<%BANNER%>

Multiple Sequence Alignment Solutions and Applications

Permanent Link: http://ufdc.ufl.edu/UFE0021685/00001

Material Information

Title: Multiple Sequence Alignment Solutions and Applications
Physical Description: 1 online resource (122 p.)
Language: english
Creator: Zhang, Xu
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: dp, msa, sp
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Multiple sequence alignment is one of the most fundamental problems in bioinformatics. It is widely used in many applications such as protein structure prediction, phylogenetic analysis, identification of conserved motifs and domains, gene prediction, and protein classification. In the research areas of multiple sequence alignment, a challenging problem is how to find the multiple sequence alignment that maximizes the SP (Sum-of-Pairs) score. This problem is a NP-complete problem. Furthermore, finding an alignment that is biologically meaningful is not trivial since the SP score may not reflect the biological significances. My research addresses these problems. More specifically we consider four problems. First, we develop an efficient algorithm to optimize the SP score of multiple sequence alignment. Second, we extend this algorithm to handle large number of sequences. Third, we apply secondary structure information of residues to build a biological meaningful alignment. Finally, we describe a strategy to employ the alignment of multiple sequences to identify primers for a given target genome.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xu Zhang.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Kahveci, Tamer.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021685:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021685/00001

Material Information

Title: Multiple Sequence Alignment Solutions and Applications
Physical Description: 1 online resource (122 p.)
Language: english
Creator: Zhang, Xu
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: dp, msa, sp
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Multiple sequence alignment is one of the most fundamental problems in bioinformatics. It is widely used in many applications such as protein structure prediction, phylogenetic analysis, identification of conserved motifs and domains, gene prediction, and protein classification. In the research areas of multiple sequence alignment, a challenging problem is how to find the multiple sequence alignment that maximizes the SP (Sum-of-Pairs) score. This problem is a NP-complete problem. Furthermore, finding an alignment that is biologically meaningful is not trivial since the SP score may not reflect the biological significances. My research addresses these problems. More specifically we consider four problems. First, we develop an efficient algorithm to optimize the SP score of multiple sequence alignment. Second, we extend this algorithm to handle large number of sequences. Third, we apply secondary structure information of residues to build a biological meaningful alignment. Finally, we describe a strategy to employ the alignment of multiple sequences to identify primers for a given target genome.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xu Zhang.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Kahveci, Tamer.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021685:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101108_AAAAAT INGEST_TIME 2010-11-08T11:05:53Z PACKAGE UFE0021685_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 13524 DFID F20101108_AAAPZP ORIGIN DEPOSITOR PATH zhang_x_Page_046.pro GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
c282700bc523d9f0a1e9305e1ac190f6
SHA-1
78a98b5bd0b2998fb93fb119ea2900cd612e794f
26510 F20101108_AAAQFJ zhang_x_Page_029.QC.jpg
498f15a771a146bbd4c2e3630696260b
76d8b4b7ff9057ae5d3d4e3580a99708504da365
6801 F20101108_AAAQEV zhang_x_Page_018thm.jpg
c3e0b031de3eeaa329a506b2479ad31d
3f430e071a1e2899b65913d68e6ab51d86a1d76c
23573 F20101108_AAAPZQ zhang_x_Page_047.pro
f106ce3d0ef4a931b91436f2665dd728
43c0931581a9e8cbd81d8f5b55951a6573825817
6861 F20101108_AAAQFK zhang_x_Page_029thm.jpg
db168e0c74be1185dc446413b169c9e0
3ee8fba54315933061c5e2a1d505f92d3d6d5126
28840 F20101108_AAAQEW zhang_x_Page_019.QC.jpg
c1fd05e9081b57a4303a26f12b54dfb8
cd77f6760d5adef1dc0be9b8f1f4c617bb0f87f3
11145 F20101108_AAAPZR zhang_x_Page_048.pro
37590588f62273fecafa048f2ce5eca9
97d4d4dc217d7536bbe27f08f9b1feb8f5ec19c2
25933 F20101108_AAAQGA zhang_x_Page_043.QC.jpg
d7940de0a3985758bcd979c6bfd0a702
0ca71c47df3b38936e59f52af32d630b382245a7
76600 F20101108_AAAPCJ zhang_x_Page_052.jpg
a5f8847eb457691d3b3cfcb80dd9bc83
982b3d37900ce0fa374c445e550398dc04b0d171
26697 F20101108_AAAQFL zhang_x_Page_032.QC.jpg
e26e6c5a5dbaa0852900181660285157
a74b800378d6e0487b7b0f9a1316180f057fd066
7007 F20101108_AAAQEX zhang_x_Page_019thm.jpg
6c7639c8e41a5d86c72066ee2521e0a2
3b19155e6ba98aca3a0c0724a0642590feb54a4b
25424 F20101108_AAAPZS zhang_x_Page_052.pro
97d7e4e887892dbfda95cc1ad6b63a29
0ecc9755826077d0080d666231222b0dabddcede
6500 F20101108_AAAQGB zhang_x_Page_043thm.jpg
93a9adf92a095f616d9f6a9984eeb9e6
595d37d19358fe33e4d7439a7909ecdee4f2d9b7
25515 F20101108_AAAPCK zhang_x_Page_054.QC.jpg
26e1831b651baad851e5d03484144c47
bc6f3b73594e07a49cb44eac3f4b70c736d07097
6453 F20101108_AAAQFM zhang_x_Page_032thm.jpg
6121c7c460b99ad7caf9de79599d6a96
346b74bd27c41c66915fc1869db22c4af41adb42
26075 F20101108_AAAQEY zhang_x_Page_020.QC.jpg
2378520db4fb5c95e558e6e804dbf329
5d0bf5c5ceb72f6e1a9ac03323195cbc99e6b8b6
25271604 F20101108_AAAPCL zhang_x_Page_016.tif
92049a0e119983444b2d91a3e4283fe4
2e5d62f758ad772552566f34c8990294047e3da2
5814 F20101108_AAAQFN zhang_x_Page_033thm.jpg
651861be88c536842e26ec97c498dc6a
8787498ce9ab68774bf6628a27204162f1699d54
23974 F20101108_AAAQEZ zhang_x_Page_021.QC.jpg
b060acceecbd0c388205cb3b316761eb
4b084b311ca5ed1973bbc945d8ebb9e2c5963204
27011 F20101108_AAAPZT zhang_x_Page_055.pro
81b0f7ffb0c76e3bde4c5b965d6f8c0a
f103090e575e4bd5e967eaa1a880a296cd081f6e
56673 F20101108_AAAPDA zhang_x_Page_088.pro
03b128adf35639d7e268580a1b0adfaa
f9b4a9f4c5587667b7d96e8867bd9bae372a5a6f
17170 F20101108_AAAQGC zhang_x_Page_044.QC.jpg
a29312c7c13017f11d302e40ae23212e
3e6d94fb0cce6c08f31d6bd15586e90960098d04
6688 F20101108_AAAPCM zhang_x_Page_054thm.jpg
2f1f71f389fa4f627441a0813df44061
c4bc3c55558fab4e725daddd389eed3972004fad
28329 F20101108_AAAQFO zhang_x_Page_034.QC.jpg
4308997b07bf402fc5d4459d001f95a1
8f480854b4a07a4bdef573ec33420afed76cb6b3
48633 F20101108_AAAPZU zhang_x_Page_056.pro
92d1108a781a82ac4cf0e8bd123a06df
a1b3358da1e6282c794366760a37aa135bbdd908
6499 F20101108_AAAPDB zhang_x_Page_069thm.jpg
5ccba95574a93554645fbad1c2b782de
6581dc941f2ca5ff8de2493aac86689e0901f15b
4380 F20101108_AAAQGD zhang_x_Page_044thm.jpg
4095371ec72132436f5fab92ac26d60e
66fb2203b74e5e7e624e8efa7b01387e93041594
58196 F20101108_AAAPCN zhang_x_Page_105.pro
81d05717d6fe922a65eaecc73b49ee00
5577cd0c301bb8054fdeb88dd4e1e1e790225fb1
7048 F20101108_AAAQFP zhang_x_Page_034thm.jpg
160545e5a27d7d974051c2b2e78eebfc
593ded69212679e808fbfff8549818e0f1feb415
54410 F20101108_AAAPZV zhang_x_Page_057.pro
45dadc2a73998a6c72058f352063f52c
8da71e530949506f5624529cb2a27ad587027fe3
92955 F20101108_AAAPDC zhang_x_Page_085.jpg
03aa6f6e492d03a11335d3da1259c90a
741c5f067a8428683cf6f47c38aac2ac125e4786
7333 F20101108_AAAQGE zhang_x_Page_045.QC.jpg
1da95bd5bcb4d2cdec90f2e3613ee4b9
2e8f145c82e87b284f9a28f6117c11b603d249a4
1051957 F20101108_AAAPCO zhang_x_Page_031.jp2
7c33b0bb3a3abcb7d2e04ccc9c3218e9
eeb292346a7f38d18fc87c8414b8023b34ce18bd
28648 F20101108_AAAQFQ zhang_x_Page_035.QC.jpg
0206e9e2e19be883659d3c4e43d616e7
55eea6daf8c52b31896754b0db95c9a673690855
61183 F20101108_AAAPDD zhang_x_Page_085.pro
3aae92db3f0c026728f79e72067b0aaf
b1101801c12c05797bd5126e32a28246a1ec7954
2431 F20101108_AAAQGF zhang_x_Page_045thm.jpg
53414b245001fb2b38907bf96c2d7f3b
248bee2e17bbcb572a0776e74ac73f73855c0efd
6733 F20101108_AAAPCP zhang_x_Page_010thm.jpg
dbbc27ceeb03f732b99198ec5014251f
06ffbd8f1a23bfe042b94b7a3a863a98da0e278c
7139 F20101108_AAAQFR zhang_x_Page_035thm.jpg
c3140174f98a00e45ed78e1200c3e280
af9c4854b205b159798732f99e411912c25e5e38
56477 F20101108_AAAPZW zhang_x_Page_059.pro
94af9586e1b9bb036e46f39ee39c1dca
28a97b0e4c3bc0b70816c34d39e482135766b367
76905 F20101108_AAAPDE zhang_x_Page_056.jpg
f71414e6eabb713eda2a60376e43087c
43a710be1f2b5382d978ac6318458f898c523196
11892 F20101108_AAAQGG zhang_x_Page_047.QC.jpg
9ed21285e0d328e9f23508acf58c139a
09de21f97d3092d3ac513a3a4ad3398324f1dda5
57607 F20101108_AAAPCQ zhang_x_Page_111.pro
e73a391ce56f806f3c5f5c3484149b55
52f056c1ecc1b0eb7e9b71ee4e643355c8220676
5514 F20101108_AAAQFS zhang_x_Page_036thm.jpg
9cd437874cb0c16f96c2332cf216241a
e958edbd7b83eeb4cf0bb60c44a01918affbb406
38288 F20101108_AAAPZX zhang_x_Page_061.pro
20fc5eb461790c1311fa7fd700323e18
f4c2a66ea1489d368540ca4e2d0940eab6373932
3370 F20101108_AAAQGH zhang_x_Page_047thm.jpg
2eaaf039a29b935d72edf962bb5e9ae8
18b17f3a7970ee4d6ac7d5c985abf8725adf5a66
75775 F20101108_AAAPCR zhang_x_Page_073.jpg
699863804b81949b4f0be217126a64c2
b12217a740f098004dbb664df3dac287a0d540eb
6514 F20101108_AAAQFT zhang_x_Page_037thm.jpg
ef5cf464342dd67ed606bca1424fd3f6
f7eb82d27a7949a3e30a4c2637c26f5e38ff99dc
32672 F20101108_AAAPZY zhang_x_Page_062.pro
4c17acfacbca3b853359c983e4f0c49a
36470e976ef0c5a4d0b5cd34746f1eee0fc68862
1053954 F20101108_AAAPDF zhang_x_Page_013.tif
7a82459159187e23dbcfeab3cd486bd4
9ea4db6337a79770998fe3018dd4617da447eb49
9684 F20101108_AAAQGI zhang_x_Page_048.QC.jpg
400bedb9cfd9a4a09e3dd2cd2009e956
a5bd2905648b84285e4afab060355f584936054d
2314 F20101108_AAAPCS zhang_x_Page_084.txt
37803219737f13218ef0014f3f52a837
e38c5862feea4352613592e3c1cec930e6e6bb12
17803 F20101108_AAAQFU zhang_x_Page_038.QC.jpg
0ef0e52eccd9a12808d3f99be0e9d7f8
20df9181ce0d8ff97a1a1882977be0ac8d68e7e8
50823 F20101108_AAAPZZ zhang_x_Page_068.pro
8127a51a85cc1573179286854402c8c6
866a9022efb9d8b921581a66809bc4685e37babf
72746 F20101108_AAAPDG zhang_x_Page_121.jpg
9eff06ff776a99baa32fa2aa45e13ec9
251ac86485e05ae69bc7250624c4f276b41bba35
2929 F20101108_AAAQGJ zhang_x_Page_048thm.jpg
53a777f88d776bd978cba6e49d02382a
9618141280376560629edca90d52431405e012da
1832 F20101108_AAAPCT zhang_x_Page_028.txt
a51aa3e8166fbf1697b0ebc2d7af393f
aa185089ffe527a82bbac3635f459de3fcdd7c41
6874 F20101108_AAAQFV zhang_x_Page_039thm.jpg
1d52cb21c0acf2ce12ff5ab5f34d1771
c583454fe219f5b13f036e2e73c625cdf244e87b
2347 F20101108_AAAPDH zhang_x_Page_083.txt
8094a476e39d70f04301a411bca6848f
8834b09e93b08a84beb8c3a55c1b569f15fb40aa
27310 F20101108_AAAQGK zhang_x_Page_049.QC.jpg
1bb248c03a65b4871e6f573cfe870ad4
b40247c8b2f9940361e4d370482620faad98bc04
26483 F20101108_AAAPCU zhang_x_Page_041.QC.jpg
9032e18dd21da83470705d74ddae628e
fae1ea7e1a952b4df3bd71dbe97a71aded7aea2d
6003 F20101108_AAAQFW zhang_x_Page_040thm.jpg
d71614c4026ff9aa882e5575bc59f597
8ef67f509497dfc52208becd6773c9fe33f7460c
8232 F20101108_AAAPDI zhang_x_Page_015.pro
6c1de1f2ab92b4f5fa6f35445c9a4882
4f40835bf027eb4df59c91b00036d6a8f85c28e2
26698 F20101108_AAAQHA zhang_x_Page_065.QC.jpg
99e297f2225d211415ddbd93ab0b10bc
eacb46839d214a02fa7438a2faa263418e1ea8df
25559 F20101108_AAAQGL zhang_x_Page_050.QC.jpg
213f3dea48a6f60327ed2b044f8a1d12
22dc98d2f85c134644eaa1b6a718d7a041c428d6
1051982 F20101108_AAAPCV zhang_x_Page_081.jp2
238eebf42802a8da721a04a64dfa30ba
34d59da7071883156c19447f2b6403f941c94cdd
6694 F20101108_AAAQFX zhang_x_Page_041thm.jpg
2f5417f88174da6a99db27fa9ecc007f
207243ab199288eb75d2cc4b8ac000f63b19c652
6887 F20101108_AAAPDJ zhang_x_Page_014thm.jpg
7ca54b76cce41cba4d2eefb0475d1c0a
0a10cc81f35c7388d3ab0235588f8db50a265f6c
6719 F20101108_AAAQHB zhang_x_Page_065thm.jpg
df1318ae0c9264f5390467e8e78ab1b4
b27113f11584061e0bc8830a5b817cec4efd65cf
6502 F20101108_AAAQGM zhang_x_Page_050thm.jpg
3a7fb04ae23b48728ba648b35c9353d3
2669981bd445ed4f0f3d772c947da458e32d3bbe
1740 F20101108_AAAPCW zhang_x_Page_063.txt
423b63cd704b5820b7a8e652b0f4df48
262a9e14108338d1abaa486133d5a1425ca41c33
28636 F20101108_AAAQFY zhang_x_Page_042.QC.jpg
be1034b5da8213247efeabf290f37f4a
a1417a3cdda5ec42551e28822d5781e3f0758596
52800 F20101108_AAAPDK zhang_x_Page_108.pro
39e0470104bfdd29988628619aeb8013
6caba8a9ec47a901155c9aa0df7981808a4d903e
20191 F20101108_AAAQHC zhang_x_Page_066.QC.jpg
2db7360af437e25bac200a84c5583a21
1cad019ca5904cad3d32eb1673aabd09b492b8f7
26813 F20101108_AAAQGN zhang_x_Page_051.QC.jpg
8d4dfc23904c7e457f7ab08d854c8f7e
cab52a123021df59eb008b5cc0b32d3658e7d1a6
F20101108_AAAPCX zhang_x_Page_064.tif
5bb73ba0d447a03fe07486f57a3bafe3
72c067d2e94f1453d4e91646316ec8fc1660dcc4
7094 F20101108_AAAQFZ zhang_x_Page_042thm.jpg
36d066eb3fcb031b8587ea1f7874684d
bb9d4157628f723367764f125756d3b4b1bf597d
75337 F20101108_AAAPEA zhang_x_Page_025.jpg
c92074cbbd8ce1ceeeac883219a9c6ef
e5895e8840092dee9ad9036d8e7fe9700e9787d2
1834 F20101108_AAAPDL zhang_x_Page_064.txt
59f389a5429d6ed93043821c30a85921
7cad660fe9877cd576e7d5ac0f2d8b33025959bf
6813 F20101108_AAAQGO zhang_x_Page_051thm.jpg
f3ebe382d49dc71050e097d13632f2ca
42ab8b9977ab6eafd4e2f410c833fe1020df3f85
39546 F20101108_AAAPCY zhang_x_Page_063.pro
e40dcf727ead393272ba930ced72eac6
d9bc28e4a6cae7cf353df1eabe976043f2169915
6844 F20101108_AAAPDM zhang_x_Page_059thm.jpg
820434bdbb95b6ddc381cd2aea415179
0c851da4aff52e0a225536165bff5b5040e61ced
5702 F20101108_AAAQHD zhang_x_Page_067thm.jpg
883bd1c1030534c24b220b4553f2c732
357219ee4aca6385816dde91d364ecb58c84c5da
21976 F20101108_AAAQGP zhang_x_Page_052.QC.jpg
a5e3b211163d151328e160a59164a4a3
bb521d0887fb063e16af057c7b9a4136f97db9be
24579 F20101108_AAAPCZ zhang_x_Page_095.QC.jpg
3640ace93385467e8eee4dffa2f025d9
f0b570d8cb663164b50de0f894b0b74d005ccacf
2280 F20101108_AAAPEB zhang_x_Page_053.txt
4f32661c6bbee68b5022d1ec3176b26f
9acbdf1adb601fe5c615becd6329e2f5da3b21bf
59138 F20101108_AAAPDN zhang_x_Page_060.pro
39f8805a4e0f4fd63a1fe781de21e764
aade92f34c8232687f97dd868b7a251568214405
25530 F20101108_AAAQHE zhang_x_Page_068.QC.jpg
b73d7686f1f3cdad2520ec7a25379f66
2ee659f67c4bea32d0695ffd857302f47da70583
24283 F20101108_AAAQGQ zhang_x_Page_053.QC.jpg
4b719fd4a1dfd3b720792c8bfb954643
a42b78f5f173b694f17048b9b68ca99d10eed275
F20101108_AAAPEC zhang_x_Page_071.tif
54a9829b513f0c200a0300df03d68320
fa992ad32f6f0a1b5a8ec34dfba8e35398a25481
72920 F20101108_AAAPDO zhang_x_Page_074.jpg
0e19a63741db2aaef1c33e02e1b6ec95
44f9af5314c5a47022a3f827cd7d68d73829dc43
26356 F20101108_AAAQHF zhang_x_Page_069.QC.jpg
4f19915eefd19a832a60cd8e6e1c8346
c2d4d26d33965836988753a0a322b6fc3be3d636
4595 F20101108_AAAQGR zhang_x_Page_055thm.jpg
87b46444036b7da19157ae144d85cd1d
516ec3eccf17e601ad9fa417d1b504e6853e8e05
2266 F20101108_AAAPED zhang_x_Page_111.txt
954c8c548706baaa99056370e05bf440
fc89fc82983260cdfd6b3c1ca40a97bfea8f0e03
F20101108_AAAPDP zhang_x_Page_035.tif
1f89c2b573b9920529819d5f38a84dff
3eb6d0317708ed5f65a2537cfdca0e19ec29fa2e
6824 F20101108_AAAQHG zhang_x_Page_072thm.jpg
e77fc46fb1e9619b14ef71a812a532c8
17a9efb480e8d607f31d0e54bd06ec6123ff079f
23941 F20101108_AAAQGS zhang_x_Page_056.QC.jpg
5cd7724cda7fa612288bc0559d5154af
b0dd496e583b551c9f68e9f98ed40286f7aff812
2053 F20101108_AAAPEE zhang_x_Page_030thm.jpg
a153d75e3112e998e15c7433b5914254
4b56f0565cf19e1cb9e6ea8046309f276594ba59
34480 F20101108_AAAPDQ zhang_x_Page_048.jpg
4cc9d26bbd6b8089a29ab5cd31496616
c4218b06489bbcff462aa936418ac98661ea4092
6395 F20101108_AAAQHH zhang_x_Page_073thm.jpg
f4671a2b66afb5742d5259d60ca7837c
a40daa0c856d7e20c7ba8b39342861c849eb2fa0
6172 F20101108_AAAQGT zhang_x_Page_056thm.jpg
381602263107508299d493a9203d08d8
cff81c72ef938c7f6511580165faf964f7513da1
19567 F20101108_AAAPEF zhang_x_Page_063.QC.jpg
767389b2decfa5676a189c64c82e6646
a6a3b3dfe03f67cac2100b57d3306c6e95301716
91128 F20101108_AAAPDR zhang_x_Page_011.jpg
e3d0085946adc41ff1c873db6a746e64
84022e7170b361a1c4e067415eff2d067c686408
22657 F20101108_AAAQHI zhang_x_Page_074.QC.jpg
b42f20773ac7d633da04c7dabd7b6a40
27580db8228009a42e17ac5d64562f2de7bb01de
26066 F20101108_AAAQGU zhang_x_Page_057.QC.jpg
c5764953a8e7a67d55675d299b114471
98ff2cb20f4d671f9aadbe45845935883850c74c
26057 F20101108_AAAPEG zhang_x_Page_081.QC.jpg
18f92028480ae745c9e1aee15f59ad33
6aa4f4fb7ef5f78114015f025e565392d1dc6769
43621 F20101108_AAAPDS zhang_x_Page_036.pro
964af42fb854877fae654a5fd0d088ca
2db73eca28cea201d2fff1a8013e01596500c7a3
6144 F20101108_AAAQHJ zhang_x_Page_074thm.jpg
f36bd9f795ca4b802179ca22a2b23ad3
6a273feff23fdb1cfbc8e94a9ea7987bb2d13d79
6723 F20101108_AAAQGV zhang_x_Page_060thm.jpg
a1632cbf972b34e2eed19963111f461c
20e3eb02b4cc55ed7ed552c720b1bc0407d6dca2
1051966 F20101108_AAAPEH zhang_x_Page_035.jp2
cca1deba5d8ad3af0fa04720434e3864
ea9bc432fe16e8a533d91a06147da8af0aeab447
23735 F20101108_AAAPDT zhang_x_Page_045.jpg
43f370e036e0fdadf976ad4aee9da294
2f9926df086da0f9d63d89b9626b1b5689ce4882
22035 F20101108_AAAQHK zhang_x_Page_077.QC.jpg
18361cdc804611aa1743a5064869fff1
f185351bff2568a89d367cbe2af067bf411773d4
19766 F20101108_AAAQGW zhang_x_Page_061.QC.jpg
094fa1e8cd689387902e6048e97783c4
3af12d8298eb552297ead0adc721e0aca5599822
1210 F20101108_AAAPEI zhang_x_Page_052.txt
e6cd9542e08e530d8fae790d94a28759
94f3cf7b22df9ab639c1de29fe29b0787e02c3d4
53662 F20101108_AAAPDU zhang_x_Page_058.pro
2714f76f1dd3114e2b813f58e3a40f00
c5022f1624492b0a954dc79808492cddc86a9fe6
26756 F20101108_AAAQIA zhang_x_Page_092.QC.jpg
3a0a5e6fa36b8566bb98abeec609bc60
4fdd25a30f6004dcbecea12131fe7426b047dac4
23258 F20101108_AAAQHL zhang_x_Page_078.QC.jpg
8c1176c2b8c331e7f51e22b61847bcfc
e29026c0175374dd7648af091cc0a56d2f00b8af
5321 F20101108_AAAQGX zhang_x_Page_061thm.jpg
a37ab10c839b6477b632a3375c420bc8
2dd3e41b1738524da70ccb0830bc8faacc4c0553
90400 F20101108_AAAPEJ zhang_x_Page_041.jpg
48625a6f5c6a88cc710861e5a49802f9
e7491bbb17912d989d5fe214cefc2b0f2986965b
F20101108_AAAPDV zhang_x_Page_039.tif
954ed27f2d5f22888516b08d3205757c
5ab4d457ea0cd8cbec96395030b46e77cf1db409
6697 F20101108_AAAQIB zhang_x_Page_092thm.jpg
fbdf0b73a384ffb38c4e2b18e07b83f0
102f55104c19d4fedaa53bc19eba4a44b87f5751
28115 F20101108_AAAQHM zhang_x_Page_079.QC.jpg
647715b36b75c3e051821751026352a3
ab5b82f2b4446a2ab75e19e408839af3c1bf05d6
16611 F20101108_AAAQGY zhang_x_Page_062.QC.jpg
39b7c76b66910b353de411ac4d6d3d76
efc582a767a9e5a42da7b480ba4dbee07ccf0c65
3880 F20101108_AAAPEK zhang_x_Page_104.pro
ed53569b3c31150f3443f9e65aaae50d
12d6c4e425234a6b008079b22af12313db635e61
84186 F20101108_AAAPDW zhang_x_Page_096.jpg
7cf404802a12fa0395af41157e969a83
d44a8071d06da2c14244949401b4464aeba7be7c
5818 F20101108_AAAQIC zhang_x_Page_093thm.jpg
0abbfb23ea60c71cc827cfb0178cad00
b34ad8abc68ea13be13d41711bc7afa5c912af3e
5028 F20101108_AAAQHN zhang_x_Page_080thm.jpg
f82d138b39ebcb7bbaf6eb4644d8ffc4
29d57070c5ffa0ca8c6025b50b0612a142bd63ad
22273 F20101108_AAAQGZ zhang_x_Page_064.QC.jpg
cf07b542c392a91cd9d4cc624987e172
9ea71f99eb325f4fa70d0ccfb9d1a5afac2f764e
1865 F20101108_AAAPFA zhang_x_Page_027.txt
80eb72326861c3787d385640dd5cd3f4
68e15f27a03785d27855178b64e8498ce5670c24
1051935 F20101108_AAAPEL zhang_x_Page_077.jp2
09a85a21a216d4ba9380752fe1761b66
5a1080ed115226e991acc0eab624b26738a305f9
F20101108_AAAPDX zhang_x_Page_003.tif
f01d72af004271ca2b9c2dd834c34ccf
98b53b8381549fefec7f64168dadd846cdd2c3bb
6725 F20101108_AAAQID zhang_x_Page_094thm.jpg
378a2ac12948522796bdc538b2cb39c1
1233814e5d42d51ed090db56cc20eb323d6efbcb
6561 F20101108_AAAQHO zhang_x_Page_081thm.jpg
a9b3b33a7ee1ac1c591ebbbb4ac8bbdb
b657d06eaac156cff0631436570458c24c337135
F20101108_AAAPFB zhang_x_Page_075.tif
0ccea9105c7934e1b2d182cf7bf89b2f
97edca53f406aa7abf3a9d879be1d6d090c21c8a
F20101108_AAAPEM zhang_x_Page_007.tif
e8355f3dd6282b62dbc70ed32612194f
8d7c70471fdc3a46f6b02550a2eb18869aafbac9
5608 F20101108_AAAPDY zhang_x_Page_026thm.jpg
a2e07f2ddae49052288ca166471d5049
7121d100be5f562155d42c6b1eac957fa2b44a86
6412 F20101108_AAAQHP zhang_x_Page_083thm.jpg
fb37ba404644ea3d6438696d0900e231
e15d3348ed177866ba3d524e5b826d156d9038ac
142329 F20101108_AAAPEN zhang_x_Page_118.jp2
372bddc0119ad82c3537aa1c96508571
09e67147f762ae784e3b48d1cefb4a8189c2163d
901871 F20101108_AAAPDZ zhang_x_Page_063.jp2
99c51e3a60682cbe907163fd30106766
e6f43cf1165d23c084b5cc44bad8760b35343bd3
6283 F20101108_AAAQIE zhang_x_Page_095thm.jpg
d0c7607ea167871c5798d7da65c5aa0b
f09cfcc5e52f18b6db94d86a520de7ab694f98ef
6790 F20101108_AAAQHQ zhang_x_Page_084thm.jpg
7a6d8ccaaf2fcfebf87e42adc2b02726
0d00bf89f49484112f924e9101d943e93b56b281
6899 F20101108_AAAPFC zhang_x_Page_079thm.jpg
3d1be646dc15023d73918addada19ee9
9afc8209f127792fd9b82289553109187864db34
42694 F20101108_AAAPEO zhang_x_Page_009.pro
572f7b43897add734843e93c5b022432
79c8af029e82eb4581db9635d97a27e92ebcfdc3
26746 F20101108_AAAQIF zhang_x_Page_096.QC.jpg
0563cd31d0d952e59c06f2adfc6ff08f
e82ef12d9af6e31e33ec4f4982375577a86f863b
29158 F20101108_AAAQHR zhang_x_Page_085.QC.jpg
78fe39ab39141ac6be4b7aa2c554183f
af2ab49a5a1bf7283175b591ba7e6fbf4ecf9b90
85926 F20101108_AAAPFD zhang_x_Page_081.jpg
8310a9b381a5bed351463db4d09f394e
1175887b9db9b79c4c2f6190499947c05f2d6fd0
F20101108_AAAPEP zhang_x_Page_094.txt
68d7863716bd2c34e741bf366c7d081b
58d185e318541a748ddc9bf2586283219d21a8a3
6713 F20101108_AAAQIG zhang_x_Page_096thm.jpg
01ddbcde6eff9362a66caaeffe34566f
c2ec9fa86ba1b2161224cb74719660aa50fe0931
7188 F20101108_AAAQHS zhang_x_Page_085thm.jpg
976dc5b93e9542d96af80bd494568d7b
20f929333f67d2248adced91551e4e7ef21b83ad
49253 F20101108_AAAPFE zhang_x_Page_029.pro
b19f28ff792bc709a3ed076f1a023adb
35543e454439b16f4e429a2453320db9e5265a30
86722 F20101108_AAAPEQ zhang_x_Page_071.jpg
04d963608ff369a7025c1e5d736763d1
df0127e17f11b7a305ae3b9120e4c421b84f64d6
20876 F20101108_AAAQIH zhang_x_Page_097.QC.jpg
36dbf4e312b0e7555b54bf7abfdb9630
5f81dfcb98248f6e24838e435f15eefbf916c2f9
6031 F20101108_AAAQHT zhang_x_Page_086thm.jpg
e3ca2433f8aa3954c300e69a48a84057
3e7019055b79d705c9105f6807c8c3157dc5b80e
42931 F20101108_AAAPFF zhang_x_Page_064.pro
903a3cbc3a907565d6df4444c5188108
b09bef5f882ae4f38ea606a0dcc3dd13044d07e9
58883 F20101108_AAAPER zhang_x_Page_083.pro
d0155604190bc7be69f28970369aaeb6
d7a0cea82977668f149e89250d43fac1443d4cfc
23509 F20101108_AAAQII zhang_x_Page_098.QC.jpg
9f6ea5f986223b41de26594c5804f9db
3f84cf2934b7469ce82aa2f98844a7ae87bef634
17426 F20101108_AAAQHU zhang_x_Page_087.QC.jpg
0cc8a138c8d2974b2905ad8721dbf0f0
bd115f93721482e3cb6f434ee634c9f12dc6bd17
21181 F20101108_AAAPFG zhang_x_Page_026.QC.jpg
a09ad5644c84599599cc1a38c3e060ff
15e11efcd8e55cd63211a0c9af569830ca71a204
1980 F20101108_AAAPES zhang_x_Page_091.txt
141e0f3d9c68db286406a858b4b922ad
1a1b436168bb490bfc3984c1b0e86be00532919c
6233 F20101108_AAAQIJ zhang_x_Page_098thm.jpg
592b07db7bc64c5131c8926e9cd9862a
1ec7667c99624bd8529f9de210f209e7a8ba42eb
27666 F20101108_AAAQHV zhang_x_Page_088.QC.jpg
61cc9b251be58a770e5ca5702b2033ea
c21638b768054fa98852caa697d8ea0e010d0311
2133 F20101108_AAAPFH zhang_x_Page_043.txt
58fb6a2d27eb77ef03350c2b6f974d19
59c28dc2cf4d512eaf37a644812b9ba9902ce1b9
80346 F20101108_AAAPET zhang_x_Page_083.jpg
a06f1e3357c3c98ce5a3b65ffc97292d
2ad51d66358f7a4d71b6e6b9e9f8736393662429
6226 F20101108_AAAQIK zhang_x_Page_100thm.jpg
e2447ec1b79a3f4f9f517599a338b920
80eb56124607c52506d72f740fd8c22412aee93a
6905 F20101108_AAAQHW zhang_x_Page_088thm.jpg
cbb3aa3cf3f01b6cb925a3bde559b9b2
00a684f684985960ad8acfd7d24e09bbb1034aab
1543 F20101108_AAAPFI zhang_x_Page_006.txt
af82bfeb4ac0d85493cd5083595bf79e
5d2114c94f8b2a2ae96522e80fa3666935e0ef28
2112 F20101108_AAAPEU zhang_x_Page_001thm.jpg
34bcc860a64515523b293519a3c922ad
3bdd84c2c5bd0528292d999055d95ffc3003ca34
26115 F20101108_AAAQJA zhang_x_Page_114.QC.jpg
df651cad0c2a59c684f77ec283095e54
ab9f6ea4985923dabc5147a521407970e158bdd2
26577 F20101108_AAAQIL zhang_x_Page_101.QC.jpg
d87faa1b20e2c052e0dff69fdc7e880f
3c01ed42a52d8ce6d68f00540ba2fa2d81155414
6758 F20101108_AAAQHX zhang_x_Page_089thm.jpg
ce0b24d1c5e5a542939448a52da69b0c
43621102402ea0d8412c2c7c764a4b413ee0e756
2107 F20101108_AAAPFJ zhang_x_Page_032.txt
0c1cf9e18a70e68751b9323af88c805b
3f6c1dbc4ae478e07c36b6e40dfc9d10b872c87d
F20101108_AAAPEV zhang_x_Page_111.tif
bb57cfb0f7a54536d9002a436ddea391
e3fd058373d1f68dc84bd3c083e2da4df87949df
6409 F20101108_AAAQJB zhang_x_Page_114thm.jpg
3ca1ca7f03fc3186c7749aa7fe81dda5
65b2df06e6c038eba87a8d767341a62718b4b4e4
26867 F20101108_AAAQIM zhang_x_Page_102.QC.jpg
0da36ab6f1d17298ad6951b4184ae26e
0fa880a8cc1d7b1bf74b55b5b8042e9429ecada2
22899 F20101108_AAAQHY zhang_x_Page_091.QC.jpg
38be095c14796677e657c8c615c67c42
03e420e1c18fb5c5fb81b295b02b03e1c26c95e4
59147 F20101108_AAAPFK zhang_x_Page_014.pro
0169cd408cb08deec30137535ada821b
1b97bd3d966c558472d81f275dd9de3903842eca
38521 F20101108_AAAPEW zhang_x_Page_104.jpg
7f31d5b77c9a96d81105117889a6432f
7cb2bc290e7b721e2a62b3e738fe6d9429c58810
24256 F20101108_AAAQJC zhang_x_Page_115.QC.jpg
f1664562423d53576639624c6664c96c
340e5a78a8c06fef9a60377d38ce6dd47a057d7c
10358 F20101108_AAAQIN zhang_x_Page_103.QC.jpg
49ba61271cedcad4c548fbad5b937b04
2b5216a9f4d1ab8c52ce7ec62614f9c4083f819b
6194 F20101108_AAAQHZ zhang_x_Page_091thm.jpg
cf37bae9c22b52729906b6fa4e9f0c48
085bde6c3b89a81ba5294ed8c008214ae279703f
87962 F20101108_AAAPFL zhang_x_Page_060.jpg
08478cb3d5c8fdf8fd1b539ce8834c02
4a1cac60098dd62791a8c0205475e76b2f8bc07f
51640 F20101108_AAAPEX zhang_x_Page_050.pro
00effbe4d4ba02d5a26a951aa16aa83f
d59e26f6382024100707176ceeecf513146e560f
6256 F20101108_AAAPGA zhang_x_Page_078thm.jpg
8b2d435682a775dc3f6cebfdae5b53d2
3d16dbf7070bada189e584866c3c62f63117fc81
6437 F20101108_AAAQJD zhang_x_Page_116thm.jpg
e5f57dfd1651cee49c97590ac48da709
93a442838551608802d8a666e6ff3b4105fce234
11320 F20101108_AAAQIO zhang_x_Page_104.QC.jpg
a1097308e1618071ec99ec1d8ab12fbc
89635861797461b153963a0018b5c8a689e954c7
78248 F20101108_AAAPFM zhang_x_Page_111.jpg
94a192c9914de5eb0fe448b3c00ec3ca
5c5f4e55ad00c7ff0517e103303c91c70d2eb576
6528 F20101108_AAAPEY zhang_x_Page_120thm.jpg
1e4fd50b093378e09c37c4d85ad3ce11
dce8a26375853f331c2e295d2989a5a5dc020d3a
9581 F20101108_AAAPGB zhang_x_Page_082.QC.jpg
d46b27ae4e4ad897d847dd7368e5c951
5e72cd235cf16967d8044e01dd20b941fe26a8c2
23530 F20101108_AAAQJE zhang_x_Page_117.QC.jpg
2094d4b66955c4a628360495c919285d
898ef2e2384ea7faa88d7904db8d9b922d967b19
3389 F20101108_AAAQIP zhang_x_Page_104thm.jpg
e1ab76c8a7324c2860f47c178e19eced
9fb526c4010afe44502bfe3c16e5d4c15b26890c
23580 F20101108_AAAPFN zhang_x_Page_122.jp2
3cbfa87cec6b8db5240e030910ebc2bd
59e42f8713a41dc6e0656b09c48450f4c70f2b8c
3081 F20101108_AAAPEZ zhang_x_Page_103thm.jpg
450c34fd1055223f072b362e4e491183
6355208c0d32028421b626ebede3a32b716fdfe8
1051977 F20101108_AAAPGC zhang_x_Page_014.jp2
34d13abb4cc92a863f2a8d4a72f45392
40c7d70948bc1dcde10eda100fc9db793e00a922
11349 F20101108_AAAQIQ zhang_x_Page_106.QC.jpg
6d39f30cc0af81c2f846675e36b6bb26
4f8fc60c360f94b05a39b3657842859eed4fb0a2
21905 F20101108_AAAPFO zhang_x_Page_033.QC.jpg
936feb8bb289b794595b88a966d65801
bfa3d4900a9c93745f47becd21434f0c1ecb930b
25183 F20101108_AAAQJF zhang_x_Page_119.QC.jpg
38fa80c8d825d9881bcc48f2a38a2533
4b231804c1cb5853c7029cdb9482dec11df7320e
3438 F20101108_AAAQIR zhang_x_Page_106thm.jpg
6e840dece69bef06fb504b68dd70d8e0
77a8a3de71fc01297dce9a11f49f2292a8b1602b
66659 F20101108_AAAPFP zhang_x_Page_028.jpg
844f55e7bf60e2f26923bbd9c7f26a37
ac2dd22f264cce0d94b55ce13da9289daaa60188
19025 F20101108_AAAPGD zhang_x_Page_009.QC.jpg
0e572eaca46d3eb81859502ffcd6ad6d
6b8534d84bcfd3c1b3b9896ec90f3e779dd8feaa
6405 F20101108_AAAQJG zhang_x_Page_119thm.jpg
da7cc3c9b00596ba8708678621f9e875
0191d31fc68e2b6796593f2b5813d5624e03fb46
25305 F20101108_AAAQIS zhang_x_Page_107.QC.jpg
175b392947b033bbad5f8d908c72d864
27f6fe96355207e1f37b40b6932140e0801692dd
54130 F20101108_AAAPFQ zhang_x_Page_070.pro
4599817dac4e91350276aaeafab7440a
1db95f26223da731a3f5c3bdac228d38873c3f77
1051976 F20101108_AAAPGE zhang_x_Page_054.jp2
e8d7381e96fed9ead47e7628fd1454f6
f705659551721178739068522752843a32dc0741
21203 F20101108_AAAQJH zhang_x_Page_121.QC.jpg
891d483c1144cd2789878466186df3bb
64cba86589c05f7cf26ca24ebf0d13872ac88ea3
6674 F20101108_AAAQIT zhang_x_Page_107thm.jpg
c62052340381258adcf7c9024190a703
4213e76e7df388f5ed566b9de5b9decb145eaa0d
27501 F20101108_AAAPFR zhang_x_Page_076.QC.jpg
c0b4654a4a4c59508eee41ff798fce58
e56d109cfbab38bd7bd7e613369c353ecde0c7aa
22215 F20101108_AAAPGF zhang_x_Page_074.pro
c768c79e642112fc7cb3b22428b67427
1e140bfa62b422c1488c35991d6ca76608937bb9
5532 F20101108_AAAQJI zhang_x_Page_121thm.jpg
a0a85ebded665096426f62203b4782f1
c0abaed5de4a1f807379543eb0b47586b5ddca84
6414 F20101108_AAAQIU zhang_x_Page_108thm.jpg
e367c536f9ca702829051ed81c8ed861
314c231ef8739827147ad5d7816387b963ebf174
F20101108_AAAPFS zhang_x_Page_038.tif
6089c9060a80a4f4bcc53b2f0c0c2654
f4f310e20182c150ba2177ff882b483552cfeca4
2264 F20101108_AAAPGG zhang_x_Page_096.txt
7345e4b23cb6a627c5e592d44bba5893
8c5cdbac23d2236a68c7262ae844cb4f3f792466
6311 F20101108_AAAQJJ zhang_x_Page_122.QC.jpg
073a4d51f179ec0caf367221f2e4a911
3405ecb5dd030e81607f30acb830122737e712f9
24240 F20101108_AAAQIV zhang_x_Page_110.QC.jpg
df1d6349c1d883d3f9c23654ea4387c3
ae6a0356e786e4018c296b617073e01bc4e00d5b
54798 F20101108_AAAPFT zhang_x_Page_101.pro
00bfb01d61a7809e91c53cb76071d853
12f15bb089efe69dc1fa9d15e03b809afc4bcc7b
1964 F20101108_AAAPGH zhang_x_Page_097.txt
c994df809df96285beae0923db09e317
e13bf9c889117116d8370e92a01a5e64d3c48ce9
24838 F20101108_AAAQIW zhang_x_Page_111.QC.jpg
443daebb74c082d623e2776251f27b64
c334b2163488144cb83b64dbb524c6c133c7ec4c
F20101108_AAAPFU zhang_x_Page_011.jp2
1de00defb0003dd29ad29744b561e8e3
608e25c8b67d5002f56b358e87587a16cca67a8b
24149 F20101108_AAAPGI zhang_x_Page_075.pro
5577ea6da78aabd9d886ce2532ee6fee
210a0a83bf746246e23e7ecc3b5d4fcbfa42634e
6146 F20101108_AAAQIX zhang_x_Page_111thm.jpg
ac720d6d4d04405f642007562ce55d26
9bdd1921f8849f189f5bffd353e3474d3532edf5
6673 F20101108_AAAPFV zhang_x_Page_070thm.jpg
732446c189cdb326c5d84836ca50af07
930c8c411f48b8d49069fca475e3f80efc2febd0
84246 F20101108_AAAPGJ zhang_x_Page_065.jpg
a4aeb9cd5c81775d643f7b5734b6632a
64f3f913b0b4e5e90b767dc36b830a25fe7469fc
3813 F20101108_AAAQIY zhang_x_Page_112thm.jpg
e855cccf5185aa52c665efef7b028f3f
abe1fd8aa6cffff4e6cf4bb4f45afc151b0bb398
40009 F20101108_AAAPFW zhang_x_Page_087.pro
ff203c841c08883db9d664381124a7e1
38010b207ea881a50daeb2e726328aea791f0788
F20101108_AAAPGK zhang_x_Page_067.tif
aa4cac204e748b24d460574fb01c451a
0ac38f32856f2b7e39052d47046a6d631151b93d
24284 F20101108_AAAQIZ zhang_x_Page_113.QC.jpg
17fe44fdb5c70223bed1cb747ae72e06
767088a5bb96fbdfb7e98e2b987b5481eff67462
100763 F20101108_AAAPFX zhang_x_Page_080.jp2
6c89d8fb0880f1510d683691f0018be3
4e45406f586d8ecafd088c7016ff45f4ceaeed63
56236 F20101108_AAAPHA zhang_x_Page_051.pro
cac2c978365c983386bc67af74254810
6cf5d1206dd60481191c8df0a2ea96c093ec6e8f
2319 F20101108_AAAPGL zhang_x_Page_042.txt
af242a451d3ae9a670fd33ac37847579
15f1a93aacd29702f229c409b9e4b9c795e49fd9
735 F20101108_AAAPFY zhang_x_Page_082.txt
e0f99c4f914d26e7766640a1d87c378d
7f28a990707d2cddd0c05d3763f4ffc9ef893770
62475 F20101108_AAAPHB zhang_x_Page_038.jpg
94de108b551be2466c63d91829a3e2b9
0c3afc021a2f3f43943dae99f5daeb2c735257d2
85712 F20101108_AAAPGM zhang_x_Page_059.jpg
af2545107505f9e6f9ce57c223f76961
57178f60a24f0594bbaf2a58cb56aa329feff541
2317 F20101108_AAAPFZ zhang_x_Page_018.txt
ad634a9147bb06c61064a2f395a60735
35fe34f13367eca346b2a8ed80ead5a1f205c3b1
25699 F20101108_AAAPHC zhang_x_Page_099.QC.jpg
e05ada0950f3ef8b960d9c2eb6d4ece5
3f74949f2cb8846a61078d217abba3e6a228a92e
1316 F20101108_AAAPGN zhang_x_Page_038.txt
6cd3759d2e9b4ca4c741afa91f399cf9
f8f35d86afbc317845a5698b5214d0f4446f247b
6312 F20101108_AAAPHD zhang_x_Page_115thm.jpg
c9e3e776cb645b196b126bf8cffae0ab
5a02e4f3e417a3e2c05e8b9e861dc07d470e1793
F20101108_AAAPGO zhang_x_Page_073.QC.jpg
82871194d1809135a810687eff4da719
533c5327e783b56c250406ed7830885ae5c5310d
7010 F20101108_AAAPGP zhang_x_Page_102thm.jpg
b7977f823a4065e10f62890327f9ed5a
54cbabaa2c3abb4014c337fde5a597556706a47e
82145 F20101108_AAAPHE zhang_x_Page_037.jpg
f635010833b35f0c9425e30812727df9
f5311f951c6c6c5fba28f72003564c64ab068430
16020 F20101108_AAAPGQ zhang_x_Page_082.pro
890f6b3fdd35279983f22d5fbb6cf8f4
a18d2400633d8350b7ebc5efd525e5fb7d03954e
2278 F20101108_AAAPHF zhang_x_Page_012.txt
71045845af163e96a176b04bc295bce4
d0f73a15349d41dca76580c970f51c9b0d5fce58
82896 F20101108_AAAPGR zhang_x_Page_010.jpg
fb362eace5880aef7634588a10cfe5ce
d64c8e5e7fdd45914e2c5c08b2613ee8249cc68d
23062 F20101108_AAAPHG zhang_x_Page_093.QC.jpg
7f0d7c79903054fefe8262b1b712b872
d280e5b1b5468caeb350a88e8817aa18f1ed6797
79 F20101108_AAAPGS zhang_x_Page_002.txt
a696df64b5abce0e5f6782beb0649f34
e69937b65b01e78b4494793b39ebd14b584973a8
240 F20101108_AAAPHH zhang_x_Page_003.txt
76a0a0f333641c5fc77548917de8fb87
0baa8445ce2a977ffe29c3900d02a34a04e79068
F20101108_AAAPGT zhang_x_Page_004.tif
412bc6242db4726d529547f2957405a3
1d3f7faeaba60e6705248aaf8d24b868950f2021
60653 F20101108_AAAPHI zhang_x_Page_019.pro
1ef701565bb734c675a55c2c1a93af35
c9758c774840c94c65b10175caaa1a0f961fbba5
93959 F20101108_AAAPGU zhang_x_Page_036.jp2
74c2f1ec8092d90ed9cf79aeb8c1a4d0
acfb77303e323c51c5871a8fb9818e66cc4170b8
23745 F20101108_AAAPHJ zhang_x_Page_100.QC.jpg
bf3ee09dd9c3866b7dc16f8c068e9eb0
6f82f6e88bf8b8a0a2ea3907e3196679d1086512
F20101108_AAAPGV zhang_x_Page_103.tif
bc57a845be0821271856bae8f90e057e
41ad03a2e925f3b44ec759a8265ab937ff49125b
89769 F20101108_AAAPHK zhang_x_Page_114.jpg
6af6fd9e892e3482fdae32ecf506a952
5f66add7731a1b08c70abb4783d4055cf5363d53
14681 F20101108_AAAPGW zhang_x_Page_003.jpg
a4f9747061a98217c2187cebd4e60e2c
f62b8dac8b7d2edcf70b7b7fde522e5b15e86666
25560 F20101108_AAAPIA zhang_x_Page_118.QC.jpg
a11d666c8969ac7956db985b49974263
25583f6766802c07de27debda9e1308c0ce834e1
1867 F20101108_AAAPHL zhang_x_Page_009.txt
8def7df1fe40f9d3b777bc04d1ad0c7a
5e597c30d49e1a92591115f278b47e7d80d546ce
F20101108_AAAPGX zhang_x_Page_042.tif
e94ec21338bcd07bef365a735c5de2c8
92a0fae62511638bf1ff0454cf0f96e58f4c355b
62365 F20101108_AAAPIB zhang_x_Page_117.pro
c35b4efe62f27ec96d96ac67b0d9d163
4f9f7be6351532bd6a5f1869031858026014bce6
1051986 F20101108_AAAPHM zhang_x_Page_072.jp2
c3fc9edd162302e7d89812b099d43246
00f35165e47dd69a7e98485d26ad0ed4e1a7a7d1
12871 F20101108_AAAPGY zhang_x_Page_003.jp2
178471403d804e05e6c5939c7e31f599
8a788d045f28176e801d0eb65bd301149ab31144
6086 F20101108_AAAPIC zhang_x_Page_110thm.jpg
e74117055761b46cfad61e561505647d
460cf46f6fd7513642d434ae9f8fd5f011f6b873
1051948 F20101108_AAAPHN zhang_x_Page_029.jp2
796866bb23713e2c87a1ea9db71f3481
ec81c06c035bf14bb78f072f13e618c498577a41
6566 F20101108_AAAPGZ zhang_x_Page_075thm.jpg
03df93a15e2cc27091787a90494af4df
603bee4cd220614ab5f12ebcb2c60d7d4ba51c2f
F20101108_AAAPHO zhang_x_Page_025.tif
a6a8b80abbe4dd7377320762d20fd965
874a14e0f2ba303db875ff9e4205cd7303a66181
85806 F20101108_AAAPID zhang_x_Page_031.jpg
671dc379cad1232234b656f9cc7bb2c3
f2460c3a2906c0e11da3e5ea324fa80cf87faf4c
16437 F20101108_AAAPHP zhang_x_Page_109.QC.jpg
c01f899970feaacb1c17e6d8f2d7dcb0
3aa3902f9933c2f74ed84a58f436a3fef21204ba
50337 F20101108_AAAPIE zhang_x_Page_037.pro
fac34d2d7ac3eca473741b393537fa66
23be9a0bf961f23e4c22b4b93810b7bc36bd9e1d
5434 F20101108_AAAPHQ zhang_x_Page_097thm.jpg
9b147ae572295d393a0a08288fce81d1
97966086aa657ec1b3429f5cc13305243e82cfe0
2170 F20101108_AAAPHR zhang_x_Page_054.txt
5aad06fa74eac8ab24010d7d88deaa7c
f5bacf35c6901cb92e444e322d5d6573183d1174
1051983 F20101108_AAAPIF zhang_x_Page_057.jp2
f3c82aba31014fa7e302827d2af19258
51093cfd566ba6b21ed741e0707c11ffd0eb41c8
57276 F20101108_AAAPHS zhang_x_Page_049.pro
0390e69725d051ab0a67250e60307029
e50f42f21bcfe4a345c8c517e931975add03b872
23362 F20101108_AAAPIG zhang_x_Page_075.QC.jpg
f581092c6b213131fcb266c29bdbd64a
7b918941cd8a3d56120ae16bee4183da94e2e1e9
24553 F20101108_AAAPHT zhang_x_Page_086.QC.jpg
1a5cc0528ae8c5f3c4ddf2fc242dbb37
b1e5e48cf59b268f781ab86a68211b692966df1a
1051978 F20101108_AAAPIH zhang_x_Page_005.jp2
0cafd54901d3840b8ef6921f6495efe7
d456b195803ef676374b33944aa9f1eb6325b164
2204 F20101108_AAAPHU zhang_x_Page_010.txt
eed28cf326dd7aef7f02d6ee0b4a5a4a
3b509df201c919d7f7be689b0c4c63ba10b2b695
53701 F20101108_AAAPII zhang_x_Page_054.pro
73080b1a4f27550cc9cda8e886345cdc
51b7169ca5086c203edd3e019a746dab548c0995
14923 F20101108_AAAPHV zhang_x_Page_055.QC.jpg
76157b26733c0677ddc4e9a5f18df4a1
47aa93f69b789978935afe554ffa1269ca448dcf
2731 F20101108_AAAPIJ zhang_x_Page_120.txt
1dd02867417c2a3903a24179aecc4a55
f6d6e9584829acb4d1642798e03c8be707ed9497
2338 F20101108_AAAPHW zhang_x_Page_100.txt
d8d074be8a0ce3325181b03e13573e01
0c9e073fc96fe88367357cfd184329e5b8234ad8
25283 F20101108_AAAPIK zhang_x_Page_120.QC.jpg
b4115bc604271fb0b1f4fb753c1c6d3e
2ff419821c43d16a7af3ea906eb522e9bb121189
113235 F20101108_AAAPHX zhang_x_Page_093.jp2
6aca1655bb782ea656e4411f79159a58
21dc6da6064f2906e87e4f381cdacd130605ed68
63874 F20101108_AAAPJA zhang_x_Page_046.jpg
aaa75a481e1a225000f28aad3d188d12
eeef697ccb42484b11bd59c7510a5a0457dd72a3
70569 F20101108_AAAPIL zhang_x_Page_098.jpg
2cd744961f69044ccd4b94ac1062e5c4
b0ef63cae3e14938ac981830932cfa27cf392155
2162 F20101108_AAAPHY zhang_x_Page_101.txt
8bc476c4f0bd706d7c86952fa71641df
985fe1faf45957269fb3278fc2e22b0b3926fe1a
2805 F20101108_AAAPJB zhang_x_Page_118.txt
9a538044e0ea96b6c1ed25f3d8b1a2e2
0f0f076f78f36b76c4fa0604fceeb2e5f1650af8
21548 F20101108_AAAPIM zhang_x_Page_109.pro
429f0fb082b3af9f658e15af65da283a
7b093fc76e6af4c49d91057fe99f75aa0252437b
822397 F20101108_AAAPHZ zhang_x_Page_038.jp2
a0e5178826934bdc9669aaf2476bc43f
a1ea46fe9f27de6ed79840edcfd306944b051cc0
5558 F20101108_AAAPJC zhang_x_Page_005thm.jpg
e05dc54938e2fee6b620583522f6e0c4
c346d7b07436cedd9ed3712850d5a7477cb753a7
55594 F20101108_AAAPIN zhang_x_Page_102.pro
e7db1e78c0defe8fefcfd8595701edae
df3e2247120d766df3037a2df92c5ae3734d8acb
80769 F20101108_AAAPJD zhang_x_Page_117.jpg
f1eb12abb819f2ca7b3a2763af064ab7
b4f98b9020d370b1ee0ffcf0ff95552e3040e88e
1051963 F20101108_AAAPIO zhang_x_Page_096.jp2
546354dfbdf5e86b40c2da44e696d5ea
bfd82189abe8fb5d2ef8e957934db7eb9a094b5b
92734 F20101108_AAAPJE zhang_x_Page_019.jpg
7be995921b72a6d04f79d2c3b749c3c2
08e9160274c21bd63283193aab9d34666cb1a169
27967 F20101108_AAAPIP zhang_x_Page_060.QC.jpg
d5616c2e5a0cadef75722dc4705fb483
9184ed6889b6a879b531a9490f92315990dadaba
26838 F20101108_AAAPJF zhang_x_Page_031.QC.jpg
a4ee55d8fbec203dd52258b706b70d84
128438244b3069b74ce4316f7c57412096de7086
F20101108_AAAPIQ zhang_x_Page_048.tif
b16f70a5b8144222651aa9c156b53ebb
4ee578cb67b6429835c3e15521d08bef530f74d1
65186 F20101108_AAAPIR zhang_x_Page_063.jpg
8f3ca75f7f0f1fb43bd49b9f86d6103d
c53fb459fd4536bb91a998241e9cc5faf264530c
F20101108_AAAPJG zhang_x_Page_052.tif
03b40d95b053aedb63557e25fc1c44d4
70eaf1abf29bb3f0045380734aaee6eaafd1f4a5
22457 F20101108_AAAPIS zhang_x_Page_067.QC.jpg
1160d09d29c70bb0c0618aaa8b3ec3f0
c5e206f90ec4d42d57013b1edfdc0ed0a3239dbe
71457 F20101108_AAAPJH zhang_x_Page_033.jpg
fa8829e26c7054f1f71bb6f91c4a2e04
130aeafcd1d63ebf0a939c18bbf543b9050e447d
1051951 F20101108_AAAPIT zhang_x_Page_068.jp2
b0790a2b5a33dd3f9b092f323c9c20c1
bd599bac2cd22098c50ab54eb99df06422e75d6c
28525 F20101108_AAAPJI zhang_x_Page_018.QC.jpg
1bb002b7f256a7740b480e8c8a518a73
063fcf1d869c8ff32081d7ca94bf5b518736dbb6
28384 F20101108_AAAPIU zhang_x_Page_039.QC.jpg
1fd907aaacfd4d761b16111f005edbd0
3f4c8da4eced4092c52c4a6c55ad934680cfad27
6839 F20101108_AAAPJJ zhang_x_Page_022thm.jpg
557bae3eab8699e564c7570b98de7c33
641c589217a02e3ac1c4b15e78ae8467c36f146e
2346 F20101108_AAAPIV zhang_x_Page_039.txt
eb1c5c5a85f48b01af76a3204dd347d0
fa034b2556fa741442b6c38a54388e0d7ca20f9d
6173 F20101108_AAAPJK zhang_x_Page_025thm.jpg
11116ed4c5d291e6dac9eb491589984a
bf68a1b9306556ac60c4ae6cd60993347da1c201
983023 F20101108_AAAPIW zhang_x_Page_028.jp2
da73e40652c2dc7ca572ca1f62344976
bbe707d6d2a49c6efb8496c6f5d31ccc17cdf891
21984 F20101108_AAAPKA zhang_x_Page_005.QC.jpg
900e0bcf72a39681a0189bb922eb7ae3
d8d9ec81d8ec69a2de918c10c2386eb027d23832
5206 F20101108_AAAPJL zhang_x_Page_038thm.jpg
fd311648a0ca4eda832cb00ead594582
e599550f3224959128ac51a4f937e2453bd173e8
F20101108_AAAPIX zhang_x_Page_023.tif
3584af8b978f0ced94c2e44c231a4adb
139bb07f2d9a10947c9392551c1a30509681e4fc
27562 F20101108_AAAPKB zhang_x_Page_089.QC.jpg
9ba19a4986f53a46d7ef050415f6555e
3b0be8b1fe33f1f08df72299667f30091da20173
F20101108_AAAPJM zhang_x_Page_105.tif
f5fbc11cb55a09cc14ac17ec4c1487a8
72e86148c8631b4eb4ce6cd66537a366ba92ebe3
44120 F20101108_AAAPIY zhang_x_Page_112.jpg
1340521f3b70f121054805821f36b4bf
30f4064bdeeb8521223823361a4c5b73fae2d45d
28080 F20101108_AAAPKC zhang_x_Page_072.QC.jpg
f81e357f17909de4cd1e1aa6c6d029b4
b4110de7f0adac5d026a3ad69a7f62844bc566ef
28136 F20101108_AAAPJN zhang_x_Page_084.QC.jpg
19cc53f40fa80a1e392c65eeb2f29b82
4f1867e807843e8be94da7ffdd35d9d21b938323
F20101108_AAAPIZ zhang_x_Page_098.tif
6d643b0876abfe15b03e9741ef7f1e14
16454dec4fe259331db27ab9abba99e454ba2c2a
2166 F20101108_AAAPKD zhang_x_Page_099.txt
1a539126f02fa6b07c620f9bed20b7db
4a31889bb0bcb4b1ff036a2332112409766b7a0a
25876 F20101108_AAAPJO zhang_x_Page_037.QC.jpg
7d842fd18107be0882be8de35841d23d
20a8349cb7338f58caeeb32b356a955413b2e677
44612 F20101108_AAAPKE zhang_x_Page_026.pro
5bfc7059197b7a2ff63e21d2624968f2
1936000972b5b1de622b470a57019028779fa39f
1051950 F20101108_AAAPJP zhang_x_Page_094.jp2
c4367e378ebbc11287f60a2a22b25b1e
11c0b2056d7e671b378e82eb7e9e1f5b09646270
90396 F20101108_AAAPKF zhang_x_Page_079.jpg
92082fa6e28067d049d77cec28bd6a47
e67c51fec16314c9412402668d5b320ec02333e2
19820 F20101108_AAAPJQ zhang_x_Page_006.QC.jpg
a7b98b751a9950b5fa84ed1ba6ab2e08
b90942987f3e7a94faffdd67c69e6b0ddc2fdf3c
F20101108_AAAPKG zhang_x_Page_018.tif
c4d3f6c5de99794068541fd4c4e1b1eb
c8b7494080d161aacdc7b004d5e54e3691ddbadd
75342 F20101108_AAAPJR zhang_x_Page_006.jpg
ef238bab3f4cb85813891196779a1324
1d36a3c4d30503052da3085c26ab23b3a2beef89
2570 F20101108_AAAPJS zhang_x_Page_115.txt
73da913c78632f0187c0362c851945de
bb087fdd208349c98ac110bb420a3e2c628c2fb4
59400 F20101108_AAAPKH zhang_x_Page_017.pro
8754e909defed6e3eb52073b51618310
1b9cce61ca0d4af8b109af57ebb0b0c16d7fd66e
598 F20101108_AAAPJT zhang_x_Page_103.txt
c9aaf4a41780dd7c98261ea1a8e692e0
9470c98513e36e4f119a265cbce77b918ee1dbc8
F20101108_AAAPKI zhang_x_Page_016.jp2
db06d09ac05778be3711f00cdaa84436
ac2589acb7a6c45db4879ff204c402e343a4820f
22080 F20101108_AAAPJU zhang_x_Page_001.jpg
bd152c34c2adee4a8e585ff0e43a766c
e2a69765bd6a976a461a18bcb8c06ae66ad9958b
2864 F20101108_AAAPKJ zhang_x_Page_082thm.jpg
67a15397ad10bf63854ae015b05e721e
03ce65033f358092409e505d5b05dbe10ed9c5b8
5812 F20101108_AAAPJV zhang_x_Page_064thm.jpg
7ee12c910df26e8d9bbb5b38549d98d7
57856e36522e3624f332bc551e224fa7df93e6a6
84479 F20101108_AAAPKK zhang_x_Page_032.jpg
386a5aac4626630e0131f7822199c5c9
52d2ac816ecfef1dd008e7ab361c3aadc0434415
42065 F20101108_AAAPJW zhang_x_Page_028.pro
5c4368df27f3dd8298ca0a3c9178d7e4
d7f6a4ba5c2ab604b733b853dad64ccd53f0fc5e
1051985 F20101108_AAAPKL zhang_x_Page_084.jp2
9f7816cc5786df60bfb53352624dfa29
2269ab2e72bac97c5c4bffd3d4ab91a5ddebff87
6503 F20101108_AAAPJX zhang_x_Page_058thm.jpg
0785918fc25c783bd8c4b966c7b322db
f65ff795de9325b641a1605d31353ededbfe5ebc
F20101108_AAAPLA zhang_x_Page_118.tif
b91ec2bdb40b302e714d652dd7cbf5f2
6065d67456a51f705efa228b227c7bfb652aac5c
1051959 F20101108_AAAPKM zhang_x_Page_107.jp2
22a827f7af6f873940f1d168ffe953c7
375084b640065d9351dadd07ab7ab0680d442207
1961 F20101108_AAAPJY zhang_x_Page_056.txt
8dbd68f967768e92b19dae412be4089d
6aab03299ad771e87ba1195453b690d5a4de1b3a
2287 F20101108_AAAPLB zhang_x_Page_105.txt
4ea5bc8588599fc583ea22c3dd209565
890a5a1390ec29c8ba7eae6ba741866c6805d548
6247 F20101108_AAAPKN zhang_x_Page_053thm.jpg
2555ec9890a0ebca55190900e18d2678
9e77c870b4d98f4bdfb273467d4984b5d88bdddb
F20101108_AAAPJZ zhang_x_Page_074.jp2
f59d5394278de75fbb0b6e0a64f9f8ab
b5b0f6e4e31d20cabb73e78ffd8e505ce28efb4f
25311 F20101108_AAAPLC zhang_x_Page_083.QC.jpg
43bbc3010fdce3ff896ebb6020c2ac94
b16219c035104960abd96bcdb61c2ead84e5cb3c
59961 F20101108_AAAPKO zhang_x_Page_106.jp2
ed55b5ec40bd1b298993429923e230b9
762b59540cfa10c06a25f402ebfd7855e3d4af58
65316 F20101108_AAAPLD zhang_x_Page_036.jpg
a40e5c539b39b0fe4aab76bfb9f8df0a
09066ee4c5d7e96a12981fe2f0638d64ab13bc53
19346 F20101108_AAAPKP zhang_x_Page_122.jpg
9831102d483792b0dafb7daa4821ede9
b48cdff07819cab4ffd9dbf63640f04a7412385d
23451 F20101108_AAAPLE zhang_x_Page_040.QC.jpg
5651a9602a4db9d2996218d2cd335213
5f4f7ecc7d8b85e091a015e18a5d05ba8a3f3d9d
57773 F20101108_AAAPKQ zhang_x_Page_076.pro
ea58258ebf28eb4b7b3c867736e84365
c3b20f122424e30f4315b66c38858e3f14924b6b
33647 F20101108_AAAPLF zhang_x_Page_066.pro
8845c2621a03b4a23e259bc977bca056
5923138727c1dcabe25f330eba8c835db5f86859
27398 F20101108_AAAPKR zhang_x_Page_022.QC.jpg
eacabc9388399bf779023d6e363afc79
aa8bb2e84eb59b3f9103ad786944c500c5493f05
6706 F20101108_AAAPLG zhang_x_Page_057thm.jpg
4f5a7543c54cea17ea329d393562e83e
1999e12c8f43136602fe68265ba533b4f76326fc
1052 F20101108_AAAPKS zhang_x_Page_078.txt
d64c7bb1e1bb6e57682b6ae2bb9773a6
ab02c76f05ab1fec47a511b834fa6414944b3efd
2177 F20101108_AAAPLH zhang_x_Page_068.txt
fcdcd8681836c2427b2d9f2255c87fcf
669bbfb28d29e9571d2e431dd4d748d3f28eb754
1051846 F20101108_AAAPKT zhang_x_Page_073.jp2
b1c73f201c759e9fee98cdc63725e571
2b2cc4c9b7d5cf51f4c3842213612c1dbb7ac725
2117 F20101108_AAAPKU zhang_x_Page_122thm.jpg
c84007537da429b4a7740708cb14b43e
4ef6aeb9026a07abadfa267c2dd80cc8f1bdc3e4
F20101108_AAAPLI zhang_x_Page_022.jp2
7f0c902e7363508dbdaf628414283959
284e36282f63581e59a06f3be68782581ab073e8
4810 F20101108_AAAPKV zhang_x_Page_109thm.jpg
2ee242519815a057a38534715ed23f82
6a8f1764e30474abb542f62576930f649d6f44e5
6213 F20101108_AAAPLJ zhang_x_Page_013thm.jpg
8afdd28e1a50ab277c92505fb4cbf75a
b1f280872231a164d66f1905af7b12a10696bc78
1969 F20101108_AAAPKW zhang_x_Page_080.txt
c632a666b9e0d0bd53fab713b92e3887
d8daf842f7cf5d20fcb18ebea1d2d77aca25f5ba
57585 F20101108_AAAPLK zhang_x_Page_053.pro
666a3ae0a6302dfa4387158c4ca95207
27c434568c3973fc203afa2b1d4aeb541697c007
1987 F20101108_AAAPKX zhang_x_Page_021.txt
dcb7202a40060cb9e84912e7ba10fe3e
dc51e87801b0cb688d2c89d640c9441ba60fa1e5
68937 F20101108_AAAPMA zhang_x_Page_055.jp2
2b75b3d04ae6734d57c1bff95904a589
34e6f42393e7c17f4860ffdc60f4691cc80b409e
725434 F20101108_AAAPLL zhang_x_Page_062.jp2
c0f60fafe3be255caea0fc4e84c47818
1c3d47ddda0f18a6003088d0889e549db48e0f04
5676 F20101108_AAAPKY zhang_x_Page_066thm.jpg
10e4f905eb1262735c0ea49378575874
01f8bd64ec65fb68224d0c342a2486779afc7ce7
84458 F20101108_AAAPMB zhang_x_Page_050.jpg
08ac3738331d4e32b5d80b82269b9990
f914c483d65058211e03f7d163dc4250d45d01b7
88806 F20101108_AAAPLM zhang_x_Page_039.jpg
696b1f94605cfceff9bb3fe5e78ef2e3
ada4f57227be53fa1cd4858c4a39c72a1eb8c541
27125 F20101108_AAAPKZ zhang_x_Page_012.QC.jpg
a6420384c033b3a2c4de5d19f980c7f5
4a939966daa3a797b2a7ccbd53c73c898e54b8e8
F20101108_AAAPMC zhang_x_Page_020.tif
e2ba4c72da555715b57dcee3671e73d6
aeef43e1aa25dcd90d9c036ecbf0c51222778756
F20101108_AAAPLN zhang_x_Page_084.tif
27d7205df258c3d122dd7ca02ba4c655
280045f41a9577f610716efe092ae460c25940c4
18299 F20101108_AAAPMD zhang_x_Page_080.QC.jpg
b748d3b102ccbb1afc8c084eea4032d0
2275e1c201674da42a9b08f438e06e302314fdf3
54555 F20101108_AAAPLO zhang_x_Page_069.pro
8edd40d1087539ac43eee0d223fdf86c
d8d6dc30fbf6bd55bbc6fc632775396b8dcdd19b
26849 F20101108_AAAPME zhang_x_Page_070.QC.jpg
54896085a222ad81eb3fabcb7c2cef03
9b6db3cd5893dfe6ddd387aba95b6c0c390cd5fb
6958 F20101108_AAAPLP zhang_x_Page_105thm.jpg
07a9b363b23e4e69de1a737fa0425c25
ef2ee458bc557ab15de94a7f461de9449f15aa93
4830 F20101108_AAAPMF zhang_x_Page_087thm.jpg
07051e7202a332b7b4c611ab4e2e7aa2
c37ecb958d9c6146b6424bd78b01367a8c4851b3
F20101108_AAAPLQ zhang_x_Page_031.tif
294a039cfb335e3eca15c4705673ab88
f16c89d6b11280c53bfb39d3e4f896696bca87b4
139642 F20101108_AAAPMG zhang_x_Page_119.jp2
ab90f67d45b4695f04700b7ecb4d3fcf
bcf66b9989eab180ffb37c6d7f185e7f657bc99f
26126 F20101108_AAAPLR zhang_x_Page_058.QC.jpg
eb354b6a7950997b9befc897d0a5d375
fd185403aff8bdbe84c274da37918b5e0abc39c8
6268 F20101108_AAAPMH zhang_x_Page_113thm.jpg
75b1df409c2fe157d8751e7ab7a90e9a
79e5d0c4ab1050b1eb2d61188d6573eba414e61c
54417 F20101108_AAAPLS zhang_x_Page_107.pro
8a1c60746f82fafebda2878ec583192e
a7837fea66e60c21388992fb58befa480210442b
53705 F20101108_AAAPMI zhang_x_Page_025.pro
485349de505d6515947b37c8991a2333
e20e92df25b2e120f5b8b50f0348854fdfabfe07
6113 F20101108_AAAPLT zhang_x_Page_117thm.jpg
0230b4d16ad7346050e923f8505bc8bc
c6592d647cc606affe055e809be3a2387d13cc70
1051872 F20101108_AAAPLU zhang_x_Page_010.jp2
d358dbe9df58cd8da81c13a76e26a79a
1cf13be3c0a3306af0facb513236fb035d5d2a40
53878 F20101108_AAAPMJ zhang_x_Page_043.pro
0169e8ed2d6811776ce11ae0e35f00df
c0025123ff1c9111b35e013d3c24f51e7bee674a
73073 F20101108_AAAPLV zhang_x_Page_016.jpg
145b73090019d2bf81018a2d694de7a7
f8d313b5f4bdf67ac8fa54df888d128ef703d91d
22776 F20101108_AAAPMK zhang_x_Page_090.QC.jpg
74a9bb24a1297bf699f82b6f4c31dc83
2f14e5c5866ebc0e2d19e0f8a4ec54078c5c70ce
F20101108_AAAPLW zhang_x_Page_100.tif
3a4db938be509d4dd63afcee05ea7d1a
234266a88d9b33021b46fedb6d90bcba3743605c
27176 F20101108_AAAPML zhang_x_Page_105.QC.jpg
112ee7881c6280646c63c5fe811c7139
da81f269c63e4166a6a94ca39112426c6052229e
40157 F20101108_AAAPLX zhang_x_Page_033.pro
52611d644725b06e386fc55d5173bef1
4298fbaf046576fb8314170da2cbe5d63fe527d1
14176 F20101108_AAAPNA zhang_x_Page_112.QC.jpg
76894b5e949b5bcbc37e6469cea83c19
8e8217cbe80b0b49f7fbc9ccfcef2e165722d149
82821 F20101108_AAAPMM zhang_x_Page_068.jpg
5d6b3e9b602e21aaba188aa7a371e137
7db7c10a708429264b00db390789fe181ab903df
20438 F20101108_AAAPLY zhang_x_Page_008.QC.jpg
67d2e9ab71e0627729afa0a3b076fcb7
d7e5d4a74cc601ccf914269d3735e5521dc9adb0
69633 F20101108_AAAPNB zhang_x_Page_114.pro
5335aacc8f43424e983de6e491f47291
cd8499b3250f68c98ab311fff9dea793208ef700
F20101108_AAAPMN zhang_x_Page_071.pro
84a0f98de78f78d8a48fa1a6a0eb698c
e9fa80e90706f4ea10b340b9d7b31a447f635472
8532 F20101108_AAAPLZ zhang_x_Page_045.pro
c6d36b048c41b09e35299d72996829ae
2f76323b32a2fe593c01ba24ea355b845c43702e
1732 F20101108_AAAPNC zhang_x_Page_003thm.jpg
b689480deb00561d3b51155aeba41b06
dfb716846bd9a2d5182db2ebf056180c4b3f9fc4
73057 F20101108_AAAPMO zhang_x_Page_067.jpg
81cf4f74882a5e4a589dd0b9c2c7d250
af40fa03f6ff276d41ce3e07c073366a8a1abc2c
809 F20101108_AAAPND zhang_x_Page_046.txt
bdf55c869da1e64b0bb67628cd0b871a
ece35890da9aaa82ba9cf007b36d12f698cea6d4
59691 F20101108_AAAPMP zhang_x_Page_072.pro
188e2b6561b2eb3fb3f25a5e157b82ea
dbebd8576a05f312e9ffc20f927468b91b0f7bf0
F20101108_AAAPNE zhang_x_Page_072.tif
47615d226f7612e3894fd1b6945fff93
240f9e2574570d95955816c63e72d0d9a5256722
F20101108_AAAPMQ zhang_x_Page_031thm.jpg
deb290d180d2dd932a59a275e62ed236
a22a57f00655b0e0b418662de52892cc23cc21d5
6896 F20101108_AAAPNF zhang_x_Page_071thm.jpg
5bfe90ea5263ef09cbd0fd6ecc5c0b92
728aca205471464b2c307385dffde4db66e5b83c
90205 F20101108_AAAPMR zhang_x_Page_069.jpg
32f95d5ae06e1f39430530918f6a0c9f
3aaf4ff41b4745114dbf7962939d8730624a918a
2379 F20101108_AAAPNG zhang_x_Page_041.txt
020ae7228b698bc51ce45cbb51046be2
8c8efca384b228c926435903e8f80feb3341f652
1051962 F20101108_AAAPMS zhang_x_Page_042.jp2
cbad7430bc1c873712b77fdfb97d3122
87c3c1a11d9bbd5bf1b32fb5e91fa96b9b16052e
5798 F20101108_AAAPNH zhang_x_Page_052thm.jpg
71c9989b0e560d79b140091f325f7f02
eb33c5edd67670a82ec65c5b529dc306e0868d34
1195 F20101108_AAAPMT zhang_x_Page_008.txt
5797d939463b711e25e501d6e6a2555d
7bc4267f182fe09ad14117d6167bb448eef55260
6553 F20101108_AAAPNI zhang_x_Page_068thm.jpg
ac9d6b49d1645d5c9ade1993f046983e
2f3fe3040f95f41c946da5a3c3ec5657b98f987f
53003 F20101108_AAAPMU zhang_x_Page_020.pro
c818625cdf3830f11fe715140f395822
3b47ebdca473a19e8113b309b18f057c07102ce9
F20101108_AAAPNJ zhang_x_Page_061.tif
027d982b933d1e6ad8ca2e87b1896d6a
80c7d86265d701287218b8c8bcb6233781568588
F20101108_AAAPMV zhang_x_Page_077.tif
b6d608957cbc3ca3a62ff602a4dda9de
51cfba477a381dd44d01a60b3a7a6b92444c7c13
82686 F20101108_AAAPMW zhang_x_Page_115.jpg
64699645fdd8758049fde7fc60cdc7a0
4e04b290750b9460094621549533fa7c1fcd84ca
20983 F20101108_AAAPNK zhang_x_Page_036.QC.jpg
bff8423b3edb6b1cff7030e94e523433
62ca98b218b872baeee203437583254e4780fe75
132555 F20101108_AAAPMX zhang_x_Page_113.jp2
a6a6e4697fcef567b45079b0c827ac57
7979714572e0ed058868bb04b17b5de918a13268
F20101108_AAAPOA zhang_x_Page_094.QC.jpg
7daa29803d09ddfcd662b963110f8783
5b4861981a9ebd131c8f1ee528c959e1ce23dd47
F20101108_AAAPNL zhang_x_Page_040.tif
912733c3592444b35697c9e1b3fe2926
b9666d42b18225cd87a367e72e35864b41c5003d
6840 F20101108_AAAPMY zhang_x_Page_076thm.jpg
c7457695c90379ddff6c911a1b3fcfdc
661f87189f2eb1d4d81905d3dff150fa4ad01aa7
F20101108_AAAPOB zhang_x_Page_022.tif
2ffa7bd04dc6de25e8dd12c9d35cb078
3ef6a1b40d837eb20c11fccc99fa875d7b66f729
F20101108_AAAPNM zhang_x_Page_030.tif
d60d07bc6c25e51b9d368197a3f8c80a
0d1747dde1987cab705640cddbdf8c7653222ddd
6494 F20101108_AAAPMZ zhang_x_Page_030.QC.jpg
b397163edbe57a27cddb6efd6994a706
147f2379e1f7d629df87934128310bcc5e7a17d0
6635 F20101108_AAAPOC zhang_x_Page_118thm.jpg
6984a9114273fd38ba82c6a152447413
fd8a36d851165e15ae092dd6e1e7a686b3de7c16
4792 F20101108_AAAPNN zhang_x_Page_062thm.jpg
f3679a5fafda4277c82a79b3d039e3ce
32fb1ac01702e587b82cf32363d98fbebf8514f8
F20101108_AAAPOD zhang_x_Page_115.tif
650a77e9356b74168c613eb3385dd448
0867b12727a7f8ff9c9aea637920faa39e9b5a6f
5898 F20101108_AAAPNO zhang_x_Page_027thm.jpg
293fa9a0dd1be19b8b8f76231c9b7a5d
cb82672691d54eb114ce9a3fdb86bee61772a9f2
1051975 F20101108_AAAPOE zhang_x_Page_101.jp2
17d55d186d2a1673689c15565a6a64db
c5778167cb37a54ba18d6a540a1e7db9adc87f11
6319 F20101108_AAAPNP zhang_x_Page_077thm.jpg
52a331e0514093487579dd2973a8806f
c9b9a5aac33f7921464ac43b6d4d15003e1976a1
121714 F20101108_AAAPOF zhang_x_Page_086.jp2
50cc601d8954bddacbcae462c5275cb8
951e8202287c2dc5dae4023ea26011d175cfee63
26836 F20101108_AAAPNQ zhang_x_Page_011.QC.jpg
b3767f1a3c2f2489cda97aca3cddb00d
a302dd9f94b586e3d319a34ebbc5d64627583852
24728 F20101108_AAAPOG zhang_x_Page_116.QC.jpg
56d83401d134e5021192587fe0ada093
7696eca1d641aa68cbcd921b4fd34346badf5ca8
9469 F20101108_AAAPNR zhang_x_Page_122.pro
797df998e563a35ae7602f0cf5c9a15f
8137bfda5949f70c392bef5c07553b69dc9adb0c
116480 F20101108_AAAPOH zhang_x_Page_025.jp2
d595b2b871fc18f8d79fbefac7fd065f
3194a8d6bc1f01c0618f1256861ec0a45d821e19
65661 F20101108_AAAPNS zhang_x_Page_080.jpg
dbb8f1c6e9f1c10af3c524c78b23d22b
15da1fd1fb4a0a7883ab74aa539184b10d9efa75
89012 F20101108_AAAPOI zhang_x_Page_049.jpg
048873d2d72a278e279cac556c30b528
cd63e45b8b53f68b8cb6c9bce6aa9639db3c1e4d
6990 F20101108_AAAPNT zhang_x_Page_049thm.jpg
0249f96e71d8456ad61e3aebd6ca03a7
074714150f0112a4e46d3042abcbfaa005b2ab23
83397 F20101108_AAAPOJ zhang_x_Page_094.jpg
a1b56ffc47d38695900d6d5991855588
856daf5edf20d481cc534b66b5bff4221e42a737
51346 F20101108_AAAPNU zhang_x_Page_065.pro
5b3e4218b37c3fa2904ec65a541e3d60
d917ecaee790121a7221064f2727b9bcf32a079d
74060 F20101108_AAAPOK zhang_x_Page_007.jpg
81e27fde32f16ace84a54b4652888ed1
e97a35400f060b278fe86fe51f7bd57b4f1ea065
24118 F20101108_AAAPNV zhang_x_Page_108.QC.jpg
8e4e346ee09079487c8b9d5d48a0d753
9f61bda9da404cc34b51934d83e13affa52ab05c
18557 F20101108_AAAPNW zhang_x_Page_046.QC.jpg
71acbb3d025e344d04c4731cdc0d1a19
038254881d0dee4c63a5f1ecdbb6c7884e012bf6
90296 F20101108_AAAPPA zhang_x_Page_014.jpg
f102688db992d295f2f116c1990524e6
9e198f3f9a5fcfc1ba635ffea47a1aa732721f98
124599 F20101108_AAAPOL zhang_x_Page_013.jp2
c589f0f74279b2297efea83038513c25
c2a0fd5b1716bf3fb8ae939d7fb5c613c254da56
F20101108_AAAPNX zhang_x_Page_079.jp2
8f3c66981dcfc629da4947ca18c2304f
c137cc45c83c60f4a29d7cc6a86bbd49c05a01d8
2645 F20101108_AAAPPB zhang_x_Page_116.txt
4b0f02b596383aea661a57c462e5a940
d92de6326df91118c066b05acead2405ff0bfd80
F20101108_AAAPOM zhang_x_Page_091.tif
7d1efff88983eb1a5d8d30fb89c5255b
27ec4612829351c9c38c773d801056cce34456bf
F20101108_AAAPNY zhang_x_Page_021.tif
95e9d2bf02a44821a2b9b1b62e58e38f
c3f1d2325fab982a4407bde1853069716ce23de5
F20101108_AAAPPC zhang_x_Page_017.jp2
2c30fd0056dee284abd971c5df3977f2
20cbcac04526c16185cb6b3529f266ffae2f2055
121407 F20101108_AAAPON zhang_x_Page_111.jp2
a81fb9f9f887855967e21e33ddd03dd4
ce15537162b58758a605847700612261850dc1a9
64110 F20101108_AAAPNZ zhang_x_Page_115.pro
721ce44a245f49a08593cc6f14b10afb
d3f8265188266920474a8c0146e939e5b7d0a044
F20101108_AAAPPD zhang_x_Page_018.jp2
ab734239ef3a7cdb745f6073bc412ccf
fa1b9da9449c4fb22940cb960dbe5c535ddca378
1009388 F20101108_AAAPOO zhang_x_Page_033.jp2
38b1fe4bda7341378105fa2a569f74d0
681fdd45597905e1f725dc80035cbd0957a29d51
52022 F20101108_AAAPPE zhang_x_Page_067.pro
16a16369fa9525c4839a2964b123233a
7115f779517fc71781bf62be63c1e5b4fb726171
F20101108_AAAPOP zhang_x_Page_096.tif
b8409788ee966cc7ad19359e49fabeda
ea641a1c7ad0e3ae728742ed436155d260b6a731
84194 F20101108_AAAPPF zhang_x_Page_116.jpg
92825cc18b4dc93e953075d8826fadc9
04e425883012ae617ce5194d11c15659cef84dc0
55870 F20101108_AAAPOQ zhang_x_Page_121.pro
8b3e3c69401026a4bffc6164379108fa
888a70ffaa65d6f4c6219d817b70e97a01167056
2788 F20101108_AAAPPG zhang_x_Page_114.txt
4c9ffaaeee13e698de98ddc036238f28
390864f7c31f41a589a8a418d4bd0a6eeac8f584
95654 F20101108_AAAPOR zhang_x_Page_097.jp2
01d2662755ed5fdd8df9de9d9aae48e5
6eb3e466ba9283e0f85f9b88f2d158634c00d4b4
27337 F20101108_AAAPPH zhang_x_Page_071.QC.jpg
4a7be0318a797a0bc788ff50785a047c
f483142262b649a826db73cbe197d63a0d31472f
F20101108_AAAPOS zhang_x_Page_007.jp2
71b5c702803d47cf7f03283cefbc2d0e
668af4d3a7bdb76c5259cac80f108c37dadf8ed6
56902 F20101108_AAAPPI zhang_x_Page_005.pro
f389c39cdfaa030d6cd293963a646da8
9abb4034ce5903651ce4933fadf6f57484f5c00f
1051964 F20101108_AAAPOT zhang_x_Page_069.jp2
d567afa420b9872ff891f807d6a5617f
aee5d528a2248b7dfa073f803d94ad80ec320d1c
108478 F20101108_AAAPPJ zhang_x_Page_067.jp2
8f4527655abbc77a55b72fec45fc03f7
887501141cfe02558da1684d2cc0d80b3dad1d4c
46300 F20101108_AAAPOU zhang_x_Page_027.pro
3439f8edae4020b834a7b420c3971b8d
be9a231e0cb88f5c3fd06b230ba17a22cd2448d2
40947 F20101108_AAAPPK zhang_x_Page_047.jpg
f141cab1a2ca1c24852015747f86df50
16395f85d4357f0fa5f5db60d7d4b1e5911ee5fd
2228 F20101108_AAAPOV zhang_x_Page_098.txt
73ffabb803f9679e34cc16b81e28e7fe
3ee4f20ed95791c29948adc53d577e8b96746206
27183 F20101108_AAAPPL zhang_x_Page_059.QC.jpg
6cf469d40fed3f54c57605616e898c9f
2deb74247df407266059b22f34ac5c1bbf92cc26
54068 F20101108_AAAPOW zhang_x_Page_093.pro
9cbbb9a134ba2a5063b901ca8967dff6
8567667aa7eb33e5d37bf768ac3f4b71df4accbe
6350 F20101108_AAAPOX zhang_x_Page_020thm.jpg
b4b0de54b123511072a7121237311b6b
92740fb26318e2c4144b99c1ccd9093752613ee4
182218 F20101108_AAAPQA UFE0021685_00001.xml FULL
94fee05fd0e7d6e636662ec7fae37c6d
930276d5061ebb2e7abb8a1b3e1bf0149f23a656
4940 F20101108_AAAPPM zhang_x_Page_046thm.jpg
4a72dabd9671ecef427e328b35c9bc8f
bcb13ec5c8899e15b24be2d4b2c550978ccf10ba
34920 F20101108_AAAPOY zhang_x_Page_103.jpg
dc70becd65f15cc89cee50769596f3c0
78a9e6e232a524eaeb5990121e73cb5afd33ab62
F20101108_AAAPPN zhang_x_Page_037.tif
442fdcb6b91b6f0c4ee3c7ed5aeaf9a9
d4055671666fbf650ae0218f0d0c7ed66cdf3b43
647 F20101108_AAAPOZ zhang_x_Page_002.pro
cd3ba52222be0d8f89b96158dc2a8513
d7088886b04d476f557d9ab6958f7fc435818dcf
6708 F20101108_AAAPPO zhang_x_Page_101thm.jpg
960bf3b11cc0fa63ce16fcff8942f5e2
e3c25c828fc8b7cf79c1e6286034c20a2f5656b0
9662 F20101108_AAAPQD zhang_x_Page_002.jpg
e0834143b361e9ebfbf895b64e9661af
af8d359357fe1048f15d361ae0582c63a7762e6d
6716 F20101108_AAAPPP zhang_x_Page_099thm.jpg
78ac08ca56f05c1b0dfb3562ac54ac3f
63746701ea627f89b2bfa42aeb6acdc0d5984011
28242 F20101108_AAAPQE zhang_x_Page_004.jpg
61d7b0cef6c5982453d8c68c9b4a2dd3
ebcf153c160c8a153a73e7098362e0a2e36f89d7
2102 F20101108_AAAPPQ zhang_x_Page_067.txt
91cf500e066b47ea0294dc25bfde2eb9
23ef154ad241261e0015512de07bd204b0cb1741
79446 F20101108_AAAPQF zhang_x_Page_005.jpg
29092a607ae480e1a6ac6f8b1e009b91
a0a46e10137377122b56d6c36ce0e5f9c35b36ca
F20101108_AAAPPR zhang_x_Page_020.txt
b8912754896f9225b03537a3f615a768
90c8ddc3abae5112a246b0004fba7a6a70adc74b
68915 F20101108_AAAPQG zhang_x_Page_008.jpg
cdaaaa4b514c2a3ca0a6a1dc97516970
50c7f8157c91731512896394525d0e5492939fdd
73211 F20101108_AAAPPS zhang_x_Page_091.jpg
fa4392d186b935e39c7e13551b9622b0
312aa38ef68f372ac8352e2a9a27e6300c564d5e
59114 F20101108_AAAPQH zhang_x_Page_009.jpg
1912e26518a791c1a1f5f238e78f1d22
a4e2717959404bf1b0d3a2602bd7ac06116e4a6a
1051970 F20101108_AAAPPT zhang_x_Page_089.jp2
2e44d897c403d2bedf50893f3b61a1b9
3b80c238e1fc3e42620518f769a0d265f0d00e61
86717 F20101108_AAAPQI zhang_x_Page_012.jpg
2da7e557c34714ca62bcc0f681df7930
831ca0bf529931ce69235e4d65553b99508c4dda
6996 F20101108_AAAPPU zhang_x_Page_017thm.jpg
b75a624d5def131a65d10085d3d167ab
8c9e9d39e10c3641f6e8c819a10a4250d0cc832d
79081 F20101108_AAAPQJ zhang_x_Page_013.jpg
2e3fccf9fd5bd96b0e017892ab00967c
1a85eefb43d7c1e86d752e84650a738962cd3d77
F20101108_AAAPPV zhang_x_Page_050.tif
a90a2dca992c12af363d7ce9c2d55d24
3700df58f7a36cdc463064e96ed4647b8c820e1e
18167 F20101108_AAAPQK zhang_x_Page_015.jpg
c9ce756d14f950684afd5229efa9edee
d4ca402373f013318f32ac66dbdcffe189c16b4b
F20101108_AAAPPW zhang_x_Page_092.tif
fffc930191bdf6a03c62d9cd97bb614f
95f5eeb03d13bb0a5f9061c70adb22a2b7e6158b
90498 F20101108_AAAPQL zhang_x_Page_017.jpg
e84a209540a52540a78b8ce587698226
fa5b8e20aa4e06910eff40a218ecb3434a8f9475
5176 F20101108_AAAPPX zhang_x_Page_063thm.jpg
81e378cd6871854b1ee36249d69050b7
b6941c65b23725500b98c95be0f0c662b177def7
83294 F20101108_AAAPRA zhang_x_Page_043.jpg
67b272d94f7c6f170ddd8fc82fd01acd
d629f35a318330afe3bf81cfdf4432750a54e46e
90538 F20101108_AAAPQM zhang_x_Page_018.jpg
57294dd914e071144181a6953692bd41
4b3b96ea93feeb9820468e0023a63b8f93519884
54292 F20101108_AAAPRB zhang_x_Page_044.jpg
1aa00136fffb8bf7ba9e3bfcca79d654
d777ac52bb50cd8e18b422572336b85e59c6bf0f
6141 F20101108_AAAPPY zhang_x_Page_090thm.jpg
727301c174768a702d9c9f5ccb7215a2
9cd63c24bd5d7404df228d947944e22da0350db2
85074 F20101108_AAAPRC zhang_x_Page_051.jpg
16e301f0a018586d4d31c9a338251312
8a26f4d0605764d1ce197f82cd0ddef832c723ef
82659 F20101108_AAAPQN zhang_x_Page_020.jpg
aabcee1f8bb99c38460b0e2e18140e5c
609777250b16bad0af330031b91027d01b148de3
F20101108_AAAPPZ zhang_x_Page_064.jp2
7574df7bab02aeb86d789c57c7f8d2da
0f9f73906a822c6bd604436b907fb2d5434ef72e
75430 F20101108_AAAPRD zhang_x_Page_053.jpg
bd7a79ea33b030e210d0fa606d1c5627
f34b9164ff8bd2d245cad4ffe491533ecea907f3
77862 F20101108_AAAPQO zhang_x_Page_021.jpg
21b22bdeac379ac5bcf7e3446a554b0d
cb16b2bb7121b26df5638d302681b0f44d8a4ae3
82815 F20101108_AAAPRE zhang_x_Page_054.jpg
3a7d682dfa9485a264dc31326d96750a
bd5ec8c64a0bc351f981e7a66020bdee03adbf8d
87951 F20101108_AAAPQP zhang_x_Page_022.jpg
881ca9a8d63381ae408412a4885b6e88
852a5146657a7d8aa625363b2e71f102e8cea234
46930 F20101108_AAAPRF zhang_x_Page_055.jpg
e4238745a0fb07c8dd30986081767bc3
042a68b1bc470d84d167ab3b69b3653869c6ce57
67368 F20101108_AAAPQQ zhang_x_Page_023.jpg
48532db458bb8eb98aadd4400177873e
1ff57cdeb33387e27deaca09282c6723b98ee2b9
82913 F20101108_AAAPRG zhang_x_Page_057.jpg
d16172968ee07ab2274ec537d6ea5ca9
56a7c1698507e99a6ee5ff34df4b105fb9faddbc
72407 F20101108_AAAPQR zhang_x_Page_024.jpg
45191f3ff03b3c0dddc3b0cb2d8cbb8b
a23d4e43c53b257b245754843e6cd478dce48ecb
83130 F20101108_AAAPRH zhang_x_Page_058.jpg
8e93289ca59460c6c7a269a4f627a2ef
95baa7747c004736f81a11e59f49db29c3c36b62
66321 F20101108_AAAPQS zhang_x_Page_026.jpg
40d34eb3ff1d5e5458dcdd769fac8518
8e7a90ee7ebb2351fb5a7244358d5118f3c64fb1
64652 F20101108_AAAPRI zhang_x_Page_061.jpg
73adc54ee47d6acc5d63cff2cb52f1c3
f4a5209b45b665c60a1d4146d452702faed16431
74613 F20101108_AAAPQT zhang_x_Page_027.jpg
c756bd23f4019e1513cd909ed94e5f63
8846faacc7dcd463e05804b2eb26242115cbd16b
53295 F20101108_AAAPRJ zhang_x_Page_062.jpg
dd6a3a71c6b90b72559f0a56030acf9d
dc7ebb59143852ffd15f70e23eddff00d69966f5
81424 F20101108_AAAPQU zhang_x_Page_029.jpg
dcb7b4d2733f28ec96e3509deae05d45
e746431d68e0c031f5d50334b640ef67c65652b2
72382 F20101108_AAAPRK zhang_x_Page_064.jpg
4eecb93af720e12b26f0adac287eb7b9
88ead1741e6cb2376072772eac1eb1353106804b
19258 F20101108_AAAPQV zhang_x_Page_030.jpg
6b3aafaa13152552623cfe53ea37cc3e
2f4105444a1367afdc02b872449e76ec6b5a0f84
64121 F20101108_AAAPRL zhang_x_Page_066.jpg
42f273f3f5cdab53606f4272e37ddd9e
cedc6262e58ed856c82c613002e47699262783ee
90172 F20101108_AAAPQW zhang_x_Page_034.jpg
20ab3e53386e51c8edcdd4dec723d6d7
b7c8d9e93dfb10bd1bd15d167d6dc61941b406da
84696 F20101108_AAAPRM zhang_x_Page_070.jpg
f0f7508b04b74e055fa3cef2f589f7d9
9e5fc5e769b6d2cff2db424868148bad12fe0b65
90036 F20101108_AAAPQX zhang_x_Page_035.jpg
054ec48daa0d6d2d389b905645a9fb81
c43e6a1c5f8a57ebe48b02b219adcfe5561b79f1
71854 F20101108_AAAPSA zhang_x_Page_093.jpg
0087c6a964be7423a866d87abbbc2df2
fca816f70109f14010a313b136697dd3d21ce382
89031 F20101108_AAAPRN zhang_x_Page_072.jpg
64c3c61e3e7693a8f28f39c3dc306dad
4cb71c54fe6f17ad503e044071b5729d7d5f3032
73256 F20101108_AAAPQY zhang_x_Page_040.jpg
5c8bbaa85f8a5fdc4c2cbbe84b6d9960
fe508853bac25f59e0a0aac21312a0c4b0056449
81467 F20101108_AAAPSB zhang_x_Page_095.jpg
c43820bb03bf4418c15d4f663248bb99
bfcf571ea4427c5a385d722b0541d8c7c0f221d1
90995 F20101108_AAAPQZ zhang_x_Page_042.jpg
e69a06a99870e3b3f5673a2b92439a2b
6afbf89cc25626852b4262fb21a1ef10b948834b
63104 F20101108_AAAPSC zhang_x_Page_097.jpg
2da3de4d54f09b6faa74b54314d61cc1
bbc5fdb68e904ec944c1a3d9c6973a6e05db57a5
72609 F20101108_AAAPRO zhang_x_Page_075.jpg
3d62f2cc98c97f4d3fc0f11a4ef619d5
d05c8020e94cd3ad376b1a89340b85c48cac6f3d
81458 F20101108_AAAPSD zhang_x_Page_099.jpg
678cc10c64041f10093ba80f6e56ad75
ce2fad9f6e5d7ce48384d940265c3ec64e51566a
86559 F20101108_AAAPRP zhang_x_Page_076.jpg
676485556f562ca557bcba97eb90605f
529f3f5edd453f764384a6667b57496a449a3e11
73754 F20101108_AAAPSE zhang_x_Page_100.jpg
1730c4d9d44d703d2fbd804eceec3480
63eb75113da57360884a104eaf5e657540d86844
72058 F20101108_AAAPRQ zhang_x_Page_077.jpg
03d6bb956be9a0c114d995d14f30b027
daacbbe06f382e248302dcac861970f680b5ab99
83750 F20101108_AAAPSF zhang_x_Page_101.jpg
a5324642ea4e38e6692f0874586f19f0
e04b9b0baa525453871290b6d77ff62707fa1090
72599 F20101108_AAAPRR zhang_x_Page_078.jpg
450fb9d8fc50adda0a9f6610715c19d9
5506314afc0c9a2e5fdcc7f33b342f68b7929b94
85078 F20101108_AAAPSG zhang_x_Page_102.jpg
009848868ae00b43ff5a4a1d405dfcee
e543f7ffc08c4c937b4927c16ea28d75cb41f2b7
33209 F20101108_AAAPRS zhang_x_Page_082.jpg
9ebb1347f0d481d70202b59b21577cdb
39b0c5675e7d93b78e916b8c16878572042c43d8
86919 F20101108_AAAPSH zhang_x_Page_105.jpg
4e2c40dc14ab9078ec767987cb828844
de94484f94b2488ce6bd9770734207b21a4c521d
89968 F20101108_AAAPRT zhang_x_Page_084.jpg
33fd18e86605433d8ae7cc52cdcee39c
94bc969663c70d7f0507973d1ad1912fbfe4c9d7
40099 F20101108_AAAPSI zhang_x_Page_106.jpg
72cf3d83862283b5aed1d7cd8c50f2bf
e22c41679eeaa711741bf421c01b9057deb40a9f
76979 F20101108_AAAPRU zhang_x_Page_086.jpg
dedf1a9c23503daf1f6ef2e9b6fcc13d
4bb672ae762fefdf45b87a32025ef522931485fd
86679 F20101108_AAAPSJ zhang_x_Page_107.jpg
250ac8a18951c58c7f48de1680cc262d
30ce967f40a1feb6239654f3091192c93ce025c1
56226 F20101108_AAAPRV zhang_x_Page_087.jpg
a40d82b5b4a92cd6baf9e3b5486e4aad
0d7adb93b07d90ca2d855b968199fd66540693ce
83033 F20101108_AAAPSK zhang_x_Page_108.jpg
1db13d7cd77896ac005028ee269172d1
04a5b559b281286c2baeca1bd57746467f77745f
87493 F20101108_AAAPRW zhang_x_Page_088.jpg
d5694c1eee7f49491ba13437ad375536
c4f2d583e3feac3b95db539fab139382ad7ebe50
55642 F20101108_AAAPSL zhang_x_Page_109.jpg
9744432fc1fb5fa28221b85fd55783ed
793392960dfffb56a1cf62bf9f396515315ca779
85828 F20101108_AAAPRX zhang_x_Page_089.jpg
a9cf81e04224a2cdc2a4a90c8989bb34
6c70e92f4b6717610919edd890212bd4ac17bcbf
F20101108_AAAPTA zhang_x_Page_020.jp2
03a757a67ec34b33d220bae235c16dc4
e25300e66da5f211743dbd1c21e38479b878d0c5
76523 F20101108_AAAPSM zhang_x_Page_110.jpg
dfc4032d65e9865e9e71d09503172cb5
ecad374f592e3761d8ca2121a13b62bf2086a30d
71550 F20101108_AAAPRY zhang_x_Page_090.jpg
9d0e7a9619a5a01a04d438f52e121251
fdb2ecddf6ab1bf77f7bf551d456491697f8298b
1051952 F20101108_AAAPTB zhang_x_Page_021.jp2
66f6e680174c2e683efdff5e0bb567c9
5fc766d6e2195da9c5b0d52d637e0eb2a6bd0b05
83504 F20101108_AAAPSN zhang_x_Page_113.jpg
df93ceb35a5e38636012466bf8a71a5a
cfdd402e529731018740f58f63656c52f3ea0087
84366 F20101108_AAAPRZ zhang_x_Page_092.jpg
be3b03e1389ed04984d52b5370adfcb6
45fc10f33f540da27b7e7fef2e8af0ec68ac0101
937023 F20101108_AAAPTC zhang_x_Page_023.jp2
e741ab5b6a4195f521ab3cf88ecad909
9696751be562d2c62baba1736db93097a8db8d8d
88178 F20101108_AAAPSO zhang_x_Page_118.jpg
400d11fd4a1c137a165f932cf0b7b2a3
d6551da617f318ce83a30196c9d0796823d77307
1038005 F20101108_AAAPTD zhang_x_Page_024.jp2
1716308c53171eb551020e7b8a9e8e24
9ce4bd8ddf4dc1a10fa4d020d26c62a4ac77b8c3
1002126 F20101108_AAAPTE zhang_x_Page_026.jp2
ed34f2b01437df8b079906df9afae345
190645b3c66c85527b641b655423bbe89d1924cd
86780 F20101108_AAAPSP zhang_x_Page_119.jpg
58969a758ab9e6014dd0d08183ac4d6e
fa0363a166227d9463dd51d3bbd2c853f1767a91
1022766 F20101108_AAAPTF zhang_x_Page_027.jp2
4341a26b23d9d9eefc98091dbcd2b908
ea7f287abdf4223f68c70393352de24130fe7c88
88266 F20101108_AAAPSQ zhang_x_Page_120.jpg
0dbda6400dea0e790ab159ab3f07460f
3aba067f5d21aca4fa8eb6b4228b1c176a72d3aa
23799 F20101108_AAAPTG zhang_x_Page_030.jp2
8aebc6280d7d8213aef34707ca17f708
4999702f952a887671e464670701a5e869626eac
22168 F20101108_AAAPSR zhang_x_Page_001.jp2
3794474fb310967773678abc0b546311
04972ee48d4729a042b1a4a626b68cf85886d698
F20101108_AAAPTH zhang_x_Page_032.jp2
cb5d3d3e279d764a0e5ea36af77d3efc
1fff2c250082ab9b97561878c6469ba9a2b9e18b
4806 F20101108_AAAPSS zhang_x_Page_002.jp2
3d349a7dfe39fb81d527ce2c5c886cae
50e8ac2cab36a6643ef046d4fe50be2c23bb4bf4
F20101108_AAAPTI zhang_x_Page_034.jp2
a38ef803f7da4d9beae4cd58e5c1dc7d
c54c26a01d96ae95c2dbd6152444f5a8360bea76
36973 F20101108_AAAPST zhang_x_Page_004.jp2
efc010b3c60fffbbcc32068fb7a2945a
e2b4d08a069dda1559fb5ed1d1719547b55f7655
1051937 F20101108_AAAPTJ zhang_x_Page_037.jp2
8943276972f82d9c09beee1dda04d56c
3a8b7cb0cdbfa4d97ce40c32720957f335b599ae
1051979 F20101108_AAAPSU zhang_x_Page_006.jp2
2d067376a9dee40075eae682c1f2eb14
a5fee94a8b5739df29bcc285539387510477641c
1051972 F20101108_AAAPTK zhang_x_Page_039.jp2
68a6fee5a8bca72219029dcbd181277e
580f0d8a8f4ec529a186ba0c1bc8d3330ac04d4a
1051984 F20101108_AAAPSV zhang_x_Page_008.jp2
f76094e806a581764e5813b300988ae8
76c83926d7baeb7573cad4368377c313d430f7e2
1043346 F20101108_AAAPTL zhang_x_Page_040.jp2
59c831e86fd88d908a4a8db6a66509b2
bfda5994c945be602a5a9764778a7b8c43e477e6
92234 F20101108_AAAPSW zhang_x_Page_009.jp2
924c2cfe6eb613846d5909410313bd1a
d0f8a2684d233c382e54bf4665627d14d7a5a0b5
F20101108_AAAPUA zhang_x_Page_059.jp2
8212af44a9a4da5207a6af7426f3ba20
7b270e14c645a9b4de6cf9be1d41ec1f3858b7f8
F20101108_AAAPTM zhang_x_Page_041.jp2
ef607ce41e3f27b3522d33d2a8a5b2f0
32657263d6496d7636c126697d0f2944167b7ea1
1051930 F20101108_AAAPSX zhang_x_Page_012.jp2
ed9562d80cad86b7b42b5a9b9b675700
5ca4bad7fe4b4d351a988f393b27e19bb48020cd
1051965 F20101108_AAAPUB zhang_x_Page_060.jp2
37d56363fb72b66d41933a6369bebf04
cf0f647b042f2b97968b6ab629fe59347c2b4588
1051961 F20101108_AAAPTN zhang_x_Page_043.jp2
0d63206991650f527a29727a4ff1b7d8
d7a77f67536d2be78179443faa86179c620e4a9b
193042 F20101108_AAAPSY zhang_x_Page_015.jp2
25457338d97e930eb76811a385557962
6706ff32dc83bbfd2a3f2de455697e95efc44ced
846123 F20101108_AAAPUC zhang_x_Page_061.jp2
02e6baf9b58f32f5fae3919867268119
d93b0b5f64554526d4260025d681407d425166f2
758819 F20101108_AAAPTO zhang_x_Page_044.jp2
3d23be74f8a2e830339104cb9b0d66c1
5390ec21b2f82099ae0d274d1bf4d37749395c8a
1051981 F20101108_AAAPSZ zhang_x_Page_019.jp2
b12f51cfc4eeac90880449d5179072b6
d86f23051a6f0c054fc652029b33787a1c960204
F20101108_AAAPUD zhang_x_Page_065.jp2
8b2c96c497a1329a28c43e2bb970f0de
3f9357b541821bb71a5e19c7211d190fdef4dfff
30148 F20101108_AAAPTP zhang_x_Page_045.jp2
1991eaaca24ac3fcf39660a63ba7580f
405c4f04f81cbfe12b4379ef4e1b7fa4e80779fc
817330 F20101108_AAAPUE zhang_x_Page_066.jp2
1027ddd1ab2e8b6036b374b7f6ece01a
a82f9e3caa78f5a2b2395270973f350b8215f985
F20101108_AAAPUF zhang_x_Page_070.jp2
0a122eb4ca5b7d0ec505923ff45f951a
501aa880d0118a0b031b077e03220a06482bbe48
104462 F20101108_AAAPTQ zhang_x_Page_046.jp2
965c84f0340e0f594ab521cbe99e87b6
e292983d75d51bf7db2c185f145752a28c5cbea9
1051939 F20101108_AAAPUG zhang_x_Page_071.jp2
4e8c1d808e4d3a9716c463a97b105696
54ecd9bde120a6fcaca851d2f81e6c1ca18937cc
523181 F20101108_AAAPTR zhang_x_Page_047.jp2
07cb014e89814e94a60ef62934f612f5
76d0d56e3525b5f4e15b6367f97e23aff4e6927c
29151 F20101108_AAAQAA zhang_x_Page_073.pro
1f48d66bf7c8af1f4ee21d2f6b3b35e4
fe504d3603344867c5b85a2c8d75ff7964db9979
1051958 F20101108_AAAPUH zhang_x_Page_075.jp2
ef22a803cbc8e2fa981ba23bf8d8b12d
cdd1afc91bc4022ea312feda329203fedbdd5d68
46860 F20101108_AAAPTS zhang_x_Page_048.jp2
2fa9d2e54844c9897b6a389cc7ac94cc
49fc612ecb28ac40715fbe8762f51f9c3cc68c40
39465 F20101108_AAAQAB zhang_x_Page_077.pro
5e2df65db8b73a3e504548183c795e7e
c6bccb314725347a88ff17a113cff3f6dd278d13
F20101108_AAAPUI zhang_x_Page_076.jp2
7f74567040ec96e8d98ddd3eebb56179
4b00b87d25ca41921b1b562c313123aed71cd0c5
F20101108_AAAPTT zhang_x_Page_049.jp2
bf2c92b369bed64e2599efd0eddf3f41
4149af52a2dbfe80f814aa86eba75043dda9e8f3
22221 F20101108_AAAQAC zhang_x_Page_078.pro
7a3ce5347fd9874535eb78fcc8824a85
f566b06b7a5cd32edc88c4f4523d096ac628ee22
1051968 F20101108_AAAPUJ zhang_x_Page_078.jp2
181761763b0af246de6310da4ccec995
003431465f35951d378a264ab28effcbeddf5a03
F20101108_AAAPTU zhang_x_Page_050.jp2
03aa95c6af56816fa34228ec7615791b
daad442149fd2c7c83fdebed4becdd3c39140fe9
58304 F20101108_AAAQAD zhang_x_Page_079.pro
0829bd69f70ede4937e888f55253cb04
c4fd95043dd87ed9ad10a830577369f5e3cf52fd
383763 F20101108_AAAPUK zhang_x_Page_082.jp2
539f9eb57be1a78f5efedd4176fb1826
c733c1bf349d288147af0f178874b8d4980a8b80
1051876 F20101108_AAAPTV zhang_x_Page_051.jp2
16cb990a033b54939d3d62d0838ef5fe
a5cedfaf41878c21ebcce7eb2dcbff052f62445c
46369 F20101108_AAAQAE zhang_x_Page_080.pro
0d003ed31323001c012c4d910495531e
c9f672e635c149b6e35dafb46c277356cd93f6fb
125199 F20101108_AAAPUL zhang_x_Page_083.jp2
c9e9a8e2ce27942448414b08a3f738f3
7eed1a2a929da3cda40606f9a85fd6cbbf5cebf3
976839 F20101108_AAAPTW zhang_x_Page_052.jp2
4bc4d28e47a11f33ecafe21ff48c9e76
cd4fb1009f1996dc56eca37917b8b9ed29d601c4
52627 F20101108_AAAQAF zhang_x_Page_081.pro
44bf69104f9fb96f107b9fadc2f82b25
0915d7f1f8e45d0df9d65dce5ce82a11bc771eef
F20101108_AAAPUM zhang_x_Page_085.jp2
f3c698ede6a101f2d805b0438db64f88
b39c6f607c3c4e5c42937153e577bd28bf39073c
117526 F20101108_AAAPTX zhang_x_Page_053.jp2
7ce847078d1abc6fab20195cd8306fe1
4b81ea15fe215d19c7ee51ef631687ae5958430b
58412 F20101108_AAAQAG zhang_x_Page_084.pro
0e4ba16e7b0af7cd1b77ade0a0661aca
f76a07af4df0df42e1ce3b619d8767939077d0a4
F20101108_AAAPVA zhang_x_Page_108.jp2
92c69a4d96bd902c2b2c536fc4f26877
1d1398fb443fddd9288363cf7ac876b7db023f10
84149 F20101108_AAAPUN zhang_x_Page_087.jp2
68d1dda599897772d8b7e53bfe73fe8b
34de5e56a008ac45acfeae87cedc93d51a8d2baf
F20101108_AAAPTY zhang_x_Page_056.jp2
b623b791bff9e094a0c6a3d55e54558f
54c1a3415e4359485be9778cb0d08171abf46f3c
58791 F20101108_AAAQAH zhang_x_Page_086.pro
0115d1161ed8f4c9ae862265adbee9ce
ea6de228ed94e49154ec863b58134ac15270b29c
880552 F20101108_AAAPVB zhang_x_Page_109.jp2
aef1dee4fe6ca09766243ea1c13a9f8b
15280c4f690718db92f4fa2a470c29a5aa421633
F20101108_AAAPUO zhang_x_Page_088.jp2
6614aa320373a4f68b8292bd68d83d12
cae210a0f91e432d8c3882a9c93f6555137a9590
1051955 F20101108_AAAPTZ zhang_x_Page_058.jp2
b26017e2de4ee9f3efaed466ecd84817
0f22cdb914b4573fcb6376cf3e219226fc528d95
56744 F20101108_AAAQAI zhang_x_Page_089.pro
9d9071c3c056cdce00aa42c306ee1bf5
6edfb720dca87ea8ed9a2db8aa1ee1c3f5e89408
119954 F20101108_AAAPVC zhang_x_Page_110.jp2
8953c548e601e5499c24fd2b3d04cb90
f402e0e4ec43383bbb8d7a017fdf4e11a6fce3a5
1035648 F20101108_AAAPUP zhang_x_Page_090.jp2
8ca5f2ccc04dd165cf938f491f24e1d4
4909a4693ab925bf4ee12f897f5d995c98fe29e1
47340 F20101108_AAAQAJ zhang_x_Page_090.pro
a21494b1db0a292dfc7f496723195bb6
83b501c41aafe6a03bc8030bbe4770ee553f81cf
65179 F20101108_AAAPVD zhang_x_Page_112.jp2
e1269a03c01ae5a2c40efe12d1f9c6cf
2d2898024cf7023dbbfdd8194297e4b98c161f95
1027207 F20101108_AAAPUQ zhang_x_Page_091.jp2
893b92aa8f017551c3dc369b97eb9409
9ff40309aae22b13282296b001774d91af2c5e04
45657 F20101108_AAAQAK zhang_x_Page_091.pro
de24850c400f690f6f54fbe9c58b35b9
f0379fb0c1198aeae944b188bafff539ed41d9c6
138325 F20101108_AAAPVE zhang_x_Page_114.jp2
bcf985395d2820b5a2465baa9e5f9a8e
ef83e62908f6d47d17609fa3d388760d2ecaf683
55837 F20101108_AAAQAL zhang_x_Page_092.pro
476fe122efc8af830e3afe2f83b334dc
0fb4b7064753136c51f08e2a124196d9062a0561
133458 F20101108_AAAPVF zhang_x_Page_115.jp2
50959c1d359aadeddeba6671309980c7
354872634c707e4ae6149ae5da29da390c1d7586
1051940 F20101108_AAAPUR zhang_x_Page_092.jp2
f65d949ef1c9e8c0c2858c8d2e1ff18f
56414c69fe1aacf62e575e4a6f6e7202dd57d75d
66603 F20101108_AAAQBA zhang_x_Page_119.pro
a5ece71ee62d6842a753c547776c268c
c4db31cb707e1adbdb8d849fc9f401d87a37f20a
56012 F20101108_AAAQAM zhang_x_Page_094.pro
5eedf5f1934d24a675c251b29438e4d8
251a9e4444551ab134d2c4e0915f4caa1fbca24b
135953 F20101108_AAAPVG zhang_x_Page_116.jp2
142bdc22cc3aa243c09c73eda10a407e
63735fe7bbeb9a73c2f982c36e8291ead269c653
1051945 F20101108_AAAPUS zhang_x_Page_095.jp2
8f22018f301ca4d195bc957b0ac3da51
f5d60345c38bd1b9709d94efebc54f12128ff774
67662 F20101108_AAAQBB zhang_x_Page_120.pro
be1247fe6884b3b2d00dbb4f81cf27ff
9a1749ea19f95a37416bb90a11b4049e1d1a73be
51427 F20101108_AAAQAN zhang_x_Page_095.pro
125731e92f8e57d394677d9ca9ffab56
ae1a95b33c4295b7700daedd45cd48a9362c5b6c
129088 F20101108_AAAPVH zhang_x_Page_117.jp2
b265a499000e8c83b9306eb315dc0637
891a92b53bcd5ec0070938edc6bb5358f3cbe67d
111249 F20101108_AAAPUT zhang_x_Page_098.jp2
d97d13ba2cb0d4df79a9b3460b7a8b12
1dc82c82db3e0d5bf0a0750c5e8767151a34b3b5
418 F20101108_AAAQBC zhang_x_Page_001.txt
441de134ae09962978519a967aca96de
a39dfc09c44694f48c1694d0ef96a56f8af782a7
56728 F20101108_AAAQAO zhang_x_Page_096.pro
1c05797735d6a19ae38332d36c71f867
7cb9bbe62df0475790a33b6c27a06c21608bc4f8
142569 F20101108_AAAPVI zhang_x_Page_120.jp2
8a538344d7fd003659c488c23b285e2a
feecf23e2c32d5b100e99ad35e3b1d5a26d306ee
F20101108_AAAPUU zhang_x_Page_099.jp2
c55a46a66fbd168ebb39414c6d24a1b0
8abf8e88abd99c767558ae01ce85f9444ccd3473
666 F20101108_AAAQBD zhang_x_Page_004.txt
b81782876916bb0010d8d4a6d53d0664
d1e42a8c6ac89c4678957e4a8416e6c13edc9ff5
45493 F20101108_AAAQAP zhang_x_Page_097.pro
bda7a7a8bc4ed12a56d39a4aa7bedd9c
412ef02d5ec84f8d8232f38d4d1e17e48d2aa0ef
116434 F20101108_AAAPVJ zhang_x_Page_121.jp2
9ad91d59361cb36ed6e51a09d6e1caf1
38771652f66fd7dec04ea8765cf741abbbaab05b
2782 F20101108_AAAQBE zhang_x_Page_005.txt
bf02c08883f8fe45f704cacf69118922
2839b89d33cc672b9d37ef664a83275650080823
55554 F20101108_AAAQAQ zhang_x_Page_098.pro
0dfa6c6345277de80a204331d067986e
fc05b38e12ab637b4eac69ab45ec8d674774ac2c
F20101108_AAAPVK zhang_x_Page_001.tif
d0585795eb9707c28442df092b257a67
3ca478c2405e43404c279a092e27f0177831c16c
1051969 F20101108_AAAPUV zhang_x_Page_100.jp2
6919e0b366bb1df9c6a570eed749beb4
78ec8e9d9b03573e4df4266bf1fa6230749f2f3a
1859 F20101108_AAAQBF zhang_x_Page_007.txt
7f4235a8bace592b4c1cb251276af0a7
6e3627fea4e6d570b8b18b6951c2a4869c22e862
54592 F20101108_AAAQAR zhang_x_Page_099.pro
3099d427ee60133f8023522097ef41e6
aa7fe6318594608fe2469a8ceb9e53c074f3dddc
F20101108_AAAPVL zhang_x_Page_002.tif
1c8bb54814e341d0267f36473fef0b9e
de012427534da0d40536ac38dd8b06ad93dc96eb
F20101108_AAAPUW zhang_x_Page_102.jp2
50413f6529b7d834a33b9878273c045c
b792436b68d529b2b64e8781f10cf2b607e04128
2567 F20101108_AAAQBG zhang_x_Page_011.txt
22db0a5875caaff2e305e45b26d0f180
e3c4adf0119063a458d06eed15b5cad2091c20ee
F20101108_AAAPWA zhang_x_Page_028.tif
cdfbf4cd066528fd37d4a9cc0ef9ebe5
cc82295d74cb29e63972fce522ba144336673b06
55278 F20101108_AAAQAS zhang_x_Page_100.pro
8fa8d4291ac5e0a02c7561fd27aeedf8
00e4a8c61357c450e0ce33619b8aa09014e59bcb
F20101108_AAAPVM zhang_x_Page_005.tif
7e748cd3157524e23d5a92383e4e0150
3bee332f4097da3f555fcdcaa0f02c5aa34d7659
50285 F20101108_AAAPUX zhang_x_Page_103.jp2
b61a7e15542ffbfaef21a9915381fbf7
65553f4b748f5bbc170355c580b0621106e5b8ff
2305 F20101108_AAAQBH zhang_x_Page_013.txt
7433876b67df2badefb3ab26b9d03ddb
9b66864da37e35d81c5c778deb90dda745bd316e
F20101108_AAAPWB zhang_x_Page_029.tif
1aeaa53c62495d1f35e4da478adc615e
c0835ac0216e2cac45daface6ac4706e50a8df38
10097 F20101108_AAAQAT zhang_x_Page_103.pro
ebbe5e086056b8fc7cecdca01111f8e0
2bcd40a852af6edbed704d6fd144608527a533eb
F20101108_AAAPVN zhang_x_Page_006.tif
666492ee46556760b8b78705dd553527
3a564fc2d4870484c42f0421c2eaff7a02d67d0b
57398 F20101108_AAAPUY zhang_x_Page_104.jp2
79dbdf5fad78c23045445691ea3c40f9
8019f5a67225b72d537a2cef6cc19759ed7b7519
2322 F20101108_AAAQBI zhang_x_Page_014.txt
d3fef1210ab1fc3c95faafc13cad277c
dfe0eb6ef052079c97efb4f471defc25583b5a4f
F20101108_AAAPWC zhang_x_Page_032.tif
fdc1460bc3ad09ff4a230bb86fd1307f
228602e7374090dd62dc713699916b3a083a5578
6443 F20101108_AAAQAU zhang_x_Page_106.pro
c47831145d632be0beefe7c59cd137cb
5ebbae690a8be08801f72e32639f84702a029ef4
F20101108_AAAPVO zhang_x_Page_008.tif
7baa8e214524764adef60452985e4bf2
9e46bc2315a95bd34f55085d7a1ee008fed8ccbb
F20101108_AAAPUZ zhang_x_Page_105.jp2
cde9a36bfce6ba9379c851fd787a4da7
8859948d7d032e65fa2bf08472286ad77e59e048
332 F20101108_AAAQBJ zhang_x_Page_015.txt
c8d7b5f8778752569a2bff3c830f6873
dc2225e36338cd7f1579796ea261dce07fb2986c
F20101108_AAAPWD zhang_x_Page_033.tif
022136043b3e2e8af3a8a13531717ad5
4eeda6d34d29eaab9d5663fe3b3e13e84b03f643
55434 F20101108_AAAQAV zhang_x_Page_110.pro
060ec3486c0ddc5ea86393a4a1e4da2e
076743d46285ef765d5542feb7cd55f361fb6860
F20101108_AAAPVP zhang_x_Page_009.tif
62c9574ac04425ec915ae209ffa3fb47
632375631465b756c91a82e63e845b1dce7c15de
2061 F20101108_AAAQBK zhang_x_Page_016.txt
83448da4dbb5c2d779cf9dcfc864d2a0
74474af8572a6493d1e12fb9b3686bbeee723294
F20101108_AAAPWE zhang_x_Page_034.tif
d41560f3297c1ae5ef707aa022411330
b9a426f46a10ac98887dcf39640cbe5084cc008e
28869 F20101108_AAAQAW zhang_x_Page_112.pro
fb4acb0249570968a5d75a5acd301d7c
9dc4bf29ca79fa1c444f92950f9f48137d3ef649
F20101108_AAAPVQ zhang_x_Page_010.tif
1882667d14813f6aa84bf585da4066cb
f2997fa401b064d1e740ccbfd16eefee8e648e1a
2352 F20101108_AAAQBL zhang_x_Page_017.txt
c3458dd880b55e3eef1f79328dc52991
ae64df57ac70b834752eed2123341dd7af8e17c2
F20101108_AAAPWF zhang_x_Page_036.tif
2cf2d5d2be7efbe71c1d5e4bc4758bb2
7bf04b09c291d10a6820437ab9c6ae6b1a660c95
65079 F20101108_AAAQAX zhang_x_Page_113.pro
fb6304fe57ab073c10dce49492744831
ffecd2724b807369939ed4c1355ea7e5fabaf98c
F20101108_AAAPVR zhang_x_Page_011.tif
ae81d539e719788406a9b0aa62ce6ca8
1bbe80b5591701b2b246cb0a321faed992c817e9
2385 F20101108_AAAQBM zhang_x_Page_019.txt
cdb143d057e7327ff1683a71eb806602
683d48e633dea3826ccb98eead520a8c78b51301
F20101108_AAAPWG zhang_x_Page_041.tif
98378e96f49f4a6672f64b8f5e4c3741
fe64ebd29c0a48915fc4c521ddd41524a20d3e94
66034 F20101108_AAAQAY zhang_x_Page_116.pro
745a679c8d487837bfd6fc8162535922
1556bd6fd64ca0629f47a4423fd701278989d986
1905 F20101108_AAAQCA zhang_x_Page_040.txt
ceb375551c927819d86c4e944b6d8702
46edc5c236829b522803104b309ce7083857f2dc
2235 F20101108_AAAQBN zhang_x_Page_022.txt
905f553c775bd0a5daeab811758eda05
432786bff9d8dda66be9c902af0a2a785cde0a58
F20101108_AAAPWH zhang_x_Page_043.tif
105f87999bb3fda38aa68dd8b61072a0
e8a0ece12c42eab38068e9584340cdf60b7620ce
70381 F20101108_AAAQAZ zhang_x_Page_118.pro
09098c1c99656be5159b8534d94cedf2
05885832cd40ff5abacee9e99b0542822b8d9623
F20101108_AAAPVS zhang_x_Page_012.tif
df89e7797ffebdd56e4c016de207385f
d73c926d749f583c4d1984bffce2318c4b5aa72c
1296 F20101108_AAAQCB zhang_x_Page_044.txt
89f6202f757530eb6f348d82cb1b536e
159beda818e2494c5e97fac0f33a801ffdeed1f8
1730 F20101108_AAAQBO zhang_x_Page_023.txt
e5ec9a39bfefa59cedd1f18d5fe46d4c
893ed5a1128dcb9e0b6d324414840b5ab0e86fdc
F20101108_AAAPWI zhang_x_Page_044.tif
b124b35ad3e1716468faea709a734206
6b424b0baf094f909e2e96696d301c288a366f44
F20101108_AAAPVT zhang_x_Page_014.tif
3ece945139e6c5d7fcd6224f86df8834
042ac47e2289ae90e045dd1cf3f0a35672a48b03
580 F20101108_AAAQCC zhang_x_Page_045.txt
98413ff7615697899ae7afd4569b4beb
5a9129f8ce36685a5c9188ccdc6649fd263ed963
1994 F20101108_AAAQBP zhang_x_Page_024.txt
67c8773d43e12875544dba18b92e3f0e
7aec449fcce709abd86452c3ca25cb60510bfe82
F20101108_AAAPWJ zhang_x_Page_045.tif
136b78cdcff87adaf72e321685d4641f
86e9c9d7a201eff8e8c2c2a1cae10bd92a872821
F20101108_AAAPVU zhang_x_Page_015.tif
ed495980783785d87d76b40674deeb32
4319fd0ca3e250f1b68d3f13c9bce55372a3ec6c
1004 F20101108_AAAQCD zhang_x_Page_047.txt
7e094077688f62493ec2678bc4745a0b
fddb46a885a06a4aa01b88532acfa74bc279687f
F20101108_AAAQBQ zhang_x_Page_025.txt
dad9e8b39acebe1add3f6f000c8991bc
22e8c44c5e08fb8a4fff98445074af9f2e5d7ff8
F20101108_AAAPWK zhang_x_Page_046.tif
87a0b1681b33c88c9ea4a4759bec921d
082f2ad4899b5bff27103c64be03079d842c823d
F20101108_AAAPVV zhang_x_Page_017.tif
9523564916a9b02fbaf1eb740216ea86
b96e020bffb5f19b8e7fe07b184e02f80241b918
702 F20101108_AAAQCE zhang_x_Page_048.txt
addb8395d52d0224eb0229386f7cafb1
e7b8f0b9938f4726dbf39c60e53bbe4d4e800157
1920 F20101108_AAAQBR zhang_x_Page_026.txt
c5cc53bbc17dc76d3c1c58c3593d1d0a
8bbc5e9a467f320418b981068fbd472471b432be
F20101108_AAAPWL zhang_x_Page_047.tif
6ea1addc974f43100710cfada2aff09e
aafd5e94dc17b49549b74034a8dcd8fac5f21c5d
F20101108_AAAPVW zhang_x_Page_019.tif
e3be22cc13345141e86841af05d32445
6848b6c99ab2df5e8adaf5d0086118c32e0d9cf5
2318 F20101108_AAAQCF zhang_x_Page_049.txt
e72c663a0023d397d8dc09549c330895
c651c22270c215dfbb936dc2949883c671894ed4
1948 F20101108_AAAQBS zhang_x_Page_029.txt
fd336ef0568f866a04ce549504656797
8b3c909a0bba91ca8b0b520a5196c7d9994aea30
F20101108_AAAPWM zhang_x_Page_049.tif
9ed08350cb7526001c7fd29c8bc72dcc
08a43757aa0e27d46c9f7a1e317ffeace9fdd21c
F20101108_AAAPVX zhang_x_Page_024.tif
a168321f5b004841d42caafa59089cd7
b499ee56bbd4317dc384c98ea1d03471aebbca0a
2220 F20101108_AAAQCG zhang_x_Page_050.txt
8f6fe23aac027012916566587bca7d34
e08b6ba79a2f3edfbac9a278be2291b0cc7bb4a3
F20101108_AAAPXA zhang_x_Page_068.tif
24f6eefbf58f78721266e04e7e809be7
998ad5fe501f1ae84ceca81495f3843740179c1a
F20101108_AAAPWN zhang_x_Page_051.tif
f4f203d616f6f2113c057b7a7e7f7ea9
7eb6b75c56eb7495adad999ce33d4458218dd743
F20101108_AAAPVY zhang_x_Page_026.tif
80c950519efcf6c53975e03f475469c9
4086e4cec16d2f5a499e96b50c0195450c0f625a
2257 F20101108_AAAQCH zhang_x_Page_051.txt
a2c3c2c977cb9f974920d037d5e8a524
c6a4e68040da4f8f1fb57b8a4d9e8292a200507c
F20101108_AAAPXB zhang_x_Page_069.tif
f1ddad407dbcda65eef44d5e2e69f060
339a49113357c16528be6ce0b392f244b3b3b896
363 F20101108_AAAQBT zhang_x_Page_030.txt
e83ac9503116b7298ebae929a3cd8e2f
dab2a62c1fa37b11928bc2971f05c7506616f082
F20101108_AAAPWO zhang_x_Page_053.tif
c9d2c18097938b1d73a338dd314de714
3c07c9fb9b2cbd89f7e6e67059ad4ea9eaeac5c9
F20101108_AAAPVZ zhang_x_Page_027.tif
a1ce81704859692af2d242b026128c98
a44a0ebbbfe7d5448c675ad820fe0e471b34da2e
1295 F20101108_AAAQCI zhang_x_Page_055.txt
36dec3f6c1a9fb40dc62b203b82a30da
a596d441a91cea59d518c77a770d5f84c1a7aba8
F20101108_AAAPXC zhang_x_Page_070.tif
09d5e0273d491dd80e4e8b0130ce1bf9
38822903da25c5c38e3aff77baf091805e21f094
2217 F20101108_AAAQBU zhang_x_Page_031.txt
92138ffd0988e42c31685db7ce6783e0
4eaf351d940e6b647fbef5a35321e0dd21040671
F20101108_AAAPWP zhang_x_Page_054.tif
9b5064598bef1c2bf93dd10eab60790a
85c5278d5ff015c39f985a0e39bdcf218fdee5f4
2193 F20101108_AAAQCJ zhang_x_Page_057.txt
d1f7d9d986b010abc6fdb9cf05277d91
083704e812629d122a448672fee8237d5906128a
F20101108_AAAPXD zhang_x_Page_073.tif
8918331092c438f83e2a37ebfa915fac
4f4aa7f8d247c2c8450019ac59938a666c9e41a9
1702 F20101108_AAAQBV zhang_x_Page_033.txt
70100e704a3d21dbce1c5a10a8449f01
5c11de131fa2f9d3c5be81b9fbd110a44ddecbf3
F20101108_AAAPWQ zhang_x_Page_055.tif
8002de73488e10cc5ead14460109c3ae
941840cdad1cc2daa5f39ee3e447a022b186a836
2124 F20101108_AAAQCK zhang_x_Page_058.txt
49195978c5eb293340a08c2579202683
40a7c8ee7a1b3d3cffb9777beba8f0e694bc0a14
F20101108_AAAPXE zhang_x_Page_074.tif
c47e029b1e33030bb2bc3ccd1961e3f7
f16027a0e03f5238ef2fcb95374508633d6f3ace
2396 F20101108_AAAQBW zhang_x_Page_034.txt
e12d3a1adec06b48ee2c2ed788bd1640
39daee5d0b2d6f77aceaf91ebe2beea8376d4576
F20101108_AAAPWR zhang_x_Page_056.tif
b5e9a75b5f86b5527e3249834add8ca8
46dab8c2463d6f0bae8a40db01cb91b641e476bb
F20101108_AAAQCL zhang_x_Page_059.txt
29bbf8127a5b2d1eb170c6d66aef291f
c2c7aa9e7dfaf731f8776431444283fa95c548f8
F20101108_AAAPXF zhang_x_Page_076.tif
dc44ebd64dc87eeeb17de811c724c7d5
e46d6a8820d15a3ab22a2b9a7a7f5bda103d711f
2335 F20101108_AAAQBX zhang_x_Page_035.txt
22b2d3cf7bfbd1b13d8f56bde7db5c61
31c0a132a71d327aa0867ac57e92903b388e34a4
F20101108_AAAPWS zhang_x_Page_057.tif
8eeb27e9eb3ae65d8cc40806ad7336eb
c49aa5bff5acafb2b4f7150b297fde6b8cf919d7
2298 F20101108_AAAQDA zhang_x_Page_079.txt
9b361a9a147703e81b73e7627b649cdf
45f3c70c7d716423f43fb8188c708edc35c68e07
2328 F20101108_AAAQCM zhang_x_Page_060.txt
f06e9a0bb9a7ca32da5782097224cf0c
c4932457333d221e15bd54922a068597e4609e0c
F20101108_AAAPXG zhang_x_Page_078.tif
53b0b6c270677e59fa42c6222a3e860d
8f060705d9cd1b6ecca7375b948ea205bc9c55fd
1847 F20101108_AAAQBY zhang_x_Page_036.txt
21dde08d1d4561ba160dbad378507434
611a820219d1218caacdd22d1ce16a2a9afdf1b8
2268 F20101108_AAAQDB zhang_x_Page_081.txt
46bf13e84a50c6920fd571852353b64b
f91812294bcc82b8065b6095f4324083753f1e84
1821 F20101108_AAAQCN zhang_x_Page_061.txt
790f526458451ed563e398b299806f98
b74c119d875e2971a0ed718598a47996f7d8ab10
F20101108_AAAPXH zhang_x_Page_079.tif
6c2105d59f622c26c26e6939345e81f4
18152aacf9689e86eeef505542284c6143a17672
2067 F20101108_AAAQBZ zhang_x_Page_037.txt
1064a5c34b9bb9e83ed929ff35ab76ee
0f209417af4d6334e13bead514ce782cf18c8a01
F20101108_AAAPWT zhang_x_Page_058.tif
17eb78b33bec4fc2d76539d9cc36c546
4281d138334756493a9fac9f7e4b2ab0460d109a
2398 F20101108_AAAQDC zhang_x_Page_085.txt
e8f577550075fbc8ed3f7c4409920659
06a3792f8836cbb5197c6a6c0629d9e277a5e76c
1540 F20101108_AAAQCO zhang_x_Page_062.txt
97c515c227b62abd5b08c5f00a3e97d4
b3b23160627c45042f133c31a2977969912def2a
F20101108_AAAPXI zhang_x_Page_080.tif
7590a0803102bb71b13cb34f5fc66bb1
86ec69a0b5df7a3b312288afa22598eb07a15048
F20101108_AAAPWU zhang_x_Page_059.tif
cb31aec2249c53b6a2b1ced1f26fb821
177cb02914f0941432cc0914ba7e062afb808002
2315 F20101108_AAAQDD zhang_x_Page_086.txt
7f97f7b0bd4ddbf27dbedc310a36a189
23a3608ae631e58d150efac64d14c8ae7588662e
2109 F20101108_AAAQCP zhang_x_Page_065.txt
6fdeadfec7d8ac6a418ad240aec24f8f
db223021c740f9fead8f70d036bb20c979fe33d9
F20101108_AAAPXJ zhang_x_Page_081.tif
4727a50159946b81e57aeed943a674cb
5fae2929d0574470a55588e8c3e7919b030c7284
F20101108_AAAPWV zhang_x_Page_060.tif
2569711521e4502ebc950fa4814f5249
ed29069494b249fa5ab39d83e9b3b079b8140f2d
1816 F20101108_AAAQDE zhang_x_Page_087.txt
cd93dfbc468fc20859c97d6d872bba26
77d7aed057c933711f05ea4709c39aabd152ac99
1451 F20101108_AAAQCQ zhang_x_Page_066.txt
a4840ab8ff3d946d51b153917a638e9c
e87d75efe110d0b32706b478b4558467e4857d74
F20101108_AAAPXK zhang_x_Page_082.tif
ed82622c271e5fcef9843822d281059c
4bf6d9a717d7925c693d770f83510ce7bc27d84c
F20101108_AAAPWW zhang_x_Page_062.tif
cbf0152d4706fe5b988de3048e2dcb52
29b690b25af762318a99c6cfa275221a18595c69
2262 F20101108_AAAQDF zhang_x_Page_088.txt
6cdabc78254bccfa6e8ed5690c8b7953
9d5683d3da254e09dcd68748deba08ba61948742
2325 F20101108_AAAQCR zhang_x_Page_069.txt
7925144b530cad93a097304376ebdcb3
11b5681f2773752729f9c291fab89f4c5a2ab3ed
F20101108_AAAPXL zhang_x_Page_083.tif
3fcc78d19456088206cdd218875f9785
2539c784424c8638dc527d8a24f36516b5ed1845
F20101108_AAAPWX zhang_x_Page_063.tif
a022acddd090821a81045f3b53b56f29
1089eb0cdbcd1b03dd0d68962183f12c2f33137e
2265 F20101108_AAAQDG zhang_x_Page_089.txt
46d1f74a024091261e5e72289a89e7ed
55a3cb4c7d986e3896bd26de6c8c86417c4329ce
F20101108_AAAPYA zhang_x_Page_106.tif
ba289347ea71235b0ea5d52be0ea11d2
9573f5d4ce4b8646373c5d9d389e0257fe5f5c79
F20101108_AAAQCS zhang_x_Page_070.txt
d5ed9f17b0b142bc377360c128cd9ea9
196fd68899c6a84fcd0e3e7200f5586d6ebf725e
F20101108_AAAPXM zhang_x_Page_085.tif
9f07079cd3237991ae90d09f863e3091
bb1b4448fc74d02afd36098dae01a5a48884e2f0
F20101108_AAAPWY zhang_x_Page_065.tif
ba7acb2f9c0773dd2fd028132cf8039a
b91e40223d47b8bf61c264d888ea5086be986420
1993 F20101108_AAAQDH zhang_x_Page_090.txt
be499a94327ae5ac4660309b075e73b4
b7a3c7452315537536f652799b81a6aec199ecbe
F20101108_AAAPYB zhang_x_Page_107.tif
2291d3fc4d6548eb367b1ad26bfc902c
8d634f46c0b3531e1245565b842a837806b24b38
2271 F20101108_AAAQCT zhang_x_Page_071.txt
48d6fe0a4708930ee4cef085ecf5ff20
7d73f6d8d3ef969831e8ea289185af13c233debe
F20101108_AAAPXN zhang_x_Page_086.tif
c1e7dadddf63b4beaf56068c1c896684
ff2a40621f6f673042c5850e88420dfaf5576c56
F20101108_AAAPWZ zhang_x_Page_066.tif
880f53c125a28b7068d6e860207f73fe
f36a52b7b2049380687ecb577948a1ab6b83fea8
2209 F20101108_AAAQDI zhang_x_Page_092.txt
1941697098c60a8f9640efc062728aa9
276bc34f07c11af56e9eb4b252df4ae054c297d1
F20101108_AAAPYC zhang_x_Page_108.tif
410f5cd5575003b5676602bbb91d3fcf
aedb2fe006f92193e8c615cbb198b9f8ef4c75cf
2341 F20101108_AAAQCU zhang_x_Page_072.txt
7c0c70fcda9984707f7a17974fc728aa
37f69c80a407f8e3354f1ac7655a094eef317c2b
F20101108_AAAPXO zhang_x_Page_087.tif
945ab796e5021119873b6c3a4a7b134b
13da1fef23b736991d8db97ba31a6d352aa3b3ec
2144 F20101108_AAAQDJ zhang_x_Page_093.txt
7a5c797b1f1c954ca1293e17b7b8e162
2460621e0875c6058bc4cbe7102ad317385ddbb7
F20101108_AAAPYD zhang_x_Page_109.tif
8a5a89ff3e7558ad1cc8f9a3f5478e45
df79459af9aa1a3b86116d0a0e182fca36bf66cc
1236 F20101108_AAAQCV zhang_x_Page_073.txt
6fa15b2943b6f5affc32409c6b57939b
a09c84e6c6a6bbdc06b86f5da9043dd50327acea
F20101108_AAAPXP zhang_x_Page_088.tif
e366f8526fc242db99dd95a49f894b2d
151c285e6db08b0b24ff6154b88b71d5adb27cdf
2127 F20101108_AAAQDK zhang_x_Page_095.txt
afd1f302b7ed71c2f40989b8656d15d0
fc054229063fe390a9c148fdc8122c3d6b7c5bb7
F20101108_AAAPYE zhang_x_Page_110.tif
35e83c7d79c35c00835b90abe6b2383d
743cdf7d590157a7b588d0f24489dfca82f7bffb
1030 F20101108_AAAQCW zhang_x_Page_074.txt
b8af620cd9f4d39e1df4caf0e8a52d73
f0e4fa115ee55501cc09a6c5a5594dc26a9922a1
F20101108_AAAPXQ zhang_x_Page_089.tif
08b1a2dc7a0821ad0b813c66db1a6e7d
0d12c4fc1009eab7bb3e5026f30ec95d2df3d473
2191 F20101108_AAAQDL zhang_x_Page_102.txt
ac38f6d0d3e1ef108a36289f260e7b67
f579f470bf53b71b86bda6cdad2d44c6182c6f81
F20101108_AAAPYF zhang_x_Page_112.tif
17f025d6816274e27942ab4fdb523f07
be8e9e5022666c1ee6d9a748a1f8f208b00dc6a9
1087 F20101108_AAAQCX zhang_x_Page_075.txt
22a5b84062a26ee8521d8d68898c51bc
724e5a744d7908c922b824e9e331608247eb8041
F20101108_AAAPXR zhang_x_Page_090.tif
bd7ff3eda555503de03894fb1ae06a4e
32acbd8cf8a28f4ca290ee5ef861070d66ac3c3b
241 F20101108_AAAQDM zhang_x_Page_104.txt
da561ea209c7524ff167f86d880c1985
bfb5874ed5cab69a196712382890b69d8f9fb96e
F20101108_AAAPYG zhang_x_Page_113.tif
2c76e52c08da2f094237bb9677e624f3
5397bd65dcb12c7a7704539e9f9c5a45eaa23c70
2279 F20101108_AAAQCY zhang_x_Page_076.txt
499c39388e8a7e8c8bd9487a8dbe55e6
6e897c93bd7dc21bad1f57ab3071cde33aefdf2d
F20101108_AAAPXS zhang_x_Page_093.tif
8d65e6c23b546e9197db984138080bb6
2104ac786f3bf63acc605b8f58092dd96ee98085
6809 F20101108_AAAQEA zhang_x_Page_001.QC.jpg
6a6546697f7831974713b11d29d39b63
e8512309278be346535d93f2f2d844f835eb654a
441 F20101108_AAAQDN zhang_x_Page_106.txt
a76edf17378a2deedd2d64b838fce0f6
6c31e59025f3f2a3c396a10505eea599a21778af
F20101108_AAAPYH zhang_x_Page_114.tif
392a51a914f65284007527273fe8c358
beb09e44aa7f902a48d010a07b64857543abf30e
1658 F20101108_AAAQCZ zhang_x_Page_077.txt
c4089ae524130457b864180b1c3e6550
ac0e10aad47a9f3db2623a716dfc4733652479c8
F20101108_AAAPXT zhang_x_Page_094.tif
edd414554482dfc508c645eacca0c72f
67be70636d928d8f2c32bc96c82dadb822efa48d
3114 F20101108_AAAQEB zhang_x_Page_002.QC.jpg
249405a25a6a96e555256cd91f88cb59
0f9277b534314835e67befa2f32808d8e6d09f7e
2176 F20101108_AAAQDO zhang_x_Page_107.txt
4f9581a7e31d5cbcb2ac79e87e98742e
282707f2c295dd577c4ad5e5ded6388ab33217e4
F20101108_AAAPYI zhang_x_Page_116.tif
401521a8623cc757298a6abb3f721034
6aae125f0fbb64d0b40dcce155b44856818b784a
1328 F20101108_AAAQEC zhang_x_Page_002thm.jpg
85d04676adbe0cf7ad78bc98be55c639
fc18b46245415759fc05a7d5af9ae339f4420eb0
2135 F20101108_AAAQDP zhang_x_Page_108.txt
37215027bfbaa78abe1321d50fc0de69
2affa4764b4495526aa26d79c96ebae438b01c2c
F20101108_AAAPYJ zhang_x_Page_117.tif
0c3e7e068876380234f8bbe422601576
b370498e7e41ebcfa581c1758513c0ee7a216001
F20101108_AAAPXU zhang_x_Page_095.tif
4b2f9b0304ee92f00b90f9e7771542e9
1ce8cfb0b822c2e653a88987de28225d5c8bafaf
4581 F20101108_AAAQED zhang_x_Page_003.QC.jpg
9f705f97bc883025dc4c7558b01f1b6b
8e580a0d72a7b4eefbc16b123e4eaee092b46766
894 F20101108_AAAQDQ zhang_x_Page_109.txt
ce56d87ba529da903f1da048468a057f
e33606257ca09ce8d9e01065caeaf77137a76c35
F20101108_AAAPYK zhang_x_Page_119.tif
2744bf78b80fffe5b1c66c1c11c6d7c7
85e32e2cc013cff67c460fb53fa4fe5ef7445b35
F20101108_AAAPXV zhang_x_Page_097.tif
6b05ad1ffa778e2020f09c36f7b87e73
ef774d3711bfe4e44ec40ae580f51eac1dcc84a2
9071 F20101108_AAAQEE zhang_x_Page_004.QC.jpg
402aa59913c9e23acc2d5d095c9ca529
9ed4cbea54dde3298b3771f0e7931fe4696fec12
2254 F20101108_AAAQDR zhang_x_Page_110.txt
00facc5208d25cc8de27a949b376cd4a
bf32f6a9aefb9a7b982bc231d063d248325ff48b
F20101108_AAAPYL zhang_x_Page_120.tif
3a3d79a1ea895c77378e8715cb530ccc
87d74552d24e055fa2bf90d2437fb16780212cd3
F20101108_AAAPXW zhang_x_Page_099.tif
7529028043deb0ae7bcfcecf3e9c47ae
2d673a9cde670d2d9a627a94e7eaea701642e90a
2561 F20101108_AAAQEF zhang_x_Page_004thm.jpg
5675d9da51d1d017fde933866b5bddc7
6d5547c12dbc76ebfcfc7afc8eb7e7a370fc699b
49724 F20101108_AAAPZA zhang_x_Page_021.pro
9e44f429c6937229e7eafc2ad27bc1d8
adec11cac7e738e2b16a9a8b4bc4adb953453258
1154 F20101108_AAAQDS zhang_x_Page_112.txt
8be813bc8d29667b31f59fd60eea7bd7
1aea25791662eafb874ee3cbe21d31875fcbec3d
F20101108_AAAPYM zhang_x_Page_121.tif
2149530d2726d20da2fb57229c6f6de1
6b13ec4324ef45ba16b865e55cf141972a981d7d
F20101108_AAAPXX zhang_x_Page_101.tif
15e6cfeca1d0e6dd638c1d9e7abf64c4
1a3ac461d3720aed6f8b7cb7284e9a234a137814
4989 F20101108_AAAQEG zhang_x_Page_006thm.jpg
5db0cecf647f6de042f222d3a0a8edd8
a26ce5a4ba078b4b934c4d9b061c5481528cc95d
55367 F20101108_AAAPZB zhang_x_Page_022.pro
0958325dd00657efdc941d56b08ee719
889752cbdb25b45b326d750b18456e3cc90a5547
2612 F20101108_AAAQDT zhang_x_Page_113.txt
c8874d8c6399e8bd5f5c2855e6fdad10
a311afb8efb25ccdea741d6ef8f33c8984e02c7b
F20101108_AAAPYN zhang_x_Page_122.tif
460f665e6bb7d895f811aa43808e96c1
6d386f719b443be118e8f275ca6c68c87b3c5555
F20101108_AAAPXY zhang_x_Page_102.tif
c6b6d82679256a751e22df6bd1be5f47
1d7195e802eba0a1a346d4cda161d03233a2110e
21386 F20101108_AAAQEH zhang_x_Page_007.QC.jpg
e8296c582c50232879c69fba9a37144b
281ebe13fae0d1419968bcde5fcd23d8cba83123
42647 F20101108_AAAPZC zhang_x_Page_023.pro
5e239ecb7e0328ccf04ef6c720f92309
f5569aa00ca3d7f503f6ef3d267b00253e6e908c
2509 F20101108_AAAQDU zhang_x_Page_117.txt
45876fd49f0935322fb8746d2dda4bea
ef7ac33ba24ff6649cbacd6d6fbdc0ede0095abe
7389 F20101108_AAAPYO zhang_x_Page_001.pro
3ec0c9c660749cd6a3c8e05ac52a5a75
f4808c0bbd6d916a994df7c10cf741435580df1d
F20101108_AAAPXZ zhang_x_Page_104.tif
98b37b0ae140f1b9f42342e5095d4a0e
675a042948ab5a212b03cad0192adcf6ee879bc8
5324 F20101108_AAAQEI zhang_x_Page_007thm.jpg
9690b266070355149465bd95eb31b5e4
1ea73e21b1f60c3dddd004da5dcd9127ab5635c5
45392 F20101108_AAAPZD zhang_x_Page_024.pro
18d6a043d847eeddae9ae4b36553c8ff
3e1e87dfdb5274aa6ea60abc21a02ef9e4eba311
2676 F20101108_AAAQDV zhang_x_Page_119.txt
c408b40745c438724846987214df5c15
fe064f0d961b69cbd6b89823d09ea0c2ce929609
4562 F20101108_AAAPYP zhang_x_Page_003.pro
bea5e1d10cfe73c14e038ff4bcf06c32
5defdb5af24e15741c0a1c315bd42225d00b6df3
5103 F20101108_AAAQEJ zhang_x_Page_008thm.jpg
a51c53dda01651a6ebca24e7433884c9
8fd0847f9f8b3c3c21705b03869cbd72a9a76791
9038 F20101108_AAAPZE zhang_x_Page_030.pro
30bbacc22581b93fd9566b8da953ff1b
b7da837155d64278c6945b54c815360f9ded7c8c
F20101108_AAAQDW zhang_x_Page_121.txt
d5f032099d916d60a28ad29383be3f98
af603f024dc930360ef2b2027712d22bab040d4d
15756 F20101108_AAAPYQ zhang_x_Page_004.pro
08c07f263f4aab5560e816e4c7cd8f2c
6180970a38319e88425760612945a3b3edf506be
5080 F20101108_AAAQEK zhang_x_Page_009thm.jpg
faf87aa19fe7f962407121073161224d
44b87823ce0cb755fe6043437c633c51f4bb8aed
53810 F20101108_AAAPZF zhang_x_Page_031.pro
133fedc96bc7d0e9511661022be7933a
ced5a0e1f82824373277b1857f150ffa588da978
415 F20101108_AAAQDX zhang_x_Page_122.txt
f1f1148e4783ac0faf9b15568efce1a5
a2822d175d47d9500370131a39d36ee660789536
33394 F20101108_AAAPYR zhang_x_Page_006.pro
d4b69bdeadda0efcd1f1a9a748626296
10ba1c9b9cdea9d52425993dd1c82ff92f08a404
6018 F20101108_AAAQFA zhang_x_Page_021thm.jpg
0eaffc0c0dc86f6714eaadde01e869c4
3c5907de310d9f20c765a9125ea735e4ffcb9018
24828 F20101108_AAAQEL zhang_x_Page_010.QC.jpg
799aff0a1e038efe12d4e93e64cc9b34
383469aac6c9003d9b11f7ad68366f9be0a3e980
52323 F20101108_AAAPZG zhang_x_Page_032.pro
ce0aa1f5b23430a4333b164600480b78
729072655579691eb7cb2078c18a91602d50195f
1330671 F20101108_AAAQDY zhang_x.pdf
83490a2d651397dc532316cc2f5fc882
fc3a39011a91f15768f9489ed8bf399614af4562
BROKEN_LINK
www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/
www-bio3d-igbmc.u-strasbg.fr/BAliBASE/
www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/
www-bio3d-igbmc.u-strasbg.fr/BAliBASE/
46147 F20101108_AAAPYS zhang_x_Page_007.pro
6336b5e1a5579eec8be2c7f68b501299
c1a44c61d397ace95fbe16c2f3ae88e962524362
6951 F20101108_AAAQEM zhang_x_Page_011thm.jpg
439260e8ca3ce0520191c0f6a4541352
239482d37029b1e3e513f98890db8a6c46200682
60878 F20101108_AAAPZH zhang_x_Page_034.pro
a5a70a82aaf0bd0abcc0ddf03feb4f21
534d4d90b2b2a6c7dffa124012f856e7325bbe37
140657 F20101108_AAAQDZ UFE0021685_00001.mets
f56bf9b5b2974990d217f008048e1e4d
f52c3a4ecede6b894733aa3d3b79c9b5832c7f2b
28517 F20101108_AAAPYT zhang_x_Page_008.pro
ee34d504fef6b9de8f100939e61a62a2
c217a15b4afc6fc532a602f4ab0bb2c12e2cdaa3
21664 F20101108_AAAQFB zhang_x_Page_023.QC.jpg
a6af91b8d421c4477e51cabb8bef09cd
ae8c67269784a2f09d2d1607a50cbd645089605b
6906 F20101108_AAAQEN zhang_x_Page_012thm.jpg
b576fcbe035accf2e725873f4e1e45a9
3bfbdcc6b3280a988ff18c367f74988724439a9d
59529 F20101108_AAAPZI zhang_x_Page_035.pro
da7333265e5264139341b94048e5c67a
c64df2a29891eec3e02f37b522b31efa10999eae
54175 F20101108_AAAPYU zhang_x_Page_010.pro
7facb9a2c3df415fef03fee585598edc
9af471c1353718c7df0cdd7a99c6afde6ad8fca3
5808 F20101108_AAAQFC zhang_x_Page_023thm.jpg
cfb855c3ce321f3b92bb9565090259bc
226e8d4051a934fb3471de8e41199f06ba58c5e0
25220 F20101108_AAAQEO zhang_x_Page_013.QC.jpg
d050bb36ea950a22b39b8b5abd12adec
2a93638f74dca6bf1c1bee78159ae660ba94a66c
27755 F20101108_AAAPZJ zhang_x_Page_038.pro
40fe6a6a2f120976bda4202450ec8858
59960d146cd14a3e0e64076739da429246984e10
22361 F20101108_AAAQFD zhang_x_Page_024.QC.jpg
8879cc26a2542d14fe6432eec40fef26
fcf2bff481ec7379c674bcb7d4c7959736723da1
28062 F20101108_AAAQEP zhang_x_Page_014.QC.jpg
5dc4c3b13380297eb32fdcb7a4c52d21
35818935525112635fcd4f87ff6e7803837061d8
59605 F20101108_AAAPZK zhang_x_Page_039.pro
f0839fda58ac38d7d25ae95b9976bd1c
84d94047fb9c32eafeb8cc32b513674c76ccf3ec
61044 F20101108_AAAPYV zhang_x_Page_011.pro
c4debe88d2251924b05c545b7a3eed9c
789901f702a198ff5ef8137e2e199cd77b28214a
5762 F20101108_AAAQFE zhang_x_Page_024thm.jpg
bf8691dbaaeaf713942976d365334aba
d36ba9322b6cc11b004ad79972232d56f4d9d164
6239 F20101108_AAAQEQ zhang_x_Page_015.QC.jpg
7c84f06447d84a728a2d78771f64a363
73c3a98921e51a2c09401989ef3fbec9982d0ed6
46531 F20101108_AAAPZL zhang_x_Page_040.pro
278d8e89fe3d148d14f17ad10c962adc
b9d39964b9470911cf4d07609ad5db88c99ddd53
57822 F20101108_AAAPYW zhang_x_Page_012.pro
909e1ec0511d0ae668c7bb5aa1e15f42
0872e22c26024fd8578f55b5b1c29a15d0c9333a
24132 F20101108_AAAQFF zhang_x_Page_025.QC.jpg
4dec9ba7907cc3f3c5b4cee1727779f4
2a74f3a18f5e01366e0878bb5e4c6ec6a1a096b3
2051 F20101108_AAAQER zhang_x_Page_015thm.jpg
a9703695af73086114da396bf6874fb2
038663a03567964117458726a1f06781273cc955
54746 F20101108_AAAPZM zhang_x_Page_041.pro
ad38ca76788d39b32a14dee1df7fa930
3626e8d053e910a842c8f7e440b6a6feeaaaab29
58737 F20101108_AAAPYX zhang_x_Page_013.pro
ad564008670383e6ba7059dc79805616
48c633ff70b4e1fc09433669d776d08898544cc3
22829 F20101108_AAAQFG zhang_x_Page_027.QC.jpg
ae5c7ba6f1bf44f214fbb5f1045d9160
3fadbe5302e2403cc141b72edfa445344265074a
23231 F20101108_AAAQES zhang_x_Page_016.QC.jpg
180afc90db571e0570eef02a1f3c7bd5
fc300e1ae520fd7f6bcec488d7bcf07496a17cdb
58874 F20101108_AAAPZN zhang_x_Page_042.pro
f86b0b582acf1919d863a68873dcc647
ea970e5178498f6c1a358b4fb2177ebc45d86f48
46425 F20101108_AAAPYY zhang_x_Page_016.pro
62c386186938932bf58fd8d6113ac3e4
8beea56d6c83f08f7d736a178e528f51c887169f
21249 F20101108_AAAQFH zhang_x_Page_028.QC.jpg
ff94bf5e778cb35e3ce81b977f396672
96f67562fc28a121e22562aa01d47796e23156d8
6152 F20101108_AAAQET zhang_x_Page_016thm.jpg
c107cb72709d11d0fd0bb6b33b763a72
2925359dea1de8851d7a47d59ecf3604afee970c
32393 F20101108_AAAPZO zhang_x_Page_044.pro
d6b21d94e8759e14556102ce3e7ff911
6de616759271bc25cd096783aaef4ddb9f17890d
58175 F20101108_AAAPYZ zhang_x_Page_018.pro
4903eb3610440c88029722fedd7b0602
0f6c3d9343b64d8aef5d71365e6ea8a2c5edb506
5700 F20101108_AAAQFI zhang_x_Page_028thm.jpg
ef5a0291387ae7f8f5ed184186da67ef
c02b0d66dfe514a259507c65ac045bce7fba195a
28615 F20101108_AAAQEU zhang_x_Page_017.QC.jpg
6f1e34fcbd7d0a075f0ef60fd5430895
a85f3a6b34d99517769a10f04208a5b1adfb91c1







MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS


By
XU ZHANG



















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

































S2007 Xu Zhang


































To my family, and to all who nurtured my intellectual curiosity, academic interests, and

sense of scholarship throughout my lifetime, making this milestone possible










ACKNOWLEDGEMENTS

This dissertation would not have been possible without the support of many people.

Alany thanks to my adviser, Tanter K~ahveci, who worked with me on our researches

and read my numerous revisions. Also thanks to my coninittee nienters, Alin Dobra,

Arunava Banerjee, C'!!I~!n emph M31 Jerniaine and K~evin AI. Folta, who offered guidance

and support. Thanks to Antit Dhingfra for cooperating with me and giving me a lot of

helps in MAPPIT project. Finally, thanks to my parents and numerous friends who

endured this long process with me, ahr-l- .- offering support and love.











TABLE OF CONTENTS

page

LIST OF TABLES . ...... .. 7

LIST OF FIGURES ......... .. . 8

ABSTRACT ......... ..... . 9

CHAPTER

1 INTRODUCTION ......... .. .. 10

2 BACK(GROUND ......_._. .. . 16

2.1 Measurements of Multiple Sequence Alignment ... ... .. 16
2.2 Dynamic Progranining Methods . .... .. 17
2.3 Heuristic Methods ........ .. .. .. .. 18
2.4 Optimizing Existing Alignments Methods ... .. .. .. 22
2.5 Approximation Algorithms . .... .... 22
2.5.1 Our Methods vs. Approximation Methods .. .. .. .. 25
2.5.1.1 What do approxiniat able" and non- approxiniat able"
mean'? ..... ........ ...... 25
2.5.1.2 Why does approximation algorithms do not work for multiple
sequence alignment applications'? .. .. .. 25
2.5.1.3 Why do our algorithms work'? ... .. .. .. 27
2.5.2 Overview of Approximation Algorithms for Multiple Sequence
Alignment ......... .. 28
2.5.2.1 Hardness Results ... .. .. 28
2.5.2.2 NP-conipleteness and MAX-SNP-hardness of multiple sequence
alignment ......... ... 29

:3 OPTIMIZATION OF SP SCORE FOR MULTIPLE SEQUENCE ALIGNMENT
IN GIVEN TIME . ..... ..... .. 31

:3.1 Motivation and Problem Definition ...... .. :31
:3.2 Current Results ......... ... :32
:3.2.1 Constructing Initial Alignment ..... .... :32
:3.2.2 Improving the SP Score via Local Optintizations .. .. .. :35
:3.2.3 QOMA and Optinmality ........ ... .. :36
:3.2.4 Improved Algorithm: Sparse Graph .... .. :38
:3.2.5 Experimental Evaluation . ..... .. 41

4 OPTIMIZING THE ALIGNMENT OF MANY SEQUENCES .. .. .. .. 49

4.1 Motivation and Problem Definition ...... .. . 49
4.2 Current Results ......... . . 51
4.3 Aligning a Window ......... . 55











4.3.1 Constructing Initial Graph.


4.3.3 Refining( Cl I-1. i Iteratively
4.3.4 Aligning the Subsequences in CloI-I. is .
4.3.5 Complexity of QOMA2.
4.4 Experimental Evaluation.

5 IMPROVING BIOLOGICAL RELEVANCE OF MULTIPLE SEQUENCE
ALIGNMENT .

5.1 Motivation and Problem Definition
5.2 Current Results.
5.2.1 Constructing Initial Graph.
5.2.2 Grouping Fragments
5.2.3 Fr-agment Position Adjustment.
5.2.4 Alignment
5.2.5 Gap Adjustment.
5.2.6 Experimental Results.


6MODITLE FOR AMPLIFICATION OF PLASTOMES BY
IDENTIFICATION. ............


Motivation and Problem Definition .....
Related Work .....
Current Results .........
6.3.1 Findingf Printer Candidates ......
6.3.1.1 Multiple sequence aligfnnent-hased
6.3.1.2 Motif-based printer identification .
6.3.2 Findingf Mininiun Printer Pair Set ......
6.:3.3 Evaluatingf Printer Pairs .....
6.3.4 Experimental Evaluation .....
6.3.5 Quality Evaluation .....
6.3.6 Performance Comparison ......
6.3.7 Wet-lah Verification ......


84
88
89
89
90
9:3
95
99
100
101
107
107

110

11:3

122


printer


.
.
.
. .
identification
.
.
.
.


7 CONCLUSION ...........

REFERENCES ......._._.. .......... .. .

BIOGRAPHICAL SKETCH . . .


PRIMER










LIST OF TABLES


Table page

3-1 The average SP scores of QOMA using complete K-partite graph .. .. .. 41

3-2 The average SP scores of QOMA and five other tools .. .. .. 46

3-3 The improvement of QOMA .. ... .. .. 47

3-4 The average (p), standard deviation (o-) of the error, S* SP, for a window
using sparse version of QOMA .... ... .. 47

3-5 The running time of QOMA (in seconds) .... .. .. 48

4-1 The list of variables used in this chapter ..... .. .. 50

4-2 The average SW and SP scores of individual windows ... .. .. .. 67

4-3 The average SP scores of QOMA2 for individual windows .. .. .. .. 68

4-4 The average SP scores of the alignments of the entire benchmarks .. .. .. 69

4-5 The average SP scores of QOMA2 and other tools .. . .. 69

5-1 The BAliBASE score of HSA and other tools. less than 25 identity .. .. 80

5-2 The BAliBASE score of HSA and other tools. 211' .- 10' identity. .. .. .. .. 80

5-3 The BAliBASE score of HSA and other tools. more than 35' identity. .. .. 81

5-4 The SP score of HSA and other tools. . ... .. 81

5-5 The running time of HSA and other tools (measured by milliseconds). .. .. 82

6-1 Comparison of Primer3 and using multiple sequence alignment in step 1 .. .. 103

6-2 Comparison of using different source of alignment ... ... .. 104

6-3 Comparison of multiple sequence alignment-based methods and motif-based methods
in stepl1............. .............. 106

6-4 Effects of the number of reference sequences ..... .. . 107

6-5 Eight randomly selected primer pairs . ..... .. 108










LIST OF FIGURES


Figure

1-1 An example of multiple sequence alignment .......

2-1 An example to show meaningless of alignments with approximation ratio less
than 2 .......


page

11


26i


An example of different alignments with the same SP-score.

Constructing the initial alignment by strategy 2.

QOMA finds optimal alignment inside window.

Sparse K-partite graph.

An example of using K-partite graph.

The SP scores of QOMA alignments

Alignment strategies at a high level.

Comparison of the SP score found by different strategies

The distribution of the number of benchmarks with different n
(K).

The initial graph constructed

The fragments with similar features are grouped together.

A gap vertex is inserted

Cliques found are the columns.





umber of


sequences


Gaps are moved.

Example of primer pairs on target sequence

An example of computing the SP score of multiple sequence alignment

An example of matching primers with translocations

Selection of next forward primer from current reverse primer.

Polymerase chain reaction samples









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS

By

Xu Zhang

December 2007

C'I I!r: Tamer K~ahveci
Major: Computer Engineering

Bioinformatics is a field where the computer science is used to assist the biology

science. In this area, multiple sequence alignment is one of the most fundamental

problems. Multiple sequence alignment is an alignment of three or more sequences.

Multiple sequence alignment is widely used in many applications such as protein structure

prediction, phylogenetic analysis, identification of conserved motifs, protein classification,

gene prediction and genome primer identification. In the research areas of multiple

sequence alignment, a challenging problem is how to find the multiple sequence alignment

that maximizes the SP (Sum-of-Pairs) score. This problem is a NP-complete problem.

Furthermore, finding an alignment that is biologically meaningful is not trivial since the

SP score may not reflect the biological significance. This thesis addresses these problems.

More specifically we consider four problems. First, we develop an efficient algorithm to

optimize the SP score of multiple sequence alignment. Second, we extend this algorithm

to handle large number of sequences. Third, we apply secondary structure information

of residues to build a biological meaningful alignment. Finally, we describe a strategy to

employ the alignment of multiple sequences to identify primers for a given target genome.









CHAPTER 1
INTRODUCTION

Bioinformatics is the interaction of molecular biology and computer science, it can

he viewed as a branch of biology which implements the use of computers to help answer

biology questions. One of the fundamental research areas in bioinformatics is multiple

sequence alignment. A multiple sequence alignment is an alignment of more than two

sequences. An example of multiple sequence alignment is shown in Figure 1-1. The

alignment is part of a whole alignment selected from BAliBASE benchmark database [1,



Multiple sequence alignment is widely used in many applications such as protein

structure prediction [3], phylogenetic analysis [4], identification of conserved motifs [5],

protein classification [6], gene prediction [7-9], and genome primer identification [10]. The

follows are some examples of the applications.

Application 1. Identification of conserved motifs and domains

One important application of multiple sequence alignments is to identify conserved

motifs and domains. Motifs are conserved regions or structures in protein or DNA

families. They tend to be preserved during evolution [11]. For related proteins, their

motifs present similar structures and functions. Within a multiple alignment, motifs can

he identified as columns with more conservation than their surroundings. Analyzed with

experimental data, the motifs can he very important characterization of sequences of

unknown function. The principal leads to a lot of important applications in bioinformatics.

Some important databases, such as PROSITE [12] and PRINTS [13], are built based on

this principal. Another type of methods uses a profile [14] or a hidden Markov model

(HMM) [15] to identify motifs. These methods work well when a motif is too subtle to

be defined via a standard pattern. Since when searching a database, profiles and HMMs

can identify distant members of a protein family and provide much higher sensitivity and

specificity than what a single sequence or a single pattern can provide. In practice, users










1thx aeqpvlvyfwaswcgpcqlmsplinlaantysdrlkvvkled.
thio thife sskpvlvdfwaewcgpckmiapileeiadeyadrlrvakfnide..
thio strcl sekpvlvdfwaewcgperqiapsleait.ehggqieivklnd.
thio rhoru adgpnxvdfwaewcgperqxapaleelatalgdkvtvakind.
thio myctu snkpvlvdfwatwcgpckmvapvleeiateratdltvakldt.
thio gripa srqpvlvdfwapwcgpermiastideiahdykdklkvvkvnd.
thio rhosh sdvpyvvdfwaewcgpcrqigpaleelskeyagkvkivkvnd.
txla synp7 ndrpmllefyadwctscqamagriaalkqdysdrldfymlnd.
1kte iqpgkvvyfikptcpfcrktqellsqlp..fkegl.lefydttst
1grx ...mqtvifgrsgcpysvrakdlacklsner.ddfqyqyvdre.
2trcp kvttivvniyedgvrgedalnssleclaaey.pmvkfckiran.

Figure 1-1. An example of multiple sequence alignment. Sequences are subsequences
selected from BAliBASE database.


can create their own profile from multiple sequence alignments, by using tools such as

PFTOOLS [16], pre-established collections like Pfam [17], or by computing the profiles on

the fly by using PSI-BLAST [18], the position specific version of BLAST.

Application 2. Protein Family Classifications

Given a family of homologous protein sequences, how can we know if a new sequence

S belongs to the family? One answer to this question would be to align S to the multiple

alignment of the sequences of the family, then find common motifs between them [19, 20].

Here, motifs are aligned ungapped segments of most highly conserved protein regions

in the multiple sequence alignment. By comparing the motifs in the multiple sequence

alignment with the unknown sequence S, we can find how similar between the alignment

and S, and then conclude the possibility of the target sequence's classification.

Application 3. Sequence Assembly

Multiple sequence alignment can be used in DNA sequencing and primer identification [21

25]. In shotgun sequencing, multiple sequence alignment pIIl an a very important role [26].

Assuming we are given a set of genomic reads in shotgun sequencing project, these read

fragments are highly similar, and hence easy to align. The multiple sequence alignment of

the reads can construct the foot print of main backbone of the original sequence, thus ease

the work of recognizing the whole sequence from the reads. If high quality reads are used,

the target sequence can be re-built directly from the consensus sequence of the multiple

sequence alignment of the reads.









Given two sequences, Pi and Pj, we indict the score of their alignment as Score(Ps, Pj).
It can be computed as Score(Ps Py 1<=k<=N CFi~,k j,k), Where N is the lengt of1~1V

lthe alignmenlltl~ Ps,k IS the kth character of Ps, and c(x, y) is the score of matching x and

y. Here x or y can be a gap, which means an insertion or a deletion. Finding the multiple

sequence alignment that maximizes the SP (Sum-of-Pairs) score is an NP-complete

problem [27]. Here, the SP Score of an alignment, A, of sequences P1, P2, PK is

computed by adding the alignment scores of all induced pairwise alignments. It can be

expressed as SP(A) = CiE Score(Ps Py), wrhere K is the number of seque~nlnces P is

the sequence indexed by i, andu Score(P P ) is lthe scoret of lthe alignmenlltll of Ps and Py

induced by A.

The alignment of two sequences with maximum score can be found in O(NV2 time

using dynamic programming [28], where N is the length of the sequences. This algorithm

can be extended to align K( sequences, but requires O(NVK) time [29, 30]. Variety of

heuristic algorithms have been developed to overcome this difficulty [1]. Most of them

are based on progressive application of pairwise alignment. They build up alignments of

larger numbers of sequences by adding sequences one by one to the existing alignment [31].

These methods have the shortcoming that the order of sequences to be added to the

existing alignment significantly affects the quality of the resulting alignment. This thesis

focuses on the problems of optimization of SP score and sequence order dependence.

We provide solutions based on divide-and-conquer strategy and also an application for

prediction of genome primers.

The contributions of this thesis are as follows:

Contribution 1: Given a fixed time budget, we aim to maximize the SP score for

moderate (3-10) number of sequences within this time. The optimization of SP score

for multiple sequence alignment requires O(NVK) time, which leads the optimization of

multiple sequence alignment unpracticable. We consider the problem of optimization of

multiple sequence alignment and provide a solution to construct alignment. This solution










can result in an alignment which can converge to optimal alignment and keep a practical

running time. We develop an algorithm, called QOMA, to address this problem. QOMA

takes an initial alignment, then optimizes the alignment by a window with limited size ,

which is selected from the alignment. It finds the optimal alignment of the window in the

sense of SP score and replaces the window back with the optimal alignment.

We develop theories to justify the claim that QOMA can find alignments which

converge to global SP optimal alignments when the size of the sliding window increases.

The experimental results also agree with the claim.

Contribution 2: Given a large number of protein sequences,we aim to maximize the

SP (Sum-of-Pairs) score. The QOMA (Quasi-Optimal Multiple Alignment) algorithm

addressed this problem when the number of sequences is small. However, as the number of

sequences increases, QOMA becomes impractical. This paper develops a new algorithm,

QOMA2, which optimizes the SP score of the alignment of arbitrarily large number

of sequences. Given an initial (potentially sub-optimal) alignment QOMA2 selects

short subsequences from this alignment by placing a window on it. It quickly estimates

the amount of improvement that can be obtained by optimizing the alignment of the

subsequences in short windows on this alignment. This estimate is called the SW (Sum

of Weights) score. It employs a dynamic programming algorithm that selects the set

of window positions with the largest total expected improvement. It partitions the

subsequences within each window into clusters such that the number of subsequences in

each cluster is small enough to be optimally aligned within a given time. Also, it aims to

select these clusters so that the optimal alignment of the subsequences in these clusters

produces the highest expected SP score.

Contribution 3: We aim to construct a biological meaningful alignment from multiple

sequences. We consider this problem and sequence order dependence problem. Our

solution is to apply secondary structure information of residues when we align the protein

sequences. In this method, we first group residues in sequences based on their primary










types and secondary structures, adjust their positions according to the groups, we then

slide a window on the adjusted sequences, align the residues in the window and replace

the window with the resulting alignment. We construct the final resulting alignment by

concatenating the alignments obtained from the sliding window. This method showed

higher SP score than any other tools we selected for comparison.

Contribution 4: We apply multiple sequences to assist genome sequencing. It is a

new problem motivated by new DNA sequencing techniques (see project ASAP [:32]).

In sequencing DNA, plastid sequencing throughput can he increased by amplifying the

isolated plastid DNA using rolling circle amplification (RCA) [:33]. However, obtaining

sequence through RCA requires this intermediate step. Recently, the ASAP method

showed that sequence information could be gathered by creating templates from plastid

DNA hased on conserved regions of plastid genes. To expand this technique to an entire

chloroplast genome an efficient method is required to facilitate primer selection. More

importantly, such a method will allow the selected primer set to be updated based upon

the availability of new plastid sequences. Our method is named MAPPIT. MAPPIT uses

related species genes to assist predicting unknown genes. MAPPIT inputs existing gene

sequences, which are close related to the gene to predict, extracts information from the

given gene sequences, and constructs primer pairs. The goal is to find the primer pairs

which can cover as much as the unknown gene, in the meanwhile, the number of pairs

should be as small as it can. MAPPIT uses two different strategies for constructing primer

candidates: multiple sequence alignment and motif based method. The experimental

results showed the primer pairs found by MAPPIT did a lot of helps for prediction of

unknown genomes.

The rest of this thesis is organized as follows: C'!s Ilter 2 discusses related work of

multiple sequence alignment. ('! .pter :3 addresses an algorithm for optimizing the SP

score of resulting multiple sequence alignment in a given time. ('!! I pter 4 introduces

an algorithm for aligning many sequences, with the goal of optimizing the SP score.










C'll I'lter 5 presents an algorithm for improving biological relevance of multiple sequence

alignment by applying secondary structure information. C'! Later 6 introduces an

application of a module for amplification of plastonies by printer identification. C'! Later 7

presents the conclusion of our work.









CHAPTER 2
BACK(GROUJND

Multiple sequence alignment [34, 35] of protein sequences is one of the most

fundamental problems in computational biology. It is an alignment of three or more

protein sequences. Multiple sequence alignment is widely used in many applications such

as protein structure prediction [3], phylogenetic analysis [4], identification of conserved

motifs and domains [5], gene prediction [7-9], and protein classification [6].

2.1 Measurements of Multiple Sequence Alignment

There are several different owsi~ to assess a multiple sequence alignment [36]. One

common method is to score a multiple alignment according to a mathematics model. We

define the cost of the multiple sequence alignment A of K sequences as


c ( Pi(i), P2 i), ", PK()
i= 1

where P/i) is the ith letter in the seuencenn Py j- =' 1, 2,--,NadcP (i),P PKi)

is the cost of the ith column [37].


c((i)PL(i) P2 i), Ki)= piE
15p~q~k



column cost function is called as the Sum-of-Pairs (or SP) cost. SP alignment model

is widely used in applications such as finding conserved regions, and receives extensively

research [38-44]. In SP alignment, we assume all sequences equally relate to each other,

then all pairs of sequences are assigned the same weight. In our later discussion, we

will focus on SP model. There are also other optimization models in this group, such as

consensus alignment and tree alignment [29, 40-42, 45-50]. The key deference of these

models is how to formulate their column cost functions [37]. For all models in this type of

measurement, the cost scheme used should be a reflect of the probabilities of evolutionary

events, including substitution, insertion, and deletion. So it is important to choose










appropriate cost schemes for pairs of letters. For protein sequences, the PAM matrix

and BLOSUM matrix are the most widely used [51, 52]. For DNA sequences, the simple

match/mismatch cost scheme is often used. We can also use more sophisticated cost

schemes such as transition/transversion costs [53] and DNA PAM matrices. Throughout

this section, we use c() as the column cost function and c(:r, y) as pairwise cost function,

which measures the dissimilarity between a pair of letters or spaces :r and y. We use o to

denote a space and C to denote the set of letters that form input sequences.

Another type of measurement is to compare a alignment with a reference alignment.

BAliBASE score [5, 54] is the most widely used in this type. Given a gold-standard

alignment ,4*, this measure evaluates how similar the alignments A and ,4* are. The

BAliBASE score is commonly used in the literature as an alternative to the SP score,

however, BAliBASE score can only be computed for sets of sequences for which the gold

standard is known. In contrast, the SP score can he computed for any set of sequences.

Most of the existing methods aim to maximize a linear variation of the SP score hv

weighting the sequences (or subsequences) in order to converge to the BAliBASE score

for known benchmark [1, 2]. This chapter focuses on optimizing the SP score which is

computationally an equivalent problem to the weighted versions in the literature. The

problem of finding appropriate weights to converge the SP and the BAliBASE score is

orthogonal to this chapter and should be considered separately.

2.2 Dynamic Programming IVethods

Dynamic programming methods was first provided for multiple string matching

problem. Multiple sequence alignment problem can he viewed as multiple string matching

problem [55-58] and also can use dynamic programming to find optimal solutions. Given

a table of scores for matches and mismatches between all amino acids and penalties for

insertions or deletions, the optimal of alignment of two sequences can he determined using

dynamic programming (DP). The time and space complexity of this methods is O(NV2) [28,

59, 60], where NV is the length of each sequence. This algorithm can he extended to align










Kt sequences, but requires O(NEK) time [29, 30]. Indeed, finding the multiple sequence

alignment that maximizes the SP (Sum-of-Pairs) score is an NP-complete problem [27].

There are a few methods which aim to optimize the alignment by running dynamic

programming alignment on all sequences simultaneously. MSA is the representative in

this class [61]. DCA extends MSA by utilizing 'divide-and-conquer' strategy [47]. Unlike

progressive methods, DCA divides the sequences recursively until they are shorter than

a given threshold. DCA then uses MSA to find the optimal solutions for the smaller

problems. The performance of DCA depends on how it divides the sequences. DCA uses

a cut strategy that minimizes additional costs [62] and uses the longest sequence in the

input sequences as reference to select the cut positions. DCA does not guarantee to find

optimal solution. The selection of the longest sequence makes DCA order dependent, as

there is no justification why this selection (or any other selection) optimizes the SP-score

of the alignment. On the contrary, our methods in this thesis are order independent.

However, MSA, DCA and other algorithms who maximize the SP score suffer from

computation expenses [1].

2.3 Heuristic Methods

Variety of heuristic algorithms have been developed to overcome the computation

expenses of dynamic programming methods [1]. These heuristic methods also provide

solutions for aligning large sequences, which dynamic programming is unable to process

due to the limitation of memory [63-69]. These heuristic methods can he classified into

four groups [70]: progressive, iterative, anchor-based and probabilistic. They all have the

drawback that they do not provide flexible quality/time trade off.

Progressive methods find multiple alignment by iteratively picking two sequences

or profiles from this set and replacing them with their alignment (i.e., consensus sequence)

until all sequences are aligned into a single consensus sequence. Thus, progressive methods

guarantee that never more than two sequences or profiles are aligned simultaneously.

The order of selecting sequence or profile is determined by a pre-created guide tree or










a clustering algorithm [71]. This approach is sufficiently fast to allow alignments of

almost any size. The common shortcoming of these methods above is that the resulting

alignment depends on the order of aligning the sequences. ClustalW [1], T-COFFEE [2],

Treealign [72], POA [45, 73, 74], and MAFFT [75] can he grouped into this class [76].

Cl w l .1W [1, 77] is currently the most commonly used multiple sequence alignment

program. ClustalW includes the following features to produce biologically meaningful

multiple sequence alignments. 1) According to a pro-computed guide tree, each input

sequence is assigned a weight during the alignment process. Thus that sequences with

more similarity get less weight and divergent sequences get more weight. 2) According to

the divergence of the sequences to be aligned, different amino acid substitution matrices

are used at different alignment stages. 3) Gap penalties prefer more continuous gaps to

opening new gaps. Therefore, it encourages that gaps occur in loop regions instead of in

highly structured regions such as alpha helices and beta sheets. The background biological

meaning for this is that biologically divergence is often less likely in highly structured

regions, which are commonly very important to the fold and function of a protein. For

similar reasons, to discourage the opening of new gaps near the existing ones, existing gaps

are assigned locally reduced gap penalties.

T-COFFEE [2] is a progressive approach hased on consistency. It is one of the most

accurate programs available for multiple sequence alignment. T-COFFEE avoids the most

serious drawback caused by the greedy nature of progressive algorithm. T-Coffee first

aligns all sequences pair-wisely, and then uses the alignment information to guide the

progressive alignment. T-Coffee creates intermediate alignments based on the sequences to

be aligned next and how all of the sequences align to each other.

MAFFT [75] provides a set of multiple alignment methods and is used on unix-like

operating systems. MAFFT includes two new techniques: Identifying motif regions

quickly and using a simplified scoring system. The first technology is done by the fast

courier transform (FFT). This technique changes an amino acid sequence to a sequence of










volume and polarity values of each amino acid residue. The second technique is to reduce

CPU time and increase the accuracy of alignments. It works well even when sequences

have large number of insertions or extensions, or when sequences of similar length are

distantly related. MAFFT implements the iterative refinement method in addition to the

progressive method.

POA [45] program does not use generalized profiles during progressive alignment

process. Instead, it introduces a partial order-multiple sequence alignment format to

represent sequences. POA allows to extend alignable regions and allows longer alignments

between closely related sequences and shorter alignments for the entire set of sequences.

Iterative methods start with an initial alignment. They then repeatedly refine

this alignment through a series of iterations until no more improvements can he made.

Iterative methods do not provide flexible quality/time trade off. And iterative methods

can not fix the mis-matches in the previous alignment during the iteration. MUSCLE [78]

can he grouped into this class as well as the progressive method class since it uses a

progressive alignment at each iteration.

MUSCLE [78] applies many techniques such as fast distance estimation using

k-mer counting, progressive alignment using a new profile function which is called the

log-expectation score, and refinement using tree-dependent restricted partitioning. At the

time it was proposed, it achieved the best accuracy. Since it was relatively slow MUSCLE

was not widely used.

Anchor-based methods first identifies local motifs (short common subsequences) as

anchors. Then, the unaligned regions between consecutive anchors are aligned using other

techniques. In general, anchor-based methods belong to divide-and-conquer strategy [79].

This group includes several methods which have designs for rapidly detecting anchors [80-

82]. DIALIGN [83, 84], Align-m [46], L-align [85], Alavid [86] and PRRP [87] belong to

this class.










DIALIGN program implements a local alignment approach to construct multiple

alignments. It uses comparisons based on segments instead of residue used previously.

It then integrates the segments identified as anchors into a multiple alignment using an

iterative procedure. DIALIGN treats a column as either alignable or non-alignable.

Align-m [46] program uses a non-progressive local approach to guide a global

alignment. It construct a set of pairwise alignments guided by consistency. It performs

well on divergent sequences. The drawback is that it runs slowly.

PRRP program uses a randomized iterative strategy. It progressively optimizes a

global alignment by dividing the sequences into two groups iteratively. It realigns groups

globally using a group-hased alignment algorithm.

Probabilistic methods first compute the substitution probabilities from known

multiple alignments. They then use the probabilities to maximize the substitution

probabilities for a given set of sequences. Especially for divergent sequences, these

consistency-based methods often have an advantage in terms of accuracy. ProhCons [88],

and HMAIT [89] can he grouped into this class.

ProhCons [88] introduces an approach hased on consistency. It uses a probabilistic

model and maximum expected accuracy scoring. According to the evaluation of its

performance on several standard alignment benchmark data sets, ProhCons is one of most

accurate alignment tools tod w-.

HMAIT first discovers the pattern which are common in the multiple sequences, and

saves a description of the pattern in HMM file. It then applies a simulated annealingf

method, which tries to maximize the probability represented by the HMM file for the

sequences to be aligned. HMAIT works iteratively by improving a new multiple sequence

alignment calculated using the pattern, then a new pattern derived from that alignment.










2.4 Optimizing Existing Alignments Methods

There are also a set of alignment algorithms targeting to improve the alignment

quality of an initial alignment. Our methods, QOMA and QOMA2 can he classified in this

group.

Improving the alignment quality of an initial alignment have been traditionally done

manually (e.g. through programs like MaM and WehMaM [90]). Recently, RASCAL [91],

REFINER [92] and ReAligner [93] have included more automatic features. Our methods,

QOMA and QOMA2, belong to this group in general. QOMA and QOMA2 are different

from RASCAL and REFINER because that QOMA and QOMA2 focus on optimizing

the SP score of alignments and require only sequence information, while RASCAL is a

knowledgfe-based approach and R EFINER targets for optimizing score of core regions.

ReAligfner uses a round-rohin algorithm and improves DNA alignment.

Most of existing tools have the shortcoming that they are unable to process a large

number of sequences. It is appropriate to apply dynamic programming on subdivisions of

alignments. "Jumping alIgIn.!~! !1 [94] applies a similar idea. Our method, QOMA2 [95],

provides a solution on how to align a large number of protein sequences.

In this thesis, we address the problems mentioned above: The sequence-order- dep endent

problem, quality/time trade off problem and a large number of sequences input problem.

2.5 Approximation Algorithms

Our algorithms provided in this thesis are heuristic algorithms by nature. Heuristic

algorithms can he defined as algorithms that search all possible solutions, but abandon the

goal of findings the optimal solution, for the sake of improvement in run time. Heuristic

algorithms usually run fast and get good results, however, they do not guarantee the

optimal solution, and have no proof that the obtained solution is not arbitrarily had.

If we want to find the optimal solution, we can use exact algorithms. The most

widely adopted method of exact algorithms in multiple sequence alignment is dynamic

programming. However, dynamic programming requires running time of O(NEK) for










aligning K sequences with length NV. The required running time is actually infeasible for

large NV or K.

Thus, if we want to find solutions which are close to the optimal solution, and want

to guarantee that the result is not too bad, and also want to run in reasonable time, then

one alternative is to make use of approximation algorithms. Approximation algorithms

are algorithms which are polynomial and guarantee that for all possible instances of a

minimization problem, all solutions obtained are at most p times the optimal solution.

We can define approximation algorithms for maximization problem symmetrically.

Approximation algorithms are often associated with NP-hard problems. Unlike heuristic

algorithms, approximation algorithms have provable solution quality and provable running

time bounds.

Multiple sequence alignment with SP-score problems are MAX-SNP-hard. Here a

maximization problem is MAX-SNP-hard when given a set of relations R1, R2, k ~

a relation D, and a quantifier-free formula #(R1, R2, k~, D, vl, v2, i,), Where is

a variable, the following are satisfied [96]:

1) Given any instance I of the problem, there exists a polynomial-time algorithm that

can produes a set of relations R{, R --, where every R;' has the same arity as

the relation Ri.




OPT(I) = maxDJ Uvl, U2,- -- ,.) E J : #(Ri7, R(, --- R(, DJ, vl, v2, ,.) = TRUE}


where OPT(I) is the optimal solution for instance I, DJ is a relation on J with the same

arity as D and Jr is the set of t-tuples of J. The original definition and detailed discussion

can be found in [96] C'!s Ilter 10.









We define performance ratio of an approximation algorithm for a minimization

problem H [37, 96] as a number p such that for any instance I of the problem,

H (I)
OPT (I)

where H(I) is the cost of the solution produced by algorithm H, and OPT(I) is the

cost of an optimal solution for instance I. We define an approximation scheme for a

minimization problem as an algorithm H that takes both instance I and an error bound e

as input, and achieves the performance ratio

H (I)
OPT (I)

We can actually view such an algorithm H as a set of algorithms {HIe > 0)}, for each

error bound e.

We define a polynomial time approximation scheme (PTAS) as an approximation

scheme { H,}, where the algorithm H runs in polynomial time of the size of the instance I,

for any fixed e. There are two types of problems: problems which have good approximation

algorithms, and problems which are hard to approximate. PTASs belong to the first type

and the best we can hope for a problem is it has a PTAS. However, a MAX SNP-hard

problem has little chance to have a PTAS. The more detailed discussion can be found

in [37] Ch1 Ilpter 4.

Since achieving an approximation ratio 1 + e for a MAX-SNP-hard problem

is NP-hard, where e > 0 is a fixed value, the approximatableness of an problem

actually depends on the value of e. For multiple sequence alignment problems, the best

approximation algorithm has 2 1/K approximation ratio for any constant 1, where K

is the number of the sequences [39, 42, 97]. Later we will show this approximation ratio

is not appropriate for real applications of multiple sequence alignment and show other

reasons that approximation algorithms do not well for multiple sequence alignment.










In this section we discuss the advantages of our algorithms over approximation

algorithms. We will answer critical questions: How can we claim that our algorithms

are superior to other algorithms that offer approximation guarantees? Why do we claim

our algorithms are more appropriate for bioinformatics applications than approximation

algorithms? In the rest of this section, first we answer the above questions, then we

present an overview of approximation algorithms.

2.5.1 Our Methods vs. Approximation IVethods

In this section, we first represent the concept s of approximat able" and non- approximat able".

We then show the reason that approximation algorithm is not appropriate for multiple

sequence alignment problem on bioinformatics. We finally discuss the reason that our

algorithms is superior to approximation algorithms for the applications of multiple

sequence alignment.

2.5.1.1 What do "approximatable" and non-approximatable" mean?

Even when a problem is MAX-SNP-hard, it may still have good approximation

algorithms which produce results with a guaranteed approximation ratio. In another

words, a MAX-SNP-hard problem may still be able to be approximated. We know that

MAX-SNP-hard problem is the problem for which achieving an approximation ratio 1 + e

is NP-hard for some fixed e > 0. The result is guaranteed close to the optimal solution

within a error factor. We consider a problem as approximatable if it has approximation

algorithms which produce solutions close to optimal solutions within a constant factor,

while the approximation ratio is acceptable for most applications. Otherwise, we consider

it as non-approximatable.

2.5.1.2 Why does approximation algorithms do not work for multiple se-
quence alignment applications?

We will show later that multiple sequence alignment problem belongs to MAX-SNP-hard

problems. Then we raise a question: Is multiple sequence alignment problem approximatable

or non-approximatable with respect to bioinformatics? There are already several










A A-

A -A
(a) (b)

Figure 2-1. An example that alignments with approximation ratio of less than 2 can he
meaningless: (a) The optimal alignment. (b) An alignment with
approximation ratio of 1.5.


approximation algorithms for multiple sequence alignment [42], which can efficiently

produce alignments. However, we will provide three reasons that approximation algorithms

are not applicable to multiple sequence alignment applications in bioinforniatics.

1) The score scheme supported for approximation algorithms is nietric, while

currently, most widely used score matrices are not metric. A metric cost matrix should

satisfy the following conditions [98]:

(Cl) c(.r, y) > 0 for all .r / y

(C2) c(r, r) = 0 for all .r



(C4) c(.r, y) < c(.r, x) + c(y, x) for any z

Popular score matrices used tod #-, such as BLOSUl\62, are not metric. When a

general score matrix is used in the approximation algorithm, the approximation ratio is no

longer guaranteed. Thus these approximation algorithms are of little use in realty.

2) The approximation ratio around 2 is too loose to actually make much sense in

bioinforniatics area and thus are almost useless in real applications of bioinforniatics. So

far the best known approximation ratio for SP alignment has been improved front 2 2/K

to 2 1/K for any constant 1, where K is the number of the sequences [39, 42, 97]. It

seems impossible to reduce 2 o(1) approximation ratio. The approximation ratio is

not acceptable and makes the approximation algorithm non-approxiniatable in biological

science. Here we present a sample example as follows: The score scheme is translated from

DNA simple niatch/nxisniatch score scheme:












c(.r, y) = 1 if .r / y

Then given sequences A" and A", two possible alignments are shown in Figure 2-1.

We consider the alignment problem as a nmaxintization problem, then the first alignment

is the optimal solution, with SP score :3, and the second alignment has SP score of 2. So

the second alignment has approximation ratio 1.5. We know that the second alignment is

a trivial alignment without any meaning in realty. Actually in this example all alignments

other than the optimal one have approximation ratio less than 2, which means the

approximation ratio of less than 2 can not guarantee a good alignment at all.

:3) These approximation algorithms do not consider the biological meaning of the

resulting alignment, and they do not count for the impact of gaps. Here we provide

a sample example to show that we need to consider the location of gaps inserted. In

biological applications, it is widely accepted that a nmisniatch can he had as matching with

a gap. We can design a simple score scheme as follows:

c(.r, y) = 1 if .r / y

C(.r, 0) =1

c(.r, r) =2

c(O, 0)= 0

Then given sequences A", A" and A", two possible alignments are shown in

Figure 2-2. From Figure 2-2, we see both alignments have SP-score 6, however, the first

alignment does not actually make any sense. Thus, an approximation algorithm for

multiple sequence alignment with a guaranteed approximation that introduces a lot of

gaps into the resulting alignment without considering biological meaning of the resulting

alignment can he useless.

2.5.1.3 Why do our algorithms work?

Heuristic algorithms can adjust parameter settings, such as the weights of sequences

and score matrix, during processing, and build more biological meaningful alignment,










A-- A
-A- A
--A A
(a) (b)

Figure 2-2. An example of different alignments with the same SP-score: (a) An alignment
with many gaps. (b) An alignment without gaps.


which is the main advantage over approximation algorithms. Other researchers have

exploited this fact before. For example, ProhCons [88] can obtain pre-knowledge via

training to guide the later alignment process, and ClustalW [1, 77] can adjust the weights

of profiles during the alignment process. Our programs, QOMA [99], QOMA2 [95] and

HSA [100] are heuristic optimization algorithms by nature. They also provide adjustment

during the alignment. Also, our methods are designed not only for fixed models such as

SP-score, but can he extended to incorporate additional biological features.

2.5.2 Overview of Approximation Algorithms for IVultiple Sequence Align-
ment

In this section, we first introduce several proved theories of approximation algorithms

for multiple sequence alignment, finally we present brief proofs of NP-conipleteness and

MAX-SNP-hardness of multiple sequence alignment with SP score.

2.5.2.1 Hardness Results

SP alignment was proved to be NP-hard [27] when a particular pairwise cost scheme

is used. The cost scheme used in the proof is not a metric since it does not satisfy the

triangle inequality. Later SP alignment was proved to be NP-hard even when the alphabet

size is 2 and the pairwise cost scheme is a metric. Thus, SP alignment problem is unlikely

to be solved in polynomial time [101].

Theorem 1 [101] SP Alignment is NP-hard when the alphabet size is 2 and the cost

scheme is metric.










Theorem 2 [102] SP Alignment is NP-hard when all spaces are only allowed to insert

at both ends of the sequences using pairwise cost scheme where a match costs 0 and a

mismatch costs 1.

Theorem 3 [10:3] Tree alignment is NP-hard even when the given phylogeny tree is a

binary tree.

Theorem 4 [104] Consensus alignment is NP-hard when the alphabet size is 4 using the

cost scheme where a match costs 0 and a mismatch costs 1.

Theorem 5 [27, 10:3] Consensus alignment is MAX SNP-hard when the pairwise cost

scheme is arbitrary.

2.5.2.2 NP-completeness and MAX-SNP-hardness of multiple sequence
alignment

In this section, we first show the NP-completeness of multiple sequence alignment

with SP-score. Then we show the MAX-SNP-hardness of multiple sequence alignment.

Theorem 6 [27] Multiple sequence alignment with SP-score is NP-complete.

Proof: The original proof was given in [27]. The basic idea is to show that multiple

sequence alignment problem is equivalent to shortest common supersequence problem,

which is a known NP-complete problem even if | C | = 2 [105].

Theorem 7 [106] There exists a score matrix B, such that multiple sequence alignment

problem for B is MAX-SNP-hard, when spaces are only allowed to insert at both ends of

the sequences.

Proof: The original proof was given in [106] and used L-reductions. Here we can simplify

the proof and use gap-preserving reduction [96]. We prove the theorem by showing that

there are gap-preserving reductions from maximization problem of gap-0-1 multiple

sequence alignment with SP-score to maximization problem of MAX-CUT(Z) problem

of size k. It was proved that SIMPLE MAX-CUT(Z) is a MAX-SNP-complete problem

for some positive integer Z. In fact, Z = :3 works [107]. Then we show that an optimal

gap-0-1 multiple sequence alignment with SP-score problem exactly defines the optimal










solution of SIMPLE MAX-CUT(Z) problem of size k, and vice versa. Then we conclude

gap-0-1 multiple sequence alignment with SP-score problems are MAX-SNP-hard. Since

this restrained gap-0-1 version of multiple sequence alignment is MAX-SNP-hard, the

general case of multiple sequence alignment is also MAX-SNP-hard. That ends our proof.









CHAPTER 3
OPTIMIZATION OF SP SCORE FOR MULTIPLE SEQUENCE ALIGNMENT IN
GIVEN TIME

In this chapter, we consider the problem of multiple alignment of protein sequences

with the goal of achieving a large SP (Sum-of-Pairs) score. We introduce a new graph-based

method. We name our method QOMA (Quasi-Optimal Multiple Alignment). QOMA

starts with an initial alignment. It represents this alignment using a K-partite graph. It

then improves the SP score of the initial alignment through local optimizations within

a window that moves greedily on the alignment. QOMA uses two strategies to permit

flexibility in time/accuracy trade off: (1) Adjust the sliding window size. (2) Tune from

complete K-partite graph to sparse K-partite graph for local optimization of window.

Unlike traditional tools, QOMA can he independent of the order of sequences. It also

provides a flexible cost/accuracy trade off by adjusting local alignment size or adjusting

the sparsity of the graph it uses. The experimental results on BAliBASE benchmarks

show that QOMA produces higher SP score than the existing tools including ClustalW,

ProhCons, MUSCLE, T-Coffee and DCA. The difference is more significant for distant

proteins.

3.1 Motivation and Problem Definition

We have introduced some background of multiple sequence alignment in C'!s Ilter 2.

Progressive methods are most popular methods for multiple sequence alignment,

however, they have an important shortcoming. The order that the profiles are chosen

for alignment significantly affects the quality of the alignment. The optimal alignment

may be different than all possible alignments obtained by considering all possible orderings

of sequences [100]. Section 2 has discussed 1!! i Br multiple sequence alignment strategies in

detail. A method, which can balance running time and alignment accuracy is seriously in

demand.

Fr-agment-hased methods follow the strategy of assembling pairwise or multiple

local alignment. The divide-and-conquer alignment methods such as DCA [47] can he










considered in this group. However, DCA is still an order dependent method as explained



In this chapter, we consider the problem of nmaxintizing the SP score of the alignment

of multiple protein sequences. We develop a graph-based method named QOMA

(Quasi-Optinmal Multiple Alignment). QOMA starts by constructing an initial multiple

alignment. The initial alignment is independent of any sequence order. QOMA then builds

a graph corresponding to the initial alignment. It iteratively places a window on this

graph, and improves the SP score of the initial alignment by optimizing the alignment

inside the window. The location of the window is selected greedily as the one that has

a chance of improving the SP score by the largest amount. QOMA uses two strategies

to permit flexibility in tinte/accuracy trade off: (1) Adjust the sliding window size. (2)

Tune front complete K-partite graph to sparse K-partite graph for local optimization of

window. The experimental results show that QOMA finds alignments with better SP score

compared to existing tools including CloI-I I1W, ProhCons, MUSCLE, T-Coffee and DCA.

The intprovenient is more significant for distant proteins.

3.2 Current Results

In this section, we introduce the basic QOMA algorithm for aligning K protein

sequences. QOMA works in two steps: (1) It constructs an initial alignment and the

K-partite graph corresponding to this alignment. (2) It iteratively places a window on the

sequences and replaces the window with its optimal alignment. We call this the complete

K(-partite graph algorithm since a letter of a protein can he aligned with any letter of the

other proteins within the same window. Next, we describe these two steps in detail.

3.2.1 Constructing Initial Alignment

The purpose of constructing an initial alignment is to roughly identify the position of

each node in final alignment. It is important to find this initial alignment quickly in order

to nxinintize initialization overhead.










'' (1 | Ia 12 a 4 /





(b2 (b bi bb b b4, ) ,






P2 bC b 2 b b4 ( Cq







Figure 3-1. Constructing the initial alignment by strategy 2. Left: A pairs of of sequences
are aligned. Edges are inserted between nodes which match in the alignment.
Right: Columns are constructed by aligning the nodes. Gaps are inserted
wherever necessary.


There are many v- .--s to construct the initial alignment. We group them into two

classes: (1) Use an existing tool, such as ClustalW, to create an alignment. This strategy

has the shortcoming that the initial alignment depends on other tools, which may be

order-dependent. This makes QOMA partially order-dependent. (2) Construct alignment

from pairwise optimal alignments of sequences. In this strategy, first, sequence pairs are

optimally aligned using DP [60]. An edge is added between two nodes if the nodes are

matched in this alignment. A weight is assigned to each edge as the substitution score of

the two residues that constitute that edge. The substitution score is obtained from the

underlying scoring matrix, such as BLOSUM62 [108]. The weight of each node is defined

as the sum of the weights of the edges that have that vertex on one end. A node set is

then defined by selecting one node from the head of each sequence. The node which has

the highest weight is selected from this set. This node is aligned with the nodes .Il11 Il:ent

to it. Thus, the letters aligned at the end of this step constitute one column of the initial










multiple alignment. The node set is then updated as the nodes immediately after the

nodes in current set in each sequence. This process is repeated and columns are found

until all the sequences end. The alignment is obtained by concatenating all these columns.

Gaps are inserted between nodes if necessary. Unlike progressive tools, this strategy is

order-independent. An example for initial alignment construction is shown in Figure 3-1.

In this example, three protein sequences pi, p2 and p3 arT f1TSt pairwisely aligned. For

simplicity, we show each pairwise alignment as a separate graph in this figure. In reality,

one node per letter is sufficient. The nodes that match in these optimal alignments then

are linked by edges. For example, al and b2 match in the optimal alignment of pi and p2,

thus they have an edge < al, b2 > in the graph constructed. The weight of this edge is

equal to the BLOSUM62 entry for the letters al and b2. We do not show the weight of

the edges in Figure 3-1 in order to keep the figure simple. In this figure, node for al has

an edge to nodes for b2 and c2. Therefore, the weight of the node for al is computed as

the sum of the weights of the edges < al, b2 > and < al, c2 >. Illitially (a1, bl, cl} are

chosen as the candidate node set. In this example, we assume that among three nodes for

al, bl and cl, the node for al has the largest weight. Thus we select the node for al as the

central node and construct column (al, b2, C2). Then we start to construct next column.

We update candidate node set to {a2, b3 C3 Which are all nodes that immediately

proceed nodes for al, b2 and c2 in the sequences. Assume that node for a2 has the largest

weight among nodes for a2, b3 and c3, We Select the node for a2 aS the central node and

construct column (a2, b4, C4) COTTOSpondingly. When we concatenate columns to make final

alignment, gap nodes are inserted if necessary. In this example, when we concatenated

columns (al, b2, C2) and (a2, b4, C4), tWO gap nodes are inserted in sequence pi, one before

the node for al and one after node for al. Thus we construct columns (-, bl, cl) and

(-, b3, C3 -

The time complexity of both of these strategies are O(K2 V2) Since pairwise

comparisons dominate the running time. However, latter approach is faster. This is










because it runs dynamic programming only once for each sequence pair. On the other

hand, the former one performs two set of pairwise alignments. One to find a guide tree

and another to align sequences progressively according to the guide tree.

3.2.2 Improving the SP Score via Local Optimizations

After constructing the initial alignment, the nodes are placed roughly in their correct

positions (or in a close by position) in the alignment. Next, the alignment is iteratively

improved. At each iteration, a short window is placed on the existing alignment. The

subsequences contained in this window are then replaced by their optimal alignment

(Figure 3-2). Generalized version of the DP algorithm [60] is used to find the optimal

alignment. This is feasible since the cost of aligning a window is much less than that of

the entire sequences.

This algorithm requires solving two problems. First, where should the windows

be placed? Second, when should the iterations stop? One obvious solution is to slide a

window from left to right (or right to left) shifting by some predefined amount a at each

iteration. In this case, the iterations will end once the window reaches to the right end (or

the left end) of the alignment (see Figure 3-2). This solution, however, have two problems.

First, it is not clear which direction the window should be slid. Second, a window is

optimized even if it is already a good alignment. We propose another solution. We

compute an upper bound to the improvement of the SP score for every possible window

position as follows. Let Xi denote the upper bound to the SP score for the window

starting at position i in the alignment. This number can be computed as the sum of the

scores of all the pairwise optimal alignments of the subsequences in this window. Let ~

denote the current SP score of that window. The upper bound is computed as Xi ~. We

propose to greedily select the window that has the largest lower bound at each iteration.

In order to ensure that this solution does not optimize more windows than the first one

(i.e., sliding windows), we do not select a window position that is within a/2 positions

to a previously optimized window. The iterations stop when all the remaining windows












Arefix AWAuffix


AL~= optimal a~ilgunlmentinth window


Arefix A suffix



Figure 3-2. QOMA finds optimal alignment inside window, it replaces the window with
the optimal alignment and then moves the window by a positions.


have an upper bound of zero or they are within a/2 positions of a previously optimized

window. In our experiments, the two solutions roughly produced the same SP-score. The

second solution was slightly better. The second solution, however, converged to the final

result much faster than the first one. (results not shown.)

The time complexityv of the algorithm is O~1I(2KyK24_ ) This is beCause there

are positions for window. A dynlamliC progrlamm~lingF solution1 is COmpIFuted for.
each such window. The cost of each dynamic programming solution is O(2KWK'K2) This

algorithm is much faster than the optimal dynamic programming when IT is much smaller

than NV. The space complexity is O(IT' + KNV). This is because dynamic programming

for a window requires O(IT') space, and only one window is maintained at a time. Also

O(KNV) space is needed to store the sequences and the alignment. Note that the edges
of the complete K-partite graph are not stored at this step as we already know that the

graph is complete.

3.2.3 QOMA and Optimality

In this section, we analyze QOMA approach. Let P1, P~, PK he the protein

sequences to be aligned. Let A* he an optimal alignment of P1, P~, PK. Let S* denote

the SP score of A*. Let A be an alignment of P1, P~, PK. Let SP(A) be the SP score

of ,4. We define the error induced by ,4 as error(,4) = S* SP(A). This expression,









however, is not computable for findings of S* is NP-complete. Instead, we compute the

error of A as e(A) = S SP(A), where S is an upper bound to S*. Here, S is computed

as the sum of the scores of all optimal pairwise alignments of P1, P2, PK. We conclude

that e(A) > error(A). Let QOMA(A, W) be the alignment obtained by QOMA starting

from initial alignment A by sliding a window of size W. We define the percentage of

improvement provided by QOMA over A using a window size of W as

e(QOMA(A, W))
improve(Al, W) =(1 ) x100 (3-1)

Our first lemma shows that QOMA .ll.k--i--s results in an alignment at least as good as

the initial alignment (The proof is shown in the appendix).

Lemma 1. improve(A, W) > 0, VA, W.

Proof: For a given position of window, let Arefiz, Aw and Assufi, denote the

alignment to the left of the window, inside the window, and to the right of the window

respectively (see Figure 3-2). Let AT, be the optimal alignment obtained by QOMA for

the window and A' be the alignment obtained by replacing Aw with AT, from A. We have

SP(Aw) < SP(Ag). Thus, SP(A) = SP(Arefix)+SP(Aw)+SP(Asuffe)> < SP(Arefix)+

SP(AWV) + SP(Asuffi,) = SP(A'). Then, we get e(A) = S SP(A) > S SP(A') = e(A').

Finally, we haveT e(Ql\M()AW) < 1. Wei' conclude imnprovec(A, W) > 0. O
Corollary 1 follows from Lemma 1.

Corollary 1. SP(A*) = SP(QOM~A(A*, W)), VW.

Corollary 1 implies that QOMA alters an initial alignment A only if A is not optimal.

Next lemma discusses the impact of window size on QOMA.

Lemma 2. SP(QOM~A(A, W)) < SP(QOM~A(A, 2W)).

Proof: For a given position of window of length 2W, let A2W denote the alignment

inside the window. Let Aw, and Aw, denote the first and second half of window A2W.

SP*(Aw,) + SP*(Aw2) < SP*(A2W). This is because, SP*(A2W) is the optimal SP score

for the entire window. Therefore, SP(QOMA(A, W)) < SP(QOMA(A, 2W)). O











d=0






P 9 1 2 4






Figure 3-3. Sparse K-partite graph for two sequences for d = 0 and d = 1.

1 2 34



1() 2 3 4 :3

(a) (b)

Figure 3-4. An example of using K-partite graph: (a) A sparse K-partite graph for three
sequences from a window of size 4. (b) The induced subgraph for cell [3, 4, 4]
for the K-partite graph in (a).


Lemma 2 indicates that as W increases, the SP score of the resulting alignment increases.

When W becomes greater than the length of A, the sliding window contains the entire

sequences. In this case, SP(QOMA(A, W)) = S*. Following corollary states this.

Corollary 2. As W increases, SP(QOM~A(A, W)) converges to S*.

3.2.4 Improved Algorithm: Sparse Graph

QOMA converges to optimal alignment as the window size (W) grows. However, this

happens at the expense of exponential time complexity. In Section 3.2.1 we computed

the time complexity of QOMLA using complete K-partite graph as O(2KyK yWl~)"))l))+1)

for proteins P1, P2, a PK. In this section, we reduce the time complexity of QOMA by









sacrificing accuracy through use of sparse K-partite graph. The goal is to enable QOMA

run within a given limited time budget when using a larger window size.

The factor 2K in the complexity is incurred because each cell of the dynamic

programming (DP) matrix is computed by considering 2K 1 conditions (i.e., 2K

neighboring cells). This is because there are 2K 1 pOSSible nonempty subsets of K

residues. Each subset, here corresponds to a set of residues that align together, and thus

to a neighboring cell. We propose to reduce this complexity by reducing the number of

residues that can be aligned together. We do this by keeping only the edges between node

pairs with high possibility of matching.

The strategy for choosing the promising edges is crucial for the quality of the

resulting alignment. We use the optimal pairwise alignment method as discussed in

Section 3.2.1. This strategy produces at most K 1 edges per node since each node is

aligned with at most one node from each of the K 1 sequences. We also introduce a

deviation parameter d, where d is a non-negative integer. Let p[i] and qlj] be the nodes

corresponding to protein sequences p and q at positions i and j in the initial graph

respectively. We draw an edge between p[i] and q[j] only if one of the following two

conditions holds in the optimal pairwise alignment of p and q: (1) 36, |6| < d, such that

p[i] is aligned with q[j + 6] (2) 36, |6| < d,such that q[j] is aligned with p[i + 6] In other

words, we draw an edge between two nodes if their positions differ by at most d in the

optimal alignment of p and q. For example, in Figure 3-3, p[2] aligns with q[2]. Therefore,

we draw an edge from p[2] to q[1] and q[3] as well as q[2] since q[1] and q[3] are within

d-neighborhood of (d = 1) of q [2].

The dynamic programming is modified for sparse K-partite graph as follows: Each

cell, [xl, x2, XK] in K-dimensional DP matrix corresponds to nodes Pi [xl], P2[a] 2 ,

PKXK]. Here Pili []stands for the node at position j in sequence i. The set contains one

node from each sequence, and can be either a residue or a gap. Thus, each cell defines a

subgraph induced by its node set. For example, during the alignment of the sequences that









have the K-partite graph as shown in Figure 3-4(a), the cell [3, 4, 4] corresponds to nodes
Pz3] 2[4 a nd P3[4].In Figure 3-4(b) shows the induced subgraph of cell [3, 4, 4].

The induced subgraph for each cell yields a set of connected components. Sparse

graph strategy exploits the concept of connected components to improve running time

of DP as follows: During the computation of the value of a DP matrix cell, we allow

two nodes to align only if they belong to the same connected component of the induced

subgraph of that cell. For example, for cell [3, 4, 4], P2 [4] and P3 [4] can be aligned

together, but, Pi [3] can,, not 1:, bel alge with P2 OT 3 (See Figure 3-4(b)). A connected

component with a nodes produce 2" 1 non-empty subsets. Thus, for a given cell, if there

are t connected components and the tth component has at nodes, then the cost of that

cell becomes Ch (2"' 1). Thl~is is a significant improvement as thle cost of a single cell is

2"1+2+".+"t 1 using the complete K-partite graph. For example, in Figure 3-4, the cost

for cell [3, 4, 4] drops from 23 1 = 7 to (20 1) + (22 4

The connected components of an induced subgraph can be found in O(K2) time (iO.,

the size of the induced subgraph) by traversing the induced subgraph once. Thus, the

total time complexity of the sparse K-partite graph approach is


O(CI' (C(2" )))(N W+ 1)K2


.The space complexity of using the sparse K-partite graph is


O(WK + KN + N(K 1)K(2d + 1)/2)

.The first term denotes the space for the dynamic programming alignment within a

window. The second term denotes the number of letters. The last term denotes the

number of edges. The space complexity for the last two terms can be reduced by storing

only the subgraph inside the window.










Table 3-1. The average SP scores of QOMA using complete K-partite graph with
a 1 W/2 on BAliBASE benchmarks and upper bound score (S). (Initialization
Strategy 1, indicated by sl: Initial alignments are obtained front ClustalW,
Initialization Strategy 2, indicted by s2: Initial alignments are obtained front
optimal pairwise alignments as discussed in Section 3.2.1).
Dataset S Strategy Initial IT=2 11=4 11=8 11=16
s1 -839 -780 -637 -401 -243
V1-R1-low 5635
s2 -797 -5863 -429 -273 -182
s1 1982 2037 2181 2347 2442
V1-R1-niedium 2880
s2 2041 2192 2338 2446 2508
s1 4883 4933 5008 5071 5092
V1-R1-high 5324
s2 4867 4965 5057 5110 5122


3.2.5 Experimental Evaluation

Experimental setup: We used BAliBASE benchmarks [5] reference 1 front version

1 (www-igbmc. u-strasbg .fr/Biolnf o/BAliBASE/) and references 1, 2, 8 from

version 3 (www-bio3d-igbmc. u-strasbg. fr/BAliBASE/) for evaluation of our method.

We use V1 and V3 to denote BAliBASE versions 1 and 3 respectively. We use R1 to

R8 to denote reference 1 to 8. For example, we use V3-R4 to represent the reference

4 dataset front version 3. We split the V1-R1 dataset into three datasets (V1-R1-low,

V1-R1-medium, and V1-R1-high) according to the similarity of the sequences in the

benchmarks as denoted in BAliBASE (low, niediunt and high similarities). Similarly,

V3-R 1 is split into two datasets V3-R1-low and V3-R1-high containing low and high

similarity benchmarks. The number of sequences in the benchmarks in version 3 were

usually too large for QOMA and DCA. Therefore, we created 1,000 benchmarks front each

reference by randomly selecting five sequences front the existing benchmarks. Thus, each

of the benchmarks front version 3 contains five sequences.

We evaluated the SP score and the running time in our experiments. We do not

report the BAliBASE scores since the purpose of QOMA is to nmaxintize the SP score.

We intpleniented the complete and the sparse K-partite QOMA algorithms as

discussed in the chapter, using standard C. We used BLOSUM62 as a measure of










similarity between amino acids. We used gap open = gap extend = -4 to penalize gaps.

We used a = 10/2 in our experiments since we achieved best quality per time with

this value. We also downloaded CluI-I I1W, ProhCons, MUSCLE, T-coffee and DCA for

comparison. We did not compare QOMA with our work HSA [100] since HSA needs

Second Structure information of proteins for alignment. To ensure a fair comparison, we

ran CloI-I I1W, MUSCLE, T-coffee, DCA and QOMA using the same parameters (gap open

= gap extend = -4, similarity matrix = BLOSUM62). This was not possible for ProhCons.

We also ran all the competing methods using their default parameters. We present the

results using the same parameters in our experiments unless otherwise stated.

We ran all our experiments on Intel Pentium 4, with 2.6 G Hz speed, and 512 MB

memory. The operating system was Windows 2000.

Quality evaluation: We first evaluate the quality of QOMA. Table :3-1 shows the

average SP score of QOMA using two strategies for constructing initial alignment and

four values of IT. Strategy 1 obtains the initial alignments from ClustalW. Strategy 2

obtains the initial alignments from the algorithm provided in Section :3.2.1. The table also

shows the upper bound for the SP score, S, and the SP score of Cllo-1 I!W for comparison.

QOMA achieves higher SP score compared to CloI-I I1W on average for all window sizes

and for all data sets. The SP score of QOMA consistently increases as IT increases. These

results are justified by Lemmas 1 and 2. The SP score of Strategy 2 is usually higher than

that of Strategy 1 for almost all cases of low and medium similarity. Both strategies are

almost identical for highly similar sequences. There is a loose correlation between the

initial SP score and the final SP score of QOMA. Higher initial SP scores usually imply

higher SP scores of the end result. There are however exceptions especially for highly

similar sequences. In the rest of the experiments, we use Strategy 2 to construct the initial

alignments by default.

Table :3-2 shows us the SP scores of five existing tools, and QOMA on all the datasets

when the competing tools are run using the same parameters as QOMA and using their










default parameters. QOMA has higher SP scores than all the tools compared for all the

datasets. DCA ah--li- has second best scores since it also targets on maximizing the

SP score of alignments. The difference between the SP scores of QOMA and the other

tools are more significant for low and medium similarity sequences. This is an important

achievement because the alignment of such sequences are usually harder than highly

similar sequences.

Table :3-3 shows the average percentage of improvement of QOMA over alignments of

ClustalW using the improvement formula as given in Section :3.2.3, the data set is V1-R 1.

As window size increases, the increase in improvement percentage reduces. This indicates

that QOMA converges to the optimal score at reasonably window sizes. In other words,

using window size larger than 16 will not improve the SP score significantly.

Table :3-4 shows the average and the standard deviation of the error incurred for each

window due to using the sparse K-partite graph for QOMA. The error decreases as d

increases. For IT = 8, when d increases from 0 to 1, the error reduces by 0.:334 (i.e., 4.89:3

- 4.559). When d increases from 1 to 2, the error decreases by 0.198. This implies that

the average improvement in the SP score degrades quickly for d > 1. Similar observations

can he made for IT = 16. Thus, we conclude that the SP score improves slightly for d >



Figure :3-5 shows the average SP scores of resulting alignments using sparse K-partite

graph for different values of d and using complete K-partite graph on the V1-R1 dataset.

The complete K-partite graph algorithm produces the best SP scores. However, the SP

scores of results from the sparse K-partite graph algorithm are very close to that of the

complete K-partite graph algorithm. The quality of the sparse K-partite graph algorithm

improves significantly when d increases from 0 to 1. The improvement is less when d

increases from 1 to 2. This implies when d becomes larger, it has less impact on the

quality of alignment.










Performance evaluation: Our second experiment set evaluates the running time

of QOMA. Table :3-5 lists the running time of QOMA for the complete and the sparse

K-partite graph algorithms for varying values of IT. Experimental results show that

QOMA runs faster for small IT. The sparse K-partite graph algorithm is faster than the

complete K-partite graph algorithm for all values of d for large IT. The running time of

QOMA increases as d increases. The results in this table agree with the time complexity

we computed in Sections 3.2.3 and :3.2.4. Referring to Tables :3-1, :3-2 and :3-3, we conclude

when window size is small, QOMA runs fast and has high quality results. As window size

increases, its performance drops but alignment quality improves further.

Another parameter for quality/time trade off is d. Figure :3-5 shows that the SP score

difference between the complete and the sparse K-partite graph algorithms is small. Thus,

it is better to increase the window size and use sparse K-partite graph strategy to obtain

high scoring results quickly. As we have observed in Tables :3-1 and :3-5 and Figure :3-5, the

best balance between quality and running time appears at d = 1 using sparse K-partite

graph strategy.
































2600


2550-



2500-


S2450-

2400 -




2350 -;


2300-
sp
sp
sp
2250
2 4 6 8 10
Window Size


12 14 16


Figure 3-5.


The SP scores of QOMA alignments using complete K-partite graph and

sparse K-partite graphs for different values of d and W on the V1-Rldataset.
The initial alignments are obtained from strategy 2.













cbo~




ko






O b~





0 m



~o



















ol k

eoSS


m

ca




O






O k













O c







Ca c
O m


~01LD 01 onb


emissi on~c~~


0comma 0 la


01L 0 1C~0 ~ ~ 0
of 1
0us~~ sion


O


0


00100 n~
co0


cl


~3~f;


O
Ln~~Cr3~
cn~~C~D1
05~~C~~3
01 ~C~


a 01~00~~n
a1 si ~ r3 CO


~~~~~

or
0000Ln00
,nLn~Lncr3
03C~~LnC\I
cr3 C~OO


00000Ln00
OOLn~Lnc'3
01C~,Ln01
cr3 C~OO


ooLn~ooolc~
c~~~0~3~
0~3~000~3
01~ 01LnLn


om



8
a
2
0


C~la
a maiy~0
~~0 tLCc~r Cm


MM0


2c3


5D

Crcr3


crenCessO
magage
CrCrCrCrCrCr
>>>>>>


















Table :3-:3.


The intprovenient (see Formula :31 in Section :3.2.3) of QOMA (using
complete K-partite graph) over ClustalW on the V1-R1 dataset. The dataset is
split into three subsets (short, niediunt, and long) according to the length of the
sequences .


Length

Short
Medium
Long


Window Size
24 8
18.0 29.2 40.2
2:3.3 :39.6 51.6
18.6 :39.4 51.5


16
46.7
58.63
54.1


Table :3-4. The average (ft), standard deviation (a) of the error, S* SP, for a window
using sparse version of QOMA on the V1-R1 dataset. Results are shown for
window sizes W = 8 and 16, and deviation d = 0, 1, and 2.The e value denotes
the 95 confidence interval, i.e., 95 of the expected intprovenient values are
in [I-1 e, p- + e] interval.


Error usingf sparse K-partite graph
d=0 d=1


d=2

0.200
0.4:3:3


1.127
2.159


:3.820
5.909


0.358
0.751


2.089
:3.249


1.591
2.4:34













cc~ S
O


~am


k
~c;"
a
k'5;3~
c~0,~
X o


"" t~ ~~

Cb
m c~
a~'~"
~ldm~
at~~"b

t~ ~,B



"o;~

a ~g

f~"- ~


E t~ ~i



5DO
~~c~J
?~ a~





r
;~ bD
C~ cb ~
ot~ ~
O~ a '5;j


~~ ~ bd
E
r
~ f~,




~aa~


Llj


~cr3
01~3
oiod


C~cr3


~Cn
c~,, Cr,
Cr30


~300

~Lnc~ol


00
C~O
~0000~



c~c~
dcYjol~

Ln
~01 01
03cr3
dcYj~~


C~~ O
cr3~
doi~~

oool
LnC~) Ln
~cr3
dcYjc~Ln


bD
O
c116








o





a
H
c~6


I


oo
r-
I~


a so





a so


m<









CHAPTER 4
OPTIMIZING THE ALIGNMENT OF MANY SEQUENCES

In this chapter, we consider the problem of aligning multiple protein sequences with

the goal of maximizing the SP (Sum-of-Pairs) score, when the number of sequences is

large. The QOMA (Quasi-Optimal Multiple Alignment) algorithm addressed this problem

when the number of sequences is small. However, as the number of sequences increases,

QOMA becomes impractical. This chapter develops a new algorithm, QOMA2, which

optimizes the SP score of the alignment of arbitrarily large number of sequences. Given

an initial (potentially sub-optimal) alignment QOMA2 selects short subsequences from

this alignment by placing a window on it. It quickly estimates the amount of improvement

that can be obtained by optimizing the alignment of the subsequences in short windows

on this alignment. This estimate is called the SW (Sum of Weights) score. It employs

a dynamic programming algorithm that selects the set of window positions with the

largest total expected improvement. It partitions the subsequences within each window

into clusters such that the number of subsequences in each cluster is small enough to be

optimally aligned within a given time. Also, it aims to select these clusters so that the

optimal alignment of the subsequences in these clusters produces the highest expected SP

score. The experimental results show that QOMA2 produces high SP scores quickly even

for large number of sequences. They also show that the SW score and the resulting SP

score are highly correlated. This implies that it is promising to aim for optimizing the SW

score since it is much cheaper than aligning multiple sequences optimally.

4.1 Motivation and Problem Definition

Progressive methods progressively align pairs of profiles in a certain order and

produce a new profile until a single profile is left. A profile is either a sequence or the

alignment of a set of sequences. Figure 4-1(a) illustrates this. Here, sequences a and b

are optimally aligned. Then, c and d are optimally aligned. Their resulting alignments

are aligned next. Progressive methods, however, have an important shortcoming. The









Table 4-1. The list of variables used in this chapter
Variable Meaning
Kt Total number of sequences to be aligned.
W Window size.
T Maximum number of sequences of length
W that can be optimally aligned.
Pi Sequence or profile.
fi Subsequence of Pi that lies in a given window.
Vertex corresponding to fi.
eigj Weight of the edge between I and vj.
NV Length of a sequence or a profile.
M Number of windows that are optimized.


Order that the profiles are chosen for alignment affects the quality of the alignment

significantly. The optimal alignment may be different than all possible alignments

obtained by considering all possible orderings of sequences [100].

Table 4-1 defines the variables frequently used in the rest of paper.

In C'!s Ilter 3, we have introduced QOMA [99], which eliminated the drawbacks of the

progressive methods. QOMA partitioned an initial alignment into short subsequences by

placing a window. It then optimally realigned the subsequences in each window. This is

shown in Figure 4-1(b). Optimally aligning each window costs O(WK2K), SignifiCantly

less than O(NVK2K) for W
costly. The value of W needs to be reduced significantly to make QOMA practical. For

example, assume that QOMA works for W = 32 when K = 6. When K becomes 18, W

should be reduced to two in order to run at roughly the same time. This, however, reduces

the SP score of the alignments found by QOMA since each window contains extremely

short subsequences.

This chapter addresses the problem of aligning multiple protein sequences with the

goal of achieving a large SP score when the number of sequences is large. We develop

an algorithm, QOMA2, which works well even when the number of sequences is large.

Figure 4-1(c) illustrates the QOMA2 algorithm. It takes K sequences and a initial

(potentially sub-optimal) alignment of them as input. QOMA2 selects short subsequences










from these sequences by placing a window on their initial alignment. Each window

position defines K subsequences, and each subsequence has at most IT letters. It quickly

estimates the amount of improvement that can he obtained by optimizing the alignment

of the subsequences in each window. This estimate is called the SW (Sum of Weights)

.score. It uses a dynamic programming algorithm to select the set of window positions with

the largest total expected improvement. It then recursively forms clusters of T, T
subsequences and optimally aligns each cluster. The clusters are created by iteratively

partitioning the subsequences into clusters and updating the SW score according to

these clusters. Thus, different windows can result different partitioning of subsequences

to clusters (see Figure 4-1(c)). This is desirable since the optimal clustering of the

subsequences may differ for different window positions. The value of T is determined by

the allowed time budget for QOMA2 for the alignment of the subsequences in clusters

governs the overall running time. As T increases both the alignment score and the

running time increase. The experimental results show that QOMA2 achieves high SP

scores quickly even for large number of sequences. They also show that the SW score

and the resulting SP score are highly correlated. This implies that it is promising to aim

for optimizing the SW score since it is much cheaper than aligning multiple sequences

optimally.

Graph Partitioning. METIS [109, 110] is a popular tool for partitioning unstructured

graphs, partitioning meshes, and computing fill-reduced ordering of sparse matrices. The

algorithms implemented in METIS are based on the multilevel recursive-hisection,

multilevel k-way, and multi-constraint partitioning schemes. It can provide high quality

partitions fast.

4.2 Current Results

Let A be an alignment of K sequences P1, P~, aPK. Let IT > 1 he an integer that

denotes the window length. Assume that we are allowed to place a window on ,4 in M~

different locations and optimize the alignment of the subsequences in these M~ locations.

















SI- -I II-- ---






d a b c d
(b)

w w2

a r--- -i




'----- a b fc d e
(c)--
Figur 4-1 Algmn staeisa hg ee:() rgesv linet b h
QOMA~~~~~~~~~~~~~~~~~ aloih c h OA loih.Tesldlnsdnt eune
a, b,... .Dshdplgn ent h sbseune hs ainet r




subsequences frma b an c ar pial aind h susqenefrmd


Fiuand f-1 arnen oprteimallyt ligned, and then) ther resulsare aligned, Simlaly the

wido on.. t.Dahed righindcats thatt the subsequencess from algmnd s f ar

optimalyi agrt o aligned the susequences. fro c,) d and e are optimally aind n
thigen. thecaddaeotmlyaindheir results are aligned.r


The irstprolgem othmat l need tobeadrsedi the ideo n tiefiction of theMlctos that h

maximize thuene ovrl mrovmea nt Figures 4-1(b)l and 4-1(c showtw sbexamples ino which

three and twor pstons mare selected a te i respecivly Itiiprtan to enioned thmiat the

numbr ofwindows Mi gvrndb the rg idctosthal time allowedue foimrovi, ng th alinent










A simple way to select the positions to place the window is to slide a window from

the left to the right (or from the right to the left), shifting by some predefined amount

a at each iteration. Another simple solution is to select the window positions randomly.

Clearly, both of these solutions do not distinguish promising window positions from

unpromising ones. We -II__- -rh I1 a greedy solution in our QOMA paper. This algorithm

greedily selects the most promising window position from the unselected positions until

M~ positions are selected. We discuss how we quantify how promising a window is later in

this section. This greedy strategy, however, does not guarantee to find the best set of M~

window positions. Here, we develop a dynamic programming algorithm that guarantees to

find the M~ optimal window positions.

For each window position, we compute an upper bound to the improvement of the

SP score that could be achieved by replacing that window with its optimal alignment as

follows. Let Xi denote the upper bound to the optimal SP score for the subsequences in

the window starting at position i of the alignment. This number can be computed as the

sum of the scores of all pairwise optimal alignments of the subsequences in this window.

Let denote the current SP score of that window. The upper bound to the improvement

of the SP score is computed as Ui = Xi ~. We ;? w that a window position i is promising

if Ui is large.

We propose to select the M~ window positions, xtl, x2a, ,;/ M~i ri ri+1) Whose

sum of upper bounds (i.e., Ci Umi) is the largest. Note that, if two windows overlap

greatly, their combined improvement over the initial alignment can be much less than

their individual improvements. This is because they improve almost the same regions,

and thus, they are highly dependent. The sum of their upper bounds includes the upper

bound for their common region twice. In order to prevent this, we also enforce a minimum

distance between the positions of different windows as Vi, we 1l ~ri > -r. Thus, if a window

is positioned at wsi, no other window can be placed on a position in the [xei -r, wei + -r]

interval .









The value of -r determines how independent the windows are. As -r increases, windows

become more independent. For -r > W, the windows are completely non-overlapping. On

the other hand, large values of -r limit the number of possible window positions. We use

-r = W/4 as it provided a good balance in our experiments.

We develop a dynamic programming solution to determine the optimal window

positions. Let SU(a, b) denote the largest possible sum of upper bounds of b window

positions selected from the first a possible window positions. We would like to determine

SU(NV W + 1, M~) to solve our problem, where NV is the length of the alignment. Clearly,

SU(a, 1) = 1!n I::' ,{Ui}.This is because if a single window is selected it should be the one

with the largest upper bound. For b > 1, there are two possibilities: 1) If a < b-r, SU(a, b)

= 0. This is because, from Dirichlet principle, it is impossible to select b window positions

that overlap with less than -r positions in this case. 2) If a > b-r, we compute SU(a, b)

recursively as


SU(a, b) = max SU(a -r, b 1) + U,, if U, is selected
SU(a 1, b), otherwise

In this computation, the first condition implies that the bth window starts at position a.

Thus, the first b-1 windows should be selected in the interval [1, a--r] to ensure that they

do not overlap with the bth window by more than -r. The second condition implies that

the window at position a is not a part of the solution. Therefore, the b window positions

should be selected in the interval [1, a 1]. The value of SU(NV W + 1, M~) is the optimal

sum of upper bounds. The window positions that lead to this optimal solution can be

found by tracking back the values of SU after the dynamic programming computation

completes.

Figure 4-2 shows the average SP score of the improved alignment for the first eleven

window positions when the windows are selected using our dynamic programming method,

greedily, and by sliding a window. For the window sliding strategy, we shift the window by


















830 -



820- x



X 810- x



800-



790-



7800
) 2 4 6 8 10 12
Number of window positions (M)




Figure 4-2. Comparison of the SP score found by different strategies of selection of window
positions: using the proposed optimal selection, the greedy selection and the
sliding window.


W/2 at each iteration. The results are obtained by averaging the results of 82 BAliBASE

benchmarks. We use W = 8 and K = T = 4 (i.e., each window of length eight is optimally

aligned). The figure shows that the proposed selection strategy improves the SP score

much faster than the sliding and the greedy strategies.

4.3 Aligning a Window

The goal of aligning a window is to maximize the SP score of the subsequences within

each window. We propose a divide-and-conquer strategy, which clusters the set of K

subsequences into smaller sets of T subsequences so that the subsequences in each subset

can be optimally aligned. This method has two 1!! ri ~ differences from the progressive










methods. First, progressive methods align two sequences (or profiles) at a time. Thus T =

2 for the progressive methods, whereas QOMA2 can use larger T values since it focuses

on a short window. Second, once the clusters are determined, progressive methods align

the entire sequences based on that clustering. However, QOMA2 can find different owsi~ of

clustering the data for different window positions (see Figure 4-1(c) as an example). This

is desirable for different regions in sequences may evolve at different conservation rates.

For example, regions that serve important functions show much less variation then the

remaining regions. Therefore, the best clustering for one region of the sequences may not

be good for another region. QOMA2 addresses this by treating each region independently.

We first construct an initial weighted complete graph by considering each subsequence

in the window as a vertex. We then align the subsequences using two nested loops. The

details of the two steps are discussed next.

4.3.1 Constructing Initial Graph

Given a window on the alignment, we first construct a weighted, undirected, complete

graph G = (V, E). This graph models how much the SP score can be improved by

realigning the subsequences in this window carefully. Let fi denote the subsequence of the

sequence Pi that remains in the window, Vi, 1 < i < K.

Each fi maps to a vertex I E V in this graph. We compute the weight of the edge

eij E E between vertices I and vj as


ei,j = Scoreoptimalfi, fj)- Scoreinduced fi, j) (1


Scoreoptimal fi, j) COmputes the score of the optimal alignment of fi and fj. Scoreinduced fi, fj

denotes the score of the alignment of fi and fj induced from the current alignment. In

other words, eigy is an upper bound to the improvement of the SP score due to fi and fj

after realigning the window.










Definition 1. Let G = (V, E) be the l'-r'rl, constructed for a set of subsequences in

a window. We 7. I;, .: the sum of the weights of all the edges in E as the SW (Sum of

Weights) scoret of Gr. O

The SW score is an upper bound to how much the SP score of the subsequences in

the underlying window can improve by aligning those subsequences optimally when the

edge weights are computed as given in equation (4-1).

The vertex induced subgraph of any subset V' C V defines a complete subgraph
G' = (V', E'). The- SW~~ score of G' is an upper bound to the amount of improvement that

can be obtained by optimally aligning only the subsequences that map to the vertices in

V'. In the following sections we will exploit the SW score to find a good clustering of the

subsequences in a given window.

4.3.2 Clustering

The clustering algorithm partitions the set subsequences { fl, f2, N r intO

non-overlapping subsets of size at most T. The eventual goal is that optimally aligning

each subset followed by aligning the results of these alignments improves the SP score as

much as possible. Recall that each subset can not have more than T subsequences since

we can not optimally align more than T subsequences within the allowed time.

We first need to understand how many clusters need to be created. The number

of subsequences in each partition should be as large as possible. This is because more

subsequences are optimally aligned with each other when the clusters are large. This

indicates that there must be [ ] clusters.

Next, we need to understand the right criteria to partition the set of subsequences. A

number of strategies can be developed to address this question. We discuss two solutions

withI lthe hetlp of lthe compllle~te weighllted graphl G constructed for the subsequences. Notice

that partitioning the set of subsequences into clusters of subsequences is equivalent to

partitioning the graph G into vertex induced subgraphs of the vertices corresponding to

the subsequences in each cluster.









Min-cut clustering. The first strategy aims to optimize the intra-cluster SP score. That

is, it maximizes the improvement in the SP score by optimally aligning the subsequences

within each c~luster. At a high level, thisa is donet by palrtitioningj G~ intlo [~ ] ubgraiphs

such that the sum of the SW scores of these subgraphs is as large as possible. This is
equivalent to the M~in [K ]-Cut problem wi.th the adirtionall mretrliction tha~t eaclh subgrrapnh

has at most T vertices. In other words, it translates into the problem of findings the set of

edges in G such that

thleir remvllU& parition~s G~ into [ ] complete subgraphs of size at most T, and

*the sum of their weights are as small as possible.
Finding the Min [ ]-Cutof a grah;, is a NPcomlete problem ume fhersi

algorithms have been developed to address this issue. One of the most commonly used

tools for partitioning graphs is METIS [109, 110]. METIS partitions an input graph to a

given number of subgraphs with the aim of minimizing or maximizing the total weight of
the edges between different subgraphs. We use M~ETIS to partition G no[] ugah

wvith minimal r ]-cut.

Although, METIS tries to partition the graph into roughly the same sized subgraphs,

it does not guarantee that they will be perfectly balanced in size. As a result, some of

the clusters determined by METIS can have more than T vertices. This is undesirable

since the subsequences in each cluster are optimally aligned in the following step. Recall

that the cost of optimally aligning a cluster is exponential in the size of that cluster. The

maximum size of a cluster, T, is determined by the total amount of time allowed to spend

to optimize the alignment. Thus, METIS clusters need to be post-processed to guarantee
that the sizes of the clusters do not exceed T.

Next, we describe how we propose to adjust the size of the METIS clusters for the

first strategy (i.e., optimizing the intra-cluster SP score) first. It is trivial to adapt this

algorithm to the second strategy.










Given a set of subgraphs (i.e., clusters) identified by METIS, we create three sets.

The first one is the set of subgraphs with T vertices, named EK (Equal to T). The second

one is the set of subgraphs with more than T vertices, named GK (Greater than T).

The last one is the set of clusters with less than T vertices, named LK (Less than T).

We adjust the size of the clusters by moving vertices from clusters in GK to clusters in

LK. Out of all such moves, it greedily picks the one which causes the smallest cut since

the goal is to minimize the total weight of the inter-cluster edges. After each move, the
nu.mber of vertiCes in--: one- of~ the- -1Clusters in GK decreases by one. Similarly, the number

of vertices in one of the clusters in LK increases by one. Thus,'- the--- :lstr inT GK an

LK move to EK. The iterations stop when GK is empty. This algorithm is guaranteed to

converge to a solution in CG,,,,(|G'| T) iterations of the while loop, where |G'| denotes

the number of vertices in G'. This is because, the number of vertices in a G' E GK reduces

by one at each iteration.

Max-Cut clustering. The second strategy aims to optimize the inter-cluster SP. It

achieves this hlv maximizing the total weight of the edges in the [ ]-cut of G. Similar to

the first strategy, we use METIS to identify such a cut.

The proposed algorithm for post-processing the clusters found by METIS can he

adapted to the second strategy as follows. At each iteration of the while loop, the vertex

move that maximizes the cut is chosen instead of the one that minimizes. This can he

done by modifying Steps 1 and 2.c of the algorithm.

It is worth mentioning that the METIS algorithm for clustering the sequences is a

module in QOMA2. It can he replaced by any clustering algorithm that finds better Min
[K ]-Cut, or, Max []-Ct in the fuiture

4.3.3 Refining Clusters Iteratively

The Min-Cut and the Max-Cut clustering strategies aim to minimize or maximize

the cut (see Section 4.3.2). One drawback of these strategies is that each edge weight is

computed by only considering the two subsequences corresponding to the two ends of that










edge (see Section 4.3.1). This is problematic, because the amount of improvement in the

SP score by optimally aligning a cluster of subsequences depends on all the subsequences

in that cluster. Considering two subsequences at a time greatly overestimates the

improvement. We propose to improve the clusters iteratively. Each iteration updates

the edge weights by considering all the subsequences in each cluster. We discuss how the

edge weights are updated later in this section. Once the edge weights are updated, it

reclusters the subsequences using the new weights. The iterations stop when the SW score

of th rphGdes not~ increased: betweenl two conlsec~u~tiv iterationls or a c~ertain nlllumer of

iterations have been performed.

We would like to estimate how much the two subsequences, fi and fj, contribute to

the SP score under the restriction that each cluster is optimally aligned. The obvious

solution is to optimally align each cluster and measure the new alignment score. This,

however, is not practical for two reasons. First, optimally aligning a cluster of T

subsequences is a costly operation. Performing this operation will make each iteration

of the cluster refinement as costly as QOlMA2. Furthermore, this will only update the

weight of the edges whose two ends belong to the same subgraph (i.e., intra-cluster edges).

The weight of the edges between different subgraphs (i.e., inter-cluster edges) still need

to be computed. Thus, a good estimator should be efficient and work for both inter- and

intra-cluster edges.

We propose to estimate the edge weights by focusing on the gaps. At a high level,

we assume the best scenario (i.e., smallest possible number of gaps) for intra-cluster

edges. This is because of the restriction that the subsequences in each cluster are

optimally aligned. We then estimate the improvement in the SP score between every

pair of subsequences by considering these gaps. We describe our estimator in detail next.

Let Li he the length of subsequence fi. After the complete weighted graph G is

partitioned into [K ] comlete subgrphs, assume, that belongs to the subgraph G'.

Recall that I is the vertex that denotes fi. The optimal alignment of all the subsequences









in the same cluster as fi requires insertion of at least


gi max{Ly} Li

letters into fi. This is because the alignment of all the subsequences in a cluster can not

be shorter than the longest subsequence in that cluster. Each such insertion corresponds

to a gap in the alignment. Thus, gi denotes the minimum number of gaps imposed on fi

due to clustering of the subsequences.

Next, we compute the expected number of indels (insertions or deletions) in the

alignment of subsequences fi and fj. An indel is an alignment of a letter with a gap.

The alignment of two letters or two gaps are not considered as indels. Considering all

possible arrangement of the letters and gaps in fi and fj, the expected ratio of letter-letter

alignments between fi and fj in their alignments is

LL,
(4-2)
(Li + gi)(Lj + gj)

Similarly, the expected ratio of gap-gap alignments is

gagy (4-3)
(Li + gi)(Lj + gj)

Thus, the expected ratio of indels can be computed by subtracting equations (4-2)

and (4-3) from one. The total length of the induced alignment of fi and fj is at most

max{Li + gi, Lj + gj}. Therefore, the expected number of indels in the induced alignment

of fi and fj, denoted by Gapexpectedfi, fj) is at most


(IaxL +L~~~> Lji g j gi, L, + gj } (4-4)

Let Gapinduced fi, fj) denote the number of indels in the induced alignment of fi and fj.

Let -i~,1. --I denote the cost of a single indel. We compute the new weight of the edge










between vertices I and vj as


ei,j = Scoreoptimalfi, fj)- Scoreinduced fi, j)-

'i.,p..1' ~Ix (Gap~expected fi, j) Gap~induced fi, fj

This computation differs from the one in Section 4.3.1 since it considers the change in the

gap cost as imposed by the clusters that fi and fj belong to.

Once the weights of the edges are updated, the current partitioning may not be

a good one anymore. Therefore, we iteratively run the clustering algorithm again and

update the edge weights similarly until the SW score of the complete graph built for the

current window does not increase any further or a given maximum number for iterations

are reached.

The Pseudo-code of the Adjustment in Section 4.3.3

While GK / 0

1. min = oo;

2. For all G' E GK and G" E LK

For all u E G'

(a) uG' = Sum of weights of all the edges from a to all the vertices in G';

(b) uG"1 = Sum of weights of all the edges from a to all the vertices in G";

(c) If uG' uG"1 < min then

Record (u, G', G") as the current best move;

Update min as min = uG' uG";

3. Move the vertex u from G' to G"1 according to the best move;

If G' contains T vertices then

Move G' from GK to EK;

If G"1 contains T vertices then

Move G"1 from LK to EK;

End While









4.3.4 Aligning the Subsequences in Clusters

The clustering algorithm guarantees that each cluster has at most T subsequences.

However, the total number of clusters may be greater than T. This happens when

K > T2. In that case, finding the optimal alignment of the profiles of clusters becomes

infeasible. Although this brings us back to the same problem we are tackling in this paper,
it is easier since we have [K ] profles which; is sgnifcantly, less than K. We recursively,

apply the QOMA2 algorithm (Sections 4.3.1 to 4.3.3) to these profiles until all the

subsequences are aligned.

4.3.5 Complexity of QOMA2

The time complexity of QOMA2 is

K(WT 2T
O(M~log, K( +cK2)


,where c is the upper bound for the number of inner loop iterations. In practice c < 10.

We deduct the time complexity as follows: For each window, we need to apply the

clustering algorithm and align the clusters using two nested loops. The outer loop iterates

[logTK] times.

At each iteration the set of subsequences inside the window is partitioned into clusters

and the edge weights are updated. Thus, each iteration of the inner loop costs O(|E|)

time. Since- G contains K vertices O(|E|) = O(K2). At the end of each iteration of the

inner loop all the clusters are optimally aligned. Optimally Aligning T subsequences costs
O(WT2T) time. At the ith iteration of the outer loop, O(K ) such optimanl aligrnments re

done. Adding these steps, we find that the total cost of the ith iteration of the outer loop



O( W 2T + cK2)
Ti









The number of outer loop iterations is log, K. Thus, the total cost of aligning a
window is

CE oyK O(fWI; 2 2 cK'2

=O((logT K)(KW I'2 ( o~ K] ~) + cK2)

=O((logl K)(K W 2 ( )ii + cK12)

=O((log, K)(~1~,-) cK

Since we totally have M~ positions for window to align, the total cost of QOMA2 is

KW 2T
O(M~log, K( + cK2)



4.4 Experimental Evaluation

Experimental setup: We used BAliBASE benchmarks [5] reference 1 from version

1 (www-igbmc. u-strasbg .fr/Biolnf o/BAliBASE/) and references 1, 2, 8 from
version 3 (www-bio3d-igbmc. u-strasbg. fr/BAliBASE/) for evaluation of our method.

We call this dataset DS since it contains benchmarks with three or more sequences. We

call the subset of D3 that contains all the benchmarks with at least 10 sequences as D10.

Similarly, we call the subset of D3 that contains all the benchmarks with at least 20

sequences as D0. D3, D10, and D20 contain 440, 209, and 84 benchmarks respectively.

We implemented the QOMA2 algorithm using standard C. We downloaded

ProbCons [88], T-Coffee [2], MUSCLE [78], and ClustalW [1, 77] for comparison. We

also downloaded DCA [47] since it aims to maximize the SP score as well. However, DCA

did not run for the benchmarks in our datasets D10 and D20 since it can not align large

number of sequences. We used BLOSUM62 as a measure of similarity between amino

acids, since BLOSUM62 is commonly used. Using other popular score matrices, such as

BLOSUM90 or PAM250 will produce similar results. We used gap cost = -4 to penalize

each indel. In order to be fair, we used the same parameters (i.e., BLOSUM62 and gap










cost) for QOMA2, T-Coffee, MUSCLE, and ClustalW. We used the default parameters for

ProhCons for it was impossible to change those parameters for ProhCons.

Among the competing tools, used in our experiments, MUSCLE aims to maximize the

SP score, ClustalW and T-Coffee aims to maximize a weighted version of the SP score.

Therefore, one can argue that it is not fair to include ClustalW, T-Coffee and ProhCons in

our experiments. We, however, include them since most of the existing tools that aim to

maximize the SP score, such as DCA or MSA, do not work for large number of sequences.

We improve the fairness of our experiments by using the same parameters for all the tools.

First, we compared different clustering algorithms and showed the relationship

between the SP and the SW scores on each window. We then evaluated the impact of the

window and the cluster size on the SP score of the QOMA2 alignment and the running

time of QOMA2. We also compared the SP scores of QOMA2 with four competing

multiple sequence alignment tools. We ran our experiments on a system with dual 2.59

GHz AMD Opteron Processors, 8 gigabytes of R AM, and a Linux operating system.

Dataset Details

The distribution of the number of benchmarks with different number of sequences (K)

is shown in Figure 4-3.

Correlation between the SP and the SW scores: The main hypothesis that QOMA2

depends on is that optimizing the SW score optimizes the SP score. Thus it aims to

optimize the SW score by finding an appropriate clustering of the sequences. For a given

window, the SW score is computed in O(K2) time aS it requires estimating the gap cost

for each pair of subsequences. The SP score, on the other hand, requires aligning the

subsequences. Therefore, it costs

KWT 2T
O(M~log, K( + CK2)


time. This makes QOMA2 desirable since the SW score can he measured efficiently

without actually finding the alignment of multiple sequences. In this experiment, we




























-1 1 1 1 1 1 11 1 1 1 1 1 1


100

90

80

S70


60

e 40

30

20

10


1 2 3 4 5 6 7 8 9 10 1 11213 1415 1617 1819 2021 2223 2425 2627 2829 30
Number of Sequences (N)


Figure 4-:3. The distribution of the number of benchmarks with different number of
sequences (K).


evaluate the relationship between the SW and the SP scores. We also measure how each

of the proposed clustering strategies performs. We place a window (W = 16) on all

possible locations of an initial alignment. We find the clusters using the Min-Cut and the

Max-Cut clustering algorithms (see Section 4.3.2). We also find clusters using the iterative

refinement (see Section 4.:3.3) on the results of Min-Cut and Max-Cut. We measure the

average SP and SW scores obtained by these algorithms for T = 2, :3, and 4. We use D20

dataset in this experiment.

Table 4-2 presents the results. Results show that there is a strong correlation between

the SP and the SW scores. For each value of T, the SP score gets larger when the SW

score gets larger. This implies that optimizing the SW score can potentially optimize

the SP score. This is an important observation since the cost of computing the SW score

is negligible as compared to that of the SP score. Note that the SW scores obtained









Table 4-2.


The average SW and SP scores of individual windows after applying different
clustering algorithms for different values of T, with W = 16. The average SP
scores of initial alignment in the window is 351. The average upper bound to
the SP score for the subsequences in the windows is 1113. Benchmarks are
selected from the D20 dataset.
Min-Cut Min-Cut Max-Cut Max-Cut
T Iterative Iterative
SP SW SP SW SP SW SP SW
2 -19 1285 157 1315 284 1482 481 1544
3 133 974 197 1031 490 1207 494 1268
4 200 823 266 908 485 1005 499 1104


with different number of clusters are not comparable to each other since they compute

the gap cost under different cluster size assumptions. The results also demonstrate that

the iterative refinement helps in improving the SW and the SP score of both of the

Max-Cut and the Min-Cut algorithms. The Max-Cut algorithm with iterative refinement

ahr-l- .- has the best SP and SW scores. This implies that if the induced alignment of

two subsequences has a high score as compared to that of their optimal alignment, it is

advantageous to keep them in the same cluster (i.e., force them to be almost optimally

aligned) .

The SP score of all the methods increase as the value of T increases. This is intuitive

since more subsequences are optimally aligned at once for large values of T.

Another important observation that follows from these results is that optimally

aligning clusters does not .l.h-- li--s improve the SP score of a window. It can actually

reduce it. This happens especially for the Min-Cut clustering (with or without iterative

refinement) for all values of T as well as the Max-Cut clustering for T = 2. This is because

when the clusters of subsequences are aligned, they impose a certain alignment for the

subsequences in each cluster. These restrictions limit the number of possibilities in which a

set of clusters can be aligned together. This indicates that the clusters should be selected

carefully.










Table 4-3. The average SP scores of QOMA2 for individual windows. "SP before" and
"Upper bound" denote the average initial SP scores and the average upper
bounds to the SP scores for individual windows respectively. Benchmarks are
selected front the D10 dataset.
It SP before I~pper bound T =2 T =3 T =4 T =5
4 -186 -67 -171 -158 -152 -147
8 -212 100 -175 -140 -124 -111
12 -264 247 -203 -147 -120 -100
16 -342 358 -257 -183 -148 -117


In the rest of the experiments, we select the Max-Cut clustering algorithm with

iterative refinement as the default clustering algorithm of QOMA2.

Impact of IT and T on the SP score. The QOMA2 algorithm hypothesizes that the

SP score can he optimized hv increasing the value of IT and T. In this experiment, we

evaluate the impact of these parameters on the SP score of QOMA2.

Table 4-3 shows the SP score of individual windows aligned by QOMA2 for different

values of IT and T. The results show that the SP scores increase when T increases for all

values of IT.

Table 4-4 shows the SP scores of alignments of the entire benchmarks in D10 using

QOMA2 for varying values of IT and T. As 11 and T increase, QOMA2 produces higher

scores. The two extreme parameter choices of using very large value for one of these

parameters and very small value for the other, i.e., It = 16, T = 2 or IT = 4, T = 5 do

not produce lower SP scores as compared to the intermediate solutions such as IT = 12,

T = 3. This is an important observation since it validates that QOMA2 is superior to the

two existing extreme solutions (see Figure 4-1).

Impact of IT and T on the running time Table 4-4 shows the average running time of

QOMA2 for optimizing a single window for varying values of IT and T. The experimental

results show that QOMA2 runs very efficiently even for large number of sequences. As we

have mentioned in Section 4.3.5, the time complexity of QOMA2 is

KIT2T2
O((log, K)( +cK2)









Table 4-4. The average SP scores of the alignments of the entire benchmarks in D10 using
QOMA2. The average SP scores of initial alignments is -12295. The average of
the upper bound to the SP scores of the benchmarks is 17648. The average
running times are also shown in the parentheses by seconds.
W T=2 T=3 T=4 T=5
4 -7119(1.173) -6770(0.653) -6676(0.403) -6498(0.465)
8 -6197(1.213) -5348(0.673) -4762(1.053) -4236(5.050)
12 -5914(1.116) -4659(0.808) -3966(3.619) -3464(13.485)
16 -5690(1.097) -4327(1.102) -3555(8.856) -2811(40.132)


for a single window. The experimental results -II---- -1 when W is large, the factor

O((log, K)( KWa)) quickly dominates the running time.

Fr-om Tables 4-3 and 4-4, we conclude a good point for balancing time and quality is

at (W = 12, T= 4).

Comparison to existing tools. Table 4-5 presents the SP scores of the alignments of

the benchmarks in D10 using four existing tools and QOMA2. Note that the compared

tools do not aim to maximize the SP score. ClustalW, MUSCLE, and T-coffee optimize

a variation of the SP score by computing weights for sequences or subsequences. We still

included this experiment because the existing tools that optimize the SP score, such as

DCA [47], MSA [61] and COSA [111] do not work for large number of sequences. For

small number of sequences, QOMA performs significantly better than DCA (see [99]). We

divided the queries into four subsets according to the number of sequences they contain.

The table shows that QOMA2 has higher SP score than all the tools compared. ClustalW

is alr-wi- the second best. The remaining tools are not competitive in terms of the SP

score.


Table 4-5. The average SP scores of QOMA2 (W = 12 and T = 4 ) and four other tools
on the D10 dataset. The competing tools (except ProbCons) are run with the
same parameters as QOMA2.
Kt ProbCons T-coffee MUSCLE ClustalW QOMA2
10-14 -16921 -16713 -24492 -12586 -12318
15-19 -14454 -29751 -31851 -9426 -9088
20-24 -5958 -12006 -28866 -778 -710
25-29 -24033 -29305 -50576 -- NORc -8989









CHAPTER 5
IMPROVING BIOLOGICAL RELEVANCE OF MULTIPLE SEQUENCE ALIGNMENT

In this chapter, we introduce a new graph-based multiple sequence alignment method

for protein sequences. We name our method HSA (Horizontal Sequence Alignment) for

it horizontally slides a window on the protein sequences simultaneously. HSA considers

all the proteins at once. It obtains final alignment by concatenating cliques of graph. In

order to find a biologically relevant alignment, HSA takes secondary structure information

as well as amino acid sequences into account. The experimental results show that HSA

achieves higher accuracy compared to existing tools on BAliBASE benchmarks. The

improvement is more significant for proteins with low similarity.

5.1 Motivation and Problem Definition

Most of heuristic multiple sequence alignment algorithms are based on progressives

application of pairwise alignment. They build up alignments of larger numbers of

sequences by adding sequences one by one to existing alignment [31]. We call this a

vertical alignment since it progressively adds a new sequence (i.e., row) to a consensus

alignment. These methods have the shortcoming that the order of sequences to be added

to existing alignment significantly affects the quality of the resulting alignment. This

problem is more apparent when the percentage of identities among amino acids falls

below 25' .~ called the twilight zone [88]. The accuracies of most progressive sequence

alignment methods drop considerably for such proteins.

We consider the problem of alignment of multiple proteins. We develop a graph-based

solution to this problem. We name this algorithm HSA (Horizontal Sequence Alignment)

as it horizontally aligns sequences. Here, horizontal alignment means that all proteins

are aligned simultaneously, one column at a time. HSA first constructs a directed-graph.

In this graph, each amino acid of the input sequences maps to a vertex. An edge is drawn

between pairs of vertices that may be aligned together. The graph is then adjusted by









inserting gap vertices. Later, this graph is traversed to find high scoring cliques. Final

alignment is obtained by concatenating these cliques.

5.2 Current Results

We provide a heuristic solution for multiple sequence alignment for proteins. We

name this algorithm HSA (Horizontal Sequence Alignment) as it horizontally aligns

sequences. Here, horizontal alignment means that all proteins are aligned simultaneously,

one column at a time. HSA first constructs a directed-graph. In this graph, each amino

acid of the input sequences maps to a vertex. An edge is drawn between pairs of vertices

that may be aligned together. The graph is then adjusted by inserting gap vertices.

Later, this graph is traversed to find high scoring cliques. Final alignment is obtained

by concatenating these cliques. The underlying assumption of HSA is that the residues

that have same SSE types have more chance to be aligned compared to the residues that

have different SSE types. This assumption is verified by a number of real experiments and

observations [112-115].

HSA works in five steps: (1) An initial directed graph is constructed by considering

residue information such as amino acid and secondary structure type. (2) The vertices

are grouped based on the types of residues. The residue vertices in each group are more

likely to be aligned together in the following step. (3) Gap vertices are inserted to the

graph in order to bring vertices in the same group close to each other in terms topological

position in the graph. (4) A window is slid from beginning to end. The clique with highest

score is found in each window and an initial alignment is constructed by concatenating

these cliques. (5) The final alignment is constructed by adjusting gap vertices of the initial

alignment. Next, we describe these five steps in detail.

5.2.1 Constructing Initial Graph

This step constructs the initial graph which will guide the alignment later. Let sl,

s2, sk be the protein sequences to be aligned. Let si(j) denote the jth amino acid of

protein as. A vertex is built for each amino acid. The vertices corresponding to different










proteins are marked with different colors. Thus, the vertices of the graph span k different

colors. If available, Secondary Structure Element (SSE) type (a~-helix P-sheet) of each

residue is also stored along with the vertex. For simplicity, SSE types include ac-helix ,

P-sheet, and no SSE information, as shown in Figure 5-1. Two types of edges are defined.

First, a directed edge is included from the vertex corresponding to as(j) to as(j + 1) for

all consecutive amino acids. Second, an undirected edge is drawn between pairs of vertices

of different colors if their substitution score is higher than a threshold. HSA gets the

substitution score from BLOSUM62 matrix. A weight is assigned to each undirected edge

as the sum of the substitution score and 'illp Score for the amino acid pair that make up

that edge. The '' up. Score is computed from the SSE types. If two residues belong to the

same SSE type, then their typeScore is high. Otherwise, it is low. We discuss this in more

detail in Section 5.2.2. This policy of weight assignment lets residues with same SSE type

or similar amino acids have higher change to be aligned in following steps. We will discuss

this in Section 5.2.4. Figure 5-1 demonstrates this step on three proteins. The amino acid

sequences and the SSEs are shown at the top of this figure. The dotted arrows represent

the undirected edges between two vertices of different color, the solid arrows only appear

between the vertices corresponding to consecutive amino acids of the same protein and

they only have one direction, from left to right.

5.2.2 Grouping Fragments

The graph constructed at the first step shows the similarity of pairs of residues.

However, multiple alignment involves alignment of groups of amino acids rather

than pairs. In this step, we group the fragments that are more likely to be aligned

together. Here, a fragment is defined by the following four properties: 1) It is composed of

consecutive vertices. 2) All the vertices have the same color. 3) All vertices have the same

SSE type. 4) There is no other fragment that contains it. For example, in Figure 5-2, S1

consists of four fragments: fl = LT, f2 = GK(TIV, f3 = E, and f4 = IAK(. Thus, S1 can be

written as S1 = ft f2 3 4-










S1: L T G K T L V E AK S2: P N KG3R V V RM K SS: PSG E CIE E








Sa B x f


Fiue51. The iniia grp osrct o eune 2adS.Eahrsdemp









to~ ~ C balgeall h seqenes ar scneda stofnd frget wih nonSSE types. The


fgragent are the n cilusee nogopweeec group consistste of eunc 1 ~ n oea fragmden from
each to sequene. To i gropfagmets we alignr thow oe frametsfist bWeuen ah smlfied
dynmicproramings algorithmby cons indiering eachfrgent aros ah vresidue inr thfe basi
algorith [28] Tes scre ofke two rget paifrs is compute fclrs om theflow ing fogrmul


Theh t h Soe is ompued fro thatte SS yps ragments with the same SSE type r o lky

contrbue ain a hig seqces whre sas ndtof fragments oft diffren SSE typesinupeat. Thisi

brffeaus e of ou assumtiod ntha gresiues wit he sae SSEh tyope havitsfe hiagherhnce trob

ealigned.Tus type o reu is calculates flos we check the tpso w fragments first. eueasmlfe





aondreturn a numbe acordin toee thaget ofolwn5 different conditysions.r 1)aThy. are the




same type of ac-helix, we return 4; 2) They are the same type of P-sheet, we return 2; 3)










S1: L T G K T L V E AK 52: P N K GR V V RM K SS: PS GE CIE E










a a x

Figure ~ ~ --' 5-. h famet wthsmia faurs scha SEtye, ents n pstin
in original sequencesare groupe together

Thyar hesm tp o oSS ypwertrn1 4 hy r ahli n 0seew
reun-;5 t hewie we reun0 h oiin ea scmue stedfeec

beteenthepoitins f to rag ent. Hrethepostio o a rag entisthetoplogca
poiio nthe originlsqec. I w ramnsae a wyin her eqeneste








contains. Fragment pis with similar lengthre wile give saler poenatyThsisbcas

as~~ th lnth he frgent par difera mored th nubrof gap vefrmtice ha ed ob



Figure 5-2. demonsrag tes wiho HSilA getroupsc frgmns Using thpes exmpeng n pof igure5-1

fragment w iith l sam ue SEye, siiare gopositions nd engh rlutrditotesm


group a Two such grops wit aoSEtpw eur ;4 hyaec-helix and P-sheet, ar ice n iue52










S1: LT GKTL VEIAK 82: PN KGRV VRM K S: PSGEICE









a B x

Figur 5-:. A ap vrtexis isertd tolet he frgmens insamegrou cloe toothe eac
other~ ~ vetialy

5.2. Frgmn Poito Adjustment
Once th rup ffrget ar demnd we upat th grp tobrnth











posibiit tat h the vrixe in these fragmnts ar lin Sed.fraio


Wiue updat th gapvraph by inserting gap verties fagns shon inm Fgrure 5-:3. Frto wte ec
compute the numer of ap etcstob netd ae ntw atr:1)Tenme

of2. reiusi fragment Ps.to 2) heeltiv oiin ffaments ntesm ru.Hr

good rltive oposito of fragments mreansthatnd the upoition of fragments lead toahig


psitorng alignesnt o the verotices in these fragments.a Wre alig the vertices inh fragment ofor



atthe same geroptoa computer 2 thos poitos. Then wilds e randomlyselroest position btwen


two consetie fragen groups Fialfo ahseunew insertin gap vertices, assoni igr -.F t, wth










positions to bring the fragments within the same group together. In Figure 5-3, a gap's

vertex is inserted before residue I in S3 to bring fragments in the group with P-sheet type

close to each other.

5.2.4 Alignment

So far, we have prepared the graph for actual alignment by two means. (1) We

determined vertex pairs that can be a part of the alignment, (2) We brought sequences to

roughly the same size by inserting gap vertices, while keeping similar vertices vertically

close. In this step, the sequences are actually aligned by scanning the updated graph in

topological order.

As demonstrated in Figure 5-4, we start by placing a window of width W at the

beginning of each sequence. This window defines a subgraph of the graph. Typically, we

use W = 4 or 6. The example in Figure 5-4 uses W = 3. Next, we greedily choose a clique

with the best expectation score from this subgraph. We will define the expectation

score of a clique later. A clique here is defined as a complete subgraph that consists of

one vertex from each color. In other words, if K sequences are to be aligned, a clique

corresponds to the alignment of one letter from each of the K sequences. Thus, each

clique produces one column of the multiple alignment. For each clique, we align the letters

of that clique, and iteratively find the next best clique that 1) does not conflict with

this clique, and 2) has at least one letter next to a letter in this clique. This iteration

is repeated t times to find t columns. Typically, t = 4. These t cliques define a local

alignment~~~~~~~ ofteiptsqecs h xettion score of the original clique is defined

as the SP score of this local alignment. After findings the highest expectation score clique,

we add this clique as a column to existing alignment. We then slide the window to the

location which is immediately after the clique found and repeat the same process until it

reaches the end of sequences. Each clique defines a column in the multiple alignment. The

columns are concatenated and gaps are inserted to align them. Figure 5-4 illustrates this

step, in the window (circled by the dotted rectangle), the highest expectation score clique










81: LTGKTLVE AK 82: PNKGRVVRMK 83: PSGE CIEE

a 3 a B a B


sl:d)tl:~ ~ B)










O Alpha helix O Beta strand O3 No SSE information
a B x


Figure 5-4. Cliques found in the sliding window (window size = 3) are the columns of the
resulting alignment. Gaps are inserted to concatenate these columns.


(the left shadow background marked column) consists of residues T, R, and I in S1, S2 and

S:3 respectively. Then, the window slides to next location toward the right of the graph

(this window is not shown in the Figure 5-4), and the highest expectation score clique (the

right background marked column) in the window consists of residue V, V, and C in S1,

92 and S:3 respectively. The two cliques found (marked by shadow background) are two

columns in resulting alignment. The resulting alignment is obtained by inserting a gap

vertex to S:3.

As mentioned in section 5.2.1, due to the policy of edge weight assignment, cliques

that contain vertices of the same SSE type or similar amino acids have higher score than

other possible cliques. Since a clique contains one vertex of each color, findings the best

clique does not assure any order for traversal of vertices of different colors. Thus, unlike

existing tools, our method is order independent.

5.2.5 Gap Adjustment

After concatenating the cliques in previous step, short gaps may be scattered in the

sequence. In this step, the alignment obtained in the previous step is adjusted by moving









81: L T G K T L V E AK 82: P N K GR V V RM K S3: PS GEICIE E







52 ~+


O lpaheiO Bet stan O No SS nomtoa


F ,~~tigure 5-5.Gp aemvd opouc ogr n ewrgp. efvr asotsd h

frget o yea-ei nd0set

the gap asfolow.Th eqece resane fo lf t igttofndioltd asI
a gap is inside a fragment of type a-heliii i x r -set i s oe otie fthtfag et
eihrbeoeoratr W hos h ircio that prouce hihrain en cr.I
ga s nid fametwihno SSE- tye ti mvdnx o h egbriggpol
if ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~s temvmnprdcsahgescrthntecretlignet iue55sosu
the-~~K~~ moeeto h is a etxi 3( B. h a etxbtenrsde n )
Th s is a ga etxisd a fagetotye -hlxTushigavrexsmvdot

an obne ihth etga etx
Th fna ainmntisotane b apig each vertex intefnlgap akt t
original reide
5.2.6 Exeieta eut





biuenhak [-5] (httr mvdop://ww-gb ce u-str asbnf/Bolf o/BliAS/) Wes W aor cousie the










benchmarks that contain SSE information since our algorithm needs SSE information

of sequences. We downloaded CloI-I I1W [1, 77], ProhCons [88], MUSCLE [78] and

T-Coffee [2] for comparison since they are the most commonly used and the most recent

tools. We ran all experiments on a computer with :3 GHz speed, Intel pentium 4 processor,

and 1 GB main memory. The operating system is Windows XP.

Evaluation of alignment quality

Alignment of dissimilar proteins is usually harder than the alignment of highly

similar proteins. Tables 5-1, 5-2 and 5-:3 show the BAliBASE scores of HSA, ClustalW,

ProhCons, MUSCLE and T-Coffee on benchmarks with low, medium, and high similarity

respectively. Fr-om Table 5-1, we conclude that for low similarity benchmarks, our method

outperforms all other tools. On the average HSA achieves a score of 0.619, which is better

than any other tool. HSA finds the best result for 14 out of 21 reference benchmarks. HSA

is the second best in 5 of the remaining 7 benchmarks. Table 5-2 shows that for sequences

with 20- Ill' identity, HSA is comparable to other tools on average. The average score is

not the best one. However, it is only slightly worse than the winner of this group (0.909

versus 0.901). HSA performs best for 2 cases out of 7, including a case for which HSA

gets full score. In Table 5-3, HSA is higher than other tools on average. HSA performs

best on 2 cases out of 7, including a case for which HSA gets full score. High scores of

existing methods for sequences with high percentage of identity (Table 5-2 and 5-:3) show

that there is little room for improvement for such sequences. Proteins at the twilight zone

(Table 5-1) pose a greater challenge. These results show that our algorithm performs

best for such sequences. For medium and high similarity benchmarks, our results are

comparable to existing tools.

Table 5-4 shows the SP scores of HSA, CluI-I I1W, ProhCons, MUSCLE, T-Coffee

and original BAliBASE alignment. On the average, CloI-I I1W, MUSCLE, and T-Coffee

find the highest SP score for low, medium, and high similarity sequences respectively.

However, according to Table 5-1 to 5-3, those methods have relatively low BAliBASE










Table 5-1. The BAliBASE score of HSA and other tools. less than 25 .~ identity


CloI-I .!W
0.693
0.546
0.655
0.223
0.607
0.6;30
0.6;6;0
0.573
0.512
0.467
0.222
0.531
0.482
0.624
0.377
0.459
0.388
0.697
0.368
0.405
0.678
0.394
0.664
0.513
0.515


ProbCons
0.624
0.679
0.655
0.439
0.464
0.690
0.705
0.608
0.373
0.585
0.397
0.498
0.606
0.700
0.355
0.502
0.411
0.719
0.590
0.534
0.717
0.568
0.573
0.587
0.565


MUSCLE
0.616
0.354
0.345
0.239
0.478
0.6;6;0
0.712
0.486
0.488
0.587
0.293
0.535
0.748
0.691
0.309
0.521
0.370
0.765
0.451
0.439
0.746
0.386
0.526
0.526
0.511


T-Coffee
0.320
0.183
0.234
0.235
0.445
0.707
0.667
0.398
0.440
0.548
0.256
0.441
0.573
0.579
0.383
0.460
0.379
0.726
0.528
0.461
0.638
0.454
0.582
0.538
0.465


HSA
0.833
0.700
0.772
0.462
0.648
0.675
0.756
0.6;92
0.539
0.590
0.352
0.596
0.614
0.6;08
0.487
0.541
0.472
0.810
0.532
0.524
0.746
0.630
0.652
0.624
0.619


Short laboA
lidy
1r69
1tvxA
lubi
1wit
2trx


Avg
Medium








Avg
Long


1bbt3
1sbp
1havA
luky
2hsdA
2pia
3grs

lajsA
1cpt
11vl
1pamA
1ped
2myr
4enl


Avg
Avg all


scores. This means that, the alignment with the highest SP score is not necessarily the

most meaningful alignment. The SP score of HSA is comparable to other tools on the


Table 5-2. The BAliBASE score of HSA and other tools. 211' 111' identity.


Clo1-1
0.994
0.861
0.833
0.920
0.853
0.941
0.718
0.874


ProbCons
0.989
0.897
0.760
0.939
0.925
0.926
0.898
0.904


MUSCLE
0.971
0.799
0.679
0.954
0.894
0.912
0.865
0.867


T-Coffee
0.991
0.887
0.817
0.956
0.894
0.955
0.867
0.909


HSA
1.000
0.871
0.782
0.941
0.925
0.924
0.867
0.901


IfjlA
1csy
1tgfxA
11dg
1mrj
1pgtA
1ton
Avg










average. For low similarity sequence benchmarks, the average SP score of HSA is higher

than the average SP score of the reference alignment.

Table 5-3. The BAliBASE score of HSA and other tools. more than ;::"' identity.
ClustalW ProbCons MUSCLE T-Coffee HSA
lamk 0.978 0.984 0.986 0.988 0.986
lar5A 0.953 0.956 0.969 0.947 1.000
11ed 0.900 0.931 0.950 0.956 0.929
1ppn 0.987 0.983 0.983 0.984 0.981
1thm 0.898 0.900 0.899 0.893 0.910
1zin 0.955 0.975 0.985 0.958 0.978
5ptp 0.948 0.963 0.950 0.961 0.957
Avg 0.945 0.956 0.960 0.955 0.963


Performance Evaluation The time complexity of our algorithm is O(WKI

K(2M2), Where K is the number of sequences, W is the sliding window size, NV is the

sequence length and M~ is the number of fragments in a protein sequence. The complexity

is computed as follows. The clique, in a window, with the highest expectation score is

found in WK time, and there are NV positions for the sliding window. K2M~2 time is

required for aligning fragments. Usually, M~
in practice, is O(WKNV). Typically W is a small number such as 4. For reasonably

small K, WKNV = O(NV). Therefore, for small K, the complexity is O(NV). As K

increases, the complexity increases quickly. However, this complexity is observed only

if the subgraphs inside a window is highly connected. It is possible to get rid of the WK

term in the complexity by using longest path methods rather than clique finding methods.

The experimental results in Table 5-5 coincides with the above conclusion. In general,

Table 5-4. The SP score of HSA and other tools.
REF ClustalW ProbCons MUSCLE T-Coffee HSA
Short, <25' -602 -453 -594 -496 -912 -599
Medium, <25' -2036 -1466 -2516 -1543 -2461 -1617
Long, <25' -2989 -1964 -3266 -2291 -2991 -2436
Shrt 1I -1I'. 456 499 508 480 491 493
Medium, 21 1' 10' 1238 1119 1138 1231 1191 1138
Medium, >35' 3474 3477 3479 3526 3528 3468
Avg overall -76 202 -208 151 -192 74









Table 5-5. The running time of HSA and other tools (measured by milliseconds).
ClustalW ProhCons MITSCLE T-Coffee HSA
Short, <25' 69 2:38 98 915 194
Medium, <25' 1:33 6:38 297 1890 5:35
Long, <2'.' :308 1564 584 :3240 1191
Shr,21'.- 1'. 6;2 265 8:3 1187 421
Medium, 21 1' -I 11' 171 695 175 2:316 61:3
Medium, >;35' 154 6;29 1:36 2502 66;0
Avg overall 149 672 229 2008 6;02


ClustalW performs best. However, ClustalW achieves this at expense of low accuracy (see

Figures 5-1 to 5-3). HSA is slower than ClustalW and MITSCLE. It is, however, faster

than ProhCons and T-Coffee.









CHAPTER 6
MODULE FOR AMPLIFICATION OF PLASTOMES BY PRIMER IDENTIFICATION

The chloroplast is the site of photosynthesis, and is therefore critical to plant growth,

development and agricultural output. The chloroplast genome is also relatively small, yet

despite its approachable size and importance, only a small number of chloroplast genomes

have been sequenced. The dearth of information is due to the requisite preparation,

frequently requiring isolation of plastids and generation of plasmid-based chloroplast DNA

libraries. The method shown in this chapter tests the hypothesis that rapid, inexpensive,

yet substantial sequence coverage of an unknown target chloroplast genome may be

obtained through a PCR-based means. A computational approach predicts a large

number of overlapping primer pairs corresponding to conserved coding regions of known

chloroplast genomes. These computer-selected primers are used to generate PCR-derived

amplicons that may then be sequenced by conventional methods. This chapter considers

the problem of finding saturating number of overlapping primer pairs to bracket maximum

possible coverage of the unknown target DNA sequence. None of the currently available

primer prediction tools consider gene and inter-gene information and most use only one

reference sequence, which limits their power and accuracy.

This chapter provides a heuristic solution, named MAPPIT, to the above mentioned

problem that is divided into the task of first identifying universal primers and then

assessing spatial relationships between the primer pair candidates. Two strategies have

been developed to solve the first problem. The first employs multiple alignment, and the

second identifies motifs. The distance between primers, their alignment within gene coding

regions, and most of all their presence in multiple reference genomes narrows the primer

set. Primers generated by the MAPPIT module provide substantially more coverage

than those generated via Primer3. Motif-based strategies provide more coverage than

multiple-alignment based approaches. As predicted, primer selection improves when based

on a larger reference set. The computational predictions were tested in the laboratory and










demonstrate that substantial coverage may be obtained from a set of eudicots, and at least

partial sequence may be obtained from distant taxa.

6.1 Motivation and Problem Definition

DNA sequence information is the basis of many disciplines of biology including

molecular biology, phylogenetics and molecular evolution. The sequence information of

a plant cell resides in three physically distinct compartments, namely the nucleus, the

mitochondrion, and the plastid. Each encodes proteins required for cell form and function,

and each is subject to different mechanisms of selection and inheritance. The green

plastid, chloroplast, is an important organelle. It is the site of photosynthesis and several

other important metabolic processes, and is therefore critical to plant growth, development

and agricultural output. The plastome or chloroplast genome holds a wealth of functional

and phylogenetic information. By mining sequence information from many species,

important taxonomic relationships may be resolved, complementing associations built

from studies of variability in morphology, as well as biochemical and nuclear-genome-based

molecular markers. Also, genetic engineering of the chloroplast requires a foundation of

sequence information.

The chloroplast genome maintains a great degree of conservation in gene content

and organization. Thus a relatively high level of synteny exists between plastid genomes

derived from distantly-related taxa [10]. The chloroplast genome is much smaller than

the nuclear genome, yet only a small number of these extra-nuclear genomes have been

sequenced. Traditionally, plastid genomes have been sequenced only after generating

extensive plasmid-based libraries of the plastid DNA. Plastid DNA extraction relies on

difficult, sometimes problematic and typically time consuming preparative procedures.

Recently, several reports have increased plastid sequencing throughput by amplifying the

isolated plastid DNA using rolling circle amplification (RCA) [33]. However, obtaining

sequence through RCA requires this intermediate step. Recently, the ASAP method

showed that sequence information could be gathered by creating templates from plastid









DNA based on conserved regions of plastid genes [32]. ASAP uses conserved primers

(short, single-stranded DNA fragments that initiate enzyme-based DNA strand elongation)

to flank unknown regions, and the regions are amplified using the polymerase chain

reaction (PCR). PCR involves the exponential amplification of a finite length of DNA in a

cell free environment [116], and it is frequently used to generate a large quantity of specific

DNA sequences for forensic applications. The procedure relies on a thermostable enzyme

known as Taq DNA polymerase, which elongates specific DNA sequences bracketed by

primer homology. A primer is classified as forward or reverse primer depending on its

orientation relative to the target sequence. For instance, a forward and reverse primer

that flank a given gene allow amplification of the bracketed sequence in the presence of

DNA polymerase, nucleotides and appropriate cofactors. Use of PCR depends on many

successive rounds of primer annealing and subsequent template elongation to amplify a

sequence of interest. The ASAP method is fast and cost effective. However, in the initial

report, the required primers were selected by visual inspection of target sequences. This

restricted the ASAP study to a small region of the chloroplast genome. To expand this

technique to an entire chloroplast genome an efficient method is required to facilitate

primer selection. More importantly, such a method will allow the selected primer set to be

updated based upon the availability of new plastid sequences.

This chapter presents the Module for Amplification of Plastomes by Primer

Identification, or MAPPIT. The MAPPIT tool uses the information of database-resident

reference plastid genomes to predict a set of conserved primers that will generate

overlapping amplicons for sequencing. The power of MAPPIT is that it would theoretically

gain accuracy and precision as the reference sequence set grows. MAPPIT uses two

approaches to identify the primers, namely multiple alignment and motif-based.

The first approach develops a multiple alignment strategy. The proposed multiple

alignment method is a variation of traditional progressive multiple alignment strategy that

weights the coding regions of the genomes, increasing the probability that the primers










identified reside in the coding regions of associated genes. Once a multiple sequence

alignment of the reference genomes is obtained, a window is slid on the consensus sequence

to identify the subsequences that satisfy the constraints that designate primer candidates.

Individual primer candidates are then assessed for their relative association with other

primer candidates to assign feasible primer pairs.

The second approach is based on motif identification. This method recognizes

potential primers from each reference genome separately. It then identifies a subset of

these primers that occur frequently in a subset of reference genomes. The presence in

multiple genomes adds support to any primer being assigned to the final primer set. Two

solutions have been developed to identify the final set of primer pairs from the candidates,

namely order dependent and order independent, depending on whether they consider

primer order or not when computing the support values.

Finally, a computational method has been developed to measure the quality of the

identified primer pairs. Experimental results show that the primer pairs designed cover

up to 81 of an unknown target sequence. Randomly selected primer pairs devised by

MAPPIT were used in laboratory experiments to validate computational predictions.

We first define several terms: A DNA sequence is represented by a string of four

letters: A, C, G, T as the bases and two extra alphabets: N as unknown bases and as

gaps. A primer is defined as a sequence which satisfies certain constraints. The length of

a primer p, indicated by length(p), is the number of characters it contains. Let s[i : j]

denote the subsequence of a from position i to position j, A primer p binds to DNA

sequence s at position i if p and s[i : i + length(p) 1] are similar. Two sequence are

considered as similar if they have sufficient percentage identity. In practice 9 :' identity is

required for primer similarity. A partial order primers p and q with respect to sequence s,

p -4, q, is defined if the position of p is before the position of q in s. Let f and r denote a

forward and reverse primer respectively. Assume that f and r bind to s[i : i+length(f)-1]









fi f2 f3
Target

r, r2j r3




Contig, Contig2

Figure 6-1. Example of primer pairs on target sequence: f and r stand for forward and
reverse primers respectively. The directions of primers are shown. < fl, rl >
pair covers a region atl and constr~ucts a contig C/onfigl, pairs < f2, r2 > and
< f3, 73 > COVer regiOUS a2 and a3, Which construct a contig Config2 SillCe a2
and a3 have overlap.


and slj : j + length(r) 1]. The distance between f and r with respect to s, d,(f, r) is

defined as



ds~fr) = j+ length(r) i if i < j
00 otherwise

A primer pair < f, r > identifies the fragment 8 [i : i + d ( f, r) 1] from s if d ( f, r)

less than a given cutoff. This cutoff number is usually 1000 and is determined by the

limitations of automated sequencing methods currently available. Two fragments of s,

;?i sl and 82, identified by two primer pairs can be combined to form a contig if sl and

82 have sufficient overlap. In practice, overlap of at least 100 letters denote a contigf with

high confidence. short overlap can not be continued as they may indicate random overlaps.

Given a set of primer pairs p = {< fl, rl >/,< f2,r2 /,"" ,< ki,rk >}, We define the

coverage of p on a sequence s as the total number of letters of a that can be identified

usmng p.

We define a primer pairs finding problem as following:

Given a target sequence T and a set of reference sequences S = {S1, S2, SK)

where Si are homologous to T, the goal is to find set of primer pairs < fi, ri >, i s










{1, 2, k}. (1) has that a large coverage on T and (2) produces a small number of

contigs from T.

An example is shown in Figure 6-1. In this example, a target DNA sequence and six

primers are shown. Primers fl and rl construct a primer pair < fl, rl > since d,(fl, rl) is

in the distance limitation L. This pair constructs a contig (Configl) on the target. Primer

pairs < f2, r2 > and < f3, r3 > has overlap greater than the overlap threshold V, therefore

these two primer pairs produce another contig (Config2 -

6.2 Related Work

Rapid and cost effective DNA sequence acquisition is one of the core problems in

bioinformatics research. Sequencing methods mainly fall into two classes: whole-genome

shotgun (WGS) assembly and PCR-based assembly. The whole-genome shotgun assembly

technique has been remarkably successful in efforts to determine the sequence of bases

that make up a genome [23]. CAP3 belongs to this category [117]. The accuracy of the

assembled sequences using WGS methods suffer because of read errors and repeats [118].

They also incur very high computation cost due to large number of pairwise sequence

comparisons. And they also need an additional finishing phase. On the other hand,

PCR-based sequencing methods are more accurate. However, their processing time is

usually much longer and the cost of processing is more expensive.

Recently, Dinghra and Folta proposed a new sequencing method, called ASAP, [32]

to overcome the shortcomings of PCR-based methods. ASAP exploits the fact that

chloroplast genomes are extremely well conserved in gene organization, at least within

1!! I r~~ taxonomic subgroups of the plant kingdom. It is a universal high-throughput,

rapid PCR-based technique to amplify, sequence and assemble plasmid genome sequence

from diverse species in a short time and at reasonable cost. The ASAP method finds the

multiple alignment of a set of reference genomes that are homolog to the target genome

using C'I1- I .W [1]. Domain experts, then, identify conserved primer pairs from the

multiple alignment through visual inspection. ASAP uses these primer pairs to generate










1-1.2 kbp overlapping amplicons from the inverted repeat region in 14 diverse genera,

which can be sequenced directly without cloning [32]. The manual primer identification

step is the bottleneck of ASAP. Efficient computational methods are needed to automate

this process. Also, as we discuss later, ASAP can miss potential primers since it uses

ClustalW for multiple alignment. This is because ClustalW maximizes the overall

alignment score for the entire sequences. Primers are however short sequences scattered in

the entire sequence. Thus, short conserved regions can be missed using ClustalW when the

sequences have many indels.

Similar to ASAP, PriFi [119] uses multiple sequence alignment to identify primers.

It also uses ClustalW to obtain multiple alignment. PriFi has the same shortcomings as

ASAP. PriFi also has the shortcoming that it can not automatically identify introns.

Multiple sequence alignment has a lot of applications in biological science such

as gene prediction [7] and improving local alignment quality [20]. Multiple sequence

alignment methods can be classified into two groups: optimal and heuristic methods.

MSA [61] is the representative of optimal solutions. Heuristic methods are much more

popular because of their low time complexity. Cllo-I I1W [1, 77], ProbCons [88], T-coffee [2]

and MUSCLE [78] are some examples to heuristic strategies.

6.3 Current Results

6.3.1 Finding Primer Candidates

In this section, we discuss how we construct the set of candidate primers (forward

and reverse) from reference sequences. Our final goal is to obtain a set of primers, which

should cover the unknown target sequence. Therefore, the primers found in this step

should be selected according to their possibility of being in the target sequence. Let

T denote the target sequence. Let S = {S1, S2, --- SK}) denote the set of reference

sequences homologous to T. Similar to ASAP method, we assume that a primer p appears

in T with high possibility if it appears in most of the reference sequences. We ;?i that p

"appears in" a given sequence if that sequence has a subsequence whose alignment with










p has a percent-identity greater than a given threshold. This threshold is usually chosen

as 93 .~ for practical purposes (see Section 6.1). We define the support of a primer p on a

sequence Si as:

supprt~p Se)= 1if p appears in Si
support, i)=r 0 otherwise

We define the support of a primer p on sequence set S as:


suipport(p, S) = su ~pport~ (, S) x 100
SES

A primer is considered as a candidate primer only if it satisfies the following two
criteria:

Conservation Criteria: A primer has to have sufficient support on set S. In practice

70-90 support is sufficient.

CG-content Criteria: A forward primer has to satisfy the following two criteria in order

to successfully amplify the target. (1) The last letter should be C or G. (2) At least two of

the last six letters should be C or G. Reverse primers have the symmetric restriction, the

first letter should be C or G and at least two of the first six letters should be C or G.

We develop two strategies to obtain a set of candidate primers. The first one is

an extension of the ASAP method and uses multiple alignment. The second one finds

primer candidates for each reference genome separately. It then merges the candidates

progressively. We will describe them in subsequent sections next.

6.3.1.1 Multiple sequence alignment-based primer identification

One way to find candidate primers is to align all the reference sequences using a

multiple alignment method. A window is then slid on the resulting alignment. The length

of the window is equal to the desired primer length. Each window position that satisfies

the conservation and CG rate criteria define a forward or reverse primer candidate. In this

approach the multiple alignment brings similar subsequences of all the reference sequences

together.









f r
S1

S2

*

SK


AB C

Figure 6-2. An example of computing the SP score of multiple sequence alignment. Region
A and C have primers in, we include their SP score when we compute the SP
score of the alignment. Region B has no primer inside, we only treat its SP
score as zero.


Alignment: Trivial approach here is to use an existing alignment strategy, such as

ClustalW [1, 77]. The underlying problem, however, differs from traditional multiple

alignment. This is because traditional multiple alignment methods aim to maximize the

overall alignment score. However, in order to find primers we only need to identify short,

highly conserved regions in the reference sequences. The non-conserved regions of less

than 1000 bases between two primer candidates should be disregarded as this region will

be identified during PCR amplification process. Figure 6-2 illustrates this. In the figure,

a forward primer region A and a reverse primer region C are shown, we only maximize

the SP score of A and C. The region B, which has no primer in, are not considered when

computing the SP score of the whole alignment.

We propose a variation of hierarchical clustering algorithm [71]. It follows from two

observations: (1) The gene regions of a set of homologous sequences are usually highly

conserved while their intergenic regions can show high variation in length and letter

content. (2) Primers need to have sufficient CG rate.

For each reference sequence, we read location and lengths of genes from data source

files, which are previous downloaded from GenBank. We also scan the sequence and find

regions which have lower CG rate than the required cutoff for a primer. We tag these










regions as unpromising. We replace the letters in such regions with "N". In other words

we mask these regions.

During the alignment of the sequences we compute a weighted score of the alignment:

The score for letters which are' I__- d as genes are scaled up using some predefined weight

constant. The score letters which' I__- d as "N" are computed as 0. We applied affine

gap penalty strategy to reduce the number of gaps. We used an algorithm extended from

alignment method of Myers and Miller [65] to reduce memory requirement since the

reference genomes are usually too long. We use Sum-of-Pairs score to evaluate the score of

alignment .

The alignment algorithm is described as follows. We first compute the alignment score

between each pair of sequences and construct an initial score table. The initial profiles

to be aligned are the original sequences. Second, we select the pair of profiles which has

highest score in the score table and obtain a new profile from the alignment of these two

profiles. Third, we remove the two profiles and add the new profile to profile set. We

calculate the SP score when we score two elements from two profiles. Fourth, we construct

a new pairwise alignment score table. Fifth, we repeat from second step to fourth step

until only one profile is left. The final profile left is the resulting alignment.

Primer selection: We first construct a consensus string from the multiple alignment.

To do this, we scan the alignment from the beginning to the end. For each column of the

alignment, we choose the most frequent character as its consensus character. We compute

the conservation rate of the consensus character of each column as the percentage of the

appearance of this character in that column.

We then slide a window from the beginning to the end of the consensus string then.

The window has same size as the primer. For each window, we check the fragment in the

window if it satisfies the CG rate and conservation rate criteria. The fragments which pass

the test become primers. Depending on the CG positions, a fragment is inserted in either










forward primer set or reverse primer set or both. For each primer, we keep its sequence

and position in the consensus sequence.

6.3.1.2 Motif-based primer identification

Multiple alignment of reference sequences provides primer candidates from conserved

regions. However, there are two drawbacks of this approach. First, variations between

intergfenic regions can cause shifts in alignment. As a result some of the conserved

regions may not be observed in the consensus sequence. Weighting the genes partially

alleviates this problem. However, it is not sufficient as the intergenic regions can also

contain primers. Second, multiple alignment can not find all conserved regions if there

are translocations in the reference genomes. In this section, we propose a new strategy to

address these problems.

Our solution first finds possible primers from each sequence separately without

considering any conservation constraints. It then finds common primers with sufficient

support by iteratively merging the primer set. We discuss these steps in more detail next.

We start by constructing a set of possible forward primers Fi and a set of reverse

primers Ri for each reference sequence Si. To do this, we slide a window of primer length

on each reference sequence. Each position of the window produces a fragment. The

fragments that satisfy the CG criteria for primers are inserted into corresponding primer

set. Let Fi = { fig, fi,2 i,mi} and Ri = {ri,l, Ti~,2 ri~,n} denote the primers found

for Si. For each primer fi,4, two values are stored: support and location, denoted with

support(fi,4) and location(f ). The support and location of fi~j are initialized to one

and the position of fi~j in Si respectively. support and location of all reverse primers are

computed in the same way. We propose two strategies to find candidate primers from

these primers. We explain our strategies for candidate forward primers. Candidate reverse

primers are found exactly the same way. The only difference is that we use Ri instead of









Order independent strategy: Let G denote the set of candidate forward primers. G is

initialized to empty set. We then carry out the following steps:

We pick a random Si from reference sequence set that has not been considered so far.

For all primers fi~j E Fi we check if there exists a primer E G that is similar to fi,j (i.e.,

g and fi~j have at least 93 .~ identity. See Section 6.1.). If there is no such g e G, then we

insert fi~j to G. If there exist such a g, then we update the support and location of g. The

location is updated as

location(g) support (g) + l ocati on ( fgy)
(1)
support(g)+ 1

The support of g is then incremented by one. We repeat the same process to each of

the remaining reference sequences in random order similarly. Once all the references are

processed we remove the primers in G that do not satisfy support criteria. Note that

further optimizations can be made in the implementation by removing primers from G as

soon as they are guaranteed to have insufficient support. We do not discuss them as they

only affect the performance.

Order dependent strategy: The first strategy increases the support of a primer

regardless of the positions of the primers in G and Fi. As a result of this, primers in

conflicting positions can be considered as similar simultaneously. Such conflicting primers

can be desirable in case of translocations. However, if the reference genomes do not have

translocations, this strategy can produce false primers as it increments support for all

matches regardless of the position. Figure 6-3 illustrates this. In the figure, we only

show forward primers and their locations, the matched primers are connected by arrows.

Primers fl and f2 arT CTOSSed and are not considered as matched at same time when using

multiple sequence alignment. In this strategy, we allow this type of match.

In this strategy, we consider the problem as finding the Longest Common Subsequence

from a set of sequences, known as k-LCS. Here, each primer set Fi denotes a sequence

of primers for the primers in Fi are ordered by their locations. The goal is to find a










S, fl f2 f3f4

S, f2 ~Cfl f-, f4

Figure 6-3. An example of matching primers with translocations. Only forward primers
are shown in the figure. Primers fl and f2 have positions crossed due to
translocation. In step 1, the matching of fis and f2S at Same time can he
allowed if using motif-based strategy but not if using multiple sequence
alignment-hased strategy.


subsequence of primers that is common to most of the reference sequences (i.e., 70-90 .~ of

the reference sequences contain it). k-LCS is an NP-complete problem [65] and has many

heuristic solutions. We use a progressive solution which is similar to our first strategy in

spirit.

We pick a random Si from reference sequence set and initialize G to Fi. We then

repeatedly pick a reference sequence from the remaining references and process it as

follows: We find the LCS of Fi and G. Here, two primers are considered as common if

they are similar to each other (i.e., they have at least 93 .~ identityy. We update the

support and location of all g eG which are in LCS. The location is updated as given in

equation (1) The support of y is then incremented by one. We then insert all the fi, E F

that are not in LCS to G. Once all the references are processed we remove the primers in

G that do not satisfy support criteria. The time complexity of this motif-based method is

O(Af2) where Af is the number of primers in a sequence. Usually Af is much less than the

length of the sequence.

6.3.2 Finding Minimum Primer Pair Set

So far, we have discussed how to find candidate primers from a given set of reference

sequences. In this section, we discuss how to select minimum set of primer pairs to obtain

the largest coverage and minimum number of contigfs.

Let F = { fl, f2, foz and R = { TI, T2, Oz } denote the set of forward and

reverse primers with sufficient support identified using any of the strategies discussed in










Section 6.3.1. Assume that location( fi) < location( fj) and location(gi) < location(gj) for

i < j. Note that the locations of primers are computed as discussed in Section 6.3.1.

The goal is to find set of primer pairs P = {< f,,, r,, >, < f,,, r,, >, < f,,, r,, >

}, where Vi, f,i E F, rp, E R and Vi < j, wei < 'ir, pi < pj with the objective that

the primer pairs in P have maximum coverage on the reference sequences and produces

minimum number of contigs. We propose a greedy algorithm. It works in three steps:

Step 1: Initialize the current forward primer, f = fl. Remove f from F.

Step 2: For the current forward primer, check R. If there are reverse primers r ER which

satisfy the distance criteria with f, select the one with the largest location as current

reverse primer, T. Recall from Section 6.1 that the distance criteria is

0 < location(r) location( f) + length(r) < distance-cutoff.

Distance-cutoff is set to 1,000 (see Section 6.1). Insert < f, r > pair into P. If there

is no r ER which satisfy the distance criteria with f, then update f as the next forward

primer, remove f from F, and repeat Step 2. If there is no more forward primer left in F,

the algorithm stops.

Step 3: For the current reverse primer r, check F. There are three cases. Case 1: If

F = 0 then the algorithm stops. Case 2: If there are forward primers in F which satisfy

the overlap criteria, select the one with the largest location as current forward primer f.

Remove all the primers in F whose locations are less than or equal to location of f. Case

3: If the forward primers do not satisfy overlap criteria select the first forward primer in F

which has larger location than r and go to Step 2. Recall from Section 6.1 that the overlap

criteria is

0 < location(r) location( f) < overlap-cutoff.

Overlap-cutoff is set to 100 (see Section 6.1).

Figure 6-4 illustrates our primer pair selection strategy. In this example, fl is chosen

as the first forward primer (Step 1). The reverse primerS T2 and T3 SailSfy distance criteria

for fl. Therefore, T2 and T3 can be paired with fl. < fl, r3 > pair is inSerted into solution













fl f2 3 4 5 6~
Target

r1 I 2 r3

Overlap cutoff

A- B


Figure 6-4. Selection of next forward primer from current reverse primer. The positions of
primer are shown in the figure. We select f2 if both fl and f2 arT in RegiOn A,
and select f3 i 3, f4, f5 and f6 arT ill ReglOn B and no primer is in Region A


set since location(T2) < lOCatiOnr T) (Step 2). The search space is split into regions A and

B. The cut position shows the boundary for the overlap criteria. All the forward primers

in A satisfy this criteria, whereas the ones in B do not. The last forward primer in region

A, f3 is chosen as the next forward primer (Step 3). If the region A had not contain any

forward primers with location greater than that of fl,the primer f4 WOuld be selected as

the next forward primer for f4 is the forward primer with smallest location in region B

(Step 3).

Note that one can prove that our greedy primer selection strategy is optimal solution

among all possible solutions that can be found from the candidate primers. We define the

optimality according to two criteria: 1) The optimal set of primer pairs covers the largest

number of letters of the consensus of the reference sequences. 2) Among all the solutions

with the same coverage, optimal solution contains the minimum number of primers and

produces the minimum number of contigfs. We, however, do not include the proof due to

space limitations.

Next, we prove that our primer selection strategy is optimal solution among all

possible solutions that can be found from the candidate primers. We define the optimality

according to two criteria: 1) The optimal set of primer pairs covers the largest number of


Distance cutoff









letters of the consensus of the reference sequences. 2) Among all the solutions with the

same coverage, optimal solution contains the minimum number of primers and produces

the minimum number of contigs.

Optimality Proof: Let F = { fl, f2, fm} and R = {rl, T2, ru} denote the

set of candidate forward and reverse primers. Let P = {< f,,, r,, >, < f,,, r,, >

,< fx,, r,, >} be the set of primer pairs found using our primer selection strategy.

Let C = {cl, c2, Cs} be the optimal set of contigs that can be determined using F

and R, sorted in ascending order of their locations. Let le ft(ci) and right(ci) denote the

position of the leftmost and rightmost position of ce in the consensus sequence. We have

right(ci) < le ft(cizz), Vi, 1 < i < s.

(A) We first show that location(f,,) = le ft(cl). Let fi be the leftmost primer (i.e.,

smallest location) in F, which has at least one matching reverse primer satisfying distance

criteria. fi is selected by our algorithm (Steps 1 & 2) (i.e., ar = i).

(A.1) Assume that location( fi) < le ft(cl). This is contradicts with the assumption

that C is optimal. This is because fi can be paired with a reverse primer to cover some

letters to the left of cl. These letters can be included in C to increase its coverage.

(A.2) Assume that location( fi) > le ft(cl). This contradicts with the assumption that

fi is the leftmost primer with a matching reverse primer.

Fr-om (A.1) and (A.2), we conclude that location( f,,) = le ft(cl).

(B) Second we prove that location(r,,) < right(cl). We prove this by contradiction.

location(r,,) > right(cl) contradicts with the assumption that cl is an optimal contig as

< f,,, r,, > can be included to extend cl.

(C) Third, we show that < f,,, r,, > is a part of the optimal solution (Steps 1 & 2 of

the algorithm).

(A) and (B) proves that f,, and r,, are contained in cl. Thus, they identify a prefix

of cl. Selection of < f,,, r,, > minimizes the number of primer pairs to cover cl. This is

because < f,,, r,, > define the longest prefix of cl that can be identified using F and R.










Thus, the coverage of any other primer pair that covers a prefix of cl is a subsequence of

that of < f,,, r,, >. Such a pair will require additional primer pairs to cover the same

region.

(D) Finally, we prove that selection strategy for the next forward primer minimizes

the number of primer pairs (Step 3 of the algorithm). (B) implies that there are two

possibilities for r,,.

(D.1) Assume that location(r,,) = right(cl). This implies that < f,,, r,, > is the

optimal primer pair to identify cl. Since cl is a part of the optimal solution, there is

no primer pair which satisfy the overlap criteria with < f,,, r,, > and location(r,,) >

right(cl). Thus, the next forward primer should be selected as the first forward primer

in F in region B (see Figure 6-4) in order to detect the next contig in C (Step 3). The

justification follows from (A).

(D.1) Assume that location(r,,) < right(cl). This implies that there exists at

least one primer pair that satisfies overlap constraint with < f,,, r,, > and covers a

subsequence of cl. Otherwise, cl would not be identified as a part of the optimal solution.

Step 3 chooses the rightmost forward primer in region A (see Figure 6-4) to maximize the

coverage of this primer pair, and thus minimize the number of primer pairs.

6.3.3 Evaluating Primer Pairs

So far, we have discussed how to find primer pairs from reference sequences to amplify

the target sequence. Performing wet-lab experimentation to evaluate the quality of the

primers is costly. In this section, we develop a new method to evaluate the quality of a set

of primer pairs computationally. This method can be used to predict the primer quality

quickly without any additional cost.

We evaluate the primer pairs using two key parameters: (1) average coverage, and (2)

average number of contigs produced for all the reference sequences. Here the coverage is

the total number of characters covered by the primer pairs. The total number of contigs

are the number of fragments identified such that no two fragments have sufficient overlap.










Let P = {< fl, rl >, < f2, r ,2 < f, kr,k >} denote the set of primer pairs

identified from reference sequences S = {S1, S2, ,SK}. For each Si e S, the algorithm

keeps an integer vector 1%, whose size is equal to the length of Si. All entries of 1K are

initially set to zero. The algorithm works as follows.

1. Initialize configid = 0.

2. For j = 1 to k

(a) Find the locations of fj and rj in Si using dynamic programming [28-30]. A
primer is found in Si if Si contains a subsequence whose alignment with that
primer has at least 93 ~~identity (see Section 6.1).

(b) If both fi and ri can be found and their locations satisfy distance criteria (i.e.,
locations differ by at most 1,000) then check the values in 1M from the starting
location of fj to ending location of rj

If the first or the last 100 values are identical and greater than zero, then
the fragment identified by < fj, rj > is an extension of an existing contig.
This is because this fragment satisfies the overlap criteria with the existing
contig (see Section 6.1). Set all the values of 1K corresponding to the new
fragfment to this value.

Otherwise, < fj, rj > defines a part of a new contig. Increment the value
of configid by one and set all the values of 1K corresponding to the new
fragment to configid.


3. Return the number of non-zero values in 1K as the coverage and the number of
distinct non-zero values in 1K as the number of contigs.


6.3.4 Experimental Evaluation

Experimental setup: We evaluate our proposed methods through both computational

and wet-lab experimentation We evaluate the primer pairs based on several criteria,

namely the coverage, the number of contigs, and hit ratio on the target sequence as well

as time it takes to find the primers. The former two are described in Section 6.1. Hit ratio

denotes the ratio of primers that has a matching subsequence in the target genome.

For comparison, we downloaded Primer3 [120] as a representative of single sequence

input primer design tools, for it is one of the well known tools. For our multiple alignment










hased strategy, we downloaded the source code of Cllu-1 I1W [1, 77]. We also implemented

the proposed weighted multiple alignment method in Section 6.3.1. We also implemented

our motif based primer method as described in Section 6.3.1. As a part of this method we

implemented both order independent and order dependent strategies. We used C language

in all our implementations.

We used five plastid genomes used in ASAP [:32] and added two more from Cucumis

and Lactuca to our dataset. We obtained the DNA sequences of these genomes from

GenBank (http://www.ncbi. nih. gov/) and selected their inverted repeat regions. We

use the last four digits of the accession number of each DNA sequence in GenBank as its

name. To test divergent sequences, we also created another set of sequences by randomly

deleting non-gene characters from according to a given probability. Unless otherwise

stated, we report the results for the original plastid genomes in our experiments. In all our

experiments we used a subset of these sequences as reference sequences. We picked another

sequence, which is not a reference sequence, as the target sequence. Unless otherwise

stated, for a given target sequence all the remaining six genomes are used as reference

sequence.

We run all computational experiments on Intel Pentium 4, with :3.2 Ghz speed, with 2

GB memory, the operation system is windows XP.

In the following tables to show, word CovT represents the coverage on the target

sequence, ConT represents the number of contigs on the target sequence, CovR represents

the average coverage on the reference sequences and ConR represents the average number

of contigfs on the reference sequences.

6.3.5 Quality Evaluation

Comparison to Primer3: Our first experiment set compares the quality of primer pairs

of MAPPIT to that of Primer:$ [120]. We use Primer:$ with its default parameters on a

single reference sequence to identify the top 50 primers. We then evaluate these primers on

the target genome. We limit the number of primers of Primer:3 to 50 for MAPPIT to make










it comparable to our method. We repeat this for all possible reference-targfet combination

and present the average results for each target. For MAPPIT, we use all the six remaining

sequences as the reference sequence for each target sequence. We report results for both

multiple alignment strategies.

Table 6-1 shows the results. The results show that the coverage of Primed3 is

significantly lower than that of our method in all cases. The results illustrate that

existing tools which consider only one sequence for primer design are not suitable to

sequence plastid genomes. The coverage of MAPPIT is greater than 62 on the average.

Furthermore, both alignment strategies achieve similar coverage, number of contigs, and

primer pairs.

Evaluation of impact of reference similarity: In order to observe the impact of the

degree of similarity of reference sequences, we run MAPPIT on reference sequences of 4

8 and 16 .divergence. Here, .r divergence means that letters in non-gene regions are

randomly deleted with .r probability.

Table 6-2 presents the results for 16 divergent dataset. Due to space limitations

results for other divergent datasets are not shown. The experiments show that the

coverage and the number of primers decreases, whereas the number of contigfs increases.

The coverage is slightly more than 57 However, the quality drop is very small given

that the sequences are altered by 16 We observe that the quality gradually drops as the

divergence increases (results not shown). Another important observation is that MAPPIT

achieves higher quality using our weighted multiple sequence alignment method compared

to ('!.1-I .!W. This shows that ('!.1-I I1W is more suitable for highly similar sequences,

whereas our weighted multiple alignment is more suitable for genomes with variations in

non-coding regions.

Comparison of proposed strategies: We compare the two methods for constructing

primer candidate set. We show the evaluations in Table 6-3 for multiple sequence

alignment- and motif-based primer identification strategies. For motif-based strategy,

















bD
bD .fj
k
~
mr
m
cc~
O o

c~ cb
O

O
k k
cb
k
C~ ~

3 m
5
m
m

O
e -
c~ 3
~

E
r
k
O
e~bD
c~ cb
m
o
~



E


~Zra







m bD ed
k
mr
e~rb

~30
~oX
bD~~
~io
.~ O ~
m m


~~df~
cb~c~'
Cn~
F;Cb~



kOCb
P~
~o Ei a
O
k
cc~ cb
O
E




O ~ bD
o~~


cr,


e
cb


Cr3~LnCr3LnLn~~
Cr3Cr3Cr3Cr3Cr3Cr3Cr3Cr3


CnLnn n ON~s


Ln0
Car
e a~


~3Ln~n~n
meece n~


b~ b~~ OoCr





C'3om bn~r

































E "




go



EH

rb


01~3~~~cr301~
Cr301Cr3Cr3Cr3Cr3Cr3Cr3





Ln01Ln01Cr3C~01Cr3


0~3~0001~0
Cr301Cr3Cr3Cr3Cr3Cr3Cr3


100C\1nC\ L n
OHN Obb~~










we show the results using order independent and order dependent approaches, indicated in

table by non-order-MAPPIT and order-MAPPIT respectively. Alotif-based strategies have

better coverage than multiple alignment-hased strategy in all experiments. This is because

multiple alignment takes all the letters into consideration from references, including the

non-coding regions. As a result, variations in less conserved regions cause the support of

the primers in conserved regions as they cause shifts in alignments. Order independent

motif-based strategy has the highest coverage in all the experiments. The reason is that it

produces more candidate primers as the order criteria is relaxed. The average coverage of

this strategy is 81 This is a significant improvement over our multiple alignment-hased

strategy.

Table 6-3 also shows the coverage and the number of contigfs computed on the

reference sequences as discussed in Section 6.:3.:3. The results show that the estimated

quality values from the reference sequences are similar to the actual values computed

from the target sequence. Thus, we conclude that the evaluation strategy proposed in

Section 6.:3.3 is accurate.

Evaluation of impact of number of references: Here, we test the effects of the

number of reference sequences. We use hit ratio as to evaluate the methods. This value

shows the accuracy of the primers found. We carry out the following steps. First we

select a target sequence from our dataset. We then select k sequences randomly from the

reference sequences such that all of them are different from the target sequence. We then

run our program on these k sequences and find the primer pairs. We compute the coverage

and the number of contigfs these primer pairs produce on the target sequence. We repeat

this process for each possible target sequence 10 times, each time selecting a new set of

references. Thus we carry out 70 experiments (7 target, 10 tests per target). We report

the average values of all these experiments.

Table 6-4 shows the results. The hit ratio usually increases as k increases. This agrees

with our assumption that more reference sequence achieve higher quality primers. The





















bD

'C

m

o


o



cb
k



m

m
rb
O

r



rb


e
c:


E











m




X m






o



t~ ~



m



~pt~

ka



E
-d
r

O

a ed


~oO~~~~c~oO


~cr3cr3~~~0101
Cr3Cr3Cr3Cr3Cr3Cr3Cr3Cr3







~Olcs1~0~~


OBONO00~


Cr3~LnCr3LnLn~~
Cr3Cr3Cr3Cr3Cr3Cr3Cr3Cr3


3001 Ln1+0n~
O0 meChb~










Table 6-4.


Effects of the number of reference sequences. Multiple sequence
alignnient-hased method uses hierarchical clustering algorithm and gap open
extension score scheme. Non-order-MAPPIT and order-MAPPIT stand for
order independent and dependent strategies separately when applying
motif-based method.
weigfhted-MAPPIT non-order-MAPPIT order-MAPPIT
# Coverage Hit Ratio Coverage Hit Ratio Coverage Hit Ratio
:32010 0.749 :30282 0.290 :32680 0.770
26476 0.820 :35055 0.668 27128 0.8:35
25528 0.844 :3 I l' 0.587 :32406 0.771
25490 0.852 :35245 0.715 28697 0.817
24629 0.862 :31904 0.910 26401 0.952


Reference
2
:3
4
5
6


coverage of the multiple alignnient-hased strategy increases as k decreases. This is because

this strategy produces more printers for small k. The coverage of the motif-based strategy

shows variations. However, it usually increases as k decreases.

6.3.6 Performance Comparison

In this section we evaluate the running time of our methods. Our result show that

on average, our multiple alignnient-hased method runs for about 270 minutes using our

weighted alignment strategy. The same method runs in 195 minutes using CloI-I dW. Our

motif-based method runs in 2:3 and 1:3 minutes for order dependent and order independent

strategies respectively. These running times are significant intprovenients over current

ASAP strategy which requires manual inspection of multiple alignment given that the

considered sequences are 40K( to 150K( bases long.

6.3.7 Wet-lab Verification

The computational method was assessed in the laboratory for efficacy. Printer pairs

identified using the computational method described above were tested using actual

polymerase chain reaction in a wet lah experiment. Eight printer pairs were selected at

random; the corresponding DNA oligonucleotides were synthesized and used to attempt

to amplify target regions from 12 different plant genera (Figure 6-5). Of these, 9 plants

are somewhat related and :3 represent ancient or highly-diverged species. Pea lacks the










Table 6-5. Eight randomly selected printer pairs, their locations on sequence 1879, the
length of the segment identified by the printers and the genes that they land
on. The negative value indicates that the printers landed in incorrect order.
Printer pairs Location in 1879 Size base pairs Forward Reverse
1 5 5279-622:3 944 rps16 Intergfenic
2 17 166:37-17945 1:308 rps2 rpoC2
:3 :36 :377:30-:39512 1782 ycf9 psaA
4 99 99061-100222 1161 ndhB rps12 Intron
5 100 100:379-100451 -97 rps12 Intron rps12 Intron
6 101 100690-101964 1274 rps12 orfl 31
7 102 101927-102811 884 orfl 31 16S
8 150 151524-151976 452 ycf2 ycf2


inverted repeat region and thus is very different front other plastid genonies sampled here.

Ginkgo, an ancient Gyninosperm, and Equisetunt a Pteridophyte, are ancestors of modern

dei flowering plants and exhibit high degree of sequence dissintilarity. The printers devised

by the computational method were mapped on the tobacco chloroplast genome (1879)

and Table 6-5 suninarizes the sequence location, expected sizes and annealingf sites of the

forward and reverse printer.

Fr-om Table 6-5 following features are evident:

1. Conmputationally identified printers pairs anneal mainly to the coding regions

or conserved intron between the genes. This parameter was one of the prerequisites for

efficient printer identification and demonstrates that the new method of multiple sequence

alignment is promising for this specific purpose. 2. The size of the amplified regions

ranges front 452 base pairs to 1782 base pairs. The optimal printer set will amplify regions

ranging front 800 base pairs to 1200 base pairs, which makes the amplified products more

amenable to sequencing. :3. Printer pair 5 represent divergent printers in 1879 thus no

product is visible here and in all other species but in maize there is an annealingf site that

produces an aniplicon of the expected size. This illustrates the potential of the method as

applicable to divergent plant species.













































Figure 6-5.


Polymerase chain reaction samples were analyzed on an agarose gel by
electrophoresis. Colunin 1\ represents a standard DNA size ladder. Columns
labeled as 5, 17, 36, 99, 100, 101 102 and 150 represent the printer pairs chosen
at random front the computational dataset. White hands in each column
represent amplified DNA front each printer pair in a given plant sample. Note
that printer pair 100 does not produce an amplified product in most plants
except for maize (see Table 6-5 ). Ginkgo and Equisetunt represent ancestral
samples used to test the limits of this approach. Although highly divergent in
sequence content and position some coverage was obtained, indicating the
method will be highly useful on contemporary crop species.(This figure is
created by Antit Dhingra.)









CHAPTER 7
CONCLUSION

We considered problems in multiple sequence alignment and developed window based

solutions, we also addressed the problem of using multiple sequences in DNA sequencing.

The hypothesis of our algorithms is that we can divide the large sequences alignment

problem to smaller ones, and then we can reach a semi-optimal alignment of the original

large sequences by combining of the solution of smaller problems.

First, we considered the problem of optimization of SP (Sum-of-Pairs) score for

multiple protein sequences alignment. We developed a graph-based algorithm called

QOMA (Quasi-Optimal Multiple Alignment). QOMA first constructs an initial alignment

of multiple sequences. In order to create this initial alignment, we developed a method

based on the optimal alignment between all pairs of sequences. QOMA represents this

alignment using a K-partite graph. It then improves the SP score of the initial alignment

by iteratively placing a window on it and optimizing the alignment within this window.

QOMA uses two strategies to permit flexibility in time/accuracy trade off: (1) Adjust the

sliding window size. (2) Tune from complete K-partite graph to sparse K-partite graph

for local optimization of window. Unlike traditional tools, QOMA can be independent of

the order of sequences. The experimental results on BAliBASE benchmarks show that

QOMA produces higher SP score than the existing tools including CloI-I dW, ProbCons,

MUSCLE, T-Coffee and DCA. QOMA has slightly better SP score using complete

K-partite graph strategy compared to the sparse K-partite graph strategy. This QOMA

work is accepted by Bioinformatics journal.

Second, we further considered the problem of multiple alignment for a large number

of protein sequences, with the goal of achieving a large SP (Sum-of-Pairs) score. We

introduced the QOMA2 algorithm, which is practical for aligning a large number of

protein sequences. QOMA2 selects short subsequences from the sequences to be aligned

by placing a window on their (potentially sub-optimal) alignment. The window position










is determined as the subsequences that have the highest improvement potential. It

partitions the subsequences within each window into clusters such that the number of

subsequences in each cluster is small enough to be optimally aligned within a given

time. The experimental results on BAliBASE benchmarks show that QOMA2 produces

alignments with high SP scores quickly.

Third, we considered the problem of construction of a biological meaningful multiple

sequence alignment. we developed a new algorithm called HSA. HSA applies SSE types

in addition to amino acid information to group the input protein residues, It then adjusts

the residues position according to the groups and constructs a graph. HSA slides a

window from the beginning to the end of the graph and finds cliques in the window. HSA

concatenates these cliques and forms the final alignment. Unlike existing progressives

multiple sequence alignment methods, HSA builds up the final alignment by considering

all sequences at once. Experimental results show that HSA achieves high accuracy and

still maintains competitive running time. The quality improvement over existing tools is

more significant for low similarity sequences. Our HSA work is published in PSB 2006.

The last problem is to assist primer prediction in DNA sequencing, by using multiple

sequences. We developed a method called MAPPIT. MAPPIT has successfully used

two novel computational approaches for identification of consensus primer pairs from a

set of reference sequences that will enable cost-effective and rapid acquisition of DNA

sequence from plastid genomes. The first one uses multiple alignment of references.

The second one finds motifs from the reference sequences that have sufficient support.

We developed two solutions for the second approach: order independent and order

dependent. In our experiments, the coverage of primer pairs found by our methods were

significantly higher compared to that of Primer3, an existing primer identification tool.

Our wet-lab experiments verified that the primers found by our methods can actually

amplify homologous target genomes. We believe rapid sequence information acquisition










using MAPPIT will be vital for the ongoing efforts for engineering plastid genomes for

benefiting agricultural crops and the phylogenetics studies.

We addressed four problems of multiple sequence alignment. We provided the

solutions based on divide-and-conquer strategy. We first developed a novel algorithm

to optimize an existing alignment and applied the algorithm to tool QOMA. Based on

QOMA algorithm, we then further developed an algorithm to process large number of

sequences. The application was called QOMA2. We also developed an algorithm to create

a biological meaningful alignment by applying secondary structure information during

aligning. Last, we applied multiple sequence alignment to primer identification for DNA

sequencing. The hypothesis of our algorithms is that we can divide the large sequences

alignment problem to smaller ones, and then we can reach a semi-optimal alignment

of the original large sequences by combining of the solution of smaller problems. The

experimental results show the hypothesis of divided-and-conquer is useful in multiple

sequence alignment.










REFERENCES


[1] J. Thompson, D. Hi----lin- and T. Gibson, "CLUSTAL W: Improving the
Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting,
Position-specific Gap Penalties and Weight Matrix Ch..s..1 ~" Nucleic Acids Research,
vol. 22, no. 22, pp. 467:34680, 1994.

[2] C. Notredame, D. Hi- lis- and J. Heringa, "T-coffee: a novel method for fast and
accurate multiple sequence alignment," Journal of M~olecular B.:. I J-it;, vol. :302, no. 1,
pp. 205-217, 2000.

[:3] D. T. Jones, "Protein Secondary Structure Prediction based on Position-Specific
Scoring Matrices," Journal of M~olecular B: .I J..;,i vol. 292, no. 2, pp. 195-202, 1999.

[4] A. Phillips, D. Janies, and W. Wheeler, j11nlspl Sequence Alignment in
Phylogenetic Analysis," M~olecular Ph tl. .I~,i 1.:. H. and Evolution, vol. 16, no. :3,
pp. :317-3:30, 2000.

[5] J. Thompson, H. Plewniak, and O. Poch, "A comprehensive comparison of
multiple sequence alignment programs," Nucleic Acids Research, vol. 27, no. 1:3,
pp. 2682-2690, 1999.

[6] W. N. Grundy, 1-` lIn!ly-based Homology Detection via Pairwise Sequence
Comparison," in Annual C'onference on Research in C'omp~utational M~olecular
B..~~I J..;,i (REC'OMB '98), 1997, pp. 94-100.

[7] S. S. Gross and 31. R. Brent, "Using multiple alignments to improve gene
prediction.," in REC'OMB, 2005, pp. :374-388.

[8] C. Burge and S. K~arlin, Prediction of complete gene structures in human genomic
DNA., vol. 268, J. Alol. Biol., 1997.

[9] A. E. Tenney, R. H. Brown, C. Vaske, J. K(. Lodge, T. L. Doering, and 31. R.
Brent, "Gene prediction and verification in a compact genome with numerous small
introns," Genome Research, vol. 14, no. 11, pp. 2:330-2:335, 2004.

[10] J. D. Palmer, "Comparative organization of chloroplast genomes," Annual Review of
Genetics, vol. 19, no. 1, pp. :325-354, 1985.

[11] T. 31. Przytycka, G. Davis, N. Song, and D. Durand, "Graph theoretical insights
into evolution of multidomain proteins.," in REC'OMB, 2005, pp. :311-325.

[12] L. Falquet, 31. Pagni, P. Bucher, N. Hulo, C. J. Sigrist, K(. Hofmann, and
A. Bairoch, "The prosite database, its status in 2002.," Nucleic Acids Research, vol.
:30, no. 1, pp. 2:35-238, January 2002.

[1:3] T. K(. Attwood, 31. D. R. Croning, D. R. Flower, A. P. Lewis, J. E. Alabey,
P. Scordis, J. N. Selley, and W. Wright, "Prints-s: the database formerly known
as prints.," Nucleic Acids Research, vol. 28, no. 1, pp. 225-227, 2000.










[14] M. Gribskov, A. McLachlan, and D. Eisenberg, "Profile analysis: detection of
distantly related proteins.," Proceedings of the National A .<.1. I,,;t of Sciences USA,
vol. 84, no. 13, pp. 4355-4358, 1987.

[15] D. Haussler, A. K~rogh, I. Mian, and K(. Sjolander, "Protein modeling using hidden
markov models: Analysis of globins," in Hawaii International Conference on S;, 1.ii
Science, Los Alamitos, CA, 1993, Hawaii International Conference on Systems
Science, vol. 1, pp. 792 -802, IEEE Computer Society Press.

[16] R. Luthy, I. Xenarios, and P. Bucher, lInsim ingll the sensitivity of the sequence
profile method," Protein Science, vol. 3, no. 1, pp. 139-146, January 1994.

[17] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones,
A. K~hanna, M. Marshall, S. Moon, E. L. Sonnhammer, D. J. Studholme, C. Yeats,
and S. R. Eddy, "The pfam protein families database.," Nucleic Acids Res, vol. 32
Database issue, January 2004.

[18] S. Altschul, T. Madden, A. Schaffer, J. Z1!I. .1 Z. Z1! I.1, W. Miller, and D. Lipman,
"Gapped blast and psi-blast: a new generation of protein database search
programs," Nucleic Acids Res., vol. 25, no. 17, pp. 3389-3402, 1997.

[19] I. K~orf, P. Flicek, D. Duan, and M. R. Brent, lIst.~ gI .1ni genomic homology into
gene structure prediction," Bioinformatics, vol. 17, no. 90001, pp. 140S-148, 2001.

[20] J. Flannick and S. Batzoglou, "Using multiple alignments to improve seeded local
alignment algorithms," Nucleic Acids Research, vol. 33, no. 15, pp. 4563-4577, 2005.

[21] R. G. S. P. Consortium, "Genome sequence of the brown 1!. i-- li- rat yields insights
into mammalian evolution," Nature, vol. 428, pp. 493-521, 2004.

[22] P. Havlak, R. C'I, i., K(. J. Durbin, A. Egan, Y. Ren, X.-Z. Song, G. M. Weinstock,
and R. A. Gibbs, "The Atlas Genome Assembly System," Genome Research, vol. 14,
no. 4, pp. 721-732, 2004.

[23] M. Roberts, B. R. Hunt, J. A. Yorke, R. A. Bolanos, and A. L. Delcher, "A
preprocessor for shotgun assembly of large genomes," Journal of Comp~utational
B/. J..,~it; vol. 11, no. 4, pp. 734-752, 2004.

[24] S. Schwartz, W. J. Kent, A. Smit, Z. Z1! I.1. R. Baertsch, R. C. Hardison,
D. Haussler, and W. Miller, "Human-Mouse Alignments with BLASTZ," Genome
Research, vol. 13, no. 1, pp. 103-107, 2003.

[25] T. K~ahveci, V. Ljosa, and A. K(. Singh, "Speeding up whole-genome alignment by
indexing frequency vectors," Bioinformatics, vol. 20, no. 13, pp. 2122-2134, 2004.

[26] A. Apostolico, M. Comin, and L. Parida, "Conservative extraction of
over-represented extensible motifs," Bioinformatics, vol. 21, no. suppl-1, pp.
i9-18, 2005.










[27] L. Wang and T. Jiang, "On the complexity of multiple sequence alignment," Journal
of Computational B..~~I J..;,i vol. 1, no. 4, pp. 337-348, 1994.

[28] S. B. Needleman and C. D. Wunsch, "A General Method Applicable to the Search
for Similarities in the Amino Acid Sequence of Two Proteins," Journal of M~olecular
B.. J..~l,it vol. 48, pp. 443-53, 1970.

[29] D. Lipman, S. Altschul, and J. K~ececioglu, "A Tool for Multiple Sequence
Alignment," Proceedings of the National A .<.l~1, ren of Sciences of the United States of
America (PNAS), vol. 86, no. 12, pp. 4412-4415, 1989.

[30] G. SK(, K(. JD, and S. AA, linsiuinglb the Practical Space and Time Efficiency of
the Shortest-paths Approach to Sum-of-pairs Multiple Sequence Alignment," Journal
of Computational B..~~I J..;,i vol. 2, no. 3, pp. 459, 1995.

[31] D. Feng and R. Doolittle, "Progressive Sequence Alignment As A Prerequisite To
Correct Phylogenetic Trees," Journal Of M~olecular Evolution, vol. 25, no. 4, pp.
351-360, 1987.

[32] A. Dhingra and K(. M. Folta, "ASAP: Amplification, sequencing & annotation of
plastomes," BM~C Genomics, vol. 6, pp. 176, 2005.

[33] e. a. Jansen R. K(., L. A. Raubeson, i l, I ha~ds for obtaining and analyzing whole
chloroplast genome sequences. methods in enzymology," in M~ethods in F,...;;;;;. J..~I,i
Academic Press, 2005, pp. 348-384.

[34] J. Thompson, F. Plewniak, and O. Poch, "A comprehensive comparison of
multiple sequence alignment programs," Nucleic Acids Research, vol. 27, no. 13,
pp. 2682-2690, 1999.

[35] C. Notredame, "Recent progress in multiple sequence alignment: a survey,"
Phonen .-,i. e..> :~mics, vol. 3, no. 1, pp. 131-44, 2002.

[36] N. Chia and R. Bundschuh, "A practical approach to significance assessment in
alignment with gaps.," in RECOM~B, 2005, pp. 474-488.

[37] T. Jiang, Y. Xu, and M. Q. Zhang, Current Topics in Computational M~olecular
B..~~I J..;,i The MIT Press, University of California, Riverside, 2002.

[38] D. J. Bacon and W. F. Anderson, \l,!ul~l sp sequence alignment," Journal of
Molecular B..~~I J..;,i vol. 191, pp. 153-161, 1986.

[39] V. Bafna, E. L. Lawler, and P. A. Pevzner, "Approximation algorithms for multiple
sequence aligmnent," Theoretical Computer Science, vol. 182, no. 1-2, pp. 233-244,
1997.

[40] H. Carrillo and D. Lipman, "The multiple sequence alignment problem in biology,"
SIAM~ Journal on Applied M~ath, vol. 48, no. 5, pp. 1073-1082, 1988.










[41] "Book review: Algorithms on strings, trees, and sequences: computer science and
computational biology by dan gusfield (: Cambridge university press, cambridge,
england, 1997)," SIGACT News, vol. 29, no. 3, pp. 43-46, 1998, Reviewer-Gary
Benson.

[42] D. Gusfield, "Efficient methods for multiple sequence alignment with guaranteed
error bounds.," Bulletin of M~athematical B..~~I J..;,i vol. 55, no. 1, pp. 141-54, 1993.

[43] D. J. Lipman, S. F. Altschul, and J. D. K~ececioglu, "A tool for multiple sequence
alignment," Proceedings of the National A .<.1. I,,;t of Sciences of the United States of
America, vol. 86, pp. 4412-4415, 1989.

[44] B. Ma, L. Wang, and M. Li, \* I.r optimal multiple alignment within a band in
polynomial time," Journal of Computer and System Sciences, vol. 73, no. 6, pp.
997-1011, 2007.

[45] C. Lee, C. Grasso, and M. Sharlow, jl\lt1 !1ple sequence alignment using partial order
graphs," Bioinformatics, vol. 18, no. 3, pp. 452-464, 2002.

[46] I. Walle, I. Lasters, and L. Wyns, "Align-m-a new algorithm for multiple alignment
of highly divergent sequences," Bioinformatics, vol. 20, no. 9, pp. 1428-1435, 2004.

[47] J. Stci; V. Moulton, and A. W. M. Dress, "Dea: an efficient implementation of
the divide-and-conquer approach to simultaneous multiple sequence alignmentt."
Computer Applications in the Biosciences, vol. 13, no. 6, pp. 625-626, 1997.

[48] S. F. Astschul and D. J. Lipman, "Trees, stars, and multiple biological sequence
alignment," SIAM~ Journal on Applied M~ath, vol. 49, no. 1, pp. 197-209, 1989.

[49] D. Sankoff, 11.1.111. i1 mutation trees of sequences," SIAM~ Journal on Applied
Mathematics, vol. 28, no. 1, pp. 35-42, 1975.

[50] M. S. Waterman, Introduction to Computational B..J.-it~;, Map~s, Sequences and
Genomes, June 1995.

[51] S. Henikoff and J. Henikoff, "Amino Acid Substitution Matrices from Protein
Blocks," Proceedings of the National A .<.l~1, ren of Sciences, vol. 89, no. 22, pp.
10915-10919, 1992.

[52] R. Schwarz and M. Dayhoff, 11 Il .:es for detecting distant relationships," Atlas of
protein sequences, pp. 353 -58.

[53] D. Sankoff, R. Cedergren, and G. Lapalme, "Frequency of insertion-deletion,
transversion, and transition in the evolution of 5s ribosomal rna.," J M~ol Evol, vol.
7, no. 2, pp. 133-49, 1976.

[54] J. Thompson, F. Plewniak, and O. Poch, "BAliBASE: a benchmark alignment
database for the evaluation of multiple alignment programs," Bioinformatics, vol. 15,
no. 1, pp. 87-88, 1999.










[55] R. Baeza-Yates and G. T N.1- Iro, N .1-- and faster filters for multiple approximate
string matching," Random Struct. Algorithms, vol. 20, no. 1, pp. 23-49, 2002.

[56] G. T N.- Ir o, 11\!ultiple~ approximate string matching by counting," in Proc. of
WSP'97. 1997, pp. 125-139, Carleton University Press.

[57] R. A. Baeza-Yates and G. T N.1- Iro, 1-` I-I. r approximate string matching," Algorith-
mica, vol. 23, no. 2, pp. 127-158, 1999.

[58] W. I. C'I I1.; and E. L. Lawler, "Sublinear expected time approximate string
matching and biological applications," Tech. Rep. 4/5, EECS Department,
University of California, Berkeley, 1994.

[59] T. Smith and M. Waterman, "Identification of Common Molecular Subsequences,"
Journal of M~olecular B:*.I J..;,i March 1981.

[60] O. Gotoh, "An improved algorithm for matching biological sequences," Journal of
Molecular B..~~I J..;,i vol. 162, no. 3, pp. 705-708, 1982.

[61] S. K(. Gupta, J. D. K~ececioglu, and A. A. Schaffer, lInspui,-ing the practical space
and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence
alignment," Journal of Comp~utational B..~~I J-it;, vol. 2, no. 3, pp. 459-462, 1995.

[62] J. Stci--, \!ltl Iple: sequence alignment with the divide-and-conquer method ,"
Gene, vol. 211, no. 2, pp. GC45-GC56, 1998.

[63] M. Brudno, C. B. Do, G. M. Cooper, M. F. K~im, E. Davydov, N. C. S. Program,
E. D. Green, A. Sidow, and S. Batzoglou, "LAGAN and Multi-LAGAN: Efficient
Tools for Large-Scale Multiple Alignment of Genomic DNA," Genome Research, vol.
13, no. 4, pp. 721-731, 2003.

[64] A. Delcher, S. K~asif, R. Fleischmann, J. Peterson, O. White, and S. Salzberg,
"Alignment of whole genomes," Nucleic Acids Research, vol. 27, no. 11, pp.
2369-2376, 1999.

[65] E. W. Myers and W. Miller, "Optimal alignments in linear space," Comp~uter
Applications in the Biosciences, vol. 4, no. 1, pp. 11-17, 1988.

[66] M. Brudno, M. C.! 11pt., .1', B. Gottgens, S. Batzoglou, and B. Morgenstern, 1-` I-I
and sensitive multiple alignment of large genomic sequences," BM~C Bioinformatics,
vol. 4, no. 66, 2003.

[67] A. Policriti, N. Vitacolonna, M. Morgante, and A. Zuccolo, "Structured motifs
search.," in RECOM~B, 2004, pp. 133-139.

[68] K(. P. Choi, F. Zeng, and L. Zhang, "Good spaced seeds for homology search,"
Bioinformatics, vol. 20, no. 7, pp. 1053-1059, 2004.










[69] B. Ma, J. Tromp, and 31. Li, "PatternHunter: faster and more sensitive homology
search," Bioinformatic~s, vol. 18, no. :3, pp. 440-445, 2002.

[70] P. A. S. Nuin, Z. War,1 and E. R. 31. Tillier, "The accuracy of several multiple
sequence alignment programs for proteins," BMG' Bioinformatic~s, vol. 7, pp. 471+
October 2006.

[71] F. Corpet, j1\!ult pl sequence alignment with hierarchical <1st-1. 1 1). Nucleic Acids
Research, vol. 16, no. 22, pp. 10881-10890, 1988.

[72] J. Hein, "A new method that simultaneously aligns and reconstructs ancestral
sequences forany number of homologous sequences, when the phylogeny is given,"
Molecular B..~~I J..;,i and Evolution, vol. 6, no. 6, pp. 649-668, 1989.

[7:3] C. Grasso and C. Lee, "Combining partial order alignment and progressive multiple
sequence alignment increases alignment speed and salability to very large alignment
problems," Bioinformatic~s, vol. 20, no. 10, pp. 1546-1556, 2004.

[74] C. Lee, C. Grasso, and 31. F. Sharlow, \!lt11 !1ple sequence alignment using partial
order graphs ," Bioinformatic~s, vol. 18, no. :3, pp. 452-464, 2002.

[75] K(. K~atoh, K(. Alisawa, K(. K~uma, and T. Miyata, j1lAFFT: a novel method for
rapid multiple sequence alignment based on fast Fourier transform," Nucleic Acids
Research, vol. :30, no. 14, pp. :3059-3066, 2002.

[76] S.-H. Sze, Y. Lu, and Q. Yang, "A Polynomial Time Solvable Formulation Of
Multiple Sequence Alignment," in International C'onference on Research in C'ompu-
testional M~olecular B..~~I J..;,i (REC'OMB), 2005, pp. 204-216.

[77] R. Thomsen, G. B. Fogel, and T. K~rink, lInp1I~i.-. in,~ 10 of Clustal-Derived Sequence
Alignments with Evolutionary Algorithms," in C'ongress on Evolut... .. r,;, C'om~utte-
tion, 200:3, vol. 1, pp. :312-319.

[78] R. Edgar, \!USCLE: multiple sequence alignment with high accuracy and high
throughput," Nucleic Acids Research, vol. :32, no. 5, pp. 1792-1797, 2004.

[79] 31. Sammeth, B. Morgenstern, and J. Stc., <- "Divide-and-conquer multiple alignment
with segment-hased constraints," Bioinformatic~s, vol. 19, no. 90002, pp. iil89-195,
200:3.

[80] K. K~atoh, K(. Alisawa, K(.-i. K~uma, and T. Miyata, j1lAFFT: a novel method for
rapid multiple sequence alignment based on fast Fourier transform," Nucleic Acids
Research, vol. :30, no. 14, pp. :3059-3066, 2002.

[81] A. K~rishnan, K(.-B. Li, and P. Issac, "Rapid detection of conserved regions in protein
sequences using wavelets.," In Silico B..~~I J..;,i vol. 4, 2004.

[82] K(. R. Rasmussen, J. Stci-;< and E. W. Myers, "Efficient q-gram filters for finding all
epsilon-matches over a given length.," in REC'OMB, 2005, pp. 189-20:3.










[8:3] B. Morgenstern, K(. Fr-ech, A. Dress, and T. Werner, "DIALIGN: Finding Local
Similarities by Multiple Sequence Alignment," Biainformatic~s, vol. 14, no. :3, pp.
290-294, 1998.

[84] B. Morgenstern, "DIALIGN 2: improvement of the segment-to-segment approach to
multiple sequence alignment," Bioinformatic~s, vol. 15, no. :3, pp. 211-218, 1999.

[85] X. Huang and W. Miller, "A time-efficient, linear-space local similarity algorithm,"
Advances in Applied M~athematics, vol. 12, pp. :337-:357, 1991.

[86] N. Bray and L. Pachter, \! AVID: Constrained Ancestral Alignment of Multiple
Sequences," Genome Research, vol. 14, no. 4, pp. 69:3699, 2004.

[87] O. Gotoh, "Significant Improvement in Accuracy of Multiple Protein Sequence
Alignments by Iterative Refinement as Assessed by Reference to Structural
Alignments," Journal of M~olecular B:C I. rit;, vol. 264, no. 4, pp. 82:38:38, 1996.

[88] C. Do, 31. Brudno, and S. Batzoglou, "PROBCONS: Probabilistic Consistency-based
Multiple Alignment of Amino Acid Sequences ," in Intelligent So/;,;Ii;- for Mlolecular
B..~~I J..;,i (15MB), 2004.

[89] E. SR, \!ull sp!.-: Alignment Using Hidden Markov Models," in Intelligent S;,I iii
for M~olecular B:*.I J..;,i (15M~B), 1995, vol. :3, pp. 114-120.

[90] C. Alkan, E. Tuzun, J. Buard, F. Lethiec, E. E. Eichler, J. A. Bailey, and S. C.
Sahinalp, j1 I...1pulating multiple sequence alignments via MaM and WehMahI,"
Nucleic Acids Research, vol. :33, no. suppl2, pp. W295-298, 2005.

[91] J. D. Thompson, J. C. Thierry, and O. Poch, "R ASCAL: rapid scanning and
correction of multiple sequence alignments," Biainformatic~s, vol. 19, no. 9, pp.
1155-1161, 200:3.

[92] S. C'I I1:1 .I~arti, C. Lanczycki, A. Panchenko, T. Przytycka, P. Thiessen, and
S. Bryant, "Refining multiple sequence alignments with conserved core regions.,"
Nucleic Acids Res, vol. :34, no. 9, pp. 2598-606, 2006.

[9:3] E. L. Anson and E. W. Myers, "ReAligner: A program for refining DNA sequence
multi-alignments," in Proceedings of thelst Annual International C'onference on
C'omp~ubstional Mlolecular B..J..~ ~I,i (REC'OMB), Santa Fe, NM, 1997, pp. 9-16, ACijl
Press.

[94] R. Sp .1, 1. Rehmsmeier, and J. Stcos.; "Sequence database search using jumping
alignments," in Intelligent So/;,;Ii;- for Mlolecular B..J. rit~;, (15M~B), 2000, pp.
:367-375.

[95] X. Zhang and T. K~ahveci, "QOMA2: Optimizing the alignment of noanyr: sequences,"
IEEE International C'onference on Bioinformatic~s and Bioengineering (BIBE), vol.
2, pp. 780-787, 2007.










[96] D. S. Hochbaum, Approxrimation Algorithms for NP-Heard Problems, PWS
Publishing Company, Department of Industrial Engineering, Operations Research,
Etcheverry Hall, University of California, Berkeley, CA 94720-1777, 1996.

[97] P. A. Pevzner, \!ulll p!--~ alignment, communication cost, and graph matching,"
SIAAF Journal on Applied M~athematics, vol. 52, no. 6, pp. 176:31779, 1992.

[98] D. Sankoff and J.B. K~ruskal and J.P. K~ruskal, Time Warps. String Edits. and
Mabcromolecules: The Theory and Practice of Sequence C'omp~arison, Cambridge
University Press, 1999, ISBN: 1575862174.

[99] X. Zhang and T. K~ahveci, "QOMA: quasi-optimal multiple alignment of protein
sequences," Bioinformatic~s vol. 2:3, no. 2, pp. 162-168, 2007.

[100] X. Zhang and T. K~ahveci, "A New Approach for Alignment of Multiple Proteins,"
in P i. I. Symp~osium on B: -~ r e; /.:, t (PSB), 2006, pp. :339-350.

[101] P. Bonizzoni and G. D. Vedova, "The complexity of multiple sequence alignment
with SP-score that is a metric," Theoretical C'omp~uter Science, vol. 259, no. 1-2, pp.
6:379, 2001.

[102] 31. Li, B. Ala, and L. Wang, \* I.r optimal multiple alignment within a hand in
polynomial time," pp. 425-434.

[10:3] T. Jiang, E. L. Lawler, and L. Wang1 "Aligning sequences via an evolutionary tree:
complexity and approximation," in STOG' '94: Proceedings of the is, ,,,/i-.sixrth
annual AC'~f symposiumm on Theory of computing, New York, NY, USA, 1994, pp.
760-769, AC il.

[104] 31. Li, B. Ala, and L. Wang, "Finding similar regions in many strings," in STOG'
'99: Proceedings of the thirty-~first annual AC17./symp~osium on Theory of computing,
New York, NY, USA, 1999, pp. 47:3482, ACijL.

[105] 31. Middendorf, "More on the complexity of common superstring and supersequence
problems," Theoretical C'omp~uter Science, vol. 125, no. 2, pp. 205-228, 1994.

[106] W. Just, "Computational complexity of multiple sequence alignment with sp-score,"
1999.

[107] 31. R. Garey and D. S. Johnson, C'omp~uter~s and Intrr;.-lilil...;;: A Guide to the
Ti,' *U of NP-G'omp~leteness, W. H. Freeman, January 1979.

[108] S. Henikoff and J. G. Henikoff, "Amino acid substitution matrices from protein
blocks," National A .<.1. I,,;t of Sciences of the United States of America, vol. 89, no.
22, pp. 10915-10919, November 1992.

[109] G. K~arypis and V. K~umar, \!. I !-~ unstructured graph partitioning and sparse
matrix ordering system. version 2.0," Tech. Rep., University of Minnesota,
Department of Computer Science, Minneapolis, MN 55455, August 1995.










[110] G. K~arypis and V. K~umar, "A fast and high quality multilevel scheme for
partitioning irregular graphs," SIAAF Journal on Scienti~fic C'omp~uting, vol. 20,
no. 1, pp. :359-392, 1998.

[111] K(. Reinert, H.-P. Lenhof, P. Mutzel, K(. Mehlhorn, and J. D. K~ececioglu, "A
branch-and-cut algorithm for multiple sequence alignment," in Proceedings of thelst
Annual International C'onference on C'omp~utational M~olecular B..J.-it ~;, (REC'OMB),
Santa Fe, NM, 1997, pp. 241-250, ACijl Press.

[112] P. Bradley, P. S. K~im, and B. Berger, "Trilogy: discovery of sequence-structure
patterns across diverse proteins," in International C'onference on Research in
C'omp~ubstional Mlolecular B..~~I J..;,i (REC'OMB), 2002, pp. 77-88.

[11:3] L. C'I. in~ \ h!1ple Protein Structure Alignment by Deterministic Annealing," in
IEEE computerr S -~... Iri Bioinfortuatic~s C'onference (G'SB'OS), 200:3, vol. 00, p. 609.

[114] G. JF, 31. T, and B. SH, "Surprising similarities in structure comparison," C'urrent
Opinion in Structural B..~~I J-it;, vol. 6, no. :3, pp. :377-385, 1996.

[115] S. V.A. and H. J, "A new method for iterative multiple sequence alignment using
secondary structure prediction," in Intelligent So/;,; Ii;- for M~olecular B..J.-it ~;
(15M~B), 2002.

[116] K(. B. Mullis and F. A. Faloona, "Specific synthesis of dna invitro via a
polymerase-catalyzed chain-reaction.," M~ethods Fr:..uteral pp. 155::335-350, 1987.

[117] X. Huang and A. Aladan, "CAP:3: A DNA Sequence Assembly Program," Genome
Research, vol. 9, no. 9, pp. 868-877, 1999.

[118] 31. Pop, S. L. Salzherg, and 31. Shumway, "Cover feature: Genome sequence
assembly: Algorithms and issues," IEEE-G'OMPUTER, vol. :35, no. 7, pp. 47-54,
July 2002.

[119] J. Fredslund, L. Schauser, L. H. Madsen, N. Sandal, and J. Stougaard, "PriFi:
using a multiple alignment of related sequences to find primers for amplification of
homologfs," Nucleic Acids Research, vol. :33, no. suppl2, pp. W516-520, 2005.

[120] S. Rozen and H. J. Skaletsky, "Primer:3 on the WWW for general users and for
biologist programmers," M~ethods in Mlolecular B..~~I J-it;, pp. :365-386, 2000.









BIOGRAPHICAL SKETCH

Xu Zhang received his master degree from the Chinese A< I1. iny: of Sciences in 2002.

He is a graduate research assistant in computer information science and engineering at the

University of Florida. His 1!! I iHr~ research interests include bioinformatics and E-lk Ilrilr_

the first of which is the focus of his forthcoming Ph.D.





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Thisdissertationwouldnothavebeenpossiblewithoutthesupportofmanypeople.Manythankstomyadviser,TamerKahveci,whoworkedwithmeonourresearchesandreadmynumerousrevisions.Alsothankstomycommitteemembers,AlinDobra,ArunavaBanerjee,ChristopherM.JermaineandKevinM.Folta,whooeredguidanceandsupport.ThankstoAmitDhingraforcooperatingwithmeandgivingmealotofhelpsinMAPPITproject.Finally,thankstomyparentsandnumerousfriendswhoenduredthislongprocesswithme,alwaysoeringsupportandlove. 4

PAGE 5

page LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 2BACKGROUND ................................... 16 2.1MeasurementsofMultipleSequenceAlignment ................ 16 2.2DynamicProgrammingMethods ........................ 17 2.3HeuristicMethods ............................... 18 2.4OptimizingExistingAlignmentsMethods ................... 22 2.5ApproximationAlgorithms ........................... 22 2.5.1OurMethodsvs.ApproximationMethods .............. 25 2.5.1.1Whatdo"approximatable"and"non-approximatable"mean? ............................. 25 2.5.1.2Whydoesapproximationalgorithmsdonotworkformultiplesequencealignmentapplications? .............. 25 2.5.1.3Whydoouralgorithmswork? ................ 27 2.5.2OverviewofApproximationAlgorithmsforMultipleSequenceAlignment .......................... 28 2.5.2.1HardnessResults ....................... 28 2.5.2.2NP-completenessandMAX-SNP-hardnessofmultiplesequencealignment ........................... 29 3OPTIMIZATIONOFSPSCOREFORMULTIPLESEQUENCEALIGNMENTINGIVENTIME ................................... 31 3.1MotivationandProblemDenition ...................... 31 3.2CurrentResults ................................. 32 3.2.1ConstructingInitialAlignment .................... 32 3.2.2ImprovingtheSPScoreviaLocalOptimizations .......... 35 3.2.3QOMAandOptimality ......................... 36 3.2.4ImprovedAlgorithm:SparseGraph .................. 38 3.2.5ExperimentalEvaluation ........................ 41 4OPTIMIZINGTHEALIGNMENTOFMANYSEQUENCES .......... 49 4.1MotivationandProblemDenition ...................... 49 4.2CurrentResults ................................. 51 4.3AligningaWindow ............................... 55 5

PAGE 6

....................... 56 4.3.2Clustering ................................ 57 4.3.3ReningClustersIteratively ...................... 59 4.3.4AligningtheSubsequencesinClusters ................. 63 4.3.5ComplexityofQOMA2 ......................... 63 4.4ExperimentalEvaluation ............................ 64 5IMPROVINGBIOLOGICALRELEVANCEOFMULTIPLESEQUENCEALIGNMENT ..................................... 70 5.1MotivationandProblemDenition ...................... 70 5.2CurrentResults ................................. 71 5.2.1ConstructingInitialGraph ....................... 71 5.2.2GroupingFragments .......................... 72 5.2.3FragmentPositionAdjustment ..................... 75 5.2.4Alignment ................................ 76 5.2.5GapAdjustment ............................. 77 5.2.6ExperimentalResults .......................... 78 6MODULEFORAMPLIFICATIONOFPLASTOMESBYPRIMERIDENTIFICATION .................................. 83 6.1MotivationandProblemDenition ...................... 84 6.2RelatedWork .................................. 88 6.3CurrentResults ................................. 89 6.3.1FindingPrimerCandidates ....................... 89 6.3.1.1Multiplesequencealignment-basedprimeridentication 90 6.3.1.2Motif-basedprimeridentication .............. 93 6.3.2FindingMinimumPrimerPairSet ................... 95 6.3.3EvaluatingPrimerPairs ........................ 99 6.3.4ExperimentalEvaluation ........................ 100 6.3.5QualityEvaluation ........................... 101 6.3.6PerformanceComparison ........................ 107 6.3.7Wet-labVerication ........................... 107 7CONCLUSION .................................... 110 REFERENCES ....................................... 113 BIOGRAPHICALSKETCH ................................ 122 6

PAGE 7

Table page 3-1TheaverageSPscoresofQOMAusingcompleteK-partitegraph ........ 41 3-2TheaverageSPscoresofQOMAandveothertools ............... 46 3-3TheimprovementofQOMA ............................. 47 3-4Theaverage(),standarddeviation()oftheerror,SSP,forawindowusingsparseversionofQOMA ............................ 47 3-5TherunningtimeofQOMA(inseconds) ...................... 48 4-1Thelistofvariablesusedinthischapter ...................... 50 4-2TheaverageSWandSPscoresofindividualwindows ............... 67 4-3TheaverageSPscoresofQOMA2forindividualwindows ............. 68 4-4TheaverageSPscoresofthealignmentsoftheentirebenchmarks ........ 69 4-5TheaverageSPscoresofQOMA2andothertools ................. 69 5-1TheBAliBASEscoreofHSAandothertools.lessthan25%identity ...... 80 5-2TheBAliBASEscoreofHSAandothertools.20%-40%identity. ......... 80 5-3TheBAliBASEscoreofHSAandothertools.morethan35%identity. ..... 81 5-4TheSPscoreofHSAandothertools. ........................ 81 5-5TherunningtimeofHSAandothertools(measuredbymilliseconds). ...... 82 6-1ComparisonofPrimer3andusingmultiplesequencealignmentinstep1 ..... 103 6-2Comparisonofusingdierentsourceofalignment ................. 104 6-3Comparisonofmultiplesequencealignment-basedmethodsandmotif-basedmethodsinstep1 ........................................ 106 6-4Eectsofthenumberofreferencesequences .................... 107 6-5Eightrandomlyselectedprimerpairs ........................ 108 7

PAGE 8

Figure page 1-1Anexampleofmultiplesequencealignment .................... 11 2-1Anexampletoshowmeaninglessofalignmentswithapproximationratiolessthan2 ......................................... 26 2-2AnexampleofdierentalignmentswiththesameSP-score ............ 28 3-1Constructingtheinitialalignmentbystrategy2 .................. 33 3-2QOMAndsoptimalalignmentinsidewindow ................... 36 3-3SparseK-partitegraph ................................ 38 3-4AnexampleofusingK-partitegraph ........................ 38 3-5TheSPscoresofQOMAalignments ........................ 45 4-1Alignmentstrategiesatahighlevel ......................... 52 4-2ComparisonoftheSPscorefoundbydierentstrategies ............. 55 4-3Thedistributionofthenumberofbenchmarkswithdierentnumberofsequences(K). .......................................... 66 5-1Theinitialgraphconstructed ............................ 73 5-2Thefragmentswithsimilarfeaturesaregroupedtogether ............. 74 5-3Agapvertexisinserted ............................... 75 5-4Cliquesfoundarethecolumns ............................ 77 5-5Gapsaremoved .................................... 78 6-1Exampleofprimerpairsontargetsequence .................... 87 6-2AnexampleofcomputingtheSPscoreofmultiplesequencealignment ..... 91 6-3Anexampleofmatchingprimerswithtranslocations ............... 95 6-4Selectionofnextforwardprimerfromcurrentreverseprimer ........... 97 6-5Polymerasechainreactionsamples ......................... 109 8

PAGE 9

Bioinformaticsisaeldwherethecomputerscienceisusedtoassistthebiologyscience.Inthisarea,multiplesequencealignmentisoneofthemostfundamentalproblems.Multiplesequencealignmentisanalignmentofthreeormoresequences.Multiplesequencealignmentiswidelyusedinmanyapplicationssuchasproteinstructureprediction,phylogeneticanalysis,identicationofconservedmotifs,proteinclassication,genepredictionandgenomeprimeridentication.Intheresearchareasofmultiplesequencealignment,achallengingproblemishowtondthemultiplesequencealignmentthatmaximizestheSP(Sum-of-Pairs)score.ThisproblemisaNP-completeproblem.Furthermore,ndinganalignmentthatisbiologicallymeaningfulisnottrivialsincetheSPscoremaynotreectthebiologicalsignicances.Thisthesisaddressestheseproblems.Morespecicallyweconsiderfourproblems.First,wedevelopanecientalgorithmtooptimizetheSPscoreofmultiplesequencealignment.Second,weextendthisalgorithmtohandlelargenumberofsequences.Third,weapplysecondarystructureinformationofresiduestobuildabiologicalmeaningfulalignment.Finally,wedescribeastrategytoemploythealignmentofmultiplesequencestoidentifyprimersforagiventargetgenome. 9

PAGE 10

Bioinformaticsistheinteractionofmolecularbiologyandcomputerscience,itcanbeviewedasabranchofbiologywhichimplementstheuseofcomputerstohelpanswerbiologyquestions.Oneofthefundamentalresearchareasinbioinformaticsismultiplesequencealignment.Amultiplesequencealignmentisanalignmentofmorethantwosequences.AnexampleofmultiplesequencealignmentisshowninFigure 1-1 .ThealignmentispartofawholealignmentselectedfromBAliBASEbenchmarkdatabase[ 1 2 ]. Multiplesequencealignmentiswidelyusedinmanyapplicationssuchasproteinstructureprediction[ 3 ],phylogeneticanalysis[ 4 ],identicationofconservedmotifs[ 5 ],proteinclassication[ 6 ],geneprediction[ 7 { 9 ],andgenomeprimeridentication[ 10 ].Thefollowsaresomeexamplesoftheapplications. 11 ].Forrelatedproteins,theirmotifspresentsimilarstructuresandfunctions.Withinamultiplealignment,motifscanbeidentiedascolumnswithmoreconservationthantheirsurroundings.Analyzedwithexperimentaldata,themotifscanbeveryimportantcharacterizationofsequencesofunknownfunction.Theprincipalleadstoalotofimportantapplicationsinbioinformatics.Someimportantdatabases,suchasPROSITE[ 12 ]andPRINTS[ 13 ],arebuiltbasedonthisprincipal.Anothertypeofmethodsusesaprole[ 14 ]orahiddenMarkovmodel(HMM)[ 15 ]toidentifymotifs.Thesemethodsworkwellwhenamotifistoosubtletobedenedviaastandardpattern.Sincewhensearchingadatabase,prolesandHMMscanidentifydistantmembersofaproteinfamilyandprovidemuchhighersensitivityandspecicitythanwhatasinglesequenceorasinglepatterncanprovide.Inpractice,users 10

PAGE 11

Anexampleofmultiplesequencealignment.SequencesaresubsequencesselectedfromBAliBASEdatabase. cancreatetheirownprolefrommultiplesequencealignments,byusingtoolssuchasPFTOOLS[ 16 ],pre-establishedcollectionslikePfam[ 17 ],orbycomputingtheprolesontheybyusingPSI-BLAST[ 18 ],thepositionspecicversionofBLAST. 19 20 ].Here,motifsarealignedungappedsegmentsofmosthighlyconservedproteinregionsinthemultiplesequencealignment.BycomparingthemotifsinthemultiplesequencealignmentwiththeunknownsequenceS,wecanndhowsimilarbetweenthealignmentandS,andthenconcludethepossibilityofthetargetsequence'sclassication. 21 { 25 ].Inshotgunsequencing,multiplesequencealignmentplaysaveryimportantrole[ 26 ].Assumingwearegivenasetofgenomicreadsinshotgunsequencingproject;thesereadfragmentsarehighlysimilar,andhenceeasytoalign.Themultiplesequencealignmentofthereadscanconstructthefootprintofmainbackboneoftheoriginalsequence,thuseasetheworkofrecognizingthewholesequencefromthereads.Ifhighqualityreadsareused,thetargetsequencecanbere-builtdirectlyfromtheconsensussequenceofthemultiplesequencealignmentofthereads. 11

PAGE 12

27 ].Here,theSPScoreofanalignment,A,ofsequencesP1;P2;;PKiscomputedbyaddingthealignmentscoresofallinducedpairwisealignments.ItcanbeexpressedasSP(A)=Pi
PAGE 13

WedeveloptheoriestojustifytheclaimthatQOMAcanndalignmentswhichconvergetoglobalSPoptimalalignmentswhenthesizeoftheslidingwindowincreases.Theexperimentalresultsalsoagreewiththeclaim. 13

PAGE 14

32 ]).InsequencingDNA,plastidsequencingthroughputcanbeincreasedbyamplifyingtheisolatedplastidDNAusingrollingcircleamplication(RCA)[ 33 ].However,obtainingsequencethroughRCArequiresthisintermediatestep.Recently,theASAPmethodshowedthatsequenceinformationcouldbegatheredbycreatingtemplatesfromplastidDNAbasedonconservedregionsofplastidgenes.Toexpandthistechniquetoanentirechloroplastgenomeanecientmethodisrequiredtofacilitateprimerselection.Moreimportantly,suchamethodwillallowtheselectedprimersettobeupdatedbasedupontheavailabilityofnewplastidsequences.OurmethodisnamedMAPPIT.MAPPITusesrelatedspeciesgenestoassistpredictingunknowngenes.MAPPITinputsexistinggenesequences,whicharecloserelatedtothegenetopredict,extractsinformationfromthegivengenesequences,andconstructsprimerpairs.Thegoalistondtheprimerpairswhichcancoverasmuchastheunknowngene,inthemeanwhile,thenumberofpairsshouldbeassmallasitcan.MAPPITusestwodierentstrategiesforconstructingprimercandidates:multiplesequencealignmentandmotifbasedmethod.TheexperimentalresultsshowedtheprimerpairsfoundbyMAPPITdidalotofhelpsforpredictionofunknowngenomes. Therestofthisthesisisorganizedasfollows:Chapter 2 discussesrelatedworkofmultiplesequencealignment.Chapter 3 addressesanalgorithmforoptimizingtheSPscoreofresultingmultiplesequencealignmentinagiventime.Chapter 4 introducesanalgorithmforaligningmanysequences,withthegoalofoptimizingtheSPscore. 14

PAGE 15

5 presentsanalgorithmforimprovingbiologicalrelevanceofmultiplesequencealignmentbyapplyingsecondarystructureinformation.Chapter 6 introducesanapplicationofamoduleforamplicationofplastomesbyprimeridentication.Chapter 7 presentstheconclusionofourwork. 15

PAGE 16

Multiplesequencealignment[ 34 35 ]ofproteinsequencesisoneofthemostfundamentalproblemsincomputationalbiology.Itisanalignmentofthreeormoreproteinsequences.Multiplesequencealignmentiswidelyusedinmanyapplicationssuchasproteinstructureprediction[ 3 ],phylogeneticanalysis[ 4 ],identicationofconservedmotifsanddomains[ 5 ],geneprediction[ 7 { 9 ],andproteinclassication[ 6 ]. 36 ].Onecommonmethodistoscoreamultiplealignmentaccordingtoamathematicsmodel.WedenethecostofthemultiplesequencealignmentAofKsequencesaslXi=1c(P1(i);P2(i);;PK(i)) wherePj(i)istheithletterinthesequencePj,j=1;2;;N,andc(P1(i);P2(i);;PK(i))isthecostoftheithcolumn[ 37 ].c(P1(i);P2(i);;PK(i))=X1pqkc(Pp(i);Pq(i)) wherec(Pp(i);Pq(i))isthecostofthetwolettersPp(i)andPq(i)inthecolumn.ThiscolumncostfunctioniscalledastheSum-of-Pairs(orSP)cost.SPalignmentmodeliswidelyusedinapplicationssuchasndingconservedregions,andreceivesextensivelyresearch[ 38 { 44 ].InSPalignment,weassumeallsequencesequallyrelatetoeachother,thenallpairsofsequencesareassignedthesameweight.Inourlaterdiscussion,wewillfocusonSPmodel.Therearealsootheroptimizationmodelsinthisgroup,suchasconsensusalignmentandtreealignment[ 29 40 { 42 45 { 50 ].Thekeydeferenceofthesemodelsishowtoformulatetheircolumncostfunctions[ 37 ].Forallmodelsinthistypeofmeasurement,thecostschemeusedshouldbeareectoftheprobabilitiesofevolutionaryevents,includingsubstitution,insertion,anddeletion.Soitisimportanttochoose 16

PAGE 17

51 52 ].ForDNAsequences,thesimplematch/mismatchcostschemeisoftenused.Wecanalsousemoresophisticatedcostschemessuchastransition/transversioncosts[ 53 ]andDNAPAMmatrices.Throughoutthissection,weusec()asthecolumncostfunctionandc(x;y)aspairwisecostfunction,whichmeasuresthedissimilaritybetweenapairoflettersorspacesxandy.Weuse2todenoteaspaceandPtodenotethesetoflettersthatforminputsequences. Anothertypeofmeasurementistocompareaalignmentwithareferencealignment.BAliBASEscore[ 5 54 ]isthemostwidelyusedinthistype.Givenagold-standardalignmentA,thismeasureevaluateshowsimilarthealignmentsAandAare.TheBAliBASEscoreiscommonlyusedintheliteratureasanalternativetotheSPscore,however,BAliBASEscorecanonlybecomputedforsetsofsequencesforwhichthegoldstandardisknown.Incontrast,theSPscorecanbecomputedforanysetofsequences.MostoftheexistingmethodsaimtomaximizealinearvariationoftheSPscorebyweightingthesequences(orsubsequences)inordertoconvergetotheBAliBASEscoreforknownbenchmark[ 1 2 ].ThischapterfocusesonoptimizingtheSPscorewhichiscomputationallyanequivalentproblemtotheweightedversionsintheliterature.TheproblemofndingappropriateweightstoconvergetheSPandtheBAliBASEscoreisorthogonaltothischapterandshouldbeconsideredseparately. 55 { 58 ]andalsocanusedynamicprogrammingtondoptimalsolutions.Givenatableofscoresformatchesandmismatchesbetweenallaminoacidsandpenaltiesforinsertionsordeletions,theoptimalofalignmentoftwosequencescanbedeterminedusingdynamicprogramming(DP).ThetimeandspacecomplexityofthismethodsisO(N2)[ 28 59 60 ],whereNisthelengthofeachsequence.Thisalgorithmcanbeextendedtoalign 17

PAGE 18

29 30 ].Indeed,ndingthemultiplesequencealignmentthatmaximizestheSP(Sum-of-Pairs)scoreisanNP-completeproblem[ 27 ]. Thereareafewmethodswhichaimtooptimizethealignmentbyrunningdynamicprogrammingalignmentonallsequencessimultaneously.MSAistherepresentativeinthisclass[ 61 ].DCAextendsMSAbyutilizing'divide-and-conquer'strategy[ 47 ].Unlikeprogressivemethods,DCAdividesthesequencesrecursivelyuntiltheyareshorterthanagiventhreshold.DCAthenusesMSAtondtheoptimalsolutionsforthesmallerproblems.TheperformanceofDCAdependsonhowitdividesthesequences.DCAusesacutstrategythatminimizesadditionalcosts[ 62 ]andusesthelongestsequenceintheinputsequencesasreferencetoselectthecutpositions.DCAdoesnotguaranteetondoptimalsolution.TheselectionofthelongestsequencemakesDCAorderdependent,asthereisnojusticationwhythisselection(oranyotherselection)optimizestheSP-scoreofthealignment.Onthecontrary,ourmethodsinthisthesisareorderindependent.However,MSA,DCAandotheralgorithmswhomaximizetheSPscoresuerfromcomputationexpenses[ 1 ]. 1 ].Theseheuristicmethodsalsoprovidesolutionsforaligninglargesequences,whichdynamicprogrammingisunabletoprocessduetothelimitationofmemory[ 63 { 69 ].Theseheuristicmethodscanbeclassiedintofourgroups[ 70 ]:progressive,iterative,anchor-basedandprobabilistic.Theyallhavethedrawbackthattheydonotprovideexiblequality/timetradeo. 18

PAGE 19

71 ].Thisapproachissucientlyfasttoallowalignmentsofalmostanysize.Thecommonshortcomingofthesemethodsaboveisthattheresultingalignmentdependsontheorderofaligningthesequences.ClustalW[ 1 ],T-COFFEE[ 2 ],Treealign[ 72 ],POA[ 45 73 74 ],andMAFFT[ 75 ]canbegroupedintothisclass[ 76 ]. ClustalW[ 1 77 ]iscurrentlythemostcommonlyusedmultiplesequencealignmentprogram.ClustalWincludesthefollowingfeaturestoproducebiologicallymeaningfulmultiplesequencealignments.1)Accordingtoapro-computedguidetree,eachinputsequenceisassignedaweightduringthealignmentprocess.Thusthatsequenceswithmoresimilaritygetlessweightanddivergentsequencesgetmoreweight.2)Accordingtothedivergenceofthesequencestobealigned,dierentaminoacidsubstitutionmatricesareusedatdierentalignmentstages.3)Gappenaltiesprefermorecontinuousgapstoopeningnewgaps.Therefore,itencouragesthatgapsoccurinloopregionsinsteadofinhighlystructuredregionssuchasalphahelicesandbetasheets.Thebackgroundbiologicalmeaningforthisisthatbiologicallydivergenceisoftenlesslikelyinhighlystructuredregions,whicharecommonlyveryimportanttothefoldandfunctionofaprotein.Forsimilarreasons,todiscouragetheopeningofnewgapsneartheexistingones,existinggapsareassignedlocallyreducedgappenalties. T-COFFEE[ 2 ]isaprogressiveapproachbasedonconsistency.Itisoneofthemostaccurateprogramsavailableformultiplesequencealignment.T-COFFEEavoidsthemostseriousdrawbackcausedbythegreedynatureofprogressivealgorithm.T-Coeerstalignsallsequencespair-wisely,andthenusesthealignmentinformationtoguidetheprogressivealignment.T-Coeecreatesintermediatealignmentsbasedonthesequencestobealignednextandhowallofthesequencesaligntoeachother. MAFFT[ 75 ]providesasetofmultiplealignmentmethodsandisusedonunix-likeoperatingsystems.MAFFTincludestwonewtechniques:Identifyingmotifregionsquicklyandusingasimpliedscoringsystem.Thersttechnologyisdonebythefastfouriertransform(FFT).Thistechniquechangesanaminoacidsequencetoasequenceof 19

PAGE 20

POA[ 45 ]programdoesnotusegeneralizedprolesduringprogressivealignmentprocess.Instead,itintroducesapartialorder-multiplesequencealignmentformattorepresentsequences.POAallowstoextendalignableregionsandallowslongeralignmentsbetweencloselyrelatedsequencesandshorteralignmentsfortheentiresetofsequences. 78 ]canbegroupedintothisclassaswellastheprogressivemethodclasssinceitusesaprogressivealignmentateachiteration. MUSCLE[ 78 ]appliesmanytechniquessuchasfastdistanceestimationusingk-mercounting,progressivealignmentusinganewprolefunctionwhichiscalledthelog-expectationscore,andrenementusingtree-dependentrestrictedpartitioning.Atthetimeitwasproposed,itachievedthebestaccuracy.SinceitwasrelativelyslowMUSCLEwasnotwidelyused. 79 ].Thisgroupincludesseveralmethodswhichhavedesignsforrapidlydetectinganchors[ 80 { 82 ].DIALIGN[ 83 84 ],Align-m[ 46 ],L-align[ 85 ],Mavid[ 86 ]andPRRP[ 87 ]belongtothisclass. 20

PAGE 21

Align-m[ 46 ]programusesanon-progressivelocalapproachtoguideaglobalalignment.Itconstructasetofpairwisealignmentsguidedbyconsistency.Itperformswellondivergentsequences.Thedrawbackisthatitrunsslowly. PRRPprogramusesarandomizediterativestrategy.Itprogressivelyoptimizesaglobalalignmentbydividingthesequencesintotwogroupsiteratively.Itrealignsgroupsgloballyusingagroup-basedalignmentalgorithm. 88 ],andHMMT[ 89 ]canbegroupedintothisclass. ProbCons[ 88 ]introducesanapproachbasedonconsistency.Itusesaprobabilisticmodelandmaximumexpectedaccuracyscoring.Accordingtotheevaluationofitsperformanceonseveralstandardalignmentbenchmarkdatasets,ProbConsisoneofmostaccuratealignmenttoolstoday. HMMTrstdiscoversthepatternwhicharecommoninthemultiplesequences,andsavesadescriptionofthepatterninHMMle.Itthenappliesasimulatedannealingmethod,whichtriestomaximizetheprobabilityrepresentedbytheHMMleforthesequencestobealigned.HMMTworksiterativelybyimprovinganewmultiplesequencealignmentcalculatedusingthepattern,thenanewpatternderivedfromthatalignment. 21

PAGE 22

Improvingthealignmentqualityofaninitialalignmenthavebeentraditionallydonemanually(e.g.throughprogramslikeMaMandWebMaM[ 90 ]).Recently,RASCAL[ 91 ],REFINER[ 92 ]andReAligner[ 93 ]haveincludedmoreautomaticfeatures.Ourmethods,QOMAandQOMA2,belongtothisgroupingeneral.QOMAandQOMA2aredierentfromRASCALandREFINERbecausethatQOMAandQOMA2focusonoptimizingtheSPscoreofalignmentsandrequireonlysequenceinformation,whileRASCALisaknowledge-basedapproachandREFINERtargetsforoptimizingscoreofcoreregions.ReAlignerusesaround-robinalgorithmandimprovesDNAalignment. Mostofexistingtoolshavetheshortcomingthattheyareunabletoprocessalargenumberofsequences.Itisappropriatetoapplydynamicprogrammingonsubdivisionsofalignments.\Jumpingalignments"[ 94 ]appliesasimilaridea.Ourmethod,QOMA2[ 95 ],providesasolutiononhowtoalignalargenumberofproteinsequences. Inthisthesis,weaddresstheproblemsmentionedabove:Thesequence-order-dependentproblem,quality/timetradeoproblemandalargenumberofsequencesinputproblem. Ifwewanttondtheoptimalsolution,wecanuseexactalgorithms.Themostwidelyadoptedmethodofexactalgorithmsinmultiplesequencealignmentisdynamicprogramming.However,dynamicprogrammingrequiresrunningtimeofO(NK)for 22

PAGE 23

Thus,ifwewanttondsolutionswhichareclosetotheoptimalsolution,andwanttoguaranteethattheresultisnottoobad,andalsowanttoruninreasonabletime,thenonealternativeistomakeuseofapproximationalgorithms.Approximationalgorithmsarealgorithmswhicharepolynomialandguaranteethatforallpossibleinstancesofaminimizationproblem,allsolutionsobtainedareatmosttimestheoptimalsolution.Wecandeneapproximationalgorithmsformaximizationproblemsymmetrically.ApproximationalgorithmsareoftenassociatedwithNP-hardproblems.Unlikeheuristicalgorithms,approximationalgorithmshaveprovablesolutionqualityandprovablerunningtimebounds. MultiplesequencealignmentwithSP-scoreproblemsareMAX-SNP-hard.HereamaximizationproblemisMAX-SNP-hardwhengivenasetofrelationsR1;R2;;Rk,arelationD,andaquantier-freeformula(R1;R2;;Rk;D;v1;v2;;vt),whereviisavariable,thefollowingaresatised[ 96 ]: 1)GivenanyinstanceIoftheproblem,thereexistsapolynomial-timealgorithmthatcanproducesasetJofrelationsRJ1;RJ2;;RJk,whereeveryRJihasthesamearityastherelationRi. 2)OPT(I)=maxDJf(v1;v2;;vt)2Jt:(RJ1;RJ2;;RJk;DJ;v1;v2;;vt)=TRUEg 96 ]Chapter10. 23

PAGE 24

37 96 ]asanumbersuchthatforanyinstanceIoftheproblem,H(I) Wedeneapolynomialtimeapproximationscheme(PTAS)asanapproximationschemefHg,wherethealgorithmHrunsinpolynomialtimeofthesizeoftheinstanceI,foranyxed.Therearetwotypesofproblems:problemswhichhavegoodapproximationalgorithms,andproblemswhicharehardtoapproximate.PTASsbelongtothersttypeandthebestwecanhopeforaproblemisithasaPTAS.However,aMAXSNP-hardproblemhaslittlechancetohaveaPTAS.Themoredetaileddiscussioncanbefoundin[ 37 ]Chapter4. Sinceachievinganapproximationratio1+foraMAX-SNP-hardproblemisNP-hard,where>0isaxedvalue,theapproximatablenessofanproblemactuallydependsonthevalueof.Formultiplesequencealignmentproblems,thebestapproximationalgorithmhas2l=Kapproximationratioforanyconstantl,whereKisthenumberofthesequences[ 39 42 97 ].Laterwewillshowthisapproximationratioisnotappropriateforrealapplicationsofmultiplesequencealignmentandshowotherreasonsthatapproximationalgorithmsdonotwellformultiplesequencealignment. 24

PAGE 25

25

PAGE 26

(b) Anexamplethatalignmentswithapproximationratiooflessthan2canbemeaningless:(a)Theoptimalalignment.(b)Analignmentwithapproximationratioof1.5. approximationalgorithmsformultiplesequencealignment[ 42 ],whichcanecientlyproducealignments.However,wewillprovidethreereasonsthatapproximationalgorithmsarenotapplicabletomultiplesequencealignmentapplicationsinbioinformatics. 1)Thescoreschemesupportedforapproximationalgorithmsismetric,whilecurrently,mostwidelyusedscorematricesarenotmetric.Ametriccostmatrixshouldsatisfythefollowingconditions[ 98 ]: (Cl)c(x;y)>0forallx6=y (C4)c(x;y)
PAGE 27

2-1 .Weconsiderthealignmentproblemasamaximizationproblem,thentherstalignmentistheoptimalsolution,withSPscore3,andthesecondalignmenthasSPscoreof2.Sothesecondalignmenthasapproximationratio1.5.Weknowthatthesecondalignmentisatrivialalignmentwithoutanymeaninginrealty.Actuallyinthisexampleallalignmentsotherthantheoptimalonehaveapproximationratiolessthan2,whichmeanstheapproximationratiooflessthan2cannotguaranteeagoodalignmentatall. 3)Theseapproximationalgorithmsdonotconsiderthebiologicalmeaningoftheresultingalignment,andtheydonotcountfortheimpactofgaps.Hereweprovideasampleexampletoshowthatweneedtoconsiderthelocationofgapsinserted.Inbiologicalapplications,itiswidelyacceptedthatamismatchcanbebadasmatchingwithagap.Wecandesignasimplescoreschemeasfollows: c(x;2)=1 Thengivensequences"A","A"and"A",twopossiblealignmentsareshowninFigure 2-2 .FromFigure 2-2 ,weseebothalignmentshaveSP-score6,however,therstalignmentdoesnotactuallymakeanysense.Thus,anapproximationalgorithmformultiplesequencealignmentwithaguaranteedapproximationthatintroducesalotofgapsintotheresultingalignmentwithoutconsideringbiologicalmeaningoftheresultingalignmentcanbeuseless. 27

PAGE 28

(b) AnexampleofdierentalignmentswiththesameSP-score:(a)Analignmentwithmanygaps.(b)Analignmentwithoutgaps. whichisthemainadvantageoverapproximationalgorithms.Otherresearchershaveexploitedthisfactbefore.Forexample,ProbCons[ 88 ]canobtainpre-knowledgeviatrainingtoguidethelateralignmentprocess,andClustalW[ 1 77 ]canadjusttheweightsofprolesduringthealignmentprocess.Ourprograms,QOMA[ 99 ],QOMA2[ 95 ]andHSA[ 100 ]areheuristicoptimizationalgorithmsbynature.Theyalsoprovideadjustmentduringthealignment.Also,ourmethodsaredesignednotonlyforxedmodelssuchasSP-score,butcanbeextentedtoincorporateadditionalbiologicalfeatures. 27 ]whenaparticularpairwisecostschemeisused.Thecostschemeusedintheproofisnotametricsinceitdoesnotsatisfythetriangleinequality.LaterSPalignmentwasprovedtobeNP-hardevenwhenthealphabetsizeis2andthepairwisecostschemeisametric.Thus,SPalignmentproblemisunlikelytobesolvedinpolynomialtime[ 101 ]. 101 ]SPAlignmentisNP-hardwhenthealphabetsizeis2andthecostschemeismetric. 28

PAGE 29

102 ]SPAlignmentisNP-hardwhenallspacesareonlyallowedtoinsertatbothendsofthesequencesusingpairwisecostschemewhereamatchcosts0andamismatchcosts1. 103 ]TreealignmentisNP-hardevenwhenthegivenphylogenytreeisabinarytree. 104 ]ConsensusalignmentisNP-hardwhenthealphabetsizeis4usingthecostschemewhereamatchcosts0andamismatchcosts1. 27 103 ]ConsensusalignmentisMAXSNP-hardwhenthepairwisecostschemeisarbitrary. 27 ]MultiplesequencealignmentwithSP-scoreisNP-complete. 27 ].Thebasicideaistoshowthatmultiplesequencealignmentproblemisequivalenttoshortestcommonsupersequenceproblem,whichisaknownNP-completeproblemevenifjPj=2[ 105 ]. 106 ]ThereexistsascorematrixB,suchthatmultiplesequencealignmentproblemforBisMAX-SNP-hard,whenspacesareonlyallowedtoinsertatbothendsofthesequences. 106 ]andusedL-reductions.Herewecansimplifytheproofandusegap-preservingreduction[ 96 ].Weprovethetheorembyshowingthattherearegap-preservingreductionsfrommaximizationproblemofgap-0-1multiplesequencealignmentwithSP-scoretomaximizationproblemofMAX-CUT(Z)problemofsizek.ItwasprovedthatSIMPLEMAX-CUT(Z)isaMAX-SNP-completeproblemforsomepositiveintegerZ.Infact,Z=3works[ 107 ].Thenweshowthatanoptimalgap-0-1multiplesequencealignmentwithSP-scoreproblemexactlydenestheoptimal 29

PAGE 30

30

PAGE 31

Inthischapter,weconsidertheproblemofmultiplealignmentofproteinsequenceswiththegoalofachievingalargeSP(Sum-of-Pairs)score.Weintroduceanewgraph-basedmethod.WenameourmethodQOMA(Quasi-OptimalMultipleAlignment).QOMAstartswithaninitialalignment.ItrepresentsthisalignmentusingaK-partitegraph.ItthenimprovestheSPscoreoftheinitialalignmentthroughlocaloptimizationswithinawindowthatmovesgreedilyonthealignment.QOMAusestwostrategiestopermitexibilityintime/accuracytradeo:(1)Adjusttheslidingwindowsize.(2)TunefromcompleteK-partitegraphtosparseK-partitegraphforlocaloptimizationofwindow.Unliketraditionaltools,QOMAcanbeindependentoftheorderofsequences.Italsoprovidesaexiblecost/accuracytradeobyadjustinglocalalignmentsizeoradjustingthesparsityofthegraphituses.TheexperimentalresultsonBAliBASEbenchmarksshowthatQOMAproduceshigherSPscorethantheexistingtoolsincludingClustalW,ProbCons,MUSCLE,T-CoeeandDCA.Thedierenceismoresignicantfordistantproteins. 2 .Progressivemethodsaremostpopularmethodsformultiplesequencealignment,however,theyhaveanimportantshortcoming.Theorderthattheprolesarechosenforalignmentsignicantlyaectsthequalityofthealignment.Theoptimalalignmentmaybedierentthanallpossiblealignmentsobtainedbyconsideringallpossibleorderingsofsequences[ 100 ].Section 2 hasdiscussedmajormultiplesequencealignmentstrategiesindetail.Amethod,whichcanbalancerunningtimeandalignmentaccuracyisseriouslyindemand. Fragment-basedmethodsfollowthestrategyofassemblingpairwiseormultiplelocalalignment.Thedivide-and-conqueralignmentmethodssuchasDCA[ 47 ]canbe 31

PAGE 32

2 Inthischapter,weconsidertheproblemofmaximizingtheSPscoreofthealignmentofmultipleproteinsequences.Wedevelopagraph-basedmethodnamedQOMA(Quasi-OptimalMultipleAlignment).QOMAstartsbyconstructinganinitialmultiplealignment.Theinitialalignmentisindependentofanysequenceorder.QOMAthenbuildsagraphcorrespondingtotheinitialalignment.Ititerativelyplacesawindowonthisgraph,andimprovestheSPscoreoftheinitialalignmentbyoptimizingthealignmentinsidethewindow.ThelocationofthewindowisselectedgreedilyastheonethathasachanceofimprovingtheSPscorebythelargestamount.QOMAusestwostrategiestopermitexibilityintime/accuracytradeo:(1)Adjusttheslidingwindowsize.(2)TunefromcompleteK-partitegraphtosparseK-partitegraphforlocaloptimizationofwindow.TheexperimentalresultsshowthatQOMAndsalignmentswithbetterSPscorecomparedtoexistingtoolsincludingClustalW,ProbCons,MUSCLE,T-CoeeandDCA.Theimprovementismoresignicantfordistantproteins. 32

PAGE 33

Constructingtheinitialalignmentbystrategy2.Left:Apairsofofsequencesarealigned.Edgesareinsertedbetweennodeswhichmatchinthealignment.Right:Columnsareconstructedbyaligningthenodes.Gapsareinsertedwherevernecessary. Therearemanywaystoconstructtheinitialalignment.Wegroupthemintotwoclasses:(1)Useanexistingtool,suchasClustalW,tocreateanalignment.Thisstrategyhastheshortcomingthattheinitialalignmentdependsonothertools,whichmaybeorder-dependent.ThismakesQOMApartiallyorder-dependent.(2)Constructalignmentfrompairwiseoptimalalignmentsofsequences.Inthisstrategy,rst,sequencepairsareoptimallyalignedusingDP[ 60 ].Anedgeisaddedbetweentwonodesifthenodesarematchedinthisalignment.Aweightisassignedtoeachedgeasthesubstitutionscoreofthetworesiduesthatconstitutethatedge.Thesubstitutionscoreisobtainedfromtheunderlyingscoringmatrix,suchasBLOSUM62[ 108 ].Theweightofeachnodeisdenedasthesumoftheweightsoftheedgesthathavethatvertexononeend.Anodesetisthendenedbyselectingonenodefromtheheadofeachsequence.Thenodewhichhasthehighestweightisselectedfromthisset.Thisnodeisalignedwiththenodesadjacenttoit.Thus,thelettersalignedattheendofthisstepconstituteonecolumnoftheinitial 33

PAGE 34

3-1 .Inthisexample,threeproteinsequencesp1,p2andp3arerstpairwiselyaligned.Forsimplicity,weshoweachpairwisealignmentasaseparategraphinthisgure.Inreality,onenodeperletterissucient.Thenodesthatmatchintheseoptimalalignmentsthenarelinkedbyedges.Forexample,a1andb2matchintheoptimalalignmentofp1andp2,thustheyhaveanedgeinthegraphconstructed.TheweightofthisedgeisequaltotheBLOSUM62entryforthelettersa1andb2.WedonotshowtheweightoftheedgesinFigure 3-1 inordertokeeptheguresimple.Inthisgure,nodefora1hasanedgetonodesforb2andc2.Therefore,theweightofthenodefora1iscomputedasthesumoftheweightsoftheedgesand.Initiallyfa1;b1;c1garechosenasthecandidatenodeset.Inthisexample,weassumethatamongthreenodesfora1,b1andc1,thenodefora1hasthelargestweight.Thusweselectthenodefora1asthecentralnodeandconstructcolumn(a1;b2;c2).Thenwestarttoconstructnextcolumn.Weupdatecandidatenodesettofa2,b3,c3g,whichareallnodesthatimmediatelyproceednodesfora1,b2andc2inthesequences.Assumethatnodefora2hasthelargestweightamongnodesfora2,b3andc3,weselectthenodefora2asthecentralnodeandconstructcolumn(a2;b4;c4)correspondingly.Whenweconcatenatecolumnstomakenalalignment,gapnodesareinsertedifnecessary.Inthisexample,whenweconcatenatecolumns(a1;b2;c2)and(a2;b4;c4),twogapnodesareinsertedinsequencep1,onebeforethenodefora1andoneafternodefora1.Thusweconstructcolumns(;b1;c1)and(;b3;c3). ThetimecomplexityofbothofthesestrategiesareO(K2N2)sincepairwisecomparisonsdominatetherunningtime.However,latterapproachisfaster.Thisis 34

PAGE 35

3-2 ).GeneralizedversionoftheDPalgorithm[ 60 ]isusedtondtheoptimalalignment.Thisisfeasiblesincethecostofaligningawindowismuchlessthanthatoftheentiresequences. Thisalgorithmrequiressolvingtwoproblems.First,whereshouldthewindowsbeplaced?Second,whenshouldtheiterationsstop?Oneobvioussolutionistoslideawindowfromlefttoright(orrighttoleft)shiftingbysomepredenedamountateachiteration.Inthiscase,theiterationswillendoncethewindowreachestotherightend(ortheleftend)ofthealignment(seeFigure 3-2 ).Thissolution,however,havetwoproblems.First,itisnotclearwhichdirectionthewindowshouldbeslid.Second,awindowisoptimizedevenifitisalreadyagoodalignment.Weproposeanothersolution.WecomputeanupperboundtotheimprovementoftheSPscoreforeverypossiblewindowpositionasfollows.LetXidenotetheupperboundtotheSPscoreforthewindowstartingatpositioniinthealignment.Thisnumbercanbecomputedasthesumofthescoresofallthepairwiseoptimalalignmentsofthesubsequencesinthiswindow.LetYidenotethecurrentSPscoreofthatwindow.TheupperboundiscomputedasXiYi.Weproposetogreedilyselectthewindowthathasthelargestlowerboundateachiteration.Inordertoensurethatthissolutiondoesnotoptimizemorewindowsthantherstone(i.e.,slidingwindows),wedonotselectawindowpositionthatiswithin=2positionstoapreviouslyoptimizedwindow.Theiterationsstopwhenalltheremainingwindows 35

PAGE 36

QOMAndsoptimalalignmentinsidewindow,itreplacesthewindowwiththeoptimalalignmentandthenmovesthewindowbypositions. haveanupperboundofzeroortheyarewithin=2positionsofapreviouslyoptimizedwindow.Inourexperiments,thetwosolutionsroughlyproducedthesameSP-score.Thesecondsolutionwasslightlybetter.Thesecondsolution,however,convergedtothenalresultmuchfasterthantherstone.(resultsnotshown.) ThetimecomplexityofthealgorithmisO(2KWKK2(NW+1) ).Thisisbecausethereare(NW+1) positionsforwindow.Adynamicprogrammingsolutioniscomputedforeachsuchwindow.ThecostofeachdynamicprogrammingsolutionisO(2KWKK2)ThisalgorithmismuchfasterthantheoptimaldynamicprogrammingwhenWismuchsmallerthanN.ThespacecomplexityisO(WK+KN).ThisisbecausedynamicprogrammingforawindowrequiresO(WK)space,andonlyonewindowismaintainedatatime.AlsoO(KN)spaceisneededtostorethesequencesandthealignment.NotethattheedgesofthecompleteK-partitegrapharenotstoredatthisstepaswealreadyknowthatthegraphiscomplete. 36

PAGE 37

OurrstlemmashowsthatQOMAalwaysresultsinanalignmentatleastasgoodastheinitialalignment(Theproofisshownintheappendix). 3-2 ).LetAWbetheoptimalalignmentobtainedbyQOMAforthewindowandA0bethealignmentobtainedbyreplacingAWwithAWfromA.WehaveSP(AW)SP(AW).Thus,SP(A)=SP(Aprefix)+SP(AW)+SP(Asuffix)SP(Aprefix)+SP(AW)+SP(Asuffix)=SP(A0).Then,weget(A)=SSP(A)SSP(A0)=(A0).Finally,wehave(QOMA(A;W)) 1 followsfromLemma 1 1 impliesthatQOMAaltersaninitialalignmentAonlyifAisnotoptimal.NextlemmadiscussestheimpactofwindowsizeonQOMA.

PAGE 38

SparseK-partitegraphfortwosequencesford=0andd=1. AnexampleofusingK-partitegraph:(a)AsparseK-partitegraphforthreesequencesfromawindowofsize4.(b)Theinducedsubgraphforcell[3,4,4]fortheK-partitegraphin(a). Lemma 2 indicatesthatasWincreases,theSPscoreoftheresultingalignmentincreases.WhenWbecomesgreaterthanthelengthofA,theslidingwindowcontainstheentiresequences.Inthiscase,SP(QOMA(A;W))=S.Followingcorollarystatesthis. 3.2.1 wecomputedthetimecomplexityofQOMAusingcompleteK-partitegraphasO(2KWK(NW+1)K2 38

PAGE 39

Thefactor2Kinthecomplexityisincurredbecauseeachcellofthedynamicprogramming(DP)matrixiscomputedbyconsidering2K1conditions(i.e.,2K1neighboringcells).Thisisbecausethereare2K1possiblenonemptysubsetsofKresidues.Eachsubset,herecorrespondstoasetofresiduesthataligntogether,andthustoaneighboringcell.Weproposetoreducethiscomplexitybyreducingthenumberofresiduesthatcanbealignedtogether.Wedothisbykeepingonlytheedgesbetweennodepairswithhighpossibilityofmatching. Thestrategyforchoosingthepromisingedgesiscrucialforthequalityoftheresultingalignment.WeusetheoptimalpairwisealignmentmethodasdiscussedinSection 3.2.1 .ThisstrategyproducesatmostK1edgespernodesinceeachnodeisalignedwithatmostonenodefromeachoftheK1sequences.Wealsointroduceadeviationparameterd,wheredisanon-negativeinteger.Letp[i]andq[j]bethenodescorrespondingtoproteinsequencespandqatpositionsiandjintheinitialgraphrespectively.Wedrawanedgebetweenp[i]andq[j]onlyifoneofthefollowingtwoconditionsholdsintheoptimalpairwisealignmentofpandq:(1)9,jjd,suchthatp[i]isalignedwithq[j+];(2)9,jjd,suchthatq[j]isalignedwithp[i+].Inotherwords,wedrawanedgebetweentwonodesiftheirpositionsdierbyatmostdintheoptimalalignmentofpandq.Forexample,inFigure 3-3 ,p[2]alignswithq[2].Therefore,wedrawanedgefromp[2]toq[1]andq[3]aswellasq[2]sinceq[1]andq[3]arewithind-neighborhoodof(d=1)ofq[2]. ThedynamicprogrammingismodiedforsparseK-partitegraphasfollows:Eachcell,[x1,x2,,xK]inK-dimensionalDPmatrixcorrespondstonodesP1[x1],P2[x2],,PK[xK].HerePi[j]standsforthenodeatpositionjinsequencei.Thesetcontainsonenodefromeachsequence,andcanbeeitheraresidueoragap.Thus,eachcelldenesasubgraphinducedbyitsnodeset.Forexample,duringthealignmentofthesequencesthat 39

PAGE 40

3-4(a) ,thecell[3,4,4]correspondstonodesP1[3],P2[4]andP3[4].Figure 3-4(b) showstheinducedsubgraphofcell[3,4,4]. Theinducedsubgraphforeachcellyieldsasetofconnectedcomponents.SparsegraphstrategyexploitstheconceptofconnectedcomponentstoimproverunningtimeofDPasfollows:DuringthecomputationofthevalueofaDPmatrixcell,weallowtwonodestoalignonlyiftheybelongtothesameconnectedcomponentoftheinducedsubgraphofthatcell.Forexample,forcell[3,4,4],P2[4]andP3[4]canbealignedtogether,butP1[3]cannotbealignedwithP2[4]orP3[4](seeFigure 3-4(b) ).Aconnectedcomponentwithnnodesproduce2n1non-emptysubsets.Thus,foragivencell,iftherearetconnectedcomponentsandthetthcomponenthasntnodes,thenthecostofthatcellbecomesPti=1(2ni1).Thisisasignicantimprovementasthecostofasinglecellis2n1+n2++nt1usingthecompleteK-partitegraph.Forexample,inFigure 3-4 ,thecostforcell[3,4,4]dropsfrom231=7to(201)+(221)=4. TheconnectedcomponentsofaninducedsubgraphcanbefoundinO(K2)time(i.e.,thesizeoftheinducedsubgraph)bytraversingtheinducedsubgraphonce.Thus,thetotaltimecomplexityofthesparseK-partitegraphapproachisO((PWKi=1(Pj(2nj1)))(NW+1)K2 .ThespacecomplexityofusingthesparseK-partitegraphisO(WK+KN+N(K1)K(2d+1)=2) .Thersttermdenotesthespaceforthedynamicprogrammingalignmentwithinawindow.Thesecondtermdenotesthenumberofletters.Thelasttermdenotesthenumberofedges.Thespacecomplexityforthelasttwotermscanbereducedbystoringonlythesubgraphinsidethewindow. 40

PAGE 41

TheaverageSPscoresofQOMAusingcompleteK-partitegraphwith=W=2onBAliBASEbenchmarksandupperboundscore(S).(InitializationStrategy1,indicatedbys1:InitialalignmentsareobtainedfromClustalW,InitializationStrategy2,indictedbys2:InitialalignmentsareobtainedfromoptimalpairwisealignmentsasdiscussedinSection 3.2.1 ). DatasetSStrategyInitialW=2W=4W=8W=16 V1-R1-low565s1-839-780-637-401-243s2-797-586-429-273-182V1-R1-medium2880s119822037218123472442s220412192233824462508V1-R1-high5324s148834933500850715092s248674965505751105122 Experimentalsetup:WeusedBAliBASEbenchmarks[ 5 ]reference1fromversion1( WeevaluatedtheSPscoreandtherunningtimeinourexperiments.WedonotreporttheBAliBASEscoressincethepurposeofQOMAistomaximizetheSPscore. WeimplementedthecompleteandthesparseK-partiteQOMAalgorithmsasdiscussedinthechapter,usingstandardC.WeusedBLOSUM62asameasureof 41

PAGE 42

100 ]sinceHSAneedsSecondStructureinformationofproteinsforalignment.Toensureafaircomparison,weranClustalW,MUSCLE,T-coee,DCAandQOMAusingthesameparameters(gapopen=gapextend=-4,similaritymatrix=BLOSUM62).ThiswasnotpossibleforProbCons.Wealsoranallthecompetingmethodsusingtheirdefaultparameters.Wepresenttheresultsusingthesameparametersinourexperimentsunlessotherwisestated. WeranallourexperimentsonIntelPentium4,with2.6GHzspeed,and512MBmemory.TheoperatingsystemwasWindows2000. 3-1 showstheaverageSPscoreofQOMAusingtwostrategiesforconstructinginitialalignmentandfourvaluesofW.Strategy1obtainstheinitialalignmentsfromClustalW.Strategy2obtainstheinitialalignmentsfromthealgorithmprovidedinSection 3.2.1 .ThetablealsoshowstheupperboundfortheSPscore,S,andtheSPscoreofClustalWforcomparison.QOMAachieveshigherSPscorecomparedtoClustalWonaverageforallwindowsizesandforalldatasets.TheSPscoreofQOMAconsistentlyincreasesasWincreases.TheseresultsarejustiedbyLemmas 1 and 2 .TheSPscoreofStrategy2isusuallyhigherthanthatofStrategy1foralmostallcasesoflowandmediumsimilarity.Bothstrategiesarealmostidenticalforhighlysimilarsequences.ThereisaloosecorrelationbetweentheinitialSPscoreandthenalSPscoreofQOMA.HigherinitialSPscoresusuallyimplyhigherSPscoresoftheendresult.Therearehoweverexceptionsespeciallyforhighlysimilarsequences.Intherestoftheexperiments,weuseStrategy2toconstructtheinitialalignmentsbydefault. Table 3-2 showsustheSPscoresofveexistingtools,andQOMAonallthedatasetswhenthecompetingtoolsarerunusingthesameparametersasQOMAandusingtheir 42

PAGE 43

Table 3-3 showstheaveragepercentageofimprovementofQOMAoveralignmentsofClustalWusingtheimprovementformulaasgiveninSection 3.2.3 ,thedatasetisV1-R1.Aswindowsizeincreases,theincreaseinimprovementpercentagereduces.ThisindicatesthatQOMAconvergestotheoptimalscoreatreasonablywindowsizes.Inotherwords,usingwindowsizelargerthan16willnotimprovetheSPscoresignicantly. Table 3-4 showstheaverageandthestandarddeviationoftheerrorincurredforeachwindowduetousingthesparseK-partitegraphforQOMA.Theerrordecreasesasdincreases.ForW=8,whendincreasesfrom0to1,theerrorreducesby0.334(i.e.,4.8934.559).Whendincreasesfrom1to2,theerrordecreasesby0.198.ThisimpliesthattheaverageimprovementintheSPscoredegradesquicklyford>1.SimilarobservationscanbemadeforW=16.Thus,weconcludethattheSPscoreimprovesslightlyford>1. Figure 3-5 showstheaverageSPscoresofresultingalignmentsusingsparseK-partitegraphfordierentvaluesofdandusingcompleteK-partitegraphontheV1-R1dataset.ThecompleteK-partitegraphalgorithmproducesthebestSPscores.However,theSPscoresofresultsfromthesparseK-partitegraphalgorithmareveryclosetothatofthecompleteK-partitegraphalgorithm.ThequalityofthesparseK-partitegraphalgorithmimprovessignicantlywhendincreasesfrom0to1.Theimprovementislesswhendincreasesfrom1to2.Thisimplieswhendbecomeslarger,ithaslessimpactonthequalityofalignment. 43

PAGE 44

3-5 liststherunningtimeofQOMAforthecompleteandthesparseK-partitegraphalgorithmsforvaryingvaluesofW.ExperimentalresultsshowthatQOMArunsfasterforsmallW.ThesparseK-partitegraphalgorithmisfasterthanthecompleteK-partitegraphalgorithmforallvaluesofdforlargeW.TherunningtimeofQOMAincreasesasdincreases.TheresultsinthistableagreewiththetimecomplexitywecomputedinSections 3.2.3 and 3.2.4 .ReferringtoTables 3-1 3-2 and 3-3 ,weconcludewhenwindowsizeissmall,QOMArunsfastandhashighqualityresults.Aswindowsizeincreases,itsperformancedropsbutalignmentqualityimprovesfurther. Anotherparameterforquality/timetradeoisd.Figure 3-5 showsthattheSPscoredierencebetweenthecompleteandthesparseK-partitegraphalgorithmsissmall.Thus,itisbettertoincreasethewindowsizeandusesparseK-partitegraphstrategytoobtainhighscoringresultsquickly.AswehaveobservedinTables 3-1 and 3-5 andFigure 3-5 ,thebestbalancebetweenqualityandrunningtimeappearsatd=1usingsparseK-partitegraphstrategy. 44

PAGE 45

TheSPscoresofQOMAalignmentsusingcompleteK-partitegraphandsparseK-partitegraphsfordierentvaluesofdandWontheV1-R1dataset.Theinitialalignmentsareobtainedfromstrategy2. 45

PAGE 46

TheaverageSPscoresofQOMA(usingcompleteK-partitegraphwithW=16)andveothertoolsonBAliBASEbenchmarks.ThenumbersshowtheSPscoreswhenthetoolsarerunwiththesameparametersasQOMA(indictedbyS)andwiththeirdefaultparameters(indictedbyD).Someofthetools,namelyT-coeeandClustalW,didnotproduceanyalignmentforsomebenchmarksforeachparametersettings.Theresultsofallthetoolsareignoredforsuchbenchmarks.\N/A"indicatesthatthecorrespondingtoolfailedtoproducealignmentformostofthebenchmarksinadatasetforthatparametersetting.Weignoresuchtools(i.e.,T-coee)forthosedatasetsandparametersetting. DatasetClustalWProbConsT-coeeMUSCLEDCAQOMASDSDSDSDSDSD V1-R1-low-808-839-1303-1303-1499-1486-2029-778-440-440-182-182V1-R1-medium21561982212820681955200742421612356228925822508V1-R1-high495449354924497548424920346850015007505551225172V3-R1-low-1233-1316-1763-1763-2141-2052-2617-1200-760-760-421-421V3-R1-high20481911200820081803190226321012288228825072507V3-R2598459535862589257935847456460296126615362876313V3-R35838600557415976N/A5945440660885968620261106348V3-R4251347-280-95N/A-148-9354637148809261107V3-R53899377836583658N/A3553211739354310431044464446V3-R6-1601-1554-1781-1782N/A-1859-1710-1570-1335-1336-1151-1152V3-R76977683265556555N/A6409466369737502750280008000V3-R88432832182388238N/A8190692784948831883192489248

PAGE 47

Theimprovement(seeFormula 3{1 inSection 3.2.3 )ofQOMA(usingcompleteK-partitegraph)overClustalWontheV1-R1dataset.Thedatasetissplitintothreesubsets(short,medium,andlong)accordingtothelengthofthesequences. LengthWindowSize24816 Short18.029.240.246.7Medium23.339.651.658.6Long18.639.451.554.1 Table3-4. Theaverage(),standarddeviation()oftheerror,SSP,forawindowusingsparseversionofQOMAontheV1-R1dataset.ResultsareshownforwindowsizesW=8and16,anddeviationd=0,1,and2.Thevaluedenotesthe95%condenceinterval,i.e.,95%oftheexpectedimprovementvaluesarein[;+]interval. ErrorusingsparseK-partitegraphd=0d=1d=2W 47

PAGE 48

TherunningtimeofQOMA(inseconds)usingcompleteK-partitegraphandsparegraphfordierentvalueofdandWontheV1-R1dataset.(A:completeK-partitegraph.B:sparseK-partitegraphwithd=0.C:sparseK-partitegraphwithd=1.D:sparseK-partitegraphwithd=2.)Thedatasetissplitintothreesubsets(short,medium,andlong)accordingtothelengthofthesequencesinthebenchmarks. WindowShortMediumLongSizeABCDABCDABCD W=20.270.240.350.30.580.610.750.741.111.992.312.27W=41.761.011.531.783.622.193.23.7485.487.838.93W=81769123713192686314361W=162325163815351101321661087257306386

PAGE 49

Inthischapter,weconsidertheproblemofaligningmultipleproteinsequenceswiththegoalofmaximizingtheSP(Sum-of-Pairs)score,whenthenumberofsequencesislarge.TheQOMA(Quasi-OptimalMultipleAlignment)algorithmaddressedthisproblemwhenthenumberofsequencesissmall.However,asthenumberofsequencesincreases,QOMAbecomesimpractical.Thischapterdevelopsanewalgorithm,QOMA2,whichoptimizestheSPscoreofthealignmentofarbitrarilylargenumberofsequences.Givenaninitial(potentiallysub-optimal)alignment,QOMA2selectsshortsubsequencesfromthisalignmentbyplacingawindowonit.Itquicklyestimatestheamountofimprovementthatcanbeobtainedbyoptimizingthealignmentofthesubsequencesinshortwindowsonthisalignment.ThisestimateiscalledtheSW(SumofWeights)score.Itemploysadynamicprogrammingalgorithmthatselectsthesetofwindowpositionswiththelargesttotalexpectedimprovement.Itpartitionsthesubsequenceswithineachwindowintoclusterssuchthatthenumberofsubsequencesineachclusterissmallenoughtobeoptimallyalignedwithinagiventime.Also,itaimstoselecttheseclusterssothattheoptimalalignmentofthesubsequencesintheseclustersproducesthehighestexpectedSPscore.TheexperimentalresultsshowthatQOMA2produceshighSPscoresquicklyevenforlargenumberofsequences.TheyalsoshowthattheSWscoreandtheresultingSPscorearehighlycorrelated.ThisimpliesthatitispromisingtoaimforoptimizingtheSWscoresinceitismuchcheaperthanaligningmultiplesequencesoptimally. 4-1(a) illustratesthis.Here,sequencesaandbareoptimallyaligned.Then,canddareoptimallyaligned.Theirresultingalignmentsarealignednext.Progressivemethods,however,haveanimportantshortcoming.The 49

PAGE 50

Thelistofvariablesusedinthischapter VariableMeaning orderthattheprolesarechosenforalignmentaectsthequalityofthealignmentsignicantly.Theoptimalalignmentmaybedierentthanallpossiblealignmentsobtainedbyconsideringallpossibleorderingsofsequences[ 100 ]. Table 4-1 denesthevariablesfrequentlyusedintherestofpaper. InChapter 3 ,wehaveintroducedQOMA[ 99 ],whicheliminatedthedrawbacksoftheprogressivemethods.QOMApartitionedaninitialalignmentintoshortsubsequencesbyplacingawindow.Itthenoptimallyrealignedthesubsequencesineachwindow.ThisisshowninFigure 4-1(b) .OptimallyaligningeachwindowcostsO(WK2K),signicantlylessthanO(NK2K)forWN.However,whenKislarge,evenO(WK2K)becomestoocostly.ThevalueofWneedstobereducedsignicantlytomakeQOMApractical.Forexample,assumethatQOMAworksforW=32whenK=6.WhenKbecomes18,Wshouldbereducedtotwoinordertorunatroughlythesametime.This,however,reducestheSPscoreofthealignmentsfoundbyQOMAsinceeachwindowcontainsextremelyshortsubsequences. ThischapteraddressestheproblemofaligningmultipleproteinsequenceswiththegoalofachievingalargeSPscorewhenthenumberofsequencesislarge.Wedevelopanalgorithm,QOMA2,whichworkswellevenwhenthenumberofsequencesislarge.Figure 4-1(c) illustratestheQOMA2algorithm.IttakesKsequencesandainitial(potentiallysub-optimal)alignmentofthemasinput.QOMA2selectsshortsubsequences 50

PAGE 51

4-1(c) ).Thisisdesirablesincetheoptimalclusteringofthesubsequencesmaydierfordierentwindowpositions.ThevalueofTisdeterminedbytheallowedtimebudgetforQOMA2forthealignmentofthesubsequencesinclustersgovernstheoverallrunningtime.AsTincreasesboththealignmentscoreandtherunningtimeincrease.TheexperimentalresultsshowthatQOMA2achieveshighSPscoresquicklyevenforlargenumberofsequences.TheyalsoshowthattheSWscoreandtheresultingSPscorearehighlycorrelated.ThisimpliesthatitispromisingtoaimforoptimizingtheSWscoresinceitismuchcheaperthanaligningmultiplesequencesoptimally. 109 110 ]isapopulartoolforpartitioningunstructuredgraphs,partitioningmeshes,andcomputingll-reducedorderingofsparsematrices.ThealgorithmsimplementedinMETISarebasedonthemultilevelrecursive-bisection,multilevelk-way,andmulti-constraintpartitioningschemes.Itcanprovidehighqualitypartitionsfast. 51

PAGE 52

Alignmentstrategiesatahighlevel:(a)progressivealignment,(b)theQOMAalgorithm(c)theQOMA2algorithm.Thesolidlinesdenotesequencesa,b,:::,f.Dashedpolygonsdenotethe(sub)sequenceswhosealignmentsareoptimized.Thetreesnexttoalignmentsshowtheguidetreeusedbytheunderlyingalgorithmtoalignthesequences.In(a),aandbareoptimallyaligned.Then,canddareoptimallyaligned.Theirresultingalignmentsarealignednext.In(b),smallsubsequencesofa,b,c,anddineachwindowisalignedoptimally.In(c),thewindowontheleftindicatesthatthesubsequencesfroma,bandcareoptimallyaligned,thesubsequencesfromd,eandfareoptimallyaligned,andthentheirresultsarealigned.Similarly,thewindowontherightindicatesthatthesubsequencesfroma,bandfareoptimallyaligned,thesubsequencesfromc,dandeareoptimallyaligned,andthentheirresultsarealigned. TherstproblemthatneedstobeaddressedistheidenticationoftheMlocationsthatmaximizetheoverallimprovement.Figures 4-1(b) and 4-1(c) showtwoexamplesinwhichthreeandtwopositionsareselectedrespectively.Itisimportanttomentionthatthenumberofwindows,M,isgovernedbythetotaltimeallowedforimprovingthealignment. 52

PAGE 53

Foreachwindowposition,wecomputeanupperboundtotheimprovementoftheSPscorethatcouldbeachievedbyreplacingthatwindowwithitsoptimalalignmentasfollows.LetXidenotetheupperboundtotheoptimalSPscoreforthesubsequencesinthewindowstartingatpositioniofthealignment.Thisnumbercanbecomputedasthesumofthescoresofallpairwiseoptimalalignmentsofthesubsequencesinthiswindow.LetYidenotethecurrentSPscoreofthatwindow.TheupperboundtotheimprovementoftheSPscoreiscomputedasUi=XiYi.WesaythatawindowpositioniispromisingifUiislarge. WeproposetoselecttheMwindowpositions,1,2,,M(8i,i
PAGE 54

Wedevelopadynamicprogrammingsolutiontodeterminetheoptimalwindowpositions.LetSU(a,b)denotethelargestpossiblesumofupperboundsofbwindowpositionsselectedfromtherstapossiblewindowpositions.WewouldliketodetermineSU(NW+1,M)tosolveourproblem,whereNisthelengthofthealignment.Clearly,SU(a,1)=maxai=1fUig.Thisisbecauseifasinglewindowisselecteditshouldbetheonewiththelargestupperbound.Forb>1,therearetwopossibilities:1)Ifa<>:SU(a;b1)+Ua;ifUaisselectedSU(a1;b);otherwise Inthiscomputation,therstconditionimpliesthatthebthwindowstartsatpositiona.Thus,therstb1windowsshouldbeselectedintheinterval[1;a]toensurethattheydonotoverlapwiththebthwindowbymorethan.Thesecondconditionimpliesthatthewindowatpositionaisnotapartofthesolution.Therefore,thebwindowpositionsshouldbeselectedintheinterval[1;a1].ThevalueofSU(NW+1,M)istheoptimalsumofupperbounds.ThewindowpositionsthatleadtothisoptimalsolutioncanbefoundbytrackingbackthevaluesofSUafterthedynamicprogrammingcomputationcompletes. Figure 4-2 showstheaverageSPscoreoftheimprovedalignmentfortherstelevenwindowpositionswhenthewindowsareselectedusingourdynamicprogrammingmethod,greedily,andbyslidingawindow.Forthewindowslidingstrategy,weshiftthewindowby 54

PAGE 55

ComparisonoftheSPscorefoundbydierentstrategiesofselectionofwindowpositions:usingtheproposedoptimalselection,thegreedyselectionandtheslidingwindow. 55

PAGE 56

4-1(c) asanexample).Thisisdesirablefordierentregionsinsequencesmayevolveatdierentconservationrates.Forexample,regionsthatserveimportantfunctionsshowmuchlessvariationthentheremainingregions.Therefore,thebestclusteringforoneregionofthesequencesmaynotbegoodforanotherregion.QOMA2addressesthisbytreatingeachregionindependently. Werstconstructaninitialweightedcompletegraphbyconsideringeachsubsequenceinthewindowasavertex.Wethenalignthesubsequencesusingtwonestedloops.Thedetailsofthetwostepsarediscussednext. Eachfimapstoavertexvi2Vinthisgraph.Wecomputetheweightoftheedgeei;j2Ebetweenverticesviandvjas 56

PAGE 57

4{1 ). ThevertexinducedsubgraphofanysubsetV0VdenesacompletesubgraphG0=(V0;E0).TheSWscoreofG0isanupperboundtotheamountofimprovementthatcanbeobtainedbyoptimallyaligningonlythesubsequencesthatmaptotheverticesinV0.InthefollowingsectionswewillexploittheSWscoretondagoodclusteringofthesubsequencesinagivenwindow. Werstneedtounderstandhowmanyclustersneedtobecreated.Thenumberofsubsequencesineachpartitionshouldbeaslargeaspossible.Thisisbecausemoresubsequencesareoptimallyalignedwitheachotherwhentheclustersarelarge.ThisindicatesthattheremustbedK Teclusters. Next,weneedtounderstandtherightcriteriatopartitionthesetofsubsequences.Anumberofstrategiescanbedevelopedtoaddressthisquestion.WediscusstwosolutionswiththehelpofthecompleteweightedgraphGconstructedforthesubsequences.NoticethatpartitioningthesetofsubsequencesintoclustersofsubsequencesisequivalenttopartitioningthegraphGintovertexinducedsubgraphsoftheverticescorrespondingtothesubsequencesineachcluster. 57

PAGE 58

TesubgraphssuchthatthesumoftheSWscoresofthesesubgraphsisaslargeaspossible.ThisisequivalenttotheMindK Te-CutproblemwiththeadditionalrestrictionthateachsubgraphhasatmostTvertices.Inotherwords,ittranslatesintotheproblemofndingthesetofedgesinGsuchthat TecompletesubgraphsofsizeatmostT,and FindingtheMindK Te-CutofagraphisanNP-completeproblem.Anumberofheuristicalgorithmshavebeendevelopedtoaddressthisissue.OneofthemostcommonlyusedtoolsforpartitioninggraphsisMETIS[ 109 110 ].METISpartitionsaninputgraphtoagivennumberofsubgraphswiththeaimofminimizingormaximizingthetotalweightoftheedgesbetweendierentsubgraphs.WeuseMETIStopartitionGintodK TesubgraphswithminimaldK Te-cut. Although,METIStriestopartitionthegraphintoroughlythesamesizedsubgraphs,itdoesnotguaranteethattheywillbeperfectlybalancedinsize.Asaresult,someoftheclustersdeterminedbyMETIScanhavemorethanTvertices.Thisisundesirablesincethesubsequencesineachclusterareoptimallyalignedinthefollowingstep.Recallthatthecostofoptimallyaligningaclusterisexponentialinthesizeofthatcluster.Themaximumsizeofacluster,T,isdeterminedbythetotalamountoftimeallowedtospendtooptimizethealignment.Thus,METISclustersneedtobepost-processedtoguaranteethatthesizesoftheclustersdonotexceedT. Next,wedescribehowweproposetoadjustthesizeoftheMETISclustersfortherststrategy(i.e.,optimizingtheintra-clusterSPscore)rst.Itistrivialtoadaptthisalgorithmtothesecondstrategy. 58

PAGE 59

Te-cutofG.Similartotherststrategy,weuseMETIStoidentifysuchacut. Theproposedalgorithmforpost-processingtheclustersfoundbyMETIScanbeadaptedtothesecondstrategyasfollows.Ateachiterationofthewhileloop,thevertexmovethatmaximizesthecutischoseninsteadoftheonethatminimizes.ThiscanbedonebymodifyingSteps1and2.cofthealgorithm. ItisworthmentioningthattheMETISalgorithmforclusteringthesequencesisamoduleinQOMA2.ItcanbereplacedbyanyclusteringalgorithmthatndsbetterMindK Te-CutorMaxdK Te-Cutinthefuture. 4.3.2 ).Onedrawbackofthesestrategiesisthateachedgeweightiscomputedbyonlyconsideringthetwosubsequencescorrespondingtothetwoendsofthat 59

PAGE 60

4.3.1 ).Thisisproblematic,becausetheamountofimprovementintheSPscorebyoptimallyaligningaclusterofsubsequencesdependsonallthesubsequencesinthatcluster.Consideringtwosubsequencesatatimegreatlyoverestimatestheimprovement.Weproposetoimprovetheclustersiteratively.Eachiterationupdatestheedgeweightsbyconsideringallthesubsequencesineachcluster.Wediscusshowtheedgeweightsareupdatedlaterinthissection.Oncetheedgeweightsareupdated,itreclustersthesubsequencesusingthenewweights.TheiterationsstopwhentheSWscoreofthegraphGdoesnotincreasebetweentwoconsecutiveiterationsoracertainnumberofiterationshavebeenperformed. Wewouldliketoestimatehowmuchthetwosubsequences,fiandfj,contributetotheSPscoreundertherestrictionthateachclusterisoptimallyaligned.Theobvioussolutionistooptimallyaligneachclusterandmeasurethenewalignmentscore.This,however,isnotpracticalfortworeasons.First,optimallyaligningaclusterofTsubsequencesisacostlyoperation.PerformingthisoperationwillmakeeachiterationoftheclusterrenementascostlyasQOMA2.Furthermore,thiswillonlyupdatetheweightoftheedgeswhosetwoendsbelongtothesamesubgraph(i.e.,intra-clusteredges).Theweightoftheedgesbetweendierentsubgraphs(i.e.,inter-clusteredges)stillneedtobecomputed.Thus,agoodestimatorshouldbeecientandworkforbothinter-andintra-clusteredges. Weproposetoestimatetheedgeweightsbyfocusingonthegaps.Atahighlevel,weassumethebestscenario(i.e.,smallestpossiblenumberofgaps)forintra-clusteredges.Thisisbecauseoftherestrictionthatthesubsequencesineachclusterareoptimallyaligned.WethenestimatetheimprovementintheSPscorebetweeneverypairofsubsequencesbyconsideringthesegaps.Wedescribeourestimatorindetailnext. LetLibethelengthofsubsequencefi.AfterthecompleteweightedgraphGispartitionedintodK Tecompletesubgraphs,assumethatvibelongstothesubgraphG0.Recallthatviisthevertexthatdenotesfi.Theoptimalalignmentofallthesubsequences 60

PAGE 61

Next,wecomputetheexpectednumberofindels(insertionsordeletions)inthealignmentofsubsequencesfiandfj.Anindelisanalignmentofaletterwithagap.Thealignmentoftwolettersortwogapsarenotconsideredasindels.Consideringallpossiblearrangementofthelettersandgapsinfiandfj,theexpectedratioofletter-letteralignmentsbetweenfiandfjintheiralignmentsis Similarly,theexpectedratioofgap-gapalignmentsis Thus,theexpectedratioofindelscanbecomputedbysubtractingequations( 4{2 )and( 4{3 )fromone.ThetotallengthoftheinducedalignmentoffiandfjisatmostmaxfLi+gi;Lj+gjg.Therefore,theexpectednumberofindelsintheinducedalignmentoffiandfj,denotedbyGapexpected(fi;fj)isatmost LetGapinduced(fi;fj)denotethenumberofindelsintheinducedalignmentoffiandfj.Letgapcostdenotethecostofasingleindel.Wecomputethenewweightoftheedge 61

PAGE 62

4.3.1 sinceitconsidersthechangeinthegapcostasimposedbytheclustersthatfiandfjbelongto. Oncetheweightsoftheedgesareupdated,thecurrentpartitioningmaynotbeagoodoneanymore.Therefore,weiterativelyruntheclusteringalgorithmagainandupdatetheedgeweightssimilarlyuntiltheSWscoreofthecompletegraphbuiltforthecurrentwindowdoesnotincreaseanyfurtheroragivenmaximumnumberforiterationsarereached.ThePseudo-codeoftheAdjustmentinSection 4.3.3 min=1; 2. (b) (c) -Updateminasmin=uG0uG00; 3. MovethevertexufromG0toG00accordingtothebestmove;

PAGE 63

TeproleswhichissignicantlylessthanK.WerecursivelyapplytheQOMA2algorithm(Sections 4.3.1 to 4.3.3 )totheseprolesuntilallthesubsequencesarealigned. ,wherecistheupperboundforthenumberofinnerloopiterations.Inpracticec10. Wedeductthetimecomplexityasfollows:Foreachwindow,weneedtoapplytheclusteringalgorithmandaligntheclustersusingtwonestedloops.TheouterloopiteratesdlogTKetimes. Ateachiterationthesetofsubsequencesinsidethewindowispartitionedintoclustersandtheedgeweightsareupdated.Thus,eachiterationoftheinnerloopcostsO(jEj)time.SinceGcontainsKverticesO(jEj)=O(K2).Attheendofeachiterationoftheinnerloopalltheclustersareoptimallyaligned.OptimallyAligningTsubsequencescostsO(WT2T)time.Attheithiterationoftheouterloop,O(K Ti)suchoptimalalignmentsaredone.Addingthesesteps,wendthatthetotalcostoftheithiterationoftheouterloopisO(K TiWT2T+cK2):

PAGE 64

TiWT2T+cK2)=O((logTK)(KWT2T(PdlogTKei=11 (K1)T2)+cK2))=O((logTK)(KWT2T Experimentalsetup:WeusedBAliBASEbenchmarks[ 5 ]reference1fromversion1( WeimplementedtheQOMA2algorithmusingstandardC.WedownloadedProbCons[ 88 ],T-Coee[ 2 ],MUSCLE[ 78 ],andClustalW[ 1 77 ]forcomparison.WealsodownloadedDCA[ 47 ]sinceitaimstomaximizetheSPscoreaswell.However,DCAdidnotrunforthebenchmarksinourdatasetsD10andD20sinceitcannotalignlargenumberofsequences.WeusedBLOSUM62asameasureofsimilaritybetweenaminoacids,sinceBLOSUM62iscommonlyused.Usingotherpopularscorematrices,suchasBLOSUM90orPAM250willproducesimilarresults.Weusedgapcost=-4topenalizeeachindel.Inordertobefair,weusedthesameparameters(i.e.,BLOSUM62andgap 64

PAGE 65

Amongthecompetingtools,usedinourexperiments,MUSCLEaimstomaximizetheSPscore,ClustalWandT-CoeeaimstomaximizeaweightedversionoftheSPscore.Therefore,onecanarguethatitisnotfairtoincludeClustalW,T-CoeeandProbConsinourexperiments.We,however,includethemsincemostoftheexistingtoolsthataimtomaximizetheSPscore,suchasDCAorMSA,donotworkforlargenumberofsequences.Weimprovethefairnessofourexperimentsbyusingthesameparametersforallthetools. First,wecompareddierentclusteringalgorithmsandshowedtherelationshipbetweentheSPandtheSWscoresoneachwindow.WethenevaluatedtheimpactofthewindowandtheclustersizeontheSPscoreoftheQOMA2alignmentandtherunningtimeofQOMA2.WealsocomparedtheSPscoresofQOMA2withfourcompetingmultiplesequencealignmenttools.Weranourexperimentsonasystemwithdual2.59GHzAMDOpteronProcessors,8gigabytesofRAM,andaLinuxoperatingsystem.DatasetDetails 4-3 time.ThismakesQOMA2desirablesincetheSWscorecanbemeasuredecientlywithoutactuallyndingthealignmentofmultiplesequences.Inthisexperiment,we 65

PAGE 66

Thedistributionofthenumberofbenchmarkswithdierentnumberofsequences(K). evaluatetherelationshipbetweentheSWandtheSPscores.Wealsomeasurehoweachoftheproposedclusteringstrategiesperforms.Weplaceawindow(W=16)onallpossiblelocationsofaninitialalignment.WendtheclustersusingtheMin-CutandtheMax-Cutclusteringalgorithms(seeSection 4.3.2 ).Wealsondclustersusingtheiterativerenement(seeSection 4.3.3 )ontheresultsofMin-CutandMax-Cut.WemeasuretheaverageSPandSWscoresobtainedbythesealgorithmsforT=2,3,and4.WeuseD20datasetinthisexperiment. Table 4-2 presentstheresults.ResultsshowthatthereisastrongcorrelationbetweentheSPandtheSWscores.ForeachvalueofT,theSPscoregetslargerwhentheSWscoregetslarger.ThisimpliesthatoptimizingtheSWscorecanpotentiallyoptimizetheSPscore.ThisisanimportantobservationsincethecostofcomputingtheSWscoreisnegligibleascomparedtothatoftheSPscore.NotethattheSWscoresobtained 66

PAGE 67

TheaverageSWandSPscoresofindividualwindowsafterapplyingdierentclusteringalgorithmsfordierentvaluesofT,withW=16.TheaverageSPscoresofinitialalignmentinthewindowis351.TheaverageupperboundtotheSPscoreforthesubsequencesinthewindowsis1113.BenchmarksareselectedfromtheD20dataset. Min-CutMin-CutMax-CutMax-CutTIterativeIterativeSPSWSPSWSPSWSPSW withdierentnumberofclustersarenotcomparabletoeachothersincetheycomputethegapcostunderdierentclustersizeassumptions.TheresultsalsodemonstratethattheiterativerenementhelpsinimprovingtheSWandtheSPscoreofbothoftheMax-CutandtheMin-Cutalgorithms.TheMax-CutalgorithmwithiterativerenementalwayshasthebestSPandSWscores.Thisimpliesthatiftheinducedalignmentoftwosubsequenceshasahighscoreascomparedtothatoftheiroptimalalignment,itisadvantageoustokeeptheminthesamecluster(i.e.,forcethemtobealmostoptimallyaligned). TheSPscoreofallthemethodsincreaseasthevalueofTincreases.ThisisintuitivesincemoresubsequencesareoptimallyalignedatonceforlargevaluesofT. AnotherimportantobservationthatfollowsfromtheseresultsisthatoptimallyaligningclustersdoesnotalwaysimprovetheSPscoreofawindow.Itcanactuallyreduceit.ThishappensespeciallyfortheMin-Cutclustering(withorwithoutiterativerenement)forallvaluesofTaswellastheMax-CutclusteringforT=2.Thisisbecausewhentheclustersofsubsequencesarealigned,theyimposeacertainalignmentforthesubsequencesineachcluster.Theserestrictionslimitthenumberofpossibilitiesinwhichasetofclusterscanbealignedtogether.Thisindicatesthattheclustersshouldbeselectedcarefully. 67

PAGE 68

TheaverageSPscoresofQOMA2forindividualwindows.\SPbefore"and\Upperbound"denotetheaverageinitialSPscoresandtheaverageupperboundstotheSPscoresforindividualwindowsrespectively.BenchmarksareselectedfromtheD10dataset. 4-186-67-171-158-152-1478-212100-175-140-124-11112-264247-203-147-120-10016-342358-257-183-148-117 Intherestoftheexperiments,weselecttheMax-CutclusteringalgorithmwithiterativerenementasthedefaultclusteringalgorithmofQOMA2. Table 4-3 showstheSPscoreofindividualwindowsalignedbyQOMA2fordierentvaluesofWandT.TheresultsshowthattheSPscoresincreasewhenTincreasesforallvaluesofW. Table 4-4 showstheSPscoresofalignmentsoftheentirebenchmarksinD10usingQOMA2forvaryingvaluesofWandT.AsWandTincrease,QOMA2produceshigherscores.Thetwoextremeparameterchoicesofusingverylargevalueforoneoftheseparametersandverysmallvaluefortheother,i.e.,W=16,T=2orW=4,T=5donotproducelowerSPscoresascomparedtotheintermediatesolutionssuchasW=12,T=3.ThisisanimportantobservationsinceitvalidatesthatQOMA2issuperiortothetwoexistingextremesolutions(seeFigure 4-1 ). 4-4 showstheaveragerunningtimeofQOMA2foroptimizingasinglewindowforvaryingvaluesofWandT.TheexperimentalresultsshowthatQOMA2runsveryecientlyevenforlargenumberofsequences.AswehavementionedinSection 4.3.5 ,thetimecomplexityofQOMA2isO((logTK)(KWT2T 68

PAGE 69

TheaverageSPscoresofthealignmentsoftheentirebenchmarksinD10usingQOMA2.TheaverageSPscoresofinitialalignmentsis-12295.TheaverageoftheupperboundtotheSPscoresofthebenchmarksis17648.Theaveragerunningtimesarealsoshownintheparenthesesbyseconds. 4-7119(1.173)-6770(0.653)-6676(0.403)-6498(0.465)8-6197(1.213)-5348(0.673)-4762(1.053)-4236(5.050)12-5914(1.116)-4659(0.808)-3966(3.619)-3464(13.485)16-5690(1.097)-4327(1.102)-3555(8.856)-2811(40.132) forasinglewindow.TheexperimentalresultssuggestwhenWislarge,thefactorO((logTK)(KWT2T FromTables 4-3 and 4-4 ,weconcludeagoodpointforbalancingtimeandqualityisat(W=12,T=4). 4-5 presentstheSPscoresofthealignmentsofthebenchmarksinD10usingfourexistingtoolsandQOMA2.NotethatthecomparedtoolsdonotaimtomaximizetheSPscore.ClustalW,MUSCLE,andT-coeeoptimizeavariationoftheSPscorebycomputingweightsforsequencesorsubsequences.WestillincludedthisexperimentbecausetheexistingtoolsthatoptimizetheSPscore,suchasDCA[ 47 ],MSA[ 61 ]andCOSA[ 111 ]donotworkforlargenumberofsequences.Forsmallnumberofsequences,QOMAperformssignicantlybetterthanDCA(see[ 99 ]).Wedividedthequeriesintofoursubsetsaccordingtothenumberofsequencestheycontain.ThetableshowsthatQOMA2hashigherSPscorethanallthetoolscompared.ClustalWisalwaysthesecondbest.TheremainingtoolsarenotcompetitiveintermsoftheSPscore. Table4-5. TheaverageSPscoresofQOMA2(W=12andT=4)andfourothertoolsontheD10dataset.Thecompetingtools(exceptProbCons)arerunwiththesameparametersasQOMA2. 10-14-16921-16713-24492-12586-1231815-19-14454-29751-31851-9426-908820-24-5958-12006-28866-778-71025-29-24033-29305-50576-9628-8989 69

PAGE 70

Inthischapter,weintroduceanewgraph-basedmultiplesequencealignmentmethodforproteinsequences.WenameourmethodHSA(HorizontalSequenceAlignment)forithorizontallyslidesawindowontheproteinsequencessimultaneously.HSAconsidersalltheproteinsatonce.Itobtainsnalalignmentbyconcatenatingcliquesofgraph.Inordertondabiologicallyrelevantalignment,HSAtakessecondarystructureinformationaswellasaminoacidsequencesintoaccount.TheexperimentalresultsshowthatHSAachieveshigheraccuracycomparedtoexistingtoolsonBAliBASEbenchmarks.Theimprovementismoresignicantforproteinswithlowsimilarity. 31 ].Wecallthisaverticalalignmentsinceitprogressivelyaddsanewsequence(i.e.,row)toaconsensusalignment.Thesemethodshavetheshortcomingthattheorderofsequencestobeaddedtoexistingalignmentsignicantlyaectsthequalityoftheresultingalignment.Thisproblemismoreapparentwhenthepercentageofidentitiesamongaminoacidsfallsbelow25%,calledthetwilightzone[ 88 ].Theaccuraciesofmostprogressivesequencealignmentmethodsdropconsiderablyforsuchproteins. Weconsidertheproblemofalignmentofmultipleproteins.Wedevelopagraph-basedsolutiontothisproblem.WenamethisalgorithmHSA(HorizontalSequenceAlignment)asithorizontallyalignssequences.Here,horizontalalignmentmeansthatallproteinsarealignedsimultaneously,onecolumnatatime.HSArstconstructsadirected-graph.Inthisgraph,eachaminoacidoftheinputsequencesmapstoavertex.Anedgeisdrawnbetweenpairsofverticesthatmaybealignedtogether.Thegraphisthenadjustedby 70

PAGE 71

112 { 115 ]. HSAworksinvesteps:(1)Aninitialdirectedgraphisconstructedbyconsideringresidueinformationsuchasaminoacidandsecondarystructuretype.(2)Theverticesaregroupedbasedonthetypesofresidues.Theresidueverticesineachgrouparemorelikelytobealignedtogetherinthefollowingstep.(3)Gapverticesareinsertedtothegraphinordertobringverticesinthesamegroupclosetoeachotherintermstopologicalpositioninthegraph.(4)Awindowisslidfrombeginningtoend.Thecliquewithhighestscoreisfoundineachwindowandaninitialalignmentisconstructedbyconcatenatingthesecliques.(5)Thenalalignmentisconstructedbyadjustinggapverticesoftheinitialalignment.Next,wedescribethesevestepsindetail. 71

PAGE 72

5-1 .Twotypesofedgesaredened.First,adirectededgeisincludedfromthevertexcorrespondingtosi(j)tosi(j+1)forallconsecutiveaminoacids.Second,anundirectededgeisdrawnbetweenpairsofverticesofdierentcolorsiftheirsubstitutionscoreishigherthanathreshold.HSAgetsthesubstitutionscorefromBLOSUM62matrix.AweightisassignedtoeachundirectededgeasthesumofthesubstitutionscoreandtypeScorefortheaminoacidpairthatmakeupthatedge.ThetypeScoreiscomputedfromtheSSEtypes.IftworesiduesbelongtothesameSSEtype,thentheirtypeScoreishigh.Otherwise,itislow.WediscussthisinmoredetailinSection 5.2.2 .ThispolicyofweightassignmentletsresidueswithsameSSEtypeorsimilaraminoacidshavehigherchangetobealignedinfollowingsteps.WewilldiscussthisinSection 5.2.4 .Figure 5-1 demonstratesthissteponthreeproteins.TheaminoacidsequencesandtheSSEsareshownatthetopofthisgure.Thedottedarrowsrepresenttheundirectededgesbetweentwoverticesofdierentcolor,thesolidarrowsonlyappearbetweentheverticescorrespondingtoconsecutiveaminoacidsofthesameproteinandtheyonlyhaveonedirection,fromlefttoright. 5-2 ,S1consistsoffourfragments:f1=LT,f2=GKTIV,f3=E,andf4=IAK.Thus,S1canbewrittenasS1=f1f2f3f4. 72

PAGE 73

TheinitialgraphconstructedforsequenceS1,S2andS3.Eachresiduemapstoavertexinthisgraph.Thegureshowssomeedgesbetweentherstverticesofthesequences,indicatedbydashedarrows.Theverticesfordierentsequencesaremarkedwithdierentcolors(colorsnotshowningure). WiththeknowledgethatthefragmentswiththesameSSEtypearemorelikelytobealigned,allsequencesarescannedtondfragmentswithknownSSEtypes.Thefragmentsarethenclusteredintogroups,whereeachgroupconsistsofonefragmentfromeachsequence.Togroupfragments,wealignthefragmentsrst.Weuseasimplieddynamicprogrammingalgorithmbyconsideringeachfragmentasaresidueinthebasicalgorithm[ 28 ].Thescoreoftwofragmentpairsiscomputedfromthefollowingformula: 73

PAGE 74

Thefragmentswithsimilarfeatures,suchasSSEtypes,lengthsandpositionsinoriginalsequencesaregroupedtogether. TheyarethesametypeofnoSSEtype,wereturn1;4)Theyare-helixand-sheet,wereturn-4;5)Otherwise,wereturn0.ThepositionPenaltyiscomputedasthedierencebetweenthepositionsoftwofragments.Herethepositionofafragmentisthetopologicalpositionintheoriginalsequence.Iftwofragmentsarefarawayintheirsequences,thenthepairofthemgetsahigherpenalty.Thisisbecausethealignmentofsuchfragmentsintroducemanygaps.ThelengthPenaltyiscomputedasthedierencebetweenthelengthsofthetwofragments.Thelengthofafragmentisthenumberofresiduesitcontains.Fragmentpairswithsimilarlengthwillbegivensmallerpenalty.Thisisbecauseasthelengthsthefragmentpairsdiermore,thenumberofgapverticesthatneedtobeinsertedinthelateralignmentincreases. Figure 5-2 demonstrateshowHSAgroupsfragments.UsingtheexampleofFigure 5-1 ,fragmentswithsameSSEtype,similarpositionsandlengthsareclusteredintothesamegroup.Twosuchgroupswith-helixand-sheetarecircledinFigure 5-2 74

PAGE 75

Agapvertexisinsertedtoletthefragmentsinsamegroupclosetoothereachothervertically. 5-3 ,vertexLinS1,vertexPinS2,andvertexPinS3areatthesameverticalposition1,similarly,vertexTinS1,vertexNinS2,andvertexSinS3areatthesameverticalposition2,etc.Aswewilldiscusslater,thisprocessincreasesthepossibilitythattheverticesinthesefragmentsarealigned. Weupdatethegraphbyinsertinggapvertices,asshowninFigure 5-3 .First,wecomputethenumberofgapverticestobeinsertedbasedontwofactors:1)Thenumberofresiduesinfragments.2)Therelativepositionsoffragmentsinthesamegroup.Hereagoodrelativepositionoffragmentsmeansthatthepositionsoffragmentsleadtoahighscoringalignmentoftheverticesinthesefragments.Wealigntheverticesinfragmentsofthesamegrouptocomputethosepositions.Then,werandomlyselectapositionbetweentwoconsecutivefragmentgroups.Finally,foreachsequenceweinsertgapverticesatthese 75

PAGE 76

5-3 ,agap'svertexisinsertedbeforeresidueIinS3tobringfragmentsinthegroupwith-sheettypeclosetoeachother. AsdemonstratedinFigure 5-4 ,westartbyplacingawindowofwidthWatthebeginningofeachsequence.Thiswindowdenesasubgraphofthegraph.Typically,weuseW=4or6.TheexampleinFigure 5-4 usesW=3.Next,wegreedilychooseacliquewiththebestexpectationscorefromthissubgraph.Wewilldenetheexpectationscoreofacliquelater.Acliquehereisdenedasacompletesubgraphthatconsistsofonevertexfromeachcolor.Inotherwords,ifKsequencesaretobealigned,acliquecorrespondstothealignmentofoneletterfromeachoftheKsequences.Thus,eachcliqueproducesonecolumnofthemultiplealignment.Foreachclique,wealignthelettersofthatclique,anditerativelyndthenextbestcliquethat1)doesnotconictwiththisclique,and2)hasatleastoneletternexttoaletterinthisclique.Thisiterationisrepeatedttimestondtcolumns.Typically,t=4.Thesetcliquesdenealocalalignmentoftheinputsequences.TheexpectationscoreoftheoriginalcliqueisdenedastheSPscoreofthislocalalignment.Afterndingthehighestexpectationscoreclique,weaddthiscliqueasacolumntoexistingalignment.Wethenslidethewindowtothelocationwhichisimmediatelyafterthecliquefoundandrepeatthesameprocessuntilitreachestheendofsequences.Eachcliquedenesacolumninthemultiplealignment.Thecolumnsareconcatenatedandgapsareinsertedtoalignthem.Figure 5-4 illustratesthisstep,inthewindow(circledbythedottedrectangle),thehighestexpectationscoreclique 76

PAGE 77

Cliquesfoundintheslidingwindow(windowsize=3)arethecolumnsoftheresultingalignment.Gapsareinsertedtoconcatenatethesecolumns. (theleftshadowbackgroundmarkedcolumn)consistsofresiduesT,R,andIinS1,S2andS3respectively.Then,thewindowslidestonextlocationtowardtherightofthegraph(thiswindowisnotshownintheFigure 5-4 ),andthehighestexpectationscoreclique(therightbackgroundmarkedcolumn)inthewindowconsistsofresidueV,V,andCinS1,S2andS3respectively.Thetwocliquesfound(markedbyshadowbackground)aretwocolumnsinresultingalignment.TheresultingalignmentisobtainedbyinsertingagapvertextoS3. Asmentionedinsection 5.2.1 ,duetothepolicyofedgeweightassignment,cliquesthatcontainverticesofthesameSSEtypeorsimilaraminoacidshavehigherscorethanotherpossiblecliques.Sinceacliquecontainsonevertexofeachcolor,ndingthebestcliquedoesnotassureanyorderfortraversalofverticesofdierentcolors.Thus,unlikeexistingtools,ourmethodisorderindependent. 77

PAGE 78

Gapsaremovedtoproducelongerandfewergaps.Wefavorgapsoutsidethefragmentsoftype-helixand-sheet. thegapsasfollows.Thesequencesarescannedfromlefttorighttondisolatedgaps.Ifagapisinsideafragmentoftype-helixor-sheet,itismovedoutsideofthatfragment,eitherbeforeorafter.Wechoosethedirectionthatproduceshigheralignmentscore.IfagapisinsideafragmentwithnoSSEtype,itismovednexttotheneighboringgaponlyifthemovementproducesahigherscorethanthecurrentalignment.Figure 5-5 showsusthemovementoftherstgapvertexinS3(i.e.,thegapvertexbetweenresiduesIandC).Thisisagapvertexinsideafragmentoftype-helix.Thusthisgapvertexismovedoutandcombinedwiththenextgapvertex. Thenalalignmentisobtainedbymappingeachvertexinthenalgraphbacktoitsoriginalresidue. 5 ]( 78

PAGE 79

1 77 ],ProbCons[ 88 ],MUSCLE[ 78 ]andT-Coee[ 2 ]forcomparisonsincetheyarethemostcommonlyusedandthemostrecenttools.Weranallexperimentsonacomputerwith3GHzspeed,Intelpentium4processor,and1GBmainmemory.TheoperatingsystemisWindowsXP. 5-1 5-2 and 5-3 showtheBAliBASEscoresofHSA,ClustalW,ProbCons,MUSCLEandT-Coeeonbenchmarkswithlow,medium,andhighsimilarityrespectively.FromTable 5-1 ,weconcludethatforlowsimilaritybenchmarks,ourmethodoutperformsallothertools.OntheaverageHSAachievesascoreof0.619,whichisbetterthananyothertool.HSAndsthebestresultfor14outof21referencebenchmarks.HSAisthesecondbestin5oftheremaining7benchmarks.Table 5-2 showsthatforsequenceswith20-40%identity,HSAiscomparabletoothertoolsonaverage.Theaveragescoreisnotthebestone.However,itisonlyslightlyworsethanthewinnerofthisgroup(0.909versus0.901).HSAperformsbestfor2casesoutof7,includingacaseforwhichHSAgetsfullscore.InTable 5-3 ,HSAishigherthanothertoolsonaverage.HSAperformsbeston2casesoutof7,includingacaseforwhichHSAgetsfullscore.Highscoresofexistingmethodsforsequenceswithhighpercentageofidentity(Table 5-2 and 5-3 )showthatthereislittleroomforimprovementforsuchsequences.Proteinsatthetwilightzone(Table 5-1 )poseagreaterchallenge.Theseresultsshowthatouralgorithmperformsbestforsuchsequences.Formediumandhighsimilaritybenchmarks,ourresultsarecomparabletoexistingtools. Table 5-4 showstheSPscoresofHSA,ClustalW,ProbCons,MUSCLE,T-CoeeandoriginalBAliBASEalignment.Ontheaverage,ClustalW,MUSCLE,andT-CoeendthehighestSPscoreforlow,medium,andhighsimilaritysequencesrespectively.However,accordingtoTable 5-1 to 5-3 ,thosemethodshaverelativelylowBAliBASE 79

PAGE 80

TheBAliBASEscoreofHSAandothertools.lessthan25%identity ClustalWProbConsMUSCLET-CoeeHSA Short1aboA0.6930.6240.6160.3200.8331idy0.5460.6790.3540.1830.7001r690.6550.6550.3450.2340.7721tvxA0.2230.4390.2390.2350.4621ubi0.6070.4640.4780.4450.6481wit0.6300.6900.6600.7070.6752trx0.6600.7050.7120.6670.756Avg0.5730.6080.4860.3980.692Medium1bbt30.5120.3730.4880.4400.5391sbp0.4670.5850.5870.5480.5901havA0.2220.3970.2930.2560.3521uky0.5310.4980.5350.4410.5962hsdA0.4820.6060.7480.5730.6142pia0.6240.7000.6910.5790.6083grs0.3770.3550.3090.3830.487Avg0.4590.5020.5210.4600.541Long1ajsA0.3880.4110.3700.3790.4721cpt0.6970.7190.7650.7260.8101lvl0.3680.5900.4510.5280.5321pamA0.4050.5340.4390.4610.5241ped0.6780.7170.7460.6380.7462myr0.3940.5680.3860.4540.6304enl0.6640.5730.5260.5820.652Avg0.5130.5870.5260.5380.624Avgall0.5150.5650.5110.4650.619 scores.Thismeansthat,thealignmentwiththehighestSPscoreisnotnecessarilythemostmeaningfulalignment.TheSPscoreofHSAiscomparabletoothertoolsonthe Table5-2. TheBAliBASEscoreofHSAandothertools.20%-40%identity. ClustalWProbConsMUSCLET-CoeeHSA 1fjlA0.9940.9890.9710.9911.0001csy0.8610.8970.7990.8870.8711tgxA0.8330.7600.6790.8170.7821ldg0.9200.9390.9540.9560.9411mrj0.8530.9250.8940.8940.9251pgtA0.9410.9260.9120.9550.9241ton0.7180.8980.8650.8670.867Avg0.8740.9040.8670.9090.901 80

PAGE 81

Table5-3. TheBAliBASEscoreofHSAandothertools.morethan35%identity. ClustalWProbConsMUSCLET-CoeeHSA 1amk0.9780.9840.9860.9880.9861ar5A0.9530.9560.9690.9471.0001led0.9000.9310.9500.9560.9291ppn0.9870.9830.9830.9840.9811thm0.8980.9000.8990.8930.9101zin0.9550.9750.9850.9580.9785ptp0.9480.9630.9500.9610.957Avg0.9450.9560.9600.9550.963 5-5 coincideswiththeaboveconclusion.Ingeneral, Table5-4. TheSPscoreofHSAandothertools. REFClustalWProbConsMUSCLET-CoeeHSA Short,<25%-602-453-594-496-912-599Medium,<25%-2036-1466-2516-1543-2461-1617Long,<25%-2989-1964-3266-2291-2991-2436Short,20%-40%456499508480491493Medium,20%-40%123811191138123111911138Medium,>35%347434773479352635283468Avgoverall-76202-208151-19274 81

PAGE 82

TherunningtimeofHSAandothertools(measuredbymilliseconds). ClustalWProbConsMUSCLET-CoeeHSA Short,<25%6923898915194Medium,<25%1336382971890535Long,<25%308156458432401191Short,20%-40%62265831187421Medium,20%-40%1716951752316613Medium,>35%1546291362502660Avgoverall1496722292008602 ClustalWperformsbest.However,ClustalWachievesthisatexpenseoflowaccuracy(seeFigures 5-1 to 5-3 ).HSAisslowerthanClustalWandMUSCLE.Itis,however,fasterthanProbConsandT-Coee. 82

PAGE 83

Thechloroplastisthesiteofphotosynthesis,andisthereforecriticaltoplantgrowth,developmentandagriculturaloutput.Thechloroplastgenomeisalsorelativelysmall,yetdespiteitsapproachablesizeandimportance,onlyasmallnumberofchloroplastgenomeshavebeensequenced.Thedearthofinformationisduetotherequisitepreparation,frequentlyrequiringisolationofplastidsandgenerationofplasmid-basedchloroplastDNAlibraries.Themethodshowninthischapterteststhehypothesisthatrapid,inexpensive,yetsubstantialsequencecoverageofanunknowntargetchloroplastgenomemaybeobtainedthroughaPCR-basedmeans.Acomputationalapproachpredictsalargenumberofoverlappingprimerpairscorrespondingtoconservedcodingregionsofknownchloroplastgenomes.Thesecomputer-selectedprimersareusedtogeneratePCR-derivedampliconsthatmaythenbesequencedbyconventionalmethods.ThischapterconsiderstheproblemofndingsaturatingnumberofoverlappingprimerpairstobracketmaximumpossiblecoverageoftheunknowntargetDNAsequence.Noneofthecurrentlyavailableprimerpredictiontoolsconsidergeneandinter-geneinformationandmostuseonlyonereferencesequence,whichlimitstheirpowerandaccuracy. Thischapterprovidesaheuristicsolution,namedMAPPIT,totheabovementionedproblemthatisdividedintothetaskofrstidentifyinguniversalprimersandthenassessingspatialrelationshipsbetweentheprimerpaircandidates.Twostrategieshavebeendevelopedtosolvetherstproblem.Therstemploysmultiplealignment,andthesecondidentiesmotifs.Thedistancebetweenprimers,theiralignmentwithingenecodingregions,andmostofalltheirpresenceinmultiplereferencegenomesnarrowstheprimerset.PrimersgeneratedbytheMAPPITmoduleprovidesubstantiallymorecoveragethanthosegeneratedviaPrimer3.Motif-basedstrategiesprovidemorecoveragethanmultiple-alignmentbasedapproaches.Aspredicted,primerselectionimproveswhenbasedonalargerreferenceset.Thecomputationalpredictionsweretestedinthelaboratoryand 83

PAGE 84

Thechloroplastgenomemaintainsagreatdegreeofconservationingenecontentandorganization.Thusarelativelyhighlevelofsyntenyexistsbetweenplastidgenomesderivedfromdistantly-relatedtaxa[ 10 ].Thechloroplastgenomeismuchsmallerthanthenucleargenome,yetonlyasmallnumberoftheseextra-nucleargenomeshavebeensequenced.Traditionally,plastidgenomeshavebeensequencedonlyaftergeneratingextensiveplasmid-basedlibrariesoftheplastidDNA.PlastidDNAextractionreliesondicult,sometimesproblematicandtypicallytimeconsumingpreparativeprocedures.Recently,severalreportshaveincreasedplastidsequencingthroughputbyamplifyingtheisolatedplastidDNAusingrollingcircleamplication(RCA)[ 33 ].However,obtainingsequencethroughRCArequiresthisintermediatestep.Recently,theASAPmethodshowedthatsequenceinformationcouldbegatheredbycreatingtemplatesfromplastid 84

PAGE 85

32 ].ASAPusesconservedprimers(short,single-strandedDNAfragmentsthatinitiateenzyme-basedDNAstrandelongation)toankunknownregions,andtheregionsareampliedusingthepolymerasechainreaction(PCR).PCRinvolvestheexponentialamplicationofanitelengthofDNAinacellfreeenvironment[ 116 ],anditisfrequentlyusedtogeneratealargequantityofspecicDNAsequencesforforensicapplications.TheprocedurereliesonathermostableenzymeknownasTaqDNApolymerase,whichelongatesspecicDNAsequencesbracketedbyprimerhomology.Aprimerisclassiedasforwardorreverseprimerdependingonitsorientationrelativetothetargetsequence.Forinstance,aforwardandreverseprimerthatankagivengeneallowamplicationofthebracketedsequenceinthepresenceofDNApolymerase,nucleotidesandappropriatecofactors.UseofPCRdependsonmanysuccessiveroundsofprimerannealingandsubsequenttemplateelongationtoamplifyasequenceofinterest.TheASAPmethodisfastandcosteective.However,intheinitialreport,therequiredprimerswereselectedbyvisualinspectionoftargetsequences.ThisrestrictedtheASAPstudytoasmallregionofthechloroplastgenome.Toexpandthistechniquetoanentirechloroplastgenomeanecientmethodisrequiredtofacilitateprimerselection.Moreimportantly,suchamethodwillallowtheselectedprimersettobeupdatedbasedupontheavailabilityofnewplastidsequences. ThischapterpresentstheModuleforAmplicationofPlastomesbyPrimerIdentication,orMAPPIT.TheMAPPITtoolusestheinformationofdatabase-residentreferenceplastidgenomestopredictasetofconservedprimersthatwillgenerateoverlappingampliconsforsequencing.ThepowerofMAPPITisthatitwouldtheoreticallygainaccuracyandprecisionasthereferencesequencesetgrows.MAPPITusestwoapproachestoidentifytheprimers,namelymultiplealignmentandmotif-based. Therstapproachdevelopsamultiplealignmentstrategy.Theproposedmultiplealignmentmethodisavariationoftraditionalprogressivemultiplealignmentstrategythatweightsthecodingregionsofthegenomes,increasingtheprobabilitythattheprimers 85

PAGE 86

Thesecondapproachisbasedonmotifidentication.Thismethodrecognizespotentialprimersfromeachreferencegenomeseparately.Itthenidentiesasubsetoftheseprimersthatoccurfrequentlyinasubsetofreferencegenomes.Thepresenceinmultiplegenomesaddssupporttoanyprimerbeingassignedtothenalprimerset.Twosolutionshavebeendevelopedtoidentifythenalsetofprimerpairsfromthecandidates,namelyorderdependentandorderindependent,dependingonwhethertheyconsiderprimerorderornotwhencomputingthesupportvalues. Finally,acomputationalmethodhasbeendevelopedtomeasurethequalityoftheidentiedprimerpairs.Experimentalresultsshowthattheprimerpairsdesignedcoverupto81%ofanunknowntargetsequence.RandomlyselectedprimerpairsdevisedbyMAPPITwereusedinlaboratoryexperimentstovalidatecomputationalpredictions. Werstdeneseveralterms:ADNAsequenceisrepresentedbyastringoffourletters:A,C,G,Tasthebasesandtwoextraalphabets:Nasunknownbasesand-asgaps.Aprimerisdenedasasequencewhichsatisescertainconstraints.Thelengthofaprimerp,indicatedbylength(p),isthenumberofcharactersitcontains.Lets[i:j]denotethesubsequenceofsfrompositionitopositionj;AprimerpbindstoDNAsequencesatpositioniifpands[i:i+length(p)1]aresimilar.Twosequenceareconsideredassimilariftheyhavesucientpercentageidentity.Inpractice93%identityisrequiredforprimersimilarity.Apartialorderprimerspandqwithrespecttosequences,psq,isdenedifthepositionofpisbeforethepositionofqins.Letfandrdenoteaforwardandreverseprimerrespectively.Assumethatfandrbindtos[i:i+length(f)1] 86

PAGE 87

Exampleofprimerpairsontargetsequence:fandrstandforforwardandreverseprimersrespectively.Thedirectionsofprimersareshown.paircoversaregiona1andconstructsacontigContig1,pairsandcoverregionsa2anda3,whichconstructacontigContig2sincea2anda3haveoverlap. ands[j:j+length(r)1].Thedistancebetweenfandrwithrespecttos,ds(f;r)isdenedas Aprimerpairidentiesthefragments[i:i+ds(f;r)1]fromsifds(f;r)lessthanagivencuto.Thiscutonumberisusually1000andisdeterminedbythelimitationsofautomatedsequencingmethodscurrentlyavailable.Twofragmentsofs,says1ands2,identiedbytwoprimerpairscanbecombinedtoformacontigifs1ands2havesucientoverlap.Inpractice,overlapofatleast100lettersdenoteacontigwithhighcondence.shortoverlapcannotbecontinuedastheymayindicaterandomoverlaps.Givenasetofprimerpairsp=f;;;g,Wedenethecoverageofponasequencesasthetotalnumberoflettersofsthatcanbeidentiedusingp. Wedeneaprimerpairsndingproblemasfollowing: GivenatargetsequenceTandasetofreferencesequencesS=fS1;S2;;SKg,whereSiarehomologoustoT,thegoalistondsetofprimerpairs,i2

PAGE 88

AnexampleisshowninFigure 6-1 .Inthisexample,atargetDNAsequenceandsixprimersareshown.Primersf1andr1constructaprimerpairsinceds(f1;r1)isinthedistancelimitationL.Thispairconstructsacontig(Contig1)onthetarget.PrimerpairsandhasoverlapgreaterthantheoverlapthresholdV,thereforethesetwoprimerpairsproduceanothercontig(Contig2). 23 ].CAP3belongstothiscategory[ 117 ].TheaccuracyoftheassembledsequencesusingWGSmethodssuerbecauseofreaderrorsandrepeats[ 118 ].Theyalsoincurveryhighcomputationcostduetolargenumberofpairwisesequencecomparisons.Andtheyalsoneedanadditionalnishingphase.Ontheotherhand,PCR-basedsequencingmethodsaremoreaccurate.However,theirprocessingtimeisusuallymuchlongerandthecostofprocessingismoreexpensive. Recently,DinghraandFoltaproposedanewsequencingmethod,calledASAP,[ 32 ]toovercometheshortcomingsofPCR-basedmethods.ASAPexploitsthefactthatchloroplastgenomesareextremelywellconservedingeneorganization,atleastwithinmajortaxonomicsubgroupsoftheplantkingdom.Itisauniversalhigh-throughput,rapidPCR-basedtechniquetoamplify,sequenceandassembleplasmidgenomesequencefromdiversespeciesinashorttimeandatreasonablecost.TheASAPmethodndsthemultiplealignmentofasetofreferencegenomesthatarehomologtothetargetgenomeusingClustalW[ 1 ].Domainexperts,then,identifyconservedprimerpairsfromthemultiplealignmentthroughvisualinspection.ASAPusestheseprimerpairstogenerate 88

PAGE 89

32 ].ThemanualprimeridenticationstepisthebottleneckofASAP.Ecientcomputationalmethodsareneededtoautomatethisprocess.Also,aswediscusslater,ASAPcanmisspotentialprimerssinceitusesClustalWformultiplealignment.ThisisbecauseClustalWmaximizestheoverallalignmentscorefortheentiresequences.Primersarehowevershortsequencesscatteredintheentiresequence.Thus,shortconservedregionscanbemissedusingClustalWwhenthesequenceshavemanyindels. SimilartoASAP,PriFi[ 119 ]usesmultiplesequencealignmenttoidentifyprimers.ItalsousesClustalWtoobtainmultiplealignment.PriFihasthesameshortcomingsasASAP.PriFialsohastheshortcomingthatitcannotautomaticallyidentifyintrons. Multiplesequencealignmenthasalotofapplicationsinbiologicalsciencesuchasgeneprediction[ 7 ]andimprovinglocalalignmentquality[ 20 ].Multiplesequencealignmentmethodscanbeclassiedintotwogroups:optimalandheuristicmethods.MSA[ 61 ]istherepresentativeofoptimalsolutions.Heuristicmethodsaremuchmorepopularbecauseoftheirlowtimecomplexity.ClustalW[ 1 77 ],ProbCons[ 88 ],T-coee[ 2 ]andMUSCLE[ 78 ]aresomeexamplestoheuristicstrategies. 6.3.1FindingPrimerCandidates 89

PAGE 90

6.1 ).WedenethesupportofaprimerponasequenceSias:support(p;Si)=8><>:1ifpappearsinSi0otherwise WedenethesupportofaprimerponsequencesetSas:support(p;S)=1 Aprimerisconsideredasacandidateprimeronlyifitsatisesthefollowingtwocriteria: Wedeveloptwostrategiestoobtainasetofcandidateprimers.TherstoneisanextensionoftheASAPmethodandusesmultiplealignment.Thesecondonendsprimercandidatesforeachreferencegenomeseparately.Itthenmergesthecandidatesprogressively.Wewilldescribetheminsubsequentsectionsnext. 90

PAGE 91

AnexampleofcomputingtheSPscoreofmultiplesequencealignment.RegionAandChaveprimersin,weincludetheirSPscorewhenwecomputetheSPscoreofthealignment.RegionBhasnoprimerinside,weonlytreatitsSPscoreaszero. 1 77 ].Theunderlyingproblem,however,diersfromtraditionalmultiplealignment.Thisisbecausetraditionalmultiplealignmentmethodsaimtomaximizetheoverallalignmentscore.However,inordertondprimersweonlyneedtoidentifyshort,highlyconservedregionsinthereferencesequences.Thenon-conservedregionsoflessthan1000basesbetweentwoprimercandidatesshouldbedisregardedasthisregionwillbeidentiedduringPCRamplicationprocess.Figure 6-2 illustratesthis.Inthegure,aforwardprimerregionAandareverseprimerregionCareshown,weonlymaximizetheSPscoreofAandC.TheregionB,whichhasnoprimerin,arenotconsideredwhencomputingtheSPscoreofthewholealignment. Weproposeavariationofhierarchicalclusteringalgorithm[ 71 ].Itfollowsfromtwoobservations:(1)Thegeneregionsofasetofhomologoussequencesareusuallyhighlyconservedwhiletheirintergenicregionscanshowhighvariationinlengthandlettercontent.(2)PrimersneedtohavesucientCGrate. Foreachreferencesequence,wereadlocationandlengthsofgenesfromdatasourceles,whicharepreviousdownloadedfromGenBank.WealsoscanthesequenceandndregionswhichhavelowerCGratethantherequiredcutoforaprimer.Wetagthese 91

PAGE 92

Duringthealignmentofthesequenceswecomputeaweightedscoreofthealignment:Thescoreforletterswhicharetaggedasgenesarescaledupusingsomepredenedweightconstant.Thescoreletterswhichtaggedas\N"arecomputedas0.Weappliedanegappenaltystrategytoreducethenumberofgaps.WeusedanalgorithmextendedfromalignmentmethodofMyersandMiller[ 65 ]toreducememoryrequirementsincethereferencegenomesareusuallytoolong.WeuseSum-of-Pairsscoretoevaluatethescoreofalignment. Thealignmentalgorithmisdescribedasfollows.Werstcomputethealignmentscorebetweeneachpairofsequencesandconstructaninitialscoretable.Theinitialprolestobealignedaretheoriginalsequences.Second,weselectthepairofproleswhichhashighestscoreinthescoretableandobtainanewprolefromthealignmentofthesetwoproles.Third,weremovethetwoprolesandaddthenewproletoproleset.WecalculatetheSPscorewhenwescoretwoelementsfromtwoproles.Fourth,weconstructanewpairwisealignmentscoretable.Fifth,werepeatfromsecondsteptofourthstepuntilonlyoneproleisleft.Thenalproleleftistheresultingalignment. Wethenslideawindowfromthebeginningtotheendoftheconsensusstringthen.Thewindowhassamesizeastheprimer.Foreachwindow,wecheckthefragmentinthewindowifitsatisestheCGrateandconservationratecriteria.Thefragmentswhichpassthetestbecomeprimers.DependingontheCGpositions,afragmentisinsertedineither 92

PAGE 93

Oursolutionrstndspossibleprimersfromeachsequenceseparatelywithoutconsideringanyconservationconstraints.Itthenndscommonprimerswithsucientsupportbyiterativelymergingtheprimerset.Wediscussthesestepsinmoredetailnext. WestartbyconstructingasetofpossibleforwardprimersFiandasetofreverseprimersRiforeachreferencesequenceSi.Todothis,weslideawindowofprimerlengthoneachreferencesequence.Eachpositionofthewindowproducesafragment.ThefragmentsthatsatisfytheCGcriteriaforprimersareinsertedintocorrespondingprimerset.LetFi=ffi;1,fi;2,,fi;migandRi=fri;1,ri;2,,ri;nigdenotetheprimersfoundforSi.Foreachprimerfi;j,twovaluesarestored:supportandlocation,denotedwithsupport(fi;j)andlocation(fi;j).Thesupportandlocationoffi;jareinitializedtooneandthepositionoffi;jinSirespectively.supportandlocationofallreverseprimersarecomputedinthesameway.Weproposetwostrategiestondcandidateprimersfromtheseprimers.Weexplainourstrategiesforcandidateforwardprimers.Candidatereverseprimersarefoundexactlythesameway.TheonlydierenceisthatweuseRiinsteadofFi. 93

PAGE 94

WepickarandomSifromreferencesequencesetthathasnotbeenconsideredsofar.Forallprimersfi;j2Fiwecheckifthereexistsaprimerg2Gthatissimilartofi;j(i.e.,gandfi;jhaveatleast93%identitiy.SeeSection 6.1 .).Ifthereisnosuchg2G,thenweinsertfi;jtoG.Ifthereexistsuchag,thenweupdatethesupportandlocationofg.Thelocationisupdatedaslocation(g)support(g)+location(fi;j) Thesupportofgisthenincrementedbyone.Werepeatthesameprocesstoeachoftheremainingreferencesequencesinrandomordersimilarly.OnceallthereferencesareprocessedweremovetheprimersinGthatdonotsatisfysupportcriteria.NotethatfurtheroptimizationscanbemadeintheimplementationbyremovingprimersfromGassoonastheyareguaranteedtohaveinsucientsupport.Wedonotdiscussthemastheyonlyaecttheperformance. 6-3 illustratesthis.Inthegure,weonlyshowforwardprimersandtheirlocations,thematchedprimersareconnectedbyarrows.Primersf1andf2arecrossedandarenotconsideredasmatchedatsametimewhenusingmultiplesequencealignment.Inthisstrategy,weallowthistypeofmatch. Inthisstrategy,weconsidertheproblemasndingtheLongestCommonSubsequencefromasetofsequences,knownask-LCS.Here,eachprimersetFidenotesasequenceofprimersfortheprimersinFiareorderedbytheirlocations.Thegoalistonda 94

PAGE 95

Anexampleofmatchingprimerswithtranslocations.Onlyforwardprimersareshowninthegure.Primersf1andf2havepositionscrossedduetotranslocation.Instep1,thematchingsoff1sandf2satsametimecanbeallowedifusingmotif-basedstrategy,butnotifusingmultiplesequencealignment-basedstrategy. subsequenceofprimersthatiscommontomostofthereferencesequences(i.e.,70-90%ofthereferencesequencescontainit).k-LCSisanNP-completeproblem[ 65 ]andhasmanyheuristicsolutions.Weuseaprogressivesolutionwhichissimilartoourrststrategyinspirit. WepickarandomSifromreferencesequencesetandinitializeGtoFi.Wethenrepeatedlypickareferencesequencefromtheremainingreferencesandprocessitasfollows:WendtheLCSofFiandG.Here,twoprimersareconsideredascommoniftheyaresimilartoeachother(i.e.,theyhaveatleast93%identitiy).Weupdatethesupportandlocationofallg2GwhichareinLCS.Thelocationisupdatedasgiveninequation(1)Thesupportofgisthenincrementedbyone.Wetheninsertallthefi;j2FthatarenotinLCStoG.OnceallthereferencesareprocessedweremovetheprimersinGthatdonotsatisfysupportcriteria.Thetimecomplexityofthismotif-basedmethodisO(M2),whereMisthenumberofprimersinasequence.UsuallyMismuchlessthanthelengthofthesequence. LetF=ff1,f2,,fmgandR=fr1,r2,,rngdenotethesetofforwardandreverseprimerswithsucientsupportidentiedusinganyofthestrategiesdiscussedin 95

PAGE 96

6.3.1 .Assumethatlocation(fi);;;g,where8i,fi2F,ri2Rand8ipairintoP.Ifthereisnor2Rwhichsatisfythedistancecriteriawithf,thenupdatefasthenextforwardprimer,removeffromF,andrepeatStep2.IfthereisnomoreforwardprimerleftinF,thealgorithmstops. 6.1 thattheoverlapcriteriais 0pairisinsertedintosolution 96

PAGE 97

Selectionofnextforwardprimerfromcurrentreverseprimer.Thepositionsofprimerareshowninthegure.Weselectf2ifbothf1andf2areinRegionA,andselectf3iff3,f4,f5andf6areinRegionBandnoprimerisinRegionA Notethatonecanprovethatourgreedyprimerselectionstrategyisoptimalsolutionamongallpossiblesolutionsthatcanbefoundfromthecandidateprimers.Wedenetheoptimalityaccordingtotwocriteria:1)Theoptimalsetofprimerpairscoversthelargestnumberoflettersoftheconsensusofthereferencesequences.2)Amongallthesolutionswiththesamecoverage,optimalsolutioncontainstheminimumnumberofprimersandproducestheminimumnumberofcontigs.We,however,donotincludetheproofduetospacelimitations. Next,weprovethatourprimerselectionstrategyisoptimalsolutionamongallpossiblesolutionsthatcanbefoundfromthecandidateprimers.Wedenetheoptimalityaccordingtotwocriteria:1)Theoptimalsetofprimerpairscoversthelargestnumberof 97

PAGE 98

(A)Werstshowthatlocation(f1)=left(c1).Letfibetheleftmostprimer(i.e.,smallestlocation)inF,whichhasatleastonematchingreverseprimersatisfyingdistancecriteria.fiisselectedbyouralgorithm(Steps1&2)(i.e.,1=i). (A.1)Assumethatlocation(fi)left(c1).Thiscontradictswiththeassumptionthatfiistheleftmostprimerwithamatchingreverseprimer. From(A.1)and(A.2),weconcludethatlocation(f1)=left(c1). (B)Secondweprovethatlocation(r1)right(c1).Weprovethisbycontradiction.location(r1)>right(c1)contradictswiththeassumptionthatc1isanoptimalcontigascanbeincludedtoextendc1. (C)Third,weshowthatisapartoftheoptimalsolution(Steps1&2ofthealgorithm). (A)and(B)provesthatf1andr1arecontainedinc1.Thus,theyidentifyaprexofc1.Selectionofminimizesthenumberofprimerpairstocoverc1.Thisisbecausedenethelongestprexofc1thatcanbeidentiedusingFandR. 98

PAGE 99

(D)Finally,weprovethatselectionstrategyforthenextforwardprimerminimizesthenumberofprimerpairs(Step3ofthealgorithm).(B)impliesthattherearetwopossibilitiesforr1. (D.1)Assumethatlocation(r1)=right(c1).Thisimpliesthatistheoptimalprimerpairtoidentifyc1.Sincec1isapartoftheoptimalsolution,thereisnoprimerpairwhichsatisfytheoverlapcriteriawithandlocation(r1)>right(c1).Thus,thenextforwardprimershouldbeselectedastherstforwardprimerinFinregionB(seeFigure 6-4 )inordertodetectthenextcontiginC(Step3).Thejusticationfollowsfrom(A). (D.1)Assumethatlocation(r1)andcoversasubsequenceofc1.Otherwise,c1wouldnotbeidentiedasapartoftheoptimalsolution.Step3choosestherightmostforwardprimerinregionA(seeFigure 6-4 )tomaximizethecoverageofthisprimerpair,andthusminimizethenumberofprimerpairs. Weevaluatetheprimerpairsusingtwokeyparameters:(1)averagecoverage,and(2)averagenumberofcontigsproducedforallthereferencesequences.Herethecoverageisthetotalnumberofcharacterscoveredbytheprimerpairs.Thetotalnumberofcontigsarethenumberoffragmentsidentiedsuchthatnotwofragmentshavesucientoverlap. 99

PAGE 100

1. Initializecontigid=0. 2. Forj=1tok FindthelocationsoffjandrjinSiusingdynamicprogramming[ 28 { 30 ].AprimerisfoundinSiifSicontainsasubsequencewhosealignmentwiththatprimerhasatleast93%identity(seeSection 6.1 ). (b) Ifbothfiandricanbefoundandtheirlocationssatisfydistancecriteria(i.e.,locationsdierbyatmost1,000)thencheckthevaluesinVifromthestartinglocationoffjtoendinglocationofrj 6.1 ).SetallthevaluesofVicorrespondingtothenewfragmenttothisvalue. 3. Returnthenumberofnon-zerovaluesinViasthecoverageandthenumberofdistinctnon-zerovaluesinViasthenumberofcontigs. Experimentalsetup:Weevaluateourproposedmethodsthroughbothcomputationalandwet-labexperimentation.Weevaluatetheprimerpairsbasedonseveralcriteria,namelythecoverage,thenumberofcontigs,andhitratioonthetargetsequenceaswellastimeittakestondtheprimers.TheformertwoaredescribedinSection 6.1 .Hitratiodenotestheratioofprimersthathasamatchingsubsequenceinthetargetgenome. Forcomparison,wedownloadedPrimer3[ 120 ]asarepresentativeofsinglesequenceinputprimerdesigntools,foritisoneofthewellknowntools.Forourmultiplealignment 100

PAGE 101

1 77 ].WealsoimplementedtheproposedweightedmultiplealignmentmethodinSection 6.3.1 .WealsoimplementedourmotifbasedprimermethodasdescribedinSection 6.3.1 .Asapartofthismethodweimplementedbothorderindependentandorderdependentstrategies.WeusedClanguageinallourimplementations. WeusedveplastidgenomesusedinASAP[ 32 ]andaddedtwomorefromCucumisandLactucatoourdataset.WeobtainedtheDNAsequencesofthesegenomesfromGenBank( WerunallcomputationalexperimentsonIntelPentium4,with3.2Ghzspeed,with2GBmemory,theoperationsystemiswindowsXP. Inthefollowingtablestoshow,wordCovTrepresentsthecoverageonthetargetsequence,ConTrepresentsthenumberofcontigsonthetargetsequence,CovRrepresentstheaveragecoverageonthereferencesequencesandConRrepresentstheaveragenumberofcontigsonthereferencesequences. ComparisontoPrimer3:OurrstexperimentsetcomparesthequalityofprimerpairsofMAPPITtothatofPrimer3[ 120 ].WeusePrimer3withitsdefaultparametersonasinglereferencesequencetoidentifythetop50primers.Wethenevaluatetheseprimersonthetargetgenome.WelimitthenumberofprimersofPrimer3to50forMAPPITtomake 101

PAGE 102

Table 6-1 showstheresults.TheresultsshowthatthecoverageofPrimer3issignicantlylowerthanthatofourmethodinallcases.Theresultsillustratethatexistingtoolswhichconsideronlyonesequenceforprimerdesignarenotsuitabletosequenceplastidgenomes.ThecoverageofMAPPITisgreaterthan62%ontheaverage.Furthermore,bothalignmentstrategiesachievesimilarcoverage,numberofcontigs,andprimerpairs. Table 6-2 presentstheresultsfor16%divergentdataset.Duetospacelimitationsresultsforotherdivergentdatasetsarenotshown.Theexperimentsshowthatthecoverageandthenumberofprimersdecreases,whereasthenumberofcontigsincreases.Thecoverageisslightlymorethan57%.However,thequalitydropisverysmallgiventhatthesequencesarealteredby16%.Weobservethatthequalitygraduallydropsasthedivergenceincreases(resultsnotshown).AnotherimportantobservationisthatMAPPITachieveshigherqualityusingourweightedmultiplesequencealignmentmethodcomparedtoClustalW.ThisshowsthatClustalWismoresuitableforhighlysimilarsequences,whereasourweightedmultiplealignmentismoresuitableforgenomeswithvariationsinnon-codingregions. 6-3 formultiplesequencealignment-andmotif-basedprimeridenticationstrategies.Formotif-basedstrategy, 102

PAGE 103

ComparisonofPrimer3andusingmultiplesequencealignmentinstep1.ThetableshowstheresultsofusingalignmentfromClustalWandourowndesignedmultiplesequencealgorithm,whichuseshierarchicalclusteringalgorithmandgapopenextensionscorestrategy. Primer3ClustalW-MAPPITweighted-MAPPITDataSetTargetLengthCovTConTPairs#CovTConTPairs#CovTConT 0932396635941332524763325722718793893455713425665234246961220238921557135219316352101874561393945571332542723324774362903998655023525394335251694714438858003424361635245065757838900165123322762334234172Avg3923663813324398434241864

PAGE 104

Comparisonofusingdierentsourceofalignment:usingClustalWandourweightedmultiplesequencealignmentalgorithm.Thedatasetare16%divergent.Theweightedmultiplesequencealignmentmethoduseshierarchicalclusteringalgorithmandgapopenextensionscorescheme. ClustalW-MAPPITweighted-MAPPITDataSetTargetLengthPairs#CovTConTCovRConRPairs#CovTConTCovRConR 0932396633023069625247532245876251007187938934292304452550222922047424541122023892131191457250685311941272428474561393943023220525223231232005244143629039986302245972464033123173624673471443885832222707246796332226072521357578389003121982424175232219824241752Avg392363022169524933331223805246284

PAGE 105

Table 6-3 alsoshowsthecoverageandthenumberofcontigscomputedonthereferencesequencesasdiscussedinSection 6.3.3 .Theresultsshowthattheestimatedqualityvaluesfromthereferencesequencesaresimilartotheactualvaluescomputedfromthetargetsequence.Thus,weconcludethattheevaluationstrategyproposedinSection 6.3.3 isaccurate. Table 6-4 showstheresults.Thehitratiousuallyincreasesaskincreases.Thisagreeswithourassumptionthatmorereferencesequenceachievehigherqualityprimers.The 105

PAGE 106

Comparisonofmultiplesequencealignment-basedmethodsandmotif-basedmethodsinstep1.Thenon-order-MAPPITandorder-MAPPITstandforusingmotif-basedmethodswithorderindependentanddependentstrategiesseparately.Themultiplesequencealignment-basedmethodsusehierarchicalclusteringalgorithmandgapopenextensionscorescheme. weighted-MAPPITnon-order-MAPPITorder-MAPPITDataSetTargetLengthPairs#CovTConTPairs#CovTConTPairs#CovTConT 0932396633325722741355239343011971879389343424696139318781133264208220238921352101874029398123324046745613939433247743373131213312607176290399863525169440315001131250907714438858352450654232854103428382147578389003423417240308681432246816Avg392363424186439319041132264018

PAGE 107

Eectsofthenumberofreferencesequences.Multiplesequencealignment-basedmethoduseshierarchicalclusteringalgorithmandgapopenextensionscorescheme.Non-order-MAPPITandorder-MAPPITstandfororderindependentanddependentstrategiesseparatelywhenapplyingmotif-basedmethod. weighted-MAPPITnon-order-MAPPITorder-MAPPITReference#CoverageHitRatioCoverageHitRatioCoverageHitRatio 2320100.749302820.290326800.7703264760.820350550.668271280.8354255280.844344920.587324060.7715254900.852352450.715286970.8176246290.862319040.910264010.952 coverageofthemultiplealignment-basedstrategyincreasesaskdecreases.Thisisbecausethisstrategyproducesmoreprimersforsmallk.Thecoverageofthemotif-basedstrategyshowsvariations.However,itusuallyincreasesaskdecreases. 6-5 ).Ofthese,9plantsaresomewhatrelatedand3representancientorhighly-divergedspecies.Pealacksthe 107

PAGE 108

Eightrandomlyselectedprimerpairs,theirlocationsonsequence1879,thelengthofthesegmentidentiedbytheprimersandthegenesthattheylandon.Thenegativevalueindicatesthattheprimerslandedinincorrectorder. PrimerpairsLocationin1879SizebasepairsForwardReverse 155279-6223944rps16Intergenic21716637-179451308rps2rpoC233637730-395121782ycf9psaA49999061-1002221161ndhBrps12Intron5100100379-100451-97rps12Intronrps12Intron6101100690-1019641274rps12orf1317102101927-102811884orf13116S8150151524-151976452ycf2ycf2 invertedrepeatregionandthusisverydierentfromotherplastidgenomessampledhere.Ginkgo,anancientGymnosperm,andEquisetumaPteridophyte,areancestorsofmoderndayoweringplantsandexhibithighdegreeofsequencedissimilarity.Theprimersdevisedbythecomputationalmethodweremappedonthetobaccochloroplastgenome(1879)andTable 6-5 summarizesthesequencelocation,expectedsizesandannealingsitesoftheforwardandreverseprimer. FromTable 6-5 followingfeaturesareevident: 1.Computationallyidentiedprimerspairsannealmainlytothecodingregionsorconservedintronbetweenthegenes.Thisparameterwasoneoftheprerequisitesforecientprimeridenticationanddemonstratesthatthenewmethodofmultiplesequencealignmentispromisingforthisspecicpurpose.2.Thesizeoftheampliedregionsrangesfrom452basepairsto1782basepairs.Theoptimalprimersetwillamplifyregionsrangingfrom800basepairsto1200basepairs,whichmakestheampliedproductsmoreamenabletosequencing.3.Primerpair5representdivergentprimersin1879thusnoproductisvisiblehereandinallotherspeciesbutinmaizethereisanannealingsitethatproducesanampliconoftheexpectedsize.Thisillustratesthepotentialofthemethodasapplicabletodivergentplantspecies. 108

PAGE 109

Polymerasechainreactionsampleswereanalyzedonanagarosegelbyelectrophoresis.ColumnMrepresentsastandardDNAsizeladder.Columnslabeledas5,17,36,99,100,101102and150representtheprimerpairschosenatrandomfromthecomputationaldataset.WhitebandsineachcolumnrepresentampliedDNAfromeachprimerpairinagivenplantsample.Notethatprimerpair100doesnotproduceanampliedproductinmostplantsexceptformaize(seeTable 6-5 ).GinkgoandEquisetumrepresentancestralsamplesusedtotestthelimitsofthisapproach.Althoughhighlydivergentinsequencecontentandpositionsomecoveragewasobtained,indicatingthemethodwillbehighlyusefuloncontemporarycropspecies.(ThisgureiscreatedbyAmitDhingra.) 109

PAGE 110

Weconsideredproblemsinmultiplesequencealignmentanddevelopedwindowbasedsolutions,wealsoaddressedtheproblemofusingmultiplesequencesinDNAsequencing.Thehypothesisofouralgorithmsisthatwecandividethelargesequencesalignmentproblemtosmallerones,andthenwecanreachasemi-optimalalignmentoftheoriginallargesequencesbycombiningofthesolutionofsmallerproblems. First,weconsideredtheproblemofoptimizationofSP(Sum-of-Pairs)scoreformultipleproteinsequencesalignment.Wedevelopedagraph-basedalgorithmcalledQOMA(Quasi-OptimalMultipleAlignment).QOMArstconstructsaninitialalignmentofmultiplesequences.Inordertocreatethisinitialalignment,wedevelopedamethodbasedontheoptimalalignmentbetweenallpairsofsequences.QOMArepresentsthisalignmentusingaK-partitegraph.ItthenimprovestheSPscoreoftheinitialalignmentbyiterativelyplacingawindowonitandoptimizingthealignmentwithinthiswindow.QOMAusestwostrategiestopermitexibilityintime/accuracytradeo:(1)Adjusttheslidingwindowsize.(2)TunefromcompleteK-partitegraphtosparseK-partitegraphforlocaloptimizationofwindow.Unliketraditionaltools,QOMAcanbeindependentoftheorderofsequences.TheexperimentalresultsonBAliBASEbenchmarksshowthatQOMAproduceshigherSPscorethantheexistingtoolsincludingClustalW,ProbCons,MUSCLE,T-CoeeandDCA.QOMAhasslightlybetterSPscoreusingcompleteK-partitegraphstrategycomparedtothesparseK-partitegraphstrategy.ThisQOMAworkisacceptedbyBioinformaticsjournal. Second,wefurtherconsideredtheproblemofmultiplealignmentforalargenumberofproteinsequences,withthegoalofachievingalargeSP(Sum-of-Pairs)score.WeintroducedtheQOMA2algorithm,whichispracticalforaligningalargenumberofproteinsequences.QOMA2selectsshortsubsequencesfromthesequencestobealignedbyplacingawindowontheir(potentiallysub-optimal)alignment.Thewindowposition 110

PAGE 111

Third,weconsideredtheproblemofconstructionofabiologicalmeaningfulmultiplesequencealignment.wedevelopedanewalgorithmcalledHSA.HSAappliesSSEtypesinadditiontoaminoacidinformationtogrouptheinputproteinresidues,Itthenadjuststheresiduespositionaccordingtothegroupsandconstructsagraph.HSAslidesawindowfromthebeginningtotheendofthegraphandndscliquesinthewindow.HSAconcatenatesthesecliquesandformsthenalalignment.Unlikeexistingprogressivemultiplesequencealignmentmethods,HSAbuildsupthenalalignmentbyconsideringallsequencesatonce.ExperimentalresultsshowthatHSAachieveshighaccuracyandstillmaintainscompetitiverunningtime.Thequalityimprovementoverexistingtoolsismoresignicantforlowsimilaritysequences.OurHSAworkispublishedinPSB2006. ThelastproblemistoassistprimerpredictioninDNAsequencing,byusingmultiplesequences.WedevelopedamethodcalledMAPPIT.MAPPIThassuccessfullyusedtwonovelcomputationalapproachesforidenticationofconsensusprimerpairsfromasetofreferencesequencesthatwillenablecost-eectiveandrapidacquisitionofDNAsequencefromplastidgenomes.Therstoneusesmultiplealignmentofreferences.Thesecondonendsmotifsfromthereferencesequencesthathavesucientsupport.Wedevelopedtwosolutionsforthesecondapproach:orderindependentandorderdependent.Inourexperiments,thecoverageofprimerpairsfoundbyourmethodsweresignicantlyhighercomparedtothatofPrimer3,anexistingprimeridenticationtool.Ourwet-labexperimentsveriedthattheprimersfoundbyourmethodscanactuallyamplifyhomologoustargetgenomes.Webelieverapidsequenceinformationacquisition 111

PAGE 112

Weaddressedfourproblemsofmultiplesequencealignment.Weprovidedthesolutionsbasedondivide-and-conquerstrategy.WerstdevelopedanovelalgorithmtooptimizeanexistingalignmentandappliedthealgorithmtotoolQOMA.BasedonQOMAalgorithm,wethenfurtherdevelopedanalgorithmtoprocesslargenumberofsequences.TheapplicationwascalledQOMA2.Wealsodevelopedanalgorithmtocreateabiologicalmeaningfulalignmentbyapplyingsecondarystructureinformationduringaligning.Last,weappliedmultiplesequencealignmenttoprimeridenticationforDNAsequencing.Thehypothesisofouralgorithmsisthatwecandividethelargesequencesalignmentproblemtosmallerones,andthenwecanreachasemi-optimalalignmentoftheoriginallargesequencesbycombiningofthesolutionofsmallerproblems.Theexperimentalresultsshowthehypothesisofdivided-and-conquerisusefulinmultiplesequencealignment. 112

PAGE 113

[1] J.Thompson,D.Higgins,andT.Gibson,\CLUSTALW:ImprovingtheSensitivityofProgressiveMultipleSequenceAlignmentthroughSequenceWeighting,Position-specicGapPenaltiesandWeightMatrixChoice,"NucleicAcidsResearch,vol.22,no.22,pp.4673{4680,1994. [2] C.Notredame,D.Higgins,andJ.Heringa,\T-coee:anovelmethodforfastandaccuratemultiplesequencealignment,"JournalofMolecularBiology,vol.302,no.1,pp.205{217,2000. [3] D.T.Jones,\ProteinSecondaryStructurePredictionbasedonPosition-SpecicScoringMatrices,"JournalofMolecularBiology,vol.292,no.2,pp.195{202,1999. [4] A.Phillips,D.Janies,andW.Wheeler,\MultipleSequenceAlignmentinPhylogeneticAnalysis,"MolecularPhylogeneticsandEvolution,vol.16,no.3,pp.317{330,2000. [5] J.Thompson,H.Plewniak,andO.Poch,\Acomprehensivecomparisonofmultiplesequencealignmentprograms,"NucleicAcidsResearch,vol.27,no.13,pp.2682{2690,1999. [6] W.N.Grundy,\Family-basedHomologyDetectionviaPairwiseSequenceComparison,"inAnnualConferenceonResearchinComputationalMolecularBiology(RECOMB'98),1997,pp.94{100. [7] S.S.GrossandM.R.Brent,\Usingmultiplealignmentstoimprovegeneprediction.,"inRECOMB,2005,pp.374{388. [8] C.BurgeandS.Karlin,PredictionofcompletegenestructuresinhumangenomicDNA.,vol.268,J.Mol.Biol.,1997. [9] A.E.Tenney,R.H.Brown,C.Vaske,J.K.Lodge,T.L.Doering,andM.R.Brent,\Genepredictionandvericationinacompactgenomewithnumeroussmallintrons,"GenomeResearch,vol.14,no.11,pp.2330{2335,2004. [10] J.D.Palmer,\Comparativeorganizationofchloroplastgenomes,"AnnualReviewofGenetics,vol.19,no.1,pp.325{354,1985. [11] T.M.Przytycka,G.Davis,N.Song,andD.Durand,\Graphtheoreticalinsightsintoevolutionofmultidomainproteins.,"inRECOMB,2005,pp.311{325. [12] L.Falquet,M.Pagni,P.Bucher,N.Hulo,C.J.Sigrist,K.Hofmann,andA.Bairoch,\Theprositedatabase,itsstatusin2002.,"NucleicAcidsResearch,vol.30,no.1,pp.235{238,January2002. [13] T.K.Attwood,M.D.R.Croning,D.R.Flower,A.P.Lewis,J.E.Mabey,P.Scordis,J.N.Selley,andW.Wright,\Prints-s:thedatabaseformerlyknownasprints.,"NucleicAcidsResearch,vol.28,no.1,pp.225{227,2000. 113

PAGE 114

M.Gribskov,A.McLachlan,andD.Eisenberg,\Proleanalysis:detectionofdistantlyrelatedproteins.,"ProceedingsoftheNationalAcademyofSciencesUSA,vol.84,no.13,pp.4355{4358,1987. [15] D.Haussler,A.Krogh,I.Mian,andK.Sjolander,\Proteinmodelingusinghiddenmarkovmodels:Analysisofglobins,"inHawaiiInternationalConferenceonSystemsScience,LosAlamitos,CA,1993,HawaiiInternationalConferenceonSystemsScience,vol.1,pp.792{802,IEEEComputerSocietyPress. [16] R.Luthy,I.Xenarios,andP.Bucher,\Improvingthesensitivityofthesequenceprolemethod,"ProteinScience,vol.3,no.1,pp.139{146,January1994. [17] A.Bateman,L.Coin,R.Durbin,R.D.Finn,V.Hollich,S.Griths-Jones,A.Khanna,M.Marshall,S.Moxon,E.L.Sonnhammer,D.J.Studholme,C.Yeats,andS.R.Eddy,\Thepfamproteinfamiliesdatabase.,"NucleicAcidsRes,vol.32Databaseissue,January2004. [18] S.Altschul,T.Madden,A.Schaer,J.Zhang,Z.Zhang,W.Miller,andD.Lipman,\Gappedblastandpsi-blast:anewgenerationofproteindatabasesearchprograms,"NucleicAcidsRes.,vol.25,no.17,pp.3389{3402,1997. [19] I.Korf,P.Flicek,D.Duan,andM.R.Brent,\Integratinggenomichomologyintogenestructureprediction,"Bioinformatics,vol.17,no.90001,pp.140S{148,2001. [20] J.FlannickandS.Batzoglou,\Usingmultiplealignmentstoimproveseededlocalalignmentalgorithms,"NucleicAcidsResearch,vol.33,no.15,pp.4563{4577,2005. [21] R.G.S.P.Consortium,\Genomesequenceofthebrownnorwayratyieldsinsightsintomammalianevolution,"Nature,vol.428,pp.493{521,2004. [22] P.Havlak,R.Chen,K.J.Durbin,A.Egan,Y.Ren,X.-Z.Song,G.M.Weinstock,andR.A.Gibbs,\TheAtlasGenomeAssemblySystem,"GenomeResearch,vol.14,no.4,pp.721{732,2004. [23] M.Roberts,B.R.Hunt,J.A.Yorke,R.A.Bolanos,andA.L.Delcher,\Apreprocessorforshotgunassemblyoflargegenomes,"JournalofComputationalBiology,vol.11,no.4,pp.734{752,2004. [24] S.Schwartz,W.J.Kent,A.Smit,Z.Zhang,R.Baertsch,R.C.Hardison,D.Haussler,andW.Miller,\Human-MouseAlignmentswithBLASTZ,"GenomeResearch,vol.13,no.1,pp.103{107,2003. [25] T.Kahveci,V.Ljosa,andA.K.Singh,\Speedingupwhole-genomealignmentbyindexingfrequencyvectors,"Bioinformatics,vol.20,no.13,pp.2122{2134,2004. [26] A.Apostolico,M.Comin,andL.Parida,\Conservativeextractionofover-representedextensiblemotifs,"Bioinformatics,vol.21,no.suppl-1,pp.i9{18,2005. 114

PAGE 115

L.WangandT.Jiang,\Onthecomplexityofmultiplesequencealignment,"JournalofComputationalBiology,vol.1,no.4,pp.337{348,1994. [28] S.B.NeedlemanandC.D.Wunsch,\AGeneralMethodApplicabletotheSearchforSimilaritiesintheAminoAcidSequenceofTwoProteins,"JournalofMolecularBiology,vol.48,pp.443{53,1970. [29] D.Lipman,S.Altschul,andJ.Kececioglu,\AToolforMultipleSequenceAlignment,"ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica(PNAS),vol.86,no.12,pp.4412{4415,1989. [30] G.SK,K.JD,andS.AA,\ImprovingthePracticalSpaceandTimeEciencyoftheShortest-pathsApproachtoSum-of-pairsMultipleSequenceAlignment,"JournalofComputationalBiology,vol.2,no.3,pp.459,1995. [31] D.FengandR.Doolittle,\ProgressiveSequenceAlignmentAsAPrerequisiteToCorrectPhylogeneticTrees,"JournalOfMolecularEvolution,vol.25,no.4,pp.351{360,1987. [32] A.DhingraandK.M.Folta,\ASAP:Amplication,sequencing&annotationofplastomes,"BMCGenomics,vol.6,pp.176,2005. [33] e.a.JansenR.K.,L.A.Raubeson,\Methodsforobtainingandanalyzingwholechloroplastgenomesequences.methodsinenzymology,"inMethodsinEnzymology,AcademicPress,2005,pp.348{384. [34] J.Thompson,F.Plewniak,andO.Poch,\Acomprehensivecomparisonofmultiplesequencealignmentprograms,"NucleicAcidsResearch,vol.27,no.13,pp.2682{2690,1999. [35] C.Notredame,\Recentprogressinmultiplesequencealignment:asurvey,"Pharmacogenomics,vol.3,no.1,pp.131{44,2002. [36] N.ChiaandR.Bundschuh,\Apracticalapproachtosignicanceassessmentinalignmentwithgaps.,"inRECOMB,2005,pp.474{488. [37] T.Jiang,Y.Xu,andM.Q.Zhang,CurrentTopicsinComputationalMolecularBiology,TheMITPress,UniversityofCalifornia,Riverside,2002. [38] D.J.BaconandW.F.Anderson,\Multiplesequencealignment,"JournalofMolecularBiology,vol.191,pp.153{161,1986. [39] V.Bafna,E.L.Lawler,andP.A.Pevzner,\Approximationalgorithmsformultiplesequencealigmnent,"TheoreticalComputerScience,vol.182,no.1{2,pp.233{244,1997. [40] H.CarrilloandD.Lipman,\Themultiplesequencealignmentprobleminbiology,"SIAMJournalonAppliedMath,vol.48,no.5,pp.1073{1082,1988. 115

PAGE 116

\Bookreview:Algorithmsonstrings,trees,andsequences:computerscienceandcomputationalbiologybydanguseld(:Cambridgeuniversitypress,cambridge,england,1997),"SIGACTNews,vol.29,no.3,pp.43{46,1998,Reviewer-GaryBenson. [42] D.Guseld,\Ecientmethodsformultiplesequencealignmentwithguaranteederrorbounds.,"BulletinofMathematicalBiology,vol.55,no.1,pp.141{54,1993. [43] D.J.Lipman,S.F.Altschul,andJ.D.Kececioglu,\Atoolformultiplesequencealignment,"ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,vol.86,pp.4412{4415,1989. [44] B.Ma,L.Wang,andM.Li,\Nearoptimalmultiplealignmentwithinabandinpolynomialtime,"JournalofComputerandSystemSciences,vol.73,no.6,pp.997{1011,2007. [45] C.Lee,C.Grasso,andM.Sharlow,\Multiplesequencealignmentusingpartialordergraphs,"Bioinformatics,vol.18,no.3,pp.452{464,2002. [46] I.Walle,I.Lasters,andL.Wyns,\Align-m{anewalgorithmformultiplealignmentofhighlydivergentsequences,"Bioinformatics,vol.20,no.9,pp.1428{1435,2004. [47] J.Stoye,V.Moulton,andA.W.M.Dress,\Dca:anecientimplementationofthedivide-and-conquerapproachtosimultaneousmultiplesequencealignment.,"ComputerApplicationsintheBiosciences,vol.13,no.6,pp.625{626,1997. [48] S.F.AstschulandD.J.Lipman,\Trees,stars,andmultiplebiologicalsequencealignment,"SIAMJournalonAppliedMath,vol.49,no.1,pp.197{209,1989. [49] D.Sanko,\Minimalmutationtreesofsequences,"SIAMJournalonAppliedMathematics,vol.28,no.1,pp.35{42,1975. [50] M.S.Waterman,IntroductiontoComputationalBiology:Maps,SequencesandGenomes,June1995. [51] S.HenikoandJ.Heniko,\AminoAcidSubstitutionMatricesfromProteinBlocks,"ProceedingsoftheNationalAcademyofSciences,vol.89,no.22,pp.10915{10919,1992. [52] R.SchwarzandM.Dayho,\Matricesfordetectingdistantrelationships,"Atlasofproteinsequences,pp.353{58. [53] D.Sanko,R.Cedergren,andG.Lapalme,\Frequencyofinsertion-deletion,transversion,andtransitionintheevolutionof5sribosomalrna.,"JMolEvol,vol.7,no.2,pp.133{49,1976. [54] J.Thompson,F.Plewniak,andO.Poch,\BAliBASE:abenchmarkalignmentdatabasefortheevaluationofmultiplealignmentprograms,"Bioinformatics,vol.15,no.1,pp.87{88,1999. 116

PAGE 117

R.Baeza-YatesandG.Navarro,\Newandfasterltersformultipleapproximatestringmatching,"RandomStruct.Algorithms,vol.20,no.1,pp.23{49,2002. [56] G.Navarro,\Multipleapproximatestringmatchingbycounting,"inProc.ofWSP'97.1997,pp.125{139,CarletonUniversityPress. [57] R.A.Baeza-YatesandG.Navarro,\Fasterapproximatestringmatching,"Algorith-mica,vol.23,no.2,pp.127{158,1999. [58] W.I.ChangandE.L.Lawler,\Sublinearexpectedtimeapproximatestringmatchingandbiologicalapplications,"Tech.Rep.4/5,EECSDepartment,UniversityofCalifornia,Berkeley,1994. [59] T.SmithandM.Waterman,\IdenticationofCommonMolecularSubsequences,"JournalofMolecularBiology,March1981. [60] O.Gotoh,\Animprovedalgorithmformatchingbiologicalsequences,"JournalofMolecularBiology,vol.162,no.3,pp.705{708,1982. [61] S.K.Gupta,J.D.Kececioglu,andA.A.Schaer,\Improvingthepracticalspaceandtimeeciencyoftheshortest-pathsapproachtosum-of-pairsmultiplesequencealignment,"JournalofComputationalBiology,vol.2,no.3,pp.459{462,1995. [62] J.Stoye,\Multiplesequencealignmentwiththedivide-and-conquermethod,"Gene,vol.211,no.2,pp.GC45{GC56,1998. [63] M.Brudno,C.B.Do,G.M.Cooper,M.F.Kim,E.Davydov,N.C.S.Program,E.D.Green,A.Sidow,andS.Batzoglou,\LAGANandMulti-LAGAN:EcientToolsforLarge-ScaleMultipleAlignmentofGenomicDNA,"GenomeResearch,vol.13,no.4,pp.721{731,2003. [64] A.Delcher,S.Kasif,R.Fleischmann,J.Peterson,O.White,andS.Salzberg,\Alignmentofwholegenomes,"NucleicAcidsResearch,vol.27,no.11,pp.2369{2376,1999. [65] E.W.MyersandW.Miller,\Optimalalignmentsinlinearspace,"ComputerApplicationsintheBiosciences,vol.4,no.1,pp.11{17,1988. [66] M.Brudno,M.Chapman,B.Gottgens,S.Batzoglou,andB.Morgenstern,\Fastandsensitivemultiplealignmentoflargegenomicsequences,"BMCBioinformatics,vol.4,no.66,2003. [67] A.Policriti,N.Vitacolonna,M.Morgante,andA.Zuccolo,\Structuredmotifssearch.,"inRECOMB,2004,pp.133{139. [68] K.P.Choi,F.Zeng,andL.Zhang,\Goodspacedseedsforhomologysearch,"Bioinformatics,vol.20,no.7,pp.1053{1059,2004. 117

PAGE 118

B.Ma,J.Tromp,andM.Li,\PatternHunter:fasterandmoresensitivehomologysearch,"Bioinformatics,vol.18,no.3,pp.440{445,2002. [70] P.A.S.Nuin,Z.Wang,andE.R.M.Tillier,\Theaccuracyofseveralmultiplesequencealignmentprogramsforproteins,"BMCBioinformatics,vol.7,pp.471+,October2006. [71] F.Corpet,\Multiplesequencealignmentwithhierarchicalclustering,"NucleicAcidsResearch,vol.16,no.22,pp.10881{10890,1988. [72] J.Hein,\Anewmethodthatsimultaneouslyalignsandreconstructsancestralsequencesforanynumberofhomologoussequences,whenthephylogenyisgiven,"MolecularBiologyandEvolution,vol.6,no.6,pp.649{668,1989. [73] C.GrassoandC.Lee,\Combiningpartialorderalignmentandprogressivemultiplesequencealignmentincreasesalignmentspeedandscalabilitytoverylargealignmentproblems,"Bioinformatics,vol.20,no.10,pp.1546{1556,2004. [74] C.Lee,C.Grasso,andM.F.Sharlow,\Multiplesequencealignmentusingpartialordergraphs,"Bioinformatics,vol.18,no.3,pp.452{464,2002. [75] K.Katoh,K.Misawa,K.Kuma,andT.Miyata,\MAFFT:anovelmethodforrapidmultiplesequencealignmentbasedonfastFouriertransform,"NucleicAcidsResearch,vol.30,no.14,pp.3059{3066,2002. [76] S.-H.Sze,Y.Lu,andQ.Yang,\APolynomialTimeSolvableFormulationOfMultipleSequenceAlignment,"inInternationalConferenceonResearchinCompu-tationalMolecularBiology(RECOMB),2005,pp.204{216. [77] R.Thomsen,G.B.Fogel,andT.Krink,\ImprovementofClustal-DerivedSequenceAlignmentswithEvolutionaryAlgorithms,"inCongressonEvolutionaryComputa-tion,2003,vol.1,pp.312{319. [78] R.Edgar,\MUSCLE:multiplesequencealignmentwithhighaccuracyandhighthroughput,"NucleicAcidsResearch,vol.32,no.5,pp.1792{1797,2004. [79] M.Sammeth,B.Morgenstern,andJ.Stoye,\Divide-and-conquermultiplealignmentwithsegment-basedconstraints,"Bioinformatics,vol.19,no.90002,pp.ii189{195,2003. [80] K.Katoh,K.Misawa,K.-i.Kuma,andT.Miyata,\MAFFT:anovelmethodforrapidmultiplesequencealignmentbasedonfastFouriertransform,"NucleicAcidsResearch,vol.30,no.14,pp.3059{3066,2002. [81] A.Krishnan,K.-B.Li,andP.Issac,\Rapiddetectionofconservedregionsinproteinsequencesusingwavelets.,"InSilicoBiology,vol.4,2004. [82] K.R.Rasmussen,J.Stoye,andE.W.Myers,\Ecientq-gramltersforndingallepsilon-matchesoveragivenlength.,"inRECOMB,2005,pp.189{203. 118

PAGE 119

B.Morgenstern,K.Frech,A.Dress,andT.Werner,\DIALIGN:FindingLocalSimilaritiesbyMultipleSequenceAlignment,"Bioinformatics,vol.14,no.3,pp.290{294,1998. [84] B.Morgenstern,\DIALIGN2:improvementofthesegment-to-segmentapproachtomultiplesequencealignment,"Bioinformatics,vol.15,no.3,pp.211{218,1999. [85] X.HuangandW.Miller,\Atime-ecient,linear-spacelocalsimilarityalgorithm,"AdvancesinAppliedMathematics,vol.12,pp.337{357,1991. [86] N.BrayandL.Pachter,\MAVID:ConstrainedAncestralAlignmentofMultipleSequences,"GenomeResearch,vol.14,no.4,pp.693{699,2004. [87] O.Gotoh,\SignicantImprovementinAccuracyofMultipleProteinSequenceAlignmentsbyIterativeRenementasAssessedbyReferencetoStructuralAlignments,"JournalofMolecularBiology,vol.264,no.4,pp.823{838,1996. [88] C.Do,M.Brudno,andS.Batzoglou,\PROBCONS:ProbabilisticConsistency-basedMultipleAlignmentofAminoAcidSequences,"inIntelligentSystemsforMolecularBiology(ISMB),2004. [89] E.SR,\MultipleAlignmentUsingHiddenMarkovModels,"inIntelligentSystemsforMolecularBiology(ISMB),1995,vol.3,pp.114{120. [90] C.Alkan,E.Tuzun,J.Buard,F.Lethiec,E.E.Eichler,J.A.Bailey,andS.C.Sahinalp,\ManipulatingmultiplesequencealignmentsviaMaMandWebMaM,"NucleicAcidsResearch,vol.33,no.suppl2,pp.W295{298,2005. [91] J.D.Thompson,J.C.Thierry,andO.Poch,\RASCAL:rapidscanningandcorrectionofmultiplesequencealignments,"Bioinformatics,vol.19,no.9,pp.1155{1161,2003. [92] S.Chakrabarti,C.Lanczycki,A.Panchenko,T.Przytycka,P.Thiessen,andS.Bryant,\Reningmultiplesequencealignmentswithconservedcoreregions.,"NucleicAcidsRes,vol.34,no.9,pp.2598{606,2006. [93] E.L.AnsonandE.W.Myers,\ReAligner:AprogramforreningDNAsequencemulti-alignments,"inProceedingsofthe1stAnnualInternationalConferenceonComputationalMolecularBiology(RECOMB),SantaFe,NM,1997,pp.9{16,ACMPress. [94] R.Spang,M.Rehmsmeier,andJ.Stoye,\Sequencedatabasesearchusingjumpingalignments,"inIntelligentSystemsforMolecularBiology(ISMB),2000,pp.367{375. [95] X.ZhangandT.Kahveci,\QOMA2:Optimizingthealignmentofmanysequences,"IEEEInternationalConferenceonBioinformaticsandBioengineering(BIBE),vol.2,pp.780{787,2007. 119

PAGE 120

D.S.Hochbaum,ApproximationAlgorithmsforNP-HardProblems,PWSPublishingCompany,DepartmentofIndustrialEngineering,OperationsResearch,EtcheverryHall,UniversityofCalifornia,Berkeley,CA94720-1777,1996. [97] P.A.Pevzner,\Multiplealignment,communicationcost,andgraphmatching,"SIAMJournalonAppliedMathematics,vol.52,no.6,pp.1763{1779,1992. [98] D.SankoandJ.B.KruskalandJ.P.Kruskal,TimeWarps,StringEdits,andMacromolecules:TheTheoryandPracticeofSequenceComparison,CambridgeUniversityPress,1999,ISBN:1575862174. [99] X.ZhangandT.Kahveci,\QOMA:quasi-optimalmultiplealignmentofproteinsequences,"Bioinformatics,vol.23,no.2,pp.162{168,2007. [100] X.ZhangandT.Kahveci,\ANewApproachforAlignmentofMultipleProteins,"inPacicSymposiumonBiocomputing(PSB),2006,pp.339{350. [101] P.BonizzoniandG.D.Vedova,\ThecomplexityofmultiplesequencealignmentwithSP-scorethatisametric,"TheoreticalComputerScience,vol.259,no.1{2,pp.63{79,2001. [102] M.Li,B.Ma,andL.Wang,\Nearoptimalmultiplealignmentwithinabandinpolynomialtime,"pp.425{434. [103] T.Jiang,E.L.Lawler,andL.Wang,\Aligningsequencesviaanevolutionarytree:complexityandapproximation,"inSTOC'94:Proceedingsofthetwenty-sixthannualACMsymposiumonTheoryofcomputing,NewYork,NY,USA,1994,pp.760{769,ACM. [104] M.Li,B.Ma,andL.Wang,\Findingsimilarregionsinmanystrings,"inSTOC'99:Proceedingsofthethirty-rstannualACMsymposiumonTheoryofcomputing,NewYork,NY,USA,1999,pp.473{482,ACM. [105] M.Middendorf,\Moreonthecomplexityofcommonsuperstringandsupersequenceproblems,"TheoreticalComputerScience,vol.125,no.2,pp.205{228,1994. [106] W.Just,\Computationalcomplexityofmultiplesequencealignmentwithsp-score,"1999. [107] M.R.GareyandD.S.Johnson,ComputersandIntractability:AGuidetotheTheoryofNP-Completeness,W.H.Freeman,January1979. [108] S.HenikoandJ.G.Heniko,\Aminoacidsubstitutionmatricesfromproteinblocks,"NationalAcademyofSciencesoftheUnitedStatesofAmerica,vol.89,no.22,pp.10915{10919,November1992. [109] G.KarypisandV.Kumar,\Metis,unstructuredgraphpartitioningandsparsematrixorderingsystem.version2.0,"Tech.Rep.,UniversityofMinnesota,DepartmentofComputerScience,Minneapolis,MN55455,August1995. 120

PAGE 121

G.KarypisandV.Kumar,\Afastandhighqualitymultilevelschemeforpartitioningirregulargraphs,"SIAMJournalonScienticComputing,vol.20,no.1,pp.359{392,1998. [111] K.Reinert,H.-P.Lenhof,P.Mutzel,K.Mehlhorn,andJ.D.Kececioglu,\Abranch-and-cutalgorithmformultiplesequencealignment,"inProceedingsofthe1stAnnualInternationalConferenceonComputationalMolecularBiology(RECOMB),SantaFe,NM,1997,pp.241{250,ACMPress. [112] P.Bradley,P.S.Kim,andB.Berger,\Trilogy:discoveryofsequence-structurepatternsacrossdiverseproteins,"inInternationalConferenceonResearchinComputationalMolecularBiology(RECOMB),2002,pp.77{88. [113] L.Chen,\MultipleProteinStructureAlignmentbyDeterministicAnnealing,"inIEEEComputerSocietyBioinformaticsConference(CSB'03),2003,vol.00,p.609. [114] G.JF,M.T,andB.SH,\Surprisingsimilaritiesinstructurecomparison,"CurrentOpinioninStructuralBiology,vol.6,no.3,pp.377{385,1996. [115] S.V.A.andH.J,\Anewmethodforiterativemultiplesequencealignmentusingsecondarystructureprediction,"inIntelligentSystemsforMolecularBiology(ISMB),2002. [116] K.B.MullisandF.A.Faloona,\Specicsynthesisofdnainvitroviaapolymerase-catalyzedchain-reaction.,"MethodsEnzymol,pp.155:335{350,1987. [117] X.HuangandA.Madan,\CAP3:ADNASequenceAssemblyProgram,"GenomeResearch,vol.9,no.9,pp.868{877,1999. [118] M.Pop,S.L.Salzberg,andM.Shumway,\Coverfeature:Genomesequenceassembly:Algorithmsandissues,"IEEE-COMPUTER,vol.35,no.7,pp.47{54,July2002. [119] J.Fredslund,L.Schauser,L.H.Madsen,N.Sandal,andJ.Stougaard,\PriFi:usingamultiplealignmentofrelatedsequencestondprimersforamplicationofhomologs,"NucleicAcidsResearch,vol.33,no.suppl2,pp.W516{520,2005. [120] S.RozenandH.J.Skaletsky,\Primer3ontheWWWforgeneralusersandforbiologistprogrammers,"MethodsinMolecularBiology,pp.365{386,2000. 121

PAGE 122

XuZhangreceivedhismasterdegreefromtheChineseAcademyofSciencesin2002.HeisagraduateresearchassistantincomputerinformationscienceandengineeringattheUniversityofFlorida.HismajorresearchinterestsincludebioinformaticsandE-learning,therstofwhichisthefocusofhisforthcomingPh.D. 122