<%BANNER%>

Mining Comparative Genomic Hybridization Data

Permanent Link: http://ufdc.ufl.edu/UFE0021686/00001

Material Information

Title: Mining Comparative Genomic Hybridization Data
Physical Description: 1 online resource (128 p.)
Language: english
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: classification, clustering, feature, progression
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Numerical and structural chromosomal imbalances are one of the most prominent features of neoplastic cells. Thousands of (molecular-) cytogenetic studies of human neoplasias have searched for insights into genetic mechanisms of tumor development and the detection of targets for pharmacologic intervention. It is assumed that repetitive chromosomal aberration patterns reflect the supposed cooperation of a multitude of tumor relevant genes in most malignant diseases. One method for measuring genomic aberrations is Comparative Genomic Hybridization (CGH). CGH is a molecular-cytogenetic analysis method for detecting regions with genomic imbalances (gains or losses of DNA segments). CGH data of an individual tumor can be considered as an ordered list of discrete values, where each value corresponds to a single chromosomal band and denotes one of three aberration statuses (gain, loss and no change). Along with the high dimensionality (around 1000), a key feature of the CGH data is that consecutive values are highly correlated. In this research, we have developed novel data mining methods to exploit these characteristics. We have developed novel algorithms for feature selection, clustering and classification of CGH data sets consisting of samples from multiple cancer types. We have also developed novel methods and models for understanding the progression of cancer. Experimental results on real CGH datasets show the benefits of our methods as compared to existing methods in the literature.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Ranka, Sanjay.
Local: Co-adviser: Kahveci, Tamer.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0021686:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021686/00001

Material Information

Title: Mining Comparative Genomic Hybridization Data
Physical Description: 1 online resource (128 p.)
Language: english
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: classification, clustering, feature, progression
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Numerical and structural chromosomal imbalances are one of the most prominent features of neoplastic cells. Thousands of (molecular-) cytogenetic studies of human neoplasias have searched for insights into genetic mechanisms of tumor development and the detection of targets for pharmacologic intervention. It is assumed that repetitive chromosomal aberration patterns reflect the supposed cooperation of a multitude of tumor relevant genes in most malignant diseases. One method for measuring genomic aberrations is Comparative Genomic Hybridization (CGH). CGH is a molecular-cytogenetic analysis method for detecting regions with genomic imbalances (gains or losses of DNA segments). CGH data of an individual tumor can be considered as an ordered list of discrete values, where each value corresponds to a single chromosomal band and denotes one of three aberration statuses (gain, loss and no change). Along with the high dimensionality (around 1000), a key feature of the CGH data is that consecutive values are highly correlated. In this research, we have developed novel data mining methods to exploit these characteristics. We have developed novel algorithms for feature selection, clustering and classification of CGH data sets consisting of samples from multiple cancer types. We have also developed novel methods and models for understanding the progression of cancer. Experimental results on real CGH datasets show the benefits of our methods as compared to existing methods in the literature.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Ranka, Sanjay.
Local: Co-adviser: Kahveci, Tamer.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0021686:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101108_AAAAAX INGEST_TIME 2010-11-08T12:25:12Z PACKAGE UFE0021686_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 158431 DFID F20101108_AAATPE ORIGIN DEPOSITOR PATH liu_j_Page_124.jp2 GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
97099f46514ce15c67a52d66ed0127c5
SHA-1
6499f1dd061ba0fb99047fa23e54f4318b81f542
57413 F20101108_AAATOP liu_j_Page_090.pro
7613585fb592e28eb41a282491d4def7
f0b50e046fa70bdedc86eab47181cbccd7ab6b99
27740 F20101108_AAATPF liu_j_Page_033.QC.jpg
7a79d4101412161e7776c01f9a330962
7092a03bcfdfc3cc50268d02c2d177c8981fd802
1051942 F20101108_AAATOQ liu_j_Page_108.jp2
cf7e8b591354f311d986f370a1e11511
cf6d9944dc353d915fbdde76c9473535a7bde225
1051948 F20101108_AAATPG liu_j_Page_085.jp2
2b792616874ca9499a7fc004b40d76ea
ecf4d764ea020c834514ca2b82a99e00d075a204
6522 F20101108_AAATOR liu_j_Page_107thm.jpg
8541faa28a83c4af25e44091ec89e7bd
d13e48f1d3b0aaa76491d5c1d70a5fd9ef182937
1053954 F20101108_AAATPH liu_j_Page_124.tif
c24de321eee5451c714e76c09a272e32
9483c9728fb25ec8cad3b0af0622798c93282c52
52671 F20101108_AAATOS liu_j_Page_025.pro
7b1186264a65edabe2f17f8aa526e083
9c71728a21abb27bd0d89cc96e21fb768d8d4bb4
967213 F20101108_AAATOT liu_j_Page_106.jp2
2e026439122b7ed4a51c9d781a76c327
6dba5a9af7dd3dc660e4e39defaeb723d8e3072f
5453 F20101108_AAATPI liu_j_Page_091thm.jpg
53b6a8eae74cfef0aed264d60ebf37df
1393a5b70368f1e4bd5cb65be274d3791b036ab2
6761 F20101108_AAATOU liu_j_Page_094thm.jpg
39d8c88720551f691f1cefebc1c5f376
c4cff59f296b4d4b5b11760384f89a7b2ab9b0c4
1051954 F20101108_AAATPJ liu_j_Page_006.jp2
1188aa3ede04845551180cfe34e34668
95f3c9fa30544b97631c4b28cf4ca50b6c9afb7d
27342 F20101108_AAATOV liu_j_Page_102.QC.jpg
541cad5cfce78932c3f991c7e0238613
0213414851f5e71607fae6a22fd9eb5231ce0764
77849 F20101108_AAATPK liu_j_Page_016.jpg
351e08ad566298918aa974c08487ee5b
059e1b8482c2adc438606ea04c8a1b5b50b093c2
90743 F20101108_AAATOW liu_j_Page_030.jpg
5d85b674cf7a37e4ccb6847a473d308f
e7c6229954a46b1c8237b63f3cbf0775569c876a
155112 F20101108_AAATPL liu_j_Page_123.jp2
8458fe317b30aec881cf86724abf377c
f80702ec7be59ea31d5a83aae6c3999893855c27
69264 F20101108_AAATOX liu_j_Page_041.jpg
47b547a3607cccea920f8d5347ddc572
7b2d35a8fb4fc503b878bacf9d6afd4abfa3a208
53237 F20101108_AAATQA liu_j_Page_061.pro
9070b35860be87b24256963c1bb9b6e9
785b192ca1c2784c38089ec27dd72b68295aa951
56738 F20101108_AAATPM liu_j_Page_064.jpg
3cf91d2058ab4a00fde201fa05f4a03c
18193dec891e41495df3d7aa071f8785ed306184
1051937 F20101108_AAATOY liu_j_Page_058.jp2
54dc0371cf2998d7c7d4b2cc5e27088d
4c4e04052ce2217252905cb6eece8b5eefc4ba15
6442 F20101108_AAATQB liu_j_Page_074thm.jpg
21ed30f69240bd88038343f0c2d1a80c
5c3cb8d370eb829eb8bfd24213210b5aae35e99c
25271604 F20101108_AAATPN liu_j_Page_070.tif
755ae8a9ccae158ca39688ecdd7e2cab
9e990c096035b7ca5379a86510ba1874c4f524db
F20101108_AAATOZ liu_j_Page_112.tif
997b4de061ba9cc545963a6025847c10
630cb91eb8fe3b27cbdaebe4b5a7a18dc97bf267
6430 F20101108_AAATQC liu_j_Page_122thm.jpg
2c53370cf0b9bb0ce95aa42671d76b11
b6419be8e6ca5b29957efd3c05ea4ea3940a7ec3
F20101108_AAATPO liu_j_Page_099.tif
b37d846b36f81d25c89a7c58fc5c2bcd
119e1138d80bc1834e0fc5c15d073913b8125a3d
47644 F20101108_AAATQD liu_j_Page_029.pro
8fe5fd2e0d4e1b49d937ff474bed5cfc
a4a847be2e4655141c561eb9f163d2df972e64c2
F20101108_AAATPP liu_j_Page_039.tif
b167179c3eca1b4549b6bad7c6c732b4
e663b1b1b445da3e08701083251ffb0e9a5e5321
56663 F20101108_AAATQE liu_j_Page_059.pro
05a3405f0b6941f20175955a0c2c6596
0fb963d5c5cb10b0ad0c87ccfd777fa9681b3636
6969 F20101108_AAATPQ liu_j_Page_020thm.jpg
8030f56981822a12e7aa64087e3148fc
8b7703d696ced83deaec59f755e6e346123bb4a3
6546 F20101108_AAATQF liu_j_Page_024thm.jpg
0124ccd371ca5986d5a03cf8cab290a3
269db0436fa3fcca354e8e601675061869ec50b3
22481 F20101108_AAATPR liu_j_Page_008.QC.jpg
9616d7133c48b8d3f12ea97eab8158fc
7d2a7147a4c86dbc0059bf9619db86ced38dab32
2191 F20101108_AAATQG liu_j_Page_101.txt
722d9465eeb0d4d46fb49ff6a854f498
771caeefab75c474233a13d0f510920945d174da
25937 F20101108_AAATPS liu_j_Page_031.QC.jpg
e473815f9b51e150ec31de521b7a59cc
d1d4c3b6af0e859190fcdf25009e8b7b813ee922
F20101108_AAATQH liu_j_Page_128.tif
25f3bf364631c6b9eed37ee8fa41bdaf
11b2da6bb2cb733a144287c456ccf8b663f59920
5418 F20101108_AAATPT liu_j_Page_017.QC.jpg
53a2d3dd388d3783a07b1e006290cdb6
17907af712f91650435da3232ba1a947be3a1442
72866 F20101108_AAATQI liu_j_Page_120.jpg
08f50b307403e7de770186dd60fd98ee
65a5505b66e448ad5b55ec8f741812be381edfb4
2112 F20101108_AAATPU liu_j_Page_113.txt
0dbb543d486037d8cc6b53080b46cf99
a2afde689fc90253f4fd70429fdc59eb6b338676
20977 F20101108_AAATPV liu_j_Page_001.jp2
d0cb534874bb8e67819548bb0a8e4ec1
a596090e2008986ba43d8c0c260e78268beb6a49
2096 F20101108_AAATQJ liu_j_Page_009.txt
3d036540b6d1cf354be3a4899421fb4f
ade4fe205b26b295ad16d159ce4faaa9d0512501
33330 F20101108_AAATPW liu_j_Page_065.pro
aeb15a068666a246f6cb15d720d92b5b
7f4de2c7025b6e1d7fc60f161fa0c7c94384357d
48071 F20101108_AAATQK liu_j_Page_091.pro
8e2e179f021a80e7b957ffe55fe1c541
d156ad076e54d30ff5cb47e3d6635c3c2231c970
F20101108_AAATPX liu_j_Page_025.tif
c5cb4e02c31fd4a0f5339b6b2400634a
a9ae912266e968d3f6e36916e2eb094471a943c4
52559 F20101108_AAATRA liu_j_Page_062.pro
025453cedc965f07b08b44a9a408655b
886a19ad780a4f89634f7e115212c9c7d4b7acf4
2078 F20101108_AAATQL liu_j_Page_001thm.jpg
8da929feb36b7572641efdc748e4c383
f2a7d903d6808932b34b1a35733ce62f13b004ed
73486 F20101108_AAATPY liu_j_Page_005.jpg
c52cc7507a1b7e3db89474d2ff315a53
4d0a18ae6f5b75c2ab466e35c3d7f40ea6f5b695
54220 F20101108_AAATRB liu_j_Page_107.pro
a0877afd732e4bd6ba593f0d3c187462
fb8a2ad06a998325e546c97cf26fa706dfea2c84
F20101108_AAATQM liu_j_Page_106.tif
aa1615c52f419b8576ef759fde48f816
239b608627d6b11cf2d3c946924f2301aefba42e
F20101108_AAATPZ liu_j_Page_122.tif
b8ba7e9d3e64ef59b6bd4e22e817bc43
bba66b9284db74602e86d0928826287584e0804b
F20101108_AAATRC liu_j_Page_029.tif
c5fbe3e3c9da20cd75037c586d80bdbe
d7b9591cd944ccd4399469478f594c214a8a79b0
F20101108_AAATQN liu_j_Page_109.tif
28f59a916b7e24bebce0e8c6ba9de225
466099f593f13fba17e0e893098632f99864ecdd
2476 F20101108_AAATRD liu_j_Page_035.txt
cd1157fc55094db8a1f9e96e3aee6391
5c464370751fd9a23bd5f98c6553718f9ffdbb5f
6680 F20101108_AAATQO liu_j_Page_021thm.jpg
c47751c49ae4ffc978803cb307f080db
1388b0fd8b8a64a6220f59fd4e3bc9925cf8dc46
1051959 F20101108_AAATRE liu_j_Page_011.jp2
ebbc2e5e12dbed386e99a6b482f5541e
a17e61d94541ada430f8b8405953c71a757de81f
6399 F20101108_AAATQP liu_j_Page_046thm.jpg
c0abce0038aa14c6a1ada5d6314c4d61
bb84912617ef66b7942f2100aab67543d275d8ef
6333 F20101108_AAATRF liu_j_Page_003.jp2
f016d0acdf77738ca8d01e24ccbee678
afde784f706e2120b949848a044cdf6be54decac
63614 F20101108_AAATQQ liu_j_Page_011.jpg
295bc9cee834a0455e1fb3dac7767c75
1343fa65ea3f90487ce1319d8041d1302eee823c
37215 F20101108_AAATRG liu_j_Page_110.jpg
d206ca97fe1e500f33fff1fcc0cc052f
915b1eb20fa11f9c0b306df1f0143841d7b7b324
6900 F20101108_AAATQR liu_j_Page_042thm.jpg
5dae0af4e0b65fdfff089453517c20b3
50b7c770092d07fb145e163a0652153911ec6cdb
6363 F20101108_AAATRH liu_j_Page_084thm.jpg
6e7269f4dd9edf720414604c689127ae
c6284546550601172690b4bca51ed04e7697981a
4696 F20101108_AAATQS liu_j_Page_027thm.jpg
1dae61db18b383aab3787d8659f87781
79eee63f5b05db1e47551e326ca3c3b5bbc1a4e0
26717 F20101108_AAATRI liu_j_Page_112.QC.jpg
93729029955a9e5adf4015963957d212
ace940b965680d19aefc7b633202e08d405994f4
2590 F20101108_AAATQT liu_j_Page_006.txt
5afd62b83faf1c3cf5771e71fdd88400
affe1eed95359715d657a4f01483156d8fe666cb
F20101108_AAATRJ liu_j_Page_123.tif
40e04f0f728a5e571f731a55776d4527
482374563414415e839be0e3a3600a4c85e55cec
2251 F20101108_AAATQU liu_j_Page_077.txt
0ce2ab7027c763cfa19f97de84986e24
af1cbfd0f27324e6d0a63616101767c07a2565ef
2352 F20101108_AAATQV liu_j_Page_083.txt
0c44f951bceff54738c7e8d863afebd8
8b8a940a78263487b02316e4fd49943e094fbd39
33007 F20101108_AAATRK liu_j_Page_011.pro
ad5918bb176ea2e03d2194310d430183
e419036d16a8a31ff8c7dfbbd7f432f9780b22be
1993 F20101108_AAATQW liu_j_Page_106.txt
1ab82866baab5ed7e57ca4296c6f9973
73273a8c485dd80ebfcb1785920294fece090363
80372 F20101108_AAATSA liu_j_Page_015.jpg
6d5437e08443864463914f3c0e86c3c2
8ffc651cb8430e8786a6e45d54649d165a914068
61164 F20101108_AAATRL liu_j_Page_044.pro
197975133f93a1e81e40af07383830b5
6f41f28c8b45cf9d829c086021bcea4fbb35a766
1051964 F20101108_AAATQX liu_j_Page_037.jp2
5e3ffda25f1f212eb9ef971806c00d74
e74c4e1539577c44f49abb6c62741ce7be8c319c
16368 F20101108_AAATSB liu_j_Page_017.jpg
d697bbae40dcc94935f94a1359b54b8d
eee12fa978be758be83026ed75608a35113b251a
2330 F20101108_AAATRM liu_j_Page_120.txt
887c44363ab665fda3e8877a546941d1
15c6d194f90b23f7e5d6871a3bccafa5b5ede7b0
81149 F20101108_AAATQY liu_j_Page_101.jpg
eed4548d2d3449b5c2dd375030db57b1
863169eefc73b46cca91252878106458687f7eba
84809 F20101108_AAATSC liu_j_Page_019.jpg
cb9efc32a73d89874efe87a8f967157d
d0287ded82a0fc9ff42b9a8060b5164016bf9793
146148 F20101108_AAATRN UFE0021686_00001.mets FULL
ac22f66a622b9fa9e5017eba3ec11b2e
5bb5780f17fca639acea95e90a81f5d9f9742f2d
17579 F20101108_AAATQZ liu_j_Page_064.QC.jpg
271c312bbaef9185f0ef2397e764d48d
b856affc93aba0be548ee58ac2fad2ec8910ffdb
90172 F20101108_AAATSD liu_j_Page_020.jpg
d1a70365e68b1ebfdfd5d44559533cca
007080d8a4ee783e313cf983c5031f79f6757bb5
85660 F20101108_AAATSE liu_j_Page_021.jpg
0762137bf7f613c3e5dec090b8ce2518
2a9b79a84c0d42f476a6c14ae83075e23777c65f
83216 F20101108_AAATSF liu_j_Page_024.jpg
a7cfe891a28665404caa74974cce5f99
80e72df6e93be8d3028cfc12c9e3811839b384df
21464 F20101108_AAATRQ liu_j_Page_001.jpg
18d72c3b6421f708ad4ccdde608dc354
ba12bc113c9aaced93539936a4c68a169936dc09
79531 F20101108_AAATSG liu_j_Page_025.jpg
133d95b0c8c263401d52cb22c24ee4fd
24531310a7c28d31a9fe6f9597804e55c6bf8a72
9524 F20101108_AAATRR liu_j_Page_002.jpg
8dd41fe78137d11b4363e17e1bd564d4
a2573237e8e584eb2bc29b88897158878e2c0679
86544 F20101108_AAATSH liu_j_Page_026.jpg
d0e47312c6d22b80ee2b75200cad980e
0de95b465f7d4baa1a9e3bc9b3f0f13bfb589f25
10094 F20101108_AAATRS liu_j_Page_003.jpg
9639614b5470359b9c99f686033f6a7e
3cb27cf2b56283c5f37f84b4cdfc2ce15974f6fe
88228 F20101108_AAATSI liu_j_Page_033.jpg
3918afd105974bef248d8a8b92b7e057
c79149261d7f58d3119cfc077ccf943270c971cc
32947 F20101108_AAATRT liu_j_Page_004.jpg
b159094bac6109a7300af1ef3aa0acbb
add2bfe28e09986171c71d228e63437f05c19c1f
87344 F20101108_AAATSJ liu_j_Page_034.jpg
1f1747ffa46438c15e008da7dd547710
efc309954900c76899de94c80effff4de8f66c3e
84690 F20101108_AAATRU liu_j_Page_006.jpg
f46cc855cc589474995b0952181991d1
5e7d58a0b887aab2a247f471de401b250c4bb5b8
83262 F20101108_AAATSK liu_j_Page_036.jpg
c21955f0ae32f6e3c17708147c4c22b2
325ebd74ab10dd6e7cc3164ae93d977571b53e53
53112 F20101108_AAATRV liu_j_Page_007.jpg
4b8988f63037d817b96bbe8c9db9ad75
4273ae979e72f6d590acd0d629f772fc47a1dee0
78122 F20101108_AAATRW liu_j_Page_008.jpg
4ff958ffadbf59a735b1e514b1a60def
8d9067d6227c783f578f655b97a1affb128ab892
78300 F20101108_AAATSL liu_j_Page_038.jpg
27327c1e414791da8b187eb688a651dd
d9811172fdf1ec4bd555c8daea04d8582e961340
65243 F20101108_AAATRX liu_j_Page_009.jpg
0fba74d9088e2262f288b45ac6b7d7ea
a97b9c59475b3416fe762fa2eda189529c96a673
65503 F20101108_AAATTA liu_j_Page_075.jpg
bd4983a124b7c4ff411f1c0b68362d25
aebb369140d5964d06155ff96bbc8f66912af6fc
60435 F20101108_AAATSM liu_j_Page_040.jpg
02c54f5a0d5c3b505924287dd6b10b8c
03511b9db7b6bc9bd5164f0e08ee9b07aa909801
68409 F20101108_AAATRY liu_j_Page_013.jpg
bd5f9b4a65f20ff90c8b9c66ae5f7b1e
0d583b24f1b0afcbecf02026fe98d6d31c97d4c7
60343 F20101108_AAATTB liu_j_Page_076.jpg
e959b1b72d0fef159e0bf7e2b0147a24
8f128e1898a80e2f21980dcea927fc75f09e0844
92106 F20101108_AAATSN liu_j_Page_042.jpg
1cb971f6202ef440da0dce900dd8f87e
31811f4a527d6eece96bca8c1e7313687b3fb18e
77457 F20101108_AAATRZ liu_j_Page_014.jpg
0302e523dc2fa636dd065df47231ba2c
d35ac23b3cf9f7d133144101487ba077a34f5944
85616 F20101108_AAATTC liu_j_Page_077.jpg
d729f1b88fe34ce0ea91e474072da35f
4310a8f34a64993eb3304f171888a4fcc66d7dd1
48498 F20101108_AAATSO liu_j_Page_043.jpg
419f083dcd3e68b3b4897cc4a5234ceb
331b554d8176e80952dd8ae8e907c562edecc9b3
90636 F20101108_AAATTD liu_j_Page_080.jpg
7bd5b51202a354b6a42a5e8af2c9e3be
112abf21b8fed43c79e297be473e57d1bd798e9c
69241 F20101108_AAATSP liu_j_Page_045.jpg
5df5b876fcdcf330320449bda8293b06
e6b16d4bb0daee0d77caa7f8db92cc5e6742630b
81939 F20101108_AAATTE liu_j_Page_081.jpg
2025c14669494d479166ae8460288111
d318f6425eb4811819b156c99bcca974399bae12
85232 F20101108_AAATSQ liu_j_Page_048.jpg
b694f61e6550fedbd78a43dfb378127b
d5914ccdbc12f5c30c2762e673dbf6a16e4bddc6
68955 F20101108_AAATTF liu_j_Page_082.jpg
5b26395e757bcad0b806240c29efc884
f515b34002cf7cdab4473115cabf89cb713455dd
77011 F20101108_AAATSR liu_j_Page_052.jpg
c5a6a90a33b994789c1918b20d2922a3
517cb0b47e4eb3e6de495ae101822fd4f3470e0c
90657 F20101108_AAATTG liu_j_Page_083.jpg
0fb15ca98af5a8cbfadd5efa6acf59bf
1abdcd0d3888d53ea76ae39df232baa14a35bd34
83619 F20101108_AAATSS liu_j_Page_055.jpg
79e9c2e86a75604516c3181094c8ecf2
dd30fc98bea02f4a9529ef114b6a52fb7cac14d9
86538 F20101108_AAATTH liu_j_Page_085.jpg
555b12662d080bcd4dbe7103b65d29a8
05db80ee6b11e18e77f37b11ae9f8404af7ea09a
79823 F20101108_AAATST liu_j_Page_057.jpg
f4d6d19815b38952008c29897c04259d
b35aff9ce13d25ae5097237eac99d0c5ddc47cc8
86756 F20101108_AAATTI liu_j_Page_090.jpg
03b2581322a725b966e9b1de71e9ae41
f964c00aadb7991e670f6b85a579ecb1cf1e3750
88392 F20101108_AAATSU liu_j_Page_059.jpg
f7a85a7ddc60fc9b6724b74fbad103f3
7eac74d435510018dc38c4db2286ddd22476f7cd
69632 F20101108_AAATTJ liu_j_Page_091.jpg
dc5ab63a0cbb2a30099a2a6a7d46e3be
ef34f55708d94347e71172893a9e801bbb0c586d
21797 F20101108_AAATTK liu_j_Page_092.jpg
a04f14b17e3f435023e38a8e5b375728
5a8018df25c0ef738694e69adb3249658dacf3dd
75950 F20101108_AAATSV liu_j_Page_061.jpg
c549ad91b5ffeea594442d5956689cc2
2eb2433437a99b7946a8eb20d503122ee990f9cc
76196 F20101108_AAATTL liu_j_Page_093.jpg
b440864a4a2bfbee14d84d8a4b0a07bc
8c7a24d92b4b3cc1e4b18ad5ad69e03b0fa41dc2
48042 F20101108_AAATSW liu_j_Page_065.jpg
db41254c4809b4ff4e1a7af1ad5939aa
026a6b1f8f97a69bdbd1e6b3a36a0ae5b21337d5
76950 F20101108_AAATUA liu_j_Page_115.jpg
41c34bc3a60799d59b2dda4c71082e10
616f904658863ba45184512896554fb03e8081c8
90660 F20101108_AAATSX liu_j_Page_066.jpg
83d5c4fc9741c5f75fe6feebab03d92e
edc388ff7cb59c9f285e90968f2c7a8d5fbcdf51
85048 F20101108_AAATUB liu_j_Page_116.jpg
8c38128275c6a80004a9e163a8a1dfb9
c448f478d8cafec5168e300128c642e0bf55777d
87687 F20101108_AAATTM liu_j_Page_095.jpg
4910e21aeec41b2aa91921def7e736c6
50778dd93c65bccd878a84f9d0692725a4ce9899
66096 F20101108_AAATSY liu_j_Page_067.jpg
1e4eb870cfbee2faab351405c2ce2d50
14a9d2887dfc2d9ad2f3f893ed3a218a2774d1dc
54400 F20101108_AAATUC liu_j_Page_119.jpg
cb5d9750dd5688966ec01fade18b25b2
1b5270263c7da20565869c57787547c2e57f0e47
60697 F20101108_AAATTN liu_j_Page_097.jpg
3ac14a8f1b659199d08df52e31a66bc9
1bae65ab5ac3d6281b42028a37eae1c6417c6a8a
62425 F20101108_AAATSZ liu_j_Page_069.jpg
4363633e0d4f3fadbba772c1f923c460
6724276fc37afa7f78d3e73e2687321969d80102
92783 F20101108_AAATUD liu_j_Page_122.jpg
5c5b99613f45f6d26e4771662efd8d43
cbe48e00071843ee77a3be3a3b03cc64a7973e01
85840 F20101108_AAATTO liu_j_Page_098.jpg
ea019a1f98c4ca5c0cb592af3b00c579
49e65766a1aa8884f5e6f594081d154845b37769
94229 F20101108_AAATUE liu_j_Page_124.jpg
b28e36f34a68a04ab43fa62b2cf17015
cb8c1ed7e17297d97703c060e94127c61ce1891b
73823 F20101108_AAATTP liu_j_Page_099.jpg
c4ef83ebe8b9cc035ca90c2c968fd98e
d5bb8104bb1a5300a3adde237555a7b3c75ff449
101890 F20101108_AAATUF liu_j_Page_125.jpg
ecf41a981c95565bd16f6ca39f2d3684
d905f82cc2747f2f9dfab0e83ed9db1d6cc27606
67118 F20101108_AAATTQ liu_j_Page_100.jpg
0f99cf1e63dcf0b1c507b7ae73d6dfea
3fa0b81ae2ed4e006a18916cca5c4b0f31f29a27
91517 F20101108_AAATUG liu_j_Page_126.jpg
caee530fc6bdcdb80fe8ececf0919e1a
d1a7771382516205c1dd7d4fcf17681b5a517ca8
87198 F20101108_AAATTR liu_j_Page_102.jpg
01d7f095586db85425bde36755031583
e53be4781d2df0f7b00a2ed0d54877b004057250
43089 F20101108_AAAUAA liu_j_Page_013.pro
9bb7379ea95e06cf75bfb8e61f5ed847
2fc17b1bd31763a421b86ae5321ed2028fdd0876
62659 F20101108_AAATUH liu_j_Page_127.jpg
e6ca1f5c72142560d97a0b283e963306
67f0fd6740fba4b9e73a9d101aefeb26925b539e
68975 F20101108_AAATTS liu_j_Page_103.jpg
09247bac17a1ec78d02e9620ccd8c11d
be9dbae314542fab605e67a5ea195a143aaeb68b
39175 F20101108_AAAUAB liu_j_Page_014.pro
6826f1dc77de2485615566d21ab0e982
dfb2fef4be9cd10d2a333d6dc75574e1fa84e9cb
19268 F20101108_AAATUI liu_j_Page_128.jpg
95ecc9dab6e3f04fca8b9022ab1fd420
92586b817f4fb42fcf635a60ea47f64dcaf76bc9
85582 F20101108_AAATTT liu_j_Page_105.jpg
e44538362c2ef058a2055108c646db8d
873600daa50d516621b5cbc6a157efef906fc158
52542 F20101108_AAAUAC liu_j_Page_018.pro
92822d2f9a71ed7ccecbc1da99d8be04
6c98927051eeaee7a4733637338e3ea8db346ff0
4472 F20101108_AAATUJ liu_j_Page_002.jp2
e5f45b1f20d0085f04be907b53998c50
be97520b80aa012342ff7ef915a404a3c7a39417
69986 F20101108_AAATTU liu_j_Page_106.jpg
c6b1ae5cba71031fc992887559398f61
addff2d771e016f5d875e67543ffb888da556be9
58295 F20101108_AAAUAD liu_j_Page_020.pro
43f50e2a57e981e8e2e5ae9c65496bd1
16848e736879670f9c5cc6e83b61867d9eb5373c
44106 F20101108_AAATUK liu_j_Page_004.jp2
5be7ceb7ecbd9745e2a35cc81dd0e683
f587b6a458d5e4f2f6a501796664e21328d02b5d
81880 F20101108_AAATTV liu_j_Page_107.jpg
08e4d47334839b7f2ca9ffd60edd0b06
15faaa775d7758bae5ac88e4043bb164a9509668
45158 F20101108_AAAUAE liu_j_Page_023.pro
b3011f9309336d202fd979f396f733bd
3e6babb770b9b0053e6df4da03a776220f920956
1051971 F20101108_AAATUL liu_j_Page_008.jp2
88210543af072c3a5c76b8f69e8811e9
078cd2e44500d1aa94aad116aec5e3c46e0ee7c6
74944 F20101108_AAATTW liu_j_Page_108.jpg
cd62a872443d83c00c82d72336902cea
8b1314762e2cf74a96d36253146a769159af63a3
54351 F20101108_AAAUAF liu_j_Page_024.pro
03951d690ddd66550512f33f43e9a87a
6ff55046bc322f05107e888f48f2ba6d99ab2777
1051974 F20101108_AAATUM liu_j_Page_010.jp2
c9cda9fcd7d85d24b7f14ffaae9a5fb7
74534f3096c6d2d8dcff2cc98956889ce8ad5800
91113 F20101108_AAATTX liu_j_Page_111.jpg
86b2bd1e4a97c633390550ca00ce1e77
a4a83b1d2ccbdf5713ba7993c6660835c7d214c4
33031 F20101108_AAAUAG liu_j_Page_027.pro
dad4011d7f72a403fba79a17c67b3db7
9d249b1f1190313d18883f63cac81c7e6cd4304c
1051953 F20101108_AAATVA liu_j_Page_030.jp2
d8468b74d8f3dba319af9df1a690b300
b660a3225b7411ae0414a0651405ce005fafff1d
93552 F20101108_AAATTY liu_j_Page_112.jpg
9c6a254e199b8e20cb6f0dbad8670a83
eb3839a94a46ab0bd3f60b9afbb371ca7771c4b6
59390 F20101108_AAAUAH liu_j_Page_030.pro
2ee9680db05044910f06cc77f0b025b1
a0bf474fe73db955ab3e0f5139459ab0efb4f051
1051887 F20101108_AAATVB liu_j_Page_031.jp2
b5305a4fc6a3aa7d8b01ed2f28b59ae5
6bc846830e80bb3bf25017f8ba0b52de45edb285
F20101108_AAATUN liu_j_Page_012.jp2
509d3ac9b216d4f7ff6e8c2f6504b75c
037e6e60f1ed2024999e9858a1be1982e6a1f256
35534 F20101108_AAATTZ liu_j_Page_114.jpg
8f17541118cea3d930392353ea8759f6
a610c3234418aa6da23e4e8dd80aa81d0a1b1751
51641 F20101108_AAAUAI liu_j_Page_032.pro
e08a98586b3b1b624465883da69d3d39
706bab77ccf0d5d4ef0172145df7245a75e98a66
1051930 F20101108_AAATVC liu_j_Page_032.jp2
74f596d1a5ae9fda20c5cac4903e6ebf
4f9cf048926023c57beb13bae39653d3b867b0f0
985936 F20101108_AAATUO liu_j_Page_013.jp2
5337932a113a20f4f4ce5c770114dff2
beb087387db63f80c05b7a72b9f3c325f1826d2a
63213 F20101108_AAAUAJ liu_j_Page_035.pro
aaaa1c08645f06720a7403e0a609c990
1d866a08bddbb1c6b33c7b50a9509334a8b33884
1051983 F20101108_AAATVD liu_j_Page_033.jp2
a5d7b0ec05a7548bca26bc14d4d6bd34
d2e0eb897e4d6d29bb58c31e927d5fd6e14a0bde
1051975 F20101108_AAATUP liu_j_Page_014.jp2
164ece7af13dcb8e22cc87bdd7f1ee38
a46d527035ab8ac1d9a366bafa814585c7382e85
52932 F20101108_AAAUAK liu_j_Page_036.pro
c67050e9b87644b091c27ea3a92d734a
4b39fc5db51691002085cf0a753b3c342a7435dc
1051977 F20101108_AAATVE liu_j_Page_034.jp2
e2fade653e42dc0c86ba776fceca3d5e
ccc483b8909e7170fa3cf2f4ab5c52cfd275d700
F20101108_AAATUQ liu_j_Page_015.jp2
aac6fb341f3659d3c0b04058d52a8aec
8d1f4c123b234450a1155599c74f57fdf892e557
55419 F20101108_AAAUAL liu_j_Page_037.pro
9de878f81d99bcb8bf3e10832dd2cd7b
55cca2386b36a220e99a227e3a78e5dcf010870f
F20101108_AAATVF liu_j_Page_035.jp2
037a3f18bca2afc34228e5ddea5ca921
7f46b49e2be03c3e992820063e7276c82031c83b
197818 F20101108_AAATUR liu_j_Page_017.jp2
8cdb45de3dfcd79123117fa8f506989e
a502762b0cad6848c9ce01f10250bcfa7d7ffdcc
38305 F20101108_AAAUBA liu_j_Page_071.pro
a0a6cf698ddef4c6388ab33a5de025f1
d7cff5d8658954c5d6ae11cb9ea9215d89444dab
48520 F20101108_AAAUAM liu_j_Page_038.pro
2f699f18731819d91732cab219ee9278
a2cf2e3947333f8e430528744162e5c065eab343
F20101108_AAATVG liu_j_Page_038.jp2
1695e7f43cbfc076bd4638e8526d30bd
57d974d28294a1242a4d31ec7ab933f6a946424c
1051931 F20101108_AAATUS liu_j_Page_019.jp2
bb16f90acb88c7fbbb8007c43add4080
6c01e2d9bf141665acdb5968603ee976a6f1e357
42652 F20101108_AAAUBB liu_j_Page_072.pro
95e9b14a49627ad20f9941aded2cb127
592da76b311294d2cf7b9df51913d51344a8c80f
55493 F20101108_AAAUAN liu_j_Page_039.pro
673eab4384c24c8d4713d3848bc647a4
992dd1818122301640b6a1dabe9e20b72e7b7e29
889989 F20101108_AAATVH liu_j_Page_040.jp2
ac18a3863c83d447f04e5bae51479179
5eed2c6e2ea07bd46eef2c6bcd6bdadf3a1dc85d
1051955 F20101108_AAATUT liu_j_Page_020.jp2
309f701131687cf2ee305ebafb425b40
8715e04f97aed189515a90aacc1a3f518092be74
43162 F20101108_AAAUBC liu_j_Page_076.pro
ac605f6adfb6db19e220130ead1523f1
ee69541c857fb3980cfde228f4a9ba60e7e60be4
35987 F20101108_AAAUAO liu_j_Page_040.pro
0a63e13483cc64383f82bdbc5cd393e4
9eb79422b87a4a718e388f4536ead83e030ab822
109980 F20101108_AAATVI liu_j_Page_041.jp2
20132b903a5970a7b325da370bdfbb1b
72572e1da7d5c1f2889a3075e32d472d4315cda9
1051947 F20101108_AAATUU liu_j_Page_022.jp2
54f941370d1b660c914a81b13bf1f97a
0ab6fcf63c60cbbd78cb4ccaff77e8b10345b85b
40442 F20101108_AAAUBD liu_j_Page_079.pro
afa0ac9a13b80738d18518fbe0f64899
af3912545d3010b9a63f461ba7fb602e93775188
50524 F20101108_AAAUAP liu_j_Page_041.pro
ff7d979b474301e4cac35188f1c96b4b
69ab6814adfdf69e6449e3be666aecfcee33bc77
1051979 F20101108_AAATVJ liu_j_Page_042.jp2
93f4d1f663eec4d5dd6773c48edbc844
a3378e6ae97af59c7da5da74a8dd3e43acc8fa17
999203 F20101108_AAATUV liu_j_Page_023.jp2
33f43d992c97c48937abb2649dc7dbbb
cc2cafade2cf4821828c2f7f4aff308951f36284
58455 F20101108_AAAUBE liu_j_Page_080.pro
3e4b71bfc3359d42fe54da71b1547c8a
00719a9a50a1e7e8ade0802060a73a93789f26ab
40240 F20101108_AAAUAQ liu_j_Page_045.pro
87fb19327c7384703792798af0f93f1c
f0677da8388ff7cb20511b59fe7fd5b8ea7e4a9a
928030 F20101108_AAATVK liu_j_Page_045.jp2
d8ae181fba952a21b9a694daee1814ca
d17dfb64d7cb9e857804bd5edda24ecaae4a890a
1051980 F20101108_AAATUW liu_j_Page_024.jp2
5fce0464ec8dc95ab25d75f989a47784
c46687e6978f9923e0d853bb0aef17505c07f879
54463 F20101108_AAAUBF liu_j_Page_081.pro
e72b862bcaea7f845bf829212184d1a7
fe7ba874852632cecbd13f8452acd4ca34a39531
53113 F20101108_AAAUAR liu_j_Page_046.pro
44258a4db0a0c6c68e09ec45a64a4728
9d0d8514af106d53d0dfe317eb7934eda0dfed14
1051941 F20101108_AAATVL liu_j_Page_048.jp2
8ed3d8237aa02a60ed0a02b3e0689711
e74aa8a4ba8a3ccf7a146a82f22100ec23c6f210
1051889 F20101108_AAATUX liu_j_Page_025.jp2
8417c61a780393684c66de5d41b409ec
1bbdfd0dcea4d2ce10505ef6aacdd0c389ddbbff
79907 F20101108_AAAUBG liu_j_Page_084.pro
e831380974ea2468ab4bfa759a4edd7c
8528bd896cdf031a8f24b031bd58bda70122e074
F20101108_AAATWA liu_j_Page_080.jp2
1b060018d0ff125fcf0a1ab04ef1f6a9
c0fd5ff0bcbaf4cf885466410c098efd5787b545
59507 F20101108_AAAUAS liu_j_Page_051.pro
5edd6a459440e574bb661a983583c0ab
3b1cd65969e0bad48521a3d52f77ad88fb77d8b0
1051985 F20101108_AAATVM liu_j_Page_051.jp2
0946c39e75c52468e1bf4740dd67eee3
0e1badc39d1b17537d608b6628acf771b3d5913e
F20101108_AAATUY liu_j_Page_028.jp2
74794caf4cb56a52293cd2738315635c
f68db51d533955ed9bf7d539ca2c316580cedb55
56248 F20101108_AAAUBH liu_j_Page_085.pro
fcd4dca39988a94700480e0151966a42
ce92413a157fbf8bf5fbb5443a479c3a879a4495
943304 F20101108_AAATWB liu_j_Page_082.jp2
f7eecbd0c952b6c694b5718d0ef34e33
8c65fc0b8418f199af3c82da198b19b0cd78a19b
55981 F20101108_AAAUAT liu_j_Page_055.pro
baad099043b5d9a7ed19626ff8d6f459
7499596e11043d2f592652baea57f87bb390acdf
F20101108_AAATVN liu_j_Page_052.jp2
85da8ff32f66479a35d76919a6ee2f22
333805257ed400c7165706ec1e5f59237f86d33e
1014630 F20101108_AAATUZ liu_j_Page_029.jp2
439b935115e08ddb0dc8457d4bfc3316
25a8f7916047ffaebdd32d4b15a6e3074838a743
50642 F20101108_AAAUBI liu_j_Page_088.pro
3ab2d5451bcb63856eb03843b5ad3a5e
4011d7ea2e114bb4d490416fe1bd828cf0ba1f3d
1051943 F20101108_AAATWC liu_j_Page_083.jp2
fe3e1525448754ccae6de3a86e21c73f
5e3eb74fea3315ca8799fe446f52be6e5917f6e2
58401 F20101108_AAAUAU liu_j_Page_056.pro
9862c67c916e470c9a4be29b4f26ac81
55e77466cd31b815edacd38e856dd1913da65565
56677 F20101108_AAAUBJ liu_j_Page_094.pro
581127778bddbbf9d90c753b8f1762c0
6a74f8078c8e3189d9e5a92301417bc364c1e47f
155172 F20101108_AAATWD liu_j_Page_084.jp2
23ca4d4fb855d0066506f66a8b6297a3
f6452a14669e67405449f0865868cc56f6296322
54753 F20101108_AAAUAV liu_j_Page_058.pro
44ae170178402c67b6d157403b2a66d4
708feb19617922422ce56ddc8765b19d880753ec
1051926 F20101108_AAATVO liu_j_Page_056.jp2
8b6114b8a8642e782ca0501c9836267f
b242747178e2dbcf36578ce40cb78e810fd7b393
51714 F20101108_AAAUBK liu_j_Page_096.pro
fe8ef43af70de6e3158cc909cdb2a62a
f74c1c21bd02ded1797db84c0a800bd54105941b
F20101108_AAATWE liu_j_Page_086.jp2
22287423ca4be667b37bf77480d6f90a
6db3175f33308931d5e765c92589c227e4e393d6
57727 F20101108_AAAUAW liu_j_Page_063.pro
315539ed402b96d2357aaaf48e7420ba
9025e5ca23aacb22c5aec431a7a2c980bd14466b
1051986 F20101108_AAATVP liu_j_Page_057.jp2
7d76f85be7a7b4de8868208f5b500e5c
37b4dc60007cc65191dd26286b72f32cb808ff23
32389 F20101108_AAAUBL liu_j_Page_097.pro
c31962186b781af5992fd3c7288889f3
4b8206df9c949d2b62d951498b233b0ae53c2984
106253 F20101108_AAATWF liu_j_Page_088.jp2
b2ec48c8921a8bff2e3e5120146b312d
6de5777bedda2053ba17be9adaefb67ddfc86103
36163 F20101108_AAAUAX liu_j_Page_064.pro
0c1b12304a3ebfaa979ac339f3368ba0
1ee395af11b518a69d06754c7fc0d3a9967fc8ef
F20101108_AAATVQ liu_j_Page_059.jp2
0f77bccb77045f3f92a4224222e986d4
dc9157fd98dd7623a714555e120225143f0f8808
57894 F20101108_AAAUBM liu_j_Page_098.pro
c817e7fbf1370557386ec5e04e031b7d
f63a3880d6f3110ab0a7ebdd4b55a2e2cff93e98
F20101108_AAATWG liu_j_Page_089.jp2
58ec221979376f7f10a8dae8a794fe28
94d43fb940d6607e81dcfd07b0762422b99b8859
1051981 F20101108_AAATVR liu_j_Page_062.jp2
cb8935a45d0e20ba37b16e86764445e1
f08402d79bf953204da27464f58d296d39eeae8d
2615 F20101108_AAAUCA liu_j_Page_005.txt
3acc0f8c7390b42f4bfe78ff9a117c22
c47fef368d85e3de6257b05232d9246ca996f26e
59250 F20101108_AAAUBN liu_j_Page_102.pro
15d866ee21687255124707006f872aaf
3d3945acc7f816ce943aedd0e38cceb0162e3017
1051949 F20101108_AAATWH liu_j_Page_090.jp2
9dbbe83a6cdaab0013b583f836638cfe
e84734a46b88f79d38d37882f76b76e7de5fa0d0
44057 F20101108_AAAUAY liu_j_Page_067.pro
acb0740debbed77669455b54371ff9a1
5af12e3c13fe89cfad00fbe0f9bb241538ffd99b
73495 F20101108_AAATVS liu_j_Page_065.jp2
90570d78a50b448b20d98838ab8ea6a4
6af19a342de4deefd1559d8e0fec4b6cc08260d4
2195 F20101108_AAAUCB liu_j_Page_008.txt
03e876beef9857b1f4a075eef3e86fe0
1d6748686d0dda8698fbebc5d7fce374b28aa481
53328 F20101108_AAAUBO liu_j_Page_103.pro
bd08c2e6904c0605630be587b0d0eac6
edd9dd8b77b2e15f590a352b46a9ee6b57247f11
1051972 F20101108_AAATWI liu_j_Page_096.jp2
ec2754235165ced8c68877bbeabe3687
2f815b39f208905f4bdc087d5167a0f3344cba3d
32871 F20101108_AAAUAZ liu_j_Page_070.pro
d0928b7592ae40103a27683077c0c93d
5d60d073ea6a325f48a0ad80d2eae6397aa24210
1051956 F20101108_AAATVT liu_j_Page_066.jp2
0cba5f8e00ecad2dee57ad258a5cfafc
0109e47b29df0390ac0f1333feced3c784145efc
1388 F20101108_AAAUCC liu_j_Page_011.txt
673191fdf42b1a96f490411bb7873241
f4ca2f4aeada11b0d878a35d77e60986380ce53c
36091 F20101108_AAAUBP liu_j_Page_104.pro
abb4597bc45d9a845b2e14500868fe01
04264e1f0e76f91d51340d7ff1d11b4907a7881f
994679 F20101108_AAATWJ liu_j_Page_097.jp2
54a476ec354c1b4d4f45b48d0ad91319
60745f95f11b8dd1f34be9f9ff1cccd6d6d2a861
838035 F20101108_AAATVU liu_j_Page_069.jp2
825203f5bb43e79a0982831f893fbf99
55cc7cd5c6ac053bab2f8b753cb360a0a6744946
2257 F20101108_AAAUCD liu_j_Page_012.txt
bb6ac00593e045b7c4563a79e464d031
a34762921549aaf7f73ccb773c412d399bbaf447
61351 F20101108_AAAUBQ liu_j_Page_111.pro
2e168dfbaee58315a23efe6c7db26617
05bf8b5c8fda16b952fdf74933dddc8654661d08
105636 F20101108_AAATWK liu_j_Page_100.jp2
c4006e7ed2c723dfef81a756e6b68350
83c87e1e26a4f3fcae8df9b44bcf2d71b042d656
F20101108_AAATVV liu_j_Page_071.jp2
7d64bfe4052c478b0c63c94ddbfbe8b4
98d0ab3516b81fd5634daad6332f0c2e8db30420
1774 F20101108_AAAUCE liu_j_Page_014.txt
ecc3a83e89ade430bb284e8d869b175f
6cb462f63ace2960f466c99285df3004fbf1347f
27515 F20101108_AAAUBR liu_j_Page_119.pro
04db55a4c3417838aaf750bcaea7db5f
54fa8bd6204b53433a5a89f1818267580d8e3aaa
F20101108_AAATWL liu_j_Page_101.jp2
f52ec34c17e67c109d42d1a7a2030e6d
bbda53eba7da44326e7a609b56566db1618bcff8
1012368 F20101108_AAATVW liu_j_Page_072.jp2
e668b50d427b14410926329658e3921c
736fbd7e259d9849609cc146470134a2fb1d634e
2362 F20101108_AAAUCF liu_j_Page_016.txt
c397697fc255c50363dcaa572b202623
e2efecde80101cf3aaa3b0a14480d593b32c06dd
F20101108_AAATXA liu_j_Page_007.tif
635494f57da507e38021651dcf77d132
6823a829770d27a69a11ba9da39698c0f38e209c
33502 F20101108_AAAUBS liu_j_Page_121.pro
d93595f1b051176653e613154b98eb39
72308261617ac8ca182dfd88f47755e7e0c98cb5
1051958 F20101108_AAATWM liu_j_Page_102.jp2
69a188128bc7bf9cd83a0f67b3a75b23
735e9463b6e8a776ae9f34712fd81cab5f84abf0
1051966 F20101108_AAATVX liu_j_Page_074.jp2
3a1bebf6b2555a0e0aab230c557ddb68
21f9fdfa1e5e242bddec685165d4d98baed62b28
2316 F20101108_AAAUCG liu_j_Page_020.txt
740459afb3ce9ca62628bff7299b2622
fbac293274077bca78b56269d347c20064da512c
F20101108_AAATXB liu_j_Page_008.tif
1fac16633b18bc4a81b2add17805c8aa
797a21a1e44fc6d214f50f8eb7577ed13d6f668e
79343 F20101108_AAAUBT liu_j_Page_122.pro
744fbd2961b2e7d819eff28d0d29ec34
e3aad39a08b38b15f9002b624dd0371a9c10f71c
109201 F20101108_AAATWN liu_j_Page_103.jp2
2649df76d41c53d74421eed9cb7d7466
a9e9784fb866ad5387731a24edc836bbad1f9e55
1051951 F20101108_AAATVY liu_j_Page_077.jp2
3497835e56f5b32add49b7e31e36ceda
b3e30ef666a09c90266c1ee1554e03840ec6153f
1839 F20101108_AAAUCH liu_j_Page_023.txt
dc83306e71753b9f39732d305f91e41f
f5d1bbe15ab93b2e50b6eb9619f164849add749d
F20101108_AAATXC liu_j_Page_009.tif
e8e3f6d2cbb232bc0257bc9a52cda730
a20631af154523f8ffcb993a27e7a52587b6391b
81368 F20101108_AAAUBU liu_j_Page_123.pro
957f30d38d16ee210c9c18df8649001d
9825544b850bc9d22e284114f5195afd910c7cd2
829158 F20101108_AAATWO liu_j_Page_104.jp2
a23b982ca10cbfebb8c26ad225636e17
f98eb4ff75a5121cd46d0949f514428db16c0ee7
94080 F20101108_AAATVZ liu_j_Page_079.jp2
13d9d5b47767761b60fdff496de07134
6585c761a89a4a83c24ecd8a43dca4b86c961c12
2214 F20101108_AAAUCI liu_j_Page_024.txt
3715e3f6cc817e70a5cf8a497f8dafd7
b5a87b90d2658988f6bf35b97039eea8c78215c9
F20101108_AAATXD liu_j_Page_012.tif
d1551198f3780692ba6ffa9ca75049a4
576b92d57c461bce8533a7a1e74e9a84f8198816
82633 F20101108_AAAUBV liu_j_Page_124.pro
78227a158ac71e19dc7bab01b5058813
04dca85c87354be2d8ebd3bdc7a5d0032f08f0b4
2162 F20101108_AAAUCJ liu_j_Page_025.txt
7cc32bf623378f923f6678b313e8988c
c3816b1e397e52d610be211c269a4ecb3dda010a
F20101108_AAATXE liu_j_Page_013.tif
5978a709021dda61479099504028772d
fd04010eb359f58eabc507dd233709d65b3b6107
9001 F20101108_AAAUBW liu_j_Page_128.pro
e4fe39fd88ac4943457fe7d07b14d3cf
1b24f1cc3561bc738746412b3e5c88e19b3c741b
F20101108_AAATWP liu_j_Page_107.jp2
8d186d1048d423219ca5ef1097067f46
e12cf45e7be1758bf8317d1ba3bb31b57d139c02
F20101108_AAAUCK liu_j_Page_030.txt
fd54002c96a41120887f1064c565fa14
082de8aa5c47c9e5fc85b3845521b94f7e8c87d8
F20101108_AAATXF liu_j_Page_016.tif
af864a07ee3704200c37cdb7b3449712
03ee139edb0afa892d332ee377e0cf6505e86e23
414 F20101108_AAAUBX liu_j_Page_001.txt
62bc49b0d69e9605041f67a67877050f
7521cd155b63859e6b102546bec07f4952f3f544
647871 F20101108_AAATWQ liu_j_Page_109.jp2
fcea6528a9fb58a36d44492a52e20344
e3c7b073be8f3a080c5392ecd8cb78084f0e6e29
2202 F20101108_AAAUCL liu_j_Page_032.txt
895d622d49733857c00ea6fbb74a390a
250e24bf23abe6088e97479238ba35d6fa422ea6
F20101108_AAATXG liu_j_Page_018.tif
bae9a5289c1edcb46a139dfd081b2b7d
7b4837103ace6b0be4890861eec96fa32d9d4418
80 F20101108_AAAUBY liu_j_Page_002.txt
115d6a55db1b22959946825872c7275c
2d9d147aa57d89d2814163833c8ea7fac172dfdb
1051961 F20101108_AAATWR liu_j_Page_111.jp2
e2fab6352136d80769916c5aaf36c6f7
3753e9edbb5d47391b3408db773cc9f4a3d19a30
3031 F20101108_AAAUDA liu_j_Page_062.txt
15e77af7ed7ad1ea3af1a8cfa278e5f2
3093a9c046f02134b98e6cb88f21ca88fbe64635
F20101108_AAAUCM liu_j_Page_036.txt
afa3bf2158dd479c638dd0c432b73e6e
ab50417da1364451014e7daac12eb3f04a5760f9
F20101108_AAATXH liu_j_Page_019.tif
2aa61625ba4df0d133df9308c5e73387
014499af644cba0870a0a7063f4ec953020793ce
1051886 F20101108_AAATWS liu_j_Page_115.jp2
a4537735dab16cbeb7955c05b4757caa
5140a3018f87c88964aa668ce61dfa9eb0b781fe
2433 F20101108_AAAUDB liu_j_Page_063.txt
b2c9b6962667aafe2591128619a081e2
076b1f18049e034fb009fe123ed3aef8e08dbb47
2031 F20101108_AAAUCN liu_j_Page_038.txt
66f5d2438c3ee10ba104d96e2e568873
6069f11dd85867309e535531a11c4fd56fd899b8
F20101108_AAATXI liu_j_Page_022.tif
f0c27844575dcd182f5e7730af964358
c9800d05462af91e73503c486560beb66adbf8bd
815 F20101108_AAAUBZ liu_j_Page_004.txt
e740829e8249b357d87b2f4f23bd5687
186b9fdca65bc514e2e0be794ee016d155f6fb55
1051982 F20101108_AAATWT liu_j_Page_118.jp2
b7048f65cd2355ca401ee7acc5afd332
c7991be3f03c82ec1175d9db26c46ba25a0bd2bb
1785 F20101108_AAAUDC liu_j_Page_064.txt
7b4741a6fdaa1516abeeba735dd37a0e
9e1b1c9eeec277fba27605ed8ce7ac16c52d7f77
2190 F20101108_AAAUCO liu_j_Page_039.txt
c357754f794a2cea40fa543649ca597e
e83a9d0331be3a72bd44e40f3e9cb28eca650ca9
F20101108_AAATXJ liu_j_Page_023.tif
50cf0c0c63f64044f7b2f249478c8ec5
64e39f4c42c3eda4887b50b89ccced9440834394
1051978 F20101108_AAATWU liu_j_Page_119.jp2
148823d820c6e669650c04e2332fb1ba
927c395c21e32fc17d8509c34cde9c26453f9cbb
1375 F20101108_AAAUDD liu_j_Page_065.txt
c5e464e1e3631a3793a21d8ebc607a55
44089453287972205881373a67b07383f4fe3166
1618 F20101108_AAAUCP liu_j_Page_040.txt
6ba06e8694380a7c1b78be65503be66a
0285e3757e83b4992332d87e7e296029933fb5b6
F20101108_AAATXK liu_j_Page_024.tif
08cad11604848f7966124aeaed409022
7ff1a484d3036486eb160cf2ddcf9e27ddc00718
115109 F20101108_AAATWV liu_j_Page_120.jp2
00806ff8cd7ed602beb3d819ef1d70c8
fbd3c70b37024f56ba557187c324e39e5c6b86a7
2432 F20101108_AAAUDE liu_j_Page_066.txt
4c483d67ee43606976f9dc4a14dd3081
3efc118c25a65b1ae2cc515dba86d605ee4ebc2d
2040 F20101108_AAAUCQ liu_j_Page_041.txt
2c1e66c05241bce41c36e174ac4dd5f8
f26671fa50f859674d9f6750cb6c0d44bfb4703b
F20101108_AAATXL liu_j_Page_027.tif
cafc9df89c5e99c07642701ee171ea4f
507497e3500567130a49df51f2eab4983f6180fc
162385 F20101108_AAATWW liu_j_Page_125.jp2
ea7c06a62a82f0a7ca2974e2b7fdfe1f
fe639511c7e5bfcdf1e1f3f5fdfc0432074b0ee5
1327 F20101108_AAAUDF liu_j_Page_070.txt
72cf12d72aa55689577db5ddbfb69048
001b665259a700b8dc01f7b94743863221bcc06d
881 F20101108_AAAUCR liu_j_Page_043.txt
34f3f079639426afd747b3e0af7450fc
f86c7c8e25305e0eea78628ecbb18824e0a9d5ac
F20101108_AAATXM liu_j_Page_028.tif
d8615a55d8f8b693dbc68d5003566a7c
aea585204f92830e8d8af10ab5036927491d92c0
22692 F20101108_AAATWX liu_j_Page_128.jp2
321f07779b6cd5d32ff6850f4c65b186
4ad8b15bb5ba568445ba3eb5f5a44060b68f159e
1699 F20101108_AAAUDG liu_j_Page_071.txt
a0227f4e54bcd629429e807b1f17ce79
525a966515809b651c854436fce3c0d5edbb726a
F20101108_AAATYA liu_j_Page_052.tif
eff7ec0753c2003d461d0486c4dd37ab
1a13843b5f52c45ec40366fe6cff08a016589ce7
1893 F20101108_AAAUCS liu_j_Page_045.txt
0ad032911fd8a19c07e712283c99dcbd
708fc15eaccde8d40d8425fb705f59b2d835c161
F20101108_AAATXN liu_j_Page_031.tif
264e249ea22f8265da55d5579f129ec7
f14571260ee3a7f27d96c3d99ed0808dc3ba6f3b
F20101108_AAATWY liu_j_Page_001.tif
2486ba3d095203a8b87659ea9af90316
d932b5ab5faa6bfa1382295e3666f7da0e14180c
1880 F20101108_AAAUDH liu_j_Page_072.txt
33fdf85a127656ffd7f5c2e1c525fc69
2598fa05dd05fe86d8336868ac34b4d1459e799b
F20101108_AAATYB liu_j_Page_053.tif
a9b930d9883dc01434fc90fa7739306f
50247ac9631248dc919a72a364e7c4335471fefd
2339 F20101108_AAAUCT liu_j_Page_048.txt
44951c653b9c7fc957b018724d19d9e7
e32c9d79c7481eca9a95690cc8d7233cbe263708
F20101108_AAATXO liu_j_Page_032.tif
14e5bf976e7546b8addf25a729532dfb
3d43780b46fdea9dc77df7a6d7d43588e17ac230
F20101108_AAATWZ liu_j_Page_005.tif
a6264e7f2ce6d3f25b99828663c1948a
59fbe582f5380b4e8ab6514779e6d970480e49dc
2017 F20101108_AAAUDI liu_j_Page_073.txt
40893e5e32edeec4d99b63007c46afe1
fd83a2cd374ccadf4bcc2eabf4196412ec222a97
F20101108_AAATYC liu_j_Page_059.tif
edba0f9c8eed489a3e4bd716f8c0c3da
0f50d606b5aa8d1e3003cf5658b42d6ae81f9d73
2343 F20101108_AAAUCU liu_j_Page_051.txt
e16d06b24ed62e3db4ef75ef214b2039
54178ec54a09b3a3f7bdbd463a7553bac80488eb
F20101108_AAATXP liu_j_Page_033.tif
11b42ecb37cc84b3c018b646f7970395
359937bd7450e63f5b44a672407cd1f54c36934c
2260 F20101108_AAAUDJ liu_j_Page_074.txt
dc468a68c726456c20d47402f6a38512
b76a8ea840c19dc1874436df959432bf0e19358a
F20101108_AAATYD liu_j_Page_060.tif
82d893c9ba9a801a8905cc1dcad4c329
373336dafd8cf1c3df2a6b95af1afe4e48e8c3ef
1897 F20101108_AAAUCV liu_j_Page_054.txt
051a07aab73ee9913c60f123007b67c1
3a1794b4840eea15b14d84c02440082db467868a
1972 F20101108_AAAUDK liu_j_Page_079.txt
26031c5ac8074bfd216c5e3b1766ea2f
4c19168073553228f872b542df7d4a7d9a3e10b4
F20101108_AAATYE liu_j_Page_061.tif
aa99144e81e64cf599096395e568a4a7
bed99facaa090e3c250e9c607beba7ee3f95fb01
2242 F20101108_AAAUCW liu_j_Page_055.txt
3e7f73f65feeb98fe14dc48f5bd5f047
25487fa5b96e51a87bcc379f9f15560d40196c7a
F20101108_AAATXQ liu_j_Page_034.tif
aec96eaaa145eb5f45952100dfee9945
8987c460b733468628e70e2010cc447ca43770d7
2211 F20101108_AAAUDL liu_j_Page_085.txt
d56070e71dd1537c8fbbe8b9f181cd06
06d1a4f2ab0ae8868de21706ac6d3e4278ae1ef3
F20101108_AAATYF liu_j_Page_062.tif
0990d7e80a78fb74e65afc1f3e9a0da0
db8286f10bce6e2311513f26d3f8ed0c416c3583
2094 F20101108_AAAUCX liu_j_Page_057.txt
07b86268cb988d76cfff7791b678a26a
03beba104ceab3eaf1f805fa0eaf1c8c59281a82
F20101108_AAATXR liu_j_Page_035.tif
cdff13489c3210d83dd0293bb945e44b
adc2fb8899e9d5533b35f5c12ff113155fcfb9c1
2206 F20101108_AAAUEA liu_j_Page_105.txt
fca8b43066a9faf6b2cbc00cf42649f5
725aba121e1c2053e7a0e2ef113768d46554b39f
2904 F20101108_AAAUDM liu_j_Page_086.txt
e5ac2e7c96b3e98f4f7baa78aeb1fcc6
244270221c558d1527bf589b91c3630ac3e7c4b4
F20101108_AAATYG liu_j_Page_063.tif
be0fef341fb1bda636e36ba9d73f2976
c0a1028455ed0b32456e3c493640ec055acf124e
2344 F20101108_AAAUCY liu_j_Page_059.txt
6bdc7faefdbb801db167f559bb9fc296
1e2e883a7b1ac09ba3e85143cd28954880cfb4f9
2192 F20101108_AAAUEB liu_j_Page_107.txt
bd87fb539d1a6dd73dfc114cf80332eb
beb08e9dd9b9a8f2ee0fd7db3988938cae798333
F20101108_AAAUDN liu_j_Page_087.txt
f25ae31960f9090accdd32c30104e8d0
348db027db9c6036ee6ce79534b6129d7df7a753
F20101108_AAATYH liu_j_Page_064.tif
1da228488ce6f5cf05c45624378180c1
bcaed3f2b25ed740bbe2384a4bccba6e779ab0a8
2241 F20101108_AAAUCZ liu_j_Page_060.txt
8c2ef22a9d714a2779949f40279e52a7
eda5382c83aa4bd4724831cb1aa1a6a07c73fa7f
F20101108_AAATXS liu_j_Page_036.tif
2eb719c87c88f192a162ebfacc3a1e22
921550ec8ad3811ceb2ad57c4ce0d3c56963e7b4
5084 F20101108_AAATBA liu_j_Page_076thm.jpg
43488454f70fbf9c29492ba9313fa5de
bf8c19575aefad51d49e7167f478037226df8af0
2409 F20101108_AAAUEC liu_j_Page_111.txt
b41e890931f6b00ed86b6ff39de99334
e42d272ee69f1a7657f8b7e1882bc23ca475907d
2093 F20101108_AAAUDO liu_j_Page_088.txt
ce03a0fb00af63dc1f6b364e3703cd9f
0a05c1e1e71f544c92c214a68a507be0b754bb5f
F20101108_AAATYI liu_j_Page_065.tif
43dc826908a6d9d503a68c704a5b26f3
33179a7952118ac9371704d783012efa06484d9c
F20101108_AAATXT liu_j_Page_040.tif
dc60f5cc1b21a7f2c215a145ececa218
57a1716cebbe003ecbcfd657cea61185b8cf4fe6
F20101108_AAATBB liu_j_Page_050.tif
a9306833fa6f47296fd74ba8f4e14db3
0f6117aa09301bf542a0503bd232edb5f7e6a581
2049 F20101108_AAAUED liu_j_Page_115.txt
98924b087c107957e20fa922cb8c3e47
62a0b61a32a463ec7819a65d8e693320a5831cf6
2239 F20101108_AAAUDP liu_j_Page_089.txt
5ef2080b164e325910d245811ae43e54
75769709e556b20eedd16408961b99b6936eca3a
F20101108_AAATYJ liu_j_Page_066.tif
e2d7df96a46259cbfbd9d85d7a425fac
f2d4fc2dd1d9b9fda5e1959fab9d24023d11aa3a
F20101108_AAATXU liu_j_Page_041.tif
a50fdfffc7de3d91b5404e8acc662fdd
f769c0d16414f18f38e8190d058716f3a7024ec2
2012 F20101108_AAATBC liu_j_Page_050.txt
01ed11bb159c7b58ad5c665ca9326e0e
294a942287e3e962a76f2b866883e9f91866bb73
2142 F20101108_AAAUEE liu_j_Page_116.txt
8738b610e3b022c45224450b2bb32bab
317b3f9531dfaefdf460adff8bea00b4ea0efb1c
2232 F20101108_AAAUDQ liu_j_Page_091.txt
30587fcf087fec087e20074e5b6a8493
efb04fb40fffaa23d9f479b18c872baa686f1357
F20101108_AAATYK liu_j_Page_068.tif
e292269690ba4f3dd48d689da24fd82f
8de5eeac948db7bb6e62645f6768aa1e9d60619e
F20101108_AAATXV liu_j_Page_042.tif
b026371e9ed144b9f5ad26f6b9704c79
e1bbf4436bcc3d8c96030d83b20b1f13ceafccb7
45341 F20101108_AAATBD liu_j_Page_054.pro
60f882c801aca58fa3552d0649b154f3
628c80d75348252b224c82fd87c4ead987ddd1cc
1344 F20101108_AAAUEF liu_j_Page_119.txt
00f378db84b9132c7bf40674d864fe92
49ae270d2d08afbb703739aa86417bb92dff7b1b
495 F20101108_AAAUDR liu_j_Page_092.txt
d14f532dc751d6a4faa176ef00c1f750
8e691754cb2908d2f6c97e1ccb0568b02da5755a
F20101108_AAATYL liu_j_Page_069.tif
262aa42d503ac0a059eabcf37b967d16
d41f9ce8c87c872d47652c1bc83843b47f9e85bb
8423998 F20101108_AAATXW liu_j_Page_043.tif
5d00779c20abe81acbd0abec916eb9e8
3ba18b256a8dff30df52e3bd6808a1dc9b50093c
21198 F20101108_AAATBE liu_j_Page_067.QC.jpg
8715e9d7f0f18c69eaba6b9692b0acfc
b8f2ab76d90551e9c920bccadae7abf2860acea3
1404 F20101108_AAAUEG liu_j_Page_121.txt
7f5f18f88c30bce6e130a27c1ef96431
9c24742cc82619ec039e27350570b7c8289527c1
48579 F20101108_AAATAQ liu_j_Page_009.pro
b497f071afa02bd145a35f2b5504be4d
bcd61b0565d6ccd3c67fcf423faf1958df5aa3f5
F20101108_AAATZA liu_j_Page_091.tif
ebca68e98b7cb2a6f4ca3d5e93fde830
db233720859df19653fe798b58fd785be2f70789
2037 F20101108_AAAUDS liu_j_Page_093.txt
65dac1a78d9f6d7b1937b14bfd5de908
02bdd2f5dfea0f6fcc4a00a327dca5519f7b008e
F20101108_AAATYM liu_j_Page_072.tif
381525b15e1bab699921e3fee72b8dd9
49f69821bb050d2f435b95777d78e901d4d1dc87
F20101108_AAATXX liu_j_Page_046.tif
ea80200a2ca6cd6f57fd33e59abe86c0
74a7d155586f88430c190b57cca646b495018dee
3143 F20101108_AAAUEH liu_j_Page_122.txt
2173f0dd7970b78796916a5802ef95b3
cc7fa4b3dde3cd9d76e08f547d889861fa6e0394
54532 F20101108_AAATAR liu_j_Page_074.pro
0892824670849a166cb9cfa9f0088d7b
10237fd57ede91883e5bdd711bd65db91fea428e
F20101108_AAATZB liu_j_Page_092.tif
ea2284db50ad6397664d997926b1541d
45a1767cea7dd454dbdc886c3c70047a44238394
2281 F20101108_AAAUDT liu_j_Page_094.txt
1c796e7ed9a4cbd118b49332be3feda0
46b74c9854e8654180d721e69408a5b9cf6fb21d
F20101108_AAATYN liu_j_Page_073.tif
13ea7d5cb8379d54b49f457e5e6ee058
41e1cb83adb1a18d62603c675cf60d467d7d12ae
F20101108_AAATXY liu_j_Page_048.tif
4ed3f7805be039dfaf98f1fee1cc7cab
19ccde6ff93c9e2ac742ec7a246744bab985f509
775817 F20101108_AAATBF liu_j_Page_043.jp2
811c5d8301b594e6166f31bc4b185fdd
e5bdb741d427084abf0dbd541b99e83ea151fdff
3224 F20101108_AAAUEI liu_j_Page_123.txt
bf7191739fb0351aba177b3deb3d5ae8
74bf40987003bc79f9655e91d8a0f21c35b920de
F20101108_AAATAS liu_j_Page_083.tif
4d5311c47e4e1ae7ddcf556eae8a9ecd
700d4cf44525837980985bbe731ff1530324c203
F20101108_AAATZC liu_j_Page_095.tif
be63166a2d6a49384f6ee70bcfc5673d
f3ecbd2d00a1cb54ad715cdae504312a95b63a56
2277 F20101108_AAAUDU liu_j_Page_095.txt
7c56002d7a299c41bc87a3338d2dd65b
8b6f8b720fd2568ab70e628d405355d31fb218a4
F20101108_AAATYO liu_j_Page_074.tif
ba2c3ab678e5f764f4b7dd543bd1cd02
7a192a6d1a374879d5415ea72e57f5198c360131
F20101108_AAATXZ liu_j_Page_051.tif
a17df22b783d89267b5e7e8a0629fa92
df6f4d953417c22cca573f66c0d640057bf1f68f
22701 F20101108_AAATBG liu_j_Page_109.pro
7c21e8ba560ca6232eddd971f6f6bb77
ea81ac564fadc77fa7600c47b076fa5134a27eab
3267 F20101108_AAAUEJ liu_j_Page_124.txt
6b9560d570ce667d464a198fa2c4c32b
5457b916d613f8ebcb5d408d252ddaa9e77bb375
2309 F20101108_AAATAT liu_j_Page_056.txt
6236b99c18d6610faeb9c25fac201af2
732d5ecc61895e9bb84832a1888986947acdd89e
F20101108_AAATZD liu_j_Page_096.tif
79df0b09131338755374d571bbd4db36
e5395b600b807c43884116a8c71e805c2fc078ea
2082 F20101108_AAAUDV liu_j_Page_096.txt
e077c8fc2abee13636a677f3e787f23f
4100a1ee8eb1b495ac8b45aa0029a44c8087897f
F20101108_AAATYP liu_j_Page_075.tif
9fb1a04d2344f50d3f5d79613dcf9259
e0abab7cc3076a02628edebeb8010496eaa73072
582 F20101108_AAATBH liu_j_Page_047.txt
610986bf94d56d05357760488999c685
946bad204e945b1173e3d46255628f49ac958670
3479 F20101108_AAAUEK liu_j_Page_125.txt
07c125bec284686cb3913f7c30fbf0a5
1440345ae0c36c1850007d160a1f284c5b1d8338
95777 F20101108_AAATAU liu_j_Page_084.jpg
06b602626f1e12987c4166d96bd44806
57202c45138de65b08f02f9720c6c049419eca93
F20101108_AAATZE liu_j_Page_097.tif
9502b87788cb6961cafbb525e6146624
44b476001a73a2273643c900beb8474ae27c5f31
2301 F20101108_AAAUDW liu_j_Page_098.txt
e7d78b03be31c45f01aae2eb91cd2735
6be170e6e2a77bbec2550f029ed766f074cf2b91
F20101108_AAATYQ liu_j_Page_076.tif
f283ed17495ea3841f71d3f5f72c3b39
add6b82437634dbd758e3a1693cf0015fc3a55fc
F20101108_AAATBI liu_j_Page_006.tif
1b7d17a8f5ba96b6f1387ccd773580fe
8aab1b2072ef0f3e8e0ac0d80e35ee073dd2fdca
1820 F20101108_AAAUEL liu_j_Page_127.txt
5395fd845d352963068162833f71ef52
67f4b14bc7dc5517e4147f5e9ed5bdb588e4e15c
27069 F20101108_AAATAV liu_j_Page_124.QC.jpg
af4ad4bb8616c9786ae1b013b5042d72
6a6a24aca6aaa356ddb0262c72a0a05a37ffa139
F20101108_AAATZF liu_j_Page_101.tif
0c27edf0d4b70ab0bbd4e9ed5db35618
e30162b17303c6d07844926b51768b998887d0a5
2071 F20101108_AAAUDX liu_j_Page_100.txt
50b7344649397e50d55f1f22d4929c7e
543ec2ef1a12e3bd0315f758d115f2b73ef6c5ce
57103 F20101108_AAATBJ liu_j_Page_006.pro
1c6f799466c140136f6e7654667025ec
498e12f0d96e355d4f49a5f2a0d657d94d2bca5a
19491 F20101108_AAAUFA liu_j_Page_069.QC.jpg
da82c142fe8c330d309f962d468769d6
74b7defb4ffb6cf0384512b69c6b9b6613daae71
28009 F20101108_AAAUEM liu_j_Page_090.QC.jpg
9145d716f758d58b04659e8c4a08bd13
3ca4013d077aba7dda64e579df39103f3cff93ce
F20101108_AAATAW liu_j_Page_087.jp2
b91561d26ba7f7407434d4ab015e6af1
31eec32d570e79404be3cc906eadbe62227e51f7
F20101108_AAATZG liu_j_Page_102.tif
41b4474513df0cc8e85829fabbb52f8d
35cf0e2fb15894fe3d4aca15a6c4a3ce79ea1df4
2353 F20101108_AAAUDY liu_j_Page_102.txt
7ba2e5add75a136e293b5c06edd16eed
8eb138dc5c4052c8e98b416f3efef55fba0527d8
F20101108_AAATYR liu_j_Page_078.tif
db2a0618061c71dde199a268d5ae4fe0
b5e396a07548bdd70af033474ccf7a0ed5dfe5c8
6504 F20101108_AAATBK liu_j_Page_087thm.jpg
af03845942271d5366a04ca6a2dcc4d0
023f2b532f2a8b8641dfbaf020cb8b2c1c4b49a4
27471 F20101108_AAAUFB liu_j_Page_059.QC.jpg
3ecfcdc4bc474cfdaaed5b6bc9c0e5c8
d97b065d8e07a4ad5433c44e7d0ef301248f34d6
22497 F20101108_AAAUEN liu_j_Page_113.QC.jpg
7a67c7a8c1ec27fbccbe9f507198679c
cc2bf69bc8abe032b1993fc0823953d98662aac1
6931 F20101108_AAATAX liu_j_Page_051thm.jpg
26f07d025a8a8717431784ee8561a5f4
c76e795f27e7ceeef1d15e505fa052ca561ce1aa
F20101108_AAATZH liu_j_Page_103.tif
54229702b4f3e44d6aceaeba67f0bfee
7846ac9c1c214923cd2a7846c7f5f00fa7ce34bc
2126 F20101108_AAAUDZ liu_j_Page_103.txt
45cb4dcafb5dd948e702c94f915598a0
c70195891eb251537e0ce177b58fb15d473ec528
F20101108_AAATYS liu_j_Page_080.tif
31f416d8a29e99b3cadfd7295e695cfb
b75d276134cced1ca4657f109b764706d5181ee2
27855 F20101108_AAATBL liu_j_Page_042.QC.jpg
d4f0d6f97bbfde2985b4f43ec52142c4
3db64af8c94232f06c33392684abc9876cd40a6d
20358 F20101108_AAAUFC liu_j_Page_005.QC.jpg
dc10c10c4aa01b90157116012ce67365
7c017403d296822b9ac8af355d19e9ac494852de
6403 F20101108_AAAUEO liu_j_Page_018thm.jpg
fb9dc04a12a9377fb03681b9d3cc1289
0326850e62212ce72da5194442a6b1e95cbbb3ce
85415 F20101108_AAATAY liu_j_Page_037.jpg
08b8fa056aac8ca32e76f26ba590437e
a141f28871a80e58a810055acc6713f6b9330dce
F20101108_AAATZI liu_j_Page_104.tif
5e723da07150f2a89c198707c9d4b682
129b22fd45c73a709cfb70fbfb33584bd05229f3
F20101108_AAATYT liu_j_Page_081.tif
94a2ea4e79c237fc69492dc2ed252c38
66f5744c0f4ecdc2e3a68286959a694d46bece5a
6965 F20101108_AAATCA liu_j_Page_090thm.jpg
1acfe0cf036ee2cf718e44e13cd826cd
bd48415e0461aeae10438ec49b480c4e2f1b99c4
F20101108_AAATBM liu_j_Page_071.tif
32455ca1ac783b6f44f4237be232fb02
3d624f79494282620c6b9787b608823b49964034
6705 F20101108_AAAUFD liu_j_Page_010thm.jpg
c64909efde4973e62ba44a049609aebb
bb7834f0cefd0a7b355ed2a0c73dcece3c06176e
6486 F20101108_AAAUEP liu_j_Page_085thm.jpg
11a0c9062a2fd55205a17289cd764182
c7af2d265ab259908cd4cda4d1acd1b862345751
F20101108_AAATZJ liu_j_Page_107.tif
8555aac132d7f91f2c40096b4c6cafb0
02f35f39cb487f38530550d767606a7ff9d321ba
F20101108_AAATYU liu_j_Page_082.tif
bed7942f91b5282686e121be84ad23bb
a30f4c3548966e844e1d6765648aa27921e32ca9
1970 F20101108_AAATCB liu_j_Page_068.txt
6aabb6bbb5141bc6cebc3bb0ea957ec3
f002d6972138648c95272edaf223e952b2817443
82982 F20101108_AAATBN liu_j_Page_046.jpg
01c591e447b743c5172839268d9936ad
7cb14838ebeeaf09ebbf6aae7d21cb6e2049d7ee
4652 F20101108_AAAUFE liu_j_Page_121thm.jpg
82a168a35989d95a33fb491f41fecd9b
5e03598cd12dde28f09847b218ba743048e90d65
2102 F20101108_AAATAZ liu_j_Page_075.txt
0712e6896a549730c696e4d65ba58805
7fa192e5fab921dd3b690a28b974141e294dac12
F20101108_AAATZK liu_j_Page_108.tif
59b4ede836a4948613dd9e584be0235d
c0aec865709db78daead01a0ba1246226d733296
F20101108_AAATYV liu_j_Page_084.tif
462277b6a931684762764a563b74a085
3b84cf0e13f573f660b8a51b7b494a6357ce435b
75092 F20101108_AAATCC liu_j_Page_049.jpg
8cdff71451b669b88de015f31405ebf2
837826929677ff7b279df3e176bfd0c318a328f9
1363 F20101108_AAATBO liu_j_Page_007.txt
09417a1fdbea33747dc01818b56a0372
9d79218ac58bd230ec1e3b6d17c2266ff8affc09
6561 F20101108_AAAUEQ liu_j_Page_053thm.jpg
d9950555f686fd0189cca67bb8047f9a
0de3ca2da91fb374dbfa584e2358e741c478f8ce
6089 F20101108_AAAUFF liu_j_Page_028thm.jpg
834b5aad83ca79ee9ec8902c47d14235
fb64436909f51a8367f8d3aac141c6e5aaaf7df4
F20101108_AAATZL liu_j_Page_114.tif
b509e44bcb534a0ea1be7ac3c511e59f
25126e40551d9009a994dee0560a9ebc1ed11010
F20101108_AAATYW liu_j_Page_085.tif
9ab9f9be00a3a7c6dc0120dbdc4c7c84
17e988da47cb52ae33ebe1bbe49791547f5c1ab6
F20101108_AAATCD liu_j_Page_017.tif
21b9b9b7ee3f64eeb8179eb3434e7692
ced0425db44c07fb4e062ea95e9bfcb72dd3a074
1582 F20101108_AAATBP liu_j_Page_104.txt
5992b72a01182666db07da2a614f6c89
e57882059a67a8ed28646e8d910180cd8cf4508d
25038 F20101108_AAAUER liu_j_Page_032.QC.jpg
3e722efabbd5178602ae17c9ccd39f53
cbb8ef52d8d29b6f2692d4844ed803aba2a1b8df
24203 F20101108_AAAUFG liu_j_Page_073.QC.jpg
ead4cdca09e2a99b984f7d4f73887865
8660786a3d0c759d3a48d43ddccd7c7d6b1dde50
F20101108_AAATZM liu_j_Page_115.tif
be53ca0b116b34010e99527c251f96f9
1bdca7808bd145bba8c5c8e22e5702d5f96a88d1
F20101108_AAATYX liu_j_Page_088.tif
ec3a61476193cc955ae4723aeed3c3a9
5a7061185d69d86016f2e9f319285d2badd41e6f
1918 F20101108_AAATCE liu_j_Page_017thm.jpg
75117de0dd89be2c6140e67679bac99d
445852a6e858f88034808f55906baa0331235c04
55097 F20101108_AAATBQ liu_j_Page_105.pro
3daa030b9c16cbb966d4a31eb41645a7
c0a7abab5d39135359375e46922a50227055886f
5898 F20101108_AAAUES liu_j_Page_038thm.jpg
66bcf6c8226792ffe1ee6fe2274b3db8
d1c17c9d2289874a7cc51887acf81831aad381d9
6523 F20101108_AAAUFH liu_j_Page_063thm.jpg
1c14b18c1ed84627bd022f027e1a2953
b309e4ae139e21fc8d3996eb13b34283d7afd91f
F20101108_AAATZN liu_j_Page_117.tif
db6bfc81d55bd7960e3d92a6d1431fbb
d97e100ca42c2bca1070d84d1c0ea150c7fca728
F20101108_AAATYY liu_j_Page_089.tif
8cc50952d46fcad655299c35bd6f91ee
8e8190341cf23741f6119ab4693ca837a19fef05
25575 F20101108_AAATCF liu_j_Page_107.QC.jpg
26aeb93c3acff2a2d7ecad8e151990b8
3635913f079fe12f07998b0be929db3032258297
36298 F20101108_AAATBR liu_j_Page_082.pro
43f1bd2689865bcd8736f07df271d375
3474c4eb0645d9809b373d1260310b3d22f5891a
5490 F20101108_AAAUET liu_j_Page_075thm.jpg
d2ece4b64d48c1ecb65a9f5e697ae3e5
d3a503a701684f6013f2279d97b6fbeb6a874136
5982 F20101108_AAAUFI liu_j_Page_014thm.jpg
647d2017409150e8d2bd204aee0d9cc6
65939e1df6daff1be941e78d5c00b00c322c122b
F20101108_AAATZO liu_j_Page_118.tif
995c489d1d9b0f7f60287aa011ea7615
5639d73e22da528be963e9860d137714e1c208a6
F20101108_AAATYZ liu_j_Page_090.tif
334e4a09fb8b6df5e70b2e0b5d3118ae
e1fa46d2c20ad5c04320c4b9d79d82a1a1898a15
103069 F20101108_AAATCG liu_j_Page_068.jp2
2239cae0cae78d33e947fee59201745d
b041d4a07bfa9ecb579abb2a12fe36b10ec5c847
754541 F20101108_AAATBS liu_j_Page_121.jp2
4922b908eb818a2623133bf3c0d9a2e0
dfff756fcfe0ba640189efbd9ec49e357889e307
6456 F20101108_AAAUEU liu_j_Page_049thm.jpg
03a5b3c4a4ec2de31b892bb99117471b
4dc7d514d8d79397d7404c14c92f91315ccc1b38
6614 F20101108_AAAUFJ liu_j_Page_116thm.jpg
b8e175e8b1d0e79fff39a0124f984bae
5ac9b8312860df7f40a50829bb71a28f126063ea
F20101108_AAATZP liu_j_Page_119.tif
cf0c8e945ee2b23cba71ebfbb5b253c3
15a2b456ded8095a1be49ed2a7a8ee3ee0290b96
55239 F20101108_AAATCH liu_j_Page_120.pro
3a06e731011ccce27453c583a5e3b831
f1f7b2f14daed6e9eeb432e1ad91f3e6ec294f7b
56756 F20101108_AAATBT liu_j_Page_016.pro
9d5077630fdc223326ce2cc899ef4df3
a8f438327720f61bec47be830416dd4d88ce3067
19443 F20101108_AAAUEV liu_j_Page_076.QC.jpg
2c82cce361e0aa5bf25ed8d7c879db19
f4ae78b348af8205228393e72371034ceb540afb
15641 F20101108_AAAUFK liu_j_Page_007.QC.jpg
6a1c24e029e4c1b80dfc132a84ff00d9
4ae3c6033ca0e69a469f2bfcfc3b730b92c7b401
6805 F20101108_AAAUEW liu_j_Page_034thm.jpg
2a6fd0b1cd2a5c40df8f338b531f66b2
5633c13d3413d5bfc0f2a407a6f69251125c374b
F20101108_AAATZQ liu_j_Page_121.tif
82b17054dd2a68e8999bc28d447cf59a
986c1defd4435710d75344e7f601d9b047c1cc17
111319 F20101108_AAATCI liu_j_Page_113.jp2
3af6f7d6ff1bcd455af07ed792518359
3bad0aa5f38804b58026b7f20e9c6711a7d2b77e
27284 F20101108_AAATBU liu_j_Page_098.QC.jpg
1be73e63abbe3294717767570f700a73
b392f42592a2a62cfd2a73aeff9ce712cef4c5b5
6644 F20101108_AAAUFL liu_j_Page_057thm.jpg
afde1fda9c74dafd1a4d1e9d6a3dc0f0
fbad70f5ccdbcebefec77df0699a7b64dedfbb73
27559 F20101108_AAAUEX liu_j_Page_026.QC.jpg
82cbdbafd2a2e8f191feac01ac4c15e3
0f8327709c02b23c4e4b12bc23b27636298db53f
F20101108_AAATZR liu_j_Page_125.tif
96ce5983523d683f4e7e48ad16ae0c02
525612cbe3e96e2b79ed5dc1d45576f6624dcfbb
F20101108_AAATCJ liu_j_Page_113.tif
6d66fc3f7f0661bc29a180004968a7ae
4a33e6a14140dd8b9965bc941972dcd4d2a443fb
22930 F20101108_AAATBV liu_j_Page_006.QC.jpg
ab5535d4595b6337c34f1b24dd294afd
3d027a1bd3e32cf1d5ef831b0d1e29474b229134
27545 F20101108_AAAUGA liu_j_Page_060.QC.jpg
0a9ab7b1e2a32d4d4ec64848303c3929
d2076f887abb577362261e1f2ae3137fd75bcc28
26412 F20101108_AAAUFM liu_j_Page_105.QC.jpg
6760c05d7575727e4473ba0df05fd5bd
7a75fb770f0d95aaab4fc41e207e3954092a7c85
7041 F20101108_AAAUEY liu_j_Page_044thm.jpg
cc1265877b0309c4e62f6f7a9390f8c1
be24381fd01f158e3a6c9af13a4b87d8e93a8fa9
1710 F20101108_AAATCK liu_j_Page_110.txt
46c66eaeac4c893902475a3159786ef3
2e1733f0426364999866a7c42fd1da618ae16e20
5823 F20101108_AAATBW liu_j_Page_023thm.jpg
1d78db6c9b6a23adab15546e9b02097e
a09d2504d275d4cde3132dea0ebf19c8a04e0bb4
6856 F20101108_AAAUGB liu_j_Page_124thm.jpg
404dcfa70a523a3a0aa3baef0feb245c
47e9829e8ad1806829a0111e998cf7ca48c86829
6754 F20101108_AAAUFN liu_j_Page_081thm.jpg
a6632156799b3421d38a6e97527df9d5
b1c358a8215e08b2ad7a5e09f11d2ac74821b816
22893 F20101108_AAAUEZ liu_j_Page_103.QC.jpg
ebc1be786ef30bc401be743ca2d53a02
430d773ea330682cbc2917b3d0125e03befe635a
F20101108_AAATZS liu_j_Page_126.tif
d672e9371964a76cd370f7d49422f88d
c0d48b855a1566bd585d8de4f16592c15d0861f1
33231 F20101108_AAATCL liu_j_Page_047.jp2
035517f9d0ce44a38b05023085587529
78cd269ea32efdea87cda5b50de8c9266738a08e
5565 F20101108_AAATBX liu_j_Page_118thm.jpg
1d3f967067dcf6f61efcca1e992c5978
fd9f64f7bd66663d95a62270d2958fb7e4a1a506
26023 F20101108_AAAUGC liu_j_Page_081.QC.jpg
6a4a536a013f7e11f042f7965c6ca744
e34b7948619594014cd0fdd37f31dd4f9b70aed6
28353 F20101108_AAAUFO liu_j_Page_030.QC.jpg
2dcb601713c862dbb5b2704aa278195f
770b63d7ba2b3d358067fc2de7f9d51fbebbc99e
7127 F20101108_AAATZT liu_j_Page_001.pro
b38c3dcd28ad6e5e1f9f6c1870598928
684765be499deb184e04cf2af9d53e77c4251312
6420 F20101108_AAATDA liu_j_Page_032thm.jpg
b97400b99abefcfdab0c0e2adcb54887
db325352685ed5b61cfef2bf071d0cb1a24e94ff
19008 F20101108_AAATCM liu_j_Page_040.QC.jpg
0d17abbdbf991e753e35af1096d7a087
430406a431417a94b8392ab831978f6e2e898b70
493209 F20101108_AAATBY liu_j_Page_110.jp2
e0c3ffc45eb01ddcdc9986a0882329c2
cc81b79f3d2eafe20506e9855b6ffa3ba1f8f72f
6968 F20101108_AAAUGD liu_j_Page_083thm.jpg
0c79dba763edf900062057ac74e068e9
94d00d3d3dc120d935b484f40a12b52eaf0b42ec
21594 F20101108_AAAUFP liu_j_Page_075.QC.jpg
03019fc0f281bb30d1e39609e3d3b125
4ec3e0e69fba614e67c635e11370f6f7d0e553f0
619 F20101108_AAATZU liu_j_Page_002.pro
b2a87c7f4a53cb89c161abcf4e5e32c4
c4f3eb562176d2b6ab57ef02841ffd1ab2c42d74
5128 F20101108_AAATDB liu_j_Page_069thm.jpg
3f5fc3c380ccede6140b86ea1b87e7e0
8dcd30b85c6ace323789cd9996c584e004d61755
70876 F20101108_AAATCN liu_j_Page_072.jpg
a87027acd74983f37ade3a370499f8d0
c5cebd343320bd702df73d9a9070f62374bd6d2c
F20101108_AAATBZ liu_j_Page_100.tif
c563b7c4a55ad3b2735dafa487df442c
74b017d3d27d95be1e877b1b4977f0374d8440c3
6209 F20101108_AAAUGE liu_j_Page_025thm.jpg
284bccc5ee7f6b6c868bf770c8460510
7d10c5e77764ec8f99a87d41d92df15c872092aa
3062 F20101108_AAAUFQ liu_j_Page_002.QC.jpg
9ac0f5dd8dd2f79fcdaf2db95faaf9ac
759690ef7f8e94c2870032b8a9a08e89fa1bfb6b
1451 F20101108_AAATZV liu_j_Page_003.pro
0a516e33029cc166d4ebafe8588cb5f7
88d6ece95d21416ee6e6e4f2c51a9dfa0349343b
2201 F20101108_AAATDC liu_j_Page_021.txt
10d2a501faba417f4a505beac5b3ce69
4aa5f014e12b05db5e717842065c6ae56cd1534a
57914 F20101108_AAATCO liu_j_Page_095.pro
4d27cb292ef3c99096fb1bab5bb292f2
8e741d5bc5131736f7a405a8e80f1b023c4c9106
1432 F20101108_AAAUGF liu_j_Page_003thm.jpg
b574e744ba92f6d3662968733ea91f4b
9ee6534ce6a54410df270d525932749e9c492f5e
7157 F20101108_AAAUFR liu_j_Page_035thm.jpg
a58ac66f9bb74c01fd72df94862a7c22
9fec0196b62099c55374b26799ce665ab66c5cfd
19567 F20101108_AAATZW liu_j_Page_004.pro
50759a5020d618061166e3189cb3f51c
21bcaf1d007953041bd5c4402dff635552ef103b
2320 F20101108_AAATDD liu_j_Page_046.txt
afcae4758b1a475eaec4d6d5c222d248
969a85b7ad7970ec817d22bc7ec082d125cf61e7
26421 F20101108_AAATCP liu_j_Page_039.QC.jpg
75f63b1c1f367db192c3bd57ce836e6a
ec7b3477d3bea223caf383bc672ba070c57103c0
189210 F20101108_AAAUGG UFE0021686_00001.xml
5cbec8b7843d3bbfcde99c5de1609b10
6eee7ca64d72bc03ff6f65e8f8cbbf001a94387b
24273 F20101108_AAAUFS liu_j_Page_087.QC.jpg
c66f6e308b7293ed2308c2d4ae2dad01
75d7d1cc0effffc733950bf9102de3e0970e8c54
52571 F20101108_AAATZX liu_j_Page_005.pro
3d76a25d590ecee0acf7342eda2ef0bc
806dbff4ed15ddf78f8b47c061fd202d3b46bb0e
92991 F20101108_AAATDE liu_j_Page_035.jpg
e3d991c92e9a001e39ede3394a642411
b83db24d1b307376b26785d6e35a33420d617093
21005 F20101108_AAATCQ liu_j_Page_068.QC.jpg
fe9c2c8035c156dbb918e982776173c5
1caa3b86f9e31531000956423a84abe9d08d48df
6630 F20101108_AAAUGH liu_j_Page_001.QC.jpg
5e59d8019f994cfe1e973b1b7c1035da
e2327c0ac3270b127c925ac509167f7b09dd6c5d
6594 F20101108_AAAUFT liu_j_Page_039thm.jpg
e1adbb50ffedd0ea9869e0c85c127034
0bfc1306678adf5b890ded09ec4b947959bd7b41
55273 F20101108_AAATZY liu_j_Page_010.pro
c52616a38bfd27620cc46ecb3454bbc7
06676e00c67dc17c71bccecee447b26fa0b64044
154361 F20101108_AAATDF liu_j_Page_126.jp2
257211c043a2c0a49103e9ec790ab5e0
cf0772d0b54f9d982903373c112cdca8e5e50368
F20101108_AAATCR liu_j_Page_098.jp2
c562e17b8e76fbdf4e03a6f43bbb00e8
b7cadb8b29a8b2c6302eaf85c101b2b00b8b9adf
3015 F20101108_AAAUGI liu_j_Page_004thm.jpg
5c11ac4dc53cd760412b2659bad7373f
6d94e17b0859ac5b46f5594bd10cbfdaf4df770b
6544 F20101108_AAAUFU liu_j_Page_012thm.jpg
db18a53689e3c8d7ce5ac5dc6580f689
d1123dd5a6a5a49f4816632233da3deaadef4556
55057 F20101108_AAATZZ liu_j_Page_012.pro
d80b63d91753e5700d9270ed28d887f0
3917ee4c9bc91c9b0dc3f13992e088fc2785b60e
21535 F20101108_AAATDG liu_j_Page_088.QC.jpg
be16a615d6672554a92e9323de68f2b7
49e2722bf223718c506d826e68263c6f3fdb2937
F20101108_AAATCS liu_j_Page_007.jp2
9506bb91e9ecd4889ad6cdc7d3de62cf
e11550722cbbffc6b13df912d1da81cf48bfc0ce
4197 F20101108_AAAUGJ liu_j_Page_007thm.jpg
217668390e7e362239db6c1848370b32
6cbc8074c190fe4a2fdc2e8e49e31d4d557f7a70
6648 F20101108_AAAUFV liu_j_Page_019thm.jpg
5fb7c8430a2769e804cf3a2065bec5cd
74712147e39dde29b62030f8ae0f611e746fb3db
6938 F20101108_AAATDH liu_j_Page_030thm.jpg
86d6bb6e552a6df203c5171a7d14637d
5f7b95efdcece6387b7e900b710330a4b02ad8f1
F20101108_AAATCT liu_j_Page_060.jp2
4fb12c0af3581f386d13d82876351544
654ecd30bc3017ee28a3cbac3172aca90b449f47
20724 F20101108_AAAUGK liu_j_Page_009.QC.jpg
b993ab51f38a6b0308a26b92c4469634
70bf428f306a71386e7a373cc5b1c1aa515f81d0
23915 F20101108_AAAUFW liu_j_Page_115.QC.jpg
d29dfb1c0cbcafb4181489a28127dffd
8f6eca10d62f86e3a5f95c46a3e0857f1653ab4b
5647 F20101108_AAATDI liu_j_Page_067thm.jpg
62a1b42bf7e46425874b37d0d8202589
ebb9971b4a5baf0a4e1c0d99a803d76eff5447c7
48316 F20101108_AAATCU liu_j_Page_028.pro
ea9b980bb14b1fdaed3411a0ead6e43c
4e20adfc64ac1b1d655aafc2e3427f0fba1f3e0b
5668 F20101108_AAAUGL liu_j_Page_009thm.jpg
b7b3996648f99ccecaf482c73abb6b61
4e24408387235d514b32c5b7d13f0cb175e78eee
5505 F20101108_AAAUFX liu_j_Page_040thm.jpg
6c83b27850d86e54ee5290200d2e4fc9
e0db346eee386198cd125a368179180f10ff8faa
2151 F20101108_AAATDJ liu_j_Page_018.txt
f188c694fd8052418bb1a0ce5b08dd1c
ca0eff3bafda253df1f2224cb16619420fa8b074
58443 F20101108_AAATCV liu_j_Page_034.pro
9883760dfa0b6ebd1d9af2b293819018
4a0a2621aafb7288f34e162e88965c176833cc5b
6927 F20101108_AAAUHA liu_j_Page_033thm.jpg
addcc2b9474789fadb5230dfe3fe7b91
3a3f5d283b58edf62017b88bff6a085797e67064
26535 F20101108_AAAUGM liu_j_Page_010.QC.jpg
93d1fe68580676da81c0eb623d2c3400
193f42f4dfe34b4933e3b6fbeabc762b898553fa
17694 F20101108_AAAUFY liu_j_Page_027.QC.jpg
15cecd256f13d3a72413f454fb592b86
aeb63acfa71a8c0ea232e465e2af75f14849a715
45659 F20101108_AAATDK liu_j_Page_050.pro
7cdd8eb07c56262fb94ed3aa2d1dc268
1fbd1a1ff9e11ffb8c02f1a10247cc0f8cef9441
6163 F20101108_AAATCW liu_j_Page_108thm.jpg
70a663c2dcda9a48e0a28f1613ed4c26
fc94f23a716f8cc5d0843ff89cfd91ad8034f302
27832 F20101108_AAAUHB liu_j_Page_034.QC.jpg
32bf5d9133e549d5d65353850cdd2c9d
df3f017b4243c1d9bbe68ab6f92989b712fd1995
19021 F20101108_AAAUGN liu_j_Page_011.QC.jpg
edd00b901af010735f494a04baff561d
1f915833f7699c2d56de84b203803517591adc4f
3233 F20101108_AAAUFZ liu_j_Page_003.QC.jpg
0f466961037ea237a76df5b9266c693e
73fdcc81ac320b68e1c37a2a3a9f03d2ee8d5f9c
F20101108_AAATDL liu_j_Page_073.jp2
e75f84c9f1f025d0447f2b14631af152
0b513da3530984627d5b76df6d29fcdcc0f74f7e
26017 F20101108_AAATCX liu_j_Page_015.QC.jpg
ed01b2a7e3875e4793d25f6d8c8d8d95
7881c626aeb5debd7239a74551653137113eb75d
25709 F20101108_AAAUHC liu_j_Page_036.QC.jpg
3b1d963ce686ac649f008c430266f79f
04232099ed0dd02024af2453ba8f17cf4d43a61c
23871 F20101108_AAAUGO liu_j_Page_014.QC.jpg
61a15480911cb363297850cec7ac91e7
8436145be8d5af49ecd3a7dfd8a202f7de80f4d5
29325 F20101108_AAATDM liu_j_Page_086.QC.jpg
a9a427648ca8b2ff6d5f8fbf1a0a2702
88dc309687730f1da17744089211bf6b805f5ade
F20101108_AAATCY liu_j_Page_004.tif
6cbea1ff5ae2919cc0dacca2784981cb
2a8d53bccf237323c87eb5df94f8947fc1c52e39
57238 F20101108_AAATEA liu_j_Page_070.jpg
846b42a20e508fd803b501b56e37d3ba
5746fc7b9248a71db5c4c99c556a84be605cfbcb
22765 F20101108_AAAUHD liu_j_Page_038.QC.jpg
25141cc8629a28f912301f4014f060dd
2f0ff0c0ab34afc3ecc7fe007daf833b120804b3
6703 F20101108_AAAUGP liu_j_Page_016thm.jpg
8548914cae060649f8b32e2a9c197427
af7e31ddc88422ab2c85cdc5ee79d00c0f814521
26455 F20101108_AAATDN liu_j_Page_126.QC.jpg
a08eaaa9251d5c115d0d79e9fc23888d
8eb9c670c2d33b743232df4da4fdbebeb918ea95
5548 F20101108_AAATCZ liu_j_Page_061thm.jpg
f25b017d927e3f6828f9fc65e5f39e18
bf4c3ad7145e45ac88a03ab40d62efd86df96192
942783 F20101108_AAATEB liu_j_Page_027.jp2
c0f148f9e456ce144e5f3339edd5fb57
f15e2ec3f71b151f6d6a4342de7d436dfc7f146c
5851 F20101108_AAAUHE liu_j_Page_041thm.jpg
33f35351bc6d235cfc66df2f7817c0e1
2724f06368359cc223142dcf21ccfa46dcc89700
25600 F20101108_AAAUGQ liu_j_Page_018.QC.jpg
fd4f81ab5c9584d88a507a0eaf911f9c
fe4dbd947ae81e868142b7e737d424f60deb38b5
28927 F20101108_AAATDO liu_j_Page_078.QC.jpg
7ced3c9581fdfe10140be031901893bd
a97095dc652f2a6ba7f82d7b49f429071456d43f
2402 F20101108_AAATEC liu_j_Page_044.txt
f2a96e9770f68e415ce53500469860c8
fd45b24e8a8e23bd7a1e105489f9f3b89e163953
28419 F20101108_AAAUHF liu_j_Page_044.QC.jpg
93faa411730a0af3456f255eff20a029
ce0dadc165f5999786917daf2043581d9b77435e
28090 F20101108_AAAUGR liu_j_Page_020.QC.jpg
c9265f1466efe242bf1972f4cbafbed9
b399e372aacb295df0198b76d1b625778d8c3a5d
13490 F20101108_AAATDP liu_j_Page_043.QC.jpg
8c05832a3fa940baab4223f8a137a269
5e3fa4b8d0c17c32e1f318bf976d313c53bd222a
3625 F20101108_AAATED liu_j_Page_109thm.jpg
711b938e4d19a45ee8dc5146b202d346
9a206f765eab3391fdbb51401fd6435bac3d79b8
5904 F20101108_AAAUHG liu_j_Page_045thm.jpg
bedce83c685f0053faa76d48c3aa9dae
1fdce06c60e03dbeac43c429550f96a648b7f60e
26589 F20101108_AAAUGS liu_j_Page_021.QC.jpg
bdad162ae2993109b91fa9fce4d5063f
103a828b1be33f67c038ffa89c1e2bba6538dafa
F20101108_AAATDQ liu_j_Page_111.tif
9638b94c7c3f118ea6ecc1c7b3676107
8f13841a05f7b76c31c34d3891b3dfe62a3e78b7
77392 F20101108_AAATEE liu_j_Page_028.jpg
f122435888927071549c176b49a782b2
5bb4a2a6ebb773dd59c257615a003dee101e749b
8204 F20101108_AAAUHH liu_j_Page_047.QC.jpg
cac10de4c1ca1bf9729b2482b92dba38
7880de1bf9c9accbe3c66551f6196ff71dfc8537
6528 F20101108_AAAUGT liu_j_Page_022thm.jpg
3a104c6bbe121f59309f830d7c2b19b4
42f86dad16f690630708db7f7453df7e6515af36
35565 F20101108_AAATDR liu_j_Page_118.pro
6987364a1c182b0f79ecb21ed606b457
d3cfaa6b84735520a114972214498a577c75ad66
73561 F20101108_AAATEF liu_j_Page_054.jpg
b11bd1655f4a3118d5a21abb0da7963b
5da937de746bea1396344598236b7d1607d8361d
2516 F20101108_AAAUHI liu_j_Page_047thm.jpg
6770774d458b66a7f0ccd80d853d3332
80ee36c28e5b7f647b91656da1307d3748954e8f
21873 F20101108_AAAUGU liu_j_Page_023.QC.jpg
38eaf23833093e6e29cecf56ddcaff51
8eafc519c584af17f57c4b6ecae43bc27c0c3aab
22261 F20101108_AAATDS liu_j_Page_114.pro
b53ba40390882e058ca1cb76f345be45
eaf28be1b7aac016646e0a148476cc20403b307c
99583 F20101108_AAATEG liu_j_Page_086.jpg
f270b42e7c4b7fcb84d985945b18f793
afa531ca77f60f258e03e48c9a862bd2cc8eb6e6
24124 F20101108_AAAUHJ liu_j_Page_049.QC.jpg
d47a8217e9a3ebedb00fd7d73f797a7c
25e4ee080addb26d55165631e8335b14f49f634c
26060 F20101108_AAAUGV liu_j_Page_024.QC.jpg
37f5220d5bea0a0a212de9760524e119
908d329d94001f4c3b2db71971caa3c55cf59b6f
2023 F20101108_AAATDT liu_j_Page_108.txt
e71501daf2495468071ea4984eb906c6
4ca40f68f8def33765e7a1b882dd11e553fc90ff
F20101108_AAATEH liu_j_Page_045.tif
8ba8f98110dece2d6326436216a10aef
524c66c5cdb5795caf174f88e01ed9ea950cc33a
20292 F20101108_AAAUHK liu_j_Page_050.QC.jpg
e4c6b53a266fbb3668e55efc04aa9935
a36775744725cc77e16b9a1d01f36ddf12ace4ae
24952 F20101108_AAAUGW liu_j_Page_025.QC.jpg
db827788d1827cff81e1c373be13c695
da86d70f8d9b9df9a21aaaa7e1afdc2639f96c26
85817 F20101108_AAATDU liu_j_Page_010.jpg
6955343cb970937c29d84609a8d08855
4b0072c24a6018baa2094a24adcb5360cb429986
20754 F20101108_AAATEI liu_j_Page_013.QC.jpg
5a0b17d4b06ff2df2ebe4233a3d8e652
b11e7d3075042bd8af4c69f866ff15c6155335dc
5500 F20101108_AAAUHL liu_j_Page_050thm.jpg
85ffd121d59fcdb25feb4ac844b0fa94
00e197a6ecd9f618bfe377ceba5648bd42c7a1e8
6844 F20101108_AAAUGX liu_j_Page_026thm.jpg
3a23fd684ca3116499e9ae7fbf13590d
1d28a38907ce0df569630d38691667e019d584c9
2348 F20101108_AAATDV liu_j_Page_042.txt
b8c63275f9122a4031f6cc9e4ff7d33b
2fe03b593d9d8fc769cd77b0a79438dcde27ce2a
5288 F20101108_AAATEJ liu_j_Page_106thm.jpg
a03f68638aed9afb15395f0c2c72aaef
e53ae9c20b9be743a72f08693439d42d859e9890
6170 F20101108_AAAUIA liu_j_Page_071thm.jpg
f1ba49f49d5319f432da5b8a4baefb63
d6468c414b6d68362297a23ae4f8382068d6f982
28121 F20101108_AAAUHM liu_j_Page_051.QC.jpg
fb997b4a2c231c0acc46076907cec31f
371defc8425827bb159aa9b945b4fb1480b9efa1
23962 F20101108_AAAUGY liu_j_Page_028.QC.jpg
d4cd9138cf5c9753ae980ee638e1aa87
a97edb8b510913db095df0472fb349aab5dfb256
49817 F20101108_AAATDW liu_j_Page_093.pro
87ef1fa56d28da19d3d3c4cd1491f4e1
a56735f3e11ad0fc029353cf30d5c6d18c19edc6
27686 F20101108_AAATEK liu_j_Page_123.QC.jpg
f9e73668dcf053c641045f4782340ac4
714a7626187f12c86f12d265c8f6fdc8cb0898ea
6029 F20101108_AAAUIB liu_j_Page_072thm.jpg
f988b7324c9682190f9efcd1b56b07cb
15604c2566c6c234d2ffbce437c3e2e89069549f
24015 F20101108_AAAUHN liu_j_Page_052.QC.jpg
59d35a19d7277ad905292f9617302fb2
670d9b36b2c1dba02428bbd6cd34af2e4487331a
6505 F20101108_AAAUGZ liu_j_Page_031thm.jpg
11afa3df466727f1bb9453e4b614c067
616a1dc93f51e7dd0460e0a06e8258205d42f847
85665 F20101108_AAATDX liu_j_Page_039.jpg
363929e76f3f9cf970fd649f5f0ecec0
691da9676bc8878a07353381af1afe45a8870b85
F20101108_AAATEL liu_j_Page_053.txt
3cfb37005d895cb65732d855d007e483
7fb171897757a2e7e98d7e70219dd8daa4cdc8cb
6309 F20101108_AAAUIC liu_j_Page_073thm.jpg
385a29fe24993ab9225e487ada129efb
4423d8b971dd25249b843c4b755cb5bcee574472
5952 F20101108_AAAUHO liu_j_Page_054thm.jpg
55d043e2ade63f6d29c23dbc6086e60e
f1f5c61dc1ede760bf9c7130d3c1db258d61abf4
22484 F20101108_AAATDY liu_j_Page_029.QC.jpg
4c2cfb489b13df6fc90ff9d5a81956af
135e4267e83737728712dc7d25ed7fa03833eb48
26020 F20101108_AAATFA liu_j_Page_074.QC.jpg
1a0c13d9c89e3c61823ce96ece4f20e3
778231cf975e9a5f35d186ef65761698068e9a32
6824 F20101108_AAATEM liu_j_Page_048thm.jpg
790b8fd2d631ebeeb86677062789787c
ea83f0c0ef26ffac6a38190bfa0d951accb8c674
26801 F20101108_AAAUID liu_j_Page_077.QC.jpg
1f5627d165f9afb01583128d4261bfd9
cdc91e1fb5470e283667f21b877a985652c70d7f
26982 F20101108_AAAUHP liu_j_Page_058.QC.jpg
78066e8e5b91046531c8806804742460
3dfb31d6979d7839dffe2fd3828c38b8c2eefdc8
22405 F20101108_AAATDZ liu_j_Page_110.pro
32370944f17c8dbdf9475280fdafa33e
6e2192d737e581df477cfe0a1715ed5a654e2b23
55219 F20101108_AAATFB liu_j_Page_021.pro
a145e4f592c5e63ba7d7233b3be967af
76954bb356bbca8921c0eb85e7ca6328300edf0e
F20101108_AAATEN liu_j_Page_127.tif
e1e52e5aeac1d101ab50ec1ce19bc49e
44eec91525416e574c0976cea7bde55b4c466b98
6601 F20101108_AAAUIE liu_j_Page_077thm.jpg
bd41c069f3cd0697655e7add365182c0
3dccb8d865523cee7265e6bcd38c184006914413
6946 F20101108_AAAUHQ liu_j_Page_058thm.jpg
daf9efbea69736b118f64ff87a672382
3b34b51d3bdc5c4ab82afd34e3f717bfa33edc07
F20101108_AAATFC liu_j_Page_049.tif
2663a92d385a83b46dcff074b7275529
e44315397d1db706c420758517a8fbae603d6cc7
14435 F20101108_AAATEO liu_j_Page_047.pro
f6f5d5d12498d330abb7570021de2bbf
2ac2a7587255c0fd66c2db7fa9fc47820101df55
7087 F20101108_AAAUIF liu_j_Page_078thm.jpg
7477cde3ba7e118b28c66764ccf903d2
a7c790213276831998bd75bc66bd588e155eb636
6795 F20101108_AAAUHR liu_j_Page_059thm.jpg
2e71cedf212285a05ad57efda9f55ae7
9f2f07c3b324dec12bc7a40225ec213a2aed956c
17405 F20101108_AAATFD liu_j_Page_127.QC.jpg
99618947856ca04c7b03155c874dc91f
e91da4690c915ef3ff76cf5f5e1cacd18fc6a294
5052 F20101108_AAATEP liu_j_Page_070thm.jpg
81902a579a6793b4dbd0172f2e2310df
baeff5e954a745e35f6e328f601bc66843bf8c52
25361 F20101108_AAAUIG liu_j_Page_080.QC.jpg
5611df91a17871e7b3d596c9aab96fad
4fef2f1f7e073e015acfffce7f6f864ba609bc36
23343 F20101108_AAAUHS liu_j_Page_062.QC.jpg
d8abad8f8f3b92e1b04a01c9de612872
542d7f99f9bd1bb1864e666962691461c8604a3e
89787 F20101108_AAATFE liu_j_Page_051.jpg
03a289837ae8e81b9eeab52d5dc9ce8b
eaf4b1352fdb685ff5a956c0b85a53d63ceff3e3
27549 F20101108_AAATEQ liu_j_Page_056.QC.jpg
075b804c1b1ad93de3200cab0e8d2b28
69926879319f4fd4bc2a9f2f65eb46ea9bc401e2
6180 F20101108_AAAUIH liu_j_Page_080thm.jpg
65e1cd82c67cbff7b8d0844ebac94a24
fdc7fbb80c9f9caca2d034c5a3b362fdb969f893
5795 F20101108_AAAUHT liu_j_Page_062thm.jpg
898022e14e902e0c3731003ac0061d52
7f6a6117092f93a081c5a58a5f671b695232b1ec
F20101108_AAATFF liu_j_Page_026.jp2
6fd36e21882f3de95a5a1bf9d8b0421f
84b3ca948f4fbb62453afe2624d33706303b3f54
1602 F20101108_AAATER liu_j_Page_117.txt
31282b8d73d6e69e2855ed96ebd08a61
59bec04b4f6d93bb9f9826c6748fe4bdd1fc31a5
26891 F20101108_AAAUII liu_j_Page_085.QC.jpg
37968724bfe114128745e7e9825dd958
e906c570f553d1e8a3d063c08a5bec3bea15d58f
25341 F20101108_AAAUHU liu_j_Page_063.QC.jpg
0b06b85907a59e4a0ccd8eae168e5faa
e7d32ff1b0221ac8f1fb36d9d5d1218b103a7e56
F20101108_AAATFG liu_j_Page_057.tif
01db70feb8ac63b544e3338012961673
fbdae5e43e2410fa3911b0cf7003f3c859d036e3
20336 F20101108_AAATES liu_j_Page_106.QC.jpg
45c8b610e3a305eede2c0e67c0d7a928
89661d4bae2273e0b355ef65324e03682314a44a
25765 F20101108_AAAUIJ liu_j_Page_089.QC.jpg
de2ab1c2d2c762a4ab5f271a58f219b7
9dfb99e716e15263bfa54365cd083c5caa2e7f36
5055 F20101108_AAAUHV liu_j_Page_064thm.jpg
894bb7161fcc51f9a56fd858abc4ad49
ef094f8cd435ed29df1ffa3b47ab5113912d6cd7
2188 F20101108_AAATFH liu_j_Page_058.txt
963d96c3df7a257b44f5975e93908c04
7e7e2e9267364d5c2eb172993ec6e97a3b5caaed
5892 F20101108_AAATET liu_j_Page_006thm.jpg
8a1ee61ac33611d330bc79b3d9f50d38
a84795249849f13faa9918eb81dbd453ec569d5e
7135 F20101108_AAAUIK liu_j_Page_092.QC.jpg
da10193118ef18ee749e27d7db90489f
883974c9cd470403e3cec14576d81c781712eebd
15427 F20101108_AAAUHW liu_j_Page_065.QC.jpg
787d1f399e939485ee2375cc333822c0
c4e6b80dd0dd3f32109b2ac07557568ffa1c2192
55509 F20101108_AAATFI liu_j_Page_031.pro
bdfaad01cd419a74e1ab6cb0fe795e14
bdeebd64e9c056e288507f35e044987f03a0de28
F20101108_AAATEU liu_j_Page_058.tif
13171d22ff73d180d5448d547a29386f
fda7258843bb7a54a02e441d69262021ae59dfae
19634 F20101108_AAAUJA liu_j_Page_117.QC.jpg
e49dbfead65860b255efa0646e5fad6a
e0fbc84271f7efd1b5c1966f76112f467d7dc8fc
27680 F20101108_AAAUIL liu_j_Page_095.QC.jpg
b3966f8bb3c4fc65869c07eb1b944144
b479129b828fd43ab108b5d305d538f5d16fc087
4262 F20101108_AAAUHX liu_j_Page_065thm.jpg
4fb33c04e050a40ea08dec6481432ee5
e135413558fea92fe59900e85ddf4a7253c3419e
1371 F20101108_AAATFJ liu_j_Page_097.txt
8812d66efc2d0d4aff3ef67719ab9582
d202441be389dbaf50aae410e6ced522e630911c
6142 F20101108_AAATEV liu_j_Page_082thm.jpg
d7105839c7a51f317b2bc84e3ef4c21b
370b6afb4a03613525c7baab2df964bec7a99e1b
6899 F20101108_AAAUIM liu_j_Page_095thm.jpg
145fc3763a7e2b973d66d2ae10caeb5a
fcfddc7d472a3622d62b631479e89c11d565f0b0
6909 F20101108_AAAUHY liu_j_Page_066thm.jpg
a70f8c1fb2d9737ef65979d52a2becc4
7bfbdaf4b0c9ff8f6fd9f51c715a784b9c92ad78
101660 F20101108_AAATFK liu_j_Page_075.jp2
349b5d92ec010717bb583a85457b0c21
65889a16bdedff83b1ddcfaa8a84089b823f617a
65869 F20101108_AAATEW liu_j_Page_068.jpg
a772fb959435d0d4218dc9427c56e3b2
46a65b773ce5aa979970374d4a47dddc8640c7b1
16893 F20101108_AAAUJB liu_j_Page_119.QC.jpg
5283bbfb8dbf936de12848448b1357c8
07b6099a864beca5f2fa0b77db82a4b2c8924ab1
18608 F20101108_AAAUIN liu_j_Page_097.QC.jpg
fbf6a65a708e2ba58b3de625abe11cb7
0d3a10fe28f7653ab5b73210bf8aa1bf2c6fad8b
17960 F20101108_AAAUHZ liu_j_Page_070.QC.jpg
3ac7d69b7806efa3b0cc771fd2c8f7f2
142d9bf5953b00816780e8fd00df5da66c9bdbbd
59770 F20101108_AAATFL liu_j_Page_083.pro
83b591055cff1673d1e0c627ee3dea91
eab06f5f6f720219df7095abf7a3911027a1bf4d
1051963 F20101108_AAATEX liu_j_Page_116.jp2
24a679b98c5b30b5bcab6a7d93eeedff
2e0f5706b0790d26398bf763f4a1857b0706a566
4563 F20101108_AAAUJC liu_j_Page_119thm.jpg
c8df69fd7c856ae1abd200ada81d8a4b
55594f1b37f9395c66b8eaeb451d81c2ee1abbea
6953 F20101108_AAAUIO liu_j_Page_098thm.jpg
bebda813817cb183cb50f7c0c95e9625
d9fc894ec6fb3d32478b48841a27dccebabfaf4c
1051855 F20101108_AAATGA liu_j_Page_093.jp2
b7f70ad0f56532b91cbf3c42ac7e3710
a539e9bf226cd4c5c5c5d45c1265b47602426a2c
61606 F20101108_AAATFM liu_j_Page_078.pro
d2ab6d80c164f4c8ee60ee13838f00ba
5d6eb68fac6b41cc4e59b18730e0cf4d29d3e074
49503 F20101108_AAATEY liu_j_Page_049.pro
0ef4ce0087561ad7b3ead8c57f2688b7
4aaa493a5017223267d84bbf34f1003d3fbbe386
17173 F20101108_AAAUJD liu_j_Page_121.QC.jpg
290994ca3d8682f260146048d7d7d910
11ae9f48b4b883e14ba6ca4bc04e4b3419a66e05
26307 F20101108_AAAUIP liu_j_Page_101.QC.jpg
ba80cf87124c4428b363533987eb408e
fad90d07677003bb9fbc41de4bb9897574966552
22134 F20101108_AAATGB liu_j_Page_082.QC.jpg
63399891dd969dcc5a98870857830dbe
b7224f54333d0c3b39a103e3d6190d3b64aebb89
2177 F20101108_AAATFN liu_j_Page_081.txt
79fa792bcff078a274b9f578fa8dcc99
b2f01af43ee1bc6e8ff5a86679b260c2e1bf8209
F20101108_AAATEZ liu_j_Page_088thm.jpg
77d5e9ef10666d58c2ff6212c7a64fb3
ec751f11f319bf079c7a1026bb4ef8806e94d2bb
27677 F20101108_AAAUJE liu_j_Page_125.QC.jpg
8ed213c9f5d89c63fe18778630e67ee3
328f74164526ae9ff8997a7d991d5855ee765c14
6746 F20101108_AAAUIQ liu_j_Page_101thm.jpg
91544297eae31bc825217258754f7822
6210f147e197d86566de2401a4fff8cb8d1b84e8
104274 F20101108_AAATFO liu_j_Page_009.jp2
604de25856435d08b732040b5d2c7106
fde5d39214f99430b3d12fee61c40b3e918b2ebf
85580 F20101108_AAATGC liu_j_Page_031.jpg
20f0438c4dff4e87f39c0af89b416d68
62c1f84f53468597c7ebda62717b64652f137bed
6777 F20101108_AAAUJF liu_j_Page_125thm.jpg
f6e7210ec06fc3571a8d28b86c632deb
9ce19819c1c06a6ef4f0a4de0877446a6e481dd7
6860 F20101108_AAAUIR liu_j_Page_102thm.jpg
a6caecf6c1d987e743d4aa2e69a37002
41efa107699611bb3e45bfe8759f6ce073d55142
F20101108_AAATFP liu_j_Page_030.tif
21cbf668a0d432998d5bf65ae3912cf9
8eae51fcdb6e7f6aec529246cd3a5fa67d99f2fc
28101 F20101108_AAATGD liu_j_Page_083.QC.jpg
285a94e872a74b6b1656e07063114c18
d44f73d28c40399a21ab28ca76d7e2b5dc2d9378
4405 F20101108_AAAUJG liu_j_Page_127thm.jpg
19023364156da91be2073475a5c03591
c0a7186e3c4457eaf46239afec5f9c41261d6dc2
5406 F20101108_AAAUIS liu_j_Page_104thm.jpg
b99508ff8ef621d98040360d5546ca55
c9c3bfaf47e11a740736558218a63eae11fe8d03
F20101108_AAATFQ liu_j_Page_093.tif
ec00920234fd39c37e7d721aa45a8bb9
0382e20d959f73b6dc0e748bc1fcfcaf97248d6c
F20101108_AAATGE liu_j_Page_094.tif
91156dbe2e74ebd73334777b952b33a8
95555f261a805dbb83da284350f413219d7ba62f
6798 F20101108_AAAUIT liu_j_Page_105thm.jpg
6e8ccdb9bee135ee66df75c15c43cd00
3e78d7192238a1d69e7706adc6cd324a8cd3e7fd
82136 F20101108_AAATFR liu_j_Page_022.jpg
68aaba931f2b321caec91be6d77835f0
db90605b5bf0fc86c9a4b69229843401d8f109eb
2121 F20101108_AAATGF liu_j_Page_013.txt
e2d34d4f338051270abbd07e56c78de4
3a453ece6e9a77076b79df8e3ac9c5a9d79ff405
12093 F20101108_AAAUIU liu_j_Page_109.QC.jpg
b5c5af9782ceac73cb179f81d4fd62cc
1dadbc6004207f58b05257acf614686a1c88c28e
24612 F20101108_AAATFS liu_j_Page_096.QC.jpg
16e3400bb4a14b90703d2e517dc3f1a8
f671e88f21582520f453f4385be767cce158ed8c
6941 F20101108_AAATGG liu_j_Page_111thm.jpg
a473371d85f1ba54253011654f38886f
2ef544bc41f485ec8b19da4d9b09cf43719c6efc
11575 F20101108_AAAUIV liu_j_Page_110.QC.jpg
8813bdb5b1c1b6c687fd65bde5f1b6bd
1d0df65d9e5fe28483c5f446b1c8beb9a0b8463e
F20101108_AAATFT liu_j_Page_003.tif
ed3a3f479a42748b1e744a81acb67d51
56f0e3fb71394415b0014f18f41b80aea50875d3
61440 F20101108_AAATGH liu_j_Page_079.jpg
048b25957ecbbeb409ea12b99bec76e1
7ba17b28db327439eb585209fef356e13d765de2
28598 F20101108_AAAUIW liu_j_Page_111.QC.jpg
4273c4e875a14bc2f767ed37345b2ec6
e09e6104ea727696d29643f19d008808fed50d91
57145 F20101108_AAATFU liu_j_Page_048.pro
ee8fe161a31d0bcf871ea0d26632dc41
315c9a8488a2efc4033bcf21895ea84f2afe292a
56850 F20101108_AAATGI liu_j_Page_033.pro
e663ba45991ff2605e7e58529a28d8c1
374cb54c9d164dd100274179be0cb9af579620c8
6774 F20101108_AAAUIX liu_j_Page_112thm.jpg
71d27898a22eb920b8f61cf35e79d1b0
ba6f3a634351635b793ff917b02616d0a1bf56f1
2254 F20101108_AAATFV liu_j_Page_026.txt
01efbb58dbf3c3cfb3f7bcee44773e30
d8c7718c571a07d0032f69f80d98645dc4d590d0
12240 F20101108_AAATGJ liu_j_Page_092.pro
4444855c536d4423b4fc4c527cddfd34
92f2bf6cdb460b6694d7c9e5292122e5f116e652
3130 F20101108_AAAUIY liu_j_Page_114thm.jpg
92f5771e7965da258db077e1aa270400
a68277b7b9130c6602f0e069af93972b44ca958c
22099 F20101108_AAATFW liu_j_Page_072.QC.jpg
89aabec97a82e8f3b767501ba4c91bb4
d69d80b83d5e6c7a5bfdad14a2edbf2e71000535
1051944 F20101108_AAATGK liu_j_Page_021.jp2
3c762f7216ec8772be4fcba1da018e3d
4d1acdfd282d8c1976c6b6f541c2450a152cbcee
6266 F20101108_AAAUIZ liu_j_Page_115thm.jpg
56a1827e19af6adad0480549bfe15635
9ca1e79ec1052ebb576d884d4980588dcfb202ed
F20101108_AAATFX liu_j_Page_087.tif
b6aa7fc708254f8ddd7bab9793a22d72
bbc339ab1bac85833c4ea6b0877ce7018dfd553e
79559 F20101108_AAATGL liu_j_Page_071.jpg
d41a0465d4deb83679f4b9907659350f
0d3b323a363050966234118939a5f04e91ef404d
18762 F20101108_AAATFY liu_j_Page_104.QC.jpg
2b78471141e1080e0b3e1757e2f2cedf
f2723b4ce864a10fa45053c979f376140f9a02b5
F20101108_AAATHA liu_j_Page_046.jp2
8288061d99cd6704819d936d70df1b67
2ab214fb61de017eea1312d171f228db9946f378
F20101108_AAATGM liu_j_Page_014.tif
b8e03d8da7abeefcb684fa1249937f30
aa13892d887da8122fa8c4c279f05ec83cac8b8f
80348 F20101108_AAATFZ liu_j_Page_053.jpg
130bfdf6466f026b3b09a263bd841fbd
f6d3a2fb8f94558d92e0a5460a58be05c787fe58
F20101108_AAATHB liu_j_Page_056.tif
e267b712aabcce8cca0178e5c6b25e5e
e2f3e2a6aeccf71ca20bf4ad1ec860738ea5b9e1
F20101108_AAATGN liu_j_Page_038.tif
a173d88ab7a90aa0cf9a9ac5af1461d1
e91c6d8589e2c9a76a6e82c9b959d63d993caeb9
1051970 F20101108_AAATHC liu_j_Page_055.jp2
d5e7533412ec57cd40e6b0373b48d552
dbbccae551f5c2e4c2a8d5eddd1c653a73111adb
1051950 F20101108_AAATGO liu_j_Page_053.jp2
38c14d035186976d1be6c275fc0d4fd1
d4f6e96f52541689e91184d6281ebb356692fe83
26520 F20101108_AAATHD liu_j_Page_122.QC.jpg
935883d24d4cc8f65d4763b3e942c48f
3956fcc5d6d679a88b01403c0ef8fe873bb63aea
5819 F20101108_AAATGP liu_j_Page_113thm.jpg
18510fb06bf961673101cf93362277de
6464347aa65d0b5862708725cd548fd22a1d0c19
F20101108_AAATHE liu_j_Page_078.jp2
e3e98bfa5693a7aa39cbc76da0bee6d1
b983c9e20943aa7c3c2bc129db7c1cf160d9edf7
69913 F20101108_AAATGQ liu_j_Page_113.jpg
32781089bcdae348c7ec418fea884758
15a70ee8b3c39edb92051308553a46292dc8344a
2013 F20101108_AAATHF liu_j_Page_028.txt
8798141be6a8c672de86836856180983
1a16c2c8ae7d7fdd5abc357e94d98cbd9f6b6696
2599 F20101108_AAATGR liu_j_Page_061.txt
c5f23e3baee7297390bcfc6a7b41e418
9583a14bd0e0b17e50557a58ffd8f100a17cb675
F20101108_AAATHG liu_j_Page_047.tif
06bab2eec89da36647946371361cb223
1429ae84d4461bf3c5e26135690a7730418d8f82
F20101108_AAATGS liu_j_Page_016.jp2
71fc37ef0a99b1c0380fbbadc77d9fdc
7fe660e85a9e8062aea242fb5cb665fa4af496b1
F20101108_AAATHH liu_j_Page_068thm.jpg
6f1ac02ae003f9fb635d9a9d43621e6b
f563282bd592c20f1a708de4357165d3592d2572
1319 F20101108_AAATGT liu_j_Page_002thm.jpg
a200b03cd4b46b0a9d00ba8f7406904d
bf74bcf4b92f8c934452fe98ec9311a1c2d19d20
26597 F20101108_AAATHI liu_j_Page_019.QC.jpg
b1882c2dd94a0dd9328db2a4b181dce2
74aacb4c913c63d029faa5984ab0725157740dff
82984 F20101108_AAATGU liu_j_Page_032.jpg
25b2b07f006948e99919046c551f3966
f58a7752ad245a4c5620940f513a81975ecf394f
6622 F20101108_AAATHJ liu_j_Page_055thm.jpg
be60c8cce094b7797e9b5ef8fa4070ed
048c0f653b7d06019fcb9daf6364933eb5dfe412
F20101108_AAATGV liu_j_Page_011.tif
37593f206a4669d20cad19a33937ffb4
483a5aae6e56bf124112cba5a21adb015c4829ec
F20101108_AAATHK liu_j_Page_055.tif
ea93970ae4d8b37283536179860e894b
46b3ccf51a1076f688763b877a1e45042e74b570
26527 F20101108_AAATGW liu_j_Page_037.QC.jpg
66dbf3c2eb0a5ec5507a7331a0fa1ab8
cbbd2050d61be0b90069eee29e1b4522057b6364
61032 F20101108_AAATIA liu_j_Page_104.jpg
850ddd7384aa66c6ee58c42d4bb36c58
38740793f3204973bd2dacb07e0a2f6e8d02f80a
59102 F20101108_AAATHL liu_j_Page_066.pro
106372d2a72baa5fbfb0693372724849
fa2119057b760720543c41f4253ebfb2e1de6e04
5546 F20101108_AAATGX liu_j_Page_013thm.jpg
527c0b9bd9d213eb7398eb976e9c61f9
693d7cffdfcb0eb6ab9273d24551ccef43ab6091
6053 F20101108_AAATHM liu_j_Page_052thm.jpg
a4ecd91c46474be1594212fa75bf44af
df162b5a25a0ccd9209f0470b844d27a3947da63
48739 F20101108_AAATGY liu_j_Page_115.pro
2489af65d6b93c4d9914df80d02654b7
8a73c64bce815117ffb68cc42787717d5918b0a1
53786 F20101108_AAATIB liu_j_Page_022.pro
5c2a8bd365ffaf7c894c6652275e035b
28104f52039908039140118244ecc9bbec31042c
23120 F20101108_AAATHN liu_j_Page_093.QC.jpg
2e1e2680095ee23686743d22f739addb
ba270173df4b3f4fbb579a89598a3e25600b4fbf
2189 F20101108_AAATGZ liu_j_Page_022.txt
4ea07c0a755634e86aa7db611e9ab2ff
ae8658ec87d88bb564e8c0e6b9da5b34429c23f5
82494 F20101108_AAATIC liu_j_Page_018.jpg
e04040cf928b7d05ccc075b7b0b135a0
9fc90214c21398567227c59a894e8e73c9655326
F20101108_AAATHO liu_j_Page_077.tif
99a45a4a0ede6d493a43f05f1eadf631
b201adfb9f032d8bb758e1afca2b980837e0dea5
26786 F20101108_AAATID liu_j_Page_048.QC.jpg
6e90c21f1910620bfcbf2aafcb88abeb
60044a26a604c61021455cc142ec31f4132f4308
79284 F20101108_AAATHP liu_j_Page_126.pro
7b72fdcf53c82dc430fae68f5205d813
a8ada5e9ca392547a643c3e4674f84fdcc9b8896
936 F20101108_AAATIE liu_j_Page_114.txt
3c579047c1735534e5eb7cc7012d92f2
e35f098cb1513f5086755d0164e99ac5a8d29e68
5280 F20101108_AAATHQ liu_j_Page_097thm.jpg
3e8464f7cc5108a49e015bf589bb6fbc
cdc435cb808d34461f6c12f01c9fc8f3d236a611
22036 F20101108_AAATIF liu_j_Page_100.QC.jpg
219342a6dc3695364c7eb64dae9878a5
e27f6526ec723ba40e72202e3da2c20f0950fa2d
29225 F20101108_AAATHR liu_j_Page_035.QC.jpg
5e19dd2ac3c873bc08d7a9db53ebb100
6149a34caf3c0e004ba4a1a45874607acaa21321
2292 F20101108_AAATIG liu_j_Page_090.txt
850371b31a37ad3c6c48247415f40d90
743de3310bf094d2d6a2856a081649a222977510
6518 F20101108_AAATHS liu_j_Page_089thm.jpg
f88129104fd331eac04a30733495971c
e0b5941b7bac8ad1d9921b6127b253c3d926edc3
3421 F20101108_AAATIH liu_j_Page_110thm.jpg
32a48cad1764953356d4dad043e2d670
c3960eb0a91bb4692c0e0e48a2acc5740bd52690
1870 F20101108_AAATHT liu_j_Page_067.txt
4988da764b797d517bcfe310ab6cdc88
3c10e1e5046a531fc3da4ddd3887172f0104dc61
F20101108_AAATII liu_j_Page_015.tif
e707e473324f19f819e2e6d0aa0f0e0e
f04f7e5f2018005a030aaa338ee01e5bd9352c0c
1837 F20101108_AAATHU liu_j_Page_076.txt
a523cf208cd5e30c23a543f936df9687
503bb942e7c8e270552b1cb393cf26c4d85d7c9b
6851 F20101108_AAATIJ liu_j_Page_037thm.jpg
c8761af999c4a66c31ab7543352409ed
aab6105109147445ea8b968ed11d654ba1ea32ce
69959 F20101108_AAATHV liu_j_Page_029.jpg
3ab2f0f97eaa5d5a9335e439f492a649
b61e4717313186160996e3d63aec0ab6ad0bc634
6091 F20101108_AAATIK liu_j_Page_029thm.jpg
ad5c2acfd22d2ba72069659072e600e0
b7c380dd4567c9b46062ad9877c652bb4049baec
5808 F20101108_AAATHW liu_j_Page_100thm.jpg
f753f35d0eee53b56cc09188498715c0
0c58e7547c02ab82b88766898521f439a3a6350a
55726 F20101108_AAATIL liu_j_Page_099.pro
aed8252446d6f8c34344ae573cdce655
894bc136bbf0885e048eae804567229e905946e1
F20101108_AAATHX liu_j_Page_002.tif
d5118d73a18802f3053b344d9c4e7753
45914bbbc561dd3c4f20ba4190d2a34bd9abd094
54578 F20101108_AAATJA liu_j_Page_089.pro
1a995f21824e7e0fd2290f02297e7a24
6ec44e00f4a726283bfa2b41dc8708bbcc55aa5e
966660 F20101108_AAATIM liu_j_Page_067.jp2
68e20726480ba9f36b057093e1f6d7fb
9201a43aa74748783fd0c3809027a6e29d0d6dd0
32596 F20101108_AAATHY liu_j_Page_007.pro
e619710666e3ae097b024cde11e3c215
4eaa491b05ed081e47285e8c02459db81c223348
26625 F20101108_AAATJB liu_j_Page_012.QC.jpg
249241908625f80197392b9e8f219395
02e81c1aaa62171e66d0133cd78d1b6fb2df827f
38481 F20101108_AAATIN liu_j_Page_069.pro
a37b7b4b9c306d87c788150579bb6047
4540c11f3a96d7ef0de430b3e131e3e31724a60c
6454 F20101108_AAATHZ liu_j_Page_015thm.jpg
6b5f0f8de866e65b0c62f55f59225968
7ef7e56310f08901752555c02f9bc27ca53f55a4
F20101108_AAATIO liu_j_Page_063.jp2
2da092ce99eec87d9fa98ea88912c963
e8f2058ad131698941ac4223a8d08d2efd0e96d6
53826 F20101108_AAATJC liu_j_Page_116.pro
26530c1ad406109ab08873c9dcb183be
a4b504f6cff9d9a7ec18ce9a3503c32eb5557dcc
42923 F20101108_AAATIP liu_j_Page_106.pro
aac5d07ef7b7a8ead89b82be8861bd6b
4b95f435c5a0aaa13b143da7a1b93abae6da4b2a
F20101108_AAATJD liu_j_Page_094.jp2
142ee14b10679b7df880a912cd3478e3
a3a76d450b85519b4ef45631f1ead4463d5d9061
24488 F20101108_AAATIQ liu_j_Page_071.QC.jpg
84d8c603e1c0575ac261109aea4d7478
5266d58b629ff4a40828276e93354e86536518e2
22149 F20101108_AAATJE liu_j_Page_045.QC.jpg
11a835f8359d2849a71486e9fdce30b2
96a2f2d35e2128dd5c6008609f55e5e8f17432a3
50146 F20101108_AAATIR liu_j_Page_075.pro
21b3bb0f87c616ca990f3a699fe7416d
213a4c9ec07fac7c52120e6fd80092dee0d8eb75
87107 F20101108_AAATJF liu_j_Page_056.jpg
b77b920126530fd494a1c88dc5978215
282addc89c7da1e92fe8179755bd445f7e98eb9b
48267 F20101108_AAATIS liu_j_Page_068.pro
90d441d6de41f301f500ebe0fed1da4c
b6202f65394ab07f5cefdabd01e51e1f81c432e1
69515 F20101108_AAATJG liu_j_Page_023.jpg
859a82b65329b9f178e2c5316f25001f
57af9955317bda0b785e844492afdac0191b696d
5951 F20101108_AAATIT liu_j_Page_099thm.jpg
e209576b73e8515ed97f1fa2fe303df8
3d3e010b5d8ca2c7dd5b0db9ce2e457a5413751f
F20101108_AAATJH liu_j_Page_086.tif
3d1042fae9898c14fa34beb64c278469
be6acd022682acfd7e7c5245afc72f84b8c0ef2f
2416 F20101108_AAATIU liu_j_Page_078.txt
eee3eb1c8764fef524688bf7bb141b14
ec283262c69c5fd2aae1a48d5bcd167a65456356
69635 F20101108_AAATJI liu_j_Page_086.pro
548243baf962b4219b4a1bdd20f8af6f
f56da01c3db2469a24c91c7bbe37b3b8fb6d4f66
2321 F20101108_AAATIV liu_j_Page_112.txt
cc4c2f1a54b2be228b3d0e08bda49d15
3ed5cdc8ab562587a880ba5d4a04e6e9b8549f8f
41311 F20101108_AAATJJ liu_j_Page_109.jpg
43eb7934c78fc7f094aca3d49b0669d7
9bfe17531962c6c20a62c1570d1bc630eb628479
F20101108_AAATIW liu_j_Page_067.tif
dbe377c660cdfc646699409827798324
3e4dae828897b3fd8d6f15f66d4d346f27f4de94
11245 F20101108_AAATJK liu_j_Page_114.QC.jpg
ce8dbcff0007636aa3f7eaffbd42fb51
50fb686bd063bb674b15fcf9f37f2bc108c12001
F20101108_AAATIX liu_j_Page_031.txt
fac2fa445fdd84550c5ab143d46ad614
6018d9b7bffa724ae19b94d9c83dc9b159ed0c86
88583 F20101108_AAATKA liu_j_Page_127.jp2
188411ec8cc372b51af8a766441045cf
4555fb1bad9b0d2f2b75c3842c56bfbda2449c1d
2329 F20101108_AAATJL liu_j_Page_080.txt
db8d6944bcb1a7e6fbbfd2f32bad1e9c
145afd364d102079c2647ffc72a8f5fe2107570f
F20101108_AAATIY liu_j_Page_021.tif
2b402508aebe1c811a3f59180e93315b
d3b9f52902b742985fba38231c5e389ae9302be8
F20101108_AAATKB liu_j_Page_039.jp2
ddfeea146dc6ac9f6a7a80d43ba26e34
c9d20dc1b85ac1190f963f1c50288aca2c17ef86
80040 F20101108_AAATJM liu_j_Page_062.jpg
9a911301dc81c91728e2febb281074eb
9a62bb5c2d13c2e8bacaadf41514cfb3498fb519
5203 F20101108_AAATIZ liu_j_Page_079thm.jpg
655abf1a3161863b591f2afeea617620
123616f255878a82dd95ae84775bd07fae30e438
57169 F20101108_AAATKC liu_j_Page_026.pro
b5fa3b4007a34f335b4d9c3f7e3e3973
0e39bb23cac0d713992eacea0c656044cb522e14
23084 F20101108_AAATJN liu_j_Page_054.QC.jpg
6b221531734d48e413ad4e0bf016ee09
94a65e57077ac08f1705ba27bd8c16fedb0abc3b
49951 F20101108_AAATJO liu_j_Page_108.pro
e83a23e5a8b6596aee7bc546773c6a9d
d542c2ea92b641ae4a3365907a1ae415f725d84e
19481 F20101108_AAATKD liu_j_Page_079.QC.jpg
b1c46f97fcb14eb52c13c2d96c617adc
9420de31a15e1ee437f50ac22d1e4a3c94aa279a
52599 F20101108_AAATJP liu_j_Page_057.pro
be7110c3663b3734025ce8ceec3ca3d9
2699352230fdfc1922b3b1a72820676abc1f65c1
2014 F20101108_AAATKE liu_j_Page_029.txt
2d85d7d3e8ad786fb74b6801e8fd3658
8c0ee50820926d49cba40ddb4957d52008cdada9
115181 F20101108_AAATJQ liu_j_Page_099.jp2
938314408801ad2cbe50a7eed020ebf0
3632723866774e7143120e1e03f37bebdf77a46e
2299 F20101108_AAATKF liu_j_Page_034.txt
b5b8b6f1526a1b67517cff6ac1ef6fc7
4919ab9e4a5f972d7dad0e910c1317a038a46c9c
95944 F20101108_AAATJR liu_j_Page_050.jp2
0c5ae76a379a5461255f3454a8b78510
9b86e984e309f39613377e55eaddea31335a9c4f
F20101108_AAATKG liu_j_Page_061.jp2
39b8dda46b2024f3a3acd11587a80932
a1af48903bcf395512d22c39f88fbd8a489d3cb7
76313 F20101108_AAATJS liu_j_Page_073.jpg
7d71399d2bd9efa07f53444f1cf45751
247dfb4d67847a5720159a900cb04b60d524033f
F20101108_AAATKH liu_j_Page_056thm.jpg
9a3e42c989304bb55ac7d60c5fb4e36b
bdab2aac7e48e64862bb33d25da5b7fe57a8c7b0
87421 F20101108_AAATJT liu_j_Page_125.pro
df48aec86a00475897d52936b464a7ed
899269eee76dde3e7307637e1c9ca8c0169ffccb
F20101108_AAATKI liu_j_Page_077.pro
afee573e740a42a3a4e6e1ea7fb29433
7544ed3f5a6998dec8e05e016d783000ae3fc5b3
1775 F20101108_AAATJU liu_j_Page_052.txt
c4f3f42dcc912d0f24038d29ad291f97
977f85751c5b28f2b9377d42308967c81e1cea76
F20101108_AAATKJ liu_j_Page_128thm.jpg
695523d488b151a03b36f2ac104687e7
feb7f3f311678b2840bf78f64f316a0576afb6f5
55956 F20101108_AAATJV liu_j_Page_027.jpg
2453ce6ee471c170b5821a1d63253f08
e1e884cb330508ebd77a289b990ed3f6bd91da6a
83329 F20101108_AAATKK liu_j_Page_012.jpg
2896408b729236fee4c3dac392f4989c
4ebec0574c51ca5425d67560198c353d04d53332
52473 F20101108_AAATJW liu_j_Page_112.pro
cee34429a2ab6ed210d7406e8daa3205
a6f0acc41affa40482547be5e27219f1ca8be780
F20101108_AAATKL liu_j_Page_079.tif
3ab55d6f356d1a5415d7705abea16c42
27e20ef930a1ca55d89c3beca7f1350b3bc1eadd
78470 F20101108_AAATJX liu_j_Page_087.jpg
fefdf796c8a8e0a11aa3e59ae66b6867
d9e0bd345239d85649f020632bcb5ac35d12d080
45280 F20101108_AAATLA liu_j_Page_127.pro
a47db94da0de66845bf8c7253bbaaf71
9c360fad6b3c615085130f6a6631dd2d4c4f7ecf
26078 F20101108_AAATKM liu_j_Page_094.QC.jpg
4f63c9163bee5287bcc108a83a3ebee7
9a9e2be6b3f0a00fbd99ce31d56081f1c0061694
6673 F20101108_AAATJY liu_j_Page_123thm.jpg
e5fe58a29ff6ec6598df820fefbe2dd4
6dcb3a11bdfdad9d9d0322315b5d701227a7c063
22791 F20101108_AAATLB liu_j_Page_041.QC.jpg
749990fa463a0e03e3d65ffa093b8ef8
22093d1e8d4dff13ba0cadc0a14b2f75a8b351cf
25638 F20101108_AAATKN liu_j_Page_046.QC.jpg
8cd17235a2f55a95d1f8f3994db2ef8a
e78b00fd700e5bdd626569151351ca4fe23f06c9
20413 F20101108_AAATJZ liu_j_Page_091.QC.jpg
ed517b39de1b7658bb13cd1c40c5cc4f
d558729a839769661976ba3068c6be08d379dd05
782071 F20101108_AAATLC liu_j_Page_070.jp2
d01bfff90949df25aa3bbffc021ac207
48736cf59b8848a0ddc6c443807e6aace41269e0
21105 F20101108_AAATKO liu_j_Page_118.QC.jpg
975cc1582313eea63edd7382e7dbf109
8976245c3dfd33382808fd3a351f80cf9b2a75cf
4298 F20101108_AAATLD liu_j_Page_043thm.jpg
225859af7439753ed6ef7182c587b8cd
f8b78ec58ac33d46ddf3050b6bd8b3742ad1c6c5
F20101108_AAATKP liu_j_Page_026.tif
d69e0f052126cd9a7ce22cd01233c28c
b49bc25fe5cebf87d674b10f82e9eac7cab9deea
1625 F20101108_AAATKQ liu_j_Page_118.txt
0f22d801044954618cc0f7efae541576
f0eb5ccbbd5a4aefdbdc69d4baa01d7f92e1ed1a
F20101108_AAATLE liu_j_Page_037.tif
12dfbd3d6e3c06a775a91693aea33a66
8f339d1c8044b4579453f069a0aedb92ea1e6909
2268 F20101108_AAATKR liu_j_Page_010.txt
958e2f413f3e468d36137aa2a90025b1
c58a9187422abca4672c7ec1b4f1de2f14c9b322
53773 F20101108_AAATLF liu_j_Page_015.pro
ef96cf67e9e996a8cd297207881b1f3b
153e67ccf256e01b4fa24871609117a91eaef276
53018 F20101108_AAATKS liu_j_Page_121.jpg
90be0148162a69cfd195edabe86ebe6f
5c853796bdd4262f05bde976e5786ef88a18d5bd
5241 F20101108_AAATLG liu_j_Page_117thm.jpg
9cdd7e2a66bb8f37c47fb57b65fdec2c
a2cdea96e1b92b780a014dcde87f81eafa7bb168
F20101108_AAATKT liu_j_Page_105.tif
755f067bc5a96f9f45975416d3427afb
1e0455a8f6ec01fef140d5d836318308d23a5e36
92041 F20101108_AAATLH liu_j_Page_076.jp2
1a2e8f42fd35f38cd0cb71ea69ea0644
5d791dd3bc279a0ebc057c5dc7035ee25a67f098
1518 F20101108_AAATKU liu_j_Page_082.txt
890b4f19d06604fb04db1d5c3b0fd613
df475da2d07ad429b10e114d1564562ccd4fd0a4
52361 F20101108_AAATLI liu_j_Page_113.pro
b905f30dbf769e610d27dc765ebcec5c
093c4c0c6b3892e7154fc9df41380e84141aa0c2
F20101108_AAATKV liu_j_Page_010.tif
9bce08c54a65fc5031be23a12fda8dc6
c0b32c6290f22b23863c645d0efe1e2c4f7f6b6c
83343 F20101108_AAATLJ liu_j_Page_089.jpg
d8520ef8ec2a58282c7a0ce075e9dca2
07b2b7e9c1ffbdcaa5b10c204f2acbb0891295a9
25424 F20101108_AAATKW liu_j_Page_016.QC.jpg
0478d7650c75e17b86d210c01b68bfc8
959466cd2e2e26076b2c174e248bbecb3d8ca40c
26989 F20101108_AAATLK liu_j_Page_084.QC.jpg
5386c0037cbc2ba11315d16ddecf0d33
62e15023a82560d82e2ed81afad95d8fff621df4
91774 F20101108_AAATKX liu_j_Page_078.jpg
08b7d82f70ef577e632c3155369357ac
683d93b0c9045c950733ead5d4de0753fe3ec77d
6654 F20101108_AAATMA liu_j_Page_017.pro
86e3a8ff4ca23d64eef923c956729380
d92b71fdb7bbba5ff9789ea4531a701450c8a719
F20101108_AAATLL liu_j_Page_044.jp2
856271305b0d4215815b834c1fc58c6d
0c7796dee9aa5e36c0faae442cf48df9f2c553bb
5778 F20101108_AAATKY liu_j_Page_008thm.jpg
8a142967bf5973d17cc44632ac246eaf
36b35d1231e241d300057ba5f775be5ea5fe3681
7282 F20101108_AAATMB liu_j_Page_086thm.jpg
ae1b2c6a2e3d490d9fe9aa270cc19962
73156f7e6773f35fbd40dd3436fd0c1e7e204936
844443 F20101108_AAATLM liu_j_Page_064.jp2
5290605906edc288871b1297a9bdd28d
2a2851f826800be183514324a1d645622711e0df
1101 F20101108_AAATKZ liu_j_Page_109.txt
4feb063a93dc4f6f37f8427e7aa04913
58333b4a991d649aae0470ba5ed4afe41fdc925f
3381 F20101108_AAATMC liu_j_Page_084.txt
63ad8d3fc01173bb9acf88a451e2b545
62f87b2ce9b2715f69988b9c8859c4dc7ef0fec7
6329 F20101108_AAATLN liu_j_Page_128.QC.jpg
8d7e52364f94dd73469a86d230098444
59197847873df90999fca0ca49ed4940f8a7160e
38639 F20101108_AAATMD liu_j_Page_052.pro
fd3c041d3afc23f3333b5384e650e3a7
1d6fe6e057461250a0792f741f3745c1193bf2c2
2245 F20101108_AAATLO liu_j_Page_015.txt
e394e01e4c37ff2b4b575098933a8608
08f910413c55ff139081c83a86de85f805421973
2039 F20101108_AAATME liu_j_Page_049.txt
cc1a08532260f597444853f277e7c3ec
bab1fd634cac0b8accce64faa4442bc344118ba5
F20101108_AAATLP liu_j_Page_117.jp2
4286b7e0c59f89790dffa51e15a29289
e8ccbb74a0749f46fb8112980784384244ed899e
25314 F20101108_AAATLQ liu_j_Page_057.QC.jpg
aa10939b517079733cef8a92605c559b
651483b64b419323542957b8dbaefb8899409444
77482 F20101108_AAATMF liu_j_Page_096.jpg
7a8bca90d8758140be76750e1bd1ecbc
69c2a7ac15f88dbfc8e6dcd9a3f713b1c35eeb59
6245 F20101108_AAATLR liu_j_Page_096thm.jpg
73526dded3a054f5146421ff65faf647
c583d47860d54782d7f68151732029133cdfa224
1340 F20101108_AAATMG liu_j_Page_027.txt
ce7729d939db29631174904b93139038
70b54c01a1df68ee116aeed48471a1a7cb52244f
5000 F20101108_AAATLS liu_j_Page_005thm.jpg
ed0b06059731b6d6e671bf60a1166c29
5cbb477fbd8b5a102adae909dfed6d6f72f99eea
F20101108_AAATMH liu_j_Page_095.jp2
36817739330471cec5cd1de4f0aae405
830b3af2cfd0dc8c20ed438734e02fce4a529e7c
5894 F20101108_AAATLT liu_j_Page_103thm.jpg
b1f719e2b38a9346cbaf96adb6ebe585
4b918381d1e7ddde5533df33d28f7fb8f4b9bfd3
1005755 F20101108_AAATMI liu_j_Page_054.jp2
b7d7ec14e1483858aceb05f77728621e
d28999866f7dad51cef74da6c0d28ec86fa59820
25454 F20101108_AAATLU liu_j_Page_047.jpg
f86ad126a9763618988aa2aa417e40e0
5e6218c3765eae1b955369f962e7d59361cd5022
90984 F20101108_AAATMJ liu_j_Page_044.jpg
056549a6fe0519602b79ed115a60f883
657fcc0be1568b6eb1675ab579c384f0c484dff2
49930 F20101108_AAATLV liu_j_Page_073.pro
e7b41dd2494fa902919a2ea9a87060ee
f633fb2ce0dbeb7c058a691430ef4f706ab4aa7e
3157 F20101108_AAATMK liu_j_Page_126.txt
db8289231611e819ca3b52c1b574c566
9c2d6f8eeb13e18ceb67c9eba08cbfb9701e80ef
54762 F20101108_AAATLW liu_j_Page_101.pro
5f87701098442075ab24c3a277c0b8d1
b25ce4b8444d3f093888a5e28ccc9de15583c3a2
116 F20101108_AAATNA liu_j_Page_003.txt
d123d87e43bc8982352a52529c20bf0e
474812243b5d9b5f4add24f0d2fa47cab4edeb2b
F20101108_AAATML liu_j_Page_120.tif
1fe266dc674da66203abdf70f46d9910
4437260cdbb64454c1dce2f1f05e948f2db188a3
F20101108_AAATLX liu_j_Page_036.jp2
c9434df0c2afba5c3268dfc7f89a7280
0a106e13173f776461cad34faf62e6cdbbe8fd01
10436 F20101108_AAATNB liu_j_Page_004.QC.jpg
e60e197c47c3513b8897334ffd2d4ae4
7cd4ef21573d8f2465369d2f9c4434eb3bdf5e97
49486 F20101108_AAATMM liu_j_Page_100.pro
32d17916edc6cdfab50757b07fd932cb
de4e53437140307dc7c119bfc6cc84124dc744e2
1051984 F20101108_AAATLY liu_j_Page_105.jp2
5285eeceb1eaab55296199abcbc2a2af
8b1104efebaecaaa72378bd1f36fad1ff742352f
F20101108_AAATNC liu_j_Page_110.tif
54e7fbfbed7c140634decb25247f387b
0e1fb81ee415b35f82c847859a93a5b77420ad65
23527 F20101108_AAATMN liu_j_Page_099.QC.jpg
bd5e7944c5cba130012daf9fbda7eed9
63a5c3179c1e032688437edf05f5bc15d68763cc
26562 F20101108_AAATLZ liu_j_Page_116.QC.jpg
17ca22d16528e86f20f0d43fd84c54fc
2ee41642c6d4afa5010508291d3fb48de33c26e4
83046 F20101108_AAATND liu_j_Page_094.jpg
b29ae31dcc2504ffe775a1cb5241c894
154fd0ee774933834aff9cede8299d1045166d66
1051962 F20101108_AAATMO liu_j_Page_049.jp2
2baea375839363ee07fe80e095274d92
85eb7dbd3979947e3738f0a8c0ec78fc11734f51
F20101108_AAATNE liu_j_Page_005.jp2
0ace6cba18674850f89b02eb3591cf65
e8af9d5ed88a5035d84839b9086f261f303d2b64
5965 F20101108_AAATMP liu_j_Page_093thm.jpg
c1393d8cf48f59585161f32f380fc3f8
42fc83b03a25daac5e5c783ab4fd60c3c9f2d59b
63809 F20101108_AAATNF liu_j_Page_117.jpg
b22b50b19a7da811b24e54635c51b32b
7e7f4cc5b3a5f4aabe9d8548579622b7800b5f0e
1614671 F20101108_AAATMQ liu_j.pdf
3ce9e31fc90eb0342e881ac8e0a23cd9
79cf6f14b8ef222e4e850cb5083246190cd41e0a
152164 F20101108_AAATMR liu_j_Page_122.jp2
706454d1bfeef30e9a76b24492539ed5
dfba18efe6ca7678757e064615d8b4501d0cfa85
F20101108_AAATNG liu_j_Page_054.tif
eb925d87004253ae8aa00acbc4267b71
5027fb98f1d6642f6e925ce37e8fdc5fd72de177
2336 F20101108_AAATMS liu_j_Page_092thm.jpg
89441f2ae0a7f9559afe355fa9f5ad31
36866039a0d34d60729e4ad8d08060f72aca4a70
25583 F20101108_AAATNH liu_j_Page_022.QC.jpg
5bbcbb62469dc8e6fde815eb8963f31e
305d2af2a9e410320325746a2d079d6be1438405
50745 F20101108_AAATMT liu_j_Page_114.jp2
9e489a761ebe11592bf3c5e3197ff701
5fc80f29f03002e39040f6323c0b4fddedfebdf4
26161 F20101108_AAATNI liu_j_Page_055.QC.jpg
db047f09ffff7feb5c908927f0501e55
5e5e05953a934b33a664195e9bad4cc7245149f4
54420 F20101108_AAATMU liu_j_Page_019.pro
38924dfeba041df7f882eaaece11ec42
7f21a09d0755baae8b7a00041504b890808b31e6
21334 F20101108_AAATNJ liu_j_Page_061.QC.jpg
c7591d063961724a155f2fc1c88f92bf
ccb36dddc813c69b85d9c1dcf94beaa973f1cd82
6473 F20101108_AAATMV liu_j_Page_126thm.jpg
eafd1bd822b2cfccaef6f45429dd757f
badfde73b5c0bb25d8557019b4f9eddc082d3b9e
16255 F20101108_AAATNK liu_j_Page_043.pro
3ff2852faafe4ad401c07da695930461
9d69b4706dec6a4918fac60b88b6a5dbad55e108
80896 F20101108_AAATMW liu_j_Page_074.jpg
751d014f3302b06495f11e21b77f39ab
dcc8f8e84bf0fcef11a23059d5c3e38769666eec
57056 F20101108_AAATNL liu_j_Page_060.pro
229419411dbf1c6db7d228f5e701342d
f9c9e032f665ea4b368d7ec2abc2f0d06323cfb4
53710 F20101108_AAATMX liu_j_Page_053.pro
50bc23690d96d2b5755291b8426e2b90
4137b46dffbef00b35061066bfa59c442a710cef
F20101108_AAATOA liu_j_Page_098.tif
6560dc8679da8e3ba7a0dd5572b723be
25932ea1fad190e5e4c6c9ab05c3b2ab352b509f
84941 F20101108_AAATNM liu_j_Page_058.jpg
449bba830a2c84fe0e0ae8b749bb3141
e4e4ea0735d50c1f9d7e5a80461077937729a2e1
269 F20101108_AAATMY liu_j_Page_017.txt
ebf4da5c130a6e07fe954a1e758f7706
ac08bb45f07c5c2a8249b84e445eab1636678ba2
53515 F20101108_AAATOB liu_j_Page_087.pro
6e1f4ebdd8274d7b8208971d7883071a
9040760ac70bf5e7e0a1360f1e403eb729f776d2
61019 F20101108_AAATNN liu_j_Page_050.jpg
673fba158dcb0ae4130cc6770275cf62
0fc4e9aa80b8a7a6fae847cbc8d2786ede908564
397 F20101108_AAATMZ liu_j_Page_128.txt
009889b7ffc0541f0f6e25a3804fe645
d1e3909c75911e58bffa0187a314ff07bb70421b
2261 F20101108_AAATOC liu_j_Page_033.txt
2bbfa1b38e432cf34d6c2dd51e373312
5aac38a2a1bf09a4478fc4f8c25c4637406b68e1
87101 F20101108_AAATNO liu_j_Page_060.jpg
e86aed7262efa239991bf665d4af3794
f25bda57d167aa20cd0c07c251fc5210ae419f8a
F20101108_AAATOD liu_j_Page_099.txt
ab31910b15e462e353cd95ed2c0b7538
d9ece0140551fca967e5e05ca131641aa40e5604
27910 F20101108_AAATNP liu_j_Page_066.QC.jpg
f2992cf7618fb63257e21a721cb43bf3
6f52512d45733b0ba72d9ed0218cf01ec1dd7b0e
36083 F20101108_AAATOE liu_j_Page_117.pro
281b30118044d7238fe489afd77ec906
169c2f513fccdf00b140875d53bc789d56b03e9c
F20101108_AAATNQ liu_j_Page_044.tif
24e26df7f837cc2274ff9b25f7a0fe87
7cc5b438fe289029c9143840f91edbd36541258d
53125 F20101108_AAATOF liu_j_Page_008.pro
7d32440497741122c3d94725ef4175e4
95f8b111724db00ff48b959cd21f780056958e0a
2194 F20101108_AAATNR liu_j_Page_019.txt
11cabfa4b3b78f90e620e835c5b535e4
b5d1cfe4734a1f3ece3ff87b09677abb0a2b6d50
5355 F20101108_AAATOG liu_j_Page_011thm.jpg
53efd0c0eba2dcb9fa62aaa8ddd7c62c
ff46ddc614a7578984b353b48ef518cf1bcd1650
28903 F20101108_AAATNS liu_j_Page_092.jp2
1456475efce52c3fd8461ce6f3c289ca
9bf54b0d1e21d2317b4a3a968099a9ceb017aeb3
22834 F20101108_AAATNT liu_j_Page_120.QC.jpg
dce3226620b0aed3d6329198d6429f63
69e40bc05a00cba9443c237cabbc427c7ab943f8
6508 F20101108_AAATOH liu_j_Page_036thm.jpg
6ff4d577429e659a3bd7e07dcd0020d4
685fbe9dba69698d2f64ac166be1cd389be05c06
69974 F20101108_AAATNU liu_j_Page_118.jpg
18dd10e9e142aebf33094c037955716d
3bc96cc0e403ca191b2a57151332be4a17d69e4a
F20101108_AAATOI liu_j_Page_018.jp2
5a65843d4be67a3476a76dfb52a41e99
94eff7c3fb6890babd172b7732b713db0bfde61c
2224 F20101108_AAATNV liu_j_Page_037.txt
7a831df3ef78f70bfafadb5666ffb9ce
6759b158ddf4bf0b71fd6bde6f211d8e84369ed1
24215 F20101108_AAATOJ liu_j_Page_108.QC.jpg
7c519dd089b36481686eba76c2afe489
2400dcc3bc84ab62c0195de7f18a543672668a8b
68108 F20101108_AAATNW liu_j_Page_088.jpg
2d8763230fc0651b466c37b69ca0146b
4ffb97f11c677056aa9d2e17dd1f7eb079b0a174
26056 F20101108_AAATOK liu_j_Page_053.QC.jpg
015d1c73848d10378c6a015dcf5bd3f4
4e917c93550d335d8d6a058f1fe5291e48d62c60
57680 F20101108_AAATNX liu_j_Page_042.pro
c15561cf569228eee0dedb606f52a8d4
260a47df740072c2e7285add65f1cba042986678
F20101108_AAATPA liu_j_Page_020.tif
0ea72393c86006826c17803468dcdfe1
74c7761fd9a47f13557b72d1821f6278e22039ca
86516 F20101108_AAATOL liu_j_Page_063.jpg
74a5b2b286026df44a634810f23e963d
79f6c15722a5232d6426c3801520b7f6502b9e54
98846 F20101108_AAATPB liu_j_Page_091.jp2
5b3d8308263f0c2a48be5de4eda1bb79
f91090d0a48d3489ebec7f75fd35dae544f78b8b
F20101108_AAATOM liu_j_Page_116.tif
e3be52941fde85d284bb3f893062db0e
f3b376d35f29d3ec03cede11179c9b310d8a70fb
F20101108_AAATNY liu_j_Page_112.jp2
e23613154c7a91b6a8e26b18e83ff718
c785d9bc4bcd4a030739132e768abc0f7bba44ec
5986 F20101108_AAATPC liu_j_Page_120thm.jpg
b118ea4ce495792c9454964a113299cc
18af40946f4e13dff2a641341b45088015026e38
6829 F20101108_AAATON liu_j_Page_060thm.jpg
1f0dddfa633386b51be7d920f3e6506c
6858859b02b41feb7402671cb1b24234d920f0ab
F20101108_AAATNZ liu_j_Page_081.jp2
426db98f364a251a1d8c101a8923a84b
eb47d64d16dcf50ea9ff630040f51ea559018fb9
1678 F20101108_AAATPD liu_j_Page_069.txt
d3dd0deba2d2860e5284f6faf539737d
2f4daa775a08a04608330c83ecb09000ba113adf
97594 F20101108_AAATOO liu_j_Page_123.jpg
b572757e457b42a66b068fca1f88641b
069a289cec679b2a5638d591e8934a78347976fe







MINING COMPARATIVE GENOMIC HYBRIDIZATION DATA


By
JUN LIU



















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008


































S2008 Jun Liu




































To my parents, my sister, and my little niece









ACKENOWLED GMENTS

I would like to thank my advisers, Dr. CI Ilis ly Ranka and Dr. Tamer K~ahveci, for

their long-term instructions on my research and this dissertation. I would also like to

thank Dr. Michael Baudis for providing the CGH datasets and several discussions on

analyzing CGH data. I would like to thank Jaaved Mohammed and James Carter for

their help in implementing and testing the clustering algorithms, and R ii I Appu-i-- all

for helping develop an initial web-based interface. I would like to thank my committee

members (Dr. Cl' I '1.~1II! :- M. Jermaine, Dr. Alin Dobra and Dr. Ravindra K(. Albui I) for

their guidance and support. Finally, I would like to thank my parents and my friends for

their continuous support.











TABLE OF CONTENTS

page

ACK(NOWLEDGMENTS ......... . .. .. 4

LIST OF TABLES ......... .... .. 7

LIST OF FIGURES ......... ... .. 8

ABSTRACT ......... ..... .. .. 9

CHAPTER

1 INTRODUCTION ......... .. .. 10

1.1 Comparative Genomic Hybridization Data .. .. 10
1.2 Analysis of CGH Data ........ ... 14
1.3 Contribution of This Thesis . .. .. .. 15

2 RELATED WORK(........... ..... .... 18

2.1 Structural Analysis of Single Comparative Genomic Hybridization (CGH)
Array ....... .. ...... ............. 19
2.2 C'llo-I. 1 .1.; and Marker Detection of CGH Data ... ... .. 20
2.3 Classification and Feature Selection of CGH Data .. .. .. 21
2.4 Inferring Progression Models for CGH Data .... .. 25
2.5 Software for Analyzing CGH Data . .... 27

:3 PAIRWISE DISTANCE-BASED CLUSTERING .... .... 28

:3.1 Method ......... . . 28
:3.1.1 Comparison of Two Samples ...... ... 29
:3.1.1.1 Raw distance ........ .. .. 29
:3.1.1.2 Segment-hased similarity .... .. :30
:3.1.1.3 Segment-hased cosine similarity .. .. .. .. :31
:3.1.2 Clu1-1. i n!g of Samples ........ .. 3:3
:3.1.2.1 k-means clustering ...... ... 3:3
:3.1.2.2 Complete link hottom-up clustering .. .. .. :34
:3.1.2.3 Top-down clustering ...... .. .. :34
:3.1.3 Further Optimization on Clustering ... . .. :36
:3.1.3.1 Combining k-means with bottom-up or top-down methods :36
:3.1.3.2 Centroid shrinking ...... ... :37
:3.2 Results ............ .......... :37
:3.2.1 Quality Analysis Measures . ..... .. :39
:3.2.2 Experimental Evaluation . ..... .. 41
:3.3 Conclusion ......... .. .. 46











4 IMPROVE CLUSTERING USING MARKERS


4.1 Detection of Markers ......... .. .. 48
4.2 Prototype Based Clustering ........ ... .. 52
4.3 Pairwise Similarity Based Clustering ...... .. . 55
4.4 Experimental Results ......... .. .. 58
4.4.1 Quality of Cll- 1 Iits; . .. ... ... .. 58
4.4.2 Quality of Markers ......... ... .. 59
4.4.3 Evaluation ......... . .. 60
4.5 Conclusion ......... ... .. 65

5 CLASSIFICATION AND FEATURE SELECTION ... ... . .. 66

5.1 Classification with SVM ....... .... .. 66
5.2 Maximum Influence Feature Selection for Two Classes .. .. .. 70
5.3 Maximum Influence Feature Selection for Multiple Classes .. .. .. .. 74
5.4 Datasets ............ ........... 77
5.5 Experimental Results ........ .. .. .. 81
5.5.1 Comparison of Linear and Raw K~ernel ... .. .. 81
5.5.2 Comparison of MIFS and Other Methods .. .. ... .. 82
5.5.3 Consistency of Selected Features .... ... .. 87
5.6 Conclusions ......... ... .. 90

6 INFERRING PROGRESSION MODELS ...... .... .. 93

6.1 Preliminary ......... ... .. 94
6.1.1 Marker Detection ......... ... .. 94
6.1.2 Tumor Progression Model . ..... .. 95
6.1.3 Tree Fitting Problem ........ ... .. 97
6.2 Progression Model for markers . ..... .. 98
6.3 Progression Model for cancers . ...... .. 101
6.4 Experimental Results ......... .. .. 105
6.4.1 Results for Marker Models . ..... .. 106
6.4.2 Results for Phylogenetic Models ..... ... .. 108
6.5 Conclusions ......... ... .. 113

7 A WEB SERVER FOR MINING CGH DATA .... .... .. 115

7.1 Software Environment . ...... .. .. .. 115
7.2 Example: Distance-Based Clustering of Sample Dataset .. .. .. .. .. 116
7.3 Conclusion ......... ... .. 119

8 CONCLUSION ......... . .. .. 120

REFERENCES ......... . .... .. 122

BIOGRAPHICAL SK(ETCH ....._._. . .. 128










LIST OF TABLES


Table page

:3-1 Detailed specification of Progfenetix dataset ..... .. :38

:3-2 Highest value of external measures for different distance/sintilarity measure .. 42

:3-3 Comparison of average quality and running time of top-down methods with global
and local refinement ......... .. .. 46

4-1 Coverage measure of three clustering methods applied over three datasets .. 61

4-2 The NMI values of three clustering methods applied over three datasets .. .. 62

4-3 Error bar results of three clustering methods over three datasets .. .. .. .. 6:3

5-1 Detailed specifications of benchmark datasets .... ... .. 80

5-2 Comparison of classification accuracies for three feature selection methods on
niulti-class datasets ......... . .. 84

5-3 Comparison of classification accuracy for three feature selection methods on two-class
datasets ......... .... . 86

5-4 Comparison of classification accuracy using different number of features. .. 87

5-5 Comparison of PMM scores of three feature selection methods .. .. .. .. 91

6-1 Name and number of cases of each cancer in the dataset. .. .. 106










LIST OF FIGURES


Figure page

1-1 Overview of CGH technique ......... .. 11

1-2 Raw and normalized (smoothed) CGH data ..... .. 1:3

1-3 Plot of a CGH dataset ......... .. .. 14

:3-1 Example of Raw distance ......... . 29

:3-2 Example of Sint measure ......... ... :31

:3-3 Example of
:3-4 Evaluation of cluster qualities using (A) NMI and (B) F1-measure for different
clustering methods .. ... .. .. 4:3

:3-5 Cluster qualities of different clustering methods with Sint measure over the entire
dataset. The cluster qualities are evaluated using NATI. ... .. . .. 45

4-1 Two CGH samples X and Y with the values of genomic intervals listed in the
order of positions. The segments are underlined. ... ... .. 49

4-2 The CGH data plot of cancer type, Retinoblastonia, NOS (ICD-O 9510/3), with
120 samples and each sample containing 862 genomic intervals .. .. .. .. 52

4-3 Coniparsion of GMS values of markers in clusters front two clustering approaches 64

5-1 Plot of 120 CGH cases belonging to Retinoblastonia, NOS (ICD-O 9510/3) .. 71

5-2 Working example of dataset re-sanmpler . .... .. 79

5-:3 Comparison of classification accuracies of SVAI with linear and Raw kernels 82

6-1 Examples of Venn diagram and corresponding graph model .. .. .. .. .. 97

6-2 Phylogenetic trees of 20 cancers based on weighted markers .. .. .. .. 109

6-3 Phylogenetic trees of 20 cancers based on unweight markers .. .. .. .. 110

6-4 A fraction of the phylogenetic tree of sub-clusters of 20 cancers .. .. .. .. 112

7-1 Snapshot of the input interface of distance-based clustering. .. .. .. .. 117

7-2 Snapshot of the results of distance-based clustering. .. .. .. 118

7-:3 Snapshot of the plot of clusters. ......... ... .. 119









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

MINING COMPARATIVE GENOMIC HYBRIDIZATION DATA

By

Jun Liu

May 2008

C'I I! n : S Ilri ly Ranka, Tanter K~ahveci
Major: Computer Engineering

Numerical and structural chromosomal imbalances are one of the most prominent

features of neoplastic cells. Thousands of (molecular-) cytogenetic studies of human

neoplasias have searched for insights into genetic mechanisms of tumor development and

the detection of targets for pharniacologic intervention. It is assumed that repetitive

chromosomal aberration patterns reflect the supposed cooperation of a multitude of tumor

relevant genes in most malignant diseases.

One method for measuring genomic aberrations is Comparative Genomic Hybridization

(CGH). CGH is a molecular-cytogenetic analysis method for detecting regions with

genomic imbalances (gains or losses of DNA segments). CGH data of an individual tumor

can he considered as an ordered list of discrete values, where each value corresponds to

a single chromosomal band and denotes one of three aberration statuses (gain, loss and

no change). Along with the high dintensionality (around 1000), a key feature of the CGH

data is that consecutive values are highly correlated.

In this research, we have developed novel data mining methods to exploit these

characteristics. We have developed novel algorithms for feature selection, clustering and

classification of CGH data sets consisting of samples front multiple cancer types. We have

also developed novel methods and models for understanding the progression of cancer.

Experimental results on real CGH datasets show the benefits of our methods as compared

to existing methods in the literature.









CHAPTER 1
INTRODUCTION

Numerical and spatial chromosomal imbalances are one of the most prominent and

pathogenetically relevant features of neoplastic cells [18]. Over the last decades, thousands

of (molecular-) cytogenetic studies of human neoplasia have led to important insights

into the genetic mechanisms of tumor development, revealing cancer to be a disease

involving dynamic changes in the genome. The foundation has been set in the discovery of

aberrations that produce oncogenes with dominant gain of function and tumor suppressor

genes with recessive loss of function [28]. Each chromosomal region of a healthy cell has

two copies of its DNA~ in1 a cell. Deviationls fromrr this nrmu-ral level ar-e called C/ 'I'' Number

Altemations (CNAs). Both classes of cancer genes, tumor suppressor genes and oncogenes,

have been identified through DNA copy number alterations in human and animal cancer

cells [36]. Detecting these aberrations and interpreting them in the context of broader

knowledge facilitates the identification of crucial genes and pathi-ws- involved in biological

processes and disease. The repetitive chromosomal aberration patterns reflects the

supposed cooperation of a multitude of tumor relevant genes [86] in most malignant

diseases. A systematic analysis of these patterns for oncogenomic pathway description

requires the large-scale compilation of (molecular-) cytogenetic tumor data as well as the

development of tools for transforming those data into a format suitable for data mining

purposes.

1.1 Comparative Genomic Hybridization Data

Comparative Genomic Hybridization (CGH) is the first efficient approach to scanning

the entire genome for variations in DNA < el i- number [59]. The main advantage of the

CGH data is that the DNA < el i- numbers for the entire genome can be measured in a

single experiment [70]. CGH on DNA microarray is a molecular-cytogenetic analysis

method for simultaneous detecting of thousands of genes with genomic imbalances (gains

or losses of DNA segments) [36]. In this technique, total genomic DNA is isolated from










Test


Ref


0 1

O
--2

Figure 1-1. Overview of CGH technique. Genomic DNA from two cell populations is
differentially labeled and hybridized to a microarray. The fluorescent ratios on
each array spot are calculated and normalized so that the median le-i_. ratio is
0. Plotting of the data for chromosome from pter to qter shows that most
elements have a ratio near 0. The two elements nearest pter have ratio near -1,
indicating a reduction by a factor of two in <.gli- number. This figure is
reproduced from the work by Pinkel et al [59].


test and reference cell populations, differentially labeled and hybridized to metaphase

chromosomes or, more recently, DNA microarrays. The relative hybridization intensity

of the test and reference signals at a given location is then (ideally) proportional to the

relative copy number of those sequences in the test and reference genomes. If the reference

genome is normal, then increases and decreases in the intensity ratio directly indicate

DNA <..pi- number variation in the genome of test cells (Figure 1-1).

Raw data from CGH experiments is viewed as being continuous [57]. Pre-processing

of raw CGH data comprises of all preliminary operations on the data necessary to arrive










at the quantity of interest. For CGH data, the log2 t~iiOS undergo three pre-processing

steps before arriving at the actual <..pi- number.

The first pre-processing step is normalization. Normalization corrects for experimental

artifacts in order to make the log2 t~iiO S from different hybridizations comparable [7,

19, 47, 54 .

The second step of the pre-pro. f -1. named segmentation, is motivated by the

underlying discrete DNA copy numbers of test and reference samples [21, 30, 87].

Segmentation algorithms divide the genome into non-overlapping segments that

are separated by breakpoints. These breakpoints indicate a change in DNA copy

number. Array elements that belong to the same segment are assumed, as they

are not separated by a breakpoint, to have the same underlying chromosomal <..pi~

number. Segmentation methods also estimate the mean le-i_. ratio per segments,

referred to as states.

As a final and last pre-processing step, referred to as calling, the DNA copy

number of each segment is determined [88, 90]. ?- u lII;. 1 I1 CGH signal surpassing

predefined thresholds is considered indicative for genomic gains or losses, respectively

(Figure 1-2). At present calling algorithms cannot determine whether there are,

ii-, three or four copies present. They can however detect deviations from the

normal copy number, and classify each segment as either 'normal', 'loss' or 'gain'.

Normal status indicates there are two copies of the chromosomal segment present.

Loss status indicates at least one copy is lost. Gain status indicates at least one

additional copy is present. These labels are referred to calls.

The chromosomal CGH summarizes signals from nar Ilny short stretches of tumor DNA

hybridizing to neighboring regions. The chromosomal CGH results are annotated in a

reverse in-situ karyotype format [50] describing imbalanced genomic regions with reference

to their chromosomal location. CGH data of an individual patient can be considered

as an ordered list of status values, where each value corresponds to a genomic interval











Raw CGH Data

1.5


ct 1.25 -- -


S 1 + N Cng=0
1- 0 5 1 0 1 5 20
E 0.75 - - -
SLoss = --1
0.5
Genomic Intervals



0 0 0 0 0 0 1 1 1 1 -1 -1 -1 -1-1 -1
Smoothed CGH Data

Figure 1-2. Raw and normalized (smoothed) CGH data. This example shows 16
measurement points of tumor vs. control fluorescence. Runs of normalized
ratio values surpassing the thresholds are considered indicative for gains or
losses of genomic material in the corresponding genomic intervals (e.g.
chromosomal bands). For our purposes, we use values of 1, -1, and 0 to express
gain, loss and no aberration, respectively.


(e.g., a single chromosomal band. The term feature and dimension have also been used

in the literature to represent the genomic interval.). Figure 1-3 shows a CGH dataset

for Retinoblastoma, NOS (ICD-O 9510/3) with 120 cases (i.e., patients) each having 862

genomic intervals. ClIn olv e-**!In and array CGH accounts for a significant percentage of

the published analyses in cancer cytogenetics [5, 11, 24, 31, 34, 49, 82].

The Pop~ : .;: He database [3] (http://www. progenetix. net) is one of the ill linr

resources for CGH data. It consisted of 15429 cases from 609 publications as of December

2006. Recently, the Progenetix database and the software tools developed for the project

have shown its usefulness for the delineation of genomic aberration patterns with clear

prognostic relevance in neuroblastomas [82] and for producing tumor type specific

imbalance maps [45, 46].

This thesis is concerned with developing tools to help analyze CGH data.










Retinoblastoma, NOS (ICD-0 9510/3), 4 Markers


20r a






10 -

100 200 300 400 500 600 700 80
Geo icItevl

Fiur 13.Plt f CH atse. hedaaetcositsof12 CH ass elngngt




g~~~~~~~~enomic inevl adtesape ewasptilyWelothginndo



status in green (light gray) and red (dark gray) respectively.


1.2 Analysis of CGH Data

The down-stream analysis of large-scale CGH data helps cancer treatment and

diagnosis and reveal the underlying genetic mechanism of cancer. For example, unsupervised

clustering methods are often emploi-. I1 to discover previously unknown sub-categories of

cancer and to identify genetic hiomarkers associated with the differentiation. Classification

methods can he used to separate healthy patients from cancer patients and to distinguish

patients of different cancer subtypes, based on their cytogenetic profiles. These tasks

help successful cancer diagnosis and treatment. Feature selection methods are emploi-. I1

to reveal the specific chromosomal regions that are consistently aberrant for particular

cancers and thereby help in focusing investigative efforts on them. However, existing data

mining methods can not he directly applied to CGH data because this data is structurally

different from ordinary data. The following are important characteristics of CGH data:

1. The datasets are high dimensional. The number of intervals in the Progenetix

database [1], a large publicly available database, is 862. N. -~ i.-~ CGH datasets may










consists of 10,000 or more intervals. The size of a dataset, when compared with the

dimensionality, is relatively small. In Progenetix database, a clinico-pathological

entity usually contains tens to hundreds of samples.

2. The features in CGH data represent ordered genomic intervals on chromosomes and

their values are categorical.

3. Genomic imbalances in cancers cells correspond to runs of 5 to 15 intervals of

gains or losses for a 862 interval representation. (We will use the term .segments to

represent these runs. They correspond to a few megabases to an entire chromosome).

This indicates that neighboring genomic intervals are often correlated.

1.3 Contribution of This Thesis

We have developed novel data mining methods for modeling and analysis of spatial

and temporal characteristics of imbalances in CGH datasets to address these challenges.

The contributions of this thesis are briefly described as follows:
1. We have developed novel pairwise distance-based (we will use the term distance-based

for convenience) clustering methods that effectively exploit the spatial correlations

between consecutive genomic intervals [41]. The goal of our clustering is to identify

sets of tumors exhibiting common underlying genetic aberrations and representing

common molecular causes. Our work is built in two steps. In the first step, we

measure the distance/similarity between all pairs of samples. For this purpose,

we have developed three metrics to compute the similarity/distance between two

CGH samples. In the second step, we build clusters of samples based on pairwise

similarities using variations of well known methods.

Experimental results show that segment-hased similarity distance measures are

better indicators of biological proximity between pairs of samples. This measure

when combined with the top-down method produces the best clusters.
2. We have proposed the concept of markers to represent key recurrent point

aberrations that capture the aberration pattern of a set of CGH samples [42].










(See Figure 1-:3. The markers are plotted in vertical lines.) We have developed a

dynamic programming technique to detect markers in a group of samples. The

resulting markers can he seen as the prototype of these samples. Based on the

markers, we have developed several clustering strategies.

Our experimental results show that the use of markers in the distance-based

clustering improves the cluster qualities.
:3. We have developed a novel kernel function for using SVAI hased methods for

classifying CGH data. The classification of CGH data aims to build a model for

defining a number of classes of tumors and accurately predict the classes of unknown

tumors. This measure counts the number of common aberrations between any two

samples. We show that this kernel measure is significantly better for SVAI-hased

classification of CGH data than the standard linear kernel.

4. We have developed an SVAI-hased feature selection method called M~axrimum

Infll;, i,.. Feature Selection (jlllFS). It uses an iterative procedure to progressively

select features. In each iteration, an SVAI hased model on selected features is

trained. This model is used to select one of the remaining feature that provides the

maximum benefit for classification. This process is repeated until the desired number

of features is reached. We compared our methods against two state-of-the-art

methods that have been used for feature selection of large dimensional biological

data. Our results -II__- -is that our method is considerably superior to existing

methods.

5. We have developed a novel method to infer the progression of multiple cancer

histological types or subtypes based on their aberration patterns. Our experimental

results based on a Progenetix dataset demonstrate that cancers with similar

histology coding are automatically grouped together using these methods.
We also describe a web hased tool for large volume data analysis of CGH datasets [4:3].

The tool provides various clustering algorithms, distance measures and identifies markers

that can he interactively used by researchers. It presents the results are provided in both










textual and graphical format. The tool does not require downloadingf or installing of

software. It can he used through a web browser. A preliminary version of this tool is

available at http: //cghmine cise .uf 1 .edu: 8007/CGH/Def ault .html.









CHAPTER 2
RELATED WORK(

One of the early techniques used to identify cytogenetic abnormalities in tumor

cells is called Metaphase analysis [10, 55, 81]. The Mitelman Database of ClsInin....in.-!~

Aberrations in Cancer (http://cgap.nci. nih. gov/Chromosomes/Mitelman) has become

an invaluable resource for cytogenetic aberration data in human malignancies. This

database currently contains data from more than 46,000 cancer and leukemia samples

analyzed by Metaphase handing. However, utilization of the data collection for data

mining purposes so far has been limited by intrinsic problems of the Metaphase handing

technique as well as the specific data format.

Over the last decade, the Comparative Genomic Hybridization (CGH) [:37] and array-

or matrix-CGH techniques [61, 6:3, 71] have addressed technical problems associated with

Metaphase analysis of tumor cells and are now used in many published observations. The

molecular cytogenetic techniques of CGH [:36] and array- or matrix-CGH [60, 62, 72] have

previously been used to describe genomic aberration hot spots in cancer entities [5, 24], for

the delineation of disease subsets according to their cytogenetic aberration patterns [:34,

48] and for the construction of genomic aberration trees from chromosomal imbalance

data [12].

In contrast to Metaphase 01,! lli--- CGH techniques are not limited to dividing tumor

cells which frequently do not represent the predominant clone in the original tumor. Also,

CGH is not hampered by incomplete identification of chromosomal segments, which for

Metaphase analysis only recently has been addressed by SK(Y (Spectral K~aryotyping) [84]

and MFISH (jl1!1t il:: Fluorescent In-Situ Hybridization) [7:3] techniques. According to

our own survey, chromosomal and array CGH now account for a significant percentage of

published analyses in cancer cytogenetics.

In this chapter, we briefly review the data mining and related methods that have been

used for analyzing CGH data.










2.1 Structural Analysis of Single Comparative Genomic Hybridization (CGH)
Array

Different strategies for structural analysis of CGH data have been applied previously.

Most of these analysis were aimed at the description of pseudo-temporal relationships

of cytogenetic events [12, 31] or at the correlation of disease subsets with clinical

parameters [48, 82]. Other CGH related data analysis have been aimed at the spatial

coherence of genomic segments with different '"pi- number levels.

Picard et al. pointed out that raw CGH signal exhibits a spatial coherence between

neighboring intervals [57]. This spatial coherence has to be handled. They used a

segmentation methods based on a random Gaussian model. The parameters of this

model are determined by abrupt changes at unknown intervals. They developed a dynamic

programming algorithm to partition the data into a finite number of segments. The

intervals in each segment approximately share the same copy number on average. Further,

he proposed a segfmentation-clusteringf approach combined with a Gaussian mixture model

to predict the biological status of the detected segments [58].

Fr-idlyand et al. used an unsupervised Hidden Markov models approach which consists

of two parts [21]. In the first part, they partition the genomic intervals into the states

which represent the underlying copy number of the groups of intervals. In the second part,

they determine the copy number level of each individual chromosome according to whether

any copy number transitions or whole chromosome gains or losses are contained in the

chromosome. They derived the appropriate values of parameters in the algorithm using

unpublished primary tumor data.

Pei et al. segmented each chromosome arm (or chromosome) using a hierarchical

clustering tree. The clusters are identified by suppressing the False Discovery Rate (FDR)

below a certain level. In addition, their algorithm provided a consensus summary across

a set of intervals, as well as an estimate of the corresponding FDR. They illustrated their










method with applications on a lung cancer microarray CGH data set as well as an array

CGH data set of aneuploid cell strains [88].

Willenbrock et al. made a comparison study on three popular and publicly available

methods for the segmentation analysis of array CGH data [90]. They demonstrated that

segmented CGH data yields better results in the downstream analysis such as hypothesis

testing and classification than the raw CGH data. They also proposed a novel procedure

for calling copy number levels by merging segments across the genome.

All the above works focus on the discretization of raw CGH data. However, they do

not address the subsequent analysis, such as clustering or classification, for a large dataset

consisting of discretised (smoothed) CGH data samples.

2.2 Clustering and Marker Detection of CGH Data

Unsupervised clustering methods are often emploi- II to discover previously unknown

sub-categories of cancer and to identify genetic hiomarkers associated with the differentiation.

Towards this end, many attempts have been done on the studies of gene microarray [32].

In the following, we briefly describe some of the earlier work on clustering of CGH data.

Alattfeldt et al. applied an existing tool for the clustering of CGH data [48]. The

tool, named Genecluster, formed clusters on the basis of an unsupervised learning rule

using an artificial neural network. It was originally proposed for the clustering of gene

expression data. Alattfeldt et al applied Genecluster over a group of tens of cases from

pT2NO prostate cancer. Based on the fact that clinically similar cases are placed into the

same clusters, they demonstrated that good clustering were found.

Beattie et al. developed a new data mining technique to discover significant

sub-clusters and marker genes in a completely unsupervised manner [4]. They used a

digital paradigm to discretise the gene microarray and transferred the data into binary

state patterns. A clustering hased on Hamming distance was applied to create clusters

and identify bio-markers. Although their work is not directly based on CGH data, they

demonstrated that their method can he adapted to other categorical datasets.










Rouveirol et al. proposed two algorithms for computing minimal and minimal

constrained regions of recurrent gain and loss aberrations from discretised CGH data [67].

Their algorithms can handle additional constraints describing relevant regions of copy

number change. They validate their algorithms on two public array-CGH datasets.

Thus, the existing literature has not addressed clustering algorithms that exploit the

important spatial and temporal characteristics of CGH data. Further, existing clustering

works usually focus on small homogeneous datasets with several tens of cases. The

existing marker discovery methods usually simply identify the markers, but do not explore

the usage of these markers in clustering analysis, as what we propose in this thesis.

2.3 Classification and Feature Selection of CGH Data

Classification aims to build an efficient and effective model for predicting class labels

of unknown data. The model is built on the training data, which consists of data points

chosen from input data space and their class labels. Classification techniques has been

widely used in microarray analysis to predict sample phenotypes based on gene expression

patterns .

Support Vector Machine (SVAI) is a state-of-art technique for classification [8:3].

Mukherjee et al. used an SVAI classifier for cancer classification based upon gene

expression data from DNA microarrays. They argued that DNA microarray problems

are very high dimensional and have very few training data. This type of situation is

particularly well suited for an SVAI approach. Their approach achieved better performance

than reported results [5:3].

Li et al. performed a comparative study of multiclass classification methods for

tissue classification based on gene expression data [40]. They conducted comprehensive

experiments using various classification methods including SVAI [8:3] with different

multiclass decomposition techniques, Naive B os;-, K(-nearest neighbor and decision

tree [79]. They found that SVAI is the best classifier for classification of gene expression

data.










To our knowledge, most existing SVAI-hased approaches focus on gene expression

data. They usually use linear kernel due to the assumption of Golub et. al. about the

additive linearity of the genes in classification [23]. For example, based on experimental

results, Mukherjee et al. demonstrated that linear SVA~s did as well as nonlinear SVA~s

using polynomial kernels. So far, there is very limited study on developing kernel functions

for the classification of CGH data.

Feature selection is a related task that selects a small subset of discrintinative

features. The problem of feature selection was first proposed in machine learning. A

good review can he found at [26]. Recently, feature selection methods have been widely

studied in gene selection of microarray data. These methods can he decomposed into two

broad classes:

1. Filter Methods: These methods select features based on discriminating criteria

that are relatively independent of classification process. Several methods use simple

correlation coefficients similar to Fisher's discrintinant criterion. For example, given

class 1 and class -1 denoting two classes, Golub et al. used a criterion as follows [23]:





where .) is the gene index, ul is the mean of class 1 for gene .), u-1 is the mean of

class -1 for gene .1, al is the standard deviation of class 1 for gene j, and a_l is the

standard deviation of class -1 for gene .). The genes are then ranked in descending

order according to P(j) and the top values correspond to "iillus~! I !is.11 genes.

Other methods adopt mutual information or statistical tests (t-test, F-test). For

example, Model et al. ranked the features using a two sample t-test [51]. They

assumed that the value of a feature within a class follows a normal distribution. A

two sample t-test was adopted to rank the features according to the significance of

the difference between the class means. In principle, their approach was similar to

Fisher's criterion because, in both methods, a large mean difference and a small









within class variance are proportional to the discriminative power of a feature. Their

experimental results demonstrated that the t-test approach worked better than the

standard Principle Component Analysis (PCA) method.

Ding et al. considered the nature of feature selection for classification of multi-class

data [16]. They used the F-statistic test which is a generalization of t-statistic for

two class. Given a gene expression across a tissue samples g = (gl, g,) from K

classes, the F-statistic is defined as


F = [ k 9k g 2(K )]/o2

where g is the average expression across all samples, gk, is the average within class

Ck~, and a2 is the pooled variance:


'2 C!k 1)0 ]/(8 K)

where nrk and ark are the size and variance of gene expression within class Ck~. They

picked genes with large F-values.

Earlier filter based methods evaluated features in isolation and did not consider

correlation between features. Recently, methods have been proposed to select

features with minimum redundancy [14, 15, 91]. For example, Yu et al. introduced

the importance of removing redundant genes in sample classification and pointed out

the necessity of studying feature redundancy [91]. They proposed a filter method

with feature redundancy taken into account. They combined sequential forward

selection with backward elimination so that, in each step, the number of feature

pairs for redundancy an~ kl--- is reduced. Their method is free of any threshold

in determining feature relevance or redundancy. Their experimental results on

microarray data demonstrated the efficiency and effectiveness of their method in

selecting discriminative genes that improve classification accuracy.










The methods proposed by Ding et al. uses a minimum redundancy maximum

relevance (1!RtMR) feature selection framework [14, 15]. They supplement the

maximum relevance criteria along with minimum redundancy criteria to choose

additional features that are maximally dissimilar to already identified ones. By doing

this, IR MR expands the representative power of the feature set and improves their

generalization properties.

2. Wrapper Methods: Wrapper methods utilize a classifier as a black hox to score the

subsets of features based on their predictive power. Wrapper methods based on SVAI

have been widely studied in machine learning community [26, 64, 89]. SVAI-RFE

(Support Vector Machine Recursive Feature Elimination) [27], a state-of-the-art

wrapper method applied to cancer research is called, uses a backward feature

elimination scheme to recursively remove insignificant features from subsets of

features. In each recursive step, a linear SVAI is trained on the feature set. For each

feature, a ranking coefficient is computed based on the reduction in the objective

function if this feature is removed. The bottom ranked feature is then eliminated

from the feature set. The above process is repeated until the feature set is empty.

The features are sorted based on their sequence of elimination.

A number of variants also use the same backward feature elimination scheme

and linear kernel. Zhang et al. proposed a method aimed for classifying two-class

data [92]. It used a recursive support vector machine (R-SVAI) algorithm to select

important features for the classification of noisy high-throughput proteomics and

microarray data. The experimental results showed that, compared to SVAI-RFE,

their method is more robust to outliers in the data and capable of selecting the most

informative features.

Duan et al. proposed a new feature selection method that used a backward

elimination procedure similar to that implemented in SVAI-RFE [17]. Unlike

SVAI-RFE, at each step, the proposed approach trained multiple linear SVA~s on










subsamples of the original training data. It then computed the feature ranking score

from a statistical analysis of weight vectors of these SVMs. The experimental results

showed that their method selects better gene subsets than SVM-RFE and improves

the classification accuracy.

For feature selection of multiclass data, R Ian I-.- lUni- et al. used an one-versus-all

strategy to convert the multiclass problem into a series of two- class problems. They

applied SVM-RFE to each two-class problem separately and generated a consensus

sorting of all features [65].

Fu et al. also proposed a method based on the one-versus-all strategy [90]. For each

two-class problem, they wrapped the feature selection into a 10-fold cross validation

(CV) and selected features using SVM-RFE in each fold. They also developed a

probabilistic model to select significant features from the 10-fold results. They took

the union of features selected from each two-class SVM as the final set of features.

Filter methods are generally less computationally intensive than wrapper methods.

However, they tend to miss complementary features that individually do not separate

the data well. A recent comparison of feature selection methods for the classification of

multiclass microarray data shows that wrapper methods such as SVM-RFE have better

classification accuracy for large number of features, but derives lower accuracy than filter

methods when the number of selected feature is small [9].

2.4 Inferring Progression Models for CGH Data

Models of tumor progression can be used to explain known clinical and molecular

evidence of cancer. An earlier effort was made by Vogelstein et al [85]. They inferred

a chain model of four genetic events, three of which are CNAs, for the progression of

colorectal cancer. These events in the model are irreversible. That is, once an event occurs

it is never undone in the future. The presence of all four events appears to be an indicator

of colorectal cancer.










Desper et al. proposed a branching tree model that are more general than a path

model by assuming that the recurrent CNAs are a set of genetic events that take place

in some order [12]. They derived a tree model inference algorithm by utilizing the idea

of maximum-weight branching in a graph. They applied the algorithm over a CGH data

set for renal cancer and showed that the correct tree for renal cancer was inferred. Later,

they extended their work to distance-based trees, in which events are leaves of the tree,

in the style of models common in phylogenetics [13]. They proposed a novel approach to

reconstruct the distance-based trees using tree-fitting algorithms developed by researchers

in phylogenetics. They applied their approach over the CGH data set for renal cancer.

The results showed that the distance-based models well complemented the branching tree

models.

Bilke et al. proposed a graph model based on the shared status of recurrent CNAs

among different stages of cancer [6]. They first identified a set of recurrent alterations

and computed their shared status using statistical tests. They then constructed a Venn

diagram based on these recurrent alterations. They manually converted the Venn diagram

into a graph model. They found that the pattern of recurrent CNAs in neuroblastoma

cancer is strongly stage dependent.

Pennington et al. developed a mutation model for individual tumor and construct

an evolutionary tree for each tumor [56]. They identified the consensus tree model based

on the < li- number alterations shared by a substantial fraction of the population. They

proved that their results are consistent with prior knowledge about the role of the genes

examined in cancer progression.

All above works infer tumor progression model based on the genetic events such

as recurrent CNAs. Their models describe the evolutionary relationship between these

events and consequently expose the progression and development of tumors. However,

these works treat every individual recurrent alterations as independent genetic events.

This makes their models become very complex when applied to data sets with samples










front multiple cancers, given that each cancer type contains a set of substantially different

recurrent alterations.

2.5 Software for Analyzing CGH Data

ChI i...i .1! ~~~~!I and array CGH recently account for a significant number of published

studies in cancer cytogenetics [5, 12, 24, 31, 34, 48, 82]. Acquisition of thousands of copy

number information brings forth challenges to the analysis of CGH data. Researchers

have explored data mining methods for this purpose. 1\any of their methods focus on the

structure analysis of CGH data, such as the spatial coherence of genomic segments with

different copy number levels [25, 57, 58, 74]. Associated with these works, a lot of tools

(hoth web application and stand-alone software package) are available for the analysis of

CGH data, such as CGH-nliner (http://www-stat. stanf ord .edu/~wp57/CGH-Miner/) ,

C GH-explorer (ht tp : //www if i .ui o .no/f orskning/grupper/bi oinf /P apers /CGH/)

SArrayCyGHt (http://genomics. catholic .ac .kr/arrayCGH/) and CGH-plotter

(http: //sigwww. cs tut .f i/TICSP/CGH-Plotter/) However, very limited efforts have

been conducted for mining heterogeneous CGH datasets for more than a few hundred

samples. This is the focus of our proposed work and software.










CHAPTER 3
PAIRWISE DISTANCE-BASED CLUSTERING

The goal of clustering is to develop a systematic way of placing patients with similar

CGH imbalance profiles into the same cluster. Our expectation is that patients with the

same cancer types will generally belong to the same cluster as their underlying CGH

profiles will be similar. In this chapter, we focus on distance-based clusteringf. We develop

three pairwise distance/similarity measures, namely Raw, Cosine and Sim. Raw measure

compares the aberrations in each genomic interval separately. The other two measures

take the correlations between consecutive genomic intervals into account. Cosine maps

pairs of CGH samples into vectors in a high-dimensional space and measures the angle

between them. Sim measures the number of independent common aberrations. We test

our distance/similarity measures on three well known clustering algorithms, bottom-up,

top-down and k-means with and without centroid shrinking. Our results show that Sim,

when combined with top-down algorithm, consistently performs better than the remaining

measures.

3.1 Method

Genomic aberration data from CGH experiments is usually communicated in a reverse

in-situ karyotype annotation format [50]. We use this strategy and represent gain, loss,

and no change with +1, -1, and 0 respectively throughout the proposal.

We propose to use three different distance-based clustering methods for CGH data

and survey their performance. The key problem, however, is to compute the proximity of

two CGH samples. In Section 3.1.1, we discuss the three measures we developed for such

pairwise comparison. We briefly explain the three clustering algorithms we used to cluster

a population of samples in section 3.1.2. Two techniques that further optimize the cluster

qualities are discussed in section 3.1.3.









Genomic Intervals 1 2 3 4 5 6 7 8 9 10 11 12

X 0 1 1 1 0 0 -1 -1 0 1 -1 -1

Y 0 01 11 00 0 011 1

Diff(x,y) 1 1 0 0 1 1 1 1 1 0 1 1

Figure 3-1. Example of Raw distance. X and Y are two CGH samples. The value of each
genomic interval shows the status (i.e. gain loss or no change) of that interval.
TIhe distance between X and Y is Em diff(xy~, yj) 9.


3.1.1 Comparison of Two Samples

Let X = xl, x2, xm and Y = yi~, y2~, Um, be two CGH samples. Here, xi and

yi denote the value or status of the ith genomic interval of X and Y, respectively. The

proximity between X and Y can be computed in terms of distance or similarity. In this

section we develop three such measures of distance/similarity.

3.1.1.1 Raw distance

Our first measure assumes that the genomic intervals are independent of each other.

This assumption is often made in existing literature to simplify the problem of computing

distances [57]. If both samples have gain (or loss) at the same genomic interval then we

consider them similar at that position. Otherwise, that genomic interval contributes to

the distance between them. Also, we assume that all genomic intervals have the same

importance. Thus, each genomic interval contributes the same amount to the total

distance. Formally, the distance is computed as Cjm,, diff(xy,~yj). Here diff(xy,yj) = 1 if

xj / yj or xj = 0. Otherwise diff(xj, yj) = 0. The similarity is obtained by subtracting

the distance from m, the number of genomic intervals of the CGH samples. An example is

shown in Figure 3-1

This distance function is similar to Hammingf distance in principle because it

compares the genomic intervals of both samples one by one. We call this distance Raw

since it is computed on raw CGH data. Raw distance between two samples is small only










if the samples have gains or losses in the same positions. Raw distance ranges between

[0, m].

3.1.1.2 Segment-based similarity

This method takes the fact that consecutive genomic intervals are usually correlated.

A contiguous block of gains (or losses) can be caused by a point-like aberration at a single

genomic interval. We use the term segment to represent a contiguous block of aberrations

of the same type. For example, in Figure 4-1, sample X contains four segments. The first

and third segments are gain type while the second and fourth segment are loss type. We

call two segments from two samples e.. e Ilay-pingl if they have at least one common genomic

interval of the same type. For example, the first segment of X is overlapping with the first

segment of Y in Figure 4-1. Also the third segment of X is overlapping with the second

segment of Y. Next, we develop a segment-based similarity measure called Sim.

Given two CGH samples X and Y, Sim constructs maximal segments by combining

as many contiguous aberrations of the same type as possible. Formally, the genomic

intervals xi, xi 1, xj, for 1 < i < j < m, define a segment if genomic intervals xi

through xj are in the same chromosome, the values from xi to xj are all gains or all losses,

and xi_l and xj I are different than xi. Thus, each sample translates into a sequence of

segments. After this transformation, Sim assumes that the segments are independent of

each other and gives the same importance to all the segments regardless of the number of

genomic intervals in them. Sim computes the similarity between two CGH samples as the

number of overlapping segment pairs. This is justified because each overlap may indicate

a common point-like aberration in both samples which then led to the corresponding

overlapping segments. An example is shown in the Figure 4-1. There are two important

observations that follows from the definition of Sim. First, unlike the Raw distance

measure, Sim considers an overlap of arbitrary number of genomic intervals as a single

match. Second, although two samples have different values for the same genomic interval,

Sim does not consider this as a mismatch if it is an extension of an overlap. For example,









Genomic Intervals 1 2 3 4 5 6 7 8 9 10 11 12

X 0 1 1 1 0 0 -1 -1 0 1 -1 -1

Y 0 01 11 00 0 011 1



Figure 3-2. Example of Sim measure. X and Y are two CGH samples with the values of
genomic intervals shown in the order of positions. The segments are
underlined. The overlapping segments are shown with arrows. Since there are
two overlapping segments; one from position 3 to 4 and the other at position
10, the similarity between X and Y is 2.

in Figure 4-1, the fifth genomic intervals of sample X and Y have different values, but we

still consider this position a match because it could be an extension of an overlap.

3.1.1.3 Segment-based cosine similarity

Segment-hased similarity grows linearly with the number of common segments.

However, the aberration patterns of some cancer types can he less complex than the

others. The samples that belong to these cancer types share fewer common segments

leading to small values of Sim even though the samples are almost identical. Cosine

similarity of two vectors normalizes the similarity by measuring the cosine of the angle

between them. This measure is the most commonly used method to compute the

similarity between two directional data in vector-space model [68]. In this section, we

extend the cosine similarity to measure the proximity of two CGH samples.

Let X and Y he two CGH samples. We first map X and Y to two vectors X and

Y E Rg, where y is the number of dimensions of the vectors. Usually, y
is the number of genomic intervals of CGH samples. The mapping process is also hased

on segments and works as follows. First, we translate each sample into a sequence of

segments. Let us define segment sequence G, H that corresponds to the sample X, Y

respectively. Without loss of generality, we can assume that for all the genomic intervals in

Y, if they belong to any segment in H, the genomic intervals in X at the same positions

are also covered by the segments in G. Here, we ;?i that a segment covers a consecutive

block of genomic intervals only if for each genomic interval, either it belongs to this









Genomic Intervals 1 2 3 4 5 6 7 8 9 10 11 12
X 0 1 1 1 0 0 -1 -1 0 1 -1 -1

Y 0 01 11 00 0 011 1

X 1 1 1 1
Y 1 0 10O

Figure 3-3. Example of <..-;in. NuGaps measure. This figure shows the on~-;ln NoGaps
similarity between two CGH samples. X and Y are two CGH samples with
the values of genomic intervals shown in the order of positions. The segments
are underlined. First, X and Y are mapped to two vectors X and Y
respectively. Second, the similarity between X and Y is computed as C(X, Y)
= 0.7071

segment or it is of no-change status and the aberration of this segment can be extended

to this genomic interval. Next, we scan the segment sequence G in the ascending order

of the genomic intervals. For each segment gi a G, if there exist an overlapping segment

by E H, we add a new dimension to both vectors X and Y. We then assign value 1 to this

dimension of X and Y, indicating that the value of this dimension are exactly the same

in the two vectors. If no overlapping segment hj E H exists, we add a new dimension to

both vectors with value 1 assigned to vector X and value 0 assigned to vector Y, which

indicates that the values of the new dimension in two vectors are orthogonal. An example

of the segmenting and mapping step for this measure is shown in Figure 3-3. After the two

CGH samples X and Y have been mapped to two vectors, the cosine similarity between X

and Y is computed as





The majority of genomic intervals in CGH data have zero values (i.e. no aberration).

We call a consecutive block of these genomic intervals Ilg-' We ignore the impact of gaps

in the above cosine similarity measure. However, considering the overlapping gaps between

two samples might contribute greatly to the similarity between them. We develop another

variant of cosine similarity which takes the overlapping gaps into consideration. The new

similarity measure changes the mapping step that translates the CGH data into vectors.










First, it extends the definition of segments to be a consecutive block of genomic intervals

that share the same status, i.e. gain, loss or no change. That means, gaps are also

included in the segments in this way. Then it translates the CGH data into a sequence of

segments with some of the segments representing gaps. Next, a scan is performed on the

segment sequence G. For each gap in G, if there exists an overlapping gap in H, a new

dimension will be added to both vectors and a pair of value 1 will be assigned to them.

Other mapping steps of gain or loss segments and computation of cosine similarity remains

unchanged. Compared to the previous cosine similarity measure, this measure offers a

larger similarity between two CGH samples due to the impact of overlapping gaps. Thus,
we use the term C'osinec~aptorpenti wrashetm 'osineivoc~ap~ is used to

represent the previous definition. Both of these measures produce a value within a range

of [0, 1] indicating the similarity between two samples.

3.1.2 Clustering of Samples

With one of the aforementioned distance/similarity measures between two CGH

samples, we can easily apply a distance-based clustering algorithm to group similar

CGH samples together. At a high-level, the problem of clustering is defined as follows.

Given a set S of n sampleS s1, s2, ,8,2, we would like to partition S into k subsets

C1, C2, ,C k, such that the samples assigned to each subset are more similar to each

other than the samples assigned to different subsets. Here, we assume that two samples

are similar if they correspond to the same cancer type.

As we mentioned earlier, our focus in this thesis is to evaluate the suitability of

various distance/similarity measures together with clustering algorithms in the context

of the CGH data clustering problem. In this section, we briefly introduce the three

distance-based clustering algorithms we used in our experiments.

3.1.2.1 k-means clustering

K-means [44] is one of the simplest unsupervised learning algorithms that solve the

well-known clustering problem. Its key step is to compute the distance/similarity between










sample data and the cluster centroid, which is not necessary a real sample. Since CGH

samples are represented as an array of status values, it is not trivial to compute an

accurate centroid for a set of CGH samples. Here, we develop a variant of the k-means

algorithm which is more suitable for our distance/similarity measures. Compared to the

standard k-means, our algorithm omits the step of computing the cluster centroids, but

reassigns a sample according to its average distance to all the samples in a cluster rather

than the distance to the centroid of that cluster. These changes let our algorithm work for

any distance/similarity measure described in Section 3.1.1.

We first partition the a samples into k clusters by randomly assigning each sample

to one of the k clusters. This random partition forms the initial cluster seeds for our

k-means algorithm. Then we scan the a samples one by one. For the ith sample, compute

its average distance to all the samples in cluster j, for 1 < j < k, and then move it to

the cluster with the minimum average distance if that cluster is different from the one

it already belongs to. This scanning process is repeated until there is no movement of

samples during a scan or until a maximum number of iterations is reached.

3.1.2.2 Complete link bottom-up clustering

Complete link [38] clustering defines the distance between two clusters as the largest

distance between a sample from the first cluster and a sample from the second cluster.

The bottom-up clustering works by designating each sample as its own cluster initially.

Next, each cluster is compared to each other cluster, and the closest clusters are merged.

This process will continue until k clusters remain.

3.1.2.3 Top-down clustering

This algorithm [75] starts by assigning all samples into one cluster. It then bisects

this cluster recursively until k clusters are produced, where k is a user defined parameter.

The bisection is performed in two phases. In the first phase, two samples are randomly

selected as the seeds of two clusters. Then, for each remaining sample, its similarity to

these two seeds is computed and it is assigned to the cluster whose seed has a higher










similarity to that sample. In the second phase, the clusters are refined. A refinement

consists of a number of iterations. During each iteration, samples are visited one by one.

Each sample si, is then moved to all of the clusters one by one, and a criterion function

is computed for each positioning of as. The criterion function evaluates the quality of the

clusters. We use the term internal measure to represent this criterion function. The formal

definition of internal measure is addressed in Section 3.2.1. The sample as is kept in the

cluster that maximizes the internal measure. This refinement process ends as soon as there

is no movement of samples during an iteration or after a predefined maximum number

of iterations have been performed. In our experiments, the number of iterations were

typically less than 20. After the refinement is finished, the cluster with the largest number

of samples is bisected similarly. Once k clusters are created, the top-down algorithm ends.

In each iteration of the refinement, O(u) time is needed to compute the change of

the internal measure for each sample. This is because, the similarity between that sample

and every other sample in each cluster needs to be accumulated. The time complexity of

each iteration is O(n2) aS there are totally a samples. Since the total number of iterations

is limited by a small constant, the complexity of refinement is O(n2). The refinement is

performed every time a new cluster is created. In the above described process the number

of clusters increases by one in every stage until k clusters are created. Therefore, the

overall time complexity of top-down clustering is O(n2k).

To reduce this time complexity, we modify the top-down clustering algorithm.

Essentially, the refinement process is limited to the cluster being decomposed into smaller

clusters. There are two differences between the modified and the original top-down

clustering. First, only the samples in the decomposed cluster are considered for refinement.

Second, a sample is relocated only to the two newly created clusters rather than all the

clusters. In the best case, the clusters are decomposed in a balanced fashion. The overall

time: complexity in this case is O(n2 + 2( )2 + + 210g, k ~-2) a O(2n~2). In the w-orst

case, a cluster with a samples could be decomposed into two clusters with n 1 samples










in one cluster and 1 sample in the other. If this case happens to all the bisections, the

worst case time complexity could be O(kn2). Thus, with this enhanced refinement process,

the average time complexity of top-down clustering is between O(n2) and O(kn2). We

generally expect the time complexity to be close to O(n2), Which results in a factor of k

improvement in time. We call this faster refinement process in the top-down clustering

Local R.in. rn. 01i, and the previous refinement process Global R.in. ne01~i~ It is worth

noting that local refinement may produce lower quality clusters. Our experimental results

described in Section 5.5 show that this deterioration is small.

3.1.3 Further Optimization on Clustering

In this proposal, we use two approaches to further optimize the clusters obtained by

the bottom-up or top-down algorithms. We also compare the optimized results with the

non-optimized results of these algorithms in Section 5.5.

3.1.3.1 Combining k-means with bottom-up or top-down methods

Similar to the standard k-means, the k-means algorithm used in this thesis does

not necessarily find the optimal clusters because it is significantly sensitive to the initial

cluster seeds. This observation motivates our further optimization by choosing the results

of bottom-up or top-down algorithms as the initial seeds for k-means. That is, after the

bottom-up or top-down clustering, a k-means method will be invoked and the clusters

produced by the bottom-up or top-down clustering will serve as the initial cluster seeds of

k-means. The rest of the k-means clustering remains the same. This additional k-means

step further refines the clusters by using the more CGH specific distance measures

proposed in this thesis. We use term top?-down+kmeans to represent the optimization

approach that combines the top-down algorithm with the k-means algorithm. Similarly, we

use term bottom-up?+kmeans to represent the combination of the bottom-up algorithm and

the k-means algorithm.










3.1.3.2 Centroid shrinking

The idea of centroid shrinking was first introduced hv Robert et. al in [80] to improve

the nearest-centroid classification. The centroids of a training set are defined as the

average expression of each gene. This idea shrinks the centroids of each class towards the

overall centroid after nornializing by the intra-class standard deviation for each genomic

interval. This normalization has the effect of assigning more weight to the genomic

interval whose status is stable within samples of the same class, and thus reduces the

number of features contributing to the nearest centroid calculation. We apply this idea to

achieve further optimization of clustering. The centroids of initial clusters found by the

different clustering methods, i.e. hotton1-up, top-down, k-nleans, bottom-up+knicans and

top-down+knicans, are shrunk towards the overall centroid. Then, a standard k-nicans

using Euclidean distance is invoked to re-cluster the samples using the shrunken centroids

as its initial centroids.

3.2 Results

Experimental setup: We evaluated the quality and the performance of all the

distance/sintilarity measures and the clustering methods discussed in this thesis. For

evaluation of quality we used different measures belonging to two categories, external and

internal measures. We discuss these measures in detail in Section :3.2.1.

We intpleniented all four distance measures (Raw, Sint, CosineGaps, CcLE.! ?-u Gaps)

and five clustering algorithms (k-nleans, top-down, botton1-up, top-down + k-nicans,

botton1-up + k-nicans). Thus, we had 20 different combinations. We have also intpleniented

the centroid shrinking strategy and applied on each combination. Note that we use local

refinement strategy (see Section :3.1.2.3) for top-down in our experiments unless otherwise

stated.

We use a dataset consisting of 5,020 CGH samples (i.e., cytogenetic imbalance profiles

of tumor samples) taken front the Progenetix database [:3]. These samples belonged to

19 different histopathological cancer types with more than 100 cases and had been coded










Table 3-1.


ICD-O-3
0000/0
8890/3
9510/3
9391/3
9835/3
9180/3
9836/3
8144/3
9673/3
8010/3
9732/3
8140/0
9500/3
8170/3
8523/3
9680/3
9823/3
8070/3
8140/3


Detailed specification of Progfenetix dataset. Term #cases denote the number of
cases.


code #cases
110
118
120
126
128
133
141
144
171
180
190
209
271
286
310
323
346
657
1057


Code translation
non-neoplastic or benign
L~ 'i..~i-osarcoma, NOS
Retinoblastoma, NOS
Ependymoma, NOS
Acute lymphoblastic leukemia, NOS
Osteosarcoma, NOS
Precursor B-cell lymphoblastic leukemia
Adenocarcinoma, intestinal type
Mantle cell lymphoma
Carcinoma, NOS
Multiple myeloma
Adenoma, NOS
Neuroblastoma, NOS
Hepatocellular carcinoma, NOS
Infiltrating duct mixed with other types of carcinoma
Diffuse large B-cell lymphoma, NOS
B-cell chronic lymphocytic leukemia/small lymphocytic lymphoma
Squamous cell carcinoma, NOS
Adenocarcinoma, NOS


according to the ICD-O-3 system [22]. The subset with the smallest number of samples,

consists of 110 non-neoplastic cases, while the one with largest number of samples,

Adenocarcinoma, NOS (ICD-O 8140/3), contains 1057 cases. The details of this dataset is

listed in Table 3-1. Each sample in the dataset consists of 862 ordered genomic intervals

extracted from 24 chromosomes. Each interval is associated with one of the three values

-1, 1 or 0, indicating loss, gain or no change status of that interval. In principle, our

CGH dataset can be mapped to a integer matrix of size 5,020 x 862. We also use a small

dataset with 2,510 samples by randomly selecting 501' of the entire dataset. This small

dataset is generated each time an experiment is running over it.

Our experimental simulations were run on a system with dual 2.59 GHz AMD

Opteron Processors, 8 gigabytes of RAM, and an Linux operating system.










3.2.1 Quality Analysis Measures

In this thesis, we hope to identify disease-related signatures of CGH data by

clustering a large number of samples. We assume that samples belonging to the same

cancer type are homogeneous and should be clustered together. There are a range of

different cluster validation techniques that can he grouped into two categories, external

measure and internal measure [29]. We use both measures to evaluate the quality of the

clusters. An external measure evaluates how well the clusters separate samples that belong

to different cancer types. Thus external measure can compare clusters based on different

distance/similarity measure. On the other hand, an internal measure evaluates how good

the clustering algorithm operates on a given distance/similarity measure. This measure

ignores the cancer types of the input samples. Compared with internal measures, external

measures are more reasonable in reflecting the quality of clusters as they take the cancer

types into consideration. Note that internal measure is a better indicator of quality for

cancer types that have multiple aberration patterns that differ significantly.

External measure: An external measure takes a value in [0, 1] interval. Higher values of

this function represent better clustering quality. An important note is that this measure is

independent of the underlying distance/similarity measure. Thus, the results of different

distance measures can he compared using external measure.

We use three external measures to evaluate the cluster quality. Let n, ni and k denote

the total number of samples, the number of different cancer types and the number of

clusters respectively. Let al, a2, anz denote the number of samples that belong to each

cancer type. Similarly, let bl, b2, bk, he the number of samples that belong to each

cluster. Let qag, Vi,.), 1 < i < nz and 1 < .) < k, denote the number of samples in jth

cluster that belong to the ith cancer type. The first external measure used, known as the

Normalized M~utwel Information (N1\l) [93] function is computed as:















FI = k 1 nmx ci~i


The third external measure is known asur Rand In sdex 78.In ode tocopueh




Rand Index measure for a given clustering, two values are calculated.

foo = the number of pairs of samples that have different cancer types and belong to

different clusters.

fll = the number of pairs of samples that have the same cancer type and belong

same cluster.

The Rand Index is then computed as:

foo + fll
Rand Index =
n (n -1)

Unlike other external measures, NMI was computed based on mutual information

I(X, Y) between a random variable X, governing the cluster labels and a random variable

Y, governing the cancer types. It has been argued that the mutual information is a

superior measure than purity or entropy [77]. Moreover, NMI is quite impartial to the

number of clusters [93].

Internal measure: Unlike the external measure, the value of internal measure depends

on the distance/similarity measure. Thus, the internal measure of different clusterings

obtained by different similarity measures are not comparable. Instead, we use this measure

to compare the clusters obtained by applying different clustering methods with same

similarity function. In this thesis, we implement two internal measures. One is the internal

measure based on compactness (cohesion) [78], the other is the internal measure based on

separation.










Let k denote the total number of clusters. Let bl, b2, bk, be the number of samples

that belong to each cluster. We use si and C, to represent ith sample and the rth cluster

respectively. Let S(as, sj) be the function that evaluates the similarity between the ith and

jth sample. The internal measure based on compactness is computed as:





The internal measure based on separation is computed as:



=1k~ C=1,q r b' bk

Since both internal measures are computed with pairwise similarity, higher values of

IC and lower values of IS represent better clustering quality respectively.

3.2.2 Experimental Evaluation

In this section, we applied the combinations of four distance/similarity measures and

five clustering methods over the entire dataset and the small dataset. We compared each

combination according to the qualities of clusters. The cluster results are evaluated using

different external measures. Due to the space limit, we mainly report the results using

NMI and F1-measure in the thesis unless otherwise stated. For the small dataset that are

randomly generated each time, we apply our experiments 100 times and report the results

between fifth and ninety-fifth percentile as the error bar.

Evaluation of distance measures: The purpose of this experiment is to compare the

distance/similarity measures discussed in this thesis, namely Raw, Sim, C
and Cosine Gaps. In the experiment, we randomly select 50I' of the entire dataset as a

small dataset with 2,510 samples. For each distance/similarity measure, we created 2, 4,

8, 16, 32 and 64 clusters using five clustering methods: top-down, bottom-up, k-means,

top-down + k-means, and bottom-up + k-means. This resulted in 6 x 5 = 30 sets of

clusters per measure. We report the highest value of external measure of all these 30 sets

as the best quality of a measure. We repeat this experiment for 100 times.










Table :3-2. Highest value of external measures for different distance/similarity measure. All
numbers here are the medians of 100 results.
Sim C< .-;in. ?-uGaps CosineGaps Raw
NMI 0.368 0.265 0.228 0.2:39
F1-measure 0.34 0.258 0.215 0.2:35
Rand Index 0.90:3 0.899 0.898 0.896;

The median of 100 highest values for Sim, C<.-;in. ?-uGaps, CosineGaps and Raw

are shown in the table :3-2. The results of both NMI and F1-measure show that Sim

produces the highest quality compared to other distance measures. Sim obtains this

quality with top-down clustering method. C<.-;in. ToGaps gives slightly better quality than

the other two measures, Raw and CosineGaps. We conclude that Sim is the most suitable

distance/similarity measure for clustering CGH data.

Evaluation of clustering methods and optimizations: The purpose of these

experiment is to compare the quality of clustering algorithms with a fixed distance/similarity

measure. We create 8, 16, :32 and 64 clusters using different clustering methods with and

without centroid shrinking strategy. We only report the results for Sim due to the

space limitations and because Sim gives the best external measure values among all

distance/similarity measures.

We randomly select 50 .~ of the entire dataset (i.e., 2,510 samples) and cluster

it. We then compute the external measure for the underlying clusters. We repeat this

process 100 times and compute the error bar for the external measure. The error bar

indicates the interval where 5-95 .~ of the results lie. Figure 4-:3A and Figure 4-:3B show

the NMI and F1-measure respectively. Top-down clustering method without centroid

shrinking gives the best quality consistently in both figures. The additional k-means step

in top-down+k-means method deteriorates the qualities. Centroid shrinking improves

the results when the quality of the clustering method is low. It hurts the quality when

the quality is high, especially when top-down method is used. This can he explained as

follows. The clustering quality is low when the patients with different cancer types are

clustered together. This usually indicates that different samples in the same cluster can


































8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64

bottom-up bottom-up+k- k-means top-down top-down+k-
means means

Clustering Algorithm and Number of Clusters

A


0.4 i


S0.3 i




0.2 i
0-













0.4-



0.3-



e 0.2
O


0.1-



O


IT,


Figure 3-4. Evaluation of cluster qualities using (a) N1\I and (b) F1-measure for different
clustering methods. The fifth and the ninety-fifth percentile of the results are
reported as the error har.


8 16 32 64 8 16 32 64 8 16 32 4 16 32 648 16 32 64
bottom-up bottom-up+k- k-means top-down top-down+k-
means means

Clustering Algorithm and Number of Clusters










contain gain, loss, and no-change status for the same genomic interval. Such genomic

intervals can he considered as noise. Centroid shrinking filters them out. However,

centroid shrinking has the limitation that its results can he followed by a standard

k-nicans clustering using Euclidean distance. Therefore, the underlying similarity measure

(i.e., Sing can not he used after shrinking the centroid. Thus, we conclude that top-down

method works best in conjunction with the Sint measure. At the same time, centroid

shrinking strategy does not help the clustering using this combination. The error bars

confirm that the top-down clustering without centroid shrinking works best for Sint

measure. The error bars show that the top-down and the botton1-up methods are more

stable than the k-nicans method. This is because k-nicans is significantly sensitive to

the initial seeds that are randomly generated. The NMI value of the top-down method

increases as the number of clusters increase front 8 to 64 in Figure 4-:3A On the other

hand, the F1-measure drops in Figure 4-:3B This is because F1-measure favors coarser

clustering and is biased towards small number of clusters while NMI is quite impartial

to the number of clusters [9:3]. We don't see the same effect for other clustering methods

because the large variance in the results of other methods, except botton1-up, hides this

effect. For botton1-up method with or without centroid shrinking, we can see that the

increase in the quality gets flattened when the number of clusters increases.

Next, we ran all the mentioned clustering methods for the entire CGH dataset (i.e.,

5,020 samples). Figure :3-5 shows the NMI for Sint. The results confirm the experiments in

Figure 4-:3A : 1) Top-down clustering produces the best clusters. 2) The centroid shrinking

strategy does not have a significant impact. :3) Most of the results on the entire dataset

remain within the error intervals. The best clustering quality was obtained when 64

clusters were created. The average cluster size, i.e. number of samples in the cluster, is

78.44 and the standard deviation is 51.03.

In our experiments on the same dataset using Rand Index, we obtained slightly better

results with top-down method. The two described internal measures (conipactness and










0.4

0.05

0.3-

0.25-

S0.2

0.15-

0.1-

0.05-


08 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64 8 16 32 64
bottom-up bottom- kmeans top-down top-
up+kmeans down+kmeans
Clustering Algorithm and Number of Clusters


SNMI ONMI After Centroid Shrinking


Figure :3-5. Cluster qualities of different clustering methods with Sim measure over the
entire dataset. The cluster qualities are evaluated using N1\I.



separation) support this conclusion that top-down clustering is the better choice (results

omitted due to space limitation).

Performance issues of top-down clustering: In Section :3.1.2.3, we discussed two

types of top-down methods, top-down method with global refinement and top-down

method with local refinement. Here, we evaluate the quality and running time of these two

strategies. We restrict the similarity measure to Sim as it gives the highest quality. Using

each strategy, we created 2, 4, 8, 16, :32, and 64 clusters for each of the 19 cancer types.

We compute the average internal measure based on compactness of all the cancer types

as the quality of the clusters. We also compute the average time to create clusters as the

running time.

Table :3-3 shows the average quality and running time of two different top-down

methods. The first part of the table indicates that local refinement gives slightly worse

qualities than the global refinement. However, the quality difference is negligible. The

quality of the clusters increases as the number of clusters increases up to :32. The quality









Table 3-3. Comparison of average quality and running time of top-down methods with
global and local refinement. (Here L and G indicate local and global refinement
respectively.)
Number of Clllum a
2 4 8 16 32 64
L 703 797 892 927 947 904
Quality
G 730 839 936; 983 1017 971
L 0.1 0.3 1.7 3.1 6.5 9.8
T~ie(Sc) G 3.4 22.9 129.7 329.4 1151.2 2018.2

starts to plateau or drop after this point. This indicates that, in general, as the number

of clusters increases, the clusters are more compact and the intra-similarity of clusters

increases. However, when the number of clusters becomes too large compared to the

size of dataset, some closely similar samples will be forced into different clusters, which,

instead, reduce the intra-similarity of clusters. The second part of the table indicates

that the average running time for global refinement is much higher than local refinement.

This observation is consistent with our analysis of time complexity in Section 3.1.2.3.

Considering that local refinement gives only slightly worse qualities but runs much faster

than global refinement, we use the former method throughout this chapter.

3.3 Conclusion

We considered the problem of clustering Comparative Genomic Hybridization

(CGH) data of a population of cancer patient samples. We developed a systematic way

of placing patients with same cancer types in the same cluster based on their CGH

patterns. We focused on distance-based clustering strategies. We developed three pairwise

distance/similarity measures, namely maw, cosine, and sim. Raw measure disregards

correlation between contiguous genomic intervals. It compares the aberrations in each

genomic interval separately. The remaining measures assume that consecutive genomic

intervals may be correlated, Cosine maps pairs of CGH samples into vectors in a high

dimensional space and measures the angle between them. Sim measure counts the number

of independent common aberrations. We emploi- a our distance/similarity measures on










three well known clustering algorithms, botton1-up, top-down, and k-nicans with and

without centroid shrinking.

In our experiments using classified disease entities front the Progenetix database, the

highest clustering quality was achieved using Sint as the similarity measure and top-down

as the clustering strategy. This observation fits with the theory that contiguous runs of

genomic aberrations arise around a point-like target (e.g., oncogene), and that consecutive

genomic intervals can not he considered as independent of each other.









CHAPTER 4
IMPROVE CLUSTERING USING MARKERS

We observe that Sim measure is affected from noisy aberrations in CGH data since

it depends on only a pair of samples. In this chapter,we develop a dynamic programming

algorithm to identify a small set of important genomic intervals called markers. The

advantage of using these markers is that the potentially noisy genomic intervals are

excluded from clustering. We also develop two clustering strategies using these markers.

The first one, prototype-based approach, maximizes the support for the markers. The

second one, distance-based approach, develops a new similarity measure called RSim and

iteratively refines clusters by maximizing the RSim measure between samples in the same

cluster. Our results demonstrate that the markers we found represent the aberration

patterns of cancer types well and they improve the quality of clustering.

4.1 Detection of Markers

Most cancers result from genomic instability and display various genomic aberrations.

A recurrent alteration is a set of aberrations common to sufficiently many CGH samples.

The recurrent alterations can be used to characterize the aberration pattern of samples.

Due to the correlation between .Il11 Il:ent genomic intervals, recurrent regions can be

represented using a small number of genomic intervals within these regions. We call these

genomic intervals markers. Each marker is represented by two numbers , where p

and q denote the position and the type of aberration respectively.

Given a set S of NV CGH samples {sl, s2, sN}. Let x~ denote the status value for

sample j at genomic interval d, Vd, 1 < d < D, where D is the number of intervals. Let

sj [u, v] be the segment of sj that starts at the uth interval and ends at the vth interval.

We use the term -- fit~, e to represent a contiguous block of aberrations of the same type.

Formally, a list of status values x(, x +, xt, for 1 < u < v < Ddefine a segment if

genomic intervals a through v are in the same chromosome, the values from x( to x( are

all gains or all losses, and xi_, and x~+ are different than xj,. For example, in Figure 4-1,









Genomic Intervals 1 2 3 4 5 6 7 8 9 10 11 12

X 0 1 1 1 0 0 -1 -1 0 1 -1 -1


Y 0 01 11 00 00 11 1



Figure 4-1. Two CGH samples X and Y with the values of genomic intervals listed in the
order of positions. The segments are underlined.

sample X contains four segments. The first and third segments are gain type while the

second and fourth segment are loss type.

Let {mt = |1 < t < R}, be a set of markers that are ordered along the

genomic intervals, i.e. pi < p2 < pR. We define the support of sj to me as a(sj, mt).

Here, a(sj, mt) = 1 only if both of the following conditions are satisfied:

1. Support: There exists a segment sy [u, v] overlapping with me, i.e. u < p, < v and the

type of sy [u, v] is the same as that of me, i.e. x( = qt.

2. Uniqueness: There exists no other marker me,, t/ < t, in the same chromosome band

such that u < pt, < v, x( = qt, and a(sj, me,) = 1.

Otherwise, a(sj, mt) = 0. We wi sj supports me or mt covers sj if a(sj, mt) = 1. We

define the support value of marker me as the sum of its support from all the samples.

Formally, Support(mt) = Cel(sj, mt).

The intuition behind these two conditions is as follows. 1. Support: The support

value of a marker counts the number of samples that share the same aberration status

as the type of this marker. Large support value for a marker implies that that marker is

important in characterizing the aberration pattern of samples. The first condition ensures

that a sample supports a marker only if it has the same aberration as that marker in the

position specified by that marker. 2. Uniqueness: The aberrations in the same segment

may correspond to a single aberration that spread to neighboring genomic intervals [78].

The second condition forces a segment overlapping with multiple markers of the same

aberration type to support only one of those markers.










Marker detection problem can be formally defined as follows. Given a set of CGH

samples S = {sl, 82, sN} and a positive integer R, the goal is to find the set of R

markers M~ = {ml, m2, a mR) p1 92 p < pR, Such that the sum of support

of arersinM, ~e, Lt=1 Support(mt), is maximized. Next, we develop a dynamic

programming algorithm to solve this problem optimally.

Let Od, r)= E Support(mt), for 1 < pt < d < D, Vt E [1, r] denote the largest

possible support that r markers can get from the genomic intervals in the range [1..d].

Here, [1..d] denote the integers 1, 2, d. O(1, 1) = support of the single marker at the

first genomic interval. The value of Old, r) in general, where 1 < r < R, r < d < D, can

be computed as the maximum of two possible cases.

Case 1: There exists a marker m, at the dth genomic interval. In this case, Old, r)

can be computed as the maximum sum of support of mr and the optimal value of

locating r-1 markers. Formally, it can be written as Old, r) = maXb-15d/Ed-1{O~dt, r

1) + Support(mr)}, where b denotes the least genomic interval that is in the same

chromosome as the dth genomic interval. Note that O(dl, r 1) may correspond to

different set of r 1 markers for different values of dr. The type of marker m, can

be either gain or loss. We choose the marker type as the one that leads to a larger

Support(mr) value.

Case 2: There is no marker at the dth genomic interval. Old, r) = Old 1, r) in this

case.

Thus, Old, r) can be computed using the following recursive equation, Old, r) =


Old 1, r 1) + Support(mr)

Old 2, r 1) + Support(mr) if me appears
maxt < at interval d

O(b 1, r 1) + Support(mr)

O~d -1, otherwise









The marker set M~ that leads to O(D, R) corresponds to R markers with the largest

sum of supports. We call the above approach a ~fix~ed number of markers approach.

An important feature of the dynamic programming approach is that optimal solutions

to subproblems are retained so as to avoid recomputing their values [67]. We construct a

D x R matrix with cell (d, r) storing the optimal value of Old, r). An iterative program

is implemented to fill this matrix. For each cell (d, r), we need to revisit cell (g, r 1),

b 1 < g < d 1, which takes constant time that is proportional to the average length

of chromosome. Besides, we need O(NV) time to compute Support(mr). So the time

complexity of filling one cell is O(NV). The overall time complexity of filling the whole

matrix would be O(DNVR).

Adaptive number of markers: The above approach finds the best combination of

R markers when R is given. However, the number of markers, usually, is not known

in advance. We modify the approach to adaptively determine the number of markers.

Generally, markers are genomic intervals that are supported by sufficiently many segments

contained in a set of samples. We, here, define the segment coverage of markers as ratio

of segments that support the markers to the total number of segments contained in

the samples. Formally, given R markers found in a set S of CGH samples, we define

SC(R) = ,I where b denotes th~e total n~umbeh r of segmenclts in? S. Wlie ada~ptively

determine the number of markers as below. Given a threshold a, where 0 < a < 1, we

find the minimum R markers whose segment coverage is greater or equal to a. That is,

R = argminR(SC(R) > a). Here, a indicates the fraction of segments that are relevant to

the aberration patterns of samples. Therefore, it determines the number of markers that

appropriately capture the aberration patterns. The value of a is given by the users.

The number of markers chosen is no longer fixed. Figure 4-2 presents an example

of applying adaptive number of markers approach over CGH data of cancer type,

Retinoblastoma, NOS (ICD-O 9510/3) [22], which contains 120 samples with each

sample including 862 genomic intervals. In this case, the parameter a is set to 0.5 and four










Retinoblastoma, NOS (ICD-O 9510/3), 4 Markers








m II: I



80~ i





120
100 200 300 400 500 600 700 800
Genomic Intervals

Figure 4-2. The CGH data plot of cancer type, Retinoblastoma, NOS (ICD-O 9510/3),
with 120 samples and each sample containing 862 genomic intervals. The
genomic intervals with gain and loss status are plotted with green and red
colors respectively. The genomic intervals with no change status are not
plotted. Four markers are found at genomic interval 52, 69, 287 and 690 using
adaptive number of marker approach with o- set to 0.5. The markers are shown
using vertical lines. The types of four markers are gain, gain, gain and loss
respectively.


markers are founded at genomic interval 52, 69, 287 and 690 with types gain, gain, gain

and loss respectively. Please note that two gain markers are found at .Il11 Il:ent genomic

intervals, 52 and 69, because they are in different chromosomes and one segment can

support both of them at the same time.

4.2 Prototype Based Clustering

Given a set of CGH samples, our marker detection technique finds a set of markers

that characterize the aberration pattern of the samples in that set. These markers can be

considered as the prototype of the samples. In this section, we develop a prototype-based

clustering algorithm to partition the dataset into subsets such that each subset has a

prototype common to the samples in that subset. It is similar to the k-means algorithm










in spirit. It starts with a random partition of samples. It then iteratively maximizes an

objective function, which we call cohesion function, in two steps. We discuss the cohesion

function later. The two steps are described as follows.

refinement step: For each cluster, the dynamic programming technique developed

in Section 4.1 is used to identify the optimal set of markers. These markers serve as

the prototype of this cluster.

reassignment step: Each sample is assigned to the cluster whose prototype is

supported the most by that sample.

Essentially, both steps optimize a cohesion function alternatively until the value of this

function does not change, i.e. converges to a stable value. Next, we discuss the cohesion

function in detail and prove that our clustering algorithm converges.

Given a set of CGH samples S = {sl, s2, *, sN}. Let K denote the number of

clusters. Let ce denote the ith cluster. A clustering of samples can be represented by

an encoding function f : sj [1..K] that maps sample sj to cluster cf( ). Let R be

the number of markers in each cluster. Let Ml = {mi,l, mi,2, *, mi~t} denote the set of

markers (i.e., prototype) for cluster i, where 1 < i < K, 1 < t < R and mi~t denote

the tth marker in the ith cluster. Each marker mi~t is a tuple < pi,t, qi~t >, where pi,t

and qi~t denote the position and type of this marker respectively. Let 4 denote the set of

prototypes MI,M. 1,...,M~K. We define a cohesion function of clustering as below



i=1 sj~ci t=1

where a(sj, mi~t) is defined as same as in Section 4.1. Essentially, the cohesion function

computes the intra-cluster similarity between samples and cluster prototypes.

In the refinement step, we optimize the cohesion function by refining the prototype

(markers) of clusters given that samples are partitioned into K clusters. For cluster i, let

O,(d, r) = E Support(meL) =~ CEI~i Ese (CS3, mit) denote the largest support that t;

markers can get from the genomic intervals in the range [1..d]. Thus the optimal cohesion









function can be written as cohesion(f, #*) = sD ) hr *dnoe h e

of refined prototypes and #* = rlia r, ,~, (cohesion( f, ~)). The dynamic programming

technique described in Section 4.1 is used to compute Oi(D, R) for 1 < i < K and, in this

way, the cohesion function is optimized.

In the reassignment step, we reassign the sample sj to the ith cluster whose prototype

ifl is supported the most by sj, i.e. ifl covers the largest number of segments in sj. This

is because, otherwise, the cohesion function could ahr-l-w be improved by letting f(sj) = i.

Formally, f* = rl1 -i,,,, r,(cohesion(f, ~)) where f* denotes the new encoding function

after reassignment step.

Proof of convergence: It can be seen that both steps, refinement and reassignment step,

are connected through the cohesion function they alternatively optimize. The dynamic

programming in refinement step optimize the cohesion with respect to ~. At iteration (h)
we have

cohesion( f "), 3(~l) > cohesion( f "), '")

On the other hand, the reassignment step optimize the cohesion based on f i.e.


cohesion( f hl, (hl) > cohesion( f h, ~~ ))


Put together, our algorithm generates a sequence ( f (), Ri)), & > 0, that increase the

cohesion function as


cohesion( f h), ~(~)) > cohesion( f h, ")


Since the maximum value of cohesion function is finite given a set of samples, our

algorithm converges to a stable value at the end.

It is worth noting that our algorithm is a k-means-type algorithm. Let xl, x2, NrvE

RD be a finite number of samples. To partition patterns into K partitions, 2 < K < NV, a

k-means-type algorithm tries to solve the following mathematical problem:

P: miimiz f (, Z)= 1 Lj









subj~ct~ to Li 1 "' = 1,.) = 1, 2, -, N, er- = 0 or 1 for i = 1, 2, K and

j=1, 2, -- -,N1

where IT = 'I .] is a K x NV matrix, Z = [xl,=2 xa K] and xi E RD is the

center of cluster i, D(:ry, i) is some similarity measure between ry and xi. If we define

D(:4, e) t= I o(:r, mi,,), the maximization of cohesion function is equivalent to

problem P. It has been proved that a k-means-type algorithm converges to a partial

optimal solution of problem P in a finite number of iterations [69].

Under certain conditions, the k-means-type algorithm may fail to converge to a

local minimum. Let (IT*, Z*) be a partial optimal solution of problem P and ,4(W1*)=

{Z : Z minimizes f(IT*, Z), Z E RDK}. A sufficient condition for IT* to be a local

minimum of problem P is that ,4(11*) is a singleton [69]. Next, we show that practically

the prototype-based clustering algorithm usually converges to a local optimum. In

prototype-based clustering algorithm, 24(W1*) represents the sets of markers identified in

each cluster for a certain clustering results Wt*. When the number of markers is small as

compared to the number of all intervals, each chromosome arm often contains at most one

marker. Since alterations in different chromosome arms are independent to each other, the

markers identified in a chromosome arm should be the one with the largest support and no

other markers is identified in the same chromosome arm. Therefore, the optimal marker

in each chromosome arm can he a singleton if we assume no two markers have the same

support value. This makes the set of markers for each cluster a singleton. Therefore, the

prototype-based clustering algorithm often converges to a local optimum.

4.3 Pairwise Similarity Based Clustering

Pairwise similarity based algorithms partition the samples into clusters so that the

similarity between samples from the same cluster are larger than the similarity between

samples of different cluster. This generally requires calculating the similarity between

two samples. In ('!. Ilter 3, we proposed a segment-hased similarity measure, called Sim,










for CGH data. In this section, we develop a new similarity measure, called RSim, which

avoids noisy aberrations with the help of markers.

Let D denote the number of genomic intervals of each sample. Let si = xi, x), x'D

and s xi, xi, x"D be two CGH samples. Here, x" and x~ denote the value or status

of the dth genomic interval of si and sj, respectively.

We, first, summarize the Sim measure we developed for computing the similarity of

two CGH data in (I Ilpter 3. We call two segments from two samples (.. e I~lay-pingl if they

have at least one common genomic interval of the same type. Sim constructs maximal

segments by combining as many contiguous aberrations of the same type as possible.

Thus, each sample translates into a sequence of segments. For example, in Figure 4-1,

as and sj are two samples that have four and two segments respectively. After this

transformation, Sim computes the similarity between two CGH samples as the number

of overlapping segment pairs. This is because each overlap may indicate a common

point-like aberration in both samples which potentially led to the overlapping segments.

In Figure 4-1, the first segment of si is overlapping with the first segment of sj. Similarly

the third segment of si is overlapping with the second segment of sj. Sim computes the

similarity between two samples based on the genomic aberrations local to these samples.

Thus, Sim can not distinguish the true aberrations from noisy ones. As a result, Sim is

a local measure that is easily biased by the noise. Next, we develop a new approach that

addresses this limitation.

We propose to employ markers to eliminate the contribution of noise to the pairwise

similarity. We develop a refined Sim measure, called RSim, as follows. Let M~ =

{ml, m2, mR), p1 92 p < pR, denote the set of markers that are globally

identified in all samples. The markers imply the important genomic intervals that are

associated with the aberration patterns of samples. Given two CGH samples as and s ,

RSim computes the similarity between them as the number of overlapping segments pairs,

such that these segments satisfy both of the following two conditions:










1. At least one of the markers in M~ is contained in both segments.

2. Both of the overlapping segments have the same aberration type as the marker

they both contain.

Formally, let x x +, x and xt,, x",7 xt, be a pair of segment from samples

as and sj, respectively. RSim counts this pair of segments as one only if

1. there exists a marker me = , me E M~, such that u < pt < v and

u/ < p, < vl, and

2. the aberration type of both segments is the same as that of me, i.e. x~ = xt, = qt.

Unlike Sim, RSim does not consider the overlapping segments that do not intersect with

any marker. This is because RSim considers such segments as noise. For example, assume

that there are two markers in Figure 4-1, one at the 3rd and the other at the 11th genomic

interval. Then RSim measure for si and sj is computed as one, whereas Sim measure is

two.

An important observation on RSim is as follows. As the number of markers

approaches to the number of genomic intervals in the CGH data, RSim becomes equivalent

to the Sim measure. This is because all segments in the CGH data will overlap with a

marker and contribute to the similarity. Thus Sim is a special case of RSim when noisy

aberrations are not eliminated.

Our previous work in C'!s Ilter 3 showed that Sim works best when combined with

topdown algorithm compared to other popular clustering algorithms such as bottom-up

and k-means. In this chapter, we propose to use the topdown clustering method with

RSim as the pairwise similarity measure for pairs of CGH samples.

Note that it is possible to extend the Raw measure in C'!s Ilter 3 by taking the

markers into account. The extended Raw measure works as follows. For each pair of

samples, we compute the similarity between them as the number of genomic intervals

that meet the following two conditions. First, both samples have the same aberration

type (gain or loss) at this interval. Second, one marker of the same type as both samples










appears at this interval. Our experiments (results omitted due to space limit) show that

although the markers improve the original Raw measure, RSim is ahr-l-w superior to this

measure .

4.4 Experimental Results

Dataset: With more than 12,000 cases [1], the largest resource for published CGH data

can he found in the Progenetix database [3] (http://www. progenetix. net). We use three

different datasets, dissimilarDS, interDS and similarDS, taken from Progfenetix databases.

Each dataset contains more than 800 CGH samples (i.e. cytogenetic imbalance profiles

of tumor samples) from four different histopathological cancer types. Each sample has

been coded according to the ICD-O-3 system [22] and consists of 862 ordered genomic

intervals extracted from 24 chromosomes. In principle, each dataset can he mapped to an

integer matrix of size NVx 862, where NV denotes the number of samples. The difference

of these three datasets are the divergence of aberration patterns in distinct cancer types.

In dissimilarDS, the samples of different cancer types contain diverse aberration patterns

that are easily distinguished from each other. The samples in similarDS contain similar

aberration patterns. The interDS dataset is at an intermediate degree. The choices of

these datasets are based on a visual inspection of the matrices for each of the cancer types.

System specifications: Our experimental simulations were run on a system with dual

2.59 GHz AMD Opteron Processors, 8 gigabytes of R AM, and an Linux operating system.

4.4.1 Quality of Clustering

Measuring the quality of clustering is a challenging task as it is an unsupervised

learning task. There are a number of internal and external cluster validation techniques

that are described in the literature. In the following, we describe two measures that can he

used for evaluating the clustering.

1. Coverage Measure (CM):

An internal measure evaluates the quality of clusters if the class labels of samples

are not known a priori. One possible internal measure is the cohesion function defined in









Section 4.2. This function measures the total support of markers over each cluster. We

use the term Coverage M~easure (C11l) to denote this measure. Markers with high support

potentially convey some meaningful biological information and potentially can serve as

the first step for further analysis, such as the identification of new oncogenes and tumor

suppressor genes. A group of markers can be considered as biologically relevant if they

cover most of the segments in all the patients.

2. Normalized Mutual Information (NMI): One way to measure the quality of

clustering is to see if each cluster predominantly consists of samples from a single cancer

type. This clearly makes the assumption that this information is available (as was the

case for datasets used in this chapter) and those samples from the same cancer type will

generally be similar to each other. The external measure, known as the Normalized M~utual

Information, described in Section 3.2.1 can be used for this purpose.

4.4.2 Quality of Markers

Measuring the quality (or the biological relevance) of the markers identified for the

clusters is also important. Here, we develop a measure to address this. We combine all the

markers found in each cluster. For each marker in combined marker set, we first compute

the ratio of the samples that support it among the samples in each cancer type. Thus, if

there are T cancer types in the dataset, T values are computed for each marker. We define

the maximum of these ratios as the biological relevance of this marker. This is because

a larger ratio of support from one cancer type indicates that the marker better captures

the aberration pattern of that cancer type. Therefore, this marker is biologically relevant

in that cancer type. We use the term Global M~ax~imum Sup~port ((0 US). to represent

this measure. Formally, let MZ/ = {ml, m2,., *, M denote the set of markers. Let Ci,

1 < i
of different cancer types in the dataset. GMS measure for marker mi EMi/ is computed as:


GMS(m4) = max { -sym)
1 sj ECt










Note that GMS differs greatly from C \l for the C 11 measure is computed over clusters

identified by the underlying clustering strategy, whereas GMS is computed over the cancer

types.

4.4.3 Evaluation

We tested the prototype-based approach and two similarity-based approaches, RSim

and Sim [78] over three datasets, dissimilarDS, interDS and similarDS. For each clustering

method and dataset, we created 6, 8 and 10 clusters. For each number of clusters, we

tried different number of markers, i.e. 4, 6, 8, 10 and 12 markers, per cluster in the

prototype-based approach. This is because biologists have pointed out that a total of 4-7

genetic alterations can be estimated for the malignant transformation of a cell [41]. Thus,

we estimated that the number of aberrations common to the samples of one cluster could

be around 10. For the consistency, we also emploi- II different numbers of markers for each

clustering of 6, 8 and 10 clusters in RSim. Here, the number of markers is determined

as the product of number of clusters and the number of markers per cluster used in the

prototype-based approach. For example, 24, 36, 48, 60 and 72 markers were found to

create 6 clusters using RSim. We compared three methods according to the qualities

of clusters. We evaluated the cluster qualities using both NMI and C 11 measures. To

evaluate the cluster qualities using C' \!, for both RSim and Sim, we identified the markers

for each resulting cluster. The number of markers per cluster is the same as that used

in the prototype-based approach. We compute the error bars for part of the results. We

also used GMS to evaluate the biological relevance of markers found in prototype-based

approach and RSim approach.

The C 11 results are shown in Table 4-1. The C 11 values monotonically increase as

the number of clusters or number of markers increase. Thus, we compare three clustering

methods for the same number of clusters and markers. We observe that prototype-based

approach has 8 to 34 better coverage than RSim and 15 to 41 better coverage than

Sim. This is because the cohesion function optimized in prototype-based approach is











Table 4-1. Coverage measure for three clusteringf methods applied over three datasets.
Here, K denotes the number of clusters. Proto denotes the prototype-based
approach.
Number of markers


1Dataset 10 Alg
Proto
6 RSini
Simn
Proto
dissimilarDS 8 RSini
Simn
Proto
10 RSini
Simn
Proto
6 RSini
Simn
Proto
interDS 8 RSini
Simn
Proto
10 RSini
Simn
Proto
6 RSini
Simn
Proto
similarDS 8 RSini
Simn
Proto
10 RSini
Simn


4K
1655
1520
1316
1738
1585
1470
1839
1645
1480
1328
993
945
1345
1011
1004
1443
1192
1113
1793
1430
1414
1827
1627
1530
1946
1734
1636


610 810 10K 12K
2107 2465 2707 2953


1916
1741
2192
1925
1883
2214
1938
1895
1696
1463
1303
1770
1473
1382
1860
1533
1486
2197
1868
1830
2257
2061
1956
2438
2153
2100


2190
2051
2524
2254
2209
2613
2371
2219
2007
1764
1611
2104
1821
1706
2189
1825
1798
2519
2265
2206
2643
2400
2316
2908
2506
2492


2519
2325
2769
2516
2484
2928
2575
2480
2318
2006
1884
2415
2114
1993
2519
2156
2078
2834
2541
2530
2973
2694
26~25
3274
2831
2822


2722
2571
3038
2817
2719
3149
2806
2716
2531
2257
2123
2709
2373
2246
2779
2467
2322
3161
2808
2775
3298
3044
2889
3551
3197
3116


the same as the Coverage Measure. R Sim and Sim do not optimize C11l directly. The

results also show that R Sim is superior to Sim most of the time. This is because the use of

markers in R Sim filters out the noise that are irrelevant to the aberration patterns. As a

result, the markers in each cluster are supported by more samples in R Sim as compared to

Sina.

The NMI results are shown in Table 4-2. Since Sim produces clusters without

finding markers, we list its results on a separate column. The results show that all three

methods perform better on dissimilarDS than interDS and similarDS. This is because

the aberration patterns of distinct cancer types in dissimilarDS are divergent. Thus, it

is harder to cluster interDS and similarDS datasets. From the results, we observe that











Table 4-2. The NMI values of the three clustering methods are applied over three
datasets. Here, K denotes the number of clusters. Proto, denotes the
prototype-based approach.
Nuniber of markers
Dataset K Siml Algf 4K 6K 8K 10K 12K
Proto 0.31 0.31 0.29 0.29 0.24
6 0.41
RSill 0.45 0.49 0.50 0.49 0.50
Proto 0.32 0.28 0.30 0.25 0.28
dissillilarDS 8 0.4 t
RSim 0.48 0.49 0.49 0.51 0.49
Proto 0.35 0.33 0.37 0.34 0.29
10 0.43
RSini 0.46 0.44 0.47 0.50 0.49
Proto 0.06 0.08 0.08 0.08 0.11
6 0.34
RSini 0.35 0.34 0.38 0.40 0.39
Proto 0.07 0.10 0.09 0.10 0.11
interDS 8 0.32
RSini 0.34 0.34 0.36 0.38 0.36
Proto 0.10 0.09 0.10 0.15 0.12
10 0.32
RSini 0.35 0.36 0.36 0.37 0.36
Proto 0.14 0.13 0.06 0.07 0.07
6 0.39
RSini 0.29 0.41 0.39 0.39 0.40
Proto 0.14 0.09 0.11 0.07 0.05
sinularDS 8 0.36
RSini 0.36 0.38 0.38 0.38 0.37
Proto 0.08 0.08 0.10 0.06 0.07
10 0.33
RSini 0.35 0.37 0.35 0.37 0.37


RSim and Sim ahr-l-w beat prototype-based approach in terms of the NMI values. This

observation, together with the conclusion from results in Table 4-1, indicates that NMI

measure has no apparent relationship with C' \!. This is because NMI computes the quality

based on the class labels of samples. On the other hand, C'1 I evaluates the compactness

of samples in each cluster based on chromosomal aberration patterns and completely

ignores the class labels. Therefore, we conclude that the pairwise similarity-based

clustering approaches are more suitable to external measures, such as NATI, while the

prototype-based approach works better for the Coverage Measure. R Sim usually has the

best NMI results using ten markers. When compared to Sim, R Sim usually has better

NMI values. This indicates that the use of markers in refining the pairwise similarity

also leads to a better clustering in terms of NATI. Given that RSim has better C'll (see

Table 4-1) and NMI values than Sim (see Table 4-2), we conclude that R Sim is a better

pairwise similarity measure than Sim.










Table 4-3.


Error bar results of three clustering methods over three datasets. The three
clustering methods are prototype-based (denoted as Proto), RSim and Sim.
The three datasets are dissimilarDS, interDS and similarDS. For each dataset,
eight clusters are created. For each cluster, ten markers are identified. The
resulting clusters are evaluated using both NMI and C'il. Here, 5' and 95' .
denote the 5th and 95th percentile respectively.
NMI C' I;
5% median -.'.5% median .
Proto 0.20 0.28 0.34 1372 1431 1487
dissimilarDS RSim 0.45 0.51 0.55 1230 1291 1337
Sim 0.36 0.42 0.47 1187 1241 1289
Proto 0.05 0.09 0.13 1156 1228 1285
interDS RSim 0.32 0.37 0.41 985 1042 1120
Sim 0.27 0.30 0.33 963 1032 1098
Proto 0.06 0.09 0.13 1480 1535 1581
similarDS RSim 0.33 0.37 0.40 1294 1342 1401
Sim 0.29 0.33 0.37 1283 1327 1376


We compute the error bars of experimental results as follows. We randomly sample

50I' of each dataset and cluster it using three methods, prototype-based, RSim and

Sim, described in the thesis. To reduce the amount of computations, we choose one

configuration from different combinations of parameters in the experiments. For each

dataset, we create 8 clusters. We identify 10 markers per cluster in the prototype- based

approach and 80 markers in RSim approach. We then compute both NMI and C'll values

for the resulting clusters using 10 markers per cluster. We repeat this process 100 times

and compute the error bar for the values of NMI and C' Il. The error bar indicates the

interval where 5-95 of the results lie. Table 4-3 shows the results with error bars.

Note that the results of C'l l are roughly half the results shown in Table 4-1 because

the calculation of C'i I depends on the dataset and we sample 50I' of each dataset in

the experiments. The results show that RSim is superior to Sim most of the time in

terms of both NMI and C' \!. Moreover, among the three clustering methods, RSim and

prototype-based approach works the best for NMI and C'l l measures, respectively. These

observations are compatible to those we obtained from Table 4-1 and 4-2. Therefore, the

error bars confirm our earlier conclusions.













0.8 I-RSim 0.4-RSim
ao 0. ---model O .3 ---model

0.5 L 0im .2 "1

S0.4 0 \
$0.3~ 0.15 '-L
m 0.2~ '- 0.1 -
0.1 0.05 's

O 20 40 60 80 0O 20 40 60 80
Marker Index Marker Index

A B


0.7
RSim
0.6 ,--- model





0.4




0 20 40 60 80
Marker Index




Figure 4-3. Coniparsion of GMS values of markers in clusters front two clustering
approaches. Plots of global nmaxiniun support of markers found (A) In
dissintilarDS, (B) In interDS and (C) In sintilarDS. The solid line indicates the
results of markers generated by R Sint approach. The dashed line indicates the
results of markers generated by prototype-based approach (denoted as model).


Next experiment compares the GMS values for RSIM and prototype-based clustering

approaches. For each dataset, we created eight clusters and identified ten markers per

cluster. This is because these results are among the best results of each dataset in

Table 4-2. We then sort the markers in descending GMS value order. We plot the sorted

results of both R Sint and prototype-based approach (Figure 4-3). The plots show that

the nmaxiniun global support of markers found by R Sint is ahr-l- 8 comparable to or better

than those found by the prototype-based approach.










4.5 Conclusion

We considered the problem of clustering Comparative Genomic Hybridization (CGH) data

of a population of cancer patient samples. There are three main contributions of our work:

1. We developed a dynamic progranining algorithm to identify the optimal set of

important genomic intervals called markers. The advantage of using these markers is

that the potentially noisy genomic intervals are excluded in the computation of pairwise

similarity between samples

2. We developed two clustering strategies using these markers. The first one,

prototype-based approach, nmaxintizes the support for the markers. The second one,

sintilarity-based approach, develops a new similarity measure called R Sint. It computes the

pairwise similarity between samples by removing the noisy aberrations. We demonstrated

the utility of such a measure in improving the quality of clustering using the classified

disease entities front the Progenetix database. Our results show that the markers we found

represent the aberration patterns of cancer types well.

3. We developed several measures for comparing markers and different clustering

methods. Our experimental results show that optimizing for the coverage measure may

not lead to better values of N1\I and vice versa.










CHAPTER 5
CLASSIFICATION AND FEATURE SELECTION ALGORITHMS FOR MULTI-CLASS
CGH DATA

Classification is the task of learning a target function that maps each attribute set

to one of the predefined class labels [79]. Typical classification tasks for cancer research

include separating ]. .11Ov!: patients from cancer patients and distinguish patients of

different cancer subtypes, based on their cytogenetic profiles. These tasks help successful

cancer diagnosis and treatment. An important technique related to classification is feature

selection. The goal of feature selection is to select a small number of discriminative

features, i.e. genomic intervals in CGH data, for accurate classification. In this chapter,

we propose novel SVAI-hased methods for classification and feature selection of CGH

data. For classification, we developed a novel similarity kernel that is shown to be more

effective than the standard linear kernel used in SVAI. For feature selection, we propose a

novel method based on the new kernel that recursively selects features with the maximum

influence on an objective function. We compared our methods against the best wrapper

based and filter based approaches that have been used for feature selection of large

dimensional biological data. Our results on datasets generated from the Progenetix

database, -II__- -is that our methods are considerably superior to existing methods.

5.1 Classification with SVM

Support Vector Machine (SVAI) is a state-of-art technique for classification [83].

It has been shown to have better accuracy and computational advantages over their

contenders [27] and has been successfully applied for many biological classification

problems. The basic technique works as follows. Consider a set of points that are

presented in a high dimensional space such that each point belongs to one of two

classes. An SVAI computes a hyperplane that maximizes the margin separating the

two classes of samples. The optimal hyperplane is called decision boundary. Formally, let

.I'l *~ 2* XI,z and y,, y2, y,z denote n training samples and their corresponding class

labels respectively. Let yi E {-1, 1} denote labels of two classes. The decision boundary of










a linear classifier can be written as w-x+b = 0 where w and b are parameters of the model.

By rescalin~g th~e parame??ters wu anld b, th~e margin l canl be written? a~s d = [~179]. Th~e

learning task in SVM can be formalized as the following constrained optimization problem:


min 1112


subject to yi(w xi + b) > 1, i = 1, 2, n.

The dual version of the above problem corresponds to findings a solution to the

following quadratic program:

Maximize J over asi:



i= 1 i=1, j= 1

subject to asi > 0, CE asiyi = 0, where asi is a real number.

The decision boundary can then be constructed from the solutions asi to the quadratic

program. The resulting decision function of a new sample z is


D(z) = w z+b


with w = Eo r II and b =< yi w xi >. Usually many of the asi are zero. The training

samples xi with non-zero asi are called support vectors. The weight vector w is a linear

combination of support vectors. The bias value b is an average over support vectors. The

class label of z is obtained by considering the sign of D(z).

Standard SVM methods find a linear decision boundary based on the training

examples. They compute the similarity between sample xi and xj using the inner product

xTxj. However, the simple inner product does not ahr-l-w measure the similarity effectively

for all applications. For some applications, a non-linear decision boundary is more

effective for classification. The basic SVM method can then be extended by transforming

samples to a higher dimensional space via a mapping function Q. By doing this, a linear

decision boundary can be found in the transformed space if a proper function # is used.










However, the mapping function # is often hard to construct. The computation in the

transformed space can be expensive because of its high dimensionality. A kernel function

can be used to overcome this limitation. A kernel function is defined as K(x,, xj)=

#(Xi)T#(Xj), where xi and xj denote the ith and jth sample respectively. It really

computes the similarity between xi and xj. With the help of kernel function, an explicit

form of the mapping function # is not required.

We introduce a new measure called Raw that captures the underlying categorical

information in CGH data and then show how to incorporate it into the basic SVM

method. CGH data consists of sparse categorical values (gain, loss and no change).

Conceptually, the similarity between CGH samples depends on the number of aberrations

(gains or losses) they both share. Raw calculates the number of common aberrations

between a pair of samples. Given a pair of samples a = al, a2, am and b =

bl, b2, bm. The similarity between a and b is computed as Raw(a, b) = C S(ai, bi).

Here S(ai, bi) = 1 if ai = bi and ai / 0. Otherwise S(ai, bi) = 0.

The main difference between Raw(a, b) and aT b is the way they deal with different

aberrations in the same interval. For example, if two samples a and b have different

aberrations at the ith interval, i.e. ai = 1, bi = -1 or ai = -1, bi = 1, the inner product

calculates this pair as ai x bi = -1 while Raw calculates S(ai, bi) = 0. The similarity value

between a and b computed by Raw is ahr-l-w greater than or equal to the inner product of

a and b. We propose to use Raw function as the kernel function for the training as well as

prediction.

Using SVM with the Raw kernel amounts to solving the following quadratic program:

Maximize J over asi:



i= 1 i=1, j= 1

subject to asi > 0, CE asiyi = 0.









Accordingly, the resulting decision function of a new sample z is


D(z) = asyslRawu(x4, z) b

The main requirement for the kernel function used in nonlinear SVM is that there

exists a mapping function # such that the kernel function computed for a pair of samples

is equivalent to the inner product between the samples in the transformed space [79]. This

requires that the underlying kernel matrix is "semi-positive definite". Formally, a kernel

K is a symmetric function taking two arguments of an arbitrary set X where the data

stems from, i.e., K : x X R. For given data points (xi)" E Xn, the kernel matrix

M~ := (K(xi, xj))j, can be defined. If for all n, all sets of data points and all vectors

v E R" the inequality v Myv > 0 holds, then K is called semi-positive 7 H We now

prove that our Raw kernel satisfies this requirement.

The mapping function # is defined as follows: as { 1,0, -1}m be {1,0}2m, Where,


ai = 1, bai-1bai = 01


ai = -1, b~i-1b~i = 10

ai = 0, bai-1bai = 00

With this transformation, it is easy to see that the Raw kernel can be written as the

inner product of #(x) and #(y), i.e. Raw(x, y) = #(x)T #(y). This is because Raw

only counts the number of common aberrations in computing the similarity between two

samples (if both the values are 0, they are not counted).

We define a 2m by a matrix u whose jth column vector corresponds to #(xj), i.e.

u := [ #(x,) # (x2) ... ]. The Raw kernel matrix can be written as











Rat

M = Rat


w(2, X1)


Raw(xxZ2

ROw~x~ 2 2


Now we have VTIMy = v'(uTU)v = (uv) uv = |<, < || > 0, Vv E R". Therefore, the

Raw kernel is semi-positive definite.

It is worth noting that we have developed other similarity measures such as Sim

for the clustering of CGH data in Cl. .pter 3. Although Sim works better than Raw in

< Im bi );~! it can not work as kernel function in SVM because it is not semi-positive

definite.

5.2 Maximum Influence Feature Selection for Two Classes

An important characteristic of CGH data is that neighboring features are strongly

correlated (Figure 5-1). When a compact set of features are selected, these highly

correlated features may cause "redundancy" in the predictive power. For example, assume

we have a training dataset with four cancer types and we want to select two features for

classification. If the ith feature is ranked high for separating samples of the first cancer

from others, the (i + 1)th or (i 1)th feature may be ranked high too for the same effect.

However, selecting both ith and (i + 1)th (or (i 1)th) feature can not improve the

classification performance much. On the other hand, if another feature, my- jth feature,


I.I) 'fi(l.2) ]










Retinoblastoma, NOS (ICD-O 9510/3), 4 Markers








m II: I






'"-~ I "



120
100 200 300 400 500 600 700 800
Genomic Intervals

Figure 5-1. Plot of 120 CGH cases belonging to Retinoblastoma, NOS (ICD-O 9510/3).
The X-axis and Y-axis denote the genomic intervals and the samples
respectively. We plot the gain and loss status in green and red respectively.


well separates samples of the third cancer from others but has a lower ranking than the

(i + 1)th (or (i 1)th) feature, we should select the ith and jth feature instead.

Typical wrapper methods based on backward feature elimination, such as SVAI-RFE [27],

have poor effect in discriminating highly correlated features, especially when a small set

of features are selected. Filter methods, such as IR MR [15], address this problem by

selecting features with minimal redundancy. However, due to the difficulty in selecting

complementary features, filter methods often produce lower predictive accuracy compared

to wrapper methods. In this paper, we propose a novel non-linear SVAI-hased wrapper

method called M~axrimum Infll;, 0..~ Feature Selection (illlFS) for the classification of

multiclass CGH data.

When the number of features is very large, an exhaustive search of all possible feature

subsets is computationally intractable. We use a greedy search strategy to progressively

add features into a feature subset. To find the next feature to add, we use criteria similar










to the one -II_a-r-- -b by Guyon et al [27]. The basic idea is to compute the change in the

objective function caused by removing or adding a given feature. In our case, we select the

feature that maximizes the variation on the objective function. The added feature is the

one which has the most influence on the objective function. This is unlike the backward

elimination scheme that removes the feature that minimizes the variation on the objective

function [27, 64].

The feature that has the most influence on the objective function is determined as

follows. Let S denote the feature set selected at a given algorithm step and J(S) denote

the value of the objective function of the trained SVM using feature set S. Let k denote

a feature that is not contained in S. The change in the objective function after adding

a candidate feature is written as DJ(k) = | J(S U {k}) J(S)|. In the case of SVM,

the objective function that needs to be maximized (under the constraint 0 < asi and

Ci asiyi = 0) is:


i= 1 i=1, j= 1
For each feature k not in S, we compute the new objective function J(S(+k)).

To make the computation tractable, we assume no change in the value of the co's after

the feature k is added. Thus we avoid having to retrain a classifier for every candidate

feature [27]. The new objective function with feature k added is:



i= 1 i=1, j= 1

where xi(+k) means training sample i with feature k added.

Therefore, the estimated (this is because we are not retraining the classifier with the

additional feature) change of objective function is:



i=1, j= 1

-i awayysygiaw(xsi(+k), xyi(+k))|
i=1, j= 1









We add the feature that has the largest difference DJ(k) to the feature set. The

iterative procedure for MIFS is formally defined as follows:

Input: Training samples {xl, x2, n}x, and class labels {yl, y2,., N i E (1N -1i

initial feature set S, predetermined number of features r

1. Initialize: Ranked feature list RL = S, candidate feature set L = D S (D is the

set of all features)

2. While |S|< r

(a) Train an SVM using training samples with features in RL,

(b) Compute the change of objective function DJ(k) for each candidate feature
keL

(c) Find the feature e with the largest DJ(k), e.g. e = ar~ils, t ,(DJ(k))

(d) Update RL = [RL, e] and L = L {e}
3. Return: Ranked feature list RL

This algorithm can be generalized to add more than one feature in Step 2.d to speed up

computations when the number of features r is large.

Time Complexity The training time complexity for linear SVM is dominated by the

time for solving the underlying quadratic program. The conventional approach for solving

the quadratic program takes time cubic in the number of samples and linear in the number

of features [48]. Recent work has shown that the empirical time complexity for training

a linear SVM is about O(n'-'m) [33], where n and m denote the number of samples

and number of features respectively. Based on this, the conventional and empirical time

complexity for this algorithm is O(n3r2) and O(nl7T2) TOSpectively.

The above method requires a set of features S to be non-empty. To start the method,

we need to derive the first feature to be added to this set. One possibility is to compute

J({k}) for every feature k by training a separate SVM for each feature k. We can, then,

select the feature with the largest value as the starting feature. However, this can be

computationally very expensive. Another approach is to use the most discriminating










feature (such as done by standard filter based methods that rank features according to

their individual predictive power). Specifically, the mutual information I of two variables r

and a is defined as

I~rs) = p~r, sylogp(ri, sj)


where p(r, s) is their joint probabilities; p(r) and p(s) are the respective marginal

probabilities. If we look at the kth feature as a random variable, we use mutual

information Ilk, y) between class labels y = {yl, y2, n} and the feature variable

k to quantify the relevance of kth feature for the classification task. We choose the

feature k with the maximum Ilk, y) as our starting feature. We have found that using

such methods is satisfactory. Our preliminary experimental results showed that Multiple

Selection is not sensitive to the initial feature chosen.

5.3 Maximum Influence Feature Selection for Multiple Classes

The feature selection method proposed in Section 5.2 only works for two-class

problems. We derive the multiclass version using a one-versus-all approach as follows.

First step. Let C > 3 denote the number of classes. For each i, 1 < i < C, a binary

SVM that separates the ith class from the rest is trained based on the selected

feature set S.

Second step. For each binary SVM, we compute DJ(k) for every feature k not in

S. We rank all the candidate features based on the value of DJ. The larger value

the value of DJ(k), the smaller is its rank of k (smaller is better). As a result, we

obtain C ranked lists of features with each ranked list corresponding to one of the

C SVMs. Equivalently, each candidate feature corresponds to a ranking vector

containing its rankings in these C ranked lists. For example, a feature can be ranked

as the first in the first list; third in the second list; 20th in the third list, 15th in the

fourth list. The vector that is used for ranking this feature is [1, 3, 20, 15].

Third step. A feature that ranks low in one list may rank high in another. We

are interested in those features that are most informative in discriminating one










class from the rest even if they are quite uninformative in other classifications. We

achieve this as follows. We first sort the ranking vector of each candidate feature

in an ascending order. If we regard each element of the ranking vector as a digit,

each ranking vector could represent a C digit number. The smallest ranking (the

first element) represents the most significant digit. We use a least significant digit

radix sort algorithm to sort all the ranking vectors and, accordingly, produce a

global ranking of features. For example, assume we have three features, kl, k2 and

k3 Whose rankings in four binary SVMs are [1, 3, 20, 15], [8, 4, 7, 6] and [5, 1, 30, 4]

respectively. The vectors show that kl ranks top in separating class one from others

and ranks third in separating class two from others etc. We first sort each ranking

vector in an ascending order. The resulting vectors are [1, 3, 15, 20], [4, 6, 7, 8] and

[1, 4, 5 30] respectively. Next, we apply a radix sort algorithm over the three vectors.

The resulting order of vectors changes to [1, 3, 15, 20], [1, 4, 5 30], [4, 6, 7, 8], which

corresponds to the order of features: kl, k3, k2. Therefore, we have a global ranking

of the three features.

The lowest ranked feature is added into S. The above three step process is used iteratively

to determine the next feature. This process stops when a predetermined number of

features are selected or S contains all the features. Also, with the set S, the features are

ranked based on the order of addition into this set. The iterative procedure for MIFS is

formally defined as follows:

Input: Training samples {xl, x2, n}x, and class labels {yl, y2,, r) i UN Ci

initial feature set S, predetermined number of features r

1. Initialize: Ranked feature list RL = S, candidate feature set L = D S (D is the

set of all features)

2. While |S|< r

(a) For i = 1 to C









i. Construct new class labels {yll, y26,.. *~ ) *j 1 Un' jj = i, otherwise

yj/ = -1;

ii. Train an SVM using training samples with features in RL;

iii. Compute the change of objective function DJ(k) for each candidate feature

keL

iv. Sort the sequence of DJ(k), k E L in descending order; create a

corresponding ranked list of candidate features;

(b) Compute the ranking vectors for all the features in L from C ranked lists ;

(c) Sort the elements of each ranking vector in an ascending order;

(d) Perform a radix sort over all

ranking vectors to produce a global ranking of features in L;

(e) Find the top ranked feature e and update RL = [RL, e] and L = L {e}
3. Return: Ranked feature list RL

This algorithm can be generalized to add more than one feature in Step 2.e to speed up

computations when the number of features r is large.

Time Complexity The conventional and empirical time complexity for this algorithm is

O(n3r2C) and O(nl7T2C) TOSpectively as a one-versus-all strategy is used to train C SVMs

in each iterative step.

In the above algorithm, we generate a global ranking of features based on their

rankings in each binary SVM. Another "ranking scheme" can be derived based on the sum

of the value that feature brings to each classifier, i.e. CC (usi x | A(S U {k) Ji(S) |) whler~e

Ai(S) is the corresponding objective function for SVM that discriminating class i from the

rest and as is the number of samples in class i. This ranking scheme gives comparable

results to the one described above. For this reason, the results concerning this scheme are

not reported in the experimental section.









5.4 Datasets

The Progenetix database [3] (http://www. progenetix. net) consists of more than

12,000 cases [1]. We use a dataset consisting of 5020 CGH samples (i.e. cytogenetic

imbalance profiles of tumor samples) taken from Progenetix (Table 3-1). These samples

belong to 19 different histopathological cancer types that have been coded according to

the ICD-O-3 system [22]. The subset with the smallest number of samples, consists of 110

non-neoplastic cases, while the one with largest number of samples, Adenocarcinoma, NOS

(ICD-O 8140/3), contains 1057 cases. Each CGH sample consists of 862 ordered genomic
intervals extracted from 24 chromosomes.

Testing the performance (predictive accuracy and run time) of the proposed methods,

requires evaluating them over datasets with different properties such as 1) number of

samples contained in the dataset, 2) number of cancer types contained in the dataset, and

3) the similarity level between samples from different cancer types, which indicating the

difficulty of classification. Currently, there are no standard benchmarks for normalized

CGH data that take the three properties into account. We propose a method to select

subsets from the Progfenetix database in a principled manner to create datasets with

desired properties. The dataset sampler accepts the following three parameters as input:

1) Approximate number of samples (denoted as NV) 2) Number of cancer types (denoted

as C) 3) Similarity range (denoted as [6min,6max]) between samples belonging to different

cancer types. An outline of the proposed dataset sampler is as follows:

1. For each cancer type, partition all the samples belonging to this cancer type into

several disjoint groups using clustering. Each cluster corresponds to the different

aberration patterns for a given cancer type.

2. Compute the pairwise similarity between pairs of groups obtained in the first step.

3. Construct a complete weighted graph where each vertex denotes a group of samples

and the weight of an edge equals to the similarity between two groups that are

connected by this edge.










One can use this graph to find a set of samples of a given size NV (by choosing a subset

of groups that sum to NV), given number of cancer types, and based on level of similarity

between groups (by only considering groups that have a similarity within the range

of [6min, 6max]). The advantage of the above dataset sampler is that a large number of

datasets can be created with variable number of samples and cancer types as well as

variable level of similarities between the chosen cancer types. This allows for testing the

accuracy and performance of a new method across a variety of potential scenarios.

Figure 5-2 shows an example of how such a dataset sampler works. Consider a dataset

containing 1,000 CGH samples 400 samples belonging to cancer type cl and the other

600 samples belonging to cancer type C2. Assume that each cancer type is clustered into

2 clusters. This results in 4 groups of CGH samples, which are denoted as gi, 1 < i < 4.

Let the size of gl, g2, g3 and g4 be 150, 250, 450, and 150 respectively. The pairwise

similarity between any two groups is shown in the Figure. Using this, one can construct a

weighted graph where each vertex denotes a group and the weight of each edge equals to

the similarity between two groups that are connected by this edge. Suppose that a dataset

needs to be sampled with NV = 400, C = 2, 6min = 0.025 and 6max = 0.035. The graph can

be parsed to find out that g2 and g4 SatiSfy the three conditions and a new dataset can be

sampled by combining the samples in g2 and g4-

We used our dataset resampling scheme to select datasets at four different similarity

levels from the Progenetix dataset. We denote the similarity levels as Best, Good, Fair,

and Poor. The samples in Best has the highest similarity and those in Poor have the

lowest similarity. For each similarity level, we created three datasets with two, four,

six, and eight cancer types respectively. Thus, in total, we have sixteen datasets. For

convenience, we use the similarity level followed by the number of cancer types to

denote a dataset. For example Best6 denotes the dataset with similarity level Best

(i.e., homogeneous samples) and contains six cancer types. The number of samples in each

two-class dataset and multi-class dataset is around 500 and 1,000 respectively. Note that












81






81 82 83 84
gi 0 0.01 0.02 0.02 81
83
82 0.01 0 0.01 0.03
83 0.02 0.01 0 0.01
84 0.02 0.03 0.01 0
84
Step3
Step2


Figure 5-2. Working example of dataset re-sampler. ce and gj denote the ith cancer type
and the jth group of samples, respectively. In the first step, the samples are
partitioned in each cancer type into two disjoint groups. In the second step,
pairwise similarity metrics are computed. In the third step, a complete
weighted graph is generated.
there is no topologfical relations between different datasets because we generate all datasets

in separate runs. For example, any sample in best is not necessarily contained in best6 or

best8.

The sampling of the sixteen datasets are explained as follows. Let NV and C denote

the number of samples and number of cancer types in the resampled datasets respectively.

In our experiments, we choose NV = 500 and C = 2 for two-class dataset. We choose

NV = 1000 and C = 4, 6 and 8 for multi-class dataset respectively. For each value of C, we

sample four datasets with four different levels of similarity.

In the first step, a clustering algorithm is applied to each cancer type to partition

all the samples belonging to this cancer type into several disjoint groups. Each cluster

corresponds to the different aberration patterns for a given cancer type. We use the

RSim clustering method for this purpose. The number of clusters for each cancer type is

determined adaptively as follows. For the ith cancer, let Si denote the number of samples










Table 5-1. Detailed specifications of benchmark datasets. Term #cases and C denote the
number of cases and number of cancer types respectively.


Name
Best2
Good2
Fair2
Poor2
Best4
Good4
Fair4
Poor4
Best
Good
Fair
Poor
Best8
Good8
Fair8
Poor8


#cases
478
466
351
373
1160
790
800
800
1100
850
880
810
1000
830
750
760


similarity level
[0.030, 1.000]
[0.018, 0.030)
[0.008, 0.018)
[0.000, 0.008)
[0.035, 1.000]
[0.020, 0.035)
[0.010, 0.020)
[0.000, 0.010)
[0.030, 1.000]
[0.017, 0.030)
[0.007, 0.017)
[0.000, 0.007)
[0.030, 1.000
[0.018, 0.030
[0.006, 0.018
[0.000, 0.006


ICD-O-3 code of cancer typt
80703, 81403
80103, 81703
80103, 96733
98233, 96803
95003, 85233, 80703, 81403
I :I1 81703, 96803, 81403
95103, 96733, 96803, 81403
85233, 96803, 98233, 81403
81443, 95003, 81703, 85233,
91803, 81443, 96733, 80103,
95103, 81400, 98233, 96803,
85233, 98233, 80703, 81400,
80103, 97323, 81400, 95003,
88903, 93913, 91803, 96733,
95103, 80103, 97323, 81703,
00000, 81400, 81703, 85233,


80703, 81403
97323, 81403
80703, 81403
81403, 96803
81703, 85233, 80703, 81403
80103, 81703, 80703, 81403
96803, 98233, 80703, 81403
96803, 98233, 80703, 81403


cluster is computed as .~~l Therefore, the number of each cluster


in it. The number of


is around N/IC on average, which makes the size of resampled dataset close to NV.

In the second step, the similarities between any pair of clusters are computed and

sorted in an ascending order. The 25, 50 and 75 percentile of the sorted similarity

sequence are chosen to divide the sequence into four segments with about equal length.

Each segment corresponds to a similarity level. We denote the four similarity levels as

Poor, Fair, Good, and Best.

In the third step, the minimum and maximum similarity in each segment are chosen

as the parameters 6min and 6max respectively. The datasets of different similarity levels are

sampled by the dataset resampler accordingly. We list the detailed specifications of our
datasets in Table 5-1

It is worth noting that the actual number of cases may not equal to the parameter NV

in the resampled datasets. This is because, the clustering algorithm may generate clusters










with unbalanced sizes. When we combine these clusters together, the actual size may be

larger or smaller than NV.

5.5 Experimental Results

In this section, we describe the experimental comparison of our methods with

SVAI-RFE and AIRMR. We developed our code using MATLAB and ran our experiment

on a system with dual 2.59 GHz AMD Opteron Processors, 8 gigabytes of R AM, and a

Linux operating system.

5.5.1 Comparison of Linear and Raw Kernel

In this section, we compare the Raw kernel to linear kernel for the classification of

CGH data. We perform the experiments over the sixteen datasets using a 5-fold cross

validation (CV). For each dataset, we randomly divided the data set into five disjoint

subsets about equal size. For each fold, we keep one subset as the test data set and the

other four sets as the training examples. We train two SVA~s over the training examples

using linear and Raw kernel respectively. We then use each SVAI to predict the class

labels of the set aside examples respectively. We compute the predictive accuracy of

each SVAI as the ratio of number of correctly classified samples to the number of test

dataset examples. Next, we choose another subset as set aside examples and the rest as

training examples. We repeat this procedure until each subset has been chosen as set aside

examples. As a result, we have five values of predictive accuracy corresponding to each

kernel respectively. We compute the average of the five values as the average predictive

accuracy for each kernel in 5-fold CV.

We use the DAGSVAI (Directed Acyclic Graph SVAI) provided by MATLAB SVAI

Toolbox [8] for the classification of multiclass data. All other parameters of SVAI are set

to the standard values that are part of the software package and existing literature.

The results are presented in Figure 5-3. X-axis lists the sixteen different datasets.

Y-axis denotes the value of average predictive accuracy in 5-fold CV. For the sixteen

datasets, Raw kernel outperfornis linear kernel in fifteen datasets (except best8). On
























0 Linear Raw


Figure 5-3. Comparison of classification accuracies of SVAI with linear and Raw kernels.
X-axis denotes different datasets. Y-axis denotes the predictive accuracy hased
on 5-fold CV.


average, Raw kernel improves the predictive accuracy by 6.1' over sixteen datasets

compared to linear kernel. For the best8 dataset, the difference between Raw and Linear

is is less than 1 These results demonstrate that SVAI hased on Raw kernel works better

for the classification of CGH data as compared to linear SVAI.

The remaining set of experimental results are limited to the Raw kernel (unless stated

explicitly).

5.5.2 Comparison of MIFS and Other Methods

In this section, our method, MIlFS, is compared against AIRMR (a filter based

approach) and SVAI-RFE (a wrapper based approach). MRMR is shown to be more

effective than most filter methods, such as methods based on standard mutual information,

F-statistic or t-statistic [15]. The AllQ scheme of IR MR, i.e. the divisive combination

of relevance and redundancy, is used because it outperforms MID scheme consistently.

SVAI-RFE is a popular wrapper method for gene selection and cancer classification. It is

shown to be better than filter methods such as those based on ranking coefficients similar

to Fisher's discriminant criterion. SVAI-RFE is also shown to be more effective than










wrapper methods using RFE and other multivariate linear discrintinant functions, such as

Linear Discrintinant Analysis and Mean Squared Error (Pseudo-inverse) [27].

For each method, a 5-fold cross validation is used. In each fold, the feature selection

method is applied over the training examples. Multiple sets of features with different sizes

(4, 8, 16 features etc) are selected. For each set of features, an SVAI is trained on the

training examples with only the selected features. The predictive accuracy of this SVAI is

determined using the test (set aside) examples with the same set of features. These steps

are repeated for each of the 5-folds to compute the average predictive accuracy.

To test the predictive accuracy of features selected hv different methods, DAGSVAI

with Raw kernel is used as it is found to be more effective than other methods. Since

the SVAI-RFE presented in the literature only works for two-class data, it is extended

to niulticlass data using the same "ranking scheme" that we use to extend MIFS (as

described in Section 5.3). The linear kernel is used in SVAI-RFE for feature selection

purpose.

The experimental results for nmulti-class dataset and two-class dataset are shown in

Table 5-2 and Table 5-3 respectively. In these tables, the predictive accuracy of features

selected by three methods, MIlFS, AIRMR and SVAI-RFE, over each dataset are compared.

For each feature selection method, the results for 4, 8, 16, 40, 60, 80, 100, 150, 250 and

500 features over each dataset are presented. The results are averaged over the 5-folds and

reported in columns :3 to 12. In the 1:3th colunin, the average predictive accuracies of SVAI

built upon 862 features, i.e. no feature selection, are reported. The average predictive

accuracies of the twelve datasets are reported in the last three rows. We mainly describe

the key findings of niulti-class datasets in Table 5-2.

Comparison between MIFS and MRMR The results in Table 5-2 show that, when

the number of features is less than or equal to sixteen, there is no clear winner between

MIlFS and AIRMR. Although, MIlFS is slightly better than AIRMR hased on the average

results of the twelve datasets, neither of the two methods are predominantly better than

















Table 5-2. Comparison of classification accuracy for three feature selection methods on
nmulti-class datasets. The three methods are MIFS, AIRMR and SVAI-RFE
(denoted as RFE). The average results over twelve datasets are reported in the
last three rows.
DS Method Nuntber of Features
4 8 16 40 60 80 100 150 250 500 862
hIlFS 0.696 0.765 0.811 0.819 0.814 0.819 0.821 0.824 0.814 0.815
poor4 MRMR 0.734 0.772 0.778 0.794 0.791 0.799 0.814 0.814 0.819 0.802 0.800
RFE 0.567 0.644 0.681 0.706 0.746 0.771 0.794 0.814 0.821 0.821
hIlFS 0.527 0.500 0.615 0.622 0.640 0.654 0.650 0.645 0.649 0.633
poor6 MRMR 0.542 0.576 0.588 0.589 0.581 0.596 0.61 0.596 0.610 0.635 0.633
RFE 0.337 0.370 0.431 0.531 0.551 0.564 0.578 0.593 0.608 0.635
MIlFS 0.338 0.394 0.433 0.469 0.470 0.488 0.496 0.513 0.530 0.486
poor8 MRMR 0.335 0.408 0.454 0.467 0.469 0.482 0.470 0.474 0.489 0.465 0.472
RFE 0.250 0.274 0.303 0.300 0.423 0.435 0.457 0.456 0.456 0.475
hIlFS 0.621 0.687 0.755 0.784 0.802 0.816 0.816 0.800 0.808 0.806
fair4 MRMR 0.598 0.685 0.728 0.777 0.796 0.789 0.784 0.777 0.783 0.786 0.798
RFE 0.466 0.527 0.608 0.693 0.753 0.753 0.771 0.786 0.787 0.806
hIlFS 0.587 0.698 0.754 0.814 0.822 0.825 0.827 0.820 0.820 0.807
fair6 MRMR 0.593 0.698 0.767 0.772 0.786 0.807 0.802 0.807 0.801 0.804 0.792
RFE 0.504 0.640 0.696 0.761 0.775 0.780 0.781 0.780 0.797 0.816
hIlFS 0.536 0.641 0.684 0.700 0.736 0.733 0.727 0.735 0.732 0.713
fair8 MRMR 0.540 0.653 0.681 0.721 0.707 0.712 0.715 0.704 0.698 0.695 0.720
RFE 0.398 0.528 0.616 0.677 0.687 0.688 0.702 0.700 0.701 0.700
hIlFS 0.586 0.673 0.763 0.773 0.782 0.78 0.783 0.774 0.778 0.767
good MRMR 0.600 0.681 0.755 0.761 0.779 0.780 0.780 0.770 0.772 0.761 0.755
RFE 0.543 0.610 0.656 0.711 0.718 0.740 0.732 0.735 0.767 0.749
hIlFS 0.455 0.551 0.593 0.645 0.700 0.716 0.724 0.697 0.700 0.694
good MRMR 0.427 0.532 0.621 0.667 0.680 0.600 0.677 0.687 0.675 0.664 0.696
RFE 0.339 0.437 0.517 0.597 0.638 0.653 0.660 0.682 0.674 0.698
hIlFS 0.373 0.477 0.567 0.650 0.674 0.676 0.665 0.673 0.666 0.655
good MRMR 0.336 0.461 0.527 0.615 0.634 111 17 0.644 0.646 0.649 0.661 0.652
RFE 0.258 0.346 0.424 0.508 0.530 0.581 0.605 0.624 0.632 0.654
hIlFS 0.650 0.754 0.763 0.817 0.829 0.832 0.829 0.821 0.838 0.820
best4 MRMR 0.667 0.757 0.775 0.785 0.789 0.793 0.798 0.791 0.784 0.802 0.803
RFE 0.596 0.650 0.708 0.753 0.766 0.789 0.776 0.791 0.803 0.817
hIlFS 0.497 0.568 0.699 0.731 0.767 0.765 0.763 0.770 0.750 0.755
best6 MRMR 0.497 0.568 0.688 0.730 0.731 0.725 0.746 0.739 0.748 0.740 0.750
RFE 0.449 0.499 0.587 0.667 0.710 0.712 0.727 0.729 0.736 0.749
hIlFS 0.427 0.543 0.635 0.726 0.737 0.733 0.735 0.732 0.735 0.727
best MRMR 0.434 0.563 0.652 0.704 0.700 0.714 0.712 0.700 0.693 0.704 0.707
RFE 0.342 0.429 0.532 0.641 0.648 0.687 0.694 0.723 0.719 0.724
hIlFS 0.524 0.612 0.673 0.713 0.732 0.736 0.737 0.734 0.735 0.723
Avgf ARMR 0.518 0.606 0.664 0.696 0.702 0.700 0.710 0.707 0.707 0.706 0.716
RFE 0.422 0.497 0.563 0.636 0.662 0.679 0.69 0.700 0.708 0.721










other. However, when the number of features is greater than sixteen, MIFS outperforms

MRMR in almost all cases. This is because, as the number of features increases, features

that individually are not discriminating may increase the predictive power when combined

with the selected features. Although, MRMR tries to address this deficiency of filter based

method by incorporating minimum redundancy, it is inferior to the method described in

this paper for CGH datasets. Further, If we compare the best predictive accuracy obtained

for a given dataset (given in bold) by using MIFS to that of MRMR, we observe that

MIFS alv-a-l- gives a better value.

Comparison between MIFS and SVM-RFE The results in Table 5-2 show that

MIFS outperforms SVM-RFE in almost all cases. Clearly, as the number of features

increases, the gap between MIFS and SVM-RFE drops. They become comparable in

terms of predictive accuracy when the number of features reaches several hundreds (we do

not report these results due to the space limitations). We believe that a forward scheme

is better because it first adds the highest discriminating features followed by features

that individually are not discriminating, but improve the classification accuracy when

used in combination with the discriminating features. On the other hand, a backward

elimination scheme (RFE) often selects "re.1mse!l Iest features but excludes complementary

features that individually do not discriminate the data well. This is exemplified by a

simple example of a classification problem with three features kl, kl and k2. Feature kl

works much better than k2 in discriminating the data. Assume that we want to select

two features. A typical RFE scheme will first remove k2 because it influences objective

function least. The two selected features would be kl and kl. On the other hand, the

proposed forward selection scheme (FS) will choose kl followed by k2 because choosing

another kl does not change the objective function at all. Therefore, the two features

selected by FS scheme lead to a better predictive accuracy as compared to those selected

by RFE scheme.












Table 5-:3. Comparison of classification accuracy for three feature selection methods on
two-class datasets. The three methods are MIFS, AIRMR and SVAI-RFE


(denoted as RFE).
last three rows.
DS Method
4 8
hIlFS 0.807 0.920
poor2 MRMR 0.791 0.885
RFE 0.775 0.775
hIlFS 0.744 0.775
fair2 MRMR 0.741 0.783
RFE 0.675 0.749
hIlFS 0.818 0.798
good MRMR 0.798 0.815
RFE 0.758 0.781
hIlFS 0.854 0.864
best2 MRMR 0.852 0.864
RFE 0.793 0.812
hIlFS 0.806 0.839
Avgf ARMR 0.795 0.837
RFE 0.750 0.779


The average results over four datasets are reported in the


Number of Features


16
0.920
0.925
0.853
0.829
0.835
0.772
0.807
0.815
0.818
0.875
0.872
0.841
0.858
0.86~2
0.821


40 60 80 100 150
0.914 0.923 0.909 0.904 0.904
0.901 0.922 0.914 0.909 0.920
0.917 0.914 0.914 0.914 0.906
0.858 0.875 0.877 0.869 0.872
0.852 0.846 0.843 0.843 0.849
0.815 0.823 0.815 0.818 0.818
0.822 0.822 0.835 0.837 0.837
0.813 0.813 0.820 0.807 0.803
0.807 0.806 0.824 0.820 0.815
0.875 0.875 0.870 0.868 0.870
0.879 0.881 0.885 0.866 0.866
0.852 0.852 0.835 11 1; 0.841
0.867 0.873 0.873 0.869 0.871
0.861 0.866 0.866 0.856 0.859
0.848 0.849 11 17 0.850 0.845


250
0.906
0.925
0.898
0.875
0.846
0.872
0.833
0.800
0.811
0.86~2
0.858
0.860
0.869
0.857
0.86~0


500
0.909
0.908
0.904
0.858
0.837
0.846
0.822
0.809
0.834
0.872
0.885
0.875
0.865
0.86~0
0.864


862

0.914



0.849



0.832



0.875



0.86:8


Comparison on two-class datasets The results in Table 5-:3 show that, unlike results in

Table 5-2, although MIFS is slightly better than AIRMR hased on the average results

of four datasets, there is no clear winner that beats the other for every one of the

four datasets. This may indicate that MIFS and AIRMR are comparable in terms

of classification accuracy for two-class datasets. The results also show that MIFS

outperforms SVAI-RFE in most cases when number of features are less than 250. As the

number of features increases, the gap between MIFS and SVAI-RFE drops. They become

comparable when number of features reaches 250. This consists with our conclusion on

multi-class datasets.


Using MIFS for feature selection The results in Table 5-2 and Table 5-:3 shows

that using 40 features result in classification accuracy that is comparable to using all

the features. Also, using 80 features derived from MIlFS scheme results in comparable

or better classification accuracy as compared to all the features. This is significant

as beyond data reduction, the proposed scheme can lead to better classification. To

support this hypothesis, we generated four new datasets using our dataset resampler. The










Table 5-4. Comparison of classification accuracy using different number of features.
Dataset Number of Features
40 80 8632
newds1 0.801 0.792 0.799
newds2 0.80:3 0.819 0.800
newds:3 0.629 0.670 0.6:37
newds4 0.706 0.748 0.719
Average 0.7:35 0.757 0.7:39


resulting four datasets (newds1 to newds4) contain 4, 5, 6 and 8 classes respectively. The

number of samples in the four datasets are 508, 1021, 815 and 649. We applied the 1\lFS

method over these datasets. We compare the classification accuracies obtained by using

all 862 features to those using only 40 and 80 selected features. The results are shown

in Table 5-4. These results substantiate our hypothesis that using around 40 features

(roughly 5' of all features) can generate comparable accuracy to using all the features.

Also, using around 80 features (roughly 1CI' of all the features) can result in comparable

or better prediction than all the 862 features.

It is worth noting that the other two methods, typically have lower or comparable

accuracy when a smaller number of features is used.

5.5.3 Consistency of Selected Features

To test the classification performance of a feature selection method, a multiple folds

cross validation is usually used. In each fold, a set of features is selected based on the

training examples. A classifier trained on training examples with the selected features is

used to test the predictive accuracy of these features on testing examples. However, the

feature sets selected in different folds are often different. A criterion is needed to evaluate

how consistent the features are selected across multiple folds. This criterion is important

because an algorithm selecting features with a low consistency may indicate that this

algorithm is sensitive to the training examples and easily subject to overfitting. Further,

consistently selected features help identify the most important chromosomal regions that

are particularly relevant to cancers. In this section, we develop a novel measure called









Pairwise M~ax~imum M~atching (PMM) to evaluate the consistency of features selected

across multiple folds for CGH data.

An important property of CGH data is that neighboring features are usually highly

correlated as a pointlike genomic aberration can expand to the neighboring intervals.

Due to the difference in the training examples, these highly correlated features can be

alternatively selected in different folds. For example, assume that two sets of features are

selected in two folds. The 53rd and 54th feature are only selected in the first and second

set respectively. Although these two features are different, they should be considered

matching because both the 53rd and 54th features are highly correlated and represent the

same aberration pattern of interest.

We first define the correlation between two features. Given a set of a CGH samples

{xi~, -2, -. -, In}. Le denote the value (1, -1 or 0 for gain, loss or no aberration

respectively) for the ith sample at the dth feature, Vd, 1 < d < D, where D is the number

of genomic intervals. The number of samples that has aberrations at the dth feature can
be computed as B(d) == l- |x| In principle, the correlation between- neighboring

features are caused by contiguous runs of gain or loss status. We use the term segment

to represent a contiguous run of aberrations of the same type in a sample. Intuitively,

two features are highly correlated when a large amount of segments intersect with both

features. Let xi[u, v] denote a segment of xi that starts at the uth interval and ends at the
vth interval,; ine {x,-i t, such that x,i = x' = =x x',_ / xt Let k and

kIlI < kC ki < enoUte two features. Wet definet thlat C/i(k, k/) 1 if there exists a segment

x4 [u, v] in the ith sample that intersects with k and kl, i.e. u < k, ki < v. The correlation

coefficient between k and kl is computed as


Cor(k, k/) L= l .~;K'
max(B(k), B(k/))









The value of this coefficient is between [0, 1]. It identifies the fraction of segments that

intersects with both two features. Given a user specified threshold e, two features k and kl

are defined highly correlated if and only if Cor(k, k/) > e.

Next, we explain how to evaluate the consistency of two sets of features. Let K=

{kl, k,} and K/ = {kll, kl,} denote the two sets of r features. We create a

bipartite graph as follows. For each feature in the two sets, a vertex is added in the graph.

For each feature ki, 1 < i < r if there exists a feature klj such that klj is highly correlated

to ki, e.g. Cor(ki, klj) > e, an edge connecting the corresponding vertex of ki and klj is

added. Let VI and VB denote the set of vertices corresponding to K and K/ respectively. It

can be seen that every edge in the graph connects a vertex in VI and one in VW. Therefore,

the resulting graph is a bipartite graph. A maximum matching M~ found in this graph is a

set of edges that identify pairs of features (or highly correlated features) selected in both

sets. We score the consistency between n K and K/l as T(K, K/') = ~-, where |M| dlenote

the number of edges in M~.

For multiple sets of features {K1, Ky}, The PMM measure is computed as the

average score of each pair of feature sets:

2 C% T~,, (Kei Ky)
f x ( f 1)

where Ks and Ky denote the ith and jth feature sets respectively.

We use the above approach to evaluate the consistency of features selected by three

methods (illlFS, MRMR and SVMRFE) for the twelve multi-class datasets. For each

dataset, each method selects five sets of features because a 5-fold cross validation is used.

The PMM scores of the five sets of features selected by different methods on different

datasets are reported in Table 5-5. The number of features are specified as 20, 50 and 100.

The parameter e is set to 0.8.

To show the significance of the PMM scores, a random test is performed as follows.

For each dataset, five sets of features are randomly selected and the PMM score is










computed. The number of features and the value of parameter e are exactly the same as

above. This process is repeated for one million times. The mean value, the first percentile

and ninety-ninth percentile of the one million scores are reported in Table 5-5.

The results show that both AIRMR and MIFS outperfornis SVAIRFE considerably

in terms of PMM scores. The PMM scores of IR MR is often slightly better than those

of hIlFS. Also, the PMM scores of both AIRMR and MIFS are significantly greater than

the ninety-ninth percentile of random scores. This indicates that the features selected by

MRMR and MIFS in multiple folds are significantly consistent. On the other hand, the

scores of SVAIRFE are often within the range of the first and the ninety-ninth percentile

of the random scores. This indicates that SVAIRFE works poor in consistently selecting

features in multiple folds. It is worth noting that the gap between MIFS or AIRMR and

random approach decreases as the number of features increases. This is because the more

features are selected, the larger is the chance to find a pair of matching features in two

random sets. As a result, the PMM scores of random approach increase too. Since the

results show that using about 1(1' of all features already provides a good classification

performance, we limit the comparison of PMM scores to small numbers of features (less

than or equal to 100).

5.6 Conclusions

Recurrent chromosomal alterations provide cytological and molecular positions for

the diagnosis and prognosis of Cancer. Comparative Genomic Hybridization (CGH) is one

of the important mapping techniques that has been shown to be useful for understanding

these alterations in cancerous cells.

In this chapter, we develop novel SVAI hased methods for classification and feature

selection of CGH data. For classification, we developed a novel similarity kernel that

is shown to be more effective than the standard linear kernel used in SVAI. For feature

selection, we propose a novel method based on the new kernel that iteratively selects

features that provides the nmaxiniun benefit for classification. We compared our methods










Table 5-5. Comparison of PMM scores of three feature selection methods. Term r, 9' 1' .
and 1 denote the number of selected features, the ninety-ninth percentile and
the first percentile respectively.


Random
mean '.
0.29 0.41
0.49 0.57
0.64 0.69
0.31 0.4:3
0.52 0.6
0.66 0.72
0.31 0.4:3
0.51 0.6
0.65 0.72
0.22 0.31
0.40 0.47
0.57 0.62
0.25 0.37
0.44 0.54
0.59 0.65
0.25 0.36
0.45 0.52
0.60 0.67
0.29 0.40
0.49 0.57
0.64 0.70
0.31 0.41
0.51 0.59
0.65 0.71
0.27 0.39
0.47 0.55
0.6:3 0.6;8
0.20 0.:3:3
0.38 0.45
0.54 0.60
0.19 0.29
0.51 0.45
0.52 0.57
0.22 0.:3:3
0.40 0.49
0.56 0.62
0.26 0.37
0.45 0.54
0.61 0.66


SVIR FE
0.32
0.51
0.65
0.45
0.58
0.6;8
0.41
0.49
0.67
0.28
0.44
0.54
0.46
0.54
0.6:3
0.47
0.5:3
0.68
0.42
0.54
0.61
0.34
0.52
0.6;8
0.35
0.51
0.6;6
0.36
0.46
0.56
0.25
0.44
0.58
0.:3:3
0.46
0.60
0.37
0.50
0.6:3


IR MR
0.87
0.91
0.91
0.94
0.91
0.91
0.9:3
0.91
0.90
0.82
0.86
0.88
0.86
0.87
0.91
0.82
0.85
0.88
0.89
0.87
0.89
0.84
0.89
0.90
0.74
0.86
0.87
0.75
0.85
0.87
0.79
0.8:3
0.85
0.74
0.80
0.8:3
0.8:3
0.87
0.88


MIlFS
0.6;8
0.82
0.86
0.8:3
0.84
0.88
0.71
0.85
0.87
0.8:3
0.76
0.81
0.81
0.81
0.8:3
0.77
0.80
0.84
0.8:3
0.86
0.87
0.76
0.75
0.88
0.78
0.80
0.85
0.7:3
0.74
0.82
0.6;2
0.68
0.81
0.68
0.74
0.80
0.75
0.79
0.84


1 .
0.19
0.42
0.58
0.21
0.44
0.6;0
0.20
0.41
0.60
0.1:3
0.3:3
0.52
0.16
0.35
0.54
0.14
0.36
0.5:3
0.18
0.41
0.59
0.20
0.42
0.58
0.18
0.38
0.57
0.11
0.31
0.49
0.11
0.29
0.47
0.1:3
0.34
0.51
0.16
0.37
0.55


Dataset r
20
hest4 50
100
20
hest6 50
100
20
hest8 50
100
20
fair4 50
100
20
fair6 50
100
20
fair8 50
100
20
good4 50
100
20
good 50
100
20
good 50
100
20
poor4 50
100
20
poor6 50
100
20
poor8 50
100
20
Average 50
100










against the best wrapper based and filter based approaches that have been used for feature

selection of large dimensional biological data. Our results on datasets generated front

the Progenetix database, -II__- -0 that our methods are considerably superior to existing

methods. Further, unlike other methods proposed in the literature, our methods can

improve the overall classification error by using a small fraction (around 1(1' .) of all the

features.










CHAPTER 6
INFER RING PROGRESSION 1\ODELS

Cancer is classified into multiple histological types, each of which consists of multiple

subtypes. Genomic aberrations may differ between histologically identical tumors (e.g.

gastro-esophageal, depending on location [76]), different histological subtypes have

different changes (e.g. in renal cell carcinomas, [35]), and different patterns may appear

in the same histologic subtype (and may point towards different mechanisms; e.g. see

complex re-arrangenients vs. whole-chromosome gains/losses [52]). Tumor evolution

process leaves characteristic signatures of inheritance along the pathir- 0-4 of progression

and present a method to infer models of tumor progression by an identification of these

signatures in genome-wide data of mutations [6]. Evidences have shown that patterns of

recurrent Copy Number Alterations (CNAs) are observed for a broad range of cancers or

subtypes of the same cancer.

To our knowledge, most existing works infer tumor progression models based on

genetic events such as recurrent Copy Number Alterations (CNAs). Their models describe

the evolutionary relationship between events and consequently expose the progression

and development of tumors. However, most existing works focus on the progression

of individual recurrent alterations. This approach leads to very complex models when

multiple cancers are concerned, given that each cancer contains a set of recurrent

alterations. A promising approach seems to consider the whole set of alterations of a

cancer and infer a model based on the alteration patterns of different cancers. Such

models effectively utilize the molecular characters of cancers and easily extend to large

scale analysis. In this chapter, we have developed novel graph hased computational

methods that derive relationships within a histological type or between histological

subtypes.










6.1 Preliminary

In this section, we briefly introduce some preliminary knowledge related to this work.

In Section 6.1.1, we review the concept of markers that define the key recurrent CNAs in

a cancer. In Section 6.1.2, we introduce an approach proposed by Bilke et al. which is

extended later for inferring the progression of markers. In Section 6.1.3, we demonstrate a

known tree fittingf problem that infers progfenetic models for cancers.

6.1.1 Marker Detection

Due to the correlation between neighboring genomic intervals [42], recurrent

alterations usually accumulate together and forms a region of recurrent alterations,

which we call recurrent region. Given a set of samples that belong to the same cancer,

a marker is an independent key recurrent alteration representing a recurrent region. We

proposed a dynamic programming algorithm to identify the best R markers for a set of

CGH cases. We demonstrated that our markers capture the aberration patterns well and

improve the clustering of CGH cases [42].

Next, we briefly introduce some notations of markers. Each marker m in a cancer is

represented by two numbers , where p and q denote the position and the aberration

type respectively. The aberration type of a marker is either gain or loss, denoted by 1

or -1 respectively. Given a set S of NV CGH cases {sl, s2, sN}. Let xi denote the

alteration value (1, -1 or 0 for gain, loss or no aberration respectively) for case j at the

dth feature, Vd, 1 I d I D, where D is the number of genomic intervals. We use the

term segment to represent a contiguous run of aberrations of the same type in a case. Let

sj [u, v] be the segment of sj that starts at the uth interval and ends at the vth interval.

Formally, sy [u, v] denotes a continuous run of interals {xt, x(}, for ci < a < v < c ,
where x" x" = =x 0, x"_ J xt, x3 X3 xt and ci denote the starting and

ending intervals of a chromosome in sj respectively.

Let m =< p, q > be a marker. We denote the independent support of sj to m as

6(sj, m). Here, 6(sj, m) = 1 if and only if x~ = q. Otherwise, 6(sj, m) = 0. We define the









total independent support value of marker me as the sum of its support from all the cases.

Formaly, Spt~m)= 1 b (sj, m). We will use term support to denote Supt(m) in this

paper. Please note that this support is not the same as what we proposed in our previous

work for marker identification [42].

6.1.2 Tumor Progression Model

Bilke et al. proposed an approach of inferring a tumor progression model for

Neuroblastoma (NB) with four different subtypes from CGH data [6]. They described

the relationship between different subtypes based on the recurrent alterations shared by

these subtypes. Their idea first identified a set of recurrent alterations. Each recurrent

alteration belonged to one of the following three categories: common (shared by all the

subtypes), shared (shared by two or more subtypes) and unique (distinct to only one

subtype). They proposed a statistical model to identify recurrent alterations and compute

the shared status of these alterations. Each shared status was a set of subtypes that

contain this recurrent alteration.

The shared status of recurrent alterations can be described using a Venn diagram. For

example, Figure 6-1 shows two Venn diagrams of two sets, represented by two overlapping

circles. Let S1 and S2 denote the left and right circle respectively. There are three distinct

areas (denoted as sections) marked by A, B and C in each Venn diagram. Each section

represent a possible logical relationship between the two sets. For example, section A and

C represent S1 S2 and S1 n S2 TOSpectively. A section is called non-empty if it contains

some members. Each non-empty section is marked by a distinct color in Figure 6-1. The

component of a non-empty section is defined to be the sets whose members are contained

in this section. For example, the components of section A and C are {S1} and {S1,S2)

respectively. In general, the number of distinct sections S in a Venn diagram of K sets

is given by S = CE= ,ii7~i which is also thre number of different shraredl status of
a recurrent alteration between K cancer subtypes. Since each section can be empty or

non-empty, there are totally 2s distinct Venn diagrams for K sets.










The authors built a Venn diagram of four sets for the four different subtypes of NB.

Each shared status corresponds to a distinct section in this Venn diagram. By computing

the shared status of each recurrent alteration, one can determine if a section in the Venn

diagram is empty or not. As a result, the structure of the Venn diagram is determined.

The authors proposed a graph model based on the structure of Venn diagram to infer the

progression of four different subtypes of NB. The graph model satisfies the following three

conditions :

1. All alterations found in a parent genotype must be present in the offspring with a

similar frequency. The daughter generation acquires additional alterations.

2. Unobservable intermediate 7.~ nd v(pes are possible, but the model with the smallest

number of genotypes is utilized.

3. All genotypes arise from a common ancestor (i.e. the model has a root).

The resulting graph is a directed .II i- 1;1: graph with each vertex corresponding to a

non-empty section in the Venn diagram. An edge connects from a vertex u to a vertex v

if (1) the set of cancer subtypes that contain the recurrent alterations of a is a superset

of that of v and (2) there is no other vertex w such that the set of cancer subtypes that

contain the recurrent alterations of w is a superset of that of v and a subset of that of u.

The number of vertices in the resulting graph is bounded by min{S, T}, where T is the

number of recurrent alterations. For example, the graph models corresponding to the two

Venn diagrams in Figure 6-1 is shown on the right of the figure.

The authors demonstrate that, with the help of such a model, it is possible to

identify trances of tumor progression in CGH data. However, their approach has several

limit nations.

*First, their methods of calculating the shared status of each recurrent alteration is

very computational expensive. The time complexity is exponential to the number of

cancers K.
































Figure 6-1. Examples of Venn diagram and corresponding graph model. Each Venn
diagram (left) consists of two sets. The corresponding graph models are shown
on the right. The three sections in the Venn diagram are denoted as ,4, B and
C respectively. The main difference between example (a) and (b) is that, in
example (b), section A is empty, i.e. it contains no nienters. Therefore, the
corresponding graph model of (b) consists of only two vertices, C and B.


*This method can model the progression of markers. It, however, can not model the

evolutionary relationship among different cancer types.

In addition to these limitations, Bilke et al do not provide a systematic algorithm for

mapping the Venn diagram to the graph model automatically. These limitations make it

impractical to use their method for large scale datasets composed of many cancers.

6.1.3 Tree Fitting Problem

Phylogenetics is one of the approaches coninonly used to infer evolutionary

relationships between genes or species of organisms. Central to most studies of phyl-_ Ir

is the concept of a phylogenetic tree, typically a graphical representation of the evolutionary

relationship among three or more genes or organisms [39].









A broad range of phylogenetic tree construction methods have been proposed. Among

them, an important category is called distance matrix: method. The tree construction

problem of distance matrix method can be described as follows. Let L denote the set of

samples and the set of real numbers. A distance matrix, D, on L is a |L| x |L| matrix,

where each entry D(i, j) of this matrix denotes the distance between the ith and the jth

sample based on a predefined distance function. Let T denote a phylogenetic tree built

upon L, T = (V, E).

Each leaf level node of T corresponds to a sample in set L. Also, there is a node at

the leaf level of T for each sample in L. In other words, there is a bijection between the

leaf level nodes and the samples in L. The rest of the nodes in V are the internal nodes

of the tree. Each edge in E is assigned a positive real number that denotes the weight

of that edge. This is also termed the length of the edge in the literature. For any pair of

leaf nodes i, jE V, define Pij as the path in T between i to j. The length of a path is

the sum of the weights of the edges on that path. We create a new distance matrix D/,

where each entry D/(i, j) is the length of the path between i and j. The distance matrix

method aims the following: Given a distance matrix D, find a tree T such that D/ is a

good approximation to D.

The tree fitting problem has been widely studied in molecular phylogenetics. Some of

the leading distance matrix methods for tree construction include the unweighted-pair-group

method with arithmetic mean (UPGMA) [39] and Neighbor Joining [39].

6.2 Progression Model for markers

In this section, we extend Bilke's approach [6] to infer progression models for markers

of multiple cancer types. Markers are the independent key recurrent alterations that

characterize the aberration pattern of a cancer type (Figure 4-2). Studies of the evolution

of markers would be of obvious value to define gene loci relevant for the early diagnosis

or treatment of cancer. It helps to answer questions about which marker tend to occur in

many cancers, which markers are likely to occur together etc. The main difference between










our approach and the previous work is that we focus on markers instead of every recurrent

alteration.

We compute the shared status of markers as follows. A marker identified in one

cancer represent a recurrent alteration region in this cancer. However, for any two or

more cancers containing the same recurrent region, they may not have markers identified

at the same position due to the noise in the aberration patterns. Therefore, markers in

different cancers representing the same recurrent region should be considered shared by

these cancers.

First, we define the correlation between a marker and its neighboring intervals. Let C

denote a set of cases belonging to the same cancer. Let m =< p, q > and d, 1 < d < D

denote a marker in C and a genomic interval respectively. For each case sj E C, we define

EG(d, m) = 1 if there exists a segment sy [u, v] overlapping with both intervals d and p,

i.e. u < d, p < v and x( = q, otherwise, Ey (d, m) = 0. The function Ey (d, m) indicates

that the alterations at d and p belong to the same segment in sj and can be caused by

the same point-like genomic alteration. We compute the correlation between d and m as

C/or(d, m) = E t(am) where |C| denotes the size of C and Supt(m) denotes the support
value of marker m in C A lag valu oCo(d, m) implies that intervals p and d belong

to the same recurrent region that is represented by marker m.

Next, we define that a marker m = < p, q > in cancer Ci is shared by Cj if and

only if the following condition is reached: there is a marker mi = < pt, ql > in cancer Cj

such thlat yi = q and C/or(p, mi) > e, where e is a user-defined threshold. The larger is

the value of e, the harder for a marker shared among multiple cancers. Intuitively, this

definition indicates that a marker mi in Ci is shared by another cancer Cj if and only if

there exists a marker mj in Cj such that mj is highly correlated with mi if mi is also a

marker in Cj. To compute the shared status of a marker in Ci, we visit every cancer other

than Ci. This makes the time complexity linear to the number of cancers K. We denote









the shared status S(m) of a marker m as the set of cancers that share this marker, i.e.

S(m) E p(({C1, ,CK ), Where p denotes the power set operation.

We propose an algorithm that generates a progression model for K cancers based on

markers. Our algorithm consists of three steps:

First step: We identify an optimal set of R markers for each cancer using our

marker identification program. These markers represent significant recurrent

alterations specific to each cancer.

Second step: For each marker in each cancer, we compute the shared status of this

marker using the method we described above. Please note that markers in different

cancers may have the same position and type. We treat these markers as a single

marker and compute its shared status once.

Third step: The logical relationship between K cancers corresponds to a Venn

diagram of K sets. There are totally S = CK-1T~Lil distinct sections in this Venn

diagram. Given a marker m with shared status S(m), the section corresponding to

S(m) is non-empty. We mark all the non-empty sections in the Venn diagram based

on the shared status of all markers. We then convert the Venn diagram to a graph

model as follows. We create a vertex V for each non-empty section and associate

it with the markers whose shared status corresponds to this section. We define

the height of this vertex, denoted as H(V), as the number of components in the

corresponding section. We visit the vertices in the descending order of their heights.

For each pair of vertices 1M and Vyj, H(10) < H(V4j) we create an edge from Vyj to 1K

if both of the following conditions hold:

1. The component set of the section corresponding to 1K is a true subset of that of



2. There is no other vertex Vk Such that the component set of the section

corresponding to Vk is a Superset of that of 1M and a subset of that of Vyj.










We analyze the time complexity of this algorithm as follows. The time complexity

of the first step is O(DNVR) as analyzed in our previous work [42], where D and NV

denote the number of genomic intervals and number of cases of all K cancers respectively.

The time complexity of the second step is O(TNVR), where T is the cardinality of set

consisting of the union of all markers. In the third step, the number of vertices is bounded

by min{S, T}. Since T I K x R, the time complexity of this step is O(K2 p2) in the

worst case. Since we have D '> T, the overall time complexity is O(DNVR) + O(K2 p2)

In general, we have D > R, NV > K2, the overall time complexity can he written as

O(DNVR).

The graph created by our algorithm can he used to describe the hierarchical or

evolutionary relationship between markers representing multiple stages between a single

cancer type or among the markers of different cancer types. We term a node as a root

node if it does not have any incoming edges. The nodes that are close to a root (there

can he multiple roots) denote the aberrations that started in earlier stages. From this

perspective, markers are not equally important. The markers that are parents of other

markers in the hierarchical representation are common to multiple cancers. Thus,

difference at parent marker positions should contribute more to the distance between

different cancers than the child markers.

6.3 Progression Model for cancers

The aberration pattern defines the molecular characteristics of a cancer. We assume

that cancers with similar aberration patterns are close to each other in the evolutionary

history. The proper identification of the similarities between cancers will expose the

underlying mechanism of cancer development and benefit the diagnosis and treatment of

cancers.

Phylogenetic tree is a simple and efficient model that infers evolutionary relationship

among multiple cancers. A key challenge of using existing distance matrix method

for tree construction is to find a biologically meaningful distance function between










cancers. Next, we propose a novel measure for computing the distance between cancers

based on their aberration patterns. Since markers are a set of recurrent alterations that

characterize the aberration patterns of a cancer, our distance measure computes the

distance between cancers based on their markers. Formally, let Ci and Cj denote two

cancers. Let Ml = {mi,l, mi,R} and Myj = {mj,1, mj,R} denote the corresponding

R markers identified in Ci and Cj respectively, where pi,l < pi,2 < < i,R and

pj~l < py,2 < < pj,R. PleaSe note that pi~k may not equal to py~,k for any 1 I k I R. To

compute the distance between Ci and Cj, we first align the markers in Ml to those in Myj.

The goal of this alignment is to map Ml and Myj into two high dimensional vectors Ml and

Myj E Rg, where g < 2R is the number of dimensions of the new vectors, such that the new

vectors contain unified format of aligned markers in Ci and Cj respectively.

We ;?i that a pair of markers mir,k and mj,, are o;~ Irll. ela-pi if they satisfy either one

of the following two conditions:

1. Both markers appear at the same interval and have the same type, i.e. pi,k = pj~, and

4i,k = j,r

2. Both markers represent the same region of recurrent alterations, i.e. Cor(pi~k, mj,r)

e and Cor(py,r, mi~k) > 6, Where e is a user-defined threshold.

In Section 6.2, we argue that markers are not equally important in the progression

of cancers. A marker that is common to many cancers usually represents a fundamental

characteristic of cancers. Therefore, we assume that markers shared by many cancers are

more important than those shared by a few cancers. The intuition behind this reasoning

can be explained as follows. A marker that tr~~~;i as most of the cancers has survived the

evolution of cancer progression with high likelihood. The markers that are cancer specific

have most likely appeared later in the evolutionary history and created the underlying

cancer alteration pattern. As a result, the deviation in genomic alterations corresponding

to older markers corresponds to larger distance between two cancer types as the age of

the genomic alteration increases. We incorporate this idea into the mapping process. We










assign weights to markers in each cancer. The weight of a marker is the number of cancers

that share this marker. Let Wi = {m ,, Wi,R} and Wj = {wj,l, ---, y,R} be the

vectors of weights for markers in Ml and Myj. Here, wi~k and wj,k denote the weights the

kth marker in Ml and Myj.

The mapping process works as follows. Each time we pick up a pair of markers from

Ml and Myj. We add a pair of new dimensions to Ml and Myj respectively. The values

of the added dimensions are determined by three attributes of markers: support, weight

and type. Let a(mi,k) = Supt(mi~k) x wi~k x qi~k. If the two markers are ci.1 11ppill_

the values added into Ml and Myj are A(mi~k) and A(mj,r) respectively. If two markers

are not 01.; 11 Ilpph.l we focus on the marker at a smaller genomic interval. Without

loss of generality, we can assume pi~k < pj~,. There is no marker at interval pi~k in Cj.

However, we need to compute the information of this interval across both cancers so

that the difference of this interval can be taken into account. So we assume that there

is a "hypothetical" marker at pi~k in Cj. This marker is of the same type and weight as

mi~k. However, the support of this marker is computed based on the samples in Cj. Let

mi =< pt, q/ > in Cj denote this "hypothetical" marker. We have pl = pi,k 91 = i,k

and wl = wi~k. PleaSe note that Supt(mi) depends on the alteration pattern in Cj and

may not equal to Supt(mi~k). We add the two values, a(mi,k) and A(mi), into Ml and

Myj respectively. Next, we choose another pair of markers and repeat the above procedure

until all the markers have been processed.

The algorithm of the mapping process of two sets of markers is implemented as

follows.

Inputs: Ml = {mi,l, mi,R} and Myj = {mj,1, mj,R} where pi,l < pi,2 < pi,R

and pj~l < py,2 < < pj,R- Wi = Ei,1, ", Wi,R} and Wj = {wj,l, my,R} are the

vectors of weights for markers in Ml and Myj

1. Initialize: Mj = My []; k = r =1;

2. while k < R and r < R









(a) if mi~k and mj,, are overlapping

-ii = [-T a(mi,k)]; My = [My, a(mj~r)]; k = k +1 1; = r + 1;

(b) else if pi,k < pj,r

Create a "hypothetical" marker m/ same as mir,k in Of ; ii = .If ami,k)]; My



(c) else if pi,k > j,r

Create a "hypothetical" marker m/ same as my,r in Ci; Mi = :IlT, a(m/)]; My =

[My, a(mj,r)]; r = r + 1

(d) else

-^1 =- mi)]M =M, my)]k=k ~+ 1; rr + 1

3. while k < R

Create a "hypothetical" marker m/ same as mir,k in Of ; ill = .If amik)]; My =


4. while r < R

Create a "hypothetical" marker m/ same as my,r in ci; il = -T a(mi)]; My =

[My, a(mj,r)]; r = r + 1

Outputs: Ml, Myj Once we have the aligned vectors Ml and Myj, we use Extended Jaccard

coefficient [79] to compute the similarity between the two vectors. Extended Jaccard

coefficient is widely used as a similarity measure in vector spaces. It retains the sparsity

property of the cosine similarity while allowing discrimination of collinear vectors. For

example, given two vectors Mi = [0.1, 0.3] and Myj = [0.2, 0.6], the cosine similarity does

not discriminate the difference between them and the similarity value is computed as 1.

However, in our case, Mi and Myj are different because they denote recurrent alterations

in Ci and Cj with different frequencies. The Extended Jaccard coefficient is computed as
follows.


||3I|| 12 Ilj 12 if M y~










The Extended Jaccard similarity of any two vectors is within the range of [0, 1]. It

is easy to convert Extended Jaccard similarity to distance by subtracting it from 1, i.e.
D(G Cy =1 EJ(M1, Myj). We compute the distance D(Ci, Cj) for anIy 1 < i, j < R.

As a result, we construct the distance matrix for K cancers. We apply existing distance

matrix method, such as UPGMA, to construct the phylogenetic tree.

6.4 Experimental Results

Dataset: Dataset: With 15127 cases from 571 publications as of Dec 2007, Progfenetix is

the largest resource for published chromosomal CGH data [3] (http://www. progenetix.

net/). For the purpose of this paper, we use a dataset with 5918 clearly malignant

epithelial neoplasias (ICD-O-3 xxxx/2 and xxxx/3), a descriptive overview of which had

been published previously [2]. From the biomedical perspective, this dataset could be

divided into 22 clinico-pathologfical disease categories. Additional entities consisting of less

than 40 cases each were summarily moved to an 'other' category

As result of the Progenetix database format transformation, for each case the

genomic imbalance status for 862 ordered intervals had been extracted from the

karyotype annotation. This information represents the whole genome (capi number

status information, in the maximum resolution feasible for cytogenetic methods. The value

of each interval is 1, -1 or 0, indicating the gain, loss and no change status. The target

data set can be represented as a 2-dimensional matrix with 5918 rows, with 862 columns

representing the imbalance status for each genomic interval. Additional columns contain

clinical information categories.

Although these cases are important for the evaluation of overall genomic instability,

due to our focus on aberration patterns 875 cases without any CNAs were deemed

non-informative for our purposes and removed prior to further analysis. Also, the

categories 'cholangio' and 'squamous_skin' were removed due to the limited number of

informative cases (11 and 15, respectively). We also excluded cases sub-summarized in









Table 6-1. Name and number of cases of each cancer in the dataset.
Diagnosis no. of cases
head-neck squanlous cell carcinoma (HNSCC) :309
non-sniall cell lung carcinoma (NSCLC) 242
small cell lung carcinoma (SCLC) 6:3
bladder carcinoma 140
breast carcinoma 640
cervical carcinoma 210
colorectal adenocarcinonia (CR C) :392
esophagus carcinoma (ES) 206
gastric carcinoma 477
hepatocellular adenocarcinonia (HCC) :334
nielanocytic (jlE: 1 ) 81
nasopharynx carcinoma (NPC) 149
neuroendocrine ca. and carcinoid (NE) 114
ovarian carcinoma :388
pancreas adenocarcinonia (PAC) 64
prostate carcinoma 416
renal carcinoma (RCC) 16:3
thyroid carcinoma 154
uterus carcinoma 42
vulva carcinoma 47


the 'other' category (:386 cases). The remaining 20 entities with 46:31 cases are used for

analysis in this paper. The details of the dataset is shown in Table 6-1.

System specifications: We developed our code using MATLAB and ran our experiment

on a system with dual 2.59 GHz AMD Opteron Processors, 8 gigabytes of R AM, and a

Linux operating system.

6.4.1 Results for Marker Models

In this experiment, we infer a progression model for markers using the dataset in

Table 6-1. We perform each step one by one and discuss the results of each step as follows.

In the first step, we identify an optimal set of 20 markers for each cancer. Please

note that we exclude 100 (peri) centronleric intervals because 1) they mostly consist of

repetitive sequence (ALU repeats etc.) without encoding genes; 2) they have technical or

interpretation difficulties. The markers are identified front the remaining 762 intervals.









An existing work by Baudis has identified the imbalance hot spots in clinico-pathological

entities in the same dataset [2], using an 'average profile' based approach. We compared

our markers to the reported imbalance hot spots for validation test. Due to the limitation

of space, here we only present the comparison results for HNSCC disease category.

Imbalance hot spots identified by Baudis [2]:

gains: 3q26 (59."'.), 8q24 (411 '.), 11ql3 (31.9'.,~ many specific high-level), 5p

(26.5' .), Xq, 1q, 7q(21), 12p, 17

losses: 3p (30.1 .), 18q(22) (22.!'.), 9p (22.!'.~), 11q24 (19."'.), 4, 5q, 8p, 13

Markers identified by our method:

gains: 3q26.2 (57.2' .), 8q24.3 (41 .), 11ql3.4 (31.9' .), 5pl14.3 (26.5' .), Xq28 (2 .;),

7q21.3 (20.C,' .), 12pl3.1 (17.'7' ), 17q25.3 (17.'7' ), 20ql2 (17.'7' ), 19pl3.11 (16~ .),

1q31.3 (16.2' .), 18p311.23 (15.9' .)

losses: 3p326.3 (30.'7'.), 18q23 (22.'7'.), 9p323 (22.!'.~), 11q25 (19."'.), 4pl4 (15'.),

5q21.3 (15.;:' .), 8p323.3 (16.2' .), 13q21.33 (16.5' .)

In the above results, markers or hot-spots are listed with detailed locus and frequency

information. Gains and losses are evaluated separately. The hot-spots or markers are

sorted in descending frequency of occurrence. We identify markers as individual intervals

while Baudis identified the regional hot-spots from summary data. Our results are highly

compatible to reported results if we consider a marker as a representative of a region. We

successfully identify all the hot spots identified by Baudis. We also identify additional

hotspots (e.g., 18q23) that has significant support.

In the second step, for each disease entity, we compute the shared status of each

marker identified in this cancer using the method we described in Section 6.2. We set

the threshold a to 0.8. To compare with the reported most frequent imbalances over

all cancers, we analyze the markers that are in the same regions. The comparisons of

imbalance with top frequencies are shown as follows.

*Most frequent imbalances reported by Baudis [2]:










+Sq: ubiquitously high (exception NE and thyroid)

Markers identified hv our method and their shared status:

+8q2:3.1, +8q2:3.2, +8q2:3.:3: 19 cancers (exception thyroid)

+8q24.1:3, +8q24.2:3, +8q24.:3: 18 cancers (exception NE and thyroid)

Most frequent imbalances reported by Baudis [2]:

-1:3q: occurring in most carcinoma types (exception cholangio and SQS)

Markers identified hv our method and their shared status:

-1:3q21.1, -1:3q21.2, -1:3q21.:33: 18 cancers (exception CRC, gastric, cholangio and

SQ S)

-1:3q22.:3: 15 cancers (exception SCLC, CRC, prostate, thyroid, gastric, cholangio

and SQ S)

The results show that our approach discovers the most frequent markers in a

consistent way to Baudis' work. Please note that markers are individual intervals

instead of chromosomal regions. Additionally to the markers reported by Baudis et al.

as top-scorers in the different entities, our method detected other regions, for example

+17q and +7p which both are shared by more than 12 cancers types.

In the third step, we build a graph model based on the shared status of markers.

The model contains 119 vertices and :385 edges, which makes it hard to fit in this thesis.

The model conveys useful information about the importance of markers. We use this

information in our next experiments in Section 6.4.2.

6.4.2 Results for Phylogenetic Models

In this experiment, we infer progression models for cancers using the distance-based

approach described in Section 6.:3. We compute the distance matrix D of 20 cancers in

Table 6-1 based on the markers reported in Section 6.4.1. We use UPGMA algorithm

in PHYLIP package [20] to generate the phylogenetic tree. To demonstrate the use of

computing the distance between cancers based on the importance of markers, we generate

two phylogenetic trees. For the first tree, we compute the distance matrix by assigning





















+-thyroid (A)
+-4
+-13 +-NE (A)

!+---SCLC (B)

~+-HNSCC (C)
!+-3
+-
+-16 +-5 +-2
!+-ES (E)

!!+-10 +-NSCLC (E)

!+--cervical (:
!+-9
!+-15 +--vulva (C)

+----RCC (A)

+--(
+-18 +-14 +-6
!+-8 +--

+-11 +-- (D)

+---bladder

+--MEL (G)
-19 !+-7
!+ (D)
!!+-12 +-1
+ (D)
!+-17!
!+--- (D)

!+------NPC (C)


Coding and Legend

A: endocrine and clear
B: small cell neuroendo
C: squamous
D:
E: mixed squamous/adeno
F: transitional
G: melanoma


E)


(F)


+--------


Figure 6-2.


Phylogenetic trees of 20 cancers based on weighted markers. The tree is
generated by taking the importance of markers into account. We mark
different cancers using different colors and capitalized letters based on their
overall histologfical compositions. The legend is shown at top right side.





















+------thyroid (A)
+-------6
!+------NE (A)

+---HNSCC (C)
+-1
+-2 +---ES (E)

+-5 +----NSCLC (E)

+---7 +----- (D)

!+----cervical (E)
+-16 !+--4
+-12 +----vulva (C)

!+---- (D)
!+-----3
+---- (D)
!+-10
!!+-13 +--------bladder (F)
!+-9
!+-------- (D)
+-17 !!!
!+----------RCC (A)
!!+-14 +-11
!+-------- (D)
!+-8
!+-------- (D)
+-18 !
!+-------------NPC (C)

!+-------------MEL (G)
-9!+-15
+-------------- (D)

I -------- ID

!+-----------------C~ (D)






Figure 6-3. Phylogenetic trees of 20 cancers based on unweight markers. The tree is
generated by giving equal weights to markers. We mark different cancers using
different colors and capitalized letters based on their overall histologfical

compositions .










the weight of each marker as the number of cancers that share this marker. The resulting

tree is shown in Figure 6-2. For the second tree, we compute the the distance matrix by

assigning the weight of each marker to 1. The resulting tree is shown in Figure 6-3.

The leaf nodes of the trees correspond to cancers (e.g. clinico-pathological cancer

entities). We mark these cancers using different colors as well as capitalized letters based

on the 1 in r~ histologfical composition of cases in this cancer. Each color corresponds to

a capitalized letter. Different colors (letters) encode different histological compositions of

cancers. The internal nodes are denoted by numbers and represent hypothetical cancers.

Since these intermediate cancers may contain daughter branches from completely different

histological, they have to be viewed as as common biological feature sets rather than truly

occurring clinico-pathologfical cancer entities. The lengths of the branch are proportional

to the difference between pairs of neighboring nodes.

In both trees, some cancers of the same histological composition are closely organized

in the same subtree. However, the tree in Figure 6-2 shows a higher correlation of

histologfical composition and subtree assignment compared to the tree in Figure 6-3.

This correlation would be in concordance with the view that cancer clones may arise

from tissue-specific cancer stem cells [66], with a similar regulatory program targeted by

genomic aberrations in related tissues.

Each clinico-pathological cancer entity may contain multiple subtypes with heterogeneous

aberration patterns. To infer a progression model for cancer subtypes, we first divide

each cancer into several (two or four) clusters. By doing this, we hope that cases with

similar aberration patterns can he grouped in the same cluster. We use RSim clustering

method [42] for this purpose. We determine the number of clusters by visually inspecting

the aberration patterns of the cancer. For each cluster of the same cancer, we compute

its quantity of imbalance as the ratio of intervals with imbalances in all CGH cases. We

sort the clusters of each cancer in the ascending order of their imbalance quantities. We

name each cluster by concatenating the cancer name and its ranks. For example, if we












---------e sopha gus4 7- --- -0 -.-- -tastonal


+---------SCLC2 .1. cl ered

-( lre cell
i~~~~ ~ ~ ----NCC en jS_~~1 doctrine andc cear
+-11 +t-1
..- 7 +- .i, *n adenocarcinomas
+-4 +-----esophagus3 -: --


5 +1-6 +i-----ovarian4l I.I ~






too 200 300 400 soo son 700 B00 son

;-4. A fraction of the phylogenetic tree of sub-clusters of 20 cancers. The subtree is
shown on the left with each leaf node corresponding to a cluster. Each cluster
is plotted on the middle, in the same order as leaf nodes from top to bottom.
The X-axis and Y-axis denote the index of genomic intervals and cases
respectively. Those intervals with gain and loss imbalance are plotted in green
and red respectively. The markers in each cluster are plotted using vertical
dotted lines. The histologfical compositions of each cluster are plotted as color
side bar. The legend of the color side bar is shown on the right.


+--


+-17





+c-1


Figure 6


divide HNSCC into four clusters, 'HNSCC1' and 'HNSCC4' are clusters with the least and

most quantity of imbalance respectively. Please note that we do not perform clustering

on entities uterus and vulva because they both contain less than 50 cases. As a result,

we divide the 20 cancers into 58 clusters of cases. We compute the distance matrix D for

the 58 clusters based on the weighted markers in them. We apply UPG1\A to construct a

phylogenetic tree over these cancer clusters. A part of the tree is shown in Figure 6-4

Figure 6-4 shows a fraction of the phylogenetic tree (left side) for 58 clusters. This

fraction contains a subtree whose seven leaf nodes correspond to the cluster of six different

cancer types. The name of these clusters all ends with 3 or 4, which indicates that cases

in these clusters contain a large amount of imbalances. (SCLC2 is also the cluster with

the highest quantity of imbalance in SCLC because SCLC only contains two clusters.)










The plots of the seven clusters are on the middle and in the same order as the leaf nodes.

The X-axis and Y-axis denote the index of genomic intervals and cases respectively.

Those intervals with gain and loss imbalance are plotted in green and red respectively.

The markers in each cluster are denoted by vertical dotted lines. By visually inspecting

the plots, we observe that these seven clusters are indeed similar in their aberration

patterns. For example, many cases present loss aberrations around intervals 150, 200, 600

and 750 and gain aberrations around intervals 180, 240 and 400 etc. In contrast to our

overall observation of a high concordance of histological origin and marker profile, the

histological composition of these clusters is varying and includes small cell carcinomas,

adenocarcinomas and squanlous cell carcinomas.

6.5 Conclusions

While the computational analysis of genomic imbalance profiles has led to evolutionary

models for aberrations in single cancer entities, a large scale analysis across heterogeneous

cancer types remains a challenging subject. Recently, the descriptive analysis of

oncogenomic suninary data was able to point towards a concordance of imbalance

profiles front entities from similar histological categories. However, the analysis of average

imbalance profiles will not he able to capture the diversity of aberration complexity in the

different entities.

We have developed an automatic method to infer a graph model for the markers of

multiple cancers. We demonstrated the use of this model in determining the importance of

markers in cancer evolution. We also developed a new method to measure the evolutionary

distance between different cancers based on their markers. We used this measure to create

an evolutionary tree for multiple cancers.

With the application of our modeling approach to a set of more than 4600 epithelial

neoplasias (carcinomas) with genomic imbalances, we can draw some preliminary

conclusions:










1\arker determination and marker dependent subset generation are powerful tools for

structuring large CGH data sets.

Phylogenetic modeling of 58 cancer subtypes with unique genomic marker sets shows

a high concordance between branch association and histological subtype

Cancer subtypes with a high level of genomic instability have overall similar

imbalance patterns, which may reflect their origin front earlier, less determined

progenitor cells and/or tissue independent mechanisms responsible for high-order

genomic instability.

While our approach as described here used rough histological group classification

as a reference, a refined data set combined with different reference qualities (e.g. clinical

parameters) should provide a significant contribution to the overall perception of genomic

instability in cancer development.










CHAPTER 7
A WEB SERVER FOR MINING CGH DATA

Data mining analysis on a large number of CGH samples helps biologist understand

the intrinsic mechanism of tumor progression. For example, clustering methods are often

emploi-, I1 to discover previously unknown sub-categories of cancer and to identify genetic

hiomarkers associated with the differentiation. Accurate classification of patients to their

cancer types based on their genomic imbalances is crucial in successful cancer diagnosis

and treatment. A public tool for data mining analysis of CGH data is of great use to

cancer studies.

An ideal tool for end-users and large-valume data analysis is a web-based application:

a web browser is the only client software. End users do not require to download and install

extra software. In addition, the backed server can distribute the intensive computing jobs

to clusters or multicore CPUs to reduce the execution time. In this chapter, we discuss a

web application developed based on our previous work [42, 78] to fulfill these requirements.

7.1 Software Environment

We have developed a web hased data mining tool for mining CGH data using our

algorithms for <1I1-r i n. marker detection and marker selection. It has the following

features:

It allows data import from tab-delimited text files in Progenetix (http://www.

progenetix. de) format

It can perform data clustering using multiple algorithms and distance measures,

perform detection of important markers and perform selection of discriminative

markers that help build a reliable classifier

It provides results in textual and Graphical formats

It provides multiple metrics for evaluation of results

The application is developed using Microsoft Internet Information Server (http:

//www.microsoft. com). The web user interface is developed using Microsoft ASP.NET










and C# (http://www. microsoft .com). When a user submit a request, the front-end C#

program calls the executable files to perform the computation. The results are written to

HTML files stored at the server side. The front-end program polls these results and return

them to users.

The underlying algorithms are developed using MATLAB (http://www. mathworks .

c om). A MATLAB compiler is used (ht tp ://www.mathworks com/produc t s/ compiler /)

to generate executable files.

A preliminary version of this tool is available at (http://128.227.162 .207:8007/

CGH/Default.html). New features and algorithms are constantly being added to this tool.

7.2 Example: Distance-Based Clustering of Sample Dataset

In this section, we briefly describe the clustering of a small dataset using our web

hased tool to demonstrate its functionality.

The programs accepts tab-delinlited text files with both CGH data and genomic

interval information. We follow the format of Progenetix, the largest source for published

CGH data (with more than 12,000 cases) (http://www. progenetix. de). The Progenetix

format is a chromosomal band specific matrix suitable for mining experiments. It currently

supports a resolution of 862 genomic intervals front 23 chromosomes. Data files consist

of a header row followed by rows of data. Progenetix also provides online tools that can

convert other formats of CGH data to the matrix format.

Our web tool provides five distance based clustering algorithms: topDown, bottonilp,

k-nleans, topDown+k-nicans and bottonITp+k-nicans. The first three are well known

clustering algorithms in the literature. The last two algorithms are the combination of

topDown or bottonilp with k-nicans. They work as follows. They first find clusters using

the top down (or bottom up) clustering algorithm. They, then, feed these clusters into

k-nicans algorithm as the initial cluster. They aim to avoid the poor results obtained by

the k-nicans algorithm due to the random initial clusters. The distance-based clustering

















Ajelg3ourih a v D itneMau:


O~ tp wn/lkmeans coosine~aps


ObotompfkurMoeans O cosineoGaps
carmeans



Number of Clusters:
You can leave youremar i adr ess i ncas e you wan t get notiled whe yourjob hasfished .
Email address (optional):







Figure 7-1. Snapshot of the input interface of distance-based clustering.


tool provides four distance measures, Raw, Sim, cosine Gaps and (0 n. oGaps as

described in OsI Ilpter refchap:paper1.

Users can choose any one of the 20 combinations of the five algorithms and four

distance measures as shown in Figure 7-1. The default algorithm and distance measure

is topDown and Sim respectively because this combination produces the best clusters

according to our experimental results. The interface allows the user to upload a database

file containing all the samples. It also allows the the user to specify the number of clusters.

Clustering is usually a computational intensive task. Depending on the database, this

process can take several minutes. The web server allows users to provide their email

addresses so that they can be notified when the results are ready. The server stores the

results in a temporary html file and emails the link for this file. The user can browse this

file later. If the user chooses to keep the browser open, the server automatically refreshes

it with the page that contains the results.





















C~bstnd~i2
[32 7 93 5 85 56 8827 2 .11 241512 2 2Bl 3 121313 3 % 718 3 @ 14 MI4 4 414 4 101115 5 561819 6 6

[13 5 7,9 121111678932227 12222RR133325633344123 4467895555545675566623 66

Cluirb rhsta2 kk~~
[171016 3 3 401 4,55161062131416 0 617 7 7 141819101112 8 8 85148 91121319 9,9 1 ,0 02@ 038lw
Prots




Figure 7-2. Snapshot of the results 9 1iof isacebse lutrig


Figure9651 8 I i6 7-2 prvie thei2 sl~nap? ~isho of ~ 29ili theii~~l6? resuling pag!1i~ rie aftes .rio i ji lappslying thcusern

algorithm to the sampe adatst ntiseapetedaae otin 9 ape

from two cancer type, namely Retinoblastoma, NOS (ID-O code 9510/3) an

Nerolatoa NOSi) is j l~~ *,1 105,6 6 6 (ICD-Ol code = 7,7ii1P8 950/3) The8111 numbers of,910,g samples n10 the two cancer










1.ur Q;al/t o clupshtering. e pressov ideatable thate reporsthe qalt oteclstr




2.ur Ot-/. r members hipe lsnht o the inex ofin saples tatbelon toln eah cluster. We


alsrth o prvie sa ln dtoa dow nlabe th ext file the atst contains this information



3. Cilatomi. r O Plo. W pot h cluser as JPEG3) files and emedo thpem in the reslt pagce


ayps thumb0 nails.Uesperscanclick the thumb idnaist iew the flull-s size pitures Inmther





p.&lllot sown in Fteingur 7-3 th deX-axisdenoe thet inexot ofe genomicy itervalusthe



Y-xi d'; I enotershi? el the idxo samples that grue ycutr.Dfeelnfft ec clusters ar sprae





1~981~5~1~9R~n
C)
c-s~~
es~Wdsaooi


~3mnnans~iar;n~ar~s~hlrr~~n,,M
0;s D1I B O B ~ i- b--~ aih- \a.~.ri X IY.IF Uiinilo- B Barno- r3;l0hyr;.X~
~ _I~~~i~~~ r aa-i~ Q1-;)i.*-gDO~-~l


250I~ -
30 -

100 ~ ~ ~ ~ --~~- 20 0 0 0 0 0 0

Geoi nevl


Figure 7-3. Snapshot of the plot of clusters.



by the horizontal lines.The genomic intervals with gain and loss imbalances are

plotted in green and red, respectively. Those with no imbalance are not plotted.

In both the query and result pages, all field names are clickable. Clicking on a field

name brings the help page that contains a description of that field.

7.3 Conclusion


Our web server employs novel data mining methodologies for clustering and

classification of CGH datasets as well as algorithms for identifying important markers

that are potentially cancer signatures. It also provides a visualization of the dataset and

the results. The developed software will help in understanding the relationships between


genomic aberrations and cancer types.



















119










CHAPTER 8
CONCLUSION

Comparative Genomic Hybridization (CGH) is a molecular-cytogenetic analysis

method for simultaneous detecting of a number of chromosomal imbalances, which are one

of the most prominent and pathogenetically relevant features of human cancer. Along with

the high dintensionality (around 1000), a key feature of CGH data is that the consecutive

value are highly correlated. The aim of this thesis is to develop novel data mining methods

that exploit these characteristics in mining a population of CGH samples. In particular,

this thesis has following contributions:

1. Novel distance measures are investigated for the clustering of CGH data. Three

pairwise distance/sintilarity measures, namely raw, cosine, and sini, are proposed.

The first one ignores the correlation, while the latter two can effectively leverage this

correlation. These distance/sintilarity measures are tested on CGH data using three

main clustering techniques. The results show that Sint consistently performs better

than the remaining measures since it can effectively utilize the correlations between

consecutive intervals in the underlying data.

2. A dynamic progranining algorithm is developed to identify a small set of important

genomic intervals called markers. The recurrent imbalance profiles of samples can he

captured using a set of markers. Two novel clustering strategies are developed. Both

methods utilize markers to exclude noisy intervals front clustering. The experimental

results demonstrate that the markers found represent the aberration patterns of

CGH data very well and they improve the quality of clustering significantly.

3. Novel SVAI hased methods for classification and feature selection of CGH data are

developed. For classification, a novel similarity kernel is proposed. It is shown to be

more effective than the standard linear kernel used in SVAI. For feature selection, a

novel method based on the new kernel is proposed. It iteratively selects features that

provides the nmaxiniun benefit for classification. Our methods are compared against










the state-of-the-art wrapper and filter methods that have been used for feature

selection of large dimensional biological data. Our results on datasets generated from

the Progenetix database, -II__- -0 that our methods are considerably superior to

existing methods.

4. A graph model is proposed to infer the progression of markers (key recurrent CNAs).

With this model, the importance of markers in cancer evolution can he derived.

A new distance measure is proposed for computing the distance between cancers

based on their aberration patterns. Existing distance matrix method is emploi-- I1

along with the new measure for inferring progression model of multiple cancers. The

results show that cancers with similar histologfical compositions are well grouped

together.

These methods are evaluated using large repositories of datasets that are publicly

available. These methods are also encapsulated into a web service that can he used for

analyzing and visualizing CGH data.

In the present study, our work is based on chromosomal CGH data annotated in a

reverse in-situ karyotype format [50]. In the future, we will extend our work to support

other CGH formats, such as aCGH data, and other datasets such as Gene expression array

data, SNP data and proteomics data.










REFERENCES


[1] M. Baudis. Online database and bioinformatics toolbox to support data mining in cancer
cytogenetics. Biotechniques, 40(3), March 2006.
[2] M. Baudis. Genomic imbalances in 5918 malignant epithelial tumors: An explorative
us,~ I .I! I1-, -is of chromosomal CGH data. accepted at BM~C Cancer, 2007.
[3] M. Baudis and M. L. C'I. .I;. Progenetix.net: an online repository for molecular cytogenetic
aberration data. Bi../ fr., est//. -. 17(12):1228-1229, 2001.
[4] B. J. Beattie and P. N. Robinson. Binary state pattern clustering: A digital paradigm for
class and biomarker discovery in gene microarray studies of cancer. Journal of Computa-
tional BA.:/...;;,:. 13(5):1114-1130, 2006.
[5] M. Bentz, C. Werner, H. Dohner, S. Joos, T. Barth, R. Siebert, M. Schroder, S. Stilgen-
bauer, K(. Fischer, P. Moller, and P. Lichter. High incidence of chromosomal imbalances
and gene amplifications in the classical follicular variant of follicle center lymphoma. Blood,
88(4):1437-1444, 1996.
[6] S. Bilke, Q.-R. Chen, F. Westerman, M. Schwab, D. Catchpoole, and J. K~han. Inferring
a Tumor Progression Model for Neuroblastoma From Genomic Data. J Clin Oncol,
23(29):7322-7331, 2005.
[7] P. Broet and S. Richardson. Detection of gene copy number changes in CGH microarrays
using a spatially correlated mixture model. Bi/.ob [.i~r ,:,7/.. 22(8):911-918, 2006.
[8] G. C. Cawley. MATLAB support vector machine toolbox (v0.55/3)[
http://theoval.sys. uea. ac .uk/~"gcc/sym/t oolbox]. University of East Anglia, School of
Information Systems, Norwich, Norfolk, U.K(. NR4 7TJ, 2000.
[9] H. Chai and C. Domeniconi. An evaluation of gene selection methods for multi-class
microarray data classification. In Proceedings of the Second European Workcshop on Data
Mining and Text M~ining in Bi/..i [.i it ,ill. pages 3-10, 2004.
[10] P. Crossen. Giemsa banding patterns of human chromosomes. Clin Genet, 3:169-179, 1972.
[11] R. Desper, F. Ji I!!:_. and O.-P. K~allioniemi. Inferring tree models for oncogenesis from
comparative genome hybridization data. 6:37-51, 1999.
[12] R. Desper, F. Ji I!:_. O.-P. K~allioniemi, H. Moch, C. H. Papadimitriou, and A. A. Schaffer.
Inferring tree models for oncogenesis from comparative genome hybridization data. Journal
of Computational BA.:/...;;,:. 6(1):37-52, 1999.
[13] R. Desper, F. Ji I!:_. O. P. K~allioniemi, H. Moch, C. H. Papadimitriou, and A. A. Schffer.
Distance-based reconstruction of tree models for oncogenesis. J Comput Biol, 7(6):789-803,
2000.

[14] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene
expression data. In CSB, page 523, Washington, DC, USA, 2003. IEEE Computer Society.
[15] C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene
expression data. J Bioinform Comput Biol, 3(2):185-205, April 2005.
[16] C. H. Q. Ding. A! Is .-, -is of gene expression profiles: class discovery and leaf ordering. In
RECOM~B, pages 127-136, New York, NY, USA, 2002. ACijl Press.










[17] K(. B. Duan, J. C. Rajapakse, H. WasI!:_. and F. Azuaje. Multiple SVM-RFE for gene
selection in cancer classification with expression data. IEEE Trans Nanobioscience,
4(3):228-234, September 2005.
[18] P. Duesberg. Does Aneuploidy or Mutation Start Cancer? Science, 307(5706):41d-, 2005.
[19] P. H. C. Oilers and R. X. de Menezes. Quantile smoothing of I1 -,- CGH daa.Baicfu.,1
ics, 21(7):1146-1153, 2005.
[20] J. Felsenstein. PHYLIP Phylogeny Inference Package (Version 3.2). Cladistics, (5):164-
166, 1989.
[21] J. Fridlyand, A. M. Snijders, D. Pinkel, D. G. Albertson, and A. N. Jain. Hidden markov
models approach to the ..I! I1-, -is of array cgh data. J. M~ultivar. Anal., 90(1):132-153, 2004.
[22] A. Fritz, C. Percy, A. Jack, L. Sobin, and M. Parkin, editors. International CI,7-;il.;/.or;~~i of
Diseases for Ort. .J.. .;u: (ICD-O), Third Edition. World Health Organization, Geneva, 2000.
[23] T. Golub, D. Slonim, P. T .na .-, o, C. Huard, M. Gaasenbeek, J. Mesirov, H. Cooler, M. Loh,
J. D~;--i.- ni:_. M. Caligiuri, C. Bloomfield, and E. Lander. Molecular classification of cancer:
Class discovery and class prediction by gene expression monitoring. Science, 286(5439):531-
537, October 1999.
[24] J. Gray, C. Collins, I. Henderson, J. Isola, A. K~allioniemi, O. K~allioniemi, H. Nakamura,
D. Pinkel, T. Stokke, M. Tanner, and a. et. Molecular cytogenetics of human breast cancer.
Cold Spring Harb Symp Quant Biol, 59:645-652, 1994.

[25] J. W. Gray and C. Collins. Genome changes and gene expression in human solid tumors.
Carcinogenesis, 21(3):443-452, 2000.
[26] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Miach.
Learn. Res., 3:1157-1182, 2003.

[27] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification
using support vector machines. Machine Learning, 46(1-3):389-422, 2002.
[28] D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100(1):57-70, January
2000.

[29] J. Handl, J. K~nowles, and D. B. Kell. Computational cluster validation in post-genomic
data .l!! I1-, -i- Bioinformatics, 21(15):3201-3212, August 2005.

[30] G. Hodgson and J. H. H. et al. Genome scanning with I1 1-, CGH delineates regional
alterations in mouse islet carcinomas aueGntis 94.1 <1 01

[31] M. Hoglund, A. Frigyesi, T. Sall, D. Gisselsson and F. Mitelman. Statistical Behavior of
Complex Cancer Karyotypes. Genes C'i:,ion...--.or.. Cancer, 42(4):327-341, 2005.
[32] D. .11 .I!:_. C. T I!!:_. and A. Zhang. C'Ilu i :I! .Is -, -is for gene expression data: a survey.
Knowledge and Data Bivitilr-rite, IEEE Transactions on, 16(11):1370-1386, 2004.
[33] T. Joachims. Making large-scale support vector machine learning practical. Advances in
kernel methods: support vector learning, pages 169-184, 1999.

[34] S. Joos, C. Menz, G. Wrobel, R. Siebert, S. Gesk, S. Ohl, G. Mechtersheimer, L. Trumper,
P. Moller, P. Lichter, and T. Barth. Classical hodgkin lymphoma is characterized by
recurrent copy number gains of the short arm of chromosome 2. Blood, 99(4):1381-1387,
2002.










[35] K(. Junker, G. Weirich, M. B. Amin, P. Moravek, W. Hindermann, and J. Schubert. Genetic
subtyping of renal cell carcinoma by comparative genomic hybridization. Recent Results
Cancer Res, 162:169-175, 2003.
[36] A. K~allioniemi, O. K~allioniemi, D. Sudar, D. Rutovitz, J. Gray, F. Waldman, and D. Pinkel.
Comparative Genomic Hybridization for Molecular Cytogenetic A! Is .-, -is of Solid Tumors.
Science, 258(5083):818-821, 1992.
[37] A. K~allioniemi, O. P. K~allioniemi, D. Sudar, D. Rutovitz, J. W.X Gray, F. Waldman, and
D. Pinkel. Comparative genomic hybridization for molecular cytogenetic ..I! I1-, -is of solid
tumors. Science, 258(5083):818-821, 1992.
[38] B. K~ing. Step-wise clustering procedures. Journal of the American Statistical Association,


[39] D. E. K~rane and M. L. Raymer. Fundamental Concepts of Bi/.ob -r ,7/!.i. Bcnil mairi-
Cummingfs Pub Co, San Francisco, CA, USA, September 2002.

[40] T. Li, C. Zh! .I!:_. and M. Ogihara. A comparative study of feature selection and multiclass
classification methods for tissue classification based on gene expression. B.b[i 7!
20(15):2429-2437, 2004.
[41] J. Liu, J. Mohammed, J. Carter, S. Ranka, T. K~ahveci, and M. Baudis. Distance-based
clustering of CGH data. Bioinformatics, 22(16):1971-1978, 2006.
[42] J. Liu, S. Railnka,) andll T. KahveitC. MarkerstL improLve. cI~lustering~ of CGH datal. Bioinformatics,
23(4):450-457, 2007.
[43] J. Liu, S. Ranka, and T. K~ahveci. A web server for mining comparative genomic hybridiza-
tion (cgh) data. volume 953, pages 144-161. AIP, 2007.
[44] J. B. Mac Queen. Some Methods for classification and A! Is .-, -is of Multivariate Observations.
In Proceedings of 5-th Berkeley Sun..;~~~.::!, I.. on Miathematical Statistics and Prol~,I.,;1,////. 1967.
[45] M. Mao, R. Hamoudi, I. Talbot, and M. Baudis. Allele-Specific Loss of Heterozygosity
in Multiple Colorectal Adenomas: Towards the Integrated Molecular Cytogenetic Map li.
accepted at Cancer, Genetics, Cytogenetics, 2005.
[46] X. Mao, R. Hamoudi, P. Zhao, and M. Baudis. Genetic Losses in Breast Cancer: Toward an
Integrated Molecular Cytogenetic Map. Cancer Genet Col...t.,, i. 160(2):141-151, 2005.
[47] J. C. Marioni, N. P. Thorne, and S. Tavare. BioHMM: a heterogeneous hidden Markov
model for segmenting array CGH data. Bi/.ob [.ii~r ,:i7... 22(9):1144-1146, 2006.
[48] T. Mattfeldt, H. Welter, R. K~emmerling, H. Gottfried, and H. K~estler. C'Ilu i :I! .Is -, -is of
comparative genomic hybridization (cgh) data using self-organizing maps: application to
prostate carcinomas. Anal Cell Pathol, 23(1):29-37, 2001.
[49] T. Mattfeldt, H. Welter, R. K~emmerling, H. W. Gottfried, and H. A. K~estler. Clue. I~
..I! I1-, -is of comparative genomic hybridization (CGH) data using self-organizing maps:
application to prostate carcinomas. Anal Cell Pathol, 23:29-37, 2001.
[50] F. Mitelman, editor. International S2, 1. c. for Coll...;. ,. 17.: Nomenclature. K~arger, Basel,
1995.

[51] F. Model, P. Adorjn, A. Olek, and C. Piepenbrock. Feature selection for dna methylation
based cancer classification. Bi../ fr., est//. -. 17 Suppl 1, 2001.










[52] R. Molist, M. Gerbault-Seureau, X. Sastre-Garau, B. Sigal-Zafrani, B. Dutrillaux, and
M. Muleris. Ductal breast carcinoma develops through different patterns of chromosomal
evolution. Genes, Cl ten .--..es... and Cancer, 43(2):147-154, 2005.

[53] S. Mukherjee, P. T ax .-,. I, D. Slonim, A. Verri, T. Golub, J. Mesirov, and T. Poggio.
Support vector machine classification of microarray data, 1999.
[54] A. B. Olshen, E. S. Venkatraman, R. Lucito, and M. Wigler. Circular binary segmentation
for the ..I! II1, -is of array-based DNA copy number data. Biostat, 5(4):557-572, 2004.

[55] K(. Patau. The identification of individual chromosomes, especially in man. Am J Hum
Genet, 12:250-276, 1960.

[56] G. Pennington, S. Shackney, and R. Schwartz. Cancer pin!-, 1. .:_. !!. rl s from single-cell assays.
Technical report, School of Computer Science, Carnegie Mellon University, 2006.

[57] F. Picard, S. Robin, M. Lavielle, C. Vaisse, and J. J. Daudin. A statistical approach for
array cgh data .l!! I1-, -i- BCB.if-to/.6 05
[58] F. Picard, S. Robin, E. Lebarbier, and J.-J. Daudin. A Segmentation-C'Il in I sug problem for
the i .1 :i of,,~ arra CGH- daan Applied Stochastic M~odels and Data At;,;/,--I.:. 2005.

[59] D. Pinkel and D. G. Albertson. Array comparative genomic hybridization and its applica-
tions in cancer. Nature Genetics 37, S11 S17 (.',ie ), 37:S11-S17, 2005.

[60] D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. K~owbel, C. Collins, W. K~uo,
C. Chen, Y. Zhai, S. Dairkee, B. Ljung, J. Gray, and D. Albertson. High Resolution
A! Is .-, -is of DNA Copy Number Variation Using Comparative Genomic Hybridization to
Microarrays. Nat Genet, 20(2):207-211, 1998.

[61] D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. K~owbel, C. Collins, W.-L. K~uo,
C. Chen, Y. Zhai, S. H. Dairkee, B. marie Ljung, and J. WX G. D. G Albertson. High
resolution ..I! I1-, -is of dna copy number variation using comparative genomic hy-bridization
to microarrays. Nature Genetics, 20:207-211, 1998.

[62] J. Pollack, C. Perou, A. Alizadeh, M. Eisen, A. Pergamenschikov, C. Williams, S. Jeffrey,
D. Botstein, and P. Brown. Genome-Wide Anl I1-, -is of DNA Cop-, -Nunal.,~ i C' I!!, ..5 Using
Cdna Microarrays. Nat Genet, 23(1):41-46, 1999.

[63] J. R. Pollack, C. M. Perou, A. A. Alizadeh, M. B. Eisen, A. Pergamenschikov, C. F.
Williams, S. S. Jeffrey, D. Botstein, and P. O. Brown. Genome-wide .I! I1-, -is of DNA
copy-number changes using cDNA microarrays. Nature Genetics, 23:41-46, 1999.
[64] A. Rakotomamonjy. Variable selection using SVM based criteria. J. Miach. Learn. Res.,
3:1357-1370, 2003.

[65] S. Ramaswamy, P. Tlina I-, R. Rifkin, S. Mukherjee, C.-H. Ye .I!:_. M. Angelo, C. Ladd,
M. Reich, E. Latulippe, J. P. Mesirov, T. Poggfio, W. Gerald, M. Loda, E. S. Lander, and
T. R. Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl
Acad Sci U S A, 98(26):15149-15154, December 2001.

[66] T. Reya, S. J. Morrison, M. F. Clarke, and I. L. Weissman. Stem cells, cancer, and cancer
stem cells. Nature, 414(1. ".' ):105-11, Nov 2001.

[67] C. Rouveirol, N. Stransky, P. Hup, P. La Rosa, E. Viara, E. Barillot, and F. Radvanyi.
Computation of recurrent minimal genomic alterations from array-cgh data. Bi/.o~~in rl., H.
January 2006.










[68] G. Salton. Automatic text processing: the intra-for n:,I:..all..r. analysis, and retrieval of
"if. ,li/rli.i by computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA, 1989.
[69] S. Selim and M. Ismail. K(-means-type algorithms: A generalized convergence theorem
and characterization of local optimality. IEEE Trans. Pattern Adi, -!-I.: and Miachine
Intelligence, 6(1):81-87, 1984.

[70] A. M. Snijders, D. Pinkel, and D. G. Albertson. Current status and future prospects
of array-based comparative genomic hybridisation. E, .: f Funct Genomic Proteomic,
2(1):37-45, 2003.
[71] S. Solinas-Toldo, S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cre-
mer, and P. Lichter. Matrix-based comparative genomic hybridization: biochips to screen
for genomic imbalances. Gee /'-e.--r.- acr 03947 97
[72] S. Solinas-Toldo, S. Lampel, S. Stilgenbauer, J. Nickolenko, A. Benner, H. Dohner, T. Cre-
mer, and P. Lichter. Matrix-Based Comparative Genomic Hybridization: Biochips to Screen
for Genomic Imbalances. Genes C/.e..te..----tre.~ Cancer, 20(4):399-407, 1997.

[73] M. Speicher, S. Gwyn Ballard, and D. Ward. K~aryotyping human chromosomes by
combinatorial multi-fluor fish. Nat Genet, 12(4):368-375, 1996.
[74] A. Statnikov, C. F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy. A comprehensive
evaluation of multicategory classification methods for microarray gene expression cancer
diagnosis. Bi/.ob -iiir el.. 21(5):631-643, 2005.
[75] M. Steinbach, G. K~arypis, and V. K~umar. A comparison of document clustering tech-
niques. In K~DD Workcshop on Text M~ining, 2000.
[76] C. S. Stocks, N. Pratt, M. Sales, D. A. Johnston, A. M. Thompson, F. A. Carey, and
N. M. K~ernohan. C'1!!.an..~-uns! I1 imbalances in gastric and esophageal adenocarcinoma:
Specific comparative genomic hybridization-detected abnormalities segregate with junctional
adenocarcinomas. Genes, C'i:,ion...--.or.. and Cancer, 32(1):50-58, 2001.
[77] A. Strehl and J. Ghosh. Cluster ensembles -a knowledge reuse framework for combining
partitionings. In Proceedings of AAAI I',i) l', Edmonton, Canada, pages 93-98. AAAI, July
2002.

[78] P.-N. Tan, M. Steinbach, and V. K~umar. Introduction to Data M~ining. Addison-Wesley
Longman Publishing Co., Inc., 2005.
[79] P.-N. Tan, M. Steinbach, and V. K~umar. Introduction to Data M~ining, (First Edition).
Addison Wesley, M .1-, 2005.

[80] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by
shrunken centroids of gene expression. PNAS, 99(10) 17.~1.7-6572, 2002.
[81] J. Tijo and A. Levan. The chromosome number of man. Hereditas, 42:1-16, 1956.
[82] J. Vandesompele, M. Baudis, K(. De Preter, N. Van Roy, and Ambros. Unequivocal
Delineation of CI!!!Is ..!! I !I: Subgroups and Development of a New Model for Improved
Outcome Prediction in Neuroblastoma. J Clin Oncol, 23(10):2280-2299, 2005.
[83] V. N. Vapnik. Statistical Learning T/.... ze, Wiley-Interscience, September 1998.










[84] T. Veldman, C. Vignon, E. Schrock, J. Rowley, and T. Ried. Hidden chromosome abnor-
malities in haematological malignancies detected by multicolour spectral karyotyping. Nat
Genet, 15(4):406-410, 1997.
[85] B. Vogelstein, E. R. Fearon, S. R. Hamilton, S. E. K~ern, A. C. Preisinger, M. Leppert,
Y. Nakamura, R. White, A. M. Smits, and J. L. Bos. Genetic alterations during colorectal-
tumor development. N Engl J M~ed, 319(9):525-532, September 1988.
[86] B. Vogelstein and K(. K~inzler. The Multistep Nature of Cancer. Trends Genet, 9(4):138-141,
1993.

[87] P. WasI!:_. Y. K~im, J. Pollack, B. Narasimhan, and R. Tibshirani. A method for calling gains
and~ : losses in rra CH dta Biostat, 6(1):45-58, 2005.
[88] P. WasI!:_. Y. K~im, J. Pollack, B. Narasimhan, and R. Tibshirani. A method for calling gains
and losses in array cgh data. Biostatistics, 6(1):45-58, January 2005.
[89] J. Weston, S. Mukherjee, O. C' q. II" II, M. Pontil, T. Poggio, and V. Vapnik. Feature
selection for SVMs. In NIPS, pages 668-674, 2000.
[90] H. Willenbrock and J. Fridlyand. A comparison study: applying segmentation to array cgh
data for downstream .I! .11-, Bi../ [.or est//. -. September 2005.

[91] L. Yu and H. Liu. Redundancy based feature selection for microarray data. In K~DD, pages
737-742, New York, NY, USA, 2004. ACijl Press.

[92] X. Zh! .I!:_. X. Lu, Q. Shi, X.-q. Xu, H.-c. Leung, L. Harris, J. Iglehart, A. Miron, J. Liu, and
W. Wong. Recursive SVM feature selection and sample classification for mass-spectrometry
and microarray data. BM~C Bi/.ob [.ii~r ,1-;/.. 7(1):197, 2006.
[93] S. Zhong and J. Ghosh. Generative model-based document clustering: a comparative study.
Know. Inf. Syst., 8(3):374-384, 2005.









BIOGRAPHICAL SKETCH

Jun Liu was born in 1976 in T I~i;ile: China. He grew up mostly in T oIji;ts. Chn.,!

He earned his B.S. and M.E. degrees in computer science from T I!:111:-:: University in 1998

and 2000, respectively. He earned his Ph.D. in computer engineering from the University

of Florida (Gainesville, Florida, USA) in 2008





PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Iwouldliketothankmyadvisers,Dr.SanjayRankaandDr.TamerKahveci,fortheirlong-terminstructionsonmyresearchandthisdissertation.IwouldalsoliketothankDr.MichaelBaudisforprovidingtheCGHdatasetsandseveraldiscussionsonanalyzingCGHdata.IwouldliketothankJaavedMohammedandJamesCarterfortheirhelpinimplementingandtestingtheclusteringalgorithms,andRajaAppuswamyforhelpingdevelopaninitialweb-basedinterface.Iwouldliketothankmycommitteemembers(Dr.ChristopherM.Jermaine,Dr.AlinDobraandDr.RavindraK.Ahuja)fortheirguidanceandsupport.Finally,Iwouldliketothankmyparentsandmyfriendsfortheircontinuoussupport. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 1.1ComparativeGenomicHybridizationData .................. 10 1.2AnalysisofCGHData ............................. 14 1.3ContributionofThisThesis .......................... 15 2RELATEDWORK .................................. 18 2.1StructuralAnalysisofSingleComparativeGenomicHybridization(CGH)Array ...................................... 19 2.2ClusteringandMarkerDetectionofCGHData ............... 20 2.3ClassicationandFeatureSelectionofCGHData .............. 21 2.4InferringProgressionModelsforCGHData ................. 25 2.5SoftwareforAnalyzingCGHData ....................... 27 3PAIRWISEDISTANCE-BASEDCLUSTERING .................. 28 3.1Method ..................................... 28 3.1.1ComparisonofTwoSamples ...................... 29 3.1.1.1Rawdistance ......................... 29 3.1.1.2Segment-basedsimilarity ................... 30 3.1.1.3Segment-basedcosinesimilarity ............... 31 3.1.2ClusteringofSamples .......................... 33 3.1.2.1k-meansclustering ...................... 33 3.1.2.2Completelinkbottom-upclustering ............. 34 3.1.2.3Top-downclustering ..................... 34 3.1.3FurtherOptimizationonClustering .................. 36 3.1.3.1Combiningk-meanswithbottom-uportop-downmethods 36 3.1.3.2Centroidshrinking ...................... 37 3.2Results ...................................... 37 3.2.1QualityAnalysisMeasures ....................... 39 3.2.2ExperimentalEvaluation ........................ 41 3.3Conclusion .................................... 46 5

PAGE 6

.................. 48 4.1DetectionofMarkers .............................. 48 4.2PrototypeBasedClustering .......................... 52 4.3PairwiseSimilarityBasedClustering ..................... 55 4.4ExperimentalResults .............................. 58 4.4.1QualityofClustering .......................... 58 4.4.2QualityofMarkers ........................... 59 4.4.3Evaluation ................................ 60 4.5Conclusion .................................... 65 5CLASSIFICATIONANDFEATURESELECTION ................ 66 5.1ClassicationwithSVM ............................ 66 5.2MaximumInuenceFeatureSelectionforTwoClasses ............ 70 5.3MaximumInuenceFeatureSelectionforMultipleClasses ......... 74 5.4Datasets ..................................... 77 5.5ExperimentalResults .............................. 81 5.5.1ComparisonofLinearandRawKernel ................ 81 5.5.2ComparisonofMIFSandOtherMethods ............... 82 5.5.3ConsistencyofSelectedFeatures .................... 87 5.6Conclusions ................................... 90 6INFERRINGPROGRESSIONMODELS ...................... 93 6.1Preliminary ................................... 94 6.1.1MarkerDetection ............................ 94 6.1.2TumorProgressionModel ....................... 95 6.1.3TreeFittingProblem .......................... 97 6.2ProgressionModelformarkers ......................... 98 6.3ProgressionModelforcancers ......................... 101 6.4ExperimentalResults .............................. 105 6.4.1ResultsforMarkerModels ....................... 106 6.4.2ResultsforPhylogeneticModels .................... 108 6.5Conclusions ................................... 113 7AWEBSERVERFORMININGCGHDATA ................... 115 7.1SoftwareEnvironment ............................. 115 7.2Example:Distance-BasedClusteringofSampleDataset ........... 116 7.3Conclusion .................................... 119 8CONCLUSION .................................... 120 REFERENCES ....................................... 122 BIOGRAPHICALSKETCH ................................ 128 6

PAGE 7

Table page 3-1DetailedspecicationofProgenetixdataset .................... 38 3-2Highestvalueofexternalmeasuresfordierentdistance/similaritymeasure ... 42 3-3Comparisonofaveragequalityandrunningtimeoftop-downmethodswithglobalandlocalrenement ................................. 46 4-1Coveragemeasureofthreeclusteringmethodsappliedoverthreedatasets .... 61 4-2TheNMIvaluesofthreeclusteringmethodsappliedoverthreedatasets ..... 62 4-3Errorbarresultsofthreeclusteringmethodsoverthreedatasets ......... 63 5-1Detailedspecicationsofbenchmarkdatasets ................... 80 5-2Comparisonofclassicationaccuraciesforthreefeatureselectionmethodsonmulti-classdatasets .................................. 84 5-3Comparisonofclassicationaccuracyforthreefeatureselectionmethodsontwo-classdatasets ........................................ 86 5-4Comparisonofclassicationaccuracyusingdierentnumberoffeatures. .... 87 5-5ComparisonofPMMscoresofthreefeatureselectionmethods .......... 91 6-1Nameandnumberofcasesofeachcancerinthedataset. ............ 106 7

PAGE 8

Figure page 1-1OverviewofCGHtechnique ............................. 11 1-2Rawandnormalized(smoothed)CGHdata .................... 13 1-3PlotofaCGHdataset ................................ 14 3-1ExampleofRawdistance .............................. 29 3-2ExampleofSimmeasure ............................... 31 3-3ExampleofcosineNoGapsmeasure ......................... 32 3-4Evaluationofclusterqualitiesusing(A)NMIand(B)F1-measurefordierentclusteringmethods .................................. 43 3-5ClusterqualitiesofdierentclusteringmethodswithSimmeasureovertheentiredataset.TheclusterqualitiesareevaluatedusingNMI. .............. 45 4-1TwoCGHsamplesXandYwiththevaluesofgenomicintervalslistedintheorderofpositions.Thesegmentsareunderlined. ................. 49 4-2TheCGHdataplotofcancertype,Retinoblastoma,NOS(ICD-O9510/3),with120samplesandeachsamplecontaining862genomicintervals .......... 52 4-3ComparsionofGMSvaluesofmarkersinclustersfromtwoclusteringapproaches 64 5-1Plotof120CGHcasesbelongingtoRetinoblastoma,NOS(ICD-O9510/3) ... 71 5-2Workingexampleofdatasetre-sampler ....................... 79 5-3ComparisonofclassicationaccuraciesofSVMwithlinearandRawkernels .. 82 6-1ExamplesofVenndiagramandcorrespondinggraphmodel ............ 97 6-2Phylogenetictreesof20cancersbasedonweightedmarkers ............ 109 6-3Phylogenetictreesof20cancersbasedonunweightmarkers ........... 110 6-4Afractionofthephylogenetictreeofsub-clustersof20cancers ......... 112 7-1Snapshotoftheinputinterfaceofdistance-basedclustering. ........... 117 7-2Snapshotoftheresultsofdistance-basedclustering. ............... 118 7-3Snapshotoftheplotofclusters. .......................... 119 8

PAGE 9

Numericalandstructuralchromosomalimbalancesareoneofthemostprominentfeaturesofneoplasticcells.Thousandsof(molecular-)cytogeneticstudiesofhumanneoplasiashavesearchedforinsightsintogeneticmechanismsoftumordevelopmentandthedetectionoftargetsforpharmacologicintervention.Itisassumedthatrepetitivechromosomalaberrationpatternsreectthesupposedcooperationofamultitudeoftumorrelevantgenesinmostmalignantdiseases. OnemethodformeasuringgenomicaberrationsisComparativeGenomicHybridization(CGH).CGHisamolecular-cytogeneticanalysismethodfordetectingregionswithgenomicimbalances(gainsorlossesofDNAsegments).CGHdataofanindividualtumorcanbeconsideredasanorderedlistofdiscretevalues,whereeachvaluecorrespondstoasinglechromosomalbandanddenotesoneofthreeaberrationstatuses(gain,lossandnochange).Alongwiththehighdimensionality(around1000),akeyfeatureoftheCGHdataisthatconsecutivevaluesarehighlycorrelated. Inthisresearch,wehavedevelopednoveldataminingmethodstoexploitthesecharacteristics.Wehavedevelopednovelalgorithmsforfeatureselection,clusteringandclassicationofCGHdatasetsconsistingofsamplesfrommultiplecancertypes.Wehavealsodevelopednovelmethodsandmodelsforunderstandingtheprogressionofcancer.ExperimentalresultsonrealCGHdatasetsshowthebenetsofourmethodsascomparedtoexistingmethodsintheliterature. 9

PAGE 10

Numericalandspatialchromosomalimbalancesareoneofthemostprominentandpathogeneticallyrelevantfeaturesofneoplasticcells[ 18 ].Overthelastdecades,thousandsof(molecular-)cytogeneticstudiesofhumanneoplasiahaveledtoimportantinsightsintothegeneticmechanismsoftumordevelopment,revealingcancertobeadiseaseinvolvingdynamicchangesinthegenome.Thefoundationhasbeensetinthediscoveryofaberrationsthatproduceoncogeneswithdominantgainoffunctionandtumorsuppressorgeneswithrecessivelossoffunction[ 28 ].EachchromosomalregionofahealthycellhastwocopiesofitsDNAinacell.DeviationsfromthisnormallevelarecalledCopyNumberAlterations(CNAs).Bothclassesofcancergenes,tumorsuppressorgenesandoncogenes,havebeenidentiedthroughDNAcopynumberalterationsinhumanandanimalcancercells[ 36 ].Detectingtheseaberrationsandinterpretingtheminthecontextofbroaderknowledgefacilitatestheidenticationofcrucialgenesandpathwaysinvolvedinbiologicalprocessesanddisease.Therepetitivechromosomalaberrationpatternsreectsthesupposedcooperationofamultitudeoftumorrelevantgenes[ 86 ]inmostmalignantdiseases.Asystematicanalysisofthesepatternsforoncogenomicpathwaydescriptionrequiresthelarge-scalecompilationof(molecular-)cytogenetictumordataaswellasthedevelopmentoftoolsfortransformingthosedataintoaformatsuitablefordataminingpurposes. 59 ].ThemainadvantageoftheCGHdataisthattheDNAcopynumbersfortheentiregenomecanbemeasuredinasingleexperiment[ 70 ].CGHonDNAmicroarrayisamolecular-cytogeneticanalysismethodforsimultaneousdetectingofthousandsofgeneswithgenomicimbalances(gainsorlossesofDNAsegments)[ 36 ].Inthistechnique,totalgenomicDNAisisolatedfrom 10

PAGE 11

OverviewofCGHtechnique.GenomicDNAfromtwocellpopulationsisdierentiallylabeledandhybridizedtoamicroarray.Theuorescentratiosoneacharrayspotarecalculatedandnormalizedsothatthemedianlog2ratiois0.Plottingofthedataforchromosomefromptertoqtershowsthatmostelementshavearationear0.Thetwoelementsnearestpterhaverationear-1,indicatingareductionbyafactoroftwoincopynumber.ThisgureisreproducedfromtheworkbyPinkeletal[ 59 ]. testandreferencecellpopulations,dierentiallylabeledandhybridizedtometaphasechromosomesor,morerecently,DNAmicroarrays.Therelativehybridizationintensityofthetestandreferencesignalsatagivenlocationisthen(ideally)proportionaltotherelativecopynumberofthosesequencesinthetestandreferencegenomes.Ifthereferencegenomeisnormal,thenincreasesanddecreasesintheintensityratiodirectlyindicateDNAcopynumbervariationinthegenomeoftestcells(Figure 1-1 ). RawdatafromCGHexperimentsisviewedasbeingcontinuous[ 57 ].Pre-processingofrawCGHdatacomprisesofallpreliminaryoperationsonthedatanecessarytoarrive 11

PAGE 12

7 19 47 54 ]. 21 30 87 ].Segmentationalgorithmsdividethegenomeintonon-overlappingsegmentsthatareseparatedbybreakpoints.ThesebreakpointsindicateachangeinDNAcopynumber.Arrayelementsthatbelongtothesamesegmentareassumed,astheyarenotseparatedbyabreakpoint,tohavethesameunderlyingchromosomalcopynumber.Segmentationmethodsalsoestimatethemeanlog2ratiopersegments,referredtoasstates. 88 90 ].NormalizedCGHsignalsurpassingpredenedthresholdsisconsideredindicativeforgenomicgainsorlosses,respectively(Figure 1-2 ).Atpresentcallingalgorithmscannotdeterminewhetherthereare,say,threeorfourcopiespresent.Theycanhoweverdetectdeviationsfromthenormalcopynumber,andclassifyeachsegmentaseither'normal','loss'or'gain'.Normalstatusindicatestherearetwocopiesofthechromosomalsegmentpresent.Lossstatusindicatesatleastonecopyislost.Gainstatusindicatesatleastoneadditionalcopyispresent.Theselabelsarereferredtocalls. ThechromosomalCGHsummarizessignalsfrommanyshortstretchesoftumorDNAhybridizingtoneighboringregions.ThechromosomalCGHresultsareannotatedinareversein-situkaryotypeformat[ 50 ]describingimbalancedgenomicregionswithreferencetotheirchromosomallocation.CGHdataofanindividualpatientcanbeconsideredasanorderedlistofstatusvalues,whereeachvaluecorrespondstoagenomicinterval

PAGE 13

Rawandnormalized(smoothed)CGHdata.Thisexampleshows16measurementpointsoftumorvs.controluorescence.Runsofnormalizedratiovaluessurpassingthethresholdsareconsideredindicativeforgainsorlossesofgenomicmaterialinthecorrespondinggenomicintervals(e.g.chromosomalbands).Forourpurposes,weusevaluesof1,-1,and0toexpressgain,lossandnoaberration,respectively. (e.g.,asinglechromosomalband.Thetermfeatureanddimensionhavealsobeenusedintheliteraturetorepresentthegenomicinterval.).Figure 1-3 showsaCGHdatasetforRetinoblastoma,NOS(ICD-O9510/3)with120cases(i.e.,patients)eachhaving862genomicintervals.ChromosomalandarrayCGHaccountsforasignicantpercentageofthepublishedanalysesincancercytogenetics[ 5 11 24 31 34 49 82 ]. TheProgenetixdatabase[ 3 ]( 82 ]andforproducingtumortypespecicimbalancemaps[ 45 46 ]. ThisthesisisconcernedwithdevelopingtoolstohelpanalyzeCGHdata. 13

PAGE 14

PlotofaCGHdataset.Thedatasetconsistsof120CGHcasesbelongingtoRetinoblastoma,NOS(ICD-O9510/3).TheX-axisandY-axisdenotethegenomicintervalsandthesamplesrespectively.Weplotthegainandlossstatusingreen(lightgray)andred(darkgray)respectively. 1. Thedatasetsarehighdimensional.ThenumberofintervalsintheProgenetixdatabase[ 1 ],alargepubliclyavailabledatabase,is862.NewerCGHdatasetsmay 14

PAGE 15

2. ThefeaturesinCGHdatarepresentorderedgenomicintervalsonchromosomesandtheirvaluesarecategorical. 3. Genomicimbalancesincancerscellscorrespondtorunsof5to15intervalsofgainsorlossesfora862intervalrepresentation.(Wewillusethetermsegmentstorepresenttheseruns.Theycorrespondtoafewmegabasestoanentirechromosome).Thisindicatesthatneighboringgenomicintervalsareoftencorrelated. 1. Wehavedevelopednovelpairwisedistance-based(wewillusethetermdistance-basedforconvenience)clusteringmethodsthateectivelyexploitthespatialcorrelationsbetweenconsecutivegenomicintervals[ 41 ].Thegoalofourclusteringistoidentifysetsoftumorsexhibitingcommonunderlyinggeneticaberrationsandrepresentingcommonmolecularcauses.Ourworkisbuiltintwosteps.Intherststep,wemeasurethedistance/similaritybetweenallpairsofsamples.Forthispurpose,wehavedevelopedthreemetricstocomputethesimilarity/distancebetweentwoCGHsamples.Inthesecondstep,webuildclustersofsamplesbasedonpairwisesimilaritiesusingvariationsofwellknownmethods. Experimentalresultsshowthatsegment-basedsimilaritydistancemeasuresarebetterindicatorsofbiologicalproximitybetweenpairsofsamples.Thismeasurewhencombinedwiththetop-downmethodproducesthebestclusters. 2. WehaveproposedtheconceptofmarkerstorepresentkeyrecurrentpointaberrationsthatcapturetheaberrationpatternofasetofCGHsamples[ 42 ]. 15

PAGE 16

1-3 .Themarkersareplottedinverticallines.)Wehavedevelopedadynamicprogrammingtechniquetodetectmarkersinagroupofsamples.Theresultingmarkerscanbeseenastheprototypeofthesesamples.Basedonthemarkers,wehavedevelopedseveralclusteringstrategies. Ourexperimentalresultsshowthattheuseofmarkersinthedistance-basedclusteringimprovestheclusterqualities. 3. WehavedevelopedanovelkernelfunctionforusingSVMbasedmethodsforclassifyingCGHdata.TheclassicationofCGHdataaimstobuildamodelfordeninganumberofclassesoftumorsandaccuratelypredicttheclassesofunknowntumors.Thismeasurecountsthenumberofcommonaberrationsbetweenanytwosamples.WeshowthatthiskernelmeasureissignicantlybetterforSVM-basedclassicationofCGHdatathanthestandardlinearkernel. 4. WehavedevelopedanSVM-basedfeatureselectionmethodcalledMaximumInuenceFeatureSelection(MIFS).Itusesaniterativeproceduretoprogressivelyselectfeatures.Ineachiteration,anSVMbasedmodelonselectedfeaturesistrained.Thismodelisusedtoselectoneoftheremainingfeaturethatprovidesthemaximumbenetforclassication.Thisprocessisrepeateduntilthedesirednumberoffeaturesisreached.Wecomparedourmethodsagainsttwostate-of-the-artmethodsthathavebeenusedforfeatureselectionoflargedimensionalbiologicaldata.Ourresultssuggeststhatourmethodisconsiderablysuperiortoexistingmethods. 5. Wehavedevelopedanovelmethodtoinfertheprogressionofmultiplecancerhistologicaltypesorsubtypesbasedontheiraberrationpatterns.OurexperimentalresultsbasedonaProgenetixdatasetdemonstratethatcancerswithsimilarhistologycodingareautomaticallygroupedtogetherusingthesemethods. WealsodescribeawebbasedtoolforlargevolumedataanalysisofCGHdatasets[ 43 ].Thetoolprovidesvariousclusteringalgorithms,distancemeasuresandidentiesmarkersthatcanbeinteractivelyusedbyresearchers.Itpresentstheresultsareprovidedinboth 16

PAGE 17

17

PAGE 18

OneoftheearlytechniquesusedtoidentifycytogeneticabnormalitiesintumorcellsiscalledMetaphaseanalysis[ 10 55 81 ].TheMitelmanDatabaseofChromosomeAberrationsinCancer( Overthelastdecade,theComparativeGenomicHybridization(CGH)[ 37 ]andarray-ormatrix-CGHtechniques[ 61 63 71 ]haveaddressedtechnicalproblemsassociatedwithMetaphaseanalysisoftumorcellsandarenowusedinmanypublishedobservations.ThemolecularcytogenetictechniquesofCGH[ 36 ]andarray-ormatrix-CGH[ 60 62 72 ]havepreviouslybeenusedtodescribegenomicaberrationhotspotsincancerentities[ 5 24 ],forthedelineationofdiseasesubsetsaccordingtotheircytogeneticaberrationpatterns[ 34 48 ]andfortheconstructionofgenomicaberrationtreesfromchromosomalimbalancedata[ 12 ]. IncontrasttoMetaphaseanalysis,CGHtechniquesarenotlimitedtodividingtumorcellswhichfrequentlydonotrepresentthepredominantcloneintheoriginaltumor.Also,CGHisnothamperedbyincompleteidenticationofchromosomalsegments,whichforMetaphaseanalysisonlyrecentlyhasbeenaddressedbySKY(SpectralKaryotyping)[ 84 ]andMFISH(MultiplexFluorescentIn-SituHybridization)[ 73 ]techniques.Accordingtoourownsurvey,chromosomalandarrayCGHnowaccountforasignicantpercentageofpublishedanalysesincancercytogenetics. Inthischapter,webrieyreviewthedataminingandrelatedmethodsthathavebeenusedforanalyzingCGHdata. 18

PAGE 19

12 31 ]oratthecorrelationofdiseasesubsetswithclinicalparameters[ 48 82 ].OtherCGHrelateddataanalysishavebeenaimedatthespatialcoherenceofgenomicsegmentswithdierentcopynumberlevels. Picardetal.pointedoutthatrawCGHsignalexhibitsaspatialcoherencebetweenneighboringintervals[ 57 ].Thisspatialcoherencehastobehandled.TheyusedasegmentationmethodsbasedonarandomGaussianmodel.Theparametersofthismodelaredeterminedbyabruptchangesatunknownintervals.Theydevelopedadynamicprogrammingalgorithmtopartitionthedataintoanitenumberofsegments.Theintervalsineachsegmentapproximatelysharethesamecopynumberonaverage.Further,heproposedasegmentation-clusteringapproachcombinedwithaGaussianmixturemodeltopredictthebiologicalstatusofthedetectedsegments[ 58 ]. Fridlyandetal.usedanunsupervisedHiddenMarkovmodelsapproachwhichconsistsoftwoparts[ 21 ].Intherstpart,theypartitionthegenomicintervalsintothestateswhichrepresenttheunderlyingcopynumberofthegroupsofintervals.Inthesecondpart,theydeterminethecopynumberlevelofeachindividualchromosomeaccordingtowhetheranycopynumbertransitionsorwholechromosomegainsorlossesarecontainedinthechromosome.Theyderivedtheappropriatevaluesofparametersinthealgorithmusingunpublishedprimarytumordata. Peietal.segmentedeachchromosomearm(orchromosome)usingahierarchicalclusteringtree.TheclustersareidentiedbysuppressingtheFalseDiscoveryRate(FDR)belowacertainlevel.Inaddition,theiralgorithmprovidedaconsensussummaryacrossasetofintervals,aswellasanestimateofthecorrespondingFDR.Theyillustratedtheir 19

PAGE 20

88 ]. Willenbrocketal.madeacomparisonstudyonthreepopularandpubliclyavailablemethodsforthesegmentationanalysisofarrayCGHdata[ 90 ].TheydemonstratedthatsegmentedCGHdatayieldsbetterresultsinthedownstreamanalysissuchashypothesistestingandclassicationthantherawCGHdata.Theyalsoproposedanovelprocedureforcallingcopynumberlevelsbymergingsegmentsacrossthegenome. AlltheaboveworksfocusonthediscretizationofrawCGHdata.However,theydonotaddressthesubsequentanalysis,suchasclusteringorclassication,foralargedatasetconsistingofdiscretised(smoothed)CGHdatasamples. 32 ].Inthefollowing,webrieydescribesomeoftheearlierworkonclusteringofCGHdata. Mattfeldtetal.appliedanexistingtoolfortheclusteringofCGHdata[ 48 ].Thetool,namedGenecluster,formedclustersonthebasisofanunsupervisedlearningruleusinganarticialneuralnetwork.Itwasoriginallyproposedfortheclusteringofgeneexpressiondata.MattfeldtetalappliedGeneclusteroveragroupoftensofcasesfrompT2N0prostatecancer.Basedonthefactthatclinicallysimilarcasesareplacedintothesameclusters,theydemonstratedthatgoodclusteringwerefound. Beattieetal.developedanewdataminingtechniquetodiscoversignicantsub-clustersandmarkergenesinacompletelyunsupervisedmanner[ 4 ].Theyusedadigitalparadigmtodiscretisethegenemicroarrayandtransferedthedataintobinarystatepatterns.AclusteringbasedonHammingdistancewasappliedtocreateclustersandidentifybio-markers.AlthoughtheirworkisnotdirectlybasedonCGHdata,theydemonstratedthattheirmethodcanbeadaptedtoothercategoricaldatasets. 20

PAGE 21

67 ].Theiralgorithmscanhandleadditionalconstraintsdescribingrelevantregionsofcopynumberchange.Theyvalidatetheiralgorithmsontwopublicarray-CGHdatasets. Thus,theexistingliteraturehasnotaddressedclusteringalgorithmsthatexploittheimportantspatialandtemporalcharacteristicsofCGHdata.Further,existingclusteringworksusuallyfocusonsmallhomogeneousdatasetswithseveraltensofcases.Theexistingmarkerdiscoverymethodsusuallysimplyidentifythemarkers,butdonotexploretheusageofthesemarkersinclusteringanalysis,aswhatweproposeinthisthesis. SupportVectorMachine(SVM)isastate-of-arttechniqueforclassication[ 83 ].Mukherjeeetal.usedanSVMclassierforcancerclassicationbasedupongeneexpressiondatafromDNAmicroarrays.TheyarguedthatDNAmicroarrayproblemsareveryhighdimensionalandhaveveryfewtrainingdata.ThistypeofsituationisparticularlywellsuitedforanSVMapproach.Theirapproachachievedbetterperformancethanreportedresults[ 53 ]. Lietal.performedacomparativestudyofmulticlassclassicationmethodsfortissueclassicationbasedongeneexpressiondata[ 40 ].TheyconductedcomprehensiveexperimentsusingvariousclassicationmethodsincludingSVM[ 83 ]withdierentmulticlassdecompositiontechniques,NaiveBayes,K-nearestneighboranddecisiontree[ 79 ].TheyfoundthatSVMisthebestclassierforclassicationofgeneexpressiondata. 21

PAGE 22

23 ].Forexample,basedonexperimentalresults,Mukherjeeetal.demonstratedthatlinearSVMsdidaswellasnonlinearSVMsusingpolynomialkernels.Sofar,thereisverylimitedstudyondevelopingkernelfunctionsfortheclassicationofCGHdata. 26 ].Recently,featureselectionmethodshavebeenwidelystudiedingeneselectionofmicroarraydata.Thesemethodscanbedecomposedintotwobroadclasses: 1. FilterMethods:Thesemethodsselectfeaturesbasedondiscriminatingcriteriathatarerelativelyindependentofclassicationprocess.SeveralmethodsusesimplecorrelationcoecientssimilartoFisher'sdiscriminantcriterion.Forexample,givenclass1andclass-1denotingtwoclasses,Golubetal.usedacriterionasfollows[ 23 ]:P(j)=u1(j)u1(j) Othermethodsadoptmutualinformationorstatisticaltests(t-test,F-test).Forexample,Modeletal.rankedthefeaturesusingatwosamplet-test[ 51 ].Theyassumedthatthevalueofafeaturewithinaclassfollowsanormaldistribution.Atwosamplet-testwasadoptedtorankthefeaturesaccordingtothesignicanceofthedierencebetweentheclassmeans.Inprinciple,theirapproachwassimilartoFisher'scriterionbecause,inbothmethods,alargemeandierenceandasmall 22

PAGE 23

Dingetal.consideredthenatureoffeatureselectionforclassicationofmulti-classdata[ 16 ].TheyusedtheF-statistictestwhichisageneralizationoft-statisticfortwoclass.Givenageneexpressionacrossntissuesamplesg=(g1;;gn)fromKclasses,theF-statisticisdenedasF=[Xknk(gkg)2=(K1)]=2 wherenkandkarethesizeandvarianceofgeneexpressionwithinclassCk.TheypickedgeneswithlargeF-values. Earlierlterbasedmethodsevaluatedfeaturesinisolationanddidnotconsidercorrelationbetweenfeatures.Recently,methodshavebeenproposedtoselectfeatureswithminimumredundancy[ 14 15 91 ].Forexample,Yuetal.introducedtheimportanceofremovingredundantgenesinsampleclassicationandpointedoutthenecessityofstudyingfeatureredundancy[ 91 ].Theyproposedaltermethodwithfeatureredundancytakenintoaccount.Theycombinedsequentialforwardselectionwithbackwardeliminationsothat,ineachstep,thenumberoffeaturepairsforredundancyanalysisisreduced.Theirmethodisfreeofanythresholdindeterminingfeaturerelevanceorredundancy.Theirexperimentalresultsonmicroarraydatademonstratedtheeciencyandeectivenessoftheirmethodinselectingdiscriminativegenesthatimproveclassicationaccuracy. 23

PAGE 24

14 15 ].Theysupplementthemaximumrelevancecriteriaalongwithminimumredundancycriteriatochooseadditionalfeaturesthataremaximallydissimilartoalreadyidentiedones.Bydoingthis,MRMRexpandstherepresentativepowerofthefeaturesetandimprovestheirgeneralizationproperties. 2. WrapperMethods:Wrappermethodsutilizeaclassierasablackboxtoscorethesubsetsoffeaturesbasedontheirpredictivepower.WrappermethodsbasedonSVMhavebeenwidelystudiedinmachinelearningcommunity[ 26 64 89 ].SVM-RFE(SupportVectorMachineRecursiveFeatureElimination)[ 27 ],astate-of-the-artwrappermethodappliedtocancerresearchiscalled,usesabackwardfeatureeliminationschemetorecursivelyremoveinsignicantfeaturesfromsubsetsoffeatures.Ineachrecursivestep,alinearSVMistrainedonthefeatureset.Foreachfeature,arankingcoecientiscomputedbasedonthereductionintheobjectivefunctionifthisfeatureisremoved.Thebottomrankedfeatureistheneliminatedfromthefeatureset.Theaboveprocessisrepeateduntilthefeaturesetisempty.Thefeaturesaresortedbasedontheirsequenceofelimination. Anumberofvariantsalsousethesamebackwardfeatureeliminationschemeandlinearkernel.Zhangetal.proposedamethodaimedforclassifyingtwo-classdata[ 92 ].Itusedarecursivesupportvectormachine(R-SVM)algorithmtoselectimportantfeaturesfortheclassicationofnoisyhigh-throughputproteomicsandmicroarraydata.Theexperimentalresultsshowedthat,comparedtoSVM-RFE,theirmethodismorerobusttooutliersinthedataandcapableofselectingthemostinformativefeatures. Duanetal.proposedanewfeatureselectionmethodthatusedabackwardeliminationproceduresimilartothatimplementedinSVM-RFE[ 17 ].UnlikeSVM-RFE,ateachstep,theproposedapproachtrainedmultiplelinearSVMson 24

PAGE 25

Forfeatureselectionofmulticlassdata,Ramaswamyetal.usedanone-versus-allstrategytoconvertthemulticlassproblemintoaseriesoftwo-classproblems.TheyappliedSVM-RFEtoeachtwo-classproblemseparatelyandgeneratedaconsensussortingofallfeatures[ 65 ]. Fuetal.alsoproposedamethodbasedontheone-versus-allstrategy[ 90 ].Foreachtwo-classproblem,theywrappedthefeatureselectionintoa10-foldcrossvalidation(CV)andselectedfeaturesusingSVM-RFEineachfold.Theyalsodevelopedaprobabilisticmodeltoselectsignicantfeaturesfromthe10-foldresults.Theytooktheunionoffeaturesselectedfromeachtwo-classSVMasthenalsetoffeatures. Filtermethodsaregenerallylesscomputationallyintensivethanwrappermethods.However,theytendtomisscomplementaryfeaturesthatindividuallydonotseparatethedatawell.ArecentcomparisonoffeatureselectionmethodsfortheclassicationofmulticlassmicroarraydatashowsthatwrappermethodssuchasSVM-RFEhavebetterclassicationaccuracyforlargenumberoffeatures,butderivesloweraccuracythanltermethodswhenthenumberofselectedfeatureissmall[ 9 ]. 85 ].Theyinferredachainmodeloffourgeneticevents,threeofwhichareCNAs,fortheprogressionofcolorectalcancer.Theseeventsinthemodelareirreversible.Thatis,onceaneventoccursitisneverundoneinthefuture.Thepresenceofallfoureventsappearstobeanindicatorofcolorectalcancer. 25

PAGE 26

12 ].Theyderivedatreemodelinferencealgorithmbyutilizingtheideaofmaximum-weightbranchinginagraph.TheyappliedthealgorithmoveraCGHdatasetforrenalcancerandshowedthatthecorrecttreeforrenalcancerwasinferred.Later,theyextendedtheirworktodistance-basedtrees,inwhicheventsareleavesofthetree,inthestyleofmodelscommoninphylogenetics[ 13 ].Theyproposedanovelapproachtoreconstructthedistance-basedtreesusingtree-ttingalgorithmsdevelopedbyresearchersinphylogenetics.TheyappliedtheirapproachovertheCGHdatasetforrenalcancer.Theresultsshowedthatthedistance-basedmodelswellcomplementedthebranchingtreemodels. Bilkeetal.proposedagraphmodelbasedonthesharedstatusofrecurrentCNAsamongdierentstagesofcancer[ 6 ].Theyrstidentiedasetofrecurrentalterationsandcomputedtheirsharedstatususingstatisticaltests.TheythenconstructedaVenndiagrambasedontheserecurrentalterations.TheymanuallyconvertedtheVenndiagramintoagraphmodel.TheyfoundthatthepatternofrecurrentCNAsinneuroblastomacancerisstronglystagedependent. Penningtonetal.developedamutationmodelforindividualtumorandconstructanevolutionarytreeforeachtumor[ 56 ].Theyidentiedtheconsensustreemodelbasedonthecopynumberalterationssharedbyasubstantialfractionofthepopulation.Theyprovedthattheirresultsareconsistentwithpriorknowledgeabouttheroleofthegenesexaminedincancerprogression. AllaboveworksinfertumorprogressionmodelbasedonthegeneticeventssuchasrecurrentCNAs.Theirmodelsdescribetheevolutionaryrelationshipbetweentheseeventsandconsequentlyexposetheprogressionanddevelopmentoftumors.However,theseworkstreateveryindividualrecurrentalterationsasindependentgeneticevents.Thismakestheirmodelsbecomeverycomplexwhenappliedtodatasetswithsamples 26

PAGE 27

5 12 24 31 34 48 82 ].AcquisitionofthousandsofcopynumberinformationbringsforthchallengestotheanalysisofCGHdata.Researchershaveexploreddataminingmethodsforthispurpose.ManyoftheirmethodsfocusonthestructureanalysisofCGHdata,suchasthespatialcoherenceofgenomicsegmentswithdierentcopynumberlevels[ 25 57 58 74 ].Associatedwiththeseworks,alotoftools(bothwebapplicationandstand-alonesoftwarepackage)areavailablefortheanalysisofCGHdata,suchasCGH-miner( 27

PAGE 28

ThegoalofclusteringistodevelopasystematicwayofplacingpatientswithsimilarCGHimbalanceprolesintothesamecluster.OurexpectationisthatpatientswiththesamecancertypeswillgenerallybelongtothesameclusterastheirunderlyingCGHproleswillbesimilar.Inthischapter,wefocusondistance-basedclustering.Wedevelopthreepairwisedistance/similaritymeasures,namelyRaw,CosineandSim.Rawmeasurecomparestheaberrationsineachgenomicintervalseparately.Theothertwomeasurestakethecorrelationsbetweenconsecutivegenomicintervalsintoaccount.CosinemapspairsofCGHsamplesintovectorsinahigh-dimensionalspaceandmeasurestheanglebetweenthem.Simmeasuresthenumberofindependentcommonaberrations.Wetestourdistance/similaritymeasuresonthreewellknownclusteringalgorithms,bottom-up,top-downandk-meanswithandwithoutcentroidshrinking.OurresultsshowthatSim,whencombinedwithtop-downalgorithm,consistentlyperformsbetterthantheremainingmeasures. 50 ].Weusethisstrategyandrepresentgain,loss,andnochangewith+1,-1,and0respectivelythroughouttheproposal. Weproposetousethreedierentdistance-basedclusteringmethodsforCGHdataandsurveytheirperformance.Thekeyproblem,however,istocomputetheproximityoftwoCGHsamples.InSection 3.1.1 ,wediscussthethreemeasureswedevelopedforsuchpairwisecomparison.Webrieyexplainthethreeclusteringalgorithmsweusedtoclusterapopulationofsamplesinsection 3.1.2 .Twotechniquesthatfurtheroptimizetheclusterqualitiesarediscussedinsection 3.1.3 28

PAGE 29

ExampleofRawdistance.XandYaretwoCGHsamples.Thevalueofeachgenomicintervalshowsthestatus(i.e.gainlossornochange)ofthatinterval.ThedistancebetweenXandYisPmj=1di(xj;yj)=9. 57 ].Ifbothsampleshavegain(orloss)atthesamegenomicintervalthenweconsiderthemsimilaratthatposition.Otherwise,thatgenomicintervalcontributestothedistancebetweenthem.Also,weassumethatallgenomicintervalshavethesameimportance.Thus,eachgenomicintervalcontributesthesameamounttothetotaldistance.Formally,thedistanceiscomputedasPmj=1di(xj;yj).Heredi(xj;yj)=1ifxj6=yjorxj=0.Otherwisedi(xj;yj)=0.Thesimilarityisobtainedbysubtractingthedistancefromm,thenumberofgenomicintervalsoftheCGHsamples.AnexampleisshowninFigure 3-1 ThisdistancefunctionissimilartoHammingdistanceinprinciplebecauseitcomparesthegenomicintervalsofbothsamplesonebyone.WecallthisdistanceRawsinceitiscomputedonrawCGHdata.Rawdistancebetweentwosamplesissmallonly 29

PAGE 30

4-1 ,sampleXcontainsfoursegments.Therstandthirdsegmentsaregaintypewhilethesecondandfourthsegmentarelosstype.Wecalltwosegmentsfromtwosamplesoverlappingiftheyhaveatleastonecommongenomicintervalofthesametype.Forexample,therstsegmentofXisoverlappingwiththerstsegmentofYinFigure 4-1 .AlsothethirdsegmentofXisoverlappingwiththesecondsegmentofY.Next,wedevelopasegment-basedsimilaritymeasurecalledSim. GiventwoCGHsamplesXandY,Simconstructsmaximalsegmentsbycombiningasmanycontiguousaberrationsofthesametypeaspossible.Formally,thegenomicintervalsxi;xi+1;;xj,for1ijm,deneasegmentifgenomicintervalsxithroughxjareinthesamechromosome,thevaluesfromxitoxjareallgainsoralllosses,andxi1andxj+1aredierentthanxi.Thus,eachsampletranslatesintoasequenceofsegments.Afterthistransformation,Simassumesthatthesegmentsareindependentofeachotherandgivesthesameimportancetoallthesegmentsregardlessofthenumberofgenomicintervalsinthem.SimcomputesthesimilaritybetweentwoCGHsamplesasthenumberofoverlappingsegmentpairs.Thisisjustiedbecauseeachoverlapmayindicateacommonpoint-likeaberrationinbothsampleswhichthenledtothecorrespondingoverlappingsegments.AnexampleisshownintheFigure 4-1 .TherearetwoimportantobservationsthatfollowsfromthedenitionofSim.First,unliketheRawdistancemeasure,Simconsidersanoverlapofarbitrarynumberofgenomicintervalsasasinglematch.Second,althoughtwosampleshavedierentvaluesforthesamegenomicinterval,Simdoesnotconsiderthisasamismatchifitisanextensionofanoverlap.Forexample, 30

PAGE 31

ExampleofSimmeasure.XandYaretwoCGHsampleswiththevaluesofgenomicintervalsshownintheorderofpositions.Thesegmentsareunderlined.Theoverlappingsegmentsareshownwitharrows.Sincetherearetwooverlappingsegments;onefromposition3to4andtheotheratposition10,thesimilaritybetweenXandYis2. inFigure 4-1 ,thefthgenomicintervalsofsampleXandYhavedierentvalues,butwestillconsiderthispositionamatchbecauseitcouldbeanextensionofanoverlap. 68 ].Inthissection,weextendthecosinesimilaritytomeasuretheproximityoftwoCGHsamples. LetXandYbetwoCGHsamples.WerstmapXandYtotwovectors^Xand^Y2Rg,wheregisthenumberofdimensionsofthevectors.Usually,gm,wheremisthenumberofgenomicintervalsofCGHsamples.Themappingprocessisalsobasedonsegmentsandworksasfollows.First,wetranslateeachsampleintoasequenceofsegments.LetusdenesegmentsequenceG,HthatcorrespondstothesampleX,Yrespectively.Withoutlossofgenerality,wecanassumethatforallthegenomicintervalsinY,iftheybelongtoanysegmentinH,thegenomicintervalsinXatthesamepositionsarealsocoveredbythesegmentsinG.Here,wesaythatasegmentcoversaconsecutiveblockofgenomicintervalsonlyifforeachgenomicinterval,eitheritbelongstothis 31

PAGE 32

ExampleofcosineNoGapsmeasure.ThisgureshowsthecosineNoGapssimilaritybetweentwoCGHsamples.XandYaretwoCGHsampleswiththevaluesofgenomicintervalsshownintheorderofpositions.Thesegmentsareunderlined.First,XandYaremappedtotwovectors^Xand^Yrespectively.Second,thesimilaritybetweenXandYiscomputedasC(^X;^Y)=0.7071 segmentoritisofno-changestatusandtheaberrationofthissegmentcanbeextendedtothisgenomicinterval.Next,wescanthesegmentsequenceGintheascendingorderofthegenomicintervals.Foreachsegmentgi2G,ifthereexistanoverlappingsegmenthj2H,weaddanewdimensiontobothvectors^Xand^Y.Wethenassignvalue1tothisdimensionof^Xand^Y,indicatingthatthevalueofthisdimensionareexactlythesameinthetwovectors.Ifnooverlappingsegmenthj2Hexists,weaddanewdimensiontobothvectorswithvalue1assignedtovector^Xandvalue0assignedtovector^Y,whichindicatesthatthevaluesofthenewdimensionintwovectorsareorthogonal.AnexampleofthesegmentingandmappingstepforthismeasureisshowninFigure 3-3 .AfterthetwoCGHsamplesXandYhavebeenmappedtotwovectors,thecosinesimilaritybetweenXandYiscomputedasC(^X;^Y)=Pmi=1^xi^yi 32

PAGE 33

Aswementionedearlier,ourfocusinthisthesisistoevaluatethesuitabilityofvariousdistance/similaritymeasurestogetherwithclusteringalgorithmsinthecontextoftheCGHdataclusteringproblem.Inthissection,webrieyintroducethethreedistance-basedclusteringalgorithmsweusedinourexperiments. 44 ]isoneofthesimplestunsupervisedlearningalgorithmsthatsolvethewell-knownclusteringproblem.Itskeystepistocomputethedistance/similaritybetween 33

PAGE 34

3.1.1 Werstpartitionthensamplesintokclustersbyrandomlyassigningeachsampletooneofthekclusters.Thisrandompartitionformstheinitialclusterseedsforourk-meansalgorithm.Thenwescanthensamplesonebyone.Fortheithsample,computeitsaveragedistancetoallthesamplesinclusterj,for1
PAGE 35

3.2.1 .Thesamplesiiskeptintheclusterthatmaximizestheinternalmeasure.Thisrenementprocessendsassoonasthereisnomovementofsamplesduringaniterationorafterapredenedmaximumnumberofiterationshavebeenperformed.Inourexperiments,thenumberofiterationsweretypicallylessthan20.Aftertherenementisnished,theclusterwiththelargestnumberofsamplesisbisectedsimilarly.Oncekclustersarecreated,thetop-downalgorithmends. Ineachiterationoftherenement,O(n)timeisneededtocomputethechangeoftheinternalmeasureforeachsample.Thisisbecause,thesimilaritybetweenthatsampleandeveryothersampleineachclusterneedstobeaccumulated.ThetimecomplexityofeachiterationisO(n2)astherearetotallynsamples.Sincethetotalnumberofiterationsislimitedbyasmallconstant,thecomplexityofrenementisO(n2).Therenementisperformedeverytimeanewclusteriscreated.Intheabovedescribedprocessthenumberofclustersincreasesbyoneineverystageuntilkclustersarecreated.Therefore,theoveralltimecomplexityoftop-downclusteringisO(n2k). Toreducethistimecomplexity,wemodifythetop-downclusteringalgorithm.Essentially,therenementprocessislimitedtotheclusterbeingdecomposedintosmallerclusters.Therearetwodierencesbetweenthemodiedandtheoriginaltop-downclustering.First,onlythesamplesinthedecomposedclusterareconsideredforrenement.Second,asampleisrelocatedonlytothetwonewlycreatedclustersratherthanalltheclusters.Inthebestcase,theclustersaredecomposedinabalancedfashion.TheoveralltimecomplexityinthiscaseisO(n2+2(n 35

PAGE 36

5.5 showthatthisdeteriorationissmall. 5.5 36

PAGE 37

80 ]toimprovethenearest-centroidclassication.Thecentroidsofatrainingsetaredenedastheaverageexpressionofeachgene.Thisideashrinksthecentroidsofeachclasstowardstheoverallcentroidafternormalizingbytheintra-classstandarddeviationforeachgenomicinterval.Thisnormalizationhastheeectofassigningmoreweighttothegenomicintervalwhosestatusisstablewithinsamplesofthesameclass,andthusreducesthenumberoffeaturescontributingtothenearestcentroidcalculation.Weapplythisideatoachievefurtheroptimizationofclustering.Thecentroidsofinitialclustersfoundbythedierentclusteringmethods,i.e.bottom-up,top-down,k-means,bottom-up+kmeansandtop-down+kmeans,areshrunktowardstheoverallcentroid.Then,astandardk-meansusingEuclideandistanceisinvokedtore-clusterthesamplesusingtheshrunkencentroidsasitsinitialcentroids. Experimentalsetup:Weevaluatedthequalityandtheperformanceofallthedistance/similaritymeasuresandtheclusteringmethodsdiscussedinthisthesis.Forevaluationofqualityweuseddierentmeasuresbelongingtotwocategories,externalandinternalmeasures.WediscussthesemeasuresindetailinSection 3.2.1 Weimplementedallfourdistancemeasures(Raw,Sim,CosineGaps,CosineNoGaps)andveclusteringalgorithms(k-means,top-down,bottom-up,top-down+k-means,bottom-up+k-means).Thus,wehad20dierentcombinations.Wehavealsoimplementedthecentroidshrinkingstrategyandappliedoneachcombination.Notethatweuselocalrenementstrategy(seeSection 3.1.2.3 )fortop-downinourexperimentsunlessotherwisestated. Weuseadatasetconsistingof5,020CGHsamples(i.e.,cytogeneticimbalanceprolesoftumorsamples)takenfromtheProgenetixdatabase[ 3 ].Thesesamplesbelongedto19dierenthistopathologicalcancertypeswithmorethan100casesandhadbeencoded 37

PAGE 38

DetailedspecicationofProgenetixdataset.Term#casesdenotethenumberofcases. ICD-O-3code#casesCodetranslation 0000/0110non-neoplasticorbenign8890/3118Leiomyosarcoma,NOS9510/3120Retinoblastoma,NOS9391/3126Ependymoma,NOS9835/3128Acutelymphoblasticleukemia,NOS9180/3133Osteosarcoma,NOS9836/3141PrecursorB-celllymphoblasticleukemia8144/3144Adenocarcinoma,intestinaltype9673/3171Mantlecelllymphoma8010/3180Carcinoma,NOS9732/3190Multiplemyeloma8140/0209Adenoma,NOS9500/3271Neuroblastoma,NOS8170/3286Hepatocellularcarcinoma,NOS8523/3310Inltratingductmixedwithothertypesofcarcinoma9680/3323DiuselargeB-celllymphoma,NOS9823/3346B-cellchroniclymphocyticleukemia/smalllymphocyticlymphoma8070/3657Squamouscellcarcinoma,NOS8140/31057Adenocarcinoma,NOS accordingtotheICD-O-3system[ 22 ].Thesubsetwiththesmallestnumberofsamples,consistsof110non-neoplasticcases,whiletheonewithlargestnumberofsamples,Adenocarcinoma,NOS(ICD-O8140/3),contains1057cases.ThedetailsofthisdatasetislistedinTable 3-1 .Eachsampleinthedatasetconsistsof862orderedgenomicintervalsextractedfrom24chromosomes.Eachintervalisassociatedwithoneofthethreevalues-1,1or0,indicatingloss,gainornochangestatusofthatinterval.Inprinciple,ourCGHdatasetcanbemappedtoaintegermatrixofsize5,020862.Wealsouseasmalldatasetwith2,510samplesbyrandomlyselecting50%oftheentiredataset.Thissmalldatasetisgeneratedeachtimeanexperimentisrunningoverit. Ourexperimentalsimulationswererunonasystemwithdual2.59GHzAMDOpteronProcessors,8gigabytesofRAM,andanLinuxoperatingsystem. 38

PAGE 39

29 ].Weusebothmeasurestoevaluatethequalityoftheclusters.Anexternalmeasureevaluateshowwelltheclustersseparatesamplesthatbelongtodierentcancertypes.Thusexternalmeasurecancompareclustersbasedondierentdistance/similaritymeasure.Ontheotherhand,aninternalmeasureevaluateshowgoodtheclusteringalgorithmoperatesonagivendistance/similaritymeasure.Thismeasureignoresthecancertypesoftheinputsamples.Comparedwithinternalmeasures,externalmeasuresaremorereasonableinreectingthequalityofclustersastheytakethecancertypesintoconsideration.Notethatinternalmeasureisabetterindicatorofqualityforcancertypesthathavemultipleaberrationpatternsthatdiersignicantly. Weusethreeexternalmeasurestoevaluatetheclusterquality.Letn,mandkdenotethetotalnumberofsamples,thenumberofdierentcancertypesandthenumberofclustersrespectively.Leta1,a2,,amdenotethenumberofsamplesthatbelongtoeachcancertype.Similarly,letb1,b2,,bkbethenumberofsamplesthatbelongtoeachcluster.Letci;j,8i;j,1imand1jk,denotethenumberofsamplesinjthclusterthatbelongtotheithcancertype.Therstexternalmeasureused,knownastheNormalizedMutualInformation(NMI)[ 93 ]functioniscomputedas: 39

PAGE 40

78 ].Itisdenedas: 78 ].InordertocomputetheRandIndexmeasureforagivenclustering,twovaluesarecalculated. TheRandIndexisthencomputedas:RandIndex=f00+f11 2: 77 ].Moreover,NMIisquiteimpartialtothenumberofclusters[ 93 ]. 78 ],theotheristheinternalmeasurebasedonseparation. 40

PAGE 41

41

PAGE 42

Highestvalueofexternalmeasuresfordierentdistance/similaritymeasure.Allnumbersherearethemediansof100results. SimCosineNoGapsCosineGapsRaw NMI0.3680.2650.2280.239F1-measure0.340.2580.2150.235RandIndex0.9030.8990.8980.896 Themedianof100highestvaluesforSim,CosineNoGaps,CosineGapsandRawareshowninthetable 3-2 .TheresultsofbothNMIandF1-measureshowthatSimproducesthehighestqualitycomparedtootherdistancemeasures.Simobtainsthisqualitywithtop-downclusteringmethod.CosineNoGapsgivesslightlybetterqualitythantheothertwomeasures,RawandCosineGaps.WeconcludethatSimisthemostsuitabledistance/similaritymeasureforclusteringCGHdata. Werandomlyselect50%oftheentiredataset(i.e.,2,510samples)andclusterit.Wethencomputetheexternalmeasurefortheunderlyingclusters.Werepeatthisprocess100timesandcomputetheerrorbarfortheexternalmeasure.Theerrorbarindicatestheintervalwhere5-95%oftheresultslie.Figure 4-3A andFigure 4-3B showtheNMIandF1-measurerespectively.Top-downclusteringmethodwithoutcentroidshrinkinggivesthebestqualityconsistentlyinbothgures.Theadditionalk-meansstepintop-down+k-meansmethoddeterioratesthequalities.Centroidshrinkingimprovestheresultswhenthequalityoftheclusteringmethodislow.Ithurtsthequalitywhenthequalityishigh,especiallywhentop-downmethodisused.Thiscanbeexplainedasfollows.Theclusteringqualityislowwhenthepatientswithdierentcancertypesareclusteredtogether.Thisusuallyindicatesthatdierentsamplesinthesameclustercan 42

PAGE 43

B Evaluationofclusterqualitiesusing(a)NMIand(b)F1-measurefordierentclusteringmethods.Thefthandtheninety-fthpercentileoftheresultsarereportedastheerrorbar. 43

PAGE 44

4-3A .Ontheotherhand,theF1-measuredropsinFigure 4-3B .ThisisbecauseF1-measurefavorscoarserclusteringandisbiasedtowardssmallnumberofclusterswhileNMIisquiteimpartialtothenumberofclusters[ 93 ].Wedon'tseethesameeectforotherclusteringmethodsbecausethelargevarianceintheresultsofothermethods,exceptbottom-up,hidesthiseect.Forbottom-upmethodwithorwithoutcentroidshrinking,wecanseethattheincreaseinthequalitygetsattenedwhenthenumberofclustersincreases. Next,weranallthementionedclusteringmethodsfortheentireCGHdataset(i.e.,5,020samples).Figure 3-5 showstheNMIforSim.TheresultsconrmtheexperimentsinFigure 4-3A :1)Top-downclusteringproducesthebestclusters.2)Thecentroidshrinkingstrategydoesnothaveasignicantimpact.3)Mostoftheresultsontheentiredatasetremainwithintheerrorintervals.Thebestclusteringqualitywasobtainedwhen64clusterswerecreated.Theaverageclustersize,i.e.numberofsamplesinthecluster,is78.44andthestandarddeviationis51.03. InourexperimentsonthesamedatasetusingRandIndex,weobtainedslightlybetterresultswithtop-downmethod.Thetwodescribedinternalmeasures(compactnessand 44

PAGE 45

ClusterqualitiesofdierentclusteringmethodswithSimmeasureovertheentiredataset.TheclusterqualitiesareevaluatedusingNMI. separation)supportthisconclusionthattop-downclusteringisthebetterchoice(resultsomittedduetospacelimitation). 3.1.2.3 ,wediscussedtwotypesoftop-downmethods,top-downmethodwithglobalrenementandtop-downmethodwithlocalrenement.Here,weevaluatethequalityandrunningtimeofthesetwostrategies.WerestrictthesimilaritymeasuretoSimasitgivesthehighestquality.Usingeachstrategy,wecreated2,4,8,16,32,and64clustersforeachofthe19cancertypes.Wecomputetheaverageinternalmeasurebasedoncompactnessofallthecancertypesasthequalityoftheclusters.Wealsocomputetheaveragetimetocreateclustersastherunningtime. Table 3-3 showstheaveragequalityandrunningtimeoftwodierenttop-downmethods.Therstpartofthetableindicatesthatlocalrenementgivesslightlyworsequalitiesthantheglobalrenement.However,thequalitydierenceisnegligible.Thequalityoftheclustersincreasesasthenumberofclustersincreasesupto32.Thequality 45

PAGE 46

Comparisonofaveragequalityandrunningtimeoftop-downmethodswithglobalandlocalrenement.(HereLandGindicatelocalandglobalrenementrespectively.) NumberofClusters 248163264 QualityL703797892927947904G7308399369831017971Time(Sec)L0.10.31.73.16.59.8G3.422.9129.7329.41151.22018.2 startstoplateauordropafterthispoint.Thisindicatesthat,ingeneral,asthenumberofclustersincreases,theclustersaremorecompactandtheintra-similarityofclustersincreases.However,whenthenumberofclustersbecomestoolargecomparedtothesizeofdataset,somecloselysimilarsampleswillbeforcedintodierentclusters,which,instead,reducetheintra-similarityofclusters.Thesecondpartofthetableindicatesthattheaveragerunningtimeforglobalrenementismuchhigherthanlocalrenement.ThisobservationisconsistentwithouranalysisoftimecomplexityinSection 3.1.2.3 .Consideringthatlocalrenementgivesonlyslightlyworsequalitiesbutrunsmuchfasterthanglobalrenement,weusetheformermethodthroughoutthischapter. 46

PAGE 47

InourexperimentsusingclassieddiseaseentitiesfromtheProgenetixdatabase,thehighestclusteringqualitywasachievedusingSimasthesimilaritymeasureandtop-downastheclusteringstrategy.Thisobservationtswiththetheorythatcontiguousrunsofgenomicaberrationsarisearoundapoint-liketarget(e.g.,oncogene),andthatconsecutivegenomicintervalscannotbeconsideredasindependentofeachother. 47

PAGE 48

WeobservethatSimmeasureisaectedfromnoisyaberrationsinCGHdatasinceitdependsononlyapairofsamples.Inthischapter,wedevelopadynamicprogrammingalgorithmtoidentifyasmallsetofimportantgenomicintervalscalledmarkers.Theadvantageofusingthesemarkersisthatthepotentiallynoisygenomicintervalsareexcludedfromclustering.Wealsodeveloptwoclusteringstrategiesusingthesemarkers.Therstone,prototype-basedapproach,maximizesthesupportforthemarkers.Thesecondone,distance-basedapproach,developsanewsimilaritymeasurecalledRSimanditerativelyrenesclustersbymaximizingtheRSimmeasurebetweensamplesinthesamecluster.Ourresultsdemonstratethatthemarkerswefoundrepresenttheaberrationpatternsofcancertypeswellandtheyimprovethequalityofclustering. GivenasetSofNCGHsamplesfs1,s2,,sNg.Letxjddenotethestatusvalueforsamplejatgenomicintervald,8d;1dD,whereDisthenumberofintervals.Letsj[u;v]bethesegmentofsjthatstartsattheuthintervalandendsatthevthinterval.Weusethetermsegmenttorepresentacontiguousblockofaberrationsofthesametype.Formally,alistofstatusvaluesxju;xju+1;;xjv,for1uvDdeneasegmentifgenomicintervalsuthroughvareinthesamechromosome,thevaluesfromxjutoxjvareallgainsoralllosses,andxju1andxjv+1aredierentthanxju.Forexample,inFigure 4-1 48

PAGE 49

TwoCGHsamplesXandYwiththevaluesofgenomicintervalslistedintheorderofpositions.Thesegmentsareunderlined. sampleXcontainsfoursegments.Therstandthirdsegmentsaregaintypewhilethesecondandfourthsegmentarelosstype. Letfmt=j1tRg,beasetofmarkersthatareorderedalongthegenomicintervals,i.e.p1
PAGE 50

LetO(d;r)=Prt=1Support(mt),for1ptdD;8t2[1;r]denotethelargestpossiblesupportthatrmarkerscangetfromthegenomicintervalsintherange[1::d].Here,[1::d]denotetheintegers1,2,,d.O(1;1)=supportofthesinglemarkerattherstgenomicinterval.ThevalueofO(d;r)ingeneral,where1rR,rdD,canbecomputedasthemaximumoftwopossiblecases. Thus,O(d;r)canbecomputedusingthefollowingrecursiveequation,O(d;r)= max8>>>>>>>>>>><>>>>>>>>>>>:O(d1;r1)+Support(mr)O(d2;r1)+Support(mr)O(b1;r1)+Support(mr)9>>>>>>>=>>>>>>>;ifmrappearsatintervaldO(d1;r)otherwise 50

PAGE 51

Animportantfeatureofthedynamicprogrammingapproachisthatoptimalsolutionstosubproblemsareretainedsoastoavoidrecomputingtheirvalues[ 67 ].WeconstructaDRmatrixwithcell(d;r)storingtheoptimalvalueofO(d;r).Aniterativeprogramisimplementedtollthismatrix.Foreachcell(d;r),weneedtorevisitcell(g;r1),b1gd1,whichtakesconstanttimethatisproportionaltotheaveragelengthofchromosome.Besides,weneedO(N)timetocomputeSupport(mr).SothetimecomplexityofllingonecellisO(N).TheoveralltimecomplexityofllingthewholematrixwouldbeO(DNR). Thenumberofmarkerschosenisnolongerxed.Figure 4-2 presentsanexampleofapplyingadaptivenumberofmarkersapproachoverCGHdataofcancertype,Retinoblastoma,NOS(ICD-O9510/3)[ 22 ],whichcontains120sampleswitheachsampleincluding862genomicintervals.Inthiscase,theparameterissetto0.5andfour 51

PAGE 52

TheCGHdataplotofcancertype,Retinoblastoma,NOS(ICD-O9510/3),with120samplesandeachsamplecontaining862genomicintervals.Thegenomicintervalswithgainandlossstatusareplottedwithgreenandredcolorsrespectively.Thegenomicintervalswithnochangestatusarenotplotted.Fourmarkersarefoundatgenomicinterval52,69,287and690usingadaptivenumberofmarkerapproachwithsetto0.5.Themarkersareshownusingverticallines.Thetypesoffourmarkersaregain,gain,gainandlossrespectively. markersarefoundedatgenomicinterval52,69,287and690withtypesgain,gain,gainandlossrespectively.Pleasenotethattwogainmarkersarefoundatadjacentgenomicintervals,52and69,becausetheyareindierentchromosomesandonesegmentcansupportbothofthematthesametime. 52

PAGE 53

4.1 isusedtoidentifytheoptimalsetofmarkers.Thesemarkersserveastheprototypeofthiscluster. Essentially,bothstepsoptimizeacohesionfunctionalternativelyuntilthevalueofthisfunctiondoesnotchange,i.e.convergestoastablevalue.Next,wediscussthecohesionfunctionindetailandprovethatourclusteringalgorithmconverges. GivenasetofCGHsamplesS=fs1;s2;:::;sNg.LetKdenotethenumberofclusters.Letcidenotetheithcluster.Aclusteringofsamplescanberepresentedbyanencodingfunctionf:sj![1::K]thatmapssamplesjtoclustercf(sj).LetRbethenumberofmarkersineachcluster.LetMi=fmi;1;mi;2;:::;mi;tgdenotethesetofmarkers(i.e.,prototype)forclusteri,where1iK;1tRandmi;tdenotethetthmarkerintheithcluster.Eachmarkermi;tisatuple,wherepi;tandqi;tdenotethepositionandtypeofthismarkerrespectively.LetdenotethesetofprototypesM1;M2;:::;MK.Wedeneacohesionfunctionofclusteringasbelowcohesion(f;)=KXi=1Xsj2ciRXt=1(sj;mi;t); 4.1 .Essentially,thecohesionfunctioncomputestheintra-clustersimilaritybetweensamplesandclusterprototypes. Intherenementstep,weoptimizethecohesionfunctionbyreningtheprototype(markers)ofclustersgiventhatsamplesarepartitionedintoKclusters.Forclusteri,letOi(d;r)=Prt=1Support(mt)=Prt=1Psj2ci(sj;mi;t)denotethelargestsupportthattmarkerscangetfromthegenomicintervalsintherange[1::d].Thustheoptimalcohesion 53

PAGE 54

4.1 isusedtocomputeOi(D;R)for1iKand,inthisway,thecohesionfunctionisoptimized. Inthereassignmentstep,wereassignthesamplesjtotheithclusterwhoseprototypeMiissupportedthemostbysj,i.e.Micoversthelargestnumberofsegmentsinsj.Thisisbecause,otherwise,thecohesionfunctioncouldalwaysbeimprovedbylettingf(sj)=i.Formally,f=Argmaxf(cohesion(f;))wherefdenotesthenewencodingfunctionafterreassignmentstep. Ontheotherhand,thereassignmentstepoptimizethecohesionbasedonf,i.e.cohesion(f(h+1);(h+1))cohesion(f(h);(h+1)) Puttogether,ouralgorithmgeneratesasequence(f(h);(h));h0,thatincreasethecohesionfunctionascohesion(f(h+1);(h+1))cohesion(f(h);(h)) Sincethemaximumvalueofcohesionfunctionisnitegivenasetofsamples,ouralgorithmconvergestoastablevalueattheend. Itisworthnotingthatouralgorithmisak-means-typealgorithm.Letx1;x2;;xN2RDbeanitenumberofsamples.TopartitionpatternsintoKpartitions,2KN,ak-means-typealgorithmtriestosolvethefollowingmathematicalproblem: 54

PAGE 55

69 ]. Undercertainconditions,thek-means-typealgorithmmayfailtoconvergetoalocalminimum.Let(W;Z)beapartialoptimalsolutionofproblemPandA(W)=fZ:Zminimizesf(W;Z);Z2RDKg.AsucentconditionforWtobealocalminimumofproblemPisthatA(W)isasingleton[ 69 ].Next,weshowthatpracticallytheprototype-basedclusteringalgorithmusuallyconvergestoalocaloptimum.Inprototype-basedclusteringalgorithm,A(W)representsthesetsofmarkersidentiedineachclusterforacertainclusteringresultsW.Whenthenumberofmarkersissmallascomparedtothenumberofallintervals,eachchromosomearmoftencontainsatmostonemarker.Sincealterationsindierentchromosomearmsareindependenttoeachother,themarkersidentiedinachromosomearmshouldbetheonewiththelargestsupportandnoothermarkersisidentiedinthesamechromosomearm.Therefore,theoptimalmarkerineachchromosomearmcanbeasingletonifweassumenotwomarkershavethesamesupportvalue.Thismakesthesetofmarkersforeachclusterasingleton.Therefore,theprototype-basedclusteringalgorithmoftenconvergestoalocaloptimum. 3 ,weproposedasegment-basedsimilaritymeasure,calledSim, 55

PAGE 56

LetDdenotethenumberofgenomicintervalsofeachsample.Letsi=xi1;xi2;;xiDandsj=xj1;xj2;;xjDbetwoCGHsamples.Here,xidandxjddenotethevalueorstatusofthedthgenomicintervalofsiandsj,respectively. We,rst,summarizetheSimmeasurewedevelopedforcomputingthesimilarityoftwoCGHdatainChapter 3 .Wecalltwosegmentsfromtwosamplesoverlappingiftheyhaveatleastonecommongenomicintervalofthesametype.Simconstructsmaximalsegmentsbycombiningasmanycontiguousaberrationsofthesametypeaspossible.Thus,eachsampletranslatesintoasequenceofsegments.Forexample,inFigure 4-1 ,siandsjaretwosamplesthathavefourandtwosegmentsrespectively.Afterthistransformation,SimcomputesthesimilaritybetweentwoCGHsamplesasthenumberofoverlappingsegmentpairs.Thisisbecauseeachoverlapmayindicateacommonpoint-likeaberrationinbothsampleswhichpotentiallyledtotheoverlappingsegments.InFigure 4-1 ,therstsegmentofsiisoverlappingwiththerstsegmentofsj.Similarlythethirdsegmentofsiisoverlappingwiththesecondsegmentofsj.Simcomputesthesimilaritybetweentwosamplesbasedonthegenomicaberrationslocaltothesesamples.Thus,Simcannotdistinguishthetrueaberrationsfromnoisyones.Asaresult,Simisalocalmeasurethatiseasilybiasedbythenoise.Next,wedevelopanewapproachthataddressesthislimitation. Weproposetoemploymarkerstoeliminatethecontributionofnoisetothepairwisesimilarity.WedeveloparenedSimmeasure,calledRSim,asfollows.LetM=fm1;m2;;mRg,p1
PAGE 57

2.Bothoftheoverlappingsegmentshavethesameaberrationtypeasthemarkertheybothcontain. Formally,letxiu;xiu+1;;xivandxju0;xju0+1;;xjv0beapairofsegmentfromsamplessiandsj,respectively.RSimcountsthispairofsegmentsasoneonlyif 1.thereexistsamarkermt=,mt2M,suchthatuptvandu0ptv0,and 2.theaberrationtypeofbothsegmentsisthesameasthatofmt,i.e.xiu=xju0=qt. UnlikeSim,RSimdoesnotconsidertheoverlappingsegmentsthatdonotintersectwithanymarker.ThisisbecauseRSimconsiderssuchsegmentsasnoise.Forexample,assumethattherearetwomarkersinFigure 4-1 ,oneatthe3rdandtheotheratthe11thgenomicinterval.ThenRSimmeasureforsiandsjiscomputedasone,whereasSimmeasureistwo. AnimportantobservationonRSimisasfollows.AsthenumberofmarkersapproachestothenumberofgenomicintervalsintheCGHdata,RSimbecomesequivalenttotheSimmeasure.ThisisbecauseallsegmentsintheCGHdatawilloverlapwithamarkerandcontributetothesimilarity.ThusSimisaspecialcaseofRSimwhennoisyaberrationsarenoteliminated. OurpreviousworkinChapter 3 showedthatSimworksbestwhencombinedwithtopdownalgorithmcomparedtootherpopularclusteringalgorithmssuchasbottom-upandk-means.Inthischapter,weproposetousethetopdownclusteringmethodwithRSimasthepairwisesimilaritymeasureforpairsofCGHsamples. NotethatitispossibletoextendtheRawmeasureinChapter 3 bytakingthemarkersintoaccount.TheextendedRawmeasureworksasfollows.Foreachpairofsamples,wecomputethesimilaritybetweenthemasthenumberofgenomicintervalsthatmeetthefollowingtwoconditions.First,bothsampleshavethesameaberrationtype(gainorloss)atthisinterval.Second,onemarkerofthesametypeasbothsamples 57

PAGE 58

Dataset:Withmorethan12,000cases[ 1 ],thelargestresourceforpublishedCGHdatacanbefoundintheProgenetixdatabase[ 3 ]( 22 ]andconsistsof862orderedgenomicintervalsextractedfrom24chromosomes.Inprinciple,eachdatasetcanbemappedtoanintegermatrixofsizeN862,whereNdenotesthenumberofsamples.Thedierenceofthesethreedatasetsarethedivergenceofaberrationpatternsindistinctcancertypes.IndissimilarDS,thesamplesofdierentcancertypescontaindiverseaberrationpatternsthatareeasilydistinguishedfromeachother.ThesamplesinsimilarDScontainsimilaraberrationpatterns.TheinterDSdatasetisatanintermediatedegree.Thechoicesofthesedatasetsarebasedonavisualinspectionofthematricesforeachofthecancertypes. 58

PAGE 59

4.2 .Thisfunctionmeasuresthetotalsupportofmarkersovereachcluster.WeusethetermCoverageMeasure(CM)todenotethismeasure.Markerswithhighsupportpotentiallyconveysomemeaningfulbiologicalinformationandpotentiallycanserveastherststepforfurtheranalysis,suchastheidenticationofnewoncogenesandtumorsuppressorgenes.Agroupofmarkerscanbeconsideredasbiologicallyrelevantiftheycovermostofthesegmentsinallthepatients. 3.2.1 canbeusedforthispurpose.

PAGE 60

78 ]overthreedatasets,dissimilarDS,interDSandsimilarDS.Foreachclusteringmethodanddataset,wecreated6,8and10clusters.Foreachnumberofclusters,wetrieddierentnumberofmarkers,i.e.4,6,8,10and12markers,perclusterintheprototype-basedapproach.Thisisbecausebiologistshavepointedoutthatatotalof4-7geneticalterationscanbeestimatedforthemalignanttransformationofacell[ 41 ].Thus,weestimatedthatthenumberofaberrationscommontothesamplesofoneclustercouldbearound10.Fortheconsistency,wealsoemployeddierentnumbersofmarkersforeachclusteringof6,8and10clustersinRSim.Here,thenumberofmarkersisdeterminedastheproductofnumberofclustersandthenumberofmarkersperclusterusedintheprototype-basedapproach.Forexample,24,36,48,60and72markerswerefoundtocreate6clustersusingRSim.Wecomparedthreemethodsaccordingtothequalitiesofclusters.WeevaluatedtheclusterqualitiesusingbothNMIandCMmeasures.ToevaluatetheclusterqualitiesusingCM,forbothRSimandSim,weidentiedthemarkersforeachresultingcluster.Thenumberofmarkersperclusteristhesameasthatusedintheprototype-basedapproach.Wecomputetheerrorbarsforpartoftheresults.WealsousedGMStoevaluatethebiologicalrelevanceofmarkersfoundinprototype-basedapproachandRSimapproach. TheCMresultsareshowninTable 4-1 .TheCMvaluesmonotonicallyincreaseasthenumberofclustersornumberofmarkersincrease.Thus,wecomparethreeclusteringmethodsforthesamenumberofclustersandmarkers.Weobservethatprototype-basedapproachhas8to34%bettercoveragethanRSimand15to41%bettercoveragethanSim.Thisisbecausethecohesionfunctionoptimizedinprototype-basedapproachis 60

PAGE 61

Coveragemeasureforthreeclusteringmethodsappliedoverthreedatasets.Here,Kdenotesthenumberofclusters.Protodenotestheprototype-basedapproach. DatasetKAlg4K6K8K10K12K TheNMIresultsareshowninTable 4-2 .SinceSimproducesclusterswithoutndingmarkers,welistitsresultsonaseparatecolumn.TheresultsshowthatallthreemethodsperformbetterondissimilarDSthaninterDSandsimilarDS.ThisisbecausetheaberrationpatternsofdistinctcancertypesindissimilarDSaredivergent.Thus,itishardertoclusterinterDSandsimilarDSdatasets.Fromtheresults,weobservethat 61

PAGE 62

TheNMIvaluesofthethreeclusteringmethodsareappliedoverthreedatasets.Here,Kdenotesthenumberofclusters.Proto,denotestheprototype-basedapproach. DatasetKSimAlg4K6K8K10K12K 4-1 ,indicatesthatNMImeasurehasnoapparentrelationshipwithCM.ThisisbecauseNMIcomputesthequalitybasedontheclasslabelsofsamples.Ontheotherhand,CMevaluatesthecompactnessofsamplesineachclusterbasedonchromosomalaberrationpatternsandcompletelyignorestheclasslabels.Therefore,weconcludethatthepairwisesimilarity-basedclusteringapproachesaremoresuitabletoexternalmeasures,suchasNMI,whiletheprototype-basedapproachworksbetterfortheCoverageMeasure.RSimusuallyhasthebestNMIresultsusingtenmarkers.WhencomparedtoSim,RSimusuallyhasbetterNMIvalues.ThisindicatesthattheuseofmarkersinreningthepairwisesimilarityalsoleadstoabetterclusteringintermsofNMI.GiventhatRSimhasbetterCM(seeTable 4-1 )andNMIvaluesthanSim(seeTable 4-2 ),weconcludethatRSimisabetterpairwisesimilaritymeasurethanSim. 62

PAGE 63

Errorbarresultsofthreeclusteringmethodsoverthreedatasets.Thethreeclusteringmethodsareprototype-based(denotedasProto),RSimandSim.ThethreedatasetsaredissimilarDS,interDSandsimilarDS.Foreachdataset,eightclustersarecreated.Foreachcluster,tenmarkersareidentied.TheresultingclustersareevaluatedusingbothNMIandCM.Here,5%and95%denotethe5thand95thpercentilerespectively. 5%median95%5%median95% dissimilarDSProto0.200.280.34137214311487RSim0.450.510.55123012911337Sim0.360.420.47118712411289interDSProto0.050.090.13115612281285RSim0.320.370.4198510421120Sim0.270.300.3396310321098similarDSProto0.060.090.13148015351581RSim0.330.370.40129413421401Sim0.290.330.37128313271376 4-3 showstheresultswitherrorbars.NotethattheresultsofCMareroughlyhalftheresultsshowninTable 4-1 becausethecalculationofCMdependsonthedatasetandwesample50%ofeachdatasetintheexperiments.TheresultsshowthatRSimissuperiortoSimmostofthetimeintermsofbothNMIandCM.Moreover,amongthethreeclusteringmethods,RSimandprototype-basedapproachworksthebestforNMIandCMmeasures,respectively.TheseobservationsarecompatibletothoseweobtainedfromTable 4-1 and 4-2 .Therefore,theerrorbarsconrmourearlierconclusions. 63

PAGE 64

B C ComparsionofGMSvaluesofmarkersinclustersfromtwoclusteringapproaches.Plotsofglobalmaximumsupportofmarkersfound(A)IndissimilarDS,(B)IninterDSand(C)InsimilarDS.ThesolidlineindicatestheresultsofmarkersgeneratedbyRSimapproach.Thedashedlineindicatestheresultsofmarkersgeneratedbyprototype-basedapproach(denotedasmodel). NextexperimentcomparestheGMSvaluesforRSIMandprototype-basedclusteringapproaches.Foreachdataset,wecreatedeightclustersandidentiedtenmarkerspercluster.ThisisbecausetheseresultsareamongthebestresultsofeachdatasetinTable 4-2 .WethensortthemarkersindescendingGMSvalueorder.WeplotthesortedresultsofbothRSimandprototype-basedapproach(Figure 4-3 ).TheplotsshowthatthemaximumglobalsupportofmarkersfoundbyRSimisalwayscomparabletoorbetterthanthosefoundbytheprototype-basedapproach. 64

PAGE 65

1.Wedevelopedadynamicprogrammingalgorithmtoidentifytheoptimalsetofimportantgenomicintervalscalledmarkers.Theadvantageofusingthesemarkersisthatthepotentiallynoisygenomicintervalsareexcludedinthecomputationofpairwisesimilaritybetweensamples 2.Wedevelopedtwoclusteringstrategiesusingthesemarkers.Therstone,prototype-basedapproach,maximizesthesupportforthemarkers.Thesecondone,similarity-basedapproach,developsanewsimilaritymeasurecalledRSim.Itcomputesthepairwisesimilaritybetweensamplesbyremovingthenoisyaberrations.WedemonstratedtheutilityofsuchameasureinimprovingthequalityofclusteringusingtheclassieddiseaseentitiesfromtheProgenetixdatabase.Ourresultsshowthatthemarkerswefoundrepresenttheaberrationpatternsofcancertypeswell. 3.Wedevelopedseveralmeasuresforcomparingmarkersanddierentclusteringmethods.OurexperimentalresultsshowthatoptimizingforthecoveragemeasuremaynotleadtobettervaluesofNMIandviceversa. 65

PAGE 66

Classicationisthetaskoflearningatargetfunctionthatmapseachattributesettooneofthepredenedclasslabels[ 79 ].Typicalclassicationtasksforcancerresearchincludeseparatinghealthypatientsfromcancerpatientsanddistinguishpatientsofdierentcancersubtypes,basedontheircytogeneticproles.Thesetaskshelpsuccessfulcancerdiagnosisandtreatment.Animportanttechniquerelatedtoclassicationisfeatureselection.Thegoaloffeatureselectionistoselectasmallnumberofdiscriminativefeatures,i.e.genomicintervalsinCGHdata,foraccurateclassication.Inthischapter,weproposenovelSVM-basedmethodsforclassicationandfeatureselectionofCGHdata.Forclassication,wedevelopedanovelsimilaritykernelthatisshowntobemoreeectivethanthestandardlinearkernelusedinSVM.Forfeatureselection,weproposeanovelmethodbasedonthenewkernelthatrecursivelyselectsfeatureswiththemaximuminuenceonanobjectivefunction.Wecomparedourmethodsagainstthebestwrapperbasedandlterbasedapproachesthathavebeenusedforfeatureselectionoflargedimensionalbiologicaldata.OurresultsondatasetsgeneratedfromtheProgenetixdatabase,suggeststhatourmethodsareconsiderablysuperiortoexistingmethods. 83 ].Ithasbeenshowntohavebetteraccuracyandcomputationaladvantagesovertheircontenders[ 27 ]andhasbeensuccessfullyappliedformanybiologicalclassicationproblems.Thebasictechniqueworksasfollows.Considerasetofpointsthatarepresentedinahighdimensionalspacesuchthateachpointbelongstooneoftwoclasses.AnSVMcomputesahyperplanethatmaximizesthemarginseparatingthetwoclassesofsamples.Theoptimalhyperplaneiscalleddecisionboundary.Formally,letx1;x2;;xnandy1;y2;;yndenotentrainingsamplesandtheircorrespondingclasslabelsrespectively.Letyi2f1;1gdenotelabelsoftwoclasses.Thedecisionboundaryof 66

PAGE 67

79 ].ThelearningtaskinSVMcanbeformalizedasthefollowingconstrainedoptimizationproblem:minwkwk2 Thedualversionoftheaboveproblemcorrespondstondingasolutiontothefollowingquadraticprogram: MaximizeJoveri:J=nXi=1i1 2nXi=1;j=1ijyiyjxTixjsubjecttoi0;Pni=1iyi=0,whereiisarealnumber. Thedecisionboundarycanthenbeconstructedfromthesolutionsitothequadraticprogram.TheresultingdecisionfunctionofanewsamplezisD(z)=wz+b StandardSVMmethodsndalineardecisionboundarybasedonthetrainingexamples.TheycomputethesimilaritybetweensamplexiandxjusingtheinnerproductxTixj.However,thesimpleinnerproductdoesnotalwaysmeasurethesimilarityeectivelyforallapplications.Forsomeapplications,anon-lineardecisionboundaryismoreeectiveforclassication.ThebasicSVMmethodcanthenbeextendedbytransformingsamplestoahigherdimensionalspaceviaamappingfunction.Bydoingthis,alineardecisionboundarycanbefoundinthetransformedspaceifaproperfunctionisused. 67

PAGE 68

WeintroduceanewmeasurecalledRawthatcapturestheunderlyingcategoricalinformationinCGHdataandthenshowhowtoincorporateitintothebasicSVMmethod.CGHdataconsistsofsparsecategoricalvalues(gain,lossandnochange).Conceptually,thesimilaritybetweenCGHsamplesdependsonthenumberofaberrations(gainsorlosses)theybothshare.Rawcalculatesthenumberofcommonaberrationsbetweenapairofsamples.Givenapairofsamplesa=a1;a2;;amandb=b1;b2;;bm.ThesimilaritybetweenaandbiscomputedasRaw(a;b)=Pmi=1S(ai;bi).HereS(ai;bi)=1ifai=biandai6=0.OtherwiseS(ai;bi)=0. ThemaindierencebetweenRaw(a;b)andaTbisthewaytheydealwithdierentaberrationsinthesameinterval.Forexample,iftwosamplesaandbhavedierentaberrationsattheithinterval,i.e.ai=1;bi=1orai=1;bi=1,theinnerproductcalculatesthispairasaibi=1whileRawcalculatesS(ai;bi)=0.ThesimilarityvaluebetweenaandbcomputedbyRawisalwaysgreaterthanorequaltotheinnerproductofaandb.WeproposetouseRawfunctionasthekernelfunctionforthetrainingaswellasprediction. UsingSVMwiththeRawkernelamountstosolvingthefollowingquadraticprogram: MaximizeJoveri:J=nXi=1i1 2nXi=1;j=1ijyiyjRaw(xi;xj)subjecttoi0;Pni=1iyi=0. 68

PAGE 69

79 ].Thisrequiresthattheunderlyingkernelmatrixis"semi-positivedenite".Formally,akernelKisasymmetricfunctiontakingtwoargumentsofanarbitrarysetXwherethedatastemsfrom,i.e.,K:XX!R.Forgivendatapoints(xi)ni=12Xn,thekernelmatrixM:=(K(xi;xj))ni;j=1canbedened.Ifforalln,allsetsofdatapointsandallvectorsv2RntheinequalityvTMv0holds,thenKiscalledsemi-positivedenite.WenowprovethatourRawkernelsatisesthisrequirement. Themappingfunctionisdenedasfollows:a2f1;0;1gm!b2f1;0g2m,where,ai=1;b2i1b2i=01ai=1;b2i1b2i=10ai=0;b2i1b2i=00 Withthistransformation,itiseasytoseethattheRawkernelcanbewrittenastheinnerproductof(x)and(y),i.e.Raw(x;y)=(x)T(y).ThisisbecauseRawonlycountsthenumberofcommonaberrationsincomputingthesimilaritybetweentwosamples(ifboththevaluesare0,theyarenotcounted). Wedenea2mbynmatrixuwhosejthcolumnvectorcorrespondsto(xj),i.e.u:=[(x1)(x2)].TheRawkernelmatrixcanbewrittenas 69

PAGE 70

ItisworthnotingthatwehavedevelopedothersimilaritymeasuressuchasSimfortheclusteringofCGHdatainChapter 3 .AlthoughSimworksbetterthanRawinclustering,itcannotworkaskernelfunctioninSVMbecauseitisnotsemi-positivedenite. 5-1 ).Whenacompactsetoffeaturesareselected,thesehighlycorrelatedfeaturesmaycause"redundancy"inthepredictivepower.Forexample,assumewehaveatrainingdatasetwithfourcancertypesandwewanttoselecttwofeaturesforclassication.Iftheithfeatureisrankedhighforseparatingsamplesoftherstcancerfromothers,the(i+1)thor(i1)thfeaturemayberankedhightooforthesameeect.However,selectingbothithand(i+1)th(or(i1)th)featurecannotimprovetheclassicationperformancemuch.Ontheotherhand,ifanotherfeature,sayjthfeature, 70

PAGE 71

Plotof120CGHcasesbelongingtoRetinoblastoma,NOS(ICD-O9510/3).TheX-axisandY-axisdenotethegenomicintervalsandthesamplesrespectively.Weplotthegainandlossstatusingreenandredrespectively. wellseparatessamplesofthethirdcancerfromothersbuthasalowerrankingthanthe(i+1)th(or(i1)th)feature,weshouldselecttheithandjthfeatureinstead. Typicalwrappermethodsbasedonbackwardfeatureelimination,suchasSVM-RFE[ 27 ],havepooreectindiscriminatinghighlycorrelatedfeatures,especiallywhenasmallsetoffeaturesareselected.Filtermethods,suchasMRMR[ 15 ],addressthisproblembyselectingfeatureswithminimalredundancy.However,duetothedicultyinselectingcomplementaryfeatures,ltermethodsoftenproducelowerpredictiveaccuracycomparedtowrappermethods.Inthispaper,weproposeanovelnon-linearSVM-basedwrappermethodcalledMaximumInuenceFeatureSelection(MIFS)fortheclassicationofmulticlassCGHdata. Whenthenumberoffeaturesisverylarge,anexhaustivesearchofallpossiblefeaturesubsetsiscomputationallyintractable.Weuseagreedysearchstrategytoprogressivelyaddfeaturesintoafeaturesubset.Tondthenextfeaturetoadd,weusecriteriasimilar 71

PAGE 72

27 ].Thebasicideaistocomputethechangeintheobjectivefunctioncausedbyremovingoraddingagivenfeature.Inourcase,weselectthefeaturethatmaximizesthevariationontheobjectivefunction.Theaddedfeatureistheonewhichhasthemostinuenceontheobjectivefunction.Thisisunlikethebackwardeliminationschemethatremovesthefeaturethatminimizesthevariationontheobjectivefunction[ 27 64 ]. Thefeaturethathasthemostinuenceontheobjectivefunctionisdeterminedasfollows.LetSdenotethefeaturesetselectedatagivenalgorithmstepandJ(S)denotethevalueoftheobjectivefunctionofthetrainedSVMusingfeaturesetS.LetkdenoteafeaturethatisnotcontainedinS.ThechangeintheobjectivefunctionafteraddingacandidatefeatureiswrittenasDJ(k)=jJ(S[fkg)J(S)j.InthecaseofSVM,theobjectivefunctionthatneedstobemaximized(undertheconstraint0iandPiiyi=0)is:J(S)=nXi=1i1 2nXi=1;j=1ijyiyjRaw(xi;xj) ForeachfeatureknotinS,wecomputethenewobjectivefunctionJ(S(+k)).Tomakethecomputationtractable,weassumenochangeinthevalueofthe'safterthefeaturekisadded.Thusweavoidhavingtoretrainaclassierforeverycandidatefeature[ 27 ].Thenewobjectivefunctionwithfeaturekaddedis:J(S[fkg)=nXi=1i1 2nXi=1;j=1ijyiyjRaw(xi(+k);xj(+k)) wherexi(+k)meanstrainingsampleiwithfeaturekadded. Therefore,theestimated(thisisbecausewearenotretrainingtheclassierwiththeadditionalfeature)changeofobjectivefunctionis:DJ(k)=1 2jnXi=1;j=1ijyiyjRaw(xi;xj)nXi=1;j=1ijyiyjRaw(xi(+k);xj(+k))j

PAGE 73

2. TrainanSVMusingtrainingsampleswithfeaturesinRL; (b) ComputethechangeofobjectivefunctionDJ(k)foreachcandidatefeaturek2L FindthefeatureewiththelargestDJ(k),e.g.e=argmax(DJ(k)) (d) UpdateRL=[RL;e]andL=Lfeg 48 ].RecentworkhasshownthattheempiricaltimecomplexityfortrainingalinearSVMisaboutO(n1:7m)[ 33 ],wherenandmdenotethenumberofsamplesandnumberoffeaturesrespectively.Basedonthis,theconventionalandempiricaltimecomplexityforthisalgorithmisO(n3r2)andO(n1:7r2)respectively. TheabovemethodrequiresasetoffeaturesStobenon-empty.Tostartthemethod,weneedtoderivetherstfeaturetobeaddedtothisset.OnepossibilityistocomputeJ(fkg)foreveryfeaturekbytrainingaseparateSVMforeachfeaturek.Wecan,then,selectthefeaturewiththelargestvalueasthestartingfeature.However,thiscanbecomputationallyveryexpensive.Anotherapproachistousethemostdiscriminating 73

PAGE 74

wherep(r;s)istheirjointprobabilities;p(r)andp(s)aretherespectivemarginalprobabilities.Ifwelookatthekthfeatureasarandomvariable,weusemutualinformationI(k;y)betweenclasslabelsy=fy1;y2;;yngandthefeaturevariablektoquantifytherelevanceofkthfeaturefortheclassicationtask.WechoosethefeaturekwiththemaximumI(k;y)asourstartingfeature.Wehavefoundthatusingsuchmethodsissatisfactory.OurpreliminaryexperimentalresultsshowedthatMultipleSelectionisnotsensitivetotheinitialfeaturechosen. 5.2 onlyworksfortwo-classproblems.Wederivethemulticlassversionusingaone-versus-allapproachasfollows. 74

PAGE 75

ThelowestrankedfeatureisaddedintoS.Theabovethreestepprocessisusediterativelytodeterminethenextfeature.ThisprocessstopswhenapredeterminednumberoffeaturesareselectedorScontainsallthefeatures.Also,withthesetS,thefeaturesarerankedbasedontheorderofadditionintothisset.TheiterativeprocedureforMIFSisformallydenedasfollows: 2.

PAGE 76

Constructnewclasslabelsfy10;y20;:::;yn0g,yj0=1ifyj=i,otherwiseyj0=1; ii. TrainanSVMusingtrainingsampleswithfeaturesinRL; iii. ComputethechangeofobjectivefunctionDJ(k)foreachcandidatefeaturek2L SortthesequenceofDJ(k);k2Lindescendingorder;createacorrespondingrankedlistofcandidatefeatures; (b) ComputetherankingvectorsforallthefeaturesinLfromCrankedlists; (c) Sorttheelementsofeachrankingvectorinanascendingorder; (d) Performaradixsortoverall rankingvectorstoproduceaglobalrankingoffeaturesinL; (e) FindthetoprankedfeatureeandupdateRL=[RL;e]andL=Lfeg Intheabovealgorithm,wegenerateaglobalrankingoffeaturesbasedontheirrankingsineachbinarySVM.Another"rankingscheme"canbederivedbasedonthesumofthevaluethatfeaturebringstoeachclassier,i.e.PCi=1(nijJi(S[fkg)Ji(S)j)whereJi(S)isthecorrespondingobjectivefunctionforSVMthatdiscriminatingclassifromtherestandniisthenumberofsamplesinclassi.Thisrankingschemegivescomparableresultstotheonedescribedabove.Forthisreason,theresultsconcerningthisschemearenotreportedintheexperimentalsection. 76

PAGE 77

3 ]( 1 ].Weuseadatasetconsistingof5020CGHsamples(i.e.cytogeneticimbalanceprolesoftumorsamples)takenfromProgenetix(Table 3-1 ).Thesesamplesbelongto19dierenthistopathologicalcancertypesthathavebeencodedaccordingtotheICD-O-3system[ 22 ].Thesubsetwiththesmallestnumberofsamples,consistsof110non-neoplasticcases,whiletheonewithlargestnumberofsamples,Adenocarcinoma,NOS(ICD-O8140/3),contains1057cases.EachCGHsampleconsistsof862orderedgenomicintervalsextractedfrom24chromosomes. Testingtheperformance(predictiveaccuracyandruntime)oftheproposedmethods,requiresevaluatingthemoverdatasetswithdierentpropertiessuchas1)numberofsamplescontainedinthedataset,2)numberofcancertypescontainedinthedataset,and3)thesimilaritylevelbetweensamplesfromdierentcancertypes,whichindicatingthedicultyofclassication.Currently,therearenostandardbenchmarksfornormalizedCGHdatathattakethethreepropertiesintoaccount.WeproposeamethodtoselectsubsetsfromtheProgenetixdatabaseinaprincipledmannertocreatedatasetswithdesiredproperties.Thedatasetsampleracceptsthefollowingthreeparametersasinput:1)Approximatenumberofsamples(denotedasN)2)Numberofcancertypes(denotedasC)3)Similarityrange(denotedas[min,max])betweensamplesbelongingtodierentcancertypes.Anoutlineoftheproposeddatasetsamplerisasfollows: 1. Foreachcancertype,partitionallthesamplesbelongingtothiscancertypeintoseveraldisjointgroupsusingclustering.Eachclustercorrespondstothedierentaberrationpatternsforagivencancertype. 2. Computethepairwisesimilaritybetweenpairsofgroupsobtainedintherststep. 3. Constructacompleteweightedgraphwhereeachvertexdenotesagroupofsamplesandtheweightofanedgeequalstothesimilaritybetweentwogroupsthatareconnectedbythisedge. 77

PAGE 78

Figure 5-2 showsanexampleofhowsuchadatasetsamplerworks.Consideradatasetcontaining1,000CGHsamples-400samplesbelongingtocancertypec1andtheother600samplesbelongingtocancertypec2.Assumethateachcancertypeisclusteredinto2clusters.Thisresultsin4groupsofCGHsamples,whicharedenotedasgi;1i4.Letthesizeofg1,g2,g3andg4be150,250,450,and150respectively.ThepairwisesimilaritybetweenanytwogroupsisshownintheFigure.Usingthis,onecanconstructaweightedgraphwhereeachvertexdenotesagroupandtheweightofeachedgeequalstothesimilaritybetweentwogroupsthatareconnectedbythisedge.SupposethatadatasetneedstobesampledwithN=400,C=2,min=0:025andmax=0:035.Thegraphcanbeparsedtondoutthatg2andg4satisfythethreeconditionsandanewdatasetcanbesampledbycombiningthesamplesing2andg4. WeusedourdatasetresamplingschemetoselectdatasetsatfourdierentsimilaritylevelsfromtheProgenetixdataset.WedenotethesimilaritylevelsasBest,Good,Fair,andPoor.ThesamplesinBesthasthehighestsimilarityandthoseinPoorhavethelowestsimilarity.Foreachsimilaritylevel,wecreatedthreedatasetswithtwo,four,six,andeightcancertypesrespectively.Thus,intotal,wehavesixteendatasets.Forconvenience,weusethesimilaritylevelfollowedbythenumberofcancertypestodenoteadataset.ForexampleBest6denotesthedatasetwithsimilaritylevelBest(i.e.,homogeneoussamples)andcontainssixcancertypes.Thenumberofsamplesineachtwo-classdatasetandmulti-classdatasetisaround500and1,000respectively.Notethat 78

PAGE 79

Workingexampleofdatasetre-sampler.ciandgjdenotetheithcancertypeandthejthgroupofsamples,respectively.Intherststep,thesamplesarepartitionedineachcancertypeintotwodisjointgroups.Inthesecondstep,pairwisesimilaritymetricsarecomputed.Inthethirdstep,acompleteweightedgraphisgenerated. thereisnotopologicalrelationsbetweendierentdatasetsbecausewegeneratealldatasetsinseparateruns.Forexample,anysampleinbest4isnotnecessarilycontainedinbest6orbest8. Thesamplingofthesixteendatasetsareexplainedasfollows.LetNandCdenotethenumberofsamplesandnumberofcancertypesintheresampleddatasetsrespectively.Inourexperiments,wechooseN=500andC=2fortwo-classdataset.WechooseN=1000andC=4;6and8formulti-classdatasetrespectively.ForeachvalueofC,wesamplefourdatasetswithfourdierentlevelsofsimilarity. Intherststep,aclusteringalgorithmisappliedtoeachcancertypetopartitionallthesamplesbelongingtothiscancertypeintoseveraldisjointgroups.Eachclustercorrespondstothedierentaberrationpatternsforagivencancertype.WeusetheRSimclusteringmethodforthispurpose.Thenumberofclustersforeachcancertypeisdeterminedadaptivelyasfollows.Fortheithcancer,letSidenotethenumberofsamples 79

PAGE 80

Detailedspecicationsofbenchmarkdatasets.Term#casesandCdenotethenumberofcasesandnumberofcancertypesrespectively. NameC#casessimilaritylevelICD-O-3codeofcancertypes Best22478[0.030,1.000]80703,81403Good22466[0.018,0.030)80103,81703Fair22351[0.008,0.018)80103,96733Poor22373[0.000,0.008)98233,96803Best441160[0.035,1.000]95003,85233,80703,81403Good44790[0.020,0.035)98363,81703,96803,81403Fair44800[0.010,0.020)95103,96733,96803,81403Poor44800[0.000,0.010)85233,96803,98233,81403Best661100[0.030,1.000]81443,95003,81703,85233,80703,81403Good66850[0.017,0.030)91803,81443,96733,80103,97323,81403Fair66880[0.007,0.017)95103,81400,98233,96803,80703,81403Poor66810[0.000,0.007)85233,98233,80703,81400,81403,96803Best881000[0.030,1.000]80103,97323,81400,95003,81703,85233,80703,81403Good88830[0.018,0.030)88903,93913,91803,96733,80103,81703,80703,81403Fair88750[0.006,0.018)95103,80103,97323,81703,96803,98233,80703,81403Poor88760[0.000,0.006)00000,81400,81703,85233,96803,98233,80703,81403 init.ThenumberofclusteriscomputedasjSi Inthesecondstep,thesimilaritiesbetweenanypairofclustersarecomputedandsortedinanascendingorder.The25,50and75percentileofthesortedsimilaritysequencearechosentodividethesequenceintofoursegmentswithaboutequallength.Eachsegmentcorrespondstoasimilaritylevel.WedenotethefoursimilaritylevelsasPoor,Fair,Good,andBest. Inthethirdstep,theminimumandmaximumsimilarityineachsegmentarechosenastheparametersminandmaxrespectively.Thedatasetsofdierentsimilaritylevelsaresampledbythedatasetresampleraccordingly.WelistthedetailedspecicationsofourdatasetsinTable 5-1 ItisworthnotingthattheactualnumberofcasesmaynotequaltotheparameterNintheresampleddatasets.Thisisbecause,theclusteringalgorithmmaygenerateclusters 80

PAGE 81

WeusetheDAGSVM(DirectedAcyclicGraphSVM)providedbyMATLABSVMToolbox[ 8 ]fortheclassicationofmulticlassdata.AllotherparametersofSVMaresettothestandardvaluesthatarepartofthesoftwarepackageandexistingliterature. TheresultsarepresentedinFigure 5-3 .X-axisliststhesixteendierentdatasets.Y-axisdenotesthevalueofaveragepredictiveaccuracyin5-foldCV.Forthesixteendatasets,Rawkerneloutperformslinearkernelinfteendatasets(exceptbest8).On 81

PAGE 82

ComparisonofclassicationaccuraciesofSVMwithlinearandRawkernels.X-axisdenotesdierentdatasets.Y-axisdenotesthepredictiveaccuracybasedon5-foldCV. average,Rawkernelimprovesthepredictiveaccuracyby6.4%oversixteendatasetscomparedtolinearkernel.Forthebest8dataset,thedierencebetweenRawandLinearisislessthan1%.TheseresultsdemonstratethatSVMbasedonRawkernelworksbetterfortheclassicationofCGHdataascomparedtolinearSVM. TheremainingsetofexperimentalresultsarelimitedtotheRawkernel(unlessstatedexplicitly). 15 ].TheMIQschemeofMRMR,i.e.thedivisivecombinationofrelevanceandredundancy,isusedbecauseitoutperformsMIDschemeconsistently.SVM-RFEisapopularwrappermethodforgeneselectionandcancerclassication.ItisshowntobebetterthanltermethodssuchasthosebasedonrankingcoecientssimilartoFisher'sdiscriminantcriterion.SVM-RFEisalsoshowntobemoreeectivethan 82

PAGE 83

27 ]. Foreachmethod,a5-foldcrossvalidationisused.Ineachfold,thefeatureselectionmethodisappliedoverthetrainingexamples.Multiplesetsoffeatureswithdierentsizes(4,8,16featuresetc)areselected.Foreachsetoffeatures,anSVMistrainedonthetrainingexampleswithonlytheselectedfeatures.ThepredictiveaccuracyofthisSVMisdeterminedusingthetest(setaside)exampleswiththesamesetoffeatures.Thesestepsarerepeatedforeachofthe5-foldstocomputetheaveragepredictiveaccuracy. Totestthepredictiveaccuracyoffeaturesselectedbydierentmethods,DAGSVMwithRawkernelisusedasitisfoundtobemoreeectivethanothermethods.SincetheSVM-RFEpresentedintheliteratureonlyworksfortwo-classdata,itisextendedtomulticlassdatausingthesame"rankingscheme"thatweusetoextendMIFS(asdescribedinSection 5.3 ).ThelinearkernelisusedinSVM-RFEforfeatureselectionpurpose. Theexperimentalresultsformulti-classdatasetandtwo-classdatasetareshowninTable 5-2 andTable 5-3 respectively.Inthesetables,thepredictiveaccuracyoffeaturesselectedbythreemethods,MIFS,MRMRandSVM-RFE,overeachdatasetarecompared.Foreachfeatureselectionmethod,theresultsfor4,8,16,40,60,80,100,150,250and500featuresovereachdatasetarepresented.Theresultsareaveragedoverthe5-foldsandreportedincolumns3to12.Inthe13thcolumn,theaveragepredictiveaccuraciesofSVMbuiltupon862features,i.e.nofeatureselection,arereported.Theaveragepredictiveaccuraciesofthetwelvedatasetsarereportedinthelastthreerows.Wemainlydescribethekeyndingsofmulti-classdatasetsinTable 5-2 5-2 showthat,whenthenumberoffeaturesislessthanorequaltosixteen,thereisnoclearwinnerbetweenMIFSandMRMR.Although,MIFSisslightlybetterthanMRMRbasedontheaverageresultsofthetwelvedatasets,neitherofthetwomethodsarepredominantlybetterthan 83

PAGE 84

Comparisonofclassicationaccuracyforthreefeatureselectionmethodsonmulti-classdatasets.ThethreemethodsareMIFS,MRMRandSVM-RFE(denotedasRFE).Theaverageresultsovertwelvedatasetsarereportedinthelastthreerows. 4816406080100150250500862 poor4MIFS0.6960.7650.8110.8190.8140.8190.8210.8240.8140.815MRMR0.7340.7720.7780.7940.7910.7990.8140.8140.8190.8020.809RFE0.5670.6440.6810.7060.7460.7710.7940.8140.8210.821 poor6MIFS0.5270.5900.6150.6220.6400.6540.6590.6450.6490.633MRMR0.5420.5760.5880.5890.5810.5960.610.5960.6100.6350.633RFE0.3370.3700.4310.5310.5510.5640.5780.5930.6080.635 poor8MIFS0.3380.3940.4330.4690.4700.4880.4960.5130.5300.486MRMR0.3350.4080.4540.4670.4690.4820.4700.4740.4890.4650.472RFE0.2590.2740.3030.3900.4230.4350.4570.4560.4560.475 fair4MIFS0.6210.6870.7550.7840.8020.8160.8160.8090.8080.806MRMR0.5980.6850.7280.7770.7960.7890.7840.7770.7830.7860.798RFE0.4660.5270.6080.6930.7530.7530.7710.7860.7870.806 fair6MIFS0.5870.6980.7540.8140.8220.8250.8270.8200.8200.807MRMR0.5930.6980.7670.7720.7860.8070.8020.8070.8010.8040.792RFE0.5040.6400.6960.7610.7750.7800.7810.7800.7970.816 fair8MIFS0.5360.6410.6840.7000.7360.7330.7270.7350.7320.713MRMR0.5400.6530.6810.7210.7070.7120.7150.7040.6980.6950.720RFE0.3980.5280.6160.6770.6870.6880.7020.7000.7010.709 good4MIFS0.5860.6730.7630.7730.7820.780.7830.7740.7780.767MRMR0.6090.6810.7550.7610.7790.7800.7800.7700.7720.7610.755RFE0.5430.6100.6560.7110.7180.7400.7320.7350.7670.749 good6MIFS0.4550.5510.5930.6450.7090.7160.7240.6970.7000.694MRMR0.4270.5320.6210.6670.6800.6900.6770.6870.6750.6640.696RFE0.3390.4370.5170.5970.6380.6530.6600.6820.6740.698 good8MIFS0.3730.4770.5670.6590.6740.6760.6650.6730.6660.655MRMR0.3360.4610.5270.6150.6340.6470.6440.6460.6490.6610.652RFE0.2580.3460.4240.5080.5300.5810.6050.6240.6320.654 best4MIFS0.6500.7540.7630.8170.8290.8320.8290.8210.8380.820MRMR0.6670.7570.7750.7850.7890.7930.7980.7910.7840.8020.803RFE0.5960.6590.7080.7530.7660.7890.7760.7910.8030.817 best6MIFS0.4970.5680.6990.7310.7670.7650.7630.7700.7500.755MRMR0.4970.5680.6880.7300.7310.7250.7460.7390.7480.7400.750RFE0.4490.4990.5870.6670.7100.7120.7270.7290.7360.749 best8MIFS0.4270.5430.6350.7260.7370.7330.7350.7320.7350.727MRMR0.4340.5630.6520.7040.7000.7140.7120.7000.6930.7040.707RFE0.3420.4290.5320.6410.6480.6870.6940.7230.7190.724 AvgMIFS0.5240.6120.6730.7130.7320.7360.7370.7340.7350.723MRMR0.5180.6060.6640.6960.7020.7090.7100.7070.7070.7060.716RFE0.4220.4970.5630.6360.6620.6790.690.7000.7080.721

PAGE 85

5-2 showthatMIFSoutperformsSVM-RFEinalmostallcases.Clearly,asthenumberoffeaturesincreases,thegapbetweenMIFSandSVM-RFEdrops.Theybecomecomparableintermsofpredictiveaccuracywhenthenumberoffeaturesreachesseveralhundreds(wedonotreporttheseresultsduetothespacelimitations).Webelievethataforwardschemeisbetterbecauseitrstaddsthehighestdiscriminatingfeaturesfollowedbyfeaturesthatindividuallyarenotdiscriminating,butimprovetheclassicationaccuracywhenusedincombinationwiththediscriminatingfeatures.Ontheotherhand,abackwardeliminationscheme(RFE)oftenselects"redundant"featuresbutexcludescomplementaryfeaturesthatindividuallydonotdiscriminatethedatawell.Thisisexempliedbyasimpleexampleofaclassicationproblemwiththreefeaturesk1;k1andk2.Featurek1worksmuchbetterthank2indiscriminatingthedata.Assumethatwewanttoselecttwofeatures.AtypicalRFEschemewillrstremovek2becauseitinuencesobjectivefunctionleast.Thetwoselectedfeatureswouldbek1andk1.Ontheotherhand,theproposedforwardselectionscheme(FS)willchoosek1followedbyk2becausechoosinganotherk1doesnotchangetheobjectivefunctionatall.Therefore,thetwofeaturesselectedbyFSschemeleadtoabetterpredictiveaccuracyascomparedtothoseselectedbyRFEscheme. 85

PAGE 86

Comparisonofclassicationaccuracyforthreefeatureselectionmethodsontwo-classdatasets.ThethreemethodsareMIFS,MRMRandSVM-RFE(denotedasRFE).Theaverageresultsoverfourdatasetsarereportedinthelastthreerows. 4816406080100150250500862 poor2MIFS0.8070.9200.9200.9140.9230.9090.9040.9040.9060.909MRMR0.7910.8850.9250.9010.9220.9140.9090.9200.9250.9080.914RFE0.7750.7750.8530.9170.9140.9140.9140.9060.8980.904 fair2MIFS0.7440.7750.8290.8580.8750.8770.8690.8720.8750.858MRMR0.7410.7830.8350.8520.8460.8430.8430.8490.8460.8370.849RFE0.6750.7490.7720.8150.8230.8150.8180.8180.8720.846 good2MIFS0.8180.7980.8070.8220.8220.8350.8370.8370.8330.822MRMR0.7980.8150.8150.8130.8130.8200.8070.8030.8000.8090.832RFE0.7580.7810.8180.8070.8060.8240.8200.8150.8110.834 best2MIFS0.8540.8640.8750.8750.8750.8700.8680.8700.8620.872MRMR0.8520.8640.8720.8790.8810.8850.8660.8660.8580.8850.875RFE0.7930.8120.8410.8520.8520.8350.8470.8410.8600.875 AvgMIFS0.8060.8390.8580.8670.8730.8730.8690.8710.8690.865MRMR0.7950.8370.8620.8610.8660.8660.8560.8590.8570.8600.868RFE0.7500.7790.8210.8480.8490.8470.8500.8450.8600.864 5-3 showthat,unlikeresultsinTable 5-2 ,althoughMIFSisslightlybetterthanMRMRbasedontheaverageresultsoffourdatasets,thereisnoclearwinnerthatbeatstheotherforeveryoneofthefourdatasets.ThismayindicatethatMIFSandMRMRarecomparableintermsofclassicationaccuracyfortwo-classdatasets.TheresultsalsoshowthatMIFSoutperformsSVM-RFEinmostcaseswhennumberoffeaturesarelessthan250.Asthenumberoffeaturesincreases,thegapbetweenMIFSandSVM-RFEdrops.Theybecomecomparablewhennumberoffeaturesreaches250.Thisconsistswithourconclusiononmulti-classdatasets. 5-2 andTable 5-3 showsthatusing40featuresresultinclassicationaccuracythatiscomparabletousingallthefeatures.Also,using80featuresderivedfromMIFSschemeresultsincomparableorbetterclassicationaccuracyascomparedtoallthefeatures.Thisissignicantasbeyonddatareduction,theproposedschemecanleadtobetterclassication.Tosupportthishypothesis,wegeneratedfournewdatasetsusingourdatasetresampler.The 86

PAGE 87

Comparisonofclassicationaccuracyusingdierentnumberoffeatures. DatasetNumberofFeatures 4080862 newds10.8010.7920.799newds20.8030.8190.800newds30.6290.6700.637newds40.7060.7480.719Average0.7350.7570.739 resultingfourdatasets(newds1tonewds4)contain4,5,6and8classesrespectively.Thenumberofsamplesinthefourdatasetsare508,1021,815and649.WeappliedtheMIFSmethodoverthesedatasets.Wecomparetheclassicationaccuraciesobtainedbyusingall862featurestothoseusingonly40and80selectedfeatures.TheresultsareshowninTable 5-4 .Theseresultssubstantiateourhypothesisthatusingaround40features(roughly5%ofallfeatures)cangeneratecomparableaccuracytousingallthefeatures.Also,usingaround80features(roughly10%ofallthefeatures)canresultincomparableorbetterpredictionthanallthe862features. Itisworthnotingthattheothertwomethods,typicallyhavelowerorcomparableaccuracywhenasmallernumberoffeaturesisused. 87

PAGE 88

AnimportantpropertyofCGHdataisthatneighboringfeaturesareusuallyhighlycorrelatedasapointlikegenomicaberrationcanexpandtotheneighboringintervals.Duetothedierenceinthetrainingexamples,thesehighlycorrelatedfeaturescanbealternativelyselectedindierentfolds.Forexample,assumethattwosetsoffeaturesareselectedintwofolds.The53rdand54thfeatureareonlyselectedintherstandsecondsetrespectively.Althoughthesetwofeaturesaredierent,theyshouldbeconsideredmatchingbecauseboththe53rdand54thfeaturesarehighlycorrelatedandrepresentthesameaberrationpatternofinterest. Werstdenethecorrelationbetweentwofeatures.GivenasetofnCGHsamplesfx1,x2,,xng.Letxiddenotethevalue(1,-1or0forgain,lossornoaberrationrespectively)fortheithsampleatthedthfeature,8d;1dD,whereDisthenumberofgenomicintervals.ThenumberofsamplesthathasaberrationsatthedthfeaturecanbecomputedasB(d)=Pni=1jxidj.Inprinciple,thecorrelationbetweenneighboringfeaturesarecausedbycontiguousrunsofgainorlossstatus.Weusethetermsegmenttorepresentacontiguousrunofaberrationsofthesametypeinasample.Intuitively,twofeaturesarehighlycorrelatedwhenalargeamountofsegmentsintersectwithbothfeatures.Letxi[u;v]denoteasegmentofxithatstartsattheuthintervalandendsatthevthinterval,i.e.fxiu;;xivg,suchthatxiu=xiu+1==xiv6=0,xiu16=xiu.Letkandk01k;k0Ddenotetwofeatures.WedenethatCi(k;k0)=1ifthereexistsasegmentxi[u;v]intheithsamplethatintersectswithkandk0,i.e.uk;k0v.Thecorrelationcoecientbetweenkandk0iscomputedasCor(k;k0)=Pni=1Ci(k;k0)

PAGE 89

Next,weexplainhowtoevaluatetheconsistencyoftwosetsoffeatures.LetK=fk1;;krgandK0=fk01;;k0rgdenotethetwosetsofrfeatures.Wecreateabipartitegraphasfollows.Foreachfeatureinthetwosets,avertexisaddedinthegraph.Foreachfeatureki;1ir,ifthereexistsafeaturek0jsuchthatk0jishighlycorrelatedtoki,e.g.Cor(ki;k0j),anedgeconnectingthecorrespondingvertexofkiandk0jisadded.LetV1andV2denotethesetofverticescorrespondingtoKandK0respectively.ItcanbeseenthateveryedgeinthegraphconnectsavertexinV1andoneinV2.Therefore,theresultinggraphisabipartitegraph.AmaximummatchingMfoundinthisgraphisasetofedgesthatidentifypairsoffeatures(orhighlycorrelatedfeatures)selectedinbothsets.WescoretheconsistencybetweenKandK0asT(K;K0)=jMj FormultiplesetsoffeaturesfK1;;Kfg,ThePMMmeasureiscomputedastheaveragescoreofeachpairoffeaturesets:PMM=2Pfi=1Pfj=i+1T(Ki;Kj) whereKiandKjdenotetheithandjthfeaturesetsrespectively. Weusetheaboveapproachtoevaluatetheconsistencyoffeaturesselectedbythreemethods(MIFS,MRMRandSVMRFE)forthetwelvemulti-classdatasets.Foreachdataset,eachmethodselectsvesetsoffeaturesbecausea5-foldcrossvalidationisused.ThePMMscoresofthevesetsoffeaturesselectedbydierentmethodsondierentdatasetsarereportedinTable 5-5 .Thenumberoffeaturesarespeciedas20,50and100.Theparameterissetto0.8. ToshowthesignicanceofthePMMscores,arandomtestisperformedasfollows.Foreachdataset,vesetsoffeaturesarerandomlyselectedandthePMMscoreis 89

PAGE 90

5-5 TheresultsshowthatbothMRMRandMIFSoutperformsSVMRFEconsiderablyintermsofPMMscores.ThePMMscoresofMRMRisoftenslightlybetterthanthoseofMIFS.Also,thePMMscoresofbothMRMRandMIFSaresignicantlygreaterthantheninety-ninthpercentileofrandomscores.ThisindicatesthatthefeaturesselectedbyMRMRandMIFSinmultiplefoldsaresignicantlyconsistent.Ontheotherhand,thescoresofSVMRFEareoftenwithintherangeoftherstandtheninety-ninthpercentileoftherandomscores.ThisindicatesthatSVMRFEworkspoorinconsistentlyselectingfeaturesinmultiplefolds.ItisworthnotingthatthegapbetweenMIFSorMRMRandrandomapproachdecreasesasthenumberoffeaturesincreases.Thisisbecausethemorefeaturesareselected,thelargeristhechancetondapairofmatchingfeaturesintworandomsets.Asaresult,thePMMscoresofrandomapproachincreasetoo.Sincetheresultsshowthatusingabout10%ofallfeaturesalreadyprovidesagoodclassicationperformance,welimitthecomparisonofPMMscorestosmallnumbersoffeatures(lessthanorequalto100). Inthischapter,wedevelopnovelSVMbasedmethodsforclassicationandfeatureselectionofCGHdata.Forclassication,wedevelopedanovelsimilaritykernelthatisshowntobemoreeectivethanthestandardlinearkernelusedinSVM.Forfeatureselection,weproposeanovelmethodbasedonthenewkernelthatiterativelyselectsfeaturesthatprovidesthemaximumbenetforclassication.Wecomparedourmethods 90

PAGE 91

ComparisonofPMMscoresofthreefeatureselectionmethods.Termr,99%and1%denotethenumberofselectedfeatures,theninety-ninthpercentileandtherstpercentilerespectively. Random DatasetrSVMRFEMRMRMIFSmean99%1% best4200.320.870.680.290.410.19500.510.910.820.490.570.421000.650.910.860.640.690.58 best6200.450.940.830.310.430.21500.580.910.840.520.60.441000.680.910.880.660.720.60 best8200.410.930.710.310.430.20500.490.910.850.510.60.411000.670.900.870.650.720.60 fair4200.280.820.830.220.310.13500.440.860.760.400.470.331000.540.880.810.570.620.52 fair6200.460.860.810.250.370.16500.540.870.810.440.540.351000.630.910.830.590.650.54 fair8200.470.820.770.250.360.14500.530.850.800.450.520.361000.680.880.840.600.670.53 good4200.420.890.830.290.400.18500.540.870.860.490.570.411000.610.890.870.640.700.59 good6200.340.840.760.310.410.20500.520.890.750.510.590.421000.680.900.880.650.710.58 good8200.350.740.780.270.390.18500.510.860.800.470.550.381000.660.870.850.630.680.57poor4200.360.750.730.200.330.11500.460.850.740.380.450.311000.560.870.820.540.600.49 poor6200.250.790.620.190.290.11500.440.830.680.510.450.291000.580.850.810.520.570.47 poor8200.330.740.680.220.330.13500.460.800.740.400.490.341000.600.830.800.560.620.51 Average200.370.830.750.260.370.16500.500.870.790.450.540.371000.630.880.840.610.660.55 91

PAGE 92

92

PAGE 93

Cancerisclassiedintomultiplehistologicaltypes,eachofwhichconsistsofmultiplesubtypes.Genomicaberrationsmaydierbetweenhistologicallyidenticaltumors(e.g.gastro-esophageal,dependingonlocation[ 76 ]),dierenthistologicalsubtypeshavedierentchanges(e.g.inrenalcellcarcinomas,[ 35 ]),anddierentpatternsmayappearinthesamehistologicsubtype(andmaypointtowardsdierentmechanisms;e.g.seecomplexre-arrangementsvs.whole-chromosomegains/losses[ 52 ]).Tumorevolutionprocessleavescharacteristicsignaturesofinheritancealongthepathwaysofprogressionandpresentamethodtoinfermodelsoftumorprogressionbyanidenticationofthesesignaturesingenome-widedataofmutations[ 6 ].EvidenceshaveshownthatpatternsofrecurrentCopyNumberAlterations(CNAs)areobservedforabroadrangeofcancersorsubtypesofthesamecancer. Toourknowledge,mostexistingworksinfertumorprogressionmodelsbasedongeneticeventssuchasrecurrentCopyNumberAlterations(CNAs).Theirmodelsdescribetheevolutionaryrelationshipbetweeneventsandconsequentlyexposetheprogressionanddevelopmentoftumors.However,mostexistingworksfocusontheprogressionofindividualrecurrentalterations.Thisapproachleadstoverycomplexmodelswhenmultiplecancersareconcerned,giventhateachcancercontainsasetofrecurrentalterations.Apromisingapproachseemstoconsiderthewholesetofalterationsofacancerandinferamodelbasedonthealterationpatternsofdierentcancers.Suchmodelseectivelyutilizethemolecularcharactersofcancersandeasilyextendtolargescaleanalysis.Inthischapter,wehavedevelopednovelgraphbasedcomputationalmethodsthatderiverelationshipswithinahistologicaltypeorbetweenhistologicalsubtypes. 93

PAGE 94

6.1.1 ,wereviewtheconceptofmarkersthatdenethekeyrecurrentCNAsinacancer.InSection 6.1.2 ,weintroduceanapproachproposedbyBilkeetal.whichisextendedlaterforinferringtheprogressionofmarkers.InSection 6.1.3 ,wedemonstrateaknowntreettingproblemthatinfersprogeneticmodelsforcancers. 42 ],recurrentalterationsusuallyaccumulatetogetherandformsaregionofrecurrentalterations,whichwecallrecurrentregion.Givenasetofsamplesthatbelongtothesamecancer,amarkerisanindependentkeyrecurrentalterationrepresentingarecurrentregion.WeproposedadynamicprogrammingalgorithmtoidentifythebestRmarkersforasetofCGHcases.WedemonstratedthatourmarkerscapturetheaberrationpatternswellandimprovetheclusteringofCGHcases[ 42 ]. Next,webrieyintroducesomenotationsofmarkers.Eachmarkerminacancerisrepresentedbytwonumbers,wherepandqdenotethepositionandtheaberrationtyperespectively.Theaberrationtypeofamarkeriseithergainorloss,denotedby1or-1respectively.GivenasetSofNCGHcasesfs1,s2,,sNg.Letxjddenotethealterationvalue(1,-1or0forgain,lossornoaberrationrespectively)forcasejatthedthfeature,8d;1dD,whereDisthenumberofgenomicintervals.Weusethetermsegmenttorepresentacontiguousrunofaberrationsofthesametypeinacase.Letsj[u;v]bethesegmentofsjthatstartsattheuthintervalandendsatthevthinterval.Formally,sj[u;v]denotesacontinuousrunofinteralsfxju;;xjvg,forcjsuvcje,wherexju=xju+1==xjv6=0,xju16=xju,xjv+16=xjv,cjsandcjedenotethestartingandendingintervalsofachromosomeinsjrespectively. Letm=beamarker.Wedenotetheindependentsupportofsjtomas(sj;m).Here,(sj;m)=1ifandonlyifxjp=q.Otherwise,(sj;m)=0.Wedenethe 94

PAGE 95

42 ]. 6 ].Theydescribedtherelationshipbetweendierentsubtypesbasedontherecurrentalterationssharedbythesesubtypes.Theiridearstidentiedasetofrecurrentalterations.Eachrecurrentalterationbelongedtooneofthefollowingthreecategories:common(sharedbyallthesubtypes),shared(sharedbytwoormoresubtypes)andunique(distincttoonlyonesubtype).Theyproposedastatisticalmodeltoidentifyrecurrentalterationsandcomputethesharedstatusofthesealterations.Eachsharedstatuswasasetofsubtypesthatcontainthisrecurrentalteration. ThesharedstatusofrecurrentalterationscanbedescribedusingaVenndiagram.Forexample,Figure 6-1 showstwoVenndiagramsoftwosets,representedbytwooverlappingcircles.LetS1andS2denotetheleftandrightcirclerespectively.Therearethreedistinctareas(denotedassections)markedbyA,BandCineachVenndiagram.Eachsectionrepresentapossiblelogicalrelationshipbetweenthetwosets.Forexample,sectionAandCrepresentS1S2andS1\S2respectively.Asectioniscallednon-emptyifitcontainssomemembers.Eachnon-emptysectionismarkedbyadistinctcolorinFigure 6-1 .Thecomponentofanon-emptysectionisdenedtobethesetswhosemembersarecontainedinthissection.Forexample,thecomponentsofsectionAandCarefS1gandfS1;S2grespectively.Ingeneral,thenumberofdistinctsectionsSinaVenndiagramofKsetsisgivenbyS=PKn=1K! (Kn)!n!,whichisalsothenumberofdierentsharedstatusofarecurrentalterationbetweenKcancersubtypes.Sinceeachsectioncanbeemptyornon-empty,therearetotally2SdistinctVenndiagramsforKsets. 95

PAGE 96

1. Allalterationsfoundinaparentgenotypemustbepresentintheospringwithasimilarfrequency.Thedaughtergenerationacquiresadditionalalterations. 2. Unobservableintermediategenotypesarepossible,butthemodelwiththesmallestnumberofgenotypesisutilized. 3. Allgenotypesarisefromacommonancestor(i.e.themodelhasaroot). Theresultinggraphisadirectedacyclicgraphwitheachvertexcorrespondingtoanon-emptysectionintheVenndiagram.Anedgeconnectsfromavertexutoavertexvif(1)thesetofcancersubtypesthatcontaintherecurrentalterationsofuisasupersetofthatofvand(2)thereisnoothervertexwsuchthatthesetofcancersubtypesthatcontaintherecurrentalterationsofwisasupersetofthatofvandasubsetofthatofu.ThenumberofverticesintheresultinggraphisboundedbyminfS;Tg,whereTisthenumberofrecurrentalterations.Forexample,thegraphmodelscorrespondingtothetwoVenndiagramsinFigure 6-1 isshownontherightofthegure. Theauthorsdemonstratethat,withthehelpofsuchamodel,itispossibletoidentifytrancesoftumorprogressioninCGHdata.However,theirapproachhasseverallimitations. 96

PAGE 97

ExamplesofVenndiagramandcorrespondinggraphmodel.EachVenndiagram(left)consistsoftwosets.Thecorrespondinggraphmodelsareshownontheright.ThethreesectionsintheVenndiagramaredenotedasA,BandCrespectively.Themaindierencebetweenexample(a)and(b)isthat,inexample(b),sectionAisempty,i.e.itcontainsnomembers.Therefore,thecorrespondinggraphmodelof(b)consistsofonlytwovertices,CandB. Inadditiontotheselimitations,BilkeetaldonotprovideasystematicalgorithmformappingtheVenndiagramtothegraphmodelautomatically.Theselimitationsmakeitimpracticaltousetheirmethodforlargescaledatasetscomposedofmanycancers. 39 ]. 97

PAGE 98

EachleaflevelnodeofTcorrespondstoasampleinsetL.Also,thereisanodeattheleaflevelofTforeachsampleinL.Inotherwords,thereisabijectionbetweentheleaflevelnodesandthesamplesinL.TherestofthenodesinVaretheinternalnodesofthetree.EachedgeinEisassignedapositiverealnumberthatdenotestheweightofthatedge.Thisisalsotermedthelengthoftheedgeintheliterature.Foranypairofleafnodesi;j2V,denePijasthepathinTbetweenitoj.Thelengthofapathisthesumoftheweightsoftheedgesonthatpath.WecreateanewdistancematrixD0,whereeachentryD0(i;j)isthelengthofthepathbetweeniandj.Thedistancematrixmethodaimsthefollowing:GivenadistancematrixD,ndatreeTsuchthatD0isagoodapproximationtoD. Thetreettingproblemhasbeenwidelystudiedinmolecularphylogenetics.Someoftheleadingdistancematrixmethodsfortreeconstructionincludetheunweighted-pair-groupmethodwitharithmeticmean(UPGMA)[ 39 ]andNeighborJoining[ 39 ]. 6 ]toinferprogressionmodelsformarkersofmultiplecancertypes.Markersaretheindependentkeyrecurrentalterationsthatcharacterizetheaberrationpatternofacancertype(Figure 4-2 ).Studiesoftheevolutionofmarkerswouldbeofobviousvaluetodenegenelocirelevantfortheearlydiagnosisortreatmentofcancer.Ithelpstoanswerquestionsaboutwhichmarkertendtooccurinmanycancers,whichmarkersarelikelytooccurtogetheretc.Themaindierencebetween 98

PAGE 99

Wecomputethesharedstatusofmarkersasfollows.Amarkeridentiedinonecancerrepresentarecurrentalterationregioninthiscancer.However,foranytwoormorecancerscontainingthesamerecurrentregion,theymaynothavemarkersidentiedatthesamepositionduetothenoiseintheaberrationpatterns.Therefore,markersindierentcancersrepresentingthesamerecurrentregionshouldbeconsideredsharedbythesecancers. First,wedenethecorrelationbetweenamarkeranditsneighboringintervals.LetCdenoteasetofcasesbelongingtothesamecancer.Letm=andd;1dDdenoteamarkerinCandagenomicintervalrespectively.Foreachcasesj2C,wedeneEj(d;m)=1ifthereexistsasegmentsj[u;v]overlappingwithbothintervalsdandp,i.e.ud;pvandxju=q,otherwise,Ej(d;m)=0.ThefunctionEj(d;m)indicatesthatthealterationsatdandpbelongtothesamesegmentinsjandcanbecausedbythesamepoint-likegenomicalteration.WecomputethecorrelationbetweendandmasCor(d;m)=PjCjj=1Ej(d;m) Next,wedenethatamarkerm=incancerCiissharedbyCjifandonlyifthefollowingconditionisreached:thereisamarkerm0=incancerCjsuchthatq0=qandCor(p;m0)>,whereisauser-denedthreshold.Thelargeristhevalueof,theharderforamarkersharedamongmultiplecancers.Intuitively,thisdenitionindicatesthatamarkermiinCiissharedbyanothercancerCjifandonlyifthereexistsamarkermjinCjsuchthatmjishighlycorrelatedwithmiifmiisalsoamarkerinCj.TocomputethesharedstatusofamarkerinCi,wevisiteverycancerotherthanCi.ThismakesthetimecomplexitylineartothenumberofcancersK.Wedenote 99

PAGE 100

WeproposeanalgorithmthatgeneratesaprogressionmodelforKcancersbasedonmarkers.Ouralgorithmconsistsofthreesteps: (Kn)!n!distinctsectionsinthisVenndiagram.GivenamarkermwithsharedstatusS(m),thesectioncorrespondingtoS(m)isnon-empty.Wemarkallthenon-emptysectionsintheVenndiagrambasedonthesharedstatusofallmarkers.WethenconverttheVenndiagramtoagraphmodelasfollows.WecreateavertexVforeachnon-emptysectionandassociateitwiththemarkerswhosesharedstatuscorrespondstothissection.Wedenetheheightofthisvertex,denotedasH(V),asthenumberofcomponentsinthecorrespondingsection.Wevisittheverticesinthedescendingorderoftheirheights.ForeachpairofverticesViandVj;H(Vi)
PAGE 101

42 ],whereDandNdenotethenumberofgenomicintervalsandnumberofcasesofallKcancersrespectively.ThetimecomplexityofthesecondstepisO(TNR),whereTisthecardinalityofsetconsistingoftheunionofallmarkers.Inthethirdstep,thenumberofverticesisboundedbyminfS;Tg.SinceTKR,thetimecomplexityofthisstepisO(K2R2)intheworstcase.SincewehaveDT,theoveralltimecomplexityisO(DNR)+O(K2R2).Ingeneral,wehaveDR;NK2,theoveralltimecomplexitycanbewrittenasO(DNR). Thegraphcreatedbyouralgorithmcanbeusedtodescribethehierarchicalorevolutionaryrelationshipbetweenmarkersrepresentingmultiplestagesbetweenasinglecancertypeoramongthemarkersofdierentcancertypes.Wetermanodeasarootnodeifitdoesnothaveanyincomingedges.Thenodesthatareclosetoaroot(therecanbemultipleroots)denotetheaberrationsthatstartedinearlierstages.Fromthisperspective,markersarenotequallyimportant.Themarkersthatareparentsofothermarkersinthehierarchicalrepresentationarecommontomultiplecancers.Thus,dierenceatparentmarkerpositionsshouldcontributemoretothedistancebetweendierentcancersthanthechildmarkers. Phylogenetictreeisasimpleandecientmodelthatinfersevolutionaryrelationshipamongmultiplecancers.Akeychallengeofusingexistingdistancematrixmethodfortreeconstructionistondabiologicallymeaningfuldistancefunctionbetween 101

PAGE 102

Wesaythatapairofmarkersmi;kandmj;rareoverlappingiftheysatisfyeitheroneofthefollowingtwoconditions: 1. Bothmarkersappearatthesameintervalandhavethesametype,i.e.pi;k=pj;randqi;k=qj;r Bothmarkersrepresentthesameregionofrecurrentalterations,i.e.Cor(pi;k;mj;r)>andCor(pj;r;mi;k)>,whereisauser-denedthreshold. InSection 6.2 ,wearguethatmarkersarenotequallyimportantintheprogressionofcancers.Amarkerthatiscommontomanycancersusuallyrepresentsafundamentalcharacteristicofcancers.Therefore,weassumethatmarkerssharedbymanycancersaremoreimportantthanthosesharedbyafewcancers.Theintuitionbehindthisreasoningcanbeexplainedasfollows.Amarkerthattriggersmostofthecancershassurvivedtheevolutionofcancerprogressionwithhighlikelihood.Themarkersthatarecancerspecichavemostlikelyappearedlaterintheevolutionaryhistoryandcreatedtheunderlyingcanceralterationpattern.Asaresult,thedeviationingenomicalterationscorrespondingtooldermarkerscorrespondstolargerdistancebetweentwocancertypesastheageofthegenomicalterationincreases.Weincorporatethisideaintothemappingprocess.We 102

PAGE 103

Themappingprocessworksasfollows.EachtimewepickupapairofmarkersfromMiandMj.Weaddapairofnewdimensionsto^Miand^Mjrespectively.Thevaluesoftheaddeddimensionsaredeterminedbythreeattributesofmarkers:support,weightandtype.Let(mi;k)=Supt(mi;k)wi;kqi;k.Ifthetwomarkersareoverlapping,thevaluesaddedinto^Miand^Mjare(mi;k)and(mj;r)respectively.Iftwomarkersarenotoverlapping,wefocusonthemarkeratasmallergenomicinterval.Withoutlossofgenerality,wecanassumepi;kinCjdenotethis"hypothetical"marker.Wehavep0=pi;k;q0=qi;kandw0=wi;k.PleasenotethatSupt(m0)dependsonthealterationpatterninCjandmaynotequaltoSupt(mi;k).Weaddthetwovalues,(mi;k)and(m0),into^Miand^Mjrespectively.Next,wechooseanotherpairofmarkersandrepeattheaboveprocedureuntilallthemarkershavebeenprocessed. Thealgorithmofthemappingprocessoftwosetsofmarkersisimplementedasfollows. 2.

PAGE 104

^Mi=[^Mi;(mi;k)];^Mj=[^Mj;(mj;r)];k=k+1;r=r+1; (b) (c) (d) 3. 4. 79 ]tocomputethesimilaritybetweenthetwovectors.ExtendedJaccardcoecientiswidelyusedasasimilaritymeasureinvectorspaces.Itretainsthesparsitypropertyofthecosinesimilaritywhileallowingdiscriminationofcollinearvectors.Forexample,giventwovectors^Mi=[0:1;0:3]and^Mj=[0:2;0:6],thecosinesimilaritydoesnotdiscriminatethedierencebetweenthemandthesimilarityvalueiscomputedas1.However,inourcase,^Miand^MjaredierentbecausetheydenoterecurrentalterationsinCiandCjwithdierentfrequencies.TheExtendedJaccardcoecientiscomputedasfollows.EJ(^Mi;^Mj)=^Mi^Mj

PAGE 105

Dataset:Dataset:With15127casesfrom571publicationsasofDec2007,ProgenetixisthelargestresourceforpublishedchromosomalCGHdata[ 3 ]( 2 ].Fromthebiomedicalperspective,thisdatasetcouldbedividedinto22clinico-pathologicaldiseasecategories.Additionalentitiesconsistingoflessthan40caseseachweresummarilymovedtoan'other'category AsresultoftheProgenetixdatabaseformattransformation,foreachcasethegenomicimbalancestatusfor862orderedintervalshadbeenextractedfromthekaryotypeannotation.Thisinformationrepresentsthewholegenomecopynumberstatusinformation,inthemaximumresolutionfeasibleforcytogeneticmethods.Thevalueofeachintervalis1,-1or0,indicatingthegain,lossandnochangestatus.Thetargetdatasetcanberepresentedasa2-dimensionalmatrixwith5918rows,with862columnsrepresentingtheimbalancestatusforeachgenomicinterval.Additionalcolumnscontainclinicalinformationcategories. Althoughthesecasesareimportantfortheevaluationofoverallgenomicinstability,duetoourfocusonaberrationpatterns875caseswithoutanyCNAsweredeemednon-informativeforourpurposesandremovedpriortofurtheranalysis.Also,thecategories'cholangio'and'squamous skin'wereremovedduetothelimitednumberofinformativecases(11and15,respectively).Wealsoexcludedcasessub-summarizedin 105

PAGE 106

Nameandnumberofcasesofeachcancerinthedataset. Diagnosisno.ofcases head-necksquamouscellcarcinoma(HNSCC)309non-smallcelllungcarcinoma(NSCLC)242smallcelllungcarcinoma(SCLC)63bladdercarcinoma140breastcarcinoma640cervicalcarcinoma210colorectaladenocarcinoma(CRC)392esophaguscarcinoma(ES)206gastriccarcinoma477hepatocellularadenocarcinoma(HCC)334melanocytic(MEL)81nasopharynxcarcinoma(NPC)149neuroendocrineca.andcarcinoid(NE)114ovariancarcinoma388pancreasadenocarcinoma(PAC)64prostatecarcinoma416renalcarcinoma(RCC)163thyroidcarcinoma154uteruscarcinoma42vulvacarcinoma47 the'other'category(386cases).Theremaining20entitieswith4631casesareusedforanalysisinthispaper.ThedetailsofthedatasetisshowninTable 6-1 6-1 .Weperformeachsteponebyoneanddiscusstheresultsofeachstepasfollows. Intherststep,weidentifyanoptimalsetof20markersforeachcancer.Pleasenotethatweexclude100(peri)centromericintervalsbecause1)theymostlyconsistofrepetitivesequence(ALUrepeatsetc.)withoutencodinggenes;2)theyhavetechnicalorinterpretationdiculties.Themarkersareidentiedfromtheremaining762intervals. 106

PAGE 107

2 ],usingan'averageprole'basedapproach.Wecomparedourmarkerstothereportedimbalancehotspotsforvalidationtest.Duetothelimitationofspace,hereweonlypresentthecomparisonresultsforHNSCCdiseasecategory. 2 ]: gains:3q26(59.2%),8q24(40.8%),11q13(31.9%,manyspecichigh-level),5p(26.5%),Xq,1q,7q(21),12p,17 losses:3p(30.1%),18q(22)(22.4%),9p(22.4%),11q24(19.2%),4,5q,8p,13 gains:3q26.2(57.2%),8q24.3(41%),11q13.4(31.9%),5p14.3(26.5%),Xq28(23%),7q21.3(20.9%),12p13.1(17.7%),17q25.3(17.7%),20q12(17.7%),19p13.11(16.8%),1q31.3(16.2%),18p11.23(15.9%) losses:3p26.3(30.7%),18q23(22.7%),9p23(22.4%),11q25(19.2%),4p14(18%),5q21.3(15.3%),8p23.3(16.2%),13q21.33(16.5%) Intheaboveresults,markersorhot-spotsarelistedwithdetailedlocusandfrequencyinformation.Gainsandlossesareevaluatedseparately.Thehot-spotsormarkersaresortedindescendingfrequencyofoccurrence.WeidentifymarkersasindividualintervalswhileBaudisidentiedtheregionalhot-spotsfromsummarydata.Ourresultsarehighlycompatibletoreportedresultsifweconsideramarkerasarepresentativeofaregion.WesuccessfullyidentifyallthehotspotsidentiedbyBaudis.Wealsoidentifyadditionalhotspots(e.g.,18q23)thathassignicantsupport. Inthesecondstep,foreachdiseaseentity,wecomputethesharedstatusofeachmarkeridentiedinthiscancerusingthemethodwedescribedinSection 6.2 .Wesetthethresholdto0.8.Tocomparewiththereportedmostfrequentimbalancesoverallcancers,weanalyzethemarkersthatareinthesameregions.Thecomparisonsofimbalancewithtopfrequenciesareshownasfollows. 2 ]: 107

PAGE 108

+8q23.1,+8q23.2,+8q23.3:19cancers(exceptionthyroid) +8q24.13,+8q24.23,+8q24.3:18cancers(exceptionNEandthyroid) 2 ]: -13q:occurringinmostcarcinomatypes(exceptioncholangioandSQS) -13q21.1,-13q21.2,-13q21.33:18cancers(exceptionCRC,gastric,cholangioandSQS) -13q22.3:15cancers(exceptionSCLC,CRC,prostate,thyroid,gastric,cholangioandSQS) TheresultsshowthatourapproachdiscoversthemostfrequentmarkersinaconsistentwaytoBaudis'work.Pleasenotethatmarkersareindividualintervalsinsteadofchromosomalregions.AdditionallytothemarkersreportedbyBaudisetal.astop-scorersinthedierententities,ourmethoddetectedotherregions,forexample+17qand+7pwhichbotharesharedbymorethan12cancerstypes. Inthethirdstep,webuildagraphmodelbasedonthesharedstatusofmarkers.Themodelcontains119verticesand385edges,whichmakesithardtotinthisthesis.Themodelconveysusefulinformationabouttheimportanceofmarkers.WeusethisinformationinournextexperimentsinSection 6.4.2 6.3 .WecomputethedistancematrixDof20cancersinTable 6-1 basedonthemarkersreportedinSection 6.4.1 .WeuseUPGMAalgorithminPHYLIPpackage[ 20 ]togeneratethephylogenetictree.Todemonstratetheuseofcomputingthedistancebetweencancersbasedontheimportanceofmarkers,wegeneratetwophylogenetictrees.Forthersttree,wecomputethedistancematrixbyassigning 108

PAGE 109

Phylogenetictreesof20cancersbasedonweightedmarkers.Thetreeisgeneratedbytakingtheimportanceofmarkersintoaccount.Wemarkdierentcancersusingdierentcolorsandcapitalizedlettersbasedontheiroverallhistologicalcompositions.Thelegendisshownattoprightside. 109

PAGE 110

Phylogenetictreesof20cancersbasedonunweightmarkers.Thetreeisgeneratedbygivingequalweightstomarkers.Wemarkdierentcancersusingdierentcolorsandcaptitalizedlettersbasedontheiroverallhistologicalcompositions. 110

PAGE 111

6-2 .Forthesecondtree,wecomputethethedistancematrixbyassigningtheweightofeachmarkerto1.TheresultingtreeisshowninFigure 6-3 Theleafnodesofthetreescorrespondtocancers(e.g.clinico-pathologicalcancerentities).Wemarkthesecancersusingdierentcolorsaswellascapitalizedlettersbasedonthemajorhistologicalcompositionofcasesinthiscancer.Eachcolorcorrespondstoacapitalizedletter.Dierentcolors(letters)encodedierenthistologicalcompositionsofcancers.Theinternalnodesaredenotedbynumbersandrepresenthypotheticalcancers.Sincetheseintermediatecancersmaycontaindaughterbranchesfromcompletelydierenthistological,theyhavetobeviewedasascommonbiologicalfeaturesetsratherthantrulyoccurringclinico-pathologicalcancerentities.Thelengthsofthebranchareproportionaltothedierencebetweenpairsofneighboringnodes. Inbothtrees,somecancersofthesamehistologicalcompositionarecloselyorganizedinthesamesubtree.However,thetreeinFigure 6-2 showsahighercorrelationofhistologicalcompositionandsubtreeassignmentcomparedtothetreeinFigure 6-3 .Thiscorrelationwouldbeinconcordancewiththeviewthatcancerclonesmayarisefromtissue-speciccancerstemcells[ 66 ],withasimilarregulatoryprogramtargetedbygenomicaberrationsinrelatedtissues. Eachclinico-pathologicalcancerentitymaycontainmultiplesubtypeswithheterogeneousaberrationpatterns.Toinferaprogressionmodelforcancersubtypes,werstdivideeachcancerintoseveral(twoorfour)clusters.Bydoingthis,wehopethatcaseswithsimilaraberrationpatternscanbegroupedinthesamecluster.WeuseRSimclusteringmethod[ 42 ]forthispurpose.Wedeterminethenumberofclustersbyvisuallyinspectingtheaberrationpatternsofthecancer.Foreachclusterofthesamecancer,wecomputeitsquantityofimbalanceastheratioofintervalswithimbalancesinallCGHcases.Wesorttheclustersofeachcancerintheascendingorderoftheirimbalancequantities.Wenameeachclusterbyconcatenatingthecancernameanditsranks.Forexample,ifwe 111

PAGE 112

Afractionofthephylogenetictreeofsub-clustersof20cancers.Thesubtreeisshownontheleftwitheachleafnodecorrespondingtoacluster.Eachclusterisplottedonthemiddle,inthesameorderasleafnodesfromtoptobottom.TheX-axisandY-axisdenotetheindexofgenomicintervalsandcasesrespectively.Thoseintervalswithgainandlossimbalanceareplottedingreenandredrespectively.Themarkersineachclusterareplottedusingverticaldottedlines.Thehistologicalcompositionsofeachclusterareplottedascolorsidebar.Thelegendofthecolorsidebarisshownontheright. divideHNSCCintofourclusters,'HNSCC1'and'HNSCC4'areclusterswiththeleastandmostquantityofimbalancerespectively.Pleasenotethatwedonotperformclusteringonentitiesuterusandvulvabecausetheybothcontainlessthan50cases.Asaresult,wedividethe20cancersinto58clustersofcases.WecomputethedistancematrixDforthe58clustersbasedontheweightedmarkersinthem.WeapplyUPGMAtoconstructaphylogenetictreeoverthesecancerclusters.ApartofthetreeisshowninFigure 6-4 Figure 6-4 showsafractionofthephylogenetictree(leftside)for58clusters.Thisfractioncontainsasubtreewhosesevenleafnodescorrespondtotheclusterofsixdierentcancertypes.Thenameoftheseclustersallendswith3or4,whichindicatesthatcasesintheseclusterscontainalargeamountofimbalances.(SCLC2isalsotheclusterwiththehighestquantityofimbalanceinSCLCbecauseSCLConlycontainstwoclusters.) 112

PAGE 113

Wehavedevelopedanautomaticmethodtoinferagraphmodelforthemarkersofmultiplecancers.Wedemonstratedtheuseofthismodelindeterminingtheimportanceofmarkersincancerevolution.Wealsodevelopedanewmethodtomeasuretheevolutionarydistancebetweendierentcancersbasedontheirmarkers.Weusedthismeasuretocreateanevolutionarytreeformultiplecancers. Withtheapplicationofourmodelingapproachtoasetofmorethan4600epithelialneoplasias(carcinomas)withgenomicimbalances,wecandrawsomepreliminaryconclusions: 113

PAGE 114

Whileourapproachasdescribedhereusedroughhistologicalgroupclassicationasareference,areneddatasetcombinedwithdierentreferencequalities(e.g.clinicalparameters)shouldprovideasignicantcontributiontotheoverallperceptionofgenomicinstabilityincancerdevelopment. 114

PAGE 115

DatamininganalysisonalargenumberofCGHsampleshelpsbiologistunderstandtheintrinsicmechanismoftumorprogression.Forexample,clusteringmethodsareoftenemployedtodiscoverpreviouslyunknownsub-categoriesofcancerandtoidentifygeneticbiomarkersassociatedwiththedierentiation.Accurateclassicationofpatientstotheircancertypesbasedontheirgenomicimbalancesiscrucialinsuccessfulcancerdiagnosisandtreatment.ApublictoolfordatamininganalysisofCGHdataisofgreatusetocancerstudies. Anidealtoolforend-usersandlarge-valumedataanalysisisaweb-basedapplication:awebbrowseristheonlyclientsoftware.Endusersdonotrequiretodownloadandinstallextrasoftware.Inaddition,thebackendservercandistributetheintensivecomputingjobstoclustersormulticoreCPUstoreducetheexecutiontime.Inthischapter,wediscussawebapplicationdevelopedbasedonourpreviouswork[ 42 78 ]tofullltheserequirements. TheapplicationisdevelopedusingMicrosoftInternetInformationServer( 115

PAGE 116

TheunderlyingalgorithmsaredevelopedusingMATLAB( Apreliminaryversionofthistoolisavailableat( Theprogramsacceptstab-delimitedtextleswithbothCGHdataandgenomicintervalinformation.WefollowtheformatofProgenetix,thelargestsourceforpublishedCGHdata(withmorethan12,000cases)( Ourwebtoolprovidesvedistancebasedclusteringalgorithms:topDown,bottomUp,k-means,topDown+k-meansandbottomUp+k-means.Therstthreearewellknownclusteringalgorithmsintheliterature.ThelasttwoalgorithmsarethecombinationoftopDownorbottomUpwithk-means.Theyworkasfollows.Theyrstndclustersusingthetopdown(orbottomup)clusteringalgorithm.They,then,feedtheseclustersintok-meansalgorithmastheinitialcluster.Theyaimtoavoidthepoorresultsobtainedbythek-meansalgorithmduetotherandominitialclusters.Thedistance-basedclustering 116

PAGE 117

Snapshotoftheinputinterfaceofdistance-basedclustering. toolprovidesfourdistancemeasures,Raw,Sim,cosineGapsandcosineNoGapsasdescribedinChapterrefchap:paper1. Userscanchooseanyoneofthe20combinationsofthevealgorithmsandfourdistancemeasuresasshowninFigure 7-1 .ThedefaultalgorithmanddistancemeasureistopDownandSimrespectivelybecausethiscombinationproducesthebestclustersaccordingtoourexperimentalresults.Theinterfaceallowstheusertouploadadatabaselecontainingallthesamples.Italsoallowsthetheusertospecifythenumberofclusters.Clusteringisusuallyacomputationalintensivetask.Dependingonthedatabase,thisprocesscantakeseveralminutes.Thewebserverallowsuserstoprovidetheiremailaddressessothattheycanbenotiedwhentheresultsareready.Theserverstorestheresultsinatemporaryhtmlleandemailsthelinkforthisle.Theusercanbrowsethislelater.Iftheuserchoosestokeepthebrowseropen,theserverautomaticallyrefreshesitwiththepagethatcontainstheresults. 117

PAGE 118

Snapshotoftheresultsofdistance-basedclustering. Figure 7-2 providesthesnapshotoftheresultingpageafterapplyingtheclusteringalgorithmtothesampledataset.Inthisexamplethedatasetcontains391samplesfromtwocancertypes,namelyRetinoblastoma,NOS(ICD-Ocode=9510/3)andNeuroblastoma,NOS(ICD-Ocode=9500/3).Thenumbersofsamplesinthetwocancertypesare120and271respectively.Theserveridentiestheclustersusingtheparametersprovidedbytheuser.Italsoprovidesclusteringbyapplyingthecentroidshrinkingtechnique[ 80 ]torenetheclusters.Theresultspageconsistsofthreeparts. 1. 79 ]. 2. 3. 7-3 ,theX-axisdenotestheindexofgenomicintervals,theY-axisdenotesthesamplesthatgroupedbyclusters.Dierentclustersareseparated 118

PAGE 119

Snapshotoftheplotofclusters. bythehorizontallines.Thegenomicintervalswithgainandlossimbalancesareplottedingreenandred,respectively.Thosewithnoimbalancearenotplotted. Inboththequeryandresultpages,alleldnamesareclickable.Clickingonaeldnamebringsthehelppagethatcontainsadescriptionofthateld. 119

PAGE 120

ComparativeGenomicHybridization(CGH)isamolecular-cytogeneticanalysismethodforsimultaneousdetectingofanumberofchromosomalimbalances,whichareoneofthemostprominentandpathogeneticallyrelevantfeaturesofhumancancer.Alongwiththehighdimensionality(around1000),akeyfeatureofCGHdataisthattheconsecutivevaluearehighlycorrelated.TheaimofthisthesisistodevelopnoveldataminingmethodsthatexploitthesecharacteristicsinminingapopulationofCGHsamples.Inparticular,thisthesishasfollowingcontributions: 1. NoveldistancemeasuresareinvestigatedfortheclusteringofCGHdata.Threepairwisedistance/similaritymeasures,namelyraw,cosine,andsim,areproposed.Therstoneignoresthecorrelation,whilethelattertwocaneectivelyleveragethiscorrelation.Thesedistance/similaritymeasuresaretestedonCGHdatausingthreemainclusteringtechniques.TheresultsshowthatSimconsistentlyperformsbetterthantheremainingmeasuressinceitcaneectivelyutilizethecorrelationsbetweenconsecutiveintervalsintheunderlyingdata. 2. Adynamicprogrammingalgorithmisdevelopedtoidentifyasmallsetofimportantgenomicintervalscalledmarkers.Therecurrentimbalanceprolesofsamplescanbecapturedusingasetofmarkers.Twonovelclusteringstrategiesaredeveloped.Bothmethodsutilizemarkerstoexcludenoisyintervalsfromclustering.TheexperimentalresultsdemonstratethatthemarkersfoundrepresenttheaberrationpatternsofCGHdataverywellandtheyimprovethequalityofclusteringsignicantly. 3. NovelSVMbasedmethodsforclassicationandfeatureselectionofCGHdataaredeveloped.Forclassication,anovelsimilaritykernelisproposed.ItisshowntobemoreeectivethanthestandardlinearkernelusedinSVM.Forfeatureselection,anovelmethodbasedonthenewkernelisproposed.Ititerativelyselectsfeaturesthatprovidesthemaximumbenetforclassication.Ourmethodsarecomparedagainst 120

PAGE 121

4. Agraphmodelisproposedtoinfertheprogressionofmarkers(keyrecurrentCNAs).Withthismodel,theimportanceofmarkersincancerevolutioncanbederived.Anewdistancemeasureisproposedforcomputingthedistancebetweencancersbasedontheiraberrationpatterns.Existingdistancematrixmethodisemployedalongwiththenewmeasureforinferringprogressionmodelofmultiplecancers.Theresultsshowthatcancerswithsimilarhistologicalcompositionsarewellgroupedtogether. Thesemethodsareevaluatedusinglargerepositoriesofdatasetsthatarepubliclyavailable.ThesemethodsarealsoencapsulatedintoawebservicethatcanbeusedforanalyzingandvisualizingCGHdata. Inthepresentstudy,ourworkisbasedonchromosomalCGHdataannotatedinareversein-situkaryotypeformat[ 50 ].Inthefuture,wewillextendourworktosupportotherCGHformats,suchasaCGHdata,andotherdatasetssuchasGeneexpressionarraydata,SNPdataandproteomicsdata. 121

PAGE 122

M.Baudis.Onlinedatabaseandbioinformaticstoolboxtosupportdataminingincancercytogenetics.Biotechniques,40(3),March2006. [2] M.Baudis.Genomicimbalancesin5918malignantepithelialtumors:Anexplorativemeta-analysisofchromosomalCGHdata.acceptedatBMCCancer,2007. [3] M.BaudisandM.L.Cleary.Progenetix.net:anonlinerepositoryformolecularcytogeneticaberrationdata.Bioinformatics,17(12):1228{1229,2001. [4] B.J.BeattieandP.N.Robinson.Binarystatepatternclustering:Adigitalparadigmforclassandbiomarkerdiscoveryingenemicroarraystudiesofcancer.JournalofComputa-tionalBiology,13(5):1114{1130,2006. [5] M.Bentz,C.Werner,H.Dohner,S.Joos,T.Barth,R.Siebert,M.Schroder,S.Stilgen-bauer,K.Fischer,P.Moller,andP.Lichter.Highincidenceofchromosomalimbalancesandgeneamplicationsintheclassicalfollicularvariantoffolliclecenterlymphoma.Blood,88(4):1437{1444,1996. [6] S.Bilke,Q.-R.Chen,F.Westerman,M.Schwab,D.Catchpoole,andJ.Khan.InferringaTumorProgressionModelforNeuroblastomaFromGenomicData.JClinOncol,23(29):7322{7331,2005. [7] P.BroetandS.Richardson.DetectionofgenecopynumberchangesinCGHmicroarraysusingaspatiallycorrelatedmixturemodel.Bioinformatics,22(8):911{918,2006. [8] G.C.Cawley.MATLABsupportvectormachinetoolbox(v0.55)[http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox].UniversityofEastAnglia,SchoolofInformationSystems,Norwich,Norfolk,U.K.NR47TJ,2000. [9] H.ChaiandC.Domeniconi.Anevaluationofgeneselectionmethodsformulti-classmicroarraydataclassication.InProceedingsoftheSecondEuropeanWorkshoponDataMiningandTextMininginBioinformatics,pages3{10,2004. [10] P.Crossen.Giemsabandingpatternsofhumanchromosomes.ClinGenet,3:169{179,1972. [11] R.Desper,F.Jiang,andO.-P.Kallioniemi.Inferringtreemodelsforoncogenesisfromcomparativegenomehybridizationdata.6:37{51,1999. [12] R.Desper,F.Jiang,O.-P.Kallioniemi,H.Moch,C.H.Papadimitriou,andA.A.Schaer.Inferringtreemodelsforoncogenesisfromcomparativegenomehybridizationdata.JournalofComputationalBiology,6(1):37{52,1999. [13] R.Desper,F.Jiang,O.P.Kallioniemi,H.Moch,C.H.Papadimitriou,andA.A.Scher.Distance-basedreconstructionoftreemodelsforoncogenesis.JComputBiol,7(6):789{803,2000. [14] C.DingandH.Peng.Minimumredundancyfeatureselectionfrommicroarraygeneexpressiondata.InCSB,page523,Washington,DC,USA,2003.IEEEComputerSociety. [15] C.DingandH.Peng.Minimumredundancyfeatureselectionfrommicroarraygeneexpressiondata.JBioinformComputBiol,3(2):185{205,April2005. [16] C.H.Q.Ding.Analysisofgeneexpressionproles:classdiscoveryandleafordering.InRECOMB,pages127{136,NewYork,NY,USA,2002.ACMPress.

PAGE 123

K.B.Duan,J.C.Rajapakse,H.Wang,andF.Azuaje.MultipleSVM-RFEforgeneselectionincancerclassicationwithexpressiondata.IEEETransNanobioscience,4(3):228{234,September2005. [18] P.Duesberg.DoesAneuploidyorMutationStartCancer?Science,307(5706):41d{,2005. [19] P.H.C.EilersandR.X.deMenezes.QuantilesmoothingofarrayCGHdata.Bioinformat-ics,21(7):1146{1153,2005. [20] J.Felsenstein.PHYLIP-PhylogenyInferencePackage(Version3.2).Cladistics,(5):164{166,1989. [21] J.Fridlyand,A.M.Snijders,D.Pinkel,D.G.Albertson,andA.N.Jain.Hiddenmarkovmodelsapproachtotheanalysisofarraycghdata.J.Multivar.Anal.,90(1):132{153,2004. [22] A.Fritz,C.Percy,A.Jack,L.Sobin,andM.Parkin,editors.InternationalClassicationofDiseasesforOncology(ICD-O),ThirdEdition.WorldHealthOrganization,Geneva,2000. [23] T.Golub,D.Slonim,P.Tamayo,C.Huard,M.Gaasenbeek,J.Mesirov,H.Coller,M.Loh,J.Downing,M.Caligiuri,C.Bloomeld,andE.Lander.Molecularclassicationofcancer:Classdiscoveryandclasspredictionbygeneexpressionmonitoring.Science,286(5439):531{537,October1999. [24] J.Gray,C.Collins,I.Henderson,J.Isola,A.Kallioniemi,O.Kallioniemi,H.Nakamura,D.Pinkel,T.Stokke,M.Tanner,anda.et.Molecularcytogeneticsofhumanbreastcancer.ColdSpringHarbSympQuantBiol,59:645{652,1994. [25] J.W.GrayandC.Collins.Genomechangesandgeneexpressioninhumansolidtumors.Carcinogenesis,21(3):443{452,2000. [26] I.GuyonandA.Elissee.Anintroductiontovariableandfeatureselection.J.Mach.Learn.Res.,3:1157{1182,2003. [27] I.Guyon,J.Weston,S.Barnhill,andV.Vapnik.Geneselectionforcancerclassicationusingsupportvectormachines.MachineLearning,46(1-3):389{422,2002. [28] D.HanahanandR.A.Weinberg.Thehallmarksofcancer.Cell,100(1):57{70,January2000. [29] J.Handl,J.Knowles,andD.B.Kell.Computationalclustervalidationinpost-genomicdataanalysis.Bioinformatics,21(15):3201{3212,August2005. [30] G.HodgsonandJ.H.H.etal.GenomescanningwitharrayCGHdelineatesregionalalterationsinmouseisletcarcinomas.NatureGenetics,29:459{464,2001. [31] M.Hoglund,A.Frigyesi,T.Sall,D.Gisselsson,andF.Mitelman.StatisticalBehaviorofComplexCancerKaryotypes.GenesChromosomesCancer,42(4):327{341,2005. [32] D.Jiang,C.Tang,andA.Zhang.Clusteranalysisforgeneexpressiondata:asurvey.KnowledgeandDataEngineering,IEEETransactionson,16(11):1370{1386,2004. [33] T.Joachims.Makinglarge-scalesupportvectormachinelearningpractical.Advancesinkernelmethods:supportvectorlearning,pages169{184,1999. [34] S.Joos,C.Menz,G.Wrobel,R.Siebert,S.Gesk,S.Ohl,G.Mechtersheimer,L.Trumper,P.Moller,P.Lichter,andT.Barth.Classicalhodgkinlymphomaischaracterizedbyrecurrentcopynumbergainsoftheshortarmofchromosome2.Blood,99(4):1381{1387,2002.

PAGE 124

K.Junker,G.Weirich,M.B.Amin,P.Moravek,W.Hindermann,andJ.Schubert.Geneticsubtypingofrenalcellcarcinomabycomparativegenomichybridization.RecentResultsCancerRes,162:169{175,2003. [36] A.Kallioniemi,O.Kallioniemi,D.Sudar,D.Rutovitz,J.Gray,F.Waldman,andD.Pinkel.ComparativeGenomicHybridizationforMolecularCytogeneticAnalysisofSolidTumors.Science,258(5083):818{821,1992. [37] A.Kallioniemi,O.P.Kallioniemi,D.Sudar,D.Rutovitz,J.W.Gray,F.Waldman,andD.Pinkel.Comparativegenomichybridizationformolecularcytogeneticanalysisofsolidtumors.Science,258(5083):818{821,1992. [38] B.King.Step-wiseclusteringprocedures.JournaloftheAmericanStatisticalAssociation,69:86{101,1967. [39] D.E.KraneandM.L.Raymer.FundamentalConceptsofBioinformatics.Benjamin-CummingsPubCo,SanFrancisco,CA,USA,September2002. [40] T.Li,C.Zhang,andM.Ogihara.Acomparativestudyoffeatureselectionandmulticlassclassicationmethodsfortissueclassicationbasedongeneexpression.Bioinformatics,20(15):2429{2437,2004. [41] J.Liu,J.Mohammed,J.Carter,S.Ranka,T.Kahveci,andM.Baudis.Distance-basedclusteringofCGHdata.Bioinformatics,22(16):1971{1978,2006. [42] J.Liu,S.Ranka,andT.Kahveci.MarkersimproveclusteringofCGHdata.Bioinformatics,23(4):450{457,2007. [43] J.Liu,S.Ranka,andT.Kahveci.Awebserverforminingcomparativegenomichybridiza-tion(cgh)data.volume953,pages144{161.AIP,2007. [44] J.B.MacQueen.SomeMethodsforclassicationandAnalysisofMultivariateObservations.InProceedingsof5-thBerkeleySymposiumonMathematicalStatisticsandProbability,1967. [45] M.Mao,R.Hamoudi,I.Talbot,andM.Baudis.Allele-SpecicLossofHeterozygosityinMultipleColorectalAdenomas:TowardstheIntegratedMolecularCytogeneticMapIi.acceptedatCancer,Genetics,Cytogenetics,2005. [46] X.Mao,R.Hamoudi,P.Zhao,andM.Baudis.GeneticLossesinBreastCancer:TowardanIntegratedMolecularCytogeneticMap.CancerGenetCytogenet,160(2):141{151,2005. [47] J.C.Marioni,N.P.Thorne,andS.Tavare.BioHMM:aheterogeneoushiddenMarkovmodelforsegmentingarrayCGHdata.Bioinformatics,22(9):1144{1146,2006. [48] T.Mattfeldt,H.Wolter,R.Kemmerling,H.Gottfried,andH.Kestler.Clusteranalysisofcomparativegenomichybridization(cgh)datausingself-organizingmaps:applicationtoprostatecarcinomas.AnalCellPathol,23(1):29{37,2001. [49] T.Mattfeldt,H.Wolter,R.Kemmerling,H.W.Gottfried,andH.A.Kestler.Clusteranalysisofcomparativegenomichybridization(CGH)datausingself-organizingmaps:applicationtoprostatecarcinomas.AnalCellPathol,23:29{37,2001. [50] F.Mitelman,editor.InternationalSystemforCytogeneticNomenclature.Karger,Basel,1995. [51] F.Model,P.Adorjn,A.Olek,andC.Piepenbrock.Featureselectionfordnamethylationbasedcancerclassication.Bioinformatics,17Suppl1,2001.

PAGE 125

R.Molist,M.Gerbault-Seureau,X.Sastre-Garau,B.Sigal-Zafrani,B.Dutrillaux,andM.Muleris.Ductalbreastcarcinomadevelopsthroughdierentpatternsofchromosomalevolution.Genes,ChromosomesandCancer,43(2):147{154,2005. [53] S.Mukherjee,P.Tamayo,D.Slonim,A.Verri,T.Golub,J.Mesirov,andT.Poggio.Supportvectormachineclassicationofmicroarraydata,1999. [54] A.B.Olshen,E.S.Venkatraman,R.Lucito,andM.Wigler.Circularbinarysegmentationfortheanalysisofarray-basedDNAcopynumberdata.Biostat,5(4):557{572,2004. [55] K.Patau.Theidenticationofindividualchromosomes,especiallyinman.AmJHumGenet,12:250{276,1960. [56] G.Pennington,S.Shackney,andR.Schwartz.Cancerphylogeneticsfromsingle-cellassays.Technicalreport,SchoolofComputerScience,CarnegieMellonUniversity,2006. [57] F.Picard,S.Robin,M.Lavielle,C.Vaisse,andJ.J.Daudin.Astatisticalapproachforarraycghdataanalysis.BMCBioinformatics,6,2005. [58] F.Picard,S.Robin,E.Lebarbier,andJ.-J.Daudin.ASegmentation-ClusteringproblemfortheanalysisofarrayCGHdata.InAppliedStochasticModelsandDataAnalysis,2005. [59] D.PinkelandD.G.Albertson.Arraycomparativegenomichybridizationanditsapplica-tionsincancer.NatureGenetics37,S11-S17(2005),37:S11{S17,2005. [60] D.Pinkel,R.Segraves,D.Sudar,S.Clark,I.Poole,D.Kowbel,C.Collins,W.Kuo,C.Chen,Y.Zhai,S.Dairkee,B.Ljung,J.Gray,andD.Albertson.HighResolutionAnalysisofDNACopyNumberVariationUsingComparativeGenomicHybridizationtoMicroarrays.NatGenet,20(2):207{211,1998. [61] D.Pinkel,R.Segraves,D.Sudar,S.Clark,I.Poole,D.Kowbel,C.Collins,W.-L.Kuo,C.Chen,Y.Zhai,S.H.Dairkee,B.marieLjung,andJ.W.G.D.G.Albertson.Highresolutionanalysisofdnacopynumbervariationusingcomparativegenomichy-bridizationtomicroarrays.NatureGenetics,20:207{211,1998. [62] J.Pollack,C.Perou,A.Alizadeh,M.Eisen,A.Pergamenschikov,C.Williams,S.Jerey,D.Botstein,andP.Brown.Genome-WideAnalysisofDNACopy-NumberChangesUsingCdnaMicroarrays.NatGenet,23(1):41{46,1999. [63] J.R.Pollack,C.M.Perou,A.A.Alizadeh,M.B.Eisen,A.Pergamenschikov,C.F.Williams,S.S.Jerey,D.Botstein,andP.O.Brown.Genome-wideanalysisofDNAcopy-numberchangesusingcDNAmicroarrays.NatureGenetics,23:41{46,1999. [64] A.Rakotomamonjy.VariableselectionusingSVMbasedcriteria.J.Mach.Learn.Res.,3:1357{1370,2003. [65] S.Ramaswamy,P.Tamayo,R.Rifkin,S.Mukherjee,C.-H.Yeang,M.Angelo,C.Ladd,M.Reich,E.Latulippe,J.P.Mesirov,T.Poggio,W.Gerald,M.Loda,E.S.Lander,andT.R.Golub.Multiclasscancerdiagnosisusingtumorgeneexpressionsignatures.ProcNatlAcadSciUSA,98(26):15149{15154,December2001. [66] T.Reya,S.J.Morrison,M.F.Clarke,andI.L.Weissman.Stemcells,cancer,andcancerstemcells.Nature,414(6859):105{11,Nov2001. [67] C.Rouveirol,N.Stransky,P.Hup,P.LaRosa,E.Viara,E.Barillot,andF.Radvanyi.Computationofrecurrentminimalgenomicalterationsfromarray-cghdata.Bioinformatics,January2006.

PAGE 126

G.Salton.Automatictextprocessing:thetransformation,analysis,andretrievalofinformationbycomputer.Addison-WesleyLongmanPublishingCo.,Inc.,Boston,MA,USA,1989. [69] S.SelimandM.Ismail.K-means-typealgorithms:Ageneralizedconvergencetheoremandcharacterizationoflocaloptimality.IEEETrans.PatternAnalysisandMachineIntelligence,6(1):81{87,1984. [70] A.M.Snijders,D.Pinkel,andD.G.Albertson.Currentstatusandfutureprospectsofarray-basedcomparativegenomichybridisation.BriefFunctGenomicProteomic,2(1):37{45,2003. [71] S.Solinas-Toldo,S.Lampel,S.Stilgenbauer,J.Nickolenko,A.Benner,H.Dohner,T.Cre-mer,andP.Lichter.Matrix-basedcomparativegenomichybridization:biochipstoscreenforgenomicimbalances.GenesChromosomesCancer,20:399{407,1997. [72] S.Solinas-Toldo,S.Lampel,S.Stilgenbauer,J.Nickolenko,A.Benner,H.Dohner,T.Cre-mer,andP.Lichter.Matrix-BasedComparativeGenomicHybridization:BiochipstoScreenforGenomicImbalances.GenesChromosomesCancer,20(4):399{407,1997. [73] M.Speicher,S.GwynBallard,andD.Ward.Karyotypinghumanchromosomesbycombinatorialmulti-uorsh.NatGenet,12(4):368{375,1996. [74] A.Statnikov,C.F.Aliferis,I.Tsamardinos,D.Hardin,andS.Levy.Acomprehensiveevaluationofmulticategoryclassicationmethodsformicroarraygeneexpressioncancerdiagnosis.Bioinformatics,21(5):631{643,2005. [75] M.Steinbach,G.Karypis,,andV.Kumar.Acomparisonofdocumentclusteringtech-niques.InKDDWorkshoponTextMining,2000. [76] C.S.Stocks,N.Pratt,M.Sales,D.A.Johnston,A.M.Thompson,F.A.Carey,andN.M.Kernohan.Chromosomalimbalancesingastricandesophagealadenocarcinoma:Speciccomparativegenomichybridization-detectedabnormalitiessegregatewithjunctionaladenocarcinomas.Genes,ChromosomesandCancer,32(1):50{58,2001. [77] A.StrehlandJ.Ghosh.Clusterensembles{aknowledgereuseframeworkforcombiningpartitionings.InProceedingsofAAAI2002,Edmonton,Canada,pages93{98.AAAI,July2002. [78] P.-N.Tan,M.Steinbach,andV.Kumar.IntroductiontoDataMining.Addison-WesleyLongmanPublishingCo.,Inc.,2005. [79] P.-N.Tan,M.Steinbach,andV.Kumar.IntroductiontoDataMining,(FirstEdition).AddisonWesley,May2005. [80] R.Tibshirani,T.Hastie,B.Narasimhan,andG.Chu.Diagnosisofmultiplecancertypesbyshrunkencentroidsofgeneexpression.PNAS,99(10):6567{6572,2002. [81] J.TijoandA.Levan.Thechromosomenumberofman.Hereditas,42:1{16,1956. [82] J.Vandesompele,M.Baudis,K.DePreter,N.VanRoy,andAmbros.UnequivocalDelineationofClinicogeneticSubgroupsandDevelopmentofaNewModelforImprovedOutcomePredictioninNeuroblastoma.JClinOncol,23(10):2280{2299,2005. [83] V.N.Vapnik.StatisticalLearningTheory.Wiley-Interscience,September1998.

PAGE 127

T.Veldman,C.Vignon,E.Schrock,J.Rowley,andT.Ried.Hiddenchromosomeabnor-malitiesinhaematologicalmalignanciesdetectedbymulticolourspectralkaryotyping.NatGenet,15(4):406{410,1997. [85] B.Vogelstein,E.R.Fearon,S.R.Hamilton,S.E.Kern,A.C.Preisinger,M.Leppert,Y.Nakamura,R.White,A.M.Smits,andJ.L.Bos.Geneticalterationsduringcolorectal-tumordevelopment.NEnglJMed,319(9):525{532,September1988. [86] B.VogelsteinandK.Kinzler.TheMultistepNatureofCancer.TrendsGenet,9(4):138{141,1993. [87] P.Wang,Y.Kim,J.Pollack,B.Narasimhan,andR.Tibshirani.AmethodforcallinggainsandlossesinarrayCGHdata.Biostat,6(1):45{58,2005. [88] P.Wang,Y.Kim,J.Pollack,B.Narasimhan,andR.Tibshirani.Amethodforcallinggainsandlossesinarraycghdata.Biostatistics,6(1):45{58,January2005. [89] J.Weston,S.Mukherjee,O.Chapelle,M.Pontil,T.Poggio,andV.Vapnik.FeatureselectionforSVMs.InNIPS,pages668{674,2000. [90] H.WillenbrockandJ.Fridlyand.Acomparisonstudy:applyingsegmentationtoarraycghdatafordownstreamanalyses.Bioinformatics,September2005. [91] L.YuandH.Liu.Redundancybasedfeatureselectionformicroarraydata.InKDD,pages737{742,NewYork,NY,USA,2004.ACMPress. [92] X.Zhang,X.Lu,Q.Shi,X.-q.Xu,H.-c.Leung,L.Harris,J.Iglehart,A.Miron,J.Liu,andW.Wong.RecursiveSVMfeatureselectionandsampleclassicationformass-spectrometryandmicroarraydata.BMCBioinformatics,7(1):197,2006. [93] S.ZhongandJ.Ghosh.Generativemodel-baseddocumentclustering:acomparativestudy.Knowl.Inf.Syst.,8(3):374{384,2005.

PAGE 128

JunLiuwasbornin1976inNanjing,China.HegrewupmostlyinNanjing,China.HeearnedhisB.S.andM.E.degreesincomputersciencefromNanjingUniversityin1998and2000,respectively.HeearnedhisPh.D.incomputerengineeringfromtheUniversityofFlorida(Gainesville,Florida,USA)in2008 128