<%BANNER%>

Semantic Integration through Application Analysis

xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101112_AAAACE INGEST_TIME 2010-11-12T12:56:36Z PACKAGE UFE0019342_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 5705 DFID F20101112_AABIPD ORIGIN DEPOSITOR PATH topsakal_o_Page_086thm.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
1b25dd563783371213b53c84be35f923
SHA-1
99c59cdc1f62652e25055fdbf4d6d47538ec37a7
6181 F20101112_AABIOO topsakal_o_Page_077thm.jpg
27f9c80214d118550a65d160ddea6b70
671639ab7adffe296a6f8fe6627521f4357bd5e4
22220 F20101112_AABHMB topsakal_o_Page_049.QC.jpg
7460ffc1eab77c5c7f532355648c8db1
8f57e7b42e270273a5314c653ed65396f0e54046
5789 F20101112_AABHLM topsakal_o_Page_017thm.jpg
d0a673b8f0aca356d7cfdc52a6cf67dd
ffc80b60537a6f3326a2952c209b07528299a77f
1053954 F20101112_AABHKY topsakal_o_Page_124.tif
95461141c79d473a5c3df26bb8e8e06f
8c77f529283bd752b649cb5a438bac3ce45f0714
21872 F20101112_AABIPE topsakal_o_Page_087.QC.jpg
bc11e4aa28d33c42be5972a7cfb1263b
2d431b26d26131a66744d8deac75ac7e9f4ce16f
21402 F20101112_AABIOP topsakal_o_Page_078.QC.jpg
2b1b8d7647adab4546525878888ef3e6
b50f384eb0300f6700fd4f09dd8c60a95c08ee82
42683 F20101112_AABHMC topsakal_o_Page_073.pro
89d9e136a997456e0e3ed7f94023421e
e9785717650533ac9230097cb7f3000622959360
3068 F20101112_AABHLN topsakal_o_Page_003.pro
bc742aff382256d6736176cf19d3e8e1
6c28850cbd6664bc5e32ab5e633306ebc1f92dfc
25271604 F20101112_AABHKZ topsakal_o_Page_030.tif
b02ddd4e32505e5584aa9def78c518a1
c88c94b51b7e307b3f35df2b8369516c0f729caf
5532 F20101112_AABIPF topsakal_o_Page_087thm.jpg
bfd2f6c5df5e9a5826ff02c7b555b234
b4eec57a072d3471f7f610d3e487c4f9a8bf897a
5712 F20101112_AABIOQ topsakal_o_Page_078thm.jpg
4034edad01cc7a18d3ee7fa7af5cfc90
cdf21bce4762e36979acbd16b827928a9f54baad
20046 F20101112_AABHMD topsakal_o_Page_010.QC.jpg
43575a2608c6722074d1b7136a2e4e22
bf8ae84c087da5cae8d3ac9bcafce8983b058322
1902 F20101112_AABHLO topsakal_o_Page_091.txt
564b713de15479f7a6061df519003427
521028c0ebbecfae924d2dda5d341d5a1fa67259
16533 F20101112_AABIPG topsakal_o_Page_088.QC.jpg
25a9c8b8e54ca865d2f198a919bf6ff6
d3dc1ef20a12e66e71eb673bf665c12845642f01
23507 F20101112_AABIOR topsakal_o_Page_079.QC.jpg
7ac69a3ee973b5bf756343ded45b3570
e49957063e8edf886f26368789571d06bca7ee36
1051962 F20101112_AABHME topsakal_o_Page_091.jp2
d9abb78b0f6c084a1b7349f9b4f29f89
d776cb2d55914ad77df5e55583f672a2a78778c4
1051969 F20101112_AABHLP topsakal_o_Page_089.jp2
68fc85fc2c4b140e4a22b66d5ec5e0f4
d373c489fb1d73681f3e87975652d8b2475ebcc3
4371 F20101112_AABIPH topsakal_o_Page_088thm.jpg
6e839de9a2c3f304decb6a563707e7cc
ac14e742d9f0c5402b9d802d48b4589d43bfb26f
6096 F20101112_AABIOS topsakal_o_Page_079thm.jpg
e4f80bb6cfb62028d4e471dc08e7aae4
f127a0b1c7f9390a629f8ba46f85c18f2e246ed6
83064 F20101112_AABHMF topsakal_o_Page_047.jpg
030e0fb03eed738bb2b61383a8066f4b
edb44e4ae9efa7897a2b8bf6c018f7b823517a0c
36275 F20101112_AABHLQ topsakal_o_Page_053.pro
15d5e0ba97f7857696a858c487b49ebd
32bc415de91817695ba9e7ee00effae8af39f219
22723 F20101112_AABIPI topsakal_o_Page_089.QC.jpg
2f66702e541d8a2f6255e0d9c6a6fccf
3a5a8ac2499b969cd14428d9c7b5f79b58af6775
16445 F20101112_AABIOT topsakal_o_Page_080.QC.jpg
2b3a98bbe7da7ac9ee225a988bdbc41f
ad906f0cba8d5c441b107e2c3d8f5c602c5c8bf1
42128 F20101112_AABHMG topsakal_o_Page_097.pro
bd7fbc12a0285b7dbd9fa896e330dc38
cd36d5b22d2918d89be2d267e47ba46caed31e5d
76756 F20101112_AABHLR topsakal_o_Page_122.jpg
76bb11511ba34003788f2a2d1d2cbbd3
f73d8c0556846830fb281cb44230892b149819a9
22181 F20101112_AABIPJ topsakal_o_Page_090.QC.jpg
0ddc26474845a6069383120f98419803
614145ca8a8b38041317f5119a8dda7a6111f578
4854 F20101112_AABIOU topsakal_o_Page_080thm.jpg
a3e8544e53c73134d0694ee9e55c06a7
e008cb2421ec0183fa769559bc96166a776bd8f9
51159 F20101112_AABHMH topsakal_o_Page_050.jpg
a4ec992a80ac4911977bd02ece0512a5
a79322f7041c9457984234eedc84d761d78210f1
1051940 F20101112_AABHLS topsakal_o_Page_109.jp2
b92cc2635de6a4ee62b7aade6f32b345
db3882f659e0382a86518d040672b4310660377a
5851 F20101112_AABIPK topsakal_o_Page_090thm.jpg
f75ed6198beba3fe81278846b9b61ede
8b6d05e25e99ee3762d4759d806e8e7769ec39d6
20856 F20101112_AABIOV topsakal_o_Page_081.QC.jpg
6eb38b933085b2944e008c0611188ffa
08e49056f3894b79429f71bf4911e151c14837dc
1896 F20101112_AABHMI topsakal_o_Page_084.txt
1fc5b28502ec9cfd859f199c736d11ad
3a14ae40765f131ab374a65e1f19048fdb9587a1
F20101112_AABHLT topsakal_o_Page_086.tif
dc76ae0ed3cd0c614f0ed6ff7f93a53e
618872768b8b1b835d5a5cb279561ec42cdfa21c
6790 F20101112_AABIPL topsakal_o_Page_091thm.jpg
4cb4f22422cafdc18934663e5541dcf9
0c963f1ca51e0730c05891138bb66b48dac0505c
5378 F20101112_AABIOW topsakal_o_Page_081thm.jpg
889c63024ded6b331647892682a7b77e
a46282891ab6ec2d0cc2216e3586c7b7f881c60d
2345 F20101112_AABHMJ topsakal_o_Page_008.txt
2644cf48c8689ba88eae749712f5e0da
f2842daa532875d5c55d9e98689b37d19ea60e25
5192 F20101112_AABHLU topsakal_o_Page_011thm.jpg
e587b3366d825bc565ccedfc31b8192f
374ac72d6476d7704d7c11cb548177286b841fbf
6088 F20101112_AABIQA topsakal_o_Page_100thm.jpg
7d2711ddae160340b3e5efa35ebfb118
ab8c1f162fcde58698dab2acd95a6f79d9d6f186
21109 F20101112_AABIPM topsakal_o_Page_092.QC.jpg
3f657d9df32619851e480a9621ac4bbd
99d9e6dc424a12716a4d99ecac7f362f71131864
18278 F20101112_AABIOX topsakal_o_Page_082.QC.jpg
319ead7f14f556d508db3a41914fe914
639c28a39898beb313f0274599e63a8c61031fc5
23045 F20101112_AABHMK topsakal_o_Page_116.QC.jpg
f380fbbcddec06dfe4cd7ee30e0d3da9
0a5ad41bfc2087696ccb191814aee4ff2456408d
1051871 F20101112_AABHLV topsakal_o_Page_023.jp2
32be8d40c29b8e36559c522529670716
0755e727a6fd4a317668ddfb6cd731ad53cdc5f2
24216 F20101112_AABIQB topsakal_o_Page_101.QC.jpg
232a3fe937a3ea03f5bf5e428e8179b3
ae03a5eda882a6bcd7d4f99eb95902988b503e7c
2257 F20101112_AABIOY topsakal_o_Page_083thm.jpg
a215b0e8ec7276a250c3986f3a9ce95a
b008bec168a1b2198f6a64119cbd10b5f65f1327
52732 F20101112_AABHLW topsakal_o_Page_101.pro
046b8e0883b697b5817fb40ee85ce5df
3c03e9cfab39b0824e1a102067912131f3db8f99
5916 F20101112_AABIQC topsakal_o_Page_101thm.jpg
691dc2e6d201fd807c8219a3de18adea
0d8ab2af28048f2aa82bc8c99ae3777f42635524
5778 F20101112_AABIPN topsakal_o_Page_092thm.jpg
c8176f87855c344f736fea1f8b3673ad
1f68fd26cab89219259dec4ead0c655cab831ca5
5383 F20101112_AABIOZ topsakal_o_Page_084thm.jpg
1772202bf13fc66df513da0d38430fe6
399432aebc95500dd13dcaef451539fd5ccace9a
6008 F20101112_AABHML topsakal_o_Page_106thm.jpg
4599a8361feb07201edb0a38d5bbfc3b
4f70a4e62517e1d6833426a07c8f03dadc832bd8
1051977 F20101112_AABHLX topsakal_o_Page_085.jp2
cf332d9372c3c4f47a5f0f68dbb32113
dd04e3dd7136fe4628d4aeac6eaecbad1b1e8f3b
78370 F20101112_AABHNA topsakal_o_Page_119.jpg
c3cd11a19062cdf5e76b080e79082227
5fb8135d2034ec1c3d9dd4f48dd154f166940521
19144 F20101112_AABIQD topsakal_o_Page_102.QC.jpg
3bc89ea9abf936499bf134506f3eaa8f
0b3b90230e12fbd6d1fc28b1514e29ec472e80ff
23301 F20101112_AABIPO topsakal_o_Page_093.QC.jpg
ec331881aede078bd2892ddc6c04d406
bd6d3b8845e88fc936470a866d536191168f61cc
23589 F20101112_AABHMM topsakal_o_Page_063.QC.jpg
4c32ec391971e7924e96605aa311da71
52929661d62b955071cca9574ef985f6372e49ae
25510 F20101112_AABHLY topsakal_o_Page_022.QC.jpg
66efebcc50f4dd8e7351ddeff380d2b7
1a31a96eaf27311c295a586dc91c0e224c88af5b
23814 F20101112_AABHNB topsakal_o_Page_037.QC.jpg
20665e1a18e2bc604c3e0598f3b506dd
8c970b8595685393c5772ccfb5eabd80256654ba
4961 F20101112_AABIQE topsakal_o_Page_102thm.jpg
cb9c712bd9cd1b04a3e36dedf2d85aa5
cf8161a8746544f7f85aba8149c4e4ed7576bd3a
21259 F20101112_AABIPP topsakal_o_Page_094.QC.jpg
2558ffded2cf2bee20fa38cb6540416c
46fddb50c0470da79550291ee9c8f405101698de
F20101112_AABHMN topsakal_o_Page_041.tif
58888e33d1fb7f18cf1ed92db3ea0e10
3cfbe86483df2d3eeebd6dcdcf3969a4b8099e25
24066 F20101112_AABHLZ topsakal_o_Page_059.QC.jpg
be540dd8166fe40218b129c7f5cd013f
efd67dc8f6e1bc16249fe2001f4f9c2ab5aa947c
1051944 F20101112_AABHNC topsakal_o_Page_043.jp2
e6a25a59ccf32c56bd9c1c16579029b5
4d1ea9cb1390a371f1a9682f8ce0328cb6414bc5
5668 F20101112_AABIQF topsakal_o_Page_104thm.jpg
a9e2e90b82bde4453cd2e9c4325998a5
b2b0425b3985a0df3dad984a449275b2c9f07e33
14031 F20101112_AABIPQ topsakal_o_Page_095.QC.jpg
49b7745dc7fd7a000ee039f62808d85f
e578ab20817f975fe3af4933c783aac23b3a5299
F20101112_AABHND topsakal_o_Page_045.tif
70a0d03714abfbd64ecdeffb77a8893a
8c59676009ef5815b626598342ba64b9ec7b5fb7
2310 F20101112_AABHMO topsakal_o_Page_118.txt
f1d6c5021233d632e76d2eed5350353f
cc5bf1ca9e1c9e78eb09a7a3884be588c5c64a4e
23955 F20101112_AABIQG topsakal_o_Page_106.QC.jpg
9a99b1a300fe83711e5409cb1ecd0f5a
c631a137d2183c593a848a93b5963c208f1bdb67
3612 F20101112_AABIPR topsakal_o_Page_095thm.jpg
fccfece55285f52940cd241d71284d2e
0e9615a5f5791ea2ea0d13a21ba9a235a8c97ab1
2073 F20101112_AABHNE topsakal_o_Page_026.txt
5b93fef46eaa2fd15461ffa6ba379e17
59c0fb5c762474c2e9a1b6da903973da76604ca3
F20101112_AABHMP topsakal_o_Page_050.tif
5f4e955150573f089e852c6546616714
f934d2567cd6eac501552dac12a1f05a121ad856
23007 F20101112_AABIQH topsakal_o_Page_107.QC.jpg
ae6945edc9267cf06aa5071dcdb8fde6
285e78466c50856e7c449182a7903ac1950df1d8
23201 F20101112_AABIPS topsakal_o_Page_096.QC.jpg
19fa547c5ccf96b9a769da88b6a76c5e
720d2144550412529716bf4e2bc280bc22246582
F20101112_AABHNF topsakal_o_Page_066.tif
4711155e24a337c8173566d7e43a7ab9
0395d7a361d89b8c6405802fec0684753af7ff70
74961 F20101112_AABHMQ topsakal_o_Page_093.pro
ef61b6470282057b564bfcb48f199ee8
56476ea9cb6467a8d5d144dcbe8dddb5427d7fce
6330 F20101112_AABIQI topsakal_o_Page_107thm.jpg
b6c5fbd6bf386b42a888454fd31cb747
ba151e5269ff2aa007d03472b4b573571a45b6ff
6046 F20101112_AABIPT topsakal_o_Page_096thm.jpg
04b8a7970aa3d81f9b36d0fed5f74539
6593434038e6d5be29d4a0136353cf3305a89fb3
143559 F20101112_AABHNG topsakal_o_Page_123.jp2
a80c03cd377afb98de288e0a6cba9c7e
168bc91bb396a6055adacab3242b238b5effd47e
1051980 F20101112_AABHMR topsakal_o_Page_099.jp2
ca10f6bf695fd49dc38713dbcf66fdbd
878361579aa11384d05cad6719fc7cd76a254f0b
22730 F20101112_AABIQJ topsakal_o_Page_108.QC.jpg
67535eb3875a9cad3791233ca297622f
f4806c33098634cdd6f7c1b0b95df002d869509d
24802 F20101112_AABIPU topsakal_o_Page_097.QC.jpg
96235e384955026db9e70eae51f71c31
7720763853bc9244a9ba5cc0badc9b5b68042a2c
F20101112_AABHNH topsakal_o_Page_102.tif
968ed328fd05f086a8dc7ee1f9e27b91
15e21518a9429d3188ee8cc1c15ff95e6d353ef2
69956 F20101112_AABHMS topsakal_o_Page_078.jpg
814cbdb57d0f5ecb2a290603000b7a29
a82bafcb9c7f3df11346a531d388c1b9b9078dda
5935 F20101112_AABIQK topsakal_o_Page_108thm.jpg
cef805c3d29e7305bd0373b94981bead
29166f5462d7a882c0cffa67fcf747e188d7e5a4
6437 F20101112_AABIPV topsakal_o_Page_097thm.jpg
8265fe85f9e3e2ceed12b65b19233bd7
08c805263f3cb6fbd75388066ad09bb69b929f4d
F20101112_AABHNI topsakal_o_Page_091.tif
3720b45a4fd1881d72656d8df4ae43e4
fa23a9288096b97648eefce1ff4d44183687f601
5039 F20101112_AABHMT topsakal_o_Page_082thm.jpg
69ea5b4a10913e1ea8b3ab58a8df099a
bea2fc53c5e1691e1320ab145cfca3a71b56c906
23291 F20101112_AABIQL topsakal_o_Page_109.QC.jpg
24343ba8893a0ea1604cb0f8b8e324bc
8f7e7e51f52a8cccacd96698d7bf7f5cc05a607e
5831 F20101112_AABIPW topsakal_o_Page_098thm.jpg
eb0c7c3f3f1e11eea53bef0281e56742
c86cf87225dc2f6c4dd4d308b7fb4320eee3b330
5655 F20101112_AABHNJ topsakal_o_Page_049thm.jpg
1997597230213a4e66aa4556eb2f040c
e2241f206bc6aff155edf02ea5b4cef0bcc206f4
F20101112_AABHMU topsakal_o_Page_019.tif
32e3ed19b7ec3fe9aa65301ed45300f2
026500378cb17ea0b09272e9d3828b6dec859dcb
23840 F20101112_AABIRA topsakal_o_Page_119.QC.jpg
760356fe32906502e8bd7b3b1675a9b1
434f5293f51d8a3da89a1ba230bd047117b23500
6209 F20101112_AABIQM topsakal_o_Page_109thm.jpg
3dfcce68464e3ccf58b703b033ccde35
ced0c20ab177be91c8f17800e5c3b1d0dfd68329
20959 F20101112_AABIPX topsakal_o_Page_099.QC.jpg
d621b7896ced12ac3e016d2626cbe345
bc8e0b3140f1d05547ae6b85a122e288ce243831
5901 F20101112_AABHNK topsakal_o_Page_125thm.jpg
9b989901ce0d6134747c4555dea54502
18f685ee7d5d38e442468748d6e0bcde90b6e012
F20101112_AABHMV topsakal_o_Page_002.tif
df82f163fd265d41f8dea3823ca92dda
b0cf43cd97d9347f3d0422e90305fc302409acf9
6034 F20101112_AABIRB topsakal_o_Page_119thm.jpg
b613d2bba2d335cb519ca8a74fa4a276
285af6bf8058162df7486eb718b3068166a41f3d
24098 F20101112_AABIQN topsakal_o_Page_110.QC.jpg
bf6a1ae363f2d899a25827283b410b0b
e21f7d278020f0a773f435da3ad6a263fe327036
5853 F20101112_AABIPY topsakal_o_Page_099thm.jpg
7bb0d8cc279d64cee4279f0ca46d7200
63d04fe5e1d1e46d617e96dbe56a00b0250704b4
86890 F20101112_AABHNL topsakal_o_Page_008.jpg
86458060194efed03347ac558ccae9ea
488a03e9df642697d6385c048895ec0cbdcad8a8
137487 F20101112_AABHMW topsakal_o_Page_121.jp2
e33c4b602d45ed14f64d0e714701ef97
693db8b00f6a0ff827e13a392718f9f7c184feca
F20101112_AABIRC topsakal_o_Page_120.QC.jpg
6f05c19e58dc3d9566076f49d7830bc7
528c63e8fa1770ba50ade865f4ffc99787ec56ad
24285 F20101112_AABIPZ topsakal_o_Page_100.QC.jpg
29186f732d62ae98fa3a8c694f615222
b71b21d6506242aa7d58b91d0a799b23854dcc16
23092 F20101112_AABHOA topsakal_o_Page_071.QC.jpg
04b95ce1d1b57a6220d9a81615e89359
40032307890c4c3b2e798434c64866e10ca59ff5
1305 F20101112_AABHMX topsakal_o_Page_107.txt
e55169bbc1c4beeeca1785cf3482eb25
008de4b8e5fd7a9b5a0f8c8fd751caedeac5dff9
5978 F20101112_AABIRD topsakal_o_Page_120thm.jpg
9e84a513a2ba27ebe6fc4508209a3a45
4627c1b74244ba8dd0a5600d6490a4b8447425cf
6164 F20101112_AABIQO topsakal_o_Page_110thm.jpg
5059dc2506bfdb9aaf5c5e6833659afe
319b04c852080df12795466dddc52320c0dca3e8
6174 F20101112_AABHOB topsakal_o_Page_013thm.jpg
5894250b2de5d12a78ed226c21603ee7
adb693ee492c82553cbe4bd0730cbc9d2bacb633
2120 F20101112_AABHNM topsakal_o_Page_052.txt
58ee17ef7b14189ef94e36734b35c00e
6a761bfc590eb8d237a6387f59dcbf092585fbb3
22876 F20101112_AABHMY topsakal_o_Page_016.QC.jpg
21311094f6a96d3880c6fe59f3c3a469
feac7ee747a39f2f7d2bc281cf92c6ac283120fa
22729 F20101112_AABIRE topsakal_o_Page_121.QC.jpg
f7ad53427cec50d46f789e9ccfed5139
2161cff3f54f607767aea1ec59dec5080d5a8af7
23921 F20101112_AABIQP topsakal_o_Page_111.QC.jpg
942c829f86ebbf33d15c987f55b48a41
a9df4342594d9cc3d41cdd6973bb47c6959f5ab9
5678 F20101112_AABHOC topsakal_o_Page_123thm.jpg
426b5fe381f75eab70aa13de3be97fd5
d21dfb0bc1131e79918138a54277cbb93bc5fff1
77906 F20101112_AABHNN topsakal_o_Page_113.jpg
2357e52de6d0ab9436fa1ee586341dc9
e2bc69c5182c9cad6f9a435103a3899a87068b6b
F20101112_AABHMZ topsakal_o_Page_057.tif
3d6178e779e52629a4ebe00b81ca340a
71c8b05afc04194a1cafd216cc9b7cdae6835763
5641 F20101112_AABIRF topsakal_o_Page_121thm.jpg
608aae38879733921477bb493b680156
9dda786b08703ced449eda634122690ae4b6bd5d
6233 F20101112_AABIQQ topsakal_o_Page_112thm.jpg
d75ac055380d56e07a7bf0dc5c3f605c
f4c31911505832499a59b4ae58a8494484aaa27a
F20101112_AABHOD topsakal_o_Page_069.tif
6557f498e694c0fd971246b8134435ce
a1bef7849f9e1db6fd07d84687767b18600b4ded
49296 F20101112_AABHNO topsakal_o_Page_062.pro
f60f2b08702e0198a8401ae019a34b73
1e7b8e4d5f65f04aac461b57436a99b480f2c506
21988 F20101112_AABIRG topsakal_o_Page_122.QC.jpg
4d1b431ef0e12ddf4ee28a906c135b2e
11e2c22e368ce6a6ae37685036e823b96d5c8c2b
23774 F20101112_AABIQR topsakal_o_Page_113.QC.jpg
fbc10cbb215e5561be6972bc3aeb6d8d
2927d2582a970e7247e1f5b19670af311135a47b
1051986 F20101112_AABHOE topsakal_o_Page_013.jp2
adf828f11d8066dfcec34b9880d220e2
ee6a7f7231b85b045d9a6cf42d9124bb6537eac2
58896 F20101112_AABHNP topsakal_o_Page_072.pro
8972934895f48876b737f625be12a58d
110b23a6b85ef6a155141b89b31138e5012512ae
5751 F20101112_AABIRH topsakal_o_Page_122thm.jpg
d6f01116f93938d30d0396ad879059b8
db72f3548b41392b67eb98da69949a35061c738a
6360 F20101112_AABIQS topsakal_o_Page_113thm.jpg
c596d4cbacca4f58cf7856da7eec1ef4
5068489da50c29bbe487ab02b119cfc6a39eeefc
5512 F20101112_AABHOF topsakal_o_Page_039thm.jpg
d541315c41cbbe71b9f77b8545565d37
3a7b60f8bbd8128b236e657b4bf23887708ef293
22361 F20101112_AABHNQ topsakal_o_Page_098.QC.jpg
11565eff35f4f6abe5fa24c53840c5e4
30ec58270c9ea02d5941f70b26cdc2c4eeb33008
22396 F20101112_AABIRI topsakal_o_Page_124.QC.jpg
54248b31c664a2bc8e61636221f4817a
e4676a4036a9e06db1aea032771ffb1038e0b8f3
21359 F20101112_AABIQT topsakal_o_Page_114.QC.jpg
b8d02e953f253ba386a374aaf2be0de9
f2fe0919e655941aec045e0a033e536f3a787111
6295 F20101112_AABHOG topsakal_o_Page_046thm.jpg
3f4fc49d1caa488824639a3eab8f21ae
883011b43339717788344c0391dac5c6ae6d6b69
70677 F20101112_AABHNR topsakal_o_Page_087.jpg
0aa4ac352ff7baba43f837e2701637eb
43e2d478a3b2b46ba26109ea7947eac80dc238db
5821 F20101112_AABIRJ topsakal_o_Page_124thm.jpg
45bb8e2d59ea711d2be298b7b004e9b0
c7bacb8051032191684da54ef56da14c92aec946
5927 F20101112_AABIQU topsakal_o_Page_114thm.jpg
03896a164ef03b7c4210cee224801b0d
1586caadb470bd13c084280461486260773ee0cc
1051974 F20101112_AABHOH topsakal_o_Page_079.jp2
a1925f0907fd2fbe73ca237283b45f4c
c144ac1618f79a376de64d5705a5c18eea958e83
22184 F20101112_AABHNS topsakal_o_Page_128.QC.jpg
8158f4b87cc26ad242bccff8eeeb1e9d
94af1ac9c151035ee1cab2794a00ff1bf08dc67b
23028 F20101112_AABIRK topsakal_o_Page_125.QC.jpg
c3530f8cced5353590b961092e51564a
261431237bb48b9d5f1025b19881deb537b0daad
17845 F20101112_AABIQV topsakal_o_Page_115.QC.jpg
b3204754934d8f4066b541b62b6d73b8
1bc609068cd34998853e1dcf08e51c8a90889448
F20101112_AABHOI topsakal_o_Page_059.tif
6edfd91f99111e0014b7547ed8fca552
2cc4d6455d3fe1338cc72fe18d23925db9abfd0d
2065 F20101112_AABHNT topsakal_o_Page_085.txt
94c21e7056cab427094df37e3f8b457c
85dc11e6b40187aab41cc7e54a31081f85bc5745
22094 F20101112_AABIRL topsakal_o_Page_126.QC.jpg
a49fcc1720dbb40e423df174208340cd
bbf601095dc572bdbad18b4675a6b81938fd48e4
6020 F20101112_AABIQW topsakal_o_Page_116thm.jpg
fe1a735f2df9fe4da9b04c2713ed614e
e50b573976a46b55c072d15db0106aaca31182d2
2029 F20101112_AABHOJ topsakal_o_Page_048.txt
c4272e86bcf81c71ea7cb8965dcaa8bf
61be8a28fb2b8e8ad0cebcdf17aab352d5cbde2d
6055 F20101112_AABHNU topsakal_o_Page_031thm.jpg
e93b748fea1625b37ce5d28a5658fb79
a583ccb17418cf453a59cbb183cc8e0245bdb53b
5773 F20101112_AABIRM topsakal_o_Page_126thm.jpg
5b2e176e2ddea3b26d503bdc9ae485f8
233788269ce522053128032c9bd312bc52ac08d5
25381 F20101112_AABIQX topsakal_o_Page_117.QC.jpg
233a35c3a3a755b8ef8ee99b82fc0e7f
00a010804db5f39f9fb4e54ce248713de8bc0cd6
F20101112_AABHOK topsakal_o_Page_007.tif
e95408c18d0749d67cadbc33580a1887
96e9c07694e441ad277795d2ec951e9e63b12236
F20101112_AABHNV topsakal_o_Page_028.jp2
984cbfac142635846de46946760d4ef5
d48aa66c00395fb2ccb9c9f7c304c6891ec9b56a
21770 F20101112_AABIRN topsakal_o_Page_127.QC.jpg
399bd01a0f197b4a4d5ed0531c0f5b63
a5a4b4dd22fd00da534dc5f0924305e5e17c51ee
6237 F20101112_AABIQY topsakal_o_Page_117thm.jpg
0e3f1e43f0bf54eeb319f7aa75b3aa59
48f7784bc8c0108bc1bb625e8998aba316c6fb29
58278 F20101112_AABHOL topsakal_o_Page_021.pro
4e7c1b03694062b9888062f9a6ffafa6
a61dd5a474eff60a1b08136dbe8c7101544a89f3
23017 F20101112_AABHNW topsakal_o_Page_012.QC.jpg
03eb3f9684787d8c14d0717a13949e90
51f3191c09b4784e0fb9e9ac3af6d007c7ea75a6
5809 F20101112_AABIRO topsakal_o_Page_127thm.jpg
d5fed3bac659baf16365dc396b9eda34
9f0aebb5cd75d5f40962a2f77ba52d20ee8760ba
5999 F20101112_AABIQZ topsakal_o_Page_118thm.jpg
3cf5ae073dbcd48fd3547de58d007962
6137e19e520ac72937ca8b40225a72c82d216b6e
80481 F20101112_AABHOM topsakal_o_Page_125.jpg
127f3ce43eb79d1a31b629206bf0a298
7877b5e968c209cbacf2e18b94d4590f36d91cbe
80783 F20101112_AABHNX topsakal_o_Page_093.jpg
25ea11e99c0e1f14ca18d465fb8e6e94
b39d5098029705f341ccedef7ade0fd2751373c7
2303 F20101112_AABHPA topsakal_o_Page_021.txt
60857b20ea2bdf4004ebd7c2f75098c4
a82820013ae5debbda344d9d9bd21767094d4afc
84666 F20101112_AABHNY topsakal_o_Page_060.jpg
325bfc1ff7bbc364603076992d976bea
e54f6273d7459fa76dd893e06ed51e9910af2ce6
62672 F20101112_AABHPB topsakal_o_Page_035.jpg
915d40a3c2cb0843420e03b5e8fb128a
96221a1f907274332dd6651305bee919ccaa53bf
5813 F20101112_AABIRP topsakal_o_Page_128thm.jpg
6d855368ba4f03603d4b3a0aeeb2a96a
f95d1da84f1c13e94fe6e0c78397e2d513f96fe5
52713 F20101112_AABHON topsakal_o_Page_109.pro
dc4150181d6509c48120331db701868f
d7156b3155b0ce49c5d5c63ee59218e36ae384b7
52126 F20101112_AABHNZ topsakal_o_Page_081.pro
aabb5e6c556a26cbe4440de1012a6412
0f5c2bc3c9a0627ee8e0e46b7fb2a1973b3a5481
F20101112_AABHPC topsakal_o_Page_128.tif
d03f46eef2889a286fca9762c9bdc773
f0e541e767fa3a785e030ac7a3a872ce66342662
3866 F20101112_AABIRQ topsakal_o_Page_129thm.jpg
11567e5360bcabe3f80826ce04909e36
6415dc687839c527916acf886ce50d8746454eb9
2373 F20101112_AABHOO topsakal_o_Page_117.txt
546af0c33eaf9671fad59bcd84c79efd
f6ddd4cd6ecd6d2f7ef8ff7c8bad679b2e366553
2322 F20101112_AABHPD topsakal_o_Page_044.txt
4ae75ce1ea07e2577137aa81bd9a159e
359bcc7df8fe9d2ecb343bebab5d8758d77449bb
14545 F20101112_AABIRR topsakal_o_Page_130.QC.jpg
e3cd597b9779e5d25cfa0fa8d9606920
7ceed260ad23a1fa86a02092a126a7f745389186
50969 F20101112_AABHOP topsakal_o_Page_078.pro
b30d3560c3e943b2f9f8d3e3e5ef24f4
7a0b5190456c76dae24b02b3f882a71a8cfbf7a3
20455 F20101112_AABHPE topsakal_o_Page_055.QC.jpg
6ba4466584b868104da271c3bdb39518
d03044d73bfdb8c39dcb368f1b89e27c6e23e001
151696 F20101112_AABIRS UFE0019342_00001.mets FULL
a088f3ec74ce245a332d706da5f99f30
12a20c7655b7db25f140f6825849f7f0249d489d
70662 F20101112_AABHOQ topsakal_o_Page_107.jpg
049729c9a1da218b6983bfda120adcc3
d5b276147a63eefada8a8aa8a48279f8cb364e1b
F20101112_AABHPF topsakal_o_Page_122.tif
6164f985f7cc569afdf09b0b77108ea1
3f3f1f8f16a61dadb372ce66e2d72dad4bb08508
22742 F20101112_AABHOR topsakal_o_Page_105.QC.jpg
f7b1d194d7b616fefbf74cc575d30ab3
850b8480baf50675ecef8d76bf8a8932dc2292bf
30901 F20101112_AABHPG topsakal_o_Page_054.pro
aa5756077cc859ec493207fb42645dae
12ff64a76042a5096e1e12ec4de9e195dad67d65
5091 F20101112_AABHOS topsakal_o_Page_014thm.jpg
e8b5bf8c4159eb4240af60b191e56437
477be42617a1e121209913983bcb61e6e6687008
997085 F20101112_AABHPH topsakal_o_Page_036.jp2
109de66224dead65231f8bd952d4f841
affec0bc4988e8ea843072bc83bc1ce5a3921118
53065 F20101112_AABHOT topsakal_o_Page_092.pro
90f314c7b190fda32a40318b8bbd9ede
dfb2ed87abb272d1107d90e14c25f8661169bcf0
52343 F20101112_AABHPI topsakal_o_Page_037.pro
77fc8196e721c32a798fc3b7ffccc889
1a3c9dbab419cd684a3fefacae7f552f0e5546e5
56130 F20101112_AABHOU topsakal_o_Page_114.pro
f4c473fa0d4fcf9a228f8d6b15bad0a3
2edfbe9398a810aa60e6023af710b9198c43b7f7
55612 F20101112_AABHPJ topsakal_o_Page_043.pro
ab5c28950661532a477741b4ff91943e
1a969291561462ad95dcad7c75dab437c0cf0155
1051978 F20101112_AABHOV topsakal_o_Page_097.jp2
df0b2ed525b30618acffde42446d1202
380903170eae96fcded456c7906f828283a3e087
1051976 F20101112_AABHPK topsakal_o_Page_005.jp2
ddec2082e1d55076011d8a846ec69c27
fa8c244c0a544c0d3a86c8c151f8310cce73c1dc
5632 F20101112_AABHOW topsakal_o_Page_061thm.jpg
dabaa5779fda8daa0f6bc03f3d58fb29
f65a5987be4bfeb0226601e947b2d9c0465cd71f
5814 F20101112_AABHPL topsakal_o_Page_094thm.jpg
21a072b36555e852c9241c2ef715f824
91c0d93dc77b5cf20b666269ecc1d8a889f3dd24
1760 F20101112_AABHOX topsakal_o_Page_014.txt
fa5f356aeca1b659e1b581543830a70e
c105bf86febba37de143f86bab57695ae8b297a8
6032 F20101112_AABHQA topsakal_o_Page_044thm.jpg
cf8fb8f233b384846979b32911be20d0
f9f6f99c5e9db8bb8becfdd981c78f12958b9352
41821 F20101112_AABHPM topsakal_o_Page_091.pro
3f8991aa1faacf8a1cfbaff7d8763da1
45d7f11bb76cbc8dad73a1f4d4aae62ae67c477a
2682 F20101112_AABHOY topsakal_o_Page_127.txt
1c740f583ae40c53930bcc4e9ac8d1cf
63a110e8f14f75e751af113679cf2929bf5e21f3
20666 F20101112_AABHQB topsakal_o_Page_040.QC.jpg
3fe5793cca019d5b9aa4da72e2a17797
fc1d8c9eedc53f2229a7ff7336a08955ab16ab9f
6299 F20101112_AABHPN topsakal_o_Page_060thm.jpg
205b7cb8646e4be6e03fd80844627355
8de087f3261d14bcb0d0a8fa9accbf5cc14d77ff
1051983 F20101112_AABHOZ topsakal_o_Page_042.jp2
2b31d339e19c0e5b1c7711a8ed03b450
af97d5898188c494207beaf40d3a32723f26e857
2854 F20101112_AABHQC topsakal_o_Page_124.txt
a877dba2c3004489996c4f9ad4895d74
70fbe314b5b3e012d6fe0b8ae1ed590895e97227
2856 F20101112_AABHQD topsakal_o_Page_128.txt
b66a36b0753ad56c73a54ceb0bd13740
b274df7d0b65219400300e8fbadc173ecc66ec0a
56117 F20101112_AABHPO topsakal_o_Page_059.pro
815c593f3329059fc43cdeaa20f6bfb8
bd4b43f8b87f9c6657caafb65edf45b1ec05cb40
2194 F20101112_AABHQE topsakal_o_Page_030.txt
db3989e976d1732de3b64d583a64d318
866d2976bb940a7371df8ca1b9f89d393f622615
4917 F20101112_AABHPP topsakal_o_Page_115thm.jpg
96a327296411d741fbd9f3a9fc780921
f766f3fb4c6c35e320655800ee040a61a94bd21d
24227 F20101112_AABHQF topsakal_o_Page_112.QC.jpg
64f63cb10c4521710cefefc005ed8f86
5e470784b909d5bc8034a62a7ab251eeea8e9e19
5437 F20101112_AABHPQ topsakal_o_Page_006thm.jpg
f043ac1d648c8f041e2cb00feb0b0bca
cfd1d6de900e5eea83323f979b600b242fd6ca45
18738 F20101112_AABHQG topsakal_o_Page_076.QC.jpg
1d32740e1e64e1c9603ec45ea2fe9752
41b42583f0b0820b04eee8b7c850514f31406cd4
5963 F20101112_AABHPR topsakal_o_Page_089thm.jpg
2bd2c45ec3c33e56b2da10d7bf3da7b2
227c549f11aa5e0603a872973e37fb449300e085
89304 F20101112_AABHQH topsakal_o_Page_004.jp2
a7a60f5d97f8a72a5f4e306cf86292b4
c04e82aa9402b1ed5da7e7705bbbb7622e0bbebd
25135 F20101112_AABHPS topsakal_o_Page_091.QC.jpg
66d7c28977188d57b7f66c58bf2dfca7
892cdeb651e0194c936f2b36c663686451cd7f3f
21724 F20101112_AABHQI topsakal_o_Page_123.QC.jpg
731424f90501c806554cc4ff9704a1b7
764afcc8709b67844bedbbbcde1e1c7de43c3d40
5855 F20101112_AABHPT topsakal_o_Page_093thm.jpg
7f99bcaed8dc9da1b60eaf2c01e43d88
580d98f6a5858695c7c70d19215a1bbd8af33eb5
F20101112_AABHQJ topsakal_o_Page_081.tif
b21c83acba8044399fcfb8e456deacb6
dfdf003bad4ec641060ab15c7c4e9659e5637215
23583 F20101112_AABHPU topsakal_o_Page_067.QC.jpg
84979667fcf0fda7adf47ed36e20c6fd
3d3cc242a5ce4679060ede4bd7bf475e3958b2ec
F20101112_AABHQK topsakal_o_Page_044.tif
ced591366516ed5f0826831dca20c82d
86faa5ab69d62c877480cc204dbc1485c35f5fd6
82141 F20101112_AABHPV topsakal_o_Page_029.jpg
818b251cb488686e22aa4f9057b15394
49f6e24fcd17a67b6636bbf04d1a2a28aef76c43
6197 F20101112_AABHQL topsakal_o_Page_111thm.jpg
fa4c6ee9fbda0f8468f1c6e599a27f6f
b7ee09cc1b653099680ebb52839df7e2ee5b1e57
22118 F20101112_AABHPW topsakal_o_Page_062.QC.jpg
0cf58a555a675f87d90ef509f7e2ac46
6a2a84b35f780eee326647f4bf648e77ec6cba74
49051 F20101112_AABHRA topsakal_o_Page_004.jpg
91165070adb497e191ecac50f37ab3ab
86a36155d6f7f8b3aeb746ba6b8fbcd9633964d2
85720 F20101112_AABHQM topsakal_o_Page_023.jpg
01e752b1d86e49141d90f3c56623ae17
a753891b35e595bbe1f4e063e1413eeb5bcff89c
75837 F20101112_AABHPX topsakal_o_Page_127.jpg
f0ca7277bd394e52c2435b04bc19d009
500f4e0e52a053827542889807ec51d12b2ac3bc
88203 F20101112_AABHRB topsakal_o_Page_006.jpg
1e633d098c902e768caa65d4e2815650
5445bc1e5bc0af89c9e08c65a494cf691524701b
21766 F20101112_AABHQN topsakal_o_Page_104.QC.jpg
78374cb24919b2136906ac09486fe48f
6573c4871c59f80d6aa1baccc22b1be3ec1ce38f
24731 F20101112_AABHPY topsakal_o_Page_118.QC.jpg
21e1f17263c95663e14759056d102e73
b198c90c28dae997ced45d5f83c53a1cc9cf51e1
88636 F20101112_AABHRC topsakal_o_Page_009.jpg
1a7712783cc7b1b0b34558bcb990a9b4
bd16553a053b5cdc7b324a661b3eec74e9c3ab84
98126 F20101112_AABHQO topsakal_o_Page_033.jp2
4388a950835d6b4280aa0bf721eef020
5e16101d1262dc9ca83d670e0421769c6ee95063
53963 F20101112_AABHPZ topsakal_o_Page_012.pro
cdfd9814e826f2ab89e62449023973f9
7f7fc168910476229e79f8e147b2bc6387fd0649
63188 F20101112_AABHRD topsakal_o_Page_011.jpg
84b87f1c417f358e232f99e8b0ad8c17
b4d63433fdc0de0e2948d4151f32a5f6999c1877
76855 F20101112_AABHRE topsakal_o_Page_012.jpg
c0e4a741233556bf0b4e2ce9c07a23ed
dad83aa9977e873c32adf7f7105f728e67b82af0
2294 F20101112_AABHQP topsakal_o_Page_032.txt
f29086d43581beb104e379195fe46329
0c772133bf9192fad376a91df887a2ab0a1ebfef
80071 F20101112_AABHRF topsakal_o_Page_013.jpg
e48c260bea32cc1fc987c5ae09281fbf
773f66a310a1b61844ebd7f69e27cfc702f79c5d
25057 F20101112_AABHQQ topsakal_o_Page_021.QC.jpg
8768784b725fbff3b2345aaefb7070cb
7a0826d4fed02776bec684a0d3c8fea16fb2aa56
58861 F20101112_AABHRG topsakal_o_Page_014.jpg
628db719813b0b91e013e934fcba1bca
83590c28ed4e0dca6364a0a1a5507b18941ab0f8
78676 F20101112_AABHQR topsakal_o_Page_032.jpg
c2fc9c5e2af5a24121b1f57862fefcd8
3d4978f6db3e26c856d428f7826da7f64be75fbf
82562 F20101112_AABHRH topsakal_o_Page_015.jpg
520802664e388d6c8dc91958f97149f3
023a13a1b03fb4854519daafc1b5182b7d944b55
6871 F20101112_AABHQS topsakal_o_Page_083.QC.jpg
84126184e330016fc844eb7a12e76ea8
290e93b1a00be5af9da62e3d5867012f34809d9c
74616 F20101112_AABHRI topsakal_o_Page_016.jpg
5507b773b04097fb4b8f466bf87b1371
bcb2b7bb13add0f2a55de55ceaf5a1e47418768a
70362 F20101112_AABHQT topsakal_o_Page_048.jpg
26cab65cfdfb1f9a985139761e55328d
fe43f8f161cd27a348cc5703d1b5aa6d5205136c
80114 F20101112_AABHRJ topsakal_o_Page_017.jpg
64a0a197419654e520bcb96baf473709
15d4e828efdbaad05c95eb79bab363e53d9dda55
3639 F20101112_AABHQU topsakal_o_Page_003.QC.jpg
2bb2abff648108ede800df3ce1ca821e
4e89ec547d48856c72ba94fe59324a16f381af1b
75690 F20101112_AABHRK topsakal_o_Page_018.jpg
3c954d1b19f46431c3ea184512ade5aa
400c4273a28688e9c281baa3e299afa5c32b5209
196723 F20101112_AABHQV UFE0019342_00001.xml
2b561058b2c54982d093445d8ed40662
cf4a86f2bfeeb1f98815aa4aa580e6e7d4a5ba47
75285 F20101112_AABHRL topsakal_o_Page_019.jpg
e60dd3496904a442bc473c685bc81aa3
1cf7b394b3926e75ee479eaa9cf2d51d4862ca35
67769 F20101112_AABHSA topsakal_o_Page_040.jpg
c637bfd3cb7d5cec466180477258f45e
4628c30cfb0598b82c99c6928247b6a07f87982a
70297 F20101112_AABHRM topsakal_o_Page_020.jpg
35ff677940abd18915ab17b0c1cc3ec9
915592783141851a1de8c2becd4944ed19159eb2
21331 F20101112_AABHQY topsakal_o_Page_001.jpg
0fd535851cdb1877a100cd6b457110f9
a2ca6417ac792d4ceee2be341a7c1ad98883ccaa
72926 F20101112_AABHSB topsakal_o_Page_042.jpg
708d0bec32991e65f213ab33a6da50d4
722fd8cf604d94632d428032811c1db8c28e2de6
80755 F20101112_AABHRN topsakal_o_Page_021.jpg
76230070f92ed0c7d3edc5bfb7bf5e00
5d1f3af7f6b62fa6451f891842fe92ac4e1e6ad3
11341 F20101112_AABHQZ topsakal_o_Page_003.jpg
cc61a02ee42187a5a349af24a4156e2d
7df1b9a979a05c3d504653f325282467d33e773d
77858 F20101112_AABHSC topsakal_o_Page_043.jpg
a06c3e68912468bf2191f90354e561ec
c6313e7f71e704c3c62df0205483fd28a4e59cd1
83647 F20101112_AABHRO topsakal_o_Page_022.jpg
c649ffe0a81533e3cc4a53bdac4c0b70
10b3726db4c819c07fec850a4753fe0fbc3ae3a3
82121 F20101112_AABHSD topsakal_o_Page_044.jpg
f2cf1b337136df0def9ec81b7c41109d
b04b942bffc84eaf7bb2e198878934c388449901
77195 F20101112_AABHRP topsakal_o_Page_024.jpg
2a7f9c77d58d050e2416bc34f11013e7
ad9c978f671800f902147958d9ee6252b650b016
80950 F20101112_AABHSE topsakal_o_Page_045.jpg
8acf58628b67c4552ac205e37dbf94b5
aef3b97c2aaed41edb784fe4b7a1288dc36b828c
83140 F20101112_AABHSF topsakal_o_Page_046.jpg
59385436e8a0c2d2b8c68942a0fcfc30
50f624b12432ddeef2e179d7819339906f4f01b2
72469 F20101112_AABHRQ topsakal_o_Page_025.jpg
4310abc0532aa5aa74fcae60b5a2f7f4
bb2e3115503da4da736c64518ec2a88173f5b631
73232 F20101112_AABHSG topsakal_o_Page_049.jpg
a1153e6aa1d664be37d98a7884a47d39
d77be2f5a1cb638a178551e50a8b6325deb4c496
71115 F20101112_AABHRR topsakal_o_Page_026.jpg
b4d3a16fb33c7d5b0775865dd821637d
bcc26bcca7c20b9a0bd93de4294b81a4a22d2c90
76329 F20101112_AABHSH topsakal_o_Page_052.jpg
ccd6e6fd641b27acfd6176fc85d67498
6e815b143e1639f8024171915a633666a900d4a5
76957 F20101112_AABHRS topsakal_o_Page_027.jpg
9c609c6b25a95b60b4062049102f0c86
fcdefba0d3d7c6a6262dda64bb6bb7e65603cfe2
74958 F20101112_AABHSI topsakal_o_Page_053.jpg
44766177adc2cff057cc7067657c14e1
6aec737d97a214b9c20f33a946ecaf5cc38d7703
75402 F20101112_AABHRT topsakal_o_Page_028.jpg
2b9f33ea386c48d2abb024e1cacb6a4e
4968d8265d832abe386fae346a071261594a43b7
62326 F20101112_AABHSJ topsakal_o_Page_054.jpg
f0fbb54690b21286a36509e8ab872aff
5195ac323163c564e6e17127269daa1a6b518d0c
77324 F20101112_AABHRU topsakal_o_Page_030.jpg
d5dcb36874436329601d4e682da1625c
b9daf3ff9c11ce5bf6c3eeeb8b2c6f15ef993f2e
65615 F20101112_AABHSK topsakal_o_Page_055.jpg
809113b92a1247f83cca7d9837f05a06
acfe360284cc64a9dba340b22ff697acee9fc8da
81872 F20101112_AABHRV topsakal_o_Page_031.jpg
1486724dfa61f906c3b3a6d4b82e1695
4bd7b213cd97a993a0857e91aa12c41322da9d23
69371 F20101112_AABHSL topsakal_o_Page_056.jpg
8426660fb878ef4b3a9d874a41bd7019
3a33af861d95216cf8474d233fc54275989e9a73
54893 F20101112_AABHRW topsakal_o_Page_034.jpg
da5d570b519cf169223926416136f8d8
4360e71203869b0f91dea50b91f9551a23d3d441
69874 F20101112_AABHTA topsakal_o_Page_075.jpg
c09e9b8bea319ae8d4e893a45ddfda2a
f31026152a8d9560cf753ffbb09fde63f99e291a
74163 F20101112_AABHSM topsakal_o_Page_057.jpg
dc7d92a93f274ee7eddbf1cbb7a9da4d
6b6e8b10999470ebafd81487ed4a001030aeb15a
64991 F20101112_AABHRX topsakal_o_Page_036.jpg
e6cb9bf8d4baf5a9d9fc70de733937ca
562b9776eb0b3b3acddb7a5669a163e27526ba91
60355 F20101112_AABHTB topsakal_o_Page_076.jpg
62fddace71eee1ef65c35492701f11cd
55f465b1a4a735fbb9c23090f124a9c62c3e3c0f
78905 F20101112_AABHSN topsakal_o_Page_059.jpg
6e4958d5f229dc633a869c0f086b083f
c7d41a41857eeccdd3171de9ca4e48473e236407
69638 F20101112_AABHRY topsakal_o_Page_038.jpg
997b1ed1082aec738014dc3199de7e01
812d7c80c9c6170155f5bf9e48755cf9d255da6e
81944 F20101112_AABHTC topsakal_o_Page_077.jpg
25bf54d9c1ad203938ab14b2fdd56bf5
9623a19b25f765811c23ae0423c5bc2d1e70ca50
63505 F20101112_AABHSO topsakal_o_Page_061.jpg
a702133348b8639561ca156d6822af49
9be44f830e8e5d0e0f0eb66f2f12a30c21d7ec2c
70126 F20101112_AABHRZ topsakal_o_Page_039.jpg
ec25d977b2747524bd8d1f218d1ab1e0
45cdf6e17a956d58e57ef5003afdf6dc32d77bda
78841 F20101112_AABHTD topsakal_o_Page_079.jpg
1fb7de5c36c37be1c35508fd490de5dc
42b93a71e5169e1631ab25becd5958b73ddabd11
75753 F20101112_AABHSP topsakal_o_Page_063.jpg
5823d3031ccb9ab1d0cf931f18a586bb
31f14f3a2ccdf7e566033c7cf5d7da94692704c7
72764 F20101112_AABHTE topsakal_o_Page_081.jpg
064e8a43ad00a302c563cf7a7d051e15
8184f2faec97c4e8d045a409f5e880f92a749557
70746 F20101112_AABHSQ topsakal_o_Page_065.jpg
cebd5c211ba37db8a52cce3594eb8f3a
ab24aef9c83b77cf0bb76b61d02500d0466675f7
58927 F20101112_AABHTF topsakal_o_Page_082.jpg
12dfa839403c08fd037bf98d33aaa34a
5a184508c244cff52a14a3f740209f763da71e50
63084 F20101112_AABHTG topsakal_o_Page_084.jpg
f8c98445a24d062f35b467d21ac60cce
36a6c7eba7718fc2f82ab2e07638c9585d0b0ef2
69069 F20101112_AABHSR topsakal_o_Page_066.jpg
808a34ec259fce37f15a5c378bdefdb4
33a69a59b979e31adae520936209ce387137c1f9
78235 F20101112_AABHTH topsakal_o_Page_085.jpg
28f00a1b7f853141885720aafc5c12c3
2d0bb360285c392f053db2d6a85029e647efa728
77191 F20101112_AABHSS topsakal_o_Page_067.jpg
d779b385ba265bf1f265690c8f3ca529
8bc9f316625874f602d092691311d0be30263edd
73265 F20101112_AABHTI topsakal_o_Page_086.jpg
74b3065ed731fa13d6f623260e443041
122cb8f589410c960b43a7eb7e2c7ac10e1fbcb3
70701 F20101112_AABHST topsakal_o_Page_068.jpg
d298394b2f505b9fc5ffbcb16a97ac12
90cbc0f55771d5ddcbfbba7a9a75d418557d9f0b
51666 F20101112_AABHTJ topsakal_o_Page_088.jpg
d5ff374a6b7bc9980680f3746e970eaf
c8a8c88b99436dcc4803e02ae79da03ed85fabf2
74558 F20101112_AABHSU topsakal_o_Page_069.jpg
562deea9f3a0dbb7646726c5e7a15bee
ffe2e5437f561eec4e72ec82d2236cbcf8fd3f1d
72906 F20101112_AABHTK topsakal_o_Page_089.jpg
48ca5d0db7c6c284c337ed494b2e69be
972b1e81321572f1683f88a206c805cc701ea077
76346 F20101112_AABHSV topsakal_o_Page_070.jpg
a07f536e92797e23e679eb32eee8ecfc
f3add4713aa7ffef5607aa96204dbd40d6164d6b
72569 F20101112_AABHTL topsakal_o_Page_090.jpg
8dfc104b743ec1215845d64cfb86eb5d
bbf3f4c2133d29cabc16a1d1fb0bb85597063dc4
76633 F20101112_AABHSW topsakal_o_Page_071.jpg
76dcf5aa1a544a190475f8dc8ea7eff9
ac38d80e5c581c5885c9a18849d152d32b0f4206
79674 F20101112_AABHTM topsakal_o_Page_091.jpg
b160cea392842a8dc8bdbf01e4bf559d
f8f94782bac17d7c56a0a99a5866d85453a31a52
83051 F20101112_AABHSX topsakal_o_Page_072.jpg
c725aac9fbc0f224b306f4a9164e0b3c
de4a771954bed7206bc15d0c9963869c50ac6ecf
72510 F20101112_AABHUA topsakal_o_Page_108.jpg
0af7bae03d07eb18add0f53a506ebf59
2562050a8dcf8abc6e5cc9e970f38dad5cdfb12f
75211 F20101112_AABHTN topsakal_o_Page_092.jpg
1572c5e004271aa45ba72bca6d43316c
19ae0129c573312b66d35ded0ab352ed65797375
68443 F20101112_AABHSY topsakal_o_Page_073.jpg
19f248a17e9737b662af966e9a68129d
30bf89eb63ec6277910088553be798b7b6e849ed
76911 F20101112_AABHUB topsakal_o_Page_109.jpg
338e9e87477abb308a49b70fb4ded19e
643735a42f07176abc46443b490774c3f3cd579d
73719 F20101112_AABHTO topsakal_o_Page_094.jpg
6cc823562ac3ffc7f76094128ac4ca2a
ac2bd871692d3b80f8b28f4975755f3212ef1612
54221 F20101112_AABHSZ topsakal_o_Page_074.jpg
b8fe5633a863543d16b97a76fd345c4a
84e0041b74b6a38e4beff1f7f5f89de022fb8a51
79548 F20101112_AABHUC topsakal_o_Page_110.jpg
d9e1b12f75d68869953f6ddf32012306
71020bb04326984207e9c1287cebe15b8ef1835c
44071 F20101112_AABHTP topsakal_o_Page_095.jpg
f411bff5a88b16e6ccb7044976e6a5e6
8c4c14d64d0c909c83be79973e30c6d6535d71a6
78728 F20101112_AABHUD topsakal_o_Page_111.jpg
c0102f22cd8d94aa3d2d1b0b4b2076f3
7f56b78663ef754c198767b92392d56b24f553c1
76342 F20101112_AABHTQ topsakal_o_Page_096.jpg
fd291599df15296b9b988f5eb3b3648b
4b05a17fcfca17f34e4d30605c25e26fd15e5563
80405 F20101112_AABHUE topsakal_o_Page_112.jpg
ff6b3b99cffa45ebb2314cf64c8992f7
b31dc5986e4c8780fbaf84be9a865af033f92846
85799 F20101112_AABHTR topsakal_o_Page_097.jpg
79ac4467dfbeb3fcf0be2d580b44192e
0308d306a3c256c0527e1d3feaa7d7254c95c55f
70836 F20101112_AABHUF topsakal_o_Page_114.jpg
f8265b9e4a341ac70bcdabb54bcaf151
474c3f73c6f2e9ac0ad97200e8ae6baf7d5101c0
F20101112_AABIAA topsakal_o_Page_040.tif
6490e044959bf1d12706aebd355fc1cb
78a5657e616e573e1b007581707bf86fc9e0a9a0
58338 F20101112_AABHUG topsakal_o_Page_115.jpg
f44eff449a7e834cd8a5823df3b7ddd4
5522c13a86397b69875f222ce0e03d97a05be71e
76064 F20101112_AABHTS topsakal_o_Page_098.jpg
56d95ec863af7931dbdd704c15cdf9cb
f08f96f9a2783c595b972ee71627f2846cfbe5f0
F20101112_AABIAB topsakal_o_Page_042.tif
fe53478f673ade43ac1f576d9880650a
3fe23cc3e186434fe5df965bfcb145f20205369b
77293 F20101112_AABHUH topsakal_o_Page_116.jpg
eff652b74fe951d31037948174ce8429
89f35cd815443568c5ebcc75a8e3fe87f4e04742
81560 F20101112_AABHTT topsakal_o_Page_100.jpg
54c062013ed9b9a73ec60dddf6efb5a8
318222af05a05d6ac8ca7233e441c4a340f25e51
F20101112_AABIAC topsakal_o_Page_046.tif
222e20d7f2aeb0c831af1bf333cf6d82
3756da084c733a45e45babb3b589411a384a08a6
83396 F20101112_AABHUI topsakal_o_Page_117.jpg
408e1a202a45d6bde8517058b29ada19
4907752dad324e78555fb86a4916076f7d41a001
79931 F20101112_AABHTU topsakal_o_Page_101.jpg
a38760298374b38b2b5fbd081d8318b6
fc85049d08c18c44162ea3436367ade0a4b331fa
F20101112_AABIAD topsakal_o_Page_047.tif
a52b844dae14aa213523ddb793f92ab6
a9e7667536984b7c4e17a7d525df7bfcf25dbeda
80858 F20101112_AABHUJ topsakal_o_Page_118.jpg
6117a40cb19f373cf4bd08636807c5e9
7c0b8c5f1f3482d08ed79191564b124365561cc8
63684 F20101112_AABHTV topsakal_o_Page_102.jpg
59224f2fdc8111a701d3b58a77bb69c4
690f23df428923f63658393db50d6f57ed8a37ab
F20101112_AABIAE topsakal_o_Page_049.tif
a8101904e2bc1d19628156e55781800a
4d032b0d2a848b34532e3e1e8bb459b626d3e35a
78086 F20101112_AABHUK topsakal_o_Page_120.jpg
9883d50eb2950b79d02ca6f857501fd3
0c7c8902f681fb2e40cafd882fdd6e38b5155012
45213 F20101112_AABHTW topsakal_o_Page_103.jpg
bebba531afba293019ac64b5641dad63
0969831ed4ce6343b084a335dd63c865bde31b3c
F20101112_AABIAF topsakal_o_Page_051.tif
6e6c12813dc1efffc520a662f8abb0d0
01e309a9a2caa986fb824e3032c61d22769bf44c
72222 F20101112_AABHUL topsakal_o_Page_121.jpg
661a8ca47f180a129a83673b49324510
e7ac635165f9ce211f499d8a8d542f65fa152872
71987 F20101112_AABHTX topsakal_o_Page_104.jpg
2e9bed5b60ac51d1f489ab2e9f55c8ea
07e38b859471d7fe9a1712b99dcffeef13a2ad3f
F20101112_AABIAG topsakal_o_Page_052.tif
0ccae3aaf5081fe15378377a852147a4
bded27734a6f942be487ed66ef69af005d147be1
1051961 F20101112_AABHVA topsakal_o_Page_012.jp2
7fc2bdb798336a5492f2948e792449fe
6aa958943fc4bedf27935673af190932ad7afd17
76673 F20101112_AABHUM topsakal_o_Page_123.jpg
88fc4f8535591400881e335559eb0b0e
e3d70061da32d22fbeb37669f78ae3f4ca2d3c60
73405 F20101112_AABHTY topsakal_o_Page_105.jpg
75a93bb08f202433617cca78be42a216
ade5dc0fdadf1ba045097d138f7041c0c708b7c8
F20101112_AABIAH topsakal_o_Page_053.tif
cbacd4ae59dab1529dbe50cc80d8e0e1
2811876ab7f5345e59a8d6a809aacd97c6cb7d8e
863598 F20101112_AABHVB topsakal_o_Page_014.jp2
c9db7e6be6a5fc1ff8d294dfab1fd112
f95f396e2b7770039534397583a80e0f028175dd
79323 F20101112_AABHUN topsakal_o_Page_124.jpg
8d83a7024ab99afd17aa300093e4297f
3a2c7ece1a0049692f398588d7ca6f5387f020e1
79476 F20101112_AABHTZ topsakal_o_Page_106.jpg
7aff2e480b56a009d01488ec1e96002d
b66f33d9101c6689c62cede32b0ca3b9cb2c5c79
F20101112_AABIAI topsakal_o_Page_054.tif
a23293469afc2ba2fbf75e71b546786c
dc750b8c618a683a5e6034ec415c94e378bf82aa
F20101112_AABHVC topsakal_o_Page_015.jp2
6d6d958875754be749ff673f95dd066a
d12e0af941ffdc1e29f3c89877c2aae588f39410
77788 F20101112_AABHUO topsakal_o_Page_126.jpg
0ab13034f43c417b8a9cc4deb102e2ba
bd8a7576d8df4435885bd76597280073d0805ede
F20101112_AABIAJ topsakal_o_Page_055.tif
43551ddc83bbd3a4488283830aea857a
312e71ad4778bb2acec623423ef5404297a9977b
1051957 F20101112_AABHVD topsakal_o_Page_016.jp2
3a5efe3ab2a4673d2889405c977cfb60
3cdbe35edda781ef9334c209b131a92ccaf93c30
77961 F20101112_AABHUP topsakal_o_Page_128.jpg
207ab5d24f5e4c031cc97464a7c0742b
d1f243f2a3fc78424f693de2136caafa5916079f
F20101112_AABIAK topsakal_o_Page_056.tif
23792de1944ba6da196499f80723f3b5
0b1b3f5b1451c89e0b614f26f7fae353249169ea
F20101112_AABHVE topsakal_o_Page_017.jp2
c84443ea0ab2997b3365312a2676a877
848597a4a7cdc5a250228608bdf3b12828e496a6
49520 F20101112_AABHUQ topsakal_o_Page_129.jpg
e4d77afdaf3388e35d007b98aefc5039
c8b908b6b994c7d140478cfbb7038f7203cd487a
F20101112_AABIAL topsakal_o_Page_058.tif
f9c1f51b8f27a0af994882c899a2a2ad
7e6f5abdb076095173eb7ac1381adedda2f3419f
1051949 F20101112_AABHVF topsakal_o_Page_018.jp2
f282fe75feeaab677dd7460616b8746b
053b687c240293a6e59cccbbe463291bebd2f0bc
45238 F20101112_AABHUR topsakal_o_Page_130.jpg
3348c4323ebf0995f6cdd1fb7127d60c
24a7fa0388f0f75c4de942c465aea21e32ee857f
F20101112_AABIAM topsakal_o_Page_060.tif
811fa88e8497079354bd8c2d14496ecb
b21f2fc552c0a060cd89abac12cb06389fafc537
F20101112_AABHVG topsakal_o_Page_019.jp2
0ce5d1ef0ebd52c5c779d88baa69c677
a25d87013a96bc878f8e7982b0875035dc00f144
25970 F20101112_AABHUS topsakal_o_Page_001.jp2
ba2506d64ac60d550e4c719a97ca1356
37d96b0ef444729efa5f66b8a57ce19e704c6ce0
F20101112_AABIBA topsakal_o_Page_077.tif
c3be2320d5365bd5728d303b5c42974b
678276574934c16d8a7617105df6e55facb5840a
F20101112_AABIAN topsakal_o_Page_061.tif
904fbb1b3a2730404fa730ac3dfb35f9
198a624627dbb9fb6bddf53e21857ef65a354eea
F20101112_AABHVH topsakal_o_Page_020.jp2
43626eae1b2e0a6f841521421280e2c3
a53c2c9b6fc86842078e22336543f186da931b58
F20101112_AABIBB topsakal_o_Page_078.tif
30de5ae3c9181a87b4dbf1aed87dbb12
577a3e39b35f5543771d933e183f381a078b0a17
F20101112_AABIAO topsakal_o_Page_062.tif
525a9690cb92b99fa01a9503e9ea3e70
2bbfefd2f0d1229899c48fb9a1362df9d4178045
1051971 F20101112_AABHVI topsakal_o_Page_021.jp2
8a0685af16733a798ac3da497cd0b33f
2637c29c2656626f41692dd07dce692ffdb3de0e
10474 F20101112_AABHUT topsakal_o_Page_003.jp2
52963173d2c337c54b1b65aab50309f9
609d5fedfaf2f296a8923e2007a937c9294b610c
F20101112_AABIBC topsakal_o_Page_080.tif
a604a27f9d842d44a924e2fe5c65d069
79229c0c2914fbe8747b5a909a1cb6f4f3df77f0
F20101112_AABIAP topsakal_o_Page_063.tif
ed376a4491d1a5693d96eb97de28b9cb
bbbb79a275d89176bca408c23c6b871b802c97df
1051937 F20101112_AABHVJ topsakal_o_Page_022.jp2
bfe32d867fcf49fd2227a0b8897773a7
cbfb45defff8d246d572df5527fd1e1ae6c7fad6
1051981 F20101112_AABHUU topsakal_o_Page_006.jp2
166b401c17e8d4b575f5891f7a55fe05
563ad1ffd9715f2cf6ea8a61cdd06a2ec6c78f5b
F20101112_AABIBD topsakal_o_Page_082.tif
2648fbeebc54d2e9ea9eba4f3852343d
3a8922c681a54b14a871e9218125ce7aeb382378
F20101112_AABIAQ topsakal_o_Page_064.tif
ec78b2756c89b8af1f79619b99e47ee3
e97945d2c231863bf875b51b9e7af608e03638d0
F20101112_AABHVK topsakal_o_Page_024.jp2
f0cfa88c08b23ade76ba60583df377b2
8d7b5658ba33353f66424d002522913eaa6d8ad8
F20101112_AABHUV topsakal_o_Page_007.jp2
7682bf3a13df99236ae52bff0e05e033
45c6356a227dc31ea9a470bacf064ee396e72e34
F20101112_AABIBE topsakal_o_Page_083.tif
3e1abb93da453bc062846fc3f75d98e0
0eac322a9e7db207c257c9fe78972fff042d94e4
F20101112_AABIAR topsakal_o_Page_065.tif
6a918304c01cfb50d5514e078fbd3704
ede000a0676901d18639e119a99a75e1bc84f513
1051972 F20101112_AABHVL topsakal_o_Page_025.jp2
1b1a95256c65237d840fdb7d4be234b7
bcae1334e86d2e81ee71a78fab04dcbcdce0964a
F20101112_AABHUW topsakal_o_Page_008.jp2
2e5259639447673fe21dbde99d2a601d
aaf5e19d8b417c86bdf1e6d9abcf8d1d7cbeb5e4
F20101112_AABIBF topsakal_o_Page_084.tif
8ecadfd3e50950e243b8cfb35331044f
5e126077935161250a9dc79ca6d04d136701c883
1051968 F20101112_AABHWA topsakal_o_Page_046.jp2
627e5be197a6e589bf1461d853a3ffbb
870692005b1e936d30e03b38abf7766e9d629c67
F20101112_AABIAS topsakal_o_Page_067.tif
43d66507e6cfb2352acddf6af7796f07
d2127cdcbbc1b93c68984fdc0e6571f462432ca2
F20101112_AABHVM topsakal_o_Page_026.jp2
36b6e9cd981f66779986f1c404f7e99b
36c875a23eb1cd2c44b4498617195b70e5992b86
1051985 F20101112_AABHUX topsakal_o_Page_009.jp2
0f68833bf97a2ec88656cffe9458539d
605bc8e35b9b5c5667e1a6f7fc7634d7678323e4
F20101112_AABIBG topsakal_o_Page_085.tif
3013c2dee3c778a110cbe30422b78317
1504da04ce6f66d7985089cd422fc088bdfa6afb
F20101112_AABHWB topsakal_o_Page_047.jp2
9ed21b784853611c879a9d615d4a1e47
5274b3e22f5fff4c5fe3cc37f651c87aef97b3f6
F20101112_AABIAT topsakal_o_Page_068.tif
96d09f2a4c944665985076a9749b6ca8
abb592f40bb456574acaa71d3f5290e4eabe4310
F20101112_AABHVN topsakal_o_Page_027.jp2
a5058c0f3673ee7000176b6fac36f17c
4ab66506d5d1da74ed413d7032f79ffc6f456e1a
F20101112_AABHUY topsakal_o_Page_010.jp2
6cd62f0db773149f7309e6d83c8443d5
51f58cde4400917138f9dfa5e9c3fea4abc242a4
F20101112_AABIBH topsakal_o_Page_088.tif
3b55bf80c408653fb4019f4e56148425
1ecfc726739551b78d231d9b327f5357f7223016
1051979 F20101112_AABHWC topsakal_o_Page_048.jp2
469e4604eb2850cc89f2641a6b74c271
bfd1833acfc232b7addacffd65d50991f144b9b3
F20101112_AABIAU topsakal_o_Page_070.tif
876c911e707478ea8a95cf2746a8dfce
6bd300c0d36661589346e0986edd5ed92caf3cce
1051950 F20101112_AABHVO topsakal_o_Page_029.jp2
d28d266cf1b4e76474e9f5ee38f73381
dfa218cb55f4f4a66ee208c2d5a9508e11714155
122839 F20101112_AABHUZ topsakal_o_Page_011.jp2
77c15128122dc50be37fd05b43e66c5a
db2bffc1f43bfdc6e9394679af1506deebae5ecb
F20101112_AABIBI topsakal_o_Page_089.tif
157b7336a0c6b6731442134bce81f680
c30dce835875148ca088e7f0dea1dbd86013fe0a
F20101112_AABHWD topsakal_o_Page_049.jp2
34b118b5e749df5a62ec0900beccd5ad
8db629d2e7883062046eb63992cdfb504df3dbe4
F20101112_AABIAV topsakal_o_Page_071.tif
1cb79dbef98d3af58da689472d6789dd
f8cae711e096801e20cda286c78c2cd093125e89
F20101112_AABHVP topsakal_o_Page_030.jp2
ff37f4d7a305c38baae969e3a4ce4bb6
e533d50efaa968754a0de1b30782500adab2ca9f
F20101112_AABIBJ topsakal_o_Page_090.tif
f1448623afc7235dec235c654e9ad365
453c6c61cd938c2dd40008a60f3bff9358d2f84b
773190 F20101112_AABHWE topsakal_o_Page_050.jp2
ec6aa005d41bdda1505b141cc817f93c
b1d04726be4adf3ea7dc6a0635f84838348e99ee
F20101112_AABIAW topsakal_o_Page_072.tif
a639f958d1a37373fc817b21676043b2
3c315e2d7e614a75a10441c9f43a37f5870c0723
F20101112_AABHVQ topsakal_o_Page_031.jp2
67f74721b074a44a8dce9cfb8db51428
3328751a83d4cd178a839c683e02acd9281b5a09
F20101112_AABIBK topsakal_o_Page_092.tif
ce40476b9f554b004031e5b072b89e60
a82d24ddb613b9c7b22cb4fb4abab78caf8ee2ae
F20101112_AABHWF topsakal_o_Page_051.jp2
a897c30fd64bdbc3bca4135dd1e687e4
2d24e376c214a3d5262056941ce606e27ce0628a
F20101112_AABIAX topsakal_o_Page_073.tif
d65ff0480621f5ced76b58020b0c28d3
fa35d1d29fc0cdd437ba81b51e5c8986a5c35971
1051966 F20101112_AABHVR topsakal_o_Page_032.jp2
eed3a22ea11d0e31bed126ef5a508217
bc6ec8464ef1c122adf436591d7edf59f78fc69a
F20101112_AABIBL topsakal_o_Page_093.tif
04f7fc65348c2bc93ecf1d45d36e60be
ca141f5615bee6b70a277b105bd97936808ee84c
1051965 F20101112_AABHWG topsakal_o_Page_052.jp2
cd50a0756a1ae75557325c2ea18d42db
f473a2435603ab3c4e863b784edf565f51ca9b7b
8423998 F20101112_AABIAY topsakal_o_Page_074.tif
b3dccb6317f69111674b3080b8591c83
c8e6e715b33233d69cfe2cb2b58cb57726541d69
814133 F20101112_AABHVS topsakal_o_Page_034.jp2
823db52322c4ce41ddd87666c790fd02
8feac9a363c185fd8c25425f452acc288ea511ca
F20101112_AABICA topsakal_o_Page_112.tif
9db054f99c11fba77a67889796989dd2
681d244efd301fb1911414504e42b4f2291273f0
F20101112_AABIBM topsakal_o_Page_094.tif
b1c39dbc5a2906955cdabe03290adf4f
4d31d50b8edee433b7a559436c9d85dae0ccce1f
1051915 F20101112_AABHWH topsakal_o_Page_053.jp2
9d62031fe0cab51905fd6cc5449b9484
1cc3a239e35ab8282e5fda02f7e2780867f8adc9
F20101112_AABIAZ topsakal_o_Page_075.tif
51ae5156eab6dc1ba72f0d6a0a660b97
d768b2bda42ac6385c7b7b08f913d252d3081aac
972196 F20101112_AABHVT topsakal_o_Page_035.jp2
39c22c3ad3f3f5a14731ef8014eccdd9
08fd8ba2aa162673fa204b4a1d23faa10e0a3866
F20101112_AABICB topsakal_o_Page_113.tif
d30f3d232a0fc69d71b4489fbb26a3de
9b8e7d0d5de04e7c0dfedff200707ed8818629eb
F20101112_AABIBN topsakal_o_Page_096.tif
8a6e52f3ebaf9bbdb49dbe55ad053aa3
3b7ecfb1ad1e98d9b062ba850262ad84d9e6024f
953292 F20101112_AABHWI topsakal_o_Page_054.jp2
7cc20d8a3779349ebfb387bb2ee17d54
673624155dc31c3fbb509d3920c0e4e64acd356a
F20101112_AABICC topsakal_o_Page_114.tif
2b22108d7938b6307a35f60642d397ed
5673ab4916625090c71fec831868f739fd97d8f6
F20101112_AABIBO topsakal_o_Page_097.tif
5afc4f961e86a000416c7ecdfb998086
4f1e5656c5af582ad4fe86f878ea6f2b15eb4689
1051982 F20101112_AABHVU topsakal_o_Page_037.jp2
3959929aea371367f768b80f2dcd4d8a
1519e7fbc9ece240c5e34573e7102cb23c7b0dab
F20101112_AABICD topsakal_o_Page_115.tif
eadb2218bcc2ea6fd68bfd48ad2fb8ae
eb2e03c2309c99e61125ef6773e89f54e313dcc9
F20101112_AABIBP topsakal_o_Page_098.tif
ff205580b617532219b692f1f3f93e65
890030b0085d0dadbd26a913dd16dfc4bef7b263
998941 F20101112_AABHWJ topsakal_o_Page_055.jp2
3a57c95eb1771487d31549f0aacb8204
c7f75849d96ef9f12cab004e91f5cbc6ef2f31e5
F20101112_AABHVV topsakal_o_Page_038.jp2
61c2e7b31b9d01e7b4853ebe8c9d80c5
9d535969d697d58118369cb93b5ea0ceb61bbddb
F20101112_AABICE topsakal_o_Page_116.tif
f2b1bf7cf5cc8782a16d446ef14b872c
adcbae2da187cc83f68d742d2edea817a6fe2564
F20101112_AABIBQ topsakal_o_Page_099.tif
6bdca766105a886c5dbd2f622a5c6d33
91a59df9abb5efb42c032d512ac5cd086c78d3bc
1031176 F20101112_AABHWK topsakal_o_Page_056.jp2
c3aeb54270758bff25eb95fbdf03093f
8ba4d96066fc29fb856fbbae76c98c1dbc6540f3
1051946 F20101112_AABHVW topsakal_o_Page_039.jp2
832cb6ec66c8428d7b4e7d57bcd27693
001c4b0da9665eeaf3c67f677c664891bccf045f
F20101112_AABICF topsakal_o_Page_117.tif
265f0ada72a8eab2959e443eb5a293b9
3e7451f0a28deb327e7cd24eac765f49c63d75b8
F20101112_AABIBR topsakal_o_Page_101.tif
69bf79162b77e74aa6c38dea2873850b
507cb2a5a4a8881b44de1eeaa327c218517e9223
F20101112_AABHWL topsakal_o_Page_057.jp2
bda2dc93b3d8b95d9c66fcef6d1f3547
8921c9cfb2f3e866947ab7f569e7ef38da9aa60c
F20101112_AABHVX topsakal_o_Page_040.jp2
c0d27ca629ff12a2e4d73d3d1a141b0e
fbafd9b280c80d753c628f5d03ef552ae696fd9c
F20101112_AABICG topsakal_o_Page_118.tif
dc0e60b8180c5bf02cdc34de2799f149
ba51038027e33a28c7d9637d8cd39fd03557d502
1007352 F20101112_AABHXA topsakal_o_Page_073.jp2
fd9c30d4eec137eac95160a3fa500170
dd3fa6346b1885d0b31dd59ff074b0aab671b748
F20101112_AABIBS topsakal_o_Page_103.tif
9a9158bbd4a9f2deb1ea7101dc1ce3e7
1425112beb7094344adefcfb51bed1a719c83dda
1051964 F20101112_AABHWM topsakal_o_Page_058.jp2
f5e0524b580b04187f444372ce7e593a
d8273b43210f6a50154e60beb242f58ec9a814dd
F20101112_AABHVY topsakal_o_Page_044.jp2
3e43820c7ead38fff3fb8112fcb47747
c1416b7e6c4caf73b37e594319537b681c794044
F20101112_AABICH topsakal_o_Page_119.tif
d2142e633c573bddb5d658aa5d21f36b
07d868528c359eab94dad5bafe69381a7faf3491
653056 F20101112_AABHXB topsakal_o_Page_074.jp2
2ad0d6b9eab4cce555603a396684579b
2698b21e484e793d08057520ce844d16598d089c
F20101112_AABIBT topsakal_o_Page_104.tif
e43be7708a297d01f5502fe81aa0f0d5
0a13b457190f087151012cefae92fd487dfd65bc
F20101112_AABHWN topsakal_o_Page_059.jp2
2a31785d9651397f2eb917c56df3eb0d
b11cd6efba6c711d8aa342b6ccc6dc555faeba18
1051947 F20101112_AABHVZ topsakal_o_Page_045.jp2
318187eba4a86cd3c6a365f5af0f8593
29b8134b3bce2bc3ed2bcd7ed675ad891b849a02
F20101112_AABICI topsakal_o_Page_121.tif
60e058575c8f4843e09507ced1f1320d
d1fe130285fecf0c092890c1c4cf89f3225df92c
F20101112_AABHXC topsakal_o_Page_075.jp2
71bb4de62c989eba585be4204200a3b2
d58cab929313972ee187280b45fcede5c5fd24ca
F20101112_AABIBU topsakal_o_Page_105.tif
abdc730f154d7d4741bf07942e584a59
394339d71e99654ee23c3299d193351d41c3f8c1
1051938 F20101112_AABHWO topsakal_o_Page_060.jp2
c34700107ebe32d00197612f25d33184
56c11daf883e080fa255a7b9320523f5aa144f68
F20101112_AABICJ topsakal_o_Page_123.tif
b888584423f667494b1aed97f39ab7dc
ed4aa3aed91719b126e77499a266ff707d601c67
890415 F20101112_AABHXD topsakal_o_Page_076.jp2
de289aea38035675999d582cbb6bd2ae
9ec465be94256a8637797c793939b7d25201426c
F20101112_AABIBV topsakal_o_Page_106.tif
1ea0bdaeaecc71d53c274b697acc90b1
2b279f0b5a576575ffb31bb7d42376d425d7f80e
1008465 F20101112_AABHWP topsakal_o_Page_061.jp2
029949ac4416d3d55f89e1cef3a1f953
fef5e50140080a1ff0f49cdc9df779679b7abdd1
F20101112_AABICK topsakal_o_Page_125.tif
4d6e3c5ae74fe97d140b4ffd239cea94
a1a1b54e29f287e0663ccc79f9e0904475aa659d
F20101112_AABHXE topsakal_o_Page_077.jp2
4a6609f9fb3dc8144f6026ccbcd25bd2
bc43712ef4b82684e24d0cb53972107c0baeb3fb
F20101112_AABIBW topsakal_o_Page_107.tif
f79cfc73a3a9b8cdb657baa32aa0d1ea
4dc41803f8e6ba3d102cf0088468c05944a9c18b
F20101112_AABHWQ topsakal_o_Page_062.jp2
ed96fbe9e195e3d301db5a333cd86989
a680fcd4ad00c3091b507251266aea04731d1ab7
52666 F20101112_AABIDA topsakal_o_Page_016.pro
ae0179090c6c0fd08c75662f12395c02
bf208d27d75f85b62eb9c79e5cfbf0b6861fc0d3
F20101112_AABICL topsakal_o_Page_126.tif
7f9e8199b37fc614c97875c9d45a14f8
2674335690910f2acb52f95cb53d800bb9d2bb2c
F20101112_AABHXF topsakal_o_Page_078.jp2
c1b8e8b5c37bc8b2304c7734ef6ebcfb
89f4c838f787c62270572dc1af508b47ac65ea7d
F20101112_AABIBX topsakal_o_Page_109.tif
a3c8ca40f52172eb65da51bc3f1978a9
a58caea705f5391563c18118690ea2edcb6eb907
1051906 F20101112_AABHWR topsakal_o_Page_063.jp2
90cdec7e867fae6e6116174cb0ea9d3f
b91c1e8a0ba1c1087d46ca5a0b69ca5e0d6a8604
F20101112_AABICM topsakal_o_Page_127.tif
5d96d6740c0461ae9c14ee7bd4e68dd7
facb4cda315500e69c416319d5dafab73e4f1f7b
871234 F20101112_AABHXG topsakal_o_Page_080.jp2
344714ec1ecb1caffa819cf789418b95
e1cbe791a8956b36e151eab52d4f4f580b29edb0
F20101112_AABIBY topsakal_o_Page_110.tif
ef1820af5d54c609da6d88da1246241a
9b51fa01a5037583f7d80a3245ddad79fa1797d6
776097 F20101112_AABHWS topsakal_o_Page_064.jp2
5e3a099bf2633d5307d49480748f4dc8
1a047db90ac9c3479ff6890023dfa71de6cd1898
55796 F20101112_AABIDB topsakal_o_Page_017.pro
5c6f798d8eb294921cc672def954eaf5
e505c2b7c31d16ddee7a1456af589b5db80a4ff5
F20101112_AABICN topsakal_o_Page_129.tif
f01877a7fb93647a5f9de3e89b59d01b
d1eb62e5c7c54d04e0308a2cff7b5d2a43302a6b
F20101112_AABHXH topsakal_o_Page_081.jp2
c2b851e3cd977e129538f66d59a1f7af
d0fede855e284fbefdd047e71cc3dd94ad8cbdd2
F20101112_AABIBZ topsakal_o_Page_111.tif
babbffe1b0abba194a9103cb78386fa0
f8c0e4095465d5d790ab4e6b2a2978bd3a363299
1011431 F20101112_AABHWT topsakal_o_Page_065.jp2
0a3b47c0b92583ad108ec949beb48080
7458382b01871afa7a8bb56c2ac05df6dcc896aa
54884 F20101112_AABIDC topsakal_o_Page_018.pro
2c13d881a1cac6a7d959cb567ac175a2
74da7da7b0e48fdc77691a77a3775ceea829026a
1052 F20101112_AABICO topsakal_o_Page_002.pro
991f25fb065e9709b5cf1ce1c9e941c6
aa1cc256e926465703e44b2c15925761a4be2b64
973324 F20101112_AABHXI topsakal_o_Page_082.jp2
ab85704837a1a016ea1a468b607e34de
7316244ed589d9dd0c29a9c807678bfb27c59cc1
F20101112_AABHWU topsakal_o_Page_067.jp2
fb4af22883dff96c7a847f2be5ea5d91
43683af10e863c12393817daa9c58e041fc1f7df
52568 F20101112_AABIDD topsakal_o_Page_019.pro
a607bc17eaf6b058775d395c028ee313
82af231f3a45f263c90ac4145697cb6d46416a72
40672 F20101112_AABICP topsakal_o_Page_004.pro
8711667d127581c2f5397a4181c0edb1
a821c03536e882b358c85ea7880eb2a92198918f
32171 F20101112_AABHXJ topsakal_o_Page_083.jp2
5b22e4eee14bee7c7027eb6613a04f81
7a60a63afb9cd77a0ba6fccac7a3d51f0cec26d1
51842 F20101112_AABIDE topsakal_o_Page_020.pro
7732b12ecf29c4115394071177d7d09e
58a25db6a07376f05856d1067878f8b5544b0688
55056 F20101112_AABICQ topsakal_o_Page_005.pro
782d9f9de62d43b9ab694bc1c2223cfd
e1c48c44e441d7fac3d8ff0201ed3d3d80679f06
965800 F20101112_AABHXK topsakal_o_Page_084.jp2
122912015ba8577dac97fa0c48dc1bbb
8e3d8ec90818fb91bc64490d7aa4f7dd3ee4a13c
136653 F20101112_AABHWV topsakal_o_Page_068.jp2
87ca6ab70de7aa68a11bc2a8f244ae37
adcb5ea351d91d45b0f9de3a9e7f50d33e05522c
60016 F20101112_AABIDF topsakal_o_Page_022.pro
4d24a1050b44c7b1c602251d2a965d21
9f6fe4dc4339c71c1340588265747de56b3875b4
68454 F20101112_AABICR topsakal_o_Page_006.pro
64ef7e2d8de206569eeff555118d785a
529c1c29009420cbe822c73a95c5f446848c5adf
1051912 F20101112_AABHXL topsakal_o_Page_086.jp2
8d699e2221c94890e44a5c9b1b250375
248937ad853ec22a978a6e78906891fc4a94b956
F20101112_AABHWW topsakal_o_Page_069.jp2
17ae80e5f457b7030d8587384accc900
6b79bcb80201c093715a27943cee28030df61b4e
1051960 F20101112_AABHYA topsakal_o_Page_108.jp2
d8043774186e3943f2d74e460abfdb73
4e291e421855f886eb822c77b96b9f96d105d09c
31283 F20101112_AABICS topsakal_o_Page_007.pro
9378ff9c49915046bd3d9eabac93fd66
546b8ac2306b626041dfc3bd11626965c2807b67
F20101112_AABHXM topsakal_o_Page_087.jp2
e3e126ac0ff3fb079565dd6b41b11503
a310c900d40952c4a86281b8554a8e7710304026
F20101112_AABHWX topsakal_o_Page_070.jp2
34db8d560dce4aa73ace789c0c808e93
ad276d29a5441660bd90295774bdbcc5e3542919
55805 F20101112_AABIDG topsakal_o_Page_024.pro
a1ba4d4c4bed7f1c77bf17d43d622da7
54c9d9b8fdbc9d9dbf85b0e113f8a0fb5b86631e
F20101112_AABHYB topsakal_o_Page_110.jp2
0daed5b3845dc96d2811ae5612d01597
b1b4c04014ec190fe906185f0b208438c6dbdda4
56546 F20101112_AABICT topsakal_o_Page_008.pro
6bb3d383f18b9ac7562b07d3ffdfbf13
7b1f768eb5862be1e5b174cdb905d7dc7e7144bc
F20101112_AABHXN topsakal_o_Page_090.jp2
fb5250fd68aa0fd47d56635255779211
a5a4ccfdab08e265853ecee2d37357f2b42c3b4c
1047377 F20101112_AABHWY topsakal_o_Page_071.jp2
d98acdb19863738f40ed890afed9462e
7c2be248d7a833e58b1aaf0abaed38d05bd77ee0
54371 F20101112_AABIDH topsakal_o_Page_025.pro
329d2434fbe14937a935040ecd7fe3c3
0ce47ad3457bb1a58bc83ef3f058d72b7833734e
1051963 F20101112_AABHYC topsakal_o_Page_111.jp2
dc68967f06d506b06b6ee10208b23e7a
266c8bba8d7a3352254430e6f892de702421374a
58378 F20101112_AABICU topsakal_o_Page_009.pro
bcb7b5bd6d63814067fb5a060a7363cb
df526e2b36a6543863c19317dfa0d8874ba5654a
1051970 F20101112_AABHXO topsakal_o_Page_092.jp2
2c9037f02e84f54dbdf61e0d5183d9d2
61b45dc2b65c6a17359e0fcb9b0c8f9f72d1eb8c
F20101112_AABHWZ topsakal_o_Page_072.jp2
bb7ee9c4473d89e3c4666bdd943c2180
c4a9f98f7d06995bcb2213f7aa4766220ead765e
52281 F20101112_AABIDI topsakal_o_Page_026.pro
7e2f3ab786dc7d23cd2bf9e383994d45
fe0c055b0480e458285c4d504e70fd91b32975ca
F20101112_AABHYD topsakal_o_Page_112.jp2
b48d5cd9d847d7f10d640a5026e3d7a2
1867a83f7e792bf09102d74dbfe438206a6b542e
46082 F20101112_AABICV topsakal_o_Page_010.pro
7a2501c5227c9021db870d6fdbf023c7
7e7af470146e6cd2a2a77883714b96ab584c8124
F20101112_AABHXP topsakal_o_Page_094.jp2
775c09a9f2c3a12fa177216673256501
c0dd08cbfe58deebf8496c06e0b29e2bf6cde0ac
56115 F20101112_AABIDJ topsakal_o_Page_027.pro
461cc434f6e612426bc3ef4abc300629
1fb98afbcee3c4316c3d4e0d5f918aaaca02bd32
F20101112_AABHYE topsakal_o_Page_113.jp2
b80d2f09a1f3108a7eca482cacbd0b6c
24bc9988469717f0fbc3b5b9176b93caa2e99550
56607 F20101112_AABICW topsakal_o_Page_011.pro
ccdb9a042b5a2d7d973ca16c6673507d
c176ed14a6b4d3284217dc9bc6014e97a6cea045
694745 F20101112_AABHXQ topsakal_o_Page_095.jp2
016799074bae001d8581129cb97c8148
c0db46404dd85f9bc412b0950792d3c7e0df8505
52885 F20101112_AABIDK topsakal_o_Page_028.pro
d699d94637cf21d6a78a306b0fbfc6df
c6212a3e4317c843172df6904da3e1abda9ac575
F20101112_AABHYF topsakal_o_Page_114.jp2
1c1b407174cab390af19a806de0f8297
1e30a12bc8f46252b75bdf4f3c305bcfa5b21dbf
57493 F20101112_AABICX topsakal_o_Page_013.pro
0f46911db64e278d448d3413b8d8d778
b7c1830e21cca040ae9190e80eaf301b9fc518ba
F20101112_AABHXR topsakal_o_Page_096.jp2
13068ef1a4decd3e8ebebe83c0d06120
f44b2fcf965da1ee2aea2d0d29e3d8ce910726be
49937 F20101112_AABIEA topsakal_o_Page_048.pro
7f9ef6d414611deaced48dff46a95888
429430edf5b2312bfe243758f318ae7442500e02
57441 F20101112_AABIDL topsakal_o_Page_029.pro
f2e335ce1b38ae33d3132210e93dcba9
996fdc16df11e4ac0121d9a5c4e01da76018db3a
974689 F20101112_AABHYG topsakal_o_Page_115.jp2
11691431aa9e0ca605c3ee9f695f26b5
b5c393b397249953c2f09d440ae41eee0cee1bf2
32879 F20101112_AABICY topsakal_o_Page_014.pro
6b98de9bcbcda14a41eb8ea3d79e2a7c
d6f77bd006ac8d59ecf065a6ef5fdef1959b13f7
F20101112_AABHXS topsakal_o_Page_098.jp2
993218ee00147e14d5b311e36be87e01
78f080d6ecffca3bfba191ed1b4dcedc83f06ead
29479 F20101112_AABIEB topsakal_o_Page_050.pro
2ee3861f5e1a966a9ea4cb7128f5330e
e953ba2c9ae55f0c7e4ec7b6f88898a38f8716bb
54483 F20101112_AABIDM topsakal_o_Page_030.pro
dfb6419aa67035dc3c1af23a71092438
af211cc55c29e7016680ec0631a7f77549ec788a
1051931 F20101112_AABHYH topsakal_o_Page_116.jp2
de30b696b124aaca111758360bf51476
2142b62cbcf07db16efee0d2f88ca7dca4a2781f
57179 F20101112_AABICZ topsakal_o_Page_015.pro
708408fec47b9d3eaf6593d74507b5de
f16e339bc20f257f6f9c10cd0cbc0973cd75abb7
F20101112_AABHXT topsakal_o_Page_101.jp2
9b0cef3df90e493fb6758fd0906c7e6f
143f45f313dc345b385bef98d1baf727aaf24982
59165 F20101112_AABIDN topsakal_o_Page_031.pro
cc4e0de7bf56f120d6cc26d7f2be0172
3e11ecf45179b81cae920eaad2fdbd58d3025cb1
F20101112_AABHYI topsakal_o_Page_117.jp2
0b6fd80d3ace62005c6a62a14a6ef559
6ca4bcce5f921ba9e336926df3b83d3e7162e033
1000723 F20101112_AABHXU topsakal_o_Page_102.jp2
a8619cb7b7231a5bbfcc7fcd116f71e6
bccc0cae5eb32b174f0778c538c43af1a3807ff9
55119 F20101112_AABIEC topsakal_o_Page_051.pro
9eefa79d2d0dda38d067ac2d1aa30136
2bdb3c0d39b0cc949579fa4d8444401a11e535f1
57610 F20101112_AABIDO topsakal_o_Page_032.pro
47ca00fd6d5682cc419b89cbb2362add
46989ef30ba408f59c63ff107d0980f7786f47cb
1051984 F20101112_AABHYJ topsakal_o_Page_118.jp2
c3ee789ead1bbe0a127bf10fb201ab87
2aa212743a15a0913c3f8375dbbdf8596648c8b9
83948 F20101112_AABHXV topsakal_o_Page_103.jp2
82f66934a156d52e6ec1545066848993
790f8bb442a75fbc97d544d7fed90f3530614a48
39018 F20101112_AABIED topsakal_o_Page_055.pro
dda12bdfe49549d59e6565d1d8584e5a
2229750cfeaa5aacee6859476d68acfa41f2ed2a
35778 F20101112_AABIDP topsakal_o_Page_034.pro
f1c88e58733b6ed058513b8c53941a8a
5b57eaa030d79eaf3a85b69db8b77cc576179e7c
F20101112_AABHYK topsakal_o_Page_119.jp2
effef15374c5b25b3090856a7d6a5878
18a63f7ec60765fd05fe9f90678e0d0bee73c06e
29663 F20101112_AABIEE topsakal_o_Page_056.pro
a9cd87379f98bc6f8d9947543fb15dd1
cecbae18c8dde7392f2b012021c7f1855402364f
43383 F20101112_AABIDQ topsakal_o_Page_035.pro
e9fa738369ad4a148974a019f7759bad
6bfba6bd76176573bf2ca9571ca3d0228abcaa8d
F20101112_AABHYL topsakal_o_Page_120.jp2
80142f36d8fe55e6c2c14e1e00fd0e9c
671838f520660af30e792247efdd28cc04cd57a1
1051939 F20101112_AABHXW topsakal_o_Page_104.jp2
5d6a6c6e8535f8abb5702e11f8c18598
7d3cf8c30185832ebeec6772a1ffa4d59331497f
60136 F20101112_AABIEF topsakal_o_Page_060.pro
eac46386e5baea56f80d9f9719891c22
a126f46ef2c346b39643781ff483f62790de4980
43930 F20101112_AABIDR topsakal_o_Page_036.pro
44f02338ef99261db51721c21fe7804e
433e8a45dfeab67e00e6dae2573f16be2fa2c125
144881 F20101112_AABHYM topsakal_o_Page_122.jp2
c7fda8f14f55345078d43bf1629ee10f
76f7770508f1f76b38fbed71f0c8545b5854f464
1051914 F20101112_AABHXX topsakal_o_Page_105.jp2
d63a2a5582f3f74873c6f527fcc1f521
0559d8e8df7fefb2cef433e1fbfb26c61e51beed
52070 F20101112_AABIEG topsakal_o_Page_063.pro
76e76ae9250fbdb5a8b54b801646340f
a37d332b0ecfab1f978c2d54e6a8510501ddf13e
F20101112_AABHZA topsakal_o_Page_010.tif
db34223c06e789c58825db48ddc6e365
60783292d04448bc9b50c38a76a0fcc3565d8311
51685 F20101112_AABIDS topsakal_o_Page_038.pro
c7dd29bfe8008044b8709c70b13d4950
70fcacaaefc6ccf5c51bbc9c274baef607c10d9d
149044 F20101112_AABHYN topsakal_o_Page_124.jp2
4b566e2628631267cb8999bca717d934
ff54a3a316b96594c053e72251d16981bbad8d66
1051911 F20101112_AABHXY topsakal_o_Page_106.jp2
2696ebdcb3a0efaf0ca8f2f169efa9a4
9107a487bfa8678915c2ddef93e312af3e60d6c7
34469 F20101112_AABIEH topsakal_o_Page_064.pro
02690479c117164455c12bdfe59f2c10
44cb590522bdf7fcfdc08da0b3b4ebe4be93227d
F20101112_AABHZB topsakal_o_Page_011.tif
0dbbeab29b8d88dfa9f971b8098ab8fe
040dddd4d29bb00847b54fbe4b10270d0d7963dd
49739 F20101112_AABIDT topsakal_o_Page_039.pro
f1e35ad6b83cb7acec6253b3438fdcfa
c92d9aac6730ca01b0c26573242f1f616f3bab05
150568 F20101112_AABHYO topsakal_o_Page_125.jp2
2e7b5256fdd12115877d745931bac80a
a8163be54c2959b9d64c63178dca83c7e71654a5
F20101112_AABHXZ topsakal_o_Page_107.jp2
8ae954f1164520a3244576e98e60c430
463387c00483de10ab739fa9b48a305c2b4bba1a
44975 F20101112_AABIEI topsakal_o_Page_065.pro
ee535beebc2fd9f1b33868f55d615691
72661c281dcf64550e6d004e9aaf880b578691f6
F20101112_AABHZC topsakal_o_Page_012.tif
fd7032bf780e543f5dc010a1d4582bda
2a9e88085e3ac033b37d7696b19215c4182b12d9
50033 F20101112_AABIDU topsakal_o_Page_040.pro
3606ef9b1b6e180d6a5a4818ec109bed
2077cfa1ede9d2571eec467cb0f0029b4e15701c
145486 F20101112_AABHYP topsakal_o_Page_126.jp2
f52ded893a592f27247e19859d24292f
82fdfd3d8680fe2f28785486fd3d1153d5f7e0f6
42449 F20101112_AABIEJ topsakal_o_Page_066.pro
cabeb9b044c3ec87cadbd2818af49845
96276254303fc25526773f759e763d1f6caf4133
F20101112_AABHZD topsakal_o_Page_013.tif
68d86672cbb9ca15155c9ff91cc9e58e
f46cfb2a857babfb07a6a4bd135d2ac0a164865b
57727 F20101112_AABIDV topsakal_o_Page_041.pro
264b6866c44fb83976ccb4124abbda1f
a1117d4ba8928b269e5f222b5f8d91714fd8600b
142650 F20101112_AABHYQ topsakal_o_Page_127.jp2
7aa529d9468a730d4890257239f457c3
2f1ad71a4c27d0642ef30697e958dcc91752e25f
55234 F20101112_AABIEK topsakal_o_Page_067.pro
a7f5ea97acb1e1e0141a76eda6499ec9
e9d383e6ca2cd8b9f968be733fc278fa17690b70
F20101112_AABHZE topsakal_o_Page_014.tif
d913d9a6ba46ca72d88dcac7bd9e1993
d50916676ace12abd1e8d24e0996ca38008adf34
58916 F20101112_AABIDW topsakal_o_Page_044.pro
17a04569152fbb816474bbeadc7a090b
870fc24a7bc87570f24ed60eca0ee466ea14978d
148486 F20101112_AABHYR topsakal_o_Page_128.jp2
23aa3beb185b8b0a5e8849fe66b21988
2fcd9e98b516238eec2ce4b0b816876abd408ddf
51586 F20101112_AABIFA topsakal_o_Page_089.pro
6b54ea63e9b7264bb84e1ee4f298f6d7
c43aec1e7149e47f6bef8acf7c69d6b3edfa46d4
64798 F20101112_AABIEL topsakal_o_Page_068.pro
a2222967c75941c92a81257ae28036d0
187d7f835a55d47353ae0fb7065f8ee10d9b3afd
F20101112_AABHZF topsakal_o_Page_015.tif
8c0caf74d66fd619c24b2bc9e84ed291
9cc6e987e177285512b8b24a773c933e7794cd66
57690 F20101112_AABIDX topsakal_o_Page_045.pro
b8ec8b708a9d6fb1f3f16acadc3e039a
d5c8f775b6214f04319f509e847a70a65a4c7a7b
89401 F20101112_AABHYS topsakal_o_Page_129.jp2
06f24a423111be9f05c5f56a88291602
dee77f566aa5633710902de40a140300fb91233f
49793 F20101112_AABIFB topsakal_o_Page_090.pro
af90664a9f5b06b06acf2c275d491a4f
4fe6b84de5fedd0fbaf2cf30568ef1823ad1bbc9
46804 F20101112_AABIEM topsakal_o_Page_069.pro
608c5d7db55de42064198189ad575711
1885d543b2abd9f25de968c1f48305f1268d9c7f
F20101112_AABHZG topsakal_o_Page_016.tif
e3a5cb83b78926a8e63066527be0e8d9
c6be1d9cab68a800c5499276a03786e009a8df68
57670 F20101112_AABIDY topsakal_o_Page_046.pro
b99c01514bddabeb593c9a0b186a2b93
444bcd3b6ed2bdb7db5aea6c0b6921cfc7336eae
83097 F20101112_AABHYT topsakal_o_Page_130.jp2
504e10cd2950c9b92ba14c086df3fca8
67fb8fd46bf7f64fc67605fb05e34557538f7c69
41175 F20101112_AABIFC topsakal_o_Page_094.pro
bffdf079d15b11e85c044ed38e799b00
930a186cc923532203a7f28d11ca9f4194e4dbb4
34878 F20101112_AABIEN topsakal_o_Page_070.pro
4a22f2dcdbc1b425c15655f17d09242e
ef34fa5799c4cf7cbc4820b09bf7a230d9be0a4b
F20101112_AABHZH topsakal_o_Page_017.tif
f7c198058c82dfe90b953543652f0397
09ade5450fdf315a6a57d66597c58fc0c6764770
59736 F20101112_AABIDZ topsakal_o_Page_047.pro
9ee8df971672c284c433eab3cb5ebb18
f8f59855ae81fde7d557f831289e9c17267da11e
F20101112_AABHYU topsakal_o_Page_001.tif
9f45205730183cbaad4561d5dddbbd4e
cc61a52ebd96c771a31783218c51116a81cf7c2e
14087 F20101112_AABIEO topsakal_o_Page_074.pro
567b93f847a384bac1db098ef59aa238
e411cfa2f4dcf82ac00c39b32b70fe1f25e13680
F20101112_AABHZI topsakal_o_Page_018.tif
e2bb6bd05e788296b7e62b0d48864fd5
148341826295a114103d1b375d35dfb27d5cb33e
F20101112_AABHYV topsakal_o_Page_003.tif
c61c1e09b074d07c6b054b5ed753a834
b84fe15931d4f302973c26b434c3d1c36a805c75
26668 F20101112_AABIFD topsakal_o_Page_095.pro
b39fc11c1652ddb643e24c878b55d3ec
e6114f0d1ac997fafa9ee47fedd31d2a785d92cc
28753 F20101112_AABIEP topsakal_o_Page_076.pro
ccbc1d341b8f0948211f5b04e66493d0
61248d48789c873d80a8872d382fa0ad69220189
F20101112_AABHZJ topsakal_o_Page_020.tif
bb93cee573289ae8b901f40cf89301a6
06e7836e9cf027acb52ad774061a5ebcf94d67db
F20101112_AABHYW topsakal_o_Page_004.tif
728190747b786fff2c5eecc786f16b6c
eba46dfc3535df0761c778d0f40f78a8cbd8091e
52435 F20101112_AABIFE topsakal_o_Page_096.pro
d34ce54305507f4cf042330a567c865e
3b2a235d2dbef3db20ca776a9526fceecebec977
60011 F20101112_AABIEQ topsakal_o_Page_077.pro
9cbe9655b4b7b1c2ad69d2038397fb30
eb349cb6a1ff23e40a928ea871dd6c605801040a
F20101112_AABHZK topsakal_o_Page_022.tif
114b254fc95f722223787017455aee78
ee8d9e3ae73143a92033fd548b6fe61a852b0351
49413 F20101112_AABIFF topsakal_o_Page_098.pro
b140435deedf2e0ca8017f99fbfe0218
f0e34146d216af72ed7e221b7d34e7e218982e82
55794 F20101112_AABIER topsakal_o_Page_079.pro
fc2f4983830387b9021990fb811d50d0
d82ea9d9755e11fd89d2d9dfdc3afd013c4c9256
F20101112_AABHZL topsakal_o_Page_023.tif
e53d4fc08367c065f50b684bf2eb7e9f
88f7c47ba41d55f67aa6acb36fd51633e5183119
F20101112_AABHYX topsakal_o_Page_006.tif
fcbeb348184faee1634d8b9275b15c61
73bc2f0dd6e84c7689efc52fc4ba431da364d272
32089 F20101112_AABIFG topsakal_o_Page_099.pro
c3da82d48ca4b69a1beff6967d565f30
280116a8ac97774ed05349a76de1a692c6e0dfa5
37520 F20101112_AABIES topsakal_o_Page_080.pro
82e59357a99b6cee07703bda64073cf6
88307cb39f71abaa5307c4a0ae5b580d08016d24
F20101112_AABHZM topsakal_o_Page_024.tif
2621b14d363745c3a9372dd8a1ca62f5
78aca749ef8196354dc75b7e3bd3448124d4e234
F20101112_AABHYY topsakal_o_Page_008.tif
b19ca0c8875575cffa6289bef5f1c0aa
8ad9a209186f04efb42106c4f803be587f316ebb
53221 F20101112_AABIFH topsakal_o_Page_100.pro
ba57baa8a16998ef2249fd8488186985
bb5f938a6d3b41ffe20d5a1141db768c379fb08c
39103 F20101112_AABIET topsakal_o_Page_082.pro
3b049f39989bbaad577669e9f6b5cd6e
2f7a30a55a8d7b01df0fc194493c9535fa0aed74
F20101112_AABHZN topsakal_o_Page_025.tif
fae5fe79abcce33c1ed70487cfc5cb8a
3ddc88ead2451934f9ea80430dcae0d95d76387b
F20101112_AABHYZ topsakal_o_Page_009.tif
f678b20504fc7bce213cee1d00ba9384
4e11c50501989cc772fa727f8f024850270cb86d
44674 F20101112_AABIFI topsakal_o_Page_102.pro
de19c4d9a36e8cc08289e668aeaa1c27
ba33ebe00c1f07ec8345ebf13562935bab295509
13771 F20101112_AABIEU topsakal_o_Page_083.pro
bdbd23e94024f250aba2d6ebef48a247
383de29661f5ad61cd92cd1680dbadf66211aa16
F20101112_AABHZO topsakal_o_Page_026.tif
765df2465ec51458e13772d4c1420b3d
0626e888ed3091e93b709313d7b82c7aedc20732
37768 F20101112_AABIFJ topsakal_o_Page_103.pro
0ebd09b6d9e43e132d4744023327ff1c
f65b296bfc824253bdb4ffec6aa93290e8a5dbf5
40096 F20101112_AABIEV topsakal_o_Page_084.pro
affa9c8185f7d0ac6adc0d676f3dfbe1
7262f91740469e9e482bc589f6bb5dddf4da879b
F20101112_AABHZP topsakal_o_Page_027.tif
04cffae7ad1831f022cfd893e4db9788
e23658d8b5e8dd90a4eb9da172b88a1bceaffaa0
49917 F20101112_AABIFK topsakal_o_Page_105.pro
58b8238a78e40f46a129c097b2bdb73e
51fce684a9cc4d0a28c45de9b3985f91755f4212
51126 F20101112_AABIEW topsakal_o_Page_085.pro
761c6a9bdabb5f6706fea7b9d94a0990
83f22698017465e7157a9958cc5fc45cdcab463d
F20101112_AABHZQ topsakal_o_Page_028.tif
8ee6a9ff61d09e0f9d5d13c7f33a0c5e
97d2bd09a4dbd46c135d8b3729edeb93e59e840f
57603 F20101112_AABIFL topsakal_o_Page_106.pro
b5bc1cab496f405ab3cb6f776a3aa193
d7dc73c0701c8fa4434e51bdd383d98167a096ef
48405 F20101112_AABIEX topsakal_o_Page_086.pro
da2999c8d8a50324cb7ee0c37c3efb5f
d9557db1eb21c2fa147d32ae56f5e935ccf920ae
F20101112_AABHZR topsakal_o_Page_031.tif
76a5d02f0ca369a45d4fa5f73832d7d1
e1e5a1ff675ce48bc59134b68ee3640b6a5c3663
69259 F20101112_AABIGA topsakal_o_Page_124.pro
00e7e9bf3b2f4f35423fe12837b4bbc3
dc7f062929005394e35fc91ed27591e850f9272c
28625 F20101112_AABIFM topsakal_o_Page_107.pro
0dad42eac1ccd9176bca822de8a1099c
0776f97895b306d7c604daca6746b51f027b0008
48562 F20101112_AABIEY topsakal_o_Page_087.pro
7b05f54a4f540337f0db362ef20368c2
0c1c51898dda974c81499e4c3de19827266c6717
F20101112_AABHZS topsakal_o_Page_032.tif
753b74e62a5778f800cbf192db7aad68
b5714e2ed3143928d5ff18bd9d52bdb927883aee
71360 F20101112_AABIGB topsakal_o_Page_125.pro
5bb6d0c325b24d592cb7853bc6aca61c
14de786f7c30d5f607fc238e9d59db82b00d3fca
53136 F20101112_AABIFN topsakal_o_Page_108.pro
7429eef2edadf8c58e975adc81490428
0b047c76cf4f085c928aca6c09271bcd3cf37657
32316 F20101112_AABIEZ topsakal_o_Page_088.pro
3ff59cc5d61e7dffd2f5fdd643b5de98
0f4f3167f8be2d15b50da787ab5b34381226fc2c
F20101112_AABHZT topsakal_o_Page_033.tif
1247408e5759d6f280a9144f775f62ee
035c3cb582e708e19c3446bb57eff6c0e264a794
66923 F20101112_AABIGC topsakal_o_Page_126.pro
8b9732657a0987f2377c5ee6725c4ed9
c618225da95579d237e51268eaeb22e0ff0899df
53833 F20101112_AABIFO topsakal_o_Page_110.pro
796c20c259cec069b18728d77834141f
0ed73d836d0d1914d5f71ae1cacd37003ee3f8a9
F20101112_AABHZU topsakal_o_Page_034.tif
be6d8126747918aa419d17536c3130bc
e5e0c376e5585c48dca85085b29c8b09e86c1efc
64668 F20101112_AABIGD topsakal_o_Page_127.pro
09be5f7dbdb8b164dfcac9fb70b45fec
c5ba27bc7507c30953ddd1b812561567bd4b030b
46876 F20101112_AABIFP topsakal_o_Page_111.pro
1fb54420ef901310d71980cc1adbe3d5
224ed66706240f65f8a6f417a7c29175db203f2a
F20101112_AABHZV topsakal_o_Page_035.tif
29f5cffabf5cc9f3ba543fe9d274200b
baa9b7832de11ab0566a52545c7c9646fff7c02f
58882 F20101112_AABIFQ topsakal_o_Page_112.pro
91b96c24a595b6b0c8a3637745e095b4
9a76d27e636a2d9acb3700f1fc5ad66f06ef5214
F20101112_AABHZW topsakal_o_Page_036.tif
e1807a1582ecab4a626619829b7c4ac2
da9756c37688bc8e4e990b379c4b0e99e25e1128
68774 F20101112_AABIGE topsakal_o_Page_128.pro
fe63cef6eb90a92bce72fbbee4b00aad
17a6768ebf84d850a9cc025cf80eec0e641806d3
32527 F20101112_AABIFR topsakal_o_Page_113.pro
b378b78329e6d5aad817b3fc85262cac
6130294a5a4ff4518d394846a6e7c8a4f491eb1a
F20101112_AABHZX topsakal_o_Page_037.tif
ac9a6b24ca15953723275c2feef09782
0da83a2aa88450ab34a9b6352753cae1932e3add
40899 F20101112_AABIGF topsakal_o_Page_129.pro
83b54d33214f7c8802c2a842b032670d
2f7b13159219b0b9874aec9e8bdce724dd17d90a
45649 F20101112_AABIFS topsakal_o_Page_115.pro
23d92a4372f14ca9cb86b175a2cb9ceb
0174f07b324d3908966194d60d9f2fb8f17ec08d
36087 F20101112_AABIGG topsakal_o_Page_130.pro
7d0759563fa90a1224253a1daeabe8dc
5eede71edc51d8c48549352d6271bced17e7443f
55637 F20101112_AABIFT topsakal_o_Page_116.pro
a501d164df081e42f2290827ac8de7c6
7f9d10529c46f8c9cbf634a594a40e8d015957f7
F20101112_AABHZY topsakal_o_Page_038.tif
04bec97644ca7c2710dcd4eba8102482
d663e78eaf33abe15387986ec22814cb1425167e
445 F20101112_AABIGH topsakal_o_Page_001.txt
e7e5f5ed7596ad707abfed11756c9d3e
1867b4909a346aad87a349859814e53e1b427451
60276 F20101112_AABIFU topsakal_o_Page_117.pro
99ada6376147dfcbf376910238093655
5afc0dd641e8ef36e9121fc776652f87018c766b
F20101112_AABHZZ topsakal_o_Page_039.tif
567a2ba0f4b1fc99b24c2b78c2811497
16e638d9d3e58114b528db348b9258554b5a4602
107 F20101112_AABIGI topsakal_o_Page_002.txt
b717617e15207b8b001f996ab1ccf784
4bdd4e06c0253c008be1bcc30726bb89924bce93
57843 F20101112_AABIFV topsakal_o_Page_118.pro
2b711719e603e892a9328a75eabee118
683d76ec3c863fe0c16e36b761f7bb8e36a32b66
184 F20101112_AABIGJ topsakal_o_Page_003.txt
1d1dee24df0a45e41c0f08248717dc8b
73d1019faaec15c92315b5085f4a7806d7c5bafd
56161 F20101112_AABIFW topsakal_o_Page_119.pro
57cfafef008952696021a491c7b9e9e6
1443d01b56dd04bc51a4c2df6192746c498e40aa
1711 F20101112_AABIGK topsakal_o_Page_004.txt
6e62d431cc6d5e78c4c37bf8d57d9f96
d39275708ef63007c698f0943010d646e17d0a58
53936 F20101112_AABIFX topsakal_o_Page_120.pro
39cfbb28897bae9b4c45a614f00be4ba
f5beb8123164a3308313e15bdcb432b578a7d26c
2140 F20101112_AABIHA topsakal_o_Page_025.txt
738d2c7634fbfe11b89b8a78452f18bf
01f64caaceadb82d7cc5e4f747c70581f4c34610
2845 F20101112_AABIGL topsakal_o_Page_005.txt
2a121fb06969969134cf546f56645a6e
d474a75a5ec0e8b6924344e0dda4d09383b33126
69266 F20101112_AABIFY topsakal_o_Page_122.pro
09c11985c6b94ea3d91bd9c65ec1a2a4
86d40f7071341dacaca51ac789d9c7d4c00fa890
2239 F20101112_AABIHB topsakal_o_Page_027.txt
a6e56e49a09cf000dfce09caa91f94d7
6752d304fbaff9085cc4134a1eb87437493bd0aa
3036 F20101112_AABIGM topsakal_o_Page_006.txt
703144e7357cd99c3a1b607431f86162
9bb0057f562135dafb121df48be4a6c77eec399b
67846 F20101112_AABIFZ topsakal_o_Page_123.pro
d52f69fa8521bbaf3daec4ade8954a69
b8b7062515339fa2082c9827072bf711b3b35b0f
2261 F20101112_AABIHC topsakal_o_Page_029.txt
5366349bd93ab11b69e8b7a6d1c94852
3bdc1c0199bc14f2d6475b9a3fa0613361bf99c7
1396 F20101112_AABIGN topsakal_o_Page_007.txt
a495a5ad6fc5e4703146576c0a2c14bf
993abd808772ad6fd3eef7e4f2e907905c2e325b
2346 F20101112_AABIHD topsakal_o_Page_031.txt
6c76d35da162246330fc889d9028126c
db562027b1db682b2107135487c28a18387e7672
2391 F20101112_AABIGO topsakal_o_Page_009.txt
0eef0774eb7af73b42d64f43ce3bb5e2
64f906893e1facda29a36cc0c9f23b46920b1648
2036 F20101112_AABIHE topsakal_o_Page_033.txt
01504d92db1051ea9c984de2fd89c5c7
418fb33f5b0decf72b1292903ae2021f8af7c264
1960 F20101112_AABIGP topsakal_o_Page_010.txt
c70045fa8106bbdb782c9daa641bb4f7
49427575a405fd2fae221428ed761609ae325a69
2493 F20101112_AABIGQ topsakal_o_Page_011.txt
30f0e45a22501aef3092f6195930d69b
d1d72da6e9f9af822018e854b6abc5161187488d
1568 F20101112_AABIHF topsakal_o_Page_034.txt
6fd575f4ba18c473433d2787e24cf787
dfe7676ed97de25feb8c6609d1eedc76e7c350be
2228 F20101112_AABIGR topsakal_o_Page_012.txt
d42cf49dfea0c19a4c7b10ff37657295
35676efc840304b48f1e0de602d45d5f1c425917
1933 F20101112_AABIHG topsakal_o_Page_035.txt
2058bc77c5aa0b1f84fff2824b9be4e2
67ed4b3965ac65b2772c2ba5036669d3c84e9cda
2274 F20101112_AABIGS topsakal_o_Page_013.txt
9859501b266ed3445cc3e8e4ae170ef2
2b4189fcebb22a7f61146a5729be3e63780e6625
1866 F20101112_AABIHH topsakal_o_Page_036.txt
309d23f28badd7daa0cee97f18870e70
8d5fc52f4ad563501b379026a8d093a7e488bff6
2250 F20101112_AABIGT topsakal_o_Page_015.txt
d80469b76102cf8db030dc0434042979
9fe487e835055b854a40e0f984d17547fdb9377d
2111 F20101112_AABIHI topsakal_o_Page_037.txt
cc2cda1d03d279ca01c5e141b4540f73
598b09a251b29f58e38d59fc4b9f6165130cdc70
2124 F20101112_AABIGU topsakal_o_Page_016.txt
33dd824524ebb75de355c2c12db5faf7
e95ca100024178439d74d032fbf50eb22a557c1b
2175 F20101112_AABIHJ topsakal_o_Page_038.txt
b5cd654854b7dd815982eb426c19316c
b9a42688ad93cc15b20bfc7d2c6c84f67ea4c1fc
2237 F20101112_AABIGV topsakal_o_Page_017.txt
6145bb0263187267f4ae4c84aebb29b9
a0337b16e352be6b09ba31751cc5b9bbf512fe26
2007 F20101112_AABIHK topsakal_o_Page_039.txt
36c62c1cea8c2c7c833b0adde9f40ea0
25d83e0954902e5db21f7dc1f355fdccad2a18d3
2226 F20101112_AABIGW topsakal_o_Page_018.txt
e2f9d419ee61b43bc01acf4dac6cbadb
4844af8b097ada016ee6b1566946b01b22bc30b4
2223 F20101112_AABIIA topsakal_o_Page_059.txt
b0aa19df477f8f4596790edcaf6b2d9d
a907774fcc5d395bf12627a271e1147d6bece77a
2100 F20101112_AABIHL topsakal_o_Page_040.txt
64bd81f141c1ebdef4660f72d7c7e736
0ff1a323b27583be17bb036a0106111d8d81d27b
2116 F20101112_AABIGX topsakal_o_Page_019.txt
59ebb81b8dcaaa8c41b1f7e92c852dc5
21522a74e15156c0956f533533e7b01f38454a97
2364 F20101112_AABIIB topsakal_o_Page_060.txt
b4b02c14c0105d3bf12436a2a1c362c3
453fe678ad7bbc0a7ebf6391604548141c09ea64
2290 F20101112_AABIHM topsakal_o_Page_041.txt
c40b9ba6b42073377baff632736e9524
02a7684330a26d63e2a4366aadae1da2ccbcf57c
2386 F20101112_AABIGY topsakal_o_Page_022.txt
e73a94c87ee368cdc7f1581e8e47b703
b9ce6c92909d283068fa202b206c96c2d47fe08f
1112 F20101112_AABIIC topsakal_o_Page_061.txt
56f269350fadfd3941ef3cf0a403c0f9
a570d88e822e1d14003fed988e295c698fe7d23c
2012 F20101112_AABIHN topsakal_o_Page_042.txt
54e8f0e3f8392c416df242b1482e6bba
2a0398a74701a7833391a225ad9624509e267f24
F20101112_AABIGZ topsakal_o_Page_024.txt
9cc6cc91fd20cc4f1243fb94ae6780f2
12ddd2a97cf5c73b5f23786688136cebeae7411a
2072 F20101112_AABIID topsakal_o_Page_063.txt
f7067b3eafe8e25a366e0b0b3fdb1339
e9d1d951c0ee01cbe2e14227e5e0b87ceefe2f32
F20101112_AABIHO topsakal_o_Page_043.txt
ff3a57701810e343d9ea717bbc4cbd6e
000bc5fbf332a4e02292da217d75fb98471460c4
2273 F20101112_AABIHP topsakal_o_Page_045.txt
0d1d7ebb75865edd92f9741cd25a45a6
8351ab93a4ab98b842860dcfbebec1ece68e2f06
1395 F20101112_AABIIE topsakal_o_Page_064.txt
7111bdaba76f96f4ffee23224628a692
d2870748ad0f0e69e929e2e46eefdf6ff7b5dc28
2267 F20101112_AABIHQ topsakal_o_Page_046.txt
d1aee28578ef080e2df42666baa2998d
9385649247871763b56b6cf2416e5a8c2c9bd237
1806 F20101112_AABIIF topsakal_o_Page_065.txt
2c76f769f94a6db5ac660f1be558de6d
40c7883a41f8c894637ccc75337b8e5ff26698f3
2344 F20101112_AABIHR topsakal_o_Page_047.txt
c11a6c34ddc06a31cba0b49b75bac6a4
397f843281c9f490df709d0d077f290407ac0a28
2155 F20101112_AABIHS topsakal_o_Page_049.txt
3342f40849992ddd7a16535e02bb65be
b30dfe0270cb18966c405a1aeb5db9d166156aac
1769 F20101112_AABIIG topsakal_o_Page_066.txt
77fdb8655c536ae796466e6e3695b48e
2d026270c29ca28c767507ac9cb27a848f091aa8
1280 F20101112_AABIHT topsakal_o_Page_050.txt
20a7d5458823f914db654da322c30927
977fd8e8de865baef2565ee1b693ce8f898fe982
2214 F20101112_AABIIH topsakal_o_Page_067.txt
330512a9712a44f15aeaf1bdd0cc5105
597fca330b9aa8701dbc2c52526c131080239b4a
2168 F20101112_AABIHU topsakal_o_Page_051.txt
1b2c5401cb332961d6878900597a4324
9dbcbcfb5be2490b1424c456bc627110ebb575ad
2648 F20101112_AABIII topsakal_o_Page_068.txt
88187af44d9ecd736890d26fbc74c077
ffe439fac1971e0ba1b30fb0232041b9dfc4af1e
1581 F20101112_AABIHV topsakal_o_Page_053.txt
533d5606e429dbc420bc6e29538fed41
a0daa6ff523d19e6a4ca2a986be8c5f1567e7150
2022 F20101112_AABIIJ topsakal_o_Page_069.txt
687e0edddc6cd48b585369b09d4e536c
7513deadfc56eba9daa587fe89f8578b47f9c66b
1747 F20101112_AABIHW topsakal_o_Page_055.txt
77affb0b134d344e3ba3314050f64772
f190aa0da61cc815a4828ae73efe915c852c1bce
1998 F20101112_AABIIK topsakal_o_Page_070.txt
e4d585869b2a2a085507b70f50440365
9edf04b4ca8cb4555a9f88834e0e5877f35f66d4
1413 F20101112_AABIHX topsakal_o_Page_056.txt
5bac32430f1684d0356cce109763710e
8cf12b811daeddcdfd4df947f03d6899188fbd82
2102 F20101112_AABIJA topsakal_o_Page_090.txt
62f72e12924ffe9aea284e8b85e3922f
b26162c033b63565089b73b7cf3782fb9a8219a1
1943 F20101112_AABIIL topsakal_o_Page_071.txt
b037010d55acabce0a319e7d1b8f1210
ce1235092a28de0300a29f9856cb23bdf9b0e38a
1775 F20101112_AABIHY topsakal_o_Page_057.txt
879bccbcf40f2a77290c133812db4dbb
8a699324034e520fb321accc4e0dd11cc973be2a
2760 F20101112_AABIJB topsakal_o_Page_092.txt
cea13700e81567f37cbb5ed2d8f0ee3a
a345144b121c8485bc73f90396e87a8525709da6
2325 F20101112_AABIIM topsakal_o_Page_072.txt
aea9d037751c621dd9593844f87915dd
a8e8f90ac79ec18594306fc7e0202a2314368f42
2149 F20101112_AABIHZ topsakal_o_Page_058.txt
9c73b69e2ff04b1f883ca674f587d2fe
940af2448a0aebabf97e4ac41f0653c518843f5b
3096 F20101112_AABIJC topsakal_o_Page_093.txt
29956466a6e81fd1b2c4a2fba99f2706
906dda64e8084227c0d462a222ca33a10e93e44d
2024 F20101112_AABIIN topsakal_o_Page_073.txt
cad25bf9b3d6009e1a76392d8229d16e
24dcf7bad29cc223bb55e0a5cf38b8a7ec017539
1822 F20101112_AABIJD topsakal_o_Page_094.txt
7b7a6d99c815502aa3b1128d6640acc0
bc31b691897e8a78e492c59b62d8e5851447626a
718 F20101112_AABIIO topsakal_o_Page_074.txt
19f0edd39f50973c956469708433aa43
2725b9b092e3e7fc79a2f5b22a4eb96d4dcfb2e6
2181 F20101112_AABIJE topsakal_o_Page_096.txt
ad459fef3925a9d850f508ed0b2eff95
344877befa668913b8e32e853d7be32003263b27
1940 F20101112_AABIIP topsakal_o_Page_075.txt
2b0d2d137f0981d720b0a4df8f00ce56
fa80ac01d575c4a475ba91f44118780286aa7300
1880 F20101112_AABIJF topsakal_o_Page_097.txt
6725107b0551a62ad6d14086d329d618
a6b9e6efd5dc2a9785100408c87002d80030ee7f
1181 F20101112_AABIIQ topsakal_o_Page_076.txt
cc3dae1302ed5dd9a99b9af0a2989a23
e1aaf227a33e6cd3353024b229a961dd05dd852b
1974 F20101112_AABIJG topsakal_o_Page_098.txt
0a2cb4445eeec3e059f62769bdd1f459
b8306694bf9ee8cf8814801063edbb1f3d579657
2361 F20101112_AABIIR topsakal_o_Page_077.txt
dfb7290637e9848122596b72e84f7690
6979fa701be82513fffd3125e9ecefad22c8af70
2061 F20101112_AABIIS topsakal_o_Page_078.txt
154714f3ce2ed4026d9665f128ca3e8b
94a0db44d537193a0cd526fdb94d75b14b610d22
1637 F20101112_AABIJH topsakal_o_Page_099.txt
789b98e94f292d51e07c48d4f33c7cc9
53a0ed3662a00fa30b4068474e3f9cfd39b09e49
F20101112_AABIIT topsakal_o_Page_080.txt
232fc544a4b32a843a34a4471ac67146
6863a678085cbb9cc10609088423a1e0966c2782
2127 F20101112_AABIJI topsakal_o_Page_100.txt
0667403eee5849f096d51a36a6e2de89
b777cb551842af59e99b1d7e12064d7c4f726b4e
2449 F20101112_AABIIU topsakal_o_Page_081.txt
d9833e34cce79c379885c4cc1b689b4a
43f312d2800c0fa007e8a303274bd004f1b6a54f
2122 F20101112_AABIJJ topsakal_o_Page_101.txt
d0c2f1e5b731d7488b3b4dea1235f1db
1067f5e02658906a0ccd1cec91c0bcb4098dba01
1656 F20101112_AABIIV topsakal_o_Page_082.txt
b0945bf0b6e2e55f91529345df787568
bf066c7f03870904d19d4c2881683f4d58f7684d
F20101112_AABIJK topsakal_o_Page_102.txt
6aa04e34fb713d68ff7fd6e4747c4e08
fd870136525208106b34aa00800d59fe9f4ff927
552 F20101112_AABIIW topsakal_o_Page_083.txt
963ef2f38b47bdf8645137a88967dc33
f5476c1e5866b4625768dab7e6552c82c3f27aa1
2852 F20101112_AABIKA topsakal_o_Page_122.txt
845d366d68e67511cff917b58d7cb2f7
6391e030665fd4e170c4a4ac9687a65cea746085
1737 F20101112_AABIJL topsakal_o_Page_103.txt
8093c86ad8d2d69507940b7246160d59
bed9b73f3c33fa194363470772e8bf3cf8939cc8
1945 F20101112_AABIIX topsakal_o_Page_087.txt
e292cea3d24ff68532eba7e38c79b103
9ed19fa1278ffb42960e302eb10567062a7dfa3e
2827 F20101112_AABIKB topsakal_o_Page_123.txt
d61d3916eaee39d766020a1325f26f2d
c49b3d4f1b19dc5f8d30617f57c65f8ff4fda048
2047 F20101112_AABIJM topsakal_o_Page_104.txt
b7cd6b10dfc614c7c6dfefbee459fe39
e68fc391cdea4fb1961524e70b0075cef0f95745
1341 F20101112_AABIIY topsakal_o_Page_088.txt
50cca005f5a700e1eb7eb6fd540c27de
818f0db43e32362fad5b941589855a3ce466780b
2973 F20101112_AABIKC topsakal_o_Page_125.txt
0c49417b58fd898f145f97297ad4bee5
74291db1a498c1a3fbe56d2d58c8be136e4539f8
2135 F20101112_AABIJN topsakal_o_Page_105.txt
df0d4bc96a08c8c6e44161e4e59dd6d1
1fb86f660940292b363c6b51ec1812779a27f0ae
F20101112_AABIIZ topsakal_o_Page_089.txt
89141152417e757c0508d3e10362e75a
0290beaac0db0f516a794d944e43f3bfe2a3f028
2782 F20101112_AABIKD topsakal_o_Page_126.txt
c92680b0bd6e4ff45f3ddaee5b14db1b
83bb8388928e3b77c0fdc1064e12b23286bda735
2471 F20101112_AABIJO topsakal_o_Page_108.txt
feb82e61743db1e30b8cb0842fe9e5f7
e30effa150860718b2352f013abbe7689bdcb1cf
1728 F20101112_AABIKE topsakal_o_Page_129.txt
12756271cf6d311e1755bacdf90dd851
484cf119a8e06801b71c2b0601587ef8b5f13e7c
2227 F20101112_AABIJP topsakal_o_Page_109.txt
532d7bfb28aac37de9b8ca56e4bf77c7
52d5cd8b99b6036e0a34a329cb8638e982687d2b
1512 F20101112_AABIKF topsakal_o_Page_130.txt
b21df77219e0b9929163aec6dd10303b
fa0242decbf73553f7a113a6a24cd8ea349807d8
2133 F20101112_AABIJQ topsakal_o_Page_110.txt
03e12784c813ba6cd80fa9c4e6dcd1fb
3a797d630486518eec31d610042eeb478208eb1c
1811467 F20101112_AABIKG topsakal_o.pdf
71bcbcf835680567d7e6759acc290a2e
21a777c12a7bd61b73d6a942fe09e55d93118397
2039 F20101112_AABIJR topsakal_o_Page_111.txt
c7a796d96fb904a2be9a2a67f12aa699
94b1012465aa3af73117d1db8344e01a05f28025
2056 F20101112_AABIKH topsakal_o_Page_001thm.jpg
2b161c945998dd8a71d3a0061fdab186
60f3886ac03515f40cc8fe110661acdca241e550
F20101112_AABIJS topsakal_o_Page_112.txt
213a26c92b2e9d42eee1ecb0fe15aba3
13527eefa02f2221baac152289f284d35a54ce5d
1473 F20101112_AABIJT topsakal_o_Page_113.txt
f5ea7a1e20057e3a814cda2e0d357435
973d222f8c6abfd36f5900a12d820b3e84b46dc7
3189 F20101112_AABIKI topsakal_o_Page_002.QC.jpg
dee747db7b2c3588e0bb290cb3e79c06
b843fb9b62e3d9d0d69c67d5f578ab4d876ec64d
2452 F20101112_AABIJU topsakal_o_Page_114.txt
cd270a6b2e9a59b090602b05b6bf8258
b45015b5a8f37b8951fdb6e7f5f9e6a263abd47b
1346 F20101112_AABIKJ topsakal_o_Page_002thm.jpg
4747f2f781d7cb0646f17a4b376c47bb
735ec922904653b6d5196a9ed183429c8ee93c34
2173 F20101112_AABIJV topsakal_o_Page_115.txt
7604bf0af36d899c5bedbd04cc5f8e1c
1173312c225e6b0f0afd54ca71035ba84da044ca
1487 F20101112_AABIKK topsakal_o_Page_003thm.jpg
7fb4e4d1820bbf289323580e5b4a5a8c
0b7c84032b9aaa407c6a19ef4360227ea8c50589
F20101112_AABIJW topsakal_o_Page_116.txt
fe843347616a74eb52566edb31fdd856
a7c018f79be0fe7bb9996118734e139681f577cd
4308 F20101112_AABIKL topsakal_o_Page_004thm.jpg
76d01030de454c3bba2933397d6c4445
4f33b3c41cecf93599fc412feb9fb29b7f104961
F20101112_AABIJX topsakal_o_Page_119.txt
66d2a89b4713b2f8ff38bed9265875e6
25144500c3b11ce337f5ec02872d30e8221d8460
6162 F20101112_AABILA topsakal_o_Page_015thm.jpg
46c1f68eadda49887f046eea84b4a478
fa5beb980eb4bae089468023297016908309a2cf
19921 F20101112_AABIKM topsakal_o_Page_005.QC.jpg
122372f1c59224c51b3c0762a4bbe8da
8852cbac7a70650e5e6666416673a6f5883458f3
2172 F20101112_AABIJY topsakal_o_Page_120.txt
c84ee7639c88b778079a911332781784
e0e4f288142f40a2f146183aa6c91e0f769155d0
5703 F20101112_AABILB topsakal_o_Page_016thm.jpg
344317310484bad868e8beee2380d913
3e92d0f567659980471c3294c506cb1733c1b948
4810 F20101112_AABIKN topsakal_o_Page_005thm.jpg
b0e931cdf8fa290cfaaf48098fec16b8
2ae2f77f20ed1a69ee6eee5f8c28dc0d3aa5cfc9
2562 F20101112_AABIJZ topsakal_o_Page_121.txt
d79c3dd2071a6514dd36862f9eec57d1
5c3ce3ffdd3374d1cd368f2de300613186494430
24560 F20101112_AABILC topsakal_o_Page_017.QC.jpg
04c368e3c2ef69b2102292bcaf7a089e
ea14bcd5245ce192209083ef2fa3d9b319e619d3
22349 F20101112_AABIKO topsakal_o_Page_006.QC.jpg
c9a20e68b5c9ab535e2542486c75155b
0664febac7399296435e48bba8adb69c1f624beb
22962 F20101112_AABILD topsakal_o_Page_018.QC.jpg
f9fa4a332dc0fc7c9057e2c9865da8f3
b6e417b33d27092ab028bb27a2166d8b1bf289fe
14455 F20101112_AABIKP topsakal_o_Page_007.QC.jpg
fdaa532223744b4b5db1bc93bc73f4ae
629d7fa9b3abd89ee30b26f0def7b9a2eeec856d
5892 F20101112_AABILE topsakal_o_Page_018thm.jpg
568c74cd76f815d2cda79485c4fc349f
3e5e0dd9fdaa0102a1e83c097c267e2bb6af5601
3774 F20101112_AABIKQ topsakal_o_Page_007thm.jpg
5e436a8310ad52da62806e1666a833af
959b5b56ae67d2183cbb4e1a81663c0c7b3fe218
22861 F20101112_AABILF topsakal_o_Page_019.QC.jpg
6047b6ab4f01a29f09acd698c5ed1e02
8c018d82fe47a903755c6a647907224856dcdfa3
24779 F20101112_AABIKR topsakal_o_Page_008.QC.jpg
965515799f209fb86f851eb12914866b
a8135ccc20951de8f12bbc90c9a3053988f12658
5823 F20101112_AABILG topsakal_o_Page_019thm.jpg
4b4f19b62a6e4ea59cdf8162dc8af760
95ef85d3072ad1eb1bec5dbcbb38b965db1186ba
6058 F20101112_AABIKS topsakal_o_Page_008thm.jpg
e7796b06b6adeb132ae521ee307af2b4
f534d56e3fda7c7a3d6a0bc2e7297f1a22dc852b
22274 F20101112_AABILH topsakal_o_Page_020.QC.jpg
41d3092ebc8ea7a59d4e1422e0cf19d8
cdd4857c1f0676a3851dac4a3c11ec2454b9efec
24420 F20101112_AABIKT topsakal_o_Page_009.QC.jpg
e699303704e449b0af2bf9e292b7ea07
52e2ded97ff7fd631be2cc255927968f1f02d006
5679 F20101112_AABILI topsakal_o_Page_020thm.jpg
78bea5da8fe88b119f2f6e45ab4d964c
f4443104da00c9c885de916bfce4cd0c6ce37da7
5313 F20101112_AABIKU topsakal_o_Page_010thm.jpg
449a42033c7d32ffa74c6baf8dc26326
4d107e46a5c6ea2b5658bfcbb08a93d20f12d7d0
20015 F20101112_AABIKV topsakal_o_Page_011.QC.jpg
ae484dc5d82a195b907712b15380f17b
581463810d168cc5303b9c22f9254766354e5ab3
6200 F20101112_AABILJ topsakal_o_Page_021thm.jpg
68f3f438e0efc57b82a74260d7d27193
423aa28ff98ed40e5dd3bf830299aab9374674e4
5580 F20101112_AABIKW topsakal_o_Page_012thm.jpg
21bfa871f071682e21f024c460b89bfe
1e93445b371a904cbc99b5d99066ddae21216ac9
6241 F20101112_AABILK topsakal_o_Page_022thm.jpg
01569683533087d414838e0bb430a515
9883e77f10fb0ebdcfcdd3755b47c128f27dd6f0
24418 F20101112_AABIKX topsakal_o_Page_013.QC.jpg
082440c119ceb0588db6d4decfbf7fe7
a5276737cb1a68279b3a94e41a379611845b606e
6111 F20101112_AABIMA topsakal_o_Page_032thm.jpg
f0d4c9ccc4676a45127abad0cb120c48
ea50614f58c8a07e26788c7d96d10c6a9ee96923
25850 F20101112_AABILL topsakal_o_Page_023.QC.jpg
b54c28da6bfe6512eb530bec6e5516f5
e8d8531af7065bf3941f3b95dbae720d40a76cd8
17497 F20101112_AABIKY topsakal_o_Page_014.QC.jpg
6263308bdca502f494e0644dca35ccd4
40eae4702423bd213ab2fb233681248251db824a
17242 F20101112_AABIMB topsakal_o_Page_033.QC.jpg
d53f4f8d595842cea0f0872f6f9c9ec7
651e259c324f71b97d5f608979c39bb90bdad929
6382 F20101112_AABILM topsakal_o_Page_023thm.jpg
234183ef22455f14a8fcb021039423bc
e4bde58f8f262662d9f5509fed6f1d9858d94ea0
24755 F20101112_AABIKZ topsakal_o_Page_015.QC.jpg
9b7e4e73a3c529d6b1119a33741b52e3
06cefc22d34015ca7dfff7f9dd51d434074fb9cf
81291 F20101112_AABHJA topsakal_o_Page_041.jpg
8872d7f3c8dab5976221bba1c3f35e78
58ee3a762ebab650314cabc8368a0d1dacadbd70
4578 F20101112_AABIMC topsakal_o_Page_033thm.jpg
306fa06d0067b58d444addae136b0b41
ce38678941208e1c42a98e9206f390260aa810aa
24249 F20101112_AABILN topsakal_o_Page_024.QC.jpg
731384ca3791b9f6a8fb72d330e1627b
4578af9887d9156929b0f1ecb36ddbe7a719f7aa
76065 F20101112_AABHJB topsakal_o_Page_058.jpg
45494eb1ecae9be8bcbddb27b8d36774
a6836b381939843f91e500bb09f0e6ad048651d3
16852 F20101112_AABIMD topsakal_o_Page_034.QC.jpg
854fd26f5f217f39197492eadfc2b534
2176e4412b9709593fe6fdc087b44f3c81102198
5822 F20101112_AABILO topsakal_o_Page_024thm.jpg
a4e27e6eaa81605771b821cb2f8bce0a
a978380a0095c7b2c6e51393961a675259144d1b
47212 F20101112_AABHJC topsakal_o_Page_104.pro
c7bc4edcdf5cf385de4f4b59638fce3f
610f7701fd5e2f1741dc0d31c9fc320bb18a51ae
4587 F20101112_AABIME topsakal_o_Page_034thm.jpg
23a8852fbea5fb53b194b8b54fc024d9
b70af0de64d3827244f21d221969cad2b6df3367
22733 F20101112_AABILP topsakal_o_Page_025.QC.jpg
5735425ad5356f230460f9dda7fb9c98
75f839b57a96823083ed557a60b93255f9186a75
42823 F20101112_AABHJD topsakal_o_Page_057.pro
58fe0754f4b7d2bac52e35ce64d352a6
6b0bfeff27d4566e33e9d358fcfc1df16649c861
19902 F20101112_AABIMF topsakal_o_Page_035.QC.jpg
285840e02fdca0cb699a6d2049f6c94f
6358a2a3fb1f294b884a3920b8277a28e70993c4
1962 F20101112_AABHIO topsakal_o_Page_062.txt
e19e916c47560c70835d8848d9730b5d
97fe644f3e49d8621f0b418c997ca8f7f0f751a5
5792 F20101112_AABILQ topsakal_o_Page_025thm.jpg
cc7504b989b2fab747517ee761aa6eb5
2fe0603ad3ead0da86ecb81b67d66467ee5654ee
F20101112_AABHJE topsakal_o_Page_079.tif
1a87d6507d6589a838273dc42257228c
4da3217adcd7433742320783c6d54b6e34edbada
5310 F20101112_AABIMG topsakal_o_Page_035thm.jpg
833c3b4a05c49495f01bc9df71f3a079
9246a0231d75ae94c73e19e3cf091f88f4d0ab8d
2147 F20101112_AABHIP topsakal_o_Page_028.txt
3e5101e01ce6e2783e7140ba1e48bd30
a489eaef8a326a23852a4a6b0212632649110431
22112 F20101112_AABILR topsakal_o_Page_026.QC.jpg
b2894b92b2ad9afa5138f727408a2f88
d508430e3fc077c037db8978c5095678c141a0f9
50528 F20101112_AABHJF topsakal_o_Page_052.pro
875a40f84bd61fad8a1e766c145f927e
8e7ff09d6d7a889e5409a6772583585af4d3c4c7
19545 F20101112_AABIMH topsakal_o_Page_036.QC.jpg
93d459561797855b0eecce07e386c932
2d3e0376d213947377b7e0d2a81f68dcde4fb031
19874 F20101112_AABHIQ topsakal_o_Page_084.QC.jpg
2f21f862bd905b231051a61b147ff7e7
de3a3036e86f770ca3df23adcf84393764b3cbf3
5786 F20101112_AABILS topsakal_o_Page_026thm.jpg
6dc54931bab307ccf475c233c8dec4c8
9042ef2e79bd8f823a0e7473044bf26118405af8
F20101112_AABHJG topsakal_o_Page_005.tif
a115fe2e236245c1793c981587b39292
cea12767be4c35b33c41a74a479df50e0bcaddff
5215 F20101112_AABIMI topsakal_o_Page_036thm.jpg
fbe1237c4e44212270cf4152e503f953
bf3140bc865f8d204720f7fcdca3f6f90a5fad14
833400 F20101112_AABHIR topsakal_o_Page_088.jp2
cd53d7893964c4b82e06109300fc3005
fa9b5656ddf54f5843bbc880018578a186c2e9eb
23907 F20101112_AABILT topsakal_o_Page_027.QC.jpg
83b4bf418723840706836f6841094e78
98574eab595dd029109dd486b229114c11e3f622
F20101112_AABHJH topsakal_o_Page_095.tif
7f677560084fff2bd6a57f384f8b15a6
a28018c3cc3abf8c37bf5e1122349938f3c9a344
5959 F20101112_AABIMJ topsakal_o_Page_037thm.jpg
3ddc8437e028b77e1b5fc61e3d6432aa
61f17ec260bcfc3283502b31643cc76293ab0beb
52654 F20101112_AABHIS topsakal_o_Page_049.pro
bc965dc32358f161be792bac003107ee
f24fa4f0d168637b01ce421ff397b865fa6f31fb
6022 F20101112_AABILU topsakal_o_Page_027thm.jpg
38bf2a31d98df6069ef9d643c41e7117
b076e70cb72717d4f418b7f8e113832008bc9e55
F20101112_AABHIT topsakal_o_Page_043.tif
485720c5c52518c98a8ed983180023d0
aa551099ed668dd31b1be4fb27d695a390e61bf1
23618 F20101112_AABILV topsakal_o_Page_028.QC.jpg
0a6eb4eee9a88934f716b2a68261fd4e
18762cee1f67a66047d2c685002990c63b4025f2
1051959 F20101112_AABHJI topsakal_o_Page_041.jp2
076869e62db6bc942b38ca5693f0091a
ab8f1b5484b9444f249572d133e204cda6734393
21533 F20101112_AABIMK topsakal_o_Page_038.QC.jpg
22683ca8dc7afb46fb79a9821042ff35
f8bf62cfaf542b4eb046f4c31ccb953cbd19c260
2060 F20101112_AABHIU topsakal_o_Page_086.txt
bbc417189d9e070b860d793932d6c9fe
1b4fed0dd81c97b9b28fa7d061d3c7cb61a70c17
23807 F20101112_AABILW topsakal_o_Page_030.QC.jpg
18f12af8c7498c53e3d34969830cfda8
75113107f0cff8d73d62f82dcc351d98f323fde7
5621 F20101112_AABINA topsakal_o_Page_048thm.jpg
2e42cb46ea7712bb07bf7f9cccc6bdb4
d29066a4f283f0440f25cd4af6f4fed78b54a70a
70742 F20101112_AABHJJ topsakal_o_Page_010.jpg
1f5ec9c7802902798cbe29d9ea834f07
55a62120050270316da11aa9927d476d6fe3baca
5780 F20101112_AABIML topsakal_o_Page_038thm.jpg
54f6f0a206f92f41019dbf712d61eb58
28f7b02d056c4bcdeae6a9e34a0855aa73dc5628
45651 F20101112_AABHIV topsakal_o_Page_033.pro
4f40fc7e813dcbc7286d8fc5eded97be
b886290b7d7e0c8749dccf5d1d77f4d17c35e48a
6023 F20101112_AABILX topsakal_o_Page_030thm.jpg
4438fe11cae44bf695601687e34269ba
a669bd86fe9e1cb35d6aa71d4e122405b5643deb
16609 F20101112_AABINB topsakal_o_Page_050.QC.jpg
a003b7791d45685c2bfdd210f10c8308
2319d097363371022e31d1a63be81037c298c528
24925 F20101112_AABHJK topsakal_o_Page_077.QC.jpg
2eda0c45f53eef5dde0009aa1bdb2def
d9215a532666cb7bfd13ffac1e51132dc9dd51ab
21763 F20101112_AABIMM topsakal_o_Page_039.QC.jpg
fa9eaed9acb9abd041a3fa16e770d1f9
71031767e429aa86114c0859d14575da9a644c80
2110 F20101112_AABHIW topsakal_o_Page_020.txt
7905adbdbc7b25114c3411b501de4682
992335a8223d6f9c5d2ebfddfd49549444d1efb8
24217 F20101112_AABILY topsakal_o_Page_031.QC.jpg
2997adeba4f8ab58e320bb1d4959a75c
9491e5500f41ded86d615a3d37016a146fd7e185
4791 F20101112_AABINC topsakal_o_Page_050thm.jpg
d2cfb4282eff22c4bbc908e52d5a4faf
6588eaf3102d70d2221a06a41b45bff95c758797
15687 F20101112_AABHJL topsakal_o_Page_004.QC.jpg
04dfb26904dc3c080f35fa4633ab4d27
4ef5d06f82281931b3defcbddbbe297f400617cb
5485 F20101112_AABIMN topsakal_o_Page_040thm.jpg
714009db92c0856facf3349068a77b24
54bb34037ce8e308f92d2fc2f6954cf7daf8b027
6589 F20101112_AABHIX topsakal_o_Page_001.QC.jpg
0680ad634ecb74b17610d95cb4738e1a
18461b84e650eb9f8ebc431bac2235b6871169e3
24375 F20101112_AABILZ topsakal_o_Page_032.QC.jpg
bb3db5d8d82601c5bbfc37b389d9c827
4f3bda85cb40035ec7f6bb9ff49d5d74d93f772a
21003 F20101112_AABHKA topsakal_o_Page_083.jpg
e48c7981ddc23cb34bdb1dba73dfd3b6
4cd26e6fdcd816cdec0693958b9802b47f425efe
24109 F20101112_AABIND topsakal_o_Page_051.QC.jpg
db3265a3c356ecd9a01d159cc77b41ab
3752a52277bb1b4b894f414de49bd24f870da592
79418 F20101112_AABHJM topsakal_o_Page_051.jpg
62d28123451332f7e2a44bb8dc8bf652
9d315a9908cdb2ef4ed584609c8b27758b78eb96
25046 F20101112_AABIMO topsakal_o_Page_041.QC.jpg
b13b63e4d8307ccb66322235e79d3895
17dd316d92b977657a5eca7e5fb8b13f1a8244a2
F20101112_AABHIY topsakal_o_Page_029.tif
03d5125948be65352f1dea6820a08967
1da9f4e7801a0e42cf5d5f8d558bb0b648d82532
2455 F20101112_AABHKB topsakal_o_Page_023.txt
52534c538d859b3b047ba3b0004ca516
5b75ebcc4d396d49000356cca39338248bc22ffc
5777 F20101112_AABINE topsakal_o_Page_051thm.jpg
b8173710ff37d4e708dfde7d0616cc41
7c33ee98ba582d137692ecb55e603dfe02c4a4ab
F20101112_AABHJN topsakal_o_Page_100.tif
65330057e8af69a95a0433d309a9ac36
fcbd812a0c5e17a3652b21a99246f372fcc28fe6
6166 F20101112_AABIMP topsakal_o_Page_041thm.jpg
b65f145c0808df85778f2022726d52ba
6685be35258c515f330375eb1802f381b47ae235
62496 F20101112_AABHIZ topsakal_o_Page_023.pro
eb10d77dfc0730fbc93c3ec4aa983454
1c872dc84207d266c0dc9fe9ca3fbd346809a346
6010 F20101112_AABHKC topsakal_o_Page_069thm.jpg
caa3598f8568f3ab3ee71584ee816344
751f9f3dfae46584562fb071f80a9fc9250b6a3d
23502 F20101112_AABINF topsakal_o_Page_052.QC.jpg
a2f8dd36ef254328dbf0ef863189276e
8b15aa07cf631e376c254327b241f7de6843c526
78248 F20101112_AABHJO topsakal_o_Page_037.jpg
f3c0b8bc3c681a097b865d72834940d8
b77d11ba73b911c4307af2607567c84b29b04706
22204 F20101112_AABIMQ topsakal_o_Page_042.QC.jpg
bd7edef03d5c4561502cfd30fe1a4f80
1fabafc018a9d97b71025cad4daf96e2bf95ef53
1136 F20101112_AABHKD topsakal_o_Page_095.txt
89189dd9919301208b05aee126061f1a
4a189178552bb8b9b6cc9570dfd0a378c81235c4
6293 F20101112_AABING topsakal_o_Page_052thm.jpg
1baff044c20eb0e32bd48dcf128b4303
789b9f0f6b6dd5de108de17f8ce217341942a210
53160 F20101112_AABHJP topsakal_o_Page_080.jpg
a38b4896316ba0ea34d165fd7137d1be
b0fe3ac88023d198ee1582f1203aee6db95dd2cd
5797 F20101112_AABIMR topsakal_o_Page_042thm.jpg
a10c253b2c877e1b2d273469599f1470
580217545a3d393312d55e38a68a3d870db7dd69
6015 F20101112_AABHKE topsakal_o_Page_063thm.jpg
7d9b448e3293efb880adb252436b5e26
ce33c26e0d66b974f83e96609a640bc669704763
23971 F20101112_AABINH topsakal_o_Page_053.QC.jpg
d15947f0a5869aebf60ab2cf3fa70c3f
c983c9cff8afe4e7c0e403f0d3263890c11a4f40
25480 F20101112_AABHJQ topsakal_o_Page_029.QC.jpg
414b386672244274e1202935f386710a
7f12d0dfd8edb05449ea59de4efe42f2577ad9b0
24155 F20101112_AABIMS topsakal_o_Page_043.QC.jpg
229bda2bcf969f1070e71124b0be82a3
80a3a96a5caa815935b4b1f99d64b3d64e7ad5ff
8243 F20101112_AABHKF topsakal_o_Page_001.pro
b026da313b0f3096942b4099e055bbc3
79f14521854221564202376902b07d8a0aa8bba9
6243 F20101112_AABINI topsakal_o_Page_053thm.jpg
2980864d7e6528f11155ce560292e328
73611c8ea3ac20b4dff67c84c72b3d95a0a55c75
F20101112_AABHJR topsakal_o_Page_021.tif
807095d2769f311fff977570f50305f9
b00a347cee0b7c1e13022a061e75e39f8d0a3f53
5954 F20101112_AABIMT topsakal_o_Page_043thm.jpg
7ebfe74429d76c499334731bd055ea97
202e089de460ae9ca1886fdf88b1f86675b17704
1491 F20101112_AABHKG topsakal_o_Page_054.txt
a267bd7f0ab2d9c472e7616410216880
2b5f40d9579ab7201e0721e7e4d5ac4cc4d08462
20256 F20101112_AABINJ topsakal_o_Page_054.QC.jpg
82706c31672cbdbdbaa584bbef071e69
abdcf1cdb777bf38de4c731779e0069e9d50ab10
24768 F20101112_AABIMU topsakal_o_Page_044.QC.jpg
2abcb5aee6f823a4db090110fab1832e
c660d467bc344a5a8cd15776a7229e761490bbe4
2286 F20101112_AABHJS topsakal_o_Page_106.txt
4587656b60063c96eaf32d1dbe296605
0673ea564750313b6a3d4d7de27a43cb8cb1fb4d
154133 F20101112_AABHKH topsakal_o_Page_093.jp2
4a35fabeced64d69153459362b64c0f3
fbdfc10a5be116f50330ecaf4f9b591b9278fe59
5525 F20101112_AABINK topsakal_o_Page_054thm.jpg
fff674e5f3f2a35d9e3ed3c61cb8da65
c672d3a5f15a608618fa4e4adbf2c6e07e73ecf8
24787 F20101112_AABIMV topsakal_o_Page_045.QC.jpg
755131842600063587a10454c226958b
fa9c1f34ef753adf55a234ffbcc40b49b69834a6
F20101112_AABHJT topsakal_o_Page_076.tif
2ff7be539f033be55facf45c66880cb7
474cec3895578d9f64f5de29f9a1c4472c1559e9
52953 F20101112_AABHKI topsakal_o_Page_033.jpg
9b4bf6f44e7881f0c6e2cc177d6131c0
e8a9b88a77145bf5df2052fd363b5f2f620c7dca
6202 F20101112_AABIMW topsakal_o_Page_045thm.jpg
16fbe39f53467d03b859b0f829663c26
44647be2e1319cb1aaa91dc0b460d81e120382da
6154 F20101112_AABHJU topsakal_o_Page_009thm.jpg
91c339796b6cf3a7146ddbec0bf686f2
f25e8461fc7fc49dca2fa2247c952c19c5c14ebd
22637 F20101112_AABIOA topsakal_o_Page_068.QC.jpg
90b4c1ca1806d11e45d3a2b473cdeec6
2f17feb65c9754e122bb0cbcdbbd082a3d27704e
5428 F20101112_AABINL topsakal_o_Page_055thm.jpg
d919f56cf7104ed8a85b9974ddfe6329
a71c2ea2ebad45cef4be05e17d6005b7729fb18e
25565 F20101112_AABIMX topsakal_o_Page_046.QC.jpg
037d461310cbf19c30ee9321ef35aa5e
ccf031fab2a28585f4a3887366ab7d39e6cca586
F20101112_AABHJV topsakal_o_Page_100.jp2
4c39282dc32f882fccf0feeb65bd90ab
f032621ae39acae4b8344fc5501796a37a57dbc5
3811 F20101112_AABHKJ topsakal_o_Page_130thm.jpg
6c71fcdf7b8264dbc5add03bfdbaff95
bc5bea756d3f8e7912609b1dbd5d474225d26f0f
5718 F20101112_AABIOB topsakal_o_Page_068thm.jpg
f35f1f82a159736c5349ab3ce1c64340
e5dc1f362dc9a3f43c59621804a879f04e1a83e3
21805 F20101112_AABINM topsakal_o_Page_056.QC.jpg
b60f78b563c39eb1edd0aec69fd1b94e
df318d03f02a1868b0b4139d73c456591df83c23
25571 F20101112_AABIMY topsakal_o_Page_047.QC.jpg
1de18f7de667c2dfcdb93f20e5d09ec6
9e50eaaad1eceafe33defdf91629ddcc0e861305
6017 F20101112_AABHJW topsakal_o_Page_028thm.jpg
9252397ddef1aa81d37aa3f32fa5f564
0e1b8b8df8e04a862726ce0a371aa26efda30307
F20101112_AABHKK topsakal_o_Page_048.QC.jpg
eac7a481b79b3f06ce54b69299f244ce
99f464435252b6f1d6f3b0c715d7a3e4fb6d2bf7
22687 F20101112_AABIOC topsakal_o_Page_069.QC.jpg
7d9f7e31d1f50e4a82f8a72dc34ae349
68ed8cafe63a7853e1baa1422838ffc42ad844b6
6037 F20101112_AABINN topsakal_o_Page_056thm.jpg
cac7515d2bb732260bcc1fd00a5f30d9
2ef5d3fbce1a7d1f94535d360cb7bff0335ea874
6276 F20101112_AABIMZ topsakal_o_Page_047thm.jpg
379c93fa200fff8b06ac534fb4b3fdcc
fd40431fe60058fcd7a716305d5496d9c1802477
20332 F20101112_AABHJX topsakal_o_Page_061.QC.jpg
4f8287e2d89aac0c6f0a12e8956508be
f164ebff73b06c1b3bd4c1901d3b4d96844ebc6b
75804 F20101112_AABHLA topsakal_o_Page_005.jpg
204fa57982370b2f9c12851a7b380cef
506fe3e6da17d86d38513b1e7f1940de62cc9ebd
6305 F20101112_AABHKL topsakal_o_Page_029thm.jpg
42c8c6a4888b2162ac19c6c6b80cd981
8a4b86dd67cca52de4002cce6249b1fc82d7e134
22864 F20101112_AABIOD topsakal_o_Page_070.QC.jpg
f212bb00c6609b5a97eee6f2117b26c3
74805d7dc29e8dd86b4dcb80b173d05fc0916017
6152 F20101112_AABINO topsakal_o_Page_057thm.jpg
66363b9da887c042e0ce9f08e6f66dcb
8fc3c6d8aeaa9cf2afd47df32eba3d3ce4376b4b
14628 F20101112_AABHJY topsakal_o_Page_129.QC.jpg
5392f0133bd8672fb4211abcb3363eb8
b33d879b6c643f709ffea9decdd103335ab968da
F20101112_AABHLB topsakal_o_Page_120.tif
cc67cf7f16facdbfeffb844d94191110
125d231a2a4c119c174f4e331b724757de5cf28d
F20101112_AABHKM topsakal_o_Page_087.tif
d0b1bbcc97112fa2de764a05a34566d7
ebfe26d51ddb85eee3997073ebeef068eea58238
5934 F20101112_AABIOE topsakal_o_Page_070thm.jpg
27e0f93153ddcdcf0a8e33e44ebc24ca
c389d935c93b511005e60768e7e9d80d5255e837
23446 F20101112_AABINP topsakal_o_Page_058.QC.jpg
f338295d61cad2948b5f9c24db255719
d288c363db2a5d21331704518c3aadde10bae908
56268 F20101112_AABHJZ topsakal_o_Page_064.jpg
a03acdb6083cac4e69a85ab0a639c48a
2afce2a47377742fd49e2b5cad6f34f6c4baa84f
2199 F20101112_AABHLC topsakal_o_Page_079.txt
a8cd355fd01cdf25bd3deefa5fbfe276
714c29a371962cc2590381ff54411a56cf61f1b2
48339 F20101112_AABHKN topsakal_o_Page_075.pro
66876313d3703dc9bb8af1ece323f073
1dc62b4daa8d57630b9ffbc461fdff69ba05d846
6110 F20101112_AABIOF topsakal_o_Page_071thm.jpg
c6c953e4341a971242d86be8b295ad18
d03bd8de1bffc466e1002477054e456d5d5ced1e
6007 F20101112_AABINQ topsakal_o_Page_058thm.jpg
97e284daabf6e390e2eb84434a54a168
919dfda108521168db690c8fe058482ebd2fc4dd
63062 F20101112_AABHLD topsakal_o_Page_121.pro
329f29fd52e832071d4f3c6406ccebe9
586faaa1ebe9544bcc50812dfe304208a450600b
74676 F20101112_AABHKO topsakal_o_Page_062.jpg
80c1f50076f63adaa09fa609d3eac940
82528b69871465067ace0e4b49b14c88351808bb
24865 F20101112_AABIOG topsakal_o_Page_072.QC.jpg
e774ee2865ba9b4d56a1d65db6660e33
23ed381e069f4da7e4be845002f9f28c7f89a3e5
5991 F20101112_AABINR topsakal_o_Page_059thm.jpg
bd1a0b6db1bffa743553e442233767cc
b126f4e7190dcb83a316f19f2aacd933a1d7574f
F20101112_AABHLE topsakal_o_Page_130.tif
da454cb8d6298dcb2b965c79ff52be8e
1f578cfd35942359b385a0799dc74d2a922e733d
6027 F20101112_AABHKP topsakal_o_Page_067thm.jpg
051768043cf7bac7ca5c53aef3e27de9
ca2de44775ccd336373db1ab7380ab94f753dd53
F20101112_AABIOH topsakal_o_Page_072thm.jpg
4b2ce8318f32c3f6e317c09694ab924f
d69afce33a913a37a16010cd1a17ad2f06ccd8b6
25910 F20101112_AABINS topsakal_o_Page_060.QC.jpg
ef8ccd631599552c513f40e9c6c18a6a
aeda199460d38ed52aebb1fd1daf2a91d5a7d06b
F20101112_AABHLF topsakal_o_Page_002.jp2
644b2c8572c9c2dd3fde03df4a4f4674
cfa171eada38fd558a675768a6ae772a26a31f54
F20101112_AABHKQ topsakal_o_Page_108.tif
3a8f94b17cec33fea8be559cd49533bf
c5c1ac1249c6be23b277726ba273d760824adda4
5785 F20101112_AABIOI topsakal_o_Page_073thm.jpg
29ba38cd9a75120ae02b07c1fce2b84c
60bf4afe0a9e881d8d11595fea455c203c6f7319
5698 F20101112_AABINT topsakal_o_Page_062thm.jpg
ccd2baf24d40a5e5eef1daeccd8b1747
de32f66d938f7322dd9e1e7ae6f765f76b402c37
36984 F20101112_AABHLG topsakal_o_Page_071.pro
b17e96ebfbffc8e8a6710f4030b6fb61
91f2a8b74159c29a4f8249555a0d62ec30e6aa4f
26046 F20101112_AABHKR topsakal_o_Page_061.pro
d3e5d74d5df12ba05758a05334b57963
9a22c53af45b18e97feae1bd78eca971d99742fb
16189 F20101112_AABIOJ topsakal_o_Page_074.QC.jpg
1cf47b711096b63bc4940439c38f2b5d
aca30ddd256fdde9b07663009b3e717f9158c8be
17570 F20101112_AABINU topsakal_o_Page_064.QC.jpg
567dead6cf456f015bc58b70c1283f30
14eb88ce4a6705c8870de6771a6c02b704d8b2cc
50684 F20101112_AABHLH topsakal_o_Page_042.pro
19f129d7a755a533b7c2ff589c71d7a0
d674602220e38e0e0eb6380104c2eed21a937a54
4408 F20101112_AABHKS topsakal_o_Page_103thm.jpg
9dc4fa4d969b8cd0b2774a70a5a648bf
805e6ffde5f67bb333822a9bfcd531bf7823b875
4679 F20101112_AABIOK topsakal_o_Page_074thm.jpg
6d880688e30641c5f4d6b89cc20840d5
d0c4b30f739c84be19c9fe1ae90cc62433662292
4569 F20101112_AABINV topsakal_o_Page_064thm.jpg
8ac5b47e015bbcd22674d13ba56de177
c66a3b3b3529b8c31dfec5ff0d26215a0943cb4f
9865 F20101112_AABHLI topsakal_o_Page_002.jpg
e82bda2aaf67d718c513612cdfa05e38
0920750f0b8e1021cba941536576ff1f88ef74ad
21675 F20101112_AABHKT topsakal_o_Page_073.QC.jpg
bb2b9e7244122121973e3372d728e427
7f8146b04fff01dc81f72d3c692a252356d6de9d
21908 F20101112_AABIOL topsakal_o_Page_075.QC.jpg
06beaba3a2575d0fe8685a039aa46700
445299b1a7ecb1303ca61bd60dd3b93e5311ed4f
21442 F20101112_AABINW topsakal_o_Page_065.QC.jpg
5d69db0aa8b5c31e3cf226ca1b0e0513
273f7454ae25429c4e572fb8117e6ee6bf9d0eac
F20101112_AABHLJ topsakal_o_Page_105thm.jpg
3979c36976a9c1d09dda6d34c9a8def0
9b7a5c374718358ce78f7438bf56185bbdced716
65028 F20101112_AABHKU topsakal_o_Page_099.jpg
1b1aabcae8a2b4c0a0ca14aba3132eab
b7415ba49ae4adb9aa1c90e338777c86f2997911
24138 F20101112_AABIPA topsakal_o_Page_085.QC.jpg
fdc268d6e3646688becd0828b696288d
ea35724504161b78312ae5bda8b991c2a62e2733
5687 F20101112_AABINX topsakal_o_Page_065thm.jpg
cbdd12f80399c2c12a6969041daccef0
c037fcacbfdc23f2698ab882d961e56774d82fdd
23014 F20101112_AABHKV topsakal_o_Page_057.QC.jpg
ade3b5d540383ed4dfc1b0338de04df0
777e5ff425a3e4538bcd12937b0c0dea09102987
F20101112_AABIPB topsakal_o_Page_085thm.jpg
7e20a49190613d25e9d87e57b9f7f2ea
fe5060cf7f3ee3b466d0533f261dc0163461f228
5648 F20101112_AABIOM topsakal_o_Page_075thm.jpg
2dd000a51ffbaf3eaf06dd78118cba8e
f00962343765fd0a7df6f6a141a7c42c3f843d62
20678 F20101112_AABINY topsakal_o_Page_066.QC.jpg
1b90ee6a937cfd55c1176b69177bf866
3dd52d0f9eb9f0628f2f3cbc33c3748b0f5bac6d
54323 F20101112_AABHLK topsakal_o_Page_058.pro
97a7396e83b0e55cb83c198e9f44c5ef
7488c4909b8aac9dd24198111163146c730c9687
53193 F20101112_AABHKW topsakal_o_Page_007.jpg
4e9c504c813157af49a52427ba6da4ca
d4e80142e70cd7bcb8f0c42965e2361b8fc67327
21800 F20101112_AABIPC topsakal_o_Page_086.QC.jpg
188e58b9634297e9ea65b198c04dd862
329a86824198f627073bf877487e0bd79fb23094
5356 F20101112_AABION topsakal_o_Page_076thm.jpg
4f51c6273dd3f857126fde56caf8ee5e
b819698d290d5cc7e214539f97e07ae9b57f2fc1
5586 F20101112_AABINZ topsakal_o_Page_066thm.jpg
af3b7bce195a8be147dea0d6be33eafa
23def2e78e4c68273ec784c759205f7610d4a208
F20101112_AABHMA topsakal_o_Page_048.tif
ddd3032be50a73244c3f902269d8dbea
234c5d1d638054e9283cb664e57cc9ce5be66017
13890 F20101112_AABHLL topsakal_o_Page_103.QC.jpg
98717dd33cf03acad6189ccea9eda294
fe38d28367c087d1854b5549f4903118d9693e54
1006581 F20101112_AABHKX topsakal_o_Page_066.jp2
6cdf7526f649fcb14ab4b0db106d28ed
0f3deab4934b69bd13b8b2bcb92f8c4608a58158



PAGE 1

SEMANTICINTEGRATIONTHROUGHAPPLICATIONANALYSIS By OGUZHANTOPSAKAL ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2007 1

PAGE 2

c r 2007OguzhanTopsakal 2

PAGE 3

Toallwhoarepatient,supportive,justandlovingtoothers regardlessoftime,location andstatus 3

PAGE 4

ACKNOWLEDGMENTS IthoughtquitealotaboutthetimewhenIwouldnishmydisse rtationandwrite thisacknowledgmentsection.Finally,thetimehascome.He reistheplacewhereIcan rememberallthegoodmemoriesandthankeveryonewhohelped alongtheway.However, Ifeelwordsarenotenoughtoshowmygratitudetothosewhowe retherewithmeall alongtheroadtomyPh.D.. Firstofall,IgivethankstoGodforgivingmethepatience,s trengthandcommitment tocomeallthisway. Iwouldliketogivemysincerethankstomydissertationadvi sor,Dr.Joachim Hammer,whohasbeensokindandsupportivetome.Hewasthepe rfectpersonfor metoworkwith.Ialsowouldliketothanktomyothercommitte emembers:Dr.Tuba Yavuz-Kahveci,Dr.ChristopherM.Jermaine,Dr.HermanLam ,andDr.RaymondIssa forservingonmycommittee. ThankstoUmutSargut,ZeynepSargut,CanOzturk,FatihBuyu kserinandFatih GorduformakingGainesvilleabetterplacetolive. Iamgratefultomyparents,H.NedretTopsakalandSabahatdi nTopsakal;tomy brother,MetehanTopsakal;andtomysister-in-law,SibelT opsakal.Theywerealways thereformewhenIneededthem,andtheyhavealwayssupporte dmeinwhateverIdo. Mywife,Elif,andIjoinedourlivesduringthemosthecticti mesofmyPh.D.studies, andshesupportedmeineveryaspect.Sheismytreasure. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 12 1.1ProblemDenition ............................... 12 1.2OverviewoftheApproach ........................... 14 1.3Contributions .................................. 16 1.4OrganizationoftheDissertation ........................ 17 2RELATEDCONCEPTSANDRESEARCH .................... 18 2.1LegacySystems ................................. 19 2.2Data,Information,Semantics ......................... 20 2.3SemanticExtraction .............................. 20 2.4ReverseEngineering .............................. 22 2.5ProgramUnderstandingTechniques ...................... 24 2.5.1TextualAnalysis ............................. 25 2.5.2SyntacticAnalysis ............................ 25 2.5.3ProgramSlicing ............................. 25 2.5.4ProgramRepresentationTechniques .................. 26 2.5.5CallGraphAnalysis ........................... 26 2.5.6DataFlowAnalysis ........................... 26 2.5.7VariableDependencyGraph ...................... 26 2.5.8SystemDependenceGraph ....................... 27 2.5.9DynamicAnalysis ............................ 27 2.6VisitorDesignPatterns ............................. 27 2.7Ontology ..................................... 28 2.8WebOntologyLanguage(OWL) ........................ 28 2.9WordNet ..................................... 30 2.10Similarity .................................... 31 2.11SemanticSimilarityMeasuresofWords .................... 32 2.11.1ResnikSimilarityMeasure ....................... 32 2.11.2Jiang-ConrathSimilarityMeasure ................... 34 2.11.3LinSimilarityMeasure ......................... 34 2.11.4IntrinsicICMeasureinWordNet ................... 34 2.11.5Leacock-ChodorowSimilarityMeasure ................ 35 5

PAGE 6

2.11.6Hirst-St.OngeSimilarityMeasure ................... 36 2.11.7WuandPalmerSimilarityMeasure .................. 36 2.11.8LeskSimilarityMeasure ........................ 36 2.11.9ExtendedGlossOverlapsSimilarityMeasure ............. 37 2.12EvaluationofWordNet-BasedSimilarityMeasures .............. 37 2.13SimilarityMeasuresforTextData ....................... 37 2.14SimilarityMeasuresforOntologies ....................... 39 2.15EvaluationMethodsforSimilarityMeasures ................. 41 2.16SchemaMatching ................................ 43 2.16.1SchemaMatchingSurveys ....................... 43 2.16.2EvaluationsofSchemaMatchingApproaches ............. 45 2.16.3ExamplesofSchemaMatchingApproaches .............. 46 2.17OntologyMapping ............................... 48 2.18SchemaMatchingvs.OntologyMapping ................... 48 3APPROACH ..................................... 49 3.1SemanticAnalysis ................................ 50 3.1.1IllustrativeExamples .......................... 51 3.1.2ConceptualArchitectureofSemanticAnalyzer ............ 53 3.1.2.1Abstractsyntaxtreegenerator(ASTG) .......... 53 3.1.2.2Reporttemplateparser(RTP) ................ 55 3.1.2.3Informationextractor(IEx) ................. 55 3.1.2.4Reportontologywriter(ROW) ............... 58 3.1.3ExtensibilityandFlexibilityofSemanticAnalyzer .......... 58 3.1.4ApplicationofProgramUnderstandingTechniquesinS A ...... 60 3.1.5HeuristicsUsedforInformationExtraction .............. 62 3.2SchemaMatching ................................ 67 3.2.1MotivatingExample ........................... 68 3.2.2SchemaMatchingApproach ...................... 73 3.2.3CreatinganInstanceofaReportOntology .............. 75 3.2.4ComputingSimilarityScores ...................... 76 3.2.5FormingaSimilarityMatrix ...................... 81 3.2.6FromMatchingOntologiestoSchemas ................ 81 3.2.7MergingResults ............................. 82 4PROTOTYPEIMPLEMENTATION ........................ 84 4.1SemanticAnalyzer(SA)Prototype ...................... 84 4.1.1UsingJavaCCtogenerateparsers ................... 84 4.1.2Executionstepsoftheinformationextractor ............. 86 4.2SchemaMatchingbyAnalyzingReporTs(SMART)Prototype ....... 88 6

PAGE 7

5TESTHARNESSFORTHEASSESSMENTOFLEGACYINFORMATION INTEGRATIONAPPROACHES(THALIA) .................... 89 5.1THALIAWebsiteandDownloadableTestPackage .............. 89 5.2DataExtractor(HTMLtoXML)OpensourcePackage ............. 91 5.3ClassicationofHeterogeneities ........................ 92 5.4WebInterfacetoUploadandCompareScores ................ 94 5.5UsageofTHALIA ............................... 95 6EVALUATION .................................... 96 6.1TestData .................................... 96 6.1.1TestDataSetfromTHALIAtestbed ................. 96 6.1.2TestDataSetfromUniversityofFlorida ............... 98 6.2DeterminingWeights .............................. 102 6.3ExperimentalEvaluation ............................ 105 6.3.1RunningExperimentswithTHALIAData .............. 107 6.3.2RunningExperimentswithUFData ................. 110 7CONCLUSION .................................... 117 7.1Contributions .................................. 119 7.2FutureDirections ................................ 121 REFERENCES ....................................... 123 BIOGRAPHICALSKETCH ................................ 131 7

PAGE 8

LISTOFTABLES Table page 2-1ListofrelationsusedtoconnectsensesinWordNet. ................ 31 2-2Absolutevaluesofthecoecientsofcorrelationbetwee nhumanratingsofsimilarity andthevecomputationalmeasures. ........................ 37 3-1SemanticAnalyzercantransferinformationfromonemet hodtoanotherthrough variablesandcanusethisinformationtodiscoversemantic sofaschemaelement. 62 3-2Outputstringgivescluesaboutthesemanticsofthevari ablefollowingit. .... 63 3-3Outputstringandthevariablemaynotbeinthesamestate ment. ........ 64 3-4Outputstringsbeforetheslicingvariableshouldbecon catenated. ........ 64 3-5Tracingbacktheoutputtextandassociatingitwiththec orrespondingcolumn ofatable. ....................................... 64 3-6Associatingtheoutputtextwiththecorrespondingcolu mninthewhere-clause. 65 3-7Columnheaderdescribesthedatainthatcolumn. ................. 65 3-8Columnontheleftdescribesthedataitemslistedtoitsi mmediateright. .... 65 3-9Columnontheleftandtheheaderimmediatelyabovedescr ibethesamesetof dataitems. ...................................... 66 3-10Setofdataitemscanbedescribedbytwodierentheader s. ........... 66 3-11Headercanbeprocessedbeforebeingassociatedwithth edataonacolumn. .. 66 4-1Subpackageinthesapackageandtheirfunctionality. ............... 85 6-1The10universitycatalogsselectedforevaluationands izeoftheirschemas. ... 98 6-2PortionofatabledescriptionfromtheCollegeofEngine ering,theBridgesProject andtheBusinessSchoolschemas. .......................... 100 6-3NamesoftablesintheCollegeofEngineering,theBridge sOce,andtheBusiness Schoolschemasandnumberofschemaelementsthateachtable has. ....... 101 6-4Weightsfoundbyanalyticalmethodfordierentsimilar ityfunctionswithTHALIA testdata. ....................................... 104 6-5Confusionmatrix. ................................... 106 8

PAGE 9

LISTOFFIGURES Figure page 1-1ScalableExtractionofEnterpriseKnowledge(SEEK)Arc hitecture. ....... 14 3-1ScalableExtractionofEnterpriseKnowledge(SEEK)Arc hitecture. ....... 50 3-2Schemausedbyanapplication. ........................... 52 3-3Schemausedbyareport. .............................. 53 3-4ConceptualviewoftheDataReverseEngineering(DRE)mo duleoftheScalable ExtractionofEnterpriseKnowledge(SEEK)prototype. .............. 54 3-5ConceptualviewofSemanticAnalyzer(SA)component. ............. 54 3-6Reportdesigntemplateexample. .......................... 55 3-7Reportgeneratedwhentheabovetemplatewasrun. ............... 56 3-8JavaServletgeneratedHTMLreportshowingcourselisti ngsofCALTECH. ... 56 3-9AnnotatedHTMLpagegeneratedbyanalyzingaJavaServle t. .......... 57 3-10Inter-proceduralcallgraphofaprogramsourcecode. ............... 61 3-11Schemasoftwodatasourcesthatcollaboratesforanewo nlinedegreeprogram. 69 3-12Reportsfromtwosampleuniversitieslistingcourses. ............... 70 3-13Reportsfromtwosampleuniversitieslistinginstruct oroces. .......... 71 3-14Similarityscoresofschemaelementsoftwodatasource s. ............. 73 3-15FivestepsofSchemaMatchingbyAnalyzingReporTs(SMA RT)algorithm. .. 74 3-16UniedModelingLanguage(UML)diagramoftheSchemaMa tchingbyAnalyzing ReporTs(SMART)reportontology. ......................... 76 3-17Exampleforasimilaritymatrix. ........................... 81 3-18Similarityscoresaftermatchingreportpairsaboutco urselistings. ........ 82 3-19Similarityscoresaftermatchingreportpairsaboutin structoroces. ...... 82 4-1JavaCodesizedistributionof(SemanticAnalyzer)SAan d(SchemaMatching byAnalyzingReporTs)SMARTpackages. ..................... 84 4-2UsingJavaCCtogenerateparsers. ......................... 86 5-1SnapshotofTestHarnessfortheAssessmentofLegacyinf ormationIntegration Approaches(THALIA)website. ........................... 90 9

PAGE 10

5-2SnapshotofthecomputersciencecoursecatalogofBosto nUniversity. ...... 91 5-3ExtensibleMarkupLanguage(XML)representationofBos tonUniversityscourse catalogandcorrespondingschemale. ....................... 92 5-4ScoresuploadedtoTestHarnessfortheAssessmentofLeg acyinformationIntegration Approaches(THALIA)benchmarkforIntegrationWizard(IWi z)Projectat theUniversityofFlorida. ............................... 94 6-1Reportdesignpracticewhereallthedescriptivetextsa reheadersofthedata. 97 6-2Reportdesignpracticewhereallthedescriptivetextsa reonthelefthandside ofthedata. ...................................... 97 6-3ArchitectureofthedatabasesintheCollegeofEngineer ing. ........... 99 6-4ResultsoftheSMARTwithJiang-Conrath(JCN),LinandLe vensteinmetrics. 107 6-5ResultsofCOmbinationofMAtchingalgorithms(COMA++) withAllContext andFilteredContextcombinedmatchersandcomparisonofSM ARTandCOMA++ results. ......................................... 108 6-6ReceiverOperatingCharacteristics(ROC)curvesofSMA RTandCOMA++ forTHALIAtestdata. ................................ 110 6-7ResultsoftheSMARTwithdierentreportpairsimilarit ythresholdsforUF testdata. ....................................... 112 6-8F-MeasureresultsofSMARTandCOMA++forUFtestdatawhe nreportpair similarityissetto0.7. ................................ 113 6-9ReceiverOperatingCharacteristics(ROC)curvesofthe SMARTforUFtest data. .......................................... 114 6-10ComparisonoftheROCcurvesoftheSMARTandCOMA++forU Ftestdata. 115 10

PAGE 11

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorOfPhilosophy SEMANTICINTEGRATIONTHROUGHAPPLICATIONANALYSIS By OguzhanTopsakal May2007 Chair:JoachimHammerMajor:ComputerEngineering Organizationsincreasinglyneedtoparticipateinrapidco llaborationswithother organizationstobesuccessfulandtheyneedtointegrateth eirdatasourcestoshareand exchangedatainsuchcollaborations.Oneoftheproblemsth atneedstobesolvedwhen integratingdierentdatasourcesisndingsemanticcorre spondencesbetweenelements ofschemasofdisparatedatasources(a.k.a.schemamatchin g).Schemas,eventhosefrom thesamedomain,showmanysemanticheterogeneities.Resol vingtheseheterogeneities ismostlydonemanually;whichistedious,timeconsuming,a ndexpensive.Current approachestoautomatingtheprocessmainlyusetheschemas andthedataasinputto discoversemanticheterogeneities.However,theschemasa ndthedataarenotsucient sourcesofsemantics.Incontrast,weanalyzeavaluablesou rceofsemantics,namely applicationsourcecodeandreportdesigntemplates,toimp roveschemamatchingfor informationintegration.Specically,weanalyzeapplica tionsourcecodethatgenerate reportstopresentthedataoftheorganizationinauserfrie ndlyway.Wetracethe descriptiveinformationonareportbacktothecorrespondi ngschemaelement(s)through reverseengineeringoftheapplicationsourcecodeorrepor tdesigntemplatesandstore thedescriptivetext,data,andthecorrespondingschemael ementsinareportontology instance.Weutilizetheinformationwehavediscoveredfor schemamatching.Our experimentsusingafullyfunctionalprototypesystemshow thatourapproachproduces moreaccurateresultsthancurrenttechniques. 11

PAGE 12

CHAPTER1 INTRODUCTION 1.1ProblemDenition Thesuccessofmanyorganizationslargelydependsontheira bilitytoparticipatein rapid,rexible,limited-timecollaborations.Theneedtoc ollaborateisnotjustlimited tobusinessbutalsoappliestogovernmentandnon-protorg anizationssuchasmilitary, emergencymanagement,health-care,rescue,etc.Thesucce ssofabusinessorganization dependsonitsabilitytorapidlycustomizeitsproducts,ad apttocontinuouslychanging demands,andreducecostsasmuchaspossible.Governmentor ganizations,suchasthe DepartmentofHomelandSecurity,needtocollaborateandex changeintelligenceto maintainthesecurityofitsbordersortoprotectcriticali nfrastructure,suchasenergy supplyandtelecommunications.Non-protorganizations, suchastheAmericanRed Cross,needtocollaborateonmattersrelatedtopublicheal thincatastrophicevents,such ashurricanes.Thecollaborationoforganizationsproduce sasynergytoachieveacommon goalthatwouldnotbepossibleotherwise. Organizationsparticipatinginarapid,rexiblecollabora tionenvironmentneedto shareandexchangedata.Inordertoshareandexchangedata, organizationsneedto integratetheirinformationsystemsandresolveheterogen eitiesamongtheirdatasources. Theheterogeneitiesexistatdierentlevels.Thereexistp hysicalheterogeneitiesatthe systemlevelbecauseofdierencesbetweenvariousinterna ldatastorage,retrieval,and representationmethods.Forexample,someorganizationsm ightuseprofessionaldatabase managementsystemswhileothersmightusesimpleratlesto storeandrepresenttheir data.Inaddition,thereexiststructural(syntax)-levelh eterogeneitiesbecauseofthe dierencesattheschemalevel.Finally,thereexistsemant iclevelheterogeneitiesbecause ofthedierencesintheuseofthedatawhichcorrespondtoth esamereal-worldobjects [ 47 ].Wefaceabroadrangeofsemanticheterogeneitiesininfor mationsystemsbecauseof 12

PAGE 13

dierentviewpointsofdesignersoftheseinformationsyst ems.Semanticheterogeneityis simplyaconsequenceoftheindependentcreationoftheinfo rmationsystems[ 44 ]. Toresolvesemanticheterogeneities,organizationsmust rstidentifythesemanticsof theirdataelementsintheirdatasources.Discoveringthes emanticsofdataautomatically hasbeenanimportantareaofresearchinthedatabasecommun ity[ 22 36 ].However,the processofresolvingsemanticheterogeneityofdatasource sisstillmostlydonemanually. Resolvingheterogeneitiesmanuallyisatedious,error-pr one,time-consuming,non-scalable andexpensivetask.Thetimeandinvestmentneededtointegr atedatasourcesbecomea signicantbarriertoinformationintegrationofcollabor atingorganizations. Inthisresearch,wearedevelopinganintegratednovelappr oachthatautomatesthe processofsemanticdiscoveryindatasourcestoovercometh isbarrierandtohelprapid, rexiblecollaborationamongorganizations.Asmentioneda bove,weareawarethatthere existphysicalheterogeneitiesamonginformationsources buttokeepthedissertation focused,weassumedatastorage,retrievalandrepresentat ionmethodsarethesame amongtheinformationsystemstobeintegrated.Accordingt oourexperiencesgained asasoftwaredeveloperforinformationtechnologiesdepar tmentofseveralbanksand softwarecompanies,applicationsourcecodegeneratingre portsencapsulatevaluable informationaboutthesemanticsofthedatatobeintegrated .Reportspresentdata fromthedatasourceinawaythatiseasilycomprehensibleby theuserandcanberich sourceofsemantics.Weanalyzeapplicationsourcecodetod iscoversemanticstofacilitate integrationofinformationsystems.Weoutlinetheapproac hinSection 1.2 belowand providemoredetailedexplanationinSections 3.1 and 3.2 .Theresearchdescribedin thisdissertationisapartoftheNSF-funded 1 SEEK(ScalableExtractionofEnterprise Knowledge)projectwhichalsoservesasatestbed. 1 TheSEEKprojectissupportedbytheNationalScienceFounda tionundergrant numbersCMS-0075407andCMS-0122193. 13

PAGE 14

1.2OverviewoftheApproach Theresultsdescribedinthisdissertationarebasedonthew orkwehavedoneonthe SEEKproject.TheSEEKprojectisdirectedatovercomingthe problemsofintegrating legacydataandknowledgeacrosstheparticipantsofacolla borationnetwork[ 45 ].The goaloftheSEEKprojectistodevelopmethodsandtheorytoen ablerapidintegration oflegacysourcesforthepurposeofdatasharing.Weapplyth esemethodologiesinthe SEEKtoolkitwhichallowsuserstodevelopSEEKwrappers.Aw rappertranslatesqueries fromanapplicationtothedatasourceschemaatrun-time.SE EKwrappersactasan intermediarybetweenthelegacysourceanddecisionsuppor ttoolswhichrequireaccessto theorganization'sknowledge. Figure1-1:ScalableExtractionofEnterpriseKnowledge(S EEK)Architecture. 14

PAGE 15

Ingeneral,SEEK[ 45 46 ]worksinthreesteps:DataReverseEngineering(DRE), SchemaMatching(SM),andWrapperGeneration(WG).Inther ststep,DataReverse Engineering(DRE)componentofSEEKgeneratesadetailedde scriptionofthelegacy source.DREhastwosub-components,SchemaExtractor(SE)a ndSemanticAnalyzer (SA).SEextractstheconceptualschemaofthedatasource.S Aanalyzeselectronically availableinformationsourcessuchasapplicationcodeand discoversthesemanticsof schemaelementsofthedatasource.Inotherwords,SAdiscov ersmappingsbetweendata itemsstoredinaninformationsystemandthereal-worldobj ectstheyrepresentbyusing thepiecesofevidencethatitextractsfromtheapplication code.SAenhancestheschema ofthedatasourcebythediscoveredsemanticsandwereferto thesemanticallyenhanced schemaknowledgebaseoftheorganization.Inthesecondste p,theSchemaMatching(SM) componentmapstheknowledgebaseofanorganizationwithth eknowledgebaseofanother organization.Inthethirdstep,theextractedlegacyschem aandthemappingrules providetheinputtotheWrapperGenerator(WG),whichprodu cesthesourcewrapper. ThesethreestepsofSEEKarebuild-timeprocesses.Atrun-t ime,thesourcewrapper translatesqueriesfromtheapplicationdomainmodeltothe legacysourceschema.A high-levelschematicviewoutliningtheSEEKcomponentsan dtheirinteractionsisshown inFigure 1-1 Inthisresearch,ourfocusisontheSemanticAnalysis(SA)a ndSchemaMatching (SM)methodology.WerstdescribehowSAextractssemantic allyrichoutputsfromthe applicationsourcecodeandthenrelatesthemwiththeschem aknowledgeextractedby theSchemaExtractor(SE).Weshowthatwecangathersignic antsemanticinformation fromtheapplicationsourcecodebythemethodologywehaved eveloped.Wethenfocus onourSchemaMatching(SM)methodology.Wedescribehowweu tilizethesemantic informationthatwehavediscoveredbySAtondmappingsbet weentwodatasources. Theextractedsemanticinformationandthemappingscanthe nbeusedbythesubsequent 15

PAGE 16

wrappergenerationsteptofacilitatethedevelopmentofle gacysourcetranslatorsand othertoolsduringinformationintegrationwhichisnotthe focusofthisdissertation. 1.3Contributions Inthisresearch,weintroducenovelapproachesforsemanti canalysisofapplication sourcecodeandformatchingofrelatedbutdisparateschema s.Inthissection,welistthe contributionsofthiswork.Wedescribethesecontribution sindetailsinChapter 7 while concludingthedissertation. Externalinformationsourcessuchascorporaofschemasand pastmatcheshavebeen usedforschemamatchingbutapplicationsourcecodehaveno tbeenusedasanexternal informationsourceyet[ 25 28 78 ].Inthisresearch,wefocusonthiswell-knownbutnot yetaddressedchallengeofanalyzingapplicationsourceco deforthepurposeofsemantic extractionforschemamatching.Theaccuracyofthecurrent schemamatchingapproaches isnotsucientforfullyautomatingtheprocessofschemama tching[ 26 ].Theapproach wepresentinthisdissertationprovidesbetteraccuracyfo rthepurposeofautomatic schemamatching. Theschemamatchingapproachessofarhavebeenmostlyusing lexicalsimilarity functionsorlook-uptablestodeterminethesimilaritieso ftwoschemaelementproperties (forexample,thenamesandtypesofschemaelements).There havebeensuggestions toutilizesemanticsimilaritymeasuresbetweenwords[ 7 ]buthavenotbeenrealized. Weutilizethestateoftheartsemanticsimilaritymeasures betweenwordstodetermine similaritiesandshowitseectontheresults. Anotherimportantcontributionistheintroductionofagen ericsimilarityfunctionfor matchingclassesofontologies.Wehavealsodescribedhoww edeterminetheweightsof oursimilarityfunction.Oursimilarityfunctionalongwit hthemethodologytodetermine theweightsofthefunctioncanbeappliedonmanydomainstod eterminesimilarities betweendierententities. 16

PAGE 17

Integrationbasedonuserreportseasethecommunicationbe tweenbusinessand informationtechnology(IT)specialists.BusinessandITs pecialistsoftenhavediculty onunderstandingeachother.BusinessandITspecialistsca ndiscussondatapresentedon reportsratherthandiscussingonincomprehensibledataba seschemas.Analyzingreports fordataintegrationandsharinghelpsbusinessandITspeci alistscommunicatebetter. Oneothercontributionsisthefunctionalextensibilityof oursemanticanalysis methodology.Ourinformationextractionframeworkletsre searchersaddnewfunctionality astheydevelopnewheuristicsandalgorithmsonthesourcec odebeinganalyzed.Our currentinformationtechniquesprovideimprovedperforma ncebecauseitrequiresless passesoverthesourcecodeandprovideimprovedaccuracyas iteliminatesunusedcode fragments(i.e.,methods,procedures). Whileconductingtheresearch,wesawthatthereisaneedofa vailabletestdataof sucientrichnessandvolumetoallowmeaningfulandfairev aluationsbetweendierent informationintegrationapproaches.Toaddressthisneed, wedevelopedTHALIA 2 (Test HarnessfortheAssessmentofLegacyinformationIntegrati onApproaches)benchmark whichprovidesresearcherswithacollectionofover40down loadabledatasources representingUniversitycoursecatalogs,asetoftwelvebe nchmarkqueries,aswellas ascoringfunctionforrankingtheperformanceofanintegra tionsystem[ 47 48 ]. 1.4OrganizationoftheDissertation Therestofthedissertationisorganizedasfollows.Weintr oduceimportant conceptsoftheworkandsummarizeresearchinChapter 2 .Chapter 3 describesour semanticanalysisapproachandschemamatchingapproach.C hapter 4 describesthe implementationdetailsofourprototype.Beforewedescrib etheexperimentalevaluation ofourapproachinChapter 6 ,wedescribetheTHALIAtestbedinChapter 5 .Chapter 7 concludesthedissertationandsummarizesthecontributio nsofthiswork. 2 THALIAwebsite: http://www.cise.ufl.edu/project/thalia.html 17

PAGE 18

CHAPTER2 RELATEDCONCEPTSANDRESEARCH Inthecontextofthiswork,wehaveexploredabroadrangeofr esearchareas. Theseresearchareasincludebutarenotlimitedtodatasema ntics,semanticdiscovery, semanticextraction,legacysystemunderstanding,revers eengineeringofapplication code,informationextractionfromapplicationcode,seman ticsimilaritymeasures,schema matching,ontologyextractionandontologymapping,etc.W hiledevelopingourapproach, weleveragetheseresearchareas. Inthischapter,weintroduceimportantconceptsandrelate dresearchthatare essentialforunderstandingthecontributionsofthiswork .Whenevernecessary,weprovide ourinterpretationsofdenitionsandcommonlyacceptedst andardsandconventionsin thiseldofstudy.Wealsopresentthestate-of-the-artint herelatedresearchareas. Werstintroducewhatalegacysystemis.Thenwestatethedi erencebetween frequentlyusedtermsdata,informationandsemanticsinSe ction 2.2 .Wepointoutsome oftheresearchinsemanticextractioninSection 2.3 .Sinceweextractsemanticsthrough reverseengineeringofapplicationsourcecode.Weprovide thedenitionsofreverse engineeringofsourcecode,databasereverseengineeringi nSection 2.4 andalsoprovide thetechniquesforprogramunderstandinginSection 2.5 .Werepresenttheextracted informationfromapplicationsourcecodeofdierentlegac ysystemsinontologiesand utilizetheseontologiestondoutsemanticsimilaritiesb etweenthem.Forthisreason, semanticsimilaritymeasuresarealsoimportantforus.Weh aveexploredtheresearch onsemanticsimilaritymeasuresandpresentedtheseworksi nSection 2.11 aftergiving thedenitionofsimilarityinSection 2.10 .Weaimtoleveragetheresearchonassessing similarityscoresbetweentextsandontologies.Wepresent thesetechniquesinSection 2.13 and 2.14 .Wethenpresenttheontologyconcept,andtheontologylang uageWeb OntologyLanguage(OWL).Finally,wepresentontologymapp ingandschemamapping 18

PAGE 19

andconcludethechapterbypresentingsomeoutstandingtec hniquesofschemamatching intheliterature. 2.1LegacySystems Ourapproachesforsemanticanalysisofapplicationsource codeandschema matchinghasbeendevelopedasapartoftheSEEKproject.SEE Kprojectaimsto helpunderstandingoflegacysystems.Weanalyzeapplicati onsourcecodeofalegacy systemtounderstandthesemanticsofitandapplygainedkno wledgetosolveschema matchingproblemofdataintegration.Inthissection,wer stgiveabroaddenitionofa legacysystemandhighlightitsimportanceandthenprovide itsdenitioninthecontext ofthiswork. Legacysystemsaregenerallyknownasinrexible,nonextens ible,undocumented,old andlargesoftwaresystemswhichareessentialfortheorgan ization'sbusiness[ 12 14 75 ]. Theysignicantlyresistmodicationsandchanges.Legacy systemareveryvaluable becausetheyaretherepositoryofcorporateknowledgecoll ectedoveralongtimeandthey alsoencapsulatethelogicoftheorganization'sbusinessp rocesses[ 49 ]. Alegacysystemisgenerallydevelopedandmaintainedbyman ydierentpeoplewith manydierentprogrammingstyles.Mostly,theoriginalpro grammershaveleft,andthe existingteamisnotanexpertofalltheaspectsofthesystem [ 49 ].Eventhoughonce therewasadocumentationaboutthedesignandspecication ofthelegacysystem,the originalsoftwarespecicationanddesignhavebeenchange dbutthedocumentationwas notupdatedthroughouttheyearsofdevelopmentandmainten ance.Thus,understanding islost,andtheonlyreliabledocumentationofthesystemis theapplicationsourcecode runningonthelegacysystem[ 75 ]. Inthecontextofthiswork,wedenelegacysystemsasanyinf ormationsystemwith poorornonexistentdocumentationabouttheunderlyingdat aortheapplicationcode thatisusingthedata.Despitethefactthatlegacysystemsa reofteninterpretedasold 19

PAGE 20

systems,forus,aninformationsystemisnotrequiredtobeo ldinordertobeconsidered aslegacy. 2.2Data,Information,Semantics Inthissection,wegivedenitionsofdata,informationand semanticsbeforewe exploresomeresearchonsemanticextractioninthefollowi ngsection. Accordingtoasimplisticdenitiondataistheraw,unproce ssedinputtoan informationsystemthatproducestheinformationasanoutp ut.Acommonlyaccepted denitionstatesthatdataisarepresentationoffacts,con ceptsorinstructionsina formalizedmannersuitableforcommunication,interpreta tion,orprocessingbyhumans orbyautomaticmeans[ 2 18 ].Datamostlyconsistsofdisconnectednumbers,words, symbols,etc.andresultsfrommeasurableevents,orobject s. Datahasavaluewhenitisprocessed,changedintoausablefo rmandplacedina context[ 2 ].Whendatahasacontextandhasbeeninterpreted,itbecome sinformation. Thenitcanbeusedpurposefullyasinformation[ 1 ]. Semanticsisthemeaningandtheuseofdata.Semanticscanbe viewedasamapping betweenanobjectstoredinaninformationsystemandtherea l-worldobjectitrepresents [ 87 ]. 2.3SemanticExtraction Inthissection,werststatetheimportanceofsemanticext ractionandapplication sourcecodeasarichsourceforsemanticextractionandthen pointoutseveralrelated researcheortsinthisresearcharea. Shethetal.[ 87 ]statedthatdatasemanticsdoesnotseemtohaveapurely mathematicalorformalmodelandcannotbediscoveredcompl etely,andfullyautomatically. Therefore,theprocessofsemanticdiscoveryrequireshuma ninvolvement.Besidesbeing human-dependent,semanticextractionisatime-consuming andhenceexpensivetask [ 36 ].Althoughitcannotbefullyautomatized,thegainofdisco veringeventhelimited amountofusefulsemanticscantremendouslyreducethecost forunderstandingasystem. 20

PAGE 21

Semanticscanbefoundfromknowledgerepresentationschem as,communicationprotocols, andapplicationsthatusethedata[ 87 ]. Throughoutthediscussionsandresearchonsemanticextrac tion,applicationsource codehasbeenproposedasarichsourceofinformation[ 30 36 87 ].Besides,researchers haveagreedthattheextractionofsemanticsfromapplicati onsourcecodeisessentialfor identicationandresolutionofsemanticheterogeneity. Weusethediscoveredsemanticsfromapplicationsourcecod etondcorrespondence betweenschemasofdisparatedatasourcesautomatically.I nthiscontext,discovering semanticsmeansgatheringinformationaboutthedata,soth atacomputercanidentify mappings(paths)betweencorrespondingschemaelementsin dierentdatasources. JimNingetal.workedonextractingsemanticsfromapplicat ionsourcecodebutwith aslightlydierentaim.Theydevelopedanapproachtoident ifyandrecoverreusablecode components[ 67 ].Theyinvestigatedconditionalstatementsaswedotondo utbusiness rules.Theystatedthatconditionalstatementsarepotenti albusinessrules.Theyalsogave importancetoinputandoutputstatementsforhighlighting semanticsinsidethecode, andstatedthatmeaningfulbusinessfunctionsnormallypro cessinputvaluesandproduce results.JimNingetal.calledinvestigatinginputvariabl esasforwardslicingandcalled investigatingoutputstatementsasbackwardslicing.Thed rawbackoftheirapproachwas beingverylanguage-specic(Cobol)[ 67 ]. NAshishetal.workedonextractingsemanticsfrominternet informationsourcesto enablesemi-automaticwrappergeneration[ 5 ].Theyusedseveralheuristicstoidentify importanttokensandstructuresofHTMLpagesinordertocre atethespecicationfor aparser.Similartoourapproach,theybenetedfromparser generationtools,namely YACC[ 53 ]andLEX[ 59 ],forsemanticextraction. Thereareseveralrelatedworkininformationextractionfr omtextthatdealwith tablesandontologyextractionfromtables.Themostreleva ntworkaboutinformation extractionfromHTMLpagesbythehelpofheuristicswasdone byWangandLochovsky 21

PAGE 22

[ 94 ].TheyaimedtoformtheschemaofthedataextractedfromanH TMLpagebyusing labelsofatableonanHTMLpage.Theheuristicthattheyuset orelatelabelstothe dataandtoseparatedatafoundinatablecellintoseveralat tributesisverysimilarto ourheuristics.Forexample,theyassumethatifseveralatt ributesareencodedintoone textstring,thenthereshouldbesomespecialsymbol(s)int hestringastheseparator tovisuallysupportuserstodistinguishtheattributes.Th eyalsouseheuristicstorelate labelstothedatafromanHTMLpagethataresimilartoourheu ristics.Buttleretal. [ 17 ]andEmbleyetal.[ 32 ]alsodevelopedheuristicbasedapproachesforinformatio n extractionfromHTMLpages.However,theiraimwastoidenti fyboundariesofdataon anHTMLpage.Embleyetal.[ 33 ]alsoworkedontablerecognitionfromdocumentsand suggestedatableontologywhichisverysimilartoourrepor tontology.Inarelatedwork, Tijerinoetal.[ 90 ]introducedaninformationextractingsystemcalledTANGO which recognizestablesbasedonasetofheuristics,formsmini-o ntologiesandthenmergesthese ontologiestoformalargerapplicationontology. 2.4ReverseEngineering Withouttheunderstandingofthesystem,inotherwordswith outtheaccurate documentationofthesystem,itisnotpossibletomaintain, extend,andintegratethe systemwithothersystems[ 76 89 95 ].Themethodologytoreconstructthismissing documentationisreverseengineering.Inthissection,we rstgivethedenitionofreverse engineeringingeneralandthengivedenitionsofprogramr everseengineeringand databasereverseengineering.Wealsostatetheimportance ofthesetasks. Reverseengineeringistheprocessofanalyzingatechnolog ytolearnhowitwas designedorhowitworks.ChikofskyandCross[ 19 ]denedreverseengineeringasthe processofanalyzingasubjectsystemtoidentifythesystem scomponentsandtheir interrelationshipsandastheprocessofcreatingrepresen tationsofthesysteminanother formoratahigherlevelofabstraction.Reverseengineerin gisanactiontounderstand thesubjectsystemanddoesnotincludethemodicationofit .Thereverseofthereverse 22

PAGE 23

engineeringisforwardengineering.Forwardengineeringi sthetraditionalprocessof movingfromhigh-levelabstractionsandlogical,implemen tation-independentdesignsto thephysicalimplementationofasystem[ 19 ].Whilereverseengineeringstartsfromthe subjectsystemandaimstoidentifythehigh-levelabstract ionofthesystem,forward engineeringstartsfromthespecicationandaimstoimplem entthesubjectsystem. Program(software)reverseengineeringisrecoveringthes pecicationsofthesoftware fromsourcecode[ 49 ].Therecoveredspecicationscanberepresentedinformss uchas datarowdiagrams,rowcharts,specications,hierarchych arts,callgraphs,etc.[ 75 ].The purposeofprogramreverseengineeringistoenhanceourund erstandingofthesoftwareof thesystemtoreengineer,restructure,maintain,extendor integratethesystem[ 49 75 ]. DatabaseReverseEngineering(DBRE)isdenedasidentifyi ngthepossible specicationofadatabaseimplementation[ 22 ].Itmainlydealswithschemaextraction, analysisandtransformation[ 49 ].ChikofskyandCross[ 19 ]denedDBREasaprocess thataimstodeterminethestructure,functionandmeaningo fthedataofanorganization. Hainaut[ 41 ]denedDBREastheprocessofrecoveringtheschema(s)ofth edatabase ofanapplicationfromdatadictionaryandprogramsourceco dethatusesthedata.The objectiveofDBREistorecoverthetechnicalandconceptual descriptionsofthedatabase. Itisaprerequisiteforseveralactivitiessuchasmaintena nce,reengineering,extension, migration,integration.DBREcanproduceanalmostcomplet eabstractspecication ofanoperationaldatabasewhileprogramreverseengineeri ngcanonlyproducepartial abstractionsthatcanhelpbetterunderstandaprogram[ 22 42 ]. Manydatastructuresandconstraintsareembeddedinsideth esourcecodeof data-orientedapplications.Ifaconstructoraconstraint hasnotbeendeclaredexplicitly inthedatabaseschema,itisimplementedinthesourcecodeo ftheapplicationthat updatesorqueriesthedatabase.Thedatainthedatabaseisa resultoftheexecutionof theapplicationsoftheorganization[ 49 ].Eventhoughthedatasatisestheconstraintsof thedatabase,itisveriedwiththevalidationmechanismsi nsidethesourcecodebeforeit 23

PAGE 24

isbeingupdatedintothedatabasetoensurethatitdoesnotv iolatetheconstrains.We candiscoversomeconstraints,suchasreferentialconstra ints,byanalyzingtheapplication sourcecode,eveniftheapplicationprogramonlyqueriesth edatabutdoesnotmodify it.Forinstance,ifthereexistsareferentialconstraint( foreignkeyrelation)betweenthe entitynamedE1andentitynamedE2,thisconstraintisusedt ojointhedataofthesetwo entitieswithaquery.Wecandiscoverthisreferentialcons traintbyanalyzingthequery [ 50 ].Sinceprogramsourcecodeisaveryusefulsourceofinform ationinwhichwecan discoveralotofimplicitconstructsandconstraints,weus eitasaninformationsourcefor DBRE. Itiswellknownthattheanalysisofprogramsourcecodeisac omplexandtedious task.However,wedonotneedtorecoverthecompletespecic ationoftheprogram forDBRE.Wearelookingforinformationtoenhancetheschem aandtondthe undeclaredconstraintsofthedatabase.Inthisprocess,we benetfromseveralprogram understandingtechniquestoextractinformationeective ly.Weprovidethedenitionsof theprogramunderstandinganditstechniquesinthefollowi ngsection. 2.5ProgramUnderstandingTechniques Inthissection,weintroducetheconceptofprogramunderst andinganditstechniques. Wehaveimplementedthesetechniquestoanalyzeapplicatio nsourcecodetoextract semanticinformationeectively. Programunderstanding(a.k.aprogramcomprehension)isth eprocessofacquiring knowledgeaboutanexisting,generallyundocumented,comp uterprogram.Theknowledge acquiredaboutthebusinessprocessesthroughtheanalysis ofthesourcecodeisaccurate andup-to-datebecausethesourcecodeisusedtogenerateth eapplicationthatthe organizationuses. Basicactionsthatcanbetakentounderstandaprogramistor eadthedocumentation aboutit,toaskforassistancefromtheuserofit,toreadthe sourcecodeofitortorun theprogramtoseewhatitoutputstospecicinputs[ 50 ].Besidestheseactions,there 24

PAGE 25

areseveraltechniquesthatwecanapplytounderstandaprog ram.Thesetechniques helptheanalysttoextracthigh-levelinformationfromlow -levelcodetocometoabetter understandingoftheprogram.Thesetechniquesaremostlyp erformedmanually.However, weapplythesetechniquesinoursemanticanalyzermoduleto automaticallyextract informationfromdata-orientedapplications.Weshowhoww eapplythesetechniques inoursemanticanalyzerinSection 3.1.5 .Wedescribethemainprogramunderstanding techniquesinthefollowingsubsections.2.5.1TextualAnalysis Onesimplewaytoanalyzeaprogramistosearchforaspecics tringintheprogram sourcecode.Thissearchedstringcanbeapatternoraclich e.Theprogramunderstanding techniquethatsearchesforapatternoraclicheisnamedas patternmatchingorcliche recognition.Apatterncanincludewildcards,characterra ngesandcanbebasedonother denedpatterns.Aclicheisacommonlyusedprogrammingpa ttern.Examplesofcliches arealgorithmiccomputations,suchaslistenumerationand binarysearch,andcommon datastructures,suchaspriorityqueueandhashtable[ 49 97 ]. 2.5.2SyntacticAnalysis Syntacticanalysisisperformedbyaparserthatdecomposes aprograminto expressionsandstatements.Theresultoftheparserisstor edinastructurecalledabstract syntaxtree(AST).AnASTisatypeofrepresentationofsourc ecodethatfacilitates theusageoftreetraversalalgorithmsanditisthebasicofm ostsophisticatedprogram analysistools[ 49 ]. 2.5.3ProgramSlicing Programslicingisatechniquetoextractthestatementsfro maprogramrelevanttoa particularcomputation,specicbehaviororinterestsuch asabusinessrule[ 75 ].Theslice ofaprogramwithrespecttoprogrampointpandvariableVcon sistsofallstatements andpredicatesoftheprogramthatmightaectthevalueofVa tpointp[ 96 ].Program slicingisusedtoreducethescopeofprogramanalysis[ 49 83 ].Theslicethataectthe 25

PAGE 26

valueofVatpointpiscomputedbygatheringstatementsandc ontrolpredicatesby wayofabackwardtraversaloftheprogram,startingatthepo intp.Thiskindofslice isalsoknownasbackwardslicing.Whenweretrievestatemen tsthatcanpotentiallybe aectedbythevariableVstartingfromapointp,wecallitfo rwardslicing.Forwardand backwardslicingarebothatypeofstaticslicingbecauseth eyuseonlystaticallyavailable information(sourcecode)forcomputing.2.5.4ProgramRepresentationTechniques Programsourcecode,evenreducedthroughprogramslicing, oftenistoodicult tounderstandbecausetheprogramcanbehuge,poorlystruct ured,andbasedonpoor namingconventions.Itisusefultorepresenttheprogramin dierentabstractviewssuch asthecallgraph,datarowgraph,etc[ 49 ].Mostoftheprogramreverseengineeringtools providethesekindofvisualizationfacilities.Inthefoll owingsections,wepresentseveral programrepresentationtechniques.2.5.5CallGraphAnalysis Callgraphanalysisistheanalysisoftheexecutionorderof theprogramunitsor statements.Ifitdeterminestheorderofthestatementswit hinaprogramthenitiscalled intra-proceduralanalysis.Ifitdeterminesthecallingre lationshipamongtheprogram units,itiscalledinter-proceduralanalysis[ 49 83 ]. 2.5.6DataFlowAnalysis Datarowanalysisistheanalysisoftherowofthevaluesfrom variablestovariables betweentheinstructionsofaprogram.Thevariablesdened andthevariablesreferenced byeachinstruction,suchasdeclaration,assignmentandco nditional,areanalyzedto computethedatarow[ 49 83 ]. 2.5.7VariableDependencyGraph Variabledependencygraphisatypeofdatarowgraphwherean oderepresentsa variableandanarcrepresentsarelation(assignment,comp arison,etc.)betweentwo variables.Ifthereisapathfromvariablev1tovariablev2i nthegraph,thenthereisa 26

PAGE 27

sequenceofstatementssuchthatthevalueofv1isinrelatio nwiththevalueofv2.Ifthe relationisanassignmentstatementthenthearcinthediagr amisdirected.Iftherelation isacomparisonstatementthenthearcisnotdirected[ 49 83 ]. 2.5.8SystemDependenceGraph Systemdependencegraphisatypeofdatarowgraphthatalsoh andlesprocedures andprocedurecalls.Asystemdependencegraphrepresentst hepassingofvaluesbetween procedures.WhenprocedurePcallsprocedureQ,valuesofpa rametersaretransferred fromPtoQandwhenQreturns,thereturnvalueistransferred backtoP[ 49 ]. 2.5.9DynamicAnalysis Theprogramunderstandingtechniquesdescribedsofararep erformedonthesource codeoftheprogramandarestaticanalysis.Dynamicanalysi sistheprocessofgaining increasedunderstandingofaprogrambysystematicallyexe cutingit[ 83 ]. 2.6VisitorDesignPatterns Weappliedtheaboveprogramunderstandingtechniquesinou rsemanticanalyzer program.Weimplementedoursemanticanalyzerbyusingvisi torpatterns.Inthissection, weexplainwhatavisitorpatternisandtherationaleforusi ngit. AVisitorDesignPatternisabehavioraldesignpattern[ 38 ],whichisusedto encapsulatethefunctionalitythatwedesiretoperformont heelementsofadata structure.Itgivestherexibilitytochangetheoperationb eingperformedonastructure withouttheneedtochangetheclassesoftheelementsonwhic htheoperationis performed.Ourgoalistobuildsemanticinformationextrac tiontechniquesthatcan beappliedtoanysourcecodeandcanbeextendedwithnewalgo rithms.Thevisitor designpatterntechniqueisthekeyobjectorientedtechniq uetoreachthisgoal.New operationsovertheobjectstructurecanbedenedsimplyby addinganewvisitor.Visitor classeslocalizerelatedbehaviorinthesamevisitorandun relatedsetsofbehaviorare partitionedintheirownvisitorsubclasses.Iftheclasses deningtheobjectstructure,in ourcasethegrammarproductionrulesoftheprogramminglan guage,rarelychange,but 27

PAGE 28

newoperationsoverthestructureareoftendened,avisito rdesignpatternistheperfect choice[ 13 71 ]. 2.7Ontology Anontologyrepresentsacommonvocabularydescribingthec onceptsandrelationships forresearcherswhoneedtoshareinformationinadomain[ 40 69 ].Itincludesmachine interpretabledenitionsofbasicconceptsinthedomainan drelationsamongthem. Ontologiesenablethedenitionandsharingofdomain-spec icvocabularies.They aredevelopedtosharecommonunderstandingofthestructur eofinformationamong peopleorsoftwareagents,toenablereuseofdomainknowled ge,andtoanalyzedomain knowledge[ 69 ]. Accordingtoacommonlyquoteddenition,anontologyisafo rmal,explicit specicationofasharedconceptualization[ 40 ].Forabetterunderstanding,Michael Uscholdetal.denethetermsinthisdenitionasfollows[ 92 ]:Aconceptualizationisan abstractmodelofhowpeoplethinkaboutthingsintheworld. Anexplicitspecication meanstheconceptsandrelationsintheabstractmodelaregi venexplicitnamesand denitions.Formalmeansthatthemeaningspecicationise ncodedinalanguagewhose formalpropertiesarewellunderstood.Sharedmeansthatth emainpurposeofanontology isgenerallytobeusedandreusedacrossdierentapplicati ons. 2.8WebOntologyLanguage(OWL) TheWebOntologyLanguage(OWL)isasemanticmarkuplanguag eforpublishing andsharingontologiesontheWorldWideWeb[ 64 ].OWLisderivedfromtheDAML+OIL WebOntologyLanguage.DAML+OILwasdevelopedasajointeo rtofresearcherswho initiallydevelopedDAML(DARPAAgentMarkupLanguage)and OIL(Ontology InferenceLayerorOntologyInterchangeLanguage)separat ely. OWLisdesignedforprocessingandreasoningaboutinformat ionbycomputers insteadofjustpresentingitontheWeb.OWLsupportsmorema chineinterpretability thanXML(ExtensibleMarkupLanguage),RDF(theResourceDe scriptionFramework), 28

PAGE 29

andRDF-S(RDFSchema)byprovidingadditionalvocabularya longwithaformal semantics. Formalsemanticsallowsustoreasonabouttheknowledge.We mayreasonabout classmembership,equivalenceofclasses,andconsistency oftheontologyforunintended relationshipsbetweenclassesandclassifytheinstancesi nclasses.RDFandRDF-S canbeusedtorepresentontologicalknowledge.However,it isnotpossibletouseall reasoningmechanismsbyusingRDFandRDF-Sbecauseofsomem issingfeaturessuch asdisjointnessofclasses,booleancombinationsofclasse s,cardinalityrestrictions,etc. [ 4 ].WhenallthesefeaturesareaddedtoRDFandRDF-Stoforman ontologylanguage, thelanguagebecomesveryexpressive.Howeveritbecomesin ecienttoreason.Forthis reason,OWLcomesinthreedierentravors:OWL-Lite,OWL-D L,andOWLFull. TheentirelanguageiscalledOWLFull,andusesalltheOWLla nguagesprimitives. Italsoallowstocombinetheseprimitivesinarbitraryways withRDFandRDF-S.Besides itsexpressiveness,OWLFull'scomputationscanbeundecid able.OWLDL(OWLDescriptionLogic)isasublanguageofOWLFull.Itincludes allOWLlanguageconstructs butrestrictsinwhichtheseconstructorsfromOWLandRDFca nbeused.Thismakes thecomputationsinOWL-DLcomplete(allconclusionsaregu aranteedtobecomputable) anddecidable(allcomputationswillnishinnitetime).T herefore,OWL-DLsupports ecientreasoning.OWLLitelimitsOWL-DLtoasubsetofcons tructors(forexample OWLLiteexcludesenumeratedclasses,disjointnessstatem entsandarbitrarycardinality) makingitlessexpressive.However,itmaybeagoodchoicefo rhierarchiesneedingsimple constraints[ 4 64 ]. OWLprovidesaninfrastructurethatallowsamachinetomake thesamesortsof simpleinferencesthathumanbeingsdo.AsetofOWLstatemen tsbyitself(andthe OWLspec)canallowyoutoconcludeanotherOWLstatementwhe reasasetofXML statements,byitself(andtheXMLspec)doesnotallowyouto concludeanyother XMLstatements.Giventhestatements(motherOfsubPropert yparentOf)and(Nedret 29

PAGE 30

motherOfOguzhan)whenstatedinOWL,allowsyoutoconclude (NedretparentOf Oguzhan)basedonthelogicaldenitionofsubPropertyasgi venintheOWLspec. AnotheradvantageofusingOWLontologiesistheavailabili tyoftoolssuchasRacer,Fact andPelletthatcanreasonaboutthem.Areasonercanalsohel pustounderstandifwe couldaccuratelyextractdataanddescriptionelementsfro mthereport.Forinstance,we candenearulesuchas`Nodataordescriptionelementscano verlap'andchecktheOWL ontologybyareasonertomakesureifthisruleissatisedor not. 2.9WordNet WordNetisanonlinedatabasewhichaimstomodelthelexical knowledgeofa nativespeakerofEnglish. 1 Itisdesignedtobeusedbycomputerprograms.WordNet linksnouns,verbs,adjectives,andadverbstosetsofsynon yms[ 66 ].Asetofsynonyms representthesameconceptandisknownasasynsetinWordNet terminology.For example,theconceptofa`child'mayberepresentedbythese tofwords:`kid',`youngster', `tiddler',`tike'.Asynsetalsohasashortdenitionordes criptionoftherealworldconcept knownasa`gloss'andhassemanticpointersthatdescribere lationshipsbetweenthe currentsynsetandothersynsets.Thesemanticpointerscan beanumberofdierent typesincludinghyponym/hypernym(is-a/hasa)meronym/ho lonym(part-of/ has-part),etc.AlistofsemanticpointersisgiveninTable 2-1 2 WordNetcanalsobe seenasalargegraphorsemanticnetwork.Eachnodeofthegra phrepresentsasynsetand eachedgeofthegraphrepresentsarelationbetweensynsets .Manyoftheapproachesfor measuringsimilarityofwordsusesthegraphicalstructure ofWordNet[ 15 72 79 80 ]. SincethedevelopmentofWordNetforEnglishbytheresearch ersofPrinceton University,manyWordNetsforotherlanguageshavebeendev elopedsuchasDannish (Dannet),Persian(PersiaNet),Italian(ItalWordnet),et c.Therehasbeenalsoresearchto 1 WordNet2.1denes155,327wordsofEnglish 2 Tableisadaptedfrom[ 72 ] 30

PAGE 31

Table2-1.ListofrelationsusedtoconnectsensesinWordNe t. Hypernymisageneralizationoffurnitureisahypernymofch air HyponymisakindofchairisahyponymoffurnitureTroponymisawaytoambleisatroponymofwalkMeronymispart/substance/memberofwheelisa(part)meron ymofabicycle HolonymcontainspartbicycleisaholonymofawheelAntonymoppositeofascendisanantonymofdescendAttributeattributeofheavyisanattributeofweightEntailmententailsploughingentailsdiggingCausecausetotooendcausestoresentAlsoseerelatedverbtolodgeisrelatedtoresideSimilartosimilartodeadissimilartoassassinatedParticipleofisparticipleofstored(adj)istheparticipl eoftostore Pertainymofpertainstoradialpertainstoradius alignWordNetsofdierentlanguages.Forinstance,EuroWo rdNet[ 93 ]isamultilingual lexicalknowledgebasethatlinksWordNetsofdierentlang uages(e.g.,Dutch,Italian, Spanish,German,French,CzechandEstonian).InEuroWordN et,theWordNetsare linkedtoanInter-Lingual-Indexwhichinterconnectsthel anguagessothatwecangofrom thesynsetsinonelanguagetocorrespondingsynsetsinothe rlanguages. WhileWordNetisadatabasewhichaimstomodelaperson'skno wledgeabouta language,anotherresearcheortCyc[ 57 ](derivedfromEncyc -lopedia)aimstomodel aperson'severydaycommonsense.Cycformalizescommonsen seknowledge(e.g.,`You cannotremembereventsthathavenothappenedyet',`Youhav etobeawaketoeat',etc.) intheformofamassivedatabaseofaxioms. 2.10Similarity Similarityisanimportantsubjectinmanyeldssuchasphil osophy,psychology,and articialintelligence.Measuresofsimilarityorrelated nessareusedinvariousapplications suchaswordsensedisambiguation,textsummarizationanda nnotation,information extractionandretrieval,automaticcorrectionofworderr orsintext,andtextclassication [ 15 21 ].Understandinghowhumansassesssimilarityisimportant tosolvemanyofthe problemsofcognitivesciencesuchasproblemsolving,cate gorization,memoryretrieval, inductivereasoning,etc.[ 39 ]. 31

PAGE 32

Similarityoftwoconceptsreferstohowmuchfeaturestheyh aveincommonand howmuchtheyhaveindierence.Lin[ 60 ]providesaninformationtheoreticdenition ofsimilaritybyclarifyingtheintuitionsandassumptions aboutit.AccordingtoLin, thesimilaritybetweenAandBisrelatedtotheircommonalit yandtheirdierence. LinassumesthatthecommonalitybetweenAandBcanbemeasur edaccordingto theinformationtheycontainincommon( I ( common ( A;B ))).Ininformationtheory, theinformationcontainedinastatementismeasuredbythen egativelogarithmofthe probabilityofthestatement( I ( common ( A;B ))= logP ( A \ B )).Linalsoassumes thatifweknowthedescriptionofAandB,wecanmeasurethedi erencebysubtracting thecommonalityofAandBfromthedescriptionofAandB.Henc e,Linstatesthat thesimilaritybetweenAandB, sim ( A;B )isafunctionoftheircommonalitiesand descriptions.Thatis, sim ( A;B )= f ( I ( common ( A;B )) ;I ( description ( A;B ))). Wealsocomeacrosswith`semanticrelatedness'termwhiled ealingwithsimilarity. Semanticrelatednessisamoregeneralconceptthansimilar ityandreferstothedegree towhichtwoconceptsarerelated[ 72 ].Similarityisoneaspectofsemanticrelatedness. Twoconceptsaresimilariftheyarerelatedintermsoftheir likeliness(e.gchild-kit). However,twoconceptscanberelatedintermsoffunctionali tyorfrequentassociationeven thoughtheyarenotsimilar(e.g.,instructor-student,chr istmas-gift). 2.11SemanticSimilarityMeasuresofWords Inthissection,weprovideareviewofsemanticsimilaritym easuresofwordsinthe literature.Thisreviewisnotmeanttobeacompletelistoft hesimilaritymeasuresbut providesmostoftheoutstandingonesintheliterature.Mos tofthemeasuresbelowuse thehierarchicalstructureofWordNet.2.11.1ResnikSimilarityMeasure Resnik[ 79 ]providedasimilaritymeasurebasedontheis-ahierarchyo ftheWordNet andthestatisticalinformationgatheredfromalargecorpo raoftext.Resnikusedthe statisticalinformationfromthelargecorporaoftexttome asuretheinformationcontent. 32

PAGE 33

Accordingtotheinformationtheory,theinformationconte ntofaconcept c canbe quantiedas log P ( c ),where P ( c )istheprobabilityofencounteringconcept c .This formulatellsusthatasprobabilityincreases,informativ enessdecreases;sothemore abstractaconcept,theloweritsinformationcontent.Inor dertocalculatetheprobability ofaconcept,Resnikrstcomputedthefrequencyofoccurren ceofeveryconceptinalarge corpusoftext.Everyoccurrenceofaconceptinthecorpusad dstothefrequencyofthe conceptandtothefrequencyofeveryconceptsubsumingthec onceptencountered.Based onthiscomputation,theformulafortheinformationconten tis: P ( c )= freq ( c ) =freq ( r ) ic ( c )= log P ( c ) ic ( c )= log( freq ( c ) =freq ( r )) whereristherootnodeofthetaxonomyandcistheconcept.AccordingtoResnik,themoreinformationtwoconceptshave incommon,themore similartheyare.Theinformationsharedbytwoconceptsisi ndicatedbytheinformation contentoftheconceptsthatsubsumetheminthetaxonomy.Th eformulaoftheResnik similaritymeasureis: simRES ( c 1 ;c 2)= max [ log P ( c )] wherecisaconceptthatsubsumesbothc1andc2.OneofthedrawbacksoftheResnikmeasureisthatitcomplete lydependsupon theinformationcontentoftheconceptthatsubsumesthetwo conceptswhosesimilarity wemeasure.Itdoesnottakethetwoconceptsintoaccount.Fo rthisreasonsimilarity measuresofdierentpairsofconceptsthathavethesamesub sumerhavethesame similarityvalues. 33

PAGE 34

2.11.2Jiang-ConrathSimilarityMeasure JiangandConrath[ 52 ]addressthelimitationsoftheResnikmeasure.Itbothuses theinformationcontentofthetwoconcepts,alongwiththei nformationcontentoftheir lowestcommonsubsumertocomputethesimilarityoftwoconc epts.Themeasureisa distancemeasurethatspeciestheextentofunrelatedness oftwoconcepts.Theformula oftheJiangandConrathmeasureis: distanceJCN ( c 1 ;c 2)= ic ( c 1)+ ic ( c 2) (2 ic ( LCS ( c 1 ;c 2))) whereicdeterminestheinformationcontentofaconcept,an dLCSdeterminesthelowest commonsubsumingconceptoftwogivenconcepts.However,th ismeasureworksonlywith WordNetnouns.2.11.3LinSimilarityMeasure Lin[ 60 ]introducedasimilaritymeasurebetweenconceptsbasedon histheoryof similaritybetweenarbitraryobjects.Tomeasurethesimil arity,Linusestheinformation contentofthetwoconceptsthatisbeingmeasuredandtheinf ormationconceptofthe lowestcommonsubsumerofthem.TheformulaoftheLinmeasur eis: simLIN ( c 1 ;c 2)= 2 log P ( c 0) log P ( c 1)+log P ( c 2) wherec0isthelowestcommonconceptthatsubsumesbothc1an dc2. 2.11.4IntrinsicICMeasureinWordNet Secoetal.[ 85 ]advocatesthatWordNetcanalsobeusedasastatisticalres ourcewith noneedforexternalcorporatocomputetheinformationcont entofaconcept. 34

PAGE 35

TheyassumethatthetaxonomicstructureofWordNetisorgan izedinameaningful andprincipledway,whereconceptswithmanyhyponyms 3 conveylessinformationthan conceptsthatareleaves.Theyprovidetheformulaforinfor mationcontentasfollows: icWN ( c )= log hypo ( c )+1 maxwn log 1 maxwn =1 log( hypo ( c )+1) log( maxwn ) Inthisformula,thefunctionhyporeturnsthenumberofhypo nymsofagivenconcept andmaxwnisthemaximumnumberofconceptsthatexistinthet axonomy. 2.11.5Leacock-ChodorowSimilarityMeasure Radaetal.[ 77 ]wasthersttomeasurethesemanticrelatednessbasedonth elength ofthepathoftwoconceptsinataxonomy.Radaetal.measured semanticrelatednessof medicalterms,usingamedicaltaxonomycalledMeSH.Accord ingtothismeasurement, givenatree-likestructureofataxonomy,thenumberoflink sbetweentwoconceptsare countedandtheyareconsideredmorerelatedifthelengthof thepathbetweenthemis shorter. Leacock-Chodorow[ 56 ]appliedthisapproachtomeasuresemanticrelatednessoft wo conceptsusingWordNet.Themeasurecountstheshortestpat hbetweentwoconceptsin thetaxonomyandscalesitbythedepthofthetaxonomy: relatedLCH ( c 1 ;c 2)= log( shortestpath ( c 1 ;c 2)) 2 D Intheformula,c1andc2representthetwoconcepts,Disthem aximumdeptofthe taxonomy. 4 Oneweaknessofthemeasureis,itassumesthesizeorweighto feverylinkasequal. However,lowerdowninthehierarchyasinglelinkawayconce ptpairsaremorerelated 3 hyponym:awordthatismorespecicthanagivenword. 4 ForWordNet1.7.1,thevalueofDis19. 35

PAGE 36

thansuchpairshigherupinthehierarchy.Anotherlimitati onofthemeasureisthatthey limittheirattentiontois-alinksandonlynounhierarchie sareconsidered. 2.11.6Hirst-St.OngeSimilarityMeasure HirstandSt.Onge's[ 51 ]measureofsemanticrelatednessisbasedontheideathattw o conceptsaresemanticallycloseiftheirWordNetsynsetsar econnectedbyapaththatis nottoolongandthatdoesnotchangedirectiontoooften[ 15 72 ]. TheHirst-St.Ongemeasureconsidersalltherelationsden edinWordNet.Alllinksin WordNetareclassiedasUpward(e.g.,part-of),Downward( e.g.,subclass)orHorizontal (e.g.,opposite-meaning).Theyalsodescribethreetypeso frelationsbetweenwords extra-strong,strongandmedium-strong. Thestrengthoftherelationshipisgivenby: relHS ( c 1 ;c 2)= C pathlength k d ; wheredisthenumberofchangesofdirectioninthepath,andC andkareconstants; ifnosuchpathexists,thestrengthoftherelationshipisze roandtheconceptsare consideredunrelated.2.11.7WuandPalmerSimilarityMeasure TheWuandPalmer[ 98 ]measuresthesimilarityintermsofthedepthofthetwo conceptsintheWordNettaxonomy,andthedepthofthelowest commonsubsumer(LCS): simWUP ( c 1 ;c 2)= 2 depth ( LCS ) depth ( c 1)+ depth ( c 2) 2.11.8LeskSimilarityMeasure Lesk[ 58 ]denesrelatednessasafunctionofdictionarydenitiono verlapsofconcepts. Hedescribesanalgorithmthatdisambiguateswordsbasedon theextentofoverlapsof theirdictionarydenitionswiththoseofwordsintheconte xt.Thesenseofthetarget wordwiththemaximumoverlapsisselectedastheassignedse nseoftheword. 36

PAGE 37

Table2-2.Absolutevaluesofthecoecientsofcorrelation betweenhumanratingsof similarityandthevecomputationalmeasures. MeasureMiller&CharlesRubenstein&Goodenough HirstandSt-Onge.744.786JiangandConrath.850.781LeacockandChodorow.816.838Lin.829.819Resnik.774.779 2.11.9ExtendedGlossOverlapsSimilarityMeasure BanerjeeandPedersen[ 9 72 ]providedameasurebyadoptingtheLesk'smeasure toWordNet.Theirmeasureiscalled`theextendedglossover lapsmeasure'andtakesnot onlythetwoconceptsthatarebeingmeasuredintoaccountbu talsotheconceptsrelated withthetwoconceptsthroughWordNetrelations.Anextende dglossofaconceptc1is preparedbyaddingtheglossesofconceptsthatisrelatedwi thc1throughaWordNet relationr.Thecalculationofmeasurementoftwoconceptsc 1andc2isbasedonthe overlapsofextendedglossesoftwoconcepts. 2.12EvaluationofWordNet-BasedSimilarityMeasures BudanitskyandHirst[ 16 ]evaluatedsixdierentmetricsusingWordNetandlisted thecoecientsofcorrelationbetweenthemetricsandhuman ratingsaccordingtothe experimentsconductedbyMiller&Charles[ 65 ]andRubenstein&Goodenough[ 82 ].We presenttheresultsofBudanitsky&Hirst'sexperimentsinT able 2-2 .Accordingtothis evaluation,theJiangandConrathmetric[ 52 ]aswellastheLinmetric[ 60 ]arelistedas oneofthebestmeasures.Asaresult,weusetheJiangandConr athaswellastheLin semanticsimilaritymeasuretoassignsimilarityscoresbe tweentextstrings. 2.13SimilarityMeasuresforTextData Severalapproacheshavebeenusedtoassessasimilaritysco rebetweentexts.One ofthesimplestmethodsistoassessasimilarityscorebased onthenumberoflexical unitsthatoccurinbothtextsegments.Severalprocessessu chasstemming,stop-word removal,longestsubsequencematching,weightingfactors canbeappliedtothismethod 37

PAGE 38

forimprovement.However,theselexicalmatchingmethodsa renotenoughtoidentifythe semanticsimilarityoftexts.Oneoftheattemptstoidentif ysemanticsimilaritybetween textsislatentsemanticanalysismethod(LSA) 5 [ 55 ]whichaimstomeasuresimilarity betweentextsbyincludingadditionalrelatedwords.LSAis successfulatsomeextendbut hasnotbeenusedonalargescale,duetothecomplexityandco mputationalcostofits algorithm. CorleyandMihalcea[ 21 ]introducedametricfortext-to-textsemanticsimilarity by combiningword-to-wordsimilaritymetrics.Toassessasim ilarityscoreforatextpair, theyrstcreateseparatesetsfornouns,verbs,adjectives ,adverbs,andcardinalsforeach text.Thentheydeterminepairsofsimilarwordsacrossthes etsinthetwotextsegments. Fornounsandverbs,theyusesemanticsimilaritymetricbas edonWordNet,andforother wordclassestheyuselexicalmatchingtechniques.Finally ,theysumupthesimilarity scoresofsimilarwordpairs.Thisbag-of-wordsapproachim provessignicantlyoverthe traditionallexicalmatchingmetrics.However,astheyack nowledge,ametricoftext semanticsimilarityshouldtakeintoaccounttherelations betweenwordsinatext. Inanotherapproachtomeasuresemanticsimilaritybetween documents,Aslam andFrost[ 6 ]assumesthatatextiscomposedofasetofindependenttermf eaturesand employtheLin's[ 60 ]metricformeasuringsimilarityofobjectsthatcanbedesc ribedbya setofindependentfeatures.Thesimilarityoftwodocument sinapileofdocumentscanbe calculatedbythefollowingformula: SimIT ( a;b )= 2 P t min( Pa : t;Pb : t )log P ( t ) P t ( Pa : t )log P ( t )+ P t ( Pb : t )log P ( t ) whereprobability P ( t )isthefractionofcorpusdocumentscontainingtermt, Pb : t is thefractionaloccurrenceofterm t indocument b ( P t ( Pb : t )=1)andtwodocuments a 5 URLofLSA:http://lsa.colorado.edu/ 38

PAGE 39

and b share min ( Pa : t;Pb : t )amountofterm t incommon,whiletheycontain Pa : t and Pb : t amountofterm t individually. AnotherapproachbyOleshchukandPedersen[ 70 ]usesontologiesasalterbefore assessingsimilarityscorestotexts.Theyinterpretatext basedonanontologyandnd outhowmuchoftheterms(concepts)ofanontologyexistsina text.Theyassigna similarityscorefortextt1andtextt2aftercomparingtheo ntologyo1extractedfrom t1basedontheontologyOandtheontologyo2extractedfromt 2basedonthesame ontologyO.Thebaseontologyactsasacontextltertotexts anddependingonthebase ontologyused,textsmayormaynotbesimilar. 2.14SimilarityMeasuresforOntologies RodriguezandEgenhofer[ 81 ]suggestedassessingsemanticsimilarityamongentity classesfromdierentontologiesbasedonamatchingproces sthatusesinformationabout commonanddierentcharacteristicfeaturesofontologies basedontheirspecications. Thesimilarityscoreoftwoentitiesfromdierentontologi esistheweightedsumof similarityscoresofcomponentsofcomparedentities.Simi larityscoresareindependently measuredforthreecomponentsofanentity.Thesecomponent sare`setofsynonyms', `setofsemanticrelations',and`setofdistinguishingfea tures'oftheentity.Theyfurther suggesttoclassifythedistinguishingfeaturesinto`func tions',`parts',and`attributes' where`functions'representswhatisdonetoorwithaninsta nceofthatentity,`parts'are structuralelementsofanentitysuchaslegorheadofahuman body,and`attributes'are additionalcharacteristicsofanentitysuchasageorhairc olorofaperson. RodriguezandEgenhoferpointoutthatifcomparedentities arerelatedtothesame entities,theymaybesemanticallysimilar.Thus,theyinte rpretcomparingsemantic 39

PAGE 40

relationsascomparingsemanticneighborhoodsofentities 6 Theformulaofoverall similaritybetweenentityaofontologyqandentitybofonto logyqisasfollows: S ( a p ;b q )= w w S w ( a p ;b q )+ w u S u ( a p ;b q )+ w n S n ( a p ;b q ) where S w S u ,and S n arethesimilaritybetweensynonymsets,features,andsema ntic neighborhoodand w w w u ,and w n aretherespectiveweightswhichaddsupto1.0. Whilecalculatingasimilarityscoreforeachcomponentsof anentity,theyalsotake noncommoncharacteristicsintoaccount.Thesimilarityof acomponentismeasuredby thefollowingformula: S(a ; b)= j A \ B j j A \ B j + ( a;b ) j A / B j +(1 ( a;b )) j B / A j where isafunctionthatdenestherelativeimportanceofthenoncommon characteristics.Theycalculate intermsofthedepthoftheentitiesintheirontologies. MaedcheandStaab[ 63 ]suggeststomeasuresimilarityofontologiesintwolevels : lexicalandconceptual.Inthelexicallevel,theyuseeditdistancemeasuretond similaritybetweentwosetsofterms(conceptsorrelations )thatformstheontologies. Whilemeasuringsimilarityintheconceptuallevel,theyta keallitssuper-andsub-concepts oftwoconceptsfromtwodierentontologiesintoaccount. AccordingtoEhrigetal.[ 31 ]comparingontologiesshouldgofarbeyondcomparing therepresentationoftheentitiesoftheontologiesandsho uldtaketheirrelationtothe realworldentitiesintoaccount.Forthis,Ehrigetal.sugg estedageneralframework formeasuringsimilaritiesofontologieswhichconsistsof fourlayers:data-,ontology-, context-,anddomainknowledgelayer.Inthedatalayer,the ycomparedatavaluesby 6 Thesemanticneighborhoodofanentityclassisthesetofent ityclasseswhose distancetotheentityclassislessthanorequaltoannonneg ativeinteger 40

PAGE 41

usinggenericsimilarityfunctionssuchaseditdistancefo rstrings.Intheontologylayer, theyconsidersemanticrelationsbyusingthegraphstructu reoftheontology.Inthe contextlayer,theycomparetheusagepatternsofentitiesi nontology-basedapplications. AccordingtoEhrigetal.iftwoentitiesareusedinthesame( related)contextthenthey aresimilar.Theyalsoproposetointegratedomainknowledg elayerintoanythreelayers asneeded.Finally,theyreachtoaoverallsimilarityfunct ionwhichincorporatesalllayers ofsimilarity. EuzenatandValtchev[ 34 35 ]proposedasimilaritymeasureforOWL-Liteontologies. Beforemeasuringsimilarity,theyrsttransformOWL-Lite ontologytoaOL-graph structure.Then,theydenesimilaritybetweennodesofthe OL-graphsdependingonthe categoryandthefeatures(e.grelations)ofthenodes.They combinethesimilaritiesof featuresbyaweightedsumapproach. AsimilarworkbyBachandDieng-Kuntz[ 8 ]proposesameasureforcomparing OWL-DLontologies.DierentfromEuzenatandValtchev'swo rk,BachandDieng-Kuntz adjuststhemanuallyassignedfeatureweightsofanOWL-DLe ntitydynamicallyincase theydonotexistinthedenitionoftheentity. 2.15EvaluationMethodsforSimilarityMeasures Therearethreekindsofapproachesforevaluatingsimilari tymeasures[ 15 ].These areevaluationbytheoreticalexamination(e.g.,Lin[ 60 ]),evaluationbycomparinghuman judgments,andevaluationbycalculatingtheperformancew ithinaparticularapplication. Evaluationbycomparinghumanjudgmentstechniquehasbeen usedbymany researcherssuchasResnik[ 79 ],andJiangandConrath[ 52 ].Mostoftheresearchersrefer tothesameexperimentonthehumanjudgmenttoevaluatethei rperformanceduetothe expenseanddicultyofarrangingsuchanexperiment.Thise xperimentwasconducted byRubensteinandGoodenough[ 82 ]andalaterreplicationofitwasdonebyMiller andCharles[ 65 ].RubensteinandGoodenoughhadhumansubjectsassigndegr eesof synonymy,onascalefrom0to4,to65pairsofcarefullychose nwords.MillerandCharles 41

PAGE 42

repeatedtheexperimentonasubsetof30wordpairsofthe65p airsusedbyRubenstein andGoodenough.RubensteinandGoodenoughused15subjects forscoringthewordpairs andtheaverageofthesescoreswasreported.MillerandChar lesused38subjectsintheir experiments. RodriguezandEgenhoferalsousedhumanjudgmentstoevalua tethequalityof theirsimilaritymeasureforcomparingdierentontologie s[ 81 ].TheyusedSpatialData TransferStandard(SDTS)ontology,WordNetontology,WSon tology(createdfromthe combinationof W ordNetand S DTS)andsubsetsoftheseontologies.Theyconductedtwo experiments.Intherstexperiment,theycomparedierent combinationsofontologiesto haveadiversegradeofsimilaritybetweenontologies.Thes ecombinationsincludeidentical ontologies(WordNettoWordNet),ontologyandsub-ontolog y(WordNettoWordNet's subset),overlappingontologies(WordNettoWS),anddier entontologies(WordNet toSDTS).Inthesecondexperiment,theyaskedhumansubject storanksimilarityof anentitytootherselectedentitiesbasedonthedenitions inWSontology.Then,they comparedaverageofhumanrankingswiththerankingsbasedo ntheirsimilaritymeasure usingdierentcombinationsofontologies. Evaluationbycalculatingtheperformancewithinaparticu larapplicationisanother approachfortheevaluationofsimilaritymeasurementmetr ics.BudanitskyandHirst[ 15 ] usedthisapproachtoevaluatetheperformanceoftheirmetr icwithinanNLPapplication, malapropisms. 7 Patwardhan[ 72 ]alsousedthisapproachtoevaluatehismetricwithinthe wordsensedisambiguation 8 application. 7 Malapropisms:Theunintentionalmisuseofawordbyconfusi onwithonethatsounds similar. 8 WordSenseDisambiguation:Itistheproblemofselectingth emostappropriate meaningorsenseofaword,basedonthecontextinwhichitocc urs. 42

PAGE 43

2.16SchemaMatching Schemamatchingisproducingamappingbetweenelementsoft woschemasthat correspondtoeachother[ 78 ].WhenwematchtwoschemasSandT,wedecideifany elementorelementsofSrefertothesamereal-worldconcept ofanyelementorelements ofT[ 28 ].Thematchoperationovertwoschemasproducesamapping.A mappingisa setofmappingelements.Eachmappingelementindicatescer tainelement(s)inSare mappedtocertainelement(s)inT.Amappingelementcanhave amappingexpression whichspecieshowschemaelementsarerelated.Amappingel ementcanbedenedas a5-tuple:(id,e,e',n,R),whereidistheuniqueidentier, eande'areschemaelements ofmatchingschemas,nisthecondencemeasure(usuallyint he[0,1]range)betweenthe schemaelementseande',Risarelation(e.g.,equivalence, mismatch,overlapping)[ 88 ]. Schemamatchinghasmanyapplicationareas,suchasdataint egration,data warehousing,semanticqueryprocessing,agentcommunicat ion,webservicesintegration, catalogmatching,andP2Pdatabases[ 78 88 ].Thematchoperationismostlydone manually.Manuallygeneratingthemappingisatedious,tim e-consuming,error-prone, andexpensiveprocess.Thereisaneedtoautomatethematcho peration.Thiswouldbe possibleifwecandiscoverthesemanticsofschemas,maketh eimplicitsemanticsexplicit andrepresenttheminamachineprocessableway.2.16.1SchemaMatchingSurveys Schemamatchingisaverywell-researchedtopicinthedatab asecommunity.Erhard RahmandPhilipBernsteinprovidesanexcellentsurveyonsc hemamatchingapproaches byreviewingpreviousworksinthecontextofschematransla tionandintegration, knowledgerepresentation,machinelearningandinformati onretrieval[ 78 ].Intheirsurvey, theyclarifythetermssuchasmatchoperation,mapping,map pingelement,andmapping expressioninthecontextofschemamatching.Theyalsointr oduceapplicationareasof schemamatchingsuchasschemaintegration,datawarehouse s,messagetranslation,and queryprocessing. 43

PAGE 44

Themostsignicantcontributionoftheirsurveyistheclas sicationofschema matchingapproacheswhichhelpsunderstandingofschemama tchingproblem.They considerawiderangeofclassicationcriteriasuchasinst ance-levelvsschema-level, elementvsstructure,linguistic-basedvsconstraint-bas ed,matchingcardinality,using auxiliarydata(e.g.,dictionaries,previousmappings,et c.),andcombiningdierent matchers(e.g.,hybrid,composite).However,itisveryrar ethatoneapproachfallsunder onlyoneleafoftheclassicationtreepresentedinthatsur vey.Aschemamatching approachneedstoexploitallthepossibleinputstoachieve thebestpossibleresult,and needstocombinematcherseitherinahybridwayorinacompos iteway.Forthisreason, mostoftheapproachesusesmorethanonetechniqueandfalls undermorethanoneleaf oftheclassicationtree.Forexample,ourapproachusesau xiliarydata(i.e.,application sourcecode)anduseslinguisticsimilaritytechniques(e. g.,nameanddescription), constraintbasedtechniques(e.g.,typeoftherelatedsche maelement)onthedataaswell. ArecentsurveybyAnhaiDoanandAlonHalevy[ 28 ]classiesmatchingtechniques undertwomaingroup:rule-basedandlearning-basedsoluti ons.Ourapproachfallsunder therule-basedgroupwhichisrelativelyinexpensiveanddo esnotrequiretraining.Anhai DoanandAlonHalevyalsodescribechallengesofschemamatc hing.Theypointoutthat sincedatasourcesbecomelegacy(poorlydocumented)schem aelementsaretypically matchedbasedonschemaanddata.However,thecluesgathere dbyprocessingtheschema anddataareoftenunreliable,incompleteandnotsucientt odeterminetherelationships amongschemaelements.Ourapproachaimstoovercomethisfu ndamentalchallengeby analyzingreportsformorereliable,completeandsucient clues. AnhaiDoanandAlonHalevyalsostatethatschemamatchingbe comesmore challengingbecausematchingapproachesmustconsiderall thepossiblematching combinationsbetweenschemastomakesurethereisnobetter mapping.Considering allthepossiblecombinationsincreasesthecostofthematc hingprocess.Ourapproach 44

PAGE 45

helpsusovercomingthischallengebyfocusingonasubsetof schemaelementsthatare usedonareportpair. AnotherchallengethatAnhaiDoanandAlonHalevystateisth esubjectivityof thematching.Thismeansthemappingdependsontheapplicat ionandmaychangein dierentapplicationseventhoughtheunderlyingschemasa rethesame.Byanalyzing reportgeneratingapplicationsourcecode,webelievewepr oducemoreobjectiveresults. AnhaiDoanandAlonHalevy'ssurveyalsoaddstwomoreapplic ationareasofschema matchingontheapplicationareasmentionedinErhardandRa hm'ssurvey.These applicationareasarepeerdatamanagementandmodelmanage ment. AmorerecentsurveybyPavelShvaikoandJer^omeEuzenat[ 88 ]pointsoutnew applicationareasofschemamatchingsuchasagentcommunic ation,webservice integrationandcatalogmatching.Intheirsurvey,PavelSh vaikoandJer^omeEuzenat consideronlyschema-basedapproachesnottheinstance-ba sedapproachesandprovide anewclassicationtreebybuildingonthepreviousworkofE rhardRahmandPhilip Bernstein.TheyinterprettheclassicationofErhardRahm andPhilipBernsteinand providetwonewclassicationtreesbasedongranularityan dkindsofinputwithadded nodestotheoriginalclassicationtreeofErhardRahmandP hilipBernstein.Finally, Hong-HaiDosummarizesrecentadvancesintheeldinhisdis sertation[ 25 ]. 2.16.2EvaluationsofSchemaMatchingApproaches Theapproachestosolvetheproblemofschemamatchingevalu atetheirsystemsby usingavarietyofmethodology,metricsanddatawhichareno tusuallypubliclyavailable. Thismakesithardtocomparetheseapproaches.However,the rehavebeenworksto benchmarktheeectivenessofasetofschemamatchingappro aches[ 26 99 ]. HongHaiDoetal.[ 26 ]speciesfourcomparisoncriteria.Thesecriteriaarekin d ofinput(e.g.,schemainformation,datainstances,dictio naries,andmappingrules), matchresults(e.g.,matchingbetweenschemaelements,nod esorpaths),qualitymeasures (metricssuchasrecall,precisionandf-measure)andeort (e.g.,pre-andpost-match 45

PAGE 46

eortsfortrainingoflearners,dictionarypreparationan dcorrection).MikalaiYatskevich inhiswork[ 99 ]comparestheapproachesbasedonthecriteriastatedin[ 26 ]andaddstime measuresasthefthcriteria. HongHaiDoetal.onlyusetheinformationavailableinthepu blicationsdescribing theapproachesandtheirevaluation.Incontrast,MikalaiY atskevichprovidesreal-time evaluationsofmatchingprototypes,ratherthanreviewing theresultspresentedinthe papers.MikalaiYatskevichcomparesonlythreeapproaches (COMA[ 24 ],Cupid[ 62 ]and SimilarityFlooding(SF)[ 86 ])andconcludesthatCOMAperformsthebestonthelarge schemasandCupidisthebestforsmallschemas.HongHaiDoet al.providesabroader comparisonbyreviewingsixapproaches(Automatch[ 10 ],COMA[ 24 ],Cupid[ 62 ],LSD [ 27 ],SimilarityFlooding(SF)[ 86 ],SemInt). 2.16.3ExamplesofSchemaMatchingApproaches Intherestofthissection,wereviewsomeofthesignicanta pproachesforschema matchinganddescribetheirsimilaritiesanddierencefro mourapproach.WereviewLSD, Corpus-based,COMAandCupidapproachesbelow. TheLSD(LearningSourceDescriptions)approach[ 27 ]usesmachine-learning techniquestomatchdatasourcestoaglobalschema.Theidea ofLSDisthatafter atrainingperiodofdeterminingmappingsbetweendatasour cesandglobalschema manually,thesystemshouldlearnfrompreviousmappingsan dsuccessfullypropose mappingsfornewdatasources.TheLSDsystemisacompositem atcher.Itmeansit combinestheresultsofseveralindependentlyexecutedmat chers.TheLSDconsistof severallearners(matchers).Eachlearnercanexploitfrom dierenttypesofcharacteristics oftheinputdatasuchasnamesimilarities,format,andfreq uencies.Thenthepredictions ofdierentlearnersarecombined.TheLSDsystemisextensi blesinceithasindependently workinglearners(matchers).Whennewlearnersaredevelop edtheycanbeaddedtothe systemtoenhancetheaccuracy.TheextensibilityoftheLSD systemissimilartothe extensibilityofoursystembecausewecanalsoaddnewvisit orpatternstooursystemto 46

PAGE 47

extractmoreinformationtoenhancetheaccuracy.TheLSDap proachissimilartoour approachinthewaythattheyalsocometoanaldecisionbyco mbiningseveralresults comingfromdierentlearners.Wealsocombineseveralresu ltsthatcomefrommatching ofontologiesofreportpairs,togiveanaldecision.LSDap proachisalearnerbased solutionandrequirestrainingwhichmakesitrelativelyex pensivebecauseoftheinitial manualeort.Howeverourapproachneedsnoinitialeortot herthancollectingrelevant reportgeneratingsourcecode. Oneofthedistinguishedapproachesthatusesexternalevid enceistheCorpus-based SchemaMatchingapproach[ 43 61 ].OurapproachissimilartoCorpus-basedSchema Matchinginthesensethatwealsoutilizeexternaldatarath erthansolelydepending onmatchingschemasandtheirdata.TheCorpus-basedschema matchingapproach constructsaknowledgebasebygatheringrelevantknowledg efromalargecorpusof databaseschemasandpreviousvalidatedmappings.Thisapp roachidentiesinteresting conceptsandpatternsinacorpusofschemasandusesthisinf ormationtomatch twounseenschemas.However,learningfromthecorpusandex tractingpatternsisa challengingtask.Thisapproachalsorequiresinitialeor ttocreateacorpusofinterest andthenrequirestuningeorttoeliminateuselessschemas andtoaddusefulschemas. TheCOMA(COmbinationofMAtchingalgorithms)approach[ 24 ]isacomposite schemamatchingapproach.Itdevelopsaplatformtocombine multiplematchersina rexibleway.Itprovidesanextensiblelibraryofmatchinga lgorithmsandaframework tocombineobtainedresults.TheCOMAapproachhavebeensup eriortoothersystems intheevaluations[ 26 99 ].TheCOMA++[ 7 ]approachimprovestheCOMAapproach bysupportingschemasandontologieswrittenindierentla nguages(i.e.,SQL,W3C XSDandOWL)andbybringingnewmatchstrategiessuchasfrag ment-basedmatching andreuse-orientedmatching.Fragment-basedapproachfol lowsthedivide-and-conquer ideaanddecomposesalargeschemaintosmallersubsetsaimi ngtoachievebettermatch qualityandexecutiontimewiththereducedproblemsizeand thenmergestheresultsof 47

PAGE 48

matchingfragmentsintoaglobalmatchresult.Ourapproach alsoconsidersmatching smallsubsetsofaschemathatarecoveredbyreportsandthen mergingthesematch resultsintoaglobalmatchresultasdescribedinChapter 3 TheCupidapproach[ 62 ]combineslinguisticandstructuralmatchersinahybridwa y. Itisbothelementandstructuralbased.Italsousesdiction ariesasauxiliarydata.Itaims toprovideagenericsolutionacrossdatamodelsandusesXML andrelationalexamples. ThestructuralmatcherofCupidtransformstheinputintoat reestructureandassesses asimilarityvalueforanodebasedonthenode'slinguistics imilarityvalueanditsleaves similarityvalues. 2.17OntologyMapping Ontologymappingisdeterminingwhichconceptsandpropert iesoftwoontologies representsimilarnotions[ 68 ].Thereareseveralothertermsrelevanttoontologymappin g andaresometimesusedinterchangeablywiththetermmappin g.Thesearealignment, merging,articulation,fusion,andintegration[ 54 ].Theresultofontologymappingisused insimilarapplicationdomainsasschemamatching,suchasd atatransformation,query answering,andwebservicesintegration[ 68 ]. 2.18SchemaMatchingvs.OntologyMapping Schemamatchingandontologymappingaresimilarproblems[ 29 ].However,ontology mappinggenerallyaimstomatchricherstructures.General ly,ontologieshavemore constraintsontheirconceptsandhavemorerelationsamong theseconcepts.Another dierenceisthataschemaoftendoesnotprovideexplicitse manticsfortheirdatawhile anontologyisasystemthatitselfcontainssemanticseithe rintuitivelyorformally[ 88 ]. Databasecommunitydealswiththeschemamatchingproblema ndtheAIcommunity dealswiththeontologymappingproblem.Wecanperhapsllt hegapbetweenthese similarbutyetdistinctlystudiedsubject. 48

PAGE 49

CHAPTER3 APPROACH InChapter 1 ,westatedtheneedforrapid,rexible,limitedtimecollabo rationsamong organizations.Wealsounderlinedthatorganizationsneed tointegratetheirinformation sourcestoexchangedatainordertocollaborateeectively .However,integrating informationsourcesiscurrentlyalabor-intensiveactivi tybecauseofnon-existingor out-datedmachineprocessabledocumentationofthedataso urce.Wedenedlegacy systemsasinformationsystemswithpoorornonexistentdoc umentationinSection 2.1 .Integratinglegacysystemsistedious,time-consumingan dexpensivebecausethe processismostlymanual.Toautomatetheprocessweneedtod evelopmethodologiesto automaticallydiscoversemanticsfromelectronicallyava ilableinformationsourcesofthe underlyinglegacysystems. Inthischapter,westateourapproachforextractingsemant icsfromlegacysystems andforusingthesesemanticsfortheschemamatchingproces sofinformationsource integration.WedevelopedourapproachinthecontextofSEE K(ScalableExtraction ofEnterpriseKnowledge)project.AsweshowinFigure 3-1 ,theSemanticAnalyzer (SA)takestheoutputofSchemaExtractor(SE),schemaofthe datasource,andthe applicationsourcecodeorreporttemplatesasinput.After thesemanticanalysisprocess, SAstoresitsoutput,extractedsemanticinformation,inar epositorywhichwecallthe knowledgebaseoftheorganization.Then,SchemaMatcher(S M)usesthisknowledgebase asaninputandproducesmappingrulesasanoutput.Finally, thesemappingruleswillbe aninputtoWrapperGenerator(WG)whichproducessourcewra ppers.InSection 3.1 ,we rststateourapproachforsemanticextractionusingSA.Th en,inSection 3.2 ,weshow howweutilizethesemanticsdiscoveredbySAinthesubseque ntschemamatchingphase. Theschemamatchingphaseisfollowedbythewrappergenerat ionphasewhichisnot describedinthisdissertation. 49

PAGE 50

Figure3-1.ScalableExtractionofEnterpriseKnowledge(S EEK)Architecture. 3.1SemanticAnalysis Ourapproachtosemanticanalysisisbasedontheobservatio nthatapplicationsource codecanbearichsourceforsemanticinformationaboutthed atasourceitisaccessing. Specically,semanticknowledgeextractedfromapplicati onsourcecodefrequently containsinformationaboutthedomain-specicmeaningsof thedataortheunderlying schemaelements.Accordingtotheseobservations,forexam ple,applicationcodeusually hasembeddedqueries,andthedataretrievedormanipulated byqueriesisstoredin variablesanddisplayedtotheenduserinoutputstatements .Manyoftheseoutput 50

PAGE 51

statementscontainadditionalsemanticinformationusual lyintheformofdescriptive textormarkup[ 36 84 87 ].Theseoutputstatementsbecomesemanticallyvaluable whentheyareusedtocommunicatewiththeend-userinaforma ttedway.Onewayof communicatingwiththeend-userisproducingreports.Repo rtsandotheruser-oriented output,whicharetypicallygeneratedbyreportgenerators orapplicationsourcecode, donotusethenamesofschemaelementsdirectlybutratherpr ovidemoredescriptive namesforthedatatomaketheoutputmorecomprehensibletot heusers.Weclaimthat thesedescriptivenamestogetherwiththeirformattingins tructionscanbeextracted fromtheapplicationcodegeneratingthereportandcanbere latedtotheunderlying schemaelementsinthedatasource.Wecantracethevariable susedinoutputstatements throughouttheapplicationcodeandrelatetheoutputwitht hequerythatretrievesdata fromthedatasourceandindirectlywiththeschemaelements .Thesedescriptivetext andformattinginstructionsarevaluableinformationthat helpdiscoverthesemanticsof theschemaelements.Inthenextsubsection,weexplainthis ideausinganillustrative example.3.1.1IllustrativeExamples Inthissection,weillustrateourideaofsemanticextracti onontwosimpleexample. OnthelefthandsideofFigure 3-2 ,weseearelationanditsattributesfromarelational databaseschema.Bylookingatthenamesoftherelationandi tsattributes,itishardto understandwhatkindofinformationthisrelationanditsat tributesstore.Forexample, thisrelationcanbeusedforstoringinformationabout`cou rses'or`instructors'.The attributeNamecanholdinformationabout`coursenames'or `instructornames'.Without anyfurtherknowledgeoftheschema,wewouldprobablynotbe abletounderstandthe fullmeaningoftheseschemaitemsintherelation`CourseIn st'.However,wecangather informationaboutthesemanticsoftheseschemaitemsbyana lyzingtheapplicationsource codethatusetheseschemaitems. 51

PAGE 52

Figure3-2.Schemausedbyanapplication. Letusassumewehaveaccesstotheapplicationsourcecodeth atoutputsthesearch screenshownontherighthandsideofFigure 3-2 .Uponinvestigationofthecode, semanticanalyzer(SA)encountersoutputstatementsofthe form`InstructorName' and`CourseCode'.SAalsoencountersinputstatementsthat expectinputfromthe usernexttotheoutputtexts.Usingprogramunderstandingt echniques,SAndsout thatinputsareusedwithcertainschemaelementsina`where clause'toformaquery toreturnthedesiredtuplesfromthedatabase.SArstrelat estheoutputstatements containingdescriptivetext(e.g.,`InstructorName')wit htheinputstatementslocated nexttotheoutputstatementsonthesearchscreenshowninFi gure 3-2 .SAthentraces inputstatementsbacktothe`whereclause'andndtheircor respondingschemaelements inthedatabase.Hence,SArelatesthedescriptivetextwith theschemaelements.For example,ifSArelatestheoutputstatement`InstructorNam e'to`Name'schemaelement ofrelation`CourseInst',thenwecanconcludethat`Name's chemaelementoftherelation `CourseInst'storesinformationaboutthe`InstructorNam es'. Letuslookatanotherexample.Figure 3-3 showsareportR1usingtheschema elementsfromtheschemaS1.Letusassumethatwehaveaccess totheapplicationsource codethatgeneratesthereportshowninFigure 3-3 .TheschemaelementnamesinS1are non-descriptive.However,oursemanticanalyzercangathe rvaluablesemanticinformation byanalyzingthesourcecode.SArsttracesthedescriptive columnheadertextsback totheschemaelementsthatllinthedataofthatcolumn.The n,SArelatesdescriptive 52

PAGE 53

Figure3-3.Schemausedbyareport. columnheadertextswiththeschemaelements(redarrows).A fterthat,wecanconclude aboutthesemanticsoftheschemaelement.Forexample,weca nconcludethattheName schemaelementoftherelationCourseInststoresinformati onabout'Instructors`. 3.1.2ConceptualArchitectureofSemanticAnalyzer SAisembeddedintheDataReverseEngineering(DRE)moduleo ftheSEEK prototypetogetherwiththeSchemaExtractor(SE)componen t.AsFigure 3-4 illustrates, theSEcomponentintheDREconnectstothedatasourcewithac all-levelinterface(e.g., JDBC)andextractstheschemaofthedatasource.TheSAcompo nentenhancesthis schemawiththepiecesofevidencefoundaboutthesemantics oftheschemaelementsfrom theapplicationsourcecodeorfromthereportdesigntempla tes. 3.1.2.1Abstractsyntaxtreegenerator(ASTG) WeshowthecomponentsofSemanticAnalyzer(SA)inFigure 3-5 .TheAbstract SyntaxTreeGenerator(ASTG)acceptsapplicationsourceco detobeanalyzed,parses 53

PAGE 54

Figure3-4.ConceptualviewoftheDataReverseEngineering (DRE)moduleofthe ScalableExtractionofEnterpriseKnowledge(SEEK)protot ype. itandproducestheabstractsyntaxtreeofthesourcecode.A nAbstractSyntaxTree (AST)isanalternativerepresentationofthesourcecodefo rmoreecientprocessing. Currently,theASTGisconguredtoparseapplicationsourc ecodewritteninJava.The ASTGcanalsoparseSQLstatementsembeddedintheJavasourc ecodeandHTML codeextractedfromtheJavaServletsourcecode.However,w eaimtoparseandextract semanticinformationfromsourcecodewritteninanyprogra mminglanguage.Toreach thisaim,weusestate-of-the-artparsergenerationtools, JavaCC,tobuildtheASTG. WeexplainhowwebuildtheASTGsothatitbecomesextensible tootherprogramming languagesinSection 3.1.3 Figure3-5.ConceptualviewofSemanticAnalyzer(SA)compo nent. 54

PAGE 55

3.1.2.2Reporttemplateparser(RTP) Wealsoextractsemanticinformationfromanotherelectron icallyavailableinformation source,namelyfromreportdesigntemplates.Areportdesig ntemplateincludes informationaboutthedesignofareportandistypicallyrep resentedinXML.When areportgenerationtool,suchasEclipseBIRTorJasperRepo rt,runsareportdesign template,itretrievesdatafromthedatasourceandpresent sittotheenduseraccording tothespecicationinthereportdesigntemplate.Whenpars ed,valuablesemantic informationabouttheschemaelementscanbegatheredfromr eportdesigntemplates. TheReportTemplateParser(RTP)componentofSAisusedtopa rsereportdesign templates.Ourcurrentsemanticanalyzerisconguredtopa rsereporttemplatesdesigned withEclipseBIRT. 1 WeshowanexampleofareportdesigntemplateinFigure 3-6 anda resultingreportwhenthistemplatewasruninFigure 3-7 Figure3-6.Reportdesigntemplateexample. 3.1.2.3Informationextractor(IEx) TheoutputsofASTGandRTParetheinputsfortheInformation Extractor(IEx) componentofSA.TheIEx,showninFigure 3-5 ,isthecomponentwhereweapplyseveral heuristicstorelatedescriptivetextinapplicationsourc ecodewiththeschemaelementsin 1 http://www.eclipse.org/birt/ 55

PAGE 56

Figure3-7.Reportgeneratedwhentheabovetemplatewasrun databasebyusingprogramunderstandingtechniques.Speci cally,TheIExrstidenties theoutputstatements.Then,itidentiestextsintheoutpu tstatementsandvariables relatedwiththeseoutputtexts.TheIExrelatestheoutputt extwiththevariablesbythe helpofseveralheuristicsdescribedinSection 3.1.5 .TheIExtracesthevariablesrelated withtheoutputtexttotheschemaelementsfromwhichitretr ievesdata. Figure3-8.JavaServletgeneratedHTMLreportshowingcour selistingsofCALTECH. TheIExcanextractinformationfromJavaapplicationsourc ecodethatcommunicates withuserthroughconsole.TheIExcanalsoextractinformat ionfromJavaServlet 56

PAGE 57

applicationsourcecode.AServletisaJavaapplicationtha trunsontheWebServerand respondstoclientrequestsbygeneratingHTMLpagesdynami cally.AServletgenerates anHTMLpagebytheoutputstatementsembeddedinsidetheJav acode.AfterIEx analyzestheJavaServlet,itidentiestheoutputstatemen tsthatoutputHTMLcode.It alsoidentiestheschemaelementsfromwhichthedataonthe HTMLpageisretrieved. Asanintermediatestep,theIExproducestheHTMLpagethatt heServletwouldproduce withtheschemaelementnamesinsteadofthedata.Anexample oftheoutputHTML pagegeneratedbytheIExafteranalyzingaJavaServletissh owninFigure 3-9 .TheJava ServletoutputthatwasanalyzedbytheIExisshowninFigure 3-8 .Thisexampleistaken fromTHALIAintegrationbenchmarkandshowscourseoering sinComputerScience departmentofCaliforniaInstituteofTechnology(CALTECH ).Thereadercannotice thatthedataonthereportinFigure 3-8 isreplacedwiththeschemaelementnamesfrom whichthedataisretrievedinFigure 3-9 .Next,theIExanalyzesthisannotatedHTML pageshowinFigure 3-9 andextractssemanticinformationfromthispage. Figure3-9.AnnotatedHTMLpagegeneratedbyanalyzingaJav aServlet. TheIExhasbeenimplementedusingvisitordesignpatterncl asses.Weexplainthe benetsofusingvisitordesignpatternsinSection 3.1.3 .TheIExappliesseveralprogram understandingtechniquessuchasprogramslicing,datarow analysisandcallgraph 57

PAGE 58

analysis[ 49 ]invisitordesignpatternclasses.Wedescribethesetechn iquesinSection 3.1.4 TheIExalsoextractssemanticinformationfromreportdesi gntemplates.TheIEx usestheheuristicnumbersseventoelevendescribedinSect ion 3.1.5 whileanalyzingthe reportdesigntemplates.Extractinginformationfromrepo rtdesigntemplatesisrelatively easierthanextractinginformationfromapplicationsourc ecodebecauseThereportdesign templatesarerepresentedinXMLandaremorestructured.3.1.2.4Reportontologywriter(ROW) ReportOntologyWriter(ROW)componentofSAwritesthesema nticinformation gatheredinreportontologyinstancesrepresentedinOWLla nguage.Weexplainthe designdetailsofthereportontologyinSection 3.2.3 .Thesereportontologyinstances formstheknowledgebaseofthedatasourcebeinganalyzed.3.1.3ExtensibilityandFlexibilityofSemanticAnalyzer Ourcurrentsemanticanalyzerisconguredtoextractinfor mationfromapplication sourcecodewritteninJava.WechoosetheJavaprogrammingl anguagebecauseitis oneofthedominatingprogramminglanguagesintheenterpri seinformationsystems. However,weaimoursemanticanalyzertobeabletoprocessso urcecodewritten inanyprogramminglanguagetoextractsemanticinformatio naboutthedataofthe legacysystem.Forthisreason,weneedtodevelopoursemant icanalyzerinawaythat isextensibletootherprogramminglanguageseasily.Torea chthisaim,weleverage state-of-the-arttechniquesandrecentresearchoncodere verseengineering,abstractsyntax treegenerationandobjectorientedprogrammingtodevelop anovelapproachforsemantic extractionfromsourcecode.Wedescribeourextensiblesem anticanalysisapproachin detailsinthissection. Toanalyzeapplicationsourcecode,weneedaparserfortheg rammarofthe programminglanguageofthesourcecode.Thisparserisused togenerateAbstractSyntax Tree(AST)ofthesourcecode.AnASTisatypeofrepresentati onofsourcecodethat 58

PAGE 59

facilitatestheusageoftreetraversalalgorithms.Forpro grammers,writingaparserfor thegrammarofaprogramminglanguagehasalwaysbeenacompl ex,time-consuming,and error-pronetask.Writingaparserbecomesmorecomplexwhe nthenumberofproduction rulesofthegrammarincreases.Itisnoteasytowritearobus tparserforJavawhichhas manyproductionrules[ 91 ]. 2 Wefocusonextractingsemanticinformationfromlegacy system'ssourcecodenotwritingaparser.Forthisreason,w echooseastate-of-the-art parsergenerationtooltoproduceourJavaparser.WeuseJav aCC 3 toautomatically generateaparserbyusingthespecicationlesfromtheJav aCCrepository. 4 JavaCC canbeusedtogenerateparsersforanygrammar.Wealsoutili zeJavaCCtogeneratea parserforSQLstatementsthatareembeddedinsidetheJavas ourcecodeandforHTML codethatareembeddedinsidetheJavaServletcode.Byusing JavaCC,wecanextendSA tomakeitcapableofparsingotherprogramminglanguageswi thlittleeort. TheInformationExtractor(IEx)componentofSAiscomposed ofseveralvisitor designpatterns.VisitorDesignPatternsgivetherexibili tytochangetheoperation beingperformedonastructurewithouttheneedtochangethe classesoftheelements onwhichtheoperationisperformed[ 38 ].Ourgoalistobuildsemanticinformation extractiontechniquesthatcanbeappliedtoanysourcecode andcanbeextendedwith newalgorithms.Byusingvisitordesignpatterns[ 71 ],wedonotembedthefunctionality oftheinformationextractioninsidetheclassesofAbstrac tSyntaxGenerator(ASTG). Thisseparationletsusfocusontheinformationextraction algorithms.Wecanmaintain theoperationsbeingperformedwhenevernecessary.Moreov er,newoperationsoverthe datastructurecanbedenedsimplybyaddinganewvisitor[ 13 ]. 2 Thereareover80productionrulesintheJavalanguageaccor dingtotheJava GrammarthatweobtainedfromtheJavaCCRepository 3 JavaCC: https://javacc.dev.java.net/ 4 JavaCCrepository: http://www.cobase.cs.ucla.edu/pub/javacc/ 59

PAGE 60

3.1.4ApplicationofProgramUnderstandingTechniquesinS A WehaveintroducedprogramunderstandingtechniquesinSec tion 2.5 .Inthissection, wepresenthowweapplythesetechniquesinSA.SAhastwocomp onentsasshownin Figure 3-5 .TheinputofInformationExtractor(IEx)componentisanab stractsyntax tree(AST).TheASTistheoutputofourAbstractSyntaxTreeG enerator(ASTG)which isactuallyaparser.AsmentionedinSection 2.5 ,processingthesourcecodebyaparser toproduceanASTisoneoftheprogramunderstandingtechniq uesknownasSyntactic Analysis[ 49 ].Weperformtherestoftheprogramunderstandingtechniqu esontheAST byusingthevisitordesignpatternclassesoftheIEx. OneoftheprogramunderstandingtechniquesweapplyisPatt ernMatching[ 49 ].We wroteavisitorclassthatlooksforcertainpatternsinside thesourcecode.Thesepatterns suchasinput/outputstatementsarestoredinaclassstruct ureandnewpatternscanbe simplyaddedintothisclassstructureasneeded.Thevisito rclassthatsearchesthese patternsidentiesthevariablesintheinput/outputstate mentsasslicingvariables.For instance,thevariableVinTable 3-5 isidentiedasaslicingvariablesinceitisusedin anoutputstatement.ProgramSlicing[ 75 ]isanotherprogramunderstandingtechnique mentionedinSection 2.5 .Weanalyzeallthestatementsaectingavariablethatisus edin anoutputstatement.Thistechniqueisalsoknownasbackwar dslicing. SAalsoappliestheCallGraphAnalysistechnique[ 83 ].SAproducesinter-procedural callgraphofthesourcecodeandanalyzesonlymethodsthate xistinthisgraph.SA startingfromaspecicmethod(e.g.,mainmethodofaJavast and-aloneclassor doGetmethodofaJavaServlet)traversesallpossiblemetho dsthatcanbeexecuted inrun-time.Bythis,SAeliminatesanalyzingunusedmethod s.Thesemethodscanrerect oldfunctionalityofthesystemandanalyzingthemcanleadt oincorrect,misleading information.Anexampleforaninter-proceduralcallgraph ofaprogramsourcecodeis showninFigure 3-10 .SAdoesnotanalyzemethod1ofClass1,method1ofClass2,an d method3ofClass3sincetheyarenevercalledfrominsideoth ermethods. 60

PAGE 61

Figure3-10.Inter-proceduralcallgraphofaprogramsourc ecode. TheDataFlowAnalysistechnique[ 83 ]isanotherprogramunderstandingtechnique thatweimplementedintheIExbyvisitordesignpatterns.As mentionedinSection 2.5 ,DataFlowAnalysisistheanalysisoftherowofthevaluesof variablestovariables. SAanalyzesthedatarowinthevariabledependencygraphs(i .e.,rowofdatabetween variables).SAanalyzesassignmentstatementsandmakesne cessarychangesinthevalues storedinthesymboltableoftheclassbeinganalyzed. SAalsoanalyzesthedatarowinthesystemdependencygraphs (i.e.,rowofdata betweenmethods).SAanalyzesmethodcallsandinitializes thevaluesofmethodvariables byactualparametersinthemethodcallandtransfersbackth evalueofreturnvariableat 61

PAGE 62

Table3-1.SemanticAnalyzercantransferinformationfrom onemethodtoanother throughvariablesandcanusethisinformationtodiscovers emanticsofa schemaelement. publicResultSetreturnList() f ResultSetrs=null;try f Stringquery="SELECTCode,Time,Day,Pl,InstFROMCourse" ; rs=sqlStatement.executeQuery(query);g catch(Exceptionex) f researchErr=ex.getMessage(); g returnrs; g ResultSetrsList=returnList();StringdataOut="";while(rsList.next()) f dataOut=rsList.getString(4);...System.out.println("Classisheldinroomnumber:"+dataO ut); theendofthemethod.SAcantransferinformationfromoneme thodtoanotherthrough variablesandcanusethisinformationtodiscoversemantic sofaschemaelement.The codefragmentinTable 3-1 isgivenasanexampleforthiscapabilityofSA.Insidethe method,thevalueofvariablequeryistransferredtovariab lers.Attheendofthemethod, valueofvariablersistransferredtovariablersList.Thev alueofthefourtheldofthe queryfromtheresultsetisthenstoredintoavariableandth enprintedout.Whenwe relatethetextintheoutputstatementwiththefourtheldo fthequery,wecanconclude thatPleldoftableCoursecorrespondsto'Classisheldinr oomnumber'. 3.1.5HeuristicsUsedforInformationExtraction Aheuristicisanymethodfoundthroughobservationwhichpr oducescorrector sucientlyexactresultswhenappliedincommonlyoccurrin gconditions.Wehave developedseveralguidelines(heuristics)throughobserv ationstoextractsemantics fromtheapplicationsourcecodeandreportdesigntemplate s.Theseheuristicsrelate semanticallyrichdescriptivetextstoschemaelements.Th eyarebasedonmainlylayout andformat(e.g.,fontsize,face,color,andtype)ofdataan ddescriptiontextsthatare 62

PAGE 63

usedtocommunicatewithusersthroughconsolewithinput/o utputstatementsorthrough areport. Weintroducetheseheuristicsbelow.Therstsixheuristic sshowninthissectionare developedtoextractinformationfromsourcecodeofapplic ationsthatcommunicatewith usersthroughconsolewithinput/outputstatements.Pleas enotethatthecodefragments intherstsixheuristicscontainJava-specicinput,outp ut,anddatabase-related statementsthatusesyntaxbasedontheJavaAPI.Weparamete rizedthesestatementsin ourSAprototype.Thereforeitistheoreticallystraightfo rwardtoaddnewinput,output, anddatabase-relatedstatementnamesortoswitchtoanothe rlanguageifnecessary. Wedevelopedtherestoftheheuristicstoextractsemantics fromreports.Weuse theseheuristicstoextractsemanticinformationeitherfr omreportsgeneratedbyJava Servletsorfromreportdesigntemplates. Heuristic1 .Applicationcodegenerallyhasinput-outputstatementst hatdisplay theresultsofqueriesexecutedontheunderlyingdatabase. Typically,outputstatements displayoneormorevariablesand/orcontainoneormoreform atstrings.Table 3-2 representsaformatstring` n nCoursecode: n t'followedbyavariableV. Table3-2.Outputstringgivescluesaboutthesemanticsoft hevariablefollowingit. System.out.println(` n nCoursecode: n t'+V); Heuristic2 .Theformatstringinaninput-outputstatementdescribest hedisplayed slicingvariablethatcomesafterthisformatstring.Thefo rmatstring` n nCoursecode: n t' describesthevariableVinTable 3-2 Heuristic3 .Theformatstringthatcontainssemanticinformationandt hevariable maynotbeinthesamestatementandmaybeseparatedbyanarbi trarynumberof statementsasshowninTable 3-3 Heuristic4 .Theremaybeanarbitrarynumberofformatstringsindiere nt statementsthatinheritsemanticsandtheymaybeseparated byanarbitrarynumber 63

PAGE 64

Table3-3.Outputstringandthevariablemaynotbeinthesam estatement. System.out.println(' n nCoursecode:`); ......System.out.print(V); ofstatements,beforeweencounteranoutputofslicingvari able.Concatenationofthe formatstringsbeforetheslicingvariablegivesmoreclues aboutthevariablesemantic.An exampleisshowninTable 3-4 Table3-4.Outputstringsbeforetheslicingvariableshoul dbeconcatenated. System.out.print(' n nCourse`); System.out.println(' n tcode:`); System.out.print(V); Heuristic5 .Anoutputtextinanoutputstatementandafollowingvariab leinthe sameorfollowingoutputstatementsaresemanticallyrelat ed.Theoutputtextcanbe consideredasthevariable'spossiblesemantics.Wecantra cebackthevariablethrough backwardslicingandidentifytheschemaelementinthedata sourcethatassignsavalue toit.Wecanconcludethatthisschemaelementandvariablea rerelated.Wecanthen relatetheoutputtextwiththeschemaelement.TheJavacode samplewithanembedded SQLqueryinTable 3-5 illustratesourpoint. Table3-5.Tracingbacktheoutputtextandassociatingitwi ththecorrespondingcolumn ofatable. Q='SELECTCFROMT`;R=S.executeQuery(Q);V=R.getString(1);System.out.println('Coursecode:`+V); 64

PAGE 65

InTable 3-5 ,thevariableVisassociatedwiththetext'Coursecode`.It isalso associatedwiththerstcolumnofthequeryresultinR,whic hiscalledC.Hencethe columnCcanbeassociatedwiththetext'Coursecode`. Heuristic6 .IfthevariableVisusedwithcolumnCoftableTinacompare statementinthewhere-clauseofthequeryQ,andifonecanas sociateatextstringfrom aninput/outputstatementdenotingthemeaningofvariable V,thenwecanassociatethis meaningofVwithcolumnCoftableT.TheJavacodesamplewith anembeddedSQL queryinTable 3-6 illustratesourpoint. Table3-6.Associatingtheoutputtextwiththecorrespondi ngcolumninthewhere-clause. Q='SELECT*FROMTWHEREC='`+V+"`;R=S.executeQuery(Q);System.out.println('Coursecode:`+V); InTable 3-6 ,thevariableinputisassociatedwiththetext`Coursecode :'.Itisalso associatedwiththecolumnCoftableT.Hencetheschemaelem entCcanbeassociated withthetext`Coursecode'. Table3-7.Columnheaderdescribesthedatainthatcolumn. College Course Title Instructor CAS CS101 IntroComp. Dr.Long GRS CS640 ArticialInt. Dr.Betke Heuristic7 .AheaderofacolumnH(i.e.,descriptiontext)onatableona report describesthevalueofadataD(i.e.,dataelement)inthatco lumn.Wecanassociate theheaderHwiththedataDpresentedonthesamecolumn.Fore xample,theheader \Instructor"inthefourthcolumndescribesthevalue\Dr.L ong"inTable 3-7 Table3-8.Columnontheleftdescribesthedataitemslisted toitsimmediateright. Course CSE103IntroductiontoDatabases Credits 3 Description Coreconceptsindatabases 65

PAGE 66

Heuristic8 .AdescriptivetextonarowofatableonareportTdescribest hevalue ofadataDontherighthandsideonthesamerowofthetable.We canassociatethetext TwiththedataDpresentedonthesamerow.Forexample,thete xt\Description"onthe thirdrowdescribesthevalue\Coreconceptsindatabases"i nTable 3-8 Table3-9.Columnontheleftandtheheaderimmediatelyabov edescribethesamesetof dataitems. CoreCourses Course CSE103IntroductiontoDatabases Credits 3 Description Coreconceptsindatabases ElectiveCourses Course CSE131ProblemSolving Credits 3 Description UseofComp.forproblemsolving Heuristic9 .Heuristiconeandheuristictwocanbecombined.Bothheade rofa dataonthesamecolumnandthetextonthelefthandsideonthe samerowdescribethe data.Forexample,boththetext\Course"onthelefthandsid eandtheheader\Elective Courses"ofdata\CSE131ProblemSolving"describethedata inTable 3-9 Table3-10.Setofdataitemscanbedescribedbytwodierent headers. Course Instructor Code Room Name Room CIS4301 E221 Dr.Hammer E452 COP6726 E112 Dr.Jermaine E456 Heuristic10 .Ifmorethanoneheaderdescribeadataonareport,allthehe aders correspondingtothedatadescribethedata.Forexample,bo ththeheader\Instructor" andtheheader\Room"describethevalue\E452"inTable 3-10 Table3-11.Headercanbeprocessedbeforebeingassociated withthedataonacolumn. Course Title(Credits) Instructor CS105 Comp.Concepts(3.0) Dr.Krentel CS110 JavaIntroProg.(4.0) Dr.Bolker 66

PAGE 67

Heuristic11 .Thedatavaluepresentedonacolumncanberetrievedfrommo re thanonedataitemintheschema.Inthatcase,theformatofth eheaderofthecolumn givescluesabouthowweneedtoparsetheheaderandassociat eitwiththedataitems. Forexample,thedataofthesecondcolumninTable 3-11 isretrievedfromtwodataitems inthedatasource.Theformatoftheheader\Title(Credits) "tellsusthatweneedto considertheparenthesiswhileparsingtheheaderandassoc iatingthedataitemsinthe columnwiththeheader. Inthissection,wehaveintroducedSemanticAnalyzer(SA). SAextractsinformation aboutthesemanticsofschemaelementsfromtheapplication sourcecode.Thisinformation isanessentialinputfortheSchemaMatching(SM)component .Inthefollowingsection, weintroduceourschemamatchingapproachandhowweuseSAto discoversemanticsfor SM. 3.2SchemaMatching Schemamatchingaimsatdiscoveringsemanticcorresponden cesbetweenschema elementsofdisparatebutrelateddatasources.Tomatchsch emas,weneedtoidentifythe semanticsofschemaelements.Whendonemanually,thisisat edious,time-consuming, anderror-pronetask.Muchresearchhasbeencarriedouttoa utomatethistasktoaid schemamatching,seeforexample,[ 25 28 78 ].However,despitetheongoingeorts, currentschemamatchingapproaches,whichusetheschemast hemselvesasthemaininput fortheiralgorithms,stillrelyheavilyonmanualinput[ 26 ].Thisdependenceonhuman involvementisduetothewell-knownfactthatschemasrepre sentsemanticspoorly.Hence, webelievethatimprovingcurrentschemamatchingapproach esrequiresimprovingthe waywediscoversemantics. Discoveringsemanticsmeansgatheringinformationaboutt hedata,sothatafter processingthedata,acomputercandecideonhowtousetheda tainawayaperson woulddo.Inthecontextofschemamatching,weareintereste dinndinginformation thatleadsustondapathfromschemaelementsinonedatasou rcetothecorresponding 67

PAGE 68

schemaelementsintheother.Therefore,wedenediscoveri ngsemanticsforschema matchingasdiscoveringpathsbetweencorrespondingschem aelementsindierentdata sources. Wereducethelevelofdicultyoftheschemamatchingproble mbyabstractingit tomatchingofautomaticallygenerateddocumentssuchasre portsthataresemantically richerthantheschemastowhichtheycorrespond.Reportsan dotheruser-oriented output,whicharetypicallygeneratedbyreportgenerators ,donotusethenamesof schemaelementsdirectlybutratherprovidemoredescripti venamestomaketheoutput morecomprehensibletotheusers.Thesedescriptionstoget herwiththeirformatting instructionsplusrelationshipstotheunderlyingschemae lementsinthedatasourcecan beextractedfromtheapplicationcodegeneratingtherepor t.Thesesemanticallyrich descriptions,whichcanbelinkedtotheschemaelementsint hesource,canbeusedto discoverrelationshipsbetweendatasourcesandhencebetw eentheunderlyingschemas. Moreover,reportsusemoredomainterminologythanschemas .Therefore,usingdomain dictionariesisparticularlyhelpfulasopposedtotheirus einschemamatchingalgorithms. Onecanarguethatreportsofaninformationsystemmaynotco vertheentire schemaandhencebythisapproachwemaynotndmatchesforal lschemaelements.It isimportanttonotethatwedonothavetomatchalltheschema elementsoftwodata sourcesinordertohavetwoorganizationscollaborate.Web elievethereportstypically presentthemostimportantdataoftheinformationsystem,w hichisalsolikelytobe thesetofelementsthatareimportantfortheensuingdatain tegrationscenario.Thus startingtheschemamatchingprocessfromreportscanhelpf ocusontheimportantdata eliminatinganyeortonmatchingunnecessaryschemaeleme nts. 3.2.1MotivatingExample Wepresentamotivatingexampletoshowhowanalyzingreport generatingapplication sourcecodeandreportdesigntemplatescanhelpusundersta ndthesemanticsofschema elementsbetter.Wechooseourmotivatingexamplereportsf romtheuniversitydomain 68

PAGE 69

becausetheuniversitydomainiswellknownandeasytounder stand.Tocreateour motivatingexample,weusetheTHALIA 5 testbedandbenchmarkwhichprovidesa collectionofover40downloadabledatasourcesrepresenti nguniversitycoursecatalogs fromcomputersciencedepartmentsworldwide[ 47 ]. Figure3-11.Schemasoftwodatasourcesthatcollaboratesf oranewonlinedegree program. Wemotivatetheneedforschemamatchingacrossthetwodatas ourcesofcomputer sciencedepartmentswithascenario.Letusassumethattwoc omputersciencedepartments ofuniversitiesAandBstarttocollaborateforanewonlined egreeprogram.Unlessone iscontendtoqueryeachreportseparately,onecanimaginet heexistenceofacourse schedulemediatorcapableofprovidingintegratedaccesst othedierentcoursesites. Themediatorenablesustoquerydataofbothuniversitiesan dpresentstheresultsina uniformway.Suchamediatornecessitatestheneedtondrel ationshipsacrossthesource schemasS1andS2ofuniversitiesAandBshowninFigure 3-11 .Thisisachallenging taskwhenlimitedtoinformationprovidedbythedatasource alone.Byjustusingthe schemanamesontheFigure,onecanmatchschemaelementsoft wodierentschemasin variousways.Forinstance,onecanmatchtheschemaelement NameinrelationOerings ofschemaS2withschemaelementNameinrelationScheduleof schemaS1orwithschema 5 THALIAWebsite: http://www.cise.ufl.edu/project/thalia.html 69

PAGE 70

elementNameinrelationCourseIntofschemaS1.Bothmappin gsseemreasonablewhen weonlyconsidertheavailableschemainformation. However,whenweconsiderthereportsgeneratedbyapplicat ionsourcecodeusing theseschemasofdatasources,wecandecideonthemappingso fschemasmoreaccurately. Informationsystemapplicationsthatgeneratereportsret rievedatafromthedatasource, formatthedataandpresentittousers.Tomakethedatamorea pprehensiblebytheuser, theseapplicationsgenerallydonotusethenamesofschemae lementsbutinventmore descriptivenames(i.e.,title)tothedatabyusingdomains pecictermswhenapplicable. Figure3-12.Reportsfromtwosampleuniversitieslistingc ourses. 70

PAGE 71

Forourmotivatingexample,universityAhasreportsR1andR 3anduniversityBhas R2andR4presentingdatafromtheirdatasources.ReportsR1 andR2presentcourse listingsandreportsR3andR4presentinstructoroceinfor mationfromcorresponding universities.Weshowthesesimpliedsamplereports(R1,R 2,R3,andR4)andthe schemas(S1andS2)inFigures 3-12 and 3-13 .Thereadercaneasilymatchthecolumn headers(bluedottedarrowsinFigures 3-12 and 3-13 ).Ontheotherhand,itishardto matchtheschemaelementsofdatasourcescorrectlybyonlyc onsideringtheirnames. However,itbecomesagainstraightforwardtoidentifysema nticallyrelatedschema elementsifweknowthelinksbetweencolumnheadersandsche maelements(redarrowsin Figures 3-12 and 3-13 ). Figure3-13.Reportsfromtwosampleuniversitieslistingi nstructoroces. 71

PAGE 72

Ourideaistondmappingsbetweendescriptivetextsonrepo rts(bluedotted arrows)byusingsemanticsimilarityfunctionsandtondth elinksbetweenthesetexts andschemaelements(redarrows)byanalyzingtheapplicati onsourcecodeandreport designtemplates.Forthispurpose,werstanalyzetheappl icationsourcecodeorthe reportdesigntemplategeneratingeachreport.Foreachrep ort,westoreourndings suchasdescriptivetexts(e.g.,columnheaders),schemael ementsandrelationsbetween thedescriptivetextsandtheschemaelementsintoaninstan ceofreportontology.We givethedetailsofthereportontologyinSection 3.2.3 .Wepairreportontologyinstances onefromtherstdatasourceandonefromtheseconddatasour ce.Wethencompute thesimilaritiesbetweenallpossiblereportontologyinst ancepairs.Forourexample,the fourpossiblereportpairswhenweselectonereportfromDS1 andtheotherfromDS2 are[R1-R2],[R1-R4],[R2-R3]and[R3-R4].Wecalculatethe similarityscoresbetween descriptivetextsonreportsforeachreportpairsbyusings emanticsimilarityfunctions usingWordNetwhichwedescribeinSection 3.2.4 .Wethentransfersimilarityscores betweendescriptivetextsofreportstoscoresbetweensche maelementsofschemasby usingthepreviouslydiscoveredrelationsbetweendescrip tivetextsandschemaelements. Last,wemergethesimilarityscoresofschemaelementscomp utedforeachreportpairand formanalmatrixholdingsimilarityscoresbetweenelemen tsofschemasthatarehigher thanathreshold.Weaddressdetailsofeachstepofourappro achinSection 3.2.2 Whenweapplyourschemamatchingapproachontheexamplesch emasandreports describedabove,weobtainaprecisionvalueof0.86andarec allvalueof1.00.Weshow thesimilarityscoresbetweenschemaelementsofdatasourc esDS1andDS2whichare greaterthanthethreshold(0.5)inFigure 3-14 .Theseresultsarebetterthantheresults foundmatchingtheaboveschemaswiththeCOMA++(COmbinati onofMAtching algorithms)framework. 6 COMA++[ 7 ]isawellknownandwellrespectedschema 6 WeusethedefaultCOMA++AllContextcombinedmatcher 72

PAGE 73

matchingframeworkprovidingadownloadableprototype.Th isexamplemotivatesusthat ourapproachpromisesbetteraccuracyforschemamatchingt hanexistingapproaches. WeprovideadetailedevaluationoftheapproachinChapter 6 .Inthenextsection,we describethestepsofourschemamatchingapproach. Figure3-14.Similarityscoresofschemaelementsoftwodat asources. 3.2.2SchemaMatchingApproach Themainideabehindourapproachisthatuser-orientedoutp utssuchasreports, encapsulatevaluableinformationaboutsemanticsofdataw hichcanbeusedtofacilitate schemamatching.Applyingwell-knownprogramunderstandi ngtechniquesasdescribed inSection 3.1.4 ,wecanextractsemanticallyrichtextualdescriptionsand relatethese withdatapresentedonreportsusingheuristicsdescribedi nSection 3.1.5 .Wecantrace thedatabacktocorrespondingschemaelementsinthedataso urceandmatchthe correspondingschemaelementsinthetwodatasources.Belo w,weoutlinethestepsof ourSchemaMatchingapproach,whichwecallSchemaMatching byAnalyzingReporTs (SMART).Inthenextsections,weprovidedetaileddescript ionofthesestepswhichare showninFigure 3-15 CreatinganInstanceofaReportOntology ComputingSimilarityScores FormingaSimilarityMatrix FromMatchingOntologiestoSchemas MergingResults 73

PAGE 74

Figure3-15.FivestepsofSchemaMatchingbyAnalyzingRepo rTs(SMART)algorithm. 74

PAGE 75

3.2.3CreatinganInstanceofaReportOntology Intherststep,weanalyzeapplicationsourcecodethatgen eratesareport.We describedthedetailsofsemanticanalysisprocessinSecti on 3.1 .Theextractedsemantic informationfromsourcecodeorfromareportdesigntemplat eisstoredinaninstanceof thereportontology. Wehavedevelopedanontologyforreportsafteranalyzingso meofthemostwidely usedopensourcereportgenerationtoolssuchasEclipseBIR T, 7 JasperReport 8 and DataVision. 9 WedesignedthereportontologyusingtheProtegeOntologyE ditor 10 and representedthisreportontologyinOWL(WebOntologyLangu age).TheUMLdiagramof thereportontologydepictedinFigure 3-16 showstheconcepts,theirpropertiesandtheir relationswithotherconcepts. Westoreinformationaboutthedescriptivetextsonareport (e.g.,columnheaders) andinformationaboutthesourceofdata(i.e.,schemaeleme nts)presentedonareportin aninstanceofthereportontology.Thedescriptivetextand schemaelementproperties arestoredindescriptionelementanddataelementconcepts ofthereportontology respectively.Thedataelementconcepthaspropertiessuch asattribute,table(tableofthe attributeinrelationaldatabase)andtype(typeofthedata storedintheattribute).We identifytherelationbetweenadescriptionelementconcep tandadataelementconcept bythehelpofasetofheuristicswhicharebasedonthelocati onandformatinformation describedinSection 3.1.5 andstorethisinformationinhasDescriptionrelationprop ertyof thedescriptionelementconcept. 7 EclipseBIRT: http://www.eclipse.org/birt/ 8 JasperReport: http://jasperreports.sourceforge.net/ 9 Datavision: http://datavision.sourceforge.net/ 10 Protegetool: http://protege.stanford.edu/ 75

PAGE 76

Figure3-16.UniedModelingLanguage(UML)diagramoftheS chemaMatchingby AnalyzingReporTs(SMART)reportontology. Thedesignofthereportontologydoesnotchangefromonerep orttoanotherbut theinformationstoredinaninstanceofthereportontology changesbasedonthereport beinganalyzed.Weplacedthedataelementconceptinthecen terofthereportontology asshowninFigure 3-16 .Thisdesignisappropriateforthecalculationofsimilari tyscores betweendataelementconceptsaccordingtotheformuladesc ribedinSection 3.2.4 3.2.4ComputingSimilarityScores Wecomputesimilarityscoresbetweenallpossibledataelem entconceptpairs consistingofadataelementconceptfromaninstanceofther eportontologyofthe rstdatasourceandanotherdataelementconceptfromanins tanceofreportontologyof theseconddatasource.Thismeansiftherearemreportshavi ngndataelementsconcepts onaveragefor DS 1datasourceandkreportshavingldataelementsconceptson average 76

PAGE 77

for DS 2datasource,wecomputesimilarityscoresfor( m n k l )pairsofdataelements concepts. However,computingsimilarityscoresforallpossiblerepo rtontologyinstancepairs maybeunnecessary.Forexample,unrelatedreportpairs,su chasareportdescribing paymentsofemployeeswithanotherdescribingthegradesof studentsatauniversity, maynothavesemanticallyrelatedschemaelementsandthere forewemaynotndany semanticalcorrespondencebycomputingsimilarityscores ofconceptsofunrelatedreport ontologyinstancepairs.Tosavecomputationtime,welter outreportpairsthathave semanticallyunrelatedreports.Todeterminewhichreport pairsaresemanticallyrelated ornot,werstextracttexts(i.e.,titles,footersanddata headers)ontworeportpairsand calculatesimilarityscoresofthesetexts.Ifthesimilari tyscorebetweenthesetextsofa reportpairisbelowapredeterminedthreshold,weassumeth atthereportpairpresents semanticallyunrelateddataandwedonotcomputesimilarit yscoresofdataelementpairs ofreportpairshavinglowsimilarityscoresforthetextson them. Thesimilarityoftwoobjectsdependsonthesimilaritiesof thecomponentsthat formtheobjects.Anontologyconceptisformedbytheproper tiesandtherelationsit has.Eachrelationofanontologyconceptconnectstheconce pttoitsneighborconcept. Therefore,thesimilarityoftwoconceptsdependsonthesim ilaritiesofthepropertiesof theconceptsandthesimilaritiesoftheneighborconcepts. Forexample,thesimilarityof twodataelementconceptsfromdierentinstancesoftherep ortontologydependsonthe similarityoftheirpropertiesattribute,table,andtypea ndthesimilaritiesofitsneighbor conceptsDescriptionElement,Header,Footer,etc. Oursimilarityfunctionbetweenconceptsofinstancesofan ontologyissimilarto thefunctionproposedbyRodriguezandEgenhofer[ 81 ].RodriguezandEgenhoferalso considersetsoffeatures(properties)andsemanticrelati ons(neighbors)amongconcepts whileassessingsimilarityscoresamongentityclassesfro mdierentontologies.While theirsimilarityfunctionaimstondsimilarityscoresbet weenconceptsfromdierent 77

PAGE 78

ontologies,oursimilarityisforndingsimilarityscores betweentheinstancesofan ontology. Weformulatethesimilarityoftwoconceptsindierentinst ancesofanontologyas follows: sim c ( c 1 ;c 2 )= w p sim p ( c 1 ;c 2 )+ w n sim n ( c 1 ;c 2 )(3{1) where c 1 isaconceptinaninstanceoftheontology, c 2 isthesametypeofconcept inanotherinstanceoftheontology, w p istheweightoftotalsimilarityofpropertiesof thatconceptand w n istheweightoftotalsimilarityoftheneighborconceptsth atcanbe reachedfromthatconceptbyarelation. sim p ( c 1 ;c 2 )and sim n ( c 1 ;c 2 )aretheformulasto calculatesimilaritiesofthepropertiesandtheneighbors .Wecanformulate sim p ( c 1 ;c 2 )as follows: sim p ( c 1 ;c 2 )= X i =1 ::k w pi SimFunc ( c 1 p i ;c 2 p i )(3{2) where k isthenumberofpropertiesofthatconcept, w pi istheweightofthe i th property, c 1 p i isthe i thpropertyoftheconceptintherstreportontologyinstan ce, c 2 p i isthesametypeofpropertyoftheotherconceptinthesecond reportontologyinstance. SimFuncisthefunctionthatweusetoassessasimilaritysco rebetweenthevalues ofthepropertiesoftwoconcepts.Fordescriptionelements ,theSimFuncisasemantic similarityfunctionbetweentextswhichissimilartothete xt-to-textsimilarityfunctionof CorleyandMihalcea[ 21 ].Tocalculatethesimilarityscorebetweentwotextstring sT1 andT2,wersteliminatestopwords(e.g.,a,and,but,to,by ).Wethenndtheword havingthemaximumsimilarityscoreintextT2foreachwordi ntextT1.Thesimilarity scorebetweentwowords,onefromtextT1andtheotherfromT2 ,isobtainedfromathe Word-NetbasedsemanticsimilarityfunctionsuchastheJia ngandConrathmetric[ 52 ]. Wesumupthemaximumscoresanddividethesumbythewordcoun tofthetextT1. TheresultisthemeasureofsimilaritybetweentextT1andth etextT2forthedirection 78

PAGE 79

fromT1toT2.Werepeattheprocessforthereversedirection (i.e.,fromT2toT1)and thencomputetheaverageofthetwoscoresforabidirectiona lsimilarityscore. Weusedierentsimilarityfunctionsfordierentproperti es.Ifthepropertythat wearecomparinghastextdatasuchaspropertydescription, weuseoneoftheword semanticsimilarityfunctionsthatwehaveintroducedinSe ction 2.11 .Byusingasemantic similaritymeasureinsteadoflexicalsimilaritymeasures uchaseditdistance,wecan detectthesimilaritiesofwordsthatarelexicallyfarbuts emanticallyclosesuchas lecturerandinstructorandwecanalsoeliminatethewordst hatarelexicallyclosebut semanticallyfarsuchas`tower'and`power'.Besidesdescr iptionpropertyofdescription elementconcept,wealsousesemanticsimilaritymeasurest ocomputesimilarityscores betweenfooternotepropertyofthefooterconcept,headern otepropertyoftheheader conceptandtitlepropertyofthereportconcept.Iftheprop ertythatwearecomparingis theattributeortablepropertyofdataelementconcept,wea ssessasimilarityscorebased ontheLevensteineditsimilaritymeasure.Besidesattribu tepropertyofdataelement concept,wealsouseeditsimilaritymeasurestocomputesim ilarityscoresbetweenquery propertyofthereportconcept. Inthefollowingformula,whichcalculatesthesimilarityb etweentheneighborsoftwo concepts, l isthenumberofrelationsoftheconceptswearecomparing, w ni istheweight ofthe i threlation, c 1 n i ( c 2 n i )istheneighborconceptoftherst(second)conceptthatwe reachbyfollowingthe k threlation. sim n ( c 1 ;c 2 )= X i =1 ::l w ni sim c ( c 1 n i ;c 2 n i )(3{3) Notethatoursimilarityfunctionisgenericandcanbeusedt ocalculatesimilarity scoresbetweenconceptsofinstancesofanyontologies.Eve nthoughtheformulasin Equations 3{1 3{2 and 3{3 arerecursiveinnature,whenweapplytheformulasto computesimilarityscoresbetweendataelementsofreporto ntologies,wedonotencounter recursivebehavior.Thatisbecausethereisnopathbacktod ataelementconceptthrough 79

PAGE 80

relationsfromneighborsofthedataelementconcept.Inoth erwords,theneighbor conceptsofdataelementconceptdoesnothavethedataeleme ntconceptasaneighbor. Weapplytheaboveformulastocalculatesimilarityscoresb etweendataelement conceptsoftwodierentreportontologies.Thedataelemen tconcepthasproperties attribute,table,andtypeandneighborconceptsdescripti onelement,report,header, andfooterconcepts.Thesimilarityscorebetweentwodatae lementconceptscanbe formulatedasfollows: sim DataElement ( DataElement 1 ;DataElement 2 )= w 1 SimFunc ( Attribute 1 ;Attribute 2 ) + w 2 SimFunc ( Table 1 ;Table 2 ) + w 3 SimFunc ( Type 1 ;Type 2 ) + w 4 sim DescriptionElement ( DescriptionElement 1 ;DescriptionElement 2 ) (3{4) + w 5 sim Report ( Report 1 ;Report 2 ) + w 6 sim Header ( Header 1 ;Header 2 ) + w 7 sim Footer ( Footer 1 ;Footer 2 ) Weexplainhowwedeterminetheweights w 1 to w 7 inSection 6.2 .Thesimilarity scorebetweentwodescriptionelement,report,headerandf ooterconceptscanbe computedbythefollowingformulas: sim DescriptionElement ( DescriptionElement 1 ;DescriptionElement 2 )= (3{5) SimFunc ( Description 1 ;Description 2 ) sim Report ( Report 1 ;Report 2 )= SimFunc ( Query 1 ;Query 2 )+ SimFunc ( Title 1 ;Title 2 ) (3{6) sim Header ( Header 1 ;Header 2 )= SimFunc ( HeaderNote 1 ;HeaderNote 2 )(3{7) sim Footer ( Footer 1 ;Footer 2 )= SimFunc ( FooterNote 1 ;FooterNote 2 )(3{8) 80

PAGE 81

3.2.5FormingaSimilarityMatrix Toformasimilaritymatrix,weconnecttotheunderlyingdat asourcesusinga call-levelinterface(e.g.,JDBC)andextracttheschemaso ftwodatasourcestobe integrated.Asimilaritymatrixisatablestoringsimilari tyscoresfortwoschemassuch thatelementsoftherstschemaformthecolumnheadersande lementsofthesecond schemaformtherowheaders.Thesimilarityscoresareinthe range[0,1].Thesimilarity matrixgivenasanexampleinFigure 3-17 hasschemaelementsfrommotivatingexample inSection 3.2.1 andthesimilarityscoresbetweenschemaelementsarectit ious. Figure3-17.Exampleforasimilaritymatrix. 3.2.6FromMatchingOntologiestoSchemas Intherststep,wetracedadataelementtoitscorrespondin gschemaelement(s).We usethisinformationtoconvertinter-ontologymatchingsc oresintoscoresbetweenschema elements.Usingtheconvertedscores,wethenllinasimila ritymatrixforeachreport pair. Notethat,wendsimilarityscoresonlyforasubsetofschem asusedinthereports. Webelievethereportstypicallypresentthemostimportant dataoftheinformation system,whichislikelytobethesetofelementsthatisimpor tantfortheensuingdata integrationscenario.Eventhoughreportsofaninformatio nsystemmaynotcoverthe entireschema,ourapproachcanhelpfocusontheimportantd atathuseliminatingeorts 81

PAGE 82

tomatchunnecessaryschemaelements.Notethateachsimila ritymatrixcanbesparse havingonlyasmallsubsetofitscellslledinasshowninFig ures 3-18 and 3-19 Figure3-18.Similarityscoresaftermatchingreportpairs aboutcourselistings. Figure3-19.Similarityscoresaftermatchingreportpairs aboutinstructoroces. 3.2.7MergingResults Aftergeneratingasimilaritymatrixforeachreportpair,w eneedtomergethem intoanalsimilaritymatrix.Ifwehavemorethanonescoref oraschemaelementpair inthesimilaritymatrix,weneedtomergethescores.InSect ion 3.2.4 ,wedescribed howwecomputesimilarityscoresforreportpairstoavoidun necessarycomputations betweenunrelatedreportpairs.Weusetheseoverallsimila rityscoresbetweenreportpairs whilemergingsimilarityscores.Wemultiplythesimilarit yscoreofaschemaelement pairwiththeoverallsimilarityscoreofthereportpairand sumtheresultingscoresup. 82

PAGE 83

Thenwedividethenalscorewiththenumberofreports.Fori nstance,ifthesimilarity scorebetweenschemaelementsAandBis0.9intherstreport havinganoverall similarityscoreof0.7andis0.5inthesecondreporthaving anoverallsimilarityscore of0.6,thenweconcludethatthesimilarityscorebetweensc hemaelementsAandBis (0 : 9 0 : 7+0 : 5 0 : 6) = (2)=0 : 465.Finally,weeliminatethecombinedscoreswhichfall belowa(user-dened)threshold. 83

PAGE 84

CHAPTER4 PROTOTYPEIMPLEMENTATION Weimplementedboththesemanticanalyzer(SA)componentof theSEEKandthe SMARTschemamatcherusingJavaprogramminglanguage.Assh owninFigure 4-1 we havewritten1,350KBofJavacode(approximately27,000lin esofcode)forourprototype implementation.Inaddition,wehaveutilized1,150KBofJa vacode(approximately 23,000linesofcode)whichwasautomaticallygeneratedbyJ avaCC.Inthefollowing sections,werstexplaintheSAprototypeandthenSMARTpro totype. Figure4-1.JavaCodesizedistributionof(SemanticAnalyz er)SAand(SchemaMatching byAnalyzingReporTs)SMARTpackages. 4.1SemanticAnalyzer(SA)Prototype WehaveimplementedSAsemanticanalyzerprototypeusingJa valanguage.The SEEKprototypesourcecodeisplacedintheseekpackage.The functionalityoftheSEEK prototypeisdividedintoseveralpackages.Thesourcecode oftheseamnticanalyzer(SA) componentresidesinthesapackage.Javaclassesinthesapa ckagearefurtherdivided intosubpackagesaccordingtotheirfunctionality.Thesub packagesofthesapackageare listedinTable 4-1 4.1.1UsingJavaCCtogenerateparsers Theclassesinsidethepackagessyntaxtree,visitor,andpa rsersareautomatically createdbyJavaCCtool.JavaCCisatoolthatreadsagrammars pecicationandconverts ittoaJavaprogramthatcanrecognizematchestothegrammar accordingtothat 84

PAGE 85

Table4-1.Subpackageinthesapackageandtheirfunctional ity. packagename classesinthepackage visitor defaultvisitorclasses. parsers classestoparseapplicationsourcecodewritteningrammar s Java,HTMLandSQL. seekstructures supplementaryclassestoanalyzeapplicationsourcecode. seekvisitors visitorclassestoanalyzesourcecodewritteningrammarsJ ava, HTML specication.AsshowninFigure 4-2 ,JavaCCprocessesgrammarspecicationleand outputtheJavalesthathasthecodeoftheparser.Theparse rcanprocessthelanguages thatareaccordingtothegrammarinthespecicationle.Th eparsersgeneratedin thiswayformstheASTGcomponentoftheSA.Grammarspecica tionlesforsome grammarssuchasJava,C++,C,SQL,XML,HTML,VisualBasic,a ndXQuerycanbe foundattheJavaCCgrammarrepositoryWebsite. 1 Thesespecicationleshavebeen testedandcorrectedbymanyJavaCCimplementers.Thisimpl iesthatparsersproduced byusingthesespecicationsmustbereasonablyeectivein thecorrectproductionof ASTs.TheclassesgeneratedbytheJavaCCformstheabstract syntaxtreegenerator ASTGoftheSAwhichwasdescribedinSection 3.1.2.1 FortheSAcomponentoftheSEEKprototype,wecreatedparser sforthreedierent grammars.TheseareJava,SQLandHTMLgrammars.Weplacedth eseparsers,related syntaxtreeclassesandgenericvisitorclassesintoparser ,syntaxtree,visitorpackage respectively.EachJavaclassinsidethesyntaxtreepackag ehasanacceptmethodtobe usedbyvisitors.Visitorclasseshaveavisitmethodsthate achcorrespondstoaJavaclass insidesyntaxtreepackage.Thesyntaxtree,visitor,andpa rserspackageshave142,15and 14classesrespectively.Theclassesinsidethesepackages remainsthesameaslongasthe Java,SQLandHTMLgrammarsdonotchange. 1 JavaCCrepository: http://www.cobase.cs.ucla.edu/pub/javacc/ 85

PAGE 86

Figure4-2.UsingJavaCCtogenerateparsers. Theclassesinsidethepackagesseekstructuresandseekvis itorsarewrittentofulll thegoalsoftheSA.Theseekstructuresandseekvisitorspac kageshave25andtenclasses respectivelyandaresubjecttochangeasweaddnewfunction alitytoSAmodule.The classesinsidethesepackagesformstheInformationExtrac tor(IEx)oftheSAwhichwas describedinSection 3.1.2.3 .IExisconsistofseveralvisitordesignpatterns.Executi on stepsoftheIExandfunctionalityofsomeselectedvisitord esignpatternsaredescribedin thenextsection.4.1.2Executionstepsoftheinformationextractor Semanticanalysisprocesshastwomainsteps.Intherstste p,SAmakespreparations necessaryforanalyzingthesourcecodeandformsthecontro lrowgraph.SAdriver acceptsthenameofthestand-aloneJavaclassle(withthem ainmethod)asan argument.StartingfromJavaclassle,SAndsoutalltheus er-denedJavaclasses tobeanalyzedintheclasspathandformsthecontrolrowgrap h.Next,SAparsestheall theJavaclassesinthecontrolrowgraphandproducesASToft heseJavaclasses.Then, thevisitorclassObjectSymbolTablegathersvariabledecl arationinformationforeachclass tobeanalyzedandstorethisinformationintheSymbolTable classes.TheSymbolTable classesarepassedtoeachvisitorclassandarelledwithne winformationastheSA processcontinues. Inthesecondstep,SAidentiesthevariablesusedininput, outputandSQL statements.SAusestheObjectSlicingVarsvisitorclassto identifyslicingvariables. Thelistofallinput,output,anddatabase-relatedstateme nts,thatarelanguage(Java, 86

PAGE 87

JDBC)specic,arestoredinInputOutputStatementsandSql Statementsclasses.To analyzeadditionalstatements,ortoswitchtoanotherlang uage,allweneedtodoisto add/updatenewstatementnamesintotheseclasses.Whenavi sitorclassencounters amethodstatementwhiletraversingthroughAST,itcheckst hislisttondoutifthis methodisaninput,output,oradatabase-relatedstatement SAndsandparsesSQLstringsembeddedinsidethesourcecod e.SAusesthe ObjectSQLStatementvisitorclasstondandparseSQLstate ments.Whilethevisitor traversestheAST,itconstructsthevalueofvariablesthat areofStringtype.When avariabletypeofStringorastringtextispassedasaparame tertoanSQLexecute method(e.g.,executeQuery(queryStr)),thisvisitorclas sparsesthestring,andconstructs theASTofthisSQLstring.Thenitusesthevisitorclassname dObjectSQLParseto extractinformationfromthatSQLstatement.Thevisitorcl assObjectSQLStatement usesthevisitorclassObjectSQLParsetoextractinformati onabouttheSQLstringand storesthisinformationintoaclassnamedResultsetSQL.Th einformationgatheredfrom SQLstatements,input/outputmethodsareusedtoconstruct relationsbetweendatabase schemaelementandthetextdenotingthepossiblemeaningof theschemaelement. BesidesanalyzingapplicationsourcecodewritteninJava, SAcanalsoanalyzereport designtemplatesrepresentedinXML.ReportTemplateParse r(RTP)componentofthe SAusesSimpleAPIforXML 2 (SAX)toparsereporttemplates. TheoutcomeoftheIExiswrittenintoreportontologyinstan cesrepresentedinOWL. ReportOntologyWriter(ROW)usesOWLAPI 3 towritethesemanticinformationinto OWLontologies. 2 SimpleXMLAPI:http://www.saxproject.org/ 3 OWLAPI:http://owl.man.ac.uk/api.shtml 87

PAGE 88

4.2SchemaMatchingbyAnalyzingReporTs(SMART)Prototype WehaveimplementedSMARTschemamatcherprototypeusingJa valanguage. Thereare46classesinvedierentpackages.Thetotalsize oftheJavaclassesare500K (approximately10,000lines). WealsowroteaPerlprogramtondsimilarityscoresbetween wordpairsbyusing theWordNetsimilaritylibrary 4 [ 73 ].Toassesssimilarityscoresbetweentexts,werst eliminatestopwords(e.g.,a,and,but,to,by)andconvertp luralwordstosingularwords. Weconvertpluralwordstosingularwords 5 becauseWordNetSimilarityfunctionsreturns similarityscoresbetweensingularwords. TheSMARTprototypealsousesSimpleAPIforXML(SAX)librar ytoparse XMLlesandOWLAPItoreadOWLreportontologyinstancesint ointointernalJava structures. COMA++frameworkenablesexternalmatcherstobeincludedi ntoitsframework throughaninterface.WehaveintegratedourSMARTmatcheri ntotheCOMA++ frameworkasanexternalmatcher. 4 WordNetSemanticSimilarityLibrary: http://search.cpan.org/dist/WordNet-Similarity/ 5 WeareusingthePlingStemmerlibrarywrittenbyFabianM.Su chanek: http://www.mpi-inf.mpg.de/~suchanek/ 88

PAGE 89

CHAPTER5 TESTHARNESSFORTHEASSESSMENTOFLEGACYINFORMATION INTEGRATIONAPPROACHES(THALIA) Informationintegrationreferstotheunicationofrelate d,heterogeneousdatafrom disparatesources,forexample,toenablecollaborationac rossdomainsandenterprises. Informationintegrationhasbeenanactiveareaofresearch sincetheearly80sand producedaplethoraoftechniquesandapproachestointegra teheterogeneousinformation. Determiningthequalityandapplicabilityofaninformatio nintegrationtechniquehas beenachallengingtaskbecauseofthelackofavailabletest dataofsucientrichnessand volumetoallowmeaningfulandfairevaluations.Researche rsgenerallyusetheirowntest dataandevaluationtechniques,whicharetailoredtothest rengthsoftheapproachand oftenhideanyexistingweaknesses. 5.1THALIAWebsiteandDownloadableTestPackage Whileworkingforthisresearch,wesawtheneedforatestbed andbenchmark providingtestdataofsucientrichnessandvolumetoallow meaningfulandfair evaluationsforinformationintegrationapproaches.Toan swerthisneed,wedeveloped THALIA 1 (TestHarnessfortheAssessmentofLegacyinformationInte grationApproaches) benchmark.WeshowasnapshotofTHALIAwebsiteinFigure 5-1 .THALIAprovides researcherswithacollectionofover40downloadabledatas ourcesrepresentingUniversity coursecatalogs,asetoftwelvebenchmarkqueries,aswella sascoringfunctionfor rankingtheperformanceofanintegrationsystem[ 47 48 ]. THALIAwebsitealsohostscachedwebpagesofUniversitycou rsecatalogs.The downloadablepackageshavedataextractedfromthesewebsi tes.Figure 5-2 showsan examplecachedcoursecatalogoftheBostonUniversityhost edinTHALIAwebsite.In THALIAwebsite,wealsoprovidetheabilitytonavigatebetw eenextracteddataand 1 URLoftheTHALIAwebsite: http://www.cise.ufl.edu/project/thalia.html 89

PAGE 90

Figure5-1.SnapshotofTestHarnessfortheAssessmentofLe gacyinformationIntegration Approaches(THALIA)website. correspondingschemalesthatareinthedownloadablepack ages.Figure 5-3 shows XMLrepresentationofBostonUniversityscoursecatalogan dcorrespondingschemale. DownloadableUniversitycoursecatalogsarerepresentedu singwell-formedandvalidXML accordingtotheextractedschemaforeachcoursecatalog.E xtractionandtranslation fromtheoriginalrepresentationwasdoneusingasource-sp ecicwrapperwhichpreserves structuralandsemanticheterogeneitiesthatexistamongt hedierentcoursecatalogs. 90

PAGE 91

Figure5-2.Snapshotofthecomputersciencecoursecatalog ofBostonUniversity. 5.2DataExtractor(HTMLtoXML)OpensourcePackage ToextractthesourcedataprovidedinTHALIAbenchmark,wee nhancedand usedtheTelegraphScreenScraper(TESS) 2 sourcewrapperdevelopedatUCBerkeley. TheenhancedversionofTESS,DataExtractor(HTMLtoXML),c anbeobtainedfrom SourceForgewebsite 3 alongwiththe46examplesusedtoextractdataprovidedin THALIA.DataExtractor(HTMLtoXML)toolprovidesaddedfun ctionalityoverTESS wrapperincludingcapabilityofextractingdatafromneste dstructures.Itextractsdata fromaHTMLpageaccordingtoacongurationleandputsthed ataintoanXMLle accordingtoaspeciedstructure. 2 TESS: http://telegraph.cs.berkeley.edu/tess/ 3 URLofDataExtractor(HTMLtoXML)is http://sourceforge.net/projects/dataextractor 91

PAGE 92

Figure5-3.ExtensibleMarkupLanguage(XML)representati onofBostonUniversitys coursecatalogandcorrespondingschemale. 5.3ClassicationofHeterogeneities Ourbenchmarkfocusesonsyntacticandsemanticheterogene itiessincewebelieve theyposethegreatesttechnicalchallengestotheresearch community.Wehavechosen courseinformationasourdomainofdiscoursebecauseitisw ellknownandeasyto understand.Furthermore,thereisanabundanceofdatasour cespubliclyavailablethat allowedustodevelopatestbedexhibitingallofthesyntact icandsemanticheterogeneities thatwehaveidentiedinourclassication.Welistourclas sicationofheterogeneities below.Wehavealsolistedtheseclassicationsin[ 48 ]andinthedownloadablepackageon theTHALIAwebsitealongwithexamplesfromTHALIAbenchmar kandcorresponding queries. 92

PAGE 93

1. Synonyms: Attributeswithdierentnamesthatconveythesamemeaning .For example,`instructor'vs.`lecturer'. 2. SimpleMapping: Relatedattributesindierentschemasdierbyamathemati cal transformationoftheirvalues.Forexample,timevaluesus inga24hourvs.12hour clock. 3. UnionTypes: Attributesindierentschemasusedierentdatatypestore present thesameinformation.Forexample,coursedescriptionasas inglestringvs.complex datatypecomposedofstringsandlinks(URLs)toexternalda ta. 4. ComplexMappings: Relatedattributesdierbyacomplextransformationofthe ir values.Thetransformationmaynotalwaysbecomputablefro mrstprinciples.For example,theattribute`Units'representsthenumberoflec turesperweekvs.textual descriptionoftheexpectedworkloadineld`credits'. 5. LanguageExpression: Namesorvaluesofidenticalattributesareexpressedin dierentlanguages.Forexample,TheEnglishterm`databas e'iscalled`Datenbank' intheGermanlanguage. 6. Nulls: Theattribute(value)doesnotexist.Forexample,Somecour sesdonothave atextbookeldorthevalueforthetextbookeldisempty. 7. VirtualColumns: Informationthatisexplicitlyprovidedinoneschemaisonl y implicitlyavailableintheotherandmustbeinferredfromo neormorevalues.For example,Courseprerequisitesisprovidedasanattributei noneschemabutexists onlyincommentformaspartofadierentattributeinanothe rschema. 8. Semanticincompatibility: Areal-worldconceptthatismodeledbyanattribute doesnotexistintheotherschema.Forexample,Theconcepto fstudentclassication (`freshman',`sophomore',etc.)atAmericanUniversities doesnotexistinGerman Universities. 9. Sameattributeindierentstructure: Thesameorrelatedattributemaybe locatedindierentpositionsindierentschemas.Forexam ple,TheattributeRoom isanattributeofCourseinoneschemawhileitisanattribut eofSectionwhichin turnisanattributeofCourseinanotherschema. 10. Handlingsets: Asetofvaluesisrepresentedusingasingle,set-valuedatt ribute inoneschemavs.acollectionofsingle-valuedattributeso rganizedinahierarchyin anotherschema.Forexample,Acoursewithmultipleinstruc torscanhaveasingle attributeinstructorsormultiplesection-instructoratt ributepairs. 11. Attributenamedoesnotdenesemantics: Thenameoftheattributedoes notadequatelydescribethemeaningofthevaluethatisstor edthere. 93

PAGE 94

12. Attributecomposition: Thesameinformationcanberepresentedeitherby asingleattribute(e.g.,asacompositevalue)orbyasetofa ttributes,possibly organizedinahierarchicalmanner. 5.4WebInterfacetoUploadandCompareScores THALIAwebsiteoersawebinterfaceforresearchertouploa dtheirresultforeach heterogeneitylistedabove.Thewebinterfaceacceptsdata inmanyaspects,suchassize ofspecication,numberofmouseclicksandsizeofprogramc ode,toevaluatetheeort spenttoresolvetheheterogeneitybytheapproach.Theuplo adedscorescanbeviewed byanybodyvisitingthewebsiteoftheTHALIAbenchmark.Thi shelpsotherresearcher comparetheirapproachwithothers.Figure 5-4 showsthescoresuploadedtoTHALIA benchmarkforIntegrationWizard(IWiz)ProjectattheUniv ersityofFlorida. Figure5-4.ScoresuploadedtoTestHarnessfortheAssessme ntofLegacyinformation IntegrationApproaches(THALIA)benchmarkforIntegratio nWizard(IWiz) ProjectattheUniversityofFlorida. 94

PAGE 95

WhileTHALIAisnottheonlydataintegrationbenchmark, 4 whatdistinguishes THALIAisthefactthatitcombinesrichtestdatawithasetof benchmarkqueries andassociatedscoringfunctiontoenabletheobjectiveeva luationandcomparisonof integrationsystems. 5.5UsageofTHALIA WebelievethatTHALIAdoesnotonlysimplifytheevaluation ofexistingintegration technologiesbutalsohelpresearchersimprovetheaccurac yandqualityoffuture approachesbyenablingmorethoroughandmorefocusedtesti ng.WehaveusedTHALIA testdatafortheevaluationofourSMapproachasdescribedi nSection 6.1.1 .Weare alsohappytoseeitisbeingusedasasourceoftestdataandbe nchmarkbyresearchers [ 11 74 100 ]andgraduatecourses 5 4 AlistofDataIntegrationBenchmarksandTestSuitscanbefo undat http://mars.csci.unt.edu/dbgroup/benchmarks.html 5 URLofthegraduatecourseattheUniversityofTorontousing THALIAis http://www.cs.toronto.edu/~{}miller/cs2525/ 95

PAGE 96

CHAPTER6 EVALUATION WeevaluateourapproachusingtheprototypedescribedinCh apter 4 .Inthe followingsections,werstdescribeourtestdatasetsando urexperiments.Wethen compareourresultswithothertechniquesandpresentadisc ussionontheresults. 6.1TestData Thetestdatasetshavetwomaincomponents;schemaofthedat asourceandreports presentingthedatafromthedatasource.Weusedtwotestdat asets.Thersttestdata setisfromTHALIAdataintegrationtestbed.Thisdatasetha s10schemas.Eachschema ofTHALIAtestdatasethasonereportandthereportcoversen tireschemaelementsof thecorrespondingschema.ThesecondtestdatasetisfromUn iversityofFloridaregistrar oce.Thisdatasethasthreeschemas.EachschemaofUFregis trartestdatasethas10 reportsandthereportsdonotcoverallschemaelementsofth ecorrespondingschema. ThersttestdatasetfromTHALIAisusedtoseehowSMARTappr oachperformswhen theentireschemaiscoveredbyreportsandthesecondtestda tasetfromUFisusedto seehowSMARTapproachperformswhentheentireschemaisnot coveredbyreports. ThetestdatasetfromUFalsoenablesustoseetheaectofhav ingmultiplereportsfor oneschema.Inthefollowingsubsections,wegivedetailedd escriptionsoftheschemasand reportsofthesetestdatasets.6.1.1TestDataSetfromTHALIAtestbed ThersttestdatasetisfromTHALIAtestbed[ 48 ].THALIAoers44+dierent Universitycoursecatalogsfromcomputersciencedepartme ntsworldwide.Eachcatalog pageisrepresentedinHTML.THALIAalsooersdataandschem aofeachcatalogpage. WeexplaineddetailsofTHALIAtestbedinChapter 5 Forthescopeofthisevaluation,wetreateachcatalogpage( inHTML)tobea samplereportfromthecorrespondingUniversity.Weselect ed10universitycatalogs (reports)fromTHALIAthatrepresentdierentreportdesig npractices.Forexample, 96

PAGE 97

Figure6-1.Reportdesignpracticewhereallthedescriptiv etextsareheadersofthedata. wegivetwoexamplesofthesereportdesignpracticesinFigu res 6-1 and 6-2 .Figure 6-1 showsthecourseschedulingreportofBostonUniversityand Figure 6-2 showsthecourse schedulingreportofMichiganStateUniversity. Figure6-2.Reportdesignpracticewhereallthedescriptiv etextsareonthelefthandside ofthedata. SizesofschemasinTHALIAtestdatasetvarybetween5to13as listedinTable 6-1 .Westoredthedataandschemasforeachselecteduniversity inaMySQL4.1 database.Whenwepair10schemas,wehave45dierentpairso fschemastomatch. 45dierentschemapairshave2576possiblecombinationsof schemaelements.We manuallydeterminedthat215ofthesepossiblecombination sarereal.Weusethese manualmappingstoevaluateourresults. 97

PAGE 98

Table6-1.The10universitycatalogsselectedforevaluati onandsizeoftheirschemas. UniversityName#ofSchemaElements UniversityofArizona5BrownUniversity7BostonUniversity7CaliforniaInstituteofTechnology5CarnegieMellonUniversity9FloridaStateUniversity13MichiganStateUniversity8NewYorkUniversity7UniversityofMassachusettsBoston8UniversityofNewSouthWales,Sydney7 Werecreatedeachreport(catalogpage)fromTHALIAbyusing twomethods.One methodisusingJavaServletsandtheotherisusingEclipseB usinessIntelligenceand ReportingTool(BIRT). 1 JavaServletapplicationscorrespondingtoacoursecatalo gfetch therelevantdatafromtherepositoryandproducetheHTMLre port.Reporttemplates designedbyBIRTtoolalsofetchtherelevantdatafromthere positoryandproducethe HTMLreportaswell.WhenSMARTprototypeisrun,itanalyzes JavaServletcodeand reporttemplatestoextractsemanticinformation.6.1.2TestDataSetfromUniversityofFlorida Thesecondtestdatasetisaboutstudentsregistryinformat ionandfromUniversityof Florida.WecontactedseveralocesattheUniversityofFlo ridatoobtaintestdatasets. 2 WerstcontactedtheCollegeofEngineering.Afterseveral meetingsanddiscussions, theCollegeofEngineeringagreedtogiveustheschemasandt hereportdesigntemplates withoutanydata.Infact,wewerenotafterthedatabecauseo urapproachworkswithout theneedofthedata.TheCollegeofEngineeringformsanduse sthedatasetthatwe obtainedafterseveralmonthsasfollows.TheCollegeofEng ineeringrunsabatchprogram 1 http://www.eclipse.org/birt/ 2 IwouldliketothanktoDr.JoachimHammerforhisextensivee ortsforreachingout severaldepartmentsandorganizingmeetingswithstatoga thertestdatasets. 98

PAGE 99

everyrstdayoftheweekanddownloadsdatafromlegacyDB2d atabaseoftheRegistrar oce.DB2databaseoftheRegistraroceisahierarchicalda tabase.TheCollegeof EngineeringstoresthedatainrelationalMSAccessdatabas es.TheCollegeofEngineering extractsasubsetofthedatabaseoftheregistraroceandus esthesameattributeand tablenamesintheMSAccessdatabaseastheyareinthedataba seoftheregistraroce. TheCollegeofEngineeringcreatessubsetsofthisMSAccess databaseandrunstheir reportsontheseMSAccessdatabases.Figure 6-3 showstheconceptualviewofthe architectureofthedatabasesintheCollegeofEngineering 3 Figure6-3.ArchitectureofthedatabasesintheCollegeofE ngineering. WealsocontactedtheUFBridgesoce.TheBridgesisaprojec ttoreplacethe universitysbusinesscomputersystemscalledlegacysyste mswithnewwebbased,integrated 3 IwouldliketothankJamesOglesfromtheCollegeofEngineer ingforhistimeto preparethetestdataandforansweringourquestionsregard ingthedataset. 99

PAGE 100

systemsthatproviderealtimeinformationandimproveuniv ersitybusinessprocesses. 4 TheUFBridgesprojectalsoredesignedthelegacyDB2databa seoftheregistraroce forMSSQLServer.Weobtainedschemasandagaincouldnotrea chtheassociateddata becauseofprivacyissues. 5 Finally,wereachedtheBusinessSchool. 6 TheBusinessSchoolstorestheirdatain MSSQLServerdatabases.TheirschemaisbasedontheBridges oceschemahowever theyusedierentnamingconventions.Theyaddnewstructur esintotheschemaswhen needed. Table6-2.PortionofatabledescriptionfromtheCollegeof Engineering,theBridges ProjectandtheBusinessSchoolschemas. TheCollegeofEng.TheBridgesOceTheBusinessSchool trans2PS UF CREC COURSEt CREC UUIDVARCHAR(9)UF UUIDVARCHAR(9)UFIDvarchar(9) CNumVARCHAR(4)UF AUTO INDEXINTEGERTermvarchar(6) SectVARCHAR(4)UF TERM CDVARCHAR(5)CourseTypevarchar(1) CTCHARUF TYPE DESCVARCHAR(40)Sectionvarchar(4) TheschemasfromtheCollegeofEngineering,theBridgesOc eandtheBusiness Schoolaresemanticallyrelatedhowevertheyexhibitdier entsyntacticalfeatures.The namingconventionsandsizesofschemasaredierent.TheCo llegeofEngineeringuses thesamenamesforschemaelementsastheyareintheRegistra r'sdatabase.Theschema elementsnamesoftencontainsabbreviationswhicharemost lynotpossibletoguess. TheBridgesoceusesmoredescriptivenamingconventionfo rschemaelements.The schemaelements(i.e,columnnames)intheschemaoftheBusi nessSchoolhavethemost descriptivenames.However,thetablenamesintheschemaof theBusinessSchooluses 4 http://www.bridges.ufl.edu/about/overview.html 5 IalsowouldliketoacknowledgethehelpofMr.WarrenCurryf romtheBridgesoce forhishelpobtainingtheschemas. 6 IalsowouldliketoacknowledgethehelpofMr.JohnC.Holmes fromtheBusiness Schoolforhishelpobtainingtheschemas. 100

PAGE 101

non-descriptivenamessimilartothenamesintheregistrar database.Togiveanexample fordierentnamingconventionintheschemas,wepresentap ortionofatabledescription fromtheCollegeofEngineering,theBridgesOceandtheBus inessSchoolschemasin Table 6-2 TheCollegeofEngineering,theBridgesOce,andtheBusine ssSchoolschemas havetotally135,175,114attributesrespectivelyinsixta bles.Wepresenttablenames inthesethreeschemasandthenumberofschemaelementsthat eachtablehasinTable 6-3 .Inadataintegrationscenario,wepairtheschemasthatist obeintegrated.When wepairthreeschemasfromtheCollegeofEngineering(COE), theBridgesOce(BO) andtheBusinessSchool(BS),showninTable 6-3 ,wehavethreedierentschemapairs, (COE-BO),(COE-BS),(BO-BS),tomatch.Wemanuallydetermi nedthat(COE-BO) pairhas88,(COE-BS)pairhas91and(BO-BS)pairhas110mapp ingsandweusethese manualmappingstoevaluatetheresultsoftheSMART(Schema MatchingbyAnalyzing ReporTs)andCOMA++(COmbinationofMAtchingalgorithms)a pproachesasdescribed inSection 6.3.2 .WerecreatedcorrespondingreportsbyusingEclipseBIRTt ool.Wehave 10reportsforeachschema. Table6-3.NamesoftablesintheCollegeofEngineering,the BridgesOce,andthe BusinessSchoolschemasandnumberofschemaelementsthate achtablehas. TheCollegeofEngineeringBridgesTheBusinessSchool colleges,8ps uf colleges,12t coll,5 deptx1,14ps uf departments,23t dept,8 ce,32ps uf ce,33t ce,30 honors,56ps uf honors,56t honors,40 majors,10ps uf majors,17t majo,9 trans2,15ps uf crec course,34t crec,22 total:135total:175total:114 6.2DeterminingWeights Weexplainedtheformulastocomputesimilarityscoresbetw eenconceptsoftwo ontologiesandshowedhowweapplytheseformulastocompute similarityscoresbetween dataelementconceptsoftworeportontologyinstancesinSe ction 3.2.4 .Inthissection,we 101

PAGE 102

showhowwedeterminetheweightsintheformulas.Thecorrec tselectionoftheweights ofthesimilarityfunctionisveryimportant.Theweightsdi rectlyaectthesimilarityscore andhencedirectlyaecttheresultsandaccuracyoftheSMAR Tapproach.Weshowthe formulaforcomputingthesimilarityscoresbetweendatael ementsbelow.Ourgoalisto determineweightsfrom w 1 to w 8 sim DataElement ( DataElement 1 ;DataElement 2 )= w 1 SimFunc ( Attribute 1 ;Attribute 2 ) + w 2 SimFunc ( Table 1 ;Table 2 ) + w 3 SimFunc ( Type 1 ;Type 2 ) + w 4 SimFunc ( Description 1 ;Description 2 ) (6{1) + w 5 SimFunc ( Query 1 ;Query 2 ) + w 6 SimFunc ( Title 1 ;Title 2 ) + w 7 SimFunc ( HeaderNote 1 ;HeaderNote 2 ) + w 8 SimFunc ( FooterNote 1 ;FooterNote 2 ) Weusemultiplelinearregressionmethodtodetermineweigh tsofthesimilarity function.Inmultiplelinearregression[ 3 ],partofthevariablesareconsideredtobe explanatoryvariables,andtheremainingareconsideredto bedependentvariables. Inourproblem,theexplanatoryvariablesarethesimilarit iesofthepropertiesofthe concepts.Forexample,ourexplanatoryvariablesinthefor mulatocomputesimilarity scoresbetweendataelementconceptsaresimilarityscores ofAttributes,Table,Type, Title,Query,HeaderNoteandFooterNoteproperties.Thede pendentvariableisthe overallsimilarityoftwodataelementconcepts.Linearreg ressionattemptstomodelthe relationshipbetweentheexplanatoryvariablesandthedep endentvariablebytting alinearequationtoobserveddata.Observeddatareferstoa setofsamplevectorsfor explanatoryvariablesandthedesiredvalueofthedependen tvariablecorrespondingto 102

PAGE 103

thesamplevectors.Ourprototypecomputesthesamplevecto rsforexplanatoryvariables accordingtoSimFuncfunctionsthatcomputesimilaritysco resbetweenpropertiesof concepts.Wemanuallyenterthedesiredvalueofthedepende ntvariable(i.e.,similarity scorefortwodataelementconcepts)correspondingtothesa mplevectors.Letusdenote theexplanatoryvariablesasacolumnvector(calledfeatur evector)by: x ( n )=[x 1 ( n ) ; x 2 ( n ) ;:::; x N ( n )] T (6{2) whereTdenotesthetransposeoperatorand n istheindexofthesampledata.The observeddatacontainsadependentvariable(desireddata) called d ( n )=[d 1 ( n ) ; d 2 ( n ) ;:::; d L ( n )] T (6{3) correspondingtothefeaturevector x ( n ).NotethatLis1forourproblem.Wecan combinethefeaturevectorsasan NxP matrix x =[x(1) ; x(2) ;:::; x( P )](6{4) where P isthenumberofdatapointsinourobserveddata.Similarly, thedesireddatacan becombinedinan LxP matrix d =[ d (1) ; d (2) ;:::; d ( P )](6{5) Inlinearregression,thegoalistomodel d (n)asalinearfunctionof x (n),i.e., d (n)=w T x (n)(6{6) where w =[w 1 ; w 2 ;::; w N ] T (6{7) iscalledtheweightmatrix.Themostcommonapproachfornd ing w isthemethodof least-squares.Thismethodcalculatestheoptimal w fortheobserveddatabyminimizing 103

PAGE 104

acostfunctionwhichismeanofthesquareerrors(MSE),i.e. MSE = P X n =1 ( d (n) w T x (n)) 2 (6{8) UsingMSE,theweightmatrix w canbefoundanalyticallyorinaniterativefashion. Tondtheanalyticalsolution,wecalculatetheminimumval ueofMSEwithrespectto w TondtheminimumvalueofMSE,wetakethederivativeoftheM SEw.r.t w andequate itto0.Theresultingequationfortheoptimalvalueof w ,denotedby w isgivenby w =( 1 P P X n =0 x ( n ) x ( n ) T ) 1 ( 1 P P X n =0 x ( n ) d ( n ))(6{9) WehavedeterminedtheweightsbyusingourtestdatafromTHA LIAtestbed. THALIAtestdatahas10schemasandonereportforeachschema .Weextractedand createdaninstanceofthereportontologythatcorresponds toareport.Eachreport ontologyinstancehas5to9dataelementconcepts.Asatotal ,wehave2576dataconcept pairsin45reportinstancecombinations.Weuse1500ofthes e2576dataconceptpairsas trainingdatasetfordeterminingtheweightsofthesimilar ityfunction.Theeightweights ofthesimilarityfunctionfoundareshowninTable 6-4 .Weranourexperimentsto determineweightswiththreedierentsimilaritymeasures ;jcn[ 52 ],lin[ 60 ]andlevenstein [ 20 ];todeterminesimilarityscoresbetweentexts. Table6-4.Weightsfoundbyanalyticalmethodfordierents imilarityfunctionswith THALIAtestdata. SimFunc AttributeTableTypeDescriptionQueryTitleHeaderFooter JCN 0.30-0.050.010.79-0.060.0050.050.09 LIN 0.32-0.080.030.600.003-0.03-0.150.08 Levenstein 0.32-0.130.020.720.01-0.03-0.040.14 Alternatively, w canbefoundinaniterativefashionusingtheupdateequatio n. w ( n +1)= w ( n )+ e ( n ) x ( n )(6{10) where iscalledthestepsizeand e ( n )istheerrorvaluegivenby d ( n ) y ( n ). 104

PAGE 105

6.3ExperimentalEvaluation Inthefollowingsubsections,weexplaintheresultsgather edbyrunningSMART prototypeonthetestdatasetsexplainedinSection 6.1 .Weusef-measuremetricandthe ReceiverOperatingCharacteristic(ROC)curvestoevaluat etheaccuracyofourresults. Wepresentdescriptionsoff-measuremetricandROCcurvesb elow. F-measurehasbeenthemostwidelyusedmetricforevaluatin gschemamatching approaches[ 26 ].F-MeasureistheharmonicmeanofPrecision(P)andRecall (R). Precisionspeciespercentageofthecorrectresultsamong allfoundresultsandRecall speciesthepercentageofcorrectresultsamongallrealre sults.Table 6-5 showsthe confusionmatrix.Eachcolumnofthematrixrepresentsthei nstancesinapredictedclass, whileeachrowrepresentstheinstancesinanactualclass.A ccordingtoTable 6-5 ,wecan formulatePrecision(P)as TP=TP + FP andRecall(R)as TP=TP + FN .Wedonot usePorRmeasuresalonebecauseneitherPnorRalonecanaccu ratelyassessthematch quality[ 25 ].RcaneasilybemaximizedattheexpenseofapoorPbyreturn ingasmany correspondencesaspossible,forexample,thecrossproduc toftwoinputschemas.Onthe otherhand,ahighPcanbeachievedattheexpenseofapoorRby returningonlyfew butcorrectcorrespondences.WecalculatedtheF-Measurew iththefollowingformulaby givingRecall(R)andPrecision(P)metricsequalweights. F Measure =2 Pr ecision Re call Pr ecision +Re call (6{11) Table6-5.Confusionmatrix. PredictedPositivePredictedNegative PositiveExamples TruePositives(TP)FalseNegatives(FN) NegativeExamples FalsePositives(FP)TrueNegatives(TN) ReceiverOperatingCharacteristic(ROC)analysisorigina tedfromsignaldetection theory.ROCanalysishasalsowidelybeenusedinmedicaldat aanalysistostudythe 105

PAGE 106

eectofvaryingthethresholdonthenumericaloutcomeofad iagnostictest.Ithasbeen introducedtomachinelearninganddataminingrelativelyr ecently. TheReceiverOperatingCharacteristics(ROC)analysissho wstheperformanceofa classierasatradeobetweendetectionrateandfalsealar mrate.Toanalyzethetrade obetweentworates,aROCcurveisplotted.AROCcurveisagr aphicalplotofthe truepositives(a.k.a.hit,detection)rateversusfalsepo sitivesrate(a.k.a.falsealarm) asabinaryclassiersystem'sthresholdparameterisvarie d.AccordingtoTable 6-5 ,we formulatetruepositiverate(TPR)as TP=TP + FN andfalsepositiverate(FPR)as 1 ( TN=TN + FP ). TheROCcurvealwaysgoesthroughtwopoints(0,0and1,1).0, 0iswherethe classierdetectsnoalarms.Inthiscaseitalwaysgetsthen egativecasesrightbutitgets allpositivecaseswrong.Thesecondpointis1,1whereevery thingisclassiedaspositive. Sotheclassiergetsallpositivecasesrightbutitgetsall negativecaseswrong.Thebest possiblepredictionmethodwouldyieldapointintheupperl eftcorner(0,1),representing alltruepositivesarefoundandnofalsepositivesarefound .Thecloserthecurvefollowsa linefrom(0,0)to(0,1)andthenfrom(0,1)to(1,1),themore accuratetheclassier. TheareaundertheROCcurveisaconvenientwayofcomparingc lassiers.A randomclassierhasanareaof0.5,whileanidealonehasana reaof1.Thelargerthe areaundertheROCcurve,thebettertheperformanceofthecl assier.However,insome cases,theareaundertheROCcurvemaybemisleading.Atacho senthreshold,the classierwiththelargerareamaynotbetheonewiththebett erperformance.Thebest placetooperatetheclassier(thebestthreshold)isthepo intonitsROCwhichliesona 45degreelineclosesttotheupperleftcorner(0,1)oftheRO Cplot. 7 Werunourexperimentswithdierentsimilaritymeasures(e .g.,Lin,JCNand lexicaletc.)andthencomparethemwiththeresultsoftheCO MA++(COmbinationof 7 Weassumethatthecostsofdetectionandfalsealarmareequa l. 106

PAGE 107

Figure6-4.ResultsoftheSMARTwithJiang-Conrath(JCN),L inandLevensteinmetrics. MAtchingalgorithms)[ 7 ]schemamatcherframework.Weselectedtocompareourresul ts withtheresultsofCOMA++becauseCOMA++hasperformedtheb estinexperiments evaluatingtheexistingschemamatchingapproaches[ 26 99 ].Besides,itprovidesa downloadableprototypewhichenablesustocreatereproduc ibleresults.COMA++ alsoenablescombiningdierentschemamatchingalgorithm s.WeusedAllContextand FilteredContextcombinedmatchersintheCOMA++framework .AllContextand FilteredContextarecombinationsofvedierentmatchers ;name,path,leaves,parents andsiblings.6.3.1RunningExperimentswithTHALIAData ToevaluateSMARTapproach,weusedatasourcesandcachedHT MLpagesfrom theTHALIAdataintegrationbenchmark[ 48 ].THALIAoers44+dierentUniversity coursecatalogsfromcomputersciencedepartmentsworldwi de.Universitycoursecatalogs andtheirschemascanbedownloadedfromtheTHALIAwebsite. 8 Forthescopeof thisevaluation,weconsidereachcatalogpage(inHTML)tob easamplereportfrom thecorrespondingUniversity.Weselected10universityca talogs(reports)fromTHALIA thatrepresentdierentreportdesignpracticesandpaired theirreportsresulting45 8 http://www.cise.ur.edu/project/thalia.html 107

PAGE 108

Figure6-5.ResultsofCOmbinationofMAtchingalgorithms( COMA++)withAll ContextandFilteredContextcombinedmatchersandcompari sonofSMART andCOMA++results. dierentpairsofreportstomatch.Wetacitlyassumethatco urseinformationisstored inadatabaseandthateachreportisproducedbyEclipseBIRT toolthatfetchesthe relevantdatafromtherepositoryandproducestheHTMLrepo rt.Thedatapresentedon reportsarestoredinMySQL4.1database. SMARTapproach'sprototype,writteninJavalanguage,extr actsinformation fromreportdesigntemplatesandstorestheextractedinfor mationininstancesofthe reportontology.Thenitcomputessimilarityscoresusingw eightsdescribedinSection 6.2 .SMARTprototypeusesthreedierentsimilaritymeasurest ondsimilarityscores betweentexts.ThesemeasuresareJCNandLINsemantic,andL evensteineditsimilarity measures. Figure 6-4 showsthef-measureresultsofSMARTwhenJCNandLINsemanti c similaritymeasuresandLevensteinlexicalsimilaritymea sureareused.Weusesemantic similaritymeasurestocomputesimilarityscoresbetweend escriptivetextssuchascolumn headers,reportheadersandfooters.Figure 6-4 showsthechangeinprecision,recalland f-measuremetricsasthethresholdchanges.Thereadercann oticethatJINandLIN semanticsimilaritymeasureperformsbetterthanLevenste inlexicalsimilaritymeasure. 108

PAGE 109

Figure6-6.ReceiverOperatingCharacteristics(ROC)curv esofSMARTandCOMA++ forTHALIAtestdata. Thiswasquiteexpectedbecauseusingsemanticsimilaritym easureshelpustoidentify similaritiesbetweenwordsthataresemanticallyclosebut lexicallyfar. InFigure 6-5 ,weshowCOMA++resultsfortheTHALIAtestdatawithAllCont ext andFilteredContextcombinedmatchers.Similartootherg ures,theresultsstartwith lowprecisionbuthighrecallvaluesforlowerthresholds.P recisionvalueincreasesand recallvaluedecreasesasthethresholdincreases.Ontheri ghthandsideoftheFigure 6-5 wepresentthecomparisonbetweenSMARTandCOMA++results. Thereadercannotice thattheSMARTperformsbetterinallthresholdswithJCNsem anticsimilaritymeasure. ThesecondbestresultonthegureisalsoachievedbytheSMA RTwhenLevenstein (EDIT)lexicalsimilaritymeasureisused.Eventheresults withlexicalsimilaritymeasure arebetterthanCOMA++results.Thatisbecausethethedescr iptivetextsextracted fromreportsandusedtondsimilarityscoresbetweenschem aelementsbytheSMART. Thesetextstendtobelexicallycloserthantheschemaeleme ntnamesusedtond similarityscoresbetweenschemaelementsbytheCOMA++app roach. InFigure 6-6 ,weshowROCCurvesofSMARTandCOMA++approachesforthe THALIATestData.Asstatedbefore,thecloserthecurvefoll owstheleft-handborder andthenthetopborderoftheROCspace,themoreaccuratethe approach.When thereaderanalyzestheROCcurves,thereadercannoticetha tresultsoftheSMART 109

PAGE 110

aremuchmoreaccuratethantheresultsofCOMA++.Thebestth resholdtorunthe matcherscanbefoundbythehelpoftheROCcurves.Thethresh oldthatgeneratesthe closestpointtotheupperleftcorner(0,1)oftheROCplotan dliesona45degreeline isthebestthreshold. 9 Forexample,thecoordinatesoftheclosestpointtotheuppe r leftcorner(0,1)ontheROCcurvefortheSMARTwiththeLINsi milaritymeasure is(0.05,0.8).TheSMARTproducesthe0.05falsealarmratea nd0.8detectionrate whenoperatedwith0.3threshold. 10 Thecoordinatesoftheclosestpointtotheupper leftcorner(0,1)ontheROCcurvefortheCOMA++withtheAllC ontextmatcheris (0.25,0.55).COMA++producesthe0.25falsealarmrateand0 .55detectionratewhen operatedwith0.4threshold.ThisshowsthatSMARTandCOMA+ +achievestheir bestperformanceatdierentthresholds.SincetheROCcurv eoftheSMARTisalways closertoupperleftcorner(0,1),theSMARTperformsbetter thanCOMA++atany threshold.ThisfactcanalsobeseeninFigure 6-5 wheref-measureresultsoftheSMART andCOMA++arepresentedfordierentthresholds.6.3.2RunningExperimentswithUFData Inthissection,wepresentourexperimentalresultswithou rsecondtestdataset.The seconddatasethasthreeschemasandeachschemahas10repor ts.Weobtainedthethree schemasfromtheCollegeofEngineering,theBusinessSchoo landtheBridgesOce.We describedthedetailsoftheseconddatasetinSection 6.1.2 Thedierencebetweentherstandthesecondtestdatasetis thatthesecond datasethasmorereportsperschemaandalsothereportsofth esecondtestdataset donotcovertheentireschema.TheschemasfromtheCollegeo fEngineering(COE), theBusinessSchool(BS),andtheBridgesOce(BO)have135, 175and114schema 9 Theassumptionhereisthatcostofafalsealarmandadetecti onareequal. 10 Thethresholdsfordierentdetection/falsealarmratecom binationsarenotseenon thegure. 110

PAGE 111

Figure6-7.ResultsoftheSMARTwithdierentreportpairsi milaritythresholdsforUF testdata. elements(i.e.,attributes)respectively.Wemanuallydet erminedthat(COE-BO)pairhas 88,(COE-BS)pairhas91and(BO-BS)pairhas110mappings.Ho wever,reportscover %90ofthesemappings.Thismeans,theSMARTcanatmosthave0 .9recallaccuracy valueifitdeterminesallthemappingscoveredbyreports.O urexperimentsshowthat evenwiththisdisadvantage,theSMARTperformsbetterthan COMA++results. Sincewehavemorethanonereportperschema,weneedtomerge theresultsfrom reportpaircombinationsintoanalsimilaritymatrix.InS ection 3.2.7 ,wedescribedhow wemergethescoresintoanalsimilaritymatrixwhenwehave morethanonescorefor aschemaelementpair.Shortly,wecomputetheweightedaver ageofthesimilarityscores betweenschemaelementpairs.Weconsiderthesimilaritysc oresbetweenreportpairsas weightsforthiscomputation.Wedescribedhowwecomputesi milarityscoresbetween reportpairsinSection 3.2.4 .Thesimilarityscoresbetweenreportpairsareintherange [0,1].Toeliminateunrelatedreportpairs,wesetathresho ldforreportpairsimilarity scoresandconsideronlyschemaelementsimilarityscorest hatcomefromreportpairs havingsimilarityscorehigherthanthereportpairsimilar ityscorethreshold. WeshowtheaccuracyresultsoftheSMARTwiththeUFdataseta ccordingto f-measuremetricinFigure 6-7 .OnthelefthandsideoftheFigure 6-7 ,weshowthe resultsoftheSMARTforBusiness-Bridgesschemapairwhenr eportpairsimilarity 111

PAGE 112

thresholdissetto0.6,0.7and0.8.Thereportpairscover90 %oftheactualmappings, thereforerecallvalueisalwayslessthan0.9.Thereare19, 13and10reportpairsthat havehighersimilarityscorethan0.6,0.7and0.8respectiv ely.Thereadercannoticethat theaccuracyofresultsarebetterwhenthereportsimilarit ythresholdissetto0.7or0.8. Thatisbecausethereportpairshavingsimilarityscorehig herthanthreshold0.7aremore similartoeachotherandthiscausesmoreaccurateresults. InthemiddleoftheFigure 6-7 ,weshowtheresultsoftheSMARTforBusiness-CollegeofEng ineeringschemapair whenthereportpairsimilaritythresholdissetto0.6,0.7a nd0.8.Thereare16,12and 8reportpairsthathavehighersimilarityscorethan0.6,0. 7and0.8respectively.The readercannoticethattheaccuracyofresultsisslightlybe tterwhenthereportsimilarity thresholdissetto0.6or0.7.Thatisbecausewhenthereport pairsimilaritythreshold issetto0.8,weeliminatesomeverysimilarreportpairs. 11 Ontherighthadsideofthe Figure 6-7 ,weshowtheresultsoftheSMARTforCollegeofEngineeringBridgesschema pairwhenreportpairsimilaritythresholdissetto0.6,0.7 and0.8.Thereare21,13 and8reportpairsthathavehighersimilarityscorethan0.6 ,0.7and0.8respectively. Thereadercannoticethattheaccuracyofresultsisslightl ybetterwhenthereport similaritythresholdissetto0.7.Thatisbecausewhenther eportpairsimilaritythreshold issetto0.6,weincludesomeunrelatedreportpairsintocom putationwhichaectsthe accuracyoftheresultsnegatively.Also,whenthereportpa irsimilaritythresholdisset to0.8,weeliminatesomeverysimilarreportpairs.Therefo re,theresultsarebetterwhen thresholdissetto0.7.Thechangesintheaccuracyoftheres ultswithdierentreportpair similarityscorethresholdsettingsshowthatcorrectlyde terminingthereportsimilarity scorethresholdisimportantfortheSMARTapproach.Thegu re 6-7 suggestsusthat weneedtoselectthereportpairsimilarityscorethreshold carefully.Thechoosenreport 11 Theschemashave10verysimilarreportpairs. 112

PAGE 113

Figure6-8.F-MeasureresultsofSMARTandCOMA++forUFtest datawhenreport pairsimilarityissetto0.7. similaritythresholdshouldnoteliminatethesimilarrepo rtbuteliminatetheunrelated reports. InFigure 6-8 ,wecomparetheperformanceoftheSMARTwiththeCOMA++base d onthef-measuremetric.TheSMARTresultswerepreparedwit hJCNsemanticsimilarity measurewhenthereportthresholdwassetto0.7.TheCOMA++a pproachresultswere preparedwiththeAllContextandtheFilteredContextcombi nedmatchers.Thereader cannoticethatSMARTproduceshigherf-measureaccuracyre sultsthanCOMA++. However,theSMARTandCOMA++performstheirbestresultsat dierentthresholds. TheresultsoftheCOMA++isveryclosetotheresultstheSMAR TfortheBusiness School-BridgesProjectschemapair.Thatisbecausethissc hemapairhasverysimilar namingconventionsasdescribedinSection 6.1.2 .Whensimilarnamingconventionsare used,lexicalsimilaritymeasuresperformsbetter.COMA++ matchersarebasedonlexical similaritymeasures,henceCOMA++performsbetterfortheB usinessSchool-Bridges Projectschemapaircomparedtoitsperformancefortheothe rschemapairs.Onthe otherhand,SMARTdoesnotonlyuselexicalsimilaritymeasu res.Itcombineslexicaland semanticsimilaritymeasuresanddoesnotdependonlexical closenessofschemaelement names.Itutilizesmoredescriptivetextsextractedfromre ports.Therefore,theresults oftheSMARTisnotaectedbythechangesinthedescriptiven essorlexicalcloseness 113

PAGE 114

Figure6-9.ReceiverOperatingCharacteristics(ROC)curv esoftheSMARTforUFtest data. oftheschemaelementnames.ThereadercanalsonoticethatS MARTandCOMA++ performstheirbestresultsatdierentschemaelementsimi larityscorethresholds.The SMARTperformsitsbestresultsaround0.25thresholdandCO MA++performsaround 0.5thresholdforUFtestdatasets. OnthelefthandsideofFigure 6-9 ,wepresentROCcurvesoftheSMARTforUF Business-Bridgesschemapair.Eachschemahas10reportswh ichmakes100report pairs.10ofthesereportpairsareverysimilar.Eachreport pairhasasimilarityscore intherange[0,1].Thereare31,19,13,10and7reportpairst hathashigherreport similarityscorethan0.5,0.6,0.7,0.8and0.9thresholdsr espectively.Whenthereport pairsimilarityscorethresholdissetto0.9,someofthever ysimilarreportpairsare eliminated.ThereforetheperformanceoftheSMARTwhenthe reportsimilarityscore thresholdissetto0.9islow.OntherighthandsideoftheFig ure 6-9 ,wepresentROC curvesoftheSMARTforUFBusiness-Engineeringschemapair .Again,tenofthepossible 100reportpairsareverysimilar.Thereare28,16,12,8and4 reportpairsthathas higherreportsimilarityscorethan0.5,0.6,0.7,0.8and0. 9thresholdsrespectively. Whenthereportpairsimilaritythresholdissetto0.8and0. 9,someoftheverysimilar reportpairsareeliminated.Thereforetheperformanceoft heSMARTdecreasesfor 114

PAGE 115

Figure6-10.ComparisonoftheROCcurvesoftheSMARTandCOM A++forUFtest data. thesethresholds.Aswelowerthereportpairsimilaritythr esholdtheperformanceofthe SMARTslightlyincreases.However,aswelowerthereportsi milarityscorethreshold, morereportpairspassthethresholdwhichrequiresextrapr ocessingtime.Aftersome point,theincreaseintheperformancebydecreasingtherep ortpairsimilaritythreshold andhenceincreasingthenumberofreportsandcomputationa mount,isnegligible. Therefore,wedonotconsiderthereportpairsbelowthe0.5s imilarityscoreinFigure 6-9 InFigure 6-10 ,wecomparetheperformanceoftheSMARTwiththeCOMA++ basedontheROCcurves.FortheBusiness-Bridgesschemapai r,theCOMA++performs betterthantheSMART.Asstatedbefore,thenamingconventi onsoftheschemas areveryclosewhichhelpsCOMA++toperformbetter.Moreove r,theSMARThasa disadvantagethatnotallthemappingsarecoveredbytheava ilablereports.Forthe Business-Engineeringschemapair,theSMARTperformsvery closetoCOMA++even thoughnotallthemappingsarecoveredbyreports. 115

PAGE 116

CHAPTER7 CONCLUSION Schemamatchingisafundamentalproblemthatoccurswhenin formationsystems share,exchangeorintegratedataforthepurposeofdatawar ehousing,queryprocessing, messagetranslation,etc.Despiteextensiveeorts,solut ionsforschemamatchingarestill mostlymanualanddependonsignicanthumaninputwhichmak esschemamatching atimeconsuminganderror-pronetask.Schemaelementsaret ypicallymatchedbased onschemaanddata.However,thecluesgatheredbyprocessin gtheschemaanddata areoftenunreliable,incompleteandnotsucienttodeterm inetherelationshipsamong schemaelements[ 28 ].Moreover,themappingdependsontheapplicationandmay changefromoneapplicationtoanothereventhoughtheunder lyingschemasremainthe same.Severalautomaticapproachesexistbuttheiraccurat enessdependsheavilyonthe descriptivenessoftheschemasalone. WehavedevelopedanewapproachcalledSchemaMatchingbyAn alyzingReporTs (SMART)whichextractsimportantsemanticinformationabo uttheschemasandtheir relationshipsfromreportgeneratingapplicationsourcec odeandreportdesigntemplates. Specically,inSMARTwereverseengineertheapplications ourcecodeandreport templatesassociatedwiththeschemasthataretobematched .Fromthesourcecode andreporttemplateswhichusetheschemaandproducereport sorotheruser-friendly output,weextractsemanticallyrichdescriptivetexts.We identifyrelationshipsofthese descriptivetextswiththedatapresentedonthereportwith thehelpofasetofheuristics. Wetracethedataonthereportbacktothecorrespondingsche maelementsinthedata source.Westorealltheinformationgatheredfromareport, includingthedescriptive texts(e.g.,columnheadersandreporttitle)andpropertie sofdatapresented(e.g.,schema elementnameandtypeofdata)intoaninstanceofthereporto ntology.Wecompute similarityscoresbetweeninstancesofthereportontology .Wethenconvertinter-ontology matchingscoresintoscoresbetweenschemaelements. 116

PAGE 117

OurexperimentalresultsshowthattheSMARTprovidesmorer eliableandaccurate resultsthancurrentapproachesthatrelyontheinformatio ncontainedintheschemas anddatainstancesalone.Forexample,thehighestaccuracy (basedonthef-measure metric)oftheSMARTforourrsttestdatasetinwhichreport scoverallschema elementsis0.73whilethehighestaccuracyoftheCOMA++(th ebestschemamatching approachaccordingtotheevaluations[ 26 ])is0.5.TheresultsoftheSMARTisalso betterorveryclosetotheCOMA++resultsforoursecondtest datasetinwhichreports cover90%ofmappings.Thehighestaccuracies(basedonthef -measuremetric)ofthe SMARTforoursecondtestdatasetare0.55,0.68and0.57whil ethehighestaccuracies oftheCOMA++are0.5,0.5and0.4fordierentschemapairsre spectively.Wealso analyzedoutresultswithreceiveroperatingcharacterist ics(ROC)curves.Wesawthat theSMART'sperformanceisbetterforeverythresholdforou rrsttestdatasetandthe SMART'sperformanceisveryclosetoCOMA++'sperformancef ortheseconddataset. Ourapproachshowsthatvaluablesemanticinformationcanb eextractedfromreports generatingapplicationsourcecode.Reverseengineerings ourcecodetoextractsemantic informationisaverychallengingtask.Toeasetheprocesso fsemanticextraction,we introducedamethodologyandframeworkwhichutilizesstat e-of-the-arttoolsanddesign patterns.Besides,ourapproachshowsthatreporttemplate s(representedinXML)are alsovaluablesourceofsemanticsandthesemanticinformat ioncanbeeasilyextracted fromreporttemplates.Moreover,weshowhowtheextractedi nformationfromdatabase schemasandreportapplicationsourcecanbestoredinontol ogies.Wealsoexplainedin detailshowweapplymulti-linearregressionmethodtodete rminetheweightsofdierent informationtoreachthebestaccurateresults. Webelieveourapproachrepresentsanimportantsteptoward smoreaccurateand reliabletoolsforschemamatching.Moreandmoresolutions forautomaticschema matchinghelpussaveeort,timeandinvestment.Thedecrea sedcostforschema matchingandhencefordataintegrationfacilitatemoreand moreorganizationsto 117

PAGE 118

collaborate.Thesynergygainedfromeective,rapid,andr exiblecollaborationsamong organizationsboaststheeconomyandthusenhancesthequal itylevelofourdailylife. 7.1Contributions Researchersadvocatethatthegainofcapturingeventhelim itedamountofuseful semanticscanbetremendous[ 87 ]andtheymotivateutilizinganykindofinformation sourcetoimproveourunderstandingofdata.Researchersal sopointoutthatapplication sourcecodeencapsulatesimportantsemanticinformationa bouttheirapplication domainandcanbeusedforthepurposeofschemamatchingford ataintegration[ 78 ]. Externalinformationsourcessuchascorporaofschemasand pastmatcheshavebeen usedforschemamatchingbutapplicationsourcecodehaveno tbeenusedasanexternal informationsourceyet[ 25 28 78 ].Inthisresearch,wefocusonthiswell-knownbutnot yetaddressedchallengeofanalyzingapplicationsourceco deforthepurposeofsemantic extractionforschemamatching.Wepresentanovelapproach forschemamatchingthat utilizessemanticallyrichtextsextractedfromapplicati onsourcecode.Weshowthat theapproachweprovideinthisdissertationprovidesbette raccuracyforthepurposeof automaticschemamatching. Duringsemanticanalysisofapplicationsourcecode,wecre ateaninstanceofthe reportontologyfromeachreportgeneratedbyapplications ourcecodeandusethis ontologyinstanceforthepurposeofschemamatching.While (semi)automaticextraction ofontologies(a.k.a.ontologylearning)fromtext,relati onalschemataandknowledgebases arewellstudiedintheliterature[ 23 37 ],tothebestofourknowledgetherehasbeenno studyaimedatextractinganontologyfromapplicationsour cecode. Anotherimportantcontributionistheintroductionoftheg enericfunctionfor computingsimilarityscoresbetweenconceptsofontologie s.Wealsodescribedhowwe determinetheweightsofthesimilarityfunction.Thesimil arityfunctionalongwiththe methodologytodeterminetheweightsofthefunctioncanbea ppliedtoanydomainto determinesimilaritiesbetweendierentconceptsofontol ogies. 118

PAGE 119

Theschemamatchingapproachessofarhavebeenusinglexica lsimilarityfunctionsor look-uptablestodeterminethesimilarityscoresoftwosch emaelements.Therehavebeen suggestionstoutilizesemanticsimilaritymeasuresbetwe enwords[ 7 ]buthavenotbeen realized.Namesofschemaelementsaremostlyabbreviation sandconcatenationsofwords. Thesenamescannotbefoundinthedictionariesthatsemanti csimilaritymeasuresuseto computesimilarityscoresbetweentwowords.Therefore,ut ilizingthesemanticsimilarity measuresbetweenwordswasnotpossible.Weextractdescrip tivetextsfromreports andrelatethemwiththeschemaelements.Therefore,wecanu tilizethestate-of-the-art semanticsimilaritymeasurestodeterminesimilarities.B yusingasemanticsimilarity measureinsteadoflexicalsimilaritymeasuresuchaseditd istance,wecandetectthe similaritiesofwordsthatarelexicallyfarbutsemantical lyclosesuchas`lecturer'and `instructor'andwecanalsoeliminatethewordsthatarelex icallyclosebutsemantically farsuchas`tower'and`power'. Oneimportantcontributionisthatintegrationbasedonuse rreportseasesthe communicationbetweenbusinessand(IT)(InformationTech nology)specialists. BusinessandITspecialistsoftenhavedicultyunderstand ingeachother.Business andITspecialistscandiscussondatapresentedonreportsn otondatabaseschemas. Businessspecialistcanrequestthedataseenonspecicrep ortstobeintegratedorshared. Analyzingreportsfordataintegrationandsharinghelpsbu sinessandITspecialists communicatebetter. Whileconductingtheresearch,wesawthatthereisaneedofa vailabletestdataof sucientrichnessandvolumetoallowmeaningfulandfairev aluationsbetweendierent informationintegrationapproaches.Toaddressthisneed, wedevelopedTHALIA 1 (TestHarnessfortheAssessmentofLegacyinformationInte grationApproaches) benchmarkwhichprovidesresearcherswithacollectionofo ver40downloadabledata 1 THALIAwebsite: http://www.cise.ufl.edu/project/thalia.html 119

PAGE 120

sourcesrepresentingUniversitycoursecatalogs,asetoft welvebenchmarkqueries,aswell asascoringfunctionforrankingtheperformanceofaninteg rationsystem[ 47 48 ].We arehappytoseeitisbeingusedasasourceoftestdataandben chmarkbyresearchers [ 11 74 100 ]andgraduatecourses 2 Inthesemanticanalysispartofourwork,weintroduceanewe xtensibleandrexible methodologyforsemanticextractionfromapplicationsour cecode.Weintegrateand utilizestate-of-the-arttechniquesinobjectorientedpr ogrammingandparsergeneration, andleveragefromtheresearchincodereverseengineeringa ndprogramunderstanding. Oneofthemaincontributionsofoursemanticanalysismetho dologyisitsfunctional extensibility.Ourinformationextractionframeworklets researchersaddnewfunctionality astheydevelopnewheuristicsandalgorithmsonthesourcec odebeinganalyzed.Our currentinformationextractiontechniqueprovidesimprov edaccuracyasiteliminates unusedcodefragments(i.e.,methods,procedures). 7.2FutureDirections Thisresearchcanbecontinuedinthefollowingdirections:Extendingthesemanticanalyzer(SA) .SAcanbeextendedtoextract informationfromwebqueryinterfaces.Webqueryinterface shavepotentiallyvaluable semanticinformationformatchingschemaelements.Theinf ormationgatheredfromquery interfacescanfacilitatebetterresultsforschemamatchi ng.SAcanalsobeextendedto extractotherpossiblyimportantinformation(e.g.,forma tandlocation)ofdataona report.Newheuristicscanalsobeaddedtorelatedataandde scriptivetextsonareport. Enhancingthereportontology .OurreportontologywasrepresentedinOWL (WebOntologyLanguage).Wecanbenetfromcapabilitiesof OWLtorelatedataand descriptionelementsintheontology.InOWL,asetofOWLsta tementscanallowus 2 ThegraduatecourseattheUniversityofTorontousingTHALI Ais'ResearchTopics inDataManagement`: http://www.cs.toronto.edu/~miller/cs2525/ 120

PAGE 121

toconcludeanotherOWLstatement.Forexample,giventhest atements(motherOf subPropertyparentOf)and(NedretmotherOfOguzhan)whens tatedinOWL,allowsus toconclude(NedretparentOfOguzhan)basedonthelogicald enitionofsubProperty asgivenintheOWLspec.Similarly,wecandeneisDescripti onOfrelationbetween dataelementconceptanddescriptionelementconcept,soth atOWLcanconcludethe isDescriptionOfrelationbylookingatthelocationinform ationofbothdataelementand descriptionelementconceptsonareport.Anotheradvantag eofusingOWLontologies istheavailabilityoftoolssuchasRacer,FactandPelletth atcanreasonaboutthem.A reasonercanalsohelpusunderstandifwecanaccuratelyext ractdataanddescription elementsfromthereport.Forinstance,wecandenearulesu chas\Nodataor descriptionelementscanoverlap"andchecktheOWLontolog ybyareasonertomake sureifthisruleissatised. Extendingtheschemamatcher(SMART) .Weevaluatethesimilarityscores producedbySMARTtodetermine1to1mappings.Wecanworkonr esultsofSMARTto gureouthowtointerprettheresultstodetermine1-nandmnmappingsaswell. Continuingresearchonsimilarity .Assessingsimilarityscoresbetweenobjects isanimportantresearchtopic.Weintroducedagenericsimi larityfunctiontodetermine similaritiesbetweenconceptsofontologies.Wealsoexpla inedhowwedeterminethe weightsofthisgenericsimilarityfunction.Weappliedthi ssimilarityfunctiononreport ontologyinstances.Realworldobjectscanbemodeledusing ontologiesandoursimilarity functioncanbeusedtondsimilaritiesbetweenthem.Forex ample,oursimilarity functionisappropriatetondsemanticsimilarityscoresb etweentwowebpagesand betweentwosentences.Todeterminethesimilarityscoresb etweentwosentences,current approachesdonotconsidertheplaceofawordinasentencean dtherelationsbetween wordsinasentences.Wecanmodelanontologyspecifyingthe relationofwordsina sentencesanduseoursimilarityfunctiontoassesssimilar ityscoresbetweensentences. 121

PAGE 122

REFERENCES [1] A.Aamodt,M.Nygard,Dierentrolesandmutualdependencie sofdata, information,andknowledge:Anaiperspectiveontheirinte gration,DataKnowl. Eng.16(3)(1995)191{222. [2] P.M.Alexander,Towardsreconstructingmeaningwhentexti scommunicated electronically,Ph.D.thesis,UniversityofPretoria,Sou thAfrica(2002). [3] M.P.Allen,UnderstandingRegressionAnalysis,NewYorkPl enumPress,1997. [4] G.Antoniou,F.vanHarmelen,Webontologylanguage:Owl.,i n:S.Staab, R.Studer(eds.),HandbookonOntologies,InternationalHa ndbooksonInformation Systems,Springer,2004,pp.67{92. [5] N.Ashish,C.A.Knoblock,Semi-automaticwrappergenerati onforinternet informationsources,in:COOPIS'97:ProceedingsoftheSec ondIFCISInternational ConferenceonCooperativeInformationSystems,IEEECompu terSociety, Washington,DC,USA,1997. [6] J.A.Aslam,M.Frost,Aninformation-theoreticmeasurefor documentsimilarity,in: SIGIR'03:Proceedingsofthe26thannualinternationalACM SIGIRconferenceon Researchanddevelopmentininformaionretrieval,ACMPres s,NewYork,NY,USA, 2003. [7] D.Aumuellet,H.-H.Do,S.Massmann,E.Rahm,Schemaandonto logymatching withcoma++,in:ProceedingsofSIGMOD2005(SoftwareDemon stration), Baltimore,2005. [8] T.-L.Bach,R.Dieng-Kuntz,Measuringsimilarityofelemen tsinowldlontologies, in:ContextandOntologies:Theory,PracticeandApplicati ons,Pittsburgh, Pennsylvania,USA,2005. [9] S.Banerjee,T.Pedersen,Anadaptedleskalgorithmforword sensedisambiguation usingword-net,in:InProceedingsoftheThirdInternation alConferenceon IntelligentTextProcessingandComputationalLinguistic s,MexicoCity,2002. [10] J.Berlin,A.Motro,Databaseschemamatchingusingmachine learningwithfeature selection,in:CAiSE'02:Proceedingsofthe14thInternati onalConferenceon AdvancedInformationSystemsEngineering,Springer-Verl ag,London,UK,2002. [11] A.Bilke,J.Bleiholder,F.Naumann,C.Bohm,K.Draba,M.Wei s,Automaticdata fusionwithhummer,in:VLDB'05:Proceedingsofthe31stint ernationalconference onVerylargedatabases,VLDBEndowment,2005. [12] J.Bisbal,D.Lawless,B.Wu,J.Grimson,Legacyinformation systems:Issuesand directions,IEEESoftw.16(5)(1999)103{111. 122

PAGE 123

123 [13] M.Bravenboer,E.Visser,Guidingvisitors:Separatingnav igationfromcomputation, Tech.Rep.UU-CS-2001-42,InstituteofInformationandCom putingSciences, UtrechtUniversity,TheNetherlands,UniversityofUtrech t,P.O.Box80.089,3508 TB,Utrecht,TheNetherlands(November2001). [14] M.L.Brodie,Thepromiseofdistributedcomputingandthech allengesoflegacy informationsystems,in:ProceedingsoftheIFIPWG2.6Data baseSemantics ConferenceonInteroperableDatabaseSystems(DS-5),Nort h-Holland,1993. [15] A.Budanitsky,G.Hirst.,Semanticdistanceinwordnet:Ane xperimental, application-orientedevaluationofvemeasures.,in:NAA CL2001WordNetand OtherLexicalResourcesWorkshop,Pittsburgh,2001. [16] A.Budanitsky,G.Hirst,Evaluatingwordnet-basedmeasure sofsemanticdistance., ComputationalLinguistics32(1)(2006)13{47. [17] D.Buttler,L.Liu,C.Pu,Afullyautomatedobjectextractio nsystemfortheworld wideweb.,in:ICDCS,2001. [18] P.Checkland,S.Holwell,Information,SystemsandInforma tionSystems-making senseoftheeld,JohnWileyandSons,Inc.,Hoboken,NJ,USA ,1998. [19] E.J.Chikofsky,J.H.C.II,Reverseengineeringanddesignr ecovery:Ataxonomy, IEEESoftw.7(1)(1990)13{17. [20] W.W.Cohen,P.Ravikumar,S.E.Fienberg,Acomparisonofstr ingdistance metricsforname-matchingtasks.,in:S.Kambhampati,C.A. Knoblock(eds.), IIWeb,2003. [21] C.Corley,R.Mihalcea,Measuringthesemanticsimilarityo ftexts,in:Proceedings oftheACLWorkshoponEmpiricalModelingofSemanticEquiva lenceand Entailment,AnnArbor,Michigan,2005. [22] K.H.Davis,P.H.Aiken,Datareverseengineering:Ahistori calsurvey.,in:Working ConferenceonReverseEngineering(WCRE),2000. [23] Y.Ding,S.Foo,Ontologyresearchanddevelopment.partI-a reviewofontology generation,JournalofInformationScience28(2)(2002)12 3{136. [24] E.Do,Hong-Hai;Rahm,COMA-asystemforrexiblecombinatio nofschema matchingapproaches,in:Proc.28thIntl.ConferenceonVer yLargeDatabases (VLDB),Hongkong,Aug.2002,2002. [25] H.-H.Do,Schemamatchingandmapping-baseddataintegrati on,Dissertation, UniversittLeipzig,Germany,DepartmentofComputerScien ce,UniversittLeipzig, Germany(January2006).

PAGE 124

124 [26] H.H.Do,S.Melnik,E.Rahm,Comparisonofschemamatchingev aluations,in: RevisedPapersfromtheNODe2002WebandDatabase-RelatedW orkshopson Web,Web-Services,andDatabaseSystems,Springer-Verlag ,London,UK,2003. [27] A.Doan,P.Domingos,A.Y.Levy,Learningsourcedescriptio nfordataintegration., in:WebDB(InformalProceedings),2000. [28] A.Doan,A.Halevy,Semanticintegrationresearchinthedat abasecommunity: Abriefsurvey.,AIMagazine,SpecialIssueonSemanticInte gration26(1)(2005) 83{94. [29] A.Doan,N.F.Noy,A.Y.Halevy,Introductiontothespeciali ssueonsemantic integration.,SIGMODRecord33(4)(2004)11{13. [30] P.Drew,R.King,D.McLeod,M.Rusinkiewicz,A.Silberschat z,Reportofthe workshoponsemanticheterogeneityandinterpolationinmu ltidatabasesystems, SIGMODRec.22(3)(1993)47{56. [31] M.Ehrig,P.Haase,N.Stojanovic,M.Hefke,Similarityforo ntologies-a comprehensiveframework,in:13thEuropeanConferenceonI nformationSystems, Regensburg,2005,iSBN:3937195092. [32] D.W.Embley,Y.S.Jiang,Y.-K.Ng,Record-boundarydiscove ryinweb documents.,in:A.Delis,C.Faloutsos,S.Ghandeharizadeh (eds.),SIGMOD Conference,ACMPress,1999. [33] D.W.Embley,D.P.Lopresti,G.Nagy,Notesoncontemporaryt ablerecognition., in:H.Bunke,A.L.Spitz(eds.),DocumentAnalysisSystems, vol.3872ofLecture NotesinComputerScience,Springer,2006. [34] J.Euzenat,P.Valtchev,Anintegrativeproximitymeasuref orontologyalignment, in:ISWC-2003workshoponsemanticinformationintegratio n,SanibelIsland(FL US),2003. [35] J.Euzenat,P.Valtchev,Similarity-basedontologyalignm entinowl-lite,in:15th EuropeanConferenceonArticialIntelligence(ECAI),Val encia,2004. [36] W.J.Frawley,G.Piatetsky-Shapiro,C.J.Matheus,Knowled gediscoveryin databases:Anoverview.,AIMagazine13(3)(1992)57{70. [37] A.Gal,G.A.Modica,H.M.Jamil,Ontobuilder:Fullyautomat icextractionand consolidationofontologiesfromwebsources.,in:Interna tionalConferenceonData Engineering(ICDE),IEEEComputerSociety,2004. [38] E.Gamma,R.Helm,R.E.Johnson,J.M.Vlissides,Designpatt erns:Abstraction andreuseofobject-orienteddesign,in:ECOOP'93:Proceed ingsofthe7th EuropeanConferenceonObject-OrientedProgramming,Spri nger-Verlag,London, UK,1993.

PAGE 125

125 [39] R.L.Goldstone,Similarity,in:R.Wilson,F.C.Keil(eds.) ,MITencylopediaofthe cognitivesciences,MITPress,Cambridge,MA,1999,pp.763 {765. [40] T.R.Gruber,Atranslationapproachtoportableontologysp ecications,Knowl. Acquis.5(2)(1993)199{220. [41] J.-L.Hainaut,M.Chandelon,C.Tonneau,M.Joris,Contribu tiontoatheoryof databasereverseengineering.,in:WCRE,1993. [42] J.-L.Hainaut,J.Henrard,Ageneralmeta-modelfordata-ce nteredapplication reengineering,in:DagstuhlworkshoponInteroperability ofReengineeringTools, 2001. [43] A.Y.Halevy,J.Madhavan,P.A.Bernstein,Discoveringstru ctureinacorpusof schemas.,IEEEDataEng.Bull.26(3)(2003)26{33. [44] J.Hammer,Resolvingsemanticheterogeneityinafederatio nofautonomous, heterogeneousdatabasesystems,Ph.D.thesis,University ofSouthernCalifornia (August1994). [45] J.Hammer,W.O'Brien,M.S.Schmalz,Scalableknowledgeext ractionfrom legacysourceswithseek.,in:H.Chen,R.Miranda,D.D.Zeng ,C.C.Demchak, J.Schroeder,T.Madhusudan(eds.),IntelligenceandSecur ityInformatics(ISI),vol. 2665ofLectureNotesinComputerScience,Springer,2003. [46] J.Hammer,M.Schmalz,W.O'Brien,S.Shekar,N.Haldavnekar ,SEEKing knowledgeinlegacyinformationsystemstosupportinterop erability,in:ECAI-02 WorkshoponOntologiesandSemanticInteroperability,200 2. [47] J.Hammer,M.Stonebraker,O.Topsakal,Thalia:Testharnes sfortheassessment oflegacyinformationintegrationapproaches.,in:Intern ationalConferenceonData Engineering(ICDE),IEEEComputerSociety,2005. [48] J.Hammer,M.Stonebraker,O.Topsakal,Thalia:Testharnes sfortheassessment oflegacyinformationintegrationapproaches,Tech.Rep.t r05-001,Universityof Florida,ComputerScienceandInformationandEngineering (2005). [49] J.Henrard,Programunderstandingindatabasereverseengi neering,Ph.D.thesis, UniversityofNotre-Dame(2003). [50] J.Henrard,V.Englebert,J.-M.Hick,D.Roland,J.-L.Haina ut,Program understandingindatabasesreverseengineering.,in:G.Qu irchmayr,E.Schweighofer, T.J.M.Bench-Capon(eds.),DEXA,vol.1460ofLectureNotes inComputer Science,Springer,1998. [51] G.Hirst,D.S.Onge,Lexicalchainsasrepresentationsofco ntextforthedetection andcorrectionofmalapropisms,in:C.Fellbaum(ed.),Word Net:Anelectronic lexicaldatabase,MITPress,1998.

PAGE 126

126 [52] J.J.Jiang,D.W.Conrath,Semanticsimilaritybasedoncorp usstatisticsandlexical taxonomy,in:IntheProceedingsofROCLINGX,Taiwan,1997, 1997. [53] S.C.Johnson,YACC:Yetanothercompilercompiler,Tech.Re p.CSTR32,ATT BellLaboratories(1978). [54] Y.Kalfoglou,M.Schorlemmer,Ontologymapping:Thestateo ftheart,The KnowledgeEngineeringReviewJournal18(1)(2003)1{31. [55] T.K.Landauer,P.W.Foltz,D.Laham,Introductiontolatent semanticanalysis, DiscourseProcesses25(1998)259{284. [56] C.Leacock,M.Chodorow,Combininglocalcontextandwordne tsimilarityforword senseidentication,in:C.Fellbaum(ed.),WordNet:Anele ctroniclexicaldatabase, MITPress,1998. [57] D.B.Lenat,Cyc:alarge-scaleinvestmentinknowledgeinfr astructure, CommunicationsoftheACM38(11)(1995)33{38. [58] M.Lesk,Automaticsensedisambiguationusingmachineread abledictionaries:How totellapineconefromaicecreamcone,in:SIGDOC86,1986. [59] M.E.Lesk,Lex-alexicalanalyzergenerator,Tech.Rep.CST R39,ATTBell Laboratories,NewJersey(1975). [60] D.Lin,Aninformation-theoreticdenitionofsimilarity, in:ICML'98:Proceedings oftheFifteenthInternationalConferenceonMachineLearn ing,MorganKaufmann PublishersInc.,SanFrancisco,CA,USA,1998. [61] J.Madhavan,P.A.Bernstein,A.Doan,A.Halevy,Corpus-bas edschemamatching, in:ICDE'05:Proceedingsofthe21stInternationalConfere nceonDataEngineering (ICDE'05),IEEEComputerSociety,Washington,DC,USA,200 5. [62] J.Madhavan,P.A.Bernstein,E.Rahm,Genericschemamatchi ngwithcupid,in: VLDB'01:Proceedingsofthe27thInternationalConference onVeryLargeData Bases,MorganKaufmannPublishersInc.,SanFrancisco,CA, USA,2001. [63] A.Maedche,S.Staab,Measuringsimilaritybetweenontolog ies,in:EKAW'02: Proceedingsofthe13thInternationalConferenceonKnowle dgeEngineeringand KnowledgeManagement.OntologiesandtheSemanticWeb,Spr inger-Verlag, London,UK,2002. [64] D.L.McGuinness,F.vanHarmelen,Owlwebontologylanguage overview, http://www.w3.org/TR/owl-features/,w3CRecommendatio n(February2004). [65] G.Miller,W.Charles,Contextualcorrelatesofsemanticsi milarity,Languageand CognitiveProcesses6(1)(1991)1{28.

PAGE 127

127 [66] G.A.Miller,Wordnet:alexicaldatabaseforenglish,Commu n.ACM38(11)(1995) 39{41. [67] J.Q.Ning,A.Engberts,W.V.Kozaczynski,Automatedsuppor tforlegacycode understanding,Commun.ACM37(5)(1994)50{57. [68] N.F.Noy,Semanticintegration:Asurveyofontology-based approaches.,SIGMOD Record33(4)(2004)65{70. [69] N.F.Noy,D.L.McGuinness,Ontologydevelopment101:Aguid etocreatingyour rstontology,TechnicalReportKSL-01-05,StanfordKnowl edgeSystemsLaboratory (2003). [70] V.Oleshchuk,A.Pedersen,Ontologybasedsemanticsimilar itycomparisonof documents,in:14thInternationalWorkshoponDatabaseand ExpertSystems Applications(DEXA03),IEEE,2003. [71] J.Palsberg,C.B.Jay,Theessenceofthevisitorpattern.,i n:ComputerSoftware andApplicationsConference(COMPSAC),IEEEComputerSoci ety,1998. [72] S.Patwardhan,Incorporatingdictionaryandcorpusinform ationintoacontext vectormeasureofsemanticrelatedness,Master'sthesis,U niversityofMinnesota (2003). [73] T.Pedersen,S.Patwardhan,J.Michelizzi,Wordnet::simil arity-measuringthe relatednessofconcepts,in:ProceedingsoftheNineteenth NationalConferenceon ArticialIntelligence(AAAI-04),SanJose,CA,2004. [74] D.H.PeterBailey,A.Krumpholz,Towardmeaningfultestcol lectionsfor informationintegrationbenchmarking,in:IIWeb2006,Wor kshoponInformation IntegrationontheWebinconjunctionwithWWW2006,Edinbur gh,Scotland,2006. [75] M.Postema,H.W.Schmidt,Reverseengineeringandabstract ionoflegacysystems, Informatica:InternationalJournalofComputingandInfor matics22(3). [76] A.Quilici,Reverseengineeringoflegacysystems:Apathto wardsuccess.,in:ICSE, 1995. [77] R.Rada,H.Mili,E.Bicknell,M.Blettner,Developmentanda pplicationofametric onsemanticnets,IEEETransactionsonSystems,Man,andCyb ernetics19(1) (1989)17{30. [78] E.Rahm,P.A.Bernstein,Asurveyofapproachestoautomatic schemamatching., VeryLargeDataBases(VLDB)J.10(4)(2001)334{350. [79] P.Resnik,Semanticsimilarityinataxonomy:Aninformatio n-basedmeasureand itsapplicationtoproblemsofambiguityinnaturallanguag e.,J.Artif.Intell.Res. (JAIR)11(1999)95{130.

PAGE 128

128 [80] R.Richardson,A.F.Smeaton,Usingwordnetasaknowledgeba seformeasuring semanticsimilaritybetweenwords,Tech.Rep.CA-1294,Sch oolofComputer Applications,DublinCityUniversity,Dublin,Ireland(19 94). [81] M.A.Rodriguez,M.J.Egenhofer,Determiningsemanticsimi larityamongentity classesfromdierentontologies,IEEETransactionsonKno wledgeandData Engineering15(2)(2003)442{456. [82] H.Rubenstein,J.Goodenough,Contextualcorrelatesofsyn onymy,Computational Linguistics8(1965)627{633. [83] S.Rugaber,Programcomprehension,EncyclopediaofComput erScienceand Technology35(20)(1995)341{368,marcelDekker,Inc:NewY ork. [84] S.Sangeetha,J.Hammer,M.Schmalz,O.Topsakal,Extractin gmeaningfromlegacy codethroughpatternmatching,TechnicalReportTR-03-003 ,UniversityofFlorida, Gainesville(2003). [85] N.Seco,T.Veale,J.Hayes,Anintrinsicinformationconten tmetricforsemantic similarityinwordnet,in:ProceedingsofECAI'2004,the16 thEuropeanConference onArticialIntelligence,Valencia,Spain,2004. [86] E.R.SergeyMelnik,HectorGarcia-Molina,Similarityrood ing:Aversatile graphmatchingalgorithmanditsapplicationtoschemamatc hing,in:ICDE'02: Proceedingsofthe18thInternationalConferenceonDataEn gineering(ICDE'02), IEEEComputerSociety,Washington,DC,USA,2002. [87] A.P.Sheth,Datasemantics:What,whereandhow.,in:Procee dingsofthe6th IFIPWorkingConferenceonDataSemantics,1995. [88] P.Shvaiko,J.Euzenat,Asurveyofschema-basedmatchingap proaches,Journalon DataSemantics(JoDS)IV(2005)146{171. [89] E.Stroulia,M.El-Ramly,L.Kong,P.G.Sorenson,B.Matichu k,Reverse engineeringlegacyinterfaces:Aninteraction-drivenapp roach.,in:6thWorking ConferenceonReverseEngineering(WCRE'99),1999. [90] Y.A.Tijerino,D.W.Embley,D.W.Lonsdale,Y.Ding,G.Nagy, Towardsontology generationfromtables.,WorldWideWeb8(3)(2005)261{285 [91] O.Topsakal,Extractingsemanticsfromlegacysourcesusin greverseengineeringof javacodewiththehelpofvisitorpatterns,Master'sthesis ,DepartmentofComputer andInformationScienceandEngineering,UniversityofFlo rida(2003). [92] M.Uschold,M.Gruninger,Ontologiesandsemanticsforseam lessconnectivity, SIGMODRec.33(4)(2004)58{64.

PAGE 129

129 [93] P.Vossen,Eurowordnet:amultilingualdatabaseforinform ationretrieval,in: ProceedingsoftheDELOSworkshoponCross-languageInform ationRetrieval, Zurich,1997. [94] J.Wang,F.H.Lochovsky,Dataextractionandlabelassignme ntforwebdatabases, in:WWW'03:Proceedingsofthe12thinternationalconferen ceonWorldWide Web,ACMPress,NewYork,NY,USA,2003. [95] B.W.Weide,W.D.Heym,J.E.Hollingsworth,Reverseenginee ringoflegacycode exposed.,in:ICSE,1995. [96] M.Weiser,Programslicing,in:ICSE'81:Proceedingsofthe 5thinternational conferenceonSoftwareengineering,IEEEPress,Piscatawa y,NJ,USA,1981. [97] L.M.Wills,Usingattributedrowgraphparsingtorecognize clichsinprograms,in: Selectedpapersfromthe5thInternationalWorkshoponGrap hGramarsandTheir ApplicationtoComputerScience,Springer-Verlag,London ,UK,1996. [98] Z.Wu,M.Palmer,Verbssemanticsandlexicalselection,in: Proceedingsofthe 32ndannualmeetingonAssociationforComputationalLingu istics,Associationfor ComputationalLinguistics,Morristown,NJ,USA,1994. [99] M.Yatskevich,Preliminaryevaluationofschemamatchings ystems,Tech.Rep. DIT-03-028,UniversityofTrento(2003). [100] B.Yu,L.Liu,B.C.Ooi,K.L.Tan,Keywordjoin:Realizingkey wordsearchfor informationintegration,in:ComputerScience(CS),DSpac eatMIT,2006.

PAGE 130

BIOGRAPHICALSKETCH OguzhanTopsakal,anativeofTurkey,receivedhisBachelor ofSciencedegreefrom theComputerandControlEngineeringDepartmentofIstanbu lTechnicalUniversity inJune1996.Heworkedininformationtechnologiesdepartm entsofseveralcompanies beforepursuinggraduatedegreeintheU.S.A.Afterherecei vedhisMasterofScience degreeincomputerengineeringattheUniversityofFlorida inAugust2003,hecontinued forhisPh.D.degree.Inthe2004-2005academicyear,hewasa study-abroadstudentat theUniversityofBremen,Germany.DuringhisPh.D.studies ,heworkedasateaching assistantforprogramminglanguageanddatabaserelatedco ursesattheUniversityof Floridaandfordatawarehousinganddataminingcourseatth eUniversityofHongKong. Healsoworkedasaresearchassistantinscalableextractio nofenterpriseknowledge (SEEK)andtestharnessfortheassessmentoflegacyinforma tionintegrationapproaches (THALIA)projects.Hisresearchinterestsincludesemanti canalysis,machinelearning, naturallanguageprocessing,knowledgemanagement,infor mationretrievalanddata integration.Hebelievesincontinuedlearningandeducati ontobetterunderstandandto contributetosociety. 130


Permanent Link: http://ufdc.ufl.edu/UFE0019342/00001

Material Information

Title: Semantic Integration through Application Analysis
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0019342:00001

Permanent Link: http://ufdc.ufl.edu/UFE0019342/00001

Material Information

Title: Semantic Integration through Application Analysis
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0019342:00001


This item has the following downloads:


Full Text





i0 -TIC INTEGCR.ATION THROUG-CH APPLICATIONS~ ANALYSIS


By

OGU!ZHANi TOPSAKAL




















A DISSERit~ TAT'IONC) P: i i ~TED TO) THE (.i ADUAlTE SC:HOOLI
OT~'F, THE UNIVERSITY OF; FLTORIDA INJ "ARFTTAL, FULF~TTILLi i
O)F THE, FF.^-UIRE~il i FOR THE DECREE O>F


UNI1VER i li .' O)F FLORTDID

2007~i






































O 7 O~guzhaln T<. .1 .1






































TIo al-L~oir l who are Iari:tivei, just andi loviing- to others i .:::11 of time, location


and status










;Z i OWZ~LEDG~i

1 thought quite a lot about the timle whenI would ::if 1 dissertation and write

this ack~nowle.1 = .. .1 section. F'inally, the time h~as c~ome. Here is the 1.1 .. where I c~an

rememcr ber all thec good memlnori es andi thank ever ;: e who help-ed along thec ----. Howe~ver?

I -~ w ordts ar~e not: enough to show; : gratlitude t~o those wvho wer~e there wit~h me: all

ailong thre roacd to ~;. r Ph.D..

First of all, I give thanks to God for giving me: the: pa~tience, strength axnd commitment

to comle all this ---:-

I would :: to gfive -^? sincere: thanks to -^- I: : nation advisor, D~r. Joac~him

Hamlmer, wlho ha~s been so kind and suppoortive: to me. He: was thle pecrfect :ofr

me to work wlithi. I also would :: :to thank to r--- other commlittee members: Dr-. Tutba

Yai~vuiz-K~ahve~ci, Dr. Ci : .. i M. Jermiine,, Dr. HermanI Lam, andi Dr. F' : : Issa

flor serving on ni:-- committeee.

Ti : 1 : to U~mut Sairgut, Z. i Sairgurt, Clan Oztuirk, Fatih~ Buyurkserin aind Fatih

Gord- u i mla~kin gr G~ainesvil le a, better place to live.

I aim g~ateful to :.. cnts, H. ":-ct To~psakal alnd Sababaldlin TIopsakal; to ::

brother, M\'etehan Toipsakal; andi to r;;- sister-in-law, Sibel To~psak al.l T.. were al-

thlere for me when 1 needled them, andi theyr have alway~is : : i o~rtedi mle in whatever I dlo.

My wife, Ei i and I joined ourr lives during the most hectic times of r; -- Ph.D). studies,

and she supp~orted mec in every : i ect~. "!1: is --:--- trecasure.











TABLE OF CONTENTS

page

ACK(NOWLEDGMENTS ......... . .. .. 4

LIST OF TABLES ......... ..... .. 8

LIST OF FIGURES ......... .... .. 9

ABSTRACT ... ......... .......... .. 11

CHAPTER

1 INTRODUCTION ......... ... .. 12

1.1 Problem Definition ......... . 12
1.2 Overview of the Approach ......... ... 14
1.3 Contributions ......... . .. 16
1.4 Organization of the Dissertation . ..... .. 17

2 RELATED CONCEPTS AND RESEARCH ..... .. 18

2.1 Legacy Systems ......... . . 19
2.2 Data, Information, Seniantics ........ .. .. 20
2.3 Semantic Extraction ......... . 20
2.4 Reverse Engineering ......... . 22
2.5 Program Understanding Techniques . ... .. 24
2.5.1 Textual Analysis ........ .. 25
2.5.2 Syntactic Analysis ........ .. 25
2.5.3 Program Slicing ........ .. .. 25
2.5.4 Program Representation Techniques ... .... .. 26
2.5.5 Call Graph Analysis ....... .. 26
2.5.6 Data Flow Analysis ....... .. 26
2.5.7 Variable Dependency Graph . .... .. 26
2.5.8 System Dependence Graph . ..... .. 27
2.5.9 Dynamic Analysis ......... .. 27
2.6 Visitor Design Patterns ......... .. 27
2.7 Ontology. ............ .... ........ 28
2.8 Web Ontology Language (OWL) . .... .. 28
2.9 WordNet ........ . .. :30
2.10 Similarity ........ . .. .. :31
2.11 Semantic Similarity Measures of Words .... .... :32
2.11.1 Resnik Similarity Measure . ..... .. .. :32
2.11.2 Jiang-Conrath Similarity Measure .. .. .. 34
2.11.3 Lin Similarity Measure ........ ... .. :34
2.11.4 Intrinsic IC Measure in WordNet ... ... .. 34
2.11.5 Leacock-Chodorow Similarity Measure .... .. .. :35











2.11.6 Hirst-St.Onge Similarity Measure ..... ... .. :36
2.11.7 Wu and Palmer Similarity Measure .. .. .. 36
2.11.8 Lesk Similarity Measure . .... .. :36
2.11.9 Extended Gloss Overlaps Similarity Measure ... .. .. :37
2.12 Evaluation of WordNet-Based Similarity Measures ... .. .. .. :37
2.1:3 Similarity Measures for Text Data . .... .. .. :37
2.14 Similarity Measures for Ontologies . .... .. :39
2.15 Evaluation Methods for Similarity Measures ... .. .. 41
2.16 Schema Matching .. ... ... .. 4:3
2.16.1 Schema Matching Surveys . ... .. 4:3
2.16.2 Evaluations of Schema Matching Approaches .. .. .. .. 45
2.16.3 Examples of Schema Matching Approaches ... .. .. 46
2.17 Ontology Mapping .. ... .. .. 48
2.18 Schema Matching vs. Ontology Mapping ... .. .. 48

:3 APPROACH ......... ... .. 49

:3.1 Semantic Analysis .. ... ... .. 50
:3.1.1 Illustrative Examples ...... ... .. 51
:3.1.2 Conceptual Architecture of Semantic Analyzer .. .. .. .. 5:3
:3.1.2.1 Abstract syntax tree generator (ASTG) .. .. .. .. 5:3
:3.1.2.2 Report template parser (RTP) .. .. .. 55
:3.1.2.3 Information extractor (IEx) ... ... .. 55
:3.1.2.4 Report ontology writer (ROW) .. .. .. 58
:3.1.3 Extensibility and Flexibility of Semantic Analyzer .. .. .. .. 58
:3.1.4 Application of Program Understanding Techniques in SA .. .. 60
:3.1.5 Heuristics Used for Information Extraction ... .. .. 62
:3.2 Schema Matching ......... .. .. 67
:3.2.1 Motivating Example ........ ... .. 68
:3.2.2 Schema Matching Approach . .... .. 7:3
:3.2.3 Creating an Instance of a Report Ontology .. .. .. 75
:3.2.4 Computing Similarity Scores . .... .. 76
:3.2.5 Forming a Similarity Matrix ...... .. .. 81
:3.2.6 From Matching Ontologies to Schemas .. .. .. 81
:3.2.7 Merging Results .. ... .. .. 82

4 PROTOTYPE IMPLEMENTATION . .._. .. 84

4.1 Semantic Analyzer (SA) Prototype ...... .. . 84
4. 1.1 Using JavaCC to generate parsers .... ... .. 84
4. 1.2 Execution steps of the information extractor .. .. .. 86
4.2 Schema Matching by Analyzing ReporTs (SMART) Prototype .. .. .. 88











5 TEST HARNESS FOR THE ASSESSMENT OF LEGACY INFORMATION
INTEGRATION APPROACHES (THALIA) .... ... .. 89

5.1 THALIA Website and Downloadable Test Package .. .. .. 89
5.2 DataExtractor (HTMLtoXML) Opensource Package .. .. .. .. 91
5.3 Classification of Heterogeneities . ..... .. 92
5.4 Web Interface to Upload and Compare Scores ... ... .. 94
5.5 Usage of THALIA ......... . .. 95

6 EVALUATION ......... ... .. 96

6.1 Test Data ..... ............ ........... 96
6.1.1 Test Data Set from THALIA tested .... .. .. 96
6.1.2 Test Data Set from University of Florida .. .. . 98
6.2 Determining Weights ......... .. .. 102
6.3 Experimental Evaluation . ... .... .. 105
6.3.1 Running Experiments with THALIA Data ... .. .. 107
6.3.2 Running Experiments with ITF Data .... .. .. 110

7 CONCLUSION ......... ... .. 117

7.1 Contributions ......... .. .. 119
7.2 Future Directions ......... .. .. 121

REFERENCES ............ ........... 12:3

BIOGRAPHICAL SK(ETCH ......... .. .. 1:31










LIST OF TABLES


Table page

2-1 List of relations used to connect senses in WordNet. ... .. .. :31

2-2 Absolute values of the coefficients of correlation between human ratings of similarity
and the five computational measures. . ...... .. :37

:3-1 Semantic Analyzer can transfer information front one method to another through
variables and can use this information to discover seniantics of a schema element. 62

:3-2 Output string gives clues about the seniantics of the variable following it. .. 6:3

:3-3 Output string and the variable may not he in the same statement. .. .. .. 64

:3-4 Output strings before the slicing variable should be concatenated. .. .. .. 64

:3-5 Tracing back the output text and associating it with the corresponding column
of table. ........... ........... 64

:3-6 Associating the output text with the corresponding colunin in the where-clause. 65

:3-7 Colunin header describes the data in that column. ... .. . .. 65

:3-8 Colunin on the left describes the data items listed to its ininediate right. .. 65

:3-9 Colunin on the left and the header ininediately above describe the same set of
data items. ........ . .. 66

:3-10 Set of data items can he described by two different headers. .. .. .. .. 66

:3-11 Header can he processed before being associated with the data on a column. 66

4-1 Subpackage in the sa package and their functionality. ... .. . .. 85

6-1 The 10 university catalogs selected for evaluation and size of their schemas. .. 98

6-2 Portion of a table description front the College of Engineering, the Bridges Project
and the Business School schemas. ........ ... .. 100

6-3 T Iam! of tables in the College of Engineering, the Bridges Office, and the Business
School schemas and number of schema elements that each table has. .. .. .. 101

6-4 Weights found by analytical method for different similarity functions with THALIA
test data. ......... ... . 104

6-5 Confusion matrix. ......... . .. .. 106










LIST OF FIGURES


Figure page

1-1 Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture. .. .. .. 14

:3-1 Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture. .. .. .. 50

:3-2 Schema used by an application. ......... ... 52

:3-:3 Schema used by a report. ......... . 5:3

:3-4 Conceptual view of the Data Reverse Engineering (DRE) module of the Scalable
Extraction of Enterprise K~nowledge (SEEK() prototype. .. .. .. 54

:3-5 Conceptual view of Semantic Analyzer (SA) component. ... .. .. 54

:3-6 Report design template example. ......... .. 55

:3-7 Report generated when the above template was run. .. .. .. 56

:3-8 Java Serylet generated HTML report showing course listings of CALTECH. .. 56

:3-9 Annotated HTML page generated by analyzing a Java Serylet. .. .. .. 57

:3-10 Inter-procedural call graph of a program source code. ... .. .. 61

:3-11 Schemas of two data sources that collaborates for a new online degree program. 69

:3-12 Reports from two sample universities listing courses. .. .. .. 70

:3-13 Reports from two sample universities listing instructor offices. .. .. .. 71

:3-14 Similarity scores of schema elements of two data sources. ... .. .. 7:3

:3-15 Five steps of Schema Matching by Analyzing ReporTs (SMART) algorithm. 74

:3-16 Unified Modeling Language (UML) diagram of the Schema Matching by Analyzing
ReporTs (SMART) report ontology. ........ .. .. 76

:3-17 Example for a similarity matrix. ........ ... .. 81

:3-18 Similarity scores after matching report pairs about course listings. .. .. .. 82

:3-19 Similarity scores after matching report pairs about instructor offices. .. .. 82

4-1 Java Code size distribution of (Semantic Analyzer) SA and (Schema Matching
by Analyzing ReporTs) SMART packages. ..... .. 84

4-2 Using JavaCC to generate parsers. . ...... .. 86

5-1 Snapshot of Test Harness for the Assessment of Legacy information Integration
Approaches (THALIA) website. ......... ... .. 90










5-2 Snapshot of the computer science course catalog of Boston University. .. .. 91

5-3 Extensible Markup Language (XML) representation of Boston IUniversitys course
catalog and corresponding schema file. ...... .. . 92

5-4 Scores uploaded to Test Harness for the Assessment of Legacy information Integration
Approaches (THALIA) benchmark for Integration Wizard (IWiz) Project at
the University of Florida. ......... .. .. .. 94

6-1 Report design practice where all the descriptive texts are headers of the data. .97

6-2 Report design practice where all the descriptive texts are on the left hand side
of the data. ......... ... .. 97

6-:3 Architecture of the databases in the College of Engfineeringf. .. .. .. .. 99

6-4 Results of the SMART with Jiang-Conrath (JCN), Lin and Levenstein metrics. 107

6-5 Results of COmbination of MAtching algorithms (COMA++) with All Context
and Filtered Context combined matchers and comparison of SMART and COMA++
results. ......... ..... . 108

6-6 Receiver Operating C'I I) Il-teristics (ROC) curves of SMART and COMA+
for THALIA test data. ......... . .. 110

6-7 Results of the SMART with different report pair similarity thresholds for ITF
test data. ......... ... . 112

6-8 F-Measure results of SMART and COMA++ for ITF test data when report pair
similarity is set to 0.7. ......... . .. 11:3

6-9 Receiver Operating C'I I) Il-teristics (ROC) curves of the SMART for ITF test
data. ............ ............... 114

6-10 Comparison of the ROC curves of the SMART and COMA++ for ITF test data. 115










Abstract of D~issertation P~resented to the Gradua~te Schlool
< the Urniversity of' 11 : Ida in? Partial F~ulfillmentl <- the
Req~uiremecnts fori the D~egfree of Doctor OfC F'i.i.. 0.o'

SEMANZ~TIC INTIORAT~ION THROUGH APPLCICATIONL ANAINSIS

By

O~guzhan Tohpsakail

I. -2007

C~i : Joachimn Hanunler
Ma.:C< :.. ulr E~ngineering

Organizatio n~s in~crelasingly neei t~o Iticipa~te in rapid- collabi-ora~tions with othrer

organizations to be suc
< 1. .. -ata in sucth c~ollab-oration-s. One of the porob-lemls that needs to be solved- when

integrating Iai : d: tata sources is findiing semaintic: I. :: : :.:es between elements

of schemnas of < i. :.e dlata sources (a.k.a.. sc~hemn a matchingr). Schlemas, even those fromn

thec same domain, showl mnany semnantic hleteroge neiies. Resolving these heteroge~neitiecs

is ..... done mlanually; wvchich is tedious, time consuming, and expoensive. Current

approaches to autt : :! :::o the process mainly urse the schemnas and the d-ata as::: :i to

semanltic heterogeneities. However, the sc~hemnas andi the datai are nlot surllic~ient

sources of` semant~ics. In cont~ra~st., we ana~l---- a valjuable so)ur.ce <-1 sm~a~ntic~s, nam~ely

ai .i. 1: :. source codie and report design temnplatels, to improve schemal matching fort

information int~egra~tion. S reallyly, wve a~nali--- CT ap '1 nsour.ce GcodeC that~ genra:I(te8j

reports to p~resent~ thle data of the: organization in a urser f : : i way. W;ie traxce the

desir~iptive information on a report backi to the i .1''. schemla elemnent(s) through
rve~-rse engineering of thle .1lilicatio source-- code'- or- .: desig-n tempIlates aindl store

the desir~iptive text, data, and the c~orrespi.. sc~heml a elemlents in a report (;(?

insta~c~e. WeVi utilize the :::i : : i : we have < : : i ftor sc~hema mnatchintg. Our

ex-perimnents using a fully functionali p-rotci-- systeml show that our approach produces

more aiccurate resullts than currr~ent tech-niq ules.










CHAPTER 1
INTRODUCTION

1.1 Problem Definition

The success of many organizations largely depends on their ability to participate in

rapid, flexible, limited-time collaborations. The need to collaborate is not just limited

to business but also applies to government and non-profit organizations such as military,

emergency management, health-care, rescue, etc. The success of a business organization

depends on its ability to rapidly customize its products, adapt to continuously changing

demands, and reduce costs as much as possible. Government organizations, such as the

Department of Homeland Security, need to collaborate and exchange intelligence to

maintain the security of its borders or to protect critical infrastructure, such as energy

supply and telecommunications. Non-profit organizations, such as the American Red

Cross, need to collaborate on matters related to public health in catastrophic events, such

as hurricanes. The collaboration of organizations produces a synergy to achieve a common

goal that would not he possible otherwise.

Organizations participating in a rapid, flexible collaboration environment need to

share and exchange data. In order to share and exchange data, organizations need to

integrate their information systems and resolve heterogeneities among their data sources.

The heterogeneities exist at different levels. There exist physical heterogeneities at the

system level because of differences between various internal data storage, retrieval, and

representation methods. For example, some organizations might use professional database

management systems while others might use simple flat files to store and represent their

data. In addition, there exist structural (syntax)-level heterogeneities because of the

differences at the schema level. Finally, there exist semantic level heterogeneities because

of the differences in the use of the data which correspond to the same real-world objects

[47]. We face a broad range of semantic heterogeneities in information systems because of










different viewpoints of designers of these information systems. Semantic heterogeneity is

simply a consequence of the independent creation of the information systems [44].

To resolve semantic heterogeneities, organizations must first identify the semantics of

their data elements in their data sources. Discovering the semantics of data automatically

has been an important area of research in the database community [22, :36]. However, the

process of resolving semantic heterogeneity of data sources is still mostly done manually.

Resolving heterogeneities manually is a tedious, error-prone, time-consuming, non-scalable

and expensive task. The time and investment needed to integrate data sources become a

significant barrier to information integration of collaborating organizations.

In this research, we are developing an integrated novel approach that automates the

process of semantic discovery in data sources to overcome this barrier and to help rapid,

flexible collaboration among organizations. As mentioned above, we are aware that there

exist physical heterogeneities among information sources but to keep the dissertation

focused, we assume data storage, retrieval and representation methods are the same

among the information systems to be integrated. According to our experiences gained

as a software developer for information technologies department of several banks and

software companies, application source code generating reports encapsulate valuable

information about the semantics of the data to be integrated. Reports present data

from the data source in a way that is easily comprehensible by the user and can he rich

source of semantics. We analyze application source code to discover semantics to facilitate

integration of information systems. We outline the approach in Section 1.2 helow and

provide more detailed explanation in Sections :3.1 and :3.2. The research described in

this dissertation is a part of the NSF-fundedl SEEK( (Scalable Extraction of Enterprise

Knowledge) project which also serves as a tested.




1 The SEEK( project is supported by the National Science Foundation under grant
numbers C':\!$-0075407 and C':\$-012219:3.











1.2 Overview of the Approach

The results described in this dissertation are based on the work we have done on the

SEEK( project. The SEEK( project is directed at overcoming the problems of integrating

legacy data and knowledge across the participants of a collaboration network [45]. The

goal of the SEEK( project is to develop methods and theory to enable rapid integration

of legacy sources for the purpose of data sharing. We apply these methodologies in the

SEEK( toolkit which allows users to develop SEEK( wrappers. A wrapper translates queries

from an application to the data source schema at run-time. SEEK( wrappers act as an

intermediary between the legacy source and decision support tools which require access to

the organization's knowledge.


Data Source of A Source Code of A






ISchema Semantic
Extractor AnalyzerKnweebso
I(SE) Schemas (SA) Organization A

Data Reverse Engineering (DRE)


Scheme Wrapper
Matcher Generator
(SM) ~I(WG)
Data Source of A Source
Mappings Wrappers





Scheme Semanticl Knowledgebaseof
IExtractor Analyzer Organization B
I(SE) Schemas (SA)
I--------
Data Reverse Engineering (DRE)



Figure 1-1: Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture.










In general, SEEK( [45, 46] works in three steps: Data Reverse Engineering (DRE),

Schema Matching (831), and Wrapper Generation (WG). In the first step, Data Reverse

Engineering (DR E) component of SEEK( generates a detailed description of the legacy

source. DRE has two sub-components, Schema Extractor (SE) and Semantic Analyzer

(SA). SE extracts the conceptual schema of the data source. SA analyzes electronically

available information sources such as application code and discovers the semantics of

schema elements of the data source. In other words, SA discovers mappings between data

items stored in an information system and the real-world objects they represent by using

the pieces of evidence that it extracts from the application code. SA enhances the schema

of the data source by the discovered semantics and we refer to the semantically enhanced

schema knowledgehase of the organization. In the second step, the Schema Matching (SM)

component maps the knowledgehase of an organization with the knowledgehase of another

organization. In the third step, the extracted legacy schema and the mapping rules

provide the input to the Wrapper Generator (WG), which produces the source wrapper.

These three steps of SEEK( are build-time processes. At run-time, the source wrapper

translates queries from the application domain model to the legacy source schema. A

high-level schematic view outlining the SEEK( components and their interactions is shown

in Figure 1-1.

In this research, our focus is on the Semantic Analysis (SA) and Schema Matching

(SM) methodology. We first describe how SA extracts semantically rich outputs from the

application source code and then relates them with the schema knowledge extracted by

the Schema Extractor (SE). We show that we can gather significant semantic information

from the application source code by the methodology we have developed. We then focus

on our Schema Matching (SM) methodology. We describe how we utilize the semantic

information that we have discovered by SA to find mappings between two data sources.

The extracted semantic information and the mappings can then he used by the subsequent










wrapper generation step to facilitate the development of legacy source translators and

other tools during information integration which is not the focus of this dissertation.

1.3 Contributions

In this research, we introduce novel approaches for semantic analysis of application

source code and for matching of related but disparate schemas. In this section, we list the

contributions of this work. We describe these contributions in details in ('! .pter 7 while

concluding the dissertation.

External information sources such as corpora of schemas and past matches have been

used for schema matching but application source code have not been used as an external

information source yet [25, 28, 78]. In this research, we focus on this well-known but not

yet addressed challenge of analyzing application source code for the purpose of semantic

extraction for schema matching. The accuracy of the current schema matching approaches

is not sufficient for fully automating the process of schema matching [26]. The approach

we present in this dissertation provides better accuracy for the purpose of automatic

schema matching.

The schema matching approaches so far have been mostly using lexical similarity

functions or look-up tables to determine the similarities of two schema element properties

(for example, the names and types of schema elements). There have been -II_a----- -0.>

to utilize semantic similarity measures between words [7] but have not been realized.

We utilize the state of the art semantic similarity measures between words to determine

similarities and show its effect on the results.

Another important contribution is the introduction of a generic similarity function for

matching classes of ontologies. We have also described how we determine the weights of

our similarity function. Our similarity function along with the methodology to determine

the weights of the function can be applied on many domains to determine similarities

between different entities.










Integration based on user reports ease the communication between business and

information technology (IT) specialists. Business and IT specialists often have difficulty

on understanding each other. Business and IT specialists can discuss on data presented on

reports rather than discussing on incomprehensible database schemas. Analyzing reports

for data integration and sharing helps business and IT specialists communicate better.

One other contributions is the functional extensibility of our semantic analysis

methodology. Our information extraction framework lets researchers add new functionality

as they develop new heuristics and algorithms on the source code being analyzed. Our

current information techniques provide improved performance because it requires less

passes over the source code and provide improved accuracy as it eliminates unused code

fragments (i.e., methods, procedures).

While conducting the research, we saw that there is a need of available test data of

sufficient richness and volume to allow meaningful and fair evaluations between different

information integration approaches. To address this need, we developed THALIA2 (TeSt

Harness for the Assessment of Legacy information Integration Approaches) benchmark

which provides researchers with a collection of over 40 downloadable data sources

representing University course catalogs, a set of twelve benchmark queries, as well as

a scoring function for ranking the performance of an integration system [47, 48].

1.4 Organization of the Dissertation

The rest of the dissertation is organized as follows. We introduce important

concepts of the work and summarize research in ('! .pter 2. ('!, Ilter 3 describes our

semantic analysis approach and schema matching approach. ('! .pter 4 describes the

implementation details of our prototype. Before we describe the experimental evaluation

of our approach in ('! .pter 6, we describe the THALIA test bed in ('! .pter 5. ('! .pter 7

concludes the dissertation and summarizes the contributions of this work.




2 THALIA website: http://www.cise.uf 1 .edu/proj ect/thalia. html










CHAPTER 2
RELATED CONCEPTS AND RESEARCH

In the context of this work, we have explored a broad range of research areas.

These research areas include but are not limited to data semantics, semantic discovery,

semantic extraction, legacy system ulrll 1;1 llliosy reverse engineering of application

code, information extraction from application code, semantic similarity measures, schema

nr I.1l1f11r ontology extraction and ontology mapping, etc. While developing our approach,

we leverage these research areas.

In this chapter, we introduce important concepts and related research that are

essential for understanding the contributions of this work. Whenever necessary, we provide

our interpretations of definitions and commonly accepted standards and conventions in

this field of study. We also present the state-of-the-art in the related research areas.

We first introduce what a legacy system is. Then we state the difference between

frequently used terms data, information and semantics in Section 2.2. We point out some

of the research in semantic extraction in Section 2.3. Since we extract semantics through

reverse engineering of application source code. We provide the definitions of reverse

engineering of source code, database reverse engineering in Section 2.4 and also provide

the techniques for program understanding in Section 2.5. We represent the extracted

information from application source code of different legacy systems in ontologies and

utilize these ontologies to find out semantic similarities between them. For this reason,

semantic similarity measures are also important for us. We have explored the research

on semantic similarity measures and presented these works in Section 2.11 after giving

the definition of similarity in Section 2.10. We aim to leverage the research on assessing

similarity scores between texts and ontologies. We present these techniques in Section

2.13 and 2.14. We then present the ontology concept, and the ontology language Web

Ontology Language (OWL). Finally, we present ontology mapping and schema mapping










and conclude the chapter by presenting some outstanding techniques of schema matching

in the literature.

2.1 Legacy Systems

Our approaches for semantic analysis of application source code and schema

matching has been developed as a part of the SEEK( project. SEEK( project aims to

help understanding of legacy systems. We analyze application source code of a legacy

system to understand the semantics of it and apply gained knowledge to solve schema

matching problem of data integration. In this section, we first give a broad definition of a

legacy system and highlight its importance and then provide its definition in the context

of this work.

Legacy systems are generally known as inflexible, nonextensible, undocumented, old

and large software systems which are essential for the organization's business [12, 14, 75].

They significantly resist modifications and changes. Legacy system are very valuable

because they are the repository of corporate knowledge collected over a long time and they

also encapsulate the logic of the organization's business processes [49].

A legacy system is generally developed and maintained by many different people with

many different programming hi- Mostly, the original programmers have left, and the

existing team is not an expert of all the aspects of the system [49]. Even though once

there was a documentation about the design and specification of the legacy system, the

original software specification and design have been changed but the documentation was

not updated through out the years of development and maintenance. Thus, understanding

is lost, and the only reliable documentation of the system is the application source code

running on the legacy system [75].

In the context of this work, we define legacy systems as any information system with

poor or nonexistent documentation about the underlying data or the application code

that is using the data. Despite the fact that legacy systems are often interpreted as old










systems, for us, an information system is not required to be old in order to be considered

as legacy.

2.2 Data, Information, Semantics

In this section, we give definitions of data, information and seniantics before we

explore some research on semantic extraction in the following section.

According to a simplistic definition data is the raw, unprocessed input to an

information system that produces the information as an output. A coninonly accepted

definition states that data is a representation of facts, concepts or instructions in a

formalized manner suitable for coninunication, interpretation, or processing by humans

or hv automatic means [2, 18]. Data mostly consists of disconnected numbers, words,

symbols, etc. and results front measurable events, or objects.

Data has a value when it is processed, changed into a usable form and placed in a

context [2]. When data has a context and has been interpreted, it becomes information.

Then it can he used purposefully as information [1].

Seniantics is the meaning and the use of data. Seniantics can he viewed as a mapping

between an object stored in an information system and the real-world object it represents

[87].

2.3 Semantic Extraction

In this section, we first state the importance of semantic extraction and application

source code as a rich source for semantic extraction and then point out several related

research efforts in this research area.

Sheth et al. [87] stated that data seniantics does not seem to have a purely

niathentatical or formal model and cannot he discovered completely, and fully automatically.

Therefore, the process of semantic discovery requires human involvement. Besides being

huntan-dependent, semantic extraction is a tinte-consunting and hence expensive task

[36]. Although it can not he fully autontatized, the gain of discovering even the limited

amount of useful seniantics can tremendously reduce the cost for understanding a system.










Semantics can be found from knowledge representation schemas, communication protocols,

and applications that use the data [87].

Through out the discussions and research on semantic extraction, application source

code has been proposed as a rich source of information [30, 36, 87]. Besides, researchers

have agreed that the extraction of semantics from application source code is essential for

identification and resolution of semantic heterogeneity.

We use the discovered semantics from application source code to find correspondence

between schemas of disparate data sources automatically. In this context, discovering

semantics means gathering information about the data, so that a computer can identify

mappings (paths) between corresponding schema elements in different data sources.

Jim Ningf et al. worked on extracting semantics from application source code but with

a slightly different aim. They developed an approach to identify and recover reusable code

components [67]. They investigated conditional statements as we do to find out business

rules. They stated that conditional statements are potential business rules. They also gave

importance to input and output statements for highlighting semantics inside the code,

and stated that meaningful business functions normally process input values and produce

results. Jim Ning et al. called investigating input variables as forward slicing and called

investigating output statements as backward slicing. The drawback of their approach was

being very language-specific (Cobol) [67].

N Ashish et al. worked on extracting semantics from internet information sources to

enable semi-automatic wrapper generation [5]. They used several heuristics to identify

important tokens and structures of HTML pages in order to create the specification for

a parser. Similar to our approach, they benefited from parser generation tools, namely

YACC [53] and LEX [59], for semantic extraction.

There are several related work in information extraction from text that deal with

tables and ontology extraction from tables. The most relevant work about information

extraction from HTML pages by the help of heuristics was done by Wang and Lochovsky










[94]. They aimed to form the schema of the data extracted from an HTML page by using

labels of a table on an HTML page. The heuristic that they use to relate labels to the

data and to separate data found in a table cell into several attributes is very similar to

our heuristics. For example, they assume that if several attributes are encoded into one

text string, then there should be some special symbols) in the string as the separator

to visually support users to distinguish the attributes. They also use heuristics to relate

labels to the data from an HTML page that are similar to our heuristics. Buttler et al.

[17] and Embley et al. [:32] also developed heuristic hased approaches for information

extraction from HTML pages. However, their aim was to identify boundaries of data on

an HTML page. Embley et al. [:33] also worked on table recognition from documents and

-II_0---- I a table ontology which is very similar to our report ontology. In a related work,

Tijerino et al. [90] introduced an information extracting system called TANGO which

recognizes tables based on a set of heuristics, forms mini-ontologies and then merges these

ontologies to form a larger application ontology.

2.4 Reverse Engineering

Without the understanding of the system, in other words without the accurate

documentation of the system, it is not possible to maintain, extend, and integrate the

system with other systems [76, 89, 95]. The methodology to reconstruct this missing

documentation is reverse engineering. In this section, we first give the definition of reverse

engineering in general and then give definitions of program reverse engineering and

database reverse engfineeringf. We also state the importance of these tasks.

Reverse engineering is the process of analyzing a technology to learn how it was

designed or how it works. Chikofsky and Cross [19] defined reverse engineering as the

process of analyzing a subject system to identify the systems components and their

interrelationships and as the process of creating representations of the system in another

form or at a higher level of abstraction. Reverse engineering is an action to understand

the subject system and does not include the modification of it. The reverse of the reverse










engineering is forward engfineeringf. Forward engineering is the traditional process of

moving from high-level abstractions and logical, implementation-independent designs to

the physical implementation of a system [19]. While reverse engineering starts from the

subject system and aims to identify the high-level abstraction of the system, forward

engineering starts from the specification and aims to implement the subject system.

Program (software) reverse engineering is recovering the specifications of the software

from source code [49]. The recovered specifications can be represented in forms such as

data flow diagrams, flow charts, specifications, hierarchy charts, call graphs, etc. [75]. The

purpose of program reverse engineering is to enhance our understanding of the software of

the system to reengineer, restructure, maintain, extend or integrate the system [49, 75].

Database Reverse Engineering (DBRE) is defined as identifying the possible

specification of a database implementation [22]. It mainly deals with schema extraction,

analysis and transformation [49]. Chikofsky and Cross [19] defined DBRE as a process

that aims to determine the structure, function and meaning of the data of an organization.

Hainaut [41] defined DBRE as the process of recovering the schema(s) of the database

of an application from data dictionary and program source code that uses the data. The

objective of DBRE is to recover the technical and conceptual descriptions of the database.

It is a prerequisite for several activities such as maintenance, reengineering, extension,

migration, integration. DBRE can produce an almost complete abstract specification

of an operational database while program reverse engineering can only produce partial

abstractions that can help better understand a program [22, 42].

II I.!y data structures and constraints are embedded inside the source code of

data-oriented applications. If a construct or a constraint has not been declared explicitly

in the database schema, it is implemented in the source code of the application that

updates or queries the database. The data in the database is a result of the execution of

the applications of the organization [49]. Even though the data satisfies the constraints of

the database, it is verified with the validation mechanisms inside the source code before it










is being updated into the database to ensure that it does not violate the constrains. We

can discover some constraints, such as referential constraints, by analyzing the application

source code, even if the application program only queries the data but does not modify

it. For instance, if there exists a referential constraint (foreign key relation) between the

entity named El and entity named E2, this constraint is used to join the data of these two

entities with a query. We can discover this referential constraint by analyzing the query

[50]. Since program source code is a very useful source of information in which we can

discover a lot of implicit constructs and constraints, we use it as an information source for

DBRE.

It is well known that the analysis of program source code is a complex and tedious

task. However, we do not need to recover the complete specification of the program

for DBRE. We are looking for information to enhance the schema and to find the

undeclared constraints of the database. In this process, we benefit from several program

understanding techniques to extract information effectively. We provide the definitions of

the program understanding and its techniques in the following section.

2.5 Program Understanding Techniques

In this section, we introduce the concept of program understanding and its techniques.

We have implemented these techniques to analyze application source code to extract

semantic information effectively.

Program understanding (a.k.a program comprehension) is the process of acquiring

knowledge about an existing, generally undocumented, computer program. The knowledge

acquired about the business processes through the analysis of the source code is accurate

and up-to-date because the source code is used to generate the application that the

organization uses.

Basic actions that can be taken to understand a program is to read the documentation

about it, to ask for assistance from the user of it, to read the source code of it or to run

the program to see what it outputs to specific inputs [50]. Besides these actions, there










are several techniques that we can apply to understand a program. These techniques

help the analyst to extract high-level information from low-level code to come to a better

understanding of the program. These techniques are mostly performed manually. However,

we apply these techniques in our semantic analyzer module to automatically extract

information from data-oriented applications. We show how we apply these techniques

in our semantic analyzer in Section 3.1.5. We describe the main program understanding

techniques in the following subsections.

2.5.1 Textual Analysis

One simple way to analyze a program is to search for a specific string in the program

source code. This searched string can he a pattern or a clichih. The program understanding

technique that searches for a pattern or a clichi: is named as pattern matching or clichi:

recognition. A pattern can include wildcards, character ranges and can he based on other

defined patterns. A clichi: is a commonly used programming pattern. Examples of clichi~s

are algorithmic computations, such as list enumeration and binary search, and common

data structures, such as priority queue and hash table [49, 97].

2.5.2 Syntactic Analysis

Syntactic analysis is performed by a parser that decomposes a program into

expressions and statements. The result of the parser is stored in a structure called abstract

syntax tree (AST). An AST is a type of representation of source code that facilitates

the usage of tree traversal algorithms and it is the basic of most sophisticated program

analysis tools [49].

2.5.3 Program Slicing

Program slicing is a technique to extract the statements from a program relevant to a

particular computation, specific behavior or interest such as a business rule [75]. The slice

of a program with respect to program point p and variable V consists of all statements

and predicates of the program that might affect the value of V at point p [96]. Program

slicing is used to reduce the scope of program analysis [49, 83]. The slice that affect the










value of V at point p is computed by gathering statements and control predicates by

way of a backward traversal of the program, starting at the point p. This kind of slice

is also known as backward slicing. When we retrieve statements that can potentially be

affected by the variable V starting front a point p, we call it forward slicing. Forward and

backward slicing are both a type of static slicing because they use only statically available

information (source code) for computing.

2.5.4 Program Representation Techniques

Program source code, even reduced through program slicing, often is too difficult

to understand because the program can he huge, poorly structured, and based on poor

naming conventions. It is useful to represent the program in different abstract views such

as the call graph, data flow graph, etc [49]. Most of the program reverse engineering tools

provide these kind of visualization facilities. In the following sections, we present several

program representation techniques.

2.5.5 Call Graph Analysis

Call graph analysis is the analysis of the execution order of the program units or

statements. If it determines the order of the statements within a program then it is called

intra-procedural analysis. If it determines the calling relationship among the program

units, it is called inter-procedural analysis [49, 83].

2.5.6 Data Flow Analysis

Data flow analysis is the analysis of the flow of the values from variables to variables

between the instructions of a program. The variables defined and the variables referenced

by each instruction, such as declaration, assignment and conditional, are analyzed to

compute the data flow [49, 83].

2.5.7 Variable Dependency Graph

Variable dependency graph is a type of data flow graph where a node represents a

variable and an are represents a relation (assignment, comparison, etc.) between two

variables. If there is a path from variable v1 to variable v2 in the graph, then there is a










sequence of statements such that the value of v1 is in relation with the value of v2. If the

relation is an assignment statement then the are in the diagram is directed. If the relation

is a comparison statement then the are is not directed [49, 8:3].

2.5.8 System Dependence Graph

System dependence graph is a type of data flow graph that also handles procedures

and procedure calls. A system dependence graph represents the passing of values between

procedures. When procedure P calls procedure Q. values of parameters are transferred

from P to Q and when Q returns, the return value is transferred back to P [49].

2.5.9 Dynamic Analysis

The program understanding techniques described so far are performed on the source

code of the program and are static analysis. Dynamic analysis is the process of gaining

increased understanding of a program by systematically executing it [8:3].

2.6 Visitor Design Patterns

We applied the above program understanding techniques in our semantic analyzer

program. We implemented our semantic analyzer by using visitor patterns. In this section,

we explain what a visitor pattern is and the rationale for using it.

A Visitor Design Pattern is a behavioral design pattern [:38], which is used to

encapsulate the functionality that we desire to perform on the elements of a data

structure. It gives the flexibility to change the operation being performed on a structure

without the need to change the classes of the elements on which the operation is

performed. Our goal is to build semantic information extraction techniques that can

he applied to any source code and can he extended with new algorithms. The visitor

design pattern technique is the key object oriented technique to reach this goal. New

operations over the object structure can he defined simply by adding a new visitor. Visitor

classes localize related behavior in the same visitor and unrelated sets of behavior are

partitioned in their own visitor subclasses. If the classes defining the object structure, in

our case the grammar production rules of the programming language, rarely change, but









new operations over the structure are often defined, a visitor design pattern is the perfect

choice [13, 71].

2.7 Ontology

An ontology represents a coninon vocabulary describing the concepts and relationships

for researchers who need to share information in a domain [40, 69]. It includes machine

interpretable definitions of hasic concepts in the domain and relations among them.

Ontologies enable the definition and sharing of dontain-specific vocabularies. They

are developed to share coninon understanding of the structure of information among

people or software agents, to enable reuse of domain knowledge, and to analyze domain

knowledge [69].

According to a coninonly quoted definition, an ontology is a formal, explicit

specification of a shared conceptualization [40]. For a better understanding, Michael

U~schold et al. define the terms in this definition as follows [92]: A conceptualization is an

abstract model of how people think about things in the world. An explicit specification

means the concepts and relations in the abstract model are given explicit names and

definitions. Formal means that the meaning specification is encoded in a language whose

formal properties are well understood. Shared means that the main purpose of an ontology

is generally to be used and reused across different applications.

2.8 Web Ontology Language (OWL)

The Web Ontology Language (OWL) is a semantic markup language for publishing

and sharing ontologies on the World Wide Web [64]. OWL is derived front the DAML+OIL

Web Ontology Language. DAML+OIL was developed as a joint effort of researchers who

initially developed DAML (DARPA Agent Markup Language) and OIL (Ontology

Inference L w< vi or Ontology Interchange Language) separately.

OWL is designed for processing and reasoning about information by computers

instead of just presenting it on the Web. OWL supports more machine interpretability

than XML (Extensible Markup Language), RDF (the Resource Description Framework),










and RDF-S (RDF Schema) by providing additional vocabulary along with a formal

seniantics.

Formal seniantics allows us to reason about the knowledge. We may reason about

class membership, equivalence of classes, and consistency of the ontology for unintended

relationships between classes and classify the instances in classes. RDF and RDF-S

can he used to represent ontological knowledge. However, it is not possible to use all

reasoning mechanisms by using RDF and RDF-S because of some missing features such

as disjointness of classes, boolean combinations of classes, cardinality restrictions, etc.

[4]. When all these features are added to RDF and RDF-S to form an ontology language,

the language becomes very expressive. However it becomes inefficient to reason. For this

reason, OWL contes in three different flavors: OWL-Lite, OWL-DL, and OWL Full.

The entire language is called OWL Full, and uses all the OWL languages primitives.

It also allows to combine these primitives in arbitrary v- 0-<~ with RDF and RDF-S. Besides

its expressiveness, OWL Full's computations can he undecidable. OWL DL (OWL -

Description Logic) is a sublanguage of OWL Full. It includes all OWL language constructs

but restricts in which these constructors front OWL and RDF can he used. This makes

the computations in OWL-DL complete (all conclusions are guaranteed to be computable)

and decidable (all computations will finish in finite time). Therefore, OWL-DL supports

efficient reasoning. OWL Lite limits OWL-DL to a subset of constructors (for example

OWL Lite excludes enumerated classes, cl;-bia.~r~~--4 statements and arbitrary cardinality)

making it less expressive. However, it may be a good choice for hierarchies needing simple

constraints [4, 64].

OWL provides an infrastructure that allows a machine to make the same sorts of

simple inferences that human beings do. A set of OWL statements by itself (and the

OWL spec) can allow you to conclude another OWL statement whereas a set of XML

statements, by itself (and the XML spec) does not allow you to conclude any other

XML statements. Given the statements (nlotherOf suhProperty parentOf) and (N. HII. t









motherOf Oguzhan) when stated in OWL, allows you to conclude (Nedret parentOf

Oguzhan) based on the logical definition of subProperty as given in the OWL spec.

Another advantage of using OWL ontologies is the availability of tools such as Racer, Fact

and Pellet that can reason about them. A reasoner can also help us to understand if we

could accurately extract data and description elements from the report. For instance, we

can define a rule such as 'No data or description elements can overlap-' and check the OWL

ontology by a reasoner to make sure if this rule is satisfied or not.

2.9 WordNet

WordNet is an online database which aims to model the lexical knowledge of a

native speaker of English.l It is designed to be used by computer programs. WordNet

links nouns, verbs, adjectives, and adverbs to sets of synonyms [66]. A set of synonyms

represent the same concept and is known as a synset in WordNet terminology. For

example, the concept of a 'child' may be represented by the set of words: 'kid', 'youngster',

tiddlerr', 'tike'. A synset also has a short definition or description of the real world concept

known as a 'gloss' and has semantic pointers that describe relationships between the

current synset and other synsets. The semantic pointers can be a number of different

types including hyponym / hypernym (is-a / has a) meronym / holonym (part-of /

has-part), etc. A list of semantic pointers is given in Table 2-1.2 WordNet can also be

seen as a large graph or semantic network. Each node of the graph represents a synset and

each edge of the graph represents a relation between synsets. ?1 Ia: of the approaches for

measuring similarity of words uses the graphical structure of WordNet [15, 72, 79, 80].

Since the development of WordNet for English by the researchers of Princeton

University, many WordNets for other languages have been developed such as Dannish

(Dannet), Persian (PersiaNet), Italian (ItalWordnet), etc. There has been also research to



1 WordNet 2.1 defines 155,327 words of English

2 Table is adapted from [72]









Table 2-1. List of relations used to connect senses in WordNet.
Hypernyni is a generalization of furniture is a hypernyni of chair
Hyponyni is a kind of chair is a hyponyni of furniture
Troponyni is a way to anthle is a troponyni of walk
hieronyni is part / substance / nienter of wheel is a (part) nieronyni of a l'*i 1-, 1- -
Holonyni contains part l..-l1.-illl is a holonynt of a wheel
Antonyni opposite of ascend is an antonyni of descend
Attribute attribute of heavy is an attribute of weight
Entailment entails ploughing entails digging
Cause cause to to offend causes to resent
Also see related verb to lodge is related to reside
Similar to similar to dead is similar to assassinated
Participle of is participle of stored (adj) is the participle of to store
Pertainyni of pertains to radial pertains to radius


align WordNets of different languages. For instance, EuroWordNet [93] is a multilingual

lexical knowledgehase that links WordNets of different languages (e.g., Dutch, Italian,

Spanish, German, French, Czech and Estonian). In EuroWordNet, the WordNets are

linked to an Inter-Lingfual-Index which interconnects the languages so that we can go front

the synsets in one language to corresponding synsets in other languages.

While WordNet is a database which aints to model a person's knowledge about a

language, another research effort Cyc [57] (derived front En-cyc-lopedia) aints to model

a person's every .1- coninon sense. Cyc fornializes coninon sense knowledge (e.g., 'You

cannot reniember events that have not happened yet', 'You have to be awake to eat', etc.)

in the form of a massive database of axioms.

2.10 Similarity

Similarity is an important subject in many fields such as philru-uphlli-, psychology, and

artificial intelligence. Measures of similarity or relatedness are used in various applications

such as word sense disambigfuation, text suninarization and annotation, information

extraction and retrieval, automatic correction of word errors in text, and text classification

[15, 21]. Understanding how humans assess similarity is important to solve many of the

problems of cognitive science such as problem solving, categorization, nienory retrieval,

inductive I -..1.11.- etc. [39].









Similarity of two concepts refers to how much features they have in common and

how much they have in difference. Lin [60] provides an information theoretic definition

of similarity by clarifying the intuitions and assumptions about it. According to Lin,

the similarity between A and B is related to their commonality and their difference.

Lin assumes that the commonality between A and B can be measured according to

the information they contain in common (I(common(A, B))). In information theory,

the information contained in a statement is measured by the negative logarithm of the

probability of the statement (I(common(A, B)) = -logP(A n B)). Lin also assumes

that if we know the description of A and B, we can measure the difference by subtracting

the commonality of A and B from the description of A and B. Hence, Lin states that

the similarity between A and B, sim(A, B) is a function of their commonalities and

descriptions. That is, sim(A, B) = f (I(common(Al, B)), I(descrip~2~tion(A, B))).

We also come across with 'semantic relatedness' term while dealing with similarity.

Semantic relatedness is a more general concept than similarity and refers to the degree

to which two concepts are related [72]. Similarity is one aspect of semantic relatedness.

Two concepts are similar if they are related in terms of their likeliness (e.g child kit).

However, two concepts can be related in terms of functionality or frequent association even

though they are not similar (e.g., instructor student, christmas gift).

2.11 Semantic Similarity Measures of Words

In this section, we provide a review of semantic similarity measures of words in the

literature. This review is not meant to be a complete list of the similarity measures but

provides most of the outstanding ones in the literature. Most of the measures below use

the hierarchical structure of WordNet.

2.11.1 Resnik Similarity Measure

Resnik [79] provided a similarity measure based on the is-a hierarchy of the WordNet

and the statistical information gathered from a large corpora of text. Resnik used the

statistical information from the large corpora of text to measure the information content.










According to thle information theory, the information content of a. c~one cpt c c~an be

:ntified- as -- log P-(c)l wh-ere P-(c) is the probability of encournt~ering cc : c. TIhis

formurla tells urs that as 1 : .1.11' by increases, inforrInativLeness dlecr~eases; so the mnore

abstract. a concecpt, t~he lowevr its information conten~t. In order t~o calculate thre .. i. ?ility

of a, < : i ii, Resnik first compu ted thle .3 :: :.:y of` oc~curr~ence: of concept in a Ilarge

corpus of t~extl. Every occurrece~~c of a ( t in thre corpus adds to thre : -:y) <.1 t~he

it andi to the frequency of every c:oncep~t surbsurning the: concept encountered. Basedl

on this ---=- ?iut action, the: formulla for the information ciont~ent is:



P(c) = flreql(c)/ ~req~r)

ic(c) :::::- log P(c)

ic(c)- = -log(f ..(c)/f .r))

where r is the root n~ode < the '-:-- 7-- and c is the con~cecpt.

A~ccor~dingr o Resnik, the more information twvo c~one cpt s have: in common, thle more:

:: il: are. i i: in-fo~rmI-a tion- shared two concepts is indicated by the:i :: :

content of the concepts that subl-surne them in the i -. Ti. formula of the R~esniki

-ty m~ea~sur e is:



simn.RE~S(c1, c2) = mnax[~- log P(c)l

where: c is a ce.... t ti ha~t subsumes both c~l and c2.

One of thec drawb;-ac~k s <- the Resnik mecasure is that it c~ompleiytel
the information content of the concept thait suibsurmes the two concepts whose similarity

we mleasurwe. It does not take~ the tw~o conlcepts into ac~oulnt. i ar this reason similarity

mneisurr es of different pairs of con< i.1 that have the samec surbsumer have the samec

ty values.









2.11.2 Jiang-Conrath Similarity Measure

Jiang and Conrath [52] address the limitations of the Resnik measure. It both uses

the information content of the two concepts, along with the information content of their

lowest common subsumer to compute the similarity of two concepts. The measure is a

distance measure that specifies the extent of unrelatedness of two concepts. The formula

of the Jiangf and Conrath measure is:



distanceJCN~(c1, c2) = ic(cl) + ic(c2) (2 + ic(LCS(c1, c2)))

where ic determines the information content of a concept, and LCS determines the lowest

common subsuming concept of two given concepts. However, this measure works only with

WordNet nouns.

2.11.3 Lin Similarity Measure

Lin [60] introduced a similarity measure between concepts based on his theory of

similarity between arbitrary objects. To measure the similarity, Lin uses the information

content of the two concepts that is being measured and the information concept of the

lowest common subsumer of them. The formula of the Lin measure is:

2 + log P(cO)
simLINV(c1, c2)=
log P(cl) + log P(c2)

where cO is the lowest common concept that subsumes both cl and c2.

2.11.4 Intrinsic IC Measure in WordNet

Seco et al. [85] advocates that WordNet can also be used as a statistical resource with

no need for external corpora to compute the information content of a concept.










They assume that the taxonomic structure of WordNet is organized in a meaningful

and principled way, where concepts with many hyponymS3 COnVey leSS information than

concepts that are leaves. They provide the formula for information content as follows:



icWNV(c) = log ~ = 1-loh ()+1
log log(maxwn)

In this formula, the function hypo returns the number of hyponyms of a given concept

and maxwn is the maximum number of concepts that exist in the ::.ir......:.

2.11.5 Leacock-Chodorow Similarity Measure

Rada et al. [77] was the first to measure the semantic relatedness based on the length

of the path of two concepts in a' I::... 0:~r. Rada et al. measured semantic relatedness of

medical terms, using a medical ::c.1 -r i called MeSH. According to this measurement,

given a tree-like structure of a' I::...... -,i the number of links between two concepts are

counted and they are considered more related if the length of the path between them is

shorter.

Leacock-Chodorow [56] applied this approach to measure semantic relatedness of two

concepts using WordNet. The measure counts the shortest path between two concepts in

the ::c.11.l~is, and scales it by the depth of the ::.ir......,-


log(shortestpath(c1 c2))
relatedLCH(cl, c2) =
2+ D

In the formula, c1 and c2 represent the two concepts, D is the maximum dept of the



One weakness of the measure is, it assumes the size or weight of every link as equal.

However, lower down in the hierarchy a single link away concept pairs are more related




3 hyponym: a word that is more specific than a given word.

4 For WordNet 1.7.1, the value of D is 19.









than such pairs higher up in the hierarchy. Another limitation of the measure is that they

limit their attention to is-a links and only noun hierarchies are considered.

2.11.6 Hirst-St.Onge Similarity Measure

Hirst and St.Onge's [51] measure of semantic relatedness is based on the idea that two

concepts are semantically close if their WordNet synsets are connected by a path that is

not too long and that does not change direction too often [15, 72].

The Hirst-St.Onge measure considers all the relations defined in WordNet. All links in

WordNet are classified as Upward (e.g., part-of), Downward (e.g., subclass) or Horizontal

(e.g., opposite-meaning). They also describe three types of relations between words

<::1~I --r11..19 strong and medium-strong.

The strength of the relationship is given by:





where d is the number of changes of direction in the path, and C and k are constants;

if no such path exists, the strength of the relationship is zero and the concepts are

considered unrelated.

2.11.7 Wu and Palmer Similarity Measure

The Wu and Palmer [98] measures the similarity in terms of the depth of the two

concepts in the WordNet ::........ai-, and the depth of the lowest common subsumer (LCS):


2 + depth(LCS)
simWUP(c1, c2)=
depth(cl) + depth(c2)

2.11.8 Lesk Similarity Measure

Lesk [58] defines relatedness as a function of dictionary definition overlaps of concepts.

He describes an algorithm that disambigfuates words based on the extent of overlaps of

their dictionary definitions with those of words in the context. The sense of the target

word with the maximum overlaps is selected as the assigned sense of the word.










Table 2-2. Absolute values of the coefficients of correlation between human ratings of
similarity and the five computational measures.
Measure Miller & Charles Rubenstein & Goodenough
Hirst and St-Onge .744 .786
Jiangf and Conrath .850 .781
Leacock and Chodorow .816 .838
Lin .829 .819
Resnik .774 .779


2.11.9 Extended Gloss Overlaps Similarity Measure

Banerjee and Pedersen [9, 72] provided a measure by adopting the Lesk's measure

to WordNet. Their measure is called 'the extended gloss overlaps measure' and takes not

only the two concepts that are being measured into account but also the concepts related

with the two concepts through WordNet relations. An extended gloss of a concept cl is

prepared by adding the glosses of concepts that is related with c1 through a WordNet

relation r. The calculation of measurement of two concepts c1 and c2 is based on the

overlaps of extended glosses of two concepts.

2.12 Evaluation of WordNet-Based Similarity Measures

Budanitsky and Hirst [16] evaluated six different nietrics using WordNet and listed

the coefficients of correlation between the nietrics and human ratings according to the

experiments conducted by 1\iller & ChI .I l. [65] and Rubenstein & Goodenough [82]. We

present the results of Budanitsky & Hirst's experiments in Table 2-2. According to this

evaluation, the Jiang and Conrath nietric [52] as well as the Lin nietric [60] are listed as

one of the best measures. As a result, we use the Jiangf and Conrath as well as the Lin

semantic similarity measure to assign similarity scores between text strings.

2.13 Similarity Measures for Text Data

Several approaches have been used to assess a similarity score between texts. One

of the simplest methods is to assess a similarity score based on the number of lexical

units that occur in both text segments. Several processes such as stenining, stop-word

removal, longest subsequence ]?r I,0 1,11. weighting factors can he applied to this method









for intprovenient. However, these lexical matching methods are not enough to identify the

semantic similarity of texts. One of the attempts to identify semantic similarity between

texts is latent semantic analysis method (LSA)5 [55] which aints to measure similarity

between texts by including additional related words. LSA is successful at some extend but

has not been used on a large scale, due to the complexity and computational cost of its

algorithm.

Corley and 1\ihalcea [21] introduced a metric for text-to-text semantic similarity by

combining word-to-word similarity nietrics. To assess a similarity score for a text pair,

they first create separate sets for nouns, verbs, adjectives, adverbs, and cardinals for each

text. Then they determine pairs of similar words across the sets in the two text segments.

For nouns and verbs, they use semantic similarity metric hased on WordNet, and for other

word classes they use lexical matching techniques. Finally, they sunt up the similarity

scores of similar word pairs. This bag-of-words approach improves significantly over the

traditional lexical matching nietrics. However, as they acknowledge, a metric of text

semantic similarity should take into account the relations between words in a text.

In another approach to measure semantic similarity between documents, Aslant

and Frost [6] assumes that a text is composed of a set of independent term features and

employ the Lin's [60] metric for measuring similarity of objects that can he described by a

set of independent features. The similarity of two documents in a pile of documents can he

calculated by the following formula:

2 + C nmin(Pa : t, Pb : t) log P(t)
SimlT (a, b)= =
C(Pa : t) log P(t) + (Pb : t) log P(t)
t I

where probability P(t) is the fraction of corpus documents containing term t, Pb : t is

the fractional occurrence of term t in document b (C(Pb : t) = 1) and two documents a




5 UR L of LSA: http://1sa.colorado.edu/










and b share min(Pa : t, Pb : t) amount of term t in common, while they contain Pa : t and

Pb : t amount of term t individually.

Another approach by Oleshchuk and Pedersen [70] uses ontologies as a filter before

assessing similarity scores to texts. They interpret a text based on an ontology and find

out how much of the terms (concepts) of an ontology exists in a text. They assign a

similarity score for text t1 and text t2 after comparing the ontology 01 extracted from

t1 based on the ontology O and the ontology 02 extracted from t2 based on the same

ontology O. The base ontology acts as a context filter to texts and depending on the base

ontology used, texts may or may not be similar.

2.14 Similarity Measures for Ontologies

Rodriguez and Egenhofer [81] -11_t-r-- -1.. assessing semantic similarity among entity

classes from different ontologies based on a matching process that uses information about

common and different characteristic features of ontologies based on their specifications.

The similarity score of two entities from different ontologies is the weighted sum of

similarity scores of components of compared entities. Similarity scores are independently

measured for three components of an entity. These components are 'set of synonyms',

'set of semantic relations', and 'set of distinguishing features' of the entity. They further

-II__- -r to classify the distinguishing features into 'functions', 'parts', and 'attributes'

where 'functions' represents what is done to or with an instance of that entity, 'parts' are

structural elements of an entity such as leg or head of a human body, and 'attributes' are

additional characteristics of an entity such as age or hair color of a person.

Rodrigfuez and Egenhofer point out that if compared entities are related to the same

entities, they may be semantically similar. Thus, they interpret comparing semantic










relations as comparing semantic neighborhoods of entities.6 The formula of overall

similarity between entity a of ontology q and entity b of ontology q is as follows:



S(al', bV) = w,, S,,(al', bV) + I, Sz,(al', bV) + w,,, S, (al', bV)

where S,,. St,, and S,z are the similarity between synonym sets, features, and semantic

neighborhood and w,, I, e and w,,, are the respective weights which adds up to 1.0.

While calculating a similarity score for each components of an entity, they also take

non common characteristics into account. The similarity of a component is measured by

the following formula:


|An B|
S(a, b)=
|A n B| + co(a, b) |,4/B| + (1 co(a, b)) |B/,4|
where a~ is a function that defines the relative importance of the non-common

characteristics. They calculate a~ in terms of the depth of the entities in their ontologies.

1\aedche and Staab [63] -11_ _t--- -is to measure similarity of ontologies in two levels:

lexical and conceptual. In the lexical level, they use edit-distance measure to find

similarity between two sets of terms (concepts or relations) that forms the ontologies.

While measuring similarity in the conceptual level, they take all its super- and sub-concepts

of two concepts from two different ontologies into account.

According to Ehrig et al. [31] comparing ontologies should go far beyond comparing

the representation of the entities of the ontologies and should take their relation to the

real world entities into account. For this, Ehrigf et al. -II_t-r-- -1.. a general framework

for measuring similarities of ontologies which consists of four 1... ris: data-, ontology-,

context-, and domain knowledge 1... vr. In the data 1.w-;r, they compare data values by




6 The semantic neighborhood of an entity class is the set of entity classes whose
distance to the entity class is less than or equal to an non negative integer










using generic similarity functions such as edit distance for strings. In the ontology 1... r,

they consider semantic relations by using the graph structure of the ontology. In the

context 1... -r, they compare the usage patterns of entities in ontology-based applications.

According to Ehrig et al. if two entities are used in the same (related) context then they

are similar. They also propose to integrate domain knowledge 111-;- r into any three 1... rs

as needed. Finally, they reach to a overall similarity function which incorporates all 111-- rs

of similarity.

Euzenat and Valtchev [34, 35] proposed a similarity measure for OWL-Lite ontologies.

Before measuring similarity, they first transform OWL-Lite ontology to a OL-graph

structure. Then, they define similarity between nodes of the OL-graphs depending on the

category and the features (e.g relations) of the nodes. They combine the similarities of

features by a weighted sum approach.

A similar work by Bach and Dieng-K~untz [8] proposes a measure for comparing

OWL-DL ontologies. Different from Euzenat and Valtchev's work, Bach and Dieng-K~untz

adjusts the manually assigned feature weights of an OWL-DL entity dynamically in case

they do not exist in the definition of the entity.

2.15 Evaluation Methods for Similarity Measures

There are three kinds of approaches for evaluating similarity measures [15]. These

are evaluation by theoretical examination (e.g., Lin [60]), evaluation by comparing human

judgments, and evaluation hv calculating the performance within a particular application.

Evaluation by comparing human judgments technique has been used hv nar Ilry

researchers such as Resnik [79], and Jiang and Conrath [52]. Most of the researchers refer

to the same experiment on the human judgment to evaluate their performance due to the

expense and difficulty of arranging such an experiment. This experiment was conducted

by Rubenstein and Goodenough [82] and a later replication of it was done by Miller

and ChI .I l. [65]. Rubenstein and Goodenough had human subjects assign degrees of

synonymy, on a scale from 0 to 4, to 65 pairs of carefully chosen words. Miller and ChI .I l. -










repeated the experiment on a subset of 30 word pairs of the 65 pairs used by Rubenstein

and Goodenough. Rubenstein and Goodenough used 15 subjects for scoring the word pairs

and the average of these scores was reported. Miller and ChI Ia l. used 38 subjects in their

experiments.

Rodriguez and Egenhofer also used human judgments to evaluate the quality of

their similarity measure for comparing different ontologies [81]. They used Spatial Data

Transfer Standard (SDTS) ontology, WordNet ontology, WS ontology (created front the

combination of WordNet and SDTS) and subsets of these ontologies. They conducted two

experiments. In the first experiment, they compare different combinations of ontologies to

have a diverse grade of similarity between ontologies. These combinations include identical

ontologies (WordNet to WordNet), ontology and sub-ontology (WordNet to WordNet's

subset), overlapping ontologies (WordNet to WS), and different ontologies (WordNet

to SDTS). In the second experiment, they asked human subjects to rank similarity of

an entity to other selected entities based on the definitions in WS ontology. Then, they

compared average of human rankings with the rankings based on their similarity measure

using different combinations of ontologies.

Evaluation by calculating the performance within a particular application is another

approach for the evaluation of similarity measurement nietrics. Budanitsky and Hirst [15]

used this approach to evaluate the performance of their metric within an NLP application,

nmalapropisms." Patwardhan [72] also used this approach to evaluate his metric within the

word sense disambigfuation8 application.




SMalapropisms: The unintentional misuse of a word by confusion with one that sounds
similar.

s Word Sense Disambiguation: It is the problem of selecting the most appropriate
meaning or sense of a word, based on the context in which it occurs.










2.16 Schema Matching

Schema matching is producing a mapping between elements of two schemas that

correspond to each other [78]. When we match two schemas S and T, we decide if any

element or elements of S refer to the same real-world concept of any element or elements

of T [28]. The match operation over two schemas produces a mapping. A mapping is a

set of mapping elements. Each mapping element indicates certain elements) in S are

mapped to certain elements) in T. A mapping element can have a mapping expression

which specifies how schema elements are related. A mapping element can be defined as

a 5-tuple: (id, e, e', n, R), where id is the unique identifier, e and e' are schema elements

of matching schemas, n is the confidence measure (usually in the [0,1] range) between the

schema elements e and e', R is a relation (e.g., equivalence, mismatch, overlapping) [88].

Schema matching has many application areas, such as data integration, data

warehousing, semantic query processing, agent communication, web services integration,

catalog nr I, 1.11, and P2P databases [78, 88]. The match operation is mostly done

manually. Manually generating the mapping is a tedious, time-consuming, error-prone,

and expensive process. There is a need to automate the match operation. This would be

possible if we can discover the semantics of schemas, make the implicit semantics explicit

and represent them in a machine processable way.

2.16.1 Schema Matching Surveys

Schema matching is a very well-researched topic in the database community. Erhard

Rahm and Philip Bernstein provides an excellent survey on schema matching approaches

by reviewing previous works in the context of schema translation and integration,

knowledge representation, machine learning and information retrieval [78]. In their survey,

they clarify the terms such as match operation, ]rn Ipllfir mapping element, and mapping

expression in the context of schema matching. They also introduce application areas of

schema matching such as schema integration, data warehouses, message translation, and

query processing.










The most significant contribution of their survey is the classification of schema

matching approaches which helps understanding of schema matching problem. They

consider a wide range of classification criteria such as instance-level vs schema-level,

element vs structure, linguistic-based vs constraint-based, matching cardinality, using

auxiliary data (e.g., dictionaries, previous mappings, etc.), and combining different

matchers (e.g., hybrid, composite). However, it is very rare that one approach falls under

only one leaf of the classification tree presented in that survey. A schema matching

approach needs to exploit all the possible inputs to achieve the best possible result, and

needs to combine matchers either in a hybrid way or in a composite way. For this reason,

most of the approaches uses more than one technique and falls under more than one leaf

of the classification tree. For example, our approach uses auxiliary data (i.e., application

source code) and uses linguistic similarity techniques (e.g., name and description),

constraint based techniques (e.g., type of the related schema element) on the data as well.

A recent survey by Anhai Doan and Alon Halevy [28] classifies matching techniques

under two main group: rule-based and learning-based solutions. Our approach falls under

the rule-based group which is relatively inexpensive and does not require training. Anhai

Doan and Alon Halevy also describe challenges of schema matching. They point out that

since data sources become legacy (poorly documented) schema elements are typically

matched based on schema and data. However, the clues gathered by processing the schema

and data are often unreliable, incomplete and not sufficient to determine the relationships

among schema elements. Our approach aims to overcome this fundamental challenge by

analyzing reports for more reliable, complete and sufficient clues.

Anhai Doan and Alon Halevy also state that schema matching becomes more

challenging because matching approaches must consider all the possible matching

combinations between schemas to make sure there is no better mapping. Considering

all the possible combinations increases the cost of the matching process. Our approach










helps us overcoming this challenge by focusing on a subset of schema elements that are

used on a report pair.

Another challenge that Anhai Doan and Alon Halevy state is the subjectivity of

the matching. This means the mapping depends on the application and may change in

different applications even though the underlying schemas are the same. By analyzing

report generating application source code, we believe we produce more objective results.

Anhai Doan and Alon Halevy's survey also adds two more application areas of schema

matching on the application areas mentioned in Erhard and Rahm's survey. These

application areas are peer data nianagenient and model nianagenient.

A more recent survey by Pavel Shvaiko and Jihrome Euzenat [88] points out new

application areas of schema matching such as agent coninunication, web service

integration and catalog matching. In their survey, Pavel Shvaiko and Jihrome Euzenat

consider only schenla-hased approaches not the instance-based approaches and provide

a new classification tree by building on the previous work of Erhard Rahni and Philip

Bernstein. They interpret the classification of Erhard Rahni and Philip Bernstein and

provide two new classification trees based on granularity and kinds of input with added

nodes to the original classification tree of Erhard Rahni and Philip Bernstein. Finally,

Hong-Hai Do suninarizes recent advances in the field in his dissertation [25].

2.16.2 Evaluations of Schema Matching Approaches

The approaches to solve the problem of schema matching evaluate their systems by

using a variety of methodology, nietrics and data which are not usually publicly available.

This makes it hard to compare these approaches. However, there have been works to

benchmark the effectiveness of a set of schema matching approaches [26, 99].

Hong Hai Do et al. [26] specifies four comparison criteria. These criteria are kind

of input (e.g., schema information, data instances, dictionaries, and mapping rules),

match results (e.g., matching between schema elements, nodes or paths), quality measures

(nletrics such as recall, precision and f-measure) and effort (e.g., pre- and post-nlatch










efforts for training of learners, dictionary preparation and correction). Mikalai Yatskevich

in his work [99] compares the approaches based on the criteria stated in [26] and adds time

measures as the fifth criteria.

Hong Hai Do et al. only use the information available in the publications describing

the approaches and their evaluation. In contrast, Mikalai Yatskevich provides real-time

evaluations of matching prototypes, rather than reviewing the results presented in the

papers. Mikalai Yatskevich compares only three approaches (COMA [24], Cupid [62] and

Similarity Flooding (SF) [86]) and concludes that COMA performs the best on the large

schemas and Cupid is the best for small schemas. Hongf Hai Do et al. provides a broader

comparison by reviewing six approaches (Automatch [10], COMA [24], Cupid [62], LSD

[27], Similarity Flooding (SF) [86], Semlnt).

2.16.3 Examples of Schema Matching Approaches

In the rest of this section, we review some of the significant approaches for schema

matching and describe their similarities and difference from our approach. We review LSD,

Corpus-based, COMA and Cupid approaches below.

The LSD (Learning Source Descriptions) approach [27] uses machine-learning

techniques to match data sources to a global schema. The idea of LSD is that after

a training period of determining mappings between data sources and global schema

manually, the system should learn from previous mappings and successfully propose

mappings for new data sources. The LSD system is a composite matcher. It means it

combines the results of several independently executed matchers. The LSD consist of

several learners (matchers). Each learner can exploit from different types of characteristics

of the input data such as name similarities, format, and frequencies. Then the predictions

of different learners are combined. The LSD system is extensible since it has independently

working learners (matchers). When new learners are developed they can he added to the

system to enhance the accuracy. The extensibility of the LSD system is similar to the

extensibility of our system because we can also add new visitor patterns to our system to










extract more information to enhance the accuracy. The LSD approach is similar to our

approach in the way that they also come to a final decision by combining several results

coming from different learners. We also combine several results that come from matching

of ontologies of report pairs, to give a final decision. LSD approach is a learner based

solution and requires training which makes it relatively expensive because of the initial

manual effort. However our approach needs no initial effort other than collecting relevant

report generating source code.

One of the distinguished approaches that uses external evidence is the Corpus-based

Schema Matching approach [43, 61]. Our approach is similar to Corpus-based Schema

Matching in the sense that we also utilize external data rather than solely depending

on matching schemas and their data. The Corpus-based schema matching approach

constructs a knowledge base by gathering relevant knowledge from a large corpus of

database schemas and previous validated mappings. This approach identifies interesting

concepts and patterns in a corpus of schemas and uses this information to match

two unseen schemas. However, learning from the corpus and extracting patterns is a

challenging task. This approach also requires initial effort to create a corpus of interest

and then requires tuning effort to eliminate useless schemas and to add useful schemas.

The COMA (COmbination of MAtching algorithms) approach [24] is a composite

schema matching approach. It develops a platform to combine multiple matchers in a

flexible way. It provides an extensible library of matching algorithms and a framework

to combine obtained results. The COMA approach have been superior to other systems

in the evaluations [26, 99]. The COMA++ [7] approach improves the COMA approach

by supporting schemas and ontologies written in different languages (i.e., SQL, W3C

XSD and OWL) and by bringing new match strategies such as fragment-hased matching

and reuse-oriented matching. Fragment-haased approach follows the divide-and-conquer

idea and decomposes a large schema into smaller subsets aiming to achieve better match

quality and execution time with the reduced problem size and then merges the results of










matching fragments into a global match result. Our approach also considers matching

small subsets of a schema that are covered by reports and then merging these match

results into a global match result as described in ChI Ilpter 3.

The Cupid approach [62] combines linguistic and structural matchers in a hybrid way.

It is both element and structural based. It also uses dictionaries as auxiliary data. It aims

to provide a generic solution across data models and uses X1\L and relational examples.

The structural matcher of Cupid transforms the input into a tree structure and assesses

a similarity value for a node based on the node's linguistic similarity value and its leaves

similarity values.

2.17 Ontology Mapping

Ontology mapping is determining which concepts and properties of two ontologies

represent similar notions [68]. There are several other terms relevant to ontology mapping

and are sometimes used interchangeably with the term mapping. These are alignment,

merging, articulation, fusion, and integration [54]. The result of ontology mapping is used

in similar application domains as schema nr ,bllah.r such as data transformation, query

answering, and web services integration [68].

2.18 Schema Matching vs. Ontology Mapping

Schema matching and ontology mapping are similar problems [29]. However, ontology

mapping generally aims to match richer structures. Generally, ontologies have more

constraints on their concepts and have more relations among these concepts. Another

difference is that a schema often does not provide explicit semantics for their data while

an ontology is a system that itself contains semantics either intuitively or formally [88].

Database community deals with the schema matching problem and the AI community

deals with the ontology mapping problem. We can perhaps fill the gap between these

similar but yet distinctly studied subject.










CHAPTER :3
APPROACH

In C'!s Ilter 1, we stated the need for rapid, flexible, limited time collaborations among

organizations. We also underlined that organizations need to integrate their information

sources to exchange data in order to collaborate effectively. However, integrating

information sources is currently a labor-intensive activity because of non-existing or

out-dated machine processable documentation of the data source. We defined legacy

systems as information systems with poor or nonexistent documentation in Section

2.1. Integrating legacy systems is tedious, tinte-consunting and expensive because the

process is mostly manual. To automate the process we need to develop methodologies to

automatically discover seniantics front electronically available information sources of the

underlying legacy systems.

In this chapter, we state our approach for extracting seniantics front legacy systems

and for using these seniantics for the schema matching process of information source

integration. We developed our approach in the context of SEEK( (Scalable Extraction

of Enterprise K~nowledge) project. As we show in Figure :3-1, the Semantic Analyzer

(SA) takes the output of Schema Extractor (SE), schema of the data source, and the

application source code or report templates as input. After the semantic analysis process,

SA stores its output, extracted semantic information, in a repository which we call the

knowledgehase of the organization. Then, Schema Alatcher (SM) uses this knowledgehase

as an input and produces mapping rules as an output. Finally, these mapping rules will be

an input to Wrapper Generator (WG) which produces source wrappers. In Section :3.1, we

first state our approach for semantic extraction using SA. Then, in Section :3.2, we show

how we utilize the seniantics discovered by SA in the subsequent schema matching phase.

The schema matching phase is followed by the wrapper generation phase which is not

described in this dissertation.









Data Source of A Source Code of A


ISchema Semantic
Extraction I IAnalysis
I(SE) (SA)
I ~Schemas

Data Reverse Engineering (DRE)


Knowledgebase of
Organization A




Schema Wrapper
Matching Generator
-0(SM) (WG)
Source
MappingsWrappers


* Knowledgebase of
Organization B

Knowledgebase of
* Organization C


Knowledgebase of
a Organization D


Figure 3-1. Scalable Extraction of Enterprise K~nowledge (SEEK() Architecture.

3.1 Semantic Analysis

Our approach to semantic analysis is based on the observation that application source

code can he a rich source for semantic information about the data source it is accessing.

Specifically, semantic knowledge extracted from application source code frequently
contains information about the domain-specific meanings of the data or the underlying

schema elements. According to these observations, for example, application code usually

has embedded queries, and the data retrieved or manipulated by queries is stored in
variables and dipt1v. liAI to the end user in output statements. M1 I.ny of these output


4


I










statements contain additional semantic information usually in the form of descriptive

text or markup [36, 84, 87]. These output statements become semantically valuable

when they are used to communicate with the end-user in a formatted way. One way of

communicating with the end-user is producing reports. Reports and other user-oriented

output, which are typically generated by report generators or application source code,

do not use the names of schema elements directly but rather provide more descriptive

names for the data to make the output more comprehensible to the users. We claim that

these descriptive names together with their formatting instructions can he extracted

from the application code generating the report and can he related to the underlying

schema elements in the data source. We can trace the variables used in output statements

throughout the application code and relate the output with the query that retrieves data

from the data source and indirectly with the schema elements. These descriptive text

and formatting instructions are valuable information that help discover the semantics of

the schema elements. In the next subsection, we explain this idea using an illustrative

example.

3.1.1 Illustrative Examples

In this section, we illustrate our idea of semantic extraction on two simple example.

On the left hand side of Figure 3-2, we see a relation and its attributes from a relational

database schema. By looking at the names of the relation and its attributes, it is hard to

understand what kind of information this relation and its attributes store. For example,
this relation can he used for storing information about 'courses'" or "-'insrucor'. Th

attribute Name can hold information about coursee names' or instructorsr names Without

any further knowledge of the schema, we would probably not he able to understand the

full meaning of these schema items in the relation 'Courselnst'. However, we can gather

information about the semantics of these schema items by analyzing the application source

code that use these schema items.












Cours~nst Instructor name:
Numl II
Name II Cus
Num2 -

Loc IIISearch |
III I II II II

Figure :3-2. Schema used by an application.


Let us assume we have access to the application source code that outputs the search

screen shown on the right hand side of Figure :3-2. Upon investigation of the code,

semantic analyzer (SA) encounters output statements of the form 'Instructor Name'

and 'Course Code'. SA also encounters input statements that expect input from the

user next to the output texts. Using program understanding techniques, SA finds out

that inputs are used with certain schema elements in a 'where clause' to form a query

to return the desired tuples from the database. SA first relates the output statements

containing descriptive text (e.g., 'Instructor Name') with the input statements located

next to the output statements on the search screen shown in Figure :3-2. SA then traces

input statements back to the 'where clause' and find their corresponding schema elements

in the database. Hence, SA relates the descriptive text with the schema elements. For

example, if SA relates the output statement 'Instructor Name' to 1 I.!!!.-' schema element

of relation 'Courselnst', then we can conclude that 1 .!!!.-' schema element of the relation

'Courselnst' stores information about the 'Instructor ?- Ion. .;

Let us look at another example. Figure :3-3 shows a report R 1 using the schema

elements from the schema S1. Let us assume that we have access to the application source

code that generates the report shown in Figure :3-:3. The schema element names in S1 are

non-descriptive. However, our semantic analyzer can gather valuable semantic information

by analyzing the source code. SA first traces the descriptive column header texts back

to the schema elements that fill in the data of that column. Then, SA relates descriptive










Schedule I Courselnst
I ode NumlI


I~I Til Nu2
I / Pr eq \Loc


r----- -

Co rse Cistings

Corse Title Ins ructor I`Time Prerequisite

CIS 105 Introduction Berger Mw CIS 201
to Comp. Sci. 2pm-3pm

ICIS 201 Discrete Taylor "" I
I Math. 3pm-4pmI



Figure :3-:3. Schema used by a report.

column header texts with the schema elements (red arrows). After that, we can conclude

about the semantics of the schema element. For example, we can conclude that the Name

schema element of the relation Courselnst stores information about 'Instructors'.

3.1.2 Conceptual Architecture of Semantic Analyzer

SA is embedded in the Data Reverse Engineering (DRE) module of the SEEK(

prototype together with the Schema Extractor (SE) component. As Figure :3-4 illustrates,

the SE component in the DR E connects to the data source with a call-level interface (e.g.,

JDBC) and extracts the schema of the data source. The SA component enhances this

schema with the pieces of evidence found about the semantics of the schema elements from

the application source code or from the report design templates.

3.1.2.1 Abstract syntax tree generator (ASTG)

We show the components of Semantic Analyzer (SA) in Figure :3-5. The Abstract

Syntax Tree Generator (ASTG) accepts application source code to be analyzed, parses











Report Design
Templates and
1__51Source Code

;1 Data Source of A of A


ISchema Extraction I a Semantic Analysis r_
(SE) Schemas (SA)

L ----------------------------- Knowledgebase of
Data Reverse Engineering (DRE) Organization A


Figure 3-4. Conceptual view of the Data Reverse Engineering (DRE) module of the
Scalable Extraction of Enterprise K~nowledge (SEEK() prototype.


it and produces the abstract syntax tree of the source code. An Abstract Syntax Tree

(AST) is an alternative representation of the source code for more efficient processing.

Currently, the ASTG is configured to parse application source code written in Java. The

ASTG can also parse SQL statements embedded in the Java source code and HTML

code extracted from the Java Serylet source code. However, we aim to parse and extract

semantic information from source code written in any programming language. To reach

this aim, we use state-of-the-art parser generation tools, JavaCC, to build the ASTG.

We explain how we build the ASTG so that it becomes extensible to other programming

languages in Section 3.1.3.


(XML) I I

Figure 3-5. Conceptual view of Semantic Analyzer (SA) component.










3.1.2.2 Report template parser (RTP)

We also extract semantic information front another electronically available information

source, namely front report design templates. A report design template includes

information about the design of a report and is typically represented in X1\L. When

a report generation tool, such as Eclipse BIRT or JasperReport, runs a report design

template, it retrieves data front the data source and presents it to the end user according

to the specification in the report design template. When parsed, valuable semantic

information about the schema elements can he gathered front report design templates.

The Report Template Parser (RTP) component of SA is used to parse report design

templates. Our current semantic analyzer is configured to parse report templates designed

with Eclipse BIRT.1 We show an example of a report design template in Figure :3-6 and a

resulting report when this template was run in Figure :3-7.



Computer Scienlce
Department

Spring 20:0-1 Schedule

.OIu+e Time Day Place Insnveralr
Ililll,.l] [Hour] [Time] [Loc] 1Il..ll

STable


Figure :3-6. Report design template example.


3.1.2.3 Information extractor (IEx)

The outputs of ASTG and RTP are the inputs for the Information Extractor (IEx)

component of SA. The IEx, shown in Figure :3-5, is the component where we apply several

heuristics to relate descriptive text in application source code with the schema elements in




Shttp: //www. eclipse, .Org/birt/





Computer Science

Department

Spring 2004 Schedule

C ous lUg imeL Day Plarc e Ilnstnutor







Figure :3-7. Report generated when the above template was run.

database by using program understanding techniques. Specifically, The IEx first identifies

the output statements. Then, it identifies texts in the output statements and variables

related with these output texts. The IEx relates the output text with the variables by the

help of several heuristics described in Section :3.1.5. The IEx traces the variables related

with the output text to the schema elements from which it retrieves data.



CS ovtRVIL PEOPLE IRCSEA~Re (d CJDEMIC 9 soaNAFS~ AD~(lssioNS CON' act 4 ComIputrS eianceE ~





Introduc;-iin to MW 2-00 22 Oates
CS -fCo~rnputationr EGrantley Vasrve 2 55

Introduction to Bercrer IW4- 1 00 1 59Sican
Ma CS B1a Discrete -15


Figure :3-8. .Java Serylet generated HT1\L report showing course listings of CALTECH.


The IEx can extract information front Java application source code that coninunicates

with user through console. The IEx can also extract information front Java Serylet


















































Schedulle. Code Schedule.Natue C~o~useIns~t.Namle ISchedulde.Thne ISchedule.Loc



Figure 3-9. Annotated HTML page generated by analyzing a Java Serylet.


The IEx has been implemented using visitor design pattern classes. We explain the

benefits of using visitor design patterns in Section 3.1.3. The IEx applies several program

understanding techniques such as program slicing, data flow analysis and call graph


application source code. A Serylet is a Java application that runs on the Web Server and

responds to client requests by generating HTML pages dynamically. A Serylet generates

an HTML page by the output statements embedded inside the Java code. After IEx

analyzes the Java Serylet, it identifies the output statements that output HTML code. It

also identifies the schema elements from which the data on the HTML page is retrieved.

As an intermediate step, the IEx produces the HTML page that the Serylet would produce

with the schema element names instead of the data. An example of the output HTML

page generated by the IEx after analyzing a Java Serylet is shown in Figure 3-9. The Java

Serylet output that was analyzed by the IEx is shown in Figure 3-8. This example is taken

from THALIA integration benchmark and shows course offerings in Computer Science

department of California Institute of Technology (CALTECH). The reader can notice

that the data on the report in Figure 3-8 is replaced with the schema element names from

which the data is retrieved in Figure 3-9. Next, the IEx analyzes this annotated HTML

page show in Figure 3-9 and extracts semantic information from this page.










analysis [49] in visitor design pattern classes. We describe these techniques in Section

:3.1.4.

The IEx also extracts semantic information from report design templates. The IEx

uses the heuristic numbers seven to eleven described in Section :3.1.5 while analyzing the

report design templates. Extracting information from report design templates is relatively

easier than extracting information from application source code because The report design

templates are represented in X1\L and are more structured.

3.1.2.4 Report ontology writer (ROW)

Report Ontology Writer (ROW) component of SA writes the semantic information

gathered in report ontology instances represented in OWL language. We explain the

design details of the report ontology in Section :3.2.3. These report ontology instances

forms the knowledgehase of the data source being analyzed.

3.1.3 Extensibility and Flexibility of Semantic Analyzer

Our current semantic analyzer is configured to extract information from application

source code written in Java. We choose the Java programming language because it is

one of the dominating programming languages in the enterprise information systems.

However, we aim our semantic analyzer to be able to process source code written

in any programming language to extract semantic information about the data of the

legacy system. For this reason, we need to develop our semantic analyzer in a way that

is extensible to other programming languages easily. To reach this aim, we leverage

state-of-the-art techniques and recent research on code reverse engineering, abstract syntax

tree generation and object oriented programming to develop a novel approach for semantic

extraction from source code. We describe our extensible semantic analysis approach in

details in this section.

To analyze application source code, we need a parser for the grammar of the

programming language of the source code. This parser is used to generate Abstract Syntax

Tree (AST) of the source code. An AST is a type of representation of source code that










facilitates the usage of tree traversal algorithms. For programmerrs, writing a parser for

the grammar of a programming language has ah-li-w been a complex, time-e mmode~fir and

error-prone task. Writing a parser becomes more complex when the number of production

rules of the grammar increases. It is not easy to write a robust parser for Java which has

many production rules [91].2 We focus on extracting semantic information from legacy

system's source code not writing a parser. For this reason, we choose a state-of-the-art

parser generation tool to produce our Java parser. We use JavaCC3 tO autOmatically

generate a parser by using the specification files from the JavaCC repository.4 JaVRCC

can be used to generate parsers for any grammar. We also utilize JavaCC to generate a

parser for SQL statements that are embedded inside the Java source code and for HTML

code that are embedded inside the Java Serylet code. By using JavaCC, we can extend SA

to make it capable of parsing other programming languages with little effort.

The Information Extractor (IEx) component of SA is composed of several visitor

design patterns. Visitor Design Patterns give the flexibility to change the operation

being performed on a structure without the need to change the classes of the elements

on which the operation is performed [38]. Our goal is to build semantic information

extraction techniques that can be applied to any source code and can be extended with

new algorithms. By using visitor design patterns [71], we do not embed the functionality

of the information extraction inside the classes of Abstract Syntax Generator (ASTG).

This separation lets us focus on the information extraction algorithms. We can maintain

the operations being performed whenever necessary. Moreover, new operations over the

data structure can be defined simply by adding a new visitor [13].




2 There are over 80 production rules in the Java language according to the Java
Grammar that we obtained from the JavaCC Repository

3 JaVaCC: https ://j avac dev java. net/

4 JavaCC repository: http://www.cobase. cs .ucla.edu/pub/j avac c/










3.1.4 Application of Program Understanding Techniques in SA

We have introduced program understanding techniques in Section 2.5. In this section,

we present how we apply these techniques in SA. SA has two components as shown in

Figure :3-5. The input of Information Extractor (IEx) component is an abstract syntax

tree (AST). The AST is the output of our Abstract Syntax Tree Generator (ASTG) which

is actually a parser. As mentioned in Section 2.5, processing the source code by a parser

to produce an AST is one of the program understanding techniques known as Syntactic

Analysis [49]. We perform the rest of the program understanding techniques on the AST

by using the visitor design pattern classes of the IEx.

One of the program understanding techniques we apply is Pattern 1\atching [49]. We

wrote a visitor class that looks for certain patterns inside the source code. These patterns

such as input/output statements are stored in a class structure and new patterns can he

simply added into this class structure as needed. The visitor class that searches these

patterns identifies the variables in the input/output statements as slicing variables. For

instance, the variable V in Table :3-5 is identified as a slicing variable since it is used in

an output statement. Program Slicing [75] is another program understanding technique

mentioned in Section 2.5. We analyze all the statements affecting a variable that is used in

an output statement. This technique is also known as backward slicing.

SA also applies the Call Graph Analysis technique [8:3]. SA produces inter-procedural

call graph of the source code and analyzes only methods that exist in this graph. SA

starting from a specific method (e.g., main method of a Java stand-alone class or

doGet method of a Java Serylet) traverses all possible methods that can he executed

in run-time. By this, SA eliminates analyzing unused methods. These methods can reflect

old functionality of the system and analyzing them can lead to incorrect, misleading

information. An example for an inter-procedural call graph of a program source code is

shown in Figure :3-10. SA does not analyze method of Class1, method of Class2, and

method:$ of Class:$ since they are never called from inside other methods.





Class

method


C lass2

method





rethod2




ssmethod2









method


C lass3

method


SII


1 I
II
I I
I I

SII

I I




ntiethod2




method 1



!


Figure 3-10. Inter-procedural call graph of a program source code.


The Data Flow Analysis technique [83] is another program understanding technique

that we implemented in the IEx by visitor design patterns. As mentioned in Section

2.5, Data Flow Analysis is the analysis of the flow of the values of variables to variables.

SA analyzes the data flow in the variable dependency graphs (i.e., flow of data between

variables). SA analyzes assignment statements and makes necessary changes in the values

stored in the symbol table of the class being analyzed.

SA also analyzes the data flow in the system dependency graphs (i.e., flow of data

between methods). SA analyzes method calls and initializes the values of method variables

by actual parameters in the method call and transfers back the value of return variable at









Table 3-1. Semantic Analyzer can transfer information from one method to another
through variables and can use this information to discover semantics of a
schema element.


public ResultSet returnList() {
ResultSet rs = null;
try { String query = "SELECT Code, Time, Day, PI, Inst FROM Course";
rs = sqlStatement. execute~uery(query) ;
}eatch(Exception ex) { researchErr = ex.gethlessatge(); }
return rs; }


ResultSet rsList = returnList();
String dataOut = "
while (rsList.next()) {
dataOut = rsList.getString(4);

System.out .println(" Class is held in room number:" + data Out);



the end of the method. SA can transfer information from one method to another through

variables and can use this information to discover semantics of a schema element. The

code fragment in Table 3-1 is given as an example for this capability of SA. Inside the

method, the value of variable query is transferred to variable rs. At the end of the method,

value of variable rs is transferred to variable rsList. The value of the fourth field of the

query from the resultset is then stored into a variable and then printed out. When we

relate the text in the output statement with the fourth field of the query, we can conclude

that Pl field of table Course corresponds to 'Class is held in room number'.

3.1.5 Heuristics Used for Information Extraction

A heuristic is any method found through observation which produces correct or

sufficiently exact results when applied in commonly occurring conditions. We have

developed several guidelines (heuristics) through observations to extract semantics

from the application source code and report design templates. These heuristics relate

semantically rich descriptive texts to schema elements. They are based on mainly layout

and format (e.g., femt size, face, color, and type) of data and description texts that are










used to communicate with users through console with input/output statements or through

a report.

We introduce these heuristics below. The first six heuristics shown in this section are

developed to extract information from source code of applications that communicate with

users through console with input/output statements. Please note that the code fragments

in the first six heuristics contain Java-specific input, output, and database-related

statements that use syntax based on the Java API. We parameterized these statements in

our SA prototype. Therefore it is theoretically straightforward to add new input, output,

and database-related statement names or to switch to another language if necessary.

We developed the rest of the heuristics to extract semantics from reports. We use

these heuristics to extract semantic information either from reports generated by Java

Servlets or from report design templates.

Heuristic 1. Application code generally has input-output statements that display

the results of queries executed on the underlying database. Typically, output statements

display one or more variables and/or contain one or more format strings. Table 3-2

represents a format string '\n Course code:\t' followed by a variable V.

Table 3-2. Output string gives clues about the semantics of the variable following it.

System.out .println('\n Course code:\t' +V),



Heuristic 2. The format string in an input-output statement describes the di;1li- a 4

slicing variable that comes after this format string. The format string '\n Course code:\t'

describes the variable V in Table 3-2.

Heuristic 3. The format string that contains semantic information and the variable

may not be in the same statement and may be separated by an arbitrary number of

statements as shown in Table 3-3.

Heuristic 4. There may be an arbitrary number of format strings in different

statements that inherit semantics and they may be separated by an arbitrary number









Table :3-:3. Output string and the variable may not he in the same statement.

Systent.out.println('\n Course code: );


Systent.out .print(V);



of statements, before we encounter an output of slicing variable. Concatenation of the

format strings before the slicing variable gives more clues about the variable semantic. An

example is shown in Table :3-4.

Table :3-4. Output strings before the slicing variable should be concatenated.

Systent.out.print('\n Course );
Systent.out.println('\t code: );
Systent.out .print(V);



Heuristic 5. An output text in an output statement and a following variable in the

same or following output statements are seniantically related. The output text can he

considered as the variable's possible seniantics. We can trace back the variable through

backward slicing and identify the schema element in the data source that assigns a value

to it. We can conclude that this schema element and variable are related. We can then

relate the output text with the schema element. The Java code sample with an embedded

SQL query in Table :3-5 illustrates our point.

Table :3-5. Tracing back the output text and associating it with the corresponding column
of a table.


Q = 'SELECT C FROM T';
R = S.execute~uery(Q);
V = R.getString(1);
Systent.out.println('Course code: + V);










In Table :3-5, the variable V is associated with the text 'Course code'. It is also

associated with the first column of the query result in R, which is called C. Hence the

column C can he associated with the text 'Course code'.

Heuristic 6. If the variable V is used with column C of table T in a compare

statement in the where-clause of the query Q. and if one can associate a text string from

an input/output statement denoting the meaning of variable V, then we can associate this

meaning of V with column C of table T. The Java code sample with an embedded SQL

query in Table :3-6 illustrates our point.

Table :3-6. Associating the output text with the corresponding column in the where-clause.


Q 'SELECT FRO1\ T WHERE C ="+V+"';
R = S.execute~uery(Q);
System.out.println('Course code: +V);



In Table :3-6, the variable input is associated with the text 'Course code:'. It is also

associated with the column C of table T. Hence the schema element C can he associated

with the text 'Course code'.

Table :3-7. Column header describes the data in that column.
College Cours e Title Ins tructor
CAS CS101 Intro Comp. Dr. Long
GR S CS640 Artificial Int. Dr. Betke



Heuristic 7. A header of a column H (i.e., description text) on a table on a report

describes the value of a data D (i.e., data element) in that column. We can associate

the header H with the data D presented on the same column. For example, the header

lIs-t uctor" in the fourth column describes the value "Dr. L 1,,,_ in Table :$-7.

Table :3-8. Column on the left describes the data items listed to its immediate right.
Course CSE10:3 Introduction to Databases
Credits :3
Description Core concepts in databases










Heuristic 8. A descriptive text on a row of a table on a report T describes the value

of a data D on the right hand side on the same row of the table. We can associate the text

T with the data D presented on the same row. For example, the text "Description" on the

third row describes the value "Core concepts in dI II I1. I-- in Table 3-8.

Table 3-9. Column on the left and the header immediately above describe the same set of
data items.
Core Courses
Course CSE103 Introduction to Databases
Credits 3
Description Core concepts in databases
Elective Courses
Course CSE131 Problem Solving
Credits 3
Description U~se of Comp. for problem solving


Heuristic 9. Heuristic one and heuristic two can be combined. Both header of a

data on the same column and the text on the left hand side on the same row describe the

data. For example, both the text "Course" on the left hand side and the header "Elective

Courses" of data "CSE131 Problem Solvingt describe the data in Table 3-9.

Table 3-10. Set of data items can be described by two different headers.
Course Instructor
Code Room Name Room
CIS4301 E221 Dr. Hammer E452
COP6726 E112 Dr. Jermaine E456


Heuristic 10. If more than one header describe a data on a report, all the headers

corresponding to the data describe the data. For example, both the header In!-I ructor"

and the header "Room" describe the value "E452" in Table 3-10.

Table 3-11. Header can be processed before being associated with the data on a column.
Course Title (Credits) Instructor
CS105 Comp. Concepts ( 3.0 ) Dr. K~rentel
CS110 Java Intro Prog. ( 4.0 ) Dr. Bolker










Heuristic 11. The data value presented on a colunin can he retrieved front more

than one data itent in the schema. In that case, the format of the header of the column

gives clues about how we need to parse the header and associate it with the data items.

For example, the data of the second colunin in Table 3-11 is retrieved front two data items

in the data source. The format of the header "Title (Credits)" tells us that we need to

consider the parenthesis while parsing the header and associating the data items in the

colunin with the header.

In this section, we have introduced Semantic Analyzer (SA). SA extracts information

about the seniantics of schema elements front the application source code. This information

is an essential input for the Schema Matching (SM) component. In the following section,

we introduce our schema matching approach and how we use SA to discover seniantics for

SM.

3.2 Schema Matching

Schema matching aints at discovering semantic correspondences between schema

elements of disparate but related data sources. To match schemas, we need to identify the

seniantics of schema elements. When done manually, this is a tedious, tinte-consunting,

and error-prone task. Much research has been carried out to automate this task to aid

schema matching, see for example, [25, 28, 78]. However, despite the ongoing efforts,

current schema matching approaches, which use the schemas themselves as the main input

for their algorithms, still rely heavily on manual input [26]. This dependence on human

involvement is due to the well-known fact that schemas represent seniantics poorly. Hence,

we believe that improving current schema matching approaches requires improving the

way we discover seniantics.

Discovering seniantics means gathering information about the data, so that after

processing the data, a computer can decide on how to use the data in a way a person

would do. In the context of schema matching, we are interested in findings information

that leads us to find a path from schema elements in one data source to the corresponding










schlema celements in the others. 7i .'efore, wei define discoveringf semnantic~s fori schemna

mlatch-ing as discovering paths between : ::-li = schemna elemnents in (CIT .::i data

SOUTrCeS.

Wei~ redi-uce thec levell <.1 d-ifficult~y of thre schrem~a m~atch~ing prob-lemn "-- abstracting it

to mnatching of automatically goncr~ated documents such as :. i r~ts t~hat ar~e semantlnically;

richrer tha~n t~he sc~hema~s to which? '" correspond, i iport~s andi other user-or~ientled

outpurt, which aire i .1'y g-eneratedl 1. Ep Irt generators, dlo not use thle names of

sc~hema elemlents diirectly buit rather provide mlor~e dlescriptive~ names to mnake the ourtpurt

morle c~omF:- i :: :i-1 e to the users. T~i dscriptions together wiith? their ftormatting

instrluctions i relationship s to the I....'. 1 : sc~hema elements in the data source can

be extrlactecd : ::: the apoplication code gener~ating thec report. T`: semnanticaliy rich

dlesc~riptions, wh~ich canl be :::I -d to the schema elements inl the: souirce, canl b-e ursed to

relationlsh-ips between data sourrces and hence between the ulnderlyin-g sc~hemlas.

Moreoverr: i -p rts uise more diomain tt : ': .. thain sc~hemais. Ti : ? E.e, using domain

diction~aries is -ticula~rly helpful a~s opposed to their use: in sch~ema mnatch~i~gr algorithms.

One can argue t~ha t reports of ain info~rmatlion system :.. not cover t~he entire:

sc~hema andi hence by, this approach we :-:=-- not findi matches for all schemna elements. It

is implIortant~ to note tha~t we dlo not hav-Le to match all the: schemna elemnent~s of twio dlata

sourrces in order to have twoi organizations I loorate. Weti believe~ the repoorts i .:.. .1

I-rsn thle mnost importantly data oft thle inftormation? system, whiiichl is also I i- -i to be

thle set of elements that ar~e implorta~nt for the ensuing dlata integrlation scenario. 7 1

starting the schemna mlatch-ing process : ::: canl help focus on the i :_ ortanlt data

elimninating = effort on mIatchingg unnec~essary sc~hemai elements.

3.2.1 i'. i otiv. !::: Example

We~1 present a mnotivating extample to shlow how analyzing reports generlating a~pl 'I

sourrce codle axnd report dlesig-n templIlates caxn 1: "i, us undlerstaindl the sema~intics of schlema~i

elemecnts better. Weii choose: our mlotivating (- i r-;orts fromn the university domain










because the university domain is well known and easy to understand. To create our

motivating example, we use the THALIA5 testbed and benchmark which provides a

collection of over 40 downloadable data sources representing university course catalogs

from computer science departments worldwide [47].


S1 I Schedule Courselnstl S2 ;Offerings Faculty ClassTimes|
SCode IIIINum1 I; No IIINo II Code
|Name I Name I Name I Name Day
STime Num2 TID Room
Hour
SPrereq IIIIOffice II;~ IIInsNo IIIITitle



Figure 3-11. Schemas of two data sources that collaborates for a new online degree
program.


We motivate the need for schema matching across the two data sources of computer

science departments with a scenario. Let us assume that two computer science departments

of universities A and B start to collaborate for a new online degree program. Unless one

is contend to query each report separately, one can imagine the existence of a course

schedule mediator capable of providing integrated access to the different course sites.

The mediator enables us to query data of both universities and presents the results in a

uniform way. Such a mediator necessitates the need to find relationships across the source

schemas S1 and S2 of universities A and B shown in Figure 3-11. This is a challenging

task when limited to information provided by the data source alone. By just using the

schema names on the Figure, one can match schema elements of two different schemas in

various v-wsi~. For instance, one can match the schema element Name in relation Offerings

of schema S2 with schema element Name in relation Schedule of schema S1 or with schema




5 THALIA Website: http://www.cise.uf l .edu/proj ect/thalia. html











element Name in relation Courselnt of schema S1. Both mappingfs seem reasonable when

we only consider the available schema information.

However, when we consider the reports generated by application source code using

these schemas of data sources, we can decide on the mappings of schemas more accurately.

Information system applications that generate reports retrieve data from the data source,

format the data and present it to users. To make the data more apprehensible by the user,

these applications generally do not use the names of schema elements but invent more

descriptive names (i.e., title) to the data by using domain specific terms when applicable.


I Schedule Courselnst
ode IIINum1
a le ~Nae
4Tin e, IINur 2
Pre eq O ffic



C ours e ~is tin gs

S Course Title Ins ructor I'Time Prerequisite
CIS 1 5 Introdu tion Berger MW CIS 201
to C om i. Sci. 2pm -3p~
SCIS 2(11 Discretel Taylor 1 MW \
111ath. 3pm -4pm

II \ 1 R2



Co urrs seSc h erd a le s

Course Title Lectu~rer Tim e
COP 3 2 Datab. e Hamilto Ti h
System s 1 -3pm
CEN 4 Sof vare Eng. Paul F 5pm-
pm


ffer gs Facult ClassT s
I No No Code
SName Name Day
I TID Roo
InsN o Til Hu
I Iil



Figure 3-12. Reports from two sample universities listing courses.












































































SL e ctu rtr R o om s

I IL e~ tare r T~itle R oo

SJo0r ge Ha m'Itona Assistant CIS /05
ii \ IProf

SP aulo Co elho0 As c~iate I~ G 202
'I I~




SOfIferings Fa nlty C ClassT im es
SNo No Code
N am e Nan e Day
STID Ro m
InsN oTieHor
I T




Figure 3-13. Reports from two sample universities listings instructor offices.


For our motivating example, university A has reports R1 and R3 and university B has


R2 and R4 presenting data from their data sources. Reports R1 and R2 present course


listings and reports R3 and R4 present instructor office information from corresponding


universities. We show these simplified sample reports (R1, R2, R3, and R4) and the


schemas (S1 and S2) in Figures 3-12 and 3-13. The reader can easily match the column


headers (blue dotted arrows in Figures 3-12 and 3-13). On the other hand, it is hard to


match the schema elements of data sources correctly by only considering their names.


However, it becomes again straightforward to identify semantically related schema


elements if we know the links between column headers and schema elements (red arrows in


Figures 3-12 and 3-13).


bll~lSchedule Courselnst
Code I Num1
N am e I iam e
T ime I N um 2

Prereq II IO fp ie






Ins tru c i r O ffi es


Berge~ 22 Gas

Taylor ~ 122 CISE


I
I
I
I
I
I
I
I
I
I


la n I


I
I
I
I
I
I
I
I
I
I
I
I


I
I
I
I
I
I
I
I
I










Our idea is to find mappings between descriptive texts on reports (blue dotted

arrows) by using semantic similarity functions and to find the links between these texts

and schema elements (red arrows) by analyzing the application source code and report

design templates. For this purpose, we first analyze the application source code or the

report design template generating each report. For each report, we store our findings

such as descriptive texts (e.g., colunin headers), schema elements and relations between

the descriptive texts and the schema elements into an instance of report ontology. We

give the details of the report ontology in Section :3.2.3. We pair report ontology instances

one front the first data source and one front the second data source. We then compute

the similarities between all possible report ontology instance pairs. For our example, the

four possible report pairs when we select one report from DS1 and the other from DS2

are [R 1-R 2], [R 1-R 4], [R 2-R:3] and [R:3-R 4]. We calculate the similarity scores between

descriptive texts on reports for each report pairs by using semantic similarity functions

using WordNet which we describe in Section :3.2.4. We then transfer similarity scores

between descriptive texts of reports to scores between schema elements of schemas by

using the previously discovered relations between descriptive texts and schema elements.

Last, we merge the similarity scores of schema elements computed for each report pair and

form a final matrix holding similarity scores between elements of schemas that are higher

than a threshold. We address details of each step of our approach in Section :3.2.2.

When we apply our schema matching approach on the example schemas and reports

described above, we obtain a precision value of 0.86 and a recall value of 1.00. We show

the similarity scores between schema elements of data sources DS1 and DS2 which are

greater than the threshold (0.5) in Figure :3-14. These results are better than the results

found matching the above schemas with the COMA++ (COmbination of MAtching

algorithms) framework." COMA++ [7] is a well known and well respected schema



6 We use the default COMA++ All Context combined niatcher










matching framework providing a downloadable prototype. This example motivates us that

our approach promises better accuracy for schema matching than existing approaches.

We provide a detailed evaluation of the approach in ChI Ilpter 6. In the next section, we

describe the steps of our schema matching approach.

Rllesse Restllis S1
Schethle LCursallnst
Thashold:0.5 Code Name PeReq Time Num1 Name Num2 Office
No 0.782
OeinsName 0.807
TID
InsNo
Code
S2 ClassTimnes Doay 0.001
Hour 0.001

Name 0.614
Faculty
Room 0.505
Title 01

Figure :3-14. Similarity scores of schema elements of two data sources.


3.2.2 Schema Matching Approach

The main idea behind our approach is that user-oriented outputs such as reports,

encapsulate valuable information about semantics of data which can he used to facilitate

schema matching. Applying well-known program understanding techniques as described

in Section :3.1.4, we can extract semantically rich textual descriptions and relate these

with data presented on reports using heuristics described in Section :3.1.5. We can trace

the data back to corresponding schema elements in the data source and match the

corresponding schema elements in the two data sources. Below, we outline the steps of

our Schema Matching approach, which we call Schema Matching by Analyzing ReporTs

(SMART). In the next sections, we provide detailed description of these steps which are

shown in Figure :3-15.

Creating an Instance of a Report Ontology
Computing Similarity Scores
Forming a Similarity Matrix
From Matching Ontologies to Schemas
Merging Results

















Report Generating Report GI
Applications and Applications
Report Templates of Template


I: 1) Creating Instances of the
:: ~Report Ontology ,,

Report
ontology .eor

Instance A3 Ontology
2) Computing similarity scores InIstance B4
between Report Ontology Instances


j) Merging Results


4) Transfering Inter Ontology Scores


~*
-,*


3) Forming Similarity Matrix





Schema


3) Forming Similarity Matrix





Schema


Figure 3-15. Five steps of Schema Matching by Analyzing ReporTs (SMART) algorithm.










3.2.3 Creating an Instance of a Report Ontology

In the first step, we analyze application source code that generates a report. We

described the details of semantic analysis process in Section :3.1. The extracted semantic

information from source code or from a report design template is stored in an instance of

the report ontology.

We have developed an ontology for reports after analyzing some of the most widely

used open source report generation tools such as Eclipse BIRT,7 JasperReports and

DataVision.' We designed the report ontology using the Protege Ontology Editorlo and

represented this report ontology in OWL (Web Ontology Language). The UlML diagram of

the report ontology depicted in Figure :3-16 shows the concepts, their properties and their

relations with other concepts.

We store information about the descriptive texts on a report (e.g., column headers)

and information about the source of data (i.e., schema elements) presented on a report in

an instance of the report ontology. The descriptive text and schema element properties

are stored in description element and data element concepts of the report ontology

respectively. The data element concept has properties such as attribute, table (table of the

attribute in relational database) and type (type of the data stored in the attribute). We

identify the relation between a description element concept and a data element concept

by the help of a set of heuristics which are based on the location and format information

described in Section :3.1.5 and store this information in hasDescription relation property of

the description element concept.




SEclipse BIRT: http://www.eclipse. 0rg/birt/

SJasperReport: http://j asperreports sourcef orge .net/

Datavision: http://datavision. source orge .net/

10 Protege tool: http://protege. stanford. edu/





























Descrigation Elerraenlt
Dle SCrTi ption'


Figure :3-16. Unified Modeling Language (UML) diagram of the Schema Matching by
Analyzing ReporTs (SMART) report ontology.


The design of the report ontology does not change from one report to another but

the information stored in an instance of the report ontology changes based on the report

being analyzed. We placed the data element concept in the center of the report ontology

as shown in Figure :3-16. This design is appropriate for the calculation of similarity scores

between data element concepts according to the formula described in Section :3.2.4.

3.2.4 Computing Similarity Scores

We compute similarity scores between all possible data element concept pairs

consisting of a data element concept from an instance of the report ontology of the

first data source and another data element concept from an instance of report ontology of

the second data source. This means if there are m reports having n data elements concepts

on average for DS1 data source and k reports having 1 data elements concepts on average










for DS2 data source, we compute similarity scores for (m n k 1) pairs of data elements

concepts.

However, computing similarity scores for all possible report ontology instance pairs

may be unnecessary. For example, unrelated report pairs, such as a report describing

p wiments of employees with another describing the grades of students at a university,

may not have semantically related schema elements and therefore we may not find any

semantical correspondence by computing similarity scores of concepts of unrelated report

ontology instance pairs. To save computation time, we filter out report pairs that have

semantically unrelated reports. To determine which report pairs are semantically related

or not, we first extract texts (i.e., titles, footers and data headers) on two report pairs and

calculate similarity scores of these texts. If the similarity score between these texts of a

report pair is below a predetermined threshold, we assume that the report pair presents

semantically unrelated data and we do not compute similarity scores of data element pairs

of report pairs having low similarity scores for the texts on them.

The similarity of two objects depends on the similarities of the components that

form the objects. An ontology concept is formed by the properties and the relations it

has. Each relation of an ontology concept connects the concept to its neighbor concept.

Therefore, the similarity of two concepts depends on the similarities of the properties of

the concepts and the similarities of the neighbor concepts. For example, the similarity of

two data element concepts from different instances of the report ontology depends on the

similarity of their properties attribute, table, and type and the similarities of its neighbor

concepts DescriptionElement, Header, Footer, etc.

Our similarity function between concepts of instances of an ontology is similar to

the function proposed by Rodriguez and Egenhofer [81]. Rodriguez and Egenhofer also

consider sets of features (properties) and semantic relations (neighbors) among concepts

while assessing similarity scores among entity classes from different ontologies. While

their similarity function aims to find similarity scores between concepts from different










ontologies, our similarity is for finding similarity scores between the instances of an

ontology.

We formulate the similarity of two concepts in different instances of an ontology as

follows :

simc (cl, C2) p w, spm(l c1)+ I,2 Pr si,(l c8 E C, C

where cl is a concept in an instance of the ontology, c2 1S the same type of concept

in another instance of the ontology, w, is the weight of total similarity of properties of

that concept and In~,, is the weight of total similarity of the neighbor concepts that can be

reached from that concept by a relation. sim,(cl, c2) and sim,(cl, c2) are the formulas to

calculate similarities of the properties and the neighbors. We can formulate sim,(cl, C2) aS

follows :



sim,,(ct c2) = tr..; SimFunc(clips C2lli) (3-2)

where k is the number of properties of that concept, Iry... is the weight of the ith

property, clip is the ith property of the concept in the first report ontology instance, c29i

is the same type of property of the other concept in the second report ontology instance.

SimFunc is the function that we use to assess a similarity score between the values

of the properties of two concepts. For description elements, the SimFunc is a semantic

similarity function between texts which is similar to the text-to-text similarity function of

Corley and Mihalcea [21]. To calculate the similarity score between two text strings T1

and T2, we first eliminate stop words (e.g., a, and, but, to, by). We then find the word

having the maximum similarity score in text T2 for each word in text T1. The similarity

score between two words, one from text T1 and the other from T2, is obtained from a the

Word-Net based semantic similarity function such as the Jiang and Conrath metric [52].

We sum up the maximum scores and divide the sum by the word count of the text T1.

The result is the measure of similarity between text T1 and the text T2 for the direction










from T1 to T2. We repeat the process for the reverse direction (i.e., from T2 to T1) and

then compute the average of the two scores for a bidirectional similarity score.

We use different similarity functions for different properties. If the property that

we are comparing has text data such as property description, we use one of the word

semantic similarity functions that we have introduced in Section 2.11. By using a semantic

similarity measure instead of lexical similarity measure such as edit distance, we can

detect the similarities of words that are lexically far but semantically close such as

lecturer and instructor and we can also eliminate the words that are lexically close but

semantically far such as 'tower' and 'power'. Besides description property of description

element concept, we also use semantic similarity measures to compute similarity scores

between footernote property of the footer concept, headernote property of the header

concept and title property of the report concept. If the property that we are comparing is

the attribute or table property of data element concept, we assess a similarity score based

on the Levenstein edit similarity measure. Besides attribute property of data element

concept, we also use edit similarity measures to compute similarity scores between query

property of the report concept.

In the following formula, which calculates the similarity between the neighbors of two

concepts, I is the number of relations of the concepts we are comparing, w,,,.. is the weight

of the ith relation, clus (c~, ) is the neighbor concept of the first (second) concept that we

reach by following the kth relation.





Note that our similarity function is generic and can be used to calculate similarity

scores between concepts of instances of any ontologies. Even though the formulas in

Equations 3-1, 3-2 and 3-3 are recursive in nature, when we apply the formulas to

compute similarity scores between data elements of report ontologies, we do not encounter

recursive behavior. That is because there is no path back to data element concept through










relations from neighbors of the data element concept. In other words, the neighbor

concepts of data element concept does not have the data element concept as a neighbor.

We apply the above formulas to calculate similarity scores between data element

concepts of two different report ontologies. The data element concept has properties

attribute, table, and type and neighbor concepts description element, report, header,

and footer concepts. The similarity score between two data element concepts can be

formulated as follows:


simDataElement (DataEl ementl D ataEl eme nt2

wi SimFunc(Attributer Attribute2)

+w2 SimFunc(Tablel, Table2)

+0' SimFunc(Typel, Type2)

+w4 I S EDescrip~tionElement(D escrip2~ti onEl eme Descrip~2~tionElement 2 (4

+I,-. + sim,,tReport RpOE, Report2)

+w6 I SiHeader (Headerl, Header2)

+w?1 simFooter (Footery Footer2)


We explain how we determine the weights wl to my in Section 6.2. The similarity

score between two description element, report, header and footer concepts can be

computed by the following formulas:


simDescri~tionElement (De scriptionEl eme nt 1, D escr ipti onEl ement2 ) = (3-5)

SimFunc(Descr2,iptioni DescriptiOnR2


sim~eport (RepOrtl, Report2) = SimFunc(Queryl i,Q u _.) + SimFunc(Titlel, Title2)

(3-6)

simeader (Headerl, Header2) = SimFunc(Headerl~otel Headerl~ote2)(7

simFooter (Footerl, Footer2) = SimFunc(Footerl~otel Footerl~ote2) (8











3.2.5 Forming a Similarity Matrix

To form a similarity matrix, we connect to the underlying data sources using a

call-level interface (e.g., JDBC) and extract the schemas of two data sources to be

integrated. A similarity matrix is a table storing similarity scores for two schemas such

that elements of the first schema form the column headers and elements of the second

schema form the row headers. The similarity scores are in the range [0,1]. The similarity

matrix given as an example in Figure :3-17 has schema elements from motivating example

in Section :3.2.1 and the similarity scores between schema elements are fictitious.

Schema S1
Entity Schedule Courselnst
Attriburte Code Name PreReq Time Num1 Name Num2 Office
baNo 0.9 0.4 0.2 0.25 0.4 0.3 0.3 0.3
NameI 0.25 0.95 0.15 0.2 0.3 0.4 0.3 0.25
TIDam 0.25 0.25 0.15 0.5 0.2 0.2 0.2 0.35
InsNo 0.3 0.2 0.1 0.2 0.3 0.15 0.2 0.2
E Code 0.5 0.4 0.2 0.5 0.15 0.3 0.4 0.3
qDay 0.2 0.25 0.05 0.7 0.25 0.2 0.15 0.2
HoHur 0.2 0.2 0.1 0.7 0.2 0.2 0.15 0.25
~nNo 0.4 0.3 0.1 0.2 0.8 0.25 0.3 0.2
$ Name 0.45 0.5 0.1 0.2 0.4 0.95 0.3 0.3
i~Room 0.2 0.3 0.05 0.2 0.2 0.2 0.2 0.2
Title 0.2 0.4 0.1 0.1 0.2 0.3 0.2 0.1


Figure :3-17. Example for a similarity matrix.



3.2.6 From Matching Ontologies to Schemas

In the first step, we traced a data element to its corresponding schema elementss. We

use this information to convert inter-ontology matching scores into scores between schema

elements. Using the converted scores, we then fill in a similarity matrix for each report

pair.

Note that, we find similarity scores only for a subset of schemas used in the reports.

We believe the reports typically present the most important data of the information

system, which is likely to be the set of elements that is important for the ensuing data

integration scenario. Even though reports of an information system may not cover the

entire schema, our approach can help focus on the important data thus eliminating efforts



















EntitySchedule Courselnst
Attribute Code Name PrcReq Time Numl Name Num2 Office
No 0.8 0.2 0.1 0.1 0.15
.Name 0.2 0.95 0.15 0.1 03

InsNo

tlDay 0.1 0.1 0.1 0.7 0,1
Hour 0.1 0.1 0.1 0.7 0,1
No
SName 0.2 035 0.2 0.1 0.9
Room
ril


Scoresfrom Reports aboct Instructor Ofics Schema S1
Entity Schedule CourseInst
Attribute Code Name Pre~eq Time Numl Name Num2 Office
No


InsNo
Code
SDay

No
Name 0,85 0.2
SRoom 030.
Title 03 0.2


to match unnecessary schema elements. Note that each similarity matrix can be sparse

having only a small subset of its cells filled in as shown in Figures 3-18 and 3-19.


Scores from R~eports about Course Listings


Scheme S1


Figure 3-18. Similarity scores after matching report pairs about course listings.


Figure 3-19. Similarity scores after matching report pairs about instructor offices.



3.2.7 Merging Results


After generating a similarity matrix for each report pair, we need to merge them


into a final similarity matrix. If we have more than one score for a schema element pair


in the similarity matrix, we need to merge the scores. In Section 3.2.4, we described

how we compute similarity scores for report pairs to avoid unnecessary computations


between unrelated report pairs. We use these overall similarity scores between report pairs


while merging similarity scores. We multiply the similarity score of a schema element


pair with the overall similarity score of the report pair and sum the resulting scores up.









7i.. .. we divide thle final score with the: number of repoorts. Forb instance if the similarity

score betw~een schemna elemlents A and B~ is 0.9 in the first report hav~in-g an overall

simnilarity score of 0.'7 aind is 0.5 in the second report hav~ting an1 overall similarity sior~e

of` 0.6, th~en we conclud- e t~hat t~he simnilarity score: between n schelma elemelcnts A and1- B: is

(0.9 + 0.7 i 0.5 0.6)/(2) = 0. :~. Finally, wve eliminate t~he comb~inedi scores which

below a2 (user-diefined)) th~rshold..









CHAPTER 4
PROTOTYPE IMPLEMENTATION

We intpleniented both the semantic analyzer (SA) component of the SEEK( and the

SMART schema niatcher using Java progranining language. As shown in Figure 4-1 we

have written 1,350 K(B of Java code (approximately 27,000 lines of code) for our prototype

intplenientation. In addition, we have utilized 1,150 K(B of Java code (approximately

23,000 lines of code) which was automatically generated by JavaCC. In the following

sections, we first explain the SA prototype and then SMART prototype.


SMART
500 KB
SA -Java~cC
'1,150 KB


SA Coded
850 KB


Figure 4-1. Java Code size distribution of (Semantic Analyzer) SA and (Schema Matching
by Analyzing ReporTs) SMART packages.


4.1 Semantic Analyzer (SA) Prototype

We have intpleniented SA semantic analyzer prototype using Java language. The

SEEK( prototype source code is placed in the seek package. The functionality of the SEEK(

prototype is divided into several packages. The source code of the seamntic analyzer (SA)

component resides in the sa package. Java classes in the sa package are further divided

into subpackages according to their functionality. The subpackages of the sa package are

listed in Table 4-1.

4.1.1 Using JavaCC to generate parsers

The classes inside the packages syntaxtree, visitor, and parsers are automatically

created by JavaCC tool. JavaCC is a tool that reads a graninar specification and converts

it to a Java program that can recognize matches to the graninar according to that










Table 4-1. Subpackage in the sa package and their functionality.
package name classes in the package
visitor default visitor classes.
parsers classes to parse application source code written in grammars
Java, HTML and SQL.
seekstructures supplementary classes to analyze application source code.
seekvisitors visitor classes to analyze source code written in grammars Java,
HTML


specification. As shown in Figure 4-2, JavaCC processes grammar specification file and

output the Java files that has the code of the parser. The parser can process the languages

that are according to the grammar in the specification file. The parsers generated in

this way forms the ASTG component of the SA. Grammar specification files for some

grammars such as Java, C ++, C, SQL, XML, HTML, Visual Basic, and X~uery can he

found at the JavaCC grammar repository Web site.l These specification files have been

tested and corrected by many JavaCC implementers. This implies that parsers produced

by using these specifications must he reasonably effective in the correct production of

ASTs. The classes generated by the JavaCC forms the abstract syntax tree generator

ASTG of the SA which was described in Section 3.1.2.1.

For the SA component of the SEEK( prototype, we created parsers for three different

grammars. These are Java, SQL and HTML grammars. We placed these parsers, related

syntax tree classes and generic visitor classes into parser, syntaxtree, visitor package

respectively. Each Java class inside the syntaxtree package has an accept method to be

used by visitors. Visitor classes have a visit methods that each corresponds to a Java class

inside syntaxtree package. The syntaxtree, visitor, and parsers packages have 142, 15 and

14 classes respectively. The classes inside these packages remains the same as long as the

Java, SQL and HTML grammars do not change.




1 JavaCC repository: http://www.cobase. cs .ucla.edu/pub/j avac c/











Abstract
Grammar IVa Compiler Syntax Tree
specif ation Compiler Generator
7 JavaCC ASTG


Figure 4-2. Using .JavaCC to generate parsers.


The classes inside the packages seekstructures and seekvisitors are written to fulfill

the goals of the SA. The seekstructures and seekvisitors packages have 25 and ten classes

respectively and are subject to change as we add new functionality to SA module. The

classes inside these packages forms the Information Extractor (IEx) of the SA which was

described in Section 3.1.2.3. IEx is consist of several visitor design patterns. Execution

steps of the IEx and functionality of some selected visitor design patterns are described in

the next section.

4.1.2 Execution steps of the information extractor

Semantic analysis process has two main steps. In the first step, SA makes preparations

necessary for analyzing the source code and forms the control flow graph. SA driver

accepts the name of the stand-alone .Java class file (with the main method) as an

argument. Starting front Java class file, SA finds out all the user-defined .Java classes

to be analyzed in the class path and forms the control flow graph. Next, SA parses the all

the .Java classes in the control flow graph and produces AST of these .Java classes. Then,

the visitor class ObjectSyntholTable gathers variable declaration information for each class

to be analyzed and store this information in the SymbolTable classes. The SymbolTable

classes are passed to each visitor class and are filled with new information as the SA

process continues.

In the second step, SA identifies the variables used in input, output and SQL

statements. SA uses the ObjectSlicingVars visitor class to identify slicing variables.

The list of all input, output, and database-related statements, that are language (.Java,










JDBC) specific, are stored in InputOutputStatements and SqlStatements classes. To

analyze additional statements, or to switch to another language, all we need to do is to

add/update new statement names into these classes. When a visitor class encounters

a method statement while traversingf through AST, it checks this list to find out if this

method is an input, output, or a database-related statement.

SA finds and parses SQL strings embedded inside the source code. SA uses the

ObjectSQLStatement visitor class to find and parse SQL statements. While the visitor

traverses the AST, it constructs the value of variables that are of String type. When

a variable type of String or a string text is passed as a parameter to an SQL execute

method (e.g., execute~uery(queryStr)), this visitor class parses the string, and constructs

the AST of this SQL string. Then it uses the visitor class named ObjectSQLParse to

extract information from that SQL statement. The visitor class ObjectSQLStatement

uses the visitor class ObjectSQLParse to extract information about the SQL string and

stores this information into a class named ResultsetSQL. The information gathered from

SQL statements, input/output methods are used to construct relations between database

schema element and the text denoting the possible meaning of the schema element.

Besides analyzing application source code written in Java, SA can also analyze report

design templates represented in XML. Report Template Parser (RTP) component of the

SA uses Simple API for XML2 (SAX) to parse report templates.

The outcome of the IEx is written into report ontology instances represented in OWL.

Report Ontology Writer (ROW) uses OWL API3 tO Write the semantic information into

OWL ontologies.



2 Simple XML API: http://www.saxproject.org/

3 OWL API: http: //owl.man.ac.uk/api .shtml










4.2 Schema Matching by Analyzing ReporTs (SMART) Prototype

We have implemented SMART schema matcher prototype using .Java language.

There are 46 classes in five different packages. The total size of the .Java classes are 500K(

(approximately 10,000 lines).

We also wrote a Perl program to find similarity scores between word pairs by using

the WordNet similarity library [73 TO aSsess similarity scores between texts, we first

eliminate stop words (e.g., a, and, but, to, by) and convert plural words to singular words.

We convert plural words to singular n-- 01 hI because WordNet Similarity functions returns

similarity scores between singular words.

The SMART prototype also uses Simple API for XML (SAX) library to parse

XML files and OWL API to read OWL report ontology instances into into internal .Java

structures.

COMA++ framework enables external matchers to be included into its framework

through an interface. We have integrated our SMART matcher into the COMA++

framework as an external matcher.





















4 WordNet Semantic Similarity Library: http://search.cpan. 0rg/dist/WordNet-Similarity/

5 We are using the PlingStemmer library written by Fabian AI. Suchanek:
http://www.mpi-inf .mpg.de/~suchanek/









CHAPTER 5
TEST HARNESS FOR THE ASSESSMENT OF LEGACY INFORMATION
INTEGR ATION APPROACHES (THALIA)

Information integration refers to the unification of related, heterogeneous data from

disparate sources, for example, to enable collaboration across domains and enterprises.

Information integration has been an active area of research since the early 80s and

produced a plethora of techniques and approaches to integrate heterogeneous information.

Determining the quality and applicability of an information integration technique has

been a challenging task because of the lack of available test data of sufficient richness and

volume to allow meaningful and fair evaluations. Researchers generally use their own test

data and evaluation techniques, which are tailored to the strengths of the approach and

often hide any existing weaknesses.

5.1 THALIA Website and Downloadable Test Package

While working for this research, we saw the need for a test bed and benchmark

providing test data of sufficient richness and volume to allow meaningful and fair

evaluations for information integration approaches. To answer this need, we developed

THALIA1 (Test Harness for the Assessment of Legacy information Integration Approaches)

benchmark. We show a snapshot of THALIA website in Figure 5-1. THALIA provides

researchers with a collection of over 40 downloadable data sources representing University

course catalogs, a set of twelve benchmark queries, as well as a scoring function for

ranking the performance of an integration system [47, 48].

THALIA website also hosts cached web pages of University course catalogs. The

downloadable packages have data extracted from these websites. Figure 5-2 shows an

example cached course catalog of the Boston University hosted in THALIA website. In

THALIA website, we also provide the ability to navigate between extracted data and




1 ITRL of the THALIA website: http://www.cise.ufl1.edu/proj ect/thalia. html














File Edit Eiewu History Bookmarks Tools Help


SFind: i Nesxt ij rlevioius ,.. Highlight all [] Match case


Figure 5-1. Snapshot of Test Harness for the Assessment of Legacy information Integration
Approaches (THALIA) website.



corresponding schema files that are in the downloadable packages. Figure 5-3 shows

XML representation of Boston Universitys course catalog and corresponding schema file.

Downloadable University course catalogs are represented using well-formed and valid XML

according to the extracted schema for each course catalog. Extraction and translation

from the original representation was done using a source-specific wrapper which preserves

structural and semantic heterogeneities that exist among the different course catalogs.


Home

Ho.. To Use The
sanchmar

Publications

Uni.ersit.. Course




Run Benchmark

Provide Feedback

Upload Your Soores

Honor Roll



Disolamer & Contact


THALIA (Test Harness for the Assessment of Legacy information Integration

Approaches) is a publicly available testbed anid benchmark for testing and evaluating
integration technologies. This Web site provides researchers and practitioners with a
collection of 40 downloadable data sources representing University course catalogs
from computer science departments around the world. The data in the testbed provide
a nrch source of syntactic and semantac heterogeneities smece we believe they stell pose
the greatest techmelal challenges to the research community. In addition, this site
pf0Vrides 8 set Of twelve benchmark queries as well as a scon~ng function for ranking
the performance of an mtegrat~on system.


We hope this site will be useful to both the research community in their efforts to
develop new mtegrat~on technologies as well as to potential users of existing
technologies in evaluating their strengths and wealmesses


THALIA :


Test Hmaless for th~e Assessment of LegacyI Info~mtion Integ~ration Approaches





I~~ md


-+' Al r;ooai; li~l
9 ;I- r- O- u d


n
Cached


IC~PCrJII~ ~e~8~cp~e r 1


i 1
i~_lh~tP:II1WWCiSeU~ edulpro]ec~l~halla h~ml V

~--~~-:rl j


THALIA :


Test Harness fo~r the Assessment of Legacy Informabkow Integration Appro~ahes


HowToUse The



Pu caution

Bwrowe Data anad
Solarna
Run Benchmrark
Provie Feedback

uphoad Yourr Scre
Honor Fson

P:nate
Oisotamer a conact


Select a Unir. er r. ii c. rose is Cowuse Catalog


CAS CS 111 tntrotio Comn
Al

CAS CS 101 Intro to Comt
B1

CAS CS 101 intro lo Com~
C1
CAR CS fl Int n ~Wah


p


p


p


Long TR 3.:30-
5PM

Stotca MWF 2-3PM


Slolca MWF 12-1PM

Stnica MWI~ LAPM


STO


CAS
211

STDO
8 50


computer science course catalog of Boston University.


Figure 5-2. Snapshot of the


5.2 DataExtractor (HTMLtoXML) Opensource Package

To extract the source data provided in THALIA benchmark, we enhanced and

used the Telegraph Screen Scraper (TESS)2 source wrapper developed at ITC Berkeley.

The enhanced version of TESS, DataExtractor (HTMLtoXML), can he obtained from

SourceForge website3 along with the 46 examples used to extract data provided in

THALIA. DataExtractor (HTMLtoXML) tool provides added functionality over TESS

wrapper including capability of extracting data from nested structures. It extracts data

from a HTML page according to a configuration file and puts the data into an XML file

according to a specified structure.




2 TESS: http://telegraph.cs .berkeley.edu/tess/

I TRL of DataExtractor (HTMLtoXML) is http://sourcef orge .net/proj ects/dataextractor


Fall 2003 Schedule

















THALIA : Test Harness for the Assesslnent of Legacy Information Integration Approaches


SSealec~t so~:.aUniversty toBrwetsXLDa nd Shema iBostonUnivesi
HowTo~iseThe CUTB
-*college>CAS
Publiatins ~C~ourelnfo code="CS 101 Al"
<~tite>ntro to Comp
unversrcoure < tnstruc torMlong
cataos. I I

STO B50
Run Benchmark
pmrr~prs <_courSe>
Uptuad Your Scores (olg A ,__,,, I Honor Roll title >Inrfo to COmip
Stoicac/instctructor
cas:doc umentation>~Boston Universite/~xs: document ta tion>
omenameracnct
cxs:complexType>
cxs:choice mnuOccurs="D)" maxoccurs="unbounded">
crs:element name="course" minOccurs="D" maxOccurs="unbounded">
crs:complexType>
axs:sequence minoccurs="O" maxoccurs="unho~unded"~
as:element name="college" type="xs~:string" minOccurrs="O" />
-
as:compllexType>


Figure 5-3. Extensible Markup Language (XML) representation of Boston Universitys
course catalog and corresponding schema file.



5.3 Classification of Heterogeneities

Our benchmark focuses on syntactic and semantic heterogeneities since we believe


they pose the greatest technical challenges to the research community. We have chosen

course information as our domain of discourse because it is well known and easy to

understand. Furthermore, there is an abundance of data sources publicly available that

allowed us to develop a tested exhibiting all of the syntactic and semantic heterogeneities

that we have identified in our classification. We list our classification of heterogeneities

below. We have also listed these classifications in [48] and in the downloadable package on

the THALIA website along with examples from THALIA benchmark and corresponding


queries.


, lrg .II- 1 gi~ '- r Ity. t I V I; A i jr


,, I


Q ~- ...I ~P- ~a~ Q










1. Synonyms: Attributes with examplec 'instrluctor' vs. 'lec~tulrer'

2i. Simple M~lapping: Related attributes in :i ::1 sch~emnas <: i : by a m~ath~ematical
trlansformation of their va~lues. Fobr exampnle, timle values usingr a 24 hour vs. 12 hour
cloclk.

:3. Union Types: Attriburtes in I :i1 : ::i scherna~s use e l: i : : daita ~tyr. to :
the same information. F~or example, courrse description as a single r T. vs.
dlata i <. -::iosedl of string-s axnd links (UR;-Ls) to externaxl djata.

4. Complex MVappings: Relateld attlrib~utes ,`ii : a complex Itr : .. : .:. <.0 their
values. I ranlsfor~mation :-- n~ot a'---. /): be < >utabe from11 first. pr~inc~ipl~es. ForC
::1 r, the attribute 'Units' tr;lhe numnber of Icetures per week vs. ltextual
description of ~hel ex-pecited work load- in? field creditsts.

5. Language Expressiotn: '::: ; or values of identical attributes are expressed inl
iT :.1 langurages. F'or example: TT.. English termn 'dattabase: is cailledl 'Datenbanlk'
inl the Germlan lanlguage.

6. Nulls: i i: attr~ibute (value) does not exist. Fo~cr example, Somne courses do n-ot hav~e
a tex-tbook field or thle value: 1 he tex-tbook field is emnpty.

'7. Virtual Colum~ns: Infor~mation that is explicitly p-rovided in one schlema is only
imploic~itly available in thec oth~er and must be ::: : i f~rom one or more: values. F;or
ex-ampnle, C~ourse p~;;; I ;.. is prlovidedd as an alttribute in one schiemna but ex-ists
01 i- in comlmenrt form ais part of ax i iI :1. attribute in another schema.~

8. Semantic incompatibility: A real~-wi~orld concept tha~t is modleledl by an attribute
does n-ot exist in the: other schemla. Ftor example, Ti concept of student~ classificatio n
('freshlman'l, "sop~homlore:, etc:.) at American Unive~rsities does not exist in Germanin
Universities.

93. Same attribute in different structure: i i same or relatedi attribute be
loc~ated in dlifforont: positions in different sc~hemas. F~or examnple,? ii 2. :::i.ate Boom
is an attributed of C~our~se in? onec sch~em~a while: it is an? atir~ibutec of` Sctclion whricih in
turn is an 2. :::1.ate of Cour~se in alnot~her schema.

10. Handling sets: A set of valures is : : :i .ii sinlg a 1:o1 set-va~luied attribute
inl onle sch-emna vs. a collection of sinlgle-valured- attributes organized in a hl-ierarchy~ in
another sch~ema. Foir example, A course with multiple instructor s can have~t a single
attr~iburte instrucitors or re 1 i 10 section--instrulctor : : 11--: pairs.

11. Attribute name does not define semantics: Ti: name < the attribute does
not at!i. .. iy described: the meaningr of the va~lue that is storedl there.











12. Attribute composition: The same information can be represented either by
a single attribute (e.g., as a composite value) or by a set of attributes, possibly
organized in a hierarchical manner.

5.4 Web Interface to Upload and Compare Scores

THALIA website offers a web interface for researcher to upload their result for each

heterogeneity listed above. The web interface accepts data in many aspects, such as size

of specification, number of mouse clicks and size of program code, to evaluate the effort

spent to resolve the heterogeneity by the approach. The uploaded scores can be viewed

by anybody visiting the website of the THALIA benchmark. This helps other researcher

compare their approach with others. Figure 5-4 shows the scores uploaded to THALIA

benchmark for Integration Wizard (IWiz) Project at the University of Florida.



File Edit View History Bookmarks lools Help

-..- J http:/fserviet.cise.ufl.edu:10001/5c oreDet setyI=&sh rp=Dept m-


Scre Details

Research Group Detof CISE, Univuersity of Florida
Name ItgainWizard (IWiz) Project


PIne, a :d ununent11. 1 1111


.e I 7 1 l, a .- i 1 . n l


I i.. 17 ....- s I rl. 1U l.. I rl....,I r 1 111


Queries:

No sogei ty Reul E rntl npr


I~~~~~ unt .1.- ;G1 20 =* ... .- r gi

.. t ... ~ ~ .b. I ]
..,, ..- --- 11-t"~ 1.-,,. r


Figure 5-4. Scores uploaded to Test Harness for the Assessment of Legacy information
Integration Approaches (THALIA) benchmark for Integration Wizard (IWiz)
Project at the University of Florida.









While THALIA is not the only data integration benchmark,4 what distinguishes

THALIA is the fact that it combines rich test data with a set of benchmark queries

and associated scoring function to enable the objective evaluation and comparison of

integration systems.

5.5 Usage of THALIA

We believe that THALIA does not only simplify the evaluation of existing integration

technologies but also help researchers improve the accuracy and quality of future

approaches by enabling more thorough and more focused testing. We have used THALIA

test data for the evaluation of our S1\ approach as described in Section 6.1.1. We are

also happy to see it is being used as a source of test data and benchmark by researchers

[11, 74, 100] and graduate courses



























4 A list of Data Integfration Benchmarks and Test Suits can he found at
http: //mars. csci.unt.edu/dbgroup/benchmarshm

5 UR L of the graduate course at the University of Toronto using THALIA is
http: //www. cs .toronto. edu/~ {}miller/cs2525










CHAPTER 6
EVALUATION

We evaluate our approach using the prototype described in OsI Ilpter 4. In the

following sections, we first describe our test data sets and our experiments. We then

compare our results with other techniques and present a discussion on the results.

6.1 Test Data

The test data sets have two main components; schema of the data source and reports

presenting the data front the data source. We used two test data sets. The first test data

set is from THALIA data integration tested. This data set has 10 schemas. Each schema

of THALIA test data set has one report and the report covers entire schema elements of

the corresponding schema. The second test data set is from University of Florida registrar

office. This data set has three schemas. Each schema of ITF registrar test data set has 10

reports and the reports do not cover all schema elements of the corresponding schema.

The first test data set from THALIA is used to see how SMART approach performs when

the entire schema is covered by reports and the second test data set from ITF is used to

see how SMART approach performs when the entire schema is not covered by reports.

The test data set from ITF also enables us to see the affect of having multiple reports for

one schema. In the following subsections, we give detailed descriptions of the schemas and

reports of these test data sets.

6.1.1 Test Data Set from THALIA testbed

The first test data set is from THALIA tested [48]. THALIA offers 44+ different

University course catalogs front computer science departments worldwide. Each catalog

page is represented in HTML. THALIA also offers data and schema of each catalog page.

We explained details of THALIA tested in OsI Ilpter 5.

For the scope of this evaluation, we treat each catalog page (in HTML) to be a

sample report front the corresponding University. We selected 10 university catalogs

(reports) from THALIA that represent different report design practices. For example,














Fall Schedule










Figure 6-1. Report design practice where all the descriptive texts are headers of the data.


we give two examples of these report design practices in Figures 6-1 and 6-2. Figure 6-1

shows the course scheduling report of Boston University and Figure 6-2 shows the course

scheduling report of Michigan State University.





Cour'Se: CSE101Comnputing (C'oncBpts and Competencies
Semester: Fall of every year. Spring of every year Summer of every year.
Credits: Total Cre dits: 3 Lecture/R~ecitation/Dis cus sion Hours: 2 Lab H ours 23(2- 2)
Description: Core concepts Ln computing including information storage, retneval,
management, and representation. Applications from specific disciplines.
Applying core concepts to design and implement solutions to various focal
problems, using hardware, multimedia software, communication and networks.
Semester Alis: CPS 100, CPS 130
Course: C~SE103Inltroductio n to Databl-ases in Infonuation Teclm~ology
Romac+t- or- Fall nf PTvetry year Snfiw nf ev~ery ye~Ar Sullmme nf Fevetry year~


Figure 6-2. Report design practice where all the descriptive texts are on the left hand side
of the data.


Sizes of schemas in THALIA test data set vary between 5 to 13 as listed in Table

6-1. We stored the data and schemas for each selected university in a MySQL 4.1

database. When we pair 10 schemas, we have 45 different pairs of schemas to match.

45 different schema pairs have 2576 possible combinations of schema elements. We

manually determined that 215 of these possible combinations are real. We use these

manual mappingfs to evaluate our results.









Table 6-1. The 10 university catalogs selected for evaluation and size of their schemas.
University Name # of Schema Elements
University of Arizona 5
Brown University 7
Boston University 7
California Institute of Technology 5
Carnegie Mellon University 9
Florida State University 13
Michigan State University 8
New York University 7
University of Massachusetts Boston 8
University of New South Wales, Sydney 7


We recreated each report (catalog page) from THALIA by using two methods. One

method is using .Java Servlets and the other is using Eclipse Business Intelligence and

Reporting Tool (BIRT).1 .Java Serylet applications corresponding to a course catalog fetch

the relevant data front the repository and produce the HTML report. Report templates

designed by BIRT tool also fetch the relevant data front the repository and produce the

HTML report as well. When SMART prototype is run, it analyzes .Java Serylet code and

report templates to extract semantic information.

6.1.2 Test Data Set from University of Florida

The second test data set is about students registry information and from University of

Florida. We contacted several offices at the University of Florida to obtain test data sets.2

We first contacted the College of Engfineeringf. After several meetings and discussions,

the College of Engineering agreed to give us the schemas and the report design templates

without any data. In fact, we were not after the data because our approach works without

the need of the data. The College of Engineering forms and uses the data set that we

obtained after several months as follows. The College of Engineering runs a batch program




Shttp: //www. eclipse, .Org/birt/

2 I would like to thank to Dr. .Joachim Haniner for his extensive efforts for reaching out
several departments and organizing meetings with staff to gather test data sets.










every first dwi of the week and downloads data from legacy DB2 database of the Registrar

office. DB2 database of the Registrar office is a hierarchical database. The College of

Engineering stores the data in relational MS Access databases. The College of Engineering

extracts a subset of the database of the registrar office and uses the same attribute and

table names in the MS Access database as they are in the database of the registrar office.

The College of Engineering creates subsets of this MS Access database and runs their

reports on these MS Access databases. Figure 6-3 shows the conceptual view of the

architecture of the databases in the College of Engineering.3


University
Registrar
Onfce



College of Engineering klbacjo
cope subset of








Reports



Reportsll Reports



Figure 6-3. Architecture of the databases in the College of Engineering.


We also contacted the UF Bridges office. The Bridges is a project to replace the

university business computer systemscalled legacy systemswith new webbased, integrated




3 I WOuld like to thank James Ogles from the College of Engineering for his time to
prepare the test data and for answering our questions regarding the data set.










systems that provide realtime information and improve university business processes.4

The ITF Bridges project also redesigned the legacy DB2 database of the registrar office

for MS SQLServer. We obtained schemas and again could not reach the associated data

because of privacy issues.5

Finally, we reached the Business School.G The Business School stores their data in

MS SQL Server databases. Their schema is based on the Bridges office schema however

they use different naming conventions. They add new structures into the schemas when

needed.

Table 6-2. Portion of a table description front the College of Engineering, the Bridges
Project and the Business School schemas.
The College of Eng. The Bridges Office The Business School
trans2 PS_ITF_CREC_COURSE t_CREC
ITIID VARCHAR(9) ITF_ITIID VARCHAR(9) ITFID varchar(9)
CNunt VARCHAR(4) ITF_AITTO_INDEX INTEGER Term varchar(6)
Sect VARCHAR(4) ITF_TERM_CD VARCHAR(5) CourseType varchar(1)
CT CHAR ITF_TYPE_DESC VARCHAR(40) Section varchar(4)


The schemas front the College of Engineering, the Bridges Office and the Business

School are seniantically related however they exhibit different syntactical features. The

naming conventions and sizes of schemas are different. The College of Engineering uses

the same names for schema elements as they are in the Registrar's database. The schema

elements names often contains abbreviations which are mostly not possible to guess.

The Bridges office uses more descriptive naming convention for schema elements. The

schema elements (i.e, colunin names) in the schema of the Business School have the most

descriptive names. However, the table names in the schema of the Business School uses




4 http://www.bridges. ufl. edu/about/overview. html

5 I also would like to acknowledge the help of AMr. Warren Curry front the Bridges office
for his help obtaining the schemas.

6 I also would like to acknowledge the help of AMr. John C. Holmes front the Business
School for his help obtaining the schemas.