<%BANNER%>

Performance Modeling and Analysis of Multicast Infrastructure for High-Speed Cluster and Grid Networks

xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20110112_AAAADK INGEST_TIME 2011-01-13T05:02:09Z PACKAGE UFE0002422_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 1766 DFID F20110113_AAAAJF ORIGIN DEPOSITOR PATH oral_h_Page_025.txt GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
3a5f84cbf6dae9a376c52161b66d9c87
SHA-1
18c0323fcaf9f349dce627edaa12d1f9e35719be
46754 F20110113_AAAAIQ oral_h_Page_123.pro
dd5ca44558eafd22f6fa9f42e2f950fc
9655db972bb07e3156c32982a002a1d9a951d496
64042 F20110112_AABDSY oral_h_Page_124.jpg
ef15504af303e5411535f7e5dc93deea
6391280fa07bf05da1ce63d552fc965a4bac6f62
5347 F20110112_AABDUB oral_h_Page_109thm.jpg
004dd7438d607d64b8732462053a9c49
393a31602df945a116c811e0c5473ec3d007b015
52152 F20110112_AABDTN oral_h_Page_078.jpg
68bb44ed55273929ee427da07d087053
39698ecde415858484ff068374e41a9e52b0206c
1767 F20110113_AAAAJG oral_h_Page_030.txt
c906af3aa233643d439450579b804870
ef003855a942c782323cf061d1b297a5574da5dd
24337 F20110113_AAAAIR oral_h_Page_130.pro
cd46e64ceca80a338c2f8a257122fe7c
199bc8e78cd129ca830666c71ecfe8eedeb0fc2d
60446 F20110112_AABDSZ oral_h_Page_007.pro
eaf4e928e7b26d287bd70e393acada07
a294586253e16fffdf56f70d9fe81f81e5d6c623
49015 F20110112_AABDUC oral_h_Page_088.pro
bafe742c74636382629fe27a9216d54d
52078c20610951b35bfc6e6d5d399f3136f56717
801504 F20110112_AABDTO oral_h_Page_060.jp2
48b7b522499da099a2a8d9c2dfc1eed5
b8e9b353ac9e300fdc65e8cbfc764dc476ca30a5
1777 F20110113_AAAAJH oral_h_Page_036.txt
68182d1db7ba6a2f3d415a834df93c6e
48c298064c47e75489cb26101e7394d8d97a9199
16811 F20110113_AAAAIS oral_h_Page_131.pro
c66b257c86ca3ac48fe4fc01e3cff3ec
9d3fe66b1b8af51f0a815bf870c208741ee4faaf
1802 F20110112_AABDUD oral_h_Page_089.txt
7fa3d4e851237716c3bf77988edd4506
54c2a098265c5a2cc087e78681385a7570f75742
19724 F20110112_AABDTP oral_h_Page_114.pro
a59389f2bc8c66e53bdcd3dfe15c61df
d6485b6b9ee9d4f61dae034af3272c101ef87294
912 F20110113_AAAAJI oral_h_Page_038.txt
8d2c5456f1bf9d8eb2f4e9ca4fd6ab1b
774565643dde62482cfb252274d0af19be4f8a1b
50722 F20110113_AAAAIT oral_h_Page_135.pro
7755c084c2ec394cb51eb10f889453ba
1dcd4dd96f688acce05869abdb55ec86513d52ec
25271604 F20110112_AABDUE oral_h_Page_132.tif
b11c933389feca746d6aa85eb011cba7
41e7a797446ceab5ab450ec6b12272f135086a75
1448 F20110112_AABDTQ oral_h_Page_031.txt
b07d7121d23f47630882ffc707dc223b
1a156d95484cb38026be7486c36b58da7860c042
1441 F20110113_AAAAJJ oral_h_Page_041.txt
160838dbed914602c3070d28cf552310
9ef38176e42636dd82eca70be5db522ff22c2c86
38516 F20110113_AAAAIU oral_h_Page_137.pro
8884e63ec150ac402dfb02032fb3c60a
ecf30a41cdf2d6a414f7374311826b996b64ca2e
46038 F20110112_AABDUF oral_h_Page_073.pro
35cf0fa21a919116f722063b38c587ea
949a3f31ed4f853e9ce463c7a30bfddb6a8bbd28
1053954 F20110112_AABDTR oral_h_Page_026.tif
3be8bea6e76fd2ece9a894fcce2af44d
d83ad24198acc87d627ba1cf6f84726f21ade44e
5982 F20110112_AABEAA oral_h_Page_059thm.jpg
4779ed68600a657c3e2c475e31422631
e7e51e24d67f5aa3794649b1479582a01cab5696
1205 F20110113_AAAAJK oral_h_Page_042.txt
3f5bfaba2b9fd269b536ee156c680cb7
6bec073b166f0fdd5f5446c9bf1650084c61400c
60582 F20110113_AAAAIV oral_h_Page_145.pro
7271bb455bab5b4e181b9b650904a6df
abd4b6438fa29cff203411c67cd0899484f9d12d
F20110112_AABDUG oral_h_Page_068.tif
77d91c00c1953f6782be7292b5ab7335
bf669e7ffdfb3117fa05ef990ad118d5cf08742c
22176 F20110112_AABDTS oral_h_Page_115.QC.jpg
a5ce3b9372cfba7344170d707910a5cb
8e8d22f9807fd43bacb832ccfc957a4f8217f701
F20110112_AABEAB oral_h_Page_138.tif
f763b00e8096200196178a634c6e4f6d
85c23d24ca6befb766270454bddd45285218b665
1478 F20110113_AAAAJL oral_h_Page_044.txt
6b43177517425c1607688887328b28b4
c29f65e9f33b9b0a63b9f303d70c9b8fa86acafb
62289 F20110113_AAAAIW oral_h_Page_147.pro
17fcdf05f984d54604c43f700e37f022
b61aebc9a0eb608c257977f3a54241b64ccc0874
107060 F20110112_AABDUH oral_h_Page_113.jp2
ff7a5c2b05d5141322b088d008c9cbd8
eeae3c4a81036e268a3e348563e72510a2b6ef26
54193 F20110112_AABDTT oral_h_Page_081.jpg
067b5f59d36f817e0d6b6fa266b5ecbb
4d60f76a3e7eea5a3bcf61c85b512e3f69750a90
6312 F20110112_AABEAC oral_h_Page_053thm.jpg
f5e16c5076100a3ba9fb5d8cc8dcf899
29dc065a8814315effbfe6025bd353d36d8126dd
1675 F20110113_AAAAKA oral_h_Page_102.txt
0fe3ed9c4137f2cd5aca149ff173dcac
db86742246b39008ab25d687f8448e266aaa9186
1562 F20110113_AAAAJM oral_h_Page_046.txt
455e71674ebc1a1eb5aa2120e6750999
09800ed7582e294ab5931fe154e27b8633c7e216
3217 F20110113_AAAAIX oral_h_Page_004.txt
d7ea8021718c224ecad1354a6d991c7c
d55d7b33320ccd423b8bba575593ac1a6b6ff267
59736 F20110112_AABDUI oral_h_Page_009.jpg
656be61a57e618f8564e2f53010020ae
5e5f5c3358da4cfd16eb75994597eb0f6447f370
F20110112_AABDTU oral_h_Page_091.tif
c62b1838bd699703b2bafccd4cdef24c
161807d1c973fb9250320c89a5280f3dbf54d675
20106 F20110112_AABEAD oral_h_Page_066.pro
e4472ed8baf93fa97d87e1c5fc4a1713
4910a0efa4045ad8ad4fa48c2b21d2e4fe4d1566
1996 F20110113_AAAAKB oral_h_Page_106.txt
06939739bf79ab37af204627227ddcf5
eeee6fd7aade80c76d9f4129eb2dc1e426fad202
1952 F20110113_AAAAJN oral_h_Page_048.txt
d45f4f4fdfd07660901b199040345a67
8227a2d01a2bf2673dca6371dbb50949c741c653
500 F20110113_AAAAIY oral_h_Page_005.txt
0c466131744df298873f6b25c668ff1a
1c0cd4a95fcbb926aee19a538a1eec634fb54b56
886 F20110112_AABDUJ oral_h_Page_051.txt
b5559ba433b48baf2900e2258f05ded3
b31ba960c54e5c58c89e8d00e1a7ef4fe9e3482a
1487 F20110112_AABDTV oral_h_Page_032.txt
3dfee3bc09beae59cbb686211a1f64fe
f4d6d2774b6f234604d36c36d15b61cc7acfb858
613 F20110113_AAAAKC oral_h_Page_107.txt
b4eadd06a72a1b70155ec93e6182cd2e
fa35435939e3f8fe9f1c644902978e7a682b089a
1846 F20110113_AAAAJO oral_h_Page_053.txt
58d242f586143faee5869129b0fb2046
bfa5d69722540070802772b91c516fac90198556
2440 F20110113_AAAAIZ oral_h_Page_007.txt
b91518a6c70fd04cc58b734845afed7d
18b4a4fc72957bb9bdaf96d97a7af20c02d9c539
F20110112_AABDUK oral_h_Page_040.txt
79116e0f713c9fa04c9fb6f0d4436249
0c65f8ab405720944e5b099192d6077ddf0f907f
1981 F20110112_AABDTW oral_h_Page_090.txt
907bcda998e5192a9b5f25398026c775
0ae8b1a826e904907c3ee84bdb7562d39b50888a
F20110112_AABEAE oral_h_Page_137.tif
d316e97f318ee026723ea5338d10b789
2445c4a347d917cd22634551102f99d9a7275476
1465 F20110113_AAAAKD oral_h_Page_109.txt
f341e3ca0c005e948b3369a3d9ed914a
4960ec3022dc5d959fe73fd749c42916bb251295
1078 F20110113_AAAAJP oral_h_Page_056.txt
ebc4e36764dce3672c60d992d9c32c16
86d669b06df05f4fc4091fea87aac150e91c6f24
2494 F20110112_AABDUL oral_h_Page_013.txt
d1cd40f1c9dde5abdcf9fe7f561a95ad
87bb9e2a14b3c76f96997e427e046b51a19801fe
F20110112_AABDTX oral_h_Page_082.tif
d43e7e7cf8c716b7d52782871deae32e
fcc61965e5521a33ba28e43240a5b0efaeed7777
23203 F20110112_AABEAF oral_h_Page_086.QC.jpg
2a6c79bc42d6eb83b84a6fd4b837ccd0
630d6fdb51e2b19d84584e2e02bd67aa82f1dd9b
1121 F20110113_AAAAKE oral_h_Page_117.txt
491487b736c69419ff53c6d085bdd88b
5b6b01f80ea6199c76949ee7aef9dc0e53dd3cd4
1867 F20110113_AAAAJQ oral_h_Page_057.txt
66230952efd8f92fcf5d9d73147991eb
ebcd6dbabe96f1c60f477ac0a41d657c3a66842f
908872 F20110112_AABDVA oral_h_Page_134.jp2
036b3ee2747787b9eb05ec9d21d730a2
18b5b93f592a462a72b2f2557305dace722e4db5
1051949 F20110112_AABDUM oral_h_Page_082.jp2
92e36fd39d7153ca6161d390a4297dec
3a3d2dcb63b75abfffe356ad663b0639a2a31cd4
F20110112_AABEAG oral_h_Page_011.tif
62a8a7cd4ec44ddc6d1abb9b4bd8d4b3
ad9be3544bfecd9e1f07971c47c2eaa77f66a9f0
783 F20110113_AAAAKF oral_h_Page_118.txt
ed99206b681efd53a7a6163c22f203b1
311a0f3a51e884ece43a9c6c3b55f46c21555539
31160 F20110112_AABDVB oral_h_Page_079.pro
78180f08bdbea7eec9e89990e9390c03
da7f05c80bd2be6ef19bdf2a91c2f5f0e8de716c
50662 F20110112_AABDUN oral_h_Page_044.jpg
68e6df3c8844ff2d9c59d41bc1900da2
7fbb6c1dac4426606db2f28ea50850a4bfff1f2a
23040 F20110112_AABDTY oral_h_Page_130.QC.jpg
70ed97c8823887f2487f753f39a6d57e
1cd549882ab47c492cdbba4b685f81b8ce6c8195
5977 F20110112_AABEAH oral_h_Page_112thm.jpg
690ec61a6bae562f997899228dd8358f
37b84668cfb24d578e68c862d3f42e8883e42935
1954 F20110113_AAAAKG oral_h_Page_121.txt
f805001b29f4b6949a5b427413650f4d
93f133a78a78e43d50bad4dd073cb5b598ba675a
2013 F20110113_AAAAJR oral_h_Page_062.txt
60e2227e770e8692c9444f1d72139923
1f21abcb0fa03904483495d4970be637f98a6453
4125 F20110112_AABDVC oral_h_Page_069thm.jpg
8b940651e0bdd4e0b98c450f24ba7edb
1be383f9c66e89a8d8971abf0074c14f8b44abef
6521 F20110112_AABDUO oral_h_Page_115thm.jpg
2b4fc5bcfc5eb3e9afc6b448bff59f0e
33c43d3f8646688346b6ceb3c892408a2d1dcf49
23094 F20110112_AABDTZ oral_h_Page_061.QC.jpg
50bb6908abe316291ebce5803ed7a52c
97f2e61f5bbb3688f6e1eff71a6c8b4bb4cf3183
55670 F20110112_AABEAI oral_h_Page_091.jpg
3158702c9b0009a01460318e3319ccf0
8bcd95b96fd820b77a54b3ff16f8d86f7c56f1e4
1497 F20110113_AAAAKH oral_h_Page_122.txt
9ac92f9ed57e12b7e50b96a083fcbb1d
10eaebabdc1bd558b8d47347675b7a024fb7968f
2053 F20110113_AAAAJS oral_h_Page_064.txt
8f20902ffb5f39a3c063502cf8a6844d
f28375f84fcb035daffdc8677bd0d9a0bb6f8d72
110493 F20110112_AABDVD oral_h_Page_141.jp2
a297c1462c860acd303e777c7433246b
f37e4926ab92b6ec78a73f01c044887bd2d2f5e9
6655 F20110112_AABDUP oral_h_Page_095thm.jpg
c7a0d2df8fc25d3bb1d872e80c9f6aad
dc1ac2266f60381a0b3e066a222e54391d00f9fb
9796 F20110112_AABEAJ oral_h_Page_008.QC.jpg
0f0281af13e7a969da106a1a5c96f43f
a61e25dbd9f65a3e6a948975f1b8696358459128
1257 F20110113_AAAAKI oral_h_Page_129.txt
e5cb00e8ab57c7161644edad081e9738
8ec5ffc2afb6ff9b6814bfdc0f80b4c9ffe5a0a1
1625 F20110113_AAAAJT oral_h_Page_072.txt
dd59bdc26521662af6c09c1225aff774
c45e4fe90bf8c22783543130f1c6f4881d4d18a4
F20110112_AABDVE oral_h_Page_047.tif
7c0a09eb48d88d110c6cb24704b0fa3a
30d893620235624d889bd2dab4e454e90a25090a
104525 F20110112_AABDUQ oral_h_Page_058.jp2
bbe2ef95befa6d827bd249da07ac81a9
cc330654cb1f01fd2e46596cf611adea03d6e52e
F20110112_AABEAK oral_h_Page_008.tif
54c9326c68cd6c0e9bc2ead10c7f8b98
6a1491adc4993769f3299a09b79350ef6205b864
989 F20110113_AAAAKJ oral_h_Page_131.txt
7e07689e5e4f508eef97fa2f365cb606
f5e6ed669fdb49b9e729c6ccd4cf8ab15bf817e1
1568 F20110113_AAAAJU oral_h_Page_078.txt
c4854688d9f1e35519cffbd60b951f42
f512c5157860004d9149f04363ab425d668881ed
20778 F20110112_AABDVF oral_h_Page_035.QC.jpg
65633e175e50ec0ef2e1e131a753f08e
d32744b24f5224858bdc5082e7654f4a53306569
64213 F20110112_AABDUR oral_h_Page_075.jpg
568575a067b526dc244f5364293c19d0
71026d0df9b04211feafb8e7dee9634e2c2d90dd
5880 F20110112_AABEBA oral_h_Page_054thm.jpg
4e56319311f3ed76b32a26b4df206dd1
213de74bd937c820cfd06c6d8bf67b41dd79a712
1988 F20110112_AABEAL oral_h_Page_135.txt
5079853367765fea3f10d2f2949fe3c1
455768bcaf879a77567139f759064bd3bc9d280a
1639 F20110113_AAAAKK oral_h_Page_134.txt
e2e9983a43348b90db5b2bf3f6ef6d40
1c144e85f302cc73f52a039b5b0c880f615d5668
1924 F20110113_AAAAJV oral_h_Page_080.txt
06cee0f4e5a7ef32557b2392f8bfbf82
55eddcd4b319f445c4f043657b3505d4bd665ab4
1779 F20110112_AABDVG oral_h_Page_105.txt
7be69964ea593dc5c0ee60b34977c476
32898997009df38c8d7e7dcbf7a2c884e004fc65
108659 F20110112_AABDUS oral_h_Page_121.jp2
9d7d2ed58bdb13d8754d371e2175d6a3
00c25377bfa3f169870653845ee58335e30370a9
37520 F20110112_AABEBB oral_h_Page_075.pro
3cd651414643ee1db00292264810af12
8a2c74f795529f30f6d6ba7a149fc760b632af50
87975 F20110112_AABEAM oral_h_Page_114.jpg
4448eb45ffd4246f50c00ade857676a3
63db1afdcb28ab5acf986feddc5713ea7b86e52a
1147 F20110113_AAAAKL oral_h_Page_143.txt
bf5362bd696e00237e08425a578ee5f5
616b367844733eb2499e73c2622b154cc1930656
1668 F20110113_AAAAJW oral_h_Page_085.txt
e94b8da8faa8e62ae83dc3ff3697f701
6879bb061ae1bc51dd2a6d121c95453094b49c71
37045 F20110112_AABDVH oral_h_Page_103.pro
23f05e80b2033471611225e881eb5104
7d8125185687cd19112d248882e47b3dde9e62cb
1963 F20110112_AABDUT oral_h_Page_055.txt
d6357b425f3fafb18acd8b839dbd11a4
392530c7ad3cfead74c567aa123c6417994c00a2
13914 F20110112_AABEBC oral_h_Page_023.pro
be0c33ebc9c8ff13940d28241d2a8499
741b80a8d3240569ba4b972e6987d4936de8d056
5711 F20110112_AABEAN oral_h_Page_129thm.jpg
70cad539ff270ab099cb974b21d43335
54e949976ed7b29877623dc8cc510b4f36632e5e
6667 F20110113_AAAALA oral_h_Page_013thm.jpg
84e58f523191d5516c83993211ebe872
e16c5dd9cca2970c648090199c209a89724d82ed
2037 F20110113_AAAAKM oral_h_Page_144.txt
1513c5e867a16910f3bc57be4e12ddcc
f19471c51b0a116c565c68f0af5a1e0e4cf73e5d
1962 F20110113_AAAAJX oral_h_Page_086.txt
081cbffbb2a8abda9e77721ca88407cd
6dccbad74b1fcaf4782294c57a0950eb239abc29
45294 F20110112_AABDVI oral_h_Page_027.pro
134a0ab6f8f68af58b4a594094431c61
8e594d386b579672a5d64eadf24b9b0ba2147cde
F20110112_AABDUU oral_h_Page_004.tif
54f6385617ada4657ca7283aff69cbcb
2346db377f694e93a2b583ccca88c40cfddc9211
976280 F20110112_AABEBD oral_h_Page_103.jp2
2da9562772056e7037ee6cf7f50ed31d
edae21bd99a4a145a259bd7b7d9cbc82f5b46cef
37511 F20110112_AABEAO oral_h_Page_104.pro
bd4c475a35b032459f0ae17a62466622
697b5b54f093fcceea032dd999ba45de8a47c32b
6399 F20110113_AAAALB oral_h_Page_016thm.jpg
0e97cdefafd80a6948fd50886a0e4d96
ba4ecc46b52104295aff581881d241b27eb6e074
2548 F20110113_AAAAKN oral_h_Page_146.txt
f35b9537e88879016189b2f938307906
6b96b3f6ceb72f46517a40d905c9bbed2e80352c
1342 F20110113_AAAAJY oral_h_Page_097.txt
ee1012d676e2ca44fec8351b3c68ede9
16acb2237c01fc305812efe3734ce8a415db68ec
864383 F20110112_AABDVJ oral_h_Page_045.jp2
1b302ac39fdb241d013b30a9de9913ef
92ccad45ab54b15570ece094e2a475727a290ac1
73327 F20110112_AABDUV oral_h_Page_007.jpg
be0e30457b3c923564a02491ad8ac5df
ea8940fb273ea2a273675be4bd85f8b1ec7e4abc
72532 F20110112_AABEBE oral_h_Page_018.jpg
55e8891c6eb25bc6f66a38cfa1e836b7
8dd1c5d2c687ae907bd589b764c01f55b9e27ecd
20488 F20110112_AABEAP oral_h_Page_051.pro
f0fb667a163b1e5bae50dab966af8915
516abd4c3f8672462a326aab9aa1a5f139bb4a4a
6179 F20110113_AAAALC oral_h_Page_020thm.jpg
e4324c71833191d7b92a22fa7a3e5e5d
e2dcf404893cefce3d16f301531a52173343d8b1
1935 F20110113_AAAAKO oral_h_Page_148.txt
d8e274cf999dc9419d6490ce178d6cfb
839fba4cdc122e9bb3b172578d169c9746dfcc00
1946 F20110113_AAAAJZ oral_h_Page_099.txt
b07197db06318350ba938a785ba8a0d7
31a4b40cbfd0c209504ad8ced906d43b1b255051
6051 F20110112_AABDVK oral_h_Page_093thm.jpg
285d47cb052e87e6994d09e3bb3805d4
b884a1f561cf9f974ba18176eb08182a41e3b783
67495 F20110112_AABDUW oral_h_Page_027.jpg
19d71abeaa1f74b23bca6e2c30338fa0
f01024bb40e4b6ea755ff1c23c463cb3c74c26bb
F20110112_AABEAQ oral_h_Page_046.tif
a9b1e81e67420a0bd9e7a150aed497b9
bd6bd2f42682d67fa92fb2f480a34b0f4e6b3e35
23801 F20110113_AAAALD oral_h_Page_021.QC.jpg
99a36d133fa68e476a64aa22f069cdac
3e5073f0bda9f73fbe0ff9170c408140ac842b27
4538 F20110113_AAAAKP oral_h_Page_003thm.jpg
8abd34729d89c84d8a916a3b4a697090
6f58f1ddadb517742fdcacf1a6909deba53bc1f1
108406 F20110112_AABDVL oral_h_Page_012.jp2
24b5b94064561171fc708607f6a13933
8e81048c2a47a01cf155d16ee1cf855453d6e3a4
6495 F20110112_AABDUX oral_h_Page_104thm.jpg
d33eab7360bd215f14e870170405dda9
be44083a137fa0a6bcafcc3008ab63aa057728e9
2063 F20110112_AABEBF oral_h_Page_083.txt
3be6e6c284097568b0e73a6059a5ae55
aa2cd558202ecfb42a996c46871057acd66dc5bb
23661 F20110112_AABEAR oral_h_Page_048.QC.jpg
98eebb52c1c5a7f5b09a55099d416a99
81d88ad8415baf06c289d57fe9bce53f033441d6
6236 F20110113_AAAALE oral_h_Page_022thm.jpg
a7f33315854567d4675c5d9e05a5e63d
05e2661d5b7638a45099813be33ccef6e4512d5b
6273 F20110113_AAAAKQ oral_h_Page_005.QC.jpg
eb2880cbb468fade4ebc7472fdb9fbd8
f5927953bdf8710e566883f746b8cc35e439d17e
26188 F20110112_AABDVM oral_h_Page_001.jp2
9cab903785f54ffdb2806904b8b31b8e
059be9f04593252f5bcf13db871b4ad15abab2ea
47642 F20110112_AABDUY oral_h_Page_120.jpg
6438fc3bd02c4186dd262edede29bc0e
edb7b6a3e6b7202f01dc1b572da4590cc985dbae
108673 F20110112_AABEBG oral_h_Page_088.jp2
0469f6b8de36a5f26d4f40bb470c15b3
ea426aaff39e435bcd1a0acbc38ac93b492c7b05
6390 F20110112_AABDWA oral_h_Page_121thm.jpg
c3d81db3651863080f5fde4cb568d810
3e9fdef2a4e749746d1c551fe96b7c5c30d28066
61966 F20110112_AABEAS oral_h_Page_013.pro
242e8d23a1138949cc86ff26fe0d7424
fb4ab46e698388d32d4e7ac7305ad5bbce07958c
5967 F20110113_AAAALF oral_h_Page_025thm.jpg
0ba2ed3b0bfc0c6db0275084539db712
3da3ce9ca009fb83e1bc1058d69a790be005de28
2167 F20110113_AAAAKR oral_h_Page_005thm.jpg
9b6853e6f82362b0a4cacc52edee9939
704630eebefbb8b3494102af2aee26e043f9c68a
23579 F20110112_AABDVN oral_h_Page_042.pro
70022f1549f353814e29a2de8d4e8537
4a8ed654fe346a804d05f55d7ab053b344d5bc03
F20110112_AABEBH oral_h_Page_095.tif
27b95c5be7b6b15ec2006c948616b7d5
776c554fca39c28e384337d7fb2d282446acc7b8
6335 F20110112_AABDWB oral_h_Page_102thm.jpg
9d959bdc7b7f543168773ebdd9145e99
988232b095c98482cf5dfbe28eec5c999a505bdf
1032141 F20110112_AABEAT oral_h_Page_056.jp2
ac7f2876d51754b616affd63448b43a9
bd808b63ce05c723317d7d08c6db7fbffe441008
6502 F20110113_AAAALG oral_h_Page_026thm.jpg
d8b729057d20d2374dd4ada5e029618d
c89f486984c71051d39b3fdb7dcff31d3599d86e
F20110112_AABDUZ oral_h_Page_039thm.jpg
1ac8aef9b3d101f74ccae4ed3562c128
c9cb3c254b3600ee00974a07ef27422350832eec
50080 F20110112_AABEBI oral_h_Page_012.pro
0a254646030ff3b42793ba0d2dd4f642
02f5eb51f90fb7412029970906bd294112eeeea9
6618 F20110112_AABDWC oral_h_Page_100thm.jpg
9332e4c0a1f722bb48812844f4c5c54d
aa1105e54efb0381fd562007baac70e8e963637a
102564 F20110112_AABEAU oral_h_Page_148.jp2
a3cc93716d41dc19534fb0f21af56b5a
a9b207ead826548e864986b59abba2d2c667a101
1937 F20110112_AABDVO oral_h_Page_026.txt
444ed8ce2017f00cb6091e6dfc901d29
4b11af129a8b7d77b8b7b4256f9fa2c6da09d24f
6070 F20110113_AAAALH oral_h_Page_030thm.jpg
94e47ee614a6507d81e28324be150111
7b4bfe7391808b3bd8c95fce63fa860c9766a8e4
18238 F20110113_AAAAKS oral_h_Page_006.QC.jpg
06495698d94938f67198287856bb9d3c
3b5cd025ec2348dfbb665efe74e8562158cbbf0b
23481 F20110112_AABEBJ oral_h_Page_004.QC.jpg
c68ace7c0da66e88aa4a621f676670f7
579d1ff4fd3062e622d2874e5ef22828ec9c2047
F20110112_AABDWD oral_h_Page_079.tif
14c15683e54c68023bc0a7e50b09dabb
7f084a55ca390be538eb444d693e0e9bd7d987ef
26764 F20110112_AABEAV oral_h_Page_127.pro
7f1b6a9b93978da953f2cc984cf2d15c
f5089b517841fb593102be6494d4ec89a4a28b08
1882 F20110112_AABDVP oral_h_Page_058.txt
a934c82e993a6c7fe25953904152adc3
70350e01890b7972df35e157fcd0b290f121d83b
14096 F20110113_AAAALI oral_h_Page_031.QC.jpg
403b01a8675d2bbf3aee36e3a7c13eb7
77b45bb53a0feb5c5e36ff945a2ead924bdc8d1d
21784 F20110113_AAAAKT oral_h_Page_007.QC.jpg
c596d375c22e344b440d5e8aaf5a42b8
98250a896032c5e3117857a265714ed109c6419d
1851 F20110112_AABEBK oral_h_Page_035.txt
cea50eb0e83bf3aa5b62078bef89e49f
bf04b8b83dbd5c023fb839d86e9b1a0067588fc7
726 F20110112_AABDWE oral_h_Page_069.txt
ca012e79acd18d4384968991f69c0a33
0ff5bf4157c268e23277a3d4c51dd33e51f33424
16138 F20110112_AABEAW oral_h_Page_091.QC.jpg
6abf4fd4698fc4a482c29b690ba9f6f2
99fceee40783106d192be5db2698d7790be58824
F20110112_AABDVQ oral_h_Page_006.tif
eb1e482f10691efaa50cc0e7dd51d35c
700982b1612927c16aa1bd339f3db1ec8b7b790f
4565 F20110113_AAAALJ oral_h_Page_031thm.jpg
c391f16db5a05a32b2f2d704c5adfa84
a4a564afec59ab77abe4cfb54a6f1d5e7d3b4342
6040 F20110113_AAAAKU oral_h_Page_007thm.jpg
5daeb82467093e868b71747432493b80
b793d8a03ca92446a492f397b09c06d63cf9cb0a
50114 F20110112_AABECA oral_h_Page_141.pro
456c6391f1198b718ce13704e71feaa2
da4a861063151d948e6372bb52a4d2eda85dbedb
16801 F20110112_AABEBL oral_h_Page_077.QC.jpg
34e1133b2ed49a1aaff74c40bcbc39c4
c3a78fb87e72f7f05d7c03d4f8f6b5301a96bb93
F20110112_AABDWF oral_h_Page_042.tif
f8fbe0d269a3903149504161627148ee
025f62a78c6909a51b9aa07cd701892e0e89c5cc
62148 F20110112_AABEAX oral_h_Page_006.jpg
6214bab4707bb3ea1e127010ba72936f
17e64657ea381242a921b357715cc56f5495f137
6527 F20110112_AABDVR oral_h_Page_101thm.jpg
82bd9e22539eb989850b05a97d894e76
28dd8214bff0acd3bc6c346089607b4e5919c646
6665 F20110113_AAAALK oral_h_Page_033thm.jpg
e6e48115bf73f1146904a321532c627b
c71fa933c165dc442ba2453e361ff5d244905989
18430 F20110113_AAAAKV oral_h_Page_009.QC.jpg
e865f03537c7df69364f8b9fb7ef35c5
08b163f090ebca4075903344e74f4e1e33f8cee5
F20110112_AABECB oral_h_Page_050.tif
7c59ec32838f6d4e7621da015f26baf5
a250f9706e0dfdefb2f5eea426b961e3948a31ab
4800 F20110112_AABEBM oral_h_Page_044thm.jpg
b726c7964c62b6b8dd182afd715397a0
67700c0db8f73ec5c9ecbbd54b74087bd8efae9f
22134 F20110112_AABDWG oral_h_Page_049.QC.jpg
2a2c67828c863a44cd0f3a3dbb901003
8448d5adc3f302c1c51d9c38ce2c425bf27a7af8
1051978 F20110112_AABEAY oral_h_Page_087.jp2
faa9843a95bf8adf90f16a8944ba4874
11296305e2d5e34918556f7bdfe0d69f8eff8c25
43046 F20110112_AABDVS oral_h_Page_098.pro
70518266fbcdb6cf187ef5462cc5c6c7
9d2d3950008c59471b24365691fb2fb738b9e30a
19681 F20110113_AAAALL oral_h_Page_036.QC.jpg
8b6cba2cb475c1e5773418ac7758bd73
e1bb717e082543d097a896d68d3076cc2d29c9d3
21081 F20110113_AAAAKW oral_h_Page_011.QC.jpg
a429f761207dfadd18c6bf91f63b06bf
b1c0e1d3aebed61cf715ee125fd9e3d61288b078
F20110112_AABECC oral_h_Page_099.tif
d72cbd36a9c5f93f33e6facd09321978
685b97ada35e8f1fed76248a91e02933ec02bab9
43705 F20110112_AABEBN oral_h_Page_022.pro
bddad2bf11844eeced53450ef661aa2a
bb578d59d903f7ad8462a76a1d3715e1141c948c
60830 F20110112_AABDWH oral_h_Page_117.jpg
ff90802be06ade51256a2dda3bb853e3
c734272bed96f88089510e71bf364d6356edb956
15465 F20110112_AABEAZ oral_h_Page_034.QC.jpg
11a7f87510868260ba0d759903be753a
54c2508f07aee878791653af2555d88e2c907aba
1047819 F20110112_AABDVT oral_h_Page_132.jp2
be587072c0ccc582b423f211802a57ce
f8df8d111aecd7dc58cdc75237401cf50974b155
13676 F20110113_AAAAMA oral_h_Page_066.QC.jpg
24db5ac40e5d5959623908b9f685ea09
341f24c4d1ea557b6a373f50e368c4420d5d2def
5109 F20110113_AAAALM oral_h_Page_038thm.jpg
934d8d842b1af6c312afcf0dad626323
0e8de12acf82df84206a205f32c76b01ab5fbb93
6052 F20110113_AAAAKX oral_h_Page_011thm.jpg
7d2a16a922922a59921b2e90da28db25
e88b0ee7a34496d0d5cbedfb2ef035050b7ba078
F20110112_AABECD oral_h_Page_116.tif
2a6b83b2038d9cc03a1a3a13e166cb7f
0a422519116ddb2dc683d52c6b92f3146ed481ac
1421 F20110112_AABEBO oral_h_Page_120.txt
ec61d508e66bed38496ed5821b515db6
a83e238cfc79514f253bb4d7d156c4789779771c
5037 F20110112_AABDWI oral_h_Page_111thm.jpg
4c0bd3788385fddc15dff5927b25568e
b56bc006d0b8c9fecb8241625883dda294aae58a
67363 F20110112_AABDVU oral_h_Page_137.jpg
2c789df3cd0051ea4d67791bd0b11ef5
d7cad6112fee43ebc354eee13bfde5f0a8cbfe4f
4013 F20110113_AAAAMB oral_h_Page_066thm.jpg
7436d4525ece5fd405fe8862b2559283
5c47a9d4fd84449d59595de7df50184d70a6a43b
5712 F20110113_AAAALN oral_h_Page_040thm.jpg
09e623ee1ada8f1a4129062bed7947f8
3ca24f53f0c565a230a88f7d49b6759aac3265fb
23560 F20110113_AAAAKY oral_h_Page_012.QC.jpg
ca6f99eacc97ce37f329eb21ec71f435
e76f54b31b7034154f003721234124217135e6da
59641 F20110112_AABECE oral_h_Page_003.pro
a8b9401800e7226b4db9a5dcf5263ee8
1dae47f184de27b55b60c0e84ab254baafe6e214
2277 F20110112_AABEBP oral_h_Page_084thm.jpg
4e234c9cb631b97e7182dd793ce30ab6
dfb28cda19161c28af82da7d2170162f7d014c83
61129 F20110112_AABDWJ oral_h_Page_071.jpg
9077f10927aafe938b68c4bc0d680922
8a47fe439705343e0113c0b97e72b04711c46a3a
F20110112_AABDVV oral_h_Page_054.tif
eec62dcf0762cbdbbdb09e034d54792d
64bc769adb2d491218532288e658cf1bea79cab7
22228 F20110113_AAAAMC oral_h_Page_068.QC.jpg
7449ce95bba8fa18717d354393bacb2d
c67eab738cbaf0d45d16d3fc4794bdf33678f3f1
5321 F20110113_AAAALO oral_h_Page_041thm.jpg
b25e85c966d22c5546f0dbf78d7b7180
a60d884423cb405138185e27b4334b3ebbb4c73e
25905 F20110113_AAAAKZ oral_h_Page_013.QC.jpg
70a8687a8e8a36963d9532d9282c57ec
143601072b50453301781e4cd0bd491e537b1b6e
25184 F20110112_AABECF oral_h_Page_084.jp2
c236bb06867a48cf98d108d56a91b491
431915fc5db0e83d98d0c1205ebb937572e73f93
1887 F20110112_AABEBQ oral_h_Page_049.txt
09178e5cba20aac50e486546bc5713b9
99d6c23e821404646b7bf2b0d0a12d425c892d35
90172 F20110112_AABDWK oral_h_Page_147.jpg
195bcabb5cb133cab2a71199dbbae009
8417fb4995cd6a1bc2f014ede4f2715331000a95
21465 F20110112_AABDVW oral_h_Page_089.QC.jpg
8d9c9325c4722604f25bc828e3ec8c75
3eb8625f16bb3c2cf4fe44a4f5d509a647fe1aff
F20110113_AAAAMD oral_h_Page_068thm.jpg
dd82b6b9f184ba24effd375d915150fb
d290c0f658f53142cb24cf268b3e1d689c25a884
16729 F20110113_AAAALP oral_h_Page_042.QC.jpg
d010d39993bd1ba778890db4a4525def
cc810ce2f915c64115c3699d8af98bdf1761cee5
40695 F20110112_AABEBR oral_h_Page_076.pro
bbdf1a141ec4d0f1f2e94210147732fe
0778eaf9adec470e463bcb9d9509067ab122608b
16419 F20110112_AABDWL oral_h_Page_029.pro
cfcf8b1b02fcbf01cba21c378ffaa0be
36fd5884537bde4c42d8bdabed111929c8dfdc12
8423998 F20110112_AABDVX oral_h_Page_059.tif
71047ae58de13584be12786a0a430ded
44a3c1df48e7320c9e569b09655edc930f8a7b26
12740 F20110113_AAAAME oral_h_Page_069.QC.jpg
07a61c9d8ebca1dbc31e800e836ddde7
89c0265c8bcafa6c9cb6f1ccbd8f030ac946e9ce
19173 F20110113_AAAALQ oral_h_Page_045.QC.jpg
d9e884922ba76adbae5bf8f91b590f99
53d652e3502c3e3548de26ab907c73f9e28152c9
2000 F20110112_AABECG oral_h_Page_100.txt
9ed6935792afef16ef47a16d1747e0f8
600d4795be538f3289d83585edaf758e1743807d
70251 F20110112_AABDXA oral_h_Page_140.jpg
cae1c2af433b1fc5b99e6d88a045965b
20018fd5cb2b8a4610a94d28af48d9e6637d39b6
69686 F20110112_AABEBS oral_h_Page_026.jpg
097a458972d75af7c06c62cc42e29cb0
38098a341aabfc81c4fdebdfa86ad56db4e3f0b4
46478 F20110112_AABDWM oral_h_Page_068.pro
3043b1c2cad4ebfc3189bcfdfe64b9db
ad537e193bfd0bec3d961614bcd35291aecbcc23
2540 F20110112_AABDVY oral_h_Page_003.txt
616cf8379ae91950aea85765ebbcbad8
13a868302edc5805f93c5afc6d1da9647b1545fb
21289 F20110113_AAAAMF oral_h_Page_070.QC.jpg
16d7210ad309b1e2aabe6a7fb236a7c3
0a1fd6b061fb1fad301c0e9b30991ad3b1379479
20791 F20110113_AAAALR oral_h_Page_047.QC.jpg
5953f754dc2e02dcc4a45f47517f692b
dbfc847d2a64b9e609b3b00bc721c9969d429c42
23722 F20110112_AABECH oral_h_Page_119.QC.jpg
7122f6de9735ca5f64dab6433eae6161
18c8fadf3fc47895fb36b2193c37a50edd6ec9f0
6320 F20110112_AABDXB oral_h_Page_088thm.jpg
39a21b865baca58f9817cf52ba1d96a8
522d1fb1bcf8f8c4bf46afacf6b11b9068a34f9b
53052 F20110112_AABEBT oral_h_Page_111.jpg
88455efddd74e57f8566fc05ad506422
7dde31cda22fd2e2b2a3c2c1b8c84522a5ed47f2
1178 F20110112_AABDWN oral_h_Page_063.txt
28544fda0bbaad8829e55a9993de49b2
14b63e2c6ddf8164f0c1cd3cc3c700ad7a670fbc
23302 F20110112_AABDVZ oral_h_Page_141.QC.jpg
0fa3d5e53108132e512d06ab010f93f4
7d562bbeb1d38c3633061bbccac190e911d94b18
18732 F20110113_AAAAMG oral_h_Page_072.QC.jpg
6cb72adeffe3700c6f8e703a319bb4f4
6e3a766dd0cc607bdce86f3648255d84879f81bd
6147 F20110113_AAAALS oral_h_Page_047thm.jpg
53e2b1586c53e2f54ce942291e689712
a71ecfdcb031c01adda7b08aab072a6e3b51f8b4
36425 F20110112_AABECI oral_h_Page_124.pro
1764b46b022d6ae2182539dccbed442f
ec3babd756de7a9cfa3f2c6ddd91db61dd9244d9
63843 F20110112_AABDXC oral_h_Page_025.jpg
21888a224653fa056f31847f0fa2b329
890241e71df1b0753f43c490a2a59e1dd96c31ed
59955 F20110112_AABEBU oral_h_Page_109.jpg
e8ee342e4abf35599b58cd0ff3da7aa9
510c2510e0a55a7ecc722cef880026a55858e3ce
1936 F20110112_AABDWO oral_h_Page_101.txt
c8c98000d936541c66671701e1764722
ca61a111ddb42c1ebfd7f7245ab17a110255d883
5329 F20110113_AAAAMH oral_h_Page_072thm.jpg
9059782ad191eb18f834d28a70d565f7
a8cf84fc9d9e1f6a999485beaafd0531bd9c8b1a
40316 F20110112_AABECJ oral_h_Page_085.pro
922786fe0a2563a4cdd3fd5e89f5be6a
53290a80940e96d420aa7ff0d5942e1df64fe8a4
62640 F20110112_AABDXD oral_h_Page_085.jpg
08bded7fcd56ca2187783b1e89a3c7d2
9d3b3bded8da1f7a2494af76be93fb9537ba7a47
1696 F20110112_AABEBV oral_h_Page_076.txt
169ae505183c527d11fcdc00343718a4
a823088496943a91c2ce6638b01c51f61b114116
70159 F20110112_AABDWP oral_h_Page_058.jpg
8808c0ccebd5f0f449c4286213bfe7da
e8b91123d262b306dc89768f5148f1206f444e4c
23621 F20110113_AAAAMI oral_h_Page_074.QC.jpg
6217630cf6f63dfa9b3d9ce8b95c1792
15916bbd77190e37895fabf9b8e82e25434ee810
5239 F20110113_AAAALT oral_h_Page_050thm.jpg
3e786ed939dc41148584eee6cba47dbc
d305cffd4db64bcda02c7842ec1a48188f20da4a
6388 F20110112_AABECK oral_h_Page_055thm.jpg
a66af7d289d8eb1dfc616373d4b5aa45
7c025013fed4950369bc73b5fd7c000040d48d6d
52474 F20110112_AABDXE oral_h_Page_063.jpg
5933a3c8e0dffc5496571f663e4de824
6bebc2b75946bdb020a649fe654e6c5681ed5974
20075 F20110112_AABEBW oral_h_Page_085.QC.jpg
582341882e3c167ac88c16519ff03290
e1a4ed0754e65e0e13eb78bdb737f6fbaa3bec65
22969 F20110112_AABDWQ oral_h_Page_114.QC.jpg
0702eb5319b9a09a3073674708fb7695
55a38651f3f27333812ea30636b44d42274fd0bf
6792 F20110113_AAAAMJ oral_h_Page_074thm.jpg
e5c450d60d68ac142ee7c40ffd9fc51d
525c65a50021446527fcff78cfdde7169fbac965
3902 F20110113_AAAALU oral_h_Page_051thm.jpg
13ace18daee5f5dab4822c7897649920
2547547ca56a8d9bee8b723c013fe188c46d12e5
13126 F20110112_AABDXF oral_h_Page_019.QC.jpg
e8dbb0b6a92012eb3781a2ca194ef900
28bd2e0f66dd8af9fd5b238c257ffd005c1b4542
1805 F20110112_AABEBX oral_h_Page_067.txt
e7700d2d7c4f7df82c9bb6d58a5491b5
ace354be9344f26cfb6d6cb26ec4cccddd2c40b0
6056 F20110112_AABDWR oral_h_Page_027thm.jpg
b1f5f70542d4c9986346e25784bf6f81
36c24159eebf0a66275fdf3ed602cdf3e89cf074
17345 F20110112_AABEDA oral_h_Page_003.QC.jpg
1a9e00b27147fb379e7bedb45d7afbdd
0dd7d338185d389499b528c28ece25b4def25bb8
F20110112_AABECL oral_h_Page_048.tif
6697183aa9784ff32036615a2d8d9714
5a01e6c9d4782685f3b09d2044c52562577cd116
20691 F20110113_AAAAMK oral_h_Page_075.QC.jpg
f5b721860dd8ad16118de01421af8d7b
e69bc8efa8e41e222bd917a737e6039bb98e3521
21901 F20110113_AAAALV oral_h_Page_053.QC.jpg
7c99b55b4ea9317f6ca7d48eb8883e17
f794fdafb90e8aa2d0d4ca7178d0c933d34a0cde
F20110112_AABDXG oral_h_Page_081.tif
f407d4af96d67e692886373c04a094a2
0d3c0412632bce709eadd14c7ae147733bd43f44
915965 F20110112_AABEBY oral_h_Page_079.jp2
a6a7ad03571fd5cb097bd483a69e5513
a5ceeb4a2e9372be19835cbbfcad2d9150a022f7
868886 F20110112_AABDWS oral_h_Page_127.jp2
8f54745b2b39948c9b58f183f565181b
33fa39dbbf57ac602c0a0e22e11066a6ef74ddc9
63710 F20110112_AABEDB oral_h_Page_023.jpg
3f250f4c858a50832305e547abac135a
e44dd2b2abc2140a8284759a016102184409814d
32789 F20110112_AABECM oral_h_Page_046.pro
71b31b2fd98e29e3fecb1a22c2b7333f
0583d3b4e246ff698c9134a3f3ec105b2f5d371d
5727 F20110113_AAAAML oral_h_Page_075thm.jpg
65729b55bd7871d7bd8f69664ea08b6f
5f93b1647678b92de45cffbf958736f6a3243079
6533 F20110113_AAAALW oral_h_Page_058thm.jpg
e392ccde230d81af7531740e39bb9ba2
f9254a2e93e3d4af51554759814e23f53e9807d0
42852 F20110112_AABDXH oral_h_Page_132.pro
45b952c867eae56f473d8925efc96ae7
6540860ac22a1ccb85ac4a0c2a1232f75437afa6
68051 F20110112_AABEBZ oral_h_Page_049.jpg
fa72a63968a0df68edc4eab1ffdefdb4
d3f1443950189179e4e94dffc33f8111bf9f4d58
50088 F20110112_AABDWT oral_h_Page_018.pro
5fa5569b2fb24458f1a5756e3b903dc1
7f826fca9832ff4e06dd0a85e715a466f144bcc7
41742 F20110112_AABEDC oral_h_Page_051.jpg
edf470378350609fa01f1f2e5220e9f1
0ff2d4f51a4894912b8d4da0ba99e62fd67057a7
45489 F20110112_AABECN oral_h_Page_089.pro
40a85fd733d794a0ebfbc862b33cc41b
d919497e5357604ba9e9c5730c4873edbe413674
5971 F20110113_AAAANA oral_h_Page_106thm.jpg
f3f9dd40f94e1a0bf81b2cc789c79b01
b44d29e755b059b924e91c38d2cc201e35b27728
4964 F20110113_AAAAMM oral_h_Page_078thm.jpg
6cf9f65c0d8cb77dd154c2804af8a825
99e5e6eec66ab20eef4f5009b4a9686973291dda
24902 F20110113_AAAALX oral_h_Page_064.QC.jpg
67ad0cc0e2a220f7e6fe135396c9fcaa
852b756bc07427beace808aae91a203ff26d5f4a
6255 F20110112_AABDXI oral_h_Page_080thm.jpg
1e5942cfc726f3e62686e372da971ae3
da9e1d72901657b60e4590a0841ca483b3d5258b
5828 F20110112_AABDWU oral_h_Page_098thm.jpg
e95ca2803f20ae80b17cf52c121c6e35
5a43b19f99976f3561971fb5ef0ba4f8ad9e6476
F20110112_AABEDD oral_h_Page_062.tif
44f985a8f0cef89174da79a32f97401a
917d0a067955c26ddf6cfb5ff7ebac62b258945c
F20110112_AABECO oral_h_Page_028.tif
be3df20b6c5377f7609826830b51522d
bb6c6646dbce0698965089fe2e6595247e08fdd1
12737 F20110113_AAAANB oral_h_Page_107.QC.jpg
9210090cd8eb0507bf855e2e2aa8181f
ab86e0d462bf18a69fa98e548d79c77947cf213b
17155 F20110113_AAAAMN oral_h_Page_079.QC.jpg
e2f29e137562799c1d404ebcec84bb94
fcd0850c9ba736908c40356712380f514d047c93
23259 F20110113_AAAALY oral_h_Page_065.QC.jpg
bf6cfbd6e54b52304f3995940908cbc9
9d96e5526213691e132474a4267597c4b7f99bdc
106974 F20110112_AABDXJ oral_h_Page_126.jp2
088ea63e4a4f388bc0e5bee5f393eff6
5832a10f8aaa5096185f9aba62bc0f40f71be7d0
49580 F20110112_AABDWV oral_h_Page_028.pro
d280e8262e00d9a820fc84d39ede903e
3cac1ce1395cce99d30c37c957edcac7ef69d56c
530769 F20110112_AABEDE oral_h_Page_069.jp2
daf8e1d6a2f41a36fdde79ff6c5b0d8c
fa4c724104fe89a65242d6f0437e62a3a74bddf8
52180 F20110112_AABECP oral_h_Page_064.pro
64a8d1b3adf6fc80ff64c4dcacaa04f3
cca25cc1761a745af1347bcae5d938e53e63ecc6
13384 F20110113_AAAANC oral_h_Page_108.QC.jpg
721e6ab36a775d7325d7c5bc1910cee3
5da661d5610694e34ffaad564dff8aad701ce3c8
18127 F20110113_AAAAMO oral_h_Page_087.QC.jpg
b879b2650cdb6cdbb195b8b7a4b3d9ba
15df262ae2309cc9985813705ddf5a489ba87d2d
6362 F20110113_AAAALZ oral_h_Page_065thm.jpg
0db3e63c964652bd29949083c6fb5ead
dc2ebe9dc4b8b03f516dbd15f1234ca5a955d9e2
9135 F20110112_AABDXK oral_h_Page_136.pro
2fa8f549a75d7d8980faad41944e1abc
7fc05d5ba4bf6be0a24ce0849dfaf9dbe601f0fc
38494 F20110112_AABDWW oral_h_Page_107.jpg
fddca113e90b750653d912c19f4d46f3
50b6c7cb2ca85b788770196ad5d12e7554553b36
57941 F20110112_AABEDF oral_h_Page_060.jpg
e37d366d86fca8b1b284fbe1e9ec4afb
0cb050268d44cd47fd8f73e9f28808ba482c23b0
50210 F20110112_AABECQ oral_h_Page_029.jpg
9e97e89a990da8a19947f34c69b0c6a5
ec73cd544d006726dcf1181470286fbdd178a91c
22512 F20110113_AAAAND oral_h_Page_110.QC.jpg
930ef81439f530c2149920be6b11dc0c
0cea34e6f7e7db39f4ae2e511c56d44a0e2e2c4c
5094 F20110113_AAAAMP oral_h_Page_087thm.jpg
911e6c57ee67ab568df3855876ea9e22
e537a7a1000dabeee080941b322897d023c4743b
5845 F20110112_AABDXL oral_h_Page_137thm.jpg
0b5ea1ab47ebb6969c045e93e38fcf3a
d61c830c9f15e919d68552e4bb49e9eee06fe4cc
857624 F20110112_AABDWX oral_h_Page_078.jp2
936140ba97930e52fd352f3965041f6f
002c494dc065c2465c401363e015df953342e83d
21495 F20110112_AABEDG oral_h_Page_039.QC.jpg
a0f88f1851926b1f07dbd0d93a1446d2
88f5994bea001219c0dd983d2cee1a026421d588
1862 F20110112_AABECR oral_h_Page_068.txt
cd75ec9a5475148a8b4714b1de486b91
cf765ac470253885ce97a42ffb88cf7c13664a5c
6283 F20110113_AAAANE oral_h_Page_110thm.jpg
aa8c85aa9dfe47d0b9abd1514d72d862
572a2fe017c0248aef57d0026e14ee0339c44382
6350 F20110113_AAAAMQ oral_h_Page_090thm.jpg
aa975a2cd9e9b8edfc82ccc34d18767b
9a5f2db13921c475b8e5fbd6e5131b4b2550c025
33763 F20110112_AABDXM oral_h_Page_032.pro
3cccc18279744018b9786f21de42b7f8
76bcf86ca0008ec2d74e5f12c9923a696c2615ca
51477 F20110112_AABDWY oral_h_Page_116.pro
fcb47e293f24129d428b858d0270925f
7b04481c57e65e11dfc883ca0bcd9c647dce5b44
70106 F20110112_AABDYA oral_h_Page_086.jpg
1b6d20d1b23127bd6150ce3b2823bd62
000c2462d1e968f41c666f31ed847d39ce983b39
20282 F20110112_AABECS oral_h_Page_046.QC.jpg
6fda41f7793c8052f5ec634060212af8
9574cc17073916ce0fc573d5ce121ec9afb053eb
23406 F20110113_AAAANF oral_h_Page_113.QC.jpg
33420a5d94d263b1dd9b847abe447f76
f81cf05361c6957c2864d29dcb8a4a08deb3cfbe
13805 F20110113_AAAAMR oral_h_Page_094.QC.jpg
e72aea6dbacbf85da9a27fed85442562
729a0b757b379aed5e901fee8b8e193b7db3256b
1829 F20110112_AABDXN oral_h_Page_016.txt
ad7841d4187dfd752bebca6625b0ef38
aca5c670923d08b33baac8d365f4f81319fbabd0
F20110112_AABDWZ oral_h_Page_088.txt
c569dc592202810191b480844502e02c
a548f2549856b08bd886f6935afd7077bf398b7e
5219 F20110112_AABEDH oral_h_Page_009thm.jpg
b6d83f3a2cc82d95cebd8d9b6f6ff600
0c29ea1bfc1b1543d7b646d953141a57474ac965
25811 F20110112_AABDYB oral_h_Page_149.jpg
c3fc5bd783b887266d9147dafbb51823
ffbc7edee9075c57f432b0d696610fefe6dcf548
5272 F20110112_AABECT oral_h_Page_043thm.jpg
e01d4c4b69329c0e110d339fc82f7d3d
6cfc279378c39982413407d4809bcffbfaa67fd6
6162 F20110113_AAAANG oral_h_Page_113thm.jpg
878807cdd106b042dfbd189a5f6dce49
97a77b6cadaaabe7f89cb9b2c1d33a9473a1e3a3
F20110113_AAAAMS oral_h_Page_096thm.jpg
2c18b1c27da2087c3c749ea2bdab8b63
a5a1fa18cf62d67d4bad49b9b121430509151e9e
49261 F20110112_AABDXO oral_h_Page_126.pro
7432fc8fe7ecb5e397c2826d68bf0815
80bba997e18bf4905953313c326fb6d53a9a5cd0
16641 F20110112_AABEDI oral_h_Page_097.QC.jpg
789960fd3ad9e536e10c07cf80aadb85
aeb7d10115555708cfbdb1f7ce528e98e43e97ec
71232 F20110112_AABDYC oral_h_Page_099.jpg
11fa3bcc472067cc7609aa2b190d795a
d07363513c1d84d01098165292b4c74ea8af6667
5114 F20110112_AABECU oral_h_Page_042thm.jpg
0c68e9bfdd101696508bfb8e913522a0
249457379704cc1d1d35ee057e0c938a9aed9ef7
24233 F20110113_AAAANH oral_h_Page_116.QC.jpg
4951034b3c50264e8895893fe6e52d21
a97e7865042b2ca3b20a4b7c06f9ba1c72280222
5034 F20110113_AAAAMT oral_h_Page_097thm.jpg
6aaa061d788c150785ef91c8cc8e093d
07bf391c3953806bbe762e07f0f27611c83d2c9b
F20110112_AABDXP oral_h_Page_033.tif
ce35e96554d1a7bee77b9f1019f326f5
9ce9b288f16847d376c8c71d6afc0ea1c8094317
5842 F20110112_AABEDJ oral_h_Page_036thm.jpg
5b78ff4188a65642b068cb0401e44d4b
682aedc2df7acc4d309b686258833aa9856a21f2
F20110112_AABDYD oral_h_Page_141.tif
6b524685d53777868015278d4e27fdf3
e3fc9bde86fe5b716d2a7517f0d718310333418e
F20110112_AABECV oral_h_Page_129.tif
99cfb605bfe88f29e5c0db65ecf6a6fa
0e33b9aa15b63f7c490814020759e0e06eaa75de
6677 F20110113_AAAANI oral_h_Page_116thm.jpg
8c5b46d72d10227087dbc7416dd67524
cb2d74c327c5c6627acb34a481868d009cc15c37
23874 F20110112_AABDXQ oral_h_Page_083.QC.jpg
ebacf2858c3f323f43b68388b732ccd2
1b127b49bf5864f9b51ed9198671395cd2cb756f
5796 F20110112_AABEDK oral_h_Page_017thm.jpg
3539d070dd8eff3587a67e5e5d1df5d4
3c0b837122211f5a03f7fad45b081f802c2a0bd9
F20110112_AABDYE oral_h_Page_009.tif
598e380a45c53a422a09901907b0f7a0
a1381834e20d9350421dac3f94be524614da670c
920 F20110112_AABECW oral_h_Page_114.txt
03b64eafa4d3a114b7ccb4ff3c37f132
0f65f67878a127e9293a3b8af96a45e47e0e893c
6448 F20110113_AAAANJ oral_h_Page_119thm.jpg
19b51ed47036ea4c46759bad01574cf7
36f03d19c9ce6b4b7b036879c0cb3128af7cbb9d
21752 F20110113_AAAAMU oral_h_Page_099.QC.jpg
577de081cfe7a749847fc61d56712b17
181756e7be9ecbf66c83f547c9a9e3cd753a86db
67947 F20110112_AABDXR oral_h_Page_128.jpg
b444d812ce3412fa7ad3c17b4ad1b191
ee04de20039ea6732fa945b8e27f91c5dfae4c2a
2022 F20110112_AABEEA oral_h_Page_116.txt
ffabf58c461f25d8ae75a5e16aad8e9c
5960696a8f18c30181b4b166a3f27be7cd9d9d03
918179 F20110112_AABEDL oral_h_Page_059.jp2
76ba1e4db8494b7b45aba1c73870adac
cffe1d5fa9122436add5ef0ff48d19e5ec9c40a1
F20110112_AABDYF oral_h_Page_086.tif
daff868e593876389918d34a27bf7eb8
26cadf35f210c63bad2c6071ffb0e6254217db67
F20110112_AABECX oral_h_Page_016.tif
1a63c579ed8fb3a83d36d9b63cdcaa3e
496ecae45a1f6bb3c191446bd271844335778462
14345 F20110113_AAAANK oral_h_Page_120.QC.jpg
bc1007c93f07529200c41ccfaca7910a
7003d0df822348cd89fa32c87b87e8f3244d3b5d
23698 F20110113_AAAAMV oral_h_Page_101.QC.jpg
5853810ecd74d1e38dd1732d1bab9122
986f38a964df73b46d033f35518c6b56d2f56e1b
48690 F20110112_AABDXS oral_h_Page_006.pro
e9054c47176b877e2f21a94f568f5fdd
90dab9b43786c0fafd3e123a7f5f6ba039a5fdae
7495 F20110112_AABEEB oral_h_Page_001.QC.jpg
bc9417cf0e6f1dca813ebeb95142129c
1169ed45063d96e66a508254ef75d29beb421b6c
957603 F20110112_AABEDM oral_h_Page_067.jp2
2ab2ec3fbbc6fd2439ba1bf8c9749930
f4111a8730b9e2cbc5427465aef1fb2649d7c28f
44427 F20110112_AABDYG oral_h_Page_122.jpg
2fa144559442b2a0327591b29ef7ade5
83b07711969c6ee56f89a0cc16760f2d967f794f
16409 F20110112_AABECY oral_h_Page_111.QC.jpg
1c200b3898fde1ba1ed43e5fa402aeea
7de43b370a7a73c2bbe7ded5f84d4134525a4b31
3919 F20110113_AAAANL oral_h_Page_120thm.jpg
bbc37fee9c99ded7c9902399d487d450
019daae7f1d9d1213e2b7e0d14c06bed173e4b6d
21692 F20110113_AAAAMW oral_h_Page_103.QC.jpg
66b82ba45ff8186cc214d95e0fad14a9
60588a411f8382fe71a2096513c1b9de2d33dc38
1884 F20110112_AABDXT oral_h_Page_125.txt
ca42395cc12131be29340839b703141b
1ddc2c5ae105c7e3a1328f5d8a184f6ac98b8e98
6636 F20110112_AABEEC oral_h_Page_052thm.jpg
69eaf354a7cf3d288582a61c04018348
01a387b61b0887c0fbab0032669db96ff8466c16
13395 F20110112_AABEDN oral_h_Page_108.pro
f46a0ade74bc69c4e120849ccdea90d7
e8d27779e503e3a6e160274466af9b9473fdf1e3
37681 F20110112_AABDYH oral_h_Page_009.pro
9c370a0865f894a0c812e0ce247ff949
aed92d0e3ede52413d4d673d81c61a1eab46d487
20044 F20110112_AABECZ oral_h_Page_071.QC.jpg
97c4b8a0857bccaa447ce2f11aa9ca4c
d374dd9e33c408f2a3cc7f75527acd53d2a48220
5771 F20110113_AAAAOA oral_h_Page_144thm.jpg
b8019c0fccfbe807087a68e7c7e99ed8
6888b08893ebc5e18065a2ebdef6b038f6510ebe
13990 F20110113_AAAANM oral_h_Page_122.QC.jpg
41699238647fdf103053466c9030be0b
a42602bc005ccd3dadae139966704fdc03d0c134
6449 F20110113_AAAAMX oral_h_Page_103thm.jpg
6144dc36ad128aa07deca9ba1f1833fc
f49b403705c5c653a4fc9d92dfcefc98fdf19735
F20110112_AABDXU oral_h_Page_107.tif
6ab2aa1b2d6488e28968cf1ba91ebc2f
94c13118ee6dc5e3022ee74ecf4478902ee32627
22063 F20110112_AABEED oral_h_Page_022.QC.jpg
a750212910b0d6279ccbfbeee64c969b
fa9e716804ebceaa6204f1e4a435224888da0302
101619 F20110112_AABEDO oral_h_Page_068.jp2
49c56edfc3f5fe676a0fe963d46fc92d
8400e091694ce6ed1e16ab6e9bddc6f8ddbda148
22333 F20110112_AABDYI oral_h_Page_126.QC.jpg
d65e64af2415b4fbbf2b362fb10e9355
07001a00869ab8d5643f652a8cea96fd029949be
26005 F20110113_AAAAOB oral_h_Page_146.QC.jpg
b42f346cc19de3c00e1603eff0f08170
1656d77e792c7a0b675056a6d421f97b104772f3
21459 F20110113_AAAANN oral_h_Page_123.QC.jpg
f145693beb52b272d030a02c0e085c0d
6c7bcebdcbe00c513df106442e697a2faebc72df
5878 F20110113_AAAAMY oral_h_Page_105thm.jpg
007e1b9676502ab0a2b845aa29c230db
f426d33893fd56d583ee63f2842ec18e0f3a1868
F20110112_AABDXV oral_h_Page_090.tif
df80a58b788a0ac0782b9b9769778d16
e08818aea25cfa28f4f0d97b7ed5968239e574db
951 F20110112_AABEEE oral_h_Page_008.txt
eb4ba50bee208194116ab1a10d83a110
3f8b52f854c6f281df2b1646b35e8d2886cda3f8
76856 F20110112_AABEDP oral_h_Page_097.jp2
de1e0cd6c689946681780e8597cf9e32
b5d499f4a949aa8758b84c945af164a9f6d45869
64413 F20110112_AABDYJ oral_h_Page_070.jpg
91a76ab16389a526227e44500aedac10
05152e20aa8c6e9a219a622a3d950cb92cbb282b
8537 F20110113_AAAAOC oral_h_Page_149.QC.jpg
5a7a84b6e75c84106a859c180f274ed2
982aae26fb094a253e7914b29f17386137b9971d
5980 F20110113_AAAANO oral_h_Page_123thm.jpg
1ff05fe2a398b37dbe0a7e1c028c0586
0ef976de01595959f90e3f52d8af0bf5af7e4f6e
20892 F20110113_AAAAMZ oral_h_Page_106.QC.jpg
3cc2ebfd196e3ad98c4ef1c3bb14b95b
551a2cfd759e73a7066600e2f24aa505e6bb6356
45209 F20110112_AABDXW oral_h_Page_035.pro
779dd4759679ac88d1136c35a71f4632
c48d9243f3f8b4827332a203b5f802617d051123
18902 F20110112_AABEEF oral_h_Page_129.QC.jpg
afe5594940bfd6348e986f1e12f96437
8781bb84f7b462daee86b1f0f7b2f5e809eb5b1d
1166 F20110112_AABEDQ oral_h_Page_091.txt
2c774f2650ab4576d66f5d68ba07b07e
968f969c08983de60215e1baecda7dca23ff11aa
505 F20110112_AABDYK oral_h_Page_138.txt
0d2f5298583ad80bd1630fdf26ef9f0c
8336379d92e0f2bfbb1c4e2dbe34f9d35fdddb6d
170418 F20110113_AAAAOD UFE0002422_00001.mets FULL
33608a91a3dcab7c37cdb41f137d5f46
b4a9f40da6a6c863649bf6822c864360f9e78cb0
20283 F20110113_AAAANP oral_h_Page_124.QC.jpg
c03a04d7aec8cf90dd9889806ae0ca04
cf27bee782a2c52350d837e49d8dd8bcb946a924
1930 F20110112_AABDXX oral_h_Page_115.txt
0ab069b6e4fbebc0a9c46c1468437b6a
4b548bcd5ff88fe8d9a0af56b2ea5e92892924fc
1765 F20110112_AABEEG oral_h_Page_054.txt
3039b565321306c2e3af3f0bded65de2
43bcff6d1c59be8b064aebb1aebd678c0c21b41a
42787 F20110112_AABEDR oral_h_Page_017.pro
e4663b5ab5392699f6c66df43c6d6e90
7aa56224da3384f7f138000cb0d108135928505c
F20110112_AABDYL oral_h_Page_118thm.jpg
c3be7de572ca84a585f010ed06eec8d7
fa3b826275f128de35af9890a20cde6d4ffc9074
5950 F20110113_AAAANQ oral_h_Page_124thm.jpg
1ad12e2fa4c43cae5c35afffcd9e3258
359a7d3ef5ddd5e8b65e45fc3121a1cda852cd1d
5490 F20110112_AABDXY oral_h_Page_045thm.jpg
93aa732c61ffd68c855c5ba033262bcc
0fb5ffc0161c17d80fab02f6ce9cb47d2f1a1053
929154 F20110112_AABEEH oral_h_Page_129.jp2
ab5d7de3a0c22085fd630f40f3d220e8
8ccf1edc29e5733395cb5e7528c64333e8461d62
4457 F20110112_AABDZA oral_h_Page_131thm.jpg
3988d6bee248f48a7c1329245acd5dd4
cea6aaae12bc43d3ec84ff78ef2b8ef98501c42d
1321 F20110112_AABEDS oral_h_Page_034.txt
20b8f5b359e5e23a6390ad710b21433e
afb10b65ba00220a03aef31f0ba5515e4d3ab64b
F20110112_AABDYM oral_h_Page_083.tif
27866f939133fa684bdb271f1b0d173e
ce1bf95c05dee02a53bc935229e21cb1dc9aba1b
6315 F20110113_AAAANR oral_h_Page_126thm.jpg
1e75d2bc4e29d5c0e357abf7cec0afcd
2c411c3f94cb87f313cb5e5add567162c052263f
F20110112_AABDXZ oral_h_Page_015thm.jpg
9dd357154b9413127b403caa0fcfefe0
8a54f98348f355d439979550f6bf0623c29fd474
69756 F20110112_AABDZB oral_h_Page_115.jpg
7070047cf6d35d74b9ff3293cd705a30
3d6d9590d5d477fdc8e20da429317a6ba8b72648
4812 F20110112_AABEDT oral_h_Page_081thm.jpg
5df61580a867c95f94c9d5976a96a554
ff87816ec36fde78056dd2476b6e860dbfc6a084
108570 F20110112_AABDYN oral_h_Page_090.jp2
4a945e6c3e751ae28dc41bd8be3e5887
08edd624628096b05f366647c1784e6f1c895f23
6198 F20110113_AAAANS oral_h_Page_128thm.jpg
f86b28294bf7aadc126d8b043de24a4a
79267f9fb8cc245b3acd593b7a6a1d4c94a01be0
19441 F20110112_AABEEI oral_h_Page_050.QC.jpg
f5ac69e5fd92737c4cf2142e8820fa95
6589c597f4d0d1dc02b0af852fa566ca48564eea
99316 F20110112_AABDZC oral_h_Page_022.jp2
c28e0585c5dee980e3ae29a96008fafa
0e02634f8c59d23239d7369c624c02481e188571
450 F20110112_AABEDU oral_h_Page_001.txt
88203da708120db278d272a25ab2b75d
d6e943c33aaa567de09bae759191b338aa5f9a1e
F20110112_AABDYO oral_h_Page_077.tif
005d83d63ac10a9c7439ab2c1b1e84aa
a285031e6ed4209097b1cd99d988f7ef79017c22
22203 F20110113_AAAANT oral_h_Page_134.QC.jpg
2fdd50b063e844dbf187036e8c28bc81
2740daec72a9695241b9258f033f9f33cffb8426
F20110112_AABEEJ oral_h_Page_024.tif
3ba44bfd10281bd97150c97df04cdf4a
b433e68182e02ba4c5162d3e10924e84d87a0df4
4687 F20110112_AABDZD oral_h_Page_082thm.jpg
0bc4828e13089a4761daab5cd564c790
07b2a97b3d1900f154590fdf67043bdc02671c89
5987 F20110112_AABEDV oral_h_Page_023thm.jpg
383192560fe6a33cd5e6f8110e01d979
5861868b4b47ffd858ea011d4eab4d87c6b49156
16531 F20110112_AABDYP oral_h_Page_082.QC.jpg
84deca3ff013e6f903741cd81701bf9e
a070919e6a23940a642a48e5e359abb5c4ece6ff
2383 F20110113_AAAANU oral_h_Page_138thm.jpg
ddd06b26f7b4fad7819af6abfa856ee2
39d085fe8a1dd83d94da9c78eab8d7f1fd37b958
30389 F20110112_AABEEK oral_h_Page_078.pro
4f6779a85d224311c36dd58737dbda86
f4bafe17cf23afe9637959be28c6a17c293a1451
44601 F20110112_AABDZE oral_h_Page_105.pro
feb4dd8e72576331695472c3d401aceb
a9a267299fa1e019c32810017606dc28601851ca
F20110112_AABEDW oral_h_Page_012.tif
85682859322a69c9ef25ce5b3deef097
35e5f9e9f0fe99f6be11191e5d61a810c112fb86
F20110112_AABDYQ oral_h_Page_070thm.jpg
8e500db23d7f3dbbfe38c034a6ba482d
dbe09bfc7cdd8bb225a978a0747f85320cec2ae0
543358 F20110112_AABEEL oral_h_Page_107.jp2
e81b3ff0f2d338d5bbee1e045b83035b
c9295436ae34c4dfffaf5cec5aea089ff638cdc6
5680 F20110112_AABDZF oral_h_Page_046thm.jpg
4363c68c5c514bbec64ff4fbf8f834fc
2e3272bcfa7808f9fd560790d328fc0a450104fb
1842 F20110112_AABEDX oral_h_Page_027.txt
b42c8c34b254ae7edfa105bf58383137
fa8d383d63c7d6590c1da58513ab7516d5e6c568
5285 F20110112_AABDYR oral_h_Page_079thm.jpg
24fea7f58dba0726991e06be5ec3e7bb
dcd4839723761579b9fcd09092bec69e15c9d6ea
104883 F20110112_AABEFA oral_h_Page_080.jp2
b98092418a4ea6aeaf2f6471cd325fd5
066d958f1be5baafc968116a57cabdf05d8369c1
5797 F20110113_AAAANV oral_h_Page_139thm.jpg
abc658607a81c5a8b51cdc5258d24323
8884942aa92d2eba80c3b6a578cdb17288c2e49c
68637 F20110112_AABEEM oral_h_Page_134.jpg
a6e8c340f44aee00f905f2ccb3087f73
19803bf9c2276529d6e9eacb5f441124d71dcf80
22695 F20110112_AABDZG oral_h_Page_093.QC.jpg
e8d9cc4def1d5c9036184c71b25c608e
df301fc14a5065f99274abdd7d5487f470ef616d
22099 F20110112_AABEDY oral_h_Page_027.QC.jpg
c2080c9c623bdd41851a2635536e7567
6a371cd8447a5afdbf375dab75055e6686f42ddb
5690 F20110112_AABDYS oral_h_Page_076thm.jpg
c3e224cf43dae18d7cf5e155e96b86ba
de28340f73aa258241c20f3b8cc1b635a3047cec
3956 F20110112_AABEFB oral_h_Page_107thm.jpg
824257d82f6ed6d75d234ffe3ecefca2
888c7f0a9239fb7a329dc655597b285eafa25238
6497 F20110113_AAAANW oral_h_Page_141thm.jpg
04d94c16b32c72c05ee722aea4df9f02
e486fa0496afdf65967db07126212b46f741b9f2
4917 F20110112_AABEEN oral_h_Page_060thm.jpg
168814b15040ff8c97e35f4129f33733
a8c5ee100c69f1eab860d273e7faac1f31a4a541
18432 F20110112_AABDZH oral_h_Page_122.pro
71fcfe7216075870c99ec676acb30aa4
80afbc920c4735060cc26464c980123b3a395cd4
29733 F20110112_AABEDZ oral_h_Page_060.pro
9542d906194e5e55b04a1a336da91bb0
5982d5099d2299380319b04db1635f7822af61b6
74047 F20110112_AABDYT oral_h_Page_116.jpg
ada240039a78045ce2115eae04a25fac
3ef9117ff4486d23761e885f517534234cff7847
F20110112_AABEFC oral_h_Page_069.tif
d3e4e7d72b7523403ffff94e5e024665
9a33dc4512c61dafb0654026358cf15f3e72dc1c
6323 F20110113_AAAANX oral_h_Page_142thm.jpg
bdb9d59b109c52c2f9b5fc2c1fe7fa19
0c3d1fa7f1c46fe062b455efac89d42ca5a3d16e
48933 F20110112_AABEEO oral_h_Page_115.pro
ef1edd691b502ff081df07ca3487531d
749399eb64bd692b32ef7632c27ceb553dc60604
20180 F20110112_AABDZI oral_h_Page_023.QC.jpg
e60c6acd07c4b55748c8830ec196936f
22aa2de068d3b260c74b492c61c108721e5feb88
F20110112_AABDYU oral_h_Page_094.tif
5d2bb3bef2b621df7c75c6ead738ecc0
1bd07435f0b6b50866acda1ddff962626bc00354
99790 F20110112_AABEFD oral_h_Page_027.jp2
44a9d2f48732f1f23cf4897aa943283d
61527b6f800d3496bafa184a0664d317756732e1
14659 F20110113_AAAANY oral_h_Page_143.QC.jpg
763fcf1c415b1cef7737b94cd6e41a3e
62a643fb2315e82e6cc05476c0b16aacc4d8634e
16408 F20110112_AABEEP oral_h_Page_037.QC.jpg
cd794e71ceb767e78d3664b0927a397b
0568277d6748b9d50c81c009c8bd386f6da6ba76
98516 F20110112_AABDZJ oral_h_Page_011.jp2
c5d026ec5111a63a49ca11327a735a64
b6c5454b5d0d0358b527bc6787406c3d14e7f1b5
1827 F20110112_AABDYV oral_h_Page_092.txt
50ccf238f98404106e96c4dd9fe9bede
14e2617cbb5bd51a8c0d6118296982befa8e8079
1670 F20110112_AABEFE oral_h_Page_071.txt
115e46b8fd6495d65e87d1f452d4e5e1
20492c8e599bde0acb481229620cf1d141b289d5
20959 F20110113_AAAANZ oral_h_Page_144.QC.jpg
77534a90566eb20ff740e7a663150875
899252d18b29cb2e12131a5ac5e24c55ed098215
1838 F20110112_AABEEQ oral_h_Page_011.txt
ed2209c70913040368a6fe0d00004848
5b28aff1852cdc9f568a8d633a8ad8f24a774c54
65624 F20110112_AABDZK oral_h_Page_073.jpg
2bb3a1723b8df13d62f7ac4e8abfe2a8
2c3044bc5b9d55a651603209e3f1438bf73852e5
6385 F20110112_AABDYW oral_h_Page_012thm.jpg
166f65eca100996281c7da0f388f435a
c6f63373c9245584cb9271c819d867192d768f40
4183 F20110112_AABEFF oral_h_Page_136thm.jpg
f04586e3233a4a12a5d6f521d7aeb12a
3ac4358fffc3bdf6fb3ade41486729c7ac29a88c
71989 F20110112_AABEER oral_h_Page_090.jpg
b009e7de939fa70c127074f0bf36f67a
b932ac9dd8ca0dad6f0501bb0e6ef4c75b010adf
49666 F20110112_AABDZL oral_h_Page_055.pro
6d49b66f5b7f712a4adc2878d048b31f
fe58ce7abddb46fd6ff71d2cff48b0c5b3a1ef2d
F20110112_AABDYX oral_h_Page_015.tif
e2843c15d0ef01993d1546713e54607e
50302e20cde06c97c1d80107ce0841e1155e0dff
92638 F20110112_AABEFG oral_h_Page_036.jp2
39fe4208a23ebc73aa71f3d44a3b1b76
8969808fecfa149439ac020aa2c1e3471a11b0b8
F20110112_AABEES oral_h_Page_040.tif
f2af4fa56d39786522deaab00941eb1d
8921391b1e9079bdf11c5c7fe2572cfd525614af
777711 F20110112_AABDZM oral_h_Page_041.jp2
c802c0e7b688182ec9bb3d313431f2e9
31c5bbff1762888d09a59b22a99ebb46930429c8
F20110112_AABDYY oral_h_Page_076.tif
3c51b1494a5531179cc1d70c1846c58a
0593ddeafac52e635105680490aaf3564b6feb85
95650 F20110112_AABEFH oral_h_Page_098.jp2
3585e9c8469d45d757a4bad939fe186c
8c2864df3b27133030b6119a6329541cf36277f9
47340 F20110112_AABEET oral_h_Page_110.pro
094d57287faf30469f6090a628f9b4ab
cfff060ac9da11611e1dc9d913ee56d7d5ce2d85
65035 F20110112_AABDZN oral_h_Page_011.jpg
1848fa8a1755ac14d6cf148c790f074d
31928f9629f92eba425ae48e17bd2800e4efa1e9
797039 F20110112_AABDYZ oral_h_Page_043.jp2
4967ca242186330814d8e8734d32736b
7f9b14ac41e16fa9820c058f63d9c5f435821e76
110000 F20110112_AABEFI oral_h_Page_033.jp2
9f8eb5270fb24634fc63691d72912fa1
53b279fe69c2bcebf305e6bc9fb2296fb8a27c09
66490 F20110112_AABEEU oral_h_Page_125.jpg
f296bfd59d1b006cf6c3bc494ee1bdc1
43bedaa1cb31403f687bfb12a27d4466eb1308d5
98582 F20110112_AABDZO oral_h_Page_039.jp2
43bd40880996f52452bd027c8c95a017
7c15631f8924da3c9eb270b9c7e4ce1c8a46ebdf
11172 F20110112_AABEEV oral_h_Page_002.pro
89963ac9600edb2b9ccfc4834b99ab08
e5023aba5b2833d7d8e451f2aff5b09a8eb354b3
F20110112_AABDZP oral_h_Page_105.tif
775ac1477d66555d5e05e20cc90227e9
91150a0ef9e62a795e8fa0cc732c138a9d77a35d
1051977 F20110112_AABEFJ oral_h_Page_003.jp2
95d22c7be61235a7390f7f8d52bec05f
454652d4014f4859cc7e8bca19c73963c8eace94
112297 F20110112_AABEEW oral_h_Page_062.jp2
ce72e7cd1e024c3cd624bf09324e6a2a
0aa11c0c9264a3fa1c9aa26b3b2b93841f600d0f
1541 F20110112_AABDZQ oral_h_Page_020.txt
88692fa758f592050c5f830e59eff56e
b6d18bb1ee5529eccc2a48d7db918a542c9455b4
73117 F20110112_AABEFK oral_h_Page_132.jpg
19a63561b0417c49b393a6ed7ae77058
c846a4f7752169e88e6ac79add224b70ab7f0cf5
47807 F20110112_AABEEX oral_h_Page_099.pro
a0c975b79487abb8f6bb11e2a6045de1
bf109fd170cb1cc0c746178e3a3582c2b730e196
20710 F20110112_AABDZR oral_h_Page_098.QC.jpg
0902d95be435a1100d3ac9ea83172d77
61ae0a118b042ae0d34929d7de93380907710acd
51549 F20110112_AABEGA oral_h_Page_015.pro
509f0bede0394e1da4fd00df3b5214aa
81d68c5c8cbad06c2c42943b6df1356c3a94a425
F20110112_AABEFL oral_h_Page_044.tif
f88fc17c85a9897a3e429dd57296f76d
e9e17a62704417b48807bb54cdf0c226b49dd511
795952 F20110112_AABEEY oral_h_Page_023.jp2
cb8a5fe338314c3b1dd2f6b2c4a92c65
624e0da8c00000408b746df4ec5fb374f7345864
921025 F20110112_AABDZS oral_h_Page_137.jp2
85782f1110d51b1b81f2fbed525a5ef6
ec820a922f29c6311f6bbb175bec4d3379b1c663
4231 F20110112_AABEGB oral_h_Page_143thm.jpg
15e28407e64b786b1ede21566d49160b
2c7dff20f1a47806b268bc57d5fa858ff4e286fa
520592 F20110112_AABEFM oral_h_Page_019.jp2
6e58faba0e5f671636f1de45f966e7c2
675d407b909b7dbe59d33d82101729204baf3b29
6553 F20110112_AABEEZ oral_h_Page_114thm.jpg
e75c6fe3fad78b215f73aab42995cd84
b4382aa5b5af7d5aac6741fdfacf4599a30e98b4
F20110112_AABDZT oral_h_Page_080.tif
593c623fe75ef6419d945e196208797a
310d2a10e825c38aca86a0a338492d9ead5fcbaa
6036 F20110112_AABEGC oral_h_Page_133thm.jpg
4088a78da72f15b0c172cc028a3df420
b25cb07c9c16376c7693179464d53330ec07e8ae
46942 F20110112_AABEFN oral_h_Page_128.pro
4f56e0de54f6c1d1f634df5925e6f11d
5da557c3f20ad7763a076f20ad7805a22d73d8d6
55758 F20110112_AABDZU oral_h_Page_077.jpg
d664750412d487596ca4a1fd53ae0e27
5e74e762e8c6d864b993e3ec76e0c944b108c2e5
23255 F20110112_AABEGD oral_h_Page_055.QC.jpg
6a3c23b30afec6976cd3a14dd4631b97
cd08fcaa6e4990a973ddb833006237ba19628ee4
4629 F20110112_AABEFO oral_h_Page_034thm.jpg
52b4626626a8ace644b6743426dfe256
0e01fe9f415d730c0ab75f9f9e41fb62c029538d
1858 F20110112_AABDZV oral_h_Page_128.txt
60a2a69fbad9d81d2b5bfe15e176d426
acef7f19f5b87b4be1be796f393c374676690988
6507 F20110112_AABEGE oral_h_Page_048thm.jpg
1561da08ee21ecc653e6e90eda34358e
4e75cd3db76df53e6d488cfd7d686fd1ec27d1fa
4790 F20110112_AABEFP oral_h_Page_029thm.jpg
660442a7611070d802c735a18ee23ea3
65b47d6a5769873ebfb40699d09b0d75f2edd961
2578 F20110112_AABDZW oral_h_Page_147.txt
7a1c364fb7f402ec8534f864c01c8816
dca2391a11ab9ad9a263f06e1a37629a07d6a726
1246 F20110112_AABEGF oral_h_Page_087.txt
76fea4c1fc0f4bc863bde9f292ad551b
ed3f20e96625e511ae449a5b25822c4ce113f4ef
47574 F20110112_AABEFQ oral_h_Page_125.pro
2e142cfefdee4e629a64a71057de1589
b63f22b85e374eb17b15f6b35da8f57f6a949bcf
F20110112_AABDZX oral_h_Page_115.tif
7d74dd9a4245b63ee78fbb73581dcc50
d0828e5006b89f739dd19658d96cfbe71afa756a
20547 F20110112_AABEGG oral_h_Page_017.QC.jpg
0ff977a32368f117ef9fda4ed7f67754
c53dc07fd4b151321aad21e596863dc8496537a2
994 F20110112_AABEFR oral_h_Page_029.txt
7f7c0756213c8a7a74fc4080c0f77b7d
4b45cd2bcb996b350496b717ef629e7ca9519e18
21074 F20110112_AABDZY oral_h_Page_056.pro
c95060fffd58c166f45fda9e1550c885
d9e666ae430e0b28b4bbf46b4f454d673c891e8d
F20110112_AABEGH oral_h_Page_108.tif
f3a062650775faf11174bec50a4db445
6e1ad6a7e2a44241b5e527e0b92d2f0530ac374c
95614 F20110112_AABEFS oral_h_Page_139.jp2
fc17ae44fa06dc2534c312b93aa79576
2332531b36dc3dd4464c54db32d8d5cb7ab41a51
1721 F20110112_AABDZZ oral_h_Page_137.txt
66efce2280ab6744bb03582cd744358b
00b925206ef3c58230ccc42e51a872cd61c41da1
21949 F20110112_AABEGI oral_h_Page_128.QC.jpg
43c426bb28c0e35985a4da7be52cce32
810b81dcb383ed975518577c589b278b026b2566
1051975 F20110112_AABEFT oral_h_Page_130.jp2
4a4aecec33a25e1c4a38bf3c28ddc7a4
9d6fb68a4f7bbec9ae84be62dd53b05b8509cb46
48499 F20110112_AABEGJ oral_h_Page_140.pro
af20047c71a6df1c3598faafac0acb1e
5112246873573187c13c2c233e84ae0343549806
21780 F20110112_AABEFU oral_h_Page_102.QC.jpg
2d350dc0cc110ff826088724e5049e3d
ce084cf53ffe099a4009af73a9c5fbb492a4685b
6838 F20110112_AABEFV oral_h_Page_064thm.jpg
b7e1f9f14277fafb9a716994998b6f33
0809d0748854679fbce95a8a32274c7fd13b628d
23770 F20110112_AABEGK oral_h_Page_052.QC.jpg
f5cf6435ad4f487830ba4221fbc777d9
e23e75466187e68895f20749130fe5d28d9c46f8
43267 F20110112_AABEFW oral_h_Page_139.pro
7859a7a489d7e2cc7888777096007b55
e3255b509c0e58722dfe8c07bf74468ce142c7c7
17799 F20110112_AABEHA oral_h_Page_043.QC.jpg
f51e08d6f6ebe631678157eb87aea603
09418afccbbe21245419ca801530c74941b386e1
F20110112_AABEGL oral_h_Page_125.tif
267009e5383e783b1fbd0cf008700a58
dab64be053979fb9a96f4706ac729422b83c92cd
14888 F20110112_AABEFX oral_h_Page_136.QC.jpg
ef2573ed5eff0724b04fde0a0fa10a9a
66e9504a1a381e6c1959a623bcfc167da0a2cd9f
38189 F20110112_AABEHB oral_h_Page_112.pro
f34ecffecde7725e6669885f32c4af43
8edfa5e68bdc10eb30e6298078709274be76291a
28696 F20110112_AABEGM oral_h_Page_143.pro
b6657df783ad4f47c3c242d220fc00e3
d398283f757df8d35eac92bb21ad79bb88c7dfc3
F20110112_AABEFY oral_h_Page_043.tif
7b7432cd9ac6ea55773c1f7f9652cb8b
76020c019142986f1a7027308f74aac43e44bb2c
808211 F20110112_AABEHC oral_h_Page_029.jp2
085c564bbd73f750342780acd6273a01
e28fd9aa75253c65605b1b72e3892848fca2bb61
F20110112_AABEGN oral_h_Page_146.tif
209823a936c77fd7b8e46d79cd415f52
d8ad915926eb39578a9963a4b278d54cfcd7a092
F20110112_AABEFZ oral_h_Page_085.tif
045c20a1e1df4690a497cf05181bcfd3
98b465fbc10a3874115f477db837937079c33a1e
44026 F20110112_AABEHD oral_h_Page_131.jpg
cff98a45c1d78aee4b4015966c74aa60
f1efeadda0c12f0128c9033bd68db6730b4256c4
25425 F20110112_AABEGO oral_h_Page_147.QC.jpg
a700e98d368dcdf9f8324f08b2329b7d
5ab5706b76f5deaffae2cbe4fb619ea3ddc0db63
21443 F20110112_AABEHE oral_h_Page_092.QC.jpg
c4fdb2cf8656c3e680c0c5f26e577a74
892b5ca6985cde6eafb317b0b1d20f2b538e14a9
47921 F20110112_AABEGP oral_h_Page_080.pro
8729e0ed57218ca5e80a0d14de0918fe
80642a1c08d0dc81737a8f1db07f93385136adf2
108367 F20110112_AABEHF oral_h_Page_101.jp2
3bbf5bd694b9f0632be2877890ad4ecd
782f4b70996fc6e3787545559afa0f312da32acb
F20110112_AABEGQ oral_h_Page_104.tif
64a4920d80a61387414f728bc446c9d7
aeff73f254080e880b32104493820becf3367ca1
F20110112_AABEHG oral_h_Page_039.txt
55f3ef017c8b2f525af6f19211a6152b
8a1de74f96ae7dcbbc9c4429ba3f2e6f452c1ca5
1100 F20110112_AABEGR oral_h_Page_081.txt
1456dc929beba015b158e8fb197cf6e3
b4c3ffa005a5c116c413030f6071b47d2e5d0b69
15993 F20110112_AABEHH oral_h_Page_029.QC.jpg
95717da37b4988ba338719f1f7f2a4d6
9b8b7d6a52bdff352241da4cad6a7cb762823a7f
61466 F20110112_AABEGS oral_h_Page_067.jpg
3cd734a9533b09d4ab62a7a9b4a8d626
5b95f52137d603e48f4499fee133d643b7126e49
17437 F20110112_AABEHI oral_h_Page_118.QC.jpg
f8f112fd17dc7e60871555209392fe34
ebd1a61dccd31f4fbce07b6ca6dd8f1e3d7910b3
F20110112_AABEGT oral_h_Page_074.tif
bc11a0cf260292e7cb1f5eae9f311da1
79515facd3e7b8c9c5cdf64c54b6ab8b28658f7a
F20110112_AABEGU oral_h_Page_031.tif
dcbcb6da36388132b54518b745180558
85aafcf57163747cf7006f1ae50e5c55d183510f
6543 F20110112_AABEHJ oral_h_Page_014thm.jpg
43519c0952a540c823bbb54caaae96ca
d2db8f9b82b2fe22179a512af1965c9d8fc93ff3
1713 F20110112_AABEGV oral_h_Page_112.txt
6a6c4a3a73838a116743e7fdb3d2d78e
621a340484cc7e0995642c18c3e12af57a6d69aa
68887 F20110112_AABEHK oral_h_Page_010.jpg
4ca57ebf54444a748703c980a0d0bebf
e9025846b22fb34e5b086fa333547bf7c85583a0
48632 F20110112_AABEGW oral_h_Page_113.pro
c04ebe4acec51a0e32fa92aa53e74e9c
34cc3381003c1027b776c81a8968e2f801854c99
F20110112_AABEGX oral_h_Page_139.tif
06f3cd6579c3fdcac196f85011966728
050436f4c3f274e7ecaefb7e08b80fc44377e4af
64420 F20110112_AABEIA oral_h_Page_047.jpg
3e3cca2aabbe11bd4f0eea39b28845c0
8529e3f93084c007fc907bf1d37c9fdf3ec8628e
1783 F20110112_AABEHL oral_h_Page_098.txt
794714a0381a7640d47a43bf930d1997
c88e78555dcdd8379dce366144aa1f7a0ce7fa4b
23311 F20110112_AABEGY oral_h_Page_104.QC.jpg
6f9dd62c375a90422b808d9d5ea8e253
d0f6e65a59b81f6d938b58a310e7d6ca10bef74a
563 F20110112_AABEIB oral_h_Page_149.txt
c3a7730f2bcdc23fb1d9be34a2dcb8cb
3c46c0679ab10e0c1bb38f4c03ba68828cd293ab
2503 F20110112_AABEHM oral_h_Page_145.txt
bd217fd664caf7152e1cad4357e44a58
7c6dc3a251d50df298c8b0a85da5a8d5658f86f4
21845 F20110112_AABEGZ oral_h_Page_020.QC.jpg
b3740628488927597efd8a0a6d8fabc9
52ff7fea49dfad4972fe8b53e8d5f35509141cce
100626 F20110112_AABEIC oral_h_Page_016.jp2
30acb753dae1113bde15d3602941fd17
1d12c57b47bbe03fe2d25c309eb5d21184ce8ee0
33253 F20110112_AABEHN oral_h_Page_097.pro
936a37eb94dd43d9f260a711582f5b10
dd27fc5a68d4db5bb14fac54ccc1c54be5c01793
F20110112_AABEID oral_h_Page_007.tif
7e5cde3698556030f721243ab049fa35
5ed79ed46da9542b7ce325cfa86f25de79161b29
707271 F20110112_AABEHO oral_h_Page_031.jp2
e989746ed5a381a6a9a1c62ef3981e59
49c755c45ded1317b89cc701ab9afd9f20292078
64671 F20110112_AABEIE oral_h_Page_017.jpg
bbff8778076d226c08e9d55baf9ef4ce
8fc03efbafdc88c0515a2da4f7b8d093d056e4b7
6225 F20110112_AABEHP oral_h_Page_049thm.jpg
9593e6da0317cb076cecbe4e4c763d5f
2fe32e1623ba5a968a9e31abf8b4f77eba90f5f6
69419 F20110112_AABEIF oral_h_Page_126.jpg
7bcbf5265b7985be570c1f585628d5e4
ce7ec3d6338834f855700e77fafa6196d0b33a36
930 F20110112_AABEHQ oral_h_Page_111.txt
cfaea337cfc33d97b43936387d7b0f26
25257b1838e1fe27e9d55ffa94102e4c88340d82
7699 F20110112_AABEIG oral_h_Page_002.QC.jpg
99da93cc182873fa85ebafea6a614970
91198a587d952135278a798bc72294f9c28af9ac
F20110112_AABEHR oral_h_Page_065.txt
02b9457d5751238593d878768dd48314
e30a0a33b7eccbd973637dd37e096bf5c6a430ac
36329 F20110112_AABEIH oral_h_Page_134.pro
83dcb3ec215c3ab0d58c286eadb76744
6073916cf540aacb54ad6400f0d372410e64aad4
6442 F20110112_AABEHS oral_h_Page_061thm.jpg
d9948dd3b7bb6f26ba5419feaa100af6
23349c1319b11ae8f84b57b7dfb0289916a5a549
23537 F20110112_AABEII oral_h_Page_121.QC.jpg
3f05ea8c1e4f9c9fa988fff8e703a686
bb10fab813e4555c0425256309145cc9a715cfbb
110268 F20110112_AABEHT oral_h_Page_135.jp2
105b0f1735e7df8abb8787f6879a651a
3112b7b8631308b44cf8710353d6077e708cfe1f
F20110112_AABEIJ oral_h_Page_111.tif
721a913e8ea5997950f75e15196de994
e33880d1701cd69334b6545503cd2e959d0d7132
F20110112_AABEHU oral_h_Page_028.txt
51ceaf1d5921eac9e91b81db09e595ee
9b97d01ac126948574fd922aa7fcd3e46d03feac
F20110112_AABEIK oral_h_Page_030.tif
29867ca92f4b1d8e8bc5e891ee93bb40
214892a3b1c7f1a7d27c6145e20dafe230f25d50
74576 F20110112_AABEHV oral_h_Page_015.jpg
7b4cea31e4a7df00139477de30051556
a6c5725047596c9397a940cf4413766bd3b414ed
6353 F20110112_AABEIL oral_h_Page_056thm.jpg
6e92c53aca396495712514581dc67331
99a461ceb6d17f3b1ad325173f50736bb55c60e3
942618 F20110112_AABEHW oral_h_Page_020.jp2
af3562bec8f855d37770b24de5b5d303
e08324498ec649573e38052dd4d18c76953516b0
44820 F20110112_AABEJA oral_h_Page_070.pro
a1935cc3aa9754afbbdbcfeb30668123
e67d98ff7d8371c39410b08cebc09360a2ef7836
19495 F20110112_AABEHX oral_h_Page_117.QC.jpg
54b172499b4473a71358afc3939fec0e
15772f311b07ceb6a3d8627841edbfe8470fd8ed
6338 F20110112_AABEJB oral_h_Page_134thm.jpg
877eb7f06397420d095c31c2d27f6d65
5bc1e0fbc681d71efce3ba328c0269637da3e3a0
41575 F20110112_AABEIM oral_h_Page_071.pro
10e922ec148750ed77dbef38cce94cb3
f15187d94662756e0a983ac94e20fcbc35d9f825
F20110112_AABEHY oral_h_Page_096.tif
740fb9ce5ee1be6212ce3a58b155e7f7
5762f6d5014dc82c6d214112df8ce29fb5a438ce
41253 F20110112_AABEJC oral_h_Page_019.jpg
c1a5c0ae844f229679a7c9b6de674ea6
db753b2fa8318b6b21e7d332bc584a262b8d6fb1
F20110112_AABEIN oral_h_Page_144.tif
0abb50080725898bff751981d2510dee
a2555d323baaa879387ab7e22c8cc51b2481dd9d
65308 F20110112_AABEHZ oral_h_Page_039.jpg
504709fb2c5ca5f0714b92bc035a075d
d0b29de7b1f5e3c2144b5664d52c5d13391d847d
1524 F20110112_AABEJD oral_h_Page_119.txt
4d44cc9ba21ecad0d23655b29367b370
f83d433d9e7ebae8823cca5054f23906e3efe52c
21723 F20110112_AABEIO oral_h_Page_073.QC.jpg
94d6aeb04ecae7dec88af6d1455f1e55
20cd5c88d32927f607e7878ba6cfccf31a014cbd
73762 F20110112_AABEJE oral_h_Page_074.jpg
fc2188ae4fb79cc5c6f54bef86f0851f
190cdaa0358b89f42d10b2f0421063eb268d37a1
22051 F20110112_AABEIP oral_h_Page_132.QC.jpg
3c7f96a48d9d454a73f3f623bca761ef
acddda286afeff6af5508709d60839c67cd6299a
49626 F20110112_AABEJF oral_h_Page_065.pro
bf0c0306a00bb8f49ab5d40937c3400a
fb798b9e71d9288f904cfe4f5677b3328c5d1955
F20110112_AABEIQ oral_h_Page_065.tif
9b71bae3283760902dc6a718114f3d7d
f268079ff359faf0033b2f8d99027460fae3f9e4
45164 F20110112_AABEJG oral_h_Page_143.jpg
e82d7ef275d804749fb07a295f850a3a
7c1fea749fe8f89bd2e397b1e3d6ba016f43ab00
7488 F20110112_AABEIR oral_h_Page_138.QC.jpg
caba427d2e9dbe73cdc4a00e6d0356fb
b2a617ae734d9b6e63113869279872683f11276f
72909 F20110112_AABEJH oral_h_Page_037.jp2
003ae15ac446edcc6c37a1fed9010b7b
d897b7093c8215adee258d4039808b26092ba916
6750 F20110112_AABEIS oral_h_Page_146thm.jpg
2085e6574a86f1aacb00154e1077363a
13c7bbf7213a48b9d993606c0faae3af2dbc2773
65246 F20110112_AABEJI oral_h_Page_022.jpg
5a116af692e65610703ee41fc8155d64
7d6f1ae4f110a75226532b65dc835652844d11a3
1848 F20110112_AABEIT oral_h_Page_073.txt
d705b027481436f40e61381853ce977c
1cf92fee422d52fe4c91631afabb302673f5ea00
F20110112_AABEJJ oral_h_Page_132.txt
13cdc3a40e3dba720a4156df93dffaf1
237cada434763e4b8364862b860d0e02db426fb9
107706 F20110112_AABEIU oral_h_Page_096.jp2
435f8dd58fe8b665c1abdac42a4c25de
e31ba28e9218c1dc20fd48ea03a8a9fd68148f6a
61340 F20110112_AABEJK oral_h_Page_032.jpg
f657bf58bbe6d1eb2b4e09c339709106
612655ea2841d54bd3b39c3ddaca71f398fd67e7
49464 F20110112_AABEIV oral_h_Page_090.pro
3e9199047421d94da5b9f557a46d0008
4f9f3898bafe2952942c0e516386f8fd322b4917
108221 F20110112_AABEJL oral_h_Page_048.jp2
9fe7977027579789399f2bf805e0f170
005f61bb0e76c38ed02cac201b01cb7529eadbf2
49487 F20110112_AABEIW oral_h_Page_144.pro
b1bef0347c03faafb121b6415aa18bd5
5e6bc0f98e5a98e63f0afe1453e25becbf10717f
951109 F20110112_AABEJM oral_h_Page_124.jp2
6f1e7874f44ff1cd80c35491a2e237ce
fd0b791fb2e2701d9b5157786035424d0dc9d36f
56563 F20110112_AABEIX oral_h_Page_043.jpg
2bac6c0034eb684f2f1d9776288e0168
29614425e1a7d59bb9e074560ced61bb98ff7b7d
67460 F20110112_AABEKA oral_h_Page_020.jpg
039318b39d1c99665e5abf7a5839b134
d72a3bf742cec67d6bb2178e83264edecb5beb2f
103124 F20110112_AABEIY oral_h_Page_049.jp2
a89ee009de61596bff1b9e85211dfc55
3b4113708ab292e7d2ac96946495fdef1f794591
114231 F20110112_AABEKB oral_h_Page_064.jp2
e5bb58f0c8deb9f81afa5bcab6cf5bef
745bf77a1fd42a67f4f810c89b9b3e3cf190c194
71669 F20110112_AABEJN oral_h_Page_141.jpg
5f789c86213b0ef0b57ef596bb4e995b
37b159225997b26de431a29c9e54b1bab37cd63f
97804 F20110112_AABEIZ oral_h_Page_070.jp2
517776cfdeee17e590379dd64dffff35
e701ae7ada624a9371494f76be40126e32920043
28013 F20110112_AABEKC oral_h_Page_044.pro
eca2e1fc774f3a2400ab41105956911d
92a074f852f1f04f60391f5462423acfbf9b0421
16506 F20110112_AABEJO oral_h_Page_081.QC.jpg
6b341b2306fab6ab9c295d9a41c80dc4
28f7b2ea10318f162c446120b75f069c93567a7b
6193 F20110112_AABEKD oral_h_Page_125thm.jpg
b6d917e5f81b03a0a2828976a991ae3e
0f9e108102c9086f982f17665b7095f649bfae2a
F20110112_AABEJP oral_h_Page_117.tif
e4b8103b6ca62577381f25e616f106e5
fe0f3f37f102f5e3b7eabcc8e7bc301f814c9373
60916 F20110112_AABEKE oral_h_Page_045.jpg
dd63831b7eb3efe38ef5509216977283
db3c9e2cce2dd55e51c095ceb67f4cc338d207a4
35707 F20110112_AABEJQ oral_h_Page_067.pro
34a2f6984573d8f2307a56411b31b108
9fe73be761acbdcea84c33cc98d2ea44d19f617d
64754 F20110112_AABEKF oral_h_Page_133.jpg
6978944ef24c629c5397b48e2435a159
6aa70f653f05bc72205dc9333e72a65df29f18f6
111978 F20110112_AABEJR oral_h_Page_074.jp2
d8b2e2ed9ba295936f768dbff40a676a
45c118b1b012cf0746eab2075ef50475d877c6b5
67264 F20110112_AABEKG oral_h_Page_112.jpg
70e5295d607ff705635ac98f9e7e9ae9
e6f2bd8fd146601552665f0c8fc224da5b4c0810
26074 F20110112_AABEJS oral_h_Page_145.QC.jpg
0fab34e7820ca8d13bbf959b5157dee9
eda9ca29a61f5dbc1cdef3148af0233810f9389a
5770 F20110112_AABEKH oral_h_Page_032thm.jpg
18d6481c3798cde30b9bf4a29b66b690
e56f9b7e9bbf8fe0f803c2c9da5a1a1e2f6831e0
22665 F20110112_AABEJT oral_h_Page_080.QC.jpg
a8100cca083bef9a2436052b5efe9bae
bb8ea3778e57825a24495911eebe78b07478cd3f
F20110112_AABEKI oral_h_Page_126.tif
05137c6307dafd657defa3de348d14cb
9fea8dec444d490221b92a8ea4595ef7fe91b12e
6577 F20110112_AABEJU oral_h_Page_062thm.jpg
31b393271ce73b896291f618fd0b7fef
31c4f3925c33ccadc6c77985ae6c1acae8b03407
2049 F20110112_AABEKJ oral_h_Page_095.txt
4679ec5c75ec4c95d9908f8408037e9e
c1dfdf4196bdaf6aecc64cc4e68a8013034086c5
97527 F20110112_AABEJV oral_h_Page_105.jp2
f304097da7426705027c3138dff80a89
35c9af7dd4d5dfbf39f836c2fd1c5882775b87f9
6471 F20110112_AABEJW oral_h_Page_130thm.jpg
cf53d4d0802d8605f802c34798ab8443
9b1be1f8ce9c0a9880d08ca0755fcc630d486d33
61738 F20110112_AABEJX oral_h_Page_146.pro
c3dbe7b6f3b58ebef3b8cc88bbd4b480
61a1760c0cacedc6a4311bafd3180f8ad713f30a
1698 F20110112_AABEJY oral_h_Page_103.txt
d53bcb902f908134d3c6e8aaa88c315d
9a7f6414c97df5b3f7527dc336ec733396824627
F20110112_AABEJZ oral_h_Page_109.tif
b374c69c79ea0c6a21a974c2c9232ee0
68713c566559b91b15cbad257b18f9a120198a8d
19667 F20110112_AABDKG oral_h_Page_067.QC.jpg
edcb953aca35285529810028a4d272c4
d5b5f37fb898aa1cb6e2db1a59635fdc74a843d9
759 F20110112_AABDKH oral_h_Page_019.txt
3a099dbd63d484518d59b87fbea85dab
780da294d4d448e56bdf05c85572f84c12721038
F20110113_AAAAAA oral_h_Page_041.tif
b3bea443a625c315e53654f10a03c789
88c87340340cf23e61ea0ab7541445555d10ceff
1051970 F20110112_AABDKI oral_h_Page_114.jp2
d4837924886053d07311232624233a9d
827a7bcd2fddd3a38869b49e206f48709057d716
22135 F20110113_AAAAAB oral_h_Page_010.QC.jpg
4006d7254153f8b36f63a6e270f753d5
d5ec96e90ef29d12f91e96e3545ded5aec38a76b
67859 F20110112_AABDKJ oral_h_Page_053.jpg
82c3df7c62aebeb16926795b728d3b2c
75feba6b6c1f08e2d22559c2eb6e280bbe8b5d5f
5514 F20110113_AAAAAC oral_h_Page_067thm.jpg
1efd47732507b52c5b934da4325f843c
fffaf36b4b03546d5309bd4bc09ae6d6f0ff079e
F20110112_AABDKK oral_h_Page_029.tif
ff6182d80e8b339861733435f711ae3d
c8614f42285973d006b12a38e44d5bd61826332e
18354 F20110113_AAAAAD oral_h_Page_109.QC.jpg
66d6c6a3749fe11391eb03d5fc841087
78f08668cfd5d6f7344bb5d8ae8210c416f9371f
26924 F20110112_AABDKL oral_h_Page_129.pro
3ad1976b56bada86df1aee7f2306e939
eac6617cb605b9b9c34e9846261186564d4dda60
1879 F20110112_AABDLA oral_h_Page_123.txt
07c031562c571ef1db9bbf02804d342c
5fcd038c1fa44f630d20ac7e9696d9a7896b3da8
21565 F20110113_AAAAAE oral_h_Page_054.QC.jpg
e989edeaa1a923249186f3e07b28dbfa
115b68e39059d1c8831ac398b06eb0528a32ff78
F20110112_AABDKM oral_h_Page_120.tif
fa9f290f071278dff177eb9524f2f7d4
f4e76b721f67a0c217fa2ff2a404e32dc1bfcd91
2016 F20110112_AABDLB oral_h_Page_033.txt
fbe4a564cd74f679aeebaea43557032c
32abd2ffaec67cfc3a441345a468ff5b53846c9d
6294 F20110113_AAAAAF oral_h_Page_073thm.jpg
1e2f18abd74e3a2555a5b5f0330ed044
47d70dcc59f8e9f188cb4177dfec6fe4465d75e3
95856 F20110112_AABDKN oral_h_Page_017.jp2
823fdbb2109161049423cc103b90e04b
72a493308e6ee57a120ff30965fedaf515614dc2
6483 F20110112_AABDLC oral_h_Page_021thm.jpg
f0d9c012ef0147d58d9e2ff74215924b
848e0e57268a2e9ce21a825a3fab0f606e0bcd49
6652 F20110113_AAAAAG oral_h_Page_135thm.jpg
c24e48c47b4ae1996b6b850dbbe9a7df
b77be6a6abb0e6b7b4859b98960210bf1a8dcf87
13946 F20110112_AABDKO oral_h_Page_051.QC.jpg
ca3cecc31b0de9bc9eb4e210ad765981
734161d02a7cc4dacf2e719c95436f9a84d4a626
2025 F20110112_AABDLD oral_h_Page_015.txt
3bf07a45cd44bb2c469641d02bfcb4fb
099c55f87cb4838c1cafd82f642eb74e13ddf230
F20110113_AAAAAH oral_h_Page_038.tif
cdc290d50f2e63c96662a3ab5f6677df
bab8745dc90bdebd4638f751d6ce2fc5322b1cd0
1974 F20110112_AABDLE oral_h_Page_012.txt
d815e2e4926c242b61dd2313ece1fc92
6c355a9851b2e05190dc6b642f90938f584f67e5
49113 F20110112_AABDKP oral_h_Page_101.pro
e4fb43649f01450d3f280351708331fe
0b4cdcaf81ed4b2243eb0dfeea04206283de1c89
86529 F20110112_AABDLF oral_h_Page_076.jp2
f3b066c7b0b58ecc580ee1bde3b7df6c
91ec427a9b57dfe4e5d5d99963b125983c5e999b
50591 F20110113_AAAAAI oral_h_Page_021.pro
d6fe23f8a31d09cb0436cde63998288c
9d1f45e077c5375f3496d89e4d2b41e0a27a9659
71920 F20110112_AABDKQ oral_h_Page_033.jpg
40af3c6b5d03bc55e674bdce8ec437e1
559c0fbfa9bff8813137f9c896f2711ca16c8bfc
62961 F20110112_AABDLG oral_h_Page_030.jpg
ebfa52dbb5b8fdcb9028412068107f07
e50d586b93e1661079f939c7f0672f2042f74149
74032 F20110113_AAAAAJ oral_h_Page_083.jpg
61ea5d2ac15cda636cafcdecb58d939c
81dd6a67787f91ff2eb1c856476a16b2a16fa3be
11709 F20110112_AABDKR oral_h_Page_138.pro
98dab872d56f43d41e8092a511c70944
5b4b81cb7e6dcd07d7d0d79f5117d0bcc57080e1
758338 F20110112_AABDLH oral_h_Page_034.jp2
20d4e80fe34136d237dcf20f079660de
c2789e3be9ed94ed8c7d542d4e0a0498c09a51a4
72800 F20110113_AAAAAK oral_h_Page_100.jpg
e50bc70fb7c2baa5aed242fd7d81d083
3b50bc48ead60e26816ab6b1fba45d8f149d701a
19415 F20110112_AABDKS oral_h_Page_032.QC.jpg
3c1a056892637ea5f8a85c88edebf05a
63d0c5fcf67007ed6aea9c5b5a2fc09e16e8abee
71097 F20110112_AABDLI oral_h_Page_061.jpg
220911796e4e3a61fba70798871d6faf
6a9be25bbee0c3a9788385f29b19d4396b574f0e
1642 F20110113_AAAAAL oral_h_Page_043.txt
d9f39dcb49233e561bbc6a921ff8275a
218a3d374f0c3545942749ebf92794adc6c5f2e3
1945 F20110112_AABDKT oral_h_Page_113.txt
c63398aacb7026b9cbfb4e6cabaa3c51
4aad1143b8b743e85d5cf612cb34fbbc60e6ac90
F20110113_AAAABA oral_h_Page_025.tif
4a831608bf68214613022a94c3be09ac
afb2e2e807d372056a69f5b5da9702fe34d90979
6190 F20110112_AABDLJ oral_h_Page_092thm.jpg
fdb9fd6fe57384416efcb21e630f155e
b7594e02bb4ed6045edba64744e1ef9eb651949e
1611 F20110113_AAAAAM oral_h_Page_077.txt
a4cd8abbb916e28117db6eae92b48f25
1f4e424cc3b85d3e59bbbb98f206f342a98a7655
58443 F20110112_AABDKU oral_h_Page_036.jpg
3097c144b58c65de66c561d2001b139a
cef13cf4c3bb5d8a678f6fb93164b86e93e8f2bd
23426 F20110113_AAAABB oral_h_Page_100.QC.jpg
476e08debab7194f1bf081c0716d3456
610c4eefaaabc767122648e2d04c946dc589c3a4
46111 F20110112_AABDLK oral_h_Page_066.jpg
d249dda645eb474b5cc5cdbaef185f5f
264db929536bd5a607b477264148bbee36b6cdb1
52418 F20110113_AAAAAN oral_h_Page_074.pro
e09f011cf1deab596e517a625529dd0f
243e33d459ded9ce683cb737865930b1f9df4788
F20110112_AABDKV oral_h_Page_010.tif
d2e4e6c14b83f0a54c1385b153d4a4a2
a6aba6b94d17b18bbe7102ee82507525999abbac
1808 F20110113_AAAABC oral_h_Page_059.txt
2a6868c9eabaea822400ecafa7875a0f
9f42899a5aab4b0bf9d9a27404397d37c0b81073
2010 F20110112_AABDLL oral_h_Page_014.txt
5daa02bcc1a8e1a8a50f7f4ae361a28d
8b1e7dd35b461080b751edbf8f279c8433c754f8
100540 F20110113_AAAAAO oral_h_Page_073.jp2
4fc3c7b9751f905d43dcb206fb2d14c8
95708d95901ed2d1b8867550a69feef7e610a168
16525 F20110112_AABDKW oral_h_Page_094.pro
ac7a0055a3ff5cc1beb503ab66a8cc14
433d5d5b353c4e772e364dac02d2be16208a917b
23262 F20110113_AAAABD oral_h_Page_096.QC.jpg
3cc220d2def82320fb04f17bdd112c8e
696ad51d48681b5ce3abc8ce91d04526a42ef3ce
870449 F20110112_AABDLM oral_h_Page_111.jp2
e3b93b3ebf7f430707de263ccc311e7c
eca3dec7c2f39e41b893462c706eaa5f6f840007
23807 F20110113_AAAAAP oral_h_Page_014.QC.jpg
a51a8151dc8fe2ca55bec2a5e90f0cc2
e0c9167f605ff3571625e8ca669e4b44862da76a
15356 F20110112_AABDKX oral_h_Page_044.QC.jpg
7c2c9c9572ca77594fe2ea89fe1a44d0
5d080f163745f1c95002ba34efebda249590e131
106289 F20110112_AABDMA oral_h_Page_144.jp2
713c651e77e98e7496010186e99c7bec
64d145b9c438a352a129689d7a516dfc433812b6
111959 F20110113_AAAABE oral_h_Page_116.jp2
37f7c64a2a744a52cc9e224f1da3e8df
0821474af4052f3775f3b8972716e08b448e80b5
F20110112_AABDLN oral_h_Page_022.tif
e7717669a6df17a1db2f4b7b4ba2da60
ed0dceee4e35e7754057027eb66b7701df2fe751
1051976 F20110113_AAAAAQ oral_h_Page_007.jp2
f7f76d384da349691f33bc63338a5cb8
cfead9d3988ae3763c849a19e7bfd3202dc78a1a
44223 F20110112_AABDKY oral_h_Page_030.pro
cdc08b0ff37d49bb8cfc3958c2594f4c
01c838fb521bdaf9625a65a6aed24ae2e4280b1a
882169 F20110112_AABDMB oral_h_Page_032.jp2
c76dc8c645caff5e029de45b66f17d85
92531bfe414deacb8c130f3c43c82c3bc8f6803d
6633 F20110113_AAAABF oral_h_Page_083thm.jpg
b4ed35cf9f744579cd0be39bb3a706b9
b4909359b242da146144384ab6996c65416dca5a
4372 F20110112_AABDLO oral_h_Page_063thm.jpg
20c6b40971fb834fd21825015e3b531f
84e5be8b609140cab2a336659fc327950d6f7a29
22620 F20110113_AAAAAR oral_h_Page_058.QC.jpg
70ec0f1ba417e8428a952b4a323ddeb4
5cd3f8d522c038cf11d63d66b06eed57e1d69d8e
44705 F20110112_AABDKZ oral_h_Page_108.jpg
9d02f8600ac61772765765b366c7797b
5c1ca88f19d86f1deda47f0d1f84031ffcb73571
103671 F20110112_AABDMC oral_h_Page_125.jp2
4ad0b034bc1fdcd11482c760282b60d8
4d7610c1794952a01511483ea1058fa53ca1cc17
4174 F20110113_AAAABG oral_h_Page_127thm.jpg
fc5672a32d2023fdba6ac8babf2e50e4
7d2a263487208002725ae25e02c0db765ad6f758
1028272 F20110112_AABDLP oral_h_Page_046.jp2
a679ea22f21955841d3ffd72957af3ed
5958d64904956497a56a51024496cff9147db1ce
4966 F20110113_AAAAAS oral_h_Page_006thm.jpg
2770ca4e73281e4d0cc8189f0add7234
0021a2eb20ac7cc2984775029782eb22a5c007de
64875 F20110112_AABDMD oral_h_Page_143.jp2
44619587fadf1a6449d31ac5e6ab73b6
f8905df95788641fc82b145fff70e8c8ce86c70f
69712 F20110113_AAAABH oral_h_Page_080.jpg
e1228481f7d6b876034db0ae1aec3ede
9d4db9725dbb2eb344e996b20f5581391be2df6b
4507 F20110113_AAAAAT oral_h_Page_108thm.jpg
0af3ee6c79a864ae8f0454816b34b6a1
c14be5b670723ffeb0cb6fb4303d6e235d29021d
65218 F20110112_AABDME oral_h_Page_123.jpg
fdfa4067cf9a50d77bfa64afd1032ce1
8015f283109d67515e5e7e771d47ad66d6d5d7e2
13870 F20110113_AAAABI oral_h_Page_131.QC.jpg
7518853ebfc0a05f2d3d24e7541b70b6
157c997b11e3e8e444f8e827d5f4f758ea3a876c
70819 F20110112_AABDLQ oral_h_Page_065.jpg
9db8ff61208d0bd873dff277050641bc
29a5a900b5a6a65fac8e339b161cd607f5b10da8
1812 F20110113_AAAAAU oral_h_Page_024.txt
6531a1d8bef3c2e4738c00e33253f555
567b9f1c8f167cf840745616eb2cebc95298a447
50535 F20110112_AABDMF oral_h_Page_042.jpg
a02b61b6fae2887a32e0798d5ce7cf38
5ffe84ef0fe88a4fbf7c7e2f7e738f3bb25caea7
5832 F20110112_AABDLR oral_h_Page_099thm.jpg
15aa5a85367ac899fd05db734d055d54
a131c960c0e88294450d3ec88ed8c65ad8e4a636
2423 F20110113_AAAAAV oral_h_Page_001thm.jpg
69b31b166a2aed7b87498f4dff6d2de4
d402537c83d044673fb0d82cf5416d93e77e73a1
6361 F20110112_AABDMG oral_h_Page_010thm.jpg
429fe0cb4da456255e861f53c24b943d
738d678f80aca6c073546e264369aac584dad498
71296 F20110113_AAAABJ oral_h_Page_055.jpg
e4ded2c83356d9e26138bebe7ed99fc7
66ab17cdb29cfc10ab71d06da5e02136f85f488b
70671 F20110112_AABDLS oral_h_Page_113.jpg
f2ac7dbf5d7e69795e14a3a9ec5b3163
e6102fdb7da5c6d5d329bb585f956df28af868d4
24188 F20110113_AAAAAW oral_h_Page_015.QC.jpg
32265fe8d469f1dfd362045e53f76f95
26be7879c0aca70f8694ddf6a2e5ad133846735b
31554 F20110112_AABDMH oral_h_Page_037.pro
a9857b743cee1afb59e1b7cc2edc714b
dfce55f8080e8302fd32fd103e89e62348d7577a
547434 F20110113_AAAABK oral_h_Page_051.jp2
5ee84b4b7bb8c80a942c3e918a47a049
6ee6ce9a183e75bae1be455c7c9304e9ad7fc988
1949 F20110113_AAAAAX oral_h_Page_075.txt
f5910f6bbe70afb4c96dbf051c31c0d5
52871bca5c59eed28e63aafdb53e692d40fecebd
70807 F20110113_AAAACA oral_h_Page_028.jpg
5e7a93dd463fc3e4f30559e6be63b897
fb7b013caba34154e5d16a15a8ceee509540c2aa
24003 F20110112_AABDMI oral_h_Page_062.QC.jpg
efa0ae43a80923cc9d5ba6cf8915b86b
9a3f0a3420d5619083bfef9ede635636ac26c3aa
742 F20110113_AAAABL oral_h_Page_023.txt
fd6304d53c36c17abc761b163ff42b8d
9e66a24d4f7b6c69fc416f4ad00d6f3b63dcc7eb
F20110112_AABDLT oral_h_Page_002.tif
ad7c6dde781e68365cc0219d63707517
64182dd09ce3ed86e15e493d56f462349ea60906
F20110113_AAAAAY oral_h_Page_113.tif
96f06d204b569558946f4868d0e2220d
589bb8305b5924285e3b929802d25a7e5290140a
45859 F20110113_AAAACB oral_h_Page_031.jpg
be1383ef8134dee3fffd4cad88cad764
5fd6df374c00f7529336bdfcb8cb57514bae7c4f
5885 F20110112_AABDMJ oral_h_Page_004thm.jpg
9b7f64472ea276d440a15408b9ffa38a
fcd95a8b356b3fb931b3d6e9c0cc8de7f931d5f2
F20110113_AAAABM oral_h_Page_036.tif
bf633c5936bb01f72280bc1b05689de4
ff0038a586635ca6d523eb304fd637f1045db8ca
F20110112_AABDLU oral_h_Page_128.tif
430d28bbe2993fc4f7a50717cb29958e
3cec61ff2be9f4476d5d3ffb8b1f1e75ac65e9fc
1995 F20110113_AAAAAZ oral_h_Page_096.txt
08da50cd524cdd69a0eb22ee1bedc091
a760d47f1991bc01bf582963f2f7961cf2307996
64560 F20110113_AAAACC oral_h_Page_035.jpg
a9454018f842405d2453f48757e7d180
13e27e279afdfabc7af21046589f1fd1d4e6465e
F20110112_AABDMK oral_h_Page_052.txt
9d803c0bae89ae97981d06f54bb0bb11
028f60011b428801523c1c590c14d9b70f7fbe66
22675 F20110113_AAAABN oral_h_Page_142.QC.jpg
cd92e5b742adf3787e04d9d4b9c61084
4eac99e36cc1ed536cdcdcb4c3fe3a2f27571bbd
5095 F20110112_AABDLV oral_h_Page_077thm.jpg
6f6a527cd11a8b3db4cd63be90b3ab28
93a8678ed3bed8507a5c4027ce3eedc4516b39ec
62196 F20110113_AAAACD oral_h_Page_040.jpg
9b45a69f02aedd5cbb809ce56088fc39
906df824af8317b460019bba90ea59150a908536
28694 F20110112_AABDML oral_h_Page_138.jp2
5464c3f859bc6c2513acd68a9302958b
7c28aaa2ef539b657db9f15bab2ae49970a4d042
220721 F20110113_AAAABO UFE0002422_00001.xml
a287780468ebd4b2107e13b2a10a001c
78ea87c29955d75209f2b76e8a821d0af25cbcbb
135876 F20110112_AABDLW oral_h_Page_147.jp2
fbfb9737ae65f0fdcd55ea5eca187448
ff654e2dfde4fa6d0a397b9e45f307a3f718cbc4
1004 F20110112_AABDNA oral_h_Page_130.txt
22968a99c5feca86132538b04efe4c9d
3789fb103ec0c40ba8720f04597297a18fa73dd1
57110 F20110113_AAAACE oral_h_Page_041.jpg
b83ca40e839c97eb4f5c3ab7efcaf2d7
5da8bfbf18340641aec56aa13088903c5532ef5a
67442 F20110112_AABDMM oral_h_Page_092.jpg
09d06da9980520154549e5716bd1547c
085235a3ba062e8924f495c4e4bd938c237f0abf
46037 F20110112_AABDLX oral_h_Page_016.pro
6788c85593c47e1eef5aa602f0edd407
de8acb86ce14d906c7fb7376fd119dca6340efd5
21238 F20110112_AABDNB oral_h_Page_112.QC.jpg
aed93443e13fbef143eba3d8fc1ec030
564bd911b95f1937e24d1ab58c9796385220abe0
73443 F20110113_AAAACF oral_h_Page_052.jpg
d2f56fb24a4dc955834507615af45559
c4c747c341f7876921283b23abb6d036e2946e41
22604 F20110112_AABDMN oral_h_Page_057.QC.jpg
0cbbd0bec597a7a7040a603bbfd7183e
5ead0824c60b0c58a4496ba7ed4e683e2fdd8236
1030407 F20110112_AABDLY oral_h_Page_081.jp2
4ddb7bdd1d4c19c1f22453939ce53a4f
1e5355ec877ef4d208d5e4a4120e8e61cf9cc242
10419 F20110112_AABDNC oral_h_Page_005.pro
f54954e090677167288682dfdfb682ac
86639e16a3d926b175bb4df7b62b548abcec2524
63855 F20110113_AAAACG oral_h_Page_054.jpg
3b63493ae3171df1ff13eaa69a67bcdd
a6cb7fe0f176fad3703bfc370700d600275db0ab
1931 F20110112_AABDMO oral_h_Page_127.txt
66466fdb1c2d1ab870613d0f6534be70
1526bac45e3065277882e2bb6a0658d806fcab22
24721 F20110113_AAAABR oral_h_Page_001.jpg
eb81c9ad4a82678642d2e04e87bbb4fd
a4a9f053ce6ba31b78e5fdea912f7f00d586c7a5
805923 F20110112_AABDLZ oral_h_Page_094.jp2
c090cb58d7b4c28358bce1a789ab0656
02a7d197edd45b74abe62376e4b91adadb80c269
F20110112_AABDND oral_h_Page_058.tif
e833f9c9c55f35aaf925580f1ac9dc77
2198dfa083a47dcc1d52da7ea06831a134293e30
71771 F20110113_AAAACH oral_h_Page_056.jpg
dfaa055b12a2ba563008d7cd30873c8e
216063e51e5ab8e8ab4c1c4392d2d083cd567d69
F20110112_AABDMP oral_h_Page_103.tif
02d36bb64a9ad1e3f3284e37056e020e
1bcdc12dc63aa9983b236900cbef35a1ba3fbb71
F20110113_AAAABS oral_h_Page_002.jpg
03778b3ca7a716567806e08f9501785f
4dc00b6f16148b921a6926179b46a38253389c93
F20110112_AABDNE oral_h_Page_093.txt
893fa61e9837994af3e58ccbecdf6ffd
aa855968acd1fc106f92438ee0d17e15bf715993
66606 F20110113_AAAACI oral_h_Page_057.jpg
63c502b1d65e43480d9937cb7002473a
f0f577bef7edf8846f2359cf94b9f95f0c988511
6017 F20110112_AABDMQ oral_h_Page_089thm.jpg
fba48a7a74cdef76b97113a437996338
5bc89d3b2475c42979114b45ee72ea92a4a71b7c
66702 F20110113_AAAABT oral_h_Page_003.jpg
8e404ab1e11f5c51c50a0f1e9670f526
06a7337d4f1fcd4a1429b9b0f4b9c43adb8238ee
48976 F20110112_AABDNF oral_h_Page_034.jpg
bb94db625c74d5b7db60f4cdf34e0d8c
289ef86701890be1f76136de6a3f94e0db1e5639
75937 F20110113_AAAACJ oral_h_Page_064.jpg
3a137888ccf9236b8028d0552d19616f
0045d041be956fb396ca637d1952f128ec8f2866
95846 F20110113_AAAABU oral_h_Page_004.jpg
7a3eba5a9a414b2437aeb28ee2bbe592
6d49a8f926b256eb6d4db4556d4a782f791a8260
46876 F20110112_AABDNG oral_h_Page_010.pro
f7f650c1110d460e325dd55e39bf2644
e42837010824c6c2eb13c113c12cfb4720b26d5e
58118 F20110112_AABDMR oral_h_Page_050.jpg
fe9e3cf26222eef704976f8cb2979381
62aa4125c03eb709c3f1479f65eca63ad073ef90
20431 F20110113_AAAABV oral_h_Page_005.jpg
8f28e2fda628aa77af0623be1747896c
143b8a54e89b75ce102c914a14ac22478e9e11ff
14934 F20110112_AABDNH oral_h_Page_063.QC.jpg
9a19378db416341b4615aa18aaef78a1
98c51f5c116ad7491d1301be449aa5614e5019a9
66421 F20110113_AAAACK oral_h_Page_068.jpg
aef7fad100f942237480ae7b9c665c77
d6a716323f9cda7fe6b125aec203a98a8cd318f2
1621 F20110112_AABDMS oral_h_Page_124.txt
8d880647013c85b5278b5b98b0c44df8
11a84ec82a44af141fa2d1b60a73209eb8e5cd33
87040 F20110113_AAAABW oral_h_Page_013.jpg
3ab1e5717db53d08cd0d3a19ad6d3e71
9ff3329b9da9c194ff8befcb16ee85c4d9b1a77b
5922 F20110112_AABDNI oral_h_Page_071thm.jpg
af8e0684902233ee46cf2314881d313b
8b9665b954eda988ea85bdb41f977fd944796692
39084 F20110113_AAAACL oral_h_Page_069.jpg
ac36365588d89bf10a4d0e14e1e9b875
062d146626eada870af658128f19da10adda5529
111438 F20110112_AABDMT oral_h_Page_052.jp2
8b6765eb253fc98136025179727bfb12
e3022fbcbd6247d4c2efb4972ebe607bba3c3d7f
73266 F20110113_AAAABX oral_h_Page_014.jpg
d7ad601e8a277c1f5f28c47b22523830
0ab1c17cd13c52245ad583f1b71a59ea110cbefb
67025 F20110113_AAAADA oral_h_Page_106.jpg
d21b403c693e1e1844c9af2fbc9cae0d
42c538517d85100e1e9a07848c5399dcbe3f73ba
76022 F20110112_AABDNJ oral_h_Page_004.pro
2320878549f4126bfb30d9632e37ed4c
dab87a5e28add6d64477fb1a097e8be20189c77f
55268 F20110113_AAAACM oral_h_Page_072.jpg
406d8cb0f928a0f931492d85d3d1be7e
de6b69d983ee21261b9d4a9e6b73354d86f2941c
63890 F20110112_AABDMU oral_h_Page_098.jpg
58f80a6b70c268d69934d86b3ce0cd7d
7dc909aee598d914db8b6b57a593e69074429a3a
71436 F20110113_AAAABY oral_h_Page_021.jpg
0774fe069b656339cd992b56db4299ab
b8bf578b069df403bd7b12b3542e9b9c1cf430fd
68555 F20110113_AAAADB oral_h_Page_110.jpg
ddec444e4d16eab32ce9a42774a07a70
470c190fc0560469aed4d9d2125630c0d8030dc4
F20110112_AABDNK oral_h_Page_021.tif
531a4b527b6fce75ba3938069e442428
eae11a2245bec135b6c9586858557d7d8958e6d2
56648 F20110113_AAAACN oral_h_Page_076.jpg
7b06d956e2bd71ae3780af34af3e7c1a
1a9ba33a71a1f1f0deb251614d851025cf9c062c
1523 F20110112_AABDMV oral_h_Page_133.txt
005038c2ad5449ad43a61fb3dbabde28
0134f2583f727a16ef3d03e0f614c0e044c41c49
62613 F20110113_AAAABZ oral_h_Page_024.jpg
a12839d88078b8d524cfe0ad93b0cd31
e56757a59098057c58c37d385998ce048e3e65b4
55526 F20110113_AAAADC oral_h_Page_118.jpg
5c599456b293eeca92a8c6be6de76a70
edbc4cdf826e9830b9f2e2affa1f1020190946cf
863958 F20110112_AABDNL oral_h_Page_118.jp2
592e605fb0e2895a0dff805abb82f6ac
bfd8e98e53b8999547e76ecdc1df6cdeeee4053f
56507 F20110113_AAAACO oral_h_Page_079.jpg
78b706ec695145d7cbbccb7203ef3ec4
5aeceb3f74ea9878ab6f511215d6098707308ea6
4969 F20110112_AABDMW oral_h_Page_091thm.jpg
4badf3b48b9655c85f6d47ee10ad865a
1b876d74a7702a64ae29f54de412d8fc02daf5be
71635 F20110113_AAAADD oral_h_Page_121.jpg
faedb74add00a90a9b4791b567cd8838
f5f888af44cfe46920690c565aec19d02944ccb7
13709 F20110112_AABDNM oral_h_Page_127.QC.jpg
6b239ac4bb6606c18e34dfab43a78c0b
47602e5d06bff5d964c365e867805fa09f8b18a2
53138 F20110113_AAAACP oral_h_Page_082.jpg
c052132c81e40fa3109748338abaaead
fc1df8292505420ad3e8d864d048715244e2025b
988873 F20110112_AABDMX oral_h_Page_112.jp2
eb6abcf555467f2e31db0ac7635c89ae
be0d55aaa335caf79a1b010b8c01fd678d3d4ccc
90075 F20110112_AABDOA oral_h_Page_146.jpg
f041b57608c36df9f073584c58855c4c
37dc494464cc71ae9175f084dd7b062b494ababd
47294 F20110113_AAAADE oral_h_Page_127.jpg
787bd7a32c780c7650e856ae2084252d
3a8b360651fe977feeaefff653156c5206c01ff5
21617 F20110112_AABDNN oral_h_Page_016.QC.jpg
baa13f9fedab795c07cecd0abad96e1a
036c4dcad82e1a77178a7034f9f08ed09deb7084
20975 F20110113_AAAACQ oral_h_Page_084.jpg
4cbef37a3c3f9e07e7216cdf3ed52899
b80d654f6107d6fd59788142899a889d82e5ff9c
693 F20110112_AABDMY oral_h_Page_136.txt
55231b8f3f08c37a190f9a65c4b824f0
00b59cdcfb17fa65d798ef1e894de93674c0bb29
15204 F20110112_AABDOB oral_h_Page_069.pro
aff979b52b5b6c7b8c63c1d44508d007
dbe9b7b6f2c697b9e251d247e2c23f93851e48fc
81376 F20110113_AAAADF oral_h_Page_130.jpg
3426172bc62450ec2c5e0751414b23f6
7aaa7d6027414f2907144e42f9b3969c02ebb351
F20110112_AABDNO oral_h_Page_149.tif
4dece38ddab52b310b1a42d2f9a61e5b
eb78cb0d8c1c75bc86ade041d06066248940002e
71496 F20110113_AAAACR oral_h_Page_088.jpg
9fb1d77102ad7654ec841909e6734588
f9e401a32cc8547613db026ceb2bec8b97c64ffb
27303 F20110112_AABDMZ oral_h_Page_087.pro
c84875d2ae6437f35921a4442dee90e5
282f529f3b828f2c9a3022237267a4b7c1d2c4f2
F20110112_AABDOC oral_h_Page_084.tif
295abadde1663a3e94bb1451d080d504
e6ef4ae0db02b0456428b4d307b59a1db846574b
73369 F20110113_AAAADG oral_h_Page_135.jpg
e80d41807e15e3f3684d1fb119db6616
dc30c284186166fb7ca11a19a2587bb8ca64ffb0
84481 F20110112_AABDNP oral_h_Page_050.jp2
2b155ed1f54bf882b427dd2c37b92b98
fc324a9a191790b555ea079a0491530ff1f99e69
68898 F20110113_AAAACS oral_h_Page_093.jpg
79d0d235077e8879406ccfb87716f45a
5f4df8f3d78f5e511c8b6e69f2fb549a158efb6f
F20110112_AABDOD oral_h_Page_131.tif
322d54d66ce3db21184da7a9f402acac
706c508a7fdc67188f31a553b2fc4af10f6e347a
47477 F20110113_AAAADH oral_h_Page_136.jpg
7359d54658041fa7147d8b3d80799863
1363a251ea5e2e7f472627ceefef4a74a1e90292
29871 F20110112_AABDNQ oral_h_Page_041.pro
c0455f47e74546da919330c256a783d3
7931882ba04a62ef02c9542b2582a0ab4de1435a
45738 F20110113_AAAACT oral_h_Page_094.jpg
0df8384b5c20608cbf9077e8fc755db5
34ffa892c3514738bc786248997a22bc816956a3
49856 F20110112_AABDOE oral_h_Page_100.pro
7d294764f54c5ee91e6984bb61b2f882
6c2f5e184df6da08cadbf485de185dff816ca8cf
22648 F20110113_AAAADI oral_h_Page_138.jpg
637a0052a9f5ae2152c153c1376f3648
eda7e5b9adbc06d4d94d45e30b0098e3761ed263
F20110112_AABDNR oral_h_Page_118.tif
dbbc74a62421432003e2a8d940c92736
776a643058f54e1693007c1f9bcf35724a14d838
73377 F20110113_AAAACU oral_h_Page_095.jpg
71adc41cab2aae5da157da21eeedb4d4
550c7a6738a0392e2426ae739a24afe2e4ef2cc5
47344 F20110112_AABDOF oral_h_Page_038.jpg
881444597db6994245c172be95cc80e2
59fda5f1f296246433baf4ebd74a6666c67fd77f
62934 F20110113_AAAADJ oral_h_Page_139.jpg
6588687e7fc92b3bc781c57ecd1287f5
6d1d615a9063a707c46d7992aab99212cb3bf815
70410 F20110113_AAAACV oral_h_Page_096.jpg
33874ec47bb7a5c488aa1ccc4fe60978
328b63e050156669356aec9372c3ff9b0cec1d4e
108153 F20110112_AABDOG oral_h_Page_065.jp2
57839177e52b0872eb2dcfca0e709e9e
ec426a57bf0e8c56549a2ef5003a5c592daacf4b
70129 F20110113_AAAADK oral_h_Page_142.jpg
6728d6af2a5ca89cfd6292e7b6774303
93b80dde4eacdb4bbabca3420afd218f0b4bd65f
112202 F20110112_AABDNS oral_h_Page_015.jp2
4c95a23993b635b4e3e6e424f60a05d6
2f7c6a0ddccc75fb96f70364d5b7951c062b9245
51509 F20110113_AAAACW oral_h_Page_097.jpg
4cbbec564793d822200d9d0beafe8c73
72f7c0ef08362a9938891ff7545dc27fa81d51c5
20796 F20110112_AABDOH oral_h_Page_139.QC.jpg
0eda55c6e071c0aacd36fd0724ebf0d6
3dae35a6cb0e5402c02d4903e41683bd4aa81735
35233 F20110112_AABDNT oral_h_Page_133.pro
9f866cf3f291f0a04cc26951eb30b5ad
81df174157ea492460d9e108fd426891d8ad23cf
68505 F20110113_AAAACX oral_h_Page_102.jpg
158e829584a62e372740047f6ce24a05
15799eec44914e82f80fbd22a93daec486cf5f26
586805 F20110113_AAAAEA oral_h_Page_038.jp2
d912c48e4e0e46e2b4f22cd48cb92488
b82f1c46fb55b38c78505844ce7a865ecb103c71
89042 F20110112_AABDOI oral_h_Page_145.jpg
b9b3cc60fe1afdfa12e08814534f85b6
5ba551400f4574d306dd56154e6ead9432e420ac
67708 F20110113_AAAADL oral_h_Page_144.jpg
ba9f10988cc49776fcc26b87aa78db8e
4c4f3e855e3a264a1a6fff1cc4ef3547c94db2f3
24234 F20110112_AABDNU oral_h_Page_018.QC.jpg
86e7a3649c0d66df9f7a4d11130425da
2d59debd6d2b9a07ec43c0119f42c5eb6cec2ed0
68702 F20110113_AAAACY oral_h_Page_103.jpg
8ebd84b0d1dfd923b2a0897e925b2d82
c324c12c25487ff3073601bcbf46b0ecd3365b0c
705719 F20110113_AAAAEB oral_h_Page_044.jp2
415044d1a7a47b81c9ee76d2c229027b
3063b4bbc3653a1ca233ed2e67cf3d264d747996
4560 F20110112_AABDOJ oral_h_Page_094thm.jpg
9e6d97f447ff4ce536c06d463d10a8c0
14236dfb96996f2697a81674d6af949198ebac84
69868 F20110113_AAAADM oral_h_Page_148.jpg
d1e8039dee4ad776d22a465a9386c383
bad38f7c8d65f42a939cb70214819d3ec728b85c
104679 F20110113_AAAAEC oral_h_Page_053.jp2
dd7738fd6338e9045cb916a1028b4f7f
92cf468fcce93a4a465803d7c12f07c2422b2530
1357 F20110112_AABDNV oral_h_Page_094.txt
f5a4b1778a1f60f0adeee474d86b84c9
8550c4aad5bcec83a4bf1c1962eb4b7db987ac0b
64200 F20110113_AAAACZ oral_h_Page_105.jpg
80523b21b00a8db3294bbf8873ac4602
f5de2f7044c68aeeb320275327ce9070d45378ed
85893 F20110112_AABDOK oral_h_Page_072.jp2
c2689fc60148569f430c16c302193f1e
e4dba112bf06f94876373779872c6c781ae3ff91
27975 F20110113_AAAADN oral_h_Page_002.jp2
e1979400e9e05744b6e8d8035aff20aa
d19225c44e8a4fae6042d4a33ebde72bb7ca891a
108046 F20110113_AAAAED oral_h_Page_055.jp2
477c479642ae8e31a9f4f2cd41405c88
532c2affee18079bb11ad3dda7cc0f25a8e80e04
20918 F20110112_AABDNW oral_h_Page_137.QC.jpg
aaae3f5fd177733387156ea7cb318f41
1c6a05c356dc7f5177f811edee8178f60bad2e4e
21102 F20110112_AABDOL oral_h_Page_059.QC.jpg
299d73e196fa42fa6434a44dd601d0b2
f567f86bdbfda115a9a80468b4aad0cfbe97c6cc
1051933 F20110113_AAAADO oral_h_Page_004.jp2
a5b165f0ea784f3bcbdaeebb6659ab24
77473fab73afed99390ae3f19e92dd8d9c3b3485
101712 F20110113_AAAAEE oral_h_Page_057.jp2
3ae812e618633c4ed4ee2af953617b08
339ccf602d2b2165e32e92e14104db0f08071bd7
6238 F20110112_AABDNX oral_h_Page_028thm.jpg
6a25c3a9eb30c5bc56a390328b2d9cce
23e5dbfbf2acaf6cf21bed636c8b813d75175fca
73382 F20110112_AABDPA oral_h_Page_062.jpg
7f51a5436b6a3b56b9818050624da68a
84d8e715e3852305d6a70e9a1b02943659396c71
F20110112_AABDOM oral_h_Page_093.tif
922027ca44856d0d4bc4fa4e27509ce6
e17b98eacde5130cf17fed44c9cdb488ef5c526e
378975 F20110113_AAAADP oral_h_Page_005.jp2
e3bdc8b4c4263b529c265fb7a91d9fd7
28ebdd5f52720edb94da81c6f262b24ec6dcab98
106603 F20110113_AAAAEF oral_h_Page_061.jp2
7b84d6831acdfaf1cc3582355072c63e
4f6f98d17b3f85f0dde9c8b5d5c5b9e4d5c650dd
F20110112_AABDNY oral_h_Page_035.tif
121feaea2714dfe87b6c984c97224cb4
5519eb4309e20c9894c49af272e3b4dce3cf283c
3033 F20110112_AABDPB oral_h_Page_008thm.jpg
52ff9983e9591755563e36659bf5017a
cf61bc127384695ac2871d2b1020cdb0fd816d84
17788 F20110112_AABDON oral_h_Page_060.QC.jpg
df9bfb6f9fdd511e3c4c2710ff1612dd
31b4441675f80fdd2657915b00a7f3ad3e48e24a
808209 F20110113_AAAADQ oral_h_Page_008.jp2
b7da3a5bd2f3a87b34e6c9cfb1476ed6
57f3bdd6f187393b37d4aca44265dde8351addf5
1046626 F20110113_AAAAEG oral_h_Page_063.jp2
40ba1d588b6ed07d9fc9b1fdb8cf2260
76adbec8cc82e890859f54e1e81f774a9d1784e2
857734 F20110112_AABDNZ oral_h_Page_040.jp2
c82a987e030f008c51e61eb425de10b0
1a809742b74267950e79535154cc28d10590b472
107039 F20110112_AABDPC oral_h_Page_086.jp2
72f35bc6d133405ab5f31123cd5682b6
f2c7ecdf2b16381a36da1f7c0a3463d495dd5dcd
6171 F20110112_AABDOO oral_h_Page_132thm.jpg
422a3d1f6e827a6eb4dd6aac9b4bb04a
c928c3037447b2812d326fbfed3390124cf1eb67
104199 F20110113_AAAADR oral_h_Page_010.jp2
3d6d7265818772874ccad9ef76945b85
f4188926eeb9573f03bd687c42a2f993cf7fab3d
841890 F20110113_AAAAEH oral_h_Page_066.jp2
54020a56522e157ad09a7f69354bc7fa
f3403db0a13650d5b1b9cccb027aaeebbbfe3126
50954 F20110112_AABDPD oral_h_Page_052.pro
4b8ef0edf00e32fcb079675eaf0dcd00
d4898e01203d15b2b04fd15fd997554ddec567e0
1051985 F20110112_AABDOP oral_h_Page_047.jp2
d0c466d3937d165ebf26bea8fd7b1877
d718842dc4af416454e67e21293e7f9991d430da
130577 F20110113_AAAADS oral_h_Page_013.jp2
79d141e43e77319ab4210f254a01039d
caae6e47f357cd3a741bf1185e33fccf31c405e1
92401 F20110113_AAAAEI oral_h_Page_071.jp2
0ee5a06a1629d3e2f0b378c47bf3af3d
af996a035fe6d671157299cbc21bcac0aef328b2
23282 F20110112_AABDPE oral_h_Page_088.QC.jpg
5ca853da97097efd4fd5b97cb7d1f934
d3cd22ee4f08aa9239da1f220b7f4a8e50c34c95
49836 F20110112_AABDOQ oral_h_Page_096.pro
1585e0949ba82a19b58f8f5ec4cb3cf6
d91c63626df6dfa77f534d30eabc21da7f9f68d5
109903 F20110113_AAAADT oral_h_Page_018.jp2
9237688efce656e763d48e5a34a593cf
41abeda5460c5f9c39409b02829cf18ca9f5fa19
86283 F20110113_AAAAEJ oral_h_Page_075.jp2
6e44e470f3ea2f938caa5473f824f465
a5a460b1635a7f264ae40b5df64892b2bb3d8a28
1033 F20110112_AABDPF oral_h_Page_108.txt
d24e89a9326d17328f9e5154020aba01
8928bed8c311b2846692199f604377a01012f206
2057 F20110112_AABDOR oral_h_Page_074.txt
d90f89f743585f94fa6a950c62def26f
13b321760b496e075c3c93e99766ed4822bcbf19
109452 F20110113_AAAADU oral_h_Page_021.jp2
f54b511835b08e9898011ba19030adff
5d94a9c1e2dd5d76b2847e10ca70d59ec3b3032d
927950 F20110113_AAAAEK oral_h_Page_077.jp2
4ce7ec667de724a43f3cdb0982024ef4
49c280a5a2a773efaa22f0ba74d900286af5d79e
15985 F20110112_AABDPG oral_h_Page_078.QC.jpg
8dd253d8e0ae21255c4a24e8dd5cb59d
ca384478573ca353842582803e053bb491b52bf1
21768 F20110112_AABDOS oral_h_Page_056.QC.jpg
6df885824a8b244e93f18c70f926c483
eec00ce8e4dc35e530f6b92d3735f6a5d33f5dd4
93942 F20110113_AAAADV oral_h_Page_024.jp2
226096f30fec8162924ecfb872b58417
a8a25733ebff5be6bc4ae86dbf90d90ecc117047
112878 F20110113_AAAAEL oral_h_Page_083.jp2
f228ea5f261a81a274361408a9684b5a
17ec4efe6b2cd799452622731b46ff67d5e0ed4a
47502 F20110112_AABDPH oral_h_Page_058.pro
2ed32cd323918e250ebef8ce1af2184d
164b9de1736dee91f0d7148900d33c4a5ba40654
94940 F20110113_AAAADW oral_h_Page_025.jp2
909b5c9eb07e7f1787fc2eddfe7f7dcb
454b6588e80c485e750b94df66f6048d81f526ae
100245 F20110113_AAAAFA oral_h_Page_123.jp2
a23dc14d37d43ff54ef451e482a3fcb3
562e6efee503334d57fe3b0b82387da9f96957d4
67249 F20110112_AABDPI oral_h_Page_016.jpg
fca0a39a9755bb5206a230e3d153710f
f3cbf563320fe087758ae26374cd08a374d0159a
F20110112_AABDOT oral_h_Page_145.tif
8cd44da996dd210545f17cf3606f2517
a8a20e3c0d92d24bed0fa92140a13ff7c1b86f34
104054 F20110113_AAAADX oral_h_Page_026.jp2
347e0da400e67e5aaf5e2e6358cd76af
40e305d9528aebec324ae8fec75904efcfcdffe7
673308 F20110113_AAAAFB oral_h_Page_131.jp2
fe54ad80a28605e9c1aff9789e93fafc
993a96bf9e8a248606eff6f7c718bc035e0db5e5
91388 F20110113_AAAAEM oral_h_Page_085.jp2
e8029667c845356d1c09f792c17618ae
d64bbd805365d419e4b3bd3c45459a8ece5f84aa
48053 F20110112_AABDPJ oral_h_Page_142.pro
c4346d64eafe7ae7acec211c22f7d396
f5ef0ca3449d031fd5fee591e6e4fa2853a0f814
1273 F20110112_AABDOU oral_h_Page_060.txt
c755f6988aaf0f6fb735faa57bdc83c6
4318e1117c107ff807254641e62c8caedfa30796
97285 F20110113_AAAADY oral_h_Page_030.jp2
345279be7807ec6758071bd12285e6c3
71027417c02057705639b0c336aaaf8cfe0bad3b
868991 F20110113_AAAAFC oral_h_Page_133.jp2
7dca6d49eac184bdd6f5ad83762b9947
c6c526ee583b0145229947364e85fb7d58555246
1051942 F20110113_AAAAEN oral_h_Page_091.jp2
306efd5714cd53ef37c6a4631497c23a
f67d0138c0edc7572b1ee1f02fd9f0c990b9fb98
393 F20110112_AABDPK oral_h_Page_084.txt
f9cc04e5ce30f9655729f3639f13ec4c
0b80d141be731747e6117f4924138d1d62c69bea
2029 F20110112_AABDOV oral_h_Page_006.txt
7324158bdc9916bfbe8e6da11f416e79
9c1fca2ffb69cfd84ee3c56e995cde8e52f5c1c6
97976 F20110113_AAAADZ oral_h_Page_035.jp2
a6a12eb8cbf1570f47f15e241387fe92
01d4f0a5f8f2fbaf4599102d3bb997a77863b9ec
686101 F20110113_AAAAFD oral_h_Page_136.jp2
5dfc691dd9b0ba6f9c284d23ec8be07e
83604bafc070faf5372a92a3c8f14e80801c7f12
101350 F20110113_AAAAEO oral_h_Page_092.jp2
e16460835ccfb35f681b1d9246670a43
ae9939357ae73b73434499cd3beb3a2bfd1c5f0b
1051986 F20110112_AABDPL oral_h_Page_006.jp2
dee68058b70a589596c7ceeaf89db33b
0ca60f797fe9cd5de2b15b058a3670247cd0310d
20914 F20110112_AABDOW oral_h_Page_148.QC.jpg
c5f975a6a02f528fc36d41cec2c526bc
518a6283d92198489203107d11513ba4c0b3b02e
105349 F20110113_AAAAFE oral_h_Page_142.jp2
1f97c70837c3599fd5694085f39c95b5
aeaa7d09c162034f1127d4149e612a3afe18e99d
103743 F20110113_AAAAEP oral_h_Page_093.jp2
036e7620a5be3f78440a90bcb88b1fd5
85551d114fc82bb4f8f51a31e1d4d5e53fff4964
22791 F20110112_AABDQA oral_h_Page_026.QC.jpg
e4b9d8242dbab4d34f8223d5dff2e784
66532f74ceef75d636784b0c78b12b63e5005c7d
71544 F20110112_AABDPM oral_h_Page_048.jpg
0277710ca98b446aedbce54817d6d553
3a1f8889ca561c99c56275fe9c8106ca17f06161
108038 F20110112_AABDOX oral_h_Page_028.jp2
f783a1f984afb81ad08f27e0673a2c66
c964e324176a209df0522950a5dfdd0bb7901d7b
128318 F20110113_AAAAFF oral_h_Page_145.jp2
32f8aef0a47e0d50a529c7e17d52d9b5
bd2665250963b074ce220ae0192965ee680347be
110617 F20110113_AAAAEQ oral_h_Page_095.jp2
8c35af63f9f5b9c590565f95552c4d2e
d6d65ab0c1ca399593732e8840cfcd69bcb760c7
F20110112_AABDQB oral_h_Page_066.tif
3463a864f1dde5ce780235b561a0b503
f4af7ffa839b1cee73269c46b59c06ad05300ea5
F20110112_AABDPN oral_h_Page_057.tif
e7310705ed89350d3f24e6e41449b9f3
494f37406e10a3a8418465713533822d9f9a1be7
97475 F20110112_AABDOY oral_h_Page_054.jp2
1d0ccd6c574261d85b6ae76b0809bb4c
f942bda1a0e9b6a4af7207845039c3d04800b38e
31288 F20110113_AAAAFG oral_h_Page_149.jp2
0dab67a5450cd1684f86f789d131eb54
ab75f8129c7858cd67ffa1ac6fa03c4bc50f87d8
108536 F20110113_AAAAER oral_h_Page_100.jp2
2901952c064611f7bcb423728e0e7871
9ba54455b11cb47623e06bf3a2ab9a6453074603
84357 F20110112_AABDQC oral_h_Page_009.jp2
dd2d25a044a4646fe06cb7a78f679b12
2f053fbc57c18314c35abd86651135e35107a59b
F20110112_AABDPO oral_h_Page_140.tif
a25669de2d4f8913ac0f19da19cb4293
88bcde472120a4e837254f6d3bbe8ec8dd16ebac
77048 F20110112_AABDOZ oral_h_Page_119.jpg
dcfde205d5cb92f6d24d98776ca0edb5
120749a358a12cf8bec66afec90e06729344f061
F20110113_AAAAFH oral_h_Page_001.tif
b4fadefd620eeddb6a23773e638db96c
ded846219372019ba32a88cddac0cddf3c105dfb
1051983 F20110113_AAAAES oral_h_Page_104.jp2
2d569ca7a718df272f2f871c784cb51d
523487ff6f81c41753ac6ca065c3b14a09889b8b
950074 F20110112_AABDQD oral_h_Page_102.jp2
e6147ce03b8d48028e543866997e980d
77fd45cdf1e6bfde60141fc9d4f4f6003e89743e
688688 F20110112_AABDPP oral_h_Page_042.jp2
53529e426f924014a1b22ee7aeeeef7b
6fbf96cf9fce07ebc4367c413a91ec88e3bc035d
F20110113_AAAAFI oral_h_Page_005.tif
f872c70b15b69d3988be029a8852f198
e5f9f7d407c0359025193545810f87f0abc86db8
99242 F20110113_AAAAET oral_h_Page_106.jp2
0bc3e628f19cde39c44a245cf8de133c
4d2ae4812bb1e722e2819dd8ca1e37d77e6f77ef
F20110112_AABDQE oral_h_Page_141.txt
a827ee2798278d0fac5a7996cba0527f
74802423744c1ab04cb11fe992623e350a69129b
33093 F20110112_AABDPQ oral_h_Page_008.jpg
497586c079859bb2e83fcfaa2b93bdfb
5e3e7375f46e009a3e81d57bdd25ab6337e35df3
F20110113_AAAAFJ oral_h_Page_013.tif
080e8a9c3318b64078d88d86dbcb00a0
17a4ca55d9c822d40c0c3ba1ef8f5bec3620dba0
661932 F20110113_AAAAEU oral_h_Page_108.jp2
70a2e99239b621b72e481bedc30eecc8
117cba114f8fc26d63e06a63f893c9d8b29e081f
1640 F20110112_AABDQF oral_h_Page_104.txt
23c02649ac57d4e2d2949e8cc70c95eb
c3f95be6aad045aa2786e1a803a89b660120cff6
4088 F20110112_AABDPR oral_h_Page_122thm.jpg
6ab642e9ec4cd762be936a2f0cd9dd10
9f6ac00e4bed883fee3eeed4e63eb3e6133bcf8a
F20110113_AAAAFK oral_h_Page_014.tif
ee444d16466c910a339323568cbc42e6
250d7a84dcea1e466a48164ad9d503d9b0b45663
831728 F20110113_AAAAEV oral_h_Page_109.jp2
784bff0ece61ff04211361e25c585f65
8cbdd5dc58ec3e9824060970562ede0986ed718d
5647 F20110112_AABDQG oral_h_Page_148thm.jpg
16f8a142d32a710540132e8529326be8
912bebe7cc41161edec631f76b3649344965feff
110730 F20110112_AABDPS oral_h_Page_014.jp2
5801c6f69f3878a8d65d15edf6771825
0976b9a99409031be8cd5eee66cf9f872dd91184
F20110113_AAAAFL oral_h_Page_017.tif
8289f09adaf0a283aeec501b763a0155
0c445ae6a08907763fd46b8f74905f3917b1dce4
104036 F20110113_AAAAEW oral_h_Page_110.jp2
a95a1077ca7baa894791924141c40c73
6c370e331efdaa2e35fcbd9b41cefe5b6eef8c66
F20110112_AABDQH oral_h_Page_052.tif
64f2e03c4eb3a1c369694c3262f758e1
25727016cba67e64b705e7d9aabdccc705b72e8a
1899 F20110112_AABDPT oral_h_Page_142.txt
8cc1a39031f073111106573608bf7bdf
a79b032330356d0dd2eb5be53dc556835a77cc95
F20110113_AAAAGA oral_h_Page_063.tif
1f063468c0d2b9f927e158b0a57058b5
dc035e9ba80891f8ee1101a0264896c944ab5f5a
F20110113_AAAAFM oral_h_Page_018.tif
9ed74df6a903ccd8141465f351f4a080
870254d6b84beff7660e79b16bf97c5278eeb0cd
106447 F20110113_AAAAEX oral_h_Page_115.jp2
3f0fd602577039fdec3a68d1f5b67a7e
ccd07241d2e7cb8fdfd939652c77c4aef566b67a
F20110112_AABDQI oral_h_Page_143.tif
1d8e740d9660bb9297d071fcf47e7b82
84588ac848da6908fc551b4eb6fa63c918ade6f7
F20110113_AAAAGB oral_h_Page_064.tif
15c393ef83da40e1ba4f81cd4c81a1dd
f820feb88453948278c168a059bd18d3950395c8
914340 F20110113_AAAAEY oral_h_Page_117.jp2
b3d064d3cdbe127e182456cec4cbde59
5598606e8250d2f7c19d1d2c3e5db3ecd164a20c
5638 F20110112_AABDQJ oral_h_Page_117thm.jpg
77663b0c4efe25273390b03b6aa167b8
7bff98c26c7d365e3518020339d01b38a79e632d
25023 F20110112_AABDPU oral_h_Page_045.pro
2bf0866dadd8925b3bc802c70c96b803
0e35c99cd6bd7fff108e59f00f7a030175233f47
F20110113_AAAAGC oral_h_Page_067.tif
dcb11623702be059b19f5b033324e48b
05608ccea107e8f125b602a7bd35e5b4f8fc06ba
F20110113_AAAAFN oral_h_Page_019.tif
3d36798e22be52040a565ceac4c2ff98
9d8ec898dbfc2f7b62e3dc9eb06ed8e67f94d9a7
869685 F20110113_AAAAEZ oral_h_Page_120.jp2
00a13310b6488cf1e51d437208edfac6
fe46687585e517efd162ded9bfa7915f8e3b753d
15116 F20110112_AABDQK oral_h_Page_038.QC.jpg
8689fe88fc9ae89668fef9970fafdfec
6ba7ac19012782fd3b26c63cb6480a123a8cfd89
6381 F20110112_AABDPV oral_h_Page_140thm.jpg
b448abc31c5b6a0deab87105b04f3c25
759ac197a19c1e4391fed05b28921910989ba7cb
F20110113_AAAAGD oral_h_Page_070.tif
5e758d93a44a65688b7637fc711f1217
bd61ea996341aebffe23c168bfb33dead747a354
F20110113_AAAAFO oral_h_Page_020.tif
ed4c1cb0de3097aa60ebc2ce74cc33f8
49c68153216ec8d81ad76326f1aaabd6c6dff865
1051880 F20110112_AABDQL oral_h_Page_119.jp2
04dc91c02a41b4efa2dc0715608c7e8b
b029a7d1a9adaf6026dd8cfed4f225a522c5633f
105833 F20110112_AABDPW oral_h_Page_099.jp2
6c3cf3a38e2c819a0e943b3311328763
b67a1f95fccd7342db16df09a579cfd398c396b6
F20110113_AAAAGE oral_h_Page_071.tif
ad6e68253724fe5a9d6e6b61a2089482
9422e23f13f689263fdc5621f72eb8102c3a74ee
F20110113_AAAAFP oral_h_Page_023.tif
315bbae06f50a9e7958a248e0d7a36bb
ccb1cb811decaa51a6e6370dfece41b5438cf590
8618 F20110112_AABDQM oral_h_Page_107.pro
fff816b0f77abe6fae6ef7859c975969
453aa7ba2b73dba695cb058b6d0ebd203fec6cc0
17567 F20110112_AABDPX oral_h_Page_041.QC.jpg
b9a98a939d2d924bb531283a1e751055
053e371f31e9b9da1591820efcba53e952bfc01b
23600 F20110112_AABDRA oral_h_Page_033.QC.jpg
51718ac972aae48f40fba462342cf038
9d5ef345797d6d6652ccdffd3637bf4dd990f09a
F20110113_AAAAGF oral_h_Page_073.tif
5ea304fed639e0bd0088cddd383f1a87
1e8c42284c6025db0b829c90cc6f4adf5b868bc4
F20110113_AAAAFQ oral_h_Page_027.tif
fd5a13d09e8a56fb417f5364b2024830
974169c4b75cea7817f5e234f31c3f2f604a0008
692931 F20110112_AABDQN oral_h_Page_122.jp2
555abf51f9d0d5a1b3af9f304285b9da
86dfab02d98e528499170735b4c4d266e8077a9e
1170 F20110112_AABDPY oral_h_Page_045.txt
3d5289b84326f075080d0af974bfaf99
55868cb963fa02c5a929b963746582f0b4b4a4a5
3871 F20110112_AABDRB oral_h_Page_019thm.jpg
32cf1c5c96a565260bce9fbb506a584e
008187860efe16e28cbe433c3a7b747ccf8e4d71
F20110113_AAAAGG oral_h_Page_075.tif
5850b0bd0dc09076fde99763b71da7d1
9d87c5e2ba1266119b8a9424449c12721be1c810
F20110113_AAAAFR oral_h_Page_034.tif
914764f508f32b4add8f530f4cadbefe
5cd0f81c71252ec23d3e98bf8186b4e482bcf036
F20110112_AABDQO oral_h_Page_147.tif
a159059fcabdad40e62d8cb4c23ba624
472ab1b28fe8e4315c16edd81225acf9406c53b3
6493 F20110112_AABDPZ oral_h_Page_018thm.jpg
c7549e1925309ac569745210f1808626
c39e812a5b70b6a1ea69cc609a7342f1707c906a
33831 F20110112_AABDRC oral_h_Page_102.pro
af444427da229f0fad89c2f8a51ea13e
3ab858a36d1f1207310b46a6db2b976e74272f76
F20110113_AAAAGH oral_h_Page_087.tif
dcd02970d9c11d054d52e4131c05f15b
19a861a600944b8ffb062a9492702eef261947a0
F20110113_AAAAFS oral_h_Page_037.tif
1dd3849959dc03cbf29d2293b9dd9bf6
f18e37d25d9cd94a0da7d6daa4b9b0d69cce513a
23927 F20110112_AABDQP oral_h_Page_034.pro
396dd5fb4bc9ef17254d8b4b471b979c
3d8b199bc0bec93b27bf3b3c1bfd3e2140532ba6
F20110112_AABDRD oral_h_Page_072.tif
f5d7cb50d0db1e43b7aa69c9c8bbf0e2
53dbb518fe72f19330eecc2dece7dc7219bc2c64
F20110113_AAAAGI oral_h_Page_088.tif
e3fc96eb027667384c8db6106fc2f38c
f0235015a6eb9c9d2ae276b87e6a4b6809176572
F20110113_AAAAFT oral_h_Page_045.tif
3d58dc3bbfa8878cba266c73c50b217e
f2ec3e9b2b1c752172c9233cbae2c64698d203dd
1597 F20110112_AABDQQ oral_h_Page_082.txt
8b6e55d7c5c159531dc64675bd5cc295
2a9a495f5b2ca3235fab136243c58dab0950efe4
1918 F20110112_AABDRE oral_h_Page_140.txt
6429fb5a7f1fff0dd75489448b56af84
c33df7842f375af8bd24497f292148de8581574a
F20110113_AAAAGJ oral_h_Page_092.tif
f9dee2449c7a7866d4353e0d2d31b151
7d2a79d6050e962dc018db18dabbe6fec3365fd2
F20110113_AAAAFU oral_h_Page_049.tif
f378ecee155783e851ff30500a4d3215
c1399bd729b0313c8d9c5d678344983d9f6f02b5
495 F20110112_AABDRF oral_h_Page_002.txt
3aeff52f3b6a86836a3a85c000e7c833
254c468bb1a8e8075eb10b852451d3076958fa85
100129 F20110112_AABDQR oral_h_Page_089.jp2
408a8d26b34561cc22ff142c3d8e8602
4be326d797466d405ffbca2fe16284a1464afdbb
F20110113_AAAAGK oral_h_Page_097.tif
38677496b7c80523dd5c78f00836e493
b947fd3ae2ca62d7d98dbd1e0a1e86fd736c2305
F20110113_AAAAFV oral_h_Page_051.tif
8a9f0fcf17fbde17fcc506a874b92ab7
4b638f42fde2f1130924914ba60e96b3ff5d55ed
6088 F20110112_AABDRG oral_h_Page_035thm.jpg
bf7acc039127ddbf5052bc5512e13055
71ae10bf70d9e97f70b079d4242f16fa6b5f9ae2
23049 F20110112_AABDQS oral_h_Page_028.QC.jpg
428eeda6c99268fb8b17ba2c3520fd22
66bf338e00f396a91af4654ad425de3ccc87d8cd
F20110113_AAAAGL oral_h_Page_098.tif
a7c5098851fa133b785cbcb1a0489c46
8723a84f005cb2502f8430222055a490c5fdcb06
F20110113_AAAAFW oral_h_Page_053.tif
ae1f9808173fd8b2dba36e0dcac84352
36e2d5256407565bd8ae0f0e131b05702bd30c96
F20110112_AABDRH oral_h_Page_061.txt
dbf493b226ddc71c3930aa0439896f08
fc373284908fa06a9f61411189eb45e19d167cbd
F20110112_AABDQT oral_h_Page_078.tif
7db8da0fd17e169df62131e8b2ad486a
ed02a78e38147bee3b31cd2e8afd3e01689baf40
F20110113_AAAAHA oral_h_Page_142.tif
298c84be218cf5ba6909f81d97f9bfcf
cc5977e590794f292f1d0c46839c6e98d9109de0
F20110113_AAAAGM oral_h_Page_100.tif
222bfd0a1fc39960f4b477b5a635c474
2441ee1cb224342a08dbd104aa63a18c6b7ba40e
F20110113_AAAAFX oral_h_Page_055.tif
57232ebc3bc7015148e20030b9c8dd23
bfc7447e239dc9946efe1610dd8a3be9b12fadec
72379 F20110112_AABDRI oral_h_Page_012.jpg
6079a692a3ec057664cf79b58c08c0d0
4ea33fff581e117c082e1cbe0b26814997d01b50
71078 F20110112_AABDQU oral_h_Page_104.jpg
a94000d24a03124cb52516a990856dd9
6ad86b5bd263116564ea002f6c66e661dffcf2a0
F20110113_AAAAHB oral_h_Page_148.tif
31083fca0365a33eb5078a0d75e9e11f
0d9b46003a0eead77c6336cf9e4eec39d383f61e
F20110113_AAAAGN oral_h_Page_101.tif
1f8ebf34fc512b9c0d3b42c1f109288c
6b80f0e453bc3e3fb37bd1d90556d79d358e4476
F20110113_AAAAFY oral_h_Page_060.tif
70e0c69c31ea51168f478e27aaab1c67
a9d142db8f194ed9630acf4806122d634f38a896
59671 F20110112_AABDRJ oral_h_Page_129.jpg
0ec95db9ddb2c35cad441bada25f788c
2db412603a8b73a62308a4d7d8976fbe14679dfb
8912 F20110113_AAAAHC oral_h_Page_001.pro
ebf4908bdd82787589eb998a1747ccf5
724c2765463b7516d86c8e77c63b7a9665913063
F20110113_AAAAFZ oral_h_Page_061.tif
d4b2f310f3529f631d8c2df6191cf2d4
de858880a1a9109529790024de4714a20d762297
1795 F20110112_AABDRK oral_h_Page_070.txt
5d93827dc2ed0865dd77fe282e271171
f7b01ae8776c042ac2afa96435a48972f1e8043e
46263 F20110112_AABDQV oral_h_Page_148.pro
3f88aecf6b9354bee922b8dcc01a828c
35302679439d9eb0d5796dc16c42c9da595ec700
22961 F20110113_AAAAHD oral_h_Page_008.pro
e0f978b01b0bbba42ab357df5a2034e9
f51fb810105b2f98e062e3a9beae4265a8170dd0
F20110113_AAAAGO oral_h_Page_102.tif
dfb09b48ab0ba2c8c858ecbb0072622d
7c956bda4a74f6682570dafae8c1e3234ca3732c
F20110112_AABDRL oral_h_Page_121.tif
e327a2a1f64fb8a74cd89934d6df97dd
a6343d6475acf01e03aa18d8dbba95a6e81225e7
32099 F20110112_AABDQW oral_h_Page_047.pro
8af6af9df15e6ea6003e411eb9ac308c
a8f9089ebb1e2ac14384f83c222d4ffa53551720
44690 F20110113_AAAAHE oral_h_Page_011.pro
7ecc73635bb50ef41b2470783e9e8ac0
e9f52cb0f13245b1f2f3f5e38956af9186c03bf7
F20110113_AAAAGP oral_h_Page_106.tif
4c455c40d36dfd93aeac1a64fc59a9eb
f714082c833385e340b3d792f4b9bbe8aa0cf0b5
21823 F20110112_AABDSA oral_h_Page_125.QC.jpg
27a261ab3a56addb49b93165c0258f4c
d137b1b74fa7e43f01ae52936432ce0fcfe152f8
44707 F20110112_AABDRM oral_h_Page_024.pro
145ed9713a4a1fa13df6a13f03280ac0
3fd679b365dda221cb90ea656778f828a3945fab
F20110112_AABDQX oral_h_Page_056.tif
a9baaa1382f3266a6548a4c2c482f3ef
b9774afdc7236f4abf09c17835d03e3d4c0173d0
51059 F20110113_AAAAHF oral_h_Page_014.pro
5ddcaa52910424485a944f03b21e6f3f
7a6b30488c0b840f28fd1a4eef5fd8656987e449
F20110113_AAAAGQ oral_h_Page_110.tif
0a1cc9ddf5209d0bb93d1ea652aa0c2c
465a41ba9161b9d3639a9f9015d5895cc9283ae2
6549 F20110112_AABDSB oral_h_Page_147thm.jpg
10bf88af26b37b4bd21880f00887116b
00a530b87affefd078a91e74938d95622a4683a6
4823 F20110112_AABDRN oral_h_Page_037thm.jpg
fcef10e1731102cc5f565b53e6f1da1d
c623483bd842ce7586fb35f682768b7e01f12056
1876 F20110112_AABDQY oral_h_Page_110.txt
3b3a178c1bb67a78c1a9df0492c0df94
e7bd43849af42703893bfebafea234ba324a14bf
18131 F20110113_AAAAHG oral_h_Page_019.pro
6c9c8da460cd65f896d9f46358c2c11e
c0cb803147d6a8810e72e479279d0e723bb188c3
F20110113_AAAAGR oral_h_Page_112.tif
9940054c6dd073b61981e9ec668e3f2c
11064532adf7abe2561ba332af6da6db25660d65
66679 F20110112_AABDSC oral_h_Page_059.jpg
d1b5ca6ddffb4891ed5e81d419e0b8dc
4c10996f2f18d000273d6cb01d8ddc04459cd6ef
49457 F20110112_AABDRO oral_h_Page_121.pro
12ff9060ad2eb769b15dc4fe711cfdaa
9e5e8c662e4ad8410a90ca70f6a8894e30776027
1504 F20110112_AABDQZ oral_h_Page_047.txt
09c2f96745a123487d7692809b2f10b3
ff89c07e1a18f8033a1578b891b14e158134a52f
36138 F20110113_AAAAHH oral_h_Page_020.pro
04548b6e10bfe34bdf0d74d45f9b12e5
b6d36d15f537ac160d80af0caddc2acf6dadcab4
F20110113_AAAAGS oral_h_Page_119.tif
b3246a9d6efbc97049cb29452357cdc3
9d9eb8a6c2b4f7124219e44fbd3319084a48acbc
135057 F20110112_AABDSD oral_h_Page_146.jp2
582c71c4716cee7f01b05052287c31ab
16714a8d78eb87d500ba10b41a49e90fc9da4554
49995 F20110112_AABDRP oral_h_Page_037.jpg
f8e25e22f514182cf702759b20e24971
402ce2fd69d1fed05df7a2b226e31044f9c89d3e
44113 F20110113_AAAAHI oral_h_Page_025.pro
b29d15a386606a9b287f1cb71a1b9fb9
8f01e60f919cee776388f1fcd61b2f31e3e3d193
F20110113_AAAAGT oral_h_Page_122.tif
e7ea40d2e9654e7365373727f7621770
7c35f9316269cb027299af36a817f7e8c6b76e99
2584 F20110112_AABDSE oral_h_Page_002thm.jpg
0e85ae1ed560549ce9c3c15c2c6aba86
b45165f4d78c492ba0b1f9a8a400c59f5ea19f5c
6548 F20110112_AABDRQ oral_h_Page_145thm.jpg
f2d6882a5a92dd6dd8de4cd54b4ba1b4
d2f06b29490eb10979ea4eaa10a36f4db6159ad4
49093 F20110113_AAAAHJ oral_h_Page_026.pro
2b11f8a75220386f3d5d5dd15d8e451e
2ef2ebae027f3be5a55cdc90c81cf09ac0c2ac8e
F20110113_AAAAGU oral_h_Page_123.tif
13a200e3de301fe4c2cf82a93836a1cb
ea09feb96aaf84f673cbdee5f34ef7083cea3395
F20110112_AABDSF oral_h_Page_039.tif
783bb353ace23bcde3b2074ca9d81f02
e3c97626f03f171e20003464b4a7e2ff1b3be120
1790 F20110112_AABDRR oral_h_Page_139.txt
498a15f8c8f62facd2c9e9514cd62383
bdb66b10adc908838ffcec36e480443d5f91b492
51250 F20110113_AAAAHK oral_h_Page_033.pro
de94784e82a4400b694cafebd034c69e
b3fa4a8273d37fd38828cc85efb57138effe72d5
F20110113_AAAAGV oral_h_Page_124.tif
51c68d36939d7936cb34b4b86b1b4ff7
9f35a1b72cf74bc7f278cc94dead3f847c29a331
20870 F20110112_AABDSG oral_h_Page_133.QC.jpg
3e1c9e287a8901813f31f908f2a9c67a
e726af6d19e8760d00e98e96363bd5f3d9c7b747
66233 F20110112_AABDRS oral_h_Page_089.jpg
db3f45122f7e73ea9d4e38b5fdcdbf75
ee13545b5fe862536b990c321c64896e2b469a13
42861 F20110113_AAAAHL oral_h_Page_036.pro
05ff65ce09bb8894ef9e36fde8581b0f
8b01dca0750dba2398efe1084a8f589632e18856
F20110113_AAAAGW oral_h_Page_127.tif
3d94747a6cd2e91c6b2885b260632884
46e24c9ab64e67261c04df638877fe6cf75a0b40
19125 F20110112_AABDSH oral_h_Page_040.QC.jpg
1dfd96df5a3fb5c5a67ee83afc57a499
e1d391da10cd786d416aab43ed99a7bff04321b0
22661 F20110112_AABDRT oral_h_Page_140.QC.jpg
7573461626c6daa613785c85a08e1bfe
6ce7a2841b8366bbc16cf720fedb44fc1a6e65d6
29920 F20110113_AAAAIA oral_h_Page_077.pro
7de0c260a9d82128eea0769f7998215b
5f73433a2bb646d33098d3ddaba12314e7081eb6
19412 F20110113_AAAAHM oral_h_Page_038.pro
9ff7374cebcffa5a6f9129467d2163a6
87cd80ba9dbe2d2671d053dc3c299fc0a00ece64
F20110113_AAAAGX oral_h_Page_133.tif
1912a8fa4137fc6c04adc32dc9ac28c5
10f982dff7a49b7375698d356a198b5d2d270210
6468 F20110112_AABDSI oral_h_Page_057thm.jpg
d85b4097e86c7d9783ad9f21e8e6b829
a362e7c7356bdf96deec35a11d1e5023defd7426
F20110112_AABDRU oral_h_Page_130.tif
3329d21e7faecdd1fadfa87017724217
b75f6881acb645efe2c1fe23d3eaa75fb6d4a63d
22012 F20110113_AAAAIB oral_h_Page_081.pro
922c33bdb00db890ca40734fa0f98d05
373f2dc00b8afa6b3457261bb4ecc4b7fc0ed151
45919 F20110113_AAAAHN oral_h_Page_039.pro
49df2576ef554b770aece66790011870
0d608b8726f660801fe42f8e964b68f3a4be271d
F20110113_AAAAGY oral_h_Page_134.tif
4dfdf63457e971e17ceffef2f596f080
9897ca092ad17222d789ef773105dc95e6552d38
1941 F20110112_AABDSJ oral_h_Page_126.txt
d25565ef03c4be53836b0e7f98794c5f
564b32869a70c9b39b4a2af5e49ddf36cd2f74a0
2637 F20110112_AABDRV oral_h_Page_149thm.jpg
776ce87c2b72658d5eafdefcb0d0e3e5
7cdaf8cbb17b0bb0b6428e2d21bcfc54d29af585
28909 F20110113_AAAAIC oral_h_Page_082.pro
540278bbf44b4e4f80a63f8a5e72f0dd
51ec5c446d06137a3bfd262eca4af8fc6f611288
35535 F20110113_AAAAHO oral_h_Page_040.pro
cd2ae56090384514173dde365175008e
2146c570a6b10ae84fe11244d8cb7b77123a7864
F20110113_AAAAGZ oral_h_Page_135.tif
4b86573bf72ee0e9692b3d32dda7375f
6bf0604bdbebc3f10593ddb06e0603ede4a28fe1
1583 F20110112_AABDSK oral_h_Page_050.txt
cdd74c920420938990be9094ee4ddce4
12cb984c010922ecc30c4441779a570fed40a496
51465 F20110113_AAAAID oral_h_Page_083.pro
6919fb5df3efc470d4fb6bcc005800e4
57ac5c33a130fd96bafa6961eaad9e99e4395f2d
F20110112_AABDSL oral_h_Page_136.tif
ba49198888e26258717de7ea6b69fcc9
263fdbc6748d1331973ba14badf686c8e4467300
F20110112_AABDRW oral_h_Page_003.tif
73b94bb0937f69a62a405dc17410f4cd
6f5fb5660db6e5ba592daa4cf177dcf495b639e5
9805 F20110113_AAAAIE oral_h_Page_084.pro
5c91811e06df9a7690a3e3c606f93fca
afdaa4566dab863a93a986bf43bfedf8fb2c3084
31519 F20110113_AAAAHP oral_h_Page_043.pro
e89f5725e597bf8907f36d94acff2d8d
882bb98688b6c5c0bbbb16718f41c48b5e39c924
5990 F20110112_AABDSM oral_h_Page_024thm.jpg
46519f2479c6c62accceb87c81f66a66
0bacbce9b1d0b1ef59f049aeb5e321174c901119
6816 F20110112_AABDRX oral_h_Page_084.QC.jpg
157d1ff73b5a87f7cd2a1aef16aed05f
9f31a98196efdc7b19bf9b5d842df71e2fd3fe3f
F20110112_AABDTA oral_h_Page_114.tif
214fa369573a7a14500524d240d60614
a7232bba866a4ad1016594b71336b2e1fb05bd01
48992 F20110113_AAAAIF oral_h_Page_086.pro
08ccda04b110fc62a3426d021721b91f
6773ffaf79f0d35c6cc38f73bdbc54fc7bd96186
49396 F20110113_AAAAHQ oral_h_Page_048.pro
157e537964b4f983231d512eccc7fb6f
6a691ddf0a688dbda62dda6300467db239bd5978
20723 F20110112_AABDSN oral_h_Page_024.QC.jpg
9b557252db74dd3845d120722565ca19
26052333e7f33951d6d1e604d3eaee58c9aadbc3
62780 F20110112_AABDRY oral_h_Page_046.jpg
99734bb16fc1070c826e298364087190
2266a0c7bcd1d7d8c4deff9f5fb9e8dc05bf008a
23811 F20110112_AABDTB oral_h_Page_031.pro
36d00ce8987613e144c15440f55028e8
c3aaa31d36fb912ec08c94e74c018a57b6d20cd5
26118 F20110113_AAAAIG oral_h_Page_091.pro
570978934acc57154eff2b2cacc345b5
17fc5827c8921e5f8af4ae56e53a0bf01b662a91
47812 F20110113_AAAAHR oral_h_Page_049.pro
fdb2e9da86c573d7636bc55135142eb3
d36df784ba26d92a559574eef4ca0cd9813f42b9
61853 F20110112_AABDSO oral_h_Page_087.jpg
f7a5cfdfe3340bb0b93d518412fc3bf1
1f4e8362c1d6ae7f5cf524b0db4f3470feaa0e05
19866 F20110112_AABDRZ oral_h_Page_025.QC.jpg
aa2758ccd84ec5a2ffe1b6e1f3d820cf
85f9098f1d80d08e2a3c57ca7e38d8cea7a1967a
921 F20110112_AABDTC oral_h_Page_066.txt
09ddcc77b93a923fec03a99d65015432
70066b9fae383bd9bc207d3228a9e25f98a978bf
46049 F20110113_AAAAIH oral_h_Page_092.pro
ac223241919b3fe0e53f903f65180843
9a90f3ce1ac42b5062c7f9e2004495dec237d4e3
37463 F20110113_AAAAHS oral_h_Page_050.pro
29534d0468bf2500daba6d481efdaaa5
c91509144470f7d756fa6d0aa9c520b1fe41ee05
21348 F20110112_AABDSP oral_h_Page_030.QC.jpg
d8cacc56fa60c779c559867d52679deb
f34144256a1906feecb10bbeee7e0ac3426d6fd1
24125 F20110112_AABDTD oral_h_Page_135.QC.jpg
431295c77f056c41ed0ca88cd3c84460
eff21d3887ea5048bf364b81c3e6dcc35ef6f64e
46905 F20110113_AAAAII oral_h_Page_093.pro
6a8f1f87246735385d5927aa92b0e705
13e88f26c97a9175edd7285c082695d5a9b4e359
46651 F20110113_AAAAHT oral_h_Page_053.pro
976d274a895b0ead6eacda2b1868114a
9eb7644c0930244e23d54d68c96d5099a19d18f9
38484 F20110112_AABDSQ oral_h_Page_059.pro
1fb164e437b6c026570e799d3d7488c9
79b8a21362f05e04abf0dc00732ca5498e6e6a94
106003 F20110112_AABDTE oral_h_Page_140.jp2
586a4fcd8e466b350066d485e8f86b77
9aabfce6d1be6d52f4948e2be46a18c724128d95
52190 F20110113_AAAAIJ oral_h_Page_095.pro
cd2c0dc06333eec0175e3256779072a2
1d75b907161389d42ac1d6017e99874ea8e178c0
43859 F20110113_AAAAHU oral_h_Page_054.pro
6636d21fb2e62a78e0fd040fc35c37c2
23c0297a7d63c50d91482aa10c9274e99e76513e
1719 F20110112_AABDSR oral_h_Page_079.txt
5391435a8efbc0f1b4ed92c796a7bb15
6345f9d0e2214208f046e4edfa5f0ca415950947
18234 F20110112_AABDTF oral_h_Page_076.QC.jpg
9e11fffccd314895b5048fc43d7c0321
19de70e71ad5feace8d5a56ce03bab75859ba0ca
48056 F20110113_AAAAIK oral_h_Page_106.pro
b7db5e2732c967fc81766f36223a5465
e35a845ee75299ebc8e68780c9a47580e06e990a
46043 F20110113_AAAAHV oral_h_Page_057.pro
a0dc1dc001d6d230eaa2d2fbd7e70775
9de03af0c56c920676abefe7d9fdd79574f2d23c
6456 F20110112_AABDSS oral_h_Page_086thm.jpg
592285aed3a48bd03d1b6b8ba0ac2ebc
5e335e5ed4134d7bb1dda0afe098c0dc35f2499b
17316 F20110112_AABDTG oral_h_Page_111.pro
0a2c1b893f59a32fd3c67f6434bfff10
7b07c7252d9db83f39f213256089a18180635ad0
33738 F20110113_AAAAIL oral_h_Page_109.pro
6ddc6f7c57338045b615e2f79fc38dfc
95791bfd6851770ee2294c1a3630b356740a153f
49146 F20110113_AAAAHW oral_h_Page_061.pro
648c01abcba4b7c41bc4c87dfab6d65f
dd04927b9ee4e3a0a312a8efb4dc9ffeb76f16d8
1705 F20110112_AABDST oral_h_Page_009.txt
0e12dc90a671e17902b3a9e3ded44a20
69f36fdbc33d4a1307454f778d744f12805e0514
23565 F20110112_AABDTH oral_h_Page_090.QC.jpg
9cf184dde9d43baedd6812b65a112b3c
cda5360226cea0ab00a2d08cf6b76d33fa195073
1856 F20110113_AAAAJA oral_h_Page_010.txt
a3156faea873b6270e493a3464b6a787
3f3b3a875029a8bd98d87c8338d05c41cdf75cd5
27075 F20110113_AAAAIM oral_h_Page_117.pro
41600c3ec06d90f0dea23c5e65483082
f8d9df90a77c95e71fe24db03f0a25d79ecd4c1e
51216 F20110113_AAAAHX oral_h_Page_062.pro
f495a3e88edce3be153711f9befdb1c8
f22ed17fbcf065821828192579dcd17e7720a6a8
12730 F20110112_AABDSU oral_h_Page_149.pro
2d8f9951f7f2645b7259ff75803269f2
f5a95e8bb0547ac0ed1de79823f7f85a16c8130b
1374 F20110112_AABDTI oral_h_Page_037.txt
5146647424f4042f5a3c87550f3aebc8
02439139cbef3853d6a61387ab79680e92681cdb
1788 F20110113_AAAAJB oral_h_Page_017.txt
1397d09c28b5bf50112a3a946127eb60
669c3565435f122983efdfe2e235dfa970ab9331
19309 F20110113_AAAAIN oral_h_Page_118.pro
27a26844b92149d621a9d762b0cea664
3bbb3a8631df61fbda6c91ae6f9733fbab9207b4
23387 F20110113_AAAAHY oral_h_Page_063.pro
20432c9c82dee83f325e3f787ec43a87
9b720aa4ba9162aeee4d19a29a565c429b6a23b2
F20110112_AABDSV oral_h_Page_032.tif
637bd5c5aac295b29be815423b21c9ba
afc34b65bc8131bcd5f2d49470c5c51aa8ed919e
F20110112_AABDTJ oral_h_Page_089.tif
928933189cf5034b03509a3be94718b3
061a3f6a543a296a6891a7fa34ca8b57ce88919c
1972 F20110113_AAAAJC oral_h_Page_018.txt
0598aeb50a15f86a24e546ddd2d73db1
0f5acba44f22cb3bed432005d86059ea1e31d5bf
35561 F20110113_AAAAIO oral_h_Page_119.pro
5b0a8de6aa1893328325a57997b2deae
b38308bae3f8358a46df1615da937edc839ede44
38828 F20110113_AAAAHZ oral_h_Page_072.pro
63575418cdf40da1f69d73df431be20f
e48cc613d2a8ecacc2ff086a782f0c0534998bfe
102687 F20110112_AABDSW oral_h_Page_128.jp2
42136970add02c41f8a7eddcfb9e58b8
ed6b74312abf97e26c554b464978194773db4c55
20671 F20110112_AABDTK oral_h_Page_105.QC.jpg
e1d53ab3c8f8cb39ecfa575fb1b72e04
51c7178b521a252c0d4d5379d5967ae884be54fd
2020 F20110113_AAAAJD oral_h_Page_021.txt
0b1bc17aff7d9e93830725400e23e1fa
bedc52d21a4331217fcb38388296635531dd7d99
25646 F20110113_AAAAIP oral_h_Page_120.pro
3ba295a9b56b464121af4459ec225493
17c9ceb4b93c2d5896ba56f12699f1f6d51f5a08
5705 F20110112_AABDTL oral_h_Page_085thm.jpg
7b7e1a64903f41c1284c7b7f32f227e5
738a0d5f746c7867fc80932ecb41d51f0f335c54
1747 F20110113_AAAAJE oral_h_Page_022.txt
9bfe1536a029ead5f63054e9389cc69a
3a5dc7d8e8cbeeb4f1a9394670d6a97310945083
24060 F20110112_AABDSX oral_h_Page_095.QC.jpg
e84305d6280f771efe6dd9557b5b61c9
8f55e30c2a9ce96d6ced230c07a82d3e3b11bb34
72351 F20110112_AABDUA oral_h_Page_101.jpg
85b03b1a7ab218d14ac045378facc081
17e80c96771c522d8cc8871e1dd674e947804918
2444076 F20110112_AABDTM oral_h.pdf
edf8d25f7e7e4aa8ce8928e3e6971feb
3f45dad57a8befe8e88a831da6c4338be175c4c1



PAGE 1

PERFORMANCE MODELING AND ANALYSIS OF MULTICAST INFRASTRUCTURE FOR HIGH-SPEED CLUSTER AND GRID NETWORKS By HAKKI SARP ORAL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2003

PAGE 2

ACKNOWLEDGMENTS I wish to thank the members of the High-performance Computation and Simulation Lab for their help and technical supports, especially Ian Troxel for spending countless hours with me and Dr. Alan. D. George for his patience and guidance. I also wish to thank everyone who supported me with their encouragement; without your support I would not have been able to complete this research and dissertation. ii

PAGE 3

TABLE OF CONTENTS page ACKNOWLEDGMENTS..................................................................................................ii LIST OF TABLES...............................................................................................................v LIST OF FIGURES...........................................................................................................vi ABSTRACT.......................................................................................................................ix CHAPTER 1 INTRODUCTION........................................................................................................1 2 MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR HIGH-SPEED UNIDIRECTIONAL TORUS NETWORKS......................................7 Scalable Coherent Interface..........................................................................................7 Related Research........................................................................................................10 Selected Multicast Algorithms...................................................................................11 Separate Addressing............................................................................................12 The U-torus Algorithm........................................................................................12 The S-torus Algorithm.........................................................................................14 The M-torus Algorithm.......................................................................................16 Case Study..................................................................................................................17 Description..........................................................................................................17 Multicast Completion Latency............................................................................18 User-level CPU Utilization..................................................................................20 Multicast Tree Creation Latency.........................................................................22 Link Concentration and Concurrency.................................................................23 Multicast Latency Modeling.......................................................................................25 The Separate Addressing Model.........................................................................31 The M d -torus Model............................................................................................32 The M u -torus Model............................................................................................33 The U-torus Model..............................................................................................34 The S-torus Model...............................................................................................34 Analytical Projections.................................................................................................36 Summary.....................................................................................................................37 iii

PAGE 4

3 MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR HIGH-SPEED INDIRECT NETWORKS WITH NIC-BASED PROCESSORS......40 Myrinet.......................................................................................................................40 Related Research........................................................................................................42 The Host Processor vs. NIC Processor Multicasting..................................................44 The Host-based Multicast Communication.........................................................44 The NIC-based Multicast Communication..........................................................44 The NIC-assisted Multicast Communication......................................................45 Case Study..................................................................................................................47 Description..........................................................................................................47 Multicast Completion Latency............................................................................51 User-level CPU Utilization..................................................................................54 Multicast Tree Creation Latency.........................................................................55 Link Concentration and Link Concurrency.........................................................57 Multicast Latency Modeling.......................................................................................58 The Host-based Latency Model...........................................................................66 The NIC-based Latency Model...........................................................................66 The NIC-assisted Latency Model........................................................................68 Analytical Projections.................................................................................................70 Summary.....................................................................................................................73 4 MULTICAST PERFORMANCE COMPARISON OF SCALABLE COHERENT INTERFACE AND MYRINET.................................................................................75 Multicast Completion Latency...................................................................................76 User-level CPU Utilization.........................................................................................80 Link Concentration and Link Concurrency................................................................83 Summary.....................................................................................................................86 5 LOW-LATENCY MULTICAST FRAMEWORK FOR GRID-CONNECTED CLUSTERS................................................................................................................88 Related Research........................................................................................................90 Framework..................................................................................................................94 Simulation Environment...........................................................................................103 Case Studies..............................................................................................................104 Case Study 1: Latency-sensitive Distributed Parallel Computing....................106 Case Study 2: Large-file Data and Replica Staging..........................................115 Summary...................................................................................................................127 6 CONCLUSIONS......................................................................................................129 LIST OF REFERENCES.................................................................................................134 BIOGRAPHICAL SKETCH...........................................................................................139 iv

PAGE 5

LIST OF TABLES Table page 2-1 Calculated t process and L i values.................................................................................31 3-1 Pseudocode for NIC-assisted and NIC-based communication schemes..................50 3-2 Measured latency model parameters........................................................................65 3-3 Calculated t process and L i values.................................................................................65 v

PAGE 6

LIST OF FIGURES Figure page 2-1 Architectural block diagram.......................................................................................9 2-2 Unidirectional SCI ringlet........................................................................................10 2-3 Unidirectional 2D 3-ary SCI torus...........................................................................10 2-4 Selected multicast algorithms for torus networks....................................................13 2-5 Completion latency vs. group size...........................................................................19 2-6 User-level CPU utilization vs. group size................................................................21 2-7 Multicast tree-creation latency vs. group size..........................................................22 2-8 Communication balance vs. group size....................................................................24 2-9 Sample multicast scenario for a given binomial tree...............................................27 2-10 Small-message latency model parameters................................................................28 2-11 Measured and calculated model...............................................................................30 2-12 Simplified model vs. actual measurements..............................................................31 2-13 Simplified model vs. actual measurements..............................................................32 2-14 Simplified model vs. actual measurements..............................................................33 2-15 Simplified model vs. actual measurements..............................................................34 2-16 Simplified model vs. actual measurements..............................................................35 2-17 Small-message latency projections..........................................................................37 3-1 Architectural block diagram.....................................................................................41 3-2 Possible binomial tree..............................................................................................46 3-3 Myricoms three-layered GM...................................................................................49 vi

PAGE 7

3.4 GM MCP state machine overview...........................................................................50 3-5 Multicast completion latencies.................................................................................53 3-6 User-level CPU utilizations......................................................................................56 3-7 Multicast tree creation latencies...............................................................................57 3-8 Communication balance vs. group size....................................................................59 3-9 Simplified model vs. actual measurements..............................................................67 3-10 Simplified model vs. actual measurements..............................................................68 3-11 Simplified model vs. actual measurements..............................................................69 3-12 Projected small-message completion latency...........................................................71 3-13 Projected small-message completion latency...........................................................72 4-1 Multicast completion latencies.................................................................................77 4-2 User-level CPU utilizations......................................................................................81 4-3 Communication balance vs. group size....................................................................84 5-1 Layer 3 PIM-SM/MBGP/MSDP multicast architecture..........................................92 5-2 Sample group versus channel membership multicast..............................................93 5-3 Sample multicast scenario for Grid-connected clusters...........................................94 5-4 Step by step illustration of the top-level...................................................................99 5-5 Sample illustration of group-membership..............................................................101 5-6 Possible gateway placement...................................................................................101 5-7 Illustration of low-latency retransmission system..................................................102 5-8 Screenshot of the MLD simulation tool.................................................................104 5-9 Latency-sensitive distributed parallel job initiation...............................................107 5-10 IPC multicast communication scenario..................................................................108 5-11 Snapshot of simulation model for latency-sensitive..............................................109 5-12 SAN-to-SAN multicast completion latencies........................................................110 vii

PAGE 8

5-13 SAN-to-SAN multicast completion latencies........................................................112 5-14 Mixed mode (local, remote) IPC............................................................................114 5-15 Top-level tree formation.........................................................................................117 5-16 Multicast communication.......................................................................................117 5-17 Evaluated network topology scenarios...................................................................119 5-18 Snapshot of simulative model................................................................................120 5-19 Multicast completion latencies...............................................................................122 5-20 Comparative WAN backbone impacts...................................................................123 5-21 Head-to-head comparison......................................................................................124 5-22 Interconnect utilizations for local clusters.............................................................127 viii

PAGE 9

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PERFORMANCE MODELING AND ANALYSIS OF MULTICAST INFRASTRUCTURE FOR HIGH-SPEED CLUSTER AND GRID NETWORKS By Hakki Sarp Oral December 2003 Chair: Alan D. George Major Department: Electrical and Computer Engineering Cluster computing provides performance similar to that of supercomputers in a cost-effective way. High-performance, specialized interconnects increase the performance of clusters. Grid computing merges clusters and other geographically distributed resources in a user-transparent way. Collective communications, such as multicasting, increase the efficiency of cluster and Grid computing. Unfortunately, commercial high-performance cluster interconnects do not inherently support multicasting. Moreover, Grid multicasting schemes only support limited high-level applications, and do not target latency-sensitive applications. This dissertation addresses these problems to achieve efficient software-based multicast schemes for clusters. These schemes are also used for defining a universal multicast infrastructure for latency-sensitive Grid applications. The first phase of research focuses on analyzing the multicast problem on high-performance, direct, torus networks for clusters. Key contributions are experimental ix

PAGE 10

evaluation, latency modeling, and analytical projections of various multicast algorithms from literature for torus networks. Results show that for systems with small messages and group sizes, simple algorithms perform best. For systems with large messages and group sizes, more complex algorithms perform better because they can efficiently partition the network into smaller segments. High-performance clusters can also be built using indirect networks with NIC-based processors. The second phase introduces multicast performance analysis and modeling for such networks. Main contributions are experimental evaluation, latency modeling, and analytical projections for indirect networks with NIC-based processors with different levels of host-NIC work-sharing for communication events. Results show that for small message and system sizes, a fast host CPU outperforms other configurations. However, with increasing message and system sizes, the host CPU is overloaded with communication events. Under these circumstances, work offloading increases system performance. Experimental and comparative multicast performance analyses for direct torus networks and indirect networks with NIC-based processors for clusters are also introduced. The third phase extends key results from the previous two phases to conceptualize a latency-sensitive universal multicast framework for Grid-connected clusters. The framework supports both clusters connected with high-performance specialized interconnects and ubiquitous IP-based networks. Moreover, the framework complies with the Globus infrastructure. Results show that lower WAN-backbone impact is achieved and multicast completion latencies are not affected by increased retransmission rates or the number of hops. x

PAGE 11

CHAPTER 1 INTRODUCTION Because of advancements in VLSI technology, it has become possible to fit more transistors, gates, and circuits in the same die area, operating at much higher speeds. These enhancements have allowed the performance of microprocessors to increase steadily. Following this trend, commercial off-the-shelf (COTS) PCs or workstations built around mass-produced, inexpensive CPUs have become the basic building blocks for parallel computers instead of expensive and special-built supercomputers. Mass-produced, fast, commodity network cards and switches have allowed tightly integrated clusters of these workstations to fill the gap between desktop systems and supercomputers. Although in terms of computing power and cost, this approach is accepted as the optimal solution for small-scale organizations, it is not cost-effective for large-scale organizations. Large-scale organizations incorporate geographically distributed locations, and the cost of maintaining and operating a separate parallel-computing cluster in each location impedes the benefit of cluster-based parallel computing. Therefore, the question is still open: What is the most optimal way to orchestrate geographically distributed resources, devices, clusters, and machines in a computationally effective and cost-effective manner simultaneously? Distributed computing has emerged as a partial solution to this problem. Distributed computing orchestrates geographically distributed resources, machines, and clusters for solving computational problems in a cost-effective way. This approach is cost-oriented and does not meet the computational needs of large-scale problems. On the 1

PAGE 12

2 other hand, Grid computing focuses on the computational side of the quest, and is therefore problem-driven [1]. As computing technology emerges, researchers work on bigger real-life scientific problems and to solve these very large-scale problems more complex experiments and simulations than ever before are designed. Data sets obtained from these simulations and experiments also steadily grow larger in scale and size. For example, Higher Energy and Nuclear Physics (HENP) tackles the problem of analyzing collisions of high-energy particles which provides invaluable insights into these fundamental particles and their interactions. Thus, HENP is expected to provide better insight into the understanding of the unification of forces, the origin and stability of matter, and structures and symmetries that govern the nature of matter and space-time in our universe. The HENP experiments currently produce data sets in the range of PetaBytes (10 15 ), and it is estimated that by the end of this decade the data sets will reach ExaBytes (10 18 ) in size [2]. Another example is Very-Long-Baseline Interferometry (VLBI) [3]. The VLBI is a technique used by astronomers for over three decade for studying objects in the universe at ultra-high resolutions and measuring earth motions with high precision. The VLBI provides much better resolutions than the optical telescopes. Currently VLBI readings produce gigantic data sets on the order of GigaBytes (10 9 ) per second continuously. To solve these and similar scientific problems, an unprecedented degree of scientific collaboration on a global scale is needed. To cope with these enormous problem and data-set complexities and to provide a global-level of scientific collaboration, Grid computing has been proposed. Grid computing brings together parallel and supercomputers, databases, scientific instruments, and display devices

PAGE 13

3 located at geographically distributed sites in a user-transparent way. According to Foster et al. [1; pp. 1-2] Grid computing has emerged as an important new field, distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation. The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO). The two key concepts in bringing together these resources and VOs are high-speed networks, and light-weight high-performance middleware and communication services. The need for high-performance collaboration and computing demands high-performance connectivity on the global scale between these resources and VOs. Projects like iVDgL [4], GriPhyN [5], DataTAG [6], and StarLight [7] are among numerous initiatives that involve research to build or exploit global-scale high-performance interconnects and middleware and communication services for Grids. These high-performance interconnects are designed and implemented in a hierarchical fashion, from high-speed global backbones that connect organizations to small proximity networks that connect the resources and clustered computational nodes. Conventional and relatively cheap interconnects such as Fast Ethernet or Gigabit Ethernet, or propriety and relatively expensive high-performance System Area Networks (SANs) are used to connect these clustered nodes. A SAN is a low-latency, high-throughput interconnection network that uses reliable links to connect clustered computing nodes over short physical distances for high-performance connectivity.

PAGE 14

4 The success of Grids depends on the performance of the physical interconnect, and also on the efficiency and performance of middleware and communication services. Interprocess communication is the basis of most, if not all, communication services. Interprocess communication primitives can be classified as point-to-point (unicast), involving a single source and destination node, or collective, involving more than two processes. Interprocess communication is often handled by collective communication operations in parallel or distributed applications. These primitives play a major role in Grid-level applications by making them more portable among different platforms. Using collective communication simplifies and increases the functionality and efficiency of parallel tasks. As a result, efficient support of collective communication is important in the design of high-performance parallel, distributed, and Grid computing systems. Multicast communication is an important primitive among collective communication operations. Multicast communication (the one-to-many delivery of data) is concerned with sending a single message from a source node to a set of destination nodes. Special cases of multicast include unicast, in which the source node must transmit a message to a single destination, and broadcast, in which the destination node set includes all defined network nodes. Multicast communication has been subject to extensive research in both Layer 3 (IP-based) and Layer 2 (MAC-based) levels. Multicast is widely used for simultaneous and on-demand audio and video data distribution to multiple destinations, and data and replica delivery over IP-based networks. Multicast over Active Networks is another area of research that targets increased efficiency and improved reliability [8]. There are well-defined and deployed Layer 3 multicast protocols such as MBone [9] and MBGP/PIM-SM/MSDP [10-12] for

PAGE 15

5 high-performance IP-based interconnects, although most of these are not widely standardized and are still evolving. Layer 2 multicast is widely used as a basis for many collective operations, such as barrier synchronization and global reduction, and for cache invalidations in shared-memory multiprocessors [13]. Layer 2 multicast also functions as a useful tool in parallel numerical procedures such as matrix multiplication and transposition, Eigenvalue computation, and Gaussian elimination [14]. Moreover, this type of communication is used in parallel search [15] and parallel graph algorithms [16]. Grids and distributed systems often form as the integration of multiple networks with different performance characteristics and functionalities combined. These networks can be listed as Wide Area Networks (WANs) as the backbones; Metropolitan Area Networks (MANs) or Campus Area Networks (CANs) providing regional connectivity to the backbones; Local Area Networks (LANs) providing connectivity to the regional networks; and SANs connecting the computational nodes and clusters or devices to the rest of the integrated networks hierarchy. These large-scale integrated systems support and use various communication protocols (e.g., IP/SONET or IP/Ethernet for WANs, IP/Ethernet for MANs, CANs, and LANs, and proprietary protocols for SANs). Providing an efficient multicast communication service for such large-scale systems imposes multiple challenges. For example, the data distribution patterns must be optimized in order to minimize the utilization and impact on the Grid and distributed system backbones as these are the longest links that the data has to traverse. Also, an efficient multicast service must route the data over the fastest interconnect possible towards the destination to obtain low latencies, for the cases where multiple interconnects, such as LANs and SANs, both provide connectivity to the destination at

PAGE 16

6 the same time. Therefore, it is beneficial for such a communication system to support multiple communication protocols. Furthermore, for unsuccessful transmissions, data must be re-transmitted from the closest upper-level parent node to the destinations, again to obtain low-latency characteristics, and to minimize the impact on the backbones. In this dissertation, the multicast problem is evaluated for Grid-connected IP-based and SAN-based clusters. A framework for low-level multicast communication is proposed that can be used as a service for high-level Grid or distributed applications. The proposed framework targets high-performance and latency-sensitive applications. Also, the proposed framework supports multiple protocols and different levels of interconnects in a hierarchical way. The problem is solved using a bottom-up approach. First, the Layer 2 multicast is investigated for various communication and networking scenarios using experimental and analytical modeling over various SANs. Results obtained from these Layer 2 studies are then combined with the existing Layer 3 research available in literature, to build a hierarchical multi-protocol and low-level multicast communication framework for Grids and distributed systems. Chapter 2 analyzes the multicast communication problem for the Scalable Coherent Interface SAN [17]. Chapter 3 evaluates the multicast problem for the Myrinet SAN [18]. A comparative performance evaluation of multicast on these two SANs is presented in Chapter 4. Chapter 5 combines results obtained in Chapters 2 and 3 and then focuses on building a universal, latency-sensitive multicast communication framework for Grid-connected clusters. Conclusions are presented in Chapter 6.

PAGE 17

CHAPTER 2 MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR HIGH-SPEED UNIDIRECTIONAL TORUS NETWORKS Direct torus networks are widely used in high-performance parallel computing systems. They are cost-effective, as the switching tasks are distributed among the hosts instead of having centralized switching elements. Torus networks also provide good scalability in terms of bandwidth. With each added host, the aggregate bandwidth of the system also increases. Moreover, torus networks allow efficient routing algorithms to be designed and implemented. The Scalable Coherent Interface (SCI) is a high-performance interconnect that supports tori topologies. The SCI is a widely used SAN because of its high unicast performance, but its multicast communication characteristics are still unclear. This chapter focuses on evaluating the performance of various unicast-based and path-based multicast protocols for high-speed torus networks. The tradeoffs in the performance of the selected algorithms are experimentally evaluated using various metrics, including multicast completion latency, tree creation latency, CPU load, link concentration, and concurrency. Analytical models of the selected algorithms for short messages are also presented. Experimental results are used to verify and calibrate the analytical models. Analytical projections of the algorithms for larger unidirectional torus networks are then produced. Scalable Coherent Interface The SCI initially aimed to be a very high-performance computer bus that would support a significant degree of multiprocessing. However, because of the technical 7

PAGE 18

8 limitations of bus-oriented architectures, the resulting ANSI/IEEE specification [19] turned out to be a set of protocols that provide processors with a shared-memory view of buses using direct point-to-point links. Based on the IEEE SCI standard, Dolphins SCI interconnect addresses both the high-performance computing and networking domains. Emphasizing flexibility and scalability; and multi-gigabit-per-second data transfers, SCIs main application area is as a SAN for high-performance computing clusters. Recent SCI networks are capable of achieving low latencies (smaller than 2s) and high throughputs (5.3Gbps peak link throughput) over point-to-point links with cut-through switching. Figure 2-1 is an architectural block diagram of Dolphins PCI-based SCI NIC. Using the unidirectional ringlets as a basic block, it is possible to obtain a large variety of topologies, such as counter-rotating rings and unidirectional and bi-directional tori. Figure 2-2 shows a sample unidirectional SCI ringlet. Figure 2-3 shows a 2D 3-ary SCI torus. Unlike many other competing SANs, SCI also offers support for both the shared-memory and message-passing paradigms. By exporting and importing memory chunks, SCI provides a shared-memory programming architecture. All exported memory chunks have a unique identifier, which is the collection of the exporting nodes SCI node ID, and the exporting applications Chunk ID and the Module ID. Imported memory chunks are mapped into the importer applications virtual memory space. To exchange messages between the nodes, the data must be copied to this imported memory segment. The SCI NIC detects this transaction, and automatically converts the request to an SCI network transaction. The PCI-to-SCI memory address mapping is handled by the SCI protocol engine. The 32-bit PCI addresses are converted into 64-bit SCI addresses, in

PAGE 19

9 which the most significant 16 bits are used to select between up to 64K distinct SCI devices. 64-bit / 66MHz PCI Bus CPU Memory System Bus Link Controller(LC2) Link Controller(LC2) 5.3 Gbps SCI Network Links PCI to HostBridge PCI to SCIBridge (PSB) DMA 64-bit / 100MHz B-Link 64-bit / 66MHz PCI Bus CPU Memory System Bus Link Controller(LC2) Link Controller(LC2) 5.3 Gbps SCI Network Links PCI to HostBridge PCI to SCIBridge (PSB) DMA 64-bit / 100MHz B-Link Figure 2-1. Architectural block diagram of Dolphins PCI-based SCI NIC. Each SCI transaction typically consists of two sub-transactions: a request and a response. For the request sub-transaction, a read or write request packet is sent by the requesting node to the destination node. The destination node sends an echo packet to the requesting node upon receiving the request packet. Concurrently, the recipient node processes the request, and sends its own response packet to the requesting node. The

PAGE 20

10 requesting node will acknowledge and commit the transaction by sending an echo packet back to the recipient node upon receiving the response packet. Figure 2-2. Unidirectional SCI ringlet. Figure 2-3. Unidirectional 2D 3-ary SCI torus. Related Research Research for multicast communication in the literature can be briefly categorized into two groups: unicast-based and multi-destination-based [15]. Among the unicast-based multicasting methods, separate addressing is the simplest one, in which the source node iteratively sends the message to each destination node one after another as separate unicast transmissions [20]. Another approach for unicast-based multicasting is to use a multi-phase communication configuration for delivering the message to the destination nodes. In this method, the destination nodes are organized in some sort of binomial tree, and at each communication step the number of nodes covered increases by a factor of n, where n denotes the fan-out factor of the binomial tree. The U-torus multicast algorithm proposed by Robinson et al. [20] is a slightly modified version of this binomial-tree approach for direct torus networks that use wormhole routing. Lin and Ni [21] were the first to introduce and investigate the path-based multicasting approach. Subsequently, path-based multicast communication has received

PAGE 21

11 attention and has been studied for direct networks [14, 20, 22]. Regarding path-based studies, this dissertation concentrates on the work of Robinson et al. [20, 22] in which they have defined the U-torus, S-torus, M d -torus, M u -torus algorithms. These algorithms were proposed as a solution to the multicast communication problem for generic, wormhole-routed, direct unidirectional and bi-directional torus networks. More details about path-based multicast algorithms for wormhole-routed networks can be found in the survey of Li and McKinley [23]. Tree-based multicasting also received attention [23, 24]; and these studies focused on solving the deadlock problem for indirect networks. The SCI unicast performance analysis and modeling has been discussed in the literature [24-26, 28], while collective communication on SCI has received little attention and its multicast communication characteristics are still unclear. Limited studies on this avenue have used collective communication primitives for assessing the scalability of various SCI topologies from an analytical point of view [29, 30], while no known study has yet investigated the multicast performance of SCI. Selected Multicast Algorithms The algorithms analyzed in this study were defined in the literature [20, 22]. This section simply provides an overview of how they work and briefly points out their differences. Bound by the limits of available hardware, two unicast-based and three path-based multicast algorithms were selected, thereby keeping an acceptable degree of variety among different classes of multicast routing algorithms. In this dissertation, the aggregate collection of all destination nodes and the source node is called the multicast group. Therefore, for a given group with size d, there are d1 destination nodes. Figure 2-4 shows how each algorithm operates for a group size of 10. The root node and the

PAGE 22

12 destination nodes are clearly marked and the message transfers are indicated. Alphabetic labels next to each arrow indicate the individual paths, and the numerical labels represent the logical communication steps on each path. Separate Addressing Separate addressing is the simplest unicast-based algorithm in terms of algorithmic complexity. For small group sizes and short messages, separate addressing can be an efficient approach. However, for large messages and large group sizes, the iterative unicast transmissions may result in large host-processor overhead. Another drawback of this protocol is linearly increasing multicast completion latencies with increasing group sizes. Figure 2-4A shows separate addressing for a given multicast problem. The U-torus Algorithm The U-torus [20] is another unicast-based multicast algorithm that uses a binomial-tree approach to reduce the total number of required communication steps. For a given group of size d, the lower bound on the number of steps required to complete the multicast by U-torus will be log 2 d. This reduction is achieved by increasing the number of covered destination nodes by a factor of 2 in each communication step. Figure 2-4B shows a typical U-torus multicast scenario. Applying U-torus to this group starts with dimension ordering of all the nodes, including the root, based on their physical placement in the torus network given in a (column, row) format. The dimension-ordered node set is then rotated around to place the root node at the beginning of the ordered list as given below: ={(1,1), (1,2), (1,3), (2,2), (2,4), (3,1), (3,3), (4,1), (4,2), (4,4)} ={(2,2), (2,4), (3,1), (3,3), (4,1), (4,2), (4,4), (1,1), (1,2), (1,3)}

PAGE 23

13 A1 B1 C1 D1 I1 H1 G1 E1 F1 AC1 B1 AA2 A1 B2 AB1 D1 C1 AA1 a b A9 A4 A6 A2 A1 A3 A767451181918171514131210916 A5 A8 B1 B2 B3 C1 A1 A2 C2 A3 D1 c d B1 B2 B3 A1 A3 A4 A2 A5 B4 Multicast Root NodeMulticast Destination NodeMessage Transfer PathNetwork Connection Idle Node e Figure 2-4. Selected multicast algorithms for torus networks. Multicast group size is 10. Individual message paths are marked alphabetically, and the numerical labels represent the logical communication steps for each message path. A) The separate addressing algorithm. B) The U-torus algorithm. C) The S-torus algorithm. D) The M d -torus algorithm. E) The M u -torus algorithm.

PAGE 24

14 where is the dimension-ordered group and denotes the rotated version of The order in also defines the final ranking of the nodes, as they are sequentially ranked starting from the leftmost node. As an example, for given above, node (2,2) has a ranking of 0, node (2,4) has a ranking of 1, and the node (1,3) has a ranking of 9. After obtaining the , the root node sends the message to the center node of to partition the multicast problem of size d into two subsets of size d/2 and d/2. The center node is calculated by Eq. 2-1 as described by Robinson et al. [20], where left denotes the ranking of the leftmost node, and right denotes the ranking of the rightmost node. 21leftrightleftcenter (2-1) For the group given above, the left is rank 0 and the right is 9, therefore the center is 5, which implies the node (4,2). The root node transmits the multicast message, and the new partitions subset information, D subset to the center node. Using the same example, at the end of the first step the root node will have the subset D subset_root = {(2,2), (2,4), (3,1), (3,3), (4,1)} with the values of left and right being rank 0 and 4, respectively. The node (4,2) will have the subset D subset_(4,2) = {(4,2), (4,4), (1,1), (1,2), (1,3)}with the values of left and right being again 0 and 4. In the second step, the original root and the (4,2) node both act as root nodes, partitioning their respective subsets in two; and sending the multicast message to their subsets center node, along with the new partitions D subset information. This process continues recursively, until all destination nodes have received the message.

PAGE 25

15 The S-torus Algorithm The S-torus, a path-based multicast routing algorithm, was defined by Robinson et al. [22] for wormhole-routed torus networks. It is a single-phase communication algorithm. The destination nodes are ranked and ordered to form a Hamiltonian cycle. A Hamiltonian cycle is a closed circuit that starts and ends at the source node, where every other node is listed only once. For any given network, more than one Hamiltonian cycle may exist. The Hamiltonian cycle that S-torus uses is based on a ranking order of nodes, which is calculated with the formula given in Eq. 2-2 for a k-ary 2D torus. )( mod )()()(110ukkuuul (2-2) Here, l(u) represents the Hamiltonian ranking of a node u, with the coordinates given as ))(),((10uu More detailed information about Hamiltonian node rankings can be found in [9]. Following this step, the ordered Hamiltonian cycle is rotated around to place the root at the beginning. This new set is named as . The root node then issues a multi-destination worm which visits each destination node one after another following the ordered set. At each destination node, the header is truncated to remove the visited destination address and the worm is re-routed to the next destination. The algorithm continues until the last destination node receives the message. Robinson et al. also proved that S-torus routing is deadlock-free [9]. Figure 2-4C shows the S-torus algorithm for the same example presented previously, for a torus network without wormhole routing. The Hamiltonian rankings are noted in the superscript labels of each node, where and are obtained as:

PAGE 26

16 = { 4 (1,3), 6 (1,1), 7 (1,2), 8 (2,2), 10 (2,4), 12 (3,1), 14 (3,3), 16 (4,4), 17 (4,1), 18 (4,2)} = { 8 (2,2), 10 (2,4), 12 (3,1), 14 (3,3), 16 (4,4), 17 (4,1), 18 (4,2), 4 (1,3), 6 (1,1), 7 (1,2)} The M-torus Algorithm Belying its simplicity, single-phase communication is known for large latency variations for a large set of destination nodes [31]. Therefore, to further improve the S-torus algorithm, Robinson et al. proposed the multi-phase multicast routing algorithm: M-torus [22]. The idea was to shorten the path lengths of the multi-destination worms to stabilize the latency variations and to achieve better performance by partitioning the multicast group. They introduced two variations of the M-torus algorithm, M d -torus and M u -torus. The M d -torus algorithm uses a dimensional partitioning method, whereas M u -torus uses a uniform partitioning mechanism. In both of these algorithms, the root node separately transmits the message to each partition and the message is then further relayed inside the subsets using multi-destination worms. The M d -torus algorithm partitions the nodes based on their respective sub-torus dimensions, therefore eliminating costly dimension-switching overhead. For example, in a 3D torus, the algorithm will first partition the group into subsets of 2D planes of the network, and then into ringlets for each plane. For a k-ary N-dimensional torus network, where k N is the total number of nodes, the M d -torus algorithm needs N steps to complete the multicast operation. By contrast, the M u -torus algorithm tries to minimize and equalize the path length of each worm by applying a uniform partitioning. M u -torus is parameterized by the partitioning size, denoted by r. For a group size of d, the M u -torus algorithm with a partitioning size of r requires log r (d) steps to complete the multicast operation. For the same example

PAGE 27

17 presented previously, Figure 2-4D and Figure 2-4E show M d -torus and M u -torus, respectively, again assuming a network without wormhole routing where r=4. Case Study To comparatively evaluate the performance of the selected algorithms, an experimental case study is conducted over a high-performance unidirectional SCI torus network. The following subsections explain experiment details and the results obtained. Description There are 16 nodes in the case study testbed. Each node is configured with dual 1GHz Intel Pentium-III processors and 256MB of PC133 SDRAM. Each node also features a Dolphin SCI NIC (PCI-64/66/D330) with 5.3Gb/s link speed using Scalis SSP (Scali Software Platform) 3.0.1, Redhat Linux 7.2 with kernel version 2.4.7-10smp, mtrr patched, and write-combining enabled. The nodes are interconnected to form a 44 unidirectional torus. For all of the selected algorithms, the polling notification method is used to lower the latencies. Although this method is known to be effective for achieving low latencies, it results in higher CPU loads, especially if the polling process runs for extended periods. To further decrease the completion latencies, the multicast-tree creation is removed from the critical path and performed at the beginning of each algorithm in every node. Throughout the case study, modified versions of the three path-based algorithms, S-torus, M d -torus, and M u -torus are used. These algorithms were originally designed to use multi-destination worms. However, as with most high-speed interconnects available on the market today, our testbed does not support multi-destination worms. Therefore, store-and-forward versions of these algorithms are developed.

PAGE 28

18 On our 4-ary 2D torus testbed, M d -torus partitions the torus network into simple 4-node rings. For a fair comparison between the M d -torus and the M u -torus algorithms, the partition length r of 4 is chosen for M u -torus. Also, the partition information for U-torus is embedded in the relayed multicast message at each step. Although separate addressing exhibits no algorithmic concurrency, it is possible to provide some degree of concurrency by simply allowing multiple message transfers to occur in a pipelined structure. This method is used for our separate address algorithm. Case study experiments with the five algorithms are performed for various group sizes and for small and large message sizes. Each algorithm is evaluated for each message and group size 100 times, where each execution has 50 repetitions. The variance was found to be very small and the averages of all executions are used in this study. Four different sets of experiments are performed to analyze the various aspects of each algorithm, which are explained in detailed in the following subsections. Multicast Completion Latency Two different sets of experiments for multicast completion latency are performed, one for a message size of 2B and the other for a message size of 64KB. Figure 2-5 shows the multicast completion latency versus group size for small and large messages. The S-torus algorithm has the worst performance for both small and large messages. Moreover, S-torus shows a linear increase in multicast completion latency with respect to the increasing group size, as it exhibits no parallelism in message transfers. By contrast, the separate addressing algorithm has a higher level of concurrency because of its design and performs best for small messages. However, it also presents linearly increasing completion latencies for large messages with increasing group size.

PAGE 29

19 020040060080010001200140016004681012141Multicast Group Size (in nodes)Multicast Completion Latency (usec) 6 U-torus S-torus Mu-torus Md-torus Sep. Add. a 0500010000150002000025000468101214Multicast Group Size (in nodes)Multicast Completion Latency (usec) 16 U-torus S-torus Mu-torus Md-torus Sep. Add. b Figure 2-5. Completion latency vs. group size. A) Small messages. B) Large messages. The M d -torus and M u -torus algorithms exhibit similar levels of performance for both small and large messages. The difference between these two becomes more distinctive at certain data points, such as 10 and 14 nodes for large messages. For group sizes of 10 and 14 the partition length for M u -torus does not provide perfectly balanced partitions, resulting in higher multicast completion latencies. Finally, U-torus has nearly

PAGE 30

20 flat latency for small messages. For large messages, it exhibits similar behavior to M u -torus. Overall, separate addressing appears to be the best for small messages and groups, while for large messages and groups M d -torus performs better compared to other algorithms. User-level CPU Utilization User-level host processor load is measured using Linuxs built-in sar utility. Figure 2-6 shows the maximum CPU utilization for the root node of each algorithm for small and large messages. It is observed that S-torus exhibits constant CPU load for the small message size independent of the group size. However, for large messages, as the group size increases the completion latency also linearly increases as shown in Figure 2-5B, and the extra polling involved results in higher CPU utilization for the root node. This effect is clearly seen in Figure 2-6B. In the separate addressing algorithm, the root node iteratively performs all message transfers to the destination nodes. As expected, this behavior causes a nearly linear increase in CPU load with increasing group size, which can be observed in Figure 2-6B. By contrast, since the number of message transmissions for the root node stays constant, M d -torus provides a nearly constant CPU overhead for small messages for every group size. For large messages and small group sizes, M d -torus performs similarly. However, for group sizes greater than 10, the CPU utilization tends to increase because of the variations in the path lengths causing extended polling durations. Although these variations are the same for both small and the large messages, the effect is more visible for the large message size.

PAGE 31

21 00.511.522.533.544.546810121416Multicast Group Size (in nodes)CPU Utilization (%) U-torus S-torus Mu-torus Md-torus Sep. Add. a 01234567894681012141Multicast Group Size (in nodes)CPU Utilization (%) 6 U-torus S-torus Mu-torus Md-torus Sep. Add. b Figure 2-6. User-level CPU utilization vs. group size. A) Small messages. B) Large messages. The M u -torus algorithm exhibits behavior identical to M d -torus for small messages. Moreover, for large messages, M u -torus also provides higher but constant CPU utilization. For U-torus, the number of communication steps required to cover all destination nodes is given in previous sections. It is observed that at certain group sizes,

PAGE 32

22 such as 4, 8, and 16, the number of these steps increases, therefore the CPU load also increases This behavior of U-torus can be clearly seen in Figure 2-6. Multicast Tree Creation Latency Multicast tree creation latency of the SCI API is also an important metric since, for small message sizes, this factor might impede the overall communication performance. The multicast tree creation latency is independent of the message size. Figure 2-7 shows multicast tree creation latency versus group size. 010203040506046810121416Multicast Group Size (in nodes)Multicast Tree Creation Latency (usec) U-torus S-torus Mu-torus Md-torus Figure 2-7. Multicast tree-creation latency vs. group size. Figure 2-7 shows the multicast tree creation latencies for the four algorithms that use a tree-like group formation for message delivery. The M u -torus and M d -torus algorithms only differ in their partitioning methods as described before and both methods are quite complex compared to the other algorithms. This complexity is seen in Figure 2-7 as they exhibit the highest multicast tree-creation latencies. The U-torus algorithm has a simple and distributed partitioning process and, compared to the two M-torus algorithms, it has lower tree-creation latency. Unlike the

PAGE 33

23 other tree-based algorithms, S-torus does not perform any partitioning and it only orders the destination nodes as described previously. Therefore, S-torus exhibits the lowest and a very-slowly and linearly increasing latency because of the simplicity of its tree formation. Link Concentration and Concurrency Link concentration is defined here as the ratio of two components: number of link visits and number of used links. Link visits is defined as the cumulative number of links used during the entire communication process, while used links is defined as the number of individual links used. Link concurrency is the maximum number of messages that are in transit in the network at any given time. Link concentration and link concurrency are given in Figure 2-8. Link concentration combined with the link concurrency illustrates the degree of communication balance. The concentration and concurrency values presented in Figure 2-8 are obtained by analyzing the theoretical communication structures and the experimental timings of the algorithms. The S-torus algorithm is a simple chained communication and there is only one active message transfer in the network at any given time. Therefore, S-torus has the lowest and a constant link concentration and concurrency compared to other algorithms. By contrast, because of the high parallelism provided by the recursive doubling approach, the U-torus algorithm has the highest concurrency. Separate addressing exhibits an identical degree of concurrency to the U-torus, because of the multiple message transfers overlapping at the same time because of the network pipelining. The M d -torus algorithm has inversely proportional link concentration versus increasing group size. In M d -torus, the root node first sends the message to the destination header nodes, and they relay it to their child nodes. As the number of dimensional header nodes is constant (k in a k-ary

PAGE 34

24 torus), with the increasing group size each new child node added to the group will increase the number of available links. Moreover, because of the communication structure of the M d -torus, the number of used links increases much more rapidly compared to the number of link visits with the increasing group size. This trend asymptotically limits the decreasing link concentration to 1. The concurrency of 0.000.501.001.502.002.503.003.504681012141Multicast Group Size (in nodes)Link Concentration 6 U-torus S-torus Mu-torus Md-torus Sep. Add. a 012345674681012141Multicast Group Size (in nodes)Link Concurrency 6 U-torus S-torus Mu-torus Md-torus Sep. Add. b Figure 2-8. Communication balance vs. group size. A) Link concentration. B) Link concurrency.

PAGE 35

25 M d -torus is upper bounded by k as each dimensional header relays the message over separate ringlets with k nodes in each. The M u -torus algorithm has low link concentration for all group sizes, as it multicasts the message to the partitioned destination nodes over a limited number of individual paths as shown in Figure 2-4D, where only a single link is used per path at a time. By contrast, for a given partition length of constant size, an increase in the group size results in an increase in the number of partitions and an increase in the number of individual paths. This trait results in more messages being transferred concurrently at any given time over the entire network. Multicast Latency Modeling The experiments throughout SCI case study have investigated the performance of multicast algorithms over a 2D torus network having a maximum of 16 nodes. However, ultimately modeling will be a key tool in predicting the relative performance of these algorithms for system sizes that far exceed our current testbed capabilities. Also, by modifying the model, future systems with improved characteristics could be evaluated quickly and accurately. The model presented in the next subsections assumes an equal number of nodes in each dimension for a given N-dimensional torus network. The presented small-message latency model follows the LogP model [32]. The LogP is a general-purpose model for distributed-memory machines with asynchronous unicast communication [32]. Under LogP, at most at every g CPU cycles a new communication operation can be issued, and a one-way small-message delivery to a remote location is formulated as )(22receiversenderioncommunicatooLt (2-3)

PAGE 36

26 where L is the upper bound on the latency for a message delivery of a message from its source processor to its target processor, and o sender and o receiver represent the sender and receiver communication overheads, respectively. Todays high-performance NICs and interconnection systems are fast enough that any packet can be injected into the network as soon as the host processor produces it without any further delays [33]. Therefore, g is negligible for modeling high-performance interconnects. This observation yields the relaxed LogP model for one-way unicast communication, given as receivernetworksenderioncommunicatotot (2-4) where t network is the total time spent between the injection of the message into the network and drainage of the message. Following this approach, a model is proposed to capture the multicast communication performance of high-performance torus networks on an otherwise unloaded system. The model is based on the concept that each multicast message can be expressed as sequences of serially forwarded unicast messages from root to destinations. The proposed model is formulated as pathsreceivernetworksendermulticastotoMaxt over (2-5) where the Max[ ] operation yields the total multicast latency for the deepest path over all paths and t multicast is the time interval between the initiation of multicast and the last destinations reception of the message. Figure 2-9 shows this concept on a sample binomial-tree multicast scenario. For this scenario, t multicast is determined over the deepest path, which is the route from node 0 to node 7.

PAGE 37

27 Gonzalez et al. [27] modeled the unicast performance of direct SCI networks and observed that t network can be divided into smaller components as given by: iissffppnetworkLhLhLhLht (2-6) Here h p h f h s and h i represent the total number of hops, forwarding nodes, switching nodes, and intermediate nodes, respectively. Similarly L p L f and L s denote the propagation delay per hop, forwarding delay through a node, and switching delay through Deepest Path07654321 Deepest Path07654321 Figure 2-9. Sample multicast scenario for a given binomial tree. a node, respectively. The L i denotes the intermediate delay through a node, which is the sum of the receiving overhead of the message, processing time, and the sending overhead of the message. Figure 2-10A shows an arbitrary mapping of the problem given above in Figure 2-9 to a 2D 44 torus network. Figure 2.10B shows a visual break down of t multicast over the same arbitrary mapping. Following the method outlined by Gonzales et al. [27] and using the obtained experimental results from the case study presented in this dissertation, the model parameters are measured and calculated for short message sizes. Assuming that the

PAGE 38

28 electrical signals propagate through a copper conductor at approximately half the speed of light, and observing that our SCI torus testbed is connected with 2m long cables, resulted in 14ns of propagation latency per link. Since L p represents the latency for the head of a message passing through a conductor, it is therefore independent of message size [27]. aaaa a osender Lp Lf Lp Ls Lp Li Lp Li Lp oreceiver Root Node Forwarding Node Switching Node Destination Node Intermediate Node osender tnetwork oreceiver b Figure 2-10. Small-message latency model parameters. A) Arbitrary mapping of the multicast problem given in Figure 2-9 to a 2D 4-ary torus. B) Break-down of the t multicast parameter over the deepest path for the given multicast scenario.

PAGE 39

29 Model parameters o sender o receiver L f and L s were obtained through ring-based and torus-based API unicast experiments. Ring-based experiments were performed with two-, four-, and six-node ring configurations. For each setup, h p and L p are known. Inserting these values into Eqs. 2-4 and 2-6 and taking algebraic differences between two-, four-, and six-node experiments yields o sender o receiver and L f Switching latency, L s is determined through torus-based experiments by comparing latency of message transfer within a dimension versus between dimensions. Figure 2-11 shows the L p L f L s o sender and o receiver model parameters for various short message sizes. The L p L f L s o sender and o receiver parameters are communication-bounded. These parameters are dominated by NIC and interconnect performance. The L i is dependent upon interconnect performance and the nodes computational power, and is formulated as: receiverprocesssenderiotoL (2-7) Calculations based on the experimental data obtained previously show that each multicast algorithms processing load, t process is different. This load is composed mainly of the time needed to check the child node positions on the multicast tree and the calculation time of the child node(s) for the next iteration of the communication. Also, t process is observed to be the dominant component of L i for short message sizes. Moreover, it can be easily seen that compared to the t process and L i parameters, the L p L f and the L s values are drastically smaller, and thus they are relatively negligible. Dropping these negligible parameters, Eq. 2-5 can be simplified and expressed as receiverprocesssendermulticastotodesination noer of desttotal numbt (2-8)

PAGE 40

30 for the separate addressing model. For the M d -torus, M u -torus, U-torus and S-torus algorithms the simplified model can be formulated as in Eq. 2-9. As can be seen, the modeling is straightforward for separate addressing and for the remaining algorithms the modeling problem is now reduced to identifying the number of intermediate nodes over the longest multicast message path. Moreover, without loss of generality, o sender and o receiver values can be treated equal to one another [27] for simplicity and we represent them by a single parameter, o. pathsipathsreceiveriisendermulticastLhoMaxoLhoMaxt over2over (2-9) 14nsAvg=60nsAvg=670nsGradient=3.15ns/byte3203ns1994ns110100100010000128192256320384448512Message Size (bytes)Latency (ns) Propagation Forwarding Switching Overhead Figure 2-11. Measured and calculated model parameters for short message sizes: L p =14ns, L f =60ns, L s =670ns, and o=1994+3.15(M-128)ns. Of course, variable L i is not involved in the separate addressing algorithm. The reason is simply that separate addressing consists of a series of unicast message transmissions from source to destination nodes so there are no intermediate nodes that are needed to relay the multicast message to other nodes. Table 2-1 shows the t process and L i values for short

PAGE 41

31 message sizes. The t process parameter is independent of the multicast message and group size but strictly dependent on the performance of the host machine. Therefore, for different computing platforms, different t process values will be obtained Table 2-1. Calculated t process and L i values. t process (s) L i (s) Separate Addressing 7 N/A M d -torus 206 206+2o M u -torus 201 201+2o U-torus 629 629+2o S-torus 265 265+2o The following subsections will discuss and provide more detail about the simplified model for each multicast algorithm. For each algorithm, the modeled values, the actual testbed measurements, and the modeling error will also be presented. The Separate Addressing Model With the separate addressing algorithm, for a group size of G, there are G-1 destination nodes that the root node alone must serve. Therefore, a simplified model for separate addressing can be expressed as given in Eq. 2-10. 0408012016020046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec ) 0102030405060708090100Error (%) Model Actual Error Figure 2-12. Simplified model vs. actual measurements for separate addressing algorithm.

PAGE 42

32 receiverprocesssendermulticastotoGt 1 (2-10) Figure 2-12 shows the simplified model and the actual measurements with various multicast group sizes for a 128-byte multicast message using the o and t process values previously defined. The results show that the model is accurate with an average error of ~3%. The M d -torus Model For M d -torus, the total number of intermediate nodes for any communication path is observed to be a function of G, the multicast group size, and k, the number of nodes in a given dimension. The simplified M d -torus model is formulated given in Eq. 2-11. 010020030040050060070080090046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec ) 0102030405060708090100Error (%) Model Actual Error Figure 2-13. Simplified model vs. actual measurements for M d -torus algorithm. imulticastLkGot2 (2-11) The simplified model and actual measurements for various group sizes with 128-byte messages are plotted in Figure 2-13. As can be seen the simplified model is accurate with an average error of ~2%.

PAGE 43

33 The M u -torus Model The number of partitions for the M u -torus algorithm, denoted by p, is a byproduct of the multicast group size, G, and the partition length, r. For systems with r equal to G, there exists only one partition and the multicast message is propagated in a chain-type communication mechanism among the destination nodes. Under this condition, the number of intermediate nodes is simply two less than the group size. The subtracted two are the root and the last destination nodes. For systems with partitions equal to or more than 2, the number of intermediate nodes becomes a function of the group size, partition length and the number of nodes in a given dimension. The simplified model is given as: 2,222,12pLGopLkrrGrGotiimulticast (2-12) 020040060080010001200140046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec ) 0102030405060708090100Error (%) Model Actual Error Figure 2-14. Simplified model vs. actual measurements for M u -torus algorithm. Figure 2.14 shows the small-message model versus actual measurements for 128-byte messages. The results show that the model is accurate with an average error of ~2%.

PAGE 44

34 The U-torus Model For U-torus the minimum number of communication steps required to cover all destination nodes can be expressed as G2log The number of intermediate nodes in the U-torus algorithm is a function of the minimum required communication steps, the group size and the number of nodes in a given dimension. The simplified U-torus model is given as: imulticastLkkkGGot modlog22 (2-13) Figure 2-15 shows the short-message model and actual measurements for various group sizes. The results show the model is accurate with an average error of ~2%. 050010001500200025003000350046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec ) 0102030405060708090100Error (%) Model Actual Error Figure 2-15. Simplified model vs. actual measurements for U-torus algorithm. The S-torus Model The S-torus is a chain-type communication algorithm and can be modeled identically to the single partition case of the M u -torus algorithm. The simplified model for S-torus is formulated as: imulticastLGot22 (2-14)

PAGE 45

35 The results from the short-message model versus actual measurements for 128-byte messages are shown in Figure 2-16. As previously stated, S-torus routing is based on Hamiltonian circuit. This type of routing ensures that each destination node will receive only one copy of the message, but some forwarding nodes (i.e., non-destination nodes that are on the actual message path) may be visited more than once for routing purposes. Moreover, depending on the group size, single-phase routing traverses many unnecessary channels, creating more traffic and possible contention. Therefore, S-torus has unavoidably large latency variations because of the varying and sometimes extremely long message paths [31]. The small-message model presented is incapable of tracking these large variations in the completion latency that are inherent to the S-torus multicast 05001000150020002500300035004000450046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec ) 0102030405060708090100Error (%) Model Actual Error Figure 2-16. Simplified model vs. actual measurements for S-torus algorithm. algorithm. The modeling error is relatively high, unlike the other algorithms evaluated in this study. The instability of the modeling error is not expected to lessen with the increasing group size.

PAGE 46

36 Analytical Projections To evaluate and compare the short-message performance of the multicast algorithms for larger systems, the simplified models are used to investigate 2D torus-based parallel systems with 64 and 256 nodes. The effects of different partition lengths (i.e., r=8, r=16, r=32) for the M u -torus algorithm over these system sizes are also investigated analytically with these projections. The results of the projections are plotted in Figure 2-17. The M d -torus algorithm has step-like completion latency caused by the fact that, with every new k destination nodes added to the group, a new ringlet is introduced to the multicast communication which increases the completion latency. The optimal performance for the 88 torus network is obtained when the M u -torus has a partition length of 8 and for the 1616 torus network when the partition length is 16. Therefore, it is surmised that the optimal partition length for M u -torus is equal to k for 2D SCI tori. 0.11101001000481216202428323640444852566064Multicast Group Size (in nodes) Multicast Completion Latency (msec) U-torus S-torus Mu-torus (r=8) Mu-torus (r=16) Mu-torus (r=32) Md-torus Sep. Add. a Figure 2-17. Continued.

PAGE 47

37 1101001000100004326088116144172200228256Multicast Group Size (in nodes) Multicast Completion Latency (msec) U-torus S-torus Mu-torus (r=8) Mu-torus (r=16) Mu-torus (r=32) Md-torus Sep. Add. b Figure 2-17. Small-message latency projections. A) Projection values for a 88 torus system. B) Projection values for a 1616 torus system. The U-torus, on average, has slowly increasing completion latency with increasing group sizes. The U-torus, M u -torus, and M d -torus algorithms all tend to have similar asymptotic latencies. The S-torus and separate addressing algorithms, as expected, have linearly increasing latencies with increasing group sizes. Although separate addressing is the best choice for small-scale systems, it loses its advantage with increasing group sizes for short messages. The S-torus, by contrast, again proves to be a poor choice, as it is simply the worst performing algorithm for group sizes greater than 8. Summary This phase of the dissertation has investigated the multicast problem on high-performance torus networks. Software-based multicast algorithms from the literature were applied to the SCI network, a commercial example of a high-performance

PAGE 48

38 torus network. Experimental analysis and small-message latency models of these algorithms are introduced. Analytical projections based on the verified small-message latency models for larger size systems are also presented. Based on the experimental results presented earlier, it is observed that the separate addressing algorithm is the best choice for small messages or small group sizes from the perspective of multicast completion latency and CPU utilization because of its simple and cost effective structure. The M d -torus algorithm performs best from the perspective of completion latency for large messages or large group sizes, because of the balance provided by its use of dimensional partitioning. In addition, M d -torus incurs a very low CPU overhead and achieves high concurrency for all the message and group sizes considered. The U-torus and M u -torus algorithms perform better when the individual multicast path depths are approximately equal. Furthermore, the M u -torus algorithm exhibits its best performance when group size is an exact multiple of the partition length. The U-torus and M u -torus algorithms have nearly constant CPU utilizations for small and large messages alike. Moreover, the U-torus algorithm has the highest concurrency among all algorithms evaluated, because of the high parallelism provided by the recursive-doubling method. The S-torus algorithm is always the worst performer from the perspective of completion latency and CPU utilization because of its lack of concurrency and its extensive communication overhead. As expected, S-torus exhibits a nearly linear increase in completion latency and CPU utilization for large messages with increasing group size. The small-message latency models, using only a few parameters, capture the essential mechanisms of multicast communication over the given platforms. The models

PAGE 49

39 are accurate for all evaluated algorithms except the S-torus algorithm. Small-message multicast latency projections for larger torus systems are provided using these models. Projection results show that with increasing group size the U-torus, M u -torus, and M d -torus algorithms tend to have asymptotically bounded similar latencies. Therefore, it is possible to choose an optimal multicast algorithm among these three for larger systems, based on the multicast completion latency and other metrics such as CPU utilization or network link concentration and concurrency. It is also possible and straightforward to project the multicast performance of larger-scale 2D torus networks with our model. Projected results show that S-torus and separate addressing have unbounded and linearly increasing completion latencies with increasing group sizes, which makes them unsuitable for large-scale systems. Applying the simplified models to other torus networks and/or multicast communication schemes is possible with a minimal calibration effort. These results make it clear that no single multicast algorithm is best in all cases for all metrics. For example, as the number of dimensions in the network increases, the M d -torus algorithm becomes dominant. By contrast, for networks with fewer dimensions supporting a large number of nodes, the M u -torus and the U-torus algorithms are most effective. Separate addressing is an efficient and cost-effective choice for small-scale systems. Finally, S-torus is determined to be inefficient as compared to the alternative algorithms in all the cases evaluated. This inefficiency is caused by the extensive length of the paths used to multicast, which in turn leads to long and widely varying completion latencies and a high degree of root-node CPU utilization.

PAGE 50

CHAPTER 3 MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR HIGH-SPEED INDIRECT NETWORKS WITH NIC-BASED PROCESSORS Chapter 2 was focused on investigating the topological characteristics of high-performance torus networks for multicast communication. An experimental case study was presented for SCI torus networks. Chapter 3 determines the optimum level of work sharing between the host processor and the NIC processor for multicast communication. The goal is to achieve an optimum balance between multicast completion latency and host processor load. With its onboard NIC RISC processor, Myrinet [18] is an example of such intelligent high-performance interconnects. The following sections explain the details of a study to achieve an optimal work balance between the host and the NIC processors for multicast communication over a Myrinet interconnect. Myrinet Myrinet is an indirect high-performance interconnect constructed of switching elements and host interfaces using point-to-point links. The core of the switching element is a pipelined crossbar that supports non-blocking, wormhole routing of unicast packets over bi-directional links up to 2.0Gbps. A crossbar switching chip is the building block of a Myrinet network. It can be used to build a non-blocking switch. It can also be interconnected to build arbitrary topologies, such as switch-based stars, n-dimensional meshes or Multistage 40

PAGE 51

41 Interconnection Network (MIN) topologies. Myrinet provides reliable, connectionless unicast message delivery between communication end-points called ports. Myrinet NICs are equipped with a programmable RISC processor (LANai), three DMA engines, and SRAM memory. Myrinet also supports user-level host access to the NIC bypassing the operating system for decreased host-to-NIC access latencies and increased throughput. The latest version of Myrinet also supports 64-bit 133MHz PCI-X interfaces. Figure 3-1 shows the architectural block diagram of a Myrinet NIC. 64-bit / 133MHz PCI-X or64-bit / 66MHz PCI Bus CPU Memory System Bus Myrinet Network Links PCI to HostBridge Memory LANai RISC P DMA Figure 3-1. Architectural block diagram of Myricoms PCI-based Myrinet NIC.

PAGE 52

42 A striking feature of a Myrinet interconnect is the on-board NIC processor. The main task of this processor is to offload work from the host processor on communication events. This programmable, 32-bit, RISC processor runs at 66 or 133 MHz, which is roughly an order of magnitude slower than todays host processors (1000-3000 MHz). Related Research Myrinet is the most successful and most widely deployed commercial high-performance interconnect for clusters. It has received extensive attention from academia and industry. Among the many topics of research, collective communication on Myrinet exploiting the NIC processor is of particular interest to this phase of the dissertation. As Myrinet does not support multicasting in hardware, designing efficient and optimal software-based collective communication is the goal of many researchers. All communication-related operations are performed on the Myrinet NIC RISC processor in NIC-based collective communication. This approach is a well-studied method to avoid expensive host-NIC interaction and to reduce system overhead and network transaction latency [34-38]. It was observed that under such communication schemes, very low host CPU loads can be obtained at a cost of increased overall multicast completion latencies. This trait is caused by the fact that the NIC processor is considerably slower (66 or 133 MHz) compared to the host CPU. On the multicasting side, Verstoep et al. [35] extended the Illinois Fast Messages (FM) protocol to produce a totally-ordered, reliable multicasting scheme that is fully processed by the Myrinet NIC processor. The performance of this scheme was evaluated over various spanning-tree multicast protocols. Kesavan et al. [36] presented a simulative evaluation of NIC-based multicast performance of an optimal binomial tree algorithm with packetization support at the NIC level. Bhoedjang et al. [37] simplified

PAGE 53

43 and improved the scalability of Verstoeps design by developing another multicasting scheme that is completely performed by the NIC co-processor. Bunitas et al. [38] presented a NIC-based barrier operation over the Myrinet/GM messaging layer and reported that they achieved performance improvement by a factor of 1.83 compared to the host-based operations. An analytical model for performance estimation of the barrier operation was also presented. The NIC-assisted multicasting was proposed as an answer to improve the multicast completion latencies of NIC-based multicast communication schemes while obtaining similar degrees of host CPU loads. Bunitas et al. [39] presented a NIC-assisted binomial tree multicast scheme over FM to improve the latency characteristics of NIC-based multicast algorithms. Among other multicast-related Myrinet research, Sivaram et al. [40] proposed enhancements to a network switch architecture to support reliable hardware-based multicasting. They have also presented a detailed latency model of their hardware-based multicast communication approach. This study complements and extends previous work by providing experimental evaluations and small-message latency models of host-based, NIC-based and NIC-assisted multicast schemes for obtaining optimal host CPU loads and multicast completion latencies. These multicasting schemes are analyzed for the binomial and binary trees, serial forwarding and separate addressing multicast algorithms. Accurate small-message latency models for these algorithms are also developed and, using these models, multicast completion latencies for larger systems are projected. Results of these comparisons determine the optimum balance of support between the host processor and

PAGE 54

44 the NIC co-processor for multicast communication under various networking scenarios. The next section provides detailed information about the different multicasting schemes. The Host Processor vs. NIC Processor Multicasting Multicast communication primitives can be implemented at two different extremes for interconnects with an onboard NIC processor, namely, host-based and NIC-based. Between these two extremes there lies another level of implementation: NIC-assisted multicasting. The following subsections will provide detailed information about these three design strategies. The Host-based Multicast Communication Host-based multicast communication is the easiest and most conventional way of implementing a multicast communication primitive. In this scheme, the host processor handles all multicasting tasks, such as multicast tree creation and issuing of the unicast send and receive operations. This type of implementation introduces increased CPU load, resulting in lower computation/communication overlap available for parallel programs. However, as the fast host processor performs all the tasks, host-based multicasting achieves small multicast-completion latencies. The NIC-based Multicast Communication In the NIC-based scheme, the NIC co-processor handles all multicasting tasks instead of the host processor. Therefore this scheme provides a reduced CPU load and high computation/communication overlap for parallel programs. However, the Myrinet NIC processor is roughly an order of magnitude slower than modern host processors. Performing all the tasks on this relatively slow processor increases multicast completion latency.

PAGE 55

45 The NIC-assisted Multicast Communication Between these two extremes a compromise has been proposed to this problem. It is called NIC-assisted multicasting. In this approach, work is shared between the two processors with the host processor handling computationally intensive multicast tasks, such as multicast tree creation, and the NIC processor handling communication-only multicast tasks, such as unicast send and receive operations. This solution presents low multicast completion latency with moderate CPU loads, resulting in an acceptable computation/communication overlap available for parallel programs. The main difference between these three multicast schemes is on how they initiate the multicast send operation and how the intermediate nodes behave for relaying the multicast message to their child nodes. Figure 3-2 shows a binomial tree multicast operation graphically. All three schemes are applicable to any multicast algorithm. As can be seen from Figure 3-2, in the host-based scheme the host processor issues each multicast send one after another as if the NIC had no onboard processor. Also, upon the reception of the multicast message, the intermediate nodes pass the message to local host buffers immediately and the host processor processes it and determines the child nodes for the next communication step. After the child nodes have been determined, the host processor also issues the necessary multicast sends. In the NIC-based multicast scheme the host processor issues only one send command. This command includes the multicast message data and the destination node set. The NIC processor creates the multicast tree based on this destination node set information, and then issues all the necessary multicast sends one after another. As soon as the intermediate node receives the multicast message the NIC processor analyzes the header,

PAGE 56

46 Myrinet Network 0 HostCPU 1 HostCPU 2 HostCPU 3 HostCPU 4 HostCPU 5 HostCPU 6 HostCPU 7 HostCPU NICP NICP NICP NICP NICP NICP NICP NICP a Myrinet Network 0 HostCPU 1 HostCPU 2 HostCPU 3 HostCPU 4 HostCPU 5 HostCPU 6 HostCPU 7 HostCPU NICP NICP NICP NICP NICP NICP NICP NICP b Myrinet Network 0 HostCPU 1 HostCPU 2 HostCPU 3 HostCPU 4 HostCPU 5 HostCPU 6 HostCPU 7 HostCPU NICP NICP NICP NICP NICP NICP NICP NICP c Figure 3-2. Possible binomial tree multicasting variations for Myrinet interconnects. The difference between these schemes can be observed on the host-NIC interactions at nodes 0 and 1. A) Host-based multicast communication scheme. B) NIC-based multicast communication scheme. C) NIC-assisted multicast communication scheme.

PAGE 57

47 transfers the multicast data to local buffers for processing, notifies the host processor, and issues the necessary multicast sends to the child nodes. Depending on the DMA capability of the NIC, transfer of the multicast data to local buffers and issuing the multicast messages to child nodes can occur concurrently. As previously stated, the NIC-assisted scheme regulates the two processors to share the workload between them. In this scheme, for a multicast send operation the host processor issues only one send command to the NIC. However, unlike the NIC-based scheme, the host processor in the NIC-assisted scheme encapsulates the destination node set, and the pre-computed multicast tree along with the data. The NIC processor only executes the communication commands, which results in a shorter network interface latency as compared to the NIC-based version. Upon reception, the NIC processor immediately transfers the multicast message to the local buffers and notifies the host processor. The host processor then processes the data and creates the child multicast tree, passing it back to the NIC processor. The NIC processor, therefore, issues multicast sends without performing any processing. Case Study To comparatively evaluate the performance of the host-based, NIC-based, and NIC-assisted communication schemes, an experimental case study is conducted over a 16-node Myrinet network. The following subsections explain experiment details and the results obtained. Description The case study is performed on a 16-node system, each node composed of dual 1GHz Intel PentiumIII processors, 256MB of PC133 SDRAM, ServerSet III LE (rev 6) chipset, and a 133MHz system bus. A Myrinet network is used as the high-speed

PAGE 58

48 interconnect, where each node has a M2L-PCI64A-2 Myrinet NIC with 1.28 Gb/s link speed using Myricoms GM-1.6.4 and Redhat Linux 8.0 with kernel version 2.4.18-14smp. The nodes are interconnected to form a 16-node star network using 16 3m Myrinet LAN cables and a 16-port Myrinet switch. Binomial tree, binary tree, serial forwarding and separate addressing algorithms are developed for the host-based, NIC-based, and NIC-assisted communication schemes individually. All of the multicast algorithms are evaluated for small (2B) and large (64KB) message sizes and multicast group sizes of 4, 6, 8, 10, 12, 14, and 16. Multicast completion latency, multicast tree creation latency, host CPU utilization, link concentration, and concurrency are measured for each combination of multicast algorithms, communication schemes, and message and group sizes previously described. The Myrinet interconnect runs a propriety software, GM, on hosts. GM is a software implementation of the Virtual Interface Architecture (VIA) [41]. Like VIA, GM is targeted for providing low-latency, OS-bypassing interactions between user-level applications and communication devices. In GM, the OS is only responsible for establishing the communication channels and enforcing required protection mechanisms through the GM driver. Once the initial setup phase is completed, host to NIC communications are performed through the GM library. Following the setup phase, both host and NIC are able to initiate one-side data transfers to each other using the provided DMA engines in the Myrinet NIC hardware. Figure 3-3 shows the three-level GM-software architecture. All of the host-based multicast algorithms used in this research are designed in the user-level application layer on top of GM-1.6.4 provided by Myricom.

PAGE 59

49 The Myrinet Control Program (MCP) provided by Myricom has been modified as follows to fit the new communication schemes for both the NIC-based and the NIC-assisted multicast algorithms. The MCP is a firmware that runs on the Myrinet NIC GM MCP DMAEngines GMUser-LevelApplication GM Library GM Driver Port 1 Port 2. . Port n Memory Block HOST-LEVEL NIC-LEVEL Figure 3-3. Myricoms three-layered GM software architecture. RISC processor. The original MCP has four state machines. These state machines are SDMA, SEND, RECV, and RDMA, as shown in Figure 3-4. Each one is responsible for a particular task and these tasks will be explained in detail in the following subsection. Both the NIC-based and the NIC-assisted communication schemes use NIC-issued multi-unicast messages which required modification of the state machine code. Table 3-1 shows the pseudocode of the overall process for both the NIC-assisted and the NIC-based communication schemes for the root, intermediate and destination nodes, including the host and NIC tasks and interactions. Parts in italics are either performed or initiated by the host CPU and the rest is performed by the NIC processor. The SDMA process includes the host writing to NIC memory and signaling the NIC upon completion of the write operation. The RDMA process includes the NIC writing to the host memory and signaling it upon the completion of the write operation. The updated state machines

PAGE 60

50 added an extra 8s to the unicast sends whereas the unmodified minimum one-way latency was measured as 17s from the host level using the unmodified MCP code. SDMAState Machine RDMAState Machine SENDState Machine RECVState MachineFromNetwork ReceiveBuffers SendBuffers DMAtoHost DMAfromHost SENDList ToNetwork ACKS/NACKS ACKS/NACKS ReceiveQueue SendQueue TokenQueue Figure 3.4. The GM MCP state machine overview. Table 3-1. Pseudocode for NIC-assisted and NIC-based communication schemes. NIC-Assisted Multicast (Root Node) NIC-Based Multicast (Root Node) Obtain multicast host names Obtain GM MCP base address pointer Build multicast tree SDMA Wait for completion Do multicast RDMA Obtain multicast host names Obtain GM MCP base address pointer SDMA Wait for completion Build multicast tree Do multicast RDMA NIC-Assisted Multicast (Intermediate and Destination Nodes) NIC-Based Multicast (Intermediate and Destination Nodes) Listen for incoming multicast calls Receive message RDMA Check multicast tree SDMA Do multicast Listen for incoming multicast calls Receive message Check multicast tree Relay message to root/child hosts RDMA Case study evaluations of the host-based, NIC-based, and NIC-assisted communication schemes are undertaken for each of the four multicast algorithms. Each

PAGE 61

51 experiment is performed for each message and group size, for 100 executions, where every execution has 50 repetitions. Four different sets of metrics are probed in each experiment, including multicast completion latency, user-level CPU utilization, multicast tree-creation latency, and link concentration and concurrency. The maximum user-level host CPU utilization of the root node is measured using the Linux built-in sar utility. Link concentration and concurrency of each algorithm are calculated as described in the Chapter 2 of this dissertation, for each group size based on the communication pattern observed throughout the experiments. Multicast Completion Latency As previously stated, completion latency is an important metric for assessing the quality and efficiency of multicast algorithms and communication schemes. Two different sets of experiments for multicast completion latency are performed in this case study, one for a small message size of 2B, and the other for a large message size of 64KB. Figure 3-5A shows the multicast completion latency versus group size for small messages. Figure 3-5B shows multicast completion latency versus group size for large messages. Figure 3-5A is presented with a logarithmic scale for clarity. The small-message results presented in Figure 3-5A show that among the host-based algorithms, binary tree performs the best for any group size for small messages. The host-based binomial tree algorithm is second best in terms of latency. The difference in the performance level of these two algorithms is because of the higher computational load of binomial tree compared to the binary tree algorithm. The step-like shape of the binomial and binary tree algorithms is due the log 2 G-1 intermediate nodes on the deepest path traveled by the multicast message. The host-based serial forwarding

PAGE 62

52 algorithm shows a linear increase in completion latency by increasing group size. The hostbased separate addressing algorithm performs worst among all host-based algorithms and also shows a linear increase in latency with increasing group size. The NIC-based approach has the highest multicast completion latency for all algorithms and all group sizes as expected. The reason for this poor performance level is because the slower NIC processor (as compared to the host processor) handles all of the computation and communication tasks. In particular, this communication scheme impedes the performance of separate addressing more than the other algorithms. The NIC-assisted solution provides slightly smaller latencies compared to the NIC-based solutions for all algorithms. The increases in efficiency of communication are caused by the fact that the faster host processor performs all the communication-related computational tasks. This communication scheme improves the latency characteristics of the separate addressing algorithm more than the serial forwarding algorithm. The large-message results presented in Figure 3-5B show that the binary tree performs the best compared to all other host-based communication algorithms as evident in the small-message host-based results. Binomial tree has slightly higher latency values for all group sizes because of its higher algorithmic computational load as compared to the binary tree algorithm. Serial forwarding, as described previously, has more latency variations compared to the small-message case. The NIC-based communication schemes still significantly impede the performance of all algorithms compared to the host-based approach as previously seen in the small-message case. However, contrary to the small-message case, the NIC-assisted approach lowers the large-message latencies. Although overheads are significantly high and they overlap with the expensive host

PAGE 63

53 1010010001000046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec) H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate a 80001200016000200002400028000320004681012141Multicast Group Size (in nodes)Multicast Completion Latency (usec) 6 H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate b Figure 3-5. Multicast completion latencies. Host-based, NIC-based, and NIC-assisted communication schemes are denoted by H.B., N.B., and N.A, respectively. A) Small messages vs. group size. B) Large messages vs. group size.

PAGE 64

54 CPUNIC LANai interaction events. This overlapping hides the latencies of host CPUNIC LANai interactions and results in the dramatic reduction of completion latencies of the NIC-assisted approach for large messages. Binomial and binary tree algorithms benefit most from the NIC-assisted communication approach for large messages. The large-message results presented in Figure 3-5B show that the binary tree performs the best compared to all other host-based communication algorithms as evident in the small-message host-based results. Binomial tree has slightly higher latency values for all group sizes because of its higher algorithmic computational load as compared to the binary tree algorithm. Serial forwarding, as described previously, has more latency variations compared to the small-message case. The NIC-based communication schemes still significantly impede the performance of all algorithms compared to the host-based approach as previously seen in the small-message case. However, contrary to the small-message case, the NIC-assisted approach lowers the large-message latencies closer to the host-based approach. In the NIC-assisted approach the sender and receiver overheads are significantly high and they overlap with the expensive host CPUNIC LANai interaction events. This overlapping hides the latencies of host CPUNIC LANai interactions and results in the dramatic reduction of completion latencies of the NIC-assisted approach for large messages. Binomial and binary tree algorithms benefit most from the NIC-assisted communication approach for large messages. User-level CPU Utilization The CPU utilization is measured at the host processor for all experiments. For both small messages and large messages, host-based multicasting produces the highest CPU utilization level for each group size because the host-processor handles all the multicast communication and computation tasks. By contrast, NIC-based communication provides

PAGE 65

55 a constant level of CPU utilization that is lower than host-based and NIC-assisted schemes for all algorithms. In NIC-based approach, the host CPU is only responsible for setting up the multicast communication environment and the rest of the tasks are carried out by the NIC processor, decreasing the workload of the host processor. NIC-assisted multicast reduces host CPU utilization as compared to the host-based scheme. Figure 3-6A shows the small-message CPU utilizations for host-based, NIC-based, and NIC-assisted communication schemes. Figure 3-6B shows the large-message CPU utilizations for host-based, NIC-based, and NIC-assisted communication schemes. The results show that all of the algorithms exhibit low CPU utilizations for NIC-based and NIC-assisted schemes. This fact proves, in terms of the host CPU utilization, that the NIC-assisted scheme is the preferable one among the three for all networking scenarios. Overall, low CPU utilization and low multicast completion latency characteristics of the NIC-assisted scheme, makes it a good choice for multicast communication. Multicast Tree Creation Latency Multicast tree creation is an all-computational task, and tree creation latencies are independent of the multicast message size but dependent on the group size. Figure 3-7 shows the tree creation latencies for all combinations of communication schemes and multicast algorithms versus all group sizes. Host-based multicast tree creation provides the lowest latencies for all algorithms as can be seen from Figure 3-7. The NIC-based tree creation is roughly an order of magnitude slower than host-based tree creation, reflecting the performance gap between those two processors. The NIC-assisted tree creation has latencies identical to the host-based scheme because the host-processor handles tree-creation tasks in this communication scheme.

PAGE 66

56 11.522.533.544.546810121416Multicast Group Size (in nodes)CPU Utilization (%) H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate a 12345678910468101214Multicast Group Size (in nodes)CPU Utilization (%) 16 H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate b Figure 3-6. User-level CPU utilizations. Host-based, NIC-based, and NIC-assisted communication schemes are denoted by H.B., N.B., and N.A, respectively. A) Small messages vs. group size. B) Large messages vs. group size.

PAGE 67

57 020040060080010001200140016001800468101214Multicast Group Size (in nodes)Multicast Tree Creation Latency (usec) 16 H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate Figure 3-7. Multicast tree creation latencies. Host-based, NIC-based, and NIC-assisted communication schemes are denoted by H.B., N.B., and N.A, respectively. Link Concentration and Link Concurrency Link concentration measures the degree to which communication is concentrated on individual links. This metric, combined with link concurrency, can be used for assessing the effectiveness of the network link usage for a given communication structure. As all communication schemes access the network in the same manner, and the only difference between these schemes is their host-NIC access patterns. Thus, the network link usage is independent of the deployed communication scheme. Figure 3-8A shows the link concentration for all algorithms versus multicast group sizes. Figure 3-8B shows the link concurrency for all algorithms versus multicast group sizes. From Figure 3-8A it can be seen that the binomial and binary tree and the separate addressing algorithms have the lowest link concentrations and asymptotically bounded link

PAGE 68

58 concentration of 2. Although the link access pattern of binary and binomial tree algorithms are different than the separate addressing algorithm, the number link visits and the used links are the same for both three of them. For the testbed with a single switch used in this case study, the link concentration can never exceed the asymptotic bound of 2, as it is number of maximum links that has be crossed for any point-to-point communication. Serial forwarding, which uses a different pair of hosts and links in every step of the multicast communication, has a constant and bounded link concentration of 2 for every group size. Link concurrency, given in Figure 3-8B, shows that binomial tree has the best link concurrency of all algorithms for most group sizes. Binary tree exhibits similar concurrency to binomial tree. Combined with the link concentration results presented in Figure 3-8A, the binomial tree algorithm appears to be the best. The difference between these binary and binomial tree algorithms is caused by their fan-out numbers. Serial forwarding has the lowest concurrency while the separate addressing algorithm is slightly better. Both show constant link concurrency. Multicast Latency Modeling Myrinet experiments for evaluating the host-based, NIC-based, and NIC-assisted communication schemes have been performed over a 16-node star network. These experiments provide useful information for understanding the basic aspects of these schemes and algorithms, but are insufficient for drawing further detailed analyses for systems of arbitrary sizes beyond the testbed capabilities. Latency modeling complements the experimental study by providing more detailed insight and establishes a better understanding of the problem.

PAGE 69

59 11.522.546810121416Multicast Group Size (in nodes)Link Concentration Binomial Binary Serial Separate a 0123456468101214Multicast Group Size (in nodes)Link Concurrency 16 Binomial Binary Serial Separate b Figure 3-8. Communication balance vs. group size. A) Link concentration. B) Link concurrency. Latency modeling for the Myrinet network is based on Eq. 2-5 and follows the same outlines approach in Chapter 2. The basic idea in latency modeling presented in this chapter is to express the communication events with a few parameters without oversimplifying the process.

PAGE 70

60 Although the outlined method is same, there are some differences between the Myrinet latency model and the SCI latency model. First of all, t network is defined differently because Myrinet is an indirect network unlike the direct SCI network. In an indirect network, nodes establish point-to-point communications through switches as there are no forwarding nodes present on a given communication path. Eq. 3-1 reflects the new t network parameter for an indirect network. iissppnetworkLhLhLht (3-1) Here, h p h s and h i represent the total number of hops, total number of switching nodes, and total number of intermediate nodes, respectively. Similarly L p and L s denote the propagation delay per hop and switching delay through a switch, respectively. The L i parameter denotes the intermediate delay through a node, which is the sum of the receiving overhead of the message, processing time, and the sending overhead of the message at the host and NIC layers for a given intermediate node. The second difference between SCI and Myrinet latency models is that in Myrinet the host and NIC processors coordinate to perform the communication-related tasks. Therefore, the Myrinet model has to account for this interaction and work sharing between the host and NIC processors. Also, this coordination and interaction is different for each communication scheme for a given Myrinet network. Eqs. 3-2 through 3-10 present each schemes sender and receiver overheads; and their intermediate node delays in detail. In a software-based multicast communication it is likely that a node will issue more than one message. To formulate the sender overhead for the host-based communication

PAGE 71

61 scheme, it is assumed that the host processor is issuing the M th individual multicast message. The overhead for issuing M th message is expressed as: sendHostsendertMo_ (3-2) However, for NIC-based communication the send process is completely performed by the NIC processor. The sender overhead for the M th consecutive multicast message is expressed as in Eq. 3-3. sendNICsendertMo_ (3-3) In NIC-assisted multicasting, a different level of work sharing exists between the host and NIC processors, compared to the NIC-based scheme. In NIC-assisted scheme, the host processor is only responsible for the multicast related computational tasks. In this scheme the NIC processor is only responsible for the communication-only events, such as the send operation. From the NIC point of view, issuing NIC-assisted multicast messages is the same with the NIC-based scheme. Therefore, the sender overhead for the NIC-assisted multicast scheme for a given M th consecutive multicast message is same with the NIC-based case and is expressed as: sendNICsendertMo_ (3-4) The last node on the multicast message path incurs receiver overhead. For the host-based scheme, the GM software layer automatically drains the message from the network and places in the appropriate user-level memory space. The receiver overhead for the host-based communication is expressed as: recvHostreceiverto_ (3-5) For the NIC-based communication scheme, the receive operation involves both the host and the NIC processors. The NIC processor receives and removes the multicast

PAGE 72

62 message from the network, writes it to the appropriate user-level memory address and then notifies the host processor. Upon this notification, the host processor completes the receive operation. As explained before, the RDMA operation consists of the memory transfer of the message to the user-level memory, and the NIC processors notification of the host processor upon transfer completion. The receiver overhead for NIC-based multicasting is: recvHostrecvNICreceivertRDMAto__ (3-6) From the host CPU point of view the reception of a message is identical for NIC-based and NIC-assisted communication schemes. For both of these schemes, the NIC-processor drains the message from the network and performs an RDMA operation to the user-level memory space and notifies the host CPU. Therefore, the receiver overhead of the NIC-assisted communication is identical to the NIC-based receiver overhead, and is expressed as follows: recvHostrecvNICreceivertRDMAto__ (3-7) The intermediate nodes act as relays in multicast communication. These nodes receive messages from the network, calculate the next set of destination nodes and relay the messages to them. Therefore, the delay for each intermediate node includes sum of the receiver overhead, processing delay, and the sender overhead. For the host-based communication scheme, the intermediate node delay, L i for the M th consecutive message sent is: sendHostprocessHostrecvHostitMttL___ (3-8)

PAGE 73

63 For the NIC-based communication scheme, all intermediate node tasks are performed by the NIC processor. For the M th consecutive message sent, the NIC-based intermediate node delay is expressed as follows: sendNICprocessNICrecvNICitMttL___ (3-9) The intermediate node delay for the NIC-assisted communication scheme starts when the NIC processor receives a message and performs an RDMA operation to the host processor. Consequently, the host processor processes the incoming multicast message and performs an SDMA operation back to the NIC processor to perform the actual multicast send operation. For the M th consecutive send, the NIC-assisted intermediate node delay is expressed as: sendNICprocessHostrecvNICitMSDMAtRDMAtL___ (3-10) As explained before, for achieving higher network concurrency, all the algorithms evaluated in this chapter are designed such that the root node always serves the deepest path first. Therefore, all equations given above can be simplified as taking M equal to 1. This simplification ensures that the root serves the deepest path first when modeling the total multicast latency. The following paragraph explains how the components of t network are acquired. The latency model parameter, L s is obtained by measuring the difference between the two latency measurements over two nodes that are first connected through the Myrinet switch and than directly without the switch. The L s parameter is measured as 500ns. For the experiment setup h s is set to 1. Assuming, 7ns per meter of propagation delay, L p is calculated as 21ns as all the Myrinet interconnect cables used were 3m long. In our testbed, 2 links are crossed for each pair of nodes that establish a point-to-point

PAGE 74

64 communication. For the single-switch, 16-node system used in our experiments, crossing 2 links per each connection yields h p as 2 for the separate addressing algorithm. All other algorithms use a tree-based communication structure which depends on relaying the message through intermediate nodes. Therefore, binary tree, binomial tree, and serial forwarding have an h p equal to 2h i The number of intermediate nodes, h i is calculated as (G -2) for serial forwarding and log 2 G-1 for the binary and binomial tree algorithms. Following the same approach defined by Bunitas et al. [39], the sender and receiver overhead estimates of the host and NIC processors are obtained individually. As the clock frequency of the host processor is known, the RDTSC assembly instruction is used to read the internal 64-bit cycle counter to get the exact time spent on the host processor. Averaging the host-based multicast communication for each previously defined experiment for each message and group size, an accurate estimate of the t Host_send is obtained. For simplicity, t Host_recv is approximated to equal t Host_send The real-time clock register (RTC) on the LANai 7 chip is incremented at a regular interval automatically by default. The incremental interval of this register is set at initialization time of the NIC based on the actual PCI bus clock. All access to the NIC registers from the host side, are performed as PCI bus transactions. Therefore, to remove this bus delay, the 64-bit host processor cycle counter is read immediately after reading the RTC. The host-processor is put to sleep for 1 second. The same process is then repeated. Comparing the elapsed time in these repeated readings allowed accurately assessing and eliminating the PCI bus transaction delays from the RTC reading operations. The t NIC_send and the t NIC_recv values are obtained for the NIC-based and the NIC-assisted schemes for all group sizes and small and large messages. The NIC-based

PAGE 75

65 and NIC-assisted SDMA and the RDMA tasks are also measured using the outlined method, from the host and NIC respectively. Table 3-2. Measured latency model parameters. t Host_send (s) t Host_recv (s) t NIC_send (s) t NIC_recv (s) SDMA (s) RDMA (s) Host-Based 9.864+ 0.0315m 9.864+ 0.0315m N/A N/A N/A N/A NIC-Based N/A N/A 0.697+ 8.3325m 0.697+ 8.3325m 26 28 NIC-Assisted N/A N/A 0.697+ 8.3325m 0.697+ 8.3325m 26 28 Table 3-3. Calculated t process and L i values. t process (s) L i (s) Binomial 19.2 39.05 Binary 12.2 32.05 Serial 7.2 27.05 Host-Based Separate 22.2 N/A Binomial 145.2 250.25 Binary 92.2 178.25 Serial 79.2 121.08 NIC-Based Separate 233.1 N/A Binomial 19.2 113.9 Binary 12.2 107.9 Serial 7.2 101.7 NIC-Assisted Separate 22.2 N/A Table 3-2 summarizes the acquired values of t Host_send t Host_recv t NIC_send t NIC_recv SDMA, and RDMA for the host-based, NIC-based, and NIC-assisted communication schemes. Table 3-3 shows the t process (i.e., t Host_process or t NIC_process depending on the communication scheme) and L i values for each possible communication scheme and algorithm combination. These values are obtained by substituting the values given in Table 3-2 into Eqs. 3-2 through 3-10. The t process and L i values are specific to the system used. However, these values are independent of multicast message size. The following

PAGE 76

66 subsections provide the details of the small-message latency model for each scheme and algorithm. The Host-based Latency Model The host-based latency models are obtained using Eqs. 3-2, 3-5, and 3-9. The binary and binomial trees have the same number of intermediate nodes for each case, therefore these two algorithms is formulated in Eq. 3-11. recvHostsendHostprocessHostrecvHostsendHostmulticastttttGtt____2_1log (3-11) The model for the serial forwarding is given as: recvHostsendHostprocessHostrecvHostsendHostmulticastttttGtt_____2 (3-12) Separate addressing model is formulated as: recvHostsendHostprocessHostmulticasttttGt___1 (3-13) Figure 3-9 shows results from the host-based, small-message model for all algorithms versus multicast group size along with the actual measurements. The model is accurate for the binomial, binary tree and separate addressing algorithms, and the average error is ~1%, ~2%, and ~1%, respectively. The host-based serial forwarding algorithm has a relatively higher modeling error, ~3%, because of its inherent latency variations as explained in the previous chapter. The NIC-based Latency Model NIC-based latency models are obtained by substituting the values presented in the previous section into Eqs. 3-3, 3-6, and 3-9. The small-message model for the binary and binomial trees is formulated as given in Eq. 3-14. RDMAttttGttSDMAtrecvNICsendNICprocessNICrecvNICsendNICprocessNICmulticast ____2__1log (3-14)

PAGE 77

67 010020030040050060046810121416Multicast Group Size (in nodes) Multicast Completion Latency (usec)0102030405060708090100Error (%) Model Binomial Model Binary Model Serial Model Separate Actual Binomial Actual Binary Actual Serial Actual Separate Error Binomial Error Binary Error Serial Error Separate Figure 3-9. Simplified model vs. actual measurements for the host-based communication scheme. The model for serial forwarding is as follows: RDMAttttGttSDMAtrecvNICsendNICprocessNICrecvNICsendNICprocessNICmulticast ______2 (3-15) The separate addressing model is formulated as given as: RDMAtttGSDMAtrecvNICsendNICprocessNICmulticast ___1 (3-16) Figure 3-10 shows the NIC-based small-message model and the actual measurements versus multicast group size for all algorithms. Similar to the host-based case, the model is accurate for the binomial, binary tree and separate addressing algorithms, and the average error is ~2%, ~2%, and ~1%, respectively. Serial forwarding has a relatively

PAGE 78

68 050010001500200025003000350040004500500046810121416Multicast Group Size (in nodes) Multicast Completion Latency (usec)0102030405060708090100Error (%) Model Binomial Model Binary Model Serial Model Separate Actual Binomial Actual Binary Actual Serial Actual Separate Error Binomial Error Binary Error Serial Error Separate Figure 3-10. Simplified model vs. actual measurements for the NIC-based communication scheme. higher modeling error, ~4%, because of its unavoidable latency variations. The NIC-assisted Latency Model As explained previously, the difference between the NIC-based and the NIC-assisted model is where communication-related computational processing is handled. Using Eqs. 3-4, 3-7, and 3-10, the small-message model for binary and binomial models is formulated as: RDMAttRDMAtSDMAtGtSDMAtrecvNICsendNICprocessHostrecvNICsendNICmulticast ____2_1log (3-17) The separate addressing model is:

PAGE 79

69 RDMAttRDMAtSDMAtGtSDMAtrecvNICsendNICprocessHostrecvNICsendNICmulticast _____2 (3-18) The serial addressing model is: RDMAttSDMAtGtrecvNICsendNICprocessHostmulticast ___1 (3-19) 0200400600800100012001400160046810121416Multicast Group Size (in nodes) Multicast Completion Latency (usec)0102030405060708090100Error (%) Model Binomial Model Binary Model Serial Model Separate Actual Binomial Actual Binary Actual Serial Actual Separate Error Binomial Error Binary Error Serial Error Separate Figure 3-11. Simplified model vs. actual measurements for the NIC-assisted communication scheme. Figure 3-11 shows the results for the NIC-assisted small-message model and the actual measurements versus multicast group size for all algorithms. The figure shows that the model is accurate for the binomial and binary tree and separate addressing algorithms, and the average error is ~1% in all cases. Like the host-based and the NIC-based cases, serial forwarding has a relatively higher modeling error, ~3%.

PAGE 80

70 Analytical Projections To evaluate and compare the short-message performance of the communication schemes and the multicast algorithms for larger systems, the simplified models are used to investigate the performance characteristics of the multicast completion latency for an indirect Myrinet star network with 64 and 256 nodes. For the first case, a 64-node network with a single 64-port Clos switch is considered. A 256-node network with a single 256-port Clos switch is considered for the second case. Currently the largest commercially available Myrinet switch is for 128 nodes, but the technology is progressing towards higher capacity switches. Higher capacity switches are preferable because they are easier to maintain and cheaper to build. These switches also provide higher bisection bandwidths with less number of cable connections than MINs. Moreover, higher capacity switches provide full-bisection bandwidth with fewer connections. Vendor provided results show that the commercially available 128-node Myrinet Clos switch achieves similar switching latencies to the 16-port version. Therefore, projections for a 256-node system with a single Clos switch are considered to be realistic. Figures 3-12 and 3-13 show the projection results for the two cases under study. Strictly in terms of multicast completion latency, the host-based binary and binomial trees provide the lowest completion latencies. The difference between these two is caused by the processing load and fan-out numbers. A more efficient binomial tree design with a lower processing load and a higher fan-out number, compared to the one used in this dissertation, may outperform the binary tree algorithm. Also it is observed that the host-based serial forwarding and separate addressing algorithms have a linear

PAGE 81

71 0.010.1110100481216202428323640444852566064Multicast Group Size (in nodes) Multicast Completion Latency (msec) H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate Figure 3-12. Projected small-message completion latency for 64 nodes. increase in completion time with increasing group sizes. Unfortunately, NIC-based approaches are a poor solution for any of the algorithms in any given network scenario as compared to the host-based approach because of their extensive completion latencies. However, as can be seen from the figures, the NIC-assisted algorithms lower the completion latencies compared to the NIC-based solutions. This approach provides significant latency reduction in the separate addressing, binomial and binary tree algorithms, where the largest gain in terms of completion latency is in the separate addressing algorithm. Unfortunately, serial forwarding is less affected by the

PAGE 82

72 NIC-assisted algorithm because of the expensive SDMA-RDMA operations incurred by each intermediate node. 0.010.11101004326088116144172200228256Multicast Group Size (in nodes) Multicast Completion Latency (msec) H.B. Binomial N.B. Binomial N.A. Binomial H.B. Binary N.B. Binary N.A. Binary H.B. Serial N.B. Serial N.A. Serial H.B. Separate N.B. Separate N.A. Separate Figure 3-13. Projected small-message completion latency for 256 nodes. Other than multicast completion latency, host CPU utilization is also an important parameter to consider when choosing a multicast communication scheme and algorithm. For larger systems a sacrifice in completion latency can be made to achieve a lower CPU utilization. For such systems, NIC-assisted schemes may be well suited, because they provide lower CPU utilization and more communicationcomputation overlap.

PAGE 83

73 Summary This chapter of the dissertation investigated the multicast problem for high-speed indirect networks with NIC-based processors. Chapter 3 introduced a multicast performance analysis and modeling for this type of interconnects. These interconnects are widely used in the parallel computing community because of their reprogrammable flexibility and work offloading from host CPU characteristics. This phase of the dissertation analyzed various degrees of host and NIC processor work sharing. Host-based, NIC-based, and NIC-assisted multicast communication schemes for work sharing are analyzed. Binomial and binary tree, serial forwarding and separate addressing multicast algorithms are used for these analyses. Experimental evaluations are performed using various metrics, such as multicast completion latency, root-node CPU utilization, multicast tree creation latency, and link concentration and concurrency, to evaluate their key strengths and weaknesses. To further analyze the performance of aforementioned multicast communication schemes, small-message latency models of binomial and binary tree, serial forwarding and separate addressing multicast algorithms are developed and verified based on the experimental results. The models are observed to be accurate. Projections for larger systems are also presented and evaluated. Experimental and latency modeling evaluations showed that for latency-sensitive applications that utilize small-messages which run over networks with NIC coprocessors, the host-based multicast communication scheme performs best. The disadvantage of host-based multicasting scheme is its high host-CPU utilization. The NIC-based solutions obtain the lowest and constant host-CPU utilizations for all cases at the cost of increased completion latencies. A compromise solution, the NIC-assisted multicasting provides lower CPU utilizations than host-based ones. Moreover, the NIC-assisted

PAGE 84

74 multicasting also provides comparable CPU utilizations to the NIC-based algorithms. The NIC-assisted approach also provides comparable multicast completion latencies to host-based schemes for lower host CPU utilizations. Thus, NIC-assisted multicasting appears to be better choice for applications that demand a high level of computation-communication overlapping.

PAGE 85

CHAPTER 4 MULTICAST PERFORMANCE COMPARISON OF SCALABLE COHERENT INTERFACE AND MYRINET Chapter 2 and 3 are focused on individually evaluating the multicast performance of high-speed torus interconnects high-speed indirect networks with onboard adapter-based processors. Experimental case studies and small-message latency models and analytical projections for both types of networks are presented. Chapter 4 summarizes and provides head-to-head multicast performance comparisons between SCI and Myrinet. The comparison is based on the case study data presented in the previous two chapters. All SCI multicast algorithms, Myrinet multicast schemes and algorithms are included in the comparison except the Myrinet NIC-based communication scheme. Myrinet NIC-based was excluded because it was the worst performing one compared to Myrinet host-based and NIC-assisted schemes in terms of completion latency, and it does not appear to be a viable solution for the software-based multicast problem for Myrinet interconnects. The unicast performance of SCI and Myrinet has been comparatively evaluated by Kurmann and Stricker [42] previously. They showed that both networks suffer performance degradation with non-contiguous data block transfers. Fischer et al. [43] also compared SCI and Myrinet. In their study, they concluded that, in terms of their performance analyses, Myrinet is a better choice compared to SCI since unlike Myrinet there is a magnitude of difference in SCIs remote read vs. write bandwidths. 75

PAGE 86

76 Chapter 4 complements and extends previous work available in literature summarized above. Chapter 4 provides comparative experimental evaluations of torus-optimized multicast algorithms for SCI versus various degrees of multicast work-sharing between the host and the NIC processors optimized for Myrinet. Both for SCI and Myrinet networks, the multicast performance is evaluated using different metrics, such as multicast completion latency, root-node CPU utilization, link concentration and concurrency. Multicast Completion Latency Completion latency is an important metric for evaluating and comparing different multicast algorithms, as it reveals how suitable an algorithm is for a given network. Two different sets of experiments for multicast completion latency are used in this case study, one for a small message size of 2B, and another for a large message size of 64KB. Figure 4-1A shows the multicast completion latency versus group size for small messages, while Figure 4-1B shows the same for large messages for both networks combined with the various multicast algorithms. Figure 4-1A is presented with a logarithmic scale for clarity. As explained previously, separate addressing is based on a simple iterative use of the unicast send operation. Therefore, for small messages the inherent unicast performance of the underlying network significantly dictates the overall performance of the multicast algorithm. This trait can be observed by comparing the small-message multicast completion latencies of SCI and Myrinet shown in Figure 4-1A. The SCI is inherently able to achieve almost an order of magnitude lower unicast latency compared to Myrinet. Simplicity and cost-effectiveness of the separate addressing algorithm for small messages, combined with SCIs unicast characteristics, result in the outcome where

PAGE 87

77 1010010001000046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec) SCI S-torus Myrinet H.B. Serial Myrinet N.A. Serial SCI Separate Myrinet H.B. Separate Myrinet N.A. Separate SCI U-torus Myrinet H.B. Binomial Myrinet N.A. Binomial SCI Mu-torus Myrinet H.B. Binary Myrinet N.A. Binary SCI Md-torus a 0500010000150002000025000300003500046810121416Multicast Group Size (in nodes)Multicast Completion Latency (usec) SCI S-torus Myrinet H.B. Serial Myrinet N.A. Serial SCI Separate Myrinet H.B. Separate Myrinet N.A. Separate SCI U-torus Myrinet H.B. Binomial Myrinet N.A. Binomial SCI Mu-torus Myrinet H.B. Binary Myrinet N.A. Binary SCI Md-torus b Figure 4-1. Multicast completion latencies. Host-based, NIC-based, and NIC-assisted communication schemes are denoted by H.B., N.B., and N.A, respectively. A) Small messages vs. group size. B) Large messages vs. group size.

PAGE 88

78 SCI separate addressing clearly performs the best compared to all other SCI and Myrinet multicast algorithms. The NIC-assisted Myrinet separate addressing does not provide a comparable performance level to the host-based version because of the costly SDMA and RDMA operations. It is observed that the SDMA and RDMA operations impose a significant overhead for small-message communications. Moreover, all three multicast schemes show a linear increase with increasing group size. The SCI S-torus is one of the worst performing algorithms next to the Myrinet NIC-assisted serial forwarding algorithm for small messages. Host-based Myrinet serial forwarding performs better compared to these two algorithms. The store-and-relay characteristics of serial forwarding algorithms result in no parallelism in message transfers and thus degradation of performance. Moreover, the expensive SDMA and RDMA operations cause the NIC-assisted serial forwarding algorithm to perform poorly compared the host-based version. As can be seen, all three multicast algorithms show a linear increase in multicast completion latency with respect to the increasing group size. Unlike the separate addressing and serial forwarding algorithms of SCI and Myrinet, binomial and binary tree algorithms exhibit nearly constant completion latencies with increasing group sizes. Among these, Myrinet host-based binary and binomial tree algorithms perform best. Comparable algorithms, such as SCI U-torus, M d -torus and M u -torus algorithms show higher completion latencies. The difference is attributed to the low sender and receiver overheads of host-based Myrinet multicasting and the simplicity of the star network compared to the more complex torus structure of the SCI network. For a given star network, the average message transmission paths are shorter compared to

PAGE 89

79 the same sized torus network. One other reason is increasing efficiency of the lightweight Myrinet GM message-passing library with increasing complexity of communication algorithms, compared to the shared-memory Scali API of the SCI interconnect. The effect of the expensive SDMA and RDMA operations can be clearly seen in the NIC-assisted Myrinet binary and binomial tree algorithms compared to their host-based counterparts. For the large-message multicast latencies in Figure 4-1B, the SCI algorithms appear to perform best compared to their Myrinet counterparts. This outcome is judged to be primarily caused by the higher data rate of SCI compared to Myrinet (i.e., 5.3Gb/s vs. 1.28Gb/s). It should be noted that the Myrinet testbed available for these experiments is not representative of the latest generation of Myrinet equipment (which feature 2.0Gb/s data rates). However, we believe that our results would follow the same general trend for large messages on the newer hardware. Among the SCI algorithms, M d -torus is found to be the best performer. The M d -torus and M u -torus algorithms exhibit similar levels of performance. The difference between these two becomes more distinctive at certain data points, such as 10 and 14 nodes. For these group sizes, the partition length for M u -torus does not provide perfectly balanced partitions, resulting in higher multicast completion latencies. For large messages, U-torus exhibits similar behavior to M u -torus. The S-torus is the worst performer compared to all other SCI multicast algorithms. Moreover, S-torus, similar to other single-phase, path-based algorithms, has unavoidably large latency variations caused by the long multicast message paths [31].

PAGE 90

80 Myrinet multicast algorithms seem to be no match for the SCI-based ones for large messages. Unlike the small-message case, Myrinet NIC-assisted binary and binomial tree algorithms provide nearly identical completion latencies to their host-based counterparts for large messages. The SDMA and RDMA overheads are negligible for large messages and because of this reason NIC-assisted multicast communication performance is enhanced significantly. Moreover, NIC-assisted communication improves the performance of the separate addressing algorithm most, compared to all other Myrinet multicast algorithms. This improvement is the result of the relative reduction of the overall effect of the SDMA and RDMA overheads on the multicast completion latencies. By contrast, NIC-assisted communication degrades the performance of the Myrinet serial forwarding algorithm, because the SDMA and the RDMA overheads are incurred at each relaying node. User-level CPU Utilization Host processor load is another useful metric to assess the quality of a multicast protocol. Figures 4-2A and 4-2B present the maximum CPU utilization for the root node of each algorithm. As before, results are obtained for various group sizes and for both small and large message sizes. Root-node CPU utilization for small messages is presented with a logarithmic axis for clarity. For small messages, SCI M d -torus and M u -torus exhibit constant 2% CPU utilization for all group sizes. Both algorithms use a tree-based scheme for multicast, which increases the concurrency of the message transfers and decreases the root-node workload significantly. Also, it is observed that SCI S-torus exhibits relatively higher utilization compared to these two but at the same time provides a constant CPU load independent of group size. As can be seen, SCI U-torus exhibits a step-like increase for

PAGE 91

81 11.522.533.544.546810121416Multicast Group Size (in nodes)CPU Utilization (%) SCI S-torus Myrinet H.B. Serial Myrinet N.A. Serial SCI Separate Myrinet H.B. Separate Myrinet N.A. Separate SCI U-torus Myrinet H.B. Binomial Myrinet N.A. Binomial SCI Mu-torus Myrinet H.B. Binary Myrinet N.A. Binary SCI Md-torus a 01234567891046810121416Multicast Group Size (in nodes)CPU Utilization (%) SCI S-torus Myrinet H.B. Serial Myrinet N.A. Serial SCI Separate Myrinet H.B. Separate Myrinet N.A. Separate SCI U-torus Myrinet H.B. Binomial Myrinet N.A. Binomial SCI Mu-torus Myrinet H.B. Binary Myrinet N.A. Binary SCI Md-torus b Figure 4-2. User-level CPU utilizations. Host-based, NIC-based, and NIC-assisted communication schemes are denoted by H.B., N.B., and N.A, respectively. A) Small messages vs. group size. B) Large messages vs. group size.

PAGE 92

82 small messages caused by the increase in the number of communication steps required to cover all destination nodes. Myrinet host-based binomial and binary tree algorithms provide an identical CPU utilization to that of SCI M d -torus and M u -torus. Host-based separate addressing and serial forwarding algorithms both show a perfect linear increase in terms of small-message CPU utilization, where serial forwarding performs better compared to separate addressing. Myrinet NIC-assisted binomial and binary tree algorithms lower the root-node CPU utilizations as expected. These two algorithms provide the lowest and a constant CPU utilization for all group sizes. Similarly, NIC-assisted separate addressing lowers the CPU utilization compared to the host-based version, and provides a very slowly increasing utilization. Similar reduction is also observed for the NIC-assisted serial forwarding algorithm compared to its host-based counterpart. Unlike NIC-assisted separate addressing, serial forwarding cannot sustain this low utilization, and it increases linearly with increasing group size. The linear increase of the serial forwarding is caused by the ever extending path lengths with increasing group sizes. Meanwhile, for large messages, it is observed that as group size increases the CPU load with the SCI S-torus algorithm also linearly increases. The SCI separate addressing algorithm has a nearly linear increase in CPU load for large messages with increasing group size. By contrast, since the number of message transmissions for the root node is constant, M d -torus provides a nearly constant CPU overhead for large messages and small group sizes. However, for group sizes greater than 10, the CPU utilization tends to

PAGE 93

83 increase caused by variations in the path lengths causing extended polling durations. For large messages, SCI M u -torus also provides higher but constant CPU utilization. Host-based Myrinet binomial and binary tree algorithms provide similar levels of CPU utilization to SCI M u -torus for large messages and large group sizes. Similar to the small-message case, host-based separate addressing and serial forwarding have linearly increasing utilizations with increasing group sizes, where serial forwarding performs better compared to separate addressing. NIC-assisted binomial and binary tree algorithms again achieve the smallest and sustainable constant CPU utilizations for large messages for all group sizes. Similar to the small-message case, separate addressing benefits most from NIC-assisted communication as can be seen from Figure 4-2B. Serial forwarding also exhibits lower CPU utilizations with the NIC-assisted communication. Link Concentration and Link Concurrency Link concentration and link concurrency for SCI and Myrinet are given in Figures 4-3A and 4-3B, respectively. Myrinet host-based and NIC-assisted communication schemes have identical link concentration and concurrency, so they are not separately plotted. Link concentration combined with the link concurrency illustrates the degree of communication balance. The concentration and concurrency values presented in Figure 4-3 are obtained by analyzing the theoretical communication structures and the experimental timings of the algorithms. SCI S-torus is a simple chained communication and there is only one active message transfer in the network at any given time. Therefore, it has the lowest and a constant link concentration and concurrency compared to other algorithms. Because of

PAGE 94

84 0.511.522.533.546810121416Multicast Group Size (in nodes)Link Concentration SCI S-torus Myrinet Serial SCI Separate Myrinet Separate SCI U-torus Myrinet Binomial SCI Mu-torus Myrinet Binary SCI Md-torus a 0123456746810121416Multicast Group Size (in nodes)Link Concurrency SCI S-torus Myrinet Serial SCI Separate Myrinet Separate SCI U-torus Myrinet Binomial SCI Mu-torus Myrinet Binary SCI Md-torus b Figure 4-3. Communication balance vs. group size. A) Link concentration. B) Link concurrency.

PAGE 95

85 the high parallelism provided by the recursive doubling approach, the SCI U-torus algorithm has the highest concurrency. SCI separate addressing exhibits an identical algorithm has the highest concurrency. SCI separate addressing exhibits an identical degree of concurrency to the U-torus, because the multiple message transfers overlap at the same time caused by the network pipelining feature available over the SCI torus network. The SCI M d -torus algorithm has inversely proportional link concentration versus increasing group size. In M d -torus, the root node first sends the message to the destination header nodes, and they relay it to their child nodes. As the number of dimensional header nodes is constant (k in a k-ary torus), with increasing group size each new child node added to the group will increase the number of available links. Moreover, because of the communication structure of the M d -torus, the number of used links increases much more rapidly compared to the number of link visits with the increasing group size. This trend asymptotically limits the decreasing link concentration to 1. The concurrency of M d -torus is upper bounded by k as each dimensional header relays the message over separate ringlets with k nodes in each. The SCI M u -torus algorithm has low link concentration for all group sizes, as it multicasts the message to the partitioned destination nodes over a limited number of individual paths, where only a single link is used per path at a time. By contrast, for a given partition length of constant size, an increase in the group size results in an increase in the number of partitions and an increase in the number of individual paths. This trait results in more messages being transferred concurrently at any given time over the entire network. The Myrinet serial forwarding algorithm is very similar to SCI S-torus in terms of its logical communications structure. Therefore, as expected it also exhibits a constant

PAGE 96

86 concentration. However, Myrinet serial forwarding has a higher link concentration compared to S-torus, and the difference is caused by the physical structure of the two interconnects. In Myrinet, the degree of connectivity of each host is fixed at 1, whereas in SCI it is N for an N-dimensional torus system. Similar to S-torus, serial forwarding has the lowest link concurrency. Myrinet binomial and binary trees and the separate addressing algorithm have asymptotically bounded link concentration of 2 with increasing group sizes as the number link visits and the used links are the same both three of them. The number of required links to establish a connection between any two nodes is 2 for a single-switch Myrinet network, which is the upper bound on the number of used links. Myrinet serial forwarding has the lowest and constant link concurrency for all group sizes caused by the reasons explained before. Myrinet separate addressing also has a constant but higher link concurrency. Myrinet binomial and binary tree algorithms have higher and variable link concurrencies with respect to the group size. Binary tree has a bounded fan-out number which decreases the link concurrency compared to the binomial tree. Summary In summary, the results reveal that multicast algorithms differ in their algorithmic and communication pattern complexity. The functionality of the algorithms increases with complexity, but occasionally the increased complexity degrades the performance in some circumstances. For some cases, such as small-message multicasting for small groups, using simple algorithms helps to obtain the true performance of the underlying network. For example, because of its simplicity and the inherently lower unicast latency of SCI, the SCI separate addressing algorithm is found to be the best choice for small-message multicasting for small groups.

PAGE 97

87 The lightweight GM software for message passing in Myrinet performs efficiently on complex algorithms. Therefore, while simple algorithms such as separate addressing perform better on SCI, it is observed that more complex algorithms such as binomial and binary tree achieve good performance on Myrinet for small-message multicast communication. For large messages, SCI has a clear advantage due its higher link data rate compared to Myrinet (i.e., 5.3Gb/s of SCI vs. 1.28Gb/s of Myrinet used in this study). Although the newest Myrinet hardware features higher data rates (i.e., 2.0Gb/s) than our testbed, these rates are still significantly lower than SCI. Therefore, we expect that our results for large messages would follow the same general trend even for the newest generation of Myrinet equipment. Myrinet NIC-assisted communication provides low host-CPU utilizations for small and large message and group sizes. Complex algorithms such as binomial and binary tree, and simple ones like separate addressing benefit significantly from this approach. However, multicast performance of NIC-assisted communication is directly affected by the cost of SDMA and RDMA operations. The overhead of these operations limits the potential advantage of this approach.

PAGE 98

CHAPTER 5 LOW-LATENCY MULTICAST FRAMEWORK FOR GRID-CONNECTED CLUSTERS Previous chapters analyzed and evaluated the multicast problem on two different SANs. Chapter 5 introduces an analysis of a low-level topology-aware multicast infrastructure for Grid-connected SANor IP-based clusters. The proposed framework integrates Layer 2 and Layer 3 protocols for low-latency multicast communication over geographically dispersed resources. Grid-connected clusters provide a global-scale computing platform for complex, real-world scientific and engineering problems. The Globus Alliance [44] is an important project among the numerous Grid-related research initiatives mentioned in Chapter 1. The Globus Alliance initiative for scientific and engineering computing, which is led by various research groups around the world, is a multi-disciplinary research and development project. Typical research areas that the Globus project tackles include resource and data management and access, security, and application and service development, all on a massive, distributed scale. A collection of services, the Globus Toolkit (GT), has emerged as a result of these collective research efforts. The Globus toolkit mainly includes services and software for Grid-level resource and data management, security, communication, and fault detection. The open source GT is a complete set of technologies for letting people share computing power, databases, and other tools securely online across corporate, institutional, and geographic boundaries without sacrificing local autonomy. The main components of GT are as follows: 88

PAGE 99

89 Globus Resource Allocation Manager (GRAM). Provides Grid-level resource allocation and process creation, monitoring, and management services among distributed domains. Grid Security Infrastructure (GSI). Provides a global security system for Grid users over distributed resources and domains. GSI establishes a single-sign-on authentication service for distributed systems by mapping from global to local user identities, while maintaining local control over access rights. Monitoring and Discovery Service (MDS). Grid-level information service that provides up-to-date information about the computing and network resources and datasets. GT also has three more software services for establishing homogenous Grid-level access for distributed resources: Global Access to Secondary Storage (GASS). Provides automatic and programmer-managed data movement and data access strategies. Nexus. Provides communication services for heterogeneous environments. Heartbeat Monitor (HBM). Detects system failures of distributed resources. GSI isolates the local administrative domains from the Grid structure by using gateways or gatekeepers at each domain. These gateways or gatekeepers are responsible for accepting and validating incoming global user login requests and converting them to local credentials. Currently GT implements these gateways and gatekeepers as Condor nodes [45]. Other than GSI specific tasks, Condor gateways are generally also responsible for job submission, scheduling and resource management of local domains and clusters. Chapter 5 introduces a latency-sensitive multicast infrastructure for Grid-connected clusters that is compliant with the GSI and also with other GT services. The following sections explain the proposed infrastructure in detail.

PAGE 100

90 Related Research Layer 3 multicast research has been investigated widely for more than a decade. Multicasting over IP networks, such as a Grid backbone, provides a portable solution but is not suitable for high-performance distributed computing platforms because of the inherent high IP overhead. Well-established and implemented Layer-3 protocols exist for both intra-domain and inter-domain Layer 3 multicast. The Distance Vector Multicast Routing Protocol (DVMRP) [46] is one of the basic intra-domain multicast protocols. DVMRP propagates multicast datagrams to minimize excess copies sent to any particular network and is based on the controlled flood method. Sources are advertised using a broadcast-and-prune method. However, controlled flooding presents a scalability problem. Also, each router on the multicast network must forward incoming source advertisements to all its output ports. Based on the responses of receivers, the routers will forward back prune messages, if required. This process requires each router on the multicast network to keep very large multicast state and routing tables. Overall, DVMRP is not scalable, and is only efficient in densely populated networks. Protocol Independent Multicast (PIM) [11] is a step beyond DVMRP and aims to overcome the scalability problem of DVMRP. It uses readily available unicast routing tables, therefore eliminating the need to keep an extra multicast routing state table. PIM Sparse Mode (PIM-SM), an extension of the PIM protocol, uses a shortest-path tree method. In this method, Rendezvous Points (RPs) are created for receivers to meet new sources. Members join or depart a PIM-SM tree by sending explicit messages to RPs. Receivers need only know one RP initially, which eliminates the need for flooding all multicast network with source advertising messages. RPs must know each other to

PAGE 101

91 transfer data from different senders to various multicast groups. Senders announce their existence to all RPs while receivers query RPs to find ongoing multicast sessions. Actual data transmission does not require RPs to be on the distribution path. While DVMRP and PIM-SM connect the sources and receivers in the same domain, they do not answer the needs for inter-domain multicasting. The Multicast Border Gateway Protocol (MBGP) [10], an extension of the unicast Border Gateway Protocol (BGP) [47], connects multicast domains. In MBGP, every router only needs to know the topology of its domain and how to access the next domain. Each PIM-SM domain uses its own independent RP(s) and does not depend on RPs in other domains. However, in order to connect the source in a domain with receivers in different domains, the RPs must know each other. Multicast Source Discovery Protocol (MSDP) [12] provides a solution this problem. The MSDP describes a mechanism to connect multiple PIM-SM domains (i.e., RPs) together over the MBGP-connected domains. It enables a mechanism to inform an RP in one domain that there are sources in other domains. MSDP is not scalable for large groups and group sizes. Overall, the current Layer 3 multicast architecture deployed in the commercial Internet, vBNS, and Abilene networks is a combination of PIM-SM, MBGP, and MSDP protocols. Figure 5-1 shows a sample PIM-SM/MBGP/MSDP multicast architecture. As mentioned earlier, both DVMRP and MSDP are efficient with small groups but are not scalable for large groups. This scalability problem makes current setups, such as PIM-SM/MBGP/MSDP, poor solutions for grid applications. Both of these methods are based on the group membership approach, which is susceptible to congestion and contention problems, as unauthorized sources can freely multicast to a group.

PAGE 102

92 Furthermore, there is no indication of group size from the sources view, which is not favorable from the service providers point of view for obvious economic reasons. In addition, these methods need unique worldwide-multicast addresses for each group: another scalability and management problem in and of itself. Rendezvous Point Source Receiver MBGP Router MSDP betweenRPsMBGPbetweendomains PIM-SM to build the multicast tree Figure 5-1. Layer 3 PIM-SM/MBGP/MSDP multicast architecture. New ideas address these shortcomings. For example, Express Multicast (EM) [48] is based on the idea that most multicast applications are either single-source or have an easily identifiable primary source. This assumption greatly reduces the complexity of multicast algorithms. The EM is also based on the channel membership concept, in which each source (S) defines a channel (S,E), where E is the channel address, and only S can send to (S,E). Channels are source addressed, which eliminates the requirement for the receivers or receiver groups to have a unique multicast address in the global scale, further increasing scalability. Although different sources can use the same channel addresses, this situation will not create a conflict, as each channel is identified by the

PAGE 103

93 combination of the source and channel address; and no two sources can have the same address. In EM, receivers send explicit join messages to S, which eliminates the scalability problem of previous algorithms caused by their broadcast-and-prune approach. EM uses bidirectional trees to reduce the state table overhead on multicast routers. Figure 5-2 shows examples of channel membership and group membership multicast. Although the channel membership approach solves the scalability problem of the group membership approach, it does not fit the current Globus/Condor architecture. The Condor gateway nodes, which are ordinary nodes, not routers, isolate local domains and clusters from the rest of the Grid network. This current architecture prevents the EM root from building a full-scale channel tree that reaches the receiver nodes in local domains. E3 (S1,E1)(S1,E3)S1 S1 S2(S1,E2) E1E2 (S2,E1)(S2,E3)(S2,E2) S2G1G2G3 Router Source Receiver a b Figure 5-2. Sample group versus channel membership multicast. A) Group membership. B) Channel membership. In summary, the existing multicast approaches using group membership or channel membership do not propose an answer to solve the Grid-level cluster-to-cluster multicast problem by themselves. The proposed solution is this dissertation combines the best qualities of these two approaches to solve the Grid-level problem. In this proposed

PAGE 104

94 approach, channel membership is used between the domains and group membership is used within domains. Channel membership between the domains provides a tree architecture for gateways while lowering the WAN backbone impact and complying with the GSI. Group membership within domains provides a simple scalable solution for domains and clusters. Figure 5-3 shows a sample multicasting scenario for Grid-connected clusters. The clouds in the figure represent the domains. The next subsection explains the proposed framework in detail. (S1,E1)(S,E1)(S1,E3) HCS.UFL.EDU PHYS.UFL.EDU CSEE.USF.EDU HCS.FSU.EDU NETLAB.CALTECH.EDU (S1,E2) Figure 5-3. Sample multicast scenario for Grid-connected clusters. Framework The proposed multicast framework is based on the assumption that participating nodes (i.e., sources and receivers) can be clearly organized in terms of administrative domains and each domain has a gateway node. These gateway nodes are Condor-like nodes in the current Globus architecture, not specific routers, which complies with the current Globus architecture. The main difference between the current Globus gateway nodes and those that we propose is that the proposed frameworks nodes are responsible for job distribution, as is the case in Globus, and also capable of initiating, organizing, orchestrating and managing multicast communication. The gateways are identified as

PAGE 105

95 regular system nodes, and have been simply identified as the gatekeepers of their domain. Gateways in the proposed framework can be a dedicated node or part of the multicast group, and this decision is left to the system administrator of a given domain. All access to a specific domain has to go through these nodes, as they are the only interface allowed to and from the outside world. This approach also eliminates the need for users to have extra administrative rights over multiple domains. The gateways are organized in a binary or binomial tree structure based on their distance from the root. The gateway distances are measured in terms of their latencies. At initialization the source gateway informs each gateway that participates in the multicast about the existence of other participating gateways. Each gateway, upon receiving this information from the source gateway, measures the latency between itself and all other gateways and sends the results to the source gateway. The source composes the gateway latency graph of participating gateways to build the channel-based gateway multicast tree. The single source of the multicast operation is located at the root of the tree. This leads to a minimal number of message transfers initiated by the root, and minimized impact on the WAN backbone. Each source that participates in multicast communication builds its own version of the top-level tree. As the top-level tree build, join, and setup operations are performed offline; they are no longer on the critical communication path. Also, once built, the same top-level tree can be used multiple times as long as the network setup and participant list remains the same.

PAGE 106

96 The top-level tree is built by partitioning the all-participant latency graph into physical domains based on neighborhood distances. This idea was introduced by Lowenkamp et al. [49] and has been widely used in MPICH-G [50] and MPICH-G2 [51] and other works for optimizing WAN communications. The partitioning is based on the cost of the link between neighbors. Each new link introduced to the multicast tree is checked for its cost. The source assesses the new nodes cost connection to every node in the system. As a result, the least expensive connection for the new node is determined, and added to the multicast top-level tree. Lowenkamp et al. have chosen a 20% cutoff value empirically for cost assessment. They have experimentally showed that this cutoff value is accurate [49]. The outline of the top-level tree building process is as follows: Start with source as the root to the binomial tree Add the domain with the minimal cost link at each step to the tree Stop when all domains are analyzed and embedded in the binomial tree The pseudocode for this process is given below: Number nodes in any order starting from 1 Source =1 N={root} U=U-N For all nodes in U For all nodes in N Choose a node from U with minimum cost with respect to source Compare the cost of new node with the cost of ALL nodes If cost is NOT less than (1.2cost of ALL existing routes) Add new node as a child to the lowest cost connection Else Add new node as a separate branch connected to source Add new node to N Delete node from U Repeat process until U={} where U is the universal set of all available nodes and N is the set of selected nodes. Figure 5-4 shows the top-level tree building process. The process starts with the source node and evaluates the cost of each new node added at every step. In Figure 5-4

PAGE 107

97 HCS.UFL.EDU is chosen as the source node in Step 1. Figure 5-4A shows the initial link latencies for this example. HCS.UFL.EDU HCS.FSU.EDU 30 msec53 msec18 msec 700 sec58 msec PHYS.UFL.EDU 30 msec CSEE.USF.EDU NETLAB.CALTECH.EDU 62 msec26 msec HCS.FSU.EDU Step 1 a b HCS.FSU.EDU 18 msec CSEE.USF.EDU Step 2 HCS.UFL.EDU HCS.FSU.EDU 30 msec18 msec CSEE.USF.EDU Step 3 26 msec 30x1.2 ?< 26+18 36 < 44 TRUE c d Figure 5-4. Continued.

PAGE 108

98 HCS.UFL.EDU HCS.FSU.EDU 30 msec18 msec CSEE.USF.EDU Step 4 HCS.UFL.EDU HCS.FSU.EDU 30 msec18 msec CSEE.USF.EDU PHYS.UFL.EDU 0.7 msec Step 5 30 msec26 msec 30x1.2 ?< 30+0.7 36 < 30.7 FALSE 30x1.2 ?< 26+18 36 < 44 TRUE e f HCS.UFL.EDU HCS.FSU.EDU 30 msec18 msec CSEE.USF.EDU PHYS.UFL.EDU 0.7 msec Step 6 HCS.UFL.EDU HCS.FSU.EDU 30 msec18 msec CSEE.USF.EDU PHYS.UFL.EDU 53 msec 0.7 msec NETLAB.CALTECH.EDU Step 7 58 msec62 msec62 msec 53x1.2 ?< 30+6263.6 < 92 TRUE 53x1.2 ?< 30+0.7+6263.6 < 92.7 TRUE 53x1.2 ?< 18+5863.6 < 76 TRUE g h Figure 5-4. Continued.

PAGE 109

99 HCS.UFL.EDU HCS.FSU.EDU 30 msec18 msec CSEE.USF.EDU PHYS.UFL.EDU 53 msec 0.7 msec NETLAB.CALTECH.EDU Step 8 (Final Binomial Tree) i Figure 5-4. Step by step illustration of the top-level tree building process. The gateways simultaneously form their own intra-domain multicast groups while participating in the top-level multicast tree. Intra-domain multicast is based on group membership. Although it is not mandatory, the DVMRP protocol is chosen for intra-domain multicast. DVMRP is known to have a scalability problem for large groups and topologies, but limiting this protocol only to intra-domain multicast eliminates the scalability problem. If necessary, domain system administrators are free to choose any intra-domain multicast protocol that supports group membership, as long as the intra-domain multicast group and the top-level tree are strictly isolated from one another at each gateway. Isolation and intra-domain group multicast combined together enforce a closed-group concept for security, where only its domain gateway can multicast to any particular closed group. Moreover, direct Layer-3 connection with a local/cluster node from the outside world is not allowed in most of the current Grid-connected clusters for

PAGE 110

100 security reasons. Information about local domains and clusters can be obtained from the gateways through a resource allocation service such as GRAM, or an information service such as MDS, which is beyond the scope of this work. Joins and departures from the multicast group in a domain are provided with broadcast-and-prune messages. For performance reasons, domains with more than one subnet can be divided into hierarchical groups. Figure 5-5 shows an example of the proposed group-membership based multicast approach. Different types of clusters with various interconnects can participate in local multicast groups. These might be Ethernet-based or SAN-based clusters. Ethernet-based clusters support IP, while SAN-based clusters allow proprietary SAN or high-performance BSD-compatible protocols for SANs to be used for communication. If a SAN protocol that is not BSD-compatible is used for intra-domain multicast, then a packet conversion from IP to the SAN protocol or vice versa is required. Where BSD-compatible protocols for SANs exist, a seamless transformation from or to BSD sockets can be achieved. If the selected interconnect is a SAN, then there are two possible scenarios for a gateway placement. It can either be part of the SAN subnet, in which case it can convert and transmit incoming multicast data to receivers directly over the SAN, or it can be outside of the SAN subnet, in which case it will transmit incoming multicast data to a head-node in the SAN subnet for further conversion and delivery to the multicast group. Figure 5-6 shows these two possible gateway placement scenarios. To further increase the performance of the proposed multicast framework, an extension to the gateways is introduced. Because of link errors and router buffer/queue

PAGE 111

101 WAN CAN WAN HCS.UFL.EDU PHYS.UFL.EDU NETLAB.CALTECH.EDU CSEE.USF.EDU HCS.FSU.EDU Figure 5-5. Sample illustration of group-membership multicast approach. IPSANGatewaySAN head-node Figure 5-6. Possible gateway placement and incoming packet forwarding scenarios. problems, transmission errors may occur. When such errors occur, it is the receivers responsibility to request a retransmission from upper-level data providers (i.e., gateways). In the proposed scheme, gateways keep a local copy of incoming multicast data in a local cache. When the closest gateway to the re-transmission requestor intercepts a request, it

PAGE 112

102 automatically suppresses the request and provides the data from its local cache. This approach requires extra storage on gateways, but when considering that 1) these gateways are regular nodes and not specific high-cost switching devices and 2) storage device prices are decreasing while their capacities are constantly increasing, it is obvious that this would not impose much of an extra burden on the system. It is also obvious that this approach will provide lower retransmission latencies when compared to a single source and single data-holder approach. Also, the proposed method will provide lower backbone impact as the requestor can at most only be one hop away from the gateway which provides the data in the system. Figure 5-7 shoes this low-latency retransmission process. As can be seen, the gateway which requests a retransmission, PHYS.UFL.EDU, is only one hop away from the gateway that provides the data, HCS.UFL.EDU. This retransmission is not visible by the rest of the top-level gateways therefore it does not bring an additional impact to the system. PHYS.UFL.EDU HCS.UFL.EDU NETLAB.CALTECH.EDU CSEE.USF.EDU HCS.FSU.EDU Retransmission Request Retransmission from closest gateway Figure 5-7. Illustration of low-latency retransmission system. In summary, the proposed framework defines a low-latency, multi-protocol framework to connect distributed clusters over WAN backbone links for multicast

PAGE 113

103 communication, integrating Layer-3 and Layer-2 multicast protocols and algorithms. It takes advantage of faster interconnects whenever possible and appropriate, such as SANs. The design is not complex, and does not require any changes in existing deployed routers. It also only requires minimal changes to existing gateway nodes and complies with the current Globus architecture. The proposed framework in this dissertation provides lower impact on WAN backbones because of the top-level tree-based architecture. The framework does not require extra access privileges to the current administrative structures, and therefore is also compliant with the GSI. The proposed framework assists different classes of users, such as Grid end users and Grid application and tool developers, with enhanced performance, functionality and scalability features. The following sections describe simulative performance analysis case studies for the proposed framework. Simulation Environment As experimentation time on grid-scale systems is not feasible, a simulation study based on grid-level infrastructure is used to evaluate and analyze the proposed design versus previous schemes. Mission-Level Designer (MLD) from MLDesign technologies, Inc., was used for this purpose [52]. While MLD is an integrated software package with a diverse set of modeling and simulation domains, the discrete-event domain for event-driven simulation of data transfer systems is used in this study. The tool allows for a hierarchical, dataflow representation of hardware devices and networks with the ability to import finite-state machine diagrams and user-developed C/C++ code as functional primitives. Figure 5-8 shows a screenshot of the MLD simulation tool. The computer systems and networks simulation capability of MLD is based upon its precursor, the

PAGE 114

104 Figure 5-8. Screenshot of the MLD simulation tool. Block-Oriented Network Simulator (BONeS) Designer tool from Cadence Design Systems [53]. The BONeS was developed to model and simulate the flow of information represented by bits, packets, messages, or any combination of these. Case Studies To analyze and evaluate the performance of the proposed multicast framework, two sets of simulative case studies are performed. The simulation model is calibrated and validated using real-world WAN, LAN, and SAN link latencies and experimental results. The following sections provide details about the simulative case study setups. The first case study analyzes a multicast scenario for latency-sensitive parallel applications over SANs connected over the WAN backbone. The multicast scheme (i.e.,

PAGE 115

105 all local multicast, all remote multicast, or mixture of these two) for various multicast group sizes (i.e., 4, 8, 12, and 16) for a 64KB multicast message size is evaluated for two WAN-connected SAN clusters. Myrinet and SCI are the selected SANs and various multicast algorithms analyzed in previous chapters are used for this simulative study. Gateway placement is also another independent variable for both cases. Gateways can be set up as part of the SAN clusters or as independent nodes outside the clusters. This case study analysis botch cases to better evaluate the effects of each setup. For the case where the gateway is not part of the SAN cluster, a slow link, such as Fast Ethernet, that connects the gateway to the SAN head node and a fast one, such as Gigabit Ethernet, are analyzed. Multicast completion latency is the only dependent variable chosen in this case study. The second case study is focused on large-file (i.e., Terabyte and above) data distribution and staging over multiple domains connected through a WAN grid backbone. For this case study, retransmission rate, network topology and number of hops (i.e., 0, 1, 2, and 3), and local domain interconnect are chosen as independent variables. The system is analyzed for retransmission rates of 1%, 5%, 10%, and 20%. Different local domain interconnects, such as Ethernet variants and SCI and Myrinet SANs, for flat, tree with no local caching (Tree N.C.), and tree with local caching (Tree L.C.) network topologies are also investigated. The multicast message size and the file size are set to 64KB, and 1TB, respectively. Dependent variables for this case study are end-to-end multicast completion latency, WAN backbone impact, and local cluster interconnect utilization. The next section discusses the results obtained for each set of case studies in detail.

PAGE 116

106 Case Study 1: Latency-sensitive Distributed Parallel Computing Many programming models and libraries exist today for parallel systems, but the most well-known ones are MPI, PVM, and Active Messages [54]. Currently, only MPI supports distributed parallel computing over Grid networks (i.e., MPICH-G, MPICH-G2 and variants), but MPI is based on the message passing paradigm and might not be suitable for all problems and cases, such as shared-memory applications. This case study evaluates how a low-latency distributed parallel computing application might benefit from the proposed framework. User starts the multicast communication by inputting the node/job list. The gateway is notified about the remote multicast IPC calls at run-time. The gateway generates the receiver domain list and the number of nodes per domain for remote multicast calls. The gateway then notifies each SAN domain gateway of the total participant node list. Each SAN domain gateway will have its own version of the top-level multicast tree as multicast communication for such an application might be initiated from any of the participating computing nodes. Each SAN gateway sends probe messages to all domain gateways on the receiver list and waits for replies. The total participant list is also included with the probe messages. Upon receiving explicit join messages, each gateway forms its version of the top-level tree based on the request-reply latencies. Then they each send channel messages (S, E) to first-layer gateways with the appropriate child gateway information embedded in the channel messages. Simultaneously, they also form multicast groups in their local SAN clusters. They then await the multicast tree ready signal from the first-layer gateways. The tree is formed off-line and can be re-used multiple times during the life time of the application. Figure 5-9 shows the tree-formation actions for the outlined scenario.

PAGE 117

107 During the computation phase IPC will be required. The frequency and characteristics of IPC are dictated by the computational code and may be coarse-grained or fine-grained, and may employ small messages or large messages. For increased efficiency, IPC multicast calls can be organized as intra-SAN multicast, which may use the SAN links without any modification, or inter-SAN multicast, which is directed to the gateway for further routing over the WAN backbones. WAN Probe message Join message ParallelCode Complier IPCInfo Domain list # of nodes per domain Run-timesystem HCS.UFL.EDU SCIClusterMyrinetCluster HCS.FSU.EDU SCIClusterMyrinetCluster Figure 5-9. Latency-sensitive distributed parallel job initiation and top-level multicast tree formation In the case of an inter-domain multicast IPC request, receiver node IDs are transmitted along with the message to the gateway. There are two possible setups. The domain gateway may be part of the SAN cluster or it may be a dedicated node. If the gateway is a dedicated node, then the SAN head-node will convert the SAN packet into

PAGE 118

108 HCS.FSU.EDU WAN HCS.UFL.EDU Figure 5-10. IPC multicast communication scenario for Grid-connected SAN clusters an IP packet, and transfer the message to the domain gateway over the Layer-3 WAN backbone. Otherwise, the gateway will convert SAN packets into IP packets. The gateways also extract the domain information from the receiver node IDs, organize inter-SAN multicast IPC requests, and transfer them to the next level gateway. IPC multicast calls routed by the gateways propagate over the pre-defined multicast tree. Upon intercepting the multicast IPC call at the gateway of the receiver domain, the reverse order of the initiation process takes place. When needed, retransmission requests

PAGE 119

109 will be handled by the closest gateway. Figure 5-10 shows multicast communication for a low-latency distributed parallel computing scenario. In this simulation model two clusters are arbitrarily chosen to be located in the HCS.UFL.EDU and HCS.FSU.EDU domains. The WAN latency between these two clusters is experimentally measured as ~30msec for a 64KB message size. In our scenario we have chosen a SAN cluster at each site, but other possibilities such as Ethernet-based clusters or a mixture of SAN-based and Ethernet-based clusters may also be applied to the model. The SAN-to-IP and IP-to-SAN packet conversion latency is measured experimentally on a Pentium III machine as ~3sec for both SCI and Myrinet SANs. Figure 5-11 shows a snapshot of the simulation model. Figure 5-11. Snapshot of simulation model for latency-sensitive distributed cluster multicast Figure 5-12 shows the multicast completion latencies for various system setups for all remote multicast IPC calls. SCI serves as the SAN cluster interconnect for these trials. Figure 5-12A shows the case where the gateway is part of the local SCI SAN cluster. Figures 5-12B and 5-12C show cases where the gateway is not part of the local SCI SAN cluster, and Gigabit Ethernet and Fast Ethernet are the gateway-to-SAN links, respectively.

PAGE 120

110 01020304050607046810121416Multicast Group Size (in nodes)Multicast Completion Latency (msec) SCI U-torus local SCI S-torus local SCI Mu-torus local SCI Md-torus local SCI Separate local SCI U-torus remote SCI S-torus remote SCI Mu-torus remote SCI Md-torus remote SCI Separate remote a 01020304050607046810121416Multicast Group Size (in nodes)Multicast Completion Latency (msec) SCI U-torus local SCI S-torus local SCI Mu-torus local SCI Md-torus local SCI Separate local SCI U-torus remote SCI S-torus remote SCI Mu-torus remote SCI Md-torus remote SCI Separate remote b 01020304050607046810121416Multicast Group Size (in nodes)Multicast Completion Latency (msec) SCI U-torus local SCI S-torus local SCI Mu-torus local SCI Md-torus local SCI Separate local SCI U-torus remote SCI S-torus remote SCI Mu-torus remote SCI Md-torus remote SCI Separate remote c Figure 5-12. SAN-to-SAN multicast completion latencies. A) Gateway is part of the SCI SAN cluster. B) Gateway is a dedicated node and gateway-to-SAN interconnect is Gigabit Ethernet. C) Gateway is a dedicated node and gateway-to-SAN interconnect is Fast Ethernet.

PAGE 121

111 For all cases, the WAN backbone latency is the dominant component and is fixed for a given system. Better performance is obtained when the gateway is part of the SAN cluster, or a fast enough (e.g., Gigabit Ethernet, ~1.2msec one-way latency at 64 KB) gateway-to-SAN link is used. If a slower (e.g., Fast Ethernet, ~12msec one-way latency at 64 KB) interconnect technology is used for the gateway-to-SAN link, it may be the major bottleneck of the system. Figure 5-13 shows multicast completion latencies between two SAN clusters placed in two different domains and connected over the WAN backbone. The SAN clusters are using a Myrinet interconnect, and all the multicast IPC calls originate from one SAN cluster and are targeted to nodes in the remote cluster. Figure 5-13A shows the case where the gateway is part of the local Myrinet SAN cluster. Figures 5-13B and 5-13C show cases where the gateway is not part of the local Myrinet SAN cluster, and Gigabit Ethernet or Fast Ethernet is the gateway-to-SAN link, respectively. Figure 5-13 shows similar results to the SCI-SCI all-remote case given in Figure 5-12. The WAN link latency is again the dominant component and is on the critical path. Once again the results reveal that it is important to ensure that the chosen gateway-to-SAN link does not impose additional latency if the gateway is not part of the SAN cluster. Figures 5-12 and 5-13 show that Gigabit Ethernet is a viable gateway-to-SAN link for current WAN and SAN technologies, if the gateway is a dedicated node, and that SCI and Myrinet clusters with integrated gateways provide good gateway-to-SAN link for current WAN and SAN technologies, if the gateway is a dedicated node, and that SCI and Myrinet clusters with integrated gateways provide good completion latencies. Also, the results show that a slow interconnect, such as Fast

PAGE 122

112 01020304050607046810121416Multicast Group Size (in nodes)Multicast Completion Latency (msec) Myrinet H.B.Binomial local Myrinet H.B. Binarylocal Myrinet N.A.Binomial local Myrinet N.A. Binarylocal Myrinet H.B.Binomial remote Myrinet H.B. Binaryremote Myrinet N.A.Binomial remote Myrinet N.A. Binaryremote a 01020304050607046810121416Multicast Group Size (in nodes)Multicast Completion Latency (msec) Myrinet H.B.Binomial local Myrinet H.B. Binarylocal Myrinet N.A.Binomial local Myrinet N.A. Binarylocal Myrinet H.B.Binomial remote Myrinet H.B. Binaryremote Myrinet N.A.Binomial remote Myrinet N.A. Binaryremote b 01020304050607046810121416Multicast Group Size (in nodes)Multicast Completion Latency (msec) Myrinet H.B.Binomial local Myrinet H.B. Binarylocal Myrinet N.A.Binomial local Myrinet N.A. Binarylocal Myrinet H.B.Binomial remote Myrinet H.B. Binaryremote Myrinet N.A.Binomial remote Myrinet N.A. Binaryremote c Figure 5-13. SAN-to-SAN multicast completion latencies. A) Gateway is part of the Myrinet SAN cluster. B) Gateway is a dedicated node and gateway-to-SAN interconnect is Gigabit Ethernet. C) Gateway is a dedicated node and gateway-to-SAN interconnect is Fast Ethernet.

PAGE 123

113 Ethernet, as a gateway-toSAN link imposes an additional bottleneck and increases the completion latency. Overall it was observed that the total all-remote multicast completion latency for Grid-connected distributed clusters, denoted as t all_remote_multicast in this dissertation, with a sufficiently large remote multicast group size can be expressed as given in Eq. 5-1. tttttttpletionremote_comSANtogatewayconversionWANconversionSANtogateway_multicastall_remote (5-1) where t gateway-to-SAN denotes the gateway-to-SAN link latency, t conversion denotes the packet conversion latency, t WAN denotes the WAN link latency, and t remote_completion is the multicast completion latency in the remote cluster. Figure 5-14 shows the mixed mode (i.e., local and remote multicast calls combined) IPC multicast calls between two SAN clusters connected with a WAN backbone. SCI and Myrinet are used as the SAN interconnects. The system setup for SCI and Myrinet simulations are identical. The local and remote multicast group sizes are given as (Local, Remote). The total mixed multicast completion latency is the greater of the local multicast completion latency and the total all-remote multicast completion latency given above. For clarity, only the best performing SAN multicast communication schemes and algorithms are included in these trials. As can been seen, the remote completion latency dominates for small local and remote group sizes. However, local completion latency is likely to dominate for significantly larger local group sizes. Latency hiding appears to be an important aspect to consider, as it is possible to hide the local multicast latency for small multicast group sizes, because it is relatively small compared to initiating and completing a remote

PAGE 124

114 3035404550(4,4)(4,8)(4,12)(4,16)(8,4)(8,8)(8,12)(8,16)(12,4)(12,8)(12,12)(12,16)(16,4)(16,8)(16,16)(16,16)Multicast Group Size (in nodes)Multicast Completion Latency (msec) SCI Md-torus Myrinet H.B. Binary Myrinet N.A. Binary Figure 5-14. Mixed mode (local, remote) IPC multicast completion latencies for SCI and Myrinet SAN clusters connected with a WAN backbone. multicast call. Similarly, significantly large local multicast group sizes will be likely to hide the remote multicast latency. The total multicast completion latency for the mixed case, denoted as t mixed_multicast in this dissertation, can be expressed as: multicastremoteallmulticastmulticastmixedttMaxt___, (5-2) where t multicast is the multicast completion latency in the local cluster as in previous chapters. In summary, this case study reveals that the WAN latency is the dominant latency component for small group sizes. Moreover, the gateway-to-SAN link plays an important role on the overall system performance. The capacity of the gateway-to-SAN link determines if the gateway should be part of the SAN cluster or not. The remote completion latency dominates when local and remote group sizes are small, while the local completion latency is likely to dominate for larger group sizes. Hiding the local multicast latency for small local multicast group sizes is possible as it is small compared

PAGE 125

115 to a remote multicast call, while significantly large local group sizes can hide the remote multicast latency. Case Study 2: Large-file Data and Replica Staging Grid and distributed scientific and engineering applications require transfers of large amounts of data between storage systems and access to large amounts of data by distributed applications and users. The GridFTP is a tool to distribute and manage large volumes of geographically dispersed data over Grids [55]. It is an enhanced version of FTP with parallel and partial data transfer, among other features. It uses point-to-point unicast communication, and the proposed multicast framework can be used to enhance the efficiency of GridFTP. The case study begins with a user data or replica staging request. The request, consisting of the identification of the source data, receiver domain list, and number of nodes per domain, is then transmitted to the actual data holder (i.e., data storage site, database maintainer). Upon receipt of the request by the data server (source) the top-level tree formation phase starts. The source sends a probe message to each domain gateway on the receiver list and waits for replies. Upon receiving explicit join messages, it forms the top-level tree based on the request-reply latency that is collected from each gateway. The source sends channel messages (S, E) to the first-layer gateways. Lower-layer gateways are listed for each gateway in the channel messages. The source awaits the multicast tree ready signal from first-layer gateways. Upon receipt of the multicast tree ready messages, it notifies the user that the tree is formed. The state and routing information of the top-level tree is stored in the gateways so that the tree can be used multiple times.

PAGE 126

116 Gateways listen to incoming requests and upon intercepting a probe message they first extract the participant gateways list and start sending probe messages to each. When they have received replies back from all the other gateways, they send an explicit join message with the results from the other gateways to the source gateway. The control messages are exchanged in an all-to-all fashion for the probe messages, enabling each gateway to evaluate its neighbors and their relative distances. This information helps the source gateway obtain a graph of the participants list so it can build the most efficient top-level tree. First-level gateways receive channel messages (S, E) after sending out the join messages. Each channel message includes the list of respective child gateways, if exist, for each top-level gateway and for each branch of the tree. First-level gateways propagate the channel message down to their child gateways, while they also simultaneously form a local multicast group in their own subnets, if necessary. These local groups are built based on the users preset criterion. They will then wait for the multicast tree ready signal from their child gateways. After receiving these messages, and after the local multicast group is formed (if requested), they send out a multicast tree ready signal to the source. Mid-level and leaf gateways perform similarly to the first level-gateways. Figure 5-15 shows the top-level tree formation actions for the outlined scenario. After the top-level tree is formed, the source initiates the actual multicast data transmission to the first-level gateways. The source listens for the retransmission requests from the first-level gateways and replies if there are any. When the data streaming is finished it waits for the multicast complete signal from the first-layer

PAGE 127

117 gateways, and upon receiving them from every child gateway it sends a multicast completed signal to the user. Source(Data Servers)NETLAB.CALTECH.EDU Userrequest ` HCS.UFL.EDU CSEE.USF.EDU EE.UCLA.EDU CS.UCF.EDU HCS.FSU.EDU User input Probe message Join message Channel message (S,E) Multicast tree ready Data transfer Multicast tree formation signal Multicast completion Multicast completed Figure 5-15. Top-level tree formation. CS.UCF.EDU Source(Data Servers)NETLAB.CALTECH.EDU ` ` HCS.FSU.EDU HCS.UFL.EDU CSEE.USF.EDU EE.UCLA.EDU Probe message Join message Channel message (S,E) Multicast tree ready Data transfer Multicast completion Form multicast group Figure 5-16. Multicast communication. In the actual multicast data transfer phase, all gateways receive incoming multicast data from upper-level gateways. They immediately save a copy of the incoming data to

PAGE 128

118 their local caches. If the incoming data is corrupted, they ask for a retransmission from upper-level gateways. Upon receipt of uncorrupted data, they simultaneously propagate the data to their child gateways and local groups. They listen and reply to any retransmission request from their own multicast group or child gateways and wait for the multicast completion signal from them. Upon receiving the completion signals, first-level gateways send multicast completion signals to the source. Figure 5-16 shows the multicast communication for the outlined scenario. Figure 5-17 shows three network topology scenarios evaluated in this case study. Figure 5-17A shows the flat topology where all receiver domains are connected to the source with hypothetical direct links that are one-hop away for the six WAN-connected domains. Figures 5-17B and 5-17C show the 2-hop and 3-hop tree architectures, respectively. Figure 5-18 shows a snapshot of the simulation model. Figure 5-19 shows the obtained multicast completion latencies for a Terabyte multicast file transfer over the WAN backbone. Tree (L.C) represents a tree architecture with local caching for retransmission, and Tree (N.C) represents a tree architecture with no local caching. Figures 5-19A and 5-19B show the case where the gateway is a dedicated node. Figures 5-19C and 5-19D show the case where the gateway is part of the local SAN clusters. As can be seen from the figure, a flat topology achieves lowest completion latency at the cost of increased backbone impact. Although the flat topology shows good theoretical results, it is not practical, as all-to-all direct connectivity is not possible for most cases in real-world WAN networks. Moreover, with the flat topology there is a

PAGE 129

119 possible congestion problem, as the root has to issue more transfers and most likely these transfers will share the same physical links to some degree. NETLAB.CALTECH.EDU EE.UCLA.EDU CSEE.USF.EDU HCS.FSU.EDU CS.UCF.EDU HCS.UFL.EDU a NETLAB.CALTECH.EDU EE.UCLA.EDU CSEE.USF.EDU HCS.FSU.EDU CS.UCF.EDU HCS.UFL.EDU b c NETLAB.CALTECH.EDU EE.UCLA.EDU CSEE.USF.EDU HCS.FSU.EDU CS.UCF.EDU HCS.UFL.ED U Figure 5-17. Evaluated network topology scenarios. A) Flat topology. B) Tree architecture for 2-hop topology. C) Tree architecture for 3-hop topology. An alternative to the flat topology is the tree architecture. Previous chapters presented comparative multicast performance of various flat topology (i.e., separate addressing) and tree architecture (i.e., U-torus, M d -torus, M u -torus, binomial and binary trees) algorithms for SANs. Following this approach, the performance of tree-based multicast architectures is analyzed in comparison to flat topology for WANs. It is observed that the regular tree architecture (i.e., Tree (N.C)) is not a viable alternative to

PAGE 130

120 Figure 5-18. Snapshot of simulative model. the flat topology because of the extensive completion latencies, as increasing the number of hops and the retransmission rate also increase the completion latency. By contrast, an enhanced version of the tree architecture that is proposed in the framework, Tree (L.C), provides similar performance levels to the flat topology. Furthermore, Tree (L.C) is not affected by increasing the number of hops and the retransmission rate, and it provides a lower WAN backbone impact. Figure 5-19 also shows useful information about the gateway placement problem. As can be seen from the figure, when the gateway is not part of the SAN cluster, and the gateway-to-SAN link is a slow interconnect (e.g., Fast Ethernet), this slow link becomes bottleneck in addition to the slow WAN links. However, if the gateway-to-SAN link is a

PAGE 131

121 01000200030004000500060001%5%10%20%Retransmission (%)Multicast Completion Latency (min) Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) a 01000200030004000500060001%5%10%20%Retransmission (%)Multicast Completion Latency (min) Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) b 01000200030004000500060001%5%10%20%Retransmission (%)Multicast Completion Latency (min) Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) c Figure 5-19. Continued. Dedicated gateway, gateway-to-SAN link is Fast Ethernet Dedicated gateway, gateway-to-SAN link is Gigabit Ethernet Gateway is part of the Myrinet SAN cluster

PAGE 132

122 01000200030004000500060001%5%10%20%Retransmission (%)Multicast Completion Latency (min) Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) d Gateway is part of the SCI SAN cluster Figure 5-19. Multicast completion latencies for the large-file transfer scenario. A) Gateway is a dedicated node and gateway-to-SAN interconnect is Fast Ethernet. B) Gateway is a dedicated node and gateway-to-SAN interconnect is Gigabit Ethernet. C) Gateway is part of the Myrinet SAN cluster. D) Gateway is part of the SCI SAN cluster. fast one, such as Gigabit Ethernet, this problem can be eliminated to some degree, and for a small number of the hops Tree (N.C) provides good performance results. When the gateway is part of the SAN cluster, it is observed that the SAN links provide adequate link capacity for data transfers, and the only system bottleneck is the WAN link latencies. Figure 5-20 shows a comparative analysis of the backbone impact of Tree (L.C) and Tree (N.C) architectures. The measurements are obtained from a 3-hop system configuration. The backbone impact is measured as the amount of additional traffic on the first hop of the WAN link backbone from the source. The baseline is defined as the minimum amount of traffic (0% retransmission case) required to transfer data. As can be seen, the Tree (L.C) is constant and unaffected by increasing retransmission rates. This observation parallels with the results of the previous case. Moreover, Tree (L.C) provides the lowest WAN backbone impact for all retransmission rates. By contrast, the Tree (N.C) architecture is directly affected by increasing

PAGE 133

123 retransmission rates and therefore places an increased impact on the source and the 1-hop WAN link. This difference is because of the route lengths that the retransmission requests and the actual retransmissions have to propagate over the WAN backbones in these two tree architectures. 2345671%5%10%20%Retranmission (%)Backbone Impact Tree (L.C) / Baseline Tree (N.C) / Baseline Figure 5-20. Comparative WAN backbone impacts of Tree (L.C) and Tree (N.C) architectures Figure 5-21 shows a head-to-head multicast completion latency comparison of the Tree (L.C) and Tree (N.C) architectures with the four different interconnect technologies for a 3-hop topology with 5% retransmission rate. Tree (L.C) performs better than the Tree (N.C) for all interconnects. Explicitly, Tree (L.C) performs ~40%, ~50%, and ~60% better compared to the Tree (N.C) algorithm, for Fast Ethernet (FE), Gigabit Ethernet (GigE), and Myrinet and SCI, respectively. Among these four, Fast Ethernet is the worst choice as a gateway-to-SAN link. Also, Figure 5-21 reveals that the WAN backbones are the primary bottleneck as they can not supply enough data to saturate SCI and Myrinet links. For the Tree (N.C), when the gateway-to-SAN link is chosen to be a Fast Ethernet interconnect an additional bottleneck is observed, whereas for Tree (L.C)

PAGE 134

124 both the Fast Ethernet and Gigabit Ethernet act as bottlenecks in addition to the WAN backbone links. Moreover, for Tree (L.C), more efficient multicast data staging is obtained when the gateway is part of the SAN cluster. 0100020003000400050006000FEGigEMyrinetSCIMulticast Completion Latency (min) Tree (L.C) Tree (N.C) Figure 5-21. Head-to-head comparison of Fast Ethernet (FE), Gigabit Ethernet (GigE), Myrinet, and SCI interconnects as gateway-to-SAN links for Tree (L.C) and Tree (N.C) architectures Figure 5-22 shows the local domain cluster interconnect utilizations. Figures 5-22A and 5-22B show the case where the gateway is a dedicated node. Figures 5-22C and 5-22D show the case where the gateway is part of the local SAN cluster. As can be seen from Figure 5-22A, where the gateway-to-SAN is connected with a slow interconnect such as Fast Ethernet, Tree (N.C) is less utilized than Tree (L.C). The Tree (N.C) pushes the utilization bottleneck from the local domain to the WAN backbone, and increasing the number of WAN backbone hops increases the WAN bottleneck problem. Moreover, increasing retransmission rates amplifies the problem, and the gateway-to-SAN Fast Ethernet link is underutilized. Overall, using a slow interconnect such as Fast Ethernet as the gateway-to-SAN link results in decreased throughput. Tree (L.C), with Fast Ethernet

PAGE 135

125 as the gateway-to-SAN link is highly utilized, as the bottleneck is shared between the WAN and the local gateway-to-SAN link. Increasing rates pushes the bottleneck to the WAN links and the gateway-to-SAN link is less utilized. The flat results exhibit similar performance levels to those of the Tree (L.C) scheme because the data and retransmissions only travel one hop over the WAN backbones for both Tree (L.C) and flat architectures. Overall, Tree (L.C) provides faster data transfers to the local multicast group because of its better local link utilization. For the case where the gateway-to-SAN connection is a Gigabit Ethernet link, as given in Figure 5-22B, lower utilization is obtained compared to the Fast Ethernet case as the link capacity of Gigabit Ethernet is higher compared to Fast Ethernet. For the Gigabit Ethernet case, a reduced backbone impact because of the Tree (N.C) approach is observed as Tree (N.C) exhibits a similar level of performance to Tree (L.C). Finally, as can be seen, increasing retransmission rates decreases utilization. Figures 5-22C and 5-22D show the results for gateways that are part of the Myrinet or SCI SAN clusters. For both cases, the SAN links are highly underutilized for all trials because the system bottleneck is the WAN backbone. It can be concluded that, with todays backbone technology, the SAN links cannot be saturated by the amount of data the WAN links can provide. Case study 2 shows that, in general, the WAN links are the key system bottlenecks. For some cases the gateway-to-SAN interconnect can also be an additional bottleneck (i.e., Fast Ethernet as a gateway-to-SAN link). Lower data retransmission rates and higher QoS priority in the WAN links can help to reduce the WAN bottleneck. Moreover, using faster interconnects as the gateway-to-SAN link and placing the gateway as part of the SAN cluster also helps to reduce the bottleneck.

PAGE 136

126 01020304050607080901001%5%10%20%Retransmission (%)Cluster Interconnect Utilization Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) a 01020304050607080901001%5%10%20%Retransmission (%)Cluster Interconnect Utilization Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) b 01020304050607080901001%5%10%20%Retransmission (%)Cluster Interconnect Utilization Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) c Figure 5-22. Continued Dedicated gateway, gateway-to-SAN link is Fast Ethernet Dedicated gateway, gateway-to-SAN link is Gigabit Ethernet Gateway is part of the Myrinet SAN cluster

PAGE 137

127 01020304050607080901001%5%10%20%Retransmission (%)Cluster Interconnect Utilization Tree 2-hop (L.C.) Tree 3-hop (L.C.) Flat Tree 2-hop (N.C) Tree 3-hop (N.C) d Gateway is part of the SCI SAN cluster Figure 5-22. Interconnect utilizations for local clusters. A) Gateway is a dedicated node and gateway-to-SAN interconnect is Fast Ethernet. B) Gateway is a dedicated node and gateway-to-SAN interconnect is Gigabit Ethernet. C) Gateway is part of the Myrinet SAN cluster. D) Gateway is part of the SCI SAN cluster. Summary This chapter introduced a Layer-2/Layer-3 framework for cluster-to-cluster multicast over a WAN backbone. The proposed infrastructure supports multi-protocol, latency-sensitive multicast with minimal WAN backbone impact. The framework takes advantage of faster interconnects whenever possible, and as it requires a minimal amount of changes in the existing Grid/Globus architecture, it can be implemented easily. Two separate simulative case studies are also presented for further performance analysis of the proposed approach. Simulation case studies show that the WAN backbone latency is the most dominant component of the multicast communication for Grid-connected clusters. The flat topology for Grids is not practical and introduces high backbone utilization levels. The proposed tree architecture with local caching provides lower latencies and reduces backbone impact compared to flat topologies. Moreover, the tree architecture with local

PAGE 138

128 caching is unaffected by increasing the retransmission rate or the number of hops. The gateway-to-SAN link is also observed to be an additional bottleneck for some cases. For example, when a slow interconnect, such as Fast Ethernet, is used as a gateway-to-SAN link, it increases multicast completion latencies, and the WAN backbone utilization. Therefore, special attention is required when determining the placement of the gateway.

PAGE 139

CHAPTER 6 CONCLUSIONS Cluster computing is a cost-effective solution for computationally intensive problems, providing performance similar to that of supercomputers. Grid computing is the answer of the parallel and distributed processing community to many of the computationally demanding scientific and engineering problems of the world, bringing together geographically distributed computing resources, such as supercomputers and clusters, on the global-scale. Collective communication operations, such as multicasting, simplify and increase the functionality and efficiency of parallel and distributed tasks for Grid computing. Unfortunately, specialized high-performance cluster interconnects do not inherently support multicasting. Furthermore, current Grid multicasting schemes are unable to support any other parallel and distributed computing platform other than MPI, and they are not targeted for latency-sensitive applications. This research investigates the multicast problem for Grid-connected SAN-based and IP-based clusters. Following a bottom-to-top approach, the research is divided into three distinct phases. The first phase of this research is focused on the experimental analysis, small-message latency modeling, and analytical projections of software-based Layer-2 multicast algorithms on high-performance torus networks. The second phase of this research investigates the multicast problem for interconnects with onboard NIC co-processors. Experimental analysis, small-message latency models, and analytical projections are performed for networks with onboard NIC coprocessors. Finally, a 129

PAGE 140

130 universal, latency-sensitive multicast framework for Grid-connected clusters is introduced and evaluated based on extensions to the results obtained in first two phases. In the first phase of this study, five different multicast algorithms for high-performance torus networks are evaluated. Direct torus networks are widely used in high-performance parallel computing systems as they are cost-effective and also provide good scalability in terms of bandwidth. The selected algorithms are analyzed on direct SCI torus networks. The performance characteristics of these algorithms are experimentally examined under different system and network configurations using various metrics, such as multicast completion latency, root-node CPU utilization, multicast tree creation latency, and link concentration and concurrency, to evaluate their key strengths and weaknesses. Based on the results obtained, small-message latency models for each algorithm are defined. The models are observed to be accurate. Projections for larger systems are also presented and evaluated. It is observed that for SCI torus networks with small messages and multicast group sizes, the best approach, in terms of completion latency and host CPU utilization, is the separate addressing algorithm. However, for large message sizes and large multicast group sizes, more complex algorithms perform better. The M d -torus algorithm performs better for larger system sizes because of its balanced dimensional partitioning method, providing the lowest completion latency and very low user-level CPU overhead. Moreover, for higher dimension torus networks (such as 3D or more), the M d -torus algorithm appears to be the best performing protocol. In short, there is no definite multicast algorithm which best suits every possible networking scenario. Although SCI

PAGE 141

131 hardware does not inherently support user-level multicasting, it can be achieved with reasonable performance levels for this high-performance interconnect. The second phase of this research introduces a multicast performance analysis and modeling for high-speed indirect networks with NIC-based processors. This type of interconnect finds a wide implementation area in the parallel computing community because of its reprogramming flexibility and work offloading from the host CPU. Various degrees of host and NIC processor work sharing are evaluated in this phase of study, such as host-based, NIC-based, and NIC-assisted communication schemes. The selected schemes are experimentally analyzed for binomial and binary trees, serial forwarding and separate addressing multicast algorithms, using various metrics, such as multicast completion latency, root-node CPU utilization, multicast tree creation latency, and link concentration and concurrency, to evaluate their key strengths and weaknesses. Small-message latency models of these algorithms are developed and verified based on the experimental results. The models are observed to be accurate. Projections for larger systems are also presented and evaluated. Experimental and latency modeling analysis revealed that for interconnects with onboard NIC coprocessors and for latency-sensitive applications that utilize small-messages, a host-based multicast communication scheme performs best. However, host-based multicasting results in the highest CPU utilization. NIC-based solutions provide the lowest and constant CPU utilizations for both small and large messages at the cost of increased completion latencies. NIC-assisted multicasting provides lower CPU utilizations than host-based ones, and comparable CPU-utilizations to the NIC-based algorithms. Furthermore, the NIC-assisted approach provides comparable multicast

PAGE 142

132 completion latencies to host-based schemes for lower host CPU utilizations, and thus appears to be better choice for applications that demand a high level of computation-communication overlapping. The third phase of this research tackles the problem of obtaining a universal, low-level multicast system for latency-sensitive applications for Grid computing. A framework for low-level multicast communication is proposed that can be used as a service for high-level Grid computing applications. The proposed framework introduces a Layer-2/Layer-3 framework for cluster-to-cluster multicast over a WAN backbone and uses and extends the results of previous two phases for improved overall multicast performance and efficiency. The infrastructure supports multi-protocol multicast with minimal WAN backbone impact. The framework takes advantage of faster interconnects, such as SANs, whenever possible. It also requires a minimal amount of changes in the existing Grid/Globus architecture. Two simulative case studies are performed for performance analysis of the proposed approach. For Grid-connected clusters the WAN backbone latency is the most dominant component for a given multicast system. Simulation results show that the tree-based architecture with local caching provides lower latencies and lower backbone impact for all cases compared to a flat architecture and is unaffected by the increasing retransmission rate and number of hops, unlike the flat architecture. It is also observed that the gateway-to-SAN interconnect can also be an additional bottleneck and the capacity of the gateway-to-SAN link determines gateway placement. The remote completion latency dominates for small local and remote group sizes and significantly large local groups will likely hide the remote multicast latency.

PAGE 143

133 This study is the first to implement and experimentally analyze and model user-level software-based multicast performance on direct SCI networks. Moreover, it is the first to present a full-scale comparative experimental analysis, latency modeling, and analytical projections of host-based, NIC-based, and NIC-assisted multicast communication schemes over various spanning tree and path-based algorithms for Myrinet networks. This research also is the first in presenting an experimental comparative multicast performance analysis of SCI and Myrinet interconnects. Furthermore, this research is the first to define a latency-sensitive universal multicast framework for Grid-connected SAN-based and IP-based clusters that can be used as a low-level service for higher-level applications. The proposed multicast framework is unique in the sense that it introduces a channel-based and group-based hybrid approach for obtaining a distributed and scalable multicast communication system over the WAN backbones while maintaining the efficiency and scalability in the local multicast groups.

PAGE 144

LIST OF REFERENCES 1. I. Foster, C. Kesselman, and S. Tuecke, The Anatomy of the Grid: Enabling Scalable Virtual Organizations, International Journal of Supercomputer Applications, Vol. 15, No.3, 2001. 2. M Barnett, 2003, ATLAS experiment home page, CERN, http://atlasexperiment.org/ Aug. 2003. 3. A.R. Whitney, K.A. Dudevoir, H.F. Hinteregger, J.P. Gary, B. Fink, L.N. Foster, C. Kodak, K. Kranacs, P. Lang, W.T. Wildes, S.L. Bernstein, L.A. Prior, P.A. Schulz, T.W. Lehman, and J. Sobieski, The Gbps e-VLBI Demonstration Project. ftp://web.haystack.edu/pub/e-vlbi/demo_report.pdf 4. P. Avery, and I. Foster, 2003, iVDgL International Virtual Data Grid Laboratory home page, iVDgL, http://www.ivdgl.org, Jan. 2003. 5. P. Avery, and I. Foster, 2003, GriPhyN Grid Physics Network home page, GriPhyN, http://www.griphyn.org, Jan. 2003. 6. R. Mondardini, 2003, DataTAG project home page, CERN, http://datatag.web.cern.ch/datatag Jan. 2003. 7. T. DeFanti, 2003, StarLight project home page, University of Illinois at Chicago, http://www.startap.net/starlight Jan. 2003. 8. M. Maimour, and C. Pham, An Active Reliable Multicast Framework for the Grids, Proceedings of the International Conference on Computational Science (ICCS 2002), Amsterdam, The Nederlands, pp588-597, Apr. 2002. 9. H. Eriksson, The multicast backbone, Communications of the ACM, Vol. 8, pp. 54-60, 1994. 10. T. Bates, R. Chandra, D. Katz, and Y. Rekhter, Multiprotocol extensions for BGP-4, Internet Engineering Task Force (IETF) specification, RFC 2283, Feb. 1998. 11. D. Estrin, D. Fariannaci, A. Helmy, D. Thaler, S. Deering, M. Handley, V. Jacobson, C. Liu, P. Sharma, and L. Wei, Protocol independent multicast sparse-mode (PIM-SM): Protocol specification, Internet Engineering Task Force (IETF) specification, RFC 2362, Jun. 1998. 134

PAGE 145

135 12. D. Farinacci, Y. Rekhter, P. Lothberg, H. Kilmer, and J. Hall, Multicast source discovery protocol, Internet Engineering Task Force (IETF) specification, RFC 0020, Jun. 1998. 13. P.K. McKinley, H.Xu, A.H. Esfahanian, and L.M. Ni, Unicast-Based Multicast Communication in Wormhole-Routed Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 12, pp. 1252-1265, 1994. 14. P.K. McKinley, Y.Tsai, and D.F. Robinson, Collective Communication in Wormhole-Routed Massively Parallel Computers, IEEE Computer, Vol. 28, No. 2, pp. 39-50, 1995. 15. Y. Tseng, D.K. Panda, and T Lai, A Trip-Based Multicasting Model in Wormhole-Routed Networks with Virtual Channels, IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No.2, pp. 138-150, 1996. 16. R. Kesavan and D.K. Panda, Multicasting on Switch-Based Irregular Networks using Multi-Drop Path-Based Multi-Destination Worms, Proceedings of Parallel Computer Routing and Communication, Second International Workshop (PCRCW'97), Atlanta, Georgia, pp. 179-192, Jun. 1997. 17. D. Gustavson and Q. Li, The Scalable Coherent Interface (SCI), IEEE Communications, Vol. 34, No. 8, pp. 52-63, Aug. 1996. 18. N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, and W.K. Su, Myrinet: A Gigabit-per-second Local Area Network, IEEE Micro, Vol. 15, No. 1, pp. 29-36, Feb. 1995. 19. IEEE, SCI: Scalable Coherent Interface, IEEE Approved Standard 1596-1992, 1992. 20. D.F. Robinson, P.K. McKinley, and B.H.C. Cheng, Optimal Multicast Communication in Wormhole-Routed Torus Networks, IEEE Transactions on Parallel and Distributed Systems, Vol. 6, No. 10, pp. 1029-1042, 1995. 21. X. Lin and L.M. Ni, Deadlock-Free Multicast Wormhole Routing in Multicomputer Networks, Proc. of 18th Annual International Symposium on Computer Architecture, Toronto, Canada, pp. 116-124, May 1991. 22. D.F. Robinson, P.K. McKinley, and B.H.C. Cheng, Path-Based Multicast Communication in Wormhole-Routed Unidirectional Torus Networks, J. of Parallel and Distributed Computing, Vol. 45, No. 2, pp. 104-121, 1997. 23. L.M. Ni and P.K. McKinley, A Survey of Wormhole Routing Techniques In Direct Networks, IEEE Computer, Vol. 26, No. 2, pp. 62-76, 1993.

PAGE 146

136 24. K. Omang and B. Parady, Performance of Low-Cost Ultrasparc Multiprocessors Connected By SCI, Technical Report, Department of Informatics, University of Oslo, Norway, 1996. 25. M. Ibel, K.E. Schauser, C.J. Scheiman and M. Weis, High-Performance Cluster Computing using SCI, Proceedings of Hot Interconnects Symposium V, Palo Alto, CA, Aug. 1997. 26. M. Sarwar and A. George, Simulative Performance Analysis of Distributed Switching Fabrics for SCI-Based Systems, Microprocessors and Microsystems, Vol.24, No.1, pp. 1-11, 2000 27. D. Gonzalez, A. George, and M. Chidester, Performance Modeling and Evaluation of Topologies for Low-Latency SCI Systems, Microprocessor and Microsystems, Vol.25, No.7, pp. 343-356, 2001 28. R. Todd, M. Chidester, and A. George, Comparative Performance Analysis of Directed Flow Control for Real-Time SCI, Computer Networks, Vol.37, No.4, pp. 391-406, 2001 29. H. Bugge, Affordable Scalability using Multicubes, in: H. Hellwagner, A. Reinfeld (Eds.), SCI: Scalable Coherent Interface, LNCS State-of-the-Art Survey, Springer, Berlin, Germany, 1999, pp. 167-174 30. L.P. Huse, Collective Communication on Dedicated Clusters Of Workstations, Proc. of 6th PVM/MPI European Users Meeting (EuroPVM/MPI ), Sabadell, Barcelona, Spain, Sep. 1999, pp. 469-476. 31. H. Wang, and D.M. Blough, Tree-Based Multicast in Wormhole-Routed Torus Networks, Proc. PDPTA, 1998. 32. D.E. Culler, R. Karp, D.A. Patterson, A. Sahay, K.E. Shauser, E. Santos, R. Subramonian, and T. von Eicken, LogP: Towards a Realistic Model of Parallel Computation, Proceedings of ACM 4th SIGPLAN Symposium on Principles and Practices of Parallel Programming, San Diego, California, pp. 1-12, May 1993. 33. E. Deelman, A. Dube, A. Hoisie, Y. Luo, R. Oliver, D. Sundaram-Stukel, H. Wasserman, V.S. Adve, R. Bagrodia, J.C. Browne, E. Houstis, O. Lubeck, J. Rice, P. Teller, M.K. Vernon, POEMS: End-to-end Performance Design of Large Parallel Adaptive Computational Systems, Proceedings of First International Workshop on Software and Performance '98, WOSP '98, Santa Fe, New Mexico, pp. 18-30, Oct. 1998. 34. M. Gerla, P. Palnati, and S. Walton, Multicasting Protocols for High-Speed, Wormhole-Routing Local Area Networks, Proceedings of SIGCOMM Symposium, pp. 184-193, Aug. 1996.

PAGE 147

137 35. K. Verstoep, K. Landgendoen, and H. Bal, Efficient Reliable Multicast on Myrinet, Proceedings of 1996 International Conference on Parallel Processing, pp. 156-165, Aug. 1996. 36. P. Kesavan and D.K. Panda, Optimal Multicast with Packetization and Network Interface Support, Proceedings of 1997 International Conference on Parallel Processing, pp. 370-377, Aug. 1997. 37. R.A.F. Bhoedjang, T. Ruhl, and H.E. Bal, Efficient Multicast on Myrinet Using Link-Level Flow Control, Proceedings of 1998 International Conference on Parallel Processing, pp. 381-389, Aug. 1998. 38. D. Buntinas, D. K. Panda, and P. Sadayappan, Fast NIC-Based Barrier over Myrinet/GM, International Parallel and Distributed Processing Symposium IPDPS, pp. 52-60, San Francisco, CA, Apr. 2001. 39. D. Buntinas, D. K. Panda, J. Duato, and P. Sadayappan, Broadcast/Multicast over Myrinet using NIC-Assisted Multidestination Messages, Proceedings of Fourth International Workshop on Communication, Architecture, and Applications for Network-Based Parallel Computing, CANPC '00, Toulouse, France, pp. 115-129, Jan. 2000. 40. R. Sivaram, R. Kesavan, D. K. Panda, and C. B. Stunkel, Where to Provide Support for Efficient Multicasting in Irregular Networks: Network Interface or Switch?, Technical Report OSU-CISRC-02/98-TR05, The Ohio State University, Feb. 1998. 41. D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A. Merritt, E. Gronke, and C. Dodd, The Virtual Interface Architecture, IEEE Micro, pp. 66-76, Mar./Apr. 1998. 42. C. Kurmann and T.M. Stricker, A Comparison of two Gigabit SAN/LAN technologies: Scalable Coherent Interface versus Myrinet, Scalable Coherent Interface: Technology and Applications, Edited by Hermann Hellwagner & Alexander Reinefeld, pp. 29-42, Cheshire Henbury Publications, 1998. 43. M. Fischer, U. Brning, J. Kluge, L. Rzymianowicz, P. Schulz, and M. Waack, ATOLL a new switched, high-speed Interconnect in Comparison to Myrinet and SCI, IPDPS 2000 (Int. Parallel and Distributed Processing Symposium), PC NOW Workshop, Cancun, Mexico, 2000. 44. I. Foster, and C. Kesselman, 2003, Globus project home page, Globus, http://www.globus.org May 2003. 45. M. Livny, and M. Solomon, 2003, Condor project home page, University of Wisconsin, http://www.cs.wisc.edu/condor, May 2003.

PAGE 148

138 46. D. Waitzman, C. Patridge, and S. Deering, Distance vector multicast routing protocol (DVMRP), Internet Engineering Task Force (IETF), RFC 1075, Nov. 1988. 47. Y. Rekhter and T. Li, a border gateway protocol (BGP-4), Internet Engineering Task Force (IETF), RFC 1771, Mar. 1995. 48. H. Holbrook and D. Cheriton, IP multicast channels: EXPRESS support for large-scale single-source applications, ACM SIGCOMM, Cambridge, MA, Aug. 1999. 49. B.B. Lowekamp, A Beguelin, ECO: Efficient Collective Operations for Communication on Heterogeneous Networks, International Parallel Processing Symposium, pp. 399-405, Honolulu, HI, 1996. 50. I. Foster, J. Geisler, W. Gropp, N. Karonis, E. Lusk, G. Thiruvathukal, and S. Tuecke, Wide-Area Implementation of the Message Passing Interface, Parallel Computing, Vol. 24, No. 12, pp. 1735-1749, 1998. 51. N. Karonis, B. Toonen, and I. Foster, MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface, Journal of Parallel and Distributed Computing (JPDC), Vol. 63, No. 5, pp. 551-563, May 2003. 52. G. Schorcht, I. Troxel, K. Farhangian, P. Unger, D. Zinn, C. Mick, A. George, and H. Salzwedel, System-Level Simulation Modeling with MLDesigner, Proc. 11 th IEEE/ACM International Symposium On Computer and Telecommunication Systems, Orlando, FL, 2003. 53. K. Shanmungen, V. Frost, and W. LaRue, A Block-Oriented Network Simulator (BONeS), Simulation, Vol. 58, No.2, pp. 83-94, 1992. 54. D. Culler, K. Keeton, L.T. Liu, A. Mainwaring, R. Martin, S. Rodrigues, K. Wright, and C. Yoshikawa, The Generic Active Message Interface Specification, white paper, 1994. 55. GridFTP: Universal Data Transfer for the Grid, white paper, http://www.globus.org/datagrid/deliverables/C2WPdraft3.pdf

PAGE 149

BIOGRAPHICAL SKETCH Mr. Hakki Sarp Oral received the B.S. degree in Electrical and Electronics Engineering from Istanbul Technical University, Turkey; and the M.S. degree in Electrical and Electronics Engineering from the Cukurova University, Turkey. He is presently a Graduate Assistant in the Department of Electrical and Computer Engineering at the University of Florida and a group leader in the High-performance Computing and Simulation (HCS) Research Laboratory. 139


Permanent Link: http://ufdc.ufl.edu/UFE0002422/00001

Material Information

Title: Performance Modeling and Analysis of Multicast Infrastructure for High-Speed Cluster and Grid Networks
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0002422:00001

Permanent Link: http://ufdc.ufl.edu/UFE0002422/00001

Material Information

Title: Performance Modeling and Analysis of Multicast Infrastructure for High-Speed Cluster and Grid Networks
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0002422:00001


This item has the following downloads:


Full Text











PERFORMANCE MODELING AND ANALYSIS OF
MULTICAST INFRASTRUCTURE FOR
HIGH-SPEED CLUSTER AND GRID NETWORKS














By

HAKKI SARP ORAL


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


2003
















ACKNOWLEDGMENTS

I wish to thank the members of the High-performance Computation and Simulation

Lab for their help and technical supports, especially Ian Troxel for spending countless

hours with me and Dr. Alan. D. George for his patience and guidance. I also wish to

thank everyone who supported me with their encouragement; without your support I

would not have been able to complete this research and dissertation.




















TABLE OF CONTENTS

page


ACKNOWLEDGMENT S .............. .................... ii


LIST OF TABLES ................ ...............v............ ....


LIST OF FIGURES .............. ....................vi


AB STRAC T ................ .............. ix


CHAPTER


1 INTRODUCTION ................. ...............1.......... ......


2 MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR
HIGH-SPEED UNIDIRECTIONAL TORUS NETWORKS ................... ...............7


S cal able Coherent Interface ................. ...............7.................
Related Research ................ ...............10...
Selected Multicast Algorithms ................. ......... ...............11......
Separate Addressing ................. ...............12.......... .....
The U-torus Al gorithm ................. ................. 12......... ...
The S-torus Algorithm............... ...............1
The M-torus Algorithm .............. ...............16....
Case Study .............. ...............17....
Description .............. ...... ...............17.
Multicast Completion Latency .............. ...............18....
User-level CPU Utilization............... ..............2
Multicast Tree Creation Latency ................. ......... ......... ............2
Link Concentration and Concurrency .............. ...............23....
Multicast Latency Modeling ................. ...............25........... ....
The Separate Addressing Model .............. ...............31....
The Md-torus Model .............. ...............32....
The Mu-torus Model .............. ...............33....
The U-torus Model .............. ...............34....
The S-torus Model ................. ...............34................

Analytical Projections............... ..............3
Summary ................. ...............37.................












3 MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR
HIGH-SPEED INDIRECT NETWORKS WITH NIC-BASED PROCESSORS......40


M yrinet .............. ...............40....
Related Research ................ .... ............... .........4
The Host Processor vs. NIC Processor Multicasting ......___ ...... .. ..............44
The Host-based Multicast Communication ................. ................. ..........44
The NIC-based Multicast Communication............... .............4
The NIC-assisted Multicast Communication .............. ...............45....
Case Study .............. ...............47....
Description .............. ...... ...............47.
Multicast Completion Latency .............. ...............51....
User-level CPU Utilization............... ..............5
Multicast Tree Creation Latency ................. ......... ......... ............5
Link Concentration and Link Concurrency ................. ............................57
Multicast Latency Modeling ................. ...............58........... ....
The Host-based Latency Model ................. ...............66................
The NIC-based Latency Model .............. ...............66....
The NIC-assisted Latency Model ................. ...._.._......_. ............6
Analytical Projections............... ..............7
Summary ........._..... ...._... ...............73.....


4 MULTICAST PERFORMANCE COMPARISON OF SCALABLE COHERENT
INTERFACE AND MYRINET .............. ...............75....


Multicast Completion Latency .............. ...............76....
User-level CPU Utilization............... ..............8
Link Concentration and Link Concurrency .............. ...............83....
Sum m ary ................. ...............86.......... ......


5 LOW-LATENCY MULTICAST FRAMEWORK FOR GRID-CONNECTED
CLUSTERS .............. ...............88....


Related Research .............. ...............90....
Framework .............. ......_ ...............94...
Simulation Environment ............ ..... ._ ...............103...
C ase Shtdies............... ...... ... ........... ..... ........ 0
Case Study 1: Latency-sensitive Distributed Parallel Computing ....................106
Case Study 2: Large-file Data and Replica Staging .......__. ........ ........... ....115
Summary ........._.___..... ._ __ ...............127....


6 CONCLUSIONS .............. ...............129....


LIST OF REFERENCES ........._.___..... .___ ...............134....


BIOGRAPHICAL SKETCH ........._.___..... .__. ...............139....


















LIST OF TABLES

Table pg

2-1 Calculated tprocess and L, values............... ...............31.

3-1 Pseudocode for NIC-assisted and NIC-based communication schemes .................. 50

3-2 Measured latency model parameters ................ ...............65........... ...

3-3 Calculated tprocess and L, values............... ...............65.



















LIST OF FIGURES


Figure pg

2-1 Architectural block diagram ............ ...... ..__ ...............9...

2-2 Unidirectional SCI ringlet. ............. ...............10.....

2-3 Unidirectional 2D 3-ary SCI torus. ............. ...............10.....

2-4 Selected multicast algorithms for torus networks. ............. .....................1

2-5 Completion latency vs. group size. ............. ...............19.....

2-6 User-level CPU utilization vs. group size. ............. ...............21.....

2-7 Multicast tree-creation latency vs. group size. .............. ...............22....

2-8 Communication balance vs. group size ................. ...............24...............

2-9 Sample multicast scenario for a given binomial tree. ............. ......................2

2-10 Small-message latency model parameters............... ...............2

2-11 Measured and calculated model .............. ...............30....

2-12 Simplified model vs. actual measurements ................ .............. ......... .....31

2-13 Simplified model vs. actual measurements. ............. ...............32.....

2-14 Simplified model vs. actual measurements .............. ...............33....

2-15 Simplified model vs. actual measurements .............. ...............34....

2-16 Simplified model vs. actual measurements .............. ...............35....

2-17 Small-message latency projections. ............. ...............37.....

3-1 Architectural block diagram ................. ...............41................

3-2 Possible binomial tree .............. ...............46....

3-3 Myricom's three-layered GM............... ...............49...











3.4 GM MCP state machine overview .............. ...............50....


3-5 Multicast completion latencies ................. ...............53................

3-6 User-level CPU utilizations ................. ...............56................

3-7 Multicast tree creation latencies. .............. ...............57....


3-8 Communication balance vs. group size ................. ...............59...............

3-9 Simplified model vs. actual measurements .............. ...............67....

3-10 Simplified model vs. actual measurements .............. ...............68....

3-11 Simplified model vs. actual measurements .............. ...............69....

3-12 Projected small-message completion latency............... ...............71

3-13 Proj ected small-message completion latency ................. .............................72

4-1 Multicast completion latencies ................. ...............77................

4-2 User-level CPU utilizations ................. ...............81................


4-3 Communication balance vs. group size ................. ...............84...............

5-1 Layer 3 PIM-SM/MBGP/MSDP multicast architecture. ............. .....................9

5-2 Sample group versus channel membership multicast. ............. .....................9

5-3 Sample multicast scenario for Grid-connected clusters. ............. ......................9

5-4 Step by step illustration of the top-level ................. ..........._._.... 99._._._..

5-5 Sample illustration of group-membership ........._.._.. ....._.. ........._.._......101

5-6 Possible gateway placement. ........._._. ...._... ...............101...

5-7 Illustration of low-latency retransmission system ....._____ ..... ... ..............102

5-8 Screenshot of the MLD simulation tool ................ ...............104........... ..


5-9 Latency-sensitive distributed parallel job initiation ....._. ............. ...... .........107

5-10 IPC multicast communication scenario............... ...............10


5-11 Snapshot of simulation model for latency-sensitive .............. ....................10

5-12 SAN-to-SAN multicast completion latencies. ............__......_ ................110O











5-13 SAN-to-SAN multicast completion latencies. ................ ......... ................1 12

5-14 Mixed mode (local, remote) IPC ................. ...............114........... ..


5-15 Top-level tree formation ................. ...............117...............

5-16 Multicast communication ................. ...............117........... ...


5-17 Evaluated network topology scenarios. ................. ....___ ............. .......1

5-18 Snapshot of simulative model. ............. ...............120....


5-19 Multicast completion latencies............... ...............12


5-20 Comparative WAN backbone impacts ................. ...............123........... ...


5-21 Head-to-head comparison .............. ...............124....

5-22 Interconnect utilizations for local clusters. ............. ...............127....
















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

PERFORMANCE MODELING AND ANALYSIS OF
MULTICAST INFRASTRUCTURE FOR
HIGH-SPEED CLUSTER AND GRID NETWORKS

By

Hakki Sarp Oral

December 2003

Chair: Alan D. George
Major Department: Electrical and Computer Engineering

Cluster computing provides performance similar to that of supercomputers in a

cost-effective way. High-performance, specialized interconnects increase the

performance of clusters. Grid computing merges clusters and other geographically

distributed resources in a user-transparent way. Collective communications, such as

multicasting, increase the efficiency of cluster and Grid computing. Unfortunately,

commercial high-performance cluster interconnects do not inherently support

multicasting. Moreover, Grid multicasting schemes only support limited high-level

applications, and do not target latency-sensitive applications. This dissertation addresses

these problems to achieve efficient software-based multicast schemes for clusters. These

schemes are also used for defining a universal multicast infrastructure for

latency-sensitive Grid applications.

The first phase of research focuses on analyzing the multicast problem on

high-performance, direct, torus networks for clusters. Key contributions are experimental










evaluation, latency modeling, and analytical proj sections of various multicast algorithms

from literature for torus networks. Results show that for systems with small messages

and group sizes, simple algorithms perform best. For systems with large messages and

group sizes, more complex algorithms perform better because they can efficiently

partition the network into smaller segments.

High-performance clusters can also be built using indirect networks with

NIC-based processors. The second phase introduces multicast performance analysis and

modeling for such networks. Main contributions are experimental evaluation, latency

modeling, and analytical proj sections for indirect networks with NIC-based processors

with different levels of host-NIC work-sharing for communication events. Results show

that for small message and system sizes, a fast host CPU outperforms other

configurations. However, with increasing message and system sizes, the host CPU is

overloaded with communication events. Under these circumstances, work offloading

increases system performance. Experimental and comparative multicast performance

analyses for direct torus networks and indirect networks with NIC-based processors for

clusters are also introduced.

The third phase extends key results from the previous two phases to conceptualize a

latency-sensitive universal multicast framework for Grid-connected clusters. The

framework supports both clusters connected with high-performance specialized

interconnects and ubiquitous IP-based networks. Moreover, the framework complies

with the Globus infrastructure. Results show that lower WAN-backbone impact is

achieved and multicast completion latencies are not affected by increased retransmission

rates or the number of hops.















CHAPTER 1
INTRODUCTION

Because of advancements in VLSI technology, it has become possible to fit more

transistors, gates, and circuits in the same die area, operating at much higher speeds.

These enhancements have allowed the performance of microprocessors to increase

steadily. Following this trend, commercial off-the-shelf (COTS) PCs or workstations

built around mass-produced, inexpensive CPUs have become the basic building blocks

for parallel computers instead of expensive and special-built supercomputers.

Mass-produced, fast, commodity network cards and switches have allowed tightly

integrated clusters of these workstations to fill the gap between desktop systems and

supercomputers. Although in terms of computing power and cost, this approach is

accepted as the optimal solution for small-scale organizations, it is not cost-effective for

large-scale organizations. Large-scale organizations incorporate geographically

distributed locations, and the cost of maintaining and operating a separate

parallel-computing cluster in each location impedes the benefit of cluster-based parallel

computing. Therefore, the question is still open: What is the most optimal way to

orchestrate geographically distributed resources, devices, clusters, and machines in a

computationally effective and cost-effective manner simultaneously? Distributed

computing has emerged as a partial solution to this problem.

Distributed computing orchestrates geographically distributed resources, machines,

and clusters for solving computational problems in a cost-effective way. This approach is

cost-oriented and does not meet the computational needs of large-scale problems. On the









other hand, Grid computing focuses on the computational side of the quest, and is

therefore problem-driven [1]. As computing technology emerges, researchers work on

bigger real-life scientific problems and to solve these very large-scale problems more

complex experiments and simulations than ever before are designed. Data sets obtained

from these simulations and experiments also steadily grow larger in scale and size. For

example, Higher Energy and Nuclear Physics (HENP) tackles the problem of analyzing

collisions of high-energy particles which provides invaluable insights into these

fundamental particles and their interactions. Thus, HENP is expected to provide better

insight into the understanding of the unification of forces, the origin and stability of

matter, and structures and symmetries that govern the nature of matter and space-time in

our universe. The HENP experiments currently produce data sets in the range of

PetaBytes (1015), and it is estimated that by the end of this decade the data sets will reach

ExaBytes (101s) in size [2].

Another example is Very-Long-Baseline Interferometry (VLBI) [3]. The VLBI is a

technique used by astronomers for over three decade for studying obj ects in the universe

at ultra-high resolutions and measuring earth motions with high precision. The VLBI

provides much better resolutions than the optical telescopes. Currently VLBI readings

produce gigantic data sets on the order of GigaBytes (109) per Second continuously.

To solve these and similar scientific problems, an unprecedented degree of

scientific collaboration on a global scale is needed. To cope with these enormous

problem and data-set complexities and to provide a global-level of scientific

collaboration, Grid computing has been proposed. Grid computing brings together

parallel and supercomputers, databases, scientific instruments, and display devices









located at geographically distributed sites in a user-transparent way. According to Foster

et al. [1; pp. 1-2]

Grid computing has emerged as an important new Hield, distinguished from
conventional distributed computing by its focus on large-scale resource sharing,
innovative applications, and, in some cases, high-performance orientation. ... The
real and specific problem that underlies the Grid concept is coordinated resource
sharing and problem solving in dynamic, multi-institutional virtual organizations.
The sharing that we are concerned with is not primarily fie exchange but rather
direct access to computers, software, data, and other resources, as is required by a
range of collaborative problem-solving and resource brokering strategies emerging
in industry, science, and engineering. This sharing is, necessarily, highly
controlled, with resource providers and consumers defining clearly and carefully
just what is shared, who is allowed to share, and the conditions under which
sharing occurs. A set of individuals and/or institutions defined by such sharing
rules form what we call a virtual organization (VO).

The two key concepts in bringing together these resources and VOs are high-speed

networks, and light-weight high-performance middleware and communication services.

The need for high-performance collaboration and computing demands high-performance

connectivity on the global scale between these resources and VOs. Projects like iVDgL

[4], GriPhyN [5], DataTAG [6], and StarLight [7] are among numerous initiatives that

involve research to build or exploit global-scale high-performance interconnects and

middleware and communication services for Grids. These high-performance

interconnects are designed and implemented in a hierarchical fashion, from high-speed

global backbones that connect organizations to small proximity networks that connect the

resources and clustered computational nodes. Conventional and relatively cheap

interconnects such as Fast Ethernet or Gigabit Ethernet, or propriety and relatively

expensive high-performance System Area Networks (SANs) are used to connect these

clustered nodes. A SAN is a low-latency, high-throughput interconnection network that

uses reliable links to connect clustered computing nodes over short physical distances for

high-performance connectivity.









The success of Grids depends on the performance of the physical interconnect, and

also on the efficiency and performance of middleware and communication services.

Interprocess communication is the basis of most, if not all, communication services.

Interprocess communication primitives can be classified as point-to-point (unicast),

involving a single source and destination node, or collective, involving more than two

processes. Interprocess communication is often handled by collective communication

operations in parallel or distributed applications. These primitives play a maj or role in

Grid-level applications by making them more portable among different platforms. Using

collective communication simplifies and increases the functionality and efficiency of

parallel tasks. As a result, efficient support of collective communication is important in

the design of high-performance parallel, distributed, and Grid computing systems.

Multicast communication is an important primitive among collective

communication operations. Multicast communication (the one-to-many delivery of data)

is concerned with sending a single message from a source node to a set of destination

nodes. Special cases of multicast include unicast, in which the source node must transmit

a message to a single destination, and broadcast, in which the destination node set

includes all defined network nodes. Multicast communication has been subject to

extensive research in both Layer 3 (IP-based) and Layer 2 (MAC-based) levels.

Multicast is widely used for simultaneous and on-demand audio and video data

distribution to multiple destinations, and data and replica delivery over IP-based

networks. Multicast over Active Networks is another area of research that targets

increased efficiency and improved reliability [8]. There are well-defined and deployed

Layer 3 multicast protocols such as MBone [9] and MBGP/PIM-SM/MSDP [10-12] for










high-performance IP-based interconnects, although most of these are not widely

standardized and are still evolving. Layer 2 multicast is widely used as a basis for many

collective operations, such as barrier synchronization and global reduction, and for cache

invalidations in shared-memory multiprocessors [13]. Layer 2 multicast also functions as

a useful tool in parallel numerical procedures such as matrix multiplication and

transposition, Eigenvalue computation, and Gaussian elimination [14]. Moreover, this

type of communication is used in parallel search [15] and parallel graph algorithms [16].

Grids and distributed systems often form as the integration of multiple networks

with different performance characteristics and functionalities combined. These networks

can be listed as Wide Area Networks (WANs) as the backbones; Metropolitan Area

Networks (MANs) or Campus Area Networks (CANs) providing regional connectivity to

the backbones; Local Area Networks (LANs) providing connectivity to the regional

networks; and SANs connecting the computational nodes and clusters or devices to the

rest of the integrated networks hierarchy. These large-scale integrated systems support

and use various communication protocols (e.g., IP/SONET or IP/Ethernet for WANs,

IP/Ethernet for MANs, CANs, and LANs, and proprietary protocols for SANs).

Providing an efficient multicast communication service for such large-scale systems

imposes multiple challenges. For example, the data distribution patterns must be

optimized in order to minimize the utilization and impact on the Grid and distributed

system backbones as these are the longest links that the data has to traverse. Also, an

efficient multicast service must route the data over the fastest interconnect possible

towards the destination to obtain low latencies, for the cases where multiple

interconnects, such as LANs and SANs, both provide connectivity to the destination at









the same time. Therefore, it is beneficial for such a communication system to support

multiple communication protocols. Furthermore, for unsuccessful transmissions, data

must be re-transmitted from the closest upper-level parent node to the destinations, again

to obtain low-latency characteristics, and to minimize the impact on the backbones.

In this dissertation, the multicast problem is evaluated for Grid-connected IP-based

and SAN-based clusters. A framework for low-level multicast communication is

proposed that can be used as a service for high-level Grid or distributed applications. The

proposed framework targets high-performance and latency-sensitive applications. Also,

the proposed framework supports multiple protocols and different levels of interconnects

in a hierarchical way.

The problem is solved using a bottom-up approach. First, the Layer 2 multicast is

investigated for various communication and networking scenarios using experimental and

analytical modeling over various SANs. Results obtained from these Layer 2 studies are

then combined with the existing Layer 3 research available in literature, to build a

hierarchical multi-protocol and low-level multicast communication framework for Grids

and distributed systems.

Chapter 2 analyzes the multicast communication problem for the Scalable Coherent

Interface SAN [17]. Chapter 3 evaluates the multicast problem for the Myrinet SAN

[18]. A comparative performance evaluation of multicast on these two SANs is presented

in Chapter 4. Chapter 5 combines results obtained in Chapters 2 and 3 and then focuses

on building a universal, latency-sensitive multicast communication framework for

Grid-connected clusters. Conclusions are presented in Chapter 6.















CHAPTER 2
MULTICAST PERFORMANCE ANALYSIS AND MODELING
FOR HIGH-SPEED UNIDIRECTIONAL TORUS NETWORKS

Direct torus networks are widely used in high-performance parallel computing

systems. They are cost-effective, as the switching tasks are distributed among the hosts

instead of having centralized switching elements. Torus networks also provide good

scalability in terms of bandwidth. With each added host, the aggregate bandwidth of the

system also increases. Moreover, torus networks allow efficient routing algorithms to be

designed and implemented. The Scalable Coherent Interface (SCI) is a high-performance

interconnect that supports tori topologies. The SCI is a widely used SAN because of its

high unicast performance, but its multicast communication characteristics are still

unclear. This chapter focuses on evaluating the performance of various unicast-based

and path-based multicast protocols for high-speed torus networks. The tradeoffs in the

performance of the selected algorithms are experimentally evaluated using various

metrics, including multicast completion latency, tree creation latency, CPU load, link

concentration, and concurrency. Analytical models of the selected algorithms for short

messages are also presented. Experimental results are used to verify and calibrate the

analytical models. Analytical projections of the algorithms for larger unidirectional torus

networks are then produced.

Scalable Coherent Interface

The SCI initially aimed to be a very high-performance computer bus that would

support a significant degree of multiprocessing. However, because of the technical









limitations of"bus-oriented" architectures, the resulting ANSI/IEEE specification [19]

turned out to be a set of protocols that provide processors with a shared-memory view of

buses using direct point-to-point links. Based on the IEEE SCI standard, Dolphin's SCI

interconnect addresses both the high-performance computing and networking domains.

Emphasizing flexibility and scalability; and multi-gigabit-per-second data transfers,

SCI's main application area is as a SAN for high-performance computing clusters.

Recent SCI networks are capable of achieving low latencies (smaller than 2Cls) and

high throughputs (5.3 Gbps peak link throughput) over point-to-point links with

cut-through switching. Figure 2-1 is an architectural block diagram of Dolphin's

PCI-based SCI NIC. Using the unidirectional ringlets as a basic block, it is possible to

obtain a large variety of topologies, such as counter-rotating rings and unidirectional and

bi-directional tori. Figure 2-2 shows a sample unidirectional SCI ringlet. Figure 2-3

shows a 2D 3-ary SCI torus.

Unlike many other competing SANs, SCI also offers support for both the

shared-memory and message-passing paradigms. By exporting and importing memory

chunks, SCI provides a shared-memory programming architecture. All exported memory

chunks have a unique identifier, which is the collection of the exporting node's SCI node

ID, and the exporting application's Chunk ID and the Module ID. Imported memory

chunks are mapped into the importer application's virtual memory space. To exchange

messages between the nodes, the data must be copied to this imported memory segment.

The SCI NIC detects this transaction, and automatically converts the request to an SCI

network transaction. The PCI-to-SCI memory address mapping is handled by the SCI

protocol engine. The 32-bit PCI addresses are converted into 64-bit SCI addresses, in










which the most significant 16 bits are used to select between up to 64K distinct SCI

devices.


System Bus


CPU


64-bit / 66MHz PCI Bus


53 Gbps SCI NetworkL




Figure 2-1. Architectural block diagram of Dolphin's PCI-based SCI NIC.

Each SCI transaction typically consists of two sub-transactions: a request and a

response. For the request sub-transaction, a read or write request packet is sent by the

requesting node to the destination node. The destination node sends an echo packet to the

requesting node upon receiving the request packet. Concurrently, the recipient node

processes the request, and sends its own response packet to the requesting node. The










requesting node will acknowledge and commit the transaction by sending an echo packet

back to the recipient node upon receiving the response packet.















Figure 2-2. Unidirectional SCI ringlet. Figure 2-3. Unidirectional 2D 3-ary SCI
torus.

Related Research

Research for multicast communication in the literature can be briefly categorized

into two groups: unicast-based and multi-destination-based [15]. Among the

unicast-based multicasting methods, separate addressing is the simplest one, in which the

source node iteratively sends the message to each destination node one after another as

separate unicast transmissions [20]. Another approach for unicast-based multicasting is

to use a multi-phase communication configuration for delivering the message to the

destination nodes. In this method, the destination nodes are organized in some sort of

binomial tree, and at each communication step the number of nodes covered increases by

a factor of n, where n denotes the fan-out factor of the binomial tree. The U-torus

multicast algorithm proposed by Robinson et al. [20] is a slightly modified version of this

binomial-tree approach for direct torus networks that use wormhole routing.

Lin and Ni [21] were the first to introduce and investigate the path-based

multicasting approach. Subsequently, path-based multicast communication has received









attention and has been studied for direct networks [14, 20, 22]. Regarding path-based

studies, this dissertation concentrates on the work of Robinson et al. [20, 22] in which

they have defined the G-torus, S-torus, M~d-torus, M,,-torus algorithms. These algorithms

were proposed as a solution to the multicast communication problem for generic,

wormhole-routed, direct unidirectional and bi-directional torus networks. More details

about path-based multicast algorithms for wormhole-routed networks can be found in the

survey of Li and McKinley [23]. Tree-based multicasting also received attention [23,

24]; and these studies focused on solving the deadlock problem for indirect networks.

The SCI unicast performance analysis and modeling has been discussed in the

literature [24-26, 28], while collective communication on SCI has received little attention

and its multicast communication characteristics are still unclear. Limited studies on this

avenue have used collective communication primitives for assessing the scalability of

various SCI topologies from an analytical point of view [29, 30], while no known study

has yet investigated the multicast performance of SCI.

Selected Multicast Algorithms

The algorithms analyzed in this study were defined in the literature [20, 22]. This

section simply provides an overview of how they work and briefly points out their

differences. Bound by the limits of available hardware, two unicast-based and three

path-based multicast algorithms were selected, thereby keeping an acceptable degree of

variety among different classes of multicast routing algorithms. In this dissertation, the

aggregate collection of all destination nodes and the source node is called the multica~st

group. Therefore, for a given group with size d, there are d-1 destination nodes. Figure

2-4 shows how each algorithm operates for a group size of 10. The root node and the









destination nodes are clearly marked and the message transfers are indicated. Alphabetic

labels next to each arrow indicate the individual paths, and the numerical labels represent

the logical communication steps on each path.

Separate Addressing

Separate addressing is the simplest unicast-based algorithm in terms of algorithmic

complexity. For small group sizes and short messages, separate addressing can be an

efficient approach. However, for large messages and large group sizes, the iterative

unicast transmissions may result in large host-processor overhead. Another drawback of

this protocol is linearly increasing multicast completion latencies with increasing group

sizes. Figure 2-4A shows separate addressing for a given multicast problem.

The U-torus Algorithm

The U-torus [20] is another unicast-based multicast algorithm that uses a

binomial-tree approach to reduce the total number of required communication steps. For

a given group of size d, the lower bound on the number of steps required to complete the

multicast by U-torus will be rlogzd]. This reduction is achieved by increasing the

number of covered destination nodes by a factor of 2 in each communication step. Figure

2-4B shows a typical U-torus multicast scenario.

Applying U-torus to this group starts with dimension ordering of all the nodes,

including the root, based on their physical placement in the torus network given in a

(column, row) format. The dimension-ordered node set is then rotated around to place

the root node at the beginning of the ordered list as given below:




0 '= { (2, 2), (2)3, 1), (3, 3), (4, 1), (4, 2), (4, 4), (1, 1), (1, 2), (1, 3)}
















1,H,


A, F


1D, EF














As A2
















A, A4







3, n









4l~ rj


B, D, C2 Az


Figure 2-4. Selected multicast algorithms for torus networks. Multicast group size is 10.
Individual message paths are marked alphabetically, and the numerical
labels represent the logical communication steps for each message path. A)
The separate addressing algorithm. B) The U-torus algorithm. C) The
S-torus algorithm. D) The Md-torus algorithm. E) The Mu-torus algorithm.


Idle Node


-,Message Transfer Path

SNetwork Connection









where Ois the dimension-ordered group and 0' denotes the rotated version of 0. The

order in O' also defines the final ranking of the nodes, as they are sequentially ranked

starting from the leftmost node. As an example, for 0' given above, node (2,2) has a

ranking of 0, node (2,4) has a ranking of 1, and the node (1,3) has a ranking of 9.

After obtaining the 0', the root node sends the message to the center node of O' to

partition the multicast problem of size d into two subsets of size rd/21 and Ld/21. The

center node is calculated by Eq. 2-1 as described by Robinson et al. [20], where left

denotes the ranking of the leftmost node, and right denotes the ranking of the rightmost

node.


~'~ =I~rright left +1Z~ 21


For the group given above, the left is rank 0 and the right is 9, therefore the center

is 5, which implies the node (4,2). The root node transmits the multicast message, and

the new partition's subset information, Dsubset, to the center node. Using the same

example, at the end of the first step the root node will have the subset Dsubset root = {(2,2),
(2,4), (3,1), (3,3), (4,1)} with1 then va~lues~ ofleft and right+ bengn rank 0 and 4, respectively.

The node (4,2) will have the subset Dsubset_(4,2/ = {(42) (44) (aa 1,1), (1,2), (1,3)}witrh the

values of left and right being again 0 and 4.

In the second step, the original root and the (4,2) node both act as root nodes,

partitioning their respective subsets in two; and sending the multicast message to their

subset' s center node, along with the new partition' s Dsubset infOrmation. This process

continues recursively, until all destination nodes have received the message.









The S-torus Algorithm

The S-torus, a path-based multicast routing algorithm, was defined by Robinson et

al. [22] for wormhole-routed torus networks. It is a single-phase communication

algorithm. The destination nodes are ranked and ordered to form a Hamiltonian cycle.

A Hamiltonian cycle is a closed circuit that starts and ends at the source node,

where every other node is listed only once. For any given network, more than one

Hamiltonian cycle may exist. The Hamiltonian cycle that S-torus uses is based on a

ranking order of nodes, which is calculated with the formula given in Eq. 2-2 for a k-ary

2D torus.


1(ur) = [(o, (ur) + a, (u)) mod k] + k [a, (u)]l (2-2)

Here, 1(u) represents the Hamiltonian ranking of a node u, with the coordinates


given as (Go (u), a, (u)) More detailed information about Hamiltonian node rankings

can be found in [9]. Following this step, the ordered Hamiltonian cycle Ois rotated

around to place the root at the beginning. This new set is named as O'.

The root node then issues a multi-destination worm which visits each destination

node one after another following the 0' ordered set. At each destination node, the header

is truncated to remove the visited destination address and the worm is re-routed to the

next destination. The algorithm continues until the last destination node receives the

message. Robinson et al. also proved that S-torus routing is deadlock-free [9].

Figure 2-4C shows the S-torus algorithm for the same example presented

previously, for a torus network without wormhole routing. The Hamiltonian rankings are

noted in the superscript labels of each node, where O and 0' are obtained as:










o= {4(1,3), 6(,) 7(1,2), s(2,2), 10(2,4), 12(3,1), 14(3,3), 16(4,4), 1(4,1), 1(4,2)}

o,'= { (2,2), 10(2,4), 12(3,1), 14(3,3), 16(4,4), 1(4,1), 1(4,2), 4(1,3), 6(,) 7(1,2)}

The M-torus Algorithm

Belying its simplicity, single-phase communication is known for large latency

variations for a large set of destination nodes [31]. Therefore, to further improve the

S-torus algorithm, Robinson et al. proposed the multi-phase multicast routing algorithm:

M-torus [22]. The idea was to shorten the path lengths of the multi-destination worms to

stabilize the latency variations and to achieve better performance by partitioning the

multicast group. They introduced two variations of the M-torus algorithm, Md-torus and

Mu-torus. The Md-torus algorithm uses a dimensional partitioning method, whereas

Mu-torus uses a uniform partitioning mechanism. In both of these algorithms, the root

node separately transmits the message to each partition and the message is then further

relayed inside the subsets using multi-destination worms. The Md-torus algorithm

partitions the nodes based on their respective sub-torus dimensions, therefore eliminating

costly dimension-switching overhead. For example, in a 3D torus, the algorithm will first

partition the group into subsets of 2D planes of the network, and then into ringlets for

each plane.

For a k-ary N-dimensional torus network, where kN is the total number of nodes, the

Md-torus algorithm needs N steps to complete the multicast operation. By contrast, the

Mu-torus algorithm tries to minimize and equalize the path length of each worm by

applying a uniform partitioning. Mu-torus is parameterized by the partitioning size,

denoted by r. For a group size of d, the Mu-torus algorithm with a partitioning size of r

requires rlogr(d)l steps to complete the multicast operation. For the same example










presented previously, Figure 2-4D and Figure 2-4E show Md-torus and Mu-torus,

respectively, again assuming a network without wormhole routing where r=4.

Case Study

To comparatively evaluate the performance of the selected algorithms, an

experimental case study is conducted over a high-performance unidirectional SCI torus

network. The following subsections explain experiment details and the results obtained.

Description

There are 16 nodes in the case study testbed. Each node is configured with dual

1GHz Intel Pentium-III processors and 256MB of PC133 SDRAM. Each node also

features a Dolphin SCI NIC (PCI-64/66/D330) with 5.3 Gb/s link speed using Scali's SSP

(Scali Software Platform) 3.0.1, Redhat Linux 7.2 with kernel version 2.4.7-10smp, mtrr

patched, and write-combining enabled. The nodes are interconnected to form a 4x4

unidirectional torus.

For all of the selected algorithms, the polling notification method is used to lower

the latencies. Although this method is known to be effective for achieving low latencies,

it results in higher CPU loads, especially if the polling process runs for extended periods.

To further decrease the completion latencies, the multicast-tree creation is removed from

the critical path and performed at the beginning of each algorithm in every node.

Throughout the case study, modified versions of the three path-based algorithms,

S-torus, Md-torus, and Mu-torus are used. These algorithms were originally designed to

use multi-destination worms. However, as with most high-speed interconnects available

on the market today, our testbed does not support multi-destination worms. Therefore,

store-and-forward versions of these algorithms are developed.









On our 4-ary 2D torus testbed, Md-torus partitions the torus network into simple

4-node rings. For a fair comparison between the Md-torus and the Mu-torus algorithms,

the partition length r of 4 is chosen for Mu-torus. Also, the partition information for

U-torus is embedded in the relayed multicast message at each step. Although separate

addressing exhibits no algorithmic concurrency, it is possible to provide some degree of

concurrency by simply allowing multiple message transfers to occur in a pipelined

structure. This method is used for our separate address algorithm.

Case study experiments with the five algorithms are performed for various group

sizes and for small and large message sizes. Each algorithm is evaluated for each

message and group size 100 times, where each execution has 50 repetitions. The

variance was found to be very small and the averages of all executions are used in this

study. Four different sets of experiments are performed to analyze the various aspects of

each algorithm, which are explained in detailed in the following subsections.

Multicast Completion Latency

Two different sets of experiments for multicast completion latency are performed,

one for a message size of 2B and the other for a message size of 64KB. Figure 2-5 shows

the multicast completion latency versus group size for small and large messages.

The S-torus algorithm has the worst performance for both small and large messages.

Moreover, S-torus shows a linear increase in multicast completion latency with respect to

the increasing group size, as it exhibits no parallelism in message transfers. By contrast,

the separate addressing algorithm has a higher level of concurrency because of its design

and performs best for small messages. However, it also presents linearly increasing

completion latencies for large messages with increasing group size.








19



1600

~1400

z.1200

S1000 *

S800

S600 ---

2 400

~200


4 6 8 10 12 14 16
Multicast Group Size (in nodes)

U-torus -+- S-torus ---A- Mu-torus --*-- Md-torus m---Sep. Add.



25000


320000







4~ 6 0 2141
Mutcs ru ie(nnds
-tru -trs M -tru -* d-ors ---S p.A d
~ 1500 ,,C---b

Figure~' 2-.Cmlto aec s rupsz.A ml esgs )Lrgmese.

TheMd-oru n utrsagrtm xii iia eeso efrac o

bot smal an ag esgs hedfeec ewe teetobcmsmr

distnctie atceraindt ons uha 0ad1 oe o ag esgs o ru

sie o 0 n 140 th attinlnthfrM-orsde ntpoie efclyblne

patiios reutiginhgermltcs cmleinlatnis ialUtrshsnal










flat latency for small messages. For large messages, it exhibits similar behavior to

Mu-torus. Overall, separate addressing appears to be the best for small messages and

groups, while for large messages and groups Md-torus performs better compared to other

algorithms.

User-level CPU Utilization

User-level host processor load is measured using Linux's built-in sar utility.

Figure 2-6 shows the maximum CPU utilization for the root node of each algorithm for

small and large messages.

It is observed that S-torus exhibits constant CPU load for the small message size

independent of the group size. However, for large messages, as the group size increases

the completion latency also linearly increases as shown in Figure 2-5B, and the extra

polling involved results in higher CPU utilization for the root node. This effect is clearly

seen in Figure 2-6B.

In the separate addressing algorithm, the root node iteratively performs all message

transfers to the destination nodes. As expected, this behavior causes a nearly linear

increase in CPU load with increasing group size, which can be observed in Figure 2-6B.

By contrast, since the number of message transmissions for the root node stays

constant, Md-torus provides a nearly constant CPU overhead for small messages for every

group size. For large messages and small group sizes, Md-torus performs similarly.

However, for group sizes greater than 10, the CPU utilization tends to increase because of

the variations in the path lengths causing extended polling durations. Although these

variations are the same for both small and the large messages, the effect is more visible

for the large message size.



































































4 6 8 10 12 14 16

Multicast Group Size (in nodes)

U-torus -+- S-torus A Mu-torus --*-- Md-torus --m---Sep. Add.





Figure 2-6. User-level CPU utilization vs. group size. A) Small messages. B) Large
messages.


The Mu-torus algorithm exhibits behavior identical to Md-torus for small messages.


Moreover, for large messages, Mu-torus also provides higher but constant CPU


utilization. For U-torus, the number of communication steps required to cover all


destination nodes is given in previous sections. It is observed that at certain group sizes,


45

4

35








05
-




05
-


L--------l

t---------)----------------

r---------t---------t---------t--------



"_.___f~I__._~_f_~_._____._f -.-.-.-.~


6 8 10 12 14

Multicast Group Size (in nodes)

U-torus -+-S-torus -A Mu-torus --*--Md-torus --m---Sep. Add.


j
---------~'








~--------f--------~--------~.'I


,c*-1










such as 4, 8, and 16, the number of these steps increases, therefore the CPU load also

increases This behavior of U-torus can be clearly seen in Figure 2-6.

Multicast Tree Creation Latency

Multicast tree creation latency of the SCI API is also an important metric since, for

small message sizes, this factor might impede the overall communication performance.

The multicast tree creation latency is independent of the message size. Figure 2-7 shows

multicast tree creation latency versus group size.






,40 --





I--------
510


4 6 8 10 12 14 16
Multicast Group Size (in nodes)
U-torus -+- S-torus -- -r- Mu-torus --*-- Md-torus


Figure 2-7. Multicast tree-creation latency vs. group size.

Figure 2-7 shows the multicast tree creation latencies for the four algorithms that

use a tree-like group formation for message delivery. The Mu-torus and Md-torus

algorithms only differ in their partitioning methods as described before and both methods

are quite complex compared to the other algorithms. This complexity is seen in Figure

2-7 as they exhibit the highest multicast tree-creation latencies.

The U-torus algorithm has a simple and distributed partitioning process and,

compared to the two M-torus algorithms, it has lower tree-creation latency. Unlike the










other tree-based algorithms, S-torus does not perform any partitioning and it only orders

the destination nodes as described previously. Therefore, S-torus exhibits the lowest and

a very-slowly and linearly increasing latency because of the simplicity of its tree

formation.

Link Concentration and Concurrency

Link concentration is defined here as the ratio of two components: number of link

visits and number of used links. Link visits is defined as the cumulative number of links

used during the entire communication process, while used links is defined as the number

of individual links used. Link concurrency is the maximum number of messages that are

in transit in the network at any given time. Link concentration and link concurrency are

given in Figure 2-8. Link concentration combined with the link concurrency illustrates

the degree of communication balance. The concentration and concurrency values

presented in Figure 2-8 are obtained by analyzing the theoretical communication

structures and the experimental timings of the algorithms.

The S-torus algorithm is a simple chained communication and there is only one

active message transfer in the network at any given time. Therefore, S-torus has the

lowest and a constant link concentration and concurrency compared to other algorithms.

By contrast, because of the high parallelism provided by the recursive doubling approach,

the U-torus algorithm has the highest concurrency. Separate addressing exhibits an

identical degree of concurrency to the U-torus, because of the multiple message transfers

overlapping at the same time because of the network pipelining. The Md-torus algorithm

has inversely proportional link concentration versus increasing group size. In Md-torus,

the root node first sends the message to the destination header nodes, and they relay it to

their child nodes. As the number of dimensional header nodes is constant (k in a k-ary







































































4 6 8 10 12 14 16
Multicast Group Size (in nodes)

U-torus -+- S-torus A-~- Mu-torus --*-- Md-torus --,-- Sep. Add.




Figure 2-8. Communication balance vs. group size. A) Link concentration. B) Link
concurrency.


torus), with the increasing group size each new child node added to the group will


increase the number of available links. Moreover, because of the communication


structure of the Md-torus, the number of used links increases much more rapidly


compared to the number of link visits with the increasing group size. This trend


asymptotically limits the decreasing link concentration to 1. The concurrency of


..~-------.~_t--------..'~''~



[.\

`C-

~;-;i.'.-~'.-.------------1
,__.~~.~.~~r.t--.'.I..--1.1_1'1_____.~'


4 6 8 10 12 14
Multicast Group Size (in nodes)
U-torus -+- S-torus A~ Mu-torus --*-- Md-torus -m--- Sep. Add.







-r..-- ..a










Md-torus is upper bounded by k as each dimensional header relays the message over

separate ringlets with k nodes in each.

The Mu-torus algorithm has low link concentration for all group sizes, as it

multicasts the message to the partitioned destination nodes over a limited number of

individual paths as shown in Figure 2-4D, where only a single link is used per path at a

time. By contrast, for a given partition length of constant size, an increase in the group

size results in an increase in the number of partitions and an increase in the number of

individual paths. This trait results in more messages being transferred concurrently at

any given time over the entire network.

Multicast Latency Modeling

The experiments throughout SCI case study have investigated the performance of

multicast algorithms over a 2D torus network having a maximum of 16 nodes. However,

ultimately modeling will be a key tool in predicting the relative performance of these

algorithms for system sizes that far exceed our current testbed capabilities. Also, by

modifying the model, future systems with improved characteristics could be evaluated

quickly and accurately. The model presented in the next subsections assumes an equal

number of nodes in each dimension for a given N-dimensional torus network. The

presented small-message latency model follows the LogP model [32].

The LogP is a general-purpose model for distributed-memory machines with

asynchronous unicast communication [32]. Under LogP, at most at every g CPU cycles a

new communication operation can be issued, and a one-way small-message delivery to a

remote location is formulated as

tcoomnlcahnmai = 2L + 2 x (osendey + O,eceive> ) (2-3)










where L is the upper bound on the latency for a message delivery of a message from its

source processor to its target processor, and Osender an~d orecelver represent the sender and

receiver communication overheads, respectively.

Today's high-performance NICs and interconnection systems are fast enough that

any packet can be inj ected into the network as soon as the host processor produces it

without any further delays [33]. Therefore, g is negligible for modeling

high-performance interconnects. This observation yields the relaxed LogP model for

one-way umicast communication, given as

toon;,mmun o,, = osender + network + rcie 24

where network is the total time spent between the inj section of the message into the network

and drainage of the message.

Following this approach, a model is proposed to capture the multicast

communication performance of high-performance torus networks on an otherwise

unloaded system. The model is based on the concept that each multicast message can be

expressed as sequences of serially forwarded unicast messages from root to destinations.

The proposed model is formulated as

tnlulancoas, = Max1-[osender + networks + OreceIVerI OVer V(2-5)
paths

where the Max[ ] operation yields the total multicast latency for the deepest path over all

paths and t;;;zarcast is the time interval between the initiation of multicast and the last

destination' s reception of the message. Figure 2-9 shows this concept on a sample

binomial-tree multicast scenario. For this scenario, t;;;zancast is determined over the deepest

path, which is the route from node 0 to node 7.










Gonzalez et al. [27] modeled the unicast performance of direct SCI networks and

observed that tnetwork can be divided into smaller components as given by:

t,wetw,, = h, x L, + hf x Lf + h, x L, + h, x L, (2-6)

Here h,, hf, h,, and h, represent the total number of hops, forwarding nodes, switching

nodes, and intermediate nodes, respectively. Similarly L,, Lf, and Ls denote the

propagation delay per hop, forwarding delay through a node, and switching delay through












Deepest
Path






Figure 2-9. Sample multicast scenario for a given binomial tree.

a node, respectively. The L, denotes the intermediate delay through a node, which is the

sum of the receiving overhead of the message, processing time, and the sending overhead

of the message. Figure 2-10A shows an arbitrary mapping of the problem given above in

Figure 2-9 to a 2D 4x4 torus network. Figure 2.10B shows a visual break down of

tmnuarcast over the same arbitrary mapping.

Following the method outlined by Gonzales et al. [27] and using the obtained

experimental results from the case study presented in this dissertation, the model

parameters are measured and calculated for short message sizes. Assuming that the









electrical signals propagate through a copper conductor at approximately half the speed

of light, and observing that our SCI torus testbed is connected with 2m long cables,

resulted in 14ns of propagation latency per link. Since L, represents the latency for the

head of a message passing through a conductor, it is therefore independent of message

size [27].





















II II



Osender networkOrcve


Root Node Frwardning Nod ISwIitcing Node


Intermediate Node Destination Node



Figure 2-10. Small-message latency model parameters. A) Arbitrary mapping of the
multicast problem given in Figure 2-9 to a 2D 4-ary torus. B) Break-down
of the t,nuarcast parameter over the deepest path for the given multicast
scenario.










Model parameters osener,, oreceiver, Lf and Ls were obtained through ring-based and

torus-based API unicast experiments. Ring-based experiments were performed with two-

four-, and six-node ring configurations. For each setup, h, and L, are known. Inserting

these values into Eqs. 2-4 and 2-6 and taking algebraic differences between two-, four-,

and six-node experiments yields osender, orecelver, and Lf. Switching latency, Ls, is

determined through torus-based experiments by comparing latency of message transfer

within a dimension versus between dimensions. Figure 2-11 shows the L,, Lf, Ls, osender,

and oreceive, model parameters for various short message sizes.

The L,, Lf, Ls, Osender, and oreceive, parameters are communi cati on-b wounded. These

parameters are dominated by NIC and interconnect performance. The L, is dependent

upon interconnect performance and the node's computational power, and is formulated

as:

L, = osender process, + Orecelver (2-7)

Calculations based on the experimental data obtained previously show that each

multicast algorithm's processing load, tprocess, is different. This load is composed mainly

of the time needed to check the child node positions on the multicast tree and the

calculation time of the child node(s) for the next iteration of the communication. Also,

process is observed to be the dominant component of L, for short message sizes.

Moreover, it can be easily seen that compared to the tprocess and L, parameters, the L,, Lf,

and the Ls values are drastically smaller, and thus they are relatively negligible.

Dropping these negligible parameters, Eq. 2-5 can be simplified and expressed as

t;,_, aucs (total number ofdeistination nodes)x (osender +process) +oreceive (2-8)











for the separate addressing model. For the Md-torus, Mu-torus, U-torus and S-torus


algorithms the simplified model can be formulated as in Eq. 2-9. As can be seen, the

modeling is straightforward for separate addressing and for the remaining algorithms the


modeling problem is now reduced to identifying the number of intermediate nodes over


the longest multicast message path. Moreover, without loss of generality, Osender and


orecelver values can be treated equal to one another [27] for simplicity and we represent


them by a single parameter, o.


t;,,,,acs, rMax[osender+h, xL, +o,ecerve, over V
paths
-Max[2xo+h, xL] over V
pathr (2-9)


10000
3203ns

Gradlent=3 15nsibyte
1000-

Avg=670ns

S100-

A g=60ns

10 14ns




128 192 256 320 384 448 512
Message Size (bytes)
-+Propagation -+- Forwlarding ----- Switching +Overhead


Figure 2-11. Measured and calculated model parameters for short message sizes:
L,=14ns, Lf =60ns, Ls=670ns, and o=1994+3.1 5 x(M-128)ns.

Of course, variable L, is not involved in the separate addressing algorithm. The reason is


simply that separate addressing consists of a series of unicast message transmissions from

source to destination nodes so there are no intermediate nodes that are needed to relay the


multicast message to other nodes. Table 2-1 shows the tprocess and L, values for short










message sizes. The tprocess parameter is independent of the multicast message and group

size but strictly dependent on the performance of the host machine. Therefore, for

different computing platforms, different trocess values will be obtained

Table 2-1. Calculated trocess and L, values.
process (ps) L, (ls)
Separate Addressing 7 N/A
Md-torus 206 206+2 xo
Mu-torus 201 201+2xo
U-torus 629 629+2xo
S-torus 265 265+2xo

The following subsections will discuss and provide more detail about the simplified

model for each multicast algorithm. For each algorithm, the modeled values, the actual

testbed measurements, and the modeling error will also be presented.

The Separate Addressing Model

With the separate addressing algorithm, for a group size of G, there are G-1

destination nodes that the root node alone must serve. Therefore, a simplified model for

separate addressing can be expressed as given in Eq. 2-10.


200 100




S120 60






64 8 0 2 14 1

Multicast Group Size (In nodes)
-m--Model -o-Actual--t---Error


Figure 2-12. Simplified model vs. actual measurements for separate addressing
algorithm .











t;,nanca, (G; -1)x (osender process) +oreceie (2-10)


Figure 2-12 shows the simplified model and the actual measurements with various

multicast group sizes for a 128-byte multicast message using the o and tprocess values


previously defined. The results show that the model is accurate with an average error of

~3%.


The Md-torus Model

For Md-torus, the total number of intermediate nodes for any communication path is


observed to be a function of G, the multicast group size, and k, the number of nodes in a


given dimension. The simplified Md-torus model is formulated given in Eq. 2-11.


900 100
ann 90


B 60

~550

40
30 30
'200 2


S6 8 10 12 14 16








t;,,,arcast =2XO x ok +xL (2-11)


The simplified model and actual measurements for various group sizes with 128-byte


messages are plotted in Figure 2-13. As can be seen the simplified model is accurate


with an average error of ~2%.











The M,-torus Model

The number of partitions for the Mu-torus algorithm, denoted by p, is a byproduct


of the multicast group size, G, and the partition length, r. For systems with r equal to G,


there exists only one partition and the multicast message is propagated in a chain-type

communication mechanism among the destination nodes. Under this condition, the


number of intermediate nodes is simply two less than the group size. The subtracted two

are the root and the last destination nodes. For systems with partitions equal to or more


than 2, the number of intermediate nodes becomes a function of the group size, partition


length and the number of nodes in a given dimension. The simplified model is given as:


2xo+ G;-rx 1- + -k x, >

tmulticast (2-12)

2xo+(Gr-2)xL,~ p<2



1100 100


P ~ t80



400 3


200
10



6 8 10 12 14 16
Multicast Group Size (In nodes)
-=Model -*-Actual -<--Error


Figure 2-14. Simplified model vs. actual measurements for Mu-torus algorithm.

Figure 2. 14 shows the small-message model versus actual measurements for 128-byte

messages. The results show that the model is accurate with an average error of ~2%.











The U-torus Model

For U-torus the minimum number of communication steps required to cover all


destination nodes can be expressed as liog2 G 1. The number of intermediate nodes in


the U-torus algorithm is a function of the minimum required communication steps, the


group size and the number of nodes in a given dimension. The simplified U-torus model

is given as:


(G mod k) + k
tnriitncemr 2 x o+ log82 G~x xL, (2-13)


Figure 2-15 shows the short-message model and actual measurements for various

group sizes. The results show the model is accurate with an average error of ~2%.


3500 100



2000 6

1500 50

E /0 40
S1000 30
20


S6 8 10 12 14 16
Multicast Group Size (In nodes)
-a-Model -*-Actual -->-Error


Figure 2-15. Simplified model vs. actual measurements for U-torus algorithm.

The S-torus Model

The S-torus is a chain-type communication algorithm and can be modeled

identically to the single partition case of the Mu-torus algorithm. The simplified model

for S-torus is formulated as:


tetmureasr 2 x o + (G~ 2) x L1 (2-14)











The results from the short-message model versus actual measurements for 128-byte

messages are shown in Figure 2-16. As previously stated, S-torus routing is based on

Hamiltonian circuit. This type of routing ensures that each destination node will receive


only one copy of the message, but some forwarding nodes (i.e., non-destination nodes

that are on the actual message path) may be visited more than once for routing purposes.

Moreover, depending on the group size, single-phase routing traverses many unnecessary

channels, creating more traffic and possible contention. Therefore, S-torus has

unavoidably large latency variations because of the varying and sometimes extremely


long message paths [31i]. The small-message model presented is incapable of tracking

these large variations in the completion latency that are inherent to the S-torus multicast






B 60

2550





6 8 1 1 1 1
Mutcs ru He(nnds







increasing group size.nnoes











Analytical Projections

To evaluate and compare the short-message performance of the multicast algorithms

for larger systems, the simplified models are used to investigate 2D torus-based parallel

systems with 64 and 256 nodes. The effects of different partition lengths (i.e., r=8, r=16,

r=32) for the Mu-torus algorithm over these system sizes are also investigated analytically

with these projections. The results of the projections are plotted in Figure 2-17.

The Md-torus algorithm has step-like completion latency caused by the fact that,

with every new k destination nodes added to the group, a new ringlet is introduced to the

multicast communication which increases the completion latency. The optimal


performance for the 8x8 torus network is obtained when the Mu-torus has a partition


length of 8 and for the 16xl6 torus network when the partition length is 16. Therefore, it

is surmised that the optimal partition length for Mu-torus is equal to k for 2D SCI tori.


1000




m 100





S 4 8 12 1 20 4 28 32 36 40 4 48 2 56 6-0 6

0.1 ----.
Mutiast bGruSie(nods
U-ors- e-o- -ou -A- utrs(=) utrs(=6
Mutrs r3) -- M -ous -Se.A d
o .I ~ .I .~ .~ r a
10 .. .r. Figure. 2-17..Cotiue.. r.











10000






~S100 -



S100



S10


4 32 60 88 116 144 172 200 228 25
Multicast Group Size (in nodes)
U-torus -+- S-torus ---A Mu-torus (r8) ---- Mu-torus (r=16)
SMu-torus (r=32) --*-- Md-torus Sep. Add.




Figure 2-17. Small-message latency projections. A) Projection values for a 8x8 torus
sy stem. B) Projection values for a 16 xl6 torus sy stem.

The U-torus, on average, has slowly increasing completion latency with increasing

group sizes. The U-torus, Mu-torus, and Md-torus algorithms all tend to have similar

asymptotic latencies. The S-torus and separate addressing algorithms, as expected, have

linearly increasing latencies with increasing group sizes. Although separate addressing is

the best choice for small-scale systems, it loses its advantage with increasing group sizes

for short messages. The S-torus, by contrast, again proves to be a poor choice, as it is


simply the worst performing algorithm for group sizes greater than 8.

Summary

This phase of the dissertation has investigated the multicast problem on


high-performance torus networks. Software-based multicast algorithms from the

literature were applied to the SCI network, a commercial example of a high-performance









torus network. Experimental analysis and small-message latency models of these

algorithms are introduced. Analytical projections based on the verified small-message

latency models for larger size systems are also presented.

Based on the experimental results presented earlier, it is observed that the separate

addressing algorithm is the best choice for small messages or small group sizes from the

perspective of multicast completion latency and CPU utilization because of its simple and

cost effective structure. The Md-torus algorithm performs best from the perspective of

completion latency for large messages or large group sizes, because of the balance

provided by its use of dimensional partitioning. In addition, Md-torus incurs a very low

CPU overhead and achieves high concurrency for all the message and group sizes

considered. The U-torus and Mu-torus algorithms perform better when the individual

multicast path depths are approximately equal. Furthermore, the Mu-torus algorithm

exhibits its best performance when group size is an exact multiple of the partition length.

The U-torus and Mu-torus algorithms have nearly constant CPU utilizations for small and

large messages alike. Moreover, the U-torus algorithm has the highest concurrency

among all algorithms evaluated, because of the high parallelism provided by the

recursive-doubling method. The S-torus algorithm is always the worst performer from

the perspective of completion latency and CPU utilization because of its lack of

concurrency and its extensive communication overhead. As expected, S-torus exhibits a

nearly linear increase in completion latency and CPU utilization for large messages with

increasing group size.

The small-message latency models, using only a few parameters, capture the

essential mechanisms of multicast communication over the given platforms. The models










are accurate for all evaluated algorithms except the S-torus algorithm. Small-message

multicast latency proj sections for larger torus systems are provided using these models.

Projection results show that with increasing group size the U-torus, Mu-torus, and

Md-torus algorithms tend to have asymptotically bounded similar latencies. Therefore, it

is possible to choose an optimal multicast algorithm among these three for larger systems,

based on the multicast completion latency and other metrics such as CPU utilization or

network link concentration and concurrency. It is also possible and straightforward to

proj ect the multicast performance of larger-scale 2D torus networks with our model.

Projected results show that S-torus and separate addressing have unbounded and linearly

increasing completion latencies with increasing group sizes, which makes them

unsuitable for large-scale systems. Applying the simplified models to other torus

networks and/or multicast communication schemes is possible with a minimal calibration

effort.

These results make it clear that no single multicast algorithm is best in all cases for

all metrics. For example, as the number of dimensions in the network increases, the

Md-torus algorithm becomes dominant. By contrast, for networks with fewer dimensions

supporting a large number of nodes, the Mu-torus and the U-torus algorithms are most

effective. Separate addressing is an efficient and cost-effective choice for small-scale

systems. Finally, S-torus is determined to be inefficient as compared to the alternative

algorithms in all the cases evaluated. This inefficiency is caused by the extensive length

of the paths used to multicast, which in turn leads to long and widely varying completion

latencies and a high degree of root-node CPU utilization.















CHAPTER 3
MULTICAST PERFORMANCE ANALYSIS AND MODELING FOR
HIGH-SPEED INDIRECT NETWORKS WITH NIC-BASED PROCESSORS

Chapter 2 was focused on investigating the topological characteristics of

high-performance torus networks for multicast communication. An experimental case

study was presented for SCI torus networks. Chapter 3 determines the optimum level of

work sharing between the host processor and the NIC processor for multicast

communication. The goal is to achieve an optimum balance between multicast

completion latency and host processor load. With its onboard NIC RISC processor,

Myrinet [18] is an example of such "intelligent" high-performance interconnects. The

following sections explain the details of a study to achieve an optimal work balance

between the host and the NIC processors for multicast communication over a Myrinet

mnterconnect.

Myrinet

Myrinet is an indirect high-performance interconnect constructed of switching

elements and host interfaces using point-to-point links. The core of the switching

element is a pipelined crossbar that supports non-blocking, wormhole routing of unicast

packets over bi-directional links up to 2.0Gbps.

A crossbar switching chip is the building block of a Myrinet network. It can be

used to build a non-blocking switch. It can also be interconnected to build arbitrary

topologies, such as switch-based stars, n-dimensional meshes or Multistage










Interconnection Network (MIN) topologies. Myrinet provides reliable, connectionless

unicast message delivery between communication end-points called ports.

Myrinet NICs are equipped with a programmable RISC processor (LANai), three

DMA engines, and SRAM memory. Myrinet also supports user-level host access to the

NIC bypassing the operating system for decreased host-to-NIC access latencies and

increased throughput. The latest version of Myrinet also supports 64-bit 133MHz PCI-X

interfaces. Figure 3-1 shows the architectural block diagram of a Myrinet NIC.


Memory


System Bus


64-bit / 133MHz PCl-X or
64-bit/166MHz PCI Bus









LANai RISC IIP I


DMA


Memory


Myrinet Network Links

Figure 3-1. Architectural block diagram of Myricom's PCI-based Myrinet NIC.









A striking feature of a Myrinet interconnect is the on-board NIC processor. The

main task of this processor is to offload work from the host processor on communication

events. This programmable, 32-bit, RISC processor runs at 66 or 133 MHz, which is

roughly an order of magnitude slower than today's host processors (1000-3 000 MHz).

Related Research

Myrinet is the most successful and most widely deployed commercial

high-performance interconnect for clusters. It has received extensive attention from

academia and industry. Among the many topics of research, collective communication

on Myrinet exploiting the NIC processor is of particular interest to this phase of the

dissertation. As Myrinet does not support multicasting in hardware, designing efficient

and optimal software-based collective communication is the goal of many researchers.

All communication-related operations are performed on the Myrinet NIC RISC

processor in NIC-based collective communication. This approach is a well-studied

method to avoid expensive host-NIC interaction and to reduce system overhead and

network transaction latency [34-38]. It was observed that under such communication

schemes, very low host CPU loads can be obtained at a cost of increased overall multicast

completion latencies. This trait is caused by the fact that the NIC processor is

considerably slower (66 or 133 MHz) compared to the host CPU.

On the multicasting side, Verstoep et at. [35] extended the Illinois Fast Messages

(FM) protocol to produce a totally-ordered, reliable multicasting scheme that is fully

processed by the Myrinet NIC processor. The performance of this scheme was evaluated

over various spanning-tree multicast protocols. Kesavan et at. [36] presented a

simulative evaluation of NIC-based multicast performance of an optimal binomial tree

algorithm with packetization support at the NIC level. Bhoedjang et at. [37] simplified









and improved the scalability of Verstoep's design by developing another multicasting

scheme that is completely performed by the NIC co-processor. Bunitas et al. [38]

presented a NIC-based barrier operation over the Myrinet/GM messaging layer and

reported that they achieved performance improvement by a factor of 1.83 compared to

the host-based operations. An analytical model for performance estimation of the barrier

operation was also presented.

The NIC-assisted multicasting was proposed as an answer to improve the multicast

completion latencies of NIC-based multicast communication schemes while obtaining

similar degrees of host CPU loads. Bunitas et al. [39] presented a NIC-assisted binomial

tree multicast scheme over FM to improve the latency characteristics of NIC-based

multicast algorithms.

Among other multicast-related Myrinet research, Sivaram et al. [40] proposed

enhancements to a network switch architecture to support reliable hardware-based

multicasting. They have also presented a detailed latency model of their hardware-based

multicast communication approach.

This study complements and extends previous work by providing experimental

evaluations and small-message latency models of host-based, NIC-based and

NIC-assisted multicast schemes for obtaining optimal host CPU loads and multicast

completion latencies. These multicasting schemes are analyzed for the binomial and

binary trees, serial forwarding and separate addressing multicast algorithms. Accurate

small-message latency models for these algorithms are also developed and, using these

models, multicast completion latencies for larger systems are projected. Results of these

comparisons determine the optimum balance of support between the host processor and










the NIC co-processor for multicast communication under various networking scenarios.

The next section provides detailed information about the different multicasting schemes.

The Host Processor vs. NIC Processor Multicasting

Multicast communication primitives can be implemented at two different extremes

for interconnects with an onboard NIC processor, namely, host-based and NIC-based.

Between these two extremes there lies another level of implementation: NIC-assisted

multicasting. The following subsections will provide detailed information about these

three design strategies.

The Host-based Multicast Communication

Host-based multicast communication is the easiest and most conventional way of

implementing a multicast communication primitive. In this scheme, the host processor

handles all multicasting tasks, such as multicast tree creation and issuing of the unicast

send and receive operations. This type of implementation introduces increased CPU

load, resulting in lower computation/communication overlap available for parallel

programs. However, as the fast host processor performs all the tasks, host-based

multicasting achieves small multicast-completion latencies.

The NIC-based Multicast Communication

In the NIC-based scheme, the NIC co-processor handles all multicasting tasks

instead of the host processor. Therefore this scheme provides a reduced CPU load and

high computation/communication overlap for parallel programs. However, the Myrinet

NIC processor is roughly an order of magnitude slower than modern host processors.

Performing all the tasks on this relatively slow processor increases multicast completion

latency .









The NIC-assisted Multicast Communication

Between these two extremes a compromise has been proposed to this problem. It is

called NIC-assisted multicasting. In this approach, work is shared between the two

processors with the host processor handling computationally intensive multicast tasks,

such as multicast tree creation, and the NIC processor handling communication-only

multicast tasks, such as unicast send and receive operations. This solution presents low

multicast completion latency with moderate CPU loads, resulting in an acceptable

computation/communication overlap available for parallel programs.

The main difference between these three multicast schemes is on how they initiate

the multicast send operation and how the intermediate nodes behave for relaying the

multicast message to their child nodes. Figure 3-2 shows a binomial tree multicast

operation graphically. All three schemes are applicable to any multicast algorithm.

As can be seen from Figure 3-2, in the host-based scheme the host processor issues

each multicast send one after another as if the NIC had no onboard processor. Also, upon

the reception of the multicast message, the intermediate nodes pass the message to local

host buffers immediately and the host processor processes it and determines the child

nodes for the next communication step. After the child nodes have been determined, the

host processor also issues the necessary multicast sends.

In the NIC-based multicast scheme the host processor issues only one send command.

This command includes the multicast message data and the destination node set. The

NIC processor creates the multicast tree based on this destination node set information,

and then issues all the necessary multicast sends one after another. As soon as the

intermediate node receives the multicast message the NIC processor analyzes the header,








46




H:il I H:il I H:il I H:il I H:il I H:il I H:il I H:il
1-11 1 1-11 1 1-11 1 1-11 1 1-11 1 1-11 1 1-11 1 1-11


t ti t t t t t
"'~ "'~ "'~ "'~ "'~ "'~ "'~ "'~
,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,
0 111 1 12 3 114 1 5 1 6 1 7


Myrinet Network
a


Myrinet Network


Myrinet Network




Figure 3-2. Possible binomial tree multicasting variations for Myrinet interconnects.
The difference between these schemes can be observed on the host-NIC
interactions at nodes 0 and 1. A) Host-based multicast communication
scheme. B) NIC-based multicast communication scheme. C) NIC-assisted
multicast communication scheme.


H:il I H:il I H:il I H:il I H:il I H:il I H:il I H:il
1-11 1 1-11 1 1-11 1 1-11 1 1-11 1 1-11 1 1-11 1 1-11


1 .
"'~ "'~ "'~ "'~ "'~ "'~ "'~ "'~
,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,
o -Kt' ;~~I 2~r3 a









transfers the multicast data to local buffers for processing, notifies the host processor, and

issues the necessary multicast sends to the child nodes. Depending on the DMA

capability of the NIC, transfer of the multicast data to local buffers and issuing the

multicast messages to child nodes can occur concurrently.

As previously stated, the NIC-assisted scheme regulates the two processors to share

the workload between them. In this scheme, for a multicast send operation the host

processor issues only one send command to the NIC. However, unlike the NIC-based

scheme, the host processor in the NIC-assisted scheme encapsulates the destination node

set, and the pre-computed multicast tree along with the data. The NIC processor only

executes the communication commands, which results in a shorter network interface

latency as compared to the NIC-based version. Upon reception, the NIC processor

immediately transfers the multicast message to the local buffers and notifies the host

processor. The host processor then processes the data and creates the child multicast tree,

passing it back to the NIC processor. The NIC processor, therefore, issues multicast

sends without performing any processing.

Case Study

To comparatively evaluate the performance of the host-based, NIC-based, and

NIC-assisted communication schemes, an experimental case study is conducted over a

16-node Myrinet network. The following subsections explain experiment details and the

results obtained.

Description

The case study is performed on a 16-node system, each node composed of dual

1GHz Intel PentiumIII processors, 256MB of PC133 SDRAM, Server Set III LE (rev 6)

chipset, and a 133MHz system bus. A Myrinet network is used as the high-speed










interconnect, where each node has a M2L-PCI64A-2 Myrinet NIC with 1.28 Gb/s link

speed using Myricom's GM-1.6.4 and Redhat Linux 8.0 with kernel version

2.4.18-14smp. The nodes are interconnected to form a 16-node star network using 16 3m

Myrinet LAN cables and a 16-port Myrinet switch.

Binomial tree, binary tree, serial forwarding and separate addressing algorithms are

developed for the host-based, NIC-based, and NIC-assisted communication schemes

individually. All of the multicast algorithms are evaluated for small (2B) and large

(64KB) message sizes and multicast group sizes of 4, 6, 8, 10, 12, 14, and 16. Multicast

completion latency, multicast tree creation latency, host CPU utilization, link

concentration, and concurrency are measured for each combination of multicast

algorithms, communication schemes, and message and group sizes previously described.

The Myrinet interconnect runs a propriety software, GM, on hosts. GM is a

software implementation of the Virtual Interface Architecture (VIA) [41]. Like VIA, GM

is targeted for providing low-latency, OS-bypassing interactions between user-level

applications and communication devices. In GM, the OS is only responsible for

establishing the communication channels and enforcing required protection mechanisms

through the GM driver. Once the initial setup phase is completed, host to NIC

communications are performed through the GM library. Following the setup phase, both

host and NIC are able to initiate one-side data transfers to each other using the provided

DMA engines in the Myrinet NIC hardware. Figure 3-3 shows the three-level

GM-software architecture. All of the host-based multicast algorithms used in this

research are designed in the user-level application layer on top of GM-1.6.4 provided by

Myricom.










The Myrinet Control Program (MCP) provided by Myricom has been modified as

follows to fit the new communication schemes for both the NIC-based and the

NIC-assisted multicast algorithms. The MCP is a firmware that runs on the Myrinet NIC

GM
Memory Block User-Level
Application
HOST-LEVEL

GM Library GM Driver



E gn Port 1 Port 2 Port n GM MCP NIC-LVE



Figure 3-3. Myricom's three-layered GM software architecture.

RISC processor. The original MCP has four state machines. These state machines are

SDM~A, SEND, RECV, and RDM4~r, as shown in Figure 3-4. Each one is responsible for a

particular task and these tasks will be explained in detail in the following subsection.

Both the NIC-based and the NIC-assisted communication schemes use NIC-issued

multi-unicast messages which required modification of the state machine code. Table 3-1

shows the pseudocode of the overall process for both the NIC-assisted and the NIC-based

communication schemes for the root, intermediate and destination nodes, including the

host and NIC tasks and interactions. Parts in italics are either performed or initiated by

the host CPU and the rest is performed by the NIC processor. The SDMA process

includes the host writing to NIC memory and signaling the NIC upon completion of the

write operation. The RDMA process includes the NIC writing to the host memory and

signaling it upon the completion of the write operation. The updated state machines







50



added an extra 8Cls to the unicast sends whereas the unmodified minimum one-way


latency was measured as 17Cls from the host level using the unmodified MCP code.


To From
Network Network


Figure 3.4. The GM MCP state machine overview.


Table 3-1. Pseudocode for NIC-assisted and NIC-based communication schemes.
NIC-Assisted Multicast NIC-Based Multicast
(Root Node) (Root Node)
Obtain multicast host names Obtain multicast host names
Obtain GM MCP base address pointer Obtain GM MCP base address pointer
Build multicast tree SDMA
SDMA Wait for completion
Wait for completion Build multicast tree
Do multicast Do multicast
RDMA RDMA
NIC-Assisted Multicast NIC-Based Multicast
(Intermediate and Destination Nodes) (Intermediate and Destination Nodes)
Listen for incoming multicast calls Listen for incoming multicast calls
Receive message Receive message
RDMA Check multicast tree
Check multicast tree Relay message to root/child hosts
SDMA RDMA
Do multicast


Case study evaluations of the host-based, NIC-based, and NIC-assisted


communication schemes are undertaken for each of the four multicast algorithms. Each










experiment is performed for each message and group size, for 100 executions, where

every execution has 50 repetitions. Four different sets of metrics are probed in each

experiment, including multicast completion latency, user-level CPU utilization, multicast

tree-creation latency, and link concentration and concurrency. The maximum user-level

host CPU utilization of the root node is measured using the Linux built-in s ar utility.

Link concentration and concurrency of each algorithm are calculated as described in the

Chapter 2 of this dissertation, for each group size based on the communication pattern

observed throughout the experiments.

Multicast Completion Latency

As previously stated, completion latency is an important metric for assessing the

quality and efficiency of multicast algorithms and communication schemes. Two

different sets of experiments for multicast completion latency are performed in this case

study, one for a small message size of 2B, and the other for a large message size of

64KB. Figure 3-5A shows the multicast completion latency versus group size for small

messages. Figure 3-5B shows multicast completion latency versus group size for large

messages. Figure 3-5A is presented with a logarithmic scale for clarity.

The small-message results presented in Figure 3-5A show that among the host-based

algorithms, binary tree performs the best for any group size for small messages. The

host-based binomial tree algorithm is second best in terms of latency. The difference in

the performance level of these two algorithms is because of the higher computational

load of binomial tree compared to the binary tree algorithm. The step-like shape of the

binomial and binary tree algorithms is due the Llog2Gl- intermediate nodes

on the deepest path traveled by the multicast message. The host-based serial forwarding










algorithm shows a linear increase in completion latency by increasing group size. The

host-based separate addressing algorithm performs worst among all host-based

algorithms and also shows a linear increase in latency with increasing group size. The

NIC-based approach has the highest multicast completion latency for all algorithms and

all group sizes as expected. The reason for this poor performance level is because the

slower NIC processor (as compared to the host processor) handles all of the computation

and communication tasks. In particular, this communication scheme impedes the

performance of separate addressing more than the other algorithms. The NIC-assisted

solution provides slightly smaller latencies compared to the NIC-based solutions for all

algorithms. The increases in efficiency of communication are caused by the fact that the

faster host processor performs all the communication-related computational tasks. This

communication scheme improves the latency characteristics of the separate addressing

algorithm more than the serial forwarding algorithm.

The large-message results presented in Figure 3-5B show that the binary tree

performs the best compared to all other host-based communication algorithms as evident

in the small-message host-based results. Binomial tree has slightly higher latency values

for all group sizes because of its higher algorithmic computational load as compared to

the binary tree algorithm. Serial forwarding, as described previously, has more latency

variations compared to the small-message case. The NIC-based communication schemes

still significantly impede the performance of all algorithms compared to the host-based

approach as previously seen in the small-message case. However, contrary to the

small-message case, the NIC-assisted approach lowers the large-message latencies.

Although overheads are significantly high and they overlap with the expensive host































....---o--~-_

-o-- ...o-
..--o
.o--~' ::.;;;;-~;;;;;;;;;;;;;;;;;~;;;;;;;;;;;


10


8 10 12
Multicast Group Size (In nodes)


14 16


-c- H B Binomial
~-H-H Binary
-*- H Serial
-e-H Separate


-EI- B Binomial
~-A-B Binary
-e- B Serial
-e- B Separate


-o -NA Binomial
A -NA Binary
-O -NA Serial
-o-- NA Separate


32000





28000





"24000






160




120000











8000


6 8 10 12
Multicast Group Size (In nodes)


14 16


-m-HB Binomial
t-A-HB Binary
-4- HB Serial
-e- HB Separate


-9- B Binomial
~-A- B Binary
-e- B Serial
-9- B Separate


---Es--- NA Binomial
---a--- NA Binary
--o--- NA Serial
---0--- NA Separate


Figure 3-5. Multicast completion latencies. Host-based, NIC-based, and NIC-assisted

communication schemes are denoted by H.B., N.B., and N.A, respectively.


A) Small messages vs. group size. B) Large messages vs. group size.









CPU-NIC LANai interaction events. This overlapping hides the latencies of host CPU-

NIC LANai interactions and results in the dramatic reduction of completion latencies of

the NIC-assisted approach for large messages. Binomial and binary tree algorithms

benefit most from the NIC-assisted communication approach for large messages.

The large-message results presented in Figure 3-5B show that the binary tree

performs the best compared to all other host-based communication algorithms as evident

in the small-message host-based results. Binomial tree has slightly higher latency values

for all group sizes because of its higher algorithmic computational load as compared to

the binary tree algorithm. Serial forwarding, as described previously, has more latency

variations compared to the small-message case. The NIC-based communication schemes

still significantly impede the performance of all algorithms compared to the host-based

approach as previously seen in the small-message case. However, contrary to the

small-message case, the NIC-assisted approach lowers the large-message latencies closer

to the host-based approach. In the NIC-assisted approach the sender and receiver

overheads are significantly high and they overlap with the expensive host CPU-NIC

LANai interaction events. This overlapping hides the latencies of host CPU-NIC LANai

interactions and results in the dramatic reduction of completion latencies of the

NIC-assisted approach for large messages. Binomial and binary tree algorithms benefit

most from the NIC-assisted communication approach for large messages.

User-level CPU Utilization

The CPU utilization is measured at the host processor for all experiments. For both

small messages and large messages, host-based multicasting produces the highest CPU

utilization level for each group size because the host-processor handles all the multicast

communication and computation tasks. By contrast, NIC-based communication provides









a constant level of CPU utilization that is lower than host-based and NIC-assisted

schemes for all algorithms. In NIC-based approach, the host CPU is only responsible for

setting up the multicast communication environment and the rest of the tasks are carried

out by the NIC processor, decreasing the workload of the host processor. NIC-assisted

multicast reduces host CPU utilization as compared to the host-based scheme. Figure

3-6A shows the small-message CPU utilizations for host-based, NIC-based, and

NIC-assisted communication schemes. Figure 3-6B shows the large-message CPU

utilizations for host-based, NIC-based, and NIC-assisted communication schemes. The

results show that all of the algorithms exhibit low CPU utilizations for NIC-based and

NIC-assisted schemes. This fact proves, in terms of the host CPU utilization, that the

NIC-assisted scheme is the preferable one among the three for all networking scenarios.

Overall, low CPU utilization and low multicast completion latency characteristics of the

NIC-assisted scheme, makes it a good choice for multicast communication.

Multicast Tree Creation Latency

Multicast tree creation is an all-computational task, and tree creation latencies are

independent of the multicast message size but dependent on the group size. Figure 3-7

shows the tree creation latencies for all combinations of communication schemes and

multicast algorithms versus all group sizes. Host-based multicast tree creation provides

the lowest latencies for all algorithms as can be seen from Figure 3-7. The NIC-based

tree creation is roughly an order of magnitude slower than host-based tree creation,

reflecting the performance gap between those two processors. The NIC-assisted tree

creation has latencies identical to the host-based scheme because the host-processor

handles tree-creation tasks in this communication scheme.





45




4




35




3


25




2,







15


8 10 12 14 16
Multicast Group Size (In nodes)

-El N Binomial ---o--- NA Binomial
~-N- N Binary ---P--- NA Binary
-e- N Serial ---O --- NA Serial
-e- N Separate ------N NA Separate


-m HB Binomial
t-A-HB Binary
-4- HB Serial
-e- HB Separate


Multicast Group Size (In nodes)

-8- B Binomial
~-A- B Binary
-o- B Serial
-e- B Separate


-M- HB Binomial
t-A-HB Binary
-*- HB Serial
-e- HB Separate


-o--- NA Binomial
-6--- NA Binary
-o--- NA Serial
-o--- NA Separate


Figure 3-6. User-level CPU utilizations. Host-based, NIC-based, and NIC-assisted

communication schemes are denoted by H.B., N.B., and N.A, respectively.


A) Small messages vs. group size. B) Large messages vs. group size.








57





1600

1400

S1200

S1000

a,800


S600

H 400

200


4 6 8 10 12 14 16
Multicast Group Size (In nodes)
-H B Binomlal -E-NB Binomlal ---E3---NA Binomlal
--H B Binary -6-NB Binary ---6--- NA Binary
-H B Serial NB Serial ---O--- NA Serial
-H B Separate NB Separate ---o-- NA Separate


Figure 3-7. Multicast tree creation latencies. Host-based, NIC-based, and NIC-assisted
communication schemes are denoted by H.B., N.B., and N.A, respectively.


Link Concentration and Link Concurrency


Link concentration measures the degree to which communication is concentrated


on individual links. This metric, combined with link concurrency, can be used for


assessing the effectiveness of the network link usage for a given communication


structure. As all communication schemes access the network in the same manner, and the


only difference between these schemes is their host-NIC access patterns. Thus, the


network link usage is independent of the deployed communication scheme. Figure 3-8A


shows the link concentration for all algorithms versus multicast group sizes. Figure 3-8B


shows the link concurrency for all algorithms versus multicast group sizes. From Figure


3-8A it can be seen that the binomial and binary tree and the separate addressing


algorithms have the lowest link concentrations and asymptotically bounded link









concentration of 2. Although the link access pattern of binary and binomial tree

algorithms are different than the separate addressing algorithm, the number link visits and

the used links are the same for both three of them. For the testbed with a single switch

used in this case study, the link concentration can never exceed the asymptotic bound of

2, as it is number of maximum links that has be crossed for any point-to-point

communication. Serial forwarding, which uses a different pair of hosts and links in every

step of the multicast communication, has a constant and bounded link concentration of 2

for every group size.

Link concurrency, given in Figure 3-8B, shows that binomial tree has the best link

concurrency of all algorithms for most group sizes. Binary tree exhibits similar

concurrency to binomial tree. Combined with the link concentration results presented in

Figure 3-8A, the binomial tree algorithm appears to be the best. The difference between

these binary and binomial tree algorithms is caused by their fan-out numbers. Serial

forwarding has the lowest concurrency while the separate addressing algorithm is slightly

better. Both show constant link concurrency.

Multicast Latency Modeling

Myrinet experiments for evaluating the host-based, NIC-based, and NIC-assisted

communication schemes have been performed over a 16-node star network. These

experiments provide useful information for understanding the basic aspects of these

schemes and algorithms, but are insufficient for drawing further detailed analyses for

systems of arbitrary sizes beyond the testbed capabilities. Latency modeling

complements the experimental study by providing more detailed insight and establishes a

better understanding of the problem.





8 10 12
Multicast Group Size (in nodes)

-m Binomial -A Binary -+ Serial -* Separate


14 18i


4 6 8 10 12 14 16
Multicast Group Size (in nodes)

-H-Binomial -A-Binary -*-Serial -*-Separate




Figure 3-8. Communication balance vs. group size. A) Link concentration. B) Link
concurrency.

Latency modeling for the Myrinet network is based on Eq. 2-5 and follows the


same outlines approach in Chapter 2. The basic idea in latency modeling presented in


this chapter is to express the communication events with a few parameters without


oversimplifying the process.










Although the outlined method is same, there are some differences between the

Myrinet latency model and the SCI latency model. First of all, tivenrok is defined

differently because Myrinet is an indirect network unlike the direct SCI network. In an

indirect network, nodes establish point-to-point communications through switches as

there are no forwarding nodes present on a given communication path. Eq. 3-1 reflects

the new tivetrork parameter for an indirect network.

tntwoev = hp x Lp + hs x Ls +h, x L, (3 -1)

Here, h,, h,, and h, represent the total number of hops, total number of switching

nodes, and total number of intermediate nodes, respectively. Similarly L, and L, denote

the propagation delay per hop and switching delay through a switch, respectively. The L,

parameter denotes the intermediate delay through a node, which is the sum of the

receiving overhead of the message, processing time, and the sending overhead of the

message at the host and NIC layers for a given intermediate node.

The second difference between SCI and Myrinet latency models is that in Myrinet

the host and NIC processors coordinate to perform the communication-related tasks.

Therefore, the Myrinet model has to account for this interaction and work sharing

between the host and NIC processors. Also, this coordination and interaction is different

for each communication scheme for a given Myrinet network. Eqs. 3-2 through 3-10

present each scheme's sender and receiver overheads; and their intermediate node delays

in detail.

In a software-based multicast communication it is likely that a node will issue more

than one message. To formulate the sender overhead for the host-based communication










scheme, it is assumed that the host processor is issuing the W~h individual multicast

message. The overhead for issuing 2Wh meSsage is expressed as:

Osender =MtHost send 2~xt(3 -2)

However, for NIC-based communication the send process is completely performed

by the NIC processor. The sender overhead for the W~h COnsecutive multicast message is

expressed as in Eq. 3-3.

osender =MtNIC send 2~xt(3 -3)

In NIC-assisted multicasting, a different level of work sharing exists between the

host and NIC processors, compared to the NIC-based scheme. In NIC-assisted scheme,

the host processor is only responsible for the multicast related computational tasks. In

this scheme the NIC processor is only responsible for the communication-only events,

such as the send operation. From the NIC point of view, issuing NIC-assisted multicast

messages is the same with the NIC-based scheme. Therefore, the sender overhead for the

NIC-assisted multicast scheme for a given 2Wh COnsecutive multicast message is same

with the NIC-based case and is expressed as:

osender =MtNIC send 2~xt(3 -4)

The last node on the multicast message path incurs receiver overhead. For the

host-based scheme, the GM software layer automatically drains the message from the

network and places in the appropriate user-level memory space. The receiver overhead

for the host-based communication is expressed as:

orecove, = tHost rec (3 -5)

For the NIC-based communication scheme, the receive operation involves both the

host and the NIC processors. The NIC processor receives and removes the multicast










message from the network, writes it to the appropriate user-level memory address and

then notifies the host processor. Upon this notification, the host processor completes the

receive operation. As explained before, the RDMA operation consists of the memory

transfer of the message to the user-level memory, and the NIC processor' s notification of

the host processor upon transfer completion. The receiver overhead for NIC-based

multicasting is:

oreove = tc e +RDM4A + t~otrc (3 -6)

From the host CPU point of view the reception of a message is identical for

NIC-based and NIC-assisted communication schemes. For both of these schemes, the

NIC-processor drains the message from the network and performs an RDMA operation to

the user-level memory space and notifies the host CPU. Therefore, the receiver overhead

of the NIC-assisted communication is identical to the NIC-based receiver overhead, and

is expressed as follows:

orcv,= tacr + RDM4A + t~otrc (3 -7)


The intermediate nodes act as relays in multicast communication. These nodes

receive messages from the network, calculate the next set of destination nodes and relay

the messages to them. Therefore, the delay for each intermediate node includes sum of

the receiver overhead, processing delay, and the sender overhead. For the host-based

communication scheme, the intermediate node delay, L,, for the Mth COnsecutive message

sent is:


L, tos rcyHostprocess +M Host send(38










For the NIC-based communication scheme, all intermediate node tasks are

performed by the NIC processor. For the Mhi" COnsecutive message sent, the NIC-based

intermediate node delay is expressed as follows:

L, ta _rew+tC _process +M tC send(39

The intermediate node delay for the NIC-assisted communication scheme starts

when the NIC processor receives a message and performs an RDMA operation to the host

processor. Consequently, the host processor processes the incoming multicast message

and performs an SDMA operation back to the NIC processor to perform the actual

multicast send operation. For the Mth COnsecutive send, the NIC-assisted intermediate

node delay is expressed as:

L t c ew +R D M~A + t os _pr c s + SD M~A + M ~x t MC sn (3 -10)


As explained before, for achieving higher network concurrency, all the algorithms

evaluated in this chapter are designed such that the root node always serves the deepest

path first. Therefore, all equations given above can be simplified as taking M~ equal to 1.

This simplification ensures that the root serves the deepest path first when modeling the

total multicast latency.

The following paragraph explains how the components of tnetwork are acquired. The

latency model parameter, Ls, is obtained by measuring the difference between the two

latency measurements over two nodes that are first connected through the Myrinet switch

and than directly without the switch. The Ls parameter is measured as 500ns. For the

experiment setup h, is set to 1. Assuming, 7ns per meter of propagation delay, L, is

calculated as 21ns as all the Myrinet interconnect cables used were 3m long. In our

testbed, 2 links are crossed for each pair of nodes that establish a point-to-point









communication. For the single-switch, 16-node system used in our experiments, crossing

2 links per each connection yields h, as 2 for the separate addressing algorithm. All other

algorithms use a tree-based communication structure which depends on relaying the

message through intermediate nodes. Therefore, binary tree, binomial tree, and serial

forwarding have an h, equal to 2xh,. The number of intermediate nodes, h,, is calculated

as (G -2) for serial forwarding and Llog2Gl- for the binary and binomial tree algorithms.

Following the same approach defined by Bunitas et al. [39], the sender and receiver

overhead estimates of the host and NIC processors are obtained individually. As the

clock frequency of the host processor is known, the RDT SC assembly instruction is used

to read the internal 64-bit cycle counter to get the exact time spent on the host processor.

Averaging the host-based multicast communication for each previously defined

experiment for each message and group size, an accurate estimate of the tHost send is

obtained. For simplicity, tHost reov is approximated to equal tHost send.

The real-time clock register (RTC) on the LANai 7 chip is incremented at a regular

interval automatically by default. The incremental interval of this register is set at

initialization time of the NIC based on the actual PCI bus clock. All access to the NIC

registers from the host side, are performed as PCI bus transactions. Therefore, to remove

this bus delay, the 64-bit host processor cycle counter is read immediately after reading

the RTC. The host-processor is put to sleep for 1 second. The same process is then

repeated. Comparing the elapsed time in these repeated readings allowed accurately

assessing and eliminating the PCI bus transaction delays from the RTC reading

operations. The tMrC send and the tMrc reov values are obtained for the NIC-based and the

NIC-assisted schemes for all group sizes and small and large messages. The NIC-based










and NIC-assisted SDM~A and the RDM4A tasks are also measured using the outlined

method, from the host and NIC respectively.

Table 3-2. Measured latency model parameters.
tHost send tHost recv tNIC send tuzc reov SDMl~r RDllr

9.864+ 9.864+
Host-Based N/A N/A N/A N/A
0.03 15 xm 0.03 15 xm
0.697+ 0.697+
NIC-Based N/A N/A 26 28
8.3325xm 8.3325xm
0.697+ 0.697+
NIC -As si sted N/A N/A 26 28
8.3325xm 8.3325xm

Table 3-3 Calculatedl t and L values


Sul vuvurvuDOCeSS "" .IY"UU
tprcs (ps) L, (s
Binomial 19.2 39.05
Binar 12.2 32.05
Host-B ased.
Serial 7.2 27.05
Separate 22.2 N/A
Binomial 145.2 250.25
Binar 92.2 178.25
NIC-Based.
Serial 79.2 121.08
Separate 233.1 N/A
Binomial 19.2 113.9
.Binary 12.2 107.9
NIC -As si sted.
Serial 7.2 101.7
Separate 22.2 N/A


Table 3-2 summarizes the acquired values of tHost send, tHost recy, tNIC send, tuzc recy,

SDM~A, and RDM4~r for the host-based, NIC-based, and NIC-assisted communication

schemes. Table 3-3 shows the tprocess (i.e., tHostgprocess Or tNIC process depending on the

communication scheme) and L, values for each possible communication scheme and

algorithm combination. These values are obtained by substituting the values given in

Table 3-2 into Eqs. 3-2 through 3-10. The tprocess and L, values are specific to the system

used. However, these values are independent of multicast message size. The following










subsections provide the details of the small-message latency model for each scheme and

algorithm .

The Host-based Latency Model

The host-based latency models are obtained using Eqs. 3-2, 3-5, and 3-9. The

binary and binomial trees have the same number of intermediate nodes for each case,

therefore these two algorithms is formulated in Eq. 3-11.

tmuam = tHost senld + 10 g 2 G -1)x (t1s ey ot prces ost _send ) Host rec (3 -11)

The model for the serial forwarding is given as:

tmuam = tHost send + (G 2) x t~s ey ot prcesHost _send ) Host rec (3 -12)

Separate addressing model is formulated as:

tmulhcmt = (G 1) x(tHost _process + Host _send )fHost recy (3 -13)

Figure 3-9 shows results from the host-based, small-message model for all

algorithms versus multicast group size along with the actual measurements. The model is

accurate for the binomial, binary tree and separate addressing algorithms, and the average

error is ~1%, ~2%, and ~1%, respectively. The host-based serial forwarding algorithm

has a relatively higher modeling error, ~3%, because of its inherent latency variations as

explained in the previous chapter.

The NIC-based Latency Model

NIC-based latency models are obtained by substituting the values presented in the

previous section into Eqs. 3-3, 3-6, and 3-9. The small-message model for the binary and

binomial trees is formulated as given in Eq. 3-14.

natlc tem = D +tNIC process NI~C _send + ~10g2 G 1)
(G +t+NC sn -~~ (3-14)
NI ey NIC _process NIsed/ ICrc







67



600 1 100



500


cn P 70
;400



o 300 50.I s

~ 40


S200
o ~ J ~v 30



100




4 6 8 10 12 14 16
Multicast Group Size (in nodes)
-m- Model Binomial -* Model Binary -A- Model Serial -*- Model Separate
-E -Actual Binomial ->-Actual Binary -Actual Serial -e-Actual Separate
---E3--- Error Binomial ---o--- Error Binary ---A--- Error Serial ---e--- Error Separate


Figure 3-9. Simplified model vs. actual measurements for the host-based
communication scheme.


The model for serial forwarding is as follows:


tmuace = DM + trl rcs tNC _send + ( M (3 -15)
MC_ ecv AT rcs ACsn C recy


The separate addressing model is formulated as given as:


ts~~~~~~ team, = D 4+( )x(AT rcs T edf MC recy + RDM4 (3 -16)


Figure 3-10 shows the NIC-based small-message model and the actual measurements


versus multicast group size for all algorithms. Similar to the host-based case, the model


is accurate for the binomial, binary tree and separate addressing algorithms, and the


average error is ~2%, ~2%, and ~1%, respectively. Serial forwarding has a relatively












5000 ,100

4500 -1 90

4000 -1 80

cn3500 -1 70


g 3000 / 60

o 2500 5

o 2000 40

o 1500 J 30

1000 i ~ 20





4 6 8 10 12 14 16
Multicast Group Size (in nodes)

-m- Model Binomial -* Model Binary -A- Model Serial -*- Model Separate
-B-Actual Binomial -o-Actual Binary -Actual Serial -o-Actual Separate
-o--0-- Error Binomial ---o--- Error Binary ---6--- Error Serial ---o--- Error Separate


Figure 3-10. Simplified model vs. actual measurements for the NIC-based
communication scheme.


higher modeling error, ~4%, because of its unavoidable latency variations.

The NIC-assisted Latency Model

As explained previously, the difference between the NIC-based and the


NIC-assisted model is where communication-related computational processing is


handled. Using Eqs. 3-4, 3-7, and 3-10, the small-message model for binary and

binomial models is formulated as:


t,,i,ura, SDi2 + tTC send + 10og 2 G -1)
(G ) DZA +t+R~~A +tNe,, R (3-17)
x I ta e,+SM Host _process Csed C rc


The separate addressing model is:












mlcat SD)M + tATCsend + (G- 2)
(G ) DZA +t+R~~ ~e,, XM (3-18)
xtvec re D A+tot_ rcs Csn C rece


The serial addressing model is:


t~iics =( -1 x(Host _process SDM + tAT send Crea, + RDMA (3 -19)


1600 1 100


1400-


1200-
70

S1000-
160


o 800 -1 -50
Eg
O $40
O 600

0 1 30
S400

2020





4 6 8 10 12 14 16
Multicast Group Size (in nodes)

-m Model Binomial -+- Model Binary -A- Model Serial -*- Model Separate
-8- Actual Binomial -e- Actual Binary -6- Actual Serial -e- Actual Separate
---0-- Error Binomial ---o--- Error Binary ---6--- Error Serial ---o--- Error Separate


Figure 3-11. Simplified model vs. actual measurements for the NIC-assisted
communication scheme.


Figure 3-11 shows the results for the NIC-assisted small-message model and the


actual measurements versus multicast group size for all algorithms. The figure shows


that the model is accurate for the binomial and binary tree and separate addressing


algorithms, and the average error is ~1% in all cases. Like the host-based and the


NIC-based cases, serial forwarding has a relatively higher modeling error, ~3%.










Analytical Projections

To evaluate and compare the short-message performance of the communication

schemes and the multicast algorithms for larger systems, the simplified models are used

to investigate the performance characteristics of the multicast completion latency for an

indirect Myrinet star network with 64 and 256 nodes. For the first case, a 64-node

network with a single 64-port Clos switch is considered. A 256-node network with a

single 256-port Clos switch is considered for the second case. Currently the largest

commercially available Myrinet switch is for 128 nodes, but the technology is

progressing towards higher capacity switches. Higher capacity switches are preferable

because they are easier to maintain and cheaper to build. These switches also provide

higher bisection bandwidths with less number of cable connections than MINs.

Moreover, higher capacity switches provide full-bisection bandwidth with fewer

connections. Vendor provided results show that the commercially available 128-node

Myrinet Clos switch achieves similar switching latencies to the 16-port version.

Therefore, proj sections for a 256-node system with a single Clos switch are considered to

be realistic.

Figures 3-12 and 3-13 show the proj section results for the two cases under study.

Strictly in terms of multicast completion latency, the host-based binary and binomial

trees provide the lowest completion latencies. The difference between these two is

caused by the processing load and fan-out numbers. A more efficient binomial tree

design with a lower processing load and a higher fan-out number, compared to the one

used in this dissertation, may outperform the binary tree algorithm. Also it is observed

that the host-based serial forwarding and separate addressing algorithms have a linear











100








10










01i




1 8 12 16 0 4 2 3 36 40 44 8 2 5 6 6
Mutcs Gru de(nnds
H B.~~ B.omal NBBoma --8-NAB~oma



H ~ ~ B Seart NB Seart --0 ASprt








increase in ompletion tie with incrasin group size Ioes. notntly I-ae

approachesare a poo solti n fray of th Balgritm n-Nin anygientwrscaios



compared to2 thoe hot-bsd sapprosachbcueoteretnie completion latencies.64noes






However, as can be seen from the figures, the NIC-assisted algorithms lower the


completion latencies compared to the NIC-based solutions. This approach provides


significant latency reduction in the separate addressing, binomial and binary tree


algorithms, where the largest gain in terms of completion latency is in the separate


addressing algorithm. Unfortunately, serial forwarding is less affected by the














NIC-assisted algorithm because of the expensive SDMA-RDMA operations incurred by


each intermediate node.



100 1


-B- ~-s _g--g--p~ ~-- O ...O
9- 0--.. 0--0"'
JT -g
-g-g O-... 0..
-0--
8 --g --0--. 6~ -0...~.0...0.~.~.0-.~..0-~
g O' ---D--~
... 0" --0.~
-0--
g ..O' ---0-~
.0. 0-~.0-
..-0" --0--
~v .o--~--C--
0'~' --~
C-C)-C ~-* -t-*-*
n- ~t~-*


4 32 60 88 116 144 172 200 228 256
Multicast Group Size (In nodes)
tHB Binomial ~NB Binomial ---E3--- NA Binomial
tHB Binary ~NB Binary ---- NA Binary
SH B Serial NB Serial o--- NA Serial
H B Separate NB Separate ---0--- NA Separate



Figure 3-13. Projected small-message completion latency for 256 nodes.


Other than multicast completion latency, host CPU utilization is also an important



parameter to consider when choosing a multicast communication scheme and algorithm.


For larger systems a sacrifice in completion latency can be made to achieve a lower CPU


utilization. For such systems, NIC-assisted schemes may be well suited, because they



provide lower CPU utilization and more communication-computation overlap.










Summary

This chapter of the dissertation investigated the multicast problem for high-speed

indirect networks with NIC-based processors. Chapter 3 introduced a multicast

performance analysis and modeling for this type of interconnects. These interconnects

are widely used in the parallel computing community because of their reprogrammable

flexibility and work offloading from host CPU characteristics. This phase of the

dissertation analyzed various degrees of host and NIC processor work sharing.

Host-based, NIC-based, and NIC-assisted multicast communication schemes for work

sharing are analyzed. Binomial and binary tree, serial forwarding and separate

addressing multicast algorithms are used for these analyses. Experimental evaluations

are performed using various metrics, such as multicast completion latency, root-node

CPU utilization, multicast tree creation latency, and link concentration and concurrency,

to evaluate their key strengths and weaknesses. To further analyze the performance of

aforementioned multicast communication schemes, small-message latency models of

binomial and binary tree, serial forwarding and separate addressing multicast algorithms

are developed and verified based on the experimental results. The models are observed

to be accurate. Projections for larger systems are also presented and evaluated.

Experimental and latency modeling evaluations showed that for latency-sensitive

applications that utilize small-messages which run over networks with NIC coprocessors,

the host-based multicast communication scheme performs best. The disadvantage of

host-based multicasting scheme is its high host-CPU utilization. The NIC-based

solutions obtain the lowest and constant host-CPU utilizations for all cases at the cost of

increased completion latencies. A compromise solution, the NIC-assisted multicasting

provides lower CPU utilizations than host-based ones. Moreover, the NIC-assisted









multicasting also provides comparable CPU utilizations to the NIC-based algorithms.

The NIC-assisted approach also provides comparable multicast completion latencies to

host-based schemes for lower host CPU utilizations. Thus, NIC-assisted multicasting

appears to be better choice for applications that demand a high level of

computation-communication overlapping.















CHAPTER 4
MULTICAST PERFORMANCE COMPARISON OF
SCALABLE COHERENT INTERFACE AND MYRINET

Chapter 2 and 3 are focused on individually evaluating the multicast performance

of high-speed torus interconnects high-speed indirect networks with onboard

adapter-based processors. Experimental case studies and small-message latency models

and analytical proj sections for both types of networks are presented. Chapter 4

summarizes and provides head-to-head multicast performance comparisons between SCI

and Myrinet. The comparison is based on the case study data presented in the previous

two chapters. All SCI multicast algorithms, Myrinet multicast schemes and algorithms

are included in the comparison except the Myrinet NIC-based communication scheme.

Myrinet NIC-based was excluded because it was the worst performing one compared to

Myrinet host-based and NIC-assisted schemes in terms of completion latency, and it does

not appear to be a viable solution for the software-based multicast problem for Myrinet

mnterconnects.

The unicast performance of SCI and Myrinet has been comparatively evaluated by

Kurmann and Stricker [42] previously. They showed that both networks suffer

performance degradation with non-contiguous data block transfers. Fischer et al. [43]

also compared SCI and Myrinet. In their study, they concluded that, in terms of their

performance analyses, Myrinet is a better choice compared to SCI since unlike Myrinet

there is a magnitude of difference in SCI's remote read vs. write bandwidths.










Chapter 4 complements and extends previous work available in literature

summarized above. Chapter 4 provides comparative experimental evaluations of

torus-optimized multicast algorithms for SCI versus various degrees of multicast

work-sharing between the host and the NIC processors optimized for Myrinet. Both for

SCI and Myrinet networks, the multicast performance is evaluated using different

metrics, such as multicast completion latency, root-node CPU utilization, link

concentration and concurrency.

Multicast Completion Latency

Completion latency is an important metric for evaluating and comparing different

multicast algorithms, as it reveals how suitable an algorithm is for a given network. Two

different sets of experiments for multicast completion latency are used in this case study,

one for a small message size of 2B, and another for a large message size of 64KB. Figure

4-1A shows the multicast completion latency versus group size for small messages, while

Figure 4-1B shows the same for large messages for both networks combined with the

various multicast algorithms. Figure 4-1A is presented with a logarithmic scale for

cl arity.

As explained previously, separate addressing is based on a simple iterative use of

the unicast send operation. Therefore, for small messages the inherent unicast

performance of the underlying network significantly dictates the overall performance of

the multicast algorithm. This trait can be observed by comparing the small-message

multicast completion latencies of SCI and Myrinet shown in Figure 4-1A. The SCI is

inherently able to achieve almost an order of magnitude lower unicast latency compared

to Myrinet. Simplicity and cost-effectiveness of the separate addressing algorithm for

small messages, combined with SCI' s unicast characteristics, result in the outcome where













10000









o

2~1000




10


4 6 8 10 12 14 16
Multicast Group Size (in nodes)
- SCI S-torus ---4-- Myrinet H.B. Serial --e-- Myrinet N.A. Seria
- SCI Separate ------ Myrinet H.B. Separate --O-- Myrinet N.A. Sepa
- SCI U-torus ---X--- Myrinet H.B. Binomial ----- Myrinet N.A. Binon
- SCI Mu-torus ---A-- Myrinet H.B. Binary --A-- Myrinet N.A. Binar
- SCI Md-torus


-4-
-H-
-M-
-A-
-*-


rate
mial
y


35000


30000


25 s000

2000


o

1 0000


~ 5000


4 6 8 10 12 14 16
Multicast Group Size (in nodes)


-4SCIS-torus
-c-SCI Separate
-X-SCI U-torus
-A-SCI Mu-torus
-*-SI Md-torus


---4-- Myrinet H.B. Serial --e-- Myrinet N.A. Serial
----- Myrinet H.B. Separate --o-- Myrinet N.A. Separate
---X-- Myrinet H.B. Binomial ----- Myrinet N.A. Binomial
---A-- Myrinet H.B. Binary --A-- Myrinet N.A. Binary


Figure 4-1. Multicast completion latencies. Host-based, NIC-based, and NIC-assisted
communication schemes are denoted by H.B., N.B., and N.A, respectively.

A) Small messages vs. group size. B) Large messages vs. group size.









SCI separate addressing clearly performs the best compared to all other SCI and Myrinet

multicast algorithms.

The NIC-assisted Myrinet separate addressing does not provide a comparable

performance level to the host-based version because of the costly SDMA and RDMA

operations. It is observed that the SDMA and RDMA operations impose a significant

overhead for small-message communications. Moreover, all three multicast schemes

show a linear increase with increasing group size.

The SCI S-torus is one of the worst performing algorithms next to the Myrinet

NIC-assisted serial forwarding algorithm for small messages. Host-based Myrinet serial

forwarding performs better compared to these two algorithms. The store-and-relay

characteristics of serial forwarding algorithms result in no parallelism in message

transfers and thus degradation of performance. Moreover, the expensive SDMA and

RDMA operations cause the NIC-assisted serial forwarding algorithm to perform poorly

compared the host-based version. As can be seen, all three multicast algorithms show a

linear increase in multicast completion latency with respect to the increasing group size.

Unlike the separate addressing and serial forwarding algorithms of SCI and

Myrinet, binomial and binary tree algorithms exhibit nearly constant completion latencies

with increasing group sizes. Among these, Myrinet host-based binary and binomial tree

algorithms perform best. Comparable algorithms, such as SCI U-torus, Md-torus and

Mu-torus algorithms show higher completion latencies. The difference is attributed to the

low sender and receiver overheads of host-based Myrinet multicasting and the simplicity

of the star network compared to the more complex torus structure of the SCI network.

For a given star network, the average message transmission paths are shorter compared to









the same sized torus network. One other reason is increasing efficiency of the

lightweight Myrinet GM message-passing library with increasing complexity of

communication algorithms, compared to the shared-memory Scali API of the SCI

interconnect. The effect of the expensive SDMA and RDMA operations can be clearly

seen in the NIC-assisted Myrinet binary and binomial tree algorithms compared to their

host-based counterparts.

For the large-message multicast latencies in Figure 4-1B, the SCI algorithms

appear to perform best compared to their Myrinet counterparts. This outcome is judged

to be primarily caused by the higher data rate of SCI compared to Myrinet (i.e., 5.3Gb/s

vs. 1.28Gb/s). It should be noted that the Myrinet testbed available for these experiments

is not representative of the latest generation of Myrinet equipment (which feature 2.0Gb/s

data rates). However, we believe that our results would follow the same general trend for

large messages on the newer hardware.

Among the SCI algorithms, Md-torus is found to be the best performer. The

Md-torus and Mu-torus algorithms exhibit similar levels of performance. The difference

between these two becomes more distinctive at certain data points, such as 10 and 14

nodes. For these group sizes, the partition length for Mu-torus does not provide perfectly

balanced partitions, resulting in higher multicast completion latencies. For large

messages, U-torus exhibits similar behavior to Mu-torus. The S-torus is the worst

performer compared to all other SCI multicast algorithms. Moreover, S-torus, similar to

other single-phase, path-based algorithms, has unavoidably large latency variations

caused by the long multicast message paths [31].










Myrinet multicast algorithms seem to be no match for the SCI-based ones for large

messages. Unlike the small-message case, Myrinet NIC-assisted binary and binomial

tree algorithms provide nearly identical completion latencies to their host-based

counterparts for large messages. The SDMA and RDMA overheads are negligible for

large messages and because of this reason NIC-assisted multicast communication

performance is enhanced significantly. Moreover, NIC-assisted communication

improves the performance of the separate addressing algorithm most, compared to all

other Myrinet multicast algorithms. This improvement is the result of the relative

reduction of the overall effect of the SDMA and RDMA overheads on the multicast

completion latencies. By contrast, NIC-assisted communication degrades the

performance of the Myrinet serial forwarding algorithm, because the SDMA and the

RDMA overheads are incurred at each relaying node.

User-level CPU Utilization

Host processor load is another useful metric to assess the quality of a multicast

protocol. Figures 4-2A and 4-2B present the maximum CPU utilization for the root node

of each algorithm. As before, results are obtained for various group sizes and for both

small and large message sizes. Root-node CPU utilization for small messages is

presented with a logarithmic axis for clarity.

For small messages, SCI Md-torus and Mu-torus exhibit constant 2% CPU

utilization for all group sizes. Both algorithms use a tree-based scheme for multicast,

which increases the concurrency of the message transfers and decreases the root-node

workload significantly. Also, it is observed that SCI S-torus exhibits relatively higher

utilization compared to these two but at the same time provides a constant CPU load

independent of group size. As can be seen, SCI U-torus exhibits a step-like increase for














4.5


4


3.5


4 6


8 10 12 14 16
Multicast Group Size (in nodes)

---4 Myrinet H.B. Serial --e-- Myrinet N.A. Serial
----- Myrinet H.B. Separate --O-- Myrinet N.A. Separate
---X-- Myrinet H.B. Binomial ----- Myrinet N.A. Binomial
-~ A Myrinet H.B. Binary -A-- Myrinet N.A. Binary


SCI S-torus
-H- SCI Separate
-M--SCI U-torus
-A- SCI Mu-torus
-e- SI Md-torus


10

9

8

7
-






3
-


4 6 8 10 12 14 16
Multicast Group Size (in nodes)


-4I-- C Strus
-t- SCI Separate
-M- SCI U-torus
-A- SCI Mu-torus
-e- SI Md-torus


-~ Myrinet H.B. Serial -o-- Myrinet N.A. Serial
----- Myrinet H.B. Separate --O-- Myrinet N.A. Separate
---X-- Myrinet H.B. Binomial ----- Myrinet N.A. Binomial
- -~ -A Myrinet H.B. Binary -A-- Myrinet N.A. Binary


Figure 4-2. User-level CPU utilizations. Host-based, NIC-based, and NIC-assisted
communication schemes are denoted by H.B., N.B., and N.A, respectively.

A) Small messages vs. group size. B) Large messages vs. group size.









small messages caused by the increase in the number of communication steps required to

cover all destination nodes.

Myrinet host-based binomial and binary tree algorithms provide an identical CPU

utilization to that of SCI Md-torus and Mu-torus. Host-based separate addressing and

serial forwarding algorithms both show a perfect linear increase in terms of

small-message CPU utilization, where serial forwarding performs better compared to

separate addressing.

Myrinet NIC-assisted binomial and binary tree algorithms lower the root-node CPU

utilizations as expected. These two algorithms provide the lowest and a constant CPU

utilization for all group sizes. Similarly, NIC-assisted separate addressing lowers the

CPU utilization compared to the host-based version, and provides a very slowly

increasing utilization. Similar reduction is also observed for the NIC-assisted serial

forwarding algorithm compared to its host-based counterpart. Unlike NIC-assisted

separate addressing, serial forwarding cannot sustain this low utilization, and it increases

linearly with increasing group size. The linear increase of the serial forwarding is caused

by the ever extending path lengths with increasing group sizes.

Meanwhile, for large messages, it is observed that as group size increases the CPU

load with the SCI S-torus algorithm also linearly increases. The SCI separate addressing

algorithm has a nearly linear increase in CPU load for large messages with increasing

group size. By contrast, since the number of message transmissions for the root node is

constant, Md-torus provides a nearly constant CPU overhead for large messages and

small group sizes. However, for group sizes greater than 10, the CPU utilization tends to









increase caused by variations in the path lengths causing extended polling durations. For

large messages, SCI Mu-torus also provides higher but constant CPU utilization.

Host-based Myrinet binomial and binary tree algorithms provide similar levels of

CPU utilization to SCI Mu-torus for large messages and large group sizes. Similar to the

small-message case, host-based separate addressing and serial forwarding have linearly

increasing utilizations with increasing group sizes, where serial forwarding performs

better compared to separate addressing.

NIC-assisted binomial and binary tree algorithms again achieve the smallest and

sustainable constant CPU utilizations for large messages for all group sizes. Similar to

the small-message case, separate addressing benefits most from NIC-assisted

communication as can be seen from Figure 4-2B. Serial forwarding also exhibits lower

CPU utilizations with the NIC-assisted communication.

Link Concentration and Link Concurrency

Link concentration and link concurrency for SCI and Myrinet are given in Figures

4-3A and 4-3B, respectively. Myrinet host-based and NIC-assisted communication

schemes have identical link concentration and concurrency, so they are not separately

plotted. Link concentration combined with the link concurrency illustrates the degree of

communication balance. The concentration and concurrency values presented in Figure

4-3 are obtained by analyzing the theoretical communication structures and the

experimental timings of the algorithms.

SCI S-torus is a simple chained communication and there is only one active

message transfer in the network at any given time. Therefore, it has the lowest and a

constant link concentration and concurrency compared to other algorithms. Because of








84



3.5







2.5















0.5
4 6 8 10 12 14 16
Multicast Group Size (in nodes)
SCI S-torus -4 Myrinet Serial
-H- SCI Separate ----- Myrinet Separate
-X- SCI U-torus ---X-- Myrinet Binomial
-A- SCI Mu-torus A~ Myrinet Binary
-e-SI Md-torus










s .x ------------ -x





03 k










4 6 8 10 12 14 16
Multicast Group Size (in nodes)
-4 SCI S-torus -* Myrinet Serial
-a--SCI Separate ---m-- Myrinet Separate
-X-SCI U-torus ---X-- Myrinet Binomial
-A- SCI Mu-torus -A Myrinet Binary
-e- SI Md-torus





Figure 4-3. Communication balance vs. group size. A) Link concentration. B) Link
concurrency.









the high parallelism provided by the recursive doubling approach, the SCI U-torus

algorithm has the highest concurrency. SCI separate addressing exhibits an identical

algorithm has the highest concurrency. SCI separate addressing exhibits an identical

degree of concurrency to the U-torus, because the multiple message transfers overlap at

the same time caused by the network pipelining feature available over the SCI torus

network. The SCI Md-torus algorithm has inversely proportional link concentration

versus increasing group size. In Md-torus, the root node first sends the message to the

destination header nodes, and they relay it to their child nodes. As the number of

dimensional header nodes is constant (k in a k-ary torus), with increasing group size each

new child node added to the group will increase the number of available links.

Moreover, because of the communication structure of the Md-torus, the number of used

links increases much more rapidly compared to the number of link visits with the

increasing group size. This trend asymptotically limits the decreasing link concentration

to 1. The concurrency of Md-torus is upper bounded by k as each dimensional header

relays the message over separate ringlets with k nodes in each. The SCI Mu-torus

algorithm has low link concentration for all group sizes, as it multicasts the message to

the partitioned destination nodes over a limited number of individual paths, where only a

single link is used per path at a time. By contrast, for a given partition length of constant

size, an increase in the group size results in an increase in the number of partitions and an

increase in the number of individual paths. This trait results in more messages being

transferred concurrently at any given time over the entire network.

The Myrinet serial forwarding algorithm is very similar to SCI S-torus in terms of

its logical communications structure. Therefore, as expected it also exhibits a constant









concentration. However, Myrinet serial forwarding has a higher link concentration

compared to S-torus, and the difference is caused by the physical structure of the two

interconnects. In Myrinet, the degree of connectivity of each host is fixed at 1, whereas

in SCI it is N for an N-dimensional torus system. Similar to S-torus, serial forwarding

has the lowest link concurrency. Myrinet binomial and binary trees and the separate

addressing algorithm have asymptotically bounded link concentration of 2 with

increasing group sizes as the number link visits and the used links are the same both three

of them. The number of required links to establish a connection between any two nodes

is 2 for a single-switch Myrinet network, which is the upper bound on the number of used

links. Myrinet serial forwarding has the lowest and constant link concurrency for all

group sizes caused by the reasons explained before. Myrinet separate addressing also has

a constant but higher link concurrency. Myrinet binomial and binary tree algorithms

have higher and variable link concurrencies with respect to the group size. Binary tree

has a bounded fan-out number which decreases the link concurrency compared to the

binomial tree.

Summary

In summary, the results reveal that multicast algorithms differ in their algorithmic

and communication pattern complexity. The functionality of the algorithms increases

with complexity, but occasionally the increased complexity degrades the performance in

some circumstances. For some cases, such as small-message multicasting for small

groups, using simple algorithms helps to obtain the true performance of the underlying

network. For example, because of its simplicity and the inherently lower unicast latency

of SCI, the SCI separate addressing algorithm is found to be the best choice for

small-message multicasting for small groups.










The lightweight GM software for message passing in Myrinet performs efficiently

on complex algorithms. Therefore, while simple algorithms such as separate addressing

perform better on SCI, it is observed that more complex algorithms such as binomial and

binary tree achieve good performance on Myrinet for small-message multicast

communication.

For large messages, SCI has a clear advantage due its higher link data rate

compared to Myrinet (i.e., 5.3Gb/s of SCI vs. 1.28Gb/s of Myrinet used in this study).

Although the newest Myrinet hardware features higher data rates (i.e., 2.0Gb/s) than our

testbed, these rates are still significantly lower than SCI. Therefore, we expect that our

results for large messages would follow the same general trend even for the newest

generation of Myrinet equipment.

Myrinet NIC-assisted communication provides low host-CPU utilizations for small

and large message and group sizes. Complex algorithms such as binomial and binary

tree, and simple ones like separate addressing benefit significantly from this approach.

However, multicast performance of NIC-assisted communication is directly affected by

the cost of SDMA and RDMA operations. The overhead of these operations limits the

potential advantage of this approach.















CHAPTER 5
LOW-LATENCY MULTICAST FRAMEWORK FOR
GRID-CONNECTED CLUSTERS

Previous chapters analyzed and evaluated the multicast problem on two different

SANs. Chapter 5 introduces an analysis of a low-level topology-aware multicast

infrastructure for Grid-connected SAN- or IP-based clusters. The proposed framework

integrates Layer 2 and Layer 3 protocols for low-latency multicast communication over

geographically dispersed resources.

Grid-connected clusters provide a global-scale computing platform for complex,

real-world scientific and engineering problems. The Globus Alliance [44] is an important

proj ect among the numerous Grid-related research initiatives mentioned in Chapter 1.

The Globus Alliance initiative for scientific and engineering computing, which is led by

various research groups around the world, is a multi-disciplinary research and

development proj ect. Typical research areas that the Globus proj ect tackles include

resource and data management and access, security, and application and service

development, all on a massive, distributed scale. A collection of services, the Globus

Toolkit (GT), has emerged as a result of these collective research efforts. The Globus

toolkit mainly includes services and software for Grid-level resource and data

management, security, communication, and fault detection. The open source GT is a

complete set of technologies for letting people share computing power, databases, and

other tools securely online across corporate, institutional, and geographic boundaries

without sacrificing local autonomy. The main components of GT are as follows:










* Globus Resource Allocation Manager (GRAM). Provides Grid-level resource
allocation and process creation, monitoring, and management services among
distributed domains.

* Grid Security Infcra~structure (GSI). Provides a global security system for Grid users
over distributed resources and domains. GSI establishes a single-sign-on
authentication service for distributed systems by mapping from global to local user
identities, while maintaining local control over access rights.

* Monitoring and Discovery Service (MDS). Grid-level information service that
provides up-to-date information about the computing and network resources and
datasets.

GT also has three more software services for establishing homogenous Grid-level

access for distributed resources:

* Global Access to Secondarydd~~~~~ddddd~~~~ Storage (GAS S). Provides automatic and
programmer-managed data movement and data access strategies.

* Nexus. Provides communication services for heterogeneous environments.

* Heartbeat M~onitor (HBM). Detects system failures of distributed resources.

GSI isolates the local administrative domains from the Grid structure by using

gateways or gatekeepers at each domain. These gateways or gatekeepers are responsible

for accepting and validating incoming global user login requests and converting them to

local credentials. Currently GT implements these gateways and gatekeepers as Condor

nodes [45]. Other than GSI specific tasks, Condor gateways are generally also

responsible for job submission, scheduling and resource management of local domains

and clusters.

Chapter 5 introduces a latency-sensitive multicast infrastructure for Grid-connected

clusters that is compliant with the GSI and also with other GT services. The following

sections explain the proposed infrastructure in detail.









Related Research

Layer 3 multicast research has been investigated widely for more than a decade.

Multicasting over IP networks, such as a Grid backbone, provides a portable solution but

is not suitable for high-performance distributed computing platforms because of the

inherent high IP overhead. Well-established and implemented Layer-3 protocols exist for

both intra-domain and inter-domain Layer 3 multicast.

The Distance Vector Multicast Routing Protocol (DVMRP) [46] is one of the basic

intra-domain multicast protocols. DVMRP propagates multicast datagrams to minimize

excess copies sent to any particular network and is based on the controlled flood method.

Sources are advertised using a broadcast-and-prune method. However, controlled

flooding presents a scalability problem. Also, each router on the multicast network must

forward incoming source advertisements to all its output ports. Based on the responses of

receivers, the routers will forward back prune messages, if required. This process

requires each router on the multicast network to keep very large multicast state and

routing tables. Overall, DVMRP is not scalable, and is only efficient in densely

populated networks.

Protocol Independent Multicast (PIM) [l l] is a step beyond DVMRP and aims to

overcome the scalability problem of DVMRP. It uses readily available unicast routing

tables, therefore eliminating the need to keep an extra multicast routing state table. PIM

Sparse Mode (PIM-SM), an extension of the PIM protocol, uses a shortest-path tree

method. In this method, "Rendezvous Points" (RPs) are created for receivers to meet

new sources. Members join or depart a PIM-SM tree by sending explicit messages to

RPs. Receivers need only know one RP initially, which eliminates the need for flooding

all multicast network with source advertising messages. RPs must know each other to