<%BANNER%>

Mitigating CMP Memory Wall by Accurate Data Prefetching and On-Chip Storage Optimization

Permanent Link: http://ufdc.ufl.edu/UFE0021650/00001

Material Information

Title: Mitigating CMP Memory Wall by Accurate Data Prefetching and On-Chip Storage Optimization
Physical Description: 1 online resource (108 p.)
Language: english
Creator: Shi, Xudong
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: cache, cmp, memory
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Chip-Multiprocessors (CMPs) are becoming ubiquitous. With the processor feature size continuing to decrease, the number of cores in CMPs increases dramatically. To sustain the increasing chip-level power in many-core CMPs, tremendous pressures will be put on the memory hierarchy to supply instructions and data in a timely fashion. The dissertation develops several techniques to address the critical issues in bridging the CPU memory performance gap in CMPs. An accurate, low-overhead data prefetching on CMPs has been proposed based on a unique observation of coterminous groups, highly repeated close-by off-chip memory accesses with equal reuse distances. The reuse distance of a memory reference is defined to be the number of distinct memory blocks between this memory reference and the previous reference to the same block. When a member in the coterminous group is accessed, the other members will likely be accessed in the near future. Coterminous groups are captured in a small table for accurate data prefetching. Performance evaluation demonstrates 10% IPC improvement for a wide variety of SPEC2000 workload mixes. It is appealing for future many-core CMPs due to its high accuracy and low overhead. Optimizing limited on-chip cache space is essential for improving memory hierarchy performance. Accurate simulation of cache optimization for many-core CMPs is a challenge due to its complexity and simulation time. An analytical model is developed for fast estimating the performance of data replication in CMP caches. We also develop a single-pass global stack simulation for more detailed study of the tradeoff between the capacity and access latency in CMP caches. A wide-spectrum of the cache design space can be explored in a single simulation pass with high accuracy. Maintaining cache coherence in future many-core CMPs presents difficult design challenges. The snooping-bus-based method and traditional directory protocols are not suitable for many-core CMPs. We investigate a new set-associative CMP coherence with small associativity, augmented with a Directory Lookaside Table (DLT) that allows blocks to be displaced from their primary sets for alleviating hot-set conflicts that cause unwanted block invalidations. Performance shows 6 to 10% IPC improvement for both multiprogrammed and multithreaded workloads.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xudong Shi.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Peir, Jih-Kwon.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021650:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021650/00001

Material Information

Title: Mitigating CMP Memory Wall by Accurate Data Prefetching and On-Chip Storage Optimization
Physical Description: 1 online resource (108 p.)
Language: english
Creator: Shi, Xudong
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: cache, cmp, memory
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Chip-Multiprocessors (CMPs) are becoming ubiquitous. With the processor feature size continuing to decrease, the number of cores in CMPs increases dramatically. To sustain the increasing chip-level power in many-core CMPs, tremendous pressures will be put on the memory hierarchy to supply instructions and data in a timely fashion. The dissertation develops several techniques to address the critical issues in bridging the CPU memory performance gap in CMPs. An accurate, low-overhead data prefetching on CMPs has been proposed based on a unique observation of coterminous groups, highly repeated close-by off-chip memory accesses with equal reuse distances. The reuse distance of a memory reference is defined to be the number of distinct memory blocks between this memory reference and the previous reference to the same block. When a member in the coterminous group is accessed, the other members will likely be accessed in the near future. Coterminous groups are captured in a small table for accurate data prefetching. Performance evaluation demonstrates 10% IPC improvement for a wide variety of SPEC2000 workload mixes. It is appealing for future many-core CMPs due to its high accuracy and low overhead. Optimizing limited on-chip cache space is essential for improving memory hierarchy performance. Accurate simulation of cache optimization for many-core CMPs is a challenge due to its complexity and simulation time. An analytical model is developed for fast estimating the performance of data replication in CMP caches. We also develop a single-pass global stack simulation for more detailed study of the tradeoff between the capacity and access latency in CMP caches. A wide-spectrum of the cache design space can be explored in a single simulation pass with high accuracy. Maintaining cache coherence in future many-core CMPs presents difficult design challenges. The snooping-bus-based method and traditional directory protocols are not suitable for many-core CMPs. We investigate a new set-associative CMP coherence with small associativity, augmented with a Directory Lookaside Table (DLT) that allows blocks to be displaced from their primary sets for alleviating hot-set conflicts that cause unwanted block invalidations. Performance shows 6 to 10% IPC improvement for both multiprogrammed and multithreaded workloads.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Xudong Shi.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Peir, Jih-Kwon.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021650:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101118_AAAABS INGEST_TIME 2010-11-18T15:28:44Z PACKAGE UFE0021650_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 25271604 DFID F20101118_AABDFJ ORIGIN DEPOSITOR PATH shi_x_Page_057.tif GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
54af0537fdff9d4f8ff6b77d2c6b21a7
SHA-1
ef2d188aae1544442610a0dbbcdc0f2e83fbff92
F20101118_AABDEU shi_x_Page_041.tif
de1beebd4f6756adcb5e77564de86eae
c9dac3717cddb1150cafe35addaa9a5be80784c7
91817 F20101118_AABCZO shi_x_Page_107.jpg
f8d8fac9e571a8d7d981514cad64cd1c
f2fbd6fe0bddd421d30374bcdbcbc482edc0f6b8
F20101118_AABDFK shi_x_Page_058.tif
2bf2a7e3d378c438441e41b50cd39a75
449356bed0513225403712b56a40b7a816553101
F20101118_AABDEV shi_x_Page_042.tif
5fecb56e6f590a37852f0d368cb88225
c31065dd2587feba2937bbac48d4db0eac1184b0
22124 F20101118_AABCZP shi_x_Page_108.jpg
b428a4e199ebf511a33d6a7eab79c786
4f132e052f4e7a07a9a5f818011eb91567980ceb
F20101118_AABDFL shi_x_Page_059.tif
a3595a58636ed5ebbc8c5d6dc08af4f2
ddbba680ab919ae52863e6c45b55f7a60660ffe8
F20101118_AABDEW shi_x_Page_043.tif
a24d33f4f8803321dbe498d0791fc079
3fcef3d3af72f91c6c2f58b391c92cc10578c138
25189 F20101118_AABCZQ shi_x_Page_001.jp2
b7468d8f031734d850ac9d157094e79b
8641b01c1b11797cbfcfaa0ae6e1ade7e4792720
F20101118_AABDEX shi_x_Page_044.tif
91d6d4b48e142fd64fa315387ea1fe63
e9f12380d76237b798bbdd8293a507bdfa0e29ac
4896 F20101118_AABCZR shi_x_Page_002.jp2
c01120244903f57223f548ef58a00d31
3c4086ff44a50ea0630e20202e76f9dfcb41250b
F20101118_AABDGA shi_x_Page_074.tif
4f3a751f23d4f0f6c9a1e2e0677af32f
ed5c0af875048422223ce2c18ef88443bd44e93b
F20101118_AABDFM shi_x_Page_060.tif
a764899cf8ef16bcfdbd87b79739ef31
6aa53b955d909e1f70c92722d2bca03337e40fd5
F20101118_AABDEY shi_x_Page_045.tif
06b4470b9fe6d0698d0d208b929cd100
44313cad97a55a792610a91a770d423ed5557ba4
5392 F20101118_AABCZS shi_x_Page_003.jp2
691ca203562167f0397e4ad9f4653f0c
7cfb8691e38dd46362b17a7840cd2e8f0caa7b1c
F20101118_AABDGB shi_x_Page_075.tif
f1582ce50d14008aa7f83a799baf2b9c
4863fd0e9990911f9afa76d56ddafedebd9bc063
F20101118_AABDFN shi_x_Page_061.tif
3d2f2e12e7a50f58138e82c967997c13
6c3a22100a523cb0b839fc29d749c3621910bf41
F20101118_AABDEZ shi_x_Page_046.tif
053037d975d7c1f2ded0bbdf30a6d740
6d2a95e8ddcaaf043f20d2a9aff8903f05d71180
54720 F20101118_AABCZT shi_x_Page_004.jp2
cee0b380714384f5e2517d8a016dadeb
0b0da72e50c88d932d7e8640d9c11e28d49c9b16
1053954 F20101118_AABDGC shi_x_Page_076.tif
6f72fece5e61255752f30b0c44df1415
2906f926dd32edae6e1645f8f5e3fa24492e53c8
1051978 F20101118_AABCZU shi_x_Page_005.jp2
ee9a0804a7959777bf85d24a9daa6b07
868687b7c8e1cfe46e71dbc2792828f59503ecc7
F20101118_AABDGD shi_x_Page_077.tif
fb0e7c664175458ead71a70bd05de352
9c35a1112d5f77260e1685662a900f0e02d2a57b
F20101118_AABDFO shi_x_Page_062.tif
0c5a8eb9e61102e43ae07c168d842a93
d8f2b07b0cf824d38ed948f8031e00627f8a49f0
1051981 F20101118_AABCZV shi_x_Page_006.jp2
2e53c541b73af90a7b89b6c59453b01a
0b20e67d9c7185992949ca42e99306aca95988b5
F20101118_AABDGE shi_x_Page_078.tif
29a599a18f31e92145bcd2fdd1273baa
035832bb9c2d1d6c338bb87bfc37313e2a5c6612
F20101118_AABDFP shi_x_Page_063.tif
2fbee1320bb5562bae01dc9144855d61
75b3b544638f0b586ad1b32e1930fd411486a10f
814989 F20101118_AABCZW shi_x_Page_007.jp2
04d7054313de2e1adaa0f0aef009010e
5cadb309fa9c86ab61828ba42c575c1a35a65fcc
F20101118_AABDGF shi_x_Page_079.tif
b67ca90297726df4764db3354635f4a1
1143ce8accade0c6b6f4d7ba5c7eacd02dd1ee51
F20101118_AABDFQ shi_x_Page_064.tif
7ee89c2bf582782d593a1dda59cf0b83
a254847fc9b1b0e0e4e0945c21757d7e6261da74
1051986 F20101118_AABCZX shi_x_Page_008.jp2
c53c6ef584949ce44b004ef471da16cb
78b110f0c57cf102b4daf12a54831e9a9bdaacb7
F20101118_AABDGG shi_x_Page_080.tif
d108a6910a33dc541b7b9bab300bd287
438dfbe299ddf2abccf24bd4cfe22b82dec56156
F20101118_AABDFR shi_x_Page_065.tif
e7cfb3c68a28736b1a739cbb733c6634
e6497b551a6711bf444ebb9606b708eecbe3377a
F20101118_AABCZY shi_x_Page_009.jp2
2b06e186b5589c0c2ffad6135d4c8865
f0d00429c6b857df0db1cb05e17b6c68bb948e52
F20101118_AABDGH shi_x_Page_081.tif
f7abf899ee9a6af28a28e1970ca2798d
19244492c78ce053cbd0ed25cc4961b6a9d4e1af
F20101118_AABDFS shi_x_Page_066.tif
e406c354473b3d8b41b1a0e88de75f26
16aeb75e6dcde8180151548e93f18cb5bf21b3c3
93717 F20101118_AABCZZ shi_x_Page_010.jp2
1da8261309728550faab6961d8afba8d
20e70b3c5a115bad43d12b738c599a7c88e49cee
F20101118_AABDGI shi_x_Page_082.tif
1fc4b5f5d8e4734d07f43c0cf4a79b03
14a1bc1e3abe04dafed5bd97c9c582299a5d8e69
F20101118_AABDFT shi_x_Page_067.tif
f4768b03b17867089f5aed2743362997
0dce2ad0c19d13903acf0fa3d0d30d8a78b64cac
F20101118_AABDGJ shi_x_Page_084.tif
77b2d6663f2ac7cd492a700caf536453
76f543f4278196b28f082d3ca99775f1e85584d5
F20101118_AABDFU shi_x_Page_068.tif
009a928940e9f529931ab595d4c5f32f
f042f6dba0d452116b9c3661aa83d0a737e5899b
F20101118_AABDGK shi_x_Page_085.tif
dd1dea0f4e57a343f1013bd87a1d8f64
ddc2b07ee47ba384ad9518da74719fa936138f9a
F20101118_AABDFV shi_x_Page_069.tif
c4063257e907851e9d8bb9fe7165e6c5
eadc0b415b7a1ebf80cc3a749cdc2f44d5b4f897
F20101118_AABDGL shi_x_Page_086.tif
7d271880050f4b6ef429eb2ac7a3400e
d24589d9b56937d87f748ead3bb46941f237de4a
F20101118_AABDFW shi_x_Page_070.tif
3f671443e714476782baa0160f1e6478
2b79791f7dd69d44b3090c1f4e825253eec2f59f
F20101118_AABDHA shi_x_Page_102.tif
2d0da40e4ac54025aa6e09c861b3f5a8
56c561070d79727b79d56dddf57531e1d175cd5d
F20101118_AABDGM shi_x_Page_087.tif
c472e581a255b8ab9ed5132255f4656f
8d3c7c9dfc5fe5a29f3c5b53a6e6fea1b722fd1a
F20101118_AABDFX shi_x_Page_071.tif
6b3dce666bb79d973108305d565061c5
6473b9ea43a8e7e059035f9fe2c28709fe7167c7
F20101118_AABDHB shi_x_Page_103.tif
ce82d08b61d310946b45e4d275468f2b
665389b1cb4f37c85ce3b15550d584982775eebb
F20101118_AABDGN shi_x_Page_088.tif
be681d31e8dd10ee44ad3d686b5e4ca0
291f487dda6a1de9a2ee92a02f852c7913864054
F20101118_AABDFY shi_x_Page_072.tif
c36d626bd7dc51fd2712c9d1ecb17740
717eddf3d5dd63d107a8bc19489d49d7c083a430
F20101118_AABDHC shi_x_Page_104.tif
58f5040db962fa58d4b2af869620ff17
d70328e16f6868f74329a5ae1ca8a838568efc95
F20101118_AABDGO shi_x_Page_089.tif
0e99952296d27f07a1bd82ffe84170c9
64f1c06b88ebf227a56b8d5906fee01f9d20ea5b
F20101118_AABDFZ shi_x_Page_073.tif
e1764eef4de53faee7a2269a7c402449
044573cf20cc813e1e502811afeba8d6598cdf5e
F20101118_AABDHD shi_x_Page_105.tif
01121596054f0321545efec7c42e47cb
1e93752c63ff8e5f3da6a4799a42234275a64884
F20101118_AABDHE shi_x_Page_106.tif
0aaf66aa840e978b5ce023e5db56ffe7
8d15d84ef0bf1aabf215cdf7b5330e171c311a12
F20101118_AABDGP shi_x_Page_090.tif
332c48f5977af13367f91516cfcf7fa6
2949b09e64f140b44296ae59c1f8cbb3243994da
F20101118_AABDHF shi_x_Page_107.tif
21c5881e9f9620ead592fde4ad83ab01
7d8cb18601510c584adc091e6625aa1e75352001
F20101118_AABDGQ shi_x_Page_091.tif
a1e0518fb893f6bcb2f7b36e711b9fef
b8e7f8e8f8de6eafd8a1572fca812506d1b629b4
F20101118_AABDHG shi_x_Page_108.tif
12a5702f2a0b9d5ba7120e8d6612c1f0
29df61d0be7731a9c05ce25f8732b74279af605e
F20101118_AABDGR shi_x_Page_092.tif
4715cf256ad06a66f9246b51a3e0bf2c
ca674c6154b277d07d5cd0791fea3a43e8afff80
8291 F20101118_AABDHH shi_x_Page_001.pro
df9f4e9f300311e0c84de6f5e00aec9c
29e8a61e29ae3800ad1338a724d8dde5e347168c
F20101118_AABDGS shi_x_Page_094.tif
058a04b33b1bb30c213ebbd3d0f5f2c0
313733e0aec6fcf2ed3e671d09d9751720cb94b7
931 F20101118_AABDHI shi_x_Page_003.pro
cf056d4353118c13735e4cd3e435750b
254089d19555fad144e0188a18d077ae4fa154e0
F20101118_AABDGT shi_x_Page_095.tif
accf8ee55a6b2ed8d0242d146cf394d1
a09f0af8212eb38bd2e7265c51136e0059683f25
23445 F20101118_AABDHJ shi_x_Page_004.pro
90cfcafa63c6f99c83991325026aa567
7b2b06de1800f4985a985426816a4fbdd0b6f296
F20101118_AABDGU shi_x_Page_096.tif
9396a33fda68bccd547123968d896117
390eb728b140ac89bced69de0922ff138a70a6ec
73139 F20101118_AABDHK shi_x_Page_005.pro
64d7665b87a5b1c02c0fa21273ea4872
ffa7713892c7b05f45e2a7e92685344fb46e1aaa
F20101118_AABDGV shi_x_Page_097.tif
aae6bfbbb8dc141c6150909fd017fc17
4f5b797adccc0760110cdd96af8595251ccd75c0
39645 F20101118_AABDHL shi_x_Page_006.pro
8f96a62dc0b4522b8ea75cf37c5db037
7afdbd18c3a10aac2e1f2a574189a0cdf5bd3432
F20101118_AABDGW shi_x_Page_098.tif
f0a94bbc2caaab46025a2a32ca77e436
02f4c478820338c00e4c17584a5c968c55649e32
21771 F20101118_AABDHM shi_x_Page_007.pro
ce2d013e390b510449d9578fd6272839
999bcc2e659ae8b3f3ed55d06c9581e1e764af1a
F20101118_AABDGX shi_x_Page_099.tif
ce3112264f698a545d232169bd93f251
6a5842e7716645fbc54b00e88698918d22d3ff4e
55730 F20101118_AABDIA shi_x_Page_021.pro
fac5d76eaa4c2cab3a0bab9f6342585e
1e93e7bad03163ed286fe7a3886f669aa0e04f63
57561 F20101118_AABDHN shi_x_Page_008.pro
293da37e00ce40a2e49d1886ab1b5894
ea3f8661c7656f35af34bd33c952ed5182fc57db
F20101118_AABDGY shi_x_Page_100.tif
f1cde8d272b877211c6d42438f1d7c70
db7fc3ae87cc049a3caa9740230962a158e5ee84
55903 F20101118_AABDIB shi_x_Page_022.pro
ac505a87af2dc64a62e6dc1f580054a7
08439f9ba6282974041c965d4140f74b28221344
39179 F20101118_AABDHO shi_x_Page_009.pro
a44925a86f7466a1e20d76def9c85b9f
b92e27976269ca48807156c63f3cf20e66a2b32d
F20101118_AABDGZ shi_x_Page_101.tif
40a4409ef25fc2f855a75022be1b220e
6ac1832c7b7a44907a9732eba2daa97537e735fc
51761 F20101118_AABDIC shi_x_Page_023.pro
a42b8f55fa4ef0ccf4c9a666c7caed73
84a0fae4d5a6429ccd9f530007b7084693e77b30
41309 F20101118_AABDHP shi_x_Page_010.pro
96bc4a7df28deb84555b33c595f312a8
f3e5ca1625bec1d1ae7151b559456d035fc22311
17669 F20101118_AABDID shi_x_Page_024.pro
c3edc912020e13443ba8650526f2cc99
744d40d10a102fbd180e6744fa866c1677ae9f7f
58421 F20101118_AABDIE shi_x_Page_025.pro
5b6a4318cfb894360eff2f27592225cc
811f4a445103a95a4d16a3daa6e214b9264653a8
29647 F20101118_AABDHQ shi_x_Page_011.pro
86cf7251ac31bc6aeeff707896dcf0bc
a1fd22db284ebb5f02e217e477646b4bf03d6a0b
56010 F20101118_AABDIF shi_x_Page_026.pro
6684f83b0b46820292bdc5d2173c68da
1c93675f06f5143f35679c0e3602754b3fdacbac
51926 F20101118_AABDHR shi_x_Page_012.pro
d1ac7c329784313fb7854b0d78feb66e
d62ca67082561a828e14e9aab7c4c663406d73df
52652 F20101118_AABDIG shi_x_Page_027.pro
b3f5d26c2ffad476dd49c7224f5ff975
e84f382e330f606224293a13b95f70b2dc3dceb8
39467 F20101118_AABDHS shi_x_Page_013.pro
5c893d453f5c0c5933e0ebfdde3b19bd
fde6f212e8344d0b6f4e794666eed9f7a505ad3b
51703 F20101118_AABDIH shi_x_Page_028.pro
e03238deffb5507c965409a727cae94c
54bf6ee5143a9f55c0e15195b989a72fc4544fa6
53096 F20101118_AABDHT shi_x_Page_014.pro
edc14ad9484c384a72958720201bcafa
2df864dcb002d69e5b6e124c48aa225b434dac58
29579 F20101118_AABDII shi_x_Page_029.pro
6d8c82fcd77b7cf3642a0a21b03af2d1
928423c15d6535722bc40b4b476a673a91cbbb59
54271 F20101118_AABDHU shi_x_Page_015.pro
987184ebce87a2f914b6e9f9024f5f36
fa81656da7658806cd71f37698f11994876f5f72
54828 F20101118_AABDIJ shi_x_Page_030.pro
9f83b1d0124a63f79cf71ce1615bdb4a
5fbbdab44862db07fff38c2954c82ad91f9ae9c6
55649 F20101118_AABDHV shi_x_Page_016.pro
81578f292d48491e73620ff04f157440
37768bf8626b571dc2e6e71eb15e65ab14adfeb9
49712 F20101118_AABDIK shi_x_Page_031.pro
ff35154021ef43c3f3e3f397786bb200
9364c6e77f5c2bab7b017789e6ed722c1ed66168
54010 F20101118_AABDHW shi_x_Page_017.pro
46121a86aaa14ab14c8c7d0b6dc6259a
14def08a4912ac835bd89d603ce92976c8153c04
41599 F20101118_AABDIL shi_x_Page_032.pro
5340f6d86abec9705738ea5ccd2dc8b9
8042fb7094b58e018d1a5eb476127d5d01dece49
53604 F20101118_AABDHX shi_x_Page_018.pro
5fb451f1bdd2355370645a23ead5a2c1
a9020f074cc23e42d54c6fdbae66cc3d429023b6
39612 F20101118_AABDJA shi_x_Page_048.pro
8eadc1f3a979dc3457663a0a884e98e4
90dbeccd2d7b30033511a892d76dfbd101abd940
54638 F20101118_AABDIM shi_x_Page_033.pro
6440bd3724bbddce25658ca56d358ba6
4c93ac5f35d5fbf63145286dd43c371b49ac3725
53758 F20101118_AABDHY shi_x_Page_019.pro
ce6c32ec9adf2c78675662c3929b56e4
8f682e9521e21dd57f4f03126606bc5ba5fb9225
44805 F20101118_AABDJB shi_x_Page_049.pro
ded91cb974480515150ec88f32e11011
ce794419ab84d842dc4364ffec6d7f046eb85a35
48815 F20101118_AABDIN shi_x_Page_034.pro
4c6c5673eeb7d1c3fe696e2b888a9054
2b6ee80c4690d272ce6e51192efee5d027cc6711
54200 F20101118_AABDHZ shi_x_Page_020.pro
9cd0a8e4ef105bc285cb33ac57bfbbaa
1943d9d5accb8301c5e7e9f8c8d2bc708837d298
45164 F20101118_AABDJC shi_x_Page_050.pro
8c2cb10d96d76eb90bff174851186dd6
50a6df441dcab6046473353d77777617a64c6d37
17360 F20101118_AABDIO shi_x_Page_035.pro
8920fbee0839f48c77dd1178468e2fb0
282c4d0ff7410ca7070510219907ddc96b0ef3a0
31826 F20101118_AABDJD shi_x_Page_051.pro
6ef183ddd98ec1b21f07536f0a2b8de8
60495f0a71e605bbda6817a744a793be035f48b6
54326 F20101118_AABDIP shi_x_Page_036.pro
988b7bd28e39bd5a4ec537ca7b5445be
9ae55125b4abfef5744ec9e29460fd13167fdc64
34837 F20101118_AABDJE shi_x_Page_053.pro
0842974629d79267f85f73df1128446a
d527de0e40002bc1ab91f84d622cdc89644bfb0d
53832 F20101118_AABDIQ shi_x_Page_038.pro
df499dca822445941ef03f23d8548852
27cd4a403382d5c50a449bd29385ce46d3ac996a
51201 F20101118_AABDJF shi_x_Page_054.pro
2da313a33a145bccee5a48bea9297d38
ce046b993da80e8806630a3b0372d8b4f721bd4f
46747 F20101118_AABDJG shi_x_Page_055.pro
95c769272dc79387cbe5b6e1439e025a
2a8b3949e57b30b58fabd0774cdcec6c979fccc9
36103 F20101118_AABDIR shi_x_Page_039.pro
c2ac42547a0a0327cb151f44af1cd4cc
41c5e56177f37289f493720be5d89225c64aecd2
38895 F20101118_AABDJH shi_x_Page_056.pro
a790e1a443a0827015e36d6c50ff68f0
738fe012bd80eb011eef7281708e5501f6129b84
53576 F20101118_AABDIS shi_x_Page_040.pro
183e2c0c6c709f473085f9fc979fa83e
a07e8c9db1e2ab1751d94033b17a58f7b4c9b9cd
28800 F20101118_AABDJI shi_x_Page_058.pro
9e7e98cca18f78e90147b97b1999c50e
b37abc39abce041b5815fd086ec6ce32bdccfcc0
44451 F20101118_AABDIT shi_x_Page_041.pro
cba6f388449cdad21061185e94ac1285
5ff0caa7dc36c17c5bfdac79d29adafcab45536d
52049 F20101118_AABDJJ shi_x_Page_059.pro
0337008e2ced4652d4bf12c7d634585f
6b7ed879de8acb9d47aaf480276bfc65e76758e8
36215 F20101118_AABDIU shi_x_Page_042.pro
cad9a8fecd366b268787ae41caa1d1a7
c95af195feed1823d3f3f0148afa40bd5d29faca
37064 F20101118_AABDJK shi_x_Page_061.pro
81c6c8c7b1a283cfed3714007e9345d3
d835b11cb0f0718d91170a41f6b6f04bfb92e8cd
51643 F20101118_AABDIV shi_x_Page_043.pro
8c45965dc15a9332b4d5c25dd5db231b
197fa7e374bc03f4643d938e06d828039e569806
54608 F20101118_AABDJL shi_x_Page_062.pro
469bdfe77c9e734a9b88079514ffb43f
512dccebf0920bae3ef7ef826c9468342f12535e
48438 F20101118_AABDIW shi_x_Page_044.pro
1e8e8cc4b0d28a66b41f6fb5f674b3cb
5539656814b27e9693325d7df3774c3440b02ee9
56151 F20101118_AABDKA shi_x_Page_077.pro
0901f898503f47c49be85639800846e9
ea2842fc1ee3c951a5be44a8f16baabfedcfb10e
38642 F20101118_AABDJM shi_x_Page_063.pro
2229a40a45b66f4f1e630d1e1d29146a
03046b2e9f670676f9773824d18f78eeedce3b68
50646 F20101118_AABDIX shi_x_Page_045.pro
58efe0725e0a4928830cd3f4c6f29076
ff73dc52ecb396835f4c5bd3d20bdbdb4cfc8bc8
53830 F20101118_AABDKB shi_x_Page_078.pro
5c27129d651380452758a4446a8edd3c
9f407228526a058375b2b5d2ce0d7bf3002681da
44990 F20101118_AABDJN shi_x_Page_064.pro
ec364d544cfac6abff5d6d0817095933
81e3df8d472a45f37fd2c691639a2528d4c88057
21187 F20101118_AABDIY shi_x_Page_046.pro
d8f653ee0f8a0f47ae9fa47688c6f610
92dd153f6d349492fa4de6c91916596209aa49d0
56206 F20101118_AABDKC shi_x_Page_079.pro
be7de51f3637254085b40791c2faf24d
16577c921ee8c250b9186403085b86e5e49100d2
55597 F20101118_AABDJO shi_x_Page_065.pro
4da86bd787925461cf30cf019920acfc
dd69b3c8462cfb537713f6b192afa6a38f202b76
44956 F20101118_AABDIZ shi_x_Page_047.pro
2c3fca8c0e52b1e7e9e113cb356922d5
00d9033a1354e8535c9ccc78f163b008d3c7d70f
36477 F20101118_AABDKD shi_x_Page_080.pro
23fdb81d44a94b6e7941137c9e1a1d1f
913fc81b966ec1c932e2571e10bf043031ab117a
56213 F20101118_AABDJP shi_x_Page_066.pro
2048681aecfe1a4a87d41abbd1b74f3d
1432f3b2e955db4d7806c26a772ecc749538b281
33337 F20101118_AABDKE shi_x_Page_081.pro
b113d138f7e6b45cd08b19bc5b7fcf5c
a5d9eb1913651bab3a5a35742128e3ec204e56b4
48667 F20101118_AABDJQ shi_x_Page_067.pro
81c148987d3f54a2f88672e4eef8f9f9
a32fb2b8dc320ac6423cdcb6e6c6700031468c67
56128 F20101118_AABDKF shi_x_Page_082.pro
3bf0a20af82d4f88f2908c01c79f4469
f0ec6c2a53f779f4df4073346314da3e02c0a1dc
40413 F20101118_AABDJR shi_x_Page_068.pro
df086f84258e6eeb534fa9ee87553adf
0e46b040d780e4299a0e0b45cc8caac102cccade
57974 F20101118_AABDKG shi_x_Page_083.pro
08b61315521081875efaf942841af4db
ca4a5a4e2a9ab62da63cd22b06ba9a65a5f6a24b
53987 F20101118_AABDKH shi_x_Page_084.pro
730ba2d94d21f2bba5ab248ff3e5598f
7529b324a8fec19dd42d6aef246396fd0d533597
51402 F20101118_AABDJS shi_x_Page_069.pro
b05a337e5809e67fa52751b46b7bf9a2
7e23ee43a6f4ea0674f114046d85c3781ccd66f5
51480 F20101118_AABDKI shi_x_Page_085.pro
f24268f3a9628606731df7abbc503b16
62fa878c3bec6729c3a50bc3cb4fff0a2d9450e4
48013 F20101118_AABDJT shi_x_Page_070.pro
d251d4afa756e9c59c9936c7b962b07c
fd662e08a2bdb821345696d845a7c9582257272a
53841 F20101118_AABDJU shi_x_Page_071.pro
f33517e2622c69110ac0224fadd96e44
2bc0ecb84962d47442637d03aaa800aa4df57664
51913 F20101118_AABDKJ shi_x_Page_086.pro
6bee214767eabeec4f418a080ed8bdc4
35b35868099c9192d73e4c216eb60bb595e1c27e
23370 F20101118_AABDJV shi_x_Page_072.pro
bcdd10614a54ad22e630c3f006e408aa
de6e3cd3c721c723847f396e228432b28c244aa5
57433 F20101118_AABDKK shi_x_Page_087.pro
9091d17235c7bff4922275dda90d07b4
106818176ffe30f063229e65dbb97e411a292c36
46889 F20101118_AABDJW shi_x_Page_073.pro
3ab52875876a9c968ee281d72965748c
004ebef9691abeccff5e671316c152b41bbab8c3
51511 F20101118_AABDKL shi_x_Page_088.pro
d3fdb9c323d9966d9f751fdfd74649bf
f54f89b3e57a0adbbb9c85cc47e0b2e0b355e128
37433 F20101118_AABDJX shi_x_Page_074.pro
c76b2915f7fc1fd9d8a94d0a607b0da9
320f43c689bc53d6009b6adc3002a57a17151868
58871 F20101118_AABDLA shi_x_Page_104.pro
c861623e25462ee805f672afc44bb998
5a64daa94a898db7f99851fac76d990b07ed60ec
35516 F20101118_AABDKM shi_x_Page_089.pro
186a003d28dc6c4c63bb7fec88af1e76
dc699c547bd5f7c84d06a6bcd420ca1dc4be650b
51454 F20101118_AABDJY shi_x_Page_075.pro
ae93db46b575b1f79f6d41a35484665e
86c70ca14a091bca2d693f8099e5103c9764c006
64987 F20101118_AABDLB shi_x_Page_105.pro
cba4b8ed649c001cc0f80dc9ef3596ed
af44b6e35392e44fad66fa5a088e2e7fd29190ef
47994 F20101118_AABDKN shi_x_Page_090.pro
8baff4a87313213a9625751779b81c78
9684640f7a8878709ed3389485432153191d3cc5
3208 F20101118_AABDJZ shi_x_Page_076.pro
4a384139a9747d5be3db3d8114502bc5
740105781ad3b1bb5555a409760c7faa416a8531
63594 F20101118_AABDLC shi_x_Page_106.pro
f3e033a84fff1d44282718d018b0ce9d
7c4cac3f5e3b986503f3a1cde7d8759034681795
35055 F20101118_AABDKO shi_x_Page_091.pro
d93c4f2a7fb0a7bb0f91c2e05a68e905
e2eefe59e8d9f5ea5d7fbced7f1b502138c861dd
53743 F20101118_AABDLD shi_x_Page_107.pro
61ecc4d51242cd756513ee46e77b166e
ae29158d2f8dfb81efb15d602bdeff3127c7900e
47516 F20101118_AABDKP shi_x_Page_093.pro
a005c571cf953a45011f156be68a14f6
99577d501598f118d7f0fcf54980e876266f8f97
10704 F20101118_AABDLE shi_x_Page_108.pro
3778e24cace16cef4aee234a3930a0cd
82068580a57337c44b17fb0885eb962428e6e39b
35881 F20101118_AABDKQ shi_x_Page_094.pro
332b6982525aa1c88815918e618b6551
68e6f65d75c3e3e695c32563bba9951fde25a3c5
483 F20101118_AABDLF shi_x_Page_001.txt
43c0af42aea08dc134f0a9aa70f926b7
f249692e5c3de5a4ac8320060605e65b3c10e9be
44698 F20101118_AABDKR shi_x_Page_095.pro
f0e0e58f717e3f8f3074d12eb89d0f03
e12b536f37e95fe25f006d1a094bb77106bfa99e
78 F20101118_AABDLG shi_x_Page_002.txt
524bb40439f03ea54b76cfa2e2f68c74
1b5fba39d74fdf86c283dac35b41c974811310b2
48962 F20101118_AABDKS shi_x_Page_096.pro
491d3cfa148b989781f1226a300ef51a
19342242b1424a284992969ac9e50acd45c10dfd
92 F20101118_AABDLH shi_x_Page_003.txt
0e18a6111dc2209d32991d7e75f411fa
5c2da5d6ebd3428f2f856395b729f4aeefb58e7e
983 F20101118_AABDLI shi_x_Page_004.txt
84252baf79ee6ddfbe74da2fdb9d406f
57a53174ff296cc6b4ba23623004248e6fd9ee11
44741 F20101118_AABDKT shi_x_Page_097.pro
cd353cc723a6fa3afbada6660282feed
935215d8e6afe1e021174de05ff7553b48f2296f
3200 F20101118_AABDLJ shi_x_Page_005.txt
9738a4071a95064ac68da0396ee52f94
4e3ed12c81cc236eb99adabf5a2edb1f3c89c0eb
25590 F20101118_AABDKU shi_x_Page_098.pro
b7ac073558fa629ec11d1c726ac30553
0829194e582012f14dd5dfb861fe8f2882c68a72
1644 F20101118_AABDLK shi_x_Page_006.txt
1fef80778a53dc71b7112267c66fb8ac
6f6d4d67297fe1556066f4645be8534f050978c9
50700 F20101118_AABDKV shi_x_Page_099.pro
53a14a951c55f536186a29e5c5b51822
e28038b4c796decda3f4128d85997b2cd55ccb9d
958 F20101118_AABDLL shi_x_Page_007.txt
d41cdfbfdcc10b34fc7c4ee8e13dc23e
482a8ab1e7b1ed8a6f7d81d4b5803ef0f4605e01
55834 F20101118_AABDKW shi_x_Page_100.pro
6ea5c4eba0a842368df08c1558b730ff
7b28e18d957349819fd41399c45bfca9008ba382
2188 F20101118_AABDMA shi_x_Page_022.txt
53a836a273b63ec6a9cf12c62bacd261
07819d9ea22566f374621f2e03a9802119193df4
2302 F20101118_AABDLM shi_x_Page_008.txt
cb402bc941506342242f1c3f2145a768
4554c8402d8edd6e77820f6bb67668fc61d856a4
18611 F20101118_AABDKX shi_x_Page_101.pro
d399a7d5cbaf0fdd1043ed09e6890721
b2750c5efe6957f51df5cee6f63976eeff5ee291
2396 F20101118_AABDMB shi_x_Page_025.txt
57333f080b8d44377adecd75180b1ebb
b7d7a4e413eae45956d9e95cdf439cafd45a2caf
1617 F20101118_AABDLN shi_x_Page_009.txt
bd9c97831d23a8d539a8686b3b87cb78
9198b5e094f2c05e815d783bd05f6d283a10b949
57535 F20101118_AABDKY shi_x_Page_102.pro
12fedfc094d7d879ee084a44879afedf
183ab39ecd96b821b971fefc7c987dc7dafd16a7
2070 F20101118_AABDMC shi_x_Page_027.txt
c507ff1beb658f1afe40d125b550f7a1
1a8a2d7df75e823fb56095c3562be48df045d61e
1835 F20101118_AABDLO shi_x_Page_010.txt
7cab215b2681af11a2182a72eaf08894
532ca91869f824063fbda5434fe6069222af20d2
65188 F20101118_AABDKZ shi_x_Page_103.pro
af46689c1e5ddecf28c08f869c2ba9fb
6a4f15233da982312c01aa735c074a3e6e1e05cb
2115 F20101118_AABDMD shi_x_Page_028.txt
7cd5cd2e2bacffd2ee44dd8027925ee5
b7372daccd699c63a96e7be42d580c5c5b6c77ac
1191 F20101118_AABDLP shi_x_Page_011.txt
cd2fa85a1fcbfaf7197ece336d325715
b0f6b5ab7b10f92a1196ab5b3c44ec3e9dfb4fdc
1454 F20101118_AABDME shi_x_Page_029.txt
1383c9c4ef875ea1ece76aa2c377dc4b
7dca0a08251bbfa738e7aa2d7a0526dd6f71973a
2143 F20101118_AABDLQ shi_x_Page_012.txt
aaf983757ddad9db650142265b96705f
9bf3bcf3bde25055d97d83d731db46bf0e297bc9
2210 F20101118_AABDMF shi_x_Page_030.txt
17da45934cc9e1febd6d78a848000964
7381deb89eedb55896bfa85e0d1d86c8eaa1ccb6
1554 F20101118_AABDLR shi_x_Page_013.txt
6801ea7aa77710060c9681b5ab91c91d
749d2b7363375aa94a6d84b55da45443cbb14d7d
1990 F20101118_AABDMG shi_x_Page_031.txt
237197fbf70747550a617b18501ae08a
eafa5f375647968775528a4cf0cc1c36e0a55307
2111 F20101118_AABDLS shi_x_Page_014.txt
d65ac96a70e6de1e24bc525bda6fe6e4
627d3fce29227ea42c7bbf3782d953dab80f2681
2212 F20101118_AABDMH shi_x_Page_032.txt
77011d918da7dce734f631adba0858f7
e276f1bc5c54bdb516ec507a72ef4ea237117a6e
2149 F20101118_AABDLT shi_x_Page_015.txt
dd6bb40490d5c310767b93fc44d02a90
c562bf9a615dbac0cba6c56e3641eb5b032b8fe4
1946 F20101118_AABDMI shi_x_Page_034.txt
5a0850f9cad22ad22f00031c93d38ebe
c4de4f9e11150d730ddd6c08e635d39a62e8992f
850 F20101118_AABDMJ shi_x_Page_035.txt
23c1c7f5ceb56edfbe2f364e14a54bad
518cc3264ad21bb52f628977efce3d07f26919a1
2186 F20101118_AABDLU shi_x_Page_016.txt
f9eeac3c4967210bad17954e822be5f9
dc95cef42d428d300339cb2e5e10ce85c6f87318
2138 F20101118_AABDMK shi_x_Page_036.txt
08300a1cfc493c898ffd097c16e9e420
60575bf3b3e7c477458cfef2ffc41bc4d4e5d352
2126 F20101118_AABDLV shi_x_Page_017.txt
1db25783f1dc471e1af1cd973836d0da
92362772a2feaaf8266b7b09b4524b2e5c567df7
1395 F20101118_AABDML shi_x_Page_037.txt
9356e212b81e0237739106366cbc1c2e
b48b2c29709b45f78e1d76c27b7a1fd526a9dc79
2107 F20101118_AABDLW shi_x_Page_018.txt
a2e44d524394daa903a9411d04d1e6fe
eca1a81bb8e1a3f518ec8f5930cdf92734604097
2124 F20101118_AABDMM shi_x_Page_038.txt
a460861ced3ebaf5055ec2bd94b12755
8d175fdaeb2b2cd4fcbbb6be63674f5d7ea9e308
F20101118_AABDLX shi_x_Page_019.txt
60c706206c746cb81e6f2757b1be5ad9
1957e0ae33044fd061f75a503a3d4c19a7f9bcd4
2062 F20101118_AABDNA shi_x_Page_052.txt
adc00b5778cc3f967b5aa574f1cff178
5d89fe7a2e86f5acb49072145118373390e9edba
1494 F20101118_AABDMN shi_x_Page_039.txt
577da018bd577daecec51648d3c11d65
3b96260cd002a1d5f07150bb24a9706ef0a5d69c
2137 F20101118_AABDLY shi_x_Page_020.txt
95a66419edee98b61ffe213d9e2ca29e
9984a7ed5fdaa6dfc64f664cccdac76f996524c6
1763 F20101118_AABDNB shi_x_Page_053.txt
3bbd3dc975cb7297d7b153ab11b0639e
7667a59052e8ac64fb5cd6e03d8e2d4712925493
F20101118_AABDMO shi_x_Page_040.txt
ad89e2da9d9e64285cc3e70bde1430d5
aca41b4066250db68c3b14e084d8f87d3f41f12a
F20101118_AABDLZ shi_x_Page_021.txt
042183d6540e445789125ea555b92979
4ef6a08a62524aa2633c32bd11122a0720999568
2071 F20101118_AABDNC shi_x_Page_054.txt
36ef84752afed154fda95772b8732b4e
5ed17900178acc7843be07f2142e25cc7412d110
1782 F20101118_AABDMP shi_x_Page_041.txt
d9085b97a34de3122961a13b2e084981
383d858f11af4796305cf575bac7c39fcc6f0bcf
1896 F20101118_AABDND shi_x_Page_055.txt
118dde92174cecb54dff6a72545532d2
5e4f7e58b1ec00b9776c1ac504151a66859d7afc
1905 F20101118_AABDMQ shi_x_Page_042.txt
7ef301fcb815a3fabe2b25ff47241f71
452362554169b459a719776475835268eb208dcd
1931 F20101118_AABDNE shi_x_Page_057.txt
00cb48a499b5c75bb7a34300bbbd416a
9445d5d70ce935a99f060c3eee3b4d17b9f022f6
2043 F20101118_AABDMR shi_x_Page_043.txt
b2870042767e2971038a222bed734d83
2f138f601b7d5fb8fd9df22ddbfb15e2da08bfd5
1613 F20101118_AABDNF shi_x_Page_058.txt
a033e0d8c6f0f5b0b7ef4cec246ffb8b
b605422ceb0b15f8d7c131356325b2a7d55acad9
1955 F20101118_AABDMS shi_x_Page_044.txt
477b2ac5e4f055a5a32c89a88d3cdf67
f36e6387d391ca461620a230d21b8a651d57675b
2056 F20101118_AABDNG shi_x_Page_059.txt
1c198e3a54086d3d44c2be2c8e347c28
8f009389fb648c9439fbef84596a9d28223d0ce3
2001 F20101118_AABDMT shi_x_Page_045.txt
36045a9ea597f34ebbf2e89ec1401e15
552623c9c31319d38c00914d1494e162afe55b9f
1367 F20101118_AABDNH shi_x_Page_060.txt
37a920acf37b37783c143f9d084f078d
58fd65350b4ae162b9ae606e81bbe269ce075077
978 F20101118_AABDMU shi_x_Page_046.txt
0a52b2f77f5b97ea722ecbad37848525
247b784a4f935208e7acb373b033ec453938555d
1675 F20101118_AABDNI shi_x_Page_061.txt
b5b2ee8e04a7da2e1fdb3c2324007fb7
e08ea50a945c77a3a4087b40ea379ec7602b9bfa
2208 F20101118_AABDNJ shi_x_Page_062.txt
1332517563d229e8f58341430ebda426
14c7cf7557ef5606d048e1ada09a92db37287e58
1882 F20101118_AABDMV shi_x_Page_047.txt
9b23ba456e0b0f0a4fffb7b9b9ef77cf
54f0a2ef1737e50f97166c7c6042b3b8f5ef111c
F20101118_AABDNK shi_x_Page_065.txt
d5ae8646580cd2c128f62827909d7b5b
d09a6149de84a73ace3a7f9b841a8a2a4da95f31
1916 F20101118_AABDMW shi_x_Page_048.txt
d61014c86a8dbd34d03c97f18e0196ac
1523b14985c6ee78974b3bd07683109cc5b1a166
2235 F20101118_AABDNL shi_x_Page_066.txt
7d7b7ebbb681f6f66b526d807e9803c4
c85e7b255a6ba58e09b42f1ef4039b09e0f78594
1965 F20101118_AABDMX shi_x_Page_049.txt
b632b4a2c7bbe34ea1df443a7ccf2016
2e2a9cbea7286275b2defb7dd6521670af76b014
2204 F20101118_AABDOA shi_x_Page_082.txt
9cec547ae23a70ad8e4de627d07d12bf
ca6332542da895ab15d1084f706b2bfdd9dd629c
1958 F20101118_AABDNM shi_x_Page_067.txt
33226ccdf5a16a708857581db8143cf2
f0aad572e873b30473420fa2a8c3f0965358a0db
1801 F20101118_AABDMY shi_x_Page_050.txt
b0409f9ac8c9493dac461a526320b674
22e97a6b8980e1b932fcc5ff6d6d6bcb2aae245d
2273 F20101118_AABDOB shi_x_Page_083.txt
b8333c09dfc8e5c1f4c14920173dc7cd
0e8c8156a461c2f6e0636ab498bf111eac85bff6
2237 F20101118_AABDNN shi_x_Page_068.txt
0a0fff5fbd5f9301bbab765aa85c285a
50cb0e555dd02ef5e8aedc5529c881968297fd6b
1723 F20101118_AABDMZ shi_x_Page_051.txt
57d563e3e671d5493428f8fa352cef27
d4e5cd1ee32b484befbadab26d5f192dc924e264
F20101118_AABDOC shi_x_Page_084.txt
18b106616c39443e8e7b2189117142a3
1864f616c69653be9bf817caf2f75fdd49322bf3
2030 F20101118_AABDNO shi_x_Page_069.txt
b21c53c8e72d993519a5c34532c44ab1
46e90789985c23ac5f2eb6deff784fa884aa8fb2
2034 F20101118_AABDOD shi_x_Page_085.txt
351474b45a215b2e70764b747272d523
a9c4388c565686b0b0ee316332e8760d9e0cdc8a
2411 F20101118_AABDNP shi_x_Page_070.txt
d433f45eb5d5fb08dcfbec71daf1022c
7583340646de1e15fe3343b37d33c75cf71e8dc4
2077 F20101118_AABDOE shi_x_Page_086.txt
04dd4c989e01994bc31e29240c93befa
98f95f99f9e8e16792cff580be292f401c4abb16
F20101118_AABDNQ shi_x_Page_071.txt
f9eda4aa35e7fb1d65664d2da7415bb0
0d3a72db3dca63cc78f2d6379250b8f5072e78d3
2260 F20101118_AABDOF shi_x_Page_087.txt
76e38566e4d09520f35d551ba5e5004a
9efc2060b0561f1990e5067e7a813a4cb6c3bb7e
1134 F20101118_AABDNR shi_x_Page_072.txt
bee2cd4cbe7978a123569d598d2e3555
00bb51dacd8f960882570b4f644867f4c2e1a024
2069 F20101118_AABDOG shi_x_Page_088.txt
eb4fe1693bd2bcc176c770d7d44e12bd
94e5d088414f85e7b2c513ddb60e51138e28da8c
1870 F20101118_AABDNS shi_x_Page_073.txt
1a5d636a5f0418fd0a85c6593b7b1b68
a45dbace527398edc4b78b5f62c43e2493968b69
1493 F20101118_AABDOH shi_x_Page_089.txt
4d3b35f8a4910e46ada940a9d74e5a26
762afae8cc4336e2005220b007bb974d574210fc
1937 F20101118_AABDNT shi_x_Page_074.txt
5fc2201b57d55f892039c92326666e84
135b20101b74d41336e0b29114065e27fa3a282f
1912 F20101118_AABDOI shi_x_Page_090.txt
28e72fdeba3e9fc0afe780abe94c8292
001b499ea0ea00ad9bfad1f535b229ebece1e0c1
F20101118_AABDNU shi_x_Page_075.txt
789a11a857bc97ff563606d5f3700df2
0e4c60acb03d2f6697572dbe5f01ea060bd61e73
1434 F20101118_AABDOJ shi_x_Page_091.txt
7b39a97d46915ba4854e80aaee4804ce
eae4192a59420d2b845667f5c1db048b5257ca37
134 F20101118_AABDNV shi_x_Page_076.txt
35aa30942ba6bdce0636d34f1cf6e47a
fd8e6f67b41e34a8ad2b30e7ede3baf15c769986
1478 F20101118_AABDOK shi_x_Page_092.txt
6071c37e17da684af66fea811ffa5cd5
816374fb894ed67349ffe1985433449a8cde6032
1893 F20101118_AABDOL shi_x_Page_093.txt
e3294c2c1d9a612e529a63e2575e69dc
d3b3db30e779f62e20818bc56b6b4afac34b8e9f
2265 F20101118_AABDNW shi_x_Page_077.txt
8a65eb9182f3e7f2b3f2be5e2800a58e
7ba3435b0fad6e200f11e91b346c5474726a21ad
1609 F20101118_AABDPA shi_x_Page_001thm.jpg
a32e3e848e4833ed27454764187d3d87
d9f229bcd5830f581718c6db7004ca7215b0c31a
1585 F20101118_AABDOM shi_x_Page_094.txt
344b737a2cade603d0f2513a940268ee
c46eaa14fbc3a5f48d2293e04c0d2c380cf07c5f
2220 F20101118_AABDNX shi_x_Page_079.txt
102230aae76eb8fec9ae22876df2f070
520e5797e0a36b355da35fcbd82e3326db7c4ea1
716994 F20101118_AABDPB shi_x.pdf
c6844443499f8ca4c23a41344bd4032b
fef6361a5258bdd3cde9218b896d46538de5c13c
1932 F20101118_AABDON shi_x_Page_095.txt
e6b4cc6c02a3db7cc8d47d8bc5dc9ff1
90c28d25a52777184589731253fb85044925e28b
1578 F20101118_AABDNY shi_x_Page_080.txt
4f65b1b7dd1c10d558d9ce994595824e
e084af6191103a109888845127e8bb9f82b108ae
6542 F20101118_AABDPC shi_x_Page_001.QC.jpg
21ae45b84cda3bb8da699f2c3407399e
7dab95ce8788f584095e6d5367892e4ff8ae0ff3
2229 F20101118_AABDOO shi_x_Page_096.txt
a2d211da88d4de742f1a0438b20e23d5
289c759b1a46e42603e8e3a7e06f5054c0487760
1471 F20101118_AABDNZ shi_x_Page_081.txt
285ff9d5a07f266794e6025fa80b27ae
7b10d71d2af0bd450bbf045da405954f4cf3e0bb
436 F20101118_AABDPD shi_x_Page_002thm.jpg
12b7741ad64efb26f085cf81159c5b2a
47de7819a2b011d4f128c648868251ea75e1c275
1816 F20101118_AABDOP shi_x_Page_097.txt
fcae3a8b0fd8f84f3d2d46cfcdc01bed
5533b964975f8ee73209ab4bad6b1dfcd84708d7
1005 F20101118_AABDPE shi_x_Page_002.QC.jpg
cc726ddbbb6fd1f2b6151316f37b221a
8f65007cf1feff3a3791f15165d4ba9a1831313d
1128 F20101118_AABDOQ shi_x_Page_098.txt
45b029344a0d681a10181acc4de55aad
ceb862081ad8c47ff54c7bd9b9c6a00ebc263c86
466 F20101118_AABDPF shi_x_Page_003thm.jpg
2f57785569a8aa3236e179ee362ef4c4
b009c3b1f8da737c65a681653873474385dd0795
2073 F20101118_AABDOR shi_x_Page_099.txt
3be6d40a40cc619a014cc6c069a4c135
4aa400d601bd4d61ccf3cb7a113afa87022aa93e
2185 F20101118_AABDOS shi_x_Page_100.txt
ecf96865ad09c37efce7cf90178e1522
8b0b25bc13c5b68a5a812ba8656adb12202c1361
1062 F20101118_AABDPG shi_x_Page_003.QC.jpg
cdf71e647eadcab4b80f70ee95969926
f46fa836a0aab58a3ae5955abd1bac2070df454e
739 F20101118_AABDOT shi_x_Page_101.txt
2faa4c39534aec3855ac633313ba1731
761a9d210446aa4ba18fee3b17d1661a5f7b1aec
3389 F20101118_AABDPH shi_x_Page_004thm.jpg
19b6a8b43ad45f959507655fd20d0c35
8242e7dee0418e33a8fb95d5c50cc6bccf0883bc
2328 F20101118_AABDOU shi_x_Page_102.txt
2cd6c3944b2eebd36b59d7eb7df3f0e4
d8687e6995b8178e0ffa032eaf4bcde1b90b979a
13231 F20101118_AABDPI shi_x_Page_004.QC.jpg
b25e55f9ffa1285d92e164c43faece14
d7601ab89ddc96622a3b6006d72e91cfb174bb05
2627 F20101118_AABDOV shi_x_Page_103.txt
3be2edd6cda32c6f3252a604c68ff1bb
e0ea147fb466b9cdda099e5fd259d35e5caaf1c4
5244 F20101118_AABDPJ shi_x_Page_005thm.jpg
25b7fb54fad1f8852199e3b5019b0d8e
8127b59eedb78768e8ba63f3328f05701e13e633
2380 F20101118_AABDOW shi_x_Page_104.txt
c91039bd82ede9baf8b5f42f177b5f62
9e093dde7661ea63a7f282c3ec5bd77f4482363f
22396 F20101118_AABDPK shi_x_Page_005.QC.jpg
905d30e6320dd4de18bfb1a3a6f8a82c
d48a9e0c6f9c0b4ea9b54b8a5159df837c631d91
3282 F20101118_AABDPL shi_x_Page_006thm.jpg
63c57444bf1060ab80e79d221e64c52c
a71e6d523a2ce24111d684e6f145c6c817d109f3
2612 F20101118_AABDOX shi_x_Page_105.txt
e51a3bf98752e7c9efaeae614f4538a8
e1c6fab3dded20467a68145700edae238e2c1a6c
28645 F20101118_AABDQA shi_x_Page_014.QC.jpg
d487e4a4ca684a3133ce67459310694d
faf5af7468cf53244a452d77cdf5003cc75dac4c
12988 F20101118_AABDPM shi_x_Page_006.QC.jpg
b52254b6d758216d37daf068dabe66ee
57d08a1af733b38a59df1849be59a46fd3dbb907
2194 F20101118_AABDOY shi_x_Page_107.txt
e032d9c2d978fdbe0217264aa2c739b9
b7d8a5ea8fdf1d0bd396e51ec907bf09f6079492
6862 F20101118_AABDQB shi_x_Page_015thm.jpg
1a0d5015520de05253575668676ddd76
dda39a7e643da384befd260b77ca3028bfe72e2d
2735 F20101118_AABDPN shi_x_Page_007thm.jpg
0d9838c94a2246ecb13fcdea783455d3
d7ab432fc980f4c6fa98482457d98aa885227dc9
468 F20101118_AABDOZ shi_x_Page_108.txt
31ed40351c0a0fa6b909b1c8f2a8390e
745b062d75782d76979c1c10c69b151b467e68e3
28860 F20101118_AABDQC shi_x_Page_015.QC.jpg
a0aed4faf80b46ed3f12b528975ef859
a309e6b043b226af0f2c07e143a625c40b721926
10093 F20101118_AABDPO shi_x_Page_007.QC.jpg
13dbc0dc7bb1530c22c3b9d63140774f
9ff1cd27acd92f2e43ab090089c841097547ce04
6985 F20101118_AABDQD shi_x_Page_016thm.jpg
b1630a8983298c407b79a6af0dc0b73c
687b0d77fb989ebfc828b5d1c687bd4020e3ad86
5793 F20101118_AABDPP shi_x_Page_008thm.jpg
3e06188be135bf74aaa152ced8e4f29c
00c7dd6ec6730c3eabe8c1fe22704e8e6cc4d4b7
29492 F20101118_AABDQE shi_x_Page_016.QC.jpg
415045a5c48894469cc1370ee8828beb
aebf80e233d1891e89b353ee32a9226c07bbc029
24776 F20101118_AABDPQ shi_x_Page_008.QC.jpg
d2c53bf4018337454caedbd7950cc70d
15236f2045ca281c1e5c0262404c2067b7bbdd17
6822 F20101118_AABDQF shi_x_Page_017thm.jpg
af71201a34434b47ea22e31fadefc8a0
e088bb69a5b8f997a6b94f2cc6d5277bba360dee
17611 F20101118_AABDPR shi_x_Page_009.QC.jpg
02a2e971d02e0017275b9d47dc513b84
6eb61ec7d1614b09303c4b5a209754ac2e57bf12
29389 F20101118_AABDQG shi_x_Page_017.QC.jpg
b2988811606510edfd4bb923309549db
265413ce3369220b25bcb4d9f01427de3e98de82
5567 F20101118_AABDPS shi_x_Page_010thm.jpg
5a35d41301ed6ef2ca4a56d3db25b982
9077cfae7a61b99b6692f34f9c2029a0bf356599
6754 F20101118_AABDQH shi_x_Page_018thm.jpg
682cf36bba949c428a1ef4702eb61e68
069e750ebaba4f7208cd8d2f1d6d014ca28966c8
22440 F20101118_AABDPT shi_x_Page_010.QC.jpg
2cbe1c7b19fa36289eb1d35720923668
3c57772fe1b95e9d782f3f0b60c4af6c55311b2b
27919 F20101118_AABDQI shi_x_Page_018.QC.jpg
932b708eaa478406d220e47615542acc
69c4741ba2a395f8cd87bacfae12708189830fa5
3821 F20101118_AABDPU shi_x_Page_011thm.jpg
42340c045add4cab932b7b85b5902977
676442b79d1000dc9d22ed6985dfe31277a3513c
6790 F20101118_AABDQJ shi_x_Page_019thm.jpg
bde9126d1d73b76d64ac5082bc4479e6
82fc1184be90103802dab0e09c7e967c57402a75
16333 F20101118_AABDPV shi_x_Page_011.QC.jpg
e25af34cc51a03dfe8abf689340c6014
0f0ecf36bb79eb645e20b997c89d6d59aa3cfbae
28527 F20101118_AABDQK shi_x_Page_019.QC.jpg
97fe4d16459644e82cb709d0ae1a7abe
fc780c44ea1d7b53058e12d4e8f2956e6edd1e3d
6559 F20101118_AABDPW shi_x_Page_012thm.jpg
9ed8cb67a6290b51fa37737d1ed77697
65bda56133319732ba88775b94efe9f4dfd8799e
6629 F20101118_AABDQL shi_x_Page_020thm.jpg
4929c060f9d01b76af379b3a7fc500b8
9c81543b5c88a475c75bc6329dfc60ad15947828
28109 F20101118_AABDPX shi_x_Page_012.QC.jpg
5146ef3e49f976a7c2ef3034d27a6a91
72023872668c43514157e0e9784b8d6138b317af
27534 F20101118_AABDRA shi_x_Page_028.QC.jpg
718042b86c55b15e521c567b06405bb3
8a1303e7a868f4d85317fb4a06a9d2574b0fe8c8
28576 F20101118_AABDQM shi_x_Page_020.QC.jpg
3e6f58422c707bd746c1f664f22a5d79
38b3aac2f9b973c066d244b6f3e45a2e2b1a0b98
3000 F20101118_AABDRB shi_x_Page_029thm.jpg
21c7ff6d8f11638fc3bd93c181ce8e36
91e4def10a071f0134953d0f96f330728ba166e8
7011 F20101118_AABDQN shi_x_Page_021thm.jpg
c27e65fee197352cb58519edb4a9fbc4
acfe404ee351d053120eeaeb402463b4b6bce118
6548 F20101118_AABDPY shi_x_Page_013thm.jpg
d4ce046b673baafe96915510f29c6988
7ab929274708207c0edff11a2c235dd68f039eae
12675 F20101118_AABDRC shi_x_Page_029.QC.jpg
a0813332064eddfa622f85292d664616
418326e328790fb8d8d791a22a41b4ce7e6d632f
29105 F20101118_AABDQO shi_x_Page_021.QC.jpg
3a29145d1ce704d625fbfcd083540d95
7382019d3627ccabcd0b23537ec46f8d92113603
25861 F20101118_AABDPZ shi_x_Page_013.QC.jpg
94ccee453b7aac4f7e18aa1dd57a3b12
192bcf77213c41e7d687930ab50789b801efd380
6741 F20101118_AABDRD shi_x_Page_030thm.jpg
60ef5546043eb641ea13c71a0c343a4b
0c94112c5dca7a92582a32832c9560cb239f9760
6906 F20101118_AABDQP shi_x_Page_022thm.jpg
b228e2373210d625ef4273a94514e6d0
3bdbd8ff04885fdcd56a0d168b806fde0e727716
28745 F20101118_AABDRE shi_x_Page_030.QC.jpg
a3c6ec066ae779a9856c487408aa0d07
8b753448216bfbc406e3a405ee9dcbb19548ce6f
29037 F20101118_AABDQQ shi_x_Page_022.QC.jpg
3319830889185d90e4ca0ef5a2e06426
f33d62bee861ca7e25797dabbf34a42bbd51282e
6739 F20101118_AABDRF shi_x_Page_031thm.jpg
35808d4bb2f524cc6985538182ba865d
9d5d1f85b964c7db62fb8c44bbaf2461e84ea179
6641 F20101118_AABDQR shi_x_Page_023thm.jpg
438ea4e1fda57fc570e2d50e5fb63d35
d2e8a73e695290eb311e388b424149b74cd2b32b
28296 F20101118_AABDRG shi_x_Page_031.QC.jpg
71ead9ffd617600a5cf28b3df3e76d68
49e37a70610e656e20cfad359e6452fbde4f6414
27187 F20101118_AABDQS shi_x_Page_023.QC.jpg
394a3aa7a9e804aad4d1d04cb25c085a
138d8303df28ee855425f9eb2124ea043ce2af6b
5866 F20101118_AABDRH shi_x_Page_032thm.jpg
373f2acebd45d0c2b2c23c6ad3a09e20
2f19cf773f7a550a22dd2f877ae6780bed225b0c
6966 F20101118_AABDQT shi_x_Page_024thm.jpg
bb185da061f411ecea3bbd7a6cde1d49
626b6a760b357c399518f299cfa3efed8f2f0dc6
22589 F20101118_AABDRI shi_x_Page_032.QC.jpg
60f3bde9716b48a71debe12023e79969
a7df0e20af764c82c14e51c8cf0bb6cac7b09333
28166 F20101118_AABDQU shi_x_Page_024.QC.jpg
6279e2ab39dae6659dc32c63fa8f2c3b
66f9213582e3f02a0ac3d90f150d6a69c2aaa6e9
29168 F20101118_AABDRJ shi_x_Page_033.QC.jpg
4f0aaa394f47c3284281c5325a9dce15
118fb5fa3d953597808459b9e442807bfd3c3d44
27270 F20101118_AABDQV shi_x_Page_025.QC.jpg
20afe7715262e9e38929c0a4f6f393b2
35390ced3fe52159920c8580340a984dae027f95
6253 F20101118_AABDRK shi_x_Page_034thm.jpg
722a2cf5e19cb494e2b2ccedaaa5739f
8eeede308201533c07f2e06e97bdc52f14814efe
27212 F20101118_AABDQW shi_x_Page_026.QC.jpg
fdc42052c734c18c9435200fe49d3c8f
68eac681f1ffbf1d0dbbf9e7f3db2539b62a072b
25398 F20101118_AABDRL shi_x_Page_034.QC.jpg
ebc8a583849890fdef3770e116c102d4
0e121efa32c1f8359d6341d5e6f2f6e36500592d
6716 F20101118_AABDQX shi_x_Page_027thm.jpg
4a2da142091c049d26d60f015b625ee3
2ed21e526d365d3cb7e8f8bee3a130ed909b59d8
6465 F20101118_AABDRM shi_x_Page_035thm.jpg
8d4ea2915f3ebe1a5dd5b49ed0fbee6b
cbeb4c784529346734c286d8b481c909893dfc15
27507 F20101118_AABDQY shi_x_Page_027.QC.jpg
a614ed81efcb1e87fc0f6e4e45004d9f
7491887d8e76dd21783ad9a466aa0f17137bdc40
7081 F20101118_AABDSA shi_x_Page_043thm.jpg
5205a55f046aadeb388ea91189657556
10bb2dd1935daa1752154c5582cd82b770fb9d44
24502 F20101118_AABDRN shi_x_Page_035.QC.jpg
e7912c556c003b94eaf291628059a5da
1259672c96322a61c01500504180418eda2a139a
28648 F20101118_AABDSB shi_x_Page_043.QC.jpg
100c7400c3c65b020506a1dc1060d4de
8d1b6209e64f5d40d80eefcb75d4585484e5f460
6998 F20101118_AABDRO shi_x_Page_036thm.jpg
21253802a3ede9a0036c8f75b40d1879
0e67861e59759946c8e54d63f942f0ad92b7422b
6722 F20101118_AABDQZ shi_x_Page_028thm.jpg
1ef47e714bf99d0b74bb32aa224375d4
a285940aedfffb5f94c7d1a52a1107ea9754c9c2
6689 F20101118_AABDSC shi_x_Page_044thm.jpg
3349e4ccbf3ed23a5bd7071ae9865613
c376c51277b0601ec7e7d3504e01f4c6ba9b048c
29323 F20101118_AABDRP shi_x_Page_036.QC.jpg
2a112c33d73c93cfc2fc73e242f15ce8
a1fc508fba8d76ba517cdf3c0747640cf7193abe
26520 F20101118_AABDSD shi_x_Page_044.QC.jpg
bc604f0a6f6eb78e9c7cb97d199052fc
c0a47ea3776f0c30bc8b5d1869d22cd93326264a
6246 F20101118_AABDRQ shi_x_Page_037thm.jpg
90ec2ac3c67607dc760f9b0899dcf4c6
7c8f790ef51a4eb41b44818b2fdd39a4df3ab13e
6612 F20101118_AABDSE shi_x_Page_045thm.jpg
c2a16e41d27f7cdee97e39f7cbfee8b3
ca2978dc673f533dc9a2de6359801c3c7f17c6d3
7046 F20101118_AABDRR shi_x_Page_038thm.jpg
4640754e0112594b4b0b36de7f18c2c7
645c54cde863568f3eafb59ee2376a9d6289171e
27678 F20101118_AABDSF shi_x_Page_045.QC.jpg
47408345a9556e13da0afb6c29d8df54
db9fab6b78c5aae6e64dd8511b429f2b9da51dbc
29448 F20101118_AABDRS shi_x_Page_038.QC.jpg
d551adc76a17a61a3c5894a785c67bb6
9bf1722877b354de13253368d4e0ac5a9f7b8db6
6333 F20101118_AABDSG shi_x_Page_046thm.jpg
0576c8047af19429e27a97f7f7feffd1
f59e3f0a4acc44d1a3578c8585ea386b18474104
6211 F20101118_AABDRT shi_x_Page_039thm.jpg
630ae0980516622106cabd49032ac321
7a9297a02c82d966a6714c36e479983e8813725c
6305 F20101118_AABDSH shi_x_Page_047thm.jpg
5c4660f2036dab34045b2d2cb9457bf8
316f28501ff041b8abb98d8c8d5bb1cc0e767c3a
24347 F20101118_AABDRU shi_x_Page_039.QC.jpg
44a155c557c6a2f2537b5dc8a9ae334f
e82e16dcaa55c4d6e1b02ef875e71f74fdac3ecc
24953 F20101118_AABDSI shi_x_Page_047.QC.jpg
91af268c61629b5b639728d65b12d9bf
2fadce1d355ad84aa7e372f8039a1c90de5777e2
6743 F20101118_AABDRV shi_x_Page_040thm.jpg
3a3a61a16f99cd8d0a5803c6708611ba
f5d63e5af99982275e1bc95193d7b172764718ed
5617 F20101118_AABDSJ shi_x_Page_048thm.jpg
680be7ff8dc58a67554c69ec00e00714
736c2022eb763649556049fea2d1fdfbb74be27d
28129 F20101118_AABDRW shi_x_Page_040.QC.jpg
45cce282da74356a79aa6deb9a73efb0
6946c0db09a53652d0354f4326d0427a1ee9743f
21805 F20101118_AABDSK shi_x_Page_048.QC.jpg
1e27f1a241b54082c1089476ca2bc7d4
8469e16042aa9688a94d89a66993e65ab033e454
5934 F20101118_AABDRX shi_x_Page_041thm.jpg
a58ab55cdae25766e17c67db3465979b
bee05ccd0710fc42ba7df3d1a8589c63220a4b7f
7164 F20101118_AABDSL shi_x_Page_049thm.jpg
5c549191bcbc9a456b75cf5e64927f61
4e8aa3b54ff4e654004c1d94fd6249d17e59dd95
25128 F20101118_AABDRY shi_x_Page_041.QC.jpg
d0daefd5c4c1001fe8892043b07be4c9
e5b8655c14c160f888d97842adcd0697716a49fb
23796 F20101118_AABDTA shi_x_Page_056.QC.jpg
c2ba9cad3e7668728fe7371b52af2b4f
cffd5fd79fdfc99dee89a2b6bbc2bb2acd86bb99
28163 F20101118_AABDSM shi_x_Page_049.QC.jpg
00e3754282799ff54c9b38e37474a61d
5aaee9ff05755f105310df205d902c3663e85c33
23926 F20101118_AABDRZ shi_x_Page_042.QC.jpg
f429448604831b52c5511d3c4da045b0
4c6913d63e899a6916e831b10d8d232d4374c614
6483 F20101118_AABDTB shi_x_Page_057thm.jpg
a60ca58fbe4b68ea5b9c21a04b001219
cc65c55c8802a4988a724b2218366c38de93db53
5958 F20101118_AABDSN shi_x_Page_050thm.jpg
d3fb45ceae5be0ff65d23bc1b6dad62c
c4b9ac46127a4bfa7b2798fe18f5458e7bce2924
25100 F20101118_AABDTC shi_x_Page_057.QC.jpg
61ecc179020606212da11982c9aae4b9
87d1c0fae22ed01aa2cdb733ea61ccb846fb0289
24596 F20101118_AABDSO shi_x_Page_050.QC.jpg
397d9c5bffb3d30b1447da348c1fb3b4
adb6ffa7575dbff03a8fb9f59b607809bb051190
5592 F20101118_AABDTD shi_x_Page_058thm.jpg
e7892d468f37e157992f328f6c3664ed
e31dfb8552e6050624a83b6899ea5dd519f9050e
6850 F20101118_AABDSP shi_x_Page_051thm.jpg
6ed3e1f9be774bca15e823814bafd074
20df4b24561eb86dcdeed956314936fc13d8e2e6
22139 F20101118_AABDTE shi_x_Page_058.QC.jpg
46f6daa2268d14427a6bd462e584832d
25a4bc17f87fdb56a010906242ec29faf5a3cab4
26458 F20101118_AABDSQ shi_x_Page_051.QC.jpg
4d546e0266831b8c2baba6ddc37d6e40
96e39dde94f5250d687bd6b0e1b939baa55bf64e
7009 F20101118_AABDTF shi_x_Page_059thm.jpg
e4a0e54b38ade4524da7f8ff11c9fb40
6715e379b2b5224aa42ba822ad8513279b1e49bb
6278 F20101118_AABDSR shi_x_Page_052thm.jpg
183a5882cff73ad119f6e7831f2d44c5
00da0d6e19aa8ad0af2cc6e96f36df084e01129f
27821 F20101118_AABDTG shi_x_Page_059.QC.jpg
6aacd1bba94611d6cb687ca93a36ee0f
4fb22d363ecd2835bc6245e1bc63e341b3a5357c
23980 F20101118_AABDSS shi_x_Page_052.QC.jpg
b27ad6964f97d021000b1cb2d43c6742
eec9f9070e87b2888d6f6e558ea5760d6e69c079
5875 F20101118_AABDTH shi_x_Page_060thm.jpg
c33b36b3c459eea7c08325388e7c96ab
6b24304a32aa7a1b919ed79f1a8f2d61942b899f
5056 F20101118_AABDST shi_x_Page_053thm.jpg
02331981d8dc8a3951ea3db276514115
d202657a075abf028e68bae9ee13b060a3978964
21672 F20101118_AABDTI shi_x_Page_060.QC.jpg
78c823c1f80396bfa1d897ed67cb36cd
f5f9c6cb0ad7a5afdb0c3d80a8bda3f75c5fb328
20571 F20101118_AABDSU shi_x_Page_053.QC.jpg
dd47d1ff6ac529658be97828ee305720
2b6c5e152fe642dca95b41752d06dc85a07abe7b
6257 F20101118_AABDTJ shi_x_Page_061thm.jpg
3b46e0113680851fe835aea18f4677e7
9a53d39b22077831713552ee4dce41ad8698fa90
6201 F20101118_AABDSV shi_x_Page_054thm.jpg
e80e7faf09e75b6af7e24238aef8f9dd
67bab64a97c95ecf273be1820507d5adb8cd3ba8
24262 F20101118_AABDTK shi_x_Page_061.QC.jpg
67d25713f9611480bc1730a259d34db9
d1787ccfb2dc378d201cf96a5fa42f86ed3ae8a2
26613 F20101118_AABDSW shi_x_Page_054.QC.jpg
a5407feb0422143a47bc7f6aa665c9a4
a5f9c4f47b0ac8c2d2768dde7b635611e48f8a6b
7001 F20101118_AABDTL shi_x_Page_062thm.jpg
0d07ff192459ef6ba528d774df69bc7b
b8ee04e557393b153eb0ef76f12d80cd5e7ebc41
5959 F20101118_AABDSX shi_x_Page_055thm.jpg
d1d68747e9d23c752b470fe86ba49c64
69a5eb952267e391f5d1560199bddfdc47a30a47
26929 F20101118_AABDUA shi_x_Page_069.QC.jpg
4eb656902a6a174c32009a553b537999
14bef0288294a45638645ad531b9b26a703b15d7
27143 F20101118_AABDTM shi_x_Page_062.QC.jpg
8a1cb53aa8494f2d99de1848c5ddc82f
d0f8ba0f2c03428c7f0c247c4b93f516206ad367
24558 F20101118_AABDSY shi_x_Page_055.QC.jpg
9b786fa8b6a9148df7e4bba6c1daf450
b7af699f2ff879e0cf7c702e069a91cad1195913
6651 F20101118_AABDUB shi_x_Page_070thm.jpg
8911c42d0c006c74490d9ebe4b5d47ff
491c4050d8f7109d19c0a2ea6b241feb5eebdc21
6235 F20101118_AABDTN shi_x_Page_063thm.jpg
53f928c1771a77babed811cdb7ff7971
d7e28371424865192b0fdbee90ffa55757d40ecf
5983 F20101118_AABDSZ shi_x_Page_056thm.jpg
4ffc90c8780889ce544f21726e03a7c4
c77322ac74a5077aec5bdfcd7587968ab22ab034
25348 F20101118_AABDUC shi_x_Page_070.QC.jpg
a38eda681343dcaa4ed9153235d6a020
97131df2e11cddba27c4bb7ce15a13ac3e9af563
23839 F20101118_AABDTO shi_x_Page_063.QC.jpg
e9c07364e8efdf47d00d9ec52078f759
496c95bb138d1046c57d760dd231a3e98cb84e84
F20101118_AABDTP shi_x_Page_064thm.jpg
deeab89dc8d87be64d7d391e2ba5cb81
3bb8ac78f7ff7094734cccd1a6452438199edc36
6832 F20101118_AABDUD shi_x_Page_071thm.jpg
c581ad8fb8cc116dc96aca69c35dd027
fa7156a1d7bba1b792d207a423b8c8af0385d8cf
25775 F20101118_AABDTQ shi_x_Page_064.QC.jpg
70bc1e99d8e396757db7542f220f81e6
cd8391310946a9447b9f5782ad923b1019d75ae3
6138 F20101118_AABDUE shi_x_Page_072thm.jpg
1b4f62835829adff3934da499b025621
e00cdfe8d8d2c6e791491ba0bf6903f9ff9bc688
7108 F20101118_AABDTR shi_x_Page_065thm.jpg
1246def284c8b9683c09a335d6f24b0b
e965d36370cd98b114783bec066b6ea7d54f7b47
24038 F20101118_AABDUF shi_x_Page_072.QC.jpg
7c95fc161bcb3be5d78efa4a2f96f414
7aa3cfd237a56a062063e1ed31c8ba9787150ce0
28671 F20101118_AABDTS shi_x_Page_065.QC.jpg
bc27219c7833c83a66b8fa035aeb8a57
2b15bf053e39f749a5d932778086ab15270f87d9
6199 F20101118_AABDUG shi_x_Page_073thm.jpg
85ed45554663991a127037506c8ffc67
f920aea308d48af0e82df1aa09fc685ad690113a
6971 F20101118_AABDTT shi_x_Page_066thm.jpg
59ad15749b6aa4a6e1f2463508af8939
43ab4b29af5345279db8051a7907e9672fc3265f
25068 F20101118_AABDUH shi_x_Page_073.QC.jpg
a07f101027d5435c5cb267397296bc90
4a86c45b26fab146b51db7cf300d5334d3eb1856
29538 F20101118_AABDTU shi_x_Page_066.QC.jpg
5ae3f229576b476012de55b21ed76314
1e2e3299695beba242d7bfb1564a5b04e47a5848
5930 F20101118_AABDUI shi_x_Page_074thm.jpg
6b1b98f67b57ca0db46f16c206d6fe8b
2ce2edfe3c160fc76e41e3a6d1fa5b5294a83267
6547 F20101118_AABDTV shi_x_Page_067thm.jpg
48986eac774aecfffbebe4cb696ddfdd
95ed68f121ccf3f980a8fc9f2bcca86d06514c0c
22848 F20101118_AABDUJ shi_x_Page_074.QC.jpg
4aa4c8b84427e97664e9fa26c0017e89
4cf167d7710e61af9e45df7d6d78000e1df56b74
26453 F20101118_AABDTW shi_x_Page_067.QC.jpg
1f20ac106f37621f283a8efa740deb2c
a1bef86fee836cb7c0f1a97386ab233e64fa25de
6461 F20101118_AABDUK shi_x_Page_075thm.jpg
3080fa47ae6c4b149652fbac7d3c6127
453f08908b0df17db806f5423e96460f93818ca1
5706 F20101118_AABDTX shi_x_Page_068thm.jpg
25f7b9dafeeb09657ec7ece43766300d
ebf06eff15e27eb82cb722e1a5369a0689fece83
27230 F20101118_AABDUL shi_x_Page_075.QC.jpg
dc820cb383aa52173882e008775fcb30
1f7734f03b7beb7732e74a1b404735995461bbe0
22353 F20101118_AABDTY shi_x_Page_068.QC.jpg
526a9d61ae570dca9e84451618235de8
f5c67c4b88d2a86f86cc1e805aa7cf4a77e21a0b
29488 F20101118_AABDVA shi_x_Page_083.QC.jpg
846d82f94a92356d52a9050f048d20cd
ec17fae11501803edeb21f35d9a363d04f843e46
779 F20101118_AABDUM shi_x_Page_076thm.jpg
929e6ef4bf05d96036cb5511afd32121
b45a08749e690aed7884d412d2dd95d9df05db62
6911 F20101118_AABDTZ shi_x_Page_069thm.jpg
e273e6d124c6e8897833cbac1f6c290e
656e36beacdcb60b93e84c63b1cec00166f9795c
28796 F20101118_AABDVB shi_x_Page_084.QC.jpg
981e61fb13b021966af4f69eb0ad14ab
53d1e1dff74070e14856cd1a8e112e30e356062a
2785 F20101118_AABDUN shi_x_Page_076.QC.jpg
352bda55f4060b1395335caa1b9fa35f
1c3b46541a94a1ac9dd2fd3665d510a2deee2676
26994 F20101118_AABDVC shi_x_Page_085.QC.jpg
0ec55eba427cb502f7bf1ed17d6fec6b
dc692dea08079ec541e88808483299c579d6a28c
7112 F20101118_AABDUO shi_x_Page_077thm.jpg
b26e1b92cf79abbf675ed3bc459badde
830e622fd1f9b8e851de51f977fcc58be48fb535
6756 F20101118_AABDVD shi_x_Page_086thm.jpg
11bdc300f81784556742da35c2afe1dd
36572a2501219c89ca9671f8cc11c884bcabcc28
29916 F20101118_AABDUP shi_x_Page_077.QC.jpg
c4c96db810371935cd1b5391bb5ed245
f55f84caa6e7c5584602be7cab4f9cee7c60139a
27709 F20101118_AABDVE shi_x_Page_086.QC.jpg
773c088e0f72e44b073b866ff44d909d
d48bb4bcb86512784b9c40d6512a8346bd5d267d
6919 F20101118_AABDUQ shi_x_Page_078thm.jpg
75711aaf3af9067b8548a074f43208cb
98fde921055c9f6002e7255a5e96fe45c01103b9
6819 F20101118_AABDVF shi_x_Page_087thm.jpg
0f99bb69190d6607c5b6e3951dac828b
85a5081291310c7b5653195c3de905dbc4e1ffcd
29690 F20101118_AABDUR shi_x_Page_078.QC.jpg
ad74e19eee361bca52c93f738b0bfa82
c3067856224f08a6151a3e18fe5500474d2bd978
28157 F20101118_AABDVG shi_x_Page_087.QC.jpg
14003fd7b5df52ed69f44f072eb7927a
905f0249c01d4dd11e00234bd3f69116fb6e588c
6978 F20101118_AABDUS shi_x_Page_079thm.jpg
7ea4cdc49e2c23721f076237e6745517
1bc4236c8040a1863d381ad63cea81a6fb421317
6616 F20101118_AABDVH shi_x_Page_088thm.jpg
ed607ca8e5585f8034cfd1c3a25b5d7d
403585c580bf38336da857daa46a52f0c81b1e01
29226 F20101118_AABDUT shi_x_Page_079.QC.jpg
ccc20852879cb2895da8d1e82f73b189
ec6bb7b4d3c10f8e1ef10ee43b21147f3c215553
26398 F20101118_AABDVI shi_x_Page_089.QC.jpg
3e7340f2e9df5e1d764e74ee73020955
7ea2f9341435cbee932037d2542a2cc157001001
5845 F20101118_AABDUU shi_x_Page_080thm.jpg
fe0c9eb97457ad0710833d2993456cba
c8efd37c70d23140dc0eef1667a9556212138177
6233 F20101118_AABDVJ shi_x_Page_090thm.jpg
2e984980814790460a21f5dbb35b0e1d
d088f02c3d616dadb51c73bc7ea6dbeab650557b
23376 F20101118_AABDUV shi_x_Page_080.QC.jpg
2f2ee9664a460a0538499260c69ad846
11c55e79a3514f6ed6f9f2c33978d93a89d0f3e7
6690 F20101118_AABDVK shi_x_Page_091thm.jpg
ff3473ecd39f48832b4c605de3cc5276
dcdf9c6e13cd3d253797cd2ce90f8b24db6d6531
6022 F20101118_AABDUW shi_x_Page_081thm.jpg
e1e8f2ff59850107d94a9fd543539a0a
2b2ebe9bc0c0d57f3a5bdf4657842772fb5ad4b6
28185 F20101118_AABDVL shi_x_Page_091.QC.jpg
da1d8072398ca799ab6e5753cb1ef7ad
eb02d5d8b0e1b134d52ee3bbdc3f1455eaa6a0bb
22831 F20101118_AABDUX shi_x_Page_081.QC.jpg
f0f7eda13d0c43b22f2e67d00177e01b
3a57349052d59c36ae75e176654cfe287eec216c
6499 F20101118_AABDWA shi_x_Page_099thm.jpg
d210c94c04e90040bb140e06e6abffd8
612d59aca3a8a805fcf474570c4e25c097f8e557
6013 F20101118_AABDVM shi_x_Page_092thm.jpg
ac244b9d8ce367848ac2aee3a01ae647
ca9500d371ae4011464dd1cd6b037fdb51cc11d9
7205 F20101118_AABDUY shi_x_Page_082thm.jpg
905aef08119fdef81277ac7b2197d94a
f44123c5c51941f33e6d66dbbd1e2cd84b106736
26766 F20101118_AABDWB shi_x_Page_099.QC.jpg
ccfbf4312dcea987a69e2ee4d1131945
801acc20a1456032cfd00d15355c44aa56438174
24058 F20101118_AABDVN shi_x_Page_092.QC.jpg
f754d97103bb3440c39da2f77f6e6b41
bfcd414ca34cb6b37636e355041749057e6897a1
29149 F20101118_AABDUZ shi_x_Page_082.QC.jpg
d5a1d13f85205bcd6cb1082193452d2a
05811ac30284833efc0ccd3c62c3f5791294bb64
6750 F20101118_AABDWC shi_x_Page_100thm.jpg
aa74d290b8126a1152bc1684fae5db41
67c90aae7269317a2e87f944fa563930e0a71274
6304 F20101118_AABDVO shi_x_Page_093thm.jpg
4e2b4406fc654beb7cec361bc6e74fbf
bdf544540ef5f9bf849dea316c11a4cbe492dc56
29089 F20101118_AABDWD shi_x_Page_100.QC.jpg
161ec22b3480625fe38c508f6775eb09
5fe5f6adfa98c09d2529b9a3b4d06e893e76907f
25422 F20101118_AABDVP shi_x_Page_093.QC.jpg
f04261ab8b073e93bce294fe2181c4c8
9eca8af83bd1cc625ddc597272abbb0b570934a8
11062 F20101118_AABDWE shi_x_Page_101.QC.jpg
b8dd5e849b2fafc79a53a3cc977359e1
a213dd8b06d9ac706ced85df91b1dcdea7411a11
6356 F20101118_AABDVQ shi_x_Page_094thm.jpg
cb88179de87cef69413fc7889673cab3
bcf7986254f58eaffe3e1a7797cd2974db119f61
F20101118_AABDWF shi_x_Page_102thm.jpg
77526d1a7276d0597651fe797dbe1612
4c0180673afa532e813649209698d446cb7ebc61
26180 F20101118_AABDVR shi_x_Page_094.QC.jpg
ec26f427188324a34ee34368d7256a92
dece4b1a544f552fad9c0f0992706f61997832ca
27494 F20101118_AABDWG shi_x_Page_102.QC.jpg
52644c1994fc798b09d020ae937ffaf1
4a3037615222aa0ccd7319d878a257b8826fba22
6779 F20101118_AABDVS shi_x_Page_095thm.jpg
ebf4c9280cacdc076eb8529359285d7b
978981b2deda9cf76f6bb2c65ea4997cd8d6c304
7007 F20101118_AABDWH shi_x_Page_103thm.jpg
15feda9460427b6989428d15a25b6586
aa3d1411bc8bf7ce1a2fbf816eb17e7adc3889c7
26818 F20101118_AABDVT shi_x_Page_095.QC.jpg
d0931a0cbcb0fd99d25cda5f2e276b79
df28f32c5bc3be158a4a865982ab72093eed84cf
30081 F20101118_AABDWI shi_x_Page_103.QC.jpg
f7a6cab269176703a34a88e47f004846
fe9c8d3757013139d43249eb65a65cf17f42fdf3
6845 F20101118_AABDVU shi_x_Page_096thm.jpg
aa6fc0b3f8ef214f7ea3d73834091cf4
65730b4795ebed39364d24c60b5b9e70894d5d09
6630 F20101118_AABDWJ shi_x_Page_104thm.jpg
94c3273a508f6ac8ebd094abc64f8cd4
0e52539bfe994f3d9ff51392fe418f986ef3870d
26535 F20101118_AABDVV shi_x_Page_096.QC.jpg
473f729722d285a9a161d5747451bf4d
a709af956488630f2e99866e6b766c5e1b38dad2
27913 F20101118_AABDWK shi_x_Page_104.QC.jpg
dc9cae30e4ee7c528cc6a5bd023a6da7
442a819ff6de45f389e40673336a3e9b0d8f9d79
5973 F20101118_AABDVW shi_x_Page_097thm.jpg
d22b1171a0b5da6e9f3d9a0121f4b0eb
d0c1c765acc01bd7554c688b6f8cd8e0f53203aa
7017 F20101118_AABDWL shi_x_Page_105thm.jpg
cf238b0db85063fe3115cfce04815a2e
9301ad4d2d696ce011cae2ce451f948a78d373a1
24649 F20101118_AABDVX shi_x_Page_097.QC.jpg
53b13a78666b4fc9971a50608122b0c9
a4a19c2d261abde5ef4974842b25c4cc64da901c
30450 F20101118_AABDWM shi_x_Page_105.QC.jpg
acf3fe0a78fe397a1ccd36565979a51d
51c2e633047ce9746ea3f21defe21d907cd4df32
4903 F20101118_AABDVY shi_x_Page_098thm.jpg
c955dd58914f3421d139da10c2312d18
4a4c4272b0af1d8c1944b49299f6d2353a71dcaa
6989 F20101118_AABDWN shi_x_Page_106thm.jpg
16ca3b0b33d1b13a8c2331ecdbb43ae9
3a43b6d641b7c0f2c80676674cd55662f4bef4c6
20769 F20101118_AABDVZ shi_x_Page_098.QC.jpg
25df2beab1313ecd4417f62f66285dfd
b1fac6ae1b5f88165cedaee2ba0bc53b5f0f7130
30265 F20101118_AABDWO shi_x_Page_106.QC.jpg
d2c7ec31e3076fc49b798297198cb0e0
26022cc85f8c049a7dd81095267121e32b501c1c
1846 F20101118_AABCUA shi_x_Page_063.txt
9e61c162f2e7b5e662deed756f9f1746
148298954e9fdb7cd45c8b254f63ef4678d54240
1051950 F20101118_AABCTN shi_x_Page_016.jp2
1f21bdf52e7e2e781acc9e32b8239c98
45fc33142febaec1b2e09768f0db3977be8f4692
6097 F20101118_AABDWP shi_x_Page_107thm.jpg
a3e8d8a4df9063f2017e66aacb6b3c02
7a3027112ab6651bf03b20cd03bf98592caf4d7e
75150 F20101118_AABCUB shi_x_Page_061.jpg
1592ed00d47b38c3adc59c888ccf0755
19e85fff8b90af6bac7d4717f376d9622fb4056b
723 F20101118_AABCTO shi_x_Page_002.pro
2b7c44069fb5725f9cb1f54c130a52ec
bd283bbf33fbc6ee9446b8aa07ba55a6fd0894b8
25438 F20101118_AABDWQ shi_x_Page_107.QC.jpg
c858cce61084f462d0e950cc84a4efd6
673be8a49cce7464ae366645a1553a40793f00cf
7085 F20101118_AABCUC shi_x_Page_083thm.jpg
df1f208f557a31f3dc764280911cff7d
dd76d33abb660a5349447028bbd0d657b2a36da1
7034 F20101118_AABCTP shi_x_Page_084thm.jpg
38dce6f21981308eb0f7a34536862f4d
779273dbdfd844e4283848e38bf81018ade90cd2
F20101118_AABDWR shi_x_Page_108thm.jpg
4f98080f860c5dadbd580dc5416b8ebd
37d6b0512bd1900dc7e40b4c1aec74fee8fe205d
F20101118_AABCUD shi_x_Page_083.tif
3eb3479be564f29cc93a65a42c2ab211
d2710dc139b2f6cac32175a7de2f84674db4cfdc
F20101118_AABCTQ shi_x_Page_047.tif
2749b3da9a4136637b302458a06fde5f
541c1418242c503c9789a4e167f704b26c266372
6777 F20101118_AABDWS shi_x_Page_108.QC.jpg
d9bdcd9780aca9ee8e5005b94ebb29d5
2a5cc02ae78af343ddbf88fbe147c2b55d3137f9
125728 F20101118_AABCUE shi_x_Page_102.jp2
7dd65b80d394ff18297df27a681775fa
19f84e1572e17b5e49f05b4301ff6b65c4338649
124049 F20101118_AABDWT UFE0021650_00001.mets FULL
cb66bf947e61f4ec0d16e182c03cb52f
19a08c96832fccdfb6dce108df030bfc4f28561e
68406 F20101118_AABDAA shi_x_Page_011.jp2
ab5cc1868cfc24282b5ecd9845a60d98
05c7b251bacc9f785122d1f73268faf249b9b56f
6902 F20101118_AABCUF shi_x_Page_014thm.jpg
30d3ff06818cc8869fb4479cf4436270
781a2b495d84c829ef35a4746181ab0e4d2fe438
35232 F20101118_AABCTR shi_x_Page_092.pro
e60356d7d276823c0ffa32667d16ff3e
d05d1bad7c7d2bd9502f9d99e9b86ca5f5917a6d
1051965 F20101118_AABDAB shi_x_Page_012.jp2
71aa54a30ed1fac327644f45c073100b
f84eed98bec1ea010381b12c3a7bf12c46665683
4221 F20101118_AABCUG shi_x_Page_009thm.jpg
72ea5318da3b6e23ec959b93cd10f83d
89ab8a405fbfa0377e76405505467c72579aa74d
1869 F20101118_AABCTS shi_x_Page_064.txt
cfdacedabee21766f15245dbb7412489
466f0d2b2d36ef3db78502f268777ad7563d0985
1051961 F20101118_AABDAC shi_x_Page_013.jp2
38a9aa2abafb867be503d72896cbdcaf
0a2c36d71b8fc2cd1c33c8f714a2df5c096c160c
1051943 F20101118_AABCUH shi_x_Page_082.jp2
0048742ce4da58d3e71d1d70965f1e6c
2e63f9be86609386927ca415f530d00ffe5cb89b
43102 F20101118_AABCTT shi_x_Page_004.jpg
c70026a94ee9c9f1aad7850227c6739d
cb77e1ca0d2499059b1f90db2a9d27937171c43b
1051972 F20101118_AABDAD shi_x_Page_014.jp2
ebfbf0e90c05179e26a1a4c2702f6515
2f3c40f0c41555a73c85e564d37906af34d6b862
25183 F20101118_AABCUI shi_x_Page_046.QC.jpg
ea94853324c47981cff8ee4d3f558c39
94b2ff410d5b6071ad1ff0209cc93f23ec0fc736
28079 F20101118_AABCTU shi_x_Page_071.QC.jpg
1c5faf8c767e0bd8e8b22323f5c5a3cd
c33198a75ce349ebe3b8c9c1179c5bfac9a1f182
1051958 F20101118_AABDAE shi_x_Page_015.jp2
8819306826065222ab2cd779d466f01b
d55668692d487ac59588fc6f93dd230e0d7320b9
1051895 F20101118_AABCUJ shi_x_Page_077.jp2
793460a1a582ffc770aacca22693e51b
027de0179f2213239bde3475650afe0d2e88a03e
1051967 F20101118_AABCTV shi_x_Page_070.jp2
e7dca2b18c98adeb72700b71e1f14946
29a8d074fde68d0474ca5e02bbec6d7a585571f3
1051980 F20101118_AABDAF shi_x_Page_019.jp2
14ada67395116970cc7f1f808d6e09ee
22e7ce7caea847cbb2c3bb23507ad7da4897ad47
6292 F20101118_AABCUK shi_x_Page_089thm.jpg
2151e75aef81eb3c51a1ba4104db69c1
1dd10780c65295d153470b1d3dc89bf35d88b95b
73849 F20101118_AABCTW shi_x_Page_010.jpg
993a1d16b54afb18ae5144816cfd5b5c
f360d02dc918429f9bf3d2ef748164ac1091455c
F20101118_AABDAG shi_x_Page_020.jp2
f5c674828dba2c06a1bfd7780d8427ce
a3811eed530e128d526f83cff8770256922ea054
1641 F20101118_AABCUL shi_x_Page_056.txt
cf169bd0bd423bb6952a061b2e421a00
fb43410bd3e17377e33d46990c51c1da9da14742
2569 F20101118_AABCTX shi_x_Page_106.txt
927fc6b577882ec09697bcf5ec8e3929
5b4b8142de365875632b3fda36e296db0c721e00
1051916 F20101118_AABDAH shi_x_Page_021.jp2
05e37e32b3a4de91e6d5bd9846f0e19a
e446d3b06e6807e931ffec36385c16a6949ab5f0
1051974 F20101118_AABCVA shi_x_Page_018.jp2
1744e8bd8fa384798e607962be6b8475
7553a7ed4100697f22f5e8a8f5da2d0ef3ded1b3
F20101118_AABCUM shi_x_Page_093.tif
c5422d3a54120cda72986bcbb98d4a27
5d429ba9274dff9528b56eea43a0b1b5da74ec96
1051954 F20101118_AABCTY shi_x_Page_050.jp2
266436ae740f18a1e6648582f242c30b
fd36f4db0c218ce5f65fdbb12fd4ff2319da73e7
F20101118_AABDAI shi_x_Page_022.jp2
5179b0e654e18d1be729a231ebc75611
34b456b3de9350a2c2349a8987a17064751a5077
821731 F20101118_AABCVB shi_x_Page_060.jp2
23d7b882a635d22a47665a364979a53a
c8a83a3223f27ce5d22bf0c6dc4475032e50c023
2044 F20101118_AABCUN shi_x_Page_023.txt
5eb1d6841b8ff975548bd27e8daf13dc
9c4fcbce0fff0635637ffdee8a84c969ebadfe18
6855 F20101118_AABCTZ shi_x_Page_033thm.jpg
46218fa1057923cd93d7f1c2e0ba511c
eb751c31870c52dce49c94b69e46e7bee2906ed2
2609 F20101118_AABCVC shi_x_Page_101thm.jpg
e12537391d83842c23251fd2a53584cd
5f42f22ea24ab51e5d81cb0230e5e594acd5cfda
2174 F20101118_AABCUO shi_x_Page_033.txt
e67620940572af5c926c8145c53ff5cd
85630ce1b84d53c013b9961d81b834e6f60d3e0f
1051971 F20101118_AABDAJ shi_x_Page_023.jp2
d798b022de1b1e8b12f844038814aed8
fed2bdbd786fbd8822191cfe7d71078976dbddd9
25798 F20101118_AABCVD shi_x_Page_090.QC.jpg
1af5cb4f2b0893fbf8604ae503eef1b4
0e141155a3c5d2b0ad13ec34457bca97e51960e6
1051951 F20101118_AABCUP shi_x_Page_017.jp2
75438afe895c71f26021d83fea99c423
7f778b6eeb71cbb30708d3234f3afddee34aee1b
985823 F20101118_AABDAK shi_x_Page_024.jp2
5057ae20ac1b63cea7639fa253082550
6621bf65bce76ff86d305c4a9d7e401ebf3d1340
43279 F20101118_AABCVE shi_x_Page_101.jp2
ee0e30c53432ab024597f40f2274c7d6
59b732c38ef00ad3bddc1978f55a6a1e333f9b57
6601 F20101118_AABCUQ shi_x_Page_025thm.jpg
2cc83d65fa469372f149eff2d61e30cb
2ef7e212ef3748e54a96be755e8cac7084e80659
125178 F20101118_AABDAL shi_x_Page_025.jp2
1f9039ea40ed40847daa7a2401ad0be9
058ac1a7f7066fd8d20ca7ee78c98126d67d28cd
F20101118_AABCVF shi_x_Page_067.jp2
cd97abfea2d31041d79b8e9b2559f6df
4dd9de66a850bbfc5b2d210c1362ecf6f5cfb5fb
6490 F20101118_AABCUR shi_x_Page_085thm.jpg
f57f8c5870cd92e82d7a6deb182d179b
97db7e1563538044b5da5a58a15d94ba80e7a77c
F20101118_AABDBA shi_x_Page_040.jp2
14d56686a23c14036688e7f37e25b0bc
af1c6c43ad39678a63b3dff97a3f64e6ff3b9bf7
1051966 F20101118_AABDAM shi_x_Page_026.jp2
f0b9fa261e9fa218091b1f40782be182
8211d18671f6c0e93d908b11950e48c331c6f727
2141 F20101118_AABCVG shi_x_Page_078.txt
a970a7954b77ccf06908c8cc9d0ca7ed
6b1ffbb894422b267b84e67292639e0a93d84936
24274 F20101118_AABCUS shi_x_Page_037.QC.jpg
1e8deb2df7c1495f42765336e43ed849
07ca700cee6e807c3cb62a7d98981ff2c1dec5a8
1019664 F20101118_AABDBB shi_x_Page_041.jp2
169139b0c6ab66198a5c4f481ac86545
d2038a08ff3236218f3ca4a66729831a8e71c18b
1051983 F20101118_AABDAN shi_x_Page_027.jp2
4f0d78f26553024d255f97212188fa78
91084d9f33d715d1c44dc97d2f3d611a760cd119
46006 F20101118_AABCVH shi_x_Page_057.pro
55fd55dfd1c123301c1dcf5de1e9499a
3ce3bab72f6a09e04d13effdecdc71955d3cacd2
F20101118_AABCUT shi_x_Page_029.jpg
54d3622e45693dd54845f21e2b8ef16b
d13818223f74e2670c4b7a6dbb87f896e701258d
855423 F20101118_AABDBC shi_x_Page_042.jp2
fa04365fc6a7b4e76d4281154f394df9
d596ddd2699da9a39bc2ecb32221bcc152b3b9cb
1051975 F20101118_AABCVI shi_x_Page_084.jp2
c46885126d623c1297d1f67cd34823ae
8b17742e76fee912cc3b253b66a2b33be41d1cac
6757 F20101118_AABCUU shi_x_Page_026thm.jpg
2a4d5bf6e51073367435865a8a2e3968
4243e8a12c0600e630660503a8237b71702a1d77
F20101118_AABDBD shi_x_Page_043.jp2
e32b183eaecc82c5fe6f79cf32111cb1
c27ce38c3191cf549f9d280c635aca39daf73308
F20101118_AABDAO shi_x_Page_028.jp2
9526b4336069f381dce9c113d221f662
a74466cbef4d1cc1e1643d5a00711a6f513923d5
39501 F20101118_AABCVJ shi_x_Page_052.pro
2478a16ac1eb22ecbd1a73e92d263c1f
925da04557873316a458248056d8d311c2567fa8
80708 F20101118_AABCUV shi_x_Page_093.jpg
52234e58d08e48005410abc14f6bb20d
fe37dd88b8c6139a71c088a1e8ffd64bb44e62b3
F20101118_AABDBE shi_x_Page_044.jp2
95074ac4f85dd6df4322acc731ee2a26
53a740391deb9a509518db588d56e7d018ee6eac
614904 F20101118_AABDAP shi_x_Page_029.jp2
2abb8c0b20ade43c482a1886f1cd2687
d72e5845ecc44a2e3f4dc0ceaf307161746d80de
F20101118_AABCVK shi_x_Page_046.jp2
0df905e474e5e8d8b0920bea0a5ebeb1
10d36a571126de72a336900aac8e2cb51bf2818f
31355 F20101118_AABCUW shi_x_Page_060.pro
c8bfea9e0eb2573ce75500c7160b40e9
9a40cd628f1915df0d70ebd3968ac25442126606
1051947 F20101118_AABDBF shi_x_Page_045.jp2
cf404182829c12e97afc0801df64e040
f4dc44abbd9a14b3fc2b21e5817283d15f05d6ed
120954 F20101118_AABDAQ shi_x_Page_030.jp2
a9caf91156ea788517d8f1675ae6211f
f3d867ed35d3b271651b0223901e7f65c321bbf3
6361 F20101118_AABCVL shi_x_Page_042thm.jpg
2427b958a45a02c7cdaf6374189d911c
6a40464b902c033de2c7c33fc31e81051d66be78
F20101118_AABCUX shi_x_Page_010.tif
7c00d551ae9da113bb67d70132ba580a
94ffce72e2a659b7ff6bddab23ea29803c5d2f75
1039907 F20101118_AABDBG shi_x_Page_047.jp2
b629fcf319fa244d376c1a043a169fa0
0b2816399bd66ac4a717fdb7801cacd56cdfd558
F20101118_AABDAR shi_x_Page_031.jp2
9bb30e608dddd9d45e8d066cd8df4d9c
e6a04606f12cb3aea098098dd904009bd2626473
26976 F20101118_AABCVM shi_x_Page_088.QC.jpg
068c139069a676a8db7fe9cde4261d62
46785da0b3049d270d2a8f92a5bb59660f73e395
32222 F20101118_AABCUY shi_x_Page_037.pro
bf86d35c189d04ff48599e6644a53a09
0063fdd564267c19a93728ef76416e8f227171eb
959100 F20101118_AABDBH shi_x_Page_048.jp2
0d14e5db901b096c051f2d5cec9550ba
ca5e4bfadf8f1339ab8ea1234a2cbb378395e5a6
90658 F20101118_AABCWA shi_x_Page_012.jpg
5a8804bea82258d0d3a656e11c5f4d14
7c8c43df19a084ec2092c8da5496a21cb8a364d7
919788 F20101118_AABDAS shi_x_Page_032.jp2
e9979a226b3ace049a9489b7e798db78
28ebac3580c14bf659d6c94e76bc6f958bc24dab
780 F20101118_AABCVN shi_x_Page_024.txt
7ae7de7c12a594c5c7d75d458df146c5
5026661846b56d32583c742d634c54932084393d
2316 F20101118_AABCUZ shi_x_Page_026.txt
314ce5a553a7d5e81067c1a9c0725040
767a8d455e922adce118b0a6680ab341406851c9
F20101118_AABDBI shi_x_Page_049.jp2
17810310c9cd9f82dfb6799fce920412
f6ff5174ce9ea550a4b3c4399429bacbcecf27fe
86124 F20101118_AABCWB shi_x_Page_013.jpg
0710e50e32cf1d21e544634682aa7c71
e0d6a49206a5b5a015a218861afaf30842e13ea5
118875 F20101118_AABDAT shi_x_Page_033.jp2
fc1c5801ed0d292c3d6b876f87e1a400
aa3b94dc98dc370d393154d6a2272065c8e70f6c
160325 F20101118_AABCVO UFE0021650_00001.xml
5d7e739e1b099b90ead7da7180dcfbc9
d251943a176e141bd5c2cf690623dcb80bc3b5b4
F20101118_AABDBJ shi_x_Page_051.jp2
45d2f1ce7acc500c9897b69d8e93907f
adf39ae9288b6c52e4d5807efe7b5646859a910e
91414 F20101118_AABCWC shi_x_Page_014.jpg
790fe1fee0bc48029a59a4599ee4cd1d
fa6a638826040bd3a14b22247b10c68a1b3b3f47
F20101118_AABDAU shi_x_Page_034.jp2
7648c17cf18a033f7d31f44c7add36a2
2524ba95e708ae2f46639628436da2c22adf1078
91121 F20101118_AABCWD shi_x_Page_015.jpg
d1186c0679b4a86a0def1db6f7393dc2
c4398ca26aa1e184cea19371cd278b943c01bbf8
1051959 F20101118_AABDAV shi_x_Page_035.jp2
6dd93d6d4d6fd2a423c7b1378bb39865
1093afedbc1a15ef2cfd33026bcad2b625f43700
971130 F20101118_AABDBK shi_x_Page_052.jp2
7d5a35d349c0c2b9016157bd7c253ca1
490dfb47ff7b11259281957360cfe1b9e6c21583
94095 F20101118_AABCWE shi_x_Page_016.jpg
860cac319bb86386c131fe27a90f4074
a687f44bccf49ccf3dd2e7ce56b0ae7fefac793b
1051876 F20101118_AABDAW shi_x_Page_036.jp2
db3f51a95e866fcdda124b0a4ecb03ce
64fd9d77b92b5bed9d6bff8a4f99038cb3fa2f14
21619 F20101118_AABCVR shi_x_Page_001.jpg
caabb9f3797ffab8eb347a3244e79f7c
8c4fa5c9c77419b32a49e229f594c79de6703042
1051984 F20101118_AABDCA shi_x_Page_071.jp2
8e97e3279c4f5a0c3909fc6b47379483
9f5b7b8682085cbda8b1ebe33bd1453d4ab09d3a
846810 F20101118_AABDBL shi_x_Page_053.jp2
7e57b610d54f559d50dd1b0f9b13be19
831252f690817e8db80b12166a9fcdeacb762dbe
92463 F20101118_AABCWF shi_x_Page_017.jpg
200bcc9ddd568f6cce8f67d11c0bc164
ab57cb2ec6e80d56adfe45da1e888fe1b9562775
860704 F20101118_AABDAX shi_x_Page_037.jp2
2caddf1aea8ae9e39b1794790a51e26d
2a0b9d6f59da5530b510b20d4c852e4b6045fae6
3401 F20101118_AABCVS shi_x_Page_002.jpg
fa99f1e50b11c20d3b8764f815f2a555
c629fed86ae98180bff484c08e726a1cefa00af8
1003095 F20101118_AABDCB shi_x_Page_072.jp2
4219bf537a3a178e0d73a71234156d19
f3103770b496eee6debeb5409ccf17fc9a2f3353
112500 F20101118_AABDBM shi_x_Page_054.jp2
fac7341300f5702f8bded2d540e58ae4
2933c13dbf3a084b7e6ea58faa616533758d34fb
90411 F20101118_AABCWG shi_x_Page_018.jpg
a67f6b085dc97a00e809804ddbc564cd
7d0deb7172388abe45377152f78581758bf43b51
1051933 F20101118_AABDAY shi_x_Page_038.jp2
474c2a7ebd26b04ab6b39a734569bbd2
eacd96fb86dab65d11350c852a2a44e6dd494abc
3758 F20101118_AABCVT shi_x_Page_003.jpg
a25789f6c353c7a00ebf8eaba1910931
f768d2d9f6c1adf7d2e51e36a3e93c162a0eaf34
1050629 F20101118_AABDCC shi_x_Page_073.jp2
ffd04002a61d7c644814a68e6f277a35
10c01273e54b383db8ad3f3775a93195e1af315e
1049171 F20101118_AABDBN shi_x_Page_055.jp2
6e93fbbd31d00e93a1ba2bbeb13c73b8
06d86b61000b7f74cdcdc0ed9f71597caaedb41b
91521 F20101118_AABCWH shi_x_Page_019.jpg
b3ffa907f102e12d828f5a0000a95677
b72a618a79db6a27b031ff06f557b1611a0fd01e
912170 F20101118_AABDAZ shi_x_Page_039.jp2
92d9f6c61596ff2864922661d58a8da2
67c3dd1323413db1c56da6c5ecf9d6341ad7c63d
102649 F20101118_AABCVU shi_x_Page_005.jpg
243eb2c207b5959a57e3942508a3026d
28e962251a6bb82c9e8fdd713634b8542bd76fd4
904846 F20101118_AABDCD shi_x_Page_074.jp2
91f55f43b64c4074466def95e3839fe7
71d5ad9f7fc19efe2bb3042b5e464b8566774bf3
982934 F20101118_AABDBO shi_x_Page_056.jp2
b5bd2cf4eddc8e0d027ef72bf57673b0
61e5644e1b05fce99d398b923260c34f4fc76ee5
90975 F20101118_AABCWI shi_x_Page_020.jpg
9ca8e971a68918ef143b746e06ab6b50
729a3559b6a74a222ede3c8fcf3038e00d3912eb
55909 F20101118_AABCVV shi_x_Page_006.jpg
90288ce2933ecd7e3214ddf384eb4ae9
dc7d8cdfe0b277ce0b3a2ed6b1b5ac2a4aa36009
1051977 F20101118_AABDCE shi_x_Page_075.jp2
eeff82217967634b01475226f48e39a8
8c06bf57052dc1e128dd44c48693f23068e0d11c
1038491 F20101118_AABDBP shi_x_Page_057.jp2
e185bc2444863bfeb112901bf434c742
013c920bf17f680a03cca3f7815235df98a54075
93173 F20101118_AABCWJ shi_x_Page_021.jpg
43a8cac5c3d2c3a254b350485fe405b2
d8592050dd915582cf41805c38ddd3465e23fdd1
35879 F20101118_AABCVW shi_x_Page_007.jpg
d5630e06816768ebef43d91a005241b9
16674214bb1790e7e68296c21644508f9e82ecb6
10501 F20101118_AABDCF shi_x_Page_076.jp2
d3faa302cb235d9e4872d30344a47915
398555ea823d4fde26964d2d5978190297c80217
960953 F20101118_AABDBQ shi_x_Page_058.jp2
57d7aed5efabb0852939ebe4ffa2ba66
78128acb1e7ebda79ff12d9ef5bf3063eb8f5b6d
93216 F20101118_AABCWK shi_x_Page_022.jpg
e2380fef651986a8bd77a008ebd67f6b
69f2e554f4572577b2d116816c1cc756f24176d3
88152 F20101118_AABCVX shi_x_Page_008.jpg
1a273678cb69b13677d6d38e8ee497ff
61b055681a90500fed31b8e55705c3ce43bf5ada
F20101118_AABDCG shi_x_Page_078.jp2
c0a1a2949b1bb8b4fb53205a48093ef3
34ac3c407ec14126747f3f854d0ca230d85938e8
77207 F20101118_AABCXA shi_x_Page_039.jpg
8bba0e28d7655b6a4f8faaeb29ce31a1
7e0253c8b625f6f93333e7858f122f49876f71fc
1051952 F20101118_AABDBR shi_x_Page_059.jp2
b624f8cb4a5e908ab27cbe8d1f19b9f6
5495af5a06e2a98d44fb7297fd1c090d9250bf1d
86667 F20101118_AABCWL shi_x_Page_023.jpg
fabd58cc3ddc4731429978da69652900
2a73b092822d87824e103cf4739bbf8047debe2e
63645 F20101118_AABCVY shi_x_Page_009.jpg
9c1550add36a90fac07cf6901f423418
d49cbe65c146047cd727a477a403f0b295902e71
1051955 F20101118_AABDCH shi_x_Page_079.jp2
2d890bf5a9cb0b19f29e2f7f5d240bfa
0b9bd4535487f24dfbada848a944264a1d6d4033
862331 F20101118_AABDBS shi_x_Page_061.jp2
caafd4bea305b89f6d0f072c622d3acb
717436959a3e76b0d2e7b46d71ad919874c3fe2f
86411 F20101118_AABCWM shi_x_Page_024.jpg
73664d2ab7dd520af8fb9098d4d2ad70
d639a7c537cda3e2f9afaa0a642cd63b2034fe65
53012 F20101118_AABCVZ shi_x_Page_011.jpg
6083ab698b91d9b698b09203a688f4dd
755b1da0cdb7ba8823f341863604001c9be487d3
934231 F20101118_AABDCI shi_x_Page_080.jp2
a066996c586da5c9406d171a12e56d7c
fc2d1689de40bb8f9f50e3f2e2579c9784898728
91882 F20101118_AABCXB shi_x_Page_040.jpg
ffef5f209a52d3c5eb7eb160926767ac
73cb7e8a34f049397fed5c706c83c7de3dbfcd6f
F20101118_AABDBT shi_x_Page_062.jp2
b23a956cf8ebe33761e7e57bfde84ac0
4b51ec63a1988400796a5a0f635cfe7412283890
92640 F20101118_AABCWN shi_x_Page_025.jpg
694101b16de51e908aa45f8a3096d247
baf0382fffe5417d9b0153004ad86d859463c356
821943 F20101118_AABDCJ shi_x_Page_081.jp2
30acd2eb034be39298c72c104f845597
ad5dd98accfb9330a3650c30a72869ade2d29567
77188 F20101118_AABCXC shi_x_Page_041.jpg
4b25d82d1e6e8474764d1ebca9b819d6
20e1cc8ea6897671b7eecb2afbb61ef3ee857c00
847333 F20101118_AABDBU shi_x_Page_063.jp2
81a23162b52ff7dad15ee9f5807c60f5
ae14611146766411e5b9a8a6d6a0b740153ca0ed
92971 F20101118_AABCWO shi_x_Page_026.jpg
d574f9222adcde211521982ef1df71ba
0a51867abc896132ab257588f0acb64579723dee
119376 F20101118_AABDCK shi_x_Page_083.jp2
d3446a202645b2ba85bff5d003494836
3cbe7dfb2f4d48dfd88958cd03d787935db1a43e
74129 F20101118_AABCXD shi_x_Page_042.jpg
0be1309c07ddf79b42db9918080f6b81
320aed4bff696e1fa6b689cf475b214f31566e64
943912 F20101118_AABDBV shi_x_Page_064.jp2
81b74af3938e3d47cb298f27f958c39b
cd4904464d40929633e50441cd48d6355d0fabb6
90901 F20101118_AABCWP shi_x_Page_027.jpg
1566eb6fb57df54ce76f9156f38d3369
8c5284f8424d6d8d2bcd810afd78a42991094f5a
88407 F20101118_AABCXE shi_x_Page_043.jpg
6ad730ed50bb0f7fe952f3755aad7bdc
58e74378edddecb7e675ae0bff7e8797e49c766b
F20101118_AABDBW shi_x_Page_065.jp2
a5505d8e914822b828df96ad0e60959b
cab2e996dcb869e58880fe7a677131a4607796a7
91296 F20101118_AABCWQ shi_x_Page_028.jpg
5d10c7d74238f8fe7b75a19cb0c479b7
aece393e6df60795624e367cd86820335b092269
107571 F20101118_AABDCL shi_x_Page_085.jp2
45c13dd9d89f42385ad5a3aa80bab8ee
d88b1d8d3bcb2d24ec3c31f11d2b7114977b8a5e
84938 F20101118_AABCXF shi_x_Page_044.jpg
b7090b38feeea993dffa37b0131f3e01
a05703675bd2ece174cd0765b7c61fa0f6659f3f
120529 F20101118_AABDBX shi_x_Page_066.jp2
a0666505687f05cf4236ae0ada294307
b0d733442cf565495d6576085fb10bde19ea090c
92926 F20101118_AABCWR shi_x_Page_030.jpg
b679566071e31d26d2070b42f8d6e5e7
28edde36ce137d4f0586fad2483c8e1619f08be3
121292 F20101118_AABDDA shi_x_Page_100.jp2
9582cc5c7cb939391bdded5c08bad3a4
fc801c895f8bbdf546b1f597b3ed8ecc199d029c
F20101118_AABDCM shi_x_Page_086.jp2
8088529b6a66a94738af1e26499e09e8
0a085e120f1f54e9ab80633830778deebc712197
89387 F20101118_AABCXG shi_x_Page_045.jpg
60ec0b81c714ae227b8ee6ee5b5bf1db
a1a624b5224b9c4e0a35dec1935fc5a7a9ac05d1
931691 F20101118_AABDBY shi_x_Page_068.jp2
f4e2a6955debce553a4307d27664810f
4d9c9f31a528df036d05534a410af905c1c3b0fc
87652 F20101118_AABCWS shi_x_Page_031.jpg
cb898302c1a6e4493fd00388b98df3f2
7ea5ae6dcc1607d6f761bc17ea8fa7c3822ed13a
139937 F20101118_AABDDB shi_x_Page_103.jp2
a834342ab89acaede2f86424c5d1aceb
1ba213313fd811a4647bd850f20ee2865c5d7a72
F20101118_AABDCN shi_x_Page_087.jp2
1cf9d7f8ed5acc97bf23a0a6a84b413b
27a329a537a262482739a4bf2ad3a1a92ad46bde
84134 F20101118_AABCXH shi_x_Page_046.jpg
2574179329b95a1931db8cf1e0265616
f297f48fe30410277cb1f630cb39e152a774c26a
1051976 F20101118_AABDBZ shi_x_Page_069.jp2
5ba3b403164e159e29051b6d155cacbc
33291ec47e61858814c8af0aa85c9be2e329c3a3
80293 F20101118_AABCWT shi_x_Page_032.jpg
4ad10f35ebdf9bd553410a9a9bc268fe
616780e6c0dc5e00e6db2806b7d58d669168a69f
128453 F20101118_AABDDC shi_x_Page_104.jp2
6cfab77846be553cf18ab2fd06306840
8e85e6d0421b811e50a9bbd0dd1239f197e01b4b
1051942 F20101118_AABDCO shi_x_Page_088.jp2
ca0b4445663b6895f9d7fed36b3e9daa
d1246122b923ef424e3d2d3df9a2f28db5cf2d6a
80073 F20101118_AABCXI shi_x_Page_047.jpg
fabb613f4aaf9ad8c4bac9bcb3ff14b3
5b6fe60e2b78dced4599944c3935da26c40a3cfa
91509 F20101118_AABCWU shi_x_Page_033.jpg
9c8d02584e732a943e12b57bcc5eda70
855bd8445223e1d53254421a1c66dd03c4f7bc23
140286 F20101118_AABDDD shi_x_Page_105.jp2
2cd66cd0ae5c1dd00567c6637db20edd
3cb93f16547909ff6020ecc23a521f0c14c4593a
F20101118_AABDCP shi_x_Page_089.jp2
ea5ed7f69c037a8a62654bceb177f952
146e4323006f0da45ed1cc48ed4bc5d38bccb773
76673 F20101118_AABCXJ shi_x_Page_048.jpg
51ed1df677c662ec70f96706f4fae333
652a0abfd4fde940193f34fc0433a467d871b9b6
82986 F20101118_AABCWV shi_x_Page_034.jpg
8bcdddd52878bff0aee5ca4798c2d1d3
9c99bd71702f75ca0055223600ae81f2f1e87e78
139626 F20101118_AABDDE shi_x_Page_106.jp2
c8d9fa85af490c89adc9ffc171d2da63
fca84bb62601c8aca3f6419e6f1d351fb7341953
1051926 F20101118_AABDCQ shi_x_Page_090.jp2
15b9134960ad86c5316131a01ab7a9f8
856d67c795ba2bbba6b5ab5a5709e8d1ab9a83d7
94224 F20101118_AABCXK shi_x_Page_049.jpg
84fe9eb49d71427efcbf1d5e5de91eb1
c719607e7d5691c4e9807f799bfd0154ae23c40a
79867 F20101118_AABCWW shi_x_Page_035.jpg
f898a7eb1763ad2c1d91812c92c2efef
b97758f402b827f47c1e3d6c4915dd438bc8bfa0
120517 F20101118_AABDDF shi_x_Page_107.jp2
4ad0a3a4e5c40129296964c07a2f9cd9
d80ce11f575e96f8ce676f3bd1fa6bcea0302733
93019 F20101118_AABCYA shi_x_Page_066.jpg
3eba918a40428a4f64637afbf0b13ce3
3a4f3a27556238da8d9bd1d41751f65e9cd7a004
F20101118_AABDCR shi_x_Page_091.jp2
0038b83a52cca66957689e777dd5356d
d18bdbcc6247d667154b50335daf95cccac8c8d1
79291 F20101118_AABCXL shi_x_Page_050.jpg
7dd1a47446e45e2a22e98ff672186fcc
0d707e803604010c5f31879af0eadb5b92677a6a
93148 F20101118_AABCWX shi_x_Page_036.jpg
71524c742eaf791c3cd5d6ac8bc47296
6774397aaaf81f8559d3e6a1fd361938dce81b25
27458 F20101118_AABDDG shi_x_Page_108.jp2
0abf060c94981634c8468b7c2b955c15
d21fd8aae8b8322eb4f74a233df6672f73fa5376
83754 F20101118_AABCYB shi_x_Page_067.jpg
4b130973bb1e06a907bbdadf3a20cdec
920b477811803d43165d60d7123629760c8b9a38
983343 F20101118_AABDCS shi_x_Page_092.jp2
3e76cb160d9046bc6ae54ddf6065aaeb
ece7928b08a4740374d5406e0068587513e1d136
91653 F20101118_AABCXM shi_x_Page_051.jpg
b3b64dc9702fe4ba1241635cb96f1526
ea25de8a608d9aa94739e04afcd856422ff3a541
74506 F20101118_AABCWY shi_x_Page_037.jpg
8dcf9c0b6eea6d9b41f3d9e09101c888
c1ad735600d498f16f38abe217534859b5943ad7
F20101118_AABDDH shi_x_Page_001.tif
b1fd171b4da28e3bb537ddfffcc50e0d
5cb5b97e2448209f946c735e9b3d7fd0d13545c8
1051922 F20101118_AABDCT shi_x_Page_093.jp2
466b784099105f8d3d6fbdd438caf3b4
fe54b44d4f7c75155702c466db6adbe46fcc1c6e
77873 F20101118_AABCXN shi_x_Page_052.jpg
fb9a42a1b637a8848a25d0cc1ed253e2
fde6f27eafb15fb424821d966ffbdab1597c3656
92851 F20101118_AABCWZ shi_x_Page_038.jpg
115da31c187489f54a5dbfe6550f26f0
6880a63958e5af1e9c7e283a1ba407dfea9e8dc6
F20101118_AABDDI shi_x_Page_002.tif
497bd286b0731e8e53c8fc0a57b5c8c7
2c9dea247e6664c5e4411225d26c1422c52fc2e2
73093 F20101118_AABCYC shi_x_Page_068.jpg
684b4c317d042b982a0eab3e58f4abad
1fb09608854a2cc6fa3323dcf7c32908227bc1ff
979799 F20101118_AABDCU shi_x_Page_094.jp2
5834cfbc1519bbded7f38806235223f7
6647f8ce673b403f1f4df72881489cacc0e8c9a9
66729 F20101118_AABCXO shi_x_Page_053.jpg
1c28890a949db86d0030dcdaa7410ccb
e9d14aa6806701274c3c92f7cf473f35d3eb2073
F20101118_AABDDJ shi_x_Page_003.tif
afadd1ce6b0db89c2d4ec1c6c480b511
78959ae634de837824972be0ce6099e09dd19913
86059 F20101118_AABCYD shi_x_Page_069.jpg
77f6898664e49e582c31216674cfec2a
9530d8b68e8bc2d8a94463ced820018de1f55340
1051956 F20101118_AABDCV shi_x_Page_095.jp2
086a153f4358f360778cfacd8d42424a
d65e8be1cc5cbafc87bd718c55a301d6faff2bdc
86407 F20101118_AABCXP shi_x_Page_054.jpg
df8ef59101a84425d0740de9f91aa871
7cdd4dd1eb56368264d08f2d01b1e6aae9502fe7
F20101118_AABDDK shi_x_Page_004.tif
19d0f0a3eb1b7bf30a1e9f89d7290d55
3b8e35bf03bd9d4a1df01a0d0c08028821210785
83084 F20101118_AABCYE shi_x_Page_070.jpg
460b50904ce2076b747be8b26177a4f1
0863488761caa1e0dccef37bd5cd243bb2650705
107030 F20101118_AABDCW shi_x_Page_096.jp2
f70d04c7509c3b1875317f4809958ad6
ca9c5f87fe37a8a1b29061b541b86292dfde1494
79721 F20101118_AABCXQ shi_x_Page_055.jpg
06607498241d97d026ff2ced35159bdf
8dd2543a8b2861dbce82e44cb31ac5bb4cffa7ea
F20101118_AABDDL shi_x_Page_005.tif
448e30903c8f92521da3b5b1d98e4365
2a402ee9f9059155f0940847fdcadfafab510a7f
90476 F20101118_AABCYF shi_x_Page_071.jpg
e0430ebcff740c5b5278d61ce3c974b6
7cd1ce22a1a50e4d4d190232357faa85771ded33
1013474 F20101118_AABDCX shi_x_Page_097.jp2
17868876d34a7c8bfd145e11f413cd77
9c4595c4ce7d25c3a8d6db7e94e0955f0cf0dad6
76440 F20101118_AABCXR shi_x_Page_056.jpg
8eafa8c391e1d104c8ec79bae0c17fd1
d195d8625bfa99cb2ae0f56bbbda19637e2fc6a8
F20101118_AABDEA shi_x_Page_021.tif
3a0002f5d72036df4bb2dd25337996c2
4b694205e207a651b4f28d6d229cf14ba798b728
79640 F20101118_AABCYG shi_x_Page_072.jpg
c85e2989113d8166a821739baf0f8526
160d57d66f29cc55729d0a46e7468ac551ff0ec5
907559 F20101118_AABDCY shi_x_Page_098.jp2
5d023c9adfa86d6503cd2679948972b4
fa3c2072ee3bf012051d3f387832b0d2fdbc2f55
78710 F20101118_AABCXS shi_x_Page_057.jpg
c1ec017f82361f635212436338c12604
235f89cdcbd2de795b291c24b1d417f8eddc0e78
F20101118_AABDEB shi_x_Page_022.tif
ac9aec58430409ee69cd47cf57d7f452
35f0b159d326d8ecf1fed96529c66a22f91f94df
F20101118_AABDDM shi_x_Page_006.tif
1e36e820992e8253d34fda8a3953714c
088b7a3bf2189a7143b944d44de52474312aacd2
80363 F20101118_AABCYH shi_x_Page_073.jpg
c20797a2392aceb7b2175731a156527e
dc311fcddd5967a392f7afc0b4e972ca32b2864d
112477 F20101118_AABDCZ shi_x_Page_099.jp2
e801501c0a752b4022dbf31baf2d19f6
4d799bbabb44d5fe767ddbbc9ceba0e090eb1825
70714 F20101118_AABCXT shi_x_Page_058.jpg
7ddecf90a63726d9f4591c461b4455d0
902f3e3f342f57887df16384b798187d54e520f0
F20101118_AABDEC shi_x_Page_023.tif
a2fcd2da54ccb237fb9c715dee66bcb2
5abae29e78ae3c8a9a7cf1388eaf038df334d804
F20101118_AABDDN shi_x_Page_007.tif
ba0be5f2119769ff74539fdd2b0208e1
bdda2cd23ace27f853553be5100a25a044d79440
72482 F20101118_AABCYI shi_x_Page_074.jpg
ee1917f9f1eafa164efcf99b46035a92
b1842e50e6103e370f6d8d1e311ddb88b36cd1e1
89895 F20101118_AABCXU shi_x_Page_059.jpg
ccff7402ca982123ce365dd0dcdb6ee4
27cd99c82da42eef2d258486f39c3c43995a1d96
F20101118_AABDED shi_x_Page_024.tif
a6343aa70e2985bb0e7da0a648a471b1
d9c0c6d37344027e71923d4c702a05b546ee65d8
F20101118_AABDDO shi_x_Page_008.tif
c4f1c29c5d64fd9459d681408a384ada
04f270576012094dc8764c9042178978338f284a
90315 F20101118_AABCYJ shi_x_Page_075.jpg
e7870e2e487700d5e8046721add342a3
0b7f19de9bb6a14f5841127588b7ef559c9497bc
67682 F20101118_AABCXV shi_x_Page_060.jpg
2b2d7003cf8d3aa28a05ff52a8ba7c3f
63c7082d432779864b3fb9d8c3a0e55198184c13
F20101118_AABDEE shi_x_Page_025.tif
4213d466071259ebbb4c4c9a087bf03b
ee2ecb37fe6ffd3507903ca963f989b4f955c83c
F20101118_AABDDP shi_x_Page_009.tif
1e0235b50c03e23769981194feb782bc
fa79db5373e3bab5f573fb44ef1687e3d1dba118
7642 F20101118_AABCYK shi_x_Page_076.jpg
77c38c039607590cda2f3b618fc3fe4b
f0f3b82e4d36e10bfc4e2f986ced5381134a77b6
87856 F20101118_AABCXW shi_x_Page_062.jpg
2ff8bfa24b8529bda18060fbbc3e42fb
896e42aa2867cff806b95c0b7212f2e3178df343
F20101118_AABDEF shi_x_Page_026.tif
4743f31dce0bacc04ca3de457f12a5d0
09f5228426e0b4581b0fd129685626a88d3ddee4
F20101118_AABDDQ shi_x_Page_011.tif
61521d9da4e7bfc8b577ef768f1e565f
2d9f12cc4c637369ac1058db2919c491345f1c56
96978 F20101118_AABCYL shi_x_Page_077.jpg
02d38e464ccb1cdafa538b1f3573c5d6
061e06e3dad6a9a27e52e57cd2508e0c4ef7286c
73643 F20101118_AABCXX shi_x_Page_063.jpg
ed935633f889d26c40a8112547eeb24a
a4821422609d0fe3c8f0a814623e19d455472441
F20101118_AABDEG shi_x_Page_027.tif
df69e0ad1e2123f504c7e8545e8f7fb0
a0df891318e8fd0f7e62eec00ba744f715ca5342
77147 F20101118_AABCZA shi_x_Page_092.jpg
7edbaf75d2f0baa6976ffc505d6afa55
c27cdbd5d8867b38b1d15ec0eff5a8d9ddea1701
F20101118_AABDDR shi_x_Page_012.tif
e66ac913bb87233b836326ddb357c6bc
834ed87d22054b2f667e3f36f1b3f6819aaf54af
92155 F20101118_AABCYM shi_x_Page_078.jpg
0767fd5c6d62cbc03eb5b95858e96b36
bee9911531f7876eb5989ee4d5d8f38ecc2dc68a
84424 F20101118_AABCXY shi_x_Page_064.jpg
727b6546de1c41ff1a51ede2a6357f2b
cfdd849255b95a40e1000074761e10581b5b45ad
F20101118_AABDEH shi_x_Page_028.tif
a42624db92e2fbf8b4d56641557f4292
cd2ad64d421a3306d7f44a6af8077ac5abd17c4e
80003 F20101118_AABCZB shi_x_Page_094.jpg
18873a192b7ed27eba1b9d1fc55b1094
68e86fcea9d1b09e3a1a0481b5746eb1243ac358
F20101118_AABDDS shi_x_Page_013.tif
ae0431eb8bab90d4b0546e472afc0355
7b2ad94a74143b0db49459f7d4065ea027a0a6c5
93359 F20101118_AABCYN shi_x_Page_079.jpg
9c1ed8cdefc265d9d6b6fc6f5799766c
42b7fbdc67bdfcdc777bf52bd0e6155f823b7aa9
92161 F20101118_AABCXZ shi_x_Page_065.jpg
22fcbf70b24fc7c36bece135b3a576ba
03ccc4e6732133c015815c7ad5ac5305cec6559e
F20101118_AABDEI shi_x_Page_029.tif
4f0383d33c2d6ee5a0e79dfa9931fc8d
5a3b4e8f36ee60d14f466148b1328d1220dbc286
84115 F20101118_AABCZC shi_x_Page_095.jpg
6c4bb6c4a3e41f16bca0d2dc72858dcb
e1f95685d4413565d6282b0e0dc0ddce39b1259e
F20101118_AABDDT shi_x_Page_014.tif
72e17a2493198e3601988f6cec1e1c8f
915dfd911bfec0dec7d3bc168c8fee021c74c9b7
F20101118_AABDEJ shi_x_Page_030.tif
83e6271a0b90c3ce20e0ad7e62a2338b
265e51c60bd4c9f1f1f025c3f98f989d6096a631
F20101118_AABDDU shi_x_Page_015.tif
2ab02fc4f51d7b69a321988c1f2df2ba
1de24da4244b8a277e2fc4b868915b8eacd6c46f
76526 F20101118_AABCYO shi_x_Page_080.jpg
2b0c232b545a14e0f5c8dd13be6f2cfe
f0f125629b1d6c791b04cf993db7334c54b5a0b8
F20101118_AABDEK shi_x_Page_031.tif
a2f5bde51686de6222162dacb4125101
f5ef5698853ebc740389ea087c482391146d318a
85617 F20101118_AABCZD shi_x_Page_096.jpg
289a616ab707a2bf5807d46e41f55eac
40dace8749240fe97af7ac84ebddbe5a0ce8cc4a
F20101118_AABDDV shi_x_Page_016.tif
fb1044103e071735cc4ffdff74751799
9f926c3cc822059e19e21889934d14da12cc193e
70910 F20101118_AABCYP shi_x_Page_081.jpg
edc06df5bb5145dcb364b1e94853b926
561aa6ad0f8dcf120645829efa8e030304ff3a8c
F20101118_AABDEL shi_x_Page_032.tif
f1a9e1c3a984078d2f9cdd6aba55e652
0f785bae51211817976539a1b55c0c767bfa487b
77150 F20101118_AABCZE shi_x_Page_097.jpg
a51e3d814ed2cdfcafa4ed1627ee332d
294abc6cdbcf9e6d3128a04aa4d142b2e5762f24
F20101118_AABDDW shi_x_Page_017.tif
dee091c81dc8025bbf5a5552e56f54e3
0f0903735e02263a76765802ec24d558465ade11
92315 F20101118_AABCYQ shi_x_Page_082.jpg
32bb76625dfbd282a23bd3c9063dd914
a450b9353388abdc4a1b7cbd9550a2e85262ed4f
F20101118_AABDFA shi_x_Page_048.tif
88ce07374f36b6638712341b6789e6e0
75c29eba97a64048c9f59bf30fedd997592ffa74
F20101118_AABDEM shi_x_Page_033.tif
bcf87aac4179087030f660768e22382b
f390060344207b6c23a2e1cf8964313887106c48
71202 F20101118_AABCZF shi_x_Page_098.jpg
2eff6722c523a2aeb08eb289b4e24de5
62d10be0078c654fb895782f4a806bdb56106c60
F20101118_AABDDX shi_x_Page_018.tif
a4f062fbe315b56b2bb9ffbc8390ff3f
8173b57d6b36c3f10e665fd3475f3fa5e7121ae1
94305 F20101118_AABCYR shi_x_Page_083.jpg
bc90e1d5564ed105a214a6d04be9a790
838e2bbeaecd8c8701d50ad3aba692d25ab0014d
F20101118_AABDFB shi_x_Page_049.tif
096ee68eeddf49325b8c9a35700d0fe0
9f3b9ba64a137043260408bd4611c99411ae4fe5
86130 F20101118_AABCZG shi_x_Page_099.jpg
11804ab6a18f223ae9f866b79d943026
faee72524085c23c164200eb56ecc74431752f51
F20101118_AABDDY shi_x_Page_019.tif
887e53668c563aeddd1affdf817f350d
ac16da98c98dc3ee3c03c2b7e4e3e65147664ac8
89999 F20101118_AABCYS shi_x_Page_084.jpg
52cf06093ed6b5545849a0f75919319c
485aefabf1ce756a62082eaf36697fdd3f6d6c93
F20101118_AABDFC shi_x_Page_050.tif
b584a2cc4f306ab6f2ca0e77006a39d6
e7be51120047d1a053ea0ebeed226af956347563
F20101118_AABDEN shi_x_Page_034.tif
cef2961071e35ddc611695e91611df87
ae78b144c2fc44b26410e7fb91b83736f8b3718a
92735 F20101118_AABCZH shi_x_Page_100.jpg
0f4c2408a5e3f37604ceaf1fda50e8ec
5fbeeaa673d8617b5ec6e95adbb49677bb46ba79
F20101118_AABDDZ shi_x_Page_020.tif
970a6a9e2f34b6390a7bf398c5eed120
baafc960530ffe0ce4f9ecad6ce41b802cb4c4fe
85614 F20101118_AABCYT shi_x_Page_085.jpg
ef37ca14f1f4845b14f85afbdb67fe90
df93ff1fff3da20d3a72506f0cd69f85858997ee
F20101118_AABDFD shi_x_Page_051.tif
605dd7e80b77a472a73b16aac7643264
016933903f4205478d44e74b9d02cd46427c82b0
F20101118_AABDEO shi_x_Page_035.tif
14786bda46f1a926758d430827c10a0f
3288511b8bbf0843baa8473621501ee2f158e622
33164 F20101118_AABCZI shi_x_Page_101.jpg
1ac58e8a0e6a0bba078bb92ef5c1dbfb
f22c48ee1e8aeb077476af6df89dcb65cc56da23
88944 F20101118_AABCYU shi_x_Page_086.jpg
3fc5541a82b61a94deaba6499bd236a0
68a7c448f14242a2e4f01cdefb502ae5b1c8c713
F20101118_AABDFE shi_x_Page_052.tif
fee95f8c9f6090d1ff36ea1fc0e04250
e9005221ca7d5dd487c34d1a9b8882453817bd68
F20101118_AABDEP shi_x_Page_036.tif
afaeadf0d8d572efc08d813d37fc8d95
7629c0c0c057dc13ffc6179eff287f6316b7e13f
96436 F20101118_AABCZJ shi_x_Page_102.jpg
cf3045bbfdce046c4acb250c62229a4f
9294fe492eef1d5420244d5d3502308be3bbf0f1
95635 F20101118_AABCYV shi_x_Page_087.jpg
1b81b7854517650d48f1afeec0676fb6
b5756c1bc5df89ed790e9d6798903b074a213bcc
F20101118_AABDFF shi_x_Page_053.tif
b194ea98bc119c2446546aae5f68189f
d71643c39cfb4a4b74910d4e219fada00ec53deb
F20101118_AABDEQ shi_x_Page_037.tif
2c516846911e9f10bcb743d0e0e24f20
653354d4314a109671c95fb96c7777f40c214b3c
105659 F20101118_AABCZK shi_x_Page_103.jpg
6655995e76d646eb92e44d07d8bbe142
382aefc77a48ff2be14bff2093cd67ec543a2865
85067 F20101118_AABCYW shi_x_Page_088.jpg
fa38c1926ef202e1b319d54a0305f76e
80df5f27f66525487025652c88b935f119fd1078
F20101118_AABDFG shi_x_Page_054.tif
e23510f96299f1864a018b9b3f02db2d
ef20262846477d6ad472720e17857f2b15b26802
F20101118_AABDER shi_x_Page_038.tif
c4b1d16aa11d59ce08bed32d74ced383
12e87666daff7d863da4d8f630ecfc67687418c6
97859 F20101118_AABCZL shi_x_Page_104.jpg
b195f91a8502897b23d6236c6baee41f
5b5f66abca6083b554cd9f3afcc5bb0e000150d9
87333 F20101118_AABCYX shi_x_Page_089.jpg
409f90637236d4b1083fbd4c9082425b
829fb7b1421a8caec129cdd24b0b771c3f9ac9ce
F20101118_AABDFH shi_x_Page_055.tif
98b0bc9a48a3b869d169cec374bfede6
ef4f49b3ecf40040c497ed653ce6d21ec94675ae
F20101118_AABDES shi_x_Page_039.tif
eb33c046211e593d593e431413cc05bf
b33221c22042093c2ac3ed003b546ee5b88c8d9c
109947 F20101118_AABCZM shi_x_Page_105.jpg
43ebca186f3f4ef87f3adaf9462bf8e7
146f3750e2c8b4e687941c61b4442077dd919523
81669 F20101118_AABCYY shi_x_Page_090.jpg
62173eeefcce2cdcce3f827f90347ce5
39056dc320106cd7a89281cd61abcb68f48b7dc7
F20101118_AABDFI shi_x_Page_056.tif
a390ad39d4dc054b0598b5732c4b9c9b
9f62d4b0b33d1651de31adf502943f35ab8fd3c4
F20101118_AABDET shi_x_Page_040.tif
8c9e5ff07adccd419d17e66f8ba45c1f
6cc0d58dd264a09d4423ac9a8e5518a1b636c3ea
109462 F20101118_AABCZN shi_x_Page_106.jpg
6b41264c303e39d74540d754982f55c9
61a537363717d66e81197301d280171eb6df23c6
93350 F20101118_AABCYZ shi_x_Page_091.jpg
80eaa714afec85defeede686b67831cb
03a1be06cff9ad17c1d68a277506097f1eead951







MITIGATING CMP MEMORY WALL BY ACCURATE DATA PREFETCHING
AND ON-CHIP STORAGE OPTIMIZATION























By

XUDONG SHI


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007































O 2007 Xudong Shi


































To my wife and my parents









ACKNOWLEDGMENTS

I would thank my advisor, Dr. Jih-Kwon Peir, for his guidance and support throughout the

whole period of my graduate study. His depth of knowledge, insightful advice, tremendous hard-

working, great passion and persistence has been instrumental in the completion of this work. I

thank Dr. Ye Xia for numerous discussions, suggestions and help on my latest research proj ects.

I also extend my appreciation to my other committee members, Dr. Timothy Davis, Dr. Chris

Jermaine and Dr. Kenneth O.

I appreciate the valuable help from my colleagues, Dr. Lu Peng, Zhen Yang, Feiqi Su, Li

Chen, Sean Sun, Chung-Ching Peng, Zhuo Huang, Gang Liu, David Lin, Jianming Cheng, and

Duckky Lee.

Finally, but most importantly, I would thank my parents and my wife for their endless

love, understanding and support during my life. Without them, none of these would have been

possible.












TABLE OF CONTENTS


page

ACKNOWLEDGMENTS .............. ...............4.....


LIST OF TABLES ................ ...............7............ ....


LIST OF FIGURES .............. ...............8.....


AB S TRAC T ............._. .......... ..............._ 10...


CHAPTER


1 INTRODUCTION .............. ...............12....


1.1 CMP Memory Wall .............. ...............12....
1.2 Directions and Related Works ..........._..._ ...............14....._..__....
1.2. 1 Data Prefetching ................... ..... ... .__ ... ...............14.
1.2.2 Optimization of On-Chip Cache Organization............... ..............17
1.2.3 Maintaining On-Die Cache Coherence .............. ...............20....
1.2.4 General Organization of Many-Core CMP Platform .................... ...............2
1.3 Dissertation Contribution......................... .........2
1.3.1 Coterminous Group Data Prefetching ................. ..... ...............2
1.3.2 Performance Proj section of On-chip Storage Optimization ................. ................25
1.3.3 Enabling Scalable and Low-Conflict CMP Coherence Directory ........................26
1.4 Simulation Methodology and Workload .............. ...............26....
1.5 Dissertation Structure .............. ...............28....


2 COTERMINOUS GROUP DATA PREFETCHING................ ..............3


2. 1 Cache Contentions on CMPs ............ ..... ._ ...............31.
2.2 Coterminous Group and Locality .............. ...............33....
2.3 Memory-side CG-prefetcher on CMPs............... ...............37..
2.3.1 Basic Design of CG-Prefetcher .............. ..... .. ...............3
2.3.2 Integrating CG-prefetcher on CMP Memory Systems..........._ ... ......_.._.. .....40
2.4 Evaluation Methodology .............. ...............42....
2.5 Performance Results .............. ...............44....
2.6 Summary ........._.. ..... ._ ...............53....

3 PERFORMANCE PROJECTION OF ON-CHIP STORAGE OPTIMIZATION..................54


3.1 Modeling Data Replication............... ..............5
3.2 Organization of Global Stack .............. ...............60....
3.2.1 Shared Caches .............. ...............62....
3.2.2 Private Caches ........................ ...............6
3.3 Evaluation and Validation Methodology .....___.....__.___ .......____ ...........6
3.4 Evaluation and Validation Results............... ...............67











3.4.1 Hits/Misses for Shared and Private L2 Caches .............. ...............67....
3.4.2 Shared Caches with Replication .....___._ ..... ... .__. .....__. .........7
3.4.3 Private Caches without Replication............... ..............7
3.4.4 Simulation Time Comparison............... ...............7
3.5 Summary ........._.___..... .___ ...............75....

4 DIRECTORY LOOKASIDE TABLE: ENABLING SCALABLE, LOW-CONFLICT
CMP CACHE COHERENCE DIRECTORY .............. ...............77....


4. 1 Impact on Limited CMP Coherence Directory ................. ...............78..............
4.2 A New CMP Coherence Directory ............ .....__ ...............80
4.3 Evaluation Methodology .............. ...............86....
4.4 Performance Result................ .. ...............8
4.4.1 Valid Block and Hit/Miss Comparison .............. ...............88....
4.4.2 DLT Sensitivity Studies .............. ...............93....
4.4.3 Execution Time Improvement .....__.....___ ..........._ ............9
4.5 Summary ............ ..... ._ ...............97...

5 DISSERTATION SUMMARY ................. ...............99................


LI ST OF REFERENCE S ................. ...............102................

BIOGRAPHICAL SKETCH ................. ...............108......... ......











LIST OF TABLES

Table page

1-1 Common simulation parameters .............. ...............27....

1-2 Multiprogrammed workload mixes simulated .....__.__ ....... __ ......___..........2

1-3 Multithreaded workloads simulated ................. ...._... ...............29. ....

2-1 Example operations of forming a CG ........._._............ ...............40.

2-2 Space overhead for various memory-side prefetcher. .....__.___ .... ... ._ ................. 45

3-1 Simulation time comparison of global stack and execution-driven simulation (in
M minutes) ........._.__ ..... ._ ...............75....

4-1 Directory-related simulation parameters............... ...............8

4-2 Space requirement for the seven directory organizations .....__.___ .... ... .__ ..............87










LIST OF FIGURES


Figure page

1-1 Performance gap between memory and cores since 1980. ............. .....................1

1-2 Possible organization of the next-generation CMP ..........._ ..... ................24

2-1 IPC degradation due to cache contention for SPEC2000 workload mixes on CMPs........32

2-2 Reuse distances for Mcf, Ammp and Parser ................. ...............35........... .

2-3 Strong correlation of adj acent references within CGs ......___ .......... ........._.....37

2-4 Diagram of the CG prefetcher ................. ...............39...............

2-5 Integration of the CG-prefetcher into the memory controller..........._.._.. ............_.......42

2-6 Normalized combined IPCs of various prefetchers ........._.__ ..... .._ ........_........46

2-7 Average speedup of 4 workload mixes .....___.....__.___ .......____ ...........4

2-8 Prefetch accuracy and coverage of simulated prefetchers .....__.___ .... ... ._._ ............49

2-9 Effect of distance constrains on the CG-prefetcher ........... ..... ..._. ......._.._.......51

2-10 Effect of group size on the CG-prefetcher ......___ ........._.._......_ ..........5

2-11 Effect of L2 size on the CG-prefetcher. ....__ ......_____ .......___ ..........5

2-12 Effect of memory channels on the CG-prefetcher ......___ .........._. ......._.._.......5

3-1 Cache performance impact when introducing replicas ........._.._.. ....._.._ ........._.....56

3-2 Curve fitting of reuse distance histogram for the OLTP workload .............. ..................58

3-3 Performance with replicas for different cache sizes derived by the analytical model.......58

3-4 Optimal fraction of replication derived by the analytical model .............. ....................60

3-5 Single-pass global stack organization............... ..............6

3-6 Example operations of the global stack for shared caches ..........._... .....__...........63

3-7 Example operations of the global stack for private caches .........._...._. ........._. ..........64

3-8 Verification of miss ratios from global stack simulation for shared caches ................... ...68











3-9 Verifieation of miss ratio, remote hit ratio and average effective size from global
stack simulation for private caches ................. ...............70........... ...

3-10 Verifieation of average L2 access time with different level of replication derived
from global stack simulation for shared caches with replication............... ...............7

3-11 Verifieation of average L2 access time ratio from global stack simulation for private
caches without replication............... ..............7

4-1 Valid cache blocks in CMP directories with various set-associativity .........._... .............80

4-2 A CMP coherence directory with a multiple-hashing DLT ......____ ... ... .....__..........81

4-3 Valid cache blocks for simulated cache coherence directories............... ...............8

4-4 Cache hit/miss and invalidation for simulated cache coherence directories..........._.........91

4-5 Distribution of directory hits to main directory and DLT................... ...............9

4-6 Sensitivity study on DLT size and number of hashing functions ............. ...............94

4-7 Effects of filtering directory searches by extra index bits ........... _..... ..._ ............95

4-8 Normalized invalidation with banked DLT and restricted mapping from DLT to
directory ........... ......__ ...............96...

4-9 Normalized execution time for simulated cache coherence directories.............................9









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

MITIGATING CMP MEMORY WALL BY ACCURATE DATA PREFETCHING
AND ON-CHIP STORAGE OPTIMIZATION

By

Xudong Shi

December 2007

Chair: Jih-Kwon Peir
Major: Computer Engineering

Chip-Multiprocessors (CMPs) are becoming ubiquitous. With the processor feature size

continuing to decrease, the number of cores in CMPs increases dramatically. To sustain the

increasing chip-level power in many-core CMPs, tremendous pressures will be put on the

memory hierarchy to supply instructions and data in a timely fashion. The dissertation develops

several techniques to address the critical issues in bridging the CPU memory performance gap in

CMPs.

An accurate, low-overhead data prefetching on CMPs has been proposed based on a unique

observation of coterminous groups, highly repeated close-by off-chip memory accesses with

equal reuse distances. The reuse distance of a memory reference is defined to be the number of

distinct memory blocks between this memory reference and the previous reference to the same

block. When a member in the coterminous group is accessed, the other members will likely be

accessed in the near future. Coterminous groups are captured in a small table for accurate data

prefetching. Performance evaluation demonstrates 10% IPC improvement for a wide variety of

SPEC2000 workload mixes. It is appealing for future many-core CMPs due to its high accuracy

and low overhead.










Optimizing limited on-chip cache space is essential for improving memory hierarchy

performance. Accurate simulation of cache optimization for many-core CMPs is a challenge due

to its complexity and simulation time. An analytical model is developed for fast estimating the

performance of data replication in CMP caches. We also develop a single-pass global stack

simulation for more detailed study of the tradeoff between the capacity and access latency in

CMP caches. A wide-spectrum of the cache design space can be explored in a single simulation

pass with high accuracy.

Maintaining cache coherence in future many-core CMPs presents difficult design

challenges. The snooping-bus-based method and traditional directory protocols are not suitable

for many-core CMPs. We investigate a new set-associative CMP coherence with small

associativity, augmented with a Directory Lookaside Table (DLT) that allows blocks to be

displaced from their primary sets for alleviating hot-set conflicts that cause unwanted block

invalidations. Performance shows 6%-10% IPC improvement for both multiprogrammed and

multithreaded workloads.









CHAPTER 1
INTTRODUCTION

1.1 CMP Memory Wall

As the silicon VLSI integration continues to advance with deep submicron technology,

billions of transistors will be available in a single processor die with a clock frequency

approaching 10 GHz. Because of limited Instruction-Level Parallelism (ILP), design

complexities, as well as high energy/power consumption, further expanding wide-issued, out-

of-order single-core processors with huge instruction windows and super-speculative execution

techniques will suffer diminishing performance returns. It has become a norm that a processor

die contains multiple cores, called a Chip Multiprocessor (CMP), and each core can execute

multiple threads simultaneously to achieve a higher chip-level Instruction-Per-Cycle (IPC) [56].

The case for a chip multiprocessor was first presented in [56]. Since then, many companies have

designed and/or announced their multi-core products [7], [40], [45], [39], [2], [35]. Trends,

opportunities, and challenges for future CMPs have appeared in recent keynote speeches, invited

talks, as well as in special columns of conferences and professional journals [15], [66], [69],

[16]. CMPs are now becoming ubiquitous in all computing domains. As the processor feature

size continues to decrease, the number of cores in a CMP increases rapidly. 4- or 8-core CMPs

are now commercially available [3], [36], [29]. Recently, Intel announces a prototype of the

teraflop processor [75], realizing an 80-core prototype with a 2D mesh interconnect architecture

that reaches more than ITflops of performance. Furthermore, the advances of the wafer stacking

technology, CAD design tools, thermal management, and electrothermal design methods make 3-

dimentional (3D) chips feasible [48], [13]. This soon commercially available 3D technology

further changes the landscape of the processor chips. We will see a large number of processor

cores in a single CMP die, called many-core CMPs.










rQ0~006


Figure 1-1. Performance gap between memory and cores since 1980. (J.Hennessy- and
D.Patterson, Computer Architecture: A Quantitative Approach 4th Edition).

In such a die with a large number of cores, tremendous pressure will be put on the memory

hierarchy system to supply the needed instructions and data in a timely fashion to sustain ever-

increasing chip-level IPCs. However, the performance gap between memory and cores has been

widening since 1980, as illustrated in Figure 1-1. It becomes a critical question of how to bridge

the performance gap between processors and memory on many-core CMPs.

Hierarchical caches are traditionally designed to bridge the gap between main memory and

cores by utilizing programs' spatial locality and temporal locality. To match the fast speed and

high bandwidth of the CPU' s execution pipelines, it is quite a standard that every core in CMPs

usually has small (e.g. 16KB 64KB) first level private instruction and data caches, tightly

coupled into its pipelines. The most critical part of the many-core CMP cache hierarchy design

boils down to the lower level caches. However, several challenges exist to design the lower level

caches in many-core CMPs. First, since caches occupy a large percentage of die space, the total

cache space in many-core CMPs is usually restricted in order to keep the die footprint reasonably

sized. It is even not unusual that average cache space per core decreases when the number of










cores increases. How to leverage the precious on-chip storage space becomes an essential issue.

Second, many-core CMPs usually suffer from longer on-chip remote cache access latency and

off-chip memory access latency in terms of number of CPU cycles. This is mainly due to two

reasons. On one hand, the gap between the speed of cores and that of caches/memory is

widening. On the other hand, with large number of cores and large amount of cache space, it is

extremely time-consuming to locate and transfer the data block in/between the caches. A critical

question is how to reduce the on-chip and off-chip access latency. Third, many-core CMPs

demands higher on- and off-chip memory bandwidth. To sustain the IPC of many cores, large

amount of data must be transferred between main memory and caches, among different caches

and between caches and cores.

1.2 Directions and Related Works

The design of the lower level caches in the next-generation many-core CMPs has yet to be

standardized. There are many factors to leverage and many possible performance metrics to

evaluate. Several very important directions among them are data prefetching, optimizing the on-

chip cache organization and improving cache coherence activities among different on-die caches.

In the following part of this chapter, we will give an introduction of existing solutions in these

directions and raise some interesting questions.

1.2.1 Data Prefetching

Data prefetching has been an important technique to mitigate the CPU and memory

performance gap [38], [22], [37], [74], [77], [33], [54], [46]. It speculatively fetches the data or

instructions that the processors) will likely use in the near future into the on-chip caches in

advance. A successful prefetch may change an off-chip memory miss into a cache hit or a partial

cache hit, thus eliminate or shorten the expensive off-chip memory latency. A data prefetcher

may make decisions on what and when to prefetch based on program semantics that have been









hinted by programmers or compilers (software-based) or based on runtime memory access

patterns that the processors have experienced (hardware-based). The software techniques require

a significant knowledge and effort from programmers and/or compilers, which limits their

practicability. The hardware approaches, on the other hand, predict the future access pattern

based on the history, and hope that the history will reoccur in the future. However, an inaccurate

(useless) prefetch may hurt the overall performance since the useless prefetch consume memory

bandwidth, and the useless data block may pollute the on-chip storage.

Three key metrics are used to measure the effectiveness of a data prefetcher [37]:

Acuracy. Accuracy is defined as the ratio of useful prefetches to total prefetches.

Coverage. Coverage is the ratio of useful prefetches to total number of misses. It is the

percentage of misses that have been covered by prefetching.

History size. History size is the size of the extra history table used to store memory access

patterns, usually for hardware prefetchers.

Many uni-processor data prefetching schemes have been proposed in the last decade [37],

[74], [77], [54]. Traditional sequential or stride prefetchers identify sequential or stride memory

access pattern, and prefetch next a few blocks in such a pattern. They work well for workloads

with regular spatial access behaviors [38], [22], [46]. Correlation-based predictors (e.g. Markov

predictor [37] and Global History Buffer [54]) record and use past miss correlations to predict

future cache misses. They record miss pairs of (A->B) in a history table, meaning that a miss B

following the miss A has been observed in the past. When A misses again, B will be prefetched.

However, a huge history table or a FIFO buffer is usually necessary to provide decent coverage.

Instead of recording individual miss block correlations, Hu et al. [33] uses tag-correlation, a

much bigger block correlation, to reduce the history size. The down side of the bigger block










correction is that it reduces the accuracy as well. To avoid cache pollution and provide timely

prefetches, the dead-block prefetcher issues a prefetch once a cache block is predicted to be

dead, based on a huge history of program instructions [47].

Speculative data prefetching become more essential and more necessitating on CMPs to

hide the higher memory wall. But prefetching on CMPs is more challenging due to limited on-

die cache space and off-chip memory bandwidth. Traditional Markov data prefetcher [37],

despite its advantage of reasonable coverage and great generality, faces serious obstacles in the

context of CMPs. First, each cache miss often has several potential successive misses and

prefetching multiple successors is inaccurate and expensive. Such incorrect speculations are

more harmful on CMPs, wasting already limited memory bandwidth and polluting critical on-

chip caches. Second, consecutive cache misses can be separated by few instructions. It could be

too late to initiate prefetches for successive misses. Third, reasonable miss coverage requires a

large history table which translates to more on-chip power/area.

Recently, several proposals target to improve prefetch accuracy. Saulsbury et al. [63]

proposed a recency-based TLB preloading. It maintains the TLB information in a Mattson stack,

and preloads adj acent entries in the stack upon a TLB miss. The recency-based technique can be

applied for data prefetching. However, it prefetches adj acent entries in the stack without the prior

knowledge of whether the adjacent requests have showed any repeated patterns or how the two

requests arrive at the adj acent stack positions. Chilimbi [23] introduced a hot-stream prefetcher.

It profiles and analyzes sampled memory traces on-line to identify frequently repeated sequences

(hot streams) and inserts prefetching instructions to the binary code for these streams. The

profiling, analysis, and binary code insertions / modifications incur execution overheads, and

may become excessive to cover hot streams with long reuse distances. Wenisch et al. [78]










proposed temporal streams by extending hot streams and global history buffer to deal with

coherence misses on SMPs. It requires a huge FIFO and multiple searches/compari sons on every

miss to capture repeated streams.

In spite of so much effect, it remains a big challenge to provide a more accurate data

prefetcher with low overhead on CMPs, where memory bandwidth and chip space are more

limited, and where inaccurate prefetches are less tolerated.

1.2.2 Optimization of On-Chip Cache Organization

With limited on-chip caches, optimization of on-chip lower level cache organization

becomes critical. An important design metrics is whether the lower level cache is shared among

many cores or partitioned into private caches for each core. Sharing has two main benefits. First,

sharing increases the effective capacity of the cache, since a block only has one copy in the

shared cache. Second, sharing balances cache occupancy automatically among workloads with

unbalanced working sets. However, sharing often increases the hit latency due to the longer

wiring delay, and possibly also due to larger search time and bandwidth bottleneck. Furthermore,

a dynamic sharing may lead to erratic, application-dependent performance when different cores

interfere with each other by evicting each other's blocks. It causes priority-inversion problem

when the task running in one core occupies too much cache space and starves higher priority

tasks in other cores [42], [61].

A monolithic shared cache with high associativity consumes more power as the size

increases. Non-uniform cache access (NUCA) [41] architecture splits a large monolithic shared

cache into several banks to reduce power dissipation and increase bandwidth. Usually the

number of banks is equal to the number of cores, and each core has a local bank. Which bank to

store a block is statically decided by the lower bits of block addresses. The access latency thus

depends on the distance between the requesting core and the bank containing the data. Generally,










only a small fraction of accesses (the reciprocal of the number of cores) are targeting to local

banks .

Alternatively, private caches contain most recently used blocks for specific cores. They

provide fast local accesses for maj ority of the cache hits, probably reducing the traffic between

different caches and consuming less power. But, data may be replicated in private caches when

two or more cores share the same blocks, leading to less capacity and often more off-chip

memory misses. Private caches also need to maintain data coherence. Upon a local read miss or a

write without data exclusivity, the accessing core needs to check other private caches by either a

broadcast or through a global directory, to fetch data and/or to maintain write consistency by

invalidating remote copies. Another downside is private caches do not allow storage space

sharing among multiple cores, thus can not accommodate unbalanced cache occupancy for

workloads with different working sets.

It has become increasingly clear that it could be better to combine the benefits of both

private caches' and shared caches' [49], [24], [81], [34], [20], [9]. Generally, they can be

summarized into two general directions. The first direction is to organize the L2 as a shared

cache for maximizing the capacity. To shorten the access time, data blocks may be dynamically

migrated to the requesting cores [8], and/or some degree of replication is allowed[81], [9], to

increase the number of local accesses at a minimum cost of lowering on-chip capacity. To

achieve fair capacity sharing, [60] partitions a shared cache between multiple applications

depending on the reduction in cache misses that each application is likely to obtain for a given

amount of cache resources. The second direction is to organize the L2 as private caches for

minimizing the access time. But data replications among multiple L2s are constrained to achieve

larger effective on-chip capacity [20] without adversely decreasing the number of local accesses









too much. Dybdahl [26] proposes to create a shared logical L3 part by giving up a dynamically

adjusted portion of private L2 space for each core. To achieve optimal capacity sharing, private

L2s can steal other's capacity by block migration [24], [20], accommodating different space

requirements of different cores (workloads).

One of the biggest problems that these studies face is extremely long simulation time. They

must examine a wide-spectrum of design spaces to have complete conclusions, such as different

number of cores, different L2 sizes, different L2 organizations, and different workloads with

different working sets. Furthermore, increasing number of cores on CMPs makes the problem

even worse. Simulation time usually increases more than linearly as the number of cores

increases. It is expected that 32, 64 or even hundreds of cores will be the target of the future

research. To reduce the simulation time, FPGA simulation might be a good solution, but it is too

difficult to build. A great challenge is then how to provide an efficient methodology to study

design choices of optimizing CMP on-chip storage accurately and completely, when the number

of cores increases.

There have been several techniques for speeding up cache simulations in uniprocessor

systems. Mattson, et al. [53] presents a LRU stack algorithm to measure cache misses for

multiple cache sizes in a single pass. For fast search through the stack, tree-based stack

algorithms are proposed [10], [76]. Kim et al. [43] provide a much faster simulation by

maintaining the reuse distance counts only to a few potential cache sizes. All-associativity

simulations allow a single-pass simulation for variable set-associativities [32], [72]. Meanwhile,

various prediction models have been proposed to provide quick cache performance estimation

[5], [28], [27], [76], [l l], [12]. They apply statistical models to analyze the stack reuse distances.

But, it is generally difficult to model systems with complex real-time interactions among









multiple processors. StatCache [l l] estimates capacity misses using sparse sampling and static

statistical analysis.

All above techniques target uniprocessor systems where there is no interference between

multiple threads. Several works aim at modeling multiprocessor systems [79], [80], [19], [12].

StatCacheMP [12] extends StatCache to incorporate communication misses. However, it

assumes a random replacement policy for the statistical model. Chandra, et al [19] propose three

analytical models based on the L2 stack distance or circular sequence profile of each thread to

predict inter-thread cache contentions on the CMP for multiprogrammed workloads that do not

have interference with each other. Two other works pay attention only to miss ratios, update

ratios, and invalidate ratios for multiprocessor caches [79], [80].

Despite those efforts, it is still an important problem how to efficiently model and predict

the performance results of optimization of many-core CMP cache organization about the tradeoff

between data capacity and accessibility.

1.2.3 Maintaining On-Die Cache Coherence

Cache Coherence defines the behavior of reads and writes to the same memory location

with multiple cores and multiple caches. The coherence of caches is obtained if the following

conditions are met: 1) Program order must be preserved among reads and writes from the same

core. 2) The coherent view of memory must be maintained, i.e. a read from a core must return

the latest value written by other cores. 3) Writes to the same location must be serialized.

On CMPs, since the on-chip cache access latency is much less than the off-chip memory

latency, it is desirable to obtain the data from on-chip caches if possible. Write-invalidation

cache coherence protocol is generally used on today's microprocessors. There are three cache

coherence activities here on CMPs. First, on a data read miss at the local cache, a search through

the other on-chip caches is performed to obtain the latest data, if possible. Second, on a write









miss at the local cache, a search through the other caches is performed to obtain the latest data if

possible and all those copies must be invalidated. Third, on a write upgrade (i.e. write hit at the

local cache without exclusivity), all the copies at other caches must be invalidated.

Maintaining cache coherence with increasing number of cores has become a key design

issue for future CMPs. In an architecture with a large number of cores and cache modules, it

becomes inherently difficult to locate a copy (or copies) of a requested data block and to keep

them coherent. When a requested block is not located in the local cache module, one strategy is

to search through all modules. A broadcast to all the modules is possible at the same time or in

the sequence of multiple levels, for instance, local modules, neighbor modules, and entire space.

This approach is only applicable to a system with a small number of cores using a shared

snooping bus. Searching the entire CMP cache becomes prohibitively time consuming and power

hungry when the number of cores increases. Hierarchical clusters with hierarchical buses

alleviate the problem at the expense of introducing lots of complexity. Recent ring based

architecture connects all the cores (and the local L2 slice) with one or more uni- or bi-directional

rings [44], [71]. A remote request travels hop-by-hop on the ring until a hit is encountered. The

data may travel back to the requesting core hop-by-hop or directly depending on whether data

interconnection shortcuts are provided. The total number of hops varies, depending on the

workloads and the data replication strategy. However, ring-based architecture may still not scale

well as the number of cores increases.

To avoid broadcasting and searching all cache modules, directory-based cache-coherence

mechanisms might be the choice. When a request cannot be serviced by local cache module, this

request is sent to the directory to find the state and the locations of the block. Many directory

implementations have been proposed in the field of Symmetric Multiprocessor (SMP). A









memory-based directory records the states and sharers of all memory blocks at each memory

module using a set of presence bits [17]. Although the memory-based directory can be accessed

directly by the memory address, such a full directory is unnecessary in CMPs since the size of

the cache is only a small fraction of the total memory. There have been many research works

trying to overcome the space overhead of the memory-based directory [4], [18], [25]. Recently, a

multi-level directory combines the full memory-based directory with directory caches for fast

accesses [1]. A directory cache is a small full-map, first-level directory that provides information

for the most recently referenced blocks, while the second-level directory provides additional

information for all the blocks. The cache-based directory, on the other hand, records only cached

blocks in the directory to save directory space. The simplest approach is to duplicate all

individual cache directories in a centralized location [73]. For instance, Piranha [7] duplicates L1

tag arrays in the shared L2 to maintain L1 coherence. Searches to all cache directories are

necessary to locate copies of a block. This essentially builds a directory of much wide set-

associability (the multiplication of number of cores and number of cache ways per set), and

wastes a lot of power. In a recent virtual hierarchy design [51], a 2-level directory is maintained

in a virtual machine (VM) environment. The level-1 coherence directory is embedded in the L2

tag array of the dynamically mapped home tile located within each VM domain. Any unresolved

accesses will be sent to the level-2 directory. If a block is predicted on-chip, the request is

broadcast to all cores.

The sparse directory approach uses a small fraction of the full memory directory organized

in a set-associative fashion to record only those cached memory blocks [30], [55]. Since

directory must maintain the full map of cache states, hot-set conflicts at the directory lead to

unnecessary block invalidations at the cache modules, resulting in an increase of cache miss.









With a typical set-associative directory, such conflict at the directory tends to become worst as

the number of cores increases, unless the set-associativity also dramatically increases. For

instance, in a CMP with 64 cores with each core having a 16-way local cache module, only a

1024-way directory can eliminate all inadvertent invalidations. A naive plan of building a 1024-

way set-associative directory, although it can eliminate all conflicts, it is hardly feasible. Thus,

an important technical problem is to avoid the hot-set conflicts at the directory with small set

associativity, small space and high efficiency.

Some previous works exist to alleviate the hot-set conflict of caches, instead of the cache

coherence directory. The column-associative cache [6] establishes a secondary set for each block

using a different hash function from that for the primary set. The group-associative cache [58]

maintains a separate cache tag array for a more flexible secondary set. The skewed-associative

cache applies different hash functions on different cache partitions to reduce conflicts [64], [14].

The V-way cache [59] doubles the cache tag size with respect to the actual number of blocks to

alleviate set conflicts. Any unused tag entry in the primary set can record a newly missed block

without replacing an existing block. The Bloomier filter [21] approach institutes a conflict-free

mapping from a set of search keys to a linear array, using multiple hash functions. It remains a

problem to reduce the set-conflict of cache coherence directory.

1.2.4 General Organization of Many-Core CMP Platform

To establish the foundation for the proposed research, we plot the possible organization of

future many-core CMPs, as illustrated in Figure 1-2. This is similar to the Intel's vision of future

CMPs. The CMPs will be built on a partition-based or tiled-based substrate. There will be 16-64

processing cores and 16-64MB on-chip cache capacity.


















Fiue -.Posbe raizto o h nx-enrtonC P
Eac tleconais ne or ad aloalcace aritin.Tomach tespeanbndit










The9991 coeec ietr sas hsial itiue mn l herec picaryaItitions, thog


to~~~~~~ ~~ manandt oeec o lcsalctdt utpecc e modules.Thcoerencee


diec tory maintains th e states of a l ol cache blck.Adrectiory lookupc h r eqest will be snt tothe

di roetory if a emrya riequest antbhnrdb the local cache, suchio i asre rii e ad mio vnlss, write



mmiss, and c writ e upgrade tis lkly thloata D, mesh itercnnc and nepl ca essoary rodiutrlikall the


priin.The acences latencytoryi als remot l itrbtd mn ale partition i eiddb the iigdsac n outer










processing time. Multiple memory interfaces may be available to support multiple channel main

memories.

1.3 Dissertation Contribution

There are three main contributions in this dissertation addressing the issue of mitigating the

CMP memory wall. First of all, we develop an accurate and low overhead data prefetching based

on a unique observation of the program behavior. Second, we describe an analytical model and a

single-pass global stack simulation to fast proj ect the performance of the tradeoff between cache

capacity and data accessibility in CMP on-chip caches. Third, we develop a many-core CMP

coherence directory with small set associativity, small space and high efficiency and introduce a

directory lookaside table to reduce the number of inadvertent cache invalidations due to directory

hot-set conflicts.

1.3.1 Coterminous Group Data Prefetching

We prove with a set of SPEC CPU 2000 workload mixes that cache contentions from
different cores running independent workloads creates much more cache misses.

We observer a unique behavior of Coterminous Group (CG) in the SPECCPU 2000
applications, a group of memory accesses with temporal repeated memory access
patterns. We further define a new Coterminous Locality: when a member in a CG is
referenced, the other members will be reference soon.

We develop a CG-prefetcher based on Coterminous Groups. It identifies and records
highly repeated CGs in a small buffer for accurate and timely prefetches for members in a
group. Detailed performance evaluations have shown significant IPC improvement
against other prefetchers.

1.3.2 Performance Projection of On-chip Storage Optimization

We present an analytical model for fast proj section of CMP cache performance about the
tradeoff between data accessibility and the cache capacity loss due to data replication.

We develop a single-pass global stack simulation to more accurately simulate these
effects for shared and private caches. By using the stack results of the shared and private
caches, we further deduce the performance effect of more complicated cache
organizations, such as shared caches with replication and private caches without
replication.










*We verify the proj section accuracy of the analytical model and the stack simulation by
detailed execution-driven simulation. More importantly, the single-pass global stack
simulation only consumes a small percentage of simulation time that execution-driven
simulation requires.

1.3.3 Enabling Scalable and Low-Conflict CMP Coherence Directory

We demonstrate the sparse coherence directory with small associativity causes significant
unwanted cache invalidation due to the hot-set conflicts in the coherence directory.

We augment a Directory Looka~side Table (DLT) for the set-associative sparse coherence
directory to allow the displacement of a block away from its primary set to one of the
empty slots in order to reduce conflict.

Performance evaluations using multithreaded and multiprogrammed workloads
demonstrate significant performance improvement of the DLT-enhanced directory over
the traditional set-associative or skewed associative directories by eliminating maj ority of
the inadvertent cache invalidations.

1.4 Simulation Methodology and Workload

To implement and verify our ideas, we use the full-system simulator, Virtutech Simics 2.2

[50], to simulate 2-, 4-, or 8-core CMPs running real operation system Linux 9.0 on a machine

with x86 Instruction Set Architecture (ISA).

The processor module is based on the Simics Microarchitecture Interface (MAI) and

models timing-directed processors in detail. A g-share branch predictor is added to each core.

Each core has its own instruction and data L1 cache. Since L2 cache is our focus, we have

different L2 organizations in different works. It will be described in each work later.

We implement a cycle-by-cycle event-driven memory simulator to accurately model the

memory system. Multi-channel DDR SDRAM is simulated. The DRAM accesses are pipelined

whenever possible. A cycle-accurate, split-transaction processor- memory bus is also included to

connect the L2 caches and the main memory.

Table 1-1 summarizes common simulation parameters. Parameters that are specific to each

work will be described later in the individual chapters.










Table 1-1. Common simulation parameters
Parameter Description
CMP 2, 4, or 8 cores, 3.2GHz, 128-entry ROB
Pipeline Width 4 Fetch / 7 Exec / 5 Retire / 3 Commit
Branch predictor G-share, 64KB, 4K BTB, 10 cycle misprediction penalty
L1-I/L1l-D 64KB, 4-way, 64B Line, 16-entry MSHR, MESI, 0/2-cycle latency
L2 8- or 16-way, 64B Line, 16-entry MSHR, MOESI if not pure shared
L2 latency 15 cycles local, 30 cycles remote
Memory latency 432 cycles without contentions
DRAM 2/4/8/16 channels, 180-cycle access latency
Memory bus 8-byte, 800MHz, 6.4GB/s, 220-cycle round trip latency

We use 2 sets of workloads in our study: multiprogrammed and multithreaded workloads.

For multiprogrammed workloads, we use several mixtures of SPEC CPU2000 and SPEC

CPU2006 benchmark applications based on the classification of memory-bound and CPU-bound

applications [82]. The memory-bound applications are Art, Mcf, and Ammp, while the CPU-

bound applications are Twolf, Parser, Vortex, Bzip2 and Crafty. The first category of workload

mixes, MEM, includes memory-bound applications; the second category of workload mixes,

MIX, consists of both memory-bound and CPU-bound applications; and the third category of

workloads, CPU, contains only CPU-bound applications. For studies with 2-, 4-, or 8-core

CMPs, we prepare 2-, 4- and 8-application workloads. We choose the ref input set for all the

SPEC CPU2000 and SPEC CPU2006 applications.

Table 1-2 summarizes the selected multiprogrammed workload mixes. For each workload

mix, the applications are ordered by high-to-low L2 miss penalties from left to right in their

appearance. We skip certain instructions for each individual application in a mix based on

studies done in [62], and run the workload mix for another 100 million instructions for warming

up the caches. A Simics checkpoint for each mix is generated afterwards. We run our simulator

until any application in a mix has executed at least 100 million instructions for collecting

statistics.









Table 1-2. Multiprogrammed workload mixes simulated
MEM MIX CPU
Art/Mcf Art/Twolf Twolf/Bzip2
Two Mcf/Mcf Mcf/Twolf Parser/Bzip2
Mcf/Ammp Mcf/Bzip2 Bzip2/Bzip2
Four Art/Mcf/Ammp/Twolf Art/Mcf/Vortex/Bzip2 Twolf/Parser/Vortex/Bzip2
Eight Art/Mcf/Ammp/Parser/Vortex/Bzip2/Crafty (SPEC2000)
Mcf/Libquantum/Astar/Gobmk/Sj eng/Xalan/Bzip2/Gcc (SPEC2006)

We also use three multithreaded commercial workloads, OLTP (Online Transaction

Processing), Apache (Static web server), and SPECjbb (java server). We consider the variability

of these multithreaded workloads by running multiple simulations for each configuration of each

workload and inserting small random noises (perturbations) in the memory system timing for

each run. For each workload, we carefully adjust system and workload parameters to keep the

CPU idle time low enough. We then fast-forward the whole system for enough period of time to

fi11 the internal buffers and other structures before making a checkpoint. Finally, we collect

simulation results during executing a certain number of transactions after we warm up the caches

or other simulation related structures. Table 1-3 gives the details of the workloads.

1.5 Dissertation Structure

The structure of this dissertation is as follows. Chapter 2 develops the first piece of the

dissertation, an accurate and low-overhead data prefetching technique in CMPs based on a

unique observation of coterminous group, a highly repeated and close-by memory access

sequence. In Chapter 3, we illustrate two methodologies, an abstract data replication model and a

single-pass global stack simulation, to fast project the performance issue of CMP on-chip storage

optimization. Chapter 4 builds a set-associative CMP coherence directory with small

associativity and high efficiency, augmented by a directory lookaside table that alleviates the

directory hot-set conflicts. This is followed by a brief summary of the dissertation in Chapter 5.










Table 1-3. Multithreaded workloads simulated
Workload Description
OLTP (Online It is built upon the OSDL-DBT-2 [57] and the MySQL database server
Transaction Processing) 5.0. We build a 1GB, 10-warehouse database. To reduce the database
disk activity, we increase the size of the MySQL buffer pool to 512MB.
We further stress the system by simulating 128 users without any keying
and thinking time. We simulate 1024 transactions after bypassing 2000
transactions and warming up caches (or stack) for another 256
transactions.
Apache (Static web We run apache 2.2 from as the web server, and use Surge to generate
server) web requests from a 10,000 file, about 200MB repository. To stress the
CMP system, we simulate 8 clients with 50 threads per client. We collect
statistics for 8192 transactions after bypassing 2500 requests and
warming up for 2048 transactions.
SPECjbb (java server) SPECjbb is a java-based 3-tier online transaction processing system. We
simulate 8 warehouses. We first fast-forward 100,000 transactions. Then
we simulate 20480 transactions after warming up the structures for 4096
transactions.









CHAPTER 2
COTERMINOUS GROUP DATA PREFETCHING

In this chapter, we describe an accurate data prefetching technique on CMPs to overlap

expensive off-chip cache miss latency. Our analysis of SPEC applications shows that adj acent

traversals of various data structures, such as arrays, trees and graphs, often exhibit temporal

repeated memory access patterns. A unique feature of these nearby accesses is that they exhibit a

long but equal reuse distance. The reuse distance of a memory reference is defined as the

number of distinct data blocks that are accessed between this reference and the last references to

the same block. It is the most fundamental measure of memory reference locality. We define

such a group of memory references with an equal block reuse distance as a Coterminous Group

(CG) and the highly repeated access patterns among members in a CG as the Coterminous

Locality. A new data prefetcher identifies and records highly repeated CGs in a history buffer.

For accurate and timely prefetches, whenever a member in a CG is referenced, the entire group

members are prefetched. We call such a data prefetching method a CG-prefetcher.

We make three contributions about the CG-prefetcher. First, we demonstrate the severe

cache contention problem with various mixes of SPEC2000 applications, and describe the

necessities and the challenges of accurate data prefetching on CMPs. Second, we discover the

existence of coterminous groups in these applications and quantify the strong coterminous

locality among members in a CG. Third, based on the concept of coterminous groups, we

develop a new CG-prefetcher, and present a realistic implementation by integrating the CG-

prefetcher into the memory controller. Full system evaluations based on mixed SPEC CPU 2000

applications have shown that the proposed CG-prefetcher can accurately prefetch the needed data

in a timely manner on CMPs. It generates about 10-40% extra traffic to achieve 20-50% of miss

coverage in comparison with two and a half times more extra traffic by a typical correlation-









based prefetcher with a comparable miss coverage. The CG-prefetcher also shows better IPC

(Instructions per Cycle) improvement than the existing miss correlation based or the stream

based prefetchers.

To clearly demonstrate the effectiveness of CG-prefetcher, we carry out experiments on a

simple shared L2 cache with multiple cores, each running a different application. But, this

scheme is independent of any specific cache organizations, and can be adapted to private caches

as well.

2.1 Cache Contentions on CMPs

CMPs put tremendously more pressure on the memory hierarchy to supply the data to

cores than uniprocessors. One of the maj or reasons is that different cores compete for the limited

on-chip shared storage when multiple independent applications are running simultaneously on

multiple cores. The typical example is the shared L2 cache among multiple cores. This effect is

more evident when independent memory-intensive applications are running together. To

demonstrate the cache contentions on CMPs, we show the IPCs (Instructions per Cycle) of a set

of SPEC2000 applications that are running individually, or in parallel on 4- or 2-core CMPs in

Figure 2-1 (A) and Figure 2-1 (B) respectively.

We have three workload mixes consisting of 4 applications, Art/Mcf/Ammp/Twolf,

Art/Mcf/Vortex/Bzip2, and Twolf/Parser/Vortex/Bzip2 in Figure 2-1 (A). The first workload

mix, Art/Mcf/Ammp/Twolf, contains applications with heavier L2 misses; the second one,

Art/Mcf/Vortex/Bzip2, mixes applications with heavier and lighter L2 penalties; and the third

one, TwolflParser/Vortex/Bzip2, has applications with generally lighter L2 misses. We also run

nine 2-application mixes in Figure 2-1 (B), also ranging from high-to-low L2 miss penalties,

including Art/Mcf, Mcf/Mcf, Mcf/Ammp, Art/Twolf, Mcf/Twolf, Mcf/Bzip2, Twolf/Bzip2,

Parser/Bzip2, and Bzip2/Bzip2.














2.2 ~-1 O womioadl m womioadZ m womioacui O womioad4


1.8
1.6


1.2


0 .8 -- -- -
0 .6 -- -- -
0 .4 - -- -

0 ''

Art/Mc/k Mr/Mcf;/ TwolfParsed'
Armp/Twolf VolteldBzp2 VoltedBzip2
A

O Worldoad1 5 Worldoacl2
2.2


1.8
1.6



1.

0 .8 -- - -
0 .6 -- - -
0 .4 -- - - -- --
0 .2 -- -- -- --

Att/ 1Vkof 1Vo/ 1vn/of 1V/of 7Annp Att/ Tual o/olf ~rk~ulf 1Vro/ ~Bep2 Traolf/ ip2 Parser/ Bep2 Iki2/ Iki2
B

Figure 2-1. IPC degradation due to cache contention for SPEC2000 workload mixes on CMPs.
A) IPC for 4-workload on 4-core CMPs. The first 4 bars are individual IPCs when
only one application is running and the last bar is combined IPC when 4 workloads
run in parallel on 4 cores. B) IPC for 2-workload on 2-core CMPs. The first 2 bars are
individual IPCs when only one application is running and the last bar is combined
IPC when 2 workloads run in parallel on 2 cores.


The first four bars in Figure 2-1 (A) and the first two bars in Figure 2-1 (B) are the


individual IPCs (Instructions per Cycle) of the applications in the workload mixes, ordered by


the appearance in the name of the workload mixes. We collect the individual IPC for each


application by running the specific application on one core and keeping all the other cores idle.


As a result, the entire L2 cache is available for the individual application. The last bar of each










workload mix is the combined IPC when we run all the applications at the same time with one

core running one independent application. The combined IPC is broken down into segments to

show the IPC contribution of each application.

Ideally, the combined IPC should be equal to the sum of individual IPCs when only one

application is running. However, significant IPC reductions can be observed on each application

when they run in parallel, mainly due to the shared L2 cache contention among multiple

applications. This is especially evident for the workload mixes with high demands on shared L2

caches. For example, when Art/Mcf/Ammp/Twolf are running on four cores, the individual IPC

drops from 0.029 to 0.022 for Art, from 0.050 to 0.026 for Mcf, from 0.132 to 0.043 for Ammp,

and from 0.481 to 0. 181 for Twolf respectively. Instead of accumulating the individual IPCs on

four cores, the combined IPC drops from 0.69 (the sum of individual IPCs) to 0.27, a 60% of

degradation. Similar effects of various degrees can also be observed with 2-core CMPs. The

significant IPC degradations come from more L2 misses and more off-chip memory accesses.

Data prefetching is an effective way to reduce number of L2 misses. However, in the CMP

context with limited cache space and limited memory bandwidth, inaccurate prefetches are more

harmful. Those inaccurate prefetches pollute the limited cache space and wasted the limited

memory bandwidth. The CMP demands for accurate prefetchers with low overhead to alleviate

heavier cache contentions and misses.

2.2 Coterminous Group and Locality

The proposed data prefetcher on CMPs is based on a unique observation of the existence of

Coterminous Groups (CGs). A Coterminous Group (CG) is a group of nearby data references

with same block reuse distances. The reuse distance of a reference is defined as the number of

distinct data blocks that are accessed between two consecutive references to this block. For

instance, consider the following accessing sequence: a bc x dx y z ab cy d. The reuse distances










of a-a, b-b, c-c and d-d are all 6, whereas x-x is 1 and y-y is 4. In this case, a b cd can form a

CG.

References in a CG have three important properties. First, the order of references must be

exactly the same at each repetition (e.g. d must follow c, c follows b and b follows a). Second,

references in a CG can interleave with other references (e.g. x, y). These references, however, are

irregular and difficult to predict accurately, and will be excluded by the criteria of same reuse

distance. Third, since we are interested in capturing references with long reuse distances for

prefetching, the same reference (i.e. to the same block) usually does not appear twice in one CG.

To demonstrate the existence of CGs, we plot reuse distances of 3000 nearby references

from three SPEC2000 applications, Mcf, Ammp and Parser in Figure 2-2. We randomly select

3000 memory references for each application, and compute the reuse distance of each reference.

Note that references with short reuse distances (e.g. < 512), which are frequent due to temporal

and spatial localities of memory references, can be captured by small caches and thus are

filtered.

The existence of CGs is quite obvious from these snapshots. Mcf has a huge CG with a

reuse distance of over 60,000. The reuse distance is so large that a reasonable sized L2 cache

(<4MB, if each memory block is 64B) will not keep those blocks in it. So those accesses are

likely to be L2 misses. Ammp shows four large CGs along with a few small ones. And Parser has

many small CGs. Other applications also show the CG behavior. We only present three examples

due to the space limit.

The next important question is whether there exist strong correlations among references in

a CG, i.e. when a member of a captured CG is referenced, the other members will be referenced

in the near future.











x 104 Mrcf x 104 Almnp



5 -a o 1.5n m




01 50 00 10 2 50 30
Meor Acese MeoyAcse
63~k;P;LldWx 10k Pare







Memry ccese
Figure ~ ~ ~ ~ ~ ~ 5 2-2 Reue isanesfo Mc, mm ad arer








>d il b reoredin itr al.Fra esstht it the~se tab lle, we veriywete h

actalnex aces i th n xt acces thathas eenrcre.Fr xmli ccs apn


aganwened o vriy hehe b s henet cces.Ifso w contitasanaccrae reicio

based on the CG.

We ~ ~ ~ E can als rea h es istnerqieetalwn ml ainei es

distances. Weu defin CG-N asC s nwih eryrfrneswt es isacsta r
wihipusormiusN.A smllrNmen moersrceC spttily oeacu t,

awhile lager Nmean mor reaxe C~sL posbyicuin oemmes C stems










restricted one, representing the original same-distance CG, while CG-oo is the most relaxed one

that has a single CG including all nearby references with long block reuse distances.

Figure 2-3 shows the accuracy that the real next access of a member in a CG is indeed the

next access being captured and saved for CG-0, CG-2, CG-8 and CG-oo for all the individual

applications. In this figure, there are 4 bars for each application representing the accuracy of CG-

0, CG-2, CG-8 and CG-oo from left to right. In general, CG-0, CG-2 and CG-8 exhibit strong

repeated reference behaviors among members in a CG than CG-oo. Ammp and Art show nearly

perfect correlations with about 98% of accuracy regardless of the reuse distance requirement.

Those two applications are floating point applications with regular array accesses, which are easy

to predict. All other applications also demonstrate strong correlations for CG-0, CG-2 and CG-8.

As expected, CG-0 shows stronger correlations than other weaker forms of CGs, while CG-oo,

which is essentially the same as the adj acent cache-miss correlation, shows very poor

correlations. For instance, the accuracy of CG-0 for Mcf is about 78%, while the accuracy of

CG-oo is only 30%. The gap between CG-0 and CG-2/CG-8 is rather narrow for Mcf, Vortex,

and Bzip2, suggesting a weaker form of CGs may be preferable for covering more references. A

large gap is observed between CG-0 and other CGs in Twolf, Parser, and Gcc indicating CG-0 is

more accurate for prefetching for those applications.

Based on these behaviors, we can safely conclude that members in a CG exhibit a highly

repeated access pattern, i.e. whenever a nzenber in a CG is referenced, the remaining nzenbers

will likely be referenced in the near future according to the previous accessing sequence. We call

such highly repeated patterns coternzinous locality. Based on the existence of highly-repeated

coterminous locality within members in CGs, we can design an accurate prefetching scheme to

capture CGs, and prefetch members in CGs.











1000/


80% -



E 60%

40%


20%


0 0o
Art Mokf Aunp Twolf Parser Gec Mortex Bzip2

Figure 2-3. Strong correlation of adj acent references within CGs.

2.3 Memory-side CG-prefetcher on CMPs

Due to shared cache contentions on CMPs, it is more beneficial to prefetch those L2

misses to improve the overall performance. Our CG-prefetcher records L2 misses, captures CGs

from the L2 miss sequence and prefetches members in CGs to reduce the number of expensive

off-chip memory accesses. Since the memory controller sees every L2 miss directly, we integrate

the CG-prefetcher into the memory controller. This is the memory-side CG-prefetcher.

A memory-side CG-prefetcher is attractive for several reasons [67]. First, it minimizes

changes to the complex processor pipeline along with any associated performance and space

overheads. Second, it may use the DRAM array to store necessary state information with

minimum cost. A recent trend is to integrate the memory controller in the processor die to reduce

interconnect latency. Nevertheless, such integration has minimal performance implication on

implementing the CG-prefetcher in the memory controller. Note that although the CG-prefetcher









is suitable on uni-processor systems as well, it is more appealing on emerging CMPs with extra

resource contentions and constraints due to its high accuracy and low overhead.

2.3.1 Basic Design of CG-Prefetcher

The structure of a CG-prefetcher is illustrated in Figure 2-4. There are several main

functions: to capture nearby memory references with equal reuse distance, to form CGs, to

efficiently save CGs in a history table for searching, and to update CGs and keep them fresh.

To capture nearby memory references with same distance, a Request FIFO records the

block addresses and their reuse distances of recent main memory requests. A CG starts to form

once the number of requests with the same reuse distance in the Request FIFO exceeds a certain

threshold (e.g. 3), which controls the aggressiveness of forming a new CG. The size of the FIFO

determines the adj acency of members, and usually it is small (e.g. 16). A flag is associated with

each request indicating whether the request is matched. The matched requests in the FIFO are

copied into a CG Buffer waiting for the CG to form. The size of the CG buffer determines the

maximum number of members in a CG, which can control the timeliness of prefetches. A small

number of CG Buffers can be implemented to allow multiple CGs to form concurrently. A CG is

completed and will be saved when either the CG Buffer is full or a new CG Buffer is needed

when a new CG is identified from the Request FIFO.

To efficiently save CGs, we introduce Coterminous Group History Table (CGHT), a set-

associative table indexed by block addresses, so that every member in a CG can be found very

fast. A unidirectional pointer in each entry links the members in a CG. This link-based CGHT

permits fast searching of a CG from any member in the group, thus allows to prefetch a CG

starting from any member. When the CGHT becomes full, either the LRU entries are replaced

and removed from the existing CGs, or the conflicting new CG members are dropped to avoid

potential thrashing.










Blode F
I ~~Iclex


I


SCoterlnn~Finous-Grou Histry Thble (CGHT)





E A


Cotenninous Giroull AB CD E F
Figure 2-4. Diagram of the CG prefetcher.

Any existing CGs can change dynamically over a period of time. Updating CGs in the

CGHT dynamically is difficult without precise information on when a member leaves or j oins a

group. Another option is to simply flush the CGHT periodically based on the number of executed

instructions, memory references, or cache misses. However, a penalty will be paid to reestablish

the CGs after each flush. Note that a block can appear in more than one CG in the CGHT without

updating. This is because a block reference can leave a CG and become a member of a new CG,

while the CGHT may still keep the old CG. On multiple hits, either the most recent CG or all the

matched CGs may be prefetched.

Table 2-1 gives an example to illustrate the operations of CGs in Figure 2-4 step by step.

We assume three accesses with the same reuse distance are requested to start forming a CG. We

also assume memory accesses A, B and C have reuse distance dl, and have already been

identify led as a CG in the Request FIFO, and moved to the CG Buffer. So the initial state is that

the CG Buffer records A, B and C, as well as the current reuse distance dl.


F 011l A L
I 012 B
E 011l C


M d3 F



RequLest FIFO CG~Buffer










Table 2-1. Example operations of forming a CG
Access Reuse distance Event Comment
D dl Put D in CG Buffer dl matches the reuse distance of current CG
N d3 None d3 is not equal to dl
M d3 None Only two same reuse distance accesses, N and M
J d2 None d2 is not equal to dl
E dl Put E in CG Buffer dl matches the reuse distance of current CG
I d2 None Only two same reuse distance accesses, J and I
F dl Put F in CG Buffer dl matches the reuse distance of current CG

Table 2-1 simulates the situation when accesses D, N, M, J, E, I, F come one by one. D, E,

and F have the same reuse distance dl, and will be recorded in the CG Buffer step by step.

Although N and M have the same reuse distance d3, they cannot start to form a new CG until

another access with reuse distance d3 together with N and M appear in the Request FIFO at the

same time. If this does happen, the current CG will be moved to the CGHT to make space for the

newest CG. Once a L2 miss hits the CGHT, the entire CG can be identified and prefetched by

following the circular links. In Figure 2-4, for instance, a miss to block F will trigger prefetches

of A, B, C, D, and E in order.

2.3.2 Integrating CG-prefetcher on CMP Memory Systems

There are several issues to integrate the CG-prefetcher into the memory controller. The

first key issue is to determine the block reuse distance without seeing all processor requests at the

memory controller. A global miss sequence number is used. The memory controller assigns and

saves a new sequence number to each missed memory block in the DRAM array. The reuse

distance can be approximated as the difference of the new and the old sequence numbers.

For a 128-byte block with a 16-bit sequence number, a reuse distance of 64K blocks, or an

8MB working set can be covered. The memory overhead is merely 1.5%. When the same

distance requirement is relaxed, one sequence number can be for a small number of adj acent

requests, which will expand the working set coverage and/or reduce the space requirement.










Figure 2-5 shows the CG-prefetcher in memory system. To avoid regular cache-miss

requests from different cores disrupting one another for establishing the CGs [70], we construct a

private CG-prefetcher for each core.

Each CG-prefetcher has a Prefetch Queue (PQ) to buffer the prefetch requests (addresses)

from the associated prefetcher. A shared M~iss Queue (MQ)l stores regular miss requests from all

cores for accessing the DRAM channels. A shared M~iss Return Queue (M~RQ) and a shared

Prefetch Return Queue (PRQ) buffers the data from the miss requests and the prefetch requests

for accessing the memory bus.

We implement a private PQ to prevent prefetch requests of one core from blocking those

from other cores. The PQs have lower priority than the MQ. Among the PQs, a round-robin

fashion is used. Similarly, the PRQ has lower priority than the MRQ in arbitrating the system

bus. Each CG-prefetcher maintains a separate sequence number for calculating the block reuse

distance.

When a regular miss request arrives, all the PQs are searched. In case of a match, the

request is removed from the PQ and is inserted into the MQ, gaining a higher priority to access

the DRAM. In this case, there is no performance benefit since the prefetch of the requested block

has not been initiated. If a matched prefetch request is in the middle of fetching the block from

the DRAM, or is ready in the PRQ, waiting for the shared data bus, the request will be redirected

to the MRQ for a higher priority to arbitrate the data bus. Variable delay cycles can be saved

depending on the stage of the prefetch request. The miss request is inserted into the MQ

normally when no match is found.











Memory Cctrd~er ~DRArrays

Prefetc
Queue Pol
CG Prefetcher 0




Prefetch
CG Prefetcher- n Qeuee tPQn) L I - ---

Seq+1 -I MIss Miss Prefetch I
Queue~ etumn Retum
(MQ)~ Queuel Queue
Insert: (dist, addr) --( RQ) PR Q)

Mi ss A address
(seq,addr)


Mdemoary BeT

Figure 2-5. Integration of the CG-prefetcher into the memory controller.

A miss request can trigger a sequence of prefetches if it hits the CGHT. The prefetch

requests are inserted into the corresponding PQ. If the PQ or the PRQ is full, or if a prefetch

request has been initiated, the prefetch request is simply dropped. In order to filter the prefetched

blocks already located in processor' s cache, a topologically equivalent directory of the lowest

level cache is maintained in the controller (not shown in Figure 2-5). The directory is updated

based on misses, prefetches, and write-backs to keep it close to the cache directory. A prefetch is

dropped in case of a match. Note that all other simulated prefetchers incorporate the directory

too.

2.4 Evaluation Methodology

We simulate 2-core and 4-core CMPs with a IMB, 8-way, shared L2 cache. Please note

that the CG prefetcher can also be applied to other L2 organizations. We add an independent

processor-side stride prefetcher to each core. All timing delays of misses and prefetches are










carefully simulated. Due to a slower clock of the memory controller, the memory-side

prefetchers initiate one prefetch every 10 processor cycles. The queue size for MQ, PQi, PRQ

and MRQ is all 16 enties. We use all 2-application, and 4-application workload mixes described

in Chapter 1 for this work.

The performance results of the proposed CG-prefetcher are compared against a pair-wise

miss-correlation prefetcher (MC-prefetcher), a prefetcher based on the last miss stream (LS-

prefetcher), and a hot-stream prefetcher (HS-prefetcher). A processor-side stride prefetcher is

included in all simulated memory-side prefetchers. Descriptions of these prefetchers are given

next.

Processor-side Stride prefetcher (Stride-prefetcher). The stride-prefetcher identifies and

prefetches sequential or stride memory access patterns for specific PCs (program counters) [46].

It has 4k-entry PCs with each entry maintaining four previous references of that PC. Four

successive prefetches are issued, whenever four stride distances of a specific PC are matched.

Memory-side Miss-Correlation (MC) prefetcher. The MC-prefetcher records pair-wise

miss correlations A->B in a history table, and prefetches B if A happens again [37]. Each core

has a MC-prefetcher with a 128k-entry 8 set-associative history table. Each miss address (each

entry) records 2 successive misses. Upon a miss, the MC-prefetcher prefetches two levels in

depth, resulting in a total of up to 6 prefetches.

Memory-side Hot-Stream (HS) prefetcher. The HS-prefetcher records a linear miss

stream in a history table, and dynamically identifies and prefetcher repeated access sequences

[23]. It is simulated based on a Global History Buffer [54], [78], with 128k-entry FIFO and 64k-

entry 16 set-associative miss index table for each core. On every miss, the index and the FIFO

are searched sequentially to find all recent streams that begin with the current miss. If the first 3










misses of any two streams match, the matched stream is prefetched. The length of each stream is



Memory-side Last-Stream (LS) prefetcher. The LS-prefetcher is a special case of the

HS-prefetcher [23], where the last miss stream is prefetched without any further qualification. It

has the same implementation as the HS-prefetcher.

Memory-side Coterminous Group (CG) prefetcher. We use CG-2 to get both high

accuracy and decent coverage of misses. The CGHT is 16k entries per core, with 30 bits (16-way

set-associative) per entry. We use a 16-entry Request FIFO and four 8-entry CG-Buffers. A CG

can be formed once three memory requests in the Request FIFO satisfy the reuse distance

requirement. Each CG contains up to 8 members. The CGHT is flushed periodically every 2

million misses from the corresponding core.

Table 2-2 summaries the extra space overhead to implement various memory-side

prefetchers. Note that the space overhead for processor-side stride prefetcher is minimal, thus has

not been included.

2.5 Performance Results

In Figure 2-6 (A) and Figure 2-6 (B), the combined IPC of Stride-, MC-, HS-, LS-, and

CG-prefetchers, normalized to that of the baseline model without any prefetching, are presented

for 4-core CMPs and 2-core CMPs respectively. Please note in all the following figures, we

simplify Stride-prefetcher to Stride, MC-prefetcher to MC, HS-prefetcher to HS, LS-prefetcher

to LS and CG-prefetcher to CG. We include the normalized combined IPC of the base line model

without any prefetching for comparison purpose (with the total height of 1). Note that the

absolute combined IPCs for the baseline model were given in Figure 2-1. Also, a separate

processor-side stride prefetcher is always running with MC-, HS-, LS-, and CG-prefetcher to

prefetch blocks with regular access patterns.









Table 2-2. Space overhead for various memory-side prefetcher
Prefetcher Memory controller (SRAM) per core DRAM
CG-prefetcher 60KB (16K*30bit/8) 3%
MC-prefetcher 2MB (128K*2*64bit/8) 0
HS-prefetcher 920KB(128K*43bit/8+64K*29bit/8) 0
LS-prefetcher 920KB(128K*43bit/8+64K*29bit/8) 0

Each bar is broken into the contributions made by individual workloads in the mix. The

total height represents the overall normalized IPC or IPC speedup. For example, a total height of

1.2 means a 20% improvement in IPS as compared with the base IPC without any prefetching,

given in Figure 2-1. Please note a normalized IPC of less than one means the prefetcher degrades

the overall performance.

Several observations can be made. First, most workload mixes show performance

improvement for all five prefetching techniques. In general, the CG-prefetcher has the highest

overall improvement, followed by the LS-, the HS-, and the MC-prefetchers. Two workload

mixes Art/Twolf and Mcf/Twolf show a performance loss for most prefetchers. Our studies

indicate that Twolf has irregular patterns, and hardly benefits from any of the prefetching

schemes. Although Art and Mcf are well performed, the higher IPC of Twolf dominates the

overall IPC speedup.

Second, the CG-prefetcher is a big winner for the MEM workloads with speedup of 40% in

average, followed by the LS-prefetcher with 30%, the HS-prefetcher with 24% and the MC-

prefetcher with 18%. The MEM workloads exhibit heavier cache contentions and misses.

Therefore, the accurate CG-prefetcher benefits the most for this category.

Third, the CG-prefetcher generally performs better in the MIX and the CPU categories.

However, the LS-prefetcher slightly outperforms the CG-prefetcher in a few cases. With lighter

memory demands in these workload mixes, the LS-prefetcher can deliver more prefetched blocks

with a smaller impact on cache pollutions and memory traffic.





O kload1 IcrdoaLd2


O Wrldoad3 2 14load4'


1.4 -

1.2

1

0.8

0.6

0.4

0.2

0


Twof / Parser / Vortex
/Bzip2


Art /Mcf / Vortex /
Bzip2


1.8 O Workload1 H Workload2
1.6
1.4 -

11



S0.6
0.4
0.2




AM/ f f R R /Art / Elf Tf/ Rf/ Elf / Pamser / Up2 /
Amrmp Rolf Bzp2 Bzp2 Bzp2 Bzp2

Figure 2-6. Normalized combined IPCs of various prefetchers. (Normalized to baseline). A) 4-
workload mix running on 4-core CMPs. B) 2-workload mix running on 2-core CMPs.

It is important to note that the normalized IPC speedup creates an unfair view when

comparing mixed workloads on multi-cores. For example, in Art/Mcf/Vortex/Bzip2, the

normalized IPCs of individual workloads are measured at 3.16 for Art, 1.41 for Mcf, 0.82 for


Art/ Mcf/ Armmp /
Twof










Vortex, and 1.42 for Bzip2 with the CG-prefetcher, and 2.39 for Art, 1.22 for Mcf, 0.86 for

Vortex, and 1.49 for Bzip2 with the MC-prefetcher. Therefore, the average individual speedups

of the four workloads, according to equation (2-1), are 1.70 for the CG-prefetcher and 1.49 for

the MC-prefetchers. However, their normalized IPCs are only 1.20 and 1.22. Given the fact that

Vortex and Bzip2 have considerably higher IPC than those of Art and Mcf, the overall IPC

improvement is dominated by the last two workloads. This is true for other workload mixes.

In Figure 2-7, the average speedup of two MEM and two MIX workload mixes are shown

according to the equation (2-1). Please recall that all the applications in a MEM workload mix

are memory-intensive, and a MIX workload mix contains both memory-intensive and cpu-

intensive applications.


Ave~rrr peeii~ IPC, ~I ithr pre fetch 21
IPC, 11 without pre fetch

where n is the number of workloads.

Comparing with the measured IPC speedups in Figure 2-6, significantly higher average

speedups are achieved by all prefetchers. For Art/Twolf, the average IPC speedups are 48% for

the MC-prefetcher, 44% for the HS-prefetcher, 52% for the LS-prefetcher and 51% for the CG-

prefetcher, instead of the IPC degradations as shown in Figure 2-6.

Figure 2-8 shows the accuracy and coverage of different prefetchers. In this figure, each

bar is broken down into 5 categories from bottom to top for each prefetcher: misses, partial hits,

miss reductions, extra prefetches, and wasted prefetches. The descriptions of each category are

listed as follows.

*The misses are those main memory accesses that have not been covered by the
prefetchers.











1.8 ^I
1.6
1.4 :-:-


1



S0.4
0.2

M t/Maf / Mt/Mcf / 19tf / Anp Mt/ Twolf
Annp /Twolf Vartex / Bzp2

Figure 2-7. Average speedup of 4 workload mixes.

The partial hits refer to memory accesses with a part of off-chip access latency being
saved by earlier but incomplete prefetches. The earlier prefetches have been issued to the
memory, but the blocks have not arrived in the L2 cache.

The miss reductions are those that have been fully covered by prefetches. Those accesses
are successfully and completely changed from L2 misses to L2 hits.

The extra prefetches represent the prefetched blocks that are replaced before any usage.
Those useless prefetched blocks will pollute the limited cache space and waste the
limited bandwidth.

The wasted prefetches refer to the prefetched blocks that are presented in L2 cache
already when those blocks arrive in the L2 cache because of the mis-prediction of the
shadow directory at the memory side, which wastes memory bandwidth.

The sum of the misses, partial hits, and miss reductions is equal to the number of misses of

the baseline without prefetching, which is normalized to 1 in the figure. And the sum of extra

prefetches and wasted prefetches, normalized to baseline misses, is the extra memory traffic.

According to the above definition, the accuracy of a prefetcher can be described as

Equation (2-2):

Misses reductions + Partial~PPP~~~~PPP~~~PP hits
Coverage = (2-2)
Misses reductions + Partial~PPP~~~~PPP~~~PP hits +M2isses











1.8~~~ n Msses O Partial hits 5 Miss reduction 0F~aprefetch 5 Wastedpdeetch
1.6 t~~ in





~0.4-- --

~0.2 -- ----


Art/Mcf/ Art/Mcf/ Twolf/Parser Art/Mcf Mcf/Mcf Mcf/A~mnp Art/Twolf Mcf/Twolf Mcf/Biy2 Twolf/EAp2 Parser/EAp2 A~p2/Biy2
A~mnp /Twlf Artex/Biy2 / Wtex/
A~p2

Figure 2-8. Prefetch accuracy and coverage of simulated prefetchers.

The coverage of a prefetcher can be calculated as Equation (2-3):

Extra pre fetch + Wasted pre fetch
Accuracy = (2-3)
Misses reductions + Partial hits + Extra prefetch + Wasted prefetch


Overall, all prefetchers show a significant coverage, reduction of cache misses, ranging

from a few percent to as high as 50%. The MC-, LS- and CG-prefetchers have better coverage

than HS-prefetcher, since HS-prefetcher only identifies exactly repeated accesses. On the other

hand, in contrast to the MC- and the LS-prefetcher, the HS- and the CG-prefetcher carefully

qualify members in a group that show highly repeated patterns for prefetching. The MC- and the

LS-prefetcher generate significantly higher memory traffic than the HS- and the CG-prefetcher.

On the average, the HS-prefetcher has the least extra traffic of about 4%, followed by 21% for

the CG-prefetcher, 35% for the LS-prefetcher, and 52% for the MC-prefetcher. The excessive

memory traffic by the LS- and the MC-prefetcher does not turn proportionally into a positive

reduction of the cache miss. In some cases, the impact is negative mainly due to the cache

pollution problem on CMPs. Between the two, the LS-prefetcher is more effective than the MC-

prefetcher indicating prefetching multiple successor misses may not be a good idea. The HS-










prefetcher has the highest accuracy. However, the low miss coverage limits its overall IPC

improvement.

The reuse distance constraint of forming a CG is simulated and the performance results of

CG-0, CG-2 and CG-8 are plotted in Figure 2-9. With respect to the normalized IPCs, the results

are mixed, shown in Figure 2-9 (A). Figure 2-9 (B) further plots the coverage and accuracy of

different CGs.

It is evident that the reuse distance constraint is the tradeoff between accuracy and

coverage. In generally, larger distance means higher coverage, but lower accuracy. CG-2 seems

to be better than CG-8 for workload mixes having higher L2 misses, such as Mcf/Ammp and

Mcf/Mcf. We selected CG-2 to represent the CG-prefetcher due to its slightly better IPCs than

that of CG-0 and considerably less traffic than that of CG-8. Note that we omit CG-4, which has

similar IPC speedup in comparison with CG-2, but generates more memory traffic.

The impact of group size is evaluated as shown in Figure 2-10. Two workload mixes in the

MEM category, Art/Mcf/Ammp/Twolf and Mcf/Ammp, and two in the MIX category,

Art/Mcf/Vortex/Bzip2 and Art/Twolf are chosen due to their high memory demand. The

measured IPCs decrease slightly or remain unchanged for the two 4-workload mixes, while they

increase slightly with the two 2-workload mixes. Due to cache contentions, larger groups

generate more useless prefetches. The group size of 8 shows a balance of high IPCs with low

overall memory traffic.

Figure 2-11 plots the average speedup of CG with respect to Stride-only for different L2

cache sizes from 512KB to 4MB. As observed, the four workload mixes behave very differently

with respect to different L2 sizes.














1.8 |-- wonLacaal mwoicoaaz u womoilacu a wanacao+

1.6 -- -


1 .2---- -- - --




O.6 ---

O .4---- ---

0 .2---- ---




Art/ Art/ TExor/ Art/1Vtf 1VEf/1t/ Art/ 1Vf/1Vf/ Exo/ Parser/ Bzip2/
1Vf/1Vf/Parser/ 1Vf Anrmp xolf Exolf Bzip2 Bzip2 Bzip2 Bzip2
Armp' Voxtex/ Voxtex/
Txolf Bzip2 Bzip2
A

1.8 ~ 5 1Visses O Partial his Vss recktion Extraprefetch Pastedprefetch

1111
1.4





1.2 -









Art/1VEf/ Art/ TExor/ Art/1V~f 1VEf/ 1~/ Art/ 1Vf V~/Eor/ Parser/ Bzip2/
Armp' Vf Parser/ 1Vf Armp wolf Exolf Bzip2 Bzip2 Bzip2 Bzip2
Exolf Vortex/ Voxtex/
Bzip2 Bzip2
B

Figure 2-9. Effect of distance constrains on the CG-prefetcher. A) Normalized IPC. B)
Accuracy and traffic.


For Art/Mcf/Vortex/Bzip2 and Art/Twolf, the average IPC speedups are peak at 1MB and


2MB respectively, and then drop sharply afterwards because of a sharp reduction of cache misses


with larger caches. However, for the memory-bound workload mixes, Art/Mcf/Ammp/Twolf and


Mcf/Ammp, the average speedups of median-size L2 are slightly less than those of smaller and


larger L2.












-A Artlh!Armap Twolf -E1 Arttlhibfottex ~ BE |h 1sses O Parlial hidts Mass abduction
1.7 1-A 1 VtfNImm -5-, At Twl 1.6 O Eta vefetch 5 Wasted trefetch-
1.65 --t -- 1.4
1 .6 - - - 1 .2 - -- - -
0.35 t ;;l1-------
9 0.3 - --- 0.8- -- --
0.25 .
0.2 0.4-----
0.15 0. --- -- --




4 8 12 16 Atmrp/Tolf Vottex/Bzip2
Group Size
A B

Figure 2-10. Effect of group size on the CG-prefetcher. A) Measured IPCs. B) Accuracy and
traffic.



-- Art/Ivtfmr/AnuTolf -0- A/o7ae/3q
1.45 -- Ivbf/Anl -- Art/Twolf
1.4
31.35




1. 151

1. 1
1.05


512K IM 2M 4M
L2 Siz

Figure 2-11i. Effect of L2 size on the CG-prefetcher.


With smaller caches, the cache contention problem is so severe that a small percentage of


successful prefetches can lead to significant IPC speedups. For median size caches, the impact of


delaying normal miss due to conflicts with prefetches begins to compensate the benefit of


prefetching. When the L2 size continues to increase, the number of misses decreases and it


diminishes the effect of accessing conflicts. As a result, the average speedup increases again.


Given a higher demand for accessing the DRAM for the prefetching methods, we perform


a sensitivity study on the DRAM channels as shown in Figure 2-12. The results indicate that the


number of DRAM channels does show impacts on the IPCs and more so to the memory-bound











workload mixes. All four workload mixes perform poorly with 2 channels. However, the


improvements are saturated about 4 to 8 channels.


1.8 -6- Art/1V\c/k~~AmipSwolf At1~fVotxBi
-A- 1VtfAnxnp -5 Art/Twolf
1.7- - -
1. 6
1.5
1.4
1.3
1. 2
0.4

013
0.


2 4 8 16
IV~enxxy Clun~el

Figure 2-12. Effect of memory channels on the CG-prefetcher.

2.6 Summary

We have introduced an accurate CG-based data prefetching scheme on Chip

Multiprocessors (CMPs). We have showed the existence of coterminous groups (CGs) and a

third kind of locality, coterminous locality. In particular, the order of nearby references in a CG

follows exactly the same order that these references appeared last time, even though they may be


irregular. The proposed prefetcher uses CG history to trigger prefetches when a member in a

group is re-referenced. It overcomes challenges of the existing correlation-based or stream-based

prefetchers, including low prefetch accuracy, lack of timeliness, and large history. The accurate

CG-prefetcher is especially appealing for CMPs, where cache contentions and memory access

demands are escalated. Evaluations based on various SPEC CPU 2000 workload mixes have


demonstrated significant advantages of the CG-prefetcher over other existing prefetching

schemes on CMPs.









CHAPTER 3
PERFORMANCE PROJECTION OF ON-CHIP STORAGE OPTIMIZATION

Organizing on-chip storage space on CMPs has become an important research topic.

Balancing between data accessibility due to wiring delay and the effective on-chip storage

capacity due to data replication has been studied extensively. These studies must examine a

wide-spectrum of the design space to have a comprehensive view. The simulation time is

prohibitively long for these timing simulations, and would increase drastically as the number of

cores increases. A great challenge is then how to provide an efficient methodology to study

design choices of optimizing CMP on-chip storage accurately and completely, when the number

of cores increases.

In the second work, we first develop an analytical model to assess general performance

behavior with respect to data replications in CMP caches. The model inj ects replicas (replicated

data blocks) into a generic cache. Based on the block reuse-distance histogram obtained from a

real application, a precise mathematical formula is derived to evaluate the impact of the replicas.

The results demonstrate that whether data replication helps or hurts L2 cache performance is a

function of the total L2 size and the working set of the application.

To overcome the limitations of modeling, we further develop a single-pass stack

simulation technique to handle shared and private cache organizations with the invalidation-

based coherence protocol. The stack algorithm can handle complex interactions among multiple

private caches. This single-pass stack technique can provide local/remote hit ratios and the

effective cache size for a range of physical cache capacities. We also demonstrate that we can

use the basic multiprocessor stack simulation results to estimate the performance of other

interesting CMP cache organizations, such as shared caches with replication and private caches

without replication.









We verify the results of the analytical data replication model and the single-pass global

stack simulation with detailed execution-driven simulations. We show that the single-pass stack

simulation produces small error margins of 2-9% for all simulated cache organizations. The total

simulation times for the single-pass stack simulation and the individual execution-driven

simulations are compared. For a limited set of the four studied cache organizations, the stack

simulation takes about 8% of the execution-driven simulation time.

3.1 Modeling Data Replication

We first develop an abstract model independent of private/shared organizations to evaluate

the tradeoff between the access time and the miss rate of CMP caches with respect to data

replication. The purpose is to provide a uniform understanding on this central issue of caching in

CMP that is present in most maj or cache organizations. This study also highlights the importance

of examining a wide enough range of system parameters in the performance evaluation of any

cache organization, which can be costly.

In Figure 3-1, a generic histogram of block reuse distances is plotted, where the reuse

distance is measured by the number of distinct blocks between two adj acent accesses to the same

block. A distance of zero indicates a request to the same block as the previous request. The

histogram is denoted by f(x), which represents the number of block references with reuse

distance x.


For a cache size S, the total cache hits can be measured by a~ f(x) dx, which is equal to the


area under the range of the histogram curve from 0 to S. This well-known, stack distance

histogram can provide hits/misses of all cache sizes with a fully-associative organization and the

LRU replacement policy.

























Replica
0 ... S-R --------------------- S Reuse Distance
Figure 3-1. Cache performance impact when introducing replicas.

To model the performance impact of data replication, we inj ect replicas into the cache.

Note that regardless the cache organization, replicas help to improve the local hit rate since

replicas are created and moved close to the requesting cores. On the other hand, having replicas

reduces the effective capacity of the cache, and hence, increases cache misses. We need to

compare effect from the increase of local hits against that from the increase of cache misses.

Suppose we take a snapshot of the L2 cache and find a total of R replicas. As a result, only

S-R cache blocks are distinct, effectively reducing the capacity of the cache. Note that the model

does not make reference to any specific cache organization and management. For instance, it

does not say where the replicas are stored, which may depend on factors such as shared or

private organization. We will compare this scenario with the baseline case where all S blocks are

distinct. First, the cache misses are increased by fC (x)dx, since the total number of hits is


nwf(x) dx. On the other hand, the replicas help to improve the local hits. Among the


S-R
f(x) dx hits, a fraction R 5 hits are targeting to the replicas. Depending on the specific cache


organization, not all accesses to the replicas result in local hits. A requesting core may find a









replica in the local cache of another remote core, resulting in a remote hit. We assume that a

fraction L accesses to replicas are actually local hits. Therefore, compared with the baseline case,

the total change of memory cycles due to the creation of R replicas can be calculated by:


P,,, x f (x)dx S Gx-xLx f (x)dx (3-1)

where Pm is the penalty cycles of a cache miss; and GI is the cycle gain from a local hit. With


the total number of memory accesses, of I(x)dx the average change of memory access cycles is

equal to:


P,, x f(x)dx -j Gx x Lx f (xdx (xdx (3-2)

Now the key is to obtain the reuse distance histogram f(x). We conduct experiment using

an OLTP workload [57] and collect its reuse distance histogram. With the curve-fitting tool of

Matlab [52], we obtain the equation f(x) = Aexp(-Bx), where A = 6.084*1"JO and B = 2. 658*103

This is shown in Figure 3-2, where the cross marks represent the actual reuse frequencies from

OLTP and the solid line is the fitted curve. We can now substitute f(x) into equation (3-2) to

obtain the average change in memory cycles as:


P,, x e-(S-)_-S -G x-x 1-eBSR)> (3 -3)

Equation (3-3) provides the change in L2 access time as a function of the cache area being

occupied by the replicas. In Figure 3-3, we plot the change of the memory access time for three

cache sizes, 2, 4, and 8 IVB, as we vary the replicas' occupancy from none to the entire cache. In

this figure, we assume Gl=15, Pm= 400, and we vary L with 0.25, 0.5 and 0.75 for each cache

size. Note that negative values mean performance gain. We can observe that the performance of

allocating L2 space for replicas for the OLTP workload varies with different cache sizes.











x 106

7 -- ------ --------- X O rigimaldata
fitted curve











0 1024 2048 3072 4096 5120 6144 7168 8192
Reuse distance (KB)

Figure 3-2. Curve fitting of reuse distance histogram for the OLTP workload.


-*- 2M-LO.25 4M-LO.25 -* 8M-LO.25
-- -2M-LO.5 4M-LO. 5 - -8M-LO. 5
40- -a- 2M-LO.75 4M-LO. 75 -a 8M-LO. 75

OLTP '
3 0 - -


S2 0 - -




-10
0 1/8 14 3/8 /2 5/8 3/4 7/
raction ofreliato





For istane, wen L 0.5 theresutsiondct no replicationprvdstehoetaeag




memory access time for a 2MB L2 cache, while for larger 4MB and 8MB L2 caches, allocating

40% and 68% of the cache for the replicas has the smallest access time. These results are

consistent with the reuse histogram curve shown in Figure 3-2. The reuse count approaches zero

when the reuse distance is equal to or greater than 2MB. It increases significantly when the reuse

distance is shorter than 2MB. Therefore, it is not wise to allocate space for the replicas when the









cache size is 2MB or less. Increasing L favors data replication slightly. For instance, for a 4MB

cache, allocating 34%, 40%, 44% of the cache for the replicas achieves the best performance

improvement of about 1, 3, and 5 cycles on the average memory access time for L = 0.25, 0.5

and 0.75 respectively. The performance improvement with data replication would be more

significant when GI increases.

The general behavior due to data replication is consistent with the detailed simulation

result as will be given in Section 3.4. Note that the fraction of replicas cannot reach 100% unless

the entire cache is occupied by a single block. Therefore, in Figure 3-3, the average memory time

increase is not meaningful when the fraction of replicas is approaching to the cache size.

We also run the same experiment for two other workloads, Apache and SPECjbb. Figure

3-4 plots the optimal fractions of replication for all three workloads with cache size from 2 to

8MB and L from 0.25 to 0.75. The same behavior can be observed for both Apache and

SPECjbb. Lager caches favor more replication. For example, with L = 0.5, allocating 13%, 50%,

72% space for replicas has the best performance for Apache, and 28%, 59%, 78% for SPECjbb.

Also, increasing L favors more replication. With smaller working set, SPECjbb benefits

replication the most among the three workloads.

It is essential to study a set of representative workloads with a spectrum of cache sizes to

understand the tradeoff of accessibility vs. capacity on CMP caches. A fixed replication policy

may not work well for a wide-variety of workloads on different CMP caches. Although

mathematical modeling can provide understanding of the general performance trend, its inability

to model sufficiently detailed interactions among multiple cores makes it less useful for accurate

performance prediction. To remedy this problem, we will describe a global stack based

simulation for studying CMP caches next.













60








S2M 4M 8M 2M 4M 8M 2M 4M 8M

OLTP Apache SPECjbb

Figure 3-4. Optimal fraction of replication derived by the analytical model.

3.2 Organization of Global Stack

Figure 3-5 sketches the organization of the global stack, which records the memory

reference history from all the cores.

In the CMP context, a block address and its core-id uniquely identify a reference, where

the core-id indicates from which core the request is issued. Several independent linked lists are

established in the global stack for simulating a shared' and several per-core private stacks. Each

stack entry appears exactly in one of the private stacks determined by the core-id, and may or

may not reside in the shared stack depending on the recency of the reference. In addition, an

address-based hash list is also established in the global stack for fast searches.

Since only a set of discrete cache sizes are of interest for cache studies, both the shared and

the private stacks are organized as groups [43]. Each group consists of multiple entries for fast

search during the stack simulation and for easy calculations of cache hits under various

interesting cache sizes after the simulation.











Glob al Stack Entry

Hash Shared Private
S ha red P riv ate prev prev prey
Block Cr
group group
address Id.
Id Id Hash Shared Pnivate
next next next


group bo~und
(to last block in group )

re mote replica hl


I I I I


group bound
(to last block in grou )

count local


Shared Gro~up Tab~le Private Gro~up Tab~le
Figure 3-5. Single-pass global stack organization.

For example, assuming the cache sizes of interest are 16KB, 32KB, and 64KB. The groups

can then be organized according to the stack sequence starting from the MRU entry with 256,

256, 512 entries for the first three groups, respectively, assuming the block size is 64B. Based on

the stack inclusion property, the hits to a particular cache size are equal to the sum of the hits to

all the groups accumulated up to that cache size. Each group maintains a reuse counter, denoted

by Gl, G2, and G3. After the simulation, the cache hits for the three cache sizes can be computed

as Gl, G1 G2, and G1 G2 G3 respectively.

Separate shared and private group tables are maintained to record the reuse frequency

count and other information for each group in the shared and private caches. A shared and a

private group-id are kept in each global stack entry as a pointer to the corresponding group

information in the shared and the private group table. The group bound in each entry of the group










table links to the last block of the respective group in the global stack. These group bounds

provide fast links for adjusting entries between adjacent groups. The associated counters are

accumulated on each memory request, and will be used to deduce cache hit/miss ratios for

various cache sizes after the simulation. The following subsections provide detailed stack

operations.

3.2.1 Shared Caches

Each memory block can be recorded multiple times in the global stack, one from each core

according to the order of the requests. Intuitively, only the first-appearance of a block in the

global stack should be in the shared list since there is no replication in a shared cache. A first-

appearance block is the one that is most recently used in the global stack among all blocks with

the same address.

The shared stack is formed by linking all the first-appearance blocks from MRU to LRU.

Figure 3-6 illustrates an example of a memory request sequence and the operations to the shared

stack. Each memory request isdenoted as a block address, A, B, C, ..., etc., followed by a core-

id. The detailed stack operations when B1 is requested are described as follows.

Address B is searched by the hash list of the shared stack. B2 is found with the matching
address. In this case, the reuse counter for the shared group where B2 resides, group 3, is
incremented.

B2 is removed from the shared list, and B 1 is inserted at the top of the shared list.

The shared group-id for B1 is set to 1. Meanwhile, the block located on the boundary of
the first group, El, is pushed to the second group. The boundary adjustment continues to
the group where B2 was previously located.

If a requested block cannot be located through the hash list, (i.e. the very first access of
the address among any cores), the stack is updated as above without incrementing any
reuse counters.

After the simulation, the total number of cache hits for a shared cache that include exactly
the first m groups is the sum of all shared reuse counters from group 1 to group m.










Mernory Request Sequence: A1, B2, C3, D4, E1, F2, R ..




Shared Group Shared List Hash List
Count Bound Address Group Address Group List 0 List 1
Group 1 0 2 F2 1 Bi 1 F2 B1
Group 2 0 4 -- E1 1--... F2 1 El C3
Group 3 0+1 8 --I I D4 2 1 Etl--+2 D4 B2
Group4 0 16 I~ C3 2 --.._ D4 2 Al


Al 3 Al 3


Block 6
After B1


Before B1 After B1

Figure 3-6. Example operations of the global stack for shared caches.

3.2.2 Private Caches

The construction and update of the private lists are essentially the same as those of the

shared list, except that we link accesses from the same core together. We collect crucial

information such as the local hits, remote hits, and number of replicas, with the help of the local,

remote, and replica counters in the private group table. For simplicity, we assume these counters

are shared by all the cores, although per-core counters may provide more information. Figure 3-7

draws the contents of the four private lists and the private group table, when we extend the

previous memory sequence (Figure 3-6) with three additional requests, A2, C1, and Al.

Local/remote reuse counters. The local counter of a group is incremented when a request

falls into the respective group in the local private stack. In this example, only the last request,

Al, encounters a local hit, and in this case, the local counter of the second group is incremented.




63














P ri vate Gro u p
Accumulated Accumulated
Local Remote Replica Bound
O +0 +0+0 +0 O+1 +0+1 +1 C +1 -1 +1+1 2
O +0 +0+0+1 O +1 +1 +1 +0 +1+1 +1 +0 4
0 +0 +0+0 +0 O+1 +1 +1 +0 +1+1 +1 +0 8


III


Mernory Request Sequence: A1, B2, C3, D4, E1, F2, Bl, A2. C1. A1,....


Private List after B1
Core Core Core Core
1 23 4
B1 F2 C3 D4
El B2
Al





Private List after C1
Core Core Core Core
1 23 4
Cl A2 C3 D4
B1 F2
E1 B2
Al


Private List after A2
Core Core Core Core
1 23 4
B1 A2 C3 D4
El F2
Al B2





Private List after Al
Core Core Core Core
1 23 4
Al A2 C3 D4
Cl F2
B1 B2
El


Figure 3-7. Example operations of the global stack for private caches.


After the simulation, the sum of all local counters from group 1 to group m represents the total

number of local hits for private caches with exactly m groups.

Counting the remote hits is a little tricky, since a remote hit may only happen when a

reference is a local miss. For example, assume that a request is in the third group of the local

stack; meanwhile, the minimum group id of all the remote groups where this address appears is

the second. When the private cache size is only large enough to contain the first group, neither a

local nor a remote hit happens. If the cache contains exactly two groups, the request is a remote

hit. Finally, if the cache is extended to the third group or larger, it is a local hit. Formally, if an

address is present in the local group L and the minimum remote group that contains the block is

R, the access can be a remote hit only if the cache size is within the range from group R to L-1.

We increment the remote counters for groups R to L-1 (R <= L-1). Note that after the simulation,










the remote counter m is the number of remote hits for a cache with exactly m groups. To

differentiate them from the local counters, we call them accumulated remote counters.

In the example, the first highlighted request, B l, encounters a local miss, but a remote hit

to B2 in the first group. We accumulate the remote counters for all the groups. The second

request, A2, is also a local miss, but a remote hit to Al in the second group. The remote counter

of the first group remains unchanged, while the counters are incremented for all the remaining

groups. Similar to Bl, all the remote counters are incremented for C1. Finally, the last request,

Al, is a local hit in the second group and is also a remote hit to A2 in the first group. In this case,

only the remote counter of the first group is incremented since Al is considered as a local hit if

the cache size extends to more than the first group.

Measuring replica. The effective cache size is an important factor for shared and private

cache comparisons [8], [24], [81], [20]. The single-pass stack simulation counts each block

replication as a replica for calculating the effective cache size along the simulation. Similar to the

remote hit case, we use accumulated replica counters. As shown in Figure 3-7, the first

highlighted request, Bl, creates a replica in the first group, as well as any larger groups because

of the presence of B2. The second highlighted request, A2, does not create a new replica in the

first group. But it does create a new replica in the second group because ofAl. Meanwhile, A2

pushes B2 out of the first group, thus reduces a replica in the first group. This new replica applies

to all the larger groups too. Note that the addition of B2 in the second group does not alter the

replica counter for group 2, since the replica was already counted when B2 was first referenced.

Similar to Bl, the third highlighted request, C1, creates a replica to all the groups. Lastly, the

reference, Al, extends a replica of A into the first group because of A2. The counters for the

remaining groups stay the same.










Handling memory writes. In private caches, memory writes may cause invalidations to

all the replicas. During the stack simulation, write invalidations create holes in the private stacks

where the replicas are located. These holes will be filled later when the adj acent block is pushed

down from a more-recently-used position by a new request. No block will be pushed out of a

group when a hole exists in the group. To accurately maintain the reuse counters in the private

group table, each group records the total number of holes for each core. The number of holes is

initialized to the respective group size, and is decremented whenever a valid block j oins the

group. The hole-count for each group avoids searching for existing holes.

3.3 Evaluation and Validation Methodology

We simulate an 8-core CMP system. The global stack runs behind the L1 caches and

simulates every L1 misses, essentially replacing the role of L2 caches. During simulations, stack

distances and other related statistics are collected as described in the above section. Each group

contains 256 blocks (16KB), and we simulate 1024 groups (16MB maximum). The results of the

single-pass stack simulation are used to derive the performance of shared or private caches with

various cache sizes and the sharing mechanisms for understanding the accessibility vs. capacity

tradeoff in CMP caches.

The results from the stack simulation are verified against execution-driven simulations,

where detailed cache models with proper access latencies are inserted. In the detailed execution-

driven simulation, we assume the shared L2 has eight banks, with one local and seven remote

determined by the least-significant three bits of the block address. The total shared cache sizes

are 1, 2, 4, 8, and 16MB. For the private L2, we model both local and remote accesses. The

MOESI coherence protocol is implemented to maintain data coherence among the private L2s.

Accordingly, we simulate 128, 256, 512, 1024, 2048KB private caches. For comparison, we use

the hit/miss information and average memory access times to approximate the execution time










behavior because the single-pass stack simulation cannot provide IPCs. We use three

multithreaded commercial workloads, OLTP, Apache, and SPECjbb.

The accuracy of the CMP memory performance proj section can be assessed from two

different angles, the accuracy of predicting individual performance metrics, and the accuracy of

predicting general cache behavior. By verifying the results against the execution-driven

simulation, we demonstrate that the stack simulation can accurately predict cache hits and misses

for the targeted L2 cache organizations, and more importantly, it can precisely proj ect the

sharing and replication behavior of the CMP caches.

One inherent weakness of stack simulation is its inability to insert accurate timing delays

for variable L2 cache sizes. The fluctuation in memory delays may alter the sequence of memory

accesses among multiple processors. We try a simple approach to insert memory delays based on

a single discrete cache size. In the stack simulation, we inserted memory delays based on five

cache sizes IMB, 2MB, 4MB, 8MB, and 16MB,denoted as stack-1, stack-2, stack-4, stack-8, and

stack-16 respectively. An off-chip cache miss latency is charged if the reuse distance is longer

than the selected discrete cache size.

3.4 Evaluation and Validation Results

3.4.1 Hits/Misses for Shared and Private L2 Caches

Figure 3-8 shows the proj ected and real miss rates for shared caches, where "real"

represents the results from individual execution-driven simulations. In general, the stack results

follow the execution-driven results closely. For OLTP, stack-2 shows only about 5-6% average

error. For Apache and SPECjbb, the difference among different delay insertions is less apparent.

The stack results predict the miss ratios with about 2-6% error, except for Apache with a small

IMB cache.













OLTP


Aache


20% 20%
--o--Real II I--o--Real


0 .--M -Stack-8 II--*-- Stack-8
~~,,, -*-Sack-16 -*-' ~Stack-1 6


S8% -- -- 8%--

4% ~ L- 4%

0% 0%
1M 2M 4M 8M 16M 1M 2M 4M 8M 16M
Cache size Cache size

SPECIbb
24%
O, --+--Real
20%~ i-c-Stack-1
20% -- - -m Stack-2
~-X- Stack-4
a --x--Sack-86

~12% -- --



4%

0%
1M 2M 4M 8M 16M
Cache size

Figure 3-8. Verification of miss ratios from global stack simulation for shared caches.


Two maj or factors affect the accuracy of the stack results.


One is cache associativity. Since we use a fully-associative stack to simulate a 16-way


cache, the stack simulation usually underestimates the real miss rates. This effect is more


apparent when the cache size is small, due to more conflict misses. The issue can be solved by


more complicated set-associative stack simulations [53], [32]. For simplicity, we keep the stack


fully-associative. More sensitivity studies also need to evaluate L2 caches with smaller set


as soci activity.


The other factor is inaccurate delay insertions. For example, in the stack-1 simulation of


OLTP, the cache miss latency is inserted whenever the reuse distance is longer than IMB. Such a


cache miss delay is inserted wrongly for caches larger than the IMB. These extra delays for


larger caches cause more OS interference and context switches that may lead to more cache


misses. At 4MB cache size, the overestimate of cache misses due to the extra delay insertion










exceeds the underestimate due to the full associativity. The gap becomes wider with larger

caches. On the other hand, the stack-16 simulation for smaller caches mistakenly inserts hit

latency, instead of miss latency, for accesses with reuse distance from the corresponding cache

size to 161VB, causing less OS interference, thus less misses. In this case, both the full

associativity and the delay insertion lead to underestimate of the real misses, which makes the

stack-16 simulation the most inaccurate.

For private caches, Figure 3-9 shows the overall misses, the remote hits, and the average

effective sizes. Note that the horizontal axis shows the size of a single core from 128KB to 21VB

each. With eight cores, the total sizes of the private caches are comparable to the shared cache

sizes in Figure 3-8.

We can make two important observations. First, comapring with the shared cache, the

simulation results show that the overall L2 miss ratios are increased by 14.7%, 9.9%, 4.3%,

1.1%, and 0.5% for OLTP for the private cache sizes from 128KB to 11VB. For Apache and

SPECjbb, the L2 miss ratios are increased by 11.8%, 4.4%, 1.1%, 1.0%, 2.2%, and 7.3%, 3.1%,

2.9%, 0.6%, 0.5%, respectively. Second, the estimated miss and remote hit rates from the stack

simulation match closely to the results from the execution-driven simulations, with less than 10%

margin of errors.

We also simulate the effective capacity for the private-cache cases. The effective cache

size is the average over the entire simulation period. In general, the private cache reduces the

cache capacity due to replicated and invalid cache entries. The effective capacity is reduced to

45-75% for the three workloads with various cache sizes. The estimated capacity from the stack

simulation is almost identical to the result from the execution-driven simulation. Due to its

higher accuracy, we use the stack-2 simulation in the following discussion.
















OLTP~

--x -Upper bound ,
- R eal - - /
Stack-1
-e Stack-2
-a- Stack-4
-* Stack-8 8
~-- Stack-16








_,x'



128k 256k 512k 1M 2M
Cache size


Apache

-X -Upper bound
-*- Real


~-- Stack-16 ,




Stc -2X


I
128k 256k 51 iz 1M 2M
Cach sz


Apache
- -o --Realmiss -- Stack-1 miss -0- Stack-2 miss
----Stac-4miss -*- Stack-8 miss -*--Stac-16 miss
- -*- -Real rhlt ~---Stck-1 rhlt --ta--Sak-2rhlt
-+- -Stack-4 rhlt -x- Stack-8 rhlt -oac- Sak16 rhit


128k 256k 512k 1M 2M
Cache size


SPECjbb


- g -

- -


------- -------- -


OLTP
--+--Real miss ~--Stack-1miss Stackk-2miss
-- Stac-4 miss -X- -Stack-8 miss Stack-16 miss
--*-- Real rht ~--- Stck-1 rhlt -m--Stack-2 rhlt
--+- Stack-4 rhlt --x- Stack-8 rhlt ~-o-Stack-16 rhlt


30%


20%


10%


0%









50%


~40%


30%/


20%


10%


I


128k 256k 512k
Cache size


SPEClbb

--x-- Upper bound
- Real
~-- Stack-1
_ e Stack-2
-a- Stack-4
-* Stack-8
~-- Stack-16











128k 256k 512k
Cache size


1M 2M


--+-- Real miss ~-A-Stack-1 miss
--Stack-4miss -X-Stack-8 miss
50% --+--Realrhit ---Stack-1rhit
-+- Stack-4 rhit -x- Stack-8 rhit


SStack-2 miss
SStack-16 miss
-m--Stack-2 rhit
-o- Stack-16 rhit


S40% -


S30% -


20% -


10 -,


1M 2M


128k 256k ize k
C~acn s


LM 2M


Figure 3-9. Verification of miss ratio, remote hit ratio and average effective size from global

stack simulation for private caches.



3.4.2 Shared Caches with Replication



To balance accessibility and capacity, victim-replication [81] creates a dynamic L1 victim



cache for each core in the local slice of the L2 to trade capacity for fast local access. In this



section, we estimate the performance of a static victim-replication scheme. We allocate 0% to



50% of the L2 capacity as L1 victim caches with variable L2 sizes from 2MB to 8MB. For










performance comparison, we use the average memory access time, which is calculated based on

the local hits to victim caches, the hits to shared portion of L2, and L2 misses.

The average memory access time of the static victim replication can be derived directly

from the results of the stack simulation described in the previous sections. Assuming the

inclusion property is enforced between the shared potion of the L2 and the victim potion plus the

Ll. Suppose the L1 and L2 sizes are denoted by CL1, and CL2, r is the percentage of the L2

allocated for the victim cache, and n is the number of the cores. Then, each victim-cache size is

equal to (r*CL2)/n, and the remaining shared portion is equal to (1-r)*CL2. The average memory

access time includes the following components. First, since the L1 and the victim cache are

exclusive, the total hits to the victim cache can be estimated from the private stacks with the size

of the L1 plus the size of the victim:CL1+ (r*CL2)/n. Note that this estimation may not be

precise due to the lack of the L1 hit information that alters the sequence in the stack. Second, the

total number of L2 hits (including the victim portion) and L2 misses can be calculated from the

shared stack with the size (1-r)*"CL2. Finally, the hit to the shared portion of L2 can be

calculated by subtracting the hits to the victim from the total L2 hits.

Figure 3-10 demonstrates the average L2 access time with static victim replication.

Generally, large caches favor more replications. For a small 2MB L2, except that Apache has a

slight performance gain at low replication levels, the average L2 access times increase with more

replications. The optimal replication levels for OLTP are 12.5%, and 37.5% respectively for

4MB and 8MB L2. This general performance behavior with respect to data replication is

consistent with what we have observed from the analytical model in section 3.1. However, the

analytical model without cache invalidations should apply lower L for the optimal replication

level .












OLTP


Apache


--+--Real-2M --o--Real-4M --a--Real-8M --o--Real-2M --o--Real-4M --n--Real-8M
~100 -*- Stack-2M H-Stack-4M -- Stack-8hi, 90 -* Stack-2M -Stack-4M -Stack-8M









--~~~~~~o--Rel2 -Ra-M -A -el8




0 70



c~~ 0 t- "- -40


0.0% 1.5% 2.0% 0.000 150.0% 25.0005 2. 37.500 50.00
Percntae ofrepicaton reaPercentage of replication area



For SE~jbb 12.5 replcatio shos te jbes o oh4Bad8BL.Tefgr o



37.5%~~~ ~~~ fo M S.Tesel otradction come fro-m the fact htL2mse sattordc

dratiall aoun 8 cahe, a eosrtd nFgr -. ecnas bsretah


opia rpiato evl athprfcl bten h taksmuainsadth xcuin

driven~~~~. siuain.Wt epc oth vrg 2acs ie hesakrslsaewti %

8% eror magins




3.4.3ereng Priat Cachesto wihot eliato

Pigri -1.vae aich acrio fc capeacit fofs access time. Itt mayen bee o dsrable ation limit




replicatisonsi thet prvte cefrachs oudesadte i mpc of hepriat L without replication, slrea 0%fr4B u










we run a separate stack simulation in which the creation of a replica causes the invalidation of

the original copy.

Figure 3-11 demonstrates the L2 access delays of the private caches without replication,

shown as the ratio to those of the private caches with full replication. As expected, with small

128KB and 256KB private caches per core, the average L2 access times without replication are

about 5-17% lower than those with full replication for all the three workloads. This is because

the benefit of the increased capacity more than compensates the loss of local accesses.

With large 11VB or 2MB caches per core, the average L2 access time of the private caches

without replication is 12-3 0% worse than the full-replication counterpart, suggesting that

increasing local accesses is beneficial when enough L2 capacity is available. The stack

simulation results follow this trend perfectly. They provide very accurate results with only 2-5%

margin of error.

3.4.4 Simulation Time Comparison

The full-system Virtutech Simics 2.2 simulator [50] to simulate an 8-core C1VP system

with Linux 9.0 and x86 ISA is running on Intel Xeon 3.2 GHz 2-way S1VP. The simulation time

of each stack or execution-driven simulation is measured on a dedicated system without other

interference. A timer was inserted at the beginning and the end of each run to calculate the total

execution time.

In the single-pass stack simulation, each stack is partitioned into 16KB groups with a total

of 1024 groups for the 161VB cache. This small 16KB groups are necessary in order to study

shared caches with variable percentage of replication areas as shown in Figure 3-10. The stack

simulation time can be further reduced for cache organizations that only require a few large

groups.











Apache


OL TP


1.4 --o--Re1 Rea 1.4 --+-- Real
SStack Stack
S1.2 -- - 1.2






0.6 0 .6
128k 256k 512k 1M 2M
128k 256k 512k 1M 2M
Cache size Cache size

1.4 SPECj bb
.--o-- Real
SStack


1.2



0.6
128k 256k< 512k 1M 2M
Cache size

Figure 3-11. Verifieation of average L2 access time ratio from global stack simulation for
private caches without replication.

Table 3-1 summaries the simulation times for the stack and the execution-driven


simulations to obtain the results above. For each workload, two stack simulations are needed.

One run is for producing the results for shared caches, private caches, and shared caches with


replication, and the other run is for the private L2 without replication. In execution-driven

simulations, it requires a separate run for each cache size resulting in five runs for each cache


organization. In studying the shared cache with replication, fiye separate runs are needed for

each cache size in order to simulate Hyve different replication percentages. No separate stack

simulation is required for the shared cache with replication. Similarly, no separate execution-

driven simulation is needed for shared caches with 0% area for data replication. Therefore we

have 20 runs for the shared with replication for execution-driven simulations. The total number

of simulation runs is also summarized in Table 3-1.









Table 3-1. Simulation time comparison of global stack and execution-driven simulation (in
Minutes)
Measurements Workload Stack Execution-Driven
Shared / Private OLTP 1 Run: 835 (5+5) Runs: 6252
(Section 3.4.1) Apache 1 Run: 901 (5+5) Runs: 6319
SPECjbb 1 Run: 582 (5+5) Runs: 4220
Shared with replication OLTP 0 Run: 0 20 Runs: 11976
(Section 3.4.2) Apache 0 Run: 0 20 Runs: 12211
SPECijbb 0 Run: 0 20 Runs: 8210
Private no replication OLTP 1 Run: 872 5 Runs: 3257
(Section 3.4.3) Apache 1 Run: 948 5 Runs: 3372
SPEC~bb 1 Run: 613 5 Runs: 2199
Total 4751 58016

The total stack simulation time is measured about 4751 minutes, while the execution-

driven simulation takes 58016 minutes, a factor over 12 times. This gap can be much wider if

more cache organizations and sizes are studied and simulated.

3.5 Summary

In this chapter, we developed an abstract model for understanding the general performance

behavior of data replication in CMP caches. The model showed that data replication could

degrade cache performance without a sufficiently large capacity. We then used the global stack

simulation for more detailed study on the issue of balancing accessibility and capacity for on-

chip storage space on CMPs. With the stack simulation, we can explore a wide-spectrum of the

cache design space in a single simulation pass. We simulated the schemes of shared caches,

private caches and private caches without replication with various cache sizes directly by global

stack simulation. We also deduce the performance data of shared caches with replication by the

shared and private cache results. We verified the stack simulation results with execution-driven

simulations using commercial multithreaded workloads. We showed that the single-pass stack

simulation can characterize the CMP cache performance with high accuracy (about only 2% -

9% error magins) and significant less simulation time (only 8 %). Our results proved that the










effectiveness of various techniques to optimize the CMP on-chip storage is closely related to the

total L2 size.









CHAPTER 4
DIRECTORY LOOKASIDE TABLE: ENABLING SCALABLE, LOW-CONFLICT CMP
CACHE COHERENCE DIRECTORY

Directory cache coherence mechanism is one of the most important choices for building

scalable CMPs. The design of a sparse coherence directory for future CMPs with many cores

presents new challenges. With a typical set-associative sparse directory, the hot-set conflict at the

directory tends to worsen when many cores compete in each individual set, unless the set

associativity is dramatically increased. In order to maintain precise cache information, the set-

conflict causes inadvertent cache invalidations. Thus, an important technical issue is to avoid the

hot-set conflicts at the coherence directory with small set associativity, small directory space and

high efficiency.

We develop a set-associative directory with an augmented directory lookaside table to

allow displacing directory entries from their primary sets for solving the hot-set conflicts. The

proposed CMP coherence directory offers three unique contributions. First, while none of the

existing cache coherence mechanisms are efficient enough when the number of cores becomes

large, the proposed CMP coherence directory provides a low cost design with a small directory

size and low set associativity. Second, although the size of the coherence directory matches the

total number of CMP cache blocks, the topological difference between the coherence directory

and all cache modules creates conflicts in individual sets of the coherence directory and causes

inadvertent invalidations. The DLT is introduced to reconcile the mismatch between the two

CMP components. In addition, the unique design of the DLT has its own independent utility in

that it can be applied to other set-associative cache organizations for alleviating hot-set conflicts.

In particular, it has advantages over other multiple-hash-function-based schemes, such as the

skewed associative cache [64], [14]. Third, unlike the memory-based coherence directory where

each memory block has a single directory entry along with the presence bits indicating where the









block is located, the proposed CMP directory keeps a separate record for every copy of the same

cached block along with the core ID. Multiple hits to a block can occur in a directory lookup,

which returns multiple core IDs without expensive presence bits.

Performance evaluations have demonstrated the significant performance improvement of

the DLT-enhanced directory over the traditional set-associative or skewed associative

directories. Augmented with a DLT that allows up to one quarter of the cache blocks to be

displaced from their primary sets in the set-associative directory, up to 10% improvement in

execution time is achievable. More importantly, such an improvement is within 98% of what an

ideal coherence directory can accomplish.

In the following sections of this chapter, we first show the problem that a limited set-

associative CMP coherence directory may have big performance impact due to inadvertent cache

invalidations. We then propose our enhancement of the directory, directory lookaside table. This

is followed by detailed performance evaluations.

4.1 Impact on Limited CMP Coherence Directory

In this section, we demonstrate the severity of cache invalidation due to hot-set conflicts at

the coherence directory if the directory has small set-associativity. Each copy of a cached block

occupies a directory entry that records the block address and the ID of the core where the block

is located. A block must be removed from the cache when its corresponding entry is replaced

from the CMP directory. Three multithreaded workloads, OLTP, Apache, SPECjbb, and two

multiprogrammed workloads, SPEC2000 and SPEC2006, were used for this study. These

workloads ran on a Simics based whole-system simulation environment. In these simulations, we

assume a CMP with eight cores, and each core has a private IMB, 8-way, L2 cache. The

simulated CMP directory with different set associativities can record a total of 8MB cache

blocks. Detailed descriptions of the simulation will be given in Section 4.3.










Figure 4-1 shows the average of the total valid cache blocks over a long simulation period

using a CMP coherence directory with various set associativities. Given eight cores, each with an

8-way set-associative L2 cache, the 64-way directory (Set-fudl) can accommodate all cache

blocks without causing any extra invalidation. The small percentage of invalid blocks for the 64-

way directory comes from cache coherence invalidations due to data sharing, OS interference,

thread migrations, etc. on multiple cores.

The severity of cache invalidations because of set conflicts in the directory is very evident

in the cases of smaller set associativities. In general, whenever the set associativity is reduced by

half, the total valid cache blocks are reduced by 4-9% for all five workloads. Using OLTP as an

example, the valid blocks are reduced from 93% to 87%, 82% and 75% as the set associativity is

reduced from 64 ways to 32, 16, and 8 ways, respectively. The gap between the 64-way and 8-

way directories indicates that, on average, about 18% of the total cached blocks are invalidated

due to insufficient associativity in the 8-way coherence directory. This severe decrease in valid

blocks will reduce the local cache hits and increase the overall CMP cache misses.

To further demonstrate the effect of hot-set conflicts in a directory with small set-

associativity, we also simulated an 8-way set-associative directory with twice the number of sets,

capable of recording the states for a cache size of 16 MB (denoted as 2x-8way). We can observe

that a significant gap in the number of valid blocks still exists between the 2x-8way and the 64-

way directories. For OLTP, the 32-way directory can also out-perform the 2x-8way directory.

To completely avoid extra cache invalidations, a 64-way directory is needed here. Consider

a future CMP with 64 cores and 16-way private L2 caches, an expensive and power hungry

1024-way directory is needed to eliminate all extra cache invalidations. This full-associativity

directory is essentially the same as maintaining and searching all individual cache directories.












900





70





OLTP Apache SPE~jbb SPEC2000 SPEC2006

Figure 4-1. Valid cache blocks in CMP directories with various set-associativity.

4.2 A New CMP Coherence Directory

To solve or reduce the hot-set conflicts, we propose to displace the replaced directory

entries to the empty slots in the directory.

Figure 4-2 illustrates the basic organization of a CMP coherence directory enhanced with a

Directory Lookaside Table (DLT), referred collectively as the DLT-dir. The directory part, called

the main directory, is set-associative. Each entry in the main directory records a cached block

with its address tag, a core ID, a MOESI coherence state, a valid bit, and a bit indicating whether

the recorded block has been displaced from its primary set.

The DLT is organized as a linear array in which each entry can establish a link to a

displaced block in the main directory. Each DLT entry consists of a pointer, the index bits of the

displaced block in the main directory, and a 'use' bit indicating if the DLT entry has a valid

pointer. In addition, a set-empty bit array is used to indicate whether the corresponding set in the

main directory has a free entry for accommodating any displaced block away from the block' s

primary set. Note that, different from the memory-based directory, the DLT-dir serves only as a

coherence directory without any associated data array.










M ern ory Address


Set-Asso. Coherence Directory Set Di rectory
em pty Lookesi de
Tab~le (DLT)
Figure 4-2. A CMP coherence directory with a multiple-hashing DLT.

Inspired by a similar idea in the skewed associative cache [64], [14], a set of hash functions

are used to index the DLT; the purpose is to reduce the conflict at the DLT. When a block is to

be displaced in the main directory due to a conflict, the hash functions are applied to the block

address to obtain multiple locations in the DLT. Some of these locations may have already been

used to point to some other displaced blocks in the main directory. If there exist unused DLT

locations among the hashed ones and if there also exists a free slot in the main directory, then the

displaced block is moved to the free directory slot and an unused DLT location is selected to

point to the displaced block. Since this paper is not aiming at inventing new hash functions, we

borrow the skewed function family reported in [14]. Let a be the one-position circular shift on n

index bits, a cache block at memory address A = A,2" 2" + A, 2" + A 2" + An can be mapped to the

following m locations: fo(A)= A, O A,, J;(4)= o(4 ) S 4,, f,(A)= 0 (A )@ A,, ..., and

f,l-(A)= o"' (A,)OA,, where m<;n.









In the illustrated example of Figure 4-2, a block address is hashed to three locations, a, b,

and c in the DLT based on the skewed functions [14]. Indicated by the 'use' bit, location a or c

contains a valid pointer that points to a displaced block in the main directory, while location b is

unused. The index bits of the main directory are attached in the DLT for the displaced blocks for

two purposes. First, such a scheme saves the main directory space by not including any index

bits in the address tag at each directory entry. Note that these index bits are needed only for the

displaced blocks. Instead of allocating space at every directory entry for storing the index bits, no

such space is allocated at all in the main directory and the index bits of the displaced blocks are

stored in the DLT. Second, the index bits in the DLT can be used to filter out unnecessary access

to the main directory. An access is granted only when the index bits match with that of the

requested address. In the example of Figure 4-2, assume both a and c' s indexes match with that

of the requested address, then access to the main directory is initiated. The address tags of the

two main directory entries pointed by location a and c are compared against the address tag of

the request. When a tag match occurs, a displaced block is found. Note that although DLT-dir

requires additional directory access beyond the primary sets, our evaluation shows that a

maj ority (about 97-99%) of this secondary access can be filtered out using the index bits stored

in the DLT.

In the attempt to relocate a block evicted from its primary set to another directory entry,

suppose a free DLT slot, say b, is found for the block. A free slot in the main directory must also

be identified. The set-empty bit array is maintained for this purpose. The corresponding set-

empty bit is set whenever a free directory slot appears. A quick scan to the set-empty bit array

returns a set with at least one free slot. The displaced block is then stored in the free slot in the

main directory and that location is recorded in entry b in the DLT.









When a block is removed from a cache module either due to eviction or invalidation, the

block must also be removed from the DLT-dir. If the block is recorded in its primary set, all that

is needed is to turn the valid bit off. If the block is displaced, it will be found through a DLT

lookup. Both the main directory entry for the block and the corresponding DLT entry are freed.

A directory entry that holds a displaced block can also be replaced by a newly referenced

block. Given that the index is unavailable in the main directory for the displaced block, normally

a backward pointer is needed from the directory entry that holds the displaced block to the

corresponding DLT entry, in order to free the DLT entry. However, it is expensive to add a

backward pointer in the main directory. Alternatively, the DLT entry can be searched for

determining the DLT entry that points to the location of the displaced block in the main

directory. If each DLT entry is allowed to point to any directory location, then a fully associative

search of the DLT is required to locate a given location in the main directory. To reduce the cost

of searching the DLT, one can impose restriction on the DLT-to-directory mapping so that, for a

given directory location, only a small subset of the DLT entries can potentially point to it and

need to be examined. For instance, consider a DLT whose total number of entries is one-quarter

of that of the directory entries, which will be shown to be sufficiently large. In the most

restrictive DLT-to-directory mapping, each DLT entry is allowed to point to one of only four

fixed locations in the main directory. Although the need of DLT search is minimal, such a

restrictive mapping could lead to severe hot-set-like conflicts during the displacement of a block,

because the block can only be displaced to a small number of potential locations in the directory:

The block is first mapped to several DLT entries (by multiple hash functions), each of which in

turn can point to one of a small number of directory entries. The result is a reduced chance of

finding a free directory entry for the displaced block. In a less restrictive design, any set-









associative mapping can be instituted such that each DLT entry is limited to certain collection of

sets in the main directory. The set-associative mapping allows fast search in the DLT for a

directory location with minimum reduction on the chance of finding free slots in the main

directory. We will evaluate the performance of this design in Section 4.4.

In comparison with other multiple-hash-function-based directories or caches, e.g., the

skewed associative directory, the multiple-hashing DLT has its unique advantages. Since the

DLT is used only to keep track of the free slots and displaced blocks in the main directory, its

size, counted in number of entries, is considerably smaller than the total number of entries in the

main directory. Suppose the directory has a total of 1000 entries, then the DLT may have 250

entries. Suppose the directory contains 100 displaced blocks and 100 free slots at some point.

Then, the directory has 10% of free entries but the DLT has (250-100)/250 = 60% of free entries.

When the same hash function family is used in both the skewed associative directory and the

DLT, the chance of finding a free entry in the DLT is much higher (0.9993 vs. 0.5695 if eight

uniform random hash functions are used). Once a free DLT entry is found, finding a free entry in

the directory is ensured by searching the set-empty bit array (assuming the DLT-to-directory

mapping is unrestricted). We will demonstrate the performance advantage of this unique

property.

The detailed operations of the DLT-dir are summarized as follows.

When a requested block is absent from the local cache, a search of the DLT-dir is carried

out for locating the requested block in other cache modules. When the block is found in its

primary set of the main directory and/or in other sets through the DLT lookup, proper coherence

actions are performed to fetch and/or invalidate the block from other caches. The block with the

requesting core ID is inserted into the MRU position in the primary set of the main directory.









This newly inserted block may lead to the following sequence of actions. First, the block is

always inserted into a free entry in the primary set if one exists. Otherwise, it replaces a

displaced block residing in the primary set; this design is intended to limit the total number of

displaced blocks. In this case, the previously established pointer in the DLT to the displaced

block must be freed. If no displaced block exists in the primary set, the LRU block is replaced. In

either case, the replaced block will undergo a displacement attempt through the DLT. If no free

space is found in either the main directory or the DLT, the replaced block is evicted from the

directory and invalidated in the cache module.

To displace a block, an unused entry in the DLT must be selected through the multiple

hash functions. In addition, the set-empty bit array is checked for selecting a free slot in the main

directory to which the selected DLT entry can be mapped. Each corresponding bit in the set-

empty array is updated every time the respective set is searched.

A miss in the CMP caches is encountered if the block cannot be found in the DLT-dir. The

corresponding block must be fetched from a lower-level memory in the memory hierarchy. The

update of the DLT-dir for the newly fetched block is the same as when the block is found in

other CMP cache modules.

When a write request hits a block in the shared state in the local cache, an upgrade request

is sent to the DLT-dir. The requested block with the core ID must exist either in the primary set

or in other sets linked through the DLT. The requested block can exist in more than one entry in

the main directory since the shared block may also be in other cores' caches. In response to the

upgrade request, all other copies of the block must be invalidated in the respective caches and

their corresponding directory entries must be freed. Replacement of any block in a cache module









must be accompanied by a notification to the directory for freeing the corresponding directory

and DLT entries.

4.3 Evaluation Methodology

We use Simics to evaluate an 8-core out-of-order x86 chip multiprocessor. We develop

detailed cycle-by-cycle event-driven cache, directory and interconnection models. Each core has

its own L1 instruction and data caches, and an inclusive private L2 cache. Every core has its own

command-address-response bus and data bus connecting itself with all the directory banks. The

MOESI coherence protocol is applied to maintain cache coherence through the directory. Each

core has two outgoing request queues to the directory, a miss-request queue and a replacement-

notification queue, and an outgoing response queue for sending the response to the directory. It

also has an incoming request queue for handling request from the directory and an incoming

response queue. Each bank of the directory maintains five corresponding queues for each core to

buffer request/reponse from/to each core. The simulator keeps track of the states of each bus and

directory bank, as well as all the queues. The timing delays of directory access, bus transmission,

and queue conflicts are carefully modeled. We assume the DLT access can be fully overlapped

with the main directory access. However, the overall access latency may vary based on the

number of displaced blocks that need to be checked. Besides the main directory access latency of

6 cycles, we assume that 3 additional cycles are consumed for each access to a displaced block

after the access is issued from the DLT. Table 4-1 summarizes the directory related simulation

parameters besides the general parameters narrated in Chapter 1.

For this study, we use three multithreaded commercial workloads, OLTP (Online

Transaction Processing), Apache (static web server), and SPECjbb (java server), and two

multiprogrammed workloads with applications from SPEC2000 and SPEC2006.










Table 4-1. Directory-related simulation parameters
Parameter Description
CMP 8-core, IM private L2 cache each core
Main directory 1/2/4/8 Banks, 128K entries, 8 way
Queue size 8-entry request/response queues to/from each core
DLT table 1/2/4/8 Banks, 8K/16K/32K entries
DLT mapping each DLT entry maps to 8/16/32/64/128 directory sets
Directory latency 6 cycles for primary set, each displaced block for 3 additional cycles
Remote latency 52 cycles without contention, 4 hops
Cmd/Data bus 8B, bidirectional, 32GB/s, 6-cycle propagation latency

Table 4-2. Space requirement for the seven directory organizations
Directory 8 Cores Overhead 64 Cores Overhead
Set-8w 4587520 1 36700160 1
Set-8w-64v 4671552 1.02 37372416 1.02
Skew-8w 6160384 1.34 49283072 1.34
Set-1 0w-1/4 5734400 1.25 45875200 1.25
Set-8w-p 5242880 1.14 97517568 2.66
DLT-8w-1/4 5439488 1.18 43515904 1.18
Set-full 4980736 1.09 39845888 1.09

We evaluated seven directory organizations: the 8-way set-associative (Set-8w), the 8-way

set-associative with 64-block fully-associative victim buffer (Set-8w-64v), the 8-way skew

associative (Man1'~ -.M~I'), the 10-way set-associative with 25% additional directory size (Set-1Av-

1 4), the 8-way set-associative with presence bits (Set-8w-p), the 8-way set associative with a

DLT of 25% of total cache entries using 8 hashing functions (DLT-8w-1 4), and the full 64-way

set-associative (Set-fudl) directory. The Set-full represents the ideal case where no directory

conflict will occur. The Set-1Av-1 4 is included because it adds one-quarter extra directory space

that matches the DLT entries in the DLT-8w-1 4. The Set-1Av-1 4 possesses an extra advantage

because the set-associativity is increased from 8-way to 10-way. In our simulations, the DLT-dir

is partitioned into four banks based on the low-order two bits in the block address to allow for

sufficient directory bandwidth. Multiple-banked directory also displays interesting effects on the

DLT-dir conflict. Detailed evaluation will be given in Section 4.4. All results are the average

from all four banks. The total number of bits and the normalized space requirement relative to









that of the Set-8w for the seven directories are shown in Table 4-2 for an 8-core CMP and a 64-

core CMP. The skewed associative directory (Men1'~ -.1wI), requires the index bits as a part of the

address tag, and hence needs the largest space. The space requirement for the directory with

presence bits (Set-8w-p) goes up much faster than others when the number of cores increases

(e.g. from 1.14 to 2.66).

4.4 Performance Result

In this section, we show performance evaluation results of the seven CMP coherence

directory organizations. The cache hit/miss ratios, the average valid blocks, the IPC

improvement, and sensitivity to the DLT design parameters are presented.

4.4.1 Valid Block and Hit/Miss Comparison

Figure 4-3 shows the percentage of valid blocks for the seven directories averaged over the

entire simulation period. We can make a few important observations.

First, the proposed DLT-8w-1/4 is far superior to any other directory organizations, except

for the ideal Set-full. The DLT-8w-1/4 can retain almost all cached blocks using a DLT whose

total number of entries is equal to one-quarter of the total number of the cache blocks. Among

the five workloads, only OLTP shows noticeable invalidations under DLT-8w-1/4. This is

because, with intensive data and instruction sharing, OLTP experiences more set conflicts due to

replication of shared blocks. Our simulation results reveal that over 40% of the cached blocks are

replicas in OLTP.

Second, set-associative directories other than the one with the full 64-way perform poorly.

For instance, in the Set-8w directory, about 18%, 14%, 13%, 21%, and 20% of the cached blocks

are unnecessarily invalidated for the respective five workloads, due to hot-set conflicts in the

directory. The Set-10w-1/4 directory improves the number of valid blocks at the cost of 25%

additional directory entries and higher associativity. But, its deficit is still substantial.














90






670


50


OLTP Apached SPBCjbb SPBC2000 SPEC2006

Figure 4-3. Valid cache blocks for simulated cache coherence directories.

Third, very little advantage is shown when the Set-8w is furnished with a 64-block fully-

associative victim buffer. Apparently, the buffer is too small to hold a sufficient number of

conflicting blocks.

Fourth, the skewed associative directory (.\~1ken -.1wI) does alleviate set conflict. But, it is

still far from being able to retain all the valid blocks. Since multiple-hashing is applied to the

entire directory, the chance of Einding a free slot in the main directory is diluted by the non-

displaced blocks located in the primary sets. In contrast, the DLT only contains the displaced

blocks; hence the chance of Einding a free slot in the DLT is much higher (See the sample

calculation near the end of Section 4.2 for detail.).

Lastly, the Set-8w-p works well with multithreaded workloads, but performs poorly with

multiprogrammed workloads. By combining (duplicated) blocks with the same address into one

entry in the directory, the presence-bit-based implementation saves directory entries, and hence,

alleviates set conflicts for multithreaded workloads. However, keeping the presence bits becomes

increasingly space-inefficient as the number of cores increases. In addition, for









multiprogrammed workloads, there is little data sharing; the advantages of having the expensive

presence bits no longer exist.

The total valid blocks determine the overall cache hits and misses. In Figure 4-4, extra L1

hits, L2 local hits, L2 remote hits, and L2 misses of the seven directory schemes, normalized

with respect to the total L2 references of the Set-8w, are displayed for each workload. Note that

the inadvertent invalidation due to set conflicts at the directory also invalidates the blocks located

in the L1 caches for maintaining the L1/L2 inclusion property. Therefore, more L1 misses are

encountered for the directory schemes that cause more invalidation.

Compared with the Set-8w, there are extra L1 hits and fewer L2 references for all other

directory schemes. Among the workloads, SPECjbb under the DLT-8w-1/4 sees about 8%

increase in the L1 hits. The DLT-8w-1/4 shows clear advantages over the other directory

schemes.

Besides additional L1 hits and fewer L2 misses, the biggest gain for using the DLT-dir

comes from the increase of the local L2 hits. Recall that to avoid the expensive presence bits,

each copy of a block occupies a separate directory entry with a unique core ID. Consequently,

inadvertent invalidation of a cached block caused by insufficient directory space may not turn a

L2 local hit to the block into an extra L2 cache miss. Instead, it is likely that a local L2 hit to the

block results in a remote L2 hit since not all copies of the block are invalidated.

This is the key reason for more remote L2 hits in the directory schemes that produce more

invalidation for the three multithreaded workloads. The difference among the directory schemes

is not as significant for SPEC2000 and SPEC2006 since there is little data sharing among the

multiprogrammed workload.





























OLTP Apache SPECjbb SPEC2000 SPEC2006
Figure 4-4. Cache hit/miss and invalidation for simulated cache coherence directories.

Figure 4-5 plots the distribution of four types of requests to the directory: instruction fetch

(IFetch), data read (Dread), data write hit to a shared-state block (Upgrade), and data write miss

(DWrite). Within each request type, the results are further broken down into four categories

based on the directory search results: hit only to the main directory, hit only to the DLT, hit to

both the main directory and the DLT, and miss at both the main directory and the DLT (which

becomes a CMP cache miss).

A few interesting observations can be made. First, there are very few requests that find the

blocks only through the DLT for all request types in all five workloads. Given the fact that a

displaced block is used to be at the LRU position of its primary set, the chance of its reuse is not

high, unless other copies of the same block are also in the primary set such that the request is

likely targeting for the blocks in the primary set. Therefore, a hit to the DLT is usually

accompanied by one or more hits to the main directory. This is especially true for IFetch in the

three multithreaded workloads with more sharing of read-only instructions among multiple cores.











5 Both B Main 5 DLT O None1


0 .8 -- -








0-2






OL~TP ApaEche SPE~jbb SPEC2~000 SPEC2006

Figure 4-5. Distribution of directory hits to main directory and DLT.

A good percentage of instruction blocks are displaced from their primary sets due to heavy

conflicts. A small percentage of DRead also finds the requested blocks in both places. Since

neither IFetch nor DRead requires locating all copies of the block, it is not harmful to look up the

main directory and the DLT sequentially to save power.

Second, it is observed that very few displaced blocks are encountered by the Upgrade or

DWrite requests. Detailed analysis of the results indicates that widely shared blocks do exist, but

they are mostly read-only blocks. This is demonstrated by the fact that the average number of

sharers for DRead is about 6 when DRead hits both the main directory and the DLT. But, for

Upgrade and DWrite, the number of sharers is less than 3, a small enough number so that the

three copies of the block can be kept in the primary set most of the time. This explains why such

blocks are not typically found through the DLT. Nevertheless, parallel searching of the directory

and DLT is still desirable because the required acknowledgement can be sent back to the

requesting core faster.










Third, for the two multiprogrammed workloads, there are almost no Ifetch requests to the

DLT-dir, revealing the small footprint of the instructions. Since there is no data sharing, both

DRead and DWrite are always misses. Moreover, there are no Upgrade requests because the

blocks are in the E-state.

4.4.2 DLT Sensitivity Studies

Two key DLT design parameters are its size and number of hash functions. In Figure 4-6,

we vary the DLT sizes among 1/16, 1/8, and 1/4 of the total number of the cache blocks. We also

evaluate the difference between using 8 or 12 hash functions for accessing the DLT. As observed

earlier from Figure 4-3, the average number of valid blocks drops by 13-21% when the Set-8w is

used instead of the Set-full. To reduce the number of invalidated blocks, the DLT must be

capable of capturing at least these percentages of blocks and allow them to be displaced from

their primary sets. Therefore, we can observe significant improvement in the average number of

valid blocks as the DLT size increases from 1/16 to 1/8 and to 1/4 of the total number of cache

blocks for all Hyve workloads.

The impact of the number of hash functions, on the other hand, is not as obvious. Eight

hash functions are generally sufficient for Einding an unused entry in the DLT when the DLT has

sufficient size. We observe small improvement using 12 instead 8 hash functions. For example,

the total number of valid blocks increases from 90.3% to 91.7% for OLTP with 12 hash

functions.

The index bits, which are needed for displaced blocks, are recorded in the DLT to save

space in the main directory. These index bits can also be used to filter the access to the main

directory for the displaced blocks. An access is necessary only when the index bits match with

the requested block.










19 Set-8w O D~t-1/16 O 13t-1/8 9 13t-1/4 I Set-full


800




S70

60


50
8f 12f 8f 12f 8f 12f 8f 12f 8f 12f

OLTP ApaEche SPECj bbSP 200PC00

Figure 4-6. Sensitivity study on DLT size and number of hashing functions.

Figure 4-7 shows the advantage of filtering for the four workloads with the DLT-8w-1 4

directory. Beside the 12 index bits, we also experiment with using additional 1, 2, or 3 bits for

filtering purpose. Figure 4-7 (A) shows the false-positive rate, Figure 4-7 (B) illustrates the total

traffic that can be filtered out. The false positive is defined as an index-bit match, but a failure to

find the block in the main directory.

We observe that by recording one additional bit beyond the necessary index bits, the false-

positive is almost completely eliminated for all three multithreaded workloads. For SPEC2000

and SPEC2006, however, additional 3 bits are necessary to reduce the false-positive rate to a

negligible level. Furthermore, by recording one additional bit beyond the index bits, the total

additional traffic to the main directory is reduced down to only about 0.2-2.8% of the total

request traffic arriving at the DLT. With such high percentages of filtering, a maj ority of CMP

cache misses or upgrades can be identified earlier without searching again the main directory for

possible displaced blocks.










-M OLTP -e pch A SPIEjbb -K OLTP + Apa~che -A SPE~jbb
80 -m- SPIE2000 SPIE2006 -z- SPIE2000 -0- SPIE2006
100









20 --- -- 7
12 13 14 15 12 13 14 15
Number of Iiltermgbitss Number of filteringbits

Figure 4-7. Effects of filtering directory searches by extra index bits.

The DLT is partitioned into multiple banks for enabling simultaneously DLT access from

multiple cores. Meanwhile, for performing fast inverse search of the DLT given a main directory

entry, the DLT is organized in a way that each DLT entry can only point to one set among a

fixed sub-collection of all sets in the main directory.

Figure 4-8 shows the sensitivity studies with respect to these two parameters for OLTP and

SPEC2000. The X-axis represents the number of fixed sets in the main directory that each DLT

entry can point to; the Y-axis is the normalized invalidation with respect to the total invalidations

in a reference configuration where the directory has only one bank and the DLT uses a 8-set

mapping.

Several interesting observations can be made. First, banking also helps reducing

inadvertent invalidations. The main reason is that with smaller banks, the same DLT-to-directory

mapping covers a larger percentage of the main directory. For example, for an 8-way directory

corresponding to an 8MB cache with a 64-byte block size, there are 16K sets for a 1-bank

directory, but only 4K sets per bank for a 4-bank directory. Therefore, if each DLT entry can be

mapped to 32 sets, it covers 1/5 12 of the directory entries for the 1-bank case. However, with the

same mapping, a DLT entry can cover 1/128 of the directory entries for the 4-bank directory.










The more the coverage, the higher the chance will be for finding a free slot in the main directory

for displacement. The curves corresponding to the same directory covering percentage by the

each DLT entry are also drawn with broken lines in the figure.


1 OLW 1 SPEC-2000


~ 4l-Bank

0.8 -- 0.8 -- -8B k
ct-- K- 1/1024napping
-- 1/512 napping
S0.7 -- -- -- 0.7 -- -- -+- 1/256nrappig
m ~- K- 1/128 npping
S -- 1/4m nppin~g
S0.6 -- 0.6 -- -- ----- -- *- 1/32nappi ng

0.5 0.5
8-set 16-set 32-set 6l-set 128-set 8-set 16-set 32-set 64-set 128-set

Figure 4-8. Normalized invalidation with banked DLT and restricted mapping from DLT to
directory .

Second, banking creates another level of conflicts because the distribution of the cached

blocks to each bank may not be even. The physical bank prevents the DLT-based block

displacement from crossing the bank boundary. This effect is very evident in OLTP. With the

same covering percentage, the 1-bank directory performs noticeably better than the 2-bank

directory, which in turn is better than the 4-bank and the 8-bank directories. The bank conflict is

not as obvious for SPEC2000 due to its mixed applications. Simple index randomization

technique can be applied to alleviate the bank conflicts. But, further discussions are omitted due

to space limitation.

Third, the constrained DLT-to-directory mapping does affect the number of invalidations

substantially. But, the reduction of invalidations starts diminishing when doubling the mapping

freedom from 1/64 to 1/32. In the early simulation, we used a 4-bank directory with a 32-set

mapping, where each DLT entry can be mapped to 1/128 banked directory space, in order to










achieve a balance between minimizing the amount of invalidations and the cost of searching the

DLT entries.

4.4.3 Execution Time Improvement

The execution times for the seven directory schemes are compared in Figure 4-9. The

figure shows the normalized execution times with respect to the Set-8w directory for each

workload. For the DLT-8w-1/4 directory, the execution time improvement is about 10%, 5%,

9%, 8%, and 8% over the Set-8w directory. More importantly, the improvement of the DLT-8w-

1/4 is about 98% of what the full 64-way directory (Set-full) can achieve for all five workloads.

In the case of the more expensive skewed associative directory, the time saved is only about 20-

3 5% of what the DLT-8w-1/4 can save for the three multithreaded workloads. Finally, the Set-

8w-p reduces the execution time of the multithreaded workloads, but does little for the

multiprogrammed ones.

4.5 Summary

In this chapter, we describe an efficient cache coherence mechanism for future CMPs with

many cores and many cache modules. We argue in favor of a directory-based approach because

the snooping-bus-based approach lacks scalability due to its broadcasting nature. However, the

design of a low-cost coherence directory with small size and small set-associativity for future

CMPs must handle the hot-set conflicts at the directory that lead to unnecessary block

invalidations at the cache modules. In a typical set-associative directory, the hot-set conflict

tends to become worse because many cores compete in each individual set with uneven

distribution. The central issue is to reconcile the topology difference between the set-associative

coherence directory and the CMP caches.















S0.9


0.8





0.5
OL~TP ApaEche SPECjbb SPEC2000 SPEC2006

Figure 4-9. Normalized execution time for simulated cache coherence directories.

The proposed DLT-dir accomplishes just that with small space requirement and high

efficiency. The hot-set conflict is alleviated by allowing blocks to be displaced from their

primary sets. A new DLT is introduced to keep track of the displaced blocks by maintaining a

pointer to the displaced block in the main directory. The DLT is accessed by applying multiple

hash functions to the requested block address to reduce the DLT's own conflicts. Performance

evaluation has confirmed the advantage of the DLT-dir over other conventional set associative

directories or the skewed-associative directory for conflict avoidance. In particular, the DLT-dir

with a size equal to one quarter of the total number of the directory blocks achieves up to 10%

faster execution time in comparison with a traditional 8-way set-associative directory.









CHAPTER 5
DISSERTATION SUMMARY

Chip multiprocessors (CMPs) are becoming ubiquitous in all computing domains. As the

number of cores increases, tremendous pressures will be exerted on the memory hierarchy

system to supply the instructions and data in a timely fashion to sustain increasing chip-level

IPCs. In this dissertation, we develop three works about bridging the ever-increasing CPU

memory performance gap.

In the first part of this dissertation, an accurate and low-overhead data prefetching on

CMPs based on a unique observation of coterminous group (CG) and coterminous locality has

been developed. A coterminous group is a group of off-chip memory accesses with the same

reuse distance. Coterminous locality is defined as when a member in the coterminous group is

accessed, the other members will be likely to be accessed in the near future. In particular, the

order of nearby references in a CG follows exactly the same order that these references appeared

last time, even though they may be irregular. The proposed prefetcher uses CG history to trigger

prefetches when a member in a group is re-referenced. It overcomes challenges of the existing

correlation-based or stream-based prefetchers, including low prefetch accuracy, lack of

timeliness, and large history. The accurate CG-prefetcher is especially appealing for CMPs,

where cache contentions and memory access demands are escalated. Evaluations based on

various workload mixes have demonstrated significant advantages of the CG-prefetcher over

other existing prefetching schemes on CMPs, with about 10% of IPC improvement with much

less extra traffic.

As many techniques have been proposed for optimizing on-chip storage space in CMPs,

the second part of the dissertation is to propose an analytical model and a global stack simulation

to fast proj ect the performance of the tradeoff between the capacity and access latency in CMP









caches. The purposed analytical model can fast estimate the general performance behavior of

data replication in CMP caches. The model has showed that data replication could degrade cache

performance without a sufficiently large capacity. The global stack simulation has been proposed

for more detailed study on the issue of balancing accessibility and capacity for on-chip storage

space on CMPs. With the stack simulation, a wide-spectrum of the cache design space can be

explored in a single simulation pass. We have simulated the schemes of shared/private caches,

and shared caches with replication of various cache sizes. We also have verified the stack

simulation results with execution-driven simulations using commercial multithreaded workloads

and showed that the single-pass stack simulation can characterize the CMP cache performance

with high accuracy. Only 2%-9% of error margin is observed. Our results have proved that the

effectiveness of various techniques to optimize the CMP on-chip storage is closely related to the

total L2 size. More importantly, our global stack simulation consumes only 8% of the simulation

time of execution-driven simulations.

In this third part of the dissertation, we have described an efficient cache coherence

mechanism for future CMPs with many cores and many cache modules. We favor a directory-

based approach because the snooping-bus-based approach lacks scalability due to its

broadcasting nature. However, the design of a low-cost coherence directory with small size and

small set-associativity for future CMPs must handle the hot-set conflicts at the directory that lead

to unnecessary block invalidations at the cache modules. In a typical set-associative directory,

the hot-set conflict tends to become worse because many cores compete in each individual set

with uneven distribution. The central issue is to reconcile the topology difference between the

set-associative coherence directory and the CMP caches. The proposed DLT-dir accomplishes

that just with small space requirement and high efficiency. The hot-set conflict is alleviated by









allowing blocks to be displaced from their primary sets. A new DLT is introduced to keep track

of the displaced blocks by maintaining a pointer to the displaced block in the main directory. The

DLT is accessed by applying multiple hash functions to the requested block address to reduce the

DLT' s own conflicts. Performance evaluation has confirmed the advantage of the DLT-dir over

other conventional set associative directories or the skewed-associative directory for conflict

avoidance. In particular, the DLT-dir with a size equal to one quarter of the total number of the

directory blocks achieves up to 10% faster execution time in comparison with a traditional 8-way

set-associative directory.










LIST OF REFERENCES


[1] M. E. Acacio, "A Two-Level Directory Architecture for Highly Scalable cc-N~UMA
Multiprocessors," IEEE Trans. on ParallelPPP~~~~PPP~~~PPP and Distributed' Systems Vol. 16 (1, pp. 67-79,
Jan. 2005

[2] Advanced Micro Devices, "AMD Demonstrates Dual Core Leadership,"
http://www.amnd.com, 2004.

[3] AMD Quad-Core, http: /mu///core.a~md. com us-en/quad'core/

[4] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, "An Evaluation of Directory
Schemes for Cache Coherence," Proc. 15th Int'1Symp. Computer Architecture, pp. 280-
289, May 1988.

[5] A. Agarwal, M. Horowitz, and J. Hennessy, "An Analytical Cache Model," ACM~ Trans. on
Computer Systems, Vol. 7, No. 2, pp. 184-215, May 1989.

[6] A. Agarwal and S. D. Pudar, "Column-Associative Caches: a Technique for Reducing the
Miss Rate of Direct-Mapped Caches," Proc. 29th Int'1 Symp. on Computer Architecture,
pp. 179-190, May 1993.

[7] L. Barroso et al, "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,"
Proc. 2 7th Int'1 Symp. on Computer Architecture, pp. 165-175, June 2000.

[8] B. Beckmann and D. Wood, "Managing Wire Delay in Large Chip-multiprocessor
Caches," Proc. 3 7th Int '1Symp. on M\~icroarchitecture, Dec. 2004.

[9] B. M. Beckmann, M. R. Marty, and D. A. Wood, "ASR: Adaptive Selective Replication for
CMP Caches," Proc. 39th Int '1Symp. on M\~icroarchitecture, Dec. 2006.

[10] B. T. Bennett and V. J. Kruskal, "LRU Stack Processing," IBM journal ofR & D, pp. 353-
357, July 1975.

[1l] E. Berg and E. Hagersten, "StatCache: A Probabilistic Approach to Efficient and Accurate
Data Locality Analysis," Proc. 2004 Int '1Symp. on Performance Analysis of Systems and'
Software, March 2004.

[12] E. Berg, H. Zeffer, and E. Hagersten, "A Statistical Multiprocessor Cache Model," Proc.
2006 Int '1Symp. on Performance Analysis of Systems and' Software, March 2006.

[13] B. Black et al, "Die Stacking (3D) Microarchitecture," Proc. 39th Int'1 Symp. on
M\~icroarchitecture, pp. 469-479, Dec. 2006.

[14] F. Bodin and A. Seznec, "Skewed Associativity Improves Performance and Enhances
Predictability," IEEE Trans. on Computers, 46(5), pp. 530-544, May 1997.










[15] S. Borkar, "Microarchitecture and Design Challenges for Gigascale Integration," Proc. 37th
Int'1 Symp. on 2~icroarchitecture, 1st Keynote, pp. 3-3, Dec. 2004.

[16] P. Bose, "Chip-Level Microarchitecture Trends," IEEE Micro, Vol 24(2), pp. 5-5, Mar-Apr.
2004.

[17] L. M. Censier and P. Feautrier, "A New Solution to Coherence Problems in Multicache
Systems," IEEE Trans. on Computers, c-27(12), pp. 1112-1118, Dec. 1978.

[18] D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal, "Directory-Based Cache Coherence in
Large-Scale Multiprocessors," Computer 23, 6, pp. 49-58, Jun. 1990.

[19] D. Chandra, F. Guo, S. Kim, and Y. Solihin, "Predicting Inter-Thread Cache Contention on
a Chip Multi-Processor Architecture", Proc. 11th Int '1 Symp. on High Performance
Computer Architecture, pp. 340-351, Feb. 2005.

[20] J. Chang and G. S. Sohi, "Cooperative caching for chip multiprocessors," Proc. 33rdlnt '1
Symp. on Computer Architecture, June 2006.

[21] B. Chazelle and Kilian et al, "The Bloomier Filter: An Efficient Data Structure for Static
Support Lookup Tables," Proc. 15th Annual ACM\~-SIAM ~Symp. on Discrete Al glo within,\
Jan. 2004.

[22] T. Chen and J. Baer, "Reducing Memory Latency via Non-blocking and Prefetching
Caches," Proc. oflnt'1 Conf: on Architectural Support for Programming Languages and'
Operating Systems, pp. 51-61, Oct. 1992.

[23] T. M. Chilimbi and M. Hirzel, "Dynamic Hot Data Stream Prefetching for General-purpose
Programs," Proc. SIGPLAN '02 Conference on PLDI, pp. 199-209, June 2002.

[24] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimizing Replication, Communication,
and Capacity Allocation in CMPs," Proc. 32nd Int '1Symp. on Computer Architecture, June
2005.

[25] D. E. Culler, J. P. Singh, and A. Gupta, "Parallel Computer Architecture: A
Hardware/Software Approach," M~organ2 Kaufinann Publishers Inc., 1999.

[26] H. Dybdahl and P. Stenstrom, "An Adaptive Shared/Private NUCA Cache Partitioning
Scheme for Chip Multiprocessors," Proc. 13th Int '1Symp. on High Performance Computer
Architecture, Feb 2007.

[27] G. Edwards, S. Devadas, and L. Rudolph, "Analytical Cache Models with Applications to
Cache Partitioning," Proc. 15th Int '1 Conf: on Supercomputing, pp. 1-12, June 2001.

[28] B. Fraguela, R. Doallo, and E. Zapata, "Automatic Analytical Modeling for the Estimation
of Cache Misses," Proc. 1999 Int '1Conf: on ParallelPPP~~~~PPP~~~PPP Architectures and' Compilation
Techniques, Sep. 1999.










[29] J. D. Gilbert, S. H. Hunt, D. Gunadi, and G. Srinivasa, "Niagara2: A Highly-Threaded
Server-on-A-Chip," Proc. 18th HotChips Synap, Aug. 2006.

[30] A. Gupta, W. Weber, and T. Mowry. "Reducing Memory and Traffic Requirements for
Scalable Directory-Based Cache Coherence Schemes," Proc. hIt'1 ConJ ICPP '90, pp. 312-
321, Aug. 1990.

[31] J. Hasan and S. Cadambi et al. "Chisel: A Storage-efficient, Collision-free Hash-based
Network Processing Architecture," Proc. 33th hIt '1Synap. on Computer Architecture, pp.
203-215, June 2006.

[32] M. Hill and J. Smith, "Evaluating Associativity in CPU Caches", IEEE Trans. on
Computers, pp. 1612-1630, Dec. 1989.

[33] Z. Hu, M. Martonosi, and S. Kaxiras, "TCP: Tag Correlating Prefetchers," Proc. 9th hIt '1
Synap. on HPCA, pp 317-326, Feb. 2003.

[34] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, "A NUCA Substrate for
Flexible CMP Cache Sharing," Proc. 19th Int '1 ConJ on Superconsputing, June 2005.

[35] Intel Core Duo Processor: The Next Leap in Microprocessor Architecture,
Technology~hitel2\agazinene Feb. 2006.

[36] Intel Core 2 Quad Processors,
http: www. intel. con; products processor core2quad index. htn

[37] D. Joseph and D. Grunwald, "Prefetching Using Markov Predictors," Proc. 26th hIt'1 Synap.
on Computer architecture, pp. 252-263, June 1997.

[38] N. P. Jouppi, "Improving Direct-mapped Cache Performance by the Addition of a Small
Fully-associative Cache and Prefetch Buffers," Proc. 1 7th hIt'1 Synap. on Computer
Architecture, pp 364-373, May 1990.

[39] R. Kalla, B. Sinharoy, and J. Tendler, "IBM POWERS Chip: A Dual-Core Multithreaded
Processor," IEEE Micro, Vol 24(2), Mar-Apr. 2004.

[40] S. Kapil, "UltraSPARC Gemini: Dual CPU Processor," Proc. 15th HotChips Synap., Aug.
2003.

[41] C. Kim, D. Burger, and S. Keckler, "An Adaptive, Non-uniform Cache Structure for Wire-
delay Dominated On-chip Caches," Proc 10th Int '1ConJ on Architectural Support for
Progra~nmning Languages and Operating Systems, Oct. 2002.

[42] S. Kim, D. Chandra, and Y. Solihin, "Fair Cache Sharing and Partitioning in a Chip
Multiprocessor Architecture," Proc. 2004Int '1ConJ on Parallel Architecturesa~nd
Compilation Techniques, Sep. 2004.










[43] Y. H. Kim, M. D. Hill, and D. A. Wood, "Implementing Stack Simulation for Highly-
associative Memories," Proc. 1991 SIGM~ETRICS conf: on M~easurement and Modeling of
Computer Systems, pp. 212-213, May 1991.

[44] M. Kistler, M. Perrone, and F. Petrini, "Cell Multiprocessor Communication Network:
Built for Speed," IEEE Micro, Vol 26(3), pp. 10-23, May-June 2006.

[45] P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-way Multithreaded SPARC
Processor," Proc. 16th HotChips Symp., Aug. 2004.

[46] S. Lacobovici, L. Spracklen, S. Kadambi, Y. Chou, and S. G. Abraham, "Effective Stream-
Based and Execution-Based Data Prefetching," Proc. 19th Int '1 Conf: on Supercomputing,
pp 1-11, June 2004.

[47] A. Lai, C. Fide, and B. Falsafi, "Dead-block Prediction & Dead-block Correlating
Prefetchers," Proc. 28th Int'1 Symp. on Computer Architecture, pp 144-154, July 2001.

[48] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir, "Design and
Management of 3D Chip Multiprocessors Using Network-in-Memory," Proc. 33rdlnt '1
Symp. on Computer Architecture, June 2006.

[49] C. Liu, A. Sivasubramaniam, and M. Kandemir, "Organizing the Last Line of Defense
before Hitting the Memory Wall for CMPs," Proc. 10th Int '1 Symp. on High Performance
Computer Architecture, pp. 176-185, Feb. 2004.

[50] P. S. Magnusson et al, "Simics: A Full System Simulation Platform," IEEE Computer, Feb.
2002.

[51] M. R. Marty and M. D. Hill, "Virtual Hierarchies to Support Server Consolidation," Proc.
34th Int '1Symp. on Computer Architecture, June 2007.

[ 52] Matlab, http://www. mathworks. com/products/matla~b/.

[53] R. Mattson, J. Gecsei, D. Slutz, and I. Traiger, "Evaluation Techniques and Storage
Hierarchies," IBM Systems Journal, 9, pp. 78-117, 1970.

[54] K. Nesbit and J. Smith, "Data Cache Prefetching Using a Global History Buffer," Proc.
10th Int'1 Symp. on High Performance Computer Architecture, pp 96-105, Feb. 2004.

[55] B. O'Krafka and A. Newton, "An Empirical Evaluation of Two Memory-Efficient
Directory Methods," Proc. 1 7th Int'l Symp. Computer Architecture, pp. 138-147, May
1990.

[56] K. Olukotun et al, "The Case for a Single-Chip Multiprocessor," Proc. 7th Int'1 Conf: on
Architectural Support for Programming Languages and' Operating Systems, Oct. 1996.

[57] Open Source Development Labs Database Test 2, http://www.osd'lorg
/lab~aL anne k e ri n\ /_Ittesting/osdl~da~tabarsetest~suite/osdl~dbt-2/










[58] J. K. Peir, Y. Lee, and W. W. Hsu, "Capturing Dynamic Memory Reference Behavior with
Adaptive Cache Topology," Proc. 8th Int'1 Conf: on Architectural Support for
Programming Language and' Operating Systems, pp. 240-250, Oct. 1998.

[59] M. K. Qureshi, D. Thompson, and Y. N. Patt, "The V-Way Cache: Demand-Based
Associativity via Global Replacement," Proc. 32ndlnt 'lSymp. on Computer Architecture,
June 2005.

[60] M. K. Qureshi and Y. N. Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-
Performance, Runtime Mechanism to Partition Shared Caches," Proc. 33rdlnt '1Symp. on
M\~icroarchitecture, Dec. 2006.

[61] N. Rafique, W. Lim, and M. Thottethodi, "Architectural Support for Operating System
Driven CMP Cache Management," Proc. 20061Int'1 Conf: on Parallel ArchitecturesP~PP~PP~P and'
Compilation Techniques, Sep. 2006.

[62] S. Sair and M. Charney, "Memory Behavior of the SPEC-2000 Benchmark Suit," Technical
Report, IBM~ Corp., Oct. 2000.

[63] A. Saulsbury, F. Dahlgren, and P. Stenstrom, "Recency-based TLB preloading," Proc. 27th
Int '1Symp. on Computer architecture, pp. 117-127, May 2000.

[64] A. Seznec, "A Case for Two-Way Skewed-Associative Cache," Proc. 20th Int'1 Symp. on
Computer Architecture, pp. 169-178, May 1993.

[65] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, "Automatically Characterizing
Large Scale Program Behavior," Proc. 10th Int '1Conf: on Architecture Support for
Programming Language and' Operating Systems, pp 45-57, Oct. 2002.

[66] G. Sohi, "Single-Chip Multiprocessors: The Next Wave of Computer Architecture
Innovation," Proc. 37th Int'1 Symp. on 2~icroarchitecture, 2nd'Keynote, pp. 143-143, Dec.
2004.

[67] Y. Solihin, J. Lee, and J. Torrellas, "Using a User-level Memory Thread for Correlation
Prefetching," Proc. 29th Int '1Symp. on Computer Architecture, pp. 171-182, May 2002.

[68] M. Spjuth, M. Karlsson, and E. Hagersten, "Skewed Caches from a Low-power
Perspective," Proc. 2nd' Conf: on Computing frontiers 2005, May 2005.

[69] L. Spracklen and S. Abraham, "Chip Multithreading: Opportunities and Challenges," Proc.
11th Int'1 Symp. on High Performance Computer Architecture, pp. 248-252, Feb. 2005.

[70] L. Spracklen and Y. Chou, "Effective Instruction Pre-fetching in Chip Multiprocessors for
Modern Commercial Applications," Proc. 11th Int'1 Symp. on High Performance Computer
Architecture, pp. 225-236, Feb. 2005.










[71] K. Strauss, X. Shen, and J. Torrellas, "Flexible Snooping: Adaptive Forwarding and
Filtering of Snoops in Embedded-Ring Multiprocessors," Proc. 33rdlnt '1Symp. on
Computer Architecture, June 2006.

[72] R. A. Sugumar and S. G. Abraham, "Set-associative Cache Simulation using Generalized
Binomial Trees," ACM~ Trans. on Computer Systems, Vol. 13, No. 1, pp. 32-56, Feb. 1995

[73] C. K. Tang, "Cache Design in the Tightly Coupled Multiprocessor System," AFIPS
Conference Proceedings, National Computer Conference, pp 749-753, June 1976.

[74] S. P. Vanderwiel and D. J. Lilj a, "Data Prefetch Mechanisms," ACM~ Computing Surveys,
pp. 174-199, June 2000.

[75] S. Vangal et al., "An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS," IEEE
International Solid-State Circuits Conference, Feb. 2007.

[76] X. Vera and J. Xue, "Let's Study Whole-Program Cache Behavior Analytically," Proc. 8th
Int '1Symp. on High Performance Computer Architecture, Feb. 2002.

[77] Z. Wang and D. Burger, et al., "Guided Region Prefetching: a Cooperative
Hardware/Software Approach," Proc. 30th Int '1Symp. on Computer Architecture, pp. 388-
398, June 2003.

[78] T. Wenisch and S. Somogyi, et al., "Temporal Streaming of Shared Memory," Proc. 32nd
Int'1 Symp. on Computer Architecture, pp. 222-233, June 2005.

[79] C. E. Wu, Y. Hsu, and Y. Liu, "Efficient Stack Simulation for Shared Memory Set-
Associative Multiprocessor Caches," Proc. 1993 Int '1 Conf: on Parallel Processing, Aug.
1993.

[80] Y. Wu and R. Muntz, "Stack Evaluation of Arbitrary Set-Associative Multiprocessor
Caches," IEEE Trans. on Parallel and Distributed Systems, pp. 930-942, Sep. 1995.

[81] M. Zhang and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding Wire
Delay in Tiled Chip Multiprocessors," Proc. 32nd Int '1Symp. on Computer Architecture,
pp. 336-345, June 2005.

[82] Z. Zhu and Z. Zhang, "A Performance Comparison of DRAM Memory System
Optimizations for SMT Processors," Proc. 11th Int'1 Symp. on High Performance
Computer Architecture, pp. 213- 224, Feb. 2005.









BIOGRAPHICAL SKETCH

Xudong Shi received his B.E. degree in electrical engineering and M.E. degree in

computer science and engineering from Shanghai Jiaotong University in 2000 and in 2003

respectively. Immediately after that, he started to pursue the Doctoral degree in computer

engineering at University of Florida. His research interests include micro-architecture design and

distributed systems.





PAGE 1

1 MITIGATING CMP MEMORY WALL BY ACCURATE DATA PREFETCHING AND ON-CHIP STORAGE OPTIMIZATION By XUDONG SHI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007

PAGE 2

2 2007 Xudong Shi

PAGE 3

3 To my wife and my parents

PAGE 4

4 ACKNOWLEDGMENTS I would thank my advisor, Dr. Jih-Kwon Peir for his guidance and support throughout the whole period of my graduate study. His depth of knowledge, insightful advice, tremendous hardworking, great passion and persiste nce has been instrumental in the completion of this work. I thank Dr. Ye Xia for numerous discussions, suggesti ons and help on my latest research projects. I also extend my appreciation to my other committee members, Dr. Timothy Davis, Dr. Chris Jermaine and Dr. Kenneth O. I appreciate the valuable help from my co lleagues, Dr. Lu Peng, Zhen Yang, Feiqi Su, Li Chen, Sean Sun, Chung-Ching Peng, Zhuo Hua ng, Gang Liu, David Lin, Jianming Cheng, and Duckky Lee. Finally, but most importantly, I would thank my parents and my wi fe for their endless love, understanding and support during my life. W ithout them, none of these would have been possible.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4 LIST OF TABLES................................................................................................................. ..........7 LIST OF FIGURES................................................................................................................ .........8 ABSTRACT....................................................................................................................... ............10 CHAPTER 1 INTRODUCTION..................................................................................................................12 1.1 CMP Memory Wall..........................................................................................................12 1.2 Directions and Related Works..........................................................................................14 1.2.1 Data Prefetching.....................................................................................................14 1.2.2 Optimization of On-Chip Cache Organization.......................................................17 1.2.3 Maintaining On-Die Cache Coherence..................................................................20 1.2.4 General Organization of Many-Core CMP Platform.............................................23 1.3 Dissertation Contribution..................................................................................................25 1.3.1 Coterminous Group Data Prefetching....................................................................25 1.3.2 Performance Projection of On-chip Storage Optimization....................................25 1.3.3 Enabling Scalable and Low-Conf lict CMP Coherence Directory..........................26 1.4 Simulation Methodology and Workload..........................................................................26 1.5 Dissertation Structure..................................................................................................... ..28 2 COTERMINOUS GROUP DATA PREFETCHING.............................................................30 2.1 Cache Contentions on CMPs............................................................................................31 2.2 Coterminous Group and Locality.....................................................................................33 2.3 Memory-side CG-prefetcher on CMPs.............................................................................37 2.3.1 Basic Design of CG-Prefetcher..............................................................................38 2.3.2 Integrating CG-prefetcher on CMP Memory Systems...........................................40 2.4 Evaluation Methodology..................................................................................................42 2.5 Performance Results........................................................................................................ .44 2.6 Summary.................................................................................................................... .......53 3 PERFORMANCE PROJECTION OF ONCHIP STORAGE OPTIMIZATION..................54 3.1 Modeling Data Replication...............................................................................................55 3.2 Organization of Global Stack...........................................................................................60 3.2.1 Shared Caches........................................................................................................62 3.2.2 Private Caches........................................................................................................63 3.3 Evaluation and Validation Methodology..........................................................................66 3.4 Evaluation and Validation Results....................................................................................67

PAGE 6

6 3.4.1 Hits/Misses for Shared and Private L2 Caches......................................................67 3.4.2 Shared Caches with Replication.............................................................................70 3.4.3 Private Caches without Replication........................................................................72 3.4.4 Simulation Time Comparison.................................................................................73 3.5 Summary.................................................................................................................... .......75 4 DIRECTORY LOOKASIDE TABLE: EN ABLING SCALABLE, LOW-CONFLICT CMP CACHE COHERENCE DIRECTORY........................................................................77 4.1 Impact on Limited CMP Coherence Directory.................................................................78 4.2 A New CMP Coherence Directory...................................................................................80 4.3 Evaluation Methodology..................................................................................................86 4.4 Performance Result......................................................................................................... ..88 4.4.1 Valid Block and Hit/Miss Comparison..................................................................88 4.4.2 DLT Sensitivity Studies.........................................................................................93 4.4.3 Execution Time Improvement................................................................................97 4.5 Summary.................................................................................................................... .......97 5 DISSERTATION SUMMARY..............................................................................................99 LIST OF REFERENCES.............................................................................................................102 BIOGRAPHICAL SKETCH.......................................................................................................108

PAGE 7

7 LIST OF TABLES Table page 1-1 Common simulation parameters........................................................................................27 1-2 Multiprogrammed workload mixes simulated...................................................................28 1-3 Multithreaded workloads simulated...................................................................................29 2-1 Example operations of forming a CG................................................................................40 2-2 Space overhead for various memory-side prefetcher.........................................................45 3-1 Simulation time comparison of global stack and execution-driven simulation (in Minutes)....................................................................................................................... ......75 4-1 Directory-related simulation parameters............................................................................87 4-2 Space requirement for the seven directory organizations..................................................87

PAGE 8

8 LIST OF FIGURES Figure page 1-1 Performance gap between memory and cores since 1980.................................................13 1-2 Possible organization of the next-generation CMP...........................................................24 2-1 IPC degradation due to cache contention for SPEC2000 workload mixes on CMPs........32 2-2 Reuse distances for Mcf, Ammp and Parser......................................................................35 2-3 Strong correlation of adj acent references within CGs.......................................................37 2-4 Diagram of the CG prefetcher............................................................................................39 2-5 Integration of the CG-prefetc her into the memory controller............................................42 2-6 Normalized combined IPCs of various prefetchers...........................................................46 2-7 Average speedup of 4 workload mixes..............................................................................48 2-8 Prefetch accuracy and coverage of simulated prefetchers.................................................49 2-9 Effect of distance constr ains on the CG-prefetcher...........................................................51 2-10 Effect of group si ze on the CG-prefetcher.........................................................................52 2-11 Effect of L2 size on the CG-prefetcher..............................................................................52 2-12 Effect of memory cha nnels on the CG-prefetcher.............................................................53 3-1 Cache performance impact when introducing replicas......................................................56 3-2 Curve fitting of reuse distance histogram for the OLTP workload...................................58 3-3 Performance with replicas for different cache sizes derived by the analytical model.......58 3-4 Optimal fraction of replication derived by the analytical model.......................................60 3-5 Single-pass global stack organization................................................................................61 3-6 Example operations of the global stack for shared caches................................................63 3-7 Example operations of the global stack for private caches................................................64 3-8 Verification of miss ratios from gl obal stack simulation for shared caches......................68

PAGE 9

9 3-9 Verification of miss ratio, remote hit ratio and average effective size from global stack simulation for private caches....................................................................................70 3-10 Verification of average L2 access time with different level of replication derived from global stack simulation for shared caches with replication.......................................72 3-11 Verification of average L2 access time rati o from global stack simulation for private caches without replication..................................................................................................74 4-1 Valid cache blocks in CMP director ies with various set-associativity..............................80 4-2 A CMP coherence directory with a multiple-hashing DLT...............................................81 4-3 Valid cache blocks for simulate d cache coherence directories..........................................89 4-4 Cache hit/miss and invalidation for s imulated cache coherence directories......................91 4-5 Distribution of directory hits to main directory and DLT..................................................92 4-6 Sensitivity study on DLT size and number of hashing functions......................................94 4-7 Effects of filtering director y searches by extra index bits.................................................95 4-8 Normalized invalidation with banked DLT and restricted mapping from DLT to directory...................................................................................................................... .......96 4-9 Normalized execution time for simu lated cache coherence directories.............................98

PAGE 10

10 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy MITIGATING CMP MEMORY WALL BY ACCURATE DATA PREFETCHING AND ON-CHIP STORAGE OPTIMIZATION By Xudong Shi December 2007 Chair: Jih-Kwon Peir Major: Computer Engineering Chip-Multiprocessors (CMPs) are becoming ubi quitous. With the processor feature size continuing to decrease, the number of cores in CMPs increases dramatically. To sustain the increasing chip-level power in many-core CM Ps, tremendous pressures will be put on the memory hierarchy to supply instructions and data in a timely fashion. The dissertation develops several techniques to address the critical issues in bri dging the CPU memory performance gap in CMPs. An accurate, low-overhead data prefetching on CMPs has been proposed based on a unique observation of coterminous groups highly repeated close-by o ff-chip memory accesses with equal reuse distances. The reuse distance of a memo ry reference is defined to be the number of distinct memory blocks between this memory re ference and the previous reference to the same block. When a member in the coterminous group is accessed, the other me mbers will likely be accessed in the near future. Coterminous groups are captured in a small table for accurate data prefetching. Performance evaluation demonstrates 10% IPC improvement for a wide variety of SPEC2000 workload mixes. It is appealing for fu ture many-core CMPs due to its high accuracy and low overhead.

PAGE 11

11 Optimizing limited on-chip cache space is essential for improving memory hierarchy performance. Accurate simulati on of cache optimization for many-co re CMPs is a challenge due to its complexity and simulation time. An analy tical model is developed for fast estimating the performance of data replication in CMP caches We also develop a single-pass global stack simulation for more detailed study of the tradeo ff between the capacity and access latency in CMP caches. A wide-spectrum of the cache design space can be explored in a single simulation pass with high accuracy. Maintaining cache coherence in future many-core CMPs presents difficult design challenges. The snooping-bus-based method and tr aditional directory protocols are not suitable for many-core CMPs. We investigate a new set-associative CMP coherence with small associativity, augmented with a Directory Lookaside Table (DLT) that allows blocks to be displaced from their primary sets for alleviati ng hot-set conflicts that cause unwanted block invalidations. Performance shows 6%-10% IPC improvement for both multiprogrammed and multithreaded workloads.

PAGE 12

12 CHAPTER 1 INTRODUCTION 1.1 CMP Memory Wall As the silicon VLSI integration continues to advance with deep submicron technology, billions of transistors will be available in a single processor die with a clock frequency approaching 10 GHz. Because of limited Inst ruction-Level Parallelism (ILP), design complexities, as well as high energy/power consumptions, further expanding wide-issued, outof-order single-core processors with huge inst ruction windows and supe r-speculative execution techniques will suffer diminishing performance returns. It has become a norm that a processor die contains multiple cores, called a Chip Mult iprocessor (CMP), and each core can execute multiple threads simultaneously to achieve a high er chip-level Instruction-Per-Cycle (IPC) [56 ]. The case for a chip multiprocessor was first presented in [56 ]. Since then, ma ny companies have designed and/or announced th eir multi-core products [7 ], [40 ], [45 ], [39 ], [2 ], [35 ]. Trends, opportunities, and challenges for future CMPs have appeared in recent keynote speeches, invited talks, as well as in special columns of conferences and professional journals [15 ], [66 ], [69 ], [16 ]. CMPs are now becoming ubiquitous in all computing domains. As the processor feature size continues to decrease, the number of cores in a CMP increases rapidly. 4or 8-core CMPs are now commercially available [3 ], [36 ], [29 ]. Recently, Intel announces a prototype of the teraflop processor [75 ], realizing an 80-core prototype with a 2D mesh interconnect architecture that reaches more than 1Tflops of performance. Furthermore, the advances of the wafer stacking technology, CAD design tools, thermal management and electrothermal design methods make 3dimentional (3D) chips feasible [48 ], [13 ]. This soon commercially available 3D technology further changes the landscape of the processor ch ips. We will see a large number of processor cores in a single CMP die, called many-core CMPs.

PAGE 13

13 Figure 1-1. Performance gap between memo ry and cores since 1980. (J.Hennessy and D.Patterson, Computer Architecture: A Quantitative Approach 4th Edition). In such a die with a large number of cores, tremendous pressure will be put on the memory hierarchy system to supply the needed instructions and data in a timely fashion to sustain everincreasing chip-level IPCs. However, the perfor mance gap between memory and cores has been widening since 1980, as illustrated in Figure 1-1 It becomes a critical qu estion of how to bridge the performance gap between processo rs and memory on many-core CMPs. Hierarchical caches are traditionally designed to bridge the gap between main memory and cores by utilizing programs spatial locality and temporal locality To match the fast speed and high bandwidth of the CPUs execution pipelines, it is quite a standard that every core in CMPs usually has small (e.g. 16KB 64KB) first level private instruction and data caches, tightly coupled into its pipelines. The most critical part of the many-core CMP cache hierarchy design boils down to the lower level caches. However, se veral challenges exist to design the lower level caches in many-core CMPs. First, since caches o ccupy a large percentage of die space, the total cache space in many-core CMPs is usually restricted in order to keep the die footprint reasonably sized. It is even not unusual that average cach e space per core decrease s when the number of

PAGE 14

14 cores increases. How to leverage the precious on-c hip storage space becomes an essential issue. Second, many-core CMPs usually suffer from longer on-chip remote cache access latency and off-chip memory access latency in terms of number of CPU cycles. This is mainly due to two reasons. On one hand, the gap between the spee d of cores and that of caches/memory is widening. On the other hand, with large number of cores and large amount of cache space, it is extremely time-consuming to locate and transfer th e data block in/between the caches. A critical question is how to reduce the on-chip and o ff-chip access latency. Third, many-core CMPs demands higher onand off-chip memory bandwid th. To sustain the IPC of many cores, large amount of data must be transferred between main memory and caches, among different caches and between caches and cores. 1.2 Directions and Related Works The design of the lower level caches in the ne xt-generation many-core CMPs has yet to be standardized. There are many factors to levera ge and many possible performance metrics to evaluate. Several very important directions amo ng them are data prefetching, optimizing the onchip cache organization and improving cache coheren ce activities among different on-die caches. In the following part of this chapter, we will give an introduction of existing solutions in these directions and raise some interesting questions. 1.2.1 Data Prefetching Data prefetching has been an important technique to mitigate the CPU and memory performance gap [38 ], [22 ], [37 ], [74 ], [77 ], [33 ], [54 ], [46 ]. It speculatively fetches the data or instructions that the processor(s) will likely use in the near future into the on-chip caches in advance. A successful prefetch may change an offchip memory miss into a cache hit or a partial cache hit, thus eliminate or shorten the expensiv e off-chip memory latency. A data prefetcher may make decisions on what and when to prefet ch based on program semantics that have been

PAGE 15

15 hinted by programmers or compilers (softwar e-based) or based on runtime memory access patterns that the processors have experienced (hardware-based). The software techniques require a significant knowledge and effort from progr ammers and/or compilers, which limits their practicability. The hardware approaches, on the other hand, predict the future access pattern based on the history, and hope that the history will reoccur in the future. However, an inaccurate (useless) prefetch may hurt th e overall performance si nce the useless prefetch consume memory bandwidth, and the useless data block may pollute the on-chip storage. Three key metrics are used to measure the effectiveness of a data prefetcher [37 ]: Acuracy Accuracy is defined as the ratio of useful prefetches to total prefetches. Coverage Coverage is the ratio of us eful prefetches to total number of misses. It is the percentage of misses that have been covered by prefetching. History size History size is the size of the extra history table used to store memory access patterns, usually for hardware prefetchers. Many uni-processor data prefetch ing schemes have been proposed in the last decade [37 ], [74 ], [77 ], [54 ]. Traditional sequential or stride prefetchers identify sequential or stride memory access pattern, and prefetch next a few blocks in such a pattern. They work well for workloads with regular spatia l access behaviors [38 ], [22 ], [46 ]. Correlation-based predictors (e.g. Markov predictor [37 ] and Global History Buffer [54 ]) record and use past miss correlations to predict future cache misses. They record miss pairs of (A ->B) in a history table, meaning that a miss B following the miss A has been observed in the past. When A misses again, B will be prefetched. However, a huge history table or a FIFO buffer is usually necessary to provide decent coverage. Instead of recording individual miss block correlations, Hu et al. [33 ] uses tag-correlation, a much bigger block correlation, to reduce the hi story size. The down side of the bigger block

PAGE 16

16 correction is that it reduces th e accuracy as well. To avoid cache pollution and provide timely prefetches, the dead-block prefetcher issues a prefetch once a cache bloc k is predicted to be dead, based on a huge history of program instructions [47 ]. Speculative data prefetching become more essential and more necessitating on CMPs to hide the higher memory wall. But prefetching on CMPs is more challenging due to limited ondie cache space and off-chip memory bandwidth Traditional Markov data prefetcher [37 ], despite its advantage of reasonabl e coverage and great generality, faces serious obs tacles in the context of CMPs. First, each cache miss ofte n has several potential successive misses and prefetching multiple successors is inaccurate an d expensive. Such incorrect speculations are more harmful on CMPs, wasting already limited memory bandwidth and polluting critical onchip caches. Second, consecutive cache misses can be separated by few instructions. It could be too late to initiate prefetches for successive misses. Third, r easonable miss coverage requires a large history table which translates to more on-chip power/area. Recently, several proposals target to improve prefetch accuracy. Saulsbury et al. [63 ] proposed a recency-based TLB preloading. It main tains the TLB information in a Mattson stack, and preloads adjacent entries in the stack upon a TLB miss. The recency-based technique can be applied for data prefetching. However, it prefetches adjacent entries in the stack without the prior knowledge of whether the adjacent requests have showed any repeated patterns or how the two requests arrive at the adj acent stack positions. Chilimbi [23 ] introduced a hot-stream prefetcher. It profiles and analyzes sampled memory traces on-line to identify frequently repeated sequences (hot streams) and inserts prefet ching instructions to the binary code for these streams. The profiling, analysis, and binary code insertions / modifications incur ex ecution overheads, and may become excessive to cover hot streams with long reuse distances. Wenisch et al. [78 ]

PAGE 17

17 proposed temporal streams by extending hot stre ams and global history buffer to deal with coherence misses on SMPs. It requires a huge FI FO and multiple searches/comparisons on every miss to capture repeated streams. In spite of so much effect, it remains a bi g challenge to provide a more accurate data prefetcher with low overhead on CMPs, where memory bandwidth and chip space are more limited, and where inaccurate prefetches are less tolerated. 1.2.2 Optimization of On-Chip Cache Organization With limited on-chip caches, optimization of on-chip lower le vel cache organization becomes critical. An important design metrics is whether the lower level cache is shared among many cores or partitioned into private caches for each core. Sharing has two main benefits. First, sharing increases the e ffective capacity of the cache, sinc e a block only has one copy in the shared cache. Second, sharing balances cache occupancy automatically among workloads with unbalanced working sets. However, sharing ofte n increases the hit latency due to the longer wiring delay, and possibly also due to larger se arch time and bandwidth bottleneck. Furthermore, a dynamic sharing may lead to erratic, applicat ion-dependent performance when different cores interfere with each other by evic ting each others blocks. It caus es priority-inversion problem when the task running in one core occupies too much cache space and starves higher priority tasks in other cores [42 ], [61 ]. A monolithic shared cache with high associ ativity consumes more power as the size increases. Non-uniform cache access (NUCA) [41 ] architecture splits a large monolithic shared cache into several banks to reduce power dissi pation and increase bandwidth. Usually the number of banks is equal to the number of core s, and each core has a local bank. Which bank to store a block is statically deci ded by the lower bits of block a ddresses. The access latency thus depends on the distance between the requesting core and the bank containing the data. Generally,

PAGE 18

18 only a small fraction of accesses (the reciprocal of the number of cores) are targeting to local banks. Alternatively, private caches c ontain most recently used blocks for specific cores. They provide fast local accesses for majority of the cac he hits, probably reducing the traffic between different caches and consuming less power. But, da ta may be replicated in private caches when two or more cores share the same blocks, lead ing to less capacity and often more off-chip memory misses. Private caches also need to mainta in data coherence. Upon a local read miss or a write without data exclusivity, th e accessing core needs to check ot her private caches by either a broadcast or through a global directory, to fetch data and/or to maintain write consistency by invalidating remote copies. Another downside is private caches do not allow storage space sharing among multiple cores, thus can not accommodate unbalanced cache occupancy for workloads with different working sets. It has become increasingly clear that it coul d be better to combine the benefits of both private caches and shared caches [49 ], [24 ], [81 ], [34 ], [20 ], [9 ]. Generally, they can be summarized into two general direct ions. The first direction is to organize the L2 as a shared cache for maximizing the capacity. To shorten th e access time, data blocks may be dynamically migrated to the requesting cores [8 ], and/or some degree of replication is allowed[81 ], [9 ], to increase the number of local accesses at a mi nimum cost of lowering on-chip capacity. To achieve fair capacity sharing, [60 ] partitions a shared cache between multiple applications depending on the reduction in cache misses that e ach application is likely to obtain for a given amount of cache resources. The second direction is to organize the L2 as private caches for minimizing the access time. But data replications among multiple L2s are constrained to achieve larger effective on-chip capacity [20 ] without adversely decreasing the number of local accesses

PAGE 19

19 too much. Dybdahl [26 ] proposes to create a shared logica l L3 part by giving up a dynamically adjusted portion of private L2 space for each core To achieve optimal capacity sharing, private L2s can steal others capacity by block migration [24 ], [20 ], accommodating different space requirements of different cores (workloads). One of the biggest problems that these studies face is extremely long simulation time. They must examine a wide-spectrum of design spaces to have complete conclusi ons, such as different number of cores, different L2 sizes, different L2 organizations and different workloads with different working sets. Furthermore, increasi ng number of cores on CMPs makes the problem even worse. Simulation time usually increases more than linearly as the number of cores increases. It is expected that 32, 64 or even hundr eds of cores will be the target of the future research. To reduce the simulation time, FPGA si mulation might be a good solution, but it is too difficult to build. A great challenge is then how to provide an effi cient methodology to study design choices of optimizing CMP on-chip storage accurately and completely, when the number of cores increases. There have been several techniques for speeding up cache simulations in uniprocessor systems. Mattson, et al. [53 ] presents a LRU stack algorithm to measure cache misses for multiple cache sizes in a single pass. For fast search through the stack, tree-based stack algorithms are proposed [10 ], [76 ]. Kim et al. [43 ] provide a much fa ster simulation by maintaining the reuse distance counts only to a few potential cache sizes. All-associativity simulations allow a single-pass simulati on for variable set-associativities [32 ], [72 ]. Meanwhile, various prediction models have been proposed to provide quick cache performance estimation [5 ], [28 ], [27 ], [76 ], [11 ], [12 ]. They apply statistical models to analyze the stack reuse distances. But, it is generally difficult to model systems with complex real-time interactions among

PAGE 20

20 multiple processors. StatCache [11 ] estimates capacity misses usi ng sparse sampling and static statistical analysis. All above techniques target uniprocessor syst ems where there is no interference between multiple threads. Several works aim at modeling multiprocessor systems [79 ], [80 ], [19 ], [12 ]. StatCacheMP [12 ] extends StatCache to incorporate communication misses. However, it assumes a random replacement policy for the statistical model. Chandra, et al [19 ] propose three analytical models based on the L2 stack distance or circular sequence profile of each thread to predict inter-thread cache contentions on the CMP for multiprogrammed workloads that do not have interference with each other. Two other wo rks pay attention only to miss ratios, update ratios, and invalidate ratios for multiprocessor caches [79 ], [80 ]. Despite those efforts, it is still an important problem how to effici ently model and predict the performance results of optimization of many-core CMP c ache organization about the tradeoff between data capacity and accessibility. 1.2.3 Maintaining On-Die Cache Coherence Cache Coherence defines the behavior of r eads and writes to the same memory location with multiple cores and multiple caches. The cohe rence of caches is obtained if the following conditions are met: 1) Program order must be preserved among reads and writes from the same core. 2) The coherent view of me mory must be maintained, i.e. a read from a core must return the latest value written by other cores. 3) Wr ites to the same location must be serialized. On CMPs, since the on-chip cache access latency is much less than the off-chip memory latency, it is desirable to obtain the data from on-chip caches if possible. Write-invalidation cache coherence protocol is generally used on todays microprocessors. There are three cache coherence activities here on CMPs. First, on a data read miss at the local cache, a search through the other on-chip caches is performed to obtain the latest data, if possible. Second, on a write

PAGE 21

21 miss at the local cache, a search through the other caches is performe d to obtain the latest data if possible and all those copies must be invalidated. Third, on a write upgrade (i.e. write hit at the local cache without exclusivity), all the copies at other c aches must be invalidated. Maintaining cache coherence with increasing number of cores has become a key design issue for future CMPs. In an architecture with a large number of cores and cache modules, it becomes inherently difficult to locate a copy (or copies) of a requested data block and to keep them coherent. When a requested block is not lo cated in the local cache module, one strategy is to search through all modules. A br oadcast to all the modules is possible at the same time or in the sequence of multiple levels, for instance, local modules, neighbor modules, and entire space. This approach is only applicab le to a system with a small number of cores using a shared snooping bus. Searching the entire CMP cache b ecomes prohibitively time consuming and power hungry when the number of cores increases. Hierar chical clusters with hierarchical buses alleviate the problem at the expense of intr oducing lots of complexity. Recent ring based architecture connects all the cores (and the local L2 slice) with one or mo re unior bi-directional rings [44 ], [71 ]. A remote request travels hop-by-hop on the ring until a hit is encountered. The data may travel back to the requesting core hop-by-hop or directly depending on whether data interconnection shortcuts are pr ovided. The total number of hops varies, depending on the workloads and the data replication strategy. Howe ver, ring-based architecture may still not scale well as the number of cores increases. To avoid broadcasting and searching all cach e modules, directory-based cache-coherence mechanisms might be the choice. When a request cannot be serviced by local cache module, this request is sent to the director y to find the state and the locatio ns of the block. Many directory implementations have been proposed in the field of Symmetric Multiprocessor (SMP). A

PAGE 22

22 memory-based directory records the states and sh arers of all memory blocks at each memory module using a set of presence bits [17 ]. Although the memory-based directory can be accessed directly by the memory address, such a full dire ctory is unnecessary in CMPs since the size of the cache is only a small fraction of the total memory. There have been many research works trying to overcome the space overhead of the memory-based directory [4 ], [18 ], [25 ]. Recently, a multi-level directory combines the full memory-bas ed directory with directory caches for fast accesses [1 ]. A directory cache is a small full-map, firstlevel directory that provides information for the most recently referenced blocks, while the second-level directory provides additional information for all the blocks. The cache-based directory, on the other hand, records only cached blocks in the directory to sa ve directory space. The simplest approach is to duplicate all individual cache directories in a centralized location [73 ]. For instance, Piranha [7 ] duplicates L1 tag arrays in the shared L2 to maintain L1 coherence. Searches to all cache directories are necessary to locate copies of a block. This e ssentially builds a directory of much wide setassociability (the multiplication of number of cores and number of cache ways per set), and wastes a lot of power. In a r ecent virtual hierarchy design [51 ], a 2-level directory is maintained in a virtual machine (VM) enviro nment. The level-1 coherence directory is embedded in the L2 tag array of the dynamically mapped home tile lo cated within each VM domain. Any unresolved accesses will be sent to the level-2 directory. If a block is predicted on-chip, the request is broadcast to all cores. The sparse directory approach uses a small fr action of the full memory directory organized in a set-associative fashion to record only those cached memory blocks [30 ], [55 ]. Since directory must maintain the full map of cache stat es, hot-set conflicts at the directory lead to unnecessary block invalidations at the cache modules resulting in an increase of cache miss.

PAGE 23

23 With a typical set-associative directory, such conflict at the directory tends to become worst as the number of cores increases, unless the set-a ssociativity also dramatically increases. For instance, in a CMP with 64 cores with each co re having a 16-way local cache module, only a 1024-way directory can eliminate all inadvertent invalidations. A naive plan of building a 1024way set-associative directory, alt hough it can eliminate all conflicts it is hardly feasible. Thus, an important technical problem is to avoid the ho t-set conflicts at the directory with small set associativity, small space and high efficiency. Some previous works exist to alleviate the hot -set conflict of caches instead of the cache coherence directory. The column-associative cache [6 ] establishes a secondary set for each block using a different hash function from that for the primary set. The gr oup-associative cache [58 ] maintains a separate cache tag array for a more flexible secondary set. The skewed-associative cache applies different hash functions on diffe rent cache partitions to reduce conflicts [64 ], [14 ]. The V-way cache [59 ] doubles the cache tag size with respec t to the actual number of blocks to alleviate set conflicts. Any unused tag entry in the primary set can record a newly missed block without replacing an existing block. The Bloomier filter [21 ] approach institutes a conflict-free mapping from a set of search keys to a linear a rray, using multiple hash functions. It remains a problem to reduce the set-confli ct of cache coherence directory. 1.2.4 General Organization of Many-Core CMP Platform To establish the foundation for the proposed rese arch, we plot the possible organization of future many-core CMPs, as illustrated in Figure 1-2 This is similar to the Intels vision of future CMPs. The CMPs will be built on a partition-based or tiled-based substrate. There will be 16-64 processing cores and 16-64MB on-chip cache capacity.

PAGE 24

24 Figure 1-2. Possible organization of the next-generation CMP. Each tile contains one core and a local cache part ition. To match the speed and bandwidth of processor internal pipelines, th e local cache partition is further divided into even closer private L1 instruction and data caches, and a unified loca l L2 module. All the other L2 modules except the local module are remote to this specific co re. For achieving the shortest access time, each memory block may be dynamically allocated, migr ated, and replicated to any individual L2 cache modules. The coherence directory is also physically distributed among all the partitions, though logically it is centralized. Write-invalidate write -allocate, MOESI coherence protocol is applied to maintain data coherence for blocks allo cated to multiple cache modules. The coherence directory maintains the states of all cache blocks. A directory l ookup request will be sent to the directory if a memory request cannot be honored by the local cache, such as read miss, write miss, and write upgrade. It is lik ely that a 2D mesh interconnect and necessary routers link all the partitions. The access latency to a remote partiti on is decided by the wiring distance and router

PAGE 25

25 processing time. Multiple memory interfaces may be available to support multiple channel main memories. 1.3 Dissertation Contribution There are three main contributions in this di ssertation addressing the issue of mitigating the CMP memory wall. First of all, we develop an accurate and low overhead data prefetching based on a unique observation of the progr am behavior. Second, we describe an analytical model and a single-pass global stack simulation to fast project the performan ce of the tradeoff between cache capacity and data accessibility in CMP on-chip caches. Third, we develop a many-core CMP coherence directory with small set associativit y, small space and high efficiency and introduce a directory lookaside table to reduce the number of inadvertent cache invalidations due to directory hot-set conflicts. 1.3.1 Coterminous Group Data Prefetching We prove with a set of SPEC CPU 2000 work load mixes that cache contentions from different cores running inde pendent workloads creates much more cache misses. We observer a unique behavior of Coterminous Group (CG) in the SPECCPU 2000 applications, a group of memory accesses with temporal repeated memory access patterns. We further define a new Coterminous Locality : when a member in a CG is referenced, the other members will be reference soon. We develop a CG-prefetcher based on Cote rminous Groups. It identifies and records highly repeated CGs in a small buffer for accura te and timely prefetches for members in a group. Detailed performance evaluations ha ve shown significant IPC improvement against other prefetchers. 1.3.2 Performance Projection of On -chip Storage Optimization We present an analytical model for fast pr ojection of CMP cache performance about the tradeoff between data accessibility and the ca che capacity loss due to data replication. We develop a single-pass global stack simu lation to more accurately simulate these effects for shared and private caches. By using the stack results of the shared and private caches, we further deduce the performan ce effect of more complicated cache organizations, such as shared caches with replication and priv ate caches without replication.

PAGE 26

26 We verify the projection accuracy of the an alytical model and the stack simulation by detailed execution-driven simulation. Mo re importantly, the single-pass global stack simulation only consumes a small percentage of simulation time that execution-driven simulation requires. 1.3.3 Enabling Scalable and Low-Conflict CMP Coherence Directory We demonstrate the sparse coherence director y with small associativity causes significant unwanted cache invalidation due to the hot-set conflicts in the coherence directory. We augment a Directory Lookaside Table (DLT) for the set-associ ative sparse coherence directory to allow the displacement of a bloc k away from its primary set to one of the empty slots in order to reduce conflict. Performance evaluations using multithreaded and multiprogrammed workloads demonstrate significant performance improveme nt of the DLT-enhanced directory over the traditional set-associative or skewed asso ciative directories by eliminating majority of the inadvertent cache invalidations. 1.4 Simulation Methodology and Workload To implement and verify our ideas, we use the full-system simulator, Virtutech Simics 2.2 [50 ], to simulate 2-, 4-, or 8-core CMPs runni ng real operation system Linux 9.0 on a machine with x86 Instruction Set Architecture (ISA). The processor module is based on the Simics Microarchitecture Interface (MAI) and models timing-directed processors in detail. A gshare branch predictor is added to each core. Each core has its own instruction and data L1 cache. Since L2 cache is our focus, we have different L2 organizations in different works. It will be described in each work later. We implement a cycle-by-cycle event-driven memory simulator to accurately model the memory system. Multi-channel DDR SDRAM is simulated. The DRAM accesses are pipelined whenever possible. A cycle-accurate, split-transac tion processormemory bus is also included to connect the L2 caches and the main memory. Table 1-1 summarizes common simulation parameters. Pa rameters that are specific to each work will be described later in the individual chapters.

PAGE 27

27 Table 1-1. Common si mulation parameters Parameter Description CMP 2, 4, or 8 cores, 3.2GHz, 128-entry ROB Pipeline Width 4 Fetch / 7 Exec / 5 Retire / 3 Commit Branch predictor G-share, 64KB, 4K BTB, 10 cycle misprediction penalty L1-I/L1-D 64KB, 4-way, 64B Line, 16-entry MSHR, MESI, 0/2-cycle latency L2 8or 16-way, 64B Line, 16-entry MSHR, MOESI if not pure shared L2 latency 15 cycles local, 30 cycles remote Memory latency 432 cycles without contentions DRAM 2/4/8/16 channels, 180-cycle access latency Memory bus 8-byte, 800MHz, 6.4GB/s, 220-cycle round trip latency We use 2 sets of workloads in our study: multiprogrammed and multithreaded workloads. For multiprogrammed workloads, we use se veral mixtures of SPEC CPU2000 and SPEC CPU2006 benchmark applications based on the classification of me mory-bound and CPU-bound applications [82 ]. The memory-bound applications are Art, Mcf, and Ammp, while the CPUbound applications are Twolf, Parser, Vortex, Bzi p2 and Crafty. The first category of workload mixes, MEM, includes memory-bound applicati ons; the second category of workload mixes, MIX, consists of both memory-bound and CPUbound applications; and the third category of workloads, CPU, contains only CPU-bound applica tions. For studies with 2-, 4-, or 8-core CMPs, we prepare 2-, 4and 8-application work loads. We choose the ref input set for all the SPEC CPU2000 and SPEC CPU2006 applications. Table 1-2 summarizes the selected multiprogrammed workload mixes. For each workload mix, the applications are ordered by high-to-low L2 miss penalties from left to right in their appearance. We skip certain instructions for each individual application in a mix based on studies done in [62 ], and run the workload mix for another 100 million instructions for warming up the caches. A Simics checkpoint for each mix is generated afterwards. We run our simulator until any application in a mix has executed at least 100 million instructions for collecting statistics.

PAGE 28

28 Table 1-2. Multiprogrammed workload mixes simulated MEM MIX CPU Art/Mcf Art/Twolf Twolf/Bzip2 Mcf/Mcf Mcf/Twolf Parser/Bzip2 Two Mcf/Ammp Mcf/Bzip2 Bzip2/Bzip2 Four Art/Mcf/Ammp/Twolf Art/Mcf/Vorte x/Bzip2 Twolf/Parser/Vortex/Bzip2 Eight Art/Mcf/Ammp/Parser/Vortex/Bzip2/Crafty (SPEC2000) Mcf/Libquantum/Astar/Gobmk/Sje ng/Xalan/Bzip2/Gcc (SPEC2006) We also use three multithreaded commercia l workloads, OLTP (Online Transaction Processing), Apache (Static web server), and SPEC jbb (java server). We consider the variability of these multithreaded workloads by running multiple simulations for each configuration of each workload and inserting small random noises (per turbations) in the memory system timing for each run. For each workload, we carefully adjust system and workload parameters to keep the CPU idle time low enough. We then fast-forward the whole system for enough period of time to fill the internal buffers and other structures before making a checkpoint. Finally, we collect simulation results during executing a certain number of transactions after we warm up the caches or other simulation rela ted structures. Table 1-3 gives the details of the workloads. 1.5 Dissertation Structure The structure of this dissertation is as follo ws. Chapter 2 develops the first piece of the dissertation, an accurate and low-overhead data prefetching technique in CMPs based on a unique observation of coterminous group, a hi ghly repeated and close-by memory access sequence. In Chapter 3, we illustra te two methodologies, an abstract data replication model and a single-pass global stack simulation, to fast project the performance issue of CMP on-chip storage optimization. Chapter 4 builds a set-associative CMP coherence directory with small associativity and high efficiency, augmented by a directory lookaside tabl e that alleviates the directory hot-set conflicts. This is followed by a brief summary of the dissertation in Chapter 5.

PAGE 29

29 Table 1-3. Multithreaded workloads simulated Workload Description OLTP (Online Transaction Processing) It is built upon the OSDL-DBT-2 [57] and the MySQL database server 5.0. We build a 1GB, 10-warehouse database. To reduce the database disk activity, we increase the size of the MySQL buffer pool to 512MB. We further stress the system by simulating 128 users without any keying and thinking time. We simulate 1024 transactions after bypassing 2000 transactions and warming up caches (or stack) for another 256 transactions. Apache (Static web server) We run apache 2.2 from as the web server, and use Surge to generate web requests from a 10,000 file, about 200MB repository. To stress the CMP system, we simulate 8 clients with 50 threads per client. We collect statistics for 8192 transactions after bypassing 2500 requests and warming up for 2048 transactions. SPECjbb (java server) SPECjbb is a java-based 3-tier online transaction processing system. We simulate 8 warehouses. We first f ast-forward 100,000 transactions. Then we simulate 20480 transactions after warming up the structures for 4096 transactions.

PAGE 30

30 CHAPTER 2 COTERMINOUS GROUP DATA PREFETCHING In this chapter, we describe an accurate da ta prefetching technique on CMPs to overlap expensive off-chip cache miss latency. Our analys is of SPEC applications shows that adjacent traversals of various data stru ctures, such as arrays, trees a nd graphs, often e xhibit temporal repeated memory access patterns. A unique feature of these nearby accesses is that they exhibit a long but equal reuse distance. The reuse distance of a memory reference is defined as the number of distinct data blocks that are accessed be tween this reference and the last references to the same block. It is the most fundamental m easure of memory reference locality. We define such a group of memory references with an equal block reuse distance as a Coterminous Group (CG) and the highly repeated access pa tterns among members in a CG as the Coterminous Locality A new data prefetcher identifies and reco rds highly repeated CGs in a history buffer. For accurate and timely prefetches, whenever a me mber in a CG is referenced, the entire group members are prefetched. We call su ch a data prefetching method a CG-prefetcher We make three contributions about the CG-pre fetcher. First, we demonstrate the severe cache contention problem with various mixes of SPEC2000 applications, and describe the necessities and the challenges of accurate data prefetching on CMPs. Second, we discover the existence of coterminous groups in these appl ications and quantify the strong coterminous locality among members in a CG. Third, based on the concept of coterminous groups, we develop a new CG-prefetcher, and present a re alistic implementation by integrating the CGprefetcher into the memory c ontroller. Full system evaluatio ns based on mixed SPEC CPU 2000 applications have shown that the proposed CG-prefetcher can accura tely prefetch the needed data in a timely manner on CMPs. It generates about 10-40% extra traffic to achieve 20-50% of miss coverage in comparison with tw o and a half times more extra traffic by a typical correlation-

PAGE 31

31 based prefetcher with a comparable miss covera ge. The CG-prefetcher also shows better IPC (Instructions per Cycle) improvement than th e existing miss correlation based or the stream based prefetchers. To clearly demonstrate the effectiveness of CG-prefetcher, we carry out experiments on a simple shared L2 cache with multiple cores, each running a different application. But, this scheme is independent of any specific cache organizations, and can be adapted to private caches as well. 2.1 Cache Contentions on CMPs CMPs put tremendously more pressure on the memory hierarchy to supply the data to cores than uniprocessors. One of the major reasons is that different cores compete for the limited on-chip shared storage when multiple independent applications are running simultaneously on multiple cores. The typical example is the shared L2 cache among multiple cores. This effect is more evident when independent memory-intensi ve applications are running together. To demonstrate the cache contentions on CMPs, we show the IPCs (Instructions per Cycle) of a set of SPEC2000 applications that ar e running individually, or in parall el on 4or 2-core CMPs in Figure 2-1 (A) and Figure 2-1 (B) respectively. We have three workload mixes consisting of 4 applications, Art/Mcf/Ammp/Twolf, Art/Mcf/Vortex/Bzip2, and Twolf/Pa rser/Vortex/Bzip2 in Figure 2-1 (A) The first workload mix, Art/Mcf/Ammp/Twolf, contains applications with heavier L2 misses; the second one, Art/Mcf/Vortex/Bzip2, mixes applic ations with heavier and lighter L2 penalties; and the third one, Twolf/Parser/Vortex/Bzip2, has applications with generally li ghter L2 misses. We also run nine 2-application mixes in Figure 2-1 (B) also ranging from high-to-low L2 miss penalties, including Art/Mcf, Mcf/Mcf, Mcf/Ammp, Art/Twolf, Mcf/Twol f, Mcf/Bzip2, Twolf/Bzip2, Parser/Bzip2, and Bzip2/Bzip2.

PAGE 32

32 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Workload1 Workload2 Workload3 Workload4 Art/Mcf/ Ammp/Twolf Art/Mcf/ Vortex/Bzip2 Twolf/Parser/ Vortex/Bzip2IP C A 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Workload1 Workload2 Art/ Mc f Mcf/ Mc f Mcf/ AmmpArt/ TwolfMcf/ Twolf Mcf/ Bzip2 Twolf/ Bzip2Parser/ Bzip 2 Bzip2/ Bzip2IPCB Figure 2-1. IPC degradation due to cache c ontention for SPEC2000 workload mixes on CMPs. A) IPC for 4-workload on 4-core CMPs. Th e first 4 bars are individual IPCs when only one application is running and the last bar is combined IPC when 4 workloads run in parallel on 4 cores. B) IPC for 2-work load on 2-core CMPs. The first 2 bars are individual IPCs when only one applicati on is running and the la st bar is combined IPC when 2 workloads run in parallel on 2 cores. The first four bars in Figure 2-1 (A) and the first two bars in Figure 2-1 (B) are the individual IPCs (Instructions pe r Cycle) of the applications in the workload mixes, ordered by the appearance in the name of the workload mixes. We collect the individual IPC for each application by running the specific application on one core and k eeping all the other cores idle. As a result, the entire L2 cache is available for the individual application. The last bar of each

PAGE 33

33 workload mix is the combined IPC when we run all the applications at the same time with one core running one independent application. The co mbined IPC is broken down into segments to show the IPC contribution of each application. Ideally, the combined IPC should be equal to the sum of individual IPCs when only one application is running. However, significant IPC reductions can be observed on each application when they run in parallel, mainly due to the shared L2 cache contention among multiple applications. This is especially evident for the workload mixes with high demands on shared L2 caches. For example, when Art/Mcf/Ammp/Twolf ar e running on four cores, the individual IPC drops from 0.029 to 0.022 for Art, from 0.050 to 0.026 for Mcf, from 0.132 to 0.043 for Ammp, and from 0.481 to 0.181 for Twolf respectively. In stead of accumulating the individual IPCs on four cores, the combined IPC drops from 0.69 (the sum of individual IPCs) to 0.27, a 60% of degradation. Similar effects of various degrees can also be observed with 2-core CMPs. The significant IPC degradations come from more L2 misses and more off-chip memory accesses. Data prefetching is an effective way to reduc e number of L2 misses. However, in the CMP context with limited cache space and limited memory bandwidth, inaccurate prefetches are more harmful. Those inaccurate prefetches pollute the limited cache space and wasted the limited memory bandwidth. The CMP demands for accurate prefetchers with low overhead to alleviate heavier cache contentions and misses. 2.2 Coterminous Group and Locality The proposed data prefetcher on CMPs is base d on a unique observation of the existence of Coterminous Groups (CGs) A Coterminous Group (CG) is a group of nearby data references with same block reuse distances The reuse distance of a reference is defined as the number of distinct data blocks that are accessed between two consecutive references to this block. For instance, consider the following accessing sequence: a b c x d x y z a b c y d The reuse distances

PAGE 34

34 of a-a b-b c-c and d-d are all 6, whereas x-x is 1 and y-y is 4. In this case, a b c d can form a CG. References in a CG have three important propert ies. First, the order of references must be exactly the same at each repetition (e.g. d must follow c c follows b and b follows a ). Second, references in a CG can interleave with other references (e.g. x y ). These references, however, are irregular and difficult to predict accurately, and will be excluded by the criteria of same reuse distance. Third, since we are interested in capturing refere nces with long reuse distances for prefetching, the same reference (i.e. to the same block) usually does not a ppear twice in one CG. To demonstrate the existence of CGs, we pl ot reuse distances of 3000 nearby references from three SPEC2000 applications, Mc f, Ammp and Pa rser in Figure 2-2 We randomly select 3000 memory references for each application, and co mpute the reuse distance of each reference. Note that references with shor t reuse distances (e.g. < 512), whic h are frequent due to temporal and spatial localities of memory references, can be captured by small caches and thus are filtered. The existence of CGs is quite obvious from these snapshots. Mcf has a huge CG with a reuse distance of over 60,000. The reuse distance is so large that a reasonable sized L2 cache (<4MB, if each memory block is 64B) will not keep those blocks in it. So those accesses are likely to be L2 misses. Ammp shows four large CGs along with a few small ones. And Parser has many small CGs. Other applications also show th e CG behavior. We only present three examples due to the space limit. The next important question is whether there exist strong correlations among references in a CG, i.e. when a member of a captured CG is re ferenced, the other members will be referenced in the near future.

PAGE 35

35 Figure 2-2. Reuse distances for Mcf, Ammp and Parser. We can answer this question by measuring the pair-wise correlation A->B between adjacent references in a CG. (This is similar to the miss correlation in [37 ].) The accuracy of B being referenced immediatel y after the re-reference of A provides a good locality measurement. Considering that a CG { a b c d } has been captured, and its reference order a -> b b -> c and c > d will be recorded in a history table. For an access that hits the table, we verify whether the actual next access is the next access that has been recorded. For example, if access a happens again, we need to verify whether b is the next access. If so, we count it as an accurate prediction based on the CG. We can also relax the reuse distance require ment allowing a small variance in reuse distances. We define CG-N as CGs, in which ne arby references with reuse distances that are within plus or minus N. A smaller N means mo re restricted CGs, potentially more accurate, while a larger N means more relaxed CGs, possibl y including more members. CG-0 is the most

PAGE 36

36 restricted one, representing the or iginal same-distance CG, while CGis the most relaxed one that has a single CG including all nearby refe rences with long block reuse distances. Figure 2-3 shows the accuracy that the real next access of a member in a CG is indeed the next access being captured and save d for CG-0, CG-2, CG-8 and CGfor all the individual applications. In this figure, there are 4 bars for each application representing the accuracy of CG0, CG-2, CG-8 and CGfrom left to right. In general, CG-0, CG-2 and CG-8 exhibit strong repeated reference behaviors among members in a CG than CG. Ammp and Art show nearly perfect correlations with about 98% of accuracy regardless of the reuse distance requirement. Those two applications are floating point applicati ons with regular array accesses, which are easy to predict. All other applications also demons trate strong correlations for CG-0, CG-2 and CG-8. As expected, CG-0 shows stronger correlations than other weaker forms of CGs, while CG, which is essentially the same as the ad jacent cache-miss correlation, shows very poor correlations. For instance, the accuracy of CG-0 for Mcf is about 78%, while the accuracy of CGis only 30%. The gap between CG-0 and CG-2 /CG-8 is rather narrow for Mcf, Vortex, and Bzip2, suggesting a weaker form of CGs may be preferable fo r covering more references. A large gap is observed between CG-0 and other CGs in Twolf, Parser, and Gcc indicating CG-0 is more accurate for prefetching for those applications. Based on these behaviors, we can safely concl ude that members in a CG exhibit a highly repeated access pattern, i.e. whenever a member in a CG is referenced, the remaining members will likely be referenced in the near future according to the previous accessing sequence We call such highly repeated patterns coterminous locality Based on the existence of highly-repeated coterminous locality within members in CGs, we can design an accurate prefetching scheme to capture CGs, and prefetch members in CGs.

PAGE 37

37 0% 20% 40% 60% 80% 100% ArtMcfAmmpTwolfParserGccVortexBzip2 CG-0 CG-2 CG-8 CG-infAccuracy Figure 2-3. Strong correlation of adjacent references within CGs. 2.3 Memory-side CG-prefetcher on CMPs Due to shared cache contentions on CMPs, it is more beneficial to prefetch those L2 misses to improve the overall pe rformance. Our CG-prefetcher re cords L2 misses, captures CGs from the L2 miss sequence and prefetches member s in CGs to reduce the number of expensive off-chip memory accesses. Since the memory contro ller sees every L2 miss directly, we integrate the CG-prefetcher into the memory controller. This is the memory-side CG-prefetcher. A memory-side CG-prefetcher is attractive for several reasons [67 ]. First, it minimizes changes to the complex processor pipeline al ong with any associated performance and space overheads. Second, it may use the DRAM array to store necessary state information with minimum cost. A recent trend is to integrate the memory controller in the processor die to reduce interconnect latency. Nevertheless, such integr ation has minimal performance implication on implementing the CG-prefetcher in the memory c ontroller. Note that although the CG-prefetcher

PAGE 38

38 is suitable on uni-processor systems as well, it is more appealing on emerging CMPs with extra resource contentions and constraints du e to its high accuracy and low overhead. 2.3.1 Basic Design of CG-Prefetcher The structure of a CG-prefetc her is illustrated in Figure 2-4 There are several main functions: to capture nearby memory references with equal reuse distance, to form CGs, to efficiently save CGs in a history table for sear ching, and to update CGs and keep them fresh. To capture nearby memory references with same distance, a Request FIFO records the block addresses and their reuse distances of recen t main memory requests. A CG starts to form once the number of requests with the same reuse distance in the Request FIFO exceeds a certain threshold (e.g. 3), which controls the aggressiveness of forming a new CG. The size of the FIFO determines the adjacency of members, and usually it is small (e.g. 16). A flag is associated with each request indicating whether the request is matched. The matched requests in the FIFO are copied into a CG Buffer waiting for the CG to form. The size of the CG buffer determines the maximum number of members in a CG, which can control the timeliness of prefetches. A small number of CG Buffers can be implemented to allow multiple CGs to form concurrently. A CG is completed and will be saved when either the CG Buffer is full or a new CG Buffer is needed when a new CG is identified from the Request FIFO To efficiently save CGs, we introduce Coterminous Group History Table (CGHT) a setassociative table indexed by block addresses, so that every member in a CG can be found very fast. A unidirectional pointer in each entry links the members in a CG. This link-based CGHT permits fast searching of a CG from any member in the group, thus allows to prefetch a CG starting from any member. When the CGHT become s full, either the LRU entries are replaced and removed from the existing CGs, or the conf licting new CG members are dropped to avoid potential thrashing.

PAGE 39

39 Figure 2-4. Diagram of the CG prefetcher. Any existing CGs can change dynamically ove r a period of time. Updating CGs in the CGHT dynamically is difficult with out precise information on when a member leaves or joins a group. Another option is to simply flush the CGHT periodically based on the number of executed instructions, memory references, or cache misses. However, a penalty will be paid to reestablish the CGs after each flush. Note that a block can ap pear in more than one CG in the CGHT without updating. This is because a block reference can le ave a CG and become a member of a new CG, while the CGHT may still keep the old CG. On multip le hits, either the most recent CG or all the matched CGs may be prefetched. Table 2-1 gives an example to illustrate the operations of CGs in Figure 2-4 step by step. We assume three accesses with the same reuse di stance are requested to start forming a CG. We also assume memory accesses A, B and C have reuse distance d1 and have already been identified as a CG in the Request FIFO and moved to the CG Buffer So the initial state is that the CG Buffer records A, B and C, as well as the current reuse distance d1

PAGE 40

40 Table 2-1. Example operations of forming a CG Access Reuse distance Event Comment D d1 Put D in CG Buffer d1 matches the reuse distance of current CG N d3 None d3 is not equal to d1 M d3 None Only two same reuse distance accesses, N and M J d2 None d2 is not equal to d1 E d1 Put E in CG Buffer d1 matches the reuse distance of current CG I d2 None Only two same reuse distance accesses, J and I F d1 Put F in CG Buffer d1 matches the reuse distance of current CG Table 2-1 simulates the situation when accesses D, N, M, J, E, I, F come one by one. D, E, and F have the same reuse distance d1 and will be recorded in the CG Buffer step by step. Although N and M have the same reuse distance d3 they cannot start to form a new CG until another access with reuse distance d3 together with N and M appear in the Request FIFO at the same time. If this does happen, the current CG will be moved to the CGHT to make space for the newest CG. Once a L2 miss hits the CGHT, the en tire CG can be identified and prefetched by following the circular links. In Figure 2-4 for instance, a miss to block F will trigger prefetches of A, B, C, D, and E in order. 2.3.2 Integrating CG-prefetcher on CMP Memory Systems There are several issues to integrate the CG -prefetcher into the memory controller. The first key issue is to determine the block reuse di stance without seeing all processor requests at the memory controller. A global miss sequence number is used. The memory controller assigns and saves a new sequence number to each missed me mory block in the DRAM array. The reuse distance can be approximated as the difference of the new and the old sequence numbers. For a 128-byte block with a 16-bit sequence numb er, a reuse distance of 64K blocks, or an 8MB working set can be covered. The memory overhead is merely 1.5%. When the same distance requirement is relaxed, one sequence number can be for a small number of adjacent requests, which will expand the working set co verage and/or reduce the space requirement.

PAGE 41

41 Figure 2-5 shows the CG-prefetcher in memory system. To avoid regular cache-miss requests from different cores disrupting one another for establishing the CGs [70 ], we construct a private CG-prefetcher for each core. Each CG-prefetcher has a Prefetch Queue (PQ) to buffer the prefetch requests (addresses) from the associated prefetcher. A shared Miss Queue (MQ) stores regular miss requests from all cores for accessing the DRAM channels. A shared Miss Return Queue (MRQ) and a shared Prefetch Return Queue (PRQ) buffers the data from the miss re quests and the prefetch requests for accessing the memory bus. We implement a private PQ to prevent prefet ch requests of one core from blocking those from other cores. The PQs have lower priori ty than the MQ. Among the PQs, a round-robin fashion is used. Similarly, the PRQ has lower prio rity than the MRQ in arbitrating the system bus. Each CG-prefetcher mainta ins a separate sequence number for calculating the block reuse distance. When a regular miss request arrives, all th e PQs are searched. In case of a match, the request is removed from the PQ a nd is inserted into the MQ, gaining a higher priority to access the DRAM. In this case, there is no performance be nefit since the prefetch of the requested block has not been initiated. If a matched prefetch requ est is in the middle of fetching the block from the DRAM, or is ready in the PRQ, waiting for the shared data bus, the reque st will be redirected to the MRQ for a higher priority to arbitrate th e data bus. Variable delay cycles can be saved depending on the stage of the prefetch request The miss request is inserted into the MQ normally when no match is found.

PAGE 42

42 Figure 2-5. Integration of the CG-pre fetcher into the memory controller. A miss request can trigger a sequence of pref etches if it hits the CGHT. The prefetch requests are inserted into the corresponding PQ. If the PQ or the PRQ is full, or if a prefetch request has been initiated, the pref etch request is simply dropped. In order to filter the prefetched blocks already located in processors cache, a to pologically equivalent di rectory of the lowest level cache is maintained in the controller (not shown in Figure 2-5 ). The directory is updated based on misses, prefetches, and write-backs to k eep it close to the cache di rectory. A prefetch is dropped in case of a match. Note that all other simulated prefetchers incorporate the directory too. 2.4 Evaluation Methodology We simulate 2-core and 4-core CMPs with a 1MB, 8-way, shared L2 cache. Please note that the CG prefetcher can also be applied to other L2 organi zations. We add an independent processor-side stride prefetcher to each core. All timing delays of misses and prefetches are

PAGE 43

43 carefully simulated. Due to a slower clock of the memory controller, the memory-side prefetchers initiate one prefetch every 10 proces sor cycles. The queue size for MQ, PQi, PRQ and MRQ is all 16 enties. We us e all 2-application, a nd 4-application workload mixes described in Chapter 1 for this work. The performance results of th e proposed CG-prefetcher are co mpared against a pair-wise miss-correlation prefetcher (MC-pr efetcher), a prefetcher based on the last miss stream (LSprefetcher), and a hot-stream pref etcher (HS-prefetcher). A proce ssor-side stride prefetcher is included in all simulated memory-side prefetchers. Descriptions of thes e prefetchers are given next. Processor-side Stride prefetcher (Stride-prefetcher) The stride-prefetcher identifies and prefetches sequential or stride memory access pa tterns for specific PCs (program counters) [46 ]. It has 4k-entry PCs with each entry maintaining four previous references of that PC. Four successive prefetches are issued, whenever four stride distances of a specific PC are matched. Memory-side Miss-Correlation (MC) prefetcher The MC-prefetcher records pair-wise miss correlations A->B in a history table, and prefetches B if A happens again [37 ]. Each core has a MC-prefetcher with a 128k-entry 8 set-associ ative history table. Each miss address (each entry) records 2 successive misses. Upon a miss, the MC-prefetcher prefetches two levels in depth, resulting in a tota l of up to 6 prefetches. Memory-side Hot-Stream (HS) prefetcher The HS-prefetcher records a linear miss stream in a history table, and dynamically iden tifies and prefetcher re peated access sequences [23 ]. It is simulated based on a Global History Buffer [54 ], [78 ], with 128k-entry FIFO and 64kentry 16 set-associative miss inde x table for each core. On every miss, the index and the FIFO are searched sequentially to find a ll recent streams that be gin with the current miss. If the first 3

PAGE 44

44 misses of any two streams match, the matched stream is prefetched. The length of each stream is 8. Memory-side Last-Stream (LS) prefetcher The LS-prefetcher is a special case of the HS-prefetcher [23 ], where the last miss stream is prefet ched without any further qualification. It has the same implementation as the HS-prefetcher. Memory-side Coterminous Group (CG) prefetcher We use CG-2 to get both high accuracy and decent coverage of misses. The CGHT is 16k entries per core, with 30 bits (16-way set-associative) per entry. We use a 16-entry Re quest FIFO and four 8-entry CG-Buffers. A CG can be formed once three memory requests in the Request FIFO satisfy the reuse distance requirement. Each CG contains up to 8 member s. The CGHT is flushed periodically every 2 million misses from the corresponding core. Table 2-2 summaries the extra space overhead to implement various memory-side prefetchers. Note that the space overhead for processo r-side stride prefetcher is minimal, thus has not been included. 2.5 Performance Results In Figure 2-6 (A) and Figure 2-6 (B) the combined IPC of Stride-, MC-, HS-, LS-, and CG-prefetchers, normalized to that of the baseli ne model without any prefetching, are presented for 4-core CMPs and 2-core CMPs respectivel y. Please note in all th e following figures, we simplify Stride-prefetcher to Stride, MC-prefetc her to MC, HS-prefetcher to HS, LS-prefetcher to LS and CG-prefetcher to CG. We include the normalized combined IPC of the base line model without any prefetching for comp arison purpose (with the total he ight of 1). Note that the absolute combined IPCs for the baseline model were given in Figure 2-1 Also, a separate processor-side stride prefetcher is always runni ng with MC-, HS-, LS-, and CG-prefetcher to prefetch blocks with regular access patterns.

PAGE 45

45 Table 2-2. Space overhead for various memory-side prefetcher Prefetcher Memory controller (SRAM) per core DRAM CG-prefetcher 60KB (16K*30bit/8) 3% MC-prefetcher 2MB (128K*2*64bit/8) 0 HS-prefetcher 920KB(128K*43bit/8+64K*29bit/8) 0 LS-prefetcher 920KB(128K*43bit/8+64K*29bit/8) 0 Each bar is broken into the contributions ma de by individual workloads in the mix. The total height represents the overa ll normalized IPC or IPC speedup. Fo r example, a total height of 1.2 means a 20% improvement in IPS as compared with the base IPC without any prefetching, given in Figure 2-1 Please note a normalized IPC of less th an one means the prefetcher degrades the overall performance. Several observations can be made. First, most workload mixes show performance improvement for all five prefetch ing techniques. In general, th e CG-prefetcher has the highest overall improvement, followed by the LS-, the HS-, and the MC-prefetchers. Two workload mixes Art/Twolf and Mcf/Twolf show a performa nce loss for most prefetchers. Our studies indicate that Twolf has irregular patterns, and hardly benefits from any of the prefetching schemes. Although Art and Mcf are well perfor med, the higher IPC of Twolf dominates the overall IPC speedup. Second, the CG-prefetcher is a big winner for the MEM workloads with speedup of 40% in average, followed by the LS-prefetcher with 30%, the HS-prefetcher with 24% and the MCprefetcher with 18%. The MEM workloads exhi bit heavier cache contentions and misses. Therefore, the accurate CG-prefetcher be nefits the most for this category. Third, the CG-prefetcher generally performs better in the MIX and the CPU categories. However, the LS-prefetcher slightly outperforms the CG-prefetcher in a few cases. With lighter memory demands in these workload mixes, the LS -prefetcher can deliver more prefetched blocks with a smaller impact on cache pollutions and memory traffic.

PAGE 46

46 0 0.2 0.4 0.6 0.8 1 1.2 1.4Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Art / Mcf / Ammp / Twolf Art / Mcf / Vortex / Bzip2 Twolf / Parser / Vortex / Bzip2 Workload1 Workload2 Workload3 Workload4 Normalized IPCA 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Base Stride MC HS LS CG Art / McfMcf / McfMcf / Ammp Art / TwolfMcf / Twolf Mcf / Bzip2 Twolf / Bzip2 Parser / Bzip2 Bzip2 / Bzip2 Workload1 Workload2 Normalized IP C B Figure 2-6. Normalized combined IPCs of various prefetchers. (Normalized to baseline). A) 4workload mix running on 4-core CMPs. B) 2-workload mix running on 2-core CMPs. It is important to note that the normalized IPC speedup cr eates an unfair view when comparing mixed workloads on multi-cores. For example, in Art/Mcf/Vortex/Bzip2, the normalized IPCs of individual workloads are measured at 3.16 for Art, 1.41 for Mcf, 0.82 for

PAGE 47

47 Vortex, and 1.42 for Bzip2 with the CG-prefe tcher, and 2.39 for Art, 1.22 for Mcf, 0.86 for Vortex, and 1.49 for Bzip2 with the MC-prefetche r. Therefore, the average individual speedups of the four workloads, according to equation (2-1 ), are 1.70 for the CG-prefetcher and 1.49 for the MC-prefetchers. However, their normalized IPCs are only 1.20 and 1.22. Given the fact that Vortex and Bzip2 have considerably higher IP C than those of Art and Mcf, the overall IPC improvement is dominated by the last two workload s. This is true for other workload mixes. In Figure 2-7 the average speedup of two MEM and two MIX workload mixes are shown according to the equation (2-1 ). Please recall that all the app lications in a MEM workload mix are memory-intensive, and a MIX workload mi x contains both memory-intensive and cpuintensive applications. n i i iprefetch without IPC prefetch with IPC Speedup Average0(2-1) where n is the number of workloads. Comparing with the measured IPC speedups in Figure 2-6 significantly higher average speedups are achieved by all prefetchers. For Art/Twolf, the average IP C speedups are 48% for the MC-prefetcher, 44% for the HS-prefetcher, 52% for the LS-prefetcher and 51% for the CGprefetcher, instead of the IPC de gradations as shown in Figure 2-6 Figure 2-8 shows the accuracy and coverage of diffe rent prefetchers. In this figure, each bar is broken down into 5 categories from bottom to top for each prefetcher: misses, partial hits, miss reductions, extra prefetches, and wasted pref etches. The descriptions of each category are listed as follows. The misses are those main memory accesse s that have not been covered by the prefetchers.

PAGE 48

48 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Art / Mcf / Ammp / Twolf Art / Mcf / Vortex / Bzip2 Mcf / AmmpArt / Twolf Stride MC HS LS CG Average Speedup Figure 2-7. Average speedup of 4 workload mixes. The partial hits refer to memory accesses w ith a part of off-ch ip access latency being saved by earlier but inco mplete prefetches. The earlier pref etches have been issued to the memory, but the blocks have not arrived in the L2 cache. The miss reductions are those that have been fully covered by prefetches. Those accesses are successfully and completely cha nged from L2 misses to L2 hits. The extra prefetches represent the prefetched blocks that are replaced before any usage. Those useless prefetched blocks will po llute the limited cache space and waste the limited bandwidth. The wasted prefetches refer to the prefetch ed blocks that are presented in L2 cache already when those blocks arrive in the L2 cache because of the mis-prediction of the shadow directory at the memory side which wastes memory bandwidth. The sum of the misses, partial h its, and miss reductions is equa l to the number of misses of the baseline without prefetching, which is normalized to 1 in the figure. And the sum of extra prefetches and wasted prefetches normalized to baseline misses, is the extra memory traffic. According to the above definition, the accuracy of a prefetcher can be described as Equation (2-2 ): Misses hits Partial reductions Misses hits Partial reductions Misses Coverage (2-2)

PAGE 49

49 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG MC HS LS CG Art / Mcf / Ammp / Twolf Art / Mcf / Vortex / Bzip2 Twolf / Parser / Vortex / Bzip2 Art / McfMcf / McfMcf / AmmpArt / TwolfMcf / TwolfMcf / Bzip2Twolf / Bzip2Parser / Bzip2Bzip2 / Bzip2 Misses Partial hits Miss reduction Extra prefetch Wasted prefetch Prefetch Accuracy & Coverage Figure 2-8. Prefetch accuracy and co verage of simulated prefetchers. The coverage of a prefetcher can be calculated as Equation (2-3 ): prefetch Wasted prefetch Extra hits Partial reductions Misses prefetch Wasted prefetch Extra Accuracy (2-3) Overall, all prefetchers show a significan t coverage, reduction of cache misses, ranging from a few percent to as high as 50%. The MC-, LSand CG-prefetchers have better coverage than HS-prefetcher, since HS-prefetcher only iden tifies exactly repeated accesses. On the other hand, in contrast to the MCand the LS-prefetc her, the HSand the CG-prefetcher carefully qualify members in a group that show highly repe ated patterns for prefetching. The MCand the LS-prefetcher generate significan tly higher memory traffic than the HSand the CG-prefetcher. On the average, the HS-prefetcher has the leas t extra traffic of about 4%, followed by 21% for the CG-prefetcher, 35% for the LS-prefetcher, and 52% for the MC-prefetcher. The excessive memory traffic by the LSand the MC-prefetche r does not turn proporti onally into a positive reduction of the cache miss. In some cases, the impact is negative mainly due to the cache pollution problem on CMPs. Between the two, the LS -prefetcher is more effective than the MCprefetcher indicating prefetching multiple su ccessor misses may not be a good idea. The HS-

PAGE 50

50 prefetcher has the highest accuracy. However, the low miss coverage limits its overall IPC improvement. The reuse distance constraint of forming a CG is simulated and the performance results of CG-0, CG-2 and CG-8 are plotted in Figure 2-9 With respect to the norm alized IPCs, the results are mixed, shown in Figure 2-9 (A) Figure 2-9 (B) further plots the coverage and accuracy of different CGs. It is evident that the reus e distance constraint is the tradeoff between accuracy and coverage. In generally, larger distance means higher coverage, but lowe r accuracy. CG-2 seems to be better than CG-8 for workload mixes having higher L2 misses, such as Mcf/Ammp and Mcf/Mcf. We selected CG-2 to represent the CG-p refetcher due to its slightly better IPCs than that of CG-0 and considerably less traffic than that of CG-8. Note that we omit CG-4, which has similar IPC speedup in comparison with CG2, but generates more memory traffic. The impact of group size is evaluated as shown in Figure 2-10 Two workload mixes in the MEM category, Art/Mcf/Ammp/Twolf and Mc f/Ammp, and two in the MIX category, Art/Mcf/Vortex/Bzip2 and Art/T wolf are chosen due to their high memory demand. The measured IPCs decrease slightly or remain unchanged for the two 4-workload mixes, while they increase slightly with the two 2-workload mixes. Due to cache contentions, larger groups generate more useless prefetches. The group size of 8 shows a balance of high IPCs with low overall memory traffic. Figure 2-11 plots the average speedup of CG with respect to St ride-only for different L2 cache sizes from 512KB to 4MB. As observed, the four workload mixes behave very differently with respect to different L2 sizes.

PAGE 51

51 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 Art/ Mcf/ Ammp/ Twolf Art / Mcf / Vortex / Bzip2 Twolf/ Parser/ Vortex/ Bzip2 Art/ McfMcf/ Mcf Mcf/ Ammp Art/ Twolf Mcf/ Twolf Mcf/ Bzip2 Twolf/ Bzip2 Parser/ Bzip2 Bzip2/ Bzip2 Workload1 Workload2 Workload3 Workload4 Normalized IPC A 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 CG0 CG2 CG8 Art/ Mcf/ Ammp/ Twolf Art/ Mcf/ Vortex/ Bzip2 Twolf/ Parser/ Vortex/ Bzip2 Art/ McfMcf/ Mcf Mcf/ Ammp Art/ Twolf Mcf/ Twolf Mcf/ Bzip2 Twolf/ Bzip2 Parser/ Bzip2 Bzip2/ Bzip2 Misses Partial hits Miss reduction Extra prefetch Wasted prefetch Prefetch AccuracyB Figure 2-9. Effect of distance constrains on th e CG-prefetcher. A) Normalized IPC. B) Accuracy and traffic. For Art/Mcf/Vortex/Bzip2 and Art/Twolf, the average IPC speedups are peak at 1MB and 2MB respectively, and then drop sharply afterwards because of a sharp re duction of cache misses with larger caches. However, for the memory -bound workload mixes, Art/Mcf/Ammp/Twolf and Mcf/Ammp, the average speedups of median-size L2 are slightly less than those of smaller and larger L2.

PAGE 52

52 481216 Group Size Art/Mcf/Ammp/Twolf Art/Mcf/Vortex/Bzip 2 Mcf/Amm p Art/TwolfIP C 0 0.0 5 0.1 0.1 5 0. 2 0.2 5 0. 3 0.3 5 1. 6 1.6 5 1. 7 A 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.64 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16 Art / Mcf / Ammp / Twolf Art / Mcf / Vortex / Bzip2 Mcf / AmmpArt / Twolf Misses Partial hits Miss reduction Extra p refetch Wasted p refetch Prefetch Accurac y B Figure 2-10. Effect of group size on the CG-prefe tcher. A) Measured IPCs. B) Accuracy and traffic. 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 512K1M2M4M L2 Size Art/Mcf/Ammp/Twolf Art/Mcf/Vortex/Bzip 2 Mcf/Amm p Art/TwolfAverage Speedup Figure 2-11. Effect of L2 size on the CG-prefetcher. With smaller caches, the cache contention problem is so severe that a small percentage of successful prefetches can lead to significant IPC speedups. For medi an size caches, the impact of delaying normal miss due to conf licts with prefetches begins to compensate the benefit of prefetching. When the L2 size continues to increa se, the number of mi sses decreases and it diminishes the effect of accessing conflicts. As a result, the average speedup increases again. Given a higher demand for accessing the DRAM for the prefetching methods, we perform a sensitivity study on the DRAM channels as shown in Figure 2-12 The results indicate that the number of DRAM channels does show impacts on the IPCs and more so to the memory-bound

PAGE 53

53 workload mixes. All four workload mixes pe rform poorly with 2 channels. However, the improvements are saturated about 4 to 8 channels. 24816 Memory Channel Art/Mcf/Ammp/Twolf Art/Mcf/Vortex/Bzip 2 Mcf/Amm p Art/TwolfIPC 0 0.1 0. 2 0.3 0. 4 1. 2 1.3 1. 4 1.5 1. 6 1. 7 1. 8 Figure 2-12. Effect of memory channels on the CG-prefetcher. 2.6 Summary We have introduced an accurate CG-bas ed data prefetching scheme on Chip Multiprocessors (CMPs). We have showed the existence of coterminous groups (CGs) and a third kind of locality, coterminous locality. In par ticular, the order of nearby references in a CG follows exactly the same order th at these references appeared la st time, even though they may be irregular. The proposed prefetcher uses CG hist ory to trigger prefetches when a member in a group is re-referenced. It overcomes challenges of the existing correlation-based or stream-based prefetchers, including low prefetch accuracy, la ck of timeliness, and large history. The accurate CG-prefetcher is especially appealing for CMPs, where cache contentions and memory access demands are escalated. Evaluations based on various SPEC CPU 2000 workload mixes have demonstrated significant advant ages of the CG-prefetcher ove r other existing prefetching schemes on CMPs.

PAGE 54

54 CHAPTER 3 PERFORMANCE PROJECTION OF ONCHIP STORAGE OPTIMIZATION Organizing on-chip storage space on CMPs has become an important research topic. Balancing between data accessibility due to wiri ng delay and the effective on-chip storage capacity due to data replication has been stud ied extensively. These studies must examine a wide-spectrum of the design space to have a comprehensive view. The simulation time is prohibitively long for these timing simulations, and would increase drastica lly as the number of cores increases. A great challe nge is then how to provide an efficient methodology to study design choices of optimizing CMP on-chip storage accurately and completely, when the number of cores increases. In the second work, we first develop an an alytical model to asse ss general performance behavior with respect to data replications in CMP caches. The model injects replicas (replicated data blocks) into a generic cache. Based on the block reuse-distance histogram obtained from a real application, a precise mathematical formula is derived to evaluate the impact of the replicas. The results demonstrate that whet her data replication helps or hur ts L2 cache performance is a function of the total L2 size and th e working set of the application. To overcome the limitations of modeling, we further develop a single-pass stack simulation technique to handle shared and priv ate cache organizations with the invalidationbased coherence protocol. The stack algorithm can handle complex interactions among multiple private caches. This single-pass stack techniqu e can provide local/remote hit ratios and the effective cache size for a range of physical cache capacities. We also demonstrate that we can use the basic multiprocessor stack simulation results to estimate the performance of other interesting CMP cache organizations, such as shared caches with replicat ion and private caches without replication.

PAGE 55

55 We verify the results of the analytical data replication model and the single-pass global stack simulation with detailed execution-driven s imulations. We show that the single-pass stack simulation produces small error margins of 2-9% for all simulated cache organizations. The total simulation times for the single-pass stack si mulation and the indivi dual execution-driven simulations are compared. For a limited set of the four studied cache organizations, the stack simulation takes about 8% of the execution-driven simulation time. 3.1 Modeling Data Replication We first develop an abstract model independent of private/shared orga nizations to evaluate the tradeoff between the access time and the miss rate of CMP caches with respect to data replication. The purpose is to provide a uniform understanding on this central issue of caching in CMP that is present in most major cache organi zations. This study also highlights the importance of examining a wide enough range of system pa rameters in the performance evaluation of any cache organization, which can be costly. In Figure 3-1 a generic histogram of block reuse distances is plotted, where the reuse distance is measured by the number of distinct bl ocks between two adjace nt accesses to the same block. A distance of zero indicate s a request to the same block as the previous request. The histogram is denoted by f(x), which represents the number of block references with reuse distance x. For a cache size S, the total cache hits can be measured by Sdx x f0) (, which is equal to the area under the range of the histogram curve from 0 to S. This well-known, stack distance histogram can provide hits/misses of all cache si zes with a fully-associative organization and the LRU replacement policy.

PAGE 56

56 Figure 3-1. Cache performance im pact when introducing replicas. To model the performance impact of data repl ication, we inject rep licas into the cache. Note that regardless the cache organization, repli cas help to improve the local hit rate since replicas are created and moved cl ose to the requesting cores. On the other hand, having replicas reduces the effective capacity of the cache, and hence, increa ses cache misses. We need to compare effect from the increase of local hits against that from the increase of cache misses. Suppose we take a snapshot of the L2 cache and find a total of R replicas. As a result, only S-R cache blocks are distinct, effectively reducing the capacity of the cache. Note that the model does not make reference to any specific cache organization and management. For instance, it does not say where the replicas are stored, wh ich may depend on factors such as shared or private organization. We will compare this s cenario with the baseline case where all S blocks are distinct. First, the cache misses are increased by S R Sdx x f ) (, since the total number of hits is now R Sdx x f0) (. On the other hand, the replicas help to improve the local hits. Among the R Sdx x f0) ( hits, a fraction R/S hits are targeting to the repli cas. Depending on the specific cache organization, not all accesses to th e replicas result in local hits. A requesting core may find a

PAGE 57

57 replica in the local cache of anot her remote core, resulting in a remote hit. We assume that a fraction L accesses to replicas are actually local hits. Therefore, compared with the baseline case, the total change of memory cy cles due to the creation of R replicas can be calculated by: R S l S R S mdx x f L S R G dx x f P0) ( ) ((3-1) where Pm is the penalty cycles of a cache miss; and Gl is the cycle gain from a local hit. With the total number of memory accesses,dx x f 0) (, the average change of memory access cycles is equal to: 0 0) ( ) ( ) ( dx x f dx x f L S R G dx x f PR S l S R S m(3-2) Now the key is to obtain the reuse distance histogram f(x). We conduct experiment using an OLTP workload [57 ] and collect its reuse distance histogr am. With the curve-fitting tool of Matlab [52 ], we obtain the equation f(x) = Aexp(-Bx), where A = 6.084*106 and B = 2.658*10-3. This is shown in Figure 3-2 where the cross marks represent th e actual reuse frequencies from OLTP and the solid line is the fi tted curve. We can now substitute f(x) into equation (3-2 ) to obtain the average change in memory cycles as: ) ( ) (1R S B l BS R S B me L S R G e e P (3-3) Equation (3-3 ) provides the change in L2 access time as a function of the cache area being occupied by the replicas. In Figure 3-3 we plot the change of the memory access time for three cache sizes, 2, 4, and 8 MB, as we vary the replicas occupancy from none to the entire cache. In this figure, we assume Gl=15, Pm= 400, and we vary L with 0.25, 0.5 and 0.75 for each cache size. Note that negative values mean performan ce gain. We can observe that the performance of allocating L2 space for replicas for the OLTP wo rkload varies with different cache sizes.

PAGE 58

58 Figure 3-2. Curve fitting of reuse distance histogram for the OLTP workload. OLTP -10 0 10 20 30 40 01/81/43/81/25/83/47/8 Fraction of replicationAve. access time increases (Cycles) 2M-L0.25 4M-L0.25 8M-L0.25 2M-L0.5 4M-L0.5 8M-L0.5 2M-L0.75 4M-L0.75 8M-L0.75 Figure 3-3. Performance with replicas for differe nt cache sizes derived by the analytical model. For instance, when L = 0.5, the results indicate no repli cation provides the shortest average memory access time for a 2MB L2 cache, while for larger 4MB and 8MB L2 caches, allocating 40% and 68% of the cache for the replicas ha s the smallest access time. These results are consistent with the reuse histogram curve shown in Figure 3-2 The reuse count approaches zero when the reuse distance is equal to or greater than 2MB. It increas es significantly when the reuse distance is shorter than 2MB. Theref ore, it is not wise to allocate space for the replicas when the

PAGE 59

59 cache size is 2MB or less. Increasing L favors data replication slightly. For instance, for a 4MB cache, allocating 34%, 40%, 44% of the cache fo r the replicas achieves the best performance improvement of about 1, 3, and 5 cycles on the average memory access time for L = 0.25, 0.5 and 0.75 respectively. The performance improvement with data replication would be more significant when Gl increases. The general behavior due to da ta replication is consistent with the detailed simulation result as will be given in Section 3.4 Note that the fraction of repl icas cannot reach 100% unless the entire cache is occupied by a single block. Therefore, in Figure 3-3 the average memory time increase is not meaningful when the fraction of replicas is approaching to the cache size. We also run the same experiment for two other workloads, Apache and SPECjbb. Figure 3-4 plots the optimal fractions of replication for all three workloads with cache size from 2 to 8MB and L from 0.25 to 0.75. The same behavior can be observed for both Apache and SPECjbb. Lager caches favor more replication. For example, with L = 0.5, allocating 13%, 50%, 72% space for replicas has the best performance for Apache, and 28%, 59%, 78% for SPECjbb. Also, increasing L favors more replication. With smaller working set, SPECjbb benefits replication the most among the three workloads. It is essential to study a set of representati ve workloads with a sp ectrum of cache sizes to understand the tradeoff of accessibility vs. capac ity on CMP caches. A fixed replication policy may not work well for a wide-variety of workloads on different CMP caches. Although mathematical modeling can provide understanding of the general performance trend, its inability to model sufficiently detailed interactions among multiple cores makes it less useful for accurate performance prediction. To remedy this problem we will describe a global stack based simulation for studying CMP caches next.

PAGE 60

60 0 20 40 60 80 2M4M8M2M4M8M2M4M8M OLTPApacheSPECjbb Opt. Fraction of Replication (% ) L=0.25 L=0.5 L=0.75 Figure 3-4. Optimal fraction of replica tion derived by the analytical model. 3.2 Organization of Global Stack Figure 3-5 sketches the organization of the gl obal stack, which records the memory reference history from all the cores. In the CMP context, a block address and its core-id uniquely identify a reference, where the core-id indicates from which core the request is issued. Several independent linked lists are established in the global stack for simulating a shared and several per-core private stacks. Each stack entry appears exactly in one of the private stacks determined by the core-id, and may or may not reside in the shared stack depending on the recency of the reference. In addition, an address-based hash list is also established in the global stack for fast searches. Since only a set of discrete cache sizes are of interest for cache studies, both the shared and the private stacks are organized as groups [43 ]. Each group consists of multiple entries for fast search during the stack simulation and for easy calculations of cache hits under various interesting cache sizes after the simulation.

PAGE 61

61 Figure 3-5. Single-pass gl obal stack organization. For example, assuming the cache sizes of interest are 16KB, 32KB, and 64KB. The groups can then be organized according to the stack sequence starting from the MRU entry with 256, 256, 512 entries for the first three groups, respectively, assuming the block size is 64B. Based on the stack inclusion property, the hits to a particul ar cache size are equal to the sum of the hits to all the groups accumulated up to that cache size. Each group maintains a reuse counter, denoted by G1, G2, and G3. After the simulation, the cache hits for the three cache sizes can be computed as G1, G1+G2, and G1+G2+G3 respectively. Separate shared and private group tables are maintained to record the reuse frequency count and other information for each group in th e shared and private caches. A shared and a private group-id are kept in each global st ack entry as a pointer to the corresponding group information in the shared and the private group table. The group bound in each entry of the group

PAGE 62

62 table links to the last block of the respective group in the global stack. These group bounds provide fast links for adjusting entries between adjacent groups. The associated counters are accumulated on each memory request, and will be used to deduce cache hit/miss ratios for various cache sizes after the simulation. The following subsections provide detailed stack operations. 3.2.1 Shared Caches Each memory block can be recorded multiple tim es in the global stack, one from each core according to the order of the requests. Intuitivel y, only the first-appearance of a block in the global stack should be in the shared list since there is no replica tion in a shared cache. A firstappearance block is the one that is most recentl y used in the global stack among all blocks with the same address. The shared stack is formed by linking all th e first-appearance blocks from MRU to LRU. Figure 3-6 illustrates an example of a memory request sequence and the operations to the shared stack. Each memory request isdenoted as a block address, A, B, C, etc., followed by a coreid. The detailed stack operations when B1 is requested are described as follows. Address B is searched by the hash list of the shared stack. B2 is found with the matching address. In this case, the reuse counter for the shared group where B2 resides, group 3, is incremented. B2 is removed from the shared list, and B1 is inserted at the top of the shared list. The shared group-id for B1 is set to 1. M eanwhile, the block located on the boundary of the first group, E1, is pushed to the second group. The boundary adjustment continues to the group where B2 was previously located. If a requested block cannot be located through th e hash list, (i.e. the very first access of the address among any cores), the stack is updated as above wit hout incrementing any reuse counters. After the simulation, the total number of cache h its for a shared cache that include exactly the first m groups is the sum of all shared reuse counters from group 1 to group m.

PAGE 63

63 Figure 3-6. Example operations of the global stack for shared caches. 3.2.2 Private Caches The construction and update of the private list s are essentially the same as those of the shared list, except that we link accesses from the same core together. We collect crucial information such as the local hits, remote hits, a nd number of replicas, with the help of the local, remote, and replica counters in the private group table. For simplicity, we assume these counters are shared by all the cores, although per-core counters may provide more information. Figure 3-7 draws the contents of the four private lists a nd the private group table, when we extend the previous memory sequence (Figure 3-6 ) with three additional re quests, A2, C1, and A1. Local/remote reuse counters The local counter of a group is incremented when a request falls into the respective group in the local privat e stack. In this example, only the last request, A1, encounters a local hit, and in this case, the local c ounter of the second group is incremented.

PAGE 64

64 Figure 3-7. Example operations of the global stack for private caches. After the simulation, the sum of all local counters from group 1 to group m represents the total number of local hits for private caches with exactly m groups. Counting the remote hits is a little tric ky, since a remote hit may only happen when a reference is a local miss. For example, assume th at a request is in the third group of the local stack; meanwhile, the minimum group id of all th e remote groups where this address appears is the second. When the private cache size is only la rge enough to contain the first group, neither a local nor a remote hit happens. If the cache contai ns exactly two groups, the request is a remote hit. Finally, if the cache is extende d to the third group or larger, it is a local hit. Formally, if an address is present in the local group L and the minimum remote group that contains the block is R, the access can be a remote hit only if the cache size is within the range from group R to L-1. We increment the remote counters for groups R to L-1 (R <= L-1). Note that after the simulation,

PAGE 65

65 the remote counter m is the number of remote h its for a cache with exactly m groups. To differentiate them from the local counters, we call them accumulated remote counters. In the example, the first high lighted request, B1, encounters a local miss, but a remote hit to B2 in the first group. We accumulate the remote counters for all the groups. The second request, A2, is also a local miss, but a remote h it to A1 in the second group. The remote counter of the first group remains unchanged, while the counters are incremented for all the remaining groups. Similar to B1, all the remote counters ar e incremented for C1. Finally, the last request, A1, is a local hit in the second group and is also a remote hit to A2 in the first group. In this case, only the remote counter of the firs t group is incremented since A1 is considered as a local hit if the cache size extends to more than the first group. Measuring replica The effective cache size is an important factor for shared and private cache comparisons [8 ], [24 ], [81 ], [20 ]. The single-pass stack simulation counts each block replication as a replica for calcu lating the effective cache size along the simulation. Similar to the remote hit case, we use accumulated replica counters. As shown in Figure 3-7 the first highlighted request, B1, creates a replica in the first group, as well as any larger groups because of the presence of B2. The second highlighted re quest, A2, does not create a new replica in the first group. But it does create a new replica in th e second group because of A1. Meanwhile, A2 pushes B2 out of the first group, thus reduces a re plica in the first group. This new replica applies to all the larger groups too. Note that the additi on of B2 in the second group does not alter the replica counter for group 2, since the replica was already counted when B2 was first referenced. Similar to B1, the third highlighted request, C1, creates a replica to a ll the groups. Lastly, the reference, A1, extends a replica of A into th e first group because of A2. The counters for the remaining groups stay the same.

PAGE 66

66 Handling memory writes In private caches, memory writes may cause invalidations to all the replicas. During the stack simulation, write invalidations create holes in the private stacks where the replicas are located. These holes will be filled later when the adjacent block is pushed down from a more-recently-used position by a new request. No bl ock will be pushed out of a group when a hole exists in the group. To accurate ly maintain the reuse counters in the private group table, each group records the total number of holes for each core. The number of holes is initialized to the respective group size, and is decremented whenever a valid block joins the group. The hole-count for each group avoids searching for existing holes. 3.3 Evaluation and Validation Methodology We simulate an 8-core CMP system. The gl obal stack runs behind the L1 caches and simulates every L1 misses, essentially replacing the role of L2 caches. During simulations, stack distances and other related statis tics are collected as described in the above section. Each group contains 256 blocks (16KB), and we simulate 1024 groups (16MB maximum). The results of the single-pass stack simulation are used to derive the performance of shared or private caches with various cache sizes and the sharing mechanisms for understanding the accessibility vs. capacity tradeoff in CMP caches. The results from the stack simulation are veri fied against execution-driven simulations, where detailed cache models with proper access la tencies are inserted. In the detailed executiondriven simulation, we assume the shared L2 has eight banks, with one local and seven remote determined by the least-significant three bits of the block address. The total shared cache sizes are 1, 2, 4, 8, and 16MB. For the private L2, we model both local and remote accesses. The MOESI coherence protocol is implemented to ma intain data coherence among the private L2s. Accordingly, we simulate 128, 256, 512, 1024, 2048KB private caches. For comparison, we use the hit/miss information and average memory access times to approximate the execution time

PAGE 67

67 behavior because the single-pass stack si mulation cannot provide IPCs. We use three multithreaded commercial workloads, OLTP, Apache, and SPECjbb. The accuracy of the CMP memory performa nce projection can be assessed from two different angles, the accuracy of predicting individual performance metrics, and the accuracy of predicting general cache behavior. By verifyi ng the results against the execution-driven simulation, we demonstrate that the stack simula tion can accurately predict cache hits and misses for the targeted L2 cache organizations, and more importantly, it can precisely project the sharing and replication behavior of the CMP caches. One inherent weakness of stack simulation is it s inability to insert accurate timing delays for variable L2 cache sizes. The fluctuation in memory delays may alter the sequence of memory accesses among multiple processors. We try a simple approach to insert memory delays based on a single discrete cache size. In the stack simula tion, we inserted memory delays based on five cache sizes 1MB, 2MB, 4MB, 8MB, and 16MB,denot ed as stack-1, stack-2, stack-4, stack-8, and stack-16 respectively. An off-chip cache miss latenc y is charged if the re use distance is longer than the selected di screte cache size. 3.4 Evaluation and Validation Results 3.4.1 Hits/Misses for Shared and Private L2 Caches Figure 3-8 shows the projected and real miss rates for shared caches, where real represents the results from indi vidual execution-driven simulations. In general, the stack results follow the execution-driven results closely. For OLTP, stack-2 shows only about 5-6% average error. For Apache and SPECjbb, the difference among different delay inserti ons is less apparent. The stack results predict the miss ratios with a bout 2-6% error, except for Apache with a small 1MB cache.

PAGE 68

68 OLTP 0% 4% 8% 12% 16% 20% 1M2M4M8M16M Cache sizeShared miss rat e Real Stack-1 Stack-2 Stack-4 Stack-8 Stack-16 Apache 0% 4% 8% 12% 16% 20% 1M2M4M8M16M Cache sizeShared miss rate Real Stack-1 Stack-2 Stack-4 Stack-8 Stack-16 SPECjbb 0% 4% 8% 12% 16% 20% 24% 1M2M4M8M16M Cache sizeShared miss rat e Real Stack-1 Stack-2 Stack-4 Stack-8 Stack-16 Figure 3-8. Verification of miss ratios from global stack simulation for shared caches. Two major factors affect the accu racy of the stack results. One is cache associativity. Since we use a fu lly-associative stack to simulate a 16-way cache, the stack simulation usually underestimates the real miss rates. This effect is more apparent when the cache size is small, due to more conflict misses. The issue can be solved by more complicated set-associative stack simulations [53 ], [32 ]. For simplicity, we keep the stack fully-associative. More sensitivity studies also need to evaluate L2 caches with smaller set associativity. The other factor is inaccurate delay inserti ons. For example, in the stack-1 simulation of OLTP, the cache miss latency is inserted whenever the reuse distance is longer than 1MB. Such a cache miss delay is inserted wrongly for caches la rger than the 1MB. These extra delays for larger caches cause more OS interference and context switches that may lead to more cache misses. At 4MB cache size, the overestimate of cache misses due to the extra delay insertion

PAGE 69

69 exceeds the underestimate due to the full associ ativity. The gap becomes wider with larger caches. On the other hand, the stack-16 simulation for smaller caches mistakenly inserts hit latency, instead of miss latency, for accesses wi th reuse distance from the corresponding cache size to 16MB, causing less OS interferences, thus less misses. In this case, both the full associativity and the dela y insertion lead to underestimate of the real misses, which makes the stack-16 simulation the most inaccurate. For private caches, Figure 3-9 shows the overall misses, the remote hits, and the average effective sizes. Note that the horizontal axis shows the size of a single core from 128KB to 2MB each. With eight cores, the total sizes of the private caches are comparable to the shared cache sizes in Figure 3-8 We can make two important observations. Firs t, comapring with the shared cache, the simulation results show that the overall L2 miss ratios are increased by 14.7%, 9.9%, 4.3%, 1.1%, and 0.5% for OLTP for the private cache sizes from 128KB to 1MB. For Apache and SPECjbb, the L2 miss ratios are increased by 11.8%, 4.4%, 1.1%, 1.0%, 2.2%, and 7.3%, 3.1%, 2.9%, 0.6%, 0.5%, respectively. Second, the estimated miss and remote hit rates from the stack simulation match closely to the results from the execution-driven simulations, with less than 10% margin of errors. We also simulate the effective capacity fo r the private-cache cases. The effective cache size is the average over the entire simulation pe riod. In general, the private cache reduces the cache capacity due to replicated a nd invalid cache entries. The eff ective capacity is reduced to 45-75% for the three workloads with various cach e sizes. The estimated capacity from the stack simulation is almost identical to the result from the execution-driven simulation. Due to its higher accuracy, we use the stack-2 si mulation in the following discussion.

PAGE 70

70 OLTP 0% 10% 20% 30% 40% 50% 128k256k512k1M2M Cache sizeMiss rate & remote hit rate Real miss Stack-1 miss Stack-2 miss Stack-4 miss Stack-8 miss Stack-16 miss Real rhit Stack-1 rhit Stack-2 rhit Stack-4 rhit Stack-8 rhit Stack-16 rhit OLTP 0 2 4 6 8 10 12 14 16 128k256k512k1M2M Cache sizeEffective size (MB) Upper bound Real Stack-1 Stack-2 Stack-4 Stack-8 Stack-16 Apache 0% 10% 20% 30% 40% 50% 128k256k512k1M2M Cache sizeMiss rate & remote hit rate Real miss Stack-1 miss Stack-2 miss Stack-4 miss Stack-8 miss Stack-16 miss Real rhit Stack-1 rhit Stack-2 rhit Stack-4 rhit Stack-8 rhit Stack-16 rhit Apache 0 2 4 6 8 10 12 14 16 128k256k512k1M2M Cache sizeEffective size (MB) Upper bound Real Stack-1 Stack-2 Stack-4 Stack-8 Stack-16 SPECjbb 0% 10% 20% 30% 40% 50% 128k256k512k1M2M Cache sizeMiss rate & remote hit rate Real miss Stack-1 miss Stack-2 miss Stack-4 miss Stack-8 miss Stack-16 miss Real rhit Stack-1 rhit Stack-2 rhit Stack-4 rhit Stack-8 rhit Stack-16 rhit SPECjbb 0 2 4 6 8 10 12 14 16 128k256k512k1M2M Cache sizeEffective size (MB) Upper bound Real Stack-1 Stack-2 Stack-4 Stack-8 Stack-16 Figure 3-9. Verification of miss ratio, remote hit ratio and average effective size from global stack simulation for private caches. 3.4.2 Shared Caches with Replication To balance accessibility and capacity, victim-replication [81 ] creates a dynamic L1 victim cache for each core in the local sl ice of the L2 to trad e capacity for fast lo cal access. In this section, we estimate the performance of a static victim-replication scheme. We allocate 0% to 50% of the L2 capacity as L1 victim caches wi th variable L2 sizes from 2MB to 8MB. For

PAGE 71

71 performance comparison, we use the average memory access time, which is calculated based on the local hits to victim caches, the hits to shared portion of L2, and L2 misses. The average memory access time of the static victim replication can be derived directly from the results of the stack simulation desc ribed in the previous sections. Assuming the inclusion property is enforced between the shared potion of the L2 and the victim potion plus the L1. Suppose the L1 and L2 sizes are denoted by CL1, and CL2, r is the percentage of the L2 allocated for the victim cache, and n is the numbe r of the cores. Then, each victim-cache size is equal to (r*CL2)/n, and the remaining shared portion is equal to (1-r)*CL2. The average memory access time includes the following components. Firs t, since the L1 and the victim cache are exclusive, the total hits to the victim cache can be estimated from the private stacks with the size of the L1 plus the size of the victim:CL1+(r*CL2)/n. Note that this estimation may not be precise due to the lack of the L1 hit informati on that alters the sequence in the stack. Second, the total number of L2 hits (including the victim por tion) and L2 misses can be calculated from the shared stack with the size (1-r)*CL2. Finally, the hit to the sh ared portion of L2 can be calculated by subtracting the hits to the victim from the total L2 hits. Figure 3-10 demonstrates the average L2 access tim e with static victim replication. Generally, large caches favor more replications For a small 2MB L2, except that Apache has a slight performance gain at low replication leve ls, the average L2 access times increase with more replications. The optimal replication levels for OLTP are 12.5%, and 37.5% respectively for 4MB and 8MB L2. This general performance behavi or with respect to data replication is consistent with what we have observed from the analytical model in section 3.1 However, the analytical model without cache invalidations should apply lower L for the optimal replication level.

PAGE 72

72 OLTP 20 40 60 80 100 0.0%12.5%25.0%37.5%50.0% Percentage of replication areaAve. L2 access time (Cycles ) Real-2M Real-4M Real-8M Stack-2M Stack-4M Stack-8M Apache60 65 70 75 80 85 90 0.0%12.5%25.0%37.5%50.0% Percentage of replication areaAve. L2 access time (Cycles Real-2M Real-4M Real-8M Stack-2M Stack-4M Stack-8M SPECjbb 40 50 60 70 80 90 100 110 120 0.0%12.5%25.0%37.5%50.0% Percentage of replication areaAve. L2 access time (Cycles ) Real-2M Real-4M Real-8M Stack-2M Stack-4M Stack-8M Figure 3-10. Verification of average L2 access time with different level of replication derived from global stack simulation for shared caches with replication. For SPECjbb, 12.5% replication shows the best for both 4MB and 8MB L2. The figure for Apache shows that the performance is better with replications as large as 50% for 4MB, but 37.5% for 8MB L2s. The seemly contradiction comes from the fact that L2 misses start to reduce drastically around 8M caches, as demonstrated in Figure 3-8 We can also observe that the optimal replication levels match perfectly be tween the stack simulations and the executiondriven simulations. With respect to the average L2 access time, the stack results are within 2%8% error margins. 3.4.3 Private Caches without Replication Private caches sacrifice capaci ty for fast access time. It may be desirable to limit replications in the private caches. To understand th e impact of the private L2 without replication,

PAGE 73

73 we run a separate stack simulation in which the creation of a replica causes the invalidation of the original copy. Figure 3-11 demonstrates the L2 access delays of the private caches without replication, shown as the ratio to those of the private caches with full rep lication. As expected, with small 128KB and 256KB private caches per core, the av erage L2 access times w ithout replication are about 5-17% lower than those w ith full replication for all the three workload s. This is because the benefit of the increased capacity more th an compensates the loss of local accesses. With large 1MB or 2MB caches per core, the av erage L2 access time of the private caches without replication is 12-30% worse than th e full-replication counter part, suggesting that increasing local accesses is beneficial when enough L2 capacity is available. The stack simulation results follow this trend perfectly. Th ey provide very accurate results with only 2-5% margin of error. 3.4.4 Simulation Time Comparison The full-system Virtutech Simics 2.2 simulator [50 ] to simulate an 8-core CMP system with Linux 9.0 and x86 ISA is running on Intel Xeon 3.2 GHz 2-way SMP. The simulation time of each stack or execution-driven simulation is measured on a dedicated system without other interference. A timer was inserted at the beginnin g and the end of each run to calculate the total execution time. In the single-pass stack simulation, each stack is partitioned into 16K B groups with a total of 1024 groups for the 16MB cache. This small 16KB groups are necessary in order to study shared caches with variable percentage of replication areas as shown in Figure 3-10 The stack simulation time can be further reduced for cache organizations that only require a few large groups.

PAGE 74

74 OLTP 0.6 0.8 1 1.2 1.4 128k256k512k1M2M Cache sizeAve. L2 access time rati o Real Stack Apache 0.6 0.8 1 1.2 1.4 128k256k512k1M2M Cache sizeAve. L2 access time rati o Real Stack SPECjbb 0.6 0.8 1 1.2 1.4 128k256k512k1M2M Cache sizeAve. L2 access time rati o Real Stac k Figure 3-11. Verification of average L2 access time ratio from global stack simulation for private caches without replication. Table 3-1 summaries the simulation times for the stack and the execution-driven simulations to obtain the results above. For each workload, two stack simulations are needed. One run is for producing the results for shared caches, private caches, and shared caches with replication, and the other run is for the privat e L2 without replicati on. In execution-driven simulations, it requires a separate run for each cac he size resulting in five runs for each cache organization. In studying the shared cache with replication, five separate runs are needed for each cache size in order to simulate five different replication percentages. No separate stack simulation is required for the shared cache with replication. Similarly, no separate executiondriven simulation is needed for shared caches w ith 0% area for data re plication. Therefore we have 20 runs for the shared with replication fo r execution-driven simulations. The total number of simulation runs is also summarized in Table 3-1

PAGE 75

75 Table 3-1. Simulation time co mparison of global stack and execution-driven simulation (in Minutes) Measurements Workload Stack Execution-Driven OLTP 1 Run: 835 (5+5) Runs: 6252 Apache 1 Run: 901 (5+5) Runs: 6319 Shared / Private (Section 3.4.1) SPECjbb 1 Run: 582 (5+5) Runs: 4220 OLTP 0 Run: 0 20 Runs: 11976 Apache 0 Run: 0 20 Runs: 12211 Shared with replication (Section 3.4.2) SPECjbb 0 Run: 0 20 Runs: 8210 OLTP 1 Run: 872 5 Runs: 3257 Apache 1 Run: 948 5 Runs: 3372 Private no replication (Section 3.4.3) SPECjbb 1 Run: 6 13 5 Runs: 2199 Total 4751 58016 The total stack simulation time is measur ed about 4751 minutes, while the executiondriven simulation takes 58016 minut es, a factor over 12 times. This gap can be much wider if more cache organizations and sizes are studied and simulated. 3.5 Summary In this chapter, we developed an abstract model for understanding the general performance behavior of data replication in CMP caches. Th e model showed that da ta replication could degrade cache performance without a sufficiently large capacity. We then used the global stack simulation for more detailed study on the issue of balancing accessibility and capacity for onchip storage space on CMPs. With the stack simu lation, we can explore a wide-spectrum of the cache design space in a single simulation pass. We simulated the schemes of shared caches, private caches and private caches without replicat ion with various cache si zes directly by global stack simulation. We also deduce the performance data of shared caches with replication by the shared and private cache results. We verified th e stack simulation results with execution-driven simulations using commercial multithreaded worklo ads. We showed that the single-pass stack simulation can characterize the CMP cache perfor mance with high accuracy (about only 2% 9% error magins) and significant less simulation time (only 8 %). Our results proved that the

PAGE 76

76 effectiveness of various techniqu es to optimize the CMP on-chip storage is closely related to the total L2 size.

PAGE 77

77 CHAPTER 4 DIRECTORY LOOKASIDE TA BLE: ENABLING SCALABLE, LOW-CONFLICT CMP CACHE COHERENCE DIRECTORY Directory cache coherence mechanism is one of the most important choices for building scalable CMPs. The design of a sparse coheren ce directory for future CMPs with many cores presents new challenges. With a typical set-associ ative sparse directory, th e hot-set conflict at the directory tends to worsen when many cores compete in each individua l set, unless the set associativity is dramatic ally increased. In orde r to maintain precise cache information, the setconflict causes inadvertent cache invalidations. Thus an important technical issue is to avoid the hot-set conflicts at the coherence directory with small set associativity, small directory space and high efficiency. We develop a set-associative directory with an augmented director y lookaside table to allow displacing directory entries from their primary sets for solving the hot-set conflicts. The proposed CMP coherence directory offers three unique contributions. First, while none of the existing cache coherence mechanisms are effi cient enough when the number of cores becomes large, the proposed CMP coherence directory provi des a low cost design with a small directory size and low set associativity. S econd, although the size of the coherence directory matches the total number of CMP cache blocks, the topological difference between the coherence directory and all cache modules creates conflicts in individu al sets of the coherence directory and causes inadvertent invalidations. The DLT is introduced to reconcile the mismatch between the two CMP components. In addition, the unique design of the DLT has its own independent utility in that it can be applied to other se t-associative cache organizations for alleviating hot -set conflicts. In particular, it has advantages over other multiple-hash-function-based schemes, such as the skewed associative cache [64 ], [14 ]. Third, unlike the memory-based coherence directory where each memory block has a single directory entry al ong with the presence bits indicating where the

PAGE 78

78 block is located, the proposed CM P directory keeps a separate record for every copy of the same cached block along with the core ID. Multiple hi ts to a block can occur in a directory lookup, which returns multiple core IDs without expensive presence bits. Performance evaluations have demonstrated the significant performance improvement of the DLT-enhanced directory over the traditiona l set-associative or skewed associative directories. Augmented with a DLT that allows up to one quarter of the cache blocks to be displaced from their primary sets in the set-a ssociative directory, up to 10% improvement in execution time is achievable. More importantly, su ch an improvement is within 98% of what an ideal coherence directory can accomplish. In the following sections of this chapter, we first show the problem that a limited setassociative CMP coherence directory may have bi g performance impact due to inadvertent cache invalidations. We then propose our enhancement of the directory, directory lookaside table. This is followed by detailed performance evalutions. 4.1 Impact on Limited CMP Coherence Directory In this section, we demonstrate the severity of cache invalidati on due to hot-set conflicts at the coherence directory if the directory has small set-associ ativity. Each copy of a cached block occupies a directory entry that records the block address and the ID of the core where the block is located. A block must be removed from the cache when its corresponding entry is replaced from the CMP directory. Three multithreaded workloads, OLTP, Apache, SPECjbb, and two multiprogrammed workloads, SPEC2000 and SPEC2006, were used for this study. These workloads ran on a Simics based whole-system si mulation environment. In these simulations, we assume a CMP with eight cores, and each co re has a private 1MB, 8-way, L2 cache. The simulated CMP directory with different set a ssociativities can record a total of 8MB cache blocks. Detailed descriptions of the simulation will be given in Section 4.3

PAGE 79

79 Figure 4-1 shows the average of the total valid cache blocks over a long simulation period using a CMP coherence directory with various set a ssociativities. Given eight cores, each with an 8-way set-associative L2 cach e, the 64-way directory (Set-full) can accommodate all cache blocks without causing any extra invalidation. The small percentage of invalid blocks for the 64way directory comes from cache coherence invalida tions due to data sharing, OS interference, thread migrations, etc. on multiple cores. The severity of cache invalidations because of set conflicts in the directory is very evident in the cases of smaller set associativities. In gene ral, whenever the set associativity is reduced by half, the total valid cache blocks are reduced by 4-9% for all five workloads. Using OLTP as an example, the valid blocks are reduced from 93% to 87%, 82% and 75% as the set associativity is reduced from 64 ways to 32, 16, and 8 ways, re spectively. The gap between the 64-way and 8way directories indicates that, on average, about 18% of the total cached blocks are invalidated due to insufficient associativity in the 8-way c oherence directory. This severe decrease in valid blocks will reduce the local cache hits a nd increase the overall CMP cache misses. To further demonstrate the effect of hot-set conflicts in a directory with small setassociativity, we also simulated an 8-way set-asso ciative directory with twice the number of sets, capable of recording the states fo r a cache size of 16 MB (denoted as 2x-8way). We can observe that a significant gap in the number of valid blocks still exists between the 2x-8way and the 64way directories. For OLTP, the 32-way directory can also out-perform the 2x-8way directory. To completely avoid extra cache invalidations, a 64-way directory is needed here. Consider a future CMP with 64 cores and 16-way priv ate L2 caches, an expensive and power hungry 1024-way directory is needed to eliminate all extra cach e invalidations. This full-associativity directory is essentially the same as maintaining and searching all individual cache directories.

PAGE 80

80 50 60 70 80 90 100 OLTPApacheSPECjbbSPEC2000SPEC2006Valid Blocks (%) Set-8w Set-16w Set-32w Set-full Set-2x-8w Figure 4-1. Valid cache blocks in CMP dire ctories with various set-associativity. 4.2 A New CMP Coherence Directory To solve or reduce the hot-set conflicts, we propose to displace the replaced directory entries to the empty slots in the directory. Figure 4-2 illustrates the basic organization of a CMP coherence directory enhanced with a Directory Lookaside Table (DLT) referred collectively as the DLT-dir. The directory part, called the main directory, is set-associative. Each entr y in the main directory records a cached block with its address tag, a co re ID, a MOESI coherence state, a va lid bit, and a bit indicating whether the recorded block has been disp laced from its primary set. The DLT is organized as a linear array in which each entry can establish a link to a displaced block in the main directory. Each DLT en try consists of a pointer the index bits of the displaced block in the main directory, and a u se bit indicating if the DLT entry has a valid pointer. In addition, a set-empty bit array is used to indicate whether the corresponding set in the main directory has a free entry for accommodati ng any displaced block away from the blocks primary set. Note that, different fr om the memory-based directory, the DLT-dir serves only as a coherence directory without any associated data array.

PAGE 81

81 Figure 4-2. A CMP coherence direct ory with a multiple-hashing DLT. Inspired by a similar idea in the skewed associative cache [64 ], [14 ], a set of hash functions are used to index the DLT; the pur pose is to reduce the conflict at the DLT. When a block is to be displaced in the main directory due to a conf lict, the hash functions are applied to the block address to obtain multiple locations in the DLT. Some of these locations may have already been used to point to some other displaced blocks in the main directory. If there exist unused DLT locations among the hashed ones and if there also exists a free slot in the main directory, then the displaced block is moved to the free directory sl ot and an unused DLT location is selected to point to the displaced block. Sin ce this paper is not aiming at i nventing new hash functions, we borrow the skewed function family reported in [14 ]. Let be the one-position circular shift on n index bits, a cache block at memory address 0 1 2 2 32 2 2 A A A A Ac n c n c can be mapped to the following m locations: 2 1 0) ( A A A f 2 1 1) ( ) ( A A A f 2 1 2 2) ( ) ( A A A f and 2 1 1 1) ( ) ( A A A fm m where n m

PAGE 82

82 In the illustrated example of Figure 4-2 a block address is hashed to three locations, a, b, and c in the DLT based on the skewed functions [14 ]. Indicated by the use bit, location a or c contains a valid pointer that points to a displaced block in the main directory, while location b is unused. The index bits of the main directory are a ttached in the DLT for the displaced blocks for two purposes. First, such a scheme saves the main directory space by not including any index bits in the address tag at each directory entry. No te that these index bits are needed only for the displaced blocks. Instead of allo cating space at every directory entr y for storing the index bits, no such space is allocated at all in the main director y and the index bits of the displaced blocks are stored in the DLT. Second, the index bits in the DLT can be used to filter out unnecessary access to the main directory. An access is granted only when the index bits match with that of the requested address. In the example of Figure 4-2 assume both a and cs indexes match with that of the requested address, then acc ess to the main directory is in itiated. The address tags of the two main directory entries pointed by location a and c are compar ed against the address tag of the request. When a tag match occurs, a disp laced block is found. Note that although DLT-dir requires additional directory access beyond the pr imary sets, our evaluation shows that a majority (about 97-99%) of this secondary access can be filtered out using the index bits stored in the DLT. In the attempt to relocate a block evicted fr om its primary set to another directory entry, suppose a free DLT slot, say b, is found for the block. A free slot in the main directory must also be identified. The set-empty bit array is main tained for this purpose. The corresponding setempty bit is set whenever a free directory slot ap pears. A quick scan to the set-empty bit array returns a set with at least one fr ee slot. The displaced block is then stored in the free slot in the main directory and that location is recorded in entry b in the DLT.

PAGE 83

83 When a block is removed from a cache module either due to eviction or invalidation, the block must also be removed from the DLT-dir. If th e block is recorded in its primary set, all that is needed is to turn the valid bit off. If th e block is displaced, it wi ll be found through a DLT lookup. Both the main directory entry for the bl ock and the corresponding DLT entry are freed. A directory entry that holds a displaced bloc k can also be replaced by a newly referenced block. Given that the index is una vailable in the main directory for the displaced block, normally a backward pointer is needed fr om the directory entry that hol ds the displaced block to the corresponding DLT entry, in order to free the D LT entry. However, it is expensive to add a backward pointer in the main directory. Alte rnatively, the DLT entry can be searched for determining the DLT entry that points to the location of the displaced block in the main directory. If each DLT entry is allowed to point to any directory location, then a fully associative search of the DLT is required to locate a given location in the main dire ctory. To reduce the cost of searching the DLT, one can impose restrictio n on the DLT-to-directory mapping so that, for a given directory location, only a small subset of the DLT entries can potentially point to it and need to be examined. For instance, consider a DLT whose total number of entries is one-quarter of that of the directory entries, which will be shown to be sufficiently large. In the most restrictive DLT-to-directory mapping, each DLT entr y is allowed to point to one of only four fixed locations in the main directory. Although the need of DLT search is minimal, such a restrictive mapping could lead to severe hot-setlike conflicts during the displacement of a block, because the block can only be displaced to a small number of potential loca tions in the directory: The block is first mapped to several DLT entries (by multiple hash functions), each of which in turn can point to one of a small number of dire ctory entries. The result is a reduced chance of finding a free directory entry fo r the displaced block. In a le ss restrictive design, any set-

PAGE 84

84 associative mapping can be instituted such that each DLT entry is limited to certain collection of sets in the main directory. The set-associativ e mapping allows fast search in the DLT for a directory location with minimum reduction on the chance of finding free slots in the main directory. We will evaluate the performance of this design in Section 4.4 In comparison with other multiple-hash-function-based directories or caches, e.g., the skewed associative directory, the multiple-hashi ng DLT has its unique advantages. Since the DLT is used only to keep track of the free slots and displaced blocks in the main directory, its size, counted in number of entries, is considerably smaller than th e total number of entries in the main directory. Suppose the directory has a tota l of 1000 entries, then the DLT may have 250 entries. Suppose the directory contains 100 disp laced blocks and 100 free slots at some point. Then, the directory has 10% of free entries but the DLT has (250-100)/250 = 60% of free entries. When the same hash function family is used in both the skewed associ ative direct ory and the DLT, the chance of finding a free entry in th e DLT is much higher (0.9993 vs. 0.5695 if eight uniform random hash functions are used). Once a free DLT entry is found, finding a free entry in the directory is ensured by searching the se t-empty bit array (assuming the DLT-to-directory mapping is unrestricted). We will demonstrate the performance advantage of this unique property. The detailed operations of the D LT-dir are summarized as follows. When a requested block is absent from the lo cal cache, a search of the DLT-dir is carried out for locating the requested block in other cache modules. When the block is found in its primary set of the main directory and/or in other sets through the DLT lookup, proper coherence actions are performed to fetch a nd/or invalidate the bl ock from other caches. The block with the requesting core ID is inserted into the MRU position in the primary set of the main directory.

PAGE 85

85 This newly inserted block may lead to the fo llowing sequence of actions. First, the block is always inserted into a free entry in the primary set if one exists. Otherwise, it replaces a displaced block residing in the primary set; this design is intended to limit the total number of displaced blocks. In this case, the previously established pointer in the DLT to the displaced block must be freed. If no displaced block exists in the primary set, the LRU block is replaced. In either case, the replaced block will undergo a displacement attempt through the DLT. If no free space is found in either the main directory or the DLT, the replaced block is evicted from the directory and invalidated in the cache module. To displace a block, an unused entry in th e DLT must be selected through the multiple hash functions. In addition, the setempty bit array is checked for selecting a free slot in the main directory to which the selected DLT entry ca n be mapped. Each corres ponding bit in the setempty array is updated every time the respective set is searched. A miss in the CMP caches is encountered if th e block cannot be found in the DLT-dir. The corresponding block must be fetched from a lowe r-level memory in the memory hierarchy. The update of the DLT-dir for the newly fetched bloc k is the same as when the block is found in other CMP cache modules. When a write request hits a block in the shared state in the local cache, an upgrade request is sent to the DLT-dir. The reque sted block with the core ID must exist either in the primary set or in other sets linked through the DLT. The request ed block can exist in more than one entry in the main directory since the shared block may also be in other cores caches. In response to the upgrade request, all other copies of the block mu st be invalidated in the respective caches and their corresponding directory entries must be fr eed. Replacement of any block in a cache module

PAGE 86

86 must be accompanied by a notification to the directory for freeing the corresponding directory and DLT entries. 4.3 Evaluation Methodology We use Simics to evaluate an 8-core out -of-order x86 chip multiprocessor. We develop detailed cycle-by-cycle event-driven cache, dire ctory and interconnection models. Each core has its own L1 instruction and data caches, and an in clusive private L2 cache. Every core has its own command-address-response bus and data bus conn ecting itself with all th e directory banks. The MOESI coherence protocol is applied to mainta in cache coherence through the directory. Each core has two outgoing request queues to the di rectory, a miss-request queue and a replacementnotification queue, and an outgoing response queue for sending the response to the directory. It also has an incoming request queue for handli ng request from the directory and an incoming response queue. Each bank of the directory main tains five corresponding queues for each core to buffer request/reponse from/to each core. The simula tor keeps track of the states of each bus and directory bank, as well as all the queues. The timi ng delays of directory access, bus transmission, and queue conflicts are carefully modeled. We assume the DLT access can be fully overlapped with the main directory access. However, th e overall access latency may vary based on the number of displaced blocks that need to be ch ecked. Besides the main directory access latency of 6 cycles, we assume that 3 additional cycles ar e consumed for each access to a displaced block after the access is issued from the DLT. Table 4-1 summarizes the directory related simulation parameters besides the general pa rameters narrated in Chapter 1. For this study, we use three multithreaded commercial workloads, OLTP (Online Transaction Processing), Apach e (static web server), and SPE Cjbb (java server), and two multiprogrammed workloads with applications from SPEC2000 and SPEC2006.

PAGE 87

87 Table 4-1. Directory-rela ted simulation parameters Parameter Description CMP 8-core, 1M private L2 cache each core Main directory 1/2/4/8 Banks, 128K entries, 8 way Queue size 8-entry request/res ponse queues to/from each core DLT table 1/2/4/8 Banks, 8K/16K/32K entries DLT mapping each DLT entry maps to 8/16/32/64/128 directory sets Directory latency 6 cycles for primary set, ea ch displaced block for 3 additional cycles Remote latency 52 cycles without contention, 4 hops Cmd/Data bus 8B, bidirectional, 32G B/s, 6-cycle propagation latency Table 4-2. Space requirement for th e seven directory organizations Directory 8 Cores Overhead 64 Cores Overhead Set-8w 4587520 1 36700160 1 Set-8w-64v 4671552 1.02 37372416 1.02 Skew-8w 6160384 1.34 49283072 1.34 Set-10w-1/4 5734400 1.25 45875200 1.25 Set-8w-p 5242880 1.14 97517568 2.66 DLT-8w-1/4 5439488 1.18 43515904 1.18 Set-full 4980736 1.09 39845888 1.09 We evaluated seven directory organi zations: the 8-way set-associative (Set-8w), the 8-way set-associative with 64-block fully-associative victim buffer (Set-8w-64v), the 8-way skew associative (Skew-8w), the 10-way set-associative with 25% additional directory size (Set-10w1/4), the 8-way set-associative with presence bits (Set-8w-p), the 8-way set associative with a DLT of 25% of total cache entr ies using 8 hashing functions (DLT-8w-1/4), and the full 64-way set-associative (Set-full) directory. The Set-full represents the ideal case where no directory conflict will occur. The Set-10w-1/4 is included because it adds one -quarter extra directory space that matches the DLT entries in the DLT-8w-1/4. The Set-10w-1/4 possesses an extra advantage because the set-associativity is increased from 8-way to 10-way. In our simulations, the DLT-dir is partitioned into four banks based on the loworder two bits in the bl ock address to allow for sufficient directory bandwidth. Mult iple-banked directory also disp lays interesting effects on the DLT-dir conflict. Detailed evaluation will be given in Section 4.4 All results are the average from all four banks. The total number of bits and the normalized space requirement relative to

PAGE 88

88 that of the Set-8w for the seven directories are shown in Table 4-2 for an 8-core CMP and a 64core CMP. The skewed associative directory (Skew-8w), requires the index b its as a part of the address tag, and hence needs the largest space. The space requirement for the directory with presence bits (Set-8w-p) goes up much faster than others when the number of cores increases (e.g. from 1.14 to 2.66). 4.4 Performance Result In this section, we show performance eval uation results of the seven CMP coherence directory organizations. The cache hit/miss ra tios, the average valid blocks, the IPC improvement, and sensitivity to the D LT design parameters are presented. 4.4.1 Valid Block and Hit/Miss Comparison Figure 4-3 shows the percentage of valid blocks fo r the seven directories averaged over the entire simulation period. We can make a few important observations. First, the proposed DLT-8w-1/4 is far superior to any othe r directory organizations, except for the ideal Set-full. The DLT-8w-1/4 can retain almost all cach ed blocks using a DLT whose total number of entries is equal to one-quarter of the total number of the cache blocks. Among the five workloads, only OLTP s hows noticeable invalidations under DLT-8w-1/4. This is because, with intensive data and instruction shar ing, OLTP experiences more set conflicts due to replication of shared blocks. Our simulation result s reveal that over 40% of the cached blocks are replicas in OLTP. Second, set-associative director ies other than the one with the full 64-way perform poorly. For instance, in the Set-8w directory, about 18%, 14%, 13%, 21%, and 20% of the cached blocks are unnecessarily invalidated for th e respective five workloads, due to hot-set conflicts in the directory. The Set-10w-1/4 directory improves the number of va lid blocks at the cost of 25% additional directory entries and higher associativ ity. But, its deficit is still substantial.

PAGE 89

89 50 60 70 80 90 100 OLTPApacheSPECjbbSPEC2000SPEC2006Valid Block (%) Set-8w Set-8w-64v Skew-8w Set-10w-1/ 4 Set-8wp Dlt-8w-1/4 Set-ful l Figure 4-3. Valid cache blocks for simu lated cache coherence directories. Third, very little advantage is shown when the Set-8w is furnished with a 64-block fullyassociative victim buffer. Appa rently, the buffer is too small to hold a sufficient number of conflicting blocks. Fourth, the skewed associative directory (Skew-8w) does alleviate set conflict. But, it is still far from being able to retain all the valid blocks. Since multiple-hashing is applied to the entire directory, the chance of finding a free slot in the main directory is diluted by the nondisplaced blocks located in the primary sets. In contrast, the DLT only contains the displaced blocks; hence the chance of finding a free slot in the DLT is much higher (See the sample calculation near the end of Section 4.2 for detail.). Lastly, the Set-8w-p works well with multithreaded work loads, but performs poorly with multiprogrammed workloads. By combining (duplicated) blocks with the same address into one entry in the directory, the presen ce-bit-based implementation saves directory entries, and hence, alleviates set conflicts for multithreaded workloads. However, keeping the presence bits becomes increasingly space-inefficient as the numb er of cores increases. In addition, for

PAGE 90

90 multiprogrammed workloads, there is little data sh aring; the advantages of having the expensive presence bits no longer exist. The total valid blocks determine the overall cache hits and misses. In Figure 4-4 extra L1 hits, L2 local hits, L2 remote hits, and L2 mi sses of the seven directory schemes, normalized with respect to the total L2 references of the Set-8w, are displayed for each workload. Note that the inadvertent invalidation due to set conflicts at the directory also invalidates the blocks located in the L1 caches for maintaining the L1/L2 in clusion property. Therefor e, more L1 misses are encountered for the directory sche mes that cause more invalidation. Compared with the Set-8w, there are extra L1 hits and fewer L2 references for all other directory schemes. Among the workloads, SPECjbb under the DLT-8w-1/4 sees about 8% increase in the L1 hits. The DLT-8w-1/4 shows clear advantages over the other directory schemes. Besides additional L1 hits and fewer L2 misses, the biggest gain for using the DLT-dir comes from the increase of the local L2 hits. Re call that to avoid the expensive presence bits, each copy of a block occupies a separate direct ory entry with a unique core ID. Consequently, inadvertent invalidation of a cach ed block caused by insufficient directory space may not turn a L2 local hit to the block into an extra L2 cache mi ss. Instead, it is likely that a local L2 hit to the block results in a remote L2 hit since not all copies of the block are invalidated. This is the key reason for more remote L2 hi ts in the directory schemes that produce more invalidation for the three multithreaded workload s. The difference among the directory schemes is not as significant for SPEC2000 and SPEC2006 since there is little data sharing among the multiprogrammed workload.

PAGE 91

91 0 0.2 0.4 0.6 0.8 1Set-8w Set-8w-64v Skew-8w Set-10w-1/4 Set-8w-p Dlt-8w-1/4 Set-full Set-8w Set-8w-64v Skew-8w Set-10w-1/4 Set-8w-p Dlt-8w-1/4 Set-full Set-8w Set-8w-64v Skew-8w Set-10w-1/4 Set-8w-p Dlt-8w-1/4 Set-full Set-8w Set-8w-64v Skew-8w Set-10w-1/4 Set-8w-p Dlt-8w-1/4 Set-full Set-8w Set-8w-64v Skew-8w Set-10w-1/4 Set-8w-p Dlt-8w-1/4 Set-full OLTPApacheSPECjbbSPEC2000SPEC2006 Normalized CMP L2 hit/miss Extra L1 Local L2 Remote L2 Miss Figure 4-4. Cache hit/miss and invalidation for simulated cac he coherence directories. Figure 4-5 plots the distribution of four types of re quests to the director y: instruction fetch (IFetch), data read (Dread), data write hit to a shared-state bl ock (Upgrade), and data write miss (DWrite). Within each request type, the results are further broken down into four categories based on the directory search result s: hit only to the main director y, hit only to the DLT, hit to both the main directory and the DLT, and miss at both the main directory and the DLT (which becomes a CMP cache miss). A few interesting observations can be made. First, there are very few requests that find the blocks only through the DLT for all request types in all five work loads. Given the fact that a displaced block is used to be at the LRU position of its primary set, the chance of its reuse is not high, unless other copies of the sa me block are also in the primary set such that the request is likely targeting for the blocks in the primary se t. Therefore, a hit to the DLT is usually accompanied by one or more hits to the main direct ory. This is especially true for IFetch in the three multithreaded workloads with more sharing of read-only instructions among multiple cores.

PAGE 92

92 0 0.2 0.4 0.6 0.8 1IFetch DRead Upgrade DWrite IFetch DRead Upgrade DWrite IFetch DRead Upgrade DWrite IFetch DRead Upgrade DWrite IFetch DRead Upgrade DWrite OLTPApacheSPECjbbSPEC2000SPEC2006 Distribution Both Main DLT None Figure 4-5. Distribution of directory hits to ma in directory and DLT. A good percentage of instruction blocks are disp laced from their primary sets due to heavy conflicts. A small percentage of DRead also fi nds the requested blocks in both places. Since neither IFetch nor DRead requires locating all copi es of the block, it is not harmful to look up the main directory and the DLT se quentially to save power. Second, it is observed that very few displaced blocks are en countered by the Upgrade or DWrite requests. Detailed analysis of the results i ndicates that widely shared blocks do exist, but they are mostly read-only blocks. This is demons trated by the fact that the average number of sharers for DRead is about 6 when DRead hits both the main directory and the DLT. But, for Upgrade and DWrite, the number of sharers is less than 3, a small enough number so that the three copies of the block can be kept in the primary set most of the time. This explains why such blocks are not typically found through the DLT. Ne vertheless, parallel sear ching of the directory and DLT is still desirable because the require d acknowledgement can be sent back to the requesting core faster.

PAGE 93

93 Third, for the two multiprogrammed workloads, there are almost no Ifetch requests to the DLT-dir, revealing the small foot print of the instructions. Since there is no data sharing, both DRead and DWrite are always misses. Moreover, there are no Upgrade requests because the blocks are in the E-state. 4.4.2 DLT Sensitivity Studies Two key DLT design parameters are its size and number of hash functions. In Figure 4-6 we vary the DLT sizes among 1/16, 1/8, and 1/4 of the total number of the cache blocks. We also evaluate the difference between using 8 or 12 ha sh functions for accessing the DLT. As observed earlier from Figure 4-3 the average number of valid blocks drops by 13-21% when the Set-8w is used instead of the Set-full. To reduce the num ber of invalidated blocks, the DLT must be capable of capturing at least th ese percentages of blocks and a llow them to be displaced from their primary sets. Therefore, we can observe si gnificant improvement in the average number of valid blocks as the DLT size increases from 1/16 to 1/8 and to 1/4 of th e total number of cache blocks for all five workloads. The impact of the number of hash functions, on the other hand, is not as obvious. Eight hash functions are generally sufficient for finding an unused entry in the DLT when the DLT has sufficient size. We observe small improvement us ing 12 instead 8 hash functions. For example, the total number of valid bl ocks increases from 90.3% to 91.7% for OLTP with 12 hash functions. The index bits, which are needed for displaced blocks, are recorded in the DLT to save space in the main directory. These index bits can also be used to filter the access to the main directory for the displaced blocks. An access is ne cessary only when the index bits match with the requested block.

PAGE 94

94 50 60 70 80 90 100 8f12f8f12f8f12f8f12f8f12f OLTPApacheSPECjbbSPEC2000SPEC2006 Valid Block (%) Set-8w Dlt-1/16 Dlt-1/8 Dlt-1/4 Set-full Figure 4-6. Sensitivity study on DLT size and number of hashing functions. Figure 4-7 shows the advantage of filtering for the four workloads with the DLT-8w-1/4 directory. Beside the 12 index bits we also experiment with usi ng additional 1, 2, or 3 bits for filtering purpose. Figure 4-7 (A) shows the false-positive rate, Figure 4-7 (B) illustrates the total traffic that can be filtered out. The false positive is defined as an index-bit match, but a failure to find the block in the main directory. We observe that by recording one additional bit beyond the necessary index bits, the falsepositive is almost completely eliminated for all three multithreaded workloads. For SPEC2000 and SPEC2006, however, additional 3 bits are nece ssary to reduce the fals e-positive rate to a negligible level. Furthermore, by recording one additional bit beyond the index bits, the total additional traffic to the main directory is reduced down to onl y about 0.2-2.8% of the total request traffic arriving at the DLT. With such hi gh percentages of filtering, a majority of CMP cache misses or upgrades can be identified earlier without searching again the main directory for possible displaced blocks.

PAGE 95

95 0 20 40 60 80 12131415 Number of filtering bitsFalse positive (%) OLTP Apache SPECjbb SPEC200 0 SPEC200 6 96 97 98 99 100 12131415 Number of filtering bitsTraffic filtered (% ) OLTP Apache SPECjbb SPEC200 0 SPEC2006 Figure 4-7. Effects of filtering dir ectory searches by extra index bits. The DLT is partitioned into multiple banks for enabling simultaneously DLT access from multiple cores. Meanwhile, for performing fast in verse search of the DLT given a main directory entry, the DLT is organized in a way that e ach DLT entry can only point to one set among a fixed sub-collection of all se ts in the main directory. Figure 4-8 shows the sensitivity studies with respect to these two parameters for OLTP and SPEC2000. The X-axis represents th e number of fixed sets in the main directory that each DLT entry can point to; the Y-axis is the normalized in validation with respect to the total invalidations in a reference configuration where the direct ory has only one bank and the DLT uses a 8-set mapping. Several interesting observations can be made. First, banking also helps reducing inadvertent invalidations. The main reason is that with smaller ba nks, the same DLT-to-directory mapping covers a larger percentage of the main directory. For example, for an 8-way directory corresponding to an 8MB cache with a 64-byte bl ock size, there are 16K sets for a 1-bank directory, but only 4K sets per bank for a 4-bank directory. Therefore, if each DLT entry can be mapped to 32 sets, it covers 1/512 of the director y entries for the 1-bank case. However, with the same mapping, a DLT entry can cover 1/128 of th e directory entries fo r the 4-bank directory.

PAGE 96

96 The more the coverage, the higher the chance will be for finding a free slot in the main directory for displacement. The curves corresponding to th e same directory covering percentage by the each DLT entry are also drawn w ith broken lines in the figure. OLTP 0.5 0.6 0.7 0.8 0.9 1 8-set16-set32-set64-set128-setNormalized Invalidation SPEC2000 0.5 0.6 0.7 0.8 0.9 1 8-set16-set32-set64-set128-setNormalized Invalidatio n 1-Bank 2-Bank 4-Bank 8-Bank 1/1024 mapping 1/512 mapping 1/256 mapping 1/128 mapping 1/64 mapping 1/32 mapping Figure 4-8. Normalized invali dation with banked DLT and rest ricted mapping from DLT to directory. Second, banking creates another level of conflic ts because the distri bution of the cached blocks to each bank may not be even. The physical bank prevents the DLT-based block displacement from crossing the bank boundary. This effect is very evident in OLTP. With the same covering percentage, the 1-bank directory performs noticeably better than the 2-bank directory, which in turn is better than the 4-bank and the 8-bank di rectories. The bank conflict is not as obvious for SPEC2000 due to its mixed applications. Simple index randomization technique can be applied to alleviate the bank conflicts. But, further discussions are omitted due to space limitation. Third, the constrained DLT-to-directory mappi ng does affect the number of invalidations substantially. But, the reduction of invalidations starts dimi nishing when doubling the mapping freedom from 1/64 to 1/32. In the early simula tion, we used a 4-bank directory with a 32-set mapping, where each DLT entry can be mapped to 1/128 banked directory space, in order to

PAGE 97

97 achieve a balance between minimizing the amount of invalidations and the cost of searching the DLT entries. 4.4.3 Execution Time Improvement The execution times for the seven directory schemes are compared in Figure 4-9 The figure shows the normalized execut ion times with respect to the Set-8w directory for each workload. For the DLT-8w-1/4 directory, the execution time im provement is about 10%, 5%, 9%, 8%, and 8% over the Set-8w directory. More importantly, the improvement of the DLT-8w1/4 is about 98% of what the full 64-way directory (Set-full) can achieve for all five workloads. In the case of the more expensive skewed asso ciative directory, the time saved is only about 2035% of what the DLT-8w-1/4 can save for the three multithreaded workloads. Finally, the Set8w-p reduces the execution time of the multithre aded workloads, but does little for the multiprogrammed ones. 4.5 Summary In this chapter, we describe an efficient cache coherence mechanism for future CMPs with many cores and many cache modules. We argue in favor of a directory-based approach because the snooping-bus-based approach lacks scalabilit y due to its broadcasting nature. However, the design of a low-cost coherence directory with sm all size and small set-associativity for future CMPs must handle the hot-set conflicts at the directory th at lead to unnecessary block invalidations at the cache modules. In a typical set-associative director y, the hot-set conflict tends to become worse because many cores co mpete in each individual set with uneven distribution. The central issue is to reconcile the topology differe nce between the set-associative coherence directory and the CMP caches.

PAGE 98

98 0.5 0.6 0.7 0.8 0.9 1OLTPApacheSPECjbbSPEC2000SPEC2006Normalized execution time Set-8w Set-8w-64v Skew-8w Set-10w-1/ 4 Set-8w-p Dlt-8w-1/4 Set-full Figure 4-9. Normalized execution time for simulated cache coherence directories. The proposed DLT-dir accomplishes just that with small space requirement and high efficiency. The hot-set conflict is alleviated by allowing blocks to be displaced from their primary sets. A new DLT is introduced to keep track of the displaced blocks by maintaining a pointer to the displaced block in the main di rectory. The DLT is accessed by applying multiple hash functions to the requested block address to reduce the DLTs own conflicts. Performance evaluation has confirmed the advantage of the DLT-dir over other conventi onal set associative directories or the skewed-associative director y for conflict avoidance. In particular, the DLT-dir with a size equal to one quarter of the total number of the directory blocks achieves up to 10% faster execution time in comparison with a traditional 8-way set-associative directory.

PAGE 99

99 CHAPTER 5 DISSERTATION SUMMARY Chip multiprocessors (CMPs) are becoming ubiqu itous in all computing domains. As the number of cores increases, tremendous pressu res will be exerted on the memory hierarchy system to supply the instructions and data in a timely fashion to sustain increasing chip-level IPCs. In this dissertation, we develop thr ee works about bridging th e ever-increasing CPU memory performance gap. In the first part of this dissertation, an accurate and low-overhead data prefetching on CMPs based on a unique observation of coterm inous group (CG) and coterminous locality has been developed. A coterminous group is a group of off-chip memory accesses with the same reuse distance. Coterminous locality is defined as when a member in the coterminous group is accessed, the other members will be likely to be accessed in the near future. In particular, the order of nearby references in a CG follows exactl y the same order that these references appeared last time, even though they may be irregular. The proposed prefetcher uses CG history to trigger prefetches when a member in a group is re-refe renced. It overcomes challenges of the existing correlation-based or stream-based prefetchers, including low prefetch accuracy, lack of timeliness, and large history. The accurate CG-p refetcher is especially appealing for CMPs, where cache contentions and memory access de mands are escalated. Evaluations based on various workload mixes have demonstrated si gnificant advantages of the CG-prefetcher over other existing prefetching schemes on CMPs, w ith about 10% of IPC improvement with much less extra traffic. As many techniques have been proposed for optimizing on-chip storage space in CMPs, the second part of the dissertation is to propose an analytical model and a global stack simulation to fast project the performance of the tradeo ff between the capacity and access latency in CMP

PAGE 100

100 caches. The purposed analytical model can fast estimate the general performance behavior of data replication in CMP caches. The model has show ed that data replicat ion could degrade cache performance without a sufficiently large capacit y. The global stack simulation has been proposed for more detailed study on the issue of balanc ing accessibility and capacity for on-chip storage space on CMPs. With the stack simulation, a wide -spectrum of the cache design space can be explored in a single simulation pass. We have simulated the schemes of shared/private caches, and shared caches with replication of various cache sizes. We also have verified the stack simulation results with execution-driven simula tions using commercial multithreaded workloads and showed that the single-pass stack simula tion can characterize the CMP cache performance with high accuracy. Only 2%-9% of error margin is observed. Ou r results have proved that the effectiveness of various techniqu es to optimize the CMP on-chip storage is closely related to the total L2 size. More importantly, our global stack simulation consumes only 8% of the simulation time of execution-driven simulations. In this third part of the di ssertation, we have described an efficient cache coherence mechanism for future CMPs with many cores and many cache modules. We favor a directorybased approach because the snooping-bus-based approach lacks scalability due to its broadcasting nature. However, the design of a lo w-cost coherence directory with small size and small set-associativity for future CMPs must handle the hot-set conflicts at the directory that lead to unnecessary block invalidations at the cache modules. In a typi cal set-associative directory, the hot-set conflict tends to become worse b ecause many cores compete in each individual set with uneven distribution. The cen tral issue is to reconcile th e topology difference between the set-associative coherence dire ctory and the CMP caches. The proposed DLT-dir accomplishes that just with small space requirement and high e fficiency. The hot-set conflict is alleviated by

PAGE 101

101 allowing blocks to be displaced from their prim ary sets. A new DLT is introduced to keep track of the displaced blocks by mainta ining a pointer to the displaced block in the main directory. The DLT is accessed by applying multiple hash functions to the requested block address to reduce the DLTs own conflicts. Performan ce evaluation has confirmed the advantage of the DLT-dir over other conventional set a ssociative directories or the skewed -associative directory for conflict avoidance. In particular, the D LT-dir with a size equal to one qua rter of the total number of the directory blocks achieves up to 10% faster execution time in comparison with a traditional 8-way set-associative directory.

PAGE 102

102 LIST OF REFERENCES [1] M. E. Acacio, A Two-Level Direct ory Architecture for Highly Scalable cc-NUMA Multiprocessors, IEEE Trans. on Parallel and Distri buted Systems Vol. 16 (1), pp. 67-79, Jan. 2005 [2] Advanced Micro Devices, AMD Demonstrates Dual Core Leadership, http://www.amd.com, 2004. [3] AMD Quad-Core, http://multicore.amd. com/us-en/quadcore/ [4] A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, An Evaluation of Directory Schemes for Cache Coherence, Proc. 15th Int'l Symp. Computer Architecture, pp. 280289, May 1988. [5] A. Agarwal, M. Horowitz, and J. Hennessy, An Analytical Cache Model, ACM Trans. on Computer Systems, Vol. 7, No. 2, pp. 184-215, May 1989. [6] A. Agarwal and S. D. Pudar, ColumnAssociative Caches: a Technique for Reducing the Miss Rate of Direct-Mapped Caches, Proc. 29th Int'l Symp. on Computer Architecture, pp. 179-190, May 1993. [7] L. Barroso et al, Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing, Proc. 27th Int'l Symp. on Computer Architecture, pp. 165-175, June 2000. [8] B. Beckmann and D. Wood, Managing Wire Delay in Large Chip-multiprocessor Caches, Proc. 37th Intl Symp. on Microarchitecture, Dec. 2004. [9] B. M. Beckmann, M. R. Marty, and D. A. Wood, ASR: Adaptive Se lective Replication for CMP Caches, Proc. 39th Intl Symp. on Microarchitecture, Dec. 2006. [10] B. T. Bennett and V. J. Kruskal, LRU Stack Processing, IBM journal of R & D, pp. 353357, July 1975. [11] E. Berg and E. Hagersten, StatCache: A Probabilistic Approach to Efficient and Accurate Data Locality Analysis, Proc. 2004 Intl Symp. on Perform ance Analysis of Systems and Software, March 2004. [12] E. Berg, H. Zeffer, and E. Hagerste n, A Statistical Multiprocessor Cache Model, Proc. 2006 Intl Symp. on Performance Analysis of Systems and Software, March 2006. [13] B. Black et al, Die St acking (3D) Microarchitecture, Proc. 39th Int'l Symp. on Microarchitecture, pp. 469-479, Dec. 2006. [14] F. Bodin and A. Seznec, Skewed Asso ciativity Improves Performance and Enhances Predictability, IEEE Trans. on Computers, 46(5), pp. 530-544, May 1997.

PAGE 103

103 [15] S. Borkar, Microarchitecture and De sign Challenges for Gigascale Integration, Proc. 37th Int'l Symp. on Microarch itecture, 1st Keynote, pp. 3-3, Dec. 2004. [16] P. Bose, Chip-Level Microarchitecture Trends, IEEE Micro, Vol 24(2), pp. 5-5, Mar-Apr. 2004. [17] L. M. Censier and P. Feautrier, A New Solution to Coherence Problems in Multicache Systems, IEEE Trans. on Computers, c-27(12), pp. 1112-1118, Dec. 1978. [18] D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal, Directory-Bas ed Cache Coherence in Large-Scale Multiprocessors, Computer 23, 6, pp. 49-58, Jun. 1990. [19] D. Chandra, F. Guo, S. Kim, and Y. So lihin, Predicting Inter-Th read Cache Contention on a Chip Multi-Processor Architecture, Proc. 11th Intl Symp. on High Performance Computer Architecture, pp. 340-351, Feb. 2005. [20] J. Chang and G. S. Sohi, Cooperative caching for chip multiprocessors, Proc. 33rd Intl Symp. on Computer Architecture, June 2006. [21] B. Chazelle and Kilian et al, The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables, Proc. 15th Annual ACM-SIAM Sy mp. on Discrete Algorithms, Jan. 2004. [22] T. Chen and J. Baer, Reducing Memo ry Latency via Non-blocking and Prefetching Caches, Proc. of Int'l Conf. on Architectur al Support for Programming Languages and Operating Systems, pp. 51-61, Oct. 1992. [23] T. M. Chilimbi and M. Hirzel, Dynamic Hot Data Stream Prefetching for General-purpose Programs, Proc. SIGPLAN Conference on PLDI, pp. 199-209, June 2002. [24] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, Optimizing Replication, Communication, and Capacity Allocation in CMPs, Proc. 32nd Intl Symp. on Computer Architecture, June 2005. [25] D. E. Culler, J. P. Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers Inc., 1999. [26] H. Dybdahl and P. Stenstrom, An Ad aptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors, Proc. 13th Intl Symp. on High Performance Computer Architecture, Feb 2007. [27] G. Edwards, S. Devadas, and L. Rudolph, Analytical Cache Models with Applications to Cache Partitioning, Proc. 15th Intl Conf. on Supercomputing, pp. 1-12, June 2001. [28] B. Fraguela, R. Doallo, and E. Zapata, Automatic Analytical Modeling for the Estimation of Cache Misses, Proc. 1999 Intl Conf. on Paralle l Architectures and Compilation Techniques, Sep. 1999.

PAGE 104

104 [29] J. D. Gilbert, S. H. Hunt, D. Gunadi, and G. Srinivasa, Niagara2: A Highly-Threaded Server-on-A-Chip, Proc. 18th HotChips Symp, Aug. 2006. [30] A. Gupta, W. Weber, and T. Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes, Proc. Int'l Conf. ICPP '90, pp. 312321, Aug. 1990. [31] J. Hasan and S. Cadambi et al. Chisel: A Storage-efficient, Collision-free Hash-based Network Processing Architecture, Proc. 33th Intl Symp. on Computer Architecture, pp. 203-215, June 2006. [32] M. Hill and J. Smith, Evaluating Associativity in CPU Caches, IEEE Trans. on Computers, pp. 1612-1630, Dec. 1989. [33] Z. Hu, M. Martonosi and S. Kaxiras, TCP: Tag Correlating Prefetchers, Proc. 9th Intl Symp. on HPCA, pp 317-326, Feb. 2003. [34] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler, A NUCA Substrate for Flexible CMP Cache Sharing, Proc. 19th Intl Conf. on Supercomputing, June 2005. [35] Intel Core Duo Processor: The Ne xt Leap in Microprocessor Architecture, Technology@Intel Magazine, Feb. 2006. [36] Intel Core 2 Quad Processors, http://www.intel.com/products /processor/core2quad/index.htm [37] D. Joseph and D. Grunwald, P refetching Using Markov Predictors, Proc. 26th Int'l Symp. on Computer architecture, pp. 252-263, June 1997. [38] N. P. Jouppi, Improvi ng Direct-mapped Cache Performance by the Addition of a Small Fully-associative Cache and Prefetch Buffers, Proc. 17th Int'l Symp. on Computer Architecture, pp 364-373, May 1990. [39] R. Kalla, B. Sinharoy, and J. Tendler, IBM POWER5 Chip: A Dual-Core Multithreaded Processor, IEEE Micro, Vol 24(2), Mar-Apr. 2004. [40] S. Kapil, UltraSPARC Gemini: Dual CPU Processor, Proc. 15th HotChips Symp., Aug. 2003. [41] C. Kim, D. Burger, and S. Keckler, A n Adaptive, Non-uniform Cache Structure for Wiredelay Dominated On-chip Caches, Proc 10th Intl Conf. on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. [42] S. Kim, D. Chandra, and Y. Solihin, Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, Proc. 2004 Intl Conf. on Parallel Architectures and Compilation Techniques, Sep. 2004.

PAGE 105

105 [43] Y. H. Kim, M. D. Hill, and D. A. Wood, Implementing Stack Simulation for Highlyassociative Memories, Proc. 1991 SIGMETRICS conf. on Measurement and Modeling of Computer Systems, pp. 212-213, May 1991. [44] M. Kistler, M. Perrone, and F. Petrin i, Cell Multiprocessor Communication Network: Built for Speed, IEEE Micro, Vol 26(3), pp. 10-23, May-June 2006. [45] P. Kongetira, K. Aingaran, and K. Olukotun, Niagara: A 32-wa y Multithreaded SPARC Processor, Proc. 16th HotChips Symp., Aug. 2004. [46] S. Lacobovici, L. Spracklen, S. Kadambi, Y. Chou, and S. G. Abraham, Effective StreamBased and Execution-Based Data Prefetching, Proc. 19th Intl Conf. on Supercomputing, pp 1-11, June 2004. [47] A. Lai, C. Fide, and B. Falsafi, Dead-block Prediction & Dead-block Correlating Prefetchers, Proc. 28th Int'l Symp. on Computer Architecture, pp 144-154, July 2001. [48] F. Li, C. Nicopoulos, T. Richardson, Y. Xi e, V. Narayanan, and M. Kandemir, Design and Management of 3D Chip Multiprocessors Using Network-in-Memory, Proc. 33rd Intl Symp. on Computer Architecture, June 2006. [49] C. Liu, A. Sivasubramaniam, and M. Ka ndemir, Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs, Proc. 10th Intl Symp. on High Performance Computer Architecture, pp. 176-185, Feb. 2004. [50] P. S. Magnusson et al, Simics: A Full System Simulation Platform, IEEE Computer, Feb. 2002. [51] M. R. Marty and M. D. Hill, Virtual Hierarchies to Support Server Consolidation, Proc. 34th Intl Symp. on Computer Architecture, June 2007. [52] Matlab, http://www.mathworks.com/products/matlab/. [53] R. Mattson, J. Gecsei, D. Slutz, and I. Traiger, Evaluation Techniques and Storage Hierarchies, IBM Systems Journal, 9, pp. 78-117, 1970. [54] K. Nesbit and J. Smith, Data Cache Prefetching Using a Gl obal History Buffer, Proc. 10th Int'l Symp. on High Performance Computer Architecture, pp 96-105, Feb. 2004. [55] B. O'Krafka and A. Newton, An Em pirical Evaluation of Two Memory-Efficient Directory Methods, Proc. 17th Int'l Symp. Computer Architecture, pp. 138-147, May 1990. [56] K. Olukotun et al, The Case for a Single-Chip Multiprocessor, Proc. 7th Int'l Conf. on Architectural Support for Programmi ng Languages and Operating Systems, Oct. 1996. [57] Open Source Development Labs Database Test 2, http://www.osdl.org /lab_activities/kernel_testing/osdl _database_test_suite/osdl_dbt-2/.

PAGE 106

106 [58] J. K. Peir, Y. Lee, a nd W. W. Hsu, Capturing Dynamic Me mory Reference Behavior with Adaptive Cache Topology, Proc. 8th Int'l Conf. on Architectural Support for Programming Language and Operating Systems, pp. 240-250, Oct. 1998. [59] M. K. Qureshi, D. Thompson, and Y. N. Patt, The V-Way Cache: Demand-Based Associativity via Global Replacement, Proc. 32nd Intl Symp. on Computer Architecture, June 2005. [60] M. K. Qureshi and Y. N. Patt, Utility -Based Cache Partitioning: A Low-Overhead, HighPerformance, Runtime Mechanism to Partition Shared Caches, Proc. 33rd Intl Symp. on Microarchitecture, Dec. 2006. [61] N. Rafique, W. Lim, and M. Thottethodi Architectural Support for Operating System Driven CMP Cache Management, Proc. 2006 Int'l Conf. on Parallel Architectures and Compilation Techniques, Sep. 2006. [62] S. Sair and M. Charney, Memory Behavior of the SPEC-2000 Benchmark Suit, Technical Report, IBM Corp., Oct. 2000. [63] A. Saulsbury, F. Dahlgren, and P. Stenstrom, Recency-based TLB preloading, Proc. 27th Intl Symp. on Computer architecture, pp. 117-127, May 2000. [64] A. Seznec, A Case for Two-Way Skewed-Associative Cache, Proc. 20th Int'l Symp. on Computer Architecture, pp. 169-178, May 1993. [65] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, Automatically Characterizing Large Scale Program Behavior, Proc. 10th Intl Conf. on Architecture Support for Programming Language and Operating Systems, pp 45, Oct. 2002. [66] G. Sohi, Single-Chip Multiprocessors: The Next Wa ve of Computer Architecture Innovation, Proc. 37th Int'l Symp. on Microarchitecture, 2nd Keynote, pp. 143-143, Dec. 2004. [67] Y. Solihin, J. Lee, and J. Torrellas, U sing a User-level Memory Thread for Correlation Prefetching, Proc. 29th Intl Symp. on Computer Architecture, pp.171-182, May 2002. [68] M. Spjuth, M. Karlsson, and E. Ha gersten, Skewed Caches from a Low-power Perspective, Proc. 2nd Conf. on Computing frontiers 2005, May 2005. [69] L. Spracklen and S. Abraham, Chip Multithreading: Opportunities and Challenges, Proc. 11th Int'l Symp. on High Performance Computer Architecture, pp. 248-252, Feb. 2005. [70] L. Spracklen and Y. Chou, Effective Inst ruction Pre-fetching in Chip Multiprocessors for Modern Commercial Applications, Proc. 11th Int'l Symp. on High Performance Computer Architecture, pp. 225-236, Feb. 2005.

PAGE 107

107 [71] K. Strauss, X. Shen, and J. Torre llas, Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors, Proc. 33rd Intl Symp. on Computer Architecture, June 2006. [72] R. A. Sugumar and S. G. Abraham, S et-associative Cache Simulation using Generalized Binomial Trees, ACM Trans. on Computer Systems, Vol. 13, No. 1, pp. 32-56, Feb. 1995 [73] C. K. Tang, Cache Design in the Tightly Coupled Multiprocessor System, AFIPS Conference Proceedings, National Computer Conference, pp 749-753, June 1976. [74] S. P. Vanderwiel and D. J. Lilja, Data Prefetch Mechanisms, ACM Computing Surveys, pp. 174-199, June 2000. [75] S. Vangal et al., An 80-Tile 1. 28TFLOPS Network-on-Chip in 65nm CMOS, IEEE International Solid-State Circuits Conference, Feb. 2007. [76] X. Vera and J. Xue, Lets Study Whole-Program Cache Beha vior Analytically, Proc. 8th Intl Symp. on High Performance Computer Architecture, Feb. 2002. [77] Z. Wang and D. Burger, et al., G uided Region Prefetching: a Cooperative Hardware/Software Approach, Proc. 30th Intl Symp. on Computer Architecture, pp. 388398, June 2003. [78] T. Wenisch and S. So mogyi, et al., Temporal Streaming of Shared Memory, Proc. 32nd Int'l Symp. on Computer Architecture, pp. 222-233, June 2005. [79] C. E. Wu, Y. Hsu, a nd Y. Liu, Efficient Stack Simulation for Shared Memory SetAssociative Multiprocessor Caches, Proc. 1993 Intl Conf. on Parallel Processing, Aug. 1993. [80] Y. Wu and R. Muntz, Stack Evaluation of Arbitrary Set-Associative Multiprocessor Caches, IEEE Trans. on Parallel and Distributed Systems, pp. 930-942, Sep. 1995. [81] M. Zhang and K. Asanovic, Victim Rep lication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, Proc. 32nd Intl Symp. on Computer Architecture, pp. 336-345, June 2005. [82] Z. Zhu and Z. Zhang, A Performa nce Comparison of DRAM Memory System Optimizations for SMT Processors, Proc. 11th Int'l Symp. on High Performance Computer Architecture, pp. 213224, Feb. 2005.

PAGE 108

108 BIOGRAPHICAL SKETCH Xudong Shi received his B.E. degree in elec trical engineering and M.E. degree in computer science and engineering from Sha nghai Jiaotong University in 2000 and in 2003 respectively. Immediately after that, he started to pursue the Doctoral degree in computer engineering at University of Florida. His research interests include micro-architecture design and distributed systems.