<%BANNER%>

Improving Memory Hierarchy Performance with Addressless Preload, Order-Free LSQ, and Runahead Scheduling

Permanent Link: http://ufdc.ufl.edu/UFE0021625/00001

Material Information

Title: Improving Memory Hierarchy Performance with Addressless Preload, Order-Free LSQ, and Runahead Scheduling
Physical Description: 1 online resource (117 p.)
Language: english
Creator: Yang, Zhen
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: cache, forwarding, load, prefetch, scheduling, store
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The average memory access latency is determined by three primary factors, cache hit latency, miss rate, and miss penalty. It is well known that cache miss penalty in processor cycles continues to grow. For those memory-bound workloads, a promising alternative is to exploit memory-level parallelism by overlapping multiple memory accesses. We study P-load scheme (P-load stands for Preload), an efficient solution to reduce the cache miss penalty by overlapping cache misses. To reduce cache misses, we also introduce a cache organization with an efficient replacement policy to specifically reduce conflict misses. A recent trend is to fetch and issue multiple instructions from multiple threads at the same time on one processor. This design benefits much from resource sharing among multiple threads. However, contentions of shared resources including caches, instruction issue window and instruction window may hamper the performance improvement from multi-threading schemes. In the third proposed research, we evaluate a technique to solve the resource contention problem in multi-threading environment. Store-load forwarding is a critical aspect of dynamically scheduled execution in modern processors. Conventional processors implement store-load forwarding by buffering the addresses and data values of all in-flight stores in an age-ordered store queue. A load accesses the data cache and in parallel associatively searches the store queue for older stores with matching addresses. Associative structures can be made fast, but often at the cost of substantial additional energy, area, and/or design effort. We introduce a new order-free store queue that decouples the matching of the store/load address and its corresponding age-based priority encoding logic from the original store queue and largely decreases the hardware complexity.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Zhen Yang.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Peir, Jih-Kwon.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021625:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021625/00001

Material Information

Title: Improving Memory Hierarchy Performance with Addressless Preload, Order-Free LSQ, and Runahead Scheduling
Physical Description: 1 online resource (117 p.)
Language: english
Creator: Yang, Zhen
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2007

Subjects

Subjects / Keywords: cache, forwarding, load, prefetch, scheduling, store
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The average memory access latency is determined by three primary factors, cache hit latency, miss rate, and miss penalty. It is well known that cache miss penalty in processor cycles continues to grow. For those memory-bound workloads, a promising alternative is to exploit memory-level parallelism by overlapping multiple memory accesses. We study P-load scheme (P-load stands for Preload), an efficient solution to reduce the cache miss penalty by overlapping cache misses. To reduce cache misses, we also introduce a cache organization with an efficient replacement policy to specifically reduce conflict misses. A recent trend is to fetch and issue multiple instructions from multiple threads at the same time on one processor. This design benefits much from resource sharing among multiple threads. However, contentions of shared resources including caches, instruction issue window and instruction window may hamper the performance improvement from multi-threading schemes. In the third proposed research, we evaluate a technique to solve the resource contention problem in multi-threading environment. Store-load forwarding is a critical aspect of dynamically scheduled execution in modern processors. Conventional processors implement store-load forwarding by buffering the addresses and data values of all in-flight stores in an age-ordered store queue. A load accesses the data cache and in parallel associatively searches the store queue for older stores with matching addresses. Associative structures can be made fast, but often at the cost of substantial additional energy, area, and/or design effort. We introduce a new order-free store queue that decouples the matching of the store/load address and its corresponding age-based priority encoding logic from the original store queue and largely decreases the hardware complexity.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Zhen Yang.
Thesis: Thesis (Ph.D.)--University of Florida, 2007.
Local: Adviser: Peir, Jih-Kwon.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2007
System ID: UFE0021625:00001


This item has the following downloads:


Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101108_AAAAAF INGEST_TIME 2010-11-08T06:23:50Z PACKAGE UFE0021625_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 29459 DFID F20101108_AAAENX ORIGIN DEPOSITOR PATH yang_z_Page_114.QC.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
f3f3626b5e507c64178a8ce27d420fb4
SHA-1
419ded0d2db8610f1e4f8e11d384bbe9b71af683
1051977 F20101108_AAADLK yang_z_Page_111.jp2
1d6265119b1e9c313c0af45beed9bd03
09d67ed03d0d6d0fa66eb41b5dc0e9f4404a02fe
21260 F20101108_AAADKV yang_z_Page_107.QC.jpg
9e0d6bd82ece1c5703938012dda5a71c
84f4ce7708342b4df9b2eea3e1dbc470d5a7f953
7454 F20101108_AAAENY yang_z_Page_114thm.jpg
38f80081407b9f522d148175e099b185
399d149432559804099070f4e4c7074e59407f29
1051978 F20101108_AAADLL yang_z_Page_049.jp2
6bb3d9dee42ab60946a16a27e36e026c
4963524c98b927a62294c98a9dbfdae73ac0c429
50817 F20101108_AAADKW yang_z_Page_078.jpg
6565bea0865f1be3554429e1bcb63fbf
e0b23b87315d6545a8d8809244d44d8e7a371cf6
29066 F20101108_AAAENZ yang_z_Page_115.QC.jpg
5e54ec362596c2a766f6439e6fa7b81b
c78799412e611c787fa74162ebab42d86d05d5c6
87294 F20101108_AAADLM yang_z_Page_015.jpg
de89bbd87cb685fa88cef51c9016f0d7
a71aa357ce908dddc44e40b0392468183b6b7ea7
1770 F20101108_AAADKX yang_z_Page_088.txt
98231cf8324f7dc487191f7cf46a9433
cc5c391077d4f734b0bc6093f4f2a4075d542aab
25271604 F20101108_AAADMA yang_z_Page_105.tif
f56d587dcc7c7a87522f1f9c9596e619
2793e49273c0a2ca71769b2a7faf8f5430ace87d
39865 F20101108_AAADLN yang_z_Page_029.pro
776a61487c78bc37f087d0c0aad15a71
266b1875c517349a2214d5b0bfcef639aeaa6b67
88898 F20101108_AAADKY yang_z_Page_020.jpg
00abf2d1f884ab12f9a79363e8edf3b6
075380e08bd9a5ac6840fac8d78836b5cc181ead
26449 F20101108_AAADMB yang_z_Page_059.QC.jpg
7273b92ac12f28ce35a9edecb4e61311
3a46c3affc3dbb5b358c5e718ee291fe3b5d5d7e
8660 F20101108_AAADLO yang_z_Page_001.pro
6523800f0a41374a3f33936fa3737699
451890cf24cf0b1cb193ae444e397274ac6e9594
F20101108_AAADKZ yang_z_Page_041.tif
996bddfddbf4a20fee0c619b46b8c4bd
51494ee6a664bff46ef5633d3f44d6244e8dccca
26984 F20101108_AAADMC yang_z_Page_001.jpg
3c3a37bdebfea127fc8bbb009492d482
f238e8615e13551eb47c67f80a301b4ae0476559
7628 F20101108_AAADLP yang_z_Page_086thm.jpg
612463a1857f348f979650efaeff5383
256316ab560666eeb0357162d6502be69bfa1676
84149 F20101108_AAADMD yang_z_Page_106.jpg
ef22839fe59585bed7c49452beb64144
bef75910da5322e615ec6df619b4b604ba071711
55917 F20101108_AAADLQ yang_z_Page_015.pro
c3804deba231e579bfd0ad1fd5c9de7e
610ce9ff3b18e69662df381375ca34d9083e4374
1051956 F20101108_AAADME yang_z_Page_072.jp2
3c340a55574aadcb37ba54e13c6e86ba
25dc70ebf379d6d6da6a881db980979473665db2
1051983 F20101108_AAADLR yang_z_Page_106.jp2
6af4b75f39728277798c4bc4d2e809fc
3b61e211b2d44295d0a9b0d1806b595d8efdc961
78897 F20101108_AAADMF yang_z_Page_072.jpg
596f16a38717a5b914f648b85238855e
6ecba2bccc09e0876bd9a570e9bd7b4c99664c0a
2249 F20101108_AAADLS yang_z_Page_049.txt
8013cf0deaab1995fcb7cced77290468
3000173c48312795e7faa5ad08c7caf8dd019928
52667 F20101108_AAADMG yang_z_Page_097.pro
89ec11db6500541033695096fdba0fcd
a37c0a8c972f48c5dc46232ab95d9af218923699
27707 F20101108_AAADLT yang_z_Page_053.pro
0cf6cb83d7281690b10a24377ac46e5b
f4029d819172104d7f92bd8ecf67a62939a1a468
1051981 F20101108_AAADMH yang_z_Page_105.jp2
9c1b26a74e9fd2c0649b563f679e93e8
135aa9a960dcfcf7702d7a60a5d7a520e6e238ee
34477 F20101108_AAADMI yang_z_Page_090.pro
ca65ed316a556ab85cce27790f880fdd
363f368b2cad73b0b1b5a58ae440823b2e8b841b
2070 F20101108_AAADLU yang_z_Page_041.txt
8eb3e0390554f497782cabeb268e3feb
7e5f16486c7a50ae575e0f245d268aa8447fb0f8
52732 F20101108_AAADMJ yang_z_Page_030.pro
fe2bd87ec7e622ee747cfbdd40d6f3b7
094fefce1a2f8b7f8c8b1c0fb8b6364537442491
1298 F20101108_AAADLV yang_z_Page_032.txt
47021eba0085910e622cb3243f892c65
83a43ccc6e00f906acb09c492e470fc1e4734f8c
59858 F20101108_AAADMK yang_z_Page_068.jpg
2b58f076c2a2a0ecd990fd6bcf42bc7a
0b6b7700b804989c24df9b0d24275607e7a3d81e
56690 F20101108_AAADLW yang_z_Page_096.pro
f362cfac694bb8f1bf367e43068cd5fe
619537624502ad45279bab98cc95197daf9d0fc2
1051935 F20101108_AAADML yang_z_Page_066.jp2
04595a1dfeda4d372381386b9166510b
4851aaeae56756e537749160b2ec86db3722b6f3
50591 F20101108_AAADLX yang_z_Page_066.pro
a6751536676cf7327b4beaf67859edfa
1a3d6522f9b7cff1be210d36ebef5bde943758eb
61682 F20101108_AAADNA yang_z_Page_054.jpg
8970a9f0bf271a9d4a011611c01a5564
efe688cf0984c8d874f0efa0f70efc6e686fff5e
54576 F20101108_AAADMM yang_z_Page_043.jpg
b5f528e03db74124cf5ee86ce0c2650a
4a84d422f346a009ef94355aaead3d71e2e8eb21
1629 F20101108_AAADLY yang_z_Page_100.txt
2004a1318777d914770f90cdb2e719ef
14334181af5c32492fac496c8b5ad2a78dca5c85
2050 F20101108_AAADNB yang_z_Page_064.txt
4f9cf3ff8f63241537f8a8922aecd878
c81ef6f9fc7c1e4ccb70881a9095968312b6e360
8423998 F20101108_AAADMN yang_z_Page_068.tif
6bde9b2252371e8e7b19be80f040a0e9
cca9a4240a7f1ae6cc17c2250a373a66b561009c
F20101108_AAADLZ yang_z_Page_020.tif
005ff3597398cb761e2ae1a2366dedf5
d109713bf744a2e0befb2e8186d51969fd5a6868
2255 F20101108_AAADNC yang_z_Page_045.txt
59aa2af80de2b7087b6b2134bd66acf4
5cded8b55a9edc4e9bbf6c080652f2250c7a4611
F20101108_AAADMO yang_z_Page_038.tif
2510626412cedcfd9362469a0740581a
2729419cdb512eba1d8658bee3175e589893b6c2
F20101108_AAADND yang_z_Page_095.tif
4e1179b34de4859f0e04804c61193b7e
77d4bb0a912a13ebfaccf822b95b86c808f331fb
1051979 F20101108_AAADMP yang_z_Page_008.jp2
fbdb102324d96484c4a037ab36eac85d
0d828e9ba2f2aed594a074845ca62543cb43ff29
F20101108_AAADNE yang_z_Page_108.jp2
f098fa86d48221722a372ed22ec4774d
fc55aa559e7ae793c4cf251c3ea12f301eda5c4c
88586 F20101108_AAADMQ yang_z_Page_058.jpg
9735757743ef4472f1ef8f74032133a0
3eb0b7772a2ac8ceeb3411229b7078ada54215d4
2019 F20101108_AAADNF yang_z_Page_072.txt
eb45a0b56d06b31ba6c0d336bf5cf0cb
47f3e3e5979cf3f76b306c550ad42e4eb920a03f
28151 F20101108_AAADMR yang_z_Page_074.QC.jpg
b6e4701e5ea3e66123a6e5d69abde5a5
223c51fa0f9d81ec152cf8ad44ce106178b535fc
2121 F20101108_AAADNG yang_z_Page_071.txt
6df2c1b47d0536242bb6a3aa6d6ace57
e586eda89211c681f062c468947def34c46ed00e
65268 F20101108_AAADMS yang_z_Page_062.jpg
d7b565b4533080e3341b9c16a1d00e3a
890cd46e9702bcbe70ff4e376d9b55eabef695d7
49307 F20101108_AAADNH yang_z_Page_080.pro
df7e867cff7cf749f8fdd60a0902c161
72f55c34c8957c12977695095827e0e110c88c74
74475 F20101108_AAADMT yang_z_Page_089.jpg
6b963a46dd630d746971917945e6bec6
b6697f7f063b7e8d4d8ffa360dcc8a8728bf91a5
27293 F20101108_AAADNI yang_z_Page_063.QC.jpg
29ac9f6c700989fb1cfe145e7cd6788a
5326f5b99f6ef43c5bb9d72efacb26fab01ad7c8
63999 F20101108_AAADMU yang_z_Page_116.pro
4ab90245b9a1ee739aa3b345d5105ef0
180f06b25598e2846728b5b499799f5a2c4fff7c
14773 F20101108_AAADNJ yang_z_Page_062.pro
2d6f469216232957811d516edf1adc30
48c14c767e8bf41f32298a7a0f90fad7e9900c4d
26825 F20101108_AAADNK yang_z_Page_076.QC.jpg
5c3c99a9483cb16d63564a1e2c2f76fb
c2b1aaa453b12bdecbafa1765a0edfb961f559f6
6053 F20101108_AAADMV yang_z_Page_104thm.jpg
a7369a557d5a25ed9328bb1014f8238c
71784c7e9c12e1e6875e983f8699d21f009df5ce
7301 F20101108_AAADNL yang_z_Page_093thm.jpg
04e42ffc74f1ec04b1c22e34915d57d0
efdda0116e1d5f13c40472225f16eb5d9c8fd9c3
1572 F20101108_AAADMW yang_z_Page_054.txt
842195aa6bb590f681d17914a510549f
49592ee0b3dd0b072ebbe6ecce6253ea19e49c75
21531 F20101108_AAADOA yang_z_Page_007.jpg
6f45c6d5442ffa2053f2f5c27d058655
7f61435251c4543c6923da7de2ef42296bdb5f1c
42700 F20101108_AAADNM yang_z_Page_077.pro
ef8d947b67a00463f37a0331ded34934
e2fb5f1a46fb0831e60cbf8c5b1af7fa764c4738
83049 F20101108_AAADMX yang_z_Page_059.jpg
73e6c7eccdf087c9927278397d87c254
78cafa592030c201107762027c69af445c5d9690
79594 F20101108_AAADOB yang_z_Page_008.jpg
db8139fca1a2ce310233e87d525162fe
62c5f17585e31cb28716e21bdef097f15a901935
1035 F20101108_AAADNN yang_z_Page_067.txt
12171e102e253e6459b308892cfdc726
719f041d9ea6245bb94c8a0d6d76be0e61e793b8
7409 F20101108_AAADMY yang_z_Page_019thm.jpg
2191689c28137a2b7f8c0a87d5170c41
a6eba068dc67b655a9275d2c232c8705640e3bcb
73764 F20101108_AAADOC yang_z_Page_009.jpg
9418ac83f239007c5839ba0f5d517561
55b3b479b6d1f9809f03823f5fca1e610924550c
48272 F20101108_AAADNO yang_z_Page_022.pro
bc3d8d372d82be0adc7faadc96e190b6
012eba989302b9f22c3dbcb999596680ebada2cd
19541 F20101108_AAADMZ yang_z_Page_054.QC.jpg
d2b7dda05a44bbe6ab9988e6b2a279cf
5bf95efcbc8898f074516f6382b2a7e489b708ee
55016 F20101108_AAADNP yang_z_Page_092.pro
e8fd400cbddaeed0c45ca97ebca5f45e
5c7b2509467bd724e43f99b1a2c43ec458f8f05c
74890 F20101108_AAADOD yang_z_Page_010.jpg
21a5e87d7b514ba48c2889e90023001d
26739bb0956a018aac659e472ac3b6a91ccb996f
836335 F20101108_AAADNQ yang_z_Page_051.jp2
215f3727345f9ae00cd45738e5522c61
26390766ac7a7c0b1b06250c0aa4d8b0d42b8767
78902 F20101108_AAADOE yang_z_Page_012.jpg
ec8167e7fde412204c39931069fc8946
9ee040f28789ee06fa3683c7e480bd8c6f281966
5850 F20101108_AAADNR yang_z_Page_027thm.jpg
8b3e86da855def13233335081c0fa912
98e3f700f26c364ecc6832e0c245249358074a6d
83850 F20101108_AAADOF yang_z_Page_013.jpg
de64fb509602a6bd95e71288397621a3
03ecd5baf70ada1db4f1d9c2258097df96975e98
26659 F20101108_AAADNS yang_z_Page_082.QC.jpg
7405f1c4b89a1cd9d1dd353367f036dd
a98b3ee5031dbee4564ec05a659b57cce193e944
82978 F20101108_AAADOG yang_z_Page_016.jpg
efc2aa457e8edec90144bb00da700cd4
4fc01b8535497f3df37c095b01eabab3fa5646d7
134612 F20101108_AAADNT UFE0021625_00001.mets FULL
d79f54bc0cc31ef3cf5abb18fca09943
0395e1901bfbf9ae193e1ff92c72c2e339805556
90348 F20101108_AAADOH yang_z_Page_017.jpg
64fc6bae0d9bd7393ef990f7612926af
29fe44797480daa4e23539bb3085afe050c44e52
87520 F20101108_AAADOI yang_z_Page_018.jpg
ab3631f347b6a2e7b23a6a6a3edb42cd
0d56fb374077e7930be9a9c05b4aee35d0c7a755
85664 F20101108_AAADOJ yang_z_Page_019.jpg
90db63e9b356610912dd86c6c6885be1
7ee229bc65a88e3878b23291a434d52b73e7bba4
81179 F20101108_AAADOK yang_z_Page_022.jpg
fdf00a899a2d3b8362fead89a059e2fb
78402d6a1e11d0c93973c5533f443b2280967654
9833 F20101108_AAADNW yang_z_Page_002.jpg
6bbcb341bd13fec08c651fac1dd667e3
693533cc7e3788d61deff611ddbdd2eb4c604a27
24627 F20101108_AAADOL yang_z_Page_023.jpg
a784dbd28945bcdb6e7d10714ffa0371
7a8216a4f671d4409c41a143f09fa9b7332e3f6a
9530 F20101108_AAADNX yang_z_Page_003.jpg
b945e1152f93588d5bcd2deca966c3b5
abc978cdbda4eeb7ffc265a5c3ef4abd48e7780b
69524 F20101108_AAADPA yang_z_Page_039.jpg
b1c824526ca460ac2a38c3510846d25c
bd73019506b62d5d1dcfca6d1c10113e71f0ad46
82984 F20101108_AAADOM yang_z_Page_024.jpg
c31a093821249419a3953d0df539db45
23b3e647a5801552f645cf1c214ab7b320387806
11323 F20101108_AAADNY yang_z_Page_004.jpg
c32c2bbc00d7191405946c05289fa7cc
ffec7170d8cde086905abae954ea9a777a3467cd
60362 F20101108_AAADPB yang_z_Page_040.jpg
ac9ed8c14ef791a72fa339acd4f3bb35
75bcf6bbc169a53eed19d297028a1873f464a290
88732 F20101108_AAADON yang_z_Page_025.jpg
24ea812a002aed19ff0a29f7941a77a8
32e4d94555d4738e4f94e0e613a4f2a0fb095392
89777 F20101108_AAADNZ yang_z_Page_006.jpg
058df5666111ea502c29dbb7573e6959
208b3873e4865900e3b35950cf78ac195ec0d976
82422 F20101108_AAADPC yang_z_Page_041.jpg
7fea332c19420e508bb5f482c507ca82
746acbfd579e2c1eefd708ec87ac687b132491c9
69230 F20101108_AAADOO yang_z_Page_027.jpg
105442c9c12dc8e1f903a01694e3a899
dffb0db65a41ea18a1ba4c62b6e72aad658e372c
86957 F20101108_AAADPD yang_z_Page_044.jpg
c3d43c097b80c816bf763c5d08cd1983
4b873f95352f9b7ccfed5afe183e4570d4714c67
55777 F20101108_AAADOP yang_z_Page_028.jpg
1d5ae19c57f942c87f0b958a9ae1271e
638d39355361c5824066cb4eb1f73c96bcef44c7
88372 F20101108_AAADPE yang_z_Page_045.jpg
cf14e18d83bee4f37573aaae3ef13859
85e7e9bbc2ea8c72a343fe76308091f090f3734a
60249 F20101108_AAADOQ yang_z_Page_029.jpg
4a9aba97b9b9b391f55f38695741fdce
bd34029ac334fcb129ee2991f6ec4941f68ef893
19606 F20101108_AAADPF yang_z_Page_046.jpg
c1faddbf5a5518636dfd05f956560ecb
7a94401caec072bf61345f7d6ad8710a86ac7ae2
73638 F20101108_AAADOR yang_z_Page_030.jpg
6abb0d4b107255a4343f07aed6f30993
6dd7ba71cbf2facb77887ec875450b560b449819
87252 F20101108_AAADPG yang_z_Page_047.jpg
8b9873ed588661f2d52128619e6e01ea
f3539306329168968bcae636c9ba7e51a370a8c3
85002 F20101108_AAADOS yang_z_Page_031.jpg
68f101b6d17fca9f6e8560ba85c28aa5
94ad1ded72f4e1cd9eeb7f7d3beac28ef5530ed1
72820 F20101108_AAADPH yang_z_Page_048.jpg
c8f6b1b5d292835d17b9678f17edeffc
b3dab9dc05a290f686955f6903127b6f62b8db76
64114 F20101108_AAADOT yang_z_Page_032.jpg
05016dfbf0935a4e17d472e3fd48bca8
e77bf1dbe80e91ef0412fcc6f2b4cca39af936bb
88446 F20101108_AAADPI yang_z_Page_049.jpg
e1bebc456ebd5175de4fbaf89357f44d
6f89c55070799a9513cd60829c5ce3442ae5603c
82779 F20101108_AAADOU yang_z_Page_033.jpg
0d6fbf1f23617282f80fc2f4a1c02a22
f7c17ad9451db0ce660ea1fdf7c2d54af4b2bef9
79268 F20101108_AAADPJ yang_z_Page_050.jpg
7edfd412766990f104953c68810f67d9
5861b22c70f932a8a1aad4e5b799473913968180
83498 F20101108_AAADOV yang_z_Page_034.jpg
5975405b9e50f3533461739106068f3c
bad38b0e81876bd8abd1d71f425c131e065ebe39
63932 F20101108_AAADPK yang_z_Page_051.jpg
1f4d61d54600bd4f80754644bbfdb3e6
d1bf7fb848401358a834ddff2580a42a289fd1d4
81406 F20101108_AAADOW yang_z_Page_035.jpg
67c8a7de081d05a825f68ad293c6c2f0
d4c77849463489ab855b007874180c0b1b739186
66382 F20101108_AAADPL yang_z_Page_052.jpg
de8865fee1d117caac7a6f5738ea9160
b0f558ced028809b58337e9af8c4e007d0ca9222
56791 F20101108_AAADQA yang_z_Page_073.jpg
febe6d7411d50c44f8337070856fb1cc
5e00b7aec382ee5e4a5bcb3d03a787f8b625b0bb
69858 F20101108_AAADPM yang_z_Page_053.jpg
5b0ed9b43e5bf3e0c30f499c7f144c8a
1696af59985eb5c3c6b36ff708770430ed7136ef
73575 F20101108_AAADOX yang_z_Page_036.jpg
c999c34894355d59f5a4c45de782bfa3
b4e70e16197df0e640c0f58da1e3df152833e89d
89956 F20101108_AAADQB yang_z_Page_074.jpg
0c2d9c5c189e53ec6b6db709ce669128
9f8c0e0a1147a7f65c8d0d5853d0a3fd6acc60e7
86871 F20101108_AAADPN yang_z_Page_055.jpg
aa943066e62b95cfffe44e2bffcb5904
0e312e284d3d8082aa613e4cca845695abe2c4a2
60871 F20101108_AAADOY yang_z_Page_037.jpg
910ae2b575789b8e215f620d56fba0a5
83e8c53aabf33aed55011e241ff2ae5ccfbaf7ac
67904 F20101108_AAADQC yang_z_Page_075.jpg
7a6e1b8cdf50fc6e80271e68ef8cac3a
51f1b0770aaa28ea48f41a56a0254359882152bc
89133 F20101108_AAADPO yang_z_Page_056.jpg
d0e1c19d9fb8e8efdda1e67d3ad42f18
7e96b564a86b6b07395c7412fcdff886c02f0dbf
84223 F20101108_AAADOZ yang_z_Page_038.jpg
e1612ff7bf93e670871e8b1e50fa2b5e
f14548c604bc90ee27f12ef40140ebbb89201227
74032 F20101108_AAADQD yang_z_Page_077.jpg
7fb7fe19cbe81049adde68a15d3c76fc
2698adbd1f75ad6a50d5e51cf1a376390e297584
68060 F20101108_AAADPP yang_z_Page_057.jpg
696d4702e580daf4942c728dc480c5e4
bca72aaf716c3ff50192783389e77888ecea3642
67525 F20101108_AAADQE yang_z_Page_079.jpg
54f162809df665a962dc410c25c1d61f
4479b8c6d74cbcf7ed189e86712b46291563c5fa
80115 F20101108_AAADPQ yang_z_Page_060.jpg
19282fe067903f68fe306a21a76b640a
d1c5561d1b3b4bfa72023a53b73bd40fa49df476
81649 F20101108_AAADQF yang_z_Page_080.jpg
05fb7dd2bce03c8deb2eceee188be9ea
722264d4e7bb8cb83ac5670c29eb5a5a18923756
47920 F20101108_AAADPR yang_z_Page_061.jpg
f3dc5004044081b0d8ef7c946afc624d
d50e5d21a609b7043b8769c8b6eb63fb2ebac563
61447 F20101108_AAADQG yang_z_Page_081.jpg
5d1d6eb5d80923fcf1055f5e9ecbd4fd
a3c2e60e8c076f768b48d55c6b20e34b12b7b8e3
84536 F20101108_AAADPS yang_z_Page_063.jpg
1e26021f751d2694ee78c27a8432df9a
e79079930be0e3c9029a420722de65dd2838634c
81741 F20101108_AAADQH yang_z_Page_082.jpg
2c70a8e312ffccd90b1284bc35e2f2b0
1053ef9762e4b0ce0635df28476fa738c521f5ed
83217 F20101108_AAADPT yang_z_Page_064.jpg
4e85106f3a3f325ab9fb8605a12e13f0
966fabdb0231343b2b86dfdcfea40224d639808f
88631 F20101108_AAADQI yang_z_Page_084.jpg
d9e925abb43fb22d7b88c8e3cf6b73f3
8e5d86a86138f0cfef2a27a7b42ae3d96757cb97
84859 F20101108_AAADPU yang_z_Page_065.jpg
2bc1400e989cd0e28c399e0154307cb5
416a0e38e824c58cd7c775840093d2dc17943e1f
89532 F20101108_AAADQJ yang_z_Page_085.jpg
50faa65ebe00564ebe3655604d5a32c1
a20a3eab762066f3a6b74243ceaba5a0dfb76baf
79698 F20101108_AAADPV yang_z_Page_066.jpg
2844c745f04dfe4614fe7cc8c1140e46
c054d67e7f2862fea46fe7731b9b3fb68f778ae8
88773 F20101108_AAADQK yang_z_Page_086.jpg
3a43b6131a035b69707790e3d90d43bb
56e957b23084f10e41e83421f655a1f78b2434d0
67577 F20101108_AAADPW yang_z_Page_067.jpg
e67b840842c08b176b7fe94fb910dfcf
15ebd9b52bc1802f4ef06bb24ee3bf9bb0f5a159
71919 F20101108_AAADQL yang_z_Page_088.jpg
80f692404fc753236bfd4f9c4cfcb151
7d676599b88fb5ef2c1b4c5e6b328b4c0c5618c4
80239 F20101108_AAADPX yang_z_Page_069.jpg
120504319820a196f262a8128f720748
b4e12b242db0bbc40f69b106936428cb0642e00b
68765 F20101108_AAADQM yang_z_Page_090.jpg
fa09ad4e49e3dd1763f87474812ef024
4ea46714f39b73c75c6c46b62bc15ce3e3a03e01
89128 F20101108_AAADRA yang_z_Page_105.jpg
fc7e0e948a29b160e491f0cd7f54fed0
0c74957cdffe93ba526a5ee605e116d8684855e4
74372 F20101108_AAADQN yang_z_Page_091.jpg
e3266ebf8a1a35ca8f9c922d11b0bfa6
1c8ac899c3df360a654ff4234bf951b4d3cf8647
28625 F20101108_AAADPY yang_z_Page_070.jpg
7169d05dd6d415fe434b5cf3d6621a58
a2fd63419109a40e9ea17e825eec20edfe005d13
67746 F20101108_AAADRB yang_z_Page_107.jpg
2fc322a1fa474cb5509be679201f174b
94a3ff55f386bf013ff29ae177261d3de7c0080a
84956 F20101108_AAADQO yang_z_Page_092.jpg
49fe4c5649a84116db7cd97f4cc67f7e
029bfc4e952d6f9fd5d66835aa9dff491f7daaff
82702 F20101108_AAADPZ yang_z_Page_071.jpg
bf256e8934291d12dda81ac74d91139a
565e431852b45f35aec0891df487418b9b341ef2
76537 F20101108_AAADRC yang_z_Page_108.jpg
feba3aae4ada168809c74a9f9edd68f4
862710b2d94d582749e2c80c1d33ffc43c9d27e5
86132 F20101108_AAADQP yang_z_Page_093.jpg
1c987a507e619c767c9fb5101680c63f
0f63f411a5cc1941e81c57ce3740e00fe0a2cfe9
43558 F20101108_AAADRD yang_z_Page_109.jpg
6ee258f445079b6b14881485df94e33f
8aa83151fd8fc82d962686ba23d33e94dfe1e288
92485 F20101108_AAADQQ yang_z_Page_094.jpg
21d9a9a9912db768a2638f9f016d664e
55a77b56daf6cb5564e0b8c59f1de578526fc61e
105102 F20101108_AAADRE yang_z_Page_110.jpg
7ec3457ac6a16b3975f20a35fa740560
f42d1ffd244a82916e5d06cb88bd01d52e01eed7
74521 F20101108_AAADQR yang_z_Page_095.jpg
2929a1e6b017166970b9ed0b6e8bec08
0a8913a5b5196b4fec0e0525c51e54d39e026d6f
96516 F20101108_AAADRF yang_z_Page_112.jpg
23eb58571304b3038ae7ba03f8723f23
5c000088470f025bfb00b49e1ddcd67bef35e8a1
88577 F20101108_AAADQS yang_z_Page_096.jpg
cc9a651b7aa18f95a34e4f72c4f0b2ab
bd2e4f4a38450aac5d7d2a8d86886c5598ba335a
108632 F20101108_AAADRG yang_z_Page_113.jpg
909723d1aae1d809e06f6e75db1cd096
985ac9eff1e2932ab8421315bc674563eb906ebd
82255 F20101108_AAADQT yang_z_Page_097.jpg
091d78855efc0f491022cd5413610968
298d802a575f0cb7618ce406e8d143c89741052e
111751 F20101108_AAADRH yang_z_Page_114.jpg
ff95f9a20d59b4b1b4698c63254efa35
eccf9e06a3688bd96d245b1699688c8b74e8273e
89631 F20101108_AAADQU yang_z_Page_098.jpg
0f60a09d5f6369cefba00a4b2f0c9436
b72a66fb42dba6edffa9603b167883d40c030de9
108636 F20101108_AAADRI yang_z_Page_115.jpg
92feb5b291311a8f1720afb4f39e478d
af794000645f7cdc33c6e7ccf55981ff85e321db
88305 F20101108_AAADQV yang_z_Page_099.jpg
1c45e6e7df7952a523fa6b93195668e0
112802519d07e3717dfecdf1ec48313848fa783c
95575 F20101108_AAADRJ yang_z_Page_116.jpg
7440881430a81a43f6d1603d8278887f
7e74acdc58cfab3b6f9c27f8d9e8c2544098489c
90048 F20101108_AAADQW yang_z_Page_101.jpg
52162074edfb170888b54b93016c4e02
1b2d3469bfc7fadc096583ed74bdb5c471112c9c
19503 F20101108_AAADRK yang_z_Page_117.jpg
09ed1d7b6f34d8c836a91ce356083e7b
c9792ddcabfcdd6f35dd5fecd1a7d40e5011b005
66916 F20101108_AAADQX yang_z_Page_102.jpg
716ec1ea2b4418f5422a9322c1a42b4a
0cceda8fc27354ae41e75fa407cad7e501fc27b0
266302 F20101108_AAADRL yang_z_Page_001.jp2
1abbffec69d73b5770c472bd128c565f
da624205e0e94de105944b13dacd675e3bec85a9
64009 F20101108_AAADQY yang_z_Page_103.jpg
d2ef524ad7655cfd4b9438379e1a054f
2fdb586e616c97a968733abf5e1b75a314eee77b
1051966 F20101108_AAADSA yang_z_Page_017.jp2
2a5352bcc4363ded5ce18f205c02b7bc
6d5e2f8189cfb1ae5aa5a0b552c97f00314cf348
22125 F20101108_AAADRM yang_z_Page_002.jp2
b4416ed700b6def80f66cb797dc983db
e583620cd8851d069df45d2ebc9946f840a63f24
1051962 F20101108_AAADSB yang_z_Page_018.jp2
3d7d6183294a29485982adcbaaee2ae6
fa0e3531bfca1e44a6c1f847dbaf54e483c4a200
17830 F20101108_AAADRN yang_z_Page_003.jp2
872c1f38bd86c720943d8c6d984f3b30
7a309b0a99e86c8682f4694b924433a965d9b6ae
74297 F20101108_AAADQZ yang_z_Page_104.jpg
d0686a0283c3ef94ef907444559aeb7e
122e8869281064f73d4d9bdb29450ce7610a4f2b
1051955 F20101108_AAADSC yang_z_Page_019.jp2
f8bd5517ab7beb34e5cd41a1b6bc8167
3fc08f4f4fccc4155dee3d6eb2c6d420b01ae749
51632 F20101108_AAADRO yang_z_Page_004.jp2
9a7b6fdb9fb8ccefc820629fafc0f1b6
75bb8216ecb589ba51f976757318006b9690c5fe
1051942 F20101108_AAADSD yang_z_Page_020.jp2
d43f00cf07e9bb9be6c393dcc013d2ea
909a9c6b0391904c629c966ee19788f404b8b61f
1051973 F20101108_AAADRP yang_z_Page_005.jp2
4f921aa0d7c8a1e6a349fe0d0cd518e5
92d4bc8f47280878315dd67f3ea2433a38330c92
1051945 F20101108_AAADSE yang_z_Page_021.jp2
a46fdd12a651012bf4b98e5b216616cc
a0efeea3923cb6e1cab743b5f45afa06426a4039
1051982 F20101108_AAADRQ yang_z_Page_006.jp2
3e6c6107e9580891413334eb41360a52
c527d9d408d3cb22b005e2b68f255a87933c9e4e
1051974 F20101108_AAADSF yang_z_Page_022.jp2
f13d2fcb0d2833397a63e552ac4dcd99
90012dd60a0a941df6b647e03c4717473eb4590c
388724 F20101108_AAADRR yang_z_Page_007.jp2
7979aef0721434769123681c3c18f596
c127b15de13cbf6c2d949f5cfd5072ad57287303
261318 F20101108_AAADSG yang_z_Page_023.jp2
a7b215410c121a15594c3652dc9b376e
ee81a2883f2be6985373ccf700f267081c0245d2
1051975 F20101108_AAADRS yang_z_Page_009.jp2
29300adb3ad3bcfc44d4baa635f4af8a
50cf074ad4597576fb9979fb136ab67593e0835d
1051954 F20101108_AAADSH yang_z_Page_024.jp2
091dc447628203e93e0d9112e5d557b6
70958ed0276ac31923d6fb8d1c0162e0491846dc
1016489 F20101108_AAADRT yang_z_Page_010.jp2
3597d47f4f90c2953c951ca1c367b7e2
9e12d1864c44fabc18468657c88f94dedc0157e6
F20101108_AAADSI yang_z_Page_025.jp2
ef8b9241bc70c3fb2a88e3886ac5662c
68e64ab3538b6d4de5471401271342eb21945e80
333541 F20101108_AAADRU yang_z_Page_011.jp2
c0bf021a879bce5d51b2a25fc3725ca7
593850d719245e60fdec053723f249ba58c89d77
F20101108_AAADSJ yang_z_Page_026.jp2
7648709372f2cb46d8843e7a47aea902
b1f5bd0af345a3d12896d5fd90ef22c392b7bd32
1051985 F20101108_AAADRV yang_z_Page_012.jp2
348522f6531c9449f43991994675b8d5
68a2e3923f6e653db65ef0bf5cef496e083ca428
921211 F20101108_AAADSK yang_z_Page_027.jp2
10b1afe188609ee63e016561637d9b1c
8dbb1652142671a5ba9585be9a8a05c340609a22
F20101108_AAADRW yang_z_Page_013.jp2
782db287a184964db60f55aa01b859d8
58bda3fdc3bc2136f9e989d192947eca9e14d0a2
727869 F20101108_AAADSL yang_z_Page_028.jp2
50a789101d149f72da547d3e10809a64
82f5cb27f427f77f30473eacec6bee57b7320639
1051961 F20101108_AAADRX yang_z_Page_014.jp2
1f4b02b251e83532e05ae0f573a541fb
d5cabff918d3a4d7e8daa262f7ce5beff8f00e36
88689 F20101108_AAADSM yang_z_Page_029.jp2
23bc8336c1da7417b7f69ad4a584ce32
ac938cbdb383e2e53a1fe8cfce7194188d112923
F20101108_AAADRY yang_z_Page_015.jp2
f436fcfc370cddc111d625c0772f092c
db6b67f63c5deec13556274ecfd5285628c3e0c2
1051939 F20101108_AAADTA yang_z_Page_044.jp2
9a4f77492c8d0debf1a3af7258673e2e
7247b5ca61d7d130af04d75ed8b625415f55fcbc
107560 F20101108_AAADSN yang_z_Page_030.jp2
7b79618d5fb5163036db8aee046f66be
a89b0c971dbb6749109d1906240b106ae265926e
1051967 F20101108_AAADRZ yang_z_Page_016.jp2
5c8e728d0ccd2c2145866e9a25071746
a8c9b05e3d980928494015dbb9d9dd914d74f7bd
1051986 F20101108_AAADTB yang_z_Page_045.jp2
0f2835ba6a7623338806375f983abf19
5955f42af095b64c7501e13e6ddbcd007dac313d
1051984 F20101108_AAADSO yang_z_Page_031.jp2
d58ec9a1206f93efd402dc671e6e10e3
429ed41f434f2b031aa4459d635b92d4bbff6a59
184880 F20101108_AAADTC yang_z_Page_046.jp2
76eff52449a82b8b26f084fba4ac7bf2
fadff338f2595a7a676de850f63d0b97a707333d
86977 F20101108_AAADSP yang_z_Page_032.jp2
4a8d2fb784097b4b2a0a32371e8e0f8b
fd453cad9ee02c9c3444faf70d77f826226b61b3
1051918 F20101108_AAADTD yang_z_Page_047.jp2
0a71fa49d7b4fdbc9c800930a65b2e7a
c35ba11b6da260cfc326a0a08a9a210ce62825ab
F20101108_AAADSQ yang_z_Page_033.jp2
02d4f600c71325e1bedb9e389e28fa6e
1300719d624f74b1b58cdf1474ab63ccd8779b4b
936425 F20101108_AAADTE yang_z_Page_048.jp2
2c72a210e2642ede6c05183237a0ad78
3ed94b89a331a97c623352c0f29c5ed26f9be565
1051946 F20101108_AAADSR yang_z_Page_034.jp2
20d3ee5b2f2100a28034077c8591f56c
cc70a2a918bcdc1e5da3a622796fba73824fddca
1051971 F20101108_AAADTF yang_z_Page_050.jp2
e26626364a63819ca476535849c750f6
49770724a10c2078169b08848b52d1516245754e
1051881 F20101108_AAADSS yang_z_Page_035.jp2
da066a3cd1a01db7032df6ffb47f9292
6ad9c254778f7d22319a655a2474f90f336af540
896797 F20101108_AAADTG yang_z_Page_052.jp2
55e36794552ff1b1730b89d062a395ed
9225a51d5af7e30b3f733ba2f065e2436ac075d9
916157 F20101108_AAADST yang_z_Page_037.jp2
06f4b08f545219b61862ae5faa2b44fe
4b2fae5619f9f6ef43a578ff37f6aa10adf73e6b
959518 F20101108_AAADTH yang_z_Page_053.jp2
3c4698bc85cf2ef1388060b4aaea3d4c
06ea189036eaa69a3cbe10c2f55370ab34da9c8e
1051970 F20101108_AAADSU yang_z_Page_038.jp2
5778868f7dddb81f11c2a6255dcffdf9
e18031de019a289cb8a02c19f2c1d1bd5acf22f3
919554 F20101108_AAADTI yang_z_Page_054.jp2
7698f3c072ec601a278d820dd0e87cd4
e7fba072418b349804895d218fbafa253e9e4d65
1002123 F20101108_AAADSV yang_z_Page_039.jp2
4615c5b2e025188178ca01b3e082f67b
9cabde384ad49d00cf6413a8a296a48f312631b8
F20101108_AAADTJ yang_z_Page_055.jp2
ea223379e54736f3a92c5a0349882212
14455f52b9acb6ae20d10009e51f53d8560b615c
745953 F20101108_AAADSW yang_z_Page_040.jp2
13d30976aa269f9614f0b94119c91cbf
077fd24bd13a451448d3122783e3ab7e3ba1ef38
1051959 F20101108_AAADTK yang_z_Page_056.jp2
891761e8e47687dcc7b5169f111692dd
d8227fb8f1359b20a61fb43773cfcec38022972b
1051965 F20101108_AAADSX yang_z_Page_041.jp2
b5ff03ab1b96ea4fb726f897aaa14698
693bf843d19a4acd5b8634c28e37828070da5a98
98829 F20101108_AAADTL yang_z_Page_057.jp2
5a7d9d2ce7423b226a64a6465bfb35ba
5a33a8f2f296eaeac54c7be599346c9562b8ccc3
978697 F20101108_AAADSY yang_z_Page_042.jp2
1d9f2469fbc3cec6aa9cb7355364ea71
aab29d598e0af0fbcce4df60d3ba919de0aab3ff
F20101108_AAADUA yang_z_Page_076.jp2
075b0e8ab6a9d61fe488988b388fa001
2e2ccb92bd1a84186984949d64cd2392449fe6e0
F20101108_AAADTM yang_z_Page_058.jp2
f78dc101edffe4476596a99106f2a94d
558c6cbc39094e107460f83ac464eb2d04a94110
793441 F20101108_AAADSZ yang_z_Page_043.jp2
4d883a4007be2c410f5b4665e0991d79
f175883db85711707a7054c56848d53c470ec3da
1045432 F20101108_AAADUB yang_z_Page_077.jp2
e93cdf303ced9a4336430bcf4886c1e3
0286ef3fdbc664e635162e9dd598db623953a1e0
F20101108_AAADTN yang_z_Page_060.jp2
352236c918109d033ba1e91f769a02f7
1727d60082976230f74c22fa32f3fb1d6584c59c
652605 F20101108_AAADUC yang_z_Page_078.jp2
5741678e427948806a2fdd7ee045dbab
5ff604127c7cba75717fcbf7a4e7e883dcd7d03b
631710 F20101108_AAADTO yang_z_Page_061.jp2
b898e29c05843921bfa885cb5c519dd9
2ecf14421f21978165598ede2122d2e4e8a0da20
989146 F20101108_AAADUD yang_z_Page_079.jp2
521e4acc05fda9721a0a67ecc1d1c64e
7bf032f2cf1367639e4d6fcf4f5c07d00971b04f
804029 F20101108_AAADTP yang_z_Page_062.jp2
b00b7286f88026d8e9707de7917980ff
01c86f1a59ac1240577e88aae6803c486c924411
1051931 F20101108_AAADUE yang_z_Page_080.jp2
a17d4296cb44d3d07be7086b9d74daae
cd7d1e2516f1c1b4e364df7079f62916b56d54e0
1051976 F20101108_AAADTQ yang_z_Page_063.jp2
b47fc8014c5cbf2e754e8e35a0168a26
847c2de919d027600c6a1ccadfa2cb334a968f30
860922 F20101108_AAADUF yang_z_Page_081.jp2
9545afcfd68819a7fc47349762aebef6
7467fa082e493756834191c141a0bf2c6232f3bc
1051930 F20101108_AAADTR yang_z_Page_064.jp2
823f43975625d21aaf0d50e3c9e3a30b
1278c8524aa48d93cb8e193856e9625541037f0b
54589 F20101108_AAAEAA yang_z_Page_031.pro
112b68d99c0da28875a9b7fa0afd3f2f
fb8404e08fefcfe77537011bd880e3149c7b1dcc
469457 F20101108_AAADUG yang_z_Page_083.jp2
1fe35fb5718168ac2384cba1fe94e73b
ffe89151c660601e9304f33c387be0ee3afe7df9
1051925 F20101108_AAADTS yang_z_Page_065.jp2
c15361e7a3b8ec6f9ec8150fa914bdb7
ab0c7f8dd9cc9a09a27309722dc8cfea4fa9ea34
31953 F20101108_AAAEAB yang_z_Page_032.pro
7213f2d0c5c0d2c6f5e4991cce04974b
179af9e864c17cb6a7e83e263cb15fa6a900fdde
F20101108_AAADUH yang_z_Page_084.jp2
579940be1a32b592e62d54b07a841481
e4766f3ff601845a50bc3aa139a3aea514cd4a2c
864843 F20101108_AAADTT yang_z_Page_067.jp2
7f40b388827a1626039dc3c384e512df
badf37aaadf8856708ec297c5e1e6e270559c5c5
52159 F20101108_AAAEAC yang_z_Page_033.pro
d757221abf83b1d477a35060c39ae514
f9d9cf924bdd9ecde37829fcc5f3b8bc699fe012
1051953 F20101108_AAADUI yang_z_Page_085.jp2
9f3495235bb022423fa90666b862c63b
649b62afd0a99c1bd7c4c2dfb68d9230ed0ff307
830548 F20101108_AAADTU yang_z_Page_068.jp2
4b52ef9fad9dc5536eecd687a9fbd4e5
af60459dce8ec8a9e936328f802d8007d49dfba8
53072 F20101108_AAAEAD yang_z_Page_034.pro
85612332df67f68957e0388ef7daae9e
9a598cd6303ca7cae905d0b0b15db55ca0ba6be5
1051972 F20101108_AAADUJ yang_z_Page_086.jp2
277c86dce2f584131477b007bcbd4c2e
de85ef22cc9c1e7d15ffe6d5f4cf685d71798dd5
F20101108_AAADTV yang_z_Page_069.jp2
b2a93ea3867c8ca0b30d727fd8893b78
c71e09056a8d30e05c290a57310bf4175ad96954
50877 F20101108_AAAEAE yang_z_Page_035.pro
730681d0b406d1060fb9aca3d728f559
33e380a1caabb9f5eb6a1e428a0a5d471803f427
F20101108_AAADUK yang_z_Page_087.jp2
4d2403e0cb508786b91b0bd0697eb895
488413ece8de340ffe9d75f329cfbb7fb076b7f9
322224 F20101108_AAADTW yang_z_Page_070.jp2
eaa3ecd440dfd75e8b4b03f706fa8018
4ad7b2c4ddd04f0e80e3558a6fa527bc5fafad19
45186 F20101108_AAAEAF yang_z_Page_036.pro
823c8a4c84324899808ee305dc247fca
ec4bb4c4df3ab105c7fe6a0c1593186b24b016cf
1051100 F20101108_AAADUL yang_z_Page_088.jp2
cf72af94cbef612d240bc8a05c198aee
128b52d7d82ab2469e365f2490c6f9dda50fba6e
1051932 F20101108_AAADTX yang_z_Page_071.jp2
e235b0ba543649a1ee7fcade248053fc
97d0df5baf241ffe1cebf231266257ebe8de1cc7
54847 F20101108_AAAEAG yang_z_Page_038.pro
dd7b1fa8015346e5adcbb319b2e9d99b
13c93dd28f5f4667729711e7473622256d23fcd4
1041129 F20101108_AAADVA yang_z_Page_104.jp2
5935ed03ca5a1b3df04ad362e9e531fd
3d36a089c26ef87c787b04b5b19a43445be209a3
1014307 F20101108_AAADUM yang_z_Page_089.jp2
e96452a4a683e1873f1c527171fd9221
f38c90930f92599792f3ecd456ac1e30fd440841
791855 F20101108_AAADTY yang_z_Page_073.jp2
9f23a618cf2aae2977f61c92be1e8edd
4099620d2a12c377e853d3a34bbf68eb9d65805f
30008 F20101108_AAAEAH yang_z_Page_039.pro
a210951702048963f3e164eb9b1d10a3
b549fa0740dba97fa483bf4d29c3c6c8655c0905
930685 F20101108_AAADVB yang_z_Page_107.jp2
51f3934fbd4dec546ce255ffb8d129a5
d14e13bde4442b4df66c214f5dc6e8b1e325022d
930087 F20101108_AAADUN yang_z_Page_090.jp2
893df5b191cee0ec2dfb4d61e986e70a
37e7a8f5b424d62ada3c45f9a47f4bace7c29adb
1051929 F20101108_AAADTZ yang_z_Page_074.jp2
551712a2352ef7a8e78525666a1aace5
bf2b758c3546141b9ac05860a687e268b78b4f99
558056 F20101108_AAADVC yang_z_Page_109.jp2
9de0d15b00870e869efba9856a233b68
712ae7a3f32d6b86d3cb9666cee16d55f13fd7f6
995927 F20101108_AAADUO yang_z_Page_091.jp2
89b7ffa8e4187866287931f552abaeef
d4848647e30e5d504bc456716597d5fc192a52bf
23979 F20101108_AAAEAI yang_z_Page_040.pro
433a937d606ba3bd063a14477900f27f
969cb04f6b994bfc826b9e7dd9b5baa066cb198d
F20101108_AAADVD yang_z_Page_110.jp2
dcee2861e2f277baea22b51a787ef717
50eb97db3f31b08c0e2a8c5dd8d4a253b03954af
1051957 F20101108_AAADUP yang_z_Page_092.jp2
44c35135448999921d8a8ce1247e4e74
97ec91feadda130c9618478e77380e3f2ab8a8ac
52221 F20101108_AAAEAJ yang_z_Page_041.pro
17320bebeaac33cddaf6656cb406c12e
6f2fbf1daac946ae02d06a71fd80880ebe9d9b3e
F20101108_AAADVE yang_z_Page_113.jp2
14702aa12c6102fd26032fd4a2f0829a
9f6e30ad49ebadf7b30deb1a6bdad21eab3d285b
F20101108_AAADUQ yang_z_Page_093.jp2
08028d795047a1d873103337d1668a06
2cb734d9c4371ccb46ccc4f16efcfba704ba4e2c
36798 F20101108_AAAEAK yang_z_Page_042.pro
e5c20722970f0385d48840299ca8ab47
fbe52184a03b66f6a93ed99a96b7af52295948aa
1051941 F20101108_AAADVF yang_z_Page_114.jp2
50168a1e34cf0eca6f164760a276ebff
65d41e8696c6ec462815e0815147f922c79369ff
F20101108_AAADUR yang_z_Page_094.jp2
73b6106e8b9a30463711409e1bc3262b
ead857715c12628ebc746d78f2cb84784ebcc47d
49148 F20101108_AAAEBA yang_z_Page_065.pro
aa8d52abd7f2ad738649164f0c47e248
8afb8f441bfb02adea8ab122163fdf42877294cf
56475 F20101108_AAAEAL yang_z_Page_045.pro
3ae51ff367cb9bd6daad31e534ad082f
e38c43bc7eaf50ee571774e1d6c42db3a8e12fe0
F20101108_AAADVG yang_z_Page_116.jp2
cf4edef148293191b716afb6cb13c6e8
bb43c326e11d5368807b9814d5dd80235e2bc124
F20101108_AAADUS yang_z_Page_096.jp2
cb4be71a8c82a37dfb8c53a252a9d431
5c5f0aa63978a5d2e63c282bc5af61aa88db3eed
30563 F20101108_AAAEBB yang_z_Page_068.pro
de218aefb384e9be403c2a2b00e7f5fa
cadc7ea5a91795ea40e07ab5f164ab1775e20be4
7736 F20101108_AAAEAM yang_z_Page_046.pro
5815578909f657347412926881a189fc
07d1423c62c0b1eb8f0cac5ef2d3287162ab0bff
175672 F20101108_AAADVH yang_z_Page_117.jp2
dec6a65e2eac2e3f8d0f7502ab41050c
746d91e2403e4120d04c54ba5d8eaeba3a708581
1051969 F20101108_AAADUT yang_z_Page_097.jp2
a916c2808c5d96c1b517d46aecc52a50
0b55d8ecdafa77dbefb2c2afb7d67af1698be67d
50559 F20101108_AAAEBC yang_z_Page_069.pro
302d34476e86c16870d5b158bc3e891f
7cdb6952422f9d5835f68bb1a50042c272c31d0d
55046 F20101108_AAAEAN yang_z_Page_047.pro
77baafb1cb58d7a1d6ac2879f09f9d8b
7a8e96d8e28bffa1f6171b607d9781548137ac2a
F20101108_AAADVI yang_z_Page_001.tif
a6e94aed11ca91aeb4ddabd3c6d67245
3320d8ada5918871b42c5c3826324aea1eeb809a
F20101108_AAADUU yang_z_Page_098.jp2
aed68d5088d77e9896e3fa485b6aea18
6489145811cf6897b56682f13eac133be2c2a4ba
13786 F20101108_AAAEBD yang_z_Page_070.pro
050baf169f0a59317d49771b65f7a897
8535fa14efa0529a4e52afc730389c0797111fca
38600 F20101108_AAAEAO yang_z_Page_048.pro
82ab7d74b9262387d4c1a448aca9728a
1752730ee4db6e30b683ecad98d5bc9b3077a7d0
F20101108_AAADVJ yang_z_Page_004.tif
d2ab81a5776b5827960fcc9f2a552712
96d9dda3f71759afb7bc88c889f5425b88f1eecd
F20101108_AAADUV yang_z_Page_099.jp2
3048d056632b1e48ccc1abdb18955907
0d9a078d4621a324c7d5dc3eb3c0fef9e7b86ada
51109 F20101108_AAAEBE yang_z_Page_071.pro
0ebbdc2513c3610ae4de10c445938dd2
8eab5f149638e98aed6fc61aaf2c2857b99710f2
57423 F20101108_AAAEAP yang_z_Page_049.pro
1b027e6d18d20fd3b8231601a9d05c70
3c8b8ef5031e4ed4cca24656eb75a6bb1b0278f9
F20101108_AAADVK yang_z_Page_005.tif
ce31a610bc8aa1ae52738bc7968db482
86d36b02598fa5c1ab10aa556eed23c505ff6ac8
F20101108_AAADUW yang_z_Page_100.jp2
d3926309b0a44416d94b4a5261307f5f
7c0ddbf599cfb65ccb2a22415a992f280f06e6b1
47390 F20101108_AAAEBF yang_z_Page_072.pro
c244b386d96262489f66096e21aff520
dff3d9a08c0b1998541294039f6e894f068beb66
50337 F20101108_AAAEAQ yang_z_Page_050.pro
d3ea8940d96229f9426469882f063823
b8bca975e33136bc264f8089d2f7b34c282b4111
F20101108_AAADVL yang_z_Page_006.tif
5f1cfcee348187f077777fc7489b0a32
d2de6619cf99336004d383bd3a59a20403cb41e9
F20101108_AAADUX yang_z_Page_101.jp2
b1ebc5a0c98a31dde34c5aec5fe8821c
2f573ad287342795fec18d876362a0d021de64af
29132 F20101108_AAAEBG yang_z_Page_073.pro
11c230f597d685ee7cd0925ea29987a3
6eee10871cdb31914d716384e9cd846cee2fa368
36110 F20101108_AAAEAR yang_z_Page_051.pro
c3100d64d4d29eb3035478027ba775bb
e2326d22ae47ff879aab204c93db630072ba3b5e
F20101108_AAADVM yang_z_Page_007.tif
21d51e879c8dfb80dbbc8a851e9f49e5
001b9b526f818deed0b887d73597561fc86356e1
890205 F20101108_AAADUY yang_z_Page_102.jp2
eae9c85fea49ce90cbf94819ab10a372
b6a9751dda85e6fe2f2c55b8301bfc774281076c
56436 F20101108_AAAEBH yang_z_Page_074.pro
5d89a8657d80942cb8550b40b803f160
dca326a297974ccbbe69de3bfe69bab5eefb09a4
F20101108_AAADWA yang_z_Page_023.tif
22b09dce6f1fc9b5facdfed885f1b15f
1272bce903b7c05e2ed734f1ce4b9b1af0001f93
37231 F20101108_AAAEAS yang_z_Page_054.pro
b9d218942c2ab7e031adde1b5a0ee9a9
93bdad0d06e6aae27cf87a66ec9b0581e7696afb
F20101108_AAADVN yang_z_Page_008.tif
439b9e6b3145f4267da71109e0008f67
80e4d68142d06f15ebbe6e9d99fca40cf3b064cc
891934 F20101108_AAADUZ yang_z_Page_103.jp2
29f2ecd57b0779a0bc57218abc94a330
50926fd70b10c83e723720ba38eb24e66a30e1fb
45033 F20101108_AAAEBI yang_z_Page_075.pro
14d6b8dc81b69dbd19f42881ec4b527e
36c82c19996747935f2c291d9f6b10119c0d3305
F20101108_AAADWB yang_z_Page_025.tif
729971784dd12627d37d454e187d62d0
6ad73ea992f8d9bb82afd2f554e8d5b6ca68b137
54888 F20101108_AAAEAT yang_z_Page_055.pro
5ed32341b115fd83ab86dffe39a5f411
fd6b11177e816108b42d9700cf697eeb1b1216ec
F20101108_AAADVO yang_z_Page_009.tif
a9cf2fb3fcc5b453d87a5b98c7360158
56680f1d814f00eeb3fe21f277f2b3c4bfbfe82a
F20101108_AAADWC yang_z_Page_026.tif
2a7d5f6527a92221531be749ea60ce4b
d02ef0ef9f69440e1d7554e53bca602c964329b6
47437 F20101108_AAAEAU yang_z_Page_057.pro
cf998db12f754358e760825840c3b396
8bcee39e3cf11735a97307b4fdbcaae91da65175
F20101108_AAADVP yang_z_Page_010.tif
9a7c7169237cb968629652a60b366a39
f6a4e54d12c5c00d52acd5448ee3bec440667c95
52077 F20101108_AAAEBJ yang_z_Page_076.pro
4faea7b1664c8e88b533b90de30ec2d2
b034981f04e05d3a6616908395fd0ee6a5963c0d
F20101108_AAADWD yang_z_Page_027.tif
176b97acde4dfc1210e1dd9618cbd216
790ddcb8885cedf009685d3223acf82220a67ff9
56887 F20101108_AAAEAV yang_z_Page_058.pro
7eb185eb23eee555b540e3c8922f44cf
4f0466571bd5d7c7f649d2c9919b86745aee57b7
F20101108_AAADVQ yang_z_Page_011.tif
3cd75bc1104987b2d620b66229343e3b
959137114e1ed057b7487668035a2b969904f746
21690 F20101108_AAAEBK yang_z_Page_078.pro
973b3d1d22960efd35bf0d574ad4e9f7
e88738d08d4d5cbcfa024a3a65299eb1ea54d3bb
F20101108_AAADWE yang_z_Page_028.tif
938ef7ddecc1bcb1196d44e9e4c761d0
97f372d56ca0e6c9524dbd0200422a41722eff94
53277 F20101108_AAAEAW yang_z_Page_059.pro
d1cc5718699f765a847397d736e915e2
7bd43175d421cb6347821772f18cb762a264a100
F20101108_AAADVR yang_z_Page_013.tif
cb578e312daa1eebc61a3b36d5f23502
63f47bea3b7542ff93decb1eb9cce1f1274abce2
38963 F20101108_AAAECA yang_z_Page_100.pro
e5509a10eff461330dca9d41da2c13e7
edfa5fd69524ced88223fc5ad2c872c613448abf
35042 F20101108_AAAEBL yang_z_Page_081.pro
1b363afa0891d93708612eec1d4afb1a
2f5f186939718b68646517108d019fd858223012
1053954 F20101108_AAADWF yang_z_Page_030.tif
5844dd69490e4d07fa5609d6f01c56b6
74a5c93cb1a875dc1fa8f46c75adab22445c3734
14634 F20101108_AAAEAX yang_z_Page_061.pro
ff5f90eda0f044bd199fbc5f5a822df7
75abc223334cd7ac61625416d2b388aa7baa81ce
F20101108_AAADVS yang_z_Page_014.tif
146a72658c71e3cb74ec04dfcd93e03e
877e7839ee163733c4e2fb8e5d270346eec6055d
53594 F20101108_AAAECB yang_z_Page_101.pro
9bddf89b6224458f30cb4fa6ea94ff23
41f5366bc0863d8c50f45db19f3cff82fa7104f2
50372 F20101108_AAAEBM yang_z_Page_082.pro
b5a429b9fce9b7f473eec63d3d0845da
a9e1e530f0ccc3a802ba89fe764896751dc7c49e
F20101108_AAADWG yang_z_Page_031.tif
3a768a8704e1c3004ea6a40e8fe31b8a
156b520fafb62170a913428be9586902c153081e
53972 F20101108_AAAEAY yang_z_Page_063.pro
24c939cedda0cc7a6570d72f16a017f3
5fd811da7d8a16999a85acf6776f6307a787af43
F20101108_AAADVT yang_z_Page_015.tif
6870c44bc1c444e52298cacdbe8aed41
83106380454d657ca0b066ad3c5fd37622bdcbca
29275 F20101108_AAAECC yang_z_Page_102.pro
f40376d3fb9f22ea29e3d229b4d30927
d289c41dcd7f2a9331726d5ddabc71474acebcc6
20365 F20101108_AAAEBN yang_z_Page_083.pro
3f7e434ff777943956dec89971a975f6
eb584b38c9a0c93c9f6174c3d9bd8b195471540e
F20101108_AAADWH yang_z_Page_032.tif
642685f7b6ca68aebde9e822a65da5a7
c77ce8d207d700652311e3d7592d4924e2bb8184
51963 F20101108_AAAEAZ yang_z_Page_064.pro
e44031bd0de2c9223ddd0e2096fc437c
96f57d08d116d5e9ea372b19eb6f103ef3499237
F20101108_AAADVU yang_z_Page_016.tif
83df2f9b4e8aaee8458300beaf617390
776bc3fe071192206ec36d453496824a57c774a9
37712 F20101108_AAAECD yang_z_Page_103.pro
6c30a8d7aa01d08fb6173b370ece2f8d
a80610e9fc2606c6bb8b23df0ee3dcadf2c3eb85
54343 F20101108_AAAEBO yang_z_Page_084.pro
0738974a2026ac6d768f96d659b63232
c4105aa7e099b0186f9760d0e7817e20967ebbbe
F20101108_AAADWI yang_z_Page_033.tif
b55685912f47709b9e9aa49235ffb2b1
cb9b9ee87f9190c6f4e672e12200e584641fc4a5
F20101108_AAADVV yang_z_Page_017.tif
ef30f648bcc59cf3af4a766eb5882451
20af38b66655843604b32f101d2509c51dc44513
23833 F20101108_AAAECE yang_z_Page_104.pro
52fa98960a1df1c0fc390ce5a44dd678
2d26ecf0d0af209580b3109a1629a5f9a605d32e
57565 F20101108_AAAEBP yang_z_Page_085.pro
d82bc828240bd045784f81baf97dc128
1bded056637f96072b1d84e84272fdc534753189
F20101108_AAADWJ yang_z_Page_035.tif
a47112b6c46812c110fd03cdfb197246
3799bb3161a8dd8fbf316d27bda31e0c60d30ca1
F20101108_AAADVW yang_z_Page_018.tif
d80bd3a17c7bdc79c4a7ff231e382497
cafd86afc85b6e1c509979f61cc8074d764da552
57213 F20101108_AAAECF yang_z_Page_105.pro
600137d3d282c74031b2f959f39b7ffd
3e4efb14cf8b97c8b0347c05279bd43880b68bd5
56970 F20101108_AAAEBQ yang_z_Page_086.pro
9b0e03db9fde0c0fabd04ce9ca377253
b71cbdaa373e0d5eab2f447cbd917ac1efa3b904
F20101108_AAADWK yang_z_Page_036.tif
1a7c4fa7e6af5f3e75746f8bc1e0009e
41dc613d8164e4337d06f3e42ba89811eb9c4f39
F20101108_AAADVX yang_z_Page_019.tif
d8b032910bbd30d351155417da60aec4
e11a702f46946fb49f668948d3b8b0f97de23de3
53974 F20101108_AAAECG yang_z_Page_106.pro
39dbb979b06cac28e8429071fc483631
38189d84b42d92c90641e7364c11b8295f7c2fb2
52756 F20101108_AAAEBR yang_z_Page_087.pro
b0488571010c32e6612abe110ebb1c3c
ff095d17fe689ee4747cbd509809ca6efaf49f55
F20101108_AAADWL yang_z_Page_037.tif
e956fd47869bc9c8322f78e6c8e36514
1a32141577379d0d6a37bbaad8c8549ba0f67941
F20101108_AAADVY yang_z_Page_021.tif
94c6ab3971bfb6d85237ca03e2efe192
c7c1fa259434ddbc9b78ea9087881f9392c0bc77
40316 F20101108_AAAECH yang_z_Page_107.pro
d8e97aee959402269c514e9d42fcfe60
175e7da94e362cef53df1f473128fc9798f2b54f
F20101108_AAADXA yang_z_Page_055.tif
f28a1144ce16d3264e7219e36b56f5ed
78678f39655fa42ffd0bf72781799328e91a02d1
40574 F20101108_AAAEBS yang_z_Page_088.pro
3d69a1daf99ebddaba81ac20bf658715
41f911c6200904b53797f77bd2fedc46fe798e44
F20101108_AAADWM yang_z_Page_039.tif
1e86c84cc462a77de74a26d466c2c7ea
c0f7963498b569c6467049119cfc8f6db3c256ea
47291 F20101108_AAAECI yang_z_Page_108.pro
e236000e53764cf759b47942a16a3195
c21ae9a1cf9356766cca5ba33b7875883b76f5c1
F20101108_AAADXB yang_z_Page_056.tif
3d1648ea502f3f18bf63d5f8fe61c176
06c419bf5335dbee5031e2eac797d1534ab68f52
38530 F20101108_AAAEBT yang_z_Page_089.pro
41f3af2ddf7d9c0ea384393ffec49556
0b64b48522c6368eb28a86e3bbc5f50ce0f8789e
F20101108_AAADWN yang_z_Page_040.tif
7e9e67d9d0404017b0f576bdf11cc20d
470ee2494e20aa98d40abf7748a024ad778182e0
F20101108_AAADVZ yang_z_Page_022.tif
89ce21a9224ebbf79d8b02b03d6e9876
0ffc82b00bdd207f013585456542dff8743bfdb0
23974 F20101108_AAAECJ yang_z_Page_109.pro
992f57b7211394ab380bc129065d1674
5c32073043c07553ed845c91c05ed1aa851c5ead
F20101108_AAADXC yang_z_Page_057.tif
d3dea6ee27966a173cfcec5e5aead1be
a893f103a4e1624cf4bff0737b3533315cd4b616
44145 F20101108_AAAEBU yang_z_Page_091.pro
79685c9e1223aa6da5f670f7c41d7708
ba02b947864f7d7a5b72f9d373d9c642797cc419
F20101108_AAADWO yang_z_Page_043.tif
bab338455f1e620412da126a90b5e0d2
56014654fa98d784ad20b364bf51e727ba0f0245
F20101108_AAADXD yang_z_Page_058.tif
00e7720ab73da1dcf43c8efdac6a5cd3
337492325409712131dfada8142d1d457bccfd37
54188 F20101108_AAAEBV yang_z_Page_093.pro
cb54f33a3909d4ea45ed6afdf07af1d9
bfb61cb71888c0526652fe2053a77702c7bc8559
F20101108_AAADWP yang_z_Page_044.tif
4cb399b88c282e2abcac44bd19f1f126
110b3ef7ac2159536e1ff57c0845152a9f7c817c
66709 F20101108_AAAECK yang_z_Page_110.pro
7eb47ec07fba756aab50b1a4610fda08
14fc1e6262fc4ba462a044f62ae391405eab35c2
F20101108_AAADXE yang_z_Page_059.tif
000af8be24790b9792e541a51bcbe283
98d026b782a293c8e52c7a9f4c041725d27220e8
59312 F20101108_AAAEBW yang_z_Page_094.pro
9cece9e41da1ed6488582abe01735ada
f1d9be0fa38bb09ff113127b66d73179b703b221
F20101108_AAADWQ yang_z_Page_045.tif
197b3b7aff61a1964de5d7756f8f047c
3133e945ee91a57e431a89ee029828f6766519eb
70252 F20101108_AAAECL yang_z_Page_111.pro
485045cfb91b4a4606c5456c6e50eef7
226493ab98b407675c03251233c44047cc09dad0
F20101108_AAADXF yang_z_Page_060.tif
3bd8ade7b9d87959cc12abc3c6e787fb
1ab63a9f4c72b6c1f0bab0d83cf7f582a3a5a4ff
43026 F20101108_AAAEBX yang_z_Page_095.pro
0774246aed26c78b06b9c286278cc487
4f7ef54de7c16c8a354b277990c415e0345032d8
F20101108_AAADWR yang_z_Page_046.tif
57f3bd8f1ffd191f822aaf0b2e2c98cf
720183e16d503b125843a35782c7587acfc6a2bd
2135 F20101108_AAAEDA yang_z_Page_013.txt
55d70aa56b55338c5d8b8c06582fd804
11708b7020bb34df196370b4a91610df26148c55
64823 F20101108_AAAECM yang_z_Page_112.pro
61fc6a7a0b2cd5786473dc142b4590c6
fc9ebcf5b0bb4ec658d86bd9ec5eaf88b68857ac
F20101108_AAADXG yang_z_Page_061.tif
cd455c31b31871914ed945012ae6db83
1ee3a3e0453563e796a766ad3e2dae932ebe0cdb
57613 F20101108_AAAEBY yang_z_Page_098.pro
1e417337596805b4092b035b2f01a9f5
f717546dff9405e9e9e0e37ed54ff2a2e6e36de4
F20101108_AAADWS yang_z_Page_047.tif
d449851d58094bff8a9f57df7dcb5e26
0bbe6d7c413a4277fb9d7ddc837ad8219925bf42
2220 F20101108_AAAEDB yang_z_Page_014.txt
cab3e5133f04741fcddeba8727f09dd4
e90e9a548cafb83778caf687aa5ecf8db2b1d5d4
68251 F20101108_AAAECN yang_z_Page_113.pro
cdbacb03b3997ae3ce34d3280b017710
509e132cf6f969e4925bbde15de8302a165d2a0f
F20101108_AAADXH yang_z_Page_062.tif
ccca22bd0223866cb01fb157a4f6e3c3
28aeea2dd6e6308c6a18139ffe8a3b56b980c1ec
55770 F20101108_AAAEBZ yang_z_Page_099.pro
9953c641ee5f53099c63e804892fb79a
b1648e651491365a518693de3bf7e0a9b1d1aa97
F20101108_AAADWT yang_z_Page_048.tif
4eb646b85274ed5a7250a6c3d16321f6
f7fd4f99db809cba0641b6a54efcf0a83db79484
2212 F20101108_AAAEDC yang_z_Page_015.txt
47fa4d2e66b45f62a460c701035897fb
22f20a7675dc261e0ec295dad45fc8014204b47a
67973 F20101108_AAAECO yang_z_Page_114.pro
c6e68faf7713cc5a279d2fed69fa9efa
19c18a21f25d99a1169cac21d8c551402c311d87
F20101108_AAADXI yang_z_Page_063.tif
04c8ede35f18d943edf3a67785153272
52aea5e60f4b1456a7ac5cb7e5119ef133a56f0f
F20101108_AAADWU yang_z_Page_049.tif
5acdedc5b668bceeab2d38b70e98a2c0
e5081e4ed7492d555451209fd78225e2d13edf30
2297 F20101108_AAAEDD yang_z_Page_017.txt
3840eee3ded79a37009a01267157ff07
e61c7580687dbd78d40734fc90a1bc5a20f5321f
63906 F20101108_AAAECP yang_z_Page_115.pro
a29c9d4e7d2ac90a29a8b3ba5fb3ad99
43412f170e4bd61cf3c413320c9689ac43496736
F20101108_AAADXJ yang_z_Page_064.tif
fc85b4bc22012c166075f7481d15ae47
b9a543e0f67afc12d7cc29b0652e6c12a7bb49e2
F20101108_AAADWV yang_z_Page_050.tif
16e17cdc94f1ae5df13af114df7e12c8
48e95a1721a26c77187e324db04ac18b09793167
2153 F20101108_AAAEDE yang_z_Page_018.txt
e422951c8f43c990a37d1a4305c71dff
4eb2cea240cacd2cd932cf489486f8e0cbed2651
7165 F20101108_AAAECQ yang_z_Page_117.pro
ac283187c13211b5f0ed3d5eb8e195ac
0d955239066e941a6618fd2e2d71ec017afa20b6
F20101108_AAADXK yang_z_Page_065.tif
1a6c166b7645281c8e4ab379f323cfd9
ea61b5cd893e093da18ad55557d036d60b072b09
F20101108_AAADWW yang_z_Page_051.tif
f6fb66c33ff0d752a3e81d72bf78d0ae
ddfd37a83d3ecdc8de1ac8dd04ddc0fdb3f976c2
2170 F20101108_AAAEDF yang_z_Page_019.txt
31438713467fa92e3a80a3c49309f9fa
d864098a1bbca5d09567cab6d31e7bf35ba977cc
489 F20101108_AAAECR yang_z_Page_001.txt
6830ec872d0028afa61e2c110ef6ab73
8331082704014f0da3ac9c8fa920fb49d4c8a930
F20101108_AAADXL yang_z_Page_066.tif
9e4e3a06ac0403c22391c3dece20807b
31ef359f26118ed257699a86f16ba6a1b9a567f0
F20101108_AAADWX yang_z_Page_052.tif
465ce6e5dcf733ce730f213167c88053
aa6a2f09c5cb6597fc195d6b443a9e4875392356
2260 F20101108_AAAEDG yang_z_Page_020.txt
0ee8b5edda6ad4a416f86c59d71f173b
3c2da5343ac0c1e717ee3900cbce0658c60969ac
F20101108_AAADYA yang_z_Page_083.tif
e3bb99ba0dfd6cb8d3c36fd06d153626
39800c73ab71417bf70872bca6a18496ee44cfdf
81 F20101108_AAAECS yang_z_Page_002.txt
f7f4f7755109e714b26e089233131c81
3f41cb55941f85730ffd36a42963aaec6b622632
F20101108_AAADXM yang_z_Page_067.tif
4f8de8cc45bfa056647e239bbf057ffe
aad42d87d096dd8efa9552047038b5d98de5ccf2
F20101108_AAADWY yang_z_Page_053.tif
07479def3edd6974442d18fbbc199311
032dcc0454fd116846042e17e2a17532830757bc
2174 F20101108_AAAEDH yang_z_Page_021.txt
5c46eccbc342124cb237ec903733332f
c3afd4e872e043b87760f9ef87952207781fd887
77 F20101108_AAAECT yang_z_Page_003.txt
c0fa3df3a92f681eb090f61ce2d4b584
69f9b7770cbb58247e5aa83212e07ac0fc26d502
F20101108_AAADXN yang_z_Page_070.tif
5d0d11667d4b729c5f9da99c4a1c93d3
ae5ce1035de1011f26ac46a769f994906c3370a0
F20101108_AAADWZ yang_z_Page_054.tif
efc387f0baed4c2d56410128c1183626
88ddef2d0214e68e62bffda6a85b03ca0414955b
456 F20101108_AAAEDI yang_z_Page_023.txt
2497ef867c6606ead5657944ef3ac4c2
43817340d2d05be6601b015961bfcc9eb6d8fbea
F20101108_AAADYB yang_z_Page_084.tif
0d6b1495a2370fee65018d5d8c5493fc
185c36df161204e50c024e71d08b855f2b32807e
100 F20101108_AAAECU yang_z_Page_004.txt
80188988f95d27a500c5b78ad4232fce
a683b75ca46062fd77cf8e6ab1fd71667bc88e6c
F20101108_AAADXO yang_z_Page_071.tif
11274979a1c41d8241d13688fb81973e
bfed8c706de6657de300cbcb678ac505b5a2d86e
2084 F20101108_AAAEDJ yang_z_Page_024.txt
1510e7bf42f13613b067c2eabb8b583a
6e463a90f6c80928e3412acbf6a2ac76712719ee
F20101108_AAADYC yang_z_Page_085.tif
7b30ddc570816fda65d7a752430a95be
a0f7805af2c21faea1a278bec330acbad5dfdde6
3157 F20101108_AAAECV yang_z_Page_005.txt
952d8e58ce97d53084babf49a4c023c7
6d7f7ff7c33b86687352270affff8ce3651e1321
F20101108_AAADXP yang_z_Page_072.tif
8f26a459f863ff1b7609b8b886205630
4e0c3c2c2feb99081050a3ab112fb0d69ca582b8
2162 F20101108_AAAEDK yang_z_Page_025.txt
a9b68744c99ff4d468f427b99a2c4c51
0a1c3fb8012d752ccfe6851fa702442f8533aa1a
F20101108_AAADYD yang_z_Page_087.tif
a2445ec488bd77eed750953961304edd
b3ccfe87cbb2f0bf368f87e82f31fc353432e47c
F20101108_AAADXQ yang_z_Page_073.tif
8f3a609613e6c2cb8b79ad1c70fb3d41
8cfe3cdfc0e86a59a527e4acfe15b44c6df24b5b
F20101108_AAADYE yang_z_Page_088.tif
e9aacb7ed5eedd602a3666df00c458b0
03d40b0584f79be85f987142b4ed1ff4f07e8c46
3109 F20101108_AAAECW yang_z_Page_006.txt
a73a51f5286557b1602141656e988d8e
0127b4f87544ec8b457ffa2240777d542a1cbdb1
F20101108_AAADXR yang_z_Page_074.tif
bc39db854d0f1ac0690dcb9c51192f6b
69df06e73d93cf086f40de3459fdb85c44e34d92
1076 F20101108_AAAEEA yang_z_Page_043.txt
e31f5908b78f298474c21e6e699389be
da47a878611dbce61a767b6a9463534e6b8158f6
2142 F20101108_AAAEDL yang_z_Page_026.txt
d872c1212c1bd18ea5231c1b36408a1a
1245e5c54a5e4c68af7d33ba59dce088b0a921bc
F20101108_AAADYF yang_z_Page_089.tif
44e1200400a71efcceedb9f5573e40c2
eb1275f302898993bb3452f864efafca77ad1cd7
1929 F20101108_AAAECX yang_z_Page_010.txt
16d7bcb1d69e0682c0ece9b5bd3af124
fa08154c7c67fcf67d32048d40276b75eb35efb3
F20101108_AAADXS yang_z_Page_075.tif
73440dffdecd67ec0e33cef61e886a88
c3b5dd4dc6919ac90474ee7a5c35ad456f927c94
2150 F20101108_AAAEEB yang_z_Page_044.txt
d1a5055b6fc76dbda9e3fcdc9dcffa18
635125dd858647069310ea86e0015f5fe90d7bd3
1750 F20101108_AAAEDM yang_z_Page_027.txt
fb5fcc1173f71642eb3b55ad7171ed90
7be74d1d0434fa0539f345d0875c3f739b7bf200
F20101108_AAADYG yang_z_Page_090.tif
bb87236b0f3ca4b821c1f04fde66fbfe
79b8f522e27fa0dfcbdd150a714f458b170cbb4a
584 F20101108_AAAECY yang_z_Page_011.txt
9a3df22ac4edb0878ea38240c4ef800b
9c072540b68952b28c19372062c5675bf356a315
F20101108_AAADXT yang_z_Page_076.tif
6e4b34991a93192850e99d7c0684e7a5
c9bef915549ae63b01db417bd4da2dc7a6c252d6
313 F20101108_AAAEEC yang_z_Page_046.txt
4484a86ad916e91fc29c4e122f4db101
624a2b69a6682486f1b92d7a014e039c14240bbf
1588 F20101108_AAAEDN yang_z_Page_028.txt
81fa4da5e0d8f7cd3b4544392e827d0e
9b119e235d771b671f39d304d7a2a0bbfb38b39b
F20101108_AAADYH yang_z_Page_091.tif
a18ea0ee63fda3ea183f942b8c53137c
b776aed7196abd91ad5454125c78e551074af3dc
2009 F20101108_AAAECZ yang_z_Page_012.txt
d73a578fc5168107d80818e6d3ac382d
d538f88d7dae854835a895fad3736e7c9270334b
F20101108_AAADXU yang_z_Page_077.tif
34e48b9096c0b34d07ce684aef13646e
7ffd343ac2505789e565068ba66e7cf3248b8903
2241 F20101108_AAAEED yang_z_Page_047.txt
0b88193f25ee248d1e446764f9880551
5b3560a7d0e6d705e8230f298ce47e8dfcf305ba
2015 F20101108_AAAEDO yang_z_Page_029.txt
15e0ba49965edda6622b48e4c8e91cb0
4cbf705d1b7c39d19e9275f1f68c6494b398f8c3
F20101108_AAADYI yang_z_Page_092.tif
7a1ac83d5203e205eca11681fe219564
2d3ccf317b6233f3dcb3868a29cab2d294955227
F20101108_AAADXV yang_z_Page_078.tif
f2ca01e6c555f1c502708e0bc1eba495
55705797ff29377cfdc5ee7a8761b6c59a455d4d
1644 F20101108_AAAEEE yang_z_Page_048.txt
31aad97651af23aad55d3d10a244f65c
aec53950bf0040316e2086a0556a0cecf66ef055
2229 F20101108_AAAEDP yang_z_Page_030.txt
84242d4b65d3edfc323b14e59fc2fded
0c6d0340752a2f4d662533a3fb4b6fa58f58be78
F20101108_AAADYJ yang_z_Page_093.tif
4107a968ed8e03a93ceb22265187d017
fb15baa470ca3088e3aebea26a8a9485c723e160
F20101108_AAADXW yang_z_Page_079.tif
cbfd7b613c3bce305554caf8d92aa934
c1a87b347081e41fcbedc8f57b49a7880754439a
2032 F20101108_AAAEEF yang_z_Page_050.txt
e317ed0fc05e05803cd5640308c9737e
2a03fdf62d167fc955a0d7531a9b2462c212ed19
2146 F20101108_AAAEDQ yang_z_Page_031.txt
7989580232a871d6b3a0ab6afb927da1
d1162c2f1b6c75876c06322f7cdd33d3cdaf897d
F20101108_AAADYK yang_z_Page_094.tif
3a749e82d16a065e507b823a17755117
91f9c536703e6507d5714417e58d61b8028b701e
F20101108_AAADXX yang_z_Page_080.tif
9fff5a7fc880a5acaff0704aef52a00f
811fad4770bbe3b56b1c43f17c99695f133a2ec7
1602 F20101108_AAAEEG yang_z_Page_051.txt
729d452ba81220fe19f2aa1b6d8d01fc
2c953e37df84b99d01add1b83dff6b94e9c16d59
F20101108_AAADZA yang_z_Page_115.tif
b1e590d77165b018df2cac65436f84dd
6fa8080ec9597ab0e9a94301f2d37393412cc8c9
2069 F20101108_AAAEDR yang_z_Page_033.txt
864e761ddd345b1935c93f9447a6c93d
22275f5d58727f94059fbf140b28ce51a30e8d89
F20101108_AAADYL yang_z_Page_096.tif
852baeeef5af8e5c25f30c3a5a62936d
ed31ef11595a1de48df5ecdedcf1e5471ca512d4
F20101108_AAADXY yang_z_Page_081.tif
7ea86aabbf5f3fb88ae3d129308dca82
557b4cb276d717981454e318d3fbdf1a2372ea02
1710 F20101108_AAAEEH yang_z_Page_052.txt
35729793d6c63d759e3b1f8c225a6f81
29247383c1008e9e55568de010a8e28420da32d6
F20101108_AAADZB yang_z_Page_116.tif
2d5d54439bc02321cc6bdddd87da3735
204f93fca4f18791a0f1869ae1bf723c750b3613
2090 F20101108_AAAEDS yang_z_Page_034.txt
e5cc7193985e27e695b5e639d4e6f5e8
e144ddbfa88d797ba66863c82d6c8126484eec0a
F20101108_AAADYM yang_z_Page_097.tif
26c00c2f825389877e928467f808b73d
605f3f5d8e02d2567792749b3a04c63aa2b16653
F20101108_AAADXZ yang_z_Page_082.tif
9330db983dc910fa23f95d06e299a138
7579dce7bfb493828e32a3bab1475976ffa358fd
1346 F20101108_AAAEEI yang_z_Page_053.txt
5adf9c90b65b40a58e33bedcca1dc55b
cef659125b9972daba0c8d98aeb4206741f72d08
2039 F20101108_AAAEDT yang_z_Page_035.txt
518a56ca63bb09ae4f86dfc3d48fabf6
ffda98a55d7d17010a5a1bea65bc2e094de917fc
F20101108_AAADYN yang_z_Page_098.tif
431a6e7281d2673890d498458c678d70
ff91ea952fe19a468a074a329133adfc96a31e56
2166 F20101108_AAAEEJ yang_z_Page_055.txt
9ac9a8e59df5fe8bb44a7fcd4d6080dd
58e02e915c2b2f499e929dc52e0d79678e165b1c
F20101108_AAADZC yang_z_Page_117.tif
490fd8bfb475a8a904dc00a120a5f4a8
92e75a84c936c81cc76cfa949416515e9d99a87f
1803 F20101108_AAAEDU yang_z_Page_036.txt
c56ec4c1698d3dd3fe876d9260101192
98e1a3a754c23fcba19e2b13813e93d9f382711f
F20101108_AAADYO yang_z_Page_099.tif
d96f9d7368c3f1c1dd371b5cf5645c25
c6b161a0f7b002a2acd7b7305fe57fc5684bcfa2
2200 F20101108_AAAEEK yang_z_Page_056.txt
3dad4c313d329813632eef6caf4fa7b6
0b82f3bd13b4f6c2b5c188b3c4a6ecc6456a044b
698 F20101108_AAADZD yang_z_Page_002.pro
79f5ca831cf5d458d82c4084a9d2076a
3cd716f463e8d6028a8cd9ae02dc590643ae7a9b
1041 F20101108_AAAEDV yang_z_Page_037.txt
2c5387befb89d50edc704234f4dc9d30
2bdb26587eb1b442f592429b2512acc9477db976
F20101108_AAADYP yang_z_Page_100.tif
efefd763a97358219394cea1ada50606
6325987e7a6c3ff95957134b8000eaddd2139f9a
2244 F20101108_AAAEEL yang_z_Page_057.txt
2588525a6b8a8202925eb89e3fef9f2e
91b84780fec611861152df7c8525dc100882b958
594 F20101108_AAADZE yang_z_Page_003.pro
1a69c31e12dd05f6fa94261570d46e7a
f6927fb68f612b5caeef96c73e573ba37ca4f2f3
2158 F20101108_AAAEDW yang_z_Page_038.txt
f4cfc56f2ea77d5e5ed5beb364a4a334
75873cd7ddf5d57185054e24c0b28d50cede786b
F20101108_AAADYQ yang_z_Page_101.tif
692cd501c1ac88c7a984c6d66f299f68
0d79e50717cc91cfb6c718d82779d9227b0740cc
72531 F20101108_AAADZF yang_z_Page_005.pro
12ec3de2be16a313686c1c789a5b2e25
99bd28e5168c0c1629ebe6a17d0e0cfeeca26ba4
1273 F20101108_AAAEDX yang_z_Page_039.txt
fc2174d1857d37ff77bbe3f18ca8b956
397b7e2e1986f80af6f5fd96fde74ab7bdb1578f
F20101108_AAADYR yang_z_Page_102.tif
86aa2d06c512cbf0f636516a15c3eae1
3845ddc7c18bde3c8f23e9ab25d8d9f74aeab9bd
F20101108_AAAEFA yang_z_Page_076.txt
ecd4b7b163d3b4df2fdd2430d74917ba
415c4927cb942ff4857587edb0e4e7a6848a851c
2228 F20101108_AAAEEM yang_z_Page_058.txt
255706318222a1a7b116c5b190882019
aa55b645ac03e6ab68af376398b31f0bce0e737f
72792 F20101108_AAADZG yang_z_Page_006.pro
d9cff9fe6ff696d9e50602c017ceda72
cefe527bb71e83620a941cf61102c09d2026a654
1152 F20101108_AAAEDY yang_z_Page_040.txt
e60f52718f63baf7440d0127b67596c8
26431eee4493f157af328f62d67ebe144a2ae498
F20101108_AAADYS yang_z_Page_103.tif
bc5005f0de65ff92d40e2b97db4c9c65
ab023c37f2f6238b29837a588b82732485743778
1176 F20101108_AAAEFB yang_z_Page_078.txt
e931a8b7d90f97315f6ee8e1381e4480
27f5184d36ad2be39de46f688c16826ef9cf809e
2095 F20101108_AAAEEN yang_z_Page_059.txt
dffe9026d964646fc418d2b6dae7ac6a
0d75b7d08e253e73d1804bba17e39b34ee75c531
10702 F20101108_AAADZH yang_z_Page_007.pro
f5c8a97f1d8b3f35814123a6b8f9896a
56f64e1e124987cd0dbe22509f01b6460d8b8109
1642 F20101108_AAAEDZ yang_z_Page_042.txt
26cb673cfef40cce1014c9a995758bf9
d9e0d0dff49920eb5fa06a29dc9c87430a7892dc
F20101108_AAADYT yang_z_Page_106.tif
cab677ab977b039d94288c0b68977d96
657c2bfe3a5d3b65832bd23cc025e616ac74aead
1748 F20101108_AAAEFC yang_z_Page_079.txt
c1cc400ce0f281999379f33216457240
26e9639c1e4fc70f3a4cbdff888156bf7b39a7c9
1955 F20101108_AAAEEO yang_z_Page_060.txt
76e2cb69911fe69229b233b6c9223c3f
19611e9af83a8d3a0eb1d3a80c0157b0d6081f7f
59827 F20101108_AAADZI yang_z_Page_008.pro
0a3a00e66f3ad66a3a5692654dbbd256
1d2c8b111f35f8e88f76f439d1a4e3a17ee074e1
F20101108_AAADYU yang_z_Page_107.tif
83a4dd5cd9626d9ab0134a518e3b178a
5e7cfddbb7f93e9af2a45e69c430bc11aa7375dd
2016 F20101108_AAAEFD yang_z_Page_080.txt
d8dcd4d4d7e47d44c479403fa45540eb
5afe17ef7ac095c31e4ced1f41ac4833caae0504
894 F20101108_AAAEEP yang_z_Page_061.txt
692ab5c47d5a062c54392284448d524c
cf40071cc5859a8644e8c1f7c023128bc99bd654
53443 F20101108_AAADZJ yang_z_Page_009.pro
2fd513a5b132a9f69fb59482b49f7ac4
d7024f61e54e3a0ded9041a789f94f9ca4097193
F20101108_AAADYV yang_z_Page_108.tif
59cbc65c9b226b48dd8294ee696bb067
e90c78a259ffe95b524cff64a0f3e7ab0c6acfd6
1678 F20101108_AAAEFE yang_z_Page_081.txt
6959d423bc1896f5ebd70659685262f4
9675cf915ef37b4c43c1900eaec878de9eab46de
875 F20101108_AAAEEQ yang_z_Page_062.txt
3a0ddc22e13eab683bf8938cf8588b2f
5cfcf9ce889d3a96f0bc3d5c7f472c20c6da69fe
43979 F20101108_AAADZK yang_z_Page_010.pro
b26f6030a88e284cd436cb09e52bec01
a0454a27f778afa39c987ac416184c5df0a9d5ea
F20101108_AAADYW yang_z_Page_109.tif
4d754353804c9bd6813be0aa9a739143
2a787aad9ea56f02532f3b7341c76cb93825553f
2038 F20101108_AAAEFF yang_z_Page_082.txt
2702cf802005360f694025c22d95619a
2fb6e1178c2dd07d14d96c19d0e74af4659e096e
F20101108_AAAEER yang_z_Page_063.txt
74e4b8dc68000aaca3a5cd84882aec3c
39909b2f0ff554f2d5f5df51500255f5b6f6d41e
14708 F20101108_AAADZL yang_z_Page_011.pro
e76ef0a127f007487ce7ff49a12fcdb0
1e38bb61a8100a4dbf96af997dfbd28e173cdd8e
F20101108_AAADYX yang_z_Page_110.tif
150dddbbf3a85f7c77985b1256424a0f
2eb88bbb5d271513035e2e4d913465a6d1cbfe45
817 F20101108_AAAEFG yang_z_Page_083.txt
d612323960187d4d5e023b214aab5dcd
73b7b5d37118eca6b08e2cdf86515a24b9e4d566
2111 F20101108_AAAEES yang_z_Page_065.txt
989a919165aef1f85868a6729481e572
2ff2c1fb00154272dbe507b30a6a73438de75bcf
48682 F20101108_AAADZM yang_z_Page_012.pro
31202ba03a3a3919f268e87e53632c80
d8f0ae9de422524402260ddafdffab6a49bc383e
F20101108_AAADYY yang_z_Page_111.tif
822cc23a085332356173bc0cae225c9f
3051099bc2015df3b97e50eff0da1c65465a3694
2254 F20101108_AAAEFH yang_z_Page_085.txt
3a75ee9d9a23a80fd4be7f30b38521cc
839ce11860aca007a13e99a630e6935952b30a73
2001 F20101108_AAAEET yang_z_Page_066.txt
08989b04df4244f9e6d3c1b5f56f496e
fbe5e361026f267c86ddde65bcff1e6f9fa61a7b
54041 F20101108_AAADZN yang_z_Page_013.pro
dc1b242008194a8806462bca77052b8c
93d71e0b294c54f9317afc8c210b08c36bf4c007
F20101108_AAADYZ yang_z_Page_114.tif
0fdf457e09712a01d8694f7289d445b4
600109fe0ee8a8a4a9c7c4ebf0f1ab04300b90b6
F20101108_AAAEFI yang_z_Page_086.txt
efa83849c412b6e82972b1633cd3d5a0
a05625b0fe967e45a4ada47273f6ccbf3d5776c8
1460 F20101108_AAAEEU yang_z_Page_068.txt
81a8280365bc98449df1ea95ac2b73bf
1c277227b2699658c4b8624c314585a912a47d56
56093 F20101108_AAADZO yang_z_Page_014.pro
7f89d930c667230bbefb3973374b358b
4ff2f3b86e058f7e1a87c7a709ce2900092171a2
2100 F20101108_AAAEFJ yang_z_Page_087.txt
4395d3cb39fc2055ec438ae7f8449e74
43a26a53972b55e9f3202836fd95fa56290fdd94
2034 F20101108_AAAEEV yang_z_Page_069.txt
97d4384930cc7895a8db25ee028b93b9
eb56c48c25f63a72316c1d4a7b6cd0df837d74cb
53131 F20101108_AAADZP yang_z_Page_016.pro
290e8a303ee907ea12f0dcd4e8f03253
7b6c5a7b8608e5103435d4f9bf21fdb74e2527d2
1647 F20101108_AAAEFK yang_z_Page_089.txt
67a0f72ce4ced6820285ead6badd0ca9
9469cc94132b1b74ca240684ed6eed75c5574269
593 F20101108_AAAEEW yang_z_Page_070.txt
f6c19a7a67d413a98ae31b3a7a9635b9
3366542f216b64862b15eb854b6e3d938673878e
58303 F20101108_AAADZQ yang_z_Page_017.pro
0bb5a8960a6d74e07a0890f9385b0f4e
2487195c7c4e439b23a1f9be4a35ba95427b1fee
1482 F20101108_AAAEFL yang_z_Page_090.txt
77352085f013c640e65f580628dbcf3b
20c9944c550de7423a1f4a0eb0158984df330085
1618 F20101108_AAAEEX yang_z_Page_073.txt
653c8ae87f9163dca057f2ea4155d0d3
46ec437aead7e1acfc8e9a01232e254150ec5249
54820 F20101108_AAADZR yang_z_Page_018.pro
2ae2b506d900d83a45cbb3f9a8eba9d5
d388dacba08a7fb30bf478f3d2db330021326c0e
F20101108_AAAEGA yang_z_Page_108.txt
8ee6d811e6f01761b06a699eae34f2c9
4ec2155753a59066aa5085fbdb74ae4a294885ed
2309 F20101108_AAAEFM yang_z_Page_091.txt
0ae19d3ec0006077815000a7d3473936
b2b676cd60b8bce33bfdf7026f72f1c7db879ed5
2231 F20101108_AAAEEY yang_z_Page_074.txt
50133e95c6d84376973f6e63b23d050f
d1b3fb3a38848c707069ab5d530f9d69f652688e
54115 F20101108_AAADZS yang_z_Page_019.pro
2bb230704341f5c05603f6ef05faf406
2fb9d0e697f191c59eeb388f21ebde2e283bc537
956 F20101108_AAAEGB yang_z_Page_109.txt
a57e768a84c6b2e2d20a1296b5ceeb12
37b6bb382d36dc63a881eb0eeab8e885d385be5d
2256 F20101108_AAAEEZ yang_z_Page_075.txt
7d29cf55988e5415200ebc5faa591419
e1695e60011ddb2d4653b968545a45f4314e9093
57309 F20101108_AAADZT yang_z_Page_020.pro
15ae5804b1e2e63ffc76df1548fab3cd
4b8bf7606c3b4dffb1a89951ad518f3421652f22
2727 F20101108_AAAEGC yang_z_Page_110.txt
3533eb9a2196011a04e52684b5b720bb
a0da63d40b34893c08ed71271c45068de2414c09
F20101108_AAAEFN yang_z_Page_092.txt
9fc700e5d96821901d3a9643f9f5909d
29b2a9dc5c28cb79df52483c045999e02c3a202a
11392 F20101108_AAADZU yang_z_Page_023.pro
fd595b1719ad0be4fffb1969bae474f4
af0b765693e76a1aaecafff6f286cf9eed2a82bc
2791 F20101108_AAAEGD yang_z_Page_113.txt
8d38f9c4b814ca767b834b62ac3fd152
aa61ee5eaf4ad12da77e2313da08dd36272dd582
2139 F20101108_AAAEFO yang_z_Page_093.txt
1a4a041bb66c67c74c4dd972e3982ab8
832e83f6ea9421b10fcfc8063117bc5af573e2ee
50697 F20101108_AAADZV yang_z_Page_024.pro
05015603e62623bc9fd1bea536ad843c
292aeba58190646122aa3183d29b75b14d01eeed
2796 F20101108_AAAEGE yang_z_Page_114.txt
5acd2c24f328f90d0d6f5ffdadf698e6
fa71e45b66aa25ca634078ffd77fafdba04d2738
2315 F20101108_AAAEFP yang_z_Page_094.txt
7b02116975e153dd73783475fe2ee815
019c39c69e99b9238bf228eb80d0922aaa8eb3d5
54976 F20101108_AAADZW yang_z_Page_025.pro
cc08f18a0e848a63d96882eaee558ea1
b54ddd1b70735ac04fda7abaedfc21f6c6ac0e3f
2621 F20101108_AAAEGF yang_z_Page_115.txt
6ffe9339ac1936611a6d787456c70f0e
8e1fff6bdf912a4b0bca33fb3242db6a7ec9f62d
2265 F20101108_AAAEFQ yang_z_Page_095.txt
91cee919dd76cf14ad5de38bd382f214
a82e79e1356ff447d9aa893f7264c6fa9d019b82
53691 F20101108_AAADZX yang_z_Page_026.pro
417b510f809bf4f8e946d103cc38584a
3987a92e547bddce81e013b48203bee25cfb1cab
2633 F20101108_AAAEGG yang_z_Page_116.txt
b716286090b42397dbdc1d5565faf8bd
fb687483d8f6bfe663e9b586431f4a18e7004f7c
2288 F20101108_AAAEFR yang_z_Page_098.txt
2f0288d66b870a7bd7f3dcbaaa52de68
e235d017f9613e92e7b99fc918406b2d50137ad1
38645 F20101108_AAADZY yang_z_Page_027.pro
8c8a2affb6dec2087cc0368c504b60ff
b48ad21d5a084d7269c0019dee247231359b9d5a
323 F20101108_AAAEGH yang_z_Page_117.txt
9ea45066f56c8686b0b04d330a3fd8f0
2bdfa974363ba3f144a2923797c87954ad944f5b
2189 F20101108_AAAEFS yang_z_Page_099.txt
1acf3f89e49f895b976b7f7a9bd68434
8f8efcb2ff0d5dac41234b41b96e4fe3450eba3a
32238 F20101108_AAADZZ yang_z_Page_028.pro
5b429d4a1c2f031a0c9d1dfdc147d4cd
4fb1ad072784e7a2a56828b4cee4909f8b56154b
823583 F20101108_AAAEGI yang_z.pdf
33a8c6bf6eebb626577d40a9a376f594
a5f26ccdcc3eb291c9f5ef3600726bd409f20276
2143 F20101108_AAAEFT yang_z_Page_101.txt
cc44f8831e85d4f84529577469bca88a
b68d133c74da20f927187a123492916ee3bc8f0e
7303 F20101108_AAAEGJ yang_z_Page_041thm.jpg
2fda7f5ee818241f30e3689f95d00a31
f2cab75918af2b6668b9c940e036753496af1167
1504 F20101108_AAAEFU yang_z_Page_102.txt
484d9b8332bac223044052aed5e12021
b2cca6fb22de4c9104d98047e2415c3a2e771663
7818 F20101108_AAAEGK yang_z_Page_094thm.jpg
74f58c2ec98977d627b5f0a7a5a6f9f9
fd19aad5aec5250e2fd4f0f13fd65ee6c3d12ef9
1707 F20101108_AAAEFV yang_z_Page_103.txt
3ba43ef8cc4bcc025420ba8fc630f818
58a05e6647d071dca3387cd9d27860e0bad0c474
7785 F20101108_AAAEGL yang_z_Page_113thm.jpg
397f1a828d7037c21cf4a08866cb0554
27602e2247caf90218b60838e0aa94b9e983d2d4
1103 F20101108_AAAEFW yang_z_Page_104.txt
35db280221a80d8e86306814fa9bd9d7
7e7ae5cb4991922601becc078500922f7acd3e49
174264 F20101108_AAAEHA UFE0021625_00001.xml
f3e12a3338927c9b1d5fad33ae1cff34
b43c4e0e63cb56a47bc71460dd344f7564af34b8
6976 F20101108_AAAEGM yang_z_Page_072thm.jpg
08148d44d449dddd63708282e4b57beb
d59fe054aaf2d578a994d73946fef29b9b6c8132
2278 F20101108_AAAEFX yang_z_Page_105.txt
5eeaf640bcb707ba2b2c9048d7c46f9a
7be3297c57e688f372d65651bf8460a85229e918
7723 F20101108_AAAEHB yang_z_Page_001.QC.jpg
7b1c44edba456034347e1444f3a895c2
74a04dcde0779be39c7ae3c67480256345a65816
6200 F20101108_AAAEGN yang_z_Page_008thm.jpg
c091f968ff480b827befdc5c1ebb59de
3ace944c28a93ea42d3af5ed17c7e7b21865b8f5
2120 F20101108_AAAEFY yang_z_Page_106.txt
6ae7ca76aabcf8dc58f5a2d927ba2a77
ae1df2dfb22c460b02f06b6261473876aeb49888
2448 F20101108_AAAEHC yang_z_Page_001thm.jpg
f0ba4a50ae3671640b065ee766183687
b25ab13c3bb6fafc3f50436ca5d1a975fefca211
1640 F20101108_AAAEFZ yang_z_Page_107.txt
07a8f14793a071582d9022a659a0c1e7
373812e7827ea431e8066b62e3f73191a49e11a5
3194 F20101108_AAAEHD yang_z_Page_002.QC.jpg
18a6e1cb7b25cfc0416368c1ae6f700e
7212436e6912200dda2f5147039550938d00036b
28589 F20101108_AAAEGO yang_z_Page_085.QC.jpg
e97bdf24e689fcc401f68c852ec92245
e8048ca905c1a233bb0e137a4f262cac76287806
3712 F20101108_AAAEHE yang_z_Page_004.QC.jpg
9f84deea468a66e7c95e255348d4bb70
7a9e66cf9aa22ad9b3953d4e8447595e2b6f4278
7163 F20101108_AAAEGP yang_z_Page_034thm.jpg
09fc4a8ba4d96468e2bdba2e5ca8ad6f
8fea1242441857b15e3d974ea36b08adb774efb6
1487 F20101108_AAAEHF yang_z_Page_004thm.jpg
1a545f98143546e3acd7ae9e1fdaa00a
6b2e767f49960ecf92e6f11073d1ba47ef99d536
21187 F20101108_AAAEGQ yang_z_Page_027.QC.jpg
0c3f95fd66bb1b93577c5446b73749d8
3a129cb9400f9011e81520174213087a6d88ffdf
23086 F20101108_AAAEHG yang_z_Page_005.QC.jpg
9806ad9ff97aa5be135e81af3bab68f1
aa9c03109e98a98cbb420f5b1d738f7ac7736357
7714 F20101108_AAAEGR yang_z_Page_015thm.jpg
eaa4b2bc6f74cb72daf9a48a2076782a
51e5ab53d98d84e35cf6a2e64e415f8b277c54a6
6106 F20101108_AAAEHH yang_z_Page_005thm.jpg
e4292506730c5f8abe4742600c5ad7ff
86a301faecee2aba4afda66f9af91f09bbc9f46c
6183 F20101108_AAAEGS yang_z_Page_102thm.jpg
c6f1551a1944b3749a804a3ef7385387
05d27eaf75a9a2f21a4b8cbf26b19a33e7f4a1f4
22108 F20101108_AAAEHI yang_z_Page_006.QC.jpg
5c41575ec7f1167ebca46259d26787a5
799c9eeb744e72fdf4c128e270da945299503946
4269 F20101108_AAAEGT yang_z_Page_061thm.jpg
f43946c0961fc22eb53eef43d31a9f62
976018bf5665b59cb6023376c927654ddada8027
5716 F20101108_AAAEHJ yang_z_Page_006thm.jpg
d7fe1b9530103438a53c17019e7a89bc
a8adc2bfbb0ad323521d1f71e3a08203b85bcca0
3095 F20101108_AAAEGU yang_z_Page_003.QC.jpg
18eff030ae2ce3e23ced7450d3a1d66a
09fe4767f8e464a3147b221737cac9f0e130a53f
2101 F20101108_AAAEHK yang_z_Page_007thm.jpg
231885bc7ce5cf2f5894e6780b7128f5
cdf2479f0c778adba10c0c4dcd73f21c05155338
22674 F20101108_AAAEGV yang_z_Page_057.QC.jpg
f473828210296698b4ed458a3393545a
c3ffaa8793a210000e5071dbbb931de59b20b376
23702 F20101108_AAAEHL yang_z_Page_008.QC.jpg
c42feca0c52df90267bde3f18749a9df
628160db11a67f5ab0d66adb8dadbad48566da12
6561 F20101108_AAAEGW yang_z_Page_007.QC.jpg
17a71835eec721e51a8c76fe867d79f6
af49637c1bf683018a617063d101cf248fc5afe3
6040 F20101108_AAAEHM yang_z_Page_009thm.jpg
bca13bf3f9061b6ad42c27b25550a43e
f6eafd14021f54856f8ee7f9962ceec6642d76da
20208 F20101108_AAAEGX yang_z_Page_029.QC.jpg
42208adb3a0bc2f7819acf4fec6b7bf2
0c3b5f6541d86c61a276feecb447db646c490f71
7725 F20101108_AAAEIA yang_z_Page_017thm.jpg
5bdc7f5b66718964af56f80aeed52e81
7333ce40f60d71d735cc93330c63112d0dc1c11b
6358 F20101108_AAAEHN yang_z_Page_010thm.jpg
34f1bcde1f9773b36085c6289339e0a2
44b517e1a08aa84008382da1461197c1b6e87a36
26770 F20101108_AAAEGY yang_z_Page_035.QC.jpg
6534ade82d4b57fb154cddf354125b51
cb00d526f10baee12b3ef866f54a2291d9040122
27925 F20101108_AAAEIB yang_z_Page_018.QC.jpg
ced1e13eeca88bfe3f9016fb2ad4f0c4
b826d2a769923daabe57060514b44026c993fb45
9357 F20101108_AAAEHO yang_z_Page_011.QC.jpg
a197f5480e9bc8c7c9aff54fffe4196b
d0b2709476aae47b6999daf6be4a7740e0830ebf
23091 F20101108_AAAEGZ yang_z_Page_104.QC.jpg
1d92e6f23d420f5db48d891c6ccdd82b
afcc5dc23ace503a6926f958b3a5f990b697ead2
7215 F20101108_AAAEIC yang_z_Page_018thm.jpg
ffa59aa91ee0de9303762a4487f7e680
25c2985ef54ee757744cf214faf9feec49dc148c
27617 F20101108_AAAEID yang_z_Page_019.QC.jpg
84aba88b53bf070a5a93f2b619df7fd4
061ec2d0084b9b0243f3388f3c84cb48e560bcd6
2813 F20101108_AAAEHP yang_z_Page_011thm.jpg
b923618f702303da2b150b6ff8ce0d57
f1f6394ea0419414f4da13170c44044c67d66096
28418 F20101108_AAAEIE yang_z_Page_020.QC.jpg
c600afb28ab06959b4309c52bea23f5f
4a9d81ce1d5f44020bfea4f2b5f84ed26221ceb8
24908 F20101108_AAAEHQ yang_z_Page_012.QC.jpg
3cc06a91db52bbc04312885076e55389
26192517d2467c6e21c07ac85f6d2ea5085da47c
7496 F20101108_AAAEIF yang_z_Page_020thm.jpg
cdbe4c0eb2df2689dc2caf65280eb199
9006a3b591f8476c8472e9b83d60d040877cbf43
6746 F20101108_AAAEHR yang_z_Page_012thm.jpg
4e68ba31b87410f85d196a8e8c897134
6438e43e531332a226123ccb658440ed4ed66586
24876 F20101108_AAAEIG yang_z_Page_021.QC.jpg
0f6d620979020521b537bf87c6cc6a2f
974de5f4ece7aaf8dc62a2e7a0a0e5b5712f7fe6
26392 F20101108_AAAEHS yang_z_Page_013.QC.jpg
7da130ff373ca35284f75401492a61c4
99af9532c61586fdac66f4bd60f45b3266349277
6882 F20101108_AAAEIH yang_z_Page_021thm.jpg
bf39630ce394481e4f95e2aeaa45cb37
1d4187db6dccef6b7df66266b0031bec3920839e
24389 F20101108_AAAEII yang_z_Page_022.QC.jpg
0150de93547d86ede7df9e253c39b9b3
2ceb2b5edee0c8bbe496ac2d2bccb6faebe6957c
7290 F20101108_AAAEHT yang_z_Page_013thm.jpg
41136674ba85814ace62037b30f54e84
caa9fb9f7ad703629fa201bf4e226a83c316294f
6530 F20101108_AAAEIJ yang_z_Page_022thm.jpg
9c13c0c828e68126d8ac60fa2534993b
b86ef8e6210e20fb83263e73ca5f1329e6580d07
27872 F20101108_AAAEHU yang_z_Page_014.QC.jpg
d917ba36f5eaba7493131baf81b4baea
10c69ce676015449b2af84a57f45233abd99767d
7774 F20101108_AAAEIK yang_z_Page_023.QC.jpg
41e033c1e0f39510442aca8478cba914
3e1f3250d211ebcaf6af6a960591f8753469b7de
7375 F20101108_AAAEHV yang_z_Page_014thm.jpg
95cb853ac9c1f7b99ca688ecba76876b
c3a123b9bfabace4aebaa16179b1f67eb7b69af6
26197 F20101108_AAAEIL yang_z_Page_024.QC.jpg
9cfab7c67e1e2c02e183a8eca9204b5a
6e15036a2d30b39798a80b3b03a176910799ddc6
27810 F20101108_AAAEHW yang_z_Page_015.QC.jpg
dde992437f4372379b34a1c90de19421
4dc4cef81635e21ce72686e8d5b80c3d69c70560
7051 F20101108_AAAEJA yang_z_Page_035thm.jpg
31c774d330d0f94b64cadfc7c651ff3d
9a9d5f5ad44d33b2d798c20958eb1e218f477390
F20101108_AAAEIM yang_z_Page_024thm.jpg
c9376316a6229ccf2e2224b8fdcc067f
5262ff2451de55b919fe1b8ff0f40dc6f81e7166
26057 F20101108_AAAEHX yang_z_Page_016.QC.jpg
6a6c67f46597fb99e27bd7e80f8351df
c02fb68e12d5bbc37c788c79e108f91d538d7c5a
23260 F20101108_AAAEJB yang_z_Page_036.QC.jpg
db70d1511e4da21d1ab19f9216a84d22
42e36d5b1ce30ccc529e3bf31c69f4a5073d8eb8
27804 F20101108_AAAEIN yang_z_Page_025.QC.jpg
f9793ced2249724a4ad69333aea3f2ea
732cf907667f07ea2ba839419d94e6021f78ca38
7331 F20101108_AAAEHY yang_z_Page_016thm.jpg
077d6215cf5aac2b4291dd72057e25c1
bb83dcad6223d8e1dab3384f70a92b9b17af5039
6584 F20101108_AAAEJC yang_z_Page_036thm.jpg
73fce7d144cb1cffb50ffb83274489ef
00576b26fe3a71ab7d8a9df7bdb7fe557c50e558
7512 F20101108_AAAEIO yang_z_Page_025thm.jpg
fda4cb940ed962e3bc60ad49add50ca9
207bc0315b692e45b1e16443e65f56cefdfe1917
28315 F20101108_AAAEHZ yang_z_Page_017.QC.jpg
e002d9f352093697c7beae1a3e2c99f4
e538b7e6673e679fa94f2813810dbb4b39595c27
19071 F20101108_AAAEJD yang_z_Page_037.QC.jpg
d27c2ff3e232425195eae9a16be4ae5b
22f3225073995ed50dd2957d20a218a27bb89168
27011 F20101108_AAAEIP yang_z_Page_026.QC.jpg
dc3db08a7a318db9b46159d9012b2ca9
f85d7fe7943c23677c23f88ed346e83434e1e1c4
5682 F20101108_AAAEJE yang_z_Page_037thm.jpg
b3b9922373eb3ffae80923e6bc102e6a
b5b5f2e250d1b4e5607a0d45845d74dd61c6db48
26678 F20101108_AAAEJF yang_z_Page_038.QC.jpg
2307a35c46a24ff018918297635ff63a
baf588a917a866d06576f3168ea4f14200f27033
7268 F20101108_AAAEIQ yang_z_Page_026thm.jpg
0e4a10c533890fda82817c31e8b6e064
8c4bf40ca6e5d25103130eedbf2c630a15460abc
7518 F20101108_AAAEJG yang_z_Page_038thm.jpg
c806a018c9d3490a2debb1ffd74638a3
5482494e3ec060def5312547df80b8b7c3b2fe0d
16064 F20101108_AAAEIR yang_z_Page_028.QC.jpg
b5b7ced14edd666a00c4ff6c0f559585
247cc8d71bf2135b6fe3fc7c30fdc7ed023111ce
22011 F20101108_AAAEJH yang_z_Page_039.QC.jpg
362b5735b053460107485f815002eb4b
b61a3565a2745363fc3c9a7dae7d25a392bd1572
5044 F20101108_AAAEIS yang_z_Page_028thm.jpg
f25337a0ba42a88a07d384c5dc894762
6491047c1a26e8ca724746c3162112f8d7013346
18621 F20101108_AAAEJI yang_z_Page_040.QC.jpg
1ebbcee6b3c2a1d95983b5b55dff7774
018b85a0eb6609726c5c35260bd4656c39d46c7c
6110 F20101108_AAAEIT yang_z_Page_030thm.jpg
f8c219aff1276582c8e2c2834c130e8b
12a78f195b207748cbb083997ef302f24dc9226e
5618 F20101108_AAAEJJ yang_z_Page_040thm.jpg
556ee353a6b2084efbb862c289b102cf
1694b591ae25b71362291066ea796da3ceb2f980
27067 F20101108_AAAEIU yang_z_Page_031.QC.jpg
f090997b5185eedbbd4cc66b080e5970
ec5dc2ed4f45e52615546593b0dff9fdab0d0085
26990 F20101108_AAAEJK yang_z_Page_041.QC.jpg
d398223f674ac28a3f504a60046522b9
da1ef337e0687999a1f62995a599d24ef46019b0
7221 F20101108_AAAEIV yang_z_Page_031thm.jpg
d39c9bc397244364290ca79fa7c27eef
d17a1ea49184a43cde6eb4c18280ec4abd20f496
21595 F20101108_AAAEJL yang_z_Page_042.QC.jpg
b252fc591f367ad8c539bda09c5c1dbc
6787e1d4b32c15511f99e82a2e2b986eaba0abb5
21851 F20101108_AAAEIW yang_z_Page_032.QC.jpg
b8df6487f92a2616f869cac1c2afc2bb
5a7299f9f725dfd4bb02218d8b9c472de7fbc75e
6383 F20101108_AAAEJM yang_z_Page_042thm.jpg
b3ce7b51fdbdf32fbd9e1a90f4e4f1c5
b640b1d8746c96dbbe0b4dc9e300c65cb620f51e
6411 F20101108_AAAEIX yang_z_Page_032thm.jpg
25a7f2fb80331f33c54a32c93ef911b9
4d1308e39b64a8b56f121d804fda6c69b0bb48ef
24572 F20101108_AAAEKA yang_z_Page_050.QC.jpg
82841d91013592d7b360c3d6fde987df
b2cf9505fa0138a9cfd294190d653af916e570ad
17684 F20101108_AAAEJN yang_z_Page_043.QC.jpg
5047d0571994f88b695d4d8efc995258
49d60f15a6d05ca33ecd153710727311f67337a0
25994 F20101108_AAAEIY yang_z_Page_033.QC.jpg
4156df7e0e50ad70676049d6440ea686
41c350996ba81b19880f05077d17d2c6fffca81b
7074 F20101108_AAAEKB yang_z_Page_050thm.jpg
450ecb26f4e0b61c776f6385a2152e95
c83b461063cfbd69ff3060ac266029e793a30246
5198 F20101108_AAAEJO yang_z_Page_043thm.jpg
d80f850aa39b77187f6038c0d04fab1e
b5a08d87ae7c68fc6fc27d72fbbfb50a4ac663ed
7337 F20101108_AAAEIZ yang_z_Page_033thm.jpg
95cf5dcab79ec107a421298f1df5c7bf
412d22e40201f7ca6053b0848984538dba85e234
20024 F20101108_AAAEKC yang_z_Page_051.QC.jpg
db30c7d6f2e17f1ab490394f5eaac646
6ef1bc5567616915a6531e0bc2d2e806c2358ce7
26948 F20101108_AAAEJP yang_z_Page_044.QC.jpg
aa7229fdcd31ace396bd2f6ab2724201
0b483407ae26f9fa8cb7ac6d0bf9f7f6f9764833
5661 F20101108_AAAEKD yang_z_Page_051thm.jpg
8ae84279a411625189f865c50598fa42
1ef6633b57858834b2fc1dd1ada7b1a23a108dc4
7492 F20101108_AAAEJQ yang_z_Page_044thm.jpg
a6c33fcc5c101911519dce0750025f9e
55bd2348326b419e8d6550c622554f6dda5cb924
21740 F20101108_AAAEKE yang_z_Page_052.QC.jpg
8c8ae828717b4d44c59d8d41617d42f9
4daa9f7d031548ef20f0607f466761091ed59c6e
6210 F20101108_AAAEKF yang_z_Page_052thm.jpg
95c7f9bfaa076162b80667e2abf2b9eb
5c1101c1029cb2ccab9bf01075c53a10493880a2
28047 F20101108_AAAEJR yang_z_Page_045.QC.jpg
a17467919ab497b998bfdc9cd5d84c58
917a34661db2419f68b6dfa6087b2cd04423021b
21675 F20101108_AAAEKG yang_z_Page_053.QC.jpg
f4b8b01702f6c7dcef6d02367db587d4
292fe5b83f5fdc7bf037e30ac8c9f8cbcf55ff64
7460 F20101108_AAAEJS yang_z_Page_045thm.jpg
cbc979401fa001f4aadb85676ee1a060
bbf7bb951e9be97a091a42b23269257a71f60bcc
6046 F20101108_AAAEKH yang_z_Page_053thm.jpg
375534882e123b5934f0322045d1176f
927104bd896901675fe6b0836b8f5abd0643e7e2
6472 F20101108_AAAEJT yang_z_Page_046.QC.jpg
667881eb843c4b24ae676f5f3ae33c64
9fd2eb353f76f5f780f11bfab6e1a720f1dcaa7e
5369 F20101108_AAAEKI yang_z_Page_054thm.jpg
20fca3f753ee3c326d5626ff4ff9b6f6
58162aed3e0aebc9421452107a855f0b597a8000
2266 F20101108_AAAEJU yang_z_Page_046thm.jpg
5b4c6d4479b4ccf49e62a92ec139687a
0381111c59922ab04be73990e8e61cf853a463f7
7442 F20101108_AAAEKJ yang_z_Page_055thm.jpg
8dc6a7278e8993df03eea09a71ffaa1a
2f3bc1e91d6cb2068f1ad5aa233c2612afdbe108
27659 F20101108_AAAEJV yang_z_Page_047.QC.jpg
5407ed3f395b0ba7a585fe06ebe0e92f
ea65883863b52f8b8731366b8c6472cff9fde126
28400 F20101108_AAAEKK yang_z_Page_056.QC.jpg
5b076bdedea9d4ca6bbcc9214666bf8d
f1c54b43722bcd8b6a07bc7654072012f1156e43
7524 F20101108_AAAEJW yang_z_Page_047thm.jpg
560f001d7e0975e93436a77eb03fa292
ec410d12ec2e2eaa23c355484057b24dde3bcc21
7434 F20101108_AAAEKL yang_z_Page_056thm.jpg
b2367881c8614502ba52159bf0052116
2b7d67c210baa63652df9ec65d6f2bdc81b9b2bf
23022 F20101108_AAAEJX yang_z_Page_048.QC.jpg
0bbb80ac6b3a78e8558a70c83bcb1c4d
c8141f076c513b575d2bc3470a5f96b556f2af37
25509 F20101108_AAAELA yang_z_Page_066.QC.jpg
0a4cbfcfc72dab967b59e6fdca649840
9586dcc66278496a639d0b5e0fb30505771baaa7
6310 F20101108_AAAEKM yang_z_Page_057thm.jpg
78520ea570bc5c2de65822a217df0cc3
fc1e8e500a2cbf3a7bf155e8dd91ac0e9f0eb295
28473 F20101108_AAAEJY yang_z_Page_049.QC.jpg
c3e77184a93e1e8c27083a1f49ea362c
f70716c94e1983be3e540f6ac777e51f8b77a607
6896 F20101108_AAAELB yang_z_Page_066thm.jpg
136ad4b19dd173a42e034099f8b0603a
66cf60ff14f290928b8f60ccac75062c9fa5690f
28193 F20101108_AAAEKN yang_z_Page_058.QC.jpg
b81edad46f9df574be29b8e8b0fe3739
5f0303de11fd22bc97e5c610c5243ea944af7e07
7582 F20101108_AAAEJZ yang_z_Page_049thm.jpg
054d90fc74542fb712e13ed2cfda5424
ff90a4a169bb640f5768cab2864130496b642a08
5509 F20101108_AAAELC yang_z_Page_067thm.jpg
26f15c2fa66218a96c2b8aa6edf2f050
65ff0440d6a30092b19ac038f2524cfc6cc18c2f
7605 F20101108_AAAEKO yang_z_Page_058thm.jpg
142775aab1f981fee2df2160ca31cd6f
9eadf3039669989bb65712206111bc2b4c3757ac
20029 F20101108_AAAELD yang_z_Page_068.QC.jpg
35c7eeb988e59a75cb536a22ea97e835
1dd27d9e8ad9fbef8d4aea20b7c6db3449fcf094
7271 F20101108_AAAEKP yang_z_Page_059thm.jpg
4f9ae3d5ec02d1a5888187effa7b0cb9
00081b6f38bc8dc838a48f77f3e2d5d31a263048
5477 F20101108_AAAELE yang_z_Page_068thm.jpg
c0b20555f4a691324b29ef997ec03a51
a7e90e7c06e23a5c95cc8af8e88cbea9d9a31116
25510 F20101108_AAAEKQ yang_z_Page_060.QC.jpg
534f269254fece67d3a829a808f21436
438db0fd56cc82a5264364ae9b30a636ac465435
25186 F20101108_AAAELF yang_z_Page_069.QC.jpg
8d98da5a4ea7341dcf49c31b75207aea
a7365bd17bec452e82529a709432c5d4978f1500
7075 F20101108_AAAEKR yang_z_Page_060thm.jpg
c260d63637c58b05af18cb4413bab5be
bfd0a70cc2ae3c2f64b8d02c0d57ed4619b3ced5
7202 F20101108_AAAELG yang_z_Page_069thm.jpg
fe8ec1b5d0f60cdd55f2e895b00aebc2
6b7cd871c822fb57785577984c3863489e473cfa
8938 F20101108_AAAELH yang_z_Page_070.QC.jpg
996217df3d76e719225b39fba6618913
c01e04ce417a77cb7489e1e4ab344b89c22e055d
14351 F20101108_AAAEKS yang_z_Page_061.QC.jpg
7141d12cb99af96578c8e73f83c32b08
d7895130a653ad654be45242ffc5a2f0eb13d142
26079 F20101108_AAAELI yang_z_Page_071.QC.jpg
c80ac55f8b80c0375ce87d5dda16ea6b
f9270bbb5685e49a8d4bafed74be662b72f77d22
18361 F20101108_AAAEKT yang_z_Page_062.QC.jpg
e1d9a2548f61994d499ea8568cdde4ee
70c70523edf808ab8790ebc9277a6fdef266de78
7231 F20101108_AAAELJ yang_z_Page_071thm.jpg
2fad5fa3c3e185f13976b152f9df3a4b
17d5489f1ba4d22801730d4709fa3c5aeb6d5408
5487 F20101108_AAAEKU yang_z_Page_062thm.jpg
89a070060765ee96b5bb09303ae7ba43
af7d22bbe38fea2cbadbf1f661c13092a8955a41
24977 F20101108_AAAELK yang_z_Page_072.QC.jpg
70537f9574c78fac10e1ec4a72b019d2
27e2018dce971b048badf5dde1e72ba176e5c895
7135 F20101108_AAAEKV yang_z_Page_063thm.jpg
ad2ab5a3c83d45308e7379860615b0d5
f241f6ddecb3cf74b55b8d76b9b08b88d6a1b36b
4832 F20101108_AAAELL yang_z_Page_073thm.jpg
e03e38b54a05aee09aec9752258bc299
ec997ba0fa955989890b3b8ffcfdcaf96330ea44
26027 F20101108_AAAEKW yang_z_Page_064.QC.jpg
a721268e01e64dd2a23bbe958b123cde
19b960f857376faea68305f6a0d3c576bf370950
11861 F20101108_AAAEMA yang_z_Page_083.QC.jpg
ae09eeb635838fba3cd7a3fd2149e5b6
2049dd1d5497cdd2a78953e6a8452f4bc70618f1
7593 F20101108_AAAELM yang_z_Page_074thm.jpg
40f812aa689963a21b944e77a3316753
5eea3ed4f22d4865f4707e93ab7a92e041e6a791
7236 F20101108_AAAEKX yang_z_Page_064thm.jpg
a4377c652139e91e76fee087b661c160
5a29ada97527b27be6cc643ad9919a4259dffc0b
3650 F20101108_AAAEMB yang_z_Page_083thm.jpg
a63454aae92c42c47106ce7c45eccc8c
74d72580c9f5070aefcc5d9e91903604e0f4a0c8
22549 F20101108_AAAELN yang_z_Page_075.QC.jpg
0fc24aa7f2880c63c124d333ea7482e3
5142214a63cb0c84f78ad5f3f3b73548b1e52772
26036 F20101108_AAAEKY yang_z_Page_065.QC.jpg
57396805a92d2ed1a16edd1e509eac81
6efe3ab35f5e39e43a3859d921d1c1fa4838540e
90072 F20101108_AAADJA yang_z_Page_005.jpg
4c9ba6e21c856952f513767ac783a965
97b69fdaf88f59110b9b51b0a12229319b5c38a0
28306 F20101108_AAAEMC yang_z_Page_084.QC.jpg
fae1a360326750e10ed6142fdd4f9d10
46b47bdb9e523601e486a43f15d8631299b55b61
6078 F20101108_AAAELO yang_z_Page_075thm.jpg
df55ac104eefce46439e02d5a2b59da0
f05f5b6f5989e1ecdb2ddbd5d39179e1f9ee436f
7117 F20101108_AAAEKZ yang_z_Page_065thm.jpg
b55c519d3b66ab4871ef3b7e619c29b0
48ded68fd2d956b353a2f9caae93c44ffa4ba0aa
2221 F20101108_AAADJB yang_z_Page_096.txt
8529bdcfb00a1d07e2dede734a1f54e5
eae012e46b22bf3642edb5cc536b08f9be3f6d98
7587 F20101108_AAAEMD yang_z_Page_085thm.jpg
8da0d5bcae41bc410df85937683f38ed
7a5dbd2cfedbd59ef0e6d384401012142260a1aa
1307 F20101108_AAADIN yang_z_Page_002thm.jpg
7ea2f0b17edd1e30ebf5d26a199feaf5
02c99c5d453a7b762a5b7b22ee6330773af4e918
24007 F20101108_AAAELP yang_z_Page_077.QC.jpg
95bceee7e8df8e2112076e49226fd997
c66e1626d7071117a309182e89e4891e312529b4
F20101108_AAADJC yang_z_Page_034.tif
5800dc3d66c854c662a6ee74dedbbef7
2257c3aa417845fc935c03fb2d0d30b9a17c44f4
28456 F20101108_AAAEME yang_z_Page_086.QC.jpg
5732b88996f031960bebc0c095e2f08a
c51ddc7d5f734ca8855803a30190c293712d8a3e
97024 F20101108_AAADIO yang_z_Page_075.jp2
52e1c002a31e8190bf5ae2bd38eeacc7
4c8ffa67b814bbb5fbcd5fe15ca65b89707c485b
6320 F20101108_AAAELQ yang_z_Page_077thm.jpg
dd14db6e8b5f0ad11b607a49328be83f
3c64aa8d6244d261c794d325fd47e992f32f6a76
2063 F20101108_AAADJD yang_z_Page_022.txt
26c19cce0a6b51e697e937bc8588ed6e
a7a122bd06fa46e954bc60ad21a9179d3cc7730f
27148 F20101108_AAAEMF yang_z_Page_087.QC.jpg
8534020ebd9e4555edfce36190d6e067
e9e31e8ea99a020a8b3c05957ced66126767e592
F20101108_AAADIP yang_z_Page_024.tif
2a102c9394e40123b0d17eb28e4a75dd
ddb89e48cb586e9e68ea62da5f8f2583da98dcdb
15854 F20101108_AAAELR yang_z_Page_078.QC.jpg
64ecb8b685c761e663dcf514699e4cdf
d401567dadaa922a286260f0712ecd98f1c9b31f
41678 F20101108_AAADJE yang_z_Page_052.pro
c8e82db737245d12cfe27e19945ef388
4866c752116957ea24fca6f484adcd71b6b9c504
7364 F20101108_AAAEMG yang_z_Page_087thm.jpg
126c4aa6cae962e4260a9729734ed046
5edf06bc354ecf2d78341a307b7d17982509a7ea
22489 F20101108_AAADIQ yang_z_Page_009.QC.jpg
ed53e33d3dae715b44839c9c65089805
9d2a0f15a4e5ee0c7b2d5fe3695d6dd4f88285ab
4539 F20101108_AAAELS yang_z_Page_078thm.jpg
9416b29c01aeac4ad6495a5d8f36d8c9
f45d8fbea0aaf67c029e57c31f81a9e0faa742fb
1025121 F20101108_AAADJF yang_z_Page_036.jp2
9be508fb2ebc21c925723bf0ff750ec5
975650352c71a540a90d15a3a68aa3610db79ce3
22873 F20101108_AAAEMH yang_z_Page_088.QC.jpg
d6f8effc0241ebbfc427dde1fa1b8f5d
6bd6c1a386511d317bb1961fd94b102c9b4d9b3b
5920 F20101108_AAAEMI yang_z_Page_088thm.jpg
404dee35d43b0b030ae48f6b36a8257c
00ce9fa33ae5f373ac7ace57b7d69725c73c1c7b
F20101108_AAADIR yang_z_Page_112.jp2
c60a1789fd3075f7390ba2d3a5a14786
4a0ebf77eee15949a737c6e13fad0c5a490203d3
21647 F20101108_AAAELT yang_z_Page_079.QC.jpg
19c21e6235119af4473523876782320d
487f0bcc355db7bf236e0f998a846cb75df01e1d
22852 F20101108_AAADJG yang_z_Page_030.QC.jpg
673bd5010203ff430a7b6223109be022
4bfaad4ec117f9a1f84fe94ba4e54e33cb5b9b52
24114 F20101108_AAAEMJ yang_z_Page_089.QC.jpg
72ffb3987a9def61700e0636a174a1e4
32ffce19b25ffe898a5c10bcf32866458dc78216
F20101108_AAADIS yang_z_Page_082.jp2
5bd9ad62af82bfbd69770fe16a29508b
e9ba5607ae888c90a438f6a5459456adb66b231f
5814 F20101108_AAAELU yang_z_Page_079thm.jpg
edb780c547a95ee99f2649b152384b83
adbb4768b23d7dc36165af3c392aa1f480f0a8b2
941744 F20101108_AAADJH yang_z_Page_095.jp2
773bdd13e57221d10629a88dce3e1c9c
45beac9dbcb81ddb374e99341274279919bc17ee
F20101108_AAAEMK yang_z_Page_090.QC.jpg
84f41c42708685bde2d4d81bfc918a00
92476215a1ca66142fe3cd800e2231419c2f1e5f
F20101108_AAADIT yang_z_Page_002.tif
4fcf53685575ff8a2fa0d69c72960af9
c78db29d092f1a4c79bdcbcf164d79d0a4ed98f9
25409 F20101108_AAAELV yang_z_Page_080.QC.jpg
31d986cd3f5710ee4abb996d94eca1ce
4a98369e06880132a5521ea8ee8d95baa7fb6abe
22892 F20101108_AAADJI yang_z_Page_010.QC.jpg
1bcde6e6145db1dbfdf5144cf6553229
dde1511cbff711802c88a2c4ed7c2f0171df193f
6100 F20101108_AAAEML yang_z_Page_090thm.jpg
4b82b349b25747642d19a1039beee3b7
de22dc2f5524e316431f5d3b1ab35dd5f4a98f89
23587 F20101108_AAADIU yang_z_Page_037.pro
2040d70750a6df555423e7a2bd0ae497
1ddec274e471119b8e4289d22ac20745d70112ac
6914 F20101108_AAAELW yang_z_Page_080thm.jpg
9da76e8d1cb0b3954af543c86e4f8ecb
2979d4f79c1015f757463fe0f9db26895a8ff34c
F20101108_AAADJJ yang_z_Page_113.tif
7b8cec994997789c778c09cbbe8fda1c
9e023e98231c2e126300c4f7fbae8f7c7619e397
7387 F20101108_AAAENA yang_z_Page_099thm.jpg
53c7acd0ae17cdf84351395f31dd922c
eac673aad3136a7e7ab8b126ba5ab4ef357d04b6
22742 F20101108_AAAEMM yang_z_Page_091.QC.jpg
7529c9275c861b2b2b1a80d1232ab72a
25810bba15577a5f24e5a03c3d1212773e2f9df1
84727 F20101108_AAADIV yang_z_Page_021.jpg
a9b7a2a2f29d52c4a8e9110da39d9e48
681b12c8f2e9419626d103ec02c75cfd79a92bb6
19470 F20101108_AAAELX yang_z_Page_081.QC.jpg
228cd9d9ce767169553665873bf52c9b
48cbb7b090c6e66405aa76ee5795f6a48473dfb3
6174 F20101108_AAADJK yang_z_Page_039thm.jpg
44c3025a142392ab63a1a3d8a2e1a1fa
cd8e56e0d6a07fed330286891775d6c01da900ad
26048 F20101108_AAAENB yang_z_Page_100.QC.jpg
c9b89fba8661cb9c6afede53b7e43adc
534b54af2075fb9ca4e6a0a71418a79cfd4d906f
6352 F20101108_AAAEMN yang_z_Page_091thm.jpg
3e43911ffd6fb6f7c846b81f89f60a3c
a4f0b6f0401cbf3862f4b417ac10638b8d54a41a
2239 F20101108_AAADIW yang_z_Page_084.txt
fed9cd53390baa148333a117002539c4
9c78e73ae5d87bce7a357c025c548bc7165bb9e8
5426 F20101108_AAAELY yang_z_Page_081thm.jpg
b0e58cc0e5d389d8393231e488db637c
763a9cad0b0d775522c734d3927f3c0027a7d8e7
7542 F20101108_AAADJL yang_z_Page_084thm.jpg
798481d6a0a735aac5f122c490e105c0
930a568da49cc806e4eb2a3b8524a4c14641aa21
7043 F20101108_AAAENC yang_z_Page_100thm.jpg
aee5506a69cb254017b38202e92fd4fd
ff0edc2bfd6882c310b5a4ed165e3c8c2cc0d814
27319 F20101108_AAAEMO yang_z_Page_092.QC.jpg
66c402c997a9ec376f6824af9fb4f962
1fea06690bb4b61afbab1774d68dc8542c479b3c
21081 F20101108_AAADIX yang_z_Page_067.QC.jpg
97009e665ad7dc42261d0b2805a49d46
1e5bd9fcea88a68654b357f712f4662b8f057957
7031 F20101108_AAAELZ yang_z_Page_082thm.jpg
9bfe821a8ed29c5f2dfee16614c84e69
6287d5f1711148ac3f2ec2c3b373899310a3ff12
6395 F20101108_AAADKA yang_z_Page_089thm.jpg
a5b7cc006512e99581287fa53650d9be
ec83cfa4827c8c24a2b08ac39de72a9fa0d565d8
1860 F20101108_AAADJM yang_z_Page_077.txt
f46f202839a83df81c4bebba508c6a66
725b1da01ed02a4f2fdea3143dc5db8cadba3bee
27723 F20101108_AAAEND yang_z_Page_101.QC.jpg
348fe39f0aedaaf06038b311f34dd0b3
6a83916e3db68581bc0dda46af2ee4c44a938490
7457 F20101108_AAAEMP yang_z_Page_092thm.jpg
83d5329da1add7dd544a72378b174f33
2e985b0bbb11f6e6dd45145b42dae2614b81b5f2
F20101108_AAADIY yang_z_Page_104.tif
6e38c92e743e08eac09b590d03625ca5
0cba416fdbaf797b8a9c0f5e3e51410ee630767d
F20101108_AAADKB yang_z_Page_069.tif
3f8acaa0c704cb597eddf26da5ec337f
f58fa716d74c6c61fc2dde2ded16084ee2173801
5639 F20101108_AAADJN yang_z_Page_029thm.jpg
afa7de18ce44064ae86e059afe16974b
2b46da138db2623d07663e0d421f55ae52c09444
7494 F20101108_AAAENE yang_z_Page_101thm.jpg
56f4faa7e5228be58fafc3ca116ccf9e
862f7552539c1fecde240ec5256d81561d76a95e
27535 F20101108_AAAEMQ yang_z_Page_093.QC.jpg
2f16ccd3a48b075b201e56205beb9d43
bf5a3062476aa63bd1267b81f50df81798cfd032
2500 F20101108_AAADIZ yang_z_Page_023thm.jpg
eb60c7cc85bee170adba52c768be60ff
b4650a7b062a2b45801d4460ce225eab8e95a418
2658 F20101108_AAADKC yang_z_Page_112.txt
422a11de5004c47c9d4abfa2596e5189
e48bf645a1e7cc8500bf1bc79b6c498bd5ce9cd6
F20101108_AAADJO yang_z_Page_086.tif
e060fd8dbd4a0cd770bf5c5384df7410
ec3cfe54d7b244ef6bcf43208affeca88041b577
22103 F20101108_AAAENF yang_z_Page_102.QC.jpg
21b51c9504bc5e08b06cb3658208a4e6
c8ef7595cbbab1cb47ed057656d980a17ae0027f
F20101108_AAADKD yang_z_Page_115.jp2
81041e3a2954eba671460d14e9e1b184
5d52af5f2aaac6eefeb414596024cf3a897d0c52
2080 F20101108_AAADJP yang_z_Page_097.txt
b814d4fdbd8ea6ff5b935ce23224d6b5
6c5f134e4231f3b2e061d8953262450a2c5db927
29457 F20101108_AAAEMR yang_z_Page_094.QC.jpg
617fa6310b4bf23a0ed21fe7df93e901
4652017f19d628916520cb347979df16725d21c6
20442 F20101108_AAAENG yang_z_Page_103.QC.jpg
95e256ce7b88f45b3e7862b824fe650f
3499041b9acae66639d80b6be7cd98b00a4451c9
85880 F20101108_AAADKE yang_z_Page_026.jpg
dd33dcdabeac154ace51ab0526f46aa6
69c62a078f69f4a7b36e1e55d19ecf84416891a8
88496 F20101108_AAADJQ yang_z_Page_014.jpg
9c203c59b6341c7090d99519f4751044
276ee0f11464ef83895111bbe9a7f898865f224a
21993 F20101108_AAAEMS yang_z_Page_095.QC.jpg
1e24414061ef1fbcb697a25e48c7c636
3cbb18a46acd62f9b2a668e16238eccce9abc5c8
5629 F20101108_AAAENH yang_z_Page_103thm.jpg
c3b217c82b8bab66df5834b669801243
c315f6d81b8957ff915a77589e355b964316765f
27957 F20101108_AAAEMT yang_z_Page_096.QC.jpg
15c33702beb53a69c0b699f8231656f1
28615dbbd11319a9e7b388c04efb7931ef559434
37801 F20101108_AAADKF yang_z_Page_083.jpg
7b243721e02ec9d5294d1a658b652025
bff0fff582a0cedee0feba9b7fa426da658bbd1d
F20101108_AAADJR yang_z_Page_012.tif
14796feb0d53546d309b5d9e8862d753
4023b475fda3f75c8907b420b3d16dc12d2f4203
28215 F20101108_AAAENI yang_z_Page_105.QC.jpg
67635e1e2505f78b5360d45fecd162fa
56c894b78de11589e7a8564e4af875a86fe35bc7
6001 F20101108_AAADKG yang_z_Page_095thm.jpg
6ed798b8202ae9e22202da88e28f6953
a9f703bee490a7fc81656790fa3aac265539d5a4
7470 F20101108_AAAENJ yang_z_Page_105thm.jpg
e69be3c25f4b557b085b337861cdede2
7f7b9953e90395e4c122525bb9a7ef27a576d977
F20101108_AAAEMU yang_z_Page_096thm.jpg
be3cba7c38157efe2ec5d7fc5f792a8b
d288ebe4403f81bd91901dd7697cbd57d6424e82
39272 F20101108_AAADKH yang_z_Page_079.pro
3e09a59cd34abc24f4c4e54ee5470091
3c476f26d47ca73acc3c6bde9118ebeac379961e
82858 F20101108_AAADJS yang_z_Page_076.jpg
f349f0a664d2b4e99f5c2d2dd4219c97
343e4545f01dd58edb3c186992a131fee973b160
26596 F20101108_AAAENK yang_z_Page_106.QC.jpg
060be474e11ece40a0ccfaec941c38ff
22a911b46a3cf230b2f1e60ae267fee7f21e9a67
26261 F20101108_AAAEMV yang_z_Page_097.QC.jpg
fa84ee90e1dc28f4be5de6598161f1a2
8ec3a65beefb05ab42b0d8af2b6ce1b5fb7ce9ad
19684 F20101108_AAADKI yang_z_Page_067.pro
4705112f26891f9b97821631ef2e5d5f
82735d6d459918208bde486b806ff2c2c3d767b6
486 F20101108_AAADJT yang_z_Page_007.txt
33ef0e3f5979bea393c4fc6abd3a98b1
0c4e249e55135e87fe86825b965adcabedc14067
7153 F20101108_AAAENL yang_z_Page_106thm.jpg
e46af9bfb68984b267c6f30ae815c90f
003738f8c5f72024585b8411003f817e175e17fa
7378 F20101108_AAAEMW yang_z_Page_097thm.jpg
2d1925ab5fec12e29fc8e8c228d53930
44c1e06cf70868a2273476d091dfbbc9262361d4
2884 F20101108_AAADKJ yang_z_Page_111.txt
f24522ff13e099136f74ab34a45bcfc0
7a7a3682753d26f2a37604c229c609aba9a4c6b1
F20101108_AAADJU yang_z_Page_029.tif
a185894adcdb3d1c5014285421acb1da
3ae028dee1bdb0b543ad4e783de461cd759f43c4
7438 F20101108_AAAEOA yang_z_Page_115thm.jpg
1b68d97ad57aef65b90cb61407faa463
12c6bcea408dca29a21ed07bf50d84abeffae29a
24178 F20101108_AAAENM yang_z_Page_108.QC.jpg
a584f35219517cf7b3e3c048c8e1bb96
55b6105cd87f89e27abf7d05f030ec012f37414c
27851 F20101108_AAAEMX yang_z_Page_098.QC.jpg
83311c2679cc30c1b2195fd089e5bb58
62d1cb983fa3803d047c75a29aeb3a854b8ee856
F20101108_AAADKK yang_z_Page_048thm.jpg
9653c29304a5221f20faa7492e52e080
92f1d2125086589898d43c974e6c5743a152f87f
F20101108_AAADJV yang_z_Page_059.jp2
b5db4c3787a13f1c4d8c6b2da0c01950
da2184e53e777971d03087b622766b95b018e0c8
27545 F20101108_AAAEOB yang_z_Page_116.QC.jpg
3bc7c62f63976a14e3b941de0a51b02d
d4afbcc87a6d1f35ff0bad7c62ba467e6be65836
6717 F20101108_AAAENN yang_z_Page_108thm.jpg
fa28cdc81284f81f1bbc72a5bbf0fb10
d0b0cec7254f77f6925993b0d44c4a15f6056988
7597 F20101108_AAAEMY yang_z_Page_098thm.jpg
a1989e6d8d77ab86fe1283d53b3b4fe6
d83f509e128f3b14fe5e93b3e824202667f20162
85665 F20101108_AAADKL yang_z_Page_100.jpg
2380c07c2cea250eba4dbfe8cca0c96f
31b37202b58b83968aec211a7ee865ab2ea9e067
25981 F20101108_AAADJW yang_z_Page_034.QC.jpg
10c96f132108da5543bdd1c42b3d7e71
182ce949f8a0d558647e7b619980b5ea51cfebeb
7666 F20101108_AAAEOC yang_z_Page_116thm.jpg
19b76179978175d78d551deea8f78463
fb2708f903599e58f9558cfb86317779b3ead0c6
13728 F20101108_AAAENO yang_z_Page_109.QC.jpg
4ff1bf62de91acda9e1d40c2c0e7580b
783431d688a8e6417a7db58ca065a4f8f6a24bbd
27766 F20101108_AAAEMZ yang_z_Page_099.QC.jpg
0b89cba6f3b2dd2562ba09bc738292af
31884e29843942355ca7e23477530196cb91ad2f
29107 F20101108_AAADLA yang_z_Page_011.jpg
a67c8d61bc5940e27fbb2865ea25b7e3
e561542c6c0f055cc3ac9dda420dc8a1164fb018
F20101108_AAADKM yang_z_Page_003.tif
798d01edd847cdb819674d8b74f5d789
8d230317f99cea6050f84c2deb098db36b474b9d
F20101108_AAADJX yang_z_Page_112.tif
cd16b482590d11820588610de0f83552
c717d569e081277f68113892d3805b385836e6ff
5779 F20101108_AAAEOD yang_z_Page_117.QC.jpg
2d1298dc7bbc401b629ad81aa906764a
01a25a6e5b389deb637191ab68113ea3f9e7ed5a
4094 F20101108_AAAENP yang_z_Page_109thm.jpg
e8b105eec573c827dbee8ca5dd6fd33f
0638e763b5bbc8f9d29b675b254b316cbfc37596
2211 F20101108_AAADLB yang_z_Page_009.txt
6cf7e93b39282e5435d98fbe2d2359e9
ff5867769d291da2e56590771b2649cdb2849436
2099 F20101108_AAADKN yang_z_Page_016.txt
bcdcc0299ce1ce0ed9d7f1a566c4ec0d
20f369b4a0eed9d9e3c4a2fdcef7a8fc6d08df2a
53600 F20101108_AAADJY yang_z_Page_044.pro
5eea73070c1901da0b223ce3db80911b
e4c51756e5107cb977982ad2148a82215588266d
2119 F20101108_AAAEOE yang_z_Page_117thm.jpg
026b325421d89bbec8bce35b635bd0b6
124bb6700be85ff05cb36637aaa589c12590d7d6
28567 F20101108_AAAENQ yang_z_Page_110.QC.jpg
7e731edc367aed6d80a852fa31ae1b8b
420c50774e1b8bcd0b331ce63a67bf7a98527767
106607 F20101108_AAADLC yang_z_Page_111.jpg
eec85168910f80ed7cdcbb6582fc2d01
133c68fa6a4a55c407c73ec4e17647c18e8ea578
50546 F20101108_AAADKO yang_z_Page_021.pro
3b2e5c9f4a66b73384f1b090bb7ccd28
5413c14a726b6962a24c1b748238312890c204cb
2762 F20101108_AAADJZ yang_z_Page_070thm.jpg
c0c89816a54ae763ad3f2660e569ec82
7a51c468f5bb32a2a01791534fd5116fdad4e380
7734 F20101108_AAAENR yang_z_Page_110thm.jpg
b7b3ae3b44d1df31a0c719857156125c
9f99dc3ecd9f0a486bbb5e2d288443e4e29730b2
83584 F20101108_AAADLD yang_z_Page_087.jpg
7e4062ddc385a5a9a4f18abd0952c22a
83b76e2e8ab8b250e172605193f67f8b1323cb1b
27212 F20101108_AAADKP yang_z_Page_055.QC.jpg
cf0e352a6b25112c0fab70487d93cc20
52311ce206bd4d5e70ce688685dc7134fc5a4d64
29885 F20101108_AAAENS yang_z_Page_111.QC.jpg
007d71926a7679703f641be813dfdf7c
5fcf8a5d6bb03effe670a6738ba07b5b42c1610e
F20101108_AAADLE yang_z_Page_042.tif
25332b5901bc00068df239257ca8bf09
6d09e1cd378e2f0f4f28afbe84dfce715a019b5b
48662 F20101108_AAADKQ yang_z_Page_060.pro
63f463c8603f1c0d67b11f4bcc7de99b
d518467991d0af6da8f94458a7d65b17bb7cb0b3
7886 F20101108_AAAENT yang_z_Page_111thm.jpg
8196f7b8083e73508f81d033fc0716a4
a1d4b8c98d60195ba9399dbd917ddbaa9d72fe88
1302 F20101108_AAADLF yang_z_Page_003thm.jpg
d222261e8389bb049f178b3c4fc2e9cb
46b591e3f237df0eb87f7a2dcc7100d5e5cd4f99
2424 F20101108_AAADKR yang_z_Page_008.txt
d59cc51bcf527ee11c228cac2f657b88
a25b9233a222ec064b5c75dd8e7dfcaca02930ff
27849 F20101108_AAAENU yang_z_Page_112.QC.jpg
c36ccef3787211dc6cb1e160b28db59d
5528067093aaaf72994681159d1d64453506bd1a
17025 F20101108_AAADLG yang_z_Page_073.QC.jpg
df13cacfaf53f9bd548130d624aa1247
f8baee26089c88f51b333c2f298a065b3d3af9aa
5943 F20101108_AAADKS yang_z_Page_107thm.jpg
5a90fb3581ff6bc7499e773f944f3a48
405624d08cce957bb2767c369ba72d3cef5f3233
7573 F20101108_AAADLH yang_z_Page_076thm.jpg
722db242f3f001a193eb520dba256787
67b292ce6100cea789bc31ed47e3f384bfae8ca8
7602 F20101108_AAAENV yang_z_Page_112thm.jpg
682536ff73ab14dc96e1c0fb57ac8986
b328a4c3c7b7add1e91d44f966ba7d9944df5169
55931 F20101108_AAADLI yang_z_Page_056.pro
bd2988408d5371de28df1f7b773123f1
5ad006ff9287c42f875d2d3757e53e225358166b
1805 F20101108_AAADKT yang_z_Page_004.pro
89806b4c1e943e203ca81ea2b4cf9282
df0a555a4fc1b1a958898299422e3fa2a0a0480b
29569 F20101108_AAAENW yang_z_Page_113.QC.jpg
e402fe5a70edf93acdb6817552cd3ca7
fadd4fae6db642f7f592d5c37e249daca93742e5
22888 F20101108_AAADLJ yang_z_Page_043.pro
13a80d21ee1620cc7959393442477366
3b04511ca929afdf3e788da86889ec0d6c73340d
69848 F20101108_AAADKU yang_z_Page_042.jpg
d8433cb9e1163742b417e6acb75560f9
2a9b91c00f93e4b8b3557ae9ddcf2d0b73c23b91







IMPROVING MEMORY HIERARCHY PERFORMANCE WITH ADDRESS SLES S
PRELOAD, ORDER-FREE LSQ, AND RUNAHEAD SCHEDULING





















By

ZHEN YANG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

































O 2007 Zhen Yang

































To my family









ACKNOWLEDGMENTS

Thanks to all for their help and guidance.











TABLE OF CONTENTS


Page

ACKNOWLEDGMENTS .............. ...............4.....


LIST OF TABLES ................. ...............7............ ....

LI ST OF FIGURE S .............. ...............8.....


AB S TRAC T ........._. ............ ..............._ 10...


CHAPTER


1 INTRODUCTION .............. ...............12....


1.1 Exploitation of Memory Level Parallelism with P-load ................. ......................14
1.2 Least Frequently Used Replacement Policy in Elbow Cache .................. ................. 15
1.3 Tolerating Resource Contentions with Runahead on Multithreading Processors .......17
1.4 Order-Free Store Queue Using Decoupled Address Matching and Age-Priority
L ogic ............. ......... .. .. .. ........... ...............1
1.5 Benchmarks, Evaluation Methods, and Dissertation Outline ............ ........._......20

2 EXPLOITATION OF MEMORY LEVEL PARALLELISM WITH P-LOAD ..........._.......24


2.1 Introduction .............. .... ........ ... ...._ ... ..........2
2.2 Missing Memory Level Parallelism Opportunities............... .............2
2.3 Overlapping Cache Misses with P-loads .............. ...............28....
2.3.1 Issuing P-Loads................ ...............30
2.3.2 Memory Controller Design ....._......__. ..........._ ............3
2.3.3 Issues and Enhancements............... ..............3
2.4 Performance Evaluation............... ..............3
2.4.1 Instructions Per Cycle Comparison .............. ...............36....
2.4.2 Miss Coverage and Extra Traffic............... ...............38
2.4.3 Large Window and Runahead............... ...............40
2.4.4 Interconnect Delay .............. ....... ..._ ....... ...............41
2.4.5 Memory Request Window and P-load Buffer .............. ....................4
2.5 Related W ork ............ _...... ._ ...............44...
2.6 Conclusion .............. ...............45....


3 LEAST FREQUENTLY USED REPLACEMENT POLICY INT ELB OW CACHE .............47

3.1 Introduction .........._.__..... ._ ...............47......
3.2 Conflict M issues .............. ........ ........ ...............5
3.3 Cache Replacement Policies for Elbow Cache ................. ................ ......_...51
3.3.1 Scope of Replacement............... .............5
3.3.2 Previous Replacement Algorithms .............. ...............55....
3.3.3 Elbow Cache Replacement Example ........._............_ .....................56












3.3.4 Least Frequently Used Replacement Policy .............. ...............58....
3.4 Performance Evaluation............... ...............6
3.4.1 Miss Ratio Reduction............... ...............6
3.4.2 Searching Length and Cost ................. ...............65........... ...
3.4.3 Impact of Cache Partitions ................. ...............66........... ...
3.4.4 Impact of Varying Cache Sizes. ................ ........................... ......67
3.5 Related W ork ............ ..... ._ ...............69...
3.6 Conclusion .............. ...............70....


4 TOLERATING RESOURCE CONTENTIONS WITH RUNAHEAD ON
MULTITHREADING PROCESSORS .............. ...............71....


4.1 Introduction ................ ................ ... .......... ............7
4.2 Resource Contentions on Multithreading Processors .............. .....................7
4.3 Runahead Execution on Multithreading Processors .............. ...............74....
4.4 Performance Evaluation............... ..... ..........7
4.4.1 Instructions Per Cycle Improvement .............. ...............76....
4.4.2 Weighted Speedups on Multithreading Processors............... ...............7
4.5 Related W ork ............ _...... ._ ............... 1....
4.6 Conclusion .............. ...............82....


5 ORDER-FREE STORE QUEUE USING DECOUPLED ADDRESS MATCHING
AND AGE-PRIORITY LOGIC .............. ...............84....


5.1 Introduction ................... ............ ...............84......
5.2 Motivation and Opportunity .................. .... ...............87
5.3 Order-Free SQ Directory with Age-Order Vectors .............. ...............90....
5.3.1 Basic Design without Data Alignment .............. ...............91....
5.3.2 Handling Partial Store/Load with Mask ............... ... ......................... 9
5.3.3 Memory Dependence Detection for Stores/Loads Across 8-byte Boundary...96
5.4 Performance Results .............. ...............98....
5.4.1 IPC Comparison............... .. .............10
5.4.2 Sensitivity of the SQ Directory............... ...............10
5.5 Related W ork ................. ...............105...............
5.6 Conclusion .............. ...............107....


6 C ONCLU SION S ................ ...............108..............


LIST OF REFERENCES ................. ...............110................


BIOGRAPHICAL SKETCH ................. ...............117......... ......










LIST OF TABLES

Table Page

1-1. SimpleScalar simulation parameters .............. ...............21....

1-2. PTLSim simulation parameters .............. ...............22....

3-1. Searching levels, extra tag access, and block movement .............. ...............65....

5-1. Percentage of forwarded load using an ideal SQ ........__.........._. ............ .........101











LIST OF FIGURES


Figure Page

2-1. Gaps between base and ideal memory level parallelism exploitations. ............. ................27

2-2. Example tree-traversal function from M~cf:......_..__ ....... .__ .....__ ..........2

2-3. Pointer Chasing: A) Sequential accesses; B) Pipeline using P-load .............. ..................29

2-4. Example of issuing P-loads seamlessly without load address ................. ......................30

2-5. Basic design of the memory controller ................. ...............32........... ..

2-6. Performance comparisons: A) Instructions Per Cycle; B) Normalized memory access
tim e ..........._..._ ....._. ...............37....

2-7. Miss coverage and extra traffic .............. ...............39....

2-8. Sensitivity of P-load with respect to instruction window size............._ ........___.......40

2-9. Performance impact from combining P-load with runahead ......____ ..... ....___...........40

2-10. Sensitivity of P-load with respect to interconnect delay ........._._. ......___ ..............42

2-11i. Sensitivity of P-load with respect to memory request window size ..........._.... ..........._.....43

2-12. Sensitivity of P-load with respect to P-load buffer size ..........._..__........ ...............43

3-1. Connected cache sets with multiple hashing functions .............. ...............48....

3-2. Cache miss ratios with different degrees of associativity ..........._...__......_._ ...............51

3-3. Example of search for replacement .............. ...............53....

3-4. Distribution of scopes using two sets of hashing functions............... ...............5

3-5. Replacement based on time-stamp in elbow caches............... ...............57.

3-6. Miss ratio reduction with caching schemes and replacement policies .............. .................61

3-7. Miss ratio for different cache associativities ......._._ ............ ....__ ..........6

3-8. Miss rate for different cache sizes .............. ...............68....

4-1. Weighted instruction per cycle speedups for multiple threads vs. single thread on
simultaneous multithreading .............. ...............73....











4-2. Average memory access time ratio for multiple threads vs. single thread on
simultaneous multithreading .............. ...............73....

4-3. Basic function-driven pipeline model with runahead execution .............. ....................7

4-4. Instructions per cycle with/without runahead on simultaneous multithreading ........._........77

4-5. Weighted speedup of runahead execution on simultaneous multithreading ........._............78

4-6. Average memory access time ratio of runahead execution on simultaneous
m ultithreading .............. ...............78....

4-7. Weighted speedup of runahead execution between two threads running in the
simultaneous multithreading mode and running separately in a single-thread mode ..........79

4-8. Ratios of runahead speedup in simultaneous multithreading mode vs. runahead
speedup in single-thread mode ........._. ...... .___ ...............81...

5-1. Accumulated percentage of stores and store addresses for SPEC applications ...................88

5-2. The average number of stores and unique store addresses .........._.__.........._ ..............89

5-3. Mismatches between the latest and the last store for dependent load .............. .................90

5-4. SQ with decoupled address matching and age priority .............. ...............91....

5-5. Decoupled SQ with Partial Store/load using Mask ....._._._ .... ... .__ ........_.........9

5-6. IPC comparison .............. ...............100....

5-7. Comparison of directory full in decoupled SQ and late-binding SQ .............. ..................102

5-8. Comparison of load re-execution .............. .....................102

5-9. Sensitivity on the SQ directory size .............. ...............103....

5-10. Sensitivity on the leading-1 detector width ......... .............. ........._. ........10

5-11. Sensitivity on way prediction accuracy ................ ...._.._ ...............104 ..









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH ADDRESS SLES S
PRELOAD, ORDER-FREE LSQ, AND RUNAHEAD SCHEDULING

By

Zhen Yang

December 2007

Chair: Jih-Kwon Peir
Major: Computer Engineering

The average memory access latency is determined by three primary factors, cache hit

latency, miss rate, and miss penalty. It is well known that cache miss penalty in processor cycles

continues to grow. For those memory-bound workloads, a promising alternative is to exploit

memory-level parallelism by overlapping multiple memory accesses. We study P-load scheme

(P-load stands for Preload), an efficient solution to reduce the cache miss penalty by overlapping

cache misses. To reduce cache misses, we also introduce a cache organization with an efficient

replacement policy to specifically reduce conflict misses.

A recent trend is to fetch and issue multiple instructions from multiple threads at the same

time on one processor. This design benefits much from resource sharing among multiple threads.

However, contentions of shared resources including caches, instruction issue window and

instruction window may hamper the performance improvement from multi-threading schemes. In

the third proposed research, we evaluate a technique to solve the resource contention problem in

multi-threading environment.

Store-load forwarding is a critical aspect of dynamically scheduled execution in modern

processors. Conventional processors implement store-load forwarding by buffering the addresses









and data values of all in-flight stores in an age-ordered store queue. A load accesses the data

cache and in parallel associatively searches the store queue for older stores with matching

addresses. Associative structures can be made fast, but often at the cost of substantial additional

energy, area, and/or design effort. We introduce a new order-free store queue that decouples the

matching of the store/load address and its corresponding age-based priority encoding logic from

the original store queue and largely decreases the hardware complexity.









CHAPTER 1
INTTRODUCTION

Computer pioneers correctly predicted that programmers would want unlimited amounts of

fast memory. An economical solution to that desire is a memory hierarchy, which takes

advantage of locality and cost-performance of memory techniques. The principle of locality, that

most programs do not access all instructions or data uniformly, combined with the guideline that

smaller hardware is faster, led to hierarchies based on memories of different speeds and sizes.

Cache is used to reduce the average memory access latency. It is a smaller, faster memory

which stores copies of the instruction or data from the most frequently used main memory

locations. As long as most memory accesses are to cached memory locations, the average latency

of memory accesses will be closer to the cache access latency than to the latency of main

memory.

The average memory access latency can be determined by three primary factors, the time

needed to access the cache (hit latency), the fraction of memory references that can not be

satisfied by the cache (miss rate) and the time needed to fetch data and instructions from the next

memory hierarchy (miss penalty).

Average M~emory Access Latency = (Hit Latency) + (M~iss Rate) x (M~iss Penalty)

It is well known that cache miss penalty in processor cycles continues to grow because the

rapid improvements in processor clock frequencies have outpaced the improvements in memory

and interconnect speeds, which is the so called "memory 0I al/" problem. For those memory-

bound workloads, a promising alternative is to exploit memory-level parallelism (M~LP) by

overlapping multiple memory accesses. In the first proposed research, we will study P-load

scheme (P-load stands for Preload), an efficient solution to reduce the cache miss penalty by

overlapping cache misses.










A cache array is broken into fixed-size blocks. Typically, a cache can be organized in three

ways. If each block has only one place it can appear in the cache, the cache is said to be direct

mapped. If a block can be placed anywhere in the cache, the cache is said to be fidly associative.

If a block can be placed in a restricted set of places in the cache, the cache is set associative. A

block is first mapped onto a set, and then the block can be placed anywhere within that set. The

set is usually chosen by decoding a set of index bit from the block address.

In order to lower cache miss rate, a great deal of analysis has been done on cache behavior

in an attempt to find the best combination of size, associativity, block size, and so on. Cache

misses can be divided into three categories (known as the Three Cs): Compulsory misses are

those misses caused by the first reference to a datum. Cache size and associativity make no

difference to the number of compulsory misses. Capacity misses are those misses that a cache of

a given size will have, regardless of its associativity or block size. Conflict misses are those

misses that could have been avoided, had the cache not evicted an entry earlier. In the second

proposed research, a cache organization with an efficient replacement policy is introduced to

specifically reduce conflict misses.

A recent trend is to fetch and issue multiple instructions from multiple threads at the same

time on one processor. This design benefits much from resource sharing among multiple threads.

It outperforms previous models of hardware multithreading primarily because it hides short

latencies much more effectively, which can often dominate performance on a uniprocessor.

However, contentions of shared resources including caches, instruction issue window and

instruction window may hamper the performance improvement from multi-threading schemes. In

the third proposed research, we will evaluate a technique to solve the resource contention

problem in multi-threading environment.










Store-load forwarding is a critical aspect of dynamically scheduled execution.

Conventional processors implement store-load forwarding by buffering the addresses and data

values of all in-flight stores in an age-ordered store queue (SQ). A load accesses the data cache

and in parallel associatively searches the SQ for older stores with matching addresses.

Associative structures can be made fast, but often at the cost of substantial additional energy,

area, and/or design effort. We introduce a new order-free load-store queue that decouples the

matching of the store/load address and its corresponding age-based priority encoding logic from

the original content-addressable memory SQ and largely decreases the hardware complexity. The

performance evaluation shows a significant improvement in the execution time is achievable

comparing with other existing scalable load-store queue proposals.

1.1 Exploitation of Memory Level Parallelism with P-load

Modern out-of-order processors with non-blocking caches exploit MLP by overlapping

cache misses in a wide instruction window. The exploitation of MLP, however, can be limited

due to long-latency operations in producing the base address of a cache miss load. Under this

circumstance, the child load cannot be issued and start to execute until its base register is

produced by the parent instruction. When the parent instruction is also a cache miss load, a

serialization of the two loads must be enforced to satisfy the load-load data dependence. One

typical example is the pointer-chasing problem in many applications with linked data structures

(LDS), where accessing the successor node cannot start until the pointer is available, possibly

from memory. Similarly, indirect accesses to large array structures may face the same problem

when both address and data accesses encounter cache misses. With limited numbers of

instruction window entries and issue window entries in an out-of-order execution and in-order

commit processor, these non-overlapped long-latency memory accesses can congest the

instruction and issue windows and stall the processor.










Each cache miss encounters delays in sending the request to memory controller, accessing

the DRAM array, and receiving the data. The sum of delays in sending request and receiving

data is called interconnect delay. With load-load data dependence, normally, resolution of the

dependent load' s address and triggering of the dependent load' s execution are done at processor

side after the parent' s data is returned from memory to processor. In fact, the resolution can be

done at memory controller as soon as the parent load Einishes DRAM access. In this way, the

dependent load can start to access DRAM array upon the parent's data is fetched from DRAM

array. The issuing of the dependent load does not have to experience the receiving parent' s data

and sending dependent load' s request. Hence, the interconnect delay of the two loads can be

overlapped.

Based on the above observation, we propose a mechanism that dynamically captures the

load-load data dependence at runtime. A special P-load is issued in place of the dependent load

without waiting for the parent load, thus effectively overlapping the two loads. The P-load

provides necessary information for the memory controller to calculate the correct memory

address upon the availability of the parent' s data to eliminate any interconnect delay between the

two loads. Performance evaluations based on SPEC2000 and Olden benchmarks show that

significant speedups up to 40% with an average of 16% are achievable using the P-load.

1.2 Least Frequently Used Replacement Policy in Elbow Cache

As the performance gap between processor and memory continues to widen, cache

hierarchy designs and performance become even more critical. Applications with regular patterns

of memory access can experience severe cache conflict misses in set-associative cache, in which

the number of cache frames that a memory block can be mapped into is Eixed as the set

associativity. When all of the frames in a set are occupied, a newly missed block replaces an old

block according to the principle of memory reference locality. Furthermore, the same block









address bits are used to index every cache partition. If two blocks conflict for a single location in

one partition, they also conflict for the same location in the other partitions. Under these

constraints, heavy conflicts may occur in a few sets due to uneven distribution of memory

addresses across the entire cache sets that cause severe performance degradation.

To alleviate conflict misses, the skewed-associative cache [Seznec 1993a; Seznec and

Bodin 1993b; Bodin and Seznec 1997] employs multiple hashing functions for members in a set.

Each set also consists of one frame from each of the n cache partitions. But the location of the

frame in each partition can be different based on a different hashing function. The insight behind

the skewed-associative cache is that whenever two blocks conflict for a single location in cache

partition i, they have low probability to conflict for a location in cache partition j.

The elbow cache [Spjuth et al. 2005] extends skewed-associative cache organization by

carefully selecting its victim and, in the case of a conflict, move the conflicting cache block to its

alternative location in the other partition. In a sense, the new data block "uses its elbows" to

make space for conflicting data instead of evicting it. The enlarged replacement set provides

better opportunity to find a suitable victim for evicting.

It is imperative to design effective replacement policy to identify suitable victim for

evicting in the elbow cache, which featured with enlarged replacement set. Recency-based

replacement policy like the least recently used (LRU) replacement is generally thought to be the

most efficient policy for processor cache. However, the traditional LRU replacement policy

based on the most frequently used-most recently used (M~RU-LRG) sequence is difficult to

implement with multiple hashing functions. Since the number of sets grows exponentially with

multiple hashing, it is prohibitively expensive to maintain the needed MRU-LRU sequences.









The least frequently used (LFU) replacement policy considers the frequency of block

references, such that the least frequently used block will be replaced when needed. In fully-

associative cache and set-associative cache, the performance of LRU and LFU are mixed. That' s

because fully-associative cache and set-associative cache with LRU replacement policy suffers

the worst cache pollution when a "never-reuse" block is moved into the cache. It takes c more

misses to replace a never-reused block, where c is the set associativity. Such a block can be

replaced much faster with LFU replacement once the block has the smallest frequency counter.

We propose a low-cost and effective LFU replacement policy for the elbow cache, which has

cache performance comparable to the recency-based replacement policy.

1.3 Tolerating Resource Contentions with Runahead on Multithreading Processors

Simultaneous~tttt~~~ttt~~~ M~ultithreading (SM~T) processors exploit both instruction-level parallelisms

(ILP) and thread-level pa~rallelisms (TLP) by fetching and issuing multiple instructions from

multiple threads to the function units of a superscalar architecture each cycle to utilize wide-issue

slots [Tullsen et al. 1995; Tullsen et al. 1996]. SMT outperforms previous models of hardware

multithreading primarily because it hides short latencies much more effectively, which can often

dominate performance on a uniprocessor. For example, neither fine-grain multithreaded

architectures [Alverson et al. 1990; Laudon et al. 1994], which context switch every cycle, nor

coarse-grain multithreaded architectures [Agarwal et al. 1993b; Saavedra-Barrera et al. 1990],

which context switch only on long-latency operations, can hide the latency of a single-cycle

integer add if there is not sufficient parallelism in the same thread.

In SMT, multiple threads share resources such as caches, functional units, instruction

queue, instruction issue window, and instruction window [Tullsen et al. 1995; Tullsen et al.

1996]. SMT typically benefits from giving threads complete access to all resources every cycle.

But contentions of these resources may significantly hamper the performance of individual









threads and hinder the benefit of exploiting more parallelism from multiple threads. First,

disruptive cache contentions lead to more cache misses and hurt overall performance. Serious

cache contention problems on SMT processors were reported [Tullsen and Brown 2001]. The

optimal allocation of cache memory between two competing threads was studied [Stone et al.

1992]. Dynamic partitioning of shared caches among concurrent threads based on "marginal

gains" was reported in [Suh et al. 2001]. The results showed that significantly higher hit ratios

over the global LRU replacement could be achieved.

Second, threads can hold critical resources while they are not making progress due to long-

latency operations and block other threads from making normal execution. For example, if the

stalled thread fills the issue window and instruction window with waiting instructions, it shrinks

the window available for the other threads to find instructions to issue and bring in new

instructions to the pipeline. Thus, when parallelism is most needed when one or more threads are

no longer contributing to the instruction flow, fewer resources are available to expose that

parallelism. Previously proposed methods [El-Moursy and Albonesi 2003; Cazorla et al. 2004a;

Cazorla et al. 2004b] attempt to identify threads that will encounter long-latency operations. The

thread with long-latency operation may be delayed to prevent it from occupying more resources.

A balance scheme was proposed [Cazorla et al. 2003] to dynamically switch between flushing

and keeping long-latency threads to avoid overhead of flushing.

We propose a valuable solution to this problem, runahead execution on SMTs. Runahead

execution was first proposed to improve MLP on single-thread processors [Dundas and Mudge

1997; Mutlu et al. 2003]. Effectively, runahead execution can achieve the same performance

level as that with a much bigger instruction window. We investigate and evaluate runahead

execution on SMT processors with multiple threads running simultaneously. Besides the inherent









advantage of memory prefetching, runahead execution can also prevent a thread with long

latency loads from occupying shared resources and impeding other threads from making forward

progress. Performance evaluation based on a mixture of SPEC2000 benchmarks demonstrates

the performance improvement of runahead execution on SMTs.

1.4 Order-Free Store Queue Using Decoupled Address Matching and Age-Priority
Logic

Store-load forwarding is a critical aspect of dynamically scheduled execution.

Conventional processors implement store-load forwarding by buffering the addresses and data

values of all in-flight stores in an age-ordered SQ. A load accesses the data cache and in parallel

associatively searches the SQ for older stores with matching addresses. The load obtains its value

from the youngest such store (if any) or from the data cache. Associative structures can be made

fast, but often at the cost of substantial additional energy, area, and/or design effort. Furthermore,

these implementation disadvantages compound super-linearly especially for ordered associative

structures like the SQ as structure size or bandwidth scales up. As SQ access is on the load

execution critical path, fully-associative search of a large SQ can result in load latency that is

longer than data cache access latency, which in turn complicates scheduling and introduces

replay overheads.

There have been many recent proposals to design a scalable SQ by getting rid of the

expensive and time-consuming full content-addressablet-~-~-~-~-~ memory (CAM) design. One category of

solution is to adopt a two-level SQ where the small first-level SQ enables fast and energy

efficient forwarding and a much larger second-level SQ corrects and complements the first-level

SQ [Buyuktosunoglu et al. 2002; Gandhi et al. 2005; Sethumadhavan et al. 2006]. A store-

forwarding cache is implemented in [Castro et al. 2006] for store-load forwarding. It relies on a

separate memory disambiguation table to resolve any dependence violation. In [Akkary et al.










2003], a much larger LO cache is used to replace the first-level SQ for caching the latest store

data. A load, upon a hit, can fetch the data directly from the LO. In this approach, instead of

maintaining speculative load information, the load is also executed in the in-order pipeline

fashion. An inconsistent between the data in LO and L1 can identify memory dependence

violations. Besides the complexity and extra space, a fundamental issue in this category of

approaches is the heavy mismatch between the latest store and the correct last store for the

dependent load. Such mismatches produce costly re-executions.

We introduce a new order-free SQ that decouples the store address matching and its

corresponding age-order priority logic from the original Content-Addressable Memory (CAM)

SQ where the outstanding store addresses and data are buffered. A separate SQ directory is

maintained for searching the stores in the SQ. Unlike a conventional SQ, a single address is

recorded in the SQ directory for multiple outstanding stores with the same address. Each entry in

the directory is augmented with a new age-order vector to indicate the correct program order of

the stores with the same address. The decoupled SQ directory allows stores to enter the directory

when they are issued which can be different from their program order. It relies on the age-order

vector to recover the correct program order of the stores. The relaxation of the program-order

requirement helps to reduce the directory size as well as to abandon the fully-associative CAM-

based directory that is the key obstacle for a scalable SQ.

1.5 Benchmarks, Evaluation Methods, and Dissertation Outline

SimpleScalar tool set is used to evaluate performance for the first three works in this

dissertation. It consists of compiler, assembler, linker, simulation, and visualization tools for

modern processor architecture and provides researchers with an easily extensible, portable, high-

performance test bed for systems design. It can simulate an out-of-order issue processor that

supports non-blocking caches, speculative execution, and state-of-the-art branch prediction.









Table 1-1. SimpleScalar simulation parameters
Processor
Fetch/Decode/Issue/Commit Width: 8
Instruction Fetch Queue: 8
Branch Predictor: 64K-entry G-share, 4K-entry Branch Target Buffer (BTB)
Mis-Prediction Penalty: 10 cycles
Instruction Window/Load-Store Queue size: 512/512
Instruction Issue Window: 32
Processor Translation Lookaside Buffer (TLB): 2K-entry, 8-way
Integer Arithmetic
6 ALU (1 cycle); 2 Mult/Div: Mult (3 cycles), Div (20 cycles)
Logic Unit (ALU):
Floating Point (FP) 4 ALU (2 cycles); 2 Mult/Div/Sqrt: Mult (4 cycles), Div (12
ALU: cycles), Sqrt (24 cycles)
Memory System
Level 1 (Ll) Instruction/Data Cache: 64KB, 4-way, 64B Line, 2 cycles
L1 Data Cache Port: 4 read/write port
Level 2 (L2) Cache: IMB, 8-way, 64B Line, 15 cycles
L1/L2 Memory Status Holding Registers (MSHR): 16/16
Request/Dynamic Random Access Memory (DRAM)/Data Latency: 80/160/80
Memory Channel: 4 with line-based interleaved
Memory Request Window: 32
Channel/Return Queue: 8/8


We modified the SimpleScalar simulator to model an 8-wide superscalar, out-of-order

processor with Alpha 21264-like pipeline stages [Kessler 1999]. We made two enhancements.

First, the original Issue Execute stage is extended to the Issue, Register read, and Execute stages

to reflect the delay in instruction scheduling, operands read, and instruction execution. Second,

instead of waking up dependent instructions at the Writeback stage of the parent instruction, the

dependent instructions are pre-issued after the parent instruction is issued and the delay of

obtaining the result is known. Important simulation parameters are summarized in Table 1-1.

To evaluate the order-free SQ, we modified the PTLsim simulator [Youst 2007] to model a

cycle accurate full system x86-64 microprocessor. We followed the basic PTLsim pipeline

design, which has 13 stages (1 fetch, 1 rename, 5 frontend, 1 dispatch, 1 issue, 1 execution, 1









transfer, 1 writeback, 1 commit). Important simulation parameters for PTLSim are summarized

in Table 1-2.

Table 1-2. PTLSim simulation parameters
Fetch/Di spatch/Issue/Commit Width: 3 2/3 2/1 6/1 6
Instruction Fetch Queue: 128
Branch Predictor: 64K-entry G-share, 4K-entry 4-way BTB
Branch Mis-Prediction Penalty: 7 cycles
RUU/LQ/SQ size: 512/128/64
Instruction Issue Window int0/intl/1d/fp: 64/64/64/64
ALU/STU/LDU/FPU: 8/8/8/8
L1 I-Cache: 16KB, 4-way, 64B line, 2 cycles
L1 D-Cache: 32KB 4-way, 64B line, 2 cycles, 8 read/write port
L2 U-Cache: 256KB, 4-way, 64B Line, 6 cycles
L3 U-Cache: 2MB, 16-way, 128B Line, 16 cycles
L1/L2 MSHRs: 16/16
Memory Latency: 100 cycles
I-TLB: 64-entry fully-associative
D-TLB: 64-entry fully-associative


Benchmark programs are used to provide a measure to compare performance. The

SPEC2000 [SPEC2000 Benchmarks] from the Standard Performance Evaluation Corporation is

one of the most widely used benchmark programs in our research community. It consists of two

types of benchmarks, one is the SPECint2000, a set of integer-intensive benchmarks, and the

other is the SPECfp2000, a set of floating-point intensive benchmarks. Another benchmark suite

we evaluated is the Olden benchmarks [Olden Benchmark], which are pointer-intensive

programs built by Princeton University. We follow the studies done in [Sair and Charney 2000]

to skip certain instructions, warm up caches and other system components with 100 million

instructions, and then collect statistics from the next 500 million instructions.

The outline of this dissertation is as followed. In chapter 2, we first study the missing

memory-level parallelism opportunities because of data dependence and then describe P-load

scheme. In chapter 3, the severity of cache conflict misses is demonstrated and a cache










organization with a frequency-based replacement policy is introduced to specifically reduce

conflict misses. In chapter 4, we will evaluate a technique to solve the resource contention

problem in multi-threading environment. In chapter 5, we introduce the order-free SQ that

decouples the matching of the store/load address from its corresponding age-based priority

encoding logic. The dissertation is concluded in chapter 6.









CHAPTER 2
EXPLOITATION OF MEMORY LEVEL PARALLELISM WITH P-LOAD

2.1 Introduction

Over the past two decades, ILP has been a primary focus of computer architecture research

and a variety of microarchitecture techniques to exploit ILP such as pipelining, very long

instruction word (YIW), superscalar issue, branch prediction, and data speculation have been

developed and refined. These techniques make current processors to effectively utilize deep

multiple-issue pipelines in applications such as media processing and scientific floating-point

intensive applications.

However, the performance of commercial applications such as databases is dominated by

the frequency and cost of memory accesses. Typically, they have large instruction and data

footprints that do not fit in caches, hence, requiring frequent accesses to memory. Furthermore,

these applications exhibit data-dependent irregular patterns in their memory accesses that are not

amenable to conventional prefetching schemes. For those memory-bound workloads, a

promising alternative is to exploit MLP by overlapping multiple memory accesses.

MLP is the number of outstanding cache misses that can be generated and executed in an

overlapped manner. It is essential to exploit MLP by overlapping multiple cache misses in a wide

instruction window [Chou et al. 2004]. The exploitation of MLP, however, can be limited due to

a load that depends on another load to produce the base address (referred as load-load

dependences. If the parent load misses the cache, sequential execution of these two loads must

be enforced. One typical example is the pointer-chasing problem in many applications with LDS,

where accessing the successor node cannot start until the pointer is available, possibly from

memory. Similarly, indirect accesses to large array structures may face the same problem when

both address and data accesses encounter cache misses.









There have been several prefetching techniques to reduce penalties on consecutive cache

misses of tight load-load dependence [Luk and Mowry 1996; Roth et al. 1998; Yang and

Lebeck 2000; Vanderwiel and Lilj a 2000; Cahoon and McKinley 2001; Collins et al. 2002;

Cooksey et al. 2002; Mutlu et al. 2003; Yang and Lebeck 2004; Hughes and Adve 2005]. Luk et

al. [1996] proposed using jump-pointers, which were further developed by Roth and Sohi [Roth

et al. 1998]. A LDS node is augmented with jump-pointers that point to nodes that will be

accessed in multiple iterations or recursive calls in the future. When a LDS node is visited,

prefetches are issued for the locations pointed by its jump-pointers. They focused on a software

implementation of the four jump-pointer idioms proposed by Roth and Sohi. They also proposed

hardware and cooperative hardware/software implementations that use significant additional

hardware support at the processor to overcome some of the software scheme's limitations. The

hardware automatically creates and updates jump-pointers and generates address for and issues

prefetches. The hardware can eliminate the instruction overhead of jump pointers and reduce the

steady state stall time for root and chain jumping, but it does not affect the startup stall time for

any case and does not eliminate the steady state stall time for root and chain jumping.

The push-pull scheme [Yang and Lebeck 2000; Yang and Lebeck 2004] proposed a

prefetch engine at each level of memory hierarchy to handle LDS. A kernel of load instructions,

which encompass the LDS traversals, is generated by software. The processor downloads this

kernel to the prefetch engine, then executes the load instructions repeatedly traverse the LDS.

The lack of address ordering hardware and comparison hardware restricts their scheme's

traversals to LDS and excludes some data dependence. The kernels and prefetch engine would

require significant changes to allow more general traversals. A similar approach with compiler

help has been presented in [Hughes and Adve 2005].









The content-aware data prefetcher [Cooksey et al. 2002] identifies potential pointers by

examining word-aligned content of cache-miss data blocks. The identified pointers are used to

initiate prefetching of the successor nodes. Using the same mechanism to identify pointer loads,

the pointer-cache approach [Collins et al. 2002] builds a correlation history between heap

pointers and the addresses of the heap obj ects they point to. A prefetch is issued when a pointer

load misses the data cache, but hits the pointer cache.

We introduce a new approach to overlap cache misses involved in load-load dependence.

After dispatch, if the base register of a load is not ready due to an early cache miss load, a special

P-load' is issued in place of the dependent load. The P-load instructs the memory controller to

calculate the needed address once the parent load's data is available from the dynamic random

access memory (DRAM). The inherent interconnect delay between processor and memory can

thus be overlapped regardless the location of the memory controller [Opteron Processors]. When

executing pointer-chasing loads, a sequence of P-loads can be initiated according to the

dispatching speed of these loads.

The proposed P-load makes three unique contributions. First of all, in contrast to the

existing methods, it does not require any special predictors and/or any software-inserted

prefetching hints. Instead, the P-load scheme issues the dependent load early following the

instruction stream. Secondly, the P-load exploits more MLP from a larger instruction window

without the need to enlarge the critical issue window [Akkary et al. 2003]. Thirdly, an enhanced

memory controller with proper processing power is introduced that can share certain

computations with the processor.

2.2 Missing Memory Level Parallelism Opportunities

Overlapping cache misses can reduce the performance loss due to long-latency memory

operations. However, data dependence between a load and an early instruction may stall the load










from issuing. In this section, we will show the performance loss due to such data dependence in

real applications by comparing a baseline model with an idealized MLP exploitation model.


O Baseline
gg Ideal MLP










Mcf Tw olf Vpr Gcc- Parser Gcc- Health Mst Em~d
200 scilab

Figure 2-1. Gaps between base and ideal memory level parallelism exploitations.

MLP can be quantified as the average number of memory requests during the period when

there is at least one outstanding memory request [Chou et al. 2004]. We compare the MLPs of

the baseline model and the ideal model. In the baseline model, all data dependence are strictly

enforced. On the contrary, in the ideal model, a cache miss load is issued right after the load is

dispatched regardless of whether the base register is ready or not.

Nine workloads, M~cfJ EolfJ Vpr, Gcc-200, Parser,~PPP~~~~PPP~~~PPP and Gcc-scilab from SPEC2000

integer benchmarks, and Health, M~st, and Em3d from Olden benchmarks are selected for this

experiment because of their high L2 miss rates. An Alpha 21264-like processor with IMB L2

cache is simulated.

Figure 2-1 illustrates the measured MLPs of the baseline and the ideal models. It shows

that there are huge gaps between them, especially for McfJ Gcc-200, Parser,~PPP~~~~PPP~~~PPP Gcc-scilab, Health,

and M~st. The results reveal that significant MLP improvement can be achieved if the delay of

issuing cache misses due to data dependence is reduced.









2.3 Overlapping Cache Misses with P-loads

We describe the P-load scheme using function refr~esh potential from M~cf as shown in

Figure 2-2. Refr~esh potential is invoked frequently to refresh a huge tree structure that exceeds

4MB. The tree is initialized with a regular stride pattern among adj acent nodes on the traversal

path such that the address pattern can be accurately predicted. However, the tree structure is

slightly modified with insertions and deletions between two consecutive visits. After a period of

time, the address pattern on the traversal path becomes irregular and is hard to predict accurately.

Heavy misses are encountered when caches cannot accommodate the huge working set.

Long refresh_potential (network~t *net)



tmp = node = root->child;
while (node != root) {
while (node) {
if (node->orientation == UP)
node->potential = node->basic_arc->cost + node->pred->potential;
else (
node->p otenti al = node->pred->p otenti al node- >basic_ arc->co st;
checksum++; J
tmp = node;
node = node->child;

node = tmp;
while (node->pred) {
tmp = node->siblinzg;
if (tmp) {
node = tmp;
break; J
else node = node->pred;



return checksum;

Figure 2-2. Example tree-traversal function from M\~cf











This function traverses a data structure with three traversal links: child, pred, and sibling

(highlighted in italics), and accesses basic records with a data link, ba~sic~arc. In the first inner

while loop, the execution traverses down the path through the link: node @child.

DRAM Array
Request
Processor Mmr
Controller
Data
>time

A Reql, Meml1 Datall Req2 Mem2 Data2 Req3 Mem3 Data3 Req4, Mem4 Data4


B Reql. Meml .Datal

Req2 Mem2 Data2

Req3 ,Mem3 Data3

Req4 ,Mem4 Data4



Figure 2-3. Pointer Chasing: A) Sequential accesses; B) Pipeline using P-load

With accurate branch predictions, several iterations of the while loop can be initiated in a

wide instruction window. The recurrent instruction, node = node @child that advances the

pointer to the next node, becomes a potential bottleneck since accesses of the records in the next

node must wait until the pointer (base address) of the node is available. As shown in Figure 2-3

A), four consecutive node = node @child must be executed sequentially. In the case of a cache

miss, each of them encounters delays in sending the request, accessing the DRAM array, and

receiving the data. These non-overlapped long-latency memory accesses can congest the

instruction and issue windows and stall the processor. On the other hand, the proposed P-load

can effectively overlap the interconnect delay in sending/receiving data as shown in Figure 2-3

B). In the following subsections, detailed descriptions of identifying and issuing P-loads are




























Request
ID Instr.
Type ID IDisp
101 Iw $v0,28($a0) Load [28($a0)] -
102 bne $v0,$a3,L1
103 Iw $v0,32($a0) (partial hit)
104 Iw $v1,8($a0) (partial hit)
105 Iw $v0,16($v0) P-load [addr(103)] 1105 16
106 Iw $v1,44($v1) P-load [addr(104)] 1106 44
107 addu $v0,$v0,$v1
108 J L2
109 sw $v0,44($a0)
110 addu $v0,$0,$a0
111 Iw $a0,12($a0) (partial hit)
112 bne $a0,$0,LO
113 Iw $v0,28($a0) P-load [addr(111)] 1113 28
114 Iw $v0,32($a0) P-load [addr(111)] 1114 32
115 Iw $v1,8($a0) P-load [addr(111)] 1115 8
116 Iw $v0,16($v0) P-load [p-id(114)] 1116 16
117 Iw $v1,44($v1) P-load [p-id(115)] 1117 44
118 Iw $a0,12($a0) P-load [addr(1 11)] 1118 12


ID Link Offset Address Disp

New [28($a0)] --
105 New 32($a0) 16
106 New 8($a0) 44
113 New 12($a0) 28
114 New 12($a0) 32
115 New 12($a0) 8
116 114 16
117 115 44
118 [12($a0)]*I -


given first, followed by the design of the memory controller. Several issues and enhancements

about P-load will also be discussed.


2.3.1 Issuing P-Loads

We will describe P-load issuing and execution within the instruction window and the

memory request window (Figure 2-4) by walking through the first inner while loop of refcresh-


potential from M~cf(Figure 2-2).

Ins ruction Window Mlemory Request Window


* Assume New removed, 118 uses
address [12($a0)] to fetch DRAM
or P-load Buffer
Note, thick lines divide iterations


Figure 2-4. Example of issuing P-loads seamlessly without load address

Assume the first load, lw $v0,28($a0), is a cache miss and is issued normally. The second

and third loads encounter partial hits to the same block as the first load, thus no memory request

is issued. After the fourth load, lw $v0,16($v0), is dispatched, a search through the current

instruction window finds it depends on the second load, lw $v0,32($a0). Normally, the fourth

load must be stalled. In the proposed scheme, however, a special P-load will be inserted into a

small P-load issue window at this time. When the cache hit/miss of the parent load is known,









associative search for dependent loads in the P-load issue window is performed. All dependent

P-loads are either ready to be issued (if the parent load is a miss), or canceled (if the parent load

is a hit). The P-load consists of the address of the parent load, the displacement, and a unique

instruction ID to instruct the memory controller to calculate the address and fetch the correct

block. Details of the memory controller will be given in Section 2.3.2. The fifth load is similar to

the fourth. The sixth load, lw $a0,12($a0), advances the pointer and is also a partial hit to the

first load.

With correct branch prediction, instructions of the second iteration are placed in the

instruction window. The first three loads in the second iteration all depend on lw $a0,12($a0) in

the previous iteration. Three corresponding P-loads of them are issued accordingly with the

parent load's address. The fourth and fifth loads, however, depend on early loads that are

themselves also identified as P-loads. In this case, instead of the parent' s addresses, the parent

load IDs (p-id), 114 and 115 for the fourth and fifth loads respectively, are encoded in the

address fields to instruct the memory controller to obtain correct base addresses. This process

continues to issue a sequence of P-loads within the entire instruction window seamlessly.

A P-load does not occupy a separate location in the instruction window, nor does it keep a

record in the memory status holding registers (M~SHRs). Similar to other memory-side

prefetching methods [Solihin et al. 2002], the returned data block of a P-load must come back

with address. Upon receiving a P-load returned block from memory, the processor searches and

satisfies any existing memory requests located in the MSHRs. The block is then placed into

cache if it is not there. Searching in the MSHRs is necessary, since a P-load cannot prevent other

requests that target the same block from issuing. The load, from which a P-load was initiated,

will be issued normally when the base register is ready.










In general, the P-load can be viewed as an accurate data prefetching method. It should not

interfere with normal store-load forwarding. A P-load can be issued even there are unresolved

previous stores in the load-store queue. Upon the completion of the parent miss-load, the address

of the dependent load can be calculated that will trigger any necessary store-load forwarding.

2.3.2 Memory Controller Design

Figure 2-5 illustrates the basic design of the memory controller. Normal cache misses and

P-loads are processed and issued in the memory request window similar to the out-of-order

execution in processor' s instruction window. The memory address, the offset for the base

address, the displacement for computing the target block address, and the dependence link, are

recorded for each request in arriving order. For a normal cache miss, its address and a unique ID

assigned by the request sequencer are recorded. Such cache miss requests will access the DRAM

without delay as soon as the target DRAM channel is open. A normal cache miss may be merged

with an early active P-load that targets the same block to achieve reduced penalties.


Memory Bu


Figure 2-5. Basic design of the memory controller










Two different procedures are applied when a P-load arrives. Firstly, if a P-load comes with

valid address, the block address is used to search for any existing memory requests. Upon a

match, a dependence link is established between them; the offset within the block is used to

access the correct word from the parent's data block without the need to access the DRAM. In

the case of no match, the address that comes with the P-load is used to access the DRAM as

illustrated by request 118 assuming that the first request has been removed from the memory

request window. Secondly, if a P-load comes without a valid address, the dependence link

encoded in the address field is extracted and saved in the corresponding entry as shown by

requests 116 and 117 (Figure 2-4). In this case, the correct base addresses can be obtained from

1 16's and 1 17's parent requests, 1 14 and 1 15, respectively. The P-load is dropped if its parent P-

load is no longer in the memory request window.

Once a data block is fetched, all dependent P-loads will be woken up. For example, the

availability of the New block will trigger P-loads 105, 106, 113, 1 14, and 115 as shown in Figure

2-4. The target word in the block can be retrieved and forwarded to the dependent P-loads. The

memory address of the dependent P-load is then calculated by adding the target word (base

address) with the displacement value. The P-load's block is fetched if its address does not match

any early active P-load. The fetched P-load's block in turn triggers its dependent P-loads. A

memory request will be removed from the memory request window after its data block is sent

back.

2.3.3 Issues and Enhancements

There are many essential issues that need to be resolved to implement the P-load scheme

effi ci ently.

Maintaining Ba~se Register Identity: The base register of a qualified P-load may experience

renaming or constant increment/decrement after the parent load is dispatched. These indirect










dependence can be identified and established by proper adjustment to the displacement value of

the P-load. There are different implementation options. In our simulation model, we used a

separate register renaming table to provide association of the current dispatched load with the

parent load, if exist. This direct association can be established whenever a "simple" register

update instruction is encountered and its parent (could be multiple levels) is a miss load. The

association is dropped when the register is modified again.

Address Translation at M~emory Controller: The memory controller must perform virtual

to physical address translation for a P-load in order to access the physical memory. A shadowed

TLB needs to be maintained at the memory controller for this purpose (Figure 2-5). The

processor issues a TLB update to the memory controller whenever a TLB miss occurs and the

new address translation is available. The TLB consistency can be handled similarly to that in a

multiprocessor environment. A P-load is simply dropped upon a TLB miss.

Reducing Excessive M~emory Requests: Since a P-load is issued without memory address,

it may generate unnecessary memory traffic if the target block is already in cache or multiple

requests address the same data block. Three approaches are considered here. Firstly, when a

normal cache miss request arrives, all outstanding P-loads are searched. In the case of a match,

the P-load is changed to a normal cache miss for saving variable delays. Secondly, a small P-load

buffer (Figure 2-5) buffers the data blocks of recent P-loads and normal cache miss requests. A

fast access to the buffer occurs when the requested block is located in the buffer. Thirdly, a

topologically equivalent cache directory of the lowest level cache is maintained to predict cache

hit/miss for filtering the returned blocks. By capturing normal cache misses, P-loads, and dirty

block writebacks, the memory-side cache directory can predict cache hits accurately.










Inconsistent Data Blocks between Caches and Memory: Similar to other memory-side

prefetching techniques, the P-load scheme fetches data blocks without knowing whether they are

already located in cache. It is possible to fetch a stale copy if the block has been modified. In

general, the stale copy is likely to be dropped either by cache-hit prediction or by searching

through the directory before updating the cache. However, in a rather rare case when a modified

block is written back to the memory, this modified block must be detected against outstanding P-

loads to avoid fetching the stale data.

Complexity, Overhead', and Need for Associative Search Structure: There are two new

structures: P-load issue window and memory request window (with 8 and 32 entries in our

simulations) that require associative searches. Others do not require expensive associative

searches. We carefully model the delays and access conflicts. For instance, although multiple P-

loads can be waked up simultaneously, it takes two memory controller cycles (10 processor

cycles) conservatively to initiate each DRAM access sequentially. The delay is charged due to

the associative wakeup as well as the need for TLB and directory accesses. Our current

simulation does not consider TLB shootdown overhead. Our results showed that it has ignorable

impact due to small TLB misses and the flexibility of dropping overflow P-loads during TLB

updates.

2.4 Performance Evaluation

To handle P-loads, the processor includes an 8-entry P-load issue window along with a

512-entry instruction window and a 32-entry issue window. Several new components are added

to the memory controller. A 32-entry memory request window with a 16-entry fully associative

P-load buffer is added to process both normal cache misses and P-loads. An 8-way 16K-entry

cache directory of the second level cache to predict cache hit/miss is simulated. A shadowed









TLB with the same configuration as processor side TLB is simulated for address translation on

the memory controller.

Nine workloads, M~cf TwolfJ Vpr, Gcc-200, Parser,~PPP~~~~PPP~~~PPP and Gcc-scilab from SPEC2000

integer benchmarks, and Health, M~st, and Em3d from Olden benchmarks are selected because of

high L2 miss rates as ordered according to their appearances.

A processor-side stride prefetcher is included in all simulated models [Fu et al. 1992]. To

demonstrate the performance advantage of the P-load scheme, the historyless content-aware data

prefetcher [Cooksey et al. 2002] is also simulated. We search exhaustively to determine the

1 1it hhl (number of adj acent blocks) and the depth (level of prefetching) of the prefetcher for best

performance improvement. Two configurations are selected. In the limited option (Content-limit;

al ithh 1, depth=1), a single block is prefetched for each identified pointer from a missed data

block, i.e. both width and depth are equal to 1. In the best-performance option (Content-best;

al ithh 3, depth=4), three adj acent blocks starting from the target block of each identified pointer

are fetched. The prefetched block initiates content-aware prefetching up to the fourth level. Other

prefetchers are excluded due to the need of huge history information and/or software prefetching

help.

2.4.1 Instructions Per Cycle Comparison

Figure 2-6 summarizes the Instructions Per Cycle (IPC) and the normalized memory

access time for the baseline model, the content-aware prefetching (Content-limit and Content-

best) and the P-load schemes without (Pload-no) and with (Pload-16) a 16-entry P-load buffer.

Generally, the P-load scheme shows better performance.





3

2.5

2




0..15


05






1.2






E0.
O

0 .


Mcf Twolf Vpr Gcc- Parser Gcc- Health
200 scilab


Mst Em3d


Mcf Twolf Vpr


Gcc- Parser Gcc- Health Mst Em~d
200 scilab


Figure 2-6. Performance comparisons: A) Instructions Per Cycle;
B) Normalized memory access time

Compared with the baseline model, the Pload-16 shows speedups of 28%, 5%, 2%, 14%,

5%, 17%, 39%, 18% and 14% for the respective workloads. In comparison with the Content-

best, the Pload-16 performs better by 11%, 4%, 2%, 2%, -8%, 11%, 16%, 22%, and 12%. The P-

load is most effective on the workloads that traverse linked data structures with tight load-load

dependence such as M~cf Gcc-200, Gcc-scilab, Health, M~st, and Em3d'. The content-aware

scheme, on the other hand, can prefetch more load-load dependent blocks beyond the instruction


O Baseline
5 Pload-no


a Poad-16


----


--- 1










window. For example, the traversal lists in Parser are very short, and thus provide limited room

for issuing P-loads. But the Content-best shows better improvement on Parser.~PPP~~~~PPP~~~PPP Lastly, the results

show that a 16-entry P-load buffer provides about 1-10% performance improvement with an

average of 4%.

To further understand the P-load effect, we compare the memory access time of various

schemes normalized to the memory access time without prefetching (Figure 2-6 B). The

improvement of the memory access time matches the IPC improvement very well. In general, the

P-load reduces the memory access delay significantly. We observe 10-30% reduction of memory

access delay for McfJ Gcc-200, Gcc-scilab, Health, M~st, and Em3d'.

2.4.2 Miss Coverage and Extra Traffic

In Figure 2-7, the miss coverage and total traffic are plotted. The total traffic is classified

into five categories: misses, partial hits, miss reductions (i.e. successful P-load or prefetches),

extra prefetches, and wasted prefetches. The sum of the misses, partial hits and miss reductions is

equal to the baseline misses without prefetching, which is normalized to 1. The partial hits

represent normal misses that catch early P-loads or prefetches at the memory controller, so that

the memory access delays are reduced. The extra prefetch represents the prefetched blocks that

are replaced before any use. The wasted prefetches are referred to the prefetched blocks that are

presented in cache already.

Except for Twolf and Vpr, the P-load reduces 20-80% overall misses. These miss

reductions are accomplished with little extra data traffic because the P-load is issued according to

the instruction stream. Among the workloads, Health has the highest miss reduction. It simulates

health-care systems using a 4-way B-tree structure. Each node in the B-tree consists of a link-list

with patient records. At the memory controller, each pointer-advance P-load usually wakes up a

large number of dependent P-loads ready to access DRAM. At the processor side, the return of a










parent load normally triggers dependent loads after their respective blocks are available from

early P-loads. M~cJ on the other hand, has much simpler operations on each node visit. The return

of a parent load may initiate the dependent loads before the blocks are ready from early P-loads.

Therefore, about 20% of the misses have reduced penalties due to the early P-loads. Tvolf and

Vpr show insignificant miss reductions because of very small amount of tight load-load

dependence.


5.i


1- 1 JP.j~ l
















Fiur 2-.is coeage and extr traf




The content-aware prefetcher generates a large amount of extra traffic for aggressive data

prefetching. For Evolfand Ypr, such aggressive and incorrect prefetching actually increases the

overall misses due to cache pollution. For Parser, the Content-best out-performs the Pload-16

that is accomplished with 5 times memory traffic. In many workloads, the Contest-best generates

high percentages of wasted prefetches. For example for Parser,~PPP~~~~PPP~~~PPP the cache prediction at the

memory controller is very accurate with only 0.6% false-negative prediction (predicted hit,












actual miss) and 3.2% false-positive prediction (predicted miss, actual hit). However, the total


predicted misses are only 10%, which makes 30% of the return P-load blocks wasted.


2.4.3 Large Window and Runahead


The scope of the MLP exploitation with P-load is confined within the instruction window.


In Figure 2-8, the IPC speedups of the P-load with five window sizes: 128, 256, 384, 512 and


640 in comparison with the baseline model of the same window size are plotted.


IVcf Twolf Vpr Gcc- Parser Gcc- Health Mst Em3d
200 scilab


Figure 2-8. Sensitivity of P-load with respect to instruction window size

1.8
O Runahead
1.6 U Pload-16
gPload-16+Runahead
1 .4 -- -






0.8-- --

0.6-- --

0.*1
Mcf Twolf Vpr Gcc- Parser Gcc- Health Mst Em~d
200 scilab




Figure 2-9. Performance impact from combining P-load with runahead










The advantage of larger window is obvious, since the bigger the instruction window, the

more the P-loads can be discovered and issued. It is important to point out that issuing P-loads is

independent of the issue window size. In our simulation, the issue window size remains 32 for all

five instruction windows.

The speculative runahead execution effectively enlarges the instruction window by

removing cache miss instructions from the top of the instruction window. More instructions and

potential P-loads can thus be processed on the runahead path. Figure 2-9 shows the IPC speedups

of runahead', pload-16, and the combined pload-16 Runahead'. All three schemes use a 5 12-

entry instruction window and a 32-entry issue window. Runahead' execution is very effective on

Brolf; Vpr, and M~st. It out-performs Pload-16 due to the ability to enlarge both the instruction

and the issue windows. On the other hand, M~cJ Gcc-200, Gcc-scilab, Health, and em3d show

little benefits from runahead' because of intensive load-load dependence. The performance of

M~cfis actually degraded because of the overhead associated with canceling instructions on the

runahead path.

The benefit of issuing P-loads on the runahead path is very significant for all workloads as

shown in the figure. Basically, these two schemes are complementary to each other and show an

additive speedup benefit. The average IPC speedups ofrunahead', P-load', and P-load runahead'

relative to the baseline model are 10%, 16% and 34% respectively. Combining P-load with

runahead provides on average of 22% speedup over using only runahead execution, and 16%

average speedup over using P-load alone.

2.4.4 Interconnect Delay

To reduce memory latency, a recent trend is to integrate the memory controller into the

processor die with reduced interconnect delay [Opteron Processors]. However, in a multiple

processor-die system, significant interconnect delay is still encountered in accessing another










memory controller located off-die. In Figure 2-10, the IPC speedups of the P-load with different

interconnect delays relative to the baseline model with the same interconnect delay are plotted.

The delay indeed impacts the overall IPC significantly. But the P-load still demonstrates

performance improvement even with fast interconnect. The average IPC improvements of the

nine workloads are 18%, 16%, 12%, 8% and 5% with 100-, 80-, 60-, 40-, and 20-cycle one-way

delays respectively.


0 20 cycles []40 cycles 0 60 cycles
1.4 C 80 cycles g 100 cycles

1.3








0.9


Mcf Twolf Vpr Gcc- Fbrser Gcc- Health Mst Ern3d
200 scilab


Figure 2-10. Sensitivity of P-load with respect to interconnect delay

2.4.5 Memory Request Window and P-load Buffer

Recall that the memory request window records normal cache misses and P-loads. The size

of this window determines the total number of outstanding memory requests can be handled on

the memory controller. The issuing and execution of requests in the memory request window are

similar to the out-of-order execution in processor's instruction window. In Figure 2-11, the IPC

speedups of the P-load with four memory request window sizes: 16, 32, 64, and 128 relative to

the baseline model without P-load are plotted. A 32-entry window size is enough to hold almost

all of the requests at the memory controller for all workloads except health.













O MRW-16 0 MRW-32
O MRW-64 g MRW-128


I I 1 1 1 I I III lill I I I Ill I IIII lill I (


ORoad-32


Mcf Twolf Vpr Gcc- Parser Gcc- Health
200 scilab


Mst Em3d


Figure 2-11.










'EI
a,
a,
U)
O
a


Sensitivity of P-load with


1.6
O Pload-no I Pload-1 6
1.5 H PlRoad-64 g Pload-1 28

1.4





1 -





0.9

0.8


respect to memory request window size


bfTwolf Vpr


Gcc- Fbrser Gcc- Health
200 scilab


Mst Em~d


Figure 2-12. Sensitivity of P-load with respect to P-load buffer size


The performance impacts of the P-load buffer with 0, 16, 32, 64 and 128 entries are


simulated. Figure 2-12 shows the IPC speedups of the five P-load buffer sizes relative to the


baseline model. In all of the workloads, adding the P-load buffer increases the performance gain.


For most of the workloads, a 16-entry buffer can capture the maj ority of the benefit.









2.5 Related Work

There have been many software and hardware oriented prefetching proposals for

alleviating performance penalties on cache misses [Jouppi 1990; Chen and Baer 1992; Luk and

Mowry 1996; Joseph and Grunwald 1997; Yang and Lebeck 2000; Vanderwiel and Lilja 2000;

Cahoon and McKinley 2001; Solihin et al. 2002; Cooksey et al. 2002; Collins et al. 2002; Wang

et al. 2003; Yang and Lebeck 2004; Hughes and Adve 2005]. Traditional hardware-oriented

sequential or stride-based prefetchers work well for applications with regular memory access

patterns [Chen and Baer 1992; Jouppi 1990]. However, in many modern applications and

runtime environments, dynamic memory allocations and linked data structure accesses are very

common. It is difficult to accurately prefetch due to their irregular address patterns. Correlated

and Markov prefetchers [Charney and Reeves 1995; Joseph and Grunwald 1997] record patterns

of miss addresses and use the past miss correlations to predict future cache misses. These

approaches require a huge history table to record the past miss correlations. Besides, these

prefetchers also face challenges in providing accurate and timely prefetches.

A memory-side correlation-based prefetcher moves the prefetcher to the memory controller

[Solihin et al. 2002]. To handle timely prefetches, a chain of prefetches based on a pair-wise

correlation history can be pushed from memory. Accuracy and memory traffic, however, remain

difficult issues. To overlap load-load dependent misses, a cooperative hardware-software

approach called push-pull uses a hardware prefetch engine to execute software-inserted pointer-

based instructions ahead of the actual computation to supply the needed data [Yang and Lebeck

2000; Yang and Lebeck 2004]. A similar approach has been presented in [Hughes and Adve

2005].

A stateless, content-aware data prefetcher identifies potential pointers by examining word-

based content of a missed data block and eliminates the need to maintain a huge miss history










[Cooksey et al. 2002]. After the prefetching of the target memory block by a hardware-identified

pointer, a match of the block address with the content of the block can recognize any other

pointers in the block. The newly identified pointer can trigger a chain of prefetches. However, to

overlap long latency in sending the request and receiving the pointer data for a chain of

dependent load-loads, the stateless prefetcher needs to be implemented at the memory side. Both

virtual and physical addresses are required in order to identify pointers in a block. Furthermore,

by prefetching all identified pointers continuously, the accuracy issue still exists. Using the same

mechanism to identify pointer loads, the pointer-cache approach [Collins et al. 2002] builds a

correlation history between heap pointers and the addresses of the heap obj ects they point to. A

prefetch is issued when a pointer load misses the data cache, but hits the pointer cache.

Additional complications occur when the pointer values are updated.

The proposed P-load abandons the traditional approach of predicting prefetches with huge

miss histories. It also gives up the idea of using hardware and/or software to discover special

pointer instructions. With deep instruction windows in future out-of-order processors, the

proposed approach identifies existing load-load dependence in the instruction stream that may

delay the dependent loads. By issuing a P-load in place of the dependent load, any pointer-

chasing or indirect addressing that causes serialized memory access, can be overlapped to

effectively exploit memory-level parallelism. The execution-driven P-load can precisely preload

the needed block without involving any prediction.

2.6 Conclusion

Processor performance is significantly hampered by limited MLP exploitation due to the

serialization of loads that are dependent on one another and miss the cache. The proposed special

P-load has demonstrated its ability to effectively overlap these loads. Instead of relying on miss

predictions of the requested blocks, the execution-driven P-load precisely instructs the memory










controller in fetching the needed data block non-speculatively. The simulation results

demonstrate high accuracy and significant speedups using the P-load. The proposed P-load

scheme can be integrated with other aggressive MLP exploitation methods for even greater

performance benefit.









CHAPTER 3
LEAST FREQUENTLY USED REPLACEMENT POLICY IN ELBOW CACHE

3.1 Introduction

In cache designs, a set includes a number of cache frames that a memory block can be

mapped into. When all of the frames in a set are occupied, a newly missed block replaces an old

block according to the principle of memory reference locality. In classical set-associative caches,

both the lookup for identifying cache hit/miss and the replacement are within the same set,

normally based on hashing of a few index bits from the block address. For fast cache access

time, the set size (also referred as a~ssociativity) is usually small. In addition, all of the sets have

identical size and are disjoint to simplify the cache design. Under these constraints, heavy

conflicts may occur in a few sets (referred as hot sets) due to uneven distribution of memory

addresses across the entire cache sets that cause severe performance degradation.

There have been several efforts to alleviate conflicts in heavily accessed sets. The hash-

rehash cache [Agarwal et al. 1988] and the column-associative cache [Agarwal and Pudar 1993a]

establish a secondary set for each block using a different hashing function from the primary set.

Cache replacement is extended across both sets to reduce conflicts. An additional cache lookup

is required for blocks that are not located in the primary set. The group-associative cache [Peir et

al. 1998] maintains a separate cache directory for more flexible secondary set. A different

hashing function is used to lookup blocks in the secondary set. Similar to the hash-rehash,

lookups in both of the directories are necessary. In addition, a link is added for each entry of the

secondary directory to locate the data block. Recently, the V-way cache [Qureshi et al. 2005]

eliminates multiple lookups by doubling the cache directory size with respect to the actual

number of data blocks. In the V-way cache, any unused directory entry in the lookup set can be

used to record a newly missed block without replacing an existing block in the set. The existence







































I I Partition 2
b2 / I a2

II Partition 3
a3 b3


of unused entries allows searching for a replacement block across the entire cache, and thus

decouples the replacement set from the lookup set. Although flexible, the V-way cache requires a

bi-directional link between each directory entry and its corresponding data block in the data

array. Data accesses must go through the link indirectly that lengthens the cache access time.

Also, even with extra directory space, the V-way cache cannot eliminate the hot sets and must

replace a block within the lookup set if all directory frames in the set are occupied.

Block A Block B


Figure 3-1. Connected cache sets with multiple hashing functions

The skewed-associative cache [Seznecl993a; Seznec and Bodin 1993b; Bodin and Seznec

1997] is another cache organization to alleviate conflict misses. In contrast to the conventional

set-associative caches, the skewed-associative cache employs multiple hashing functions for

members in a set. In an n-way cache, the cache is partitioned equally into n banks. For set-

associative caches, each set consists of one frame from each partition in the same position

addressed by the index bits. In caches with multiple-hashing, each set also consists of one frame

from each of the n cache partitions. But the location of the frame in each partition can be

different based on a different hashing function. To lookup a cache block, the n independent










hashing functions address the n frames where the n existing cache blocks can be matched against

the requested block to determine a cache hit or a miss. The insight behind the skewed-associative

cache is that whenever two blocks conflict for a single location in partition i, they have low

probability to conflict for a location in partition j.

The elbow cache [Spjuth et al. 2005], an extension to the skewed-associative cache, can

expand the replacement set without any extra cache tag directory. In set-associative caches, two

blocks are either mapped into the same set, or they belong to two disj oint sets. In contrast, the

multiple-hashing cache presents an interesting property that two blocks can be mapped into two

sets which share a common frame in one or more partitions. Let us assume a 4-way partitioned

cache as illustrated in Figure 3-1, through four hashing functions, blocks A and B can be mapped

to different locations in the four cache partitions. In this example, A and B share the same frame,

al b1 in Partition 1, but are disjoint in the others. When two sets share the same frame in one or

more cache partitions, the two sets are connected. The common frame provides a link to expand

the replacement set beyond the original lookup set. Instead of replacing a block from the original

lookup set, the new block can take over the shared frame, then the block located in the shared

frame can be moved to and replace a block in the connected set. For example, assume that when

block A is requested, A is not present in any of the four allocated frames, aO, al, a2, and a3 and

al is occupied by block B. Instead of forcing out a block in aO, al, a2, or a3, block A can take

over frame al, and push block B to other frames bO, b2, or b3 in the connected set. It is essential

that relocating block B does not change the lookup mechanism. Furthermore, instead of

replacing blocks in bO, b2, or b3, the recursive interconnection allows those blocks to be moved

to the other frames in their own connected sets. The elbow cache extends skewed-associative

cache organization by carefully selecting its victim and, in the case of a conflict, move the










conflicting cache block to its alternative location in the other partition. In a sense, the new data

block "uses its elbows" to make space for conflicting data instead of evicting it. The enlarged

replacement set provides better opportunity to find a suitable victim for evicting.

It is imperative to design effective replacement policy to identify suitable victim for

evicting in the elbow cache, which featured with enlarged replacement set. Recency-based

replacement policy like the LRU replacement is generally thought to be the most efficient policy

for processor cache, but it can be expensive to be implemented in the elbow cache. In this

dissertation, we introduce a frequency-based replacement cache replacement policy based on the

concept that the least frequently used blocked is more suitable for replacement.

3.2 Conflict Misses

The severity of set conflicts is demonstrated using SPEC2000 benchmarks. Twelve

applications: TwolfJ Bzip, Gzip, Parser, Equake, Vpr, Gcc, Vorte,~ Perlbmk, Crafty, Apsi and

Eon were chosen for our study because of their high conflict misses. Their appearance from the

left to the right shows the severity of conflicts from the least to the most. In this study, we

simulate a 32KB L1 data cache with 64-byte block size. The severity of conflicts is measured by

the miss ratio reductions from a fully-associative to a 4-way set-associative design. Figure 3-2

shows cache miss ratios of 2-way, 4-way, 16-way, and fully-associative caches.

As expected, both 2-way and 4-way set-associative caches suffer significant conflict

misses for all selected workloads. It is interesting to see that even with a 16-way set-associative

cache, Bzip, Gzip, Gcc, Perlbmk, crafty and Apsi still suffer significant performance degradation

due to conflicts. For Apsi, more than 60% of the misses can be saved using a fully associative

cache comparing to that of a 16-way cache.











16.0
14.0 -- -- c 2-way+ LRU
4 12.0 --- -- -_ 4-w ay+LRU
O 10.0 - - -- - a 6-way+ LRU
B 8.0 - - - - Full+ LRU

4.0- -- -



Twolf Bzip Gzip Parser Equake Vpr Gcc Vortex Wrlbmk Crafty Apsi Eon



Figure 3-2. Cache miss ratios with different degrees of associativity

3.3 Cache Replacement Policies for Elbow Cache

The fundamental idea behind the elbow cache is to think of an n-way partitioned cache as

an n-dimensional Cartesian coordinate system. The coordinates of a cache block are the


components of a tuple of nature numbers tindexo, index,,..., indexn,_, which are generated by


applying the corresponding hashing function on the cache block address for each partition. Since

the theme of this dissertation is not to invent new hashing functions, the multiple hashing

functions described in [Seznecl993a; Seznec and Bodin 1993b; Bodin and Seznec 1997] are

borrowed. The definitions of these skewed mapping functions are given as follows.

In the original skewed-associative cache [Seznecl993a; Seznec and Bodin 1993b]3, two

functions H, G are defined, where G is the inverse function of H and n is the width of the index

bits:


H: {0,...,2" -1} 4 {0,...,2" -1}


G: {0,...,2" -1} 4 {0,...,2" -1}









For a 4-way partitioned cache, four hashing functions are defined (referred as Mapping

Function 93). A data block at memory address A = Az 2" 2" + A,2"'" + Az 2" + A,, where c is the

width of the offset, is mapped to:

1. cache frame f,(A) = H(A,) 8 G(A ) 8 A, in cache partition 0,

where O represents an exclusive-or operator;

2. cache frame J (A) = H(A,) 8 G(A ) 8 A, in cache partition 1;

3. cache frame f (A) = G(A,) 8 H(A ) 8 A, in cache partition 2;

4. cache frame f (A)= G(A,)O H(A )O A, in cache partition 3.

In an alternative skewed function family reported later [Bodin and Seznec 1997] (referred

as Mapping Function 9 7), let cr be the one-position circular shift on n bits [Stone 1971], a data

block at memory address A = A,2" 2" + A 2" + A, 2" + A, can be mapped to:

1. cache frame f,(A) = A, O A, in cache partition 0;

2. cache frame J;(A) = 0(A z)9AA in cache partition 1;

3. cache frame f (A) = o (A,) 9 A in cache partition 2;

4. cache frame ,(A) = o (A,) 9 A in cache partition 3.

3.3.1 Scope of Replacement

To expand the replacement set beyond the lookup set boundary, the coordinates of the

blocks in the lookup set provide links for reaching other connected sets. Each block has n

coordinates and can thus reach n-1 new frames in the other partitions as long as those frames

have not been reached before. The coordinates of each block in the connected sets can in turn

link to other connected sets recursively until all of the new sets have been reached. We call the

union of all connected sets as the scope for a replacement set.













Partition 1

05000505 (2,7) 7
047fe582 (4,6) 6
0500018f(3,5) 5I
05000183 (7,4) 41 *
050001a7 (0,3) 31t-0 MISS: 050001b3 (6,3)
05000195 (5,2) 2
050001ad (1,1) 1
050001al (5,0) 0





Figure3-3. Eampleof seach oelcmn












addrsses Th twointger nmer in3 thmpe parnthsinextc tor the blck ddes ereeth

corinat valures33 eueasml xm of eahbokotandfo h t-wo hashingfntions. When a requestraeta

050001b3n (6,) aries ca miss occur sine tahe block ise not lcathed in the m loku et offame 6a

on coordin athpriiowee x and frm 3o coordinatey.Teseac o replee acemeintO begin fartom th

lookpsetiv. Block 0500017d (64 loated ino forame6 on SPcordi unnat x onncst the frwame4o




coordinate y.ue Similarly block 05000a7 0,)loaedi fro h ame 3g oun coordinate y connecsto







frame 0 on coordinate x. The blocks located in the connected frames: 05000183 (7,4) and

050034e (0, 7) can make a further connection to 050001b9 (7,1) and 05000505 (2, 7) respectively.

The search continues until block 0500019b (5,5) in frame 5 on coordinate x, and 0500018f(3,5)










in frame 5 on coordinate y are revisited as illustrated by the arrows in the Eigure. In this example,

the replacement scope covers the entire cache frames.


---MappingFunction 93
~50 -- -MappingFunction97 -
~40-






032 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512
Scope for Replacement


Figure 3-4. Distribution of scopes using two sets of hashing functions

Although the interconnections of multiple sets for replacement is recursive, the scope for

each requested block can be limited once all newly expanded frames have already been reached.

With the selected integer and floating-point applications from SPEC2000, we can demonstrate

the scope of replacement. Figure 3-4 shows the accumulated scope distributions for the selected

applications using the two skewed mapping function families described before. It is interesting to

observe that the scope of the elbow cache using Mapping Function 93 almost covers the entire

cache. But when using Mapping Function 9 7, the scope is limited within half of the cache

frames. This is due to the fact of certain constraints imposed on the selected randomization

functions. Further discussions on the mathematical property of these hashing functions are out of

the scope of this dissertation. It is important to emphasize that for all practical purposes, the

scopes of both skew-based hashing schemes are sufficient to find a proper victim for

replacement.









3.3.2 Previous Replacement Algorithms

The study of cache block replacement policies is, in essence, a study of the characteristics

or behavior of workloads to a system. Specifically, it is a study of access patterns to blocks

within the cache. Based on the recognition of access patterns through acquisition and analysis of

past behavior or history, replacement policies resolve to identify the block that will be used

furthest down in the future, so that that block may be replaced when needed. The LRU policy

does this by attaining the recency of block references, such that the least recently used block will

be replaced when needed. The LFU policy considers the frequency of block references, such that

the least frequently used block will be replaced when needed. These respective policies are

inherently assuming that future behavior of the workload will be dominated by the recency or

frequency factors of past behavior.

The ability of the elbow cache to reduce conflict misses depends primarily on the

intelligence of the cache replacement policy. Different replacement policies may be used. The

random replacement policy is the simplest to implement but it increases the miss rate compared

to the baseline configuration (4-way set associative cache).

LRU replacement policy is more effective than random. The traditional LRU replacement

policy based on the MRU-LRU sequence is difficult to implement with multiple hashing

functions. It is well-known that the complexity of maintaining a MRU-LRU sequence is s!

where s is the set associativity. The LRU replacement can be applied to set-associative caches

due to their limited sets. Furthermore, pseudo-LRU schemes can be used to reduce complexity

for highly associative caches. Since the number of sets grows exponentially with multiple

hashing, it is prohibitively expensive to maintain the needed MRU-LRU sequences.

Instead of maintaining the precise MRU-LRU sequence for replacement, a scheme based

on the time stamp can be considered. The time-sttttttttttttttttamp (75) scheme emulates the LRU










replacement by maintaining a global memory request ID for each cache block. When a miss

occurs, the block with the oldest time-stamp in the set (or the connected set) is replaced

[Seznecl993a; Seznec and Bodin 1993b; Bodin and Seznec 1997]. To save the directory space as

well as to simplify calculations, a smaller time-stamp is desirable. A more practical scheme, that

uses a small number of bits both in the counter and the time-stamp, would work by shifting the

counter and all the time-stamps one bit to the right whenever the reference counter overflowed

[Gonzalez et al. 1997].

Not Recently Used Not Recently Written (NRUNRW) replacement policy is an effective

replacement implemented on skew-associative cache [Stone 1971; Seznec 1993a]. The bit tag

Recently Used (RU) is set when the cache block is accessed. Periodically the bit tags RU of all

the cache blocks are zeroed. When a cache access misses in the cache, the replaced block is

chosen among replacement set in the following priority order. First, randomly pick among the

blocks for which the RU tag is not set. Second, randomly pick among the blocks for which the

RU tag is set, but which have not been modified since they have been loaded in the cache. Last,

randomly pick among the blocks for which the RU tag is set and which have been modified.

Another key issue in implementing cache replacement in the elbow cache is that a linear

search for the replacement block among all the connected sets is necessary. It is also

prohibitively expensive to traverse the entire scope to Eind a suitable victim for replacement.

Restriction must be added to confine the search within a small set of cache frames.

3.3.3 Elbow Cache Replacement Example

In Figure 3-5, a sequence of memory requests is used to illustrate how a time-stamp based

2-way elbow cache replacement works. Each request is denoted as: Ta~g-fo~fi-(ID), where Talg is

the block address tag; fo,and fi represent the location of the block in the two coordinates based on

two different hashing functions; and (ID) represents the request ID for using as a time stamp. For











simplicity, we assume that fo, fi, are taken directly from address bits and may be needed as part

of the tag to determine a cache hit or miss. Within each partition, there are only four cache

frames addressed by the two hashing bits.

11 10 01 00
Request: Tag-f0,fl-(ID)
b-10, 10- (3) a-00, 10- (0)
a-00,1- (0)a-10, 11- (4) c-00, 11- (2)

b-00,11- (1)
0-00,1 (2)Coordinate X
b-10,10- (3)
a-10,1 (4)b-00,11- ( 1 a- 00, 10- (0
b-10,10- (5) ~310, 10-(3)
b-10, 10- (5)
Coordinate Y


Figure 3-5. Replacement based on time-stamp in elbow caches

When the first request a-00,1~0-(0) is issued, both frames 00 on coordinate x and 10 on

coordinate y are empty. A miss occurs and a-00,1~0-(0) is allocated to frames 00 on coordinate x.

The second request b-00,11~-(1 is also a miss and is allocated to frame 11 on coordinate y since it

is empty in the lookup set. For the third request, c-00,11~-(2), both frames in the lookup set are

now occupied by the first two requests. However, frame 10 on coordinate y is empty, which is in

the connected set of the current lookup set through the shared frame 00 on coordinate x.

Therefore, block a-00,1~0-(0) can be moved to filled frame 10 on coordinate y as indicated by the

arrow with the request ID (2) that leaves the shared frame 00 on coordinate x for the newly

missed request c-00,11~-(2). The fourth request b-10,1~0-(3) finds an empty frame 10 on

coordinate x. The fifth request, a-10,11-(4), again, misses the cache and both frames in the

lookup set are occupied. Assume in this case that both existing blocks are not "old" enough to be

replaced. Through the block b-10,1~0-(3) in the shared frame, an "older" block a-00,1~0-(0) in the

connected set of frame 10 on coordinate y is found and can be replaced as indicated by the arrow









with the request ID (4). Finally, the last request b-10,1~0-(5) can easily be located as a hit even

though the block has been relocated.

3.3.4 Least Frequently Used Replacement Policy

The cost and the performance associated with LRU replacement depends on the number of

bits devoted to the time-stamp. The implementation of wide bit comparison for multiple parallel

time-stamp comparison is very expensive and time consuming. Furthermore, time-stamp also

requires extra storage for each block in cache. The more number of bits devoted for time-stamp,

the more accurate LRU sequences can be maintained, and the higher implementation cost. Even

using the most optimized time-stamp [Gonzalez et al. 1997], the counter for each cache block

has at least 8 bits. To reduce the implementation complexity and maintain equal cache

performance, we introduce a new cache replacement policy for the elbow cache.

LFU replacement policy maintains the most frequently used blocks in the cache. The least

frequently used blocks are evicted from the cache when new blocks are put into it. It is based on

the observation that a block tends to be reused if the block has been used more frequently after it

was moved into the cache [Qureshi et al. 2005]. We propose reuse-count scheme (RC), which is

a kind of implementation of LFU replacement on the elbow cache. A reuse counter is maintained

for each cache block. Upon a miss, the block with the least reuse frequency is replaced. The

reuse count is given an initial value when the block is moved into cache. The value is

incremented when the block is accessed. These results in the following kind of problem: certain

blocks are relative infrequently referenced overall, and yet when they are referenced, due to

locality there are short intervals of repeated re-references, thus building up high reuse counts.

After such an interval is over, the high reuse count is misleading: it is due to locality, and cannot

be used to estimate the probability that such a block will be reused following the end of this

interval. Here, this problem is addressed by "factoring out locality" from reuse counts, as










follows. The reuse count is decremented when the cache block is searched, but is mismatched

with the requested block. A block can be replaced when the count reaches zero. In this way, the

recency information is also counted in the frequency-based replacement policy. Performance

evaluation shows that on an elbow cache, a reuse-count replacement policy with 3-bit reuse-

counter can perform as good as an LRU replacement policy, with very small storage and low

hardware complexity.

To avoid searching for a victim for replacement through all of the connected replacement

sets, a block is considered to be replaceable when the recorded reference count reaches certain

threshold (zero). The search stops when a replaceable block is found. Furthermore, the

replacement search can be confined within a limited search domain. For instance, the elbow

cache search can be confined within the original lookup set plus single-level interconnected sets

to it. In case that no replaceable block is found, a block in the lookup set is replaced. Since the

search and replacement are only encountered on a cache miss, they are not on the critical path in

cache access. In addition, our simulation results show that about 40% to 70% of the replacement

is still located in the lookup set, thus no extra overhead for searching and replacement is

incurred.

Vacating the shared frame to make room for the newly missed block involves a data block

movement. To limit this data movement, the breadth-first traversal is used to search all possible

first-level connected sets through the blocks located in the lookup set. In case no replaceable

block is found, the oldest block in the lookup set can be picked for replacement, and thus limits

the block movements to at most one per cache miss. The search can be extended to further levels

with the cost of an additional block movement per interconnection level. A more dramatic

approach to avoid the data movement is to establish a pointer from each directory entry to its










corresponding block in the data array similar to those in [Peir et al. 1998; Qureshi et al. 2005].

However, this indirect pointer approach lengthens the cache access time.

3.4 Performance Evaluation

3.4.1 Miss Ratio Reduction

In this dissertation, we use miss ratios as the primary performance metric to demonstrate

the effectiveness of the reuse-count replacement policy for the elbow cache. Various caching

schemes and replacement policies were implemented on the L1 data cache, which is 32KB, 4-

way partition with 64B line. For comparison purposes, we consider a 32KB, 4-way set-

associative L1 cache with LRU replacement as the baseline cache.

To evaluate the elbow cache, we excluded workloads that have less than 3% miss ratio gap

between a 32KB fully-associative cache and the baseline cache. Based on these criteria, twelve

workloads from SPEC2000, TwolfJ Bzip, Gzip, Parser, Equake, Vpr, Gcc, Vorte,~ Perlbmk,

Crafty, Apsi, and Eon were selected.

Three categories of existing caching schemes are evaluated and compared with the elbow

cache. The first category is conventional caches with high associativity, including 16-way and

fully-associative using the LRU replacement policy, denoted as 16-way LRU and Full LRU.

The second category is the skewed-associative caches. Three improved replacement policies,

NRUNRW [Seznec and Bodin 1993b]3, time-stamp [Gonzalez et al. 1997], and reuse-count

[Qureshi et al. 2005] are considered, denoted as .\~1ken NRUNRW, .1kes.~ TS, and .1ken' RC

respectively. The third category is the V-way cache. Only reuse-count replacement policy is

applied due to the nature of the V-way replacement, denoted as V-way RC. Finally, for the

elbow caches, the same three replacement policies as the skewed-associative caches are

implemented, denoted as Elbow +NRUNRW, Elbow TS, and Elbow +RC.














80
C70

S60
S5 0 -- -

L 4 0 - -
S3 0 - -
2 0 - - -





~Bzip


C 70

S60
- 5 0 -- -
S40












Twolf


Gzip



150

o 120- - -

90 --

60-- -E

S30 --

00 -



Eqak


oo ~ o

,\"~`;s\"";~""~~~~p""~~
Bg.
c;"
Parser


C120




60


S00

-30






Vpr


Figure 3-6. Miss ratio reduction with caching schemes and replacement policies


80 80
70 -- -70

S60 -- - 60

S50 -- I I 50
S4 0 - -4 -- -

3 0 3 0- - -
2 0 - -2 0- -














150 300

0 120 - 0 2 0 - -
S20 0 - --
~1 9 0 - - -

60 ---- ~ ~ 150 -





Goc Voe


300 7r 00

2530 -- --------, ----- 600











700 "'' 700
C 60 -- 0
0 0-





300 -- -- 00- - -- L

200 -- - 2~00- - --





Aps Eon



A rcia iesapshm seautd hc ssa -i iesapa eotdi

[Gnzlz t l 1997] Fo h es-cutshmou vlainsugse -itcutrwt

initia value of3advci au fzr o h etpromne fn eoruecuti


found, a viti is pike rado l wihi the loku set For skwd-socaieahe n









elbow caches, we borrow Mapping Function 97 as the hashing functions [Bodin and Seznec

1997]. For V-way cache, we simulate a directory with twice as many entries as the total cache

frames. Due to the overhead in sequential search for the replacement and the extra data

movement involved in moving the block from the connected frame, we limit the replacement

scope within two levels (the lookup set, plus one level of connected set) in elbow caches. As a

result, during the replacement search, up to 16 frames can be reached, and at most four directory

searches and one block movement may be needed. In case that a replaceable block is not found,

the best candidate in the lookup set is replaced.

We compare the miss ratio reduction for various caching mechanisms with different

replacement policies. Figure 3-6 summarizes the relative miss ratio reductions for the nine

caching schemes compared with the baseline cache (32KB, 4-way set-associative cache). Due to

a wide range of reductions, the twelve workloads are divided into four groups as can be

identified from the figures with four different y-axis scales.

Several interesting observations can be made. First, the elbow caches show more miss ratio

reduction than that of the skewed-associative caches. Obviously, it is due to the advantage that

the connected sets extended the searching domain from 4 frames to 16 frames. The Elbow-RC

has miss ratio reduction ranging from 2% to as high as 52% with an average reduction of 21%,

while the Elbow-TS has miss ratio reduction ranging from 3% to as high as 57% with an average

of 22%. The Skew-RC has miss ratio reduction ranging from less than 1% to 45% with an

average of 11%. On the other hand, the .\ken~ -TS' s reduction ranges from 3% to 50% with an

average about 17%. For elbow cache, the time-stamp based and the reuse-count based

replacement show mixed results. Both methods work effectively with a slight edge to the time-

stamp scheme. In contrast, the time-stamp works much better than the reuse-count on skewed-










associative caches. Apparently, LRU replacement performs better when the replacement is

confined within the lookup set. Although the time-stamp scheme performs slightly better, its cost

is also higher comparing with the reuse-count scheme.

Second, in general the elbow cache out-performs the V-way cache by a significant margin.

The average miss ratio reduction for the V-way is about 11%. The relative V-way performance

fluctuates a lot among different applications. For example, the V-way cache shows the most

reduction compared with other schemes in Gzip. It also performs nearly the best in Vortex.

However, for Equake and Apsi, V-way's performance is at the very bottom. The main reason for

this discrepancy is due to the inability to handle hot sets. For Gzip and Vortex, 92% and 75% of

the time, an unused directory entry in the lookup set can be allocated for the missed block, while

for Equake and Apsi, only 27% and 30% of the chance that a search for replacement outside the

lookup set is permitted. This confirms that even doubling the directory size, the V-way cache is

constraint in solving the hot set problem.

Third, it is interesting to observe that the elbow cache can out-perform a fully-associative

cache by a significant margin in many applications. The average miss-ratio reductions for Full-

LRU, Elbow-TS, and Elbow-RC are 21.4%, 21.6%, and 20.9% respectively. These interesting

results are due to two reasons. First, fully-associative cache suffers the worst cache pollution

when a "never-reuse" block is moved into the cache. It takes c more misses to replace a never-

reused block, where c is the number of frames in the cache. Such a block can be replaced much

faster in elbow caches once the block becomes replaceable. Vortex is a good example to

demonstrate the cache pollution problem, in which Full LRU performs much worse than 16-

way+LRU. Second, the skew hashing functions provide a good randomization for alleviating set










conflicts. By searching through the connected sets in elbow caches, the hot set issue is pretty

much diminished.

3.4.2 Searching Length and Cost

In this section, we evaluate extra cost associated with elbow caches. A normal cache access

involves a single tag directory access to determine a hit/miss. During the replacement in the

elbow cache, extra tag accesses are required when traversing through the replacement set.

Moreover, an additional block movement may happen between search levels when the replaced

block is not located in the lookup set. We simulated replacement policy with 2-level searching

(lookup set plus one level of connected sets). If a victim is found at the first level, there is no

extra tag access and data movement. Otherwise, an extra block movement along with up to three

additional tag accesses is needed if the replaced block is found in the second level. In case that

no replaceable block can be found within 2 levels, a victim will be chosen from the lookup set

and no extra block movement is required. However, it does incur 3 additional tag accesses.

Table 3-1. Searching levels, extra tag access, and block movement

Replacement search Overhead/Access
Workload
1st level 2nd level Not found xr a lc
accesses movement
Bzip 61.8% 35.0% 3.2% 1.5% 1.0%
Vpr 47.0% 45.0% 8.0% 2.7% 1.5%
Perlbmk 39.0% 45.4% 15.6% 3.3% 1.6%
Asi 45.6% 45.0% 9.4% 2.5% 1.4%


Table 3-1 summarizes the cost of the elbow cache with four workloads, Bzip, Ypr,

Perlbmk, and Apsi. Note that we selected these four workloads, one from each miss reduction

range as described previously to simplify the presentation. The percentage of chance in finding a

replaceable block at respective searching levels is shown. About 40%-60% of the replaceable

blocks are located in the lookup set and about 35%-45% are found at the connected sets. The










percentages that no replaceable block is found in the first two levels are varied from 3% to 15%.

We also count the extra tag accesses and block movements. As shown in the table, extra 1.5% to

3.3% tag accesses are encountered in the elbow cache. Also, on the average, an extra block

movement is needed for 1% to 1.6% of the memory accesses. It is important to notice that extra

tag accesses and block movements are not on the critical cache access path, because they are

only encountered on cache misses. These extra tag accesses and block movements can be

delayed in case of a conflict with normal cache accesses.

3.4.3 Impact of Cache Partitions

So far, our evaluations of the elbow cache are based on a 4-way partitioned structure. In

this section, we show the results of 2-, 4-, and 8-way partitions. Again, the four workloads, Bzip,

Vpr, Perlbmk and Apsi, one from each miss reduction range, are selected. The miss ratio, instead

of miss ratio reduction is used for the comparison.

As shown in Figure 3-7, increasing the degree of partition improves the miss ratios for all

four workloads. These results are obtained using Elbow RC. Similar results are also observed

using Elbow+ TS. From 2-way to 8-way, the miss ratios are reduced accordingly: 2.8%, 2.6%,

2.5% for Bzip; 4.2%, 3.4%, 3.2% for Vpr; 12.2%, 10.3%, 8.8% for Perlbmk; and 2.0%, 1.6%,

1.4% for Apsi, respectively. These reduction rates are much faster than the miss reduction rates

for set-associative caches when the associativity increases from 2-way to 8-way.

In a 2-level elbow cache, the search domain is equal to p2 Where p is the number of

partitions. For elbow caches from 2-way to 8-way, the replacement scope can reach from 4, 16,

to 64 frames. This power-of-2 increase in replacement set out-performs the linear increase in

replacement set for the set-associative caches. In term of costs, however, the extra directory tag












access only increases linearly with the number of partitions. Moreover, the increase of partitions


requires no extra block movement when the replacement is confined within two levels.

30 5.0




I ~~~~4.0--------
00 0 .0




160 5-0






140 -

S120 - -40






Pel m Apsio

Fiur -7 Ms rti ordffretcaheasoiaiite
In co p ri o wih f ly a s c ai e c c e u t er m s e u t o s w t a l o

cache mak it suplse full-asoitv cach pefr ac o he f h orwrlas o

BIp Vpr and Perlbmk, th isrto eueb .% 23,ad1.%epciey
3.4. Impct o Varing acheSize
Weaayetepromneipc fcce ie nebwcce.Fu 1dt ah

size 8K ,1K ,3K ,ad6K resmltduigtesm fu okod stoei

Section~~~~~~~~ 3..3 The mis raio ar plte nFgr o he ahncee:4wyLU
























-o 4-w ay+LRU
-m- Full+LRU




3.0

2 .0 - -

i n - -


-


-o 4-w ay+LRU
-m- Full+LRU
-A- -way bbow +RC


Full LRU, and 4-way Elbow +RC. As observed, the cache size does make a huge impact on the


miss ratios of the three caching schemes. It is straightforward that bigger caches reduce the miss


ratios for all three caching schemes.


5.0

$ 4.0




S2.0

1.0


8K 16K

Bzip


32K 64K


8K 16K

Vpr


32K 64K


60-


40


S30

S20

10-

00-


S4-w ay+LRU
-a Full+LRU
-A- 4-w ay Elbow +RC


8K 16K 32K 64K 8K 16K 32K 64K

Perlbmk Apsi


Figure 3-8. Miss rate for different cache sizes


However, the relative gaps among the three schemes vary widely among different


workloads with different cache sizes. Generally speaking, conflict misses are reduced with


bigger caches that make the elbow cache less effective. This is true for Bzip and Perlbmk.


However, for Vpr, the gap between 4-way LRUand Full LRU stay relatively the same with


32KB and 64KB caches. Therefore, the elbow cache is equally effective with all four cache


sizes. Apsi acts oppositely. The elbow cache is much more effective for 32KB and 64KB caches.









Detailed study of Apsi indicates that the 32KB Full LRU can hold the working set, but the

32KB 4-way LRU cannot due to conflicts in hot sets. Consequently, Elbow RC shows a huge

gap against 4-way LRU due to better replacement. At 16KB, none of caching scheme can hold

the working set that creates heavy capacity misses. As a result, all three caching schemes show

similar performance.

3.5 Related Work

Applications with regular patterns of memory access can experience severe cache conflict

misses in set-associative cache. There are few works on finding better mapping functions for

cache memories. Most of the prior hashing functions permute the accesses using some form of

Exclusive OR (XOR) operations. The elbow cache is not limited to any specific

randomization/hashing method. Other possible functions could be found in [Yang and Adina

1994; Kharbutli et al. 2004]. The skewed-associative cache apples different mapping functions

on different partitions. Although various replacement policies [Seznecl993a; Seznec and Bodin

1993b; Bodin and Seznec 1997; Gonzalez et al. 1997] have been proposed for the skewed-

associative cache, it is still an open issue to find an efficient and effective one.

The Hash-rehash [Agarwal et al. 1988], the column-associative [Agarwal and Pudar

1993a], and the group-associative [Peir et al. 1998] are using extra directories to increase

associativity. In the contrast, the elbow cache uses links to connect blocks in the cache without

any extra directory storage. The V-way cache [Qureshi et al. 2005] can be viewed as a new way

to increase associativity by doubling the cache directory size. However, it requires indirect links

between each entry in the directory and its corresponding block in the data array. Furthermore,

even with extra directory space, it can not solve the hot set problem since the directory entries in

the hot sets are always occupied.









3.6 Conclusion

The efficiency of the traditional set-associative cache is degraded because of severe

conflict misses. The elbow cache has demonstrated its ability to expand the replacement set

beyond the lookup set boundary without adding any complexity on the lookup path. Because of

the characteristics of elbow cache, it is difficult to implement recency-based replacement. The

proposed reuse-count replacement policy with low-cost can achieve cache performance

comparable to the recency-based replacement policy.









CHAPTER 4
TOLERATING RESOURCE CONTENTIONS WITH RUNAHEAD ON MULTITHREADING
PROCESSORS

4.1 Introduction

SMT processors exploit both ILP and TLP by fetching and issuing multiple instructions

from multiple threads at the same cycle to utilize wide-issue slots. In SMT, multiple threads

share resources such as caches, functional units, instruction queue, instruction issue window, and

instruction window [Tullsen et al. 1995; Tullsen et al. 1996]. SMT typically benefits from giving

threads complete access to all resources every cycle. But contentions of these resources may

significantly hamper the performance of individual threads and hinder the benefit of exploiting

more parallelism from multiple threads.

First, disruptive cache contentions lead to more cache misses and hurt overall performance.

Second, threads can hold critical resources while they are not making progress due to long-

latency operations and block other threads from making normal execution. For example, if the

stalled thread fills the issue window and instruction window with waiting instructions, it shrinks

the window available for the other threads to find instructions to issue and bring in new

instructions to the pipeline. Thus, when parallelism is most needed when one or more threads are

no longer contributing to the instruction flow, fewer resources are available to expose that

parallelism.

We investigate and evaluate a valuable solution to this problem, runahead execution on

SMTs. Runahead execution was first proposed to improve MLP on single-thread processors

[Dundas and Mudge 1997; Mutlu et al. 2003]. Effectively, runahead execution can achieve the

same performance level as that with a much bigger instruction window. With heavier cache

contentions on SMTs, runahead execution is more effective in exploiting MLP. Besides the

inherent advantage of memory prefetching, by removing long-latency memory operations from









the instruction window, runahead execution can ease resource blocking among multiple threads

on SMTs, thus make other sophisticated thread-scheduling mechanisms unnecessary [Tullsen

and Brown 2001; Cazorla et al. 2003].

4.2 Resource Contentions on Multithreading Processors

This section demonstrates the resource contention problem in SMT processors. Several

SPEC2000 benchmarks are chosen for this study based on their L2 cache performance [Hu et al.

2003]. The weighted speedup [Snavely and Tullsen 2000] is used to compare the IPC of

multithreaded execution against the IPC when each thread is executed independently. IPCsln

represents individual thread's IPC in the SMT mode.

Figure 4-1 shows the weighted speedups of eight combinations of two threads on SMTs

using the ICOUNT2.8 scheduling strategy [Tullsen et al. 1996]. These results show that running

two threads on SMTs may present worse IPC improvement than running two threads separately.

We can categorize workloads into three groups. The first group includes Evolf/Art, TEl s/j .\ k/L

and Art M~cJ which are composed of programs with relatively high L2 miss penalties [Hu et al.

2003]. The second group includes Parser~~~~PPPPP~~~~PPPP TYr, yrGcc, and TE lJ
programs with median L2 miss penalties. Finally, the third group has two workloads Gap Bzip

and Gap M~cfin which either one or both programs have low L2 miss penalties. In general,

except for the third group, other workloads have poor performance with median-size L2 caches.

For the first group, the median size is about 2MB to 8MB, while for the second group, the

median size is about 512KB to 1MB. The weighted speedups in the SMT mode in these cache

sizes can be significantly lower than 1.

IPC
Weighted Sp~eedup = "
threads JP Smngl-thread












-m-Gap/Bzlp -aGap/Mcf -aParser/Vpr -AVpr/Gcc
-eTwolf/Gcc -m-TwolflArt -aTwolflMcf -*Art/Mcf


14-












128k 256k 512k 1m 2m 4m 8m 16m
L2 cache size



Figure 4-1. Weighted instruction per cycle speedups for multiple threads vs. single thread on
simultaneous multithreading


-m- GaplBzip -a-GaplMcf -a-Parser/Vpr -a Vr/Gcc
-oTwolflGcc -m-TwolflArt -aTwolflMcf -oArt/Mcf










2-





128k 256k 512k 1m 2m 4m 8m 16m
L2 cache size



Figure 4-2. Average memory access time ratio for multiple threads vs. single thread on
simultaneous multithreading


Generally speaking, with small caches, heavy cache misses are encountered regardless the


number of threads. Thus, no significant performance degradation is observed in the SMT mode.


With large caches, both threads may incur very few cache misses even when two threads are


running together in the SMT mode. This U-shape speedup curve is evident with workloads in the


second group. It is generally true for workloads in the first group too. However, negative IPC


speedups can still result from workloads in the first group even with very large L2 caches due to










degradations on other resource contentions out-weight the benefit of exploiting TLP. Workloads

in the third group benefit from TLP consistently.

To prove the impact of cache contentions on SMT performance, we plot the average

memory access time ratio between two threads running in the SMT mode and running

sequentially in a single-thread mode (Figure 4-2). Except for Gap Bzip and Gap M~cJ other

workloads experience increases in the average memory access time with median-size caches

ranging from 5 12KB to 4MB. The degree of increases of the average memory access time

matches well against the IPC losses in Figure 4-1. Nevertheless, although little increases of

average memory access times can be observed with 8 MB or 16MB L2 caches, workloads in the

first group still show huge IPC degradations. This is again due to contentions on other resources.

4.3 Runahead Execution on Multithreading Processors

Runahead execution on SMT processors follows the same general principle as in single-

thread processors [Mutlu et al. 2003]. It prevents the instruction window from stalling on long-

latency memory operations by executing speculative instructions. Runahead execution of a

thread starts once a long latency load reaches the top of the instruction window. An invalid' value

is assigned to the long-latency load to allow the load to be pseudo-committed without blocking

the instruction window. A checkpoint of all the architecture states must be made before entering

the runahead execution mode. During runahead mode, the processor speculatively executes

instructions relying on the invalid value. All the instructions that operate over the invalid value

will also produce invalid results. However, the instructions that do not depend on the invalid

value will be pre-executed. When the memory operation that started runahead mode is resolved,

the processor rolls back to the initial checkpoint and resumes normal execution. As a

consequence, all the speculative work done by the processor is discarded. Nevertheless, this

previous execution is not completely useless. The main advantage of runahead is that the










speculative execution would have generated useful data prefetches, improving the behavior of

the memory hierarchy during the real execution. In some sense, runahead execution has the same

effect as physically enlarging the instruction window.

We adapted and modified a SimpleScalar-based SMT model from the Vortex Proj ect

[Dorai and Yeung 2002]. The out-of-order SimpleScalar processor separates in-order execution

at the functional level from the detailed pipelined timing simulation. At the function level,

instructions are executed one at a time without any overlap. The results from the function

execution drive the pipelined timing model. One important implementation issue is that the

checkpoint for runahead execution must be made at the functional level. The actual invalid value

from runahead execution will not be simulated. Instead, registers or memory locations are

marked when their content is invalid. In the runahead mode, only those L2 misses with correct

memory addresses will be issued.


r------------- I
I --+, Instructionexecution
~--+, Checkpoint mrsprediction
--+ Checkpoint- L2 rriss
I-Fetch~ DeclDep---~
:--+, Mark inwmlid register
r--+ Saw write value to MUB
'---+ Mark inwmlid memory

Function Execution Exit Runahead
:--, Flush runahead i nst.
Yes
RoblLsqllq etum7 ---[ -- Recover arch. state|
| --+ Reset renarn reg.


Issue -- Register -- Execute I, lemory-- Writeback Comrnit I

Out-of-order Timring IVodel


Figure 4-3. Basic function-driven pipeline model with runahead execution

Figure 4-3 illustrates the pipeline microarchitecture. The checkpoint is made at the

Dec Dep state in the function mode when a load misses the L2 cache. During runahead










executions, all destination registers and memory locations are marked invalid for the L2 miss and

all its descendent instructions. All memory writes are buffered in MUJB (Memory Update Buffer)

to allow correct executions while maintaining memory states at the checkpoint.

4.4 Performance Evaluation

Performance evaluations of runahead execution are carried out on modified out-of-order

SimpleScalar-based SMT model [Dorai and Yeung 2002]. ICOUNT2.8 scheduling policy

[Tullsen et al. 1996] is used. Except for replicated program counters, register Hiles, and

completion logic, other resources are shared among multiple threads. Threads share a 256-entry

instruction window and a 256-entry load-store queue. Issue window is assumed to be the same

size of the instruction window. We Eix the size of L1 caches as 64KB and vary the size of L2

cache from 128KB to 16MB. Memory access latency is set as 400 cycles. There can be at most

60 outstanding memory request at the same time.

Eight mixed workload combinations from SPEC2000, Twolf/Art, TEl es/j ie /fJ Art M\~cJ

Parser Vpr, Vpr Gcc, T.
weighted speedup, total simulated instructions of individual threads are kept the same between

the multiple-thread execution mode and the single-thread execution mode.

4.4.1 Instructions Per Cycle Improvement

Figure 4-4 summarizes the IPC improvement of runahead on SMTs with IMB L2 caches,

where the IPCs of individual threads as well as the IPCs of the mixed threads are plotted. The

three bars on the left of each workload represent the IPCs without runahead execution, while the

right three bars are IPCs with runahead.

In general, runahead improves IPCs in both single-thread and multithread modes. There is

more significant improvement on the SMT mode than that on the single-thread mode. Among the

three groups, workloads in the second group benefit the most. With runahead, the combined IPCs










on SMTs are consistently higher than the IPC of each individual thread. This is consistent with

the results in Figure 4-2, where workloads in the second group show the most increases in the

average memory access time. Although runahead on SMTs also shows much higher

improvement, the resulting IPCs in other two groups of workloads still fall between the IPCs of

the two individual threads. Therefore, we decided to compare IPC improvements using the

weighted speedup as suggested in [Tullsen and Brown 2001] with various cache sizes.

IPC
We~ightedpedu~ Speu =f
threads IPCbaschne


35

2 5 - -- --







Twolf/Art Twolf/Mcf Art/Mcf Parser/Vpr Vpr/Gcc Tw olflGcc Gap/Bzlp Gap/Mcf


Figure 4-4. Instructions per cycle with/without runahead on simultaneous multithreading

4.4.2 Weighted Speedups on Multithreading Processors

Figure 4-5 shows the weighted IPC speedups of runahead execution on SMTs. In general,

significant performance improvement can be observed as long as the cache size is not very large.

As a result of very few misses on 8 MB or 16 MB caches, runahead execution is not effective in

overlapping scatter cache misses. Similarly, since Gap Bzip has very small combined working

set, runahead execution is ineffective for all cache sizes. Among eight workloads, it is

unexpected that Gap M~cfdisplays the highest speedup. Cache contentions of Gap ~cf should not

be as severe as workloads in the first group since Gap has the lowest L2 miss penalty among

selected programs. However, runahead execution not only exploits MLP for McJ; it also releases




























-m--GaplBzip -a--aplMcf -- Parser/Vpr -a-pr/Gcc
-e-- TlflGcc -m--TwolflArt -a-- TlflMcf -*- rt/Mcf


resources to unblock Gap from frequent L2 misses of2~cf Other workloads show performance


benefits of various degrees from runahead execution with small/median caches. Because of


heavier cache misses for workloads in the first group, the weighted speedup is generally higher


than that of workloads in the second group.


128k 256k 512k 1m 2m 4m 8m 16m
L2 cache size



Figure 4-5. Weighted speedup of runahead execution on simultaneous multithreading


-m-GaplBzip -aGaplMcf -a-Parser/Vpr -a--Vr/Gcc
-e-- Tlflcc --- TolflArt -a- TolflMcf -o- At/Mcf
1

o




106-

04

02


128k 256k 512k 1m 2m 4m 8m 16m
L2 cache size


Figure 4-6. Average memory access time ratio of runahead execution on simultaneous
multithreading











The improvement of the average memory access time of runahead execution on SMTs is

shown in Figure 4-6, where the ratio is the average memory access time with runahead over that

without runahead. Significant drops on the average memory access time are very evident for all

workloads except for 161VB caches. It is interesting to observe that because of differences in

working set, significant jumps in memory access time ratios are from 21VB to 81VB for

workloads in the first group, but from 512KB to 21VB for workloads in the second group. As


expected, Gap Bzip is not as beneficial with runahead due to small working set.


-m-GaplBzip -aGaplMcf -a-Parser/Vpr -aVpr/Gcc
-eTwolflGcc -m-TwolflArt -aTwolflMcf -*Art/Mcf


14-








08-



128k 256k 512k 1m 2m 4m 8m 16m
L2 cache size


Figure 4-7. Weighted speedup of runahead execution between two threads running in the
simultaneous multithreading mode and running separately in a single-thread mode

Recall that the overall IPC improvement with runahead execution comes both from

exploiting 1VLP and from better sharing of other resources. Therefore, minor discrepancies

between the average memory access time improvement and the overall IPC improvement can be

expected. For example, the average memory access time improvement of Gap M~cfis less than

that of workloads in the first group. However, Gap M~cf displays significantly more IPC

improvement (Figure 4-5). Similarly, TaI l// f fhas worse improvement in memory access time










than the other two workloads in the first group, but its IPC improvement is the highest among the

three workloads.

Figure 4-7 shows the weighted speedups of SMTs with runahead over single thread

execution with runahead. The advantages of runahead execution on the SMT mode are clearly

displayed for workloads in both the second and the third groups. However, workloads in the first

group are still experiencing negative speedups when cache sizes are IMB or bigger. In

comparison with negative speedups without runahead execution (Figure 4-1), runahead

execution helps to pull negative speedups in the positive direction. For 4MB caches especially,

weighted speedups are improved from 0.61, 0.60, 0.44 without runahead to 0.90, 0.71, 0.70 with

runahead for T. "/7 -\/4 Br Tolf/Art and Art M\~cfrespectively. As a result of very poor SMT

performance due to cache and other resource contentions, runahead execution can alleviate but

cannot overcome the huge loss from running two threads in the SMT mode.

The weighted speedup in Figure 4-7 is a combination of two factors: the benefit of

runahead execution when two threads run together vs. run separately, and the impact of SMT

itself. In order to separate effects of runahead execution from effects of SMT execution, we

define a new Weighted Speedup Ratio. The basic idea is to calculate speedup ratios between

individual thread's runahead speedup in the SMT mode and its runahead speedup in the single-

thread mode.

Weighted Speedup Ratio =


C1IT -nm lpe d Smgle -nmahea
Number of thread Thrad 1/ -nornmahead I Smngle -nornmha~ge ahead


As shown in Figure 4-8, the speedup of runahead execution in SMT mode is generally

better than the speedup of runahead execution in single-thread execution mode. For example,

Gap ~cf shows huge benefit for runahead execution on the SMT mode. This proves why











Gap/M~cfhas the highest overall IPC speedup (Figure 4-5). Similarly, TEIl,
overall speedups comparing with other workloads in the first group due to more benefit of

runahead execution in the SMT mode. Two workloads Vpr/Gcc and TE lJ

group exhibit negative improvement with tiny caches. Recall that programs in this group have

moderate L2 cache penalties. With unrealistically small caches, cache misses can increase to the


point that runahead execution becomes very effective in the single-thread mode.


-m--GaplBzip -0GaplMcf Parser/Vpr VprGcc
--TwolflGcc -m--TwolflArt TwolflMcf --Art/Mcf



25-


2-


0- 1 5-





05
128k 256k 512k 1m 2m 4m 8m 16m
L2 cache size


Figure 4-8. Ratios of runahead speedup in simultaneous multithreading mode vs. runahead
speedup in single-thread mode

4.5 Related Work

SMT permits the processor to issue multiple instructions from multiple threads at the same

cycle [Tullsen et al. 1995; Tullsen et al. 1996]. The scheduling strategy based on the instruction

count (ICOUNT) of each active thread regulates the fetching policy to prevent any thread from

taking more resources than its fair share [Tullsen et al. 1996]. Other proposed methods [El-

Moursy and Albonesi 2003; Cazorla et al. 2004a; Cazorla et al. 2004b] attempt to identify









threads that will encounter long-latency operations (L2 cache miss). The thread with long-

latency operation may be delayed to prevent it from occupying more resources.

Serious cache contention problems on SMT processors were reported [Tullsen and Brown

2001]. Instead of keeping the thread that involves long-latency load ready to immediately begin

execution upon return of the loaded data, they proposed methods to identify threads that are

likely stalled and to free all resources associated with those threads. A balance scheme was

proposed [Cazorla et al. 2003] to dynamically switch between flushing and keeping long-latency

threads to avoid overhead of flushing.

The optimal allocation of cache memory between two competing threads was studied

[Stone et al. 1992]. Dynamic partitioning of shared caches among concurrent threads based on

"marginal gains" was reported in [Suh et al. 2001]. The results showed that significantly higher

hit ratios over the global LRU replacement could be achieved.

Runahead execution was first proposed to improve MLP on single-thread processors

[Dundas and Mudge 1997; Mutlu et al. 2003]. Effectively, runahead execution can achieve the

same performance level as that with a much bigger instruction window. We investigate and

evaluate runahead execution on SMT processors with multiple threads running simultaneously. It

is our understanding that this is the first work to apply runahead execution on SMT processors to

tolerate shared resource contentions. Besides the inherent advantage of memory prefetching,

runahead execution can also prevent a thread with long latency loads from occupying shared

resources and impeding other threads from making forward progress.

4.6 Conclusion

Simultaneous Multithreading technique has been moved from laboratory ideas into real

and commercially successful processors. However, studies have shown that without proper










mechanisms to regulate the shared resources, especially shared caches and the instruction

window, multiple threads show lower overall performance when running simultaneously.

Runahead execution, proposed initially for achieving better performance for single-thread

applications, works very well in the multiple-thread environment. In runahead execution,

multiple long-latency memory operations can be discovered and overlapped to exploit the

memory-level parallelism; meanwhile, shared critical recourses held by the stalling thread can be

released to keep the other thread running smoothly to exploit the thread-level parallelism.

Performance evaluations have demonstrated that up to 4-5 times the speedups are achievable

with runahead executions on SMT environments.









CHAPTER 5
ORDER-FREE STORE QUEUE USING DECOUPLED ADDRESS MATCHING AND
AGE-PRIORITY LOGIC

5.1 Introduction

Timely handling correct memory dependence in a dynamically scheduled, out-of-order

execution processor has posted a long-standing challenge, especially when the instruction

window is scaled up to hundreds or even thousands of instructions [Sankaralingam et al. 2003;

Sankaralingam et al. 2006; Sethumadhavan et al. 2007]. Two types of queues are usually

implemented in resolving memory dependence. A Store Queue (SQ) records all in-flight stores

for determining store-load forwarding and a Load Queue (LQ) records all in-flight loads for

detecting any memory dependence violation [Kessler 1999; Hinton et al. 2001]. There are two

fundamental challenges in enforcing correct memory dependence. The first one is to forward

values from the youngest older in-flight store with matched address to a dependent load. In a

conventional processor, this is implemented by forcing stores enter the SQ in program order and

finding the parent store using expensive fully-associative search. The second challenge is to

maintain correct memory dependence when a load is issued but early store addresses have not

been resolved. Speculation based on memory dependence prediction or other aggressive methods

[Adams et al. 1997; Hesson et al. 1997; Chrysos and Emer 1998; Kessler 1999; Yoaz et al. 1999;

Hinton et al. 2001; Subramaniam and Loh 2006] enables the load to proceed without waiting for

the early store addresses. Any offending load that violates the dependence must be identified

later by searching the LQ and causes a pipeline flush. In a conventional processor, the program

order and fully-associative search are also required in the LQ for identifying any memory

dependence violation by early executed loads when an older store is executed [Sha et al. 2005].

There have been many proposals for improving the scalability of the SQ and LQ

[Moshovos et al 1997; Park et al. 2003; Sethumadhavan et al. 2003; Roth 2004; Srinivasan et al.










2004; Cristal et al. 2005; Sethumadhavan et al. 2006; Sha et al. 2005; Stone et al. 2005; Torres et

al. 2005; Castro et al. 2006; Garg et al. 2006; Sha et al. 2006; Subramaniam and Loh 2006]. In

this work, we focus on an efficient SQ design for store-load forwarding. Since store addresses

can be generated out of program order, the SQ cannot be partitioned into smaller address-based

banks while maintaining program order in each bank for avoiding fully-associative searches in

the entire SQ. Among many proposed solutions for scalable SQ [Akkary et al. 2003;

Sethumadhavan et al. 2003; Gandi et al. 2005; Sha et al. 2005; Torres et al. 2005; Baugh and

Zilles 2006; Garg et al. 2006; Sethumadhavan et al. 2007], two recent approaches are of great

significance and related to our proposal. The first approach is to accept potential unordered

stores and loads by speculatively forwarding the matched latest store based on the execution

sequence, instead of the correct last store in program order [Gandi et al. 2005; Garg et al. 2006].

The second approach is to allow an unordered banked SQ indexed by store address, but record

the correct age along with the store address [Sethumadhavan et al. 2007]. Sophisticated hardware

can match the last store according to the age of the load without requiring the stores to be

ordered by their ages in the SQ. Our simulation results show that a significant amount of

mismatches exist between the latest and the last stores for dependent loads that causes expensive

re-executions. Furthermore, there is at most one matching parent store in the SQ even though

multiple stores could have the same address. Recording each store and age pair complicates the

age priority logic and may become the source of conflicts with limited capacity in each SQ bank.

In this work, we introduce an innovative SQ design that decouples the store/load address

matching unit and its corresponding age-order priority encoding logic from the original SQ. In

our design, renamed/dispatched stores' address and data enter a SQ RAM~ array in program order.

Instead of relying on fully-associative searches in the entire SQ, a separate SQ directory is









maintained for matching the store addresses in the SQ. A store enters the SQ directory when its

address is available. Only a single entry is allocated in the SQ directory for multiple outstanding

stores that have the same address. Each entry in the SQ directory is augmented with an age-order

vector to indicate the correct locations (ages) of multiple stores with the same address in the SQ

RAM. The width of the age-order vector is equal to the size of the SQ RAM. When a store is

issued, a directory entry is created if it does not already exist. The corresponding bit in the age-

order vector is turned on based on the location of the store. When a load is issued, an address

match to an entry in the SQ directory triggers the age-order priority logic on the associated age-

order vector. Based on the age of the load, a simple leading-one detector can locate the youngest

store that is older than the load for data forwarding. With the age-order vector, the store address

is free from imposing any order in the SQ directory. Consequently, the decoupled SQ directory

can be organized as a set-associative structure to avoid fully-associative searches. Besides the

basic decoupled SQ without considering the data alignment, we further extend the design to

handle partial stores and loads by using byte masks to identify which bytes within an 8-byte

range are read or written. We also include the memory dependence resolution for the misaligned

stores and loads that cross the 8-byte boundary.

The decoupled address matching and age-order priority logic presents significant

contributions over the existing full-CAM SQ or other scalable SQ designs. First, because the size

and configuration of the decoupled SQ directory are independent of the program-ordered SQ

RAM, it provides new opportunities to optimize the SQ directory design for locating the parent

store. Second, the relaxation of the program-order requirement in the SQ directory using a

detached age-order vector helps to abandon the fully-associative search that is the key obstacle

for a scalable SQ. Third, a store needs not be present in the SQ directory until its address









becomes available. As reported in [Sethumadhavan et al. 2007], a significant amount of renamed

stores do not have their address available and hence need not occupy any SQ directory space.

Fourth, our evaluation shows that on average, close to 30% of the executed stores have

duplicated store addresses in SQ. The SQ directory only needs to cover the unique addresses of

the issued stores, and hence the SQ directory size can be further reduced. Moreover, by getting

rid of duplicated addresses in the SQ directory, the potential set (bank) conflict can also be

alleviated since the duplicated store addresses must be located in the same set causing more

conflicts. Fifth, a full-CAM directory is inflexible for parallel searches that are often necessary in

handling memory dependence for stores and loads misaligned across the 8-byte entry boundary.

The set-associative (banked) SQ directory, on the other hand, permits concurrent searches on

different sets and hence can eliminate the need to duplicate the SQ directory. Lastly, we believe

this is the first proposal that correctly accounts and handles frequently occurring partial and

misaligned stores and loads, such as in x86 architecture, while previous proposals have not dealt

with this important issue. The performance evaluation results show that the decoupled SQ

outperforms the latest-store forwarding scheme by 20-24%, while outperforms the late-binding

SQ by 8-12%. In comparison with an expensive full-CAM SQ, the decoupled SQ only loses less

than 0.5% of the IPC.

5.2 Motivation and Opportunity

In this section, we demonstrate that multiple in-flight stores with the same address

constantly exist in the instruction window and they present significant impact on the SQ design.

We also show the severity of mismatches between the latest and the last store to dependent load.

The simulation is carried out on PTLsim [Yourst 2007] running SPEC workloads. We simulated

a 512-entry instruction window with unlimited LQ/SQ.











100%
(n90%
80s%
e!70%
60%

50%
(n40%
i~30%
20%
a -o-Acc umulated store addresses
<10% -~ DAccumulated stores
096


Total outstanding stores and store addresses

Figure 5-1. Accumulated percentage of stores and store addresses for SPEC applications

Figure 5-1 plots the accumulated percentages of stores and unique store addresses with

respect to the total number of outstanding stores and store addresses for the twenty simulated

SPEC2000 applications. The statistics is collected after each store is issued and its address is

available. An infinite SQ is maintained to records all outstanding store addresses. Each address

has an associated counter to track the number of stores that have the same store address. After


the store is issued, a search through the SQ is carried out. Upon a hit, the counter of the matched

store address is incremented by 1. In case of a miss, the new address is inserted into the SQ with

the counter initiated to 1. Meanwhile, the total unique store addresses and the total outstanding

stores are counted after each issued store. The total number of unique store addresses is equal to

the size of the SQ, while the outstanding stores are the summation of the counter for each store

address in the SQ. These two numbers indicate the SQ size required for recording only the


unique store addresses or all individual stores in the SQ. Once a store is committed, the counter

associated with the store address in the SQ is decremented by 1. The address is removed from the















































- -- -- --- ---- ----- ----- --

-- -- ------ -- ----- -


SQ when the counter reaches to 0. When a branch mis-prediction occurs, all stores younger than

the branch are removed from the SQ.

The resulting curves reveal two important messages. First, there is a substantial difference

between the two accumulated curves. For example, given a SQ size of 64, 95% of the stores can

insert their addresses into the SQ if no duplicated store address is allowed in the SQ. On the

other hand, only 75% of the issued stores can find an empty slot in the SQ if all stores regardless

their addresses must be recorded in the SQ. Insufficient SQ space causes pipeline flush and

degrades the overall IPC. Second, with 512-entry instruction window, the SQ must be large

enough to avoid constant overflow. For example, to cover 99% of the stores, the per-store based

SQ needs 195 entries, while the per-address based SQ requires 120 entries. The required large

CAM-based SQ increases the design complexity, access delay, and power dissipation.


[] Average unique store address Average store instruction


Figure 5-2. The average number of stores and unique store addresses

Figure 5-2 shows the average number of stores and unique store addresses for each

application throughout the simulation. On average, the number of stores is considerably larger

than the number of store addresses by 30%. Among the applications, only Gzip, M~cfand A~grid


40










have rather small difference. The figures imply there are better SQ solutions which records

unique store addresses for store-load forwarding.

Figure 5-3 shows the mismatches between the latest store in execution order and the last

store in program order when dependent load is issued. To collect the mismatch information, we

simulated two 64-entry CAM-based SQs. One is ordered by the execution order for detecting the

latest matched store, and the other is ordered by program order for detecting the last matched

store. Note that we consider the latest matches the last if the parent store address is unavailable.

The mismatch is very significant. On average, about 25% of the forwarded latest stores are

incorrect and cause costly re-executions.

11latest matches last I latest not matches last
~100%

S80%-

S60%-


S40%-

S20%-






Figure 5-3. Mismatches between the latest and the last store for dependent load

5.3 Order-Free SQ Directory with Age-Order Vectors

In this section, we first describe the basic design of the decoupled SQ without considering

data alignment. The basic scheme is then extended to handle partial stores and loads that are

commonly defined in many instruction set architectures. Handling misaligned stores and loads

that cross the 8-byte boundary is presented at the end.











5.3.1 Basic Design without Data Alignment

The decoupled SQ consists of three key components, the SQ RAI~array, the SQ directory,

and the corresponding age-order vector as shown in Figure 5-4. The SQ RAl~buffers the address

and data for each store until the store commit. Similar to the conventional SQ, a store enters the


SQ RAM when it is renamed and is removed after commit in program order. The store data in

each SQ RAM entry is 8-byte wide with eight associated ready bits to indicate the readiness of

the respective byte for forwarding. The SQ RAM is organized as a circular queue with two

pointers where the head points to the oldest store and the next points to the next available

location. When a store is renamed, the next location in the SQ RAM is reserved.

Priority Logic
Head for Load A 5


SQ RAMV
0 1 2 3 4 6 7 8 9 10 2. SQ~age for
committed store B

SQ Directory Age-order Vector (AV)
1. Load A rl- E -- I0I
(SQ_age=7) -I0I
-- -- -- -- .. I- 1- -
1 ~0 0 00... ooo 100 a...
B Is fred L 4 1- -



AVforblockA 10100100010 0

H ad SQ_ ge=7 for Load A
r priority 1
L Ogicl
Figure 5-4. SQ with decoupled address matching and age priority

The age of the store is defined by its position in the SQ RAM (denoted as SQeage), which

is saved in the ROB. Upon a store commit, the SQeage can locate the store from the SQ RAM

for putting away the data into the cache. The pipeline stalls when the SQ RAM is full. Since the

SQ directory is decoupled from the SQ RAM, the SQ RAM size is independent of the SQ

directory size and imposes minimum impact on searching for the parent store. When a load is









renamed, the SQeage of the last store in the SQ is recorded along with the load in the LQ. When

the load is issued later, the SQeage is used for searching the parent store

Organized similar to the conventional cache directory, the SQ directory records the store

addresses for matching dependent load address in store-load forwarding. Instead of keeping a

directory entry for every individual store in the SQ RAM, the SQ directory records a single

address for all outstanding stores with the same address. Since a store is recorded in the SQ

directory after the store has been issued and its address is available, the SQ directory can be

partitioned into multiple banks (sets) based on the store address to avoid fully-associative

searches.

The age-order vector is a new structure that provides the ordered SQ RAM locations for

all stores with the same address. Each address entry in the SQ directory has an associated age-

order vector. The width of the vector is equal to the size of the SQ RAM. When a store address is

available, the SQ directory is searched for recording the new store. If there is a match, the bit in

the corresponding age-order vector indexed by the SQeage of the current store is set to indicate

its location in the SQ RAM. If the store address is not already in the SQ directory, a new entry is

created. The corresponding bit in the age-order vector is turned on in the same way as when a

match is found. In case there is no empty entry in the SQ directory, the store is simply dropped

and cannot be recorded in the SQ directory. Consequently, the dependent load may not see the

store and causes re-execution. Since stores are always recorded in program order in the SQ

RAM, imprecision in recording the in-flight store addresses in the directory is allowed.

When a load is issued, a search through the SQ directory determines proper store-load

forwarding. If there is no hit, the load proceeds to access the cache. If there is a matched store

address in the SQ directory, the corresponding age-order vector is scanned to locate the youngest









older store for the load. The search starts from the first bit defined by the SQeage of the load and

ends with the "head" of the SQ RAM. A simple leading-one detector is used to find the closest

(youngest) location where the bit is set indicating the location of the parent store.

Two enhancements are considered in shortening the critical timing in searching for the

parent store. First, the well-known way-prediction technique enables the SQ directory lookup

and the targeted age-order vector scanning in parallel. By establishing a small but accurate way

history table, the targeted age-order vector can be predicted and scanned before the directory

lookup result comes out. Second, the delay of the leading-one detector is logarithmically

proportional to the width of the age-order vector. Given the fact that a majority parents can be

located without searching the entire SQ RAM, only searching within a partial age-order vector

starting from the SG age may be enough to catch the correct parent store. The accuracy of these

enhancements will be evaluated in Section 5.4.

When a store commits, its SQeage is used to update the SQ directory and the age-order

vector. The bit position of all age vectors pointed by the SQeage is reset. When all bits in a

vector are reset, the entry in the SQ directory is freed. The store also retires from the SQ RAM.

When a mis-predicted branch occurs, all stores after the branch can be removed from the SQ in a

similar way. The last SQeage is saved in the ROB with a branch when the branch is renamed.

When a mis-prediction is detected, all entries younger than SQeage in the SQ RAM are emptied.

Meanwhile, all "columns" in all age-order vectors that correspond to the removed entries from

the SQ RAM are reset. When all bits in any age-order vector are reset, the corresponding entry in

the SQ directory is freed.

A simple example is illustrated in Figure 5-4. Assume that a sequence of memory stores

are dispatched and recorded in the SQ RAM. Among them, the first, the third, the sixth, and the









eleventh stores have the same address A. These four requests may be issued out of order, but

eventually are recorded in the SQ directory and the associated age-order vector. Since all four

stores have the same address A, they only occupy a single entry in the SQ directory indexed by

certain lower-order bits of A. The corresponding age-order vector records the locations of the

four stores in the SQ RAM by setting the corresponding bits as shown in the Eigure. Assuming a

loadA is Einally issued with the SQeage of 7 as indicated in box 1, it finds an entry with matched

address in the SQ directory. The priority logic uses the SQeage of the load and the age-ordered

vector associated with A to locate the parent store at location 5 in the SQ RAM. In this figure,

we also illustrate an example when store B commits as indicated in box 2. The SQeage of B

from the ROB is used to reset the corresponding position in all age-order vectors. The address of

B can be freed from the SQ directory when the age-order vector contains all zeros. Given the fact

that each store in the SQ RAM cannot have two addresses, at most one vector can have '1' in the

SQeage position. Therefore, at most one directory entry can be freed for each committed store.

5.3.2 Handling Partial Store/Load with Mask

In the basic design, we assume loads and stores are always aligned within the 8-byte

boundary and access the entire 8 bytes every time. Realistically, partial loads/stores are

commonly encountered. The address of partial loads and stores is always aligned in the 8-byte

boundary. An 8-bit mask is used to indicate the precise accessed bytes. The decoupled order-free

SQ can be extended to handle memory dependence detection and forwarding for partial

stores/loads. The age-order vector for each store address in the SQ directory is expanded to 8

vectors; each covers one mask bit in the 8-bit mask. If a load address matches a store address in

the SQ directory, the leading-1 detector finds the youngest stores older than the load for each

individual valid mask bit of the load. In other words, each age-order vector identifies the parent

store for each byte of the load. If the found youngest older store covers all of the bytes of the
























A A A


Agy-'oder Vector (AVi) for A I
O' 1 2 3 4 5 6 7 8 9 10 s a
mask000 O10010001 O 40 O 0 -
maskl 00O100 10000 1 0 -
mask2 10 10 01 00 0 01 0- 1 5
mask3 10 10 01 00 0 01 0- 1 5
mask4 00 10 00 00 0 01 12 1 2
mask5 00 10 00 00 0 01 12 1 2
mask600 O10000000O O1 1 2 0-
mask70010 00001 O IC1 2 O-


Htd SQage=7Forwarding St I until
for Load Afrom SQ-(2) Store (5)
Figure 5-5. Decoupled SQ with Partial Store/load using Mask

In Figure 5-5, the example from Figure 5-4 is modified to illustrate how the decoupled SQ


works with partial loads and stores. Now, the first (A,0) and the sixth (A,5) stores are partial


stores, where (A,0) stores bytes 2 and 3, and (A,5) stores bytes 0, 1, 2 and 3 as indicated by the


8-bit masks. Assume that a partial Load A with SQ_age=7 is issued. To illustrate forwarding


detection, we assume the load issued twice; one with a mask of '00001111' and the other with


'00111100' for loading bytes 4, 5, 6, and 7, and bytes 2, 3, 4, and 5, respectively. A matched


load, store-load forwarding is detected. Otherwise, unless forwarding from multiple stores are


permitted, the load cannot proceed until the youngest store which updates a subset of the load


bytes commits and puts the store data away into the cache.

Store Mask
Priority Logic for 0 1 2 3 4 5 6 7
Head /Load A, A,0I 0 1 1 0 0 0 0
I dask=00001111
1 J~n~?A,2 1 1 1 1 1 1 1 1


SQ RAM


0 1 2 345 67 8910



SQ Directory Age-order Vector (AV)


A,5 1 1 1
A10 1 1 1


11 0 0 0 0
11 1 1 1 1

2. SQ_age for
committed store B









address A is found in the SQ directory for the load. The subsequent searches in the

corresponding age-order vectors find the youngest older store is (A,2) that covers all the bytes

for the first load. As a result, the load gets the forwarded data from the SQ RAM. For the second

load, however, it has two parent stores, where (A,2) produces bytes 4 and 5, and (A,5) produces

bytes 2 and 3 for the load. Unless a merge from multiple stores is permitted for forwarding, the

second load will be stalled until (A,5) is retired.

When a store is retired, the SQ directory and the age-order vectors are updated the same

way as that without the mask. However, given the fact that each store may store only partial

bytes, instead of detecting zeros on a single age-order vector, all eight age-order vectors that are

associated with the 8 mask bits must be all zeroed before the corresponding directory entry can

be freed.

5.3.3 Memory Dependence Detection for Stores/Loads Across 8-byte Boundary

Intel's x86 ISA permits individual store or load to cross an 8-byte aligned boundary,

referred as a misaligned store or load. Misaligned loads and stores complicate the memory

dependence detection and data forwarding. When a misaligned load is issued, a simple but costly

solution is to stall until all stores ahead of it are drained from the SQ. To avoid this costly wait

for those misaligned loads that are independent of any outstanding store, duplicated full CAMs

for SQ are implemented for detecting the dependence as described in [Abramson et al. 1997].

When a misaligned store is issued, its starting aligned address and the byte mask are stored in the

SQ along with an additional overflow bit to indicate that the store spills out of range. In this case,

a load falling into the next 8-byte range of the misaligned store will miss this store during the

search if there is only one CAM. Therefore, two SQ CAMs are needed for searching both the

address of the load (load) and the adj acent lower 8-byte address (load-8) in parallel. The

decremented load-address CAM will match the store address, and the overflow bit will indicate









to the forwarding logic that there is a misaligned hit. The load is stalled until the misaligned store

is retired and the data is stored into the cache. For handling misaligned loads, a third SQ CAM is

needed in order to provide parallel searches of the next higher 8-byte address (load 8) for

potential misaligned hits.

The banked (set-associative) order-free SQ directory provides another distinct advantage in

detecting misaligned store/load dependence. By using the lower-order bits from the aligned

address to select the bank (set), all three addresses (load-8), (load), and (load 8) of a misaligned

load are likely to be located in a different bank. Hence, the costly duplications of the CAM for

parallel searches can be avoided.

When Load A is issued, there are several cases in dependence detections and data

forwarding. If Load A does not cross the 8-byte boundary, two searches for store A and store A-

8 from the SQ directory are carried out. If A hits but A-8 misses, or both A and A-8 hit but the

store A-8 does not cross the 8-byte boundary, a search of the youngest parent of store A for Load

A can follow the algorithm described in Section 5.3.2. If A-8 hits and the store A-8 crosses the 8-

byte boundary, a misaligned hit is detected. Load A must stall until the store A-8 ahead of the

load is retired and put the data away into the cache. In case that A also hits, Load A is stalled

until both of the stores ahead of the load retire and their data is stored in to cache. If none of the

above conditions is true, load A proceeds to access the cache.

If Load A crosses the 8-byte boundary, three searches for store A, store A-8, and store A+8

from the SQ directory must be performed. Any hit of A, A+8, or a hit of A-8 with overflow bit

set in the SQ directory indicates a misaligned hit. Load A is stalled until all of these stores ahead

it retire and their data is stored into the cache. If the above condition is not true, load A proceeds

to access the cache.









5.4 Performance Results

We modified the PTLsim simulator to model a cycle accurate full system x86-64

microprocessor. We followed the basic PTLsim pipeline design, which has 13 stages (1 fetch, 1

rename, 5 frontend, 1 dispatch, 1 issue, 1 execution, 1 transfer, 1 writeback, 1 commit). Note that

the 5-cycle "frontend" stages are functionless and were inserted for more closely model realistic

pipeline delays in the front-end stages. In this pipeline design, there are 7 cycles between a store

entering SQ RAM in the rename stage and entering SQ directory in the issue stage. When any

memory dependence violation is detected for an early load, the penalty of re-dispatching the

replayed load is 2 cycles (from dispatch to execution). Any uops after this load in program order

are also re-dispatched. To reduce the memory dependence violations, a Load Store Alias

Predictor (LSAP) is added [Chrysos and Emmer 1998]. This fully-associative 16-entry LSAP

records the loads that were mispredicted in the recent history. In case there is any unresolved

store address ahead of a load in the SQ when the load is issued, the LSAP is looked up. If a

match is found, the load is delayed until the unresolved store address is resolved. Similar method

of memory aliasing prediction is used by Alpha 21264 [Kessler 1999].

The x86 architecture is known for its relatively widespread use of unaligned memory

operations. In PTLsim, once a given load or store is known to have a misaligned address, it is

preemptively split into two aligned loads or stores at the decode time. PTLsim does this by

initially causing all misaligned loads and stores to raise an internal exception that forces a

pipeline flush. At this point, the special misaligned bit is set for the problem load or store in

PTLsim's internal translated basic block representation. When the offending load or store is

encountered again, it will be split into two aligned loads or stores early in the pipeline. The split

loads and stores will be merged later in the commit stage. In our simulation, we followed the

PTLsim in handling misaligned loads and stores. The simulation results of all applications show









that pipeline flushes due to misaligned stores and loads happen very infrequently with only about

0.5% of the total stores and loads.

To gain inside on the impact of various store queue implementations, we stretch other out-

of-order pipeline parameters [Akkary et al. 2003] as summarized in Table 1-2. Other detailed

parameter setting can be found in the original PTLsim source code. SPEC 2000 integer and

floating-point applications are used to drive the performance evaluation. We skip the

initialization part of the workloads and collect statistics from the next 200 million instructions.

Four SQ designs are evaluated and compared including the traditional full CAM, the latest-

store forwarding, the late-binding SQ, and the decoupled SQ. We evaluated a 32- and a 64-entry

full CAMs denoted as Conventional 32-CAM~ and Conventional 64-CAM; latest-store forwarding

using a 64-entry full CAM denoted as LS 64-CAM; late-binding SQ with a 4 x 8 directory and a

8 x 8 directory recording address/age pair together with a 64-entry SQ RAM denoted as LB 4 x8

64-RAM~ and LB 8 x 8 64-RAM; and lastly the decoupled SQ with a 4 x 8 directory and a 8 x 8

directory also with a 64-entry SQ RAM denoted as Decoupled 4 x 8 64-RAM~ and Decoupled 8 x 8

64-RAM. The notation a x b represents the configuration of the SQ directory, where a is the

number of sets and b is the set associativity. In the decoupled SQ, we simulated a 48-bit leading-

1 detector in finding the youngest parent. We also implemented a way predictor with a 256-entry

prediction table using a total of 96 bytes. When a load is issued, the way is predicted. If a

misprediction is detected from the address comparison, the way prediction table is updated and

the load is re-issued with a 2-cycle delay. We do not consider forwarding from multiple stores.

We simplified the latest-store and late-binding forwarding schemes. A full CAM is used for the

latest-store forwarding. Instead of maintaining stores in program order as in a conventional full

CAM, stores are maintained in execution order for searching the latest store. We do not










implement the Bloom Filter and other techniques in evaluating the late-binding scheme. If the

SQ directory is full, a store is simply dropped without recording it in the directory.

5.4.1 IPC Comparison

Figure 5-6 shows the IPC comparison of the seven SQ designs running the SPEC2000

programs. We can make a few observations. First, the decoupled SQ outperforms both the latest-

store forwarding and the late-binding SQ by a sizeable margin for most applications. On average,

Decoupled 4 x 8 64-RAM~ and Decoupled 8 x 8 64-RAM~ outperform LS 64-CAI~by 20.8% and

23.8% respectively. With an equal directory size of 4 x 8 and 8 x 8, they outperform the late-

binding counterparts by 12.4% and 7.6%. Given the fact that the decoupled SQ records one

address for all outstanding stores with the same address, it tolerates a smaller 4 x8 SQ directory

much better than the late-binding scheme. In fact, the decoupled SQ with a bigger 8 x 8 directory

only improves the IPC by about 2.4% comparing with that using a 4 x8 directory.

a Con\ntional 32-CAM E Conwntional 64CAM O LS 64-CAM 0 LB 4X 64-RAM g LB 82 64RAM g Decoupled 4X 64-RAM g Decoupled 8X 64-RAM









25n~V -s'e --- -





Figure 5-6. IPC comparison

Second, De coupled 4 x 8 64-RAM~ shows better performance than Conventional 32-CAI~by

an average of 4.2%. Since the SQ directory is decoupled from the SQ RAM, a 64-entry SQ RAM

allows more outstanding stores than that of a 32-entry CAM without requiring any bigger










directory for matching the store addresses. Besides the same number of directory entries as in

Conventional 32-CAM, De coupled 4 x 8 64-RAM~ achieves better IPCs using only an 8-way

comparator for faster speed and lower power consumption.

Third, even if the SQ RAM size is no larger than full-CAM size, the performance of

Decoupled 8 x 8 64-RAM ~with banked directory is close to the expensive Conventional 64-CAM\~

performance. On average, Decoupled 8 x 8 64-RAM~ de grades less than 0.5% of the IPC

compared with Conventional 64-CAM~

Fourth, Applu, M~cJ; Mgrid, and Swim are less sensitive to different SQ designs except for

Conventional 32-CAM, because only a very small number of store-load forwarding exists in

these applications regardless the SQ designs. Table 5-1 summarizes the percentage of loads get

forwarded data from stores in an ideal SQ for all the simulated applications. An ideal SQ has

infinite size and all stores addresses before a load are known when the load is issued.

Conventional 32-CAM~ shows worse performance in these applications because its smaller CAM

size hinders stores from being renamed. Among all the loads that can be forwarded in an ideal

SQ, the seven simulated SQ designs can correctly capture 83.8%, 97.2%, 77.6%, 87.1%, 93.0%,

95.0%, and 96.4% of the ideal forwarded loads respectively.

Table 5-1. Percentage of forwarded load using an ideal SQ
Workload Ammp Applu Apsi Art Bzip2 Eon Equake Facerec Fma3d Gzip
Forward % 12.7% 0.5% 8.4% 4.7% 13.0% 18.4% 5.4% 6.8% 23.2% 4.3%
Workload Lucas Mcf Mesa Mgrid Parser Perlbmk Sixtrack Swim Vpr Wupwise
Forward % 7.5% 0.3% 4.7% 0.1% 8.8% 14.8% 8.9% 0.0% 15.6% 7.4%



By getting rid of the duplicated stores with the same address in the SQ directory, the

decoupled SQ requires smaller directory than that of the late-binding SQ. Figure 5-7 plots the

total number of dropped stores from entering the SQ directory per 10K instructions due to a full

SQ directory. The huge gaps in terms of dropped stores between the late-binding and the













decoupled SQ are very evident. Within each method, 8 x8 directory causes much less dropped


stores than 4 x8 directory.

4375 3941 2696
2,800
0 LB 4X8 64-RAM
2,400 -- -- -- -- o LB 8X8 64-RAM

2,000-

1,600-

~1,200

S800









Figure 5-7. Comparison of directory full in decoupled SQ and late-binding SQ

691 323 573
300
0 Conventional 32-CAM
O Conventional 64-CAM
250 H OLS 64-CAM
SO LB 4X864-RAM
0 LB 8X8 64-RAM
20 Decoupled 4X8 64-RAM
Decoupled 8X8 64-RAM
15 -s













Figure 5-8. Comparison of load re-execution


There are various memory dependence mis-speculations which require loads to be re-


executed: lack of the parent store address, dropped stores due to full SQ directory, incorrectly


identifying the youngest older store, etc. Figure 5-8 summarizes the total number of re-executed


loads per 10K instructions due to mis-speculations of store-load dependence. As expected, the


latest store SQ scheme causes the most number of re-execution, while conventional CAM


schemes cause the least. Note that Conventional 32-CAM~ causes less re-executions than that of










Conventional 64-CAl~because smaller 32-CAI~causes more stalls at the rename stage in such a

way that it can be considered as putting a limit in out-of-order execution of stores and loads.

5.4.2 Sensitivity of the SQ Directory

The size and configuration of the decoupled SQ directory is flexible. In Figure 5-9, we

compare four SQ directory sizes with 2 x 8, 4 x 8, 4 x 12, and 8 x 8 configurations that are

decoupled from a 64-entry SQ RAM using the average IPC of all twenty applications. Note that

we increase the set associativity to 12 for the directory with 48 entries to simplify the set

selection. We also run full-CAM with 16, 32, 48, and 64 entries.






y18-




1 6 6 Decoupled SQ
-o CAM SQ


16 32 48 64
Directory Size (Number of Entry)

Figure 5-9. Sensitivity on the SQ directory size

As shown in the figure, the smaller directories of 2 x 8 and 4 x 8, with a 64-entry SQ RAM

perform much better than the full-CAM counterparts of equal directory sizes by 17% and 4%

respectively. The advantage of the decoupled SQ is obvious since it is much cheaper to enlarge

the circular SQ RAM while keeping the associative directory small. When the directory increases

to 48, the performance gap becomes very narrow due to the smaller size difference between the

CAM and the RAM. Even with a full 64-entry CAM that matches the SQ RAM size, the

decoupled SQ only loses by less than 0.5% of the overall IPC.













The age distance from the load to its parent affects the timing of the lead-1 detector. Figure


5-10 shows the accumulated distribution of the searching distances. With the distance of 32 that


is half of the total distance of 64, 15 out of 20 applications can locate almost 100% of their


parents. 3 of the remaining 5 also achieve 97-99% of the accuracy. When the search distance


increases to 48, all but Fma3d can find 100% of their parents. Fma3d can also find 98.3% of its


parents at this distance. In all our simulations, the search distance is set to 48.







to 0% x ch8gg~geJQslsa~~~




3 0%-

2 0% CYil- xx-~- -o qae fcrc fad gi









9 0%-



70% -~dlf-en~ qae fcrc~m~ gi
a. 60% -kd_-lcs~ mt~ mea~ mrd--pre
50 --prbk~itrc si p
40%




30%
20% -
1 09'



Fiue511 estviyo aypeitonacrc










To predict the way in a SQ directory set where the parent store is located, a 256-entry way-

prediction table is maintained that reaches close to 100% accuracy for a maj ority of the

applications as shown in Figure 5-11. The prediction table is indexed by the lower-order 8 bits of

load and store addresses aligned in the 8-byte boundary. Each entry has 3-bit to record which

way is touched by store/load most recently. The total size of the table is only 96 bytes. When a

load is issued, the way ID is fetched out of the way prediction table to access the age-order

vectors for searching the parent store. When a mis-prediction is identified, the correct way is

updated and the load is recycled with a 2-cycle delay. From Figure 5-11, the way prediction

accuracy is proportional to the table size. The tables with 64 or 128 entries show very poor

prediction accuracy. Significant improvement is observed with 192 and 256 entries. All

simulations are based on an 8 x 8 decoupled SQ directory.

5.5 Related Work

There have been many recent proposals in designing a scalable SQ by getting rid of the

expensive and time-consuming full-CAM design. One category is to adopt a two-level SQ which

has a smaller first-level SQ to enable fast and energy efficient forwarding and a much larger

second-level SQ to correct and complement the first-level SQ [Roth 2004; Srinivasan et al. 2004;

Torres et al. 2005]. In general, the first-level SQ records the latest executed stores. A store to

load forwarding occurs when the load address matches a store address in the first-level store

queue. Accessed after a first-level miss, the bigger and slower second-level SQ records those

stores that cannot fit into the first-level SQ based on FIFO or some prediction mechanisms.

Instead of a first-level SQ, a store-forwarding cache is implemented in [Stone et al. 2005] for

store-load forwarding. It relies on a separate memory disambiguation table to resolve any

dependence violation. In [Garg et al. 2006], a much larger LO cache is used to replace the first-

level SQ for caching the latest store data. A load, upon a hit, can fetch the data directly from the









LO. In this approach, instead of maintaining speculative load information, the load is also

executed in the in-order pipeline fashion. An inconsistent between the data in LO and L1 can

identify memory dependence violations. Besides the complexity and extra space, a fundamental

issue in this category of approaches is the heavy mismatch between the latest store and the

correct last store for dependent load as reported in Section 5.2. Such mismatches produce costly

re-executions.

The late-binding technique enables an un-ordered SQ as reported in [Sethumadhavan et al.

2007]. A load or a store enters the SQ when the instruction is issued, instead of renamed. In

order to get correct store-load forwarding, the age of each load/store is also recorded along with

the address. In case when there are multiple hits to the SQ for an issued load, complicated

decoding logic can re-create the order of the stores based on the recorded age. Afterwards, the

search for the youngest matched store that is older than the load can locate the correct parent

store for forwarding. The late-binding avoids full-CAM search and allows small bank

implementation for the SQ. However, it records the address/age pair for every memory

instruction unnecessarily that requires more directory entries and intensifies the banking

conflicts. Furthermore, it relies on complicated age-based priority logic to locate the parent that

may lengthen the access time for store-load forwarding.

Another category of solution is also prediction based for efficiently implement or even to

get rid of the SQ entirely. A bloom filter [Sethumadhavan et al. 2003] or a predicted forwarding

technique [Park et al. 2003] can filter out a maj ority of SQ searches. A sizeable memory

dependence predictor is used in [Sha et al. 2005] to match loads with the precise store buffer

slots for the forwarded data so as to abandon the associative SQ structure. Proposals in [Sha et al.

2006; Subramaniam and Loh 2006] can completely eliminate the SQ by bypassing store data









through the register file or through the LQ. However, these prediction based approaches are

always accompanied with the expense of an extra sizeable prediction table.

5.6 Conclusion

When the instruction window scales up to hundreds or even thousands of instructions, it

requires an efficient SQ design to timely detect memory dependence and forward the store data

to dependent loads. Although there have been many proposed scalable SQ solutions, they

generally suffer certain inaccuracy and inefficiency, require complicated hardware logic along

with additional buffers or tables, and compromise the performance. In this work, we propose a

new scheme that decouples the address matching unit and the age-based priority logic from the

SQ RAM array. The proposed solution enables an efficient set-associative SQ directory for

searching the parent store using a detached age-order vector. Moreover, by recording a single

address for multiple stores with the same address in the SQ directory, the decoupled SQ further

reduces the directory size and alleviates potential bank conflicts. We also provide solutions in

handling the commonly used partial and misaligned stores and loads in designing a scalable SQ.

The performance evaluation shows that the new scheme outperforms other scalable SQ proposals

based on latest-store forwarding and late SQ binding techniques and is comparable with full-

CAM SQ. By removing the costly fully-associative CAM structure, the new scheme is both

power-efficient and scaling to large window design.









CHAPTER 6
CONCLUSIONS

In this dissertation, we propose three works related to cache performance improvement and

resource contention resolution, and one work related to LSQ design.

The proposed special P-load has demonstrated its ability to effectively overlap load- load

data dependence. Instead of relying on miss predictions of the requested blocks, the execution-

driven P-load precisely instructs the memory controller in fetching the needed data block non-

speculatively. The simulation results demonstrate high accuracy and significant speedups using

the P-load.

The elbow cache has demonstrated its ability to expand the replacement set beyond the

lookup set boundary without adding any complexity on the lookup path. Because of the

characteristics of elbow cache, it is difficult to implement recency-based replacement. The

proposed reuse-count replacement policy with low-cost can achieve cache performance

comparable to the recency-based replacement policy.

Simultaneous Multithreading techniques have been moved from laboratory ideas into real

and commercially successful processors. However, studies have shown that without proper

mechanisms to regulate the shared resources, especially shared caches and the instruction

window, multiple threads show lower overall performance when running simultaneously.

Runahead execution, proposed initially for achieving better performance for single-thread

applications, works very well in the multiple-thread environment. In runahead execution,

multiple long-latency memory operations can be discovered and overlapped to exploit the

memory-level parallelism; meanwhile, shared critical recourses held by the stalling thread can be

released to keep the other thread running smoothly to exploit the thread-level parallelism.









The order-free SQ design decouples the address matching unit and the age-based priority

logic from the original store queue. The proposed solution enables an efficient set-associative SQ

directory for searching the parent store using a detached age-order vector. Moreover, by

recording a single address for multiple stores with the same address in the SQ directory, the

decoupled SQ further reduces the directory size and alleviates potential bank conflicts. We also

provide solutions in handling the commonly used partial and misaligned stores and loads in

designing a scalable SQ. The performance evaluation shows that the new scheme outperforms

other scalable SQ proposals based on latest-store forwarding and late SQ binding techniques and

is comparable with full-CAM SQ. By removing the costly fully-associative CAM structure, the

new scheme is both power-efficient and scaling to large-window design.










LIST OF REFERENCES


Abramson, J. M., Akkary, H., Glew, A. F., Hinton, G. J., Konigsfeld, K. G., Madland, P. D., and
Papworth, D. B. 1997. Method and Apparatus for Signalling a Store Buffer to Output
Buffered Store Data for a Load Operation on an Out-of-Order Execution Computer
System, Intel, US Patent 5606670.

Adams, D., Allen, A., Bergkvist, R., Hesson, J., and LeBlanc, J. 1997. A 5ns Store Barrier Cache
with Dynamic Prediction of Load/Store Conflicts in Superscalar Processors, Proceedings
of the 199 7 international Solid-State Circuits Conference, 4 14-4 15, 496.

Agarwal, A., Hennessy, J., and Horwitz, M. 1988. Cache Performance of Operating System and
Multiprogramming Workloads, ACM~ Transactions on Computer Systems 6, 4, 393-431.

Agarwal, A., Kubiatowicz, J., Kranz, D., Lim, B.-H, Yeung, D., D' Souza, G., and Parkin, M.
1993. Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors, IEEE
Micro 13, 3, 48.

Agarwal, A., and Pudar, S.D. 1993a. Column-Associative Caches: a Technique for Reducing the
Miss Rate of Direct-Mapped Caches, Proceedings of the 29th AnnuallInternational
Synaposiunt on Computer Architecture, 179-190.

Akkary, H., Pajwar, R. and Srinivasan, S. T. 2003. Checkpoint Processing and Recovery:
Towards Scalable Large Instruction Window Processors, Proceedings of the 36th
International Conference on M~icroarchitecture 423-434.

Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., and smith, B. 1990. The
Tera Computer System, Proceedings of the 4th hIternational Conference on
Superconsputing, 1-6.

Baugh, L., and Zilles, C. B. 2006. Decomposing the Load-Store Queue by Function for Power
Reduction and Scalability, IBM~Journal ofResearch andDevelopnzent 50, 2-3, 287-298.

Bodin, F., and Seznec, A. 1997. Skewed Associativity Improves Performance and Enhances
Predictability, IEEE Transactions on Computers 46, 5, 530-544.

Burger, D., and Austin, T. 1997. The SimpleScalar Tool Set, Version 2.0, TechnicalReport
# 1342, Computer Science Department, University of Wi sconsin-Madi son.

Buyuktosunoglu, A., Albonesi, D.H., Bose, P., Cook, P., and Schuster, S. 2002. Tradeoffs in
Power-Efficient Issue Queue Design, Proceedings of the 20021International Synaposiunt
on Low Power Electronics and Design, 184-189.

Cahoon, B., and McKinley, K.S. 2001. Data Flow Analysis for Software Prefetching Linked
Data Structures in Java, Proceedings of the 10th International Conference On Parallel
Architectures and Compilation Techniques, 280-291.










Castro, F., Pinuel, L., Chaver, D., Prieto, M., Huang, M., and Tirado, F. 2006. DMDC: Delayed
Memory Dependence Checking through Age-Based Filtering, Proceedings of the 38th
International Symposium on 2~icroarchitecture, 297-306.

Cazorla, F. J., Fernandez, E., Ramirez, A., and Valero, M. 2003. Improving Memory Latency
Aware Fetch Policies for SMT Processors, Proceedings of the 5th International
Symposium on High Performance Computing, 70-85.

Cazorla, F. J., Ramirez, A., Valero, M., and Fernandez, E. 2004a. DCache Warn: An I-Fetch
Policy to Increase SMT Effieiency, Proceedings of the 18th International ParallelPPP~~~~PPP~~~PPP and
Distributed Processing Symposium, 74a .

Cazorla, F. J., Ramirez, A., Valero, M., and Fernandez, E. 2004b. Dynamically Controlled
Resource Allocation in SMT Processors, Proceedings of the 3 7th International
Symposium on M\~icroarchitecture, 171-182.

Charney, M., and Reeves, A. 1995. Generalized Correlation Based Hardware Prefetching,
Technical Report EE-CEG-95-1, Cornell University.

Chen, T., and Baer, J. 1992. Reducing Memory Latency Via Non-Blocking and Prefetching
Caches, Proceedings of the 5th International Conference on Architectural Support for
Programming Languages and Operating Systems, 5 1-61 .

Chou, Y., Fahs, B., and Abraham, S. 2004. Microarchitecture optimizations for exploiting
memory-level parallelism, Proceedings of the 31st International Symposium on
Computer Architecture, 76-87.

Chrysos, G. Z., and Emer, J. S. 1998. Memory Dependence Prediction Using Store Sets,
Proceedings of the 25th International Symposium on Computer Architecture, 142- 153.

Collins, J., Sair, S., Calder, B. and Tullsen, D. M. 2002. Pointer Cache Assisted Prefetching,
Proceedings of the 35th International Symposium on M\~icroarchitecture, 62-73.

Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A Stateless, Content-Directed Data Prefetching
Mechanism, Proceedings of the 10th International Conference on Architectural Support
for Programming Languages and Operating Systems, 279-290.

Cristal, A., Santana, O. J., Cazorla, F., Galluzzi, M., Ramirez, T., Pericas, M., and Valero, M.
2005. Kilo-Instruction Processors: Overcoming the Memory Wall, IEEE Micro 25, 3, 48-
57.

Dorai, G., and Yeung, D. 2002. Transparent Threads: Resource Sharing in SMT Processors for
High Single-Thread Performance, Proceedings of the 20021International Conference on
Parallel Architectures and Compilation Techniques, 30-41.

Dundas, J. and Mudge, T. 1997. Improving Data Cache Performance by Pre-Executing
Instructions Under a Cache Miss, Proceedings of the 11th International Conference on
Supercomputing, 68-75.










El-Moursy, A., and Albonesi, D. H. 2003. Front-End Policies for Improved Issue Efficiency in
SMT Processors, Proceedings of the 9th International Symposium on H~igh-Performance
Computer Architecture, 3 1-42.

Fu, J., Patel, J.H., and Janssens, B.L. 1992. Stride directed prefetching in scalar processors,
Proceedings of the 25th Annual International Symposium on M\~icroarchitecture, 1 02-
110.

Gandhi, A., Akkary, H., Rajwar, R., Srinivasan, S. T., and Lai, K. K. 2005. Scalable Load and
Store Processing in Latency Tolerant Processors, Proceedings of the 32ndlnternational
Symposium on Computer Architecture, 446-457.

Garg, A., Rashid, M. W., and Huang, M. C. 2006. Slackened Memory Dependence Enforcement:
Combining Opportunistic Forwarding with Decoupled Verification, Proceedings of the
33rdlnternational Symposium on Computer Architecture, 142-154.

Gonzalez, A., Valero, M., Topham, N., and Parcerisa, J. 1997. Eliminating Cache Conflict
Misses through XOR-Based Placement Functions, Proceedings of the 11th International
Conference of Supercomputing, 76-83.

Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., and Olukotun, K. 2000. The
Stanford Hydra CMP, IEEE Micro 20, 2, 71-84.

Hesson, J., LeBlanc, J., and Ciavaglia, S. 1997. Aparatus to dynamically control the out-of-order
execution of load-store instructions in a processor capable of dispatching, issuing and
executing multiple instructions in a single processor cycle, IBM, US Patent 5615350.

Hinton, G., Sager, D., Upton, M., Boggs, D. Kyker, A. and Roussel, P. 2001. The
Microarchitecture of the Pentium 4 Processor, Intel Technology Journal, 2001.

Hu, Z., Martonosi, M., and Kaxiras, S. 2003. TCP: Tag Correlating Prefetchers, Proceedings of
the 9th International Symposium on H~igh-Performance Computer Architecture, 3 17-328.

Hughes, H. J. and Adve, S. V. 2005. Memory-Side Prefetching for Linked Data Structures for
Processor-in-Memory Systems, Journal ofParallel and Distributed Computing 65, 4,
448-463.

Joseph, D., and Grunwald, D. 1997. Prefetching Using Markov Predictors, Proceedings of the
26th International Symposium on Computer Architecture, 252-263.

Jouppi, N. P. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers, Proceedings of the 1 7th International
Symposium on Computer Architecture, 364-373.

Kessler, R. E. 1999. The Alpha 21264 microprocessor, IEEE Micro 19, 2, 24-36.










Kharbutli, M., Irwin, K., Solihin, Y., and Lee, J. 2004. Using Prime Numbers for Cache Indexing
to Eliminate Conflict Misses, Proceedings of the 10th International Symposium on High
Performance Computer Architecture, 288-299.

Kirman, N., Kirman, M., Chaudhuri, M. and Martinez J. F. 2005. Checkpointed Early Load
Retirement, Proceedings of the 11th International Symposium on High Performance
Computer Architecture, 16-27.

Laudon, J., Gupta, A., and Horowitz, M. 1994. Interleaving: A Multithreading Technique
Targeting Multiprocessors and Workstations, Proceedings of the 6th International
Conference on Architectural Support for Programming Languages and Operating
Systems, 308-318.

Lin, W.-F., Reinhardt, S. K., and Burger, D. 2001. Reducing DRAM Latencies with an
Integrated Memory Hierarchy Design, Proceedings of the 7th International Symposium
on High Performance Computer Architecture, 301-3 12.

Luk C., and Mowry, T. C. 1996. Compiler-Based Prefetching for Recursive Data Structures,
Proceedings of the 7th International Conference on Architectural Support for
Programming Languages and Operating Systems, 222-233.

Moshovos, A., Breach, S. E., Vijaykumar, T. N., Sohi, G. S. 1997. Dynamic Speculation and
Synchronization of Data Dependence, Proceedings of the 24th International Symposium
on Computer Architecture, 181-193.

Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and Evaluation of a Compiler Algorithm
for Prefetching, Proceedings of the 5th International Conference on Architectural
Support for Programming Languages and Operating Systems, 62-73.

Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. 2003. Runahead Execution: An Alternative to
Very Large Instruction Windows for Out-of-order Processors, Proceedings of the 9th
International Symposium on High Performance Computer Architecture, 129-140.

Olden Benchmark, http://www.cs.princeton.edu/~mcc/olden.ht.

Opteron Processors, http://www.amd.com.

Park, I., Ooi, C.-L., and Vij aykumar, T.N. 2003. Reducing Design Complexity of the LoadStore
Queue, Proceedings of the 36th International Symposium M\~icroarchitecture, 4 11-422.

Peir, J. K., Lee, Y., and Hsu, W. W. 1998. Capturing Dynamic Memory Reference Behavior with
Adaptive Cache Topology, Proceedings of the 8th International Conference on
Architectural Support for Programming Language and Operating Systems, 240-250.

Qureshi, M. K., Thompson, D., and Patt Y. N. 2005. The V-Way Cache: Demand Based
Associativity via Global Replacement, Proceedings of the 32ndAnnuallInternational
Symposium on Computer Architecture, 544-555.










Roth, A. 2004. A High-Bandwidth Load-Store Unit for Single- and Multi-Threaded Processors,
Technical Report M~S-CIS-04-09, University of Pennsylvania.

Roth, A., Moshovos, A. and Sohi, G. 1998. Dependence Based Prefetching for Linked Data
Structure, Proceedings of the 8th International Conference on Architectural Support for
Programming Languages and Operating Systems, 1 15-126.

Saavedra-Barrera, R., Culler, D. and von Eicken, T. 1990. Analysis of Multithreaded
Architectures for Parallel Computing, Proceedings of the 2nd Annual ACM~ Symposium
on Parallel Algorithms and Architectures, 169-178.

Sair, S. and Charney, M. 2000. Memory Behavior of the SPEC2000 Benchmark Suite, Technical
Report, IBM Corp.

Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Keckler, S. W., Burger, D., and
Moore, C. R. 2003. Exploiting ILP, TLP and DLP with the Polymorphous TRIPS
Architecture, Proceedings of the 30th International Symposium on Computer
Architecture, 422-433.

Sankaralingam, K., Nagarajan, R., McDonald, R., Desikan, R. Drolia, S., Govindan, M., Gratz,
P., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif,
S., Shivakumar, P., Keckler, S. W., and Burger, D. 2006. Distributed Microarchitectural
Protocols in the TRIPS Prototype Processor, Proceedings of the 39th International
Symposium on M\~icroarchitecture, 480-491.

Sethumadhavan, S., Desikan, R., Burger, D., Moore, C. R., and Keckler, S.W. 2003. Scalable
Hardware Memory Disambiguation for High ILP Processors, Proceedings of the 36th
International Symposium on M\~icroarchitecture, 33 9-410.

Sethumadhavan, S., McDonald, R., Desikan, R., Burger, D., and Keckler, S.W. 2006. Design and
Implementation of the TRIPS Primary Memory System, Proceedings of the 24th
International Conference on Computer Design, 470-476.

Sethumadhavan, S., Roesner, F., Emer, J. S., Burger, D., and Keckler S. W. 2007. Late-Binding:
Enabling Unordered Load-Store Queues, Proceedings of the 34th Annual International
Symposium on Computer Architecture, 347-3 57.

Seznec, A. 1 993. A Case for Two-Way Skewed-Associative Cache, Proceedings of the 20th
Annual International Symposium on Computer Architecture, 169-178.

Seznec, A., and Bodin, F. 1993. Skewed-Associative Caches, Proceedings of the 5th
International Conference on Parallel Architectures and Languages Europe, 3 04-3 16.

Sha, T., Martin, M. M. K., and Roth, A. 2005. Scalable Store-Load Forwarding via Store Queue
Index Prediction, Proceedings of the 38th International Symposium on
M\~icroarchitecture, 159-170.










Sha, T., Martin, M. M. K., and Roth, A. 2006. NoSQ: Store-Load Communication without a
Store Queue, Proceedings of the 39th International Symposium on M\~icroarchitecture,
285-296.

Snavely, A. and Tullsen, D. M. 2000. Symbiotic Job Scheduling for a Simultaneous
Multithreading Processor, Proceedings of the 9th International Conference on
Architectural Support for Programming Languages and Operating Systems, 23 4-244.

Solihin, Y., Lee, J., and Torrellas, J. 2002. Using a User-Level Memory Thread for Correlation
Prefetching, Proceedings of the 29th Annual International Symposium on Computer
Architecture, 171-182.

SPEC2000 Alpha Binaries from SimpleScalar
web site, http://www. eecs.umi ch. edu/~chri swea/b enchmarks/spec2000. html .

SPEC2000 Benchmarks, http://www.spec. org/osg/cpu2000/.

Spjuth, M., Karlsson, M., and Hagersten, E. 2005. Skewed Caches from a Low-Power
Perspective, Proceedings of the 2nd Conference on Computing Frontiers, 152-160.

Spracklen L., and Abraham, S. 2005. Chip Multithreading: Opportunities and Challenges,
Proceedings of the 11th International Symposium on High Performance Computer
Architecture, 248-252.

Srinivasan, S. T., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual Flow
Pipelines, Proceedings of the 11th International Symposium on Architectural Support for
Programming Languages and Operating Systems, 107-1 19.

Stone, H. S. 1971. Parallel Processing with the Perfect Shuffle, IEEE Trans on Computers 20, 6,
153-161.

Stone, H. S., Turek, J., and Wolf, J. L. 1992. Optimal Partitioning of Cache Memory, IEEE
Transactions on Computers 41, 9.

Stone, S. S., Woley, K. M., and Frank, M. I. 2005. Address-Indexed Memory Disambiguation
and Store-to-Load Forwarding, Proceedings of the 38th International Symposium on
M\~icroarchitecture, 171-182.

Subramaniam S. and Loh G. 2006. Store Vectors for Scalable Memory Dependence Prediction
and Scheduling, Proceedings of thel2th International Symposium on High Performance
Computer Architecture, 64-75.

Suh, G. E., Rudolph, L., and Devadas, S. 2001. Dynamic Cache Partitioning for Simultaneous
Multithreading Systems, Proceedings of the 13th International Conference on Parallel
and Distributed' Computing Systems, 116-127.

Tendler, J. M., Dodson, J. S., Field, J. S. Jr., Le, H., and Sinharoy, B. 2002. POWER4 System
Microarchitecture, IBM Journal ofResearch and Development 26, 1, 5-26.










Topham, N. P., and Gonzalez, A. 1997. Randomized Cache Placement for Eliminating Conflicts,
IEEE Transactions on Computers 48, 2, 185-192.

Torres, E. F., Ibanez, P., Vinals, V., and Llaberia, J. M. 2005. Store Buffer Design in First-Level
Multibanked Data Caches Proceedings of the 32nd International Symposium on
Computer Architecture, 469-480.

Tullsen, D. M., and Brown, J. A.2001. Handling Long-Latency Loads in a Simultaneous
Multithreading Processor, Proceedings of the 34th hIternational Symposium on
M\~icroarchitecture, 3 18-3 27.

Tullsen, D. M., Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., and Stamm, R.L., 1996. Exploiting
Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading
Processor, Proceedings of the 23rdlnternational Symposium on Computer Architecture,
191-202.

Tullsen, D. M., Eggers, S.J., and Levy, H.M. 1995. Simultaneous Multithreading: Maximizing
On-Chip Parallelism, Proceedings of the 22ndlnternational Symposium on Computer
Architecture, 392-403.

Vanderwiel, S., and Lilja, D. 2000. Data Prefetch Mechanisms, ACM~ Computing Surveys, 174-
199.

Wang, Z., Burger, D., McKinley, K. S., Reinhardt, S. K., and Weems, C. C. 2003. Guided Region
Prefetching: a Cooperative Hardware/Software Approach, Proceedings of the 30th
International Symposium on Computer Architecture, 388-398.

Wilton, S.J.E., and Jouppi, N.P. 1996. CACTI: An Enhanced Cache Access and Cycle Time
Model, IEEE Journal of Solid-State Circuits 3 1, 5, 677-688.

Yang, C.-L., and Lebeck, A. R. 2000. Push vs. Pull: Data Movement for Linked Data Structures,
Proceedings of the 14th international Conference on Super computing, 1 76-1 86.

Yang, C.-L., and Lebeck, A. R. 2004. Tolerating Memory Latency through Push Prefetching for
Pointer-intensive Applications, ACM~ Transactions on Architecture and Code
Optimization 1, 4, 445-475.

Yang, Q., and Adina, S. 1994. A One's Complement Cache Memory, Proceedings of the 1994
International Conference on Parallel Processing, 250-257.

Yoaz, A., Erez, M., Ronen, R., and Jourdan, S. 1999. Speculation Techniques for Improving
Load-Related Instruction Scheduling, Proceedings 26th AnnuallInternational Symposium
on Computer Architecture, 42-53.

Yourst, M. T. 2007. PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural
Simulator, Proceedings of the 200 international Symposium on Performance Analysis of
Systems & Sofhrare, 23-34.









BIOGRAPHICAL SKETCH

Zhen Yang was born in 1977 in Tianjin, China. She earned her B.S. and M. S. in computer

science from Nankai University in 1999 and 2002 respectively. She earned her Ph.D. in

computer engineering from the University of Florida in December 2007.





PAGE 1

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH ADDRESSLESS PRELOAD, ORDER-FREE LSQ, AND RUNAHEAD SCHEDULING By ZHEN YANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 1

PAGE 2

2007 Zhen Yang 2

PAGE 3

To my family 3

PAGE 4

ACKNOWLEDGMENTS Thanks to all for their help and guidance. 4

PAGE 5

TABLE OF CONTENTS Page ACKNOWLEDGMENTS ............................................................................................................... 4 LIST OF TABLES ...........................................................................................................................7 LIST OF FIGURES .........................................................................................................................8 ABSTRACT ...................................................................................................................... .............10 CHAPTER 1 INTRODUCTION ..................................................................................................................12 1.1 Exploitation of Memory Level Parallelism with P-load ..............................................14 1.2 Least Frequently Used Replacement Policy in Elbow Cache ......................................15 1.3 Tolerating Resource Contentions with Runahead on Multithreading Processors .......17 1.4 Order-Free Store Queue Using Decoupled Address Matching and Age-Priority Logic ...........................................................................................................................19 1.5 Benchmarks, Evaluation Methods, and Dissertation Outline ......................................20 2 EXPLOITATION OF MEMORY LEVEL PARALLELISM WITH P-LOAD .....................24 2.1 Introduction .................................................................................................................. 24 2.2 Missing Memory Level Para llelism Opportunities ......................................................26 2.3 Overlapping Cache Misses with P-loads .....................................................................28 2.3.1 Issuing P-Loads ................................................................................................30 2.3.2 Memory Controller Design ..............................................................................32 2.3.3 Issues and Enhancements .................................................................................33 2.4 Performance Evaluation ...............................................................................................35 2.4.1 Instructions Per Cycle Comparison .................................................................36 2.4.2 Miss Coverage and Extra Traffic .....................................................................38 2.4.3 Large Window and Runahead ..........................................................................40 2.4.4 Interconnect Delay ...........................................................................................41 2.4.5 Memory Request Window and P-load Buffer .................................................42 2.5 Related Work ...............................................................................................................44 2.6 Conclusion ...................................................................................................................4 5 3 LEAST FREQUENTLY USED RE PLACEMENT POLICY IN ELBOW CACHE .............47 3.1 Introduction .................................................................................................................. 47 3.2 Conflict Misses ............................................................................................................50 3.3 Cache Replacement Policies for Elbow Cache ............................................................51 3.3.1 Scope of Replacement......................................................................................52 3.3.2 Previous Replacement Algorithms ..................................................................55 3.3.3 Elbow Cache Replacement Example ...............................................................56 5

PAGE 6

6 3.3.4 Least Frequently Used Replacement Policy ....................................................58 3.4 Performance Evaluation ...............................................................................................60 3.4.1 Miss Ratio Reduction .......................................................................................60 3.4.2 Searching Length and Cost ..............................................................................65 3.4.3 Impact of Cache Partitions ...............................................................................66 3.4.4 Impact of Varying Cache Sizes ........................................................................67 3.5 Related Work ...............................................................................................................69 3.6 Conclusion ...................................................................................................................7 0 4 TOLERATING RESOURCE CONTENTIONS WITH RUNAHEAD ON MULTITHREADING PROCESSORS ..................................................................................71 4.1 Introduction .................................................................................................................. 71 4.2 Resource Contentions on Multithreading Processors ..................................................72 4.3 Runahead Execution on Multi threading Processors ....................................................74 4.4 Performance Evaluation ...............................................................................................76 4.4.1 Instructions Per Cycle Improvement ...............................................................76 4.4.2 Weighted Speedups on Multi threading Processors ..........................................77 4.5 Related Work ...............................................................................................................81 4.6 Conclusion ...................................................................................................................8 2 5 ORDER-FREE STORE QUEUE USING DECOUPLED ADDRESS MATCHING AND AGE-PRIORITY LOGIC ............................................................................................84 5.1 Introduction .................................................................................................................. 84 5.2 Motivation and Opportunity ........................................................................................87 5.3 Order-Free SQ Directory w ith Age-Order Vectors .....................................................90 5.3.1 Basic Design without Data Alignment ............................................................91 5.3.2 Handling Partial Store/Load with Mask ..........................................................94 5.3.3 Memory Dependence Detection for St ores/Loads Across 8-byte Boundary ...96 5.4 Performance Results ....................................................................................................98 5.4.1 IPC Comparison .............................................................................................100 5.4.2 Sensitivity of the SQ Directory ......................................................................103 5.5 Related Work .............................................................................................................105 5.6 Conclusion .................................................................................................................107 6 CONCLUSIONS...................................................................................................................108 LIST OF REFERENCES .............................................................................................................110 BIOGRAPHICAL SKETCH .......................................................................................................117

PAGE 7

LIST OF TABLES Table Page 1-1. SimpleScalar simulation parameters .....................................................................................2 1 1-2. PTLSim simulation parameters ............................................................................................ .22 3-1. Searching levels, extra ta g access, and block movement ......................................................65 5-1. Percentage of forwarded load using an ideal SQ .................................................................101 7

PAGE 8

LIST OF FIGURES Figure Page 2-1. Gaps between base and ideal me mory level parallelism exploitations. ...............................27 2-2. Example tree-traversal function from Mcf ............................................................................28 2-3. Pointer Chasing: A) Sequentia l accesses; B) Pipeline using P-load ....................................29 2-4. Example of issuing P-loads seamlessly without load address ..............................................30 2-5. Basic design of the memory controller .................................................................................3 2 2-6. Performance comparisons: A) Instructions Per Cycle; B) Normalized memory access time .......................................................................................................................... .............37 2-7. Miss coverage and extra traffic ........................................................................................ ....39 2-8. Sensitivity of P-load with respect to instruction window size..............................................40 2-9. Performance impact from combining P-load with runahead ................................................40 2-10. Sensitivity of P-load with respect to interconnect delay ......................................................42 2-11. Sensitivity of P-load with respect to memory request window size .....................................43 2-12. Sensitivity of P-load with respect to P-load buffer size .......................................................43 3-1. Connected cache sets with multiple hashing functions ........................................................48 3-2. Cache miss ratios with diffe rent degrees of associativity .....................................................51 3-3. Example of search for replacement ......................................................................................53 3-4. Distribution of scopes usi ng two sets of hashing functions ..................................................54 3-5. Replacement based on time-stamp in elbow caches .............................................................57 3-6. Miss ratio reduction with caching schemes and replacement policies .................................61 3-7. Miss ratio for differe nt cache associativities ........................................................................67 3-8. Miss rate for different cache sizes .................................................................................... ....68 4-1. Weighted instruction per cycle speedups for multiple threads vs. single thread on simultaneous multithreading ................................................................................................73 8

PAGE 9

9 4-2. Average memory access time ratio for multiple threads vs. single thread on simultaneous multithreading ................................................................................................73 4-3. Basic function-driven pipeli ne model with runahead execution ..........................................75 4-4. Instructions per cycle with/without runahead on simultaneous multithreading ...................77 4-5. Weighted speedup of runahead execution on simultaneous multithreading ........................78 4-6. Average memory access time ratio of runahead execution on simultaneous multithreading ................................................................................................................ ......78 4-7. Weighted speedup of runahead execu tion between two threads running in the simultaneous multithreading mode and running separately in a single-thread mode ..........79 4-8. Ratios of runahead speedup in simultaneous multithreading mode vs. runahead speedup in single-thread mode .............................................................................................81 5-1. Accumulated percentage of stores and store addresses for SPEC applications ...................88 5-2. The average number of stor es and unique store addresses ...................................................89 5-3. Mismatches between the latest a nd the last store for dependent load ..................................90 5-4. SQ with decoupled addr ess matching and age priority ........................................................91 5-5. Decoupled SQ with Part ial Store/load using Mask ..............................................................95 5-6. IPC comparison ..................................................................................................................100 5-7. Comparison of directory full in decoupled SQ and late-binding SQ .................................102 5-8. Comparison of load re-execution .......................................................................................102 5-9. Sensitivity on th e SQ directory size ................................................................................... 103 5-10. Sensitivity on the lead ing-1 detector width ........................................................................104 5-11. Sensitivity on way prediction accuracy ..............................................................................104

PAGE 10

Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy IMPROVING MEMORY HIERARCHY PERFORMANCE WITH ADDRESSLESS PRELOAD, ORDER-FREE LSQ, AND RUNAHEAD SCHEDULING By Zhen Yang December 2007 Chair: Jih-Kwon Peir Major: Computer Engineering The average memory access latency is determined by three primary factors, cache hit latency, miss rate, and miss penalty. It is well known that cache miss penalty in processor cycles continues to grow. For those memory-bound worklo ads, a promising altern ative is to exploit memory-level parallelism by overlapping multiple memory accesses. We study P-load scheme (P-load stands for Preload), an efficient solu tion to reduce the cache miss penalty by overlapping cache misses. To reduce cache misses, we also in troduce a cache organizati on with an efficient replacement policy to specifically reduce conflict misses. A recent trend is to fetch and issue multiple in structions from multiple threads at the same time on one processor. This design benefits much from resource sharing among multiple threads. However, contentions of shared resources including caches, instru ction issue window and instruction window may hamper the performance improvement from multi-threading schemes. In the third proposed research, we evaluate a tec hnique to solve the resource contention problem in multi-threading environment. Store-load forwarding is a cr itical aspect of dynamically scheduled execution in modern processors. Conventional processo rs implement store-load forwarding by buffering the addresses 10

PAGE 11

11 and data values of all in-flight stores in an age-ordered store queue. A load accesses the data cache and in parallel associatively searches th e store queue for older stores with matching addresses. Associative structures can be made fast but often at the cost of substantial additional energy, area, and/or design effort We introduce a new order-free store queue that decouples the matching of the store/load addr ess and its corresponding age-based priority encoding logic from the original store queue and largely decreases the hardware complexity.

PAGE 12

CHAPTER 1 INTRODUCTION Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. An economical solution to that desire is a memory hierarchy which takes advantage of locality and cost-performance of memo ry techniques. The principle of locality, that most programs do not access all instructions or data uniformly, combined with the guideline that smaller hardware is faster, led to hierarchies based on memories of different speeds and sizes. Cache is used to reduce the average memory access latency It is a smaller, faster memory which stores copies of the instruction or data from the most frequently used main memory locations. As long as most memory accesses are to cached memory locations, the average latency of memory accesses will be closer to the cache access latency than to the latency of main memory. The average memory access latency can be dete rmined by three primary factors, the time needed to access the cache ( hit latency ), the fraction of memory re ferences that can not be satisfied by the cache ( miss rate ) and the time needed to fetch data and instructions from the next memory hierarchy ( miss penalty ). Average Memory Access Latency = (Hit Latency) + (Miss Rate) (Miss Penalty) It is well known that cache mi ss penalty in processor cycles continues to grow because the rapid improvements in processor clock frequencies have outpaced the improvements in memory and interconnect speeds, which is the so called memory wall problem. For those memorybound workloads, a promising alternative is to exploit memory-level parallelism ( MLP ) by overlapping multiple memory accesses. In the fi rst proposed research, we will study P-load scheme (P-load stands for Preload), an effici ent solution to reduce the cache miss penalty by overlapping cache misses. 12

PAGE 13

A cache array is broken into fixed-size blocks. Typically, a cache can be organized in three ways. If each block has only one place it can app ear in the cache, the cache is said to be direct mapped. If a block can be placed anywhere in the cache, the cache is said to be fully associative If a block can be placed in a restricted set of places in the cache, the cache is set associative A block is first mapped onto a set, and then the bloc k can be placed anywhere within that set. The set is usually chosen by decoding a set of index bit from the block address. In order to lower cache miss rate, a great deal of analysis has been done on cache behavior in an attempt to find the best combination of size, associativity, block size, and so on. Cache misses can be divided into three cat egories (known as the Three Cs): Compulsory misses are those misses caused by the first reference to a datum. Cache size and associativity make no difference to the number of compulsory misses. Capacity misses are those misses that a cache of a given size will have, regardless of its associativity or block size. Conflict misses are those misses that could have been avoided, had the ca che not evicted an entr y earlier. In the second proposed research, a cache organi zation with an efficient replacement policy is introduced to specifically reduce conflict misses. A recent trend is to fetch and issue multiple in structions from multiple threads at the same time on one processor. This design benefits much from resource sharing among multiple threads. It outperforms previous models of hardware multithreading primarily because it hides short latencies much more effectively, which can often dominate performance on a uniprocessor. However, contentions of shared resources including caches, instru ction issue window and instruction window may hamper the performance improvement from multi-threading schemes. In the third proposed research, we will evaluate a technique to solve the resource contention problem in multi-threading environment. 13

PAGE 14

Store-load forwarding is a critical as pect of dynamically scheduled execution. Conventional processors implem ent store-load forwarding by buffering the addresses and data values of all in-flight st ores in an age-ordered store queue ( SQ ). A load accesses the data cache and in parallel associatively searches the SQ for older stores with matching addresses. Associative structures can be ma de fast, but often at the cost of substantial additional energy, area, and/or design effort. We introduce a new or der-free load-store queue that decouples the matching of the store/load addr ess and its corresponding age-based priority encoding logic from the original content-addressable memory SQ and largely decreases the hardware complexity. The performance evaluation shows a significant impr ovement in the execution time is achievable comparing with other existing scalab le load-store queue proposals. 1.1 Exploitation of Memory Level Parallelism with P-load Modern out-of-order processors with non-blocking caches exploit MLP by overlapping cache misses in a wide instruction window. The e xploitation of MLP, however, can be limited due to long-latency operations in producing the base address of a cache miss load. Under this circumstance, the child load cannot be issued an d start to execute until its base register is produced by the parent instruction. When the pa rent instruction is also a cache miss load, a serialization of the two loads must be enforced to satisfy the load-load data dependence. One typical example is the pointer-chasing problem in many applications with linked data structures ( LDS ), where accessing the successor node cannot star t until the pointer is available, possibly from memory. Similarly, indirect accesses to la rge array structures may face the same problem when both address and data accesses encounter cache misses. With limited numbers of instruction window entries and i ssue window entries in an out-of -order execution and in-order commit processor, these non-overlapped longlatency memory accesses can congest the instruction and issue window s and stall the processor. 14

PAGE 15

Each cache miss encounters delays in sending the request to memory controller, accessing the DRAM array, and receiving the data. The su m of delays in sending request and receiving data is called interconnect delay. With load-load data dependences, normally, resolution of the dependent loads address and triggering of the dependent loads executio n are done at processor side after the parents data is returned from me mory to processor. In fact, the resolution can be done at memory controller as soon as the parent load finishes DRAM access. In this way, the dependent load can start to access DRAM array upon the parents data is fetched from DRAM array. The issuing of the dependent load does not have to experience the receiving parents data and sending dependent loads request. Hence, th e interconnect delay of the two loads can be overlapped. Based on the above observation, we propose a mechanism that dynamically captures the load-load data dependences at run time. A special P-load is issued in place of the dependent load without waiting for the parent load, thus eff ectively overlapping the two loads. The P-load provides necessary information for the memory controller to calculate the correct memory address upon the availabili ty of the parents data to eliminate any interconnect delay between the two loads. Performance evaluations based on SPEC2000 and Olden benchmarks show that significant speedups up to 40% with an averag e of 16% are achievable using the P-load. 1.2 Least Frequently Used Repl acement Policy in Elbow Cache As the performance gap between processo r and memory continues to widen, cache hierarchy designs and performance become even mo re critical. Applications with regular patterns of memory access can experience severe cache co nflict misses in set-associative cache, in which the number of cache frames that a memory block can be mapped into is fixed as the set associativity. When all of the fr ames in a set are occupied, a newly missed block replaces an old block according to the principle of memory refe rence locality. Furtherm ore, the same block 15

PAGE 16

address bits are used to index every cache parti tion. If two blocks conflic t for a single location in one partition, they also conflict for the same location in the other partitions. Under these constraints, heavy conflicts may occur in a fe w sets due to uneven distribution of memory addresses across the entire cache sets that cause severe performance degradation. To alleviate conflict misses, the skewed -associative cache [S eznec 1993a; Seznec and Bodin 1993b; Bodin and Seznec 1997] employs multiple hashing functions for members in a set. Each set also consists of one frame from each of the n cache partitions. But the location of the frame in each partition can be different based on a different hashing function. The insight behind the skewed-associative cache is th at whenever two blocks conflic t for a single location in cache partition i they have low probability to conflict for a location in cache partition j The elbow cache [Spjuth et al. 2005] extends skewed-associative cache organization by carefully selecting its victim and, in the case of a conflict, move th e conflicting cache block to its alternative location in th e other partition. In a sense, the ne w data block uses its elbows to make space for conflicting data instead of evic ting it. The enlarged replacement set provides better opportunity to find a su itable victim for evicting. It is imperative to design effective replacem ent policy to identify suitable victim for evicting in the elbow cache, which featured with enlarged replacement set. Recency-based replacement policy like the least recently used ( LRU ) replacement is generally thought to be the most efficient policy for processor cache. Ho wever, the traditional LRU replacement policy based on the most frequently used-most recently used ( MRU-LRU ) sequence is difficult to implement with multiple hashing functions. Sin ce the number of sets grows exponentially with multiple hashing, it is prohibitively expensive to maintain the needed MRU-LRU sequences. 16

PAGE 17

The least frequently used ( LFU ) replacement policy considers the frequency of block references, such that the least frequently used block will be replaced when needed. In fullyassociative cache and set-associ ative cache, the performance of LRU and LFU are mixed. Thats because fully-associative cache and set-associ ative cache with LRU replacement policy suffers the worst cache pollution when a never-reuse block is moved into the cache. It takes c more misses to replace a never-reused block, where c is the set associativity. Such a block can be replaced much faster with LFU replacement once the block has the smallest frequency counter. We propose a low-cost and effective LFU repl acement policy for the elbow cache, which has cache performance comparable to th e recency-based replacement policy. 1.3 Tolerating Resource Contentions with Ru nahead on Multithreading Processors Simultaneous Multithreading (SMT) processors exploit both instruction-level parallelisms (ILP) and thread-level parallelisms (TLP) by fetching and issuing multiple instructions from multiple threads to the function units of a superscalar architecture each cycle to utilize wide-issue slots [Tullsen et al. 1995; Tullsen et al. 1996]. SMT outperforms previous models of hardware multithreading primarily because it hides short late ncies much more effectively, which can often dominate performance on a uniprocessor. For example, neither fine-grain multithreaded architectures [Alverson et al. 1990; Laudon et al 1994], which context switch every cycle, nor coarse-grain multithreaded archite ctures [Agarwal et al. 1993b; Saavedra-Barrera et al. 1990], which context switch only on long-latency operations, can hide the latency of a single-cycle integer add if there is not sufficient parallelism in the same thread. In SMT, multiple threads share resources such as caches, functional units, instruction queue, instruction issue window, and instruction window [Tullsen et al. 1995; Tullsen et al. 1996]. SMT typically benefits from giving threads complete access to all resources every cycle. But contentions of these resources may signifi cantly hamper the performance of individual 17

PAGE 18

threads and hinder the benefit of exploiting more parallelism from multiple threads. First, disruptive cache contentions lead to more cache misses and hurt overall performance. Serious cache contention problems on SMT processors were reported [Tullsen and Brown 2001]. The optimal allocation of cache memory between two competing threads was studied [Stone et al. 1992]. Dynamic partitioning of shared caches am ong concurrent threads based on marginal gains was reported in [Suh et al. 2001]. The resu lts showed that significantly higher hit ratios over the global LRU replacement could be achieved. Second, threads can hold critical resources wh ile they are not making progress due to longlatency operations and block other threads from making normal execution. For example, if the stalled thread fills the issue window and instruct ion window with waiting instructions, it shrinks the window available for the other threads to fi nd instructions to i ssue and bring in new instructions to the pipeline. Thus when parallelism is most needed when one or more threads are no longer contributing to the inst ruction flow, fewer resources are available to expose that parallelism. Previously proposed methods [ElMoursy and Albonesi 2003; Cazorla et al. 2004a; Cazorla et al. 2004b] attempt to identify threads that will encounter l ong-latency operations. The thread with long-latency operation may be delaye d to prevent it from occupying more resources. A balance scheme was proposed [Cazorla et al. 2003] to dynamically switch between flushing and keeping long-latency threads to avoid overhead of flushing. We propose a valuable solution to this probl em, runahead execution on SMTs. Runahead execution was first proposed to improve MLP on single-thread processors [Dundas and Mudge 1997; Mutlu et al. 2003]. Effectively, runahead execution can achieve the same performance level as that with a much bigger instruction window. We investigate and evaluate runahead execution on SMT processors with multiple thread s running simultaneously. Besides the inherent 18

PAGE 19

advantage of memory prefetching, runahead ex ecution can also prevent a thread with long latency loads from occupying shared resources and impeding other threads from making forward progress. Performance evaluation based on a mi xture of SPEC2000 benchmarks demonstrates the performance improvement of runahead execution on SMTs. 1.4 Order-Free Store Queue Using Decoupled Address Matching and Age-Priority Logic Store-load forwarding is a critical as pect of dynamically scheduled execution. Conventional processors implem ent store-load forwarding by buffering the addresses and data values of all in-flight stores in an age-ordered SQ. A load accesse s the data cache and in parallel associatively searches the SQ for older stores wi th matching addresses. The load obtains its value from the youngest such store (if any) or from the data cache. Asso ciative structures can be made fast, but often at the cost of s ubstantial additional energy, area, a nd/or design effort. Furthermore, these implementation disadvantages compound super-l inearly especially fo r ordered associative structures like the SQ as structure size or bandwidth scales up. As SQ access is on the load execution critical path, fully-associative search of a large SQ can result in load latency that is longer than data cache access latency, which in turn complicates scheduling and introduces replay overheads. There have been many recent proposals to design a scalable SQ by getting rid of the expensive and time-consuming full content-addressable memory ( CAM ) design. One category of solution is to adopt a two-level SQ where the small first-level SQ enables fast and energy efficient forwarding and a much larger second-lev el SQ corrects and complements the first-level SQ [Buyuktosunoglu et al. 2002; Ga ndhi et al. 2005; Sethuma dhavan et al. 2006]. A storeforwarding cache is implemented in [Castro et al 2006] for store-load forwarding. It relies on a separate memory disambiguation table to resolv e any dependence violation. In [Akkary et al. 19

PAGE 20

2003], a much larger L0 cache is used to replace the first-level SQ for caching the latest store data. A load, upon a hit, can fetch the data direct ly from the L0. In this approach, instead of maintaining speculative load information, the load is also executed in the in-order pipeline fashion. An inconsistent between the data in L0 and L1 can identify memory dependence violations. Besides the complexity and extra space, a fundamental issue in this category of approaches is the heavy mismatch between the la test store and the correct last store for the dependent load. Such mismatches produce costly re-executions. We introduce a new order-free SQ that decouples the store address matching and its corresponding age-order priority logic from th e original Content-Addressable Memory (CAM) SQ where the outstanding store addresses and da ta are buffered. A separate SQ directory is maintained for searching the stores in the SQ Unlike a conventional SQ a single address is recorded in the SQ directory for multiple outstanding stores with the same address. Each entry in the directory is augmented with a new age-order vector to indicate the co rrect program order of the stores with the same address. The decoupled SQ directory allows stores to enter the directory when they are issued which can be different from their program order. It relies on the age-order vector to recover the correct pr ogram order of the stores. The relaxation of the program-order requirement helps to reduce the directory size as well as to abandon the fully-associative CAMbased directory that is the ke y obstacle for a scalable SQ. 1.5 Benchmarks, Evaluation Metho ds, and Dissertation Outline SimpleScalar tool set is used to evaluate performance for the first three works in this dissertation. It consists of compiler, assembler, linker, simulation, and visualization tools for modern processor architecture a nd provides researchers with an eas ily extensible, portable, highperformance test bed for systems design. It can s imulate an out-of-order issue processor that supports non-blocking caches, speculative executi on, and state-of-the-art branch prediction. 20

PAGE 21

Table 1-1. SimpleScalar simulation parameters Processor Fetch/Decode/Issue/Commit Width: 8 Instruction Fetch Queue: 8 Branch Predictor: 64K-entry G-share, 4K -entry Branch Target Buffer (BTB) Mis-Prediction Penalty: 10 cycles Instruction Window/Load-Sto re Queue size: 512/512 Instruction Issue Window: 32 Processor Translation Lookaside Buffer (TLB): 2K-entry, 8-way Integer Arithmetic Logic Unit (ALU): 6 ALU (1 cycle); 2 Mult/Div: Mult (3 cycles), Div (20 cycles) Floating Point (FP) ALU: 4 ALU (2 cycles); 2 Mult/Div /Sqrt: Mult (4 cycles), Div (12 cycles), Sqrt (24 cycles) Memory System Level 1 (L1) Instruction/Data Cach e: 64KB, 4-way, 64B Line, 2 cycles L1 Data Cache Port: 4 read/write port Level 2 (L2) Cache: 1MB, 8-way, 64B Line, 15 cycles L1/L2 Memory Status Holding Registers (MSHR): 16/16 Request/Dynamic Random Access Memo ry (DRAM)/Data Latency: 80/160/80 Memory Channel: 4 with line-based interleaved Memory Request Window: 32 Channel/Return Queue: 8/8 We modified the SimpleScalar simulator to model an 8-wide supe rscalar, out-of-order processor with Alpha 21264-like pipeline stages [Kessler 1999] We made two enhancements. First, the original Issue/Execute stage is extended to the Issue Register read, and Execute stages to reflect the delay in instruction scheduling, operands read, and instru ction execution. Second, instead of waking up dependent instructions at the Writeback stage of the parent instruction, the dependent instructions are pre-i ssued after the parent instructi on is issued and the delay of obtaining the result is known. Important simulati on parameters are summarized in Table 1-1. To evaluate the order-free SQ, we modified the PTLsim simulator [Youst 2007] to model a cycle accurate full system x8664 microprocessor. We followe d the basic PTLsim pipeline design, which has 13 stages (1 fetch, 1 rename, 5 frontend, 1 dispatch, 1 issue, 1 execution, 1 21

PAGE 22

transfer, 1 writeback, 1 commit). Important simu lation parameters for PTLSim are summarized in Table 1-2. Table 1-2. PTLSim simulation parameters Fetch/Dispatch/Issue/Co mmit Width: 32/32/16/16 Instruction Fetch Queue: 128 Branch Predictor: 64K-entry G-share, 4K-entry 4-way BTB Branch Mis-Prediction Penalty: 7 cycles RUU/LQ/SQ size: 512/128/64 Instruction Issue Window int0/int1/ld/fp: 64/64/64/64 ALU/STU/LDU/FPU: 8/8/8/8 L1 I-Cache: 16KB, 4-way, 64B line, 2 cycles L1 D-Cache: 32KB 4-way, 64B line, 2 cycles, 8 read/write port L2 U-Cache: 256KB, 4-way, 64B Line, 6 cycles L3 U-Cache: 2MB, 16-way, 128B Line, 16 cycles L1/L2 MSHRs: 16/16 Memory Latency: 100 cycles I-TLB: 64-entry fully-associative D-TLB: 64-entry fully-associative Benchmark programs are used to provide a measure to compare performance. The SPEC2000 [SPEC2000 Benchmarks] from the Standa rd Performance Evaluation Corporation is one of the most widely used benchmark programs in our research community. It consists of two types of benchmarks, one is the SPECint2000, a set of integer-intensive benchmarks, and the other is the SPECfp2000, a set of floating-point intensive benchmarks. Another benchmark suite we evaluated is the Olden benchmarks [Old en Benchmark], which are pointer-intensive programs built by Princeton University. We follo w the studies done in [Sair and Charney 2000] to skip certain instructions, warm up caches and other system components with 100 million instructions, and then collect statistics from the next 500 million instructions. The outline of this dissertation is as follo wed. In chapter 2, we first study the missing memory-level parallelism opportuni ties because of data dependences and then describe P-load scheme. In chapter 3, the severity of cache c onflict misses is demonstrated and a cache 22

PAGE 23

organization with a frequency-based replacement policy is introduced to specifically reduce conflict misses. In chapter 4, we will evaluate a technique to solve the resource contention problem in multi-threading environment. In ch apter 5, we introduce th e order-free SQ that decouples the matching of the store/load addr ess from its corresponding age-based priority encoding logic. The dissertation is concluded in chapter 6. 23

PAGE 24

CHAPTER 2 EXPLOITATION OF MEMORY LEV EL PARALLELISM WITH P-LOAD 2.1 Introduction Over the past two decades, ILP has been a prim ary focus of computer architecture research and a variety of microarch itecture techniques to exploi t ILP such as pipelining, very long instruction word ( VLIW ), superscalar issue, branch prediction, and data speculation have been developed and refined. These techniques make curre nt processors to eff ectively utilize deep multiple-issue pipelines in applications such as media processing and scientific floating-point intensive applications. However, the performance of commercial app lications such as databases is dominated by the frequency and cost of memory accesses. T ypically, they have larg e instruction and data footprints that do not fit in caches, hence, requ iring frequent accesses to memory. Furthermore, these applications exhibit data-d ependent irregula r patterns in their memory accesses that are not amenable to conventional prefetching sche mes. For those memory-bound workloads, a promising alternative is to exploit MLP by overlapping multiple memory accesses. MLP is the number of outstand ing cache misses that can be ge nerated and executed in an overlapped manner. It is essential to exploit MLP by overlapping multiple cache misses in a wide instruction window [Chou et al. 2004]. The exploitation of MLP, however, can be limited due to a load that depends on another load to produce the base address (referred as load-load dependences ). If the parent load misses the cache, sequential execution of these two loads must be enforced. One typical example is the pointer-c hasing problem in many a pplications with LDS, where accessing the successor node cannot start until the pointer is available, possibly from memory. Similarly, indirect accesses to large arra y structures may face the same problem when both address and data accesse s encounter cache misses. 24

PAGE 25

There have been several prefetching techniques to reduce penalties on consecutive cache misses of tight load-load dependences [Luk and Mowry 1996; Roth et al. 1998; Yang and Lebeck 2000; Vanderwiel and Lilja 2000; Cahoon and McKinley 2001; Collins et al. 2002; Cooksey et al. 2002; Mutlu et al. 2003; Yang a nd Lebeck 2004; Hughes and Adve 2005]. Luk et al. [1996] proposed using jump-pointers, which we re further developed by Roth and Sohi [Roth et al. 1998]. A LDS node is augmented with jumppointers that point to nodes that will be accessed in multiple iterations or recursive calls in the future. When a LDS node is visited, prefetches are issued for the locations pointed by its jump-pointers. They focused on a software implementation of the four jump-pointer idioms proposed by Roth and Sohi. They also proposed hardware and cooperative hardware/software implemen tations that use significant additional hardware support at the processor to overcome so me of the software schemes limitations. The hardware automatically creates and updates jumppointers and generates address for and issues prefetches. The hardware can eliminate the inst ruction overhead of jump pointers and reduce the steady state stall time for root and chain jumping, but it does not affect the startup stall time for any case and does not eliminate the steady state stall time for root and chain jumping. The push-pull scheme [Yang and Lebeck 2000; Yang and Lebeck 2004] proposed a prefetch engine at each level of memory hierarch y to handle LDS. A kernel of load instructions, which encompass the LDS traversals, is genera ted by software. The processor downloads this kernel to the prefetch engine, then executes the load instructions repeatedly traverse the LDS. The lack of address ordering hardware and co mparison hardware restricts their schemes traversals to LDS and excludes some data depe ndences. The kernels and prefetch engine would require significant changes to allow more genera l traversals. A similar approach with compiler help has been presented in [Hughes and Adve 2005]. 25

PAGE 26

The content-aware data prefetcher [Cooksey et al. 2002] identifies potential pointers by examining word-aligned content of cache-miss data blocks. The identified pointers are used to initiate prefetching of the successor nodes. Usi ng the same mechanism to identify pointer loads, the pointer-cache approach [Collins et al. 2002] builds a correlation history between heap pointers and the addresses of the heap objects they point to. A prefetch is issued when a pointer load misses the data cache, but hits the pointer cache. We introduce a new approach to overlap cache misses involved in load-load dependences. After dispatch, if the base register of a load is not ready due to an early cache miss load, a special P-load is issued in place of the dependent load. The P-load instructs the memory controller to calculate the needed address once the pare nt loads data is available from the dynamic random access memory ( DRAM ). The inherent interconnect delay between processor and memory can thus be overlapped regardless the location of the memory controller [Opteron Processors]. When executing pointer-chasing loads, a sequence of P-loads can be initiated according to the dispatching speed of these loads. The proposed P-load makes three unique contribu tions. First of all, in contrast to the existing methods, it does not require any special predictors and/or any software-inserted prefetching hints. Instead, the P-load scheme issues the dependent load early following the instruction stream. Secondly, the P-load exploi ts more MLP from a larg er instruction window without the need to enlarge the critical issue window [Akkary et al. 2003]. Thirdly, an enhanced memory controller with proper processing power is introduced that can share certain computations with the processor. 2.2 Missing Memory Level Pa rallelism Opportunities Overlapping cache misses can reduce the performance loss due to long-latency memory operations. However, data dependence between a load and an early instruction may stall the load 26

PAGE 27

from issuing. In this section, we will show the pe rformance loss due to such data dependences in real applications by comparing a baseline mode l with an idealized MLP exploitation model. 0 1 2 3 4 5 6 7 8McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dMLP Baseline Ideal MLP Figure 2-1. Gaps between base and ideal memory level parallelism exploitations. MLP can be quantified as the average number of memory requests during the period when there is at least one outstanding memory request [Chou et al. 2004]. We compare the MLPs of the baseline model and the ideal model. In the base line model, all data dependences are strictly enforced. On the contrary, in the ideal model, a cache miss load is issued right after the load is dispatched regardless of whether th e base register is ready or not. Nine workloads, Mcf, Twolf, Vpr, Gcc-200, Parser, and Gcc-scilab from SPEC2000 integer benchmarks, and Health, Mst, and Em3d from Olden benchmarks are selected for this experiment because of their high L2 miss rates. An Alpha 21264-like processor with 1MB L2 cache is simulated. Figure 2-1 illustrates the measured MLPs of the baseline and the ideal models. It shows that there are huge gaps between them, especially for Mcf, Gcc-200, Parser, Gcc-scilab, Health, and Mst The results reveal that significant MLP imp rovement can be achieved if the delay of issuing cache misses due to data dependences is reduced. 27

PAGE 28

2.3 Overlapping Cache Misses with P-loads We describe the P-load scheme using function r efresh_potential from Mcf as shown in Figure 2-2. Refresh_potential is invoked frequently to refresh a huge tree structure that exceeds 4MB. The tree is initialized with a regular st ride pattern among adjacent nodes on the traversal path such that the address pattern can be accurate ly predicted. However, the tree structure is slightly modified with inserti ons and deletions between two cons ecutive visits. After a period of time, the address pattern on the traversal path becomes irregular and is hard to predict accurately. Heavy misses are encountered when caches cannot accommodate the huge working set. Long refresh_potential (network_t *net) { tmp = node = root->child ; while (node != root) { while (node) { if (n ode->orientation == UP) node->potential = node->basic_arc ->cost + node->pred->potential; else { node->potential = node->pred ->potential node->basic_arc ->cost; checksum++; } tmp = node; node = e->child ; } node = tmp; while ( node->pred ) { tmp = ; if (tmp) { node = tmp; break; } else node = ode->pred ; } } return checksum; } nod node->sibling n Figure 2-2. Example tree-t raversal function from Mcf 28

PAGE 29

This function traverses a data structure with three traversal links: child, pred, and sibling (highlighted in italics ) and accesses basic record s with a data link, basic_arc. In the first inner while loop, the execution traverses down the path through the link: nodechild Processor Memory Controller Request Data A time B DRAM Array Req1 Mem1Data1 Req2 Mem2Data2 Req3 Mem3Data3 Req4 Mem4Data4 Req1 Mem1Data1 Req2 Mem2Data2 Req3 Mem3Data3 Req4 Mem4Data4 Figure 2-3. Pointer Chasing: A) Sequential accesses; B) Pipeline using P-load With accurate branch predicti ons, several iterations of the while loop can be initiated in a wide instruction window. The recurrent instruction, node = nodechild that advances the pointer to the next node, becomes a potential bot tleneck since accesses of the records in the next node must wait until the pointer (base address) of the node is available. As shown in Figure 2-3 A), four consecutive node = nodechild must be executed sequentiall y. In the case of a cache miss, each of them encounters delays in se nding the request, accessing the DRAM array, and receiving the data. These non-overlapped l ong-latency memory accesses can congest the instruction and issue windows a nd stall the processor. On the other hand, the proposed P-load can effectively overlap the interconnect delay in sending/receiving data as shown in Figure 2-3 B). In the following subsections, detailed descriptions of identifying and issuing P-loads are 29

PAGE 30

given first, followed by the design of the memory controller. Several i ssues and enhancements about P-load will also be discussed. 2.3.1 Issuing P-Loads We will describe P-load issuing and execution within the instruction window and the memory request window (Figure 2-4) by walking through the first inner while loop of refreshpotential from Mcf (Figure 2-2). IDInstr. Request 101 105 104 103 102 lw $v0,28($a0) bne $v0,$a3,L1 lw $v0,32($a0) lw $v0,16($v0) lw $v1,8($a0) lw $v1,44($v1) addu $v0,$v0,$v1 J L2 sw $v0,44($a0) addu $v0,$0,$a0 lw $a0,12($a0) bne $a0,$0,L0 Load [28($a0)] Type P-load [addr(103)] P-load [p-id(114)] IDDisp (partial hit) (partial hit) 10516 111 110 109 108 107 106 116 115 114 113 112 117 118 P-load [addr(104)]10644 (partial hit) lw $v0,28($a0) lw $v0,32($a0) lw $v1,8($a0) lw $v0,16($v0) lw $v1,44($v1) lw $a0,12($a0) P-load [addr(111)] P-load [addr(111)] P-load [addr(111)] P-load [p-id(115)] P-load [addr(111)] 117 116 115 114 113 11812 44 16 8 32 28 [28($a0)] ID New 106 105 Address Disp New 44 16 118 117 116 115 114 113 115 114 44 16 8 32 28 [12($a0)]* Instruction Window Memory Request Window Assume New removed, 118 uses address [12($a0)] to fetch DRAM or P-load Buffer Link New New New Offset 12($a0) 12($a0) 32($a0) 8($a0) 12($a0) New Note, thick lines divide iterations Figure 2-4. Example of issuing P-lo ads seamlessly without load address Assume the first load, lw $v0,28($a0), is a cache miss and is issued normally. The second and third loads encounter partial hits to the same block as the first load, thus no memory request is issued. After the fourth load, lw $v0,16($v0) is dispatched, a search through the current instruction window finds it de pends on the second load, lw $v0,32($a0) Normally, the fourth load must be stalled. In the proposed scheme, howe ver, a special P-load will be inserted into a small P-load issue window at this time. When the cache hit/miss of the parent load is known, 30

PAGE 31

associative search for dependent loads in the P-load issue window is performed. All dependent P-loads are either ready to be issued (if the parent load is a miss), or canceled (if the parent load is a hit). The P-load consists of the address of the parent load, the displacement, and a unique instruction ID to instruct the memory controlle r to calculate the addre ss and fetch the correct block. Details of the memory controller will be gi ven in Section 2.3.2. The fifth load is similar to the fourth. The sixth load, lw $a0,12($a0) advances the pointer and is also a partial hit to the first load. With correct branch prediction, instructions of the second iteration are placed in the instruction window. The first three loads in the second iteration all depend on lw $a0,12($a0) in the previous iteration. Three co rresponding P-loads of them are issued accordingly with the parent loads address. The fourth and fifth loads, however, depend on early loads that are themselves also identified as P-loads. In this cas e, instead of the parents addresses, the parent load IDs ( p-id ), 114 and 115 for the fourth and fifth loads respectively, are encoded in the address fields to instruct the memory controller to obtain correct base a ddresses. This process continues to issue a sequence of P-loads within the entire instructi on window seamlessly. A P-load does not occupy a separate location in the instruction window nor does it keep a record in the memory status holding registers (MSHRs) Similar to other memory-side prefetching methods [Solihin et al. 2002], the returned data block of a P-load must come back with address. Upon receiving a P-load returned block from memory, the processor searches and satisfies any existing memory requests located in the MSHRs. The block is then placed into cache if it is not there. Search ing in the MSHRs is necessary, si nce a P-load cannot prevent other requests that target the same block from issui ng. The load, from which a P-load was initiated, will be issued normally when the base register is ready. 31

PAGE 32

In general, the P-load can be viewed as an accurate data prefetching method. It should not interfere with normal store-load forwarding. A Pload can be issued even there are unresolved previous stores in the load-sto re queue. Upon the completion of the parent miss-load, the address of the dependent load can be calculated that wi ll trigger any necessary store-load forwarding. 2.3.2 Memory Controller Design Figure 2-5 illustrates the basic design of the memory controller. Normal cache misses and P-loads are processed and issued in the memory request window similar to the out-of-order execution in processors inst ruction window. The memory address, the offset for the base address, the displacement for computing the target block address, and the dependence link, are recorded for each request in arriving order. Fo r a normal cache miss, its address and a unique ID assigned by the request sequencer are recorded. Such cache miss requests will access the DRAM without delay as soon as the ta rget DRAM channel is open. A normal cache miss may be merged with an early active P-load that targets th e same block to achieve reduced penalties. Memory BusReturn QueueMemory Controller DRAM Arrays Request TLB Cache Director P-load Buffer Request Sequencer P-load Regular miss Request Window Channel Queue Figure 2-5. Basic design of the memory controller 32

PAGE 33

Two different procedures are applied when a P-load arrives. Firstly, if a P-load comes with valid address, the block address is used to search for any existing memory requests. Upon a match, a dependence link is es tablished between them; the offset within the block is used to access the correct word from the parents data block without the need to access the DRAM. In the case of no match, the address that comes with the P-load is used to access the DRAM as illustrated by request 118 assuming that the first request has been removed from the memory request window. Secondly, if a P-load comes without a valid address, the dependence link encoded in the address field is extracted a nd saved in the corresponding entry as shown by requests 116 and 117 (Figure 2-4). In this case, the correct base addresses can be obtained from 116s and 117s parent requests, 114 and 115, respec tively. The P-load is dropped if its parent Pload is no longer in the memory request window. Once a data block is fetched, all dependent P-loads will be woken up. For example, the availability of the New block will trigger P-loads 105, 106, 113, 114, and 115 as shown in Figure 2-4. The target word in the block can be retrie ved and forwarded to the dependent P-loads. The memory address of the dependent P-load is th en calculated by adding the target word (base address) with the displacement value. The P-load s block is fetched if its address does not match any early active P-load. The fetched P-loads bloc k in turn triggers its dependent P-loads. A memory request will be removed from the memory request window after its data block is sent back. 2.3.3 Issues and Enhancements There are many essential issues that need to be resolved to implement the P-load scheme efficiently. Maintaining Base Register Identity: The base register of a qua lified P-load may experience renaming or constant increment/decrement after th e parent load is dispatched. These indirect 33

PAGE 34

dependences can be identified and established by proper adjustment to th e displacement value of the P-load. There are different implementation options. In our simulation model, we used a separate register renaming table to provide association of the cu rrent dispatched load with the parent load, if exist. This di rect association can be establis hed whenever a simple register update instruction is encountered and its parent (could be multiple levels) is a miss load. The association is dropped when the register is modified again. Address Translation at Memory Controller: The memory controller must perform virtual to physical address translation for a P-load in order to access the physical memory. A shadowed TLB needs to be maintained at the memory controller for this purpose (Figure 2-5). The processor issues a TLB update to the memory controller whenever a TLB miss occurs and the new address translation is available. The TLB cons istency can be handled similarly to that in a multiprocessor environment. A P-load is simply dropped upon a TLB miss. Reducing Excessive Memory Requests: Since a P-load is issued without memory address, it may generate unnecessary memory traffic if th e target block is already in cache or multiple requests address the same data block. Three a pproaches are considered here. Firstly, when a normal cache miss request arrives, all outstanding P-loads are searched. In the case of a match, the P-load is changed to a normal cache miss for saving variable delays. Secondly, a small P-load buffer (Figure 2-5) buffers the data blocks of recent P-loads and normal cache miss requests. A fast access to the buffer occurs when the reques ted block is located in the buffer. Thirdly, a topologically equivalent cache directory of the lowe st level cache is maintained to predict cache hit/miss for filtering the returned blocks. By capturing normal cache misses, P-loads, and dirty block writebacks, the memory-side cache directory can predict cache hits accurately. 34

PAGE 35

Inconsistent Data Blocks between Caches and Memory : Similar to other memory-side prefetching techniques, the P-load scheme fetche s data blocks without kn owing whether they are already located in cache. It is possible to fetch a stale copy if the block has been modified. In general, the stale copy is likel y to be dropped either by cachehit prediction or by searching through the directory before updating the cache. Howe ver, in a rather rare case when a modified block is written back to the memory, this modifi ed block must be detected against outstanding Ploads to avoid fetching the stale data. Complexity, Overhead, and Need for Associative Search Structure: There are two new structures: P-load issue window and memory request window (with 8 and 32 entries in our simulations) that require associative searches Others do not require expensive associative searches. We carefully model the delays and ac cess conflicts. For inst ance, although multiple Ploads can be waked up simultaneously, it takes two memory controller cycles (10 processor cycles) conservatively to initiate each DRAM access sequentially. The delay is charged due to the associative wakeup as well as the need for TLB and directory accesses. Our current simulation does not consider TLB shootdown overh ead. Our results showed that it has ignorable impact due to small TLB misses and the flex ibility of dropping overflow P-loads during TLB updates. 2.4 Performance Evaluation To handle P-loads, the processor includes an 8-entry P-load issue window along with a 512-entry instruction window and a 32-entry issue window. Several new components are added to the memory controller. A 32-entry memory re quest window with a 16-entry fully associative P-load buffer is added to process both norma l cache misses and P-loads. An 8-way 16K-entry cache directory of the second leve l cache to predict cache hit/ miss is simulated. A shadowed 35

PAGE 36

TLB with the same configuration as processor side TLB is simu lated for address translation on the memory controller. Nine workloads, Mcf, Twolf, Vpr, Gcc-200, Parser, and Gcc-scilab from SPEC2000 integer benchmarks, and Health, Mst, and Em3d from Olden benchmarks are selected because of high L2 miss rates as ordered according to their appearances. A processor-side stride prefetcher is included in all simu lated models [Fu et al. 1992]. To demonstrate the performance advantage of the P-load scheme, the historyless content-aware data prefetcher [Cooksey et al. 2002] is also simulated. We search exhaustively to determine the width (number of adjacent blocks) and the depth (level of prefetching) of the prefetcher for best performance improvement. Two configurations are selected. In the limited option ( Content-limit; width=1, depth=1 ), a single block is prefetched for each identified pointer from a missed data block, i.e. both width and depth are equa l to 1. In the best-performance option ( Content-best; width=3, depth=4 ), three adjacent blocks starting from th e target block of each identified pointer are fetched. The prefetched block initiates content-aware prefetchi ng up to the fourth level. Other prefetchers are excluded due to the need of huge history information and/or software prefetching help. 2.4.1 Instructions Per Cycle Comparison Figure 2-6 summarizes the Instructions Per Cycle ( IPC ) and the normalized memory access time for the baseline model, the content-aware prefetching ( Content-limit and Contentbest ) and the P-load schemes without ( Pload-no) and with (Pload-16) a 16-entry P-load buffer. Generally, the P-load scheme shows better performance. 36

PAGE 37

0 0.5 1 1.5 2 2.5 3 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dIPC Baseline Content-limit Content-best Pload-no Pload-16 A 0 0.2 0.4 0.6 0.8 1 1.2 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dNormalized Memory Access Time Content-limit Content-best Pload-no Pload-16 B Figure 2-6. Performance comparisons: A) Instructions Per Cycle; B) Normalized memory access time Compared with the baseline model, the Pload-16 shows speedups of 28%, 5%, 2%, 14%, 5%, 17%, 39%, 18% and 14% for the respec tive workloads. In comparison with the Contentbest the Pload-16 performs better by 11%, 4%, 2%, 2%, -8%, 11%, 16%, 22%, and 12%. The Pload is most effective on the workloads that trav erse linked data structur es with tight load-load dependences such as Mcf, Gcc-200, Gcc-scilab, Health, Mst, and Em3d. The content-aware scheme, on the other hand, can prefetch more lo ad-load dependent blocks beyond the instruction 37

PAGE 38

window. For example, the traversal lists in Parser are very short, and thus provide limited room for issuing P-loads. But the Content-best shows better improvement on Parser. Lastly, the results show that a 16-entry P-load buffer provides ab out 1-10% performance improvement with an average of 4%. To further understand the P-load effect, we compare the memory access time of various schemes normalized to the memory access time without prefetching (Figure 2-6 B). The improvement of the memory access time matches the IPC improvement very well. In general, the P-load reduces the memory access delay significantly. We observe 10-30% reduction of memory access delay for Mcf, Gcc-200, Gcc-scilab, Health, Mst, and Em3d. 2.4.2 Miss Coverage and Extra Traffic In Figure 2-7, the miss coverage and total traffic are plotted. The total traffic is classified into five categories: misses, partial hits, miss re ductions (i.e. successful P-load or prefetches), extra prefetches, and wasted prefetches. The sum of the misses, partial hits and miss reductions is equal to the baseline misses without prefetchi ng, which is normalized to 1. The partial hits represent normal misses that catch early P-loads or prefetches at the memory controller, so that the memory access delays are reduced. The extra pref etch represents the prefetched blocks that are replaced before any use. The wasted prefetches are referred to the prefetched blocks that are presented in cache already. Except for Twolf and Vpr the P-load reduces 20-80% overall misses. These miss reductions are accomplished with little extra data tr affic because the P-load is issued according to the instruction stream. Among the workloads, Health has the highest miss reduction. It simulates health-care systems using a 4-way B-tree structure. Each node in the B-tree consists of a link-list with patient records. At the memory controller, each pointer-advance P-load usually wakes up a large number of dependent P-loads ready to access DRAM. At the pr ocessor side, the return of a 38

PAGE 39

parent load normally triggers dependent loads after their respective blocks are available from early P-loads. Mcf, on the other hand, has much simpler ope rations on each node visit. The return of a parent load may initiate the dependent loads before the blocks are ready from early P-loads. Therefore, about 20% of the misses have re duced penalties due to the early P-loads. Twolf and Vpr show insignificant miss reductions because of very small amount of tight load-load dependences. 0 1 2 3 4 5Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 Content-limit Content-best Pload-no Pload-16 McfTwolfVprGcc-200ParserGcc-scilabHealthMstEm3d Miss Coverage and Traffic Wasted Prefetch Extra Prefetch Miss Reduction Partial Hits Miss 5.1 Figure 2-7. Miss covera ge and extra traffic The content-aware prefetcher ge nerates a large amount of extra traffic for aggressive data prefetching. For Twolf and Vpr such aggressive and incorrect prefetching actually increases the overall misses due to cache pollution. For Parser the Content-best out-performs the Pload-16 that is accomplished with 5 times memory traffic. In many workloads, the Contest-best generates high percentages of wasted prefetches. For example for Parser, the cache prediction at the memory controller is very accurate with only 0.6% false-negative prediction (predicted hit, 39

PAGE 40

actual miss) and 3.2% false-positiv e prediction (predicted miss, actual hit). However, the total predicted misses are only 10%, which makes 30% of the return P-load blocks wasted. 2.4.3 Large Window and Runahead The scope of the MLP exploitation with P-load is confined within the instruction window. In Figure 2-8, the IPC speedups of the P-load with five window sizes: 128, 256, 384, 512 and 640 in comparison with the baseline mode l of the same window size are plotted. 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dIPC Speedup 128 entries 256 entries 384 entries 512 entries 640 entries Figure 2-8. Sensitivity of P-load with respect to instru ction window size 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dIPC Speedup Runahead Pload-16 Pload-16+Runahead Figure 2-9. Performance impact from combining P-load with runahead 40

PAGE 41

The advantage of larger window is obvious, si nce the bigger the instruction window, the more the P-loads can be discovered and issued. It is important to poi nt out that issuing P-loads is independent of the issue window size. In our si mulation, the issue window size remains 32 for all five instruction windows. The speculative runahead execution effec tively enlarges the instruction window by removing cache miss instructions from the top of the instruction window. More instructions and potential P-loads can thus be processed on the runahead path. Figure 2-9 shows the IPC speedups of runahead, pload-16, and the combined pload-16 + Runahead All three schemes use a 512entry instruction window a nd a 32-entry issue window. Runahead execution is very effective on Twolf Vpr and Mst It out-performs Pload-16 due to the ability to enlarge both the instruction and the issue windows. On the other hand, Mcf, Gcc-200, Gcc-scilab Health, and em3d show little benefits from runahead because of intensive load-load dependences. The performance of Mcf is actually degraded because of the overhead associated with canceling instructions on the runahead path. The benefit of issuing P-loads on the runahead path is very significant for all workloads as shown in the figure. Basically, these two scheme s are complementary to each other and show an additive speedup benefit. The average IPC speedups of runahead, P-load and P-load+runahead relative to the baseline model are 10%, 16% and 34% respecti vely. Combining P-load with runahead provides on average of 22% speedup ov er using only runahead execution, and 16% average speedup over using P-load alone. 2.4.4 Interconnect Delay To reduce memory latency, a recent trend is to integrate the memory controller into the processor die with reduced interconnect delay [Opteron Processors]. However, in a multiple processor-die system, significant interconnect de lay is still encountered in accessing another 41

PAGE 42

memory controller located off-di e. In Figure 2-10, the IPC speedups of the P-load with different interconnect delays relative to the baseline mode l with the same interconnect delay are plotted. The delay indeed impacts the overall IPC signifi cantly. But the P-load still demonstrates performance improvement even with fast inte rconnect. The average IPC improvements of the nine workloads are 18%, 16%, 12%, 8% and 5% with 100-, 80-, 60-, 40-, and 20-cycle one-way delays respectively. 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dIPC Speedup 20 cycles 40 cycles 60 cycles 80 cycles 100 cycles Figure 2-10. Sensitivity of P-load with respect to interconnect delay 2.4.5 Memory Request Window and P-load Buffer Recall that the memory request window record s normal cache misses and P-loads. The size of this window determines the total number of outstanding memory requests can be handled on the memory controller. The issu ing and execution of requests in the memory request window are similar to the out-of-order execution in processo r's instruction window. In Figure 2-11, the IPC speedups of the P-load with four memory reque st window sizes: 16, 32, 64, and 128 relative to the baseline model without P-load are plotted. A 32-entry window size is enough to hold almost all of the requests at the memory controller for all workloads except health. 42

PAGE 43

0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dIPC Speedup MRW-16 MRW-32 MRW-64 MRW-128 Figure 2-11. Sensitivity of P-load with respect to memory request window size 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 McfTwolfVprGcc200 ParserGccscilab HealthMstEm3dIPC Speedup Pload-no Pload-16 Pload-32 Pload-64 Pload-128 Figure 2-12. Sensitivity of P-load with respect to P-load buffer size The performance impacts of the P-load buffer with 0, 16, 32, 64 and 128 entries are simulated. Figure 2-12 shows the IPC speedups of the five P-load buffer sizes relative to the baseline model. In all of the workloads, addi ng the P-load buffer increases the performance gain. For most of the workloads, a 16-entry buffer can capture the majority of the benefit. 43

PAGE 44

2.5 Related Work There have been many software and hardwa re oriented prefetching proposals for alleviating performance penalties on cache mi sses [Jouppi 1990; Chen and Baer 1992; Luk and Mowry 1996; Joseph and Grunwald 1997; Yang a nd Lebeck 2000; Vanderwiel and Lilja 2000; Cahoon and McKinley 2001; Solihin et al. 2002; Cooksey et al. 2002; Collins et al. 2002; Wang et al. 2003; Yang and Lebeck 2004; Hughes and Adve 2005]. Traditional hardware-oriented sequential or stride-based prefetchers work we ll for applications with regular memory access patterns [Chen and Baer 1992; Jouppi 1990]. However, in many modern applications and runtime environments, dynamic memory allocations and linked data structure accesses are very common. It is difficult to accurately prefetch due to their irregul ar address patterns. Correlated and Markov prefetchers [Charney and Reeves 1995; Joseph and Grunwald 1997] record patterns of miss addresses and use the past miss correlations to predict future cache misses. These approaches require a huge histor y table to record the past mi ss correlations. Besides, these prefetchers also face challenges in pr oviding accurate and timely prefetches. A memory-side correlation-based prefetcher moves the prefetcher to th e memory controller [Solihin et al. 2002]. To handle timely prefetch es, a chain of prefetches based on a pair-wise correlation history can be pushed from memory. Accuracy and memo ry traffic, however, remain difficult issues. To overlap load-load dependent misses, a cooperative hardware-software approach called push-pull uses a hardware prefet ch engine to execute so ftware-inserted pointerbased instructions ahead of the actual computation to supply the needed data [Yang and Lebeck 2000; Yang and Lebeck 2004]. A similar approach has been presented in [Hughes and Adve 2005]. A stateless, content-aware data prefetcher identifies potenti al pointers by examining wordbased content of a missed data block and elimin ates the need to maintain a huge miss history 44

PAGE 45

[Cooksey et al. 2002]. After the pr efetching of the target memory block by a hardware-identified pointer, a match of the block address with th e content of the block can recognize any other pointers in the block. The newly id entified pointer can trigger a chain of prefetches. However, to overlap long latency in sending the request and receiving the pointer data for a chain of dependent load-loads, the statele ss prefetcher needs to be impleme nted at the memory side. Both virtual and physical addresses are required in order to identify pointers in a block. Furthermore, by prefetching all identified pointers continuously, the accuracy issu e still exists. Using the same mechanism to identify pointer loads, the pointer-cache approach [Collins et al. 2002] builds a correlation history between heap poi nters and the addresses of the heap objects they point to. A prefetch is issued when a pointer load misse s the data cache, but hits the pointer cache. Additional complications occur when the pointer values are updated. The proposed P-load abandons the traditional ap proach of predicting prefetches with huge miss histories. It also gives up the idea of usi ng hardware and/or softwa re to discover special pointer instructions. With deep instruction wi ndows in future out-of-order processors, the proposed approach identifies existing load-load de pendences in the instru ction stream that may delay the dependent loads. By issuing a P-load in place of the dependent load, any pointerchasing or indirect addressing that causes se rialized memory access, can be overlapped to effectively exploit memory-level parallelism. Th e execution-driven P-load can precisely preload the needed block withou t involving any prediction. 2.6 Conclusion Processor performance is significantly hamp ered by limited MLP exploitation due to the serialization of loads that are dependent on one another and miss the cache. The proposed special P-load has demonstrated its ability to effectivel y overlap these loads. In stead of relying on miss predictions of the requested bloc ks, the execution-driven P-load precisely instructs the memory 45

PAGE 46

controller in fetching the needed data bl ock non-speculatively. The simulation results demonstrate high accuracy and significant speed ups using the P-load. The proposed P-load scheme can be integrated with other aggressi ve MLP exploitation methods for even greater performance benefit. 46

PAGE 47

CHAPTER 3 LEAST FREQUENTLY USED REPLAC EMENT POLICY IN ELBOW CACHE 3.1 Introduction In cache designs, a set includes a number of cache frames that a memory block can be mapped into. When all of the frames in a set ar e occupied, a newly missed block replaces an old block according to the principle of memory referen ce locality. In classical set-associative caches, both the lookup for identifying cache hit/miss and the replacement are within the same set, normally based on hashing of a few index bits from the block address. For fast cache access time, the set size (also referred as associativity) is usually small. In addi tion, all of the sets have identical size and are disjoint to simplify the cache design. Un der these constraints, heavy conflicts may occur in a few sets (referred as hot sets ) due to uneven distribution of memory addresses across the entire cache sets that cause severe performance degradation. There have been several efforts to alleviat e conflicts in heavily accessed sets. The hashrehash cache [Agarwal et al. 1988] and the co lumn-associative cache [Agarwal and Pudar 1993a] establish a secondary set for each block using a di fferent hashing function from the primary set. Cache replacement is extended across both sets to reduce conflicts. An additional cache lookup is required for blocks that are not located in the primary set. Th e group-associative cache [Peir et al. 1998] maintains a separate cache directory fo r more flexible secondary set. A different hashing function is used to lookup blocks in th e secondary set. Similar to the hash-rehash, lookups in both of the directories are necessary. In addition, a link is added for each entry of the secondary directory to locate the data block. Recently, the V-way cache [Qureshi et al. 2005] eliminates multiple lookups by doubling the cache directory size with respect to the actual number of data blocks. In the V-way cache, a ny unused directory entry in the lookup set can be used to record a newly missed bl ock without replacing an existing block in the set. The existence 47

PAGE 48

of unused entries allows searching for a repl acement block across the entire cache, and thus decouples the replacement set from the lookup se t. Although flexible, the V-way cache requires a bi-directional link between each directory entry and its corresponding data block in the data array. Data accesses must go through the link indirectly that lengthens the cache access time. Also, even with extra directory space, the V-way cache cannot eliminate the hot sets and must replace a block within the l ookup set if all directory frames in the set are occupied. Figure 3-1. Connected cache sets with multiple hashing functions The skewed-associative cache [Seznec1993a; Se znec and Bodin 1993b; Bodin and Seznec 1997] is another cache organization to alleviate conflict misses. In contrast to the conventional set-associative caches, the skewed-associative cache employs multiple hashing functions for members in a set. In an n -way cache, the cache is partitioned equally into n banks. For setassociative caches, each set consists of one frame from each partition in the same position addressed by the index bits. In caches with multip le-hashing, each set also consists of one frame from each of the n cache partitions. But the location of the frame in each partition can be different based on a different hashin g function. To lookup a cache block, the n independent 48

PAGE 49

hashing functions address the n frames where the n existing cache blocks can be matched against the requested block to determine a cache hit or a miss. The insight behind the skewed-associative cache is that whenever two blocks conf lict for a single location in partition i they have low probability to conflict for a location in partition j The elbow cache [Spjuth et al. 2005], an exte nsion to the skewed-a ssociative cache, can expand the replacement set without any extra cache tag directory. In set-associativ e caches, two blocks are either mapped into the same set, or th ey belong to two disjoint sets. In contrast, the multiple-hashing cache presents an interesting prop erty that two blocks can be mapped into two sets which share a common frame in one or more partitions. Let us assume a 4-way partitioned cache as illustrated in Figure 3-1, through four ha shing functions, blocks A and B can be mapped to different locations in the four cache partitions. In this example, A and B share the same frame, a1/b1 in Partition 1, but are disjoint in the others. When two sets share the same frame in one or more cache partitions, the two sets are connected The common frame provides a link to expand the replacement set beyond the original lookup set Instead of replacing a block from the original lookup set, the new block can take over the shared frame, then the block located in the shared frame can be moved to and replace a block in the connected set. For example, assume that when block A is requested, A is not present in any of the four allocated frames, a0, a1, a2, and a3 and a1 is occupied by block B. Instead of forcing out a block in a0 a1, a2, or a3, block A can take over frame a1 and push block B to other frames b0, b2, or b3 in the connected set. It is essential that relocating block B does not change the lookup mechanism. Furthermore, instead of replacing blocks in b0, b2, or b3, the recursive interc onnection allows those blocks to be moved to the other frames in their own connected se ts. The elbow cache extends skewed-associative cache organization by carefully sele cting its victim and, in the case of a conflict, move the 49

PAGE 50

conflicting cache block to its alte rnative location in the other part ition. In a sense, the new data block uses its elbows to make space for conflic ting data instead of evicting it. The enlarged replacement set provides bette r opportunity to find a suit able victim for evicting. It is imperative to design effective replacem ent policy to identify suitable victim for evicting in the elbow cache, which featured with enlarged replacement set. Recency-based replacement policy like the LRU replacement is gene rally thought to be the most efficient policy for processor cache, but it can be expensive to be implemented in the elbow cache. In this dissertation, we introduce a fre quency-based replacement cache replacement policy based on the concept that the least frequently used bl ocked is more suitable for replacement. 3.2 Conflict Misses The severity of set conflicts is demons trated using SPEC2000 benchmarks. Twelve applications: Twolf, Bzip, Gzip, Parser, Equake, Vpr, Gcc, Vortex, Perlbmk, Crafty, Apsi and Eon were chosen for our study because of their hi gh conflict misses. Their appearance from the left to the right shows the seve rity of conflicts from the leas t to the most. In this study, we simulate a 32KB L1 data cache with 64-byte block size. The severity of c onflicts is measured by the miss ratio reductions from a fully-associativ e to a 4-way set-associ ative design. Figure 3-2 shows cache miss ratios of 2-way, 4-wa y, 16-way, and fully-associative caches. As expected, both 2-way and 4-way set-asso ciative caches suffer significant conflict misses for all selected workloads. It is interesti ng to see that even with a 16-way set-associative cache, Bzip, Gzip, Gcc, Perlbmk, crafty, and Apsi still suffer significant performance degradation due to conflicts. For Apsi more than 60% of the misses can be saved using a fully associative cache comparing to that of a 16-way cache. 50

PAGE 51

0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 TwolfBzipGzipParserEquakeVprGccVortexPerlbmkCraftyApsiEonMiss Ratio % 2-way+LRU 4-way+LRU 16-way+LRU Full+LRU Figure 3-2. Cache miss ratios with different degrees of associativity 3.3 Cache Replacement Po licies for Elbow Cache The fundamental idea behind the elbow cache is to think of an n-way partitioned cache as an n-dimensional Cartesian coordinate system. The coordinates of a cache block are the components of a tuple of nature numbers which are generated by applying the corresponding hashing function on the cache block address for each partition. Since the theme of this dissertation is not to inve nt new hashing functions, the multiple hashing functions described in [Seznec1993a; Seznec and Bodin 1993b; Bodin and Seznec 1997] are borrowed. The definitions of these skewed mapping functions are given as follows. ),...,,(1 1 0 nindex indexindexIn the original skewed-a ssociative cache [Seznec1993a; Seznec and Bodin 1993b], two functions H G are defined, where G is the inverse function of H and n is the width of the index bits: 111132:{0,...,21}{0,...,21} (,,...,)(,,,...,,)nn nn nnnH y yyyyyyyy 1111:{0,...,21}{0,...,21} (,,...,)(,...,,)nn nn nnnG1 y yyyyyy 51

PAGE 52

For a 4-way partitioned cache, four hashing functions are defined (referred as Mapping Function 93 ). A data block at memory address where c is the width of the offset, is mapped to: 0 1 2 2 3222 AAAAAc nc nc 1. cache frame 012()()()2 f AHAGAA in cache partition 0, where represents an exclusive-or operator; 2. cache frame 112()()()1 f AHAGAA in cache partition 1; 3. cache frame 212()()()2 f AGAHAA in cache partition 2; 4. cache frame 312()()()1 f AGAHAA in cache partition 3. In an alternative skewed f unction family reported later [Bodin and Seznec 1997] (referred as Mapping Function 97 ), let be the one-position circular shift on n bits [Stone 1971], a data block at memory address can be mapped to: 0 1 2 2222 AAA Ac nc nc 3A1. cache frame 01()2 f AAA in cache partition 0; 2. cache frame 11()()2 f AA A in cache partition 1; 3. cache frame 2 21()()2 f AA A2 in cache partition 2; 4. cache frame 3 31()() f AA A in cache partition 3. 3.3.1 Scope of Replacement To expand the replacement set beyond the l ookup set boundary, the coordinates of the blocks in the lookup set provide links for r eaching other connected sets. Each block has n coordinates and can thus reach n-1 new frames in the other partitions as long as those frames have not been reached before. The coordinates of each block in the connected sets can in turn link to other connected sets recursively until all of the new sets have been reached. We call the union of all connected sets as the scope for a replacement set. 52

PAGE 53

Figure 3-3. Example of search for replacement In Figure 3-3, we use a simple example of a 2-way partitioned cache to illustrate that the replacement scope can cover the whole cache wi th the elbow cache mechanism. The cache has 8 frames in each partition, where x and y coordinates represent partition 0 and partition 1 respectively. This snapshot was taken from Vortex of SPEC2000 running on the 2-way partitioned cache. All frames are occupied as indicated by the corresponding hexadecimal block addresses. The two integer numbers in the parent hesis next to the block address represent the coordinate values of each block obtained from the two hashing functions. When a request 050001b3 (6,3) arrives, a miss occurs since the block is not located in the lookup set of frame 6 on coordinate x and frame 3 on coordinate y The search for a replacement begins from the lookup set. Block 0500017d (6,4) located in frame 6 on coordinate x connects to the frame 4 on coordinate y Similarly, block 050001a7 (0,3) located in frame 3 on coordinate y connects to frame 0 on coordinate x The blocks located in the connected frames: 05000183 (7,4) and 050034e (0,7) can make a further connection to 050001b9 (7,1) and 05000505 (2,7) respectively. The search continues until block 0500019b (5,5) in frame 5 on coordinate x and 0500018f (3,5) 53

PAGE 54

in frame 5 on coordinate y are revisited as illustrated by the arro ws in the figure. In this example, the replacement scope covers the entire cache frames. 0.0 1.0 2.0 3.0 4.0 5.0 6.0 0326496128160192224256288320352384416448480512Scope for ReplacementDistribution Percentage % Mapping Function 93 Mapping Function 97 Figure 3-4. Distribution of scopes using two sets of hashing functions Although the interconnections of multiple sets for replacement is recursive, the scope for each requested block can be limited once all newly expanded frames have already been reached. With the selected integer and floating-point applications from SPEC 2000, we can demonstrate the scope of replacement. Figure 3-4 shows the accumulated scope distributions for the selected applications using the two skewed mapping function families described before. It is interesting to observe that the scope of the elbow cache using Mapping Function 93 almost covers the entire cache. But when using Mapping Function 97, the scope is limited within half of the cache frames. This is due to the fact of certain c onstraints imposed on the selected randomization functions. Further discussions on the mathematical property of these hashi ng functions are out of the scope of this dissertation. It is important to emphasize that for all practical purposes, the scopes of both skew-based hashing schemes are sufficient to find a proper victim for replacement. 54

PAGE 55

3.3.2 Previous Replacement Algorithms The study of cache block replacement policies is, in essence, a study of the characteristics or behavior of workloads to a system. Specifica lly, it is a study of acces s patterns to blocks within the cache. Based on the recognition of access patterns th rough acquisition and analysis of past behavior or histor y, replacement policies resolve to iden tify the block that will be used furthest down in the future, so that that bl ock may be replaced when needed. The LRU policy does this by attaining the recency of block references, such that th e least recently used block will be replaced when needed. The LFU policy considers the frequency of block references, such that the least frequently used block will be repl aced when needed. These respective policies are inherently assuming that future behavior of th e workload will be dominated by the recency or frequency factors of past behavior. The ability of the elbow cache to reduce conflict misses depends primarily on the intelligence of the cache replacement policy. Diffe rent replacement policies may be used. The random replacement policy is the simplest to implement but it increases the miss rate compared to the baseline configuration (4 -way set associative cache). LRU replacement policy is more effective than random. The traditional LRU replacement policy based on the MRU-LRU sequence is difficult to implement with multiple hashing functions. It is well-known that the comple xity of maintaining a MRU-LRU sequence is s!, where s is the set associativity. The LRU replacement can be applied to set-associative caches due to their limited sets. Furthermore, pseudoLRU schemes can be used to reduce complexity for highly associative caches. Since the numbe r of sets grows exponentially with multiple hashing, it is prohibitively expensive to maintain the needed MRU-LRU sequences. Instead of maintaining the precise MRU-LR U sequence for replacement, a scheme based on the time stamp can be considered. The time-stamp ( TS ) scheme emulates the LRU 55

PAGE 56

replacement by maintaining a global memory re quest ID for each cache block. When a miss occurs, the block with the oldest time-stamp in the set (or the connected set) is replaced [Seznec1993a; Seznec and Bodin 1993b; Bodin and S eznec 1997]. To save the directory space as well as to simplify calculations, a smaller time-stamp is desirable. A more practical scheme, that uses a small number of bits bot h in the counter and the time-s tamp, would work by shifting the counter and all the time-stamps one bit to the ri ght whenever the reference counter overflowed [Gonzalez et al. 1997]. Not Recently Used Not Recently Written ( NRUNRW ) replacement policy is an effective replacement implemented on skew-associative cache [Stone 1971; Seznec 1993a]. The bit tag Recently Used ( RU ) is set when the cache block is accesse d. Periodically the bi t tags RU of all the cache blocks are zeroed. When a cache access misses in the cache, the replaced block is chosen among replacement set in the following pr iority order. First, randomly pick among the blocks for which the RU tag is not set. Sec ond, randomly pick among the blocks for which the RU tag is set, but which have not been modified since they have been loaded in the cache. Last, randomly pick among the blocks for which the RU tag is set and which have been modified. Another key issue in implementing cache replace ment in the elbow cache is that a linear search for the replacement block among all the connected sets is necessary. It is also prohibitively expensive to traverse the entire sc ope to find a suitable victim for replacement. Restriction must be added to confine the search within a sma ll set of cache frames. 3.3.3 Elbow Cache Replacement Example In Figure 3-5, a sequence of memory requests is used to illustrate how a time-stamp based 2-way elbow cache replacement works. Each request is denoted as: Tag-f0,f1-(ID) where Tag is the block address tag; f0, and f1 represent the location of the bloc k in the two coordinates based on two different hashing functions; and (ID) represents the request ID for using as a time stamp. For 56

PAGE 57

simplicity, we assume that f0, f1, are taken directly from address bits and may be needed as part of the tag to determine a cache hit or miss. Within each partit ion, there are only four cache frames addressed by the two hashing bits. Request: Tagf0,f1(ID) a-00,10-(0) b-00,11-(1) c-00,11-(2) b-10,10-(3) a-10,11-(4) b-10,10-(5) a-00,10-(0) 00 01 10 11 b-00,11-(1) c-00,11-(2) b-10,10-(3) a-00,10-(0) ( 2 )b-10,10-(3) a-10,11-(4) (4)b-10,10-(5) Coordinate X Coordinate Y Figure 3-5. Replacement based on time-stamp in elbow caches When the first request a-00,10-(0) is issued, both frames 00 on coordinate x and 10 on coordinate y are empty. A miss occurs and a-00,10-(0) is allocated to frames 00 on coordinate x The second request b-00,11-(1) is also a miss and is allocated to frame 11 on coordinate y since it is empty in the lookup set. For the third request, c-00,11-(2) both frames in the lookup set are now occupied by the first two requests. However, frame 10 on coordinate y is empty, which is in the connected set of the current l ookup set through th e shared frame 00 on coordinate x Therefore, block a-00,10-(0) can be moved to filled frame 10 on coordinate y as indicated by the arrow with the request ID (2) that leaves the shared frame 00 on coordinate x for the newly missed request c-00,11-(2) The fourth request b-10,10-(3) finds an empty frame 10 on coordinate x The fifth request, a-10,11-(4) again, misses the cache and both frames in the lookup set are occupied. Assume in this case that both existing blocks ar e not old enough to be replaced. Through the block b-10,10-(3) in the shared frame, an older block a-00,10-(0) in the connected set of frame 10 on coordinate y is found and can be replaced as indicated by the arrow 57

PAGE 58

with the request ID (4) Finally, the last request b-10,10-(5) can easily be located as a hit even though the block has been relocated. 3.3.4 Least Frequently Used Replacement Policy The cost and the performance associated with LRU replacement depends on the number of bits devoted to the time-stamp. The implementati on of wide bit comparison for multiple parallel time-stamp comparison is very expensive and ti me consuming. Furthermore, time-stamp also requires extra storage for each block in cache. Th e more number of bits devoted for time-stamp, the more accurate LRU sequences can be maintained, and the higher implementation cost. Even using the most optimized time-stamp [Gonzalez et al. 1997], the counter for each cache block has at least 8 bits. To redu ce the implementation complexity and maintain equal cache performance, we introduce a new cache replacement policy for the elbow cache. LFU replacement policy maintains the most freque ntly used blocks in the cache. The least frequently used blocks are evicted from the cache wh en new blocks are put into it. It is based on the observation that a block tends to be reused if the block has been used more frequently after it was moved into the cache [Qureshi et al. 2005]. We propose reuse-count scheme ( RC ) which is a kind of implementation of LFU replacement on the elbow cache. A reuse counter is maintained for each cache block. Upon a miss, the block with the least reuse frequency is replaced. The reuse count is given an initial value when the block is moved into cache. The value is incremented when the block is accessed. These re sults in the following ki nd of problem: certain blocks are relative infrequently referenced overall, and yet when they are referenced, due to locality there are short intervals of repeated re-references, thus build ing up high reuse counts. After such an interval is over, the high reuse c ount is misleading: it is due to locality, and cannot be used to estimate the probability that such a block will be reused following the end of this interval. Here, this problem is addressed by f actoring out locality from reuse counts, as 58

PAGE 59

follows. The reuse count is decremented when th e cache block is searched, but is mismatched with the requested block. A block can be replace d when the count reaches zero. In this way, the recency information is also counted in the frequency-based replacem ent policy. Performance evaluation shows that on an el bow cache, a reuse-count replac ement policy with 3-bit reusecounter can perform as good as an LRU replacement policy, with very small storage and low hardware complexity. To avoid searching for a victim for replacem ent through all of the connected replacement sets, a block is considered to be replaceable when the recorded reference count reaches certain threshold (zero). The search stops when a replaceable block is found. Furthermore, the replacement search can be conf ined within a limited search do main. For instance, the elbow cache search can be confined within the original lookup set plus single-level interconnected sets to it. In case that no replaceable block is found, a block in the lookup set is replaced. Since the search and replacement are only encountered on a cach e miss, they are not on the critical path in cache access. In addition, our simulation results show that about 40% to 70% of the replacement is still located in the lookup set, thus no ex tra overhead for searching and replacement is incurred. Vacating the shared frame to make room for the newly missed block involves a data block movement. To limit this data movement, the breadth-f irst traversal is used to search all possible first-level connected sets through the blocks locat ed in the lookup set. In case no replaceable block is found, the oldest block in the lookup set can be picked fo r replacement, and thus limits the block movements to at most one per cache miss. The search can be extended to further levels with the cost of an additiona l block movement per interconnec tion level. A more dramatic approach to avoid the data movement is to esta blish a pointer from each directory entry to its 59

PAGE 60

corresponding block in the data a rray similar to those in [Peir et al. 1998; Qureshi et al. 2005]. However, this indirect pointer approach lengthens the cache access time. 3.4 Performance Evaluation 3.4.1 Miss Ratio Reduction In this dissertation, we use miss ratios as th e primary performance metric to demonstrate the effectiveness of the reuse-count replacemen t policy for the elbow cache. Various caching schemes and replacement policies were implemen ted on the L1 data cache, which is 32KB, 4way partition with 64B line. For comparison purposes, we consider a 32KB, 4-way setassociative L1 cache with LRU replacement as the baseline cache. To evaluate the elbow cache, we excluded work loads that have less than 3% miss ratio gap between a 32KB fully-associative cache and the baseline cache. Ba sed on these criteria, twelve workloads from SPEC2000, Twolf, Bzip, Gzip, Parser, Equake, Vpr, Gcc, Vortex, Perlbmk, Crafty, Apsi, and Eon were selected. Three categories of existing caching schemes are evaluated and compared with the elbow cache. The first category is conventional caches with high associativity, including 16-way and fully-associative using the LRU replacement policy, denoted as 16-way+LRU and Full+LRU The second category is the skew ed-associative caches. Three improved replacement policies, NRUNRW [Seznec and Bodin 1993b] time-stamp [Gonzalez et al. 1997], and reuse-count [Qureshi et al. 2005] are considered, denoted as Skew+NRUNRW Skew+TS, and Skew+RC respectively. The third category is the V-way cache. Only reuse-count replacement policy is applied due to the natu re of the V-way replacement, denoted as V-way+RC Finally, for the elbow caches, the same three replacement policies as the skewed-associative caches are implemented, denoted as Elbow+NRUNRW Elbow+TS, and Elbow+RC 60

PAGE 61

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.016-way+ L RU F ull + LRU S kew+ NRUNR W Sk e w+TS S k ew+RC V-way+RC El b o w+NRUNRW E l b ow+ TS Elbo w+RCMiss Ratio Reduction % TwolfA 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.01 6 -way + LRU Full +L RU Skew+NRUNRW Sk e w+TS Sk ew+RC V way+ RC El b ow+NRUNRW El bo w+ TS El b ow+RCMiss Ratio Reduction % BzipB 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.016way+L R U Full+ L RU Ske w +NR UN RW Ske w +TS Skew+R C V-way+RC Elbow + N R UNR W El bow+TS El b ow+ R CMiss Ratio Reduction % GzipC 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.016-way+LRU Full +L R U Skew+NRUNRW Skew+TS Skew+RC Vway +RC Elbow+NRU N RW Elb o w + TS Elbow+RCMiss Ratio Reduction % ParserD 0.0 3.0 6.0 9.0 12.0 15.016-way+ L R U Fu l l + LRU S kew+NRUNRW Sk ew+TS Ske w+RC V-wa y+ RC El bo w+ NR UN RW El b ow+TS El b ow+RCMiss Ratio Reduction % EquakeE -3.0 0.0 3.0 6.0 9.0 12.0 15.016-way+LRU Full +L R U Skew+NRUNRW Skew+TS Skew+RC Vway +RC Elbow+NRU N RW Elb o w + TS Elbow+RCMiss Ratio Reduction % VprF Figure 3-6. Miss ratio reduction with caching schemes and replacement policies 61

PAGE 62

0.0 3.0 6.0 9.0 12.0 15.016-way+LR U Ful l +LRU Skew+NRUNRW S kew+TS Skew +RC Vw ay+RC Elbow+NRUNRW Elbow+TS Elbow+R CMiss Ratio Reduction % GccG 0.0 5.0 10.0 15.0 20.0 25.0 30.016-way+LRU Full +L R U Skew+NRUNRW Skew+TS Skew+R C V-way+RC El bow+NRU N RW El bow+TS El b ow + RCMiss Ratio Reduction % VortexH 0.0 5.0 10.0 15.0 20.0 25.0 30.016w ay+LR U Full+LRU Skew + NR U NR W S kew +TS Skew+RC V -way+RC E l bow +N RU N RW Elbow+TS E l bow +R CMiss Ratio Reduction % PerlbmkI 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.016 w ay + LR U F ul l +L R U Ske w +NRUNRW S k ew + TS S kew+R C V way+RC Elbow+NRUNRW Elbow+TS E l bo w +R CMiss Ratio Reduction % CraftyJ 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.016-way+LRU F u l l+L R U Skew+ N RUNRW S k e w + T S Ske w+ RC V-way+RC E l b o w+NRUNRW Elbo w+ T S El b ow+ RCMiss Ratio Reduction % ApsiK 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.016w ay+L R U F u ll+LR U Skew+NRUNRW Skew + TS S k ew+ RC V-w a y+R C El b ow+N R UNRW Elbow+TS Elbow+RCMiss Ratio Reduction % EonL A practical time-stamp scheme is evaluated, wh ich uses an 8-bit time-stamp as reported in [Gonzalez et al. 1997]. For the reuse-count scheme, our evaluation suggested a 3-bit counter with initial value of 3 and victim va lue of zero for the best performa nce. If no zero reuse count is found, a victim is picked randomly within the lookup set. For skewed-a ssociative caches and 62

PAGE 63

elbow caches, we borrow Mapping Function 97 as the hashing functions [Bodin and Seznec 1997]. For V-way cache, we simulate a directory w ith twice as many entries as the total cache frames. Due to the overhead in sequential sear ch for the replacemen t and the extra data movement involved in moving th e block from the connected frame, we limit the replacement scope within two levels (the lookup set, plus on e level of connected set) in elbow caches. As a result, during the replacement search, up to 16 fram es can be reached, and at most four directory searches and one block movement may be neede d. In case that a replaceable block is not found, the best candidate in th e lookup set is replaced. We compare the miss ratio reduction for vari ous caching mechanisms with different replacement policies. Figure 36 summarizes the relative miss ra tio reductions for the nine caching schemes compared with the baseline cache (32KB, 4-way set-associative cache). Due to a wide range of reductions, the twelve workload s are divided into four groups as can be identified from the figures with four different y-axis scales. Several interesting observations can be made. First, the elbow caches show more miss ratio reduction than that of the skewed-associative cache s. Obviously, it is due to the advantage that the connected sets extended the searching domain from 4 frames to 16 frames. The Elbow-RC has miss ratio reduction ranging from 2% to as high as 52% with an av erage reduction of 21%, while the Elbow-TS has miss ratio reduction ranging from 3% to as high as 57% with an average of 22%. The Skew-RC has miss ratio reduction ranging from less than 1% to 45% with an average of 11%. On the other hand, the Skew-TSs reduction ranges from 3% to 50% with an average about 17%. For elbow cache, the time -stamp based and the reuse-count based replacement show mixed results. Both methods wo rk effectively with a slight edge to the timestamp scheme. In contrast, the time-stamp works much better than the reuse-count on skewed63

PAGE 64

associative caches. Apparentl y, LRU replacement performs be tter when the replacement is confined within the lookup set. Although the time-stamp scheme perf orms slightly better, its cost is also higher comparing with the reuse-count scheme. Second, in general the elbow cache out-performs the V-way cache by a significant margin. The average miss ratio reduction for the V-way is about 11%. The rela tive V-way performance fluctuates a lot among different applications. For example, the V-way cache shows the most reduction compared with other schemes in Gzip It also performs nearly the best in Vortex. However, for Equake and Apsi V-ways performance is at the very bottom. The main reason for this discrepancy is due to the inability to handle hot sets. For Gzip and Vortex 92% and 75% of the time, an unused directory entry in the lookup set can be allocated for the missed block, while for Equake and Apsi only 27% and 30% of the chance that a search for replacement outside the lookup set is permitted. This confirms that even doubling the directory size, the V-way cache is constraint in solving the hot set problem. Third, it is interesting to obs erve that the elbow cache can out-perform a fully-associative cache by a significant margin in many applica tions. The average miss-ratio reductions for FullLRU, Elbow-TS, and Elbow-RC are 21.4%, 21.6%, and 20.9% respec tively. These interesting results are due to two reasons. First, fully-asso ciative cache suffers the worst cache pollution when a never-reuse block is moved into the cache. It takes c more misses to replace a neverreused block, where c is the number of frames in the cache. Such a block can be replaced much faster in elbow caches once the block becomes replaceable. Vortex is a good example to demonstrate the cache pollution problem, in which Full+LRU performs much worse than 16 way+LRU Second, the skew hashing f unctions provide a good randomi zation for alleviating set 64

PAGE 65

conflicts. By searching through the connected sets in elbow caches, the hot set issue is pretty much diminished. 3.4.2 Searching Length and Cost In this section, we evaluate extra cost asso ciated with elbow caches. A normal cache access involves a single tag directory access to determine a hit/miss. During the replacement in the elbow cache, extra tag accesses are required wh en traversing through the replacement set. Moreover, an additional block movement may happen between search levels when the replaced block is not located in the lookup set. We simu lated replacement policy with 2-level searching (lookup set plus one level of connected sets). If a victim is found at the first level, there is no extra tag access and data movement. Otherwise, an extra block movement along with up to three additional tag accesses is needed if the replaced block is found in th e second level. In case that no replaceable block can be found within 2 levels, a victim will be chosen from the lookup set and no extra block movement is required. Howe ver, it does incur 3 additional tag accesses. Table 3-1. Searching levels, extra tag access, and block movement Workload Replacement search Overhead/Access 1st level 2nd level Not found Extra tag accesses Block movement Bzip 61.8% 35.0% 3.2% 1.5% 1.0% Vpr 47.0% 45.0% 8.0% 2.7% 1.5% Perlbmk 39.0% 45.4% 15.6% 3.3% 1.6% Apsi 45.6% 45.0% 9.4% 2.5% 1.4% Table 3-1 summarizes the cost of th e elbow cache with four workloads, Bzip Vpr Perlbmk, and Apsi. Note that we selected these four workloads, one from each miss reduction range as described previously to simplify the pr esentation. The percentage of chance in finding a replaceable block at respective searching leve ls is shown. About 40%-60% of the replaceable blocks are located in the l ookup set and about 35%-45% are f ound at the connected sets. The 65

PAGE 66

percentages that no replaceable block is found in th e first two levels are varied from 3% to 15%. We also count the extra tag accesses and block move ments. As shown in the table, extra 1.5% to 3.3% tag accesses are encountered in the elbow cache. Also, on the average, an extra block movement is needed for 1% to 1.6% of the memory accesses. It is important to notice that extra tag accesses and block movements are not on the critical cache access path, because they are only encountered on cache misses. These extra tag accesses and block movements can be delayed in case of a conf lict with normal cache accesses. 3.4.3 Impact of Cache Partitions So far, our evaluations of the elbow cache ar e based on a 4-way partitioned structure. In this section, we show the result s of 2-, 4-, and 8-way partiti ons. Again, the four workloads, Bzip Vpr Perlbmk and Apsi, one from each miss reduction range, are selected. The miss ratio, instead of miss ratio reduction is used for the comparison. As shown in Figure 3-7, increasing the degree of partition improves the miss ratios for all four workloads. These results are obtained using Elbow+RC Similar results are also observed using Elbow+TS From 2-way to 8-way, the miss ratios are reduced accordingly: 2.8%, 2.6%, 2.5% for Bzip ; 4.2%, 3.4%, 3.2% for Vpr ; 12.2%, 10.3%, 8.8% for Perlbmk ; and 2.0%, 1.6%, 1.4% for Apsi, respectively. These reduction rates are much faster than the miss reduction rates for set-associative caches when the associativity increases from 2-way to 8-way. In a 2-level elbow cache, the search domain is equal to where p is the number of partitions. For elbow caches from 2-way to 8-way, the replacement scope can reach from 4, 16, to 64 frames. This power-of-2 increase in replacement set out-performs the linear increase in replacement set for the set-associative caches. In term of costs, however, the extra directory tag 2p 66

PAGE 67

access only increases linearly with the number of partitions. Moreover, the increase of partitions requires no extra block movement when the re placement is confined within two levels. 0.0 0.5 1.0 1.5 2.0 2.5 3.02-wa y+L RU 4-way+ L RU 8way+L R U 1 6-w ay+L RU F ull+L RU 2-way E lbow+ R C 4 -way E l bow+R C 8-w ay E lbow +R CMiss Ratio % BzipA 0.0 1.0 2.0 3.0 4.0 5.02-w ay +LR U 4-w ay +LR U 8-way+LR U 16-way+LRU Full+ L RU 2-wa y Elbow + RC 4-w ay E l bow + RC 8 w ay E l bow +R CMiss Ratio % Vpr B 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.02 way+LRU 4 -w a y+ L RU 8 wa y +L R U 1 6-way+L RU F ull + LR U 2 wa y Elbo w +RC 4-way Elb ow +RC 8-way Elbow+RCMiss Ratio % PerlbmkC 0.0 1.0 2.0 3.0 4.0 5.02 -w a y + L R U 4 -w a y + LRU 8 -w a y+LRU 16w ay+L R U Full+LRU 2 -w a y E lb o w + R C 4 -w a y E lb o w + R C 8 -w a y E lb o w + R CMiss Ratio % ApsiD Figure 3-7. Miss ratio for diffe rent cache associativities In comparison with fully-associative cache, further miss reductions with 8-way elbow caches make it surpluses fully-associative cache perf ormance for three of the four workloads. For Bzip, Vpr and Perlbmk, the miss ratios reduce by 3.2%, 12.3%, and 16.6%, respectively. 3.4.4 Impact of Varying Cache Sizes We analyze the performance impact of cache sizes on elbow caches. Four L1 data cache sizes 8KB, 16KB, 32KB, and 64KB are simulated using the same four workloads as those in Section 3.4.3. The miss ratios are plotted in Figure 3-8 for three caching schemes: 4-way+LRU, 67

PAGE 68

Full+LRU, and 4-way Elbow+RC As observed, the cache size does make a huge impact on the miss ratios of the three caching schemes. It is straightforward that bigger caches reduce the miss ratios for all three caching schemes. 0.0 1.0 2.0 3.0 4.0 5.0 6.0 8K16K32K64KMiss Ratio % 4-way+LRU Full+LRU 4-way Elbow+RC BzipA 0.0 1.0 2.0 3.0 4.0 5.0 6.0 8K16K32K64KMiss Ratio % 4-way+LRU Full+LRU 4-way Elbow+RC VprB 0.0 5.0 10.0 15.0 20.0 25.0 8K16K32K64KMiss Ratio % 4-way+LRU Full+LRU 4-way Elbow+RC PerlbmkC 0.0 1.0 2.0 3.0 4.0 5.0 6.0 8K16K32K64KMiss Ratio % 4-way+LRU Full+LRU 4-way Elbow+RC ApsiD Figure 3-8. Miss rate fo r different cache sizes However, the relative gaps among the thre e schemes vary widely among different workloads with different cache sizes. Genera lly speaking, conflict misses are reduced with bigger caches that make the elbow cach e less effective. This is true for Bzip and Perlbmk However, for Vpr the gap between 4-way+LRU and Full+LRU stay relatively the same with 32KB and 64KB caches. Therefore, the elbow cach e is equally effective with all four cache sizes. Apsi acts oppositely. The elbow cache is much more effective for 32KB and 64KB caches. 68

PAGE 69

Detailed study of Apsi indicates that the 32KB Full+LRU can hold the working set, but the 32KB 4-way+LRU cannot due to conflicts in hot sets. Consequently, Elbow+RC shows a huge gap against 4-way+LRU due to better replacement. At 16KB, none of caching scheme can hold the working set that creates heavy capacity misse s. As a result, all th ree caching schemes show similar performance. 3.5 Related Work Applications with regular patterns of memo ry access can experience severe cache conflict misses in set-associative cache. There are fe w works on finding better mapping functions for cache memories. Most of the prior hashing functi ons permute the accesses using some form of Exclusive OR ( XOR) operations. The elbow cache is not limited to any specific randomization/hashing method. Other possible f unctions could be found in [Yang and Adina 1994; Kharbutli et al. 2004]. The skewed-associa tive cache apples different mapping functions on different partitions. Although various replac ement policies [Seznec1993a; Seznec and Bodin 1993b; Bodin and Seznec 1997; Gonzalez et al. 1997] have been proposed for the skewedassociative cache, it is still an open is sue to find an efficient and effective one. The Hash-rehash [Agarwal et al. 1988], th e column-associative [Agarwal and Pudar 1993a], and the group-associative [Peir et al. 199 8] are using extra dire ctories to increase associativity. In the contrast, the elbow cache uses links to connect blocks in the cache without any extra directory storage. The V-way cache [Qur eshi et al. 2005] can be viewed as a new way to increase associativity by doubli ng the cache directory size. Howeve r, it requires indirect links between each entry in th e directory and its corre sponding block in the data array. Furthermore, even with extra directory space, it can not solve the hot set problem since the directory entries in the hot sets are always occupied. 69

PAGE 70

3.6 Conclusion The efficiency of the traditional set-associative cache is degraded because of severe conflict misses. The elbow cache has demonstrat ed its ability to expand the replacement set beyond the lookup set boundary without adding any complexity on the lookup path. Because of the characteristics of elbow cache, it is difficu lt to implement recency-based replacement. The proposed reuse-count replacement policy with low-cost can achieve cache performance comparable to the recency-based replacement policy. 70

PAGE 71

CHAPTER 4 TOLERATING RESOURCE CONTENTIONS WI TH RUNAHEAD ON MULTITHREADING PROCESSORS 4.1 Introduction SMT processors exploit both ILP and TLP by fetching and issuing multiple instructions from multiple threads at the same cycle to util ize wide-issue slots. In SMT, multiple threads share resources such as caches, functional units, instruction queu e, instruction issue window, and instruction window [Tullsen et al 1995; Tullsen et al. 1996]. SMT typically benefits from giving threads complete access to all resources every cy cle. But contentions of these resources may significantly hamper the performance of individua l threads and hinder the benefit of exploiting more parallelism from multiple threads. First, disruptive cache contentions lead to more cache misses and hurt overall performance. Second, threads can hold critical resources wh ile they are not making progress due to longlatency operations and block other threads from making normal execution. For example, if the stalled thread fills the issue window and instruct ion window with waiting instructions, it shrinks the window available for the other threads to fi nd instructions to i ssue and bring in new instructions to the pipeline. Thus when parallelism is most needed when one or more threads are no longer contributing to the inst ruction flow, fewer resources are available to expose that parallelism. We investigate and evaluate a valuable so lution to this problem, runahead execution on SMTs. Runahead execution was first proposed to improve MLP on single-thread processors [Dundas and Mudge 1997; Mutlu et al. 2003]. Effec tively, runahead execu tion can achieve the same performance level as that with a much bigger instruction window. With heavier cache contentions on SMTs, runahead execution is mo re effective in exploiting MLP. Besides the inherent advantage of memory prefetching, by removing long-latency memory operations from 71

PAGE 72

the instruction window, runahead execution can ease resource blocking among multiple threads on SMTs, thus make other sophisticated thread-scheduling mechanisms unnecessary [Tullsen and Brown 2001; Cazorla et al. 2003]. 4.2 Resource Contentions on Mu ltithreading Processors This section demonstrates the resource c ontention problem in SMT processors. Several SPEC2000 benchmarks are chosen for this study based on their L2 cache performance [Hu et al. 2003]. The weighted speedup [Snavely and Tullsen 2000] is used to compare the IPC of multithreaded execution against the IPC when each thread is executed independently. IPCSMT represents individual thread s IPC in the SMT mode. Figure 4-1 shows the weighted speedups of eight combinati ons of two threads on SMTs using the ICOUNT2.8 scheduling strategy [Tullsen et al. 1996]. These results show that running two threads on SMTs may present worse IPC impr ovement than running two threads separately. We can categorize workloads into th ree groups. The first group includes Twolf/Art, Twolf/Mcf, and Art/Mcf which are composed of programs with rela tively high L2 miss penalties [Hu et al. 2003]. The second group includes Parser/Vpr, Vpr/Gcc and Twolf/Gcc which consist of programs with median L2 miss penalties. Finally, the third group has two workloads Gap/Bzip and Gap/Mcf in which either one or both programs ha ve low L2 miss penalties. In general, except for the third group, other workloads have poor performance with median-size L2 caches. For the first group, the median size is about 2MB to 8MB, while for the second group, the median size is about 512KB to 1MB. The wei ghted speedups in the SMT mode in these cache sizes can be significantly lower than 1. threads thread Single SMTIPC IPC Speedup Weighted 72

PAGE 73

0.4 0.6 0.8 1 1.2 1.4 1.6 128k256k512k1m2m4m8m16mL2 cache sizeWeighted Speedup Gap/Bzip Gap/Mcf Parser/Vpr Vpr/Gcc Twolf/Gcc Twolf/Art Twolf/Mcf Art/Mcf Figure 4-1. Weighted instruction per cycle sp eedups for multiple threads vs. single thread on simultaneous multithreading 0 1 2 3 4 5 6 128k256k512k1m2m4m8m16mL2 cache sizeMemory Access Time Ratio Gap/Bzip Gap/Mcf Parser/Vpr Vpr/Gcc Twolf/Gcc Twolf/Art Twolf/Mcf Art/Mcf Figure 4-2. Average memory access time ratio for multiple threads vs. single thread on simultaneous multithreading Generally speaking, with small caches, heavy cache misses are encountered regardless the number of threads. Thus, no significant performance degradation is observed in the SMT mode. With large caches, both threads may incur very few cache misses even when two threads are running together in the SMT mode. This U-shape speedup curve is evident with workloads in the second group. It is generally true for workloads in the first group too. However, negative IPC speedups can still result from workloads in the fi rst group even with very large L2 caches due to 73

PAGE 74

degradations on other resource contentions out-w eight the benefit of exploiting TLP. Workloads in the third group benefit from TLP consistently. To prove the impact of cache contentions on SMT performance, we plot the average memory access time ratio between two threads running in the SMT mode and running sequentially in a single-thread mode (Figure 4-2). Except for Gap/Bzip and Gap/Mcf other workloads experience increases in the averag e memory access time with median-size caches ranging from 512KB to 4MB. The degree of in creases of the average memory access time matches well against the IPC losses in Figure 41. Nevertheless, alt hough little increases of average memory access times can be observed with 8 MB or 16MB L2 caches, workloads in the first group still show huge IPC degradations. This is again due to contentions on other resources. 4.3 Runahead Execution on Mu ltithreading Processors Runahead execution on SMT processors follows the same general principle as in singlethread processors [Mutlu et al. 2003]. It prev ents the instruction wi ndow from stalling on longlatency memory operations by executing specula tive instructions. Runa head execution of a thread starts once a long latency load reach es the top of the instruction window. An invalid value is assigned to the long-latency load to allow the load to be pseudo-co mmitted without blocking the instruction window. A checkpoint of all the arch itecture states must be made before entering the runahead execution mode. During runahead mode, the processor speculatively executes instructions relying on th e invalid value. All the instructions that operate over the invalid value will also produce invalid results. However, th e instructions that do not depend on the invalid value will be pre-executed. When the memory opera tion that started runahead mode is resolved, the processor rolls back to the initial checkpoint and resumes normal execution. As a consequence, all the speculative work done by th e processor is discarded. Nevertheless, this previous execution is not completely useless. The main advantage of runahead is that the 74

PAGE 75

speculative execution would have generated useful data prefetches, improving the behavior of the memory hierarchy during the real execution. In some sense, runahead execution has the same effect as physically enlarg ing the instruction window. We adapted and modified a SimpleScalar-b ased SMT model from the Vortex Project [Dorai and Yeung 2002]. The out-of-order SimpleS calar processor separates in-order execution at the functional level from the detailed pipe lined timing simulation. At the function level, instructions are executed one at a time without any overla p. The results from the function execution drive the pipelined timing model. One important implementation issue is that the checkpoint for runahead execution must be made at the functional level. The actual invalid value from runahead execution will not be simulated. Instead, registers or memory locations are marked when their content is invalid. In the runahead mode, only those L2 misses with correct memory addresses will be issued. IssueRegisterExecute WritebackCommit Memory Dec/Dep I-Fetch Rob/Lsq/Iq Instruction execution Checkpoint misprediction Checkpoint L2 miss Mark invalid register Save write value to MUB Mark invalid memory Function Execution Valid Inst? Load Return? Recover arch. state Flush runahead inst. Reset rename reg. Exit RunaheadYesOut-of-order Timing ModelYes Figure 4-3. Basic function-driven pi peline model with runahead execution Figure 4-3 illustrates the pipeline microarc hitecture. The checkpoi nt is made at the Dec/Dep state in the function mode when a lo ad misses the L2 cache. During runahead 75

PAGE 76

executions, all destination register s and memory locations are marked invalid for the L2 miss and all its descendent instructions All memory writes are buffered in MUB (Memory Update Buffer) to allow correct executions while maintaining memory states at the checkpoint. 4.4 Performance Evaluation Performance evaluations of runahead executi on are carried out on modified out-of-order SimpleScalar-based SMT model [Dorai and Yeung 2002]. ICOUNT2.8 scheduling policy [Tullsen et al. 1996] is used. Except for repli cated program counters, register files, and completion logic, other resources are shared among multiple threads. Threads share a 256-entry instruction window and a 256-entry load-store queue. Issue window is assumed to be the same size of the instruction window. We fix the size of L1 caches as 64KB and vary the size of L2 cache from 128KB to 16MB. Memory access latency is set as 400 cycles. There can be at most 60 outstanding memory request at the same time. Eight mixed workload combinations from SPEC2000, Twolf/Art, Twolf/Mcf, Art/Mcf Parser/Vpr, Vpr/Gcc, Twolf/Gcc Gap/Bzip and Gap/Mcf are selected. For measuring the weighted speedup, total simulated in structions of individual thr eads are kept the same between the multiple-thread execution mode and the single-thread execution mode. 4.4.1 Instructions Per Cycle Improvement Figure 4-4 summarizes the IPC improvement of runahead on SMTs with 1MB L2 caches, where the IPCs of individual thr eads as well as the IPCs of the mixed threads are plotted. The three bars on the left of each workload represent the IPCs without runahead execution, while the right three bars are IPCs with runahead. In general, runahead improves IPCs in both single-thread and multithread modes. There is more significant improvement on the SMT mode than that on the single-thread mode. Among the three groups, workloads in the second group benef it the most. With runahead, the combined IPCs 76

PAGE 77

on SMTs are consistently higher than the IPC of each individual thread. This is consistent with the results in Figure 4-2, where workloads in the second group s how the most increases in the average memory access time. Although runahead on SMTs also shows much higher improvement, the resulting IPCs in other two groups of workloads still fall between the IPCs of the two individual threads. Therefore, we d ecided to compare IPC improvements using the weighted speedup as suggested in [Tullsen and Brown 2001] with various cache sizes. threads baseline newIPC IPC Speedup Weighted 0 0.5 1 1.5 2 2.5 3 3.5 Thread0 Thread1 Twolf/Art Twolf/Mcf A rt/Mcf Parser/Vpr Vpr/Gcc Twolf/GccGap/BzipGap/McfIPC Figure 4-4. Instructions per cycle with/w ithout runahead on simultaneous multithreading 4.4.2 Weighted Speedups on Multithreading Processors Figure 4-5 shows the weighted IPC speedups of runahead execution on SMTs. In general, significant performance improvement can be observed as long as the cache size is not very large. As a result of very few misses on 8 MB or 16 MB caches, runahead execution is not effective in overlapping scatter cache misses. Similarly, since Gap/Bzip has very small combined working set, runahead execution is ineffective for a ll cache sizes. Among eight workloads, it is unexpected that Gap/Mcf displays the highest sp eedup. Cache contentions of Gap/Mcf should not be as severe as workload s in the first group since Gap has the lowest L2 miss penalty among selected programs. However, runahead execution not only exploits MLP for Mcf, it also releases 77

PAGE 78

resources to unblock Gap from frequent L2 misses of Mcf. Other workloads show performance benefits of various degrees from runahead execution with small/median caches. Because of heavier cache misses for workloads in the first group, the weighted speedup is generally higher than that of workloads in the second group. 0 1 2 3 4 5 128k256k512k1m2m4m8m16mL2 cache sizeWeighted Speedup Gap/Bzip Gap/Mcf Parser/Vpr Vpr/Gcc Twolf/Gcc Twolf/Art Twolf/Mcf Art/Mcf Figure 4-5. Weighted speedup of runahead execution on simultaneous multithreading 0 0.2 0.4 0.6 0.8 1 128k256k512k1m2m4m8m16mL2 cache sizeAverage Memory Access Time Ratio Gap/Bzip Gap/Mcf Parser/Vpr Vpr/Gcc Twolf/Gcc Twolf/Art Twolf/Mcf Art/Mcf Figure 4-6. Average memory access time ratio of runahead execution on simultaneous multithreading 78

PAGE 79

The improvement of the average memory access time of runahead execution on SMTs is shown in Figure 4-6, where the ratio is the aver age memory access time with runahead over that without runahead. Significant drops on the averag e memory access time are very evident for all workloads except for 16MB caches. It is interesti ng to observe that because of differences in working set, significant jumps in memory access time ratios are from 2MB to 8MB for workloads in the first group, but from 512KB to 2MB for workloads in the second group. As expected, Gap/Bzip is not as beneficial with runa head due to small working set. 0.6 0.8 1 1.2 1.4 1.6 128k256k512k1m2m4m8m16mL2 cache sizeWeighted Speedup Gap/Bzip Gap/Mcf Parser/Vpr Vpr/Gcc Twolf/Gcc Twolf/Art Twolf/Mcf Art/Mcf Figure 4-7. Weighted speedup of runahead execution between two th reads running in the simultaneous multithreading mode and running separately in a single-thread mode Recall that the overall IPC improvement w ith runahead execution comes both from exploiting MLP and from better sharing of othe r resources. Therefore, minor discrepancies between the average memory access time improvement and the overall IPC improvement can be expected. For example, the average memory access time improvement of Gap/Mcf is less than that of workloads in the first group. However, Gap/Mcf displays significantly more IPC improvement (Figure 4-5). Similarly, Twolf/Mcf has worse improvement in memory access time 79

PAGE 80

than the other two workloads in the first group, but its IPC improvement is the highest among the three workloads. Figure 4-7 shows the weighted speedups of SMTs with runahead over single thread execution with runahead. The advantages of r unahead execution on the SMT mode are clearly displayed for workloads in both th e second and the third groups. Ho wever, workloads in the first group are still experiencing negative speedups when cache sizes are 1MB or bigger. In comparison with negative speedups without runahead execution (Figure 4-1), runahead execution helps to pull negative speedups in th e positive direction. For 4MB caches especially, weighted speedups are improved from 0.61, 0.60, 0.44 without runahead to 0.90, 0.71, 0.70 with runahead for Twolf/Mcf, Twolf/Art and Art/Mcf respectively. As a result of very poor SMT performance due to cache and other resource cont entions, runahead execution can alleviate but cannot overcome the huge loss from running two threads in the SMT mode. The weighted speedup in Figure 4-7 is a comb ination of two factors: the benefit of runahead execution when two threads run together vs. run separa tely, and the impact of SMT itself. In order to separate effects of runah ead execution from effects of SMT execution, we define a new Weighted Speedup Ratio The basic idea is to calculate speedup ratios between individual threads runahead speedup in the SMT mode and its runahead speedup in the singlethread mode. Thread norunahead Single runahead Single norunahead SMT runahead SMTIPC IPC IPC IPC threadofNumber Ratio Speedup Weighted 1 As shown in Figure 4-8, the speedup of runa head execution in SMT mode is generally better than the speedup of runahead execution in single-thread execution mode. For example, Gap/Mcf shows huge benefit for runahead execu tion on the SMT mode. This proves why 80

PAGE 81

Gap/Mcf has the highest overall IPC speedup (Figure 4-5). Similarly, Twolf/Mcf shows higher overall speedups comparing with other workloads in the first group due to more benefit of runahead execution in the SMT mode. Two workloads Vpr/Gcc and Twolf/Gcc from the second group exhibit negative improvement with tiny ca ches. Recall that programs in this group have moderate L2 cache penalties. With unrealistica lly small caches, cache misses can increase to the point that runahead execution becomes ve ry effective in the single-thread mode. 0.5 1 1.5 2 2.5 3 128k256k512k1m2m4m8m16mL2 cache sizeSpeedup Ratio Gap/Bzip Gap/Mcf Parser/Vpr Vpr/Gcc Twolf/Gcc Twolf/Art Twolf/Mcf Art/Mcf Figure 4-8. Ratios of runahead speedup in si multaneous multithreading mode vs. runahead speedup in single-thread mode 4.5 Related Work SMT permits the processor to issue multiple inst ructions from multiple threads at the same cycle [Tullsen et al. 1995; Tullsen et al. 1996]. The scheduling strategy based on the instruction count (ICOUNT) of each active thre ad regulates the fetching policy to prevent any thread from taking more resources than its fair share [T ullsen et al. 1996]. Other proposed methods [ElMoursy and Albonesi 2003; Cazo rla et al. 2004a; Cazorla et al. 2004b] attempt to identify 81

PAGE 82

threads that will encounter long-latency opera tions (L2 cache miss). The thread with longlatency operation may be delayed to pr event it from occupying more resources. Serious cache contention problems on SMT processors were reported [Tullsen and Brown 2001]. Instead of keeping the thr ead that involves long-latency lo ad ready to immediately begin execution upon return of the loaded data, they proposed methods to identify threads that are likely stalled and to free all resources associated with those threads. A balance scheme was proposed [Cazorla et al. 2003] to dynamically swit ch between flushing and keeping long-latency threads to avoid overhead of flushing. The optimal allocation of cache memory between two competing threads was studied [Stone et al. 1992]. Dynamic partitioning of sh ared caches among concurrent threads based on marginal gains was reported in [Suh et al. 2001] The results showed that significantly higher hit ratios over the global LRU replacement could be achieved. Runahead execution was first proposed to improve MLP on single-thread processors [Dundas and Mudge 1997; Mutlu et al. 2003]. Effec tively, runahead execu tion can achieve the same performance level as that with a much bigger instruction window. We investigate and evaluate runahead execution on SMT processors with multiple threads running simultaneously. It is our understanding that this is the first work to apply runahead execution on SMT processors to tolerate shared resource contentions. Besides th e inherent advantage of memory prefetching, runahead execution can also prev ent a thread with long latency loads from occupying shared resources and impeding other thread s from making forward progress. 4.6 Conclusion Simultaneous Multithreading technique has been moved from laborator y ideas into real and commercially successful processors. Howe ver, studies have show n that without proper 82

PAGE 83

mechanisms to regulate the shared resources, especially shared caches and the instruction window, multiple threads show lower overa ll performance when running simultaneously. Runahead execution, proposed initially for achie ving better performance for single-thread applications, works very well in the multiple -thread environment. In runahead execution, multiple long-latency memory operations can be discovered and overlapped to exploit the memory-level parallelism; meanwh ile, shared critical recourses held by the stalling thread can be released to keep the other thread running smoot hly to exploit the thread-level parallelism. Performance evaluations have demonstrated that up to 4-5 times the speedups are achievable with runahead executions on SMT environments. 83

PAGE 84

CHAPTER 5 ORDER-FREE STORE QUEUE USING DE COUPLED ADDRESS MATCHING AND AGE-PRIORITY LOGIC 5.1 Introduction Timely handling correct memory dependences in a dynamically scheduled, out-of-order execution processor has posted a long-standing ch allenge, especially when the instruction window is scaled up to hundreds or even thousands of instructi ons [Sankaralingam et al. 2003; Sankaralingam et al. 2006; Sethumadhavan et al. 2007]. Two types of queues are usually implemented in resolving memory dependences. A Store Queue (SQ) records all in-flight stores for determining store-load forwarding and a Load Queue (LQ) records all in-flight loads for detecting any memory dependence violation [K essler 1999; Hinton et al. 2001]. There are two fundamental challenges in enforcing correct memo ry dependences. The first one is to forward values from the youngest older in -flight store w ith matched address to a dependent load. In a conventional processor, this is implemented by fo rcing stores enter the SQ in program order and finding the parent store using expensive fully-associative search. The second challenge is to maintain correct memory dependence when a load is issued but early st ore addresses have not been resolved. Speculation based on memory depe ndence prediction or other aggressive methods [Adams et al. 1997; Hesson et al. 1997; Chryso s and Emer 1998; Kessler 1999; Yoaz et al. 1999; Hinton et al. 2001; Subramaniam and Loh 2006] enab les the load to proceed without waiting for the early store addresses. Any offending load th at violates the depende nce must be identified later by searching the LQ and cau ses a pipeline flush. In a conven tional processor, the program order and fully-associative search are also re quired in the LQ for identifying any memory dependence violation by early executed loads when an older store is executed [Sha et al. 2005]. There have been many proposals for improvi ng the scalability of the SQ and LQ [Moshovos et al 1997; Park et al. 2003; Sethuma dhavan et al. 2003; Roth 2004; Srinivasan et al. 84

PAGE 85

2004; Cristal et al. 2005; Sethumadhavan et al. 2006; Sha et al. 2005; Stone et al. 2005; Torres et al. 2005; Castro et al. 2006; Ga rg et al. 2006; Sha et al. 2006 ; Subramaniam and Loh 2006]. In this work, we focus on an efficient SQ design for store-load forwarding. Since store addresses can be generated out of program order, the SQ cannot be partitioned into smaller address-based banks while maintaining program order in each bank for avoiding fully-associative searches in the entire SQ. Among many proposed solution s for scalable SQ [Akkary et al. 2003; Sethumadhavan et al. 2003; Gandi et al. 2005; Sha et al. 2005; Torres et al. 2005; Baugh and Zilles 2006; Garg et al. 2006; Sethumadhavan et al. 2007], two recent approaches are of great significance and related to our proposal. The first approach is to accept potential unordered stores and loads by speculati vely forwarding the matched latest store based on the execution sequence, instead of the correct last store in program order [Gandi et al. 2005; Garg et al. 2006]. The second approach is to allow an unordered banked SQ indexed by store address, but record the correct age along with the store address [Sethumadha van et al. 2007]. Sophisticated hardware can match the last store according to the age of the load without requiring the stores to be ordered by their ages in the SQ. Our simulation results show that a significant amount of mismatches exist between the latest and the last stores for dependent load s that causes expensive re-executions. Furthermore, there is at most one matching parent store in the SQ even though multiple stores could have the same address. R ecording each store and age pair complicates the age priority logic and may become the source of conflicts with limited ca pacity in each SQ bank. In this work, we introduce an innovative SQ design that decouples the store/load address matching unit and its corresponding age-order priori ty encoding logic from the original SQ. In our design, renamed/dispatched st ores address and data enter a SQ RAM array in program order. Instead of relying on fully-associative s earches in the entire SQ, a separate SQ directory is 85

PAGE 86

maintained for matching the store addresses in th e SQ. A store enters the SQ directory when its address is available. Only a single entry is allo cated in the SQ directory for multiple outstanding stores that have the same address. Each entr y in the SQ directory is augmented with an age-order vector to indicate the correct locations ( ages) of multiple stores with the same address in the SQ RAM. The width of the age-order vector is equal to the size of the SQ RAM. When a store is issued, a directory entry is crea ted if it does not alr eady exist. The corresp onding bit in the ageorder vector is turned on based on the location of the store. When a load is issued, an address match to an entry in the SQ directory triggers the age-order priority l ogic on the associated ageorder vector. Based on the age of the load, a simp le leading-one detector can locate the youngest store that is older than the load for data forwarding. With the ag e-order vector, the store address is free from imposing any order in the SQ dir ectory. Consequently, the decoupled SQ directory can be organized as a set-associative structure to avoid fully-associative searches. Besides the basic decoupled SQ without considering the data alignment, we further extend the design to handle partial stores and loads by using byte masks to identify which bytes within an 8-byte range are read or written. We also include th e memory dependence resolution for the misaligned stores and loads that cross the 8-byte boundary. The decoupled address matching and age-orde r priority logic pr esents significant contributions over the existing full-CAM SQ or ot her scalable SQ designs. First, because the size and configuration of the decoupled SQ directory are independent of the program-ordered SQ RAM, it provides new opportunities to optimize th e SQ directory design for locating the parent store. Second, the relaxation of the program-order requirement in the SQ directory using a detached age-order vector helps to abandon the fu lly-associative search that is the key obstacle for a scalable SQ. Third, a store needs not be present in the SQ directory until its address 86

PAGE 87

becomes available. As reported in [Sethumadhavan et al. 2007], a significant amount of renamed stores do not have their address available and hence need not occupy any SQ directory space. Fourth, our evaluation shows that on average, close to 30% of the executed stores have duplicated store addresses in SQ. The SQ direct ory only needs to cover the unique addresses of the issued stores, and hence the SQ directory si ze can be further reduced. Moreover, by getting rid of duplicated addresses in the SQ directory, the potential set (bank) conflict can also be alleviated since the duplicated store addresses must be located in the same set causing more conflicts. Fifth, a full-CAM directory is inflexible fo r parallel searches that are often necessary in handling memory dependences for stores and loads misaligned across the 8-byte entry boundary. The set-associative (ban ked) SQ directory, on the other ha nd, permits concurrent searches on different sets and hence can eliminate the need to duplicate the SQ directory. Lastly, we believe this is the first proposal that correctly acc ounts and handles frequently occurring partial and misaligned stores and loads, such as in x86 arch itecture, while previous proposals have not dealt with this important issue. Th e performance evaluation results show that the decoupled SQ outperforms the latest-store forwarding scheme by 20-24%, while outperforms the late-binding SQ by 8-12%. In comparison with an expensive full-CAM SQ, the decoupled SQ only loses less than 0.5% of the IPC. 5.2 Motivation and Opportunity In this section, we demonstrate that multip le in-flight stores with the same address constantly exist in the instru ction window and they present si gnificant impact on the SQ design. We also show the severity of mismatches between the latest and the last store to dependent load. The simulation is carried out on PTLsim [Yourst 2007] running SPEC workloads. We simulated a 512-entry instruction window with unlimited LQ/SQ. 87

PAGE 88

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%5 1 5 2 5 3 5 4 5 5 5 6 5 7 5 8 5 9 5 105 115 125 135 145 155 165 1 7 5 185 1 9 5 2 0 5 2 15 2 25 2 35 2 45 >25 5Accumulated store and store address Accumulated store addresses Accumulated stores Total outstanding stores and store addresses Figure 5-1. Accumulated percen tage of stores and store addresses for SPEC applications Figure 5-1 plots the accumulated percentages of stores and unique store addresses with respect to the total number of outstanding stores and store addr esses for the twenty simulated SPEC2000 applications. The statistics is collected after each store is issued and its address is available. An infinite SQ is maintained to re cords all outstanding store addresses. Each address has an associated counter to track the number of stores that have the same store address. After the store is issued, a search through the SQ is car ried out. Upon a hit, th e counter of the matched store address is incremented by 1. In case of a miss, the new address is inserted into the SQ with the counter initiated to 1. Meanwhile, the total unique store ad dresses and the total outstanding stores are counted after each issu ed store. The total num ber of unique store addresses is equal to the size of the SQ, while the out standing stores are the summation of the counter for each store address in the SQ. These two numbers indicate the SQ size required fo r recording only the unique store addresses or all individual stores in the SQ. Once a store is committed, the counter associated with the store address in the SQ is decremented by 1. The address is removed from the 88

PAGE 89

SQ when the counter reaches to 0. When a branch mis-prediction occurs, all stores younger than the branch are removed from the SQ. The resulting curves reveal two important messa ges. First, there is a substantial difference between the two accumulated curves. For example, given a SQ size of 64, 95% of the stores can insert their addresses into the SQ if no duplicat ed store address is allowed in the SQ. On the other hand, only 75% of the issued st ores can find an empty slot in the SQ if all stores regardless their addresses must be recorded in the SQ. Insufficient SQ space causes pipeline flush and degrades the overall IPC. Second, with 512-entr y instruction window, th e SQ must be large enough to avoid constant overflow. For example, to cover 99% of the stores, the per-store based SQ needs 195 entries, while the per-address base d SQ requires 120 entries. The required large CAM-based SQ increases the design comple xity, access delay, and power dissipation. 0 10 20 30 40 50 60 70a m mp a pplu ap s i a r t b zip2 e on e q u ak e facer e c fma3 d gz i p l u c as mcf me s a mg r id parser pe r lb mk six t r a c k s wim vp r w up w i s eAverage Number of Stores and Addresses Average unique store address Average store instruction Figure 5-2. The average number of st ores and unique store addresses Figure 5-2 shows the average number of stores and unique store addresses for each application throughout the simulation. On average, the number of stores is considerably larger than the number of store addresse s by 30%. Among the applications, only Gzip Mcf and Mgrid 89

PAGE 90

have rather small difference. The figures impl y there are better SQ solutions which records unique store addresses fo r store-load forwarding. Figure 5-3 shows the mismatches between the la test store in execution order and the last store in program order when dependent load is issued. To collect the mismatch information, we simulated two 64-entry CAM-based SQs. One is ordered by the execution order for detecting the latest matched store, and the ot her is ordered by program order for detecting the last matched store. Note that we consider the latest matches the last if the pare nt store address is unavailable. The mismatch is very significant. On average, about 25% of the forwarded latest stores are incorrect and cause costly re-executions. 0% 20% 40% 60% 80% 100%a m m p ap p lu apsi art bzip2 e on e qu a ke fa c er ec fma3 d gz i p lu c as m cf m e sa m g rid pa rs er p erlbmk s ixtrack sw i m v p r wup wi seMismatches Between Latest and Last Store latest matches last latest not matches last Figure 5-3. Mismatches between the latest and the last store for dependent load 5.3 Order-Free SQ Directory with Age-Order Vectors In this section, we first describe the basic design of the decoupled SQ without considering data alignment. The basic scheme is then extend ed to handle partial stores and loads that are commonly defined in many instruction set archit ectures. Handling misaligned stores and loads that cross the 8-byte boundary is presented at the end. 90

PAGE 91

5.3.1 Basic Design without Data Alignment The decoupled SQ consists of three key components, the SQ RAM array, the SQ directory and the corresponding age-order vector as shown in Figure 5-4. The SQ RAM buffers the address and data for each store until the store commit. Sim ilar to the conventional SQ, a store enters the SQ RAM when it is renamed and is removed after commit in program order. The store data in each SQ RAM entry is 8-byte wide with eight associated ready bits to indi cate the readiness of the respective byte for forwardi ng. The SQ RAM is organized as a circular queue with two pointers where the head points to the oldest store and the next points to the next available location. When a store is renamed, the next location in the SQ RAM is reserved. Figure 5-4. SQ with decoupled a ddress matching and age priority The age of the store is defined by its pos ition in the SQ RAM (denoted as SQ_age ), which is saved in the ROB. Upon a store commit, the SQ_age can locate the store from the SQ RAM for putting away the data into the cache. The pipeline stalls when the SQ RAM is full. Since the SQ directory is decoupled from the SQ RAM, the SQ RAM size is independent of the SQ directory size and imposes minimum impact on sear ching for the parent store. When a load is 91

PAGE 92

renamed, the SQ_age of the last store in the SQ is r ecorded along with the load in the LQ When the load is issued later, the SQ_age is used for searching the parent store Organized similar to the conventional cache directory, the SQ directory records the store addresses for matching dependent load address in store-load forwarding. Instead of keeping a directory entry for every individual store in the SQ RAM, the SQ directory records a single address for all outstanding stores with the same address. Since a store is recorded in the SQ directory after the store has been issued and its address is available, the SQ directory can be partitioned into multiple banks (sets) based on the store address to avoid fully-associative searches. The age-order vector is a new structure that provides the ordered SQ RAM locations for all stores with the same address. Each address entry in the SQ directory has an associated ageorder vector. The width of the vect or is equal to the size of the SQ RAM. When a store address is available, the SQ directory is searched for record ing the new store. If there is a match, the bit in the corresponding age-order vector indexed by the SQ_age of the current stor e is set to indicate its location in the SQ RAM. If the store address is not already in the SQ directory, a new entry is created. The corresponding bit in th e age-order vector is turned on in the same way as when a match is found. In case there is no empty entry in the SQ directory, the store is simply dropped and cannot be recorded in the SQ directory. C onsequently, the dependent load may not see the store and causes re-execution. Si nce stores are always recorded in program order in the SQ RAM, imprecision in recording the in-flight st ore addresses in the directory is allowed. When a load is issued, a search through th e SQ directory determines proper store-load forwarding. If there is no hit, the load proceeds to access the cache. If th ere is a matched store address in the SQ directory, th e corresponding age-order vector is scanned to locate the youngest 92

PAGE 93

older store for the load. The search st arts from the first bit defined by the SQ_age of the load and ends with the head of the SQ RAM. A simple l eading-one detector is used to find the closest (youngest) location where the bit is set indi cating the location of the parent store. Two enhancements are considered in shorteni ng the critical timing in searching for the parent store. First, the well -known way-prediction technique enables the SQ directory lookup and the targeted age-order vector scanning in pa rallel. By establishing a small but accurate way history table, the targeted age-order vector can be predicted and scanned before the directory lookup result comes out. Second, the delay of th e leading-one detector is logarithmically proportional to the width of the ag e-order vector. Given the fact th at a majority parents can be located without searching the enti re SQ RAM, only searching with in a partial age-order vector starting from the SG_age may be enough to catch the correct parent store. The accuracy of these enhancements will be evaluated in Section 5.4. When a store commits, its SQ_age is used to update the SQ directory and the age-order vector. The bit position of a ll age vectors pointed by the SQ_age is reset. When all bits in a vector are reset, the entry in th e SQ directory is freed. The store also retires from the SQ RAM. When a mis-predicted branch occurs, all stores af ter the branch can be re moved from the SQ in a similar way. The last SQ_age is saved in the ROB with a branch when the branch is renamed. When a mis-prediction is dete cted, all entries younger than SQ_age in the SQ RAM are emptied. Meanwhile, all columns in all age-order vector s that correspond to the removed entries from the SQ RAM are reset. When all bits in any ageorder vector are reset, the corresponding entry in the SQ directory is freed. A simple example is illustrated in Figure 5-4. Assume that a sequence of memory stores are dispatched and recorded in the SQ RAM. Among them, the firs t, the third, the sixth, and the 93

PAGE 94

eleventh stores have the same address A These four requests may be issued out of order, but eventually are recorded in the SQ directory and the associated age-order vector. Since all four stores have the same address A they only occupy a single entry in the SQ directory indexed by certain lower-order bits of A The corresponding age-order vector records the locations of the four stores in the SQ RAM by setting the corres ponding bits as shown in the figure. Assuming a load A is finally issued with the SQ_age of 7 as indicated in box 1, it finds an entry with matched address in the SQ directory. The priority logic uses the SQ_age of the load and the age-ordered vector associated with A to locate the parent st ore at location 5 in the SQ RAM. In this figure, we also illustrate an example when store B commits as indicated in box 2. The SQ_age of B from the ROB is used to reset the corresponding pos ition in all age-order vectors. The address of B can be freed from the SQ directory when the ageorder vector contains a ll zeros. Given the fact that each store in the SQ RAM cannot have two addr esses, at most one vect or can have in the SQ_age position. Therefore, at most one directory en try can be freed for each committed store. 5.3.2 Handling Partial Store/Load with Mask In the basic design, we assume loads and stor es are always aligned within the 8-byte boundary and access the entire 8 bytes every time Realistically, partial loads/stores are commonly encountered. The address of partial loads and stores is always aligned in the 8-byte boundary. An 8-bit mask is used to indicate th e precise accessed bytes. The decoupled order-free SQ can be extended to handle memory dependence detection and forwarding for partial stores/loads. The age-order vector for each store address in the SQ directory is expanded to 8 vectors; each covers one mask bit in the 8-bit ma sk. If a load address matches a store address in the SQ directory, the leading-1 detector finds th e youngest stores older th an the load for each individual valid mask bit of the load. In other words, each age-order vect or identifies the parent store for each byte of the load. If the found younges t older store covers all of the bytes of the 94

PAGE 95

load, store-load forwarding is detected. Otherwise, unless forwarding from multiple stores are permitted, the load cannot proceed until the youngest store which updates a subset of the load bytes commits and puts the store data away into the cache. Figure 5-5. Decoupled SQ with Partial Store/load using Mask In Figure 5-5, the example from Figure 5-4 is modified to illustrate how the decoupled SQ works with partial loads and stores. Now, the fi rst (A,0) and the sixth (A,5) stores are partial stores, where (A,0) stores bytes 2 and 3, and (A,5) stores bytes 0, 1, 2 and 3 as indicated by the 8-bit masks. Assume that a partial Load A with SQ_age=7 is issued. To illustrate forwarding detection, we assume the load issued twice; one with a mask of and the other with for loading bytes 4, 5, 6, and 7, and bytes 2, 3, 4, and 5, respectively. A matched 95

PAGE 96

address A is found in the SQ directory for the load. The subsequent searches in the corresponding age-order vectors find the youngest ol der store is (A,2) that covers all the bytes for the first load. As a result, the load gets th e forwarded data from the SQ RAM. For the second load, however, it has two parent stores, where (A,2) produces bytes 4 and 5, and (A,5) produces bytes 2 and 3 for the load. Unless a merge from multiple stores is permitted for forwarding, the second load will be stalle d until (A,5) is retired. When a store is retired, the SQ directory a nd the age-order vectors are updated the same way as that without the mask. Ho wever, given the fact that each store may store only partial bytes, instead of detecting zeros on a single age-order vector, all eight age-order vectors that are associated with the 8 mask bits must be all zeroed before the corresponding directory entry can be freed. 5.3.3 Memory Dependence Detection for Stores/Loads Across 8-byte Boundary Intels x86 ISA permits individual store or load to cross an 8-byte aligned boundary, referred as a misaligned store or load. Misaligned loads and stores complicate the memory dependence detection and data forwarding. When a misaligned load is issued, a simple but costly solution is to stall until all stores ahead of it are drained from th e SQ. To avoid this costly wait for those misaligned loads that are independent of any outstanding stor e, duplicated full CAMs for SQ are implemented for detecting the depend ences as described in [Abramson et al. 1997]. When a misaligned store is issued, its starting ali gned address and the byte mask are stored in the SQ along with an additional overflow bit to indicate that the store sp ills out of range. In this case, a load falling into the next 8-byte range of the misaligned store will miss this store during the search if there is only one CAM. Therefore, two SQ CAMs are needed for searching both the address of the load (load) and the adjacent lower 8-byte address (load-8) in parallel. The decremented load-address CAM will match the stor e address, and the overflow bit will indicate 96

PAGE 97

to the forwarding logic that there is a misaligned hit. The load is stalled until the misaligned store is retired and the data is stored into the cache. For handling misaligned loads, a third SQ CAM is needed in order to provid e parallel searches of the next higher 8-byte address (load+8) for potential misaligned hits. The banked (set-associative) order-free SQ dire ctory provides another distinct advantage in detecting misaligned store/load dependences. By using the lower-order bits from the aligned address to select the bank (set), all three addresses (load-8), (load), and (load+8) of a misaligned load are likely to be located in a different bank. Hence, the costly duplic ations of the CAM for parallel searches can be avoided. When Load A is issued, there are several cases in dependence detections and data forwarding. If Load A does not cross the 8-byte boundary, two searches fo r store A and store A8 from the SQ directory are carried out. If A h its but A-8 misses, or bot h A and A-8 hit but the store A-8 does not cross the 8-by te boundary, a search of the youngest parent of store A for Load A can follow the algorithm described in Section 5.3.2. If A-8 hits and the store A-8 crosses the 8byte boundary, a misaligned hit is detected. Load A must stall until the store A-8 ahead of the load is retired and put the data away into the cach e. In case that A also hits, Load A is stalled until both of the stores ahead of the load retire and their data is stored in to cache. If none of the above conditions is true, load A proceeds to access the cache. If Load A crosses the 8-byte boundary, three sear ches for store A, store A-8, and store A+8 from the SQ directory must be performed. Any hit of A, A+8, or a hit of A-8 with overflow bit set in the SQ directory indicates a misaligned hit. Load A is stalle d until all of these stores ahead it retire and their data is stored into the cache. If the above cond ition is not true, load A proceeds to access the cache. 97

PAGE 98

5.4 Performance Results We modified the PTLsim simulator to model a cycle accurate full system x86-64 microprocessor. We followed the basic PTLsim pipeline design, wh ich has 13 stages (1 fetch, 1 rename, 5 frontend, 1 dispatch, 1 issue, 1 execution, 1 transfer, 1 writeback, 1 commit). Note that the 5-cycle frontend stages are functionless and were inserted for more closely model realistic pipeline delays in the front-end stages. In this pipeline design, there are 7 cycles between a store entering SQ RAM in the rename stage and enteri ng SQ directory in the issue stage. When any memory dependence violation is detected for an early load, the penalty of re-dispatching the replayed load is 2 cycles (from dispatch to execution). Any uops after this load in program order are also re-dispatched. To reduce the memory dependence violations, a Load Store Alias Predictor (LSAP) is added [Chrysos and Emme r 1998]. This fully-associative 16-entry LSAP records the loads that were mispredicted in th e recent history. In case there is any unresolved store address ahead of a load in the SQ when th e load is issued, the LSAP is looked up. If a match is found, the load is delayed until the unresolved store address is resolved. Similar method of memory aliasing prediction is used by Alpha 21264 [Kessler 1999]. The x86 architecture is known for its relativ ely widespread use of unaligned memory operations. In PTLsim, once a given load or store is known to have a misaligned address, it is preemptively split into two aligned loads or stores at the decode time. PTLsim does this by initially causing all misaligned lo ads and stores to raise an internal exception that forces a pipeline flush. At this point, the special misali gned bit is set for the pr oblem load or store in PTLsims internal translated basic block representation. When the offending load or store is encountered again, it will be split into two aligne d loads or stores early in the pipeline. The split loads and stores will be merged later in the commit stage. In our simulation, we followed the PTLsim in handling misaligned loads and stores. Th e simulation results of all applications show 98

PAGE 99

that pipeline flushes due to mi saligned stores and loads happen ve ry infrequently with only about 0.5% of the total stores and loads. To gain inside on the impact of various store queue implementations, we stretch other outof-order pipeline parameters [Akkary et al. 2003 ] as summarized in Ta ble 1-2. Other detailed parameter setting can be found in the original PTLsim source code. SPEC 2000 integer and floating-point applications are used to dr ive the performance evaluation. We skip the initialization part of the workloads and collect statistics from the next 200 million instructions. Four SQ designs are evaluated and compared including the traditional full CAM, the lateststore forwarding, the late-binding SQ, and the d ecoupled SQ. We evaluated a 32and a 64-entry full CAMs denoted as Conventional 32-CAM and Conventional 64-CAM ; latest-store forwarding using a 64-entry full CAM denoted as LS 64-CAM ; late-binding SQ with a 4 8 directory and a 8 8 directory recording address/ age pair together with a 64 -entry SQ RAM denoted as LB 4 8 64-RAM and LB 8 8 64-RAM ; and lastly the decoupled SQ with a 4 8 directory and a 8 8 directory also with a 64-entry SQ RAM denoted as Decoupled 4 8 64-RAM and Decoupled 8 8 64-RAM The notation a b represents the configurati on of the SQ directory, where a is the number of sets and b is the set associativity. In the decoupled SQ, we simulated a 48-bit leading1 detector in finding the youngest parent. We also implemented a way predictor with a 256-entry prediction table using a total of 96 bytes. When a load is i ssued, the way is predicted. If a misprediction is detected from the address co mparison, the way predictio n table is updated and the load is re-issued with a 2-cycle delay. We do not consider forwarding from multiple stores. We simplified the latest-store and late-binding forwarding schemes. A full CAM is used for the latest-store forwarding. Instead of maintaining st ores in program order as in a conventional full CAM, stores are maintained in execution orde r for searching the latest store. We do not 99

PAGE 100

implement the Bloom Filter and other techniques in evaluating the late-b inding scheme. If the SQ directory is full, a store is simply dr opped without recording it in the directory. 5.4.1 IPC Comparison Figure 5-6 shows the IPC comparison of the seven SQ designs running the SPEC2000 programs. We can make a few observations. First, the decoupled SQ outperf orms both the lateststore forwarding and the late-binding SQ by a sizeab le margin for most applications. On average, Decoupled 4 8 64-RAM and Decoupled 8 8 64-RAM outperform LS 64-CAM by 20.8% and 23.8% respectively. With an equal directory size of 4 8 and 8 8, they outperform the latebinding counterparts by 12.4% and 7.6%. Given the fact that the decoupled SQ records one address for all outstanding stores with th e same address, it tolerates a smaller 4 8 SQ directory much better than the late-binding scheme. In fact, the decoupled SQ with a bigger 8 8 directory only improves the IPC by about 2.4% comparing with that using a 4 8 directory. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5A m mp Applu A p si A rt Bz ip 2 Eo n Equ a ke Fac e re c Fma3 d Gzip L u ca s Mcf Mesa Mgrid P a rs e r P e rlb m k Si x track S w i m V p r Wup w is e Ave r ag eIPC Conventional 32-CAM Conventional 64-CAM LS 64-CAM LB 4X8 64-RAM LB 8X8 64-RAM Decoupled 4X8 64-RAM Decoupled 8X8 64-RAM Figure 5-6. IPC comparison Second, Decoupled 4 8 64-RAM shows better performance than Conventional 32-CAM by an average of 4.2%. Since the SQ directory is decoupled from the SQ RAM, a 64-entry SQ RAM allows more outstanding stores than that of a 32-entry CA M without requiring any bigger 100

PAGE 101

directory for matching the store addresses. Besides the same number of directory entries as in Conventional 32-CAM Decoupled 4 8 64-RAM achieves better IPCs using only an 8-way comparator for faster speed and lower power consumption. Third, even if the SQ RAM size is no larg er than full-CAM size, the performance of Decoupled 88 64-RAM with banked directory is close to the expensive Conventional 64-CAM performance. On average, Decoupled 8 8 64-RAM degrades less than 0.5% of the IPC compared with Conventional 64-CAM Fourth, Applu, Mcf, Mgrid and Swim are less sensitive to different SQ designs except for Conventional 32-CAM because only a very small number of store-load forwarding exists in these applications regardless th e SQ designs. Table 5-1 summarizes the percentage of loads get forwarded data from stores in an ideal SQ for all the simulated applications. An ideal SQ has infinite size and all stores addresses before a load are known when the load is issued. Conventional 32-CAM shows worse performance in these ap plications because its smaller CAM size hinders stores from being renamed. Among all the loads that can be forwarded in an ideal SQ, the seven simulated SQ designs can correctly capture 83.8%, 97.2%, 77.6%, 87.1%, 93.0%, 95.0%, and 96.4% of the ideal fo rwarded loads respectively. Table 5-1. Percentage of forw arded load using an ideal SQ Workload Ammp Applu Apsi Art Bzip2 Eon Equake Facerec Fma3d Gzip Forward % 12.7% 0.5% 8.4% 4.7% 13.0% 18.4% 5.4% 6.8% 23.2% 4.3% Workload Lucas Mcf Mesa Mgrid Parser Perlbmk Sixtrack Swim Vpr Wupwise Forward % 7.5% 0.3% 4.7% 0.1% 8.8% 14.8% 8.9% 0.0% 15.6% 7.4% By getting rid of the duplicated stores with the same address in the SQ directory, the decoupled SQ requires smaller directory than th at of the late-binding SQ. Figure 5-7 plots the total number of dropped stores from entering the SQ directory per 10K instructions due to a full SQ directory. The huge gaps in terms of dropped stores between the late-binding and the 101

PAGE 102

decoupled SQ are very evident. Within each method, 8 8 directory causes much less dropped stores than 4 8 directory. 0 400 800 1,200 1,600 2,000 2,400 2,800am mp a ppl u apsi art bzi p2 eo n equake facerec f m a 3d gzi p l u cas mc f m esa mgr i d par ser p e rlbmk s ix track sw im vpr wupwiseDirectory full / 10K instructions LB 4X8 64-RAM LB 8X8 64-RAM Decoupled 4X8 64-RAM Decoupled 8X8 64-RAM 4375 3941 2696 Figure 5-7. Comparison of directory full in decoupled SQ and late-binding SQ 0 50 100 150 200 250 300am mp appl u a ps i ar t bzip2 e on equake facerec f m a3 d gz i p lucas m c f mesa mgr i d parser per l bm k sixt r ack swim vpr w upwiseRe-execution / 10K instructions Conventional 32-CAM Conventional 64-CAM LS 64-CAM LB 4X8 64-RAM LB 8X8 64-RAM Decoupled 4X8 64-RAM Decoupled 8X8 64-RAM 691 323 573 Figure 5-8. Comparison of load re-execution There are various memory dependence mis-sp eculations which require loads to be reexecuted: lack of the parent store address, dropp ed stores due to full SQ directory, incorrectly identifying the youngest older store, etc. Figure 5-8 summarizes th e total number of re-executed loads per 10K instructions due to mis-speculation s of store-load depende nces. As expected, the latest store SQ scheme causes the most num ber of re-execution, while conventional CAM schemes cause the least. Note that Conventional 32-CAM causes less re-executions than that of 102

PAGE 103

Conventional 64-CAM because smaller 32-CAM causes more stalls at the rename stage in such a way that it can be considered as putting a lim it in out-of-order execution of stores and loads. 5.4.2 Sensitivity of the SQ Directory The size and configuration of the decoupled SQ directory is flexible. In Figure 5-9, we compare four SQ directory sizes with 2 8, 4 8, 4 12, and 8 8 configurations that are decoupled from a 64-entry SQ RAM using the averag e IPC of all twenty a pplications. Note that we increase the set associativity to 12 for th e directory with 48 entries to simplify the set selection. We also run full-CAM with 16, 32, 48, and 64 entries. 1.5 1.6 1.7 1.8 1.9 2 16 32 48 64 Directory Size (Number of Entry)Average IPC Decoupled SQ CAM SQ Figure 5-9. Sensitivity on the SQ directory size As shown in the figure, the smaller directories of 2 8 and 4 8, with a 64-entry SQ RAM perform much better than the full-CAM counter parts of equal directory sizes by 17% and 4% respectively. The advantage of th e decoupled SQ is obvious since it is much cheaper to enlarge the circular SQ RAM while keeping the associative directory sm all. When the directory increases to 48, the performance gap becomes very narrow due to the smaller size difference between the CAM and the RAM. Even with a full 64-entr y CAM that matches the SQ RAM size, the decoupled SQ only loses by less th an 0.5% of the overall IPC. 103

PAGE 104

The age distance from the load to its parent a ffects the timing of the lead-1 detector. Figure 5-10 shows the accumulated distribution of the sear ching distances. With the distance of 32 that is half of the total distance of 64, 15 out of 20 applications can locat e almost 100% of their parents. 3 of the remaining 5 also achieve 9799% of the accuracy. When the search distance increases to 48, all but Fma3d can find 100% of th eir parents. Fma3d can also find 98.3% of its parents at this distance. In all our simulations, the search distance is set to 48. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 161116212631364146515661Accumulated Distribution of Searching Distance ammp applu apsi art bzip2 eon equake facerec fma3d gzip lucas mcf mesa mgrid parser perlbmk sixtrack swim vpr wupwise Figure 5-10. Sensitivity on the leading-1 detector width 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%ammp applu a psi a rt b zi p2 eon equake f a cerec fma3d g z ip lu ca s mcf m esa m grid p a r se r pe r l b mk sixtrack swi m vp r w u pw i seAccuracy of Way Predictor 64-entry 128-entry 192-entry 256-entry Figure 5-11. Sensitivity on way prediction accuracy 104

PAGE 105

To predict the way in a SQ directory set wher e the parent store is located, a 256-entry wayprediction table is maintained that reaches close to 100% accuracy for a majority of the applications as shown in Figure 5-11. The prediction table is inde xed by the lower-order 8 bits of load and store addresses aligne d in the 8-byte boundary. Each en try has 3-bit to record which way is touched by store/load most recently. The to tal size of the table is only 96 bytes. When a load is issued, the way ID is fetched out of the way prediction table to access the age-order vectors for searching the parent store. When a mis-prediction is identified, the correct way is updated and the load is recycl ed with a 2-cycle delay. From Figure 5-11, the way prediction accuracy is proportional to the table size. The tables with 64 or 128 en tries show very poor prediction accuracy. Significant improvement is observed with 192 and 256 entries. All simulations are based on an 8 8 decoupled SQ directory. 5.5 Related Work There have been many recent proposals in desi gning a scalable SQ by getting rid of the expensive and time-consuming full-CAM design. One category is to adopt a two-level SQ which has a smaller first-level SQ to enable fast a nd energy efficient forwarding and a much larger second-level SQ to correct and complement the fi rst-level SQ [Roth 2004; Srinivasan et al. 2004; Torres et al. 2005]. In general, th e first-level SQ records the late st executed stores. A store to load forwarding occurs when the load address matches a store address in the first-level store queue. Accessed after a first-level miss, the bigg er and slower second-level SQ records those stores that cannot fit into th e first-level SQ based on FIFO or some prediction mechanisms. Instead of a first-level SQ, a store-forwarding cache is implemente d in [Stone et al. 2005] for store-load forwarding. It relies on a separate memory disambiguation table to resolve any dependence violation. In [Garg et al. 2006], a much larger L0 cache is used to replace the firstlevel SQ for caching the latest store data. A load, upon a hit, can fetch the data directly from the 105

PAGE 106

L0. In this approach, instead of maintaining sp eculative load information, the load is also executed in the in-order pipeline fashion. An in consistent between the data in L0 and L1 can identify memory dependence violations. Besides the complexity and extra space, a fundamental issue in this category of approaches is the heavy mismatch between the latest store and the correct last store for dependent load as reported in Section 5.2. Such mism atches produce costly re-executions. The late-binding technique enables an un-ordere d SQ as reported in [Sethumadhavan et al. 2007]. A load or a store enters the SQ when the instruction is issued, instead of renamed. In order to get correct store-load forwarding, the age of each load/store is also recorded along with the address. In case when there are multiple hits to the SQ for an issued load, complicated decoding logic can re-create the or der of the stores based on the recorded age. Afterwards, the search for the youngest matched stor e that is older than the load can locate the correct parent store for forwarding. The late-binding avoids full-CAM search and allows small bank implementation for the SQ. However, it records the address/age pair for every memory instruction unnecessarily that requires more directory entries and in tensifies the banking conflicts. Furthermore, it relies on complicated age-based priority logic to locate the parent that may lengthen the access time for store-load forwarding. Another category of solution is also prediction based for efficiently implement or even to get rid of the SQ entirely. A bloom filter [Set humadhavan et al. 2003] or a predicted forwarding technique [Park et al. 2003] can filter out a majority of SQ searches. A sizeable memory dependence predictor is used in [Sha et al. 2005] to match loads with the precise store buffer slots for the forwarded data so as to abandon the a ssociative SQ structure. Proposals in [Sha et al. 2006; Subramaniam and Loh 2006] can completely eliminate the SQ by bypassing store data 106

PAGE 107

107 through the register file or th rough the LQ. However, these pr ediction based approaches are always accompanied with the expense of an extra sizeable prediction table. 5.6 Conclusion When the instruction window scales up to hund reds or even thousands of instructions, it requires an efficient SQ design to timely detect memory dependences and forward the store data to dependent loads. Although there have been many proposed scalable SQ solutions, they generally suffer certain inaccuracy and ineffici ency, require complicated hardware logic along with additional buffers or tables, and compromise the performance. In this work, we propose a new scheme that decouples the address matching unit and the age-based priority logic from the SQ RAM array. The proposed solution enables an efficient set-associative SQ directory for searching the parent store using a detached ag e-order vector. Moreover, by recording a single address for multiple stores with the same addre ss in the SQ directory, the decoupled SQ further reduces the directory size and al leviates potential bank conflicts. We also provide solutions in handling the commonly used partial and misaligned stores and loads in de signing a scalable SQ. The performance evaluation shows that the new sc heme outperforms other scalable SQ proposals based on latest-store forwarding and late SQ bi nding techniques and is comparable with fullCAM SQ. By removing the costly fully-associa tive CAM structure, th e new scheme is both power-efficient and scali ng to large window design.

PAGE 108

CHAPTER 6 CONCLUSIONS In this dissertation, we propose three works related to cache performance improvement and resource contention resolution, and one work related to LSQ design. The proposed special P-load has demonstrated its ability to effectivel y overlap loadload data dependences. Instead of relying on miss pred ictions of the requested blocks, the executiondriven P-load precisely instructs the memory c ontroller in fetching the needed data block nonspeculatively. The simulation results demonstrat e high accuracy and significant speedups using the P-load. The elbow cache has demonstrated its abil ity to expand the replacement set beyond the lookup set boundary without adding any comple xity on the lookup path. Because of the characteristics of elbow cache, it is difficu lt to implement recency-based replacement. The proposed reuse-count replacement policy with low-cost can achieve cache performance comparable to the recency-based replacement policy. Simultaneous Multithreading techniques have be en moved from laboratory ideas into real and commercially successful processors. Howe ver, studies have show n that without proper mechanisms to regulate the shared resources, especially shared caches and the instruction window, multiple threads show lower overa ll performance when running simultaneously. Runahead execution, proposed initially for achie ving better performance for single-thread applications, works very well in the multiple -thread environment. In runahead execution, multiple long-latency memory operations can be discovered and overlapped to exploit the memory-level parallelism; meanwh ile, shared critical recourses held by the stalling thread can be released to keep the other thread running smoot hly to exploit the thread-level parallelism. 108

PAGE 109

109 The order-free SQ design decouples the addr ess matching unit and the age-based priority logic from the original store que ue. The proposed solution enables an efficient setassociative SQ directory for searching the pare nt store using a detached ag e-order vector. Moreover, by recording a single address for multiple stores with the same address in the SQ directory, the decoupled SQ further reduces the directory size a nd alleviates potential bank conflicts. We also provide solutions in handling th e commonly used partial and misaligned stores and loads in designing a scalable SQ. The pe rformance evaluation shows that the new scheme outperforms other scalable SQ proposals base d on latest-store forwarding a nd late SQ binding techniques and is comparable with full-CAM SQ. By removing the costly fully-associative CAM structure, the new scheme is both power-efficient and scaling to large-window design.

PAGE 110

LIST OF REFERENCES Abramson, J. M., Akkary, H., Glew, A. F., Hinton, G. J., Konigsfeld, K. G., Madland, P. D., and Papworth, D. B. 1997. Method and Apparatus for Signalling a Store Buffer to Output Buffered Store Data for a Load Operati on on an Out-of-Order Execution Computer System, Intel, US Patent 5606670 Adams, D., Allen, A., Bergkvist, R., Hesson, J., and LeBlanc, J. 1997. A 5ns Store Barrier Cache with Dynamic Prediction of Load/Store Conflicts in Superscalar Processors, Proceedings of the 1997 International Solid -State Circuits Conference 414, 496. Agarwal, A., Hennessy, J., and Horwitz, M. 1988. Cache Performance of Operating System and Multiprogramming Workloads, ACM Transactions on Computer Systems 6, 4, 393. Agarwal, A., Kubiatowicz, J., Kranz, D., Lim, B.-H, Yeung, D., DSouza, G., and Parkin, M. 1993. Sparcle: An Evolutionary Processo r Design for Large-Scale Multiprocessors, IEEE Micro 13, 3, 48. Agarwal, A., and Pudar, S.D. 1993a. Column-Asso ciative Caches: a Technique for Reducing the Miss Rate of Direct-Mapped Caches, Proceedings of the 29th Annual International Symposium on Computer Architecture, 179. Akkary, H., Pajwar, R. and Srinivasan, S. T. 2003. Checkpoint Processing and Recovery: Towards Scalable Large Inst ruction Window Processors, Proceedings of the 36th International Conference on Microarchitecture, 423. Alverson, R., Callahan, D., Cummin gs, D., Koblenz, B., Porterfield, A., and smith, B. 1990. The Tera Computer System, Proceedings of the 4th International Conference on Supercomputing, 1. Baugh, L., and Zilles, C. B. 2006. Decomposing the Load-Store Queue by Function for Power Reduction and Scalability, IBM Journal of Research and Development 50, 2-3, 287-298. Bodin, F., and Seznec, A. 1997. Skewed Associ ativity Improves Performance and Enhances Predictability, IEEE Transactions on Computers 46, 5, 530. Burger, D., and Austin, T. 1997. The Si mpleScalar Tool Set, Version 2.0, Technical Report #1342, Computer Science Department, Un iversity of Wisconsin-Madison. Buyuktosunoglu, A., Albonesi, D.H., Bose, P., C ook, P., and Schuster, S. 2002. Tradeoffs in Power-Efficient Issue Queue Design, Proceedings of the 2002 International Symposium on Low Power Electronics and Design, 184. Cahoon, B., and McKinley, K.S. 2001. Data Flow Analysis for Software Prefetching Linked Data Structures in Java, Proceedings of the 10th Inter national Conference On Parallel Architectures and Compilation Techniques, 280. 110

PAGE 111

Castro, F., Pinuel, L., Chaver, D., Pr ieto, M., Huang, M., and Tirado, F. 2006. DMDC: Delayed Memory Dependence Checking through Age-Based Filtering, Proceedings of the 38th International Symposium on Microarchitecture, 297. Cazorla, F. J., Fernandez, E., Ramirez, A ., and Valero, M. 2003. Improving Memory Latency Aware Fetch Policies for SMT Processors, Proceedings of the 5th International Symposium on High Performance Computing, 70. Cazorla, F. J., Ramirez, A., Valero, M., and Fernandez, E. 2004a. DCache Warn: An I-Fetch Policy to Increase SMT Efficiency, Proceedings of the 18th International Parallel and Distributed Processing Symposium, 74a Cazorla, F. J., Ramirez, A., Valero, M., a nd Fernandez, E. 2004b. Dynamically Controlled Resource Allocation in SMT Processors, Proceedings of the 37th International Symposium on Microarchitecture, 171. Charney, M., and Reeves, A. 1995. Generalized Correlation Based Hardware Prefetching, Technical Report EE-CEG-95-1 Cornell University. Chen, T., and Baer, J. 1992. Reducing Memory Latency Via Non-Blocking and Prefetching Caches, Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 51. Chou, Y., Fahs, B., and Abraham, S. 2004. Micr oarchitecture optimizations for exploiting memory-level parallelism, Proceedings of the 31st International Symposium on Computer Architecture, 76. Chrysos, G. Z., and Emer, J. S. 1998. Memory Dependence Prediction Using Store Sets, Proceedings of the 25th International Symposium on Computer Architecture, 142-153. Collins, J., Sair, S., Calder, B. and Tullsen, D. M. 2002. Pointer Cache Assisted Prefetching, Proceedings of the 35th International Symposium on Microarchitecture, 62. Cooksey, R., Jourdan, S., and Grunwald, D. 2002. A Stateless, Content-Directed Data Prefetching Mechanism, Proceedings of the 10th Internati onal Conference on Architectural Support for Programming Languages and Operating Systems, 279. Cristal, A., Santana, O. J., Cazorla, F., Galluzzi M., Ramirez, T., Pericas, M., and Valero, M. 2005. Kilo-Instruction Processors : Overcoming the Memory Wall IEEE Micro 25, 3, 48 57. Dorai, G., and Yeung, D. 2002. Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance, Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques 30. Dundas, J. and Mudge, T. 1997. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss, Proceedings of the 11th International Conference on Supercomputing, 68. 111

PAGE 112

El-Moursy, A., and Albonesi, D. H. 2003. Front-End Policies for Improved Issue Efficiency in SMT Processors, Proceedings of the 9th Internati onal Symposium on High-Performance Computer Architecture, 31. Fu, J., Patel, J.H., and Janssens, B.L. 1992. Stride directed prefetching in scalar processors, Proceedings of the 25th Annual Inter national Symposium on Microarchitecture, 102 110. Gandhi, A., Akkary, H., Rajwar, R., Srinivasan, S. T., and Lai, K. K. 2005. Scalable Load and Store Processing in Latency Tolerant Processors, Proceedings of the 32nd International Symposium on Computer Architecture, 446. Garg, A., Rashid, M. W., and Huang, M. C. 2006. Slackened Memory Dependence Enforcement: Combining Opportunistic Forwardi ng with Decoupled Verification, Proceedings of the 33rd International Symposium on Computer Architecture, 142. Gonzalez, A., Valero, M., Topham, N., and Pa rcerisa, J. 1997. Elimin ating Cache Conflict Misses through XOR-Based Placement Functions, Proceedings of the 11th International Conference of Supercomputing, 76. Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., and Olukotun, K. 2000. The Stanford Hydra CMP, IEEE Micro 20, 2, 71. Hesson, J., LeBlanc, J., and Ciavaglia, S. 1997. Aparatus to dynamically control the out-of-order execution of load-store instructions in a pr ocessor capable of di spatching, issuing and executing multiple instructions in a single processor cycle, IBM, US Patent 5615350 Hinton, G., Sager, D., Upton, M., Boggs, D. Kyker, A. and Roussel, P. 2001. The Microarchitecture of the Pentium 4 Processor, Intel Technology Journal 2001. Hu, Z., Martonosi, M., and Kaxiras, S. 2003. TCP: Tag Correlating Prefetchers, Proceedings of the 9th International Symposium on High-Performance Computer Architecture, 317. Hughes, H. J. and Adve, S. V. 2005. Memory-Sid e Prefetching for Linked Data Structures for Processor-in-Memory Systems, Journal of Parallel and Distributed Computing 65, 4, 448. Joseph, D., and Grunwald, D. 1997. Pref etching Using Markov Predictors, Proceedings of the 26th International Symposium on Computer Architecture, 252. Jouppi, N. P. 1990. Improving Direct-Mapped Cach e Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, Proceedings of the 17th International Symposium on Computer Architecture, 364. Kessler, R. E. 1999. The Alpha 21264 microprocessor, IEEE Micro 19, 2, 24. 112

PAGE 113

Kharbutli, M., Irwin, K., Solihin, Y., and Lee, J. 2004. Using Prime Numbers for Cache Indexing to Eliminate Conflict Misses, Proceedings of the 10th International Symposium on High Performance Computer Architecture, 288. Kirman, N., Kirman, M., Chaudhuri, M. and Ma rtinez J. F. 2005. Checkpointed Early Load Retirement, Proceedings of the 11th International Symposium on High Performance Computer Architecture, 16. Laudon, J., Gupta, A., and Horowitz, M. 1994. Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations, Proceedings of the 6th International Conference on Architectural Support fo r Programming Languages and Operating Systems, 308. Lin, W.-F., Reinhardt, S. K., and Burger, D. 2001. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design, Proceedings of the 7th International Symposium on High Performance Computer Architecture, 301. Luk C., and Mowry, T. C. 1996. Compiler-Based Pr efetching for Recursive Data Structures, Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, 222. Moshovos, A., Breach, S. E., Vijaykumar, T. N ., Sohi, G. S. 1997. Dynamic Speculation and Synchronization of Data Dependence, Proceedings of the 24th International Symposium on Computer Architecture 181-193. Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and Evaluation of a Compiler Algorithm for Prefetching, Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, 62. Mutlu, O., Stark, J., Wilkerson, C., and Patt, Y. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors, Proceedings of the 9th International Symposium on High Performance Computer Architecture, 129. Olden Benchmark, http://www.cs.princeton.edu/~mcc/olden.html. Opteron Processors, http://www.amd.com. Park, I., Ooi, C.-L., and Vijaykumar, T.N. 2003. Reducing Design Complexity of the LoadStore Queue, Proceedings of the 36th International Symposium Microarchitecture, 411. Peir, J. K., Lee, Y., and Hsu, W. W. 1998. Captur ing Dynamic Memory Reference Behavior with Adaptive Cache Topology, Proceedings of the 8th International Conference on Architectural Support for Programmi ng Language and Operating Systems, 240. Qureshi, M. K., Thompson, D., and Patt Y. N. 2005. The V-Way Cache: Demand Based Associativity via Global Replacement, Proceedings of the 32nd Annual International Symposium on Computer Architecture, 544. 113

PAGE 114

Roth, A. 2004. A High-Bandwidth Load-Store Unit for Singleand Multi-Threaded Processors, Technical Report MS-CIS-04-09 University of Pennsylvania. Roth, A., Moshovos, A. and Sohi, G. 1998. Depe ndence Based Prefetching for Linked Data Structure, Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, 115. Saavedra-Barrera, R., Culler, D. and von Ei cken, T. 1990. Analysis of Multithreaded Architectures for Parallel Computing, Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, 169. Sair, S. and Charney, M. 2000. Memory Be havior of the SPEC2000 Benchmark Suite, Technical Report, IBM Corp. Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Keckler, S. W., Burger, D., and Moore, C. R. 2003. Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture, Proceedings of the 30th International Symposium on Computer Architecture 422-433. Sankaralingam, K., Nagarajan, R., McDonald, R., Desikan, R. Drolia, S., Govindan, M., Gratz, P., Gulati, D., Hanson, H., Kim, C., Liu, H., Ranganathan, N., Sethumadhavan, S., Sharif, S., Shivakumar, P., Keckler, S. W., and Bu rger, D. 2006. Distribute d Microarchitectural Protocols in the TRIPS Prototype Processor, Proceedings of the 39th International Symposium on Microarchitecture, 480. Sethumadhavan, S., Desikan, R., Burger, D., Moore, C. R., and Keckler, S.W. 2003. Scalable Hardware Memory Disambiguation for High ILP Processors, Proceedings of the 36th International Symposiu m on Microarchitecture 339-410. Sethumadhavan, S., McDonald, R., Desikan, R., Burger, D., and Keckler, S.W. 2006. Design and Implementation of the TRIPS Primary Memory System, Proceedings of the 24th International Conference on Computer Design, 470. Sethumadhavan, S., Roesner, F., Emer, J. S., Bu rger, D., and Keckler S. W. 2007. Late-Binding: Enabling Unordered Load-Store Queues, Proceedings of the 34th Annual International Symposium on Computer Architecture, 347. Seznec, A. 1993. A Case for Two-Way Skewed-Associative Cache, Proceedings of the 20th Annual International Symposiu m on Computer Architecture, 169. Seznec, A., and Bodin, F. 1993. Skewed-Associative Caches, Proceedings of the 5th International Conference on Paralle l Architectures and Languages Europe, 304. Sha, T., Martin, M. M. K., and Roth, A. 2005. Scalable Store-Load Forwarding via Store Queue Index Prediction, Proceedings of the 38th In ternational Symposium on Microarchitecture 159. 114

PAGE 115

Sha, T., Martin, M. M. K., and Roth, A. 2006. NoSQ: Store-Load Co mmunication without a Store Queue, Proceedings of the 39th International Symposium on Microarchitecture, 285. Snavely, A. and Tullsen, D. M. 2000. Symbiotic Job Scheduling for a Simultaneous Multithreading Processor, Proceedings of the 9th International Conference on Architectural Support for Programmi ng Languages and Operating Systems, 234. Solihin, Y., Lee, J., and Torrellas, J. 2002. Usi ng a User-Level Memory Thread for Correlation Prefetching, Proceedings of the 29th Annual In ternational Symposium on Computer Architecture, 171. SPEC2000 Alpha Binaries from SimpleScalar website, http://www.eecs.umich.edu/~chriswea/benchmarks/spec2000.html SPEC2000 Benchmarks, http://www.spec.org/osg/cpu2000/ Spjuth, M., Karlsson, M., and Hagersten, E. 2005. Skewed Caches from a Low-Power Perspective, Proceedings of the 2nd Confer ence on Computing Frontiers, 152. Spracklen L., and Abraham, S. 2005. Chip Mu ltithreading: Opportunities and Challenges, Proceedings of the 11th International Symposium on High Performance Computer Architecture, 248. Srinivasan, S. T., Rajwar, R., Akkary, H., Gandhi, A., and Upton, M. 2004. Continual Flow Pipelines, Proceedings of the 11th International Symposium on Architectural Support for Programming Languages and Operating Systems 107-119. Stone, H. S. 1971. Parallel Processing with the Perfect Shuffle, IEEE Trans on Computers 20, 6, 153. Stone, H. S., Turek, J., and Wolf, J. L. 1992. Optimal Partitioning of Cache Memory, IEEE Transactions on Computers 41, 9. Stone, S. S., Woley, K. M., and Frank, M. I. 2005. Address-Indexed Memory Disambiguation and Store-to-Load Forwarding, Proceedings of the 38th In ternational Symposium on Microarchitecture 171. Subramaniam S. and Loh G. 2006. Store Vectors for Scalable Memory Dependence Prediction and Scheduling, Proceedings of the12th International Symposium on High Performance Computer Architecture 64-75. Suh, G. E., Rudolph, L., and Devadas, S. 2001. Dynamic Cache Partitioning for Simultaneous Multithreading Systems, Proceedings of the 13th Inte rnational Conference on Parallel and Distributed Computing Systems, 116. Tendler, J. M., Dodson, J. S., Field, J. S. Jr., Le, H., and Sinharoy, B. 2002. POWER4 System Microarchitecture, IBM Journal of Research and Development 26, 1, 5. 115

PAGE 116

116 Topham, N. P., and Gonzlez, A. 1997. Randomi zed Cache Placement for Eliminating Conflicts, IEEE Transactions on Computers 48, 2, 185. Torres, E. F., Ibanez, P., Vinals, V., and Llaber ia, J. M. 2005. Store Buffer Design in First-Level Multibanked Data Caches Proceedings of the 32nd International Symposium on Computer Architecture 469. Tullsen, D. M., and Brown, J. A.2001. Hand ling Long-Latency Loads in a Simultaneous Multithreading Processor, Proceedings of the 34th International Symposium on Microarchitecture, 318. Tullsen, D. M., Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., and Stamm, R.L., 1996. Exploiting Choice: Instruction Fetch a nd Issue on an Implementable Simultaneous Multithreading Processor, Proceedings of the 23rd International Symposium on Computer Architecture 191. Tullsen, D. M., Eggers, S.J., and Levy, H.M. 1995. Simultaneous Multithreading: Maximizing On-Chip Parallelism, Proceedings of the 22nd International Symposium on Computer Architecture 392. Vanderwiel, S., and Lilja, D. 2000. Data Prefetch Mechanisms, ACM Computing Surveys, 174 199. Wang, Z., Burger, D., McKinley, K. S., Reinhardt, S. K., and Weems, C. C. 2003. Guided Region Prefetching: a Cooperative Hardware/Software Approach, Proceedings of the 30th International Symposium on Computer Architecture, 388. Wilton, S.J.E., and Jouppi, N.P. 1996. CACTI: An Enhanced Cache Access and Cycle Time Model, IEEE Journal of Solid-State Circuits 31, 5, 677. Yang, C.-L., and Lebeck, A. R. 2000. Push vs. Pull: Data Movement for Linked Data Structures, Proceedings of the 14th Inter national Conference on Supercomputing, 176. Yang, C.-L., and Lebeck, A. R. 2004. Tolerating Me mory Latency through Push Prefetching for Pointer-intensive Applications, ACM Transactions on Architecture and Code Optimization 1, 4, 445. Yang, Q., and Adina, S. 1994. A Ones Complement Cache Memory, Proceedings of the 1994 International Conference on Parallel Processing, 250. Yoaz, A., Erez, M., Ronen, R., and Jourdan, S. 1999. Speculation Techniques for Improving Load-Related Instruction Scheduling, Proceedings 26th Annual International Symposium on Computer Architecture 42-53. Yourst, M. T. 2007. PTLsim: A Cycle Accura te Full System x86-64 Microarchitectural Simulator, Proceedings of the 2007 International Sy mposium on Performance Analysis of Systems & Software 23-34.

PAGE 117

BIOGRAPHICAL SKETCH Zhen Yang was born in 1977 in Tianjin, China. She earned her B.S. and M.S. in computer science from Nankai University in 1999 and 2002 respectively. She earned her Ph.D. in computer engineering from the Univer sity of Florida in December 2007. 117