<%BANNER%>

Accurate, Scalable, and Informative Modeling and Analysis of Complex Workloads and Large-Scale Microprocessor Architectures

University of Florida Institutional Repository
Permanent Link: http://ufdc.ufl.edu/UFE0021941/00001

Material Information

Title: Accurate, Scalable, and Informative Modeling and Analysis of Complex Workloads and Large-Scale Microprocessor Architectures
Physical Description: 1 online resource (119 p.)
Language: english
Creator: Cho, Chang
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Modeling and analyzing how workload and architecture interact are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challenges related to the design, evaluation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the state-of-the-art tools can detect program time-varying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a necessary first step to design cost-effective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and hardware design complexity and integration scale. For instance, while large real-world applications manifest drastically different behavior across a wide spectrum of their runtime, existing methods only focus on analyzing workload characteristics using a single time scale. Conventional architecture modeling techniques assume a centralized and monolithic hardware substrate. This assumption, however, will not hold valid since the design trends of multi-/many-core processors will result in large-scale and distributed microarchitecture specific processor core, global and cooperative resource management for large-scale many-core processor requires obtaining workload characteristics across a large number of distributed hardware components (cores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, there is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidly increasing complexity and integration scale. We aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large performance, power, reliability and thermal design space of uni-/multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Chang Cho.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Li, Tao.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0021941:00001

Permanent Link: http://ufdc.ufl.edu/UFE0021941/00001

Material Information

Title: Accurate, Scalable, and Informative Modeling and Analysis of Complex Workloads and Large-Scale Microprocessor Architectures
Physical Description: 1 online resource (119 p.)
Language: english
Creator: Cho, Chang
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Modeling and analyzing how workload and architecture interact are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challenges related to the design, evaluation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the state-of-the-art tools can detect program time-varying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a necessary first step to design cost-effective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and hardware design complexity and integration scale. For instance, while large real-world applications manifest drastically different behavior across a wide spectrum of their runtime, existing methods only focus on analyzing workload characteristics using a single time scale. Conventional architecture modeling techniques assume a centralized and monolithic hardware substrate. This assumption, however, will not hold valid since the design trends of multi-/many-core processors will result in large-scale and distributed microarchitecture specific processor core, global and cooperative resource management for large-scale many-core processor requires obtaining workload characteristics across a large number of distributed hardware components (cores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, there is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidly increasing complexity and integration scale. We aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large performance, power, reliability and thermal design space of uni-/multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Chang Cho.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Li, Tao.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0021941:00001


This item has the following downloads:


Full Text





ACCURATE, SCALABLE, AND INFORMATIVE MODELING AND ANALYSIS OF
COMPLEX WORKLOADS AND LARGE-SCALE MICROPROCESSOR ARCHITECTURES























By

CHANG BURM CHO


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008































2008 Chang Burm Cho









ACKNOWLEDGMENTS

There are many people who are responsible for my Ph.D research. Most of all I would like

to express my gratitude to my supervisor, Dr. Tao Li, for his patient guidance and invaluable

advice, for numerous discussions and encouragement throughout the course of the research.

I would also like to thank all the members of my advisory committee, Dr. Renato

Figueiredo, Dr. Rizwan Bashirullah, and Dr. Prabhat Mishra, for their valuable time and interest

in serving on my supervisory committee.

And I am indebted to all the members of IDEAL(Intelligent Design of Efficient

Architectures Laboratory), Clay Hughes, James Michael Poe II, Xin Fu and Wangyuan Zhang,

for their companionship and support throughout the time spent working on my research.

Finally, I would also like to express my greatest gratitude to my family especially my wife,

Eun-Hee Choi, for her relentless support and love.









TABLE OF CONTENTS
page

A C K N O W L E D G M E N T S ............................................ ............ ................................................ 3

LIST OF TABLES ......... ........... ..................................................6

L IST O F F IG U R E S ........ ............................................................... ................ ........... 7

ABSTRACT .......................... .............................. 10

CHAPTER

1 INTRODUCTION.......................... .............. 12

2 W A VELET TRAN SFORM .................................................... .................................. 16

Discrete W avelet Transform(DW T) .................................................. 16
Apply DWT to Capture Workload Execution Behavior.......................................... 18
2D Wavelet Transform ............................................... .. 22

3 COMPLEXITY-BASED PROGRAM PHASE ANALYSIS AND CLASSIFICATION .... 25

Characterizing and classifying the program dynamic behavior......................................... 25
Profiling Program Dynamics and Complexity........................ ................. 28
Classifying Program Phases based on their Dynamics Behavior ............................. 31
Experimental Results .... ............... .... .............. .......... .. 34

4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRAM PHASE
ANALYSIS ....................................... .................... 37

W orkload-statics-based phase analysis....................................................... ... ................. 38
Exploring W avelet Domain Phase Analysis............................ ................................. 40

5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION ................... 52

N eu ral N etw o rk .................................. ........... ...................................... ................. 5 4
Combing Wavelet and Neural Network for Workload Dynamics Prediction .................. 56
Experim mental M methodology .................................................... .................................. 58
Evaluation and Results .............................................................. 62
Workload Dynamics Driven Architecture Design Space Exploration ................................ 68

6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN
MULTI-CORE ARCHITECTURES ........................... ......... .... .............. 74

Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics
P reduction ......... ................ ............................................ 76
Experim mental M methodology .................................................... .................................. 78


4









Evaluation and R results .... ................... ...... ... ......... ....................... 82
Leveraging 2D Geometric Characteristics to Explore Cooperative Multi-core Oriented
A architecture D esign and O ptim ization ........................................... .......................... 88

7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STACKED MULTI-CORE
PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS..................... 94

Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction... 96
Experim mental M methodology .................................................... .................................. 98
Experim ental R results ......... ........................ .. .. ......... ..... .... .......... 103

8 CONCLUSIONS ................................ ........... ........ ................ 109

L IST O F R E F E R E N C E S ............................ ..................................................... ..................... 113

BIOGRAPHICAL SKETCH .................................................................................. 119









LIST OF TABLES

Table pae
3-1 B baseline m machine configuration ..................................... .................................................. 26

3-2 A classification of benchmarks based on their complexity ............................................... 30

4-1 B baseline m machine configuration ..................................... .................................................. 39

4-2 Efficiency of different hybrid wavelet signatures in phase classification............. .............. 44

5-1 Sim ulated m machine configuration......... .................. ...................................... .............. 59

5-2 Microarchitectural parameter ranges used for generating train/test data ............................ 60

6-1 Sim ulated m machine configuration (baseline) .................................. ..................................... 78

6-2 The considered architecture design parameters and their ranges ....................................... 79

6-3 M ulti-program m ed w orkloads......... ......... ......... .......... ......................... .............. 80

6-4 Error comparison of predicting raw vs. 2D DWT cache banks............................................ 85

6-5 Design space evaluation speedup (simulation vs. prediction).......................................... 86

7-1 Architecture configuration for different issue width ........................................ .............. 100

7-2 Sim ulation configurations.................................................................. ....................... 10 1

7-3 D design sp ace p aram eters ...................................................................................................... 102

7-4 Sim ulation tim e vs. prediction tim e..................... ................................. .......................... 104









LIST OF FIGURES

Figure pae
2-1 Example of Haar wavelet transform ...... .................................................. ......... ........ .... 18

2-2 Comparison execution characteristics of time and wavelet domain................................... 19

2-3 Sampled time domain program behavior..................................................... ............... 20

2-4 Reconstructing the w workload dynam ic behaviors.............................................. ... ................. 20

2-5 V ariation of w avelet coefficients....................................................... ......................... 21

2-6 2D w avelet transform s on 4 data points ........................................... .................... ...... 22

2-7 2D wavelet transforms on 16 cores/hardware components........................ ................ 23

2-8 Example of applying 2D DWT on a non-uniformly accessed cache ................................... 24

3-1 XCOR vectors for each program execution interval .............. ............................. ....... ....... 28

3-2 Dynamic complexity profile of benchmark gcc ....................................................... 28

3-3 X C O R value distributions ........................................................ ............................ ...... 30

3-4 X COR s in the sam e phase by the Sim point.................................................. ... ................. 31

3-5 B B V s w ith different resolutions ............................................................................. .. .... ...... 32

3-6 M ultiresolution analysis of the projected BBVs........... ................................. .............. 33

3-7 W weighted C O V calculation ............................................................... .......................... 34

3-8 Comparison of BBV and MRA-BBV in classifying phase dynamics................................ 35

3-9 Comparison of IPC and MRA-IPC in classifying phase dynamics ..................................... 36

4-1 Phase analysis methods time domain vs. wavelet domain .............................................. 41

4-2 Phase classification accuracy: time domain vs. wavelet domain ....................................... 42

4-3 Phase classification using hybrid wavelet coefficients................................. .................... 43

4-4 Phase classification accuracy of using 16 x 1 hybrid scheme .............................................. 45

4-5 Different methods to handle counter overflows ....................................................... 46

4-6 Impact of counter overflows on phase analysis accuracy.......................................... 47









4-7 M ethod for m odeling w orkload variability ........................................ ....................... 50

4-8 Effect of using wavelet denoising to handle workload variability ...................................... 50

4-9 Efficiency of different denoising schem es ........................................ .................. ...... 51

5-1 Variation of workload performance, power and reliability dynamics............................... 52

5-2 Basic architecture of a neural network ....................................................... ....... .... 54

5-3 Using wavelet neural network for workload dynamics prediction.................. ........... 58

5-4 Magnitude-based ranking of 128 wavelet coefficients....................................................... 61

5-5 MSE boxplots of workload dynamics prediction ............ ............................... ........... 62

5-6 MSE trends with increased number of wavelet coefficients ............................................... 64

5-7 M SE trends with increased sampling frequency .......................................... ..... ......... 64

5-8 Roles of microarchitecture design parameters.................... ........ .. ........................... 65

5-9 Threshold-based workload execution scenarios........................ .................. 67

5-10 Threshold-based w workload execution...................... .... ............................... .............. 68

5-11 Threshold-based workload scenario prediction........ ......................... .............. 68

5-12 Dynam ic Vulnerability M anagem ent .............. .......................................................... 69

5-13 IQ D V M Pseudo C ode..................... ..................................................... ..................... ..... 70

5-14 Workload dynamic prediction with scenario-based architecture optimization ................ 71

5-15 Heat plot that shows the MSE of IQ AVF and processor power...................... .............. 72

5-16 IQ AVF dynamics prediction accuracy across different DVM thresholds..................... 73

6-1 Variation of cache hits across a 256-bank non-uniform access cache on 8-core ................. 74

6-2 Using wavelet neural networks for forecasting architecture 2D characteristics .................... 77

6-3 Baseline CMP with 8 cores that share a NUCA L2 cache .............................................. 79

6-4 ME boxplots of prediction accuracies with different number of wavelet coefficients .......... 83

6-5 Predicted 2D NUCA behavior using different number of wavelet coefficients................... 84

6-6 Roles of design parameters in predicting 2D NUCA ......................................................... 87









6-7 2D NUCA footprint (geometric shape) of mesa ............................................. .............. 88

6-8. 2D cache interference in NUCA ........................................................................ 89

6-9 Pearson correlation coefficient (all 50 test cases are shown) .............................................. 90

6-10 2D NUCA thermal profile (simulation vs. prediction) ................................................... 91

6-11 NUCA 2D thermal prediction error..................... ................................ ........................... 92

6-12 Temperature profile before and after a DTM policy ............................. 93

7-1 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors ..... 94

7-2 2D thermal variation on die 4 under different microarchitecture and floor-plan
configurations ........ ..................................... ............... 95

7-3 Example of using 2D DWT to capture thermal spatial characteristics ............................. 95

7-4 Hybrid neuro-wavelet thermal prediction framework ..................................................... 97

7-5 Selected floor-plans ................................. .................. .... ...... .............. .. 98

7-6 P rocessor core floor-plan ................................................................. ............... ...... ..... 99

7-7 Cross section view of the simulated 3D quad-core chip ............................................... 100

7-8 ME boxplots of prediction accuracies (number of wavelet coefficients = 16).................. 105

7-9 Simulated and predicted thermal behavior ................................ ............ .............. 106

7-10 ME boxplots of prediction accuracies with different number of wavelet coefficients....... 106

7-11 Benefit of predicting wavelet coefficients .... ....................... .............. 107

7-12 R oles of input param eters ..................................................... ........................................ 108









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor Philosophy

ACCURATE, SCALABLE, AND INFORMATIVE MODELING AND ANALYSIS OF
COMPLEX WORKLOADS AND LARGE-SCALE MICROPROCESSOR ARCHITECTURES

By

CHANG BURM CHO

December 2008

Chair: Tao Li
Major: Electrical and Computer Engineering

Modeling and analyzing how workload and architecture interact are at the foundation of

computer architecture research and practical design. As contemporary microprocessors become

increasingly complex, many challenges related to the design, evaluation and optimization of their

architectures crucially rely on exploiting workload characteristics. While conventional workload

characterization methods measure aggregated workload behavior and the state-of-the-art tools

can detect program time-varying patterns and cluster them into different phases, existing

techniques generally lack the capability of gaining insightful knowledge on the complex

interaction between software and hardware, a necessary first step to design cost-effective

computer architecture. This limitation will only be exacerbated by the rapid growth of software

functionality and runtime and hardware design complexity and integration scale. For instance,

while large real-world applications manifest drastically different behavior across a wide

spectrum of their runtime, existing methods only focus on analyzing workload characteristics

using a single time scale. Conventional architecture modeling techniques assume a centralized

and monolithic hardware substrate. This assumption, however, will not hold valid since the

design trends of multi-/many-core processors will result in large-scale and distributed

microarchitecture specific processor core, global and cooperative resource management for









large-scale many-core processor requires obtaining workload characteristics across a large

number of distributed hardware components (cores, cache banks, interconnect links etc.) in

different levels of abstraction. Therefore, there is a pressing need for novel and efficient

approaches to model and analyze workload and architecture with rapidly increasing complexity

and integration scale.

We aim to develop computationally efficient methods and models which allow architects

and designers to rapidly yet informatively explore the large performance, power, reliability and

thermal design space of uni-/multi-core architecture. Our models achieve several orders of

magnitude speedup compared to simulation based methods. Meanwhile, our model significantly

improves prediction accuracy compared to conventional predictive models of the same

complexity. More attractively, our models have the capability of capturing complex workload

behavior and can be used to forecast workload dynamics during performance, power, reliability

and thermal design space exploration.









CHAPTER 1
INTRODUCTION

Modeling and analyzing how workloads behave on the underlying hardware have been

essential ingredients of computer architecture research. By knowing program behavior, both

hardware and software can be tuned to better suit the needs of applications. As computer systems

become more adaptive, their efficiency increasingly depends on the dynamic behavior that

programs exhibit at runtime. Previous studies [1-5] have shown that program runtime

characteristics exhibit time varying phase behavior: workload execution manifests similar

behavior within each phase while showing distinct characteristics between different phases.

Many challenges related to the design, analysis and optimization of complex computer systems

can be efficiently solved by exploiting program phases [1, 6-9]. For this reason, there is a

growing interest in studying program phase behavior. Recently, several phase analysis

techniques have been proposed [4, 7, 10-19]. Very few of these studies, however, focus on

understanding and characterizing program phases from their dynamics and complexity

perspectives. Consequently, these techniques generally lack the capability of informing phase

dynamic behavior. To complement current phase analysis techniques which pay little or no

attention to phase dynamics, we develop new methods, metrics and frameworks that have the

capability to analyze, quantify, and classify program phases based on their dynamics and

complexity characteristics. Our techniques are built on wavelet-based multiresolution analysis,

which provides a clear and orthogonal view of phase dynamics by presenting complex dynamic

structures of program phases with respect to both time and frequency domains. Consequently,

key tendencies can be efficiently identified.

As microprocessor architectures become more complex, architects increasingly rely on

exploiting workload dynamics to achieve cost and complexity effective design. Therefore, there









is a growing need for methods that can quickly and accurately explore workload dynamic

behavior at early microarchitecture design stage. Such techniques can quickly bring architects

with insights on application execution scenarios across large design space without resorting to

the detailed, case by case simulations. Researchers have been proposed several predictive models

[20-25] to reason about workload aggregated behavior at architecture design stage. However,

they have been focused on predicting the aggregated program statistics (e.g. CPI of the entire

workload execution). These monolithic global models are incapable of capturing and revealing

program dynamics which contain interesting fine-grain behavior. To overcome the problems of

monolithic, global predictive models, we propose a novel scheme that incorporates wavelet-

based multiresolution decomposition techniques and neural network prediction.

As the number of cores on a processor increases, these large and sophisticated multi-core-

oriented architectures exhibit increasingly complex and heterogeneous characteristics. Processors

with two, four and eight cores have already entered the market. Processors with tens or possibly

hundreds of cores may be a reality within the next few years. In the upcoming multi-/many- core

era, the design, evaluation and optimization of architectures will demand analysis methods that

are very different from those targeting traditional, centralized and monolithic hardware structures.

To enable global and cooperative management of hardware resources and efficiency at large

scales, it is imperative to analyze and exploit architecture characteristics beyond the scope of

individual cores and hardware components (e.g. single cache bank and single interconnect link).

To addresses this important and urgent research task, we developed the novel, 2D multi-scale

predictive models which can efficiently reason the characteristics of large and sophisticated

multi-core oriented architectures during the design space exploration stage without using detailed

cycle-level simulations.









Three-dimensional (3D) integrated circuit design [55] is an emerging technology that

greatly improves transistor integration density and reduces on-chip wire communication latency.

It places planar circuit layers in the vertical dimension and connects these layers with a high

density and low-latency interface. In addition, 3D offers the opportunity of binding dies, which

are implemented with different techniques to enable integrating heterogeneous active layers for

new system architectures. Leveraging 3D die stacking technologies to build uni-/multi-core

processors has drawn an increased attention to both chip design industry and research

community [56- 62]. The realization of 3D chips faces many challenges. One of the most

daunting of these challenges is the problem of inefficient heat dissipation. In conventional 2D

chips, the generated heat is dissipated through an external heat sink. In 3D chips, all of the layers

contribute to the generation of heat. Stacking multiple dies vertically increases power density and

dissipating heat from the layers far away from the heat sink is more challenging due to the

distance of heat source to external heat sink. Therefore, 3D technologies not only exacerbate

existing on-chip hotspots but also create new thermal hotspots. High die temperature leads to

thermal-induced performance degradation and reduced chip lifetime, which threats the reliability

of the whole system, making modeling and analyzing thermal characteristics crucial in effective

3D microprocessor design. Previous studies [59, 60] show that 3D chip temperature is affected

by factors such as configuration and floor-plan of microarchitectural components. For example,

instead of putting hot components together, thermal-aware floor-planning places the hot

components by cooler components, reducing the global temperature. Thermal-aware floor-

planning [59] uses intensive and iterative simulations to estimate the thermal effect of

microarchitecture components at early architectural design stage. However, using detailed yet









slow cycle-level simulations to explore thermal effects across large design space of 3D multi-

core processors is very expensive in terms of time and cost.









CHAPTER 2
WAVELET TRANSFORM

We use wavelets as an efficient tool for capturing workload behavior. To familiarize the

reader with general methods used in this research, we provide a brief overview on wavelet

analysis and show how program execution characteristics can be represented using wavelet

analysis.

Discrete Wavelet Transform(DWT)

Wavelets are mathematical tools that use a prototype function (called the analyzing or

mother wavelet) to transform data of interest into different frequency components, and then

analyze each component with a resolution matched to its scale. Therefore, the wavelet transform

is capable of providing a compact and effective mathematical representation of data. In contrast

to Fourier transforms which only offer frequency representations, wavelets transforms provide

time and frequency localizations simultaneously. Wavelet analysis allows one to choose wavelet

functions from numerous functions[26, 27]. In this section, we provide a quick primer on

wavelet analysis using the Haar wavelet, which is the simplest form of wavelets.

Consider a data series Xn,k,k = 0,1,2,..., at the finest time scale resolution level 2". This

time series might represent a specific program characteristic (e.g., number of executed

instructions, branch mispredictions and cache misses) measured at a given time scale. We can

coarsen this event series by averaging (with a slightly different normalization factor) over non-

overlapping blocks of size two

1
X-1,k = (Xn,2k + Xn,2k+1) (2-1)


and generate a new time series x,,, which is a coarser granularity representation of the original

series x,. The difference between the two representations, known as details, is









1
Dn-l,k = (Xn,2k Xn,2k+l) (2-2)


Note that the original time series X, can be reconstructed from its coarser

representation X,, by simply adding in the details D,,; i.e., X,, = 2 1/2(X +D,,_) We can repeat

this process (i.e., write X,, as the sum of yet another coarser version X,, of X, and the

details D,2, and iterate) for as many scale as are present in the original time series, i.e.,


X, = 2-n/2X0 +2-n/2Do +...+2-1/2Dn_1 (2-3)

We refer to the collection of 0, and Dj as the discrete Haar wavelet coefficients. The

calculations of all Dj,k, which can be done iteratively using the equations (2-1) and (2-2), make

up the so called discrete wavelet transform (DWT). As can be seen, the DWT offers a natural

hierarchy structure to represent data behavior at multiresolution levels: the first few wavelet

coefficients contain an overall, coarser approximation of the data; additional coefficients

illustrate high detail. This property can be used to capture workload execution behavior.

Figure 2-1 illustrates the procedure of using Haar-base DWT to transform a series of data

{3, 4, 20, 25, 15, 5, 20, 3}. As can be seen, scale 1 is the finest representation of the data. At

scale 2, the approximations {3.5, 22.5, 10, 11.5} are obtained by taking the average of {3, 4},

{20, 25}, {15, 5} and {20, 3} at scale 1 respectively. The details {-0.5, -2.5, 5, 8.5} are the

differences of {3, 4}, {20, 25}, {15, 5} and {20, 3} divided by 2 respectively. The process

continues by decomposing the scaling coefficient (approximation) vector using the same steps,

and completes when only one coefficient remains.

As a result, wavelet decomposition is the collection of average and details coefficients at

all scales. In other words, the wavelet transform of the original data is the single coefficient

representing the overall average of the original data, followed by the detail coefficients in order










of increasing resolutions. Different resolutions can be obtained by adding difference values back

or subtracting differences from the averages.

Original Data



Scaling Filter (Go) Wavelet Filter (Ho)
3.5, 22.5, 10, 11.5 -0.5, -2.5, 5, 8.5


Scaling Filter (G,) W avelet Filter (H,)
13, 10.75 -9.5,-0.75


Scaling Filter (G2) Wavelet Filter (H2)
11.875 1.125

11.875 1.125 -9.5, -0.75 -0.5, -2.5, 5, 8.5
Approximation (Lev 0) Detail (Lev 1) Detail Coefficients (Level 2) Detail Coefficients (Level 3)

Figure 2-1 Example of Haar wavelet transform.

For instance, {13, 10.75} = {11.875+1.125, 11.875-1.125} where 11.875 and 1.125 are the

first and the second coefficient respectively. This process can be performed recursively until the

finest scale is reached. Therefore, through an inverse transform, the original data can be

recovered from wavelet coefficients. The original data can be perfectly recovered if all wavelet

coefficients are involved. Alternatively, an approximation of the time series can be reconstructed

using a subset of wavelet coefficients. Using a wavelet transform gives time-frequency

localization of the original data. As a result, the time domain signal can be accurately

approximated using only a few wavelet coefficients since they capture most of the energy of the

input data.

Apply DWT to Capture Workload Execution Behavior

Since variation of program characteristics over time can be viewed as signals, we apply

discrete wavelet analysis to capture program execution behavior. To obtain time domain

workload execution characteristics, we break down entire program execution into intervals and










then sample multiple data points within each interval. Therefore, at the finest resolution level,

program time domain behavior is represented by a data series within each interval. Note that the

sampled data can be any runtime program characteristics of interest. We then apply discrete

wavelet transform (DWT) to each interval. As described in previous section, the result of DWT

is a set of wavelet coefficients which represent the behavior of the sampled time series in the

wavelet domain.

R 2.5E+05
2 2.5E+06
S2.0E+05 --- - 2.0E+06
E 5 .1.5E+06

r 1.OE+05 5.0E+05

SU5. 0 -5.E+E05
L "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
E 0.OE+00
200 400 600 800 1000 Wavelet Coefficients

(a) Time domain representation (b) Wavelet domain representation

Figure 2-2 Comparison execution characteristics of time and wavelet domain

Figure 2-2 (a) shows the sampled time domain workload execution statistics (The y-axis

represents the number of cycles a processor spends on executing a fixed amount of instructions)

on benchmark gcc within one execution interval. In this example, the program execution interval

is represented by 1024 sampled data points. Figure 2-2 (b) illustrates the wavelet domain

representation of the original time series after a discrete wavelet transform is applied. Although

the DWT operations can produce as many wavelet coefficients as the original input data, the first

few wavelet coefficients usually contain the important trend. In Figure 2-2 (b), we show the

values of the first 16 wavelet coefficients. As can be seen, the discrete wavelet transform

provides a compact representation of the original large volume of data. This feature can be

exploited to create concise yet informative fingerprints to capture program execution behavior.










One advantage of using wavelet coefficients to fingerprint program execution is that

program time domain behavior can be reconstructed from these wavelet coefficients. Figure 2-3

and 2-4 show that the time domain workload characteristics can be recovered using the inverse

discrete wavelet transforms.

4



O



3-2
u2
Figure 2-3 Sampled time domain program behavior






(a) 1 wavelet coefficient (b) 2 wavelet coefficients

2 2




(c) 4 wavelet coefficients (d) 8 wavelet coefficients



0 0

(e) 16 wavelet coefficients (f) 64 wavelet coefficients

Figure 2-4 Reconstructing the workload dynamic behaviors

In Figure 2-4 (a)-(e), the first 1, 2, 4, 8, and 16 wavelet coefficients were used to restore

program time domain behavior with increasing fidelity. As shown in Figure 2-4 (f), when all

(e.g. 64) wavelet coefficients are used for recovery, the original signal can be completely

restored. However, this could involve storing and processing a large number of wavelet

coefficients. Using a wavelet transform gives time-frequency localization of the original data. As

a result, most of the energy of the input data can be represented by only a few wavelet











coefficients. As can be seen, using 16 wavelet coefficients can recover program time domain


behavior with sufficient accuracy.

To classify program execution into phases, it is essential that the generated wavelet


coefficients across intervals preserve the dynamics that workloads exhibit within the time


domain. Figure 2-5 shows the variation of the first 16 wavelet coefficients (coff coff 16) which


represent the wavelet domain behavior of branch misprediction and L1 data cache hit on the


benchmark gcc. The data are shown for the entire program execution which contains a total of

1024 intervals.


2.5E+04 coeff 1 1.2E+06 coeff 1
coeff 2 coeff 2
2.0E+04 coeff 3 1.0E+06 coeff 3
coeff 4 coeff 4
1.5E+04- coeff 5 8.OE+05 coeff5
|- coeff 6 6.0E+05 coeff 6
1.5E+04 coeff 7 coeff 57
coeff 8 4.0E+05-- coeff 8
coeff 9 coeff 9
.0E+03 coeff 10 26.0E+05 coeff 10
Scoff 7 coeff 11
1.0E+00 coeff 12 E-- -I coeff 12
Scoeff 13 coeff13
-5.0E+03 coeff 14 -2.0E+05 -- coeff 14
Scoff 15 -coeff 15
-1.0E+04 coeff 16 -4.0E+05- coeff 16


(a) branch misprediction (b) L1 data cache hit

Figure 2-5 Variation of wavelet coefficients

Figure 2-5 shows that wavelet domain transforms largely preserve program dynamic


behavior. Another interesting observation is that the first order wavelet coefficient exhibits much


more significant variation than the high order wavelet coefficients. This suggests that wavelet


domain workload dynamics can be effectively captured using a few, low order wavelet


coefficients.










2D Wavelet Transform

To effectively capture the two-dimensional spatial characteristics across large-scale multi-

core architecture substrates, we also use the 2D wavelet analysis. With 1D wavelet analysis that

uses Haar wavelet filters, each adjacent pair of data in a discrete interval is replaced with its

average and difference.

Original Average Detailed Detailed Detailed
(D-horizontal) (D-vertical) (D-diagonal)

i_ t I E -JI I I --
r --*---*-
I I
I I 1 1

S, Li i (a+b+c+d)/4 ((a+b)/2-(c+d)/2)/2 ((b+d)/2-(a+c)/2)/2 ((a+d)/2-(b+c)/2)/2

Figure 2-6 2D wavelet transforms on 4 data points

A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete

plane. As shown in Figure 2-6, each adjacent four points in a discrete 2D plane can be replaced

by their averaged value and three detailed values. The detailed values (D-horizontal, D-vertical,

and D-diagonal) correspond to the average of the difference of: 1) the summation of the rows, 2)

the summation of the columns, and 3) the summation of the diagonals.

To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data

along the X-axis first, resulting in low-pass and high-pass signals (average and difference). Next,

we apply ID wavelet transforms to both signals along the Y-axis generating one averaged and

three detailed signals. Consequently, a 2D wavelet decomposition is obtained by recursively

repeating this procedure on the averaged signal. Figure 2-7 (a) illustrates the procedure. As can

be seen, the 2D wavelet decomposition can be represented by a tree-based structure. The root

node of the tree contains the original data (row-majored) of the mesh of values (for example,

performance or temperatures of the four adjacent cores, network-on-chip links, cache banks etc.).

First, we apply ID wavelet transforms along the X-axis, i.e. for each two points along the X-axis











we compute the average and difference, so we obtain (3 5 7 1 9 1 5 9) and (1 1 1 -1 5 -1 1 1).


Next, we apply ID wavelet transforms along the Y-axis; for each two points along the Y-axis we


compute average and difference (at level 0 in the example shown in Figure 2-7.a). We perform


this process recursively until the number of elements in the averaged signal becomes 1 (at level 1


in the example shown in Figure 2-7.a).

Original Data 1D wavelet 1D wavelet
along x-axis along y-axis












L---------------------J
= 1 \ Average
4 14 2 0 (L=0)




244668204142046810 Details
Original Data
(row-majored)
1D wavelet
alongx-axis 35719159 111-15111
lowpass signal highpasssignal
1D wavelet
along yaxis average Horizontal Details Vertical Details Diagonal Details i
1D wavelet ------------- -------------------------------------------------------- tal
along x-axis 4 6 -1 -1 L=O
lowpass hihpa s -
1D wavelet signal s signal
along y-axisl
:---- -----------
Avg. Horiz. Vert. Diag.
Det. Det. Det.
L=1

(a) (b)

Figure 2-7 2D wavelet transforms on 16 cores/hardware components

Figure 2-7.b shows the wavelet domain multi-resolution representation of the 2D spatial data.


Figure 2-8 further demonstrates that the 2D architecture characteristics can be effectively


captured using a small number of wavelet coefficients (e.g. Average (L=0) or Average (L= 1)).


Since a small set of wavelet coefficients provide concise yet insightful information on


architecture 2D spatial characteristics, we use predictive models (i.e. neural networks) to relate


them individually to various architecture design parameters. Through inverse 2D wavelet


transform, we use the small set of predicted wavelet coefficients to synthesize architecture 2D


spatial characteristics across the design space. Compared with a simulation-based method,









predicting a small set of wavelet coefficients using analytical models is computationally efficient

and is scalable to large scale architecture design.











(a) NUCA hit numbers (b) 2D DWT (L=0) (c) 2D DWT (L=1)
Figure 2-8 Example of applying 2D DWT on a non-uniformly accessed cache









CHAPTER 3
COMPLEXITY-BASED PROGRAM PHASE ANALYSIS AND CLASSIFICATION

Obtaining phase dynamics, in many cases, is of great interest to accurately capture

program behavior and to precisely apply runtime application oriented optimizations. For

example, complex, real-world workloads may run for hours, days or even months before

completion. Their long execution time implies that program time varying behavior can manifest

across a wide range of scales, making modeling phase behavior using a single time scale less

informative. To overcome conventional phase analysis technique, we proposed using wavelet-

based multiresolution analysis to characterize phase dynamic behavior and developed metrics to

quantitatively evaluate the complexity of phase structures. And also, we proposed methodologies

to classify program phases from their dynamics and complexity perspectives. Specifically, the

goal of this chapter is to answer the following questions: How to define the complexity of

program dynamics? How do program dynamics change over time? If classified using existing

methods, how similar are the program dynamics in each phase? How to better identify phases

with homogeneous dynamic behavior?

In this chapter, we implemented our complexity-based phase analysis technique and

evaluate its effectiveness over existing phase analysis methods based on program control flow

and runtime information. And we showed that in both cases the proposed technique produces

phases that exhibit more homogeneous dynamic behavior than existing methods do.

Characterizing and classifying the program dynamic behavior

Using the wavelet-based multiresolution analysis which is described in chapter 2, we

characterize, quantify and classify program dynamic behavior on a high-performance, out-of-

order execution superscalar processor coupled with a multi-level memory hierarchy.









Experimental setup

We performed our analysis using ten SPEC CPU 2000 benchmarks crafty, gap, gcc, gzip,

mcf parser, perlbmk, swim, twolfand vortex. All programs were run with reference input to

completion. We chose to focus on only 10 programs because of the lengthy simulation time

incurred by executing all of the programs to completion. The statistics of workload dynamics

were measured on the SimpleScalar 3.0[28] sim-outorder simulator for the Alpha ISA. The

baseline microarchitecture model is detailed in Table 3-1.

Table 3-1 Baseline machine configuration
Parameter Configuration
Processor Width 8
ITLB 128 entries, 4-way, 200 cycle miss
Branch Prediction combined 8K tables, 10 cycle misprediction, 2 predictions/cycle
BTB 2K entries, 4-way
Return Address Stack 32 entries
L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 4 MSHR, 1 cycle access
RUU Size 128 entries
Load/ Store Queue 64 entries
Store Buffer 16 entries
Integer ALU 4 I-ALU, 2 I-MUL/DIV
FP ALU 2 FP-ALU, 1FP-MUL/DIV
DTLB 256 entries, 4-way, 200 cycle miss
L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access
L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access
Memory Access 100 cycles

Metrics to Quantify Phase Complexity

To quantify phase complexity, we measure the similarity between phase dynamics observed at

different time scales. To be more specific, we use cross-correlation coefficients to measure the

similarity between the original data sampled at the finest granularity and the approximated

version reconstructed from wavelet scaling coefficients obtained at a coarser scale. The cross-

correlation coefficients (XCOR) of the two data series are defined as:









n (X X)(Yi Y)
XCOR(X, Y)= i=(-)
,n -X)2yn 1y (3-1)
Vi= X-1 X2 i= y2

where X is the original data series and Y is the approximated data series. Note that XCOR =1 if

program dynamics observed at the finest scale and its approximation at coarser granularity

exhibit perfect correlation, and XCOR =0 if the program dynamics and its approximation varies

independently across time scales.

X and Y can be any runtime program characteristics of interest. In this chapter, we use

instruction per cycle (IPC) as a metric due to its wide usage in computer architecture design and

performance evaluation. To sample IPC dynamics, we break down the entire program execution

into 1024 intervals and then sample 1024 IPC data within each interval. Therefore, at the finest

resolution level, the program dynamics of each execution interval are represented by an IPC data

series with a length of 1024. We then apply wavelet multiresolution analysis to each interval. In

a wavelet transform, each DWT operation produces an approximation coefficients vector with a

length equal to half of the input data. We remove the detail coefficients after each wavelet

transform and only use the approximation part to reconstruct IPC dynamics and then calculate

the XCOR between the original data and the reconstructed data. We apply discrete wavelet

transform to the approximation part iteratively until the length of the approximation coefficient

vector is reduced to 1. Each approximation coefficient vector is used to reconstruct a full IPC

trace with a length of 1024 and the XCOR between the original and reconstructed traces are

calculated using equation (3-1). As a result, for each program execution interval, we obtain an

XCOR vector, in which each element represents the cross-correlation coefficients between the

original workload dynamics and the approximated workload dynamics at different scales. Since










we use 1024 samples within each interval, we create an XCOR vector with a length of 10 for

each interval, as shown in Figure 3-1.



y u
x x

1 10 1 10 1 10

1st Intrval 2nd Interval uI4lh Inlerval
Figure 3-1 XCOR vectors for each program execution interval

Profiling Program Dynamics and Complexity

We use XCOR metrics to quantify program dynamics and complexity of the studied SPEC

CPU 2000 benchmarks. Figure 3-2 shows the results of the total 1024 execution intervals across

ten levels of abstraction for the benchmark gcc.


1 -


0.8













Scales
0.6













Figure 3-2 Dynamic complexity profile of benchmark gcc
X
0.4



XC OR ,i E ,1

1 2 3 4 5 6 7 8 9 lb
Scales

Figure 3-2 Dynamic complexity profile of benchmark gcc

As can be seen, the benchmark gcc shows a wide variety of changing dynamics during its

execution. As the time scale increases, XCOR values are monotonically decreased. This is due to

the fact that wavelet approximation at a coarse scale removes details in program dynamics

observed at a fine grained level. Rapidly decreased XCOR implies highly complex structures that









can not be captured by coarse level approximation. In contrast, slowly decreased XCOR suggests

that program dynamics can be largely preserved using few samples. Figure 3-2 also shows a

dotted line along which XCOR decreases linearly with the increased time scales. The XCOR

plots below that dotted line indicate rapidly decreased XCOR values or complex program

dynamics. As can be seen, a significant fraction of the benchmark gcc execution intervals

manifest quickly decreased XCOR values, indicating that the program exhibits highly complex

structure at the fine grained level. Figure 3-2 also reveals that there are a few gcc execution

intervals that have good scalability in their dynamics. On these execution intervals, the XCOR

values only drop 0.1 when the time scale is increased from 1 to 8. The results(Figure 3-2) clearly

indicate that some program execution intervals can be accurately approximated by their high

level abstractions while others can not.

We further break down the XCOR values into 10 categories ranging from 0 to 1 and analyze

their distribution across time scales. Due to space limitations, we only show the results of three

programs (swim, crafty and gcc, see Figure 3-3) which represent the characteristics of all

analyzed benchmarks. Note that at scale 1, the XCOR values of all execution intervals are

always 1. Programs show heterogeneous XCOR value distributions starting from scale level 2.

As can be seen, the benchmark swim exhibits good scalability in its dynamic complexity. The

XCOR values of all execution intervals remain above 0.9 when the time scale is increased from 1

to 7. This implies that the captured program behavior is not sensitive to any time scale in that

range. Therefore, we classify swim as a low complexity program. On the benchmark crafty,

XCOR values decrease uniformly with the increase of time scales, indicating the observed

program dynamics are sensitive to the time scales used to obtain it. We refer to this behavior as

medium complexity. On the benchmark gcc, program dynamics decay rapidly. This suggests that











abundant program dynamics could be lost if coarser time scales are used to characterize it. We

refer to this characteristic as high complexity behavior.


* [0, 0.1)
I [0.1, 0.2)
a [0.2, 0.3)
* [0.3, 0.4)
O [0.4, 0.5)
B [0.5, 0.6)
* [0.6, 0.7)
B [0.7, 0.8)
[ [0.8, 0.9)


100

80

60

40

20

0


100 0 [0, 0.1)
M [0.1, 0.2)
80 0 [0.2, 0.3)
60 El [0.3, 0.4)
[, O [0.4, 0.5)
40 -C [0.5, 0.6)
0 [0.6, 0.7)
20 [0.7, 0.8)
0 [0.8, 0.9)
a [0.9, 1)
1 2 3 4 6 67 8 9 10 ,l
Scales

(b) crafty (medium complexity)


_EU lo [0.4, 0.5)
40 0 [0.5, 0.6)
M [0.6, 0.7)
20 B [0.7, 0.8)
O 0 [0.8, 0.9)
X E] [0.9, 1)
1 2 3 4 6 6 7 8 9 10 1
Scales

(c) gcc (high complexity)
Figure 3-3 XCOR value distributions

The dynamics complexity and the XCOR value distribution plots(Figure 3-2 and Figure 3-

3) provide a quantitative and informative representation of runtime program complexity.

Table 3-2 Classification of benchmarks based on their complexity
Category Benchmarks
Low complexity Swim
Medium
complexityCrafty, gzip, parser, perlbmk, twolf
complexity
High complexity gap, gcc, mcf vortex


Using the above information, we classify the studied programs in terms of their complexity

and the results are shown in Table 3-2.


is [0.9, 1)
1 2 3 4 5 6 7 8 9 10 0-1
Scales

(a) swim (low complexity)

a [0, 0.1)
[0.1, 0.2)
[0.2, 0.3)
El [0.3, 0.4)


. . . ... .










Classifying Program Phases based on their Dynamics Behavior

In this section, we show that program execution manifests heterogeneous complexity

behavior. We further examine the efficiency of using current methods in classifying program

dynamics into phases and propose a new method that can better identify program complexity.

Classifying complexity based phase behavior enables us to understand program dynamics

progressively in a fine-to-coarse fashion, to operate on different resolutions, to manipulate

features at different scales, and to localize characteristics in both spatial and frequency domains.

Simpoint

Sherwood and Calder[l] proposed a phase analysis tool called Simpoint to automatically

classify the execution of a program into phases. They found that intervals of program execution

grouped into the same phase had similar statistics. The Simpoint tool clusters program execution

based on code signature and execution frequency. We identified program execution intervals

grouped into the same phase by the Simpoint tool and analyzed their dynamic complexity.

I 3.. ... ..... ... I. ot -< "-. t






Figure 3-4 shows the results for the benchmark mcf Simpoint generates 55 clusters on the
s 02
M.1 0.1








benchmark mcf Figure 3-4 shows program dynamics within three clusters generated by

Simpoint. Each cluster represents a unique phase. In cluster 7, the classified phase shows

homogeneous dynamics. In cluster 5, program execution intervals show two distinct dynamics









but they are classified as the same phase. In cluster 48, program execution complexity varies

widely; however, Simpoint classifies them as a single phase. The results(Figure 3-4) suggest that

program execution intervals classified as the same phase by Simpoint can still exhibit widely

varied behavior in their dynamics.

Complexity-aware Phase Classification

To enhance the capability of current methods in characterizing program dynamics, we

propose complexity-aware phase classification. Our method uses the multiresolution property of

wavelet transforms to identify and classify the changing of program code execution across

different scales.

We assume a baseline phase analysis technique that uses basic block vectors (BBV) [10].

A basic block is a section of code that is executed from start to finish with one entry and one

exit. A BBV represents the code blocks executed during a given interval of execution. To

represent program dynamics at different time scales, we create a set of basic block vectors for

each interval at different resolutions. For example, at the coarsest level (scale =10), a program

execution interval is represented by one BBV. At the most detailed level, the same program

execution interval is represented by 1024 BBVs from 1024 consecutively subdivided

intervals(Figure 3-5). To reduce the amount of data that needs to be processed, we use random

projection to reduce the dimensionality of all BBVs to 15, as suggested in [1].

Interval
Interval Resolution = 20 {BBV,,,}

SI I Resolution = 21 {BBV2.1, BBV2.2}

S I I I I Resolution = 22 {BBV4.1, BBV4.2, BBV4.3, BBV4.4}
S I I I I Resolution = 23 {BBV,1 BBV ,.2 BBVa,7 BBVs,,}

I -- Resolution = 2n {BBV2'n,i, BBV2n,2 ,...... BBVa2n,2^n-,, BBV2-,,2-
Figure 3-5 BBVs with different resolutions










The coarser scale BBVs are the approximations of the finest scale BBVs generated by the

wavelet-based multiresolution analysis.

15 Dimensions
1024:
15: -0.11 0.13 -0.03 0.16 0.06 ... -0.02
S15: 0.01 -0.07 -0.23 -0.03 0.18 ... -0.05
S15: 0.04 -0.13 0.14 0.11 -0.04 .. 0.14

15: 0.08 0.01 -0.21 0.12 0.05 ... I019
7; 77;r 74 4- --- 4-
Multiresolution Analysis
and XCOR Calculation



1 10 1 10 1 10 1 10 1 10 1 10





1 10
Figure 3-6 Multiresolution analysis of the projected BBVs

As shown in Figure 3-6, the discrete wavelet transform is applied to each dimension of a

set of BBVs at the finest scale. The XCOR calculation is used to estimate the correlations

between a BBV element and its approximations at coarser scales. The results are the 15 XCOR

vectors representing the complexity of each dimension in BBVs across 10 level abstractions. The

15 XCOR vectors are then averaged together to obtain an aggregated XCOR vector that

represents the entire BBV complexity characteristics for that execution interval. Using the above

steps, we obtained an aggregated XCOR vector for each program execution interval. We then run

the k-means clustering algorithm [29] on the collected XCOR vectors which represent the

dynamic complexity of program execution intervals and classified them into phases. This is

similar to what Simpoint does. The difference is that the Simpoint tool uses raw BBVs and our

method uses aggregated BBV XCOR vectors as the input for k-means clustering.










Experimental Results

We compare Simpoint and the proposed approach in their capability of classifying phase

complexity. Since we use wavelet transform on program basic block vectors, we refer to our

method as multiresolution analysis of BBV (MRA-BBV).












\



Weighted
CoVs


Figure 3-7 Weighted COV calculation

We examine the similarity of program complexity within each classified phase by the two

approaches. Instead of using IPC, we use IPC dynamics as the metric for evaluation. After

classifying all program execution intervals into phases, we examine each phase and compute the

IPC XCOR vectors for all the intervals in that phase. We then calculate the standard deviation in

IPC XCOR vectors within each phase, and we divide the standard deviation by the average to get

the Coefficient of Variation (COV).

As shown in Figure 3-7, we calculate an overall COV metric for a phase classification

method by taking the COV of each phase, weighting it by the percentage of execution that the

phase accounts for. This produces an overall metric (i.e. weighted COVs) used to compare

different phase classification for a given program. Since COV measures standard deviations as

the percentage of the average, a lower COV value means better phase classification technique.









100%
80%
60%
> 40%
20%
no1^


crafty gap gcc gzip mcf
100%
O BBV
80% t MRA-BBV
60%
> 40%

20%
0% .... .. ........... ....... ..,. .. .. l. .
N- 0) Vm LO <0 N M Q> Q N ) Vm LO <0 N M Q> Q ) Vm LO <0 N0 M Q Q I N)10 0 0 P) 0 0
parser perlbmk swim twolf vortex
Figure 3-8 Comparison of BBV and MRA-BBV in classifying phase dynamics

Figure 3-8 shows experimental results for all the studied benchmarks. As can be seen, the

MRA-BBV method can produce phases which exhibit more homogeneous dynamics and

complexity than the standard, BBV-based method. This can be seen from the lower COV values

generated by the MRA-BBV method. In general, the COV values yielded on both methods

increase when coarse time scales are used for complexity approximation. The MRA-BBV is

capable of achieving significantly better classification on benchmarks with high complexity,

such as gap, gcc and mcf On programs which exhibit medium complexity, such as crafty, gzip,

parser, and twolf the two schemes show a comparable effectiveness. On benchmark (e.g. swim)

which has trivial complexity, both schemes work well.

We further examine the capability of using runtime performance metrics to capture

complexity-aware phase behavior. Instead of using BBV, the sampled IPC is used directly as the

input to the k-means phase clustering algorithm. Similarly, we apply multiresolution analysis to

the IPC data and then use the gathered information for phase classification. We call this method


101% 104% 128%
O BBV
M MRA-BBV

'. ...li..... i ... ...... i l1...











multiresolution analysis ofIPC (MRA-IPC). Figure 3-9 shows the phase classification results. As


can be seen, the observations we made on the BBV-based cases hold valid on the IPC-based


cases. This implies that the proposed multiresolution analysis can be applied to both methods to


improve the capability of capturing phase dynamics.


100%
80%
60%
40%
20%
0%


100%
80%
60%
40%
20%
0%


SIPC
* MRA-IPC




SrL l l I


parser perlbmk swim twolf

Figure 3-9 Comparison of IPC and MRA-IPC in classifying phase dynamics


vortex


N ri


RJ .. .. . . n. .I .


Im II I


II " III I m IImI I IIII I m mImI


-NA Ill









CHAPTER 4
IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRAM PHASE
ANALYSIS

In this chapter, we focus on workload-statistics-based phase analysis since on a given

machine configuration and environment, it is more suitable to identify how the targeted

architecture features vary during program execution. In contrast, phase classification using

program code structures lacks the capability of informing how workloads behave architecturally

[13, 30]. Therefore, phase analysis using specified workload characteristics allows one to

explicitly link the targeted architecture features to the classified phases. For example, if phases

are used to optimize cache efficiency, the workload characteristics that reflect cache behavior

can be used to explicitly classify program execution into cache performance/power/reliability

oriented phases. Program code structure based phase analysis identifies similar phases only if

they have similar code flow. There can be cases where two sections of code can have different

code flow, but exhibit similar architectural behavior [13]. Code flow based phase analysis would

then classify them as different phases. Another advantage of workload-statistics-based phase

analysis is that when multiple threads share the same resource (e.g. pipeline, cache), using

workload execution information to classify phases allows the capability of capturing program

dynamic behavior due to the interactions between threads.

The key goal of workload execution based phase analysis is to accurately and reliably

discern and recover phase behavior from various program runtime statistics represented as large-

volume, high-dimension and noisy data. To effectively achieve this objective, recent work [30,

31] proposes using wavelets as a tool to assist phase analysis. The basic idea is to transform

workload time domain behavior into the wavelet domain. The generated wavelet coefficients

which extract compact yet informative program runtime feature are then assembled together to









facilitate phase classification. Nevertheless, in current work, the examined scope of workload

characteristics and the explored benefits due to wavelet transform are quite limited.

In this chapter, we extend research of chapter 3 by applying wavelets to abundant types of

program execution statistics and quantifying the benefits of using wavelets for improving

accuracy, scalability and robustness in phase classification. We conclude that wavelet domain

phase analysis has the following advantages: 1) accuracy: the wavelet transform significantly

reduces temporal dependence in the sampled workload statistics. As a result, simple models

which are insufficient in the time domain become quite accurate in the wavelet domain. More

attractively, wavelet coefficients transformed from various dimensions of program execution

characteristics can be dynamically assembled together to further improve phase classification

accuracy; 2) scalability: phase classification using wavelet analysis of high-dimension sampled

workload statistics can alleviate the counter overflow problem which has a negative impact on

phase detection. Therefore, it is much more scalable to analyze large-scale phases exhibited on

long-running, real-world programs; and 3) robustness: wavelets offer denoising capabilities

which allows phase classification to be performed robustly in the presence of workload

execution variability.

Workload-statics-based phase analysis

Using the wavelet-based method, we explore program phase analysis on a high-

performance, out-of-order execution superscalar processor coupled with a multi-level memory

hierarchy. We use Daubechies wavelet [26, 27] with an order of 8 for the rest of the experiments

due to its high accuracy and low computation overhead. This section describes our experimental

methodologies, the simulated machine configuration, experimented benchmarks and evaluated

metrics.









We performed our analysis using twelve SPEC CPU 2000 integer benchmarks bzip2,

crafty, eon, gap, gcc, gzip, mcf parser, perlbmk, twolf, vortex and vpr. All programs were run

with the reference input to completion. The runtime workload execution statistics were measured

on the SimpleScalar 3.0, sim-outorder simulator for the Alpha ISA. The baseline

microarchitecture model we used is detailed in Table 4-1.

Table 4-1 Baseline machine configuration
Parameter Configuration
Processor Width 8
ITLB 128 entries, 4-way, 200 cycle miss
Branch Prediction combined 8K tables, 10 cycle misprediction, 2
BTB 2K entries, 4-way
Return Address Stack 32 entries
L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 4 MSHR, 1 cycle access
RUU Size 128 entries
Load/ Store Queue 64 entries
Store Buffer 16 entries
Integer ALU 4 I-ALU, 2 I-MUL/DIV
FP ALU 2 FP-ALU, 1FP-MUL/DIV
DTLB 256 entries, 4-way, 200 cycle miss
L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access
L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access
Memory Access 100 cycles


We use IPC (instruction per cycle) as the metric to evaluate the similarity of program

execution within each classified phase. To quantify phase classification accuracy, we use the

weighted COV metric proposed by Calder et al. [15]. After classifying all program execution

intervals into phases, we examine each phase and compute the IPC for all the intervals in that

phase. We then calculate the standard deviation in IPC within each phase, and we divide the

standard deviation by the average to get the Coefficient of Variation (COV). We then calculate

an overall COV metric for a phase classification method by taking the COV of each phase and

weighting it by the percentage of execution that the phase accounts for. This produces an overall

metric (i.e. weighted COVs) used to compare different phase classifications for a given program.









Since COV measures standard deviations as a percentage of the average, a lower COV value

means a better phase classification technique.

Exploring Wavelet Domain Phase Analysis

We first evaluate the efficiency of wavelet analysis on a wide range of program execution

characteristics by comparing its phase classification accuracy with methods that use information

in the time domain. And then we explore methods to further improve phase classification

accuracy in the wavelet domain.

Phase Classification: Time Domain vs. Wavelet Domain

The wavelet analysis method provides a cost-effective representation of program behavior.

Since wavelet coefficients are generally decorrelated, we can transform the original data into the

wavelet domain and then carry out the phase classification task. The generated wavelet

coefficients can be used as signatures to classify program execution intervals into phases: if two

program execution intervals show similar fingerprints (represented as a set of wavelet

coefficients), they can be classified into the same phase. To quantify the benefit of using wavelet

based analysis, we compare phase classification methods that use time domain and wavelet

domain program execution information.

With our time domain phase analysis method, each program execution interval is

represented by a time series which consists of 1024 sampled program execution statistics. We

first apply random projection to reduce the data dimensionality to 16. We then use the k-means

clustering algorithm to classify program intervals into phases. This is similar to the method used

by the popular Simpoint tool where the basic block vectors (BBVs) are used as input. For the

wavelet domain method, the original time series are first transformed into the wavelet domain

using DWT. The first 16 wavelet coefficients of each program execution interval are extracted










and used as the input to the k-means clustering algorithms. Figure 4-1 illustrates the above

described procedure.

Random K-means COV
S Projection Clustering
Dimensionality =16

Program Runtime K-means
Statistics DWT Clustering
Number of Wavelet
Coetti iieutr =16

Figure 4-1 Phase analysis methods time domain vs. wavelet domain

We investigated the efficiency of applying wavelet domain analysis on 10 different

workload execution characteristics, namely, the numbers of executed loads (load), stores (store),

branches (branch), the number of cycles a processor spends on executing a fixed amount of

instructions (cycles), the number of branch misprediction (branch miss), the number of L1

instruction cache, L1 data cache and L2 cache hits (ill hit, dll hit and ul2 hit), and the number

of instruction and data TLB hits (itlb hit and dtlb hit). Figure 4-2 shows the COVs of phase

classifications in time and wavelet domains when each type of workload execution characteristic

is used as an input. As can be seen, compared with using raw, time domain workload data, the

wavelet domain analysis significantly improves phase classification accuracy and this

observation holds for all the investigated workload characteristics across all the examined

benchmarks. This is because in the time domain, collected program runtime statistics are treated

as high-dimension time series data. Random projection methods are used to reduce the

dimensionality of feature vectors which represent a workload signature at a given execution

interval. However, the simple random projection function can increase the aliasing between

phases and reduce the accuracy of phase detection.














50%
Time Domain
40% Waelet D ai ip
40%

30%

20%

10%

0%




















140%0
25%
Time Domain
20% Wavelet Doai gap

15%


5%

0%



20% 0

140%
M Time Domain tmcf
120% WveletDoma n
100%


S60%
40%
20%
0%




8%
Time Domain twl
Wavelet Domain wo
6%

4%

2%

0%
,de6 | # ||
strt


8%
M Time Domain crafty
SWavelet Domain
6%

4/4%

2%

0%




70%
Time Domain qCC
60% Wavelet Domain
50%
40%
30%
20%
10%
0%




15%
S Time Domain parser
12% WaveletDomain

9%

6%

3%

0%
06 ellemmlllltlltm





40%
Time Domain vortex
Wavelet Domain


# >#
1*i~~c ~P


8%
STime Domain n
Sve tDma
6%


>4%

2%

0%

Nl$ Y N
'0


30%
:Time Domain perlbmk
25% Wavelet Domain
20%
S15%
10%
5%
0%


'0"^,~y ^<*

"e 64 ":/ ,'/ k1"
\0 o a^ J~ 4'? 44 4^ 4
'Sf ^^^


Figure 4-2 Phase classification accuracy: time domain vs. wavelet domain


By transforming program runtime statistics into the wavelet domain, workload behavior



can be represented by a series of wavelet coefficients which are much more compact and



efficient than its counterpart in the time domain. The wavelet transform significantly reduces



temporal dependence and therefore simple models which are insufficient in the time domain



become quite accurate in the wavelet domain.










Figure 4-2 shows that in the wavelet domain, the efficiency of using a single type of

program characteristic to classify program phases can vary significantly across different

benchmarks. For example, while ul2 hit achieves accurate phase classification on the benchmark

vortex, it results in a high phase classification COV on the benchmark gcc. To overcome the

above disadvantages and to build phase classification methods that can achieve high accuracy

across a wide range of applications, we explore using wavelet coefficients derived from different

types of workload characteristics.


[ j\ r Wavelet
SD T Coefficient Set
Program Runtime Statistics 1

DWT Coefficent et Hybrid Wavelet K-means
efficiCoefficients Clustering OV
Program Runtime Statistics 2 ----


SDWT Wavelet
J \ Coefficient Set n
Program Runtime Statistics n
Figure 4-3 Phase classification using hybrid wavelet coefficients

As shown in Figure 4-3, a DWT is applied to each type of workload characteristic. The

generated wavelet coefficients from different categories can be assembled together to form a

signature for a data clustering algorithm.

Our objective is to improve wavelet domain phase classification accuracy across different

programs while using an equivalent amount of information to represent program behavior. We

choose a set of 16 wavelet coefficients as the phase signature since it provides sufficient

precision in capturing program dynamics when a single type of program characteristic is used. If

a phase signature can be composed using multiple workload characteristics, there are many ways

to form a 16-dimension phase signature. For example, a phase signature can be generated using

one wavelet coefficient from 16 different workload characteristics (16x 1), or it can be composed









using 8 workload characteristics with 2 wavelet coefficients from each type of workload

characteristic (8 x 2). Alternatively, a phase signature can be formed using 4 workload

characteristics with 4 wavelet coefficients each and 2 workload characteristics with 8 wavelet

coefficients each (4 x 4, and 2 x 8) respectively. We extend the 10 workload execution

characteristics (Figure 4-2) to 16 by adding the following events: the number of accesses to

instruction cache (ill access), data cache (dll access), L2 cache (ul2 access), instruction TLB

(itlb access) and data TLB (dtlb access). To understand the trade-offs in choosing different

methods to generate hybrid signatures, we did an exhaustive search using the above 4 schemes

on all benchmarks to identify the best COVs that each scheme can achieve. The results (their

ranks in terms of phase classification accuracy and the COVs of phase analysis) are shown in

Table 4-2. As can be seen, statistically, hybrid wavelet signatures generated using 16 (16 x 1) and

8 (8 x 2) workload characteristics achieve higher accuracy. This suggests that combining multiple

dimension wavelet domain workload characteristics to form a phase signature is beneficial in

phase analysis.

Table 4-2 Efficiency of different hybrid wavelet signatures in phase classification
Benchmarks Hybrid Wavelet Signature and its Phase Classification COV
Rank #1 Rank #2 Rank #3 Rank #4
Bzip2 16x 1 (6.5%) 8x 2 (10.5%) 4x 4 (10.5%) 2x 8 (10.5%)
Crafty 4x4(1.2%) 16x 1(1.6%) 8x2(1.9%) 2x8(3.9%)
Eon 8x2(1.3%) 4x4(1.6%) 16x 1(1.8%) 2x8(3.6%)
Gap 4x 4 (4.2%) 16x 1 (6.3%) 8 x 2 (7.2%) 2x 8 (9.3%)
Gcc 8x2(4.7%) 16x 1(5.8%) 4x4(6.5%) 2x8(14.1%)
Gzip 16x 1(2.5%) 4x4(3.7%) 8x2(4.4%) 2x8(4.9%)
Mcf 16x 1 (9.5%) 4x 4 (10.2%) 8x 2 (12.1%) 2x 8 (87.8%)
Parser 16x 1 (4.7%) 8 x 2 (5.2%) 4x 4 (7.3%) 2 x 8 (8.4%)
Perlbmk 8x 2 (0.7%) 16x1 (0.8%) 4x 4 (0.8%) 2x 8 (1.5%)
Twolf 16x 1 (0.2%) 8 x 2 (0.2%) 4 x 4 (0.4%) 2 x 8 (0.5%)
Vortex 16x 1 (2.4%) 8 x 2 (4%) 2 x 8 (4.4%) 4 x 4 (5.8%)
Vpr 16x 1 (3%) 8x 2 (14.9%) 4x 4 (15.9%) 2x 8 (16.3%)










We further compare the efficiency of using the 16 x 1 hybrid scheme (Hybrid), the best

case that a single type workload characteristic can achieve (Individual Best) and the Simpoint

based phase classification that uses basic block vector (BBV). The results of the 12 SPEC integer

benchmarks are shown in Figure 4-4.

25%


15
10
8-
> 5
0
0.


..I ~ ~ ~ I -i .. Iii 1 1111,_ 11 1 IIi


o -*!--T--TT--I--I--I---^--I--T--"---l--I-- -





bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr AVG

Figure 4-4 Phase classification accuracy of using 16x 1 hybrid scheme

As can be seen, the Hybrid outperforms the Individual Best on 10 out of the 12

benchmarks. The Hybrid also outperforms the BBVbased Simpoint method on 10 out of the 12

cases.

Scalability

Above we can see that wavelet domain phase analysis can achieve higher accuracy. In this

subsection, we address another important issue in phase analysis using workload execution

characteristics: scalability. Counters are usually used to collect workload statistics during

program execution. The counters may overflow if they are used to track large scale phase

behavior on long running workloads. Today, many large and real world workloads can run days,

weeks or even months before completion and this trend is likely to continue in the future. To

perform phase analysis on the next generation of computer workloads and systems, phase

classification methods should be capable of scaling with the increasing program execution time.










To understand the impact of counter overflow on phase analysis accuracy, we use 16

accumulative counters to record the 16-dimension workload characteristic. The values of the 16

accumulative counters are then used as a signature to perform phase classification. We gradually

reduce the number of bits in the accumulative counters. As a result, counter overflows start to

occur. We use two schemes to handle a counter overflow. In our first method, a counter saturates

at its maximum value once it overflows. In our second method, the counter is reset to zero after

an overflow occurs. After all counter overflows are handled, we then use the 16-dimension

accumulative counter values to perform phase analysis and calculate the COVs. Figure 4-5 (a)

describes the above procedure.

Large Scale Phase Interval Large Scale Phase Interval
| t nn r^ t t ..... t t
accumulative sampling
istic counter 1 P counter 1
Program Runtime Statistics 1 Program Runtime Statistics 1
ProgramRuntmeStatstcs2 Program Runtime Statistics 1 2
counter Clustering counter 2n a
Program Runtime Statistics 2 Program Runtime Statistics 2 -

accumulative /t7sam piling
counter nnter n
Program Runtime Statistics n Program Runtime Statistics n
(a) n-bit accumulative counter (b) n-bit sampling counter
Figure 4-5 Different methods to handle counter overflows

Our counter overflow analysis results are shown in Figure 4-6. Figure 4-6 also shows the

counter overflow rate (e.g. percentage of the overflowed counters) when counters with different

sizes are used to collect workload statistics within program execution intervals. For example, on

the benchmark crafty, when the number of bits used in counters is reduced to 20, 100% of the

counters overflow. For the purpose of clarity, we only show a region within which the counter

overflow rate is greater than zero and less than or equal to one. Since each program has different

execution time, the region varies from one program to another. As can be seen, counter

overflows have negative impact on phase classification accuracy. In general, COVs increase with














the counter overflow rate. Interestingly, as the overflow rate increases, there are cases that


overflow handling can reduce the COVs. This is because overflow handling has the effect of



normalizing and smoothing irregular peaks in the original statistics.


50%

40%

> 30%

20%

10%

0%



30%

25%

20%

> 15%

10%

5%

0%


140%
120%
100%
80%
0
5 60%
40%
20%

3


4%

3%

82%

1%

0%


089
--- Saturate 969
Reset
Wavele 90%
81%


bzip2 .A
670
4%29%


28 26 24 22 20 18 16
# of bits in counter


---a-- Saturate 97% 10/o
Reset 94%
-A-Wavelet

94%

5 % 5 6 % 8 0 % A- .

25%" A- gap
I~~~ T


.0 28 26 24 22 20 18 16
#of bits in counter


97% 99%
A.A Saturate 9%
-- Reset .
--- avelet 96 d fo
89*

60./
two H


S2, mcf
28% 31 75
A, A

0 28 26 24 22 20 18 16 14 12 10 8
# of bits in counter


-- Saturate
Reset 10%
-- Wavelet 1

two If


28%94 31 75 94%
28% 31% 75% .^


8%

6%


4%

2%

0%


28 26 24 22 20
# of bits in counter

60%
A-. Saturate 97% 99%
50% -a--Reset
---A Wavelet A- o
93% ,.A-A" -d '0
40% 4 98%

30% 7709 89%
20% a-
o0%o o50% gcc
10% 3 34%.

0%
0% -------------------

28 26 24 20 28 14 16 18 20 22
# of bits in counter

12% -%l

10% 7o -

8% 31
6% /--- ..

4%
--..-- Saturate
2% ---Reset parser
---Wavelet
0%
30 28 26 24 22 20
# of bits in counter


JUo30%
25%

20%

815%

10%

5%

0%


8%

6%


4%

2%

0%


27


20%

16%

10%

6%

0%


30%

25%

20%

815%

10%

5%

0%
2


30%

25%

20%

815%

10%

5%

0%


7


25 23 21 19 17
# of bits in counter


26 23 21 19
# of bits in counter


25 23 21 19 17
# of bits in counter


29 27 25 23
# of bits in counter


27 25 23 21 19 15 17
# of bits in counter


27 25 23 21 19 17
# of bits in counter


Figure 4-6 Impact of counter overflows on phase analysis accuracy


One solution to avoid counter overflows is to use sampling counters instead of


accumulative counters, as shown in Figure 4-5 (b). However, when sampling counters are used,


the collected statistics are represented as time series that have a large volume of data. The results


---A--- Saturate
Reset 100%
-*-Wavelet 94% j

82%
23% 56%
- ... crafty


-- Saturate 94 1009
Reset
-- Wavelet
94% .-



789o eon
0.4% 560'-
7---- Bn-'*'-- --- --


---A-- Saturate
-E- Reset 98, 100
--Wavelet 98

89%

2 72% gzip
-24%34%o


A--a-- Saturate
-- Reset 100%
-a--Wavelet 0
93// :
perlbmk
87%

85%
6% 28% 79% A *..A
!--B-B^


-*A-.- Saturate 95v/-
--- Reset
---Wavelet/
94%


93%
vortex :
S819o '
1% 56% .... -.... ---
I I -


---A--- Saturate 100o
-4- Reset
-"-Wavelet 94%

75%

'.- vpr
JL I. JI J. .


. .


-


^


n' -


-









shown in Figure 4-2 suggest that directly employing runtime samples in phase classification is

less desirable. To address the scalability issue in characterizing large scale program phases using

workload execution statistics, wavelet based dimensionality reduction techniques can be applied

to extract the essential features of workload behavior from the sampled statistics. The

observations we made in previous sections motivate the use of DWT to absorb large volume

sampled raw data and produce highly efficient wavelet domain signatures for phase analysis, as

shown in Figure 4-5 (b).

Figure 4-6 further shows phase analysis accuracy after applying wavelet techniques on the

sampled workload statistics using sampling counters with different sizes. As can be seen,

sampling enables using counters with limited size to study large program phases. In general,

sampling can scale up naturally with the interval size as long as the sampled values do not

overflow the counters. Therefore, with an increasing mismatch between phase interval and

counter size, the sampling frequency is increased, resulting in an even higher volume sampled

data. Using wavelet domain phase analysis can effectively infer program behavior from a large

set of data collected over a long time span, resulting in low COVs in phase analysis.

Workload Variability

As described earlier, our methods collect various program execution statistics and use them

to classify program execution into different phases. Such phase classification generally relies on

comparing the similarity of the collected statistics. Ideally, different runs of the same code

segment should be classified into the same phase. Existing phase detection techniques assume

that workloads have deterministic execution. On real systems, with operating system

interventions and other threads, applications manifest behavior that is not the same from run to

run. This variability can stem from changes in system state that alter cache, TLB or I/O behavior,

system calls or interrupts, and can result in noticeably different timing and performance behavior









[18, 32]. This cross-run variability can confuse similarity based phase detection. In order for a

phase analysis technique to be applicable on real systems, it should be able to perform robustly

under variability.

Program cross-run variability can be thought of as noise which is a random variance of a

measured statistic. There are many possible reasons for noisy data, such as

measurement/instrument errors and interventions of the operating systems. Removing this

variability from the collected runtime statistics can be considered as a process of denoising. In

this chapter, we explore using wavelets as an effective way to perform denoising. Due to the

vanishing moment property of the wavelets, only some wavelet coefficients are significant in

most cases. By retaining selective wavelet coefficients, a wavelet transform could be applied to

reduce the noise. The main idea of wavelet denoising is to transform the data into the wavelet

basis, where the large coefficients mainly contain the useful information and the smaller ones

represent noise. By suitably modifying the coefficients in the new basis, noise can be directly

removed from the data. The general de-noising procedure involves three steps: 1) decompose:

compute the wavelet decomposition of the original data; 2) threshold wavelet coefficients: select

a threshold and apply thresholding to the wavelet coefficients; and 3) reconstruct: compute

wavelet reconstruction using the modified wavelet coefficients. More details on the wavelet-

based denoising techniques can be found in [33].

To model workload runtime variability, we use additive noise models and randomly inject

noise into the time series that represents workload execution behavior. We vary the SNR (signal-

to-noise ratio) to simulate different degree of variability scenarios. To classify program

execution into phases, we generate a 16 dimension feature vector where each element contains










the average value of the collected program execution characteristic for each interval. The k-mean

algorithm is then used for data clustering. Figure 4-7 illustrates the above described procedure.

Sampled Workload
Workload (t)
Workload Variability Model N(t)
Statistics
S4a i S 2(t)=S 1(t)+N (t)
Sl(t) Wavelet
Denoising

D2(t)


Phase Classification
i COV

COV Comparison

Figure 4-7 Method for modeling workload variability

We use the Daubechies-8 wavelet with a global wavelet coefficients thresholding policy to

perform denoising. We then compare the phase classification COVs of using the original data,

the data with variability injected and the data after we perform denoising. Figure 4-8 shows our

experimental results.

15%- O Original
M Noised(SNR=20)
12%- Denoised(SNR=20)


3%
0- Noised(SNR=5)





0%


Figure 4-8 Effect of using wavelet denoising to handle workload variability

The SNR=20 represents scenarios with a low degree of variability and the SNR=5 reflects

situations with a high degree of variability. As can be seen, introducing variability in workload

execution statistics reduces phase analysis accuracy. Wavelet denoising is capable of recovering

phase behavior from the noised data, resulting in higher phase analysis accuracy. Interestingly,

on some benchmarks (e.g. eon, mcj), the denoised data achieve better phase classification










accuracy than the original data. This is because in phase classification, randomly occurring peaks

in the gathered workload execution data could have a deleterious effect on the phase

classification results. Wavelet denoising smoothes these irregular peaks and make the phase

classification method more robust.

Various types of wavelet denoising can be performed by choosing different threshold

selection rules (e.g. rigrsure, heursure, sqtwolog and minimaxi), by performing hard (h) or soft

(s) thresholding, and by specifying multiplicative threshold rescaling model (e.g. one, sln, and

mln).

We compare the efficiency of different denoising techniques that have been implemented

into the MATLAB tool [34]. Due to the space limitation, only the results on benchmarks bzip2,

gcc and mcfare shown in Figure 4-9. As can be seen, different wavelet denoising schemes

achieve comparable accuracy in phase classification.


10% 0 bzip2 0 gcc U mcf
8%
> 6%
0
o 4%
2%
0%





Wavelet Denoising Schemes
Figure 4-9 Efficiency of different denoising schemes










CHAPTER 5
INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION

It has been well known to the processor design community that program runtime

characteristics exhibit significant variation. To obtain the dynamic behavior that programs

manifest on complex microprocessors and systems, architects resort to the detailed, cycle-

accurate simulations. Figure 5-1 illustrates the variation in workload dynamics for SPEC CPU

2000 benchmarks gap, crafty and vpr, within one of their execution intervals. The results show

the time-varying behavior of the workload performance (gap), power (crafty) and reliability

(vpr) metrics across simulations with different microarchitecture configurations.

2------------------- 140-------------------- 0.35--------------------








.0 20 40 60 80 100 120 140 0 .
Samples 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Samples Samples
Figure 5-1 Variation of workload performance, power and reliability dynamics

As can be seen, the manifested workload dynamics while executing the same code base

varies widely across processors with different configurations. As the number of parameters in

design space increases, such variation in workload dynamics can not be captured without using

slow, detailed simulations. However, using the simulation-based methods for architecture design

space exploration where numerous design parameters have to be considered is prohibitively

expensive.

Recently, researchers propose several predictive models [20-25] to reason about workload

aggregated behavior at architecture design stage. Among them, linear regression and neural

network models have been the most used approaches. Linear models are straightforward to

understand and provide accurate estimates of the significance of parameters and their









interactions. However, they are usually inadequate for modeling the non-linear dynamics of real-

world workloads which exhibit widely different characteristic and complexity. Of the non-linear

methods, neural network models can accurately predict the aggregated program statistics (e.g.

CPI of the entire workload execution). Such models are termed as global models as only one

model is used to characterize the measured programs. The monolithic global models are

incapable of capturing and revealing program dynamics which contain interesting fine-grain

behavior. On the other hand, a workload may produce different dynamics when the underlying

architecture configurations have changed. Therefore, new methods are needed for accurately

predicting complex workload dynamics.

To overcome the problems of monolithic, global predictive models, we propose a novel

scheme that incorporates wavelet-based multiresolution decomposition techniques, which can

produce a good local representation of the workload behavior in both time and frequency

domains. The proposed analytical models, which combine wavelet-based multiscale data

representation and neural network based regression prediction, can efficiently reason about

program dynamics without resorting to detailed simulations. With our schemes, the complex

workload dynamics is decomposed into a series of wavelet coefficients. In transform domain,

each individual wavelet coefficients is modeled by a separate neural network. We extensively

evaluate the efficiency of using wavelet neural networks for predicting the dynamics that the

SPEC CPU 2000 benchmarks manifest on high performance microprocessors with a

microarchitecture design space that consists of 9 key parameters. Our results show that the

models achieve high accuracy in forecasting workload dynamics across a large microarchitecture

design space.










In this chapter, we propose to use of wavelet neural network to build accurate predictive

models for workload dynamic driven microarchitecture design space exploration. We show that

wavelet neural network can be used to accurately and cost-effectively capture complex workload

dynamics across different microarchitecture configurations. We evaluate the efficiency of using

the proposed techniques to predict workload dynamic behavior in performance, power and

reliability domains. We perform extensive simulations to analyze the impact of wavelet

coefficient selection and sampling rate on prediction accuracy and identify microarchitecture

parameters that significantly affect workload dynamic behavior.

We present a case study of using workload dynamic aware predictive models to quickly

estimate the efficiency of scenario-driven architecture optimizations across different domains.

Experimental results show that the predictive models are highly efficient in rendering workload

execution scenarios.

Neural Network

An Artificial Neural Network (ANN) [42] is an information processing paradigm that is

inspired by the way biological nervous systems process information. It is composed of a set of

interconnected processing elements working in unison to solve problems.



Output Radial Basis Function (RBF)
layer distance



Hidden Hi(x) H2(x) a
layer
05

U-

Input X X s Xn distance
layer

Figure 5-2 Basic architecture of a neural network









The most common type of neural network (Figure 5-2) consists of three layers of units: a

layer of input units is connected to a layer of hidden units, which is connected to a layer of

output units. The input is fed into network through input units. Each hidden unit receives the

entire input vector and generates a response. The output of a hidden unit is determined by the

input-output transfer function that is specified for that unit. Commonly used transfer functions

include the sigmoid, linear threshold function and Radial Basis Function (RBF) [35]. The ANN

output, which is determined by the output unit, is computed using the responses of the hidden

units and the weights between the hidden and output units. Neural networks outperform linear

models in capturing complex, non-linear relations between input and output, which make them a

promising technique for tracking and forecasting complex behavior.

In this chapter, we use the RBF transfer function to model and estimate important wavelet

coefficients on unexplored design spaces because of its superior ability to approximate complex

functions. The basic architecture of an RBF network with n-inputs and a single output is shown

in Figure 5-2. The nodes in adjacent layers are fully connected. A linear single-layer neural

network model 1-dimensional function f is expressed as a linear combination of a set of n fixed

functions, often called basis functions by analogy with the concept of a vector being composed

of a linear combination of basis vectors.

n
f(x)= -wJh, (x) (5-1)
j=1

Here e 91" is adaptable or trainable weight vector and {h (-)4 are fixed basis functions

or the transfer function of the hidden units. The flexibility of f, its ability to fit many different

functions, derives only from the freedom to choose different values for the weights. The basis









functions and any parameters which they might contain are fixed. If the basis functions can

change during the learning process, then the model is nonlinear.

Radial functions are a special class of function. Their characteristic feature is that their

response decreases (or increases) monotonically with distance from a central point. The center,

the distance scale, and the precise shape of the radial function are parameters of the model, all

fixed if it is linear. A typical radial function is the Gaussian which, in the case of a scalar input,

is


h(x)= exp )2 (5-2)
r

Its parameters are its center c and its radius r. Radial functions are simply a class of

functions. In principle, they could be employed in any sort of model, linear or nonlinear, and any

sort of network (single-layer or multi-layer).

The training of the RBF network involves selecting the center locations and radii (which

are eventually used to determine the weights) using a regression tree. A regression tree

recursively partitions the input data set into subsets with decision criteria. As a result, there will

be a root node, non-terminal nodes (having sub nodes) and terminal nodes (having no sub nodes)

which are associated with an input dataset. Each node contributes one unit to the RBF network's

center and radius vectors, the selection of RBF centers is performed by recursively parsing

regression tree nodes using a strategy proposed in [35].

Combing Wavelet and Neural Network for Workload Dynamics Prediction

We view workload dynamics as a time series produced by the processor which is a

nonlinear function of its design parameter configuration. Instead of predicting this function at

every sampling point, we employ wavelets to approximate it. Previous work [21, 23, 25] shows









that neural networks can accurately predict aggregated workload behavior during design space

exploration. Nevertheless, the monolithic global neural network models lack the capability of

revealing complex workload dynamics. To overcome this disadvantage, we propose using

wavelet neural networks that incorporate multiscale wavelet analysis into a set of neural

networks for workload dynamics prediction. The wavelet transform is a very powerful tool for

dealing with dynamic behavior since it captures both workload global and local behavior using a

set of wavelet coefficients. The short-term workload characteristics is decomposed into the lower

scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and

prediction, while the global workload behavior is decomposed into higher scales of wavelet

coefficients (low frequencies) that are used for the analysis and prediction of slow trends in the

workload execution. Collectively, these coordinated scales of time and frequency provides an

accurate interpretation of workload dynamics. Our wavelet neural networks use a separate RBF

neural network to predict individual wavelet coefficients at different scales. The separate

predictions of each wavelet coefficients are proceed independently. Predicting each wavelet

coefficients by a separate neural network simplifies the training task of each sub-network. The

prediction results for the wavelet coefficients can be combined directly by the inverse wavelet

transform to predict the workload dynamics.

Figure 5-3 shows our hybrid neuro-wavelet scheme for workload dynamics prediction.

Given the observed workload dynamics on training data, our aim is to predict workload dynamic

behavior under different architecture configurations. The hybrid scheme basically involves three

stages. In the first stage, the time series is decomposed by wavelet multiresolution analysis. In

the second stage, each wavelet coefficients is predicted by a separate ANN and in the third stage,

the approximated time series is recovered from the predicted wavelet coefficients.










Workload Dynamics (Time Domain) Synthesized Workload Dynamics (ime Domain)
TIMooard03CFHt:e ::F Reda:/Hl~t*

*0 ReddmdV~d -t
E~ D9*gi _* P:~d F d 2 P a;
G Qffetn 0





Figure 5-3 Using wavelet neural network for workload dynamics prediction

Each RBF neural network receives the entire microarchitectural design space vector and

predicts a wavelet coefficient. The training of a RBF network involves determining the center

point and a radius for each RBF and the weights of each RBF which determine the wavelet
coefficients.

Experimental Methodology
We evaluate the efficiency of using wavelet neural networks to explore workload dynamics

in performance, power and reliability domains during microarchitecture design space exploration.

We use a unified, detailed microarchitecture simulator in our experiments. Our simulation

framework, built using a heavily modified and extended version of the Simplescalar tool set,

models pipelined, multiple-issue, out-of-order execution microprocessors with multiple level

caches. Our framework uses Wattch-based power model [36]. In addition, we built the

Architecture Vulnerability Factor (AVF) analysis methods proposed in [37, 38] to estimate

processor microarchitecture vulnerability to transient faults. A microarchitecture structure's AVF

refers to the probability that a transient fault in that hardware structure will result in incorrect

program results. The AVF metric can be used to estimate how vulnerable the hardware is to soft









errors during program execution. Table 5-1 summarizes the baseline machine configurations of

our simulator.

Table 5-1 Simulated machine configuration
Parameter Configuration
Processor Width 8-wide fetch/issue/commit
Issue Queue 96
ITLB 128 entries, 4-way, 200 cycle miss
Branch Predictor 2K entries Gshare, 10-bit global history
BTB 2K entries, 4-way
Return Address Stack 32 entries RAS
L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access
ROB Size 96 entries
Load/ Store Queue 48 entries
Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store
FP ALU 8 FP-ALU, 4FP-MUL/DIV/SQRT
DTLB 256 entries, 4-way, 200 cycle miss
L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access
L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access
Memory Access 64 bit wide, 200 cycles access latency

We perform our analysis using twelve SPEC CPU 2000 benchmarks bzip2, crafty, eon,

gap, gcc, mcf parser, perlbmk, twolf swim, vortex and vpr. We use the Simpoint tool to pick the

most representative simulation point for each benchmark (with full reference input set) and each

benchmark is fast-forwarded to its representative point before detailed simulation takes place.

Each simulation contains 200M instructions. In this chapter, we consider a design space that

consists of 9 microarchitectural parameters (see Tables 5-2) of the superscalar architecture.

These microarchitectural parameters have been shown to have the largest impact on processor

performance [21]. The ranges for these parameters were set to include both typical and feasible

design points within the explored design space. Using the detailed, cycle-accurate simulations,

we measure processor performance, power and reliability characteristics on all design points

within both training and testing data sets. We build a separate model for each program and use

the model to predict workload dynamics in performance, power and reliability domains at









unexplored points in the design space. The training data set is used to build the wavelet-based

neural network models. An estimate of the model's accuracy is obtained by using the design

points in the testing data set.

Table 5-2 Microarchitectural parameter ranges used for generating train/test data
Ranges
Parameter TRa s t # of Levels
Train Test
Fetch width 2, 4, 8, 16 2, 8 4
ROB size 96, 128, 160 128, 160 3
IQsize 32, 64, 96, 128 32, 64 4
LSQ_size 16, 24, 32, 64 16, 24, 32 4
L2 size 256, 1024, 2048, 4096 256, 1024, 4096 KB 4
L2 lat 8, 12, 14, 16, 20 8, 12, 14 5
ill size 8, 16, 32, 64 KB 8, 16, 32 KB 4
dll size 8, 16, 32, 64 KB 16, 32, 64 KB 4
dll lat 1,2,3,4 1,2,3 4

To build the representative design space, one needs to ensure the sample data sets space

out points throughout the design space but unique and small enough to keep the model building

cost low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our

sampling strategy since it provides better coverage compared to a naive random sampling

scheme. We generate multiple LHS matrix and use a space filing metric called L2-star

discrepancy [40] to each LHS matrix to find the unique and best representative design space

which has the lowest values of L2-star discrepancy. We use a randomly and independently

generated set of test data points to empirically estimate the predictive accuracy of the resulting

models. And we used 200 train data and 50 test data for workload dynamic prediction since our

study shows that it offers good tradeoffs between simulation time and prediction accuracy for the

design space we considered. In our study, each workload dynamic trace is represented by 128

samples.

Predicting each wavelet coefficient by a separate neural network simplifies the learning

task. Since complex workload dynamics can be captured using limited number of wavelet










coefficients, the total size of wavelet neural networks can be small. Due to the fact that small

magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only

predict a small set of important wavelet coefficients.


Wavelet coefficients with
large magnitude


High
s 120 Mag.







2 00 B1 1 00 L w

Wavelet Coefficient Index
Figure 5-4 Magnitude-based ranking of 128 wavelet coefficients

Specifically, we consider the following two schemes for selecting important wavelet

coefficients for prediction: (1) magnitude-based: select the largest k coefficients and

approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate

the rest with 0. In this study, we choose to use the magnitude-based scheme since it always

outperforms the order-based scheme. To apply the magnitude-based wavelet coefficient selection

scheme, it is essential that the significance of the selected wavelet coefficients does not change

drastically across the design space. Figure 5-4 illustrates the magnitude-based ranking (shown as

a color map where red indicates high ranks and blue indicates low ranks) of a total 128 wavelet

coefficients (decomposed from benchmark gcc dynamics) across 50 different microarchitecture

configurations. As can be seen, the top ranked wavelet coefficients largely remain consistent

across different processor configurations.










Evaluation and Results

In this section, we present detailed experiment results on using wavelet neural network to

predict workload dynamics in performance, reliability and power domains. The workload

dynamic prediction accuracy measure is the mean square error (MSE) defined as follows:

IN
MSE = (x(k) (k)) (5-3)
N k=1

where: x(k) is the actual value, i(k) is the predicted value and Nis the total number of

samples. As prediction accuracy increases, the MSE becomes smaller.




....................... .. i
i





S ........................................................







bzip crafty eon gap gcc mcf parser per swim twolf vortex vpr
o













-zip crafty eon gap gcc mcf parser per- swim twolf vortex vpr

Figure 5-5 MSE boxplots of workload dynamics prediction





62









The workload dynamics prediction accuracies in performance, power and reliability

domains are plotted as boxplots(Figure 5-5). Boxplots are graphical displays that measure

location (median) and dispersion (interquartile range), identify possible outliers, and indicate the

symmetry or skewness of the distribution. The central box shows the data between "hinges"

which are approximately the first and third quartiles of the MSE values. Thus, about 50% of the

data are located within the box and its height is equal to the interquartile range. The horizontal

line in the interior of the box is located at the median of the data, it shows the center of the

distribution for the MSE values. The whiskers (the dotted lines extending from the top and

bottom of the box) extend to the extreme values of the data or a distance 1.5 times the

interquartile range from the median, whichever is less. The outliers are marked as circles. In

Figure 5-5, the line with diamond shape markers indicates the statistics average of MSE across

all test cases. Figure 5-5 shows that the performance model achieves median errors ranging from

0.5 percent (swim) to 8.6 percent (mcj) with an overall median error across all benchmarks of 2.3

percent. As can be seen, even though the maximum error at any design point for any benchmark

is 30%, most benchmarks show MSE less than 10%. This indicates that our proposed neuro-

wavelet scheme can forecast the dynamic behavior of program performance characteristics with

high accuracy. Figure 5-5 shows that power models are slightly less accurate with median errors

ranging from 1.3 percent (vpr) to 4.9 percent (crafty) and overall median of 2.6 percent. The

power prediction has high maximum values of 35%. These errors are much smaller in reliability

domain.

In general, the workload dynamic prediction accuracy is increased when more wavelet

coefficients are involved. However, the complexity of the predictive models is proportional to

the number of wavelet coefficients. The cost-effective models should provide high prediction










accuracy while maintaining low complexity. Figure 5-6 shows the trend of prediction accuracy

(the average statistics of all benchmarks) when various number of wavelet coefficients are used.

5 ----------I

4
CPI
3 --- Power
uJ -A-AVF



0 A.-I.A AI A
16 32 64 96 128
Number of Wavelet Coefficients


Figure 5-6 MSE trends with increased number of wavelet coefficients

As can be seen, for the programs we studied, a set of wavelet coefficients with a size of 16

combine good accuracy with low model complexity; increasing the number of wavelet

coefficients beyond this point improves error at a lower rate. This is because wavelets provide a

good time and locality characterization capability and most of the energy is captured by a limited

set of important wavelet coefficients. Using fewer parameters than other methods, the

coordinated wavelet coefficients provide interpretation of the series structures across scales of

time and frequency domains. The capability of using a limited set of wavelet coefficients to

capture workload dynamics varies with resolution level.


7
6 Ix
5
4

2
S3 -- CPl
m-* Power
2 '- -A-AVF -
1
0 A A-- A

64 128 256 512 1024
Number of Samples


Figure 5-7 MSE trends with increased sampling frequency









Figure 5-7 illustrates MSE (the average statistics of all benchmarks) yielded on predictive

models that use 16 wavelet coefficients when the number of samples varies from 64 to 1024. As

the sampling frequency increases, using the same amount of wavelet coefficients is less accurate

in terms of capturing workload dynamic behavior. As can be seen, the increase of MSE is not

significant. This suggests that the proposed schemes can capture workload dynamic behavior

with increasing complexity.

Our RBF neural networks were built using a regression tree based method. In the

regression tree algorithm, all input microarchitecture parameters were ranked based on either

split order or split frequency. The microarchitecture parameters which cause the most output

variation tend to be split earliest and most often in the constructed regression tree. Therefore,

microarchitecture parameters largely determine the values of a wavelet coefficient are located on

higher place than others in regression tree and they have larger number of splits than others.

CPI Power AVF


0


bZlp gcc mcf g cc g
a r r b swim ol vo vpr p-arser p b swm olf o vpr






L2ue 5-8 Role Lo L2 deig
2 I l lt2 dl latt dll lat
L i I/I L, la2 ii dll




S l parser perlbmk swim olf votex vpr I



Figure 5-8 Roles of microarchitecture design parameters









We present in Figure 5-8 (shown as star plot) the initial and most frequent splits within the

regression trees that model the most significant wavelet coefficients. A star plot [41] is a

graphical data analysis method for representing the relative behavior of all variables in a

multivariate data set. The star plot consists of a sequence of equi-angular spokes, called radii,

with each spoke representing one of the variables. The data length of a spoke is proportional to

the magnitude of the variable for the data point relative to the maximum magnitude of the

variable across all data points. From the star plot, we can obtain information such as: What

variables are dominant for a given datasets? Which observations show similar behavior? For

example, on benchmark gcc, Fetch, dll and LSQ have significant roles in predicting dynamic

behavior in performance domain while ROB, Fetch and dll_lat largely affect reliability domain

workload dynamic behavior. For the benchmark gcc, the most frequently involved

microarchitecture parameters in regression tree constructions are ROB, LSQ, L2 and L2_lat in

performance domain and LSQ and Fetch in reliability domain.

Compared with models that only predict workload aggregated behavior, our proposed

methods can forecast workload runtime execution scenarios. The feature is essential if the

predictive models are employed to trigger runtime dynamic management mechanisms for power

and reliability optimizations. Inadequate workload worst-case scenario predictions could make

microprocessors fail to meet the desired power and reliability targets. On the contrary, false

alarms caused by over-prediction of the worst-case scenarios can trigger responses too frequently,

resulting in significant overhead. In this section, we study the suitability of using the proposed

schemes for workload execution scenario based classification. Specifically, for a given workload

characteristics threshold, we calculate how many sampling points in a trace that represents

workload dynamics are above or below the threshold. We then apply the same calculation to the










predicted workload dynamics trace. We use the directional symmetry (DS) metric, i.e., the

percentage of correctly predicted directions with respect to the target variable, defined as


DS = N (x(k) 2(k)) (5-4)
Nkl

where (o(*) = if x and c are both above or below the threshold and (o(.) = 0 otherwise.

Thus, the DS provides a measure of the number of times the sign of the target is correctly

forecasted. In other words, DS=50% implies that the predicted direction was correct for half of

the predictions. In this work, we set three threshold levels (named as Q1, Q2 and Q3 in Figure 5-

9) between max and min values in each trace as follows, where 1Q is the lowest threshold level

and 3Q is the highest threshold level.

1Q = MIN + (MAX-MIN)*(1/4)
2Q = MIN + (MAX-MIN)*(2/4)
MAX 3Q = MIN + (MAX-MIN)*(3/4)
3Q ------------------

2Q ---- --- -

1Q -- --- -- -- -- -- ---
M IN


Figure 5-9 Threshold-based workload execution scenarios

Figure 5-10 shows the results of threshold-based workload dynamic behavior classification.

The results are presented as directional asymmetry, which can be expressed as 1 -DS. As can be

seen, not only our wavelet-based RBF neural networks can effectively capture workload

dynamics, but also they can accurately classify workload execution into different scenarios. This

suggests that proactive dynamic power and reliability management schemes can be built using

the proposed models. For instance, given a power/reliability threshold, our wavelet RBF neural

networks can be used to forecast workload execution scenario. If the predicted workload











characteristics exceed the threshold level, processors can start to response before


power/reliability reaches or surpass the threshold level.


6
*CPI1 1Q
8 CPI_2Q


CPI 3Q
4
2 L Il
O"


10
8 M Power 2Q
SPower 3Q





149 / /e e / ^ / / / ^


10
I 'AVF_1Q
mAvF 30
S ---------------------------------- A V F 2 Q --





1f/ f / e <14// /// /


Figure 5-10 Threshold-based workload execution


Figure 5-11 further illustrates detailed workload execution scenario predictions on


benchmark bzip2. Both simulation and prediction results are shown. The predicted results closely


track the varied program dynamic behavior in different domains.


8


--0.4


-Simulation -Simulato -Simulation
-Prediction 125 dPrediction
120
115 Threshold 0.3

, JJ f i^ l l 1 '1" I 0,l0.'

2 5 a 115


S 20 40 60 80 100 120 140 8 20 40 60 80 100 120 140 20 40 60 80 100 120 14
Samples Samples Samples
(a) performance (b) power (c) reliability

Figure 5-11 Threshold-based workload scenario prediction


Workload Dynamics Driven Architecture Design Space Exploration

In this section, we present a case study to demonstrate the benefit of applying workload


dynamics prediction in early architecture design space exploration. Specifically, we show that


workload dynamics prediction models can effectively forecast the worst-case operation


j


0


--~-


E










conditions to soft error vulnerability and accurately estimate the efficiency of soft error

vulnerability management schemes.

Because of technology scaling, radiation-induced soft errors contribute more and more to

the failure rate of CMOS devices. Therefore, soft error rate is an important reliability issue in

deep-submicron microprocessor design. Processor microarchitecture soft error vulnerability

exhibits significant runtime variation and it is not economical and practical to design fault

tolerant schemes that target on the worst-case operation condition. Dynamic Vulnerability

Management (DVM) refers to a set of strategies to control hardware runtime soft-error

susceptibility under a tolerable threshold. DVM allows designers to achieve higher dependability

on hardware designed for a lower reliability setting. If a particular execution period exceeds the

pre-defined vulnerability threshold, a DVM response (Figure 5-12) will work to reduce hardware

vulnerability.

Designed-for Reliability
Capacity w/out DVM
DVM Performance
Designed-for Reliability 0""- Overhead
S Capacity w/ DVM

S DVM Trigger Level DMV Engaged


DVM Disengaged


Time

Figure 5-12 Dynamic Vulnerability Management

A primary goal of DVM is to maintain vulnerability to within a pre-defined reliability

target during the entire program execution. The DVM will be triggered once the hardware soft

error vulnerability exceeds the predefined threshold. Once the trigger goes on, a DVM response

begins. Depending on the type of response chosen, there may be some performance degradation.

A DVM response can be turned off as soon as the vulnerability drops below the threshold. To











successfully achieve the desired reliability target and effectively mitigate the overhead of DVM,

architects need techniques to quickly infer application worst-case operation conditions across

design alternatives and accurately estimate the efficiency of DVM schemes at early design stage.

We developed a DVM scheme to manage runtime instruction queue (IQ) vulnerability to

soft error.


DVM IQ
{
ACE bits counter updating();
if current context has L2 cache misses
then stall dispatching instructions for current context;

every (sampleinterval/5) cycles
{
if online IQ AVF > trigger threshold
then wqratio = wqratio/2;
else wqratio = wqratio+l;
}
if (ratio of waiting instruction # to ready instruction # > wqratio)
then stall dispatching instructions;




Figure 5-13 IQ DVM Pseudo Code

Figure 5-13 shows the pseudo code of our DVM policy. The DVM scheme computes

online IQ AVF to estimate runtime microarchitecture vulnerability. The estimated AVF is

compared against a trigger threshold to determine whether it is necessary to enable a response

mechanism. To reduce IQ soft error vulnerability, we throttle the instruction dispatching from

the ROB to the IQ upon a L2 cache miss. Additionally, we sample the IQ AVF at a finer

granularity and compare the sampled AVF with the trigger threshold. If the IQ AVF exceeds the

trigger threshold, a parameter wqratio, which specifies the ratio of number of waiting

instructions to that of ready instructions in the IQ, is updated. The purpose of setting this

parameter is to maintain the performance by allowing an appropriate fraction of waiting

instructions in the IQ to exploit ILP. By maintaining a desired ratio between the waiting










instructions and the ready instructions, vulnerability can be reduced at negligible performance

cost. The wqratio update is triggered by the estimated IQ AVF. In our DVM design, wqratio is

adapted through slow increases and rapid decreases in order to ensure a quick response to a

vulnerability emergency.

We built workload dynamics predictive models which incorporate DVM as a new design

parameter. Therefore, our models can predict workload execution scenarios with/without DVM

feature across different microarchitecture configurations. Figure 5-14 shows the results of using

the predictive models to forecast IQ AVF on benchmark gcc across two microarchitecture

configurations.

08 08 08------ 08



<10 40 4 DVM Target
00 O (Enable) -03
DVM Target | AIDVML I' l Target DVM Target
(D iabe) 02 (Disable) 02 (Enable)

20 40 60 8o 100 120 140 0 0 20 40 60 8o 100 120 140 2O 4O 6O 8O 100 120 140
Sample 20 40 6 80 100 120 140 Samples Samples
Samples
DVM disabled DVM enabled DVM disabled DVM enabled
(a) Scenario 1 (b) Scenario 2

Figure 5-14 Workload dynamic prediction with scenario-based architecture optimization


We set the DVM target as 0.3 which means the DVM policy, when enabled, should

maintains the IQ AVF below 0.3 during workload execution. In both cases, the IQ AVF

dynamics were predicted when DVM is disabled and enabled. As can be seen, in scenario 1, the

DVM successfully achieves its goal. In scenario 2, despite of the enabled DVM feature, the IQ

AVF of certain execution period is still above the threshold. This implies that the developed

DVM mechanism is suitable for the microarchitecture configuration used in scenario 1. On the

other hand, architects have to choose another DVM policy if the microarchitecture configuration










shown in scenario 2 is chosen in their design. Figure 5-14 shows that in all cases, the predictive

models can accurately forecast the trends in IQ AVF dynamics due to architecture optimizations.

Figure 5-15 (a) shows prediction accuracy of IQ AVF dynamics when the DVM policy is

enabled. The results are shown for all 50 microarchitecture configurations in our test dataset.
Color Key





MSE(%)
MSE(%)












E E

(a) IQ AVF (b) Power
Figure 5-15 Heat plot that shows the MSE of IQ AVF and processor power

Since deploying the DVM policy will also affect runtime processor power behavior, we

further build models to forecast processor power dynamic behavior due to the DVM. The results

are shown in Figure 5-15 (b). The data is presented as heat plot, which maps the actual data

values into color scale with a dendrogram added to the top. A dendrogram consists of many U-

shaped lines connecting objects in a hierarchical tree. The height of each U represents the

distance between the two objects being connected. For a given benchmark, a vertical trace line

shows the scaled MSE values across all test cases.

Figure 5-15 (a) shows the predictive models yield high prediction accuracy across all test

cases on benchmarks swim, eon and vpr. The models yield prediction variation on benchmarks












gcc, crafty and vortex. In power domain, prediction accuracy is more uniform across benchmarks


and microarchitecture configurations. In Figure 5-16, we show the IQ AVF MSE when different


DVM thresholds are set. The results suggest that our predictive models work well when different


DVM targets are considered.


0.5
DVM Threshold =0.2
0.4- DVM Threshold =0.3
SU DVM Threshold =0.5
S0.3

S0.2

0.1





Figure 5-16 IQ AVF dynamics prediction accuracy across different DVM thresholds










CHAPTER 6
ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN
MULTI-CORE ARCHITECTURES

Early design space exploration is an essential ingredient in modern processor development.

It significantly reduces the time to market and post-silicon surprises. The trend toward multi-

/many-core processors will result in sophisticated large-scale architecture substrates (e.g. non-

uniformly accessed cache [43] interconnected by network-on-chip [44]) with self-contained

hardware components (e.g. cache banks, routers and interconnect links) proximate to the

individual cores but globally distributed across all cores. As the number of cores on a processor

increases, these large and sophisticated multi-core-oriented architectures exhibit increasingly

complex and heterogeneous characteristics. As an example, to alleviate the deleterious impact of

wire delays, architects have proposed splitting up large L2/L3 caches into multiple banks, with

each bank having different access latency depending on its physical proximity to the cores.

Figure 6-1 illustrates normalized cache hits (results are plotted as color maps) across the 256

cache banks of a non-uniform cache architecture (NUCA) [43] design on an 8-core chip

multiprocessor(CMP) running the SPLASH-2 Ocean-c workload. The 2D architecture spatial

patterns yielded on NUCA with different architecture design parameters are shown.


09 9 9
08 8 8
07 7 7
06 6 6
05 5 5





Figure 6-1 Variation of cache hits across a 256-bank non-uniform access cache on 8-core

As can be seen, there is a significant variation in cache access frequency across individual

cache banks. At larger scales, the manifested 2-dimensional spatial characteristics across the









entire NUCA substrate vary widely with different design choices while executing the same code

base. In this example, various NUCA cache configurations such as network topologies (e.g.

hierarchical, point-to-point and crossbar) and data management schemes (e.g. static (SNUCA)

[43], dynamic (DNUCA) [45, 46] and dynamic with replication (RNUCA) [47-49]) are used. As

the number of parameters in the design space increases, such variation and characteristics at

large scales cannot be captured without using slow and detailed simulations. However, using

simulation-based methods for architecture design space exploration where numerous design

parameters have to be considered is prohibitively expensive.

Recently, various predictive models [20-25, 50] have been proposed to cost-effectively

reason processor performance and power characteristics at the design exploration stage. A

common weakness of existing analytical models is that they assume centralized and monolithic

hardware structures and therefore lack the ability to forecast the complex and heterogeneous

behavior of large and distributed architecture substrates across the design space. This limitation

will only be exacerbated with the rapidly increasing integration scale (e.g. number of cores per

chip). Therefore, there is a pressing need for novel and cost-effective approaches to achieve

accurate and informative design trade-off analysis for large and sophisticated architectures in the

upcoming multi-/many core eras.

Thus, in this chapter, instead of quantifying these large and sophisticated architectures by a

single number or a simple statistics distribution, we proposed techniques employ 2D wavelet

multiresolution analysis and neural network non-linear regression modeling. With our schemes,

the complex spatial characteristics that workloads exhibit across large architecture substrates are

decomposed into a series of wavelet coefficients. In the transform domain, each individual

wavelet coefficient is modeled by a separate neural network. By predicting only a small set of









wavelet coefficients, our models can accurately reconstruct architecture 2D spatial behavior

across the design space. Using both multi-programmed and multi-threaded workloads, we

extensively evaluate the efficiency of using 2D wavelet neural networks for predicting the

complex behavior of non-uniformly accessed cache designs with widely varied configurations.

Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics
Prediction

We view the 2D spatial characteristics yielded on large and distributed architecture

substrates as a nonlinear function of architecture design parameters. Instead of inferring the

spatial behavior via exhaustively obtaining architecture characteristics on each individual

node/component, we employ wavelet analysis to approximate it and then use a neural network to

forecast the approximated behavior across a large architecture design space. Previous work [21,

23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior

across varied architecture configurations. Nevertheless, monolithic global neural network models

lack the ability to informatively reveal complex workload/architecture interactions at a large

scale. To overcome this disadvantage, we propose combining 2D wavelet transforms and neural

networks that incorporate multiresolution analysis into a set of neural networks for spatial

characteristics prediction of multi-core oriented architecture substrates. The 2D wavelet

transform is a very powerful tool for characterizing spatial behavior since it captures both global

trend and local variation of large data sets using a small set of wavelet coefficients. The local

characteristics are decomposed into lower scales of wavelet coefficients (high frequencies)

which are utilized for detailed analysis and prediction of individual or subsets of

cores/components, while the global trend is decomposed into higher scales of wavelet

coefficients (low frequencies) that are used for the analysis and prediction of slow trends across

many cores or distributed hardware components. Collectively, these wavelet coefficients provide










an accurate interpretation of the spatial trend and details of complex workload behavior at a large

scale. Our wavelet neural networks use a separate RBF neural network to predict individual

wavelet coefficients. The separate predictions of wavelet coefficients proceed independently.

Predicting each wavelet coefficient by a separate neural network simplifies the training task

(which can be performed concurrently) of each sub-network. The prediction results for the

wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize

the spatial patterns on large scale architecture substrates.

Figure 6-2 shows our hybrid neuro-wavelet scheme for architecture 2D spatial

characteristics prediction. Given the observed spatial behavior on training data, our aim is to

predict the 2D behavior of large-scale architecture under different design configurations.

Architecture 2D Characteristics Synthesized Architecture 2D
Characteristics

RBF Neural Netvorks

"HTj --e-j"^^ ^S^^^
Ho Architectue Design Predcted aelet


W H 0
E G oXe Parameters Coefficient e F

Go H Achitectre Design0 ( Pr te
8 ... CoWficientn-



Figure 6-2 Using wavelet neural networks for forecasting architecture 2D characteristics

The hybrid scheme basically involves three stages. In the first stage, the observed spatial

behavior is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet

coefficient is predicted by a separate ANN. In the third stage, the approximated 2D

characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network

receives the entire architecture design space vector and predicts a wavelet coefficient. The









training of an RBF network involves determining the center point and a radius for each RBF, and

the weights of each RBF which determine the wavelet coefficients.

Experimental Methodology

We evaluate the efficiency of 2D wavelet neural networks for forecasting spatial

characteristics of large-scale multi-core NUCA design using the GEMS 1.2 [51] toolset

interfaced with the Simics [52] full-system functional simulator. We simulate a SPARC V9 8-

core CMP running Solaris 9. We model in-order issue cores for this study to keep the simulation

time tractable. The processors have private L1 caches and the shared L2 is a 256-bank 16MB

NUCA. The private L1 caches of different processors are maintained coherent using a distributed

directory based protocol. To model the L2 cache, we use the Ruby NUCA cache simulator

developed in [47] which includes an on-chip network model. The network models all messages

communicated in the system including all requests, responses, replacements, and

acknowledgements. Table 6-1 summarizes the baseline machine configurations of our simulator.

Table 6-1 Simulated machine configuration (baseline)
Parameter Configuration
Number of 8
Issue Width 1
L1 (split I/D) 64KB, 64B line, write-allocation
L2 (NUCA) 16 MB (256 x 64KB ), 64B line
Memory Sequential
Memory 4 GB of DRAM, 250 cycle latency, 4KB

Our baseline processor/L2 NUCA organization is similar to that of Beckmann and Wood

[47] and is illustrated in Figure 6-3.

Each processor core (including L1 data and instruction caches) is placed on the chip

boundary and eight such cores surround a shared L2 cache. The L2 is partitioned into 256 banks

(grouped as 16 blank clusters) and connected with an interconnection network. Each core has a

cache controller that routes the core's requests to the appropriate cache bank. The NUCA design









space is very large. In this chapter, we consider a design space that consists of 9 parameters (see

Tables 6-2) of CMP NUCA architecture.


CPU 0 CPU 1





h-u
L1 $ L1 D$ L1 $ L1 D$ -1











CPU 5 CPU 4

Figure 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache

Table 6-2 Considered architecture design parameters and their ranges
Parameters Description
NUCA Management Policy (NUCA) SNUCA, DNUCA, RNUCA
Network Topology (net) Hierarchical, PTtoPT, Crossbar
Network Link Latency (net lat) 20, 30, 40, 50
Lllatency (L1 lat) 1, 3, 5
L2_latency (L2 lat) 6, 8, 10, 12.
Ll_associativity (LI aso) 1, 2, 4, 8
L2_associativity (L2 aso) 2, 4, 8, 16
Directory Latency (d lat) 30, 60, 80, 100
Processor Buffer Size (p buJ) 5, 10, 20


These design parameters cover NUCA data management policy (NUCA), interconnection

topology and latency (net and netlat), the configurations of the L1 and L2 caches (Lllat,

L2_lat, Ll_aso and L2_aso), cache coherency directory latency (dlat) and the number of cache

accesses that a processor core can issue to the L1 (pbuf). The ranges for these parameters were

set to include both typical and feasible design points within the explored design space.









We studied the CMP NUCA designs using various multi-programmed and multi-threaded

workloads (listed in Table 6-3).

Table 6-3 Multi-programmed workloads
Multi-programmed Workloads Description
Group 1 gcc (8 copies)
Homogeneous
Group2 mcf(8 copies)
Group 1 (CPU) gap, bzip2, equake, gcc, mesa, perlbmk, parser, ammp
Heterogeneous Group2 (MIX) perlbmk, mcf bzip2, vpr, mesa, art, gcc, equake
Group3 (MEM) mcf twolf art, ammp, equake, mcf art, mesa
Multithreaded Workloads Data Set
barnes 16k particles
finm input. 16348
ocean-co 514x514 ocean body
ocean-nc 258x258 ocean body
Splash2
water-ns 512 molecules
cholesky tkl5.0

fft 65,536 complex data points
radix 256k keys, 1024 radix



Our heterogeneous multi-programmed workloads consist of a mix of programs from the

SPEC 2000 benchmarks with full reference input sets. The homogeneous multi-programmed

workloads consist of multiple copies of an identical SPEC 2000 program. For multi-programmed

workload simulations, we perform fast-forwards until all benchmarks pass initialization phases.

For multithreaded workloads, we used 8 benchmarks from the SPLASH-2 suite [53] and mark an

initialization phase in the software code and skip it in our simulations. In all simulations, we first

warm up the cache model. After that, each simulation runs 500 million instructions or to

benchmark completion, whichever is less. Using detailed simulation, we obtain the 2D

architecture characteristics of large scale NUCA at all design points within both training and









testing data sets. We build a separate model for each workload and use the model to predict

architecture 2D spatial behavior at unexplored points in the design space. The training data set is

used to build the 2D wavelet neural network models. An estimate of the model's accuracy is

obtained by using the design points in the testing data set.

To build a representative design space, one needs to ensure that the sample data sets

disperse points throughout the design space but keep the space small enough to keep the cost of

building the model low. To achieve this goal, we use a variant of Latin Hypercube Sampling

(LHS) as our sampling strategy since it provides better coverage compared to a naive random

sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2-

star discrepancy. The L2-star discrepancy is applied to each LHS matrix to find the

representative design space that has the lowest value of L2-star discrepancy. We use a randomly

and independently generated set of test data points to empirically estimate the predictive

accuracy of the resulting models. In this chapter, we used 200 train data and 50 test data for

workload dynamic prediction since our study shows that it offers a good tradeoff between

simulation time and prediction accuracy for the design space we considered. And the 2D NUCA

architecture characteristics (normalized cache hit numbers) across 256 banks (with the geometry

layout, Figure 6-3) are represented by a matrix.

Predicting each wavelet coefficient by a separate neural network simplifies the learning

task. Since complex spatial patterns on large scale multi-core architecture substrates can be

captured using a limited number of wavelet coefficients, the total size of wavelet neural networks

is small and the computation overhead is low. Due to the fact that small magnitude wavelet

coefficients have less contribution to the reconstructed data, we opt to only predict a small set of

important wavelet coefficients. Specifically, we consider the following two schemes for selecting









important wavelet coefficients for prediction: (1) magnitude-based: select the largest k

coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients

and approximate the rest with 0. In this study, we choose to use the magnitude-based scheme

since it always outperforms the order-based scheme. To apply the magnitude-based wavelet

coefficient selection scheme, it is essential that the significance of the selected wavelet

coefficients do not change drastically across the design space. Our experimental results show that

the top ranked wavelet coefficients largely remain consistent across different architecture

configurations.

Evaluation and Results

In this section, we present detailed experimental results using 2D wavelet neural networks

to forecast complex, heterogeneous patterns of large scale multi-core substrates running various

workloads without using detailed simulation. The prediction accuracy measure is the mean error

defined as follows:

1 N
ME = ((Z(k) x(k)) / x(k)) (6-1)
N k=l

where: x is the actual value, ^ is the predicted value and N is the total number of samples

(e.g. 256 NUCA banks). As prediction accuracy increases, the ME becomes smaller.

The prediction accuracies are plotted as boxplots(Figure 6-4). Boxplots are graphical

displays that measure location (median) and dispersion (interquartile range), identify possible

outliers, and indicate the symmetry or skewness of the distribution. The central box shows the

data between "hinges" which are approximately the first and third quartiles of the ME values.












-0 0
0 0 0

S o
o T o T T T o











o I o -
(a) 16 Wavelet Coefficients (b) 32 Wavelet Coefficients














extending from the top and bottom of the box) extend to the extreme values of the data or a
0 o
a i0





or of 6 n aos a n woroa A n n, te m m e


(c) 64 Wavelet Coefficients (d) 128 Wavelet Coefficients
Figure 6-4 ME boxplots of prediction accuracies with different number of wavelet coefficients

Thus, about 50% of the data are located within the box and its height is equal to the

interquartile range. The horizontal line in the interior of the box is located at the median of the

data, and it shows the center of the distribution for the ME values. The whiskers (the dotted lines

extending from the top and bottom of the box) extend to the extreme values of the data or a

distance 1.5 times the interquartile range from the median, whichever is less. The outliers are

marked as circles.

Figure 6-4 (a) shows that using 16 wavelet coefficients, the predictive models achieve

median errors ranging from 5.2 percent (ift) to 9.3 percent (ocean.co) with an overall median

error of 6.6 percent across all experimented workloads. As can be seen, the maximum error at

any design point for any benchmark is 13%, and most benchmarks show an error less than 10%.

This indicates that our proposed neuro-wavelet scheme can forecast the 2D spatial workload




83










behavior across large and sophisticated architecture with high accuracy. Figure 6-4 (b-d) shows

that in general, the geospatial characteristics prediction accuracy is increased when more wavelet

coefficients are involved. Note that the complexity of the predictive models is proportional to the

number of wavelet coefficients. The cost-effective models should provide high prediction

accuracy while maintaining low complexity and computation overhead. The trend of prediction

accuracy(Figure 6-4) indicates that for the programs we studied, a set of wavelet coefficients

with a size of 16 combines good accuracy with low model complexity; increasing the number of

wavelet coefficients beyond this point improves error at a reduced rate. This is because wavelets

provide a good time and locality characterization capability and most of the energy is captured

by a limited set of important wavelet coefficients. Using fewer parameters than other methods,

the coordinated wavelet coefficients provide interpretation of the spatial patterns among a large

number of NUCA banks on a two-dimensional plane. Figure 6-5 illustrates the predicted 2D

NUCA behavior across four different configurations (e.g. A-D) on the heterogeneous multi-

programmed workload MIX (see Table 3) when different number of wavelet coefficients (e.g. 16

- 256) are used.

Simulation Prediction
org 16wc 32wc 64wc 96wc 128wc 256wc

A 0. 1 K I

org 16w 32w 64w 9wc 12w 256w

B
org 1c 32wc 64wc 9Gwc 1 c 25Gwc



u- org 16wc 32we 64we 9Gwe 128we 256we


Figure 6-5 Predicted 2D NUCA behavior using different number of wavelet coefficients
Figure 6-5 Predicted 2D NUCA behavior using different number of wavelet coefficients









The simulation results (org) are also shown for comparison purposes. Since we can

accurately forecast the behavior of large scale NUCA by only predicting a small set of wavelet

coefficients, we expect our methods are scalable to even larger architecture design.

We further compare the accuracy of our proposed scheme with that of approximating

NUCA spatial patterns via predicting the hit rates of 16 evenly distributed cache banks across a

2D plane. The results shown in Table 6-4 indicate that using the same number of neural networks,

our scheme yields a significantly higher accuracy than conventional predictive models. If current

neural network models were built at fine-grain scales (e.g. construct a model for each NUCA

bank), the model building/training overhead would be non-trivial. Since we can accurately

forecast the behavior of large scale NUCA structures by only predicting a small set of wavelet

coefficients, we expect our methods are scalable to even larger architecture substrates.

Table 6-4 Error comparison of predicting raw vs. 2D DWT cache banks
Benchmarks Error (Raw), % Error(2D DWT), %
gcc(x8) 126 8
mcf(x8) 71 7
CPU 102 9
MIX 86 8
MEM 122 8
barnes 136 6
fmm 363 6
ocean-co 99 9
ocean-nc 136 6
water-sp 97 7
cholesky 71 7
fft 64 7
radix 92 7

Table 6-5 shows that exploring multi-core NUCA design space using the proposed

predictive models can lead to several orders of magnitude speedup, compared with using detailed

simulations. The speedup is calculated using the total simulation time across all 50 test cases

divided by the time spent on model training and predicting 50 test cases. The model construction









is a one-time overhead and can be amortized in the design space exploration stage where a large

number of cases need to be examined.

Table 6-5 Design space evaluation speedup (simulation vs. prediction)
Benchmarks Simulation vs. Prediction
gcc(x8) 2,181x
mcf(x8) 3.482x
CPU 3,691x
MIX 472x
MEM 435x
barnes 659x
fmm 1,824x
ocean-co 1,077x
ocean-nc 1,169x
water-sp 738x
cholesky 696x
fft 670x
radix 1,010x

Our RBF neural networks were built using a regression tree based method. In the

regression tree algorithm, all input architecture design parameters were ranked based on either

split order or split frequency. The design parameters which cause the most output variation tend

to be split earliest and most often in the constructed regression tree. Therefore, architecture

parameters that largely determine the values of a wavelet coefficient are located higher than

others in the regression tree and they have a larger number of splits than others. We present in

Figure 6-6 (shown as star plot) the initial and most frequent splits within the regression trees that

model the most significant wavelet coefficients.

A star plot is a graphical data analysis method for representing the relative behavior of all

variables in a multivariate data set. The star plot consists of a sequence of equi-angular spokes,

called radii, with each spoke representing one of the variables. The data length of a spoke is

proportional to the magnitude of the variable for the data point relative to the maximum

magnitude of the variable across all data points. From the star plot, we can obtain information










such as: Which variables are dominant for a given dataset? Which observations show similar

behavior?


p ,buf net atp bu netat
-, 'net net
L1 lat L1 lat
NUCA -t NUCA
L2 lat L2 lat
d lat -d lat
L aso L2 aso L1 aso L2 aso



gcc mcf cpu mix mem
gcc mci cpu mix mem


barnes oceanco ocean nc water_sp cholesky barnes ocean_co ocean_nc water_sp cholesky



fit radix imm Ift radix fmm

Order Frequency
Figure 6-6 Roles of design parameters in predicting 2D NUCA

For example, on the Splash-2 benchmark fmm, network latency (net lat), processor buffer

size (p buf), L2 latency (L2 lat) and L1 associativity (LI aso) have significant roles in

predicting the 2D NUCA spatial behavior while the NUCA data management policy (NUCA)

and network topology (net) largely affect the 2D spatial pattern when running the homogeneous

multi-programmed workload gccx8. For the benchmark cholesky, the most frequently involved

architecture parameters in regression tree construction are NUCA, net lat, p buf L2 lat and

L1 aso.

Differing from models that predict aggregated workload characteristics on monolithic

architecture design, our proposed methods can accurately and informatively reveal the complex

patterns that workloads exhibit on large-scale architectures. This feature is essential if the

predictive models are employed to examine the efficiency of design tradeoffs or explore novel







optimizations that consider multi-/many- cores. In this work, we study the suitability of using the
proposed models for novel multi-core oriented NUCA optimizations.
Leveraging 2D Geometric Characteristics to Explore Cooperative Multi-core Oriented
Architecture Design and Optimization
In this section, we present case studies to demonstrate the benefit of incorporating 2D
workload/architecture behavior prediction into the early stages of microarchitecture design. In
the first case study, we show that our geospatial-aware predictive models can effectively estimate
workloads' 2D working sets and that such information can be beneficial in searching cache
friendly workload/core mapping in multi-core environments. In the second case study, we
explore using 2D thermal profile predictive models to accurately and informatively forecast the
area and location of thermal hotspots across large NUCA substrates.
Case Study 1: Geospatial-aware Application/Core Mapping
Our 2D geometry-aware architecture predictive models can be used to explore global,
cooperative, resource management and optimization in multi-core environments.
Core 0 Core 1 Core 2 Core 3
11 11 11 1


I


I :0.
Core 4

0.5
I:0


0.5 0.50.
//: mo 1;

Core 5 Core 6 Core 7
1 1 1


l : 0 IN0 0:


Figure 6-7 2D NUCA footprint (geometric shape) of mesa
For example, as shown in Figure 6-7, a workload will exhibit a 2D working set with
different geometric shapes when running on different cores. The exact shape of the access










distribution depends on several factors such as the application and the data mapping/migration

policy. As shown in previous section, our predictive models can forecast workload 2D spatial

patterns across the architecture design space. To predict workload 2D geometric footprints when

running on different cores, we incorporate the core location as a new design parameter and build

the location-aware 2D predictive models. As a result, the new model can forecast workloads' 2D

NUCA footprint (represented as a cache access distribution) when it is assigned to a specific core

location. We assign 8 SPEC CPU 2000 workloads to the 8-core CMP system and then predict

each workload's 2D NUCA footprint when running on the assigned core and use the predicted

2D geometric working set for each workload to estimate the cache interference among the cores.

Program B 2D NUCA
Program A 2D NUCA footprint @ Core 1 Interferenced
footprint @ Core 0 Area
Core 0 Core 1 Area

S_ > -_-

o -- 1- )

SProgram C 2D NUCA
footprint @ Core 2
(0^ 0
o 0
0 o

Core 5 Core 4

Figure 6-8. 2D cache interference in NUCA

As shown in Figure 6-8, to estimate the interference for a given core/workload mapping,

we estimate both the area and the degree of overlap among a workload's 2D NUCA footprint.

We only consider the interference of a core and its two neighbors. As a result, for a given

core/workload layout, we can quickly estimate the overall interference. For each NUCA

configuration, we estimate the interference when workloads are randomly assigned to different

cores. We use simulation to count the actual cache interference among workloads. For each test










case (e.g., a specific NUCA configuration), we generate two series of cache interference

statistics (e.g., one from simulation and one from the predictive model) which correspond to the

scenarios when workloads are mapped to the different cores. We compute the Pearson

correlation coefficient of the two data series. The Pearson correlation coefficient of two data

series X and Y is defined as


nrZx -=> 2>1


(6-2)


S Lnii) n 2_ x2 n( i2) n l2]
S =1 ) i-l / V =1- / Y i-- )=1


If two data series, X and Y, show highly positive correlation, their Pearson correlation

coefficient will be close to 1. Consequently, if the cache interference can be accurately estimated

using the overlap between the predicted 2D NUCA footprints, we should observe nearly perfect

correlation between the two metrics.

Group 1 (CPU) Group 2 (MIX) Group 3 (MEM)


-~A/JwzvA~rxv


0 8
s
o
S07
0o6
B0


08
07
o06


S 10 20 30 40 50 0 10 20 30 40 50 0 0 10 20 30 40
Test Cases Test Cases Test Cases
Group 1 (CPU) Group 2 (MIX) Group 3 (MEM)
Figure 6-9 Pearson correlation coefficient (all 50 test cases are shown)

Figure 6-9 shows that there is a strong correlation between the interference estimated using

the predicted 2D NUCA footprint and the interference statistics obtained using simulation. The

highly positive Pearson correlation coefficient values show that by using the predictive model,

designers can quickly devise the optimal core allocation for a given set of workloads.

Alternatively, the information can be used by the OS to guide cache friendly thread scheduling in

multi-core environments.


E










Case Study 2: 2D Thermal Hot-Spot Prediction

Thermal issues are becoming a first order design parameter for large-scale CMP

architectures. High operational temperatures and hotspots can limit performance and

manufacturability. We use the HotSpot [54] thermal model to obtain the temperature variation

across 256 NUCA banks. We then build analytical models using the proposed methods to

forecast 2D thermal behavior of large NUCA cache with different configurations. Our predictive

model can help designers insightfully predict the potential thermal hotspots and assess the

severity of thermal emergencies. Figure 6-10 shows the simulated thermal profile and predicted

thermal behavior on different workloads. The temperatures are normalized to a value between

the maximal and minimal value across the NUCA chip. As can be seen, the 2D thermal

predictive models can accurately and informatively forecast the size and the location of thermal

hotspots.

Thermal Hotspots
mulaondiction Simulation Prediction












Simulation Prediction Simulation Prediction
1.80.8
06
0.6O 0.6
_00.2

The 2D predictive model can informatively and accurately 0
forecast both the location and the size of thermal
hotspots in large scale architecture
(a) Ocean-NC (b) gccx8

Simulation Prediction Simulation Prediction





i :: WU 043 v




(c) MEM (d) Radix
Figure 6-10 2D NUCA thermal profile (simulation vs. prediction)























Figure 6-11 NUCA 2D thermal prediction error

The thermal prediction accuracy (average statistics) across three workload categories is

shown in Figure 6-11. The accuracy of using different number of wavelet coefficients in

prediction is also shown in that Figure. The results show that our predictive model can be used to

cost-effectively analyze the thermal behavior of large architecture substrates. In addition, our

proposed technique can be use to evaluate the efficiency of thermal management policies at a

large scale. For example, thermal hotspots can be mitigated by throttling the number of accesses

to a cache bank for a certain period when its temperature reaches a threshold. We build analytical

models which incorporate a thermal-aware cache access throttling as a design parameter. As a

result, our predictive model can forecast thermal hot spot distribution in the 2D NUCA cache

banks when the dynamic thermal management (DTM) policy is enabled or disabled. Figure 6-12

shows the thermal profiles before and after thermal management policies are applied (both

prediction and simulation results) for benchmark Ocean-NC. As can be seen, they track each

other very well. In terms of time taken for design space exploration, our proposed models have

orders of magnitude less overhead. The time required to predict the thermal behavior is much

less than that of full-system multi-core simulation. For example, thermal hotspot estimation is

over 2 x 105 times faster than thermal simulation, justifying our decision to use the predictive










models. Similarly, searching a cache friendly workload/core mapping is 3 x 104 times faster than

using the simulation-based method.


Simulation


DTM
0.6
-(^o


Prediction


DTM


Prediction


' o


Figure 6-12 Temperature profile before and after a DTM policy


Simulation


Prediction








CHAPTER 7
THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STACKED MULTI-CORE
PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS

To achieve thermal efficient 3D multi-core processor design, architects and chip designers

need models with low computation overhead, which allow them to quickly explore the design

space and compare different design options. One challenge in modeling the thermal behavior of

3D die stacked multi-core architecture is that the manifested thermal patterns show significant

variation within each die and across different dies (as shown in Figure 7-1).

Diel Die2 Die3 Die4
P tP uru: F *'I- F. I- P.
Sh: a a mr-i:. a *} n::
im-!-' 2W7m


I. I BI- W-N-


Figure 7-1 2D within-die and cross-dies thermal variation in 3D die stacked multi-core
processors

The results were obtained by simulating a 3D die stacked quad-core processors running

multi-programmed CPU (bzip2, eon, gcc, perlbmk), MEM (mcf, equake, vpr, swim) and MIX

(gcc, mcf, vpr, perlbmk) workloads. Each program within a multi-programmed workload was

assigned to a die that contains a processor core and caches.

Figure 7-2 shows the 2D thermal variation on die 4 under different microarchitecture and

floor-plan configurations. On the given die, the 2-dimensional thermal spatial characteristics

vary widely with different design choices. As the number of architectural parameters in the

design space increases, the complex thermal variation and characteristics cannot be captured

without using slow and detailed simulations. As shown in Figure 7-1 and 7-2, to explore the

thermal-aware design space accurately and informatively, we need computationally effective









methods that not only predict aggregate thermal behavior but also identify both size and

geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate

predictive models to achieve this goal.

Config. A Config. B Config. C Config. D


r r





Figure 7-2 2D thermal variation on die 4 under different microarchitecture and floor-plan
configurations

Figure 7-3 illustrates the original thermal behavior and 2D wavelet transformed thermal

behavior.







HL, HHH
(a) Original thermal behavior (b) 2D wavelet transformed thermal behavior
Figure 7-3 Example of using 2D DWT to capture thermal spatial characteristics

As can be seen, the 2D thermal characteristics can be effectively captured using a small number

of wavelet coefficients (e.g. Average (LL=1) or Average (LL=2)). Since a small set of wavelet

coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we

use predictive models (i.e. neural networks) to relate them individually to various design

parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet

coefficients to synthesize 2D thermal spatial characteristics across the design space. Compared

with a simulation-based method, predicting a small set of wavelet coefficients using analytical









models is computationally efficient and is scalable to explore the large thermal design space of

3D multi-core architecture.

Prior work has proposed various predictive models [20-25, 50] to cost-effectively reason

processor performance and power characteristics at the design exploration stage. A common

weakness of existing analytical models is that they assume centralized and monolithic hardware

structures and therefore lack the ability to forecast the complex and heterogeneous thermal

behavior across large and distributed 3D multi-core architecture substrates. In this paper, we

addresses this important and urgent research task by developing novel, 2D multi-scale predictive

models, which can efficiently reason the geo-spatial thermal characteristics within die and across

different dies during the design space exploration stage without using detailed cycle-level

simulations. Instead of quantifying the complex geo-spatial thermal characteristics using a single

number or a simple statistical distribution, our proposed techniques employ 2D wavelet

multiresolution analysis and neural network non-linear regression modeling. With our schemes,

the thermal spatial characteristics are decomposed into a series of wavelet coefficients. In the

transform domain, each individual wavelet coefficient is modeled by a separate neural network.

By predicting only a small set of wavelet coefficients, our models can accurately reconstruct 2D

spatial thermal behavior across the design space.

Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction

We view the 2D spatial thermal characteristics yielded in 3D integrated multi-core chips as

a nonlinear function of architecture design parameters. Instead of inferring the spatial thermal

behavior via exhaustively obtaining temperature on each individual location, we employ wavelet

analysis to approximate it and then use a neural network to forecast the approximated thermal

behavior across a large architectural design space. Previous work [21, 23, 25, 50] shows that

neural networks can accurately predict the aggregated workload behavior across varied









architecture configurations. Nevertheless, monolithic global neural network models lack the

ability to reveal complex thermal behavior on a large scale. To overcome this disadvantage, we

propose combining 2D wavelet transforms and neural networks that incorporate multiresolution

analysis into a set of neural networks for spatial thermal characteristics prediction of 3D die

stacked multi-core design.




9IUMfid PndloId
rha rma b.t r T r rl o










Figure 7-4 Hybrid neuro-wavelet thermal prediction framework

The 2D wavelet transform is a very powerful tool for characterizing spatial behavior since

it captures both global trend and local variation of large data sets using a small set of wavelet

coefficients. The local characteristics are decomposed into lower scales of wavelet coefficients

(high frequencies) which are utilized for detailed analysis and prediction of individual or subsets

of components, while the global trend is decomposed into higher scales of wavelet coefficients

(low frequencies) that are used for the analysis and prediction of slow trends across each die.

Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and

details of complex thermal behavior at a large scale. Our wavelet neural networks use a separate

RBF neural network to predict individual wavelet coefficients. The separate predictions of

wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separate

neural network simplifies the training task (which can be performed concurrently) of each sub-









network. The prediction results for the wavelet coefficients can be combined directly by the

inverse wavelet transforms to synthesize the 2D spatial thermal patterns across each die. Figure

7-4 shows our hybrid neuro-wavelet scheme for 2D spatial thermal characteristics prediction.

Given the observed spatial thermal behavior on training data, our aim is to predict the 2D

thermal behavior of each die in 3D die stacked multi-core processors under different design

configurations. The hybrid scheme involves three stages. In the first stage, the observed spatial

thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second

stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the

approximated 2D thermal characteristics are recovered from the predicted wavelet coefficients.

Each RBF neural network receives the entire architecture design space vector and predicts a

wavelet coefficient. The training of an RBF network involves determining the center point and a

radius for each RBF, and the weights of each RBF, which determine the wavelet coefficients.

Experimental Methodology

Floorplanning and Hotspot Thermal Model

In this study, we model four floor-plans that involve processor core and cache structures as

illustrated in Figure 7-5.







Figure 7-5 Selected floor-plans

As can be seen, the processor core is placed at different locations across the different floor-

plans. Each floor-plan can be chosen by a layer in the studied 3D die stacking quad-core

processors. The size and adjacency of blocks are critical parameters for deriving the thermal









model. The baseline core architecture and floorplan we modeled is an Alpha processor, closely

resembling the Alpha 21264. Figure 7-6 shows the baseline core floorplan.












Figure 7-6 Processor core floor-plan

We assume a 65 nm processing technique and the floor-plan is scaled accordingly. The

entire die size is 21 x 21mm and the core size is 5.8 x 5.8mm. We consider three core

configurations: 2-issue (5.8x 5.8 mm), 4-issue (8.14x 8.14 mm) and 8-issue (11.5x 11.5 mm).

Since the total die area is fixed, the more aggressive core configurations lead to smaller L2

caches. For all three types of core configurations, we calculate the size of the L2 caches based on

the remaining die area available. Table 7-1 lists the detailed processor core and cache

configurations.

We use Hotspot-4.0 [54] to simulate thermal behavior of a 3D quad-core chip shown as

Figure 7-7. The Hotspot tool can specify the multiple layers of silicon and metal required to

model a three dimensional IC. We choose grid-like thermal modeling mode by specifying a set

of 64 x 64 thermal grid cells per die and the average temperature of each cell (32um x 32um) is

represented by a value. Hotspot takes power consumption data for each component block, the

layer parameters and the floor-plans as inputs and generates the steady-state temperature for each

active layer. To build a 3D multi-core processor simulator, we heavily modified and extended the

M-Sim simulator [63] and incorporated the Wattch power model [36]. The power trace is











generated from the developed framework with an interval size of 500K cycles. We simulate a


3D-stacked quad-core processor with one core assigned to each layer.

Table 7-1 Architecture configuration for different issue width


2 issue


Processor
Width
Issue Queue
ITLB
Branch
Predictor
BTB
Return
Address


2-wide fetch/issue/commit
32
32 entries, 4-way, 200 cycle
miss
512 entries Gshare, 10-bit
global history
512K entries, 4-way
8 entries RAS


L1 Inst. 32K, 2-way, 32 Byte/line, 2
Cache ports, 1 cycle access
ROB Size 32 entries
Load/ Store 24 entries
2 I-ALU, 1 I-MUL/DIV, 2
Integer ALU Load/Store
Load/Store
1 FP-ALU, 1FP-
FP ALU
MUL/DIV/SQRT
DTLB 64 entries, 4-way, 200 cycle
miss
L1 Data 32K, 2-way, 32 Byte/line, 2
Cache ports,
1 cycle access
unified 4MB, 4-way, 128
L2 Cache Byte/line,
12 cycle access
Memory 32 bit wide, 200 cycles access
Access latency


4 issue

4-wide fetch/issue/commit
64
64 entries, 4-way, 200 cycle
miss
1K entries Gshare, 10-bit global
history
1K entries, 4-way
16 entries RAS
64K, 2-way, 32 Byte/line, 2
ports, 1 cycle access
64 entries
48 entries
4 I-ALU, 2 I-MUL/DIV, 2
Load/Store
2 FP-ALU, 2FP-MUL/
DIV/SQRT
128 entries, 4-way, 200 cycle
miss
64KB, 4-way, 64 Byte/line, 2
ports, 1 cycle

unified 3.7MB, 4-way, 128
Byte/line, 12 cycle access
64 bit wide, 200 cycles access
latency


8 issue

8-wide fetch/issue/commit
128
128 entries, 4-way, 200 cycle
miss
2K entries Gshare, 10-bit
global history
2K entries, 4-way
32 entries RAS
128K, 2-way, 32 Byte/line, 2
ports, 1 cycle access
96 entries
72 entries
8 I-ALU, 4 I-MUL/DIV, 4
Load/Store
4 FP-ALU, 4FP-
MUL/DIV/SQRT
256 entries, 4-way, 200 cycle
miss
128K, 2-way, 32 Byte/line, 2
ports, 1 cycle access

unified 3.2MB, 4-way, 128
Byte/line, 12 cycle access
64 bit wide, 200 cycles access
latency


Tha~m O'llea' 14.. Stnk
M..... TIMF 3"rk


FV_1Vrs


Bounding Thinned Yhrough Act
Intmrfa~e Subltratirslon Vla Layr Lyr
2ue 7 l mu u

Figure 7-7 Cross section view of the simulated 3D quad-core chip









Workloads and System Configurations

We use both integer and floating-point benchmarks from the SPEC CPU 2000 suite (e.g.

bzip2, crafty, eon, facerec, galgel, gap, gcc, lucas, mcf, parser, perlbmk, twolf swim, vortex and

vpr) to compose our experimental multiprogrammed workloads (see Table 7-2). We categorize

all benchmarks into two classes: CPU-bound and MEM bound applications. We design three

types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist

of programs from only the CPU intensive and memory intensive categories respectively. MIX

workloads are the combination of two benchmarks from the CPU intensive group and two from

the memory intensive group.

Table 7-2 Simulation configurations
Chip Frequency 3G
Voltage 1.2 V
Proc. Technology 65 nm
Die Size 21 mm x 21 mm
CPU1 bzip2, eon, gcc, perlbmk
CPU2 perlbmk, mesa, facerec, lucas
CPU3 gap, parser, eon, mesa
MIX1 gcc, mcf, vpr, perlbmk
Workloads MIX2 perlbmk, mesa, twolf, applu
MIX3 eon, gap, mcf, vpr
MEM1 mcf, equake, vpr, swim
MEM2 twolf, galgel, applu, lucas
MEM3 mcf, twolf, swim, vpr


These multi-programmed workloads were simulated on our multi-core simulator

configured as 3D quad-core processors. We use the Simpoint tool [1] to obtain a representative

slice for each benchmark (with full reference input set) and each benchmark is fast-forwarded to

its representative point before detailed simulation takes place. The simulations continue until one

benchmark within a workload finishes the execution of the representative interval of 250M

instructions.










Design Parameters

In this study, we consider a design space that consists of 23 parameters (see Table 7-3)

spanning from floor-planning to packaging technologies.

Table 7-3 Design space parameters


3D
Configurations


TIM (Thermal Interfa


General
Configurations



Archi.


Thickness (m)
LayerO Floorplan
Bench
Thickness (m)
Layerl Floorplan
Bench
Thickness (m)
Layer2 Floorplan
Bench
Thickness (m)
Layer3 Floorplan
Bench
Heat Capacity (J/m^3K)
ice Material) Resistivity (m K/W)
Thickness (m)
Convection capacity (J/k)
Convection resistance (K/w)
Heat sink
Side (m)
Thickness (m)
Side(m)
[eatSpreader Side(m)
Thickness(m)
Others Ambient temperature (K)
Issue width


Keys
lyO th
ly0_fl
ly0_bench
lylth
lylfl
lyl_bench
ly2_th
ly2_fl
ly2_bench
ly3_th
ly3fl
ly3_bench
TIMcap
TIM res
TIM th
HS_cap
HS res
HS side
HS th
HP side
HP th
Am temp
Issue width


Low High
5e-5 3e-4
Flp 1/2/3/4
CPU/MEM/MIX
5e-5 3e-4
Flp 1/2/3/4
CPU/MEM/MIX
5e-5 3e-4
Flp 1/2/3/4
CPU/MEM/MIX
5e-5 3e-4
Flp 1/2/3/4
CPU/MEM/MIX
2e6 4e6
2e-3 5e-2
2e-5 75e-6
140.4 1698
0.1 0.5
0.045 0.08
0.02 0.08
0.025 0.045
5e-4 5e-3
293.15 323.15
2 or 4 or 8


These design parameters have been shown to have a large impact on processor thermal

behavior. The ranges for these parameters were set to include both typical and feasible design

points within the explored design space. Using detailed cycle-accurate simulations, we measure

processor power and thermal characteristics on all design points within both training and testing

data sets. We build a separate model for each benchmark domain and use the model to predict

thermal behavior at unexplored points in the design space. The training data set is used to build

the wavelet-based neural network models. An estimate of the model's accuracy is obtained by

using the design points in the testing data set. To train an accurate and prompt neural network

prediction model, one needs to ensure that the sample data sets disperse points throughout the


H









design space but keeps the space small enough to maintain the low model building cost. To

achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling

strategy since it provides better coverage compared to a naive random sampling scheme. We

generate multiple LHS matrices and use a space filing metric called L2-star discrepancy [40].

The L2-star discrepancy is applied to each LHS matrix to find the representative design space

that has the lowest value of L2-star discrepancy. We use a randomly and independently

generated set of test data points to empirically estimate the predictive accuracy of the resulting

models. In this work, we used 200 train and 50 test data to reach a high accuracy for thermal

behavior prediction since our study shows that it offers a good tradeoff between simulation time

and prediction accuracy for the design space we considered. In our study, the thermal

characteristics across each die is represented by 64 x 64 samples.

Experimental Results

In this section, we present detailed experimental results using 2D wavelet neural networks

to forecast thermal behaviors of large scale 3D multi-core structures running various

CPU/MIX/MEM workloads without using detailed simulation.

Simulation Time vs. Prediction Time

To evaluate the effectiveness of our thermal prediction models, we compute the speedup

metric (defined as simulation time vs. prediction time) across all experimented workloads

(shown as Table 7-4). To calculate simulation time, we measured the time that the Hotspot

simulator takes to obtain steady thermal characteristics on a given design configuration. As can

be seen, the Hotspot tool simulation time varies with design configurations. We report both

shortest (best) and longest (worst) simulation time in Table 7-4.

The prediction time, which includes the time for the neural networks to predict the targeted

thermal behavior, remains constant for all studied cases. In our experiment, a total number of 16









neural networks were used to predict 16 2D wavelet coefficients which efficiently capture

workload thermal spatial characteristics. As can be seen, our predictive models achieve a

speedup ranging from 285 (MEM1) to 5339 (CPU2), making them suitable for rapidly exploring

large thermal design space.

Table 7-4 Simulation time vs. prediction time
Workload Simulation (sec) n (e Speedup
Prediction (sec) .
s [best:worst] (Sim./Pred.)
CPU1 362: 6,091 294 : 4,952
CPU2 366: 6,567 298 : 5,339
CPU3 365 : 6,218 297: 5,055
MEM1 351 : 5,890 285 : 4,789
MEM2 355 : 6,343 1.23 289 : 5,157
MEM3 367 : 5,997 298 : 4,876
MIX1 352 : 5,944 286 : 4,833
MIX2 365 : 6,091 297 : 4,952
MIX3 360: 6,024 293 : 4,898


Prediction Accuracy

The prediction accuracy measure is the mean error defined as follows:

ME x (k) x(k)
ME = (7-1)
N k= x(k)

where: x(k) is the actual value generated by the Hotspot thermal model, x(k) is the predicted

value and Nis the total number of samples (a set of 64 x 64 temperature samples per layer). As

prediction accuracy increases, the ME becomes smaller.

We present boxplots to observe the average prediction errors and their deviations for the 50

test configurations against Hotspot simulation results. Boxplots are graphical displays that

measure location (median) and dispersion (interquartile range), identify possible outliers, and

indicate the symmetry or skewness of the distribution. The central box shows the data between

"hinges" which are approximately the first and third quartiles of the ME values. Thus, about 50%

of the data are located within the box and its height is equal to the interquartile range. The










horizontal line in the interior of the box is located at the median of the data, it shows the center

of the distribution for the ME values. The whiskers (the dotted lines extending from the top and

bottom of the box) extend to the extreme values of the data or a distance 1.5 times the

interquartile range from the median, whichever is less. The outliers are marked as circles. In

Figure 7-8, the blue line with diamond shape markers indicates the statistics average of ME

across all benchmarks.

20

12
L 4

CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3

Figure 7-8 ME boxplots of prediction accuracies (number of wavelet coefficients = 16)

Figure 7-8 shows that using 16 wavelet coefficients, the predictive models achieve median

errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across

all experimented workloads. As can be seen, the maximum error at any design point for any

benchmark is 17.5% (MEM1), and most benchmarks show an error less than 9%. This indicates

that our hybrid neuro-wavelet framework can predict 2D spatial thermal behavior across large

and sophisticated 3D multi-core architecture with high accuracy. Figure 7-8 also indicates that

CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX

(average 6.7%) workloads. This is because the CPU workloads usually have higher temperature

on the small core area than the large L2 cache area. These small and sharp hotspots can be easily

captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal

pattern can spread the entire die area, resulting in higher prediction error. Figure 7-9 illustrates

the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on

CPU1, MEM1 and MIX1 workloads.











CPU1 MEM1 MIX1


Prediction F 0




Simulation



Figure 7-9 Simulated and predicted thermal behavior

The results show that our predictive models can tack both size and location of thermal


hotspots. We further examine the accuracy of predicting locations and area of the hottest spots


and the results are similar to those presented in Figure 7-8.

CPU1
8

04

0
16wc 32wc 64wc 96wc 128wc 256wc
MEM1

15

10
W 5

16wc 32wc 64wc 96wc 128wc 256wc
MIX1
20


10


0 -1
lo -------- ---^

16wc 32wc 64wc 96wc 128wc 256wc

Figure 7-10 ME boxplots of prediction accuracies with different number of wavelet coefficients

Figure 7-10 shows the prediction accuracies with different number of wavelet coefficients


on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D thermal spatial


pattern prediction accuracy is increased when more wavelet coefficients are involved. However,


the complexity of the predictive models is proportional to the number of wavelet coefficients.


The cost-effective models should provide high prediction accuracy while maintaining low










complexity. The trend of prediction accuracy(Figure 7-10) suggests that for the programs we

studied, a set of wavelet coefficients with a size of 16 combine good accuracy with low model

complexity; increasing the number of wavelet coefficients beyond this point improves error at a

lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to

minimize the complexity of prediction models while achieving good accuracy.

We further compare the accuracy of our proposed scheme with that of approximating 3D

stacked die spatial thermal patterns via predicting the temperature of 16 evenly distributed

locations across 2D plane. The results(Figure 7-11) indicate that using the same number of

neural networks, our scheme yields significant higher accuracy than conventional predictive

models. This is because wavelets provide a good time and locality characterization capability and

most of the energy is captured by a limited set of important wavelet coefficients. The coordinated

wavelet coefficients provide superior interpretation of the spatial patterns across scales of time

and frequency domains.

100
10 Predicting the wavelet coefficients
80 :, i 1
60
0
W 40
20
0
CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3


Figure 7-11 Benefit of predicting wavelet coefficients

Our RBF neural networks were built using a regression tree based method. In the

regression tree algorithm, all input parameters (refer to Table 7-3 ) were ranked based on split

frequency. The input parameters which cause the most output variation tend to be split frequently

in the constructed regression tree. Therefore, the input parameters that largely determine the

values of a wavelet coefficient have a larger number of splits.











Design Parameters byRegression Tree


ly0_th ly0_fl ly0_bench lyl_th lyl_fl lyl_bench



ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench



TIM_cap TIM_res TIM_th HS_cap HS_res HS_side

Clockwise:
CPU1
MEM1
HS_th HP_side HP_th am_temp Iss_size MIX1


Figure 7-12 Roles of input parameters

We present in Figure 7-12 shows the most frequent splits within the regression tree that

models the most significant wavelet coefficient. A star plot [41] is a graphical data analysis

method for representing the relative behavior of all variables in a multivariate data set. Each

volume size of parameter is proportional to the magnitude of the variable for the data point

relative to the maximum magnitude of the variable across all data points. From the star plot, we

can obtain information such as: What variables are dominant for a given datasets? Which

observations show similar behavior? As can be seen, floor-planning of each layer and core

configuration largely affect thermal spatial behavior of the studied workloads.









CHAPTER 8
CONCLUSIONS

Studying program workload behavior is of growing interest in computer architecture

research. The performance, power and reliability optimizations of future computer workloads

and systems could involve analyzing program dynamics across many time scales. Modeling and

predicting program behavior at single scale can yield many limitations. For example, samples

taken from a single, fine-grained interval may not be useful in forecasting how a program

behaves at a medium or large time scales. In contrast, observing program behavior using a

coarse-grained time scale may lose opportunities that can be exploited by hardware and software

in tuning resources to optimize workload execution at a fine-grained level.

In chapter 3, we proposed new methods, metrics and framework that can help researchers

and designers to better understand phase complexity and the changing of program dynamics

across multiple time scales. We proposed using wavelet transformations of code execution and

runtime characteristics to produce a concise yet informative view of program dynamic

complexity. We demonstrated the use of this information in phase classification which aims to

produce phases that exhibit similar degree of complexity. Characterizing phase dynamics across

different scales provides insightful knowledge and abundant features that can be exploited by

hardware and software in tuning resources to meet the requirement of workload execution at

different granularities.

In chapter 4, we extends the scope of chapter 3 by (1) exploring and contrasting the

effectiveness of using wavelets on a wide range of program execution statistics for phase

analysis; and (2) investigating techniques that can further optimize the accuracy of wavelet-based

phase classification. More importantly, we identify additional benefits that wavelets can offer in

the context of phase analysis. For example, wavelet transforms can provide efficient









dimensionality reduction of large volume, high dimension raw program execution statistics from

the time domain and hence can be integrated with a sampling mechanism to efficiently increase

the scalability of phase analysis of large scale phase behavior on long-running workloads. To

address workload variability issues in phase classification, wavelet-based denoising can be used

to extract the essential features of workload behavior from their run-time non-deterministic (i.e.,

noisy) statistics.

At the workloads prediction part, chapter 5, we propose to the use of wavelet neural

network to build accurate predictive models for workload dynamic driven microarchitecture

design space exploration to overcome the problems of monolithic, global predictive models. We

show that wavelet neural networks can be used to accurately and cost-effectively capture

complex workload dynamics across different microarchitecture configurations. We evaluate the

efficiency of using the proposed techniques to predict workload dynamic behavior in

performance, power, and reliability domains. And also we perform extensive simulations to

analyze the impact of wavelet coefficient selection and sampling rate on prediction accuracy and

identify microarchitecture parameters that significantly affect workload dynamic behavior. To

evaluate the efficiency of scenario-driven architecture optimizations across different domains,

we also present a case study of using workload dynamic aware predictive model. Experimental

results show that the predictive models are highly efficient in rendering workload execution

scenarios. To our knowledge, the model we proposed is the first one that can track complex

program dynamic behavior across different microarchitecture configurations. We believe our

workload dynamics forecasting techniques will allow architects to quickly evaluate a rich set of

architecture optimizations that target workload dynamics at early microarchitecture design stage.









In Chapter 6, we explore novel predictive techniques that can quickly, accurately and

informatively analyze the design trade-offs of future large-scale multi-/many- core architectures

in a scalable fashion. The characteristics that workloads exhibited on these architectures are

complex phenomena since they typically contain a mixture of behavior localized at different

scales. Applying wavelet analysis, our method can capture the heterogeneous behavior across a

wide range of spatial scales using a limited set of parameters. We show that these parameters can

be cost-effectively predicted using non-linear modeling techniques such as neural networks with

low computational overhead. Experimental results show that our scheme can accurately predict

the heterogeneous behavior of large-scale multi-core oriented architecture substrates. To our

knowledge, the model we proposed is the first that can track complex 2D workload/architecture

interaction across design alternatives, we further examined using the proposed models to

effectively explore multi-core aware resource allocations and design evaluations. For example,

we build analytical models that can quickly forecast workloads' 2D working sets across different

NUCA configurations. Combined with interference estimation, our models can determine the

geometric-aware workload/core mappings that lead to minimal interference. We also show that

our models can be used to predict the location and the area of thermal hotspots during thermal-

aware design exploration. In the light of the emerging multi-/ many- core design era, we believe

that the proposed 2D predictive model will allow architects to quickly yet informatively examine

a rich set of design alternatives and optimizations for large and sophisticated architecture

substrates at an early design stage.

Leveraging 3D die stacking technologies in multi-core processor design has received

increased momentum in both the chip design industry and research community. One of the major

road blocks to realizing 3D multi-core design is its inefficient heat dissipation. To ensure thermal









efficiency, processor architects and chip designers rely on detailed yet slow simulations to model

thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of

the design space, such techniques are very expensive in terms of time and cost.

In chapter 7, we aim to develop computationally efficient methods and models which

allow architects and designers to rapidly yet informatively explore the large thermal design space

of 3D multi-core architecture. Our models achieve several orders of magnitude speedup

compared to simulation based methods. Meanwhile, our model significantly improves prediction

accuracy compared to conventional predictive models of the same complexity. More attractively,

our models have the capability of capturing complex 2D thermal spatial patterns and can be used

to forecast both the location and the area of thermal hotspots during thermal-aware design

exploration. In light of the emerging 3D multi-core design era, we believe that the proposed

thermal predictive models will be valuable for architects to quickly and informatively examine a

rich set of thermal-aware design alternatives and thermal-oriented optimizations for large and

sophisticated architecture substrates at an early design stage.









LIST OF REFERENCES


[1] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, "Automatically Characterizing Large
Scale Program Behavior," in Proc. the International Conference on Architectural Supportfor
Programming Languages and Operating Systems, 2002

[2] E. Duesterwald, C. Cascaval and S. Dwarkadas, "Characterizing and Predicting Program
Behavior and Its Variability," in Proc. of the International Conference on Parallel
Architectures and Compilation Techniques, 2003.

[3] J. Cook, R. L. Oliver, and E. E. Johnson, "Examining Performance Differences in Workload
Execution Phases," in Proc. of the IEEE International Workshop on Workload
Characterization, 2001.

[4] X. Shen, Y. Zhong and C. Ding, "Locality Phase Prediction," in Proc. of the International
Conference on Architectural Support for Programming Languages and Operating Systems,
2004.

[5] C. Isci and M. Martonosi, "Runtime Power Monitoring in High-End Processors:
Methodology and Empirical Data," in Proc. of the International Symposium on
Microarchitecture, 2003.

[6] T. Sherwood, S. Sair and B. Calder, "Phase Tracking and Prediction," in Proc. of the
International Symposium on Computer Architecture, 2003.

[7] A. Dhodapkar and J. Smith, "Managing Multi-Configurable Hardware via Dynamic Working
Set Analysis," in Proc. of the International Symposium on Computer Architecture, 2002.

[8] M. Huang, J. Renau and J. Torrellas, "Positional Adaptation of Processors: Application to
Energy Reduction," in Proc. of the International Symposium on Computer Architecture,
2003.

[9] W. Liu and M. Huang, "EXPERT: Expedited Simulation Exploiting Program Behavior
Repetition," in Proc. ofInternational Conference on Supercomputing, 2004.

[10] T. Sherwood, E. Perelman and B. Calder, "Basic Block Distribution Analysis to Find
Periodic Behavior and Simulation Points in Applications," in Proc. of the International
Conference on Parallel Architectures and Compilation Techniques, 2001.

[11] A. Dhodapkar and J. Smith, "Comparing Program Phase Detection Techniques," in Proc.
of the International Symposium on Microarchitecture, 2003.

[12] C. Isci and M. Martonosi, "Identifying Program Power Phase Behavior using Power
Vectors," in Proc. of the International Workshop on Workload Characterization, 2003.









[13] C. Isci and M. Martonosi, "Phase Characterization for Power: Evaluating Control-Flow-
Based Event-Counter-Based Techniques," in Proc. of the International Symposium on High-
Performance Computer Architecture, 2006.

[14] M. Annavaram, R. Rakvic, M. Polito, J.-Y. Bouguet, R. Hankins and B. Davies, "The
Fuzzy Correlation between Code and Performance Predictability," in Proc. of the
International Symposium on Microarchitecture, 2004.

[15] J. Lau, S. Schoenmackers and B. Calder, "Structures for Phase Classification," in Proc.
of International Symposium on Performance Analysis of Systems and Software, 2004.

[16] J. Lau, J. Sampson, E. Perelman, G. Hamerly and B. Calder, "The Strong Correlation
between Code Signatures and Performance," in Proc. of the International Symposium on
Performance Analysis of Systems and Software, 2005.

[17] J. Lau, S. Schoenmackers and B. Calder, "Transition Phase Classification and
Prediction," in Proc. of the International Symposium on High Performance Computer
Architecture, 2005.

[18] Canturk Isci and Margaret Martonosi, "Detecting Recurrent Phase Behavior under Real-
System Variability," in Proc. of the IEEE International Symposium on Workload
Characterization, 2005.

[19] E. Perelman, M. Polito, J. Y. Bouguet, J. Sampson, B. Calder, C. Dulong, "Detecting
Phases in Parallel Applications on Shared Memory Architectures," in Proc. of the
International Parallel and Distributed Processing Symposium, April 2006

[20] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, "Construction and Use of Linear
Regression Models for Processor Performance Analysis," in Proc. of the International
Symposium on High-Performance Computer Architecture, 2006

[21] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, "A Predictive Performance Model
for Superscalar Processors," in Proc. of the International Symposium on Microarchitecture,
2006

[22] B. Lee and D. Brooks, "Accurate and Efficient Regression Modeling for
Microarchitectural Performance and Power Prediction," in Proc. of the International
Symposium on Architectural Support for Programming Languages and Operating Systems,
2006

[23] E. Ipek, S. A. McKee, B. R. Supinski, M. Schulz and R. Caruana, "Efficiently Exploring
Architectural Design Spaces via Predictive Modeling," in Proc. of the International
Conference on Architectural Support for Programming Languages and Operating Systems,
2006









[24] B. Lee and D. Brooks, "Illustrative Design Space Studies with Microarchitectural
Regression Models," in Proc. of the International Symposium on High-Performance
Computer Architecture, 2007.

[25] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, "Constructing a Non-Linear Model with
Neural Networks For Workload Characterization," in Proc. of the International Symposium
on Workload Characterization, 2006.

[26] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier, Vermont, 1992

[27] I. Daubechies, "Orthonomal bases of Compactly Supported Wavelets," Communications
on Pure andAppliedi MAtuli/htiiL vol. 41, pages 906-966, 1988.

[28] T. Austin, "Tutorial of Simplescalar V4.0," in Conj. With the International Symposium
on Microarchitecture, 2001

[29] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate
Observations," in Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, 1967.

[30] T. Huffmire and T. Sherwood, "Wavelet-Based Phase Classification," in Proc. of the
International Conference on Parallel Architecture and Compilation Technique, 2006

[31] D. Brooks and M. Martonosi, "Dynamic Thermal Management for High-Performance
Microprocessors," in Proc. of the International Symposium on High-Performance Computer
Architecture, 2001.

[32] A. Alameldeen and D. Wood, "Variability in Architectural Simulations of Multi-threaded
Workloads," in Proc. ofInternational Symposium on High Performance Computer
Architecture, 2003.

[33] D. L. Donoho, "De-noising by Soft-thresholding," IEEE Transactions on Information
Theory, Vol. 41, No. 3, pp. 613-627, 1995.

[34] MATLAB User Manual, MathWorks, MA, USA.

[35] M. Orr, K. Takezawa, A. Murray, S. Ninomiya and T. Leonard, "Combining Regression
Tree and Radial Based Function Networks," International Journal ofNeural Systems, 2000.

[36] David Brooks, Vivek Tiwari, and Margaret Martonosi, "Wattch: A Framework for
Architectural-Level Power Analysis and Optimizations," 27th International Symposium on
Computer Architecture, 2000.

[37] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A Systematic
Methodology to Compute the Architectural Vulnerability Factors for a High-Performance
Microprocessor," in Proc. of the International Symposium on Microarchitecture, 2003.










[38] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan,
"Computing Architectural Vulnerability Factors for Address-Based Structures," in Proc. of
the International Symposium on Computer Architecture, 2005.

[39] J.Cheng, M.J.Druzdzel, "Latin Hypercube Sampling in Bayesian Networks," in Proc. of
the 13th Florida Artificial Intelligence Research Society Conference, 2000.

[40] B.Vandewoestyne, R.Cools, "Good Permutations for Deterministic Scrambled Halton
Sequences in terms of L2-discrepancy," Journal of Computational andAppliedib I/,1h,, tii \
Vol 189, Issues 1-2, 2006.

[41] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, Graphical Methodsfor Data
Analysis, Wadsworth, 1983

[42] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13-
273350-1, 1999.

[43] C. Kim, D. Burger, and S. Keckler. "An Adaptive, Non- Uniform Cache Structure for
Wire-Delay Dominated On- Chip Caches," in Proc. the International Conference on
Architectural Support for Programming Languages and Operating Systems, 2002.

[44] L. Benini, L.; G. Micheli, "Networks On Chips: A New SoC Paradigm," Computer, Vol.
35, Issue. 1, January 2002, pp. 70 -78.

[45] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, "A NUCA Substrate for
Flexible CMP Cache Sharing," in Proc. International Conference on Supercomputing, 2005.

[46] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Distance Associativity for High-
Performance Energy-Efficient Non-Uniform Cache Architectures," in Proc. of the
International Symposium on Microarchitecture, 2003.

[47] B. M. Beckmann and D. A. Wood, "Managing Wire Delay in Large Chip-Multiprocessor
Caches," in Proc. of the International Symposium on Microarchitecture, 2004.

[48] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimization Replication,
Communication, and Capacity Allocation in CMPs," in Proc. of the International Symposium
on Computer Architecture, 2005.

[49] A. Zhang and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding
Wire Delay in Tiled Chip Multiprocessors," in Proc. of the International Symposium on
Computer Architecture, 2005.

[50] B. Lee, D. Brooks, B. Supinski, M. Schulz, K. Singh, S. McKee, "Methods of Inference
and Learning for Performance Modeling of Parallel Applications," PPoPP, 2007.









[51] K. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill,
D. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator(GEMS)
Toolset," Computer Architecture News(CAN), 2005.

[52] Virtutech Simics, http://www.virtutech.com/products/

[53] S. Woo, M. Ohara, E. Torrie, J. Singh, A. Gupta, "The SPLASH-2 Programs:
Characterization and Methodological Considerations," in Proc. of the International
Symposium on Computer Architecture, 1995.

[54] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan,
"Temperature-Aware Microarchitecture," in Proc. of the International Symposium on
Computer Architecture, 2003.

[55] K. Banerjee, S. Souri, P. Kapur, and K. Saraswat, "3-D ICs: A Novel Chip Design for
Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip
Integration", Proceedings of the IEEE, vol. 89, pp. 602--633, May 2001.

[56] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykrishnan, M. J. Irwin, "Design Space Exploration
for 3-D Cache", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.
16, No. 4, April 2008.

[57] B. Black, D. Nelson, C. Webb, and N. Samra, "3D Processing Technology and its Impact
on IA32 Microprocessors," in Proc. of the 22nd International Conference on Computer
Design, pp. 316-318, 2004.

[58] P. Reed, G. Yeung, and B. Black, "Design Aspects of a Microprocessor Data Cache using
3D Die Interconnect Technology," in Proc. of the International Conference on Integrated
Circuit Design and Technology, pp. 15-18, 2005

[59] M. Healy, M. Vittes, M. Ekpanyapong, C.S. Ballapuram, S.K. Lim, H.S. Lee, G.H. Loh,
"Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs," IEEE Trans. on
Computer Aided Design oflC and Systems, vol. 26, no. 1, pp. 38-52, 2007.

[60] S. K. Lim, "Physical design for 3D system on package," IEEE Design & Test of
Computers, vol. 22, no. 6, pp. 532-539, 2005.

[61] K. Puttaswamy, G. H. Loh, "Thermal Herding: Microarchitecture Techniques for
Controlling Hotspots in High-Performance 3D-Integrated Processors," in Proc. of the
International Symposium on High-Performance Computer Architecture, 2007.

[62] Y. Wu, Y. Chang, "Joint Exploration of Architectural and Physical Design Spaces with
Thermal Consideration," in Proc. ofInternational Symposium on Low Power Electronics and
Design, 2005.









[63] J. Sharkey, D. Ponomarev, K. Ghose, "M-Sim : A Flexible, Multithreaded Architectural
Simulation Environment," Technical Report CS-TR-05-DPO1, Department of Computer
Science, State University of New York at Binghamton, 2005.









BIOGRAPHICAL SKETCH

Chang Burm Cho earned B.E and M.A in electrical engineering at Dan-kook University,

Seoul, Korea in 1993 and 1995, respectively. Over the next 9 years, he worked as a senior

researcher at Korea Aerospace Research Institute(KARI) to develop the On-Board

Computer(OBC) for two satellites, KOMPSAT-1 and KOMPSAT-2. His research interest is

computer architecture and workload characterization and prediction in large micro architectural

design spaces.





PAGE 1

1 ACCURATE, SCALABLE, AND INFORM ATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGE-SCA LE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008

PAGE 2

2 2008 Chang Burm Cho

PAGE 3

3 ACKNOWLEDGMENTS There are many people who are responsible for my Ph.D research. Most of all I would like to express my gratitude to m y supervisor, Dr. Tao Li, for his patient guidance and invaluable advice, for numerous discussions and encouragement throughout the course of the research. I would also like to thank all the member s of my advisory committee, Dr. Renato Figueiredo, Dr. Rizwan Bashirullah, and Dr. Prabhat Mishra, for thei r valuable time and interest in serving on my supervisory committee. And I am indebted to all the members of IDEAL(Intelligent Design of Efficient Architectures Laboratory), Clay Hughes, Jame s Michael Poe II, Xin Fu and Wangyuan Zhang, for their companionship and support throughout the time spent working on my research. Finally, I would also like to expr ess my greatest gratitude to my family especially my wife, Eun-Hee Choi, for her rele ntless support and love.

PAGE 4

4 TABLE OF CONTENTS pageiscrete W avelet Transform(DWT) ..................................................................................... 16 Apply DW T to Capture Workload Execution Behavior ....................................................... 18 2D W avelet Transform ......................................................................................................... 22 3 COMPLEXITY-BASED PROGRAM PHASE ANALYSIS AND CLA SSIFICATION .... 25 Characterizing and classifying the program dynamic behavior ............................................ 25 Profiling Program Dynamics and Complexity ...................................................................... 28 Classifying Program Phases based on their Dynamics Behavior ......................................... 31 Experim ental Results .......................................................................................................... .. 34 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRM PHASE ANALYSIS ...................................................................................................................... ..... 37 Workload-statics-based phase analysis ................................................................................. 38 Exploring Wavelet Dom ain Phase Analysis ......................................................................... 40 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION ................... 52 Neural Network ..................................................................................................................... 54 Com bing Wavelet and Neural Network for Workload Dynamics Prediction ...................... 56 Experim ental Methodology .................................................................................................. 58 Evaluation and Results .......................................................................................................... 62 Workload Dynam ics Driven Architecture Design Space Exploration ................................. 68 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTI-CORE ARCHITECTURES ..................................................................................... 74 Com bining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction .................................................................................................................... .......... 76 Experim ental Methodology .................................................................................................. 78

PAGE 5

5 Evaluation and Results .......................................................................................................... 82 Leveraging 2D Geom etric Characteristics to Explore Cooperative Multi-core Oriented Architecture Design and Optimization ................................................................................. 88 7 THERMAL DESIGN SPACE EXPLORATIO N OF 3D DIE STACKED MULTI-CORE PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS ....................... 94 Com bining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction ... 96 Experim ental Methodology .................................................................................................. 98 Experim ental Results .......................................................................................................... 103 8 CONCLUSI ONS ................................................................................................................. 109 LIST OF REFERENCES ............................................................................................................ 113 BIOGRAPHICAL SKETCH ...................................................................................................... 119

PAGE 6

6 LIST OF TABLES Table page 3-1 Baseline machine configuration ............................................................................................. 26 3-2 A classification of benchmar ks based on their com plexity .................................................... 30 4-1 Baseline machine configuration ............................................................................................. 39 4-2 Efficiency of different hybrid wavelet signatures in pha se classification .............................. 44 5-1 Simulated machine configuration ........................................................................................... 59 5-2 Microarchitectural parameter ranges used for generating train/test data ............................... 60 6-1 Simulated machine configuration (baseline) .......................................................................... 78 6-2 The considered architecture de sign param eters and their ranges ........................................... 79 6-3 Multi-programmed workloads ................................................................................................ 80 6-4 Error comparison of predicti ng raw vs. 2D DWT cache banks .............................................. 85 6-5 Design space evaluation speedup (sim ulation vs. prediction) ................................................ 86 7-1 Architecture configurati on for different issue width ............................................................ 100 7-2 Simulation configurations ..................................................................................................... 101 7-3 Design space parameters ...................................................................................................... 102 7-4 Simulation time vs. prediction time ...................................................................................... 104

PAGE 7

7 LIST OF FIGURES Figure page 2-1 Example of Haar wavelet transform. ...................................................................................... 18 2-2 Comparison execution characteris tics of tim e and wavelet domain ....................................... 19 2-3 Sampled time domain program behavior ................................................................................ 20 2-4 Reconstructing the work load dynam ic behaviors ................................................................... 20 2-5 Variation of wavelet coefficients ............................................................................................ 21 2-6 2D wavelet transforms on 4 data points ................................................................................. 22 2-7 2D wavelet transforms on 16 cores/hardware com ponents .................................................... 23 2-8 Example of applying 2D DW T on a non-unifor mly accessed cache ..................................... 24 3-1 XCOR vectors for each program execution interval .............................................................. 28 3-2 Dynamic complexity profile of benchmark gcc ..................................................................... 28 3-3 XCOR value distributions ...................................................................................................... 30 3-4 XCORs in the same phase by the Simpoint ............................................................................ 31 3-5 BBVs with different resolutions ........................................................................................... .. 32 3-6 Multiresolution analysis of the projected BBVs ..................................................................... 33 3-7 Weighted COV calculation ..................................................................................................... 34 3-8 Comparison of BBV and MRA-BBV in classifying phase dynam ics .................................... 35 3-9 Comparison of IPC and MRA-IPC in classifying phase dynam ics ........................................ 36 4-1 Phase analysis methods time domain vs. wavelet domain ..................................................... 41 4-2 Phase classification accuracy: tim e domain vs. wavelet dom ain ........................................... 42 4-3 Phase classification using hybrid wavelet coefficients ........................................................... 43 4-4 Phase classification accuracy of using 16 1 hybrid scheme ................................................. 45 4-5 Different methods to handle counter overflows ..................................................................... 46 4-6 Impact of counter overflows on phase analysis accuracy ....................................................... 47

PAGE 8

8 4-7 Method for modeling wo rkload variability ............................................................................ 50 4-8 Effect of using wavelet denoisi ng to handle w orkload variability ......................................... 50 4-9 Efficiency of differe nt denoising schem es ............................................................................. 51 5-1 Variation of workload performance, power and reliability dynam ics .................................. 52 5-2 Basic architecture of a neural network ................................................................................... 54 5-3 Using wavelet neural network for workload dynam ics prediction ......................................... 58 5-4 Magnitude-based ranking of 128 wavelet coefficients ........................................................... 61 5-5 MSE boxplots of workload dynamics prediction ................................................................... 62 5-6 MSE trends with increased number of wavelet coefficients .................................................. 64 5-7 MSE trends with increased sampling frequency .................................................................... 64 5-8 Roles of microarchitect ure design param eters ........................................................................ 65 5-9 Threshold-based worklo ad execution scenarios ..................................................................... 67 5-10 Threshold-based workload execution ................................................................................... 68 5-11 Threshold-based worklo ad scenario prediction .................................................................... 68 5-12 Dynamic Vulnerability Management ................................................................................... 69 5-13 IQ DVM Pseudo Code .......................................................................................................... 70 5-14 Workload dynamic prediction with scen ario-based architecture optim ization .................... 71 5-15 Heat plot that shows the MSE of IQ AVF and processor power .......................................... 72 5-16 IQ AVF dynamics prediction accuracy across different DVM thresholds ........................... 73 6-1 Variation of cache hits across a 256-ba nk no n-uniform access cache on 8-core ................... 74 6-2 Using wavelet neural networks for forecastin g architecture 2D characteristics .................... 77 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache ..................................................... 79 6-4 ME boxplots of prediction accuracies with different number of wavelet coefficien ts ........... 83 6-5 Predicted 2D NUCA behavi or using different num ber of wavelet coefficients ..................... 84 6-6 Roles of design parameters in predicting 2D NUCA ............................................................. 87

PAGE 9

9 6-7 2D NUCA footprint (geometric shape) of mesa ..................................................................... 88 6-8. 2D cache interference in NUCA ............................................................................................ 89 6-9 Pearson correlation coefficient (all 50 test cases are show n) ................................................. 90 6-10 2D NUCA thermal profile (simulation vs. prediction) ......................................................... 91 6-11 NUCA 2D thermal prediction error ...................................................................................... 92 6-12 Temperature profile before and after a DTM policy ............................................................ 93 7-1 2D within-die and cross-dies thermal varia tion in 3D die stacked m ulti-core processors ..... 94 7-2 2D thermal variation on die 4 under di fferent m icroarchitecture and floor-plan configurations ................................................................................................................ ... 95 7-3 Example of using 2D DWT to captu re therm al spatial characteristics ................................... 95 7-4 Hybrid neuro-wavelet th erm al prediction framework ............................................................ 97 7-5 Selected floor-plans ................................................................................................................ 98 7-6 Processor core floor-plan ........................................................................................................ 99 7-7 Cross section view of the sim ulated 3D quad-core chip ...................................................... 100 7-8 ME boxplots of prediction accuracies (num ber of wavelet coefficients = 16) .................... 105 7-9 Simulated and predicted thermal behavior ........................................................................... 106 7-10 ME boxplots of prediction accuracies with d ifferent number of wavelet coefficients ....... 106 7-11 Benefit of predicting wavelet coefficients .......................................................................... 107 7-12 Roles of input parameters ................................................................................................ ... 108

PAGE 10

10 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy ACCURATE, SCALABLE, AND INFORM ATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGE-SCA LE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO December 2008 Chair: Tao Li Major: Electrical and Computer Engineering Modeling and analyzing how work load and architecture inter act are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challe nges related to the design, eval uation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the state-of-the-art tools can detect program time-varying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a ne cessary first step to design cost-effective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and ha rdware design complexity and in tegration scale. For instance, while large real-world applications manifest drastically different behavior across a wide spectrum of their runtim e, existing methods only focus on an alyzing workload characteristics using a single time scale. Conve ntional architecture modeling tec hniques assume a centralized and monolithic hardware substrate. This assu mption, however, will not hold valid since the design trends of multi-/many-core processors will result in large-scale and distributed microarchitecture specific pro cessor core, global and coopera tive resource management for

PAGE 11

11 large-scale many-core processor requires obta ining workload characte ristics across a large number of distributed hardware components (c ores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, th ere is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidl y increasing complexity and integration scale. We aim to develop computationally efficien t methods and models which allow architects and designers to rapidly yet informatively explor e the large performance, power, reliability and thermal design space of uni-/multicore architecture. Our models achieve several orders of magnitude speedup compared to simulation base d methods. Meanwhile, our model significantly improves prediction accuracy compared to conv entional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration.

PAGE 12

12 CHAPTER 1 INTRODUCTION Modeling and analyzing how workloads beha ve on the underlying hardware have been essential ing redients of computer architecture research. By knowing program behavior, both hardware and software can be tune d to better suit the needs of a pplications. As computer systems become more adaptive, their efficiency incr easingly depends on the dynamic behavior that programs exhibit at runtime. Previous studi es [1-5] have shown that program runtime characteristics exhibit time varying phase be havior: workload execution manifests similar behavior within each phase while showing distinct characteristics between different phases. Many challenges related to the de sign, analysis and optimization of complex computer systems can be efficiently solved by e xploiting program phases [1, 6-9]. For this reason, there is a growing interest in studying program phase behavior. Recently, several phase analysis techniques have been proposed [4, 7, 10-19]. Very few of these studies, however, focus on understanding and characterizing program phases from their dynamics and complexity perspectives. Consequently, these techniques gene rally lack the capability of informing phase dynamic behavior. To complement current phase analysis techniques which pay little or no attention to phase dynamics, we develop new me thods, metrics and frameworks that have the capability to analyze, quantify, and classify program phases based on their dynamics and complexity characteristics. Our techniques are built on wavelet-based multiresolution analysis, which provides a clear and orthogonal view of phase dynamics by presenting complex dynamic structures of program phases with respect to both time and frequency domains. Consequently, key tendencies can be efficiently identified. As microprocessor architectures become more complex, architects increasingly rely on exploiting workload dynamics to achieve cost an d complexity effective design. Therefore, there

PAGE 13

13 is a growing need for methods that can qui ckly and accurately explore workload dynamic behavior at early microarchit ecture design stage. Such techni ques can quickly bring architects with insights on application ex ecution scenarios across large design space without resorting to the detailed, case by case simulations. Researchers have been proposed several predictive models [20-25] to reason about workload aggregated be havior at architecture design stage. However, they have been focused on predicting the aggregat ed program statistics (e .g. CPI of the entire workload execution). These monolithic global models are incapable of capturing and revealing program dynamics which contain interesting fine-grain behavior. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates waveletbased multiresolution decomposition techniques and neural network prediction. As the number of cores on a processor increas es, these large and sophisticated multi-coreoriented architectures exhibit in creasingly complex and heterogeneous characteristics. Processors with two, four and eight cores ha ve already entered the market. Pr ocessors with tens or possibly hundreds of cores may be a reality within the next few years. In the upcoming multi-/manycore era, the design, evaluation and optimization of ar chitectures will demand analysis methods that are very different from those targeting traditiona l, centralized and monolithic hardware structures. To enable global and cooperative management of hardware resources and efficiency at large scales, it is imperative to analyze and exploi t architecture characteris tics beyond the scope of individual cores and hardware components (e.g. single cache bank and single interconnect link). To addresses this important and urgent research task, we developed the novel, 2D multi-scale predictive models which can efficiently reason th e characteristics of large and sophisticated multi-core oriented architectures during the design space exploration stage without using detailed cycle-level simulations.

PAGE 14

14 Three-dimensional (3D) integrated circuit design [55] is an emerging technology that greatly improves transistor inte gration density and re duces on-chip wire communication latency. It places planar circuit layers in the vertical dimension and co nnects these layers with a high density and low-latency interface. In addition, 3D offers the opportunity of binding dies, which are implemented with different techniques to enab le integrating heterogeneous active layers for new system architectures. Leveraging 3D die stacking technologies to build uni-/multi-core processors has drawn an increased attention to both chip design industry and research community [5662]. The realiz ation of 3D chips faces many challenges. One of the most daunting of these challenges is the problem of inefficient heat dissipa tion. In conventional 2D chips, the generated heat is dissip ated through an external heat sink. In 3D chips, all of the layers contribute to the generation of h eat. Stacking multiple dies vertically increases power density and dissipating heat from the layers far away from the heat sink is more challenging due to the distance of heat source to exte rnal heat sink. Therefore, 3D technologies not only exacerbate existing on-chip hotspots but also create new th ermal hotspots. High die temperature leads to thermal-induced performance degradation and reduced chip lifetime, which threats the reliability of the whole system, making modeling and analyzi ng thermal characteristics crucial in effective 3D microprocessor design. Previous studies [59, 60] show that 3D chip temperature is affected by factors such as configuration and floor-plan of microarchitectural components. For example, instead of putting hot components together, thermal-aware floor-planning places the hot components by cooler components, reducing th e global temperature. Thermal-aware floorplanning [59] uses intensive and iterative simu lations to estimate the thermal effect of microarchitecture components at early architectu ral design stage. However, using detailed yet

PAGE 15

15 slow cycle-level simulations to explore thermal effects across large design space of 3D multicore processors is very expens ive in terms of time and cost.

PAGE 16

16 CHAPTER 2 WAVELET TRANSFORM W e use wavelets as an efficient tool for capturing workload behavior. To familiarize the reader with general methods used in this re search, we provide a brief overview on wavelet analysis and show how program execution charac teristics can be represented using wavelet analysis. Discrete Wavelet Transform(DWT) W avelets are mathematical tools that use a prototype function (cal led the analyzing or mother wavelet) to transform data of interest into different frequenc y components, and then analyze each component with a reso lution matched to its scale. Therefore, the wavelet transform is capable of providing a compact and effective mathematical repres entation of data. In contrast to Fourier transforms which only offer frequenc y representations, wavelets transforms provide time and frequency localizations simultaneously. Wavelet analysis allows one to choose wavelet functions from numerous functions[26, 27]. In this section, we provide a quick primer on wavelet analysis using the H aar wavelet, which is the simplest form of wavelets. Consider a data series ,...,2,1,0,, kXknat the finest time scale resolution level n 2. This time series might represent a specific progr am characteristic (e.g., number of executed instructions, branch mispredictions and cache mi sses) measured at a given time scale. We can coarsen this event series by av eraging (with a slightly differe nt normalization factor) over nonoverlapping blocks of size two ) ( 2 112,2, ,1 knkn knXX X (2-1) and generate a new time series1 nX, which is a coarser granularity representation of the original seriesnX. The difference between the two re presentations, known as details, is

PAGE 17

17 ) ( 2 112,2, ,1 knkn knXX D (2-2) Note that the original time series nXcan be reconstructed from its coarser representation1 nXby simply adding in the details1 nD; i.e., ) (211 2/1 nn nDX X. We can repeat this process (i.e., write 1 nX as the sum of yet a nother coarser version 2 nX of nX and the details2 nD, and iterate) for as many scale as are pr esent in the original time series, i.e., 1 2/1 0 2/ 0 2/2...22 n n n nD D X X (2-3) We refer to the collection of 0Xand jD as the discrete Haar wavelet coefficients. The calculations of allkjD,, which can be done iteratively using the equations (2-1) and (2-2), make up the so called discrete wavele t transform (DWT). As can be seen, the DWT offers a natural hierarchy structure to represent data behavior at multiresolution levels: the first few wavelet coefficients contain an overall, coarser approx imation of the data; a dditional coefficients illustrate high detail. This property can be used to capture workload execution behavior. Figure 2-1 illustrates the proce dure of using Haar-base DWT to transform a series of data {3, 4, 20, 25, 15, 5, 20, 3}. As can be seen, scale 1 is the finest representation of the data. At scale 2, the approximations {3.5, 22.5, 10, 11.5} are obtained by taking the average of {3, 4}, {20, 25}, {15, 5} and {20, 3} at scale 1 respectively. The details {-0.5, -2.5, 5, 8.5} are the differences of {3, 4}, {20, 25}, {15, 5} and {20, 3} divided by 2 respectively. The process continues by decomposing the scaling coefficien t (approximation) vector using the same steps, and completes when only one coefficient remains. As a result, wavelet decomposition is the collec tion of average and details coefficients at all scales. In other words, the wavelet transform of the original data is the single coefficient representing the overall average of the original data, followed by the detail coefficients in order

PAGE 18

18 of increasing resolutions. Different resolutions can be obtained by adding difference values back or subtracting differences from the averages. Original Data 3, 4, 20, 25, 15, 5, 20, 3 Wavelet Filter (H0) 0.5, -2.5, 5, 8.5 Scaling Filter (G0) 3.5, 22.5, 10, 11.5 Scaling Filter (G1) 13, 10.75 Wavelet Filter (H1) -9.5, -0.75 Scaling Filter (G2) 11.875 Wavelet Filter (H2) 1.125 11.875 1.125 -9.5, -0.75 -0.5, -2.5, 5, 8.5 Approximation (Lev 0)Detail (Lev 1)Detail Coefficients (Level 2)Detail Coefficients (Level 3) Figure 2-1 Example of Haar wavelet transform. For instance, {13, 10.75} = {11.875+1.125, 11.875-1.125} where 11.875 and 1.125 are the first and the second coefficient respectively. This process can be performed recursively until the finest scale is reached. Therefore, through an inverse transform, the original data can be recovered from wavelet coefficients. The original data can be perfectly re covered if all wavelet coefficients are involved. Alternatively, an approx imation of the time series can be reconstructed using a subset of wavelet coefficients. Us ing a wavelet transform gives time-frequency localization of the original data. As a result, the time domain signal can be accurately approximated using only a few wavelet coefficients since they capture most of the energy of the input data. Apply DWT to Capture Workload Execution Behavior Since variation of program characteristics over tim e can be viewed as signals, we apply discrete wavelet analysis to capture progr am execution behavior. To obtain time domain workload execution characteristics, we break do wn entire program execution into intervals and

PAGE 19

19 then sample multiple data points within each interval. Therefore, at the finest resolution level, program time domain behavior is re presented by a data series within each interval. Note that the sampled data can be any runtime program character istics of interest. We then apply discrete wavelet transform (DWT) to each interval. As desc ribed in previous section, the result of DWT is a set of wavelet coefficients which represent the behavior of the sampled time series in the wavelet domain. 0.0E+00 5.0E+04 1.0E+05 1.5E+05 2.0E+05 2.5E+05Sampled Time Domain Workload Execution Statistics 200 400 600 800 1000 -5.0E+05 0.0E+00 5.0E+05 1.0E+06 1.5E+06 2.0E+06 2.5E+06Value 12345678910111213141516 Wavelet Coefficients(a) Time domain representation (b) Wavelet domain representation Figure 2-2 Comparison execution characteristics of time and wavelet domain Figure 2-2 (a) shows the sampled time domain workload execution statistics (The y-axis represents the number of cycles a processor spends on executing a fixed amount of instructions) on benchmark gcc within one execution interval. In this example, the program execution interval is represented by 1024 sampled data points. Fi gure 2-2 (b) illustra tes the wavelet domain representation of the original tim e series after a discrete wavele t transform is applied. Although the DWT operations can produce as many wavelet coe fficients as the original input data, the first few wavelet coefficients usually contain the important trend. In Figure 2-2 (b), we show the values of the first 16 wavelet coefficients. As can be seen, the disc rete wavelet transform provides a compact representation of the original large volume of data. This feature can be exploited to create concise yet informative fingerprints to capture pr ogram execution behavior.

PAGE 20

20 One advantage of using wavelet coefficients to fingerprint program execution is that program time domain behavior can be reconstructe d from these wavelet coefficients. Figure 2-3 and 2-4 show that the time domain workload ch aracteristics can be reco vered using the inverse discrete wavelet transforms. Figure 2-3 Sampled time domain program behavior (a) 1 wavelet coefficient (b) 2 wavelet coefficients (c) 4 wavelet coefficients (d) 8 wavelet coefficients (e) 16 wavelet coefficients (f) 64 wavelet coefficients Figure 2-4 Reconstructing the workload dynamic behaviors In Figure 2-4 (a)-(e), the first 1, 2, 4, 8, and 16 wavelet coefficients were used to restore program time domain behavior with increasing fidelity. As shown in Figure 2-4 (f), when all (e.g. 64) wavelet coefficients are used for rec overy, the original signal can be completely restored. However, this coul d involve storing and processi ng a large number of wavelet coefficients. Using a wavelet transform gives timefrequency localization of the original data. As a result, most of the energy of the input da ta can be represented by only a few wavelet

PAGE 21

21 coefficients. As can be seen, using 16 wavelet coefficients can recover program time domain behavior with sufficient accuracy. To classify program execution into phases, it is essential that the generated wavelet coefficients across intervals pr eserve the dynamics that worklo ads exhibit with in the time domain. Figure 2-5 shows the variation of the first 16 wavelet coefficients ( coff 1 coff 16 ) which represent the wavelet domain be havior of branch mispredicti on and L1 data cache hit on the benchmark gcc. The data are shown for the entire progr am execution which c ontains a total of 1024 intervals. -1.0E+04 -5.0E+03 0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 coeff 1 coeff 2 coeff 3 coeff 4 coeff 5 coeff 6 coeff 7 coeff 8 coeff 9 coeff 10 coeff 11 coeff 12 coeff 13 coeff 14 coeff 15 coeff 16 -4.0E+05 -2.0E+05 0.0E+00 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 1.2E+06 coeff 1 coeff 2 coeff 3 coeff 4 coeff 5 coeff 6 coeff 7 coeff 8 coeff 9 coeff 10 coeff 11 coeff 12 coeff 13 coeff 14 coeff 15 coeff 16 (a) branch misprediction (b) L1 data cache hit Figure 2-5 Variation of wavelet coefficients Figure 2-5 shows that wavelet domain tran sforms largely preserve program dynamic behavior. Another interesting obser vation is that the first order wavelet coefficient exhibits much more significant variation than the high order wa velet coefficients. This suggests that wavelet domain workload dynamics can be effectivel y captured using a few, low order wavelet coefficients.

PAGE 22

22 2D Wavelet Transform To effectively capture the two-dim ensional sp atial characteristics acr oss large-scale multicore architecture substrates, we also use the 2D wavelet analysis. With 1D wavelet analysis that uses Haar wavelet filters, each adjacent pair of da ta in a discrete interval is replaced with its average and difference. a b c d a b c d a b c d Original Average Detailed (D-horizontal) Detailed (D-vertical) Detailed (D-diagonal) (a+b+c+d)/4 ((a+d)/2-(b+c)/2)/2 a b c d ((a+b)/2-(c+d)/2)/2((b+d)/2-(a+c)/2)/2 a b c d Figure 2-6 2D wavelet transforms on 4 data points A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete plane. As shown in Figure 2-6, each adjacent four points in a discrete 2D plane can be replaced by their averaged value and three detailed values. The detailed values (D-horizontal, D-vertical, and D-diagonal) correspond to the average of the difference of: 1) the summation of the rows, 2) the summation of the columns, and 3) the summation of the diagonals. To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data along the X-axis first, resulting in low-pass and high-pass signals (average and difference). Next, we apply 1D wavelet transforms to both signals along the Y-axis generating one averaged and three detailed signals. Consequently, a 2D wa velet decomposition is obtained by recursively repeating this procedure on the averaged signal. Figure 2-7 (a) illustrates the procedure. As can be seen, the 2D wavelet decompos ition can be represented by a tree-based structure. The root node of the tree contains the orig inal data (row-majored ) of the mesh of va lues (for example, performance or temperatures of the four adjacent cores, network-on-chip li nks, cache banks etc.). First, we apply 1D wavelet transforms along the X-axis, i.e. for each two points along the X-axis

PAGE 23

23 we compute the average and difference, so we obtai n (3 5 7 1 9 1 5 9) and (1 1 1 -1 5 -1 1 1). Next, we apply 1D wavelet transforms along the Y-axis; for each two points along the Y-axis we compute average and difference (at level 0 in th e example shown in Figure 2-7.a). We perform this process recursively until the number of elemen ts in the averaged signa l becomes 1 (at level 1 in the example shown in Figure 2-7.a). 1D wavelet along x-axis 1D wavelet along y-axis 1D wavelet along x-axis 1D wavelet along y-axis 4 4 6 2446 6810 1420 820Original Data 1D wavelet along x-axis 1D wavelet along y-axis 2 4 4 6 6 8 2 0 4 14 2 0 4 6 8 10 3 5 7 1 9 1 5 9 1 1 1 -1 5 -1 1 1 5 3 7 5 2 -2 -2 4 1 0 3 0 0 -1 -2 1 4 6 -1 -1 5 1 -1 0 Original Data (row-majored) lowpass signal highpass signal Horizontal Details Average Vertical DetailsDiagonal Details lowpass signal highpass signal Avg.Horiz. Det. Vert. Det. Diag. Det.L=0 L=1 Horiz. Det. (L=1) Vert. Det. (L=1) Diag. Det. (L=1) Horizontal Details (L=0) Vertical Details (L=0) Diagonal Details (L=0) Avg. (L=1) Average (L=0) (a) (b) Figure 2-7 2D wavelet transforms on 16 cores/hardware components Figure 2-7.b shows the wavelet domain multi-resolution representation of the 2D spatial data. Figure 2-8 further demonstrates that the 2D architecture char acteristics can be effectively captured using a small number of wavelet coeffi cients (e.g. Average (L=0) or Average (L=1)). Since a small set of wavelet coefficients provide concise yet insightful information on architecture 2D spatial characteri stics, we use predictive models (i.e. neural networks) to relate them individually to various architecture design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wave let coefficients to synthesize architecture 2D spatial characteristics across the design space. Compared with a simulation-based method,

PAGE 24

24 predicting a small set of wavelet coefficients usin g analytical models is computationally efficient and is scalable to large scale architecture design. (a) NUCA hit numbers (b) 2D DWT (L=0) (c) 2D DWT (L=1) Figure 2-8 Example of applying 2D DWT on a non-uniformly accessed cache

PAGE 25

25 CHAPTER 3 COMPLEXITY-BASED PROGRAM PHAS E ANAL YSIS AND CLASSIFICATION Obtaining phase dynamics, in many cases, is of great interest to accurately capture program behavior and to preci sely apply runtime applicati on oriented optimizations. For example, complex, real-world workloads may ru n for hours, days or even months before completion. Their long execution time implies that program time varying behavior can manifest across a wide range of scales, making modeling phase behavior using a single time scale less informative. To overcome conventional phase an alysis technique, we proposed using waveletbased multiresolution analysis to characterize ph ase dynamic behavior and developed metrics to quantitatively evaluate the comp lexity of phase structures. And also, we proposed methodologies to classify program phases from their dynamics and complexity perspectives. Specifically, the goal of this chapter is to answer the followi ng questions: How to define the complexity of program dynamics? How do program dynamics cha nge over time? If classified using existing methods, how similar are the program dynamics in each phase? How to better identify phases with homogeneous dynamic behavior? In this chapter, we implemented our comp lexity-based phase analysis technique and evaluate its effectiveness over existing phase an alysis methods based on program control flow and runtime information. And we showed that in both cases the proposed technique produces phases that exhibit more homogeneous dyna mic behavior than existing methods do. Characterizing and classifying the program dynamic behavior Using the wavele t-based multiresolution analys is which is described in chapter 2, we characterize, quantify and classify program dynamic behavior on a high-performance, out-oforder execution superscalar processor coupled with a multi-level memory hierarchy.

PAGE 26

26 Experimental setup We performed our analysis using ten SPEC CPU 2000 benchmarks crafty, gap, gcc, gzip, mcf, parser, perlbmk, swim, twolf and vortex All programs were run wi th reference input to completion. We chose to focus on only 10 programs because of the lengthy simulation time incurred by executing all of the programs to completion. The stat istics of workload dynamics were measured on the SimpleScalar 3.0[28] si m-outorder simulator for the Alpha ISA. The baseline microarchitecture model is detailed in Table 3-1. Table 3-1 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 predictions/cycle BTB 2K entries, 4-way Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte/l ine, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 I-ALU, 2 I-MUL/DIV FP ALU 2 FP-ALU, 1FP-MUL/DIV DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/ line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 100 cycles Metrics to Quantify Phase Complexity To quantify phase complexity, we measure the similarity between phase dynamics observed at different time scales. To be more specific, we us e cross-correlation coefficients to measure the similarity between the original data sampled at the finest granularity and the approximated version reconstructed from wavele t scaling coefficients obtained at a coarser scale. The crosscorrelation coefficients (XCOR) of th e two data series are defined as:

PAGE 27

27 n i i n i i n i iiYYXX YYXX YXXCOR1 2 1 2 1)()( ))(( ),( (3-1) where X is the original data series and Y is the approximated data series. Note that XCOR =1 if program dynamics observed at th e finest scale and its approximation at coarser granularity exhibit perfect correlation, and XCOR =0 if the program dynamics and its approximation varies independently across time scales. X and Y can be any runtime program characteris tics of interest. In this chapter, we use instruction per cycle (IPC) as a metric due to its wide usage in computer architecture design and performance evaluation. To sample IPC dynamics we break down the entire program execution into 1024 intervals and then sample 1024 IPC data w ithin each interval. Therefore, at the finest resolution level, the program dynamics of each ex ecution interval are represented by an IPC data series with a length of 1024. We then apply wavele t multiresolution analysis to each interval. In a wavelet transform, each DWT operation produces an approximation coefficients vector with a length equal to half of the input data. We re move the detail coefficients after each wavelet transform and only use the approximation part to reconstruct IPC dynamics and then calculate the XCOR between the original data and the r econstructed data. We a pply discrete wavelet transform to the approximation part iteratively until the length of the approximation coefficient vector is reduced to 1. Each approximation coeffi cient vector is used to reconstruct a full IPC trace with a length of 1024 and the XCOR between the original and reconstructed traces are calculated using equation (3-1). As a result, fo r each program execution interval, we obtain an XCOR vector, in which each element represents the cross-correlation coefficients between the original workload dynamics and the approximated workload dynamics at different scales. Since

PAGE 28

28 we use 1024 samples within each interval, we crea te an XCOR vector with a length of 10 for each interval, as shown in Figure 3-1. Figure 3-1 XCOR vectors for each program execution interval Profiling Program Dynamics and Complexity We use XCOR metrics to quantify program dynamics and complexity of the studied SPEC CPU 2000 benchmarks. Figure 3-2 shows the results of the total 1024 exec ution intervals across ten levels of abstraction for the benchmark gcc. Figure 3-2 Dynamic complexity profile of benchmark gcc As can be seen, the benchmark gcc shows a wide variety of changing dynamics during its execution. As the time scale increases, XCOR valu es are monotonically decr eased. This is due to the fact that wavelet approximation at a coar se scale removes details in program dynamics observed at a fine grained level. Rapidly decreased XCOR implies highly complex structures that

PAGE 29

29 can not be captured by coarse level approximation. In contrast, slowly decreased XCOR suggests that program dynamics can be largely preserve d using few samples. Figure 3-2 also shows a dotted line along which XCOR decreases linearly with the increased time scales. The XCOR plots below that dotted line indicate rapidly decr eased XCOR values or complex program dynamics. As can be seen, a signifi cant fraction of the benchmark gcc execution intervals manifest quickly decreased XCOR values, indica ting that the program exhibits highly complex structure at the fine graine d level. Figure 3-2 also re veals that there are a few gcc execution intervals that have good scalabil ity in their dynamics. On these execution intervals, the XCOR values only drop 0.1 when the time scale is increa sed from 1 to 8. The results(Figure 3-2) clearly indicate that some program execution interval s can be accurately approximated by their high level abstractions while others can not. We further break down the XCOR values into 10 categories ranging from 0 to 1 and analyze their distribution across time scal es. Due to space limitations, we only show the results of three programs ( swim, crafty and gcc, see Figure 3-3) which repres ent the characteristics of all analyzed benchmarks. Note that at scale 1, th e XCOR values of all execution intervals are always 1. Programs show heterogeneous XCOR va lue distributions starting from scale level 2. As can be seen, the benchmark swim exhibits good scalability in its dynamic complexity. The XCOR values of all execution in tervals remain above 0.9 when the time scale is increased from 1 to 7. This implies that the captured program behavi or is not sensitive to any time scale in that range. Therefore, we classify swim as a low complexity program. On the benchmark crafty XCOR values decrease uniformly with the incr ease of time scales, indicating the observed program dynamics are sensitive to the time scales us ed to obtain it. We refe r to this behavior as medium complexity. On the benchmark gcc, program dynamics decay rapidly. This suggests that

PAGE 30

30 abundant program dynamics could be lost if coarse r time scales are used to characterize it. We refer to this characteristic as high complexity behavior. 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1(a) swim (low complexity) (b) crafty (medium complexity) 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1 (c) gcc (high complexity) Figure 3-3 XCOR va lue distributions The dynamics complexity and the XCOR value distribution plots(Figur e 3-2 and Figure 33) provide a quantitative and informative representation of runtime program complexity. Table 3-2 Classification of benchm arks based on their complexity Category Benchmarks Low complexity Swim Medium complexity Crafty gzip parser, perlbmk twolf High complexity gap, gcc, mcf vortex Using the above information, we classify the studied programs in terms of their complexity and the results are shown in Table 3-2.

PAGE 31

31 Classifying Program Phases based on their Dynamics Behavior In this section, we show th at program execution manifest s heterogeneous complexity behavior. We further examine the efficiency of using current methods in classifying program dynamics into phases and propose a new method that can better identify program complexity. Classifying complexity based phase behavior enables us to understand program dynamics progressively in a fine-to-coarse fashion, to operate on different resolutions, to manipulate features at different scales, and to localize char acteristics in both spatia l and frequency domains. Simpoint Sherwood and Calder[1] proposed a phase analys is tool called Sim point to automatically classify the execution of a program into phases. They found that interv als of program execution grouped into the same phase had similar statistics The Simpoint tool cl usters program execution based on code signature and execution frequenc y. We identified program execution intervals grouped into the same phase by the Simpoint t ool and analyzed their dynamic complexity. (a) Simpoint Cluster #7 (b) Simpoint Cluster #5 (c) Simpoint Cluster #48 Figure 3-4 XCORs in the same phase by the Simpoint Figure 3-4 shows the results for the benchmark mcf. Simpoint generates 55 clusters on the benchmark mcf Figure 3-4 shows program dynamics within three clusters generated by Simpoint. Each cluster represen ts a unique phase. In cluster 7, the classified phase shows homogeneous dynamics. In cluster 5, program ex ecution intervals show two distinct dynamics

PAGE 32

32 but they are classified as the same phase. In cluster 48, program execution complexity varies widely; however, Simpoint classifies them as a single phase. The results(Fi gure 3-4) suggest that program execution intervals classified as the same phase by Simpoint can still exhibit widely varied behavior in their dynamics. Complexity-aware Phase Classification To enhance the capability of current m ethods in characteri zing program dynamics, we propose complexity-aware phase classification. Ou r method uses the multiresolution property of wavelet transforms to identify and classify the changing of program code execution across different scales. We assume a baseline phase analysis technique that uses basic block vectors (BBV) [10]. A basic block is a section of code that is executed from start to finish with one entry and one exit. A BBV represents the code blocks execu ted during a given interv al of execution. To represent program dynamics at different time scales, we create a set of basic block vectors for each interval at different resolutions. For example, at the coarsest level (scale =10), a program execution interval is represented by one BBV. At the most detailed level, the same program execution interval is represented by 1 024 BBVs from 1024 consecutively subdivided intervals(Figure 3-5). To reduce the amount of data that needs to be processed, we use random projection to reduce the dimensionality of all BBVs to 15, as suggested in [1]. Figure 3-5 BBVs with different resolutions

PAGE 33

33 The coarser scale BBVs are the approximations of the finest scale BBVs generated by the wavelet-based multiresolution analysis. Figure 3-6 Multiresolution anal ysis of the projected BBVs As shown in Figure 3-6, the disc rete wavelet transform is app lied to each dimension of a set of BBVs at the finest scal e. The XCOR calculation is used to estimate the correlations between a BBV element and its approximations at coarser scales. The results are the 15 XCOR vectors representing the complexity of each dimension in BBVs across 10 level abstractions. The 15 XCOR vectors are then averag ed together to obtain an aggregated XCOR vector that represents the entire BBV comple xity characteristics for that execution interval. Using the above steps, we obtained an aggregated XCOR vector for each program execution interval. We then run the k-means clustering algorithm [29] on the co llected XCOR vectors which represent the dynamic complexity of program ex ecution intervals and classified them into phases. This is similar to what Simpoint does. The difference is that the Simpoint tool uses raw BBVs and our method uses aggregated BBV XCOR vectors as the input for k-means clustering.

PAGE 34

34 Experimental Results We com pare Simpoint and the proposed approach in their capability of classifying phase complexity. Since we use wavelet transform on pr ogram basic block vectors, we refer to our method as multiresolution anal ysis of BBV (MRA-BBV). Cluster #1 Cluster #2 Cluster #N 10 10 10 XCOR XCOR XCOR CoV CoV CoV 10 10 10 W Weighted CoVs10 Figure 3-7 Weighted COV calculation We examine the similarity of program complexity within each classified phase by the two approaches. Instead of using IPC, we use IP C dynamics as the metric for evaluation. After classifying all program execution intervals into phases, we examine each phase and compute the IPC XCOR vectors for all the intervals in that phas e. We then calculate the standard deviation in IPC XCOR vectors within each phase and we divide the standard deviation by the average to get the Coefficient of Variation (COV). As shown in Figure 3-7, we calculate an ove rall COV metric for a phase classification method by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for. This produces an overall me tric (i.e. weighted COVs) used to compare different phase classification for a given program. Since COV meas ures standard deviations as the percentage of the average, a lower COV va lue means better phase classification technique.

PAGE 35

35 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10crafty gap gcc gzip mcf CoV BBV MRA-BBV104%128% 101% 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10parser perlbmk swim twolf vortex CoV BBV MRA-BBVFigure 3-8 Comparison of BBV and MRABBV in classifyin g phase dynamics Figure 3-8 shows experimental results for all the studied benchmarks. As can be seen, the MRA-BBV method can produce phases which ex hibit more homogeneous dynamics and complexity than the standard, BBV-based method. This can be seen from the lower COV values generated by the MRA-BBV method. In genera l, the COV values yielded on both methods increase when coarse time scales are used for complexity approximation. The MRA-BBV is capable of achieving significantly better cla ssification on benchmarks with high complexity, such as gap, gcc and mcf On programs which exhibit medium complexity, such as crafty, gzip, parser, and twolf the two schemes show a comparable effectiveness. On benchmark (e.g. swim ) which has trivial complexity, both schemes work well. We further examine the capability of usi ng runtime performance metrics to capture complexity-aware phase behavior. Instead of usin g BBV, the sampled IPC is used directly as the input to the k-means phase clustering algorithm. Similarly, we apply multiresolution analysis to the IPC data and then use the gathered informa tion for phase classification. We call this method

PAGE 36

36 multiresolution analysis of IPC (MRA-IPC). Figure 3-9 shows the phase classification results. As can be seen, the observations we made on th e BBV-based cases hold valid on the IPC-based cases. This implies that the proposed multiresoluti on analysis can be applied to both methods to improve the capability of capturing phase dynamics. 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10crafty gap gcc gzip mcf CoV IPC MRA-IPC109 % 130 % 150 % 173 % 117%105 % 110 % 115 % 123 % 147 % 105% 120% 142% 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10parser perlbmk swim twolf vortex CoV IPC MRA-IPC113 % 122 % 140 % 163 % 104% 127% 146% Figure 3-9 Comparison of IPC and MRA-IPC in classifying phase dynamics

PAGE 37

37 CHAPTER 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRM PHASE ANAL YSIS In this chapter, we focus on workload-sta tistics-based phase anal ysis since on a given machine configuration and environment, it is more suitable to iden tify how the targeted architecture features vary during program execution. In contrast, phase classification using program code structures lacks th e capability of informing how wo rkloads behave architecturally [13, 30]. Therefore, phase analys is using specified workload ch aracteristics allows one to explicitly link the targeted arch itecture features to the classifi ed phases. For example, if phases are used to optimize cache efficiency, the workload characteristics that reflect cache behavior can be used to explicitly classify program ex ecution into cache perfor mance/power/reliability oriented phases. Program code structure based phase analysis id entifies similar phases only if they have similar code flow. There can be cases where two sections of code can have different code flow, but exhibit similar architectural behavi or [13]. Code flow based phase analysis would then classify them as different phases. Anot her advantage of worklo ad-statistics-based phase analysis is that when multiple threads share the same resource (e.g. pipeline, cache), using workload execution information to classify phases allows the capability of capturing program dynamic behavior due to the in teractions between threads. The key goal of workload execution based phase analysis is to accurately and reliably discern and recover phase behavior from various program runtime st atistics represented as largevolume, high-dimension and noisy data. To effectiv ely achieve this objec tive, recent work [30, 31] proposes using wavelets as a tool to assist phase analysis. The basic idea is to transform workload time domain behavior into the wave let domain. The generated wavelet coefficients which extract compact yet informative program r untime feature are then assembled together to

PAGE 38

38 facilitate phase classification. Nevertheless, in current work, the examined scope of workload characteristics and the explored benefits due to wavelet transform are quite limited. In this chapter, we extend research of chap ter 3 by applying wavelets to abundant types of program execution statistics and quantifying th e benefits of using wavelets for improving accuracy, scalability and robustness in phase cl assification. We conclude that wavelet domain phase analysis has the followi ng advantages: 1) accuracy: the wavelet transform significantly reduces temporal dependence in the sampled work load statistics. As a result, simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, wavelet coefficients transformed fr om various dimensions of program execution characteristics can be dynamically assembled t ogether to further improve phase classification accuracy; 2) scalability: phase cl assification using wavelet analysis of high-dimension sampled workload statistics can alleviat e the counter overflow problem which has a negative impact on phase detection. Therefore, it is much more s calable to analyze largescale phases exhibited on long-running, real-world programs; and 3) robus tness: wavelets offer denoising capabilities which allows phase classification to be perf ormed robustly in the presence of workload execution variability. Workload-statics-based phase analysis Using the wavelet-based m ethod, we expl ore program phase analysis on a highperformance, out-of-order execution superscalar processor coupled with a multi-level memory hierarchy. We use Daubechies wavelet [26, 27] with an order of 8 for the re st of the experiments due to its high accuracy and low computation over head. This section describes our experimental methodologies, the simulated machine configura tion, experimented benchmarks and evaluated metrics.

PAGE 39

39 We performed our analysis using tw elve SPEC CPU 2000 integer benchmarks bzip2, crafty, eon, gap, gcc, gzip mcf parser perlbmk twolf, vortex and vpr All programs were run with the reference input to completion. The runtim e workload execution statistics were measured on the SimpleScalar 3.0, sim-outorder simulator for the Alpha ISA. The baseline microarchitecture model we used is detailed in Table 4-1. Table 4-1 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 BTB 2K entries, 4-way Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte /line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 I-ALU, 2 I-MUL/DIV FP ALU 2 FP-ALU, 1FP-MUL/DIV DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 100 cycles We use IPC (instruction per cycle) as the metr ic to evaluate the si milarity of program execution within each classified phase. To quant ify phase classification accuracy, we use the weighted COV metric proposed by Calder et al [15]. After classifyi ng all program execution intervals into phases, we examine each phase and compute the IPC for all the intervals in that phase. We then calculate the standard deviati on in IPC within each phase, and we divide the standard deviation by the average to get the Coef ficient of Variation (COV). We then calculate an overall COV metric for a phase classification method by taking the COV of each phase and weighting it by the percentage of execution that the phase account s for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classificati ons for a given program.

PAGE 40

40 Since COV measures standard deviations as a percentage of the average, a lower COV value means a better phase classification technique. Exploring Wavelet Domain Phase Analysis We first evaluate the efficiency of wavelet analysis on a wide range of program execution characteristics by comparing its phase classifica tion accuracy with methods that use information in the time domain. And then we explore met hods to further improve phase classification accuracy in the wavelet domain. Phase Classification: Time Domain vs. Wavelet Domain The wavelet analysis m ethod provides a cost-e ffective representation of program behavior. Since wavelet coefficients are generally decorrelate d, we can transform the original data into the wavelet domain and then carry out the phase classification task. The generated wavelet coefficients can be used as signa tures to classify program executi on intervals into phases: if two program execution intervals show similar fingerprints (repres ented as a set of wavelet coefficients), they can be classified into the sa me phase. To quantify the benefit of using wavelet based analysis, we compare phase classificati on methods that use time domain and wavelet domain program execution information. With our time domain phase analysis method, each program execution interval is represented by a time series which consists of 1024 sampled program execution statistics. We first apply random projection to reduce the data dimensionality to 16. We then use the k-means clustering algorithm to classify program intervals in to phases. This is similar to the method used by the popular Simpoint tool wher e the basic block vectors (BBVs) are used as input. For the wavelet domain method, the original time series are first transformed into the wavelet domain using DWT. The first 16 wavelet coefficients of each program execution interval are extracted

PAGE 41

41 and used as the input to the k-means cluste ring algorithms. Figure 4-1 illustrates the above described procedure. Program Runtime Statistics Random Projection DWT K-means Clustering K-means Clustering COV COV Dimensionality=16 Number of Wavelet Coefficients=16 Figure 4-1 Phase analysis methods time domain vs. wavelet domain We investigated the efficiency of applyi ng wavelet domain analysis on 10 different workload execution characteristics, name ly, the numbers of executed loads ( load), stores ( store), branches ( branch ), the number of cycles a processor spends on executing a fixed amount of instructions ( cycles), the number of branch misprediction ( branch_miss ), the number of L1 instruction cache, L1 data cache and L2 cache hits ( il1_hit, dl1_hit and ul2_hit ), and the number of instruction and data TLB hits ( itlb_hit and dtlb_hit ). Figure 4-2 shows the COVs of phase classifications in time and wavelet domains when each type of workload execution characteristic is used as an input. As can be seen, compared with using raw, time domain workload data, the wavelet domain analysis significantly impr oves phase classification accuracy and this observation holds for all the inve stigated workload characteristics across all the examined benchmarks. This is because in the time domain, collected program runtime statistics are treated as high-dimension time series data. Random projection met hods are used to reduce the dimensionality of feature vectors which represent a workload signature at a given execution interval. However, the simple random projection function can increase the aliasing between phases and reduce the accuracy of phase detection.

PAGE 42

42 0% 10% 20% 30% 40% 50%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domainbzip2 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaincrafty 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaineon 0% 5% 10% 15% 20% 25%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domaingap 0% 10% 20% 30% 40% 50% 60% 70%l oa d st ore bra n ch c ycle br a nch_ m iss i l 1_ h it dl 1_ hi t ul 2_ hi t i tlb_ h it dtlb_hitCo V Time Domain Wavelet Domaingcc 0% 5% 10% 15% 20%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domaingzip 0% 20% 40% 60% 80% 100% 120% 140%l o ad store br an ch cycl e b ranch_m i ss i l 1_hit d l1_hit ul2_hit i t lb_hit dtl b_hitCoV Time Domain Wavelet Domainmcf 0% 3% 6% 9% 12% 15%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domain p arser 0% 5% 10% 15% 20% 25% 30%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domain p erlbmk 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaintwolf 0% 10% 20% 30% 40%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domainvortex 0% 10% 20% 30% 40%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet DomainvprFigure 4-2 Phase classification accuracy : time domain vs. wavelet domain By transforming program runtim e statistics into the wavelet domain, workload behavior can be represented by a series of wavelet co efficients which are much more compact and efficient than its counterpart in the time do main. The wavelet transform significantly reduces temporal dependence and therefore simple mode ls which are insufficient in the time domain become quite accurate in the wavelet domain.

PAGE 43

43 Figure 4-2 shows that in the wavelet domain, the efficiency of using a single type of program characteristic to classify program phases can vary significantly across different benchmarks. For example, while ul2_hit achieves accurate phase classification on the benchmark vortex it results in a high phase cl assification COV on the benchmark gcc. To overcome the above disadvantages and to build phase classi fication methods that can achieve high accuracy across a wide range of applications, we explore using wavelet coefficients derived from different types of workload characteristics. Program Runtime Statistics 1 DWT K-means Clustering COV Program Runtime Statistics 2 Program Runtime Statistics n DWT DWT Wavelet Coefficient Set 1 Wavelet Coefficient Set 2 Wavelet Coefficient Set n Hybrid Wavelet Coefficients Figure 4-3 Phase classification us ing hybrid wavelet coefficients As shown in Figure 4-3, a DWT is applied to each type of workload characteristic. The generated wavelet coefficients from different cat egories can be assembled together to form a signature for a data clustering algorithm. Our objective is to improve wavelet domain phase classification accuracy across different programs while using an equivalent amount of in formation to represent program behavior. We choose a set of 16 wavelet coefficients as th e phase signature since it provides sufficient precision in capturing program dynamics when a single type of program charac teristic is used. If a phase signature can be composed using multiple workload characteristics, there are many ways to form a 16-dimension phase signature. For exam ple, a phase signature can be generated using one wavelet coefficient from 16 diff erent workload characteristics (16 1), or it can be composed

PAGE 44

44 using 8 workload characteristics with 2 wavele t coefficients from each type of workload characteristic (82). Alternatively, a phase signature can be formed using 4 workload characteristics with 4 wavelet coefficients each and 2 workload characteristics with 8 wavelet coefficients each (44, and 28) respectively. We extend the 10 workload execution characteristics (Figure 4-2) to 16 by adding the following events: the num ber of accesses to instruction cache ( il1_access), data cache (dl1_access ), L2 cache (ul2_access ), instruction TLB ( itlb_access ) and data TLB ( dtlb_access). To understand the trade-offs in choosing different methods to generate hybrid signatures, we did an exhaustive search using the above 4 schemes on all benchmarks to identify the best COVs that each scheme can achieve. The results (their ranks in terms of phase classification accuracy and the COVs of phase analysis) are shown in Table 4-2. As can be seen, statistically, hybr id wavelet signatures generated using 16 (16 1) and 8 (82) workload characteristics achieve higher ac curacy. This suggests that combining multiple dimension wavelet domain workload characteristics to form a phase signature is beneficial in phase analysis. Table 4-2 Efficiency of different hybrid wavelet signatures in phase classification Benchmarks Hybrid Wavelet Signature and its Phase Classification COV Rank #1 Rank #2Rank #3Rank #4 Bzip2 161 (6.5%) 8 2 (10.5%) 4 4 (10.5%) 28 (10.5%) Crafty 44 (1.2%) 16 1 (1.6%) 8 2 (1.9%) 28 (3.9%) Eon 82 (1.3%) 4 4 (1.6%) 16 1 (1.8%) 28 (3.6%) Gap 44 (4.2%) 16 1 (6.3%) 8 2 (7.2%) 28 (9.3%) Gcc 82 (4.7%) 16 1 (5.8%) 4 4 (6.5%) 28 (14.1%) Gzip 161 (2.5%) 4 4 (3.7%) 8 2 (4.4%) 28 (4.9%) Mcf 161 (9.5%) 4 4 (10.2%) 8 2 (12.1%) 28 (87.8%) Parser 161 (4.7%) 8 2 (5.2%) 4 4 (7.3%) 28 (8.4%) Perlbmk 82 (0.7%) 16 1 (0.8%) 4 4 (0.8%) 28 (1.5%) Twolf 161 (0.2%) 8 2 (0.2%) 4 4 (0.4%) 28 (0.5%) Vortex 161 (2.4%) 8 2 (4%) 2 8 (4.4%) 44 (5.8%) Vpr 161 (3%) 8 2 (14.9%) 4 4 (15.9%) 28 (16.3%)

PAGE 45

45 We further compare the efficiency of using the 16 1 hybrid scheme ( Hybrid ), the best case that a single type workload characteristic can achieve ( Individual_Best) and the Simpoint based phase classification that uses basic block vector ( BBV ). The results of the 12 SPEC integer benchmarks are shown in Figure 4-4. 0 5 10 15Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV bzip2craftyeongapgccgzipmcfparserperlbmktwolfvortexvprAVG COV (%)25%Figure 4-4 Phase classification accuracy of using 16 1 hybrid scheme As can be seen, the Hybrid outperforms the Individual_Best on 10 out of the 12 benchmarks. The Hybrid also outperforms the BBV based Simpoint method on 10 out of the 12 cases. Scalability Above we can see that wavelet dom ain phase an alysis can achieve high er accuracy. In this subsection, we address another im portant issue in phase analys is using workload execution characteristics: scalability. C ounters are usually used to colle ct workload statistics during program execution. The counters may overflow if they are used to track large scale phase behavior on long running workloads. Today, many large and real world workloads can run days, weeks or even months before completion and this trend is likely to continue in the future. To perform phase analysis on the next generati on of computer workloads and systems, phase classification methods should be capable of scaling with the increasing program execution time.

PAGE 46

46 To understand the impact of counter overflow on phase analysis accuracy, we use 16 accumulative counters to record the 16-dimension wo rkload characteristic. The values of the 16 accumulative counters are then used as a signature to perform phase classification. We gradually reduce the number of bits in the accumulative count ers. As a result, counter overflows start to occur. We use two schemes to handle a counter ove rflow. In our first method, a counter saturates at its maximum value once it overflows. In our se cond method, the counter is reset to zero after an overflow occurs. After all counter overflows are handled, we then use the 16-dimension accumulative counter values to perform phase an alysis and calculate the COVs. Figure 4-5 (a) describes the above procedure. Large Scale Phase Interval (a) n-bit accumulative countercounter overflow Program Runtime Statistics 1 accumulative counter 1 K-means Clustering Program Runtime Statistics 2 Program Runtime Statistics n accumulative counter 2 accumulative counter n Handling Overflow COV (b) n-bit sampling counter .. Large Scale Phase Interval Program Runtime Statistics 1 sampling counter 1 Program Runtime Statistics 2 Program Runtime Statistics n sampling counter 2 sampling counter n DWT/Hybrid Wavelet Signature K-means Clustering COV Figure 4-5 Different methods to handle counter overflows Our counter overflow analysis results are s hown in Figure 4-6. Figure 4-6 also shows the counter overflow rate (e.g. per centage of the overflowed counters) when counters with different sizes are used to collect workload statistics w ithin program execution intervals. For example, on the benchmark crafty when the number of bits used in counters is reduced to 20, 100% of the counters overflow. For the purpose of clarity, we only show a region within which the counter overflow rate is greater than zero and less than or equal to one. Since each program has different execution time, the region varies from one program to another. As can be seen, counter overflows have negative impact on phase classifi cation accuracy. In general, COVs increase with

PAGE 47

47 the counter overflow rate. Interes tingly, as the overflow rate in creases, there are cases that overflow handling can reduce the COVs. This is because overflow handling has the effect of normalizing and smoothing irregular p eaks in the original statistics. 0% 10% 20% 30% 40% 50% 28262422201816 # of bits in counterCo V Saturate Reset W a v e l et 4% 29% 67% 81% 90%bzip296% 98% 0% 2% 4% 6% 8% 2826242220 # of bits in counterCo V Saturate Reset Wavelet 23% 56% 82% 94% 100%crafty 0% 2% 4% 6% 8% 272523211917 # of bits in counterCo V Saturate Reset Wavelet0.4%56% 78% 94%eon94% 100% 0% 5% 10% 15% 20% 25% 30% 3028262422201816 # of bits in counterCoV Saturate Reset Wavelet 0.5% 25%94% 94%gap80% 56%97% 100% 0% 10% 20% 30% 40% 50% 60% 28262420281416182022 # of bits in counterCoV Saturate Reset Waveletgcc0.1% 3.4% 50% 89% 77% 99% 98% 97% 93% 100% 0% 5% 10% 15% 20% 272523211917 # of bits in counterCoV Saturate Reset Wavelet 24% 34% 72% 98%gzip89% 100% 0% 20% 40% 60% 80% 100% 120% 140%30282624222018161412108# of bits in counterCoV Saturate Reset Wavelet 2% 26% 60% 93% 5%mcf96% 89% 97% 98% 99% 100% 0% 2% 4% 6% 8% 10% 12% 302826242220 # of bits in counterCoV Saturate Reset W a v e l et 0% 31% 97% 100%parser75% 85% 0% 5% 10% 15% 20% 25% 30% 27252321191715 # of bits in counterCo V Saturate Reset Wavelet 6% 28% 79% 87%perlbmk85% 93% 100% 0% 1% 2% 3% 4% 2927252321 # of bits in counterCo V Saturate Reset Wavelet 28% 31% 75% 100%twolf94% 0% 5% 10% 15% 20% 25% 30% 27252321191517 # of bits in counterCo V Saturate Reset Wavelet 1% 56% 81% 94%vortex93% 95% 100% 0% 5% 10% 15% 20% 25% 30% 272523211917 # of bits in counterCo V Saturate Reset Wavelet 3% 54% 75% 94%vpr90% 100%Figure 4-6 Impact of counter overf lows on phase analysis accuracy One solution to avoid counter overflows is to use sampling counters instead of accumulative counters, as shown in Figure 4-5 (b ). However, when sampling counters are used, the collected statistics are represented as time seri es that have a large volume of data. The results

PAGE 48

48 shown in Figure 4-2 suggest that directly employing runtime samp les in phase classification is less desirable. To address the scalability issue in characterizing large scale program phases using workload execution statistics, wavelet based dime nsionality reduction tech niques can be applied to extract the essential featur es of workload behavior from the sampled statistics. The observations we made in previous sections mo tivate the use of DWT to absorb large volume sampled raw data and produce highly efficient wa velet domain signatures for phase analysis, as shown in Figure 4-5 (b). Figure 4-6 further shows phase analysis accuracy after applying wavelet techniques on the sampled workload statistics using sampling coun ters with different sizes. As can be seen, sampling enables using counters with limited size to study large program phases. In general, sampling can scale up naturally with the interval size as long as the sampled values do not overflow the counters. Therefore, with an in creasing mismatch between phase interval and counter size, the sampling frequency is increased resulting in an even higher volume sampled data. Using wavelet domain phase analysis can effectively infer program behavior from a large set of data collected over a long time span, resulting in low COVs in phase analysis. Workload Variability As described earlier, our m ethods collect various program execution stat istics and use them to classify program execution into different phase s. Such phase classification generally relies on comparing the similarity of the collected statis tics. Ideally, different runs of the same code segment should be classified into the same pha se. Existing phase detec tion techniques assume that workloads have deterministic execution. On real systems, with operating system interventions and other threads, applications manife st behavior that is not the same from run to run. This variability can stem from changes in sy stem state that alter cache, TLB or I/O behavior, system calls or interrupts, and can result in no ticeably different timing a nd performance behavior

PAGE 49

49 [18, 32]. This cross-run variability can confuse similarity based phase detection. In order for a phase analysis technique to be applicable on real systems, it should be able to perform robustly under variability. Program cross-run variability can be thought of as noise which is a random variance of a measured statistic. There are many possi ble reasons for noisy data, such as measurement/instrument errors and interventions of the operating systems. Removing this variability from the collected run time statistics can be considered as a process of denoising. In this chapter, we explore using wavelets as an effective way to perf orm denoising. Due to the vanishing moment property of the wavelets, only some wavelet coefficients are significant in most cases. By retaining selective wavelet coeffi cients, a wavelet transfor m could be applied to reduce the noise. The main idea of wavelet denoising is to transform the data into the wavelet basis, where the large coefficients mainly contain the useful information and the smaller ones represent noise. By suitably modifying the coefficients in the new basis, noise can be directly removed from the data. The gene ral de-noising procedure involves three steps: 1) decompose: compute the wavelet decomposition of the original data; 2) threshold wavelet coefficients: select a threshold and apply thresholding to the wave let coefficients; and 3) reconstruct: compute wavelet reconstruction using the modified wave let coefficients. More details on the waveletbased denoising techniques can be found in [33]. To model workload runtime variability, we use additive noise models and randomly inject noise into the time series that represents work load execution behavior. We vary the SNR (signalto-noise ratio) to simulate different degree of variability scenarios. To classify program execution into phases, we generate a 16 dimensi on feature vector where each element contains

PAGE 50

50 the average value of the collected program execution characteristic for each interval. The k-mean algorithm is then used for data clustering. Figure 4-7 il lustrates the above described procedure. Sampled Workload Statistics Wavelet Denoising S1(t) D2(t) Phase Classification COV COV Comparison Workload Variability Model N(t) S2(t)=S1(t)+N(t) Figure 4-7 Method for modeli ng workload variability We use the Daubechies-8 wavelet with a global wavelet coefficients thresholding policy to perform denoising. We then compare the phase cl assification COVs of us ing the original data, the data with variability injected and the data after we perform denoisi ng. Figure 4-8 shows our experimental results. 0% 3% 6% 9% 12% 15%b zip 2 crafty eon g ap g cc g zip mcf pars e r perlbmk tw o lf v o rt e x vprCOV Original Noised(SNR=20) Denoised(SNR=20) Noised(SNR=5) Denoised(SNR=5)Figure 4-8 Effect of using wavelet denoi sing to handle workload variability The SNR=20 represents scenarios with a low degree of variability and the SNR=5 reflects situations with a high degree of variability. As can be seen, introducing va riability in workload execution statistics reduces phase analysis accur acy. Wavelet denoising is capable of recovering phase behavior from the noised data, resulting in higher phase analysis accuracy. Interestingly, on some benchmarks (e.g. eon mcf ), the denoised data achieve better phase classification

PAGE 51

51 accuracy than the original data. This is because in phase classification, randomly occurring peaks in the gathered workload execution data co uld have a deleterious effect on the phase classification results. Wavelet denoising smoothe s these irregular peak s and make the phase classification method more robust. Various types of wavelet denoising can be performed by choosing different threshold selection rules (e.g. rigrsure, he ursure, sqtwolog and minimaxi), by performing hard (h) or soft (s) thresholding, and by specifying multiplicative threshold rescaling model (e.g. one, sln, and mln). We compare the efficiency of different denoisi ng techniques that have been implemented into the MATLAB tool [34]. Due to the sp ace limitation, only the results on benchmarks bzip2, gcc and mcf are shown in Figure 4-9. As can be seen, different wavelet denoising schemes achieve comparable accuracy in phase classification. 0% 2% 4% 6% 8% 10%heu r s u re:s:m ln heu rsu re: s: sl n heu r s u re: h : ml n heu r s u re: h : s ln r ig rsure :s :m l n r ig rsur e:s :s l n r igrsur e :h:mln rigrsure:h:sln sqtwol og :s:mln sq t wolog:s:s l n s q t wolog : h : mln s q t wolog : h : sl n m i ni m axi:s:m ln minim ax i : s:s l n mi ni m ax i : h:m ln m i ni m axi : h : sl nWavelet Denoising SchemesCOV bzip2 gcc mcfFigure 4-9 Efficiency of different denoising schemes

PAGE 52

52 CHAPTER 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION It has been well known to the processor design comm unity that program runtime characteristics exhibit significant variation. To obtain the dynamic beha vior that programs manifest on complex microprocessors and systems, architects resort to the detailed, cycleaccurate simulations. Figure 5-1 illustrates the variation in workload dynamics for SPEC CPU 2000 benchmarks gap, crafty and vpr within one of their execution intervals. The results show the time-varying behavior of the workload performance ( gap), power ( crafty ) and reliability ( vpr) metrics across simulations with differe nt microarchitecture configurations. 0 20 40 60 80 100 120 14 0 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 SamplesCPIgap 0 20 40 60 80 100 120 14 0 20 40 60 80 100 120 140 SamplesPower (W) crafty 0 20 40 60 80 100 120 14 0 0.1 0.15 0.2 0.25 0.3 0.35 Samples AVF vpr Figure 5-1 Variation of workload performance, power and reliability dynamics As can be seen, the manifested workload dyna mics while executing the same code base varies widely across processors with different configurations. As the number of parameters in design space increases, such variation in workload dynamics can not be captured without using slow, detailed simulations. However, using the simulation-based methods for architecture design space exploration where numerous design parameters have to be considered is prohibitively expensive. Recently, researchers propose several predictive models [20-25] to reason about workload aggregated behavior at archit ecture design stage. Among them linear regression and neural network models have been the most used appr oaches. Linear models are straightforward to understand and provide accurate estimates of the significance of parameters and their

PAGE 53

53 interactions. However, they are usually inadequa te for modeling the non-l inear dynamics of realworld workloads which exhibit widely different characteristic and complexity. Of the non-linear methods, neural network models can accurately predict the aggregated program statistics (e.g. CPI of the entire workload execution). Such mode ls are termed as global models as only one model is used to characterize the measured programs. The monolithic global models are incapable of capturing and revealing program dy namics which contain interesting fine-grain behavior. On the other hand, a workload may produce different dynamics when the underlying architecture configurations ha ve changed. Therefore, new methods are needed for accurately predicting complex workload dynamics. To overcome the problems of monolithic, gl obal predictive models, we propose a novel scheme that incorporates wavelet-based multiresolution decomposition techniques, which can produce a good local representati on of the workload behavior in both time and frequency domains. The proposed analytical models, wh ich combine wavelet-based multiscale data representation and neural network based regres sion prediction, can e fficiently reason about program dynamics without resorting to detaile d simulations. With our schemes, the complex workload dynamics is decomposed into a series of wavelet coefficients. In transform domain, each individual wavelet coefficients is modele d by a separate neural network. We extensively evaluate the efficiency of using wavelet neural networks for predicting the dynamics that the SPEC CPU 2000 benchmarks manifest on high performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in forecasting work load dynamics across a large microarchitecture design space.

PAGE 54

54 In this chapter, we propose to use of wavele t neural network to build accurate predictive models for workload dynamic driven microarchi tecture design space exploration. We show that wavelet neural network can be used to accurately and cost-effectively capture complex workload dynamics across different microarchitecture configur ations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power and reliability domains. We perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on pred iction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. We present a case study of using workload dynamic aware predictive models to quickly estimate the efficiency of scenario-driven archite cture optimizations across different domains. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. Neural Network An Artificial Neural Network (ANN) [42] is an infor mation processing paradigm that is inspired by the way biological nervous systems pro cess information. It is composed of a set of interconnected processing elements working in unison to solve problems. f(x) H1(x) H2(x) Hn(x) X1 X2 Xn w1Input layer Hidden layer Output layer distance distanceRBF Response Radial Basis Function (RBF)w2wn Figure 5-2 Basic architectu re of a neural network

PAGE 55

55 The most common type of neural network (Figure 5-2) consists of th ree layers of units: a layer of input units is connected to a layer of hidden units, wh ich is connected to a layer of output units. The input is fed into network th rough input units. Each hidden unit receives the entire input vector and genera tes a response. The output of a hidden unit is determined by the input-output transfer function that is specified for that unit. Co mmonly used transfer functions include the sigmoid, linear threshold function and Radial Basis Function (RBF) [35]. The ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing complex, nonlinear relations between input a nd output, which make them a promising technique for tracking and forecasting complex behavior. In this chapter, we use the RBF transfer f unction to model and estimate important wavelet coefficients on unexplored design spaces because of its superior ability to approximate complex functions. The basic architecture of an RBF network with n-input s and a single output is shown in Figure 5-2. The nodes in adjacent layers are fully connected. A linear single-layer neural network model 1-dimensional function f is expressed as a linear combination of a set of nfixed functions, often called basis functions by analogy with the concept of a vector being composed of a linear combination of basis vectors. n j jjxhwxf1)()( (5-1) Here nw is adaptable or trainable weight vector and n j jh1)( are fixed basis functions or the transfer function of the hidden units. The flexibility of f, its ability to fit many different functions, derives only from the freedom to choos e different values for the weights. The basis

PAGE 56

56 functions and any parameters which they might contain are fixed. If the basis functions can change during the learning process, then the model is nonlinear. Radial functions are a special class of function. Their characte ristic feature is that their response decreases (or increases) monotonically with distance from a central point. The center, the distance scale, and the precis e shape of the radial function ar e parameters of the model, all fixed if it is linear. A typical ra dial function is the Gaussian whic h, in the case of a scalar input, is 2 2)( exp)( r cx xh (5-2) Its parameters are its center c and its radius r Radial functions are simply a class of functions. In principle, they coul d be employed in any sort of m odel, linear or nonlinear, and any sort of network (single-layer or multi-layer). The training of the RBF network involves se lecting the center locati ons and radii (which are eventually used to determine the weight s) using a regression tree. A regression tree recursively partitions the input data set into subs ets with decision criteria. As a result, there will be a root node, non-terminal nodes (having sub n odes) and terminal nodes (having no sub nodes) which are associated with an input dataset. Each node contributes one unit to the RBF networks center and radius vectors. the selection of RB F centers is performed by recursively parsing regression tree nodes using a strategy proposed in [35]. Combing Wavelet and Neural Network for Workload Dynamics Prediction We view workload dynamics as a time series produced by the processor which is a nonlinear function of its design pa rameter configuration. Instead of predicting this function at every sampling point, we employ wavelets to ap proximate it. Previous work [21, 23, 25] shows

PAGE 57

57 that neural networks can accurately predict a ggregated workload behavior during design space exploration. Nevertheless, the m onolithic global neural network m odels lack the capability of revealing complex workload dynamics. To ove rcome this disadvantage, we propose using wavelet neural networks that incorporate multiscale wavelet analysis into a set of neural networks for workload dynamics prediction. The wavelet transform is a very powerful tool for dealing with dynamic behavior si nce it captures both workload global and local behavior using a set of wavelet coefficients. The short-term workload characteristics is decomposed into the lower scales of wavelet coefficients (high frequencie s) which are utilized for detailed analysis and prediction, while the global worklo ad behavior is decomposed in to higher scales of wavelet coefficients (low frequencies) th at are used for the analysis and prediction of slow trends in the workload execution. Collectively, th ese coordinated scales of time and frequency provides an accurate interpretation of workload dynamics. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet co efficients at different scales. The separate predictions of each wavelet coe fficients are proceed independently. Predicting each wavelet coefficients by a separate neural network simplifies the training task of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transform to predict the workload dynamics. Figure 5-3 shows our hybrid neuro-wavelet scheme for workload dynamics prediction. Given the observed workload dynamics on training data, our aim is to predict workload dynamic behavior under different architecture configura tions. The hybrid scheme basically involves three stages. In the first stage, the time series is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficients is pr edicted by a separate ANN and in the third stage, the approximated time series is recovered from the predicted wavelet coefficients.

PAGE 58

58 G0 H0 G1 H1 Gk Hk ... Workload Dynamics (Time Domain)Wavelet Decomposition Wavelet Coefficients. ... ... M icroarchitecture Design Param eters Predicted W avelet Coefficient 1 ... ... M icroarchitecture Design Param eters Predicted W avelet Coefficient 2 M icroarchitecture Design Param eters... ... ...RBF Neural Netw orks ... ... Predicted W avelet Coefficient n G*0 H*0 G*1 H*1 G*k H*k ... Synthesized Workload Dynamics (Time Domain)Wavelet Reconstruction Predicted Wavelet Coefficients 00 0, 0, 0, 0, 0, 0 Figure 5-3 Using wavelet neural netw ork for workload dynamics prediction Each RBF neural network receives the entire microarchitectural design space vector and predicts a wavelet coefficient. The training of a RBF network involves determining the center point and a radius for each RBF and the wei ghts of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of using wavelet ne ural networks to explore workload dynam ics in performance, power and reli ability domains during microarchi tecture design sp ace exploration. We use a unified, detailed microarchitecture simulator in our experiments. Our simulation framework, built using a heavily modified and extended version of the Simplescalar tool set, models pipelined, multiple-issue, out-of-order execution microprocessors with multiple level caches. Our framework uses Wattch-based power model [36]. In addition, we built the Architecture Vulnerability Factor (AVF) analys is methods proposed in [37, 38] to estimate processor microarchitecture vulnera bility to transient faults. A microarchitecture structures AVF refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The AVF metric can be used to estimate how vulnerable the hardware is to soft

PAGE 59

59 errors during program execution. Table 5-1 summari zes the baseline machine configurations of our simulator. Table 5-1 Simulated machine configuration Parameter Configuration Processor Width 8-wi de fetch/issue/commit Issue Queue 96 ITLB 128 entries, 4-way, 200 cycle miss Branch Predictor 2K entries Gshare, 10-bit global history BTB 2K entries, 4-way Return Address Stack 32 entries RAS L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries Load/ Store Queue 48 entries Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 8 FP-ALU, 4FP-MUL/DIV/SQRT DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency We perform our analysis using twelve SPEC CPU 2000 benchmarks bzip2, crafty, eon, gap, gcc, mcf, parser, perl bmk, twolf, swim, vortex and vpr. We use the Simpoint tool to pick the most representative simulation point for each be nchmark (with full reference input set) and each benchmark is fast-forwarded to its representati ve point before detailed simulation takes place. Each simulation contains 200M instructions. In th is chapter, we consider a design space that consists of 9 microarchitectural parameters (see Tables 5-2) of the superscalar architecture. These microarchitectural parameters have been shown to have the largest impact on processor performance [21]. The ranges for th ese parameters were set to include both typical and feasible design points within the explored design space. Using the detailed, cycle-accurate simulations, we measure processor performance, power and reliability characteristics on all design points within both training and testing data sets. We build a separate model for each program and use the model to predict workload dynamics in performance, power and reliability domains at

PAGE 60

60 unexplored points in the design spac e. The training data set is used to build the wavelet-based neural network models. An estimate of the m odels accuracy is obtained by using the design points in the testing data set. Table 5-2 Microarchitectural parameter ranges used for generati ng train/test data Parameter R anges # of Levels Trai n Test Fetch_width 2, 4, 8, 16 2, 84 ROB_size 96, 128, 160 128,160 3 IQ_size 32, 64, 96, 12832, 644 LSQ_size 16, 24, 32, 6416,24, 324 L2_size 2 5 6 1024 2048 4096 KB 256, 1024, 4096KB4 L2_lat 8, 12, 14, 16, 208, 12, 14 5 il1_size 8, 16, 32, 64 KB8, 16, 32 KB 4 dl1_size 8, 16, 32, 64 KB16, 32, 64 KB 4 dl1_lat 1, 2, 3, 4 1,2,3 4 To build the representative design space, one needs to ensure the sample data sets space out points throughout the design space but unique and small enough to keep the model building cost low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better cove rage compared to a naive random sampling scheme. We generate multiple LHS matrix and use a space filing metric called L2-star discrepancy [40] to each LHS matrix to find the unique and best representative design space which has the lowest values of L2-star disc repancy. We use a randomly and independently generated set of test data point s to empirically estimate the pred ictive accuracy of the resulting models. And we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers good tradeoffs between simulation time and prediction accuracy for the design space we considered. In our study, each workload dynamic trace is represented by 128 samples. Predicting each wavelet coefficient by a sepa rate neural network simplifies the learning task. Since complex workload dynamics can be captured using limited number of wavelet

PAGE 61

61 coefficients, the total size of wavelet neural networks can be small. Due to the fact that small magnitude wavelet coefficients have less contribution to the rec onstructed data, we opt to only predict a small set of important wavelet coefficients. Processor Configuration High Mag. Low Mag. Wavelet Coefficient Index Figure 5-4 Magnitude-based rankin g of 128 wavelet coefficients Specifically, we consider the following tw o schemes for selecting important wavelet coefficients for prediction: (1) ma gnitude-based: select the largest k coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate the rest with 0. In this study, we choose to us e the magnitude-based scheme since it always outperforms the order-based scheme. To apply th e magnitude-based wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients does not change drastically across the design space. Figure 5-4 i llustrates the magnitude-based ranking (shown as a color map where red indicates high ranks and bl ue indicates low ranks) of a total 128 wavelet coefficients (decomposed from benchmark gcc dynamics) across 50 different microarchitecture configurations. As can be seen, the top ranked wa velet coefficients largely remain consistent across different processor configurations. Wavelet coefficients with large magnitude

PAGE 62

62 Evaluation and Results In this sec tion, we present detailed experiment results on usin g wavelet neural network to predict workload dynamics in performance, reliability and power domains. The workload dynamic prediction accuracy measure is the mean square error (MSE) defined as follows: 2 11 ()()N k M SExkxk N (5-3) where: () x k is the actual value, () x kis the predicted value and Nis the total number of samples. As prediction accuracy in creases, the MSE becomes smaller. (a) CPI bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 051015202530 MSE (%) (b) Power bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 05101520253035 MSE (%) (c) AVF bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 0123 MSE (%) Figure 5-5 MSE boxplots of workload dynamics prediction

PAGE 63

63 The workload dynamics prediction accuracies in performance, power and reliability domains are plotted as boxplots( Figure 5-5). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify po ssible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between hinges which are approximately the first and third quartiles of the MSE values. Thus, about 50% of the data are located within the box and its height is equal to the interquart ile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the MSE values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme valu es of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 5-5, the line with diamond shape markers i ndicates the statistics average of MSE across all test cases. Figure 5-5 shows that the performance model achieves median errors ranging from 0.5 percent (swim) to 8.6 percent (mcf) with an overall median erro r across all benchmarks of 2.3 percent. As can be seen, even though the maxi mum error at any design point for any benchmark is 30%, most benchmarks show MSE less than 10%. This indicates that our proposed neurowavelet scheme can forecast the dynamic behavior of program performance characteristics with high accuracy. Figure 5-5 shows that power models are slightly less accurate with median errors ranging from 1.3 percent (vpr) to 4.9 percent (crafty) and overall median of 2.6 percent. The power prediction has high maximum values of 35%. These errors are much smaller in reliability domain. In general, the workload dynamic prediction accuracy is increased when more wavelet coefficients are involved. However, the complex ity of the predictive models is proportional to the number of wavelet coefficients. The cost-e ffective models should provide high prediction

PAGE 64

64 accuracy while maintaining low complexity. Figure 5-6 shows the trend of prediction accuracy (the average statistics of all benchmarks) when various number of wavelet coefficients are used. 0 1 2 3 4 5 16326496128 Number of Wavelet CoefficientsMSE (%) CPI Power AVF Figure 5-6 MSE trends with increased number of wavelet coefficients As can be seen, for the programs we studied, a se t of wavelet coefficients with a size of 16 combine good accuracy with low model comp lexity; increasing th e number of wavelet coefficients beyond this point impr oves error at a lower rate. This is because wavelets provide a good time and locality characterizati on capability and most of the en ergy is captured by a limited set of important wavelet coefficients. Usi ng fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the series structures across scales of time and frequency domains. The capability of us ing a limited set of wavelet coefficients to capture workload dynamics vari es with resolution level. 0 1 2 3 4 5 6 7 641282565121024 Number of SamplesMSE (%) CPI Power AVF Figure 5-7 MSE trends with increased sampling frequency

PAGE 65

65 Figure 5-7 illustrates MSE (the average statistics of all benchmarks) yielded on predictive models that use 16 wavelet coefficients when th e number of samples varies from 64 to 1024. As the sampling frequency increases, using the same amount of wavelet coefficients is less accurate in terms of capturing workload dynamic behavior. As can be seen, the increase of MSE is not significant. This suggests that the proposed sc hemes can capture workload dynamic behavior with increasing complexity. Our RBF neural networks were built us ing a regression tree based method. In the regression tree algorithm, all input microarchitecture parameters were ranked based on either split order or split frequency. The microarchitecture parameters which cause the most output variation tend to be split earliest and most often in the constr ucted regression tree. Therefore, microarchitecture parameters largely determine th e values of a wavelet coefficient are located on higher place than others in regression tree and th ey have larger number of splits than others. CPI Power AVF (a) Split Order bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat (b) Split Frequency bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat Figure 5-8 Roles of microarchitecture design parameters

PAGE 66

66 We present in Figure 5-8 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data poi nt relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? For example, on benchmark gcc, Fetch, dl1 and LSQ have signific ant roles in predicting dynamic behavior in performance domain while ROB, Fetch and dl1_lat larg ely affect reliability domain workload dynamic behavior. For the benchmark gcc, the most frequently involved microarchitecture parameters in regression tree constructions are ROB, LSQ, L2 and L2_lat in performance domain and LSQ and Fetch in reliability domain. Compared with models that only predict workload aggregated behavior, our proposed methods can forecast workload runtime execution scenarios. The feature is essential if the predictive models are employed to trigger runtime dynamic management mechanisms for power and reliability optimizations. Inadequate workload worst-case scenario predictions could make microprocessors fail to meet the desired power a nd reliability targets. On the contrary, false alarms caused by over-prediction of the worst-case scenarios can trigger responses too frequently, resulting in significant overhead. In this section, we study the su itability of using the proposed schemes for workload execution scenario based cl assification. Specifically, for a given workload characteristics threshold, we calculate how ma ny sampling points in a trace that represents workload dynamics are above or below the thresh old. We then apply the same calculation to the

PAGE 67

67 predicted workload dynamics trace. We use the directional symmetry (DS) metric, i.e., the percentage of correctly predicte d directions with respect to th e target variable, defined as N kkxkx N DS1)( )( 1 (5-4) where 1)( if x and x are both above or belo w the threshold and 0 )( otherwise. Thus, the DS provides a measure of the number of times the sign of the target is correctly forecasted. In other words, DS=50% implies that the predicted direction was correct for half of the predictions. In this work, we set three threshold levels (named as Q1, Q2 and Q3 in Figure 59) between max and min values in each trace as follows, where 1Q is the lowest threshold level and 3Q is the highest threshold level. 3Q 2Q 1Q MAX MIN 1Q = MIN + (MAX-MIN)*(1/4) 2Q = MIN + (MAX-MIN)*(2/4) 3Q = MIN + (MAX-MIN)*(3/4) Figure 5-9 Threshold-based workload execution scenarios Figure 5-10 shows the results of threshold-base d workload dynamic behavior classification. The results are presented as directional asymmetry, which can be expressed as DS 1. As can be seen, not only our wavelet-based RBF neural networks can effectively capture workload dynamics, but also they can accurate ly classify workload execution into different scenarios. This suggests that proactive dynamic power and reli ability management schemes can be built using the proposed models. For instance, given a power/r eliability threshold, our wavelet RBF neural networks can be used to forecast workload execution scenario. If the predicted workload

PAGE 68

68 characteristics exceed the threshold level, processors can start to response before power/reliability reaches or surpass the threshold level. 0 2 4 6 8 10bzi p cr a f t y e o n g a p gc c mcf p ar s er pe r lbm k swim two l f vortex v prDirectional Asym m etry ( %) CPI_1Q CPI_2Q CPI_3Q 0 2 4 6 8 10bzip crafty eo n gap gcc mcf parser p e rl b m k swim t w olf vo r te x v p rD irectional A sym m etry ( %) Power_1Q Power_2Q Power_3Q 0 2 4 6 8 10bzi p crafty e o n gap g cc mcf parser p erl b m k swim twolf v ortex vp rD irectional A sym m etry ( %) AVF_1Q AVF_2Q AVF_3Q Figure 5-10 Threshold-based workload execution Figure 5-11 further illustrates detailed wo rkload execution scenario predictions on benchmark bzip2. Both simulation and prediction results ar e shown. The predicted results closely track the varied program dynamic behavior in different domains. (a) performance (b) power (c) reliability Figure 5-11 Threshold-based wo rkload scenario prediction Workload Dynamics Driven Archit ecture Design Space Exploration In this section, we present a case study to demonstrate the benefit of applying workload dynamics prediction in early architecture design space exploration. Specifically, we show that workload dynamics prediction models can effectively forecast the worst-case operation

PAGE 69

69 conditions to soft error vulnerability and accurately estimate the efficiency of soft error vulnerability management schemes. Because of technology scaling, ra diation-induced soft errors contribute more and more to the failure rate of CMOS devices. Therefore, soft error rate is an importa nt reliability issue in deep-submicron microprocessor design. Processor microarchitecture soft error vulnerability exhibits significant runtime va riation and it is not economical and practical to design fault tolerant schemes that target on the worst-case operation condition. Dynamic Vulnerability Management (DVM) refers to a set of strate gies to control hardware runtime soft-error susceptibility under a tolerable threshold. DVM allows designers to achieve higher dependability on hardware designed for a lower reliability sett ing. If a particular execution period exceeds the pre-defined vulnerability threshol d, a DVM response (Figure 5-12) wi ll work to reduce hardware vulnerability. VulnerabilityTime Designed-for Reliability Capacity w/out DVM Designed-for Reliability Capacity w/ DVM DVM Trigger Level DMV Engaged DVM Disengaged DVM Performance Overhead Figure 5-12 Dynamic Vulnerability Management A primary goal of DVM is to maintain vulnerability to within a pre-defined reliability target during the entire program execution. The DVM will be triggered once the hardware soft error vulnerability exceeds the predefined threshold. Once the trigger goes on, a DVM response begins. Depending on the type of response chosen, there may be some performance degradation. A DVM response can be turned off as soon as the vulnerability drops below the threshold. To

PAGE 70

70 successfully achieve the desired reliability targ et and effectively mitigate the overhead of DVM, architects need techniques to quickly infer application worstcase operation conditions across design alternatives and accurately estimate the efficiency of DVM schemes at early design stage. We developed a DVM scheme to manage runt ime instruction queue (IQ) vulnerability to soft error. DVM_IQ { ACE bits counter updating(); if current context has L2 cache misses then stall dispatching instructions for current context; every (sample_interval/5) cycles { if online IQ_AVF > trigger threshold then wq_ratio = wq_ratio/2; else wq_ratio = wq_ratio+1; } if (ratio of waiting instruction # to ready instruction # > wq_ratio) then stall dispatching instructions; } Figure 5-13 IQ DVM Pseudo Code Figure 5-13 shows the pseudo code of our DVM policy. The DVM scheme computes online IQ AVF to estimate runtime microarch itecture vulnerability. The estimated AVF is compared against a trigger thres hold to determine whether it is necessary to enable a response mechanism. To reduce IQ soft error vulnerability we throttle the instru ction dispatching from the ROB to the IQ upon a L2 cache miss. Additi onally, we sample the IQ AVF at a finer granularity and compare the sampled AVF with th e trigger threshold. If the IQ AVF exceeds the trigger threshold, a parameter wq_ratio, whic h specifies the ratio of number of waiting instructions to that of ready instructions in the IQ, is upda ted. The purpose of setting this parameter is to maintain the performance by allowing an appropria te fraction of waiting instructions in the IQ to e xploit ILP. By maintaining a de sired ratio between the waiting

PAGE 71

71 instructions and the ready instru ctions, vulnerability can be redu ced at negligible performance cost. The wq_ratio update is tri ggered by the estimated IQ AVF. In our DVM design, wq_ratio is adapted through slow increases a nd rapid decreases in order to ensure a quick response to a vulnerability emergency. We built workload dynamics predictive models which incorporate DVM as a new design parameter. Therefore, our models can predict workload execution scenarios with/without DVM feature across different microarch itecture configurations. Figure 5-14 shows the results of using the predictive models to forecast IQ AVF on benchmark gcc across two microarchitecture configurations. 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Disable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Enable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Disable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Enable) DVM disabled DVM enabled DVM disabled DVM enabled (a) Scenario 1 (b) Scenario 2 Figure 5-14 Workload dynamic prediction with scenario-based architecture optimization We set the DVM target as 0.3 which mean s the DVM policy, when enabled, should maintains the IQ AVF below 0.3 during work load execution. In both cases, the IQ AVF dynamics were predicted when DVM is disabled and enabled. As can be seen, in scenario 1, the DVM successfully achieves its goal. In scenario 2, despite of the enabled DVM feature, the IQ AVF of certain execution period is still above the threshold. Th is implies that the developed DVM mechanism is suitable for the microarchitect ure configuration used in scenario 1. On the other hand, architects have to choose another DVM policy if the microarchitecture configuration

PAGE 72

72 shown in scenario 2 is chosen in their design. Figure 5-14 shows that in all cases, the predictive models can accurately forecast the trends in IQ AVF dynamics due to architecture optimizations. Figure 5-15 (a) shows prediction accuracy of IQ AVF dynamics when the DVM policy is enabled. The results are shown for all 50 microarch itecture configurations in our test dataset. twolf vpr swim bzip eon vortext crafty gap parser perlbmk mcf gcc50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0510152025Value 0100200300400 Color Key and HistogramCount (a) IQ AVF (b) Power Figure 5-15 Heat plot that shows the MSE of IQ AVF and processor power Since deploying the DVM policy will also aff ect runtime processor power behavior, we further build models to forecas t processor power dynamic behavior due to the DVM. The results are shown in Figure 5-15 (b). The data is presented as heat plot, which maps the actual data values into color scale with a dendrogram adde d to the top. A dendrogram consists of many Ushaped lines connecting objects in a hierarchical tree. The height of each U represents the distance between the two objects being connected. For a given benchmark, a vertical trace line shows the scaled MSE values across all test cases. Figure 5-15 (a) shows the predictive models yield hi gh prediction accuracy across all test cases on benchmarks swim, eon and vpr. The models yield prediction variation on benchmarks MSE (%) parser bzip twolf gap perlbmk swim eon vpr mcf gcc crafty vortext50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 00.10.20.30.40.5Value 050100150200 Color Key and HistogramCount MSE (%)

PAGE 73

73gcc, crafty and vortex. In power domain, prediction accuracy is more uniform across benchmarks and microarchitecture configurations. In Figure 5-16, we show the IQ AVF MSE when different DVM thresholds are set. The resu lts suggest that our predictive models work well when different DVM targets are considered. 0 0.1 0.2 0.3 0.4 0.5b z ip craf t y eo n g ap g cc m cf p a rser p erlbmk swim twolf vortex vp rIQ AVF MSE (% ) DVM Threshold =0.2 DVM Threshold =0.3 DVM Threshold =0.5Figure 5-16 IQ AVF dynamics prediction accu racy across different DVM thresholds

PAGE 74

74 CHAPTER 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTI-CORE ARCHI TECTURES Early design space exploration is an essential ingredient in modern processor development. It significantly reduces the time to market and post-silicon surprises. The trend toward multi/many-core processors will result in sophisticated large-scale architectu re substrates (e.g. nonuniformly accessed cache [43] interconnected by network-on-chip [44]) with self-contained hardware components (e.g. cache banks, routers and interconnect links) proximate to the individual cores but globally di stributed across all cores. As th e number of cores on a processor increases, these large and sophisticated multi-co re-oriented architectures exhibit increasingly complex and heterogeneous characteristics. As an ex ample, to alleviate the deleterious impact of wire delays, architects have proposed splitting up large L2/L3 cach es into multiple banks, with each bank having different access latency dependi ng on its physical proximity to the cores. Figure 6-1 illustrates normalized cache hits (results are plotte d as color maps) across the 256 cache banks of a non-uniform cache architecture (NUCA) [43] design on an 8-core chip multiprocessor(CMP) running the SPLASH-2 Ocean-c workload. The 2D architecture spatial patterns yielded on NUCA with different architecture design parameters are shown. Figure 6-1 Variation of cache hits across a 256-bank non-uniform access cache on 8-core As can be seen, there is a significant vari ation in cache access fre quency across individual cache banks. At larger scales, the manifested 2-dimensional spatial characteristics across the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PAGE 75

75 entire NUCA substrate vary widely with differe nt design choices while executing the same code base. In this example, various NUCA cache conf igurations such as network topologies (e.g. hierarchical, point-to-point and crossbar) and data management schemes (e.g. static (SNUCA) [43], dynamic (DNUCA) [45, 46] and dynamic with replication (R NUCA) [47-49]) are used. As the number of parameters in the design space incr eases, such variation and characteristics at large scales cannot be captured without using slow and detailed simulations. However, using simulation-based methods for architecture de sign space exploration where numerous design parameters have to be consider ed is prohibitively expensive. Recently, various predictive models [20-25, 50] have been proposed to cost-effectively reason processor performance and power characte ristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ab ility to forecast the complex and heterogeneous behavior of large and distributed architecture s ubstrates across the design space. This limitation will only be exacerbated with the rapidly increasing integration s cale (e.g. number of cores per chip). Therefore, there is a pressing need fo r novel and cost-effective approaches to achieve accurate and informative design trade-off analysis for large and sophisticated architectures in the upcoming multi-/many core eras. Thus, in this chapter, instead of quantifying these large and sophisticated architectures by a single number or a simple statistics distributi on, we proposed techniqu es employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the complex spatial characteristics that workloads exhibit across large architecture substrates are decomposed into a series of wavelet coeffici ents. In the transform domain, each individual wavelet coefficient is modeled by a separate neur al network. By predicting only a small set of

PAGE 76

76 wavelet coefficients, our models can accurately reconstruct architecture 2D spatial behavior across the design space. Using both multi-progr ammed and multi-threaded workloads, we extensively evaluate the efficiency of using 2D wavelet neural networks for predicting the complex behavior of non-uniformly accessed cache de signs with widely varied configurations. Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction We view the 2D spatial characteristics yi elded on large and dist ributed architecture substrates as a nonlinear functi on of architecture design paramete rs. Instead of inferring the spatial behavior via exhaustively obtaining architecture characteristics on each individual node/component, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated behavior across a large ar chitecture design space. Previous work [21, 23, 25, 50] shows that neural networks can accura tely predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to informatively reveal comple x workload/architecture interactions at a large scale. To overcome this disadvantage, we propos e combining 2D wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial characteristics prediction of multi-core orient ed architecture substrates. The 2D wavelet transform is a very powerful tool for characterizing spatial behavior sin ce it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficien ts (high frequencies) which are utilized for detailed analysis a nd prediction of individual or subsets of cores/components, while the gl obal trend is decomposed into higher scales of wavelet coefficients (low frequencies) th at are used for the analysis and prediction of slow trends across many cores or distributed hardware components. Co llectively, these wavelet coefficients provide

PAGE 77

77 an accurate interpretation of the spatial trend and details of complex workload behavior at a large scale. Our wavelet neural networks use a sepa rate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separa te neural network simplifies the training task (which can be performed concurrently) of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the spatial patterns on large scale architecture substrates. Figure 6-2 shows our hybrid neuro-wavele t scheme for architecture 2D spatial characteristics prediction. Given the observed spatial behavior on training data, our aim is to predict the 2D behavior of large-scale architecture under di fferent design configurations. G0 H0 G1 H1 Gk Hk ... Architecture 2D Characteristics Wavelet Decomposition S S A 1 A 2 D 2 D 3 D 1 S S A 128 A 127 D 127 D 1 D 1 D 128 A 2 A 2 D 3 A 1 W a v e l e t C o e f f i c i e n t s ... ... Architecture Design Parameters Predicted Wavelet Coefficient 1 ... ... Architecture Design Parameters Predicted Wavelet Coefficient 2 Architecture Design Parameters... ... ...RBF Neural Networks ... ... Predicted Wavelet Coefficient n Synthesized Architecture 2D Characteristics G*0 H*0 G*1 H*1 G*k H*k ... .Wavelet Reconstruction S S A 1 A 2 D 2 D 3 D 1 S S A 128 A 127 D 127 D 1 D 1 D 128 A 2 A 2 D 3 A 1 0 0P re d ic te d W a v el et C oe ff ic ie n ts Figure 6-2 Using wavelet neural networks for forecasting architecture 2D characteristics The hybrid scheme basically invol ves three stages. In the first stage, the observed spatial behavior is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire ar chitecture design space vector and pr edicts a wavelet coefficient. The

PAGE 78

78 training of an RBF network invol ves determining the center point and a radius for each RBF, and the weights of each RBF which dete rmine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of 2D wavele t neural networks fo r forecasting spatial characteristics of large-scale multi-core NUCA design using the GEMS 1.2 [51] toolset interfaced with the Simics [52] full-system functional simulator. We simulate a SPARC V9 8core CMP running Solaris 9. We m odel in-order issue cores for this study to keep the simulation time tractable. The processors have private L1 caches and the shared L2 is a 256-bank 16MB NUCA. The private L1 caches of different processo rs are maintained cohere nt using a distributed directory based protocol. To model the L2 cache, we use the Ruby NUCA cache simulator developed in [47] which includes an on-chip ne twork model. The network models all messages communicated in the system including a ll requests, responses, replacements, and acknowledgements. Table 6-1 summarizes the baselin e machine configurations of our simulator. Table 6-1 Simulated machine configuration (baseline) Parameter Configuration Number of 8 Issue Width 1 L1 (split I/D) 64KB, 64B line, write-allocation L2 (NUCA) 16 MB (KB 64256 ), 64B line Memory Sequential Memory 4 GB of DRAM, 250 cycle latency, 4KB Our baseline processor/L2 NUCA organization is similar to that of Beckmann and Wood [47] and is illustra ted in Figure 6-3. Each processor core (including L1 data and instruction caches) is placed on the chip boundary and eight such cores surround a shared L2 cache. The L2 is partitioned into 256 banks (grouped as 16 blank clusters) and connected with an interconnection network. Each core has a cache controller that routes the cores request s to the appropriate cache bank. The NUCA design

PAGE 79

79 space is very large. In this chapter, we consider a design space that consists of 9 parameters (see Tables 6-2) of CMP NUCA architecture. 1 2 3 4 5 6 129 145 17 18 19 20 21 22 33 34 7 8 9 10 11 12 130 146 23 24 25 26 27 28 35 36 113 114 13 14 131 132 133 147 148 149 29 30 37 38 39 40 115 116 15 16 134 135 136 150 151 152 31 32 41 42 43 44 117 118 119 120 137 138 139 153 154 155 161 162 163 164 45 46 121 122 123 124 140 141 142 156 157 158 165 166 167 168 47 48 125 126 241 242 243 244 143 159 169 170 171 172 173 174 175 176 127 128 245 246 247 248 144 160 177 178 179 180 181 182 183 184 249 250 251 252 253 254 255 256 209 193 185 186 187 188 49 50 225 226 227 228 229 230 231 232 210 194 189 190 191 192 51 52 97 98 233 234 235 236 211 212 213 195 196 197 53 54 55 56 99 100 237 238 239 240 214 215 216 198 199 200 57 58 59 60 101 102 103 104 81 82 217 218 219 201 202 203 65 66 61 62 105 106 107 108 83 84 220 221 222 204 205 206 67 68 63 64 109 110 85 86 87 88 89 90 223 207 69 70 71 72 73 74 111 112 91 92 93 94 95 96 224 208 75 76 77 78 79 80 CPU 0CPU 1 CPU 5CPU 4CPU 7 CPU 6CPU 3 CPU 2 L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$Figure 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache Table 6-2 Considered architecture de sign parameters and their ranges Parameters Descriptio n NUCA Management Policy (NUCA)SNUCA, DNUCA, RNUCA Network Topology (net) Hierarchical, PT_to_PT, Crossba r Network Link Latency (net_lat)20, 30, 40, 50 L1_latency (L1_lat) 1, 3, 5 L2_latency (L2_lat) 6, 8, 10, 12. L1_associativity (L1_aso) 1, 2, 4, 8 L2_associativity (L2_aso) 2, 4, 8, 16 Directory Latency (d_lat) 30, 60, 80, 100 Processor Buffer Size ( p _buf)5, 10, 20 These design parameters cover NUCA data management policy (NUCA), interconnection topology and latency (net and net_lat), the conf igurations of the L1 and L2 caches (L1_lat, L2_lat, L1_aso and L2_aso), cache coherency di rectory latency (d_lat) and the number of cache accesses that a processor core can issue to the L1 (p_buf). The ranges for these parameters were set to include both typical and feasible desi gn points within the e xplored design space.

PAGE 80

80 We studied the CMP NUCA designs using various multi-programmed and multi-threaded workloads (listed in Table 6-3). Table 6-3 Multi-programmed workloads Multi-programmed Workloads Description Homogeneous Group1 gcc (8 copies) Group2 mcf (8 copies) Heterogeneous Group1 (CPU) gap, bzip2, equake, gcc, me sa, perlbmk, parser, ammp Group2 (MIX) perlbmk, mcf, bzip2, vpr, mesa, art, gcc, equake Group3 (MEM) mcf, twolf, art, ammp, equake, mcf, art, mesa Multithreaded Workloads Data Set Splash2 barnes 16k particles fmm input.16348 ocean-co 514x514 ocean body ocean-nc 258x258 ocean body water-ns 512 molecules cholesky tk15.O fft 65,536 complex data points radix 256k keys, 1024 radix Our heterogeneous multi-programmed workloads consist of a mix of programs from the SPEC 2000 benchmarks with full reference i nput sets. The homogeneous multi-programmed workloads consist of multiple copies of an identical SPEC 2000 program. For multi-programmed workload simulations, we perform fast-forwards until all benchmarks pass initialization phases. For multithreaded workloads, we used 8 benchmarks from the SPLASH-2 suite [53] and mark an initialization phase in the software code and skip it in our simulations. In all simulations, we first warm up the cache model. After that, each simulation runs 500 million instructions or to benchmark completion, whichever is less. Us ing detailed simulati on, we obtain the 2D architecture characteristics of large scale NUC A at all design points wi thin both training and

PAGE 81

81 testing data sets. We build a separate model for each workload and use the model to predict architecture 2D spatial behavior at unexplored points in the design space. The training data set is used to build the 2D wavelet neural network models. An estimate of the models accuracy is obtained by using the design point s in the testing data set. To build a representative design space, one needs to ensure that the sample data sets disperse points throughout the design space but keep the space small enough to keep the cost of building the model low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) as our sampling strategy since it provides better coverage compar ed to a naive random sampling scheme. We generate multiple LHS matr ices and use a space filing metric called L2star discrepancy. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest va lue of L2-star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this chapter, we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. And the 2D NUCA architecture characteristics (normalized cache h it numbers) across 256 banks (with the geometry layout, Figure 6-3) are re presented by a matrix. Predicting each wavelet coefficient by a sepa rate neural network simplifies the learning task. Since complex spatial patterns on large scal e multi-core architecture substrates can be captured using a limited number of wavelet coeffici ents, the total size of wavelet neural networks is small and the computation overhead is low. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Specifically, we consider the following two schemes for selecting

PAGE 82

82 important wavelet coefficients for prediction: (1) magnitude-based: select the largest k coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitude-based scheme since it always outperforms the order-based sc heme. To apply the magnitude-based wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients do not change drasti cally across the design space. Our experimental results show that the top ranked wavelet coefficients largely re main consistent across different architecture configurations. Evaluation and Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast complex, heterogeneous patterns of la rge scale multi-core substrates running various workloads without using detailed simulation. Th e prediction accuracy measure is the mean error defined as follows: N kkxkxkx N ME1))(/))()( (( 1 (6-1) where: x is the actual value, x is the predicted value and N is the total number of samples (e.g. 256 NUCA banks). As prediction accuracy increases, the ME becomes smaller. The prediction accuracies are plotted as boxplots(Figure 6-4). B oxplots are graphical displays that measure location (median) and di spersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. Th e central box shows the data between hinges which are approximately the first and thir d quartiles of the ME values.

PAGE 83

83 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 68101214 Error (%) gcc_x8 mcf_x8 68101214 CPU MIX MEM 68101214 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 68101214 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 4681012 Error (%) gcc_x8 mcf_x8 4681012 CPU MIX MEM 4681012 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 4681012 (a) 16 Wavelet Coefficients (b) 32 Wavelet Coefficients gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 24681012 Error (%) gcc_x8 mcf_x8 24681012 CPU MIX MEM 24681012 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 24681012 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 2468101214 Error (%) gcc_x8 mcf_x8 2468101214 CPU MIX MEM 2468101214 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 2468101214 (c) 64 Wavelet Coefficients (d) 128 Wavelet Coefficients Figure 6-4 ME boxplots of prediction accuracies w ith different number of wavelet coefficients Thus, about 50% of the data are located with in the box and its height is equal to the interquartile range. The hor izontal line in the interior of the box is locate d at the median of the data, and it shows the center of the distribution for the ME values The whiskers (the dotted lines extending from the top and bottom of the box) exte nd to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. Figure 6-4 (a) shows that using 16 wavelet co efficients, the predictive models achieve median errors ranging from 5.2 percent (fft) to 9.3 percent (ocean.co) with an overall median error of 6.6 percent across all e xperimented workloads. As can be seen, the maximum error at any design point for any benchmark is 13%, and mo st benchmarks show an error less than 10%. This indicates that our proposed neuro-wavelet scheme can forecast the 2D spatial workload

PAGE 84

84 behavior across large and sophisticated architect ure with high accuracy. Figure 6-4 (b-d) shows that in general, the geospatial characteristics prediction accuracy is increased when more wavelet coefficients are involved. Note th at the complexity of the predicti ve models is proportional to the number of wavelet coefficients. The cost-e ffective models should provide high prediction accuracy while maintaining low complexity and computation overhead. The trend of prediction accuracy(Figure 6-4) indicates that for the progra ms we studied, a set of wavelet coefficients with a size of 16 combines good accuracy with lo w model complexity; in creasing the number of wavelet coefficients beyond this point improves erro r at a reduced rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients Using fewer parameters than other methods, the coordinated wavelet coeffici ents provide interpretation of the spatial patterns among a large number of NUCA banks on a two-dimensional plan e. Figure 6-5 illustra tes the predicted 2D NUCA behavior across four different configur ations (e.g. A-D) on the heterogeneous multiprogrammed workload MIX (see Table 3) when differe nt number of wavelet coefficients (e.g. 16 256) are used. Simulation Prediction A B C D Figure 6-5 Predicted 2D NUCA be havior using different number of wavelet coefficients

PAGE 85

85 The simulation results (org) are also show n for comparison purposes. Since we can accurately forecast the behavior of large scale NUCA by only predicting a small set of wavelet coefficients, we expect our methods are scalable to even larger architecture design. We further compare the accuracy of our pr oposed scheme with that of approximating NUCA spatial patterns via predicting the hit rates of 16 evenly distributed cache banks across a 2D plane. The results shown in Table 6-4 indicate that using the same number of neural networks, our scheme yields a significantly higher accuracy than conventional predictive models. If current neural network models were built at fine-gra in scales (e.g. constr uct a model for each NUCA bank), the model building/training overhead wo uld be non-trivial. Since we can accurately forecast the behavior of large scale NUCA stru ctures by only predicting a small set of wavelet coefficients, we expect our methods are scalab le to even larger ar chitecture substrates. Table 6-4 Error comparison of predicting raw vs. 2D DWT cache banks Benchmarks Error (Raw), % Error(2D DWT), % gcc(x8) 126 8 mcf(x8) 71 7 CPU 102 9 MIX 86 8 MEM 122 8 barnes 136 6 fmm 363 6 ocean-co 99 9 ocean-nc 136 6 water-sp 97 7 cholesky 71 7 fft 64 7 radix 92 7 Table 6-5 shows that exploring multi-co re NUCA design space using the proposed predictive models can lead to se veral orders of magnitude speedup, compared with using detailed simulations. The speedup is calculated using the total simulation time across all 50 test cases divided by the time spent on model training and pr edicting 50 test cases. The model construction

PAGE 86

86 is a one-time overhead and can be amortized in the design space exploration stage where a large number of cases need to be examined. Table 6-5 Design space evaluation speedup (simulation vs. prediction) Benchmarks Simulation vs. Prediction gcc(x8) 2,181x mcf(x8) 3.482x CPU 3,691x MIX 472x MEM 435x barnes 659x fmm 1,824x ocean-co 1,077x ocean-nc 1,169x water-sp 738x cholesky 696x fft 670x radix 1,010x Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all i nput architecture design parameters were ranked based on either split order or split frequency. The design parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, architecture parameters that largely determine the values of a wavelet coefficient are located higher than others in the regression tree and they have a larger number of sp lits than others. We present in Figure 6-6 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information

PAGE 87

87 such as: Which variables are dominant for a gi ven dataset? Which observations show similar behavior? Order Frequency Figure 6-6 Roles of design parameters in predicting 2D NUCA For example, on the Splash-2 be nchmark fmm, network latency (net_lat), processor buffer size (p_buf), L2 latency (L2_lat) and L1 associativity (L1_aso) have significant roles in predicting the 2D NUCA spatial behavior while the NUCA data management policy (NUCA) and network topology (net) largely affect the 2D spatial pattern when running the homogeneous multi-programmed workload gccx8. For the benchm ark cholesky, the most frequently involved architecture parameters in regression tree construction are NUCA, net_lat, p_buf, L2_lat and L1_aso. Differing from models that predict aggregat ed workload characteristics on monolithic architecture design, our proposed methods can accurately and in formatively reveal the complex patterns that workloads exhibit on large-scale architectures. This feature is essential if the predictive models are employed to examine the efficiency of design tradeoffs or explore novel

PAGE 88

88 optimizations that consider multi-/manycores. In this work, we study the suitability of using the proposed models for novel multi-core oriented NUCA optimizations. Leveraging 2D Geometric Characteristics to Explore Cooperative Multi-core Oriented Architecture Design and Optimization In this section, we present case studies to demonstrate the benefit of incorporating 2D workload/architecture behavior prediction into th e early stages of microa rchitecture design. In the first case study, we show that our geospatial-a ware predictive models can effectively estimate workloads 2D working sets and that such info rmation can be benefici al in searching cache friendly workload/core mapping in multi-core environments. In the second case study, we explore using 2D thermal profile predictive mode ls to accurately and informatively forecast the area and location of thermal hots pots across large NUCA substrates. Case Study 1: Geospatial-aware Application/Core Mapping Our 2D geometry-aware architecture predictiv e models can be used to explore global, cooperative, resource management and op timization in multi-core environments. Figure 6-7 2D NUCA footprint (geometric shape) of mesa For example, as shown in Figure 6-7, a wo rkload will exhibit a 2D working set with different geometric shapes when running on di fferent cores. The exact shape of the access

PAGE 89

89 distribution depends on several f actors such as the application and the data mapping/migration policy. As shown in previous section, our pred ictive models can forecast workload 2D spatial patterns across the architecture design space. To pr edict workload 2D geometric footprints when running on different cores, we inco rporate the core loca tion as a new design parameter and build the location-aware 2D predictive models. As a re sult, the new model can forecast workloads 2D NUCA footprint (represented as a cache access distri bution) when it is assigned to a specific core location. We assign 8 SPEC CPU 2000 workloads to the 8-core CMP system and then predict each workloads 2D NUCA footprint when runni ng on the assigned core and use the predicted 2D geometric working set for each workload to estimate the cach e interference among the cores. Core 0 Core 5 Core 4Core 7 Core 6Core 3 Core 2Core 1 Program A 2D NUCA footprint @ Core 0 Program B 2D NUCA footprint @ Core 1 Program C 2D NUCA footprint @ Core 2 Interferenced Area Figure 6-8. 2D cache interference in NUCA As shown in Figure 6-8, to estimate the in terference for a given core/workload mapping, we estimate both the area and the degree of ove rlap among a workloads 2D NUCA footprint. We only consider the interference of a core a nd its two neighbors. As a result, for a given core/workload layout, we can quickly estimate the overall interference. For each NUCA configuration, we estimate the interference when workloads are randomly assigned to different cores. We use simulation to count the actual c ache interference among workloads. For each test

PAGE 90

90 case (e.g., a specific NUCA configuration), we generate two series of cache interference statistics (e.g., one from simu lation and one from the predictiv e model) which correspond to the scenarios when workloads are mapped to the different cores. We compute the Pearson correlation coefficient of the two data series. Th e Pearson correlation coefficient of two data series X and Y is defined as 2 1 1 2 2 1 1 2 1 1 1 n i i n i i n i i n i i n i i n i i n i iiyyn xxn yxyxn r (6-2) If two data series, X and Y, show highly positive correlation, their Pearson correlation coefficient will be close to 1. Consequently, if the cache interference can be accurately estimated using the overlap between the predicted 2D NUCA footprints, we should observe nearly perfect correlation between the two metrics. 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 1 (CPU) 0 10 20 30 40 5 0 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 2 (MIX) 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 3 (MEM) Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) Figure 6-9 Pearson correlation coeffici ent (all 50 test cases are shown) Figure 6-9 shows that there is a strong correla tion between the interference estimated using the predicted 2D NUCA footprint and the interf erence statistics obtained using simulation. The highly positive Pearson correlation coefficient valu es show that by using the predictive model, designers can quickly devise the optimal core allocation for a given set of workloads. Alternatively, the information can be used by th e OS to guide cache friendly thread scheduling in multi-core environments.

PAGE 91

91Case Study 2: 2D Thermal Hot-Spot Prediction Thermal issues are becoming a first order design parameter for large-scale CMP architectures. High operational temperatur es and hotspots can limit performance and manufacturability. We use the HotSpot [54] ther mal model to obtain the temperature variation across 256 NUCA banks. We then build analytical models using the proposed methods to forecast 2D thermal behavior of large NUCA cache with different config urations. Our predictive model can help designers insightfully predic t the potential thermal hotspots and assess the severity of thermal emergencies. Figure 6-10 sh ows the simulated thermal profile and predicted thermal behavior on different workloads. The te mperatures are normalized to a value between the maximal and minimal value across the NUC A chip. As can be seen, the 2D thermal predictive models can accurately and informativ ely forecast the size and the location of thermal hotspots. The 2D predictive model can informatively and accurately forecast both the location and the size of thermal hotspots in large scale architecture Thermal Hotspots Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0.2 0.4 0.6 0.8 1 (a) Ocean-NC (b) gccx8 Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0. 2 0. 4 0. 6 0. 8 1 Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0. 2 0. 4 0. 6 0. 8 1 ( c ) MEM ( d ) Ra d ix Figure 6-10 2D NUCA thermal prof ile (simulation vs. prediction)

PAGE 92

92 16wc 32wc 64wc 96wc 128wc 256wc 2 4 6 8 10 12 14 16 Error (%) Multiprogrammed (Homogeneous) Multiprogrammed (Heterogeneous) Multithreaded (SPLASH) Figure 6-11 NUCA 2D ther mal prediction error The thermal prediction accuracy (average statistics) across three workload categories is shown in Figure 6-11. The accuracy of using different number of wavelet coefficients in prediction is also shown in that Figure. The results show that our predictive model can be used to cost-effectively analyze the thermal behavior of large architecture substrates. In addition, our proposed technique can be use to evaluate the e fficiency of thermal management policies at a large scale. For example, thermal hotspots can be mitigated by throttling the number of accesses to a cache bank for a certain period when its temper ature reaches a threshold. We build analytical models which incorporate a thermal-aware cache access throttling as a design parameter. As a result, our predictive model can forecast therma l hot spot distribution in the 2D NUCA cache banks when the dynamic thermal management (DTM ) policy is enabled or disabled. Figure 6-12 shows the thermal profiles before and after th ermal management policies are applied (both prediction and simulation results) for benchmar k Ocean-NC. As can be seen, they track each other very well. In terms of time taken for de sign space exploration, our proposed models have orders of magnitude less overhead. The time required to predict the thermal behavior is much less than that of full-system multi-core simulatio n. For example, thermal hotspot estimation is over 5102 times faster than thermal simulation, ju stifying our decision to use the predictive

PAGE 93

93 models. Similarly, searching a cach e friendly workload/core mapping is 4103 times faster than using the simulation-based method. DTM DTM Simulation Prediction Figure 6-12 Temperature profile before and after a DTM policy

PAGE 94

94 CHAPTER 7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STAC KED MULTI-CORE PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS To achieve thermal efficient 3D multi-core pro cessor design, architects and chip designers need models with low computation overhead, wh ich allow them to quickly explore the design space and compare different design options. One ch allenge in modeling the thermal behavior of 3D die stacked multi-core architecture is that th e manifested thermal patterns show significant variation within each die and across different dies (as shown in Figure 7-1). Die1 Die2 Die3 Die4 CPU MEM MIX Figure 7-1 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors The results were obtained by simulating a 3D die stacked quad-cor e processors running multi-programmed CPU (bzip2, eon, gcc, perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk) workloads. Each program within a multi-programmed workload was assigned to a die that contains a processor core and caches. Figure 7-2 shows the 2D thermal variation on die 4 under different mi croarchitecture and floor-plan configurations. On the given die, the 2-dimensional thermal spatial characteristics vary widely with different de sign choices. As the number of ar chitectural parameters in the design space increases, the complex thermal va riation and characterist ics cannot be captured without using slow and detailed simulations. As shown in Figure 7-1 and 7-2, to explore the thermal-aware design space accurately and informa tively, we need computationally effective

PAGE 95

95 methods that not only predict aggregate therma l behavior but also identify both size and geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate predictive models to achieve this goal. Config. AConfig. BConfig. CConfig. DCPU MEM MIX Figure 7-2 2D thermal variation on die 4 under different microarchitecture and floor-plan configurations Figure 7-3 illustrates the original thermal be havior and 2D wavelet transformed thermal behavior. 340 341 342 343 344 345 346 HL1HH1LH1HH2HL2LH2LL2 LL 1 (a) Original thermal behavior (b) 2D wavelet transformed thermal behavior Figure 7-3 Example of using 2D DWT to capture thermal spatial characteristics As can be seen, the 2D thermal characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (LL=1) or Average (LL=2)). Since a small set of wavelet coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we use predictive models (i.e. neural networks) to relate them individua lly to various design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize 2D thermal spatia l characteristics across the design space. Compared with a simulation-based method, predicting a small set of wavelet coeffici ents using analytical

PAGE 96

96 models is computationally efficient and is scal able to explore the large thermal design space of 3D multi-core architecture. Prior work has proposed various predictive m odels [20-25, 50] to cost-effectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous thermal behavior across large and distributed 3D multi-core architecture substrates. In this paper, we addresses this important and urgent research task by developing novel, 2D multi-scale predictive models, which can efficiently reason the geo-spatia l thermal characteristics within die and across different dies during the design space exploration stage withou t using detailed cycle-level simulations. Instead of quantifying the complex geo-spatial thermal characteristics using a single number or a simple statistic al distribution, our proposed techniques employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the thermal spatial characteristics are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of wavelet coeffi cients, our models can acc urately reconstruct 2D spatial thermal behavior across the design space. Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction We view the 2D spatial thermal characteristics yielded in 3D integrated multi-core chips as a nonlinear function of architecture design paramete rs. Instead of inferring the spatial thermal behavior via exhaustively obtai ning temperature on each individual location, we employ wavelet analysis to approximate it and then use a neur al network to forecast the approximated thermal behavior across a large architectur al design space. Previous work [21, 23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior across varied

PAGE 97

97 architecture configurations. Ne vertheless, monolithic global neur al network models lack the ability to reveal complex thermal behavior on a large scale. To overcome this disadvantage, we propose combining 2D wavelet tran sforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial thermal characteristics prediction of 3D die stacked multi-core design. Figure 7-4 Hybrid neuro-wavele t thermal prediction framework The 2D wavelet transform is a very powerful t ool for characterizing sp atial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteri stics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utili zed for detailed analysis and pr ediction of individual or subsets of components, while the global tr end is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across each die. Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and details of complex thermal behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Pred icting each wavelet coefficient by a separate neural network simplifies the training task (whi ch can be performed concurrently) of each sub-

PAGE 98

98 network. The prediction results for the wavelet co efficients can be combined directly by the inverse wavelet transforms to s ynthesize the 2D spatial thermal patterns across each die. Figure 7-4 shows our hybrid neuro-wavelet scheme for 2D spatial thermal characteristics prediction. Given the observed spatial thermal behavior on training data, our aim is to predict the 2D thermal behavior of each die in 3D die stacked multi-core processors under different design configurations. The hybrid scheme involves three stages. In the fi rst stage, the observed spatial thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicte d by a separate ANN. In the third stage, the approximated 2D thermal characteristics are recove red from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The traini ng of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology Floorplanning and Hotspot Thermal Model In this study, we model four floor-plans that involve processor core and cache structures as illustrated in Figure 7-5. Figure 7-5 Selected floor-plans As can be seen, the processor core is placed at different locations acr oss the different floorplans. Each floor-plan can be chosen by a la yer in the studied 3D die stacking quad-core processors. The size and adjacency of blocks ar e critical parameters for deriving the thermal

PAGE 99

99 model. The baseline core architecture and floorpl an we modeled is an Alpha processor, closely resembling the Alpha 21264. Figure 7-6 s hows the baseline core floorplan. Figure 7-6 Processor core floor-plan We assume a 65 nm processing technique and the floor-plan is scaled accordingly. The entire die size is 2121mm and the core size is 5.8 5.8mm. We consider three core configurations: 2-issue (5.85.8 mm), 4-issue (8.14 8.14 mm) and 8-issue (11.511.5 mm). Since the total die area is fixed, the more aggre ssive core configurations lead to smaller L2 caches. For all three types of core configurations, we calculate the size of the L2 caches based on the remaining die area available. Table 7-1 lists the detailed processor core and cache configurations. We use Hotspot-4.0 [54] to simulate thermal behavior of a 3D quad-core chip shown as Figure 7-7. The Hotspot tool can specify the mu ltiple layers of silicon and metal required to model a three dimensional IC. We choose grid-l ike thermal modeling mode by specifying a set of 64 x 64 thermal grid cells per die and the av erage temperature of each cell (32um x 32um) is represented by a value. Hotspot takes power consumption data for each component block, the layer parameters and the floor-plans as inputs an d generates the steady-state temperature for each active layer. To build a 3D multi-core processor simulator, we heavily modified and extended the M-Sim simulator [63] and incorporated the Wattch power model [36]. The power trace is

PAGE 100

100 generated from the developed framework with an interval size of 500K cycles. We simulate a 3D-stacked quad-core processor with one core assigned to each layer. Table 7-1 Architecture configura tion for different issue width 2 issue 4 issue 8 issue Processor Width 2-wide fetch/issue/commit 4-wide fetch/issue/commit 8-wide fetch/issue/commit Issue Queue 32 64 128 ITLB 32 entries, 4-way, 200 cycle miss 64 entries, 4-way, 200 cycle miss 128 entries, 4-way, 200 cycle miss Branch Predictor 512 entries Gshare, 10-bit global history 1K entries Gshare, 10-bit global history 2K entries Gshare, 10-bit global history BTB 512K entries, 4-way 1K entries, 4-way 2K entries, 4-way Return Address 8 entries RAS 16 entries RAS 32 entries RAS L1 Inst. Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access 64K, 2-way, 32 Byte/line, 2 ports, 1 cycle access 128K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 32 entries 64 entries 96 entries Load/ Store 24 entries 48 entries 72 entries Integer ALU 2 I-ALU, 1 I-MUL/DIV, 2 Load/Store 4 I-ALU, 2 I-MUL/DIV, 2 Load/Store 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 1 FP-ALU, 1FPMUL/DIV/SQRT 2 FP-ALU, 2FP-MUL/ DIV/SQRT 4 FP-ALU, 4FPMUL/DIV/SQRT DTLB 64 entries, 4-way, 200 cycle miss 128 entries, 4-way, 200 cycle miss 256 entries, 4-way, 200 cycle miss L1 Data Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle 128K, 2-way, 32 Byte/line, 2 ports, 1 cycle access L2 Cache unified 4MB, 4-way, 128 Byte/line, 12 cycle access unified 3.7MB, 4-way, 128 Byte/line, 12 cycle access unified 3.2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 32 bit wide, 200 cycles access latency 64 bit wide, 200 cycles access latency 64 bit wide, 200 cycles access latency Figure 7-7 Cross section view of th e simulated 3D quad-core chip

PAGE 101

101Workloads and System Configurations We use both integer and floating-point benc hmarks from the SPEC CPU 2000 suite (e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lu cas, mcf, parser, perlbmk, twolf, swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see Table 7-2). We categorize all benchmarks into two classes: CPU-bound and MEM bound applications. We design three types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist of programs from only the CPU intensive and me mory intensive categories respectively. MIX workloads are the combination of two benchmar ks from the CPU intensive group and two from the memory intensive group. Table 7-2 Simulation configurations Chip Frequency 3G Voltage 1.2 V Proc. Technology 65 nm Die Size 21 mm 21 mm Workloads CPU1 bzip2, eon, gcc, perlbmk CPU2 perlbmk, mesa, facerec, lucas CPU3 gap, parser, eon, mesa MIX1 gcc, mcf, vpr, perlbmk MIX2 perlbmk, mesa, twolf, applu MIX3 eon, gap, mcf, vpr MEM1 mcf, equake, vpr swim MEM2 twolf, galgel, applu, lucas MEM3 mcf, twolf, swim, vpr These multi-programmed workloads were simulated on our multi-core simulator configured as 3D quad-core processors. We use the Simpoint tool [1] to obtain a representative slice for each benchmark (with full reference input set) and each benchmark is fast-forwarded to its representative point before detailed simulation takes place. The simulations continue until one benchmark within a workload finishes the execu tion of the representative interval of 250M instructions.

PAGE 102

102Design Parameters In this study, we consider a design space that consists of 23 parameters (see Table 7-3) spanning from floor-planning to packaging technologies. Table 7-3 Design space parameters Keys Low High 3D Configurations Layer0 Thickness ( m ) l y 0 th5e-5 3e-4 Floorplan ly0_fl Flp 1/2/3/4 Bench ly0_bench CPU/MEM/MIX Layer1 Thickness (m) ly1_th 5e-5 3e-4 Floorplan ly1_fl Flp 1/2/3/4 Bench ly1_bench CPU/MEM/MIX Layer2 Thickness (m) ly2_th 5e-5 3e-4 Floorplan ly2_fl Flp 1/2/3/4 Bench ly2_bench CPU/MEM/MIX Layer3 Thickness (m) ly3_th 5e-5 3e-4 Floorplan ly3_fl Flp 1/2/3/4 Bench ly3_bench CPU/MEM/MIX TIM (Thermal Interface Material) Heat Capacity (J/m^3 K) TIM_cap 2e6 4e6 Resistivity (m K/W) TIM_res 2e-3 5e-2 Thickness (m) TIM_th 2e-5 75e-6 General Configurations Heat sink Convection capacity (J/k) HS_cap 140.4 1698 Convection resistance (K/w)HS_res 0.1 0.5 Side (m) HS_side 0.045 0.08 Thickness (m) HS_th 0.02 0.08 HeatSpreader Side(m) HP_side 0.025 0.045 Thickness(m) HP_th 5e-4 5e-3 Others Ambient temperature (K) Am_temp 293.15 323.15 Archi. Issue width Issue width_ 2 or 4 or 8 These design parameters have been shown to have a large impact on processor thermal behavior. The ranges for these parameters were set to include both typi cal and feasible design points within the explored design space. Using detailed cycle-accurate simulations, we measure processor power and thermal characteristics on al l design points within bot h training and testing data sets. We build a separate model for each benchmark domain and use the model to predict thermal behavior at unexplored points in the design space. The training data set is used to build the wavelet-based neural network models. An es timate of the models acc uracy is obtained by using the design points in the testing data set. To train an accurate and prompt neural network prediction model, one needs to en sure that the sample data sets disperse points throughout the

PAGE 103

103 design space but keeps the space small enough to maintain the low model building cost. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage comp ared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2-star discrepancy [40]. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2-star discre pancy. We use a rando mly and independently generated set of test data point s to empirically estimate the pred ictive accuracy of the resulting models. In this work, we used 200 train and 50 te st data to reach a hi gh accuracy for thermal behavior prediction since our st udy shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. In our study, the thermal characteristics across each die is represented by 64 64 samples. Experimental Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast thermal behaviors of large scale 3D multi-core structures running various CPU/MIX/MEM workloads without using detailed simulation. Simulation Time vs. Prediction Time To evaluate the effectiveness of our ther mal prediction models, we compute the speedup metric (defined as simulation time vs. prediction time) across all experimented workloads (shown as Table 7-4). To calcu late simulation time, we measur ed the time that the Hotspot simulator takes to obtain steady thermal character istics on a given design configuration. As can be seen, the Hotspot tool simulation time vari es with design configurations. We report both shortest (best) and longest (wor st) simulation time in Table 7-4. The prediction time, which includes the time for the neural networks to predict the targeted thermal behavior, remains constant for all studie d cases. In our experiment, a total number of 16

PAGE 104

104 neural networks were used to predict 16 2D wavelet coefficients which efficiently capture workload thermal spatial characteristics. As can be seen, our predictive models achieve a speedup ranging from 285 (MEM1) to 5339 (CPU2) making them suitable for rapidly exploring large thermal design space. Table 7-4 Simulation time vs. prediction time Workload s Simulation (sec) [best:worst] Prediction (sec) Speedup (Sim./Pred.) CPU1 362 : 6,091 1.23 294 : 4,952 CPU2 366 : 6,567 298 : 5,339 CPU3 365 : 6,218 297 : 5,055 MEM1 351 : 5,890 285 : 4,789 MEM2 355 : 6,343 289 : 5,157 MEM3 367 : 5,997 298 : 4,876 MIX1 352 : 5,944 286 : 4,833 MIX2 365 : 6,091 297 : 4,952 MIX3 360 : 6,024 293 : 4,898 Prediction Accuracy The prediction accuracy measure is th e mean error defined as follows: N kkx kxkx N ME1)( )()( ~ 1 (7-1) where: )( kx is the actual value generated by the Hotspot thermal model, )( ~ kx is the predicted value and N is the total number of samples (a set of 64 x 64 temperature samples per layer). As prediction accuracy increases, the ME becomes smaller. We present boxplots to observe th e average prediction errors a nd their deviations for the 50 test configurations against Ho tspot simulation results. Boxplots are graphical displays that measure location (median) and di spersion (interquartile range), identify possible outliers, and indicate the symmetry or skewne ss of the distribution. The centr al box shows the data between hinges which are approximately the first and third quartiles of the ME values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The

PAGE 105

105 horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the ME values. The whiske rs (the dotted lines extending from the top and bottom of the box) extend to the extreme valu es of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 7-8, the blue line with diamond shape markers indicates the statistics average of ME across all benchmarks. CPU1CPU2CPU3MEM1MEM2MEM3MIX1MIX2MIX3 0 4 8 12 16 20Error (%) Figure 7-8 ME boxplots of prediction accuracies (number of wavelet coefficients = 16) Figure 7-8 shows that using 16 wavelet coeffici ents, the predictive models achieve median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 17.5% (MEM1), and mo st benchmarks show an error less than 9%. This indicates that our hybrid neuro-wavelet framework can pred ict 2D spatial thermal behavior across large and sophisticated 3D multi-core architecture with high accuracy. Figure 7-8 also indicates that CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%) workloads. This is because the CPU workloads usually have higher temperature on the small core area than the large L2 cache ar ea. These small and sharp hotspots can be easily captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal pattern can spread the entire die area, resulting in higher prediction error. Figure 7-9 illustrates the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on CPU1, MEM1 and MIX1 workloads.

PAGE 106

106 CPU1 MEM1 MIX1 Prediction Simulation Figure 7-9 Simulated and predicted thermal behavior The results show that our pr edictive models can tack both size and location of thermal hotspots. We further examine the accuracy of pred icting locations and area of the hottest spots and the results are similar to those presented in Figure 7-8. CPU1 16wc32wc64wc96wc128wc256wc 0 4 8Error (%) MEM1 16wc32wc64wc96wc128wc256wc 0 5 10 15Error (%) MIX1 16wc32wc64wc96wc128wc256wc 0 10 20Error (%) Figure 7-10 ME boxplots of prediction accuracies w ith different number of wavelet coefficients Figure 7-10 shows the prediction accuracies with different numb er of wavelet coefficients on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D thermal spatial pattern prediction accuracy is in creased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The cost-effective models should provide hi gh prediction accuracy while maintaining low

PAGE 107

107 complexity. The trend of prediction accuracy(Fig ure 7-10) suggests that for the programs we studied, a set of wavelet coeffi cients with a size of 16 comb ine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond th is point improves error at a lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to minimize the complexity of predicti on models while achieving good accuracy. We further compare the accuracy of our propos ed scheme with that of approximating 3D stacked die spatial thermal patterns via predic ting the temperature of 16 evenly distributed locations across 2D plane. The results(Figure 711) indicate that using the same number of neural networks, our scheme yields significant higher accuracy than conventional predictive models. This is because wavelets provide a good time and locality character ization capability and most of the energy is captured by a limited set of important wavelet coefficients. The coordinated wavelet coefficients provide superior interpretati on of the spatial patterns across scales of time and frequency domains. CPU1CPU2CPU3MEM1MEM2MEM3MIX1MIX2MIX3 0 20 40 60 80 100Error (%) Predicting the wavelet coefficients Predicting the raw data Figure 7-11 Benefit of predic ting wavelet coefficients Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input parameters (refer to Table 7-3 ) were ranked based on split frequency. The input parameters which cause the mo st output variation tend to be split frequently in the constructed regression tree. Therefore, the input paramete rs that largely determine the values of a wavelet coefficient have a larger number of splits.

PAGE 108

108Design Parameters by Regression Tree ly0_th ly0_fl ly0_bench ly1_th ly1_fl ly1_bench ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench TIM_cap TIM_res TIM_th HS_cap HS_res HS_side HS_th HP_side HP_th am_temp Iss_size Figure 7-12 Roles of input parameters We present in Figure 7-12 shows the most fre quent splits within th e regression tree that models the most significant wavele t coefficient. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. Each volume size of parameter is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. Fr om the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? As can be seen, floor-planning of each layer and core configuration largely affect thermal spatia l behavior of the studied workloads.

PAGE 109

109 CHAPTER 8 CONCLUSIONS Studying program workload behavior is of growing interest in co mputer architecture research. The performance, power and reliability optimizations of future computer workloads and systems could involve anal yzing program dynamics across many time scales. Modeling and predicting program behavior at single scale can yield many limitations. For example, samples taken from a single, fine-grained interval ma y not be useful in forecasting how a program behaves at a medium or large time scales. In contrast, observing progr am behavior using a coarse-grained time scale may lo se opportunities that can be exploited by hardware and software in tuning resources to optimize workload execution at a fine-grained level. In chapter 3, we proposed new methods, metric s and framework that can help researchers and designers to better understand phase comp lexity and the changing of program dynamics across multiple time scales. We proposed using wavelet transformations of code execution and runtime characteristics to pr oduce a concise yet informativ e view of program dynamic complexity. We demonstrated the use of this in formation in phase classification which aims to produce phases that exhibit similar degree of co mplexity. Characterizing phase dynamics across different scales provides insi ghtful knowledge and abundant feat ures that can be exploited by hardware and software in tuning resources to m eet the requirement of workload execution at different granularities. In chapter 4, we extends the scope of chap ter 3 by (1) explorin g and contrasting the effectiveness of using wavelets on a wide ra nge of program executi on statistics for phase analysis; and (2) investigating techniques that ca n further optimize the accuracy of wavelet-based phase classification. More importantly, we identify additional benefits that wavelets can offer in the context of phase analysis. For example, wavelet transforms can provide efficient

PAGE 110

110 dimensionality reduction of large volume, high di mension raw program execution statistics from the time domain and hence can be integrated wi th a sampling mechanism to efficiently increase the scalability of phase analysis of large scale phase behavior on long-running workloads. To address workload variability issues in phase cl assification, wavelet-base d denoising can be used to extract the essential features of workload behavior from their run-time non-deterministic (i.e., noisy) statistics. At the workloads prediction part, chapter 5, we propose to the use of wavelet neural network to build accurate predictive models fo r workload dynamic driven microarchitecture design space exploration to overcome the problems of monolithic, global predictive models. We show that wavelet neural networks can be us ed to accurately and cost-effectively capture complex workload dynamics across different microa rchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power, and reliability domains. And also we perform extensive simulations to analyze the impact of wavelet coefficient sele ction and sampling rate on prediction accuracy and identify microarchitecture parameters that signi ficantly affect workload dynamic behavior. To evaluate the efficiency of scen ario-driven architecture optimiza tions across different domains, we also present a case study of using workload dynamic aware predictive model. Experimental results show that the predictive models are hi ghly efficient in rendering workload execution scenarios. To our knowledge, the model we propos ed is the first one th at can track complex program dynamic behavior across different micr oarchitecture configurations. We believe our workload dynamics forecasting techniques will allow architects to quickly ev aluate a rich set of architecture optimizations that target workload dynamics at early microarchitecture design stage.

PAGE 111

111 In Chapter 6, we explore novel predictive t echniques that can quickly, accurately and informatively analyze the design trade-offs of fu ture large-scale multi-/manycore architectures in a scalable fashion. The characteristics that workloads exhibited on these architectures are complex phenomena since they typically contain a mixture of behavior localized at different scales. Applying wavelet analysis, our method can capture the heterogeneous behavior across a wide range of spatial scales usi ng a limited set of parameters. We show that these parameters can be cost-effectively predicted usi ng non-linear modeling techniques su ch as neural networks with low computational overhead. Experimental results show that our scheme can accurately predict the heterogeneous behavior of large-scale multi-core oriented architecture substrates. To our knowledge, the model we proposed is the first that can track complex 2D workload/architecture interaction across design alternatives. we fu rther examined using the proposed models to effectively explore multi-core aw are resource allocations and design evaluations. For example, we build analytical models that can quickly fore cast workloads 2D working sets across different NUCA configurations. Combined with interference estimation, our models can determine the geometric-aware workload/core mappings that lead to minimal interference. We also show that our models can be used to predict the locati on and the area of thermal hotspots during thermalaware design exploration. In the light of the emerging multi-/ manycore design era, we believe that the proposed 2D predictive model will allo w architects to quickly yet informatively examine a rich set of design alternativ es and optimizations for large and sophisticated architecture substrates at an early design stage. Leveraging 3D die stacking technologies in multi-core pr ocessor design has received increased momentum in both the chip design industry and research community. One of the major road blocks to realizing 3D multi-core design is it s inefficient heat dissipation. To ensure thermal

PAGE 112

112 efficiency, processor architects and chip designer s rely on detailed yet slow simulations to model thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of the design space, such techniques are very expensive in terms of time and cost. In chapter 7, we aim to develop computat ionally efficient methods and models which allow architects and designers to rapidly yet info rmatively explore the large thermal design space of 3D multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile our model significan tly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex 2D thermal spatial patterns and can be used to forecast both the location and the area of thermal hotspots during thermal-aware design exploration. In light of the emerging 3D multi-co re design era, we believe that the proposed thermal predictive models will be valuable for architects to quickly and informatively examine a rich set of thermal-aware design alternatives and thermal-oriented optimizations for large and sophisticated architecture substr ates at an early design stage.

PAGE 113

113 LIST OF REFERENCES [1] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, in Proc. the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002 [2] E. Duesterwald, C. Cascaval and S. Dwark adas, Characterizing and Predicting Program Behavior and Its Variability, in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2003. [3] J. Cook, R. L. Oliver, and E. E. Johnson, E xamining Performance Differences in Workload Execution Phases, in Proc. of the IEEE Interna tional Workshop on Workload Characterization, 2001. [4] X. Shen, Y. Zhong and C. Ding, Locality Phase Prediction, in Proc. of the International Conference on Architectural Support for Pr ogramming Languages and Operating Systems, 2004. [5] C. Isci and M. Martonosi, Runtime Powe r Monitoring in High-End Processors: Methodology and Empirical Data, in Proc. of the International Symposium on Microarchitecture, 2003. [6] T. Sherwood, S. Sair and B. Calder Phase Tracking and Prediction, in Proc. of the International Symposium on Computer Architecture, 2003. [7] A. Dhodapkar and J. Smith, Managing Multi-C onfigurable Hardware via Dynamic Working Set Analysis, in Proc. of the International Sym posium on Computer Architecture, 2002. [8] M. Huang, J. Renau and J. Torrellas, Positiona l Adaptation of Processors: Application to Energy Reduction, in Proc. of the International Symposium on Computer Architecture, 2003. [9] W. Liu and M. Huang, EXPERT: Expedite d Simulation Exploiting Program Behavior Repetition, in Proc. of International Conference on Supercomputing, 2004. [10] T. Sherwood, E. Perelman and B. Calder, B asic Block Distributi on Analysis to Find Periodic Behavior and Simulation Points in Applications, in Proc. of the International Conference on Parallel Architect ures and Compilation Techniques, 2001. [11] A. Dhodapkar and J. Smith, Comparing Pr ogram Phase Detection Techniques, in Proc. of the International Sym posium on Microarchitecture, 2003. [12] C. Isci and M. Martonosi, Identifying Program Power Ph ase Behavior using Power Vectors, in Proc. of the International Work shop on Workload Characterization, 2003.

PAGE 114

114 [13] C. Isci and M. Martonosi, Phase Characterization for Power: Evaluating Control-FlowBased Event-Counter-Based Techniques, in Proc. of the Interna tional Symposium on HighPerformance Computer Architecture, 2006. [14] M. Annavaram, R. Rakvic, M. Polito, J.-Y. Bouguet, R. Hankins and B. Davies, The Fuzzy Correlation between Code and Performance Predictability, in Proc. of the International Symposiu m on Microarchitecture, 2004. [15] J. Lau, S. Schoenmackers and B. Calder, Structures for Phase Classification, in Proc. of International Symposium on Performance Analysis of Systems and Software, 2004. [16] J. Lau, J. Sampson, E. Perelman, G. Ha merly and B. Calder, T he Strong Correlation between Code Signatures and Performance, in Proc. of the International Symposium on Performance Analysis of Systems and Software, 2005. [17] J. Lau, S. Schoenmackers and B. Cald er, Transition Phase Classification and Prediction, in Proc. of the International Sympos ium on High Performance Computer Architecture, 2005. [18] Canturk Isci and Margaret Martonosi, Det ecting Recurrent Phase Behavior under RealSystem Variability, in Proc. of the IEEE International Symposium on Workload Characterization, 2005. [19] E. Perelman, M. Polito, J. Y. Bouguet, J. Sa mpson, B. Calder, C. Dulong Detecting Phases in Parallel Applications on Shared Memory Architectures, in Proc. of the International Parallel and Dist ributed Processing Symposium, April 2006 [20] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, Construction and Use of Linear Regression Models for Processo r Performance Analysis, in Proc. of the International Symposium on High-Performan ce Computer Architecture, 2006 [21] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, A Predictive Performance Model for Superscalar Processors, in Proc. of the International Symposium on Microarchitecture, 2006 [22] B. Lee and D. Brooks, Accurate an d Efficient Regression Modeling for Microarchitectural Performance and Power Prediction, in Proc. of the International Symposium on Architectural Support for Programming Languages and Operating Systems, 2006 [23] E. Ipek, S. A. McKee, B. R. Supinski, M. Schulz and R. Caruana, Efficiently Exploring Architectural Design Spaces via Predictive Modeling, in Proc. of the International Conference on Architectural Support for Pr ogramming Languages and Operating Systems, 2006

PAGE 115

115 [24] B. Lee and D. Brooks, Illustrative Design Space Studies with Microarchitectural Regression Models, in Proc. of the International Symposium on High-Performance Computer Architecture, 2007. [25] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, Constructing a Non-Linear Model with Neural Networks For Workload Characterization, in Proc. of the International Symposium on Workload Characterization, 2006. [26] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier, Vermont, 1992 [27] I. Daubechies, Orthonomal bases of Compactly Supported Wavelets, Communications on Pure and Applied Mathematics, vol. 41, pages 906-966, 1988. [28] T. Austin, Tutorial of Simplescalar V4.0, in Conj. With the International Symposium on Microarchitecture, 2001 [29] J. MacQueen, Some Methods for Classi fication and Analysis of Multivariate Observations, in Proc. of the Fifth Berkeley Sympos ium on Mathematical Statistics and Probability, 1967. [30] T. Huffmire and T. Sherwood, Wavelet-Based Phase Classification, in Proc. of the International Conference on Paralle l Architecture and Compilation Technique, 2006 [31] D. Brooks and M. Martonosi, Dynamic Thermal Management for High-Performance Microprocessors, in Proc. of the International Sympos ium on High-Performance Computer Architecture, 2001. [32] A. Alameldeen and D. Wood, Variability in Architectural Simulations of Multi-threaded Workloads, in Proc. of International Symposium on High Performance Computer Architecture, 2003. [33] D. L. Donoho, De-noisi ng by Soft-thresholding, IEEE Transactions on Information Theory, Vol. 41, No. 3, pp. 613-627, 1995. [34] MATLAB User Manual, MathWorks, MA, USA. [35] M. Orr, K. Takezawa, A. Murray, S. Nino miya and T. Leonard, Combining Regression Tree and Radial Based Function Networks, International Journal of Neural Systems, 2000. [36] David Brooks, Vivek Tiwari, and Margaret Martonosi, Wattch: A Framework for Architectural-Level Power An alysis and Optimizations, 27th International Symposium on Computer Architecture, 2000. [37] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, in Proc. of the International Symposium on Microarchitecture, 2003.

PAGE 116

116 [38] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, in Proc. of the International Symposium on Computer Architecture, 2005. [39] J.Cheng, M.J.Druzdzel, Latin Hypercube Sampling in Bayesian Networks, in Proc. of the 13th Florida Artificial Intell igence Research Society Conference, 2000. [40] B.Vandewoestyne, R.Cools, Good Permutati ons for Deterministic Scrambled Halton Sequences in terms of L2-discrepancy, Journal of Computational and Applied Mathematics Vol 189, Issues 1-2, 2006. [41] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, Graphical Methods for Data Analysis, Wadsworth, 1983 [42] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13273350-1, 1999. [43] C. Kim, D. Burger, and S. Keckler. An Adaptive, NonUniform Cache Structure for Wire-Delay Dominated OnChip Caches, in Proc. the International Conference on Architectural Support for Programmi ng Languages and Operating Systems, 2002. [44] L. Benini, L.; G. Micheli, Network s On Chips: A New SoC Paradigm, Computer Vol. 35, Issue. 1, January 2002, pp. 70 -78. [45] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, A NUCA Substrate for Flexible CMP Cache Sharing, in Proc. International C onference on Supercomputing, 2005. [46] Z. Chishti, M. D. Powell, and T. N. V ijaykumar, Distance Associativity for HighPerformance Energy-Efficient Non-Un iform Cache Architectures, in Proc. of the International Symposiu m on Microarchitecture, 2003. [47] B. M. Beckmann and D. A. Wood, Managing Wire Delay in Large Chip-Multiprocessor Caches, in Proc. of the International Symposium on Microarchitecture, 2004. [48] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, Optimization Replication, Communication, and Capacity Allocation in CMPs, in Proc. of the International Symposium on Computer Architecture, 2005. [49] A. Zhang and K. Asanovic, Victim Replic ation: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, in Proc. of the International Symposium on Computer Architecture, 2005. [50] B. Lee, D. Brooks, B. Supinski, M. Schulz, K. Singh S. McKee, Methods of Inference and Learning for Performance Mode ling of Parallel Applications, PPoPP, 2007.

PAGE 117

117 [51] K. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, Multifacets General Executiondriven Multiprocessor Simulator(GEMS) Toolset, Computer Architecture News(CAN), 2005. [52] Virtutech Simics, http://www.virtutech.com/products/ [53] S. Woo, M. Ohara, E. Torrie, J. Sing h, A. Gupta, The SPLASH-2 Programs: Characterization and Methodologi cal Considerations, in Proc. of the International Symposium on Computer Architecture, 1995. [54] K. Skadron, M. R. Stan, W. Huang, S. Velu samy, K. Sankaranarayanan, and D. Tarjan, Temperature-Aware Microarchitecture, in Proc. of the International Symposium on Computer Architecture, 2003. [55] K. Banerjee, S. Souri, P. Kapur, and K. Sa raswat, -D ICs: A Novel Chip Design for Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration, Proceedings of the IEEE, vol. 89, pp. 602--633, May 2001. [56] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykris hnan, M. J. Irwin, Design Space Exploration for 3-D Cache, IEEE Transactions on Very Large Sc ale Integration (VLSI) Systems, Vol. 16, No. 4, April 2008. [57] B. Black, D. Nelson, C. Webb, and N. Samr a, D Processing Technology and its Impact on IA32 Microprocessors, in Proc. of the 22nd Internati onal Conference on Computer Design, pp. 316, 2004. [58] P. Reed, G. Yeung, and B. Black, Design As pects of a Microprocessor Data Cache using 3D Die Interconnect Technology, in Proc. of the International Conference on Integrated Circuit Design and Technology, pp. 15, 2005 [59] M. Healy, M. Vittes, M. Ekpanyapong, C.S. Balla puram, S.K. Lim, H.S. Lee, G.H. Loh, Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs, IEEE Trans. on Computer Aided Design of IC and Systems, vol. 26, no. 1, pp. 38-52, 2007. [60] S. K. Lim, Physical design for 3D system on package, IEEE Design & Test of Computers, vol. 22, no. 6, pp. 532, 2005. [61] K. Puttaswamy, G. H. Loh, Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performa nce 3D-Integrated Processors, in Proc. of the International Symposium on High-Pe rformance Computer Architecture, 2007. [62] Y. Wu, Y. Chang, Joint Expl oration of Architectural and Physical Design Spaces with Thermal Consideration, in Proc. of International Symposium on Low Power Electronics and Design, 2005.

PAGE 118

118 [63] J. Sharkey, D. Ponomarev, K. Ghose, M-S im : A Flexible, Multithreaded Architectural Simulation Environment, Technical Report CS-TR-05-DP01, Department of Computer Science, State University of New York at Binghamton, 2005.

PAGE 119

119 BIOGRAPHICAL SKETCH Chang Burm Cho earned B.E and M.A in electr ical engineering at Dan-kook University, Seoul, Korea in 1993 and 1995, respec tively. Over the next 9 years, he w orked as a senior researcher at Korea Aerospace Research Institute(KARI) to develop the On-Board Computer(OBC) for two satellites, KOMPSAT-1 and KOMPSAT-2. His research interest is computer architecture and workload characteriza tion and prediction in large micro architectural design spaces.


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E20101111_AAAAAY INGEST_TIME 2010-11-11T10:58:39Z PACKAGE UFE0021941_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES
FILE SIZE 22911 DFID F20101111_AAAQEL ORIGIN DEPOSITOR PATH cho_c_Page_087.QC.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5
e0d22d83127d09a91e23511875fca290
SHA-1
945aad31e349a6be73c0fd46eb9b10c82c8eef6b
1974 F20101111_AAAPZF cho_c_Page_083.txt
a34a09d2355c675839dd614221b78eb8
1aadd25429df8afd9be8ab577539c4f879b902d5
2071 F20101111_AAAPYQ cho_c_Page_068.txt
27c18a87df13ea6eb2447607a400d7b4
4e85b80ef51366de067188480c16e3cf7a66aa3a
22738 F20101111_AAAQEM cho_c_Page_058.QC.jpg
504715b45486949dd5ab62a68e714016
a5f6ea39842f821f0a797ed055386044398d51e2
2104 F20101111_AAAPZG cho_c_Page_084.txt
3dbbbb10d40d217c83fcf6a595df7c88
41254d18244cda0cdf8f9f068067188b2dcdbb90
6426 F20101111_AAAQDX cho_c_Page_085thm.jpg
4499fbf9ccecae8507126da32b092a0c
c993c979fc9619fb23f286bfab7d356b165804a2
2073 F20101111_AAAPYR cho_c_Page_069.txt
8f2ce2043107cda2e589f25e5e63ecc8
256cfe4a9da47519c0268fa158c8b8e49e36bbac
29410 F20101111_AAAQFA cho_c_Page_091.QC.jpg
e00ce55e7c6d6e02cc1e675b8bbb9eb3
01cfbcb88cbef1b8c8370240525bc65fa31e8ead
3708 F20101111_AAAQEN cho_c_Page_006thm.jpg
3b5004c536b9f648119bea9a108b40fb
34e3ea97454e4ea11050250054bc58300923ee9b
2484 F20101111_AAAPZH cho_c_Page_085.txt
7ebf516286cd718a90fca667e7c6dd96
90bdca09ea5ded53871b9355d4dc85137d2321c2
6328 F20101111_AAAQDY cho_c_Page_043thm.jpg
50541a01076595103aec1ab83d5c2564
dda1a365d597986755a9e531e01172fa112581a8
2025 F20101111_AAAPYS cho_c_Page_070.txt
03d2b2a2af183d0b77370372e6792433
f1fe140fd3dc03f9389e682a72c4a79d85e9a10c
6837 F20101111_AAAQFB cho_c_Page_107thm.jpg
a998e2ed3ba02bbfaa684a2274c02b9a
0b0c1475e81927a60fb641aa73a3f40dbdbd2670
6993 F20101111_AAAQEO cho_c_Page_046thm.jpg
c329ac53865c9ab0ec78e62c9fb7beb7
a0a5670aec6ee60c925c441e6dd8ac3a727c759d
2388 F20101111_AAAPZI cho_c_Page_086.txt
81e47c7b3333a8dd26d1e861cc7aca34
64b91c6ffd3d8442e01a00792d323eef0cfafb1a
22865 F20101111_AAAQDZ cho_c_Page_069.QC.jpg
410f47a6a3002328c2790e913d7c31c2
7d081ba6981964bb4c93eacce32567987cd7338d
2281 F20101111_AAAPYT cho_c_Page_071.txt
59f07581562b5cda407925c217b519c8
cb362269671586ed0ceb461e158a509d60fda466
7294 F20101111_AAAQFC cho_c_Page_088thm.jpg
50ce13b2457fad91ed0ffcf570a11fc1
f52063dc0e17e128ce272156b0fa48a3a4861f9b
6242 F20101111_AAAQEP cho_c_Page_070thm.jpg
ad0c1ca7e7e902fa6ce14c47956e926d
edf0d114830146f0372bffd9d7243abc1437c828
2052 F20101111_AAAPZJ cho_c_Page_087.txt
0a8316a0e3bb10b897c73dc2e21597d4
e2e772893cb69b72217a4e1a0ce3e9e2443c114f
1619 F20101111_AAAPYU cho_c_Page_072.txt
ebc68a3f3bea1dc9607fb006cfb3c4f4
f4a591c06ad470b3bb58e9e2799ade81f00a04ee
7343 F20101111_AAAQFD cho_c_Page_103thm.jpg
4b4f44981d600ef9579dd97d85b975e6
964b10a80acdfb15f68f7234bd21ed1b971a3830
28309 F20101111_AAAQEQ cho_c_Page_052.QC.jpg
faf300a744632189d89d7419770819a8
febe23b668cfb287ce1219fc34b2db173d969f54
1780 F20101111_AAAPZK cho_c_Page_088.txt
ba9d68dc60f9ff2e4dfebb03c102e6b6
02cc2e173a4d4c522a5e324c00eb7cbd813c8e69
756 F20101111_AAAPYV cho_c_Page_073.txt
35c714129f945f5e72f8d15fe676595f
aff2135ead6fc537ca1376e866736147150e0ec4
22422 F20101111_AAAQFE cho_c_Page_016.QC.jpg
cd8b93bf2951c0fb884d9f8b190f641a
2dc336315cce81c1aa09cba9b0188eae9f35bebf
24412 F20101111_AAAQER cho_c_Page_017.QC.jpg
d2a12306fba171188481d33c7b089d8f
4555a0507ff9169bb64111b12e816361598ab1d2
2270 F20101111_AAAPZL cho_c_Page_089.txt
c9817e406f4bf2fdcabfc8f405bb0299
ecd99ebd3d01fc2c8a2de55b4c1b78c8c449015f
2259 F20101111_AAAPYW cho_c_Page_074.txt
9deab5a88182f7fc2a9183b3beb53bc1
43bd110a0a2605276f08ba41da6e9999a90bb8cb
7473 F20101111_AAAQFF cho_c_Page_095thm.jpg
4a4b6f8dd874191e225e5caeec77a350
473c97ae6ca6d71ebe08abfdd3be3b13285805d6
1960 F20101111_AAAPZM cho_c_Page_090.txt
38abea8297691dab1d4b2304a503e00b
ad41eb7a090126a6566962e7a9a5379db58789bb
2148 F20101111_AAAPYX cho_c_Page_075.txt
dbedb6855bfd70299d37d722a1c1cbd9
7ec469825aa7fc87430f653ab572831fd16a5506
7188 F20101111_AAAQFG cho_c_Page_037thm.jpg
e767d1e131006ad96d27f225ffe97f57
fadc56ea3ad1190b7185c0284050345a2067dcd0
27467 F20101111_AAAQES cho_c_Page_075.QC.jpg
fb5fcc40025b2ca72e88cbb528f8110c
f53f2b4f3c2ab76382c7f2175f49a58e19ff4362
2067 F20101111_AAAPZN cho_c_Page_091.txt
d7111256c2957d14c1a3056fb4b24159
0b3ab5121b0dfb25abf7bfe6d7069da8884c5230
2266 F20101111_AAAPYY cho_c_Page_076.txt
a86540043ee961db0a58075677c25b5a
d813336bd457b8ebeebd17d1f8328f46556ea929
7104 F20101111_AAAQFH cho_c_Page_055thm.jpg
0f9676630174a58be23bf89dd8085c97
d3cee55515b68ebabe4e96472fb770d257278d8b
5861 F20101111_AAAQET cho_c_Page_064thm.jpg
ac40990dc4cf31ad275d59f8e1153d71
bce5330454db44f5df243d5592b0ad2aa47d973f
2276 F20101111_AAAPYZ cho_c_Page_077.txt
62f742735c71609ccbc9af2fdeddcc91
0b0f68d5ba2f49bc0b7fc56925733e2e71593ea0
27782 F20101111_AAAQFI cho_c_Page_116.QC.jpg
f85e071c03d2414708b991b6af792434
22b666d886893ef75c3d0e61fc18f724d95b3dc8
5837 F20101111_AAAQEU cho_c_Page_021thm.jpg
abf92a4852a8df2e1b6e98341fb1a5e5
5ced4dd1a860b332cf159382336311753693f60e
1644 F20101111_AAAPZO cho_c_Page_092.txt
0852fcfcc7a8543b52f98c8e4048b4b0
cfc59df855f2a4e8e09d45fc994426888f1abb28
7408 F20101111_AAAQFJ cho_c_Page_066thm.jpg
a7cb2dd8a247b316d4f8f9f9a90cec4d
b9b9933676ec599d45136edd9527e9788c89ce02
21551 F20101111_AAAQEV cho_c_Page_004.QC.jpg
fc3acd9ff01cbe12fb8b5f958da4a292
02a532fde47667aeaa379fb32a0c535d4373afdb
378 F20101111_AAAPZP cho_c_Page_093.txt
9626136b1abff018ccbf0df31cd49d44
7965c0d8b79d09896b87d6dc4bf9c6ae107508b8
25263 F20101111_AAAQFK cho_c_Page_053.QC.jpg
542f38c86d71735268769aa8b9dc95a0
ceadb1a63b43869751539ceac6153530cfec6a2a
7568 F20101111_AAAQEW cho_c_Page_083thm.jpg
5aaa622ca0133b3ab3993c02e1078890
320d602854366afee5a0ff938c0778af53f2da1d
1550 F20101111_AAAPZQ cho_c_Page_095.txt
bef70efe523a898fc9a254a32a260fd1
0effd7426852ecfde062c96cd5e208ff0d13dc86
6793 F20101111_AAAQFL cho_c_Page_090thm.jpg
7121a54aef8f913bf2a1afb4b65858f6
4318f18eaafd575f25bd289fd8c9e62afae056d0
26831 F20101111_AAAQEX cho_c_Page_103.QC.jpg
e855df189e0236535ee5cc0a2b5f4937
6dce5da5372f3f482473936f6588dde9a47cb871
2231 F20101111_AAAPZR cho_c_Page_096.txt
68b9bb6c89cc55268587372dd73df58c
e2910a9486d70625fe125f5d6e034f1203d843bf
6605 F20101111_AAAQGA cho_c_Page_089thm.jpg
3faf22de78bda54507f23847c7083598
8faccad908e0e8c8c117d03251103d9c95cb1403
17871 F20101111_AAAQFM cho_c_Page_062.QC.jpg
2c3ffae3becc3b476a973dfeee36078c
325ccc94f0c334605a63a472adc10162441babff
1757 F20101111_AAAPZS cho_c_Page_097.txt
246702ff1c61e58ae6ce9dfb8f86883a
52b3f82cf02ade9058ea41785c9d768d262a4967
176435 F20101111_AAAQGB UFE0021941_00001.xml FULL
3adabb138c1aaf989cb7460dc163ef31
5357988b00846886d9188a1b69a8fb3015dde8ed
7465 F20101111_AAAQFN cho_c_Page_042thm.jpg
32adcc7275caf0183a9d75c2f1128244
74196296aee9da0a164311c828e9ad161efe7ec1
22970 F20101111_AAAQEY cho_c_Page_085.QC.jpg
1c1ec4fc3f6905b1ac149d5b335185ba
09557ca39a95a86d29cf3666542525f3bb5f4b5d
1748 F20101111_AAAPZT cho_c_Page_098.txt
bae519c755343bf1687bf58d30f78154
ec7fd6f69288f38af406e9b0a204de884757245a
1356 F20101111_AAAQGC cho_c_Page_002thm.jpg
b731b23b06ce840b5b1200544891aa7e
c5d6575b35fe93b614945c50376872046218e5e5
27910 F20101111_AAAQFO cho_c_Page_111.QC.jpg
ed1ec78c0f47063e54b40057de53ba2d
e87d0c56266060e3f3ba761eb077cc846829b1cf
6219 F20101111_AAAQEZ cho_c_Page_069thm.jpg
4cb657ec7f4da92074c33a85410a88c9
b198f4639ebd24ef79f551185547cb8b3f6d5d3b
1607 F20101111_AAAPZU cho_c_Page_099.txt
b1d2a7eb7eaf64b518b717f6a7822226
dc7134c8776fe37175e90fecfd797383bb599962
1051966 F20101111_AAAPDB cho_c_Page_009.jp2
5c6a898e09722784d50f1b306dd8984e
acddb46a84f8136aa52add67401b587892dcad2d
3938 F20101111_AAAQGD cho_c_Page_003thm.jpg
debb2a4cb8c85346d73bad526d36f87d
21351ebd9d8632acaf565f891f393349fe53f2bd
28061 F20101111_AAAQFP cho_c_Page_117.QC.jpg
f2f24c687df768f560055bc8381ef142
45a5281cc830708d4e96e0e2b6b2d595772a9260
2474 F20101111_AAAPZV cho_c_Page_100.txt
0c7ed28f0f436cdbfeb78dc411d7cb83
6c8fbc127ef18fa4e53a4257ebc201602aa3b16b
25271604 F20101111_AAAPDC cho_c_Page_037.tif
a4a9bc3bef1988b9de02e561761da584
973540d6d307fcb969492c938a7c34808680884d
13979 F20101111_AAAQGE cho_c_Page_006.QC.jpg
0973ffb376aad217dec27eede6606f81
7e932c3ce46e55295210b7612d8fb9b9d76c7470
20146 F20101111_AAAQFQ cho_c_Page_007.QC.jpg
44d73438de9465746940e89fd7a14316
8d6d53a5df10c2a1f7a7abe356f0c870e839f4df
2075 F20101111_AAAPZW cho_c_Page_101.txt
0b2e0aa6c40cdef9fdd85f1ef87dea88
c78ab6f87b4a729582c913661e73fccbd73489e5
1494 F20101111_AAAPDD cho_c_Page_020.txt
b6f90d5229a692ade53c232179dcfc61
0bdf2558e2ce2d4db570ea395a9af596d65716c6
5051 F20101111_AAAQGF cho_c_Page_007thm.jpg
0468923c2c22bd81de2ed66943c36858
b7cdd54de706e4cd6a1e65e8f88162c473ac2328
26366 F20101111_AAAQFR cho_c_Page_025.QC.jpg
9bce612d57e243718222ff25e429b40a
2983faadac6f142a3f302faab3aef0ad5ad7a561
2726 F20101111_AAAPZX cho_c_Page_102.txt
61585b557cb8a97814c8a5e07bcd17dc
3dffa5a31cfeb10dd477fc54e0e94166c616fd29
15614 F20101111_AAAPDE cho_c_Page_108.QC.jpg
38b27dfbb1974bcfa85a40e46ebd19d8
95a8533ce467697c20385ef979836b6b0fd40ece
5908 F20101111_AAAQGG cho_c_Page_008thm.jpg
623c316c95ef71ed3625e2af72abb5a9
c22b6c72e4a9cb4fd6d3e15543d142affe70cee6
7014 F20101111_AAAQFS cho_c_Page_040thm.jpg
0b6e209c35e7c3c00335db5f336cfd34
78d0d9fb1c3356206bd3ba028a2b4e0bb4cd2f79
2129 F20101111_AAAPZY cho_c_Page_103.txt
366eeda59af2e9fab23dca5fdd4627e9
e8591debea63c13ea2ab0657c7e9435e53747fe1
F20101111_AAAPDF cho_c_Page_018.tif
9e6518450533470bed6903867f463ac3
846fdfe5972ec1e4d7f586609baf3d2ce3516d35
15695 F20101111_AAAQGH cho_c_Page_011.QC.jpg
df44de70d050a263e9736a8d9b653edf
b15defb617da8613dbefdfcf5b4e958a78179fec
19923 F20101111_AAAQFT cho_c_Page_034.QC.jpg
bb768afd4fef535c969f8eb3aed088aa
c9648843cfc0aeda15e5ad755ad20448b212b347
2407 F20101111_AAAPZZ cho_c_Page_104.txt
7b75c8e87e98da0887d0473b764091e2
358c4a2accbcb38851a6ac70f1e064aa2318ddca
175 F20101111_AAAPDG cho_c_Page_015.txt
2d1bcd1e0401a694158a2f3ebdaaa13e
4afaf44abbfb3e8bff072fb45431e01478991f36
4575 F20101111_AAAQGI cho_c_Page_011thm.jpg
8bc5853fd84e4e776b4b36ad60fe7d17
2966e2b9f8fd87441b264f515ec0239366b1d59e
22987 F20101111_AAAQFU cho_c_Page_054.QC.jpg
4b6633b7fb7a64aa662b7beaa535f66e
33fc44bbbb030bcc88f030c584aecd1c209206d0
24605 F20101111_AAAPDH cho_c_Page_071.QC.jpg
3a37335f5b29a24dbd3b193b07bc37db
164ae6f9cdc65e0fe08fde304443e6a9817b7129
7513 F20101111_AAAQGJ cho_c_Page_013thm.jpg
d3fd6e675276019744b8e60e6101e293
dd4eb44ab070a66d43c8bbebdf682b4a6ff6d5c8
28581 F20101111_AAAQFV cho_c_Page_115.QC.jpg
e0d6bd77a0b3ee9860f002da190b17be
3879aa11c979a0513e05b2a8d05b2324543db453
5584 F20101111_AAAPDI cho_c_Page_062thm.jpg
8d114cd3c5f21a90286c383ab5f34968
51d9f25da0cae10c68a5fc886661ea5f72c9319b
7282 F20101111_AAAQGK cho_c_Page_014thm.jpg
1d897d55006b218ca87b73fda7e9e046
836d6d7c541b83b0dd4a80bd75d2a6367b4e8b5f
22731 F20101111_AAAQFW cho_c_Page_032.QC.jpg
504ae3f4b490d3be682f62433235c5d2
dff3193c7a0d1a65022d3fbeaa4f85898f74c1b2
67378 F20101111_AAAPDJ cho_c_Page_033.jpg
4dbc9b24e267ec9bc2b6cfab4b3c9c4d
b7c095082f98a3fa09fa44a164435b7e683da226
7131 F20101111_AAAQGL cho_c_Page_017thm.jpg
c54e71379d39148e045d57b56e3e2e21
188757769c0c5ac9203ab4d04d6942271464c917
6713 F20101111_AAAQFX cho_c_Page_019thm.jpg
141ed7d7f6519f169b52bee8b776f8e9
bbac0bae2e8a56a59420262087046ddf78fb17ab
103564 F20101111_AAAPDK cho_c_Page_117.jpg
fbb4e307d96b9482bfd0aa2baa66b36c
1b0069b0902bc286d86b5502e186ba77d6a2f5ab
22339 F20101111_AAAQGM cho_c_Page_018.QC.jpg
8d732fb529832e36a59c461131bef27e
86583c03ca23bd0b205bffea538702a0969d4a0c
7074 F20101111_AAAQFY cho_c_Page_078thm.jpg
4b39c695f64c83c7c95632d9863a0cd6
738779335b5b320c488253a28e5cf53515b9c503
7026 F20101111_AAAQHA cho_c_Page_039thm.jpg
265a147a90cb3b2994483afe343a6d33
e627327410805f4a72bfdbdc7ec30907b2b4ebe6
53473 F20101111_AAAPDL cho_c_Page_037.pro
be4221fbc1f71bb607bf6db603906d27
dcff98d8c40139171d38d5ddafbc558cb5c6b39e
6327 F20101111_AAAQGN cho_c_Page_018thm.jpg
a3a91a0efdc5efa09736dc08d5fa1b16
90b6972c7b8b13c713b701fa83a7e44dd06bccca
6120 F20101111_AAAQHB cho_c_Page_041thm.jpg
4938cd2c3ddfe36c1b26c73929b2c652
adc987f08c28cde32408a9d6cc93694e984b4221
4556 F20101111_AAAPDM cho_c_Page_036thm.jpg
1f7be656e5f9ddd37daf349bbccb4259
0cc76286409be3d3648546fbb41eae36401bd48f
24395 F20101111_AAAQGO cho_c_Page_019.QC.jpg
316dcc13ad9d91be504b6f87b4cb9903
67ea2ea86885790926a20b7e9c0b003cfb4cf139
6543 F20101111_AAAQFZ cho_c_Page_097thm.jpg
8932136896e278f0b5c6a398038746f2
e5bbb6dddc775a2190ff1eddc403ebfe8421b052
1051947 F20101111_AAAPEA cho_c_Page_102.jp2
d004d9c88fa6db6acabf796130cdcbf9
b3c41b20b1c223320c756ab03d41c56290a8cef1
27262 F20101111_AAAQHC cho_c_Page_044.QC.jpg
7269b2cc00bb7c12ddfb6397b0aca5ce
f610d8d02a26ab832f77d1d22ae35538e8e2e181
1053954 F20101111_AAAPDN cho_c_Page_043.tif
3a849c8ad466962f9c607c64f5eae4f5
113bc3de368ab1f7a5bcbbf821f4193517509d54
18404 F20101111_AAAQGP cho_c_Page_020.QC.jpg
075172a7fec960102a7064ed39c75887
7d09798ce46073da7a4c097d2e46ca034e3f4e9b
F20101111_AAAPEB cho_c_Page_030.txt
cc877651394b40519b580c21a9780b45
9f468f8636e9e569b96b68b79318d11b14942e54
F20101111_AAAQHD cho_c_Page_044thm.jpg
f9c6887de2633d48ab4ecda24c875e1b
b2af868a4aef279012f6210382316433479763f0
F20101111_AAAPDO cho_c_Page_070.tif
4dcb73ece5b4b2ba81c03b248cf4c521
839253429aacc24308fb2a80b6188e6e054387b2
5415 F20101111_AAAQGQ cho_c_Page_020thm.jpg
b01607d82ffe8475f26656825fbb44fb
8b5bc2289cf41fae1f245f6276cc462c5f845e23
553326 F20101111_AAAPEC cho_c_Page_003.jp2
a8ba8c03223fd0d6846c583d5d1c8ca1
83ba78fb20291411a5a0909946cba92bc4dc0729
7571 F20101111_AAAQHE cho_c_Page_048thm.jpg
6fb00f601423f26c95c7651b001e8900
adbefdb33c6b52749cfebbdf94d11490e7c21c77
1051968 F20101111_AAAPDP cho_c_Page_010.jp2
9235b4130c423f27dda21ec2b9d84bc0
2bf1664b083fc8dc47158f29aac8b7ec27c3638d
26379 F20101111_AAAQGR cho_c_Page_022.QC.jpg
05ddb78a25ba6bc2edcf8639325da4a1
bbb0db3608f95e416e84d0442a6a6e5455c4a6a8
18644 F20101111_AAAPED cho_c_Page_118.jpg
7854cf7e52a564bc89e4d341655bed66
032a6702b6c6398d6fa4e8cbeff9e338a8f4af91
7111 F20101111_AAAQHF cho_c_Page_049thm.jpg
eababa7a79d9b6d529889993f9e02572
d8609c5961ffb82f6706e9d5a97db23d73a3a8bc
73554 F20101111_AAAPDQ cho_c_Page_089.jpg
0bbd64ae115dfb6d3de55403e26919f7
9471db114d3c4ed27883fe57aa649a8e51f04c8f
11151 F20101111_AAAQGS cho_c_Page_024.QC.jpg
e9a45180d1e03d134528d5afa7210fdb
490319907fd751cf061ef1aa3bfee75118365c13
1979 F20101111_AAAPEE cho_c_Page_094.txt
1988373bd249b08ac8ec4a5358f070bb
7b2e1a8e3451522d1c47ea71894592792301149b
6467 F20101111_AAAQHG cho_c_Page_050thm.jpg
c0c3ff62492fc68e4ff819ed7b9fb42a
24af3ea8f84303aae8a1ba1c20d4d25e65cc40ef
5306 F20101111_AAAQHH cho_c_Page_051thm.jpg
584a906e538a163d3bdcc0664ab68890
fc0755f9358d036b9975811cde3ff4422e82c880
25481 F20101111_AAAPDR cho_c_Page_049.QC.jpg
ff409faa09523bd90892d5cbd75a5972
1c80f170204de4f1bee9ccdd28581b34df4021e2
23274 F20101111_AAAQGT cho_c_Page_026.QC.jpg
b358bbb533f66991a78dc4aa12979ef5
14ff233834b9d53122589a4168add9c333dd3abd
26993 F20101111_AAAPEF cho_c_Page_014.QC.jpg
ff823f433310eeca767cd7174d8f2a16
e243b85b10783ed9c8d8488962f841a072085aa7
6972 F20101111_AAAQHI cho_c_Page_053thm.jpg
36864514d8d1faf0ec376ba475f14b8d
764c19d010586dd949be8b0edd3a8e3d6e8a4e95
2323 F20101111_AAAPDS cho_c_Page_059.txt
84cda51aa2b0aca11a54655d76e33d02
32bc26504807d6fe598dca3f91c4ee76b4e59cf7
26014 F20101111_AAAQGU cho_c_Page_027.QC.jpg
8671a8b083f87341ef08710a54ca98c4
527008a5ba5d860d41084b1d2d377800a53c6b5d
55816 F20101111_AAAPEG cho_c_Page_111.pro
a3d2ab0761c29fa9e269ff0d483c74cd
2811f55e0611171e1ca90f5aa7c654490a175508
7411 F20101111_AAAQHJ cho_c_Page_057thm.jpg
cae17813aac626f0676e018011e37ec5
d7179ccc385e8f2eefd0ab9b74ccf0bb4f7b6c63
27475 F20101111_AAAPDT cho_c_Page_063.QC.jpg
77ac1e57a889d14d6833143b41b27310
70475cb12832b410ec0a65481b3b769287e95cfe
20935 F20101111_AAAQGV cho_c_Page_028.QC.jpg
72e2894f7bac5a71793efaf60aa6e386
3ce024216cf32d2617c4371673307e4eda183b9c
157013 F20101111_AAAPEH cho_c_Page_118.jp2
52194e61c14e78526a90d5421034fea1
7556ad206d7425edee9f0cce96dcc755b7725509
6397 F20101111_AAAQHK cho_c_Page_058thm.jpg
d7b48dd41ea4b681f2c72057bc88db62
fcb1887ccf0a70a1192976e24c26222c79edf440
7437 F20101111_AAAPDU cho_c_Page_022thm.jpg
9fa9d4c1b615e67ad6f6bafb75e82c6d
56e7699168346c55409b8874cd1679a7479be560
27574 F20101111_AAAQGW cho_c_Page_029.QC.jpg
ead2eba084882938e3d90ef8850da621
9424e1d19ae0eb1c71ec1985394fda1f0a3b148f
1051800 F20101111_AAAPEI cho_c_Page_031.jp2
1f09cb8da385e79265f3ed5cc54c2130
d3b514eb71de447d765090b2841ed14fcf2d706d
25272 F20101111_AAAQHL cho_c_Page_059.QC.jpg
1068632f5543749b7b297bdd8362a4ed
0b95e4a2db658ffc3bd99bf43751290fdcd1e712
F20101111_AAAPDV cho_c_Page_091.tif
39ea28e34ae5282543ce8ef6327ec0b4
bddf953073eee45d61215ed6e613a261babfe250
5954 F20101111_AAAQGX cho_c_Page_033thm.jpg
dee78483fb3125b1931abdfb42c2f116
412b01f8cab288395800c2b2c77e60d5cf301116
24771 F20101111_AAAPEJ cho_c_Page_088.QC.jpg
9e830c58beb3afe5974759307af09d81
1ccb40099047a3cb1280cc4206657bdd9909f235
6395 F20101111_AAAQIA cho_c_Page_080thm.jpg
9b6ceda8297f808e93eb51578ccf3e3e
d0ccaa478c25b9f4d5f2a3cff4b1be47520fdc81
7444 F20101111_AAAQHM cho_c_Page_061thm.jpg
d553b707f490ee8163267e9a77e362d8
92a46ac9fe2f64409a6dd9b76243b0fda3a67b1e
6419 F20101111_AAAPDW cho_c_Page_026thm.jpg
02cfe8301be2e7288ff79b6bcaa53e43
902ea4e8e253b6ca8934f1796c5bbb9935656e4b
6744 F20101111_AAAQGY cho_c_Page_035thm.jpg
73572e8dff6f8559664ea413b5ca4da8
168e997c6ce20b7d8bbdcabb2aa706c5b6eb59d3
89562 F20101111_AAAPEK cho_c_Page_044.jpg
2735d84dfe05b2b8682d178719c7ed1c
cfad00ee50abb557a51d7c2bb8b2410f9cff5134
7497 F20101111_AAAQIB cho_c_Page_081thm.jpg
2ccd5e1ee4b7a2b7219598ecad1868ab
effa601813d3722cce3e80c66bdd3402fd4d78f1
20570 F20101111_AAAQHN cho_c_Page_064.QC.jpg
1947cdd216ba7238f7f146b2199716bf
c0957b1664c6c7d43ba36a5ccc1ec36d5943bf00
306566 F20101111_AAAPDX cho_c_Page_119.jp2
45928a37b5a39557ce72029a3cd2370e
77127aee0207bf5009ee631dbe856c270c9d64ac
7154 F20101111_AAAQGZ cho_c_Page_038thm.jpg
bc9fe155bbc8df6dd5bf8f5887b51e9e
616c42b4835009521993c17c6802e512012ab2c9
99273 F20101111_AAAPEL cho_c_Page_116.jpg
16ad00d31a3ec47fe9ebe52a01851baa
104c88525e7187c0d499b168f7243508a46c547a
21673 F20101111_AAAQIC cho_c_Page_082.QC.jpg
97a3dddc7b0744a6cd597cab4a3e206b
612d80b6d47fec8e30ff359f03c47221a847869a
24322 F20101111_AAAQHO cho_c_Page_065.QC.jpg
6d5d3f49c09a4b2deea9f0762da3faac
363ed0a32511dd487115ca33cd24335055fca2f0
78146 F20101111_AAAPFA cho_c_Page_004.jpg
b1823c29e6687c9e4292de29bc358cf8
931ff8d68433d0b65cffac0add1d7abafa76fa5e
918363 F20101111_AAAPEM cho_c_Page_033.jp2
b27f0e9b0400fbef5dfde301270675ab
0a2c61266c1a80abba00b23fba8ca5cf8b0ad331
25771 F20101111_AAAQID cho_c_Page_083.QC.jpg
4881a8a2242b48e3fdfcb8ba8e68dfd7
29374a67d97c849b50ac550303682b43880b026d
6845 F20101111_AAAQHP cho_c_Page_065thm.jpg
bafc1f569c248bde934d951d66ca1d26
9526062a5d08cff3b7ece6b8baea5f9f053c1a94
27662 F20101111_AAAPDY cho_c_Page_057.QC.jpg
1cf8f147e73a128bddd4410994865060
e2b598f0867e92808b2d457efd53c2c681e3acbe
37072 F20101111_AAAPFB cho_c_Page_005.jpg
734765313e005c93f2dfab3d38d0c074
248a4d071952ec4332b3edb5c18670ebcbc626cf
F20101111_AAAPEN cho_c_Page_114.tif
b71080f9099c679989f3e96348857376
2b5c7a13fefb278b3585c4d486601c3b7a68e540
30839 F20101111_AAAQIE cho_c_Page_084.QC.jpg
18d8a6a9cd9dff765d5b7d299fb1f82d
08a33c5cfbfbe9b15d15e68f9b3805b0c0b30525
22511 F20101111_AAAQHQ cho_c_Page_067.QC.jpg
83ec7d7a3498da6c7d36588ff0d3cb37
80583cd32b254d341c6371cc80403d08d6fabb86
87721 F20101111_AAAPDZ cho_c_Page_096.jpg
b6743f37026c403e51ac8a5275ea6a2a
c5b1ba8ec8295abb6de59131832edebb78e420df
47406 F20101111_AAAPFC cho_c_Page_006.jpg
59114717a9f90ba7bba285a59944a680
6b850b7c1a25cd2a33d95d1f07a57e01afeeb36d
25867 F20101111_AAAPEO cho_c_Page_031.QC.jpg
414c403594afdfb98c1e4cebaceaf85b
ce331b8fcb2f53a2bee3c90923878a370ae6d2aa
8529 F20101111_AAAQIF cho_c_Page_084thm.jpg
6a35c7117b1fb5486a065ca428e7cbd7
d3d47ca8307baf4b8c932bb8ba3b8ebd49fedd84
6368 F20101111_AAAQHR cho_c_Page_067thm.jpg
2a64fbce66511dd0365ff49907e4875e
4b7c7f2da5a44f0b222b6518f32ae8c99a164ced
68276 F20101111_AAAPFD cho_c_Page_007.jpg
002fa00614cb52a2d8dd4ff84295e3ec
999cb6934d544c4e92cff39b74da45888ba9ed25
36481 F20101111_AAAPEP cho_c_Page_093.jpg
9654977fe3fb8c52a381b885707c9c09
3cfd5c9ca407933019e20e6bad69ca49607e0b8c
6123 F20101111_AAAQIG cho_c_Page_086thm.jpg
d7d20892b5dba875a3506b35e20dbef7
b05068b368cbb7caac211f708d209eb3a7194038
23522 F20101111_AAAQHS cho_c_Page_068.QC.jpg
a8aed69334fdcee7773df3fe6b937429
6c198c93cbb9f1fb8a8eec663b7b7457e36699bc
74830 F20101111_AAAPFE cho_c_Page_008.jpg
2fe6d559efbd6770e7e3fe32d1589f44
19168889afdcaa6695f5a1a76199e7270e06edc7
2070 F20101111_AAAPEQ cho_c_Page_041.txt
39c2a81fb42a40876d667f459f247fc9
cd215f8a0b53a42d0282b53674120e83ed72c738
22995 F20101111_AAAQIH cho_c_Page_089.QC.jpg
43ee53d91d0c44628eb69473472e0020
7a8a13234fe4d7ba2646bcfc4d917c8c86545933
7341 F20101111_AAAQHT cho_c_Page_071thm.jpg
cf7fcafcf6433deafff3ccc86d5e9219
25569f7702643737fbb88934175e93f252cf851e
57189 F20101111_AAAPFF cho_c_Page_009.jpg
22a9dec0186a3ad87041a847ae84a7bf
0ff5f2c6e35ddf595cf5c00c73bd8ff3fd0adefb
25507 F20101111_AAAPER cho_c_Page_010.QC.jpg
235d293d06c415ce0de7f05d0b5f9437
8487ecbd0d74f5cc83be4358af26de24d63db30d
8734 F20101111_AAAQII cho_c_Page_091thm.jpg
2eeeb58da5fe10137a87e9d6f5d7aa43
33f3f380d6965fe45e805f30c4bce24b1c586b98
24545 F20101111_AAAQHU cho_c_Page_072.QC.jpg
bc726f606f76bedfcd1cc3d2f990f6d6
906b2be7cfca21820dc4b3764a139e06694e24cc
84203 F20101111_AAAPFG cho_c_Page_010.jpg
9244cb6c5a0f40e3317ee261c5b34335
8ab0b677238dadd800793e191d3695149f9e64e5
21619 F20101111_AAAPES cho_c_Page_086.QC.jpg
1e5390c89cbc1eaacff9766fb2354589
f7489e5f604eabbc460cd46f20b998eb751fd601
23845 F20101111_AAAQIJ cho_c_Page_092.QC.jpg
c2af4dc07e9e24c18cb680351ca5ea31
189a45ee61d138cbf8040ebe572945d29e491305
3345 F20101111_AAAQHV cho_c_Page_073thm.jpg
1cadfa0a00627ba7352b8531d1b65ff4
f552d415cd19dedbc7dcd1109434ce237f034e9c
49663 F20101111_AAAPFH cho_c_Page_011.jpg
0cf0d46f25811c2c235e84612f850873
82589932063cb3c365f5388f9f46af4a4ae34acb
F20101111_AAAPET cho_c_Page_098.tif
406075c05d70d92813291f1902be0d8a
373784ee91fc6b71502fa86e7d97f84c29e18638
4093 F20101111_AAAQIK cho_c_Page_093thm.jpg
486eb3cb63481159aac0d8b21bff059a
fac9515d3a0d99ff4cb9071e0683c4ecd759e750
7795 F20101111_AAAQHW cho_c_Page_074thm.jpg
e2bbc40d39a83b3d99f0b33a1a0a42d3
2b29c7b13cae644115b9894bb013038e8c2adcaa
84862 F20101111_AAAPFI cho_c_Page_012.jpg
e89f046dd302775922202d80b344c026
7e0c3dfb9dedb7e9a9e7ec7cc1f46595a3bf7b6c
136516 F20101111_AAAPEU UFE0021941_00001.mets
70d86a43d4563074f31113b4a91abf44
b3c5798f839a4cd47c585258c4a2c967b31f8487
26013 F20101111_AAAQIL cho_c_Page_094.QC.jpg
1a07fb668784a27051f449a2c4ab287c
bcd644ee5603ac663c017b64405cd10aab267a56
25890 F20101111_AAAQHX cho_c_Page_077.QC.jpg
21357eb01a8a48387af2046e52e38780
d51ecdba89dd7cde6839e2db1953d86ead64849c
87690 F20101111_AAAPFJ cho_c_Page_013.jpg
624d6678fc40ed7d1c6f95d91ea700e4
c4b127775e5fcd229491b2fd39b305a3a8420460
26413 F20101111_AAAQJA cho_c_Page_113.QC.jpg
5f98a4c44d304748ce8f62096fd42cb9
a68fdf75ad657fef1bfbee6eb5cd5f5e8802340d
7578 F20101111_AAAQIM cho_c_Page_096thm.jpg
293c60c814008a7c4c019cf826ded0db
43711970717ee370fa18dacc975249647797db48
25191 F20101111_AAAQHY cho_c_Page_078.QC.jpg
002db615b94256a97450a52022f94976
5b817c6dffa265ba5c60c440e96020dc2caddcfb
86808 F20101111_AAAPFK cho_c_Page_014.jpg
a052cd0da872143615ad5b9ca31db061
eeaeb4145e13ac42c42e23a612b54de79b1295d2
27786 F20101111_AAAQJB cho_c_Page_114.QC.jpg
d6fa3ea5a4184c803e89005a6dc2388c
46f662ce3c4e9951618d446beef12392b1307e14
23410 F20101111_AAAQIN cho_c_Page_097.QC.jpg
aa353eef707e6928123b627338410391
77e1c19072fed53e81606dc88c5c501c4431aa54
22155 F20101111_AAAQHZ cho_c_Page_080.QC.jpg
ac945629758d492a0df28e56b18d0b6e
e2ea9ed00c8707290488cbd75d4124aa873a632e
14500 F20101111_AAAPFL cho_c_Page_015.jpg
8c38e9f3a290cbba0825249c5faa455e
42990f33e93389aa66e5b2159fba4e9726768d1e
27868 F20101111_AAAPEX cho_c_Page_001.jpg
625083c1b8484ff4a09b8d432d038854
602f61fc06a528b4d690cb3e2810857a6794d627
7022 F20101111_AAAQJC cho_c_Page_114thm.jpg
2103cc052ce166a7919e3a8f6ebc9d63
e8df3c4b22666dd09db5975696b8308140a25106
6306 F20101111_AAAQIO cho_c_Page_099thm.jpg
4c9e578c787e224c50efe6ca449828a9
050e6f353f8c4a439b4664bb41bafd6a778c14b5
80436 F20101111_AAAPGA cho_c_Page_030.jpg
008551b032440d2a9a07ee31506b8148
979a342634941a50e273fa1cd52640c8d0148576
72485 F20101111_AAAPFM cho_c_Page_016.jpg
e6466f2b171a8edab192527ecdd18b78
b05710a2c904429d4246c6828c8c3f2e0284a25d
10098 F20101111_AAAPEY cho_c_Page_002.jpg
8413b5413d0c3145ff5aa2fcac8e8434
9c047e5369518e98e0283c04b85a60e3b68ef786
7506 F20101111_AAAQJD cho_c_Page_115thm.jpg
30cade386c1eaf8875bf1f3bba638055
b15a703137ff06fc66e1808d856c20f9fc276b0a
21728 F20101111_AAAQIP cho_c_Page_101.QC.jpg
042737d85fe3d2a71ab01175726aa9dd
00395aa630d0382649e7338d2653ab8b5d256b73
82494 F20101111_AAAPGB cho_c_Page_031.jpg
ccc071d2ab364cfbaf0869c6ca12e046
02fd9e39012a9c71066554ba46bf8b3b159ca96e
78535 F20101111_AAAPFN cho_c_Page_017.jpg
b4196356c2aae07d50e65998658cfd52
22da85c6d49d643cdcc95bf514410541f48af145
7587 F20101111_AAAQJE cho_c_Page_117thm.jpg
fb8bffab4f33fcccdcf943a1b57e3d03
dc3ea1702f868e9feb3d60db4bfbd3cde63fae26
6229 F20101111_AAAQIQ cho_c_Page_101thm.jpg
1e63a7129c6acc30ade20f3415705d83
f55e832174c0bdfd9430af9d73954a158d8df2e7
72508 F20101111_AAAPGC cho_c_Page_032.jpg
b851944216f1bcb6dd9386bb2ace9eba
4c203be39ad6f25837671f617f1f781ed44eb6c8
70183 F20101111_AAAPFO cho_c_Page_018.jpg
be17c4c89aa72a28e31268b67b598e7a
c2de4537cfc8a94f2c2759dba3f81ddae4677cf0
44042 F20101111_AAAPEZ cho_c_Page_003.jpg
bed8600c565733ae60c51b6fbb9d1a5a
ba73749723a1563e93cf059471be99607a6d7f12
5466 F20101111_AAAQJF cho_c_Page_118.QC.jpg
249067baf988b61df2be7248086dce46
40396de6c630876e05fad022a5b6bd522df37054
6749 F20101111_AAAQIR cho_c_Page_102thm.jpg
b0912a6e55e1fc7c8aa416ead5eb3e4e
54373ab0c16d8d9013c2375d6a0bf13c724ffd6e
63438 F20101111_AAAPGD cho_c_Page_034.jpg
1b1714ab9b6f9d2376591380f722bd05
68f6995e0210a2ab94bae02d7d8fe5c0029f56f9
75715 F20101111_AAAPFP cho_c_Page_019.jpg
5b7d5ef6cdbffda1cc9689e10f0ad2f6
9d69498a3faccec962121679eaa41a75ac809eda
9307 F20101111_AAAQJG cho_c_Page_119.QC.jpg
526e67b78a90283c9185e700b879553b
464f6a52a17c55fbeeb55ef6a9a3313863defe71
25427 F20101111_AAAQIS cho_c_Page_105.QC.jpg
e4de94a8eff40f08ba74d294d6dcee38
2080335a813d858d41c9f8d45bcc22513e267e24
56320 F20101111_AAAPFQ cho_c_Page_020.jpg
685f3a82ff668d1d74e6d9d9ab27e359
88b614884b062c3ab25b407075c37d7a50170109
77771 F20101111_AAAPGE cho_c_Page_035.jpg
3f5d6b6aefa38269dc93c82baed16bef
3751b8961b4d83ca3e015d6a153283b90fdb52e9
6795 F20101111_AAAQIT cho_c_Page_105thm.jpg
d478296c32bd2a184c3e4c0a7641088d
cc0d9da49243fc19eff9db598770c718ada6e16f
66217 F20101111_AAAPFR cho_c_Page_021.jpg
1888a99cba8e527522d1d50ee03dec27
052514a90bf1261022f717071827fca3b4f04e41
49609 F20101111_AAAPGF cho_c_Page_036.jpg
92d0862182a357edfa915692e377fa03
a918761a1927438f9662a38bc2653396f10b7d8e
19277 F20101111_AAAQIU cho_c_Page_106.QC.jpg
e5b9d2d2a7bf33d0c8e09a10c917db1b
0876247157e6cdc8756262163e8308ed7863febd
83861 F20101111_AAAPFS cho_c_Page_022.jpg
a506868101fe9b0a795985f65e031faf
a47d3e5ddfefe50e94abcab951e1b2297032ec6f
86752 F20101111_AAAPGG cho_c_Page_037.jpg
6728d5edf27ba71db45537938a8cb031
2a73ddd499ffd613e4784fffcdbfbe494f3cc0a4
5621 F20101111_AAAQIV cho_c_Page_106thm.jpg
34f33652a9aa8cfbec79e2f9b4fcb738
81e61d8822d457342aea55b33c4b1b2481c456ce
84677 F20101111_AAAPFT cho_c_Page_023.jpg
76a2e44f0abed3bd8214914b164126f0
49a659842ab80e9fa51b46f5707091e9385d6a9a
80727 F20101111_AAAPGH cho_c_Page_038.jpg
68b98f540ec68f82a48f51d982df5b52
a99f068b2ac857f7cdbe9e8085441a11cf8889f8
24142 F20101111_AAAQIW cho_c_Page_107.QC.jpg
f1ec3159f50cf6e8f4eed95cd11dbae6
5f2a2d8744e43d975dc9536d3b0a7230de1c318b
33531 F20101111_AAAPFU cho_c_Page_024.jpg
18a3d26d2c11596f6b9617cf0a2fe502
2f5bc7d8aef7d7e04c5d8e3ac4de5becaae07f9d
85947 F20101111_AAAPGI cho_c_Page_039.jpg
de1dee9f695fa95022ec90c50fa193a4
73891c65e9d60c9820b4ff3e2f0b865d8fa95043
4781 F20101111_AAAQIX cho_c_Page_108thm.jpg
d074f54df7857c3192fd5148e8f8ded3
9e1cbcbff03ccc941cb1b1f2f9620d8ecc26e654
82775 F20101111_AAAPFV cho_c_Page_025.jpg
164999d87e3acd7c066a1ae18ea21b09
ff8fe84a77cd97bbbb637f7b322dcf0acbc420b2
80426 F20101111_AAAPGJ cho_c_Page_040.jpg
63cd53f8b21b8ba315161629123cf8fb
47d577c86fde850f8b2156f296162ad7ccd43c8c
26046 F20101111_AAAQIY cho_c_Page_109.QC.jpg
cf3296df4fc6018ddfe7499314b18cf6
dc73880a090ebde7f0c22a9dec68a39b8cdd9be7
75945 F20101111_AAAPFW cho_c_Page_026.jpg
0cba6b251740b4ec154a82d560cb959b
b26fca39bcaf551e7b0c29bd5c47d22ff40d9471
65697 F20101111_AAAPGK cho_c_Page_041.jpg
eb75bdf9c450fc1c72efb8138984ad6b
b8119a1882155e0fbd6ba7fae43c31b8b072f9df
5077 F20101111_AAAQIZ cho_c_Page_112thm.jpg
1fb804c521fe025ad5f8da59b09e8c50
22a144c769c70426b29bc02cafd479c9087b7c72
81135 F20101111_AAAPFX cho_c_Page_027.jpg
7273e899374c3eb9511c2e1964c66f75
762fde30d3e99c119be127cc84ef7b93b7c67b51
89679 F20101111_AAAPGL cho_c_Page_042.jpg
21c8e4b799f2eca07e53bf59cad9c75e
c981f855643a6d62e1ea6027b96f0a7c72cba54d
66684 F20101111_AAAPFY cho_c_Page_028.jpg
3327acc8974c684dbdb2b1d75a3cf63f
93c82401cb1e87a62c7035ba4d122cdd9ecfed8f
68266 F20101111_AAAPHA cho_c_Page_058.jpg
29767afb9b943076b13264f84d62c7ab
5ed0645f2be066061f3561aba9787a6809ecab8e
67900 F20101111_AAAPGM cho_c_Page_043.jpg
8949fe2f49bb5c659d2f6457c5008511
ddff28285b0524e3344be334ce4d60fd6b82dc50
87007 F20101111_AAAPFZ cho_c_Page_029.jpg
14b1193dfa8568e683a6c165cc1a73c4
2c57aaf4afa29428914b5d0e32e28e5bd8cfdf96
84162 F20101111_AAAPHB cho_c_Page_059.jpg
238fa3ad1105573758dbf7da3f860144
d41b920a2bc3f6de87bcf9799af473838610c2bd
72854 F20101111_AAAPGN cho_c_Page_045.jpg
9486ca4ab10bd920814ac420fcaeab8e
d6f2ed55ea6dc6da8b75b91d42ebebb593e49cd4
81756 F20101111_AAAPHC cho_c_Page_060.jpg
b6fc78d37f4ff80b0c2e3de7b1469b62
5aa4a751ef20c5071dff043f61b42d38dcf8102a
80306 F20101111_AAAPGO cho_c_Page_046.jpg
dffbf3bbb1495b92b69fdfd7aa187b4f
bb83bd14d699705d19c305b4d06e077d6b65c54b
94764 F20101111_AAAPHD cho_c_Page_061.jpg
eb970eaec14ab9c3e1ed80ea00549c14
3f628f019da607b1fe229c96c84fff6c11218d2e
90584 F20101111_AAAPGP cho_c_Page_047.jpg
35a18fca4bf6d2296c2e94b0daad8dc2
bf3cc68c2254df2ca0782f9caaa0e7f28a0ab4ff
69923 F20101111_AAAPHE cho_c_Page_062.jpg
6152a1a6c64cec59d9da1de974ff13e9
3f39a952f6a01ca9d959ce9e53861d1fb9ce0147
87831 F20101111_AAAPGQ cho_c_Page_048.jpg
bd8aa22d338acc7cb384c68c97fb3d5e
a22c5d8f91827663d66097e35c60568d175227c8
85677 F20101111_AAAPHF cho_c_Page_063.jpg
097ba492f5af4e8805ba3b40dabb6c61
54f0ea6284be90f91e57425cd61a659146bf3348
81945 F20101111_AAAPGR cho_c_Page_049.jpg
755651bf83103ebd1638c6c6a2dc16e5
2351d65286adca9094be6224f683ef7e359862ae
63048 F20101111_AAAPHG cho_c_Page_064.jpg
d9a9d2842de1c754031bb6f59ff2d5d5
13bb2ceb5d9266be6d929d3771eb3019bf484865
74575 F20101111_AAAPGS cho_c_Page_050.jpg
e136a33639d2bc69bbea783cbf2b8a00
baac1145a1d170a28011fa7ac946d8c7eb1b9d03
76766 F20101111_AAAPHH cho_c_Page_065.jpg
7a61775ccb2b6b20d451f12066638784
6b246a3eff2663768a9c9086cd7de7b30dec4d21
62634 F20101111_AAAPGT cho_c_Page_051.jpg
bb431ef6756a619a46db9f77403328ca
388fb67dfe04cf5b29aa4d9f1bb93e2dc1c0e70f
87383 F20101111_AAAPHI cho_c_Page_066.jpg
7a5d44074c013c41bd7a552a6abdf625
5864a1edf4837e9646645aea06ac191782b9c40a
92369 F20101111_AAAPGU cho_c_Page_052.jpg
7aa0af923c26ced027fd3ca679067496
ec598d1f9ab50b950be3e0b0bbb3a1f9469a22b9
70454 F20101111_AAAPHJ cho_c_Page_067.jpg
93559084e7cc0305cfa95bb748498b9f
7b5dc366d06ab0ababc730f56ecafc642e7dd38f
79548 F20101111_AAAPGV cho_c_Page_053.jpg
48c46cfcf406684e22bd355df3be1fe4
edb340b1f776e12ff20742016c5b37f8e9e63d28
72955 F20101111_AAAPHK cho_c_Page_068.jpg
9f52c86926e61db781ebaa75f2e1f4be
cefd9aa35f1995ca849d8f39b1835ee4651b61b0
72380 F20101111_AAAPGW cho_c_Page_054.jpg
a002247aa2adabe6c2b97225ed509858
b2f29c073e5c2ebb2a96a59ae94de0976f88dd75
71737 F20101111_AAAPHL cho_c_Page_069.jpg
d72792cbaa957495fb316463ce88bb29
1cee11ee89fc610699268f6db74ea595ef87ad55
78345 F20101111_AAAPGX cho_c_Page_055.jpg
94f5eef5bb1067aa07d508763d797ad5
ca683ddc452bdf8ca11e1854b99a67003ce583df
98891 F20101111_AAAPIA cho_c_Page_084.jpg
bf22ee1342d25098216be11e657f39e9
040c0671299698cc2761593596fcd8bab5264bc1
F20101111_AAAPHM cho_c_Page_070.jpg
9bf4bbc23f4108ba94bcca872bf86930
745ea1479459292274293b9dbd554c843d4701db
76577 F20101111_AAAPGY cho_c_Page_056.jpg
8cb6855e6664c46814da046a799e95cf
06d95cd105d67cfae9014b1b32340d3064ff4ba1
73042 F20101111_AAAPIB cho_c_Page_085.jpg
d7064105188f535e3bef17c8c608b4d4
96c9ea79e5f1111193792afa7967eae43316cb20
78597 F20101111_AAAPHN cho_c_Page_071.jpg
36f4a47faaec24cb81457c1aa3e179c9
b702e7952976d3b475c54da31bfc32de4b229e3e
86816 F20101111_AAAPGZ cho_c_Page_057.jpg
75aaf06494bb860745df2246baea0a25
10008dc4b62d8784ccbacdf4b9540b97214adefd
69707 F20101111_AAAPIC cho_c_Page_086.jpg
4fa573fde6701664dfa0ac0325462215
4bc77fe8400f196a7f52353a6e2236bf9abda9d3
80158 F20101111_AAAPHO cho_c_Page_072.jpg
fb9f15d8dacb658c83d1d872e3130936
cf507c163df526cf84ba082b339c17b3de357588
72224 F20101111_AAAPID cho_c_Page_087.jpg
931cbacec86d0addd16a4a4cd18252e8
fa44f27bf937fd3b8e4e2b7d7b180298da2f4e55
32496 F20101111_AAAPHP cho_c_Page_073.jpg
6aa2cef9a1729dc6af4e1716fde6e17c
65488c994f6201c787f8a915043b3f21f7566229
73906 F20101111_AAAPIE cho_c_Page_088.jpg
c5ffb37b2186e367bfe67c2ff0ad962f
b702e3be179fcac5efa3487c0cefa9215d8ab6cc
88635 F20101111_AAAPHQ cho_c_Page_074.jpg
5874f98b48ea97521db4c019328bdd29
392aad09dfd58ab13684883625eed176754ae1ed
75297 F20101111_AAAPIF cho_c_Page_090.jpg
c9cbabf668cc955e8b1db09ce6de7ae9
c11790cda232ef4bd32b4421390a101aa793ce8f
85675 F20101111_AAAPHR cho_c_Page_075.jpg
a3212d6c4f0ac3031e432ce526b6355b
179e6606806a5d8c3fbe7f04775ea3f2a1024283
92880 F20101111_AAAPIG cho_c_Page_091.jpg
20bcef5ed2846d4e4837f1ed20f69c27
4cc52a1de4e77617b24ea617ae3850a432bfccd7
89265 F20101111_AAAPHS cho_c_Page_076.jpg
8df6a5ca3beb688a3bbc098e3ad51428
4de0220469985e563bbe8477bf2e1a2f4446f70b
76004 F20101111_AAAPIH cho_c_Page_092.jpg
5e9964fbbc20fd5dd3152877e09b3927
5ea92e202fd310f210bcdb1eacf14671d0714eac
82152 F20101111_AAAPHT cho_c_Page_077.jpg
222e09e446fab031278a946998ca1968
feee5414cd8fb59d4a67e613d3dff46985cc0e14
85287 F20101111_AAAPII cho_c_Page_094.jpg
7e2798a41ce8d34b479ea283b77f9a82
7f10273585b501be269dd750516f06dad0d7190a
79532 F20101111_AAAPHU cho_c_Page_078.jpg
3f4cdbccbad4fb505e7d6711ec6f93a8
5afdf6b0fe0b42defc14c4a80f837a636f237123
77395 F20101111_AAAPIJ cho_c_Page_095.jpg
aeecc4d33609b8789863af1ebf85f3c6
0b814e92d6c6652f8e34d9ad8d0e6a8cf5505040
65088 F20101111_AAAPHV cho_c_Page_079.jpg
d02aae8e11bed97fd87bb2f819854164
1f171bc51c38d6e405dce28efac75db1a3822e45
74744 F20101111_AAAPIK cho_c_Page_097.jpg
22af0adaf83685ce59ece2d5b96f8d34
2d217cb907f33a4624df8b7564251e87b2e7b2f3
70929 F20101111_AAAPHW cho_c_Page_080.jpg
691c06843351504769eb2604abc9acf9
bde4d08834f0bfe1102d53aee2fb10d6eb5f2014
75769 F20101111_AAAPIL cho_c_Page_098.jpg
cadaf63c1a160e7c65e6496ffb202544
725cfa4c44d36805bc84bb819fe6590567779d04
85692 F20101111_AAAPHX cho_c_Page_081.jpg
d4f54a61edc006b88ae6700550d31196
f42415311a35dec625921e736f3cb1a178779e4d
70390 F20101111_AAAPIM cho_c_Page_099.jpg
15dafdbcf6f2a92328bbd2186ff7d589
8854c534a8b09a44906e069051edf5a92fdc5320
67764 F20101111_AAAPHY cho_c_Page_082.jpg
0eeecb4668256f4bcf4473290b9dd436
d2b27390a181e2b2f397322d6ff72b060b9731ea
93866 F20101111_AAAPJA cho_c_Page_113.jpg
6fef58e4dc6aac37ef125f9ce8896612
493fab52d1915a1d67a1d77859ad6cfc525aaae3
77555 F20101111_AAAPIN cho_c_Page_100.jpg
0584d0a8397bfa332e7400f956cdddfd
2f908d774496b43efd3c0eef1d509e30d6df357f
77641 F20101111_AAAPHZ cho_c_Page_083.jpg
fec38b4d46ca60d7e979180ba4c63e7c
31c10099eec0ab211fa4b9c80fab26053fa0902e
104591 F20101111_AAAPJB cho_c_Page_114.jpg
f8b53fabc0f5831d42c78b4a853d4313
7be943eaf8704d257451b8300af9eba6d2e4375c
71483 F20101111_AAAPIO cho_c_Page_101.jpg
29600bc6702f0726df8425a79330e30f
94f312e0dd888e7dbe18abb3c529dee563d126c0
100333 F20101111_AAAPJC cho_c_Page_115.jpg
c5d7d52a15aa92800eebe957d7e2a66a
1a36ff164d5f099bdbdc9bcc09b31b786c56b597
79239 F20101111_AAAPIP cho_c_Page_102.jpg
f7f58bd9e1329a96199d9974f738daae
905e5f9859f9936ffcc9c868fa4ee654adb22bfd
29081 F20101111_AAAPJD cho_c_Page_119.jpg
f10e7f3e8eb48f9b97192ee0ee99a36a
8d3b75fbab980aeb2186e8a8d5dd120e23f0b1ee
83995 F20101111_AAAPIQ cho_c_Page_103.jpg
0e550764ff76201ded5a90d6e1333633
ef2f3516294796089855b8da1309a483d5681f1a
288706 F20101111_AAAPJE cho_c_Page_001.jp2
83a8c7fdc8767f5da5cb36216a7f19bb
8f05dcea74c3538b88e09669c0c26a6955eb3915
73151 F20101111_AAAPIR cho_c_Page_104.jpg
634b7777c0f1e236a85b706e495927d5
e3ff3ae70444ca44494f9646220501c144db0151
26063 F20101111_AAAPJF cho_c_Page_002.jp2
53df46f31d265232b9804e525ee2c82e
be263ca610af847d45a949e8f011da7a208e2896
78907 F20101111_AAAPIS cho_c_Page_105.jpg
321e43067535ef8077d03bb9f1252786
2b8056d32550847eeaff5aa6a80cb103c2351863
1051979 F20101111_AAAPJG cho_c_Page_004.jp2
809fdf88332b30dff8d7de8a5219bc83
b29268b638e2a4fce44f28d2cf8d45a9444d043c
61171 F20101111_AAAPIT cho_c_Page_106.jpg
73d778fec23cfcac660b442d5fc43999
88af72df207b1242897795be2c79bb7d03062d73
980045 F20101111_AAAPJH cho_c_Page_005.jp2
55a74c5e0ae31f0bf3b2d6a5d29bd28b
41b573f2b33b0de29fcebab5401ebe1cc8650b84
76643 F20101111_AAAPIU cho_c_Page_107.jpg
c1a49fdac22b62c7b9bfa8982f8101da
81075a357f6c4e6b32712c4f2f2d27f5fe81dbf5
1051971 F20101111_AAAPJI cho_c_Page_006.jp2
e2659f3a216d64a927f8cf6b5d53fc49
0b5935e4f585f81d54a7db4535fa290d6254f70f
47960 F20101111_AAAPIV cho_c_Page_108.jpg
e21dca6a8ea51b9c02db6d94ba11c104
318726a7e75424bdcc7c951bf9d31a27ff1faa5e
1051976 F20101111_AAAPJJ cho_c_Page_007.jp2
4df5c2b99119a5e4726549daa2ac6549
d7dc29244089c1813a629a5ed11028e50ed0943e
83292 F20101111_AAAPIW cho_c_Page_109.jpg
14f1154e88372977450095715c192a9e
fc8f37acd492843966d0f20fc9d45469de3941a8
1051972 F20101111_AAAPJK cho_c_Page_008.jp2
9887fc8fc3a528e96cd1ee5b18735e72
e07d10f348b2f4bb1b57103e9d0129edd4c48179
83360 F20101111_AAAPIX cho_c_Page_110.jpg
b62b66ea987442638dbc0a3fd85b74c7
0bf2c80803ce01f5ba6f2925fc0c78e1c97cf6b1
652183 F20101111_AAAPJL cho_c_Page_011.jp2
90df050cb60f1f41f17700a1b9bca122
12f056f8f82008c89a516a6208c4eaa0e685b503
87578 F20101111_AAAPIY cho_c_Page_111.jpg
2ccfc3cefe3652870b26b9042561ea97
6a434614569797285ffe2feaae6c689bf5118d60
1025948 F20101111_AAAPKA cho_c_Page_026.jp2
11505e8685d55e4822070f090ef03214
d0e99c320c7583a10aa96f34bd17ac9aa421b02e
1051974 F20101111_AAAPJM cho_c_Page_012.jp2
fb64f3f69637f7fe8f3792a02573bda8
e2e69fbb41e70fc75adc1e3ce3dcb918ce8d7d41
56672 F20101111_AAAPIZ cho_c_Page_112.jpg
02ebf7ac32562adbd3564508220110cd
fb72fff85aa0d4860cd773eb86058866599c27bb
1051954 F20101111_AAAPKB cho_c_Page_027.jp2
7b48464e266354e57a856e75ab7a17cd
d3f052cb75a7c5c3f6e43e4b0e4717ae3450e6b8
1051938 F20101111_AAAPJN cho_c_Page_013.jp2
a868ddcd0b4348019160d0e59bd4a39c
ddd6b5952b4f4bafe95edf444e24b8fe292f8f72
1051985 F20101111_AAAPKC cho_c_Page_028.jp2
8f6485110bb53f71a8c68985f4f56f79
77c7f448e4d2376561c32b05b113e3508a6c489b
F20101111_AAAPJO cho_c_Page_014.jp2
dee2bb2c8853df56753f23a1337b0ead
f6236d1612a2056d129c8d0eb326daff5c2af7c1
1051917 F20101111_AAAPKD cho_c_Page_029.jp2
c2c6014ffa1866ae90da4708e6b89c24
7f562df7792313f2657890b8d293113bbe4423c1
101357 F20101111_AAAPJP cho_c_Page_015.jp2
374562623b9b303d010b896898048b7a
d68cf82c612bee178e7959fc1acf89f4162222a4
1051881 F20101111_AAAPKE cho_c_Page_030.jp2
d9980f5edce870a54f83661af33e18ee
9f870ad235ec7a8ec68df5a0bf6ad9a0337c6ed4
1012692 F20101111_AAAPJQ cho_c_Page_016.jp2
fd57407a6c603e427089e608966bd9fc
748fdde4c7259eca83cb0212dc88dff151a778e6
994253 F20101111_AAAPKF cho_c_Page_032.jp2
881a6c5d5145f9bcf606805d14f4600d
2fd0ab9f8525dc1a0795c4ef5e0c194865e9a0b7
1051963 F20101111_AAAPJR cho_c_Page_017.jp2
67af3071735dc3f4b21ce970f4a52c04
58bc8d708c1e5b13d623a7fee4ee9f7e8d8a4196
866956 F20101111_AAAPKG cho_c_Page_034.jp2
1e3a81f62a1779ceae8004e7972590de
7e59e97d0486f56471bcfb9ca91c8226183fc5f6
983764 F20101111_AAAPJS cho_c_Page_018.jp2
021828308d79f4049b22131e8a149108
ff4a81c106448b6bcab17feb160d959902203e26
1051903 F20101111_AAAPKH cho_c_Page_035.jp2
49ba9618d4a01196dea32f02c355f8bc
140e27228c18bf392f75245828bd8ca7ddb9c638
1051982 F20101111_AAAPJT cho_c_Page_019.jp2
5ea667cd41df3fecea284e6c87e90f75
0252df448fedef02f6e075f4a85771d095fda10e
604651 F20101111_AAAPKI cho_c_Page_036.jp2
445fa957aba760ababe87fa76dec650b
c8672261e9cd8126af875ead354a21e514db0f53
757189 F20101111_AAAPJU cho_c_Page_020.jp2
0e9ca8555d4536dfea3853e97c76718d
511fb88f8d00173775e719fd24ee118c61aa3011
1051959 F20101111_AAAPKJ cho_c_Page_037.jp2
bcfd836afcd174c78e28c82c3975646a
2007943d9d560707c1b838edbc55dd260edd2178
F20101111_AAAPJV cho_c_Page_021.jp2
3b904825dcf46fa2dcc95ac375062a93
c0418d47acbab6ea2ad2d83fb1ab34be178c36a5
1051935 F20101111_AAAPKK cho_c_Page_038.jp2
820100a7050c1dea9297cc24c92a94a1
9fb56d6bf37e9e31b9a2f92c40963f8e15f734dd
1051900 F20101111_AAAPJW cho_c_Page_022.jp2
8f85baf95ae093ba86c9ff23f470b639
b28c0a2353ec3f5eff9d9ee79da281dc385c9818
1051975 F20101111_AAAPKL cho_c_Page_039.jp2
c1c039a5fc1666d6ef35a258944228cf
77ae9925950d7a059dd95332640a4cebe8529963
F20101111_AAAPJX cho_c_Page_023.jp2
506623c3289a358185634abdafe24cb3
e023253e55c7ca10b9b65a62a7de62ab5693f4ba
1030917 F20101111_AAAPLA cho_c_Page_054.jp2
40003f7ba0ab9db544737ac32e7fb6d9
107bf13aadf1e280307110795c9b26fbaca99bfe
1051936 F20101111_AAAPKM cho_c_Page_040.jp2
d3300fe8df7893568b0d8d756be4319b
03248e270e6cb7d1a97bba7f320f8f4cc09506ea
514433 F20101111_AAAPJY cho_c_Page_024.jp2
b86225193412b80c3540a8f09abe05d8
6518f53505e243a8f2471bf8923c03c572a49d94
98719 F20101111_AAAPKN cho_c_Page_041.jp2
6fa510fb9fcdd1e5c64e4167c9443d3d
0d00d96f97c5032f690787e96fb1616b855e1ec6
1051944 F20101111_AAAPJZ cho_c_Page_025.jp2
90efc78bfc3b8f6d3baed50bf7044a8c
ee770bdd75b60e09395fe4a406ab1c51b5bb7d31
1051965 F20101111_AAAPLB cho_c_Page_055.jp2
f4eba8ce25144a76c6c3895906342c11
9dec04c0448364f56979bab87e1fd19679ac2ba0
F20101111_AAAPKO cho_c_Page_042.jp2
40d87fb2317fad06e0b3959bbc44c7bf
ba464f17956e5a200eaf0f18ec96ab33f2380184
1047128 F20101111_AAAPLC cho_c_Page_056.jp2
82b0446d99e43141e9ed018c791172a0
ed43709ee014bb58a64f0a32c5feb44f44108105
103811 F20101111_AAAPKP cho_c_Page_043.jp2
d08c3dc3012860abf4ad4bfdcf6e9464
219e8caf25b433b73d8057eb7ae2f2157b79c3a0
F20101111_AAAPLD cho_c_Page_057.jp2
38ba505ba07a15a4b9e8fc4c7ef76ba0
616d29d87aedde9ebf74961330c1bae88404c76e
1051981 F20101111_AAAPKQ cho_c_Page_044.jp2
52342aad8d87cdb8398f96ad89fd357c
9175be3a486c4c5765a5dd8be64863dacee2f015
102093 F20101111_AAAPLE cho_c_Page_058.jp2
a904c2aa84b372d4f00bab988a4929b5
05a380b2311eef99cf0fa9d7e676b3ccf0676ad7
1041685 F20101111_AAAPKR cho_c_Page_045.jp2
6ac049bb9f33b65da89ef6f7abcfe23b
b38823fae14dc036b17ade2599d9df896dfeab3e
1051967 F20101111_AAAPLF cho_c_Page_059.jp2
b2d4bb3e815bd220799571fc019a1713
6ed8c4268e2a3eaf832408190a8244217d7d74d4
F20101111_AAAPKS cho_c_Page_046.jp2
211a8661097b6cd232518a58c12084b2
6eab10c39700fc18f16ffdbf5f044442faf1df93
1051960 F20101111_AAAPLG cho_c_Page_060.jp2
b5b4ec51e33623f827b1db5a2466d078
9121610b6cb0d103357c9133b566f04c8a9b111f
132909 F20101111_AAAPKT cho_c_Page_047.jp2
d729343a96289974a17cd77326ab7b32
25aa63faa3818e16eaa4b48d73b16a5336750374
F20101111_AAAPLH cho_c_Page_061.jp2
295d53efd6b2dcac48ef5fceac09b522
a01a5b748302d24ddc91808587c4578eaa7f7039
1051927 F20101111_AAAPKU cho_c_Page_048.jp2
65aa155caebcd83a17c2eb0291ccfaf6
e2d32b18ba7698602e4ab6ef3e4190a0affe796e
1051986 F20101111_AAAPLI cho_c_Page_062.jp2
e456906d01b72731ae52c0c8c1d7d3aa
aa962f9c9da30307a88f45395508b1336e9e6df6
F20101111_AAAPKV cho_c_Page_049.jp2
d6626e7f5bda1a053300eaccc5a6e42d
1292c73cc0dbd2a8bb9d92a8161918689ba81138
1051950 F20101111_AAAPLJ cho_c_Page_063.jp2
49f208f6e11e8570101981e751326b50
2a814580a5f8b3634a1e8ed64fb9db1800b77f20
989825 F20101111_AAAPKW cho_c_Page_050.jp2
e58557ba28e3a3f9c075b4730622872b
5faebd9380659ca0c4f335b20694c0df3d3cd28e
876225 F20101111_AAAPLK cho_c_Page_064.jp2
eb375048a2f142aecd2eab6c00176358
086dd102b3c9687756e97cf74359035af6310d1d
845398 F20101111_AAAPKX cho_c_Page_051.jp2
8a9ec44962ca8c1269e38fb06f2dd1b5
ce1a75238a60f976417e70463971b85161404f70
960892 F20101111_AAAPMA cho_c_Page_080.jp2
f1d34a319557d08773bfc61be5792c96
99c78d6f145ca72025e4e99a23431ca92e8c2323
F20101111_AAAPLL cho_c_Page_065.jp2
a08071eb2d6bebd7e84a72f95c66648e
fd22f58fff0d84bd16136ab3fde45190089657b3
1051978 F20101111_AAAPKY cho_c_Page_052.jp2
20ea72872566f0403d1b080b4f59187b
4bd2b46b3c722c713bdde28375012e9d74995e2c
F20101111_AAAPLM cho_c_Page_066.jp2
ad669d193bd05774e445207e1dbe6411
ff021240f3f19f1623d5cca144703a1c0d5d56d5
F20101111_AAAPKZ cho_c_Page_053.jp2
a4862382e5fc139a5ec682a4ec4afe76
389ea4b955958a61b26bcd8914aec314871dbaf3
1051973 F20101111_AAAPMB cho_c_Page_081.jp2
50fef9cef92ab4c16b6b72a8d25e6516
201eb0c0339330c91ef63511b1a7df30211d989c
978707 F20101111_AAAPLN cho_c_Page_067.jp2
8a1b224e3b67664e449a28336c839a98
2fbb7a83431dcb0f7fd418d533c29f8a7c9bb2f9
945022 F20101111_AAAPMC cho_c_Page_082.jp2
9a38e71f2ba2724cf443ccfa30e512d3
d595b9f0ef80ac2aba13ec5839ced61faa2acb3f
1051946 F20101111_AAAPLO cho_c_Page_068.jp2
482a0d61a53a7a5ba55c86436c52dbe2
257dd3845f1a6dd3b58aff9d146213675ab6f2ae
1047237 F20101111_AAAPMD cho_c_Page_083.jp2
35c840d4fcf206d40c20cc9bbc5c7c4e
4be8555a5b5a7220f1681707918d8d9d0c549562
1016243 F20101111_AAAPLP cho_c_Page_069.jp2
5bfe23f5fff1e8e07cdb03b72a09b419
a9b13f7d90d1264ab0bb676074249ff6eb26a681
F20101111_AAAPME cho_c_Page_084.jp2
af6b554fe7f2fcd29cd9ab94b91fa20d
4fcb7a69d73372974f703c4f8f64fb2a8fa9dfa0
943379 F20101111_AAAPLQ cho_c_Page_070.jp2
88998d90a54f6aa1491066d782fa198b
dcd38a22c9d4f7d1793482772ba60a35a28ca27d
999975 F20101111_AAAPMF cho_c_Page_085.jp2
b102734ac8da305e9214a30a33cd4df1
01889b849b6915fa75faed246f3d907683a19ab0
1051941 F20101111_AAAPLR cho_c_Page_071.jp2
ad838941efa5ff998ca54e8952e3c277
39cb6048e8008e2953e2e115e28db0aca6f09c87
950761 F20101111_AAAPMG cho_c_Page_086.jp2
a8fc4baaf4ad4511a8410d421d306983
51d8806e933e9c084da25a77adf3abe3d9cef730
F20101111_AAAPLS cho_c_Page_072.jp2
485a581aa08e6608feaf8618a7bd2264
80e62209e8b4f41d51ce8400596994cf8768e98c
957294 F20101111_AAAPMH cho_c_Page_087.jp2
2f5f22f650feb75e48e3975452bd3c23
516a2e15862334118ce23f93f1f4e3448b2c6b77
364942 F20101111_AAAPLT cho_c_Page_073.jp2
fcab4d2df181170d0e5945ec768c2143
cea55a57e2eea92f572fca1a81e7dc91c19a8b6a
1051945 F20101111_AAAPMI cho_c_Page_088.jp2
ca73a3c58246ec24815734e89f509557
ccd5068486a6324dd439132072a01bbd2ef40e38
1051984 F20101111_AAAPLU cho_c_Page_074.jp2
25149c3d938e12195faa3801e451d38c
9477045447c653be40cd4d12fad7abe61de919b9
F20101111_AAAPMJ cho_c_Page_089.jp2
851901d049da111faee162790cc7a3a3
7f8c3dca84efa978afee2ca70fab7f1bb7cdab76
F20101111_AAAPLV cho_c_Page_075.jp2
8082c00d12e3a3dd41909ccdadeddf27
cb4bbd5f2dedc9f862b6f04b51cb8c9da24bdc27
F20101111_AAAPMK cho_c_Page_090.jp2
87e017ab5349ad34a204793f2c5e019d
b1fb168baec5584dbb5e02b1f5afc83bb3cc9dd6
1051955 F20101111_AAAPLW cho_c_Page_076.jp2
8182d13210d3870a4df0ec73d2acb772
1356ca20efc76dc40d34f744b474e43de4e9bf6e
F20101111_AAAPNA cho_c_Page_107.jp2
8c753d2c77c401f5d4b17f049ae35b0c
f29e0ecdf2d8b78336420b5bea17a6959b7198c2
1051931 F20101111_AAAPML cho_c_Page_091.jp2
e8642a46474adaa0e418b693377ebec6
8f0df07b363142bdb38bf322e3f916f44fd1a86d
F20101111_AAAPLX cho_c_Page_077.jp2
4b65b66ca0e4e2def5707cad46065ea2
3ab1b80750a2455325bf653858af00db0cd8c413
647841 F20101111_AAAPNB cho_c_Page_108.jp2
8118fd7205bd76aa87abc3e68be9fdc2
d2a5bff7a112bf12c89ce77559af5674d0567d12
1051964 F20101111_AAAPMM cho_c_Page_092.jp2
44d1c79ca131af192042b12fafec3924
c820f18b7f7e99dbca5c1cab35b370bd0a288ded
F20101111_AAAPLY cho_c_Page_078.jp2
94cb893c6054f3c97c8c5e1c6ea56f06
24a79c3169c3fd87343457dfd6345f533ed3bff4
1005150 F20101111_AAAPMN cho_c_Page_093.jp2
cb20ee47b68cda4fde31726d969b8095
a06124b22007f3c4b226078d80ce1fb747475514
1036328 F20101111_AAAPLZ cho_c_Page_079.jp2
3aa390760f80735a74664504e180536f
05885a003e3cf2756247b004436bbbf2ebba69a9
F20101111_AAAPNC cho_c_Page_109.jp2
9c932225f76bba00e2c041c69e1e171f
e5b26bf6e8df57f0df1e9242f6e8197d34121105
F20101111_AAAPMO cho_c_Page_094.jp2
94a8c161c4de6f029df4a44cc338e9e2
0c53539df9ab353c5bc23f4a7f90c8fe173499ae
1051969 F20101111_AAAPND cho_c_Page_110.jp2
a06da5a96820a9960329b2554ed1f8fa
6ac483b61739659d06a6a37eeb55f4baadc6b3a5
F20101111_AAAPMP cho_c_Page_095.jp2
c2b10ab540539b12fc03d3095a3afd37
59b19b59ee77d3d761981bc4bd530216df6af0d4
1051928 F20101111_AAAPNE cho_c_Page_111.jp2
fdba855b544bb73e93636ae27f8bcd78
d3e8b81541320c7ad0db58c30784e2922c7e1158
1051934 F20101111_AAAPMQ cho_c_Page_096.jp2
c4abe16ea1361be3a32343350d682782
7c4eccd8019e4db05fd92075c0b54a4295dc0e2a
761593 F20101111_AAAPNF cho_c_Page_112.jp2
6e6c09aaf164453c1896a36620c69b6a
3016d325f241d22e1ded11ec66a57029ae3a5b70
F20101111_AAAPMR cho_c_Page_097.jp2
2f8dd87e1a29ec736f25c64219a35c9a
aef1e3fba24a33ef927718cc83dec2b5d1521f53
F20101111_AAAPNG cho_c_Page_113.jp2
87adeb6e8c9e481892043626802eeed1
2ad166a0a5f349208e67993447dc33e849a6dda8
1051962 F20101111_AAAPMS cho_c_Page_098.jp2
854ec5bb0c96210b8e608cae42fdd270
2997dfb056e0ec7516c0389d7da1e85c9e0d9e12
F20101111_AAAPNH cho_c_Page_114.jp2
ec48069f853e014d8d1cf167e6bfebb6
dc54f10fa86af2e026c2f9bd11082b1f4106b8da
963867 F20101111_AAAPMT cho_c_Page_099.jp2
543180fcb940123169eaef9df79d63e9
199bd98005cdbf3f975c2b03531ebad88b1d810e
1051977 F20101111_AAAPNI cho_c_Page_115.jp2
15017c62d1930545d0a63568c0151e31
97417ef4231953da2ae0af5651d98120acf1a562
1051804 F20101111_AAAPMU cho_c_Page_100.jp2
70a14def5c653eaf76e9fe72be8748e0
0cc7f4947b218432b5bd5c856c827ea24fff61f8
F20101111_AAAPNJ cho_c_Page_116.jp2
79d4447f15f0ffc57eb98ed05453382c
b2b09cbd350467b801f30783f568ed91b5084859
974135 F20101111_AAAPMV cho_c_Page_101.jp2
e19d0e03498d6bc34beb531ce7a1f6a0
ec474204fedcaec16367217a93667490e5da9272
F20101111_AAAPNK cho_c_Page_117.jp2
c6d029c5b78ce1be72d49c4fc2fa4734
c3d2e9bbb36016273b179979d417b5e9ceef6d53
F20101111_AAAPMW cho_c_Page_103.jp2
eaf50205e25f8d330754016f1fa58db9
3d3e6f7d911fe10ab026c89785a4c58a71a091a2
F20101111_AAAPNL cho_c_Page_001.tif
ab7290125fdac0058c27388e17a0bd41
443b6dcb52f45ada1b6a08d18f0abb52e2043bd7
1005069 F20101111_AAAPMX cho_c_Page_104.jp2
28e2a1d16e6c45a613737ce7e8958d29
e9fd3587f0a6efbdb4690f4631d4378968f47d03
F20101111_AAAPOA cho_c_Page_016.tif
e34eb07b92421a0dcacae0aef4dadf53
77f5553c063bff24ce5a8962205c0d11bb65b6d5
F20101111_AAAPNM cho_c_Page_002.tif
0457eca796815e0eaecbfd2b19a159fa
73f6ea6eac70c1f9a6f8170fe62db079576010cd
F20101111_AAAPMY cho_c_Page_105.jp2
c5e7477e6ee098967964b0a6374f635b
69310f6eb4a4b4ee7df94a12fbd0772f5ecf1763
F20101111_AAAPOB cho_c_Page_017.tif
b971785f23dcfa7df3c62a63d111bf97
026bbb471ae8b0620c82ad63886b0d03ded13458
F20101111_AAAPNN cho_c_Page_003.tif
3890941c885d84165fbeb1600b455a0e
db4bb4236bd8d62ffdf5e79581a58ec20f67ed0b
1030885 F20101111_AAAPMZ cho_c_Page_106.jp2
8a51e64c8b4bb1e0872778a1d8ef7279
865097895bd5e3b5cfa8c44c1745a73f040280ac
8423998 F20101111_AAAPOC cho_c_Page_019.tif
36c15796e818d293ac1494047f7a211e
ddf1a705f2c0bf97a3778fa07238946ff2d05a6a
F20101111_AAAPNO cho_c_Page_004.tif
8fde8dc0d740056465b82541baf12b6e
4a794965bc523f5dc2a521083c23653e0f69d503
F20101111_AAAPNP cho_c_Page_005.tif
6ad2657dcfb03b714072c073004256c4
984f5da203af31a0811b232634bf3d4127f327a8
F20101111_AAAPOD cho_c_Page_020.tif
1cb3aa9a1d71c3304ffbbad852465d8c
bb269568f04a224e8cbbfb10ec9455ef197e692e
F20101111_AAAPNQ cho_c_Page_006.tif
9658c021f7188f8b6210eebb4250a7a2
33b91be4ddc59cb4b8c287e25935fd606d9f1b50
F20101111_AAAPOE cho_c_Page_021.tif
c24a25961a07d4b5b8591c38bab39a78
88d62dfc6802a3518efd2f305d3252a60511c918
F20101111_AAAPNR cho_c_Page_007.tif
f2c313a2a7e514ba3ec16a8b9fc717be
faecd678387c7502c4c4f4246579196fbb5baf7d
F20101111_AAAPOF cho_c_Page_022.tif
a83f94cc021bcce52011db6ec4324c9d
4775aa40b6984f8749126f08b5e60524824c8e3a
F20101111_AAAPNS cho_c_Page_008.tif
2e30e179351ab0020954c4d15619f8a5
56d4f5b921a3950d6ba93fbe1035c6fe236dc113
F20101111_AAAPOG cho_c_Page_023.tif
1f8d2bf2f8f4c8216afd3a0137be19b6
223362c5e71a12312cf2d1d7b5e819db1decf6e4
F20101111_AAAPNT cho_c_Page_009.tif
ae8479b80869d7f1f67803858d03123b
831dc8c22f16fbdff9b0fe30acd54b30550aa662
F20101111_AAAPOH cho_c_Page_024.tif
3ee76577d89ae4251c4b6d5f35e8e7ec
1c8e3629064725d1bc37be462a06e82838d8f544
F20101111_AAAPNU cho_c_Page_010.tif
186235ed4b6790a2f9d3fb4803984e68
d8a7eadae63233e66809ad1a42bbe3649243827a
F20101111_AAAPOI cho_c_Page_025.tif
c897d2b7de75779ad06abd26312fd978
626214da868fac528759789b296d35a945f06bd0
F20101111_AAAPNV cho_c_Page_011.tif
ccd8e67e8a22a521e2ff215858a10573
4ac3e24c502c0380233534206ce42ee7e3aa048e
F20101111_AAAPOJ cho_c_Page_026.tif
c830b7646102b8e8d58dce54f7026010
a58fd3277eda6cc98ceef57d0657b7b6dff2c4ae
F20101111_AAAPNW cho_c_Page_012.tif
9b8ecec1774b25be411bde33ce809891
c956fcee854758ff39fb9ca3f55066ee44917747
F20101111_AAAPOK cho_c_Page_027.tif
6f2181e541cb781481ccff879f31a2bd
4d683b822b618bfd65c9b2602f3304e4e6d6bfb3
F20101111_AAAPNX cho_c_Page_013.tif
bcc94cff8b9c7b87f709fd9dd4c73237
68268af3a4af0e657771c7da1e3f687a4f1a1212
F20101111_AAAPPA cho_c_Page_045.tif
ddd442fc3bd2f8ad938b3381f395bcb3
9ac534e4cde8035fb2f95a997008401e2c5b5263
F20101111_AAAPOL cho_c_Page_028.tif
a49c71ef9fdc77878e3e3c43582555f0
86d1ee0235a2075c3ef53aa1343702e6fd46fe49
F20101111_AAAPNY cho_c_Page_014.tif
40e2db7811bef7037a6270c7bb13d6cc
25a8482b13278738ea8b87e15a85c2756c900db7
F20101111_AAAPPB cho_c_Page_046.tif
c0b39ff8b7d036bb2bdd54e434f00c53
dd7f7c94a1d143cc201f0eb656498127e2a2e10a
F20101111_AAAPOM cho_c_Page_029.tif
2df200b1e4bd1dd19b2cc2c45e9bc729
cd1d73da18894da4c8b007d8e49353f474656107
F20101111_AAAPNZ cho_c_Page_015.tif
da3d5b64576d2aad0f5204dee934f646
463e4b3b804e35153d1d8b910f22e37e6c4c47a9
F20101111_AAAPPC cho_c_Page_047.tif
8dde6ec42a6d7e68684a500c4ccd67c6
767383ed140ea135251d63a4c6074c3ff2b3a04c
F20101111_AAAPON cho_c_Page_030.tif
2c55da79c459857d75dca593e8c653f9
5f78cef582bfea5b20191d4cd3846fe5f6f833c7
F20101111_AAAPPD cho_c_Page_048.tif
71389cc3145f5e4bf1cf2c8ba2561f21
f1a0f86bd20e754c25c87692e2d9aa9827b06177
F20101111_AAAPOO cho_c_Page_031.tif
608629949ec41e8a97856b101b503241
530b6b5b384275bf7b773d5a47d8336417f401b3
F20101111_AAAPOP cho_c_Page_032.tif
c45aade4cb6b48a17193dc0270431b81
ccab9d47cd4e50cc1559ae892482964f9499b57f
F20101111_AAAPPE cho_c_Page_049.tif
275b0a3b621d3ec49434cf83adf5ee80
f552d950dd080dce291e354ffb2fb12e9cce16e3
F20101111_AAAPOQ cho_c_Page_033.tif
19009576fce0cde100138823d02188a1
b0aa8698cecf73c7a650cb3ceccf5bffcd94a7a9
F20101111_AAAPPF cho_c_Page_050.tif
cba29b691c691cba16da794f1d1f70b0
854ae07c95cedf22c82424a18271c4dc1954e32f
F20101111_AAAPOR cho_c_Page_034.tif
55a72cf426a31b0b682ea225386df354
03f1d84707dc487c8266db8a78997a89ede46a6a
F20101111_AAAPPG cho_c_Page_051.tif
0990421fbeddefe90cde6f77be2a4fb6
7ba1eee3d1003b4d7d78f0ae797718b9189c8c91
F20101111_AAAPOS cho_c_Page_035.tif
cce417513f17002927e7b905ad9239a8
9e3da4c3415db867db163e725ab3fe52aca2512e
F20101111_AAAPPH cho_c_Page_052.tif
03226ab31853159dcf30107a0c500927
848488df571dc291d0b22a7ec6785823a5c64d96
F20101111_AAAPOT cho_c_Page_036.tif
93dd26257fd6d658a933d065c3d09bf3
ad71d78b8b6940cd3d41f0e2b10b841c5302ddec
F20101111_AAAPPI cho_c_Page_053.tif
5cc2538e2c560b02697d90e2a0bc7431
de962e98fe4b82c25b83a1010610d16a61b58b77
F20101111_AAAPOU cho_c_Page_038.tif
be93082cecdf651a882a58d1f6aa21f6
52932941da183e6cd106a9bb90ea84494eb8af6d
F20101111_AAAPPJ cho_c_Page_054.tif
090994a4de19b900b8ce01e1140036f3
edf073f9d906c7183112338f3214ef9f6f2158dd
F20101111_AAAPOV cho_c_Page_039.tif
6fc925310ed54c13fd210cbb846dbbd9
290819b2e1a8f8c4fef8cb3f4e76b4e1e845f6c7
F20101111_AAAPPK cho_c_Page_055.tif
a602ab0854b59e71b0a3fed91b8fa2b5
51518f5d6e1187f152a85a94f6385a4f9a6f35f2
F20101111_AAAPOW cho_c_Page_040.tif
e886ba909d53aca4b0defeed85585725
c9684cc4ffc141fde25b792e27e57d0756bd928c
F20101111_AAAPPL cho_c_Page_056.tif
e73febadbe6fe1e020a0f195b118d7a1
c55e31e11f42f57f81a46eb2ff16509f94733d40
F20101111_AAAPOX cho_c_Page_041.tif
f5133faa24ee20277cc4100e2fb32d66
6655f8e420cc657e603c76a382721a9b08b6806e
F20101111_AAAPQA cho_c_Page_072.tif
419b2618153f4c7778d1769f131063d1
18256e39104f8cb32ea1c777b1c7a474ec7845d5
F20101111_AAAPPM cho_c_Page_057.tif
5bf57deffdce02cc9bbc7cf7a0b3c3b0
17b07a73b015ea25bc38489d810a948b336591eb
F20101111_AAAPOY cho_c_Page_042.tif
19b78057eb9eb1ecae3d4d43a2975504
2c1795d3e29ff64a3ff868a2f5c46fd2af8967f0
F20101111_AAAPQB cho_c_Page_073.tif
88fe1c103096be7d1e7cf2d639d67110
c9a2f89ad2102fd2c9687876580206e611c47548
F20101111_AAAPPN cho_c_Page_058.tif
084962bda17b1674e43ccb39746630b2
1f95bc2f190f27f80011fd6c7cec37be699aac43
F20101111_AAAPOZ cho_c_Page_044.tif
c0ebd5194a48e4261934831c3f475113
afb79a6ff28c01f9d6458180405a75ed81115af1
F20101111_AAAPQC cho_c_Page_074.tif
c90543a08cf0d104e23859c79014e4dd
0ae50f120018b9555f6f9cfd7ad22735d38ecec0
F20101111_AAAPPO cho_c_Page_059.tif
697b94a4555251231fb951a4fe588de6
3c8dbdef270d56637c7446dc77c6cf25226faddc
F20101111_AAAPQD cho_c_Page_075.tif
f83ac64f9abace9d80fabe6abbe3aceb
ae89a15755e71a33ae51e36228955ac55213f00d
F20101111_AAAPPP cho_c_Page_060.tif
df477afce370b97f9408583e0eae538c
efc5507caf536fc69b955365e6b705beb6c62a8e
F20101111_AAAPQE cho_c_Page_076.tif
df216dde02ee4fc2549838e355aafbe9
615d2c967707a201b518bd577f3e1bcb4ac37933
F20101111_AAAPPQ cho_c_Page_061.tif
72bb6b9f23b96a99269862007cb0495c
7e3de9881272ffe47f9f5e406d02dc7010849f99
F20101111_AAAPPR cho_c_Page_062.tif
01314ef11a0f861bd59df6ec0f1b1ef6
5dae3911cc74f59313d319845b5a52cd66313792
F20101111_AAAPQF cho_c_Page_077.tif
908dce5061989f4964c93d56c253b43b
97d26759462d0a9f68fb6f353a299285c7001b10
F20101111_AAAPPS cho_c_Page_063.tif
09848d969a721c368c36f874f5986b18
07a27ce40ae87f043d36eb9cb6d2b5c563c56f19
F20101111_AAAPQG cho_c_Page_078.tif
370885705feb3baf95acf30bf52abfd4
88874e22c2e5d46f8b629d17a342d55a2f3141f9
F20101111_AAAPPT cho_c_Page_064.tif
e60774f468f916facf8a8411f6b1f355
02e5732b624c6b2854aa35e62fbdf2580e6b3ad6
F20101111_AAAPQH cho_c_Page_079.tif
eac71d0eb99a76c21b807c455c9a86b5
6482564b763443c4f6d860bb4bb62b23772b1cea
F20101111_AAAPPU cho_c_Page_065.tif
793e4df0f98659f10430297ca86afcb6
7d9ea618bf759c43a9619fd3e3c430f7c90ac25e
F20101111_AAAPQI cho_c_Page_080.tif
12bb012599632dda47b69e787d05aaf1
9c9c3299509c670f0cf7622e8efad4d2435daf59
F20101111_AAAPPV cho_c_Page_066.tif
664c8cd22923528b0bfa9faf79600208
a0fa5c535b602fd13339685ba89c7ea4184f3fe6
F20101111_AAAPQJ cho_c_Page_081.tif
57f6c0ad9313aee1910fb674b5c750f1
ccbb095ca997af138d1a8f870cbdeea94acdad37
F20101111_AAAPPW cho_c_Page_067.tif
f126a4bffbbc89fc0b5ccf1b218343c5
5955ef9ba2ec74fa774190528bd580df80282110
F20101111_AAAPQK cho_c_Page_082.tif
2cc1dcd1bfc485d8a0825918389f6979
7caf71c47f06e5f83f47a89a4e65f8193790c068
F20101111_AAAPPX cho_c_Page_068.tif
ab51391e5b2833d2e6afce9f54bd62b6
fea1b8a1ff4abd5826c244b0519ac81f5099ebc4
F20101111_AAAPRA cho_c_Page_100.tif
9bb853c2ca441f9fb2f3efad21f6a931
2ee570c5889408b434188f58fc9b9f55b9dbc7f8
F20101111_AAAPQL cho_c_Page_083.tif
9e959ac1166beec44bf02b4dec12e408
f5b7678521dee01c3c2d2f4d1ccb164ccfe96551
F20101111_AAAPPY cho_c_Page_069.tif
64b70f460bce0aaba158eb622e7e5d8a
108767a65b254bdb4db81a39cdce2b43e5a1b92b
F20101111_AAAPRB cho_c_Page_101.tif
cae4962071f29449b57135e24cfb6eaf
202f9120f1e16e617609d95c1108c4857c627b02
F20101111_AAAPQM cho_c_Page_084.tif
2c267dd58cc0d552d9c4110501075839
1327dccc0723549806f2a20ce541ea8ff66915f0
F20101111_AAAPPZ cho_c_Page_071.tif
d28fcbf72eff5e90e3fe556f9e2aee5d
fe294cb6bd5380da3772e7e720cd47c225994470
F20101111_AAAPRC cho_c_Page_102.tif
6f72f593da4522744ed8aaac5253e392
ef3040ddc31a5178b9e7600fc251e5d681bd609f
F20101111_AAAPQN cho_c_Page_085.tif
d8d315b3e1b537689f55824b455eb691
2e57cf3990018ff2148d439a2253ed7b5a186ad0
F20101111_AAAPRD cho_c_Page_103.tif
ff71affb8bfb41ff397bc88d01e19914
25e9dd35f4ed1b5469f78b2b435d46a63b7aa72b
F20101111_AAAPQO cho_c_Page_086.tif
8c19a100aae394d3d73dd84a6c0a60af
cb515ed41f2ac1047f7a209efcb4548e1da69ac1
F20101111_AAAPRE cho_c_Page_104.tif
4231b4115fb48b914401f8ab4998ae8b
9894cbdf8c861ed7651b58b0e876ac26ff0d6b58
F20101111_AAAPQP cho_c_Page_087.tif
83992e01b3784f85381aa1d3db6266fd
3a7c3ed5b537a19903aed86259e9fbbd3aa1c481
F20101111_AAAPRF cho_c_Page_105.tif
2e3381aa23e9d8c0a64002d1ca5f839e
866b7499b00067ed71fe6bc2501e385ecf434a5b
F20101111_AAAPQQ cho_c_Page_088.tif
a35881ca16d62cd97dd94ef9ed8f65d8
519329bff9f39215466bf2eef5020dbc62404a9e
F20101111_AAAPQR cho_c_Page_089.tif
393307d16843f7b8e6699ecbafbef13a
f814cb8fec2daa466ba980a28892a0aca9e839b2
F20101111_AAAPRG cho_c_Page_106.tif
f8af5271848f0671d6e1fb6372869de1
a1fc9b6ffb1467e3df0605ff71097adfe03c726c
F20101111_AAAPQS cho_c_Page_090.tif
a594a12c41b398a1240b7e764107275f
8b009600663f3c017bb38f90e7edc5e4ec93fb7c
F20101111_AAAPRH cho_c_Page_107.tif
22276a02a20adb2d1316bafc60266bf9
4a58f245396a140a70ae243730a4911b44423fa9
F20101111_AAAPQT cho_c_Page_092.tif
9f5ec6d80d0d3be7557b7ecc97b6ebdf
368b880ba284ddf9f7f685cf58ad3d15d446088f
F20101111_AAAPRI cho_c_Page_108.tif
13433156808b6b9c1c9e0f586b0dbef9
2cf4d22742ae456e52f56025c0fadc87dfa49dac
F20101111_AAAPQU cho_c_Page_093.tif
e1f16348f8ba8366af562a5f6ccc23d7
422bb46bcafc72ca734cf23284af72066a48e7c9
F20101111_AAAPRJ cho_c_Page_109.tif
afed17b7e7ccf2823190387aef5a20ae
8e2c0c87425421cedcf161aa8a8fbf2c7ee415e8
F20101111_AAAPQV cho_c_Page_094.tif
305b8521d6a03bf14c8c8acf2c875d0c
f3e7b459abd096846d3fa41d0b60d99c7ea6d9ec
F20101111_AAAPRK cho_c_Page_110.tif
c215a78030a56f68e04765dc5840ee75
089ed02b156566f8772323abd6f3ebf55f5eaa0f
F20101111_AAAPQW cho_c_Page_095.tif
9cea6a5ca3dbefb9ed9cf68d0c1b58b1
cdd07cd9627b425bab038d73283813ece6aa56ec
72766 F20101111_AAAPSA cho_c_Page_008.pro
111ba8c5c3c949a8f4c9a56b2eddeb15
f622dae8015e551b88410b07e4e053e635f99078
F20101111_AAAPRL cho_c_Page_111.tif
f7b45d18426fe344849c01aa9449f49c
b27e87eca36a818db951820c49f32d0967637692
F20101111_AAAPQX cho_c_Page_096.tif
ec1b75a320f24f0a12d68e795c904b56
333f192cb694f1b67e0f41ea25329440e349569f
53953 F20101111_AAAPSB cho_c_Page_009.pro
8a63bff994a015eeaed7f220b379977d
a09e2e187b38b3cf1bdf63bfd1213847f09f7efe
F20101111_AAAPRM cho_c_Page_112.tif
a0ddb6ca86521028cd3f3f391ce0b079
5f3c78a27c942008ab6f99da54fef925b07b048f
F20101111_AAAPQY cho_c_Page_097.tif
fc8a8855e3c5eca339de2a759b0f8f82
54d47e1b5ca520a6ddef07853c14147db3aa0ecf
49595 F20101111_AAAPSC cho_c_Page_010.pro
ef98f6260d05464b594c39742b728617
285d8e8cd508965e3cec6c26c0636a968e7ff3e6
F20101111_AAAPRN cho_c_Page_113.tif
fa022ae7eb3f29396ccc031981be5b86
fac5ad3c288655cf6dd220acae476497fb6c7305
F20101111_AAAPQZ cho_c_Page_099.tif
c787a18504207ddfe9188417aff7eef1
f49456d4dfd63c4e9d5ef96640b4668a844e23b9
28103 F20101111_AAAPSD cho_c_Page_011.pro
31cb3ae962a26be173fb4ac09d668eac
60280d0ea0d48a9c88c9e472e7b7705c4bbdcf79
F20101111_AAAPRO cho_c_Page_115.tif
24175184ea66dfed53cb8c7f4c4f888d
3ef6bf12d8ccb111ac6ddef24d45fb831bb2f2bc
52175 F20101111_AAAPSE cho_c_Page_012.pro
dfee2dd88a4efd5f5bfb7e7e45a5a2e4
ab3cbb67f7e8d8eddb326e2583c2a62def452616
F20101111_AAAPRP cho_c_Page_116.tif
72171699b20f5a19eb91689a2cc48f05
955a378c53a7022ef7a326f3be7804926700d80c
55503 F20101111_AAAPSF cho_c_Page_013.pro
44bde93138fe141734601d936f3be4af
f7047e46e8795dc9339ec640cfec1e52a3c6a60f
F20101111_AAAPRQ cho_c_Page_117.tif
3ecfae1a0fe8402e12676c23a2e6792c
4c0b800a218338f3e0b783bf456e90fbdc5c1c5b
54453 F20101111_AAAPSG cho_c_Page_014.pro
5d535e923ae98de7807c225468b63919
f6751bb3b33f543b05ac26cf2a811389c0117aa2
F20101111_AAAPRR cho_c_Page_118.tif
a529623d4d482d6f953c55aa4bd1a0ef
1e4863e5c8391246fb311aeb8f2b556dbc522ab2
F20101111_AAAPRS cho_c_Page_119.tif
0e4e22a58c687ea7b3b6be7f350d6efe
255d3066860bbef2ddbeed176b12b85db6677732
4329 F20101111_AAAPSH cho_c_Page_015.pro
2f874651623c6c8aefc9941b7cae4e61
de8046d5bd18a07c0aaf1731e6ce0e956f5fa088
9310 F20101111_AAAPRT cho_c_Page_001.pro
38bfca5f11a425033478f6f4598208c7
f652ea702022e52f8cd659073287914c27cbb20a
45202 F20101111_AAAPSI cho_c_Page_016.pro
9af139f2ea6eb57242460c4580e7e3a9
235dfd9eb0ea602733753bdfc68b67ba5dd6393a
832 F20101111_AAAPRU cho_c_Page_002.pro
bd5b461dbb25d2766097dfd2b298de9f
926a2ec74fd5d775a11a0a6af97f0ad6132324a4
51354 F20101111_AAAPSJ cho_c_Page_017.pro
70b40a959c82d90aeafbb7e290e361da
3747b12abb0265d0c4e1d904d93336d551ab63f1
23902 F20101111_AAAPRV cho_c_Page_003.pro
55ecf202d36341073617ccf02327d0bb
09bba3fb98d51d48ac0ebbda5b4f7f0d279ba513
45261 F20101111_AAAPSK cho_c_Page_018.pro
b6da512854898156adacc2e107fa186c
e85e334e14a80f22005ba8c42354b2731dd0f4a5
77313 F20101111_AAAPRW cho_c_Page_004.pro
e70c9dc30ee80d0e62d1c76bd7e12c72
c28fab44bce152a62152082b45c9d1544710b832
51753 F20101111_AAAPSL cho_c_Page_019.pro
15c4a036ff9e3cade2ffb1fb646cbf62
8a3767b0cd7bdd14a36b4b6b24a8fc16596986e0
29049 F20101111_AAAPRX cho_c_Page_005.pro
a094f4618e4a843487dc562cc42e4104
e2e5f2fec47b48b4c7a06eb59496eee03fc9f5ab
34144 F20101111_AAAPTA cho_c_Page_034.pro
9076f5325d6ad1fdc9d6e46008b22448
c6b86a8a17c0f3c9563ec40aba28e30a285a6d39
32749 F20101111_AAAPSM cho_c_Page_020.pro
4bafc79d89d8c56a385cdd3d709726d5
4109c32fbf59780662106d2eb9db8582add5df97
49765 F20101111_AAAPRY cho_c_Page_006.pro
0e529c16281ef916581cb63ccff515ab
f41aca813127a2c53fada78405d712bf05e15972
44134 F20101111_AAAPTB cho_c_Page_035.pro
988b9c8cfc807885324f0d38b662642e
9644dbd5ed3d12b20a45935165f21b55e35ff9e6
53505 F20101111_AAAPSN cho_c_Page_021.pro
50647e8c84eb46b319ff687e5a32904a
3cd613b18241d11a16af0c662eb9ecfca5881320
72582 F20101111_AAAPRZ cho_c_Page_007.pro
8c5e6d64242783f69bfc65e44ac05b68
364662619fbbee7048e520f437d7ad92cb4b89f4
18744 F20101111_AAAPTC cho_c_Page_036.pro
f172c9f5eb95257be7269f7b385583a6
7b3c08ae16321bda7d34f80100c92130335d040a
52037 F20101111_AAAPSO cho_c_Page_022.pro
7f32743f3c59f610e6164619828c0b3d
d6460e410e5046be55f1fd32cfa99c35802ff44a
50663 F20101111_AAAPTD cho_c_Page_038.pro
da989ad6f2ca3d80269c1caf8766e64f
1fc538e2eae6251fbccdbf10e69da2f2ab3268d0
52918 F20101111_AAAPSP cho_c_Page_023.pro
b95d2ceddd0b474e3422a9800d9b956d
069e0cdb8ff75d31c38f5c6d97aa22e3379b7433
53273 F20101111_AAAPTE cho_c_Page_039.pro
14ac9693c60db6e975da81a7d7a325b5
ef73adb02367f1f2b6116860c16fc6dfdb94aed4
7865 F20101111_AAAPSQ cho_c_Page_024.pro
2991af7c2d3ff5a5ce2edba6f7f01edb
e790e1e05e52233b01b0bb71176a95bcd7a8701a
49926 F20101111_AAAPTF cho_c_Page_040.pro
69374ea4748df11a431d3f8f64c7fe45
6afba1ce38b3e5d03c07d5b99fa04b1090db6c17
49496 F20101111_AAAPSR cho_c_Page_025.pro
db4a600e8776327e297adefd11b04522
2c21ef2b305211a34ab17e507192cad1c45805ff
45557 F20101111_AAAPTG cho_c_Page_041.pro
27eb80a716655fc486745391bcffffc3
c8dcb106a158360cd83af1e22b772f6ca5c6682a
47051 F20101111_AAAPSS cho_c_Page_026.pro
2def8805ce6f92cc5a4e3b0e2899a9d5
092e7905bd85882e9ee52b45373eb58c25b54daa
34126 F20101111_AAAPTH cho_c_Page_042.pro
096754902b17eb9e568263466b3b647b
15d3ae6c28ba76cb9af3a58c25b08aaacbed9a18
51155 F20101111_AAAPST cho_c_Page_027.pro
7900aac58927e94fb0c79077f83cef88
722c6854763a543e58778a449debec6b3a560303
32213 F20101111_AAAPSU cho_c_Page_028.pro
6cd977cf1a00161f13bbd038367a1e2d
4c3ab07b8add29ce8c65ced21acac8244743052e
48720 F20101111_AAAPTI cho_c_Page_043.pro
fb387ab1dfe57fca1f531fa081542b01
6631a815e4d52ba7b037c1e5197e7945c1200bd0
55389 F20101111_AAAPSV cho_c_Page_029.pro
b4b9b17c6286c7699aabef5624d6954d
db280b5d719217dbafa94161e0161cd68795480e
58749 F20101111_AAAPTJ cho_c_Page_044.pro
4b0163494db12fa14980004b24510996
daf91ff02d85526fd0abd74c3cba8f040aa02be6
40831 F20101111_AAAPSW cho_c_Page_030.pro
c3111413862bb6495d473436c5a0218c
1732f864f35dda6a27529a82349349ef04635eb6
41158 F20101111_AAAPTK cho_c_Page_045.pro
e1df4f925442a916cf0629ba6f918ea7
922bd2e59d5a19452a4323fee0575b83ed461823
39107 F20101111_AAAPUA cho_c_Page_061.pro
029b20dda530f417766ace47e38de9eb
f69ae767d9ab150ebb42dc2d220d3a4ecafd0105
60267 F20101111_AAAPTL cho_c_Page_046.pro
d1bc822c7256d714ccb125de706cf4a6
6e77d296187232e8540f3f0e0b1a8b835f65efc1
40739 F20101111_AAAPSX cho_c_Page_031.pro
34f371de60af1ef8605364f45263fae9
2c3e72a6f6350a8703a01b978b3d09055c962b08
22268 F20101111_AAAPUB cho_c_Page_062.pro
eef15b8df7531532b18cda43405169b6
04e11c7ed603ff9e430eb9ff345e60a475ca3a87
75863 F20101111_AAAPTM cho_c_Page_047.pro
d1bef6dad4968c35270387b900aa756f
1977c5e7738d8e294f4133f5c311f65b5ea5f05f
48298 F20101111_AAAPSY cho_c_Page_032.pro
6df3f0f33116b8af9214d33403305bbc
e3b7a9e56fc5af801171432c6d982dfdb2192be9
54604 F20101111_AAAPUC cho_c_Page_063.pro
e1bcadc476754c4f74c7ba1296d41a8d
73eb853dde75b01970459cdf7d4f2c3d3449efb6
55989 F20101111_AAAPTN cho_c_Page_048.pro
0329e953b5d96802cc27ea4ae246c92d
ff1f2e6dfb4c529acb94c4d632abd134e04b6979
40216 F20101111_AAAPSZ cho_c_Page_033.pro
8abae5bb98f54daa04138cfc6b12e27c
c22ceeb6f73b83a1174aa7eabe49aace70cad4e4
35311 F20101111_AAAPUD cho_c_Page_064.pro
729d127eddb697d5e8e4e76308d948e4
15a3f374c6e40ff91cd8718f86cff884b0c68425
51216 F20101111_AAAPTO cho_c_Page_049.pro
5ea11a23d35cc0ed11cef010d3ae2520
d7ef2fafffb1cfd2c60c61e733117817e00ad234
39780 F20101111_AAAPUE cho_c_Page_065.pro
af8cfeba65720794addd6b0be3edd244
83da6e065d404fc34b5a388b5d3191c44d0bd8aa
39584 F20101111_AAAPTP cho_c_Page_050.pro
4b2d1099b12de27712f190e0e0343f7a
2f712bfd1725b4918decf09ccfab3df07f7d13f7
55877 F20101111_AAAPUF cho_c_Page_066.pro
fc2993962f9b6088267ec04940e0a497
f80ba964f15b95caef40b39b609b9efbabc096e7
28142 F20101111_AAAPTQ cho_c_Page_051.pro
103c1171ffbcebccec614b8472fdf8d8
c31cd3c73403d53d50c26415cbd2e77a74a67554
43211 F20101111_AAAPUG cho_c_Page_067.pro
3261b7ff8be3ab0a75e26884fe48f917
9eaa6c787cc2d8d6f06d4354cd2497845e0dd89c
48686 F20101111_AAAPTR cho_c_Page_052.pro
a381db85e2bb8493436ec72994b94ea6
d28286cf3193de73f2e14856bc9b8726b3266d62
1969 F20101111_AAAQAA cho_c_Page_105.txt
5013577c6390d6ff14e357cd7b9d7cae
ae62f499ad55718ce0f4273c2734cc0c0c83414e
42296 F20101111_AAAPUH cho_c_Page_068.pro
0c3a905966b023c2113e5ab436e7b1bd
9db25578398494ebda68719c79d75794354bcc9b
49398 F20101111_AAAPTS cho_c_Page_053.pro
f66367abd7f8b58f3f90699442c8d719
d2b7bf4f5b2f6f7e79e766a5ad8f21fecbe3e3dd
1786 F20101111_AAAQAB cho_c_Page_106.txt
ac9a8e1a0c5fad72f3677597fcfc16aa
64614d0f148f344659cfd81b2a4909c15bd32519
44813 F20101111_AAAPUI cho_c_Page_069.pro
8b3b3680076e031f9728e7d8c9d05096
8eda2fc98424ce45fd7c66b9f2e76232262486e8
40335 F20101111_AAAPTT cho_c_Page_054.pro
2f894b31b854e9a4c639b7120ffbf182
d69e2d15d957fd6b589686d299b01d7484986072
2103 F20101111_AAAQAC cho_c_Page_107.txt
911d6729bcce443c51996aeea3633731
78bddf0465698bbf76f82dde1f16df3d670a27e3
49358 F20101111_AAAPTU cho_c_Page_055.pro
b3705b45e59176b05ef5c0af0ffd16cc
56f41a407c6e07220f9aea34bda4de9584b23ef7
1444 F20101111_AAAQAD cho_c_Page_108.txt
f548f634923a5e455156016ec4bae129
64e9719ff844340c235be37e486427ed351e507d
44599 F20101111_AAAPUJ cho_c_Page_070.pro
469523fd80dea03c6b6ad5e297125001
8e625923393820b8aec3f04b9ce4bc188ec7b39d
47754 F20101111_AAAPTV cho_c_Page_056.pro
e5f8630d24e81ed2efd4cffb92e973d8
c79c508ebb23cbb415e9e31ccefd10696750154d
2121 F20101111_AAAQAE cho_c_Page_109.txt
bfdf00f20913672683e2e4fd216b3489
44ebf132465545172f225cbf49e69bc628c04d8b
50845 F20101111_AAAPUK cho_c_Page_071.pro
c3dbd8e62b081654c5661344053e81c8
2c0adeb73377059c85c87b185ff9e448223f5c45
55760 F20101111_AAAPTW cho_c_Page_057.pro
78eefc31f2baaf0be8ed2820b5cfdd45
6faa0c60e804bc9e0d5e2314c912d26db2065432
2058 F20101111_AAAQAF cho_c_Page_110.txt
797fa4d76d583e1237dd8fb402ce9959
1b1df8429e607602b4b77439f55079a4f3e0c5aa
35543 F20101111_AAAPUL cho_c_Page_072.pro
c3734ed7c135d00cbcc5a216c5e29663
08523fa641c9802098136c6f1c5905be754481d2
42537 F20101111_AAAPTX cho_c_Page_058.pro
2778893639c1fd4d1e99eeaf821223e6
ddcbe790b44c53fdad658efa24d3043075e47abb
2194 F20101111_AAAQAG cho_c_Page_111.txt
778140a022327e9f0206ac0801e8c981
ffeffe8b4a45f5cff952a1af370a650ce64bd399
42419 F20101111_AAAPVA cho_c_Page_087.pro
f2478ba512ffa4e6c1e8fdee8a479832
eecaa96cc22dea07ba2213315a29be5d5faa4c9c
13129 F20101111_AAAPUM cho_c_Page_073.pro
1a97fcd3844df887087b6b213bf18579
acdd6f12e2cee6d34e564bc2065ffe950813629f
53914 F20101111_AAAPTY cho_c_Page_059.pro
78417584f8640c8bb875c3fc9f140f8c
fffdd2edbf8b8b675444c81f949d255e99cc9704
1328 F20101111_AAAQAH cho_c_Page_112.txt
317420b71d7a0b361e787b1ebdfe1aaf
d199b793e951c5c0ca9f5804845fa8d464e89689
40852 F20101111_AAAPVB cho_c_Page_088.pro
1565b8c0e18f716206c1c4bfc19c8e85
a4916842ea0e1d7a52b08d6f7f8e319b8b76e7df
48809 F20101111_AAAPUN cho_c_Page_074.pro
c49dd3983f8b429a18d6495edb2c4ca0
7bce8b94963cf26558ded4229ee57f7e82ce4c4f
54463 F20101111_AAAPTZ cho_c_Page_060.pro
318a181687ed4a05cb7e027588d66ba6
6a6d5f19fa13aaac26b3d8e4ef2bc1411bf4d1c8
2394 F20101111_AAAQAI cho_c_Page_113.txt
eb8c20587192e7268aeb01ac2987fcc2
c6b1e49c69d72461abc8ea469c544f478a0e9206
45880 F20101111_AAAPVC cho_c_Page_089.pro
ba8bc924f1e470587d8fafd0112e4df5
006f5a0467995d4d78694fa46d16a38295d5db8a
54731 F20101111_AAAPUO cho_c_Page_075.pro
a6b52b228a64cabace20645a85e6bd56
108bc39378d7df8ae4c6de97bd77c59b1fd63362
2543 F20101111_AAAQAJ cho_c_Page_114.txt
2a89b71c41675b532299b4f0e33592ff
8d9a39b813d6598f4ef19c0eb0a1af24063f0220
45186 F20101111_AAAPVD cho_c_Page_090.pro
9d3cdaedb9ae7bd866d70b722ed016b5
32cdfce8db1525852407b72ccc038c9078611fc9
56817 F20101111_AAAPUP cho_c_Page_076.pro
a7bc6bb895fdb961f42c862d03001059
87edd4fccdb90e36cc4437ba3ce69690b75b6072
2454 F20101111_AAAQAK cho_c_Page_115.txt
ca36e010a4aeffdad5ceb0a747e85e29
67b3e4aead72d90fe331de00bfb81c4c36ef2fb2
41978 F20101111_AAAPVE cho_c_Page_091.pro
eb66b717e2006908ddf0336f9258a753
aeac275f75cd75f00ec90b5dfef53364d090b11f
49535 F20101111_AAAPUQ cho_c_Page_077.pro
5f8750b440549964d166d5da8427b59c
1a25d722bdc90d5b919a04ceca659a03113c7aed
2459 F20101111_AAAQAL cho_c_Page_116.txt
a8746173251480391fe3f1a1733752a7
c9ba68683aa7332c64f873823fea9669ab827504
41458 F20101111_AAAPVF cho_c_Page_092.pro
672c0c770222f15351a365f7874adae4
53d8d1df55ebeb56eb129ab080762562937fcc82
49404 F20101111_AAAPUR cho_c_Page_078.pro
56ccd838c19831ad06ad4e1b7cab44e0
2c589f6d0ce205fe6668781352527eddad4c4691
27160 F20101111_AAAQBA cho_c_Page_013.QC.jpg
5d8c64bde25a2e02cb84211450e1bf2b
c031a029abbef6315345c92651d87846f3e208a1
2496 F20101111_AAAQAM cho_c_Page_117.txt
796585d04dfaea029db8fbfdfde54352
8c2f43dc7830e2897b52c63e230e35080b5ca5f9
8200 F20101111_AAAPVG cho_c_Page_093.pro
58c88b51c95b88ba915c9609f575578b
0026f34d02cad599a3980047b0a738c8dbedbe6e
33426 F20101111_AAAPUS cho_c_Page_079.pro
39769edbff8f462250f611f722e37f05
18728c1de24943c8608b9e97ff238c0ba55b9623
7447 F20101111_AAAQBB cho_c_Page_077thm.jpg
f1ff90c5c7a5e7735e86d374a2236201
ddb5e1ae45ce69cae7b1581a9cd2e76245b129b5
254 F20101111_AAAQAN cho_c_Page_118.txt
2ff603a102059e4ffe1c7c796bc5a2e3
26a79e324fec38393d0ca1e77890cab9cedebad4
45341 F20101111_AAAPVH cho_c_Page_094.pro
1263c27a7975878d1e25208fe1067baf
0fd3ae7902db3939206355bc70bdf36b978d6d76
42440 F20101111_AAAPUT cho_c_Page_080.pro
469387bae7efced5517731d087079bf0
70b0295b44f99c5e72e5814b331ed11b3d51a0d8
22880 F20101111_AAAQBC cho_c_Page_050.QC.jpg
de5882d59cab9de06940e07acd40406d
41fee8d30220343148392ac44298948b3c08f66e
537 F20101111_AAAQAO cho_c_Page_119.txt
65fab5ba25c0b91a01d11eba5f039aa2
3954b237c27e727cc3fa97698e235ee92ba24011
35084 F20101111_AAAPVI cho_c_Page_095.pro
91c6899c463332d66ecd867aeaeacdda
5a2eb1af82151bd61ccc355547a211aaed42cd13
55032 F20101111_AAAPUU cho_c_Page_081.pro
74475ca5a7e66ecefa96749b08d8d5e1
84cccfab100d75f5a0efd8157356e7174768e83a
28235 F20101111_AAAQBD cho_c_Page_048.QC.jpg
092a695246c66ee18d74beeb8d3194b1
cec1c650e68f628afdce3bcd4d7b216e9fc851f2
2375 F20101111_AAAQAP cho_c_Page_001thm.jpg
5763482cfcedc938b4a66b76ffc1bfcc
61d74f9026e573e0bac15169490cbffd580fd57a
56937 F20101111_AAAPVJ cho_c_Page_096.pro
e0c5ccc61c5877d2c61d72897cb828b0
95b541fcd2bf2ca33294f0fae390f1e9e4b15b0a
42375 F20101111_AAAPUV cho_c_Page_082.pro
5fae5eda6ecacf7e8fed14b1bc682549
715b45b1746992d6842cc4b1e8d537d3dc1da774
13977 F20101111_AAAQBE cho_c_Page_003.QC.jpg
726a77ec0c85c63c0bdacbfc725eca77
357e443e694985813bc21c7ad947bbdd81396ad9
1585324 F20101111_AAAQAQ cho_c.pdf
861298221d79268e12c17f0d559fb62b
f3e86185553110db9376d2cb36c5ea74a205f16c
40832 F20101111_AAAPUW cho_c_Page_083.pro
d6070752046e67ce1dba40ffede18f18
b1b0667abbf1b7712699dd15cfc25c3cf919b1c7
24106 F20101111_AAAQBF cho_c_Page_102.QC.jpg
d1d410fc080bafb65612475eafebbbcc
29ae7a729be0d31205b9e9fa1ce044e9217020ab
5983 F20101111_AAAQAR cho_c_Page_028thm.jpg
6a19357c381eebff8a371a9a46ffa886
7dfd4ae8766e8e8fdc8d9b999edff93709c9d41a
42244 F20101111_AAAPVK cho_c_Page_097.pro
f8ac936b1f450d2717aa3c40c6cef403
740243b477a80cddba570350b6e5ddbb7e416207
47846 F20101111_AAAPUX cho_c_Page_084.pro
4f3d00b7b615e1300af577c26b3b52f4
9c7c3fcd420d7ff3786d811808a9ecd8741d735c
27422 F20101111_AAAQBG cho_c_Page_042.QC.jpg
d1f5d329f77a5d71213b3d34c8d0552b
68468972a96dad7a9aed2642fa9c519c2a260c15
63605 F20101111_AAAPWA cho_c_Page_114.pro
28f558cdaec513c94b2ddb60e0be67c9
99418cb36653ebd13cbbf98e267b0a0b1a252ee8
23862 F20101111_AAAQAS cho_c_Page_100.QC.jpg
b1f99910e593d44cdcc76c49b32d4a31
6ea608e88ade70c67701ae5757a7003c32add54a
43228 F20101111_AAAPVL cho_c_Page_098.pro
c7bbe6bba07bd092066c6ec0a56fa3ca
f257fad76ccd4b86b3b2012016247705408bb413
50055 F20101111_AAAPUY cho_c_Page_085.pro
079038b4b03fd6f93eb827a7b2ce80e6
6bf9536da82a0f808148f250ef7425542f734b69
21044 F20101111_AAAQBH cho_c_Page_021.QC.jpg
19ba4e8780f0ff8972c2e8efeaea0289
1330c5cf90e2637569922612ee2a67e3902dc95c
61870 F20101111_AAAPWB cho_c_Page_115.pro
e3bf09778afad3da3564c809847614d9
c8f45e85b7c6253edcf296d8af5849473340f4b1
7348 F20101111_AAAQAT cho_c_Page_094thm.jpg
0d733addf8086b9fae1dae505c3d48d9
0a99aa0da8b7020f484ec91c76faff0038b75727
40416 F20101111_AAAPVM cho_c_Page_099.pro
128286ca19f6d5a96d0c125ed95f1eaa
688e0afb1e101c2e1453ea60f2104011cd6d9394
46703 F20101111_AAAPUZ cho_c_Page_086.pro
480758788fc4091d69e5c06d2afc4438
82b8c6b898abf5a6b81e49586479dc3413cb7b83
25852 F20101111_AAAQBI cho_c_Page_039.QC.jpg
1107c64b1e16b5e5c52b0de8bd4fba84
0f9a319d9564424dcd252258f9394f2ded537b7b
61859 F20101111_AAAPWC cho_c_Page_116.pro
439106c73d1bca62695d6d3124e9d72d
312865bff3ffd5077ca3e208793a320f901d8091
55343 F20101111_AAAPVN cho_c_Page_100.pro
5a28948d578d09115f933058260429b7
fda36be5f654b1dc06c2bb1103da8b220eef4fff
26253 F20101111_AAAQBJ cho_c_Page_040.QC.jpg
59e2ffe4b3412305a056cd0b11bded4a
118c410e16ac0d726c490b9e95133dff8e2a2235
62644 F20101111_AAAPWD cho_c_Page_117.pro
ce453df0a237a75047046fa9a3b91f43
d73cec2c740c4373c2e84239f8a883b08a50cd87
F20101111_AAAQAU cho_c_Page_116thm.jpg
dfe4cd9c0469075656181f8fd46316e4
a17d00988588c8ff55e576383caeb099482fd056
43027 F20101111_AAAPVO cho_c_Page_101.pro
ed4f4c77ec5dc062959342e78f6ec9d5
1a50196a32f29c2cebc1710ae28c0d0c73fd5306
6959 F20101111_AAAQBK cho_c_Page_098thm.jpg
1b0dcd808e3e26dc01d084782f24b55c
9dc21eba51aba44849a79a5fdd1f36c5b8f06d94
6256 F20101111_AAAPWE cho_c_Page_118.pro
047af822a08263dd97c2d69996893b95
bc0cb21e3a58e138338c849ef64c1e83829c5414
6805 F20101111_AAAQAV cho_c_Page_059thm.jpg
2dff51666596c1610c51b1448376983c
11596129474367cbc77ae7816875b31d634119b5
55809 F20101111_AAAPVP cho_c_Page_102.pro
ee97fec523745daadd5954a9ffd51fb8
c5d6bb4849468810f3379a62b993ad6ce1ecae39
6257 F20101111_AAAQBL cho_c_Page_045thm.jpg
529a75163052ce51dc7b3c66ad03a3b0
794c183ee8839d172f56822abdba6995678671bc
12516 F20101111_AAAPWF cho_c_Page_119.pro
df9e01766584ffa7503f45f0cdfec510
77d49cf7cb99d0d4a64c0bae15a72af73e4a25f6
7546 F20101111_AAAQAW cho_c_Page_076thm.jpg
0c892cbf02479dd46889088941ffe132
4ef73752791a020c00f1ce3de64b5453a090545d
53195 F20101111_AAAPVQ cho_c_Page_103.pro
aa81d296bebc1cc39be128ffa2459a74
72f9348c2ef801d4e2fdaaf5b7b1ac8de8b8e654
24381 F20101111_AAAQBM cho_c_Page_095.QC.jpg
7b60a09cf793c1eb9fe67b5fa64f93b9
72545e4cb04a3adeff71650c444ca16e8e754b0c
516 F20101111_AAAPWG cho_c_Page_001.txt
3650a090f54ed4872c7ef8dd1cf77938
cee345e803a93bfed536c829a5df9f34f9c1b5d4
20817 F20101111_AAAQAX cho_c_Page_033.QC.jpg
445b0eeba9691e5d70e57aed8ec818dd
83fe244e505a002b3943db722dd8ee161c1717aa
49485 F20101111_AAAPVR cho_c_Page_104.pro
c9631f039feaa0d7216b7ff0a4213b08
046917624b7287ef0b08eba0b8c1b0839c3db02b
6188 F20101111_AAAQCA cho_c_Page_082thm.jpg
8c0ded4fe3c11db79c0cca5717a877c5
38de382ac1180e6acf4ef8425c6735963d421a5c
12177 F20101111_AAAQBN cho_c_Page_093.QC.jpg
46498eedb64bfa3d2da7893c45c3afc7
dc0eead44c1ca3daf41ef5995de435b3b06d2df2
84 F20101111_AAAPWH cho_c_Page_002.txt
cf8b380a18743b698d02f88af07d756f
a28992730ddf838972fcaf7620106ca8df750e80
23185 F20101111_AAAQAY cho_c_Page_090.QC.jpg
9e742f30253463c97919608d52df7678
cae9243c02248bd7ca8a9055634182a24897b0d2
47588 F20101111_AAAPVS cho_c_Page_105.pro
ca8de5a9b0f08e31d1123e80531c6c3b
27d14fa14751e98ff62f45e6bf2f1d222966babb
7233 F20101111_AAAQCB cho_c_Page_063thm.jpg
3dc4710803c48f2b9a65f4406e8a83e2
2dd6c467ee8c87f067425439f91157c61f369a82
5829 F20101111_AAAQBO cho_c_Page_034thm.jpg
578b4817203e0e057e35814c9e4d38d9
8e7e340952d968c542aca61a6294a7b59a5cafb6
1003 F20101111_AAAPWI cho_c_Page_003.txt
b725eec970bb1395b59ad39d6973a49c
3f60ce8ee30f2548a9e76c8390bc74e63b845b12
24216 F20101111_AAAQAZ cho_c_Page_098.QC.jpg
15772308f062a3b76a83f8b84209a85b
b9cea8781335426aff1507ea25afbcac55213ebb
29295 F20101111_AAAPVT cho_c_Page_106.pro
74438bb52961914d35c4d25a32fbedd9
a1fba5724d5ffa5469e0875064adc9bb6a63baf4
23179 F20101111_AAAQCC cho_c_Page_045.QC.jpg
4ff85dc3305a570c9d7e6c9007cdb64a
b9bd53bb0d3f324b7486c2312ea9b630a566a78b
6703 F20101111_AAAQBP cho_c_Page_016thm.jpg
fc19c02466e1d802c2cdf72ec26c859d
bff03b8c57a062a0a4c0e61b5e8e13e696486a9f
3139 F20101111_AAAPWJ cho_c_Page_004.txt
2530429ade385845c1a9de481acddcdb
93b9fd8301047a028b4596b5e0a41c1ea6c485d4
46911 F20101111_AAAPVU cho_c_Page_107.pro
50bd6c15d557f83c98d6f65c97680456
072c4a5a1754737fef56409eb594e79e23aeb8b3
7028 F20101111_AAAQCD cho_c_Page_025thm.jpg
67f86e2cd0cd126a03906ecc4e7ec96d
7b6cde82933e79d9d76fd30d24a685856a14f468
8316 F20101111_AAAQBQ cho_c_Page_001.QC.jpg
0ea2155d48d1094e1f78f84c30b9a771
2c2bf272515bf140f39e8ae8033631ced40ee163
1175 F20101111_AAAPWK cho_c_Page_005.txt
2d13c558d0ab6565f58b64ec7d1b9dbe
ede1f53aed2a71a57cb64426429fed5c04963804
27926 F20101111_AAAPVV cho_c_Page_108.pro
20a29565768239f6ef8aaab9e3fdc619
cd790574bf95873fa1a1d8cffeef3a11bec0b0da
26077 F20101111_AAAQCE cho_c_Page_038.QC.jpg
fb0a4f97635c1542591b0190f314d68d
b0b410f324b12749ccf07906393b8759ec02f92a
7123 F20101111_AAAQBR cho_c_Page_012thm.jpg
be06e02b748319fdefcf9daec79172b7
22e0189fa8a36d4f330af47a75679bfbd8822fb5
51895 F20101111_AAAPVW cho_c_Page_109.pro
696dc61503db85ac4dd07cbf3f73f6f9
eda1cd865bc300e9f36025439b17bf6ec67bad2d
6649 F20101111_AAAQCF cho_c_Page_032thm.jpg
2c87e988b04d6f39d3e1f0c2dcf89a38
9c9c91b5ca11eafb867669e7576ef986433cbb9c
2530 F20101111_AAAPXA cho_c_Page_023.txt
fbd632611c2cd6fde34328727b0e7ed1
d0aaa2b83d52d051188327e157ceaaffb77d9d5d
7590 F20101111_AAAQBS cho_c_Page_052thm.jpg
f6eb7e3bca2faaebcbb6cbdec2a0078c
4f9e1e6d4f85ed3d29c3cb841c91cad263ceb024
2023 F20101111_AAAPWL cho_c_Page_006.txt
488104026c6e9b2455ca6c5d89332d36
294c90ff9968633758f695b79fef6b08ad646019
52426 F20101111_AAAPVX cho_c_Page_110.pro
e35a13ff3569dfd7761943e473f2ff94
4b5170b84c55517885e5976dfa1b22ea0e12ef95
6716 F20101111_AAAQCG cho_c_Page_092thm.jpg
809ed7fe1d1bd45902171bdfb0d52298
547345ec648824a4f906bfee79228b32e5e525b6
352 F20101111_AAAPXB cho_c_Page_024.txt
dc02c239d24946f92d7174cdb620548a
9ab1832545d3e2ff1005755917bc82cf755a4a5d
7189 F20101111_AAAQBT cho_c_Page_027thm.jpg
9e952d3b1c4ab01a944512a1b7c061ec
5d01cbd005e3d9bdb3176585ac49e22f9d6c073e
2894 F20101111_AAAPWM cho_c_Page_007.txt
a5a1b0ed709f3be6265ae16be9427d96
c46aa11c73711e4116a57c26106e6089e16df928
33396 F20101111_AAAPVY cho_c_Page_112.pro
66b1617881bbc5dbb1ee34cec0e6cc8b
801fa2483a10413818524a92806072663498f112
6403 F20101111_AAAQCH cho_c_Page_087thm.jpg
5eea9ba018f5b1c3c3d3d86798703589
25bb34280c43a71900f7818f6e0c6bebf4fd064d
2013 F20101111_AAAPXC cho_c_Page_025.txt
79db648841fc00e587182f4585e90807
3749fa857d24121daaaebd64850ae4c35958060b
7355 F20101111_AAAQBU cho_c_Page_111thm.jpg
f2ca055660cc22c0a712b1b1534cd0d7
d605a27d7a2c385e32da4fc888e3bc743fa802c9
2827 F20101111_AAAPWN cho_c_Page_008.txt
8b6b947009d6f8ef62dbc3463e7c9351
6ad3a14ca319f5e618aeffadabc861dc95445164
60219 F20101111_AAAPVZ cho_c_Page_113.pro
83847745b15cb80befb07b1661191a2e
66148066cfdd28c69662b714a66da58a99291e37
9949 F20101111_AAAQCI cho_c_Page_005.QC.jpg
cbf308159d35f5c8f8c8f6568a00811f
197030f0e1bab418c8d55f108b3a7b9c45c3c38c
1963 F20101111_AAAPXD cho_c_Page_026.txt
339e3e6102f9323049468969f7e83bdd
995700159a97ab9055a309497eacb498b14e682d
2127 F20101111_AAAPWO cho_c_Page_009.txt
ff21234c681640d233a2605c3472972c
14094da7e43a3799bcd469e6ffd212a063d1821c
3795 F20101111_AAAQCJ cho_c_Page_024thm.jpg
10c3a3c317c56d7364da9e8c21af0f86
7808a431fe5923123bd3880cda51a3994672fd26
2160 F20101111_AAAPXE cho_c_Page_027.txt
6f9273b1336373b5d6119c7c64dfad2c
47b0a3c42ada722d590205cb2cc73b99496070a6
4443 F20101111_AAAQBV cho_c_Page_009thm.jpg
0e53de3b3c59b945ca42b3197dfcfaf0
9da97d75bdf603d4d1cffbe6f319cf23a98d43ae
2117 F20101111_AAAPWP cho_c_Page_010.txt
1f94308ceebf2e477d5d56fb656acfee
d51167ab0a35ce4e1b5abc6b9235d8677b033bbc
20250 F20101111_AAAQCK cho_c_Page_079.QC.jpg
b5602999f2422a3827983d6bb283ce7a
04f1c3c61ac32775ea06a1f92aa19aa1e666d02b
1783 F20101111_AAAPXF cho_c_Page_028.txt
693e81dbd69767e88cc35b170d32f1f8
0f2e8d56394396fdd77d05ff2887647097c0b3d0
23681 F20101111_AAAQBW cho_c_Page_030.QC.jpg
8763cb481c544d585e171c1eb1b8e251
cbda1924bc0013e256b6f7e247955a278a3e1fd2
1122 F20101111_AAAPWQ cho_c_Page_011.txt
17332538ee6655d99d51877b30425a35
a06d2ccffb01cc962b225226928f56bd0712cb7f
7164 F20101111_AAAQCL cho_c_Page_110thm.jpg
1c80acf79c026d9c37284be760659762
daebf6b1f778aeb174fa2df5f3505edb4e3de58c
2168 F20101111_AAAPXG cho_c_Page_029.txt
c34794d82604ea2bcf232d1e6615312a
1a9c060823cc4429c06f98d0e303ccbde9a110f8
7196 F20101111_AAAQBX cho_c_Page_031thm.jpg
7cbe21c2f141caedb19c1548a638f990
b6908fc6371a5a5b46c827a6224d5a4aaf4a33e3
F20101111_AAAPWR cho_c_Page_012.txt
04f02f8bd5c7ad12c7113065cfb4cb12
5e9f5f4098dc02982c1c1b1c1f117edfec70e96c
8026 F20101111_AAAQDA cho_c_Page_047thm.jpg
1a9c4652eae8abbdb3f12b9d82c9f732
cda4ad153071ef05435897efa30a9037d3028bfe
28085 F20101111_AAAQCM cho_c_Page_061.QC.jpg
5de75d19b3a582d7f2d8b70727e47647
fcc7f6ca5116784c2e442258f2009f3ccdc738f2
1778 F20101111_AAAPXH cho_c_Page_031.txt
92fbd1aa4b3d510c04dd6f7a199f1c87
76c711a700b0d83d337811d3221a0c5bfd057225
6311 F20101111_AAAQBY cho_c_Page_030thm.jpg
2fc0ac4dadaafc3d33968197dab3b736
bf923575c6d5f11f05cc2ace43be0d9a1cd72afe
2177 F20101111_AAAPWS cho_c_Page_013.txt
b19c0fd0c470e237cdafc1c28d247d8e
8dd8ab79edc2a69de1d8beb9aa31ce88da2b4c53
1886 F20101111_AAAQDB cho_c_Page_118thm.jpg
577701f80164645fa6ff6030fddb7e3f
767090905b16639283001b1449a16796ca061bff
28773 F20101111_AAAQCN cho_c_Page_047.QC.jpg
bf4ad0a52a0feb972a6b9a144619c6f1
1720719acda6dcb2f1dd303d84301ab4579b6f4f
F20101111_AAAPXI cho_c_Page_032.txt
f652b37fb33bd6eaeafca521362047a3
4b9d69c5c1757e64f352e663a38626e1e7621505
6767 F20101111_AAAQBZ cho_c_Page_056thm.jpg
668ba33ea2af12f75702ac64d93afcf2
107fdac54da9f346fcd24861073adb115f4e084d
2137 F20101111_AAAPWT cho_c_Page_014.txt
12ba5471964ea063a8e2729bce57b172
650ecc309381b5b163e2677dba10820621d3cf61
17133 F20101111_AAAQDC cho_c_Page_009.QC.jpg
5fb609c0c654ea832d9caddfc13779d3
fa0fbd5693208edb0ac72b81542a6b7b01cb0982
6521 F20101111_AAAQCO cho_c_Page_100thm.jpg
99eeec600cc4f26fa816aec8d918350a
b82a4572593d34b33bc5009ad02ab1ab5a2ab188
1922 F20101111_AAAPXJ cho_c_Page_033.txt
9bada3039c8b4256fab2861ae3d51222
95ab50b18b0f7005d86a820f3e99d7de2308bae4
1977 F20101111_AAAPWU cho_c_Page_016.txt
e02806c326cbc78cd8e09d611ada2a80
f6c69681c1d596447a77e0c17e5ad64f695b49c9
21311 F20101111_AAAQDD cho_c_Page_041.QC.jpg
b40aa2cdb03b3c5ebdb0b3bbdca65a5d
0af94668d149533cbd0c59965729f49661ebdfba
6529 F20101111_AAAQCP cho_c_Page_104thm.jpg
f5d70376fd8bbc9dae2b87c44b85b961
dfab0d63861688dd54d27469610d1810857c3dd4
1536 F20101111_AAAPXK cho_c_Page_034.txt
2b21c9bbfd9baa7c7fb9e7ee4045f88c
f8ae90b3943ceea5f3091268ecbf41ebf517c172
2149 F20101111_AAAPWV cho_c_Page_017.txt
540e52a7a4d68301306af969c7beb787
f1913b2d5bacf3ef2a9962e2bf50d9050ee2b927
25637 F20101111_AAAQDE cho_c_Page_110.QC.jpg
9e8b2dd5a3f895c37eecfa45232e6f43
ebdaf275825d63efbd3f3ff7c5b8ccd7d02a4ecf
7454 F20101111_AAAQCQ cho_c_Page_075thm.jpg
1b387240d01ac0f48c3ae46a3abe9595
181987b330961acac0a7f446ce574ba1abea6640
2109 F20101111_AAAPXL cho_c_Page_035.txt
8195d4c8465e25bd5d55b5b1b5f9fde3
a02713f2c028635511ef4db63b6c56badd0d95cd
2028 F20101111_AAAPWW cho_c_Page_018.txt
d52474e964339cd241c1876de8b9b6de
71730cb7ef42dae2abbe0967b943b0b793597d63
6879 F20101111_AAAQDF cho_c_Page_113thm.jpg
efb37452869dad0587e010f34fa3002d
03ff460b317cd3a0786a3130ebbb4d91c90098b7
25013 F20101111_AAAQCR cho_c_Page_035.QC.jpg
229fe35f891908c155d07e3a31111ea1
98e7f97ec2ee1461c93cdfc23bb528b7c58358b6
2202 F20101111_AAAPWX cho_c_Page_019.txt
ac47d742d51cb8f9bf22ab95165101f6
162174d8908f1dbdb0e65710291c424ec3c2b330
6309 F20101111_AAAQDG cho_c_Page_054thm.jpg
00301902f4edbc6d778346cbd8db18bf
885c9e6aa19a2889a3b5c2e7bf2c98bc1c7431ad
1277 F20101111_AAAPYA cho_c_Page_051.txt
136bb1e6ee327f241b0d5101f55e2788
535ab0e5cfa02d6fee5477f6cf49d2bcba43b0af
26211 F20101111_AAAQCS cho_c_Page_055.QC.jpg
4b2be044a57a21f44a849a56048e95ab
4325f94877e2d60c6d8edb50c3beb2c7289272f4
814 F20101111_AAAPXM cho_c_Page_036.txt
44525987109b579ea988016be5c3c515
451898260cea2930228bdde9d5ce23e6927778df
2800 F20101111_AAAPWY cho_c_Page_021.txt
cabdd06db7f0d8eb48231e1bd4fc566d
ba52c52df4ef55947cb4d194d36854b0c01fd115
23797 F20101111_AAAQDH cho_c_Page_056.QC.jpg
15622ebe8e8c89fe5469f9ae42ce0039
3eb45abf3b63de9c8fa22c63168da16f720145ee
2078 F20101111_AAAPYB cho_c_Page_052.txt
620c188eb11f91ecede72eaf252a2684
c1d826a25e519a724ef4423943d6eb82d5bdbcea
27581 F20101111_AAAQCT cho_c_Page_081.QC.jpg
896046472d2f01b8dc754ba8be1463cf
fa062285c451500d707203f7c2548500204d7ed4
2181 F20101111_AAAPXN cho_c_Page_037.txt
b50c78661a3819608848c28029baf068
f0e4fda66c667452281a8b97895b20d87c5e1503
2433 F20101111_AAAPWZ cho_c_Page_022.txt
500139a94dec96a8b9f8ced532c01447
6ee3f914f60ca1ab57efbf92b666747db18a7251
22648 F20101111_AAAQDI cho_c_Page_104.QC.jpg
7ed1cfa693534e31d6dbc4bb0876c0ea
84c010f1697ac8359d308b88f302864a49f4489f
1948 F20101111_AAAPYC cho_c_Page_053.txt
78528d7e4209dcbab4460d86bef5fa9f
d37215e2b720d2b98df14152218205ea83b5bd81
3298 F20101111_AAAQCU cho_c_Page_002.QC.jpg
ac6000da80665b7156945d997f9f73c4
f4eb1d6c8a828a54f19aa870cacbd1c9ad9a38e7
F20101111_AAAPXO cho_c_Page_038.txt
dd96dd4b97dc77eb7f94ce79c0a4d3be
134dfd7bf35b7a29b4c29afcc3e7f1ca3d00d9dc
20325 F20101111_AAAQDJ cho_c_Page_051.QC.jpg
62cd72427e26bd807561410bdec58738
2c1be0b700baf060a2c30082304e04cdcf1347ab
1916 F20101111_AAAPYD cho_c_Page_054.txt
9aef5f2beacb00f819be0cff3b7249ca
2191b9abe88e1f328052b78addf941b92172129b
16065 F20101111_AAAQCV cho_c_Page_036.QC.jpg
8fb8c205ca7c8625c6e7169fc70acd01
cfd56c755a9f50524d73f0d1f913c294f03cd2b4
2241 F20101111_AAAPXP cho_c_Page_039.txt
4afe18af1aa97ccd9cc98f9e9b3ecfe4
233f0baa74a5c29eb9a437da94b3e49a60f03c05
F20101111_AAAQDK cho_c_Page_060thm.jpg
46ebf71485b181d7f131dda39c1dc20f
6fb6618a3c1c5059494455065f2023d0b93ccdd6
2077 F20101111_AAAPYE cho_c_Page_055.txt
aded948e7f9e3bf96ceec0894572fa90
79aefb04b25455b3a7b2e729071963894137e269
1996 F20101111_AAAPXQ cho_c_Page_040.txt
92f287c21ac3f8ce375ddcb0de227798
91999e47c2e77a3987589892ad5eac80590e9968
28354 F20101111_AAAQDL cho_c_Page_076.QC.jpg
0057bd02192dc06681aab7968212c963
574b9414ed474543af94a0d4111a8f49967b7442
F20101111_AAAPYF cho_c_Page_056.txt
e4ae89458616b13fd91549f864721e77
906153f49d487b7c3b70e29d590df518ec09618a
22198 F20101111_AAAQCW cho_c_Page_099.QC.jpg
711cf10cf8d6dba4e2ed418390a92b4c
66c537cb0f1eddba7f371c870db40e41652f4eb7
1799 F20101111_AAAPXR cho_c_Page_042.txt
c7791d8fe64cd952dcc0dde87d75242b
7e205b6f6ce44e2406d62ce74748f0f52bdbb900
4856 F20101111_AAAQEA cho_c_Page_015.QC.jpg
d0365fd768e193720d02a484b628c52f
8f96f34972116062bcfcdb80fb73e518ad13a2fe
2711 F20101111_AAAQDM cho_c_Page_119thm.jpg
f9a8e865364f4b69b1267dae38aa92f8
269047c3d1dd3bdf9ac3e7ea4dd712b93c1f3d8d
2182 F20101111_AAAPYG cho_c_Page_057.txt
08728d0ec0c651fa62b9199f02a75e4f
82a6b72e970bcb6c315a784fdacdb376bd942a8d
5555 F20101111_AAAQCX cho_c_Page_004thm.jpg
8f00e3a5a13cdc52b1b6f3db7a0bc782
f3f607672a53221d7803de073dd7c42da66a4f27
2164 F20101111_AAAPXS cho_c_Page_043.txt
4da5d9689006372f42734d3323adf56c
95ca4c52d24714fb0e276dcedcfa9a8de9bd6144
28193 F20101111_AAAQEB cho_c_Page_096.QC.jpg
917f8a4beae1e762045434b81a0d35fe
f94b08fdc36e2482c86a5d173153894511fa0610
7470 F20101111_AAAQDN cho_c_Page_029thm.jpg
963249445f40b14a55f82b641b26cb84
d06851c3aba41d311deb3bdcc534976e9dce5c9b
1789 F20101111_AAAPYH cho_c_Page_058.txt
5fdf65297b05efd9fefeece7cdaadf15
b4b9115fcbfc910d11a66e6e71af3358674ee59f
6784 F20101111_AAAQCY cho_c_Page_068thm.jpg
5714e1b42639aee60d8294bdf72fb237
d9a26106c4961c9d666a099148146961cdfbea8f
2529 F20101111_AAAPXT cho_c_Page_044.txt
ba7643330a691cb91cdd6b67ee50363d
345ab635d09aa9327497c2bca42b7b18a85a57f7
2953 F20101111_AAAQEC cho_c_Page_005thm.jpg
23311a6bb64ef4727362e4d19f226c8f
896dc4f6b996d025896a9b5dbf2dd390b77586f5
26651 F20101111_AAAQDO cho_c_Page_023.QC.jpg
a71074cf7e5838096278810f9475504d
33e9e50ae9dce3e3a7d57c891b7fdfef41a514b6
F20101111_AAAPYI cho_c_Page_060.txt
410e914d14e0af88f70f78fb8fdf0b6d
cda00402b4946f4f53fa38b72c286d0ccdc8c66d
F20101111_AAAQCZ cho_c_Page_046.QC.jpg
b69fab596936b7cb5612ffd9510ae1db
f911df4ed65d069b5298ed20920ed16a8bed17e7
10242 F20101111_AAAQED cho_c_Page_073.QC.jpg
79ac93abdf77acc26b26a109a95fb528
dd115b15fc2eb2242558ca33d0fbb6fb716b578f
26965 F20101111_AAAQDP cho_c_Page_012.QC.jpg
18baa65d10d8deb618e66e85fbd57ece
98d9d3fb4f503be1e142a75261f83ea1e94c071d
1779 F20101111_AAAPYJ cho_c_Page_061.txt
e30874ae8021b28a2b7bebe184738754
a38f52f0d1a6d96d710d962cab6d0a2c509757e5
1710 F20101111_AAAPXU cho_c_Page_045.txt
c10f80a38f50fafb7483840a09c29b2b
8848aadd1c146d926c374ce7fdc7015e9e5f4869
6950 F20101111_AAAQEE cho_c_Page_010thm.jpg
b997dea209a87d0c40dd78a0e24e70ce
a3261c521c1ff8bb29340bec1f4433fe3e51eca1
1755 F20101111_AAAQDQ cho_c_Page_015thm.jpg
f160e05105814e2b5ab6012e92296f47
61b4202248cc8084344d2638f2276dc1f2df2075
1226 F20101111_AAAPYK cho_c_Page_062.txt
8a1891401e089c7f52a9f1e8af941d05
190b9e8e14625a10bb665ab914f7b972503129a7
2673 F20101111_AAAPXV cho_c_Page_046.txt
65eda77caf37573db92000f3e9537fc7
2fc097fea6b406f45888e0f090b97ddce4ef91d5
18057 F20101111_AAAQEF cho_c_Page_112.QC.jpg
8053991201a0ddef83e0afad190ae368
6f7f68ddb82144c392b54cb4bda497bf5692df31
28017 F20101111_AAAQDR cho_c_Page_074.QC.jpg
a0d09b0c947046e6e20781a65c9ab46f
28150191c3bd374dbd6a58e235a3145a2ef73e1d
2150 F20101111_AAAPYL cho_c_Page_063.txt
e2edf0dc79dbd195cd6a05a96d8d83b8
b83be433789478c47e83432ed838cf15183d2ef5
3972 F20101111_AAAPXW cho_c_Page_047.txt
290d7dae3524ef250885c469e34d1c51
33a2722782fa8d82b92a645454ac070bbdbe3a56
25422 F20101111_AAAQEG cho_c_Page_060.QC.jpg
864cdcded02b3dfa385126fa186ddd97
afc5b52e9ad2c1f79800f04b79d4cd1951480ffe
2110 F20101111_AAAPZA cho_c_Page_078.txt
d9413d2306bf516ac8e55118db79ac11
49b02f82704a7c0511bcd88fbd18d1f46641994f
5775 F20101111_AAAQDS cho_c_Page_079thm.jpg
eea4dfe134201ae5f7d8543015f11a74
20fc28b8aa3f6f76e6324edcaf0481dfbee50fd9
2116 F20101111_AAAPYM cho_c_Page_064.txt
8bab50bd0b304a19aaaa25681500eae8
a4126a30c2f798849bc12a4e086cddc7e0660f63
2192 F20101111_AAAPXX cho_c_Page_048.txt
8e49277c7822138eaf684bbf3518257c
e351f46b15fbc6d43137b218a28bf16832620e38
6946 F20101111_AAAQEH cho_c_Page_072thm.jpg
21abf7f99c8291ccd67319a974be4baa
ef085ec5a4703d12f3137b1759bfc3f1dbce7e12
1602 F20101111_AAAPZB cho_c_Page_079.txt
b6d36f8baa50b524545bdd0b7de458c0
216e633b93a3d0043583894df0fc42fd5b0e7fda
27825 F20101111_AAAQDT cho_c_Page_066.QC.jpg
c5c35888c4d5499609d54f27fb2f63c2
621243dcf8bc4685386c5131de4332c5fe23ccc8
2018 F20101111_AAAPXY cho_c_Page_049.txt
5605c91dae01a9ef4aa2e82ef7c83ae7
fbdfef5b693099bde77aef89b4ab510c944a7c13
22536 F20101111_AAAQEI cho_c_Page_070.QC.jpg
b45cc5171bf816479c64a10bddc2a760
209fbaf645ad524c5b2a9aa97ec7393250dd107c
1992 F20101111_AAAPZC cho_c_Page_080.txt
725b024c554f1035ed46727f2eab4fe6
f7dc691d0c6e86a1fe5303a2f7b93a127ac139d4
21904 F20101111_AAAQDU cho_c_Page_043.QC.jpg
f664a8f9f5e7afa0044fbc81b6fe752f
0c112213132b5b1b50f90f74e18d5b2059c21bb0
1796 F20101111_AAAPYN cho_c_Page_065.txt
98bff39fa83822405cda294a18ee7007
cc4db21651ca8f981dfc9dd8881ce290f7702ffd
2185 F20101111_AAAPXZ cho_c_Page_050.txt
40f7a0dfac07195860c3589c147b4236
e713a5c18bc2b50700368e8ae3c7219a0cf60a08
27191 F20101111_AAAQEJ cho_c_Page_037.QC.jpg
5ed43f4a415259b7c7c952608213833f
d4bc714210bf77bc635f54fd13f0123ff780462f
2159 F20101111_AAAPZD cho_c_Page_081.txt
1bec9dad57c8f045a070015a4f608f67
0e7e05e3b2222bdfbd7c8a3a548efc818dbb3d67
22225 F20101111_AAAQDV cho_c_Page_008.QC.jpg
6a4e11ea8ab997d8b79d30739d91a358
b21dfcd67320465df8f0dbb507f08922b4ce8f99
F20101111_AAAPYO cho_c_Page_066.txt
7795dbdf7a1e127744cdfef4311d4469
48487bbc6302e36acad3c91f77937645cde0ee0f
7301 F20101111_AAAQEK cho_c_Page_023thm.jpg
9880f59125d20c9c9ed2b3ca0af572a1
b48bd28ac654acc36bf1ad7fb90297ebb1c2b19d
1828 F20101111_AAAPZE cho_c_Page_082.txt
456738b1afafa5cf5469ab1db6eced9e
44dc80b31b4541415381c6479a6707abaeb6aa15
7163 F20101111_AAAQDW cho_c_Page_109thm.jpg
ff36f5eda2f66c9a87745406c1b33083
f4385d1390ea3316deb2ae8d948267b09f95d2e7
2029 F20101111_AAAPYP cho_c_Page_067.txt
a9d6e5e3eacd9b59f057323378af3d2a
9412cd5b04277a8ed64193ea62dd0710d2889304