UFDC Home  myUFDC Home  Help 



Full Text  
ACCURATE, SCALABLE, AND INFORMATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGESCALE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 2008 Chang Burm Cho ACKNOWLEDGMENTS There are many people who are responsible for my Ph.D research. Most of all I would like to express my gratitude to my supervisor, Dr. Tao Li, for his patient guidance and invaluable advice, for numerous discussions and encouragement throughout the course of the research. I would also like to thank all the members of my advisory committee, Dr. Renato Figueiredo, Dr. Rizwan Bashirullah, and Dr. Prabhat Mishra, for their valuable time and interest in serving on my supervisory committee. And I am indebted to all the members of IDEAL(Intelligent Design of Efficient Architectures Laboratory), Clay Hughes, James Michael Poe II, Xin Fu and Wangyuan Zhang, for their companionship and support throughout the time spent working on my research. Finally, I would also like to express my greatest gratitude to my family especially my wife, EunHee Choi, for her relentless support and love. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ............................................ ............ ................................................ 3 LIST OF TABLES ......... ........... ..................................................6 L IST O F F IG U R E S ........ ............................................................... ................ ........... 7 ABSTRACT .......................... .............................. 10 CHAPTER 1 INTRODUCTION.......................... .............. 12 2 W A VELET TRAN SFORM .................................................... .................................. 16 Discrete W avelet Transform(DW T) .................................................. 16 Apply DWT to Capture Workload Execution Behavior.......................................... 18 2D Wavelet Transform ............................................... .. 22 3 COMPLEXITYBASED PROGRAM PHASE ANALYSIS AND CLASSIFICATION .... 25 Characterizing and classifying the program dynamic behavior......................................... 25 Profiling Program Dynamics and Complexity........................ ................. 28 Classifying Program Phases based on their Dynamics Behavior ............................. 31 Experimental Results .... ............... .... .............. .......... .. 34 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRAM PHASE ANALYSIS ....................................... .................... 37 W orkloadstaticsbased phase analysis....................................................... ... ................. 38 Exploring W avelet Domain Phase Analysis............................ ................................. 40 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION ................... 52 N eu ral N etw o rk .................................. ........... ...................................... ................. 5 4 Combing Wavelet and Neural Network for Workload Dynamics Prediction .................. 56 Experim mental M methodology .................................................... .................................. 58 Evaluation and Results .............................................................. 62 Workload Dynamics Driven Architecture Design Space Exploration ................................ 68 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTICORE ARCHITECTURES ........................... ......... .... .............. 74 Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics P reduction ......... ................ ............................................ 76 Experim mental M methodology .................................................... .................................. 78 4 Evaluation and R results .... ................... ...... ... ......... ....................... 82 Leveraging 2D Geometric Characteristics to Explore Cooperative Multicore Oriented A architecture D esign and O ptim ization ........................................... .......................... 88 7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STACKED MULTICORE PROCESSORS USING GEOSPATIALBASED PREDICTIVE MODELS..................... 94 Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction... 96 Experim mental M methodology .................................................... .................................. 98 Experim ental R results ......... ........................ .. .. ......... ..... .... .......... 103 8 CONCLUSIONS ................................ ........... ........ ................ 109 L IST O F R E F E R E N C E S ............................ ..................................................... ..................... 113 BIOGRAPHICAL SKETCH .................................................................................. 119 LIST OF TABLES Table pae 31 B baseline m machine configuration ..................................... .................................................. 26 32 A classification of benchmarks based on their complexity ............................................... 30 41 B baseline m machine configuration ..................................... .................................................. 39 42 Efficiency of different hybrid wavelet signatures in phase classification............. .............. 44 51 Sim ulated m machine configuration......... .................. ...................................... .............. 59 52 Microarchitectural parameter ranges used for generating train/test data ............................ 60 61 Sim ulated m machine configuration (baseline) .................................. ..................................... 78 62 The considered architecture design parameters and their ranges ....................................... 79 63 M ultiprogram m ed w orkloads......... ......... ......... .......... ......................... .............. 80 64 Error comparison of predicting raw vs. 2D DWT cache banks............................................ 85 65 Design space evaluation speedup (simulation vs. prediction).......................................... 86 71 Architecture configuration for different issue width ........................................ .............. 100 72 Sim ulation configurations.................................................................. ....................... 10 1 73 D design sp ace p aram eters ...................................................................................................... 102 74 Sim ulation tim e vs. prediction tim e..................... ................................. .......................... 104 LIST OF FIGURES Figure pae 21 Example of Haar wavelet transform ...... .................................................. ......... ........ .... 18 22 Comparison execution characteristics of time and wavelet domain................................... 19 23 Sampled time domain program behavior..................................................... ............... 20 24 Reconstructing the w workload dynam ic behaviors.............................................. ... ................. 20 25 V ariation of w avelet coefficients....................................................... ......................... 21 26 2D w avelet transform s on 4 data points ........................................... .................... ...... 22 27 2D wavelet transforms on 16 cores/hardware components........................ ................ 23 28 Example of applying 2D DWT on a nonuniformly accessed cache ................................... 24 31 XCOR vectors for each program execution interval .............. ............................. ....... ....... 28 32 Dynamic complexity profile of benchmark gcc ....................................................... 28 33 X C O R value distributions ........................................................ ............................ ...... 30 34 X COR s in the sam e phase by the Sim point.................................................. ... ................. 31 35 B B V s w ith different resolutions ............................................................................. .. .... ...... 32 36 M ultiresolution analysis of the projected BBVs........... ................................. .............. 33 37 W weighted C O V calculation ............................................................... .......................... 34 38 Comparison of BBV and MRABBV in classifying phase dynamics................................ 35 39 Comparison of IPC and MRAIPC in classifying phase dynamics ..................................... 36 41 Phase analysis methods time domain vs. wavelet domain .............................................. 41 42 Phase classification accuracy: time domain vs. wavelet domain ....................................... 42 43 Phase classification using hybrid wavelet coefficients................................. .................... 43 44 Phase classification accuracy of using 16 x 1 hybrid scheme .............................................. 45 45 Different methods to handle counter overflows ....................................................... 46 46 Impact of counter overflows on phase analysis accuracy.......................................... 47 47 M ethod for m odeling w orkload variability ........................................ ....................... 50 48 Effect of using wavelet denoising to handle workload variability ...................................... 50 49 Efficiency of different denoising schem es ........................................ .................. ...... 51 51 Variation of workload performance, power and reliability dynamics............................... 52 52 Basic architecture of a neural network ....................................................... ....... .... 54 53 Using wavelet neural network for workload dynamics prediction.................. ........... 58 54 Magnitudebased ranking of 128 wavelet coefficients....................................................... 61 55 MSE boxplots of workload dynamics prediction ............ ............................... ........... 62 56 MSE trends with increased number of wavelet coefficients ............................................... 64 57 M SE trends with increased sampling frequency .......................................... ..... ......... 64 58 Roles of microarchitecture design parameters.................... ........ .. ........................... 65 59 Thresholdbased workload execution scenarios........................ .................. 67 510 Thresholdbased w workload execution...................... .... ............................... .............. 68 511 Thresholdbased workload scenario prediction........ ......................... .............. 68 512 Dynam ic Vulnerability M anagem ent .............. .......................................................... 69 513 IQ D V M Pseudo C ode..................... ..................................................... ..................... ..... 70 514 Workload dynamic prediction with scenariobased architecture optimization ................ 71 515 Heat plot that shows the MSE of IQ AVF and processor power...................... .............. 72 516 IQ AVF dynamics prediction accuracy across different DVM thresholds..................... 73 61 Variation of cache hits across a 256bank nonuniform access cache on 8core ................. 74 62 Using wavelet neural networks for forecasting architecture 2D characteristics .................... 77 63 Baseline CMP with 8 cores that share a NUCA L2 cache .............................................. 79 64 ME boxplots of prediction accuracies with different number of wavelet coefficients .......... 83 65 Predicted 2D NUCA behavior using different number of wavelet coefficients................... 84 66 Roles of design parameters in predicting 2D NUCA ......................................................... 87 67 2D NUCA footprint (geometric shape) of mesa ............................................. .............. 88 68. 2D cache interference in NUCA ........................................................................ 89 69 Pearson correlation coefficient (all 50 test cases are shown) .............................................. 90 610 2D NUCA thermal profile (simulation vs. prediction) ................................................... 91 611 NUCA 2D thermal prediction error..................... ................................ ........................... 92 612 Temperature profile before and after a DTM policy ............................. 93 71 2D withindie and crossdies thermal variation in 3D die stacked multicore processors ..... 94 72 2D thermal variation on die 4 under different microarchitecture and floorplan configurations ........ ..................................... ............... 95 73 Example of using 2D DWT to capture thermal spatial characteristics ............................. 95 74 Hybrid neurowavelet thermal prediction framework ..................................................... 97 75 Selected floorplans ................................. .................. .... ...... .............. .. 98 76 P rocessor core floorplan ................................................................. ............... ...... ..... 99 77 Cross section view of the simulated 3D quadcore chip ............................................... 100 78 ME boxplots of prediction accuracies (number of wavelet coefficients = 16).................. 105 79 Simulated and predicted thermal behavior ................................ ............ .............. 106 710 ME boxplots of prediction accuracies with different number of wavelet coefficients....... 106 711 Benefit of predicting wavelet coefficients .... ....................... .............. 107 712 R oles of input param eters ..................................................... ........................................ 108 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy ACCURATE, SCALABLE, AND INFORMATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGESCALE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO December 2008 Chair: Tao Li Major: Electrical and Computer Engineering Modeling and analyzing how workload and architecture interact are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challenges related to the design, evaluation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the stateoftheart tools can detect program timevarying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a necessary first step to design costeffective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and hardware design complexity and integration scale. For instance, while large realworld applications manifest drastically different behavior across a wide spectrum of their runtime, existing methods only focus on analyzing workload characteristics using a single time scale. Conventional architecture modeling techniques assume a centralized and monolithic hardware substrate. This assumption, however, will not hold valid since the design trends of multi/manycore processors will result in largescale and distributed microarchitecture specific processor core, global and cooperative resource management for largescale manycore processor requires obtaining workload characteristics across a large number of distributed hardware components (cores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, there is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidly increasing complexity and integration scale. We aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large performance, power, reliability and thermal design space of uni/multicore architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration. CHAPTER 1 INTRODUCTION Modeling and analyzing how workloads behave on the underlying hardware have been essential ingredients of computer architecture research. By knowing program behavior, both hardware and software can be tuned to better suit the needs of applications. As computer systems become more adaptive, their efficiency increasingly depends on the dynamic behavior that programs exhibit at runtime. Previous studies [15] have shown that program runtime characteristics exhibit time varying phase behavior: workload execution manifests similar behavior within each phase while showing distinct characteristics between different phases. Many challenges related to the design, analysis and optimization of complex computer systems can be efficiently solved by exploiting program phases [1, 69]. For this reason, there is a growing interest in studying program phase behavior. Recently, several phase analysis techniques have been proposed [4, 7, 1019]. Very few of these studies, however, focus on understanding and characterizing program phases from their dynamics and complexity perspectives. Consequently, these techniques generally lack the capability of informing phase dynamic behavior. To complement current phase analysis techniques which pay little or no attention to phase dynamics, we develop new methods, metrics and frameworks that have the capability to analyze, quantify, and classify program phases based on their dynamics and complexity characteristics. Our techniques are built on waveletbased multiresolution analysis, which provides a clear and orthogonal view of phase dynamics by presenting complex dynamic structures of program phases with respect to both time and frequency domains. Consequently, key tendencies can be efficiently identified. As microprocessor architectures become more complex, architects increasingly rely on exploiting workload dynamics to achieve cost and complexity effective design. Therefore, there is a growing need for methods that can quickly and accurately explore workload dynamic behavior at early microarchitecture design stage. Such techniques can quickly bring architects with insights on application execution scenarios across large design space without resorting to the detailed, case by case simulations. Researchers have been proposed several predictive models [2025] to reason about workload aggregated behavior at architecture design stage. However, they have been focused on predicting the aggregated program statistics (e.g. CPI of the entire workload execution). These monolithic global models are incapable of capturing and revealing program dynamics which contain interesting finegrain behavior. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates wavelet based multiresolution decomposition techniques and neural network prediction. As the number of cores on a processor increases, these large and sophisticated multicore oriented architectures exhibit increasingly complex and heterogeneous characteristics. Processors with two, four and eight cores have already entered the market. Processors with tens or possibly hundreds of cores may be a reality within the next few years. In the upcoming multi/many core era, the design, evaluation and optimization of architectures will demand analysis methods that are very different from those targeting traditional, centralized and monolithic hardware structures. To enable global and cooperative management of hardware resources and efficiency at large scales, it is imperative to analyze and exploit architecture characteristics beyond the scope of individual cores and hardware components (e.g. single cache bank and single interconnect link). To addresses this important and urgent research task, we developed the novel, 2D multiscale predictive models which can efficiently reason the characteristics of large and sophisticated multicore oriented architectures during the design space exploration stage without using detailed cyclelevel simulations. Threedimensional (3D) integrated circuit design [55] is an emerging technology that greatly improves transistor integration density and reduces onchip wire communication latency. It places planar circuit layers in the vertical dimension and connects these layers with a high density and lowlatency interface. In addition, 3D offers the opportunity of binding dies, which are implemented with different techniques to enable integrating heterogeneous active layers for new system architectures. Leveraging 3D die stacking technologies to build uni/multicore processors has drawn an increased attention to both chip design industry and research community [56 62]. The realization of 3D chips faces many challenges. One of the most daunting of these challenges is the problem of inefficient heat dissipation. In conventional 2D chips, the generated heat is dissipated through an external heat sink. In 3D chips, all of the layers contribute to the generation of heat. Stacking multiple dies vertically increases power density and dissipating heat from the layers far away from the heat sink is more challenging due to the distance of heat source to external heat sink. Therefore, 3D technologies not only exacerbate existing onchip hotspots but also create new thermal hotspots. High die temperature leads to thermalinduced performance degradation and reduced chip lifetime, which threats the reliability of the whole system, making modeling and analyzing thermal characteristics crucial in effective 3D microprocessor design. Previous studies [59, 60] show that 3D chip temperature is affected by factors such as configuration and floorplan of microarchitectural components. For example, instead of putting hot components together, thermalaware floorplanning places the hot components by cooler components, reducing the global temperature. Thermalaware floor planning [59] uses intensive and iterative simulations to estimate the thermal effect of microarchitecture components at early architectural design stage. However, using detailed yet slow cyclelevel simulations to explore thermal effects across large design space of 3D multi core processors is very expensive in terms of time and cost. CHAPTER 2 WAVELET TRANSFORM We use wavelets as an efficient tool for capturing workload behavior. To familiarize the reader with general methods used in this research, we provide a brief overview on wavelet analysis and show how program execution characteristics can be represented using wavelet analysis. Discrete Wavelet Transform(DWT) Wavelets are mathematical tools that use a prototype function (called the analyzing or mother wavelet) to transform data of interest into different frequency components, and then analyze each component with a resolution matched to its scale. Therefore, the wavelet transform is capable of providing a compact and effective mathematical representation of data. In contrast to Fourier transforms which only offer frequency representations, wavelets transforms provide time and frequency localizations simultaneously. Wavelet analysis allows one to choose wavelet functions from numerous functions[26, 27]. In this section, we provide a quick primer on wavelet analysis using the Haar wavelet, which is the simplest form of wavelets. Consider a data series Xn,k,k = 0,1,2,..., at the finest time scale resolution level 2". This time series might represent a specific program characteristic (e.g., number of executed instructions, branch mispredictions and cache misses) measured at a given time scale. We can coarsen this event series by averaging (with a slightly different normalization factor) over non overlapping blocks of size two 1 X1,k = (Xn,2k + Xn,2k+1) (21) and generate a new time series x,,, which is a coarser granularity representation of the original series x,. The difference between the two representations, known as details, is 1 Dnl,k = (Xn,2k Xn,2k+l) (22) Note that the original time series X, can be reconstructed from its coarser representation X,, by simply adding in the details D,,; i.e., X,, = 2 1/2(X +D,,_) We can repeat this process (i.e., write X,, as the sum of yet another coarser version X,, of X, and the details D,2, and iterate) for as many scale as are present in the original time series, i.e., X, = 2n/2X0 +2n/2Do +...+21/2Dn_1 (23) We refer to the collection of 0, and Dj as the discrete Haar wavelet coefficients. The calculations of all Dj,k, which can be done iteratively using the equations (21) and (22), make up the so called discrete wavelet transform (DWT). As can be seen, the DWT offers a natural hierarchy structure to represent data behavior at multiresolution levels: the first few wavelet coefficients contain an overall, coarser approximation of the data; additional coefficients illustrate high detail. This property can be used to capture workload execution behavior. Figure 21 illustrates the procedure of using Haarbase DWT to transform a series of data {3, 4, 20, 25, 15, 5, 20, 3}. As can be seen, scale 1 is the finest representation of the data. At scale 2, the approximations {3.5, 22.5, 10, 11.5} are obtained by taking the average of {3, 4}, {20, 25}, {15, 5} and {20, 3} at scale 1 respectively. The details {0.5, 2.5, 5, 8.5} are the differences of {3, 4}, {20, 25}, {15, 5} and {20, 3} divided by 2 respectively. The process continues by decomposing the scaling coefficient (approximation) vector using the same steps, and completes when only one coefficient remains. As a result, wavelet decomposition is the collection of average and details coefficients at all scales. In other words, the wavelet transform of the original data is the single coefficient representing the overall average of the original data, followed by the detail coefficients in order of increasing resolutions. Different resolutions can be obtained by adding difference values back or subtracting differences from the averages. Original Data Scaling Filter (Go) Wavelet Filter (Ho) 3.5, 22.5, 10, 11.5 0.5, 2.5, 5, 8.5 Scaling Filter (G,) W avelet Filter (H,) 13, 10.75 9.5,0.75 Scaling Filter (G2) Wavelet Filter (H2) 11.875 1.125 11.875 1.125 9.5, 0.75 0.5, 2.5, 5, 8.5 Approximation (Lev 0) Detail (Lev 1) Detail Coefficients (Level 2) Detail Coefficients (Level 3) Figure 21 Example of Haar wavelet transform. For instance, {13, 10.75} = {11.875+1.125, 11.8751.125} where 11.875 and 1.125 are the first and the second coefficient respectively. This process can be performed recursively until the finest scale is reached. Therefore, through an inverse transform, the original data can be recovered from wavelet coefficients. The original data can be perfectly recovered if all wavelet coefficients are involved. Alternatively, an approximation of the time series can be reconstructed using a subset of wavelet coefficients. Using a wavelet transform gives timefrequency localization of the original data. As a result, the time domain signal can be accurately approximated using only a few wavelet coefficients since they capture most of the energy of the input data. Apply DWT to Capture Workload Execution Behavior Since variation of program characteristics over time can be viewed as signals, we apply discrete wavelet analysis to capture program execution behavior. To obtain time domain workload execution characteristics, we break down entire program execution into intervals and then sample multiple data points within each interval. Therefore, at the finest resolution level, program time domain behavior is represented by a data series within each interval. Note that the sampled data can be any runtime program characteristics of interest. We then apply discrete wavelet transform (DWT) to each interval. As described in previous section, the result of DWT is a set of wavelet coefficients which represent the behavior of the sampled time series in the wavelet domain. R 2.5E+05 2 2.5E+06 S2.0E+05   2.0E+06 E 5 .1.5E+06 r 1.OE+05 5.0E+05 SU5. 0 5.E+E05 L "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E 0.OE+00 200 400 600 800 1000 Wavelet Coefficients (a) Time domain representation (b) Wavelet domain representation Figure 22 Comparison execution characteristics of time and wavelet domain Figure 22 (a) shows the sampled time domain workload execution statistics (The yaxis represents the number of cycles a processor spends on executing a fixed amount of instructions) on benchmark gcc within one execution interval. In this example, the program execution interval is represented by 1024 sampled data points. Figure 22 (b) illustrates the wavelet domain representation of the original time series after a discrete wavelet transform is applied. Although the DWT operations can produce as many wavelet coefficients as the original input data, the first few wavelet coefficients usually contain the important trend. In Figure 22 (b), we show the values of the first 16 wavelet coefficients. As can be seen, the discrete wavelet transform provides a compact representation of the original large volume of data. This feature can be exploited to create concise yet informative fingerprints to capture program execution behavior. One advantage of using wavelet coefficients to fingerprint program execution is that program time domain behavior can be reconstructed from these wavelet coefficients. Figure 23 and 24 show that the time domain workload characteristics can be recovered using the inverse discrete wavelet transforms. 4 O 32 u2 Figure 23 Sampled time domain program behavior (a) 1 wavelet coefficient (b) 2 wavelet coefficients 2 2 (c) 4 wavelet coefficients (d) 8 wavelet coefficients 0 0 (e) 16 wavelet coefficients (f) 64 wavelet coefficients Figure 24 Reconstructing the workload dynamic behaviors In Figure 24 (a)(e), the first 1, 2, 4, 8, and 16 wavelet coefficients were used to restore program time domain behavior with increasing fidelity. As shown in Figure 24 (f), when all (e.g. 64) wavelet coefficients are used for recovery, the original signal can be completely restored. However, this could involve storing and processing a large number of wavelet coefficients. Using a wavelet transform gives timefrequency localization of the original data. As a result, most of the energy of the input data can be represented by only a few wavelet coefficients. As can be seen, using 16 wavelet coefficients can recover program time domain behavior with sufficient accuracy. To classify program execution into phases, it is essential that the generated wavelet coefficients across intervals preserve the dynamics that workloads exhibit within the time domain. Figure 25 shows the variation of the first 16 wavelet coefficients (coff coff 16) which represent the wavelet domain behavior of branch misprediction and L1 data cache hit on the benchmark gcc. The data are shown for the entire program execution which contains a total of 1024 intervals. 2.5E+04 coeff 1 1.2E+06 coeff 1 coeff 2 coeff 2 2.0E+04 coeff 3 1.0E+06 coeff 3 coeff 4 coeff 4 1.5E+04 coeff 5 8.OE+05 coeff5  coeff 6 6.0E+05 coeff 6 1.5E+04 coeff 7 coeff 57 coeff 8 4.0E+05 coeff 8 coeff 9 coeff 9 .0E+03 coeff 10 26.0E+05 coeff 10 Scoff 7 coeff 11 1.0E+00 coeff 12 E I coeff 12 Scoeff 13 coeff13 5.0E+03 coeff 14 2.0E+05  coeff 14 Scoff 15 coeff 15 1.0E+04 coeff 16 4.0E+05 coeff 16 (a) branch misprediction (b) L1 data cache hit Figure 25 Variation of wavelet coefficients Figure 25 shows that wavelet domain transforms largely preserve program dynamic behavior. Another interesting observation is that the first order wavelet coefficient exhibits much more significant variation than the high order wavelet coefficients. This suggests that wavelet domain workload dynamics can be effectively captured using a few, low order wavelet coefficients. 2D Wavelet Transform To effectively capture the twodimensional spatial characteristics across largescale multi core architecture substrates, we also use the 2D wavelet analysis. With 1D wavelet analysis that uses Haar wavelet filters, each adjacent pair of data in a discrete interval is replaced with its average and difference. Original Average Detailed Detailed Detailed (Dhorizontal) (Dvertical) (Ddiagonal) i_ t I E JI I I  r ** I I I I 1 1 S, Li i Figure 26 2D wavelet transforms on 4 data points A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete plane. As shown in Figure 26, each adjacent four points in a discrete 2D plane can be replaced by their averaged value and three detailed values. The detailed values (Dhorizontal, Dvertical, and Ddiagonal) correspond to the average of the difference of: 1) the summation of the rows, 2) the summation of the columns, and 3) the summation of the diagonals. To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data along the Xaxis first, resulting in lowpass and highpass signals (average and difference). Next, we apply ID wavelet transforms to both signals along the Yaxis generating one averaged and three detailed signals. Consequently, a 2D wavelet decomposition is obtained by recursively repeating this procedure on the averaged signal. Figure 27 (a) illustrates the procedure. As can be seen, the 2D wavelet decomposition can be represented by a treebased structure. The root node of the tree contains the original data (rowmajored) of the mesh of values (for example, performance or temperatures of the four adjacent cores, networkonchip links, cache banks etc.). First, we apply ID wavelet transforms along the Xaxis, i.e. for each two points along the Xaxis we compute the average and difference, so we obtain (3 5 7 1 9 1 5 9) and (1 1 1 1 5 1 1 1). Next, we apply ID wavelet transforms along the Yaxis; for each two points along the Yaxis we compute average and difference (at level 0 in the example shown in Figure 27.a). We perform this process recursively until the number of elements in the averaged signal becomes 1 (at level 1 in the example shown in Figure 27.a). Original Data 1D wavelet 1D wavelet along xaxis along yaxis LJ = 1 \ Average 4 14 2 0 (L=0) 244668204142046810 Details Original Data (rowmajored) 1D wavelet alongxaxis 35719159 11115111 lowpass signal highpasssignal 1D wavelet along yaxis average Horizontal Details Vertical Details Diagonal Details i 1D wavelet   tal along xaxis 4 6 1 1 L=O lowpass hihpa s  1D wavelet signal s signal along yaxisl :  Avg. Horiz. Vert. Diag. Det. Det. Det. L=1 (a) (b) Figure 27 2D wavelet transforms on 16 cores/hardware components Figure 27.b shows the wavelet domain multiresolution representation of the 2D spatial data. Figure 28 further demonstrates that the 2D architecture characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (L=0) or Average (L= 1)). Since a small set of wavelet coefficients provide concise yet insightful information on architecture 2D spatial characteristics, we use predictive models (i.e. neural networks) to relate them individually to various architecture design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize architecture 2D spatial characteristics across the design space. Compared with a simulationbased method, predicting a small set of wavelet coefficients using analytical models is computationally efficient and is scalable to large scale architecture design. (a) NUCA hit numbers (b) 2D DWT (L=0) (c) 2D DWT (L=1) Figure 28 Example of applying 2D DWT on a nonuniformly accessed cache CHAPTER 3 COMPLEXITYBASED PROGRAM PHASE ANALYSIS AND CLASSIFICATION Obtaining phase dynamics, in many cases, is of great interest to accurately capture program behavior and to precisely apply runtime application oriented optimizations. For example, complex, realworld workloads may run for hours, days or even months before completion. Their long execution time implies that program time varying behavior can manifest across a wide range of scales, making modeling phase behavior using a single time scale less informative. To overcome conventional phase analysis technique, we proposed using wavelet based multiresolution analysis to characterize phase dynamic behavior and developed metrics to quantitatively evaluate the complexity of phase structures. And also, we proposed methodologies to classify program phases from their dynamics and complexity perspectives. Specifically, the goal of this chapter is to answer the following questions: How to define the complexity of program dynamics? How do program dynamics change over time? If classified using existing methods, how similar are the program dynamics in each phase? How to better identify phases with homogeneous dynamic behavior? In this chapter, we implemented our complexitybased phase analysis technique and evaluate its effectiveness over existing phase analysis methods based on program control flow and runtime information. And we showed that in both cases the proposed technique produces phases that exhibit more homogeneous dynamic behavior than existing methods do. Characterizing and classifying the program dynamic behavior Using the waveletbased multiresolution analysis which is described in chapter 2, we characterize, quantify and classify program dynamic behavior on a highperformance, outof order execution superscalar processor coupled with a multilevel memory hierarchy. Experimental setup We performed our analysis using ten SPEC CPU 2000 benchmarks crafty, gap, gcc, gzip, mcf parser, perlbmk, swim, twolfand vortex. All programs were run with reference input to completion. We chose to focus on only 10 programs because of the lengthy simulation time incurred by executing all of the programs to completion. The statistics of workload dynamics were measured on the SimpleScalar 3.0[28] simoutorder simulator for the Alpha ISA. The baseline microarchitecture model is detailed in Table 31. Table 31 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 predictions/cycle BTB 2K entries, 4way Return Address Stack 32 entries L1 Instruction Cache 32K, 2way, 32 Byte/line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 IALU, 2 IMUL/DIV FP ALU 2 FPALU, 1FPMUL/DIV DTLB 256 entries, 4way, 200 cycle miss L1 Data Cache 64KB, 4way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4way, 128 Byte/line, 12 cycle access Memory Access 100 cycles Metrics to Quantify Phase Complexity To quantify phase complexity, we measure the similarity between phase dynamics observed at different time scales. To be more specific, we use crosscorrelation coefficients to measure the similarity between the original data sampled at the finest granularity and the approximated version reconstructed from wavelet scaling coefficients obtained at a coarser scale. The cross correlation coefficients (XCOR) of the two data series are defined as: n (X X)(Yi Y) XCOR(X, Y)= i=() ,n X)2yn 1y (31) Vi= X1 X2 i= y2 where X is the original data series and Y is the approximated data series. Note that XCOR =1 if program dynamics observed at the finest scale and its approximation at coarser granularity exhibit perfect correlation, and XCOR =0 if the program dynamics and its approximation varies independently across time scales. X and Y can be any runtime program characteristics of interest. In this chapter, we use instruction per cycle (IPC) as a metric due to its wide usage in computer architecture design and performance evaluation. To sample IPC dynamics, we break down the entire program execution into 1024 intervals and then sample 1024 IPC data within each interval. Therefore, at the finest resolution level, the program dynamics of each execution interval are represented by an IPC data series with a length of 1024. We then apply wavelet multiresolution analysis to each interval. In a wavelet transform, each DWT operation produces an approximation coefficients vector with a length equal to half of the input data. We remove the detail coefficients after each wavelet transform and only use the approximation part to reconstruct IPC dynamics and then calculate the XCOR between the original data and the reconstructed data. We apply discrete wavelet transform to the approximation part iteratively until the length of the approximation coefficient vector is reduced to 1. Each approximation coefficient vector is used to reconstruct a full IPC trace with a length of 1024 and the XCOR between the original and reconstructed traces are calculated using equation (31). As a result, for each program execution interval, we obtain an XCOR vector, in which each element represents the crosscorrelation coefficients between the original workload dynamics and the approximated workload dynamics at different scales. Since we use 1024 samples within each interval, we create an XCOR vector with a length of 10 for each interval, as shown in Figure 31. y u x x 1 10 1 10 1 10 1st Intrval 2nd Interval uI4lh Inlerval Figure 31 XCOR vectors for each program execution interval Profiling Program Dynamics and Complexity We use XCOR metrics to quantify program dynamics and complexity of the studied SPEC CPU 2000 benchmarks. Figure 32 shows the results of the total 1024 execution intervals across ten levels of abstraction for the benchmark gcc. 1  0.8 Scales 0.6 Figure 32 Dynamic complexity profile of benchmark gcc X 0.4 XC OR ,i E ,1 1 2 3 4 5 6 7 8 9 lb Scales Figure 32 Dynamic complexity profile of benchmark gcc As can be seen, the benchmark gcc shows a wide variety of changing dynamics during its execution. As the time scale increases, XCOR values are monotonically decreased. This is due to the fact that wavelet approximation at a coarse scale removes details in program dynamics observed at a fine grained level. Rapidly decreased XCOR implies highly complex structures that can not be captured by coarse level approximation. In contrast, slowly decreased XCOR suggests that program dynamics can be largely preserved using few samples. Figure 32 also shows a dotted line along which XCOR decreases linearly with the increased time scales. The XCOR plots below that dotted line indicate rapidly decreased XCOR values or complex program dynamics. As can be seen, a significant fraction of the benchmark gcc execution intervals manifest quickly decreased XCOR values, indicating that the program exhibits highly complex structure at the fine grained level. Figure 32 also reveals that there are a few gcc execution intervals that have good scalability in their dynamics. On these execution intervals, the XCOR values only drop 0.1 when the time scale is increased from 1 to 8. The results(Figure 32) clearly indicate that some program execution intervals can be accurately approximated by their high level abstractions while others can not. We further break down the XCOR values into 10 categories ranging from 0 to 1 and analyze their distribution across time scales. Due to space limitations, we only show the results of three programs (swim, crafty and gcc, see Figure 33) which represent the characteristics of all analyzed benchmarks. Note that at scale 1, the XCOR values of all execution intervals are always 1. Programs show heterogeneous XCOR value distributions starting from scale level 2. As can be seen, the benchmark swim exhibits good scalability in its dynamic complexity. The XCOR values of all execution intervals remain above 0.9 when the time scale is increased from 1 to 7. This implies that the captured program behavior is not sensitive to any time scale in that range. Therefore, we classify swim as a low complexity program. On the benchmark crafty, XCOR values decrease uniformly with the increase of time scales, indicating the observed program dynamics are sensitive to the time scales used to obtain it. We refer to this behavior as medium complexity. On the benchmark gcc, program dynamics decay rapidly. This suggests that abundant program dynamics could be lost if coarser time scales are used to characterize it. We refer to this characteristic as high complexity behavior. * [0, 0.1) I [0.1, 0.2) a [0.2, 0.3) * [0.3, 0.4) O [0.4, 0.5) B [0.5, 0.6) * [0.6, 0.7) B [0.7, 0.8) [ [0.8, 0.9) 100 80 60 40 20 0 100 0 [0, 0.1) M [0.1, 0.2) 80 0 [0.2, 0.3) 60 El [0.3, 0.4) [, O [0.4, 0.5) 40 C [0.5, 0.6) 0 [0.6, 0.7) 20 [0.7, 0.8) 0 [0.8, 0.9) a [0.9, 1) 1 2 3 4 6 67 8 9 10 ,l Scales (b) crafty (medium complexity) _EU lo [0.4, 0.5) 40 0 [0.5, 0.6) M [0.6, 0.7) 20 B [0.7, 0.8) O 0 [0.8, 0.9) X E] [0.9, 1) 1 2 3 4 6 6 7 8 9 10 1 Scales (c) gcc (high complexity) Figure 33 XCOR value distributions The dynamics complexity and the XCOR value distribution plots(Figure 32 and Figure 3 3) provide a quantitative and informative representation of runtime program complexity. Table 32 Classification of benchmarks based on their complexity Category Benchmarks Low complexity Swim Medium complexityCrafty, gzip, parser, perlbmk, twolf complexity High complexity gap, gcc, mcf vortex Using the above information, we classify the studied programs in terms of their complexity and the results are shown in Table 32. is [0.9, 1) 1 2 3 4 5 6 7 8 9 10 01 Scales (a) swim (low complexity) a [0, 0.1) [0.1, 0.2) [0.2, 0.3) El [0.3, 0.4) . . . ... . Classifying Program Phases based on their Dynamics Behavior In this section, we show that program execution manifests heterogeneous complexity behavior. We further examine the efficiency of using current methods in classifying program dynamics into phases and propose a new method that can better identify program complexity. Classifying complexity based phase behavior enables us to understand program dynamics progressively in a finetocoarse fashion, to operate on different resolutions, to manipulate features at different scales, and to localize characteristics in both spatial and frequency domains. Simpoint Sherwood and Calder[l] proposed a phase analysis tool called Simpoint to automatically classify the execution of a program into phases. They found that intervals of program execution grouped into the same phase had similar statistics. The Simpoint tool clusters program execution based on code signature and execution frequency. We identified program execution intervals grouped into the same phase by the Simpoint tool and analyzed their dynamic complexity. I 3.. ... ..... ... I. ot < ". t Figure 34 shows the results for the benchmark mcf Simpoint generates 55 clusters on the s 02 M.1 0.1 benchmark mcf Figure 34 shows program dynamics within three clusters generated by Simpoint. Each cluster represents a unique phase. In cluster 7, the classified phase shows homogeneous dynamics. In cluster 5, program execution intervals show two distinct dynamics but they are classified as the same phase. In cluster 48, program execution complexity varies widely; however, Simpoint classifies them as a single phase. The results(Figure 34) suggest that program execution intervals classified as the same phase by Simpoint can still exhibit widely varied behavior in their dynamics. Complexityaware Phase Classification To enhance the capability of current methods in characterizing program dynamics, we propose complexityaware phase classification. Our method uses the multiresolution property of wavelet transforms to identify and classify the changing of program code execution across different scales. We assume a baseline phase analysis technique that uses basic block vectors (BBV) [10]. A basic block is a section of code that is executed from start to finish with one entry and one exit. A BBV represents the code blocks executed during a given interval of execution. To represent program dynamics at different time scales, we create a set of basic block vectors for each interval at different resolutions. For example, at the coarsest level (scale =10), a program execution interval is represented by one BBV. At the most detailed level, the same program execution interval is represented by 1024 BBVs from 1024 consecutively subdivided intervals(Figure 35). To reduce the amount of data that needs to be processed, we use random projection to reduce the dimensionality of all BBVs to 15, as suggested in [1]. Interval Interval Resolution = 20 {BBV,,,} SI I Resolution = 21 {BBV2.1, BBV2.2} S I I I I Resolution = 22 {BBV4.1, BBV4.2, BBV4.3, BBV4.4} S I I I I Resolution = 23 {BBV,1 BBV ,.2 BBVa,7 BBVs,,} I  Resolution = 2n {BBV2'n,i, BBV2n,2 ,...... BBVa2n,2^n,, BBV2,,2 Figure 35 BBVs with different resolutions The coarser scale BBVs are the approximations of the finest scale BBVs generated by the waveletbased multiresolution analysis. 15 Dimensions 1024: 15: 0.11 0.13 0.03 0.16 0.06 ... 0.02 S15: 0.01 0.07 0.23 0.03 0.18 ... 0.05 S15: 0.04 0.13 0.14 0.11 0.04 .. 0.14 15: 0.08 0.01 0.21 0.12 0.05 ... I019 7; 77;r 74 4  4 Multiresolution Analysis and XCOR Calculation 1 10 1 10 1 10 1 10 1 10 1 10 1 10 Figure 36 Multiresolution analysis of the projected BBVs As shown in Figure 36, the discrete wavelet transform is applied to each dimension of a set of BBVs at the finest scale. The XCOR calculation is used to estimate the correlations between a BBV element and its approximations at coarser scales. The results are the 15 XCOR vectors representing the complexity of each dimension in BBVs across 10 level abstractions. The 15 XCOR vectors are then averaged together to obtain an aggregated XCOR vector that represents the entire BBV complexity characteristics for that execution interval. Using the above steps, we obtained an aggregated XCOR vector for each program execution interval. We then run the kmeans clustering algorithm [29] on the collected XCOR vectors which represent the dynamic complexity of program execution intervals and classified them into phases. This is similar to what Simpoint does. The difference is that the Simpoint tool uses raw BBVs and our method uses aggregated BBV XCOR vectors as the input for kmeans clustering. Experimental Results We compare Simpoint and the proposed approach in their capability of classifying phase complexity. Since we use wavelet transform on program basic block vectors, we refer to our method as multiresolution analysis of BBV (MRABBV). \ Weighted CoVs Figure 37 Weighted COV calculation We examine the similarity of program complexity within each classified phase by the two approaches. Instead of using IPC, we use IPC dynamics as the metric for evaluation. After classifying all program execution intervals into phases, we examine each phase and compute the IPC XCOR vectors for all the intervals in that phase. We then calculate the standard deviation in IPC XCOR vectors within each phase, and we divide the standard deviation by the average to get the Coefficient of Variation (COV). As shown in Figure 37, we calculate an overall COV metric for a phase classification method by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classification for a given program. Since COV measures standard deviations as the percentage of the average, a lower COV value means better phase classification technique. 100% 80% 60% > 40% 20% no1^ crafty gap gcc gzip mcf 100% O BBV 80% t MRABBV 60% > 40% 20% 0% .... .. ........... ....... ..,. .. .. l. . N 0) Vm LO <0 N M Q> Q N ) Vm LO <0 N M Q> Q ) Vm LO <0 N0 M Q Q I N)10 0 0 P) 0 0 parser perlbmk swim twolf vortex Figure 38 Comparison of BBV and MRABBV in classifying phase dynamics Figure 38 shows experimental results for all the studied benchmarks. As can be seen, the MRABBV method can produce phases which exhibit more homogeneous dynamics and complexity than the standard, BBVbased method. This can be seen from the lower COV values generated by the MRABBV method. In general, the COV values yielded on both methods increase when coarse time scales are used for complexity approximation. The MRABBV is capable of achieving significantly better classification on benchmarks with high complexity, such as gap, gcc and mcf On programs which exhibit medium complexity, such as crafty, gzip, parser, and twolf the two schemes show a comparable effectiveness. On benchmark (e.g. swim) which has trivial complexity, both schemes work well. We further examine the capability of using runtime performance metrics to capture complexityaware phase behavior. Instead of using BBV, the sampled IPC is used directly as the input to the kmeans phase clustering algorithm. Similarly, we apply multiresolution analysis to the IPC data and then use the gathered information for phase classification. We call this method 101% 104% 128% O BBV M MRABBV '. ...li..... i ... ...... i l1... multiresolution analysis ofIPC (MRAIPC). Figure 39 shows the phase classification results. As can be seen, the observations we made on the BBVbased cases hold valid on the IPCbased cases. This implies that the proposed multiresolution analysis can be applied to both methods to improve the capability of capturing phase dynamics. 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% SIPC * MRAIPC SrL l l I parser perlbmk swim twolf Figure 39 Comparison of IPC and MRAIPC in classifying phase dynamics vortex N ri RJ .. .. . . n. .I . Im II I II " III I m IImI I IIII I m mImI NA Ill CHAPTER 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRAM PHASE ANALYSIS In this chapter, we focus on workloadstatisticsbased phase analysis since on a given machine configuration and environment, it is more suitable to identify how the targeted architecture features vary during program execution. In contrast, phase classification using program code structures lacks the capability of informing how workloads behave architecturally [13, 30]. Therefore, phase analysis using specified workload characteristics allows one to explicitly link the targeted architecture features to the classified phases. For example, if phases are used to optimize cache efficiency, the workload characteristics that reflect cache behavior can be used to explicitly classify program execution into cache performance/power/reliability oriented phases. Program code structure based phase analysis identifies similar phases only if they have similar code flow. There can be cases where two sections of code can have different code flow, but exhibit similar architectural behavior [13]. Code flow based phase analysis would then classify them as different phases. Another advantage of workloadstatisticsbased phase analysis is that when multiple threads share the same resource (e.g. pipeline, cache), using workload execution information to classify phases allows the capability of capturing program dynamic behavior due to the interactions between threads. The key goal of workload execution based phase analysis is to accurately and reliably discern and recover phase behavior from various program runtime statistics represented as large volume, highdimension and noisy data. To effectively achieve this objective, recent work [30, 31] proposes using wavelets as a tool to assist phase analysis. The basic idea is to transform workload time domain behavior into the wavelet domain. The generated wavelet coefficients which extract compact yet informative program runtime feature are then assembled together to facilitate phase classification. Nevertheless, in current work, the examined scope of workload characteristics and the explored benefits due to wavelet transform are quite limited. In this chapter, we extend research of chapter 3 by applying wavelets to abundant types of program execution statistics and quantifying the benefits of using wavelets for improving accuracy, scalability and robustness in phase classification. We conclude that wavelet domain phase analysis has the following advantages: 1) accuracy: the wavelet transform significantly reduces temporal dependence in the sampled workload statistics. As a result, simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, wavelet coefficients transformed from various dimensions of program execution characteristics can be dynamically assembled together to further improve phase classification accuracy; 2) scalability: phase classification using wavelet analysis of highdimension sampled workload statistics can alleviate the counter overflow problem which has a negative impact on phase detection. Therefore, it is much more scalable to analyze largescale phases exhibited on longrunning, realworld programs; and 3) robustness: wavelets offer denoising capabilities which allows phase classification to be performed robustly in the presence of workload execution variability. Workloadstaticsbased phase analysis Using the waveletbased method, we explore program phase analysis on a high performance, outoforder execution superscalar processor coupled with a multilevel memory hierarchy. We use Daubechies wavelet [26, 27] with an order of 8 for the rest of the experiments due to its high accuracy and low computation overhead. This section describes our experimental methodologies, the simulated machine configuration, experimented benchmarks and evaluated metrics. We performed our analysis using twelve SPEC CPU 2000 integer benchmarks bzip2, crafty, eon, gap, gcc, gzip, mcf parser, perlbmk, twolf, vortex and vpr. All programs were run with the reference input to completion. The runtime workload execution statistics were measured on the SimpleScalar 3.0, simoutorder simulator for the Alpha ISA. The baseline microarchitecture model we used is detailed in Table 41. Table 41 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 BTB 2K entries, 4way Return Address Stack 32 entries L1 Instruction Cache 32K, 2way, 32 Byte/line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 IALU, 2 IMUL/DIV FP ALU 2 FPALU, 1FPMUL/DIV DTLB 256 entries, 4way, 200 cycle miss L1 Data Cache 64KB, 4way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4way, 128 Byte/line, 12 cycle access Memory Access 100 cycles We use IPC (instruction per cycle) as the metric to evaluate the similarity of program execution within each classified phase. To quantify phase classification accuracy, we use the weighted COV metric proposed by Calder et al. [15]. After classifying all program execution intervals into phases, we examine each phase and compute the IPC for all the intervals in that phase. We then calculate the standard deviation in IPC within each phase, and we divide the standard deviation by the average to get the Coefficient of Variation (COV). We then calculate an overall COV metric for a phase classification method by taking the COV of each phase and weighting it by the percentage of execution that the phase accounts for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classifications for a given program. Since COV measures standard deviations as a percentage of the average, a lower COV value means a better phase classification technique. Exploring Wavelet Domain Phase Analysis We first evaluate the efficiency of wavelet analysis on a wide range of program execution characteristics by comparing its phase classification accuracy with methods that use information in the time domain. And then we explore methods to further improve phase classification accuracy in the wavelet domain. Phase Classification: Time Domain vs. Wavelet Domain The wavelet analysis method provides a costeffective representation of program behavior. Since wavelet coefficients are generally decorrelated, we can transform the original data into the wavelet domain and then carry out the phase classification task. The generated wavelet coefficients can be used as signatures to classify program execution intervals into phases: if two program execution intervals show similar fingerprints (represented as a set of wavelet coefficients), they can be classified into the same phase. To quantify the benefit of using wavelet based analysis, we compare phase classification methods that use time domain and wavelet domain program execution information. With our time domain phase analysis method, each program execution interval is represented by a time series which consists of 1024 sampled program execution statistics. We first apply random projection to reduce the data dimensionality to 16. We then use the kmeans clustering algorithm to classify program intervals into phases. This is similar to the method used by the popular Simpoint tool where the basic block vectors (BBVs) are used as input. For the wavelet domain method, the original time series are first transformed into the wavelet domain using DWT. The first 16 wavelet coefficients of each program execution interval are extracted and used as the input to the kmeans clustering algorithms. Figure 41 illustrates the above described procedure. Random Kmeans COV S Projection Clustering Dimensionality =16 Program Runtime Kmeans Statistics DWT Clustering Number of Wavelet Coetti iieutr =16 Figure 41 Phase analysis methods time domain vs. wavelet domain We investigated the efficiency of applying wavelet domain analysis on 10 different workload execution characteristics, namely, the numbers of executed loads (load), stores (store), branches (branch), the number of cycles a processor spends on executing a fixed amount of instructions (cycles), the number of branch misprediction (branch miss), the number of L1 instruction cache, L1 data cache and L2 cache hits (ill hit, dll hit and ul2 hit), and the number of instruction and data TLB hits (itlb hit and dtlb hit). Figure 42 shows the COVs of phase classifications in time and wavelet domains when each type of workload execution characteristic is used as an input. As can be seen, compared with using raw, time domain workload data, the wavelet domain analysis significantly improves phase classification accuracy and this observation holds for all the investigated workload characteristics across all the examined benchmarks. This is because in the time domain, collected program runtime statistics are treated as highdimension time series data. Random projection methods are used to reduce the dimensionality of feature vectors which represent a workload signature at a given execution interval. However, the simple random projection function can increase the aliasing between phases and reduce the accuracy of phase detection. 50% Time Domain 40% Waelet D ai ip 40% 30% 20% 10% 0% 140%0 25% Time Domain 20% Wavelet Doai gap 15% 5% 0% 20% 0 140% M Time Domain tmcf 120% WveletDoma n 100% S60% 40% 20% 0% 8% Time Domain twl Wavelet Domain wo 6% 4% 2% 0% ,de6  #  strt 8% M Time Domain crafty SWavelet Domain 6% 4/4% 2% 0% 70% Time Domain qCC 60% Wavelet Domain 50% 40% 30% 20% 10% 0% 15% S Time Domain parser 12% WaveletDomain 9% 6% 3% 0% 06 ellemmlllltlltm 40% Time Domain vortex Wavelet Domain # ># 1*i~~c ~P 8% STime Domain n Sve tDma 6% >4% 2% 0% Nl$ Y N '0 30% :Time Domain perlbmk 25% Wavelet Domain 20% S15% 10% 5% 0% '0"^,~y ^<* "e 64 ":/ ,'/ k1" \0 o a^ J~ 4'? 44 4^ 4 'Sf ^^^ Figure 42 Phase classification accuracy: time domain vs. wavelet domain By transforming program runtime statistics into the wavelet domain, workload behavior can be represented by a series of wavelet coefficients which are much more compact and efficient than its counterpart in the time domain. The wavelet transform significantly reduces temporal dependence and therefore simple models which are insufficient in the time domain become quite accurate in the wavelet domain. Figure 42 shows that in the wavelet domain, the efficiency of using a single type of program characteristic to classify program phases can vary significantly across different benchmarks. For example, while ul2 hit achieves accurate phase classification on the benchmark vortex, it results in a high phase classification COV on the benchmark gcc. To overcome the above disadvantages and to build phase classification methods that can achieve high accuracy across a wide range of applications, we explore using wavelet coefficients derived from different types of workload characteristics. [ j\ r Wavelet SD T Coefficient Set Program Runtime Statistics 1 DWT Coefficent et Hybrid Wavelet Kmeans efficiCoefficients Clustering OV Program Runtime Statistics 2  SDWT Wavelet J \ Coefficient Set n Program Runtime Statistics n Figure 43 Phase classification using hybrid wavelet coefficients As shown in Figure 43, a DWT is applied to each type of workload characteristic. The generated wavelet coefficients from different categories can be assembled together to form a signature for a data clustering algorithm. Our objective is to improve wavelet domain phase classification accuracy across different programs while using an equivalent amount of information to represent program behavior. We choose a set of 16 wavelet coefficients as the phase signature since it provides sufficient precision in capturing program dynamics when a single type of program characteristic is used. If a phase signature can be composed using multiple workload characteristics, there are many ways to form a 16dimension phase signature. For example, a phase signature can be generated using one wavelet coefficient from 16 different workload characteristics (16x 1), or it can be composed using 8 workload characteristics with 2 wavelet coefficients from each type of workload characteristic (8 x 2). Alternatively, a phase signature can be formed using 4 workload characteristics with 4 wavelet coefficients each and 2 workload characteristics with 8 wavelet coefficients each (4 x 4, and 2 x 8) respectively. We extend the 10 workload execution characteristics (Figure 42) to 16 by adding the following events: the number of accesses to instruction cache (ill access), data cache (dll access), L2 cache (ul2 access), instruction TLB (itlb access) and data TLB (dtlb access). To understand the tradeoffs in choosing different methods to generate hybrid signatures, we did an exhaustive search using the above 4 schemes on all benchmarks to identify the best COVs that each scheme can achieve. The results (their ranks in terms of phase classification accuracy and the COVs of phase analysis) are shown in Table 42. As can be seen, statistically, hybrid wavelet signatures generated using 16 (16 x 1) and 8 (8 x 2) workload characteristics achieve higher accuracy. This suggests that combining multiple dimension wavelet domain workload characteristics to form a phase signature is beneficial in phase analysis. Table 42 Efficiency of different hybrid wavelet signatures in phase classification Benchmarks Hybrid Wavelet Signature and its Phase Classification COV Rank #1 Rank #2 Rank #3 Rank #4 Bzip2 16x 1 (6.5%) 8x 2 (10.5%) 4x 4 (10.5%) 2x 8 (10.5%) Crafty 4x4(1.2%) 16x 1(1.6%) 8x2(1.9%) 2x8(3.9%) Eon 8x2(1.3%) 4x4(1.6%) 16x 1(1.8%) 2x8(3.6%) Gap 4x 4 (4.2%) 16x 1 (6.3%) 8 x 2 (7.2%) 2x 8 (9.3%) Gcc 8x2(4.7%) 16x 1(5.8%) 4x4(6.5%) 2x8(14.1%) Gzip 16x 1(2.5%) 4x4(3.7%) 8x2(4.4%) 2x8(4.9%) Mcf 16x 1 (9.5%) 4x 4 (10.2%) 8x 2 (12.1%) 2x 8 (87.8%) Parser 16x 1 (4.7%) 8 x 2 (5.2%) 4x 4 (7.3%) 2 x 8 (8.4%) Perlbmk 8x 2 (0.7%) 16x1 (0.8%) 4x 4 (0.8%) 2x 8 (1.5%) Twolf 16x 1 (0.2%) 8 x 2 (0.2%) 4 x 4 (0.4%) 2 x 8 (0.5%) Vortex 16x 1 (2.4%) 8 x 2 (4%) 2 x 8 (4.4%) 4 x 4 (5.8%) Vpr 16x 1 (3%) 8x 2 (14.9%) 4x 4 (15.9%) 2x 8 (16.3%) We further compare the efficiency of using the 16 x 1 hybrid scheme (Hybrid), the best case that a single type workload characteristic can achieve (Individual Best) and the Simpoint based phase classification that uses basic block vector (BBV). The results of the 12 SPEC integer benchmarks are shown in Figure 44. 25% 15 10 8 > 5 0 0. ..I ~ ~ ~ I i .. Iii 1 1111,_ 11 1 IIi o *!TTTIII^IT"lI  bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr AVG Figure 44 Phase classification accuracy of using 16x 1 hybrid scheme As can be seen, the Hybrid outperforms the Individual Best on 10 out of the 12 benchmarks. The Hybrid also outperforms the BBVbased Simpoint method on 10 out of the 12 cases. Scalability Above we can see that wavelet domain phase analysis can achieve higher accuracy. In this subsection, we address another important issue in phase analysis using workload execution characteristics: scalability. Counters are usually used to collect workload statistics during program execution. The counters may overflow if they are used to track large scale phase behavior on long running workloads. Today, many large and real world workloads can run days, weeks or even months before completion and this trend is likely to continue in the future. To perform phase analysis on the next generation of computer workloads and systems, phase classification methods should be capable of scaling with the increasing program execution time. To understand the impact of counter overflow on phase analysis accuracy, we use 16 accumulative counters to record the 16dimension workload characteristic. The values of the 16 accumulative counters are then used as a signature to perform phase classification. We gradually reduce the number of bits in the accumulative counters. As a result, counter overflows start to occur. We use two schemes to handle a counter overflow. In our first method, a counter saturates at its maximum value once it overflows. In our second method, the counter is reset to zero after an overflow occurs. After all counter overflows are handled, we then use the 16dimension accumulative counter values to perform phase analysis and calculate the COVs. Figure 45 (a) describes the above procedure. Large Scale Phase Interval Large Scale Phase Interval  t nn r^ t t ..... t t accumulative sampling istic counter 1 P counter 1 Program Runtime Statistics 1 Program Runtime Statistics 1 ProgramRuntmeStatstcs2 Program Runtime Statistics 1 2 counter Clustering counter 2n a Program Runtime Statistics 2 Program Runtime Statistics 2  accumulative /t7sam piling counter nnter n Program Runtime Statistics n Program Runtime Statistics n (a) nbit accumulative counter (b) nbit sampling counter Figure 45 Different methods to handle counter overflows Our counter overflow analysis results are shown in Figure 46. Figure 46 also shows the counter overflow rate (e.g. percentage of the overflowed counters) when counters with different sizes are used to collect workload statistics within program execution intervals. For example, on the benchmark crafty, when the number of bits used in counters is reduced to 20, 100% of the counters overflow. For the purpose of clarity, we only show a region within which the counter overflow rate is greater than zero and less than or equal to one. Since each program has different execution time, the region varies from one program to another. As can be seen, counter overflows have negative impact on phase classification accuracy. In general, COVs increase with the counter overflow rate. Interestingly, as the overflow rate increases, there are cases that overflow handling can reduce the COVs. This is because overflow handling has the effect of normalizing and smoothing irregular peaks in the original statistics. 50% 40% > 30% 20% 10% 0% 30% 25% 20% > 15% 10% 5% 0% 140% 120% 100% 80% 0 5 60% 40% 20% 3 4% 3% 82% 1% 0% 089  Saturate 969 Reset Wavele 90% 81% bzip2 .A 670 4%29% 28 26 24 22 20 18 16 # of bits in counter a Saturate 97% 10/o Reset 94% AWavelet 94% 5 % 5 6 % 8 0 % A . 25%" A gap I~~~ T .0 28 26 24 22 20 18 16 #of bits in counter 97% 99% A.A Saturate 9%  Reset .  avelet 96 d fo 89* 60./ two H S2, mcf 28% 31 75 A, A 0 28 26 24 22 20 18 16 14 12 10 8 # of bits in counter  Saturate Reset 10%  Wavelet 1 two If 28%94 31 75 94% 28% 31% 75% .^ 8% 6% 4% 2% 0% 28 26 24 22 20 # of bits in counter 60% A. Saturate 97% 99% 50% aReset A Wavelet A o 93% ,.AA" d '0 40% 4 98% 30% 7709 89% 20% a o0%o o50% gcc 10% 3 34%. 0% 0%  28 26 24 20 28 14 16 18 20 22 # of bits in counter 12% %l 10% 7o  8% 31 6% / .. 4% .. Saturate 2% Reset parser Wavelet 0% 30 28 26 24 22 20 # of bits in counter JUo30% 25% 20% 815% 10% 5% 0% 8% 6% 4% 2% 0% 27 20% 16% 10% 6% 0% 30% 25% 20% 815% 10% 5% 0% 2 30% 25% 20% 815% 10% 5% 0% 7 25 23 21 19 17 # of bits in counter 26 23 21 19 # of bits in counter 25 23 21 19 17 # of bits in counter 29 27 25 23 # of bits in counter 27 25 23 21 19 15 17 # of bits in counter 27 25 23 21 19 17 # of bits in counter Figure 46 Impact of counter overflows on phase analysis accuracy One solution to avoid counter overflows is to use sampling counters instead of accumulative counters, as shown in Figure 45 (b). However, when sampling counters are used, the collected statistics are represented as time series that have a large volume of data. The results A Saturate Reset 100% *Wavelet 94% j 82% 23% 56%  ... crafty  Saturate 94 1009 Reset  Wavelet 94% . 789o eon 0.4% 560' 7 Bn'*'   A Saturate E Reset 98, 100 Wavelet 98 89% 2 72% gzip 24%34%o Aa Saturate  Reset 100% aWavelet 0 93// : perlbmk 87% 85% 6% 28% 79% A *..A !BB^ *A. Saturate 95v/  Reset Wavelet/ 94% 93% vortex : S819o ' 1% 56% .... ....  I I  A Saturate 100o 4 Reset "Wavelet 94% 75% '. vpr JL I. JI J. . . .  ^ n'   shown in Figure 42 suggest that directly employing runtime samples in phase classification is less desirable. To address the scalability issue in characterizing large scale program phases using workload execution statistics, wavelet based dimensionality reduction techniques can be applied to extract the essential features of workload behavior from the sampled statistics. The observations we made in previous sections motivate the use of DWT to absorb large volume sampled raw data and produce highly efficient wavelet domain signatures for phase analysis, as shown in Figure 45 (b). Figure 46 further shows phase analysis accuracy after applying wavelet techniques on the sampled workload statistics using sampling counters with different sizes. As can be seen, sampling enables using counters with limited size to study large program phases. In general, sampling can scale up naturally with the interval size as long as the sampled values do not overflow the counters. Therefore, with an increasing mismatch between phase interval and counter size, the sampling frequency is increased, resulting in an even higher volume sampled data. Using wavelet domain phase analysis can effectively infer program behavior from a large set of data collected over a long time span, resulting in low COVs in phase analysis. Workload Variability As described earlier, our methods collect various program execution statistics and use them to classify program execution into different phases. Such phase classification generally relies on comparing the similarity of the collected statistics. Ideally, different runs of the same code segment should be classified into the same phase. Existing phase detection techniques assume that workloads have deterministic execution. On real systems, with operating system interventions and other threads, applications manifest behavior that is not the same from run to run. This variability can stem from changes in system state that alter cache, TLB or I/O behavior, system calls or interrupts, and can result in noticeably different timing and performance behavior [18, 32]. This crossrun variability can confuse similarity based phase detection. In order for a phase analysis technique to be applicable on real systems, it should be able to perform robustly under variability. Program crossrun variability can be thought of as noise which is a random variance of a measured statistic. There are many possible reasons for noisy data, such as measurement/instrument errors and interventions of the operating systems. Removing this variability from the collected runtime statistics can be considered as a process of denoising. In this chapter, we explore using wavelets as an effective way to perform denoising. Due to the vanishing moment property of the wavelets, only some wavelet coefficients are significant in most cases. By retaining selective wavelet coefficients, a wavelet transform could be applied to reduce the noise. The main idea of wavelet denoising is to transform the data into the wavelet basis, where the large coefficients mainly contain the useful information and the smaller ones represent noise. By suitably modifying the coefficients in the new basis, noise can be directly removed from the data. The general denoising procedure involves three steps: 1) decompose: compute the wavelet decomposition of the original data; 2) threshold wavelet coefficients: select a threshold and apply thresholding to the wavelet coefficients; and 3) reconstruct: compute wavelet reconstruction using the modified wavelet coefficients. More details on the wavelet based denoising techniques can be found in [33]. To model workload runtime variability, we use additive noise models and randomly inject noise into the time series that represents workload execution behavior. We vary the SNR (signal tonoise ratio) to simulate different degree of variability scenarios. To classify program execution into phases, we generate a 16 dimension feature vector where each element contains the average value of the collected program execution characteristic for each interval. The kmean algorithm is then used for data clustering. Figure 47 illustrates the above described procedure. Sampled Workload Workload (t) Workload Variability Model N(t) Statistics S4a i S 2(t)=S 1(t)+N (t) Sl(t) Wavelet Denoising D2(t) Phase Classification i COV COV Comparison Figure 47 Method for modeling workload variability We use the Daubechies8 wavelet with a global wavelet coefficients thresholding policy to perform denoising. We then compare the phase classification COVs of using the original data, the data with variability injected and the data after we perform denoising. Figure 48 shows our experimental results. 15% O Original M Noised(SNR=20) 12% Denoised(SNR=20) 3% 0 Noised(SNR=5) 0% Figure 48 Effect of using wavelet denoising to handle workload variability The SNR=20 represents scenarios with a low degree of variability and the SNR=5 reflects situations with a high degree of variability. As can be seen, introducing variability in workload execution statistics reduces phase analysis accuracy. Wavelet denoising is capable of recovering phase behavior from the noised data, resulting in higher phase analysis accuracy. Interestingly, on some benchmarks (e.g. eon, mcj), the denoised data achieve better phase classification accuracy than the original data. This is because in phase classification, randomly occurring peaks in the gathered workload execution data could have a deleterious effect on the phase classification results. Wavelet denoising smoothes these irregular peaks and make the phase classification method more robust. Various types of wavelet denoising can be performed by choosing different threshold selection rules (e.g. rigrsure, heursure, sqtwolog and minimaxi), by performing hard (h) or soft (s) thresholding, and by specifying multiplicative threshold rescaling model (e.g. one, sln, and mln). We compare the efficiency of different denoising techniques that have been implemented into the MATLAB tool [34]. Due to the space limitation, only the results on benchmarks bzip2, gcc and mcfare shown in Figure 49. As can be seen, different wavelet denoising schemes achieve comparable accuracy in phase classification. 10% 0 bzip2 0 gcc U mcf 8% > 6% 0 o 4% 2% 0% Wavelet Denoising Schemes Figure 49 Efficiency of different denoising schemes CHAPTER 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION It has been well known to the processor design community that program runtime characteristics exhibit significant variation. To obtain the dynamic behavior that programs manifest on complex microprocessors and systems, architects resort to the detailed, cycle accurate simulations. Figure 51 illustrates the variation in workload dynamics for SPEC CPU 2000 benchmarks gap, crafty and vpr, within one of their execution intervals. The results show the timevarying behavior of the workload performance (gap), power (crafty) and reliability (vpr) metrics across simulations with different microarchitecture configurations. 2 140 0.35 .0 20 40 60 80 100 120 140 0 . Samples 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Samples Samples Figure 51 Variation of workload performance, power and reliability dynamics As can be seen, the manifested workload dynamics while executing the same code base varies widely across processors with different configurations. As the number of parameters in design space increases, such variation in workload dynamics can not be captured without using slow, detailed simulations. However, using the simulationbased methods for architecture design space exploration where numerous design parameters have to be considered is prohibitively expensive. Recently, researchers propose several predictive models [2025] to reason about workload aggregated behavior at architecture design stage. Among them, linear regression and neural network models have been the most used approaches. Linear models are straightforward to understand and provide accurate estimates of the significance of parameters and their interactions. However, they are usually inadequate for modeling the nonlinear dynamics of real world workloads which exhibit widely different characteristic and complexity. Of the nonlinear methods, neural network models can accurately predict the aggregated program statistics (e.g. CPI of the entire workload execution). Such models are termed as global models as only one model is used to characterize the measured programs. The monolithic global models are incapable of capturing and revealing program dynamics which contain interesting finegrain behavior. On the other hand, a workload may produce different dynamics when the underlying architecture configurations have changed. Therefore, new methods are needed for accurately predicting complex workload dynamics. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates waveletbased multiresolution decomposition techniques, which can produce a good local representation of the workload behavior in both time and frequency domains. The proposed analytical models, which combine waveletbased multiscale data representation and neural network based regression prediction, can efficiently reason about program dynamics without resorting to detailed simulations. With our schemes, the complex workload dynamics is decomposed into a series of wavelet coefficients. In transform domain, each individual wavelet coefficients is modeled by a separate neural network. We extensively evaluate the efficiency of using wavelet neural networks for predicting the dynamics that the SPEC CPU 2000 benchmarks manifest on high performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in forecasting workload dynamics across a large microarchitecture design space. In this chapter, we propose to use of wavelet neural network to build accurate predictive models for workload dynamic driven microarchitecture design space exploration. We show that wavelet neural network can be used to accurately and costeffectively capture complex workload dynamics across different microarchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power and reliability domains. We perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on prediction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. We present a case study of using workload dynamic aware predictive models to quickly estimate the efficiency of scenariodriven architecture optimizations across different domains. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. Neural Network An Artificial Neural Network (ANN) [42] is an information processing paradigm that is inspired by the way biological nervous systems process information. It is composed of a set of interconnected processing elements working in unison to solve problems. Output Radial Basis Function (RBF) layer distance Hidden Hi(x) H2(x) a layer 05 U Input X X s Xn distance layer Figure 52 Basic architecture of a neural network The most common type of neural network (Figure 52) consists of three layers of units: a layer of input units is connected to a layer of hidden units, which is connected to a layer of output units. The input is fed into network through input units. Each hidden unit receives the entire input vector and generates a response. The output of a hidden unit is determined by the inputoutput transfer function that is specified for that unit. Commonly used transfer functions include the sigmoid, linear threshold function and Radial Basis Function (RBF) [35]. The ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing complex, nonlinear relations between input and output, which make them a promising technique for tracking and forecasting complex behavior. In this chapter, we use the RBF transfer function to model and estimate important wavelet coefficients on unexplored design spaces because of its superior ability to approximate complex functions. The basic architecture of an RBF network with ninputs and a single output is shown in Figure 52. The nodes in adjacent layers are fully connected. A linear singlelayer neural network model 1dimensional function f is expressed as a linear combination of a set of n fixed functions, often called basis functions by analogy with the concept of a vector being composed of a linear combination of basis vectors. n f(x)= wJh, (x) (51) j=1 Here e 91" is adaptable or trainable weight vector and {h ()4 are fixed basis functions or the transfer function of the hidden units. The flexibility of f, its ability to fit many different functions, derives only from the freedom to choose different values for the weights. The basis functions and any parameters which they might contain are fixed. If the basis functions can change during the learning process, then the model is nonlinear. Radial functions are a special class of function. Their characteristic feature is that their response decreases (or increases) monotonically with distance from a central point. The center, the distance scale, and the precise shape of the radial function are parameters of the model, all fixed if it is linear. A typical radial function is the Gaussian which, in the case of a scalar input, is h(x)= exp )2 (52) r Its parameters are its center c and its radius r. Radial functions are simply a class of functions. In principle, they could be employed in any sort of model, linear or nonlinear, and any sort of network (singlelayer or multilayer). The training of the RBF network involves selecting the center locations and radii (which are eventually used to determine the weights) using a regression tree. A regression tree recursively partitions the input data set into subsets with decision criteria. As a result, there will be a root node, nonterminal nodes (having sub nodes) and terminal nodes (having no sub nodes) which are associated with an input dataset. Each node contributes one unit to the RBF network's center and radius vectors, the selection of RBF centers is performed by recursively parsing regression tree nodes using a strategy proposed in [35]. Combing Wavelet and Neural Network for Workload Dynamics Prediction We view workload dynamics as a time series produced by the processor which is a nonlinear function of its design parameter configuration. Instead of predicting this function at every sampling point, we employ wavelets to approximate it. Previous work [21, 23, 25] shows that neural networks can accurately predict aggregated workload behavior during design space exploration. Nevertheless, the monolithic global neural network models lack the capability of revealing complex workload dynamics. To overcome this disadvantage, we propose using wavelet neural networks that incorporate multiscale wavelet analysis into a set of neural networks for workload dynamics prediction. The wavelet transform is a very powerful tool for dealing with dynamic behavior since it captures both workload global and local behavior using a set of wavelet coefficients. The shortterm workload characteristics is decomposed into the lower scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and prediction, while the global workload behavior is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends in the workload execution. Collectively, these coordinated scales of time and frequency provides an accurate interpretation of workload dynamics. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients at different scales. The separate predictions of each wavelet coefficients are proceed independently. Predicting each wavelet coefficients by a separate neural network simplifies the training task of each subnetwork. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transform to predict the workload dynamics. Figure 53 shows our hybrid neurowavelet scheme for workload dynamics prediction. Given the observed workload dynamics on training data, our aim is to predict workload dynamic behavior under different architecture configurations. The hybrid scheme basically involves three stages. In the first stage, the time series is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficients is predicted by a separate ANN and in the third stage, the approximated time series is recovered from the predicted wavelet coefficients. Workload Dynamics (Time Domain) Synthesized Workload Dynamics (ime Domain) TIMooard03CFHt:e ::F Reda:/Hl~t* *0 ReddmdV~d t E~ D9*gi _* P:~d F d 2 P a; G Qffetn 0 Figure 53 Using wavelet neural network for workload dynamics prediction Each RBF neural network receives the entire microarchitectural design space vector and predicts a wavelet coefficient. The training of a RBF network involves determining the center point and a radius for each RBF and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of using wavelet neural networks to explore workload dynamics in performance, power and reliability domains during microarchitecture design space exploration. We use a unified, detailed microarchitecture simulator in our experiments. Our simulation framework, built using a heavily modified and extended version of the Simplescalar tool set, models pipelined, multipleissue, outoforder execution microprocessors with multiple level caches. Our framework uses Wattchbased power model [36]. In addition, we built the Architecture Vulnerability Factor (AVF) analysis methods proposed in [37, 38] to estimate processor microarchitecture vulnerability to transient faults. A microarchitecture structure's AVF refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The AVF metric can be used to estimate how vulnerable the hardware is to soft errors during program execution. Table 51 summarizes the baseline machine configurations of our simulator. Table 51 Simulated machine configuration Parameter Configuration Processor Width 8wide fetch/issue/commit Issue Queue 96 ITLB 128 entries, 4way, 200 cycle miss Branch Predictor 2K entries Gshare, 10bit global history BTB 2K entries, 4way Return Address Stack 32 entries RAS L1 Instruction Cache 32K, 2way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries Load/ Store Queue 48 entries Integer ALU 8 IALU, 4 IMUL/DIV, 4 Load/Store FP ALU 8 FPALU, 4FPMUL/DIV/SQRT DTLB 256 entries, 4way, 200 cycle miss L1 Data Cache 64KB, 4way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency We perform our analysis using twelve SPEC CPU 2000 benchmarks bzip2, crafty, eon, gap, gcc, mcf parser, perlbmk, twolf swim, vortex and vpr. We use the Simpoint tool to pick the most representative simulation point for each benchmark (with full reference input set) and each benchmark is fastforwarded to its representative point before detailed simulation takes place. Each simulation contains 200M instructions. In this chapter, we consider a design space that consists of 9 microarchitectural parameters (see Tables 52) of the superscalar architecture. These microarchitectural parameters have been shown to have the largest impact on processor performance [21]. The ranges for these parameters were set to include both typical and feasible design points within the explored design space. Using the detailed, cycleaccurate simulations, we measure processor performance, power and reliability characteristics on all design points within both training and testing data sets. We build a separate model for each program and use the model to predict workload dynamics in performance, power and reliability domains at unexplored points in the design space. The training data set is used to build the waveletbased neural network models. An estimate of the model's accuracy is obtained by using the design points in the testing data set. Table 52 Microarchitectural parameter ranges used for generating train/test data Ranges Parameter TRa s t # of Levels Train Test Fetch width 2, 4, 8, 16 2, 8 4 ROB size 96, 128, 160 128, 160 3 IQsize 32, 64, 96, 128 32, 64 4 LSQ_size 16, 24, 32, 64 16, 24, 32 4 L2 size 256, 1024, 2048, 4096 256, 1024, 4096 KB 4 L2 lat 8, 12, 14, 16, 20 8, 12, 14 5 ill size 8, 16, 32, 64 KB 8, 16, 32 KB 4 dll size 8, 16, 32, 64 KB 16, 32, 64 KB 4 dll lat 1,2,3,4 1,2,3 4 To build the representative design space, one needs to ensure the sample data sets space out points throughout the design space but unique and small enough to keep the model building cost low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrix and use a space filing metric called L2star discrepancy [40] to each LHS matrix to find the unique and best representative design space which has the lowest values of L2star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. And we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers good tradeoffs between simulation time and prediction accuracy for the design space we considered. In our study, each workload dynamic trace is represented by 128 samples. Predicting each wavelet coefficient by a separate neural network simplifies the learning task. Since complex workload dynamics can be captured using limited number of wavelet coefficients, the total size of wavelet neural networks can be small. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Wavelet coefficients with large magnitude High s 120 Mag. 2 00 B1 1 00 L w Wavelet Coefficient Index Figure 54 Magnitudebased ranking of 128 wavelet coefficients Specifically, we consider the following two schemes for selecting important wavelet coefficients for prediction: (1) magnitudebased: select the largest k coefficients and approximate the rest with 0 and (2) orderbased: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitudebased scheme since it always outperforms the orderbased scheme. To apply the magnitudebased wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients does not change drastically across the design space. Figure 54 illustrates the magnitudebased ranking (shown as a color map where red indicates high ranks and blue indicates low ranks) of a total 128 wavelet coefficients (decomposed from benchmark gcc dynamics) across 50 different microarchitecture configurations. As can be seen, the top ranked wavelet coefficients largely remain consistent across different processor configurations. Evaluation and Results In this section, we present detailed experiment results on using wavelet neural network to predict workload dynamics in performance, reliability and power domains. The workload dynamic prediction accuracy measure is the mean square error (MSE) defined as follows: IN MSE = (x(k) (k)) (53) N k=1 where: x(k) is the actual value, i(k) is the predicted value and Nis the total number of samples. As prediction accuracy increases, the MSE becomes smaller. ....................... .. i i S ........................................................ bzip crafty eon gap gcc mcf parser per swim twolf vortex vpr o zip crafty eon gap gcc mcf parser per swim twolf vortex vpr Figure 55 MSE boxplots of workload dynamics prediction 62 The workload dynamics prediction accuracies in performance, power and reliability domains are plotted as boxplots(Figure 55). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between "hinges" which are approximately the first and third quartiles of the MSE values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the MSE values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 55, the line with diamond shape markers indicates the statistics average of MSE across all test cases. Figure 55 shows that the performance model achieves median errors ranging from 0.5 percent (swim) to 8.6 percent (mcj) with an overall median error across all benchmarks of 2.3 percent. As can be seen, even though the maximum error at any design point for any benchmark is 30%, most benchmarks show MSE less than 10%. This indicates that our proposed neuro wavelet scheme can forecast the dynamic behavior of program performance characteristics with high accuracy. Figure 55 shows that power models are slightly less accurate with median errors ranging from 1.3 percent (vpr) to 4.9 percent (crafty) and overall median of 2.6 percent. The power prediction has high maximum values of 35%. These errors are much smaller in reliability domain. In general, the workload dynamic prediction accuracy is increased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The costeffective models should provide high prediction accuracy while maintaining low complexity. Figure 56 shows the trend of prediction accuracy (the average statistics of all benchmarks) when various number of wavelet coefficients are used. 5 I 4 CPI 3  Power uJ AAVF 0 A.I.A AI A 16 32 64 96 128 Number of Wavelet Coefficients Figure 56 MSE trends with increased number of wavelet coefficients As can be seen, for the programs we studied, a set of wavelet coefficients with a size of 16 combine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a lower rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. Using fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the series structures across scales of time and frequency domains. The capability of using a limited set of wavelet coefficients to capture workload dynamics varies with resolution level. 7 6 Ix 5 4 2 S3  CPl m* Power 2 ' AAVF  1 0 A A A 64 128 256 512 1024 Number of Samples Figure 57 MSE trends with increased sampling frequency Figure 57 illustrates MSE (the average statistics of all benchmarks) yielded on predictive models that use 16 wavelet coefficients when the number of samples varies from 64 to 1024. As the sampling frequency increases, using the same amount of wavelet coefficients is less accurate in terms of capturing workload dynamic behavior. As can be seen, the increase of MSE is not significant. This suggests that the proposed schemes can capture workload dynamic behavior with increasing complexity. Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input microarchitecture parameters were ranked based on either split order or split frequency. The microarchitecture parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, microarchitecture parameters largely determine the values of a wavelet coefficient are located on higher place than others in regression tree and they have larger number of splits than others. CPI Power AVF 0 bZlp gcc mcf g cc g a r r b swim ol vo vpr parser p b swm olf o vpr L2ue 58 Role Lo L2 deig 2 I l lt2 dl latt dll lat L i I/I L, la2 ii dll S l parser perlbmk swim olf votex vpr I Figure 58 Roles of microarchitecture design parameters We present in Figure 58 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equiangular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? For example, on benchmark gcc, Fetch, dll and LSQ have significant roles in predicting dynamic behavior in performance domain while ROB, Fetch and dll_lat largely affect reliability domain workload dynamic behavior. For the benchmark gcc, the most frequently involved microarchitecture parameters in regression tree constructions are ROB, LSQ, L2 and L2_lat in performance domain and LSQ and Fetch in reliability domain. Compared with models that only predict workload aggregated behavior, our proposed methods can forecast workload runtime execution scenarios. The feature is essential if the predictive models are employed to trigger runtime dynamic management mechanisms for power and reliability optimizations. Inadequate workload worstcase scenario predictions could make microprocessors fail to meet the desired power and reliability targets. On the contrary, false alarms caused by overprediction of the worstcase scenarios can trigger responses too frequently, resulting in significant overhead. In this section, we study the suitability of using the proposed schemes for workload execution scenario based classification. Specifically, for a given workload characteristics threshold, we calculate how many sampling points in a trace that represents workload dynamics are above or below the threshold. We then apply the same calculation to the predicted workload dynamics trace. We use the directional symmetry (DS) metric, i.e., the percentage of correctly predicted directions with respect to the target variable, defined as DS = N (x(k) 2(k)) (54) Nkl where (o(*) = if x and c are both above or below the threshold and (o(.) = 0 otherwise. Thus, the DS provides a measure of the number of times the sign of the target is correctly forecasted. In other words, DS=50% implies that the predicted direction was correct for half of the predictions. In this work, we set three threshold levels (named as Q1, Q2 and Q3 in Figure 5 9) between max and min values in each trace as follows, where 1Q is the lowest threshold level and 3Q is the highest threshold level. 1Q = MIN + (MAXMIN)*(1/4) 2Q = MIN + (MAXMIN)*(2/4) MAX 3Q = MIN + (MAXMIN)*(3/4) 3Q  2Q    1Q        M IN Figure 59 Thresholdbased workload execution scenarios Figure 510 shows the results of thresholdbased workload dynamic behavior classification. The results are presented as directional asymmetry, which can be expressed as 1 DS. As can be seen, not only our waveletbased RBF neural networks can effectively capture workload dynamics, but also they can accurately classify workload execution into different scenarios. This suggests that proactive dynamic power and reliability management schemes can be built using the proposed models. For instance, given a power/reliability threshold, our wavelet RBF neural networks can be used to forecast workload execution scenario. If the predicted workload characteristics exceed the threshold level, processors can start to response before power/reliability reaches or surpass the threshold level. 6 *CPI1 1Q 8 CPI_2Q CPI 3Q 4 2 L Il O" 10 8 M Power 2Q SPower 3Q 149 / /e e / ^ / / / ^ 10 I 'AVF_1Q mAvF 30 S  A V F 2 Q  1f/ f / e <14// /// / Figure 510 Thresholdbased workload execution Figure 511 further illustrates detailed workload execution scenario predictions on benchmark bzip2. Both simulation and prediction results are shown. The predicted results closely track the varied program dynamic behavior in different domains. 8 0.4 Simulation Simulato Simulation Prediction 125 dPrediction 120 115 Threshold 0.3 , JJ f i^ l l 1 '1" I 0,l0.' 2 5 a 115 S 20 40 60 80 100 120 140 8 20 40 60 80 100 120 140 20 40 60 80 100 120 14 Samples Samples Samples (a) performance (b) power (c) reliability Figure 511 Thresholdbased workload scenario prediction Workload Dynamics Driven Architecture Design Space Exploration In this section, we present a case study to demonstrate the benefit of applying workload dynamics prediction in early architecture design space exploration. Specifically, we show that workload dynamics prediction models can effectively forecast the worstcase operation j 0 ~ E conditions to soft error vulnerability and accurately estimate the efficiency of soft error vulnerability management schemes. Because of technology scaling, radiationinduced soft errors contribute more and more to the failure rate of CMOS devices. Therefore, soft error rate is an important reliability issue in deepsubmicron microprocessor design. Processor microarchitecture soft error vulnerability exhibits significant runtime variation and it is not economical and practical to design fault tolerant schemes that target on the worstcase operation condition. Dynamic Vulnerability Management (DVM) refers to a set of strategies to control hardware runtime softerror susceptibility under a tolerable threshold. DVM allows designers to achieve higher dependability on hardware designed for a lower reliability setting. If a particular execution period exceeds the predefined vulnerability threshold, a DVM response (Figure 512) will work to reduce hardware vulnerability. Designedfor Reliability Capacity w/out DVM DVM Performance Designedfor Reliability 0"" Overhead S Capacity w/ DVM S DVM Trigger Level DMV Engaged DVM Disengaged Time Figure 512 Dynamic Vulnerability Management A primary goal of DVM is to maintain vulnerability to within a predefined reliability target during the entire program execution. The DVM will be triggered once the hardware soft error vulnerability exceeds the predefined threshold. Once the trigger goes on, a DVM response begins. Depending on the type of response chosen, there may be some performance degradation. A DVM response can be turned off as soon as the vulnerability drops below the threshold. To successfully achieve the desired reliability target and effectively mitigate the overhead of DVM, architects need techniques to quickly infer application worstcase operation conditions across design alternatives and accurately estimate the efficiency of DVM schemes at early design stage. We developed a DVM scheme to manage runtime instruction queue (IQ) vulnerability to soft error. DVM IQ { ACE bits counter updating(); if current context has L2 cache misses then stall dispatching instructions for current context; every (sampleinterval/5) cycles { if online IQ AVF > trigger threshold then wqratio = wqratio/2; else wqratio = wqratio+l; } if (ratio of waiting instruction # to ready instruction # > wqratio) then stall dispatching instructions; Figure 513 IQ DVM Pseudo Code Figure 513 shows the pseudo code of our DVM policy. The DVM scheme computes online IQ AVF to estimate runtime microarchitecture vulnerability. The estimated AVF is compared against a trigger threshold to determine whether it is necessary to enable a response mechanism. To reduce IQ soft error vulnerability, we throttle the instruction dispatching from the ROB to the IQ upon a L2 cache miss. Additionally, we sample the IQ AVF at a finer granularity and compare the sampled AVF with the trigger threshold. If the IQ AVF exceeds the trigger threshold, a parameter wqratio, which specifies the ratio of number of waiting instructions to that of ready instructions in the IQ, is updated. The purpose of setting this parameter is to maintain the performance by allowing an appropriate fraction of waiting instructions in the IQ to exploit ILP. By maintaining a desired ratio between the waiting instructions and the ready instructions, vulnerability can be reduced at negligible performance cost. The wqratio update is triggered by the estimated IQ AVF. In our DVM design, wqratio is adapted through slow increases and rapid decreases in order to ensure a quick response to a vulnerability emergency. We built workload dynamics predictive models which incorporate DVM as a new design parameter. Therefore, our models can predict workload execution scenarios with/without DVM feature across different microarchitecture configurations. Figure 514 shows the results of using the predictive models to forecast IQ AVF on benchmark gcc across two microarchitecture configurations. 08 08 08 08 <10 40 4 DVM Target 00 O (Enable) 03 DVM Target  AIDVML I' l Target DVM Target (D iabe) 02 (Disable) 02 (Enable) 20 40 60 8o 100 120 140 0 0 20 40 60 8o 100 120 140 2O 4O 6O 8O 100 120 140 Sample 20 40 6 80 100 120 140 Samples Samples Samples DVM disabled DVM enabled DVM disabled DVM enabled (a) Scenario 1 (b) Scenario 2 Figure 514 Workload dynamic prediction with scenariobased architecture optimization We set the DVM target as 0.3 which means the DVM policy, when enabled, should maintains the IQ AVF below 0.3 during workload execution. In both cases, the IQ AVF dynamics were predicted when DVM is disabled and enabled. As can be seen, in scenario 1, the DVM successfully achieves its goal. In scenario 2, despite of the enabled DVM feature, the IQ AVF of certain execution period is still above the threshold. This implies that the developed DVM mechanism is suitable for the microarchitecture configuration used in scenario 1. On the other hand, architects have to choose another DVM policy if the microarchitecture configuration shown in scenario 2 is chosen in their design. Figure 514 shows that in all cases, the predictive models can accurately forecast the trends in IQ AVF dynamics due to architecture optimizations. Figure 515 (a) shows prediction accuracy of IQ AVF dynamics when the DVM policy is enabled. The results are shown for all 50 microarchitecture configurations in our test dataset. Color Key MSE(%) MSE(%) E E (a) IQ AVF (b) Power Figure 515 Heat plot that shows the MSE of IQ AVF and processor power Since deploying the DVM policy will also affect runtime processor power behavior, we further build models to forecast processor power dynamic behavior due to the DVM. The results are shown in Figure 515 (b). The data is presented as heat plot, which maps the actual data values into color scale with a dendrogram added to the top. A dendrogram consists of many U shaped lines connecting objects in a hierarchical tree. The height of each U represents the distance between the two objects being connected. For a given benchmark, a vertical trace line shows the scaled MSE values across all test cases. Figure 515 (a) shows the predictive models yield high prediction accuracy across all test cases on benchmarks swim, eon and vpr. The models yield prediction variation on benchmarks gcc, crafty and vortex. In power domain, prediction accuracy is more uniform across benchmarks and microarchitecture configurations. In Figure 516, we show the IQ AVF MSE when different DVM thresholds are set. The results suggest that our predictive models work well when different DVM targets are considered. 0.5 DVM Threshold =0.2 0.4 DVM Threshold =0.3 SU DVM Threshold =0.5 S0.3 S0.2 0.1 Figure 516 IQ AVF dynamics prediction accuracy across different DVM thresholds CHAPTER 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTICORE ARCHITECTURES Early design space exploration is an essential ingredient in modern processor development. It significantly reduces the time to market and postsilicon surprises. The trend toward multi /manycore processors will result in sophisticated largescale architecture substrates (e.g. non uniformly accessed cache [43] interconnected by networkonchip [44]) with selfcontained hardware components (e.g. cache banks, routers and interconnect links) proximate to the individual cores but globally distributed across all cores. As the number of cores on a processor increases, these large and sophisticated multicoreoriented architectures exhibit increasingly complex and heterogeneous characteristics. As an example, to alleviate the deleterious impact of wire delays, architects have proposed splitting up large L2/L3 caches into multiple banks, with each bank having different access latency depending on its physical proximity to the cores. Figure 61 illustrates normalized cache hits (results are plotted as color maps) across the 256 cache banks of a nonuniform cache architecture (NUCA) [43] design on an 8core chip multiprocessor(CMP) running the SPLASH2 Oceanc workload. The 2D architecture spatial patterns yielded on NUCA with different architecture design parameters are shown. 09 9 9 08 8 8 07 7 7 06 6 6 05 5 5 Figure 61 Variation of cache hits across a 256bank nonuniform access cache on 8core As can be seen, there is a significant variation in cache access frequency across individual cache banks. At larger scales, the manifested 2dimensional spatial characteristics across the entire NUCA substrate vary widely with different design choices while executing the same code base. In this example, various NUCA cache configurations such as network topologies (e.g. hierarchical, pointtopoint and crossbar) and data management schemes (e.g. static (SNUCA) [43], dynamic (DNUCA) [45, 46] and dynamic with replication (RNUCA) [4749]) are used. As the number of parameters in the design space increases, such variation and characteristics at large scales cannot be captured without using slow and detailed simulations. However, using simulationbased methods for architecture design space exploration where numerous design parameters have to be considered is prohibitively expensive. Recently, various predictive models [2025, 50] have been proposed to costeffectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous behavior of large and distributed architecture substrates across the design space. This limitation will only be exacerbated with the rapidly increasing integration scale (e.g. number of cores per chip). Therefore, there is a pressing need for novel and costeffective approaches to achieve accurate and informative design tradeoff analysis for large and sophisticated architectures in the upcoming multi/many core eras. Thus, in this chapter, instead of quantifying these large and sophisticated architectures by a single number or a simple statistics distribution, we proposed techniques employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the complex spatial characteristics that workloads exhibit across large architecture substrates are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of wavelet coefficients, our models can accurately reconstruct architecture 2D spatial behavior across the design space. Using both multiprogrammed and multithreaded workloads, we extensively evaluate the efficiency of using 2D wavelet neural networks for predicting the complex behavior of nonuniformly accessed cache designs with widely varied configurations. Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction We view the 2D spatial characteristics yielded on large and distributed architecture substrates as a nonlinear function of architecture design parameters. Instead of inferring the spatial behavior via exhaustively obtaining architecture characteristics on each individual node/component, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated behavior across a large architecture design space. Previous work [21, 23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to informatively reveal complex workload/architecture interactions at a large scale. To overcome this disadvantage, we propose combining 2D wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial characteristics prediction of multicore oriented architecture substrates. The 2D wavelet transform is a very powerful tool for characterizing spatial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and prediction of individual or subsets of cores/components, while the global trend is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across many cores or distributed hardware components. Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and details of complex workload behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separate neural network simplifies the training task (which can be performed concurrently) of each subnetwork. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the spatial patterns on large scale architecture substrates. Figure 62 shows our hybrid neurowavelet scheme for architecture 2D spatial characteristics prediction. Given the observed spatial behavior on training data, our aim is to predict the 2D behavior of largescale architecture under different design configurations. Architecture 2D Characteristics Synthesized Architecture 2D Characteristics RBF Neural Netvorks "HTj ej"^^ ^S^^^ Ho Architectue Design Predcted aelet W H 0 E G oXe Parameters Coefficient e F Go H Achitectre Design0 ( Pr te 8 ... CoWficientn Figure 62 Using wavelet neural networks for forecasting architecture 2D characteristics The hybrid scheme basically involves three stages. In the first stage, the observed spatial behavior is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The training of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of 2D wavelet neural networks for forecasting spatial characteristics of largescale multicore NUCA design using the GEMS 1.2 [51] toolset interfaced with the Simics [52] fullsystem functional simulator. We simulate a SPARC V9 8 core CMP running Solaris 9. We model inorder issue cores for this study to keep the simulation time tractable. The processors have private L1 caches and the shared L2 is a 256bank 16MB NUCA. The private L1 caches of different processors are maintained coherent using a distributed directory based protocol. To model the L2 cache, we use the Ruby NUCA cache simulator developed in [47] which includes an onchip network model. The network models all messages communicated in the system including all requests, responses, replacements, and acknowledgements. Table 61 summarizes the baseline machine configurations of our simulator. Table 61 Simulated machine configuration (baseline) Parameter Configuration Number of 8 Issue Width 1 L1 (split I/D) 64KB, 64B line, writeallocation L2 (NUCA) 16 MB (256 x 64KB ), 64B line Memory Sequential Memory 4 GB of DRAM, 250 cycle latency, 4KB Our baseline processor/L2 NUCA organization is similar to that of Beckmann and Wood [47] and is illustrated in Figure 63. Each processor core (including L1 data and instruction caches) is placed on the chip boundary and eight such cores surround a shared L2 cache. The L2 is partitioned into 256 banks (grouped as 16 blank clusters) and connected with an interconnection network. Each core has a cache controller that routes the core's requests to the appropriate cache bank. The NUCA design space is very large. In this chapter, we consider a design space that consists of 9 parameters (see Tables 62) of CMP NUCA architecture. CPU 0 CPU 1 hu L1 $ L1 D$ L1 $ L1 D$ 1 CPU 5 CPU 4 Figure 63 Baseline CMP with 8 cores that share a NUCA L2 cache Table 62 Considered architecture design parameters and their ranges Parameters Description NUCA Management Policy (NUCA) SNUCA, DNUCA, RNUCA Network Topology (net) Hierarchical, PTtoPT, Crossbar Network Link Latency (net lat) 20, 30, 40, 50 Lllatency (L1 lat) 1, 3, 5 L2_latency (L2 lat) 6, 8, 10, 12. Ll_associativity (LI aso) 1, 2, 4, 8 L2_associativity (L2 aso) 2, 4, 8, 16 Directory Latency (d lat) 30, 60, 80, 100 Processor Buffer Size (p buJ) 5, 10, 20 These design parameters cover NUCA data management policy (NUCA), interconnection topology and latency (net and netlat), the configurations of the L1 and L2 caches (Lllat, L2_lat, Ll_aso and L2_aso), cache coherency directory latency (dlat) and the number of cache accesses that a processor core can issue to the L1 (pbuf). The ranges for these parameters were set to include both typical and feasible design points within the explored design space. We studied the CMP NUCA designs using various multiprogrammed and multithreaded workloads (listed in Table 63). Table 63 Multiprogrammed workloads Multiprogrammed Workloads Description Group 1 gcc (8 copies) Homogeneous Group2 mcf(8 copies) Group 1 (CPU) gap, bzip2, equake, gcc, mesa, perlbmk, parser, ammp Heterogeneous Group2 (MIX) perlbmk, mcf bzip2, vpr, mesa, art, gcc, equake Group3 (MEM) mcf twolf art, ammp, equake, mcf art, mesa Multithreaded Workloads Data Set barnes 16k particles finm input. 16348 oceanco 514x514 ocean body oceannc 258x258 ocean body Splash2 waterns 512 molecules cholesky tkl5.0 fft 65,536 complex data points radix 256k keys, 1024 radix Our heterogeneous multiprogrammed workloads consist of a mix of programs from the SPEC 2000 benchmarks with full reference input sets. The homogeneous multiprogrammed workloads consist of multiple copies of an identical SPEC 2000 program. For multiprogrammed workload simulations, we perform fastforwards until all benchmarks pass initialization phases. For multithreaded workloads, we used 8 benchmarks from the SPLASH2 suite [53] and mark an initialization phase in the software code and skip it in our simulations. In all simulations, we first warm up the cache model. After that, each simulation runs 500 million instructions or to benchmark completion, whichever is less. Using detailed simulation, we obtain the 2D architecture characteristics of large scale NUCA at all design points within both training and testing data sets. We build a separate model for each workload and use the model to predict architecture 2D spatial behavior at unexplored points in the design space. The training data set is used to build the 2D wavelet neural network models. An estimate of the model's accuracy is obtained by using the design points in the testing data set. To build a representative design space, one needs to ensure that the sample data sets disperse points throughout the design space but keep the space small enough to keep the cost of building the model low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2 star discrepancy. The L2star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this chapter, we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. And the 2D NUCA architecture characteristics (normalized cache hit numbers) across 256 banks (with the geometry layout, Figure 63) are represented by a matrix. Predicting each wavelet coefficient by a separate neural network simplifies the learning task. Since complex spatial patterns on large scale multicore architecture substrates can be captured using a limited number of wavelet coefficients, the total size of wavelet neural networks is small and the computation overhead is low. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Specifically, we consider the following two schemes for selecting important wavelet coefficients for prediction: (1) magnitudebased: select the largest k coefficients and approximate the rest with 0 and (2) orderbased: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitudebased scheme since it always outperforms the orderbased scheme. To apply the magnitudebased wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients do not change drastically across the design space. Our experimental results show that the top ranked wavelet coefficients largely remain consistent across different architecture configurations. Evaluation and Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast complex, heterogeneous patterns of large scale multicore substrates running various workloads without using detailed simulation. The prediction accuracy measure is the mean error defined as follows: 1 N ME = ((Z(k) x(k)) / x(k)) (61) N k=l where: x is the actual value, ^ is the predicted value and N is the total number of samples (e.g. 256 NUCA banks). As prediction accuracy increases, the ME becomes smaller. The prediction accuracies are plotted as boxplots(Figure 64). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between "hinges" which are approximately the first and third quartiles of the ME values. 0 0 0 0 0 S o o T o T T T o o I o  (a) 16 Wavelet Coefficients (b) 32 Wavelet Coefficients extending from the top and bottom of the box) extend to the extreme values of the data or a 0 o a i0 or of 6 n aos a n woroa A n n, te m m e (c) 64 Wavelet Coefficients (d) 128 Wavelet Coefficients Figure 64 ME boxplots of prediction accuracies with different number of wavelet coefficients Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, and it shows the center of the distribution for the ME values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. Figure 64 (a) shows that using 16 wavelet coefficients, the predictive models achieve median errors ranging from 5.2 percent (ift) to 9.3 percent (ocean.co) with an overall median error of 6.6 percent across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 13%, and most benchmarks show an error less than 10%. This indicates that our proposed neurowavelet scheme can forecast the 2D spatial workload 83 behavior across large and sophisticated architecture with high accuracy. Figure 64 (bd) shows that in general, the geospatial characteristics prediction accuracy is increased when more wavelet coefficients are involved. Note that the complexity of the predictive models is proportional to the number of wavelet coefficients. The costeffective models should provide high prediction accuracy while maintaining low complexity and computation overhead. The trend of prediction accuracy(Figure 64) indicates that for the programs we studied, a set of wavelet coefficients with a size of 16 combines good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a reduced rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. Using fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the spatial patterns among a large number of NUCA banks on a twodimensional plane. Figure 65 illustrates the predicted 2D NUCA behavior across four different configurations (e.g. AD) on the heterogeneous multi programmed workload MIX (see Table 3) when different number of wavelet coefficients (e.g. 16  256) are used. Simulation Prediction org 16wc 32wc 64wc 96wc 128wc 256wc A 0. 1 K I org 16w 32w 64w 9wc 12w 256w B org 1c 32wc 64wc 9Gwc 1 c 25Gwc u org 16wc 32we 64we 9Gwe 128we 256we Figure 65 Predicted 2D NUCA behavior using different number of wavelet coefficients Figure 65 Predicted 2D NUCA behavior using different number of wavelet coefficients The simulation results (org) are also shown for comparison purposes. Since we can accurately forecast the behavior of large scale NUCA by only predicting a small set of wavelet coefficients, we expect our methods are scalable to even larger architecture design. We further compare the accuracy of our proposed scheme with that of approximating NUCA spatial patterns via predicting the hit rates of 16 evenly distributed cache banks across a 2D plane. The results shown in Table 64 indicate that using the same number of neural networks, our scheme yields a significantly higher accuracy than conventional predictive models. If current neural network models were built at finegrain scales (e.g. construct a model for each NUCA bank), the model building/training overhead would be nontrivial. Since we can accurately forecast the behavior of large scale NUCA structures by only predicting a small set of wavelet coefficients, we expect our methods are scalable to even larger architecture substrates. Table 64 Error comparison of predicting raw vs. 2D DWT cache banks Benchmarks Error (Raw), % Error(2D DWT), % gcc(x8) 126 8 mcf(x8) 71 7 CPU 102 9 MIX 86 8 MEM 122 8 barnes 136 6 fmm 363 6 oceanco 99 9 oceannc 136 6 watersp 97 7 cholesky 71 7 fft 64 7 radix 92 7 Table 65 shows that exploring multicore NUCA design space using the proposed predictive models can lead to several orders of magnitude speedup, compared with using detailed simulations. The speedup is calculated using the total simulation time across all 50 test cases divided by the time spent on model training and predicting 50 test cases. The model construction is a onetime overhead and can be amortized in the design space exploration stage where a large number of cases need to be examined. Table 65 Design space evaluation speedup (simulation vs. prediction) Benchmarks Simulation vs. Prediction gcc(x8) 2,181x mcf(x8) 3.482x CPU 3,691x MIX 472x MEM 435x barnes 659x fmm 1,824x oceanco 1,077x oceannc 1,169x watersp 738x cholesky 696x fft 670x radix 1,010x Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input architecture design parameters were ranked based on either split order or split frequency. The design parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, architecture parameters that largely determine the values of a wavelet coefficient are located higher than others in the regression tree and they have a larger number of splits than others. We present in Figure 66 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equiangular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: Which variables are dominant for a given dataset? Which observations show similar behavior? p ,buf net atp bu netat , 'net net L1 lat L1 lat NUCA t NUCA L2 lat L2 lat d lat d lat L aso L2 aso L1 aso L2 aso gcc mcf cpu mix mem gcc mci cpu mix mem barnes oceanco ocean nc water_sp cholesky barnes ocean_co ocean_nc water_sp cholesky fit radix imm Ift radix fmm Order Frequency Figure 66 Roles of design parameters in predicting 2D NUCA For example, on the Splash2 benchmark fmm, network latency (net lat), processor buffer size (p buf), L2 latency (L2 lat) and L1 associativity (LI aso) have significant roles in predicting the 2D NUCA spatial behavior while the NUCA data management policy (NUCA) and network topology (net) largely affect the 2D spatial pattern when running the homogeneous multiprogrammed workload gccx8. For the benchmark cholesky, the most frequently involved architecture parameters in regression tree construction are NUCA, net lat, p buf L2 lat and L1 aso. Differing from models that predict aggregated workload characteristics on monolithic architecture design, our proposed methods can accurately and informatively reveal the complex patterns that workloads exhibit on largescale architectures. This feature is essential if the predictive models are employed to examine the efficiency of design tradeoffs or explore novel optimizations that consider multi/many cores. In this work, we study the suitability of using the proposed models for novel multicore oriented NUCA optimizations. Leveraging 2D Geometric Characteristics to Explore Cooperative Multicore Oriented Architecture Design and Optimization In this section, we present case studies to demonstrate the benefit of incorporating 2D workload/architecture behavior prediction into the early stages of microarchitecture design. In the first case study, we show that our geospatialaware predictive models can effectively estimate workloads' 2D working sets and that such information can be beneficial in searching cache friendly workload/core mapping in multicore environments. In the second case study, we explore using 2D thermal profile predictive models to accurately and informatively forecast the area and location of thermal hotspots across large NUCA substrates. Case Study 1: Geospatialaware Application/Core Mapping Our 2D geometryaware architecture predictive models can be used to explore global, cooperative, resource management and optimization in multicore environments. Core 0 Core 1 Core 2 Core 3 11 11 11 1 I I :0. Core 4 0.5 I:0 0.5 0.50. //: mo 1; Core 5 Core 6 Core 7 1 1 1 l : 0 IN0 0: Figure 67 2D NUCA footprint (geometric shape) of mesa For example, as shown in Figure 67, a workload will exhibit a 2D working set with different geometric shapes when running on different cores. The exact shape of the access distribution depends on several factors such as the application and the data mapping/migration policy. As shown in previous section, our predictive models can forecast workload 2D spatial patterns across the architecture design space. To predict workload 2D geometric footprints when running on different cores, we incorporate the core location as a new design parameter and build the locationaware 2D predictive models. As a result, the new model can forecast workloads' 2D NUCA footprint (represented as a cache access distribution) when it is assigned to a specific core location. We assign 8 SPEC CPU 2000 workloads to the 8core CMP system and then predict each workload's 2D NUCA footprint when running on the assigned core and use the predicted 2D geometric working set for each workload to estimate the cache interference among the cores. Program B 2D NUCA Program A 2D NUCA footprint @ Core 1 Interferenced footprint @ Core 0 Area Core 0 Core 1 Area S_ > _ o  1 ) SProgram C 2D NUCA footprint @ Core 2 (0^ 0 o 0 0 o Core 5 Core 4 Figure 68. 2D cache interference in NUCA As shown in Figure 68, to estimate the interference for a given core/workload mapping, we estimate both the area and the degree of overlap among a workload's 2D NUCA footprint. We only consider the interference of a core and its two neighbors. As a result, for a given core/workload layout, we can quickly estimate the overall interference. For each NUCA configuration, we estimate the interference when workloads are randomly assigned to different cores. We use simulation to count the actual cache interference among workloads. For each test case (e.g., a specific NUCA configuration), we generate two series of cache interference statistics (e.g., one from simulation and one from the predictive model) which correspond to the scenarios when workloads are mapped to the different cores. We compute the Pearson correlation coefficient of the two data series. The Pearson correlation coefficient of two data series X and Y is defined as nrZx => 2>1 (62) S Lnii) n 2_ x2 n( i2) n l2] S =1 ) il / V =1 / Y i )=1 If two data series, X and Y, show highly positive correlation, their Pearson correlation coefficient will be close to 1. Consequently, if the cache interference can be accurately estimated using the overlap between the predicted 2D NUCA footprints, we should observe nearly perfect correlation between the two metrics. Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) ~A/JwzvA~rxv 0 8 s o S07 0o6 B0 08 07 o06 S 10 20 30 40 50 0 10 20 30 40 50 0 0 10 20 30 40 Test Cases Test Cases Test Cases Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) Figure 69 Pearson correlation coefficient (all 50 test cases are shown) Figure 69 shows that there is a strong correlation between the interference estimated using the predicted 2D NUCA footprint and the interference statistics obtained using simulation. The highly positive Pearson correlation coefficient values show that by using the predictive model, designers can quickly devise the optimal core allocation for a given set of workloads. Alternatively, the information can be used by the OS to guide cache friendly thread scheduling in multicore environments. E Case Study 2: 2D Thermal HotSpot Prediction Thermal issues are becoming a first order design parameter for largescale CMP architectures. High operational temperatures and hotspots can limit performance and manufacturability. We use the HotSpot [54] thermal model to obtain the temperature variation across 256 NUCA banks. We then build analytical models using the proposed methods to forecast 2D thermal behavior of large NUCA cache with different configurations. Our predictive model can help designers insightfully predict the potential thermal hotspots and assess the severity of thermal emergencies. Figure 610 shows the simulated thermal profile and predicted thermal behavior on different workloads. The temperatures are normalized to a value between the maximal and minimal value across the NUCA chip. As can be seen, the 2D thermal predictive models can accurately and informatively forecast the size and the location of thermal hotspots. Thermal Hotspots mulaondiction Simulation Prediction Simulation Prediction Simulation Prediction 1.80.8 06 0.6O 0.6 _00.2 The 2D predictive model can informatively and accurately 0 forecast both the location and the size of thermal hotspots in large scale architecture (a) OceanNC (b) gccx8 Simulation Prediction Simulation Prediction i :: WU 043 v (c) MEM (d) Radix Figure 610 2D NUCA thermal profile (simulation vs. prediction) Figure 611 NUCA 2D thermal prediction error The thermal prediction accuracy (average statistics) across three workload categories is shown in Figure 611. The accuracy of using different number of wavelet coefficients in prediction is also shown in that Figure. The results show that our predictive model can be used to costeffectively analyze the thermal behavior of large architecture substrates. In addition, our proposed technique can be use to evaluate the efficiency of thermal management policies at a large scale. For example, thermal hotspots can be mitigated by throttling the number of accesses to a cache bank for a certain period when its temperature reaches a threshold. We build analytical models which incorporate a thermalaware cache access throttling as a design parameter. As a result, our predictive model can forecast thermal hot spot distribution in the 2D NUCA cache banks when the dynamic thermal management (DTM) policy is enabled or disabled. Figure 612 shows the thermal profiles before and after thermal management policies are applied (both prediction and simulation results) for benchmark OceanNC. As can be seen, they track each other very well. In terms of time taken for design space exploration, our proposed models have orders of magnitude less overhead. The time required to predict the thermal behavior is much less than that of fullsystem multicore simulation. For example, thermal hotspot estimation is over 2 x 105 times faster than thermal simulation, justifying our decision to use the predictive models. Similarly, searching a cache friendly workload/core mapping is 3 x 104 times faster than using the simulationbased method. Simulation DTM 0.6 (^o Prediction DTM Prediction ' o Figure 612 Temperature profile before and after a DTM policy Simulation Prediction CHAPTER 7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STACKED MULTICORE PROCESSORS USING GEOSPATIALBASED PREDICTIVE MODELS To achieve thermal efficient 3D multicore processor design, architects and chip designers need models with low computation overhead, which allow them to quickly explore the design space and compare different design options. One challenge in modeling the thermal behavior of 3D die stacked multicore architecture is that the manifested thermal patterns show significant variation within each die and across different dies (as shown in Figure 71). Diel Die2 Die3 Die4 P tP uru: F *'I F. I P. Sh: a a mri:. a *} n:: im!' 2W7m I. I BI WN Figure 71 2D withindie and crossdies thermal variation in 3D die stacked multicore processors The results were obtained by simulating a 3D die stacked quadcore processors running multiprogrammed CPU (bzip2, eon, gcc, perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk) workloads. Each program within a multiprogrammed workload was assigned to a die that contains a processor core and caches. Figure 72 shows the 2D thermal variation on die 4 under different microarchitecture and floorplan configurations. On the given die, the 2dimensional thermal spatial characteristics vary widely with different design choices. As the number of architectural parameters in the design space increases, the complex thermal variation and characteristics cannot be captured without using slow and detailed simulations. As shown in Figure 71 and 72, to explore the thermalaware design space accurately and informatively, we need computationally effective methods that not only predict aggregate thermal behavior but also identify both size and geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate predictive models to achieve this goal. Config. A Config. B Config. C Config. D r r Figure 72 2D thermal variation on die 4 under different microarchitecture and floorplan configurations Figure 73 illustrates the original thermal behavior and 2D wavelet transformed thermal behavior. HL, HHH (a) Original thermal behavior (b) 2D wavelet transformed thermal behavior Figure 73 Example of using 2D DWT to capture thermal spatial characteristics As can be seen, the 2D thermal characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (LL=1) or Average (LL=2)). Since a small set of wavelet coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we use predictive models (i.e. neural networks) to relate them individually to various design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize 2D thermal spatial characteristics across the design space. Compared with a simulationbased method, predicting a small set of wavelet coefficients using analytical models is computationally efficient and is scalable to explore the large thermal design space of 3D multicore architecture. Prior work has proposed various predictive models [2025, 50] to costeffectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous thermal behavior across large and distributed 3D multicore architecture substrates. In this paper, we addresses this important and urgent research task by developing novel, 2D multiscale predictive models, which can efficiently reason the geospatial thermal characteristics within die and across different dies during the design space exploration stage without using detailed cyclelevel simulations. Instead of quantifying the complex geospatial thermal characteristics using a single number or a simple statistical distribution, our proposed techniques employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the thermal spatial characteristics are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of wavelet coefficients, our models can accurately reconstruct 2D spatial thermal behavior across the design space. Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction We view the 2D spatial thermal characteristics yielded in 3D integrated multicore chips as a nonlinear function of architecture design parameters. Instead of inferring the spatial thermal behavior via exhaustively obtaining temperature on each individual location, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated thermal behavior across a large architectural design space. Previous work [21, 23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to reveal complex thermal behavior on a large scale. To overcome this disadvantage, we propose combining 2D wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial thermal characteristics prediction of 3D die stacked multicore design. 9IUMfid PndloId rha rma b.t r T r rl o Figure 74 Hybrid neurowavelet thermal prediction framework The 2D wavelet transform is a very powerful tool for characterizing spatial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and prediction of individual or subsets of components, while the global trend is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across each die. Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and details of complex thermal behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separate neural network simplifies the training task (which can be performed concurrently) of each sub network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the 2D spatial thermal patterns across each die. Figure 74 shows our hybrid neurowavelet scheme for 2D spatial thermal characteristics prediction. Given the observed spatial thermal behavior on training data, our aim is to predict the 2D thermal behavior of each die in 3D die stacked multicore processors under different design configurations. The hybrid scheme involves three stages. In the first stage, the observed spatial thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D thermal characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The training of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF, which determine the wavelet coefficients. Experimental Methodology Floorplanning and Hotspot Thermal Model In this study, we model four floorplans that involve processor core and cache structures as illustrated in Figure 75. Figure 75 Selected floorplans As can be seen, the processor core is placed at different locations across the different floor plans. Each floorplan can be chosen by a layer in the studied 3D die stacking quadcore processors. The size and adjacency of blocks are critical parameters for deriving the thermal model. The baseline core architecture and floorplan we modeled is an Alpha processor, closely resembling the Alpha 21264. Figure 76 shows the baseline core floorplan. Figure 76 Processor core floorplan We assume a 65 nm processing technique and the floorplan is scaled accordingly. The entire die size is 21 x 21mm and the core size is 5.8 x 5.8mm. We consider three core configurations: 2issue (5.8x 5.8 mm), 4issue (8.14x 8.14 mm) and 8issue (11.5x 11.5 mm). Since the total die area is fixed, the more aggressive core configurations lead to smaller L2 caches. For all three types of core configurations, we calculate the size of the L2 caches based on the remaining die area available. Table 71 lists the detailed processor core and cache configurations. We use Hotspot4.0 [54] to simulate thermal behavior of a 3D quadcore chip shown as Figure 77. The Hotspot tool can specify the multiple layers of silicon and metal required to model a three dimensional IC. We choose gridlike thermal modeling mode by specifying a set of 64 x 64 thermal grid cells per die and the average temperature of each cell (32um x 32um) is represented by a value. Hotspot takes power consumption data for each component block, the layer parameters and the floorplans as inputs and generates the steadystate temperature for each active layer. To build a 3D multicore processor simulator, we heavily modified and extended the MSim simulator [63] and incorporated the Wattch power model [36]. The power trace is generated from the developed framework with an interval size of 500K cycles. We simulate a 3Dstacked quadcore processor with one core assigned to each layer. Table 71 Architecture configuration for different issue width 2 issue Processor Width Issue Queue ITLB Branch Predictor BTB Return Address 2wide fetch/issue/commit 32 32 entries, 4way, 200 cycle miss 512 entries Gshare, 10bit global history 512K entries, 4way 8 entries RAS L1 Inst. 32K, 2way, 32 Byte/line, 2 Cache ports, 1 cycle access ROB Size 32 entries Load/ Store 24 entries 2 IALU, 1 IMUL/DIV, 2 Integer ALU Load/Store Load/Store 1 FPALU, 1FP FP ALU MUL/DIV/SQRT DTLB 64 entries, 4way, 200 cycle miss L1 Data 32K, 2way, 32 Byte/line, 2 Cache ports, 1 cycle access unified 4MB, 4way, 128 L2 Cache Byte/line, 12 cycle access Memory 32 bit wide, 200 cycles access Access latency 4 issue 4wide fetch/issue/commit 64 64 entries, 4way, 200 cycle miss 1K entries Gshare, 10bit global history 1K entries, 4way 16 entries RAS 64K, 2way, 32 Byte/line, 2 ports, 1 cycle access 64 entries 48 entries 4 IALU, 2 IMUL/DIV, 2 Load/Store 2 FPALU, 2FPMUL/ DIV/SQRT 128 entries, 4way, 200 cycle miss 64KB, 4way, 64 Byte/line, 2 ports, 1 cycle unified 3.7MB, 4way, 128 Byte/line, 12 cycle access 64 bit wide, 200 cycles access latency 8 issue 8wide fetch/issue/commit 128 128 entries, 4way, 200 cycle miss 2K entries Gshare, 10bit global history 2K entries, 4way 32 entries RAS 128K, 2way, 32 Byte/line, 2 ports, 1 cycle access 96 entries 72 entries 8 IALU, 4 IMUL/DIV, 4 Load/Store 4 FPALU, 4FP MUL/DIV/SQRT 256 entries, 4way, 200 cycle miss 128K, 2way, 32 Byte/line, 2 ports, 1 cycle access unified 3.2MB, 4way, 128 Byte/line, 12 cycle access 64 bit wide, 200 cycles access latency Tha~m O'llea' 14.. Stnk M..... TIMF 3"rk FV_1Vrs Bounding Thinned Yhrough Act Intmrfa~e Subltratirslon Vla Layr Lyr 2ue 7 l mu u Figure 77 Cross section view of the simulated 3D quadcore chip Workloads and System Configurations We use both integer and floatingpoint benchmarks from the SPEC CPU 2000 suite (e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lucas, mcf, parser, perlbmk, twolf swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see Table 72). We categorize all benchmarks into two classes: CPUbound and MEM bound applications. We design three types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist of programs from only the CPU intensive and memory intensive categories respectively. MIX workloads are the combination of two benchmarks from the CPU intensive group and two from the memory intensive group. Table 72 Simulation configurations Chip Frequency 3G Voltage 1.2 V Proc. Technology 65 nm Die Size 21 mm x 21 mm CPU1 bzip2, eon, gcc, perlbmk CPU2 perlbmk, mesa, facerec, lucas CPU3 gap, parser, eon, mesa MIX1 gcc, mcf, vpr, perlbmk Workloads MIX2 perlbmk, mesa, twolf, applu MIX3 eon, gap, mcf, vpr MEM1 mcf, equake, vpr, swim MEM2 twolf, galgel, applu, lucas MEM3 mcf, twolf, swim, vpr These multiprogrammed workloads were simulated on our multicore simulator configured as 3D quadcore processors. We use the Simpoint tool [1] to obtain a representative slice for each benchmark (with full reference input set) and each benchmark is fastforwarded to its representative point before detailed simulation takes place. The simulations continue until one benchmark within a workload finishes the execution of the representative interval of 250M instructions. Design Parameters In this study, we consider a design space that consists of 23 parameters (see Table 73) spanning from floorplanning to packaging technologies. Table 73 Design space parameters 3D Configurations TIM (Thermal Interfa General Configurations Archi. Thickness (m) LayerO Floorplan Bench Thickness (m) Layerl Floorplan Bench Thickness (m) Layer2 Floorplan Bench Thickness (m) Layer3 Floorplan Bench Heat Capacity (J/m^3K) ice Material) Resistivity (m K/W) Thickness (m) Convection capacity (J/k) Convection resistance (K/w) Heat sink Side (m) Thickness (m) Side(m) [eatSpreader Side(m) Thickness(m) Others Ambient temperature (K) Issue width Keys lyO th ly0_fl ly0_bench lylth lylfl lyl_bench ly2_th ly2_fl ly2_bench ly3_th ly3fl ly3_bench TIMcap TIM res TIM th HS_cap HS res HS side HS th HP side HP th Am temp Issue width Low High 5e5 3e4 Flp 1/2/3/4 CPU/MEM/MIX 5e5 3e4 Flp 1/2/3/4 CPU/MEM/MIX 5e5 3e4 Flp 1/2/3/4 CPU/MEM/MIX 5e5 3e4 Flp 1/2/3/4 CPU/MEM/MIX 2e6 4e6 2e3 5e2 2e5 75e6 140.4 1698 0.1 0.5 0.045 0.08 0.02 0.08 0.025 0.045 5e4 5e3 293.15 323.15 2 or 4 or 8 These design parameters have been shown to have a large impact on processor thermal behavior. The ranges for these parameters were set to include both typical and feasible design points within the explored design space. Using detailed cycleaccurate simulations, we measure processor power and thermal characteristics on all design points within both training and testing data sets. We build a separate model for each benchmark domain and use the model to predict thermal behavior at unexplored points in the design space. The training data set is used to build the waveletbased neural network models. An estimate of the model's accuracy is obtained by using the design points in the testing data set. To train an accurate and prompt neural network prediction model, one needs to ensure that the sample data sets disperse points throughout the H design space but keeps the space small enough to maintain the low model building cost. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2star discrepancy [40]. The L2star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this work, we used 200 train and 50 test data to reach a high accuracy for thermal behavior prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. In our study, the thermal characteristics across each die is represented by 64 x 64 samples. Experimental Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast thermal behaviors of large scale 3D multicore structures running various CPU/MIX/MEM workloads without using detailed simulation. Simulation Time vs. Prediction Time To evaluate the effectiveness of our thermal prediction models, we compute the speedup metric (defined as simulation time vs. prediction time) across all experimented workloads (shown as Table 74). To calculate simulation time, we measured the time that the Hotspot simulator takes to obtain steady thermal characteristics on a given design configuration. As can be seen, the Hotspot tool simulation time varies with design configurations. We report both shortest (best) and longest (worst) simulation time in Table 74. The prediction time, which includes the time for the neural networks to predict the targeted thermal behavior, remains constant for all studied cases. In our experiment, a total number of 16 neural networks were used to predict 16 2D wavelet coefficients which efficiently capture workload thermal spatial characteristics. As can be seen, our predictive models achieve a speedup ranging from 285 (MEM1) to 5339 (CPU2), making them suitable for rapidly exploring large thermal design space. Table 74 Simulation time vs. prediction time Workload Simulation (sec) n (e Speedup Prediction (sec) . s [best:worst] (Sim./Pred.) CPU1 362: 6,091 294 : 4,952 CPU2 366: 6,567 298 : 5,339 CPU3 365 : 6,218 297: 5,055 MEM1 351 : 5,890 285 : 4,789 MEM2 355 : 6,343 1.23 289 : 5,157 MEM3 367 : 5,997 298 : 4,876 MIX1 352 : 5,944 286 : 4,833 MIX2 365 : 6,091 297 : 4,952 MIX3 360: 6,024 293 : 4,898 Prediction Accuracy The prediction accuracy measure is the mean error defined as follows: ME x (k) x(k) ME = (71) N k= x(k) where: x(k) is the actual value generated by the Hotspot thermal model, x(k) is the predicted value and Nis the total number of samples (a set of 64 x 64 temperature samples per layer). As prediction accuracy increases, the ME becomes smaller. We present boxplots to observe the average prediction errors and their deviations for the 50 test configurations against Hotspot simulation results. Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between "hinges" which are approximately the first and third quartiles of the ME values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the ME values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 78, the blue line with diamond shape markers indicates the statistics average of ME across all benchmarks. 20 12 L 4 CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3 Figure 78 ME boxplots of prediction accuracies (number of wavelet coefficients = 16) Figure 78 shows that using 16 wavelet coefficients, the predictive models achieve median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 17.5% (MEM1), and most benchmarks show an error less than 9%. This indicates that our hybrid neurowavelet framework can predict 2D spatial thermal behavior across large and sophisticated 3D multicore architecture with high accuracy. Figure 78 also indicates that CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%) workloads. This is because the CPU workloads usually have higher temperature on the small core area than the large L2 cache area. These small and sharp hotspots can be easily captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal pattern can spread the entire die area, resulting in higher prediction error. Figure 79 illustrates the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on CPU1, MEM1 and MIX1 workloads. CPU1 MEM1 MIX1 Prediction F 0 Simulation Figure 79 Simulated and predicted thermal behavior The results show that our predictive models can tack both size and location of thermal hotspots. We further examine the accuracy of predicting locations and area of the hottest spots and the results are similar to those presented in Figure 78. CPU1 8 04 0 16wc 32wc 64wc 96wc 128wc 256wc MEM1 15 10 W 5 16wc 32wc 64wc 96wc 128wc 256wc MIX1 20 10 0 1 lo  ^ 16wc 32wc 64wc 96wc 128wc 256wc Figure 710 ME boxplots of prediction accuracies with different number of wavelet coefficients Figure 710 shows the prediction accuracies with different number of wavelet coefficients on multiprogrammed workloads CPU1, MEM1 and MIX1. In general, the 2D thermal spatial pattern prediction accuracy is increased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The costeffective models should provide high prediction accuracy while maintaining low complexity. The trend of prediction accuracy(Figure 710) suggests that for the programs we studied, a set of wavelet coefficients with a size of 16 combine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to minimize the complexity of prediction models while achieving good accuracy. We further compare the accuracy of our proposed scheme with that of approximating 3D stacked die spatial thermal patterns via predicting the temperature of 16 evenly distributed locations across 2D plane. The results(Figure 711) indicate that using the same number of neural networks, our scheme yields significant higher accuracy than conventional predictive models. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. The coordinated wavelet coefficients provide superior interpretation of the spatial patterns across scales of time and frequency domains. 100 10 Predicting the wavelet coefficients 80 :, i 1 60 0 W 40 20 0 CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3 Figure 711 Benefit of predicting wavelet coefficients Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input parameters (refer to Table 73 ) were ranked based on split frequency. The input parameters which cause the most output variation tend to be split frequently in the constructed regression tree. Therefore, the input parameters that largely determine the values of a wavelet coefficient have a larger number of splits. Design Parameters byRegression Tree ly0_th ly0_fl ly0_bench lyl_th lyl_fl lyl_bench ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench TIM_cap TIM_res TIM_th HS_cap HS_res HS_side Clockwise: CPU1 MEM1 HS_th HP_side HP_th am_temp Iss_size MIX1 Figure 712 Roles of input parameters We present in Figure 712 shows the most frequent splits within the regression tree that models the most significant wavelet coefficient. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. Each volume size of parameter is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? As can be seen, floorplanning of each layer and core configuration largely affect thermal spatial behavior of the studied workloads. CHAPTER 8 CONCLUSIONS Studying program workload behavior is of growing interest in computer architecture research. The performance, power and reliability optimizations of future computer workloads and systems could involve analyzing program dynamics across many time scales. Modeling and predicting program behavior at single scale can yield many limitations. For example, samples taken from a single, finegrained interval may not be useful in forecasting how a program behaves at a medium or large time scales. In contrast, observing program behavior using a coarsegrained time scale may lose opportunities that can be exploited by hardware and software in tuning resources to optimize workload execution at a finegrained level. In chapter 3, we proposed new methods, metrics and framework that can help researchers and designers to better understand phase complexity and the changing of program dynamics across multiple time scales. We proposed using wavelet transformations of code execution and runtime characteristics to produce a concise yet informative view of program dynamic complexity. We demonstrated the use of this information in phase classification which aims to produce phases that exhibit similar degree of complexity. Characterizing phase dynamics across different scales provides insightful knowledge and abundant features that can be exploited by hardware and software in tuning resources to meet the requirement of workload execution at different granularities. In chapter 4, we extends the scope of chapter 3 by (1) exploring and contrasting the effectiveness of using wavelets on a wide range of program execution statistics for phase analysis; and (2) investigating techniques that can further optimize the accuracy of waveletbased phase classification. More importantly, we identify additional benefits that wavelets can offer in the context of phase analysis. For example, wavelet transforms can provide efficient dimensionality reduction of large volume, high dimension raw program execution statistics from the time domain and hence can be integrated with a sampling mechanism to efficiently increase the scalability of phase analysis of large scale phase behavior on longrunning workloads. To address workload variability issues in phase classification, waveletbased denoising can be used to extract the essential features of workload behavior from their runtime nondeterministic (i.e., noisy) statistics. At the workloads prediction part, chapter 5, we propose to the use of wavelet neural network to build accurate predictive models for workload dynamic driven microarchitecture design space exploration to overcome the problems of monolithic, global predictive models. We show that wavelet neural networks can be used to accurately and costeffectively capture complex workload dynamics across different microarchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power, and reliability domains. And also we perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on prediction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. To evaluate the efficiency of scenariodriven architecture optimizations across different domains, we also present a case study of using workload dynamic aware predictive model. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. To our knowledge, the model we proposed is the first one that can track complex program dynamic behavior across different microarchitecture configurations. We believe our workload dynamics forecasting techniques will allow architects to quickly evaluate a rich set of architecture optimizations that target workload dynamics at early microarchitecture design stage. In Chapter 6, we explore novel predictive techniques that can quickly, accurately and informatively analyze the design tradeoffs of future largescale multi/many core architectures in a scalable fashion. The characteristics that workloads exhibited on these architectures are complex phenomena since they typically contain a mixture of behavior localized at different scales. Applying wavelet analysis, our method can capture the heterogeneous behavior across a wide range of spatial scales using a limited set of parameters. We show that these parameters can be costeffectively predicted using nonlinear modeling techniques such as neural networks with low computational overhead. Experimental results show that our scheme can accurately predict the heterogeneous behavior of largescale multicore oriented architecture substrates. To our knowledge, the model we proposed is the first that can track complex 2D workload/architecture interaction across design alternatives, we further examined using the proposed models to effectively explore multicore aware resource allocations and design evaluations. For example, we build analytical models that can quickly forecast workloads' 2D working sets across different NUCA configurations. Combined with interference estimation, our models can determine the geometricaware workload/core mappings that lead to minimal interference. We also show that our models can be used to predict the location and the area of thermal hotspots during thermal aware design exploration. In the light of the emerging multi/ many core design era, we believe that the proposed 2D predictive model will allow architects to quickly yet informatively examine a rich set of design alternatives and optimizations for large and sophisticated architecture substrates at an early design stage. Leveraging 3D die stacking technologies in multicore processor design has received increased momentum in both the chip design industry and research community. One of the major road blocks to realizing 3D multicore design is its inefficient heat dissipation. To ensure thermal efficiency, processor architects and chip designers rely on detailed yet slow simulations to model thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of the design space, such techniques are very expensive in terms of time and cost. In chapter 7, we aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large thermal design space of 3D multicore architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex 2D thermal spatial patterns and can be used to forecast both the location and the area of thermal hotspots during thermalaware design exploration. In light of the emerging 3D multicore design era, we believe that the proposed thermal predictive models will be valuable for architects to quickly and informatively examine a rich set of thermalaware design alternatives and thermaloriented optimizations for large and sophisticated architecture substrates at an early design stage. LIST OF REFERENCES [1] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, "Automatically Characterizing Large Scale Program Behavior," in Proc. the International Conference on Architectural Supportfor Programming Languages and Operating Systems, 2002 [2] E. Duesterwald, C. Cascaval and S. Dwarkadas, "Characterizing and Predicting Program Behavior and Its Variability," in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2003. [3] J. Cook, R. L. Oliver, and E. E. Johnson, "Examining Performance Differences in Workload Execution Phases," in Proc. of the IEEE International Workshop on Workload Characterization, 2001. [4] X. Shen, Y. Zhong and C. Ding, "Locality Phase Prediction," in Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2004. [5] C. Isci and M. Martonosi, "Runtime Power Monitoring in HighEnd Processors: Methodology and Empirical Data," in Proc. of the International Symposium on Microarchitecture, 2003. [6] T. Sherwood, S. Sair and B. Calder, "Phase Tracking and Prediction," in Proc. of the International Symposium on Computer Architecture, 2003. [7] A. Dhodapkar and J. Smith, "Managing MultiConfigurable Hardware via Dynamic Working Set Analysis," in Proc. of the International Symposium on Computer Architecture, 2002. [8] M. Huang, J. Renau and J. Torrellas, "Positional Adaptation of Processors: Application to Energy Reduction," in Proc. of the International Symposium on Computer Architecture, 2003. [9] W. Liu and M. Huang, "EXPERT: Expedited Simulation Exploiting Program Behavior Repetition," in Proc. ofInternational Conference on Supercomputing, 2004. [10] T. Sherwood, E. Perelman and B. Calder, "Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications," in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2001. [11] A. Dhodapkar and J. Smith, "Comparing Program Phase Detection Techniques," in Proc. of the International Symposium on Microarchitecture, 2003. [12] C. Isci and M. Martonosi, "Identifying Program Power Phase Behavior using Power Vectors," in Proc. of the International Workshop on Workload Characterization, 2003. [13] C. Isci and M. Martonosi, "Phase Characterization for Power: Evaluating ControlFlow Based EventCounterBased Techniques," in Proc. of the International Symposium on High Performance Computer Architecture, 2006. [14] M. Annavaram, R. Rakvic, M. Polito, J.Y. Bouguet, R. Hankins and B. Davies, "The Fuzzy Correlation between Code and Performance Predictability," in Proc. of the International Symposium on Microarchitecture, 2004. [15] J. Lau, S. Schoenmackers and B. Calder, "Structures for Phase Classification," in Proc. of International Symposium on Performance Analysis of Systems and Software, 2004. [16] J. Lau, J. Sampson, E. Perelman, G. Hamerly and B. Calder, "The Strong Correlation between Code Signatures and Performance," in Proc. of the International Symposium on Performance Analysis of Systems and Software, 2005. [17] J. Lau, S. Schoenmackers and B. Calder, "Transition Phase Classification and Prediction," in Proc. of the International Symposium on High Performance Computer Architecture, 2005. [18] Canturk Isci and Margaret Martonosi, "Detecting Recurrent Phase Behavior under Real System Variability," in Proc. of the IEEE International Symposium on Workload Characterization, 2005. [19] E. Perelman, M. Polito, J. Y. Bouguet, J. Sampson, B. Calder, C. Dulong, "Detecting Phases in Parallel Applications on Shared Memory Architectures," in Proc. of the International Parallel and Distributed Processing Symposium, April 2006 [20] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, "Construction and Use of Linear Regression Models for Processor Performance Analysis," in Proc. of the International Symposium on HighPerformance Computer Architecture, 2006 [21] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, "A Predictive Performance Model for Superscalar Processors," in Proc. of the International Symposium on Microarchitecture, 2006 [22] B. Lee and D. Brooks, "Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction," in Proc. of the International Symposium on Architectural Support for Programming Languages and Operating Systems, 2006 [23] E. Ipek, S. A. McKee, B. R. Supinski, M. Schulz and R. Caruana, "Efficiently Exploring Architectural Design Spaces via Predictive Modeling," in Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2006 [24] B. Lee and D. Brooks, "Illustrative Design Space Studies with Microarchitectural Regression Models," in Proc. of the International Symposium on HighPerformance Computer Architecture, 2007. [25] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, "Constructing a NonLinear Model with Neural Networks For Workload Characterization," in Proc. of the International Symposium on Workload Characterization, 2006. [26] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier, Vermont, 1992 [27] I. Daubechies, "Orthonomal bases of Compactly Supported Wavelets," Communications on Pure andAppliedi MAtuli/htiiL vol. 41, pages 906966, 1988. [28] T. Austin, "Tutorial of Simplescalar V4.0," in Conj. With the International Symposium on Microarchitecture, 2001 [29] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," in Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. [30] T. Huffmire and T. Sherwood, "WaveletBased Phase Classification," in Proc. of the International Conference on Parallel Architecture and Compilation Technique, 2006 [31] D. Brooks and M. Martonosi, "Dynamic Thermal Management for HighPerformance Microprocessors," in Proc. of the International Symposium on HighPerformance Computer Architecture, 2001. [32] A. Alameldeen and D. Wood, "Variability in Architectural Simulations of Multithreaded Workloads," in Proc. ofInternational Symposium on High Performance Computer Architecture, 2003. [33] D. L. Donoho, "Denoising by Softthresholding," IEEE Transactions on Information Theory, Vol. 41, No. 3, pp. 613627, 1995. [34] MATLAB User Manual, MathWorks, MA, USA. [35] M. Orr, K. Takezawa, A. Murray, S. Ninomiya and T. Leonard, "Combining Regression Tree and Radial Based Function Networks," International Journal ofNeural Systems, 2000. [36] David Brooks, Vivek Tiwari, and Margaret Martonosi, "Wattch: A Framework for ArchitecturalLevel Power Analysis and Optimizations," 27th International Symposium on Computer Architecture, 2000. [37] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a HighPerformance Microprocessor," in Proc. of the International Symposium on Microarchitecture, 2003. [38] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, "Computing Architectural Vulnerability Factors for AddressBased Structures," in Proc. of the International Symposium on Computer Architecture, 2005. [39] J.Cheng, M.J.Druzdzel, "Latin Hypercube Sampling in Bayesian Networks," in Proc. of the 13th Florida Artificial Intelligence Research Society Conference, 2000. [40] B.Vandewoestyne, R.Cools, "Good Permutations for Deterministic Scrambled Halton Sequences in terms of L2discrepancy," Journal of Computational andAppliedib I/,1h,, tii \ Vol 189, Issues 12, 2006. [41] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, Graphical Methodsfor Data Analysis, Wadsworth, 1983 [42] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 013 2733501, 1999. [43] C. Kim, D. Burger, and S. Keckler. "An Adaptive, Non Uniform Cache Structure for WireDelay Dominated On Chip Caches," in Proc. the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. [44] L. Benini, L.; G. Micheli, "Networks On Chips: A New SoC Paradigm," Computer, Vol. 35, Issue. 1, January 2002, pp. 70 78. [45] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing," in Proc. International Conference on Supercomputing, 2005. [46] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Distance Associativity for High Performance EnergyEfficient NonUniform Cache Architectures," in Proc. of the International Symposium on Microarchitecture, 2003. [47] B. M. Beckmann and D. A. Wood, "Managing Wire Delay in Large ChipMultiprocessor Caches," in Proc. of the International Symposium on Microarchitecture, 2004. [48] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimization Replication, Communication, and Capacity Allocation in CMPs," in Proc. of the International Symposium on Computer Architecture, 2005. [49] A. Zhang and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," in Proc. of the International Symposium on Computer Architecture, 2005. [50] B. Lee, D. Brooks, B. Supinski, M. Schulz, K. Singh, S. McKee, "Methods of Inference and Learning for Performance Modeling of Parallel Applications," PPoPP, 2007. [51] K. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, "Multifacet's General Executiondriven Multiprocessor Simulator(GEMS) Toolset," Computer Architecture News(CAN), 2005. [52] Virtutech Simics, http://www.virtutech.com/products/ [53] S. Woo, M. Ohara, E. Torrie, J. Singh, A. Gupta, "The SPLASH2 Programs: Characterization and Methodological Considerations," in Proc. of the International Symposium on Computer Architecture, 1995. [54] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "TemperatureAware Microarchitecture," in Proc. of the International Symposium on Computer Architecture, 2003. [55] K. Banerjee, S. Souri, P. Kapur, and K. Saraswat, "3D ICs: A Novel Chip Design for Improving DeepSubmicrometer Interconnect Performance and SystemsonChip Integration", Proceedings of the IEEE, vol. 89, pp. 602633, May 2001. [56] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykrishnan, M. J. Irwin, "Design Space Exploration for 3D Cache", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No. 4, April 2008. [57] B. Black, D. Nelson, C. Webb, and N. Samra, "3D Processing Technology and its Impact on IA32 Microprocessors," in Proc. of the 22nd International Conference on Computer Design, pp. 316318, 2004. [58] P. Reed, G. Yeung, and B. Black, "Design Aspects of a Microprocessor Data Cache using 3D Die Interconnect Technology," in Proc. of the International Conference on Integrated Circuit Design and Technology, pp. 1518, 2005 [59] M. Healy, M. Vittes, M. Ekpanyapong, C.S. Ballapuram, S.K. Lim, H.S. Lee, G.H. Loh, "Multiobjective Microarchitectural Floorplanning for 2D and 3D ICs," IEEE Trans. on Computer Aided Design oflC and Systems, vol. 26, no. 1, pp. 3852, 2007. [60] S. K. Lim, "Physical design for 3D system on package," IEEE Design & Test of Computers, vol. 22, no. 6, pp. 532539, 2005. [61] K. Puttaswamy, G. H. Loh, "Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in HighPerformance 3DIntegrated Processors," in Proc. of the International Symposium on HighPerformance Computer Architecture, 2007. [62] Y. Wu, Y. Chang, "Joint Exploration of Architectural and Physical Design Spaces with Thermal Consideration," in Proc. ofInternational Symposium on Low Power Electronics and Design, 2005. [63] J. Sharkey, D. Ponomarev, K. Ghose, "MSim : A Flexible, Multithreaded Architectural Simulation Environment," Technical Report CSTR05DPO1, Department of Computer Science, State University of New York at Binghamton, 2005. BIOGRAPHICAL SKETCH Chang Burm Cho earned B.E and M.A in electrical engineering at Dankook University, Seoul, Korea in 1993 and 1995, respectively. Over the next 9 years, he worked as a senior researcher at Korea Aerospace Research Institute(KARI) to develop the OnBoard Computer(OBC) for two satellites, KOMPSAT1 and KOMPSAT2. His research interest is computer architecture and workload characterization and prediction in large micro architectural design spaces. PAGE 1 1 ACCURATE, SCALABLE, AND INFORM ATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGESCA LE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 PAGE 2 2 2008 Chang Burm Cho PAGE 3 3 ACKNOWLEDGMENTS There are many people who are responsible for my Ph.D research. Most of all I would like to express my gratitude to m y supervisor, Dr. Tao Li, for his patient guidance and invaluable advice, for numerous discussions and encouragement throughout the course of the research. I would also like to thank all the member s of my advisory committee, Dr. Renato Figueiredo, Dr. Rizwan Bashirullah, and Dr. Prabhat Mishra, for thei r valuable time and interest in serving on my supervisory committee. And I am indebted to all the members of IDEAL(Intelligent Design of Efficient Architectures Laboratory), Clay Hughes, Jame s Michael Poe II, Xin Fu and Wangyuan Zhang, for their companionship and support throughout the time spent working on my research. Finally, I would also like to expr ess my greatest gratitude to my family especially my wife, EunHee Choi, for her rele ntless support and love. PAGE 4 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS .............................................................................................................. 3 LIST OF TABLES .......................................................................................................................... 6 LIST OF FI GURES ........................................................................................................................ 7 ABSTRACT ...................................................................................................................... ............ 10 CHAP TER 1 INTRODUCTION ................................................................................................................ 12 2 WAVELET TRANSFORM .................................................................................................. 16 Discrete W avelet Transform(DWT) ..................................................................................... 16 Apply DW T to Capture Workload Execution Behavior ....................................................... 18 2D W avelet Transform ......................................................................................................... 22 3 COMPLEXITYBASED PROGRAM PHASE ANALYSIS AND CLA SSIFICATION .... 25 Characterizing and classifying the program dynamic behavior ............................................ 25 Profiling Program Dynamics and Complexity ...................................................................... 28 Classifying Program Phases based on their Dynamics Behavior ......................................... 31 Experim ental Results .......................................................................................................... .. 34 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRM PHASE ANALYSIS ...................................................................................................................... ..... 37 Workloadstaticsbased phase analysis ................................................................................. 38 Exploring Wavelet Dom ain Phase Analysis ......................................................................... 40 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION ................... 52 Neural Network ..................................................................................................................... 54 Com bing Wavelet and Neural Network for Workload Dynamics Prediction ...................... 56 Experim ental Methodology .................................................................................................. 58 Evaluation and Results .......................................................................................................... 62 Workload Dynam ics Driven Architecture Design Space Exploration ................................. 68 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTICORE ARCHITECTURES ..................................................................................... 74 Com bining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction .................................................................................................................... .......... 76 Experim ental Methodology .................................................................................................. 78 PAGE 5 5 Evaluation and Results .......................................................................................................... 82 Leveraging 2D Geom etric Characteristics to Explore Cooperative Multicore Oriented Architecture Design and Optimization ................................................................................. 88 7 THERMAL DESIGN SPACE EXPLORATIO N OF 3D DIE STACKED MULTICORE PROCESSORS USING GEOSPATIALBASED PREDICTIVE MODELS ....................... 94 Com bining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction ... 96 Experim ental Methodology .................................................................................................. 98 Experim ental Results .......................................................................................................... 103 8 CONCLUSI ONS ................................................................................................................. 109 LIST OF REFERENCES ............................................................................................................ 113 BIOGRAPHICAL SKETCH ...................................................................................................... 119 PAGE 6 6 LIST OF TABLES Table page 31 Baseline machine configuration ............................................................................................. 26 32 A classification of benchmar ks based on their com plexity .................................................... 30 41 Baseline machine configuration ............................................................................................. 39 42 Efficiency of different hybrid wavelet signatures in pha se classification .............................. 44 51 Simulated machine configuration ........................................................................................... 59 52 Microarchitectural parameter ranges used for generating train/test data ............................... 60 61 Simulated machine configuration (baseline) .......................................................................... 78 62 The considered architecture de sign param eters and their ranges ........................................... 79 63 Multiprogrammed workloads ................................................................................................ 80 64 Error comparison of predicti ng raw vs. 2D DWT cache banks .............................................. 85 65 Design space evaluation speedup (sim ulation vs. prediction) ................................................ 86 71 Architecture configurati on for different issue width ............................................................ 100 72 Simulation configurations ..................................................................................................... 101 73 Design space parameters ...................................................................................................... 102 74 Simulation time vs. prediction time ...................................................................................... 104 PAGE 7 7 LIST OF FIGURES Figure page 21 Example of Haar wavelet transform. ...................................................................................... 18 22 Comparison execution characteris tics of tim e and wavelet domain ....................................... 19 23 Sampled time domain program behavior ................................................................................ 20 24 Reconstructing the work load dynam ic behaviors ................................................................... 20 25 Variation of wavelet coefficients ............................................................................................ 21 26 2D wavelet transforms on 4 data points ................................................................................. 22 27 2D wavelet transforms on 16 cores/hardware com ponents .................................................... 23 28 Example of applying 2D DW T on a nonunifor mly accessed cache ..................................... 24 31 XCOR vectors for each program execution interval .............................................................. 28 32 Dynamic complexity profile of benchmark gcc ..................................................................... 28 33 XCOR value distributions ...................................................................................................... 30 34 XCORs in the same phase by the Simpoint ............................................................................ 31 35 BBVs with different resolutions ........................................................................................... .. 32 36 Multiresolution analysis of the projected BBVs ..................................................................... 33 37 Weighted COV calculation ..................................................................................................... 34 38 Comparison of BBV and MRABBV in classifying phase dynam ics .................................... 35 39 Comparison of IPC and MRAIPC in classifying phase dynam ics ........................................ 36 41 Phase analysis methods time domain vs. wavelet domain ..................................................... 41 42 Phase classification accuracy: tim e domain vs. wavelet dom ain ........................................... 42 43 Phase classification using hybrid wavelet coefficients ........................................................... 43 44 Phase classification accuracy of using 16 1 hybrid scheme ................................................. 45 45 Different methods to handle counter overflows ..................................................................... 46 46 Impact of counter overflows on phase analysis accuracy ....................................................... 47 PAGE 8 8 47 Method for modeling wo rkload variability ............................................................................ 50 48 Effect of using wavelet denoisi ng to handle w orkload variability ......................................... 50 49 Efficiency of differe nt denoising schem es ............................................................................. 51 51 Variation of workload performance, power and reliability dynam ics .................................. 52 52 Basic architecture of a neural network ................................................................................... 54 53 Using wavelet neural network for workload dynam ics prediction ......................................... 58 54 Magnitudebased ranking of 128 wavelet coefficients ........................................................... 61 55 MSE boxplots of workload dynamics prediction ................................................................... 62 56 MSE trends with increased number of wavelet coefficients .................................................. 64 57 MSE trends with increased sampling frequency .................................................................... 64 58 Roles of microarchitect ure design param eters ........................................................................ 65 59 Thresholdbased worklo ad execution scenarios ..................................................................... 67 510 Thresholdbased workload execution ................................................................................... 68 511 Thresholdbased worklo ad scenario prediction .................................................................... 68 512 Dynamic Vulnerability Management ................................................................................... 69 513 IQ DVM Pseudo Code .......................................................................................................... 70 514 Workload dynamic prediction with scen ariobased architecture optim ization .................... 71 515 Heat plot that shows the MSE of IQ AVF and processor power .......................................... 72 516 IQ AVF dynamics prediction accuracy across different DVM thresholds ........................... 73 61 Variation of cache hits across a 256ba nk no nuniform access cache on 8core ................... 74 62 Using wavelet neural networks for forecastin g architecture 2D characteristics .................... 77 63 Baseline CMP with 8 cores that share a NUCA L2 cache ..................................................... 79 64 ME boxplots of prediction accuracies with different number of wavelet coefficien ts ........... 83 65 Predicted 2D NUCA behavi or using different num ber of wavelet coefficients ..................... 84 66 Roles of design parameters in predicting 2D NUCA ............................................................. 87 PAGE 9 9 67 2D NUCA footprint (geometric shape) of mesa ..................................................................... 88 68. 2D cache interference in NUCA ............................................................................................ 89 69 Pearson correlation coefficient (all 50 test cases are show n) ................................................. 90 610 2D NUCA thermal profile (simulation vs. prediction) ......................................................... 91 611 NUCA 2D thermal prediction error ...................................................................................... 92 612 Temperature profile before and after a DTM policy ............................................................ 93 71 2D withindie and crossdies thermal varia tion in 3D die stacked m ulticore processors ..... 94 72 2D thermal variation on die 4 under di fferent m icroarchitecture and floorplan configurations ................................................................................................................ ... 95 73 Example of using 2D DWT to captu re therm al spatial characteristics ................................... 95 74 Hybrid neurowavelet th erm al prediction framework ............................................................ 97 75 Selected floorplans ................................................................................................................ 98 76 Processor core floorplan ........................................................................................................ 99 77 Cross section view of the sim ulated 3D quadcore chip ...................................................... 100 78 ME boxplots of prediction accuracies (num ber of wavelet coefficients = 16) .................... 105 79 Simulated and predicted thermal behavior ........................................................................... 106 710 ME boxplots of prediction accuracies with d ifferent number of wavelet coefficients ....... 106 711 Benefit of predicting wavelet coefficients .......................................................................... 107 712 Roles of input parameters ................................................................................................ ... 108 PAGE 10 10 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy ACCURATE, SCALABLE, AND INFORM ATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGESCA LE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO December 2008 Chair: Tao Li Major: Electrical and Computer Engineering Modeling and analyzing how work load and architecture inter act are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challe nges related to the design, eval uation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the stateoftheart tools can detect program timevarying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a ne cessary first step to design costeffective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and ha rdware design complexity and in tegration scale. For instance, while large realworld applications manifest drastically different behavior across a wide spectrum of their runtim e, existing methods only focus on an alyzing workload characteristics using a single time scale. Conve ntional architecture modeling tec hniques assume a centralized and monolithic hardware substrate. This assu mption, however, will not hold valid since the design trends of multi/manycore processors will result in largescale and distributed microarchitecture specific pro cessor core, global and coopera tive resource management for PAGE 11 11 largescale manycore processor requires obta ining workload characte ristics across a large number of distributed hardware components (c ores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, th ere is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidl y increasing complexity and integration scale. We aim to develop computationally efficien t methods and models which allow architects and designers to rapidly yet informatively explor e the large performance, power, reliability and thermal design space of uni/multicore architecture. Our models achieve several orders of magnitude speedup compared to simulation base d methods. Meanwhile, our model significantly improves prediction accuracy compared to conv entional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration. PAGE 12 12 CHAPTER 1 INTRODUCTION Modeling and analyzing how workloads beha ve on the underlying hardware have been essential ing redients of computer architecture research. By knowing program behavior, both hardware and software can be tune d to better suit the needs of a pplications. As computer systems become more adaptive, their efficiency incr easingly depends on the dynamic behavior that programs exhibit at runtime. Previous studi es [15] have shown that program runtime characteristics exhibit time varying phase be havior: workload execution manifests similar behavior within each phase while showing distinct characteristics between different phases. Many challenges related to the de sign, analysis and optimization of complex computer systems can be efficiently solved by e xploiting program phases [1, 69]. For this reason, there is a growing interest in studying program phase behavior. Recently, several phase analysis techniques have been proposed [4, 7, 1019]. Very few of these studies, however, focus on understanding and characterizing program phases from their dynamics and complexity perspectives. Consequently, these techniques gene rally lack the capability of informing phase dynamic behavior. To complement current phase analysis techniques which pay little or no attention to phase dynamics, we develop new me thods, metrics and frameworks that have the capability to analyze, quantify, and classify program phases based on their dynamics and complexity characteristics. Our techniques are built on waveletbased multiresolution analysis, which provides a clear and orthogonal view of phase dynamics by presenting complex dynamic structures of program phases with respect to both time and frequency domains. Consequently, key tendencies can be efficiently identified. As microprocessor architectures become more complex, architects increasingly rely on exploiting workload dynamics to achieve cost an d complexity effective design. Therefore, there PAGE 13 13 is a growing need for methods that can qui ckly and accurately explore workload dynamic behavior at early microarchit ecture design stage. Such techni ques can quickly bring architects with insights on application ex ecution scenarios across large design space without resorting to the detailed, case by case simulations. Researchers have been proposed several predictive models [2025] to reason about workload aggregated be havior at architecture design stage. However, they have been focused on predicting the aggregat ed program statistics (e .g. CPI of the entire workload execution). These monolithic global models are incapable of capturing and revealing program dynamics which contain interesting finegrain behavior. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates waveletbased multiresolution decomposition techniques and neural network prediction. As the number of cores on a processor increas es, these large and sophisticated multicoreoriented architectures exhibit in creasingly complex and heterogeneous characteristics. Processors with two, four and eight cores ha ve already entered the market. Pr ocessors with tens or possibly hundreds of cores may be a reality within the next few years. In the upcoming multi/manycore era, the design, evaluation and optimization of ar chitectures will demand analysis methods that are very different from those targeting traditiona l, centralized and monolithic hardware structures. To enable global and cooperative management of hardware resources and efficiency at large scales, it is imperative to analyze and exploi t architecture characteris tics beyond the scope of individual cores and hardware components (e.g. single cache bank and single interconnect link). To addresses this important and urgent research task, we developed the novel, 2D multiscale predictive models which can efficiently reason th e characteristics of large and sophisticated multicore oriented architectures during the design space exploration stage without using detailed cyclelevel simulations. PAGE 14 14 Threedimensional (3D) integrated circuit design [55] is an emerging technology that greatly improves transistor inte gration density and re duces onchip wire communication latency. It places planar circuit layers in the vertical dimension and co nnects these layers with a high density and lowlatency interface. In addition, 3D offers the opportunity of binding dies, which are implemented with different techniques to enab le integrating heterogeneous active layers for new system architectures. Leveraging 3D die stacking technologies to build uni/multicore processors has drawn an increased attention to both chip design industry and research community [5662]. The realiz ation of 3D chips faces many challenges. One of the most daunting of these challenges is the problem of inefficient heat dissipa tion. In conventional 2D chips, the generated heat is dissip ated through an external heat sink. In 3D chips, all of the layers contribute to the generation of h eat. Stacking multiple dies vertically increases power density and dissipating heat from the layers far away from the heat sink is more challenging due to the distance of heat source to exte rnal heat sink. Therefore, 3D technologies not only exacerbate existing onchip hotspots but also create new th ermal hotspots. High die temperature leads to thermalinduced performance degradation and reduced chip lifetime, which threats the reliability of the whole system, making modeling and analyzi ng thermal characteristics crucial in effective 3D microprocessor design. Previous studies [59, 60] show that 3D chip temperature is affected by factors such as configuration and floorplan of microarchitectural components. For example, instead of putting hot components together, thermalaware floorplanning places the hot components by cooler components, reducing th e global temperature. Thermalaware floorplanning [59] uses intensive and iterative simu lations to estimate the thermal effect of microarchitecture components at early architectu ral design stage. However, using detailed yet PAGE 15 15 slow cyclelevel simulations to explore thermal effects across large design space of 3D multicore processors is very expens ive in terms of time and cost. PAGE 16 16 CHAPTER 2 WAVELET TRANSFORM W e use wavelets as an efficient tool for capturing workload behavior. To familiarize the reader with general methods used in this re search, we provide a brief overview on wavelet analysis and show how program execution charac teristics can be represented using wavelet analysis. Discrete Wavelet Transform(DWT) W avelets are mathematical tools that use a prototype function (cal led the analyzing or mother wavelet) to transform data of interest into different frequenc y components, and then analyze each component with a reso lution matched to its scale. Therefore, the wavelet transform is capable of providing a compact and effective mathematical repres entation of data. In contrast to Fourier transforms which only offer frequenc y representations, wavelets transforms provide time and frequency localizations simultaneously. Wavelet analysis allows one to choose wavelet functions from numerous functions[26, 27]. In this section, we provide a quick primer on wavelet analysis using the H aar wavelet, which is the simplest form of wavelets. Consider a data series ,...,2,1,0,, kXknat the finest time scale resolution level n 2. This time series might represent a specific progr am characteristic (e.g., number of executed instructions, branch mispredictions and cache mi sses) measured at a given time scale. We can coarsen this event series by av eraging (with a slightly differe nt normalization factor) over nonoverlapping blocks of size two ) ( 2 112,2, ,1 knkn knXX X (21) and generate a new time series1 nX, which is a coarser granularity representation of the original seriesnX. The difference between the two re presentations, known as details, is PAGE 17 17 ) ( 2 112,2, ,1 knkn knXX D (22) Note that the original time series nXcan be reconstructed from its coarser representation1 nXby simply adding in the details1 nD; i.e., ) (211 2/1 nn nDX X. We can repeat this process (i.e., write 1 nX as the sum of yet a nother coarser version 2 nX of nX and the details2 nD, and iterate) for as many scale as are pr esent in the original time series, i.e., 1 2/1 0 2/ 0 2/2...22 n n n nD D X X (23) We refer to the collection of 0Xand jD as the discrete Haar wavelet coefficients. The calculations of allkjD,, which can be done iteratively using the equations (21) and (22), make up the so called discrete wavele t transform (DWT). As can be seen, the DWT offers a natural hierarchy structure to represent data behavior at multiresolution levels: the first few wavelet coefficients contain an overall, coarser approx imation of the data; a dditional coefficients illustrate high detail. This property can be used to capture workload execution behavior. Figure 21 illustrates the proce dure of using Haarbase DWT to transform a series of data {3, 4, 20, 25, 15, 5, 20, 3}. As can be seen, scale 1 is the finest representation of the data. At scale 2, the approximations {3.5, 22.5, 10, 11.5} are obtained by taking the average of {3, 4}, {20, 25}, {15, 5} and {20, 3} at scale 1 respectively. The details {0.5, 2.5, 5, 8.5} are the differences of {3, 4}, {20, 25}, {15, 5} and {20, 3} divided by 2 respectively. The process continues by decomposing the scaling coefficien t (approximation) vector using the same steps, and completes when only one coefficient remains. As a result, wavelet decomposition is the collec tion of average and details coefficients at all scales. In other words, the wavelet transform of the original data is the single coefficient representing the overall average of the original data, followed by the detail coefficients in order PAGE 18 18 of increasing resolutions. Different resolutions can be obtained by adding difference values back or subtracting differences from the averages. Original Data 3, 4, 20, 25, 15, 5, 20, 3 Wavelet Filter (H0) 0.5, 2.5, 5, 8.5 Scaling Filter (G0) 3.5, 22.5, 10, 11.5 Scaling Filter (G1) 13, 10.75 Wavelet Filter (H1) 9.5, 0.75 Scaling Filter (G2) 11.875 Wavelet Filter (H2) 1.125 11.875 1.125 9.5, 0.75 0.5, 2.5, 5, 8.5 Approximation (Lev 0)Detail (Lev 1)Detail Coefficients (Level 2)Detail Coefficients (Level 3) Figure 21 Example of Haar wavelet transform. For instance, {13, 10.75} = {11.875+1.125, 11.8751.125} where 11.875 and 1.125 are the first and the second coefficient respectively. This process can be performed recursively until the finest scale is reached. Therefore, through an inverse transform, the original data can be recovered from wavelet coefficients. The original data can be perfectly re covered if all wavelet coefficients are involved. Alternatively, an approx imation of the time series can be reconstructed using a subset of wavelet coefficients. Us ing a wavelet transform gives timefrequency localization of the original data. As a result, the time domain signal can be accurately approximated using only a few wavelet coefficients since they capture most of the energy of the input data. Apply DWT to Capture Workload Execution Behavior Since variation of program characteristics over tim e can be viewed as signals, we apply discrete wavelet analysis to capture progr am execution behavior. To obtain time domain workload execution characteristics, we break do wn entire program execution into intervals and PAGE 19 19 then sample multiple data points within each interval. Therefore, at the finest resolution level, program time domain behavior is re presented by a data series within each interval. Note that the sampled data can be any runtime program character istics of interest. We then apply discrete wavelet transform (DWT) to each interval. As desc ribed in previous section, the result of DWT is a set of wavelet coefficients which represent the behavior of the sampled time series in the wavelet domain. 0.0E+00 5.0E+04 1.0E+05 1.5E+05 2.0E+05 2.5E+05Sampled Time Domain Workload Execution Statistics 200 400 600 800 1000 5.0E+05 0.0E+00 5.0E+05 1.0E+06 1.5E+06 2.0E+06 2.5E+06Value 12345678910111213141516 Wavelet Coefficients(a) Time domain representation (b) Wavelet domain representation Figure 22 Comparison execution characteristics of time and wavelet domain Figure 22 (a) shows the sampled time domain workload execution statistics (The yaxis represents the number of cycles a processor spends on executing a fixed amount of instructions) on benchmark gcc within one execution interval. In this example, the program execution interval is represented by 1024 sampled data points. Fi gure 22 (b) illustra tes the wavelet domain representation of the original tim e series after a discrete wavele t transform is applied. Although the DWT operations can produce as many wavelet coe fficients as the original input data, the first few wavelet coefficients usually contain the important trend. In Figure 22 (b), we show the values of the first 16 wavelet coefficients. As can be seen, the disc rete wavelet transform provides a compact representation of the original large volume of data. This feature can be exploited to create concise yet informative fingerprints to capture pr ogram execution behavior. PAGE 20 20 One advantage of using wavelet coefficients to fingerprint program execution is that program time domain behavior can be reconstructe d from these wavelet coefficients. Figure 23 and 24 show that the time domain workload ch aracteristics can be reco vered using the inverse discrete wavelet transforms. Figure 23 Sampled time domain program behavior (a) 1 wavelet coefficient (b) 2 wavelet coefficients (c) 4 wavelet coefficients (d) 8 wavelet coefficients (e) 16 wavelet coefficients (f) 64 wavelet coefficients Figure 24 Reconstructing the workload dynamic behaviors In Figure 24 (a)(e), the first 1, 2, 4, 8, and 16 wavelet coefficients were used to restore program time domain behavior with increasing fidelity. As shown in Figure 24 (f), when all (e.g. 64) wavelet coefficients are used for rec overy, the original signal can be completely restored. However, this coul d involve storing and processi ng a large number of wavelet coefficients. Using a wavelet transform gives timefrequency localization of the original data. As a result, most of the energy of the input da ta can be represented by only a few wavelet PAGE 21 21 coefficients. As can be seen, using 16 wavelet coefficients can recover program time domain behavior with sufficient accuracy. To classify program execution into phases, it is essential that the generated wavelet coefficients across intervals pr eserve the dynamics that worklo ads exhibit with in the time domain. Figure 25 shows the variation of the first 16 wavelet coefficients ( coff 1 coff 16 ) which represent the wavelet domain be havior of branch mispredicti on and L1 data cache hit on the benchmark gcc. The data are shown for the entire progr am execution which c ontains a total of 1024 intervals. 1.0E+04 5.0E+03 0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 coeff 1 coeff 2 coeff 3 coeff 4 coeff 5 coeff 6 coeff 7 coeff 8 coeff 9 coeff 10 coeff 11 coeff 12 coeff 13 coeff 14 coeff 15 coeff 16 4.0E+05 2.0E+05 0.0E+00 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 1.2E+06 coeff 1 coeff 2 coeff 3 coeff 4 coeff 5 coeff 6 coeff 7 coeff 8 coeff 9 coeff 10 coeff 11 coeff 12 coeff 13 coeff 14 coeff 15 coeff 16 (a) branch misprediction (b) L1 data cache hit Figure 25 Variation of wavelet coefficients Figure 25 shows that wavelet domain tran sforms largely preserve program dynamic behavior. Another interesting obser vation is that the first order wavelet coefficient exhibits much more significant variation than the high order wa velet coefficients. This suggests that wavelet domain workload dynamics can be effectivel y captured using a few, low order wavelet coefficients. PAGE 22 22 2D Wavelet Transform To effectively capture the twodim ensional sp atial characteristics acr oss largescale multicore architecture substrates, we also use the 2D wavelet analysis. With 1D wavelet analysis that uses Haar wavelet filters, each adjacent pair of da ta in a discrete interval is replaced with its average and difference. a b c d a b c d a b c d Original Average Detailed (Dhorizontal) Detailed (Dvertical) Detailed (Ddiagonal) (a+b+c+d)/4 ((a+d)/2(b+c)/2)/2 a b c d ((a+b)/2(c+d)/2)/2((b+d)/2(a+c)/2)/2 a b c d Figure 26 2D wavelet transforms on 4 data points A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete plane. As shown in Figure 26, each adjacent four points in a discrete 2D plane can be replaced by their averaged value and three detailed values. The detailed values (Dhorizontal, Dvertical, and Ddiagonal) correspond to the average of the difference of: 1) the summation of the rows, 2) the summation of the columns, and 3) the summation of the diagonals. To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data along the Xaxis first, resulting in lowpass and highpass signals (average and difference). Next, we apply 1D wavelet transforms to both signals along the Yaxis generating one averaged and three detailed signals. Consequently, a 2D wa velet decomposition is obtained by recursively repeating this procedure on the averaged signal. Figure 27 (a) illustrates the procedure. As can be seen, the 2D wavelet decompos ition can be represented by a treebased structure. The root node of the tree contains the orig inal data (rowmajored ) of the mesh of va lues (for example, performance or temperatures of the four adjacent cores, networkonchip li nks, cache banks etc.). First, we apply 1D wavelet transforms along the Xaxis, i.e. for each two points along the Xaxis PAGE 23 23 we compute the average and difference, so we obtai n (3 5 7 1 9 1 5 9) and (1 1 1 1 5 1 1 1). Next, we apply 1D wavelet transforms along the Yaxis; for each two points along the Yaxis we compute average and difference (at level 0 in th e example shown in Figure 27.a). We perform this process recursively until the number of elemen ts in the averaged signa l becomes 1 (at level 1 in the example shown in Figure 27.a). 1D wavelet along xaxis 1D wavelet along yaxis 1D wavelet along xaxis 1D wavelet along yaxis 4 4 6 2446 6810 1420 820Original Data 1D wavelet along xaxis 1D wavelet along yaxis 2 4 4 6 6 8 2 0 4 14 2 0 4 6 8 10 3 5 7 1 9 1 5 9 1 1 1 1 5 1 1 1 5 3 7 5 2 2 2 4 1 0 3 0 0 1 2 1 4 6 1 1 5 1 1 0 Original Data (rowmajored) lowpass signal highpass signal Horizontal Details Average Vertical DetailsDiagonal Details lowpass signal highpass signal Avg.Horiz. Det. Vert. Det. Diag. Det.L=0 L=1 Horiz. Det. (L=1) Vert. Det. (L=1) Diag. Det. (L=1) Horizontal Details (L=0) Vertical Details (L=0) Diagonal Details (L=0) Avg. (L=1) Average (L=0) (a) (b) Figure 27 2D wavelet transforms on 16 cores/hardware components Figure 27.b shows the wavelet domain multiresolution representation of the 2D spatial data. Figure 28 further demonstrates that the 2D architecture char acteristics can be effectively captured using a small number of wavelet coeffi cients (e.g. Average (L=0) or Average (L=1)). Since a small set of wavelet coefficients provide concise yet insightful information on architecture 2D spatial characteri stics, we use predictive models (i.e. neural networks) to relate them individually to various architecture design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wave let coefficients to synthesize architecture 2D spatial characteristics across the design space. Compared with a simulationbased method, PAGE 24 24 predicting a small set of wavelet coefficients usin g analytical models is computationally efficient and is scalable to large scale architecture design. (a) NUCA hit numbers (b) 2D DWT (L=0) (c) 2D DWT (L=1) Figure 28 Example of applying 2D DWT on a nonuniformly accessed cache PAGE 25 25 CHAPTER 3 COMPLEXITYBASED PROGRAM PHAS E ANAL YSIS AND CLASSIFICATION Obtaining phase dynamics, in many cases, is of great interest to accurately capture program behavior and to preci sely apply runtime applicati on oriented optimizations. For example, complex, realworld workloads may ru n for hours, days or even months before completion. Their long execution time implies that program time varying behavior can manifest across a wide range of scales, making modeling phase behavior using a single time scale less informative. To overcome conventional phase an alysis technique, we proposed using waveletbased multiresolution analysis to characterize ph ase dynamic behavior and developed metrics to quantitatively evaluate the comp lexity of phase structures. And also, we proposed methodologies to classify program phases from their dynamics and complexity perspectives. Specifically, the goal of this chapter is to answer the followi ng questions: How to define the complexity of program dynamics? How do program dynamics cha nge over time? If classified using existing methods, how similar are the program dynamics in each phase? How to better identify phases with homogeneous dynamic behavior? In this chapter, we implemented our comp lexitybased phase analysis technique and evaluate its effectiveness over existing phase an alysis methods based on program control flow and runtime information. And we showed that in both cases the proposed technique produces phases that exhibit more homogeneous dyna mic behavior than existing methods do. Characterizing and classifying the program dynamic behavior Using the wavele tbased multiresolution analys is which is described in chapter 2, we characterize, quantify and classify program dynamic behavior on a highperformance, outoforder execution superscalar processor coupled with a multilevel memory hierarchy. PAGE 26 26 Experimental setup We performed our analysis using ten SPEC CPU 2000 benchmarks crafty, gap, gcc, gzip, mcf, parser, perlbmk, swim, twolf and vortex All programs were run wi th reference input to completion. We chose to focus on only 10 programs because of the lengthy simulation time incurred by executing all of the programs to completion. The stat istics of workload dynamics were measured on the SimpleScalar 3.0[28] si moutorder simulator for the Alpha ISA. The baseline microarchitecture model is detailed in Table 31. Table 31 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 predictions/cycle BTB 2K entries, 4way Return Address Stack 32 entries L1 Instruction Cache 32K, 2way, 32 Byte/l ine, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 IALU, 2 IMUL/DIV FP ALU 2 FPALU, 1FPMUL/DIV DTLB 256 entries, 4way, 200 cycle miss L1 Data Cache 64KB, 4way, 64 Byte/ line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4way, 128 Byte/line, 12 cycle access Memory Access 100 cycles Metrics to Quantify Phase Complexity To quantify phase complexity, we measure the similarity between phase dynamics observed at different time scales. To be more specific, we us e crosscorrelation coefficients to measure the similarity between the original data sampled at the finest granularity and the approximated version reconstructed from wavele t scaling coefficients obtained at a coarser scale. The crosscorrelation coefficients (XCOR) of th e two data series are defined as: PAGE 27 27 n i i n i i n i iiYYXX YYXX YXXCOR1 2 1 2 1)()( ))(( ),( (31) where X is the original data series and Y is the approximated data series. Note that XCOR =1 if program dynamics observed at th e finest scale and its approximation at coarser granularity exhibit perfect correlation, and XCOR =0 if the program dynamics and its approximation varies independently across time scales. X and Y can be any runtime program characteris tics of interest. In this chapter, we use instruction per cycle (IPC) as a metric due to its wide usage in computer architecture design and performance evaluation. To sample IPC dynamics we break down the entire program execution into 1024 intervals and then sample 1024 IPC data w ithin each interval. Therefore, at the finest resolution level, the program dynamics of each ex ecution interval are represented by an IPC data series with a length of 1024. We then apply wavele t multiresolution analysis to each interval. In a wavelet transform, each DWT operation produces an approximation coefficients vector with a length equal to half of the input data. We re move the detail coefficients after each wavelet transform and only use the approximation part to reconstruct IPC dynamics and then calculate the XCOR between the original data and the r econstructed data. We a pply discrete wavelet transform to the approximation part iteratively until the length of the approximation coefficient vector is reduced to 1. Each approximation coeffi cient vector is used to reconstruct a full IPC trace with a length of 1024 and the XCOR between the original and reconstructed traces are calculated using equation (31). As a result, fo r each program execution interval, we obtain an XCOR vector, in which each element represents the crosscorrelation coefficients between the original workload dynamics and the approximated workload dynamics at different scales. Since PAGE 28 28 we use 1024 samples within each interval, we crea te an XCOR vector with a length of 10 for each interval, as shown in Figure 31. Figure 31 XCOR vectors for each program execution interval Profiling Program Dynamics and Complexity We use XCOR metrics to quantify program dynamics and complexity of the studied SPEC CPU 2000 benchmarks. Figure 32 shows the results of the total 1024 exec ution intervals across ten levels of abstraction for the benchmark gcc. Figure 32 Dynamic complexity profile of benchmark gcc As can be seen, the benchmark gcc shows a wide variety of changing dynamics during its execution. As the time scale increases, XCOR valu es are monotonically decr eased. This is due to the fact that wavelet approximation at a coar se scale removes details in program dynamics observed at a fine grained level. Rapidly decreased XCOR implies highly complex structures that PAGE 29 29 can not be captured by coarse level approximation. In contrast, slowly decreased XCOR suggests that program dynamics can be largely preserve d using few samples. Figure 32 also shows a dotted line along which XCOR decreases linearly with the increased time scales. The XCOR plots below that dotted line indicate rapidly decr eased XCOR values or complex program dynamics. As can be seen, a signifi cant fraction of the benchmark gcc execution intervals manifest quickly decreased XCOR values, indica ting that the program exhibits highly complex structure at the fine graine d level. Figure 32 also re veals that there are a few gcc execution intervals that have good scalabil ity in their dynamics. On these execution intervals, the XCOR values only drop 0.1 when the time scale is increa sed from 1 to 8. The results(Figure 32) clearly indicate that some program execution interval s can be accurately approximated by their high level abstractions while others can not. We further break down the XCOR values into 10 categories ranging from 0 to 1 and analyze their distribution across time scal es. Due to space limitations, we only show the results of three programs ( swim, crafty and gcc, see Figure 33) which repres ent the characteristics of all analyzed benchmarks. Note that at scale 1, th e XCOR values of all execution intervals are always 1. Programs show heterogeneous XCOR va lue distributions starting from scale level 2. As can be seen, the benchmark swim exhibits good scalability in its dynamic complexity. The XCOR values of all execution in tervals remain above 0.9 when the time scale is increased from 1 to 7. This implies that the captured program behavi or is not sensitive to any time scale in that range. Therefore, we classify swim as a low complexity program. On the benchmark crafty XCOR values decrease uniformly with the incr ease of time scales, indicating the observed program dynamics are sensitive to the time scales us ed to obtain it. We refe r to this behavior as medium complexity. On the benchmark gcc, program dynamics decay rapidly. This suggests that PAGE 30 30 abundant program dynamics could be lost if coarse r time scales are used to characterize it. We refer to this characteristic as high complexity behavior. 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1(a) swim (low complexity) (b) crafty (medium complexity) 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1 (c) gcc (high complexity) Figure 33 XCOR va lue distributions The dynamics complexity and the XCOR value distribution plots(Figur e 32 and Figure 33) provide a quantitative and informative representation of runtime program complexity. Table 32 Classification of benchm arks based on their complexity Category Benchmarks Low complexity Swim Medium complexity Crafty gzip parser, perlbmk twolf High complexity gap, gcc, mcf vortex Using the above information, we classify the studied programs in terms of their complexity and the results are shown in Table 32. PAGE 31 31 Classifying Program Phases based on their Dynamics Behavior In this section, we show th at program execution manifest s heterogeneous complexity behavior. We further examine the efficiency of using current methods in classifying program dynamics into phases and propose a new method that can better identify program complexity. Classifying complexity based phase behavior enables us to understand program dynamics progressively in a finetocoarse fashion, to operate on different resolutions, to manipulate features at different scales, and to localize char acteristics in both spatia l and frequency domains. Simpoint Sherwood and Calder[1] proposed a phase analys is tool called Sim point to automatically classify the execution of a program into phases. They found that interv als of program execution grouped into the same phase had similar statistics The Simpoint tool cl usters program execution based on code signature and execution frequenc y. We identified program execution intervals grouped into the same phase by the Simpoint t ool and analyzed their dynamic complexity. (a) Simpoint Cluster #7 (b) Simpoint Cluster #5 (c) Simpoint Cluster #48 Figure 34 XCORs in the same phase by the Simpoint Figure 34 shows the results for the benchmark mcf. Simpoint generates 55 clusters on the benchmark mcf Figure 34 shows program dynamics within three clusters generated by Simpoint. Each cluster represen ts a unique phase. In cluster 7, the classified phase shows homogeneous dynamics. In cluster 5, program ex ecution intervals show two distinct dynamics PAGE 32 32 but they are classified as the same phase. In cluster 48, program execution complexity varies widely; however, Simpoint classifies them as a single phase. The results(Fi gure 34) suggest that program execution intervals classified as the same phase by Simpoint can still exhibit widely varied behavior in their dynamics. Complexityaware Phase Classification To enhance the capability of current m ethods in characteri zing program dynamics, we propose complexityaware phase classification. Ou r method uses the multiresolution property of wavelet transforms to identify and classify the changing of program code execution across different scales. We assume a baseline phase analysis technique that uses basic block vectors (BBV) [10]. A basic block is a section of code that is executed from start to finish with one entry and one exit. A BBV represents the code blocks execu ted during a given interv al of execution. To represent program dynamics at different time scales, we create a set of basic block vectors for each interval at different resolutions. For example, at the coarsest level (scale =10), a program execution interval is represented by one BBV. At the most detailed level, the same program execution interval is represented by 1 024 BBVs from 1024 consecutively subdivided intervals(Figure 35). To reduce the amount of data that needs to be processed, we use random projection to reduce the dimensionality of all BBVs to 15, as suggested in [1]. Figure 35 BBVs with different resolutions PAGE 33 33 The coarser scale BBVs are the approximations of the finest scale BBVs generated by the waveletbased multiresolution analysis. Figure 36 Multiresolution anal ysis of the projected BBVs As shown in Figure 36, the disc rete wavelet transform is app lied to each dimension of a set of BBVs at the finest scal e. The XCOR calculation is used to estimate the correlations between a BBV element and its approximations at coarser scales. The results are the 15 XCOR vectors representing the complexity of each dimension in BBVs across 10 level abstractions. The 15 XCOR vectors are then averag ed together to obtain an aggregated XCOR vector that represents the entire BBV comple xity characteristics for that execution interval. Using the above steps, we obtained an aggregated XCOR vector for each program execution interval. We then run the kmeans clustering algorithm [29] on the co llected XCOR vectors which represent the dynamic complexity of program ex ecution intervals and classified them into phases. This is similar to what Simpoint does. The difference is that the Simpoint tool uses raw BBVs and our method uses aggregated BBV XCOR vectors as the input for kmeans clustering. PAGE 34 34 Experimental Results We com pare Simpoint and the proposed approach in their capability of classifying phase complexity. Since we use wavelet transform on pr ogram basic block vectors, we refer to our method as multiresolution anal ysis of BBV (MRABBV). Cluster #1 Cluster #2 Cluster #N 10 10 10 XCOR XCOR XCOR CoV CoV CoV 10 10 10 W Weighted CoVs10 Figure 37 Weighted COV calculation We examine the similarity of program complexity within each classified phase by the two approaches. Instead of using IPC, we use IP C dynamics as the metric for evaluation. After classifying all program execution intervals into phases, we examine each phase and compute the IPC XCOR vectors for all the intervals in that phas e. We then calculate the standard deviation in IPC XCOR vectors within each phase and we divide the standard deviation by the average to get the Coefficient of Variation (COV). As shown in Figure 37, we calculate an ove rall COV metric for a phase classification method by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for. This produces an overall me tric (i.e. weighted COVs) used to compare different phase classification for a given program. Since COV meas ures standard deviations as the percentage of the average, a lower COV va lue means better phase classification technique. PAGE 35 35 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10crafty gap gcc gzip mcf CoV BBV MRABBV104%128% 101% 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10parser perlbmk swim twolf vortex CoV BBV MRABBVFigure 38 Comparison of BBV and MRABBV in classifyin g phase dynamics Figure 38 shows experimental results for all the studied benchmarks. As can be seen, the MRABBV method can produce phases which ex hibit more homogeneous dynamics and complexity than the standard, BBVbased method. This can be seen from the lower COV values generated by the MRABBV method. In genera l, the COV values yielded on both methods increase when coarse time scales are used for complexity approximation. The MRABBV is capable of achieving significantly better cla ssification on benchmarks with high complexity, such as gap, gcc and mcf On programs which exhibit medium complexity, such as crafty, gzip, parser, and twolf the two schemes show a comparable effectiveness. On benchmark (e.g. swim ) which has trivial complexity, both schemes work well. We further examine the capability of usi ng runtime performance metrics to capture complexityaware phase behavior. Instead of usin g BBV, the sampled IPC is used directly as the input to the kmeans phase clustering algorithm. Similarly, we apply multiresolution analysis to the IPC data and then use the gathered informa tion for phase classification. We call this method PAGE 36 36 multiresolution analysis of IPC (MRAIPC). Figure 39 shows the phase classification results. As can be seen, the observations we made on th e BBVbased cases hold valid on the IPCbased cases. This implies that the proposed multiresoluti on analysis can be applied to both methods to improve the capability of capturing phase dynamics. 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10crafty gap gcc gzip mcf CoV IPC MRAIPC109 % 130 % 150 % 173 % 117%105 % 110 % 115 % 123 % 147 % 105% 120% 142% 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10parser perlbmk swim twolf vortex CoV IPC MRAIPC113 % 122 % 140 % 163 % 104% 127% 146% Figure 39 Comparison of IPC and MRAIPC in classifying phase dynamics PAGE 37 37 CHAPTER 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRM PHASE ANAL YSIS In this chapter, we focus on workloadsta tisticsbased phase anal ysis since on a given machine configuration and environment, it is more suitable to iden tify how the targeted architecture features vary during program execution. In contrast, phase classification using program code structures lacks th e capability of informing how wo rkloads behave architecturally [13, 30]. Therefore, phase analys is using specified workload ch aracteristics allows one to explicitly link the targeted arch itecture features to the classifi ed phases. For example, if phases are used to optimize cache efficiency, the workload characteristics that reflect cache behavior can be used to explicitly classify program ex ecution into cache perfor mance/power/reliability oriented phases. Program code structure based phase analysis id entifies similar phases only if they have similar code flow. There can be cases where two sections of code can have different code flow, but exhibit similar architectural behavi or [13]. Code flow based phase analysis would then classify them as different phases. Anot her advantage of worklo adstatisticsbased phase analysis is that when multiple threads share the same resource (e.g. pipeline, cache), using workload execution information to classify phases allows the capability of capturing program dynamic behavior due to the in teractions between threads. The key goal of workload execution based phase analysis is to accurately and reliably discern and recover phase behavior from various program runtime st atistics represented as largevolume, highdimension and noisy data. To effectiv ely achieve this objec tive, recent work [30, 31] proposes using wavelets as a tool to assist phase analysis. The basic idea is to transform workload time domain behavior into the wave let domain. The generated wavelet coefficients which extract compact yet informative program r untime feature are then assembled together to PAGE 38 38 facilitate phase classification. Nevertheless, in current work, the examined scope of workload characteristics and the explored benefits due to wavelet transform are quite limited. In this chapter, we extend research of chap ter 3 by applying wavelets to abundant types of program execution statistics and quantifying th e benefits of using wavelets for improving accuracy, scalability and robustness in phase cl assification. We conclude that wavelet domain phase analysis has the followi ng advantages: 1) accuracy: the wavelet transform significantly reduces temporal dependence in the sampled work load statistics. As a result, simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, wavelet coefficients transformed fr om various dimensions of program execution characteristics can be dynamically assembled t ogether to further improve phase classification accuracy; 2) scalability: phase cl assification using wavelet analysis of highdimension sampled workload statistics can alleviat e the counter overflow problem which has a negative impact on phase detection. Therefore, it is much more s calable to analyze largescale phases exhibited on longrunning, realworld programs; and 3) robus tness: wavelets offer denoising capabilities which allows phase classification to be perf ormed robustly in the presence of workload execution variability. Workloadstaticsbased phase analysis Using the waveletbased m ethod, we expl ore program phase analysis on a highperformance, outoforder execution superscalar processor coupled with a multilevel memory hierarchy. We use Daubechies wavelet [26, 27] with an order of 8 for the re st of the experiments due to its high accuracy and low computation over head. This section describes our experimental methodologies, the simulated machine configura tion, experimented benchmarks and evaluated metrics. PAGE 39 39 We performed our analysis using tw elve SPEC CPU 2000 integer benchmarks bzip2, crafty, eon, gap, gcc, gzip mcf parser perlbmk twolf, vortex and vpr All programs were run with the reference input to completion. The runtim e workload execution statistics were measured on the SimpleScalar 3.0, simoutorder simulator for the Alpha ISA. The baseline microarchitecture model we used is detailed in Table 41. Table 41 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 BTB 2K entries, 4way Return Address Stack 32 entries L1 Instruction Cache 32K, 2way, 32 Byte /line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 IALU, 2 IMUL/DIV FP ALU 2 FPALU, 1FPMUL/DIV DTLB 256 entries, 4way, 200 cycle miss L1 Data Cache 64KB, 4way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4way, 128 Byte/line, 12 cycle access Memory Access 100 cycles We use IPC (instruction per cycle) as the metr ic to evaluate the si milarity of program execution within each classified phase. To quant ify phase classification accuracy, we use the weighted COV metric proposed by Calder et al [15]. After classifyi ng all program execution intervals into phases, we examine each phase and compute the IPC for all the intervals in that phase. We then calculate the standard deviati on in IPC within each phase, and we divide the standard deviation by the average to get the Coef ficient of Variation (COV). We then calculate an overall COV metric for a phase classification method by taking the COV of each phase and weighting it by the percentage of execution that the phase account s for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classificati ons for a given program. PAGE 40 40 Since COV measures standard deviations as a percentage of the average, a lower COV value means a better phase classification technique. Exploring Wavelet Domain Phase Analysis We first evaluate the efficiency of wavelet analysis on a wide range of program execution characteristics by comparing its phase classifica tion accuracy with methods that use information in the time domain. And then we explore met hods to further improve phase classification accuracy in the wavelet domain. Phase Classification: Time Domain vs. Wavelet Domain The wavelet analysis m ethod provides a coste ffective representation of program behavior. Since wavelet coefficients are generally decorrelate d, we can transform the original data into the wavelet domain and then carry out the phase classification task. The generated wavelet coefficients can be used as signa tures to classify program executi on intervals into phases: if two program execution intervals show similar fingerprints (repres ented as a set of wavelet coefficients), they can be classified into the sa me phase. To quantify the benefit of using wavelet based analysis, we compare phase classificati on methods that use time domain and wavelet domain program execution information. With our time domain phase analysis method, each program execution interval is represented by a time series which consists of 1024 sampled program execution statistics. We first apply random projection to reduce the data dimensionality to 16. We then use the kmeans clustering algorithm to classify program intervals in to phases. This is similar to the method used by the popular Simpoint tool wher e the basic block vectors (BBVs) are used as input. For the wavelet domain method, the original time series are first transformed into the wavelet domain using DWT. The first 16 wavelet coefficients of each program execution interval are extracted PAGE 41 41 and used as the input to the kmeans cluste ring algorithms. Figure 41 illustrates the above described procedure. Program Runtime Statistics Random Projection DWT Kmeans Clustering Kmeans Clustering COV COV Dimensionality=16 Number of Wavelet Coefficients=16 Figure 41 Phase analysis methods time domain vs. wavelet domain We investigated the efficiency of applyi ng wavelet domain analysis on 10 different workload execution characteristics, name ly, the numbers of executed loads ( load), stores ( store), branches ( branch ), the number of cycles a processor spends on executing a fixed amount of instructions ( cycles), the number of branch misprediction ( branch_miss ), the number of L1 instruction cache, L1 data cache and L2 cache hits ( il1_hit, dl1_hit and ul2_hit ), and the number of instruction and data TLB hits ( itlb_hit and dtlb_hit ). Figure 42 shows the COVs of phase classifications in time and wavelet domains when each type of workload execution characteristic is used as an input. As can be seen, compared with using raw, time domain workload data, the wavelet domain analysis significantly impr oves phase classification accuracy and this observation holds for all the inve stigated workload characteristics across all the examined benchmarks. This is because in the time domain, collected program runtime statistics are treated as highdimension time series data. Random projection met hods are used to reduce the dimensionality of feature vectors which represent a workload signature at a given execution interval. However, the simple random projection function can increase the aliasing between phases and reduce the accuracy of phase detection. PAGE 42 42 0% 10% 20% 30% 40% 50%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domainbzip2 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaincrafty 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaineon 0% 5% 10% 15% 20% 25%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domaingap 0% 10% 20% 30% 40% 50% 60% 70%l oa d st ore bra n ch c ycle br a nch_ m iss i l 1_ h it dl 1_ hi t ul 2_ hi t i tlb_ h it dtlb_hitCo V Time Domain Wavelet Domaingcc 0% 5% 10% 15% 20%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domaingzip 0% 20% 40% 60% 80% 100% 120% 140%l o ad store br an ch cycl e b ranch_m i ss i l 1_hit d l1_hit ul2_hit i t lb_hit dtl b_hitCoV Time Domain Wavelet Domainmcf 0% 3% 6% 9% 12% 15%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domain p arser 0% 5% 10% 15% 20% 25% 30%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domain p erlbmk 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaintwolf 0% 10% 20% 30% 40%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domainvortex 0% 10% 20% 30% 40%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet DomainvprFigure 42 Phase classification accuracy : time domain vs. wavelet domain By transforming program runtim e statistics into the wavelet domain, workload behavior can be represented by a series of wavelet co efficients which are much more compact and efficient than its counterpart in the time do main. The wavelet transform significantly reduces temporal dependence and therefore simple mode ls which are insufficient in the time domain become quite accurate in the wavelet domain. PAGE 43 43 Figure 42 shows that in the wavelet domain, the efficiency of using a single type of program characteristic to classify program phases can vary significantly across different benchmarks. For example, while ul2_hit achieves accurate phase classification on the benchmark vortex it results in a high phase cl assification COV on the benchmark gcc. To overcome the above disadvantages and to build phase classi fication methods that can achieve high accuracy across a wide range of applications, we explore using wavelet coefficients derived from different types of workload characteristics. Program Runtime Statistics 1 DWT Kmeans Clustering COV Program Runtime Statistics 2 Program Runtime Statistics n DWT DWT Wavelet Coefficient Set 1 Wavelet Coefficient Set 2 Wavelet Coefficient Set n Hybrid Wavelet Coefficients Figure 43 Phase classification us ing hybrid wavelet coefficients As shown in Figure 43, a DWT is applied to each type of workload characteristic. The generated wavelet coefficients from different cat egories can be assembled together to form a signature for a data clustering algorithm. Our objective is to improve wavelet domain phase classification accuracy across different programs while using an equivalent amount of in formation to represent program behavior. We choose a set of 16 wavelet coefficients as th e phase signature since it provides sufficient precision in capturing program dynamics when a single type of program charac teristic is used. If a phase signature can be composed using multiple workload characteristics, there are many ways to form a 16dimension phase signature. For exam ple, a phase signature can be generated using one wavelet coefficient from 16 diff erent workload characteristics (16 1), or it can be composed PAGE 44 44 using 8 workload characteristics with 2 wavele t coefficients from each type of workload characteristic (82). Alternatively, a phase signature can be formed using 4 workload characteristics with 4 wavelet coefficients each and 2 workload characteristics with 8 wavelet coefficients each (44, and 28) respectively. We extend the 10 workload execution characteristics (Figure 42) to 16 by adding the following events: the num ber of accesses to instruction cache ( il1_access), data cache (dl1_access ), L2 cache (ul2_access ), instruction TLB ( itlb_access ) and data TLB ( dtlb_access). To understand the tradeoffs in choosing different methods to generate hybrid signatures, we did an exhaustive search using the above 4 schemes on all benchmarks to identify the best COVs that each scheme can achieve. The results (their ranks in terms of phase classification accuracy and the COVs of phase analysis) are shown in Table 42. As can be seen, statistically, hybr id wavelet signatures generated using 16 (16 1) and 8 (82) workload characteristics achieve higher ac curacy. This suggests that combining multiple dimension wavelet domain workload characteristics to form a phase signature is beneficial in phase analysis. Table 42 Efficiency of different hybrid wavelet signatures in phase classification Benchmarks Hybrid Wavelet Signature and its Phase Classification COV Rank #1 Rank #2Rank #3Rank #4 Bzip2 161 (6.5%) 8 2 (10.5%) 4 4 (10.5%) 28 (10.5%) Crafty 44 (1.2%) 16 1 (1.6%) 8 2 (1.9%) 28 (3.9%) Eon 82 (1.3%) 4 4 (1.6%) 16 1 (1.8%) 28 (3.6%) Gap 44 (4.2%) 16 1 (6.3%) 8 2 (7.2%) 28 (9.3%) Gcc 82 (4.7%) 16 1 (5.8%) 4 4 (6.5%) 28 (14.1%) Gzip 161 (2.5%) 4 4 (3.7%) 8 2 (4.4%) 28 (4.9%) Mcf 161 (9.5%) 4 4 (10.2%) 8 2 (12.1%) 28 (87.8%) Parser 161 (4.7%) 8 2 (5.2%) 4 4 (7.3%) 28 (8.4%) Perlbmk 82 (0.7%) 16 1 (0.8%) 4 4 (0.8%) 28 (1.5%) Twolf 161 (0.2%) 8 2 (0.2%) 4 4 (0.4%) 28 (0.5%) Vortex 161 (2.4%) 8 2 (4%) 2 8 (4.4%) 44 (5.8%) Vpr 161 (3%) 8 2 (14.9%) 4 4 (15.9%) 28 (16.3%) PAGE 45 45 We further compare the efficiency of using the 16 1 hybrid scheme ( Hybrid ), the best case that a single type workload characteristic can achieve ( Individual_Best) and the Simpoint based phase classification that uses basic block vector ( BBV ). The results of the 12 SPEC integer benchmarks are shown in Figure 44. 0 5 10 15Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV bzip2craftyeongapgccgzipmcfparserperlbmktwolfvortexvprAVG COV (%)25%Figure 44 Phase classification accuracy of using 16 1 hybrid scheme As can be seen, the Hybrid outperforms the Individual_Best on 10 out of the 12 benchmarks. The Hybrid also outperforms the BBV based Simpoint method on 10 out of the 12 cases. Scalability Above we can see that wavelet dom ain phase an alysis can achieve high er accuracy. In this subsection, we address another im portant issue in phase analys is using workload execution characteristics: scalability. C ounters are usually used to colle ct workload statistics during program execution. The counters may overflow if they are used to track large scale phase behavior on long running workloads. Today, many large and real world workloads can run days, weeks or even months before completion and this trend is likely to continue in the future. To perform phase analysis on the next generati on of computer workloads and systems, phase classification methods should be capable of scaling with the increasing program execution time. PAGE 46 46 To understand the impact of counter overflow on phase analysis accuracy, we use 16 accumulative counters to record the 16dimension wo rkload characteristic. The values of the 16 accumulative counters are then used as a signature to perform phase classification. We gradually reduce the number of bits in the accumulative count ers. As a result, counter overflows start to occur. We use two schemes to handle a counter ove rflow. In our first method, a counter saturates at its maximum value once it overflows. In our se cond method, the counter is reset to zero after an overflow occurs. After all counter overflows are handled, we then use the 16dimension accumulative counter values to perform phase an alysis and calculate the COVs. Figure 45 (a) describes the above procedure. Large Scale Phase Interval (a) nbit accumulative countercounter overflow Program Runtime Statistics 1 accumulative counter 1 Kmeans Clustering Program Runtime Statistics 2 Program Runtime Statistics n accumulative counter 2 accumulative counter n Handling Overflow COV (b) nbit sampling counter .. Large Scale Phase Interval Program Runtime Statistics 1 sampling counter 1 Program Runtime Statistics 2 Program Runtime Statistics n sampling counter 2 sampling counter n DWT/Hybrid Wavelet Signature Kmeans Clustering COV Figure 45 Different methods to handle counter overflows Our counter overflow analysis results are s hown in Figure 46. Figure 46 also shows the counter overflow rate (e.g. per centage of the overflowed counters) when counters with different sizes are used to collect workload statistics w ithin program execution intervals. For example, on the benchmark crafty when the number of bits used in counters is reduced to 20, 100% of the counters overflow. For the purpose of clarity, we only show a region within which the counter overflow rate is greater than zero and less than or equal to one. Since each program has different execution time, the region varies from one program to another. As can be seen, counter overflows have negative impact on phase classifi cation accuracy. In general, COVs increase with PAGE 47 47 the counter overflow rate. Interes tingly, as the overflow rate in creases, there are cases that overflow handling can reduce the COVs. This is because overflow handling has the effect of normalizing and smoothing irregular p eaks in the original statistics. 0% 10% 20% 30% 40% 50% 28262422201816 # of bits in counterCo V Saturate Reset W a v e l et 4% 29% 67% 81% 90%bzip296% 98% 0% 2% 4% 6% 8% 2826242220 # of bits in counterCo V Saturate Reset Wavelet 23% 56% 82% 94% 100%crafty 0% 2% 4% 6% 8% 272523211917 # of bits in counterCo V Saturate Reset Wavelet0.4%56% 78% 94%eon94% 100% 0% 5% 10% 15% 20% 25% 30% 3028262422201816 # of bits in counterCoV Saturate Reset Wavelet 0.5% 25%94% 94%gap80% 56%97% 100% 0% 10% 20% 30% 40% 50% 60% 28262420281416182022 # of bits in counterCoV Saturate Reset Waveletgcc0.1% 3.4% 50% 89% 77% 99% 98% 97% 93% 100% 0% 5% 10% 15% 20% 272523211917 # of bits in counterCoV Saturate Reset Wavelet 24% 34% 72% 98%gzip89% 100% 0% 20% 40% 60% 80% 100% 120% 140%30282624222018161412108# of bits in counterCoV Saturate Reset Wavelet 2% 26% 60% 93% 5%mcf96% 89% 97% 98% 99% 100% 0% 2% 4% 6% 8% 10% 12% 302826242220 # of bits in counterCoV Saturate Reset W a v e l et 0% 31% 97% 100%parser75% 85% 0% 5% 10% 15% 20% 25% 30% 27252321191715 # of bits in counterCo V Saturate Reset Wavelet 6% 28% 79% 87%perlbmk85% 93% 100% 0% 1% 2% 3% 4% 2927252321 # of bits in counterCo V Saturate Reset Wavelet 28% 31% 75% 100%twolf94% 0% 5% 10% 15% 20% 25% 30% 27252321191517 # of bits in counterCo V Saturate Reset Wavelet 1% 56% 81% 94%vortex93% 95% 100% 0% 5% 10% 15% 20% 25% 30% 272523211917 # of bits in counterCo V Saturate Reset Wavelet 3% 54% 75% 94%vpr90% 100%Figure 46 Impact of counter overf lows on phase analysis accuracy One solution to avoid counter overflows is to use sampling counters instead of accumulative counters, as shown in Figure 45 (b ). However, when sampling counters are used, the collected statistics are represented as time seri es that have a large volume of data. The results PAGE 48 48 shown in Figure 42 suggest that directly employing runtime samp les in phase classification is less desirable. To address the scalability issue in characterizing large scale program phases using workload execution statistics, wavelet based dime nsionality reduction tech niques can be applied to extract the essential featur es of workload behavior from the sampled statistics. The observations we made in previous sections mo tivate the use of DWT to absorb large volume sampled raw data and produce highly efficient wa velet domain signatures for phase analysis, as shown in Figure 45 (b). Figure 46 further shows phase analysis accuracy after applying wavelet techniques on the sampled workload statistics using sampling coun ters with different sizes. As can be seen, sampling enables using counters with limited size to study large program phases. In general, sampling can scale up naturally with the interval size as long as the sampled values do not overflow the counters. Therefore, with an in creasing mismatch between phase interval and counter size, the sampling frequency is increased resulting in an even higher volume sampled data. Using wavelet domain phase analysis can effectively infer program behavior from a large set of data collected over a long time span, resulting in low COVs in phase analysis. Workload Variability As described earlier, our m ethods collect various program execution stat istics and use them to classify program execution into different phase s. Such phase classification generally relies on comparing the similarity of the collected statis tics. Ideally, different runs of the same code segment should be classified into the same pha se. Existing phase detec tion techniques assume that workloads have deterministic execution. On real systems, with operating system interventions and other threads, applications manife st behavior that is not the same from run to run. This variability can stem from changes in sy stem state that alter cache, TLB or I/O behavior, system calls or interrupts, and can result in no ticeably different timing a nd performance behavior PAGE 49 49 [18, 32]. This crossrun variability can confuse similarity based phase detection. In order for a phase analysis technique to be applicable on real systems, it should be able to perform robustly under variability. Program crossrun variability can be thought of as noise which is a random variance of a measured statistic. There are many possi ble reasons for noisy data, such as measurement/instrument errors and interventions of the operating systems. Removing this variability from the collected run time statistics can be considered as a process of denoising. In this chapter, we explore using wavelets as an effective way to perf orm denoising. Due to the vanishing moment property of the wavelets, only some wavelet coefficients are significant in most cases. By retaining selective wavelet coeffi cients, a wavelet transfor m could be applied to reduce the noise. The main idea of wavelet denoising is to transform the data into the wavelet basis, where the large coefficients mainly contain the useful information and the smaller ones represent noise. By suitably modifying the coefficients in the new basis, noise can be directly removed from the data. The gene ral denoising procedure involves three steps: 1) decompose: compute the wavelet decomposition of the original data; 2) threshold wavelet coefficients: select a threshold and apply thresholding to the wave let coefficients; and 3) reconstruct: compute wavelet reconstruction using the modified wave let coefficients. More details on the waveletbased denoising techniques can be found in [33]. To model workload runtime variability, we use additive noise models and randomly inject noise into the time series that represents work load execution behavior. We vary the SNR (signaltonoise ratio) to simulate different degree of variability scenarios. To classify program execution into phases, we generate a 16 dimensi on feature vector where each element contains PAGE 50 50 the average value of the collected program execution characteristic for each interval. The kmean algorithm is then used for data clustering. Figure 47 il lustrates the above described procedure. Sampled Workload Statistics Wavelet Denoising S1(t) D2(t) Phase Classification COV COV Comparison Workload Variability Model N(t) S2(t)=S1(t)+N(t) Figure 47 Method for modeli ng workload variability We use the Daubechies8 wavelet with a global wavelet coefficients thresholding policy to perform denoising. We then compare the phase cl assification COVs of us ing the original data, the data with variability injected and the data after we perform denoisi ng. Figure 48 shows our experimental results. 0% 3% 6% 9% 12% 15%b zip 2 crafty eon g ap g cc g zip mcf pars e r perlbmk tw o lf v o rt e x vprCOV Original Noised(SNR=20) Denoised(SNR=20) Noised(SNR=5) Denoised(SNR=5)Figure 48 Effect of using wavelet denoi sing to handle workload variability The SNR=20 represents scenarios with a low degree of variability and the SNR=5 reflects situations with a high degree of variability. As can be seen, introducing va riability in workload execution statistics reduces phase analysis accur acy. Wavelet denoising is capable of recovering phase behavior from the noised data, resulting in higher phase analysis accuracy. Interestingly, on some benchmarks (e.g. eon mcf ), the denoised data achieve better phase classification PAGE 51 51 accuracy than the original data. This is because in phase classification, randomly occurring peaks in the gathered workload execution data co uld have a deleterious effect on the phase classification results. Wavelet denoising smoothe s these irregular peak s and make the phase classification method more robust. Various types of wavelet denoising can be performed by choosing different threshold selection rules (e.g. rigrsure, he ursure, sqtwolog and minimaxi), by performing hard (h) or soft (s) thresholding, and by specifying multiplicative threshold rescaling model (e.g. one, sln, and mln). We compare the efficiency of different denoisi ng techniques that have been implemented into the MATLAB tool [34]. Due to the sp ace limitation, only the results on benchmarks bzip2, gcc and mcf are shown in Figure 49. As can be seen, different wavelet denoising schemes achieve comparable accuracy in phase classification. 0% 2% 4% 6% 8% 10%heu r s u re:s:m ln heu rsu re: s: sl n heu r s u re: h : ml n heu r s u re: h : s ln r ig rsure :s :m l n r ig rsur e:s :s l n r igrsur e :h:mln rigrsure:h:sln sqtwol og :s:mln sq t wolog:s:s l n s q t wolog : h : mln s q t wolog : h : sl n m i ni m axi:s:m ln minim ax i : s:s l n mi ni m ax i : h:m ln m i ni m axi : h : sl nWavelet Denoising SchemesCOV bzip2 gcc mcfFigure 49 Efficiency of different denoising schemes PAGE 52 52 CHAPTER 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION It has been well known to the processor design comm unity that program runtime characteristics exhibit significant variation. To obtain the dynamic beha vior that programs manifest on complex microprocessors and systems, architects resort to the detailed, cycleaccurate simulations. Figure 51 illustrates the variation in workload dynamics for SPEC CPU 2000 benchmarks gap, crafty and vpr within one of their execution intervals. The results show the timevarying behavior of the workload performance ( gap), power ( crafty ) and reliability ( vpr) metrics across simulations with differe nt microarchitecture configurations. 0 20 40 60 80 100 120 14 0 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 SamplesCPIgap 0 20 40 60 80 100 120 14 0 20 40 60 80 100 120 140 SamplesPower (W) crafty 0 20 40 60 80 100 120 14 0 0.1 0.15 0.2 0.25 0.3 0.35 Samples AVF vpr Figure 51 Variation of workload performance, power and reliability dynamics As can be seen, the manifested workload dyna mics while executing the same code base varies widely across processors with different configurations. As the number of parameters in design space increases, such variation in workload dynamics can not be captured without using slow, detailed simulations. However, using the simulationbased methods for architecture design space exploration where numerous design parameters have to be considered is prohibitively expensive. Recently, researchers propose several predictive models [2025] to reason about workload aggregated behavior at archit ecture design stage. Among them linear regression and neural network models have been the most used appr oaches. Linear models are straightforward to understand and provide accurate estimates of the significance of parameters and their PAGE 53 53 interactions. However, they are usually inadequa te for modeling the nonl inear dynamics of realworld workloads which exhibit widely different characteristic and complexity. Of the nonlinear methods, neural network models can accurately predict the aggregated program statistics (e.g. CPI of the entire workload execution). Such mode ls are termed as global models as only one model is used to characterize the measured programs. The monolithic global models are incapable of capturing and revealing program dy namics which contain interesting finegrain behavior. On the other hand, a workload may produce different dynamics when the underlying architecture configurations ha ve changed. Therefore, new methods are needed for accurately predicting complex workload dynamics. To overcome the problems of monolithic, gl obal predictive models, we propose a novel scheme that incorporates waveletbased multiresolution decomposition techniques, which can produce a good local representati on of the workload behavior in both time and frequency domains. The proposed analytical models, wh ich combine waveletbased multiscale data representation and neural network based regres sion prediction, can e fficiently reason about program dynamics without resorting to detaile d simulations. With our schemes, the complex workload dynamics is decomposed into a series of wavelet coefficients. In transform domain, each individual wavelet coefficients is modele d by a separate neural network. We extensively evaluate the efficiency of using wavelet neural networks for predicting the dynamics that the SPEC CPU 2000 benchmarks manifest on high performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in forecasting work load dynamics across a large microarchitecture design space. PAGE 54 54 In this chapter, we propose to use of wavele t neural network to build accurate predictive models for workload dynamic driven microarchi tecture design space exploration. We show that wavelet neural network can be used to accurately and costeffectively capture complex workload dynamics across different microarchitecture configur ations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power and reliability domains. We perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on pred iction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. We present a case study of using workload dynamic aware predictive models to quickly estimate the efficiency of scenariodriven archite cture optimizations across different domains. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. Neural Network An Artificial Neural Network (ANN) [42] is an infor mation processing paradigm that is inspired by the way biological nervous systems pro cess information. It is composed of a set of interconnected processing elements working in unison to solve problems. f(x) H1(x) H2(x) Hn(x) X1 X2 Xn w1Input layer Hidden layer Output layer distance distanceRBF Response Radial Basis Function (RBF)w2wn Figure 52 Basic architectu re of a neural network PAGE 55 55 The most common type of neural network (Figure 52) consists of th ree layers of units: a layer of input units is connected to a layer of hidden units, wh ich is connected to a layer of output units. The input is fed into network th rough input units. Each hidden unit receives the entire input vector and genera tes a response. The output of a hidden unit is determined by the inputoutput transfer function that is specified for that unit. Co mmonly used transfer functions include the sigmoid, linear threshold function and Radial Basis Function (RBF) [35]. The ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing complex, nonlinear relations between input a nd output, which make them a promising technique for tracking and forecasting complex behavior. In this chapter, we use the RBF transfer f unction to model and estimate important wavelet coefficients on unexplored design spaces because of its superior ability to approximate complex functions. The basic architecture of an RBF network with ninput s and a single output is shown in Figure 52. The nodes in adjacent layers are fully connected. A linear singlelayer neural network model 1dimensional function f is expressed as a linear combination of a set of nfixed functions, often called basis functions by analogy with the concept of a vector being composed of a linear combination of basis vectors. n j jjxhwxf1)()( (51) Here nw is adaptable or trainable weight vector and n j jh1)( are fixed basis functions or the transfer function of the hidden units. The flexibility of f, its ability to fit many different functions, derives only from the freedom to choos e different values for the weights. The basis PAGE 56 56 functions and any parameters which they might contain are fixed. If the basis functions can change during the learning process, then the model is nonlinear. Radial functions are a special class of function. Their characte ristic feature is that their response decreases (or increases) monotonically with distance from a central point. The center, the distance scale, and the precis e shape of the radial function ar e parameters of the model, all fixed if it is linear. A typical ra dial function is the Gaussian whic h, in the case of a scalar input, is 2 2)( exp)( r cx xh (52) Its parameters are its center c and its radius r Radial functions are simply a class of functions. In principle, they coul d be employed in any sort of m odel, linear or nonlinear, and any sort of network (singlelayer or multilayer). The training of the RBF network involves se lecting the center locati ons and radii (which are eventually used to determine the weight s) using a regression tree. A regression tree recursively partitions the input data set into subs ets with decision criteria. As a result, there will be a root node, nonterminal nodes (having sub n odes) and terminal nodes (having no sub nodes) which are associated with an input dataset. Each node contributes one unit to the RBF networks center and radius vectors. the selection of RB F centers is performed by recursively parsing regression tree nodes using a strategy proposed in [35]. Combing Wavelet and Neural Network for Workload Dynamics Prediction We view workload dynamics as a time series produced by the processor which is a nonlinear function of its design pa rameter configuration. Instead of predicting this function at every sampling point, we employ wavelets to ap proximate it. Previous work [21, 23, 25] shows PAGE 57 57 that neural networks can accurately predict a ggregated workload behavior during design space exploration. Nevertheless, the m onolithic global neural network m odels lack the capability of revealing complex workload dynamics. To ove rcome this disadvantage, we propose using wavelet neural networks that incorporate multiscale wavelet analysis into a set of neural networks for workload dynamics prediction. The wavelet transform is a very powerful tool for dealing with dynamic behavior si nce it captures both workload global and local behavior using a set of wavelet coefficients. The shortterm workload characteristics is decomposed into the lower scales of wavelet coefficients (high frequencie s) which are utilized for detailed analysis and prediction, while the global worklo ad behavior is decomposed in to higher scales of wavelet coefficients (low frequencies) th at are used for the analysis and prediction of slow trends in the workload execution. Collectively, th ese coordinated scales of time and frequency provides an accurate interpretation of workload dynamics. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet co efficients at different scales. The separate predictions of each wavelet coe fficients are proceed independently. Predicting each wavelet coefficients by a separate neural network simplifies the training task of each subnetwork. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transform to predict the workload dynamics. Figure 53 shows our hybrid neurowavelet scheme for workload dynamics prediction. Given the observed workload dynamics on training data, our aim is to predict workload dynamic behavior under different architecture configura tions. The hybrid scheme basically involves three stages. In the first stage, the time series is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficients is pr edicted by a separate ANN and in the third stage, the approximated time series is recovered from the predicted wavelet coefficients. PAGE 58 58 G0 H0 G1 H1 Gk Hk ... Workload Dynamics (Time Domain)Wavelet Decomposition Wavelet Coefficients. ... ... M icroarchitecture Design Param eters Predicted W avelet Coefficient 1 ... ... M icroarchitecture Design Param eters Predicted W avelet Coefficient 2 M icroarchitecture Design Param eters... ... ...RBF Neural Netw orks ... ... Predicted W avelet Coefficient n G*0 H*0 G*1 H*1 G*k H*k ... Synthesized Workload Dynamics (Time Domain)Wavelet Reconstruction Predicted Wavelet Coefficients 00 0, 0, 0, 0, 0, 0 Figure 53 Using wavelet neural netw ork for workload dynamics prediction Each RBF neural network receives the entire microarchitectural design space vector and predicts a wavelet coefficient. The training of a RBF network involves determining the center point and a radius for each RBF and the wei ghts of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of using wavelet ne ural networks to explore workload dynam ics in performance, power and reli ability domains during microarchi tecture design sp ace exploration. We use a unified, detailed microarchitecture simulator in our experiments. Our simulation framework, built using a heavily modified and extended version of the Simplescalar tool set, models pipelined, multipleissue, outoforder execution microprocessors with multiple level caches. Our framework uses Wattchbased power model [36]. In addition, we built the Architecture Vulnerability Factor (AVF) analys is methods proposed in [37, 38] to estimate processor microarchitecture vulnera bility to transient faults. A microarchitecture structures AVF refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The AVF metric can be used to estimate how vulnerable the hardware is to soft PAGE 59 59 errors during program execution. Table 51 summari zes the baseline machine configurations of our simulator. Table 51 Simulated machine configuration Parameter Configuration Processor Width 8wi de fetch/issue/commit Issue Queue 96 ITLB 128 entries, 4way, 200 cycle miss Branch Predictor 2K entries Gshare, 10bit global history BTB 2K entries, 4way Return Address Stack 32 entries RAS L1 Instruction Cache 32K, 2way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries Load/ Store Queue 48 entries Integer ALU 8 IALU, 4 IMUL/DIV, 4 Load/Store FP ALU 8 FPALU, 4FPMUL/DIV/SQRT DTLB 256 entries, 4way, 200 cycle miss L1 Data Cache 64KB, 4way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency We perform our analysis using twelve SPEC CPU 2000 benchmarks bzip2, crafty, eon, gap, gcc, mcf, parser, perl bmk, twolf, swim, vortex and vpr. We use the Simpoint tool to pick the most representative simulation point for each be nchmark (with full reference input set) and each benchmark is fastforwarded to its representati ve point before detailed simulation takes place. Each simulation contains 200M instructions. In th is chapter, we consider a design space that consists of 9 microarchitectural parameters (see Tables 52) of the superscalar architecture. These microarchitectural parameters have been shown to have the largest impact on processor performance [21]. The ranges for th ese parameters were set to include both typical and feasible design points within the explored design space. Using the detailed, cycleaccurate simulations, we measure processor performance, power and reliability characteristics on all design points within both training and testing data sets. We build a separate model for each program and use the model to predict workload dynamics in performance, power and reliability domains at PAGE 60 60 unexplored points in the design spac e. The training data set is used to build the waveletbased neural network models. An estimate of the m odels accuracy is obtained by using the design points in the testing data set. Table 52 Microarchitectural parameter ranges used for generati ng train/test data Parameter R anges # of Levels Trai n Test Fetch_width 2, 4, 8, 16 2, 84 ROB_size 96, 128, 160 128,160 3 IQ_size 32, 64, 96, 12832, 644 LSQ_size 16, 24, 32, 6416,24, 324 L2_size 2 5 6 1024 2048 4096 KB 256, 1024, 4096KB4 L2_lat 8, 12, 14, 16, 208, 12, 14 5 il1_size 8, 16, 32, 64 KB8, 16, 32 KB 4 dl1_size 8, 16, 32, 64 KB16, 32, 64 KB 4 dl1_lat 1, 2, 3, 4 1,2,3 4 To build the representative design space, one needs to ensure the sample data sets space out points throughout the design space but unique and small enough to keep the model building cost low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better cove rage compared to a naive random sampling scheme. We generate multiple LHS matrix and use a space filing metric called L2star discrepancy [40] to each LHS matrix to find the unique and best representative design space which has the lowest values of L2star disc repancy. We use a randomly and independently generated set of test data point s to empirically estimate the pred ictive accuracy of the resulting models. And we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers good tradeoffs between simulation time and prediction accuracy for the design space we considered. In our study, each workload dynamic trace is represented by 128 samples. Predicting each wavelet coefficient by a sepa rate neural network simplifies the learning task. Since complex workload dynamics can be captured using limited number of wavelet PAGE 61 61 coefficients, the total size of wavelet neural networks can be small. Due to the fact that small magnitude wavelet coefficients have less contribution to the rec onstructed data, we opt to only predict a small set of important wavelet coefficients. Processor Configuration High Mag. Low Mag. Wavelet Coefficient Index Figure 54 Magnitudebased rankin g of 128 wavelet coefficients Specifically, we consider the following tw o schemes for selecting important wavelet coefficients for prediction: (1) ma gnitudebased: select the largest k coefficients and approximate the rest with 0 and (2) orderbased: select the first k coefficients and approximate the rest with 0. In this study, we choose to us e the magnitudebased scheme since it always outperforms the orderbased scheme. To apply th e magnitudebased wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients does not change drastically across the design space. Figure 54 i llustrates the magnitudebased ranking (shown as a color map where red indicates high ranks and bl ue indicates low ranks) of a total 128 wavelet coefficients (decomposed from benchmark gcc dynamics) across 50 different microarchitecture configurations. As can be seen, the top ranked wa velet coefficients largely remain consistent across different processor configurations. Wavelet coefficients with large magnitude PAGE 62 62 Evaluation and Results In this sec tion, we present detailed experiment results on usin g wavelet neural network to predict workload dynamics in performance, reliability and power domains. The workload dynamic prediction accuracy measure is the mean square error (MSE) defined as follows: 2 11 ()()N k M SExkxk N (53) where: () x k is the actual value, () x kis the predicted value and Nis the total number of samples. As prediction accuracy in creases, the MSE becomes smaller. (a) CPI bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 051015202530 MSE (%) (b) Power bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 05101520253035 MSE (%) (c) AVF bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 0123 MSE (%) Figure 55 MSE boxplots of workload dynamics prediction PAGE 63 63 The workload dynamics prediction accuracies in performance, power and reliability domains are plotted as boxplots( Figure 55). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify po ssible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between hinges which are approximately the first and third quartiles of the MSE values. Thus, about 50% of the data are located within the box and its height is equal to the interquart ile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the MSE values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme valu es of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 55, the line with diamond shape markers i ndicates the statistics average of MSE across all test cases. Figure 55 shows that the performance model achieves median errors ranging from 0.5 percent (swim) to 8.6 percent (mcf) with an overall median erro r across all benchmarks of 2.3 percent. As can be seen, even though the maxi mum error at any design point for any benchmark is 30%, most benchmarks show MSE less than 10%. This indicates that our proposed neurowavelet scheme can forecast the dynamic behavior of program performance characteristics with high accuracy. Figure 55 shows that power models are slightly less accurate with median errors ranging from 1.3 percent (vpr) to 4.9 percent (crafty) and overall median of 2.6 percent. The power prediction has high maximum values of 35%. These errors are much smaller in reliability domain. In general, the workload dynamic prediction accuracy is increased when more wavelet coefficients are involved. However, the complex ity of the predictive models is proportional to the number of wavelet coefficients. The coste ffective models should provide high prediction PAGE 64 64 accuracy while maintaining low complexity. Figure 56 shows the trend of prediction accuracy (the average statistics of all benchmarks) when various number of wavelet coefficients are used. 0 1 2 3 4 5 16326496128 Number of Wavelet CoefficientsMSE (%) CPI Power AVF Figure 56 MSE trends with increased number of wavelet coefficients As can be seen, for the programs we studied, a se t of wavelet coefficients with a size of 16 combine good accuracy with low model comp lexity; increasing th e number of wavelet coefficients beyond this point impr oves error at a lower rate. This is because wavelets provide a good time and locality characterizati on capability and most of the en ergy is captured by a limited set of important wavelet coefficients. Usi ng fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the series structures across scales of time and frequency domains. The capability of us ing a limited set of wavelet coefficients to capture workload dynamics vari es with resolution level. 0 1 2 3 4 5 6 7 641282565121024 Number of SamplesMSE (%) CPI Power AVF Figure 57 MSE trends with increased sampling frequency PAGE 65 65 Figure 57 illustrates MSE (the average statistics of all benchmarks) yielded on predictive models that use 16 wavelet coefficients when th e number of samples varies from 64 to 1024. As the sampling frequency increases, using the same amount of wavelet coefficients is less accurate in terms of capturing workload dynamic behavior. As can be seen, the increase of MSE is not significant. This suggests that the proposed sc hemes can capture workload dynamic behavior with increasing complexity. Our RBF neural networks were built us ing a regression tree based method. In the regression tree algorithm, all input microarchitecture parameters were ranked based on either split order or split frequency. The microarchitecture parameters which cause the most output variation tend to be split earliest and most often in the constr ucted regression tree. Therefore, microarchitecture parameters largely determine th e values of a wavelet coefficient are located on higher place than others in regression tree and th ey have larger number of splits than others. CPI Power AVF (a) Split Order bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat (b) Split Frequency bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat Figure 58 Roles of microarchitecture design parameters PAGE 66 66 We present in Figure 58 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equiangular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data poi nt relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? For example, on benchmark gcc, Fetch, dl1 and LSQ have signific ant roles in predicting dynamic behavior in performance domain while ROB, Fetch and dl1_lat larg ely affect reliability domain workload dynamic behavior. For the benchmark gcc, the most frequently involved microarchitecture parameters in regression tree constructions are ROB, LSQ, L2 and L2_lat in performance domain and LSQ and Fetch in reliability domain. Compared with models that only predict workload aggregated behavior, our proposed methods can forecast workload runtime execution scenarios. The feature is essential if the predictive models are employed to trigger runtime dynamic management mechanisms for power and reliability optimizations. Inadequate workload worstcase scenario predictions could make microprocessors fail to meet the desired power a nd reliability targets. On the contrary, false alarms caused by overprediction of the worstcase scenarios can trigger responses too frequently, resulting in significant overhead. In this section, we study the su itability of using the proposed schemes for workload execution scenario based cl assification. Specifically, for a given workload characteristics threshold, we calculate how ma ny sampling points in a trace that represents workload dynamics are above or below the thresh old. We then apply the same calculation to the PAGE 67 67 predicted workload dynamics trace. We use the directional symmetry (DS) metric, i.e., the percentage of correctly predicte d directions with respect to th e target variable, defined as N kkxkx N DS1)( )( 1 (54) where 1)( if x and x are both above or belo w the threshold and 0 )( otherwise. Thus, the DS provides a measure of the number of times the sign of the target is correctly forecasted. In other words, DS=50% implies that the predicted direction was correct for half of the predictions. In this work, we set three threshold levels (named as Q1, Q2 and Q3 in Figure 59) between max and min values in each trace as follows, where 1Q is the lowest threshold level and 3Q is the highest threshold level. 3Q 2Q 1Q MAX MIN 1Q = MIN + (MAXMIN)*(1/4) 2Q = MIN + (MAXMIN)*(2/4) 3Q = MIN + (MAXMIN)*(3/4) Figure 59 Thresholdbased workload execution scenarios Figure 510 shows the results of thresholdbase d workload dynamic behavior classification. The results are presented as directional asymmetry, which can be expressed as DS 1. As can be seen, not only our waveletbased RBF neural networks can effectively capture workload dynamics, but also they can accurate ly classify workload execution into different scenarios. This suggests that proactive dynamic power and reli ability management schemes can be built using the proposed models. For instance, given a power/r eliability threshold, our wavelet RBF neural networks can be used to forecast workload execution scenario. If the predicted workload PAGE 68 68 characteristics exceed the threshold level, processors can start to response before power/reliability reaches or surpass the threshold level. 0 2 4 6 8 10bzi p cr a f t y e o n g a p gc c mcf p ar s er pe r lbm k swim two l f vortex v prDirectional Asym m etry ( %) CPI_1Q CPI_2Q CPI_3Q 0 2 4 6 8 10bzip crafty eo n gap gcc mcf parser p e rl b m k swim t w olf vo r te x v p rD irectional A sym m etry ( %) Power_1Q Power_2Q Power_3Q 0 2 4 6 8 10bzi p crafty e o n gap g cc mcf parser p erl b m k swim twolf v ortex vp rD irectional A sym m etry ( %) AVF_1Q AVF_2Q AVF_3Q Figure 510 Thresholdbased workload execution Figure 511 further illustrates detailed wo rkload execution scenario predictions on benchmark bzip2. Both simulation and prediction results ar e shown. The predicted results closely track the varied program dynamic behavior in different domains. (a) performance (b) power (c) reliability Figure 511 Thresholdbased wo rkload scenario prediction Workload Dynamics Driven Archit ecture Design Space Exploration In this section, we present a case study to demonstrate the benefit of applying workload dynamics prediction in early architecture design space exploration. Specifically, we show that workload dynamics prediction models can effectively forecast the worstcase operation PAGE 69 69 conditions to soft error vulnerability and accurately estimate the efficiency of soft error vulnerability management schemes. Because of technology scaling, ra diationinduced soft errors contribute more and more to the failure rate of CMOS devices. Therefore, soft error rate is an importa nt reliability issue in deepsubmicron microprocessor design. Processor microarchitecture soft error vulnerability exhibits significant runtime va riation and it is not economical and practical to design fault tolerant schemes that target on the worstcase operation condition. Dynamic Vulnerability Management (DVM) refers to a set of strate gies to control hardware runtime softerror susceptibility under a tolerable threshold. DVM allows designers to achieve higher dependability on hardware designed for a lower reliability sett ing. If a particular execution period exceeds the predefined vulnerability threshol d, a DVM response (Figure 512) wi ll work to reduce hardware vulnerability. VulnerabilityTime Designedfor Reliability Capacity w/out DVM Designedfor Reliability Capacity w/ DVM DVM Trigger Level DMV Engaged DVM Disengaged DVM Performance Overhead Figure 512 Dynamic Vulnerability Management A primary goal of DVM is to maintain vulnerability to within a predefined reliability target during the entire program execution. The DVM will be triggered once the hardware soft error vulnerability exceeds the predefined threshold. Once the trigger goes on, a DVM response begins. Depending on the type of response chosen, there may be some performance degradation. A DVM response can be turned off as soon as the vulnerability drops below the threshold. To PAGE 70 70 successfully achieve the desired reliability targ et and effectively mitigate the overhead of DVM, architects need techniques to quickly infer application worstcase operation conditions across design alternatives and accurately estimate the efficiency of DVM schemes at early design stage. We developed a DVM scheme to manage runt ime instruction queue (IQ) vulnerability to soft error. DVM_IQ { ACE bits counter updating(); if current context has L2 cache misses then stall dispatching instructions for current context; every (sample_interval/5) cycles { if online IQ_AVF > trigger threshold then wq_ratio = wq_ratio/2; else wq_ratio = wq_ratio+1; } if (ratio of waiting instruction # to ready instruction # > wq_ratio) then stall dispatching instructions; } Figure 513 IQ DVM Pseudo Code Figure 513 shows the pseudo code of our DVM policy. The DVM scheme computes online IQ AVF to estimate runtime microarch itecture vulnerability. The estimated AVF is compared against a trigger thres hold to determine whether it is necessary to enable a response mechanism. To reduce IQ soft error vulnerability we throttle the instru ction dispatching from the ROB to the IQ upon a L2 cache miss. Additi onally, we sample the IQ AVF at a finer granularity and compare the sampled AVF with th e trigger threshold. If the IQ AVF exceeds the trigger threshold, a parameter wq_ratio, whic h specifies the ratio of number of waiting instructions to that of ready instructions in the IQ, is upda ted. The purpose of setting this parameter is to maintain the performance by allowing an appropria te fraction of waiting instructions in the IQ to e xploit ILP. By maintaining a de sired ratio between the waiting PAGE 71 71 instructions and the ready instru ctions, vulnerability can be redu ced at negligible performance cost. The wq_ratio update is tri ggered by the estimated IQ AVF. In our DVM design, wq_ratio is adapted through slow increases a nd rapid decreases in order to ensure a quick response to a vulnerability emergency. We built workload dynamics predictive models which incorporate DVM as a new design parameter. Therefore, our models can predict workload execution scenarios with/without DVM feature across different microarch itecture configurations. Figure 514 shows the results of using the predictive models to forecast IQ AVF on benchmark gcc across two microarchitecture configurations. 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Disable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Enable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Disable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Enable) DVM disabled DVM enabled DVM disabled DVM enabled (a) Scenario 1 (b) Scenario 2 Figure 514 Workload dynamic prediction with scenariobased architecture optimization We set the DVM target as 0.3 which mean s the DVM policy, when enabled, should maintains the IQ AVF below 0.3 during work load execution. In both cases, the IQ AVF dynamics were predicted when DVM is disabled and enabled. As can be seen, in scenario 1, the DVM successfully achieves its goal. In scenario 2, despite of the enabled DVM feature, the IQ AVF of certain execution period is still above the threshold. Th is implies that the developed DVM mechanism is suitable for the microarchitect ure configuration used in scenario 1. On the other hand, architects have to choose another DVM policy if the microarchitecture configuration PAGE 72 72 shown in scenario 2 is chosen in their design. Figure 514 shows that in all cases, the predictive models can accurately forecast the trends in IQ AVF dynamics due to architecture optimizations. Figure 515 (a) shows prediction accuracy of IQ AVF dynamics when the DVM policy is enabled. The results are shown for all 50 microarch itecture configurations in our test dataset. twolf vpr swim bzip eon vortext crafty gap parser perlbmk mcf gcc50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0510152025Value 0100200300400 Color Key and HistogramCount (a) IQ AVF (b) Power Figure 515 Heat plot that shows the MSE of IQ AVF and processor power Since deploying the DVM policy will also aff ect runtime processor power behavior, we further build models to forecas t processor power dynamic behavior due to the DVM. The results are shown in Figure 515 (b). The data is presented as heat plot, which maps the actual data values into color scale with a dendrogram adde d to the top. A dendrogram consists of many Ushaped lines connecting objects in a hierarchical tree. The height of each U represents the distance between the two objects being connected. For a given benchmark, a vertical trace line shows the scaled MSE values across all test cases. Figure 515 (a) shows the predictive models yield hi gh prediction accuracy across all test cases on benchmarks swim, eon and vpr. The models yield prediction variation on benchmarks MSE (%) parser bzip twolf gap perlbmk swim eon vpr mcf gcc crafty vortext50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 00.10.20.30.40.5Value 050100150200 Color Key and HistogramCount MSE (%) PAGE 73 73gcc, crafty and vortex. In power domain, prediction accuracy is more uniform across benchmarks and microarchitecture configurations. In Figure 516, we show the IQ AVF MSE when different DVM thresholds are set. The resu lts suggest that our predictive models work well when different DVM targets are considered. 0 0.1 0.2 0.3 0.4 0.5b z ip craf t y eo n g ap g cc m cf p a rser p erlbmk swim twolf vortex vp rIQ AVF MSE (% ) DVM Threshold =0.2 DVM Threshold =0.3 DVM Threshold =0.5Figure 516 IQ AVF dynamics prediction accu racy across different DVM thresholds PAGE 74 74 CHAPTER 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTICORE ARCHI TECTURES Early design space exploration is an essential ingredient in modern processor development. It significantly reduces the time to market and postsilicon surprises. The trend toward multi/manycore processors will result in sophisticated largescale architectu re substrates (e.g. nonuniformly accessed cache [43] interconnected by networkonchip [44]) with selfcontained hardware components (e.g. cache banks, routers and interconnect links) proximate to the individual cores but globally di stributed across all cores. As th e number of cores on a processor increases, these large and sophisticated multico reoriented architectures exhibit increasingly complex and heterogeneous characteristics. As an ex ample, to alleviate the deleterious impact of wire delays, architects have proposed splitting up large L2/L3 cach es into multiple banks, with each bank having different access latency dependi ng on its physical proximity to the cores. Figure 61 illustrates normalized cache hits (results are plotte d as color maps) across the 256 cache banks of a nonuniform cache architecture (NUCA) [43] design on an 8core chip multiprocessor(CMP) running the SPLASH2 Oceanc workload. The 2D architecture spatial patterns yielded on NUCA with different architecture design parameters are shown. Figure 61 Variation of cache hits across a 256bank nonuniform access cache on 8core As can be seen, there is a significant vari ation in cache access fre quency across individual cache banks. At larger scales, the manifested 2dimensional spatial characteristics across the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PAGE 75 75 entire NUCA substrate vary widely with differe nt design choices while executing the same code base. In this example, various NUCA cache conf igurations such as network topologies (e.g. hierarchical, pointtopoint and crossbar) and data management schemes (e.g. static (SNUCA) [43], dynamic (DNUCA) [45, 46] and dynamic with replication (R NUCA) [4749]) are used. As the number of parameters in the design space incr eases, such variation and characteristics at large scales cannot be captured without using slow and detailed simulations. However, using simulationbased methods for architecture de sign space exploration where numerous design parameters have to be consider ed is prohibitively expensive. Recently, various predictive models [2025, 50] have been proposed to costeffectively reason processor performance and power characte ristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ab ility to forecast the complex and heterogeneous behavior of large and distributed architecture s ubstrates across the design space. This limitation will only be exacerbated with the rapidly increasing integration s cale (e.g. number of cores per chip). Therefore, there is a pressing need fo r novel and costeffective approaches to achieve accurate and informative design tradeoff analysis for large and sophisticated architectures in the upcoming multi/many core eras. Thus, in this chapter, instead of quantifying these large and sophisticated architectures by a single number or a simple statistics distributi on, we proposed techniqu es employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the complex spatial characteristics that workloads exhibit across large architecture substrates are decomposed into a series of wavelet coeffici ents. In the transform domain, each individual wavelet coefficient is modeled by a separate neur al network. By predicting only a small set of PAGE 76 76 wavelet coefficients, our models can accurately reconstruct architecture 2D spatial behavior across the design space. Using both multiprogr ammed and multithreaded workloads, we extensively evaluate the efficiency of using 2D wavelet neural networks for predicting the complex behavior of nonuniformly accessed cache de signs with widely varied configurations. Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction We view the 2D spatial characteristics yi elded on large and dist ributed architecture substrates as a nonlinear functi on of architecture design paramete rs. Instead of inferring the spatial behavior via exhaustively obtaining architecture characteristics on each individual node/component, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated behavior across a large ar chitecture design space. Previous work [21, 23, 25, 50] shows that neural networks can accura tely predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to informatively reveal comple x workload/architecture interactions at a large scale. To overcome this disadvantage, we propos e combining 2D wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial characteristics prediction of multicore orient ed architecture substrates. The 2D wavelet transform is a very powerful tool for characterizing spatial behavior sin ce it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficien ts (high frequencies) which are utilized for detailed analysis a nd prediction of individual or subsets of cores/components, while the gl obal trend is decomposed into higher scales of wavelet coefficients (low frequencies) th at are used for the analysis and prediction of slow trends across many cores or distributed hardware components. Co llectively, these wavelet coefficients provide PAGE 77 77 an accurate interpretation of the spatial trend and details of complex workload behavior at a large scale. Our wavelet neural networks use a sepa rate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separa te neural network simplifies the training task (which can be performed concurrently) of each subnetwork. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the spatial patterns on large scale architecture substrates. Figure 62 shows our hybrid neurowavele t scheme for architecture 2D spatial characteristics prediction. Given the observed spatial behavior on training data, our aim is to predict the 2D behavior of largescale architecture under di fferent design configurations. G0 H0 G1 H1 Gk Hk ... Architecture 2D Characteristics Wavelet Decomposition S S A 1 A 2 D 2 D 3 D 1 S S A 128 A 127 D 127 D 1 D 1 D 128 A 2 A 2 D 3 A 1 W a v e l e t C o e f f i c i e n t s ... ... Architecture Design Parameters Predicted Wavelet Coefficient 1 ... ... Architecture Design Parameters Predicted Wavelet Coefficient 2 Architecture Design Parameters... ... ...RBF Neural Networks ... ... Predicted Wavelet Coefficient n Synthesized Architecture 2D Characteristics G*0 H*0 G*1 H*1 G*k H*k ... .Wavelet Reconstruction S S A 1 A 2 D 2 D 3 D 1 S S A 128 A 127 D 127 D 1 D 1 D 128 A 2 A 2 D 3 A 1 0 0P re d ic te d W a v el et C oe ff ic ie n ts Figure 62 Using wavelet neural networks for forecasting architecture 2D characteristics The hybrid scheme basically invol ves three stages. In the first stage, the observed spatial behavior is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire ar chitecture design space vector and pr edicts a wavelet coefficient. The PAGE 78 78 training of an RBF network invol ves determining the center point and a radius for each RBF, and the weights of each RBF which dete rmine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of 2D wavele t neural networks fo r forecasting spatial characteristics of largescale multicore NUCA design using the GEMS 1.2 [51] toolset interfaced with the Simics [52] fullsystem functional simulator. We simulate a SPARC V9 8core CMP running Solaris 9. We m odel inorder issue cores for this study to keep the simulation time tractable. The processors have private L1 caches and the shared L2 is a 256bank 16MB NUCA. The private L1 caches of different processo rs are maintained cohere nt using a distributed directory based protocol. To model the L2 cache, we use the Ruby NUCA cache simulator developed in [47] which includes an onchip ne twork model. The network models all messages communicated in the system including a ll requests, responses, replacements, and acknowledgements. Table 61 summarizes the baselin e machine configurations of our simulator. Table 61 Simulated machine configuration (baseline) Parameter Configuration Number of 8 Issue Width 1 L1 (split I/D) 64KB, 64B line, writeallocation L2 (NUCA) 16 MB (KB 64256 ), 64B line Memory Sequential Memory 4 GB of DRAM, 250 cycle latency, 4KB Our baseline processor/L2 NUCA organization is similar to that of Beckmann and Wood [47] and is illustra ted in Figure 63. Each processor core (including L1 data and instruction caches) is placed on the chip boundary and eight such cores surround a shared L2 cache. The L2 is partitioned into 256 banks (grouped as 16 blank clusters) and connected with an interconnection network. Each core has a cache controller that routes the cores request s to the appropriate cache bank. The NUCA design PAGE 79 79 space is very large. In this chapter, we consider a design space that consists of 9 parameters (see Tables 62) of CMP NUCA architecture. 1 2 3 4 5 6 129 145 17 18 19 20 21 22 33 34 7 8 9 10 11 12 130 146 23 24 25 26 27 28 35 36 113 114 13 14 131 132 133 147 148 149 29 30 37 38 39 40 115 116 15 16 134 135 136 150 151 152 31 32 41 42 43 44 117 118 119 120 137 138 139 153 154 155 161 162 163 164 45 46 121 122 123 124 140 141 142 156 157 158 165 166 167 168 47 48 125 126 241 242 243 244 143 159 169 170 171 172 173 174 175 176 127 128 245 246 247 248 144 160 177 178 179 180 181 182 183 184 249 250 251 252 253 254 255 256 209 193 185 186 187 188 49 50 225 226 227 228 229 230 231 232 210 194 189 190 191 192 51 52 97 98 233 234 235 236 211 212 213 195 196 197 53 54 55 56 99 100 237 238 239 240 214 215 216 198 199 200 57 58 59 60 101 102 103 104 81 82 217 218 219 201 202 203 65 66 61 62 105 106 107 108 83 84 220 221 222 204 205 206 67 68 63 64 109 110 85 86 87 88 89 90 223 207 69 70 71 72 73 74 111 112 91 92 93 94 95 96 224 208 75 76 77 78 79 80 CPU 0CPU 1 CPU 5CPU 4CPU 7 CPU 6CPU 3 CPU 2 L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$Figure 63 Baseline CMP with 8 cores that share a NUCA L2 cache Table 62 Considered architecture de sign parameters and their ranges Parameters Descriptio n NUCA Management Policy (NUCA)SNUCA, DNUCA, RNUCA Network Topology (net) Hierarchical, PT_to_PT, Crossba r Network Link Latency (net_lat)20, 30, 40, 50 L1_latency (L1_lat) 1, 3, 5 L2_latency (L2_lat) 6, 8, 10, 12. L1_associativity (L1_aso) 1, 2, 4, 8 L2_associativity (L2_aso) 2, 4, 8, 16 Directory Latency (d_lat) 30, 60, 80, 100 Processor Buffer Size ( p _buf)5, 10, 20 These design parameters cover NUCA data management policy (NUCA), interconnection topology and latency (net and net_lat), the conf igurations of the L1 and L2 caches (L1_lat, L2_lat, L1_aso and L2_aso), cache coherency di rectory latency (d_lat) and the number of cache accesses that a processor core can issue to the L1 (p_buf). The ranges for these parameters were set to include both typical and feasible desi gn points within the e xplored design space. PAGE 80 80 We studied the CMP NUCA designs using various multiprogrammed and multithreaded workloads (listed in Table 63). Table 63 Multiprogrammed workloads Multiprogrammed Workloads Description Homogeneous Group1 gcc (8 copies) Group2 mcf (8 copies) Heterogeneous Group1 (CPU) gap, bzip2, equake, gcc, me sa, perlbmk, parser, ammp Group2 (MIX) perlbmk, mcf, bzip2, vpr, mesa, art, gcc, equake Group3 (MEM) mcf, twolf, art, ammp, equake, mcf, art, mesa Multithreaded Workloads Data Set Splash2 barnes 16k particles fmm input.16348 oceanco 514x514 ocean body oceannc 258x258 ocean body waterns 512 molecules cholesky tk15.O fft 65,536 complex data points radix 256k keys, 1024 radix Our heterogeneous multiprogrammed workloads consist of a mix of programs from the SPEC 2000 benchmarks with full reference i nput sets. The homogeneous multiprogrammed workloads consist of multiple copies of an identical SPEC 2000 program. For multiprogrammed workload simulations, we perform fastforwards until all benchmarks pass initialization phases. For multithreaded workloads, we used 8 benchmarks from the SPLASH2 suite [53] and mark an initialization phase in the software code and skip it in our simulations. In all simulations, we first warm up the cache model. After that, each simulation runs 500 million instructions or to benchmark completion, whichever is less. Us ing detailed simulati on, we obtain the 2D architecture characteristics of large scale NUC A at all design points wi thin both training and PAGE 81 81 testing data sets. We build a separate model for each workload and use the model to predict architecture 2D spatial behavior at unexplored points in the design space. The training data set is used to build the 2D wavelet neural network models. An estimate of the models accuracy is obtained by using the design point s in the testing data set. To build a representative design space, one needs to ensure that the sample data sets disperse points throughout the design space but keep the space small enough to keep the cost of building the model low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) as our sampling strategy since it provides better coverage compar ed to a naive random sampling scheme. We generate multiple LHS matr ices and use a space filing metric called L2star discrepancy. The L2star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest va lue of L2star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this chapter, we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. And the 2D NUCA architecture characteristics (normalized cache h it numbers) across 256 banks (with the geometry layout, Figure 63) are re presented by a matrix. Predicting each wavelet coefficient by a sepa rate neural network simplifies the learning task. Since complex spatial patterns on large scal e multicore architecture substrates can be captured using a limited number of wavelet coeffici ents, the total size of wavelet neural networks is small and the computation overhead is low. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Specifically, we consider the following two schemes for selecting PAGE 82 82 important wavelet coefficients for prediction: (1) magnitudebased: select the largest k coefficients and approximate the rest with 0 and (2) orderbased: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitudebased scheme since it always outperforms the orderbased sc heme. To apply the magnitudebased wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients do not change drasti cally across the design space. Our experimental results show that the top ranked wavelet coefficients largely re main consistent across different architecture configurations. Evaluation and Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast complex, heterogeneous patterns of la rge scale multicore substrates running various workloads without using detailed simulation. Th e prediction accuracy measure is the mean error defined as follows: N kkxkxkx N ME1))(/))()( (( 1 (61) where: x is the actual value, x is the predicted value and N is the total number of samples (e.g. 256 NUCA banks). As prediction accuracy increases, the ME becomes smaller. The prediction accuracies are plotted as boxplots(Figure 64). B oxplots are graphical displays that measure location (median) and di spersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. Th e central box shows the data between hinges which are approximately the first and thir d quartiles of the ME values. PAGE 83 83 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 68101214 Error (%) gcc_x8 mcf_x8 68101214 CPU MIX MEM 68101214 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 68101214 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 4681012 Error (%) gcc_x8 mcf_x8 4681012 CPU MIX MEM 4681012 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 4681012 (a) 16 Wavelet Coefficients (b) 32 Wavelet Coefficients gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 24681012 Error (%) gcc_x8 mcf_x8 24681012 CPU MIX MEM 24681012 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 24681012 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 2468101214 Error (%) gcc_x8 mcf_x8 2468101214 CPU MIX MEM 2468101214 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 2468101214 (c) 64 Wavelet Coefficients (d) 128 Wavelet Coefficients Figure 64 ME boxplots of prediction accuracies w ith different number of wavelet coefficients Thus, about 50% of the data are located with in the box and its height is equal to the interquartile range. The hor izontal line in the interior of the box is locate d at the median of the data, and it shows the center of the distribution for the ME values The whiskers (the dotted lines extending from the top and bottom of the box) exte nd to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. Figure 64 (a) shows that using 16 wavelet co efficients, the predictive models achieve median errors ranging from 5.2 percent (fft) to 9.3 percent (ocean.co) with an overall median error of 6.6 percent across all e xperimented workloads. As can be seen, the maximum error at any design point for any benchmark is 13%, and mo st benchmarks show an error less than 10%. This indicates that our proposed neurowavelet scheme can forecast the 2D spatial workload PAGE 84 84 behavior across large and sophisticated architect ure with high accuracy. Figure 64 (bd) shows that in general, the geospatial characteristics prediction accuracy is increased when more wavelet coefficients are involved. Note th at the complexity of the predicti ve models is proportional to the number of wavelet coefficients. The coste ffective models should provide high prediction accuracy while maintaining low complexity and computation overhead. The trend of prediction accuracy(Figure 64) indicates that for the progra ms we studied, a set of wavelet coefficients with a size of 16 combines good accuracy with lo w model complexity; in creasing the number of wavelet coefficients beyond this point improves erro r at a reduced rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients Using fewer parameters than other methods, the coordinated wavelet coeffici ents provide interpretation of the spatial patterns among a large number of NUCA banks on a twodimensional plan e. Figure 65 illustra tes the predicted 2D NUCA behavior across four different configur ations (e.g. AD) on the heterogeneous multiprogrammed workload MIX (see Table 3) when differe nt number of wavelet coefficients (e.g. 16 256) are used. Simulation Prediction A B C D Figure 65 Predicted 2D NUCA be havior using different number of wavelet coefficients PAGE 85 85 The simulation results (org) are also show n for comparison purposes. Since we can accurately forecast the behavior of large scale NUCA by only predicting a small set of wavelet coefficients, we expect our methods are scalable to even larger architecture design. We further compare the accuracy of our pr oposed scheme with that of approximating NUCA spatial patterns via predicting the hit rates of 16 evenly distributed cache banks across a 2D plane. The results shown in Table 64 indicate that using the same number of neural networks, our scheme yields a significantly higher accuracy than conventional predictive models. If current neural network models were built at finegra in scales (e.g. constr uct a model for each NUCA bank), the model building/training overhead wo uld be nontrivial. Since we can accurately forecast the behavior of large scale NUCA stru ctures by only predicting a small set of wavelet coefficients, we expect our methods are scalab le to even larger ar chitecture substrates. Table 64 Error comparison of predicting raw vs. 2D DWT cache banks Benchmarks Error (Raw), % Error(2D DWT), % gcc(x8) 126 8 mcf(x8) 71 7 CPU 102 9 MIX 86 8 MEM 122 8 barnes 136 6 fmm 363 6 oceanco 99 9 oceannc 136 6 watersp 97 7 cholesky 71 7 fft 64 7 radix 92 7 Table 65 shows that exploring multico re NUCA design space using the proposed predictive models can lead to se veral orders of magnitude speedup, compared with using detailed simulations. The speedup is calculated using the total simulation time across all 50 test cases divided by the time spent on model training and pr edicting 50 test cases. The model construction PAGE 86 86 is a onetime overhead and can be amortized in the design space exploration stage where a large number of cases need to be examined. Table 65 Design space evaluation speedup (simulation vs. prediction) Benchmarks Simulation vs. Prediction gcc(x8) 2,181x mcf(x8) 3.482x CPU 3,691x MIX 472x MEM 435x barnes 659x fmm 1,824x oceanco 1,077x oceannc 1,169x watersp 738x cholesky 696x fft 670x radix 1,010x Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all i nput architecture design parameters were ranked based on either split order or split frequency. The design parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, architecture parameters that largely determine the values of a wavelet coefficient are located higher than others in the regression tree and they have a larger number of sp lits than others. We present in Figure 66 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equiangular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information PAGE 87 87 such as: Which variables are dominant for a gi ven dataset? Which observations show similar behavior? Order Frequency Figure 66 Roles of design parameters in predicting 2D NUCA For example, on the Splash2 be nchmark fmm, network latency (net_lat), processor buffer size (p_buf), L2 latency (L2_lat) and L1 associativity (L1_aso) have significant roles in predicting the 2D NUCA spatial behavior while the NUCA data management policy (NUCA) and network topology (net) largely affect the 2D spatial pattern when running the homogeneous multiprogrammed workload gccx8. For the benchm ark cholesky, the most frequently involved architecture parameters in regression tree construction are NUCA, net_lat, p_buf, L2_lat and L1_aso. Differing from models that predict aggregat ed workload characteristics on monolithic architecture design, our proposed methods can accurately and in formatively reveal the complex patterns that workloads exhibit on largescale architectures. This feature is essential if the predictive models are employed to examine the efficiency of design tradeoffs or explore novel PAGE 88 88 optimizations that consider multi/manycores. In this work, we study the suitability of using the proposed models for novel multicore oriented NUCA optimizations. Leveraging 2D Geometric Characteristics to Explore Cooperative Multicore Oriented Architecture Design and Optimization In this section, we present case studies to demonstrate the benefit of incorporating 2D workload/architecture behavior prediction into th e early stages of microa rchitecture design. In the first case study, we show that our geospatiala ware predictive models can effectively estimate workloads 2D working sets and that such info rmation can be benefici al in searching cache friendly workload/core mapping in multicore environments. In the second case study, we explore using 2D thermal profile predictive mode ls to accurately and informatively forecast the area and location of thermal hots pots across large NUCA substrates. Case Study 1: Geospatialaware Application/Core Mapping Our 2D geometryaware architecture predictiv e models can be used to explore global, cooperative, resource management and op timization in multicore environments. Figure 67 2D NUCA footprint (geometric shape) of mesa For example, as shown in Figure 67, a wo rkload will exhibit a 2D working set with different geometric shapes when running on di fferent cores. The exact shape of the access PAGE 89 89 distribution depends on several f actors such as the application and the data mapping/migration policy. As shown in previous section, our pred ictive models can forecast workload 2D spatial patterns across the architecture design space. To pr edict workload 2D geometric footprints when running on different cores, we inco rporate the core loca tion as a new design parameter and build the locationaware 2D predictive models. As a re sult, the new model can forecast workloads 2D NUCA footprint (represented as a cache access distri bution) when it is assigned to a specific core location. We assign 8 SPEC CPU 2000 workloads to the 8core CMP system and then predict each workloads 2D NUCA footprint when runni ng on the assigned core and use the predicted 2D geometric working set for each workload to estimate the cach e interference among the cores. Core 0 Core 5 Core 4Core 7 Core 6Core 3 Core 2Core 1 Program A 2D NUCA footprint @ Core 0 Program B 2D NUCA footprint @ Core 1 Program C 2D NUCA footprint @ Core 2 Interferenced Area Figure 68. 2D cache interference in NUCA As shown in Figure 68, to estimate the in terference for a given core/workload mapping, we estimate both the area and the degree of ove rlap among a workloads 2D NUCA footprint. We only consider the interference of a core a nd its two neighbors. As a result, for a given core/workload layout, we can quickly estimate the overall interference. For each NUCA configuration, we estimate the interference when workloads are randomly assigned to different cores. We use simulation to count the actual c ache interference among workloads. For each test PAGE 90 90 case (e.g., a specific NUCA configuration), we generate two series of cache interference statistics (e.g., one from simu lation and one from the predictiv e model) which correspond to the scenarios when workloads are mapped to the different cores. We compute the Pearson correlation coefficient of the two data series. Th e Pearson correlation coefficient of two data series X and Y is defined as 2 1 1 2 2 1 1 2 1 1 1 n i i n i i n i i n i i n i i n i i n i iiyyn xxn yxyxn r (62) If two data series, X and Y, show highly positive correlation, their Pearson correlation coefficient will be close to 1. Consequently, if the cache interference can be accurately estimated using the overlap between the predicted 2D NUCA footprints, we should observe nearly perfect correlation between the two metrics. 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 1 (CPU) 0 10 20 30 40 5 0 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 2 (MIX) 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 3 (MEM) Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) Figure 69 Pearson correlation coeffici ent (all 50 test cases are shown) Figure 69 shows that there is a strong correla tion between the interference estimated using the predicted 2D NUCA footprint and the interf erence statistics obtained using simulation. The highly positive Pearson correlation coefficient valu es show that by using the predictive model, designers can quickly devise the optimal core allocation for a given set of workloads. Alternatively, the information can be used by th e OS to guide cache friendly thread scheduling in multicore environments. PAGE 91 91Case Study 2: 2D Thermal HotSpot Prediction Thermal issues are becoming a first order design parameter for largescale CMP architectures. High operational temperatur es and hotspots can limit performance and manufacturability. We use the HotSpot [54] ther mal model to obtain the temperature variation across 256 NUCA banks. We then build analytical models using the proposed methods to forecast 2D thermal behavior of large NUCA cache with different config urations. Our predictive model can help designers insightfully predic t the potential thermal hotspots and assess the severity of thermal emergencies. Figure 610 sh ows the simulated thermal profile and predicted thermal behavior on different workloads. The te mperatures are normalized to a value between the maximal and minimal value across the NUC A chip. As can be seen, the 2D thermal predictive models can accurately and informativ ely forecast the size and the location of thermal hotspots. The 2D predictive model can informatively and accurately forecast both the location and the size of thermal hotspots in large scale architecture Thermal Hotspots Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0.2 0.4 0.6 0.8 1 (a) OceanNC (b) gccx8 Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0. 2 0. 4 0. 6 0. 8 1 Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0. 2 0. 4 0. 6 0. 8 1 ( c ) MEM ( d ) Ra d ix Figure 610 2D NUCA thermal prof ile (simulation vs. prediction) PAGE 92 92 16wc 32wc 64wc 96wc 128wc 256wc 2 4 6 8 10 12 14 16 Error (%) Multiprogrammed (Homogeneous) Multiprogrammed (Heterogeneous) Multithreaded (SPLASH) Figure 611 NUCA 2D ther mal prediction error The thermal prediction accuracy (average statistics) across three workload categories is shown in Figure 611. The accuracy of using different number of wavelet coefficients in prediction is also shown in that Figure. The results show that our predictive model can be used to costeffectively analyze the thermal behavior of large architecture substrates. In addition, our proposed technique can be use to evaluate the e fficiency of thermal management policies at a large scale. For example, thermal hotspots can be mitigated by throttling the number of accesses to a cache bank for a certain period when its temper ature reaches a threshold. We build analytical models which incorporate a thermalaware cache access throttling as a design parameter. As a result, our predictive model can forecast therma l hot spot distribution in the 2D NUCA cache banks when the dynamic thermal management (DTM ) policy is enabled or disabled. Figure 612 shows the thermal profiles before and after th ermal management policies are applied (both prediction and simulation results) for benchmar k OceanNC. As can be seen, they track each other very well. In terms of time taken for de sign space exploration, our proposed models have orders of magnitude less overhead. The time required to predict the thermal behavior is much less than that of fullsystem multicore simulatio n. For example, thermal hotspot estimation is over 5102 times faster than thermal simulation, ju stifying our decision to use the predictive PAGE 93 93 models. Similarly, searching a cach e friendly workload/core mapping is 4103 times faster than using the simulationbased method. DTM DTM Simulation Prediction Figure 612 Temperature profile before and after a DTM policy PAGE 94 94 CHAPTER 7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STAC KED MULTICORE PROCESSORS USING GEOSPATIALBASED PREDICTIVE MODELS To achieve thermal efficient 3D multicore pro cessor design, architects and chip designers need models with low computation overhead, wh ich allow them to quickly explore the design space and compare different design options. One ch allenge in modeling the thermal behavior of 3D die stacked multicore architecture is that th e manifested thermal patterns show significant variation within each die and across different dies (as shown in Figure 71). Die1 Die2 Die3 Die4 CPU MEM MIX Figure 71 2D withindie and crossdies thermal variation in 3D die stacked multicore processors The results were obtained by simulating a 3D die stacked quadcor e processors running multiprogrammed CPU (bzip2, eon, gcc, perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk) workloads. Each program within a multiprogrammed workload was assigned to a die that contains a processor core and caches. Figure 72 shows the 2D thermal variation on die 4 under different mi croarchitecture and floorplan configurations. On the given die, the 2dimensional thermal spatial characteristics vary widely with different de sign choices. As the number of ar chitectural parameters in the design space increases, the complex thermal va riation and characterist ics cannot be captured without using slow and detailed simulations. As shown in Figure 71 and 72, to explore the thermalaware design space accurately and informa tively, we need computationally effective PAGE 95 95 methods that not only predict aggregate therma l behavior but also identify both size and geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate predictive models to achieve this goal. Config. AConfig. BConfig. CConfig. DCPU MEM MIX Figure 72 2D thermal variation on die 4 under different microarchitecture and floorplan configurations Figure 73 illustrates the original thermal be havior and 2D wavelet transformed thermal behavior. 340 341 342 343 344 345 346 HL1HH1LH1HH2HL2LH2LL2 LL 1 (a) Original thermal behavior (b) 2D wavelet transformed thermal behavior Figure 73 Example of using 2D DWT to capture thermal spatial characteristics As can be seen, the 2D thermal characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (LL=1) or Average (LL=2)). Since a small set of wavelet coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we use predictive models (i.e. neural networks) to relate them individua lly to various design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize 2D thermal spatia l characteristics across the design space. Compared with a simulationbased method, predicting a small set of wavelet coeffici ents using analytical PAGE 96 96 models is computationally efficient and is scal able to explore the large thermal design space of 3D multicore architecture. Prior work has proposed various predictive m odels [2025, 50] to costeffectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous thermal behavior across large and distributed 3D multicore architecture substrates. In this paper, we addresses this important and urgent research task by developing novel, 2D multiscale predictive models, which can efficiently reason the geospatia l thermal characteristics within die and across different dies during the design space exploration stage withou t using detailed cyclelevel simulations. Instead of quantifying the complex geospatial thermal characteristics using a single number or a simple statistic al distribution, our proposed techniques employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the thermal spatial characteristics are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of wavelet coeffi cients, our models can acc urately reconstruct 2D spatial thermal behavior across the design space. Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction We view the 2D spatial thermal characteristics yielded in 3D integrated multicore chips as a nonlinear function of architecture design paramete rs. Instead of inferring the spatial thermal behavior via exhaustively obtai ning temperature on each individual location, we employ wavelet analysis to approximate it and then use a neur al network to forecast the approximated thermal behavior across a large architectur al design space. Previous work [21, 23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior across varied PAGE 97 97 architecture configurations. Ne vertheless, monolithic global neur al network models lack the ability to reveal complex thermal behavior on a large scale. To overcome this disadvantage, we propose combining 2D wavelet tran sforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial thermal characteristics prediction of 3D die stacked multicore design. Figure 74 Hybrid neurowavele t thermal prediction framework The 2D wavelet transform is a very powerful t ool for characterizing sp atial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteri stics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utili zed for detailed analysis and pr ediction of individual or subsets of components, while the global tr end is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across each die. Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and details of complex thermal behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Pred icting each wavelet coefficient by a separate neural network simplifies the training task (whi ch can be performed concurrently) of each sub PAGE 98 98 network. The prediction results for the wavelet co efficients can be combined directly by the inverse wavelet transforms to s ynthesize the 2D spatial thermal patterns across each die. Figure 74 shows our hybrid neurowavelet scheme for 2D spatial thermal characteristics prediction. Given the observed spatial thermal behavior on training data, our aim is to predict the 2D thermal behavior of each die in 3D die stacked multicore processors under different design configurations. The hybrid scheme involves three stages. In the fi rst stage, the observed spatial thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicte d by a separate ANN. In the third stage, the approximated 2D thermal characteristics are recove red from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The traini ng of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology Floorplanning and Hotspot Thermal Model In this study, we model four floorplans that involve processor core and cache structures as illustrated in Figure 75. Figure 75 Selected floorplans As can be seen, the processor core is placed at different locations acr oss the different floorplans. Each floorplan can be chosen by a la yer in the studied 3D die stacking quadcore processors. The size and adjacency of blocks ar e critical parameters for deriving the thermal PAGE 99 99 model. The baseline core architecture and floorpl an we modeled is an Alpha processor, closely resembling the Alpha 21264. Figure 76 s hows the baseline core floorplan. Figure 76 Processor core floorplan We assume a 65 nm processing technique and the floorplan is scaled accordingly. The entire die size is 2121mm and the core size is 5.8 5.8mm. We consider three core configurations: 2issue (5.85.8 mm), 4issue (8.14 8.14 mm) and 8issue (11.511.5 mm). Since the total die area is fixed, the more aggre ssive core configurations lead to smaller L2 caches. For all three types of core configurations, we calculate the size of the L2 caches based on the remaining die area available. Table 71 lists the detailed processor core and cache configurations. We use Hotspot4.0 [54] to simulate thermal behavior of a 3D quadcore chip shown as Figure 77. The Hotspot tool can specify the mu ltiple layers of silicon and metal required to model a three dimensional IC. We choose gridl ike thermal modeling mode by specifying a set of 64 x 64 thermal grid cells per die and the av erage temperature of each cell (32um x 32um) is represented by a value. Hotspot takes power consumption data for each component block, the layer parameters and the floorplans as inputs an d generates the steadystate temperature for each active layer. To build a 3D multicore processor simulator, we heavily modified and extended the MSim simulator [63] and incorporated the Wattch power model [36]. The power trace is PAGE 100 100 generated from the developed framework with an interval size of 500K cycles. We simulate a 3Dstacked quadcore processor with one core assigned to each layer. Table 71 Architecture configura tion for different issue width 2 issue 4 issue 8 issue Processor Width 2wide fetch/issue/commit 4wide fetch/issue/commit 8wide fetch/issue/commit Issue Queue 32 64 128 ITLB 32 entries, 4way, 200 cycle miss 64 entries, 4way, 200 cycle miss 128 entries, 4way, 200 cycle miss Branch Predictor 512 entries Gshare, 10bit global history 1K entries Gshare, 10bit global history 2K entries Gshare, 10bit global history BTB 512K entries, 4way 1K entries, 4way 2K entries, 4way Return Address 8 entries RAS 16 entries RAS 32 entries RAS L1 Inst. Cache 32K, 2way, 32 Byte/line, 2 ports, 1 cycle access 64K, 2way, 32 Byte/line, 2 ports, 1 cycle access 128K, 2way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 32 entries 64 entries 96 entries Load/ Store 24 entries 48 entries 72 entries Integer ALU 2 IALU, 1 IMUL/DIV, 2 Load/Store 4 IALU, 2 IMUL/DIV, 2 Load/Store 8 IALU, 4 IMUL/DIV, 4 Load/Store FP ALU 1 FPALU, 1FPMUL/DIV/SQRT 2 FPALU, 2FPMUL/ DIV/SQRT 4 FPALU, 4FPMUL/DIV/SQRT DTLB 64 entries, 4way, 200 cycle miss 128 entries, 4way, 200 cycle miss 256 entries, 4way, 200 cycle miss L1 Data Cache 32K, 2way, 32 Byte/line, 2 ports, 1 cycle access 64KB, 4way, 64 Byte/line, 2 ports, 1 cycle 128K, 2way, 32 Byte/line, 2 ports, 1 cycle access L2 Cache unified 4MB, 4way, 128 Byte/line, 12 cycle access unified 3.7MB, 4way, 128 Byte/line, 12 cycle access unified 3.2MB, 4way, 128 Byte/line, 12 cycle access Memory Access 32 bit wide, 200 cycles access latency 64 bit wide, 200 cycles access latency 64 bit wide, 200 cycles access latency Figure 77 Cross section view of th e simulated 3D quadcore chip PAGE 101 101Workloads and System Configurations We use both integer and floatingpoint benc hmarks from the SPEC CPU 2000 suite (e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lu cas, mcf, parser, perlbmk, twolf, swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see Table 72). We categorize all benchmarks into two classes: CPUbound and MEM bound applications. We design three types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist of programs from only the CPU intensive and me mory intensive categories respectively. MIX workloads are the combination of two benchmar ks from the CPU intensive group and two from the memory intensive group. Table 72 Simulation configurations Chip Frequency 3G Voltage 1.2 V Proc. Technology 65 nm Die Size 21 mm 21 mm Workloads CPU1 bzip2, eon, gcc, perlbmk CPU2 perlbmk, mesa, facerec, lucas CPU3 gap, parser, eon, mesa MIX1 gcc, mcf, vpr, perlbmk MIX2 perlbmk, mesa, twolf, applu MIX3 eon, gap, mcf, vpr MEM1 mcf, equake, vpr swim MEM2 twolf, galgel, applu, lucas MEM3 mcf, twolf, swim, vpr These multiprogrammed workloads were simulated on our multicore simulator configured as 3D quadcore processors. We use the Simpoint tool [1] to obtain a representative slice for each benchmark (with full reference input set) and each benchmark is fastforwarded to its representative point before detailed simulation takes place. The simulations continue until one benchmark within a workload finishes the execu tion of the representative interval of 250M instructions. PAGE 102 102Design Parameters In this study, we consider a design space that consists of 23 parameters (see Table 73) spanning from floorplanning to packaging technologies. Table 73 Design space parameters Keys Low High 3D Configurations Layer0 Thickness ( m ) l y 0 th5e5 3e4 Floorplan ly0_fl Flp 1/2/3/4 Bench ly0_bench CPU/MEM/MIX Layer1 Thickness (m) ly1_th 5e5 3e4 Floorplan ly1_fl Flp 1/2/3/4 Bench ly1_bench CPU/MEM/MIX Layer2 Thickness (m) ly2_th 5e5 3e4 Floorplan ly2_fl Flp 1/2/3/4 Bench ly2_bench CPU/MEM/MIX Layer3 Thickness (m) ly3_th 5e5 3e4 Floorplan ly3_fl Flp 1/2/3/4 Bench ly3_bench CPU/MEM/MIX TIM (Thermal Interface Material) Heat Capacity (J/m^3 K) TIM_cap 2e6 4e6 Resistivity (m K/W) TIM_res 2e3 5e2 Thickness (m) TIM_th 2e5 75e6 General Configurations Heat sink Convection capacity (J/k) HS_cap 140.4 1698 Convection resistance (K/w)HS_res 0.1 0.5 Side (m) HS_side 0.045 0.08 Thickness (m) HS_th 0.02 0.08 HeatSpreader Side(m) HP_side 0.025 0.045 Thickness(m) HP_th 5e4 5e3 Others Ambient temperature (K) Am_temp 293.15 323.15 Archi. Issue width Issue width_ 2 or 4 or 8 These design parameters have been shown to have a large impact on processor thermal behavior. The ranges for these parameters were set to include both typi cal and feasible design points within the explored design space. Using detailed cycleaccurate simulations, we measure processor power and thermal characteristics on al l design points within bot h training and testing data sets. We build a separate model for each benchmark domain and use the model to predict thermal behavior at unexplored points in the design space. The training data set is used to build the waveletbased neural network models. An es timate of the models acc uracy is obtained by using the design points in the testing data set. To train an accurate and prompt neural network prediction model, one needs to en sure that the sample data sets disperse points throughout the PAGE 103 103 design space but keeps the space small enough to maintain the low model building cost. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage comp ared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2star discrepancy [40]. The L2star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2star discre pancy. We use a rando mly and independently generated set of test data point s to empirically estimate the pred ictive accuracy of the resulting models. In this work, we used 200 train and 50 te st data to reach a hi gh accuracy for thermal behavior prediction since our st udy shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. In our study, the thermal characteristics across each die is represented by 64 64 samples. Experimental Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast thermal behaviors of large scale 3D multicore structures running various CPU/MIX/MEM workloads without using detailed simulation. Simulation Time vs. Prediction Time To evaluate the effectiveness of our ther mal prediction models, we compute the speedup metric (defined as simulation time vs. prediction time) across all experimented workloads (shown as Table 74). To calcu late simulation time, we measur ed the time that the Hotspot simulator takes to obtain steady thermal character istics on a given design configuration. As can be seen, the Hotspot tool simulation time vari es with design configurations. We report both shortest (best) and longest (wor st) simulation time in Table 74. The prediction time, which includes the time for the neural networks to predict the targeted thermal behavior, remains constant for all studie d cases. In our experiment, a total number of 16 PAGE 104 104 neural networks were used to predict 16 2D wavelet coefficients which efficiently capture workload thermal spatial characteristics. As can be seen, our predictive models achieve a speedup ranging from 285 (MEM1) to 5339 (CPU2) making them suitable for rapidly exploring large thermal design space. Table 74 Simulation time vs. prediction time Workload s Simulation (sec) [best:worst] Prediction (sec) Speedup (Sim./Pred.) CPU1 362 : 6,091 1.23 294 : 4,952 CPU2 366 : 6,567 298 : 5,339 CPU3 365 : 6,218 297 : 5,055 MEM1 351 : 5,890 285 : 4,789 MEM2 355 : 6,343 289 : 5,157 MEM3 367 : 5,997 298 : 4,876 MIX1 352 : 5,944 286 : 4,833 MIX2 365 : 6,091 297 : 4,952 MIX3 360 : 6,024 293 : 4,898 Prediction Accuracy The prediction accuracy measure is th e mean error defined as follows: N kkx kxkx N ME1)( )()( ~ 1 (71) where: )( kx is the actual value generated by the Hotspot thermal model, )( ~ kx is the predicted value and N is the total number of samples (a set of 64 x 64 temperature samples per layer). As prediction accuracy increases, the ME becomes smaller. We present boxplots to observe th e average prediction errors a nd their deviations for the 50 test configurations against Ho tspot simulation results. Boxplots are graphical displays that measure location (median) and di spersion (interquartile range), identify possible outliers, and indicate the symmetry or skewne ss of the distribution. The centr al box shows the data between hinges which are approximately the first and third quartiles of the ME values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The PAGE 105 105 horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the ME values. The whiske rs (the dotted lines extending from the top and bottom of the box) extend to the extreme valu es of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 78, the blue line with diamond shape markers indicates the statistics average of ME across all benchmarks. CPU1CPU2CPU3MEM1MEM2MEM3MIX1MIX2MIX3 0 4 8 12 16 20Error (%) Figure 78 ME boxplots of prediction accuracies (number of wavelet coefficients = 16) Figure 78 shows that using 16 wavelet coeffici ents, the predictive models achieve median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 17.5% (MEM1), and mo st benchmarks show an error less than 9%. This indicates that our hybrid neurowavelet framework can pred ict 2D spatial thermal behavior across large and sophisticated 3D multicore architecture with high accuracy. Figure 78 also indicates that CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%) workloads. This is because the CPU workloads usually have higher temperature on the small core area than the large L2 cache ar ea. These small and sharp hotspots can be easily captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal pattern can spread the entire die area, resulting in higher prediction error. Figure 79 illustrates the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on CPU1, MEM1 and MIX1 workloads. PAGE 106 106 CPU1 MEM1 MIX1 Prediction Simulation Figure 79 Simulated and predicted thermal behavior The results show that our pr edictive models can tack both size and location of thermal hotspots. We further examine the accuracy of pred icting locations and area of the hottest spots and the results are similar to those presented in Figure 78. CPU1 16wc32wc64wc96wc128wc256wc 0 4 8Error (%) MEM1 16wc32wc64wc96wc128wc256wc 0 5 10 15Error (%) MIX1 16wc32wc64wc96wc128wc256wc 0 10 20Error (%) Figure 710 ME boxplots of prediction accuracies w ith different number of wavelet coefficients Figure 710 shows the prediction accuracies with different numb er of wavelet coefficients on multiprogrammed workloads CPU1, MEM1 and MIX1. In general, the 2D thermal spatial pattern prediction accuracy is in creased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The costeffective models should provide hi gh prediction accuracy while maintaining low PAGE 107 107 complexity. The trend of prediction accuracy(Fig ure 710) suggests that for the programs we studied, a set of wavelet coeffi cients with a size of 16 comb ine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond th is point improves error at a lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to minimize the complexity of predicti on models while achieving good accuracy. We further compare the accuracy of our propos ed scheme with that of approximating 3D stacked die spatial thermal patterns via predic ting the temperature of 16 evenly distributed locations across 2D plane. The results(Figure 711) indicate that using the same number of neural networks, our scheme yields significant higher accuracy than conventional predictive models. This is because wavelets provide a good time and locality character ization capability and most of the energy is captured by a limited set of important wavelet coefficients. The coordinated wavelet coefficients provide superior interpretati on of the spatial patterns across scales of time and frequency domains. CPU1CPU2CPU3MEM1MEM2MEM3MIX1MIX2MIX3 0 20 40 60 80 100Error (%) Predicting the wavelet coefficients Predicting the raw data Figure 711 Benefit of predic ting wavelet coefficients Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input parameters (refer to Table 73 ) were ranked based on split frequency. The input parameters which cause the mo st output variation tend to be split frequently in the constructed regression tree. Therefore, the input paramete rs that largely determine the values of a wavelet coefficient have a larger number of splits. PAGE 108 108Design Parameters by Regression Tree ly0_th ly0_fl ly0_bench ly1_th ly1_fl ly1_bench ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench TIM_cap TIM_res TIM_th HS_cap HS_res HS_side HS_th HP_side HP_th am_temp Iss_size Figure 712 Roles of input parameters We present in Figure 712 shows the most fre quent splits within th e regression tree that models the most significant wavele t coefficient. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. Each volume size of parameter is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. Fr om the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? As can be seen, floorplanning of each layer and core configuration largely affect thermal spatia l behavior of the studied workloads. PAGE 109 109 CHAPTER 8 CONCLUSIONS Studying program workload behavior is of growing interest in co mputer architecture research. The performance, power and reliability optimizations of future computer workloads and systems could involve anal yzing program dynamics across many time scales. Modeling and predicting program behavior at single scale can yield many limitations. For example, samples taken from a single, finegrained interval ma y not be useful in forecasting how a program behaves at a medium or large time scales. In contrast, observing progr am behavior using a coarsegrained time scale may lo se opportunities that can be exploited by hardware and software in tuning resources to optimize workload execution at a finegrained level. In chapter 3, we proposed new methods, metric s and framework that can help researchers and designers to better understand phase comp lexity and the changing of program dynamics across multiple time scales. We proposed using wavelet transformations of code execution and runtime characteristics to pr oduce a concise yet informativ e view of program dynamic complexity. We demonstrated the use of this in formation in phase classification which aims to produce phases that exhibit similar degree of co mplexity. Characterizing phase dynamics across different scales provides insi ghtful knowledge and abundant feat ures that can be exploited by hardware and software in tuning resources to m eet the requirement of workload execution at different granularities. In chapter 4, we extends the scope of chap ter 3 by (1) explorin g and contrasting the effectiveness of using wavelets on a wide ra nge of program executi on statistics for phase analysis; and (2) investigating techniques that ca n further optimize the accuracy of waveletbased phase classification. More importantly, we identify additional benefits that wavelets can offer in the context of phase analysis. For example, wavelet transforms can provide efficient PAGE 110 110 dimensionality reduction of large volume, high di mension raw program execution statistics from the time domain and hence can be integrated wi th a sampling mechanism to efficiently increase the scalability of phase analysis of large scale phase behavior on longrunning workloads. To address workload variability issues in phase cl assification, waveletbase d denoising can be used to extract the essential features of workload behavior from their runtime nondeterministic (i.e., noisy) statistics. At the workloads prediction part, chapter 5, we propose to the use of wavelet neural network to build accurate predictive models fo r workload dynamic driven microarchitecture design space exploration to overcome the problems of monolithic, global predictive models. We show that wavelet neural networks can be us ed to accurately and costeffectively capture complex workload dynamics across different microa rchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power, and reliability domains. And also we perform extensive simulations to analyze the impact of wavelet coefficient sele ction and sampling rate on prediction accuracy and identify microarchitecture parameters that signi ficantly affect workload dynamic behavior. To evaluate the efficiency of scen ariodriven architecture optimiza tions across different domains, we also present a case study of using workload dynamic aware predictive model. Experimental results show that the predictive models are hi ghly efficient in rendering workload execution scenarios. To our knowledge, the model we propos ed is the first one th at can track complex program dynamic behavior across different micr oarchitecture configurations. We believe our workload dynamics forecasting techniques will allow architects to quickly ev aluate a rich set of architecture optimizations that target workload dynamics at early microarchitecture design stage. PAGE 111 111 In Chapter 6, we explore novel predictive t echniques that can quickly, accurately and informatively analyze the design tradeoffs of fu ture largescale multi/manycore architectures in a scalable fashion. The characteristics that workloads exhibited on these architectures are complex phenomena since they typically contain a mixture of behavior localized at different scales. Applying wavelet analysis, our method can capture the heterogeneous behavior across a wide range of spatial scales usi ng a limited set of parameters. We show that these parameters can be costeffectively predicted usi ng nonlinear modeling techniques su ch as neural networks with low computational overhead. Experimental results show that our scheme can accurately predict the heterogeneous behavior of largescale multicore oriented architecture substrates. To our knowledge, the model we proposed is the first that can track complex 2D workload/architecture interaction across design alternatives. we fu rther examined using the proposed models to effectively explore multicore aw are resource allocations and design evaluations. For example, we build analytical models that can quickly fore cast workloads 2D working sets across different NUCA configurations. Combined with interference estimation, our models can determine the geometricaware workload/core mappings that lead to minimal interference. We also show that our models can be used to predict the locati on and the area of thermal hotspots during thermalaware design exploration. In the light of the emerging multi/ manycore design era, we believe that the proposed 2D predictive model will allo w architects to quickly yet informatively examine a rich set of design alternativ es and optimizations for large and sophisticated architecture substrates at an early design stage. Leveraging 3D die stacking technologies in multicore pr ocessor design has received increased momentum in both the chip design industry and research community. One of the major road blocks to realizing 3D multicore design is it s inefficient heat dissipation. To ensure thermal PAGE 112 112 efficiency, processor architects and chip designer s rely on detailed yet slow simulations to model thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of the design space, such techniques are very expensive in terms of time and cost. In chapter 7, we aim to develop computat ionally efficient methods and models which allow architects and designers to rapidly yet info rmatively explore the large thermal design space of 3D multicore architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile our model significan tly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex 2D thermal spatial patterns and can be used to forecast both the location and the area of thermal hotspots during thermalaware design exploration. In light of the emerging 3D multico re design era, we believe that the proposed thermal predictive models will be valuable for architects to quickly and informatively examine a rich set of thermalaware design alternatives and thermaloriented optimizations for large and sophisticated architecture substr ates at an early design stage. PAGE 113 113 LIST OF REFERENCES [1] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, in Proc. the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002 [2] E. Duesterwald, C. Cascaval and S. Dwark adas, Characterizing and Predicting Program Behavior and Its Variability, in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2003. [3] J. Cook, R. L. Oliver, and E. E. Johnson, E xamining Performance Differences in Workload Execution Phases, in Proc. of the IEEE Interna tional Workshop on Workload Characterization, 2001. [4] X. Shen, Y. Zhong and C. Ding, Locality Phase Prediction, in Proc. of the International Conference on Architectural Support for Pr ogramming Languages and Operating Systems, 2004. [5] C. Isci and M. Martonosi, Runtime Powe r Monitoring in HighEnd Processors: Methodology and Empirical Data, in Proc. of the International Symposium on Microarchitecture, 2003. [6] T. Sherwood, S. Sair and B. Calder Phase Tracking and Prediction, in Proc. of the International Symposium on Computer Architecture, 2003. [7] A. Dhodapkar and J. Smith, Managing MultiC onfigurable Hardware via Dynamic Working Set Analysis, in Proc. of the International Sym posium on Computer Architecture, 2002. [8] M. Huang, J. Renau and J. Torrellas, Positiona l Adaptation of Processors: Application to Energy Reduction, in Proc. of the International Symposium on Computer Architecture, 2003. [9] W. Liu and M. Huang, EXPERT: Expedite d Simulation Exploiting Program Behavior Repetition, in Proc. of International Conference on Supercomputing, 2004. [10] T. Sherwood, E. Perelman and B. Calder, B asic Block Distributi on Analysis to Find Periodic Behavior and Simulation Points in Applications, in Proc. of the International Conference on Parallel Architect ures and Compilation Techniques, 2001. [11] A. Dhodapkar and J. Smith, Comparing Pr ogram Phase Detection Techniques, in Proc. of the International Sym posium on Microarchitecture, 2003. [12] C. Isci and M. Martonosi, Identifying Program Power Ph ase Behavior using Power Vectors, in Proc. of the International Work shop on Workload Characterization, 2003. PAGE 114 114 [13] C. Isci and M. Martonosi, Phase Characterization for Power: Evaluating ControlFlowBased EventCounterBased Techniques, in Proc. of the Interna tional Symposium on HighPerformance Computer Architecture, 2006. [14] M. Annavaram, R. Rakvic, M. Polito, J.Y. Bouguet, R. Hankins and B. Davies, The Fuzzy Correlation between Code and Performance Predictability, in Proc. of the International Symposiu m on Microarchitecture, 2004. [15] J. Lau, S. Schoenmackers and B. Calder, Structures for Phase Classification, in Proc. of International Symposium on Performance Analysis of Systems and Software, 2004. [16] J. Lau, J. Sampson, E. Perelman, G. Ha merly and B. Calder, T he Strong Correlation between Code Signatures and Performance, in Proc. of the International Symposium on Performance Analysis of Systems and Software, 2005. [17] J. Lau, S. Schoenmackers and B. Cald er, Transition Phase Classification and Prediction, in Proc. of the International Sympos ium on High Performance Computer Architecture, 2005. [18] Canturk Isci and Margaret Martonosi, Det ecting Recurrent Phase Behavior under RealSystem Variability, in Proc. of the IEEE International Symposium on Workload Characterization, 2005. [19] E. Perelman, M. Polito, J. Y. Bouguet, J. Sa mpson, B. Calder, C. Dulong Detecting Phases in Parallel Applications on Shared Memory Architectures, in Proc. of the International Parallel and Dist ributed Processing Symposium, April 2006 [20] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, Construction and Use of Linear Regression Models for Processo r Performance Analysis, in Proc. of the International Symposium on HighPerforman ce Computer Architecture, 2006 [21] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, A Predictive Performance Model for Superscalar Processors, in Proc. of the International Symposium on Microarchitecture, 2006 [22] B. Lee and D. Brooks, Accurate an d Efficient Regression Modeling for Microarchitectural Performance and Power Prediction, in Proc. of the International Symposium on Architectural Support for Programming Languages and Operating Systems, 2006 [23] E. Ipek, S. A. McKee, B. R. Supinski, M. Schulz and R. Caruana, Efficiently Exploring Architectural Design Spaces via Predictive Modeling, in Proc. of the International Conference on Architectural Support for Pr ogramming Languages and Operating Systems, 2006 PAGE 115 115 [24] B. Lee and D. Brooks, Illustrative Design Space Studies with Microarchitectural Regression Models, in Proc. of the International Symposium on HighPerformance Computer Architecture, 2007. [25] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, Constructing a NonLinear Model with Neural Networks For Workload Characterization, in Proc. of the International Symposium on Workload Characterization, 2006. [26] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier, Vermont, 1992 [27] I. Daubechies, Orthonomal bases of Compactly Supported Wavelets, Communications on Pure and Applied Mathematics, vol. 41, pages 906966, 1988. [28] T. Austin, Tutorial of Simplescalar V4.0, in Conj. With the International Symposium on Microarchitecture, 2001 [29] J. MacQueen, Some Methods for Classi fication and Analysis of Multivariate Observations, in Proc. of the Fifth Berkeley Sympos ium on Mathematical Statistics and Probability, 1967. [30] T. Huffmire and T. Sherwood, WaveletBased Phase Classification, in Proc. of the International Conference on Paralle l Architecture and Compilation Technique, 2006 [31] D. Brooks and M. Martonosi, Dynamic Thermal Management for HighPerformance Microprocessors, in Proc. of the International Sympos ium on HighPerformance Computer Architecture, 2001. [32] A. Alameldeen and D. Wood, Variability in Architectural Simulations of Multithreaded Workloads, in Proc. of International Symposium on High Performance Computer Architecture, 2003. [33] D. L. Donoho, Denoisi ng by Softthresholding, IEEE Transactions on Information Theory, Vol. 41, No. 3, pp. 613627, 1995. [34] MATLAB User Manual, MathWorks, MA, USA. [35] M. Orr, K. Takezawa, A. Murray, S. Nino miya and T. Leonard, Combining Regression Tree and Radial Based Function Networks, International Journal of Neural Systems, 2000. [36] David Brooks, Vivek Tiwari, and Margaret Martonosi, Wattch: A Framework for ArchitecturalLevel Power An alysis and Optimizations, 27th International Symposium on Computer Architecture, 2000. [37] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a HighPerformance Microprocessor, in Proc. of the International Symposium on Microarchitecture, 2003. PAGE 116 116 [38] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for AddressBased Structures, in Proc. of the International Symposium on Computer Architecture, 2005. [39] J.Cheng, M.J.Druzdzel, Latin Hypercube Sampling in Bayesian Networks, in Proc. of the 13th Florida Artificial Intell igence Research Society Conference, 2000. [40] B.Vandewoestyne, R.Cools, Good Permutati ons for Deterministic Scrambled Halton Sequences in terms of L2discrepancy, Journal of Computational and Applied Mathematics Vol 189, Issues 12, 2006. [41] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, Graphical Methods for Data Analysis, Wadsworth, 1983 [42] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0132733501, 1999. [43] C. Kim, D. Burger, and S. Keckler. An Adaptive, NonUniform Cache Structure for WireDelay Dominated OnChip Caches, in Proc. the International Conference on Architectural Support for Programmi ng Languages and Operating Systems, 2002. [44] L. Benini, L.; G. Micheli, Network s On Chips: A New SoC Paradigm, Computer Vol. 35, Issue. 1, January 2002, pp. 70 78. [45] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, A NUCA Substrate for Flexible CMP Cache Sharing, in Proc. International C onference on Supercomputing, 2005. [46] Z. Chishti, M. D. Powell, and T. N. V ijaykumar, Distance Associativity for HighPerformance EnergyEfficient NonUn iform Cache Architectures, in Proc. of the International Symposiu m on Microarchitecture, 2003. [47] B. M. Beckmann and D. A. Wood, Managing Wire Delay in Large ChipMultiprocessor Caches, in Proc. of the International Symposium on Microarchitecture, 2004. [48] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, Optimization Replication, Communication, and Capacity Allocation in CMPs, in Proc. of the International Symposium on Computer Architecture, 2005. [49] A. Zhang and K. Asanovic, Victim Replic ation: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, in Proc. of the International Symposium on Computer Architecture, 2005. [50] B. Lee, D. Brooks, B. Supinski, M. Schulz, K. Singh S. McKee, Methods of Inference and Learning for Performance Mode ling of Parallel Applications, PPoPP, 2007. PAGE 117 117 [51] K. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, Multifacets General Executiondriven Multiprocessor Simulator(GEMS) Toolset, Computer Architecture News(CAN), 2005. [52] Virtutech Simics, http://www.virtutech.com/products/ [53] S. Woo, M. Ohara, E. Torrie, J. Sing h, A. Gupta, The SPLASH2 Programs: Characterization and Methodologi cal Considerations, in Proc. of the International Symposium on Computer Architecture, 1995. [54] K. Skadron, M. R. Stan, W. Huang, S. Velu samy, K. Sankaranarayanan, and D. Tarjan, TemperatureAware Microarchitecture, in Proc. of the International Symposium on Computer Architecture, 2003. [55] K. Banerjee, S. Souri, P. Kapur, and K. Sa raswat, D ICs: A Novel Chip Design for Improving DeepSubmicrometer Interconnect Performance and SystemsonChip Integration, Proceedings of the IEEE, vol. 89, pp. 602633, May 2001. [56] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykris hnan, M. J. Irwin, Design Space Exploration for 3D Cache, IEEE Transactions on Very Large Sc ale Integration (VLSI) Systems, Vol. 16, No. 4, April 2008. [57] B. Black, D. Nelson, C. Webb, and N. Samr a, D Processing Technology and its Impact on IA32 Microprocessors, in Proc. of the 22nd Internati onal Conference on Computer Design, pp. 316, 2004. [58] P. Reed, G. Yeung, and B. Black, Design As pects of a Microprocessor Data Cache using 3D Die Interconnect Technology, in Proc. of the International Conference on Integrated Circuit Design and Technology, pp. 15, 2005 [59] M. Healy, M. Vittes, M. Ekpanyapong, C.S. Balla puram, S.K. Lim, H.S. Lee, G.H. Loh, Multiobjective Microarchitectural Floorplanning for 2D and 3D ICs, IEEE Trans. on Computer Aided Design of IC and Systems, vol. 26, no. 1, pp. 3852, 2007. [60] S. K. Lim, Physical design for 3D system on package, IEEE Design & Test of Computers, vol. 22, no. 6, pp. 532, 2005. [61] K. Puttaswamy, G. H. Loh, Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in HighPerforma nce 3DIntegrated Processors, in Proc. of the International Symposium on HighPe rformance Computer Architecture, 2007. [62] Y. Wu, Y. Chang, Joint Expl oration of Architectural and Physical Design Spaces with Thermal Consideration, in Proc. of International Symposium on Low Power Electronics and Design, 2005. PAGE 118 118 [63] J. Sharkey, D. Ponomarev, K. Ghose, MS im : A Flexible, Multithreaded Architectural Simulation Environment, Technical Report CSTR05DP01, Department of Computer Science, State University of New York at Binghamton, 2005. PAGE 119 119 BIOGRAPHICAL SKETCH Chang Burm Cho earned B.E and M.A in electr ical engineering at Dankook University, Seoul, Korea in 1993 and 1995, respec tively. Over the next 9 years, he w orked as a senior researcher at Korea Aerospace Research Institute(KARI) to develop the OnBoard Computer(OBC) for two satellites, KOMPSAT1 and KOMPSAT2. His research interest is computer architecture and workload characteriza tion and prediction in large micro architectural design spaces. xml version 1.0 encoding UTF8 REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID E20101111_AAAAAY INGEST_TIME 20101111T10:58:39Z PACKAGE UFE0021941_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES FILE SIZE 22911 DFID F20101111_AAAQEL ORIGIN DEPOSITOR PATH cho_c_Page_087.QC.jpg GLOBAL false PRESERVATION BIT MESSAGE_DIGEST ALGORITHM MD5 e0d22d83127d09a91e23511875fca290 SHA1 945aad31e349a6be73c0fd46eb9b10c82c8eef6b 1974 F20101111_AAAPZF cho_c_Page_083.txt a34a09d2355c675839dd614221b78eb8 1aadd25429df8afd9be8ab577539c4f879b902d5 2071 F20101111_AAAPYQ cho_c_Page_068.txt 27c18a87df13ea6eb2447607a400d7b4 4e85b80ef51366de067188480c16e3cf7a66aa3a 22738 F20101111_AAAQEM cho_c_Page_058.QC.jpg 504715b45486949dd5ab62a68e714016 a5f6ea39842f821f0a797ed055386044398d51e2 2104 F20101111_AAAPZG cho_c_Page_084.txt 3dbbbb10d40d217c83fcf6a595df7c88 41254d18244cda0cdf8f9f068067188b2dcdbb90 6426 F20101111_AAAQDX cho_c_Page_085thm.jpg 4499fbf9ccecae8507126da32b092a0c c993c979fc9619fb23f286bfab7d356b165804a2 2073 F20101111_AAAPYR cho_c_Page_069.txt 8f2ce2043107cda2e589f25e5e63ecc8 256cfe4a9da47519c0268fa158c8b8e49e36bbac 29410 F20101111_AAAQFA cho_c_Page_091.QC.jpg e00ce55e7c6d6e02cc1e675b8bbb9eb3 01cfbcb88cbef1b8c8370240525bc65fa31e8ead 3708 F20101111_AAAQEN cho_c_Page_006thm.jpg 3b5004c536b9f648119bea9a108b40fb 34e3ea97454e4ea11050250054bc58300923ee9b 2484 F20101111_AAAPZH cho_c_Page_085.txt 7ebf516286cd718a90fca667e7c6dd96 90bdca09ea5ded53871b9355d4dc85137d2321c2 6328 F20101111_AAAQDY cho_c_Page_043thm.jpg 50541a01076595103aec1ab83d5c2564 dda1a365d597986755a9e531e01172fa112581a8 2025 F20101111_AAAPYS cho_c_Page_070.txt 03d2b2a2af183d0b77370372e6792433 f1fe140fd3dc03f9389e682a72c4a79d85e9a10c 6837 F20101111_AAAQFB cho_c_Page_107thm.jpg a998e2ed3ba02bbfaa684a2274c02b9a 0b0c1475e81927a60fb641aa73a3f40dbdbd2670 6993 F20101111_AAAQEO cho_c_Page_046thm.jpg c329ac53865c9ab0ec78e62c9fb7beb7 a0a5670aec6ee60c925c441e6dd8ac3a727c759d 2388 F20101111_AAAPZI cho_c_Page_086.txt 81e47c7b3333a8dd26d1e861cc7aca34 64b91c6ffd3d8442e01a00792d323eef0cfafb1a 22865 F20101111_AAAQDZ cho_c_Page_069.QC.jpg 410f47a6a3002328c2790e913d7c31c2 7d081ba6981964bb4c93eacce32567987cd7338d 2281 F20101111_AAAPYT cho_c_Page_071.txt 59f07581562b5cda407925c217b519c8 cb362269671586ed0ceb461e158a509d60fda466 7294 F20101111_AAAQFC cho_c_Page_088thm.jpg 50ce13b2457fad91ed0ffcf570a11fc1 f52063dc0e17e128ce272156b0fa48a3a4861f9b 6242 F20101111_AAAQEP cho_c_Page_070thm.jpg ad0c1ca7e7e902fa6ce14c47956e926d edf0d114830146f0372bffd9d7243abc1437c828 2052 F20101111_AAAPZJ cho_c_Page_087.txt 0a8316a0e3bb10b897c73dc2e21597d4 e2e772893cb69b72217a4e1a0ce3e9e2443c114f 1619 F20101111_AAAPYU cho_c_Page_072.txt ebc68a3f3bea1dc9607fb006cfb3c4f4 f4a591c06ad470b3bb58e9e2799ade81f00a04ee 7343 F20101111_AAAQFD cho_c_Page_103thm.jpg 4b4f44981d600ef9579dd97d85b975e6 964b10a80acdfb15f68f7234bd21ed1b971a3830 28309 F20101111_AAAQEQ cho_c_Page_052.QC.jpg faf300a744632189d89d7419770819a8 febe23b668cfb287ce1219fc34b2db173d969f54 1780 F20101111_AAAPZK cho_c_Page_088.txt ba9d68dc60f9ff2e4dfebb03c102e6b6 02cc2e173a4d4c522a5e324c00eb7cbd813c8e69 756 F20101111_AAAPYV cho_c_Page_073.txt 35c714129f945f5e72f8d15fe676595f aff2135ead6fc537ca1376e866736147150e0ec4 22422 F20101111_AAAQFE cho_c_Page_016.QC.jpg cd8b93bf2951c0fb884d9f8b190f641a 2dc336315cce81c1aa09cba9b0188eae9f35bebf 24412 F20101111_AAAQER cho_c_Page_017.QC.jpg d2a12306fba171188481d33c7b089d8f 4555a0507ff9169bb64111b12e816361598ab1d2 2270 F20101111_AAAPZL cho_c_Page_089.txt c9817e406f4bf2fdcabfc8f405bb0299 ecd99ebd3d01fc2c8a2de55b4c1b78c8c449015f 2259 F20101111_AAAPYW cho_c_Page_074.txt 9deab5a88182f7fc2a9183b3beb53bc1 43bd110a0a2605276f08ba41da6e9999a90bb8cb 7473 F20101111_AAAQFF cho_c_Page_095thm.jpg 4a4b6f8dd874191e225e5caeec77a350 473c97ae6ca6d71ebe08abfdd3be3b13285805d6 1960 F20101111_AAAPZM cho_c_Page_090.txt 38abea8297691dab1d4b2304a503e00b ad41eb7a090126a6566962e7a9a5379db58789bb 2148 F20101111_AAAPYX cho_c_Page_075.txt dbedb6855bfd70299d37d722a1c1cbd9 7ec469825aa7fc87430f653ab572831fd16a5506 7188 F20101111_AAAQFG cho_c_Page_037thm.jpg e767d1e131006ad96d27f225ffe97f57 fadc56ea3ad1190b7185c0284050345a2067dcd0 27467 F20101111_AAAQES cho_c_Page_075.QC.jpg fb5fcc40025b2ca72e88cbb528f8110c f53f2b4f3c2ab76382c7f2175f49a58e19ff4362 2067 F20101111_AAAPZN cho_c_Page_091.txt d7111256c2957d14c1a3056fb4b24159 0b3ab5121b0dfb25abf7bfe6d7069da8884c5230 2266 F20101111_AAAPYY cho_c_Page_076.txt a86540043ee961db0a58075677c25b5a d813336bd457b8ebeebd17d1f8328f46556ea929 7104 F20101111_AAAQFH cho_c_Page_055thm.jpg 0f9676630174a58be23bf89dd8085c97 d3cee55515b68ebabe4e96472fb770d257278d8b 5861 F20101111_AAAQET cho_c_Page_064thm.jpg ac40990dc4cf31ad275d59f8e1153d71 bce5330454db44f5df243d5592b0ad2aa47d973f 2276 F20101111_AAAPYZ cho_c_Page_077.txt 62f742735c71609ccbc9af2fdeddcc91 0b0f68d5ba2f49bc0b7fc56925733e2e71593ea0 27782 F20101111_AAAQFI cho_c_Page_116.QC.jpg f85e071c03d2414708b991b6af792434 22b666d886893ef75c3d0e61fc18f724d95b3dc8 5837 F20101111_AAAQEU cho_c_Page_021thm.jpg abf92a4852a8df2e1b6e98341fb1a5e5 5ced4dd1a860b332cf159382336311753693f60e 1644 F20101111_AAAPZO cho_c_Page_092.txt 0852fcfcc7a8543b52f98c8e4048b4b0 cfc59df855f2a4e8e09d45fc994426888f1abb28 7408 F20101111_AAAQFJ cho_c_Page_066thm.jpg a7cb2dd8a247b316d4f8f9f9a90cec4d b9b9933676ec599d45136edd9527e9788c89ce02 21551 F20101111_AAAQEV cho_c_Page_004.QC.jpg fc3acd9ff01cbe12fb8b5f958da4a292 02a532fde47667aeaa379fb32a0c535d4373afdb 378 F20101111_AAAPZP cho_c_Page_093.txt 9626136b1abff018ccbf0df31cd49d44 7965c0d8b79d09896b87d6dc4bf9c6ae107508b8 25263 F20101111_AAAQFK cho_c_Page_053.QC.jpg 542f38c86d71735268769aa8b9dc95a0 ceadb1a63b43869751539ceac6153530cfec6a2a 7568 F20101111_AAAQEW cho_c_Page_083thm.jpg 5aaa622ca0133b3ab3993c02e1078890 320d602854366afee5a0ff938c0778af53f2da1d 1550 F20101111_AAAPZQ cho_c_Page_095.txt bef70efe523a898fc9a254a32a260fd1 0effd7426852ecfde062c96cd5e208ff0d13dc86 6793 F20101111_AAAQFL cho_c_Page_090thm.jpg 7121a54aef8f913bf2a1afb4b65858f6 4318f18eaafd575f25bd289fd8c9e62afae056d0 26831 F20101111_AAAQEX cho_c_Page_103.QC.jpg e855df189e0236535ee5cc0a2b5f4937 6dce5da5372f3f482473936f6588dde9a47cb871 2231 F20101111_AAAPZR cho_c_Page_096.txt 68b9bb6c89cc55268587372dd73df58c e2910a9486d70625fe125f5d6e034f1203d843bf 6605 F20101111_AAAQGA cho_c_Page_089thm.jpg 3faf22de78bda54507f23847c7083598 8faccad908e0e8c8c117d03251103d9c95cb1403 17871 F20101111_AAAQFM cho_c_Page_062.QC.jpg 2c3ffae3becc3b476a973dfeee36078c 325ccc94f0c334605a63a472adc10162441babff 1757 F20101111_AAAPZS cho_c_Page_097.txt 246702ff1c61e58ae6ce9dfb8f86883a 52b3f82cf02ade9058ea41785c9d768d262a4967 176435 F20101111_AAAQGB UFE0021941_00001.xml FULL 3adabb138c1aaf989cb7460dc163ef31 5357988b00846886d9188a1b69a8fb3015dde8ed 7465 F20101111_AAAQFN cho_c_Page_042thm.jpg 32adcc7275caf0183a9d75c2f1128244 74196296aee9da0a164311c828e9ad161efe7ec1 22970 F20101111_AAAQEY cho_c_Page_085.QC.jpg 1c1ec4fc3f6905b1ac149d5b335185ba 09557ca39a95a86d29cf3666542525f3bb5f4b5d 1748 F20101111_AAAPZT cho_c_Page_098.txt bae519c755343bf1687bf58d30f78154 ec7fd6f69288f38af406e9b0a204de884757245a 1356 F20101111_AAAQGC cho_c_Page_002thm.jpg b731b23b06ce840b5b1200544891aa7e c5d6575b35fe93b614945c50376872046218e5e5 27910 F20101111_AAAQFO cho_c_Page_111.QC.jpg ed1ec78c0f47063e54b40057de53ba2d e87d0c56266060e3f3ba761eb077cc846829b1cf 6219 F20101111_AAAQEZ cho_c_Page_069thm.jpg 4cb657ec7f4da92074c33a85410a88c9 b198f4639ebd24ef79f551185547cb8b3f6d5d3b 1607 F20101111_AAAPZU cho_c_Page_099.txt b1d2a7eb7eaf64b518b717f6a7822226 dc7134c8776fe37175e90fecfd797383bb599962 1051966 F20101111_AAAPDB cho_c_Page_009.jp2 5c6a898e09722784d50f1b306dd8984e acddb46a84f8136aa52add67401b587892dcad2d 3938 F20101111_AAAQGD cho_c_Page_003thm.jpg debb2a4cb8c85346d73bad526d36f87d 21351ebd9d8632acaf565f891f393349fe53f2bd 28061 F20101111_AAAQFP cho_c_Page_117.QC.jpg f2f24c687df768f560055bc8381ef142 45a5281cc830708d4e96e0e2b6b2d595772a9260 2474 F20101111_AAAPZV cho_c_Page_100.txt 0c7ed28f0f436cdbfeb78dc411d7cb83 6c8fbc127ef18fa4e53a4257ebc201602aa3b16b 25271604 F20101111_AAAPDC cho_c_Page_037.tif a4a9bc3bef1988b9de02e561761da584 973540d6d307fcb969492c938a7c34808680884d 13979 F20101111_AAAQGE cho_c_Page_006.QC.jpg 0973ffb376aad217dec27eede6606f81 7e932c3ce46e55295210b7612d8fb9b9d76c7470 20146 F20101111_AAAQFQ cho_c_Page_007.QC.jpg 44d73438de9465746940e89fd7a14316 8d6d53a5df10c2a1f7a7abe356f0c870e839f4df 2075 F20101111_AAAPZW cho_c_Page_101.txt 0b2e0aa6c40cdef9fdd85f1ef87dea88 c78ab6f87b4a729582c913661e73fccbd73489e5 1494 F20101111_AAAPDD cho_c_Page_020.txt b6f90d5229a692ade53c232179dcfc61 0bdf2558e2ce2d4db570ea395a9af596d65716c6 5051 F20101111_AAAQGF cho_c_Page_007thm.jpg 0468923c2c22bd81de2ed66943c36858 b7cdd54de706e4cd6a1e65e8f88162c473ac2328 26366 F20101111_AAAQFR cho_c_Page_025.QC.jpg 9bce612d57e243718222ff25e429b40a 2983faadac6f142a3f302faab3aef0ad5ad7a561 2726 F20101111_AAAPZX cho_c_Page_102.txt 61585b557cb8a97814c8a5e07bcd17dc 3dffa5a31cfeb10dd477fc54e0e94166c616fd29 15614 F20101111_AAAPDE cho_c_Page_108.QC.jpg 38b27dfbb1974bcfa85a40e46ebd19d8 95a8533ce467697c20385ef979836b6b0fd40ece 5908 F20101111_AAAQGG cho_c_Page_008thm.jpg 623c316c95ef71ed3625e2af72abb5a9 c22b6c72e4a9cb4fd6d3e15543d142affe70cee6 7014 F20101111_AAAQFS cho_c_Page_040thm.jpg 0b6e209c35e7c3c00335db5f336cfd34 78d0d9fb1c3356206bd3ba028a2b4e0bb4cd2f79 2129 F20101111_AAAPZY cho_c_Page_103.txt 366eeda59af2e9fab23dca5fdd4627e9 e8591debea63c13ea2ab0657c7e9435e53747fe1 F20101111_AAAPDF cho_c_Page_018.tif 9e6518450533470bed6903867f463ac3 846fdfe5972ec1e4d7f586609baf3d2ce3516d35 15695 F20101111_AAAQGH cho_c_Page_011.QC.jpg df44de70d050a263e9736a8d9b653edf b15defb617da8613dbefdfcf5b4e958a78179fec 19923 F20101111_AAAQFT cho_c_Page_034.QC.jpg bb768afd4fef535c969f8eb3aed088aa c9648843cfc0aeda15e5ad755ad20448b212b347 2407 F20101111_AAAPZZ cho_c_Page_104.txt 7b75c8e87e98da0887d0473b764091e2 358c4a2accbcb38851a6ac70f1e064aa2318ddca 175 F20101111_AAAPDG cho_c_Page_015.txt 2d1bcd1e0401a694158a2f3ebdaaa13e 4afaf44abbfb3e8bff072fb45431e01478991f36 4575 F20101111_AAAQGI cho_c_Page_011thm.jpg 8bc5853fd84e4e776b4b36ad60fe7d17 2966e2b9f8fd87441b264f515ec0239366b1d59e 22987 F20101111_AAAQFU cho_c_Page_054.QC.jpg 4b6633b7fb7a64aa662b7beaa535f66e 33fc44bbbb030bcc88f030c584aecd1c209206d0 24605 F20101111_AAAPDH cho_c_Page_071.QC.jpg 3a37335f5b29a24dbd3b193b07bc37db 164ae6f9cdc65e0fe08fde304443e6a9817b7129 7513 F20101111_AAAQGJ cho_c_Page_013thm.jpg d3fd6e675276019744b8e60e6101e293 dd4eb44ab070a66d43c8bbebdf682b4a6ff6d5c8 28581 F20101111_AAAQFV cho_c_Page_115.QC.jpg e0d6bd77a0b3ee9860f002da190b17be 3879aa11c979a0513e05b2a8d05b2324543db453 5584 F20101111_AAAPDI cho_c_Page_062thm.jpg 8d114cd3c5f21a90286c383ab5f34968 51d9f25da0cae10c68a5fc886661ea5f72c9319b 7282 F20101111_AAAQGK cho_c_Page_014thm.jpg 1d897d55006b218ca87b73fda7e9e046 836d6d7c541b83b0dd4a80bd75d2a6367b4e8b5f 22731 F20101111_AAAQFW cho_c_Page_032.QC.jpg 504ae3f4b490d3be682f62433235c5d2 dff3193c7a0d1a65022d3fbeaa4f85898f74c1b2 67378 F20101111_AAAPDJ cho_c_Page_033.jpg 4dbc9b24e267ec9bc2b6cfab4b3c9c4d b7c095082f98a3fa09fa44a164435b7e683da226 7131 F20101111_AAAQGL cho_c_Page_017thm.jpg c54e71379d39148e045d57b56e3e2e21 188757769c0c5ac9203ab4d04d6942271464c917 6713 F20101111_AAAQFX cho_c_Page_019thm.jpg 141ed7d7f6519f169b52bee8b776f8e9 bbac0bae2e8a56a59420262087046ddf78fb17ab 103564 F20101111_AAAPDK cho_c_Page_117.jpg fbb4e307d96b9482bfd0aa2baa66b36c 1b0069b0902bc286d86b5502e186ba77d6a2f5ab 22339 F20101111_AAAQGM cho_c_Page_018.QC.jpg 8d732fb529832e36a59c461131bef27e 86583c03ca23bd0b205bffea538702a0969d4a0c 7074 F20101111_AAAQFY cho_c_Page_078thm.jpg 4b39c695f64c83c7c95632d9863a0cd6 738779335b5b320c488253a28e5cf53515b9c503 7026 F20101111_AAAQHA cho_c_Page_039thm.jpg 265a147a90cb3b2994483afe343a6d33 e627327410805f4a72bfdbdc7ec30907b2b4ebe6 53473 F20101111_AAAPDL cho_c_Page_037.pro be4221fbc1f71bb607bf6db603906d27 dcff98d8c40139171d38d5ddafbc558cb5c6b39e 6327 F20101111_AAAQGN cho_c_Page_018thm.jpg a3a91a0efdc5efa09736dc08d5fa1b16 90b6972c7b8b13c713b701fa83a7e44dd06bccca 6120 F20101111_AAAQHB cho_c_Page_041thm.jpg 4938cd2c3ddfe36c1b26c73929b2c652 adc987f08c28cde32408a9d6cc93694e984b4221 4556 F20101111_AAAPDM cho_c_Page_036thm.jpg 1f7be656e5f9ddd37daf349bbccb4259 0cc76286409be3d3648546fbb41eae36401bd48f 24395 F20101111_AAAQGO cho_c_Page_019.QC.jpg 316dcc13ad9d91be504b6f87b4cb9903 67ea2ea86885790926a20b7e9c0b003cfb4cf139 6543 F20101111_AAAQFZ cho_c_Page_097thm.jpg 8932136896e278f0b5c6a398038746f2 e5bbb6dddc775a2190ff1eddc403ebfe8421b052 1051947 F20101111_AAAPEA cho_c_Page_102.jp2 d004d9c88fa6db6acabf796130cdcbf9 b3c41b20b1c223320c756ab03d41c56290a8cef1 27262 F20101111_AAAQHC cho_c_Page_044.QC.jpg 7269b2cc00bb7c12ddfb6397b0aca5ce f610d8d02a26ab832f77d1d22ae35538e8e2e181 1053954 F20101111_AAAPDN cho_c_Page_043.tif 3a849c8ad466962f9c607c64f5eae4f5 113bc3de368ab1f7a5bcbbf821f4193517509d54 18404 F20101111_AAAQGP cho_c_Page_020.QC.jpg 075172a7fec960102a7064ed39c75887 7d09798ce46073da7a4c097d2e46ca034e3f4e9b F20101111_AAAPEB cho_c_Page_030.txt cc877651394b40519b580c21a9780b45 9f468f8636e9e569b96b68b79318d11b14942e54 F20101111_AAAQHD cho_c_Page_044thm.jpg f9c6887de2633d48ab4ecda24c875e1b b2af868a4aef279012f6210382316433479763f0 F20101111_AAAPDO cho_c_Page_070.tif 4dcb73ece5b4b2ba81c03b248cf4c521 839253429aacc24308fb2a80b6188e6e054387b2 5415 F20101111_AAAQGQ cho_c_Page_020thm.jpg b01607d82ffe8475f26656825fbb44fb 8b5bc2289cf41fae1f245f6276cc462c5f845e23 553326 F20101111_AAAPEC cho_c_Page_003.jp2 a8ba8c03223fd0d6846c583d5d1c8ca1 83ba78fb20291411a5a0909946cba92bc4dc0729 7571 F20101111_AAAQHE cho_c_Page_048thm.jpg 6fb00f601423f26c95c7651b001e8900 adbefdb33c6b52749cfebbdf94d11490e7c21c77 1051968 F20101111_AAAPDP cho_c_Page_010.jp2 9235b4130c423f27dda21ec2b9d84bc0 2bf1664b083fc8dc47158f29aac8b7ec27c3638d 26379 F20101111_AAAQGR cho_c_Page_022.QC.jpg 05ddb78a25ba6bc2edcf8639325da4a1 bbb0db3608f95e416e84d0442a6a6e5455c4a6a8 18644 F20101111_AAAPED cho_c_Page_118.jpg 7854cf7e52a564bc89e4d341655bed66 032a6702b6c6398d6fa4e8cbeff9e338a8f4af91 7111 F20101111_AAAQHF cho_c_Page_049thm.jpg eababa7a79d9b6d529889993f9e02572 d8609c5961ffb82f6706e9d5a97db23d73a3a8bc 73554 F20101111_AAAPDQ cho_c_Page_089.jpg 0bbd64ae115dfb6d3de55403e26919f7 9471db114d3c4ed27883fe57aa649a8e51f04c8f 11151 F20101111_AAAQGS cho_c_Page_024.QC.jpg e9a45180d1e03d134528d5afa7210fdb 490319907fd751cf061ef1aa3bfee75118365c13 1979 F20101111_AAAPEE cho_c_Page_094.txt 1988373bd249b08ac8ec4a5358f070bb 7b2e1a8e3451522d1c47ea71894592792301149b 6467 F20101111_AAAQHG cho_c_Page_050thm.jpg c0c3ff62492fc68e4ff819ed7b9fb42a 24af3ea8f84303aae8a1ba1c20d4d25e65cc40ef 5306 F20101111_AAAQHH cho_c_Page_051thm.jpg 584a906e538a163d3bdcc0664ab68890 fc0755f9358d036b9975811cde3ff4422e82c880 25481 F20101111_AAAPDR cho_c_Page_049.QC.jpg ff409faa09523bd90892d5cbd75a5972 1c80f170204de4f1bee9ccdd28581b34df4021e2 23274 F20101111_AAAQGT cho_c_Page_026.QC.jpg b358bbb533f66991a78dc4aa12979ef5 14ff233834b9d53122589a4168add9c333dd3abd 26993 F20101111_AAAPEF cho_c_Page_014.QC.jpg ff823f433310eeca767cd7174d8f2a16 e243b85b10783ed9c8d8488962f841a072085aa7 6972 F20101111_AAAQHI cho_c_Page_053thm.jpg 36864514d8d1faf0ec376ba475f14b8d 764c19d010586dd949be8b0edd3a8e3d6e8a4e95 2323 F20101111_AAAPDS cho_c_Page_059.txt 84cda51aa2b0aca11a54655d76e33d02 32bc26504807d6fe598dca3f91c4ee76b4e59cf7 26014 F20101111_AAAQGU cho_c_Page_027.QC.jpg 8671a8b083f87341ef08710a54ca98c4 527008a5ba5d860d41084b1d2d377800a53c6b5d 55816 F20101111_AAAPEG cho_c_Page_111.pro a3d2ab0761c29fa9e269ff0d483c74cd 2811f55e0611171e1ca90f5aa7c654490a175508 7411 F20101111_AAAQHJ cho_c_Page_057thm.jpg cae17813aac626f0676e018011e37ec5 d7179ccc385e8f2eefd0ab9b74ccf0bb4f7b6c63 27475 F20101111_AAAPDT cho_c_Page_063.QC.jpg 77ac1e57a889d14d6833143b41b27310 70475cb12832b410ec0a65481b3b769287e95cfe 20935 F20101111_AAAQGV cho_c_Page_028.QC.jpg 72e2894f7bac5a71793efaf60aa6e386 3ce024216cf32d2617c4371673307e4eda183b9c 157013 F20101111_AAAPEH cho_c_Page_118.jp2 52194e61c14e78526a90d5421034fea1 7556ad206d7425edee9f0cce96dcc755b7725509 6397 F20101111_AAAQHK cho_c_Page_058thm.jpg d7b48dd41ea4b681f2c72057bc88db62 fcb1887ccf0a70a1192976e24c26222c79edf440 7437 F20101111_AAAPDU cho_c_Page_022thm.jpg 9fa9d4c1b615e67ad6f6bafb75e82c6d 56e7699168346c55409b8874cd1679a7479be560 27574 F20101111_AAAQGW cho_c_Page_029.QC.jpg ead2eba084882938e3d90ef8850da621 9424e1d19ae0eb1c71ec1985394fda1f0a3b148f 1051800 F20101111_AAAPEI cho_c_Page_031.jp2 1f09cb8da385e79265f3ed5cc54c2130 d3b514eb71de447d765090b2841ed14fcf2d706d 25272 F20101111_AAAQHL cho_c_Page_059.QC.jpg 1068632f5543749b7b297bdd8362a4ed 0b95e4a2db658ffc3bd99bf43751290fdcd1e712 F20101111_AAAPDV cho_c_Page_091.tif 39ea28e34ae5282543ce8ef6327ec0b4 bddf953073eee45d61215ed6e613a261babfe250 5954 F20101111_AAAQGX cho_c_Page_033thm.jpg dee78483fb3125b1931abdfb42c2f116 412b01f8cab288395800c2b2c77e60d5cf301116 24771 F20101111_AAAPEJ cho_c_Page_088.QC.jpg 9e830c58beb3afe5974759307af09d81 1ccb40099047a3cb1280cc4206657bdd9909f235 6395 F20101111_AAAQIA cho_c_Page_080thm.jpg 9b6ceda8297f808e93eb51578ccf3e3e d0ccaa478c25b9f4d5f2a3cff4b1be47520fdc81 7444 F20101111_AAAQHM cho_c_Page_061thm.jpg d553b707f490ee8163267e9a77e362d8 92a46ac9fe2f64409a6dd9b76243b0fda3a67b1e 6419 F20101111_AAAPDW cho_c_Page_026thm.jpg 02cfe8301be2e7288ff79b6bcaa53e43 902ea4e8e253b6ca8934f1796c5bbb9935656e4b 6744 F20101111_AAAQGY cho_c_Page_035thm.jpg 73572e8dff6f8559664ea413b5ca4da8 168e997c6ce20b7d8bbdcabb2aa706c5b6eb59d3 89562 F20101111_AAAPEK cho_c_Page_044.jpg 2735d84dfe05b2b8682d178719c7ed1c cfad00ee50abb557a51d7c2bb8b2410f9cff5134 7497 F20101111_AAAQIB cho_c_Page_081thm.jpg 2ccd5e1ee4b7a2b7219598ecad1868ab effa601813d3722cce3e80c66bdd3402fd4d78f1 20570 F20101111_AAAQHN cho_c_Page_064.QC.jpg 1947cdd216ba7238f7f146b2199716bf c0957b1664c6c7d43ba36a5ccc1ec36d5943bf00 306566 F20101111_AAAPDX cho_c_Page_119.jp2 45928a37b5a39557ce72029a3cd2370e 77127aee0207bf5009ee631dbe856c270c9d64ac 7154 F20101111_AAAQGZ cho_c_Page_038thm.jpg bc9fe155bbc8df6dd5bf8f5887b51e9e 616c42b4835009521993c17c6802e512012ab2c9 99273 F20101111_AAAPEL cho_c_Page_116.jpg 16ad00d31a3ec47fe9ebe52a01851baa 104c88525e7187c0d499b168f7243508a46c547a 21673 F20101111_AAAQIC cho_c_Page_082.QC.jpg 97a3dddc7b0744a6cd597cab4a3e206b 612d80b6d47fec8e30ff359f03c47221a847869a 24322 F20101111_AAAQHO cho_c_Page_065.QC.jpg 6d5d3f49c09a4b2deea9f0762da3faac 363ed0a32511dd487115ca33cd24335055fca2f0 78146 F20101111_AAAPFA cho_c_Page_004.jpg b1823c29e6687c9e4292de29bc358cf8 931ff8d68433d0b65cffac0add1d7abafa76fa5e 918363 F20101111_AAAPEM cho_c_Page_033.jp2 b27f0e9b0400fbef5dfde301270675ab 0a2c61266c1a80abba00b23fba8ca5cf8b0ad331 25771 F20101111_AAAQID cho_c_Page_083.QC.jpg 4881a8a2242b48e3fdfcb8ba8e68dfd7 29374a67d97c849b50ac550303682b43880b026d 6845 F20101111_AAAQHP cho_c_Page_065thm.jpg bafc1f569c248bde934d951d66ca1d26 9526062a5d08cff3b7ece6b8baea5f9f053c1a94 27662 F20101111_AAAPDY cho_c_Page_057.QC.jpg 1cf8f147e73a128bddd4410994865060 e2b598f0867e92808b2d457efd53c2c681e3acbe 37072 F20101111_AAAPFB cho_c_Page_005.jpg 734765313e005c93f2dfab3d38d0c074 248a4d071952ec4332b3edb5c18670ebcbc626cf F20101111_AAAPEN cho_c_Page_114.tif b71080f9099c679989f3e96348857376 2b5c7a13fefb278b3585c4d486601c3b7a68e540 30839 F20101111_AAAQIE cho_c_Page_084.QC.jpg 18d8a6a9cd9dff765d5b7d299fb1f82d 08a33c5cfbfbe9b15d15e68f9b3805b0c0b30525 22511 F20101111_AAAQHQ cho_c_Page_067.QC.jpg 83ec7d7a3498da6c7d36588ff0d3cb37 80583cd32b254d341c6371cc80403d08d6fabb86 87721 F20101111_AAAPDZ cho_c_Page_096.jpg b6743f37026c403e51ac8a5275ea6a2a c5b1ba8ec8295abb6de59131832edebb78e420df 47406 F20101111_AAAPFC cho_c_Page_006.jpg 59114717a9f90ba7bba285a59944a680 6b850b7c1a25cd2a33d95d1f07a57e01afeeb36d 25867 F20101111_AAAPEO cho_c_Page_031.QC.jpg 414c403594afdfb98c1e4cebaceaf85b ce331b8fcb2f53a2bee3c90923878a370ae6d2aa 8529 F20101111_AAAQIF cho_c_Page_084thm.jpg 6a35c7117b1fb5486a065ca428e7cbd7 d3d47ca8307baf4b8c932bb8ba3b8ebd49fedd84 6368 F20101111_AAAQHR cho_c_Page_067thm.jpg 2a64fbce66511dd0365ff49907e4875e 4b7c7f2da5a44f0b222b6518f32ae8c99a164ced 68276 F20101111_AAAPFD cho_c_Page_007.jpg 002fa00614cb52a2d8dd4ff84295e3ec 999cb6934d544c4e92cff39b74da45888ba9ed25 36481 F20101111_AAAPEP cho_c_Page_093.jpg 9654977fe3fb8c52a381b885707c9c09 3cfd5c9ca407933019e20e6bad69ca49607e0b8c 6123 F20101111_AAAQIG cho_c_Page_086thm.jpg d7d20892b5dba875a3506b35e20dbef7 b05068b368cbb7caac211f708d209eb3a7194038 23522 F20101111_AAAQHS cho_c_Page_068.QC.jpg a8aed69334fdcee7773df3fe6b937429 6c198c93cbb9f1fb8a8eec663b7b7457e36699bc 74830 F20101111_AAAPFE cho_c_Page_008.jpg 2fe6d559efbd6770e7e3fe32d1589f44 19168889afdcaa6695f5a1a76199e7270e06edc7 2070 F20101111_AAAPEQ cho_c_Page_041.txt 39c2a81fb42a40876d667f459f247fc9 cd215f8a0b53a42d0282b53674120e83ed72c738 22995 F20101111_AAAQIH cho_c_Page_089.QC.jpg 43ee53d91d0c44628eb69473472e0020 7a8a13234fe4d7ba2646bcfc4d917c8c86545933 7341 F20101111_AAAQHT cho_c_Page_071thm.jpg cf7fcafcf6433deafff3ccc86d5e9219 25569f7702643737fbb88934175e93f252cf851e 57189 F20101111_AAAPFF cho_c_Page_009.jpg 22a9dec0186a3ad87041a847ae84a7bf 0ff5f2c6e35ddf595cf5c00c73bd8ff3fd0adefb 25507 F20101111_AAAPER cho_c_Page_010.QC.jpg 235d293d06c415ce0de7f05d0b5f9437 8487ecbd0d74f5cc83be4358af26de24d63db30d 8734 F20101111_AAAQII cho_c_Page_091thm.jpg 2eeeb58da5fe10137a87e9d6f5d7aa43 33f3f380d6965fe45e805f30c4bce24b1c586b98 24545 F20101111_AAAQHU cho_c_Page_072.QC.jpg bc726f606f76bedfcd1cc3d2f990f6d6 906b2be7cfca21820dc4b3764a139e06694e24cc 84203 F20101111_AAAPFG cho_c_Page_010.jpg 9244cb6c5a0f40e3317ee261c5b34335 8ab0b677238dadd800793e191d3695149f9e64e5 21619 F20101111_AAAPES cho_c_Page_086.QC.jpg 1e5390c89cbc1eaacff9766fb2354589 f7489e5f604eabbc460cd46f20b998eb751fd601 23845 F20101111_AAAQIJ cho_c_Page_092.QC.jpg c2af4dc07e9e24c18cb680351ca5ea31 189a45ee61d138cbf8040ebe572945d29e491305 3345 F20101111_AAAQHV cho_c_Page_073thm.jpg 1cadfa0a00627ba7352b8531d1b65ff4 f552d415cd19dedbc7dcd1109434ce237f034e9c 49663 F20101111_AAAPFH cho_c_Page_011.jpg 0cf0d46f25811c2c235e84612f850873 82589932063cb3c365f5388f9f46af4a4ae34acb F20101111_AAAPET cho_c_Page_098.tif 406075c05d70d92813291f1902be0d8a 373784ee91fc6b71502fa86e7d97f84c29e18638 4093 F20101111_AAAQIK cho_c_Page_093thm.jpg 486eb3cb63481159aac0d8b21bff059a fac9515d3a0d99ff4cb9071e0683c4ecd759e750 7795 F20101111_AAAQHW cho_c_Page_074thm.jpg e2bbc40d39a83b3d99f0b33a1a0a42d3 2b29c7b13cae644115b9894bb013038e8c2adcaa 84862 F20101111_AAAPFI cho_c_Page_012.jpg e89f046dd302775922202d80b344c026 7e0c3dfb9dedb7e9a9e7ec7cc1f46595a3bf7b6c 136516 F20101111_AAAPEU UFE0021941_00001.mets 70d86a43d4563074f31113b4a91abf44 b3c5798f839a4cd47c585258c4a2c967b31f8487 26013 F20101111_AAAQIL cho_c_Page_094.QC.jpg 1a07fb668784a27051f449a2c4ab287c bcd644ee5603ac663c017b64405cd10aab267a56 25890 F20101111_AAAQHX cho_c_Page_077.QC.jpg 21357eb01a8a48387af2046e52e38780 d51ecdba89dd7cde6839e2db1953d86ead64849c 87690 F20101111_AAAPFJ cho_c_Page_013.jpg 624d6678fc40ed7d1c6f95d91ea700e4 c4b127775e5fcd229491b2fd39b305a3a8420460 26413 F20101111_AAAQJA cho_c_Page_113.QC.jpg 5f98a4c44d304748ce8f62096fd42cb9 a68fdf75ad657fef1bfbee6eb5cd5f5e8802340d 7578 F20101111_AAAQIM cho_c_Page_096thm.jpg 293c60c814008a7c4c019cf826ded0db 43711970717ee370fa18dacc975249647797db48 25191 F20101111_AAAQHY cho_c_Page_078.QC.jpg 002db615b94256a97450a52022f94976 5b817c6dffa265ba5c60c440e96020dc2caddcfb 86808 F20101111_AAAPFK cho_c_Page_014.jpg a052cd0da872143615ad5b9ca31db061 eeaeb4145e13ac42c42e23a612b54de79b1295d2 27786 F20101111_AAAQJB cho_c_Page_114.QC.jpg d6fa3ea5a4184c803e89005a6dc2388c 46f662ce3c4e9951618d446beef12392b1307e14 23410 F20101111_AAAQIN cho_c_Page_097.QC.jpg aa353eef707e6928123b627338410391 77e1c19072fed53e81606dc88c5c501c4431aa54 22155 F20101111_AAAQHZ cho_c_Page_080.QC.jpg ac945629758d492a0df28e56b18d0b6e e2ea9ed00c8707290488cbd75d4124aa873a632e 14500 F20101111_AAAPFL cho_c_Page_015.jpg 8c38e9f3a290cbba0825249c5faa455e 42990f33e93389aa66e5b2159fba4e9726768d1e 27868 F20101111_AAAPEX cho_c_Page_001.jpg 625083c1b8484ff4a09b8d432d038854 602f61fc06a528b4d690cb3e2810857a6794d627 7022 F20101111_AAAQJC cho_c_Page_114thm.jpg 2103cc052ce166a7919e3a8f6ebc9d63 e8df3c4b22666dd09db5975696b8308140a25106 6306 F20101111_AAAQIO cho_c_Page_099thm.jpg 4c9e578c787e224c50efe6ca449828a9 050e6f353f8c4a439b4664bb41bafd6a778c14b5 80436 F20101111_AAAPGA cho_c_Page_030.jpg 008551b032440d2a9a07ee31506b8148 979a342634941a50e273fa1cd52640c8d0148576 72485 F20101111_AAAPFM cho_c_Page_016.jpg e6466f2b171a8edab192527ecdd18b78 b05710a2c904429d4246c6828c8c3f2e0284a25d 10098 F20101111_AAAPEY cho_c_Page_002.jpg 8413b5413d0c3145ff5aa2fcac8e8434 9c047e5369518e98e0283c04b85a60e3b68ef786 7506 F20101111_AAAQJD cho_c_Page_115thm.jpg 30cade386c1eaf8875bf1f3bba638055 b15a703137ff06fc66e1808d856c20f9fc276b0a 21728 F20101111_AAAQIP cho_c_Page_101.QC.jpg 042737d85fe3d2a71ab01175726aa9dd 00395aa630d0382649e7338d2653ab8b5d256b73 82494 F20101111_AAAPGB cho_c_Page_031.jpg ccc071d2ab364cfbaf0869c6ca12e046 02fd9e39012a9c71066554ba46bf8b3b159ca96e 78535 F20101111_AAAPFN cho_c_Page_017.jpg b4196356c2aae07d50e65998658cfd52 22da85c6d49d643cdcc95bf514410541f48af145 7587 F20101111_AAAQJE cho_c_Page_117thm.jpg fb8bffab4f33fcccdcf943a1b57e3d03 dc3ea1702f868e9feb3d60db4bfbd3cde63fae26 6229 F20101111_AAAQIQ cho_c_Page_101thm.jpg 1e63a7129c6acc30ade20f3415705d83 f55e832174c0bdfd9430af9d73954a158d8df2e7 72508 F20101111_AAAPGC cho_c_Page_032.jpg b851944216f1bcb6dd9386bb2ace9eba 4c203be39ad6f25837671f617f1f781ed44eb6c8 70183 F20101111_AAAPFO cho_c_Page_018.jpg be17c4c89aa72a28e31268b67b598e7a c2de4537cfc8a94f2c2759dba3f81ddae4677cf0 44042 F20101111_AAAPEZ cho_c_Page_003.jpg bed8600c565733ae60c51b6fbb9d1a5a ba73749723a1563e93cf059471be99607a6d7f12 5466 F20101111_AAAQJF cho_c_Page_118.QC.jpg 249067baf988b61df2be7248086dce46 40396de6c630876e05fad022a5b6bd522df37054 6749 F20101111_AAAQIR cho_c_Page_102thm.jpg b0912a6e55e1fc7c8aa416ead5eb3e4e 54373ab0c16d8d9013c2375d6a0bf13c724ffd6e 63438 F20101111_AAAPGD cho_c_Page_034.jpg 1b1714ab9b6f9d2376591380f722bd05 68f6995e0210a2ab94bae02d7d8fe5c0029f56f9 75715 F20101111_AAAPFP cho_c_Page_019.jpg 5b7d5ef6cdbffda1cc9689e10f0ad2f6 9d69498a3faccec962121679eaa41a75ac809eda 9307 F20101111_AAAQJG cho_c_Page_119.QC.jpg 526e67b78a90283c9185e700b879553b 464f6a52a17c55fbeeb55ef6a9a3313863defe71 25427 F20101111_AAAQIS cho_c_Page_105.QC.jpg e4de94a8eff40f08ba74d294d6dcee38 2080335a813d858d41c9f8d45bcc22513e267e24 56320 F20101111_AAAPFQ cho_c_Page_020.jpg 685f3a82ff668d1d74e6d9d9ab27e359 88b614884b062c3ab25b407075c37d7a50170109 77771 F20101111_AAAPGE cho_c_Page_035.jpg 3f5d6b6aefa38269dc93c82baed16bef 3751b8961b4d83ca3e015d6a153283b90fdb52e9 6795 F20101111_AAAQIT cho_c_Page_105thm.jpg d478296c32bd2a184c3e4c0a7641088d cc0d9da49243fc19eff9db598770c718ada6e16f 66217 F20101111_AAAPFR cho_c_Page_021.jpg 1888a99cba8e527522d1d50ee03dec27 052514a90bf1261022f717071827fca3b4f04e41 49609 F20101111_AAAPGF cho_c_Page_036.jpg 92d0862182a357edfa915692e377fa03 a918761a1927438f9662a38bc2653396f10b7d8e 19277 F20101111_AAAQIU cho_c_Page_106.QC.jpg e5b9d2d2a7bf33d0c8e09a10c917db1b 0876247157e6cdc8756262163e8308ed7863febd 83861 F20101111_AAAPFS cho_c_Page_022.jpg a506868101fe9b0a795985f65e031faf a47d3e5ddfefe50e94abcab951e1b2297032ec6f 86752 F20101111_AAAPGG cho_c_Page_037.jpg 6728d5edf27ba71db45537938a8cb031 2a73ddd499ffd613e4784fffcdbfbe494f3cc0a4 5621 F20101111_AAAQIV cho_c_Page_106thm.jpg 34f33652a9aa8cfbec79e2f9b4fcb738 81e61d8822d457342aea55b33c4b1b2481c456ce 84677 F20101111_AAAPFT cho_c_Page_023.jpg 76a2e44f0abed3bd8214914b164126f0 49a659842ab80e9fa51b46f5707091e9385d6a9a 80727 F20101111_AAAPGH cho_c_Page_038.jpg 68b98f540ec68f82a48f51d982df5b52 a99f068b2ac857f7cdbe9e8085441a11cf8889f8 24142 F20101111_AAAQIW cho_c_Page_107.QC.jpg f1ec3159f50cf6e8f4eed95cd11dbae6 5f2a2d8744e43d975dc9536d3b0a7230de1c318b 33531 F20101111_AAAPFU cho_c_Page_024.jpg 18a3d26d2c11596f6b9617cf0a2fe502 2f5bc7d8aef7d7e04c5d8e3ac4de5becaae07f9d 85947 F20101111_AAAPGI cho_c_Page_039.jpg de1dee9f695fa95022ec90c50fa193a4 73891c65e9d60c9820b4ff3e2f0b865d8fa95043 4781 F20101111_AAAQIX cho_c_Page_108thm.jpg d074f54df7857c3192fd5148e8f8ded3 9e1cbcbff03ccc941cb1b1f2f9620d8ecc26e654 82775 F20101111_AAAPFV cho_c_Page_025.jpg 164999d87e3acd7c066a1ae18ea21b09 ff8fe84a77cd97bbbb637f7b322dcf0acbc420b2 80426 F20101111_AAAPGJ cho_c_Page_040.jpg 63cd53f8b21b8ba315161629123cf8fb 47d577c86fde850f8b2156f296162ad7ccd43c8c 26046 F20101111_AAAQIY cho_c_Page_109.QC.jpg cf3296df4fc6018ddfe7499314b18cf6 dc73880a090ebde7f0c22a9dec68a39b8cdd9be7 75945 F20101111_AAAPFW cho_c_Page_026.jpg 0cba6b251740b4ec154a82d560cb959b b26fca39bcaf551e7b0c29bd5c47d22ff40d9471 65697 F20101111_AAAPGK cho_c_Page_041.jpg eb75bdf9c450fc1c72efb8138984ad6b b8119a1882155e0fbd6ba7fae43c31b8b072f9df 5077 F20101111_AAAQIZ cho_c_Page_112thm.jpg 1fb804c521fe025ad5f8da59b09e8c50 22a144c769c70426b29bc02cafd479c9087b7c72 81135 F20101111_AAAPFX cho_c_Page_027.jpg 7273e899374c3eb9511c2e1964c66f75 762fde30d3e99c119be127cc84ef7b93b7c67b51 89679 F20101111_AAAPGL cho_c_Page_042.jpg 21c8e4b799f2eca07e53bf59cad9c75e c981f855643a6d62e1ea6027b96f0a7c72cba54d 66684 F20101111_AAAPFY cho_c_Page_028.jpg 3327acc8974c684dbdb2b1d75a3cf63f 93c82401cb1e87a62c7035ba4d122cdd9ecfed8f 68266 F20101111_AAAPHA cho_c_Page_058.jpg 29767afb9b943076b13264f84d62c7ab 5ed0645f2be066061f3561aba9787a6809ecab8e 67900 F20101111_AAAPGM cho_c_Page_043.jpg 8949fe2f49bb5c659d2f6457c5008511 ddff28285b0524e3344be334ce4d60fd6b82dc50 87007 F20101111_AAAPFZ cho_c_Page_029.jpg 14b1193dfa8568e683a6c165cc1a73c4 2c57aaf4afa29428914b5d0e32e28e5bd8cfdf96 84162 F20101111_AAAPHB cho_c_Page_059.jpg 238fa3ad1105573758dbf7da3f860144 d41b920a2bc3f6de87bcf9799af473838610c2bd 72854 F20101111_AAAPGN cho_c_Page_045.jpg 9486ca4ab10bd920814ac420fcaeab8e d6f2ed55ea6dc6da8b75b91d42ebebb593e49cd4 81756 F20101111_AAAPHC cho_c_Page_060.jpg b6fc78d37f4ff80b0c2e3de7b1469b62 5aa4a751ef20c5071dff043f61b42d38dcf8102a 80306 F20101111_AAAPGO cho_c_Page_046.jpg dffbf3bbb1495b92b69fdfd7aa187b4f bb83bd14d699705d19c305b4d06e077d6b65c54b 94764 F20101111_AAAPHD cho_c_Page_061.jpg eb970eaec14ab9c3e1ed80ea00549c14 3f628f019da607b1fe229c96c84fff6c11218d2e 90584 F20101111_AAAPGP cho_c_Page_047.jpg 35a18fca4bf6d2296c2e94b0daad8dc2 bf3cc68c2254df2ca0782f9caaa0e7f28a0ab4ff 69923 F20101111_AAAPHE cho_c_Page_062.jpg 6152a1a6c64cec59d9da1de974ff13e9 3f39a952f6a01ca9d959ce9e53861d1fb9ce0147 87831 F20101111_AAAPGQ cho_c_Page_048.jpg bd8aa22d338acc7cb384c68c97fb3d5e a22c5d8f91827663d66097e35c60568d175227c8 85677 F20101111_AAAPHF cho_c_Page_063.jpg 097ba492f5af4e8805ba3b40dabb6c61 54f0ea6284be90f91e57425cd61a659146bf3348 81945 F20101111_AAAPGR cho_c_Page_049.jpg 755651bf83103ebd1638c6c6a2dc16e5 2351d65286adca9094be6224f683ef7e359862ae 63048 F20101111_AAAPHG cho_c_Page_064.jpg d9a9d2842de1c754031bb6f59ff2d5d5 13bb2ceb5d9266be6d929d3771eb3019bf484865 74575 F20101111_AAAPGS cho_c_Page_050.jpg e136a33639d2bc69bbea783cbf2b8a00 baac1145a1d170a28011fa7ac946d8c7eb1b9d03 76766 F20101111_AAAPHH cho_c_Page_065.jpg 7a61775ccb2b6b20d451f12066638784 6b246a3eff2663768a9c9086cd7de7b30dec4d21 62634 F20101111_AAAPGT cho_c_Page_051.jpg bb431ef6756a619a46db9f77403328ca 388fb67dfe04cf5b29aa4d9f1bb93e2dc1c0e70f 87383 F20101111_AAAPHI cho_c_Page_066.jpg 7a5d44074c013c41bd7a552a6abdf625 5864a1edf4837e9646645aea06ac191782b9c40a 92369 F20101111_AAAPGU cho_c_Page_052.jpg 7aa0af923c26ced027fd3ca679067496 ec598d1f9ab50b950be3e0b0bbb3a1f9469a22b9 70454 F20101111_AAAPHJ cho_c_Page_067.jpg 93559084e7cc0305cfa95bb748498b9f 7b5dc366d06ab0ababc730f56ecafc642e7dd38f 79548 F20101111_AAAPGV cho_c_Page_053.jpg 48c46cfcf406684e22bd355df3be1fe4 edb340b1f776e12ff20742016c5b37f8e9e63d28 72955 F20101111_AAAPHK cho_c_Page_068.jpg 9f52c86926e61db781ebaa75f2e1f4be cefd9aa35f1995ca849d8f39b1835ee4651b61b0 72380 F20101111_AAAPGW cho_c_Page_054.jpg a002247aa2adabe6c2b97225ed509858 b2f29c073e5c2ebb2a96a59ae94de0976f88dd75 71737 F20101111_AAAPHL cho_c_Page_069.jpg d72792cbaa957495fb316463ce88bb29 1cee11ee89fc610699268f6db74ea595ef87ad55 78345 F20101111_AAAPGX cho_c_Page_055.jpg 94f5eef5bb1067aa07d508763d797ad5 ca683ddc452bdf8ca11e1854b99a67003ce583df 98891 F20101111_AAAPIA cho_c_Page_084.jpg bf22ee1342d25098216be11e657f39e9 040c0671299698cc2761593596fcd8bab5264bc1 F20101111_AAAPHM cho_c_Page_070.jpg 9bf4bbc23f4108ba94bcca872bf86930 745ea1479459292274293b9dbd554c843d4701db 76577 F20101111_AAAPGY cho_c_Page_056.jpg 8cb6855e6664c46814da046a799e95cf 06d95cd105d67cfae9014b1b32340d3064ff4ba1 73042 F20101111_AAAPIB cho_c_Page_085.jpg d7064105188f535e3bef17c8c608b4d4 96c9ea79e5f1111193792afa7967eae43316cb20 78597 F20101111_AAAPHN cho_c_Page_071.jpg 36f4a47faaec24cb81457c1aa3e179c9 b702e7952976d3b475c54da31bfc32de4b229e3e 86816 F20101111_AAAPGZ cho_c_Page_057.jpg 75aaf06494bb860745df2246baea0a25 10008dc4b62d8784ccbacdf4b9540b97214adefd 69707 F20101111_AAAPIC cho_c_Page_086.jpg 4fa573fde6701664dfa0ac0325462215 4bc77fe8400f196a7f52353a6e2236bf9abda9d3 80158 F20101111_AAAPHO cho_c_Page_072.jpg fb9f15d8dacb658c83d1d872e3130936 cf507c163df526cf84ba082b339c17b3de357588 72224 F20101111_AAAPID cho_c_Page_087.jpg 931cbacec86d0addd16a4a4cd18252e8 fa44f27bf937fd3b8e4e2b7d7b180298da2f4e55 32496 F20101111_AAAPHP cho_c_Page_073.jpg 6aa2cef9a1729dc6af4e1716fde6e17c 65488c994f6201c787f8a915043b3f21f7566229 73906 F20101111_AAAPIE cho_c_Page_088.jpg c5ffb37b2186e367bfe67c2ff0ad962f b702e3be179fcac5efa3487c0cefa9215d8ab6cc 88635 F20101111_AAAPHQ cho_c_Page_074.jpg 5874f98b48ea97521db4c019328bdd29 392aad09dfd58ab13684883625eed176754ae1ed 75297 F20101111_AAAPIF cho_c_Page_090.jpg c9cbabf668cc955e8b1db09ce6de7ae9 c11790cda232ef4bd32b4421390a101aa793ce8f 85675 F20101111_AAAPHR cho_c_Page_075.jpg a3212d6c4f0ac3031e432ce526b6355b 179e6606806a5d8c3fbe7f04775ea3f2a1024283 92880 F20101111_AAAPIG cho_c_Page_091.jpg 20bcef5ed2846d4e4837f1ed20f69c27 4cc52a1de4e77617b24ea617ae3850a432bfccd7 89265 F20101111_AAAPHS cho_c_Page_076.jpg 8df6a5ca3beb688a3bbc098e3ad51428 4de0220469985e563bbe8477bf2e1a2f4446f70b 76004 F20101111_AAAPIH cho_c_Page_092.jpg 5e9964fbbc20fd5dd3152877e09b3927 5ea92e202fd310f210bcdb1eacf14671d0714eac 82152 F20101111_AAAPHT cho_c_Page_077.jpg 222e09e446fab031278a946998ca1968 feee5414cd8fb59d4a67e613d3dff46985cc0e14 85287 F20101111_AAAPII cho_c_Page_094.jpg 7e2798a41ce8d34b479ea283b77f9a82 7f10273585b501be269dd750516f06dad0d7190a 79532 F20101111_AAAPHU cho_c_Page_078.jpg 3f4cdbccbad4fb505e7d6711ec6f93a8 5afdf6b0fe0b42defc14c4a80f837a636f237123 77395 F20101111_AAAPIJ cho_c_Page_095.jpg aeecc4d33609b8789863af1ebf85f3c6 0b814e92d6c6652f8e34d9ad8d0e6a8cf5505040 65088 F20101111_AAAPHV cho_c_Page_079.jpg d02aae8e11bed97fd87bb2f819854164 1f171bc51c38d6e405dce28efac75db1a3822e45 74744 F20101111_AAAPIK cho_c_Page_097.jpg 22af0adaf83685ce59ece2d5b96f8d34 2d217cb907f33a4624df8b7564251e87b2e7b2f3 70929 F20101111_AAAPHW cho_c_Page_080.jpg 691c06843351504769eb2604abc9acf9 bde4d08834f0bfe1102d53aee2fb10d6eb5f2014 75769 F20101111_AAAPIL cho_c_Page_098.jpg cadaf63c1a160e7c65e6496ffb202544 725cfa4c44d36805bc84bb819fe6590567779d04 85692 F20101111_AAAPHX cho_c_Page_081.jpg d4f54a61edc006b88ae6700550d31196 f42415311a35dec625921e736f3cb1a178779e4d 70390 F20101111_AAAPIM cho_c_Page_099.jpg 15dafdbcf6f2a92328bbd2186ff7d589 8854c534a8b09a44906e069051edf5a92fdc5320 67764 F20101111_AAAPHY cho_c_Page_082.jpg 0eeecb4668256f4bcf4473290b9dd436 d2b27390a181e2b2f397322d6ff72b060b9731ea 93866 F20101111_AAAPJA cho_c_Page_113.jpg 6fef58e4dc6aac37ef125f9ce8896612 493fab52d1915a1d67a1d77859ad6cfc525aaae3 77555 F20101111_AAAPIN cho_c_Page_100.jpg 0584d0a8397bfa332e7400f956cdddfd 2f908d774496b43efd3c0eef1d509e30d6df357f 77641 F20101111_AAAPHZ cho_c_Page_083.jpg fec38b4d46ca60d7e979180ba4c63e7c 31c10099eec0ab211fa4b9c80fab26053fa0902e 104591 F20101111_AAAPJB cho_c_Page_114.jpg f8b53fabc0f5831d42c78b4a853d4313 7be943eaf8704d257451b8300af9eba6d2e4375c 71483 F20101111_AAAPIO cho_c_Page_101.jpg 29600bc6702f0726df8425a79330e30f 94f312e0dd888e7dbe18abb3c529dee563d126c0 100333 F20101111_AAAPJC cho_c_Page_115.jpg c5d7d52a15aa92800eebe957d7e2a66a 1a36ff164d5f099bdbdc9bcc09b31b786c56b597 79239 F20101111_AAAPIP cho_c_Page_102.jpg f7f58bd9e1329a96199d9974f738daae 905e5f9859f9936ffcc9c868fa4ee654adb22bfd 29081 F20101111_AAAPJD cho_c_Page_119.jpg f10e7f3e8eb48f9b97192ee0ee99a36a 8d3b75fbab980aeb2186e8a8d5dd120e23f0b1ee 83995 F20101111_AAAPIQ cho_c_Page_103.jpg 0e550764ff76201ded5a90d6e1333633 ef2f3516294796089855b8da1309a483d5681f1a 288706 F20101111_AAAPJE cho_c_Page_001.jp2 83a8c7fdc8767f5da5cb36216a7f19bb 8f05dcea74c3538b88e09669c0c26a6955eb3915 73151 F20101111_AAAPIR cho_c_Page_104.jpg 634b7777c0f1e236a85b706e495927d5 e3ff3ae70444ca44494f9646220501c144db0151 26063 F20101111_AAAPJF cho_c_Page_002.jp2 53df46f31d265232b9804e525ee2c82e be263ca610af847d45a949e8f011da7a208e2896 78907 F20101111_AAAPIS cho_c_Page_105.jpg 321e43067535ef8077d03bb9f1252786 2b8056d32550847eeaff5aa6a80cb103c2351863 1051979 F20101111_AAAPJG cho_c_Page_004.jp2 809fdf88332b30dff8d7de8a5219bc83 b29268b638e2a4fce44f28d2cf8d45a9444d043c 61171 F20101111_AAAPIT cho_c_Page_106.jpg 73d778fec23cfcac660b442d5fc43999 88af72df207b1242897795be2c79bb7d03062d73 980045 F20101111_AAAPJH cho_c_Page_005.jp2 55a74c5e0ae31f0bf3b2d6a5d29bd28b 41b573f2b33b0de29fcebab5401ebe1cc8650b84 76643 F20101111_AAAPIU cho_c_Page_107.jpg c1a49fdac22b62c7b9bfa8982f8101da 81075a357f6c4e6b32712c4f2f2d27f5fe81dbf5 1051971 F20101111_AAAPJI cho_c_Page_006.jp2 e2659f3a216d64a927f8cf6b5d53fc49 0b5935e4f585f81d54a7db4535fa290d6254f70f 47960 F20101111_AAAPIV cho_c_Page_108.jpg e21dca6a8ea51b9c02db6d94ba11c104 318726a7e75424bdcc7c951bf9d31a27ff1faa5e 1051976 F20101111_AAAPJJ cho_c_Page_007.jp2 4df5c2b99119a5e4726549daa2ac6549 d7dc29244089c1813a629a5ed11028e50ed0943e 83292 F20101111_AAAPIW cho_c_Page_109.jpg 14f1154e88372977450095715c192a9e fc8f37acd492843966d0f20fc9d45469de3941a8 1051972 F20101111_AAAPJK cho_c_Page_008.jp2 9887fc8fc3a528e96cd1ee5b18735e72 e07d10f348b2f4bb1b57103e9d0129edd4c48179 83360 F20101111_AAAPIX cho_c_Page_110.jpg b62b66ea987442638dbc0a3fd85b74c7 0bf2c80803ce01f5ba6f2925fc0c78e1c97cf6b1 652183 F20101111_AAAPJL cho_c_Page_011.jp2 90df050cb60f1f41f17700a1b9bca122 12f056f8f82008c89a516a6208c4eaa0e685b503 87578 F20101111_AAAPIY cho_c_Page_111.jpg 2ccfc3cefe3652870b26b9042561ea97 6a434614569797285ffe2feaae6c689bf5118d60 1025948 F20101111_AAAPKA cho_c_Page_026.jp2 11505e8685d55e4822070f090ef03214 d0e99c320c7583a10aa96f34bd17ac9aa421b02e 1051974 F20101111_AAAPJM cho_c_Page_012.jp2 fb64f3f69637f7fe8f3792a02573bda8 e2e69fbb41e70fc75adc1e3ce3dcb918ce8d7d41 56672 F20101111_AAAPIZ cho_c_Page_112.jpg 02ebf7ac32562adbd3564508220110cd fb72fff85aa0d4860cd773eb86058866599c27bb 1051954 F20101111_AAAPKB cho_c_Page_027.jp2 7b48464e266354e57a856e75ab7a17cd d3f052cb75a7c5c3f6e43e4b0e4717ae3450e6b8 1051938 F20101111_AAAPJN cho_c_Page_013.jp2 a868ddcd0b4348019160d0e59bd4a39c ddd6b5952b4f4bafe95edf444e24b8fe292f8f72 1051985 F20101111_AAAPKC cho_c_Page_028.jp2 8f6485110bb53f71a8c68985f4f56f79 77c7f448e4d2376561c32b05b113e3508a6c489b F20101111_AAAPJO cho_c_Page_014.jp2 dee2bb2c8853df56753f23a1337b0ead f6236d1612a2056d129c8d0eb326daff5c2af7c1 1051917 F20101111_AAAPKD cho_c_Page_029.jp2 c2c6014ffa1866ae90da4708e6b89c24 7f562df7792313f2657890b8d293113bbe4423c1 101357 F20101111_AAAPJP cho_c_Page_015.jp2 374562623b9b303d010b896898048b7a d68cf82c612bee178e7959fc1acf89f4162222a4 1051881 F20101111_AAAPKE cho_c_Page_030.jp2 d9980f5edce870a54f83661af33e18ee 9f870ad235ec7a8ec68df5a0bf6ad9a0337c6ed4 1012692 F20101111_AAAPJQ cho_c_Page_016.jp2 fd57407a6c603e427089e608966bd9fc 748fdde4c7259eca83cb0212dc88dff151a778e6 994253 F20101111_AAAPKF cho_c_Page_032.jp2 881a6c5d5145f9bcf606805d14f4600d 2fd0ab9f8525dc1a0795c4ef5e0c194865e9a0b7 1051963 F20101111_AAAPJR cho_c_Page_017.jp2 67af3071735dc3f4b21ce970f4a52c04 58bc8d708c1e5b13d623a7fee4ee9f7e8d8a4196 866956 F20101111_AAAPKG cho_c_Page_034.jp2 1e3a81f62a1779ceae8004e7972590de 7e59e97d0486f56471bcfb9ca91c8226183fc5f6 983764 F20101111_AAAPJS cho_c_Page_018.jp2 021828308d79f4049b22131e8a149108 ff4a81c106448b6bcab17feb160d959902203e26 1051903 F20101111_AAAPKH cho_c_Page_035.jp2 49ba9618d4a01196dea32f02c355f8bc 140e27228c18bf392f75245828bd8ca7ddb9c638 1051982 F20101111_AAAPJT cho_c_Page_019.jp2 5ea667cd41df3fecea284e6c87e90f75 0252df448fedef02f6e075f4a85771d095fda10e 604651 F20101111_AAAPKI cho_c_Page_036.jp2 445fa957aba760ababe87fa76dec650b c8672261e9cd8126af875ead354a21e514db0f53 757189 F20101111_AAAPJU cho_c_Page_020.jp2 0e9ca8555d4536dfea3853e97c76718d 511fb88f8d00173775e719fd24ee118c61aa3011 1051959 F20101111_AAAPKJ cho_c_Page_037.jp2 bcfd836afcd174c78e28c82c3975646a 2007943d9d560707c1b838edbc55dd260edd2178 F20101111_AAAPJV cho_c_Page_021.jp2 3b904825dcf46fa2dcc95ac375062a93 c0418d47acbab6ea2ad2d83fb1ab34be178c36a5 1051935 F20101111_AAAPKK cho_c_Page_038.jp2 820100a7050c1dea9297cc24c92a94a1 9fb56d6bf37e9e31b9a2f92c40963f8e15f734dd 1051900 F20101111_AAAPJW cho_c_Page_022.jp2 8f85baf95ae093ba86c9ff23f470b639 b28c0a2353ec3f5eff9d9ee79da281dc385c9818 1051975 F20101111_AAAPKL cho_c_Page_039.jp2 c1c039a5fc1666d6ef35a258944228cf 77ae9925950d7a059dd95332640a4cebe8529963 F20101111_AAAPJX cho_c_Page_023.jp2 506623c3289a358185634abdafe24cb3 e023253e55c7ca10b9b65a62a7de62ab5693f4ba 1030917 F20101111_AAAPLA cho_c_Page_054.jp2 40003f7ba0ab9db544737ac32e7fb6d9 107bf13aadf1e280307110795c9b26fbaca99bfe 1051936 F20101111_AAAPKM cho_c_Page_040.jp2 d3300fe8df7893568b0d8d756be4319b 03248e270e6cb7d1a97bba7f320f8f4cc09506ea 514433 F20101111_AAAPJY cho_c_Page_024.jp2 b86225193412b80c3540a8f09abe05d8 6518f53505e243a8f2471bf8923c03c572a49d94 98719 F20101111_AAAPKN cho_c_Page_041.jp2 6fa510fb9fcdd1e5c64e4167c9443d3d 0d00d96f97c5032f690787e96fb1616b855e1ec6 1051944 F20101111_AAAPJZ cho_c_Page_025.jp2 90efc78bfc3b8f6d3baed50bf7044a8c ee770bdd75b60e09395fe4a406ab1c51b5bb7d31 1051965 F20101111_AAAPLB cho_c_Page_055.jp2 f4eba8ce25144a76c6c3895906342c11 9dec04c0448364f56979bab87e1fd19679ac2ba0 F20101111_AAAPKO cho_c_Page_042.jp2 40d87fb2317fad06e0b3959bbc44c7bf ba464f17956e5a200eaf0f18ec96ab33f2380184 1047128 F20101111_AAAPLC cho_c_Page_056.jp2 82b0446d99e43141e9ed018c791172a0 ed43709ee014bb58a64f0a32c5feb44f44108105 103811 F20101111_AAAPKP cho_c_Page_043.jp2 d08c3dc3012860abf4ad4bfdcf6e9464 219e8caf25b433b73d8057eb7ae2f2157b79c3a0 F20101111_AAAPLD cho_c_Page_057.jp2 38ba505ba07a15a4b9e8fc4c7ef76ba0 616d29d87aedde9ebf74961330c1bae88404c76e 1051981 F20101111_AAAPKQ cho_c_Page_044.jp2 52342aad8d87cdb8398f96ad89fd357c 9175be3a486c4c5765a5dd8be64863dacee2f015 102093 F20101111_AAAPLE cho_c_Page_058.jp2 a904c2aa84b372d4f00bab988a4929b5 05a380b2311eef99cf0fa9d7e676b3ccf0676ad7 1041685 F20101111_AAAPKR cho_c_Page_045.jp2 6ac049bb9f33b65da89ef6f7abcfe23b b38823fae14dc036b17ade2599d9df896dfeab3e 1051967 F20101111_AAAPLF cho_c_Page_059.jp2 b2d4bb3e815bd220799571fc019a1713 6ed8c4268e2a3eaf832408190a8244217d7d74d4 F20101111_AAAPKS cho_c_Page_046.jp2 211a8661097b6cd232518a58c12084b2 6eab10c39700fc18f16ffdbf5f044442faf1df93 1051960 F20101111_AAAPLG cho_c_Page_060.jp2 b5b4ec51e33623f827b1db5a2466d078 9121610b6cb0d103357c9133b566f04c8a9b111f 132909 F20101111_AAAPKT cho_c_Page_047.jp2 d729343a96289974a17cd77326ab7b32 25aa63faa3818e16eaa4b48d73b16a5336750374 F20101111_AAAPLH cho_c_Page_061.jp2 295d53efd6b2dcac48ef5fceac09b522 a01a5b748302d24ddc91808587c4578eaa7f7039 1051927 F20101111_AAAPKU cho_c_Page_048.jp2 65aa155caebcd83a17c2eb0291ccfaf6 e2d32b18ba7698602e4ab6ef3e4190a0affe796e 1051986 F20101111_AAAPLI cho_c_Page_062.jp2 e456906d01b72731ae52c0c8c1d7d3aa aa962f9c9da30307a88f45395508b1336e9e6df6 F20101111_AAAPKV cho_c_Page_049.jp2 d6626e7f5bda1a053300eaccc5a6e42d 1292c73cc0dbd2a8bb9d92a8161918689ba81138 1051950 F20101111_AAAPLJ cho_c_Page_063.jp2 49f208f6e11e8570101981e751326b50 2a814580a5f8b3634a1e8ed64fb9db1800b77f20 989825 F20101111_AAAPKW cho_c_Page_050.jp2 e58557ba28e3a3f9c075b4730622872b 5faebd9380659ca0c4f335b20694c0df3d3cd28e 876225 F20101111_AAAPLK cho_c_Page_064.jp2 eb375048a2f142aecd2eab6c00176358 086dd102b3c9687756e97cf74359035af6310d1d 845398 F20101111_AAAPKX cho_c_Page_051.jp2 8a9ec44962ca8c1269e38fb06f2dd1b5 ce1a75238a60f976417e70463971b85161404f70 960892 F20101111_AAAPMA cho_c_Page_080.jp2 f1d34a319557d08773bfc61be5792c96 99c78d6f145ca72025e4e99a23431ca92e8c2323 F20101111_AAAPLL cho_c_Page_065.jp2 a08071eb2d6bebd7e84a72f95c66648e fd22f58fff0d84bd16136ab3fde45190089657b3 1051978 F20101111_AAAPKY cho_c_Page_052.jp2 20ea72872566f0403d1b080b4f59187b 4bd2b46b3c722c713bdde28375012e9d74995e2c F20101111_AAAPLM cho_c_Page_066.jp2 ad669d193bd05774e445207e1dbe6411 ff021240f3f19f1623d5cca144703a1c0d5d56d5 F20101111_AAAPKZ cho_c_Page_053.jp2 a4862382e5fc139a5ec682a4ec4afe76 389ea4b955958a61b26bcd8914aec314871dbaf3 1051973 F20101111_AAAPMB cho_c_Page_081.jp2 50fef9cef92ab4c16b6b72a8d25e6516 201eb0c0339330c91ef63511b1a7df30211d989c 978707 F20101111_AAAPLN cho_c_Page_067.jp2 8a1b224e3b67664e449a28336c839a98 2fbb7a83431dcb0f7fd418d533c29f8a7c9bb2f9 945022 F20101111_AAAPMC cho_c_Page_082.jp2 9a38e71f2ba2724cf443ccfa30e512d3 d595b9f0ef80ac2aba13ec5839ced61faa2acb3f 1051946 F20101111_AAAPLO cho_c_Page_068.jp2 482a0d61a53a7a5ba55c86436c52dbe2 257dd3845f1a6dd3b58aff9d146213675ab6f2ae 1047237 F20101111_AAAPMD cho_c_Page_083.jp2 35c840d4fcf206d40c20cc9bbc5c7c4e 4be8555a5b5a7220f1681707918d8d9d0c549562 1016243 F20101111_AAAPLP cho_c_Page_069.jp2 5bfe23f5fff1e8e07cdb03b72a09b419 a9b13f7d90d1264ab0bb676074249ff6eb26a681 F20101111_AAAPME cho_c_Page_084.jp2 af6b554fe7f2fcd29cd9ab94b91fa20d 4fcb7a69d73372974f703c4f8f64fb2a8fa9dfa0 943379 F20101111_AAAPLQ cho_c_Page_070.jp2 88998d90a54f6aa1491066d782fa198b dcd38a22c9d4f7d1793482772ba60a35a28ca27d 999975 F20101111_AAAPMF cho_c_Page_085.jp2 b102734ac8da305e9214a30a33cd4df1 01889b849b6915fa75faed246f3d907683a19ab0 1051941 F20101111_AAAPLR cho_c_Page_071.jp2 ad838941efa5ff998ca54e8952e3c277 39cb6048e8008e2953e2e115e28db0aca6f09c87 950761 F20101111_AAAPMG cho_c_Page_086.jp2 a8fc4baaf4ad4511a8410d421d306983 51d8806e933e9c084da25a77adf3abe3d9cef730 F20101111_AAAPLS cho_c_Page_072.jp2 485a581aa08e6608feaf8618a7bd2264 80e62209e8b4f41d51ce8400596994cf8768e98c 957294 F20101111_AAAPMH cho_c_Page_087.jp2 2f5f22f650feb75e48e3975452bd3c23 516a2e15862334118ce23f93f1f4e3448b2c6b77 364942 F20101111_AAAPLT cho_c_Page_073.jp2 fcab4d2df181170d0e5945ec768c2143 cea55a57e2eea92f572fca1a81e7dc91c19a8b6a 1051945 F20101111_AAAPMI cho_c_Page_088.jp2 ca73a3c58246ec24815734e89f509557 ccd5068486a6324dd439132072a01bbd2ef40e38 1051984 F20101111_AAAPLU cho_c_Page_074.jp2 25149c3d938e12195faa3801e451d38c 9477045447c653be40cd4d12fad7abe61de919b9 F20101111_AAAPMJ cho_c_Page_089.jp2 851901d049da111faee162790cc7a3a3 7f8c3dca84efa978afee2ca70fab7f1bb7cdab76 F20101111_AAAPLV cho_c_Page_075.jp2 8082c00d12e3a3dd41909ccdadeddf27 cb4bbd5f2dedc9f862b6f04b51cb8c9da24bdc27 F20101111_AAAPMK cho_c_Page_090.jp2 87e017ab5349ad34a204793f2c5e019d b1fb168baec5584dbb5e02b1f5afc83bb3cc9dd6 1051955 F20101111_AAAPLW cho_c_Page_076.jp2 8182d13210d3870a4df0ec73d2acb772 1356ca20efc76dc40d34f744b474e43de4e9bf6e F20101111_AAAPNA cho_c_Page_107.jp2 8c753d2c77c401f5d4b17f049ae35b0c f29e0ecdf2d8b78336420b5bea17a6959b7198c2 1051931 F20101111_AAAPML cho_c_Page_091.jp2 e8642a46474adaa0e418b693377ebec6 8f0df07b363142bdb38bf322e3f916f44fd1a86d F20101111_AAAPLX cho_c_Page_077.jp2 4b65b66ca0e4e2def5707cad46065ea2 3ab1b80750a2455325bf653858af00db0cd8c413 647841 F20101111_AAAPNB cho_c_Page_108.jp2 8118fd7205bd76aa87abc3e68be9fdc2 d2a5bff7a112bf12c89ce77559af5674d0567d12 1051964 F20101111_AAAPMM cho_c_Page_092.jp2 44d1c79ca131af192042b12fafec3924 c820f18b7f7e99dbca5c1cab35b370bd0a288ded F20101111_AAAPLY cho_c_Page_078.jp2 94cb893c6054f3c97c8c5e1c6ea56f06 24a79c3169c3fd87343457dfd6345f533ed3bff4 1005150 F20101111_AAAPMN cho_c_Page_093.jp2 cb20ee47b68cda4fde31726d969b8095 a06124b22007f3c4b226078d80ce1fb747475514 1036328 F20101111_AAAPLZ cho_c_Page_079.jp2 3aa390760f80735a74664504e180536f 05885a003e3cf2756247b004436bbbf2ebba69a9 F20101111_AAAPNC cho_c_Page_109.jp2 9c932225f76bba00e2c041c69e1e171f e5b26bf6e8df57f0df1e9242f6e8197d34121105 F20101111_AAAPMO cho_c_Page_094.jp2 94a8c161c4de6f029df4a44cc338e9e2 0c53539df9ab353c5bc23f4a7f90c8fe173499ae 1051969 F20101111_AAAPND cho_c_Page_110.jp2 a06da5a96820a9960329b2554ed1f8fa 6ac483b61739659d06a6a37eeb55f4baadc6b3a5 F20101111_AAAPMP cho_c_Page_095.jp2 c2b10ab540539b12fc03d3095a3afd37 59b19b59ee77d3d761981bc4bd530216df6af0d4 1051928 F20101111_AAAPNE cho_c_Page_111.jp2 fdba855b544bb73e93636ae27f8bcd78 d3e8b81541320c7ad0db58c30784e2922c7e1158 1051934 F20101111_AAAPMQ cho_c_Page_096.jp2 c4abe16ea1361be3a32343350d682782 7c4eccd8019e4db05fd92075c0b54a4295dc0e2a 761593 F20101111_AAAPNF cho_c_Page_112.jp2 6e6c09aaf164453c1896a36620c69b6a 3016d325f241d22e1ded11ec66a57029ae3a5b70 F20101111_AAAPMR cho_c_Page_097.jp2 2f8dd87e1a29ec736f25c64219a35c9a aef1e3fba24a33ef927718cc83dec2b5d1521f53 F20101111_AAAPNG cho_c_Page_113.jp2 87adeb6e8c9e481892043626802eeed1 2ad166a0a5f349208e67993447dc33e849a6dda8 1051962 F20101111_AAAPMS cho_c_Page_098.jp2 854ec5bb0c96210b8e608cae42fdd270 2997dfb056e0ec7516c0389d7da1e85c9e0d9e12 F20101111_AAAPNH cho_c_Page_114.jp2 ec48069f853e014d8d1cf167e6bfebb6 dc54f10fa86af2e026c2f9bd11082b1f4106b8da 963867 F20101111_AAAPMT cho_c_Page_099.jp2 543180fcb940123169eaef9df79d63e9 199bd98005cdbf3f975c2b03531ebad88b1d810e 1051977 F20101111_AAAPNI cho_c_Page_115.jp2 15017c62d1930545d0a63568c0151e31 97417ef4231953da2ae0af5651d98120acf1a562 1051804 F20101111_AAAPMU cho_c_Page_100.jp2 70a14def5c653eaf76e9fe72be8748e0 0cc7f4947b218432b5bd5c856c827ea24fff61f8 F20101111_AAAPNJ cho_c_Page_116.jp2 79d4447f15f0ffc57eb98ed05453382c b2b09cbd350467b801f30783f568ed91b5084859 974135 F20101111_AAAPMV cho_c_Page_101.jp2 e19d0e03498d6bc34beb531ce7a1f6a0 ec474204fedcaec16367217a93667490e5da9272 F20101111_AAAPNK cho_c_Page_117.jp2 c6d029c5b78ce1be72d49c4fc2fa4734 c3d2e9bbb36016273b179979d417b5e9ceef6d53 F20101111_AAAPMW cho_c_Page_103.jp2 eaf50205e25f8d330754016f1fa58db9 3d3e6f7d911fe10ab026c89785a4c58a71a091a2 F20101111_AAAPNL cho_c_Page_001.tif ab7290125fdac0058c27388e17a0bd41 443b6dcb52f45ada1b6a08d18f0abb52e2043bd7 1005069 F20101111_AAAPMX cho_c_Page_104.jp2 28e2a1d16e6c45a613737ce7e8958d29 e9fd3587f0a6efbdb4690f4631d4378968f47d03 F20101111_AAAPOA cho_c_Page_016.tif e34eb07b92421a0dcacae0aef4dadf53 77f5553c063bff24ce5a8962205c0d11bb65b6d5 F20101111_AAAPNM cho_c_Page_002.tif 0457eca796815e0eaecbfd2b19a159fa 73f6ea6eac70c1f9a6f8170fe62db079576010cd F20101111_AAAPMY cho_c_Page_105.jp2 c5e7477e6ee098967964b0a6374f635b 69310f6eb4a4b4ee7df94a12fbd0772f5ecf1763 F20101111_AAAPOB cho_c_Page_017.tif b971785f23dcfa7df3c62a63d111bf97 026bbb471ae8b0620c82ad63886b0d03ded13458 F20101111_AAAPNN cho_c_Page_003.tif 3890941c885d84165fbeb1600b455a0e db4bb4236bd8d62ffdf5e79581a58ec20f67ed0b 1030885 F20101111_AAAPMZ cho_c_Page_106.jp2 8a51e64c8b4bb1e0872778a1d8ef7279 865097895bd5e3b5cfa8c44c1745a73f040280ac 8423998 F20101111_AAAPOC cho_c_Page_019.tif 36c15796e818d293ac1494047f7a211e ddf1a705f2c0bf97a3778fa07238946ff2d05a6a F20101111_AAAPNO cho_c_Page_004.tif 8fde8dc0d740056465b82541baf12b6e 4a794965bc523f5dc2a521083c23653e0f69d503 F20101111_AAAPNP cho_c_Page_005.tif 6ad2657dcfb03b714072c073004256c4 984f5da203af31a0811b232634bf3d4127f327a8 F20101111_AAAPOD cho_c_Page_020.tif 1cb3aa9a1d71c3304ffbbad852465d8c bb269568f04a224e8cbbfb10ec9455ef197e692e F20101111_AAAPNQ cho_c_Page_006.tif 9658c021f7188f8b6210eebb4250a7a2 33b91be4ddc59cb4b8c287e25935fd606d9f1b50 F20101111_AAAPOE cho_c_Page_021.tif c24a25961a07d4b5b8591c38bab39a78 88d62dfc6802a3518efd2f305d3252a60511c918 F20101111_AAAPNR cho_c_Page_007.tif f2c313a2a7e514ba3ec16a8b9fc717be faecd678387c7502c4c4f4246579196fbb5baf7d F20101111_AAAPOF cho_c_Page_022.tif a83f94cc021bcce52011db6ec4324c9d 4775aa40b6984f8749126f08b5e60524824c8e3a F20101111_AAAPNS cho_c_Page_008.tif 2e30e179351ab0020954c4d15619f8a5 56d4f5b921a3950d6ba93fbe1035c6fe236dc113 F20101111_AAAPOG cho_c_Page_023.tif 1f8d2bf2f8f4c8216afd3a0137be19b6 223362c5e71a12312cf2d1d7b5e819db1decf6e4 F20101111_AAAPNT cho_c_Page_009.tif ae8479b80869d7f1f67803858d03123b 831dc8c22f16fbdff9b0fe30acd54b30550aa662 F20101111_AAAPOH cho_c_Page_024.tif 3ee76577d89ae4251c4b6d5f35e8e7ec 1c8e3629064725d1bc37be462a06e82838d8f544 F20101111_AAAPNU cho_c_Page_010.tif 186235ed4b6790a2f9d3fb4803984e68 d8a7eadae63233e66809ad1a42bbe3649243827a F20101111_AAAPOI cho_c_Page_025.tif c897d2b7de75779ad06abd26312fd978 626214da868fac528759789b296d35a945f06bd0 F20101111_AAAPNV cho_c_Page_011.tif ccd8e67e8a22a521e2ff215858a10573 4ac3e24c502c0380233534206ce42ee7e3aa048e F20101111_AAAPOJ cho_c_Page_026.tif c830b7646102b8e8d58dce54f7026010 a58fd3277eda6cc98ceef57d0657b7b6dff2c4ae F20101111_AAAPNW cho_c_Page_012.tif 9b8ecec1774b25be411bde33ce809891 c956fcee854758ff39fb9ca3f55066ee44917747 F20101111_AAAPOK cho_c_Page_027.tif 6f2181e541cb781481ccff879f31a2bd 4d683b822b618bfd65c9b2602f3304e4e6d6bfb3 F20101111_AAAPNX cho_c_Page_013.tif bcc94cff8b9c7b87f709fd9dd4c73237 68268af3a4af0e657771c7da1e3f687a4f1a1212 F20101111_AAAPPA cho_c_Page_045.tif ddd442fc3bd2f8ad938b3381f395bcb3 9ac534e4cde8035fb2f95a997008401e2c5b5263 F20101111_AAAPOL cho_c_Page_028.tif a49c71ef9fdc77878e3e3c43582555f0 86d1ee0235a2075c3ef53aa1343702e6fd46fe49 F20101111_AAAPNY cho_c_Page_014.tif 40e2db7811bef7037a6270c7bb13d6cc 25a8482b13278738ea8b87e15a85c2756c900db7 F20101111_AAAPPB cho_c_Page_046.tif c0b39ff8b7d036bb2bdd54e434f00c53 dd7f7c94a1d143cc201f0eb656498127e2a2e10a F20101111_AAAPOM cho_c_Page_029.tif 2df200b1e4bd1dd19b2cc2c45e9bc729 cd1d73da18894da4c8b007d8e49353f474656107 F20101111_AAAPNZ cho_c_Page_015.tif da3d5b64576d2aad0f5204dee934f646 463e4b3b804e35153d1d8b910f22e37e6c4c47a9 F20101111_AAAPPC cho_c_Page_047.tif 8dde6ec42a6d7e68684a500c4ccd67c6 767383ed140ea135251d63a4c6074c3ff2b3a04c F20101111_AAAPON cho_c_Page_030.tif 2c55da79c459857d75dca593e8c653f9 5f78cef582bfea5b20191d4cd3846fe5f6f833c7 F20101111_AAAPPD cho_c_Page_048.tif 71389cc3145f5e4bf1cf2c8ba2561f21 f1a0f86bd20e754c25c87692e2d9aa9827b06177 F20101111_AAAPOO cho_c_Page_031.tif 608629949ec41e8a97856b101b503241 530b6b5b384275bf7b773d5a47d8336417f401b3 F20101111_AAAPOP cho_c_Page_032.tif c45aade4cb6b48a17193dc0270431b81 ccab9d47cd4e50cc1559ae892482964f9499b57f F20101111_AAAPPE cho_c_Page_049.tif 275b0a3b621d3ec49434cf83adf5ee80 f552d950dd080dce291e354ffb2fb12e9cce16e3 F20101111_AAAPOQ cho_c_Page_033.tif 19009576fce0cde100138823d02188a1 b0aa8698cecf73c7a650cb3ceccf5bffcd94a7a9 F20101111_AAAPPF cho_c_Page_050.tif cba29b691c691cba16da794f1d1f70b0 854ae07c95cedf22c82424a18271c4dc1954e32f F20101111_AAAPOR cho_c_Page_034.tif 55a72cf426a31b0b682ea225386df354 03f1d84707dc487c8266db8a78997a89ede46a6a F20101111_AAAPPG cho_c_Page_051.tif 0990421fbeddefe90cde6f77be2a4fb6 7ba1eee3d1003b4d7d78f0ae797718b9189c8c91 F20101111_AAAPOS cho_c_Page_035.tif cce417513f17002927e7b905ad9239a8 9e3da4c3415db867db163e725ab3fe52aca2512e F20101111_AAAPPH cho_c_Page_052.tif 03226ab31853159dcf30107a0c500927 848488df571dc291d0b22a7ec6785823a5c64d96 F20101111_AAAPOT cho_c_Page_036.tif 93dd26257fd6d658a933d065c3d09bf3 ad71d78b8b6940cd3d41f0e2b10b841c5302ddec F20101111_AAAPPI cho_c_Page_053.tif 5cc2538e2c560b02697d90e2a0bc7431 de962e98fe4b82c25b83a1010610d16a61b58b77 F20101111_AAAPOU cho_c_Page_038.tif be93082cecdf651a882a58d1f6aa21f6 52932941da183e6cd106a9bb90ea84494eb8af6d F20101111_AAAPPJ cho_c_Page_054.tif 090994a4de19b900b8ce01e1140036f3 edf073f9d906c7183112338f3214ef9f6f2158dd F20101111_AAAPOV cho_c_Page_039.tif 6fc925310ed54c13fd210cbb846dbbd9 290819b2e1a8f8c4fef8cb3f4e76b4e1e845f6c7 F20101111_AAAPPK cho_c_Page_055.tif a602ab0854b59e71b0a3fed91b8fa2b5 51518f5d6e1187f152a85a94f6385a4f9a6f35f2 F20101111_AAAPOW cho_c_Page_040.tif e886ba909d53aca4b0defeed85585725 c9684cc4ffc141fde25b792e27e57d0756bd928c F20101111_AAAPPL cho_c_Page_056.tif e73febadbe6fe1e020a0f195b118d7a1 c55e31e11f42f57f81a46eb2ff16509f94733d40 F20101111_AAAPOX cho_c_Page_041.tif f5133faa24ee20277cc4100e2fb32d66 6655f8e420cc657e603c76a382721a9b08b6806e F20101111_AAAPQA cho_c_Page_072.tif 419b2618153f4c7778d1769f131063d1 18256e39104f8cb32ea1c777b1c7a474ec7845d5 F20101111_AAAPPM cho_c_Page_057.tif 5bf57deffdce02cc9bbc7cf7a0b3c3b0 17b07a73b015ea25bc38489d810a948b336591eb F20101111_AAAPOY cho_c_Page_042.tif 19b78057eb9eb1ecae3d4d43a2975504 2c1795d3e29ff64a3ff868a2f5c46fd2af8967f0 F20101111_AAAPQB cho_c_Page_073.tif 88fe1c103096be7d1e7cf2d639d67110 c9a2f89ad2102fd2c9687876580206e611c47548 F20101111_AAAPPN cho_c_Page_058.tif 084962bda17b1674e43ccb39746630b2 1f95bc2f190f27f80011fd6c7cec37be699aac43 F20101111_AAAPOZ cho_c_Page_044.tif c0ebd5194a48e4261934831c3f475113 afb79a6ff28c01f9d6458180405a75ed81115af1 F20101111_AAAPQC cho_c_Page_074.tif c90543a08cf0d104e23859c79014e4dd 0ae50f120018b9555f6f9cfd7ad22735d38ecec0 F20101111_AAAPPO cho_c_Page_059.tif 697b94a4555251231fb951a4fe588de6 3c8dbdef270d56637c7446dc77c6cf25226faddc F20101111_AAAPQD cho_c_Page_075.tif f83ac64f9abace9d80fabe6abbe3aceb ae89a15755e71a33ae51e36228955ac55213f00d F20101111_AAAPPP cho_c_Page_060.tif df477afce370b97f9408583e0eae538c efc5507caf536fc69b955365e6b705beb6c62a8e F20101111_AAAPQE cho_c_Page_076.tif df216dde02ee4fc2549838e355aafbe9 615d2c967707a201b518bd577f3e1bcb4ac37933 F20101111_AAAPPQ cho_c_Page_061.tif 72bb6b9f23b96a99269862007cb0495c 7e3de9881272ffe47f9f5e406d02dc7010849f99 F20101111_AAAPPR cho_c_Page_062.tif 01314ef11a0f861bd59df6ec0f1b1ef6 5dae3911cc74f59313d319845b5a52cd66313792 F20101111_AAAPQF cho_c_Page_077.tif 908dce5061989f4964c93d56c253b43b 97d26759462d0a9f68fb6f353a299285c7001b10 F20101111_AAAPPS cho_c_Page_063.tif 09848d969a721c368c36f874f5986b18 07a27ce40ae87f043d36eb9cb6d2b5c563c56f19 F20101111_AAAPQG cho_c_Page_078.tif 370885705feb3baf95acf30bf52abfd4 88874e22c2e5d46f8b629d17a342d55a2f3141f9 F20101111_AAAPPT cho_c_Page_064.tif e60774f468f916facf8a8411f6b1f355 02e5732b624c6b2854aa35e62fbdf2580e6b3ad6 F20101111_AAAPQH cho_c_Page_079.tif eac71d0eb99a76c21b807c455c9a86b5 6482564b763443c4f6d860bb4bb62b23772b1cea F20101111_AAAPPU cho_c_Page_065.tif 793e4df0f98659f10430297ca86afcb6 7d9ea618bf759c43a9619fd3e3c430f7c90ac25e F20101111_AAAPQI cho_c_Page_080.tif 12bb012599632dda47b69e787d05aaf1 9c9c3299509c670f0cf7622e8efad4d2435daf59 F20101111_AAAPPV cho_c_Page_066.tif 664c8cd22923528b0bfa9faf79600208 a0fa5c535b602fd13339685ba89c7ea4184f3fe6 F20101111_AAAPQJ cho_c_Page_081.tif 57f6c0ad9313aee1910fb674b5c750f1 ccbb095ca997af138d1a8f870cbdeea94acdad37 F20101111_AAAPPW cho_c_Page_067.tif f126a4bffbbc89fc0b5ccf1b218343c5 5955ef9ba2ec74fa774190528bd580df80282110 F20101111_AAAPQK cho_c_Page_082.tif 2cc1dcd1bfc485d8a0825918389f6979 7caf71c47f06e5f83f47a89a4e65f8193790c068 F20101111_AAAPPX cho_c_Page_068.tif ab51391e5b2833d2e6afce9f54bd62b6 fea1b8a1ff4abd5826c244b0519ac81f5099ebc4 F20101111_AAAPRA cho_c_Page_100.tif 9bb853c2ca441f9fb2f3efad21f6a931 2ee570c5889408b434188f58fc9b9f55b9dbc7f8 F20101111_AAAPQL cho_c_Page_083.tif 9e959ac1166beec44bf02b4dec12e408 f5b7678521dee01c3c2d2f4d1ccb164ccfe96551 F20101111_AAAPPY cho_c_Page_069.tif 64b70f460bce0aaba158eb622e7e5d8a 108767a65b254bdb4db81a39cdce2b43e5a1b92b F20101111_AAAPRB cho_c_Page_101.tif cae4962071f29449b57135e24cfb6eaf 202f9120f1e16e617609d95c1108c4857c627b02 F20101111_AAAPQM cho_c_Page_084.tif 2c267dd58cc0d552d9c4110501075839 1327dccc0723549806f2a20ce541ea8ff66915f0 F20101111_AAAPPZ cho_c_Page_071.tif d28fcbf72eff5e90e3fe556f9e2aee5d fe294cb6bd5380da3772e7e720cd47c225994470 F20101111_AAAPRC cho_c_Page_102.tif 6f72f593da4522744ed8aaac5253e392 ef3040ddc31a5178b9e7600fc251e5d681bd609f F20101111_AAAPQN cho_c_Page_085.tif d8d315b3e1b537689f55824b455eb691 2e57cf3990018ff2148d439a2253ed7b5a186ad0 F20101111_AAAPRD cho_c_Page_103.tif ff71affb8bfb41ff397bc88d01e19914 25e9dd35f4ed1b5469f78b2b435d46a63b7aa72b F20101111_AAAPQO cho_c_Page_086.tif 8c19a100aae394d3d73dd84a6c0a60af cb515ed41f2ac1047f7a209efcb4548e1da69ac1 F20101111_AAAPRE cho_c_Page_104.tif 4231b4115fb48b914401f8ab4998ae8b 9894cbdf8c861ed7651b58b0e876ac26ff0d6b58 F20101111_AAAPQP cho_c_Page_087.tif 83992e01b3784f85381aa1d3db6266fd 3a7c3ed5b537a19903aed86259e9fbbd3aa1c481 F20101111_AAAPRF cho_c_Page_105.tif 2e3381aa23e9d8c0a64002d1ca5f839e 866b7499b00067ed71fe6bc2501e385ecf434a5b F20101111_AAAPQQ cho_c_Page_088.tif a35881ca16d62cd97dd94ef9ed8f65d8 519329bff9f39215466bf2eef5020dbc62404a9e F20101111_AAAPQR cho_c_Page_089.tif 393307d16843f7b8e6699ecbafbef13a f814cb8fec2daa466ba980a28892a0aca9e839b2 F20101111_AAAPRG cho_c_Page_106.tif f8af5271848f0671d6e1fb6372869de1 a1fc9b6ffb1467e3df0605ff71097adfe03c726c F20101111_AAAPQS cho_c_Page_090.tif a594a12c41b398a1240b7e764107275f 8b009600663f3c017bb38f90e7edc5e4ec93fb7c F20101111_AAAPRH cho_c_Page_107.tif 22276a02a20adb2d1316bafc60266bf9 4a58f245396a140a70ae243730a4911b44423fa9 F20101111_AAAPQT cho_c_Page_092.tif 9f5ec6d80d0d3be7557b7ecc97b6ebdf 368b880ba284ddf9f7f685cf58ad3d15d446088f F20101111_AAAPRI cho_c_Page_108.tif 13433156808b6b9c1c9e0f586b0dbef9 2cf4d22742ae456e52f56025c0fadc87dfa49dac F20101111_AAAPQU cho_c_Page_093.tif e1f16348f8ba8366af562a5f6ccc23d7 422bb46bcafc72ca734cf23284af72066a48e7c9 F20101111_AAAPRJ cho_c_Page_109.tif afed17b7e7ccf2823190387aef5a20ae 8e2c0c87425421cedcf161aa8a8fbf2c7ee415e8 F20101111_AAAPQV cho_c_Page_094.tif 305b8521d6a03bf14c8c8acf2c875d0c f3e7b459abd096846d3fa41d0b60d99c7ea6d9ec F20101111_AAAPRK cho_c_Page_110.tif c215a78030a56f68e04765dc5840ee75 089ed02b156566f8772323abd6f3ebf55f5eaa0f F20101111_AAAPQW cho_c_Page_095.tif 9cea6a5ca3dbefb9ed9cf68d0c1b58b1 cdd07cd9627b425bab038d73283813ece6aa56ec 72766 F20101111_AAAPSA cho_c_Page_008.pro 111ba8c5c3c949a8f4c9a56b2eddeb15 f622dae8015e551b88410b07e4e053e635f99078 F20101111_AAAPRL cho_c_Page_111.tif f7b45d18426fe344849c01aa9449f49c b27e87eca36a818db951820c49f32d0967637692 F20101111_AAAPQX cho_c_Page_096.tif ec1b75a320f24f0a12d68e795c904b56 333f192cb694f1b67e0f41ea25329440e349569f 53953 F20101111_AAAPSB cho_c_Page_009.pro 8a63bff994a015eeaed7f220b379977d a09e2e187b38b3cf1bdf63bfd1213847f09f7efe F20101111_AAAPRM cho_c_Page_112.tif a0ddb6ca86521028cd3f3f391ce0b079 5f3c78a27c942008ab6f99da54fef925b07b048f F20101111_AAAPQY cho_c_Page_097.tif fc8a8855e3c5eca339de2a759b0f8f82 54d47e1b5ca520a6ddef07853c14147db3aa0ecf 49595 F20101111_AAAPSC cho_c_Page_010.pro ef98f6260d05464b594c39742b728617 285d8e8cd508965e3cec6c26c0636a968e7ff3e6 F20101111_AAAPRN cho_c_Page_113.tif fa022ae7eb3f29396ccc031981be5b86 fac5ad3c288655cf6dd220acae476497fb6c7305 F20101111_AAAPQZ cho_c_Page_099.tif c787a18504207ddfe9188417aff7eef1 f49456d4dfd63c4e9d5ef96640b4668a844e23b9 28103 F20101111_AAAPSD cho_c_Page_011.pro 31cb3ae962a26be173fb4ac09d668eac 60280d0ea0d48a9c88c9e472e7b7705c4bbdcf79 F20101111_AAAPRO cho_c_Page_115.tif 24175184ea66dfed53cb8c7f4c4f888d 3ef6bf12d8ccb111ac6ddef24d45fb831bb2f2bc 52175 F20101111_AAAPSE cho_c_Page_012.pro dfee2dd88a4efd5f5bfb7e7e45a5a2e4 ab3cbb67f7e8d8eddb326e2583c2a62def452616 F20101111_AAAPRP cho_c_Page_116.tif 72171699b20f5a19eb91689a2cc48f05 955a378c53a7022ef7a326f3be7804926700d80c 55503 F20101111_AAAPSF cho_c_Page_013.pro 44bde93138fe141734601d936f3be4af f7047e46e8795dc9339ec640cfec1e52a3c6a60f F20101111_AAAPRQ cho_c_Page_117.tif 3ecfae1a0fe8402e12676c23a2e6792c 4c0b800a218338f3e0b783bf456e90fbdc5c1c5b 54453 F20101111_AAAPSG cho_c_Page_014.pro 5d535e923ae98de7807c225468b63919 f6751bb3b33f543b05ac26cf2a811389c0117aa2 F20101111_AAAPRR cho_c_Page_118.tif a529623d4d482d6f953c55aa4bd1a0ef 1e4863e5c8391246fb311aeb8f2b556dbc522ab2 F20101111_AAAPRS cho_c_Page_119.tif 0e4e22a58c687ea7b3b6be7f350d6efe 255d3066860bbef2ddbeed176b12b85db6677732 4329 F20101111_AAAPSH cho_c_Page_015.pro 2f874651623c6c8aefc9941b7cae4e61 de8046d5bd18a07c0aaf1731e6ce0e956f5fa088 9310 F20101111_AAAPRT cho_c_Page_001.pro 38bfca5f11a425033478f6f4598208c7 f652ea702022e52f8cd659073287914c27cbb20a 45202 F20101111_AAAPSI cho_c_Page_016.pro 9af139f2ea6eb57242460c4580e7e3a9 235dfd9eb0ea602733753bdfc68b67ba5dd6393a 832 F20101111_AAAPRU cho_c_Page_002.pro bd5b461dbb25d2766097dfd2b298de9f 926a2ec74fd5d775a11a0a6af97f0ad6132324a4 51354 F20101111_AAAPSJ cho_c_Page_017.pro 70b40a959c82d90aeafbb7e290e361da 3747b12abb0265d0c4e1d904d93336d551ab63f1 23902 F20101111_AAAPRV cho_c_Page_003.pro 55ecf202d36341073617ccf02327d0bb 09bba3fb98d51d48ac0ebbda5b4f7f0d279ba513 45261 F20101111_AAAPSK cho_c_Page_018.pro b6da512854898156adacc2e107fa186c e85e334e14a80f22005ba8c42354b2731dd0f4a5 77313 F20101111_AAAPRW cho_c_Page_004.pro e70c9dc30ee80d0e62d1c76bd7e12c72 c28fab44bce152a62152082b45c9d1544710b832 51753 F20101111_AAAPSL cho_c_Page_019.pro 15c4a036ff9e3cade2ffb1fb646cbf62 8a3767b0cd7bdd14a36b4b6b24a8fc16596986e0 29049 F20101111_AAAPRX cho_c_Page_005.pro a094f4618e4a843487dc562cc42e4104 e2e5f2fec47b48b4c7a06eb59496eee03fc9f5ab 34144 F20101111_AAAPTA cho_c_Page_034.pro 9076f5325d6ad1fdc9d6e46008b22448 c6b86a8a17c0f3c9563ec40aba28e30a285a6d39 32749 F20101111_AAAPSM cho_c_Page_020.pro 4bafc79d89d8c56a385cdd3d709726d5 4109c32fbf59780662106d2eb9db8582add5df97 49765 F20101111_AAAPRY cho_c_Page_006.pro 0e529c16281ef916581cb63ccff515ab f41aca813127a2c53fada78405d712bf05e15972 44134 F20101111_AAAPTB cho_c_Page_035.pro 988b9c8cfc807885324f0d38b662642e 9644dbd5ed3d12b20a45935165f21b55e35ff9e6 53505 F20101111_AAAPSN cho_c_Page_021.pro 50647e8c84eb46b319ff687e5a32904a 3cd613b18241d11a16af0c662eb9ecfca5881320 72582 F20101111_AAAPRZ cho_c_Page_007.pro 8c5e6d64242783f69bfc65e44ac05b68 364662619fbbee7048e520f437d7ad92cb4b89f4 18744 F20101111_AAAPTC cho_c_Page_036.pro f172c9f5eb95257be7269f7b385583a6 7b3c08ae16321bda7d34f80100c92130335d040a 52037 F20101111_AAAPSO cho_c_Page_022.pro 7f32743f3c59f610e6164619828c0b3d d6460e410e5046be55f1fd32cfa99c35802ff44a 50663 F20101111_AAAPTD cho_c_Page_038.pro da989ad6f2ca3d80269c1caf8766e64f 1fc538e2eae6251fbccdbf10e69da2f2ab3268d0 52918 F20101111_AAAPSP cho_c_Page_023.pro b95d2ceddd0b474e3422a9800d9b956d 069e0cdb8ff75d31c38f5c6d97aa22e3379b7433 53273 F20101111_AAAPTE cho_c_Page_039.pro 14ac9693c60db6e975da81a7d7a325b5 ef73adb02367f1f2b6116860c16fc6dfdb94aed4 7865 F20101111_AAAPSQ cho_c_Page_024.pro 2991af7c2d3ff5a5ce2edba6f7f01edb e790e1e05e52233b01b0bb71176a95bcd7a8701a 49926 F20101111_AAAPTF cho_c_Page_040.pro 69374ea4748df11a431d3f8f64c7fe45 6afba1ce38b3e5d03c07d5b99fa04b1090db6c17 49496 F20101111_AAAPSR cho_c_Page_025.pro db4a600e8776327e297adefd11b04522 2c21ef2b305211a34ab17e507192cad1c45805ff 45557 F20101111_AAAPTG cho_c_Page_041.pro 27eb80a716655fc486745391bcffffc3 c8dcb106a158360cd83af1e22b772f6ca5c6682a 47051 F20101111_AAAPSS cho_c_Page_026.pro 2def8805ce6f92cc5a4e3b0e2899a9d5 092e7905bd85882e9ee52b45373eb58c25b54daa 34126 F20101111_AAAPTH cho_c_Page_042.pro 096754902b17eb9e568263466b3b647b 15d3ae6c28ba76cb9af3a58c25b08aaacbed9a18 51155 F20101111_AAAPST cho_c_Page_027.pro 7900aac58927e94fb0c79077f83cef88 722c6854763a543e58778a449debec6b3a560303 32213 F20101111_AAAPSU cho_c_Page_028.pro 6cd977cf1a00161f13bbd038367a1e2d 4c3ab07b8add29ce8c65ced21acac8244743052e 48720 F20101111_AAAPTI cho_c_Page_043.pro fb387ab1dfe57fca1f531fa081542b01 6631a815e4d52ba7b037c1e5197e7945c1200bd0 55389 F20101111_AAAPSV cho_c_Page_029.pro b4b9b17c6286c7699aabef5624d6954d db280b5d719217dbafa94161e0161cd68795480e 58749 F20101111_AAAPTJ cho_c_Page_044.pro 4b0163494db12fa14980004b24510996 daf91ff02d85526fd0abd74c3cba8f040aa02be6 40831 F20101111_AAAPSW cho_c_Page_030.pro c3111413862bb6495d473436c5a0218c 1732f864f35dda6a27529a82349349ef04635eb6 41158 F20101111_AAAPTK cho_c_Page_045.pro e1df4f925442a916cf0629ba6f918ea7 922bd2e59d5a19452a4323fee0575b83ed461823 39107 F20101111_AAAPUA cho_c_Page_061.pro 029b20dda530f417766ace47e38de9eb f69ae767d9ab150ebb42dc2d220d3a4ecafd0105 60267 F20101111_AAAPTL cho_c_Page_046.pro d1bc822c7256d714ccb125de706cf4a6 6e77d296187232e8540f3f0e0b1a8b835f65efc1 40739 F20101111_AAAPSX cho_c_Page_031.pro 34f371de60af1ef8605364f45263fae9 2c3e72a6f6350a8703a01b978b3d09055c962b08 22268 F20101111_AAAPUB cho_c_Page_062.pro eef15b8df7531532b18cda43405169b6 04e11c7ed603ff9e430eb9ff345e60a475ca3a87 75863 F20101111_AAAPTM cho_c_Page_047.pro d1bef6dad4968c35270387b900aa756f 1977c5e7738d8e294f4133f5c311f65b5ea5f05f 48298 F20101111_AAAPSY cho_c_Page_032.pro 6df3f0f33116b8af9214d33403305bbc e3b7a9e56fc5af801171432c6d982dfdb2192be9 54604 F20101111_AAAPUC cho_c_Page_063.pro e1bcadc476754c4f74c7ba1296d41a8d 73eb853dde75b01970459cdf7d4f2c3d3449efb6 55989 F20101111_AAAPTN cho_c_Page_048.pro 0329e953b5d96802cc27ea4ae246c92d ff1f2e6dfb4c529acb94c4d632abd134e04b6979 40216 F20101111_AAAPSZ cho_c_Page_033.pro 8abae5bb98f54daa04138cfc6b12e27c c22ceeb6f73b83a1174aa7eabe49aace70cad4e4 35311 F20101111_AAAPUD cho_c_Page_064.pro 729d127eddb697d5e8e4e76308d948e4 15a3f374c6e40ff91cd8718f86cff884b0c68425 51216 F20101111_AAAPTO cho_c_Page_049.pro 5ea11a23d35cc0ed11cef010d3ae2520 d7ef2fafffb1cfd2c60c61e733117817e00ad234 39780 F20101111_AAAPUE cho_c_Page_065.pro af8cfeba65720794addd6b0be3edd244 83da6e065d404fc34b5a388b5d3191c44d0bd8aa 39584 F20101111_AAAPTP cho_c_Page_050.pro 4b2d1099b12de27712f190e0e0343f7a 2f712bfd1725b4918decf09ccfab3df07f7d13f7 55877 F20101111_AAAPUF cho_c_Page_066.pro fc2993962f9b6088267ec04940e0a497 f80ba964f15b95caef40b39b609b9efbabc096e7 28142 F20101111_AAAPTQ cho_c_Page_051.pro 103c1171ffbcebccec614b8472fdf8d8 c31cd3c73403d53d50c26415cbd2e77a74a67554 43211 F20101111_AAAPUG cho_c_Page_067.pro 3261b7ff8be3ab0a75e26884fe48f917 9eaa6c787cc2d8d6f06d4354cd2497845e0dd89c 48686 F20101111_AAAPTR cho_c_Page_052.pro a381db85e2bb8493436ec72994b94ea6 d28286cf3193de73f2e14856bc9b8726b3266d62 1969 F20101111_AAAQAA cho_c_Page_105.txt 5013577c6390d6ff14e357cd7b9d7cae ae62f499ad55718ce0f4273c2734cc0c0c83414e 42296 F20101111_AAAPUH cho_c_Page_068.pro 0c3a905966b023c2113e5ab436e7b1bd 9db25578398494ebda68719c79d75794354bcc9b 49398 F20101111_AAAPTS cho_c_Page_053.pro f66367abd7f8b58f3f90699442c8d719 d2b7bf4f5b2f6f7e79e766a5ad8f21fecbe3e3dd 1786 F20101111_AAAQAB cho_c_Page_106.txt ac9a8e1a0c5fad72f3677597fcfc16aa 64614d0f148f344659cfd81b2a4909c15bd32519 44813 F20101111_AAAPUI cho_c_Page_069.pro 8b3b3680076e031f9728e7d8c9d05096 8eda2fc98424ce45fd7c66b9f2e76232262486e8 40335 F20101111_AAAPTT cho_c_Page_054.pro 2f894b31b854e9a4c639b7120ffbf182 d69e2d15d957fd6b589686d299b01d7484986072 2103 F20101111_AAAQAC cho_c_Page_107.txt 911d6729bcce443c51996aeea3633731 78bddf0465698bbf76f82dde1f16df3d670a27e3 49358 F20101111_AAAPTU cho_c_Page_055.pro b3705b45e59176b05ef5c0af0ffd16cc 56f41a407c6e07220f9aea34bda4de9584b23ef7 1444 F20101111_AAAQAD cho_c_Page_108.txt f548f634923a5e455156016ec4bae129 64e9719ff844340c235be37e486427ed351e507d 44599 F20101111_AAAPUJ cho_c_Page_070.pro 469523fd80dea03c6b6ad5e297125001 8e625923393820b8aec3f04b9ce4bc188ec7b39d 47754 F20101111_AAAPTV cho_c_Page_056.pro e5f8630d24e81ed2efd4cffb92e973d8 c79c508ebb23cbb415e9e31ccefd10696750154d 2121 F20101111_AAAQAE cho_c_Page_109.txt bfdf00f20913672683e2e4fd216b3489 44ebf132465545172f225cbf49e69bc628c04d8b 50845 F20101111_AAAPUK cho_c_Page_071.pro c3dbd8e62b081654c5661344053e81c8 2c0adeb73377059c85c87b185ff9e448223f5c45 55760 F20101111_AAAPTW cho_c_Page_057.pro 78eefc31f2baaf0be8ed2820b5cfdd45 6faa0c60e804bc9e0d5e2314c912d26db2065432 2058 F20101111_AAAQAF cho_c_Page_110.txt 797fa4d76d583e1237dd8fb402ce9959 1b1df8429e607602b4b77439f55079a4f3e0c5aa 35543 F20101111_AAAPUL cho_c_Page_072.pro c3734ed7c135d00cbcc5a216c5e29663 08523fa641c9802098136c6f1c5905be754481d2 42537 F20101111_AAAPTX cho_c_Page_058.pro 2778893639c1fd4d1e99eeaf821223e6 ddcbe790b44c53fdad658efa24d3043075e47abb 2194 F20101111_AAAQAG cho_c_Page_111.txt 778140a022327e9f0206ac0801e8c981 ffeffe8b4a45f5cff952a1af370a650ce64bd399 42419 F20101111_AAAPVA cho_c_Page_087.pro f2478ba512ffa4e6c1e8fdee8a479832 eecaa96cc22dea07ba2213315a29be5d5faa4c9c 13129 F20101111_AAAPUM cho_c_Page_073.pro 1a97fcd3844df887087b6b213bf18579 acdd6f12e2cee6d34e564bc2065ffe950813629f 53914 F20101111_AAAPTY cho_c_Page_059.pro 78417584f8640c8bb875c3fc9f140f8c fffdd2edbf8b8b675444c81f949d255e99cc9704 1328 F20101111_AAAQAH cho_c_Page_112.txt 317420b71d7a0b361e787b1ebdfe1aaf d199b793e951c5c0ca9f5804845fa8d464e89689 40852 F20101111_AAAPVB cho_c_Page_088.pro 1565b8c0e18f716206c1c4bfc19c8e85 a4916842ea0e1d7a52b08d6f7f8e319b8b76e7df 48809 F20101111_AAAPUN cho_c_Page_074.pro c49dd3983f8b429a18d6495edb2c4ca0 7bce8b94963cf26558ded4229ee57f7e82ce4c4f 54463 F20101111_AAAPTZ cho_c_Page_060.pro 318a181687ed4a05cb7e027588d66ba6 6a6d5f19fa13aaac26b3d8e4ef2bc1411bf4d1c8 2394 F20101111_AAAQAI cho_c_Page_113.txt eb8c20587192e7268aeb01ac2987fcc2 c6b1e49c69d72461abc8ea469c544f478a0e9206 45880 F20101111_AAAPVC cho_c_Page_089.pro ba8bc924f1e470587d8fafd0112e4df5 006f5a0467995d4d78694fa46d16a38295d5db8a 54731 F20101111_AAAPUO cho_c_Page_075.pro a6b52b228a64cabace20645a85e6bd56 108bc39378d7df8ae4c6de97bd77c59b1fd63362 2543 F20101111_AAAQAJ cho_c_Page_114.txt 2a89b71c41675b532299b4f0e33592ff 8d9a39b813d6598f4ef19c0eb0a1af24063f0220 45186 F20101111_AAAPVD cho_c_Page_090.pro 9d3cdaedb9ae7bd866d70b722ed016b5 32cdfce8db1525852407b72ccc038c9078611fc9 56817 F20101111_AAAPUP cho_c_Page_076.pro a7bc6bb895fdb961f42c862d03001059 87edd4fccdb90e36cc4437ba3ce69690b75b6072 2454 F20101111_AAAQAK cho_c_Page_115.txt ca36e010a4aeffdad5ceb0a747e85e29 67b3e4aead72d90fe331de00bfb81c4c36ef2fb2 41978 F20101111_AAAPVE cho_c_Page_091.pro eb66b717e2006908ddf0336f9258a753 aeac275f75cd75f00ec90b5dfef53364d090b11f 49535 F20101111_AAAPUQ cho_c_Page_077.pro 5f8750b440549964d166d5da8427b59c 1a25d722bdc90d5b919a04ceca659a03113c7aed 2459 F20101111_AAAQAL cho_c_Page_116.txt a8746173251480391fe3f1a1733752a7 c9ba68683aa7332c64f873823fea9669ab827504 41458 F20101111_AAAPVF cho_c_Page_092.pro 672c0c770222f15351a365f7874adae4 53d8d1df55ebeb56eb129ab080762562937fcc82 49404 F20101111_AAAPUR cho_c_Page_078.pro 56ccd838c19831ad06ad4e1b7cab44e0 2c589f6d0ce205fe6668781352527eddad4c4691 27160 F20101111_AAAQBA cho_c_Page_013.QC.jpg 5d8c64bde25a2e02cb84211450e1bf2b c031a029abbef6315345c92651d87846f3e208a1 2496 F20101111_AAAQAM cho_c_Page_117.txt 796585d04dfaea029db8fbfdfde54352 8c2f43dc7830e2897b52c63e230e35080b5ca5f9 8200 F20101111_AAAPVG cho_c_Page_093.pro 58c88b51c95b88ba915c9609f575578b 0026f34d02cad599a3980047b0a738c8dbedbe6e 33426 F20101111_AAAPUS cho_c_Page_079.pro 39769edbff8f462250f611f722e37f05 18728c1de24943c8608b9e97ff238c0ba55b9623 7447 F20101111_AAAQBB cho_c_Page_077thm.jpg f1ff90c5c7a5e7735e86d374a2236201 ddb5e1ae45ce69cae7b1581a9cd2e76245b129b5 254 F20101111_AAAQAN cho_c_Page_118.txt 2ff603a102059e4ffe1c7c796bc5a2e3 26a79e324fec38393d0ca1e77890cab9cedebad4 45341 F20101111_AAAPVH cho_c_Page_094.pro 1263c27a7975878d1e25208fe1067baf 0fd3ae7902db3939206355bc70bdf36b978d6d76 42440 F20101111_AAAPUT cho_c_Page_080.pro 469387bae7efced5517731d087079bf0 70b0295b44f99c5e72e5814b331ed11b3d51a0d8 22880 F20101111_AAAQBC cho_c_Page_050.QC.jpg de5882d59cab9de06940e07acd40406d 41fee8d30220343148392ac44298948b3c08f66e 537 F20101111_AAAQAO cho_c_Page_119.txt 65fab5ba25c0b91a01d11eba5f039aa2 3954b237c27e727cc3fa97698e235ee92ba24011 35084 F20101111_AAAPVI cho_c_Page_095.pro 91c6899c463332d66ecd867aeaeacdda 5a2eb1af82151bd61ccc355547a211aaed42cd13 55032 F20101111_AAAPUU cho_c_Page_081.pro 74475ca5a7e66ecefa96749b08d8d5e1 84cccfab100d75f5a0efd8157356e7174768e83a 28235 F20101111_AAAQBD cho_c_Page_048.QC.jpg 092a695246c66ee18d74beeb8d3194b1 cec1c650e68f628afdce3bcd4d7b216e9fc851f2 2375 F20101111_AAAQAP cho_c_Page_001thm.jpg 5763482cfcedc938b4a66b76ffc1bfcc 61d74f9026e573e0bac15169490cbffd580fd57a 56937 F20101111_AAAPVJ cho_c_Page_096.pro e0c5ccc61c5877d2c61d72897cb828b0 95b541fcd2bf2ca33294f0fae390f1e9e4b15b0a 42375 F20101111_AAAPUV cho_c_Page_082.pro 5fae5eda6ecacf7e8fed14b1bc682549 715b45b1746992d6842cc4b1e8d537d3dc1da774 13977 F20101111_AAAQBE cho_c_Page_003.QC.jpg 726a77ec0c85c63c0bdacbfc725eca77 357e443e694985813bc21c7ad947bbdd81396ad9 1585324 F20101111_AAAQAQ cho_c.pdf 861298221d79268e12c17f0d559fb62b f3e86185553110db9376d2cb36c5ea74a205f16c 40832 F20101111_AAAPUW cho_c_Page_083.pro d6070752046e67ce1dba40ffede18f18 b1b0667abbf1b7712699dd15cfc25c3cf919b1c7 24106 F20101111_AAAQBF cho_c_Page_102.QC.jpg d1d410fc080bafb65612475eafebbbcc 29ae7a729be0d31205b9e9fa1ce044e9217020ab 5983 F20101111_AAAQAR cho_c_Page_028thm.jpg 6a19357c381eebff8a371a9a46ffa886 7dfd4ae8766e8e8fdc8d9b999edff93709c9d41a 42244 F20101111_AAAPVK cho_c_Page_097.pro f8ac936b1f450d2717aa3c40c6cef403 740243b477a80cddba570350b6e5ddbb7e416207 47846 F20101111_AAAPUX cho_c_Page_084.pro 4f3d00b7b615e1300af577c26b3b52f4 9c7c3fcd420d7ff3786d811808a9ecd8741d735c 27422 F20101111_AAAQBG cho_c_Page_042.QC.jpg d1f5d329f77a5d71213b3d34c8d0552b 68468972a96dad7a9aed2642fa9c519c2a260c15 63605 F20101111_AAAPWA cho_c_Page_114.pro 28f558cdaec513c94b2ddb60e0be67c9 99418cb36653ebd13cbbf98e267b0a0b1a252ee8 23862 F20101111_AAAQAS cho_c_Page_100.QC.jpg b1f99910e593d44cdcc76c49b32d4a31 6ea608e88ade70c67701ae5757a7003c32add54a 43228 F20101111_AAAPVL cho_c_Page_098.pro c7bbe6bba07bd092066c6ec0a56fa3ca f257fad76ccd4b86b3b2012016247705408bb413 50055 F20101111_AAAPUY cho_c_Page_085.pro 079038b4b03fd6f93eb827a7b2ce80e6 6bf9536da82a0f808148f250ef7425542f734b69 21044 F20101111_AAAQBH cho_c_Page_021.QC.jpg 19ba4e8780f0ff8972c2e8efeaea0289 1330c5cf90e2637569922612ee2a67e3902dc95c 61870 F20101111_AAAPWB cho_c_Page_115.pro e3bf09778afad3da3564c809847614d9 c8f45e85b7c6253edcf296d8af5849473340f4b1 7348 F20101111_AAAQAT cho_c_Page_094thm.jpg 0d733addf8086b9fae1dae505c3d48d9 0a99aa0da8b7020f484ec91c76faff0038b75727 40416 F20101111_AAAPVM cho_c_Page_099.pro 128286ca19f6d5a96d0c125ed95f1eaa 688e0afb1e101c2e1453ea60f2104011cd6d9394 46703 F20101111_AAAPUZ cho_c_Page_086.pro 480758788fc4091d69e5c06d2afc4438 82b8c6b898abf5a6b81e49586479dc3413cb7b83 25852 F20101111_AAAQBI cho_c_Page_039.QC.jpg 1107c64b1e16b5e5c52b0de8bd4fba84 0f9a319d9564424dcd252258f9394f2ded537b7b 61859 F20101111_AAAPWC cho_c_Page_116.pro 439106c73d1bca62695d6d3124e9d72d 312865bff3ffd5077ca3e208793a320f901d8091 55343 F20101111_AAAPVN cho_c_Page_100.pro 5a28948d578d09115f933058260429b7 fda36be5f654b1dc06c2bb1103da8b220eef4fff 26253 F20101111_AAAQBJ cho_c_Page_040.QC.jpg 59e2ffe4b3412305a056cd0b11bded4a 118c410e16ac0d726c490b9e95133dff8e2a2235 62644 F20101111_AAAPWD cho_c_Page_117.pro ce453df0a237a75047046fa9a3b91f43 d73cec2c740c4373c2e84239f8a883b08a50cd87 F20101111_AAAQAU cho_c_Page_116thm.jpg dfe4cd9c0469075656181f8fd46316e4 a17d00988588c8ff55e576383caeb099482fd056 43027 F20101111_AAAPVO cho_c_Page_101.pro ed4f4c77ec5dc062959342e78f6ec9d5 1a50196a32f29c2cebc1710ae28c0d0c73fd5306 6959 F20101111_AAAQBK cho_c_Page_098thm.jpg 1b0dcd808e3e26dc01d084782f24b55c 9dc21eba51aba44849a79a5fdd1f36c5b8f06d94 6256 F20101111_AAAPWE cho_c_Page_118.pro 047af822a08263dd97c2d69996893b95 bc0cb21e3a58e138338c849ef64c1e83829c5414 6805 F20101111_AAAQAV cho_c_Page_059thm.jpg 2dff51666596c1610c51b1448376983c 11596129474367cbc77ae7816875b31d634119b5 55809 F20101111_AAAPVP cho_c_Page_102.pro ee97fec523745daadd5954a9ffd51fb8 c5d6bb4849468810f3379a62b993ad6ce1ecae39 6257 F20101111_AAAQBL cho_c_Page_045thm.jpg 529a75163052ce51dc7b3c66ad03a3b0 794c183ee8839d172f56822abdba6995678671bc 12516 F20101111_AAAPWF cho_c_Page_119.pro df9e01766584ffa7503f45f0cdfec510 77d49cf7cb99d0d4a64c0bae15a72af73e4a25f6 7546 F20101111_AAAQAW cho_c_Page_076thm.jpg 0c892cbf02479dd46889088941ffe132 4ef73752791a020c00f1ce3de64b5453a090545d 53195 F20101111_AAAPVQ cho_c_Page_103.pro aa81d296bebc1cc39be128ffa2459a74 72f9348c2ef801d4e2fdaaf5b7b1ac8de8b8e654 24381 F20101111_AAAQBM cho_c_Page_095.QC.jpg 7b60a09cf793c1eb9fe67b5fa64f93b9 72545e4cb04a3adeff71650c444ca16e8e754b0c 516 F20101111_AAAPWG cho_c_Page_001.txt 3650a090f54ed4872c7ef8dd1cf77938 cee345e803a93bfed536c829a5df9f34f9c1b5d4 20817 F20101111_AAAQAX cho_c_Page_033.QC.jpg 445b0eeba9691e5d70e57aed8ec818dd 83fe244e505a002b3943db722dd8ee161c1717aa 49485 F20101111_AAAPVR cho_c_Page_104.pro c9631f039feaa0d7216b7ff0a4213b08 046917624b7287ef0b08eba0b8c1b0839c3db02b 6188 F20101111_AAAQCA cho_c_Page_082thm.jpg 8c0ded4fe3c11db79c0cca5717a877c5 38de382ac1180e6acf4ef8425c6735963d421a5c 12177 F20101111_AAAQBN cho_c_Page_093.QC.jpg 46498eedb64bfa3d2da7893c45c3afc7 dc0eead44c1ca3daf41ef5995de435b3b06d2df2 84 F20101111_AAAPWH cho_c_Page_002.txt cf8b380a18743b698d02f88af07d756f a28992730ddf838972fcaf7620106ca8df750e80 23185 F20101111_AAAQAY cho_c_Page_090.QC.jpg 9e742f30253463c97919608d52df7678 cae9243c02248bd7ca8a9055634182a24897b0d2 47588 F20101111_AAAPVS cho_c_Page_105.pro ca8de5a9b0f08e31d1123e80531c6c3b 27d14fa14751e98ff62f45e6bf2f1d222966babb 7233 F20101111_AAAQCB cho_c_Page_063thm.jpg 3dc4710803c48f2b9a65f4406e8a83e2 2dd6c467ee8c87f067425439f91157c61f369a82 5829 F20101111_AAAQBO cho_c_Page_034thm.jpg 578b4817203e0e057e35814c9e4d38d9 8e7e340952d968c542aca61a6294a7b59a5cafb6 1003 F20101111_AAAPWI cho_c_Page_003.txt b725eec970bb1395b59ad39d6973a49c 3f60ce8ee30f2548a9e76c8390bc74e63b845b12 24216 F20101111_AAAQAZ cho_c_Page_098.QC.jpg 15772308f062a3b76a83f8b84209a85b b9cea8781335426aff1507ea25afbcac55213ebb 29295 F20101111_AAAPVT cho_c_Page_106.pro 74438bb52961914d35c4d25a32fbedd9 a1fba5724d5ffa5469e0875064adc9bb6a63baf4 23179 F20101111_AAAQCC cho_c_Page_045.QC.jpg 4ff85dc3305a570c9d7e6c9007cdb64a b9bd53bb0d3f324b7486c2312ea9b630a566a78b 6703 F20101111_AAAQBP cho_c_Page_016thm.jpg fc19c02466e1d802c2cdf72ec26c859d bff03b8c57a062a0a4c0e61b5e8e13e696486a9f 3139 F20101111_AAAPWJ cho_c_Page_004.txt 2530429ade385845c1a9de481acddcdb 93b9fd8301047a028b4596b5e0a41c1ea6c485d4 46911 F20101111_AAAPVU cho_c_Page_107.pro 50bd6c15d557f83c98d6f65c97680456 072c4a5a1754737fef56409eb594e79e23aeb8b3 7028 F20101111_AAAQCD cho_c_Page_025thm.jpg 67f86e2cd0cd126a03906ecc4e7ec96d 7b6cde82933e79d9d76fd30d24a685856a14f468 8316 F20101111_AAAQBQ cho_c_Page_001.QC.jpg 0ea2155d48d1094e1f78f84c30b9a771 2c2bf272515bf140f39e8ae8033631ced40ee163 1175 F20101111_AAAPWK cho_c_Page_005.txt 2d13c558d0ab6565f58b64ec7d1b9dbe ede1f53aed2a71a57cb64426429fed5c04963804 27926 F20101111_AAAPVV cho_c_Page_108.pro 20a29565768239f6ef8aaab9e3fdc619 cd790574bf95873fa1a1d8cffeef3a11bec0b0da 26077 F20101111_AAAQCE cho_c_Page_038.QC.jpg fb0a4f97635c1542591b0190f314d68d b0b410f324b12749ccf07906393b8759ec02f92a 7123 F20101111_AAAQBR cho_c_Page_012thm.jpg be06e02b748319fdefcf9daec79172b7 22e0189fa8a36d4f330af47a75679bfbd8822fb5 51895 F20101111_AAAPVW cho_c_Page_109.pro 696dc61503db85ac4dd07cbf3f73f6f9 eda1cd865bc300e9f36025439b17bf6ec67bad2d 6649 F20101111_AAAQCF cho_c_Page_032thm.jpg 2c87e988b04d6f39d3e1f0c2dcf89a38 9c9c91b5ca11eafb867669e7576ef986433cbb9c 2530 F20101111_AAAPXA cho_c_Page_023.txt fbd632611c2cd6fde34328727b0e7ed1 d0aaa2b83d52d051188327e157ceaaffb77d9d5d 7590 F20101111_AAAQBS cho_c_Page_052thm.jpg f6eb7e3bca2faaebcbb6cbdec2a0078c 4f9e1e6d4f85ed3d29c3cb841c91cad263ceb024 2023 F20101111_AAAPWL cho_c_Page_006.txt 488104026c6e9b2455ca6c5d89332d36 294c90ff9968633758f695b79fef6b08ad646019 52426 F20101111_AAAPVX cho_c_Page_110.pro e35a13ff3569dfd7761943e473f2ff94 4b5170b84c55517885e5976dfa1b22ea0e12ef95 6716 F20101111_AAAQCG cho_c_Page_092thm.jpg 809ed7fe1d1bd45902171bdfb0d52298 547345ec648824a4f906bfee79228b32e5e525b6 352 F20101111_AAAPXB cho_c_Page_024.txt dc02c239d24946f92d7174cdb620548a 9ab1832545d3e2ff1005755917bc82cf755a4a5d 7189 F20101111_AAAQBT cho_c_Page_027thm.jpg 9e952d3b1c4ab01a944512a1b7c061ec 5d01cbd005e3d9bdb3176585ac49e22f9d6c073e 2894 F20101111_AAAPWM cho_c_Page_007.txt a5a1b0ed709f3be6265ae16be9427d96 c46aa11c73711e4116a57c26106e6089e16df928 33396 F20101111_AAAPVY cho_c_Page_112.pro 66b1617881bbc5dbb1ee34cec0e6cc8b 801fa2483a10413818524a92806072663498f112 6403 F20101111_AAAQCH cho_c_Page_087thm.jpg 5eea9ba018f5b1c3c3d3d86798703589 25bb34280c43a71900f7818f6e0c6bebf4fd064d 2013 F20101111_AAAPXC cho_c_Page_025.txt 79db648841fc00e587182f4585e90807 3749fa857d24121daaaebd64850ae4c35958060b 7355 F20101111_AAAQBU cho_c_Page_111thm.jpg f2ca055660cc22c0a712b1b1534cd0d7 d605a27d7a2c385e32da4fc888e3bc743fa802c9 2827 F20101111_AAAPWN cho_c_Page_008.txt 8b6b947009d6f8ef62dbc3463e7c9351 6ad3a14ca319f5e618aeffadabc861dc95445164 60219 F20101111_AAAPVZ cho_c_Page_113.pro 83847745b15cb80befb07b1661191a2e 66148066cfdd28c69662b714a66da58a99291e37 9949 F20101111_AAAQCI cho_c_Page_005.QC.jpg cbf308159d35f5c8f8c8f6568a00811f 197030f0e1bab418c8d55f108b3a7b9c45c3c38c 1963 F20101111_AAAPXD cho_c_Page_026.txt 339e3e6102f9323049468969f7e83bdd 995700159a97ab9055a309497eacb498b14e682d 2127 F20101111_AAAPWO cho_c_Page_009.txt ff21234c681640d233a2605c3472972c 14094da7e43a3799bcd469e6ffd212a063d1821c 3795 F20101111_AAAQCJ cho_c_Page_024thm.jpg 10c3a3c317c56d7364da9e8c21af0f86 7808a431fe5923123bd3880cda51a3994672fd26 2160 F20101111_AAAPXE cho_c_Page_027.txt 6f9273b1336373b5d6119c7c64dfad2c 47b0a3c42ada722d590205cb2cc73b99496070a6 4443 F20101111_AAAQBV cho_c_Page_009thm.jpg 0e53de3b3c59b945ca42b3197dfcfaf0 9da97d75bdf603d4d1cffbe6f319cf23a98d43ae 2117 F20101111_AAAPWP cho_c_Page_010.txt 1f94308ceebf2e477d5d56fb656acfee d51167ab0a35ce4e1b5abc6b9235d8677b033bbc 20250 F20101111_AAAQCK cho_c_Page_079.QC.jpg b5602999f2422a3827983d6bb283ce7a 04f1c3c61ac32775ea06a1f92aa19aa1e666d02b 1783 F20101111_AAAPXF cho_c_Page_028.txt 693e81dbd69767e88cc35b170d32f1f8 0f2e8d56394396fdd77d05ff2887647097c0b3d0 23681 F20101111_AAAQBW cho_c_Page_030.QC.jpg 8763cb481c544d585e171c1eb1b8e251 cbda1924bc0013e256b6f7e247955a278a3e1fd2 1122 F20101111_AAAPWQ cho_c_Page_011.txt 17332538ee6655d99d51877b30425a35 a06d2ccffb01cc962b225226928f56bd0712cb7f 7164 F20101111_AAAQCL cho_c_Page_110thm.jpg 1c80acf79c026d9c37284be760659762 daebf6b1f778aeb174fa2df5f3505edb4e3de58c 2168 F20101111_AAAPXG cho_c_Page_029.txt c34794d82604ea2bcf232d1e6615312a 1a9c060823cc4429c06f98d0e303ccbde9a110f8 7196 F20101111_AAAQBX cho_c_Page_031thm.jpg 7cbe21c2f141caedb19c1548a638f990 b6908fc6371a5a5b46c827a6224d5a4aaf4a33e3 F20101111_AAAPWR cho_c_Page_012.txt 04f02f8bd5c7ad12c7113065cfb4cb12 5e9f5f4098dc02982c1c1b1c1f117edfec70e96c 8026 F20101111_AAAQDA cho_c_Page_047thm.jpg 1a9c4652eae8abbdb3f12b9d82c9f732 cda4ad153071ef05435897efa30a9037d3028bfe 28085 F20101111_AAAQCM cho_c_Page_061.QC.jpg 5de75d19b3a582d7f2d8b70727e47647 fcc7f6ca5116784c2e442258f2009f3ccdc738f2 1778 F20101111_AAAPXH cho_c_Page_031.txt 92fbd1aa4b3d510c04dd6f7a199f1c87 76c711a700b0d83d337811d3221a0c5bfd057225 6311 F20101111_AAAQBY cho_c_Page_030thm.jpg 2fc0ac4dadaafc3d33968197dab3b736 bf923575c6d5f11f05cc2ace43be0d9a1cd72afe 2177 F20101111_AAAPWS cho_c_Page_013.txt b19c0fd0c470e237cdafc1c28d247d8e 8dd8ab79edc2a69de1d8beb9aa31ce88da2b4c53 1886 F20101111_AAAQDB cho_c_Page_118thm.jpg 577701f80164645fa6ff6030fddb7e3f 767090905b16639283001b1449a16796ca061bff 28773 F20101111_AAAQCN cho_c_Page_047.QC.jpg bf4ad0a52a0feb972a6b9a144619c6f1 1720719acda6dcb2f1dd303d84301ab4579b6f4f F20101111_AAAPXI cho_c_Page_032.txt f652b37fb33bd6eaeafca521362047a3 4b9d69c5c1757e64f352e663a38626e1e7621505 6767 F20101111_AAAQBZ cho_c_Page_056thm.jpg 668ba33ea2af12f75702ac64d93afcf2 107fdac54da9f346fcd24861073adb115f4e084d 2137 F20101111_AAAPWT cho_c_Page_014.txt 12ba5471964ea063a8e2729bce57b172 650ecc309381b5b163e2677dba10820621d3cf61 17133 F20101111_AAAQDC cho_c_Page_009.QC.jpg 5fb609c0c654ea832d9caddfc13779d3 fa0fbd5693208edb0ac72b81542a6b7b01cb0982 6521 F20101111_AAAQCO cho_c_Page_100thm.jpg 99eeec600cc4f26fa816aec8d918350a b82a4572593d34b33bc5009ad02ab1ab5a2ab188 1922 F20101111_AAAPXJ cho_c_Page_033.txt 9bada3039c8b4256fab2861ae3d51222 95ab50b18b0f7005d86a820f3e99d7de2308bae4 1977 F20101111_AAAPWU cho_c_Page_016.txt e02806c326cbc78cd8e09d611ada2a80 f6c69681c1d596447a77e0c17e5ad64f695b49c9 21311 F20101111_AAAQDD cho_c_Page_041.QC.jpg b40aa2cdb03b3c5ebdb0b3bbdca65a5d 0af94668d149533cbd0c59965729f49661ebdfba 6529 F20101111_AAAQCP cho_c_Page_104thm.jpg f5d70376fd8bbc9dae2b87c44b85b961 dfab0d63861688dd54d27469610d1810857c3dd4 1536 F20101111_AAAPXK cho_c_Page_034.txt 2b21c9bbfd9baa7c7fb9e7ee4045f88c f8ae90b3943ceea5f3091268ecbf41ebf517c172 2149 F20101111_AAAPWV cho_c_Page_017.txt 540e52a7a4d68301306af969c7beb787 f1913b2d5bacf3ef2a9962e2bf50d9050ee2b927 25637 F20101111_AAAQDE cho_c_Page_110.QC.jpg 9e8b2dd5a3f895c37eecfa45232e6f43 ebdaf275825d63efbd3f3ff7c5b8ccd7d02a4ecf 7454 F20101111_AAAQCQ cho_c_Page_075thm.jpg 1b387240d01ac0f48c3ae46a3abe9595 181987b330961acac0a7f446ce574ba1abea6640 2109 F20101111_AAAPXL cho_c_Page_035.txt 8195d4c8465e25bd5d55b5b1b5f9fde3 a02713f2c028635511ef4db63b6c56badd0d95cd 2028 F20101111_AAAPWW cho_c_Page_018.txt d52474e964339cd241c1876de8b9b6de 71730cb7ef42dae2abbe0967b943b0b793597d63 6879 F20101111_AAAQDF cho_c_Page_113thm.jpg efb37452869dad0587e010f34fa3002d 03ff460b317cd3a0786a3130ebbb4d91c90098b7 25013 F20101111_AAAQCR cho_c_Page_035.QC.jpg 229fe35f891908c155d07e3a31111ea1 98e7f97ec2ee1461c93cdfc23bb528b7c58358b6 2202 F20101111_AAAPWX cho_c_Page_019.txt ac47d742d51cb8f9bf22ab95165101f6 162174d8908f1dbdb0e65710291c424ec3c2b330 6309 F20101111_AAAQDG cho_c_Page_054thm.jpg 00301902f4edbc6d778346cbd8db18bf 885c9e6aa19a2889a3b5c2e7bf2c98bc1c7431ad 1277 F20101111_AAAPYA cho_c_Page_051.txt 136bb1e6ee327f241b0d5101f55e2788 535ab0e5cfa02d6fee5477f6cf49d2bcba43b0af 26211 F20101111_AAAQCS cho_c_Page_055.QC.jpg 4b2be044a57a21f44a849a56048e95ab 4325f94877e2d60c6d8edb50c3beb2c7289272f4 814 F20101111_AAAPXM cho_c_Page_036.txt 44525987109b579ea988016be5c3c515 451898260cea2930228bdde9d5ce23e6927778df 2800 F20101111_AAAPWY cho_c_Page_021.txt cabdd06db7f0d8eb48231e1bd4fc566d ba52c52df4ef55947cb4d194d36854b0c01fd115 23797 F20101111_AAAQDH cho_c_Page_056.QC.jpg 15622ebe8e8c89fe5469f9ae42ce0039 3eb45abf3b63de9c8fa22c63168da16f720145ee 2078 F20101111_AAAPYB cho_c_Page_052.txt 620c188eb11f91ecede72eaf252a2684 c1d826a25e519a724ef4423943d6eb82d5bdbcea 27581 F20101111_AAAQCT cho_c_Page_081.QC.jpg 896046472d2f01b8dc754ba8be1463cf fa062285c451500d707203f7c2548500204d7ed4 2181 F20101111_AAAPXN cho_c_Page_037.txt b50c78661a3819608848c28029baf068 f0e4fda66c667452281a8b97895b20d87c5e1503 2433 F20101111_AAAPWZ cho_c_Page_022.txt 500139a94dec96a8b9f8ced532c01447 6ee3f914f60ca1ab57efbf92b666747db18a7251 22648 F20101111_AAAQDI cho_c_Page_104.QC.jpg 7ed1cfa693534e31d6dbc4bb0876c0ea 84c010f1697ac8359d308b88f302864a49f4489f 1948 F20101111_AAAPYC cho_c_Page_053.txt 78528d7e4209dcbab4460d86bef5fa9f d37215e2b720d2b98df14152218205ea83b5bd81 3298 F20101111_AAAQCU cho_c_Page_002.QC.jpg ac6000da80665b7156945d997f9f73c4 f4eb1d6c8a828a54f19aa870cacbd1c9ad9a38e7 F20101111_AAAPXO cho_c_Page_038.txt dd96dd4b97dc77eb7f94ce79c0a4d3be 134dfd7bf35b7a29b4c29afcc3e7f1ca3d00d9dc 20325 F20101111_AAAQDJ cho_c_Page_051.QC.jpg 62cd72427e26bd807561410bdec58738 2c1be0b700baf060a2c30082304e04cdcf1347ab 1916 F20101111_AAAPYD cho_c_Page_054.txt 9aef5f2beacb00f819be0cff3b7249ca 2191b9abe88e1f328052b78addf941b92172129b 16065 F20101111_AAAQCV cho_c_Page_036.QC.jpg 8fb8c205ca7c8625c6e7169fc70acd01 cfd56c755a9f50524d73f0d1f913c294f03cd2b4 2241 F20101111_AAAPXP cho_c_Page_039.txt 4afe18af1aa97ccd9cc98f9e9b3ecfe4 233f0baa74a5c29eb9a437da94b3e49a60f03c05 F20101111_AAAQDK cho_c_Page_060thm.jpg 46ebf71485b181d7f131dda39c1dc20f 6fb6618a3c1c5059494455065f2023d0b93ccdd6 2077 F20101111_AAAPYE cho_c_Page_055.txt aded948e7f9e3bf96ceec0894572fa90 79aefb04b25455b3a7b2e729071963894137e269 1996 F20101111_AAAPXQ cho_c_Page_040.txt 92f287c21ac3f8ce375ddcb0de227798 91999e47c2e77a3987589892ad5eac80590e9968 28354 F20101111_AAAQDL cho_c_Page_076.QC.jpg 0057bd02192dc06681aab7968212c963 574b9414ed474543af94a0d4111a8f49967b7442 F20101111_AAAPYF cho_c_Page_056.txt e4ae89458616b13fd91549f864721e77 906153f49d487b7c3b70e29d590df518ec09618a 22198 F20101111_AAAQCW cho_c_Page_099.QC.jpg 711cf10cf8d6dba4e2ed418390a92b4c 66c537cb0f1eddba7f371c870db40e41652f4eb7 1799 F20101111_AAAPXR cho_c_Page_042.txt c7791d8fe64cd952dcc0dde87d75242b 7e205b6f6ce44e2406d62ce74748f0f52bdbb900 4856 F20101111_AAAQEA cho_c_Page_015.QC.jpg d0365fd768e193720d02a484b628c52f 8f96f34972116062bcfcdb80fb73e518ad13a2fe 2711 F20101111_AAAQDM cho_c_Page_119thm.jpg f9a8e865364f4b69b1267dae38aa92f8 269047c3d1dd3bdf9ac3e7ea4dd712b93c1f3d8d 2182 F20101111_AAAPYG cho_c_Page_057.txt 08728d0ec0c651fa62b9199f02a75e4f 82a6b72e970bcb6c315a784fdacdb376bd942a8d 5555 F20101111_AAAQCX cho_c_Page_004thm.jpg 8f00e3a5a13cdc52b1b6f3db7a0bc782 f3f607672a53221d7803de073dd7c42da66a4f27 2164 F20101111_AAAPXS cho_c_Page_043.txt 4da5d9689006372f42734d3323adf56c 95ca4c52d24714fb0e276dcedcfa9a8de9bd6144 28193 F20101111_AAAQEB cho_c_Page_096.QC.jpg 917f8a4beae1e762045434b81a0d35fe f94b08fdc36e2482c86a5d173153894511fa0610 7470 F20101111_AAAQDN cho_c_Page_029thm.jpg 963249445f40b14a55f82b641b26cb84 d06851c3aba41d311deb3bdcc534976e9dce5c9b 1789 F20101111_AAAPYH cho_c_Page_058.txt 5fdf65297b05efd9fefeece7cdaadf15 b4b9115fcbfc910d11a66e6e71af3358674ee59f 6784 F20101111_AAAQCY cho_c_Page_068thm.jpg 5714e1b42639aee60d8294bdf72fb237 d9a26106c4961c9d666a099148146961cdfbea8f 2529 F20101111_AAAPXT cho_c_Page_044.txt ba7643330a691cb91cdd6b67ee50363d 345ab635d09aa9327497c2bca42b7b18a85a57f7 2953 F20101111_AAAQEC cho_c_Page_005thm.jpg 23311a6bb64ef4727362e4d19f226c8f 896dc4f6b996d025896a9b5dbf2dd390b77586f5 26651 F20101111_AAAQDO cho_c_Page_023.QC.jpg a71074cf7e5838096278810f9475504d 33e9e50ae9dce3e3a7d57c891b7fdfef41a514b6 F20101111_AAAPYI cho_c_Page_060.txt 410e914d14e0af88f70f78fb8fdf0b6d cda00402b4946f4f53fa38b72c286d0ccdc8c66d F20101111_AAAQCZ cho_c_Page_046.QC.jpg b69fab596936b7cb5612ffd9510ae1db f911df4ed65d069b5298ed20920ed16a8bed17e7 10242 F20101111_AAAQED cho_c_Page_073.QC.jpg 79ac93abdf77acc26b26a109a95fb528 dd115b15fc2eb2242558ca33d0fbb6fb716b578f 26965 F20101111_AAAQDP cho_c_Page_012.QC.jpg 18baa65d10d8deb618e66e85fbd57ece 98d9d3fb4f503be1e142a75261f83ea1e94c071d 1779 F20101111_AAAPYJ cho_c_Page_061.txt e30874ae8021b28a2b7bebe184738754 a38f52f0d1a6d96d710d962cab6d0a2c509757e5 1710 F20101111_AAAPXU cho_c_Page_045.txt c10f80a38f50fafb7483840a09c29b2b 8848aadd1c146d926c374ce7fdc7015e9e5f4869 6950 F20101111_AAAQEE cho_c_Page_010thm.jpg b997dea209a87d0c40dd78a0e24e70ce a3261c521c1ff8bb29340bec1f4433fe3e51eca1 1755 F20101111_AAAQDQ cho_c_Page_015thm.jpg f160e05105814e2b5ab6012e92296f47 61b4202248cc8084344d2638f2276dc1f2df2075 1226 F20101111_AAAPYK cho_c_Page_062.txt 8a1891401e089c7f52a9f1e8af941d05 190b9e8e14625a10bb665ab914f7b972503129a7 2673 F20101111_AAAPXV cho_c_Page_046.txt 65eda77caf37573db92000f3e9537fc7 2fc097fea6b406f45888e0f090b97ddce4ef91d5 18057 F20101111_AAAQEF cho_c_Page_112.QC.jpg 8053991201a0ddef83e0afad190ae368 6f7f68ddb82144c392b54cb4bda497bf5692df31 28017 F20101111_AAAQDR cho_c_Page_074.QC.jpg a0d09b0c947046e6e20781a65c9ab46f 28150191c3bd374dbd6a58e235a3145a2ef73e1d 2150 F20101111_AAAPYL cho_c_Page_063.txt e2edf0dc79dbd195cd6a05a96d8d83b8 b83be433789478c47e83432ed838cf15183d2ef5 3972 F20101111_AAAPXW cho_c_Page_047.txt 290d7dae3524ef250885c469e34d1c51 33a2722782fa8d82b92a645454ac070bbdbe3a56 25422 F20101111_AAAQEG cho_c_Page_060.QC.jpg 864cdcded02b3dfa385126fa186ddd97 afc5b52e9ad2c1f79800f04b79d4cd1951480ffe 2110 F20101111_AAAPZA cho_c_Page_078.txt d9413d2306bf516ac8e55118db79ac11 49b02f82704a7c0511bcd88fbd18d1f46641994f 5775 F20101111_AAAQDS cho_c_Page_079thm.jpg eea4dfe134201ae5f7d8543015f11a74 20fc28b8aa3f6f76e6324edcaf0481dfbee50fd9 2116 F20101111_AAAPYM cho_c_Page_064.txt 8bab50bd0b304a19aaaa25681500eae8 a4126a30c2f798849bc12a4e086cddc7e0660f63 2192 F20101111_AAAPXX cho_c_Page_048.txt 8e49277c7822138eaf684bbf3518257c e351f46b15fbc6d43137b218a28bf16832620e38 6946 F20101111_AAAQEH cho_c_Page_072thm.jpg 21abf7f99c8291ccd67319a974be4baa ef085ec5a4703d12f3137b1759bfc3f1dbce7e12 1602 F20101111_AAAPZB cho_c_Page_079.txt b6d36f8baa50b524545bdd0b7de458c0 216e633b93a3d0043583894df0fc42fd5b0e7fda 27825 F20101111_AAAQDT cho_c_Page_066.QC.jpg c5c35888c4d5499609d54f27fb2f63c2 621243dcf8bc4685386c5131de4332c5fe23ccc8 2018 F20101111_AAAPXY cho_c_Page_049.txt 5605c91dae01a9ef4aa2e82ef7c83ae7 fbdfef5b693099bde77aef89b4ab510c944a7c13 22536 F20101111_AAAQEI cho_c_Page_070.QC.jpg b45cc5171bf816479c64a10bddc2a760 209fbaf645ad524c5b2a9aa97ec7393250dd107c 1992 F20101111_AAAPZC cho_c_Page_080.txt 725b024c554f1035ed46727f2eab4fe6 f7dc691d0c6e86a1fe5303a2f7b93a127ac139d4 21904 F20101111_AAAQDU cho_c_Page_043.QC.jpg f664a8f9f5e7afa0044fbc81b6fe752f 0c112213132b5b1b50f90f74e18d5b2059c21bb0 1796 F20101111_AAAPYN cho_c_Page_065.txt 98bff39fa83822405cda294a18ee7007 cc4db21651ca8f981dfc9dd8881ce290f7702ffd 2185 F20101111_AAAPXZ cho_c_Page_050.txt 40f7a0dfac07195860c3589c147b4236 e713a5c18bc2b50700368e8ae3c7219a0cf60a08 27191 F20101111_AAAQEJ cho_c_Page_037.QC.jpg 5ed43f4a415259b7c7c952608213833f d4bc714210bf77bc635f54fd13f0123ff780462f 2159 F20101111_AAAPZD cho_c_Page_081.txt 1bec9dad57c8f045a070015a4f608f67 0e7e05e3b2222bdfbd7c8a3a548efc818dbb3d67 22225 F20101111_AAAQDV cho_c_Page_008.QC.jpg 6a4e11ea8ab997d8b79d30739d91a358 b21dfcd67320465df8f0dbb507f08922b4ce8f99 F20101111_AAAPYO cho_c_Page_066.txt 7795dbdf7a1e127744cdfef4311d4469 48487bbc6302e36acad3c91f77937645cde0ee0f 7301 F20101111_AAAQEK cho_c_Page_023thm.jpg 9880f59125d20c9c9ed2b3ca0af572a1 b48bd28ac654acc36bf1ad7fb90297ebb1c2b19d 1828 F20101111_AAAPZE cho_c_Page_082.txt 456738b1afafa5cf5469ab1db6eced9e 44dc80b31b4541415381c6479a6707abaeb6aa15 7163 F20101111_AAAQDW cho_c_Page_109thm.jpg ff36f5eda2f66c9a87745406c1b33083 f4385d1390ea3316deb2ae8d948267b09f95d2e7 2029 F20101111_AAAPYP cho_c_Page_067.txt a9d6e5e3eacd9b59f057323378af3d2a 9412cd5b04277a8ed64193ea62dd0710d2889304 