Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UFE0021941/00001
## Material Information- Title:
- Accurate, Scalable, and Informative Modeling and Analysis of Complex Workloads and Large-Scale Microprocessor Architectures
- Creator:
- Cho, Chang
- Place of Publication:
- [Gainesville, Fla.]
- Publisher:
- University of Florida
- Publication Date:
- 2008
- Language:
- english
- Physical Description:
- 1 online resource (119 p.)
## Thesis/Dissertation Information- Degree:
- Doctorate ( Ph.D.)
- Degree Grantor:
- University of Florida
- Degree Disciplines:
- Electrical and Computer Engineering
- Committee Chair:
- Li, Tao
- Committee Members:
- Figueiredo, Renato J.
Bashirullah, Rizwan Mishra, Prabhat - Graduation Date:
- 12/19/2008
## Subjects- Subjects / Keywords:
- Architectural design ( jstor )
Architectural models ( jstor ) Computer architecture ( jstor ) Modeling ( jstor ) Neural networks ( jstor ) Predictive modeling ( jstor ) Simulations ( jstor ) Statistics ( jstor ) Wavelet analysis ( jstor ) Workloads ( jstor ) Electrical and Computer Engineering -- Dissertations, Academic -- UF - Genre:
- Electronic Thesis or Dissertation
born-digital ( sobekcm ) Electrical and Computer Engineering thesis, Ph.D.
## Notes- Abstract:
- Modeling and analyzing how workload and architecture interact are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challenges related to the design, evaluation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the state-of-the-art tools can detect program time-varying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a necessary first step to design cost-effective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and hardware design complexity and integration scale. For instance, while large real-world applications manifest drastically different behavior across a wide spectrum of their runtime, existing methods only focus on analyzing workload characteristics using a single time scale. Conventional architecture modeling techniques assume a centralized and monolithic hardware substrate. This assumption, however, will not hold valid since the design trends of multi-/many-core processors will result in large-scale and distributed microarchitecture specific processor core, global and cooperative resource management for large-scale many-core processor requires obtaining workload characteristics across a large number of distributed hardware components (cores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, there is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidly increasing complexity and integration scale. We aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large performance, power, reliability and thermal design space of uni-/multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration. ( en )
- General Note:
- In the series University of Florida Digital Collections.
- General Note:
- Includes vita.
- Bibliography:
- Includes bibliographical references.
- Source of Description:
- Description based on online resource; title from PDF title page.
- Source of Description:
- This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
- Thesis:
- Thesis (Ph.D.)--University of Florida, 2008.
- Local:
- Adviser: Li, Tao.
- Statement of Responsibility:
- by Chang Cho.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Cho, Chang. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Classification:
- LD1780 2008 ( lcc )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

but they are classified as the same phase. In cluster 48, program execution complexity varies widely; however, Simpoint classifies them as a single phase. The results(Figure 3-4) suggest that program execution intervals classified as the same phase by Simpoint can still exhibit widely varied behavior in their dynamics. Complexity-aware Phase Classification To enhance the capability of current methods in characterizing program dynamics, we propose complexity-aware phase classification. Our method uses the multiresolution property of wavelet transforms to identify and classify the changing of program code execution across different scales. We assume a baseline phase analysis technique that uses basic block vectors (BBV) [10]. A basic block is a section of code that is executed from start to finish with one entry and one exit. A BBV represents the code blocks executed during a given interval of execution. To represent program dynamics at different time scales, we create a set of basic block vectors for each interval at different resolutions. For example, at the coarsest level (scale =10), a program execution interval is represented by one BBV. At the most detailed level, the same program execution interval is represented by 1024 BBVs from 1024 consecutively subdivided intervals(Figure 3-5). To reduce the amount of data that needs to be processed, we use random projection to reduce the dimensionality of all BBVs to 15, as suggested in [1]. Interval Interval Resolution = 20 {BBV,,,} SI I Resolution = 21 {BBV2.1, BBV2.2} S I I I I Resolution = 22 {BBV4.1, BBV4.2, BBV4.3, BBV4.4} S I I I I Resolution = 23 {BBV,1 BBV ,.2 BBVa,7 BBVs,,} I -- Resolution = 2n {BBV2'n,i, BBV2n,2 ,...... BBVa2n,2^n-,, BBV2-,,2- Figure 3-5 BBVs with different resolutions LIST OF TABLES Table pae 3-1 B baseline m machine configuration ..................................... .................................................. 26 3-2 A classification of benchmarks based on their complexity ............................................... 30 4-1 B baseline m machine configuration ..................................... .................................................. 39 4-2 Efficiency of different hybrid wavelet signatures in phase classification............. .............. 44 5-1 Sim ulated m machine configuration......... .................. ...................................... .............. 59 5-2 Microarchitectural parameter ranges used for generating train/test data ............................ 60 6-1 Sim ulated m machine configuration (baseline) .................................. ..................................... 78 6-2 The considered architecture design parameters and their ranges ....................................... 79 6-3 M ulti-program m ed w orkloads......... ......... ......... .......... ......................... .............. 80 6-4 Error comparison of predicting raw vs. 2D DWT cache banks............................................ 85 6-5 Design space evaluation speedup (simulation vs. prediction).......................................... 86 7-1 Architecture configuration for different issue width ........................................ .............. 100 7-2 Sim ulation configurations.................................................................. ....................... 10 1 7-3 D design sp ace p aram eters ...................................................................................................... 102 7-4 Sim ulation tim e vs. prediction tim e..................... ................................. .......................... 104 LIST OF FIGURES Figure pae 2-1 Example of Haar wavelet transform ...... .................................................. ......... ........ .... 18 2-2 Comparison execution characteristics of time and wavelet domain................................... 19 2-3 Sampled time domain program behavior..................................................... ............... 20 2-4 Reconstructing the w workload dynam ic behaviors.............................................. ... ................. 20 2-5 V ariation of w avelet coefficients....................................................... ......................... 21 2-6 2D w avelet transform s on 4 data points ........................................... .................... ...... 22 2-7 2D wavelet transforms on 16 cores/hardware components........................ ................ 23 2-8 Example of applying 2D DWT on a non-uniformly accessed cache ................................... 24 3-1 XCOR vectors for each program execution interval .............. ............................. ....... ....... 28 3-2 Dynamic complexity profile of benchmark gcc ....................................................... 28 3-3 X C O R value distributions ........................................................ ............................ ...... 30 3-4 X COR s in the sam e phase by the Sim point.................................................. ... ................. 31 3-5 B B V s w ith different resolutions ............................................................................. .. .... ...... 32 3-6 M ultiresolution analysis of the projected BBVs........... ................................. .............. 33 3-7 W weighted C O V calculation ............................................................... .......................... 34 3-8 Comparison of BBV and MRA-BBV in classifying phase dynamics................................ 35 3-9 Comparison of IPC and MRA-IPC in classifying phase dynamics ..................................... 36 4-1 Phase analysis methods time domain vs. wavelet domain .............................................. 41 4-2 Phase classification accuracy: time domain vs. wavelet domain ....................................... 42 4-3 Phase classification using hybrid wavelet coefficients................................. .................... 43 4-4 Phase classification accuracy of using 16 x 1 hybrid scheme .............................................. 45 4-5 Different methods to handle counter overflows ....................................................... 46 4-6 Impact of counter overflows on phase analysis accuracy.......................................... 47 neural networks were used to predict 16 2D wavelet coefficients which efficiently capture workload thermal spatial characteristics. As can be seen, our predictive models achieve a speedup ranging from 285 (MEM1) to 5339 (CPU2), making them suitable for rapidly exploring large thermal design space. Table 7-4 Simulation time vs. prediction time Workload Simulation (sec) n (e Speedup Prediction (sec) . s [best:worst] (Sim./Pred.) CPU1 362: 6,091 294 : 4,952 CPU2 366: 6,567 298 : 5,339 CPU3 365 : 6,218 297: 5,055 MEM1 351 : 5,890 285 : 4,789 MEM2 355 : 6,343 1.23 289 : 5,157 MEM3 367 : 5,997 298 : 4,876 MIX1 352 : 5,944 286 : 4,833 MIX2 365 : 6,091 297 : 4,952 MIX3 360: 6,024 293 : 4,898 Prediction Accuracy The prediction accuracy measure is the mean error defined as follows: ME x (k) x(k) ME = (7-1) N k= x(k) where: x(k) is the actual value generated by the Hotspot thermal model, x(k) is the predicted value and Nis the total number of samples (a set of 64 x 64 temperature samples per layer). As prediction accuracy increases, the ME becomes smaller. We present boxplots to observe the average prediction errors and their deviations for the 50 test configurations against Hotspot simulation results. Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between "hinges" which are approximately the first and third quartiles of the ME values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The design space but keeps the space small enough to maintain the low model building cost. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2-star discrepancy [40]. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2-star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this work, we used 200 train and 50 test data to reach a high accuracy for thermal behavior prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. In our study, the thermal characteristics across each die is represented by 64 x 64 samples. Experimental Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast thermal behaviors of large scale 3D multi-core structures running various CPU/MIX/MEM workloads without using detailed simulation. Simulation Time vs. Prediction Time To evaluate the effectiveness of our thermal prediction models, we compute the speedup metric (defined as simulation time vs. prediction time) across all experimented workloads (shown as Table 7-4). To calculate simulation time, we measured the time that the Hotspot simulator takes to obtain steady thermal characteristics on a given design configuration. As can be seen, the Hotspot tool simulation time varies with design configurations. We report both shortest (best) and longest (worst) simulation time in Table 7-4. The prediction time, which includes the time for the neural networks to predict the targeted thermal behavior, remains constant for all studied cases. In our experiment, a total number of 16 errors during program execution. Table 5-1 summarizes the baseline machine configurations of our simulator. Table 5-1 Simulated machine configuration Parameter Configuration Processor Width 8-wide fetch/issue/commit Issue Queue 96 ITLB 128 entries, 4-way, 200 cycle miss Branch Predictor 2K entries Gshare, 10-bit global history BTB 2K entries, 4-way Return Address Stack 32 entries RAS L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries Load/ Store Queue 48 entries Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 8 FP-ALU, 4FP-MUL/DIV/SQRT DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency We perform our analysis using twelve SPEC CPU 2000 benchmarks bzip2, crafty, eon, gap, gcc, mcf parser, perlbmk, twolf swim, vortex and vpr. We use the Simpoint tool to pick the most representative simulation point for each benchmark (with full reference input set) and each benchmark is fast-forwarded to its representative point before detailed simulation takes place. Each simulation contains 200M instructions. In this chapter, we consider a design space that consists of 9 microarchitectural parameters (see Tables 5-2) of the superscalar architecture. These microarchitectural parameters have been shown to have the largest impact on processor performance [21]. The ranges for these parameters were set to include both typical and feasible design points within the explored design space. Using the detailed, cycle-accurate simulations, we measure processor performance, power and reliability characteristics on all design points within both training and testing data sets. We build a separate model for each program and use the model to predict workload dynamics in performance, power and reliability domains at We studied the CMP NUCA designs using various multi-programmed and multi-threaded workloads (listed in Table 6-3). Table 6-3 Multi-programmed workloads Multi-programmed Workloads Description Group 1 gcc (8 copies) Homogeneous Group2 mcf(8 copies) Group 1 (CPU) gap, bzip2, equake, gcc, mesa, perlbmk, parser, ammp Heterogeneous Group2 (MIX) perlbmk, mcf bzip2, vpr, mesa, art, gcc, equake Group3 (MEM) mcf twolf art, ammp, equake, mcf art, mesa Multithreaded Workloads Data Set barnes 16k particles finm input. 16348 ocean-co 514x514 ocean body ocean-nc 258x258 ocean body Splash2 water-ns 512 molecules cholesky tkl5.0 fft 65,536 complex data points radix 256k keys, 1024 radix Our heterogeneous multi-programmed workloads consist of a mix of programs from the SPEC 2000 benchmarks with full reference input sets. The homogeneous multi-programmed workloads consist of multiple copies of an identical SPEC 2000 program. For multi-programmed workload simulations, we perform fast-forwards until all benchmarks pass initialization phases. For multithreaded workloads, we used 8 benchmarks from the SPLASH-2 suite [53] and mark an initialization phase in the software code and skip it in our simulations. In all simulations, we first warm up the cache model. After that, each simulation runs 500 million instructions or to benchmark completion, whichever is less. Using detailed simulation, we obtain the 2D architecture characteristics of large scale NUCA at all design points within both training and Case Study 2: 2D Thermal Hot-Spot Prediction Thermal issues are becoming a first order design parameter for large-scale CMP architectures. High operational temperatures and hotspots can limit performance and manufacturability. We use the HotSpot [54] thermal model to obtain the temperature variation across 256 NUCA banks. We then build analytical models using the proposed methods to forecast 2D thermal behavior of large NUCA cache with different configurations. Our predictive model can help designers insightfully predict the potential thermal hotspots and assess the severity of thermal emergencies. Figure 6-10 shows the simulated thermal profile and predicted thermal behavior on different workloads. The temperatures are normalized to a value between the maximal and minimal value across the NUCA chip. As can be seen, the 2D thermal predictive models can accurately and informatively forecast the size and the location of thermal hotspots. Thermal Hotspots mulaondiction Simulation Prediction Simulation Prediction Simulation Prediction 1.80.8 06 0.6O 0.6 _00.2 The 2D predictive model can informatively and accurately 0 forecast both the location and the size of thermal hotspots in large scale architecture (a) Ocean-NC (b) gccx8 Simulation Prediction Simulation Prediction i :: WU 043 v (c) MEM (d) Radix Figure 6-10 2D NUCA thermal profile (simulation vs. prediction) -0 0 0 0 0 S o o T o T T T o o I o - (a) 16 Wavelet Coefficients (b) 32 Wavelet Coefficients extending from the top and bottom of the box) extend to the extreme values of the data or a 0 o a i0 or of 6 n aos a n woroa A n n, te m m e (c) 64 Wavelet Coefficients (d) 128 Wavelet Coefficients Figure 6-4 ME boxplots of prediction accuracies with different number of wavelet coefficients Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, and it shows the center of the distribution for the ME values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. Figure 6-4 (a) shows that using 16 wavelet coefficients, the predictive models achieve median errors ranging from 5.2 percent (ift) to 9.3 percent (ocean.co) with an overall median error of 6.6 percent across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 13%, and most benchmarks show an error less than 10%. This indicates that our proposed neuro-wavelet scheme can forecast the 2D spatial workload 83 Design Parameters In this study, we consider a design space that consists of 23 parameters (see Table 7-3) spanning from floor-planning to packaging technologies. Table 7-3 Design space parameters 3D Configurations TIM (Thermal Interfa General Configurations Archi. Thickness (m) LayerO Floorplan Bench Thickness (m) Layerl Floorplan Bench Thickness (m) Layer2 Floorplan Bench Thickness (m) Layer3 Floorplan Bench Heat Capacity (J/m^3K) ice Material) Resistivity (m K/W) Thickness (m) Convection capacity (J/k) Convection resistance (K/w) Heat sink Side (m) Thickness (m) Side(m) [eatSpreader Side(m) Thickness(m) Others Ambient temperature (K) Issue width Keys lyO th ly0_fl ly0_bench lylth lylfl lyl_bench ly2_th ly2_fl ly2_bench ly3_th ly3fl ly3_bench TIMcap TIM res TIM th HS_cap HS res HS side HS th HP side HP th Am temp Issue width Low High 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 2e6 4e6 2e-3 5e-2 2e-5 75e-6 140.4 1698 0.1 0.5 0.045 0.08 0.02 0.08 0.025 0.045 5e-4 5e-3 293.15 323.15 2 or 4 or 8 These design parameters have been shown to have a large impact on processor thermal behavior. The ranges for these parameters were set to include both typical and feasible design points within the explored design space. Using detailed cycle-accurate simulations, we measure processor power and thermal characteristics on all design points within both training and testing data sets. We build a separate model for each benchmark domain and use the model to predict thermal behavior at unexplored points in the design space. The training data set is used to build the wavelet-based neural network models. An estimate of the model's accuracy is obtained by using the design points in the testing data set. To train an accurate and prompt neural network prediction model, one needs to ensure that the sample data sets disperse points throughout the H interactions. However, they are usually inadequate for modeling the non-linear dynamics of real- world workloads which exhibit widely different characteristic and complexity. Of the non-linear methods, neural network models can accurately predict the aggregated program statistics (e.g. CPI of the entire workload execution). Such models are termed as global models as only one model is used to characterize the measured programs. The monolithic global models are incapable of capturing and revealing program dynamics which contain interesting fine-grain behavior. On the other hand, a workload may produce different dynamics when the underlying architecture configurations have changed. Therefore, new methods are needed for accurately predicting complex workload dynamics. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates wavelet-based multiresolution decomposition techniques, which can produce a good local representation of the workload behavior in both time and frequency domains. The proposed analytical models, which combine wavelet-based multiscale data representation and neural network based regression prediction, can efficiently reason about program dynamics without resorting to detailed simulations. With our schemes, the complex workload dynamics is decomposed into a series of wavelet coefficients. In transform domain, each individual wavelet coefficients is modeled by a separate neural network. We extensively evaluate the efficiency of using wavelet neural networks for predicting the dynamics that the SPEC CPU 2000 benchmarks manifest on high performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in forecasting workload dynamics across a large microarchitecture design space. To understand the impact of counter overflow on phase analysis accuracy, we use 16 accumulative counters to record the 16-dimension workload characteristic. The values of the 16 accumulative counters are then used as a signature to perform phase classification. We gradually reduce the number of bits in the accumulative counters. As a result, counter overflows start to occur. We use two schemes to handle a counter overflow. In our first method, a counter saturates at its maximum value once it overflows. In our second method, the counter is reset to zero after an overflow occurs. After all counter overflows are handled, we then use the 16-dimension accumulative counter values to perform phase analysis and calculate the COVs. Figure 4-5 (a) describes the above procedure. Large Scale Phase Interval Large Scale Phase Interval | t nn r^ t t ..... t t accumulative sampling istic counter 1 P counter 1 Program Runtime Statistics 1 Program Runtime Statistics 1 ProgramRuntmeStatstcs2 Program Runtime Statistics 1 2 counter Clustering counter 2n a Program Runtime Statistics 2 Program Runtime Statistics 2 - accumulative /t7sam piling counter nnter n Program Runtime Statistics n Program Runtime Statistics n (a) n-bit accumulative counter (b) n-bit sampling counter Figure 4-5 Different methods to handle counter overflows Our counter overflow analysis results are shown in Figure 4-6. Figure 4-6 also shows the counter overflow rate (e.g. percentage of the overflowed counters) when counters with different sizes are used to collect workload statistics within program execution intervals. For example, on the benchmark crafty, when the number of bits used in counters is reduced to 20, 100% of the counters overflow. For the purpose of clarity, we only show a region within which the counter overflow rate is greater than zero and less than or equal to one. Since each program has different execution time, the region varies from one program to another. As can be seen, counter overflows have negative impact on phase classification accuracy. In general, COVs increase with 2D Wavelet Transform To effectively capture the two-dimensional spatial characteristics across large-scale multi- core architecture substrates, we also use the 2D wavelet analysis. With 1D wavelet analysis that uses Haar wavelet filters, each adjacent pair of data in a discrete interval is replaced with its average and difference. Original Average Detailed Detailed Detailed (D-horizontal) (D-vertical) (D-diagonal) i_ t I E -JI I I -- r --*---*- I I I I 1 1 S, Li i Figure 2-6 2D wavelet transforms on 4 data points A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete plane. As shown in Figure 2-6, each adjacent four points in a discrete 2D plane can be replaced by their averaged value and three detailed values. The detailed values (D-horizontal, D-vertical, and D-diagonal) correspond to the average of the difference of: 1) the summation of the rows, 2) the summation of the columns, and 3) the summation of the diagonals. To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data along the X-axis first, resulting in low-pass and high-pass signals (average and difference). Next, we apply ID wavelet transforms to both signals along the Y-axis generating one averaged and three detailed signals. Consequently, a 2D wavelet decomposition is obtained by recursively repeating this procedure on the averaged signal. Figure 2-7 (a) illustrates the procedure. As can be seen, the 2D wavelet decomposition can be represented by a tree-based structure. The root node of the tree contains the original data (row-majored) of the mesh of values (for example, performance or temperatures of the four adjacent cores, network-on-chip links, cache banks etc.). First, we apply ID wavelet transforms along the X-axis, i.e. for each two points along the X-axis 50% Time Domain 40% Waelet D ai ip 40% 30% 20% 10% 0% 140%0 25% Time Domain 20% Wavelet Doai gap 15% 5% 0% 20% 0 140% M Time Domain tmcf 120% WveletDoma n 100% S60% 40% 20% 0% 8% Time Domain twl Wavelet Domain wo 6% 4% 2% 0% ,de6 | # || strt 8% M Time Domain crafty SWavelet Domain 6% 4/4% 2% 0% 70% Time Domain qCC 60% Wavelet Domain 50% 40% 30% 20% 10% 0% 15% S Time Domain parser 12% WaveletDomain 9% 6% 3% 0% 06 ellemmlllltlltm 40% Time Domain vortex Wavelet Domain # ># 1*i~~c ~P 8% STime Domain n Sve tDma 6% >4% 2% 0% Nl$ Y N '0 30% :Time Domain perlbmk 25% Wavelet Domain 20% S15% 10% 5% 0% '0"^,~y ^<* "e 64 ":/ ,'/ k1" \0 o a^ J~ 4'? 44 4^ 4 'Sf ^^^ Figure 4-2 Phase classification accuracy: time domain vs. wavelet domain By transforming program runtime statistics into the wavelet domain, workload behavior can be represented by a series of wavelet coefficients which are much more compact and efficient than its counterpart in the time domain. The wavelet transform significantly reduces temporal dependence and therefore simple models which are insufficient in the time domain become quite accurate in the wavelet domain. CHAPTER 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTI-CORE ARCHITECTURES Early design space exploration is an essential ingredient in modern processor development. It significantly reduces the time to market and post-silicon surprises. The trend toward multi- /many-core processors will result in sophisticated large-scale architecture substrates (e.g. non- uniformly accessed cache [43] interconnected by network-on-chip [44]) with self-contained hardware components (e.g. cache banks, routers and interconnect links) proximate to the individual cores but globally distributed across all cores. As the number of cores on a processor increases, these large and sophisticated multi-core-oriented architectures exhibit increasingly complex and heterogeneous characteristics. As an example, to alleviate the deleterious impact of wire delays, architects have proposed splitting up large L2/L3 caches into multiple banks, with each bank having different access latency depending on its physical proximity to the cores. Figure 6-1 illustrates normalized cache hits (results are plotted as color maps) across the 256 cache banks of a non-uniform cache architecture (NUCA) [43] design on an 8-core chip multiprocessor(CMP) running the SPLASH-2 Ocean-c workload. The 2D architecture spatial patterns yielded on NUCA with different architecture design parameters are shown. 09 9 9 08 8 8 07 7 7 06 6 6 05 5 5 Figure 6-1 Variation of cache hits across a 256-bank non-uniform access cache on 8-core As can be seen, there is a significant variation in cache access frequency across individual cache banks. At larger scales, the manifested 2-dimensional spatial characteristics across the Experimental setup We performed our analysis using ten SPEC CPU 2000 benchmarks crafty, gap, gcc, gzip, mcf parser, perlbmk, swim, twolfand vortex. All programs were run with reference input to completion. We chose to focus on only 10 programs because of the lengthy simulation time incurred by executing all of the programs to completion. The statistics of workload dynamics were measured on the SimpleScalar 3.0[28] sim-outorder simulator for the Alpha ISA. The baseline microarchitecture model is detailed in Table 3-1. Table 3-1 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 predictions/cycle BTB 2K entries, 4-way Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 I-ALU, 2 I-MUL/DIV FP ALU 2 FP-ALU, 1FP-MUL/DIV DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 100 cycles Metrics to Quantify Phase Complexity To quantify phase complexity, we measure the similarity between phase dynamics observed at different time scales. To be more specific, we use cross-correlation coefficients to measure the similarity between the original data sampled at the finest granularity and the approximated version reconstructed from wavelet scaling coefficients obtained at a coarser scale. The cross- correlation coefficients (XCOR) of the two data series are defined as: complexity. The trend of prediction accuracy(Figure 7-10) suggests that for the programs we studied, a set of wavelet coefficients with a size of 16 combine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to minimize the complexity of prediction models while achieving good accuracy. We further compare the accuracy of our proposed scheme with that of approximating 3D stacked die spatial thermal patterns via predicting the temperature of 16 evenly distributed locations across 2D plane. The results(Figure 7-11) indicate that using the same number of neural networks, our scheme yields significant higher accuracy than conventional predictive models. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. The coordinated wavelet coefficients provide superior interpretation of the spatial patterns across scales of time and frequency domains. 100 10 Predicting the wavelet coefficients 80 :, i 1 60 0 W 40 20 0 CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3 Figure 7-11 Benefit of predicting wavelet coefficients Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input parameters (refer to Table 7-3 ) were ranked based on split frequency. The input parameters which cause the most output variation tend to be split frequently in the constructed regression tree. Therefore, the input parameters that largely determine the values of a wavelet coefficient have a larger number of splits. successfully achieve the desired reliability target and effectively mitigate the overhead of DVM, architects need techniques to quickly infer application worst-case operation conditions across design alternatives and accurately estimate the efficiency of DVM schemes at early design stage. We developed a DVM scheme to manage runtime instruction queue (IQ) vulnerability to soft error. DVM IQ { ACE bits counter updating(); if current context has L2 cache misses then stall dispatching instructions for current context; every (sampleinterval/5) cycles { if online IQ AVF > trigger threshold then wqratio = wqratio/2; else wqratio = wqratio+l; } if (ratio of waiting instruction # to ready instruction # > wqratio) then stall dispatching instructions; Figure 5-13 IQ DVM Pseudo Code Figure 5-13 shows the pseudo code of our DVM policy. The DVM scheme computes online IQ AVF to estimate runtime microarchitecture vulnerability. The estimated AVF is compared against a trigger threshold to determine whether it is necessary to enable a response mechanism. To reduce IQ soft error vulnerability, we throttle the instruction dispatching from the ROB to the IQ upon a L2 cache miss. Additionally, we sample the IQ AVF at a finer granularity and compare the sampled AVF with the trigger threshold. If the IQ AVF exceeds the trigger threshold, a parameter wqratio, which specifies the ratio of number of waiting instructions to that of ready instructions in the IQ, is updated. The purpose of setting this parameter is to maintain the performance by allowing an appropriate fraction of waiting instructions in the IQ to exploit ILP. By maintaining a desired ratio between the waiting the counter overflow rate. Interestingly, as the overflow rate increases, there are cases that overflow handling can reduce the COVs. This is because overflow handling has the effect of normalizing and smoothing irregular peaks in the original statistics. 50% 40% > 30% 20% 10% 0% 30% 25% 20% > 15% 10% 5% 0% 140% 120% 100% 80% 0 5 60% 40% 20% 3 4% 3% 82% 1% 0% 089 --- Saturate 969 Reset Wavele 90% 81% bzip2 .A 670 4%29% 28 26 24 22 20 18 16 # of bits in counter ---a-- Saturate 97% 10/o Reset 94% -A-Wavelet 94% 5 % 5 6 % 8 0 % A- . 25%" A- gap I~~~ T .0 28 26 24 22 20 18 16 #of bits in counter 97% 99% A.A Saturate 9% -- Reset . --- avelet 96 d fo 89* 60./ two H S2, mcf 28% 31 75 A, A 0 28 26 24 22 20 18 16 14 12 10 8 # of bits in counter -- Saturate Reset 10% -- Wavelet 1 two If 28%94 31 75 94% 28% 31% 75% .^ 8% 6% 4% 2% 0% 28 26 24 22 20 # of bits in counter 60% A-. Saturate 97% 99% 50% -a--Reset ---A Wavelet A- o 93% ,.A-A" -d '0 40% 4 98% 30% 7709 89% 20% a- o0%o o50% gcc 10% 3 34%. 0% 0% ------------------- 28 26 24 20 28 14 16 18 20 22 # of bits in counter 12% -%l 10% 7o - 8% 31 6% /--- .. 4% --..-- Saturate 2% ---Reset parser ---Wavelet 0% 30 28 26 24 22 20 # of bits in counter JUo30% 25% 20% 815% 10% 5% 0% 8% 6% 4% 2% 0% 27 20% 16% 10% 6% 0% 30% 25% 20% 815% 10% 5% 0% 2 30% 25% 20% 815% 10% 5% 0% 7 25 23 21 19 17 # of bits in counter 26 23 21 19 # of bits in counter 25 23 21 19 17 # of bits in counter 29 27 25 23 # of bits in counter 27 25 23 21 19 15 17 # of bits in counter 27 25 23 21 19 17 # of bits in counter Figure 4-6 Impact of counter overflows on phase analysis accuracy One solution to avoid counter overflows is to use sampling counters instead of accumulative counters, as shown in Figure 4-5 (b). However, when sampling counters are used, the collected statistics are represented as time series that have a large volume of data. The results ---A--- Saturate Reset 100% -*-Wavelet 94% j 82% 23% 56% - ... crafty -- Saturate 94 1009 Reset -- Wavelet 94% .- 789o eon 0.4% 560'- 7---- Bn-'*'-- --- -- ---A-- Saturate -E- Reset 98, 100 --Wavelet 98 89% 2 72% gzip -24%34%o A--a-- Saturate -- Reset 100% -a--Wavelet 0 93// : perlbmk 87% 85% 6% 28% 79% A *..A !--B-B^ -*A-.- Saturate 95v/- --- Reset ---Wavelet/ 94% 93% vortex : S819o ' 1% 56% .... -.... --- I I - ---A--- Saturate 100o -4- Reset -"-Wavelet 94% 75% '.- vpr JL I. JI J. . . . - ^ n' - - behavior across large and sophisticated architecture with high accuracy. Figure 6-4 (b-d) shows that in general, the geospatial characteristics prediction accuracy is increased when more wavelet coefficients are involved. Note that the complexity of the predictive models is proportional to the number of wavelet coefficients. The cost-effective models should provide high prediction accuracy while maintaining low complexity and computation overhead. The trend of prediction accuracy(Figure 6-4) indicates that for the programs we studied, a set of wavelet coefficients with a size of 16 combines good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a reduced rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. Using fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the spatial patterns among a large number of NUCA banks on a two-dimensional plane. Figure 6-5 illustrates the predicted 2D NUCA behavior across four different configurations (e.g. A-D) on the heterogeneous multi- programmed workload MIX (see Table 3) when different number of wavelet coefficients (e.g. 16 - 256) are used. Simulation Prediction org 16wc 32wc 64wc 96wc 128wc 256wc A 0. 1 K I org 16w 32w 64w 9wc 12w 256w B org 1c 32wc 64wc 9Gwc 1 c 25Gwc u- org 16wc 32we 64we 9Gwe 128we 256we Figure 6-5 Predicted 2D NUCA behavior using different number of wavelet coefficients Figure 6-5 Predicted 2D NUCA behavior using different number of wavelet coefficients 4-7 M ethod for m odeling w orkload variability ........................................ ....................... 50 4-8 Effect of using wavelet denoising to handle workload variability ...................................... 50 4-9 Efficiency of different denoising schem es ........................................ .................. ...... 51 5-1 Variation of workload performance, power and reliability dynamics............................... 52 5-2 Basic architecture of a neural network ....................................................... ....... .... 54 5-3 Using wavelet neural network for workload dynamics prediction.................. ........... 58 5-4 Magnitude-based ranking of 128 wavelet coefficients....................................................... 61 5-5 MSE boxplots of workload dynamics prediction ............ ............................... ........... 62 5-6 MSE trends with increased number of wavelet coefficients ............................................... 64 5-7 M SE trends with increased sampling frequency .......................................... ..... ......... 64 5-8 Roles of microarchitecture design parameters.................... ........ .. ........................... 65 5-9 Threshold-based workload execution scenarios........................ .................. 67 5-10 Threshold-based w workload execution...................... .... ............................... .............. 68 5-11 Threshold-based workload scenario prediction........ ......................... .............. 68 5-12 Dynam ic Vulnerability M anagem ent .............. .......................................................... 69 5-13 IQ D V M Pseudo C ode..................... ..................................................... ..................... ..... 70 5-14 Workload dynamic prediction with scenario-based architecture optimization ................ 71 5-15 Heat plot that shows the MSE of IQ AVF and processor power...................... .............. 72 5-16 IQ AVF dynamics prediction accuracy across different DVM thresholds..................... 73 6-1 Variation of cache hits across a 256-bank non-uniform access cache on 8-core ................. 74 6-2 Using wavelet neural networks for forecasting architecture 2D characteristics .................... 77 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache .............................................. 79 6-4 ME boxplots of prediction accuracies with different number of wavelet coefficients .......... 83 6-5 Predicted 2D NUCA behavior using different number of wavelet coefficients................... 84 6-6 Roles of design parameters in predicting 2D NUCA ......................................................... 87 100% 80% 60% > 40% 20% no1^ crafty gap gcc gzip mcf 100% O BBV 80% t MRA-BBV 60% > 40% 20% 0% .... .. ........... ....... ..,. .. .. l. . N- 0) Vm LO <0 N M Q> Q N ) Vm LO <0 N M Q> Q ) Vm LO <0 N0 M Q Q I N)10 0 0 P) 0 0 parser perlbmk swim twolf vortex Figure 3-8 Comparison of BBV and MRA-BBV in classifying phase dynamics Figure 3-8 shows experimental results for all the studied benchmarks. As can be seen, the MRA-BBV method can produce phases which exhibit more homogeneous dynamics and complexity than the standard, BBV-based method. This can be seen from the lower COV values generated by the MRA-BBV method. In general, the COV values yielded on both methods increase when coarse time scales are used for complexity approximation. The MRA-BBV is capable of achieving significantly better classification on benchmarks with high complexity, such as gap, gcc and mcf On programs which exhibit medium complexity, such as crafty, gzip, parser, and twolf the two schemes show a comparable effectiveness. On benchmark (e.g. swim) which has trivial complexity, both schemes work well. We further examine the capability of using runtime performance metrics to capture complexity-aware phase behavior. Instead of using BBV, the sampled IPC is used directly as the input to the k-means phase clustering algorithm. Similarly, we apply multiresolution analysis to the IPC data and then use the gathered information for phase classification. We call this method 101% 104% 128% O BBV M MRA-BBV '. ...li..... i ... ...... i l1... case (e.g., a specific NUCA configuration), we generate two series of cache interference statistics (e.g., one from simulation and one from the predictive model) which correspond to the scenarios when workloads are mapped to the different cores. We compute the Pearson correlation coefficient of the two data series. The Pearson correlation coefficient of two data series X and Y is defined as nrZx -=> 2>1 (6-2) S Lnii) n 2_ x2 n( i2) n l2] S =1 ) i-l / V =1- / Y i-- )=1 If two data series, X and Y, show highly positive correlation, their Pearson correlation coefficient will be close to 1. Consequently, if the cache interference can be accurately estimated using the overlap between the predicted 2D NUCA footprints, we should observe nearly perfect correlation between the two metrics. Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) -~A/JwzvA~rxv 0 8 s o S07 0o6 B0 08 07 o06 S 10 20 30 40 50 0 10 20 30 40 50 0 0 10 20 30 40 Test Cases Test Cases Test Cases Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) Figure 6-9 Pearson correlation coefficient (all 50 test cases are shown) Figure 6-9 shows that there is a strong correlation between the interference estimated using the predicted 2D NUCA footprint and the interference statistics obtained using simulation. The highly positive Pearson correlation coefficient values show that by using the predictive model, designers can quickly devise the optimal core allocation for a given set of workloads. Alternatively, the information can be used by the OS to guide cache friendly thread scheduling in multi-core environments. E Figure 6-11 NUCA 2D thermal prediction error The thermal prediction accuracy (average statistics) across three workload categories is shown in Figure 6-11. The accuracy of using different number of wavelet coefficients in prediction is also shown in that Figure. The results show that our predictive model can be used to cost-effectively analyze the thermal behavior of large architecture substrates. In addition, our proposed technique can be use to evaluate the efficiency of thermal management policies at a large scale. For example, thermal hotspots can be mitigated by throttling the number of accesses to a cache bank for a certain period when its temperature reaches a threshold. We build analytical models which incorporate a thermal-aware cache access throttling as a design parameter. As a result, our predictive model can forecast thermal hot spot distribution in the 2D NUCA cache banks when the dynamic thermal management (DTM) policy is enabled or disabled. Figure 6-12 shows the thermal profiles before and after thermal management policies are applied (both prediction and simulation results) for benchmark Ocean-NC. As can be seen, they track each other very well. In terms of time taken for design space exploration, our proposed models have orders of magnitude less overhead. The time required to predict the thermal behavior is much less than that of full-system multi-core simulation. For example, thermal hotspot estimation is over 2 x 105 times faster than thermal simulation, justifying our decision to use the predictive Figure 5-7 illustrates MSE (the average statistics of all benchmarks) yielded on predictive models that use 16 wavelet coefficients when the number of samples varies from 64 to 1024. As the sampling frequency increases, using the same amount of wavelet coefficients is less accurate in terms of capturing workload dynamic behavior. As can be seen, the increase of MSE is not significant. This suggests that the proposed schemes can capture workload dynamic behavior with increasing complexity. Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input microarchitecture parameters were ranked based on either split order or split frequency. The microarchitecture parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, microarchitecture parameters largely determine the values of a wavelet coefficient are located on higher place than others in regression tree and they have larger number of splits than others. CPI Power AVF 0 bZlp gcc mcf g cc g a r r b swim ol vo vpr p-arser p b swm olf o vpr L2ue 5-8 Role Lo L2 deig 2 I l lt2 dl latt dll lat L i I/I L, la2 ii dll S l parser perlbmk swim olf votex vpr I Figure 5-8 Roles of microarchitecture design parameters facilitate phase classification. Nevertheless, in current work, the examined scope of workload characteristics and the explored benefits due to wavelet transform are quite limited. In this chapter, we extend research of chapter 3 by applying wavelets to abundant types of program execution statistics and quantifying the benefits of using wavelets for improving accuracy, scalability and robustness in phase classification. We conclude that wavelet domain phase analysis has the following advantages: 1) accuracy: the wavelet transform significantly reduces temporal dependence in the sampled workload statistics. As a result, simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, wavelet coefficients transformed from various dimensions of program execution characteristics can be dynamically assembled together to further improve phase classification accuracy; 2) scalability: phase classification using wavelet analysis of high-dimension sampled workload statistics can alleviate the counter overflow problem which has a negative impact on phase detection. Therefore, it is much more scalable to analyze large-scale phases exhibited on long-running, real-world programs; and 3) robustness: wavelets offer denoising capabilities which allows phase classification to be performed robustly in the presence of workload execution variability. Workload-statics-based phase analysis Using the wavelet-based method, we explore program phase analysis on a high- performance, out-of-order execution superscalar processor coupled with a multi-level memory hierarchy. We use Daubechies wavelet [26, 27] with an order of 8 for the rest of the experiments due to its high accuracy and low computation overhead. This section describes our experimental methodologies, the simulated machine configuration, experimented benchmarks and evaluated metrics. BIOGRAPHICAL SKETCH Chang Burm Cho earned B.E and M.A in electrical engineering at Dan-kook University, Seoul, Korea in 1993 and 1995, respectively. Over the next 9 years, he worked as a senior researcher at Korea Aerospace Research Institute(KARI) to develop the On-Board Computer(OBC) for two satellites, KOMPSAT-1 and KOMPSAT-2. His research interest is computer architecture and workload characterization and prediction in large micro architectural design spaces. model. The baseline core architecture and floorplan we modeled is an Alpha processor, closely resembling the Alpha 21264. Figure 7-6 shows the baseline core floorplan. Figure 7-6 Processor core floor-plan We assume a 65 nm processing technique and the floor-plan is scaled accordingly. The entire die size is 21 x 21mm and the core size is 5.8 x 5.8mm. We consider three core configurations: 2-issue (5.8x 5.8 mm), 4-issue (8.14x 8.14 mm) and 8-issue (11.5x 11.5 mm). Since the total die area is fixed, the more aggressive core configurations lead to smaller L2 caches. For all three types of core configurations, we calculate the size of the L2 caches based on the remaining die area available. Table 7-1 lists the detailed processor core and cache configurations. We use Hotspot-4.0 [54] to simulate thermal behavior of a 3D quad-core chip shown as Figure 7-7. The Hotspot tool can specify the multiple layers of silicon and metal required to model a three dimensional IC. We choose grid-like thermal modeling mode by specifying a set of 64 x 64 thermal grid cells per die and the average temperature of each cell (32um x 32um) is represented by a value. Hotspot takes power consumption data for each component block, the layer parameters and the floor-plans as inputs and generates the steady-state temperature for each active layer. To build a 3D multi-core processor simulator, we heavily modified and extended the M-Sim simulator [63] and incorporated the Wattch power model [36]. The power trace is [18, 32]. This cross-run variability can confuse similarity based phase detection. In order for a phase analysis technique to be applicable on real systems, it should be able to perform robustly under variability. Program cross-run variability can be thought of as noise which is a random variance of a measured statistic. There are many possible reasons for noisy data, such as measurement/instrument errors and interventions of the operating systems. Removing this variability from the collected runtime statistics can be considered as a process of denoising. In this chapter, we explore using wavelets as an effective way to perform denoising. Due to the vanishing moment property of the wavelets, only some wavelet coefficients are significant in most cases. By retaining selective wavelet coefficients, a wavelet transform could be applied to reduce the noise. The main idea of wavelet denoising is to transform the data into the wavelet basis, where the large coefficients mainly contain the useful information and the smaller ones represent noise. By suitably modifying the coefficients in the new basis, noise can be directly removed from the data. The general de-noising procedure involves three steps: 1) decompose: compute the wavelet decomposition of the original data; 2) threshold wavelet coefficients: select a threshold and apply thresholding to the wavelet coefficients; and 3) reconstruct: compute wavelet reconstruction using the modified wavelet coefficients. More details on the wavelet- based denoising techniques can be found in [33]. To model workload runtime variability, we use additive noise models and randomly inject noise into the time series that represents workload execution behavior. We vary the SNR (signal- to-noise ratio) to simulate different degree of variability scenarios. To classify program execution into phases, we generate a 16 dimension feature vector where each element contains [24] B. Lee and D. Brooks, "Illustrative Design Space Studies with Microarchitectural Regression Models," in Proc. of the International Symposium on High-Performance Computer Architecture, 2007. [25] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, "Constructing a Non-Linear Model with Neural Networks For Workload Characterization," in Proc. of the International Symposium on Workload Characterization, 2006. [26] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier, Vermont, 1992 [27] I. Daubechies, "Orthonomal bases of Compactly Supported Wavelets," Communications on Pure andAppliedi MAtuli/htiiL vol. 41, pages 906-966, 1988. [28] T. Austin, "Tutorial of Simplescalar V4.0," in Conj. With the International Symposium on Microarchitecture, 2001 [29] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," in Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. [30] T. Huffmire and T. Sherwood, "Wavelet-Based Phase Classification," in Proc. of the International Conference on Parallel Architecture and Compilation Technique, 2006 [31] D. Brooks and M. Martonosi, "Dynamic Thermal Management for High-Performance Microprocessors," in Proc. of the International Symposium on High-Performance Computer Architecture, 2001. [32] A. Alameldeen and D. Wood, "Variability in Architectural Simulations of Multi-threaded Workloads," in Proc. ofInternational Symposium on High Performance Computer Architecture, 2003. [33] D. L. Donoho, "De-noising by Soft-thresholding," IEEE Transactions on Information Theory, Vol. 41, No. 3, pp. 613-627, 1995. [34] MATLAB User Manual, MathWorks, MA, USA. [35] M. Orr, K. Takezawa, A. Murray, S. Ninomiya and T. Leonard, "Combining Regression Tree and Radial Based Function Networks," International Journal ofNeural Systems, 2000. [36] David Brooks, Vivek Tiwari, and Margaret Martonosi, "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations," 27th International Symposium on Computer Architecture, 2000. [37] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor," in Proc. of the International Symposium on Microarchitecture, 2003. an accurate interpretation of the spatial trend and details of complex workload behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separate neural network simplifies the training task (which can be performed concurrently) of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the spatial patterns on large scale architecture substrates. Figure 6-2 shows our hybrid neuro-wavelet scheme for architecture 2D spatial characteristics prediction. Given the observed spatial behavior on training data, our aim is to predict the 2D behavior of large-scale architecture under different design configurations. Architecture 2D Characteristics Synthesized Architecture 2D Characteristics RBF Neural Netvorks "HTj --e-j"^^ ^S^^^ Ho Architectue Design Predcted aelet W H 0 E G oXe Parameters Coefficient e F Go H Achitectre Design0 ( Pr te 8 ... CoWficientn- Figure 6-2 Using wavelet neural networks for forecasting architecture 2D characteristics The hybrid scheme basically involves three stages. In the first stage, the observed spatial behavior is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The 2008 Chang Burm Cho unexplored points in the design space. The training data set is used to build the wavelet-based neural network models. An estimate of the model's accuracy is obtained by using the design points in the testing data set. Table 5-2 Microarchitectural parameter ranges used for generating train/test data Ranges Parameter TRa s t # of Levels Train Test Fetch width 2, 4, 8, 16 2, 8 4 ROB size 96, 128, 160 128, 160 3 IQsize 32, 64, 96, 128 32, 64 4 LSQ_size 16, 24, 32, 64 16, 24, 32 4 L2 size 256, 1024, 2048, 4096 256, 1024, 4096 KB 4 L2 lat 8, 12, 14, 16, 20 8, 12, 14 5 ill size 8, 16, 32, 64 KB 8, 16, 32 KB 4 dll size 8, 16, 32, 64 KB 16, 32, 64 KB 4 dll lat 1,2,3,4 1,2,3 4 To build the representative design space, one needs to ensure the sample data sets space out points throughout the design space but unique and small enough to keep the model building cost low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrix and use a space filing metric called L2-star discrepancy [40] to each LHS matrix to find the unique and best representative design space which has the lowest values of L2-star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. And we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers good tradeoffs between simulation time and prediction accuracy for the design space we considered. In our study, each workload dynamic trace is represented by 128 samples. Predicting each wavelet coefficient by a separate neural network simplifies the learning task. Since complex workload dynamics can be captured using limited number of wavelet The workload dynamics prediction accuracies in performance, power and reliability domains are plotted as boxplots(Figure 5-5). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between "hinges" which are approximately the first and third quartiles of the MSE values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the MSE values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 5-5, the line with diamond shape markers indicates the statistics average of MSE across all test cases. Figure 5-5 shows that the performance model achieves median errors ranging from 0.5 percent (swim) to 8.6 percent (mcj) with an overall median error across all benchmarks of 2.3 percent. As can be seen, even though the maximum error at any design point for any benchmark is 30%, most benchmarks show MSE less than 10%. This indicates that our proposed neuro- wavelet scheme can forecast the dynamic behavior of program performance characteristics with high accuracy. Figure 5-5 shows that power models are slightly less accurate with median errors ranging from 1.3 percent (vpr) to 4.9 percent (crafty) and overall median of 2.6 percent. The power prediction has high maximum values of 35%. These errors are much smaller in reliability domain. In general, the workload dynamic prediction accuracy is increased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The cost-effective models should provide high prediction testing data sets. We build a separate model for each workload and use the model to predict architecture 2D spatial behavior at unexplored points in the design space. The training data set is used to build the 2D wavelet neural network models. An estimate of the model's accuracy is obtained by using the design points in the testing data set. To build a representative design space, one needs to ensure that the sample data sets disperse points throughout the design space but keep the space small enough to keep the cost of building the model low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2- star discrepancy. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2-star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this chapter, we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. And the 2D NUCA architecture characteristics (normalized cache hit numbers) across 256 banks (with the geometry layout, Figure 6-3) are represented by a matrix. Predicting each wavelet coefficient by a separate neural network simplifies the learning task. Since complex spatial patterns on large scale multi-core architecture substrates can be captured using a limited number of wavelet coefficients, the total size of wavelet neural networks is small and the computation overhead is low. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Specifically, we consider the following two schemes for selecting Since COV measures standard deviations as a percentage of the average, a lower COV value means a better phase classification technique. Exploring Wavelet Domain Phase Analysis We first evaluate the efficiency of wavelet analysis on a wide range of program execution characteristics by comparing its phase classification accuracy with methods that use information in the time domain. And then we explore methods to further improve phase classification accuracy in the wavelet domain. Phase Classification: Time Domain vs. Wavelet Domain The wavelet analysis method provides a cost-effective representation of program behavior. Since wavelet coefficients are generally decorrelated, we can transform the original data into the wavelet domain and then carry out the phase classification task. The generated wavelet coefficients can be used as signatures to classify program execution intervals into phases: if two program execution intervals show similar fingerprints (represented as a set of wavelet coefficients), they can be classified into the same phase. To quantify the benefit of using wavelet based analysis, we compare phase classification methods that use time domain and wavelet domain program execution information. With our time domain phase analysis method, each program execution interval is represented by a time series which consists of 1024 sampled program execution statistics. We first apply random projection to reduce the data dimensionality to 16. We then use the k-means clustering algorithm to classify program intervals into phases. This is similar to the method used by the popular Simpoint tool where the basic block vectors (BBVs) are used as input. For the wavelet domain method, the original time series are first transformed into the wavelet domain using DWT. The first 16 wavelet coefficients of each program execution interval are extracted accuracy while maintaining low complexity. Figure 5-6 shows the trend of prediction accuracy (the average statistics of all benchmarks) when various number of wavelet coefficients are used. 5 ----------I 4 CPI 3 --- Power uJ -A-AVF 0 A.-I.A AI A 16 32 64 96 128 Number of Wavelet Coefficients Figure 5-6 MSE trends with increased number of wavelet coefficients As can be seen, for the programs we studied, a set of wavelet coefficients with a size of 16 combine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a lower rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. Using fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the series structures across scales of time and frequency domains. The capability of using a limited set of wavelet coefficients to capture workload dynamics varies with resolution level. 7 6 Ix 5 4 2 S3 -- CPl m-* Power 2 '- -A-AVF - 1 0 A A-- A 64 128 256 512 1024 Number of Samples Figure 5-7 MSE trends with increased sampling frequency conditions to soft error vulnerability and accurately estimate the efficiency of soft error vulnerability management schemes. Because of technology scaling, radiation-induced soft errors contribute more and more to the failure rate of CMOS devices. Therefore, soft error rate is an important reliability issue in deep-submicron microprocessor design. Processor microarchitecture soft error vulnerability exhibits significant runtime variation and it is not economical and practical to design fault tolerant schemes that target on the worst-case operation condition. Dynamic Vulnerability Management (DVM) refers to a set of strategies to control hardware runtime soft-error susceptibility under a tolerable threshold. DVM allows designers to achieve higher dependability on hardware designed for a lower reliability setting. If a particular execution period exceeds the pre-defined vulnerability threshold, a DVM response (Figure 5-12) will work to reduce hardware vulnerability. Designed-for Reliability Capacity w/out DVM DVM Performance Designed-for Reliability 0""- Overhead S Capacity w/ DVM S DVM Trigger Level DMV Engaged DVM Disengaged Time Figure 5-12 Dynamic Vulnerability Management A primary goal of DVM is to maintain vulnerability to within a pre-defined reliability target during the entire program execution. The DVM will be triggered once the hardware soft error vulnerability exceeds the predefined threshold. Once the trigger goes on, a DVM response begins. Depending on the type of response chosen, there may be some performance degradation. A DVM response can be turned off as soon as the vulnerability drops below the threshold. To space is very large. In this chapter, we consider a design space that consists of 9 parameters (see Tables 6-2) of CMP NUCA architecture. CPU 0 CPU 1 h-u L1 $ L1 D$ L1 $ L1 D$ -1 CPU 5 CPU 4 Figure 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache Table 6-2 Considered architecture design parameters and their ranges Parameters Description NUCA Management Policy (NUCA) SNUCA, DNUCA, RNUCA Network Topology (net) Hierarchical, PTtoPT, Crossbar Network Link Latency (net lat) 20, 30, 40, 50 Lllatency (L1 lat) 1, 3, 5 L2_latency (L2 lat) 6, 8, 10, 12. Ll_associativity (LI aso) 1, 2, 4, 8 L2_associativity (L2 aso) 2, 4, 8, 16 Directory Latency (d lat) 30, 60, 80, 100 Processor Buffer Size (p buJ) 5, 10, 20 These design parameters cover NUCA data management policy (NUCA), interconnection topology and latency (net and netlat), the configurations of the L1 and L2 caches (Lllat, L2_lat, Ll_aso and L2_aso), cache coherency directory latency (dlat) and the number of cache accesses that a processor core can issue to the L1 (pbuf). The ranges for these parameters were set to include both typical and feasible design points within the explored design space. CHAPTER 3 COMPLEXITY-BASED PROGRAM PHASE ANALYSIS AND CLASSIFICATION Obtaining phase dynamics, in many cases, is of great interest to accurately capture program behavior and to precisely apply runtime application oriented optimizations. For example, complex, real-world workloads may run for hours, days or even months before completion. Their long execution time implies that program time varying behavior can manifest across a wide range of scales, making modeling phase behavior using a single time scale less informative. To overcome conventional phase analysis technique, we proposed using wavelet- based multiresolution analysis to characterize phase dynamic behavior and developed metrics to quantitatively evaluate the complexity of phase structures. And also, we proposed methodologies to classify program phases from their dynamics and complexity perspectives. Specifically, the goal of this chapter is to answer the following questions: How to define the complexity of program dynamics? How do program dynamics change over time? If classified using existing methods, how similar are the program dynamics in each phase? How to better identify phases with homogeneous dynamic behavior? In this chapter, we implemented our complexity-based phase analysis technique and evaluate its effectiveness over existing phase analysis methods based on program control flow and runtime information. And we showed that in both cases the proposed technique produces phases that exhibit more homogeneous dynamic behavior than existing methods do. Characterizing and classifying the program dynamic behavior Using the wavelet-based multiresolution analysis which is described in chapter 2, we characterize, quantify and classify program dynamic behavior on a high-performance, out-of- order execution superscalar processor coupled with a multi-level memory hierarchy. large-scale many-core processor requires obtaining workload characteristics across a large number of distributed hardware components (cores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, there is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidly increasing complexity and integration scale. We aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large performance, power, reliability and thermal design space of uni-/multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration. Workload Dynamics (Time Domain) Synthesized Workload Dynamics (ime Domain) TIMooard03CFHt:e ::F Reda:/Hl~t* *0 ReddmdV~d -t E~ D9*gi _* P:~d F d 2 P a; G Qffetn 0 Figure 5-3 Using wavelet neural network for workload dynamics prediction Each RBF neural network receives the entire microarchitectural design space vector and predicts a wavelet coefficient. The training of a RBF network involves determining the center point and a radius for each RBF and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of using wavelet neural networks to explore workload dynamics in performance, power and reliability domains during microarchitecture design space exploration. We use a unified, detailed microarchitecture simulator in our experiments. Our simulation framework, built using a heavily modified and extended version of the Simplescalar tool set, models pipelined, multiple-issue, out-of-order execution microprocessors with multiple level caches. Our framework uses Wattch-based power model [36]. In addition, we built the Architecture Vulnerability Factor (AVF) analysis methods proposed in [37, 38] to estimate processor microarchitecture vulnerability to transient faults. A microarchitecture structure's AVF refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The AVF metric can be used to estimate how vulnerable the hardware is to soft Experimental Results We compare Simpoint and the proposed approach in their capability of classifying phase complexity. Since we use wavelet transform on program basic block vectors, we refer to our method as multiresolution analysis of BBV (MRA-BBV). \ Weighted CoVs Figure 3-7 Weighted COV calculation We examine the similarity of program complexity within each classified phase by the two approaches. Instead of using IPC, we use IPC dynamics as the metric for evaluation. After classifying all program execution intervals into phases, we examine each phase and compute the IPC XCOR vectors for all the intervals in that phase. We then calculate the standard deviation in IPC XCOR vectors within each phase, and we divide the standard deviation by the average to get the Coefficient of Variation (COV). As shown in Figure 3-7, we calculate an overall COV metric for a phase classification method by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classification for a given program. Since COV measures standard deviations as the percentage of the average, a lower COV value means better phase classification technique. The most common type of neural network (Figure 5-2) consists of three layers of units: a layer of input units is connected to a layer of hidden units, which is connected to a layer of output units. The input is fed into network through input units. Each hidden unit receives the entire input vector and generates a response. The output of a hidden unit is determined by the input-output transfer function that is specified for that unit. Commonly used transfer functions include the sigmoid, linear threshold function and Radial Basis Function (RBF) [35]. The ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing complex, non-linear relations between input and output, which make them a promising technique for tracking and forecasting complex behavior. In this chapter, we use the RBF transfer function to model and estimate important wavelet coefficients on unexplored design spaces because of its superior ability to approximate complex functions. The basic architecture of an RBF network with n-inputs and a single output is shown in Figure 5-2. The nodes in adjacent layers are fully connected. A linear single-layer neural network model 1-dimensional function f is expressed as a linear combination of a set of n fixed functions, often called basis functions by analogy with the concept of a vector being composed of a linear combination of basis vectors. n f(x)= -wJh, (x) (5-1) j=1 Here e 91" is adaptable or trainable weight vector and {h (-)4 are fixed basis functions or the transfer function of the hidden units. The flexibility of f, its ability to fit many different functions, derives only from the freedom to choose different values for the weights. The basis [38] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, "Computing Architectural Vulnerability Factors for Address-Based Structures," in Proc. of the International Symposium on Computer Architecture, 2005. [39] J.Cheng, M.J.Druzdzel, "Latin Hypercube Sampling in Bayesian Networks," in Proc. of the 13th Florida Artificial Intelligence Research Society Conference, 2000. [40] B.Vandewoestyne, R.Cools, "Good Permutations for Deterministic Scrambled Halton Sequences in terms of L2-discrepancy," Journal of Computational andAppliedib I/,1h,, tii \ Vol 189, Issues 1-2, 2006. [41] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, Graphical Methodsfor Data Analysis, Wadsworth, 1983 [42] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13- 273350-1, 1999. [43] C. Kim, D. Burger, and S. Keckler. "An Adaptive, Non- Uniform Cache Structure for Wire-Delay Dominated On- Chip Caches," in Proc. the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002. [44] L. Benini, L.; G. Micheli, "Networks On Chips: A New SoC Paradigm," Computer, Vol. 35, Issue. 1, January 2002, pp. 70 -78. [45] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, "A NUCA Substrate for Flexible CMP Cache Sharing," in Proc. International Conference on Supercomputing, 2005. [46] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Distance Associativity for High- Performance Energy-Efficient Non-Uniform Cache Architectures," in Proc. of the International Symposium on Microarchitecture, 2003. [47] B. M. Beckmann and D. A. Wood, "Managing Wire Delay in Large Chip-Multiprocessor Caches," in Proc. of the International Symposium on Microarchitecture, 2004. [48] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, "Optimization Replication, Communication, and Capacity Allocation in CMPs," in Proc. of the International Symposium on Computer Architecture, 2005. [49] A. Zhang and K. Asanovic, "Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors," in Proc. of the International Symposium on Computer Architecture, 2005. [50] B. Lee, D. Brooks, B. Supinski, M. Schulz, K. Singh, S. McKee, "Methods of Inference and Learning for Performance Modeling of Parallel Applications," PPoPP, 2007. gcc, crafty and vortex. In power domain, prediction accuracy is more uniform across benchmarks and microarchitecture configurations. In Figure 5-16, we show the IQ AVF MSE when different DVM thresholds are set. The results suggest that our predictive models work well when different DVM targets are considered. 0.5 DVM Threshold =0.2 0.4- DVM Threshold =0.3 SU DVM Threshold =0.5 S0.3 S0.2 0.1 Figure 5-16 IQ AVF dynamics prediction accuracy across different DVM thresholds functions and any parameters which they might contain are fixed. If the basis functions can change during the learning process, then the model is nonlinear. Radial functions are a special class of function. Their characteristic feature is that their response decreases (or increases) monotonically with distance from a central point. The center, the distance scale, and the precise shape of the radial function are parameters of the model, all fixed if it is linear. A typical radial function is the Gaussian which, in the case of a scalar input, is h(x)= exp )2 (5-2) r Its parameters are its center c and its radius r. Radial functions are simply a class of functions. In principle, they could be employed in any sort of model, linear or nonlinear, and any sort of network (single-layer or multi-layer). The training of the RBF network involves selecting the center locations and radii (which are eventually used to determine the weights) using a regression tree. A regression tree recursively partitions the input data set into subsets with decision criteria. As a result, there will be a root node, non-terminal nodes (having sub nodes) and terminal nodes (having no sub nodes) which are associated with an input dataset. Each node contributes one unit to the RBF network's center and radius vectors, the selection of RBF centers is performed by recursively parsing regression tree nodes using a strategy proposed in [35]. Combing Wavelet and Neural Network for Workload Dynamics Prediction We view workload dynamics as a time series produced by the processor which is a nonlinear function of its design parameter configuration. Instead of predicting this function at every sampling point, we employ wavelets to approximate it. Previous work [21, 23, 25] shows such as: Which variables are dominant for a given dataset? Which observations show similar behavior? p ,buf net atp bu netat -, 'net net L1 lat L1 lat NUCA -t NUCA L2 lat L2 lat d lat -d lat L aso L2 aso L1 aso L2 aso gcc mcf cpu mix mem gcc mci cpu mix mem barnes oceanco ocean nc water_sp cholesky barnes ocean_co ocean_nc water_sp cholesky fit radix imm Ift radix fmm Order Frequency Figure 6-6 Roles of design parameters in predicting 2D NUCA For example, on the Splash-2 benchmark fmm, network latency (net lat), processor buffer size (p buf), L2 latency (L2 lat) and L1 associativity (LI aso) have significant roles in predicting the 2D NUCA spatial behavior while the NUCA data management policy (NUCA) and network topology (net) largely affect the 2D spatial pattern when running the homogeneous multi-programmed workload gccx8. For the benchmark cholesky, the most frequently involved architecture parameters in regression tree construction are NUCA, net lat, p buf L2 lat and L1 aso. Differing from models that predict aggregated workload characteristics on monolithic architecture design, our proposed methods can accurately and informatively reveal the complex patterns that workloads exhibit on large-scale architectures. This feature is essential if the predictive models are employed to examine the efficiency of design tradeoffs or explore novel abundant program dynamics could be lost if coarser time scales are used to characterize it. We refer to this characteristic as high complexity behavior. * [0, 0.1) I [0.1, 0.2) a [0.2, 0.3) * [0.3, 0.4) O [0.4, 0.5) B [0.5, 0.6) * [0.6, 0.7) B [0.7, 0.8) [ [0.8, 0.9) 100 80 60 40 20 0 100 0 [0, 0.1) M [0.1, 0.2) 80 0 [0.2, 0.3) 60 El [0.3, 0.4) [, O [0.4, 0.5) 40 -C [0.5, 0.6) 0 [0.6, 0.7) 20 [0.7, 0.8) 0 [0.8, 0.9) a [0.9, 1) 1 2 3 4 6 67 8 9 10 ,l Scales (b) crafty (medium complexity) _EU lo [0.4, 0.5) 40 0 [0.5, 0.6) M [0.6, 0.7) 20 B [0.7, 0.8) O 0 [0.8, 0.9) X E] [0.9, 1) 1 2 3 4 6 6 7 8 9 10 1 Scales (c) gcc (high complexity) Figure 3-3 XCOR value distributions The dynamics complexity and the XCOR value distribution plots(Figure 3-2 and Figure 3- 3) provide a quantitative and informative representation of runtime program complexity. Table 3-2 Classification of benchmarks based on their complexity Category Benchmarks Low complexity Swim Medium complexityCrafty, gzip, parser, perlbmk, twolf complexity High complexity gap, gcc, mcf vortex Using the above information, we classify the studied programs in terms of their complexity and the results are shown in Table 3-2. is [0.9, 1) 1 2 3 4 5 6 7 8 9 10 0-1 Scales (a) swim (low complexity) a [0, 0.1) [0.1, 0.2) [0.2, 0.3) El [0.3, 0.4) . . . ... . then sample multiple data points within each interval. Therefore, at the finest resolution level, program time domain behavior is represented by a data series within each interval. Note that the sampled data can be any runtime program characteristics of interest. We then apply discrete wavelet transform (DWT) to each interval. As described in previous section, the result of DWT is a set of wavelet coefficients which represent the behavior of the sampled time series in the wavelet domain. R 2.5E+05 2 2.5E+06 S2.0E+05 --- - 2.0E+06 E 5 .1.5E+06 r 1.OE+05 5.0E+05 SU5. 0 -5.E+E05 L "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 E 0.OE+00 200 400 600 800 1000 Wavelet Coefficients (a) Time domain representation (b) Wavelet domain representation Figure 2-2 Comparison execution characteristics of time and wavelet domain Figure 2-2 (a) shows the sampled time domain workload execution statistics (The y-axis represents the number of cycles a processor spends on executing a fixed amount of instructions) on benchmark gcc within one execution interval. In this example, the program execution interval is represented by 1024 sampled data points. Figure 2-2 (b) illustrates the wavelet domain representation of the original time series after a discrete wavelet transform is applied. Although the DWT operations can produce as many wavelet coefficients as the original input data, the first few wavelet coefficients usually contain the important trend. In Figure 2-2 (b), we show the values of the first 16 wavelet coefficients. As can be seen, the discrete wavelet transform provides a compact representation of the original large volume of data. This feature can be exploited to create concise yet informative fingerprints to capture program execution behavior. we use 1024 samples within each interval, we create an XCOR vector with a length of 10 for each interval, as shown in Figure 3-1. y u x x 1 10 1 10 1 10 1st Intrval 2nd Interval uI4lh Inlerval Figure 3-1 XCOR vectors for each program execution interval Profiling Program Dynamics and Complexity We use XCOR metrics to quantify program dynamics and complexity of the studied SPEC CPU 2000 benchmarks. Figure 3-2 shows the results of the total 1024 execution intervals across ten levels of abstraction for the benchmark gcc. 1 - 0.8 Scales 0.6 Figure 3-2 Dynamic complexity profile of benchmark gcc X 0.4 XC OR ,i E ,1 1 2 3 4 5 6 7 8 9 lb Scales Figure 3-2 Dynamic complexity profile of benchmark gcc As can be seen, the benchmark gcc shows a wide variety of changing dynamics during its execution. As the time scale increases, XCOR values are monotonically decreased. This is due to the fact that wavelet approximation at a coarse scale removes details in program dynamics observed at a fine grained level. Rapidly decreased XCOR implies highly complex structures that horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the ME values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 7-8, the blue line with diamond shape markers indicates the statistics average of ME across all benchmarks. 20 12 L 4 CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3 Figure 7-8 ME boxplots of prediction accuracies (number of wavelet coefficients = 16) Figure 7-8 shows that using 16 wavelet coefficients, the predictive models achieve median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 17.5% (MEM1), and most benchmarks show an error less than 9%. This indicates that our hybrid neuro-wavelet framework can predict 2D spatial thermal behavior across large and sophisticated 3D multi-core architecture with high accuracy. Figure 7-8 also indicates that CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%) workloads. This is because the CPU workloads usually have higher temperature on the small core area than the large L2 cache area. These small and sharp hotspots can be easily captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal pattern can spread the entire die area, resulting in higher prediction error. Figure 7-9 illustrates the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on CPU1, MEM1 and MIX1 workloads. The coarser scale BBVs are the approximations of the finest scale BBVs generated by the wavelet-based multiresolution analysis. 15 Dimensions 1024: 15: -0.11 0.13 -0.03 0.16 0.06 ... -0.02 S15: 0.01 -0.07 -0.23 -0.03 0.18 ... -0.05 S15: 0.04 -0.13 0.14 0.11 -0.04 .. 0.14 15: 0.08 0.01 -0.21 0.12 0.05 ... I019 7; 77;r 74 4- --- 4- Multiresolution Analysis and XCOR Calculation 1 10 1 10 1 10 1 10 1 10 1 10 1 10 Figure 3-6 Multiresolution analysis of the projected BBVs As shown in Figure 3-6, the discrete wavelet transform is applied to each dimension of a set of BBVs at the finest scale. The XCOR calculation is used to estimate the correlations between a BBV element and its approximations at coarser scales. The results are the 15 XCOR vectors representing the complexity of each dimension in BBVs across 10 level abstractions. The 15 XCOR vectors are then averaged together to obtain an aggregated XCOR vector that represents the entire BBV complexity characteristics for that execution interval. Using the above steps, we obtained an aggregated XCOR vector for each program execution interval. We then run the k-means clustering algorithm [29] on the collected XCOR vectors which represent the dynamic complexity of program execution intervals and classified them into phases. This is similar to what Simpoint does. The difference is that the Simpoint tool uses raw BBVs and our method uses aggregated BBV XCOR vectors as the input for k-means clustering. coefficients, the total size of wavelet neural networks can be small. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Wavelet coefficients with large magnitude High s 120 Mag. 2 00 B1 1 00 L w Wavelet Coefficient Index Figure 5-4 Magnitude-based ranking of 128 wavelet coefficients Specifically, we consider the following two schemes for selecting important wavelet coefficients for prediction: (1) magnitude-based: select the largest k coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitude-based scheme since it always outperforms the order-based scheme. To apply the magnitude-based wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients does not change drastically across the design space. Figure 5-4 illustrates the magnitude-based ranking (shown as a color map where red indicates high ranks and blue indicates low ranks) of a total 128 wavelet coefficients (decomposed from benchmark gcc dynamics) across 50 different microarchitecture configurations. As can be seen, the top ranked wavelet coefficients largely remain consistent across different processor configurations. In this chapter, we propose to use of wavelet neural network to build accurate predictive models for workload dynamic driven microarchitecture design space exploration. We show that wavelet neural network can be used to accurately and cost-effectively capture complex workload dynamics across different microarchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power and reliability domains. We perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on prediction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. We present a case study of using workload dynamic aware predictive models to quickly estimate the efficiency of scenario-driven architecture optimizations across different domains. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. Neural Network An Artificial Neural Network (ANN) [42] is an information processing paradigm that is inspired by the way biological nervous systems process information. It is composed of a set of interconnected processing elements working in unison to solve problems. Output Radial Basis Function (RBF) layer distance Hidden Hi(x) H2(x) a layer 05 U- Input X X s Xn distance layer Figure 5-2 Basic architecture of a neural network instructions and the ready instructions, vulnerability can be reduced at negligible performance cost. The wqratio update is triggered by the estimated IQ AVF. In our DVM design, wqratio is adapted through slow increases and rapid decreases in order to ensure a quick response to a vulnerability emergency. We built workload dynamics predictive models which incorporate DVM as a new design parameter. Therefore, our models can predict workload execution scenarios with/without DVM feature across different microarchitecture configurations. Figure 5-14 shows the results of using the predictive models to forecast IQ AVF on benchmark gcc across two microarchitecture configurations. 08 08 08------ 08 <10 40 4 DVM Target 00 O (Enable) -03 DVM Target | AIDVML I' l Target DVM Target (D iabe) 02 (Disable) 02 (Enable) 20 40 60 8o 100 120 140 0 0 20 40 60 8o 100 120 140 2O 4O 6O 8O 100 120 140 Sample 20 40 6 80 100 120 140 Samples Samples Samples DVM disabled DVM enabled DVM disabled DVM enabled (a) Scenario 1 (b) Scenario 2 Figure 5-14 Workload dynamic prediction with scenario-based architecture optimization We set the DVM target as 0.3 which means the DVM policy, when enabled, should maintains the IQ AVF below 0.3 during workload execution. In both cases, the IQ AVF dynamics were predicted when DVM is disabled and enabled. As can be seen, in scenario 1, the DVM successfully achieves its goal. In scenario 2, despite of the enabled DVM feature, the IQ AVF of certain execution period is still above the threshold. This implies that the developed DVM mechanism is suitable for the microarchitecture configuration used in scenario 1. On the other hand, architects have to choose another DVM policy if the microarchitecture configuration One advantage of using wavelet coefficients to fingerprint program execution is that program time domain behavior can be reconstructed from these wavelet coefficients. Figure 2-3 and 2-4 show that the time domain workload characteristics can be recovered using the inverse discrete wavelet transforms. 4 O 3-2 u2 Figure 2-3 Sampled time domain program behavior (a) 1 wavelet coefficient (b) 2 wavelet coefficients 2 2 (c) 4 wavelet coefficients (d) 8 wavelet coefficients 0 0 (e) 16 wavelet coefficients (f) 64 wavelet coefficients Figure 2-4 Reconstructing the workload dynamic behaviors In Figure 2-4 (a)-(e), the first 1, 2, 4, 8, and 16 wavelet coefficients were used to restore program time domain behavior with increasing fidelity. As shown in Figure 2-4 (f), when all (e.g. 64) wavelet coefficients are used for recovery, the original signal can be completely restored. However, this could involve storing and processing a large number of wavelet coefficients. Using a wavelet transform gives time-frequency localization of the original data. As a result, most of the energy of the input data can be represented by only a few wavelet entire NUCA substrate vary widely with different design choices while executing the same code base. In this example, various NUCA cache configurations such as network topologies (e.g. hierarchical, point-to-point and crossbar) and data management schemes (e.g. static (SNUCA) [43], dynamic (DNUCA) [45, 46] and dynamic with replication (RNUCA) [47-49]) are used. As the number of parameters in the design space increases, such variation and characteristics at large scales cannot be captured without using slow and detailed simulations. However, using simulation-based methods for architecture design space exploration where numerous design parameters have to be considered is prohibitively expensive. Recently, various predictive models [20-25, 50] have been proposed to cost-effectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous behavior of large and distributed architecture substrates across the design space. This limitation will only be exacerbated with the rapidly increasing integration scale (e.g. number of cores per chip). Therefore, there is a pressing need for novel and cost-effective approaches to achieve accurate and informative design trade-off analysis for large and sophisticated architectures in the upcoming multi-/many core eras. Thus, in this chapter, instead of quantifying these large and sophisticated architectures by a single number or a simple statistics distribution, we proposed techniques employ 2D wavelet multiresolution analysis and neural network non-linear regression modeling. With our schemes, the complex spatial characteristics that workloads exhibit across large architecture substrates are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of coefficients. As can be seen, using 16 wavelet coefficients can recover program time domain behavior with sufficient accuracy. To classify program execution into phases, it is essential that the generated wavelet coefficients across intervals preserve the dynamics that workloads exhibit within the time domain. Figure 2-5 shows the variation of the first 16 wavelet coefficients (coff coff 16) which represent the wavelet domain behavior of branch misprediction and L1 data cache hit on the benchmark gcc. The data are shown for the entire program execution which contains a total of 1024 intervals. 2.5E+04 coeff 1 1.2E+06 coeff 1 coeff 2 coeff 2 2.0E+04 coeff 3 1.0E+06 coeff 3 coeff 4 coeff 4 1.5E+04- coeff 5 8.OE+05 coeff5 |- coeff 6 6.0E+05 coeff 6 1.5E+04 coeff 7 coeff 57 coeff 8 4.0E+05-- coeff 8 coeff 9 coeff 9 .0E+03 coeff 10 26.0E+05 coeff 10 Scoff 7 coeff 11 1.0E+00 coeff 12 E-- -I coeff 12 Scoeff 13 coeff13 -5.0E+03 coeff 14 -2.0E+05 -- coeff 14 Scoff 15 -coeff 15 -1.0E+04 coeff 16 -4.0E+05- coeff 16 (a) branch misprediction (b) L1 data cache hit Figure 2-5 Variation of wavelet coefficients Figure 2-5 shows that wavelet domain transforms largely preserve program dynamic behavior. Another interesting observation is that the first order wavelet coefficient exhibits much more significant variation than the high order wavelet coefficients. This suggests that wavelet domain workload dynamics can be effectively captured using a few, low order wavelet coefficients. Figure 4-2 shows that in the wavelet domain, the efficiency of using a single type of program characteristic to classify program phases can vary significantly across different benchmarks. For example, while ul2 hit achieves accurate phase classification on the benchmark vortex, it results in a high phase classification COV on the benchmark gcc. To overcome the above disadvantages and to build phase classification methods that can achieve high accuracy across a wide range of applications, we explore using wavelet coefficients derived from different types of workload characteristics. [ j\ r Wavelet SD T Coefficient Set Program Runtime Statistics 1 DWT Coefficent et Hybrid Wavelet K-means efficiCoefficients Clustering OV Program Runtime Statistics 2 ---- SDWT Wavelet J \ Coefficient Set n Program Runtime Statistics n Figure 4-3 Phase classification using hybrid wavelet coefficients As shown in Figure 4-3, a DWT is applied to each type of workload characteristic. The generated wavelet coefficients from different categories can be assembled together to form a signature for a data clustering algorithm. Our objective is to improve wavelet domain phase classification accuracy across different programs while using an equivalent amount of information to represent program behavior. We choose a set of 16 wavelet coefficients as the phase signature since it provides sufficient precision in capturing program dynamics when a single type of program characteristic is used. If a phase signature can be composed using multiple workload characteristics, there are many ways to form a 16-dimension phase signature. For example, a phase signature can be generated using one wavelet coefficient from 16 different workload characteristics (16x 1), or it can be composed CHAPTER 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRAM PHASE ANALYSIS In this chapter, we focus on workload-statistics-based phase analysis since on a given machine configuration and environment, it is more suitable to identify how the targeted architecture features vary during program execution. In contrast, phase classification using program code structures lacks the capability of informing how workloads behave architecturally [13, 30]. Therefore, phase analysis using specified workload characteristics allows one to explicitly link the targeted architecture features to the classified phases. For example, if phases are used to optimize cache efficiency, the workload characteristics that reflect cache behavior can be used to explicitly classify program execution into cache performance/power/reliability oriented phases. Program code structure based phase analysis identifies similar phases only if they have similar code flow. There can be cases where two sections of code can have different code flow, but exhibit similar architectural behavior [13]. Code flow based phase analysis would then classify them as different phases. Another advantage of workload-statistics-based phase analysis is that when multiple threads share the same resource (e.g. pipeline, cache), using workload execution information to classify phases allows the capability of capturing program dynamic behavior due to the interactions between threads. The key goal of workload execution based phase analysis is to accurately and reliably discern and recover phase behavior from various program runtime statistics represented as large- volume, high-dimension and noisy data. To effectively achieve this objective, recent work [30, 31] proposes using wavelets as a tool to assist phase analysis. The basic idea is to transform workload time domain behavior into the wavelet domain. The generated wavelet coefficients which extract compact yet informative program runtime feature are then assembled together to Design Parameters byRegression Tree ly0_th ly0_fl ly0_bench lyl_th lyl_fl lyl_bench ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench TIM_cap TIM_res TIM_th HS_cap HS_res HS_side Clockwise: CPU1 MEM1 HS_th HP_side HP_th am_temp Iss_size MIX1 Figure 7-12 Roles of input parameters We present in Figure 7-12 shows the most frequent splits within the regression tree that models the most significant wavelet coefficient. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. Each volume size of parameter is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? As can be seen, floor-planning of each layer and core configuration largely affect thermal spatial behavior of the studied workloads. dimensionality reduction of large volume, high dimension raw program execution statistics from the time domain and hence can be integrated with a sampling mechanism to efficiently increase the scalability of phase analysis of large scale phase behavior on long-running workloads. To address workload variability issues in phase classification, wavelet-based denoising can be used to extract the essential features of workload behavior from their run-time non-deterministic (i.e., noisy) statistics. At the workloads prediction part, chapter 5, we propose to the use of wavelet neural network to build accurate predictive models for workload dynamic driven microarchitecture design space exploration to overcome the problems of monolithic, global predictive models. We show that wavelet neural networks can be used to accurately and cost-effectively capture complex workload dynamics across different microarchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power, and reliability domains. And also we perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on prediction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. To evaluate the efficiency of scenario-driven architecture optimizations across different domains, we also present a case study of using workload dynamic aware predictive model. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. To our knowledge, the model we proposed is the first one that can track complex program dynamic behavior across different microarchitecture configurations. We believe our workload dynamics forecasting techniques will allow architects to quickly evaluate a rich set of architecture optimizations that target workload dynamics at early microarchitecture design stage. efficiency, processor architects and chip designers rely on detailed yet slow simulations to model thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of the design space, such techniques are very expensive in terms of time and cost. In chapter 7, we aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large thermal design space of 3D multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex 2D thermal spatial patterns and can be used to forecast both the location and the area of thermal hotspots during thermal-aware design exploration. In light of the emerging 3D multi-core design era, we believe that the proposed thermal predictive models will be valuable for architects to quickly and informatively examine a rich set of thermal-aware design alternatives and thermal-oriented optimizations for large and sophisticated architecture substrates at an early design stage. wavelet coefficients, our models can accurately reconstruct architecture 2D spatial behavior across the design space. Using both multi-programmed and multi-threaded workloads, we extensively evaluate the efficiency of using 2D wavelet neural networks for predicting the complex behavior of non-uniformly accessed cache designs with widely varied configurations. Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction We view the 2D spatial characteristics yielded on large and distributed architecture substrates as a nonlinear function of architecture design parameters. Instead of inferring the spatial behavior via exhaustively obtaining architecture characteristics on each individual node/component, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated behavior across a large architecture design space. Previous work [21, 23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to informatively reveal complex workload/architecture interactions at a large scale. To overcome this disadvantage, we propose combining 2D wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial characteristics prediction of multi-core oriented architecture substrates. The 2D wavelet transform is a very powerful tool for characterizing spatial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and prediction of individual or subsets of cores/components, while the global trend is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across many cores or distributed hardware components. Collectively, these wavelet coefficients provide characteristics exceed the threshold level, processors can start to response before power/reliability reaches or surpass the threshold level. 6 *CPI1 1Q 8 CPI_2Q CPI 3Q 4 2 L Il O" 10 8 M Power 2Q SPower 3Q 149 / /e e / ^ / / / ^ 10 I 'AVF_1Q mAvF 30 S ---------------------------------- A V F 2 Q -- 1f/ f / e <14// /// / Figure 5-10 Threshold-based workload execution Figure 5-11 further illustrates detailed workload execution scenario predictions on benchmark bzip2. Both simulation and prediction results are shown. The predicted results closely track the varied program dynamic behavior in different domains. 8 --0.4 -Simulation -Simulato -Simulation -Prediction 125 dPrediction 120 115 Threshold 0.3 , JJ f i^ l l 1 '1" I 0,l0.' 2 5 a 115 S 20 40 60 80 100 120 140 8 20 40 60 80 100 120 140 20 40 60 80 100 120 14 Samples Samples Samples (a) performance (b) power (c) reliability Figure 5-11 Threshold-based workload scenario prediction Workload Dynamics Driven Architecture Design Space Exploration In this section, we present a case study to demonstrate the benefit of applying workload dynamics prediction in early architecture design space exploration. Specifically, we show that workload dynamics prediction models can effectively forecast the worst-case operation j 0 --~- E CHAPTER 2 WAVELET TRANSFORM We use wavelets as an efficient tool for capturing workload behavior. To familiarize the reader with general methods used in this research, we provide a brief overview on wavelet analysis and show how program execution characteristics can be represented using wavelet analysis. Discrete Wavelet Transform(DWT) Wavelets are mathematical tools that use a prototype function (called the analyzing or mother wavelet) to transform data of interest into different frequency components, and then analyze each component with a resolution matched to its scale. Therefore, the wavelet transform is capable of providing a compact and effective mathematical representation of data. In contrast to Fourier transforms which only offer frequency representations, wavelets transforms provide time and frequency localizations simultaneously. Wavelet analysis allows one to choose wavelet functions from numerous functions[26, 27]. In this section, we provide a quick primer on wavelet analysis using the Haar wavelet, which is the simplest form of wavelets. Consider a data series Xn,k,k = 0,1,2,..., at the finest time scale resolution level 2". This time series might represent a specific program characteristic (e.g., number of executed instructions, branch mispredictions and cache misses) measured at a given time scale. We can coarsen this event series by averaging (with a slightly different normalization factor) over non- overlapping blocks of size two 1 X-1,k = (Xn,2k + Xn,2k+1) (2-1) and generate a new time series x,,, which is a coarser granularity representation of the original series x,. The difference between the two representations, known as details, is Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy ACCURATE, SCALABLE, AND INFORMATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGE-SCALE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO December 2008 Chair: Tao Li Major: Electrical and Computer Engineering Modeling and analyzing how workload and architecture interact are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challenges related to the design, evaluation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the state-of-the-art tools can detect program time-varying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a necessary first step to design cost-effective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and hardware design complexity and integration scale. For instance, while large real-world applications manifest drastically different behavior across a wide spectrum of their runtime, existing methods only focus on analyzing workload characteristics using a single time scale. Conventional architecture modeling techniques assume a centralized and monolithic hardware substrate. This assumption, however, will not hold valid since the design trends of multi-/many-core processors will result in large-scale and distributed microarchitecture specific processor core, global and cooperative resource management for important wavelet coefficients for prediction: (1) magnitude-based: select the largest k coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitude-based scheme since it always outperforms the order-based scheme. To apply the magnitude-based wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients do not change drastically across the design space. Our experimental results show that the top ranked wavelet coefficients largely remain consistent across different architecture configurations. Evaluation and Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast complex, heterogeneous patterns of large scale multi-core substrates running various workloads without using detailed simulation. The prediction accuracy measure is the mean error defined as follows: 1 N ME = ((Z(k) x(k)) / x(k)) (6-1) N k=l where: x is the actual value, ^ is the predicted value and N is the total number of samples (e.g. 256 NUCA banks). As prediction accuracy increases, the ME becomes smaller. The prediction accuracies are plotted as boxplots(Figure 6-4). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between "hinges" which are approximately the first and third quartiles of the ME values. models. Similarly, searching a cache friendly workload/core mapping is 3 x 104 times faster than using the simulation-based method. Simulation DTM 0.6 -(^o Prediction DTM Prediction ' o Figure 6-12 Temperature profile before and after a DTM policy Simulation Prediction is a growing need for methods that can quickly and accurately explore workload dynamic behavior at early microarchitecture design stage. Such techniques can quickly bring architects with insights on application execution scenarios across large design space without resorting to the detailed, case by case simulations. Researchers have been proposed several predictive models [20-25] to reason about workload aggregated behavior at architecture design stage. However, they have been focused on predicting the aggregated program statistics (e.g. CPI of the entire workload execution). These monolithic global models are incapable of capturing and revealing program dynamics which contain interesting fine-grain behavior. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates wavelet- based multiresolution decomposition techniques and neural network prediction. As the number of cores on a processor increases, these large and sophisticated multi-core- oriented architectures exhibit increasingly complex and heterogeneous characteristics. Processors with two, four and eight cores have already entered the market. Processors with tens or possibly hundreds of cores may be a reality within the next few years. In the upcoming multi-/many- core era, the design, evaluation and optimization of architectures will demand analysis methods that are very different from those targeting traditional, centralized and monolithic hardware structures. To enable global and cooperative management of hardware resources and efficiency at large scales, it is imperative to analyze and exploit architecture characteristics beyond the scope of individual cores and hardware components (e.g. single cache bank and single interconnect link). To addresses this important and urgent research task, we developed the novel, 2D multi-scale predictive models which can efficiently reason the characteristics of large and sophisticated multi-core oriented architectures during the design space exploration stage without using detailed cycle-level simulations. CHAPTER 8 CONCLUSIONS Studying program workload behavior is of growing interest in computer architecture research. The performance, power and reliability optimizations of future computer workloads and systems could involve analyzing program dynamics across many time scales. Modeling and predicting program behavior at single scale can yield many limitations. For example, samples taken from a single, fine-grained interval may not be useful in forecasting how a program behaves at a medium or large time scales. In contrast, observing program behavior using a coarse-grained time scale may lose opportunities that can be exploited by hardware and software in tuning resources to optimize workload execution at a fine-grained level. In chapter 3, we proposed new methods, metrics and framework that can help researchers and designers to better understand phase complexity and the changing of program dynamics across multiple time scales. We proposed using wavelet transformations of code execution and runtime characteristics to produce a concise yet informative view of program dynamic complexity. We demonstrated the use of this information in phase classification which aims to produce phases that exhibit similar degree of complexity. Characterizing phase dynamics across different scales provides insightful knowledge and abundant features that can be exploited by hardware and software in tuning resources to meet the requirement of workload execution at different granularities. In chapter 4, we extends the scope of chapter 3 by (1) exploring and contrasting the effectiveness of using wavelets on a wide range of program execution statistics for phase analysis; and (2) investigating techniques that can further optimize the accuracy of wavelet-based phase classification. More importantly, we identify additional benefits that wavelets can offer in the context of phase analysis. For example, wavelet transforms can provide efficient Evaluation and R results .... ................... ...... ... ......... ....................... 82 Leveraging 2D Geometric Characteristics to Explore Cooperative Multi-core Oriented A architecture D esign and O ptim ization ........................................... .......................... 88 7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STACKED MULTI-CORE PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS..................... 94 Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction... 96 Experim mental M methodology .................................................... .................................. 98 Experim ental R results ......... ........................ .. .. ......... ..... .... .......... 103 8 CONCLUSIONS ................................ ........... ........ ................ 109 L IST O F R E F E R E N C E S ............................ ..................................................... ..................... 113 BIOGRAPHICAL SKETCH .................................................................................. 119 In Chapter 6, we explore novel predictive techniques that can quickly, accurately and informatively analyze the design trade-offs of future large-scale multi-/many- core architectures in a scalable fashion. The characteristics that workloads exhibited on these architectures are complex phenomena since they typically contain a mixture of behavior localized at different scales. Applying wavelet analysis, our method can capture the heterogeneous behavior across a wide range of spatial scales using a limited set of parameters. We show that these parameters can be cost-effectively predicted using non-linear modeling techniques such as neural networks with low computational overhead. Experimental results show that our scheme can accurately predict the heterogeneous behavior of large-scale multi-core oriented architecture substrates. To our knowledge, the model we proposed is the first that can track complex 2D workload/architecture interaction across design alternatives, we further examined using the proposed models to effectively explore multi-core aware resource allocations and design evaluations. For example, we build analytical models that can quickly forecast workloads' 2D working sets across different NUCA configurations. Combined with interference estimation, our models can determine the geometric-aware workload/core mappings that lead to minimal interference. We also show that our models can be used to predict the location and the area of thermal hotspots during thermal- aware design exploration. In the light of the emerging multi-/ many- core design era, we believe that the proposed 2D predictive model will allow architects to quickly yet informatively examine a rich set of design alternatives and optimizations for large and sophisticated architecture substrates at an early design stage. Leveraging 3D die stacking technologies in multi-core processor design has received increased momentum in both the chip design industry and research community. One of the major road blocks to realizing 3D multi-core design is its inefficient heat dissipation. To ensure thermal methods that not only predict aggregate thermal behavior but also identify both size and geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate predictive models to achieve this goal. Config. A Config. B Config. C Config. D r r Figure 7-2 2D thermal variation on die 4 under different microarchitecture and floor-plan configurations Figure 7-3 illustrates the original thermal behavior and 2D wavelet transformed thermal behavior. HL, HHH (a) Original thermal behavior (b) 2D wavelet transformed thermal behavior Figure 7-3 Example of using 2D DWT to capture thermal spatial characteristics As can be seen, the 2D thermal characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (LL=1) or Average (LL=2)). Since a small set of wavelet coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we use predictive models (i.e. neural networks) to relate them individually to various design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize 2D thermal spatial characteristics across the design space. Compared with a simulation-based method, predicting a small set of wavelet coefficients using analytical that neural networks can accurately predict aggregated workload behavior during design space exploration. Nevertheless, the monolithic global neural network models lack the capability of revealing complex workload dynamics. To overcome this disadvantage, we propose using wavelet neural networks that incorporate multiscale wavelet analysis into a set of neural networks for workload dynamics prediction. The wavelet transform is a very powerful tool for dealing with dynamic behavior since it captures both workload global and local behavior using a set of wavelet coefficients. The short-term workload characteristics is decomposed into the lower scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and prediction, while the global workload behavior is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends in the workload execution. Collectively, these coordinated scales of time and frequency provides an accurate interpretation of workload dynamics. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients at different scales. The separate predictions of each wavelet coefficients are proceed independently. Predicting each wavelet coefficients by a separate neural network simplifies the training task of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transform to predict the workload dynamics. Figure 5-3 shows our hybrid neuro-wavelet scheme for workload dynamics prediction. Given the observed workload dynamics on training data, our aim is to predict workload dynamic behavior under different architecture configurations. The hybrid scheme basically involves three stages. In the first stage, the time series is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficients is predicted by a separate ANN and in the third stage, the approximated time series is recovered from the predicted wavelet coefficients. accuracy than the original data. This is because in phase classification, randomly occurring peaks in the gathered workload execution data could have a deleterious effect on the phase classification results. Wavelet denoising smoothes these irregular peaks and make the phase classification method more robust. Various types of wavelet denoising can be performed by choosing different threshold selection rules (e.g. rigrsure, heursure, sqtwolog and minimaxi), by performing hard (h) or soft (s) thresholding, and by specifying multiplicative threshold rescaling model (e.g. one, sln, and mln). We compare the efficiency of different denoising techniques that have been implemented into the MATLAB tool [34]. Due to the space limitation, only the results on benchmarks bzip2, gcc and mcfare shown in Figure 4-9. As can be seen, different wavelet denoising schemes achieve comparable accuracy in phase classification. 10% 0 bzip2 0 gcc U mcf 8% > 6% 0 o 4% 2% 0% Wavelet Denoising Schemes Figure 4-9 Efficiency of different denoising schemes We further compare the efficiency of using the 16 x 1 hybrid scheme (Hybrid), the best case that a single type workload characteristic can achieve (Individual Best) and the Simpoint based phase classification that uses basic block vector (BBV). The results of the 12 SPEC integer benchmarks are shown in Figure 4-4. 25% 15 10 8- > 5 0 0. ..I ~ ~ ~ I -i .. Iii 1 1111,_ 11 1 IIi o -*!--T--TT--I--I--I---^--I--T--"---l--I-- - bzip2 crafty eon gap gcc gzip mcf parser perlbmk twolf vortex vpr AVG Figure 4-4 Phase classification accuracy of using 16x 1 hybrid scheme As can be seen, the Hybrid outperforms the Individual Best on 10 out of the 12 benchmarks. The Hybrid also outperforms the BBVbased Simpoint method on 10 out of the 12 cases. Scalability Above we can see that wavelet domain phase analysis can achieve higher accuracy. In this subsection, we address another important issue in phase analysis using workload execution characteristics: scalability. Counters are usually used to collect workload statistics during program execution. The counters may overflow if they are used to track large scale phase behavior on long running workloads. Today, many large and real world workloads can run days, weeks or even months before completion and this trend is likely to continue in the future. To perform phase analysis on the next generation of computer workloads and systems, phase classification methods should be capable of scaling with the increasing program execution time. [63] J. Sharkey, D. Ponomarev, K. Ghose, "M-Sim : A Flexible, Multithreaded Architectural Simulation Environment," Technical Report CS-TR-05-DPO1, Department of Computer Science, State University of New York at Binghamton, 2005. n (X X)(Yi Y) XCOR(X, Y)= i=(-) ,n -X)2yn 1y (3-1) Vi= X-1 X2 i= y2 where X is the original data series and Y is the approximated data series. Note that XCOR =1 if program dynamics observed at the finest scale and its approximation at coarser granularity exhibit perfect correlation, and XCOR =0 if the program dynamics and its approximation varies independently across time scales. X and Y can be any runtime program characteristics of interest. In this chapter, we use instruction per cycle (IPC) as a metric due to its wide usage in computer architecture design and performance evaluation. To sample IPC dynamics, we break down the entire program execution into 1024 intervals and then sample 1024 IPC data within each interval. Therefore, at the finest resolution level, the program dynamics of each execution interval are represented by an IPC data series with a length of 1024. We then apply wavelet multiresolution analysis to each interval. In a wavelet transform, each DWT operation produces an approximation coefficients vector with a length equal to half of the input data. We remove the detail coefficients after each wavelet transform and only use the approximation part to reconstruct IPC dynamics and then calculate the XCOR between the original data and the reconstructed data. We apply discrete wavelet transform to the approximation part iteratively until the length of the approximation coefficient vector is reduced to 1. Each approximation coefficient vector is used to reconstruct a full IPC trace with a length of 1024 and the XCOR between the original and reconstructed traces are calculated using equation (3-1). As a result, for each program execution interval, we obtain an XCOR vector, in which each element represents the cross-correlation coefficients between the original workload dynamics and the approximated workload dynamics at different scales. Since [51] K. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, "Multifacet's General Execution-driven Multiprocessor Simulator(GEMS) Toolset," Computer Architecture News(CAN), 2005. [52] Virtutech Simics, http://www.virtutech.com/products/ [53] S. Woo, M. Ohara, E. Torrie, J. Singh, A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," in Proc. of the International Symposium on Computer Architecture, 1995. [54] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-Aware Microarchitecture," in Proc. of the International Symposium on Computer Architecture, 2003. [55] K. Banerjee, S. Souri, P. Kapur, and K. Saraswat, "3-D ICs: A Novel Chip Design for Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration", Proceedings of the IEEE, vol. 89, pp. 602--633, May 2001. [56] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykrishnan, M. J. Irwin, "Design Space Exploration for 3-D Cache", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 16, No. 4, April 2008. [57] B. Black, D. Nelson, C. Webb, and N. Samra, "3D Processing Technology and its Impact on IA32 Microprocessors," in Proc. of the 22nd International Conference on Computer Design, pp. 316-318, 2004. [58] P. Reed, G. Yeung, and B. Black, "Design Aspects of a Microprocessor Data Cache using 3D Die Interconnect Technology," in Proc. of the International Conference on Integrated Circuit Design and Technology, pp. 15-18, 2005 [59] M. Healy, M. Vittes, M. Ekpanyapong, C.S. Ballapuram, S.K. Lim, H.S. Lee, G.H. Loh, "Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs," IEEE Trans. on Computer Aided Design oflC and Systems, vol. 26, no. 1, pp. 38-52, 2007. [60] S. K. Lim, "Physical design for 3D system on package," IEEE Design & Test of Computers, vol. 22, no. 6, pp. 532-539, 2005. [61] K. Puttaswamy, G. H. Loh, "Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors," in Proc. of the International Symposium on High-Performance Computer Architecture, 2007. [62] Y. Wu, Y. Chang, "Joint Exploration of Architectural and Physical Design Spaces with Thermal Consideration," in Proc. ofInternational Symposium on Low Power Electronics and Design, 2005. of increasing resolutions. Different resolutions can be obtained by adding difference values back or subtracting differences from the averages. Original Data Scaling Filter (Go) Wavelet Filter (Ho) 3.5, 22.5, 10, 11.5 -0.5, -2.5, 5, 8.5 Scaling Filter (G,) W avelet Filter (H,) 13, 10.75 -9.5,-0.75 Scaling Filter (G2) Wavelet Filter (H2) 11.875 1.125 11.875 1.125 -9.5, -0.75 -0.5, -2.5, 5, 8.5 Approximation (Lev 0) Detail (Lev 1) Detail Coefficients (Level 2) Detail Coefficients (Level 3) Figure 2-1 Example of Haar wavelet transform. For instance, {13, 10.75} = {11.875+1.125, 11.875-1.125} where 11.875 and 1.125 are the first and the second coefficient respectively. This process can be performed recursively until the finest scale is reached. Therefore, through an inverse transform, the original data can be recovered from wavelet coefficients. The original data can be perfectly recovered if all wavelet coefficients are involved. Alternatively, an approximation of the time series can be reconstructed using a subset of wavelet coefficients. Using a wavelet transform gives time-frequency localization of the original data. As a result, the time domain signal can be accurately approximated using only a few wavelet coefficients since they capture most of the energy of the input data. Apply DWT to Capture Workload Execution Behavior Since variation of program characteristics over time can be viewed as signals, we apply discrete wavelet analysis to capture program execution behavior. To obtain time domain workload execution characteristics, we break down entire program execution into intervals and is a one-time overhead and can be amortized in the design space exploration stage where a large number of cases need to be examined. Table 6-5 Design space evaluation speedup (simulation vs. prediction) Benchmarks Simulation vs. Prediction gcc(x8) 2,181x mcf(x8) 3.482x CPU 3,691x MIX 472x MEM 435x barnes 659x fmm 1,824x ocean-co 1,077x ocean-nc 1,169x water-sp 738x cholesky 696x fft 670x radix 1,010x Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input architecture design parameters were ranked based on either split order or split frequency. The design parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, architecture parameters that largely determine the values of a wavelet coefficient are located higher than others in the regression tree and they have a larger number of splits than others. We present in Figure 6-6 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information 6-7 2D NUCA footprint (geometric shape) of mesa ............................................. .............. 88 6-8. 2D cache interference in NUCA ........................................................................ 89 6-9 Pearson correlation coefficient (all 50 test cases are shown) .............................................. 90 6-10 2D NUCA thermal profile (simulation vs. prediction) ................................................... 91 6-11 NUCA 2D thermal prediction error..................... ................................ ........................... 92 6-12 Temperature profile before and after a DTM policy ............................. 93 7-1 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors ..... 94 7-2 2D thermal variation on die 4 under different microarchitecture and floor-plan configurations ........ ..................................... ............... 95 7-3 Example of using 2D DWT to capture thermal spatial characteristics ............................. 95 7-4 Hybrid neuro-wavelet thermal prediction framework ..................................................... 97 7-5 Selected floor-plans ................................. .................. .... ...... .............. .. 98 7-6 P rocessor core floor-plan ................................................................. ............... ...... ..... 99 7-7 Cross section view of the simulated 3D quad-core chip ............................................... 100 7-8 ME boxplots of prediction accuracies (number of wavelet coefficients = 16).................. 105 7-9 Simulated and predicted thermal behavior ................................ ............ .............. 106 7-10 ME boxplots of prediction accuracies with different number of wavelet coefficients....... 106 7-11 Benefit of predicting wavelet coefficients .... ....................... .............. 107 7-12 R oles of input param eters ..................................................... ........................................ 108 shown in Figure 4-2 suggest that directly employing runtime samples in phase classification is less desirable. To address the scalability issue in characterizing large scale program phases using workload execution statistics, wavelet based dimensionality reduction techniques can be applied to extract the essential features of workload behavior from the sampled statistics. The observations we made in previous sections motivate the use of DWT to absorb large volume sampled raw data and produce highly efficient wavelet domain signatures for phase analysis, as shown in Figure 4-5 (b). Figure 4-6 further shows phase analysis accuracy after applying wavelet techniques on the sampled workload statistics using sampling counters with different sizes. As can be seen, sampling enables using counters with limited size to study large program phases. In general, sampling can scale up naturally with the interval size as long as the sampled values do not overflow the counters. Therefore, with an increasing mismatch between phase interval and counter size, the sampling frequency is increased, resulting in an even higher volume sampled data. Using wavelet domain phase analysis can effectively infer program behavior from a large set of data collected over a long time span, resulting in low COVs in phase analysis. Workload Variability As described earlier, our methods collect various program execution statistics and use them to classify program execution into different phases. Such phase classification generally relies on comparing the similarity of the collected statistics. Ideally, different runs of the same code segment should be classified into the same phase. Existing phase detection techniques assume that workloads have deterministic execution. On real systems, with operating system interventions and other threads, applications manifest behavior that is not the same from run to run. This variability can stem from changes in system state that alter cache, TLB or I/O behavior, system calls or interrupts, and can result in noticeably different timing and performance behavior Evaluation and Results In this section, we present detailed experiment results on using wavelet neural network to predict workload dynamics in performance, reliability and power domains. The workload dynamic prediction accuracy measure is the mean square error (MSE) defined as follows: IN MSE = (x(k) (k)) (5-3) N k=1 where: x(k) is the actual value, i(k) is the predicted value and Nis the total number of samples. As prediction accuracy increases, the MSE becomes smaller. ....................... .. i i S ........................................................ bzip crafty eon gap gcc mcf parser per swim twolf vortex vpr o -zip crafty eon gap gcc mcf parser per- swim twolf vortex vpr Figure 5-5 MSE boxplots of workload dynamics prediction 62 CHAPTER 1 INTRODUCTION Modeling and analyzing how workloads behave on the underlying hardware have been essential ingredients of computer architecture research. By knowing program behavior, both hardware and software can be tuned to better suit the needs of applications. As computer systems become more adaptive, their efficiency increasingly depends on the dynamic behavior that programs exhibit at runtime. Previous studies [1-5] have shown that program runtime characteristics exhibit time varying phase behavior: workload execution manifests similar behavior within each phase while showing distinct characteristics between different phases. Many challenges related to the design, analysis and optimization of complex computer systems can be efficiently solved by exploiting program phases [1, 6-9]. For this reason, there is a growing interest in studying program phase behavior. Recently, several phase analysis techniques have been proposed [4, 7, 10-19]. Very few of these studies, however, focus on understanding and characterizing program phases from their dynamics and complexity perspectives. Consequently, these techniques generally lack the capability of informing phase dynamic behavior. To complement current phase analysis techniques which pay little or no attention to phase dynamics, we develop new methods, metrics and frameworks that have the capability to analyze, quantify, and classify program phases based on their dynamics and complexity characteristics. Our techniques are built on wavelet-based multiresolution analysis, which provides a clear and orthogonal view of phase dynamics by presenting complex dynamic structures of program phases with respect to both time and frequency domains. Consequently, key tendencies can be efficiently identified. As microprocessor architectures become more complex, architects increasingly rely on exploiting workload dynamics to achieve cost and complexity effective design. Therefore, there multiresolution analysis ofIPC (MRA-IPC). Figure 3-9 shows the phase classification results. As can be seen, the observations we made on the BBV-based cases hold valid on the IPC-based cases. This implies that the proposed multiresolution analysis can be applied to both methods to improve the capability of capturing phase dynamics. 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% SIPC * MRA-IPC SrL l l I parser perlbmk swim twolf Figure 3-9 Comparison of IPC and MRA-IPC in classifying phase dynamics vortex N ri RJ .. .. . . n. .I . Im II I II " III I m IImI I IIII I m mImI -NA Ill LIST OF REFERENCES [1] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, "Automatically Characterizing Large Scale Program Behavior," in Proc. the International Conference on Architectural Supportfor Programming Languages and Operating Systems, 2002 [2] E. Duesterwald, C. Cascaval and S. Dwarkadas, "Characterizing and Predicting Program Behavior and Its Variability," in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2003. [3] J. Cook, R. L. Oliver, and E. E. Johnson, "Examining Performance Differences in Workload Execution Phases," in Proc. of the IEEE International Workshop on Workload Characterization, 2001. [4] X. Shen, Y. Zhong and C. Ding, "Locality Phase Prediction," in Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2004. [5] C. Isci and M. Martonosi, "Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data," in Proc. of the International Symposium on Microarchitecture, 2003. [6] T. Sherwood, S. Sair and B. Calder, "Phase Tracking and Prediction," in Proc. of the International Symposium on Computer Architecture, 2003. [7] A. Dhodapkar and J. Smith, "Managing Multi-Configurable Hardware via Dynamic Working Set Analysis," in Proc. of the International Symposium on Computer Architecture, 2002. [8] M. Huang, J. Renau and J. Torrellas, "Positional Adaptation of Processors: Application to Energy Reduction," in Proc. of the International Symposium on Computer Architecture, 2003. [9] W. Liu and M. Huang, "EXPERT: Expedited Simulation Exploiting Program Behavior Repetition," in Proc. ofInternational Conference on Supercomputing, 2004. [10] T. Sherwood, E. Perelman and B. Calder, "Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications," in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2001. [11] A. Dhodapkar and J. Smith, "Comparing Program Phase Detection Techniques," in Proc. of the International Symposium on Microarchitecture, 2003. [12] C. Isci and M. Martonosi, "Identifying Program Power Phase Behavior using Power Vectors," in Proc. of the International Workshop on Workload Characterization, 2003. predicting a small set of wavelet coefficients using analytical models is computationally efficient and is scalable to large scale architecture design. (a) NUCA hit numbers (b) 2D DWT (L=0) (c) 2D DWT (L=1) Figure 2-8 Example of applying 2D DWT on a non-uniformly accessed cache and used as the input to the k-means clustering algorithms. Figure 4-1 illustrates the above described procedure. Random K-means COV S Projection Clustering Dimensionality =16 Program Runtime K-means Statistics DWT Clustering Number of Wavelet Coetti iieutr =16 Figure 4-1 Phase analysis methods time domain vs. wavelet domain We investigated the efficiency of applying wavelet domain analysis on 10 different workload execution characteristics, namely, the numbers of executed loads (load), stores (store), branches (branch), the number of cycles a processor spends on executing a fixed amount of instructions (cycles), the number of branch misprediction (branch miss), the number of L1 instruction cache, L1 data cache and L2 cache hits (ill hit, dll hit and ul2 hit), and the number of instruction and data TLB hits (itlb hit and dtlb hit). Figure 4-2 shows the COVs of phase classifications in time and wavelet domains when each type of workload execution characteristic is used as an input. As can be seen, compared with using raw, time domain workload data, the wavelet domain analysis significantly improves phase classification accuracy and this observation holds for all the investigated workload characteristics across all the examined benchmarks. This is because in the time domain, collected program runtime statistics are treated as high-dimension time series data. Random projection methods are used to reduce the dimensionality of feature vectors which represent a workload signature at a given execution interval. However, the simple random projection function can increase the aliasing between phases and reduce the accuracy of phase detection. optimizations that consider multi-/many- cores. In this work, we study the suitability of using the proposed models for novel multi-core oriented NUCA optimizations. Leveraging 2D Geometric Characteristics to Explore Cooperative Multi-core Oriented Architecture Design and Optimization In this section, we present case studies to demonstrate the benefit of incorporating 2D workload/architecture behavior prediction into the early stages of microarchitecture design. In the first case study, we show that our geospatial-aware predictive models can effectively estimate workloads' 2D working sets and that such information can be beneficial in searching cache friendly workload/core mapping in multi-core environments. In the second case study, we explore using 2D thermal profile predictive models to accurately and informatively forecast the area and location of thermal hotspots across large NUCA substrates. Case Study 1: Geospatial-aware Application/Core Mapping Our 2D geometry-aware architecture predictive models can be used to explore global, cooperative, resource management and optimization in multi-core environments. Core 0 Core 1 Core 2 Core 3 11 11 11 1 I I :0. Core 4 0.5 I:0 0.5 0.50. //: mo 1; Core 5 Core 6 Core 7 1 1 1 l : 0 IN0 0: Figure 6-7 2D NUCA footprint (geometric shape) of mesa For example, as shown in Figure 6-7, a workload will exhibit a 2D working set with different geometric shapes when running on different cores. The exact shape of the access training of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of 2D wavelet neural networks for forecasting spatial characteristics of large-scale multi-core NUCA design using the GEMS 1.2 [51] toolset interfaced with the Simics [52] full-system functional simulator. We simulate a SPARC V9 8- core CMP running Solaris 9. We model in-order issue cores for this study to keep the simulation time tractable. The processors have private L1 caches and the shared L2 is a 256-bank 16MB NUCA. The private L1 caches of different processors are maintained coherent using a distributed directory based protocol. To model the L2 cache, we use the Ruby NUCA cache simulator developed in [47] which includes an on-chip network model. The network models all messages communicated in the system including all requests, responses, replacements, and acknowledgements. Table 6-1 summarizes the baseline machine configurations of our simulator. Table 6-1 Simulated machine configuration (baseline) Parameter Configuration Number of 8 Issue Width 1 L1 (split I/D) 64KB, 64B line, write-allocation L2 (NUCA) 16 MB (256 x 64KB ), 64B line Memory Sequential Memory 4 GB of DRAM, 250 cycle latency, 4KB Our baseline processor/L2 NUCA organization is similar to that of Beckmann and Wood [47] and is illustrated in Figure 6-3. Each processor core (including L1 data and instruction caches) is placed on the chip boundary and eight such cores surround a shared L2 cache. The L2 is partitioned into 256 banks (grouped as 16 blank clusters) and connected with an interconnection network. Each core has a cache controller that routes the core's requests to the appropriate cache bank. The NUCA design CPU1 MEM1 MIX1 Prediction F 0 Simulation Figure 7-9 Simulated and predicted thermal behavior The results show that our predictive models can tack both size and location of thermal hotspots. We further examine the accuracy of predicting locations and area of the hottest spots and the results are similar to those presented in Figure 7-8. CPU1 8 04 0 16wc 32wc 64wc 96wc 128wc 256wc MEM1 15 10 W 5 16wc 32wc 64wc 96wc 128wc 256wc MIX1 20 10 0 -1 lo -------- ---^ 16wc 32wc 64wc 96wc 128wc 256wc Figure 7-10 ME boxplots of prediction accuracies with different number of wavelet coefficients Figure 7-10 shows the prediction accuracies with different number of wavelet coefficients on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D thermal spatial pattern prediction accuracy is increased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The cost-effective models should provide high prediction accuracy while maintaining low we compute the average and difference, so we obtain (3 5 7 1 9 1 5 9) and (1 1 1 -1 5 -1 1 1). Next, we apply ID wavelet transforms along the Y-axis; for each two points along the Y-axis we compute average and difference (at level 0 in the example shown in Figure 2-7.a). We perform this process recursively until the number of elements in the averaged signal becomes 1 (at level 1 in the example shown in Figure 2-7.a). Original Data 1D wavelet 1D wavelet along x-axis along y-axis L---------------------J = 1 \ Average 4 14 2 0 (L=0) 244668204142046810 Details Original Data (row-majored) 1D wavelet alongx-axis 35719159 111-15111 lowpass signal highpasssignal 1D wavelet along yaxis average Horizontal Details Vertical Details Diagonal Details i 1D wavelet ------------- -------------------------------------------------------- tal along x-axis 4 6 -1 -1 L=O lowpass hihpa s - 1D wavelet signal s signal along y-axisl :---- ----------- Avg. Horiz. Vert. Diag. Det. Det. Det. L=1 (a) (b) Figure 2-7 2D wavelet transforms on 16 cores/hardware components Figure 2-7.b shows the wavelet domain multi-resolution representation of the 2D spatial data. Figure 2-8 further demonstrates that the 2D architecture characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (L=0) or Average (L= 1)). Since a small set of wavelet coefficients provide concise yet insightful information on architecture 2D spatial characteristics, we use predictive models (i.e. neural networks) to relate them individually to various architecture design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize architecture 2D spatial characteristics across the design space. Compared with a simulation-based method, predicted workload dynamics trace. We use the directional symmetry (DS) metric, i.e., the percentage of correctly predicted directions with respect to the target variable, defined as DS = N (x(k) 2(k)) (5-4) Nkl where (o(*) = if x and c are both above or below the threshold and (o(.) = 0 otherwise. Thus, the DS provides a measure of the number of times the sign of the target is correctly forecasted. In other words, DS=50% implies that the predicted direction was correct for half of the predictions. In this work, we set three threshold levels (named as Q1, Q2 and Q3 in Figure 5- 9) between max and min values in each trace as follows, where 1Q is the lowest threshold level and 3Q is the highest threshold level. 1Q = MIN + (MAX-MIN)*(1/4) 2Q = MIN + (MAX-MIN)*(2/4) MAX 3Q = MIN + (MAX-MIN)*(3/4) 3Q ------------------ 2Q ---- --- - 1Q -- --- -- -- -- -- --- M IN Figure 5-9 Threshold-based workload execution scenarios Figure 5-10 shows the results of threshold-based workload dynamic behavior classification. The results are presented as directional asymmetry, which can be expressed as 1 -DS. As can be seen, not only our wavelet-based RBF neural networks can effectively capture workload dynamics, but also they can accurately classify workload execution into different scenarios. This suggests that proactive dynamic power and reliability management schemes can be built using the proposed models. For instance, given a power/reliability threshold, our wavelet RBF neural networks can be used to forecast workload execution scenario. If the predicted workload network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the 2D spatial thermal patterns across each die. Figure 7-4 shows our hybrid neuro-wavelet scheme for 2D spatial thermal characteristics prediction. Given the observed spatial thermal behavior on training data, our aim is to predict the 2D thermal behavior of each die in 3D die stacked multi-core processors under different design configurations. The hybrid scheme involves three stages. In the first stage, the observed spatial thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D thermal characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The training of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF, which determine the wavelet coefficients. Experimental Methodology Floorplanning and Hotspot Thermal Model In this study, we model four floor-plans that involve processor core and cache structures as illustrated in Figure 7-5. Figure 7-5 Selected floor-plans As can be seen, the processor core is placed at different locations across the different floor- plans. Each floor-plan can be chosen by a layer in the studied 3D die stacking quad-core processors. The size and adjacency of blocks are critical parameters for deriving the thermal Workloads and System Configurations We use both integer and floating-point benchmarks from the SPEC CPU 2000 suite (e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lucas, mcf, parser, perlbmk, twolf swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see Table 7-2). We categorize all benchmarks into two classes: CPU-bound and MEM bound applications. We design three types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist of programs from only the CPU intensive and memory intensive categories respectively. MIX workloads are the combination of two benchmarks from the CPU intensive group and two from the memory intensive group. Table 7-2 Simulation configurations Chip Frequency 3G Voltage 1.2 V Proc. Technology 65 nm Die Size 21 mm x 21 mm CPU1 bzip2, eon, gcc, perlbmk CPU2 perlbmk, mesa, facerec, lucas CPU3 gap, parser, eon, mesa MIX1 gcc, mcf, vpr, perlbmk Workloads MIX2 perlbmk, mesa, twolf, applu MIX3 eon, gap, mcf, vpr MEM1 mcf, equake, vpr, swim MEM2 twolf, galgel, applu, lucas MEM3 mcf, twolf, swim, vpr These multi-programmed workloads were simulated on our multi-core simulator configured as 3D quad-core processors. We use the Simpoint tool [1] to obtain a representative slice for each benchmark (with full reference input set) and each benchmark is fast-forwarded to its representative point before detailed simulation takes place. The simulations continue until one benchmark within a workload finishes the execution of the representative interval of 250M instructions. We performed our analysis using twelve SPEC CPU 2000 integer benchmarks bzip2, crafty, eon, gap, gcc, gzip, mcf parser, perlbmk, twolf, vortex and vpr. All programs were run with the reference input to completion. The runtime workload execution statistics were measured on the SimpleScalar 3.0, sim-outorder simulator for the Alpha ISA. The baseline microarchitecture model we used is detailed in Table 4-1. Table 4-1 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 BTB 2K entries, 4-way Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 I-ALU, 2 I-MUL/DIV FP ALU 2 FP-ALU, 1FP-MUL/DIV DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 100 cycles We use IPC (instruction per cycle) as the metric to evaluate the similarity of program execution within each classified phase. To quantify phase classification accuracy, we use the weighted COV metric proposed by Calder et al. [15]. After classifying all program execution intervals into phases, we examine each phase and compute the IPC for all the intervals in that phase. We then calculate the standard deviation in IPC within each phase, and we divide the standard deviation by the average to get the Coefficient of Variation (COV). We then calculate an overall COV metric for a phase classification method by taking the COV of each phase and weighting it by the percentage of execution that the phase accounts for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classifications for a given program. |

Full Text |

PAGE 1 1 ACCURATE, SCALABLE, AND INFORM ATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGE-SCA LE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 PAGE 2 2 2008 Chang Burm Cho PAGE 3 3 ACKNOWLEDGMENTS There are many people who are responsible for my Ph.D research. Most of all I would like to express my gratitude to m y supervisor, Dr. Tao Li, for his patient guidance and invaluable advice, for numerous discussions and encouragement throughout the course of the research. I would also like to thank all the member s of my advisory committee, Dr. Renato Figueiredo, Dr. Rizwan Bashirullah, and Dr. Prabhat Mishra, for thei r valuable time and interest in serving on my supervisory committee. And I am indebted to all the members of IDEAL(Intelligent Design of Efficient Architectures Laboratory), Clay Hughes, Jame s Michael Poe II, Xin Fu and Wangyuan Zhang, for their companionship and support throughout the time spent working on my research. Finally, I would also like to expr ess my greatest gratitude to my family especially my wife, Eun-Hee Choi, for her rele ntless support and love. PAGE 4 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS .............................................................................................................. 3 LIST OF TABLES .......................................................................................................................... 6 LIST OF FI GURES ........................................................................................................................ 7 ABSTRACT ...................................................................................................................... ............ 10 CHAP TER 1 INTRODUCTION ................................................................................................................ 12 2 WAVELET TRANSFORM .................................................................................................. 16 Discrete W avelet Transform(DWT) ..................................................................................... 16 Apply DW T to Capture Workload Execution Behavior ....................................................... 18 2D W avelet Transform ......................................................................................................... 22 3 COMPLEXITY-BASED PROGRAM PHASE ANALYSIS AND CLA SSIFICATION .... 25 Characterizing and classifying the program dynamic behavior ............................................ 25 Profiling Program Dynamics and Complexity ...................................................................... 28 Classifying Program Phases based on their Dynamics Behavior ......................................... 31 Experim ental Results .......................................................................................................... .. 34 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRM PHASE ANALYSIS ...................................................................................................................... ..... 37 Workload-statics-based phase analysis ................................................................................. 38 Exploring Wavelet Dom ain Phase Analysis ......................................................................... 40 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION ................... 52 Neural Network ..................................................................................................................... 54 Com bing Wavelet and Neural Network for Workload Dynamics Prediction ...................... 56 Experim ental Methodology .................................................................................................. 58 Evaluation and Results .......................................................................................................... 62 Workload Dynam ics Driven Architecture Design Space Exploration ................................. 68 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTI-CORE ARCHITECTURES ..................................................................................... 74 Com bining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction .................................................................................................................... .......... 76 Experim ental Methodology .................................................................................................. 78 PAGE 5 5 Evaluation and Results .......................................................................................................... 82 Leveraging 2D Geom etric Characteristics to Explore Cooperative Multi-core Oriented Architecture Design and Optimization ................................................................................. 88 7 THERMAL DESIGN SPACE EXPLORATIO N OF 3D DIE STACKED MULTI-CORE PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS ....................... 94 Com bining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction ... 96 Experim ental Methodology .................................................................................................. 98 Experim ental Results .......................................................................................................... 103 8 CONCLUSI ONS ................................................................................................................. 109 LIST OF REFERENCES ............................................................................................................ 113 BIOGRAPHICAL SKETCH ...................................................................................................... 119 PAGE 6 6 LIST OF TABLES Table page 3-1 Baseline machine configuration ............................................................................................. 26 3-2 A classification of benchmar ks based on their com plexity .................................................... 30 4-1 Baseline machine configuration ............................................................................................. 39 4-2 Efficiency of different hybrid wavelet signatures in pha se classification .............................. 44 5-1 Simulated machine configuration ........................................................................................... 59 5-2 Microarchitectural parameter ranges used for generating train/test data ............................... 60 6-1 Simulated machine configuration (baseline) .......................................................................... 78 6-2 The considered architecture de sign param eters and their ranges ........................................... 79 6-3 Multi-programmed workloads ................................................................................................ 80 6-4 Error comparison of predicti ng raw vs. 2D DWT cache banks .............................................. 85 6-5 Design space evaluation speedup (sim ulation vs. prediction) ................................................ 86 7-1 Architecture configurati on for different issue width ............................................................ 100 7-2 Simulation configurations ..................................................................................................... 101 7-3 Design space parameters ...................................................................................................... 102 7-4 Simulation time vs. prediction time ...................................................................................... 104 PAGE 7 7 LIST OF FIGURES Figure page 2-1 Example of Haar wavelet transform. ...................................................................................... 18 2-2 Comparison execution characteris tics of tim e and wavelet domain ....................................... 19 2-3 Sampled time domain program behavior ................................................................................ 20 2-4 Reconstructing the work load dynam ic behaviors ................................................................... 20 2-5 Variation of wavelet coefficients ............................................................................................ 21 2-6 2D wavelet transforms on 4 data points ................................................................................. 22 2-7 2D wavelet transforms on 16 cores/hardware com ponents .................................................... 23 2-8 Example of applying 2D DW T on a non-unifor mly accessed cache ..................................... 24 3-1 XCOR vectors for each program execution interval .............................................................. 28 3-2 Dynamic complexity profile of benchmark gcc ..................................................................... 28 3-3 XCOR value distributions ...................................................................................................... 30 3-4 XCORs in the same phase by the Simpoint ............................................................................ 31 3-5 BBVs with different resolutions ........................................................................................... .. 32 3-6 Multiresolution analysis of the projected BBVs ..................................................................... 33 3-7 Weighted COV calculation ..................................................................................................... 34 3-8 Comparison of BBV and MRA-BBV in classifying phase dynam ics .................................... 35 3-9 Comparison of IPC and MRA-IPC in classifying phase dynam ics ........................................ 36 4-1 Phase analysis methods time domain vs. wavelet domain ..................................................... 41 4-2 Phase classification accuracy: tim e domain vs. wavelet dom ain ........................................... 42 4-3 Phase classification using hybrid wavelet coefficients ........................................................... 43 4-4 Phase classification accuracy of using 16 1 hybrid scheme ................................................. 45 4-5 Different methods to handle counter overflows ..................................................................... 46 4-6 Impact of counter overflows on phase analysis accuracy ....................................................... 47 PAGE 8 8 4-7 Method for modeling wo rkload variability ............................................................................ 50 4-8 Effect of using wavelet denoisi ng to handle w orkload variability ......................................... 50 4-9 Efficiency of differe nt denoising schem es ............................................................................. 51 5-1 Variation of workload performance, power and reliability dynam ics .................................. 52 5-2 Basic architecture of a neural network ................................................................................... 54 5-3 Using wavelet neural network for workload dynam ics prediction ......................................... 58 5-4 Magnitude-based ranking of 128 wavelet coefficients ........................................................... 61 5-5 MSE boxplots of workload dynamics prediction ................................................................... 62 5-6 MSE trends with increased number of wavelet coefficients .................................................. 64 5-7 MSE trends with increased sampling frequency .................................................................... 64 5-8 Roles of microarchitect ure design param eters ........................................................................ 65 5-9 Threshold-based worklo ad execution scenarios ..................................................................... 67 5-10 Threshold-based workload execution ................................................................................... 68 5-11 Threshold-based worklo ad scenario prediction .................................................................... 68 5-12 Dynamic Vulnerability Management ................................................................................... 69 5-13 IQ DVM Pseudo Code .......................................................................................................... 70 5-14 Workload dynamic prediction with scen ario-based architecture optim ization .................... 71 5-15 Heat plot that shows the MSE of IQ AVF and processor power .......................................... 72 5-16 IQ AVF dynamics prediction accuracy across different DVM thresholds ........................... 73 6-1 Variation of cache hits across a 256-ba nk no n-uniform access cache on 8-core ................... 74 6-2 Using wavelet neural networks for forecastin g architecture 2D characteristics .................... 77 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache ..................................................... 79 6-4 ME boxplots of prediction accuracies with different number of wavelet coefficien ts ........... 83 6-5 Predicted 2D NUCA behavi or using different num ber of wavelet coefficients ..................... 84 6-6 Roles of design parameters in predicting 2D NUCA ............................................................. 87 PAGE 9 9 6-7 2D NUCA footprint (geometric shape) of mesa ..................................................................... 88 6-8. 2D cache interference in NUCA ............................................................................................ 89 6-9 Pearson correlation coefficient (all 50 test cases are show n) ................................................. 90 6-10 2D NUCA thermal profile (simulation vs. prediction) ......................................................... 91 6-11 NUCA 2D thermal prediction error ...................................................................................... 92 6-12 Temperature profile before and after a DTM policy ............................................................ 93 7-1 2D within-die and cross-dies thermal varia tion in 3D die stacked m ulti-core processors ..... 94 7-2 2D thermal variation on die 4 under di fferent m icroarchitecture and floor-plan configurations ................................................................................................................ ... 95 7-3 Example of using 2D DWT to captu re therm al spatial characteristics ................................... 95 7-4 Hybrid neuro-wavelet th erm al prediction framework ............................................................ 97 7-5 Selected floor-plans ................................................................................................................ 98 7-6 Processor core floor-plan ........................................................................................................ 99 7-7 Cross section view of the sim ulated 3D quad-core chip ...................................................... 100 7-8 ME boxplots of prediction accuracies (num ber of wavelet coefficients = 16) .................... 105 7-9 Simulated and predicted thermal behavior ........................................................................... 106 7-10 ME boxplots of prediction accuracies with d ifferent number of wavelet coefficients ....... 106 7-11 Benefit of predicting wavelet coefficients .......................................................................... 107 7-12 Roles of input parameters ................................................................................................ ... 108 PAGE 10 10 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy ACCURATE, SCALABLE, AND INFORM ATIVE MODELING AND ANALYSIS OF COMPLEX WORKLOADS AND LARGE-SCA LE MICROPROCESSOR ARCHITECTURES By CHANG BURM CHO December 2008 Chair: Tao Li Major: Electrical and Computer Engineering Modeling and analyzing how work load and architecture inter act are at the foundation of computer architecture research and practical design. As contemporary microprocessors become increasingly complex, many challe nges related to the design, eval uation and optimization of their architectures crucially rely on exploiting workload characteristics. While conventional workload characterization methods measure aggregated workload behavior and the state-of-the-art tools can detect program time-varying patterns and cluster them into different phases, existing techniques generally lack the capability of gaining insightful knowledge on the complex interaction between software and hardware, a ne cessary first step to design cost-effective computer architecture. This limitation will only be exacerbated by the rapid growth of software functionality and runtime and ha rdware design complexity and in tegration scale. For instance, while large real-world applications manifest drastically different behavior across a wide spectrum of their runtim e, existing methods only focus on an alyzing workload characteristics using a single time scale. Conve ntional architecture modeling tec hniques assume a centralized and monolithic hardware substrate. This assu mption, however, will not hold valid since the design trends of multi-/many-core processors will result in large-scale and distributed microarchitecture specific pro cessor core, global and coopera tive resource management for PAGE 11 11 large-scale many-core processor requires obta ining workload characte ristics across a large number of distributed hardware components (c ores, cache banks, interconnect links etc.) in different levels of abstraction. Therefore, th ere is a pressing need for novel and efficient approaches to model and analyze workload and architecture with rapidl y increasing complexity and integration scale. We aim to develop computationally efficien t methods and models which allow architects and designers to rapidly yet informatively explor e the large performance, power, reliability and thermal design space of uni-/multicore architecture. Our models achieve several orders of magnitude speedup compared to simulation base d methods. Meanwhile, our model significantly improves prediction accuracy compared to conv entional predictive models of the same complexity. More attractively, our models have the capability of capturing complex workload behavior and can be used to forecast workload dynamics during performance, power, reliability and thermal design space exploration. PAGE 12 12 CHAPTER 1 INTRODUCTION Modeling and analyzing how workloads beha ve on the underlying hardware have been essential ing redients of computer architecture research. By knowing program behavior, both hardware and software can be tune d to better suit the needs of a pplications. As computer systems become more adaptive, their efficiency incr easingly depends on the dynamic behavior that programs exhibit at runtime. Previous studi es [1-5] have shown that program runtime characteristics exhibit time varying phase be havior: workload execution manifests similar behavior within each phase while showing distinct characteristics between different phases. Many challenges related to the de sign, analysis and optimization of complex computer systems can be efficiently solved by e xploiting program phases [1, 6-9]. For this reason, there is a growing interest in studying program phase behavior. Recently, several phase analysis techniques have been proposed [4, 7, 10-19]. Very few of these studies, however, focus on understanding and characterizing program phases from their dynamics and complexity perspectives. Consequently, these techniques gene rally lack the capability of informing phase dynamic behavior. To complement current phase analysis techniques which pay little or no attention to phase dynamics, we develop new me thods, metrics and frameworks that have the capability to analyze, quantify, and classify program phases based on their dynamics and complexity characteristics. Our techniques are built on wavelet-based multiresolution analysis, which provides a clear and orthogonal view of phase dynamics by presenting complex dynamic structures of program phases with respect to both time and frequency domains. Consequently, key tendencies can be efficiently identified. As microprocessor architectures become more complex, architects increasingly rely on exploiting workload dynamics to achieve cost an d complexity effective design. Therefore, there PAGE 13 13 is a growing need for methods that can qui ckly and accurately explore workload dynamic behavior at early microarchit ecture design stage. Such techni ques can quickly bring architects with insights on application ex ecution scenarios across large design space without resorting to the detailed, case by case simulations. Researchers have been proposed several predictive models [20-25] to reason about workload aggregated be havior at architecture design stage. However, they have been focused on predicting the aggregat ed program statistics (e .g. CPI of the entire workload execution). These monolithic global models are incapable of capturing and revealing program dynamics which contain interesting fine-grain behavior. To overcome the problems of monolithic, global predictive models, we propose a novel scheme that incorporates waveletbased multiresolution decomposition techniques and neural network prediction. As the number of cores on a processor increas es, these large and sophisticated multi-coreoriented architectures exhibit in creasingly complex and heterogeneous characteristics. Processors with two, four and eight cores ha ve already entered the market. Pr ocessors with tens or possibly hundreds of cores may be a reality within the next few years. In the upcoming multi-/manycore era, the design, evaluation and optimization of ar chitectures will demand analysis methods that are very different from those targeting traditiona l, centralized and monolithic hardware structures. To enable global and cooperative management of hardware resources and efficiency at large scales, it is imperative to analyze and exploi t architecture characteris tics beyond the scope of individual cores and hardware components (e.g. single cache bank and single interconnect link). To addresses this important and urgent research task, we developed the novel, 2D multi-scale predictive models which can efficiently reason th e characteristics of large and sophisticated multi-core oriented architectures during the design space exploration stage without using detailed cycle-level simulations. PAGE 14 14 Three-dimensional (3D) integrated circuit design [55] is an emerging technology that greatly improves transistor inte gration density and re duces on-chip wire communication latency. It places planar circuit layers in the vertical dimension and co nnects these layers with a high density and low-latency interface. In addition, 3D offers the opportunity of binding dies, which are implemented with different techniques to enab le integrating heterogeneous active layers for new system architectures. Leveraging 3D die stacking technologies to build uni-/multi-core processors has drawn an increased attention to both chip design industry and research community [5662]. The realiz ation of 3D chips faces many challenges. One of the most daunting of these challenges is the problem of inefficient heat dissipa tion. In conventional 2D chips, the generated heat is dissip ated through an external heat sink. In 3D chips, all of the layers contribute to the generation of h eat. Stacking multiple dies vertically increases power density and dissipating heat from the layers far away from the heat sink is more challenging due to the distance of heat source to exte rnal heat sink. Therefore, 3D technologies not only exacerbate existing on-chip hotspots but also create new th ermal hotspots. High die temperature leads to thermal-induced performance degradation and reduced chip lifetime, which threats the reliability of the whole system, making modeling and analyzi ng thermal characteristics crucial in effective 3D microprocessor design. Previous studies [59, 60] show that 3D chip temperature is affected by factors such as configuration and floor-plan of microarchitectural components. For example, instead of putting hot components together, thermal-aware floor-planning places the hot components by cooler components, reducing th e global temperature. Thermal-aware floorplanning [59] uses intensive and iterative simu lations to estimate the thermal effect of microarchitecture components at early architectu ral design stage. However, using detailed yet PAGE 15 15 slow cycle-level simulations to explore thermal effects across large design space of 3D multicore processors is very expens ive in terms of time and cost. PAGE 16 16 CHAPTER 2 WAVELET TRANSFORM W e use wavelets as an efficient tool for capturing workload behavior. To familiarize the reader with general methods used in this re search, we provide a brief overview on wavelet analysis and show how program execution charac teristics can be represented using wavelet analysis. Discrete Wavelet Transform(DWT) W avelets are mathematical tools that use a prototype function (cal led the analyzing or mother wavelet) to transform data of interest into different frequenc y components, and then analyze each component with a reso lution matched to its scale. Therefore, the wavelet transform is capable of providing a compact and effective mathematical repres entation of data. In contrast to Fourier transforms which only offer frequenc y representations, wavelets transforms provide time and frequency localizations simultaneously. Wavelet analysis allows one to choose wavelet functions from numerous functions[26, 27]. In this section, we provide a quick primer on wavelet analysis using the H aar wavelet, which is the simplest form of wavelets. Consider a data series ,...,2,1,0,, kXknat the finest time scale resolution level n 2. This time series might represent a specific progr am characteristic (e.g., number of executed instructions, branch mispredictions and cache mi sses) measured at a given time scale. We can coarsen this event series by av eraging (with a slightly differe nt normalization factor) over nonoverlapping blocks of size two ) ( 2 112,2, ,1 knkn knXX X (2-1) and generate a new time series1 nX, which is a coarser granularity representation of the original seriesnX. The difference between the two re presentations, known as details, is PAGE 17 17 ) ( 2 112,2, ,1 knkn knXX D (2-2) Note that the original time series nXcan be reconstructed from its coarser representation1 nXby simply adding in the details1 nD; i.e., ) (211 2/1 nn nDX X. We can repeat this process (i.e., write 1 nX as the sum of yet a nother coarser version 2 nX of nX and the details2 nD, and iterate) for as many scale as are pr esent in the original time series, i.e., 1 2/1 0 2/ 0 2/2...22 n n n nD D X X (2-3) We refer to the collection of 0Xand jD as the discrete Haar wavelet coefficients. The calculations of allkjD,, which can be done iteratively using the equations (2-1) and (2-2), make up the so called discrete wavele t transform (DWT). As can be seen, the DWT offers a natural hierarchy structure to represent data behavior at multiresolution levels: the first few wavelet coefficients contain an overall, coarser approx imation of the data; a dditional coefficients illustrate high detail. This property can be used to capture workload execution behavior. Figure 2-1 illustrates the proce dure of using Haar-base DWT to transform a series of data {3, 4, 20, 25, 15, 5, 20, 3}. As can be seen, scale 1 is the finest representation of the data. At scale 2, the approximations {3.5, 22.5, 10, 11.5} are obtained by taking the average of {3, 4}, {20, 25}, {15, 5} and {20, 3} at scale 1 respectively. The details {-0.5, -2.5, 5, 8.5} are the differences of {3, 4}, {20, 25}, {15, 5} and {20, 3} divided by 2 respectively. The process continues by decomposing the scaling coefficien t (approximation) vector using the same steps, and completes when only one coefficient remains. As a result, wavelet decomposition is the collec tion of average and details coefficients at all scales. In other words, the wavelet transform of the original data is the single coefficient representing the overall average of the original data, followed by the detail coefficients in order PAGE 18 18 of increasing resolutions. Different resolutions can be obtained by adding difference values back or subtracting differences from the averages. Original Data 3, 4, 20, 25, 15, 5, 20, 3 Wavelet Filter (H0) 0.5, -2.5, 5, 8.5 Scaling Filter (G0) 3.5, 22.5, 10, 11.5 Scaling Filter (G1) 13, 10.75 Wavelet Filter (H1) -9.5, -0.75 Scaling Filter (G2) 11.875 Wavelet Filter (H2) 1.125 11.875 1.125 -9.5, -0.75 -0.5, -2.5, 5, 8.5 Approximation (Lev 0)Detail (Lev 1)Detail Coefficients (Level 2)Detail Coefficients (Level 3) Figure 2-1 Example of Haar wavelet transform. For instance, {13, 10.75} = {11.875+1.125, 11.875-1.125} where 11.875 and 1.125 are the first and the second coefficient respectively. This process can be performed recursively until the finest scale is reached. Therefore, through an inverse transform, the original data can be recovered from wavelet coefficients. The original data can be perfectly re covered if all wavelet coefficients are involved. Alternatively, an approx imation of the time series can be reconstructed using a subset of wavelet coefficients. Us ing a wavelet transform gives time-frequency localization of the original data. As a result, the time domain signal can be accurately approximated using only a few wavelet coefficients since they capture most of the energy of the input data. Apply DWT to Capture Workload Execution Behavior Since variation of program characteristics over tim e can be viewed as signals, we apply discrete wavelet analysis to capture progr am execution behavior. To obtain time domain workload execution characteristics, we break do wn entire program execution into intervals and PAGE 19 19 then sample multiple data points within each interval. Therefore, at the finest resolution level, program time domain behavior is re presented by a data series within each interval. Note that the sampled data can be any runtime program character istics of interest. We then apply discrete wavelet transform (DWT) to each interval. As desc ribed in previous section, the result of DWT is a set of wavelet coefficients which represent the behavior of the sampled time series in the wavelet domain. 0.0E+00 5.0E+04 1.0E+05 1.5E+05 2.0E+05 2.5E+05Sampled Time Domain Workload Execution Statistics 200 400 600 800 1000 -5.0E+05 0.0E+00 5.0E+05 1.0E+06 1.5E+06 2.0E+06 2.5E+06Value 12345678910111213141516 Wavelet Coefficients(a) Time domain representation (b) Wavelet domain representation Figure 2-2 Comparison execution characteristics of time and wavelet domain Figure 2-2 (a) shows the sampled time domain workload execution statistics (The y-axis represents the number of cycles a processor spends on executing a fixed amount of instructions) on benchmark gcc within one execution interval. In this example, the program execution interval is represented by 1024 sampled data points. Fi gure 2-2 (b) illustra tes the wavelet domain representation of the original tim e series after a discrete wavele t transform is applied. Although the DWT operations can produce as many wavelet coe fficients as the original input data, the first few wavelet coefficients usually contain the important trend. In Figure 2-2 (b), we show the values of the first 16 wavelet coefficients. As can be seen, the disc rete wavelet transform provides a compact representation of the original large volume of data. This feature can be exploited to create concise yet informative fingerprints to capture pr ogram execution behavior. PAGE 20 20 One advantage of using wavelet coefficients to fingerprint program execution is that program time domain behavior can be reconstructe d from these wavelet coefficients. Figure 2-3 and 2-4 show that the time domain workload ch aracteristics can be reco vered using the inverse discrete wavelet transforms. Figure 2-3 Sampled time domain program behavior (a) 1 wavelet coefficient (b) 2 wavelet coefficients (c) 4 wavelet coefficients (d) 8 wavelet coefficients (e) 16 wavelet coefficients (f) 64 wavelet coefficients Figure 2-4 Reconstructing the workload dynamic behaviors In Figure 2-4 (a)-(e), the first 1, 2, 4, 8, and 16 wavelet coefficients were used to restore program time domain behavior with increasing fidelity. As shown in Figure 2-4 (f), when all (e.g. 64) wavelet coefficients are used for rec overy, the original signal can be completely restored. However, this coul d involve storing and processi ng a large number of wavelet coefficients. Using a wavelet transform gives timefrequency localization of the original data. As a result, most of the energy of the input da ta can be represented by only a few wavelet PAGE 21 21 coefficients. As can be seen, using 16 wavelet coefficients can recover program time domain behavior with sufficient accuracy. To classify program execution into phases, it is essential that the generated wavelet coefficients across intervals pr eserve the dynamics that worklo ads exhibit with in the time domain. Figure 2-5 shows the variation of the first 16 wavelet coefficients ( coff 1 coff 16 ) which represent the wavelet domain be havior of branch mispredicti on and L1 data cache hit on the benchmark gcc. The data are shown for the entire progr am execution which c ontains a total of 1024 intervals. -1.0E+04 -5.0E+03 0.0E+00 5.0E+03 1.0E+04 1.5E+04 2.0E+04 2.5E+04 coeff 1 coeff 2 coeff 3 coeff 4 coeff 5 coeff 6 coeff 7 coeff 8 coeff 9 coeff 10 coeff 11 coeff 12 coeff 13 coeff 14 coeff 15 coeff 16 -4.0E+05 -2.0E+05 0.0E+00 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 1.2E+06 coeff 1 coeff 2 coeff 3 coeff 4 coeff 5 coeff 6 coeff 7 coeff 8 coeff 9 coeff 10 coeff 11 coeff 12 coeff 13 coeff 14 coeff 15 coeff 16 (a) branch misprediction (b) L1 data cache hit Figure 2-5 Variation of wavelet coefficients Figure 2-5 shows that wavelet domain tran sforms largely preserve program dynamic behavior. Another interesting obser vation is that the first order wavelet coefficient exhibits much more significant variation than the high order wa velet coefficients. This suggests that wavelet domain workload dynamics can be effectivel y captured using a few, low order wavelet coefficients. PAGE 22 22 2D Wavelet Transform To effectively capture the two-dim ensional sp atial characteristics acr oss large-scale multicore architecture substrates, we also use the 2D wavelet analysis. With 1D wavelet analysis that uses Haar wavelet filters, each adjacent pair of da ta in a discrete interval is replaced with its average and difference. a b c d a b c d a b c d Original Average Detailed (D-horizontal) Detailed (D-vertical) Detailed (D-diagonal) (a+b+c+d)/4 ((a+d)/2-(b+c)/2)/2 a b c d ((a+b)/2-(c+d)/2)/2((b+d)/2-(a+c)/2)/2 a b c d Figure 2-6 2D wavelet transforms on 4 data points A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete plane. As shown in Figure 2-6, each adjacent four points in a discrete 2D plane can be replaced by their averaged value and three detailed values. The detailed values (D-horizontal, D-vertical, and D-diagonal) correspond to the average of the difference of: 1) the summation of the rows, 2) the summation of the columns, and 3) the summation of the diagonals. To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data along the X-axis first, resulting in low-pass and high-pass signals (average and difference). Next, we apply 1D wavelet transforms to both signals along the Y-axis generating one averaged and three detailed signals. Consequently, a 2D wa velet decomposition is obtained by recursively repeating this procedure on the averaged signal. Figure 2-7 (a) illustrates the procedure. As can be seen, the 2D wavelet decompos ition can be represented by a tree-based structure. The root node of the tree contains the orig inal data (row-majored ) of the mesh of va lues (for example, performance or temperatures of the four adjacent cores, network-on-chip li nks, cache banks etc.). First, we apply 1D wavelet transforms along the X-axis, i.e. for each two points along the X-axis PAGE 23 23 we compute the average and difference, so we obtai n (3 5 7 1 9 1 5 9) and (1 1 1 -1 5 -1 1 1). Next, we apply 1D wavelet transforms along the Y-axis; for each two points along the Y-axis we compute average and difference (at level 0 in th e example shown in Figure 2-7.a). We perform this process recursively until the number of elemen ts in the averaged signa l becomes 1 (at level 1 in the example shown in Figure 2-7.a). 1D wavelet along x-axis 1D wavelet along y-axis 1D wavelet along x-axis 1D wavelet along y-axis 4 4 6 2446 6810 1420 820Original Data 1D wavelet along x-axis 1D wavelet along y-axis 2 4 4 6 6 8 2 0 4 14 2 0 4 6 8 10 3 5 7 1 9 1 5 9 1 1 1 -1 5 -1 1 1 5 3 7 5 2 -2 -2 4 1 0 3 0 0 -1 -2 1 4 6 -1 -1 5 1 -1 0 Original Data (row-majored) lowpass signal highpass signal Horizontal Details Average Vertical DetailsDiagonal Details lowpass signal highpass signal Avg.Horiz. Det. Vert. Det. Diag. Det.L=0 L=1 Horiz. Det. (L=1) Vert. Det. (L=1) Diag. Det. (L=1) Horizontal Details (L=0) Vertical Details (L=0) Diagonal Details (L=0) Avg. (L=1) Average (L=0) (a) (b) Figure 2-7 2D wavelet transforms on 16 cores/hardware components Figure 2-7.b shows the wavelet domain multi-resolution representation of the 2D spatial data. Figure 2-8 further demonstrates that the 2D architecture char acteristics can be effectively captured using a small number of wavelet coeffi cients (e.g. Average (L=0) or Average (L=1)). Since a small set of wavelet coefficients provide concise yet insightful information on architecture 2D spatial characteri stics, we use predictive models (i.e. neural networks) to relate them individually to various architecture design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wave let coefficients to synthesize architecture 2D spatial characteristics across the design space. Compared with a simulation-based method, PAGE 24 24 predicting a small set of wavelet coefficients usin g analytical models is computationally efficient and is scalable to large scale architecture design. (a) NUCA hit numbers (b) 2D DWT (L=0) (c) 2D DWT (L=1) Figure 2-8 Example of applying 2D DWT on a non-uniformly accessed cache PAGE 25 25 CHAPTER 3 COMPLEXITY-BASED PROGRAM PHAS E ANAL YSIS AND CLASSIFICATION Obtaining phase dynamics, in many cases, is of great interest to accurately capture program behavior and to preci sely apply runtime applicati on oriented optimizations. For example, complex, real-world workloads may ru n for hours, days or even months before completion. Their long execution time implies that program time varying behavior can manifest across a wide range of scales, making modeling phase behavior using a single time scale less informative. To overcome conventional phase an alysis technique, we proposed using waveletbased multiresolution analysis to characterize ph ase dynamic behavior and developed metrics to quantitatively evaluate the comp lexity of phase structures. And also, we proposed methodologies to classify program phases from their dynamics and complexity perspectives. Specifically, the goal of this chapter is to answer the followi ng questions: How to define the complexity of program dynamics? How do program dynamics cha nge over time? If classified using existing methods, how similar are the program dynamics in each phase? How to better identify phases with homogeneous dynamic behavior? In this chapter, we implemented our comp lexity-based phase analysis technique and evaluate its effectiveness over existing phase an alysis methods based on program control flow and runtime information. And we showed that in both cases the proposed technique produces phases that exhibit more homogeneous dyna mic behavior than existing methods do. Characterizing and classifying the program dynamic behavior Using the wavele t-based multiresolution analys is which is described in chapter 2, we characterize, quantify and classify program dynamic behavior on a high-performance, out-oforder execution superscalar processor coupled with a multi-level memory hierarchy. PAGE 26 26 Experimental setup We performed our analysis using ten SPEC CPU 2000 benchmarks crafty, gap, gcc, gzip, mcf, parser, perlbmk, swim, twolf and vortex All programs were run wi th reference input to completion. We chose to focus on only 10 programs because of the lengthy simulation time incurred by executing all of the programs to completion. The stat istics of workload dynamics were measured on the SimpleScalar 3.0[28] si m-outorder simulator for the Alpha ISA. The baseline microarchitecture model is detailed in Table 3-1. Table 3-1 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 predictions/cycle BTB 2K entries, 4-way Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte/l ine, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 I-ALU, 2 I-MUL/DIV FP ALU 2 FP-ALU, 1FP-MUL/DIV DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/ line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 100 cycles Metrics to Quantify Phase Complexity To quantify phase complexity, we measure the similarity between phase dynamics observed at different time scales. To be more specific, we us e cross-correlation coefficients to measure the similarity between the original data sampled at the finest granularity and the approximated version reconstructed from wavele t scaling coefficients obtained at a coarser scale. The crosscorrelation coefficients (XCOR) of th e two data series are defined as: PAGE 27 27 n i i n i i n i iiYYXX YYXX YXXCOR1 2 1 2 1)()( ))(( ),( (3-1) where X is the original data series and Y is the approximated data series. Note that XCOR =1 if program dynamics observed at th e finest scale and its approximation at coarser granularity exhibit perfect correlation, and XCOR =0 if the program dynamics and its approximation varies independently across time scales. X and Y can be any runtime program characteris tics of interest. In this chapter, we use instruction per cycle (IPC) as a metric due to its wide usage in computer architecture design and performance evaluation. To sample IPC dynamics we break down the entire program execution into 1024 intervals and then sample 1024 IPC data w ithin each interval. Therefore, at the finest resolution level, the program dynamics of each ex ecution interval are represented by an IPC data series with a length of 1024. We then apply wavele t multiresolution analysis to each interval. In a wavelet transform, each DWT operation produces an approximation coefficients vector with a length equal to half of the input data. We re move the detail coefficients after each wavelet transform and only use the approximation part to reconstruct IPC dynamics and then calculate the XCOR between the original data and the r econstructed data. We a pply discrete wavelet transform to the approximation part iteratively until the length of the approximation coefficient vector is reduced to 1. Each approximation coeffi cient vector is used to reconstruct a full IPC trace with a length of 1024 and the XCOR between the original and reconstructed traces are calculated using equation (3-1). As a result, fo r each program execution interval, we obtain an XCOR vector, in which each element represents the cross-correlation coefficients between the original workload dynamics and the approximated workload dynamics at different scales. Since PAGE 28 28 we use 1024 samples within each interval, we crea te an XCOR vector with a length of 10 for each interval, as shown in Figure 3-1. Figure 3-1 XCOR vectors for each program execution interval Profiling Program Dynamics and Complexity We use XCOR metrics to quantify program dynamics and complexity of the studied SPEC CPU 2000 benchmarks. Figure 3-2 shows the results of the total 1024 exec ution intervals across ten levels of abstraction for the benchmark gcc. Figure 3-2 Dynamic complexity profile of benchmark gcc As can be seen, the benchmark gcc shows a wide variety of changing dynamics during its execution. As the time scale increases, XCOR valu es are monotonically decr eased. This is due to the fact that wavelet approximation at a coar se scale removes details in program dynamics observed at a fine grained level. Rapidly decreased XCOR implies highly complex structures that PAGE 29 29 can not be captured by coarse level approximation. In contrast, slowly decreased XCOR suggests that program dynamics can be largely preserve d using few samples. Figure 3-2 also shows a dotted line along which XCOR decreases linearly with the increased time scales. The XCOR plots below that dotted line indicate rapidly decr eased XCOR values or complex program dynamics. As can be seen, a signifi cant fraction of the benchmark gcc execution intervals manifest quickly decreased XCOR values, indica ting that the program exhibits highly complex structure at the fine graine d level. Figure 3-2 also re veals that there are a few gcc execution intervals that have good scalabil ity in their dynamics. On these execution intervals, the XCOR values only drop 0.1 when the time scale is increa sed from 1 to 8. The results(Figure 3-2) clearly indicate that some program execution interval s can be accurately approximated by their high level abstractions while others can not. We further break down the XCOR values into 10 categories ranging from 0 to 1 and analyze their distribution across time scal es. Due to space limitations, we only show the results of three programs ( swim, crafty and gcc, see Figure 3-3) which repres ent the characteristics of all analyzed benchmarks. Note that at scale 1, th e XCOR values of all execution intervals are always 1. Programs show heterogeneous XCOR va lue distributions starting from scale level 2. As can be seen, the benchmark swim exhibits good scalability in its dynamic complexity. The XCOR values of all execution in tervals remain above 0.9 when the time scale is increased from 1 to 7. This implies that the captured program behavi or is not sensitive to any time scale in that range. Therefore, we classify swim as a low complexity program. On the benchmark crafty XCOR values decrease uniformly with the incr ease of time scales, indicating the observed program dynamics are sensitive to the time scales us ed to obtain it. We refe r to this behavior as medium complexity. On the benchmark gcc, program dynamics decay rapidly. This suggests that PAGE 30 30 abundant program dynamics could be lost if coarse r time scales are used to characterize it. We refer to this characteristic as high complexity behavior. 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1(a) swim (low complexity) (b) crafty (medium complexity) 0 20 40 60 80 10012345678910ScalesXCOR Value Distributio n (%) [0, 0.1) [0.1, 0.2) [0.2, 0.3) [0.3, 0.4) [0.4, 0.5) [0.5, 0.6) [0.6, 0.7) [0.7, 0.8) [0.8, 0.9) [0.9, 1) 1 (c) gcc (high complexity) Figure 3-3 XCOR va lue distributions The dynamics complexity and the XCOR value distribution plots(Figur e 3-2 and Figure 33) provide a quantitative and informative representation of runtime program complexity. Table 3-2 Classification of benchm arks based on their complexity Category Benchmarks Low complexity Swim Medium complexity Crafty gzip parser, perlbmk twolf High complexity gap, gcc, mcf vortex Using the above information, we classify the studied programs in terms of their complexity and the results are shown in Table 3-2. PAGE 31 31 Classifying Program Phases based on their Dynamics Behavior In this section, we show th at program execution manifest s heterogeneous complexity behavior. We further examine the efficiency of using current methods in classifying program dynamics into phases and propose a new method that can better identify program complexity. Classifying complexity based phase behavior enables us to understand program dynamics progressively in a fine-to-coarse fashion, to operate on different resolutions, to manipulate features at different scales, and to localize char acteristics in both spatia l and frequency domains. Simpoint Sherwood and Calder[1] proposed a phase analys is tool called Sim point to automatically classify the execution of a program into phases. They found that interv als of program execution grouped into the same phase had similar statistics The Simpoint tool cl usters program execution based on code signature and execution frequenc y. We identified program execution intervals grouped into the same phase by the Simpoint t ool and analyzed their dynamic complexity. (a) Simpoint Cluster #7 (b) Simpoint Cluster #5 (c) Simpoint Cluster #48 Figure 3-4 XCORs in the same phase by the Simpoint Figure 3-4 shows the results for the benchmark mcf. Simpoint generates 55 clusters on the benchmark mcf Figure 3-4 shows program dynamics within three clusters generated by Simpoint. Each cluster represen ts a unique phase. In cluster 7, the classified phase shows homogeneous dynamics. In cluster 5, program ex ecution intervals show two distinct dynamics PAGE 32 32 but they are classified as the same phase. In cluster 48, program execution complexity varies widely; however, Simpoint classifies them as a single phase. The results(Fi gure 3-4) suggest that program execution intervals classified as the same phase by Simpoint can still exhibit widely varied behavior in their dynamics. Complexity-aware Phase Classification To enhance the capability of current m ethods in characteri zing program dynamics, we propose complexity-aware phase classification. Ou r method uses the multiresolution property of wavelet transforms to identify and classify the changing of program code execution across different scales. We assume a baseline phase analysis technique that uses basic block vectors (BBV) [10]. A basic block is a section of code that is executed from start to finish with one entry and one exit. A BBV represents the code blocks execu ted during a given interv al of execution. To represent program dynamics at different time scales, we create a set of basic block vectors for each interval at different resolutions. For example, at the coarsest level (scale =10), a program execution interval is represented by one BBV. At the most detailed level, the same program execution interval is represented by 1 024 BBVs from 1024 consecutively subdivided intervals(Figure 3-5). To reduce the amount of data that needs to be processed, we use random projection to reduce the dimensionality of all BBVs to 15, as suggested in [1]. Figure 3-5 BBVs with different resolutions PAGE 33 33 The coarser scale BBVs are the approximations of the finest scale BBVs generated by the wavelet-based multiresolution analysis. Figure 3-6 Multiresolution anal ysis of the projected BBVs As shown in Figure 3-6, the disc rete wavelet transform is app lied to each dimension of a set of BBVs at the finest scal e. The XCOR calculation is used to estimate the correlations between a BBV element and its approximations at coarser scales. The results are the 15 XCOR vectors representing the complexity of each dimension in BBVs across 10 level abstractions. The 15 XCOR vectors are then averag ed together to obtain an aggregated XCOR vector that represents the entire BBV comple xity characteristics for that execution interval. Using the above steps, we obtained an aggregated XCOR vector for each program execution interval. We then run the k-means clustering algorithm [29] on the co llected XCOR vectors which represent the dynamic complexity of program ex ecution intervals and classified them into phases. This is similar to what Simpoint does. The difference is that the Simpoint tool uses raw BBVs and our method uses aggregated BBV XCOR vectors as the input for k-means clustering. PAGE 34 34 Experimental Results We com pare Simpoint and the proposed approach in their capability of classifying phase complexity. Since we use wavelet transform on pr ogram basic block vectors, we refer to our method as multiresolution anal ysis of BBV (MRA-BBV). Cluster #1 Cluster #2 Cluster #N 10 10 10 XCOR XCOR XCOR CoV CoV CoV 10 10 10 W Weighted CoVs10 Figure 3-7 Weighted COV calculation We examine the similarity of program complexity within each classified phase by the two approaches. Instead of using IPC, we use IP C dynamics as the metric for evaluation. After classifying all program execution intervals into phases, we examine each phase and compute the IPC XCOR vectors for all the intervals in that phas e. We then calculate the standard deviation in IPC XCOR vectors within each phase and we divide the standard deviation by the average to get the Coefficient of Variation (COV). As shown in Figure 3-7, we calculate an ove rall COV metric for a phase classification method by taking the COV of each phase, weighting it by the percentage of execution that the phase accounts for. This produces an overall me tric (i.e. weighted COVs) used to compare different phase classification for a given program. Since COV meas ures standard deviations as the percentage of the average, a lower COV va lue means better phase classification technique. PAGE 35 35 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10crafty gap gcc gzip mcf CoV BBV MRA-BBV104%128% 101% 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10parser perlbmk swim twolf vortex CoV BBV MRA-BBVFigure 3-8 Comparison of BBV and MRABBV in classifyin g phase dynamics Figure 3-8 shows experimental results for all the studied benchmarks. As can be seen, the MRA-BBV method can produce phases which ex hibit more homogeneous dynamics and complexity than the standard, BBV-based method. This can be seen from the lower COV values generated by the MRA-BBV method. In genera l, the COV values yielded on both methods increase when coarse time scales are used for complexity approximation. The MRA-BBV is capable of achieving significantly better cla ssification on benchmarks with high complexity, such as gap, gcc and mcf On programs which exhibit medium complexity, such as crafty, gzip, parser, and twolf the two schemes show a comparable effectiveness. On benchmark (e.g. swim ) which has trivial complexity, both schemes work well. We further examine the capability of usi ng runtime performance metrics to capture complexity-aware phase behavior. Instead of usin g BBV, the sampled IPC is used directly as the input to the k-means phase clustering algorithm. Similarly, we apply multiresolution analysis to the IPC data and then use the gathered informa tion for phase classification. We call this method PAGE 36 36 multiresolution analysis of IPC (MRA-IPC). Figure 3-9 shows the phase classification results. As can be seen, the observations we made on th e BBV-based cases hold valid on the IPC-based cases. This implies that the proposed multiresoluti on analysis can be applied to both methods to improve the capability of capturing phase dynamics. 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10crafty gap gcc gzip mcf CoV IPC MRA-IPC109 % 130 % 150 % 173 % 117%105 % 110 % 115 % 123 % 147 % 105% 120% 142% 0% 20% 40% 60% 80% 100%1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10parser perlbmk swim twolf vortex CoV IPC MRA-IPC113 % 122 % 140 % 163 % 104% 127% 146% Figure 3-9 Comparison of IPC and MRA-IPC in classifying phase dynamics PAGE 37 37 CHAPTER 4 IMPROVING ACCURACY, SCALABILITY AND ROBUSTNESS IN PROGRM PHASE ANAL YSIS In this chapter, we focus on workload-sta tistics-based phase anal ysis since on a given machine configuration and environment, it is more suitable to iden tify how the targeted architecture features vary during program execution. In contrast, phase classification using program code structures lacks th e capability of informing how wo rkloads behave architecturally [13, 30]. Therefore, phase analys is using specified workload ch aracteristics allows one to explicitly link the targeted arch itecture features to the classifi ed phases. For example, if phases are used to optimize cache efficiency, the workload characteristics that reflect cache behavior can be used to explicitly classify program ex ecution into cache perfor mance/power/reliability oriented phases. Program code structure based phase analysis id entifies similar phases only if they have similar code flow. There can be cases where two sections of code can have different code flow, but exhibit similar architectural behavi or [13]. Code flow based phase analysis would then classify them as different phases. Anot her advantage of worklo ad-statistics-based phase analysis is that when multiple threads share the same resource (e.g. pipeline, cache), using workload execution information to classify phases allows the capability of capturing program dynamic behavior due to the in teractions between threads. The key goal of workload execution based phase analysis is to accurately and reliably discern and recover phase behavior from various program runtime st atistics represented as largevolume, high-dimension and noisy data. To effectiv ely achieve this objec tive, recent work [30, 31] proposes using wavelets as a tool to assist phase analysis. The basic idea is to transform workload time domain behavior into the wave let domain. The generated wavelet coefficients which extract compact yet informative program r untime feature are then assembled together to PAGE 38 38 facilitate phase classification. Nevertheless, in current work, the examined scope of workload characteristics and the explored benefits due to wavelet transform are quite limited. In this chapter, we extend research of chap ter 3 by applying wavelets to abundant types of program execution statistics and quantifying th e benefits of using wavelets for improving accuracy, scalability and robustness in phase cl assification. We conclude that wavelet domain phase analysis has the followi ng advantages: 1) accuracy: the wavelet transform significantly reduces temporal dependence in the sampled work load statistics. As a result, simple models which are insufficient in the time domain become quite accurate in the wavelet domain. More attractively, wavelet coefficients transformed fr om various dimensions of program execution characteristics can be dynamically assembled t ogether to further improve phase classification accuracy; 2) scalability: phase cl assification using wavelet analysis of high-dimension sampled workload statistics can alleviat e the counter overflow problem which has a negative impact on phase detection. Therefore, it is much more s calable to analyze largescale phases exhibited on long-running, real-world programs; and 3) robus tness: wavelets offer denoising capabilities which allows phase classification to be perf ormed robustly in the presence of workload execution variability. Workload-statics-based phase analysis Using the wavelet-based m ethod, we expl ore program phase analysis on a highperformance, out-of-order execution superscalar processor coupled with a multi-level memory hierarchy. We use Daubechies wavelet [26, 27] with an order of 8 for the re st of the experiments due to its high accuracy and low computation over head. This section describes our experimental methodologies, the simulated machine configura tion, experimented benchmarks and evaluated metrics. PAGE 39 39 We performed our analysis using tw elve SPEC CPU 2000 integer benchmarks bzip2, crafty, eon, gap, gcc, gzip mcf parser perlbmk twolf, vortex and vpr All programs were run with the reference input to completion. The runtim e workload execution statistics were measured on the SimpleScalar 3.0, sim-outorder simulator for the Alpha ISA. The baseline microarchitecture model we used is detailed in Table 4-1. Table 4-1 Baseline machine configuration Parameter Configuration Processor Width 8 ITLB 128 entries, 4-way, 200 cycle miss Branch Prediction combined 8K tables, 10 cycle misprediction, 2 BTB 2K entries, 4-way Return Address Stack 32 entries L1 Instruction Cache 32K, 2-way, 32 Byte /line, 2 ports, 4 MSHR, 1 cycle access RUU Size 128 entries Load/ Store Queue 64 entries Store Buffer 16 entries Integer ALU 4 I-ALU, 2 I-MUL/DIV FP ALU 2 FP-ALU, 1FP-MUL/DIV DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 8 MSHR, 1 cycle access L2 Cache unified 1MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 100 cycles We use IPC (instruction per cycle) as the metr ic to evaluate the si milarity of program execution within each classified phase. To quant ify phase classification accuracy, we use the weighted COV metric proposed by Calder et al [15]. After classifyi ng all program execution intervals into phases, we examine each phase and compute the IPC for all the intervals in that phase. We then calculate the standard deviati on in IPC within each phase, and we divide the standard deviation by the average to get the Coef ficient of Variation (COV). We then calculate an overall COV metric for a phase classification method by taking the COV of each phase and weighting it by the percentage of execution that the phase account s for. This produces an overall metric (i.e. weighted COVs) used to compare different phase classificati ons for a given program. PAGE 40 40 Since COV measures standard deviations as a percentage of the average, a lower COV value means a better phase classification technique. Exploring Wavelet Domain Phase Analysis We first evaluate the efficiency of wavelet analysis on a wide range of program execution characteristics by comparing its phase classifica tion accuracy with methods that use information in the time domain. And then we explore met hods to further improve phase classification accuracy in the wavelet domain. Phase Classification: Time Domain vs. Wavelet Domain The wavelet analysis m ethod provides a cost-e ffective representation of program behavior. Since wavelet coefficients are generally decorrelate d, we can transform the original data into the wavelet domain and then carry out the phase classification task. The generated wavelet coefficients can be used as signa tures to classify program executi on intervals into phases: if two program execution intervals show similar fingerprints (repres ented as a set of wavelet coefficients), they can be classified into the sa me phase. To quantify the benefit of using wavelet based analysis, we compare phase classificati on methods that use time domain and wavelet domain program execution information. With our time domain phase analysis method, each program execution interval is represented by a time series which consists of 1024 sampled program execution statistics. We first apply random projection to reduce the data dimensionality to 16. We then use the k-means clustering algorithm to classify program intervals in to phases. This is similar to the method used by the popular Simpoint tool wher e the basic block vectors (BBVs) are used as input. For the wavelet domain method, the original time series are first transformed into the wavelet domain using DWT. The first 16 wavelet coefficients of each program execution interval are extracted PAGE 41 41 and used as the input to the k-means cluste ring algorithms. Figure 4-1 illustrates the above described procedure. Program Runtime Statistics Random Projection DWT K-means Clustering K-means Clustering COV COV Dimensionality=16 Number of Wavelet Coefficients=16 Figure 4-1 Phase analysis methods time domain vs. wavelet domain We investigated the efficiency of applyi ng wavelet domain analysis on 10 different workload execution characteristics, name ly, the numbers of executed loads ( load), stores ( store), branches ( branch ), the number of cycles a processor spends on executing a fixed amount of instructions ( cycles), the number of branch misprediction ( branch_miss ), the number of L1 instruction cache, L1 data cache and L2 cache hits ( il1_hit, dl1_hit and ul2_hit ), and the number of instruction and data TLB hits ( itlb_hit and dtlb_hit ). Figure 4-2 shows the COVs of phase classifications in time and wavelet domains when each type of workload execution characteristic is used as an input. As can be seen, compared with using raw, time domain workload data, the wavelet domain analysis significantly impr oves phase classification accuracy and this observation holds for all the inve stigated workload characteristics across all the examined benchmarks. This is because in the time domain, collected program runtime statistics are treated as high-dimension time series data. Random projection met hods are used to reduce the dimensionality of feature vectors which represent a workload signature at a given execution interval. However, the simple random projection function can increase the aliasing between phases and reduce the accuracy of phase detection. PAGE 42 42 0% 10% 20% 30% 40% 50%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domainbzip2 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaincrafty 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaineon 0% 5% 10% 15% 20% 25%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domaingap 0% 10% 20% 30% 40% 50% 60% 70%l oa d st ore bra n ch c ycle br a nch_ m iss i l 1_ h it dl 1_ hi t ul 2_ hi t i tlb_ h it dtlb_hitCo V Time Domain Wavelet Domaingcc 0% 5% 10% 15% 20%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domaingzip 0% 20% 40% 60% 80% 100% 120% 140%l o ad store br an ch cycl e b ranch_m i ss i l 1_hit d l1_hit ul2_hit i t lb_hit dtl b_hitCoV Time Domain Wavelet Domainmcf 0% 3% 6% 9% 12% 15%load store br a nc h c y cle branch_ m iss il1_hit dl 1 _hit ul 2 _hi t i t l b_ hit dtlb h i tCoV Time Domain Wavelet Domain p arser 0% 5% 10% 15% 20% 25% 30%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domain p erlbmk 0% 2% 4% 6% 8%l o ad stor e branch cycle branch_miss il1_hit dl1_hit u l2_hit i tlb_hit dtl b_h itCoV Time Domain Wavelet Domaintwolf 0% 10% 20% 30% 40%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet Domainvortex 0% 10% 20% 30% 40%l oad st or e b ran ch cyc l e b ranch_m i ss il1_hit dl1_h i t ul2_hit i t lb_h it dtl b_h itCoV Time Domain Wavelet DomainvprFigure 4-2 Phase classification accuracy : time domain vs. wavelet domain By transforming program runtim e statistics into the wavelet domain, workload behavior can be represented by a series of wavelet co efficients which are much more compact and efficient than its counterpart in the time do main. The wavelet transform significantly reduces temporal dependence and therefore simple mode ls which are insufficient in the time domain become quite accurate in the wavelet domain. PAGE 43 43 Figure 4-2 shows that in the wavelet domain, the efficiency of using a single type of program characteristic to classify program phases can vary significantly across different benchmarks. For example, while ul2_hit achieves accurate phase classification on the benchmark vortex it results in a high phase cl assification COV on the benchmark gcc. To overcome the above disadvantages and to build phase classi fication methods that can achieve high accuracy across a wide range of applications, we explore using wavelet coefficients derived from different types of workload characteristics. Program Runtime Statistics 1 DWT K-means Clustering COV Program Runtime Statistics 2 Program Runtime Statistics n DWT DWT Wavelet Coefficient Set 1 Wavelet Coefficient Set 2 Wavelet Coefficient Set n Hybrid Wavelet Coefficients Figure 4-3 Phase classification us ing hybrid wavelet coefficients As shown in Figure 4-3, a DWT is applied to each type of workload characteristic. The generated wavelet coefficients from different cat egories can be assembled together to form a signature for a data clustering algorithm. Our objective is to improve wavelet domain phase classification accuracy across different programs while using an equivalent amount of in formation to represent program behavior. We choose a set of 16 wavelet coefficients as th e phase signature since it provides sufficient precision in capturing program dynamics when a single type of program charac teristic is used. If a phase signature can be composed using multiple workload characteristics, there are many ways to form a 16-dimension phase signature. For exam ple, a phase signature can be generated using one wavelet coefficient from 16 diff erent workload characteristics (16 1), or it can be composed PAGE 44 44 using 8 workload characteristics with 2 wavele t coefficients from each type of workload characteristic (82). Alternatively, a phase signature can be formed using 4 workload characteristics with 4 wavelet coefficients each and 2 workload characteristics with 8 wavelet coefficients each (44, and 28) respectively. We extend the 10 workload execution characteristics (Figure 4-2) to 16 by adding the following events: the num ber of accesses to instruction cache ( il1_access), data cache (dl1_access ), L2 cache (ul2_access ), instruction TLB ( itlb_access ) and data TLB ( dtlb_access). To understand the trade-offs in choosing different methods to generate hybrid signatures, we did an exhaustive search using the above 4 schemes on all benchmarks to identify the best COVs that each scheme can achieve. The results (their ranks in terms of phase classification accuracy and the COVs of phase analysis) are shown in Table 4-2. As can be seen, statistically, hybr id wavelet signatures generated using 16 (16 1) and 8 (82) workload characteristics achieve higher ac curacy. This suggests that combining multiple dimension wavelet domain workload characteristics to form a phase signature is beneficial in phase analysis. Table 4-2 Efficiency of different hybrid wavelet signatures in phase classification Benchmarks Hybrid Wavelet Signature and its Phase Classification COV Rank #1 Rank #2Rank #3Rank #4 Bzip2 161 (6.5%) 8 2 (10.5%) 4 4 (10.5%) 28 (10.5%) Crafty 44 (1.2%) 16 1 (1.6%) 8 2 (1.9%) 28 (3.9%) Eon 82 (1.3%) 4 4 (1.6%) 16 1 (1.8%) 28 (3.6%) Gap 44 (4.2%) 16 1 (6.3%) 8 2 (7.2%) 28 (9.3%) Gcc 82 (4.7%) 16 1 (5.8%) 4 4 (6.5%) 28 (14.1%) Gzip 161 (2.5%) 4 4 (3.7%) 8 2 (4.4%) 28 (4.9%) Mcf 161 (9.5%) 4 4 (10.2%) 8 2 (12.1%) 28 (87.8%) Parser 161 (4.7%) 8 2 (5.2%) 4 4 (7.3%) 28 (8.4%) Perlbmk 82 (0.7%) 16 1 (0.8%) 4 4 (0.8%) 28 (1.5%) Twolf 161 (0.2%) 8 2 (0.2%) 4 4 (0.4%) 28 (0.5%) Vortex 161 (2.4%) 8 2 (4%) 2 8 (4.4%) 44 (5.8%) Vpr 161 (3%) 8 2 (14.9%) 4 4 (15.9%) 28 (16.3%) PAGE 45 45 We further compare the efficiency of using the 16 1 hybrid scheme ( Hybrid ), the best case that a single type workload characteristic can achieve ( Individual_Best) and the Simpoint based phase classification that uses basic block vector ( BBV ). The results of the 12 SPEC integer benchmarks are shown in Figure 4-4. 0 5 10 15Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV Individual_Best Hybrid BBV bzip2craftyeongapgccgzipmcfparserperlbmktwolfvortexvprAVG COV (%)25%Figure 4-4 Phase classification accuracy of using 16 1 hybrid scheme As can be seen, the Hybrid outperforms the Individual_Best on 10 out of the 12 benchmarks. The Hybrid also outperforms the BBV based Simpoint method on 10 out of the 12 cases. Scalability Above we can see that wavelet dom ain phase an alysis can achieve high er accuracy. In this subsection, we address another im portant issue in phase analys is using workload execution characteristics: scalability. C ounters are usually used to colle ct workload statistics during program execution. The counters may overflow if they are used to track large scale phase behavior on long running workloads. Today, many large and real world workloads can run days, weeks or even months before completion and this trend is likely to continue in the future. To perform phase analysis on the next generati on of computer workloads and systems, phase classification methods should be capable of scaling with the increasing program execution time. PAGE 46 46 To understand the impact of counter overflow on phase analysis accuracy, we use 16 accumulative counters to record the 16-dimension wo rkload characteristic. The values of the 16 accumulative counters are then used as a signature to perform phase classification. We gradually reduce the number of bits in the accumulative count ers. As a result, counter overflows start to occur. We use two schemes to handle a counter ove rflow. In our first method, a counter saturates at its maximum value once it overflows. In our se cond method, the counter is reset to zero after an overflow occurs. After all counter overflows are handled, we then use the 16-dimension accumulative counter values to perform phase an alysis and calculate the COVs. Figure 4-5 (a) describes the above procedure. Large Scale Phase Interval (a) n-bit accumulative countercounter overflow Program Runtime Statistics 1 accumulative counter 1 K-means Clustering Program Runtime Statistics 2 Program Runtime Statistics n accumulative counter 2 accumulative counter n Handling Overflow COV (b) n-bit sampling counter .. Large Scale Phase Interval Program Runtime Statistics 1 sampling counter 1 Program Runtime Statistics 2 Program Runtime Statistics n sampling counter 2 sampling counter n DWT/Hybrid Wavelet Signature K-means Clustering COV Figure 4-5 Different methods to handle counter overflows Our counter overflow analysis results are s hown in Figure 4-6. Figure 4-6 also shows the counter overflow rate (e.g. per centage of the overflowed counters) when counters with different sizes are used to collect workload statistics w ithin program execution intervals. For example, on the benchmark crafty when the number of bits used in counters is reduced to 20, 100% of the counters overflow. For the purpose of clarity, we only show a region within which the counter overflow rate is greater than zero and less than or equal to one. Since each program has different execution time, the region varies from one program to another. As can be seen, counter overflows have negative impact on phase classifi cation accuracy. In general, COVs increase with PAGE 47 47 the counter overflow rate. Interes tingly, as the overflow rate in creases, there are cases that overflow handling can reduce the COVs. This is because overflow handling has the effect of normalizing and smoothing irregular p eaks in the original statistics. 0% 10% 20% 30% 40% 50% 28262422201816 # of bits in counterCo V Saturate Reset W a v e l et 4% 29% 67% 81% 90%bzip296% 98% 0% 2% 4% 6% 8% 2826242220 # of bits in counterCo V Saturate Reset Wavelet 23% 56% 82% 94% 100%crafty 0% 2% 4% 6% 8% 272523211917 # of bits in counterCo V Saturate Reset Wavelet0.4%56% 78% 94%eon94% 100% 0% 5% 10% 15% 20% 25% 30% 3028262422201816 # of bits in counterCoV Saturate Reset Wavelet 0.5% 25%94% 94%gap80% 56%97% 100% 0% 10% 20% 30% 40% 50% 60% 28262420281416182022 # of bits in counterCoV Saturate Reset Waveletgcc0.1% 3.4% 50% 89% 77% 99% 98% 97% 93% 100% 0% 5% 10% 15% 20% 272523211917 # of bits in counterCoV Saturate Reset Wavelet 24% 34% 72% 98%gzip89% 100% 0% 20% 40% 60% 80% 100% 120% 140%30282624222018161412108# of bits in counterCoV Saturate Reset Wavelet 2% 26% 60% 93% 5%mcf96% 89% 97% 98% 99% 100% 0% 2% 4% 6% 8% 10% 12% 302826242220 # of bits in counterCoV Saturate Reset W a v e l et 0% 31% 97% 100%parser75% 85% 0% 5% 10% 15% 20% 25% 30% 27252321191715 # of bits in counterCo V Saturate Reset Wavelet 6% 28% 79% 87%perlbmk85% 93% 100% 0% 1% 2% 3% 4% 2927252321 # of bits in counterCo V Saturate Reset Wavelet 28% 31% 75% 100%twolf94% 0% 5% 10% 15% 20% 25% 30% 27252321191517 # of bits in counterCo V Saturate Reset Wavelet 1% 56% 81% 94%vortex93% 95% 100% 0% 5% 10% 15% 20% 25% 30% 272523211917 # of bits in counterCo V Saturate Reset Wavelet 3% 54% 75% 94%vpr90% 100%Figure 4-6 Impact of counter overf lows on phase analysis accuracy One solution to avoid counter overflows is to use sampling counters instead of accumulative counters, as shown in Figure 4-5 (b ). However, when sampling counters are used, the collected statistics are represented as time seri es that have a large volume of data. The results PAGE 48 48 shown in Figure 4-2 suggest that directly employing runtime samp les in phase classification is less desirable. To address the scalability issue in characterizing large scale program phases using workload execution statistics, wavelet based dime nsionality reduction tech niques can be applied to extract the essential featur es of workload behavior from the sampled statistics. The observations we made in previous sections mo tivate the use of DWT to absorb large volume sampled raw data and produce highly efficient wa velet domain signatures for phase analysis, as shown in Figure 4-5 (b). Figure 4-6 further shows phase analysis accuracy after applying wavelet techniques on the sampled workload statistics using sampling coun ters with different sizes. As can be seen, sampling enables using counters with limited size to study large program phases. In general, sampling can scale up naturally with the interval size as long as the sampled values do not overflow the counters. Therefore, with an in creasing mismatch between phase interval and counter size, the sampling frequency is increased resulting in an even higher volume sampled data. Using wavelet domain phase analysis can effectively infer program behavior from a large set of data collected over a long time span, resulting in low COVs in phase analysis. Workload Variability As described earlier, our m ethods collect various program execution stat istics and use them to classify program execution into different phase s. Such phase classification generally relies on comparing the similarity of the collected statis tics. Ideally, different runs of the same code segment should be classified into the same pha se. Existing phase detec tion techniques assume that workloads have deterministic execution. On real systems, with operating system interventions and other threads, applications manife st behavior that is not the same from run to run. This variability can stem from changes in sy stem state that alter cache, TLB or I/O behavior, system calls or interrupts, and can result in no ticeably different timing a nd performance behavior PAGE 49 49 [18, 32]. This cross-run variability can confuse similarity based phase detection. In order for a phase analysis technique to be applicable on real systems, it should be able to perform robustly under variability. Program cross-run variability can be thought of as noise which is a random variance of a measured statistic. There are many possi ble reasons for noisy data, such as measurement/instrument errors and interventions of the operating systems. Removing this variability from the collected run time statistics can be considered as a process of denoising. In this chapter, we explore using wavelets as an effective way to perf orm denoising. Due to the vanishing moment property of the wavelets, only some wavelet coefficients are significant in most cases. By retaining selective wavelet coeffi cients, a wavelet transfor m could be applied to reduce the noise. The main idea of wavelet denoising is to transform the data into the wavelet basis, where the large coefficients mainly contain the useful information and the smaller ones represent noise. By suitably modifying the coefficients in the new basis, noise can be directly removed from the data. The gene ral de-noising procedure involves three steps: 1) decompose: compute the wavelet decomposition of the original data; 2) threshold wavelet coefficients: select a threshold and apply thresholding to the wave let coefficients; and 3) reconstruct: compute wavelet reconstruction using the modified wave let coefficients. More details on the waveletbased denoising techniques can be found in [33]. To model workload runtime variability, we use additive noise models and randomly inject noise into the time series that represents work load execution behavior. We vary the SNR (signalto-noise ratio) to simulate different degree of variability scenarios. To classify program execution into phases, we generate a 16 dimensi on feature vector where each element contains PAGE 50 50 the average value of the collected program execution characteristic for each interval. The k-mean algorithm is then used for data clustering. Figure 4-7 il lustrates the above described procedure. Sampled Workload Statistics Wavelet Denoising S1(t) D2(t) Phase Classification COV COV Comparison Workload Variability Model N(t) S2(t)=S1(t)+N(t) Figure 4-7 Method for modeli ng workload variability We use the Daubechies-8 wavelet with a global wavelet coefficients thresholding policy to perform denoising. We then compare the phase cl assification COVs of us ing the original data, the data with variability injected and the data after we perform denoisi ng. Figure 4-8 shows our experimental results. 0% 3% 6% 9% 12% 15%b zip 2 crafty eon g ap g cc g zip mcf pars e r perlbmk tw o lf v o rt e x vprCOV Original Noised(SNR=20) Denoised(SNR=20) Noised(SNR=5) Denoised(SNR=5)Figure 4-8 Effect of using wavelet denoi sing to handle workload variability The SNR=20 represents scenarios with a low degree of variability and the SNR=5 reflects situations with a high degree of variability. As can be seen, introducing va riability in workload execution statistics reduces phase analysis accur acy. Wavelet denoising is capable of recovering phase behavior from the noised data, resulting in higher phase analysis accuracy. Interestingly, on some benchmarks (e.g. eon mcf ), the denoised data achieve better phase classification PAGE 51 51 accuracy than the original data. This is because in phase classification, randomly occurring peaks in the gathered workload execution data co uld have a deleterious effect on the phase classification results. Wavelet denoising smoothe s these irregular peak s and make the phase classification method more robust. Various types of wavelet denoising can be performed by choosing different threshold selection rules (e.g. rigrsure, he ursure, sqtwolog and minimaxi), by performing hard (h) or soft (s) thresholding, and by specifying multiplicative threshold rescaling model (e.g. one, sln, and mln). We compare the efficiency of different denoisi ng techniques that have been implemented into the MATLAB tool [34]. Due to the sp ace limitation, only the results on benchmarks bzip2, gcc and mcf are shown in Figure 4-9. As can be seen, different wavelet denoising schemes achieve comparable accuracy in phase classification. 0% 2% 4% 6% 8% 10%heu r s u re:s:m ln heu rsu re: s: sl n heu r s u re: h : ml n heu r s u re: h : s ln r ig rsure :s :m l n r ig rsur e:s :s l n r igrsur e :h:mln rigrsure:h:sln sqtwol og :s:mln sq t wolog:s:s l n s q t wolog : h : mln s q t wolog : h : sl n m i ni m axi:s:m ln minim ax i : s:s l n mi ni m ax i : h:m ln m i ni m axi : h : sl nWavelet Denoising SchemesCOV bzip2 gcc mcfFigure 4-9 Efficiency of different denoising schemes PAGE 52 52 CHAPTER 5 INFORMED MICROARCHITECTURE DESIGN SPACE EXPLORATION It has been well known to the processor design comm unity that program runtime characteristics exhibit significant variation. To obtain the dynamic beha vior that programs manifest on complex microprocessors and systems, architects resort to the detailed, cycleaccurate simulations. Figure 5-1 illustrates the variation in workload dynamics for SPEC CPU 2000 benchmarks gap, crafty and vpr within one of their execution intervals. The results show the time-varying behavior of the workload performance ( gap), power ( crafty ) and reliability ( vpr) metrics across simulations with differe nt microarchitecture configurations. 0 20 40 60 80 100 120 14 0 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 SamplesCPIgap 0 20 40 60 80 100 120 14 0 20 40 60 80 100 120 140 SamplesPower (W) crafty 0 20 40 60 80 100 120 14 0 0.1 0.15 0.2 0.25 0.3 0.35 Samples AVF vpr Figure 5-1 Variation of workload performance, power and reliability dynamics As can be seen, the manifested workload dyna mics while executing the same code base varies widely across processors with different configurations. As the number of parameters in design space increases, such variation in workload dynamics can not be captured without using slow, detailed simulations. However, using the simulation-based methods for architecture design space exploration where numerous design parameters have to be considered is prohibitively expensive. Recently, researchers propose several predictive models [20-25] to reason about workload aggregated behavior at archit ecture design stage. Among them linear regression and neural network models have been the most used appr oaches. Linear models are straightforward to understand and provide accurate estimates of the significance of parameters and their PAGE 53 53 interactions. However, they are usually inadequa te for modeling the non-l inear dynamics of realworld workloads which exhibit widely different characteristic and complexity. Of the non-linear methods, neural network models can accurately predict the aggregated program statistics (e.g. CPI of the entire workload execution). Such mode ls are termed as global models as only one model is used to characterize the measured programs. The monolithic global models are incapable of capturing and revealing program dy namics which contain interesting fine-grain behavior. On the other hand, a workload may produce different dynamics when the underlying architecture configurations ha ve changed. Therefore, new methods are needed for accurately predicting complex workload dynamics. To overcome the problems of monolithic, gl obal predictive models, we propose a novel scheme that incorporates wavelet-based multiresolution decomposition techniques, which can produce a good local representati on of the workload behavior in both time and frequency domains. The proposed analytical models, wh ich combine wavelet-based multiscale data representation and neural network based regres sion prediction, can e fficiently reason about program dynamics without resorting to detaile d simulations. With our schemes, the complex workload dynamics is decomposed into a series of wavelet coefficients. In transform domain, each individual wavelet coefficients is modele d by a separate neural network. We extensively evaluate the efficiency of using wavelet neural networks for predicting the dynamics that the SPEC CPU 2000 benchmarks manifest on high performance microprocessors with a microarchitecture design space that consists of 9 key parameters. Our results show that the models achieve high accuracy in forecasting work load dynamics across a large microarchitecture design space. PAGE 54 54 In this chapter, we propose to use of wavele t neural network to build accurate predictive models for workload dynamic driven microarchi tecture design space exploration. We show that wavelet neural network can be used to accurately and cost-effectively capture complex workload dynamics across different microarchitecture configur ations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power and reliability domains. We perform extensive simulations to analyze the impact of wavelet coefficient selection and sampling rate on pred iction accuracy and identify microarchitecture parameters that significantly affect workload dynamic behavior. We present a case study of using workload dynamic aware predictive models to quickly estimate the efficiency of scenario-driven archite cture optimizations across different domains. Experimental results show that the predictive models are highly efficient in rendering workload execution scenarios. Neural Network An Artificial Neural Network (ANN) [42] is an infor mation processing paradigm that is inspired by the way biological nervous systems pro cess information. It is composed of a set of interconnected processing elements working in unison to solve problems. f(x) H1(x) H2(x) Hn(x) X1 X2 Xn w1Input layer Hidden layer Output layer distance distanceRBF Response Radial Basis Function (RBF)w2wn Figure 5-2 Basic architectu re of a neural network PAGE 55 55 The most common type of neural network (Figure 5-2) consists of th ree layers of units: a layer of input units is connected to a layer of hidden units, wh ich is connected to a layer of output units. The input is fed into network th rough input units. Each hidden unit receives the entire input vector and genera tes a response. The output of a hidden unit is determined by the input-output transfer function that is specified for that unit. Co mmonly used transfer functions include the sigmoid, linear threshold function and Radial Basis Function (RBF) [35]. The ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing complex, nonlinear relations between input a nd output, which make them a promising technique for tracking and forecasting complex behavior. In this chapter, we use the RBF transfer f unction to model and estimate important wavelet coefficients on unexplored design spaces because of its superior ability to approximate complex functions. The basic architecture of an RBF network with n-input s and a single output is shown in Figure 5-2. The nodes in adjacent layers are fully connected. A linear single-layer neural network model 1-dimensional function f is expressed as a linear combination of a set of nfixed functions, often called basis functions by analogy with the concept of a vector being composed of a linear combination of basis vectors. n j jjxhwxf1)()( (5-1) Here nw is adaptable or trainable weight vector and n j jh1)( are fixed basis functions or the transfer function of the hidden units. The flexibility of f, its ability to fit many different functions, derives only from the freedom to choos e different values for the weights. The basis PAGE 56 56 functions and any parameters which they might contain are fixed. If the basis functions can change during the learning process, then the model is nonlinear. Radial functions are a special class of function. Their characte ristic feature is that their response decreases (or increases) monotonically with distance from a central point. The center, the distance scale, and the precis e shape of the radial function ar e parameters of the model, all fixed if it is linear. A typical ra dial function is the Gaussian whic h, in the case of a scalar input, is 2 2)( exp)( r cx xh (5-2) Its parameters are its center c and its radius r Radial functions are simply a class of functions. In principle, they coul d be employed in any sort of m odel, linear or nonlinear, and any sort of network (single-layer or multi-layer). The training of the RBF network involves se lecting the center locati ons and radii (which are eventually used to determine the weight s) using a regression tree. A regression tree recursively partitions the input data set into subs ets with decision criteria. As a result, there will be a root node, non-terminal nodes (having sub n odes) and terminal nodes (having no sub nodes) which are associated with an input dataset. Each node contributes one unit to the RBF networks center and radius vectors. the selection of RB F centers is performed by recursively parsing regression tree nodes using a strategy proposed in [35]. Combing Wavelet and Neural Network for Workload Dynamics Prediction We view workload dynamics as a time series produced by the processor which is a nonlinear function of its design pa rameter configuration. Instead of predicting this function at every sampling point, we employ wavelets to ap proximate it. Previous work [21, 23, 25] shows PAGE 57 57 that neural networks can accurately predict a ggregated workload behavior during design space exploration. Nevertheless, the m onolithic global neural network m odels lack the capability of revealing complex workload dynamics. To ove rcome this disadvantage, we propose using wavelet neural networks that incorporate multiscale wavelet analysis into a set of neural networks for workload dynamics prediction. The wavelet transform is a very powerful tool for dealing with dynamic behavior si nce it captures both workload global and local behavior using a set of wavelet coefficients. The short-term workload characteristics is decomposed into the lower scales of wavelet coefficients (high frequencie s) which are utilized for detailed analysis and prediction, while the global worklo ad behavior is decomposed in to higher scales of wavelet coefficients (low frequencies) th at are used for the analysis and prediction of slow trends in the workload execution. Collectively, th ese coordinated scales of time and frequency provides an accurate interpretation of workload dynamics. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet co efficients at different scales. The separate predictions of each wavelet coe fficients are proceed independently. Predicting each wavelet coefficients by a separate neural network simplifies the training task of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transform to predict the workload dynamics. Figure 5-3 shows our hybrid neuro-wavelet scheme for workload dynamics prediction. Given the observed workload dynamics on training data, our aim is to predict workload dynamic behavior under different architecture configura tions. The hybrid scheme basically involves three stages. In the first stage, the time series is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficients is pr edicted by a separate ANN and in the third stage, the approximated time series is recovered from the predicted wavelet coefficients. PAGE 58 58 G0 H0 G1 H1 Gk Hk ... Workload Dynamics (Time Domain)Wavelet Decomposition Wavelet Coefficients. ... ... M icroarchitecture Design Param eters Predicted W avelet Coefficient 1 ... ... M icroarchitecture Design Param eters Predicted W avelet Coefficient 2 M icroarchitecture Design Param eters... ... ...RBF Neural Netw orks ... ... Predicted W avelet Coefficient n G*0 H*0 G*1 H*1 G*k H*k ... Synthesized Workload Dynamics (Time Domain)Wavelet Reconstruction Predicted Wavelet Coefficients 00 0, 0, 0, 0, 0, 0 Figure 5-3 Using wavelet neural netw ork for workload dynamics prediction Each RBF neural network receives the entire microarchitectural design space vector and predicts a wavelet coefficient. The training of a RBF network involves determining the center point and a radius for each RBF and the wei ghts of each RBF which determine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of using wavelet ne ural networks to explore workload dynam ics in performance, power and reli ability domains during microarchi tecture design sp ace exploration. We use a unified, detailed microarchitecture simulator in our experiments. Our simulation framework, built using a heavily modified and extended version of the Simplescalar tool set, models pipelined, multiple-issue, out-of-order execution microprocessors with multiple level caches. Our framework uses Wattch-based power model [36]. In addition, we built the Architecture Vulnerability Factor (AVF) analys is methods proposed in [37, 38] to estimate processor microarchitecture vulnera bility to transient faults. A microarchitecture structures AVF refers to the probability that a transient fault in that hardware structure will result in incorrect program results. The AVF metric can be used to estimate how vulnerable the hardware is to soft PAGE 59 59 errors during program execution. Table 5-1 summari zes the baseline machine configurations of our simulator. Table 5-1 Simulated machine configuration Parameter Configuration Processor Width 8-wi de fetch/issue/commit Issue Queue 96 ITLB 128 entries, 4-way, 200 cycle miss Branch Predictor 2K entries Gshare, 10-bit global history BTB 2K entries, 4-way Return Address Stack 32 entries RAS L1 Instruction Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 96 entries Load/ Store Queue 48 entries Integer ALU 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 8 FP-ALU, 4FP-MUL/DIV/SQRT DTLB 256 entries, 4-way, 200 cycle miss L1 Data Cache 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle access L2 Cache unified 2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 64 bit wide, 200 cycles access latency We perform our analysis using twelve SPEC CPU 2000 benchmarks bzip2, crafty, eon, gap, gcc, mcf, parser, perl bmk, twolf, swim, vortex and vpr. We use the Simpoint tool to pick the most representative simulation point for each be nchmark (with full reference input set) and each benchmark is fast-forwarded to its representati ve point before detailed simulation takes place. Each simulation contains 200M instructions. In th is chapter, we consider a design space that consists of 9 microarchitectural parameters (see Tables 5-2) of the superscalar architecture. These microarchitectural parameters have been shown to have the largest impact on processor performance [21]. The ranges for th ese parameters were set to include both typical and feasible design points within the explored design space. Using the detailed, cycle-accurate simulations, we measure processor performance, power and reliability characteristics on all design points within both training and testing data sets. We build a separate model for each program and use the model to predict workload dynamics in performance, power and reliability domains at PAGE 60 60 unexplored points in the design spac e. The training data set is used to build the wavelet-based neural network models. An estimate of the m odels accuracy is obtained by using the design points in the testing data set. Table 5-2 Microarchitectural parameter ranges used for generati ng train/test data Parameter R anges # of Levels Trai n Test Fetch_width 2, 4, 8, 16 2, 84 ROB_size 96, 128, 160 128,160 3 IQ_size 32, 64, 96, 12832, 644 LSQ_size 16, 24, 32, 6416,24, 324 L2_size 2 5 6 1024 2048 4096 KB 256, 1024, 4096KB4 L2_lat 8, 12, 14, 16, 208, 12, 14 5 il1_size 8, 16, 32, 64 KB8, 16, 32 KB 4 dl1_size 8, 16, 32, 64 KB16, 32, 64 KB 4 dl1_lat 1, 2, 3, 4 1,2,3 4 To build the representative design space, one needs to ensure the sample data sets space out points throughout the design space but unique and small enough to keep the model building cost low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better cove rage compared to a naive random sampling scheme. We generate multiple LHS matrix and use a space filing metric called L2-star discrepancy [40] to each LHS matrix to find the unique and best representative design space which has the lowest values of L2-star disc repancy. We use a randomly and independently generated set of test data point s to empirically estimate the pred ictive accuracy of the resulting models. And we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers good tradeoffs between simulation time and prediction accuracy for the design space we considered. In our study, each workload dynamic trace is represented by 128 samples. Predicting each wavelet coefficient by a sepa rate neural network simplifies the learning task. Since complex workload dynamics can be captured using limited number of wavelet PAGE 61 61 coefficients, the total size of wavelet neural networks can be small. Due to the fact that small magnitude wavelet coefficients have less contribution to the rec onstructed data, we opt to only predict a small set of important wavelet coefficients. Processor Configuration High Mag. Low Mag. Wavelet Coefficient Index Figure 5-4 Magnitude-based rankin g of 128 wavelet coefficients Specifically, we consider the following tw o schemes for selecting important wavelet coefficients for prediction: (1) ma gnitude-based: select the largest k coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate the rest with 0. In this study, we choose to us e the magnitude-based scheme since it always outperforms the order-based scheme. To apply th e magnitude-based wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients does not change drastically across the design space. Figure 5-4 i llustrates the magnitude-based ranking (shown as a color map where red indicates high ranks and bl ue indicates low ranks) of a total 128 wavelet coefficients (decomposed from benchmark gcc dynamics) across 50 different microarchitecture configurations. As can be seen, the top ranked wa velet coefficients largely remain consistent across different processor configurations. Wavelet coefficients with large magnitude PAGE 62 62 Evaluation and Results In this sec tion, we present detailed experiment results on usin g wavelet neural network to predict workload dynamics in performance, reliability and power domains. The workload dynamic prediction accuracy measure is the mean square error (MSE) defined as follows: 2 11 ()()N k M SExkxk N (5-3) where: () x k is the actual value, () x kis the predicted value and Nis the total number of samples. As prediction accuracy in creases, the MSE becomes smaller. (a) CPI bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 051015202530 MSE (%) (b) Power bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 05101520253035 MSE (%) (c) AVF bzipcraftyeongapgccmcfparserperlswimtwolfvortexvpr 0123 MSE (%) Figure 5-5 MSE boxplots of workload dynamics prediction PAGE 63 63 The workload dynamics prediction accuracies in performance, power and reliability domains are plotted as boxplots( Figure 5-5). Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify po ssible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between hinges which are approximately the first and third quartiles of the MSE values. Thus, about 50% of the data are located within the box and its height is equal to the interquart ile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the MSE values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme valu es of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 5-5, the line with diamond shape markers i ndicates the statistics average of MSE across all test cases. Figure 5-5 shows that the performance model achieves median errors ranging from 0.5 percent (swim) to 8.6 percent (mcf) with an overall median erro r across all benchmarks of 2.3 percent. As can be seen, even though the maxi mum error at any design point for any benchmark is 30%, most benchmarks show MSE less than 10%. This indicates that our proposed neurowavelet scheme can forecast the dynamic behavior of program performance characteristics with high accuracy. Figure 5-5 shows that power models are slightly less accurate with median errors ranging from 1.3 percent (vpr) to 4.9 percent (crafty) and overall median of 2.6 percent. The power prediction has high maximum values of 35%. These errors are much smaller in reliability domain. In general, the workload dynamic prediction accuracy is increased when more wavelet coefficients are involved. However, the complex ity of the predictive models is proportional to the number of wavelet coefficients. The cost-e ffective models should provide high prediction PAGE 64 64 accuracy while maintaining low complexity. Figure 5-6 shows the trend of prediction accuracy (the average statistics of all benchmarks) when various number of wavelet coefficients are used. 0 1 2 3 4 5 16326496128 Number of Wavelet CoefficientsMSE (%) CPI Power AVF Figure 5-6 MSE trends with increased number of wavelet coefficients As can be seen, for the programs we studied, a se t of wavelet coefficients with a size of 16 combine good accuracy with low model comp lexity; increasing th e number of wavelet coefficients beyond this point impr oves error at a lower rate. This is because wavelets provide a good time and locality characterizati on capability and most of the en ergy is captured by a limited set of important wavelet coefficients. Usi ng fewer parameters than other methods, the coordinated wavelet coefficients provide interpretation of the series structures across scales of time and frequency domains. The capability of us ing a limited set of wavelet coefficients to capture workload dynamics vari es with resolution level. 0 1 2 3 4 5 6 7 641282565121024 Number of SamplesMSE (%) CPI Power AVF Figure 5-7 MSE trends with increased sampling frequency PAGE 65 65 Figure 5-7 illustrates MSE (the average statistics of all benchmarks) yielded on predictive models that use 16 wavelet coefficients when th e number of samples varies from 64 to 1024. As the sampling frequency increases, using the same amount of wavelet coefficients is less accurate in terms of capturing workload dynamic behavior. As can be seen, the increase of MSE is not significant. This suggests that the proposed sc hemes can capture workload dynamic behavior with increasing complexity. Our RBF neural networks were built us ing a regression tree based method. In the regression tree algorithm, all input microarchitecture parameters were ranked based on either split order or split frequency. The microarchitecture parameters which cause the most output variation tend to be split earliest and most often in the constr ucted regression tree. Therefore, microarchitecture parameters largely determine th e values of a wavelet coefficient are located on higher place than others in regression tree and th ey have larger number of splits than others. CPI Power AVF (a) Split Order bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat (b) Split Frequency bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat bzipcraftyeongapgccmcf parserperlbmkswimtwolfvortexvpr Fetch ROB IQ LSQ L2 L2_lat il1 dl1 dl1_lat Figure 5-8 Roles of microarchitecture design parameters PAGE 66 66 We present in Figure 5-8 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data poi nt relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? For example, on benchmark gcc, Fetch, dl1 and LSQ have signific ant roles in predicting dynamic behavior in performance domain while ROB, Fetch and dl1_lat larg ely affect reliability domain workload dynamic behavior. For the benchmark gcc, the most frequently involved microarchitecture parameters in regression tree constructions are ROB, LSQ, L2 and L2_lat in performance domain and LSQ and Fetch in reliability domain. Compared with models that only predict workload aggregated behavior, our proposed methods can forecast workload runtime execution scenarios. The feature is essential if the predictive models are employed to trigger runtime dynamic management mechanisms for power and reliability optimizations. Inadequate workload worst-case scenario predictions could make microprocessors fail to meet the desired power a nd reliability targets. On the contrary, false alarms caused by over-prediction of the worst-case scenarios can trigger responses too frequently, resulting in significant overhead. In this section, we study the su itability of using the proposed schemes for workload execution scenario based cl assification. Specifically, for a given workload characteristics threshold, we calculate how ma ny sampling points in a trace that represents workload dynamics are above or below the thresh old. We then apply the same calculation to the PAGE 67 67 predicted workload dynamics trace. We use the directional symmetry (DS) metric, i.e., the percentage of correctly predicte d directions with respect to th e target variable, defined as N kkxkx N DS1)( )( 1 (5-4) where 1)( if x and x are both above or belo w the threshold and 0 )( otherwise. Thus, the DS provides a measure of the number of times the sign of the target is correctly forecasted. In other words, DS=50% implies that the predicted direction was correct for half of the predictions. In this work, we set three threshold levels (named as Q1, Q2 and Q3 in Figure 59) between max and min values in each trace as follows, where 1Q is the lowest threshold level and 3Q is the highest threshold level. 3Q 2Q 1Q MAX MIN 1Q = MIN + (MAX-MIN)*(1/4) 2Q = MIN + (MAX-MIN)*(2/4) 3Q = MIN + (MAX-MIN)*(3/4) Figure 5-9 Threshold-based workload execution scenarios Figure 5-10 shows the results of threshold-base d workload dynamic behavior classification. The results are presented as directional asymmetry, which can be expressed as DS 1. As can be seen, not only our wavelet-based RBF neural networks can effectively capture workload dynamics, but also they can accurate ly classify workload execution into different scenarios. This suggests that proactive dynamic power and reli ability management schemes can be built using the proposed models. For instance, given a power/r eliability threshold, our wavelet RBF neural networks can be used to forecast workload execution scenario. If the predicted workload PAGE 68 68 characteristics exceed the threshold level, processors can start to response before power/reliability reaches or surpass the threshold level. 0 2 4 6 8 10bzi p cr a f t y e o n g a p gc c mcf p ar s er pe r lbm k swim two l f vortex v prDirectional Asym m etry ( %) CPI_1Q CPI_2Q CPI_3Q 0 2 4 6 8 10bzip crafty eo n gap gcc mcf parser p e rl b m k swim t w olf vo r te x v p rD irectional A sym m etry ( %) Power_1Q Power_2Q Power_3Q 0 2 4 6 8 10bzi p crafty e o n gap g cc mcf parser p erl b m k swim twolf v ortex vp rD irectional A sym m etry ( %) AVF_1Q AVF_2Q AVF_3Q Figure 5-10 Threshold-based workload execution Figure 5-11 further illustrates detailed wo rkload execution scenario predictions on benchmark bzip2. Both simulation and prediction results ar e shown. The predicted results closely track the varied program dynamic behavior in different domains. (a) performance (b) power (c) reliability Figure 5-11 Threshold-based wo rkload scenario prediction Workload Dynamics Driven Archit ecture Design Space Exploration In this section, we present a case study to demonstrate the benefit of applying workload dynamics prediction in early architecture design space exploration. Specifically, we show that workload dynamics prediction models can effectively forecast the worst-case operation PAGE 69 69 conditions to soft error vulnerability and accurately estimate the efficiency of soft error vulnerability management schemes. Because of technology scaling, ra diation-induced soft errors contribute more and more to the failure rate of CMOS devices. Therefore, soft error rate is an importa nt reliability issue in deep-submicron microprocessor design. Processor microarchitecture soft error vulnerability exhibits significant runtime va riation and it is not economical and practical to design fault tolerant schemes that target on the worst-case operation condition. Dynamic Vulnerability Management (DVM) refers to a set of strate gies to control hardware runtime soft-error susceptibility under a tolerable threshold. DVM allows designers to achieve higher dependability on hardware designed for a lower reliability sett ing. If a particular execution period exceeds the pre-defined vulnerability threshol d, a DVM response (Figure 5-12) wi ll work to reduce hardware vulnerability. VulnerabilityTime Designed-for Reliability Capacity w/out DVM Designed-for Reliability Capacity w/ DVM DVM Trigger Level DMV Engaged DVM Disengaged DVM Performance Overhead Figure 5-12 Dynamic Vulnerability Management A primary goal of DVM is to maintain vulnerability to within a pre-defined reliability target during the entire program execution. The DVM will be triggered once the hardware soft error vulnerability exceeds the predefined threshold. Once the trigger goes on, a DVM response begins. Depending on the type of response chosen, there may be some performance degradation. A DVM response can be turned off as soon as the vulnerability drops below the threshold. To PAGE 70 70 successfully achieve the desired reliability targ et and effectively mitigate the overhead of DVM, architects need techniques to quickly infer application worstcase operation conditions across design alternatives and accurately estimate the efficiency of DVM schemes at early design stage. We developed a DVM scheme to manage runt ime instruction queue (IQ) vulnerability to soft error. DVM_IQ { ACE bits counter updating(); if current context has L2 cache misses then stall dispatching instructions for current context; every (sample_interval/5) cycles { if online IQ_AVF > trigger threshold then wq_ratio = wq_ratio/2; else wq_ratio = wq_ratio+1; } if (ratio of waiting instruction # to ready instruction # > wq_ratio) then stall dispatching instructions; } Figure 5-13 IQ DVM Pseudo Code Figure 5-13 shows the pseudo code of our DVM policy. The DVM scheme computes online IQ AVF to estimate runtime microarch itecture vulnerability. The estimated AVF is compared against a trigger thres hold to determine whether it is necessary to enable a response mechanism. To reduce IQ soft error vulnerability we throttle the instru ction dispatching from the ROB to the IQ upon a L2 cache miss. Additi onally, we sample the IQ AVF at a finer granularity and compare the sampled AVF with th e trigger threshold. If the IQ AVF exceeds the trigger threshold, a parameter wq_ratio, whic h specifies the ratio of number of waiting instructions to that of ready instructions in the IQ, is upda ted. The purpose of setting this parameter is to maintain the performance by allowing an appropria te fraction of waiting instructions in the IQ to e xploit ILP. By maintaining a de sired ratio between the waiting PAGE 71 71 instructions and the ready instru ctions, vulnerability can be redu ced at negligible performance cost. The wq_ratio update is tri ggered by the estimated IQ AVF. In our DVM design, wq_ratio is adapted through slow increases a nd rapid decreases in order to ensure a quick response to a vulnerability emergency. We built workload dynamics predictive models which incorporate DVM as a new design parameter. Therefore, our models can predict workload execution scenarios with/without DVM feature across different microarch itecture configurations. Figure 5-14 shows the results of using the predictive models to forecast IQ AVF on benchmark gcc across two microarchitecture configurations. 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Disable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Enable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Disable) 0 20 40 60 80 100 120 140 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 SamplesIQ_AVF Simulation Prediction DVM Target (Enable) DVM disabled DVM enabled DVM disabled DVM enabled (a) Scenario 1 (b) Scenario 2 Figure 5-14 Workload dynamic prediction with scenario-based architecture optimization We set the DVM target as 0.3 which mean s the DVM policy, when enabled, should maintains the IQ AVF below 0.3 during work load execution. In both cases, the IQ AVF dynamics were predicted when DVM is disabled and enabled. As can be seen, in scenario 1, the DVM successfully achieves its goal. In scenario 2, despite of the enabled DVM feature, the IQ AVF of certain execution period is still above the threshold. Th is implies that the developed DVM mechanism is suitable for the microarchitect ure configuration used in scenario 1. On the other hand, architects have to choose another DVM policy if the microarchitecture configuration PAGE 72 72 shown in scenario 2 is chosen in their design. Figure 5-14 shows that in all cases, the predictive models can accurately forecast the trends in IQ AVF dynamics due to architecture optimizations. Figure 5-15 (a) shows prediction accuracy of IQ AVF dynamics when the DVM policy is enabled. The results are shown for all 50 microarch itecture configurations in our test dataset. twolf vpr swim bzip eon vortext crafty gap parser perlbmk mcf gcc50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0510152025Value 0100200300400 Color Key and HistogramCount (a) IQ AVF (b) Power Figure 5-15 Heat plot that shows the MSE of IQ AVF and processor power Since deploying the DVM policy will also aff ect runtime processor power behavior, we further build models to forecas t processor power dynamic behavior due to the DVM. The results are shown in Figure 5-15 (b). The data is presented as heat plot, which maps the actual data values into color scale with a dendrogram adde d to the top. A dendrogram consists of many Ushaped lines connecting objects in a hierarchical tree. The height of each U represents the distance between the two objects being connected. For a given benchmark, a vertical trace line shows the scaled MSE values across all test cases. Figure 5-15 (a) shows the predictive models yield hi gh prediction accuracy across all test cases on benchmarks swim, eon and vpr. The models yield prediction variation on benchmarks MSE (%) parser bzip twolf gap perlbmk swim eon vpr mcf gcc crafty vortext50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 00.10.20.30.40.5Value 050100150200 Color Key and HistogramCount MSE (%) PAGE 73 73gcc, crafty and vortex. In power domain, prediction accuracy is more uniform across benchmarks and microarchitecture configurations. In Figure 5-16, we show the IQ AVF MSE when different DVM thresholds are set. The resu lts suggest that our predictive models work well when different DVM targets are considered. 0 0.1 0.2 0.3 0.4 0.5b z ip craf t y eo n g ap g cc m cf p a rser p erlbmk swim twolf vortex vp rIQ AVF MSE (% ) DVM Threshold =0.2 DVM Threshold =0.3 DVM Threshold =0.5Figure 5-16 IQ AVF dynamics prediction accu racy across different DVM thresholds PAGE 74 74 CHAPTER 6 ACCURATE, SCALABLE AND INFORMATIVE DESIGN SPACE EXPLORATION IN MULTI-CORE ARCHI TECTURES Early design space exploration is an essential ingredient in modern processor development. It significantly reduces the time to market and post-silicon surprises. The trend toward multi/many-core processors will result in sophisticated large-scale architectu re substrates (e.g. nonuniformly accessed cache [43] interconnected by network-on-chip [44]) with self-contained hardware components (e.g. cache banks, routers and interconnect links) proximate to the individual cores but globally di stributed across all cores. As th e number of cores on a processor increases, these large and sophisticated multi-co re-oriented architectures exhibit increasingly complex and heterogeneous characteristics. As an ex ample, to alleviate the deleterious impact of wire delays, architects have proposed splitting up large L2/L3 cach es into multiple banks, with each bank having different access latency dependi ng on its physical proximity to the cores. Figure 6-1 illustrates normalized cache hits (results are plotte d as color maps) across the 256 cache banks of a non-uniform cache architecture (NUCA) [43] design on an 8-core chip multiprocessor(CMP) running the SPLASH-2 Ocean-c workload. The 2D architecture spatial patterns yielded on NUCA with different architecture design parameters are shown. Figure 6-1 Variation of cache hits across a 256-bank non-uniform access cache on 8-core As can be seen, there is a significant vari ation in cache access fre quency across individual cache banks. At larger scales, the manifested 2-dimensional spatial characteristics across the 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 PAGE 75 75 entire NUCA substrate vary widely with differe nt design choices while executing the same code base. In this example, various NUCA cache conf igurations such as network topologies (e.g. hierarchical, point-to-point and crossbar) and data management schemes (e.g. static (SNUCA) [43], dynamic (DNUCA) [45, 46] and dynamic with replication (R NUCA) [47-49]) are used. As the number of parameters in the design space incr eases, such variation and characteristics at large scales cannot be captured without using slow and detailed simulations. However, using simulation-based methods for architecture de sign space exploration where numerous design parameters have to be consider ed is prohibitively expensive. Recently, various predictive models [20-25, 50] have been proposed to cost-effectively reason processor performance and power characte ristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ab ility to forecast the complex and heterogeneous behavior of large and distributed architecture s ubstrates across the design space. This limitation will only be exacerbated with the rapidly increasing integration s cale (e.g. number of cores per chip). Therefore, there is a pressing need fo r novel and cost-effective approaches to achieve accurate and informative design trade-off analysis for large and sophisticated architectures in the upcoming multi-/many core eras. Thus, in this chapter, instead of quantifying these large and sophisticated architectures by a single number or a simple statistics distributi on, we proposed techniqu es employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the complex spatial characteristics that workloads exhibit across large architecture substrates are decomposed into a series of wavelet coeffici ents. In the transform domain, each individual wavelet coefficient is modeled by a separate neur al network. By predicting only a small set of PAGE 76 76 wavelet coefficients, our models can accurately reconstruct architecture 2D spatial behavior across the design space. Using both multi-progr ammed and multi-threaded workloads, we extensively evaluate the efficiency of using 2D wavelet neural networks for predicting the complex behavior of non-uniformly accessed cache de signs with widely varied configurations. Combining Wavelets and Neural Networks for Architecture 2D Spatial Characteristics Prediction We view the 2D spatial characteristics yi elded on large and dist ributed architecture substrates as a nonlinear functi on of architecture design paramete rs. Instead of inferring the spatial behavior via exhaustively obtaining architecture characteristics on each individual node/component, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated behavior across a large ar chitecture design space. Previous work [21, 23, 25, 50] shows that neural networks can accura tely predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to informatively reveal comple x workload/architecture interactions at a large scale. To overcome this disadvantage, we propos e combining 2D wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial characteristics prediction of multi-core orient ed architecture substrates. The 2D wavelet transform is a very powerful tool for characterizing spatial behavior sin ce it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficien ts (high frequencies) which are utilized for detailed analysis a nd prediction of individual or subsets of cores/components, while the gl obal trend is decomposed into higher scales of wavelet coefficients (low frequencies) th at are used for the analysis and prediction of slow trends across many cores or distributed hardware components. Co llectively, these wavelet coefficients provide PAGE 77 77 an accurate interpretation of the spatial trend and details of complex workload behavior at a large scale. Our wavelet neural networks use a sepa rate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separa te neural network simplifies the training task (which can be performed concurrently) of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the spatial patterns on large scale architecture substrates. Figure 6-2 shows our hybrid neuro-wavele t scheme for architecture 2D spatial characteristics prediction. Given the observed spatial behavior on training data, our aim is to predict the 2D behavior of large-scale architecture under di fferent design configurations. G0 H0 G1 H1 Gk Hk ... Architecture 2D Characteristics Wavelet Decomposition S S A 1 A 2 D 2 D 3 D 1 S S A 128 A 127 D 127 D 1 D 1 D 128 A 2 A 2 D 3 A 1 W a v e l e t C o e f f i c i e n t s ... ... Architecture Design Parameters Predicted Wavelet Coefficient 1 ... ... Architecture Design Parameters Predicted Wavelet Coefficient 2 Architecture Design Parameters... ... ...RBF Neural Networks ... ... Predicted Wavelet Coefficient n Synthesized Architecture 2D Characteristics G*0 H*0 G*1 H*1 G*k H*k ... .Wavelet Reconstruction S S A 1 A 2 D 2 D 3 D 1 S S A 128 A 127 D 127 D 1 D 1 D 128 A 2 A 2 D 3 A 1 0 0P re d ic te d W a v el et C oe ff ic ie n ts Figure 6-2 Using wavelet neural networks for forecasting architecture 2D characteristics The hybrid scheme basically invol ves three stages. In the first stage, the observed spatial behavior is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire ar chitecture design space vector and pr edicts a wavelet coefficient. The PAGE 78 78 training of an RBF network invol ves determining the center point and a radius for each RBF, and the weights of each RBF which dete rmine the wavelet coefficients. Experimental Methodology We evaluate the efficiency of 2D wavele t neural networks fo r forecasting spatial characteristics of large-scale multi-core NUCA design using the GEMS 1.2 [51] toolset interfaced with the Simics [52] full-system functional simulator. We simulate a SPARC V9 8core CMP running Solaris 9. We m odel in-order issue cores for this study to keep the simulation time tractable. The processors have private L1 caches and the shared L2 is a 256-bank 16MB NUCA. The private L1 caches of different processo rs are maintained cohere nt using a distributed directory based protocol. To model the L2 cache, we use the Ruby NUCA cache simulator developed in [47] which includes an on-chip ne twork model. The network models all messages communicated in the system including a ll requests, responses, replacements, and acknowledgements. Table 6-1 summarizes the baselin e machine configurations of our simulator. Table 6-1 Simulated machine configuration (baseline) Parameter Configuration Number of 8 Issue Width 1 L1 (split I/D) 64KB, 64B line, write-allocation L2 (NUCA) 16 MB (KB 64256 ), 64B line Memory Sequential Memory 4 GB of DRAM, 250 cycle latency, 4KB Our baseline processor/L2 NUCA organization is similar to that of Beckmann and Wood [47] and is illustra ted in Figure 6-3. Each processor core (including L1 data and instruction caches) is placed on the chip boundary and eight such cores surround a shared L2 cache. The L2 is partitioned into 256 banks (grouped as 16 blank clusters) and connected with an interconnection network. Each core has a cache controller that routes the cores request s to the appropriate cache bank. The NUCA design PAGE 79 79 space is very large. In this chapter, we consider a design space that consists of 9 parameters (see Tables 6-2) of CMP NUCA architecture. 1 2 3 4 5 6 129 145 17 18 19 20 21 22 33 34 7 8 9 10 11 12 130 146 23 24 25 26 27 28 35 36 113 114 13 14 131 132 133 147 148 149 29 30 37 38 39 40 115 116 15 16 134 135 136 150 151 152 31 32 41 42 43 44 117 118 119 120 137 138 139 153 154 155 161 162 163 164 45 46 121 122 123 124 140 141 142 156 157 158 165 166 167 168 47 48 125 126 241 242 243 244 143 159 169 170 171 172 173 174 175 176 127 128 245 246 247 248 144 160 177 178 179 180 181 182 183 184 249 250 251 252 253 254 255 256 209 193 185 186 187 188 49 50 225 226 227 228 229 230 231 232 210 194 189 190 191 192 51 52 97 98 233 234 235 236 211 212 213 195 196 197 53 54 55 56 99 100 237 238 239 240 214 215 216 198 199 200 57 58 59 60 101 102 103 104 81 82 217 218 219 201 202 203 65 66 61 62 105 106 107 108 83 84 220 221 222 204 205 206 67 68 63 64 109 110 85 86 87 88 89 90 223 207 69 70 71 72 73 74 111 112 91 92 93 94 95 96 224 208 75 76 77 78 79 80 CPU 0CPU 1 CPU 5CPU 4CPU 7 CPU 6CPU 3 CPU 2 L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$Figure 6-3 Baseline CMP with 8 cores that share a NUCA L2 cache Table 6-2 Considered architecture de sign parameters and their ranges Parameters Descriptio n NUCA Management Policy (NUCA)SNUCA, DNUCA, RNUCA Network Topology (net) Hierarchical, PT_to_PT, Crossba r Network Link Latency (net_lat)20, 30, 40, 50 L1_latency (L1_lat) 1, 3, 5 L2_latency (L2_lat) 6, 8, 10, 12. L1_associativity (L1_aso) 1, 2, 4, 8 L2_associativity (L2_aso) 2, 4, 8, 16 Directory Latency (d_lat) 30, 60, 80, 100 Processor Buffer Size ( p _buf)5, 10, 20 These design parameters cover NUCA data management policy (NUCA), interconnection topology and latency (net and net_lat), the conf igurations of the L1 and L2 caches (L1_lat, L2_lat, L1_aso and L2_aso), cache coherency di rectory latency (d_lat) and the number of cache accesses that a processor core can issue to the L1 (p_buf). The ranges for these parameters were set to include both typical and feasible desi gn points within the e xplored design space. PAGE 80 80 We studied the CMP NUCA designs using various multi-programmed and multi-threaded workloads (listed in Table 6-3). Table 6-3 Multi-programmed workloads Multi-programmed Workloads Description Homogeneous Group1 gcc (8 copies) Group2 mcf (8 copies) Heterogeneous Group1 (CPU) gap, bzip2, equake, gcc, me sa, perlbmk, parser, ammp Group2 (MIX) perlbmk, mcf, bzip2, vpr, mesa, art, gcc, equake Group3 (MEM) mcf, twolf, art, ammp, equake, mcf, art, mesa Multithreaded Workloads Data Set Splash2 barnes 16k particles fmm input.16348 ocean-co 514x514 ocean body ocean-nc 258x258 ocean body water-ns 512 molecules cholesky tk15.O fft 65,536 complex data points radix 256k keys, 1024 radix Our heterogeneous multi-programmed workloads consist of a mix of programs from the SPEC 2000 benchmarks with full reference i nput sets. The homogeneous multi-programmed workloads consist of multiple copies of an identical SPEC 2000 program. For multi-programmed workload simulations, we perform fast-forwards until all benchmarks pass initialization phases. For multithreaded workloads, we used 8 benchmarks from the SPLASH-2 suite [53] and mark an initialization phase in the software code and skip it in our simulations. In all simulations, we first warm up the cache model. After that, each simulation runs 500 million instructions or to benchmark completion, whichever is less. Us ing detailed simulati on, we obtain the 2D architecture characteristics of large scale NUC A at all design points wi thin both training and PAGE 81 81 testing data sets. We build a separate model for each workload and use the model to predict architecture 2D spatial behavior at unexplored points in the design space. The training data set is used to build the 2D wavelet neural network models. An estimate of the models accuracy is obtained by using the design point s in the testing data set. To build a representative design space, one needs to ensure that the sample data sets disperse points throughout the design space but keep the space small enough to keep the cost of building the model low. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) as our sampling strategy since it provides better coverage compar ed to a naive random sampling scheme. We generate multiple LHS matr ices and use a space filing metric called L2star discrepancy. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest va lue of L2-star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this chapter, we used 200 train data and 50 test data for workload dynamic prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. And the 2D NUCA architecture characteristics (normalized cache h it numbers) across 256 banks (with the geometry layout, Figure 6-3) are re presented by a matrix. Predicting each wavelet coefficient by a sepa rate neural network simplifies the learning task. Since complex spatial patterns on large scal e multi-core architecture substrates can be captured using a limited number of wavelet coeffici ents, the total size of wavelet neural networks is small and the computation overhead is low. Due to the fact that small magnitude wavelet coefficients have less contribution to the reconstructed data, we opt to only predict a small set of important wavelet coefficients. Specifically, we consider the following two schemes for selecting PAGE 82 82 important wavelet coefficients for prediction: (1) magnitude-based: select the largest k coefficients and approximate the rest with 0 and (2) order-based: select the first k coefficients and approximate the rest with 0. In this study, we choose to use the magnitude-based scheme since it always outperforms the order-based sc heme. To apply the magnitude-based wavelet coefficient selection scheme, it is essential that the significance of the selected wavelet coefficients do not change drasti cally across the design space. Our experimental results show that the top ranked wavelet coefficients largely re main consistent across different architecture configurations. Evaluation and Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast complex, heterogeneous patterns of la rge scale multi-core substrates running various workloads without using detailed simulation. Th e prediction accuracy measure is the mean error defined as follows: N kkxkxkx N ME1))(/))()( (( 1 (6-1) where: x is the actual value, x is the predicted value and N is the total number of samples (e.g. 256 NUCA banks). As prediction accuracy increases, the ME becomes smaller. The prediction accuracies are plotted as boxplots(Figure 6-4). B oxplots are graphical displays that measure location (median) and di spersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. Th e central box shows the data between hinges which are approximately the first and thir d quartiles of the ME values. PAGE 83 83 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 68101214 Error (%) gcc_x8 mcf_x8 68101214 CPU MIX MEM 68101214 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 68101214 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 4681012 Error (%) gcc_x8 mcf_x8 4681012 CPU MIX MEM 4681012 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 4681012 (a) 16 Wavelet Coefficients (b) 32 Wavelet Coefficients gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 24681012 Error (%) gcc_x8 mcf_x8 24681012 CPU MIX MEM 24681012 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 24681012 gcc_x8 mcf_x8 CPU MIX MEM barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 2468101214 Error (%) gcc_x8 mcf_x8 2468101214 CPU MIX MEM 2468101214 barnes ocean.co ocean.nc water.sp cholesky fft radix fmm 2468101214 (c) 64 Wavelet Coefficients (d) 128 Wavelet Coefficients Figure 6-4 ME boxplots of prediction accuracies w ith different number of wavelet coefficients Thus, about 50% of the data are located with in the box and its height is equal to the interquartile range. The hor izontal line in the interior of the box is locate d at the median of the data, and it shows the center of the distribution for the ME values The whiskers (the dotted lines extending from the top and bottom of the box) exte nd to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. Figure 6-4 (a) shows that using 16 wavelet co efficients, the predictive models achieve median errors ranging from 5.2 percent (fft) to 9.3 percent (ocean.co) with an overall median error of 6.6 percent across all e xperimented workloads. As can be seen, the maximum error at any design point for any benchmark is 13%, and mo st benchmarks show an error less than 10%. This indicates that our proposed neuro-wavelet scheme can forecast the 2D spatial workload PAGE 84 84 behavior across large and sophisticated architect ure with high accuracy. Figure 6-4 (b-d) shows that in general, the geospatial characteristics prediction accuracy is increased when more wavelet coefficients are involved. Note th at the complexity of the predicti ve models is proportional to the number of wavelet coefficients. The cost-e ffective models should provide high prediction accuracy while maintaining low complexity and computation overhead. The trend of prediction accuracy(Figure 6-4) indicates that for the progra ms we studied, a set of wavelet coefficients with a size of 16 combines good accuracy with lo w model complexity; in creasing the number of wavelet coefficients beyond this point improves erro r at a reduced rate. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients Using fewer parameters than other methods, the coordinated wavelet coeffici ents provide interpretation of the spatial patterns among a large number of NUCA banks on a two-dimensional plan e. Figure 6-5 illustra tes the predicted 2D NUCA behavior across four different configur ations (e.g. A-D) on the heterogeneous multiprogrammed workload MIX (see Table 3) when differe nt number of wavelet coefficients (e.g. 16 256) are used. Simulation Prediction A B C D Figure 6-5 Predicted 2D NUCA be havior using different number of wavelet coefficients PAGE 85 85 The simulation results (org) are also show n for comparison purposes. Since we can accurately forecast the behavior of large scale NUCA by only predicting a small set of wavelet coefficients, we expect our methods are scalable to even larger architecture design. We further compare the accuracy of our pr oposed scheme with that of approximating NUCA spatial patterns via predicting the hit rates of 16 evenly distributed cache banks across a 2D plane. The results shown in Table 6-4 indicate that using the same number of neural networks, our scheme yields a significantly higher accuracy than conventional predictive models. If current neural network models were built at fine-gra in scales (e.g. constr uct a model for each NUCA bank), the model building/training overhead wo uld be non-trivial. Since we can accurately forecast the behavior of large scale NUCA stru ctures by only predicting a small set of wavelet coefficients, we expect our methods are scalab le to even larger ar chitecture substrates. Table 6-4 Error comparison of predicting raw vs. 2D DWT cache banks Benchmarks Error (Raw), % Error(2D DWT), % gcc(x8) 126 8 mcf(x8) 71 7 CPU 102 9 MIX 86 8 MEM 122 8 barnes 136 6 fmm 363 6 ocean-co 99 9 ocean-nc 136 6 water-sp 97 7 cholesky 71 7 fft 64 7 radix 92 7 Table 6-5 shows that exploring multi-co re NUCA design space using the proposed predictive models can lead to se veral orders of magnitude speedup, compared with using detailed simulations. The speedup is calculated using the total simulation time across all 50 test cases divided by the time spent on model training and pr edicting 50 test cases. The model construction PAGE 86 86 is a one-time overhead and can be amortized in the design space exploration stage where a large number of cases need to be examined. Table 6-5 Design space evaluation speedup (simulation vs. prediction) Benchmarks Simulation vs. Prediction gcc(x8) 2,181x mcf(x8) 3.482x CPU 3,691x MIX 472x MEM 435x barnes 659x fmm 1,824x ocean-co 1,077x ocean-nc 1,169x water-sp 738x cholesky 696x fft 670x radix 1,010x Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all i nput architecture design parameters were ranked based on either split order or split frequency. The design parameters which cause the most output variation tend to be split earliest and most often in the constructed regression tree. Therefore, architecture parameters that largely determine the values of a wavelet coefficient are located higher than others in the regression tree and they have a larger number of sp lits than others. We present in Figure 6-6 (shown as star plot) the initial and most frequent splits within the regression trees that model the most significant wavelet coefficients. A star plot is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information PAGE 87 87 such as: Which variables are dominant for a gi ven dataset? Which observations show similar behavior? Order Frequency Figure 6-6 Roles of design parameters in predicting 2D NUCA For example, on the Splash-2 be nchmark fmm, network latency (net_lat), processor buffer size (p_buf), L2 latency (L2_lat) and L1 associativity (L1_aso) have significant roles in predicting the 2D NUCA spatial behavior while the NUCA data management policy (NUCA) and network topology (net) largely affect the 2D spatial pattern when running the homogeneous multi-programmed workload gccx8. For the benchm ark cholesky, the most frequently involved architecture parameters in regression tree construction are NUCA, net_lat, p_buf, L2_lat and L1_aso. Differing from models that predict aggregat ed workload characteristics on monolithic architecture design, our proposed methods can accurately and in formatively reveal the complex patterns that workloads exhibit on large-scale architectures. This feature is essential if the predictive models are employed to examine the efficiency of design tradeoffs or explore novel PAGE 88 88 optimizations that consider multi-/manycores. In this work, we study the suitability of using the proposed models for novel multi-core oriented NUCA optimizations. Leveraging 2D Geometric Characteristics to Explore Cooperative Multi-core Oriented Architecture Design and Optimization In this section, we present case studies to demonstrate the benefit of incorporating 2D workload/architecture behavior prediction into th e early stages of microa rchitecture design. In the first case study, we show that our geospatial-a ware predictive models can effectively estimate workloads 2D working sets and that such info rmation can be benefici al in searching cache friendly workload/core mapping in multi-core environments. In the second case study, we explore using 2D thermal profile predictive mode ls to accurately and informatively forecast the area and location of thermal hots pots across large NUCA substrates. Case Study 1: Geospatial-aware Application/Core Mapping Our 2D geometry-aware architecture predictiv e models can be used to explore global, cooperative, resource management and op timization in multi-core environments. Figure 6-7 2D NUCA footprint (geometric shape) of mesa For example, as shown in Figure 6-7, a wo rkload will exhibit a 2D working set with different geometric shapes when running on di fferent cores. The exact shape of the access PAGE 89 89 distribution depends on several f actors such as the application and the data mapping/migration policy. As shown in previous section, our pred ictive models can forecast workload 2D spatial patterns across the architecture design space. To pr edict workload 2D geometric footprints when running on different cores, we inco rporate the core loca tion as a new design parameter and build the location-aware 2D predictive models. As a re sult, the new model can forecast workloads 2D NUCA footprint (represented as a cache access distri bution) when it is assigned to a specific core location. We assign 8 SPEC CPU 2000 workloads to the 8-core CMP system and then predict each workloads 2D NUCA footprint when runni ng on the assigned core and use the predicted 2D geometric working set for each workload to estimate the cach e interference among the cores. Core 0 Core 5 Core 4Core 7 Core 6Core 3 Core 2Core 1 Program A 2D NUCA footprint @ Core 0 Program B 2D NUCA footprint @ Core 1 Program C 2D NUCA footprint @ Core 2 Interferenced Area Figure 6-8. 2D cache interference in NUCA As shown in Figure 6-8, to estimate the in terference for a given core/workload mapping, we estimate both the area and the degree of ove rlap among a workloads 2D NUCA footprint. We only consider the interference of a core a nd its two neighbors. As a result, for a given core/workload layout, we can quickly estimate the overall interference. For each NUCA configuration, we estimate the interference when workloads are randomly assigned to different cores. We use simulation to count the actual c ache interference among workloads. For each test PAGE 90 90 case (e.g., a specific NUCA configuration), we generate two series of cache interference statistics (e.g., one from simu lation and one from the predictiv e model) which correspond to the scenarios when workloads are mapped to the different cores. We compute the Pearson correlation coefficient of the two data series. Th e Pearson correlation coefficient of two data series X and Y is defined as 2 1 1 2 2 1 1 2 1 1 1 n i i n i i n i i n i i n i i n i i n i iiyyn xxn yxyxn r (6-2) If two data series, X and Y, show highly positive correlation, their Pearson correlation coefficient will be close to 1. Consequently, if the cache interference can be accurately estimated using the overlap between the predicted 2D NUCA footprints, we should observe nearly perfect correlation between the two metrics. 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 1 (CPU) 0 10 20 30 40 5 0 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 2 (MIX) 0 10 20 30 40 50 0.5 0.6 0.7 0.8 0.9 1 Pearson Correlation CoefficientTest Cases Group 3 (MEM) Group 1 (CPU) Group 2 (MIX) Group 3 (MEM) Figure 6-9 Pearson correlation coeffici ent (all 50 test cases are shown) Figure 6-9 shows that there is a strong correla tion between the interference estimated using the predicted 2D NUCA footprint and the interf erence statistics obtained using simulation. The highly positive Pearson correlation coefficient valu es show that by using the predictive model, designers can quickly devise the optimal core allocation for a given set of workloads. Alternatively, the information can be used by th e OS to guide cache friendly thread scheduling in multi-core environments. PAGE 91 91Case Study 2: 2D Thermal Hot-Spot Prediction Thermal issues are becoming a first order design parameter for large-scale CMP architectures. High operational temperatur es and hotspots can limit performance and manufacturability. We use the HotSpot [54] ther mal model to obtain the temperature variation across 256 NUCA banks. We then build analytical models using the proposed methods to forecast 2D thermal behavior of large NUCA cache with different config urations. Our predictive model can help designers insightfully predic t the potential thermal hotspots and assess the severity of thermal emergencies. Figure 6-10 sh ows the simulated thermal profile and predicted thermal behavior on different workloads. The te mperatures are normalized to a value between the maximal and minimal value across the NUC A chip. As can be seen, the 2D thermal predictive models can accurately and informativ ely forecast the size and the location of thermal hotspots. The 2D predictive model can informatively and accurately forecast both the location and the size of thermal hotspots in large scale architecture Thermal Hotspots Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0.2 0.4 0.6 0.8 1 (a) Ocean-NC (b) gccx8 Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0. 2 0. 4 0. 6 0. 8 1 Simulation 0 0. 2 0. 4 0. 6 0. 8 1 Prediction 0 0. 2 0. 4 0. 6 0. 8 1 ( c ) MEM ( d ) Ra d ix Figure 6-10 2D NUCA thermal prof ile (simulation vs. prediction) PAGE 92 92 16wc 32wc 64wc 96wc 128wc 256wc 2 4 6 8 10 12 14 16 Error (%) Multiprogrammed (Homogeneous) Multiprogrammed (Heterogeneous) Multithreaded (SPLASH) Figure 6-11 NUCA 2D ther mal prediction error The thermal prediction accuracy (average statistics) across three workload categories is shown in Figure 6-11. The accuracy of using different number of wavelet coefficients in prediction is also shown in that Figure. The results show that our predictive model can be used to cost-effectively analyze the thermal behavior of large architecture substrates. In addition, our proposed technique can be use to evaluate the e fficiency of thermal management policies at a large scale. For example, thermal hotspots can be mitigated by throttling the number of accesses to a cache bank for a certain period when its temper ature reaches a threshold. We build analytical models which incorporate a thermal-aware cache access throttling as a design parameter. As a result, our predictive model can forecast therma l hot spot distribution in the 2D NUCA cache banks when the dynamic thermal management (DTM ) policy is enabled or disabled. Figure 6-12 shows the thermal profiles before and after th ermal management policies are applied (both prediction and simulation results) for benchmar k Ocean-NC. As can be seen, they track each other very well. In terms of time taken for de sign space exploration, our proposed models have orders of magnitude less overhead. The time required to predict the thermal behavior is much less than that of full-system multi-core simulatio n. For example, thermal hotspot estimation is over 5102 times faster than thermal simulation, ju stifying our decision to use the predictive PAGE 93 93 models. Similarly, searching a cach e friendly workload/core mapping is 4103 times faster than using the simulation-based method. DTM DTM Simulation Prediction Figure 6-12 Temperature profile before and after a DTM policy PAGE 94 94 CHAPTER 7 THERMAL DESIGN SPACE EXPLORATION OF 3D DIE STAC KED MULTI-CORE PROCESSORS USING GEOSPATIAL-BASED PREDICTIVE MODELS To achieve thermal efficient 3D multi-core pro cessor design, architects and chip designers need models with low computation overhead, wh ich allow them to quickly explore the design space and compare different design options. One ch allenge in modeling the thermal behavior of 3D die stacked multi-core architecture is that th e manifested thermal patterns show significant variation within each die and across different dies (as shown in Figure 7-1). Die1 Die2 Die3 Die4 CPU MEM MIX Figure 7-1 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors The results were obtained by simulating a 3D die stacked quad-cor e processors running multi-programmed CPU (bzip2, eon, gcc, perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk) workloads. Each program within a multi-programmed workload was assigned to a die that contains a processor core and caches. Figure 7-2 shows the 2D thermal variation on die 4 under different mi croarchitecture and floor-plan configurations. On the given die, the 2-dimensional thermal spatial characteristics vary widely with different de sign choices. As the number of ar chitectural parameters in the design space increases, the complex thermal va riation and characterist ics cannot be captured without using slow and detailed simulations. As shown in Figure 7-1 and 7-2, to explore the thermal-aware design space accurately and informa tively, we need computationally effective PAGE 95 95 methods that not only predict aggregate therma l behavior but also identify both size and geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate predictive models to achieve this goal. Config. AConfig. BConfig. CConfig. DCPU MEM MIX Figure 7-2 2D thermal variation on die 4 under different microarchitecture and floor-plan configurations Figure 7-3 illustrates the original thermal be havior and 2D wavelet transformed thermal behavior. 340 341 342 343 344 345 346 HL1HH1LH1HH2HL2LH2LL2 LL 1 (a) Original thermal behavior (b) 2D wavelet transformed thermal behavior Figure 7-3 Example of using 2D DWT to capture thermal spatial characteristics As can be seen, the 2D thermal characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (LL=1) or Average (LL=2)). Since a small set of wavelet coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we use predictive models (i.e. neural networks) to relate them individua lly to various design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize 2D thermal spatia l characteristics across the design space. Compared with a simulation-based method, predicting a small set of wavelet coeffici ents using analytical PAGE 96 96 models is computationally efficient and is scal able to explore the large thermal design space of 3D multi-core architecture. Prior work has proposed various predictive m odels [20-25, 50] to cost-effectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous thermal behavior across large and distributed 3D multi-core architecture substrates. In this paper, we addresses this important and urgent research task by developing novel, 2D multi-scale predictive models, which can efficiently reason the geo-spatia l thermal characteristics within die and across different dies during the design space exploration stage withou t using detailed cycle-level simulations. Instead of quantifying the complex geo-spatial thermal characteristics using a single number or a simple statistic al distribution, our proposed techniques employ 2D wavelet multiresolution analysis and neural network nonlinear regression modeling. With our schemes, the thermal spatial characteristics are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of wavelet coeffi cients, our models can acc urately reconstruct 2D spatial thermal behavior across the design space. Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction We view the 2D spatial thermal characteristics yielded in 3D integrated multi-core chips as a nonlinear function of architecture design paramete rs. Instead of inferring the spatial thermal behavior via exhaustively obtai ning temperature on each individual location, we employ wavelet analysis to approximate it and then use a neur al network to forecast the approximated thermal behavior across a large architectur al design space. Previous work [21, 23, 25, 50] shows that neural networks can accurately predict the aggregated workload behavior across varied PAGE 97 97 architecture configurations. Ne vertheless, monolithic global neur al network models lack the ability to reveal complex thermal behavior on a large scale. To overcome this disadvantage, we propose combining 2D wavelet tran sforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial thermal characteristics prediction of 3D die stacked multi-core design. Figure 7-4 Hybrid neuro-wavele t thermal prediction framework The 2D wavelet transform is a very powerful t ool for characterizing sp atial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteri stics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utili zed for detailed analysis and pr ediction of individual or subsets of components, while the global tr end is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across each die. Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and details of complex thermal behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Pred icting each wavelet coefficient by a separate neural network simplifies the training task (whi ch can be performed concurrently) of each sub- PAGE 98 98 network. The prediction results for the wavelet co efficients can be combined directly by the inverse wavelet transforms to s ynthesize the 2D spatial thermal patterns across each die. Figure 7-4 shows our hybrid neuro-wavelet scheme for 2D spatial thermal characteristics prediction. Given the observed spatial thermal behavior on training data, our aim is to predict the 2D thermal behavior of each die in 3D die stacked multi-core processors under different design configurations. The hybrid scheme involves three stages. In the fi rst stage, the observed spatial thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicte d by a separate ANN. In the third stage, the approximated 2D thermal characteristics are recove red from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The traini ng of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF which determine the wavelet coefficients. Experimental Methodology Floorplanning and Hotspot Thermal Model In this study, we model four floor-plans that involve processor core and cache structures as illustrated in Figure 7-5. Figure 7-5 Selected floor-plans As can be seen, the processor core is placed at different locations acr oss the different floorplans. Each floor-plan can be chosen by a la yer in the studied 3D die stacking quad-core processors. The size and adjacency of blocks ar e critical parameters for deriving the thermal PAGE 99 99 model. The baseline core architecture and floorpl an we modeled is an Alpha processor, closely resembling the Alpha 21264. Figure 7-6 s hows the baseline core floorplan. Figure 7-6 Processor core floor-plan We assume a 65 nm processing technique and the floor-plan is scaled accordingly. The entire die size is 2121mm and the core size is 5.8 5.8mm. We consider three core configurations: 2-issue (5.85.8 mm), 4-issue (8.14 8.14 mm) and 8-issue (11.511.5 mm). Since the total die area is fixed, the more aggre ssive core configurations lead to smaller L2 caches. For all three types of core configurations, we calculate the size of the L2 caches based on the remaining die area available. Table 7-1 lists the detailed processor core and cache configurations. We use Hotspot-4.0 [54] to simulate thermal behavior of a 3D quad-core chip shown as Figure 7-7. The Hotspot tool can specify the mu ltiple layers of silicon and metal required to model a three dimensional IC. We choose grid-l ike thermal modeling mode by specifying a set of 64 x 64 thermal grid cells per die and the av erage temperature of each cell (32um x 32um) is represented by a value. Hotspot takes power consumption data for each component block, the layer parameters and the floor-plans as inputs an d generates the steady-state temperature for each active layer. To build a 3D multi-core processor simulator, we heavily modified and extended the M-Sim simulator [63] and incorporated the Wattch power model [36]. The power trace is PAGE 100 100 generated from the developed framework with an interval size of 500K cycles. We simulate a 3D-stacked quad-core processor with one core assigned to each layer. Table 7-1 Architecture configura tion for different issue width 2 issue 4 issue 8 issue Processor Width 2-wide fetch/issue/commit 4-wide fetch/issue/commit 8-wide fetch/issue/commit Issue Queue 32 64 128 ITLB 32 entries, 4-way, 200 cycle miss 64 entries, 4-way, 200 cycle miss 128 entries, 4-way, 200 cycle miss Branch Predictor 512 entries Gshare, 10-bit global history 1K entries Gshare, 10-bit global history 2K entries Gshare, 10-bit global history BTB 512K entries, 4-way 1K entries, 4-way 2K entries, 4-way Return Address 8 entries RAS 16 entries RAS 32 entries RAS L1 Inst. Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access 64K, 2-way, 32 Byte/line, 2 ports, 1 cycle access 128K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ROB Size 32 entries 64 entries 96 entries Load/ Store 24 entries 48 entries 72 entries Integer ALU 2 I-ALU, 1 I-MUL/DIV, 2 Load/Store 4 I-ALU, 2 I-MUL/DIV, 2 Load/Store 8 I-ALU, 4 I-MUL/DIV, 4 Load/Store FP ALU 1 FP-ALU, 1FPMUL/DIV/SQRT 2 FP-ALU, 2FP-MUL/ DIV/SQRT 4 FP-ALU, 4FPMUL/DIV/SQRT DTLB 64 entries, 4-way, 200 cycle miss 128 entries, 4-way, 200 cycle miss 256 entries, 4-way, 200 cycle miss L1 Data Cache 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle 128K, 2-way, 32 Byte/line, 2 ports, 1 cycle access L2 Cache unified 4MB, 4-way, 128 Byte/line, 12 cycle access unified 3.7MB, 4-way, 128 Byte/line, 12 cycle access unified 3.2MB, 4-way, 128 Byte/line, 12 cycle access Memory Access 32 bit wide, 200 cycles access latency 64 bit wide, 200 cycles access latency 64 bit wide, 200 cycles access latency Figure 7-7 Cross section view of th e simulated 3D quad-core chip PAGE 101 101Workloads and System Configurations We use both integer and floating-point benc hmarks from the SPEC CPU 2000 suite (e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lu cas, mcf, parser, perlbmk, twolf, swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see Table 7-2). We categorize all benchmarks into two classes: CPU-bound and MEM bound applications. We design three types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist of programs from only the CPU intensive and me mory intensive categories respectively. MIX workloads are the combination of two benchmar ks from the CPU intensive group and two from the memory intensive group. Table 7-2 Simulation configurations Chip Frequency 3G Voltage 1.2 V Proc. Technology 65 nm Die Size 21 mm 21 mm Workloads CPU1 bzip2, eon, gcc, perlbmk CPU2 perlbmk, mesa, facerec, lucas CPU3 gap, parser, eon, mesa MIX1 gcc, mcf, vpr, perlbmk MIX2 perlbmk, mesa, twolf, applu MIX3 eon, gap, mcf, vpr MEM1 mcf, equake, vpr swim MEM2 twolf, galgel, applu, lucas MEM3 mcf, twolf, swim, vpr These multi-programmed workloads were simulated on our multi-core simulator configured as 3D quad-core processors. We use the Simpoint tool [1] to obtain a representative slice for each benchmark (with full reference input set) and each benchmark is fast-forwarded to its representative point before detailed simulation takes place. The simulations continue until one benchmark within a workload finishes the execu tion of the representative interval of 250M instructions. PAGE 102 102Design Parameters In this study, we consider a design space that consists of 23 parameters (see Table 7-3) spanning from floor-planning to packaging technologies. Table 7-3 Design space parameters Keys Low High 3D Configurations Layer0 Thickness ( m ) l y 0 th5e-5 3e-4 Floorplan ly0_fl Flp 1/2/3/4 Bench ly0_bench CPU/MEM/MIX Layer1 Thickness (m) ly1_th 5e-5 3e-4 Floorplan ly1_fl Flp 1/2/3/4 Bench ly1_bench CPU/MEM/MIX Layer2 Thickness (m) ly2_th 5e-5 3e-4 Floorplan ly2_fl Flp 1/2/3/4 Bench ly2_bench CPU/MEM/MIX Layer3 Thickness (m) ly3_th 5e-5 3e-4 Floorplan ly3_fl Flp 1/2/3/4 Bench ly3_bench CPU/MEM/MIX TIM (Thermal Interface Material) Heat Capacity (J/m^3 K) TIM_cap 2e6 4e6 Resistivity (m K/W) TIM_res 2e-3 5e-2 Thickness (m) TIM_th 2e-5 75e-6 General Configurations Heat sink Convection capacity (J/k) HS_cap 140.4 1698 Convection resistance (K/w)HS_res 0.1 0.5 Side (m) HS_side 0.045 0.08 Thickness (m) HS_th 0.02 0.08 HeatSpreader Side(m) HP_side 0.025 0.045 Thickness(m) HP_th 5e-4 5e-3 Others Ambient temperature (K) Am_temp 293.15 323.15 Archi. Issue width Issue width_ 2 or 4 or 8 These design parameters have been shown to have a large impact on processor thermal behavior. The ranges for these parameters were set to include both typi cal and feasible design points within the explored design space. Using detailed cycle-accurate simulations, we measure processor power and thermal characteristics on al l design points within bot h training and testing data sets. We build a separate model for each benchmark domain and use the model to predict thermal behavior at unexplored points in the design space. The training data set is used to build the wavelet-based neural network models. An es timate of the models acc uracy is obtained by using the design points in the testing data set. To train an accurate and prompt neural network prediction model, one needs to en sure that the sample data sets disperse points throughout the PAGE 103 103 design space but keeps the space small enough to maintain the low model building cost. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [39] as our sampling strategy since it provides better coverage comp ared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2-star discrepancy [40]. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2-star discre pancy. We use a rando mly and independently generated set of test data point s to empirically estimate the pred ictive accuracy of the resulting models. In this work, we used 200 train and 50 te st data to reach a hi gh accuracy for thermal behavior prediction since our st udy shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. In our study, the thermal characteristics across each die is represented by 64 64 samples. Experimental Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast thermal behaviors of large scale 3D multi-core structures running various CPU/MIX/MEM workloads without using detailed simulation. Simulation Time vs. Prediction Time To evaluate the effectiveness of our ther mal prediction models, we compute the speedup metric (defined as simulation time vs. prediction time) across all experimented workloads (shown as Table 7-4). To calcu late simulation time, we measur ed the time that the Hotspot simulator takes to obtain steady thermal character istics on a given design configuration. As can be seen, the Hotspot tool simulation time vari es with design configurations. We report both shortest (best) and longest (wor st) simulation time in Table 7-4. The prediction time, which includes the time for the neural networks to predict the targeted thermal behavior, remains constant for all studie d cases. In our experiment, a total number of 16 PAGE 104 104 neural networks were used to predict 16 2D wavelet coefficients which efficiently capture workload thermal spatial characteristics. As can be seen, our predictive models achieve a speedup ranging from 285 (MEM1) to 5339 (CPU2) making them suitable for rapidly exploring large thermal design space. Table 7-4 Simulation time vs. prediction time Workload s Simulation (sec) [best:worst] Prediction (sec) Speedup (Sim./Pred.) CPU1 362 : 6,091 1.23 294 : 4,952 CPU2 366 : 6,567 298 : 5,339 CPU3 365 : 6,218 297 : 5,055 MEM1 351 : 5,890 285 : 4,789 MEM2 355 : 6,343 289 : 5,157 MEM3 367 : 5,997 298 : 4,876 MIX1 352 : 5,944 286 : 4,833 MIX2 365 : 6,091 297 : 4,952 MIX3 360 : 6,024 293 : 4,898 Prediction Accuracy The prediction accuracy measure is th e mean error defined as follows: N kkx kxkx N ME1)( )()( ~ 1 (7-1) where: )( kx is the actual value generated by the Hotspot thermal model, )( ~ kx is the predicted value and N is the total number of samples (a set of 64 x 64 temperature samples per layer). As prediction accuracy increases, the ME becomes smaller. We present boxplots to observe th e average prediction errors a nd their deviations for the 50 test configurations against Ho tspot simulation results. Boxplots are graphical displays that measure location (median) and di spersion (interquartile range), identify possible outliers, and indicate the symmetry or skewne ss of the distribution. The centr al box shows the data between hinges which are approximately the first and third quartiles of the ME values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The PAGE 105 105 horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the ME values. The whiske rs (the dotted lines extending from the top and bottom of the box) extend to the extreme valu es of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Figure 7-8, the blue line with diamond shape markers indicates the statistics average of ME across all benchmarks. CPU1CPU2CPU3MEM1MEM2MEM3MIX1MIX2MIX3 0 4 8 12 16 20Error (%) Figure 7-8 ME boxplots of prediction accuracies (number of wavelet coefficients = 16) Figure 7-8 shows that using 16 wavelet coeffici ents, the predictive models achieve median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 17.5% (MEM1), and mo st benchmarks show an error less than 9%. This indicates that our hybrid neuro-wavelet framework can pred ict 2D spatial thermal behavior across large and sophisticated 3D multi-core architecture with high accuracy. Figure 7-8 also indicates that CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%) workloads. This is because the CPU workloads usually have higher temperature on the small core area than the large L2 cache ar ea. These small and sharp hotspots can be easily captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal pattern can spread the entire die area, resulting in higher prediction error. Figure 7-9 illustrates the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on CPU1, MEM1 and MIX1 workloads. PAGE 106 106 CPU1 MEM1 MIX1 Prediction Simulation Figure 7-9 Simulated and predicted thermal behavior The results show that our pr edictive models can tack both size and location of thermal hotspots. We further examine the accuracy of pred icting locations and area of the hottest spots and the results are similar to those presented in Figure 7-8. CPU1 16wc32wc64wc96wc128wc256wc 0 4 8Error (%) MEM1 16wc32wc64wc96wc128wc256wc 0 5 10 15Error (%) MIX1 16wc32wc64wc96wc128wc256wc 0 10 20Error (%) Figure 7-10 ME boxplots of prediction accuracies w ith different number of wavelet coefficients Figure 7-10 shows the prediction accuracies with different numb er of wavelet coefficients on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D thermal spatial pattern prediction accuracy is in creased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The cost-effective models should provide hi gh prediction accuracy while maintaining low PAGE 107 107 complexity. The trend of prediction accuracy(Fig ure 7-10) suggests that for the programs we studied, a set of wavelet coeffi cients with a size of 16 comb ine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond th is point improves error at a lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to minimize the complexity of predicti on models while achieving good accuracy. We further compare the accuracy of our propos ed scheme with that of approximating 3D stacked die spatial thermal patterns via predic ting the temperature of 16 evenly distributed locations across 2D plane. The results(Figure 711) indicate that using the same number of neural networks, our scheme yields significant higher accuracy than conventional predictive models. This is because wavelets provide a good time and locality character ization capability and most of the energy is captured by a limited set of important wavelet coefficients. The coordinated wavelet coefficients provide superior interpretati on of the spatial patterns across scales of time and frequency domains. CPU1CPU2CPU3MEM1MEM2MEM3MIX1MIX2MIX3 0 20 40 60 80 100Error (%) Predicting the wavelet coefficients Predicting the raw data Figure 7-11 Benefit of predic ting wavelet coefficients Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input parameters (refer to Table 7-3 ) were ranked based on split frequency. The input parameters which cause the mo st output variation tend to be split frequently in the constructed regression tree. Therefore, the input paramete rs that largely determine the values of a wavelet coefficient have a larger number of splits. PAGE 108 108Design Parameters by Regression Tree ly0_th ly0_fl ly0_bench ly1_th ly1_fl ly1_bench ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench TIM_cap TIM_res TIM_th HS_cap HS_res HS_side HS_th HP_side HP_th am_temp Iss_size Figure 7-12 Roles of input parameters We present in Figure 7-12 shows the most fre quent splits within th e regression tree that models the most significant wavele t coefficient. A star plot [41] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. Each volume size of parameter is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. Fr om the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? As can be seen, floor-planning of each layer and core configuration largely affect thermal spatia l behavior of the studied workloads. PAGE 109 109 CHAPTER 8 CONCLUSIONS Studying program workload behavior is of growing interest in co mputer architecture research. The performance, power and reliability optimizations of future computer workloads and systems could involve anal yzing program dynamics across many time scales. Modeling and predicting program behavior at single scale can yield many limitations. For example, samples taken from a single, fine-grained interval ma y not be useful in forecasting how a program behaves at a medium or large time scales. In contrast, observing progr am behavior using a coarse-grained time scale may lo se opportunities that can be exploited by hardware and software in tuning resources to optimize workload execution at a fine-grained level. In chapter 3, we proposed new methods, metric s and framework that can help researchers and designers to better understand phase comp lexity and the changing of program dynamics across multiple time scales. We proposed using wavelet transformations of code execution and runtime characteristics to pr oduce a concise yet informativ e view of program dynamic complexity. We demonstrated the use of this in formation in phase classification which aims to produce phases that exhibit similar degree of co mplexity. Characterizing phase dynamics across different scales provides insi ghtful knowledge and abundant feat ures that can be exploited by hardware and software in tuning resources to m eet the requirement of workload execution at different granularities. In chapter 4, we extends the scope of chap ter 3 by (1) explorin g and contrasting the effectiveness of using wavelets on a wide ra nge of program executi on statistics for phase analysis; and (2) investigating techniques that ca n further optimize the accuracy of wavelet-based phase classification. More importantly, we identify additional benefits that wavelets can offer in the context of phase analysis. For example, wavelet transforms can provide efficient PAGE 110 110 dimensionality reduction of large volume, high di mension raw program execution statistics from the time domain and hence can be integrated wi th a sampling mechanism to efficiently increase the scalability of phase analysis of large scale phase behavior on long-running workloads. To address workload variability issues in phase cl assification, wavelet-base d denoising can be used to extract the essential features of workload behavior from their run-time non-deterministic (i.e., noisy) statistics. At the workloads prediction part, chapter 5, we propose to the use of wavelet neural network to build accurate predictive models fo r workload dynamic driven microarchitecture design space exploration to overcome the problems of monolithic, global predictive models. We show that wavelet neural networks can be us ed to accurately and cost-effectively capture complex workload dynamics across different microa rchitecture configurations. We evaluate the efficiency of using the proposed techniques to predict workload dynamic behavior in performance, power, and reliability domains. And also we perform extensive simulations to analyze the impact of wavelet coefficient sele ction and sampling rate on prediction accuracy and identify microarchitecture parameters that signi ficantly affect workload dynamic behavior. To evaluate the efficiency of scen ario-driven architecture optimiza tions across different domains, we also present a case study of using workload dynamic aware predictive model. Experimental results show that the predictive models are hi ghly efficient in rendering workload execution scenarios. To our knowledge, the model we propos ed is the first one th at can track complex program dynamic behavior across different micr oarchitecture configurations. We believe our workload dynamics forecasting techniques will allow architects to quickly ev aluate a rich set of architecture optimizations that target workload dynamics at early microarchitecture design stage. PAGE 111 111 In Chapter 6, we explore novel predictive t echniques that can quickly, accurately and informatively analyze the design trade-offs of fu ture large-scale multi-/manycore architectures in a scalable fashion. The characteristics that workloads exhibited on these architectures are complex phenomena since they typically contain a mixture of behavior localized at different scales. Applying wavelet analysis, our method can capture the heterogeneous behavior across a wide range of spatial scales usi ng a limited set of parameters. We show that these parameters can be cost-effectively predicted usi ng non-linear modeling techniques su ch as neural networks with low computational overhead. Experimental results show that our scheme can accurately predict the heterogeneous behavior of large-scale multi-core oriented architecture substrates. To our knowledge, the model we proposed is the first that can track complex 2D workload/architecture interaction across design alternatives. we fu rther examined using the proposed models to effectively explore multi-core aw are resource allocations and design evaluations. For example, we build analytical models that can quickly fore cast workloads 2D working sets across different NUCA configurations. Combined with interference estimation, our models can determine the geometric-aware workload/core mappings that lead to minimal interference. We also show that our models can be used to predict the locati on and the area of thermal hotspots during thermalaware design exploration. In the light of the emerging multi-/ manycore design era, we believe that the proposed 2D predictive model will allo w architects to quickly yet informatively examine a rich set of design alternativ es and optimizations for large and sophisticated architecture substrates at an early design stage. Leveraging 3D die stacking technologies in multi-core pr ocessor design has received increased momentum in both the chip design industry and research community. One of the major road blocks to realizing 3D multi-core design is it s inefficient heat dissipation. To ensure thermal PAGE 112 112 efficiency, processor architects and chip designer s rely on detailed yet slow simulations to model thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of the design space, such techniques are very expensive in terms of time and cost. In chapter 7, we aim to develop computat ionally efficient methods and models which allow architects and designers to rapidly yet info rmatively explore the large thermal design space of 3D multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile our model significan tly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex 2D thermal spatial patterns and can be used to forecast both the location and the area of thermal hotspots during thermal-aware design exploration. In light of the emerging 3D multi-co re design era, we believe that the proposed thermal predictive models will be valuable for architects to quickly and informatively examine a rich set of thermal-aware design alternatives and thermal-oriented optimizations for large and sophisticated architecture substr ates at an early design stage. PAGE 113 113 LIST OF REFERENCES [1] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, in Proc. the International Conference on Architectural Support for Programming Languages and Operating Systems, 2002 [2] E. Duesterwald, C. Cascaval and S. Dwark adas, Characterizing and Predicting Program Behavior and Its Variability, in Proc. of the International Conference on Parallel Architectures and Compilation Techniques, 2003. [3] J. Cook, R. L. Oliver, and E. E. Johnson, E xamining Performance Differences in Workload Execution Phases, in Proc. of the IEEE Interna tional Workshop on Workload Characterization, 2001. [4] X. Shen, Y. Zhong and C. Ding, Locality Phase Prediction, in Proc. of the International Conference on Architectural Support for Pr ogramming Languages and Operating Systems, 2004. [5] C. Isci and M. Martonosi, Runtime Powe r Monitoring in High-End Processors: Methodology and Empirical Data, in Proc. of the International Symposium on Microarchitecture, 2003. [6] T. Sherwood, S. Sair and B. Calder Phase Tracking and Prediction, in Proc. of the International Symposium on Computer Architecture, 2003. [7] A. Dhodapkar and J. Smith, Managing Multi-C onfigurable Hardware via Dynamic Working Set Analysis, in Proc. of the International Sym posium on Computer Architecture, 2002. [8] M. Huang, J. Renau and J. Torrellas, Positiona l Adaptation of Processors: Application to Energy Reduction, in Proc. of the International Symposium on Computer Architecture, 2003. [9] W. Liu and M. Huang, EXPERT: Expedite d Simulation Exploiting Program Behavior Repetition, in Proc. of International Conference on Supercomputing, 2004. [10] T. Sherwood, E. Perelman and B. Calder, B asic Block Distributi on Analysis to Find Periodic Behavior and Simulation Points in Applications, in Proc. of the International Conference on Parallel Architect ures and Compilation Techniques, 2001. [11] A. Dhodapkar and J. Smith, Comparing Pr ogram Phase Detection Techniques, in Proc. of the International Sym posium on Microarchitecture, 2003. [12] C. Isci and M. Martonosi, Identifying Program Power Ph ase Behavior using Power Vectors, in Proc. of the International Work shop on Workload Characterization, 2003. PAGE 114 114 [13] C. Isci and M. Martonosi, Phase Characterization for Power: Evaluating Control-FlowBased Event-Counter-Based Techniques, in Proc. of the Interna tional Symposium on HighPerformance Computer Architecture, 2006. [14] M. Annavaram, R. Rakvic, M. Polito, J.-Y. Bouguet, R. Hankins and B. Davies, The Fuzzy Correlation between Code and Performance Predictability, in Proc. of the International Symposiu m on Microarchitecture, 2004. [15] J. Lau, S. Schoenmackers and B. Calder, Structures for Phase Classification, in Proc. of International Symposium on Performance Analysis of Systems and Software, 2004. [16] J. Lau, J. Sampson, E. Perelman, G. Ha merly and B. Calder, T he Strong Correlation between Code Signatures and Performance, in Proc. of the International Symposium on Performance Analysis of Systems and Software, 2005. [17] J. Lau, S. Schoenmackers and B. Cald er, Transition Phase Classification and Prediction, in Proc. of the International Sympos ium on High Performance Computer Architecture, 2005. [18] Canturk Isci and Margaret Martonosi, Det ecting Recurrent Phase Behavior under RealSystem Variability, in Proc. of the IEEE International Symposium on Workload Characterization, 2005. [19] E. Perelman, M. Polito, J. Y. Bouguet, J. Sa mpson, B. Calder, C. Dulong Detecting Phases in Parallel Applications on Shared Memory Architectures, in Proc. of the International Parallel and Dist ributed Processing Symposium, April 2006 [20] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, Construction and Use of Linear Regression Models for Processo r Performance Analysis, in Proc. of the International Symposium on High-Performan ce Computer Architecture, 2006 [21] P. J. Joseph, K. Vaswani and M. J. Thazhuthaveetil, A Predictive Performance Model for Superscalar Processors, in Proc. of the International Symposium on Microarchitecture, 2006 [22] B. Lee and D. Brooks, Accurate an d Efficient Regression Modeling for Microarchitectural Performance and Power Prediction, in Proc. of the International Symposium on Architectural Support for Programming Languages and Operating Systems, 2006 [23] E. Ipek, S. A. McKee, B. R. Supinski, M. Schulz and R. Caruana, Efficiently Exploring Architectural Design Spaces via Predictive Modeling, in Proc. of the International Conference on Architectural Support for Pr ogramming Languages and Operating Systems, 2006 PAGE 115 115 [24] B. Lee and D. Brooks, Illustrative Design Space Studies with Microarchitectural Regression Models, in Proc. of the International Symposium on High-Performance Computer Architecture, 2007. [25] R. M. Yoo, H. Lee, K. Chow and H. H. S. Lee, Constructing a Non-Linear Model with Neural Networks For Workload Characterization, in Proc. of the International Symposium on Workload Characterization, 2006. [26] I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier, Vermont, 1992 [27] I. Daubechies, Orthonomal bases of Compactly Supported Wavelets, Communications on Pure and Applied Mathematics, vol. 41, pages 906-966, 1988. [28] T. Austin, Tutorial of Simplescalar V4.0, in Conj. With the International Symposium on Microarchitecture, 2001 [29] J. MacQueen, Some Methods for Classi fication and Analysis of Multivariate Observations, in Proc. of the Fifth Berkeley Sympos ium on Mathematical Statistics and Probability, 1967. [30] T. Huffmire and T. Sherwood, Wavelet-Based Phase Classification, in Proc. of the International Conference on Paralle l Architecture and Compilation Technique, 2006 [31] D. Brooks and M. Martonosi, Dynamic Thermal Management for High-Performance Microprocessors, in Proc. of the International Sympos ium on High-Performance Computer Architecture, 2001. [32] A. Alameldeen and D. Wood, Variability in Architectural Simulations of Multi-threaded Workloads, in Proc. of International Symposium on High Performance Computer Architecture, 2003. [33] D. L. Donoho, De-noisi ng by Soft-thresholding, IEEE Transactions on Information Theory, Vol. 41, No. 3, pp. 613-627, 1995. [34] MATLAB User Manual, MathWorks, MA, USA. [35] M. Orr, K. Takezawa, A. Murray, S. Nino miya and T. Leonard, Combining Regression Tree and Radial Based Function Networks, International Journal of Neural Systems, 2000. [36] David Brooks, Vivek Tiwari, and Margaret Martonosi, Wattch: A Framework for Architectural-Level Power An alysis and Optimizations, 27th International Symposium on Computer Architecture, 2000. [37] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor, in Proc. of the International Symposium on Microarchitecture, 2003. PAGE 116 116 [38] A. Biswas, R. Cheveresan, J. Emer, S. S. Mukherjee, P. B. Racunas and R. Rangan, Computing Architectural Vulnerability Factors for Address-Based Structures, in Proc. of the International Symposium on Computer Architecture, 2005. [39] J.Cheng, M.J.Druzdzel, Latin Hypercube Sampling in Bayesian Networks, in Proc. of the 13th Florida Artificial Intell igence Research Society Conference, 2000. [40] B.Vandewoestyne, R.Cools, Good Permutati ons for Deterministic Scrambled Halton Sequences in terms of L2-discrepancy, Journal of Computational and Applied Mathematics Vol 189, Issues 1-2, 2006. [41] J. Chambers, W. Cleveland, B. Kleiner and P. Tukey, Graphical Methods for Data Analysis, Wadsworth, 1983 [42] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, ISBN 0-13273350-1, 1999. [43] C. Kim, D. Burger, and S. Keckler. An Adaptive, NonUniform Cache Structure for Wire-Delay Dominated OnChip Caches, in Proc. the International Conference on Architectural Support for Programmi ng Languages and Operating Systems, 2002. [44] L. Benini, L.; G. Micheli, Network s On Chips: A New SoC Paradigm, Computer Vol. 35, Issue. 1, January 2002, pp. 70 -78. [45] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler, A NUCA Substrate for Flexible CMP Cache Sharing, in Proc. International C onference on Supercomputing, 2005. [46] Z. Chishti, M. D. Powell, and T. N. V ijaykumar, Distance Associativity for HighPerformance Energy-Efficient Non-Un iform Cache Architectures, in Proc. of the International Symposiu m on Microarchitecture, 2003. [47] B. M. Beckmann and D. A. Wood, Managing Wire Delay in Large Chip-Multiprocessor Caches, in Proc. of the International Symposium on Microarchitecture, 2004. [48] Z. Chishti, M. D. Powell, and T. N. Vijaykumar, Optimization Replication, Communication, and Capacity Allocation in CMPs, in Proc. of the International Symposium on Computer Architecture, 2005. [49] A. Zhang and K. Asanovic, Victim Replic ation: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors, in Proc. of the International Symposium on Computer Architecture, 2005. [50] B. Lee, D. Brooks, B. Supinski, M. Schulz, K. Singh S. McKee, Methods of Inference and Learning for Performance Mode ling of Parallel Applications, PPoPP, 2007. PAGE 117 117 [51] K. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. Alameldeen, K. Moore, M. Hill, D. Wood, Multifacets General Executiondriven Multiprocessor Simulator(GEMS) Toolset, Computer Architecture News(CAN), 2005. [52] Virtutech Simics, http://www.virtutech.com/products/ [53] S. Woo, M. Ohara, E. Torrie, J. Sing h, A. Gupta, The SPLASH-2 Programs: Characterization and Methodologi cal Considerations, in Proc. of the International Symposium on Computer Architecture, 1995. [54] K. Skadron, M. R. Stan, W. Huang, S. Velu samy, K. Sankaranarayanan, and D. Tarjan, Temperature-Aware Microarchitecture, in Proc. of the International Symposium on Computer Architecture, 2003. [55] K. Banerjee, S. Souri, P. Kapur, and K. Sa raswat, -D ICs: A Novel Chip Design for Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration, Proceedings of the IEEE, vol. 89, pp. 602--633, May 2001. [56] Y. F. Tsai, F. Wang, Y. Xie, N. Vijaykris hnan, M. J. Irwin, Design Space Exploration for 3-D Cache, IEEE Transactions on Very Large Sc ale Integration (VLSI) Systems, Vol. 16, No. 4, April 2008. [57] B. Black, D. Nelson, C. Webb, and N. Samr a, D Processing Technology and its Impact on IA32 Microprocessors, in Proc. of the 22nd Internati onal Conference on Computer Design, pp. 316, 2004. [58] P. Reed, G. Yeung, and B. Black, Design As pects of a Microprocessor Data Cache using 3D Die Interconnect Technology, in Proc. of the International Conference on Integrated Circuit Design and Technology, pp. 15, 2005 [59] M. Healy, M. Vittes, M. Ekpanyapong, C.S. Balla puram, S.K. Lim, H.S. Lee, G.H. Loh, Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs, IEEE Trans. on Computer Aided Design of IC and Systems, vol. 26, no. 1, pp. 38-52, 2007. [60] S. K. Lim, Physical design for 3D system on package, IEEE Design & Test of Computers, vol. 22, no. 6, pp. 532, 2005. [61] K. Puttaswamy, G. H. Loh, Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performa nce 3D-Integrated Processors, in Proc. of the International Symposium on High-Pe rformance Computer Architecture, 2007. [62] Y. Wu, Y. Chang, Joint Expl oration of Architectural and Physical Design Spaces with Thermal Consideration, in Proc. of International Symposium on Low Power Electronics and Design, 2005. PAGE 118 118 [63] J. Sharkey, D. Ponomarev, K. Ghose, M-S im : A Flexible, Multithreaded Architectural Simulation Environment, Technical Report CS-TR-05-DP01, Department of Computer Science, State University of New York at Binghamton, 2005. PAGE 119 119 BIOGRAPHICAL SKETCH Chang Burm Cho earned B.E and M.A in electr ical engineering at Dan-kook University, Seoul, Korea in 1993 and 1995, respec tively. Over the next 9 years, he w orked as a senior researcher at Korea Aerospace Research Institute(KARI) to develop the On-Board Computer(OBC) for two satellites, KOMPSAT-1 and KOMPSAT-2. His research interest is computer architecture and workload characteriza tion and prediction in large micro architectural design spaces. |