Citation
Analytical Modeling and Analysis of Memory Access for Cache Tuning

Material Information

Title:
Analytical Modeling and Analysis of Memory Access for Cache Tuning
Creator:
Zang, Wei
Publisher:
University of Florida
Publication Date:
Language:
English

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Gordon-Ross, Ann M
Committee Members:
Li, Tao
Stitt, Greg M
Wang, Zhe
Graduation Date:
5/4/2013

Subjects

Subjects / Keywords:
Acceleration ( jstor )
Analytical estimating ( jstor )
Analytical models ( jstor )
Average speed ( jstor )
Data processing ( jstor )
Evictions ( jstor )
Index sets ( jstor )
Instructional material evaluation ( jstor )
Limited liability companies ( jstor )
Simulations ( jstor )
cache
configuration
modeling
Genre:
Unknown ( sobekcm )

Notes

General Note:
Cache tuning is the process of determining the optimal cache configuration given an application’s requirements for reducing energy consumption and/or improving performance. To facilitate fast design-time cache tuning, we propose single-pass trace-driven cache simulation methodologies—T-SPaCS and U-SPaCS—for two-level exclusive instruction/data cache and unified cache hierarchies, respectively. T-SPaCS and U-SPaCS significantly reduce storage space and simulation time using a small set of stacks that only record the complete cache access pattern independent of the cache configuration and can simulate all cache configurations for both the level one and level two caches simultaneously. Experimental results show that T-SPaCS is 21X faster on average than sequential simulation for instruction caches and 33X faster for data caches, and U-SPaCS can accurately determine the cache miss rates with a 41X speedup in average simulation time.  Even though the fundamentals of T-SPaCS and U-SPaCS for single core systems can be adapted to chip multi-processor systems(CMPs), additionally complexities must be considered. In CMPs, last-level caches (LLCs) are typically partitioned to manage resource contention to shared cache resources and un-balanced utilization of private caches. We propose cache partitioning with partial sharing (CaPPS), which reduces LLC contention using cache partitioning and improves utilization using sharing configuration. Sharing configuration enables the partitions to be privately allocated to a single core,partially shared with a subset of cores, or fully shared with all cores based on dynamic runtime application requirements. To facilitate fast design space exploration, we develop an analytical model to quickly estimate the miss rates of all CaPPS configurations using the applications’ isolated LLC access traces to predict runtime LLC contention for co-executing applications. Experimental results demonstrate that the analytical model estimates cache miss rates with an average error of only 0.73% and with an average speedup of 3,966X as compared to a cycle-accurate simulator. Due to CaPPS’s extensive design space,CaPPS can reduce the average LLC miss rate by 22% as compared to a baseline configuration and 14% as compared to prior works.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright by Wei Zang. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
5/31/2015

Downloads

This item has the following downloads:


Full Text

PAGE 1

1 ANALYTICAL MODELING AND ANALYSIS OF MEMORY ACCESS FOR CACHE TUNING By WEI ZANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

PAGE 2

2 2013 Wei Zang

PAGE 3

3 To my husband and parents

PAGE 4

4 ACKNOWLEDGMENTS The completion of this dissertation and my Ph.D. degree was a challenging, but wonderful experience. My sincere gratitude goes to all the people for their guidance, support, and companionship to help me go through this long journey and have made my success possible First and foremost, I deeply appreciate my advisor, Dr. Ann Gordon Ross, for motivating and inspiring me into the computer architecture area. Her in depth knowledge, valuable advice, and timely feedback helped me to improve my capability for new idea development, logical thinking, and problem solvi ng She is not only a professional mentor in my research but also a concerned friend in my life. Additionally, she never complains about my inexperienced English expressions, and always encourages me and teaches me on the technical writing and speaking skills with all her patience. I also thank Dr. Tao Li, Dr. Greg Stitt, and Dr. Daisy Zhe Wang, for their constant motivation and support in my Ph.D. study insightful comments on my proposal, and their valuable time and efforts in serving on my supervisory committee. In addition, I am also indebted to my colleagues in the Embedded System Lab and the CHRECF4 group: Arslan Munir, Marisha Rawlins, Tosiron Adegbija, M. Hammam Alsafrjalani, Abelardo JaraBerrocal, Rohit Kumar, Shaon Yousuf Aurelio Morales, and Zubin Kumar for their knowledgeable discussion during my research I particularly thank Marisha Rawlins for sharing her ideas and technologies without reservation. With her kind help, I was able to conduct my first research topic smoothly.

PAGE 5

5 Finally, I would like to express my gratitude to my beloved husband, Rui Cao, and my parents for their infinite support on everything and accompany ing me through all the difficulties

PAGE 6

6 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF TABLES ............................................................................................................ 8 LIST OF FIGURES .......................................................................................................... 9 ABSTRACT ................................................................................................................... 12 CHAPTER 1 INTRODUCTION .................................................................................................... 14 2 RELATED WORK ................................................................................................... 25 Trace driven Cache Simulation ............................................................................... 25 Last level Cache Partitioning .................................................................................. 31 3 T SPACS A TWO LEVEL SINGLEPASS CACHE SIMULATION METHODOLOGY ................................................................................................... 35 Two Level Cache Characteristics ........................................................................... 35 Two Level SinglePass Cache Simulation Methodology T SPaCS ...................... 38 Stackbased TwoLevel Cache Analysis .......................................................... 42 Compareexclude scenario: = .......................................................... 44 Compareexclude scenario: < .......................................................... 44 Compareexclude scenario: > .......................................................... 47 Acceleration Strategies ..................................................................................... 48 Tree assisted acceleration ......................................................................... 49 Array assisted acceleration ........................................................................ 53 Experimental Results and Analysis ......................................................................... 55 Miss Rate Accuracy .......................................................................................... 56 Optimal Cache Configurations .......................................................................... 57 Si mulation Time Efficiency ............................................................................... 61 Comparison with TCaT ..................................................................................... 65 Summary ................................................................................................................ 68 4 U SPACS A SINGLE PASS CACHE SIMULATION METHODOLOGY FOR TWO LEVEL UNIFIED CACHES ............................................................................ 70 Single Pass Cache Simulation Methodology for Twolevel Unified Caches U SPaCS ................................................................................................................. 71 Overview .......................................................................................................... 71 Secondlevel Unified Cache Analysis ............................................................... 75 Accelerated Stack Processing .......................................................................... 78

PAGE 7

7 U SPaCSs Processing Algor ithm .................................................................... 79 Occupied Blank Labeling .................................................................................. 82 Write back Counting ......................................................................................... 83 Experimental Results and Analysis ......................................................................... 84 Accuracy Evaluation ......................................................................................... 85 Simulation Time Evaluation .............................................................................. 85 Summary ................................................................................................................ 89 5 CAPPS: CACHE PARTITION ING WITH PARTIAL SHARING FOR MULTI CORE SYSTEMS ................................................................................................... 90 Cache Partitioning With Partial Sharing (CAPPS) .................................................. 91 Architecture and Sharing Configurations .......................................................... 91 Modified LRU replacement policy .............................................................. 93 Column caching ......................................................................................... 94 Analytical Modeling Overview ........................................................................... 98 Isolated Access Trace Processing ................................................................... 99 Analysis of Contention in the Shared Ways .................................................... 101 Calculation of ....................................................................................... 101 Calculation of ...................................................................................... 105 Calculation of ( )7 T ............................................................................. 108 Calculation of the LLC miss rates ............................................................ 110 Experiment Results ............................................................................................... 112 Experiment S etup ........................................................................................... 112 Analytical Model Evaluation ............................................................................ 115 Accuracy evaluation ................................................................................. 116 Simulation time evaluation ....................................................................... 117 CaPP S Evaluation .......................................................................................... 118 Comparison with baseline configurations ................................................. 118 Comparison with private partitioning ........................................................ 119 Comparison with constrained partial sharing ........................................... 124 Summary .............................................................................................................. 125 6 CONCLUSIONS AND FUTURE WORK ............................................................... 127 LIST OF REFERENCES ............................................................................................. 130 BIOGRAPHICAL SKETCH .......................................................................................... 136

PAGE 8

8 LIST OF TABLES Table page 2 1 Optimal instruction and data cache configurations ............................................. 59 5 1 The starting instructions for the benchmarks simulation intervals. ................... 113 5 2 CMP system parameters .................................................................................. 114

PAGE 9

9 LIST OF FIGURES Figure page 2 1 Two address processing cases in trace driven cache simulation using the stackbased algorithm. ....................................................................................... 27 2 2 An example of the commonly used tree/forest structure in trace driven cache simulation using the tree/forest based algorithm. ............................................... 29 3 1 Storage requirements for the stack based algorithm for a twolevel cache. ....... 37 3 2 T SPaCSs functional overview .......................................................................... 40 3 3 Cache addressing formats for the compareexclude operation scenarios. ......... 43 3 4 Special case when < where fetching from L2 results in an occupied blank (BLK). ........................................................................................................ 46 3 5 A sample tree structure where rectangles correspond to tree nodes and values in the nodes are the indexes of the recorded conflicts. ........................... 51 3 6 Tree assisted stack processing acceleration algorithm ..................................... 53 3 7 Array assisted stack processing acceleration algorithm. .................................... 55 3 8 Energy model for energy consumption measurement ........................................ 58 3 9 Energy savings for the optimal instruction and data cache configurations normalized to the base cache configuration. ...................................................... 60 3 10 Normalized energy consumption of the twolevel optimal cache configurations to the singlelevel optimal cache configurations. ......................... 61 3 11 Simulation time speedup of T SPaCS without acceleration, T SPaCS, and simplified T SPaCS compared to the modified Dinero for instruction caches. .... 62 3 12 Simulation time speedup of T SPaCS without acceleration, T SPaCS, and simplified T SPaCS compared to the modified Dinero for data caches. ............. 64 3 13 Simulation time speedup of T SPaCS and simplifiedT SPaCS compared to TCaT for instruction caches. ............................................................................... 66 3 14 Simulation time speedup of T SPaCS and simplifiedT SPaCS as compared to TCaT for data caches. .................................................................................... 66 3 15 The comparison of t he normalized energy savings between the optimal cache configurations and TCaTs configurations for instruction caches. ............ 67

PAGE 10

10 3 16 The comparison of normalized energy savings between the optimal cache configurations and TCaTs configurations for data caches. ................................ 68 4 1 Overview of U SPaCSs operation. ..................................................................... 74 4 2 Sam ple instruction and data frames .................................................................. 79 4 3 U SPaCSs algorithm for an instruction address for a particular B. .................... 81 4 4 U SPaCSs simulation time speedup as compared to Dinero. ............................ 86 4 5 The logarithmic scaled simulation time for Dinero and U SPaCS and the speedup attained by U SPaCS as compared to Dinero ...................................... 87 4 6 U SPaCSs speedup with respect to the product of the average L1 cache miss rate and the average stack size. ................................................................ 88 4 7 U SPaCSs speedup with respect to the product of the square of the average L1 cache miss rate and the average stack size. ................................................. 88 5 1 Three sharing configurations: (A ) a cores quota is configured as private, (B ) partially shar ed with a subset of cores, or (C ) fully shared with all other cor es. 92 5 2 Maintaining LRU inform ation using a linked list for (A ) a conventional LRU cache and (B ) a CaPPS sharing configuration .................................................. 95 5 3 Extension of the linked list implementation for CaPPS. ...................................... 97 5 4 Two cores isolated (C1, C2) and interleaved (C1&C2 ) access traces for an arbitrary cache set. ............................................................................................. 99 5 5 C1s isolated access trace to an arbitrary cache set to illustrate the calculation of ............................................................................................... 105 5 6 The average and standard deviation of the average LLC miss rate error determined by the analytical model. ................................................................. 116 5 7 The analytical models simulation time speedup as compared to GEM5. ......... 118 5 8 Average LLC miss rate reductions for CaPPSs optimal configurations as compared to the two baseline configurations. .................................................. 119 5 9 Average LLC miss rate reductions for CaPPSs optimal configurations as compared to private partitionings optimal configurations for a 1 MB LLC. ....... 120 5 10 LLC miss rates for different numbers of cache ways and cache sizes when the benchmarks execute in isolation. ................................................................ 121

PAGE 11

11 5 11 Average LLC miss rate reductions for CaPPSs optimal configuration as compared to private partitioning for 512 KB and 256 KB LLCs, respectively. ... 123 5 12 Average LLC miss rate reductions for CaPPSs optimal configuration as compared to the two constrained partial sharing design spaces. ..................... 125

PAGE 12

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ANALYTICAL MODELING AND ANALYSIS OF MEMORY ACCESS FOR CACHE TUNING By Wei Zang May 2013 Chair: Ann GordonRoss Major: Electrical and Computer Engineering Cache tuning is the process of determining the optimal cache configuration given an applications requirements for reducing energy consumption and/or imp roving performance. To facilitate fast designtime cache tuning, we propose single pass tracedri ven cache simulation methodologies T SPaCS and U SPaCS for two level exclusive instruction/data cache and unified cache hierarchies, respectively T SPaCS and U SPaCS significantly reduce storage space and simulation time using a small set of stacks that only record the complete cache access pattern independent of the cache configuration and can simulate all cache configurations for both the level one and level two caches simultaneously. Experimental results show that T SPaCS is 21X faster on average than sequential simulation for instruction caches and 33X faster for data caches and U SPaCS can accurately determine the cache miss rates with a 41X speedup in average simulation time. E ven though the fundamentals of T SPaCS and U SPaCS for single core systems can be adapted to chip multi processor systems (CMPs), additionally complexities must be considered. In CMPs, last level caches (LLCs) are typically parti tioned to manage resource contention to shared cache resources and unbalanced

PAGE 13

13 utilization of private caches. We propose c ache partitioning with partial sharing (CaPPS) which reduces LLC contention using cache partitioning and improves utilization using sharing configuration. Sharing configuration enables the partitions to be privately allocated to a single core partially shared with a subset of cores, or fully shared with all cores based on dynamic runtime application requirements. To facilitate fast d esign space exploration, we develop an analytical model to quickly estimate the miss rate s of all CaPPS configurations using the applications isolated LLC access trace s to predict runtime LLC contention for co executing applications E xperiment al results demonstrate that the analytical model estimates cache miss rates with an average error of only 0.73% and with an average speedup of 3,966X as compared to a cycle accurate simulator. Due to CaPPSs extensive design space, CaPPS can reduce the average LLC mi ss rate by 22% as compared to a baseline configuration and 14% as compared to prior works.

PAGE 14

14 CHAPTER 1 INTRODUCTION Due to high memory latency and memory bandwidth limitations (i.e., the memory wall), optimizing the cache subsystem is critical for improving overall system performance. In addition, since the cache subsystem generally consumes a large percentage of total microprocessor system ener gy [58] optimizing the cache subsystems energy consumption results in significant total system energy savings. For example, the ARM 920T microprocessor cache subsystem consumes 44% of the total power [58] Even though the Strong ARM SA 110 processor specifically targets low power applications, the processors instruction c ache consumes 27% of the total power [52] The 21164 DEC Alphas cache subsystem consumes 25% to 30% of the total power [19] Previous research shows that since applications require diverse cache configurations (i.e., specific values for cache parameters such as total size, block size, and associativity), cache parameters can be specialized for optimal performance or energy consumption [72] In general, the bigger cache total size, block size, and associativity, the better cache performance. However, if the cache size is larger than the working set size, excess dynamic energy is consumed by fetching blocks from an excessively large cache and excess static energy is consumed to power the large cache size. Alternatively, if the cache siz e is smaller than the working set size, excess energy is wasted due to thrashing (portions of the working set are continually swapped into and out of the cache because of capacity misses, which affects cache performance as well). Spatial locality dictates the appropriate cache block size. If the block size is too large, fetching unused information from main memory wastes energy and increases fetching

PAGE 15

15 latency, and if the block size is too small, high cache miss rates waste energy due to frequent memory fetch ing and added stall cycles. Similarly, temporal locality dictates the appropriate cache associativity. Adjusting cache parameters based on applications is appropriate for small applications with little dynamic behavior, where a single cache configuration c an capture the entire applications temporal and spatial locality behaviors. However, since many larger applications show considerable variation in cache requirements throughout execution [60] a single cache configuration cannot capture all temporal and spatial locality behaviors. Phasespecified cache optimization allows cache parameters and organization to be configured for each execution phase (phase is the set o f intervals within an applications execution that have similar cache behavior [60] ), resulting in more energy savings and better performance than applicationbased cache optimization [29] Cache tuning is the process of determining the best cache configuration (specific cache parameter values and organization) in the design space (the collection of all possible cache configurations) for a particular application (applicationbased cache tuning) [27] [72] or an applications phases (phase based cache tuning) [29] [60] Cache tuning determines the best cache configuration with respect to one or more design metrics, such as minimizing the leakage power, minimizing the energy per access, minimizing the average cache access time, minimizing the total energy consumption, minimizing the number of cache misses and maximizing performance, etc. Cache tuning is classified into designtime offline static cache tuning and run time online dynamic cache tuning. We will introduce the two categories in the following.

PAGE 16

16 Offline static cache tuning : Given the availability of synthesizable corebased processors with tunable cache parameters [3] [4] [51] designers can perform design space exploration offline during design time to determine the optimal cache configuration. The designer synthesizes the processor with these optimal cache parameter values and the cache configuration remains fixed during runtime. Numerous techniques exist for offline static cache tuning including techniques that directly simulate the cache behavior using a simulator (simulationbased cache tuning) and techniques that formulize cache miss rates with mathematical models according to theoretic analysis (analytical modeling). Simula tion based tuning is a common technique not only for tuning the cache, but also for tuning any design parameter in the processor. Cache simulation leverages software [6] [8] [48] [53] to model cache operations and estimates cache miss rates or other metrics based on a representative application input. The simulator allows cache parameters to be varied such that different cache configurations in the design space can be simulated eas ily without building costly physical hardware prototypes. However, the simulation model can be cumbersome to setup since accurately modeling a systems environment can be more difficult than modeling the system itself. Additionally, since cache simulation is generally slow, simulating only seconds of application execution can require hours or days of simulation time for each cache configuration in the design space. Due to this lengthy simulation time, simulationbased cache tuning is generally not feasible for large design spaces and/or complex applications. Aside from general simulation for all modules in the microarchitecture, specialized trace driven cache simulation, which only simulates the cache module, is

PAGE 17

17 widely used for cache tuning. For specialized trace driven cache simulation, cache behavior can be simulated using a sequence of timeordered memory accesses, typically referred to as memory access trace The access trace can be collected using any microarchitecture simulator Since simulating the en tire processor/system is slow, trace driven cache simulation can significantly reduce cache tuning time by simulating an application only once to produce the access trace and then process that access trace quickly to evaluate the cache configurations. Most trace driven cache simulation methods iteratively evaluate the design space, processing only one configuration on each simulation pass [18] resulting in lengthy design space exploration time. Instead of iteratively exploring the design space, single pass trace driven simulation evaluates multiple configurations simultaneously in a single simulation pass [34] [50] [66] achieving simulation speedups on the order of tens as compared to iterative simulation. Unlike cache simulation techniques that require lengthy simulation/tuning time, analytical modeling based cache tuning directly calculates cache misses for each cache configuration using mathematical models. Since the cache miss rates are calculated nearly instantaneously using computational formulas, the cache tuning time is significantly reduced. Analytical models require detailed statistical information and/or information on critical application events, which can be collected using profilers such as Cpro f [44] ProfileMe [15] and DCPI [2] Analytical modeling based cache tuning is typically less accurate than simulationbased cache tuning. Previous works mostly focus on two distinct analytical modeling categories based on either appli cation structures or access traces. Since an applications spatial and temporal locality characteristics, which are mainly dictated by loops, determine cache

PAGE 18

18 behavior, cache misses can be estimated based on application structures gathered from specially designed loop profilers [25] [33] [69] However, loop characteristics do not sufficiently predict exact cache behavior, which also depends on data layout and traversal order, thus estimating an applications total cache behavior based only on an applications loop behavior may be inaccurate. Analytical modeling based on a memory access trace is similar to trace driven cache simulation, by using a functional simulation to produce the memory access trace. However, instead of simulating cac he behavior as in trace driven cache simulation, mathematical equations statistically or empirically analyze the memory access trace to determine the cache miss rates. Most analytical modeling based on a memory access trace leverages the reuse distance bet ween successive accesses to the same memory address [9] [22] [74] The reuse distance is the number of memory accesses (trace address or block address) between these two successive accesses [9] [16] [26] The reuse distance captures the temporal distance between successive accesses and the spatial locality when using individual cache block addresses as the unique memory access for determining the reuse distance. Online Dynamic Cache Tuning : Online dy namic cache tuning alleviates many of the drawbacks of offline static cache tuning by adjusting cache parameters during runtime. Since online cache tuning executes entirely during runtime, designers are not burdened with setting up realistic simulation env ironments or modeling accurate input stimuli, and thus all design time efforts are eliminated. Additionally, since online cache tuning can dynamically react to changing execution, the cache configuration can also change to accommodate new execution behavior or application updates. However, online dynamic cache tuning requires some hardware overhead to monitor cache

PAGE 19

19 behavior and control cache parameter changes, which introduces additional power, energy, and/or performance overheads during cache tuning and ca n diminish the total savings. Dynamic cache parameter adaptation leverages several techniques, such as unused way/set shutdown [1] [7] [42] merging/splitting cache sets [23] [40] [54] and way/block concatenat ion [72] Since exhaustive design space exploration is not feasible for online cache tuning due to lengthy cache tuning time heuristic search techniques reduce the number of configurations explored, which significantly reduces the online cache tuning time and energy/performance overheads. Zhang et al [72] analyzed the impact of each cache parameter on the cache miss rate and energy consumption. Based on Zhangs impact ordered singlelevel cache tuning heuristic, GordonRoss et al [27] developed a twolevel cache tuner called TCaT. TCaT leveraged the impact ordered heuristic to interlace the exploration of separate instruction and data level one and level two caches, and then extended TCaT for a unified secondlevel cache [30] Based on Zhang et al.s single core configurable cache architecture and cache tuning techniques [72] Rawlins and GordonRoss [57] developed an efficient design space exploration technique for level one d ata cache tuning in heterogeneous dual core architectures. In addition to adjusting cache parameters in multi core architecture s, most research has focused on the design and optimization of a CMPs last level cache (LLC) due to the large effects that the L LC has on system performance and energy consumption. The LLC is usually the secondor third level cache, which is typically very

PAGE 20

20 large and thus occupies a large die area and consumes considerably high leakage power. Additionally, large LLCs require long a ccess latencies. Shared LLCs have more efficient cache capacity utilization than private LLCs because shared LLCs have no shared address replication and no coherence overhead for shared data modification. However, large shared LLCs may have lengthy hit lat encies and multiple cores may have high contention for the shared LLCs resources even though the threads do not have shared data. Such contention generates high interference between threads and considerably affects the performance of individual threads an d the entire system. In a private LLC, the cache capacity is allocated to each core statically. If the application running on a core is small and the allocated private cache capacity is large, then the excess cache capacity is wasted. Alternatively, if the application running on a core is large and the allocated cache capacity is small, performance may be compromised due to a large cache miss rate. In a shared LLC, the cache capacity occupied by each core is flexible and can change based on the applications demands. However, if an application sequentially processes a large amount of data, such as multimedia streaming reads/writes, the application will occupy a large portion of the cache with data that is only accessed a single time, thus a large cache capac ity may not reduce the cache miss rate for this application. As a result, the co scheduled applications running on the other cores experience high cache miss rates due to shared cache contention. To overcome the drawbacks of both shared and private LLCs, c ombining the advantages of both into a hybrid architecture is required. In a shared LLC, the capacity can be dynamically partitioned and each partition is allocated to one core as the cores

PAGE 21

21 private cache and this private caches size and associativity can be configured. Alternatively, private LLCs can be partitioned to be partially or entirely shared with other cores. Our work focuses on offline static cache tuning due to the advantages afforded by offline cache tuning for embedded systems and dynamic cac he tunings induced overheads. Since offline cache tuning is done prior to system runtime, there is no runtime overhead (e.g., performance, energy) introduced with respect to online design space exploration. Furthermore, many embedded systems execute a fix ed application or fixed set of applications, thus offline cache tuning is feasible for these systems. Based on the prior successes in offline static cache tuning, using a memory access trace for cache simulation or input to an analytical model provides acc urate and sufficient information for cache miss rate prediction. Additionally, for online dynamic cache tuning, preprofiling the memory access trace can reduce the online cache tuning hardware overhead that is necessary to dynamically record and monitor cache access patterns, thereby avoiding this cache tuning intrusion to normal system execution. Furthermore, recording the memory access trace offline provides a basic view of the applications locality characteristics and gives a predictable guide for onl ine cache tuning. Our work focuses on the analytical modeling and analysis of memory access traces to rapidly evaluate cache miss rates for all cache configurations in the design space to assist offline cache tuning. To reduce the lengthy simulation time i n offline cache tuning, the single pass trace driven cache simulation was developed. However, several drawbacks limit the single pass trace driven cache simulations versatility. First, some singlepass

PAGE 22

22 algorithms that simulate caches with only one or two variable parameters must reexecute the algorithm several times in order to cover the entire design space. Secondly, single pass trace driven cache simulation for multi level caches is complex. For example, in a twolevel cache hierarchy, the level one cac he filters the access trace and produces one filtered access trace for each level two cache (i.e., each unique level one cache configurations misses form a unique filtered access trace for the level two cache). When directly leveraging singlepass trace d riven cache simulation for singlelevel caches to simulate each of the two level caches, the twolevel cache simulation must store and process each filtered access trace separately to combine each level two cache configuration with each level one cache con figuration. Since the number of filtered access traces is dictated by the level one caches design space size, the time and storage complexities can be considerably large for multi level cache simulation. We designed a T wo level S ingle Pa ss trace driven C a che S imulation methodology T SPaCS for exclusive instruction and data caches with low simulation time and storage complexity, which will be presented in Chapter 3. Due to the broad usage of unified caches, the singlepass tracedriven cache simulation desi gn for unified caches is demanded. In unified cache simulation, the relative ordering of cache misses in separate level one instruction and level one data caches must be maintained to simulate the unified level two cache. Unfortunately, this required history of the relative access ordering cannot be captured given the current data structures and algorithms in the existing singlepass trace driven cache simulation design. In Chapter 4, we will present our design of a twolevel U nified S ingle Pa ss C ache S imu lation methodology U SPaCS.

PAGE 23

23 Although the fundamentals of T SPaCS and U SPaCS for single core systems can be adapted to CMPs with only private caches, in CMPs with shared caches, the additional complexities introduced by shared cache resource contention mus t be considered. In CMPs, LLCs are typically partitioned to manage resource contention to shared caches and unbalanced utilization of private caches. C ache partitioning partitions the cache, allocates quotas (a subset of partitions) to the cores, and opti onally configures the partitions/quotas (e.g., size and/or associativity) to the allocated cores requirements. Each core s cache occupanc y is constrained to the cores quota to ensure fair utilization. Set partitioning partitions and allocates quotas at the cache set granularity and is typically implemented using operating system (OS) based page coloring [41] H owever, due to this OS modification requirement, hardwarebased way partitioning is more widely used. Way partitioning partitions and allocates quotas at the cache way granularity [55] [65] and is implemented using column caching or a modified replacement policy to ensure that a cores occupancy does not exceed the cores quota [14] [17] [43] Way partitioning for s hared LLCs typically uses private partitioning, which restricts quotas for exclusive use by the allocated core and can lead to poor cache utilization I f a core does not occupy the entire allocated quota temporarily, other cores cannot temporarily utilize the vacant quota. Thus, partially sharing a cores quota with other cores can potentially improve cache utilization. Therefore, in Chapter 5, we propose a hybrid LLC organization that combines the benefi ts of private and shared partitioning Cache Partitioning with Partial S haring ( CaPPS) which enables a cores quota to be configured as private, partially shared with a subset of cores, or fully shared

PAGE 24

24 with all other cores To facilitate fast design space exploration, we develop an analytical model to quickly estimate the miss rates of all CaPPS configurations

PAGE 25

25 CHAPTER 2 RELATED WORK Since our work derives mainly from trace driven cache simulation and LLC partitioning, this chapter gives an extensive rev iew on the two areas. Tracedriven Cache Simulation A simple nave technique for leveraging trace driven cache simulation for cache tuning is to sequentially process the access trace for each cache configuration, where the number of trace processing passes is equal to the number of cache configurations in the design space. This sequential trace driven cache simulation technique result s in prohibitively lengthy simulation time for a large access trace, thus requiring lengthy tuning time for exhaustive desig n space exploration. Dinero [18] is the most widely used trace simulator for single processor systems. Dinero models each cache set as a linked list where the number of nodes in each list is equal to the cache set associativ ity and each node records the tag information for the addresses that map to that set. Cache operations are modeled as linked list manipulations. Trace driven cache simulation for multicore architectures include CASPER [37] and CMP$IM [38] where cache parameters are configurable for each cache level and heterogeneous configurations are allowed, but the cores private cache configurations must be homogenous. To speed up the total trace processing time as compared to sequential trace driven cache simulation, s ingle pass tracedriven cache simulation simulates multiple cache c onfigurations during a single trace processing pass, thereby reducing the cache tuning time. The trace simulator processes the trace addresses sequentially and evaluates each processed address to determine if the access results in a cache hit/miss for all cache configurations simultaneously. For each processed address, the number of

PAGE 26

26 conflicts (previously accessed blocks that map to the same cache set as the currently processed address) directly determines whether or not the processed address is a cache hit/ miss. The conflict evaluation is very time consuming because all previously accessed unique addresses must be evaluated for conflicts. Singlepass tracedriven cache simulation leverages two cache properties, the inclusion property and the set refinement p roperty to speed up this conflict evaluation. The inclusion property [50] states that larger caches contain a superset of the blocks present in smaller caches. If two cache configurations have the same block size, the same number of sets, and use access order based replacement policies (e.g., least recently used (LRU)), the inclusion property indicates that the conflicts for the cache configuration with a smaller as sociativity also form the conflicts for the cache configuration with a larger associativity. The set refinement property [34] states that the blocks that map to th e same cache set in larger caches also map to the same cache set in smaller caches. The set refinement property implies that the conflicts for the cache configuration with a larger number of sets are also the conflicts for the cache configuration with a sm aller number of sets if both cache configurations have the same block size. Previous works on the singlepass trace driven cache simulation can be divided into two categories based on the algorithm and data structure used: the stack based algorithm and the tree/forest based algorithm. These structures maintain the previous trace access information to enable cache hit/miss determination for each processed address. The stack based algorithm leverages simple data structures but requires long trace processing time, while the tree/forest based algorithm leverages complex data structures but reduces the trace processing time.

PAGE 27

27 Figure 2 1 Two address processing cases in trace driven cache simulation using the stackbased algorithm. Case 1 depicts the situation where the processed address is not found in stack and case 2 depicts the situation where the processed address is found in stack. Early work by Mattson et al. [50] developed the stack based algorithm for fully associative caches, which laid the foundation for all future tracedriven cache simulation work. Figure 2 1 depicts two addressing processing cases in trace driven cache simulation using the stack based algorithm. Letters A, B, C, D, E, and F represent different addresses that map to different cache lines. The trace simulator processes the access trace one address at a time. For the examples in Figure 2 1 case 1 is processing address A and c ase 2 is processing address E. For each processed address, the algorithm first performs a stack search for a previous access to the currently processed address. Case 1 depicts the situation where the currently processed address has not been accessed before and is not present in any previously accessed cache line (the address is not present in the stack). Therefore this access is a compulsory cache miss. Case 2 depicts the situation where the currently processed address is located in the stack and has been previously accessed. Conflict evaluation Case 1 Case 2 Stack Search Process A Process E Not found Found Conflict evaluation Compulsory miss No evaluation Potential conflict setStack Update B C D E F B C D E F B C D E F E B C D F A B C D E F

PAGE 28

28 evaluates the potential conflict set (all addresses in the stack between the stacks top and the currently processed addresss previous access location in the stack) to determine the conflicts. For example, in case 2, addresses B, C, and D form the potential conflict set for the processed address E. For a fully associative cache, all potential conflicts are the conflicts since all addresses map to the same (single) cache set. The number of conflicts directly determines the minimum associativity (i.e., cache size for a fully associative cache) necessary to result in a cache hit/miss for the currently processed address. After conflict evaluation, the stack update process updates the stacks stored address access order by pushing the currently processed address onto the stack and removing the previous access to the currently processed address if the current access was not a compulsory miss (case 2). Otherwise, the currently processed address is directly pushed into the stack (case 1). In this way, the stack only stores the uniquely accessed addresses with most recently used (MRU) ordering. Hill and Smith [34] extended the stack based algorithm to simulate direct mapped and set associative caches. Unlike fully associative cache simulation, in the direct mapped and set associative cache simulation, conflict evaluation must further evaluate the potential conflict set to determine the conflicts that map to the same cache set as the currently processed address. If the number of conflicts is large enough to evict the previously fetched line that the currently processed address maps to, the currently processed address access results in a cache miss. Furthermore, Hill and Smith [34] leveraged the set refinement property to efficiently determine the number of conflicts fo r caches with different number of sets. Thompson and Smith [66] introduced dirty level analysis and included writeback counters for writeback caches.

PAGE 29

29 Figure 2 2 An example of the commonly used tree/forest structure in trace driven cache simulation using the tree/forest based algorithm. The rectangles correspond to tree nodes and values in the rectangles correspond to the index of the cache set. The time complexity of both the stack search for the processed address and the conflict evaluation is on the order of the stack size, which is equal to the number of uniquely accessed addresses in the worst case. Therefore, the stack search and conflict evaluation can be very time consuming. To speedup the trace processing time, much work has focused on replacing the stack structure with a tree/fores t structure that stores and traverses accesses more efficiently than the stack structure by storing the cache contents for multiple cache configurations in a single structure. The commonly used tree/forest structure in trace driven cache simulation uses di fferent tree levels to represent different cache configurations with a different number of cache sets. The smallest configurable number of sets dictates the number of trees in the forest, so that in the extreme case, only one tree is required if the cache can be configured as fully associative. Figure 2 2 depicts an example of this tree/forest structure. In this example, the number of cache sets can be configured as 2, 4, and 8. Thus there are two trees and each tree has three levels in the structure. Each rectangle in the figure represents a single node, where each node stores each cache sets contents, and the rectangles value corresponds to the cache sets index address. Therefore, the number of addresses stored in a node directly indicates the number of conflicts. When processing a new address, the processed address is searched in the tree and added to the tree 0 00 10 000 100 010 110 1 01 11 001 101 011 111 Number of sets 2 4 8

PAGE 30

30 nodes based on the sets that the proces sed address would map to for each number of sets. We note that the tree/forest structure described in Figure 2 2 can only simulate cache configurations with a fixed line size. Therefore, multiple trees/forests are typically used in previous works to simulate multiple line sizes in a single pass. Hill and Smith [34] leveraged multiple trees/forests for direct mapped cache simulation. Janapsatya et al. [39] expended Hill and Simiths technique to simulate set associative caches. In their trees/forests, each node contained a stack to maintain address tags and the cache hit/miss classification of the processed address was directly dictated by the address tags in the corresponding nodes stack. Tojo et al. [67] and Haque et al. [32] reduced Janapsatyas tech niques processing time using the cache inclusion property. One drawback of the tree/forest structure is the redundant storage overhead because each unique line address must be stored in one node of each level in order to record conflict information for t he cache configurations with different number of sets. Sugumar and Abraham [64] developed novel tree/forest structures to simulate the cache configurations with a fixed line size and the cache configurations with a fixed size but varying line size. Their novel tree/forest structures only maintained the line addresses for the cache configuration with the largest number of sets, from which the line addresses for the c ache configurations with smaller number of sets were derived. Therefore, the redundant storage was reduced by a factor of two. The simulation time of their tree/forest structures outperformed the simulation time of previous tree/forest structures by factor s of 1.0 to 5.0.

PAGE 31

31 Due to the complex data structure and processing operations, the t ree/forest algorithms are not amenable to hardware implementation for runtime cache tuning. Thus, the stack based algorithm is still widely used. For example, Viana et al. [70] proposed SPCE a stackbased algorithm that evaluated cache size, line size, and associativity simultaneously using simple bit wise operations. GordonRoss et al. [28] designed SPCEs hardware prototype for runtime cache tuning Whereas these singlepass cache simulation methodologies (stack and treebased) are highly efficient, these methods are limited to singlelevel cache simulation. In this dissertation, w e will enhance previous works to include twolevel cache simulation. We combine a stack based algorithm [70] to record the memory access trace with a treebased data structure to support the stack search for processing time acceleration. Last level Cache Partitioning Cache partitioning consists of three components: the cache partitioning controller, system metric tracking, and the cache partitioning decision. The c ache partitioning controller can be implemented using a modified replacement policy in a shared LLC or a coherence policy in a private LLC. By enforcing replacement constraints on the candidate replacement blocks in a cache set and the insertion location for each cores incoming blocks, an individual cores cache occupancy can be controlled. System metric tracking requires tracking metrics for the entire system or each core. One possible technique is to use dynamic set sampling [55] by sampling several cache set s whose tags were maintained as if the sampled sets were configured with one potential configuration to approximate the global performance of the entire cache. Another technique is to monitor the cache access profile in hardware or software

PAGE 32

32 and analyzes each cores cache requir ements and interactions with other cores. The cache partitioning decision evaluates the system metrics to determine the best cache partitioning for either optimizing overall performance or maintaining performance fairness across the cores [35] For CMPs with a distributed shared LLC, either an entire way or set is located in one cache bank, allowing the cache to be partitioned either on a way or set basis. D ue to this OS modification requirement in set partitioning hardwarebased way partitioning is more widely used. Previous shared cache partitioning typically used private partitioning. Qureshi and Patt [55] developed dynamic utility based cache partitioning (UCP) which used an online monitor to track the cache misses for all possible numbers of ways assigned to each core. Greedy and refined heuristics determined the cores quota. Varadarajan et al. [68] dynamically partitioned the cache into small direct mapped cache partitions which were privately assigned to the cores and the cache partitions had configurable size, block size, and associativity. Kim et al. [43] developed cache partitioning for fairness optimization using static and dynamic partitioning. Static cache partitioning used the stack distance profile of cache accesses to determ ine the cores requirements. Dynamic cache partitioning monitored the cache misses, and increased/decreased the cores quotas in accordance with the miss rate changes between evaluation intervals. Suh et al. [65] partitioned the cache and developed a greedy dynamic partitioning method to allocate each partition to the cores for exclusive use. Although their method allowed partitions to be shared across multiple cores when the number of cores exceeded the number of partitions, the equation used to estimate the number of misses

PAGE 33

33 in the shared partitions did not consider all aspects. For example, the equation assumed that with a random replacement policy, the number of re placements in a cores quota was proportional to the quotas size, which did not consider the differences between the private and shared utilizations. Private LLCs also benefit from cache partitioning, where the cores private caches are partitioned to be partially/fully shared with other cores. In CloudCache [45] the private caches were partitioned and a core could share the private caches of nearby cores (limiti ng the cores to only sharing nearby caches restricted the additional access latency ) MorphCache [62] partitioned the level two and level three caches and allowed subsets of cores private caches to be merged and fully shared by the subset. Huh et al. [36] subsetted the cores, partitioned the cache evenly, and each partition was fully shared by a subset. Dybdahl and Stenstrom [17] developed an adaptive cache partitioning method in which a cores private cache could be partially shared among all cores. An adaptive spill receive caching method [56] classified private caches as spiller or receiver caches, where the spiller caches could store evicted blocks into receiver caches. Although private LLC partitioning enabled a core to share other cores quotas, only two ki nds of constrained partial sharing were provided: 1) subset sharing : the cores were subsetted and the cores quotas were fully shared by the subset [36] [45] [62] ; 2) joint sharing : partial sharing allowed a portion/all of a cores quota to be shared by all cores [17] (i.e., the cores still retained some private portion of the quota). As compared to partially sharing a cores quota with all cores, CaPPS is more flexible than these works by enabling a portion of a cores quota to be shared with any subset of cores, and

PAGE 34

34 thus the constrained partial sharings design spaces are subsets of CaPPSs design space. T o facilitate fast design space exp loration we develop an offline analytical model to quickly estimate cache miss rates for all configurati ons in CaPPS. Prior works on analytical modeling to determine cache miss rates targeted only fully shared caches, therefore, our proposed analytical mo del can leverage on the fundamentals established in these prior works. Chandra et al. [12] proposed a model using access traces for isolated threads to predict inter thread contention for a shared cache. Reuse distance profiles were analyzed to predict the extra cache misses for each thread due to cache sharing, but the model did not consider the interaction between cycles per instruction (CPI) variations and cache c ontention. Eklov et al. [21] proposed a simpler model that calculated the CPI considering the cache misses caused by contention by predicting the reuse distance dis tribution of an application when coexecuted with other applications based on the isolated reuse distance distribution of each application. Chen and Aamodt [13] pro posed a Markov model to estimate the cache miss rates for multi threaded applications with inter thread communication. Since CaPPS allows a cores quota to be partially shared with a subset of cores, as compared to fully shared by all cores as in prior work s, analytically predicting the cache miss rate s is more challenging. In CaPPS, the time ordered interleaving of the cores accesses to the LLC must be considered, since only the LLC accesses that access partially shared ways affect a cores miss rate. Ho wever, since the analytical model executes offline, statically determining, quantifying, and predicting the dynamic effects significantly complicates miss rate estimations.

PAGE 35

35 CHAPTER 3 T SPACS A TWO LEVEL SINGLEPASS CACHE SIMULATION METHODOLOGY All previous singlepass trace driven methods, to the best of our knowledge, only simulate single level caches W e present for the first time a Two level Single Pass trace driven Cache S imulation methodology T SPaCS for exclusive instruction and data caches for offline static cache tuning The designers can employ T SPaCS a priori at design time to evaluate all cache configurations for a particular embedded application, and then determine the optimal (lowest energy) cache configuration to be used during execu tion time. We leverage an exclusive cache hierarchy to limit area and processing overheads and enable the level one cache ( L1 ) and level two cache ( L2) to be logically analyzed as one single cache followed by a supplementary processing step to extract the exclusive L2 contents. Our proposed methodology determines the optimal cache configuration with high simulation speedup and low storage requirements compared to iterative simulation. Two Level Cache Characteristics As the description in the introduction, two of the major challenges in twolevel single pass cache simulation are the storage and simulation time required to process each filtered access trace for the L2 cache. In this section we motivate our selection of an exclusive cache hierarchy, as opposed to an inclusive cache hierarchy, to address these challenges. In an inclusive hierarchy with the least recently used (LRU) replacement policy for both L1 and L2, each cache level contains a subset of the contents of the lower level caches (closer to the memory). On an L1 miss and an L2 hit, the cache block is copied from L2 to L1. If both L1 and L2 miss, the cache block is copied to both L1 and L2. The

PAGE 36

36 evicted LRU blocks are discarded if the blocks are not dirty (assuming the data cache uses write allocat e and write back policies). Otherwise, the evicted blocks are written back to main memory. In an exclusive hierarchy [71] with LRU for L1 and first in first out (FIFO) like for L2 (the exclusive hierarchy complicates L2 evictions, making the process similar to FIFO), each cache levels contents are disjoint from the contents of all other cache levels. On an L1 miss and an L2 hit, the cache block is m oved from L2 to L1, the evicted LRU L1 block is moved to L2, and the evicted oldest block from L2 is discarded if the block is not dirty. When L1 and L2 both miss, the missed block is only fetched into L1 from main memory. This lack of replication across L 1 and L2 provides an opportunity to logically view L1 and L2 as one combined cache, whose analysis can be processed based solely on the complete access trace. Deriving the L2 miss rate using this combined analysis eliminates the need to store and process e ach filtered L2 access trace and alters the basic stack based algorithm processing. To exemplify the reduced storage requirements afforded by the exclusive hierarchy, Fig ure 31 depicts the stack based algorithms cache layout view (dotted boxes) and storage requirements for a twolevel c ache with inclusive (A) and exclusive (B ) hierarchies. More specifically, for the inclusive hierarchy, each cache is processed separ ately. The complete access trace is recorded in the L1 stack and for each L1 configuration, the filtered access trace is recorded in an L2 stack (one distinct L2 stack is required for each L1 configuration). Each L2 stack is processed separately using the same process as for singlelevel cache simulation [70] In the exclusive hierarchy, only one stack is required since L1 and L2 are treated as one combined cache (denoted using the dotted boxes) and are evaluated simultaneously.

PAGE 37

37 L1 cache L2 cache L1 Stack L2 Stack L1 cache L2 cache One Stack Combined cache (A) Inclusive hierarchy (B) Exclusive hierarchy Separate caches Fig ure 31 Storage requirements for the stack based algorithm for a twolevel cache with (A ) an inclusive hierarchy where L1 and L2 are processed separately and (B ) an exclusive hierarchy where L1 and L2 are treated as one combined cache denoted using the dotted box. This difference in stack processing has a large impact on the storage and time complexity. The inclusive cache hierarchy requires one L1 stack and L2 stacks, where is the number of L1 configurations. The L1 stack has a storage complexity of ( ) where is the number of unique addresses in the access trace. Each of the L2 stacks has the same storage complexity of ( ) by assuming the worst case where all L1 accesses are misses. The exclusive cache hierarchy requires only one L1 stack to generate both L1 and L2 results. Therefore, the storage and time complexities for twolevel inclusive and exclusive caches are ( ( + 1 ) ) and ( ) respectively. The lightweight storage and time complexities of the exclusive cache are important for T SPaCS since the singlepass simulation feature makes T SPaCS amenable to hardware implementation in dynamic cache tuning without runti me system intrusion [28] Additionally, e xclusive caches are widely applied in modern commercial processors, such as the AMD Athlon and Duron processors [1] and ARM Cortex A8

PAGE 38

38 and Cortex A9 processors [4]. Zheng et al. [73] evaluated the exclusive cache performance and concluded that the exclusive cache is suitable for embedded systems with limited on chip cache area since an exclusive hierarchy can provide a larg er effective cache size than an inclusive hierarchy. The tradeoff for the exclusive caches reduced storage and time complexities is a design space reduction since an exclusive hierarchy requires L1 and L2 block sizes to be equal. However, previous work [29] showed that for a large design space, several cache configurations offer nearly equal energy and performance, thus this restriction will have a nominal effect on the cache tuning. Two Level Single Pass Cache Simulation Methodology T SPaCS T SPaCS is suitable for a highly configurable twolevel exclusive cache hierarchy by simultaneously evaluating cache configurations with varying size, block size, and associativity. T SPaCSs output is the miss rates for all cache configurations. When combining t he miss rates with a performance and energy model [30] a system designer can determine an appropriate cache configuration based on the application requirements. T SPaCS evaluates (determines a cache hit or miss) a trace address for a particular cache configuration by locating the conflicts T SPaCSs goal is to simultaneously determine the conflicts with each trace address in L1 and L2 for all cache configurations in the design space. Figure 32 illustrates T SPaCSs functional overview. The application is executed once to produce the timeordered sequence of accessed addresses, which is denoted by the vector and is an arbitrary address in The corresponding block address

PAGE 39

39 is calculated by / = where is a bitwise right shift operator and = 2 is the cache block size. During T SPaCSs simulation, the timeordered sequence of unique block addresses that map to the same set with index for the minimum number of sets (without loss of generality, we assume < ) is recorded into one stack structure for every cache block size Thus, the number of required stack structures for a particul ar is equal to (determined by the minimum cache size and the maximum associativity ). In the set of a particular s stacks, we denote each stacks contents as the vector [ ] is one element in vector representing the uniquely accessed block address (counting starts from the stacks top, thus [ 1 ] is the stack top and represents the address of the most recently accessed cache block that maps to the set with index for and ). D uring T SPaCSs processing for a particular the cache configurations with different are simulated sequentially using the corresponding set of stacks. Since T SPaCSs processing of each for each is the same, we limit our discussion in the remainder of this chapter to T SPaCSs processing of one arbitrary trace address for one arbitrary cache block size ( [ ] where and represent the minimum and maximum block size values, respec tively). T SPaCS processes each trace address for the set of stack structure s for each using four steps: stack processing, L1 analysis, L2 analysis, and stack update, as shown in Figure 32 For a trace address (whose block address is ), the set index is determined through and ( = ) and locates the stack structure that stores s conflicts for all possible number of sets ( [ ] [ ] where and represent the minimum and maximum number of

PAGE 40

40 Figure 3 2 T SPaCSs functional ov erview: Executing the application generates the time ordered access trace, which is sequentially fed into a different set of stacks based on the trace addresss set index for under different T SPaCSs processing for each trace address consists o f four steps (encompassed by the dotted box). After processing the entire access trace, T SPaCS produces the accumulated number of L1 and L2 misses for all cache configurations. cache sets in L1 and L2, respectively). Stack processing scans the current stack structure to determine whether the block was recorded (a cache line with that block address has already been fetched). If there exists satisfying [ ] = the block that maps to was accessed previously and the most recent access was recorded in the stack as [ ] For all stack processing scans the stack from [ 1 ] to [ ] to evaluate the conflicts with The conflicts with for a particular are denoted by ( ) whose collection is represented by { ( ) } We note that conflict evaluation is trivial for since all the stack addresses in conflict with [ ] for and thus the conflict For Stack processing L1 analysis L2 analysis Stack update CBStacks for different B Accumulated Number of L1 and L2 misses for all cache configurations Execute application : : T : : Time ordered access trace All W1 for each combination of S1 and B All W2 for each combination of S2 and B All S1 and S2 for each B in design space T-SPaCS iK T iK

PAGE 41

41 evaluation for simply records the stack addresses in into { ( ) } After identi fying these conflicts, L1 analysis directly determines to be an L1 hit/miss based on the number of conflicts for the L1 configurations with If there is an L1 miss, L2 analysis is required. When the particular and any ( [ ] ) are combined to form one twolevel cache configuration, L2 analysis categorizes the evaluated conflicts of the combined cache as either L1 or L2 conflicts. Similarly, t he number of conflicts in L2 dictates an L2 hit/miss After stack processing L1 analysis, and L2 analysis (if necessary) for all cache configurations the stack update removes [ ] from if exists and then pushes on the top of If there is no [ ] in the stack update directly pushes on the top of After the entire is processed, T SPaCS accumulates the number of L1 and L2 misses for all twolevel cache configurations. To simulate a data cache that uses the writeback policy, the cache blocks dirty status must be considered. We track the number of write backs using a write avoid counter [66] In a write back cache, not all the cache writes result in writebacks to main memory. For example, if a cache block is written to X times, X 1 write backs to main memory are avoided since all writes to the cache block are coalesced and are only written back to memory once when the cache block is evicted. Therefore, during T SPaCSs data cache processin g, if writing results in an L1/L2 hit, and the stack searched address [ ] is dirty, this write is avoided, and the writeavoid counter is incremented. The number of writebacks is equal to the total number of writes minus the number of writeavoids. We use a bit array attached with each stack address, in which each bit indicates whether the address is dirty for each cache configuration. The bit -

PAGE 42

42 arrays size is equal to the number of cache configurations with the same in the design space. The dirty status of the stack address is maintained in the stack update step, which is detailed as follows: writing sets the newly inserted as dirty; if reading results in an L2 miss, is set as clean, since an L2 miss implies fetching from main memory which is always clean; if reading results in an L1/L2 hit, the dirty status of is dictated by the dirty status of the removed [ ] We note that although T SPaCS uses several stack structures to record block size specific cache access pa tterns to simplify conflict determination, the storage complexities are similar to those analyzed in last Section. Since one cache block encapsulates multiple addresses, the combined storage space for all stacks will be similar to the storage space required by one stack. The remainder of this section presents T SPaCSs detailed operations. Since L1 analysis uses a conventional stack based algorithm for simulating set associative caches [34] [70] we refer the reader to Chapter 2 for the L1 analysis description, and direct ly extend the stack based algorithm to L2 analysis. Then, we will discuss acceleration strategies to assist stack processing. Stackbased TwoLevel Cache Analysis When using an exclusive hierarchy, L1 and L2 can be treated as one combined cache. The conf lict evaluation for the combined cache is processed using the L2 configurations in order to extract the L2 conflicts. We represent the conflicts for the combined cache as { ( ) } (for cache configurations of ). { ( ) } can be identified using confli ct evaluation from [ 1 ] to [ ] in stack processing. { ( ) } consists of the conflicts in both L1 and L2. Since { ( ) } contains inclusive L2 conflicts, exclusion

PAGE 43

43 requires the removal of the L1 conflicts from { ( ) } to isolate the exclusive L2 con flicts, whose collection is denoted by { } The number of L2 conflicts | | determines the minimum L2 associativity necessary for to be an L2 hit. Due to the LRU replacement policy for L1 and FIFO for L2, stack processing orders the conflicts in the co nflict collections { ( ) } and { ( ) } in MRU (most recently used) time order to facilitate L1 and L2 analysis. When results in an L1 conflict miss for a particular L1 configuration ( ) the first conflicts { ( ) } in { ( ) } are the blocks present in L1. The subsequent L2 analysis will evaluate all possible L2 configurations (with the same as L1). For each ( [ ] ), the L2 analysis consists of three steps: 1) stack processing to determine { ( ) } ; 2) removing { ( ) } (effectively removing the L1 blocks) from { ( ) } to deduce { } which we refer to as the compareexclude operation; and 3) evaluate as an L2 hit/miss for each set associativity based on | | (A) S1 = S2 (k = l) tag Index (k) block offset L1 tag Index (l) block offset L2 (B) S1 < S2 (k < l) tag Index (k) block offset tag (k) block offset (l-k) Index (l) (C) S1 > S2 (k > l) tag (l) block offset tag Index (l) block offset (k-l) Index (k) L1 L2 L1 L2 Figure 3 3 Cache addressing formats for the compareexclude operation scenarios: (A ) = (B ) < and (C ) > in which and represent the number of L1 and L2 index bits, respectively.

PAGE 44

44 The compare exclude operation is divided into three scenarios based on three different inclusion relationships between { ( ) } and { ( ) } Figure 3 3 depicts the cache addressing formats of L1 and L2, in which k and l represent the number of L1 and L2 index bits, respectively The three c ompare exclude scenarios are: (A ) the number of L1 and L2 sets are equal ( = and { ( ) } is the first Compare exclude s cenario: = conf licts in { ( ) } ); (B ) the number of L1 sets is less than the number of L2 sets ( < and { ( ) } contains several conflicts in { ( ) } ); (C ) the number of L2 sets is less than the number of L1 sets ( > and { ( ) } is a subset of { ( ) } ). For L1 and L2 configurations with the same B and S values ( Figure 3 3 (A )), { ( ) } is the same as { ( ) } Thus, stack processing for { ( ) } is not necessary and { ( ) } can be directly applied to deduce { } In this scenario, the conflicts in { ( ) } are divided into two categories: the first conflicts are present in L1 and the remaining conflicts form { } The condition that > | | indicates that [ ] has not been evicted from L2 by the following accessed blocks and results in an L2 hit. For example, if | ( ) | = 5 and = 2 the first two conflicts in { ( ) } are present in L1 and the remaining conflicts compose { } Therefore, | | = 3 and the L2 configurations with associativities greater than 3 result in an L2 hit. Compare exclude scenario: < As depicted in Figure 3 3 (B ), the number of L1 index bits is less than the number of L2 index bits in the < scenario. The evicted cache blocks from one L1 set map to multiple (multiple of two) L2 sets. We refer to these multiple L2 sets as the affinity group associated with one L1 set, and the number of L2 sets in the affinity group

PAGE 45

45 is equal to /. The cache indexes for addressi ng one L1 set and the multiple L2 sets in the affinity group retain the following relationship: the least significant index bits in L1 and L2 are equal and the L2 indexs most significant ( ) bits are all values from all 0s incremented to all 1s Since the stack processing for { ( ) } begins from the stacks top, { ( ) } determined by the L2 index of contains some conflicts that are still present in L1. The collection of these conflicts is the subset of { ( ) } with the same L2 index as and thereby can be determined by the intersection of { ( ) } and { ( ) } After removing these intersecting conflicts, the remaining conflicts in { ( ) } are the L2 conflicts { } : { } = { ( ) } { ( ) } { ( ) } (2 1 ) However, Figure 3 4 illustrates a special case that must be considered in this scenario. Figure 3 4 (A ) s hows a timeordered access trace segment from [ ] to [ + 8 ] to indicate the time order of access addresses. We assume the following: [ ] = [ + 8 ] ; [ + 3 ] = [ + 7 ] ; and the other addresses map to different unique cache blocks. In these blocks, [ ] maps to the same cache set as [ + 1 ] [ + 2 ] [ + 3 ] and [ + 4 ] under both and while [ ] maps to the same cache set as [ + 5 ] and [ + 6 ] under but not For simplification, we represent [ ] to [ + 8 ] with block addresses: , , and respectively, as shown in Figure 3 4 (A ). For = 2 and = 4 Figure 3 4 (B ) shows the L1 and L2 set contents at the time points + 5 + 6 and + 7 Figure 3 4 (C ) shows the stack contents before + 8 According to the cache set contents, accessing [ + 8 ] with results in L1 and L2 misses. Stack processing for [ + 8 ] produces the conflicts { ( ) }= { } and { ( ) } = { } The compare exclude operation

PAGE 46

46 produces the conflicts { } = { } Since | | = 3 and = 4 [ + 8 ] is incorrectly classified as a hit. (A) Access trace (time) (B) Cache contents in time order (C) Stack before ti+8 x2 y2 y1 x1 x3 x4 z Top T[ti] T[ti+1] T[ti+2] T[ti+3] T[ti+4] T[ ti+5] T[ti+6] T[ti+7] T[ti+8] ti+5 ti+6 ti+7 ti+8 L1 (2 ways) L2 (4 ways) y1 x1 y2 y1 x2 y2 x2 x3 x4 z x1 x2 x3 x4 x1 x3 x4 BLK Fetch from z, miss in L2 occupied blank z x4 x3 x2 x1 y1 y2 x2 z Fig ure 3 4 Special case when < where fetching from L2 results in an occupied blank (BLK). To explain this incorrect classification, we note that accessing [ + 7 ] moves from L2 to L1, leaving an empty way in L2an occupied blank ( BLK ), as shown in Figure 3 4 (B ). The occupied blank occurs because at + 7 was evicted from L1 to accommodate but maps to a different L2 set than the set that maps to. The occupied blank means that was in L2 and involved in evicting from L2 (at + 6 ), thus should be counted as a conflict for [ + 8 ] in { } In order to account for the occupied blank, occupied blank labeling is a supplemental process that labels occupied blanks using a bit array associated with each stack address. The bit array size is equal to the number of cache configurations with the same in the design space. A set bit indicates that an occupied blank follows the labeled block in the corresponding cache configuration. The compareexclude operation is augmented to include blank label examination. If the label associated with the last conflict in { } is set, there is an occupied blank behind the last con flict, which

PAGE 47

47 means the last conflict is the LRU block present in the L2 set and the block that maps to has already been evicted. As a result, is classified as an L2 miss even though | | < in this case. The occupied blank is introduced when there is an L2 hit, such as the example described in Figure 3 4 (B ) (i.e., the L2 hit for [+ 7 ] results in a BLK by fetching into L1) Therefore, occu pied blank labeling must proceed whenever the processed address results in an L2 hit. We refer to the previously fetched block that maps to as [ ] following our previous notation. There are two cases for occupied blank labeling: 1) when [ ] is the (MRU order) block in the L2 set (i.e., | | = 1 ), the last conflict in { } is labeled as an occupied blank since [ ] will be fetched into L1 after accessing [ ] ; and 2) when [ ] is not the block in the L2 set (i.e., | | < 1 ), stack processing must continue after [ ] to locate and label the block in the L2 set. The example shown in Figure 3 4 (B ) follows the second case. Since is not the block (MRU order) in L2 at + 6 stack processing continues and labels the block to indicate a BLK after in processing [ + 7 ] Compare exclude s cenario: > In the > scenario ( Figure 3 3 (C )), blocks evicted from multiple L1 sets map to one L2 set. Similarly to the < scenario, one L2 set corresponds to an affinity group in L1 and the number of L1 sets in the affinity group is equal to /. The L2 set that maps to has an affinity group consisting of multiple L1 sets, and maps to one of the L1 sets. We denote these multiple L1 sets in the affinity group, excluding the set that maps to, as the complementary sets of The conflicts associated with one of the complementary sets are denoted by where the subscript [ 1 / 1 ]

PAGE 48

48 differentiates between each of the complementary sets. The collection of confli cts for the complementary set is denoted by whose c ardinality is denoted by | | In Figure 3 3 (C ), s address uses bits for the L1 index and bits for the L2 index ( > ). The complementary sets indexes in L1 can be composed by joining the least significant bits with each combination of 0s and 1s for the most significant ( ) bits excluding the combination associated wi th [ ] s L1 index. For example, if s indexes are 101101 for L1 and 1101 for L2, the collection of { } for all will include all conflicts associated with {001101, 011101, 111101}. The conflicts in { ( ) } while still present in L1, include both the L1 conflicts in the set that maps to ( { ( ) }) and the L1 conflicts associated with the complimentary sets ( { } ). Stack processing determines these additional conflicts by simply considering the complementary sets indexes. Theref ore, the compareexclude operation in this scenario produces: { } = { ( ) } { ( ) } / (2 2 ) where represents the first conflicts (MRU order) in Acceleration Strategies During T SPaCSs processing, stack processing is the most time consuming operation. For every in the access trace, stack processing repeatedly evaluates all stack addresses [ ] ( [1 ) ) for conflicts with every number of cache sets in the design space. If | | denotes the total number of configurable and in the design space the conflict evaluations complexity is ( | | ) (without considering the complementary sets) for each [ ] unless [ ] results in L1 hit s for all L1 conf igurations ( in which case, the complexity is ( | | ) where | | is the total number of

PAGE 49

49 configurable in the L1 design space only ). As we know, to reduce the conflict evaluation runtime, stack processing can be accelerated using the set refinement property [34] which can be leveraged by processing from smallest to largest and a stack address [ ] is evaluated for conflicts with only if conflicts with [ ] for a smaller Alternatively, processing from largest to smallest leverages the set refinement property. However, since most stack addresses are not conflicts even for small in a typical application, starting from reduces more trivial conflict evaluations. This acceleration strategy can be applied to both stack processing steps: 1) determining all conflicts { ( ) } { ( ) } and for all possible and using a tree data structure; and 2) occupied blank labeling using an array data structure. Tree assisted acceleration When processing for an arbitrary with stack on an L1 miss, the compareexclude operation compares the conflicts in { ( ) } for each L1 configuration with the conflicts in { ( ) } for all possible L2 configurations. An efficient method to determine these conflicts is to determine the conflicts for all possible initially and store these conflicts in a tree structure for later reference. We note that this data structure is not a t raditional tree structure, but is a hierarchical representation that we refer to as a tree for simplicity. The tree structure stores s conflicts and the conflicts associated with the complimentary sets for all with the same Each tree level corres ponds to a different with increasing from root to leaf (higher level to lower level) by powers of two. Tree nodes store the conflict information and the maximum L1 and L2 associativities dictate

PAGE 50

50 the maximum number of conflicts stored at each node ( conflict storage) Every conflict is represented by the conflicts block address and a pointer to the blocks stack location, which assists in occupied blank labeling since the blank labels (linked using the stack address) of the recorded conflicts will be examined to correct the compareexclude results in the < scenario. We accelerate stack processing by determining all conflicts for all simultaneously. We denote all from the minimum to the maximum with a subscript where is an integer satisfying [1 log( /) + 1 ] such that the tree level corresponds to and the number of tree levels is log( /) + 1 Due to the set refinement property, evaluating the conflicts for [ ] ( [1 ) ) with for a particular ( [2 log( /) + 1 ]) depends on whether [ ] conflicts with for When [ ] is s conflict for [ ] will be s conflict for on the condition that the most significant bit in the indexes of both and [ ] under are the same. When the combined L1 and L2 configurations satisfy the > scenario, the compareexclude operation should exclude the conflicts associated with the L1 complementary sets as well as the L1 conflicts. Therefore, for each i n the L1 design space that is larger than the additional conflicts with each of the complimentary sets { } must also be searched and recorded in the tree level as one node. The number of nodes at each tree level is dictated by the num ber of complementary sets required for More specifically, if [ ] and > the number of nodes in the tree level is / When is combined with an other than On the contrary, if [ ] is not s conflict for the indexes of and [ ] under all larger ( [, log( / ) + 1 ]) are also different.

PAGE 51

51 while still complying with the condition that > the conflicts with the complementary sets can be determined by selecting the corresponding nodes in the tree level based on the difference between the number of index bits for and Figure 3 5 provides a sample tree structure for a processed address [ ] with block address [ ] = 100110110110. The configurable number of sets in the design space for the certain are bounded by = 4 = 64 = 16, and = 256 Rectangles correspond to tree nodes and values in the nodes correspond to the index es of the recorded conflicts. For levels , and only one node is required in each level to record the conflicts with [ ] For levels and the complementary sets conflicts must be recorded since and are larger than (i.e., ). When = and = the conflicts selec ted from all the four nodes in the fifth level will be excluded during the compare exclude operation. When = and = only the conflicts in the first two nodes will be excluded. Fig ure 3 5 A sample tree structure where rectangles correspond to tree nodes and values in the nodes are the indexes of the recorded conflicts. A[t]=(1001)10110110 10 110 10110 0110 00110 110110 010110 000110 100110 0110110 10110110 Complementary sets S1=4 S2=8 S3=16 S4=32 S5=64 S6=128 S7=256 S1 S2

PAGE 52

52 Figure 36 summarizes the treeassisted st ack processing acceleration algorithm. For each and the tree contents are cleared and is initialized to (line 1). For each stack address [ ] ( [1 ) ) (lines 2 19), conflict evaluation determines the conflicts with or the com plementary sets from to (lines 3 13). If the complementary sets conflicts are not required for the tree level, [ ] is directly evaluated for conflict with for by comparing the most significant index bits of and [ ] under (since increases by powers of two) (lines 46). If [ ] conflicts with for [ ] is recorded into the node in the tree level (lines 78); otherwise conflict evaluation for [ ] is terminated (larger conflict e valuations are not necessary) (lines 9 10). If the complimentary sets conflicts are required in the tree level, [ ] is stored into a node in the tree level based on the most log / significant index bits (lines 1113). In this situation, only checking the most log / significant bits requires the condition that [ ] conflicts with for when processing proceeds to line 12. This condition is guaranteed by the constraint in changing in line 18. Since the required number of conflicts stored in each node is limited by the L1 and L2 associativities, if all nodes in the level are full after processing [ ] for all ( [, ]), searching for more conflicts in the level is trivial. Therefore, the value for may change (line 14). If the complementary sets conflicts are not needed for the level with is updated by (lines 1516); otherwise conf lict evaluation always starts from (lines 1718) since the evaluation of the conflicts for is a prior requirement for determining the complimentary sets conflicts.

PAGE 53

53 Figure 3 6 Tree assisted stack processing acceleration algorithm Since only one tree is built and the trees contents are cleared for every and the storage space required by the tree is minimal as compared to the stack structures. Array assisted acceleration Stack processing for is limited to only evaluating the stack addresses before [ ] except during occupied blank labeling for the < scenario. In this scenario, additional stack processing after [ ] is required if [ ] is not the (MRU o rder) block in the L2 set. Since this additional stack processing may be required for all possible we propose array assisted acceleration for this additional stack processing. An array whose size (number of elements) is dictated by the maximum L2 associativity records additional conflicts in the stack after [ ] .P Stack processing for each Tand B: -Clear tree contents; Sstart = S1; for (m=1; m< h; m++) //check from Ki[1] to Ki[h] for (S=Sstart; S<=Smax; S2) if (complementary setss conflicts are not required in the th level) -evaluate the conflict of Ki[m] for Swhen S!= S1 min; otherwise directly record Ki[m] in the 1-st level ; if (conflict) -record into the node in the -th level ; else -go to END_m; else -check the most log2(S/S2 min) significant bits in index and store into the corresponding node in the -th level; if (all nodes in the start-th level are full) if ( complementary setss conflicts are not required in the (start+1) -th level) -Sstart = Sstart+1; else -Sstart=S2 min; END_m; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Each element stores the information for one conflict using the same format as the tree structure nodes.

PAGE 54

54 Figure 3 7 depicts the array assisted stack processing acceleration algorithm. When results in an L1 miss for a particular L1 configuration, s contents are cleared (line 3) and all possible L2 configurations for all combinations of and are analyzed by the compareexclude operation (lines 46). If there is a hit in an L2 configuration and [ ] is not the block in the L2 set, additional stack processing is required to determine and label the block (line 7). If is empty, this L2 configuration is the first configuration to require the block and stack processing determines the additional conflicts after [ ] for and records these additional conflicts in (lines 8 11). If is not empty, one of two situations occurs. 1) If the block is first required for that particular (lines 1318), the additional conflicts can be obtained by evaluating s elements (line 14) since already stores the conflicts for the previous ly processed smaller and contains the first several conflicts for larger (according to the set refinement property). If the conflicts determined in are not enough to determine the block for the new additional stack processing continually evaluates the stack addresses until the block is determined (lines 1516), and all additional conflicts for the new will replace s contents (lines 1718). 2) If the additional conflicts for that particular with previously processed l arger were determined and stored in the block for smaller can be directly determined in (lines 19 20). Since contents are updated for each with each specific cache configuration, only one array is required and the storage overhead for array assisted acceleration is negligible as compared to the stack structures.

PAGE 55

55 Figure 3 7 Array assisted stack processing accele ration algorithm. Experimental Results and Analysis We verified T SPaCS using the fifteen benchmarks from EEMBC benchmark suite [20] five arbitrari ly selected benchmarks from the Powerstone benchmark suite [49] and four arbitrarily selected benchmarks from the MediaBench benchmark suite [46] We gathered the access trace for each benchmark by modifying sim cache in SimpleScalar 3.0d [10] and these traces served as input to T SPaCS. For comparison, we modified the widely used tracedriven cache simulator Dinero IV [18] to simulate both two level exclusive instruction and data caches, respectively, for each benchmark. We used the same design space for the twolevel configurable cache hierarchy as in [30] The design space consisted of 243 configurations by varying (in increments of powers of 2) the L1 size from 2 to 8 Kbytes, the L2 size from 16 t o 64 Kbytes, the L1 and L2 associativities from direct mapped to 4way, and the cache block size from 16 to For processed T with certain B, S1, W1: -L1 analysis; if (L1 miss) -clear ; for (S2=S2 min; S2<=S2 max; S22) for (W2=W2 max;W2>=1; W2/2) -L2 analysis; if (W2-th block in L2 is required) if ( is NULL) -stack processing for additional conflicts with S2 until the W2-th block is determined; -store all searched additional conflicts in ; else if (the W2-th block is firstly required for the certain S2) -evaluate conflicts in for S2; -continue stack processing to determine the W2-th block if the conflicts searched in is not enough; -replace contents with all obtained additional conflicts for S2; else -determine the W2-th block in directly; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

PAGE 56

56 64 bytes. We note that we selected this design space for comparison convenience and T SPaCS itself does not impose any restriction on the configurable cache parameters, and is thus valid for any design space. In order to determine T SPaCSs accuracy and efficiency, we gathered the cache miss rates for all 243 configurations using the modified Dinero, which produces exact results, and T SPaCS, then ev aluated the margin of errors in T SPaCS with respect to the exact miss rate and the optimal (lowest) energy cache. Miss Rate Accuracy We compared the miss rates determined by T SPaCS with the exact miss rates determined by the modified Dinero for each benc hmark. The results showed that for both instruction and data caches, T SPaCSs L1 miss rates were 100% accurate for all configurations and the L2 miss rates were 100% accurate for 240 out of 243 configurations, which accounts for 99% of the design space. F or each benchmark, we calculated the average and standard deviation of miss rate errors across the three inaccurate cache configurations. For the instruction cache, across all benchmarks, the maximum values for the average and standard deviation of miss rate errors were 1.16% and 0.64%, respectively. For the data cache, across all benchmarks, the maximum values for the average and standard deviation of miss rate errors were 0.69% and 0.32%, respectively. Since inaccurate miss rates result in inaccurate writ e back rates in the data cache, the maximum values for the average and standard deviation of writeback rate errors across all benchmarks were only 0.15% and 0.07%, respectively. For the three inaccurate configurations, multiple L1 sets in one affinity group corresponded to one L2 set (i.e., scenario > ). In this scenario, the eviction order of blocks from the different L1 sets to the same L2 set does not follow the memory

PAGE 57

57 access order and the blocks in the L2 set are disordered. When we determine the conflicts of in L2, only the blocks that are evicted into L2 after [ ] affect [ ] s eviction from L2. Since the stack structure only records the latest memory access order, the historical eviction order of the blocks from multiple L1 sets to the same L2 set cannot be obtained from the stack. Therefore, the blocks in { } generated by the compareexclude operation are not guaranteed to be the blocks present in the L2 set. However, inaccurate { } does not necessarily produce an incorrect cache hit/mis s classification since a cache miss is determined when | | >= Only when the inaccurate | | s error is larger than the difference between and accurate | | the cache hit/miss classification will be affected. Our experimental results showed that the ef fect of errors in | | on miss rate estimation was nominal. Optimal Cache Configurations Since low energy/power consumption is a critical optimization for both embedded systems and desktop computers, we evaluated T SPaCSs ability to determine the optimal (l owest energy) cache configuration. From expending a twolevel inclusive cache hierarchy energy model [30] to include evicted block write energy, Figure 3 8 depicts the energy model that we used to determine the energy consumption for each cache configuration. We used T SPaCS and modified Dinero for a twolevel exclusive cache to determine L1_ac cesses, L1_misses, L2_hits, L2_misses, L1_evicts, and writebacks. We obtained dynamic cache and memory read/write energy using CACTI 6.5 [11] for 0 .09 micron technology and the cache static energy consumption accounted for 20% of the total cache energy [30] We assumed the CPU_stall_energy t o be 20% [30] of a

PAGE 58

58 0.09micron ARM1156 microprocessor [5] and estimated bandwidth and latency based on a reasonable system architecture: an L2 fetch is 4 times longer than an L1 fetch; a main memory fetch is 20 times longer than an L2 fetch; and the memory throughput is 50% of the latency [30] Figure 3 8 Energy model for energy consumption measurement We applied this energy model to the miss rates determined by T SPaCS and the exact miss rates determined by the modified Dinero. The calculated results showed that the optimal cache configurations determined by T SPaCS were exactly the same as those determined by Dinero for all benchmarks. Table 2 1 shows the optimal instruction and data cache configurations for each benchmark. Despite incorrect miss rates for the three configurations where > the errors were too small to affect t he determined optimal cache configurations. Considering that the number of configurations with > generally occupies a small percentage (3 out of 243 (1%) in our experiment) of total_energy = static_energy + dynamic_energy dynamic_energy = L1_dynamic_energy + L2_dynamic_energy + offchip_access_energy + (miss_cycles CPU_stall_energy) L1_dynamic_energy = L1_accesses L2_dynamic_energy = (L2_hits L2_per_tag_read_energy) + (L1_evicts offchip_access_energy = L2_misses y + write backs memory_per_write_energy miss_cycles = L1_miss_cycles + L2_miss_cycles L1_miss_cycles = L1_misses L2_miss_cycles = L2_misses th) static_energy = total_cycles static_energy_per_cycle = energy_per_Kbyte energy_per_Kbyte =((dynamic_energy_of_base_cache)*20%) / base_cache_size_in_Kbytes

PAGE 59

59 the design space since caches sizes are technically limited by < a nd the introduced errors do not affect the determined optimal cache configuration, there is no need to eliminate the small miss rate errors since doing so would significantly increase the simulation time. Table 2 1 Optimal instruction and data cache c onfigurations ( L1 and L2 configurations for instruction caches and data caches (separated with semicolon) are listed as the total size in kbytes (kB) followed by the block size in bytes (B) followed by the associativity in ways (W) ) Benchma rks Optimal instruction cache configurations Optimal data cache configurations bcnt 2kB _64B_1W; 16kB_64B_4W 2kB_64B_1W ; 16kB_64B_4W bilv 4kB_64B_4W; 16kB_64B_2W 2kB_64B_1W; 16kB_64B_4W blit 2kB_64B_4W; 16kB_64B_2W 8kB_32B_4W; 16kB_32B_2W brev 4kB_64B_4W; 16kB_64B_4W 2kB_64B_1W ; 16kB_64B_1W fir 4kB_64B_4W ; 16kB_64B_2W 2kB_64B_1W; 16kB_64B_1W A2TIME01 8kB_64B_4W ; 32kB_64B_4W 4kB_64B_4W ; 16kB_64B_1W AIFFTR01 2kB_32B_2W ; 32kB_32B_4W 8kB_64B_4W ; 64kB_64B_2W AIFIRF01 4kB_32B_4W; 16kB_32B_4W 2kB_64B_4W; 16kB_64B_1W AIIFFT01 2kB_32B_2W; 32kB_32B_4W 8kB_64B_4W ; 64kB_64B_4W BaseFP01 4kB_64B_4W ; 64kB_64B_4W 4kB_64B_4W; 32kB_64B_1W BITMNP01 4kB_64B_4W ; 16kB_64B_4W 4kB_64B_4W; 16kB_64B_1W CACHEB01 8kB_64B_4W; 32kB_64B_4W 8kB_32B_4W; 16kB_32B_4W CANRDR01 4kB_64B_4W; 32kB_64B_4W 8kB_64B_4W ; 16kB_64B_4W IDCTRN01 8kB_64B_4W ; 16kB_64B_4W 4kB_64B_4W; 16kB_64B_2W IIRFLT01 8kB_32B_1W; 16kB_32B_4W 2kB_64B_4W; 16kB_64B_1W PNTRCH01 2kB_32B_1W ; 16kB_32B_4W 8kB_64B_4W; 16kB_64B_4W PUWMOD01 4kB_64B_4W; 32kB_64B_4W 2kB_64B_4W ; 16kB_64B_1W RSPEED01 4kB_64B_4W ; 32kB_64B_4W 4kB_64B_4W ; 16kB_64B_1W TBLOOK01 4kB_64B_4W ; 64kB_64B_4W 8kB_64B_2W; 16kB_64B_4W TTSPRK01 8kB_16B_4W; 16kB_16B_4W 8kB_64B_4W; 16kB_64B_4W epic 2kB_32B_2W ; 32kB_32B_4W 8kB_64B_2W ; 64kB_64B_1W jpegencode 8kB_64B_4W ; 32kB_64B_4W 8kB_64B_4W ; 64kB_64B_1W mpeg2decode 4kB_16B_4W ; 32kB_16B_4W 8kB_64B_4W ; 64kB_64B_4W pegwitencode 8kB_32B_1W ; 16kB_32B_4W 8kB_16B_2W ; 64kB_16B_4W Figure 3 9 depicts the normalized energy savings for the optimal cache configurations compared to a base cache configuration for each benchmark. The base cache configuration represents a configuration t hat may be commonly found on a platform intended to run benchmarks similar to those we studied. The base cache

PAGE 60

60 configuration was an 8 Kbyte L1 cache with a 32byte block size and 4way set associativity and a 64 Kbyte L2 cache with a 32byte block size and 4 way set associativity. Figure 3 9 shows average and maximum energy savings of 22% and 46%, respectively, for instruction caches, and average and maximum energy sav ings of 26% and 48%, respectively, for data caches. Figure 3 9 Energy savings for the optimal instruction and data cache configurations normalized to the base cache configuration. To further corroborate the significance of two level cache configuration over single level cache configuration in embedded systems intended for low power, we compared the energy savings using the twolevel optimal cache configurations with the energy savings using singlelevel optimal cache configurations. The singlelevel configurable cache design space consisted of the same L1 configurations as in the twolevel configurable cache. We determined the optimal singlelevel cache configurations using an exhaustive search. Figure 310 depicts the energy consumption of the twolevel optimal cache configurations normalized to the energy consumption of the singlelevel optimal cache configurations. The r esults indicate that for instruction caches, 6 of the 24 benchmarks consumed less energy using a singlelevel cache as compared to a two level cache, while the remaining 18 benchmarks showed increased energy savings 0% 10% 20% 30% 40% 50% bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMO RSPEED01 TBLOOK01 TTSPRK01 epic jpegencode mpeg2dec pegwitenc average Normalized energy saving I cache D cache

PAGE 61

61 using twolevel caches; for data caches, 16 of the 24 benchmarks presented increased energy savings using twolevel caches. On average, over all benchmarks, the twolevel optimal instruction caches consumed 28% less energy than the singlelevel optimal instruction caches, and the twolevel optim al data caches consumed 22% less energy than the singlelevel optimal data caches. Figure 3 10. Energy consumption of the twolevel optimal cache configurations normalized to the energy consumption of the singlelevel optimal c ache configurations. Simulation Time Efficiency To illustrate T SPaCSs efficiency, we compared the simulation time required for T SPaCS to simultaneously evaluate all 243 configurations with the simulation time required to sequentially simulate all 243 co nfigurations with the modified Dinero. We tabulated the user time reported from the Linux time command for the simulations running on a Red Hat Linux Server v5.2 with a 2.66 GHz processor and 4 gigabytes of RAM. To verify the speedup improvement obtained using our acceleration strategies, we simulated the benchmarks using two T SPaCS versions: T SPaCS without acceleration and T SPaCS with acceleration. 0% 20% 40% 60% 80% 100% 120% 140% 160% bcnt bilv blit brev fir A2TIM AIFFT AIFIRF01 AIIFFT01 BaseFP01 BITMN CACH CANR IDCTR IIRFLT01 PNTRC PUWM RSPEE TBLOO TTSPR epic jpegenc mpeg2d pegwite average Normalized energy consumption I cache D cache

PAGE 62

62 The actual simulation times for evaluating all 243 instruction cache configurations for a single benchmark ranged from 9.2 to 419.5 minutes for Dinero, 41 to 1,409 seconds for T SPaCS without acceleration, and 24 to 1,108 seconds for T SPaCS with acceleration. For all simulators, the benchmarks that required the most and least simulation times were blit and BaseFP01, respectively. Figure 3 11 shows the simulation speedups as compared to the modified Dinero for instruction caches. T SPaCS with acceler ation (second bar) achieved maximum and average speedups of 25.42X and 21.02X, respectively, which improved the speedup of T SPaCS without acceleration (first bar) by 6.79X and 7.84X, respectively. Figure 3 11 Simulation time s peedup of T SPaCS without acceleration, T SPaCS, and simplified T SPaCS compared to the modified Dinero for instruction caches. The actual simulation times for evaluating all 243 data cache configurations for a single benchmark ranged from 14.6 to 804.3 mi nutes for Dinero, 22 to 1,774 seconds for T SPaCS without acceleration, and 19 to 1,499 seconds for T SPaCS with acceleration. For all simulators, the benchmark that required the most simulation time was blit and the benchmark that required the least simul ation time was jpegencode for Dinero and TBLOOK01 for both versions of T SPaCS. Figure 312 shows the simulation 0 5 10 15 20 25 30 35 40 bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMOD01 RSPEED01 TBLOOK01 TTSPRK01 epic jpegencode mpeg2decode pegwitencode average Simulation time speedup T SPaCS without acceleration T SPaCS simplified T SPaCS

PAGE 63

63 speedups for data caches. T SPaCS with acceleration (second bar) acquired maximum and average speedups of 46.8X and 33.34X, respectively, which reduced the simulation time by 6.52X and 2.07X as compared to T SPaCS without acceleration (first bar), respectively. We note that two benchmarks in Figure 312 did not show acceleration improvement, which was due to the fact that our acceleration strategies reduced the simulation time using the set refinement property. The conflict determinations are omitted for the larger if the stack address is not a conflict for a small Thus, in a rare case that a large amount of stack addresses are the conflicts for most in the design space, there will be no significant speedup obtained by using our acceleration strategy. Due to the processing overhead introduced by acceleration, the total simulation time with acceleration can be longer than the simulation time without ac celeration. To avoid the miss rate error introduced during the < scenario, we supplemented the compareexclude operation with occupied blank labeling for each L2 hit. Experiments revealed that occupied blank labeling accounted for a large portion of T SPaCSs simulation time even when leveraging the array assisted acce leration. Therefore, we evaluated a simplified version of T SPaCS (simplified T SPaCS) by removing occupied blank labeling. The measured simulation times for evaluating all 243 configurations for a single benchmark using simplifiedT SPaCS ranged from 18 ( BaseFP01 ) to 873 ( blit ) seconds for the instruction caches and from 16 ( TBLOOK01 ) to 1,254 ( blit ) seconds for the data caches. Figure 3 11 shows the s imulation time speedups obtained by simplifiedT SPaCS for each benchmark (third bar) as compared to the modified Dinero for instruction caches. SimplifiedT SPaCSs maximum and average speedups were increased to 33.92X and 30.15X, respectively. Figure 3 12

PAGE 64

64 depicts the simulation time speedups of simplifiedT SPaCS (third bar) for data caches, and the maximum and average speedups were 54.71X and 41.31X, respectively. Figure 3 12. Simulation time speedup of T SPaCS without acceleration, T SPaCS, and simplified T SPaCS compared to the modified Dinero for data caches. The tradeoff for the increased simulation speedups of simplified T SPaCS is additional L2 miss rate errors for the 228 configurations where < In order to quantify the degradation in the miss rate accuracy without occupied blank labeling, we counted the number of occurrences of occupied blanks and the number of inaccurate L2 hit/miss classifications without labeling the occupied blanks. Averaged across all benchmarks, occupied blanks accounted for 92% of the L2 hits (only L2 hits introduce occupied blanks) for instruction caches and 90% of the L2 hits for data caches. However, the average number of L2 hit/miss classifications corrected by occupied blank labeling was only 0.47% of the occurrences of occupied blanks for instruction caches and 0.52% for data caches. We further calculated the average and standard deviation of miss rate errors across the 228 inaccurate cache configurations for each benchmark. Across all benchmarks, the maximum values of the average and standard deviation of miss rate errors were 0.71% and 0.90%, respectively, for instruction ca ches. For data caches, the maximum values of the average and standard deviation of miss rate errors 0 10 20 30 40 50 60 bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHE CANRD IDCTRN01 IIRFLT01 PNTRC PUWM RSPEED01 TBLOO TTSPRK01 epic jpegencode mpeg2de pegwiten average Simultaion time speedup T SPaCS without acceleration T SPaCS simplified T SPaCS

PAGE 65

65 were 1.02% and 1.65%, respectively. We also examined the maximum values for the average and standard deviation of writeback rate errors across all benchmar ks, which were 0.13% and 0.14%, respectively. Furthermore, we determined the optimal cache configurations using simplifiedT SPaCS. Results revealed that even with the inaccurate miss rates, simplified T SPaCS produced identical optimal configurations as t hose determined using the exact miss rates for both instruction and data caches. Therefore, simplifiedT SPaCS is an ideal choice for cache configuration due to simplified T SPaCSs competitively fast simulation time and accurate optimal configuration det ermination. Alternatively, T SPaCS is suitable in situations that require more accurate cache miss rate estimation (e.g., performance analysis) while still providing simulation speedup. Comparison with TCaT To further verify T SPaCSs and simplifiedT SPaC Ss efficiency, we compared to a stateof the art two level cache tuner, TCaT [27] TCaT is an efficient heuristic that determines the optimal energ y cache configuration using an interlaced exploration methodology. Since TCaT sequentially simulates the design space using SimpleScalars [10] si m cache [27] we modified sim cache to simulate a twolevel exclusive cache. Even though TCaT sequentially simulates the cache configurations, T CaT only simulates 6.5% of the configurations on average and can thus determine the optimal energy cache configuration quickly. Figure 3 13 shows the simulation time speedup of T SPaCS (first bar) and simplifiedT SPaCS (second bar) as compared to TCaT for instruction caches. The results indicated that T SPaCS required more simulation time than TCaT for all the 24 benchmarks. For simplifiedT SPaCS, results revealed that 12 benchmarks required less simulation time t han TCaT. There was a maximum speedup

PAGE 66

66 of 1.22X, and the average simulation time of simplifiedT SPaCS was approximately equal to TCaT. Figure 3 14 sho ws the simulation time speedup obtained by T SPaCS (first bar) and simplified T SPaCS (second bar) as compared to TCaT for data caches. The results revealed that T SPaCS simulated 22 benchmarks faster than TCaT, and simplified T SPaCS simulated all 24 benc hmarks faster than TCaT. The average speedups of T SPaCS and simplifiedT SPaCS were 1.54X and 1.90X, respectively. Figure 3 13. Simulation time speedup of T SPaCS and simplifiedT SPaCS compared to TCaT for instruction caches. Figure 3 14. Simulation time speedup of T SPaCS and simplifiedT SPaCS as compared to TCaT for data caches. Even though TCaT is generally faster than T SPaCS, since TCaT is an inexact heuristic, TCaT trades off fast simulation tim e for reduced accuracy and TCaT is unable 0 0.2 0.4 0.6 0.8 1 1.2 1.4 bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMOD01 RSPEED01 TBLOOK01 TTSPRK01 epic jpegencode mpeg2decode pegwitencode average Simulation time speedup T SPaCS simplified T SPaCS 0 0.5 1 1.5 2 2.5 3 bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMO RSPEED01 TBLOOK01 TTSPRK01 epic jpegencode mpeg2dec pegwitenc average Simulation time speedup T SPaCS simplified T SPaCS

PAGE 67

67 to determine the optimal energy cache configuration for all benchmarks. We refer to the cache configuration determined by TCaT as TCaTs configuration, which may be suboptimum. Figure 315 compares the normalized (normalize to the base cache) energy savings between the optimal cache configurations (first bar) and TCaTs configurations (second bar) for instruction caches. For four benchmarks, TCaTs configurations were the same as the optimal cache configurations. TCaTs configurations consumed 24% more energy than the optimal cache configurations in the worst case, and the average degradation in energy saving for TCaTs configurations across all the 24 benchmarks was 4%. Figure 316 provides the normalized energy savings between the optimal cache configurations (first bar) and TCaTs configurations (second bar) for data caches. TCaTs configurations were the same as the optimal cache configurations for ten benchmarks. However, in the worst case, TCaTs configuration consumed 47% more energy than the optimal cache configuration, and the average degradation in energy saving for TCaTs configurations across all the 24 benchmarks was 10%. Figure 3 15. The comparison of the normalized energy savings between the optimal cache configurations and TCaTs configurations for instruction caches. 10% 0% 10% 20% 30% 40% 50% bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMOD01 RSPEED01 TBLOOK01 TTSPRK01 epic jpegencode mpeg2decode pegwitencode average Normalized energy saving optimal cache configuration TCaT's configuration

PAGE 68

68 Figure 3 16. The comparison of normalized energy savings between the optimal cache configurations and TCaTs configurations for data caches. These results suggest that TCaT is less effective than originally reported in [27] To explain this discrepancy, we point out that TCaT was originally designed for a twolevel inclusive cache and the adaptation to an exclusive c ache hierarchy explains the relatively poor performance. Therefore, despite T SPaCSs generally longer simulation time and simplified T SPaCSs similar or slightly better simulation time as compared to TCaT, TSPaCS and simplifiedT SPaCS estimate the miss rates for all cache configurations in the design space accurately enough to determine the optimal cache configuration. Summary In this chapter, we presented T SPaCS a T wo level S ingle Pa ss trace driven C ache S imulation methodology for exclusive instruc tion and data cache hierarchies by using a stack based algorithm to simulate both the level one and level two caches simultaneously. T SPaCS reduces the storage and time complexity required for simulating two level caches as compared to direct adaptation o f existing single pass cache simulation methods to two level caches using sequential simulation. T SPaCS produces 100% accurate results for 99% of the design space, and the average 10% 0% 10% 20% 30% 40% 50% 60% bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMOD01 RSPEED01 TBLOOK01 TTSPRK01 epic jpegencode mpeg2decode pegwitencode average Normalized energy saving optimal cache configuration TCaT's configuration

PAGE 69

69 simulation time speedups compared to sequential simulation time for instruc tion and data caches are 21.02X and 33.34X, respectively. A simplified version of T SPaCS (simplified T SPaCS) increases the average simulation speedup to 30.15X for instruction caches and 41.31X for data caches, at the expense of inaccurate miss rates for 95% of the design space. However, even with these miss rate errors, both T SPaCS and simplifiedT SPaCS are still able to accurately determine the optimal energy configuration for all studied benchmarks, thereby facilitating rapid design space exploration for cache configuration.

PAGE 70

70 CHAPTER 4 U SPACS A SINGLE PASS CACHE SIMULATION METHODOLOGY FOR TWO LEVEL UNIFIED CACHES In Chapter 3, we presented T SPaCS for the twolevel single pass trace driven cache simulation for instruction caches and data caches. T SPaCS leverages an exclusive cache hierarchys characteristics to simultaneously evaluate both L1 and L2 cache using only the complete (unfiltered) acce ss trace in a single processing pass, thereby efficiently and accurately determining the optimal cache configuration. However, T SPaCSs methodology is not applicable to a twolevel cache hierarchy with a unified second level cache (referred to as a twole vel unified cache herein) due to the interlacing of the L1 instruction and data cache misses, which precludes independent processing of the data and instruction access traces. The relative access ordering of the L1 caches misses to the L2 cache must be maintained, which cannot be captured with T SPaCSs data structures and algorithms. In this chapter we will present our design of two level Unified Single Pass Cache Simulation methodology U SPaCS. The broadening of the ability to simulate twolevel uni fied caches introduces several processing challenges. First, unlike a twolevel instruction/data cache, which includes only one L1 cache and one L2 cache both storing instructions/data only, in a twolevel unified cache, both the L1 instruction and data ca ches share the L2 unified cache. Thus, all three of these caches cannot logically be considered as one combined cache and U SPaCS cannot directly leverage the compareexclude operations developed in T SPaCS to derive the L2 cache performance. Second, since the L2 cache stores the blocks evicted from both the L1 instruction and data caches and the storage of both caches evicted blocks are interlaced into the L2 cache, the instruction and data address processing cannot be

PAGE 71

71 isolated. This interdependency precl udes the use of T SPaCS to individually process the instruction and data addresses since during the processing of the instruction (or data) addresses, an interlaced data (or instruction) address stored in the L2 cache may affect the instruction (or data) addresss L2 hit/miss. Third, the relative access ordering (interlacing) of the cache blocks evicted from the two L1 caches into the L2 cache is critical for L2 cache analysis. Different L1 cache configurations generate different relative L1 cache eviction orderings to the L2 cache. If there are M L1 instruction cache configurations and N L1 data cache configurations, the blocks evicted from the two L1 caches to the L2 cache have MN unique eviction orderings. Therefore, efficiently maintaining and processing a potentially large number of unique eviction orderings for large L1 cache design spaces is very challenging. In U SPaCS, w e leverage an exclusive cache hierarchy as T SPaCS due to the inherent fast runtime complexities afforded in singlepass exclusive cache simulation. Single Pass Cache Simulation Methodology for Twolevel Unified CachesU SPaCS Overview U SPaCSs target cache architecture is a twolevel exclusive unified c ache consisting of configurable L1 instruction and data caches and a configurable L2 unified cache, each with independently configurable total size, block size, and associativity. The L1 caches use the least recently used (LRU) replacement policy and the L 2 cache uses a first in first out (FIFO) like replacement policy. The cache hierarchy configuration design space essentially contains all combinations of the three caches configurable parameter values. However, due to the disjoint cache contents in an ex clusive cache hierarchy, all the three caches must be

PAGE 72

72 configured with the same block size. We denote the L1 instruction cache configuration as ( ) where and represent the number of cach e sets, associativity, and the block size, respectively. The L1 data cache and L2 cache configurations are similarly denoted as ( ) and ( ) respectively, and the cache hierarchy configuration is denoted as ( ) Each cache parameters minimum and maximum values are denoted by the subscripts min and max, respectively, and the parameter values increase by powers of two. We use cardinality | X | to denote the number of different values for a cache parameter X. U SPaCS simultaneously evaluates the cache hierarchy configurations with varying total size, block size, and associativity. For each cache hierarchy configuration, U SPaCS outputs each caches (i.e., the L1 instruc tion and data caches and the L2 cache) number of misses and the L2 caches number of writebacks (there is no writeback in the L1 caches since the L1 caches propagate (evict) the dirty blocks to the L2 cache instead of directly writing these blocks back t o main memory). Cache configuration combines these outputs with a performance/energy model (e.g., [30] ) to determine the optimal hierarchy configuration. U SPaCS sequentially processes the time ordered trace and evaluates each processed trace address as a cache hit/miss for each cache hierarchy configuration by determining the number of conflicts in the L1 and L2 caches. U SPaCS is a stackbased trace driven cache simulator and maintains separate stack structures for the instruction and data addresses. For each cache block size U SPaCS processes the trace address and records the time ordered sequence of unique instruction/data block

PAGE 73

73 addresses that map to the same cache set for the minimum number of sets into one instruction/data stack. The total number of required stacks for each is R and R Since the processing of each trace address for eac h block size is the same, we discuss U SPaCSs processing for an arbitrary trace address and an arbitrary block size Additionally, since the processing of an instruction and data address is the same, except for the dirty status and writebacks for d ata address writes, without loss of generality, we describe U SPaCSs instruction address processing first and discuss additional data address processing operation details thereafter for the instruction and data stacks, respectively. The instruction stacks are differentiated from each other using the set index for the minimum number of sets, which is denoted by ( 0 ). We denote the instruction s tack for the corresponding as and [ ] denotes the th block address in the stack, where = 1 denotes the stacks top. The data stacks are similarly differentiated. Figure 4 1 depicts an overview of U SPaCSs operation. A single execution of the application (using any arbitrary method such as an instruction set simulator) generates the time ordered access trace of instruction and data addresses, incl uding an additional read/write designation for each data address. As is the case in most offline cache performance analysis techniques, we assume an inorder processor, in which the relative orders of instruction and data addresses in the access trace do n ot change with different cache configurations. U SPaCS sequentially processes the trace addresses. Given a particular we have the block address of a processed instruction address and s set index for the minimum number of sets for t he L1 instruction, which indicates the stack from which the instruction conflicts for all possible

PAGE 74

74 (denoted as ( ) ) and (denoted as ( ) ) are determined. Similarly, s set index indicates the stack from which the data conflicts for all possible (denoted as ( ) ) Figure 4 1 Overview of U SPaCSs operation. U SPaCS sequentially processes the access trace and outputs the number of cache misses and writebacks for all cache hierarchy configurations. are determined (since is an instruction address, the data conflicts are only required when analyzing the L2 cache). When processing U SPaCS first scans for to determine if was fetched previously. If is not located in is being fetched into the cache for the first time and accessing results in a compulsory miss for all and U SPaCS continues processing the next trace address. If is found in as [ ] U SPaCS begins L1 analysis L1 analysis evaluates the L1 cache conflicts for to determine if is a hit/miss for each F or each that results in an L1 cache Instruction stacks for different BKi_ inst Data stacks for different BKi_data Execute application Time ordered access trace : : T : : Instruction trace address Data trace address L1 analysis L2 analysis L1 miss Stack update for Ki_inst L1 analysis L2 analysis L1 miss Stack update for Ki_data U-SPaCS L1 instruction cache configurations L1 data cache configurations L2 unified cache configurations Outputs: Accumulated number of L1 instruction cache miss, L1 data cache miss, L2 unified cache miss, and L2 write-backs for all cache hierarchy configurations

PAGE 75

75 miss, the is combined with all possible and to form the which will be analyzed for L2 hits/misses during L2 analysis L2 analysis performs conflict evaluation using both and for all possible Next, for each evaluated L2 analysis determines the L2 cache conflicts for which are the conflicts evicted from the L1 instruction and data caches after [ ] s eviction fr om the L1 instruction cache. The number of L2 cache conflicts dictates whether is an L2 cache hit/miss. After evaluating all for the stack update process modifies to reflect T s access. After processing all of the trace addresses, U SPaCS outputs the number of L1 instruction cache misses, L1 data cache misses, L2 unified cache misses, and writebacks for all cache hierarchy configurations. The remainder of this section is organized as follows: Since L1 analysis uses a conventional stack based algorithm for simulating set associative caches [34] [70] we refer the reader to Chapter 2 for the L1 analysis description. First, we present the processing details for L2 analysis and then the accelerated stack processing is addressed. Next, the U SPaCSs algorithm for processing an instruction address is summarized. Finally, the integrations of occupied blank labeling and write back counting for data addresses in USPaCS are described. Secondlevel Unified Cache Analysis If there is an L1 miss, L2 analysis is required. L2 analysis determines the number of L2 cache conflicts for the processed instruction address Since the number of L2 conflicts | | dictates an L2 cache hit/miss for only the conflicts that contribute to s eviction from the L2 cache set are counted into { } Due to the FIFOlike L2 cache replacement policy, { } is not only the instruction/data block addr esses that maps to the

PAGE 76

76 same L2 cache set as (i.e. A s conflicts for S2), but also the blocks that were evicted from the L1 instruction/data caches after [ ] s eviction from the L1 instruction cache. Therefore, L2 analysis includes two steps. In the first step, A s conflicts for S2Step One: The first step determines the instruction conflicts ( ) and the data conflicts ( ) all P : ( ) and ( ) are determined from the stacks and respectively. In the second step, L2 analysis isolates the blocks that have been evicted from the L1 caches (whose collections are referred to as { } and { } for the instruction and data blocks, respectively) from ( ) and ( ) by excluding the blocks that are still stored in the L1 caches. Thereafter, the L2 conflicts collection { } is selected from the collection { } { } whose constituting blocks eviction times are later than [ ] s evictio n time from the L1 cache. The following discusses these two steps in detail. using conflict evaluation. However, the L1 cache evi ction order does not follow the address accessing order recorded in the stacks and any of the stack addresses could be the L2 conflict. For example, a block [ ] ( > ) that was accessed (stored into the stack) before [ ] s access could be evicted from the L1 cache after [ ] s eviction from the L1 cache and thereby should be included in L2 conflicts. This situation occurs when the block [ ] ( > ) and [ ] map to different L1 cache sets but the same L2 cache set. Therefore, step one must evaluate all stack addresses in and for conflicts (this is in contrast to L1 analysis where only the stack addresses stored on top of [ ] s stack location are eval uated for conflicts).

PAGE 77

77 Step Two: Step two determines { } from ( ) and ( ) Since the conflicts for could include conflicts (block addresses) currently stored in the L1 caches, step two excludes (removes) the conflicts still stored in the L1 caches from ( ) and ( ) and the remaining conflicts, { } and { } respectively, are the conflicts stored exclusively in the L2 cache. However, not all the conflicts in { } and { } can be classified as s L2 conflicts since { } must only be the conflicts that were evicted from the L1 caches after [ ] was evicted from the L1 instruction cache. However, the blocks relative stack locations are not sufficient to deter mine the blocks eviction orders since the stack only stores the blocks latest access and maintains no previous access history. Therefore, each stack location includes an array to explicitly record that blocks eviction time from the L1 cache. Each array element corresponds to the eviction time for each L1 cache configuration and the array size is equal to the associated caches (instruction or data) design space size. Each time that a block is evicted from the L1 cache, the currently processed addresss t race order value is added to the evicted blocks array to indicate the blocks eviction time. We begin counting the trace order from 1 such that an eviction time array value of 0 indicates that the block is currently stored in the L1 cache for the corr esponding L1 cache configuration. If there is an L1 cache miss determined for T during the L1 analysis, the th L1 cache conflict, which is acquired from the instruction frame layers MRU ordering, is the evicted block from the L1 cache after fetching Thus, s trace order is assigned to the corresponding (dictated by the ) element in the eviction time array of the

PAGE 78

78 th L1 cache conflict. Since an eviction time value of 0 implies that the block is located in the L1 cache, during the stack update process, all of the elements in the eviction time array of the newly pushed address are cleared to 0. By maintaining the eviction time array, L2conf can be easily determined using the eviction times of the conflicts in ( ) and ( ) First, { } and { } are derived from ( ) and ( ) by excluding the conflicts with an eviction time value of 0. Next, { } is determined by selecting the conflicts from ( ) and ( ) with the condition that the conflicts eviction time is larger than the eviction ti me of [ ] Accelerated Stack Processing Since the stacks can be very large, the conflict evaluation in either L1 analysis or L2 analysis is very time consuming. Same as the acceleration strategies used in T SPaCS, U SPaCS evaluates each stack a ddress for conflicts for all simultaneously based on the set refinement property and stores the conflict information into layered structure s. The conflicts evaluated from instruction stack and data stack are stored into the layered structure referred to as the instruction frame and data frame respectively Each layer stores the conflict information for each in MRU (most recently used) order to preserve the conflicts relative access order. The conflict information is recorded by a pointer that points to the conflicts stack location, which will be leveraged to quickly locate the conflict. Figure 4 2 depicts sample of instruction (A) and data frames (B ) for a processed address with block address = 100110110110 for an arbitrary and a design space bounded by = 4 = 64, = 16 and = 256. During L1

PAGE 79

79 analysis, The top five layers in t he instruction frame are b uilt to store s conflicts for all (the termination of the conflict evaluation for each [ ] is < ) During L2 analysis, New layers to store the instruction conflicts for are added to the existing instruction frame, which is correspondingly extended from to ) (conflict evaluation for the entire stack). Since in our example, is less than some overlap, thus the number of layers in the instruction frame is equal to | | plus | | minus the overlapped number of The data frame is built to store data conflicts for each as well Process instruction block address A=(1001)10110110 110 10110 0110 110110 0110110 10110110 S=4 S=8 S=16 S=32 S=64 S=128 S=256 S1_inst S2 Instruction frame 10110 0110 110110 0110110 10110110 S=16 S=32 S=64 S=128 S=256 S2 Data frame 10 (A) (B) Figure 4 2 Sample instruction and data frames. Each rectangle represents one layer that stores the conflict information for for the layers corresponding The conflict information is recorded using a pointer that points to the conflicts stack location. The nu mber shown in the rectangle indicates the cache index of the recorded conflicts in that layer. U SPaCSs Processing Algorithm In this section, we will summarize U SPaCSs processing algorithm for an instruction trace address and a particular block s ize D uring each s processing, U SPaCS simply repeats this algorithm for each

PAGE 80

80 Figure 43 depicts U SPaCSs algorithm for processing an instruction address for a particular First, and are calculated (lines 13). Next, U SPaCS searches the stack to determine whether the block was accessed before (line 4). If there is no such that [ ] is equal to accessing results in a compulsory cache miss for all (lines 5 6), s processing is finished (line 7), and the stack update process pushes onto the top of As long as there is at least one L1 cache miss recorded in the L1 cache miss list, U SPaCS performs L2 analysis (line 22). L2 analysis continually evaluates the conflicts for the stack addresses [ ] wher e > for all P (line 51). If there exists such that [ ] is equal to U SPaCS begins L1 analysis and performs conflict evaluation for the stack addresses [ ] where 0 < < for all from to and stores the conflict information into the corresponding layers (dictated by ) of the instruction frame (lines 910). For each the conflicts stored in the corresponding layer of the instruction frame are all of the L1 cache conflicts (lines 1112). Therefore, all with larger than the number of L1 cache conflicts result in an L1 cache hit, and thereby, all the with these also result in an L1 cache hit (lines 1316) Otherwise, results in an L1 cache miss and the is recorded into an L1 cache miss list for future reference (lines 1719). Since s conflict cache miss results in a block eviction after fetching into t he L1 instruction cache, U SPaCS locates the evicted block (which is the th L1 cache conflict according to the MRU order ) and set s trace access order to the value of the corresponding element (dictated by ) in the evicted bloc ks eviction time array (lines 2021). and stores the conflict information in the corresponding layers (dictated by ) of the instruction frame. L2

PAGE 81

81 Figure 43 U SPaCSs algorithm for an instruction address for a particular B. Process instruction address (T, B): 1 -A = T >> log2B; 2 -i_inst = A mod S1 min_inst; 3 -i_data = A mod S1 min_data; 4 -search for Ki_inst[h] that satisfying (Ki_inst[h]==A) in stack Ki_inst; 5 if (Ki_inst[h] is not searched) 6 -all c_hier results in a miss; // compulsory miss of T 7 -goto END_PROCESSING; 8 else //L1 cache analysis 9 for (m = 1; m < h; m++) 10 -ConflictEvaluation (A, Ki_inst[m], S1 min_inst, S2 max); 11 for (S1_inst = S1 min_inst, S1_inst<= S1 max_inst; S1_inst*2) 12 -check the number of L1 cache conflicts recorded in corresponding layer (dictated by S1_inst) of the instruction frame; 13 for (W1_inst = W1 min_inst, W1_inst <= W1 max_inst; W1_inst*2) 14 if (W1_inst > the number of L1 cache conflicts) 15 -c1_inst(S1_inst, W1_inst, B) results in a L1 cache hit; 16 -each c_hier including the c1_inst results in an L1 cache hit; 17 else 18 -c1_inst(S1_inst, W1_inst, B) results in an L1 cache miss; 19 -record the c1_inst in a L1 cache miss list; // set the eviction time for the L1 evicted block 20 -L1_eviction = the W1_inst-th L1 cache conflict, which is determined from the corresponding layer (dictated by S1_inst) of the instruction frame; 21 -the corresponding element (dictated by c1_inst) in the eviction time array of L1_eviction = Ts trace order; //L2 cache analysis 22 if (L1 cache miss list != NULL) // there is L1 cache misses 23 for (m > h; Ki_inst[m] !=NULL; m++) 24 -ConflictEvaluation (A, Ki_inst[m], S2 min, S2 max); 25 for (m = 1; Ki_data[m] !=NULL; m++) 26 -ConflictEvaluation (A, Ki_data[m], S2 min, S2 max); 27 for (S2 = S2 min, S2<= S2 max; S2*2) 28 -the conflicts recorded in the corresponding layer (dictated by S2) of the instruction frame form {(S2_inst)}; 29 -the conflicts recorded in the corresponding layer (dictated by S2) of the data frame form {(S2_data)}; 30 for (each c1_inst in the L1 cache miss list) 31 -A_EvictionTime = the eviction time arrays corresponding element (dictated by c1_inst) value of Ki_inst[h]; 32 for (each (S2_inst) in {(S2_inst)}) 33 -chk_EvictionTime = the eviction time arrays corresponding element (dictated by c1_inst) value of (S2_inst); 34 if (chk_EvictionTime < A_EvictionTime) 35 -exclude the (S2_inst) from the collection {(S2_inst)}; 36 -{} = {(S2_inst)}; 37 for (each c1_data) 38 for (each (S2_data) in {(S2_data)}) 39 -chk_EvictionTime = the eviction time arrays corresponding element (dictated by c1_data) value of (S2_data); 40 if (chk_EvictionTime < A_EvictionTime) 41 -exclude the (S2_data) from the collection {(S2_data)}; 42 -{} = {} U (S2_data); 43 for (W2 = W2 min, W2 <= W2 max; W2*2) 44 if (W2 > ||) 45 -c_hier(c1_inst, c1_data, c2(S2, W2, B)) results in an L2 cache hit (and L1 cache miss); 46 else 47 -c_hier(c1_inst, c1_data, c2(S2, W2, B)) results in an L2 cache miss (and L1 cache miss); 48 END_PROCESSING: //stack update process for T 49 if (Ki_inst[h] was searched) 50 -remove Ki_inst[h] from the stack Ki_inst; 51 -push A to the top of Ki_inst as Ki_inst[1]; 52 -clear all the elements in Ki_inst[1]s eviction time array to 0; 53 -free the instruction frame and data frame;

PAGE 82

82 analysis similarly evaluates all stack addresses in for all and stores the conflict information in the corresponding layers (dictated by ) of the data frame (lines 2326). After conflict evaluation for both the instruction and data stacks, for each the conflicts in the corresponding layer (dictated by ) in the instruction and data frames form { ( ) } and { ( ) } respectively (lines 2829). For each that resulted in an L1 cache miss (line 30), all that is composed of and all combinations of every P (line 27), (line 37), and P Occupied Blank Labeling (line 43) are analyzed for L2 cache hits/misses. The subcollection of all the instruction conflicts in { } is derived from { ( ) } by excluding those ( ) whose eviction time arrays corresponding element (dictated by ) value is less than (including the value 0) the value of [ ] s (lines 31 36). Thereafter, the subcollection of all the data confli cts in { } is derived from { ( ) } in the same way, by comparing the eviction time arrays corresponding element (dictated by ) value of ( ) and [ ] (lines 3842). With the total number of derived | | the with different generates the L2 hit/miss results (lines 4347). After L1 and L2 analysis, U SPaCS performs the stack update process for T which removes [ ] from the stack (lines 4950) and pushes onto the top of (line 51) All of the elements in the eviction time array of the newly pushed stack address are cleared to 0 (line 52), and the instruction and data frames built for s processing are freed/cleared (line 53). Same as the < scenario in T SPaCS, the special case of occupied blanks (BLK) occurs and introduces incorrect L2 cache hit/miss classification in U SPaCS.

PAGE 83

83 T SPaCSs results revealed that even though the BLK introduced an average miss rate error of 0.71%, T SPaCS still determined the optimal cache configurations for all of the studied benchmarks and therefore, BLK labeling was not necessary. However, since our experimental results revealed that the introduced miss rate error for a twolevel unified cache was as high as 20% and this error caused cache tuning to incorrectly determine the optimal cache configuration, we integrate BLK labeling into U SPaCS. A bit array associated with each stack address is used to label the BLK. The bit arrays size is equal to the number of with the same A set bit in the array denotes that a BLK follows the block for the corresponding If accessing results in an L2 cache hit, and the evicted block from the L1 cache does not map to the same L2 cache set as B LK label is set for the th (MRU ordering) conflict in the L2 cache determined by sorting the conflicts in ( ) and ( ) based on the conflicts eviction times. L2 analysis is augmented by including this BLK label examination. If there is a set label for any L2 cache conflict, [ ] has already been evicted from the L2 cache, and thereby, accessing results in an L2 cache miss even though the number of L2 cache conflicts is less than Write back Counting The pr ocessing for data and instruction addresses is essentially the same, except that data address processing must consider the data writebacks. As in T SPaCS, assuming a writeback policy (the writethrough policy requires no special processing), we distingui sh between writes that propagate to lower levels of cache (write backs of dirty blocks due to evictions) and writes that avoid propagation to lower levels of cache. During L1/L2 analysis, U SPaCS records the number of writeavoids to

PAGE 84

84 calculate the number of writebacks [66] where the number of writebacks is equal to the total number of writes minus the number of writeavoids. Each data stack address includes a bit array to indicate the dirty status of the block for all The bit arrays size is equal to the number of with the same During the stack update process, the maintenance of the dirty status of t he stack address is the same as in T SPaCS, where the dirty status is dictated by an L2 hit/miss for the corresponding cache configuration. In the L1/L2 analysis, if writing results in an L1/L2 cache hit and [ ] is dirty, the writes propagat ion to memory is avoided and the number of writeavoids is incremented. Experimental Results and Analysis We verified the cache hierarchy miss and writeback rates outputted by U SPaCS and examined U SPaCSs simulation time efficiency using the entire EEMB C [20] benchmark suite, five arbitrarily selected benchmarks from Powerstone [49] and four arbitrarily selected benchmarks from MediaBench [46] (due to incorrect execution, we could not evaluate the complete Powerstone and MediaBench suites). We generated the access traces using SimpleScalars [10] sim cache module and compared U S PaCS with the widely used trace driven cache simulator Dinero IV [18] We modified Dinero to simulate an exclusive cache hierarchy. We leveraged the same configurable L1 instruction cache, L1 data cache, and L2 unified cache design space as in [30] ( [30] showed that this design space provided an appropriate variation in cache configurations for similar benchmarks). The L1 instruction and data cache sizes ranged from 2 to 8 K bytes, the L2 cache size ranged from 16 to 64 Kbytes, the L1/L2 cache associativities ranged from direct mapped to 4way, and the

PAGE 85

8 5 L1/L2 cache block sizes ranged from 16 to 64 bytes. Given this configurability and considering the block size restriction, the total number of cache hierarchy configurations was 2,187. Accuracy Evaluation We verified U SPaCSs accuracy by comparing U SPaCSs cache miss and write back rates with Dineros exact cache miss and writeback rates for each benchmark. U SPaCSs miss rate s for all of the caches and the writeback rates for the L2 caches for all of the cache hierarchy configurations were 100% identical to Dineros. Since U SPaCS provides accurate results, cache tuning can always determine the optimal cache configuration considering the application requirements and/or design constraints. Simulation Time Evaluation Since we are the first to propose a singlepass trace driven cache simulation for two level unified caches, there is no prior work to directly compare to. Therefore, we quantified U SPaCSs simulation efficiency by comparing U SPaCSs total simulation time to simultaneously evaluate the entire design space with the simulation time required by the most widely used trace driven cache simulator, Dinero, to iteratively e valuate the design space. We tabulated the user time reported from the Linux time command for the simulations running on a Red Hat Linux Server version 5.2 with a 2.66 GHz processor and 4 gigabytes of RAM. Figure 4 4 depicts U SPaCSs simulation time speedup as compared to Dinero. U SPaCSs simulation speedups reached as high as 72X, with an average speedup of 41X.

PAGE 86

86 Figure 4 4 U SPaCSs simulation time speedup as compared to Dinero. Since the speedups varied significantly across different applications, we further analyzed the results to determine the cause of this wide variation. The differentiating factor between applications was the access trace length (number of addresses in the ac cess trace file). Figure 4 5 depicts the logarithmic scaled simulation time (in seconds) for Dinero and U SPaCS, and logarithmic scaled speedup of U S PaCS as compared to Dinero with respect to increasing access trace length (logarithmic scaled). Each graph point represents a benchmark such that vertically correlated points represent the benchmarks performance for both simulation methods and the resulti ng speedup. The results indicated that Dineros simulation time increased linearly as the access trace increased due to Dineros constant simulation time for each trace address. U SPaCSs simulation time does not strictly follow this linear increasing relationship because U SPaCSs simulation time also depends on the instruction and data stacks sizes and the L1 caches miss rates. The stacks sizes dictate the complexity of the conflict evaluation during L1 and L2 analysis. Furthermore, only L1 cache misse s require L2 analysis, which is lengthy as compared to L1 analysis. The combination of applicationspecific 0 10 20 30 40 50 60 70 80 bcnt bilv blit brev fir A2TIME01 AIFFTR01 AIFIRF01 AIIFFT01 BaseFP01 BITMNP01 CACHEB01 CANRDR01 IDCTRN01 IIRFLT01 PNTRCH01 PUWMOD01 RSPEED01 TBLOOK01 TTSPRK01 epic mpeg2decode pegwitencode jpegencode average Speedup

PAGE 87

87 behavior and the cache hierarchy configuration dictates the stacks sizes and cache miss rates, thus resulting in the observed speedup variations. Figure 4 5 The logarithmic scaled simulation time (in seconds) for Dinero and U SPaCS and the logarithmic scaled speedup attained by U SPaCS as compared to Dinero with respect to increasing access trace length. Since U SPaCSs simulation time increased as the L1 cache miss rates and stacks sizes increased, thereby decreasing U SPaCSs speedup, we evaluated the relationship between the L1 cache miss rates and the stacks sizes with U SPaCSs speedup. Figure 46 plots U SPaCSs speedup for each benchmark (graph point) with respect to the product of the benchmarks average L1 miss rate and average stack size. For each benchmark the average L1 miss rate was the summation of the average L1 instruction/data cache miss rates, averaged across all L1 instruction/data cache configurations, and weighted by the percentage of instruction/data accesses in the access trace. Since the stack s sizes increase during U SPaCSs processing, we recorded the corresponding stacks sizes for each trace address processing and generated a histogram for all stack sizes. Based on the histogram, we calculated the average stack size. The results in Figure 46 verified that the speedup generally 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.8E+07 1.8E+08 Logarithmic scaled simulation time and speedup Logarithmic scaled access trace length Dinero simulation time (s) SPUC simulation time (s) SPUC's speedup

PAGE 88

88 decreased as the product of the average L1 cache miss rate and average stack size increased. One outlying poi nt, the bilv benchmark, did not match this decreasing speedup trend due to a relatively higher average L1 cache miss rate as compared to the neighboring graph points. This outlying point revealed that the L1 cache miss rate had a larger impact on the speedup than the stack size. Therefore, we replotted U SPaCSs speedup with respect to the product of the square of the average L1 cache miss rate and average stack size in Figure 47 With a higher importance placed on the L1 cache miss rate, all benchmark points follow the decreasing speedup trend. Figure 46 U SPaCSs speedup with respect to the product of the average L1 cache miss rate and the average stack size. Figure 47 U SPaCSs speedup with respect to the product of the square of the average L1 cache miss rate and the average stack size. bilv 0 20 40 60 80 100 0 5 10 15 20 25 30 35 Speedup 0 20 40 60 80 100 0 0.5 1 1.5 2 2.5 3 Speedup

PAGE 89

89 Summary In this chapter, we presented U SPaCS, which is the first, to the best of our knowledge, singlepass cache simulation methodology for twolevel unified caches. U SPaCS simultaneously evaluates all cache hierarchy configurations using a stack based algorithm to store and evaluate uniquely accessed addresses with additionally recorded per block eviction time information. Experimental results indicated that U SPaCSs cache miss rates and writebacks were 100% accurate for all cache configurations and the average simulation time speedup was 41X as compared to the most widely used trace driven cache simulator.

PAGE 90

90 CHAPTER 5 CA PPS: C ACHE PARTITIONING WITH PARTIAL SHARING FOR MULTI CORE SYSTEMS In CMPs shared resources are optimized to manage access contention from multiple cores. Shared LLCs shoul d be large enough to accommodate all sharing cores data however, due to long access latencies and high power consumption, large LLCs are typically precluded from embedded systems with strict area/energy/power constraints. Therefore optimizing small LLCs is significantly more challenging due to contention for limited cache space. CaPPS combines the benefits of private and shared partitioning and thereby can reduce the shared cache contention and improve the poor cache utilization in private partitioning. CaPPS controls each cores cache utilization using sharing configuration which enables a cores quota to be configured as private, partially shared with a subset of cores, or fully shared with all other cores. Whereas CaPPSs sharing configuration increas es the design space and thus increases optimization potential, this large design space significantly increases design space exploration time. Since using a CMP simulator to exhaustively simulate all configurations in CaPPS design space is prohibitively lengthy for realistic applications (several months or more), t o facilitate fast design space exploration, we develop an offline analytical model to quickly estimate cache miss rates for all configurati ons, which enables determining LLC configurations for any optimization that evaluates the cache miss rates (e.g., performance, energy, energy delay product, power, etc.). The analytical model probabilistically predicts the miss rates when multiple applications are co executing using the isolated cache access dis tribution for each application (i.e., the application is run in isolation with no co executing applications). This probabilistic prediction provides

PAGE 91

91 a fair and realistic offline method for evaluating any combination of coexecuted applications which cannot be determined at design time for dynamically scheduled systems. C ache P artitioning W ith P artial S haring (CAPPS) To accommodate LLC requirements for multiple applications coexecuting in different cores on a CMP, CaPPS partitions the shared LLC at the way granularity and leverages sharing configuration to allocate the partitions to each cores quota. To facilitate fast design space exploration, an analytical model estimates the cache miss rates for the CaPPS configurati ons using the applications isolated LLC access traces. W e assume that each core executes a different application in an independent address space, thus there is no shared instruction/data address or coherence management which is a common case in CMPs and is similar to assumptions made in prior works [12] [21] Architecture and Sharing Configurations CaPPSs sharing configurations enable a cores quota to be configured as private, partially shared by a subset of cores, or fully shared by all cores. Fig ure 3 5 (A)(C ) illustrates sample configurations, respectively, for a 4core CMP (C1 to C4) and an 8 way LLC: (A ) each cores quota has a configurable number of private ways; (B ) C1s quota has four ways, two are private and two are shared with C2, C2s quota contains an additional private way, and C3s quota has three ways, one is private and two ways are shared with C4CaPPS uses the least recently used (LRU) replacement policy, but we note that the analytical model can be extended to approximate cache miss rates for other replacement policies, such as pseudoLRU, but is beyond the scope of this work. To ; and (C ) all the four cores fully share all ways.

PAGE 92

92 reduce the sharing co nfigurability with no effect on cache performance and to minimize contention, cores share an arbitrary number of ways starting with the LRU way, then second LRU way, and so on since these ways are least likely to be accessed. This pruning method intelligently removes the redundant sharing configurations that have higher contention potential. For example, in Fig ure 3 5 (B ) two of C1s ways are shared with C2, therefore, C1s two most recently used (MRU) blocks are cached in C1s two private ways, and the two LRU blocks are cached in the two ways shared with C2 and these two LRU blocks are the only replacement candidates for C2 8-way LLC 8-way LLC 8-way LLC Shared by all of the four cores (A) (B) (C) Private for C1 Private for C2 Private for C3 Private for C4 Private for C1 Private for C2 Shared by C1 & C2 Private for C3 Shared by C3 & C4 s accesses. Fi g ure 5 1 Three sharing configurations: (A ) a cores quota is configured as private, (B ) partially shar ed with a subset of cores, or (C ) fully shared with all other cores. Prior works primarily used two hardware support approaches for c ache partitioning. One approach leveraged a modified LRU replacement policy [17] [43] [55] [65] and selected replacement candidates based on the blocks MRU orderings and the number of blocks occupied by each core. The other approach used column cac hing [14] [43] [65] to globally control which ways a cores data could be cached in (i.e., the ways that contained candidate replacement blocks for a particular core). Neither approach increased the cache access times since the new l ogic was only activated during a cache miss and replacement block selection occurred in parallel with the miss fetch.

PAGE 93

93 Since both approaches are equally applicable to CaPPS, in the following subsections, we detail how these approaches could be modified for CaPPS. Modified LRU replacement policy A conventional LRU cache associates a counter with each block to denote the blocks MRU ordering in the cache set. To adapt this basic hardware for CaPPS, per block core identification (ID) is required. For example, assume two cores, C1 and C2, share number of ways in a CaPPS configuration, and each core has and number of private ways, respectively. On a cache hit, the blocks counters are updated to indicate the new MRU ordering similarly to a conventional LRU cache. On a cache miss, the number of blocks, and currently occupied by C1 and C2, respectivel y, in the set must be determined. If the C1 caused the miss and there are unused/invalid blocks in C1s private ways or in C1 and C2s shared ways (which can be dictated by validating ( < ) | ( ( + max ( ) ) < ( + + ) ) ) the new fetched data can be cached into an unused/invalid block, otherwise a replacement block must be selected. In C1 and C2The additional hardware required to use this approach for CaPPS is the per block core ID, which can be evaluated as log( ) bits per block, where is the total number of cores. Additionally, when changing sharing configurations, all cache blocks must be invalided and dirty blocks written back in the case of a write back cache, which can induce additional cache miss es, however, we point out that this overhead is s occupied blocks, after excluding the private ways number of MRU blocks (i.e., and respectively), the replacement block is the LRU block in the remaining blocks. A similar method determines the replacement block when more than two cores share cache ways.

PAGE 94

94 required for any reconfigurable cache and is not an additional overhead for CaPPS as compared to prior work. Column c aching Column caching uses a per core partition vector to globally control the cores can didate replacement ways for all cache sets. The partition vector is a bit vector where the number of bits is equal to the cache associativity and a set bit 1 denotes that the bits associated way is assigned to that core. For example, if the cache associ ativity is eight and a cores partition vector is 00111001, four ways are allocated to the core: the third, fourth, fifth, and eighth ways. Cache fetches are the same as in a conventional cache (i.e., all tags in the cache set are compared with the fetched blocks tag), thus the partition vector does not increase the cache access time. On a cache miss, the replacement block is selected from the cores allocated ways as denoted by the cores associated partition vector. Column caching introduces minimal hardware overhead since only per core partition vector are required and the vectors are globally used by all sets. Changing sharing configurations requires new partition vector contents to be loaded, but unlike the modified LRU replacement policy, cache blo ck invalidation and dirty block write backs are not required since all of the tags in the ways are compared with the fetched block. On a cache miss, the replacement block is selected using the new partition vector, thus column caching does not induce addit ional cache misses as compared to the modified LRU replacement policy. However, the conventional LRU cache implementation that uses counters to denote the MRU orderings cannot be used in column caching. Column caching uses the partition vectors to global ly control which physical ways a cores data is cached in for all

PAGE 95

95 cache sets. Since CaPPSs sharing configurations restrict sharing from the LRU ways, a simple counter implemented LRU cache cannot used since the LRU blocks can be stored in any physical way (dictated by the associated counters value) in different cache sets and at different times. Thus, there is no phy sical way (which stores the LRU blocks of a core in all cache sets) that can be shared with other cores, and thereby globally controlled by a partition vector bit. D1 D2 D3 D4 D5 D6 D7 D8 (A) C1's private waysLRU 1st 2nd 3rd 4th 5th 6th 7th 8thC1's partition vector 1 1 1 0 1 1 0 0 D1 D2 D3 D5 D6 C2's partition vector 0 0 0 1 1 1 0 0 D4 D5 D6 C2's private way Shared LRU ways between C1 and C2(B)MRU Figure 5 2 Maintaining LRU inform at ion using a linked list for (A ) a conventional LRU cache and (B ) a CaPPS sharing configuration where C1 and C2Instead of counters, linked lists can be used to denote the MRU orderings, and prior works s partition vectors are 11101100 and 00011100, respectively. [24] [63] showed that linked lists used less hardware resources and afforded faster cache access time as compared to counters for conventional LRU cache implementation. In the linked list implementation, a cache sets blocks are indexed and the indexes of the blocks are separately maintained in a linked list. Figure 5 2 (A ) depicts a sample linked list for an 8way cache, where the linked list registers D1

PAGE 96

96 through D8 store the blocks indexes. When a replacement is required, the index stored in the LRU register (i.e., D8) indentifies the index of the ways replacement block. Instead of associating the partition vector bits with each physical cache way [14] CaPPS can associate the partition vector bits with the linked list registers (i.e., the most and least significant bits are associated with the head and tail registers of the linked list, respectively). Thus, in a cores partition vector, partial sharing enables other cores to share ways starting from the right most set bits associated linked list register, which always stores the cores LRU blocks index. For example, in an 8way cache where two cores, C1 an d C2, share two ways and C1 has three private ways and C2 has one private way, C1 and C2s partition vectors are 11101100 and 00011100, respectively. If a third core, C3, also shares ways with the two cores, and C3 has two private ways, the partition v ectors for C1, C2, and C3Figure 5 2 are 11100011, 00010011, and 00001111, respectively. (B ) depicts the associated linked lists for C1 and C2 w ith partition vectors 11101100 and 00011100, respectively. Since each bit corresponds to a linked list register, the blocks allocated to C1 are the indexes stored in D1, D2, D3, D5, and D6 and the blocks allocated to C2 are the indexes stored in D4, D5, and D6, where the shared D5 and D6 by the two cores indicates the LRU and second LRU blocks for C1 and C2The additional hardware overhead when comparing Figure 5 2 (A ) with Figure 5 2 (B ) is the data path from D3 to D5, which can be implemented by adding a bypass transfer line with the linked list registers, as shown in Figure 5 3 Th e transfer lines and the linked list registers outputs are connected using nMOS switches. Therefore, the bypass source and destination registers can be connected by turning on the registers

PAGE 97

97 n MOS switches. By carefully controlling the on/off status of the switches and the load enable of linked list registers based on the partition vectors bits values, the linked lists, as in Figure 5 2 (B ), can be easily maintained. For example, to maintain C1Figure 5 2 s linked list in (B ), the load enable of the registers associated with the 1 bit values in C1s partition vector (i.e., En1, En2, En3, En5, and En6) are set as valid, and the switches Sw3 and Sw4 are turned on to bypass the D4 register. To maintain C2Figure 5 2 s linked list in (B ), the load enable of the registers associated with the 1 bit values in C2s partition vector (i.e., En4, En5, and En6Since ) are set as valid. Since no bypass is required, all the switches are off. [24] provides the linked list implementation details for a conventional LRU cache, these details are excluded from Figure 5 3 thus Figure 5 3 only depicts the additional hardware required for CaPPS. The bypass transfer line and nMOS switches can be shared by all sets, therefore the additional hardware cost is minimal. The control logic for the load enable of the linked list registers and switches is straightforward, thus we omit those details. Figure 5 3 Extension of the linked list implementation for CaPPS. An alternative approach to implementing the linked list for a conventional LRU cache is to maintain a previous and next register for each block [63] A current blocks previous register stores the index of the most recent previously accessed block and the D1 D2 D3 D4 D5 D6 D7 D8 LRU MRU Bypass transfer line En1 En2 En3 En4 En5 En6 En7 En8Sw1Sw2Sw3Sw4Sw5Sw6Sw7

PAGE 98

98 next register stores the index of the next block accessed after the current block. Extending this approach to achieve the bypass transfer line as shown in Figure 5 2 (B ) is trivial since the previous and next registers contents can be changed to move the blocks ordering in the linked list. Analytical Modeling Overview For applications with fully/partially shared ways, the analytical model probabilistically determines the miss rates, considering contention effects, using the isolated cache access distributions for the coexecuting applications. These distributions are recorded during isolated access trace processing The isolated LLC access traces can be generated with a simulator/profiler by running each application in isolation on a single core with all other cores idle. For applications with only private ways, there is no cache contention a nd the miss rate can be directly determined from the isolated LLC access trace distribution. Fi gure 5 4 exemplifies the contention effects in the shared ways using sample time ordered isolated (C1, C2) and interleaved/co executed (C1&C2) access traces to an arbitrary cache set from cores C1 and C2. C1 and C2s accesses are denoted as Xi and Yi, respectively, where i differentiates accesses to unique ca che blocks. The first access to X3 and the second access to X1 occurred at times and respectively. C1s second access to X1 will be a cache hit if C1s number of private ways is greater than or equal to five because four unique blocks are accessed between the two accesses to X1. Alternatively, if C1s number of private ways is smaller than five and C1 shares ways with C2, X1s hit/miss is dictated by the interleaved accesses from C2. For example, if C1 has six allocated ways and two of the LRU ways are shared with C2, X3 evicts X1 from C1s private way into a shared way. Therefore, C2s accesses between

PAGE 99

99 (when X1 is evicted from C1s private way) and (when the X1 is reaccessed) dictates whether X1 is in a shared way or has been evicted from t he cache. If C2s accesses between and (i.e., Y1, Y2, and Y3) evict two or more blocks into the shared ways, X1 s second access will be a cache miss. Fi gure 5 4 Two cores isolated (C1, C2) and interleaved (C1&C2In order to determine the contention effects to C ) access traces for an arbitrary cache set. 1s miss rate, C1 and C2s number of accesses and respectively, during the time period ( ) must be estimated. Since the number of blocks from e victed into the shared ways dictates whether C1s blocks (e.g., X1Fi gure 5 4 in ) are still in the shared ways, we calculate the probability ( ) that number of blocks are evicted into the shared ways to estimate C1Isolated Access Trace Processing s miss rate. To accumulate the isolated cache access distribution, we record the reuse distance and stack distance for each access in the isolated LLC access trace, which can be obtained using a stack based trace driven simulator [34] For an accessed address T that maps to a cache set, the reuse distance is the number of accesses to that set between this access to T and the previous access to any address in the same block as T, including this access to T. The stack distance is the number of unique Access trace in one cache set X1 X2 X3 X3 X2 X4 X5 X1C1 Y1 Y2 Y3 Y4 Y5 Y1C2 X1 Y1 X2 Y2 X3 Y3 X3 X2 X4 Y4 Y5 X5 Y1 X1C1&C2 t1t2

PAGE 100

100 block addresses, or conflicts in this set of accesses excluding T. For example, in Fi gure 5 4 C1s second access to X1 has = 7 since there are seven accesses between the two acc esses to X1 including the second X1, and = 4 with the conflicts: X5, X4, X2, and X3In each cache set, we accumulate the number of accesses for each stack distance ( [ 0 ] ), where is the LLC associativity. We accumulate the number of accesse s with > in together with the number of accesses with = since all accesses with are cache misses in any configuration. Given this information, for any access, the probabilistic information for the access stack distance is ( < ) = ( ) ( ) and ( ) = 1 ( < ) ( [1 A ]). For all of the accesses for each we accumulate a histogram of different and calculate the average over all The analytical model uses the base (best case) CPU cycles to calculate the CPU cycles required to complete the application when co executed with other applications. assumes that all LLC accesses are hits. An applications total number of CPU cycles are recorded for a single isolated execution to calculate using = where is the number of LLC misses in the applications isolated execution and is the delay cycles incurred by the LLC misses. Since the access distributions across the cache sets are different, the distributions are individually accumulated and recorded for each set to estimate the number of misses in each sets access es while analyzing the contention in the shared ways, which enables the miss rate of the application to be calculated. Since the analysis

PAGE 101

101 is the same for all cache sets, we present the analytical model for one arbitrary cache set. Analysis of Contention in the Shared Ways First, we describe the analytical model to analyze the contention in the shared ways for a sample CMP with two cores C1 and C2 and then generalize the analytical model to any number of cores. A sharing configuration allocates number of ways to core C1, where ways are private and the remaining ( = ) ways are shared with core C2. and similarly denote these values for C2. For C1, all accesses with a stack distance 1 result in cache hits in the private ways and all accesses with are cache misses. The only undetermined cache hits/misses are the accesses where 1 which depend on the interleaved accesses from C2. Thus, the following subsections elaborate on the estimation method for these accesses. If C1 only has private ways, then = and estimating the contention in the shared ways contention is not required since C1 has no ways shared with other cores and the number of hits for C1Calculation of can be directly calculated using N For an arbitrary stack distance in [ 1 ], the associated was determined during isolated access trace processing. This subsection presents the calculation of for C1Figure 5 5 s accesses with stack distance based on depicts C1s isolated access trace to an arbitrary cache set, where the second access to X1 has a stack distance and reuse distance X3s access evicts X1 from C1s private ways, therefore, the number of conflicts before and after X3 are 1 (excluding X3) and ( 1 ) (including X3), respectively. Confi denotes

PAGE 102

102 the first access of the i th conflict with X1. We denote the number of accesses before X3 can be any integer in [ 1 ( ) 2 ] since there are at least 1 conflicts before X as and can be calculated by = 1 To simplify the computation, we represent and using the expected values and respectively, in the subsequent calculations. 3 and at least conflicts after X3and s expected value is: = 1 ( 5 2 ) After determining the probability ( ( 1 ) ) for each (where 1 indicates the number of conflicts in the accesses, which is deterministic in one configuration), we can calculate s expected value for the evaluated configurations associated using: = ( ( ( 1 ) ) ) ( 5 1 ) For a particular [ 1 ( ) 2 ], the probability ( ( 1 ) ) is: 1 = ( | ) = ( ) ( ) ( ) ( 5 3 ) where is the event that the accesses have exactly 1 conflicts and is the event that the accesses have exactly ( 1 ) conflicts. ( ) and ( ) are the occurrence probabilities of and respectively. is the event that the accesses have exactly conflicts and ( ) is the probability of s occurrence, which is the summation of ( ( ) ( ) ) for all possible in [ 1 ( ) 2 ].

PAGE 103

103 To calculate ( ) and ( ) we examine the sufficie nt conditions that and occur In the example of Figure 5 5 the first access following X1 must be different from X1 (for > 0 ), which is Conf1 satisfying 1 since Conf1 has at least one conflict: X1. The second conflict Conf2 satisfies 2 since Conf2 has at least two conflicts: Conf1 and X1. The accesses between Conf1 and Conf2 satisfy < 1 since these accesses can only be Conf1. Conf3 satisfies 3 since Conf3 has at least three conflicts: Conf2, Conf1, and X1. The accesses between Conf2 and C onf3 satisfy < 2 since these conflicts can only be Conf2 or Conf1, etc. Similarly, Con f satisfies ( 1 ) and the accesses between X3where is a set including all satisfying = 1 The first multiplicand in the equation computes the probability that there are 1 number of Conf and Con f satisfy < ( 1 ) Therefore, defining a vector = ( , ) where [0 ( 1 ) ], ( ) is: ( ) = ( ) ( < ) ( 5 4 ) i and the second multiplicand computes the probability of all of the cases that in the remaining ( 1 ) accesses, there are exactly number of accesses occur between Confi and Confi+ 1 where is a set including all satisfying = ( + 1 ) for each [1 1 ]. Similarly, defining a vector = ( , ) where [0 ( + 1 ) ], ( ) is : ( ) = + < + ( 5 5 )

PAGE 104

104 To reduce the computational complexity, the second multiplicand in Equation ( 5 4 ) can be substituted with ( ) where represents the number of Confi and represents the remaining number of accesses in the accesses bef ore X3where = ( ) satisfying [0 ] and is a set including all satisfying = ( ) can be derived using induction as: ( ) = ( 1 ) ( < ) + ( 1 ) ( 5 7 ) (i.e., = ( 1 ) and = ( 1 ) ). Thus: ( ) = ( < ) ( 5 6 ) with the initial cases ( 1 ) = ( < 1 ) and ( 0 ) = 1 ( ) is calculated from = 1 (i.e., = 2 ) since = 0 indicates that there is no private way and = and = 1 indicates one private way and the first access after X1 evicts X1 into the shared ways, thus = 1 The induction of ( ) means that in the accesses, if the last access is not Confk, the previous 1 accesses must contain all the number of Confi and the last access satisfies < If the last access in the accesses is Confk, the previous 1 accesses must contain 1 number of ConfiSimilarly, the second multiplicand in Equation ( 5 5 ) can be substituted with ( ) where = ( + 1 ) and = + 1 and the induction of ( ) can be derived as: ( ) = ( 1 ) ( < + 1 ) + ( 1 ) ( 5 8 ) with the initial cases ( 1 ) = ( < ) and ( 0 ) = 1

PAGE 105

105 After substituting ( ) and ( ) for and respectively, in Equations ( 5 4 ) and ( 5 5 ) ( ( 1 ) ) is calculated using Equation ( 5 3 ) and and can be deter mined using Equations ( 5 1 ) and ( 5 2 ) Figure 5 5 C1s Calculation of isolated access trace to an arbitrary cache set to illustrate the calculation of To determine the contention effects from C2, the expected number of accesses from C2 is estimated based on the ratio of the number of cache set accesses from C1 and C2 where and are the total number of LLC accesses from C per cycle: = ( 5 9 ) 1 and C2, respectively. is the number of CPU cycles required to execute the application on C1 when C2 X1 ... X3 ... Conf(Kp,C1-1) ... Conf3 ... Conf2 ... Conf1 X1time d1 d2 d3 dKp,C1-1 d<1 d<2 d
PAGE 106

106 where is the delay imposed by the shared bus contention from the higher level caches (closer to the CPU) of each core to the shared LLC. can be estimated considering that each higher level cache miss requests the bus twice when a read/write request is sent to the LLC and the LLC sends back the requested block. Thus, the bus cycles used by each hi gher level cache miss is = + and the probability that one bus cycle is used by a core CiTo calculate we assume that the bus scheduler schedules simultaneous requests from different cores randomly and additional cores are using/requesting the bus concurrently with a C is = ( ) ( /( /) ) where and are the CPU and bus frequencies, respectively. 1s bus request in one bus cycle. There are three cases to consider. In the first case, number of cores are sending read/write requests to the LLC, and the expected delayed bus cycles for C1This equation indicates that C is: ( ) = 1 + 1 = 2 ( 5 11 ) 1 In the second case, cores are sending requests to the bus to receive requested blocks from the LLC. Similar to Equation s bus request may directly be serviced or may stall f or one, two, three, or more cycles for number of cores to be serviced. Each of the possible waiting times occurrences probability is 1 /( + 1 ) ( 5 11) the expected delayed bus cycles for C1 is: ( ) = 1 + 1 = 2 ( 5 12 )

PAGE 107

107 In the third case, cores are in the process of receiving the requested block from the LLC and C1s request must stall while the other cores are using the bus. Defining a vector = where [0 ], the expected delayed bus cycles for C1where is a set including all satisfying = denotes that number of cores have number of bus cycles remaining to finish the transfer of the requested block. The probability of each in is equal, ( ) = 1 /| | where | | is the cardinality of and is calculated by | | = ( + 2 2 ) where ( ) denotes the number of combinations of numbers selected from a set with numbers. is: ( ) = ( ) ( ) ( 5 13) Considering the three cases and defining a vector = ( ) where [0 ], the expected delay ed bus cycles for C1 where is a set including all satisfying = The probability of each in is equal, ( ) = 1 /| | where | | is the cardinality of and is calculated by | | = ( + 2 2 ) when number of additional cores are using/requesting the bus is: ( ) = ( ( ) + ( ) + ( ) ) ( ) ( 5 14) Therefore, the for C1 w ith respect to the CPU cycles is: = 2 ( ) ( ) ( 5 15 )

PAGE 108

108 where is the total number of cores, ( 2 ) is the total number of bus requests from C1, and ( ) is the probability that number of additional cores are using/requesting the bus when C1 is requesting the bus. Using and to represent the identification of the cores that are and are not using/requesting the b us when C1 where is a set including all satisfying < < < < < < < < and 1 is requesting the bus, respectively, and defining a vector = ( , , ) where [1 ], ( ) can be calculated as: ( ) = 1 ( 1 ) 1 ( 5 16) Calculation of ( ) ( ) is the probability that number of blocks are evicted from C2Thus, we model usi ng a Poisson distribution ( ) = ( ) where is if the LLC is accessed randomly. However, since the LLCs accesses are generally s private ways in the accesses. Directly using the expected to calculate ( ) will introduce a large bias (approximate 10% error) in the estimated LLC miss rate, since different values of result in different hit/miss determination and j ust using the expected value of all the i.e. will estimate all the as hit/miss. For example, in an extreme case where = 1 and the = + 1 the accesses with this result in cache hits in a shared way if = 0 Therefore, if the expected value is used to determine the cache hit/miss and if > 0 all of the accesses with this will be evaluated as cache misses. However, although the expected value > 0 some can be zeros, which result in cache hits.

PAGE 109

109 not random and not uniformly distributed in time (which makes equation ( 5 9 ) valid), we use an empirical variable to adjust to = / Our experiments indicated that = 5 was appropriate for our training benchmark suite, which contains a wide variety of typic al CMP applications, and is thus generally applicable. Since the range of is infinite in the Poisson distribution, and with very small ( ) has a minimal effect on the miss rate estimation, we only consider the with ( ) > 0 01 and calculat e the associated ( ) To calculate ( ) for an arbitrary is determined by evaluating the accesses in chronological order with an initial value of = 0 If there is one access with > fetching this address into C2s priva te ways will evict one block into the shared ways and thus is incremented by 1. remains the same until the subsequent accesses include one access with > + 1 and one additional block will be evicted into the shared ways by fetching the accesse d block into C2 s private ways, in which case is incremented. The same condition (i.e., one access with > + ) to increment applies to the remaining accesses, therefore, we can calculate ( ) inductively: ( ) = ( 1 1 ) + ( 1 ) = ( 1 ) < + + ( 1 1 ) + ( 1 ) < ( 1 ) < + = 0 ( 5 17) with the initial case ( = 0 = 0 ) = 1. There are three cases in the induction. In the first case, = all accesses will evict one additional block into the shared ways, thus, in the accesses, the previous 1 accesses evict 1 blocks and the last

PAGE 110

110 access satisfies + ( 1 ) In the second case, < in the accesses, the previous 1 accesses either have evicted blocks into the shared ways, in which case the last access satisfies < + or the previous 1 accesses have evicted 1 blocks, in which case the last access satisfies + ( 1 ) and evicts one additional block. In the third case, = 0 no block is evicted into the shared ways in the accesses. Calculation of the LLC m is s r ates Considering the impact of to the accesses with stack distance [ 1 ] the number of cache hits for C1where the first addend calculates the total number of hits in the private ways and the second addend calculates the total number of hits in the shared ways for the accesses satisfying [ 1 ]. is the probabi lity of hits for the number of accesses with stack distance which is calculated as: = ( ) ( ) : ( ) ( 5 19 ) is: = + ( 5 18) The calculation of includes all of the with ( ) > 0 01 and for each if the number of blocks evicted from C2s private ways is smaller than C1After accumulating for all cache sets, the number of LLC misses and the LLC miss rates can be determined. s accesses with results in cache hits in the shared ways.

PAGE 111

111 Finally, we generalize the analytical model to estimate the LLC miss rate for any core Ci when j additional cores (denoted as Cj) share cache ways with Ci by calculating the expected number of accesses from the additional cores during the time ( ) and then estimating ( ) similarly as estimating and ( ) for C2( 5 18) The generalized expression of Equation is: = + ( 5 20) where: = ( ) ( ) ( 5 21) where the vector C = ( ) with ( ) > 0 01 and is a set including all C satisfying < According to Equation ( 5 10) a circular dependency exists where is used to estimate and is used to calculate The solution cannot be represented using a closed form, thus we iteratively s olve for The initial value of is acquired assuming there is no contention (i.e., all number of ways are privately used by Ci( 5 10) ), and is used in Equation to calculate the initial value of is provided back into the analytical model to update and the new is used to update This iterative process continues until a stable (with a precision of 0.001%) is achieved. Experimental results indicated that only four iterations were required for the results to converge.

PAGE 112

112 The analytical models runtime complexity depends on the shared LLCs associativity the number of cores in the CMP, the evaluated sharing configuration (such as the number of ways shared among cores), and the isolated cache access distribution for each application (such as the average reuse distance for each stack distance value). Due t o the large number of complex and interdependent variables and unknowns, the complexity analysis is intractab le, thus, we will evaluate the analytical models measured execution time in the experiment. E xperiment R esults Our experiments evaluated the accuracy and time efficiency of the analytical model in estimating the LLC miss rates for CaPPS. We also verified the advantages of partial sharing as compared to two baseline configurations, private partitioning, and constrai ned partial sharing. Experiment Setup We used twelve benchmarks from the SPEC CPU2006 suite [61] which were compiled to Alpha_OSF binaries and executed using the ref input data sets. Due to incorrect execution, we could not evaluate the complete suite. Even though our work is targeted for embedded systems, we did not use embedded system benchmark suites since these suites contain only small kernels, which do not s ufficiently access the LLC, and do not represent our targeted embedded CMP domain. Since complete execution of the large SPEC benchmarks prohibits exhaustive examination of the entire CaPPS design space, and since most embedded benchmarks have stable behavior during execution, for each SPEC benchmark, we selected 500 million consecutive instructions with similar behavior to use as the simulation interval to mimic an embedded application with high LLC occupancy. To select the simulation

PAGE 113

113 interval, we perform ed phase classification on the SPEC benchmarks using SimpleScalar 3.0d [6] and SimPoint 3.2 [31] Within a benchmarks entire execution, nonoverlapping intervals with a fixed length of 100 million instructions were classified into phases, where all of the intervals in the same phase had similar behavior. Since all of the inte rvals belonging to a phase were not necessarily contiguous, we selected five contiguous intervals that were classified as belonging to the same phase to form the benchmarks simulation intervals. The SPEC benchmarks phases were long enough such that every benchmark had five contiguous intervals belonging to the same phase. Table 5 1 lists the starting instructions for the benchmarks simulation intervals. Our work can easily be extended to applications with varying behavior (i.e., multiple phases throughout execution) by integrating offline phase change detection methodologies [31] [59] Table 5 1 The starting instructions (counted from the beginning of the benchmarks execution) for the benchmarks simulation intervals Benchmark Starting instruction (million) Benchmark Starting instruction (million) 400.perlbench 6000 445.gobmk 13500 401.bzip2 64100 456.hmmer 97000 429.mcf 13500 458.sjeng 10700 433.milc 30000 462.libquantum 67800 435.gromacs 15400 464.h264ref 13500 444.namd 13500 473.astar 19400 We generated the exact cache miss rates for comparison purposes using GEM5 [8] and modeled four in order cores with the TimingSimple CPU model, which stalls the CPU when fetching from the caches and memory. Each core had private level one (L1) instruction and data caches. The unified level two (L2) cache and all lower level memory hierarchy components were shared among all cores. We modified the L2 cache

PAGE 114

114 replacement operation in GEM5 to model the shared LCC for CaPPS. Table 5 2 shows the parameters used for each system component. Since four cores shared the eight way LLC, there were 3,347 configurations in the CaPPS design space. Table 5 2 C MP system parameters Components Parameters CPU 2 GHz clock, 1 thread L1 instruction cache Private, total size of 8 KB, block size of 64 B, 2 way associativity, LRU replacement, access latency of 2 CPU cycles L1 data cache Private, total size of 8 KB, block size of 64 B, 2 way associativity, LRU replacement, access latency of 2 CPU cycles L2 unified cache Shared, total size of 1 MB, block size of 64 B, 8 way associativity, LRU replacement, access latency of 20 CPU cycles M emory 3 GB size, access latency of 200 CPU cycles L1 caches to L2 cache bus Shared, 64 B width, 1 GHz clock, first come first serve (FCFS) scheduling Memory bus 64 B width, 1 GHz clock Before CaPPS simulation, we executed each benchmark in isolation during the benchmarks simulation interval and recorded the isolated LLC access traces and the CPU cycles For CaPPS simulation, we arbitrarily selected four benchmarks to be co execut ed, which formed a benchmark set, and we evaluated sixteen benchmark sets. Since the four benchmarks simulation intervals were at different execution points, we forced the system to simultaneously begin executing the benchmarks at the benchmarks associated simulation intervals starting instructions using a fullsystem checkpoint A fullsystem checkpoint gives a snapshot of the four core system state, including the register state, the memory state, the system calls inputs and outputs, etc. To create th e full system checkpoint, the CMP simulator must terminate each core individually after that core reaches the simulation intervals starting instruction for that cores benchmark. However, since this instruction number is different for each benchmark and G EM5 does not support selective core termination (all cores must

PAGE 115

115 terminate simultaneously), we created the full system checkpoint by aggregating the checkpoints of the individual benchmarks executing in isolation. We refer to these checkpoints as isolatedb enchmark checkpoints and we generated these checkpoints by fast forwarding each benchmark to the starting instruction of the benchmarks corresponding simulation interval. For each simulation, the full system checkpoint was restored and then the system st arted execution. The system execution was terminated when any core reached 500 million instructions. Due to varying CPU stall cycles across the benchmarks, at the termination point, not all cores had completed executing the simulation interval. However, this termination approach guaranteed that the cache miss rates reflected a fully loaded system (i.e., full LLC contention since all cores were running during the entire system execution). Since we focused on the cache miss rates and not the absolute number o f cache misses, the incomplete benchmarks execution had no impact on our evaluation. Similarly, due to statistical predictions, the applications are not required to begin execution simultaneously to garner accurate results. Although our experiments used only four cores and the LLC was a shared 8way L2 cache, the analytical model itself does not include any limitations on the number of cores, the LLCs hierarchical level, or the cache parameters (e.g., total size, block size, and associativity for our experiments). Analytical Model Evaluation We verified the accuracy of our estimated LLC miss rates obtained using the analytical model and evaluated the analytical models ability to determine the optimal (minimum LLC miss rate) configuration in the CaPPS design space. Additionally, we illustrated the analytical models efficiency by comparing the time required to calculate

PAGE 116

116 the LLC miss rates as compared to using a cycleaccurate simulator to generate the exact cache miss rates for all configurations. Accu racy e valuation For each benchmark set, we compared the average LLC miss rate for the four cores determined by the analytical model with the exact miss rate determined by GEM5 for each configuration in CaPPSs design space. We calculated the average and standard deviation of the miss rate errors across the 3,347 configurations. Figure 5 6 depicts the results for each benchmark set. The black markers indicate the average miss rate errors and the upper and lower ranges as compared to the black markers in the gray bars show the corresponding standard deviation. On average, over all sixteen benchmark sets, the absolute average miss rate error and standard deviation were 0.73% and 1.30%, respectively. Figure 5 6 The average and standard deviation of the average LLC miss rate error determined by the analytical model. Since the analytic al models cache miss rates are inaccurate, we compared the absolute difference between the LLC miss rates of the analytical models minimum LLC 4% 3% 2% 1% 0% 1% 2% Average and staandard deviation of estimated average LLC miss rate error

PAGE 117

117 miss rate configuration and the actual minimum LLC miss rate configuration as determined via exhaustive search. Comparing with an exhaustive search is appropriate for evaluating the analytical models efficacy, which is only affected by the estimated miss rate errors in determining the optimal configuration. The results indicate that fourteen out of sixteen benchmark sets differences were less than 1% and the maximum and averages differences over all benchmark sets was negligible, 1.3% and 0.36%, respectively. Simulation time e valuation To evaluate the execution time efficiency of the analytical model, we compar ed the time required to estimate the LLC miss rates (including the time for isolated trace access generation) for all configurations in the CaPPS design space as compared to using GEM5. We compared to exhaustive exploration since we are the first to propos e CaPPS and therefore there is no heuristic search to compare to and a designer would have to use exhaustive exploration. Furthermore, since the analytical model evaluates each configuration individually using GEM5, the average simulation time speedups wer e nearly independent of the number of evaluated configurations. Therefore the analytical model would show similar speedups even if compared with a heuristic search since the heuristic method could be leveraged by both the analytical model and GEM5. We im plemented the analytical model in C++ compiled with O3 optimizations. We tabulated the user time reported from the Linux time command for the simulations running on a Red Hat Linux Server v5.2 with a 2.66 GHz processor and 4 gigabytes of RAM. Figure 5 7 depicts the speedup of the analytical model for each benchmark set as compared to GEM5. Over all benchmark sets, the average speedup was 3,966X, with maximum and minimum speedups of 13,554X and 1,277X, respectively. For one

PAGE 118

118 benchmark set, the time for simulating all 3,347 configurations using GEM5 was approximately three months, and comparatively the analytical model took only two to three hours. Figure 5 7 The analytical models simulation time speedup as compared to GEM5. CaPPS Evaluation To validate the advantages of CaPPS, we compared CaPPSs ability to reduce the LLC miss rate (i.e., the optimal configuration) as compared to two baseline configurations and configurations as proposed in prior works, including private partitioning and constrained partial sharing. Comparis on with baseline c onfigurations Fig ure 5 8 depicts the average LLC miss rate reductions for CaPPSs optimal configurations (the configurations with minimum average LLC miss rate in CaPPSs design space) as compared to two baseline con figurations: 1) even privatepartitioning : the LLC is evenly partitioned using private partitioning; and 2) fully shared: the LLC is fully shared by all cores. Across all benchmark sets, the average and maximum average 0 2000 4000 6000 8000 10000 12000 14000 Speedup

PAGE 119

119 LLC miss rate reductions for CaPPSs optimal configurations were 25.58% and 50.15%, respectively, as compared to evenprivatepartitioning and 19.39%, and 41.10%, respectively, as compared to fully shared. Fig ure 5 8 Average LLC miss rate reductions for CaPPSs optimal configurations as compared to the two baseline configurations: evenprivatepartitioning and fully shared. Comparison with private partitioning We compared CaPPS with private partitioning since prior works typically partition shared caches using private partitioning. Figure 5 9 depicts the average LLC miss rate reductions for CaPPSs optimal configurations as compared to private partitionings optimal configurations, which is the configuration with minimum LLC miss rate in the private partitionings design space consisting of 35 configurations approximately 1% o f CaPPSs design space. Across all benchmark sets, the average and maximum reductions in CaPPSs average LLC miss rates as compared to private partitioning were 16.92% and 43.02%, respectively. The first three benchmark sets in the figure showed small redu ctions (less than 2.5%), which indicates that for these combinations of co 0% 10% 20% 30% 40% 50% 60% Average LLC miss rate reduction Compared to even private partitioning Compared to fully shared

PAGE 120

120 executing applications, exploring the private partitioning design space is sufficient to obtain small LLC miss rates. Figure 5 9 Average LLC miss rate reductions for CaPPSs optimal configurations as compared to private partitionings optimal configurations for a 1 MB LLC. Since private partitioning was sufficient for only three of the benchmark sets, we further evaluated the resul ts to determine which combinations of coexecuting applications most benefit from CaPPSs increased design space. We examined the benchmarks temporal locality characteristics using an inhouse implemented stack based trace driven cache simulator [34] to process the isolated L2 cache access traceses. Figure 5 10 depicts the LLC miss rates for varying numbers of cache ways and sizes for each benchmark executing in isolation. Based on these results, we determined the best number of cache ways for each benchmark, which allowed the benchmarks entire working set to f it into the LLC. The best number of cache ways is the number of cache ways wherein allocating additional ways does not reduce the miss rate, and therefore wastes cache resources. 0% 10% 20% 30% 40% 50% Average LLC miss rate reduction

PAGE 121

121 Figure 5 10. LLC miss rates for different numbers of cache ways and cache sizes when the benchmarks execute in isolation. We classified the benchmarks into three groups based on the benchmarks best number of cache ways by evaluating the LLC miss rate trends in Figure 5 10 : (1) the LLC miss rates for 401.bzip2, 473.astar, and 435.gromacs show a continual decrease as the number of cache ways increases, and the best number of cache ways for these benchmarks is the maximum LLC associativity; (2) the LLC miss rates for 444.namd, 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 401.bzip2 LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 473.astar LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 435.gromacs LLC size: 1MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 444.namd LLC size: 1MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 458.sjeng LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 445.gobmk LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 464.h264ref LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 400.perlbench LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 456.hmmer LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 429.mcf LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 462.libquantum LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1 2 3 4 5 6 7 8 LLC cache miss rate Number of ways 433.milc LLC size: 1 MB LLC size: 512 KB LLC size: 256 KB

PAGE 122

122 458.sjeng, 445.gobmk, 464.h264ref, 400.perlbench, 456.hmmer, and 429.mcf reach a minimum plateau at a certain number of ways, and the plateau point is the best number of cache ways for the points associated benchmark; and (3) the LLC miss rates for 462.libquantum and 433.milc are independent of the number of cache ways since the majority of the cache accesses are compulsory misses, and the best number of cache ways for these benchmarks is one. We can extend this benchmark analysis to predict any set of co execution applications LLC partitioning requirements. If the summation of the best number of cache ways for all of the co executing applications exceeds the LLC associativity, a privately partitioned cache cannot meet the applications demands. Partial sharing enables co executing applications to collectively share the LRU ways to enable a larger number of cache ways to be allocated to each core to further reduce the LLC miss rates. However, if the LLC is large enough to accommodate the summation of the applications best number of cache ways, which is the case for the first three benchmark sets in Figure 5 9 private partitioning satisfies the applications requirements and partial sharing is not necessary. We also evaluated the benefits of partial sharing for small LLCs since energy and siz e constrained embedded systems must typically use a small LLC. Since the number of cache sets decreases with the cache size, more blocks may map to same cache set, which may increase cache conflicts. This is evident in Figure 5 10 where the miss rate trends for the 512 KB and 256 KB LLCs show that the best number of cache ways for group two benchmarks increases as the LLC size decreases. Additionally, the group one and two benchmarks LLC miss rates decrease more rapidly in a small LLC

PAGE 123

123 as compared to a large LLC. Therefore, more cache ways are required in a small LLC to obtain low miss rates as compared to a large LLC. Partial sharing alleviates this requirement by enabling larger quotas to be assigned to group one and two applications for small LLCs, which can significantly improve overall cache performance as compared to private partitioning. Figure 5 11. Average LLC miss rate reductions for CaPPSs optimal configuration as compared to private partitioning for 512 KB and 256 KB LLCs, respectively. To further verify that CaPPS is more suitable for embedded systems with small cache sizes as compared to private partitioning, we created six benchmark sets. Each set contained four arbitrarily selected benchmarks such that the summation of the benchmarks best number of cache ways exceeded the LLC associativity. Figure 5 11 depicts the average LLC miss rate reductions for CaPPSs optimal configurations as compared to private partitionings optimal configurations for 512 KB and 256 KB LLCs, respecti vely. For the 512 KB LLC, the average and maximum LLC miss rate reductions were 28.99% and 69.75%, respectively, and were 30.63% and 45.36%, respectively, for the 256 KB LLC. The first five benchmark sets showed that LLC miss rate reductions increase as th e cache size decreases. For the last benchmark set, the LLC miss rate 0% 10% 20% 30% 40% 50% 60% 70% 80% Average LLC miss rate reduction 512 KB LLCs 256 KB LLCs

PAGE 124

124 reduction for 256 KB LLC was smaller than 512 KB LLC, since the 256 KB cache size was too small to accommodate all the applications data and the partial sharing resulted LLC miss rate r eduction in a too small LLC could not be as prominent as in a large LLC. Comparison with constrained partial s haring Even though CaPPS targets shared cache partitioning, the fundamentals associated with partitioning shared and private caches are similar, thus we compare to prior works using constrained partial sharing [17] [36] [45] [62] If a shared LLC is evenly partitioned across the cores and each partition is considered as the cores private LLC, the constrained partial sharings design space is a subset of CaPPSs design space. In the related works of LLC partitioning we classi fied prior works into two kinds of constrained partial sharing: subset sharing and joint sharing based on the partitioning and sharing configurability. Using the experiment settings in Table 5 2 the numbers of configurations in the subset and joint sharings design spaces are 15 and 81, which account for 0.4% and 2% of CaPPSs design space, respectively. Since CaPPSs determines optimal configurations offline and prior works on constrained partial sharing determines configurations online by monitoring the cache performance and then greedily or heuristically determining the configurations, providing a fair comparison is difficult. Since prior constrained partial sharing works used online greedy/heuristic methods, the determined configurations may be suboptimal. Therefore, we determined the optimal offline configurations for the subset and joint sharing via exhaustive exploration of the design spaces, and evaluated the mi ss rate reduction of CaPPSs optimal configurations as compared to the subset and joint sharings optimal offline configurations. Exhaustive exploration ensures that the optimal configurations

PAGE 125

125 was determined for subset and joint sharing, which places a low er bound on the results and in practice, CaPPSs miss rate reductions would likely be greater than reported. Figure 5 12. Average LLC miss rate reductions for CaPPSs optimal configuration as compared to the two constrained partial sharing design spaces. Figure 5 12 depicts the average LLC miss rate reductions for CaPPSs optimal configurations as compared to subset and joint sharings optimal offline configurations for a 1 MB LLC. The average and maximum LLC miss rate reductions for CaPPSs optimal configurations as compared to the subset sharing were 14.20% and 29.99%, respectively, and were 13.61% and 31.18%, respectively, for joint sharing. Summary In this chapter we presented cache partitioning with partial sharing (CaPPS) a novel cache partitioning and sharing architecture that improves shared last level cach e (LLC) performance with low hardware overhead for chip multi processor systems (CMPs). Our experiments showed that CaPPS reduced the average LLC miss rates by 2025% as compared to two baseline configurations, by 17% as compared to private 0% 5% 10% 15% 20% 25% 30% 35% Average LLC miss rate reduction Compared to subset sharing Compared to joint sharing

PAGE 126

126 partitioning a nd by 14 % as compared to constrained partial sharing. To quickly estimate the miss rates of CaPPSs sharing configurations, we developed an offline, analytical model that achieved an average miss rate estimation error of only 0.73%. As compared to exhausti ve exploration of the CaPPS design space to determine the lowest energy cache configuration, the analytical model affords an average speedup of 3,966X. Finally, CaPPS and the analytical model are applicable to CMPs with any number of cores and places no li mitation on the configurable cache parameters.

PAGE 127

127 CHAPTER 6 CONCLUSIONS AND FUTURE WORK In this dissertation, we investigated two offline cache optimization methods: cache parameter tuning and cache partitioning. By configuring the cache total size, block size, and associativity for a particular application or phases, overall system performance and energy savings can be improved. Additionally, in CMPs, the contention over shared LLC resources increases as the numbe r of cores increases. Cache partitioning eliminates the LLC contention and configures the partitioning to the allocated cores requirements to improve cache performance. In single core systems, t o facilitate fast design time cache parameter tuning, we propose single pass trace dri ven cache simulation methodologies T SPaCS and U SPaCS for two level exclusive instruction/data cache and unified cache hierarchies, respectively Direct adaptation of conventional trace driven cache simulation to twolevel caches requires prohibitive storage and simulation time since numerous stacks are required to record cache access patterns for each combination of level one and level two cache configuration and each stack is repeatedly processed. T SPaCS and U SPaCS significant ly reduce storage space and simulation time using a small set of stacks that only record the complete cache access pattern independent of the cache configuration Thereby, T SPaCS and U SPaCS can simulate all cache configurations for both the level one and level two caches simultaneously in a singlepass. Experimental results show that T SPaCS is 21X faster on average than sequential simulation for instruction caches and 33X faster for data caches. A simplified, but minimally lossy version of T SPaCS (simpl ified T SPaCS) increases the average simulation speedup to 30X for instruction caches and 41X for data caches. We leverage

PAGE 128

128 T SPaCS and simplifiedT SPaCS for determining the lowest energy cache configuration to quantify the effects of lossiness and observe that T SPaCS and simplified T SPaCS determine the lowest energy cache configuration as compared to exact simulation, even in the presence of some cache miss rate inaccuracies of 1% on average. Additionally, U SPaCS can accurately determine the cache miss rates for a configurable cache design space consisting of 2,187 cache configurations with a 41X speedup in average simulation time as compared to the most widely used sequential trace driven cache simulation. In CMPs shared LLC partitioning, we propos ed c ache partitioning with partial sharing (CaPPS) which reduces LLC contention via cache partitioning and improves utilization via sharing configuration. Sharing configuration enables the partitions to be privately allocated to a single core partially shared with a subset of cores, or fully shared with all cores based on dynamic runtime application requirements. CaPPS imposes low hardware overhead and affords an extensive design space to increase optimization potential Enlightened from T SPaC S and U SPaCS, t o facilitate fast CaPPS design space exploration, we develop an analytical model to quickly estimate the miss rates of all CaPPS configurations using the applications isolated LLC access trace s. E xperiment al results demonstrate that the an alytical model estimates cache miss rate s with an average error of only 0.73% and with an average speedup of 3,966X as compared to a cycleaccurate simulator. Due to CaPPSs extensive design space as compared to private partitioning and constrained partial sharing in prior works CaPPS can reduce the average LLC miss rate by as much as 25% as compared to baseline configurations and as much as 17% as compared to prior works.

PAGE 129

129 Although offline cache tuning has advantages such as no runtime overhead and no intr usion to the normal system execution, online cache tuning is still attractive since the insystem exploration capability enables dynamic cache tuning to adaptively react to the systems environment and input changes thereby potentially resulting in more a ccurate cache configurations. Therefore, future work includes implement ing T SPaCS and U SPaCS in hardware for dynamic cache tuning. In CaPPS, the analytical model could theoretically be incorporated into the operating system scheduling routine to analyze the scheduled co executing applications for dynamic cache partitioning. However, practically, online usage of the analytical model will introduce large overheads, such as long execution stall s and significant energy consumed while running the analytical model Thus, simplify ing the analytical model to be more amenable to online cache partitioning is challenging, which is left as future work Future work also includes extending the analytical model to optimize for any design goal (e.g., performance, energy delay product, etc ) leveraging the offline analytical results to guide online scheduling for performance optimizations in real time embedded systems, including accesses to shared address space, and incorporating cache prefetching in our analytical model Since the LLC is typically very large and requires long access latencies that may be nonuniform in different cache banks (e.g., nonuniform access architecture (NUCA) caches), proximity aware cache partitioning and allocation of cache partitions to eac h core in NUCA caches [17] [36] [47] is a promising research area. Therefore, In the future work, CaPPS can be extended to proximity aware cache partitioning for NUCA caches.

PAGE 130

130 LIST OF REFERENCES [1] D. H. Albonesi, Selective Cache Way: On de mand Cache Resource Allocation, In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 248259, Nov. 1999. [2] J.M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M.R. Henzinger, S.A. Leung, R.L. Sites, M.T. Vandevoorde, C.A. Waldspurger, and W. E. Weihl, Continuous Profiling: Where Have All the Cycles Gone? ACM Transactions on Computer Systems vol. 15, no. 4, pp. 357390, Nov. 1997. [3] Arc cores, http://www.synopsys.com/IP/ProcessorIP/ARCProcessors/Pages/default.asp x 2013. [4] ARM, http://www.arm.com/products/processors/ 2013. [5] ARM 1156 Processor, http://www.arm.com/products/processors/classic/arm11 2013. [6] T. Austin, E. Larson, and D. Ernst, SimpleScalar: an Infrastructur e for Computer System Modeling, IEEE Computers vol. 35, no. 2, pp. 5967, Feb. 2002. [7] R. Balasubramonian, N. P. Jouppi, and N. Muralimanohar, Multi core Cache Hierarchies, Synthesis Lectures on Computer Architecture Morgan & Claypool Publishers, Nov. 2011. [8] N. Binkert, et. al The gem5 Simulator, http://gem5.org/Main_Page, 2013. [9] M. Brehob and R. J. Enbody, An Analytical Model of Locality and Caching, Technical report MSU CSE 9931, Michigan State University, Sept. 1999. [10] D. Burger, T. Austin and S. Bennet Evaluating Future Microprocessors: the Simplescalar Toolset, Technical Report, CS TR 1308, University of WisconsinMadison, Computer Science Department, Jul y 2000. [11] CACTI, http://www.hpl.hp.com/researc h/cacti/ 2013. [12] D. Chandra, F. Guo, S. Kim, and Y. Solihin, Predicting Inter Thread Cache Contention on a Chi p Multi Processor Architecture, In Proceedings of the 11th International Symposium on HighPerformance Computer Architecture, pp. 340 351, Feb. 2005. [13] X. E. Chen and T. M. Aamodt, A first order finegrained multithreaded throughput model, IEEE 15th International Symposium on High Performance Computer Architecture pp. 329 340, Feb. 2009.

PAGE 131

131 [14] D. Chiou, D. Chiouy L. Rudolph, S. Devadas, and B. S. Ang, D ynamic Cache Partitioning via Columnization, Computation Structures Group Memo 430 Massachusetts Institute of Technology, Nov. 1999. [15] J. Dean, J.E. Hicks, C.A. Waldspurger, W.E. Weihl, and G. Chrysos, ProfileMe: Hardware Support for Instructionlevel Profil ing in Out of order Processors, In Proceedings of the 30th Anual ACM/IEEE International Symposium on Microarchitecture, pp. 292302, Dec. 1997. [16] C. Ding and Y. Zhong, Predicting wholeprogram locality t hrough reuse distance analysis, In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 245 257, May 2003. [17] H. Dybdahl and P. Senstrom, An Adaptive Shared/Private NUCA Cache Partitioning S cheme for Chip Multiprocessors, IEEE 13 th International Symposium on Hi g h Performance Computer Architecture pp. 1014, Feb. 2007. [18] J. Edler and M.D. HILL, Dinero IV trace drive n uniprocessor cache simulator, http://www.cs.wisc.edu/~markhill/DineroIV 1998. [19] J. Edmondon, P.I. Rubinfeld and P. J. Bannon, et. al. Internal organization of the Alpha 21164, a 300MHz 64 bit Quadissue CMOS RISC microprocessor Digital Technical Journal Special 10th Anniversary Issue, vol. 7, n o. 1, pp. 119135, Jan. 1995. [20] EEMBC, the Embedded Microprocessor Benchmark Consortium, www.eembc.org 2013. [21] D. Eklov, D. Black schaffer, and E. Hagersten, Fast Modeling of Shared Cache in Multicore Systems, In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers pp. 147157, Jan. 2011. [22] C. Fang, S. Carr, S. Onder, and Z. Wang, Reuse distance based Miss rate Predict ion on a per Instruction Basis, In Proceedings of the 2004 workshop on Memory system performance pp. 6068, June 2004. [23] K. Flautner, N.S. Kim, S. Matin, D. Blaauw, and T. Mudge, Drowsy Caches: Simple Techniq ues for Reducing Leakage Power, In Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 148157, May 2002. [24] H. Ghasemzadeh, S. Mazrouee, H. G Moghaddam, H. Shojaei, and M. R. Kakoee, Hardware Implementation of Stack Based Replacement Algorithms, in Proceedings of World Academy of Science and Technology v ol. 16, pp. 135139, Nov. 2006. [25] S. Ghosh, M. Martonosi, and S. Malik, Cache Miss Equations: A Compiler Framework for Analyz ing and Tuning Memory Behavior, ACM Transactions on Programming Languages and Systems vol. 21, n o. 4, pp. 703 746, July 1999.

PAGE 132

132 [26] A. Ghosh and T. Givargis, Cache optimization for embedded processor cores: an analytical approach, ACM Transactions on Design Automation of Electronic Systems, vol. 9, n o. 4, pp. 419440, Oct. 2004. [27] A. GordonRoss, F. Vahid and N. Dutt, Automatic Tuning of Two Level Caches to Embedded Applications, IEEE/ACM Design Automation and Test in Europe, pp. 208213, Feb. 2004. [28] A. Gordon Ross, P. Viana, F. Vahid, W. Najjar and E. Barros, A OneShot ConfigurableCache Tuner for Improved Energy and Performance, IEEE/ACM Design, Automation and Test in Europe pp. 1 6, Apr 2 007. [29] A. Gordon Ross, J. Lau and B. Calder, Phase Based Cache Reconfiguration for Highly Configurable TwoLevel Cache Hierarchy, ACM Great Lakes Symposium on VLSI pp. 323 337, May 2008. [30] A. Gordon Ross, F. Vahid, and N. Dutt, Fast ConfigurableCache Tuning with a Unified SecondLevel Cache, IEEE Transactions on VLSI Systems vol. 17, n o. 1, pp. 8091, Jan. 2009. [31] G. Hamerly, E. Perelman, J. Lau, and B. Calder, SimPoint 3.0: Faster and More Flexible Program Analysis, Journal of Instructionlevel Paralle lism pp. 1 28, 2005. [32] M. S. Haque, A. Janapsatya, and S. Parameswaran, SuSeSim: a Fast Simulation Strategy to Find Optimal L1 Cache Conf iguration for Embedded Systems, In Proceedings of the 7th IEEE/ACM International Conference on Hardware/software Codesi gn and System Synthesis pp. 295 304, Oct. 2009. [33] J.S. Harper, D.J. Kerbyson, and G. R. Nudd, Analytical Modeling of Set associative Cache Behavior, IEEE Transactions on Computers vol. 48, n o. 10, pp. 10091024, Oct. 1999. [34] M.D. Hill and A. J. Smith, Evaluating Associativity in CPU Caches, IEEE Transactions on Computers vol. 38, n o. 12, pp. 16121630, Dec. 1989. [35] L. Hsu, S. Reinhardt, R. Iyer and S. Makineni, Communist, Utilitarian, and Capitalist Cache Policies on CMP s: Caches as a Shared Resource, In Proceedings of the 15th international conference on Parallel architectures and compilation techniques pp. 1322, Jan. 2006. [36] J. Huh, C. Kim, H. Shafi, and L. Zhang, A NUCA Substrate for Flexible CMP Cache Sharing, IEEE Transactions on Parallel and Distributed Systems vol. 18, n o. 8, pp. 10281040, Aug. 2007. [37] R. Iyer, On Modeling and Analyzing Cache Hierarchies U sing CASPER In Proceedings of the 11th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems pp 182187, Oct. 2003.

PAGE 133

133 [38] A. Jaleel, R.S. Cohn, C. K. Luk, and B. Jacob, CMP$im: A PinBased OnTheFly Multi Core Cache Simulator, The Fourth Annual Workshop on Modeling Benchmarking and Simulation, pp.28 36 June 2008. [39] A. Janaps and S. Parameswaran, Finding Optimal L1 Cache Cinfiguration for Embedded Systems, Asia and South Pacific Design Automation Conference pp. 796801, Jan. 2006. [40] S. Kaxiras, Z. Hu, and M. Martonosi, Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power, In Proceedings of the 28th International Symposium on Computer Architecture, pp. 240251, June 2001. [41] R. E. Kessler and M. D. Hill, Page Placement Algorithms for Large Real indexed Caches, ACM Transactions on Comput er Systems v ol. 10, n o. 4, pp. 338 359, Nov. 1992. [42] N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge, Drowsy Instruction Caches Leakage Power Reduction Using Dynamic V olta ge S caling, In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, pp. 219230, Nov. 2002. [43] S. Kim, D. Chandra and Y. Solihin, Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture, In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techni ques pp. 111 122, Oct. 2004. [44] A. Lebeck and D. Wood, Cache Profiling and T he SPEC Benchmarks: A Case S tudy, IEEE Computer vol. 27, n o. 10, pp. 1526, Oct. 1994. [45] H. Lee, S. Cho, and B. Childers, CloudCache: Expanding and Shrinking Private Caches, In Proceedings of IEEE 17th International Symposium on High Performance Computer Architecture, pp. 219 230, Feb. 2011. [46] C. Lee, M. Potkonjak and W.H. MangioneSmith, MediaBench: a Tool for Evaluating and Synthesizing Multi media and Communication Systems, In Proc eedings of 30th Annual International Symposium on Microarchitecture, pp. 330335, Dec. 1997. [47] C. Liu A. S ivasubramaniam and M. K andemir, Organizing The Last Line of Defense B efore Hitting The Memory W all for CMPs In Proceedings of Symposium on Hig h Performance Computer Architecture pp.176185, Feb. 2004 [48] P. S. Magnusson, M. Christensson, K. J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, Simics: A F ull System Simulation Platform, IEEE Computer vol. 35, no. 2, pp. 5058, Feb. 2002. [49] A. Malik, W. Moyer and D. Cermak, A Low Power Unified Cache Architecture Providing Power and Performance Flexibility, International Symposium on Low Power Electronics and Design, pp. 241243, July 2000.

PAGE 134

134 [50] R.L. Mattson, J. Gecs ei, D. R. Slutz and I. L. Traiger, Evaluation Techniques for Storage Hierarchies, IBM Systems Journal v ol. 9, n o. 2, pp. 78117, June 1970. [51] MIPS32 4KTM Processor Core Family Software Users Manual, http://www.mips.com/products/product materials/processor/processor cores/ 2001. [52] J. Montanaro, R. T. Witek, and K. Anne, et al. A 160MHz, 32 b 0.5 W CMOS RISC M icroprocessor, Digital Technical Journal vol. 9, n o. 1, pp. 4962, Nov. 1997. [53] P. M. Ortego and P. SACK, SESC: SuperESCalar Simulator, http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/ 2004. [54] M. Powell, S. H. Y ang, B. Falsafi, K. Roy, and T.N. Vijaykumar, Reducing Leakage in a High performance Deepsubmicron Instruction Cache, IEEE Transactions on VLSI Systems vol. 9, n o. 1, pp. 7789, Feb. 2001. [55] M. Qureshi and Y. Patt, Utility Based Cache Partitioning: A Low Overhead, HighPerformance, R untime Mechanism to Partition Shared Caches, In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 423 432, Dec. 2006. [56] M. Qureshi, Adaptive Spill Receive for Robust HighPerformance Caching in CMPs In Proceedings of Symposium on High Performance Computer Architecture pp.45 54, Feb. 2009. [57] M. Rawlins and A. GordonR oss, CPACT T he Conditional Parameter Adjustment Cache Tuner for Dual core Architectures, IEEE 29th International Conference of Computer Design, pp. 396403, Oct. 2011. [58] S. Segars, Low Power Design Techniques for Micropocessors, International Solid State Circuit Conference Feb. 2001. [59] X. Shen, Y. Zhong and C. Di ng, Locality Phase Prediction, In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems pp. 165176, Mar. 2004. [60] T. Sherwood, E. Perelman, G. Hamerly, S. Sair and B. Calder, Discovering and Exploiting Program Phases, IEEE Micro: Top Picks from Computer Architecture Co nf erence pp. 8493, Dec. 2003. [61] SPEC CPU2006. http://www.spec.org/cpu2006, 2013 [62] S. Srikantaiah, E.T. Zhang, M. Kandemir, M. Irwin, and Y. Xie, MorphCache: a Reconfigurable Adaptive Multi level Cache Hierarchy for CMPs, In Proceedings of High Performance Computer Architecture, pp. 231 242, Feb. 2011.

PAGE 135

135 [63] T.S.B. Sudarshan R.A. Mir, and S. Vijayalakshmi "Highly Efficient LRU Implementations for High Associativity Cache M emory," In Proceedings of International Conferen ce on Advanced Computer Control pp 87 95, Jan. 2004. [64] R. Sugumar and S. Abraham, Efficient Simulation of Multiple Cache Configurations Using Binomial Trees, Technical Report University of Michigan, 1991. [65] G.E. Suh, L. Rudolph, and S. Devadas, Dynamic Cache Partitioning for Simultaneous Multithreading Systems In Proceedings of Interantional Conference on Parallel and Distributed Computing and Systems pp. 116127, Nov. 2001. [66] J. G. Thompson and A. J. Smith Efficient (stack) Algorithms for Analysis of Write back and Sector Memories, ACM Transactions on Computer Systems v ol. 7, no. 1, pp. 78117, Feb. 1989. [67] N. Tojo, N. Togawa, M. Yanagisawa, and T. Ohtsuki, Exact and Fast L1 Cache S imulation for Embedded Systems, In Proceedings of the 2009 Asia and South Pacific Design Automation Conference pp. 817822, Jan. 2009. [68] K. Varadarajan, S. Nandy, V. Sharda, A. Bharadwaj, R. Iyer, S. Makineni, and D. Newell, Molecular Caches: A Caching Structure for Dynamic Creation of Application specif ic He terogeneous Cache Regions, In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture pp. 433442, Dec. 2006. [69] X. Vera, N. Bermudo, J. Llosa, and A. Gonzalez, A Fast and Accurate Framework to Analyze and Optimize Cache Memory Behavior, ACM Transactions on Programming Languages and Systems vol. 26, n o. 2, pp. 263 300, Mar. 2004. [70] P. Viana, A. Gordon Ross, E. Baros and F. Vahid, A Tablebased Method for Single Pass Cache Optimization, ACM Great Lakes Symposium on VLSI pp. 7 1 76, May 2008. [71] Z. Ying, B.T. Davis, and M. Jordan, Performance Evaluation of Exclusive Cache Hierarchies, IEEE International Symposium on Performance Analysis of Systems and Software, pp. 8996, Apr. 2004. [72] C. Zhang, F. Vahid, and R. Lysecky, A Self tuning Cache Architecture for Embedded Systems, ACM Trans. on Embedded Computing Systems vol. 3, n o. 2, pp. 407425, May 2004. [73] Y. Zheng, B.T. Davis, and M. Jordan, Performance E valuation of Exclusive Cache Hierarchies, IEEE International Symposium on Performance Analysis of Systems and Software, pp. 8996, Mar. 2004. [74] Y. Zhong, S. Dropsho, and C. Ding, Miss Rate Predic tion Across all Program Inputs, In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques pp. 91 101, Sept. 2003.

PAGE 136

136 BIOGRAPHICAL SKETCH Wei Zang was born in Baotou, Nei Mongol, China. She received the B S and M S degrees from the Zhejiang University, Hangzhou, China, in 2006 and 2008, respectively. In the spring of 2013, s he received the Ph. D degree in electrical and computer engineering at the University of Florida, Gainesville FL Her research interests include low power embedded system design, system modeling and simulation, design automation with an emphasis on cache reconfiguration and cache partitioning on multicore platforms.