|UFDC Home||myUFDC Home | Help|
1 DYNAMIC LOW ENERGY CACHE OPTIMIZATIONS FOR EMBEDDED SYSTEMS By MARISHA RAWLINS A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 201 2
2 201 2 Marisha Rawlins
3 TABLE OF CONTENTS page LIST OF FIGURES ................................ ................................ ................................ .......... 6 LIST OF ABBREVIATIONS ................................ ................................ ............................. 8 ABSTRACT ................................ ................................ ................................ ..................... 9 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 11 2 PREVIOUS INSTRUCTION CACHE OPTIMIZATIONS FETCHING FROM SMALL DEVICES ................................ ................................ ................................ ... 19 2.1 Overview ................................ ................................ ................................ .......... 19 2.2 The Filter Cache and Other Desi gns ................................ ................................ 19 2.3 Loop Caching ................................ ................................ ................................ ... 21 2.4 Previous Loop Cache Designs ................................ ................................ ......... 22 2.4.1 Dynamic Loop Cache (DLC) ................................ ................................ ... 23 2.4.2 Preloaded Tagless Loop Cache (PLC) ................................ ................... 24 2.4.3 Hybrid Tagless Loop Cache (HLC) ................................ ......................... 25 2.4.4 Dynamic Loop Buffer (DLB) ................................ ................................ .... 26 3 LIGHTWEIGHT RUNTIME CONTROL FLOW ANALYSIS FOR ADAPTIVE LOOP CACHING ................................ ................................ ................................ .... 27 3.1 Overview ................................ ................................ ................................ .......... 27 3.2 Basic A daptive Loop Cache Operation ................................ ............................ 27 3.3 The ALC Architectural Layout ................................ ................................ .......... 30 3.4 ALC Functionality for Critical Regions with Forward Branches ........................ 31 3.4.1 Address Translation Unit (ATU) ................................ .............................. 31 3.4.2 Loop Cache Controller (LCC) and Runtime Con trol Flow Analysis ........ 32 188.8.131.52 Operating in the idle state ................................ ............................. 32 184.108.40.206 Preparing for a new loop ................................ ............................... 32 220.127.116.11 Operating in the fill state ................................ ............................... 33 18.104.22.168 Operating in the active state ................................ ......................... 35 22.214.171.124 ALC overhead ................................ ................................ ............... 36 126.96.36.199 ALC operation in the presence of branch prediction ..................... 37 3.5 ALC Functionality for Critical Regions with Nested Loops ............................... 39 3.6 Experimental Results ................................ ................................ ....................... 42 3.6.1 Experimental Setup ................................ ................................ ................ 42 3.6.2 Results and System Scenarios ................................ ............................... 44 188.8.131.52 ALC loop cache access rate ................................ ......................... 44 184.108.40.206 Loop cache access rate for the ALC with nested loops ................. 47 220.127.116.11 ALC energy savings ................................ ................................ ...... 49
4 18.104.22.168 Energy savings for the ALC with nested loops .............................. 50 22.214.171.124 Results summary ................................ ................................ .......... 5 1 3.7 Comparing the Filter Cache with the ALC ................................ ........................ 52 4 COMBINING LOOP CACHING WITH OTHER DYNAMIC INSTRUCTION CACHE OPTIMIZATIONS ................................ ................................ ...................... 62 4.1 Overview ................................ ................................ ................................ .......... 62 4.2 Instruction Cache Optimizations ................................ ................................ ...... 62 4.2.1 Loop Caching ................................ ................................ ......................... 62 4.2.2 Level One (L1) Cache Tuning ................................ ................................ 63 4.2.3 The Combined Effects of Cache Tuning and other Optimizations .......... 63 4.2.4 Code Compression ................................ ................................ ................. 66 4.3 Loop Cache and Level One Cache Tuning ................................ ...................... 68 4.3.1 Experimental Setup ................................ ................................ ................ 68 4.3.2 Analysis ................................ ................................ ................................ .. 69 4.4 Code Compression, Loop Caching, and Cache Tuning ................................ ... 73 4.4.1 Loop Caching to Reduce Decompression Overhead .............................. 73 4.4.2 Experimental Setup ................................ ................................ ................ 74 4.4.3 Analysis ................................ ................................ ................................ .. 75 4.4.4 Using LZW encoding ................................ ................................ .............. 78 4.4.5 Analysis ................................ ................................ ................................ .. 79 5 MULTI CORE LEVEL ONE (L1) DATA CACHE TUNING FOR ENERGY SAVINGS IN EMBEDDED SYSTEMS ................................ ................................ .... 83 5.1 Overview ................................ ................................ ................................ .......... 83 5.2 Previous Work on Single Core Cache Tuning ................................ ................. 83 5.3 Additional Multi Core Cache Tuning Challenges ................................ ............. 89 6 LEVEL ONE (L1) DATA CACHE TUNING IN DUAL CORE EMBEDDED SYSTEMS ................................ ................................ ................................ ............... 90 6.1 Ove rview ................................ ................................ ................................ .......... 90 6.2 Previous Multi Core Optimizations ................................ ................................ ... 90 6.3 Runtime Dual Core Data Cache Tuning ................................ .......................... 91 6.4 Experimental Results ................................ ................................ ....................... 94 6.4.1 Experimental Setup ................................ ................................ ................ 94 6.4.2 Maximum Attainable Energy Savings ................................ ..................... 98 6.4.3 CPACT Results and Analysis ................................ ................................ 99 6.4.4 Cross Core Data Decomposition Analysis ................................ ............ 103 7 AN APPLICATION CLASSIFICATION GUIDED CACHE TUNING HEURISTIC FOR MULTI CORE ARCHITECTURES ................................ ................................ 109 7.1 Overview ................................ ................................ ................................ ........ 109 7.2 Overview of Data Parallel Applications and Application Classification ........... 109 7.3 Runtime Multi Core Data Cache Tuning ................................ ........................ 110
5 7.3.1 Multi core Architectural Layo ut ................................ ............................. 110 7.3.2 Application Classification Guided Cache Tuning Heuristic ................... 111 7.4 Experimental Results ................................ ................................ ..................... 116 7. 4.1 Experimental Setup ................................ ................................ ............... 116 7.4.2 Results and Analysis ................................ ................................ ............. 117 7.4.3 Application Classification ................................ ................................ ...... 118 8 CONCLUSIONS ................................ ................................ ................................ ... 122 LIST OF REFERE NCES ................................ ................................ ............................. 125 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 131
6 LIST OF FIGURES Figure page 2 1 Architectural placement of the filter cache and loop caches .............................. 26 3 1 Flowchart of basic a daptive l oop c ache (ALC) operation ................................ .... 55 3 2 An example of adaptive loop cache operation ................................ .................... 56 3 3 Summary of transitions and events of the adaptive loop cach e ......................... 57 3 4 Average loop cache access rate and energy savings for the ALC ..................... 58 3 5 Experimental results for selected benchmarks for the ALC ................................ 59 3 6 Benchmark grouping based on the benchmark characteristic s .......................... 60 3 7 Summary of results for the average loop cache access rate and average energy savings ................................ ................................ ................................ ... 60 3 8 Av erage energy savings and execution times for the ALC and filter cache ........ 61 4 1 The Decompression on Cache Refil, Decompression on F etch (DF) and Decompression on Fetch architecture with a Loop Cache to store Decompressed Instructions ................................ ................................ ................ 81 4 2 Energy savings for loop caching and cache tuning ................................ ............. 81 4 3 Energy savings for code compression, loop caching, and cache tuning (Huffman encoding) ................................ ................................ ............................ 82 4 4 Energy savings for code compression, loop caching, and cache tuning (LZW encoding) ................................ ................................ ................................ ........... 82 6 1 The Conditional Parameter Adjustment Cache Tuner (CPACT) ....................... 106 6 2 Architectural layout of dual core system ................................ ........................... 106 6 3 Energy model used for the dual core system ................................ ................... 107 6 4 Energy savings and performance for CPACT ................................ .................. 107 6 5 Percentage difference between the energy consumed by the optimal configuration and CPACT and number of configurations explored by CPAC T 108 7 1 Sample architectural layout for a dual core system with global data cache tuner ................................ ................................ ................................ ................ 119
7 7 2 An application classification example ................................ ............................... 120 7 3 Application classification guided cache tuning heuristic ................................ ... 120 7 4 Energy model for the multi core system ................................ ........................... 121 7 5 Energy savings and performance for the application classification cache tuning heuristic ................................ ................................ ................................ 121
8 LIST OF ABBREVIATION S ALC Adaptive Loop Cache ALC+N Adaptive Loop Cache with Nested looping CPACT Conditional Parameter Adjustment Cache Tuner DCR Decompression on Cache Refill DF Decompression on Fetch DLB Dynamic Loop Buffer DLC Dynamic Loop Cache PLC Preloaded Loop Cache
9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DYNAMIC LOW ENERGY CACHE OPTIMIZATIONS FOR EMBEDDED SYSTEMS By Marisha Rawlins May 2012 Chair: Ann Gordon Ross Major: Electrical and Computer Engineering Caches often consume at least 50% of a microprocessor s system power making the memory hierarchy a suitable target for system optimization. Some optimizations focus on replacing instruction cache accesses with accesses to a smaller low power ed device such as a loop cache while others tune the cache to fit th e application First we present an adaptive loop cache (ALC) that can dynamically store loops containing forward branches and provide s an additional 20% average instruction cache energy savings (with individual benchmark energy savings as high as 69%) c ompared to previous loop cache designs Next we explore the interaction of three optimizations: loop caching, cache tuning, and code compression. P revious work explores various instruction cache optimization techniques such as loop caching individually, however, little work explores the combin ed effects of these techniques Our investigation combining loop caching, cache tuning, and code compression show s that when combined with cache tuning, loop caching increases energy savings by as much as 26% compared to cache tuning alone and when combined with code compression, loop caching reduces decompression energy by as much as 73%.
10 Since multi core architectures are becoming more popular, recent multi core optimizations focus on energy consumption. Cache tuning reveals substantial energy savings for single core architectures, but has yet to be explored for multi core architectures. First we explore level one (L1) data cache tuning in a heterogene ous dual core system where each data cache can have a different configuration. We present the dual core tuning heuristic CPACT, which finds cache configurations within 1% of the optimal configuration while searching only 1% of the design space, and achiev es an average of 25% energy savings. We also provide valuable insights on core interactions and data coherence revealed when tuning the multithreaded SPLASH 2 benchmarks. Finally we develop a n L1 data cache tuning heuristic for a heterogeneous multi core system, which classifies applications based on data sharing and cache behavior, and uses this classification to guide cache tuning and reduce the number of cores that need to be tuned. Results reveal average energy savings of 25% for 2 4 8 and 16 co re systems while searching only 1% of the design space.
11 CHAPTER 1 INTRODUCTION Embedded systems are used in many diverse domains such as consumer and general devices, communications, signal and graphics processing, medical devices, and automotive control E mbedded system applications and devices designed for different domains can have diverse design constraints. For example an embedded system for a consumer device should be low cost, low powered, with a short time to market, while an embedded system for a medical device is designed to be reliable and to meet hard real time deadlines with less focus on device cost or time to market. Additionally, these design constraints may compete with each other requiring design tradeoffs such as speed versus area or power versus performance Unlike desktop computing and supercomputing where high performance is a main optimization goal, i n embedded systems low power consumption is a main optimization goal In particular, embedded systems are designed to be low powered to increase the battery life and reduce the cooling requirements of the device. Much previous research has therefore foc used on power and energy reduction in embedded systems        however, diverse design constrains across domains and design tradeoffs make energy optimization in embedded systems difficult. the total system power  there exists much previous work on cache energy optimization techniques. Cache energy can be reduced by replacing accesses to the cache with accesses to a smaller device such as with loop caching   or a filter cache  or by increasing cache utilization with code optimization such as code reordering  and code compression   Other techniques reduced overall cache
12 energy consumption by changing the cache structure. For example, cache tuning/configuration      specialized the cache configuration ( e.g., specific values for cache parameters such as cache size, line size, and associativity ) to the needs of an application. These cache energy optimizations can be static or dynamic optimizations. Our work focuses on dynamic optimizations because: (1) dynamic optimizations require no user or application designer effort (i.e., the user does not need to know the optimization exists ) and the application designer does not need to profile the application and perform the required optimizations, and (2) dynamic optimizations can react to changes in the environment such as changing inputs and updates done after the device is deployed. P revious embedded systems cache energy optimizations were designed for single core systems h owever, m ulti core architectures are becoming a popular method to increase system performance through parallelism w ithout increasing the processor frequency and/or energy Additionally, some embedded system domains such as signal and video processing benefit greatly from parallel execution because of the parallel nature of their applications. Previous cache energy op timizations, such as cache configuration, were successful in single core systems but are difficult to apply to multi core system s since additional multi core challenges must be considered, including a larger design space and interaction among different cor es for multithreaded applications This dissertation focuses on dynamic cache energy optimization s in single and multi core embedded systems. First we present an adaptive loop cache (ALC) that reduces cache energy consumption during runtime by replacing cache accesses with
13 accesses to a smaller, more energy efficient device. Next we investigate the interactions of three previous dynamic cache energy optimizations loop caching, cache tuning, and code compression. Finally we adapt the single core dynamic cache configuration to a multi core embedded system executing multithreaded applications to reduce the overall energy consumption of the memory system. In this dissertation we first present the ALC which reduces the energy consumption of the system by storing loop instructions in a small energy efficient device. Several optimization techniques    exploit the 90 10 rule, which states that 90% of cution time is spent in only 10% of the code   This 10% is re critical regions Frequently, these critical regions are composed of only a few loops (two to four), which iterate many times in succession. To exploit the 90 10 rule, filter caches  [ 28] and loop caches    store critical regions in a small cache that is smaller than a traditional level one (L1) cache. The ALC is a tagless loop cache architecture integrating the complex functionality of a p reloaded l oop c ache (PLC)  able to cache loops containing branches, with the runtime flexibility of a dynamic loop cache (DLC)  The ALC performs lightweight runtime control flow analysis on critical regions to dynamically store and evaluate loops, enabling the ALC to adapt to changing application behavior and inputs, and eliminating the need for costly designer pre analysis. The ALC reduces instruction cache hierarchy energy when loop instructions are fetched from the smaller, lower energy ALC while the instruction cache remains in a low energy idle state.
14 In Chapter 2 we describe the related work on filter caches and loop caches. The filter cach e a direct predecessor of tagless loop cache s, reduced L1 instruction cache fetches with an additional small direct mapped level zero (L0) cache between the L1 cache and the processor. However, the original filter cache reduced L1 fetches at the expense of an increased cache miss penalty. To ensure a 100% hit rate, not all loop cache designs can store all types of critical region code structures. The simplest code structure f or a loop cache is straight line code, or basic blocks, since all instructions are fetched in succession. The DLC  stored loops with straight line code. These l oops were identified during runtime using short backwards branch ( sbb ) instructions with a small negative offset. DLC operation was completely transparent, requiring no application designer effort, however, the DLC was able to only store loops containing s traight line code. The PLC  applications by storing multiple complex critical regions ( i.e. multip le regions containing control of flow changes (cofs) such as forward jumps ) Although the PLC was not restricted to storing loops with straight line code, the PLC required an application designer to perform an offline pre analysis step to determine the cri tical regions making the PLC applicable to only static situations. The hybrid loop cache  (HLC) combined the advantages of both the DLC and the PLC. The HLC separated the loop cache into two partitions: a larger PLC partition and a smaller DLC partition (for critical regions not preloaded). Whereas the HLC
15 disadvantage was that the PLC could not cache complex critical regions that were not In Chapter 3 we present the design and evaluation of our ALC. The ALC is highly flexible and suitable for all system scenarios and eliminates the costly pre analysis step required to cache complex critical regions in the PLC/HLC while accurately identifying critical regions using actual operating inputs during runtime. Additionally, whereas the PLC and HLC can be useful in system scenarios where application behavior is static, this inflexibility makes the PLC and HLC unsuitable for system scenarios where behavior changes due to application phase changes  input vectors changes   or application upda tes. Chapter 3 includes details on the architectural layout of our ALC design, the functionality implemented to allow our ALC to cache loops with forward branches during runtime, the functionality implemented to allow our ALC to cache nested loops during r untime, and energy savings achieved by the ALC. We conclude Chapter 3 with a comparison between the ALC and the filter cache. Chapter 4 explores additional energy savings revealed by combining loop caching with two other dynamic cache energy optimization techniques: cache tuning and code compression. In this chapter we evaluate how these optimization techniques interact ( i.e., do these techniques complement each other, degrade each other, or does one technique obviate the other ) Off the shelf microprocess ors typically fix the cache configuration to a configuration that performs well on average across all applications. However, this
16 applications exhibit different runtime behaviors. Instruction cache tuning analyzes the instruction stream and configures the cache to the lowest energy (or highest performance) configuration by configuring the cache size, block size, and associativity. Cache tuning therefore enables applicatio n specific energy/performance optimizations     Code compression techniques were initially developed to reduce the static c ode size in embedded systems. However, recent code compression work   investigated the effects of code compression on instruction fetch energy in embedded systems. In these systems, energy is saved by storing compressed instructions in the L1 instruction cache and decompressing these instructions (during runtime) with a low energy/performance overhead decompression unit. The studies in this chapter shows that althou gh cache tuning dominates energy savings, loop caching can provide additional energy savings. Also, our experiments on loop caching and code compression revealed that the loop cache can effectively reduce the decompression overhead of a system while provid ing overall energy savings. Finally we present cache tuning optimizations that dynamically reduce the energy consumption in a multi core system executing multi threaded applications. Multi core architectures are becoming a popular method to increase syste m performance via exploited parallelism without increasing the processor frequency and/or energy. To further improve these systems, multi core optimizations focus on improving system performance or energy efficiency In our work we focus on cache tuning fo r multi core consumption and previous single core successes. Cache tuning enables application specific
17 optimizations by selecting the lowest energy configuration that satisfy the ap runtime behavior, since inappropriate configurations can waste energy. Given previous single core successes, we believe cache tuning in multi core systems may yield similar results if the additional multi core specific challenges are addressed. One of the main multi core cache tuning challenges is runtime cache configuration design space exploration which incurs energy and performance penalties while executing inferior configurations. In a multi core system a linear increase in the n umber of heterogeneous cores with different cache configurations exponentially increases the design space and the overhead. New challenges include core interactions across applications with a high degree of sharing that affects cache behavior (e.g., increa sed cache miss rates) and processor stalls. In Chapter 6 we quantify data cache tuning energy savings in a heterogeneous dual core system with highly configurable caches able to configure the total size, line size, and associativity, and show that the ener gy savings of dual core cache tuning is comparable to single core cache tuning. We present the Conditional Parameter Adjustment Cache Tuner (CPACT) which performs runtime cache tuning by searching a small fraction of the design space to find the optimal, or near optimal, lowest energy cache configuration. We conclude Chapter 6 with insights on core interactions and data coherence considerations for cache tuning. In Chapter 7 we extend cache tuning to a larger multi core system of up to 16 cores. We propos e an application classification guided cache tuning heuristic for L1 multi core data caches to determine the lowest energy optimal cache configuration. The heuristic leverages runtime profiling techniques to classify the application based on the
18 cache beha vior core interactions and data coherence (data sharing). As the number of cores increase, minimizing cache tuning overheads becomes more critical in a multi core system due to overhead accumulation across each core and the potential power increase if a ll cores simultaneously tune the cache. Our heuristic uses application classification to minimize the number of cores that are tuned simultaneously therefore reducing tuning overhead. And finally, in Chapter 8 we present our conclusions.
19 CHAPTER 2 PREVIOUS INSTRUCTION CACHE OPTIMIZATIONS FETCHING FROM SMALL DEVICES 2.1 Overview Much previous work focused on decreasing memory hierarchy energy consumption. In this chapter, we present an overview of previous work focusing on runtime instruction ca che optimizations where instruction fetch energy is reduced by replacing costly instruction cache accesses with accesses to smaller, more energy efficient structures such as filter caches and loop caches To exploit the 90 10 rule which states that 90% of execution is spent in only 10% of code filter caches   and loop caches    store critical regions in a cache that is smaller than a traditional level one (L1) cache. Filter caches and loop caches consume less energy than traditional L1 caches for several reasons. First, since the filter and lo op cache is smaller than an L1 cache, shorter internal wires result in reduced power dissipation. Furthermore, the filter and filter and loop cache to be placed in closer proximity to the microprocessor, resulting in shor ter bus wires. Additionally loop caches are tagless, thus no tag comparison is required. 2.2 The Filter Cache and Other Designs The filter cache  a direct predecessor of tagless loop caches, reduced L1 instruction cache fetches with an additional small direct mapped level zero (L0) cache between the L1 cache and the processor ( Figure 2 1 (a) ). Even though the original filter cache reduced L1 f etches at the expense of an increased cache miss penalty, several techniques introduced methods to reduce or eliminate filter cache misses. The predictive filter cache  used a profile aware compiler to map critical regions to a
20 reserved portion of the address space, and only this address space could occupy the filter cache. The tagless hit instruction cache (TH IC)  eliminated the additional miss penalty of the filter cache by including meta bits) with each instruction and cache line to ensure a 100% filter cache hit rate. However, the TH IC imposed several architectural over heads. The TH IC required considerable instruction fetch unit augmentations to evaluate TH IC meta bits for potential TH IC misses. In addition, the microprocessor was responsible for meta bit invalidation following TH IC replacements. Furthermore, precise meta bit invalidation could require a significant area overhead due to TH IC bookkeeping bits some invalidation schemes could require up to 141 additional bits for every 4 instructions compared to the P reloaded L oop C ache (PLC) H ybrid L oop C ache (HLC) or our design, the A daptive L oop C ache (ALC) which would require only 8 additional bits for every 4 instructions. Finally, meta bit invalidation appeared to impose numerous TH IC updates per TH IC replacement; however, the impact of this overhead was not addressed. In the next chapter we introduce the A daptive L oop C ache for caching critical regions with branches without incurring the microprocessor architectural modifications and complex meta bit invalidation overheads (area and performance) imposed by t he TH IC. Trace caches   dynam ically store execution traces in a small cache and provide explicit branch prediction by sequentially storing taken branch target instructions. Different branch outcomes required the trace cache to store multiple nearly identical traces, each of which prov ided a different branch prediction outcome.
21 showed that high branch misprediction rates increased energy consumption and decreased performance  as well as increased trace cache size pressure. In the next chapter we introduce the tagless A daptive L oop C ache which eliminates the energy consumed by the trace cache duri ng tag comparisons. 2.3 Loop Caching Previous work introduces several loop cache design variations, however, the purpose of any loop cache is to service as many instruction fetches as possible instead of fetching from the L1 cache. In addition, all loop cache designs provide tagless cache acc esses, which eliminate all the power and time of costly cache tag comparisons by using simple indexing counters that increment through loop cache locations. The main goal of the first loop cache design  was to achieve energy savings without affecting system performance or incurring a performance penalty, such as the cycle penalty incurred by the filter cache. The loop cache must therefore have a 1 00% hit rate. Loop cache designs vary in how the loop cache guarantees a 100% hit rate since no loop cache fetch can miss. This restriction makes loop cache design difficult because the instruction fetch location (loop cache or L1 cache) must be determined with 100% certainty before the instruction fetch is issued. as the application behavior and the desired application designer effort. The dynamically loaded tagless loop cache  (DLC) required no designer effort, but could not store complex critical regions consisting of control of flow changes. A control of flow change is any two consecutively executed instructions where the instructions are not stored consecutively in memory (i.e., the first instruction causes the control of flow change if
22 the instruction is a taken jump, branch, subroutine call, etc.). The preloaded l oop cache  (PLC) and hybrid loop cache  (HLC) required designer effort and offline analysis to store loops with control of flow changes, however, the loop cache contents were static and could not react to changes in application behavior. However, a more adaptive and flexible loop cache is needed for system scenarios that include highly input dependent and dynamic application behavior that is difficult to accurately model during design time and/or system scenarios where application designers do not want to or cannot perform offline pre anal ysis. 2.4 Previous Loop Cache Designs Even though several previous loop cache designs provided methodologies suitable for different system scenarios, no one loop cache design is the best loop cache design (lowe st energy loop cache) for every system sce nario. An appropriate loop cache design choice based on the system scenario is critical as inappropriate loop caches can increase memory hierarchy energy consumption (Section 3.6.2 ). Figure 2 1 (b) and (c) depict loop cache architectural layout. The loop cache is a small instruction cache placed in parallel with the L1 cache such that either the loop cache or the L1 cache services instruction fetches. Microprocessor architectural modification includes two control signals asserted by the instruction decode and branch resolution phases: sbb indicates that a backwards branch with a small negative offset, known as a short backwards branch, is taken; and cof, indicates that a control of flow change has occurred when a forward branch is taken. Even though the DL C is limited to storing one single critical region at a time and the PLC and HLC may cache multiple critical regions with branches simultaneously, all loop cache designs may store critical regions that are larger than the loop cache size. In
23 these cases, g iven a loop cache of size M, the loop cache simply stores the first M static In the remainder of this chapter, we provide operational background and architectural fundamentals of p revious loop cache designs necessary to build a foundation for our ALC. 2.4.1 Dynamic Loop Cache (DLC) The DLC  ( Figure 2 1 (b)) has three operational states: idle, fill, and active. In the idle state, the DLC monitors the sbb signal for a triggering short backwards branch instruction. A triggering short backwards branch transitions the DLC to the filling state and the DLC fills as the L1 cache services instruction fetches on the second loop iteration. Finally, on the t hird loop iteration, the DLC transitions to the active state and services instruction fetches. Thus, the DLC requires the L1 cache to service the first two loop iterations, but the DLC services all subsequent loop iterations until the triggering short back wards branch is not taken. The DLCs main limitation is that in order to provide a 100% hit rate, when the DLC enters the fill state, the DLC must cache all loop instructions during one loop iteration. Therefore, internal loop control of flow changes, such as forward jumps, halt DLC filling and return the DLC to the idle state; otherwise not all loop instructions would be cached and the DLC would contain gaps or DLC locations that do not store valid instructions. This restriction is due in part to the index ing counter. However, even with a more complex indexing methodology, DLC gaps cannot be identified during the active state, thus a 100% hit rate could not be guaranteed.
24 DLC advantages include zero application designer effort and excellent performance for suitable scenarios, but inappropriate scenarios can cause DLC thrashing. Thrashing occurs when the DLC constantly transitions between the idle and fill states, and never transitions to the active state. Thrashing is caused by forward jumps within a loop a nd nested loops where the inner loop has few iterations. 2.4.2 Preloaded Tagless Loop Cache (PLC) Even though the PLC  architecture ( Figure 2 1 (b)) is similar to the DLC, PLC operation is very different, and requires additional offline, application designer effort. During an application pre analysis step, ap critical regions and perform control flow analysis on these regions to determine loop exit conditions and exit bit values. Then, during system startup, the PLC immediately enters the fill state to preload the critical regions (application execution may be paused), which remain fixed throughout system lifetime. In order to transition from the idle to the active state, the PLC contains sets of loop address registers (LARs) indicating the start and end address of each stored critical of flow change, the PLC compares the Two exit bits stored with each PLC instruction provide a seamless transition with no cycle penalties between the active and idle states. An instr control of flow change instructions and their associated targets and indicate whether the PLC or the L1 cache should service the next instruction fetched based on whether or not the branch is taken or not taken.
25 critical regions are preloaded and require no application runtime filling, the PLC provides higher loop cache access rates for straight line loops as compared to the DLC. In addition, the PLC can efficiently cache loops that would cause thrashing in the D LC. However, PLC drawbacks include limitations on the number of stored critical regions and additional PLC index address translation (the loop cache offset must be subtracted from the instruction address to obtain the loop cache index) augments the simple counter based DLC indexing. In addition, the inherent static nature of the PLC makes it unsuitable for dynamic system scenarios. 2.4.3 Hybrid Tagless Loop Cache (HLC) The HLC  ( Figure 2 1 (c)) leverages the advantages of both the DLC and the PLC to provide a loop cache design more amenable to a larger range of system scenarios. The HLC separates the loop cache into two partitions: the larger part ition functions as a PLC (the PLC is larger since critical regions with branches are more common and are generally larger than straight line loops) and the smaller partition c ontrol of flow change. If this check fails, and the control of flow change is a triggering short backwards branch, the DLC partition begins filling. Whereas the HLC appears to provide the best of both techniques, one main disadvantage is that if the appli cation behavior changes or a designer does not perform the necessary pre analysis step, the PLC partition is not used but still expends static energy and increases DLC dynamic energy, and therefore reduces HLC operation to only a DLC
26 2.4.4 Dynamic Loop Buffer (DLB) The DLB  is a tagless buffer developed to leverage the structure of digital signal processing (DSP) applications, which contain ma ny loops. The DLB adds a fourth state to the three states already implemented by the DLC. This additional state, known as the overflow state, allows the DLB to store loops larger than the buffer size. The DLB dynamically identifies and captures complex loo ps and achieves a reduction in energy consumption by a factor of three for six DSP applications. In this paper, we present our adaptive loop cache, which caches complex loops containing branches similarly to the DLB. However, since there are few details on how the DLB caches complex loops (the authors simply state that enough information is kept for loop detection and management), it is difficult to directly compare the ALC and th e DLB. Figure 2 1 Architectural placement of the (a) filter cache, (b) the dynamic, preloaded, and adaptive loop caches (DLC, PLC, ALC, respectively), and (c) the hybrid loop cache (HLC).
27 CHAPTER 3 LIGHTWEIGHT RUNTIME CONTROL FLOW ANALYSI S FOR ADAPTIVE LOOP CACHIN G 3.1 Overview In this chapter we present our adaptive loop cache (ALC) architecture and methodology. The goal of the adaptive loop cache is to reduce energy consumption by replacing cache accesses with accesses to the smaller, more energy efficient loop cache. Compared with previous work, t runtime control flow that of the DLC and the PLC. Section 3.2 gives an overview of basic ALC operation. We then describe the base architectural layout and functionality for the ALC, which can cache critical regions with forward branches, in Sections 3.3 and 3.4 respectively. Then, in Section 3.5 we discuss ALC changes necessary to cache critical regions with nested loops. 3.2 Basic Adaptive Loop Cache Operation The ALC is highly flexible and suitable for all system scenarios and eliminates the costly pre analysis step require d to cache critical regions with branches in the PLC/HLC, while accurately identifying critical regions using actual operating inputs during runtime. For some system scenarios it is important to optimize at runtime using real inputs because it may be diffi cult for designers to provide accurate realistic inputs during profiling and unrealistic inputs during profiling causes the PLC/HLC to be less accurate than the ALC. Additionally, whereas the PLC and HLC can be useful in system scenarios where application behavior is static, this inflexibility makes the PLC and HLC
28 unsuitable for system scenarios where behavior changes due to application phase changes   changes in input vectors   or application updates. The ALC has four states: the idle state the ALC is idle, the L1 c ache supplies instructions the buffer state the ALC prepares for the new loop the fill state the ALC is filling, the L1 cache supplies instructions the active state the ALC supplies the processor with instructions instead of the L1 cache, the L1 cac he is idle Figure 3 1 (a) depicts an overview of the ALC operation and highlights the main transitions between the four states. Figure 3 2 gives an illustrative example, highlighting common ALC operations. Additional possible transitions and detailed oper ation is presented Sections 3.4 and Section 3.5 Figure 3 2 (f) shows a sample loop containing several control of flow changes: the loop body begins at instruction , the short backwards branch instruction  loops back to the start of the loop body [2 ], the branch instruction is instruction ; and the cache this loop if the branch  is taken and the PLC/HLC require offline analysis to cache this loop. The ALC i instructions are fetched from the L1 cache, and the ALC supplies the processor with instructions for the remaining iterations. Each ALC entry contains an instruction and the corresponding pair of valid bits (Figure 3 1 (c)) indicating whether the next instruction should exit the loop cache and fetch from the L1 cache or continue fetching from the loop cache. The valid bits are critical for transferring control from the ALC to the L1
29 cache without a cycle penalty as the ALC must guarantee a 100% hit rate to not affect system performance. The ALC begins operation in the idle state (Figure 3 2 (a)) If the short backwards branch  is taken, a new loop is encountered, the ALC invalidates the loop ca che contents, and begins filling on the next cycle. The ALC spends one iteration in the fill state filling the loop cache with instructions and setting the appropriate valid bits (Figure 3 2 (b)) The ALC uses the current instruction and the previously buf fered instruction to determine the valid bits, and then writes the previously buffered instruction and its corresponding valid bits to the loop cache. Note that since the branch instruction  was taken in this example and the branch target  fits in th e loop cache, the taken_next_valid (tnv) bit is set as shown in Figure 3 2 (b) If the short backwards branch  is taken again, the ALC transitions to the active state on the next cycle (Figure 3 2 (b)). The ALC remains in the active state for several ite rations, supplies the processor with instructions, and uses the valid bits to determine the location of each next instruction fetch. The ALC supplies the next instruction for straight line code or when a branch is not taken and the next_valid valid bit is set, or when a branch is taken and the taken_next_valid valid bit is set. In Figure 3 2 (c) the ALC remains in the active state as long as the branch instruction  is taken. However, if the branch instruction  is not taken, the ALC knows that the next sequential instruction  is not available since the (Figure 3 2 (c)) In this situation, the ALC returns to the filling state to fill the gap (instructions  and ) left during the initia l filling cycle, while the L1 cache resumes serving instruction fetches.
30 The ALC spends one iteration in the fill state and transitions to the active state if the short backwards branch  is taken again (Figure 3 2 (d)) As shown in (Figure 3 2 (e)) th e ALC has all loop instructions (instructions  ) and associated valid bits stored in the loop cache. The ALC can remain in the active state for any number of consecutive iterations and supplies the processor with instructions regardless of the bran backwards branch  is not taken again or if a control of flow change instruction causes a jump to a target outside of the ALC range (no such instruction is present in this sample piece of code), the ALC transiti ons to the idle state and the L1 cache resumes servicing instruction fetches. The ALC can cache loops larger than the ALC size by storing as many instructions from the beginning of the loop that can fit in the ALC. The ALC utilizes a signal to indicate the se overflow conditions where only the beginning of the loop is cached. If the ALC is in the fill or active states when an overflow occurs, the ALC becomes idle while the L1 cache supplies the instructions that could not fit in the ALC. If the short backwar ds branch is taken again, the ALC transitions directly to the active state to supply the processor with as many instructions as possible. 3.3 The ALC Architectural Layout The ALC architectural placement and use of the sbb and cof microprocessor signals is identical to the DLC and PLC (Figure 2 1 (b)). Figure 3 1 (b) depicts the detailed architectural layout of the ALC (some control signals are omitted for clarity). The ALC contains a loop cache controller (LCC) to orchestrate loop cache operation, the loop cache to store instructions, and an address translation unit (ATU) for loop cache indexing.
31 The LCC contains three internal registers, tr_sbb, last_PC, and i_buffer, which collec tively enable caching critical region with branches and provide support for loop backwards branch address, thus the tr_sbb effectively identifies the currently cached critica l region. Therefore, when the microprocessor asserts sbb, the LCC uses the new triggering short backwards currently cached critical region or if execution has entered a new critical region. If the new triggering short backwards branch address matches tr_sbb, the LCC asserts an internal signal, same_loop, which enables the ALC to resume fetching the currently cached critical region (for critical regions larger than the loop cache). I_buffer and last_PC a ssist in control flow analysis by buffering the previously fetched instruction and address while the next instruction fetch location is evaluated. The valid bits appended to each instruction in the loop cache store control flow analysis information. 3.4 A LC Functionality for Critical Regions with Forward Branches Complete ALC operation requires cooperation between the LCC and the ATU in order to fill the loop cache and service instruction fetches. In this section, we describe ALC functionality necessary fo r caching critical regions with forward branches. 3.4.1 Address Translation Unit (ATU) In order to cache critical regions with branches, maintain tagless loop cache accesses, and use indexing counters, the ATU translates the current instruction address i nto a loop cache index. During straight line code, the ATU uses an indexing counter to step through the loop cache. However, when a control of flow change occurs, a simple counter increment is not sufficient to determine the correct loop cache index and th e
32 When the LCC asserts ad_trans indicating a control of flow change, the AT U subtracts cr_start from the current instruction address (last_PC during loop cache filling or PC during loop cache fetching) to obtain the loop cache index. This new loop cache index becomes the indexing counter, which is incremented for each subsequent straight line instruction access. The PLC performs a similar subtract operation as does the ATU, thus critical path delay is not increased as compared to previous loop cache designs. 3.4.2 Loop Cache Controller (LCC) and Runtime Control Flow Analysis Figure 3 3 (a) highlights important LCC state transitions and events and Figure 3 3 (b) depicts detailed LCC state machine operation. LCC operation consists of four states: idle, buffer, fill, and active. In the idle and active states, instructions are fet ched from the L1 cache and the loop cache, respectively. In the buffer and fill states, instructions are fetched from the instruction cache and written to the loop cache after runtime control flow analysis. 126.96.36.199 Operating in the idle state The ALC begi ns operation in the idle state and there are two possible transitions out of the idle state. If same_loop is asserted (corresponding to a triggering short backwards branch for a currently cached critical region), the LCC transitions directly to the active state. If sbb is asserted and same_loop is not asserted, the triggering short backwards branch corresponds to a new critical region and the LCC transitions to the buffer state. 188.8.131.52 Preparing for a new loop The LCC spends one cycle in the buffer state to prepare the loop cache for filling by resetting all valid bits to invalidate the currently stored critical region. Special
33 hardware built into the loop cache resets all valid bits in a single clock cycle by asserting inv to the loop cache. The LCC asser ts tr_sbb_ld to store the triggering short backwards branch address in tr_sbb and asserts i_buffer_ld to buffer the current instruction and address into i_buffer and last_PC, respec tively. 184.108.40.206 Operating in the fill state On the next clock cycle, the LCC transitions to the fill state and performs an initial runtime control flow analysis pass (on the second loop iteration) to dynamically determine the values of the two valid bits, next_valid and taken_next_valid (nv and tnv, respectively, in Figure 3 1 (c) and Figure 3 3 (b)), which store loop exit condition whether the instruction cache or loop cache should service the next instruction fetch. Valid bits are critical to loop cache operation and maintaining a 100% hit rate as valid bits indentify control of flow changes with targets outside of the loop cache and allow gaps within the loop cache. During straight line code execution in the fill state (including untaken b ranch instructions), control flow analysis sets the next_valid bit for these instructions indicating that the next sequential instruction is stored in the loop cache, thus the next instruction fetch will be a loop cache hit. If a control of flow change occ urs and the control of flow analysis sets the taken_next_valid bit for this instruction indicating that when the control of flow change occurs, the target instruction is stored in the loop cache Note that at this point, control flow analysis sets only one valid bit, leaving the other valid bit unset. Since the initial control flow analysis pass operates on one loop iteration, both valid bits cannot be set. Since we do not restrict the ALC to ca ching only loops with straight line code, it is possible to have gaps in the ALC. In order to fill in
34 unset valid bits and loop cache gaps, the LCC may return to the fill state for additional control flow analysis passes. For example, if a loop contains an if/else control of flow change instruction, the instructions belonging to either the if clause or the else clause will be initially cached, potentially leaving a gap in the ALC. However, since the loop cache returns to the filling state when gaps are enco untered, it is possible for the ALC to contain instructions from both the if and else clause and therefore, for the control of flow During the fill state, i_buffer serves t wo purposes. Firstly, i_buffer enables control flow analysis to compare the previous instruction executed with the subsequent instruction executed to evaluate loop exit conditions. The second purpose significantly reduces loop cache writes and eliminates the need for a dual ported loop cache. Previous loop cache designs simultaneously write instructions as the instructions are after the next in struction fetch, each loop cache instruction would require two updates: one update to write the current instruction and one update to different loop cache lines, resulting in two writes per clock cycle necessitating a dual ported loop cache. In order to avoid a dual ported loop cache, i_buffer buffers the previous instruction during control flow analysis such that each loop cache instruction only requires one loop cache upd ate. The loop cache is backfilled with the previous instruction and associated valid bits while the current instruction is fetched from the L1 cache. Because i_buffer is read at the beginning of the clock cycle and written at the
35 end of the clock cycle, i_ buffer always latches the current instruction after the previous instruction is written to the loop cache. The LCC continues this backfilling process until one of several conditions is met. If a new triggering short backwards branch is taken (same_loop is not asserted) the LCC transitions to the buffer state to prepare for the new loop. If execution reaches the end backwards branch (tr_sbb) is taken again (same_loop is asserted), the LCC transitions directly to the active state (the LCC instruction is actually the first instruction written to the loo p cache). 220.127.116.11 Operating in the active state In the active state, the LCC deasserts loop_fill to allow the loop cache to service instruction fetches. The LCC buffers the instruction fetched from the loop cache and associated valid bits in i_buffer to prepare for additional control flow analysis passes. The LCC analyzes vali current instruction is not a control of flow change and the next_valid bit is set or if the current instruction is a control of flow change and the taken_next_valid bit is set, the LCC remai ns in the active state. However, if the current instruction is not a control of flow change and the next_valid bit is not set or if the current instruction is a control of flow change and the taken_next_valid bit is not set, then the L1 cache must service the next instruction fetch and the LCC transitions out of the active state. There are several LCC transitions out of the active state. If the next_valid bit is not set and the instruction corresponds to the last instruction in the loop cache (overflow is asserted), the LCC transitions to the idle state. Otherwise, the unset next_valid bit indicates a loop cache gap and the LCC transitions to the fill state to perform another
36 control flow analysis pass. If the LCC detects a new loop (a triggering short back wards branch is taken and is not tr_sbb), the LCC transitions to the buffer state to prepare the backwards branch (tr_sbb) is not taken, the LCC transitions to the idle state. Lastly, if a c ontrol of flow range (past the triggering short backwards out_of_loop signal is asserted and the contr oller transitions to the idle state. 18.104.22.168 ALC overhead The ALC provides a lightweight solution for identifying, storing, and fetching loop instructions for an application containing loops with branches. The ALC reuses the sbb and cof microprocessor signals that are already present in the three previous loop cache designs and introduces a few additional internal signals. The largest contributor energy overhead is the table storing the instructions. The LCC itself has very little overhead besides the control logic i.e., t he registers added to the LCC for control flow analysis are very small and have a negligible impact on overall energy consumption. In t he DLC writing an instruction to the loop cache while providing the processor with the instruction or fetching an instruction from the loop cache take s a single cycle (the same as fetching from the L1 cache) Similarly our ALC can backfill the ALC or fetch an instruction in a single cycle. Additionally, the ALC incurs no cycle penalties as there are no loop cache mi sses. G iven the simplicity of the control logic as compared to a single pipeline stage, the ALC control logic could easily operate at embedded system microprocessor clock speeds, thus minimizing runtime overhead. The ALC is designed so that each
37 instructi on is analyzed and stored in the loop table in a single clock cycle. The DLC is able to use the sbb and cof signals to determine when to transition between filling, fetching from, or exiting the loop cache without incurring a performance penalty. Our ALC u ses the sbb and cof signals for the same purpose and is also able to use the same sbb and cof signals to simultaneously set one of the valid bits without impacting the cycle time. compared with previous loop cache designs able to cache loops with branches since both the PLC and HLC also required the same address translation operations. Additionally, the DLC is able to address the contents of the loop cache using a counter increment operation. In our ALC design we either increment a counter to access sequential instructions or perform a subtraction operation to access an instruction if there is a taken branch where a subtraction operation does not increase cycle time compared to an increment operation. Finally, control flow analysis performed by the ALC is done without using the complex external tools and analysis required to identify and cache loops in the PLC and HLC. 22.214.171.124 ALC operation in the presence of branch prediction The backwards branch or forward branch, respectively) that is taken. The branch information fetched fr om the ALC or the L1 cache. The ALC is therefore applicable to architectures where the branch outcome will be known with certainty before the next instruction is fetched, such as unconditional jumps and architectures with branch target buffers.
38 However, wh en the branch outcome is not known early enough, some architectures use branch prediction to guess the branch outcome and to improve the flow of instructions in the pipeline. In cases where branch prediction is used, the ALC will depend on the branch pred ictor to assert the sbb and cof signals, however, the functionality of the ALC remains the same. For example, if the branch predictor predicts that the short backwards branch will be taken again when the ALC is in the fill state, the ALC transitions to the active state and provides the cached instructions. If the prediction is correct then the ALC would have supplied the correct instructions. However, if the prediction is incorrect, the processor is stalled while the incorrect instructions are discarded. No te that any processor stalls and cycle penalties incurred by an incorrect prediction are not caused by the ALC and would have occurred even if the ALC had not been implemented. Similarly, an incorrect prediction may cause the wrong set of instructions to be filled in the ALC if a control of flow change within the loop is mispredicted. For example, if a loop contains an if/else construct and the branch is mispredicted as taken, the ALC is incorrectly filled with the if clause. If the ALC is still in the fil l state when the branch outcome becomes known, the correct instructions belonging to the else clause will be filled. If the ALC has transitioned to the active state and is providing instructions when the branch outcome becomes known, the branch instruction unset indicating that the L1 cache must supply the requested instruction on the next t a cycle penalty from occurring even when a branch is incorrectly predicted.
39 3.5 ALC Functionality for Critical Regions with Nested Loops In order to enhance ALC functionality to cache critical regions with nested loops, we modified the LCC described in Sections 3.2 and 3.4. The LCC (for critical regions with forward branches) treats all taken short backwards branch instructions, which do not correspond backwards branch, as the trigger for a new critical region. However, when caching nested loops, and the Additionally, since the inner loop is executed many times in succession, followed by one iteration of the outer loop, the inner loop will be evicted from the ALC upon each out er loop iteration, requiring the inner loop to be recached and reevaluated (control flow analysis) after each outer loop iteration. In order to eliminate this mild form of loop f the outer loop, and not as a new critical region. We point out that the modifications described in this section enhance ALC functionality and do not remove any previously described functionality. To add the ability to cache nested loops, the LCC requires three additional internal registers inner_sbb, inner_overflow, and inner_start and an internal signal inner. Inner_sbb stores the short backwards branch of the current inner loop being filled. Inner_overflow stores the short backwards branch of the inner loop if that inner loop does not fit in the loop cache. Inner_start stores the start address of all new loops and is used to determine if this new loop is an inner loop of the currently cached critical region or a new critical region. The LCC identif starting (currently held in inner_start) and ending addresses fall within the bounds of the
40 current critical region (i.e., starting and ending addresses are bounded by cr_start and tr_sbb, respectively). The LCC asserts inner if an inner loop is identified, thus signaling that the ALC should continue to fill or fetch the currently cached critical region instead of evicting the currently cached critical region. Note that inner_sbb and inner_start temporarily s tore the short backwards branch address and address of the first instruction of the inner loop currently being filled during control flow analysis. When the inner_sbb is taken again, control flow analysis on the inner loop is complete, same_loop is asserte d, and the ALC transitions from the fill to the active state. Once control flow analysis is complete on the inner loop, the short backwards branch of the inner loop is treated like any other control of flow change instruction of the outer loop, and the app ropriate valid bit is set indicating the target of backwards control of flow change instruction in t he loop cache. This allows the inner_sbb and inner_start registers to be reused during the control flow analysis of additional inner loops making it possible to cache multiple inner loops and achieve several levels of loop nesting. LCC state transitions ar e governed by both inner and same_loop. We modified same_loop functionality compared to the description in Section 3.4 to account for nested loops. The LCC asserts same_loop when either the tr_sbb, the inner_sbb, or the inner_overflow are taken again. This assertion results in the LCC transitioning to or remaining in the active state when same_loop is asserted as shown in Figure 3 3 (b). Allowing same_loop to be asserted when the inner_sbb is taken again is necessary since the inner loop can iterate severa
41 backwards branch (stored in tr_sbb) is taken again. Thus, if same_loop is asserted by a taken inner_sbb in the fill state, the LCC transitions to active and the loop cache supplies the microprocessor with instructions without having to wait for the remainder of the outer loop to be filled and for tr_sbb to be taken again. The LCC remains in the active state until execution leaves the same critical region or until a gap in either an inner or outer loop needs to be filled Allowing same_loop to be asserted when the inner_overflow is taken again is necessary since the loop cache can store loops larger than the loop cache size (it is possible that only the first part of an inner loop will fit in the loop cache). If the inner backwards signal seen in Figure 3 short backwards branch is stored in the inner_overflow register. The LCC trans itions to the active state if same_loop is asserted in the idle state. In the case where the inner_overflow register causes same_loop to be asserted in the idle state, the LCC returns to the active state because part of the current inner loop is stored in the loop cache and can therefore be fetched from the loop cache. If a short backwards branch is taken and inner is asserted while same_loop is deasserted, another inner loop within the same critical region is found. If the LCC is currently in the fill sta instructions are treated as outer loop instructions. If the LCC is currently in the active state, the LCC transitions to the fill state to add the new inner loop to the current critical region. In
42 backwards branch instruction is evaluated and cached like any other control of flow change instruction in the loop cache. If a short backw ards branch is taken and both the inner and same_loop signals are deasserted, a new critical region is being executed, the LCC transitions to the buffer state, and operation continues as described in Section 3.4. Th e functionality of the ATU and all other transitions shown in Figure 3 3 (b) remain the same. 3.6 Experimental Results 3.6.1 Experimental Setup To determine the number of instructions fetched from the loop cache and calculate energy savings for the DLC, PLC, HLC, and our ALC designs, we execute d 31 benchmarks from the EEMBC  MiBench  and Powerstone  benchmark suites (due to incorrect execution not related to the loop caches, we could not evaluate the complete suites). Since the loop cache is supposed to be sma ll (up to 128 entries for the DLC  ), we implemented each loop cache design in SimpleScalar  and evaluated small loop cache sizes ra nging from 4 to 256 entries (larger loop cache sizes resulted in either a decrease or only a negligible increase in energy savings as explained in Section 3.6.2). Since  was 32 entries, we present HLC results for 64 128 and 256 entry loop caches only. Each benchmark was executed once to completion for each loop cache size. To calculate energy consumption for each loop cache configuration, we augmented the energy model adopted by Zhang et al.  to include loop cache fill and fetch operations as shown here (IC = instr uction cache and LC = loop cache): total energy =IC_energy + LC_energy IC_energy = IC_fill_energy + IC_dynamic_energy + IC_static_energy
43 IC_fill_energy = IC_misses (IC_linesize / wordsize) mem_energy_perword + cpu_stall_energy IC_dynamic_energy = IC_ac cesses IC_access_energy IC_static_energy = (IC_misses miss_latency_cycles) + (IC_accesses IC.misses) + LC_hits) 0.15* IC_access_energy LC_energy = ((LC_hits + LC_fills) LC_access_energy) + LC_static_energy LC_static_energy = (IC_misses miss_lat ency_cycles) + (IC_accesses IC.misses) + LC_hits) 0.15* LC_access_energy (3 1) We gathered IC _accesses, LC_hits, and LC_fills cache statistics using SimpleScalar. We used the PISA instruction set architecture and the single issue five stage pipeline w cache simulator. We used CACTI  to determine dynamic cache energy dissipation for 0.18um technolo gy. Since the loop cache is tagless, we obtained energy consumption for same sized direct mapped caches with an 8 byte line size (corresponding to one loop cache entry of one instruction) and removed the tag energy. We assumed static energy per clock cycle for the instruction and loop caches as 15% of their respective dynamic access energies  Registers added to the LCC are relatively small compar ed to the entire ALC architecture and therefore do not make a significant impact on ALC energy consumption or overall instruction memory hierarchy energy consumption. L1 cache tuning revealed that the embedded system benchmarks used in our experiments requ ire small L1 caches ranging in size from 2 KB to 8 KB   In order to compare to a system with no loop cache, we define a base system configuration consisting of an 8 KB L1 instruction cache with 4 way set associativity and a 32 byte line size a configuration shown in  to perform well for a variety of benchmarks run on several embedded microprocessors. To calculate energy savings,
44 we no rmalize the energy consumption of a system with a loop cache and the L1 base cache to the base system with no loop cac he. 3.6.2 Results and System Scenarios Figure 3 4 depicts experimental results comparing the DLC, ALC (without nested loop caching), ALC +N (with nested loop caching), PLC, and HLC loop cache designs for various loop cache sizes in number of entries on the x axis. In order to show different loop cache design trends across different benchmark suites, we present results averaging the EEMBC, M iBench, and Powerstone suites separately. In addition, we evaluate our ALC without the nested loop caching ability compared to previous work and then evaluate the benefits of adding nested loop cache ability. We analyzed every benchmark and observed that several benchmarks exhibited the same behavior. Therefore, to keep the discussion concise, we identified the application characteristics and discuss the individual benchmark results by grouping the benchmarks and giving example benchmarks from each group w here necessary. Figure 3 6 depicts our benchmark grouping based on the benchmark characteristics we identify and discuss in the remainder of this Section. Figure 3 5 depicts experimental results comparing the loop cache access rate and energy savings for f our benchmarks DLC, ALC, ALC+N, PLC, and HLC. These four benchmarks were chosen to highlight the effectiveness of the ALC and ALC+N for different types of a pplication scenarios. 126.96.36.199 ALC loop cache access rate Figure 3 4 (a), (b), and (c) depict the loop cache access rate (i.e., the percentage of instruction fetches resulting in loop cache hits, or the percentage of instructions fetched from the loop cache instead of the L1 ca che) and reveal that our ALC always
45 outperforms the DLC for all three benchmark suites and all loop cache sizes. This result is expected since the DLC caches a subset of the loops that the ALC caches. These ing particular system scenarios. For the Powerstone suite and all loop cache sizes ( Figure 3 4 (c)), the DLC never resulted in loop cache hits due to the absence of straight line loops (we verified this using a loop analysis tool  ). Average loop cache access rate improvements for the ALC compared to the DLC reach as high as 18%, 40%, and 74% for EEMBC, MiBench, and Powerstone suites, respectively. For individual benchmarks, the ALC loop cache access rate reaches as high as 97% for the EEMBC suite and as high as 99% and 99% for the MiBench and certain application scenarios result s in loop cache access rate improvements as high as 5 (c)) and as Figure 3 4 (a), (b), and (c) also show that fo r the EEMBC, MiBench, and Powerstone suites, on average, our ALC either outperforms or performs as well as the PLC and HLC for loop cache sizes up to 128 entries. In addition, Figure 3 4 (b) shows that for the MiBench suite, on average, the ALC always outp erfor ms the PLC and the difference between the ALC and HLC loop cache access rate is only 5% for the a 256 entry loop cache. Thus, the ALC is ideal for sized constrained applications. Increasing the loop cache size to 256 entries results in the PLC/HLC sto ring almost all of the average since the PLC/HLC requires no runtime filling cycles. However, since the ALC
46 outperforms the PLC/HLC for small loop caches, this shows t overhead is minimal. Individual results show that the PLC can outperform the ALC only for the 256 entry loop cache for the 14 (out of 31) benchmarks listed as Group A in Figure 3 6 Analyzing these benchmarks with the tool developed in  reveals that the ALC consistently outperforms the PLC for applications with a large critical region or a large number of critical regions, which would require prohibitively large loop caches. For example, Figure 3 PLC/HLC with the exception of a 256 entry PLC, which only achieves a loop cache access rate 1% higher than a same sized ALC and with a lower energy savings than a 32 64 and 128 entry ALC (Figure 3 5 (d)). Although these larger PLCs/HLCs could potentially incur a higher loop cache access rate, in some cases the additional loop cache accesses would not compe nsate for the increased energy consumption incurred by fetching from a larger loop cache, thus increasing overall energy consumption. In addition, highly size constrained system scenarios may not afford the larger loop caches required for optimal PLC/HLC e nergy savings or performance. loop cache sizes including the 256 entry loop cache. Figure 3 5 (e ) shows that for sizes since a 256 entry PLC (and 224 entry PLC partition in the case of the HLC) was not large enough to store a single critical region in its entirety.
47 We fu rther note that certain system scenarios may result in the PLC outperforming the ALC for smaller loop cache sizes since the PLC does not require runtime filling cycles. For these seven benchmarks (Group B, Figure 3 6 ) the ALC loop cache access rate is usua 5 (g), the PLC has a higher loop cache access rate than the ALC (by as much as 7%) for loop caches larger than 32 entries. The PLC can also outperform the AL C for all loop cache sizes when critical loops are executed frequently but only iterate a few times successively (i.e., enough iterations to fill the ALC but not enough iterations to take advantage of fetching from the ALC such as in EB01, PUWMOD01, and RSPEED01)). However, we point out that even in these scenarios the flexibility of the ALC may still outweigh the increased PLC performance 188.8.131.52 Loop cache access rate for the ALC with nested loops Next, we evaluate the added benefits of enabling the ALC to cache nested loops as compared to the ALC without the ability to cache nested loops. Even though initial loop analysis (using the tool introduced in  ) suggests that this added functionality should provide additional ALC savings, experimental results revealed no average improvement and only little improvement (2% average improvement in loop cache access rate a nd 3% average improvement in energy savings) for a very small set of benchmarks (7 out of 31 benchmarks). Figure 3 4 (a) reveals that on average, ALC+N always results in a lower loop cache access rate compared to the ALC for the EEMBC suite. Loop analysis on the EEMBC suite reveals that most benchmarks contain nested loops with a large outer loop which iterates on average only once or twice in succession
48 ALC+N caches only the beginn ing of the outer loop, which caches only the first few inner loops. This results in only a small portion of the critical region (large outer loop) being cached. Thus, for these benchmarks it is advantageous to individually cache each inner loop as it execu tes and results in a larger percentage of the critical region being cached. Figure 3 4 (c) depicts similar results for the Powerstone suite except for one exception. With a 256 entry loop cache, the ALC+N achieves a higher average loop cache access rate t han the ALC. Loop analysis of the Powerstone suite reveal that these benchmarks generally contain small outer loops with few (usually only three or four) inner loops, which is in stark contrast to the EMBCC suite. As the loop cache size increases to 256 en tries, the ALC+N is able to cache the entire critical region outer and all inner loops and achieves a higher loop cache access rate than the ALC since cycles are not spent refilling the loop cache with inner loops as in the case of the ALC. The ALC+N is therefore best suited for applications with smaller loops where the outer loop and most/all inner loops fit entirely in to the loop cache, thus eliminating the need to continually replace the loop cache contents with the different inner loops, such as t he seven benchmarks listed in Group C ( Figure 3 6 instructions (containing forward jumps) and thus achieves a 99% loop access rate for the 256 entry ALC+N loop cache as seen in Figure 3 5 (a). In terms of average loop cache accesses, the ALC+N performs worse than the PLC and HLC as seen in Figure 3 4 (a), (b), and (c). Although the ALC+N, PLC, and HLC all cache nested loops, the PLC/HLC us ually outperforms the ALC+N due to the
49 elimination of runtime loop cache filling cycles. However, the ALC+N can outperform the PLC/HLC for applications with a large number of critical regions, which cannot TBLOOK01. Figure 3 5 (f) reveals that using a 64 entry ALC+N for TBLOOK01 improves the loop cache access rate of the PLC and HLC by 42% and 56%, respectively. For certain application scenarios, the ALC+N can perform as well as the ALC for larger loop cache sizes. Figure 3 4 (b) shows that for the MiBench suite, the ALC and ALC+N have the same average loop cache access rate for the 128 and 256 entry loop cache. The ALC caches all inner loops including the most frequently execut ed inner loops. For smaller loop cache sizes, ALC+N performance suffers because the ALC+N can only cache the first portion of the outer loop and the first few inner loops. For 5 (c)), the first portion of the outer loop and the first few inner loops do not iterate as frequently as the remaining uncached inner loops that cannot fit in the loop cache (for small loop cache sizes). As the ALC+N grows larger the more frequently executed inner loops are cached and t he ALC+N performance approaches that of the ALC. Figure 3 5 (c) shows that for loop cache sizes greater than 32 entries, the ALC+N and ALC both achieve a 97% loop access rate. A similar trend is observed for the nine benchmarks listed in Gro up D ( Figure 3 6 ). 184.108.40.206 ALC energy savings Figure 3 4 (d), (e), and (f) depicts average memory hierarchy energy savings for the EEMBC, MiBench, and Powerstone suites, respectively. As expected, for all three suites, our ALC outperforms the DLC due to increased ALC hi ts for a same sized DLC. Average ALC energy savings improvements over the DLC reach as high as 12%, 26%,
50 and 49% for EEMBC, MiBench, and Powerstone suites, respectively, and reach as high ys increases energy consumption compared to the base system because thrashing results in no loop cache accesses. The PLC and ALC can also result in negative energy savings for the 4 entry loop cache since a 4 entry loop cache can be too small to incur enou gh loop cache accesses needed to translate into energy savings. Figure 3 4 (d), (e), and (f) show that our ALC saves more energy than the PLC for loop cache sizes less than 64 entries. We point out that on average the ALC does not outperform the PLC/HLC for larger loop cache sizes due to the absence of PLC/HLC (PLC partition) filling. However, individual benchmarks show additional energy savings entry loop cache shown in Figure 3 5 (c). Figure 3 4 (d) shows a decrease in energy savings for the ALC with a 256 entry loop cache. At this point, the additional loop cache hits for increased loop cache sizes does not outweigh the additional energy required for the larger loop cache. (F igure 3 4 (a) shows that the loop cache hit rate for the ALC levels off at 64 entries). Additional experiments run for loop cache sizes greater than 256 entries revealed a negligible increase in loop cache ene rgy savings or a decrease in energy savings for all loop cache designs. 220.127.116.11 Energy savings for the ALC with nested loops In general, the ALC+N does not increase energy savings due to lower loop cache access rates as compared to the ALC (Figure 3 4 (c) and (d)). However, for highly size constrained systems with some specific benchmarks, the ALC+N can achieve energy
51 entry loop cache in Figure 3 5 (b)), which corresponds to a 12% improvement over a s ame sized ALC. 18.104.22.168 Results summary Figure 3 7 (a) and (b) depict the average loop cache access rate and energy savings for the EEMBC, MiBench, and Powerstone suites, respectively. EEMBC Avg presents the EEMBC results averaged across the 64 128 and 256 entry loop cache sizes. Similar results are reported for the MiBench and Powerstone suites. Even though the ALC outperforms all other loop cache designs at smaller sizes, larger sizes are used for the results summary because the loop cache access rates and energy savings are best at these sizes for all loop cache designs. Figure 3 7 (a) and (b) show that, as expected, the average loop cache access rate and the average energy savings are much higher for the ALC compared to the DLC. This is because the ALC caches loops containing branches while the DL C can only cache loops with straight line code. In general, the loop cache access rates and energy savings is higher for the MiBench and Powerstone suites compared with those of the EEMBC suite (Figure 3 7 (a) and (b)). The MiBench and Powerstone suites ha ve few loops that iterate many times in succession before exiting the loop and caching another new loop. The EEMBC suite has more loops that do not iterate as many times as the MiBench and Powerstone suites. The EEMBC suite therefore spends more time refil ling the loop cache with different loops and therefore does not achieve average loop cache access rates or energy savings as high as the MiBench and Powerstone suites. Figure 3 7 (a) and (b) also show that the PLC and HLC generally outperform the ALC for l arger loop cache sizes since larger PLCs and HLCs ca regions and do not require runtime filling cycles.
52 3.7 Comparing the Filter Cache with the ALC Like the ALC, the original filter cache is filled during runtime, requires no application profiling, and does not rely on a compiler or interpreter to identify or modify the filter cache contents. Unlike loop caches however, the filter cache is not restricted to caching only loops and subroutines. Because of the similar ities between the ALC and filter cache, we conclude this chapter by comparing the energy savings of our ALC to the energy savings of the filter cache. Figure 3 8 (a), (b), and (c) depict the average energy savings of the ALC, the filter cache with an 8 byt e line size (filter 8), the filter cache with a 16 byte line size (filter 16), and the filter cache with a 32 byte line size (filter 32) for our 31 benchmarks normalized to a base system without a loop or filter cache. The ALC and filter cache size was var ied from 4 to 64 instructions (512 bytes). Filter caches larger than 512 bytes were not used since the filter cache is supposed to be relatively very small as compared to the L1 cache (8 KB in our case). Since loop caches are direct mapped and can only store one instruction per line (8 bytes in our case), our initial experiment compared the ALC to a direct mapped, 8 byte line size filter cache (filter 8). Figure 3 8 (a), (b), and (c) show that, on average, the ALC always achieved higher energy savings th an the 8 byte line size filter cache. We also observed that the filter cache only had significant energy savings for the 64 entry filter cache for the EEMBC and MiBench suites with 10% and 14% energy savings respectively. For the remaining sizes, the 8 byt e line size filter cache configurations achieved negative energy savings (or consumed more energy than the base cache) as low as 60% for the EEMBC and Powerstone 4 entry caches (the negative values are not shown on the graphs). However, for individual ben chmarks, the 8 byte filter cache
53 achieved higher energy savings than the ALC when the ALC loop cache access rate ALC achieved 25% and 1% energy savings respectively for the 64 en try cache. On average the ALC always saved more energy than the 8 byte line size filter cache because the dynamic energy per access of the filter cache is larger than that of the ALC (for the same size) since the filter cache includes tag access energy wh ile the ALC is tagless. Furthermore, the filter cache is accessed for every instruction fetched, increasing the total dynamic energy of the memory hierarchy, while the ALC is only accessed when an ALC hit is guaranteed. Additionally, the size of the loops in the embedded benchmarks is typically larger than the filter cache size therefore most instructions in the filter cache will be replaced before reuse. For loops larger than the ALC, the ALC is filled only once then instructions are fetched from either th e ALC or the L1 cache while the other device remains idle. For the filter cache however, the filter total dynamic energy of the filter cache. Also, since the filter cach e has a low hit rate (less than 80%) due to its small size, the L1 cache is accessed frequently, which increases the dynamic access energy of the L1 cache and total system energy. We increased the line size of the filter cache to improve the filter cache hit rate and instruction reuse. In our experiments we did not increase associativity since increasing associativity in  did not improve the ove rall energy savings. Figure 3 8 (a), (b), and (c) show that, as expected, increasing the filter cache line size to 16 bytes (filter 16) increased the filter cache energy savings. We observed that the ALC still outperformed the 16 byte line size filter cach e for all cases except the 64 entry loop cache for the
54 EEMBC suite, where the filter cache and ALC achieved energy savings of 27% and 19%, respectively. Increasing the filter cache line size to 32 bytes (filter 32) resulted in the filter cache achieving hi gher energy savings than the ALC since the 32 byte filter cache had an increased hit rate and instruction reuse. Figure 3 8 (a), (b), and (c) show that the average energy savings for the 32 byte line size filter cache was 25% for the EEMBC suite, 18% for t he MiBench suite, and 27% for the Powerstone suite while the ALC achieved average energy savings of 14%, 12%, and 10%, respectively. While the 32 byte line size filter cache saved more energy than the ALC and the filter cache is not restricted to caching loops, these energy savings were at the expense of an increase in execution time. Figure 3 8 (d), (e), and (f) depict the execution time of the ALC and filter cache, normalized to the execution time of the base cache system, where execution time is measur ed as the total number of cycles needed to execute the benchmark to completion. Figure 3 8 (d), (e), and (f) shows that, compared to the base cache system, the normalized ALC execution time is 1. This is expected since the ALC has a 100% hit rate meaning that instructions are fetched from the ALC only if the instruction is guaranteed to be in t he loop cache. Instructions are fetched either from the ALC or the L1 cache and there is no cycle penalty when transitioning from fetching from the ALC to fetching from the L1 cache (Section 3.4.2). However, because of the architectural placement of the fi lter cache between the L1 cache and processor (Figure 2 1 (a)), a miss in the filter cache incurs a cycle penalty. Increasing the filter cache line size from 8 bytes to 32 bytes reduced the filter cache misses and performance overhead but did not completel y eliminate the cycle penalty. The average execution time of the filter cache ranged from as low as 1.1x (normalized to the base cache) for
55 the 64 entry, 32 byte line size filter cache for the MiBench suite to 2.0x (normalized to the base cache) for the 4 and 8 entry, 8 b yte line size filter caches for the Powerstone suite ( Figure 3 8 (d), (e), and (f)). Figure 3 1 (a) Flowchart of basic Adaptive Loop Cache (ALC) operation (b) the ALC architecture (c) an ALC entry each entry contains an instruction, n ext_valid (nv), and taken_next_valid (tnv) valid bits
56 Figure 3 2 An example of ALC operation
57 Figure 3 3 The adaptive loop cache (ALC) (a) important ALC transitions and events (b) detailed loop cache controller (LCC) state machine operation. Runtime control flow analysis occurs in the fill state.
58 Figure 3 4 Experimental results for the DLC, ALC (ALC without nested loop caching), ALC+N (ALC with nested loop caching), PLC, and HLC for v arying loop cache sizes (entries) showing percentage of instruction fetches resulting in loop cache hits averaged for the (a) EEMBC, (b) MiBench, and (c) Powerstone suites and energy savings (normalized to the base system) averaged for the (d) EEMBC, (e) M iBench, and (f) Powerstone suites. 0% 20% 40% 60% 80% 100% % Loop Cache Access Rate Loop Cache Entries DLC ALC ALC+N PLC HLC (a) (b) (c) 0% 20% 40% 60% 80% 100% Energy Savings (%) Loop Cache Entries DLC ALC ALC+N PLC HLC (d) (e) (f)
59 Figure 3 5. Experimental results for the DLC, ALC (without nested loop caching), ALC+N (with nested loop caching), PLC, and HLC for varying loop cache sizes (entries) for four individual benchmarks showing percentage of instruction fetches resulting in loop cache hits for (a) bcnt (Powerstone), (c) PNTRCH01 (EEMBC), (e) TBLOOK01 (EEMBC), and (g) AIFFTR01 (EEMBC) and corresponding energy savings (normalized to the base system) for (b) bcnt, (d) PNTRCH01, (f) TBLOOK01, and (h) AIFFTR01. 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Loop Cache Access Rate Loop Cache Entries DLC ALC ALC+N PLC HLC (a) bcnt 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Energy Savings (%) Loop Cache Entries DLC ALC ALC+N PLC HLC (b) bcnt 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 4 8 16 32 64 128 256 Average Loop Cache Access Rate Loop Cache Entries DLC ALC ALC+N PLC HLC (c) PNTRCH01 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Energy Savings (%) Loop Cache Entries DLC ALC ALC+N PLC HLC (d) PNTRCH01 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Loop Cache Access Rate Loop Cache Entries DLC ALC ALC+N PLC HLC (e) TBLOOK01 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Energy Savings (%) Loop Cache Entries DLC ALC ALC+N PLC HLC (f) TBLOOK01 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Loop Cache Access Rate Loop Cache Entries DLC ALC ALC+N PLC HLC (g) AIFFTR01 0% 20% 40% 60% 80% 100% 4 8 16 32 64 128 256 Average Energy Savings (%) Loop Cache Entries DLC ALC ALC+N PLC HLC (h) AIFFTR01
6 0 Figure 3 6 Benchmark grouping based on the benchmark characteristics discussed in this Chapter Figure 3 7 Summary results for the (a) average loop cache access rate and (b) average energy savings 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% EEMBC-Avg MiBench-Avg Pstone-Avg Loop Cache Access Rate DLC ALC ALC+N PLC HLC (a) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% EEMBC-Avg MiBench-Avg Pstone-Avg Energy Savings (%) DLC ALC ALC+N PLC HLC (b)
61 Figure 3 8 Experimental results for the ALC and 8 byte, 16 byte, and 32 byte line size filter cache for varying sizes (entries) showing average energy savings for (a) EEMBC, (b) MiBench, and (c) Powerstone suites, and execution time measured in # of cycles for (d) E EMBC, (e) MiBench, and (f) Powerstone suites normalized to the base cache system 0% 20% 40% 60% 80% 100% 4 8 16 32 64 EEMBC Avg 4 8 16 32 64 MiBench Avg 4 8 16 32 64 Pstone Avg Energy Savings (%) Entries ALC filter-8 filter-16 filter-32 (a) (b) (c) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 4 8 16 32 64 EEMBC Avg 4 8 16 32 64 MiBench Avg 4 8 16 32 64 Pstone Avg Normalized Execution Time (in # of cycles) Entries ALC filter-8 filter-16 filter-32 (d) (e) (f)
62 CHAPTER 4 COMBINING LOOP CACHI NG WITH OTHER DYNAMIC I NSTRUCTION CACHE OPTIMIZATIONS 4.1 Overview Power and energy reduction is an important concern in embedded system design therefore several different energy optimization techniques have been proposed in the past. Since a system designer may choose to apply several different energy optimization techni ques, it is important to evaluate how dependent optimization techniques interact (i.e., do these techniques complement each other, degrade each other, or does one technique obviate the other). Studying the interaction of existing techniques reveals the pra cticality of combining optimization techniques. For example, if combining certain techniques provides additional energy savings but the combination process is non trivial (e.g., circular dependencies for highly dependent techniques  ), new design techniques must be developed to maximize savings. On the other hand, less dependent techniques may be easier to combine but may reveal little additional s avings. Finally, some combined techniques may even degrade each other. These studies provide designers with valuable insights for determining if the combined savings is worth the additional design effort. In this chapter, we focus on the interactions betwe en three popular cache optimization techniques: loop caching, cache configuration, and code compression. 4.2 Instruction Cache Optimizations 4.2.1 Loop Caching Details on the Preloaded Loop Cache and the Adaptive Loop Cache can be found in Chapter s 2 and 3, respectively.
63 4.2.2 Level One (L1) Cache Tuning Previous work on cache tuning can be found in Chapter 5. 4.2.3 The Combined Effects of Cache Tuning and other Optimizations on, numerous energy optimization techniques for the instruction cache exist (we point out that many techniques exist for the entire cache hierarchy, but we focus on the instruction cache since a loop cache has little to no impact on data cache utilization) However, even though there exists a plethora of research on these optimization techniques applied individually, little research exists that explores the combined interaction of these techniques. Since a system designer may choose to apply several differe nt optimization techniques, it is important to evaluate how dependent optimization techniques interact (do these techniques complement each other, degrade each other, or does one technique obviate the other). In this section, we present a summary of previo us work highlighting the interaction of cache tuning with three interdependent optimization techniques dynamic voltage scaling, hardware/software partitioning, and code reordering. Nacul et al.  investigated the effects of combining dynamic voltage scaling (DVS) with dynamic cache reconfiguration (DCR). DVS scales the supply voltage to reduce the processor and memory hierarchy operating frequencies, resulting in reduced energy consumption. Nacul proposed a DVS algorithm (which was invoked as part of making a best effort to meet task deadlines (this technique could result in mi nor performance degradation). Results showed that, when applied individually, DVS and DCR reduced energy consumption by the similar amounts on average. However,
64 combining DVS and DCR resulted in up to 27% additional energy savings, versus using either tech nique individually, for tasks with longer deadlines. In previous work, we evaluated the effects of combining hardware/software partitioning with cache tuning  Hardware/software partitioning removes the critical regions (and thus the major source of temporal and spatial locality) from the software and implements these critical regions in custom hardware, such as a field programmable gate array (FPGA). Hardware/ software partitioning reduces energy consumption and increases performance since critical regions are executed in fast custom hardware instead of slower software. However, removing critical region instructions from software, and thus the instruction cache, affects cache locality, which may not only alter the optimal cache configuration, but could also degrade established cache tuning heuristics (since these heuristics exploit cache locality to quickly explore the design space). Results showed that a non par titioned system achieved average instruction cache energy savings of 53% while a partitioned system achieved average instruction energy savings of 55% with increased performance (note that these numbers do not include the additional energy and performance savings revealed by the hardware/software partitioning technique itself, and thus total system savings would be much larger). Even though the additional energy savings was small, these results showed that hardware/software partitioning and cache tuning com plement each other, as cache tuning is still beneficial even after hardware/software partitioning is applied. In addition, existing cache tuning heuristics proved to be effective even after removal of temporal and spatial locality from the instruction stre am.
65 In other previous work, we evaluated the effects of combining code reordering and cache tuning  Code reordering is a well studied technique that attempts to improve system performance by placing frequently executed instructions contiguously in memory, thus improving spatial locality and cache utilization (it is well known that code reordering does not always improve performance). The basic form of code reor dering uses execution profile information to determine frequently executed basic blocks, which are chained together to form larger regions of straight line code. Infrequently executed basic blocks are moved to another location in memory, and jump instructi ons are inserted to retain proper execution flow. Code reordering therefore saves energy by reducing cache misses through increased spatial locality. We used the Pentium Link Time Optimizer (PLTO), which implements an improved version of the fundamental Pe ttis and Hanson code reordering algorithm that reorders basic blocks to reduce the number of taken branches and increase instruction locality  Studying the com bined interaction of code reordering and cache tuning is interesting because code reordering tunes the instruction stream to the cache and cache tuning tunes the cache to the instruction stream. Results revealed that cache tuning obviates code reordering o n average for the majority of system scenarios. Combining code reordering with cache tuning resulted in only a 2% increase in energy savings compared to cache tuning individually. However, cache tuning eliminated the performance degradation for system scen arios that did not benefit from code reordering individually. Finally, for certain system scenarios, code reordering resulted in optimal cache configurations that reduced area overhead, since the increased spatial locality provided by code reordering resul ted in smaller, more efficient cache configurations.
66 To the best of our knowledge, there exists no previous work studying the interaction between loop caching and cache tuning. 4.2.4 Code Compression Several code compression techniques are based on well known lossless data compression techniques. Wolfe and Chanin  used Huffman coding to compress and decompress code for RISC processors. They als o introduced the Line Address Table (LAT) used for mapping program instruction addresses to their corresponding compressed code instruction addresses. Lefurgy et al.  Liao et al.  and Lin et al. [10 ] developed code compression techniques to reduce program size based on LZW and other dictionary based data compression schemes. Lekatsas and Wolfe  presented SMAC a method for reducing memory requirements in an embedded system based on arithmetic coding and Markov models. Decompression is typically implemented in a decompression u nit, which can be placed either between the L1 instruction cache and main memory, or between the L1 instruction cache and the processor. If the decompression unit is placed between the L1 cache and main memory, decompression is done on every cache refill ( decompression on cache refill (DCR) architecture  Figure 4 1 (a)) and the instruction cache holds uncompressed instructions. If the decompression unit is placed between the L1 cache and the processor, decompression is done when instructions are fetched (decompression on fetch (DF) architecture  Figure 4 1(b)) and the instruction cache holds compressed instructions. In both techniques, however, the processor receives uncompressed instructions. Lekatsas et al. showed that the DF architecture could save more energy than the DCR architecture i f the decompression overhead was small enough 
67 In their approach, Lekatsas et al.  grouped instructions into four groups, which they identified with codes appended to the beginning of the compressed instruction. The first group of instructions (instructions with immediate) was compress ed with arithmetic coding and a Markov model. The second group of instructions (branches) was encoded by compressing only the destination field. The third group of instructions (fast dictionary instructions) was replaced by a one byte index into a lookup t able. The last group of instructions was left uncompressed. This approach achieved energy savings between 22% and 82% for the entire system (cache, processor, and busses). Lekatsas at al. summarized that the DF performed better than the DCR in their experi ments since bit toggling and energy expended on busses were reduced and the L1 cache capacity was effectively increased. Benini et al.  propose d a low overhead DF method based on fast dictionary instructions. Benini et al. noted that, whereas the decompression unit was used only on cache refills for the DCR architecture, for the DF architecture, the decompression unit was on the critical path (si nce the decompression unit was invoked for every instruction executed) and therefore, must have a low decompression (performance) overhead. In their approach, Benini et al. profiled the executable to find the 256 most frequently executed instructions, whic h they denoted as SN. During compression, if an instruction belonged to SN, the instruction would be replaced with an 8 bit code only if that instruction and its neighboring instructions could be compressed into a single cache line. This approach achieved average energy savings of 30% for the entire system (cache, processor, and controller).
68 4.3 Loop Cache and Level One Cache Tuning 4.3.1 Experimental Setup To determine the combined effects of loop caching and cache tuning, we determined the optimal (lowest energy) loop cache and L1 configurations for systems using the ALC and the PLC for 31 benchmarks from the EEMBC  MiBench  and Powerstone  benchmark suites (all benchmarks were run to completion, however, due to in correct execution not related to the loop caches, we could not evaluate the complete suites). We used the energy model and methods in Chapter 3 to calculate energy consumption for each configuration. For comparison purposes, we normalize energy consumption to a base system configuration with an 8 KB, 4 way set associative L1 instruction cache with a 32 byte line size (a configuration shown in  to perform well for a variety of benchmarks on several embedded microprocessors) and with no loop cache. We imple mented each loop cache design in SimpleScalar  We varied the L1 instruction cache size from 2KB to 8KB, the line size from 16 bytes to 64 bytes, and the associ ativity from direct mapped to 4 way   and varied the loop cach e size from 4 to 256 entries (Chapter 3). In our experiments we searched all possible configurati ons to find the optimal (lowest energy) configuration, however, heuristics (such as in    ) can also be applied for dynamic configuration. Our experiments evaluated three different sys tem configurations. In the first experiment, we tuned the L1 cache with a fixed 32 entry ALC for the EEMBC and MiBench and a fixed 128 entry ALC for Powerstone (denoted as tuneL1+ALC) ( Chapter 3 showed that these sizes performed well on average for the res pective benchmark suites). In the second experiment, we quantified additional energy savings gained by
69 tuning both the L1 instruction cache and the ALC (denoted as tuneL1+tuneALC). In our final experiment, we tuned the L1 cache while using a fixed 128 entr y PLC (denoted as tuneL1+PLC). For thorough comparison purposes, we also report energy savings obtained by tuning the ALC using a fixed L1 base cache configuration (denoted as tuneLC+base) and tuning the L1 cache in a system with no loop cache (denoted as noLC) 4.3.2 Analysis Figure 4 2 depicts energy savings for all experiments described in Section 4.3 .1 normalized to the base system. In summary, these results compare the energy savings for combining loop caching and L1 cache tuning with the energy savin gs for applying loop caching and cache tuning individually. First, we evaluated energy savings for each technique individually. L1 cache tuning alone achieved average energy savings of 54 %, 60 %, and 37% for the EEMBC, Powerstone, and MiBench benchmark suit es, respectively. ALC tuning in a system with a base L1 cache achieved average energy savings of 23%, 46 %, and 26 % for the EEMBC, Powerstone, and MiBench benchmark suites, respectively. These results revealed that in general, ALC tuning alone did not match the energy savings of L1 cache tuning alone. In this case a smaller optimal L1 cache saved more energy than the ALC combined with the (much larger) base cache. For example, tuning the ALC with a fixed base L1 cache achieved 34 % energy savings for IDCTRN01. However when L1 cache tuning was applied, the 8 KB, 4 way, 32 byte line size base L1 cache is replaced with a much smaller 2 KB, direct mapped, 64 byte line size L1 cache, resulting in energy savings of 55%.
70 However, loop cache tuning alone can save more energy than L1 cache tuning without a loop cache when the optimal L1 cache configuration is already similar to the base cache (such as dijkstra in Figure 4 2). Also when ALC loop cache access rates are high, ALC cache tuning alone is sufficient greater than 90%. Next, we evaluate d the combin ed effects of a fixed sized ALC with L1 cache tuning (tuneL1+ALC in Figure 4 2 ). Additional energy savings were minor as compared to L1 cache tuning alone (average energy savings are 54%, 61 %, and 42 % for the EEMBC, Powerstone, and MiBench benchmark suites, respectively). Although the average improvement in energy savings across the benchmark suites was approximately 1%, adding a fixed sized ALC improved energy savings by as much as 15 stringsearch benchmark. Also, in cases where loop caching alone resulted in negative energy savings (benchmarks with less than a 10% loop cache access ra te), this negative impact was offset using L1 cache tuning. For example, even when the ALC caused a 9% increase in energy consumption, the overall energy savings was still 48 % energ y savings. Our next experiment investigate d the effects of tuning both the L1 and the loop cache (tuneL1+tuneALC in Figure 4 2 ). Although there were improvements in energy savings, the resulting average energy savings of 57 %, 64 %, and 46 % for the EEMBC, P owerstone, and MiBench benchmark suites, respectively, were not significantly better than the energy savings achieved by L1 cache tuning alone. Even though the average
71 energy savings improvement across all benchmark suites was approximately 4%, additional energy savings reached as high as 26 % for Additionally, when comparing a system with a tuned L1 cache and a fixed sized ALC, the improvement in average energy savings was minor, averaging only 2% over all benchmark suites, ho RSPEED01 benchmark. The reason for the minor additional energy savings was because the energy savings for the optimal ALC size was very close to the savings for the fixed sized ALC for two reasons: 1) the optimal ALC size was typically similar to the for each particular suite); and 2) loop cache access rates leveled off as the loop cache size increased (Chapter 3). This finding is significant in that it reveals that L1 cache tuning obviates ALC tuning. If a system designer wishes to incorporate an ALC, simply significance is also impo rtant for dynamic cache tuning since using a fixed sized ALC decreases design exploration space by a factor of seven since we eliminate the need to combine each L1 configuration with seven ALC sizes. The results presented thus far suggest that, in general, in a system optimized using L1 cache tuning, an ALC can improve energy savings, but it is not necessary to tune the ALC since L1 cache tuning dominates the energy savings. We observed that, since the optimal ALC configuration does not change the optimal L 1 cache configuration, there is no need to consider the ALC during L1 cache tuning. The L1 cache configuration remains the same regardless of the presence of the ALC because using an ALC does not remove any instructions from the instruction stream, nor doe s
72 the ALC prevent those instructions from being cached in the L1 cache and therefore, does not affect the locality. In fact, the L1 cache supplies the processor with instructions during the first two loop iterations to fill the ALC (Chapter 3) The additio nal energy savings achieved by adding an ALC to the optimal L1 cache configuration results from fetching instructions from the smaller, lower energy ALC (Chapter 3) The tradeoff for adding the ALC is an increase in area, which can be as high as 12%. Howev er, this area increase is only a concern in highly area constrained systems, in which case the system designer should choose to apply L1 cache tuning with no ALC. Since the ALC does not change the actual instructions stored in the L1 cache (the ALC only ch anges the number of times each instruction is fetched from the L1 cache), our final experiment involved combining the L1 cache tuning with a fixed sized PLC, since the PLC actually eliminates instructions from the L1 cache. Tuning the L1 cache and using a fixed sized PLC resulted in average energy savings of 61%, 69%, and 49% for the EEMBC, Powerstone, and MiBench benchmark suites, respectively. On average, adding the PLC to L1 cache tuning revealed an additional energy savings of 10 % as compared to L1 cach e tuning alone (with no loop cache) with individual additional savings ranging from 10% to 27% for 12 of the 31 benchmarks. Furthermore, since the PLC is preloaded and the preloaded instructions never enter the L1 instruction cache, using a PLC can change the optimal L1 cache configuration, especially when PLC access rates are very high. Adding the PLC changed the optimal L1 cache configuration for 14 benchmarks, which resulted in area savings as high as 33%. Whereas these additional savings may be attracti ve, we reiterate that these additional savings come at the expense of the PLC pre analysis step and requires a stable application.
73 4.4 Code Compression, Loop Caching, and Cache Tuning 4.4.1 Loop Caching to Reduce Decompression Overhead The basic Huffman decompression technique imposes both performance and energy overheads, which may impact the feasibility of using this type of instruction compression in some system scenarios. With respect to performance overhead, Huffman decompression can require several cycles to decompress instructions (Huffman decompression requires one clock cycle to traverse the Huffman tree for every bit entering the decompression unit). With respect to energy consumption, the decompression unit and LAT (in the case of a taken branch ) consume energy during decompression. Additionally, the CPU stall energy, instruction cache energy, and loop cache energy (if the loop cache is being filled) are increased during decompression. Therefore, we propose using a loop cache to decrease these de compression overheads by storing/caching and providing the processor with uncompressed instructions (Figure 4 1 (c)). Unlike previous DF techniques, the decompression unit is not necessarily invoked for every instruction executed since uncompressed instruc tions are stored in the loop cache. The decompression overhead is eliminated when the loop cache provides the processor with uncompressed instructions. Since the type of loop cache used affects the magnitude of decompression overhead that is alleviated, we consider both the ALC and PLC loop cache architectures. With the ALC, iterations each time the loop is encountered. During these two iterations, the decompression unit provides the processor with decompressed instructions while simultaneously filling the ALC. T he decompression overhead is eliminated beginning with the third iteration, and for all subsequent iterations, when the ALC provides the
74 processor with uncompressed instructions. However, sin ce the PLC can be preloaded with uncompressed instructions, the PLC can potentially provide greater energy savings since the decompression unit would not be invoked for any instruction in the loop cache (the tradeoff being reduced flexibility due to the PL instructions stored in the PLC decreases the total number of instructions compressed, and thus potentially decreasing the instruction mix, the PLC could reduce the number of decompression cycles required for instructi ons not stored in the PLC, resulting in improved overall system performance. 4.4.2 Experimental Setup To determine the combined effects of code compression with cache tuning, we determined the optima l (lowest energy) L1 cache configuration for a system us ing a modified DF architecture ( Figure 4 1 (c)) for the same 31 benchmarks and experimental setup as described in Section 4.3.1 For comparison purposes, energy consumption and performance was normalized to a base system configuration with an 8 KB, 4 way s et associative L1 base cache with a 32 byte line size (with no loop cache). Based on Chapter 3 we used a 32 entry ALC and a 64 entry PLC for our experiments. We used Huffman encoding  for instruction compression/decompression. Branch targets were byte aligned to enable random access decompression and a LAT translated uncompressed addresses to corresponding compressed addresses for branch and jump targets. We modified SimpleScalar  to include the decompression unit, LAT, and loop cache. The energy model used in Section 4.3.1 was modified to include decompression en ergy. We also measured the performance (total number of clock cycles needed for
75 execution). The performance measured was normalized to the performance of the base system with uncompressed instructions and no loop cache. 4.4.3 Analysis Figure 4 3 depicts t he (a) energy and (b) per formance of the optimal (lowest energy) L1 cache configuration for a system that stores compressed instruction s in the L1 cache and uncompressed instructions in a loop cache (ALC or PLC) normalized to the base system with no loop c ache. For brevity, Figure 4 3 shows average energy and performance for each benchmark suite and selected individual benchmarks that reveal ed interesting results. Figure 4 3 (a) shows that, on average, for the EEMBC benchmarks, the optimal L1 cache configur ation combined with the 32 entry ALC did not result in energy savings. However, on average, the Powerstone and MiBench benchmarks achieve d energy savings of 20% and 19%, respectively ( Figure 4 3 (a)) for the system with an ALC. Analysis of the benchmark structure revealed that both Powerstone and MiBench benchmarks contain only a few loops that iterate several times (several Powerstone and MiBench benchmarks stay in the same loop for hundreds of consecutive iterations) resulting in energy savings and a lo wer performance overhead. EEMBC benchmarks however, contain many loops that iterate fewer times than the Powerstone and MiBench benchmarks (several EEMBC benchmarks stay in the same loop for less than 20 consecutive iterations). EEMBC benchmarks spend a sh ort time fetching uncompressed instructions from the ALC before a new loop is encountered and the decompression unit is invoked again resulting in low energy savings and a large performance overhead. However, EEMBC benchmarks with a high loop cache access rate achieved energy savings (for example, PNTRCH01 with a 97% loop cache access rate (Figure 3 5)
76 achieved 69% energy savings (Figure 4 3 (a)) with only a small decompression overhead (Figure 4 3 (b))). Figure 4 3 (a) shows that, on average, the Powerston e and MiBench benchmark suites both achieve d energy savings of 30% for the system with an optimal L1 cache configuration combined with a 64 entry PLC. An additional 10% in average energy savings was gained by eliminating the decompression overhead, which w ould have been consumed while filling the ALC. Figure 4 adpcm e benchmarks saved 56% and 38% more energy, respectively, when using the act of the decompression overhead. For blit, the loop cache access rate for the 32 entry ALC is higher than the loop cache access rate for the 64 entry PLC (80% compared with 30% in Chapter 3) but by removing the decompression energy consumed during the fi rst two iterations of the loop, the system with the PLC saved almost as much energy as the system with the ALC (Figure 4 3 (a)). Figure 4 3 (a) also shows that, on average, for the EEMBC benchmarks, using the PLC did not result in energy savings and that the ALC outperformed the PLC. This result is expected since, for the EEMBC benchmarks, the PLC only outperformed the ALC for the 2 56 entry loop cache (Chapter 3). Figure 4 3 (b) shows that, on average, the performance of the system increased for both the ALC and the PLC because of the large decompression overhead (the loop cache does not affect system performance since it guarantees a 100% loop cache hit rate). The average increase in performance due to decompression overhead ranged from as much as 4.7x for EEMBC benchmarks with a PLC to 1.7x for MiBench
77 benchmarks with a PLC (Figure 4 3 (b)). We also observed that using the PLC inst ead of the ALC reduced the decompression overhead by approximately 40% for Powerstone and MiBench benchmarks. Individual results showed that, fo r most benchmarks, the PLC reduce d the decompression overhead but increase d system performance as compared to a system with no PLC. As shown in Figure 4 3, for the system with the e achieved 73% energy savings (38% more than the system with the ALC) and reduced the performance overhead to only 2% more than the performance of the base system. For our experiments, we tuned the L1 cache while keeping the loop cache size fixed to find the optimal (lowest energy) combination of L1 cache and loop cache. We compared these new L1 cac he configurations to the lowest energy L1 cache configurations for a sys tem with uncompressed instructions and no loop cache. We found that for 12 out of 31 benchmarks the new L1 cache configurations were smaller for the systems using compression compared with the L1 cache configurations for the systems not using compression. These benchmarks were able to use smaller L1 cache configurations since the L1 cache stored compressed instructions, and effectively increased the cache size. However, we did not observe a change in L1 cache configuration for systems with low loop cache ac cess rates and no energy savings. Additionally, for some benchmarks, the optimal L1 configuration for the uncompressed system was already the smallest size (2 KB) so adding a loop cache did not result in a smaller L1 cache configuration. We calculated the area savings gained by replacing the L1 cache storing uncompressed instructions with the smaller L1 cache storing compressed instructions
78 combined with the loop cache for the 12 benchmarks with new optimal L1 configurations. The benchmarks that replaced a n 8 KB L1 cache with a 2 KB L1 cache and loop cache achieved approximately 50% area savings. The benchmarks that replaced an 8 KB L1 cache with a 4 KB L1 cache and loop cache and repla ced a 4 KB L1 cache with a 2 KB L1 cache and loop cache achieved approximately 30% and 20% area savings, respectively. For the remaining benchmarks, the L1 cache configuration did not change, and thus adding a loop cache increased the area of the system. Some benchmarks achieved energy savings but not area savings. For PNTRCH01 benchmark had a loop cache access rate of 97% and achieved a 69% energy savings with the AL C but the L1 configuration was the same for both the uncompressed and compressed system which result ed in an increase in area of approxim ately 14%. 4.4.4 Using LZW encoding In the previous sub section we investigated the possibility of using the loop cache to reduce the decompression overhead when using Huffman encoding. We found that although we achieved energy savings as high as 69% for an application with a high loop cache access rate, on average we did not achieve energy savings for EEMBC benchmarks and had only a 30% energy savings for Powerstone benchmarks These low energy savings are due to the Huffman decompression requir ing severa l cycles to decompress instructions (Huffman decompression requires one clock cycle to traverse the Huffman tree for every bit entering the decompression unit) and the CPU stall energy, instruction cache energy, and loop cache energy (if the loop cache is being filled) accumulated during the decompression cycles.
79 In our next experiment we investigated the possibility of using the ALC to reduce the decompression overhead in EEMBC and Powerstone benchmarks when using LZW encoding. We chose LZW encoding beca use LZW has a lower decompression overhead compared with Huffman therefore by using LZW encoding we were able to achieve energy savings for EEMBC and increase the energy savings achieved by Powerstone benchmarks. During LZW decompression the codeword is retrieved and looked up in the dictionary, then the uncompressed instructions are output while the dictionary is update d (if necessary)  In our experiment s we compressed branch blocks, or the set of instructions between two consecutive branch targets and byte aligned branch targets to enable random access decompression  A LAT translated uncompressed addresses to corresponding compressed addresses for branch and jump targets. The remaining details for the experimental setup are the same as Section 4.4.2. 4.4.5 Analysis Figure 4 4 depicts the (a) energy savings and (b) performance of the optimal (lowest energy) L1 cache configuration for a system that stores compressed instructions in the L1 cache and uncompressed instructions in the ALC normalized to the base system with no loop cache. Figure 4 4 (a) shows that, on average, the EEMBC and Powerstone benchmarks achieved energy savings of 30% and 50%, respectively, for LZW encoding, and average energy savings of 35% and 20%, respectively for Huffman encoding (note that negative values are not displayed on the graph ). The increase in average energy savings was because of the lower decompression overhead of LZW encoding.
80 Figure 4 4 (a) shows that for individual benchmarks, in most cases LZW encoding achieves much higher energy savings as compared to Huffman encoding For the benchmarks difference s in energy savings are very small, Huffman encoding was able to achieve energy savings close to LZW encoding because these benchmarks have a high loop cache access ra te of at least 97% (Section 22.214.171.124). These benchmarks spend at least 97% of the time fetching uncompressed instructions from the loop cache and very little time decompressing instructions therefore even Huffman encoding with a high decompression overhead achieves high energy savings comparable to the energy savings of LZW encoding. Figure 4 4 (b) shows that, on average, the performance of the system increased for both the Huffman and LZW encoding because of the decompression overhead. The average increa se in performance due to decompression overhead for Huffman encoding was 4.7x for EEMBC and 3.2x for Powerstone compared to executing the benchmark in the base system. Using LZW encoding decreased the decompression overhead and overall performance penalty by half. Figure 4 4 (b) shows that the average increase in performance due to decompression overhead for LZW encoding was 2.0x for EEMBC and 1.5x for Powerstone compared to executing the benchmark in the base system.
81 Figure 4 1 (a) T he Decompression on Cache Refill (DCR) architecture ; (b) T he Decompression on Fetch (DF) architecture ; (c) T he Decompression on Fetch (DF) architecture with a Loop Cache to store Decompressed Instructions Figure 4 2 Energy savings (compared with the base system with no loop cache) for loop caching and cache tuning for the (a) EEMBC, (b) Powerstone, and (c) MiBench benchmark suites. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Energy Savings (%) tuneL1+ALC tuneL1+tuneALC tuneLC+base tuneL1+PLC noLC (a) 0% 20% 40% 60% 80% 100% Energy Savings (%) tuneL1+ALC tuneL1+tuneALC tuneLC+base tuneL1+PLC noLC (b) (c)
82 Figure 4 3 (a) Energy and (b) performance ( energy and performance normalized to the base system with no loop cache) for the lowest energy cache configuration averaged across the EEMBC benchmarks (EEMBC Avg), Powerstone benchmarks (Powerstone Avg), MiBench benchmarks (MiBench Avg), and for selected individual benchmarks ( PNTRCH01 blit dijkstra and adpcm e ) Figure 4 4. (a) Energy savings and (b) performance (normalized to the base system with no loop cache) for a system that stores LZW compressed instructions in the L1 cache and uncompressed instructions in the ALC for EEMBC and Powerstone benchmark suit e s. 0% 20% 40% 60% 80% 100% 120% 140% Energy ALC PLC (a) 0% 50% 100% 150% 200% 250% 300% 350% 400% Performance ALC PLC (b) 0% 20% 40% 60% 80% 100% Energy Savings (%) huffman lzw (a) 0% 100% 200% 300% 400% Normalized Performance huffman lzw (b)
83 CHAPTER 5 MULTI CORE LEVEL ONE (L1) DATA CACHE TUNING FOR ENE RGY SAVINGS IN EMBEDDED SYSTEMS 5.1 Overview For the remainder of the dissertation we focus on dynamic cache tuning for energy reduction in a multi core system executing multithreaded applications that share data Multi core architectures are becoming a popular method to increase system performance via exploited parallelism without inc reasing the processor frequency and/or energy. To further improve these systems, multi core optimizations focus on improving system performance via scheduling, proximity aware caching, and cooperative caching or improving energy efficiency with voltage and frequency scaling. In particular, fine grained multi core architectures allow applications to be decomposed into smaller subtasks executing on several processors. Application decomposition enables fine grained energy management techniques by allowing individual devices ( e.g., processors, caches, pipelines, etc. ), to be shut down or placed in a lower power operating mode Additionally, considering both homogeneous and heterogeneous processors increases energy reduction potential since the smaller subtas ks can be highly specialized to a particular processor. In Chapters 6 and 7 we to focus on cache tuning for multi core systems due to the  and previous single core successes       5.2 Previous Work on Single C ore C ache T uning Cache tuning e nables application specific optimizations by selecting the lowest energy configuration configurations can waste energy.
84 Since different applications exhibit different runtime behavior and no cache configuration satisfies all system scenarios, off the shelf microprocessors typically fix the cache configuration to a configuration that performs well on average across all applications  affect the optimal cache size since, if the cache is la rger than a critical region, energy is wasted fetching from an unnecessarily large cache, while, if the cache is smaller than a critical region, energy is wasted during frequent cache miss stalls. Similarly, an affects the optimal line size and associativity. A larger line size is preferred when an application exhibits a high degree of spatial locality, since the larger line size will result in fewer cache misses due to prefetching. However, if an application doe s not exhibit much spatial locality, a smaller line size is preferred since, if the line size is too large, energy can be wasted fetching unused instructions. Finally, temporal locality affects the associativity of the cache. Increasing associativity explo its temporal locality by reducing the frequency of cache replacements (and stalls incurred during cache filling). Thus, since cache tuning enables application specific energy/performance optimizations, there exists much previous work detailing different ca configuration      enabled/disabled cache ways using a cache control register for power and performance tuning. The cache control registers allowed each cache way to be designated as either an instruction only way, a data only way, a unified way, or, to save power (reduce the powered on cache size), by shutting down
85 the way entirely. Albonesi  described a dynamic technique to shut down and activate ways to tune the cache size and decrease power consumption for different application phases. Zhang et al.  achieved an average of 30% energy savings using way concatenation, a method that logically concatenated ways to adjust associativity. To adjust line size, Veidenbaum et al.  proposed a technique to change a the cache line size based on application accesses to the line and Zhang et al.  used line concatenation to logically implement larger line sizes as multiples of physical smaller line sizes. Each of these techniques configured only one cache parameter, while holding the remaining cache parameters fixed (i.e. tune the c ache size but leave the associativity and line size fixed). More advanced methods increased energy savings by tuning multiple cache parameters simultaneously. Janapsatya et al.  developed a technique to tune all cache parameters simultaneously, and explored all cache configurations using a tree based structure where each node of the tree pointed to a linked list containing cache statistics. Software based method s such as the single pass cache evaluation (SPCE)  technique evaluated all values for all cache parameters simultaneously in a single pass using a stack to stor e previously executed instructions for cache hit analysis, bitwise operations to determine cache hits, and a simple table structure to store cache hit information for all cache configurations. In  a hardware based version of SPCE, the one shot cache tuner, updated the cache to the best configuration in one shot without have to physically explore inferior configurations. In order to make SPCE runtime amendable, algorithmic modifications included ad dress processing time reduction by sampling the address stream instead of processing every
86 address and leveraging a small TCAM structure to store previously executed instructions for cache hit analysis. The application designer can either determine optima l cache parameters during design time or the cache parameters can be determined dynamically during runtime. Design time cache tuning can require significant effort in terms of simulation setup and run time and may be inappropriate for highly dynamic applic ations. Alternatively, runtime cache tuning requires no application designer effort and is appropriate for highly dynamic applications because runtime cache tuning can react to actual input stimulus, environmental conditions, and application phase changes   Runtime cache tuning determines the optimal cache configura tion by executing the application in each explored cache configuration for a period of time. During this design space exploration, several inferior, non optimal cache configurations may be explored before the optimal or near optimal configuration is determ ined. Since previous work shows that these inferior configurations can consume an exorbitant amount of energy and introduce performance overheads as compared to executing in a base cache configuration  it is imperative to reduce the number of inferior configurations explored. Tuning heuristics reduce the number of inferior configurations explored and find the optimal or near optimal configuration while searchin g only a fraction of the design space. Zhang et al.  introduced an L1 cache tuning heuristic that searched configurations in order of their impact on energy, resulting in 40% average energy savings, which was within 7% of the optimal, by searching only 5 of 18 possible ordered, single core L1 cache tuning heuristic tunes the
87 cache size followed by the line size and finally the ass ociativity. During cache tuning, in order to isolate the effects of a single cache parameter, one cache parameter is explored while the other parameters are held constant. For example, during size tuning, the line size and associativity are held constant a t their smallest values while the cache size is explored from smallest to largest, which minimizes cache flushing and misses incurred while switching configurations. The heuristic increases the cache size as long the size increase results in a decrease in energy consumption. This process continues until there is an increase in energy consumption or the largest cache size is reached, wherein the cache size is then fixed at the lowest energy size. The line size and associativity are tuned in a similar way. Since a system designer may deem the optimal cache configuration as the lowest energy, highest performance, or, for a flexible system, a set of configurations with a reasonable tradeoff between energy and performance the Pareto optimal set determining the optimal cache configuration is time consuming and difficult. Previous work has developed heuristics to decrease the design space exploration time as compared to exhaustive search techniques. Platune  a configurable system on a chip platform, quickly finds the Pareto optimal set by separating the dependent and independent tunable parameters. Palesi et al.  by pruning the design space using a genetic algorithm. TCaT  which extended cache, interleaved the L1 and L2 exploration, and included a final associativity adjustment. TCaT searched 6.5% of over 400 possible configurations, found configurations within 3% of the optimal, and achi eved 53% energy savings on average.
88 ACE AWT  increased of over 18,000 possible configurations, found configurations within 1% of the optimal cache configuration, and achieved 62% energy savings on average. Wang et al.  introduced, SACR: Scheduling A ware C ache R econfiguration for real time embedded systems which applied dynamic cach e tuning to embedded systems with soft real time constraints. SACR use d static profiling to dynamically configure the L1 cache resulting in an average reduc tion in energy consumption of 50% while ensuring that de adlines are met for most tasks. Wang et al.  extended SACR to a two level cache hierarchy and achieved energy savings as high as 46%. Sundararajan et al.  introduced a reconfigurable cache architecture and a runtime cache tuning heuristic used for reducing the energy delay product of the cache hierarchy. The Set and way Management cache Architecture for Run Time reconfiguration (Smart cache)  allow ed the cache size and cache associativity to be reconfigur ed during runtime The Smart cache used a decision tree machine learning model to reconfigure the cache during runtime and achieved an average reduction in energy delay product by 17% with a performance overhead of less than 2%. Each of the previous techniques configured the cache on a per application basis. Hajimiri et al.  introduced intra task dynamic cache tuning which tuned the cache on a per phase basis instead of perform ing cache tuning once for the entire application. Intra task cache tuning used static analysis to identify the application phases and to choose when to apply cache tuning The energy consumption and performance was evaluated for every possible cache config uration for each phase and the lowest energy configuration for the different phases was chosen during runtime. Energy savings for
89 intra task configurations was on average 12% and 7% for the instruction and data cache, respectively with a performance pena lty of up to 1%. 5.3 Additional Multi C ore Cache Tuning Challenges Whereas these single core heuristics were highly effective, developing cache tuning heuristics was challenging given large design spaces, cache interdependencies, and minimizing system int rusion. Multi core systems not only exacerbate these challenges, but also introduce new challenges. For example, in heterogeneous multi core systems, where each core can have a different cache configuration, the design space grows exponentially with the nu mber of cores. Also, during design space exploration, cache tuning incurs energy and performance penalties while executing inferior, non optimal configurations. Minimizing these cache tuning overheads is critical in a multi core system due to overhead accu mulation across each core and the potential power increase if all cores simultaneously tune the cache. New challenges include core interdependencies across applications with a high degree of data sharing which affects cache behavior (e.g., increased cache miss rates) and processor stalls. Applications with core interactions have circular tuning caches. For example, increasing the cache size increases the amount of data that the cache can store and decreases the miss rate. However, this larger cache is more likely to store shared data, which may increase the number of cache coherence evictions and forced write backs for all cores, thus increasing energy consumption. Therefore, cores executing data sharing applications cannot be tuned individually without coordinating the tuning and considering the core interactions.
90 CHAPTER 6 LEVEL ONE (L1) DATA CACHE TUNING IN DUAL CORE EMBEDDED SYSTEM S 6 .1 Overview In this chapter w e introduce the Conditional Parameter Adjustment Cache Tuner (CPACT) heuristic for L1 cache tuning in a dual core system. CPACT find s the optimal, or a near optimal, lowest energy cache configuration by searching only a small fraction of possible configurations in a heterogeneous system where each core is able to have a different L1 data cache configuration. 6 .2 Previous Multi Core Optimizations Most previous work in multi core optimizations focused on improving cache performance by reducing off chip accesses and resource contention using methods such as cooperative caching  prox imity aware caches  scheduling heuristics  cache partitioning ( for maximizing throughput  minimizing miss rate  or maximizin g speedup and fairness  ), or cooperative cache partitioning  So me performance centric techniques (e.g., scheduling) can also target energy consumption. Merkel et al.  reduced the energy delay product using specialized chip frequencies to minimize delay and by co scheduling tasks to minimize resource contention. Park et al.  proposed three synchronization aware algorithms to estimate thread dependent slowdowns and achieved energy savings using different processor power modes. Seo et al.  balanced the task loads of multiple cores to optimize power consumption and adjusted the number of active cores, resulting in energy savings as high as 26%. Although multi core optimiz ations were typically applied to homogeneous architectures, heterogeneous architectures were introduced to improve energy and performance using application specific processor optimizations. Kumar et al. 
91 introduced a single ISA heterogeneous multi core architecture where system software evaluated resource requirements and selected the best core for good performance while minimizing energy during runtime. The pro posed system achieved more than a 3x reduction in energy with only an 18% performance penalty. In  Kumar et al. used a hill climbing search heuristic to design the single ISA heterogeneous architecture used in  which produced performance improvements as high as 40%, was within 5% of the optimal (best performance), a nd searched 14% of the design space. Wang et al. [ 65] combined dynamic cache tuning in private caches with cache partitioning in shared caches in a multi core sys tem. The authors used static profiling to assist in choosing the best (low energy) cache configuration for L1 caches while partitioning the shared L2 cache and achiev e d average energy savings of 30% for the L1 and L2 cache s. Wang et al. execut ed multiprogrammed benchmarks therefore unlike our multi core cache tuning work, their work does not consider the core interactions from shared data in a multi core system using multithreaded benchmarks with shared data. 6.3 Runtime Dual Core Data Cache T uning Results in Section 6 4.2 will show that dual core data cache tuning can achieve an average energy savings of 25%, with energy savings as high as 50% ( Figure 6 4 (a) ) for the optimal configuration (i.e. the lowest energy configuration in our defined design space (Section 6.4.1) ) To evaluate a nave adaptation of a state of the art single core cache tuning heuristic to a dual  single core L1 tuning heuristic to both processors in the dual heuristic was not sufficient for dual configurations that increased the energy con sumption by as much as 26% as compared
92 to the optimal dual core configuration and achieved less than 1% average energy savings as compared to a base dual core configuration. Since these results revealed that existing heuristics are not appropriate for runt ime dual core cache tuning, we developed a dual core cache tuning heuristic to achieve energy savings by efficiently exploring a fraction of the design space. Figure 6 1 depicts our dual core cache tuning heuristic the Conditional Parameter Adjustment Ca che Tuner (CPACT) which dete rmines the final cache configuration (optimal or near optimal) during runtime with no application designer effort. CPACT consists of an initial and a size adjustment step that determine initial parameter values and conditions that govern final tuning actions that perform additional parameter value adjustments to hone in on the optimal configuration. Conditions are critical to reducing the number of configurations explored as the governed actions are only performed when addition al parameter value adjustment is highly likely to improve the explored, CPACT also minimizes the energy and performance overhead penalties incurred during design space exp loration when the application is executed in inferior configurations. direct mapped cache with a 16 byte line size initial step applies impac t simultaneously, which reduces the total tuning time (design exploration time) and number of dual core configurations explored as compared to sequentially exploring
93 tuning time will become more pronounced as the number of cores in the system increases. While holding the line size and associativity constant at their smallest values, CPACT increases the size from 8 KB to 64 KB, in increments of powers of two, until there is no decrease in energy. CPACT similarly tunes the line size and then associativity. Since the cache size has the largest impact on energy consumption and the cache size can be greatly affected by the actual line size and associativity values, CPACT performs a size adjustment step after the initial step tunes the associativity (i.e., tune the cache size, line size, associativity, and then the cache size again). Typically, larger size than the optimal size to compensate for the small line size and associativity, therefore the additional size adjustment step begins with the size determined during the initial step and decreases the size until the smallest size is reached or t here is no decrease in the energy consumption. Finally, CPACT conditionally adjusts the line size and associativity again. Similar the parameter values until the largest size is reached or there is no decrease in energy consumption. Condition 1 evaluates the energy results of the size adjustment step. If the size adjustment step decreased the energy consumption, action 1 performs a final adjustment of the line size followed by the associativity. Alternatively, if the size adjustment step increased the energy consumption, action 2 evaluates one additional configuration with the cache size halved and the line size doubled from the configuration
94 Condition 2 eva consumption, action 3 performs a final adjustment on the line size and associativity. If condition 2 determines that action 2 increased the energy consumption, action 4 halts cache tuning a nd no additional parameter adjustment is necessary. Once the final configuration is determined, the benchmark is executed in that final combined with phase classification a nd configuration techniques (    ) to tune for each execution phase. In our experiments, we model a dual core system where each L1 data cache uses cache  The physical total size of each data cache is 64 KB, however, the data cache is composed of 32, 2 KB banks that can be shut down and/or concatenated to vary the size and associativity, thus creating different configurations  To enable runtime cache tuning, each data cache is cou pled with a private hardware cache tuner that executes the CPACT algorithm. A shared signal between these tuners enables a tuner to signal the other tuner when the initial tuning step has completed (Figure 6 2). This signal is critical to the quality of th e configured cache as our experimental results revealed that without coordinating the start of the size adjustment step, CPACT does not find the optimal configuration for some benchmarks. Previous work showed that the hardware overhead of a similar purpose d cache tuner is minimal  6.4 Experimental Results 6.4.1 Experimental Setup 2 multithreaded benchmarks (2 SPLASH
95 long execution times)  on the SESC simulator  for a dual core system. For each processor, we varied the L1 data cache size from 8 KB to 64 KB, the line size from 16 bytes to 64 bytes, and the associativity from direct mapped to 4 way  resulting in a design space of 1,296 possible cache configurations. In SESC, the L1 instruction cache and L2 unified cache were fixed at 64 KB, 4 way set associative cache with a 64 byte line size and 256 KB, 4 way set associative cache with a 64 byte line size, respectively. SESC has a 4 instruction issue width and allows out of order execution. The integer multiplication and division functional units and the floating point mult iplication and division functional units have latencies of 4, 12, 2, and 10 cycles, respectively. All other functional units have a single cycle latency. Figure 6 3 depicts the dual core energy model (adapted from a single core model  to as the system) energy consumption given a particular cache configuration. Our energy model considers the dynamic access energy and static energy for the L1 data (dL1) cache, main memory fetch energy for a cache miss (fill_energy), main memory write energy for a cache write back (mem_write_energy_perword), and proc essor stall energy during a cache miss (CPU_stall_energy) for each processor. We gathered dL1_misses, dL1_hits, and dL1_writebacks using SESC. CACTIv6.5  provided the dynamic cache energy dissipation and main memory energy per word for 90nm technology. We assumed static energy per clock cycle for each data cache configuration as 25%  processor idle energy (CPU_idle_energy) as 25% of the processor' active energy  of the MIPS32 M14K processor  We assumed a 2 cycle hit latency and a 30, 60, and
96 90 cycle miss penalty f or reading and writing off chip memory for the 16, 32, and 64 byte line sizes, respectively. energy consumption with the given configuration to that of a system where each pr way set associative L1 cache with a 64 byte line size. Our base configuration is set to the largest size since caches not specialized for specific applications are typically set to the largest co performance. several e performance impacts when executing the entire benchmark in the optimal configuration as compared to the fixed base configuration. Next, we quantified the ineffectiveness of the nave adapta  single core L1 tuning heuristic (i.e., ordered tuning for dual core systems). Finally, we evaluated CPA beyond the impact ordered tuning heuristic via the size adjustment step and full conditional parameter adjustment options. Thus, we calculated the percentage difference between the energy consumed when executing the benchmark in the optimal configuration and executing the benchmark in the three configurations determined at after the initial step, after the size adjustment step, and at the final configuration (summarized in Figure 6 5 ).
97 To simulate run time cache tuning in SESC, we defined a tuning interval of 500,000 cycles (i.e., the data cache configuration is changed after every 500,000 cycles until the CPACT algorithm is complete). The SPLASH 2 benchmark suite consists of computational kernels and complete applications  Our tuning interval is enough cycles to execute a sm all SPLASH benchmarks several times. CPACT changed the data cache configuration of each processor at the beginning of each tunin g interval. While CPACT explores the cache configurations, the benchmarks are executed in several inferior configurations, each for one tuning interval. Once the final configuration is determined, the benchmark is executed in that final configuration for the remainder of Each benchmark is run to completion for every configuration explored by CPACT, and the energy and performance in total cycles are calculated using our energy model (Figure 6 3). The final_energy_per_cycle for eac h benchmark for each configuration is calculated. We use the final_energy_per_cycle to determine the energy consumed during each tuning interval and during execution in the co nsumed by all cache configurations explored as well as the energy consumed while executing in the final configuration. The reported execution time included the stall cycles (100 cycles) incurred between tuning intervals. During these stall cycles, energy consumption was calculated, the next cache configuration was chosen, the cache configuration was changed, and
98 overhead compared to executing the entire benchmark in the optimal configuration is defined as the (number of configurations explored 1) number of tuning stall cycles where the number of configurations explored is reported in Figure 6 5 Section 6.4.3 y also includes processor stall energy (CPU_idle_energy) consumed during these stall cycles. the tuning stall cycles and the additional energy consumed when inferior configura tions are executed during exploration. Section 6.4.3 shows that this overhead is no more than 2% compared to executing the entire benchmark in the optimal configuration 6.4.2 Maximum Attainable Energy Savings Figure 6 4 (a) and (b) depict the energy savings and performance, respectively, for the optimal energy configuration for a single and a dual core system normalized to the core system achieved an average energy savings of 25%, with energy savings as high as 50% for radiosity. The average energy savings for the dual core system is comparable to single core energy savings (25%) across the same benchmarks (Figure 6 4 (a)). One motivation for switching from single to dual core sys tems is that dual core systems can be more energy efficient because the application is decomposed across both processors with each processor consuming a smaller amount of energy. Our results showed that the dual core system consumed less energy than the si ngle core system by as much as 10% for ocean non, and by applying dual core cache tuning, energy consumption is reduced even further. With respect to performance, cache tuning in the dual core system imposed an average performance penalty of 5%, with a ma ximum performance penalty of 14% for
99 water spatial, which also achieved 41% energy savings (for comparison,  showed an average performance penal ty of 18% for their proposed heterogeneous multi core system). Another motivation for switching from single to dual core systems is to improve the performance without increasing processor frequency. As expected, all benchmarks in our experiments required fewer cycles to complete execution on the dual core system as compared to the single core system. On average, the optimal dual core cache configuration required only 61% of the number of cycles needed to complete execution on a single core base system (Fig ure 6 4 (b) dual core (optimal) % of cycles), or a 1.64x speedup. 6.4.3 CPACT Results and Analysis CPACT achieved an average of 24% energy savings as compared to the base configuration (Figure 6 4 (a)) and explored at most 13 out of 1,296 possible conf igurations, equivalent to 1% of the design space. CPACT found the optimal configuration for 10 out of 11 benchmarks and resulted in a configuration within 1% of the optimal configuration for the remaining benchmark. Although CPACT found 10 of 11 benchmarks energy savings achieved by CPACT can be lower than the energy savings achieved by executing the entire benchmark in the optimal configuration. This discrepancy is due to the energy overhead of executing the benchmark in inferior, much higher energy configurations, and processor stall energy incurred while configurations are changed during design space exploration. Results showed that the energy overhead during design space exploration was small, with the largest overhead decreasing energy savings by only 2% for radix as compared to executing the entire benchmark in the optimal configuration. However, even with the energy consumption penalty, using
100 CPACT, radix still achieved 30% energy savings compared to the base configur ation (Figure 6 4 (a)). Executing benchmarks in inferior configurations and stalling execution between tuning intervals also incurred a performance penalty. On average, CPACT incurred an 8% performance penalty, a 3% increase in performance penalty compare d to executing only in the optimal configuration (Figure 6 4 (b)). Additionally, on average, CPACT required only 62% of the number of cycles needed to complete execution on a single core base system (Figure 6 4 (b) CPACT % of cycles), or a 1.60x speedup. T his shows that the overhead of CPACT is very small and that, even with the added stall cycles, CPACT achieves speedups comparable to dual core optimal speedups. Figure 6 5 depicts the percentage difference (% DIFF) between the energy consumed when executin g the benchmark in the optimal configuration and executing the benchmark in the three configurations chosen by CPACT at the different evaluation points. Figure 6 5 also reports the total number of configurations explored by CPACT (# CFG EXP). For most ben tuning heuristic  The configuration found by the initial step achieves an average of 1% energy savings compared to the base cache, and can consume as much as 26% more energy when compared to the dual core optimal configuration ( Figure 6 5 ). To de termine the final configuration, CPACT explored 11 configurations on average (# CFGS EXP), while after the initial step, only 8 configurations were explored on average. This means that, on average, to find the final configuration, CPACT required 300
101 additi onal stall cycles between tuning intervals as compared to applying just the initial step. However, we point out that: 1) 300 cycles is very small compared to the SPLASH 2 benchmarks, which execute for up to 80 million cycles; 2) the performance penalty of CPACT is only 3% compared to executing in the optimal dual core configuration; and 3) the energy savings for the initial step are not enough to justify the reduced performance overhead. optimal cache size was bypassed since the line size and associativity were fixed at 16 bytes and direct mapped, respectively, which did not necessarily reflect the optimal line size and associativity. For example, for lunon the 64 KB direct mapped cache wi th a 16 byte line size consumed less energy than the 32 KB direct mapped cache with a 16 byte line size for both processors, therefore 64 KB was chosen as the size. However, the 64 KB 2 way cache with a 32 byte line size configuration found after the initi al step consumed more energy than the optimal configuration (32 KB 4 way cache with a 64 byte line size). The optimal configuration was not found by the initial step since the remaining configurations with a 32 KB cache size were not explored. However, inc luding the size adjustment step was beneficial for 6 of the 11 benchmarks, resulting in configurations that were as much as 22% closer to the optimal configuration. Although the size adjustment step had a large impact on the results, final line size and associativity adjustment was required to find the optimal (or near optimal) configuration for the benchmarks. Radiosity, lucon, and lunon were particularly difficult benchmarks to tune. Radiosity is a graphics program with unstructured access patterns to i rregular data structures, which reduced spatial locality and complicates the analysis
102 of the impact of cache parameters and runtime cache tuning  Unlike the graphics programs with irregular data structures, both LU (Lower triangular matrix, Upper triangular matrix) Decomposition benchmarks (lucon and lunon) are decomposed into regular blocks  When the cache size increases from 8 KB to 64 KB, the miss rate remains nearly constant, which makes cache tuning difficult, but larger sizes (128 KB and 256 KB) cause a 90% decrease in the miss r ate due to the size of these energy per access and static energy per cycle. inal line size and associativity adjustments address both of these issues and produced final configurations much closer to the optimal for all benchmarks. If the size adjustment step decreased the energy consumption (condition 1, Figure 6 1), a final adjus tment of the line size followed by the associativity improved energy savings (action 1, Figure 6 1). This applies to benchmarks such as radiosity and results in the final configuration being within 1% of the optimal configuration. Alternatively, if the size adjustment step increased the energy consumption, one additional configuration with the cache size halved and the line size doubled from the 1). If ac tion 2 decreased the energy consumption, a final adjustment of the line size and associativity is required for improved energy savings (condition 2, action 3, Figure 6 1). This applies to benchmarks such as lucon where the cache size chosen after the initi al step is double that of the optimal cache size, but additional energy savings are not
103 revealed when the size is tuned again in the size adjustment step given the line size and associativity chosen after the initial step is complete. However, if the addi tional configuration with the cache size halved and the line consumption (i.e., condition 2 determines that action 2 increased the energy consumption in Figure 6 1), cach e tuning is halted and no additional parameter adjustment is necessary. This applies to the benchmarks such as raytrace where the configuration chosen after the size adjustment step was the optimal configuration, thus eliminating unnecessary configuration exploration for thes e benchmarks. 6.4.4 Cross Core Data Decomposition Analysis core interactions and data coherence. Results revealed that the benchmarks typically SIMD (single instruction multiple data) like data decomposition across the processors where each processor executes the sam e functions on different data sets. However, three benchmarks fft, radiosity, and raytrace, required heterogeneous configurations with different associativities and/or line sizes. Evaluating the benchmarks that required heterogeneous configurations reveal ed that core interactions did affect the data cache configurations. In these benchmarks, one processor (arbitrarily referred to as P0) does significantly more work than the other processor (arbitrarily referred to as P1). For example, P0 may perform tasks that would not be necessary in a single core system  such as building large data structures and distributing the data to reduce communication d uring execution, and data aggregation at the end of execution.
104 With respect to data coherence, we modified SESC to identify cache misses due to data sharing (coherence misses) and observed that for fft, radiosity, and raytrace, coherence misses attributed to as much as 17% of the overall cache misses for a 64 KB cache, even though the benchmark suite consists of multithreaded applications that are optimized to reduce inter processor communication and data sharing  As expected, the number of coherence misses increases as the cache size increases since larger caches are more likely to store data needed by another processor. Increasing the cache si ze to 128 KB and 256 KB increased the percentage of coherence misses for the three benchmarks (as high as 29% for raytrace) and revealed additional benchmarks with coherence misses (as high as 11% for water nsquared). However, the 128 KB and 256 KB cache s izes are not explored by CPACT since these large cache sizes are never optimal configurations due to their high dynamic and static energy consumptions. Although the coherence misses increase with the cache size, benchmarks with significant coherence misse s do not necessarily choose a small cache size as the optimal configuration. Analysis revealed that the optimal cache size is dictated by the size of the working set and is not affected by data sharing and coherence misses for the cache sizes used in our e even though the number of coherence misses in the 32 KB cache was 5x larger than the 8 KB cache. As the cache size increased, the increase in coherence misses was offset by the decrease in tota l cache misses due to the ability to store more of the working set in the larger cache. In summary, our analysis has revealed that core interactions have a much greater effect on the cache configurations than data coherence does. However, data coherence
105 d oes have an effect on cache configurations and we observed that the three benchmarks that required heterogeneous configurations were the benchmarks with data sharing among the processors while benchmarks with no data sharing chose the same configuration fo r both processors. Although the SPLASH 2 benchmark suite uses data decomposition  the SPLASH 2 benchmarks can be classified into two types bas ed on our core interaction analysis of the dual core system: 1) benchmarks with very little or no data sharing where coherence misses make up less than 1% of total misses and 2) benchmarks with data sharing. Some of our preliminary work shows that this cla ssification is true for 4 core systems. Furthermore, for the benchmarks that exhibit data sharing and coherence in both the dual and 4 core systems, one processor, the coordinating processor, performs pre and post execution processing while the other pro cessors, the working processors, perform SIMD like work. By extending our core interaction analysis to systems with more than two cores, we project that this trend will continue for 4 8 and 16 core systems. We predict that, for benchmarks with no data sharing, CPACT may only need to be applied to a single processor while the other processors remain in a fixed configuration. Once the single processor is tuned, the remaining processors can be set or benchmarks that exhibit data sharing, cache tuning may only need to be done for two processors: the coordinating processor and one working processor, wherein the one tuned working processer can convey the optimal configuration to the other working proce ssors without having to perform cache tuning on every processor. These observations were leverage d
106 in the design of our application classification guided cache tuning heuristic for larger multi core systems presented in Chapter 7 since r educing the number of cores tuned simultaneously reduces the energy and performance overhead of cache tuning, which will become s more important as the number of cores in the system increases. Figure 6 1 T he Conditional Parameter Adjustment Cach e Tuner (CPACT) Figure 6 2 Architectural layout of dual core system
107 Figure 6 3 Energy model used for the dual core system Figure 6 4 (a) Energy savings for the optimal (lowest energy) cache for single and dual core system s and final configuration applied to a dual core system as compared to their respective base configurations. (b) Performance (measured in cycles) normalized to their respective base configurations for the optimal cache for single and dual core system performance, and percentage of execution cycles needed by the dual core system compared to the execution cycles needed by the single core system (% of cycles), for the dual core optimal configuration and CPACT 0% 20% 40% 60% 80% 100% Energy Savings (%) single core (optimal) dual core (optimal) (a) 0% 20% 40% 60% 80% 100% 120% Normalized Performance single core (optimal) dual core (optimal) CPACT dual core (optimal) % of cycles CPACT % of cycles (b)
108 Figure 6 5 Percentage difference between the energy consumed by the optimal configuration and CPACT after three evaluation points ( % diff ), and number of configurations explored by CPACT ( # cfgs exp )
109 CHAPTER 7 AN APPLICATION CLASS IFICATION GUIDED CAC HE TUNING HEURISTIC FOR MULTI CORE ARCHITECTURES 7.1 Overview In this chapter we introduce an application classification guided cache tuning heuristic for level one (L1) multi core data caches to determine the lowest energy cache configuration. Minimizing cache tuning overheads is critica l in a multi core system due to overhead accumulation across each core and the potential power increase if all cores simultaneously tune the cache. The heuristic leverages runtime profiling techniques to classify the application based on the cache behavio r and data sharing and to minimize the cores which are tuned simultaneously using insights from our cross core data decomposition analysis (Section 6.4.4) heterogeneous 2 4 8 and 16 core systems with highly configurable caches and evaluate energy and performance overheads incurred during cache tuning. 7 .2 Overview of Data Parallel Applications and Application Classification In a data parallel multi core system, applications are decomposed into equal d ata sets that are distributed over several cores, where each core performs the same b ehavior. Two data sets have similar cache behavior if the data sets have similar miss rates when run with the same cache configuration, and thus would require the same optimal cache configuration. We note that this similarity assumption is valid because th e similar cache miss rates would not necessarily indicate similar cache behavior.
110 This application behavior can be leveraged to guide the cache tuning heuristic. For examp similar cache behavior, data sharing and core interactions do not need to be considered. In this situation, cache tuning is relatively simple since tuning could be applied similarly behaving cores, thus avoiding redundant cache tuning. Alternatively, data sharing applications where the data sets have different cache behavior may require cache tun ing on several or all cores since the optimal configuration will be different across the cores. In this situation, the tuning heuristic should coordinate cache tuning among the cores to avoid simultaneously tuning all caches. 7.3 Runtime Multi Core Data Cache Tuning Our runtime L1 multi core data cache tuning heuristic leverages application classification to guide cache tuning and determines the optimal, lowest energy cache c onfiguration (i.e. the lowest energy configuration in our defined design space (S ection 7.4.1)) The heuristic classifies the application using cache statistics (accesses, misses, write backs, and coherence misses) gathered at runtime. These cache statistics are combined with a cache subsystem energy model (detailed in Section 7.4.1) t o calculate details our target multi core architecture and Section 7.3.2 describes our runtime application classification methodology and cache tuning heuristic. 7.3.1 Mult i core Architectural Layout Our multi core system consists of an arbitrary number of cores and a cache tuner, all placed on a single chip, where each core has a private, highly configurable L1 data cache  We chose parameter value ranges based on our experimental results for the
111 SPLASH 2 applications, which required optimal cache sizes ranging from 8 to 64 KB, associativities ranging from direct mapped to 4 way, and line sizes ranging from 16 to constructed using 32 2 KB banks. The cache banks can be shutdown and/or concatenated to tune the cache size and associativity. The caches have a physical line size of 16 bytes, which can be increased by fetching multiple physical lines. The caches have al so been augmented with a small amount of custom hardware to identify coherence misses, which are misses that occur when there is a tag hit for an invalid cache block  Figure 7 1 depicts a sample architectural layout for a 2 core system, which an n core system, the cache tuner connects to all n caches). The global tuner orc cache tuning among the cores, tuning, applications incur stall cycles while the tuner gathers cache statistics, ca lculates energy consumption, and changes the cache configuration. These tuning stall cycles introduce energy and performance overhead. Additionally, the tuning stall cycles could increase if the global tuner becomes a bottleneck while cache statistics are collected from several cores simultaneously. Our tuning heuristic considers these overheads incurred during the tuning stall cycles, and thu s minimizes the number of simultaneously tuned cores and the tuning energy and performance overheads. 7.3.2 Applic ation Classification Guided Cache Tuning Heuristic Cache tuning is relatively simple for non data sharing applications where only one
112 che behavior. However, tuning for data sharing requiring additional tuning actions and coordinated tuning among cores. In order to determine the minimum required cache t uning effort, application classification must be done during runtime to guide cache tuning, reduce the tuning overhead, and reduce the number of simultaneously tuned cores. Application classification determines data sharing and cache behavior at runtime us ing cache statistics. Coherence misses delineate data sharing from non data sharing applications, where a data than 5% of the total cache misses, otherwise the application is non data sharing. Cache accesses and misses are used to determine if data sets have similar cache behavior. Data parallel architectures execute the same function on similar data sets. Since the cores are performing the same function on equal portions of data, data sets that have similar accesses and misses, and therefore similar miss rates, when run on caches of the same configuration are classified as having the same cache behavior. Figure 7 2 illustrates these similarities using actual data cache miss rates for an 8 core system (the cores are denoted as P0 to P7) for SPLASH non (top table) and fft (bottom the core with the lowest miss rate (P0 in this example). Since ocean alized miss rates are nearly 1.0 for all cores, all caches are classified as having similar ent cache behavior than P1 to P7.
113 Figure 7 3 depicts our application classification guided cache tuning heuristic, which consists of three main steps: 1) application profiling and initial tuning, 2) application classification, and 3) final tuning actions. Using these steps, the heuristic profiles the application to gather the 2. Since evaluating cache behavior is most e ffective when the caches have the same configuration and determining data sharing is most effective when the caches are large (larger caches have more coherence misses), the heuristic initializes all of the caches to a base configuration. The base configur ation is a 64 KB, 4 way associativity cache with a 64 byte line size. The heuristic profiles the application for one initial tuning interval using this base configuration. Step 1 is critical for avoiding redundant cache tuning in situations where the dat a sets have similar cache behavior and similar optimal configurations. If these cache statistics indicate that all data sets have the same cache behavior, only one cache n is determined, the cache tuner can immediately convey this configuration to all other caches, thus avoiding the cache tuning process on all other cores. Alternatively, if the cache statistics indicate that the cores have different cache behavior, additio nal cores may need to be tuned in addition to P0, however these additional cores and whether or not these cores share data cannot be determined until application classification in Step 2. Regardless of cache behavior similarities, Step 1 applies the initia l impact ordered
114 configuration. While P0 is being tuned, the heuristic continues to identify coherence misses, which will be used to determine data sharing in Step 2. Step 2 uses the cache behavior and coherence misses from Step 1 for application classification. Condition 1 and Condition 2 classify the applications based on whether or not the cores have similar cache behavior and/or exhibit data sharing, respectively Evaluating these conditions determines the necessary cache tuning effort in Step 3. Since the single core impact ordered tuning heuristic does not consider core fin al configuration. Step 3 determines the final configuration using several final tuning application classification to determine how to perform these parameter adjustments, coordinated among the cores. Cache tuning is simplified for situations with non data sharing applications and when all data sets have similar cache behavior, or when Con dition 1 is evaluated as true. In these situations, only a single cache needs to be tuned and the heuristic performs parameter adjustments on P0 while the other cores remain fixed at the current (base) configuration. Additionally, since there is no data sh aring, P0 can be tuned independently without affecting the behavior of the other cores. The final tuning actions start with a size adjustment for P0 (Action 1). Since Step size for non data sharing applications, size adjustment begins with the size from Step 1 and decreases the cache size as long as decreasing the cache size decreases the
115 energy consumption. Size adjustment is followed by similar line size and assoc iativity adjustments, which each begin with the line size/associativity from Step 1 and increase the line size/associativity as long as increasing the line size/associativity decreases final configuration is conveyed to the other cores and the remainder of the application is executed with this final configuration. If the data sets have different cache behavior, or Condition 1 is false, tuning is more complex and several cores must be tu ned. The heuristic minimizes the number of behavior, where data sets with similar cache miss rates belong to the same group. The heuristic then tunes only one (arbitrarily cho sen) cache from each group while all other cores in the group remain in the base configuration. For example, using an 8 core system and the cache miss rates in Figure 7 2, fft has two groups: P0 belongs in one group and P1 to P7 belong in the second group. Given this grouping only P0 and P1 if the cores do not share data, or Condition 2 is false, the cores can be tuned independently without affecting the behavior of t he other cores. The other cores chosen the cache size) and line size/associativity adjustments (increasing line size/associativity) (Action 2) are performed on all c ores identified for tuning. To complete the final tuning actions, the tuned cores convey the final configuration to the other cores in the tuned
116 Finally, if the application shares data, or Condition 2 is true, the heuristic still only tunes one core from each group, but the tuning must be coordinated among the cores and additional configurations must be explored. Action 3 performs size adjustment on the cores identified for tuning. Since data is shared and tuning one core affects the behavior of the other cores, tuning must be coordinated. Tuning coordination requires size adjustment to complete on all cores before adjusting the remaining parameters. Additionally, instead of exploring onl y smaller cache sizes and larger line sizes/associativities in Step 3, the heuristic explores both smaller and larger values for each parameter since applications with shared data require additional exploration. 7.4 Experimental Results 7.4.1 Experimental Setup We quantified the energy savings and performance of our heuristic using 11 SPLASH 2 multithreaded applications (2 SPLASH 2 applications were not evaluated  on the SESC simulator  for a 1 2 4 8 and 16 core system In SESC, we modeled a heterogeneous system with the L1 data cache parameters identified in Section 7.3.1. Since the L1 data cache has 36 possible configurations, our design space is 36 n where n is the number of cores in the system. The L1 instruction cache and L2 unified cache were fixed at the base configuration and 256 KB, 4 way set associative cache with a 64 byte line size, respectively. We modified SESC to identify coherence misses. Figure 7 4 depicts the multi core energy model used to calculate the energy consumption of each data cache configuration. The remaining details on the energy model can be found in Section 6.4.1.
117 7.4.2 Results and Analysis Figure 7 5 (a) and (b) depict the energy savings and performance, respectively, for the optimal configuration determined via exhaustive design space exploration (optimal) for 2 and 4 core systems and for the final configuration found by our application classification cache tuning heuristic (heur istic) for 2 4 8 and 16 core systems, for each application and averaged across all applications (Avg.). Given the exponential increase in design space size with respect to the number of cores, it was not possible to find the optimal configurations fo r the 8 and 16 core systems. Our heuristic achieved an average of 25% energy savings for all systems (Figure 7 5 (a)) and explored at most 14 configurations 1% of the design space. Our heuristic found the optimal configuration for 10 out of 11 applica tions for the 2 core system and for all 11 applications in the 4 core system. On the 2 core system, the heuristic found a final configuration within 2% of the optimal for ocean non. Even though the heuristic found the optimal configuration in all but one a pplication for the 2 and 4 core systems, executing the entire application in the optimal configuration resulted in 26% average energy savings, while the heuristic achieved 25% average energy savings. This minor energy difference is due to the tuning energ y overhead. Our results showed that the largest tuning energy overhead was 4% for radix on the 4 core system, however, even with a 4% energy overhead, radix still achieved 30% energy savings. Figure 7 5 (b) shows that the average performance penalties fo r our heuristic for the 2 and 4 core systems were 6% and 8%, respectively, while the average performance penalties for running the application in the optimal configuration were 5% and 8%, respectively, compared to executing the application in the base cac he. The tuning performance overhead due to the additional tuning stall cycles is 1% for the 2
118 core and less than 1% for the 4 core system. However, our results showed that, even with the tuning performance overhead, the 2 and 4 core systems achieved a 1.7 x and 2.8x average speedup, respectively, compared to running the application on a single core system. Our heuristic achieved 26% and 25% energy savings, incurred 9% and 6% performance penalties, and achieved 4.8x and 7.9x average speedups for the 8 and 16 core systems, respectively. Although we were unable to compare these results to the optimal configuration, we estimate the tuning energy overhead as no more than 4% and the tuning performance overhead to be on average 1% for the 8 and 16 core systems b ased on the results for the 2 and 4 core systems. 7.4.3 Application Classification Our heuristic classified the SPLASH 2 applications into two categories: 1) non data 2) dat a sharing applications where the cores have different behaviors. For the non data homogeneous final configuratio ns across all cores. We used the optimal cache any configuration (i.e., the cores were allowed to select heterogeneous configurations), for the 2 and 4 core systems to con firm that these applications require homogeneous lucon in the 4 search were homogeneous configurat ions where all 4 cores selected 16 KB, 4 way, 64 byte line size configurations.
119 Three applications fft, radiosity, and raytrace, were classified as data sharing applications where the cores had different cache behavior. We observed that for these three app lications one core (arbitrarily referred to as P0) typically had different behavior than the remaining cores, therefore to determine the final configuration, our heuristic final configuration to the remaining cores, resulting in heterogeneous final configurations. For example, the miss rates for P1, P2, and P3 were nearly 3.0 times the miss rate of P0 for raytrace in a 4 core system. Our heuristic selected final configurati ons of 16 KB, 4 way, 16 byte line size for P0 and 32 KB, 4 way, 16 byte line size for P1, P2, and P3, which is the same configuration found via an exhaustive search. Note that these three applications also share data, therefore it was necessary to coordina te data cache tuning among cores to determine the optimal configuration. Figure 7 1 Sample architectural layout for a 2 core system showing the global data cache tuner connected to each private L1 data cache.
120 Figure 7 2 Application classification an example using data cache miss rates for an 8 core system where each cache is set to the base configuration Figure 7 3 Application classification guided cache tuning heuristic
121 Figure 7 4 Energy model for the multi core system Figure 7 5 (a) Energy savings and (b) normalized performance for the optimal cache (optimal) for 2 and 4 core systems and the final configuration for the application classification cache tuning heuristic (heur istic) for 2 4 8 and 16 0% 20% 40% 60% 80% 100% cholesky fft lucon lunon ocean-con ocean-non radiosity radix raytrace water-nsq water-sp Avg Energy Savings (%) 2-core (optimal) 2-core (heuristic) 4-core (optimal) 4-core (heuristic) 8-core(heuristic) 16-core(heuristic) (a) 0% 20% 40% 60% 80% 100% 120% 140% cholesky fft lucon lunon ocean-con ocean-non radiosity radix raytrace water-nsq water-sp Avg Normalized Performance 2-core (optimal) 2-core (heuristic) 4-core (optimal) 4-core (heuristic) 8-core(heuristic) 16-core(heuristic) (b)
122 CHAPTER 8 CONCLUSIONS In this dissertation we investigated dynamic cache energy savings techniques for embedded systems. First we i ntroduced the adaptive loop cache (ALC) a dynamic tagless loop cache that combines the flexibility of the dynamic loop cache (DLC) with ALC significantly increases system scenario am enability as compared to previous loop cache designs using a lightweight runtime control flow analysis technique to dynamically cache complex critical regions containing branches in an area efficient manner. Furthermore, the ALC eliminates the need for the costly designer pre analysis effort required for the PLC and HLC, making the ALC the most appropriate design for applications with changing behavior. The ALC increases the average number of instructions fetched from the loop cache by as much as 74% as com pared to previous loop cache designs. The ALC also increased average energy savings by as much as 20% (for the 32 entry ALC compared with the PLC, averaged for the Powerstone suite) and increased individual benchmark savings by as much as 69% (for a 64 ent ry ALC Allowing the ALC to cache nested loops (ALC+N) resulted in very little average improvement in the loop cache access rate or energy savings compared to the ALC (less than 4% improvement in both cases). H owever, we have observed that the ALC+N works best for smaller outer loops, while it is best to cache inner loops separately using the ALC for larger outer loops. In future work we will investigate the benefits of switching between the ALC and ALC+N depend ing on the application characteristics.
123 Compared with the filter cache, the ALC achieves higher energy savings than the 8 byte and 16 byte line size filter cache by as much as 63% and 10% ( as compared to the 8 byte and 16 byte line size for the Powerstone suite averages, respectively). The ALC does not achieve higher energy savings than the 32 byte line size filter cache (i.e. we observed that the filter cache achieved 17% higher energy savings than the ALC for the 16 byte line size of the Powerstone suit e average) However, unlike the filter cache, the ALC does not incur a performance penalty. We then investigated the effects of combining loop caching with level one cache tuning and found that in general, cache tuning dominated overall energy savings indi cating that cache tuning is sufficient for energy savings. However, we observed that adding a loop cache to an optimal (lowest energy) cache increased energy savings by as much as 26%. Next we investigated the possibility of using a loop cache to minimize runtime decompression overhead and quantified the effects of combining code compression with cache tuning. Our results showed that a loop cache can effectively reduce the decompression overhead Using LZW encoding we were able to achieve average energy s avings of 30% and 50% for EEMBC and Powerstone benchmarks respectively. However, to fully exploit combining cache tuning, code compression, and loop caching, a compression/decompression algorithm with lower decompression overhead is required. Next we in vestigated dynamic energy savings in multi core systems. First we extended single core cache tuning to a dual core system. We presented a Conditional Parameter Adjustment Cache Tuner (CPACT) that achieve d average data cache energy savings of 24%. CPACT f o und the optimal configuration for 10 out of 11 benchmarks
124 and a final configuration within 1% of the optimal configuration for the remaining benchmark s while searching, at most, 13 out of 1,296 configurations. Benchmark data decomposition analysis revealed that core interactions have a greater effect on the optimal configurations than data coherence. Finally, we presented an application classification guided cache tuning heuristic for level one data caches that found the o ptimal, or near optimal, lowest energy cache configuration for 2 4 8 and 16 core systems. Our heuristic classified applications based on data sharing and cache behavior, and used this classification to identify which cores needed to be tuned and to reduce the number of cores being tuned simultaneously. Our heuristic searched at most 1% of the design space, yielded configurations within 2% of the optimal, and achieved an average of 25% energy savings. In future work we plan to implement the multi core cache tuner hardware in VHDL on 90nm technology to evaluate the power and area of the tuner.
125 LIST OF REFERENCES  Proceedings of the 32nd Annual ACM/IEEE international Symposium on Microarchitecture 1999.  time scheduling on Proceedings of the 12th IEEE Real Time and Embedded Technology and Applications Symposium April 2006.  N. Bellas, I. Hajj, C. Polychronopoulos, and IEEE International Conference on Computer Design, 1999, pp.378.  code compression for energy minimization in Proceedings of the 2001 international Symposium on Low Power Electronics and Design 2001.  University of Wisconsin Madison. Computer Science Department. Tech. Report CS TR 1308 July 1996.  Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, 2000.  D. Chaver, M. A. Rojas, L. P inuel, et al ., aware fetch mechanism: Trace cache and BTB customization, in International Symposium on Low Power Electronics and Design, 2005.  Proceedings of the 33rd Annual international Symposium on Computer Architecture June 2006.  Proceedings of the 21st Annual international Conference on Supercomputing June 2007.  Chang Hong Lin; Yuan Xie based code compression for VLIW Design, Automation and Test in Europe Conference and Exhibition 2004.  MICRO 2003.  EEMBC. http://www.eembc.org/
126  representative program International Conference on Parallel Architectures and Compilation Techniques vol., no., pp. 83 94, 2002.  level exploration for pareto optimal configurations in parameterized systems on a Proceedings of the 2001 IEEE/ACM international Conference on Computer Aided Design 2001.  A. Gordon R Proceedings of the 15th ACM Great Lakes Symposium on VLSI 2005.  A. Gordon Ross, P. Viana, F. Vahid, W. Najjar shot configurable Design, Automation and Test in Europe 2007.  A. Gordon Proceeding of the 44 th Annual Design Automation Conference, 2007.  A. Gordon level caches to Proceedings of the Conference on Design, Automation and Test in Europe 2004.  A. Gordon metes preloaded loop caching IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2002.  A. Gordon systems: A l Computer Architecture Letters Volume 1, January 2002.  A. Gordon cache tuning with a unified second Proceedings of the 2005 international Symposium on Low Power Electro nics and Design, 2005.  A. Gordon based cache reconfiguration for a highly configurable two Proceedings of the 18th ACM Great Lakes Symposium on VLSI 2008.  M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. IEEE 4th Annual Workshop on Workload Characterization 2001.  H. Hajimiri and P. Mishra "Intra task dynamic cache reconfiguration," in the Proceedings of the 25th International Conference on VLSI Design (VLSID) 2012.
127  IEEE/ACM International Symposium on Microarchitecture 2007.  Proceedings of the IRE vol. 4D, pp. 1098 1101, 1952.  ptimal L1 cache Proceedings of the 2006 Asia and South Pacific Design Automation Conference 2006.  J. Kin, M. Gupta, and W. H. Mangione ACM/IEEE Int ernational Symposium on Microarchitecture, 1997.  Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques PACT '06.  ISA heterogeneous multi core architectures for multithreaded workload Proceedings of the International Symposium on Computer Architecture June 2005.  L. Low cost embedded program loop caching Revisited, University of Michigan Technical Report Number CSE TR 411 99, December 1999.  ng compression Proceedings 30th Annual International Symposium on MicroArchitecture 1997.  Proceedings og the 37 th Design Automation Conference 2000.  H. Lekatsas, and W. Wolf, : a code compression algorithm for embedded processors in IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems vol.18, no.12, pp.1689 1701, 1999.  S. Y. Liao, S. Devadas, and K. Keutzer, density optimization for embedded DSP processors using data compression techniques in the Proc eedings of the Chapel Hill Conf erence of Advanced Research in VLSI 1995.  C. Lin, Y. Xie, and W. Wolf, "LZW based code compressi on for VLIW embedded systems," in Proceedings of the Design, Automation and Test in Europe Conference and Exhibition 2004.
128  Proceedings of the 10th international Symposium on High Performance Computer Architecture February 2004.  International Symposium on Low Power Electronics and Design 2000.  aware scheduling for energy efficiency on Proceedings of the 2008 Conference on Power Aware Computing and Systems 2008  MIPS32 M14K http://www.mips.com/p roducts/cores/32 64 bit cores/mips32 m14k/  Western Research Lab, 2009.  Proceedings of the Conference on Design, Automation and Test in Europe 2004.  http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/ Dec. 2004.  M. Palesi, and T. Givargis, Multi objective design space exploration using genetic algorithms n Proceedings of the Tenth international Symposium on Hardware/Software Codesign 2002.  performance tradeoffs SIGMETRICS Perform. Eval. Rev. 35, 1.  Proceedings of the ACM SIGPLAN conference on Programming Language Design and Implementation 1990.  M. D. Powell, A. Agarwal, N. T. Vijaykumar, B. Falsafi associative cache energy via way prediction and selective direct the 34th Annual ACM/IEEE International Symposium on Microarchitecture 2001.  ction fetch International Symposium on Low Power Electronics and Design 2003.  high bandwidth instruction ACM/IEEE International Symposium on Microarchitecture, 1996.
129  power M~CORE International Symposium on Computer Architecture Power Driven Microarchitecture Wor kshop pp. 145 150, 1998.  S. Segars, Low power design for microprocessors International Solid State Circuit Conference 2001.  E. Seo, J. Jeong time tasks 2008.  Proceedings of the 11th Internation al Conference on Architectural Support for Programming Languages and Operating Systems 2004.  T. Sherwood et al. in IEEE Micro vol. 23, no. 6, pp. 84 93, 2003.  P. Shivakumar, N. P. Jouppi, Cacti3.0: an integ rated cache timing and power model COMPAQ Western Research Lab, 2001.  Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware /software codesign and system synthesis CODES+ISSS '05.  M. D. Smith, Overcoming the challenges to feedback directed optimization (Keynote Talk) SIGPLAN Not. 35, 7 (Jul. 2000), 1 11.  cache 26.  K. T. Sundararajan, T. M. Jones, and N. Topham, "Smart cache: A self adaptive cache architecture for energy efficiency," in the Proceedings of the 2011 International Conference on Embedded Computer Systems (SAMOS) 2011.  D. C. Suresh, W. A. Najjar, F. Vahid, J. R. Villarreal, G. and Stitt, Profiling tools for hardware/software partitioning of embedded applications ACM SIGPLAN Conference on Language, Compiler, and Tool For Embedded Systems 20 0 3  Proceedings of the 13th international Conference on Supercomputing 1999.  formance debugging in multi the Proceedings of the 13th Workshop on Interaction between Compilers and Computer Architecture
130  P. Viana, A. Gordon Ross, E. Barros, and F. Vahid based method for single Proceedings of the 18th ACM Great Lakes Symposium on VLSI 2008  University of Calif ornia, Riverside Technical Report UCR CSE 01 03 2002.  W. Wang, P. Mishra, and S. Ranka, "Dynamic cache reconfiguration and partitioning for energy optimization in real time multi core systems," in the proceedings of the 48th ACM/EDAC/IEEE Design Automatio n Conference (DAC) 2011.  W. Wang and P. Mishra, "Dynamic reconfiguration of two level caches in soft real time embedded systems," in the Proceedings of the EEE Computer Society Annual Symposium on VLSI (ISVLSI) 2009.  W. Wang, P. Mishra, and A. Gordon Ros s, "SACR: Scheduling aware cache reconfiguration for real time embedded systems," in the Proceedings of the 22nd International Conference on VLSI Design 2009.  W Proceedings of the 25th Annual international Symposium on Microarchitecture 1992.  2 programs: Characterization and Proceedings of the International Symposium on Computer Architect ure 1995.  in Proceedings of the 14th IEEE international Workshop on Rapid System Prototyping, 2003.  C. Zhang, F. Vahid, and W. Najjar A highly configurable cache archite cture for embedded systems the 30th Annual International Symposium on Computer Architecture 2003  C. Zhang, F. Vahid, and R. Lysecky, A self tuning cache architecture for embedded systems in ACM Trans actions on Embed ded Comput ing Syst ems vol 3, 2 pp. 407 425 2003.
131 BIOGRAPHICAL SKETCH Marisha Rawlins graduated with a Diploma in c omputer e ngineering t echnology from the Trinidad and Tobago Institute of Technology in 2004, and received the Best Overall Student at Trinidad and Tobago Institute of Technology and Most Outstanding Student in Computer Engineering Technology awards. She graduated summa cum laude with a Bachelor of Science in c omputer e ngineering from Florida Institute of Technology in 2007, and rece ived her Master of Science i n e lectrical and c omputer e ngineering from the University of Florida in 2008. She received her Doctor of Philosophy degree in e lectrical and c omputer e ngineering from t he University of Florida in 2012.