<%BANNER%>

Developing Representative Workloads for Future Hardware Transactional Memory Research Using a Cycle-Accurate, Multi-dime...

Permanent Link: http://ufdc.ufl.edu/UFE0041005/00001

Material Information

Title: Developing Representative Workloads for Future Hardware Transactional Memory Research Using a Cycle-Accurate, Multi-dimensional Hardware Transactional Memory Model
Physical Description: 1 online resource (132 p.)
Language: english
Creator: Poe II, James
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: architecture, benchmarks, characterization, computer, memory, modeling, simulation, synthetic, transactional, workload
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Transactional memory is emerging as a parallel programming paradigm for multi-core processors. Transactional memory provides a means to bridge the discrepancies between programmer productivity and the difficulty in exploiting thread-level parallelism gains offered by emerging chip multiprocessors. Because the hardware has outpaced the software, there are very few modern multithreaded benchmarks available and even fewer for transactional memory researchers. To make this problem worse, the architecture community has no means of characterizing the similarity of the benchmarks that do exist, which poses serious questions concerning the comprehensiveness of current evaluations. This hurdle must be overcome for transactional memory research to mature and gain widespread acceptance. Currently, for performance evaluations, most researchers rely on manually converted lock-based multithreaded workloads or the small group of programs written explicitly for transactional memory. A new parameterized methodology that can automatically generate a program based on the desired high-level program characteristics benefits the transactional memory community. In this work, all of the above issues are addressed. First, a cycle-accurate, multiple-issue multi-core hardware transactional memory model that is capable of simulating each of the three most common dimensions of hardware transactional memory is developed - the first of its kind. That model is then used to perform the first comprehensive study of the interaction that occurs between transactional memory and multi-core architecture. The results of that study are used to develop analytical models that are capable of predicting performance. A set of transaction-oriented workload characteristics is proposed that can accurately capture the behavior of transactional code and, when used in conjunction with principle component analysis and clustering algorithms, expose the similarity that exists in current transactional workloads. Methods to reduce overlap (the number of required simulations) and maximize the comprehensiveness of an evaluation based upon the architectural areas of interest to the transactional developer are described in detail. All of these tools and experience gained are used to develop TransPlant, a framework that is capable of generating synthesized transactional benchmarks based on an array of different inputs. It is shown that TransPlant can mimic the behavior of current transactional memory workloads. Further, TransPlant is shown to be capable of generating benchmarks with features that lie outside the boundary occupied by traditional benchmarks. Finally, TransPlant is used to perform a case study on the behavior of future transactional memory workloads.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by James Poe II.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Li, Tao.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041005:00001

Permanent Link: http://ufdc.ufl.edu/UFE0041005/00001

Material Information

Title: Developing Representative Workloads for Future Hardware Transactional Memory Research Using a Cycle-Accurate, Multi-dimensional Hardware Transactional Memory Model
Physical Description: 1 online resource (132 p.)
Language: english
Creator: Poe II, James
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2009

Subjects

Subjects / Keywords: architecture, benchmarks, characterization, computer, memory, modeling, simulation, synthetic, transactional, workload
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Transactional memory is emerging as a parallel programming paradigm for multi-core processors. Transactional memory provides a means to bridge the discrepancies between programmer productivity and the difficulty in exploiting thread-level parallelism gains offered by emerging chip multiprocessors. Because the hardware has outpaced the software, there are very few modern multithreaded benchmarks available and even fewer for transactional memory researchers. To make this problem worse, the architecture community has no means of characterizing the similarity of the benchmarks that do exist, which poses serious questions concerning the comprehensiveness of current evaluations. This hurdle must be overcome for transactional memory research to mature and gain widespread acceptance. Currently, for performance evaluations, most researchers rely on manually converted lock-based multithreaded workloads or the small group of programs written explicitly for transactional memory. A new parameterized methodology that can automatically generate a program based on the desired high-level program characteristics benefits the transactional memory community. In this work, all of the above issues are addressed. First, a cycle-accurate, multiple-issue multi-core hardware transactional memory model that is capable of simulating each of the three most common dimensions of hardware transactional memory is developed - the first of its kind. That model is then used to perform the first comprehensive study of the interaction that occurs between transactional memory and multi-core architecture. The results of that study are used to develop analytical models that are capable of predicting performance. A set of transaction-oriented workload characteristics is proposed that can accurately capture the behavior of transactional code and, when used in conjunction with principle component analysis and clustering algorithms, expose the similarity that exists in current transactional workloads. Methods to reduce overlap (the number of required simulations) and maximize the comprehensiveness of an evaluation based upon the architectural areas of interest to the transactional developer are described in detail. All of these tools and experience gained are used to develop TransPlant, a framework that is capable of generating synthesized transactional benchmarks based on an array of different inputs. It is shown that TransPlant can mimic the behavior of current transactional memory workloads. Further, TransPlant is shown to be capable of generating benchmarks with features that lie outside the boundary occupied by traditional benchmarks. Finally, TransPlant is used to perform a case study on the behavior of future transactional memory workloads.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by James Poe II.
Thesis: Thesis (Ph.D.)--University of Florida, 2009.
Local: Adviser: Li, Tao.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2009
System ID: UFE0041005:00001


This item has the following downloads:


Full Text

PAGE 1

1 DEVELOPING REPRESENTATIVE WORK LOADS FOR FUTURE HARDWARE TRANSACTIONAL MEMORY RESEARCH USING A CYCLE-ACCURATE, MULTIDIMENSIONAL HARDWARE TRANSACTIONAL MEMORY MODEL By JAMES MICHAEL POE II A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2009

PAGE 2

2 2009 James Michael Poe II

PAGE 3

3 To Whom It May Concern

PAGE 4

4 ACKNOWLEDGMENTS While there are countless individuals, both in my life and that have come before me, to whom I am indebted for the opportunity to be able to attain this highest degree; I would like to take a moment to acknowledge those most directly responsible. First, I would like to thank t hose organizations that placed their confidence in me and financially supported my academic career: Florid a International University, University of Florida, and the National Science Foundation. I would also like to thank my advisory committee, Dr. Jing Guo, Dr. Renato Figueired o, and Dr. Shigang Chen for their time. Moreover there will always be a special place in my heart for the Intelligent Design of Efficient Architectures Laboratory (IDEAL) and the members of its inaugural class, ChangBurm Cho, Xin Fu, Wangyuan Zhang, Asmita Chande and my primary co-author, Clay Hughes, without whose tireless effort this dissertation might very well be a single chapter. I would also like to thank Girish Venkatasubramanian, Fernando Hernandez, Pierre St Juste, Aaron Blom and my flight instructor Derek Vie rra for helping me to keep everything in valuable perspective. I would like to thank my family, whose unc onditional support I knew that I always had my grandparents Ruth and Salvatore Sarazen, my si ster Jennifer, mother Sandra, and my father James Mike without whose long hours of proof reading, most of my work would have likely remained illegible. Moreover to my uncle Dr. R obert Sarazen, whose own title served as an inspiration to my competitive nature. Last, but cer tainly not least, my beautiful wife Dima who over the course of this research has always pr ovided a constant source of love and comfort. Most importantly, however, I would like to expr ess my deepest gratitude to my advisor and friend, Dr. Tao Li, who is the sole reason that any of this research exists It didnt matter how many times I expressed a desire to quit, his resp onse was always steadfast and the same no, and you will thank me for it some day. He was right.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4 LIST OF TABLES................................................................................................................. ..........8 LIST OF FIGURES.........................................................................................................................9 ABSTRACT...................................................................................................................................12 CHAP TER 1 INTRODUCTION..................................................................................................................14 2 SUPERTRANS: A MULTI-DIMENSIONAL, CYCLE ACCURATE HARDWARE TRANSACTIONAL M EMORY MODEL............................................................................. 19 Hardware Transactional Memory Framework........................................................................ 19 Simulated Workloads............................................................................................................ ..21 3 USING ANALYTICAL MODELS TO EFFICIENTLY EXPLORE HARDWARE TRANSACTIONAL M EMORY A ND MULTI-CORE CO-DESIGN.................................. 24 Analytical Modeling Techniques............................................................................................ 24 Linear Regression Models............................................................................................... 24 Radial Basis Function Neural Networks.........................................................................26 Experimental Methodology.................................................................................................... 28 Results.....................................................................................................................................30 Linear Models.................................................................................................................. 30 Non-linear Model Results................................................................................................ 34 The Heterogeneity of Transaction/Co re Microarchitecture/T M Mechanism Interaction....................................................................................................................35 4 A SET OF TRANSACTION-ORIENTED WORKLOAD C HARACTERISTICS CAPABLE OF DESCRIBING TRANSA CTIONAL WORKLOAD SIMILARITY............ 42 Transactional Oriented Workload Characteristics..................................................................42 Data Analysis..........................................................................................................................46 Principle Component Analysis........................................................................................ 46 Cluster Analysis...............................................................................................................47 Experimental Methodology.................................................................................................... 48 Baseline Transactional Model / Mi croarchitecture Configuration ..................................48 Transactional Metrics......................................................................................................49 Experimental Design....................................................................................................... 50 Transactional Memory and Microarchitecture Parameters............................................. 51

PAGE 6

6 Results.....................................................................................................................................52 Independence...................................................................................................................52 Benchmark Subsets......................................................................................................... 53 Transactional Model Evaluation...................................................................................... 53 Microarchitecture Model Evaluation............................................................................... 55 Time Savings................................................................................................................... 55 Variable Subsets..............................................................................................................56 Related Work................................................................................................................... .......59 5 A PARAMETERIZED METHODOLOGY FO R GENE RATING TRANSACTIONAL MEMORY WORKLOADS.................................................................................................... 71 Related Work................................................................................................................... .......71 Parallel Benchmarks........................................................................................................71 Transactional Memory Benchmarks................................................................................ 72 Benchmark Redundancy.................................................................................................. 73 Benchmark Synthesis...................................................................................................... 74 TransPlant...............................................................................................................................74 Design..............................................................................................................................74 Capabilities......................................................................................................................75 Implementation................................................................................................................ 76 Validation and Skeleton Creation............................................................................ 77 Spine.........................................................................................................................78 Vertebrae..................................................................................................................78 Code Generation....................................................................................................... 79 Example...........................................................................................................................79 Results.....................................................................................................................................80 Stressing TM Hardware................................................................................................... 80 Workload Comparison..................................................................................................... 81 Clustering.................................................................................................................81 Performance.............................................................................................................83 Benchmark Mimicking.................................................................................................... 83 6 CASE STUDY: UNDERSTANDING THE BE HAVIOR OF TRANSACTIONAL WORKLOADS; OBSERVATIONS, IMPLICATIONS, AND DESIGN RECOMMENDATIONS........................................................................................................ 96 Related Work................................................................................................................... .......97 Methodology...........................................................................................................................98 Simulation Environment.................................................................................................. 99 Program Generation.......................................................................................................100 Results...................................................................................................................................102 Transaction Granularity................................................................................................. 102 Impact Using Array Access.................................................................................... 103 Impact Using Object Access.................................................................................. 105 Transactional Duty Cycle..............................................................................................108 Memory Conflict Stride................................................................................................. 110

PAGE 7

7 Case Study Conclusions....................................................................................................... 113 7 CONCLUSIONS.................................................................................................................. 123 LIST OF REFERENCES.............................................................................................................127 BIOGRAPHICAL SKETCH.......................................................................................................132

PAGE 8

8 LIST OF TABLES Table page 2-1 Transactional characteristics of a su bset SPLASH-2, STAM P, and PARSEC run on the hardware transactional model......................................................................................233-1 TM and core microarchitectural parameters and their ranges........................................... 383-2 TM workloads and their transacti onal characteristics (8 Core CMP)................................ 383-3 Most significant model term s and beta values (SPLASH-2)............................................. 383-4 Most significant model term s and beta values (STAMP).................................................. 394-1 Transaction oriented workload characteristics used for measuring similarity between TM workloads....................................................................................................................604-2 Baseline transactional a nd microarchitecture model......................................................... 614-3 Transactional memory parameters..................................................................................... 614-4 Microarchitecture parameters............................................................................................ 624-5 Average error and Cv ( / ) as the transaction model is varied.......................................... 624-6 Average and Cv ( / ) as the microarchitecture model is varied........................................ 624-7 Speedup of subsets to full suite.......................................................................................... 625-1 Transactionaland microarchite cture-independent characteristics....................................865-2 Transaction oriented workload characteristics used for measuring similarity between TM workloads....................................................................................................................865-3 Microarchitecture conf iguration (8 Core CMP)................................................................ 875-4 TM workloads and their transacti onal characteristics (8 Core CMP)................................ 885-5 Transactional cycles-total cycles for original and synthetic.............................................. 896-1 Transactional and microarchitecture independent characteristics................................... 1156-2 Baseline configuration..................................................................................................... 1156-3 Transactional program characteristics............................................................................. 1166-4 Memory conflict stride program characteristics..............................................................116

PAGE 9

9 LIST OF FIGURES Figure page 3-1 Basic architecture of a neural network............................................................................... 403-2 RBF neural network speedup prediction accuracy............................................................403-3 Starplots of key microarchitectural pa rameters in determining relative speedup..............413-4 Bayes + Fmm transposition...............................................................................................414-1 Dendrogram showing clustering based on characteristics defined in Table 4-1...............634-2 Cumulative distributi on of the variance by PC.................................................................. 634-3 Performance comparison of NACKd Cycles:T rans Cycles as the transactional model is varied..............................................................................................................................644-4 Percentage of committed instructions as the transactional model is varied....................... 644-5 Performance comparison of NACKd Cycles :Trans Cycles as the microarchitecture model is varied................................................................................................................ ...654-6 Percentage of committed instructions as the microarchitecture model is varied............... 654-7 Dendrogram showing clustering based on transaction size...............................................664-8 Dendrogram showing clustering based on transaction conflicts........................................664-9 Dendrogram showing clustering based on readand write-set sizes................................. 674-10 Dendrogram showing clustering based on th e combination of transaction size, R-/Wset sizes, and conflict ratio................................................................................................. 674-11 PC 1 versus PC 2 plot of transaction sizes......................................................................... 684-12 PC 1 versus PC 2 plot of conflict ratios............................................................................. 684-13 PC 1 versus PC 2 plot of R-/W-set sizes............................................................................ 694-14 PC 1 versus PC 2 plot of contention domain..................................................................... 694-15 Factor loadings of the contention domain.......................................................................... 705-1 PC Plot of STAMP and SPLASH......................................................................................905-2 Dendrogram of cluster anal ysis of STAMP and SPLASH................................................90

PAGE 10

10 5-3 High-level representation of TransPlant............................................................................ 915-4 TransPlant step by step example........................................................................................915-5 PC1-PC2 plot of synthetic programs................................................................................. 925-6 Dendrogram of unified cluster analysis............................................................................. 925-7 PC1-PC2 plot of unified PCA............................................................................................935-8 PC3-PC4 plot of unified PCA............................................................................................935-9 PC1-PC2 plot of original applications...............................................................................945-10 Dendrogram from original cluster analysis........................................................................ 945-11 PC1-PC2 plot of synthetic applications.............................................................................955-12 Dendrogram of synthetic cluster analysis.......................................................................... 956-1 Performance scaling on 50% transactiona l, array-based memory accesses, as the transactional granularity incr eases on an eager-eager system (left) and a lazy-lazy system (right)................................................................................................................. ..1176-2 Relative execution time eag er-eager (array-based).......................................................... 1176-3 Relative execution time lazy-lazy (array-based).............................................................. 1186-4 Performance scaling on 50% transactiona l, object-based memory accesses, as the transactional granularity incr eases on an eager-eager system (left) and a lazy-lazy system (right)................................................................................................................. ..1186-5 Relative execution time eag er-eager (object-based)........................................................ 1196-6 Relative execution time lazy-lazy (object-based)............................................................ 1196-7 Retries per transactions for lazy-lazy on 50% transactional as the transactional granularity increases on array-based me mory accesses (left) and object-based memory accesses (right).................................................................................................. 1206-8 Performance scaling as the transact ional granularity increases and as the transactional percentage changes on an eag er-eager system (left) and a lazy-lazy system (right)................................................................................................................. ..1206-9 Memory conflict st ride (EE system)................................................................................ 1206-10 Abort count (mem_18k_1_1)........................................................................................... 1216-11 Cycle count (mem_18k_1_1)........................................................................................... 121

PAGE 11

11 6-12 Abort count (mem_256_x_y)........................................................................................... 1226-13 Cycle count (mem_256_x_y)........................................................................................... 122

PAGE 12

12 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DEVELOPING REPRESENTATIVE WORK LOADS FOR FUTURE HARDWARE TRANSACTIONAL MEMORY RESEARCH USING A CYCLE-ACCURATE, MULTIDIMENSIONAL HARDWARE TRANSACTIONAL MEMORY MODEL By James Michael Poe II December 2009 Chair: Tao Li Major: Electrical and Computer Engineering Transactional memory is emerging as a parallel programming paradigm for multi-core processors. Transactional memory provides a means to bridge the discrepancies between programmer productivity and the difficulty in ex ploiting thread-level pa rallelism gains offered by emerging chip multiprocessors. Because the ha rdware has outpaced the software, there are very few modern multithreaded benchmarks available and even fewer for transactional memory researchers. To make this problem worse, the architecture comm unity has no means of characterizing the similarity of the benchmarks that do exist, which poses serious questions concerning the comprehensiveness of current ev aluations. This hurdle must be overcome for transactional memory research to mature and gain widespread acceptance. Currently, for performance evaluations, most researchers rely on manually converted lo ck-based multithreaded workloads or the small group of programs written explicitly for transactional memory. A new parameterized methodology that can automatically generate a program based on the desired highlevel program characteristics benefits the transactional memory community. In this work, all of the above issues are a ddressed. First, a cycle-accurate, multiple-issue multi-core hardware transactional memory model th at is capable of simulating each of the three

PAGE 13

13 most common dimensions of hardware transactiona l memory is developed the first of its kind. That model is then used to perform the first co mprehensive study of the interaction that occurs between transactional memory and multi-core architecture. The results of that study are used to develop analytical models that are capable of predicting performance. A set of transactionoriented workload characteristics is proposed that can accurately capture the behavior of transactional code and, when used in conjunc tion with principle component analysis and clustering algorithms, expose the similarity that exists in current transactional workloads. Methods to reduce overlap (the number of required simulations) and maximize the comprehensiveness of an evaluation based upon the architectural areas of interest to the transactional developer are descri bed in detail. All of these tool s and experience gained are used to develop TransPlant, a framework that is cap able of generating synthesized transactional benchmarks based on an array of different inputs. It is shown that TransPlant can mimic the behavior of current transactional memory workload s. Further, TransPlant is shown to be capable of generating benchmarks with features that lie outside the boundary occupied by traditional benchmarks. Finally, TransPlant is used to perform a case study on the behavior of future transactional memory workloads.

PAGE 14

14 CHAPTER 1 INTRODUCTION Transactional m emory (TM) simplifies parallel programming by supporting the speculative execution of atomic blocks of code without relying upon complex lo cking conventions. TM systems and supporting mechanisms can be implemente d in hardware [1, 2], software [3, 4, 5] or a hybrid [6, 7, 8] of the two. This study focuse s on hardware implementations of transactional memory. Conflict detection and version manageme nt represent major design dimensions in the hardware transactional memory (HTM) desi gn space. Conflict det ection describes the methodology used to detect conflicts which can occur either immediately when a memory request is made (Eager), or as a batch operati on when a transaction is committed (Lazy). Version management refers to whether new values are wr itten directly to memory while old values are logged (Eager), or whether new values are buffe red and then written in batch at the time of commit (Lazy). Recent HTM design studies largel y focus on optimizing parameters only within the TM domain, and the interaction between the TM and core configuration has received little attention. In a HTM-based multi-core system, the TM design dimensions and core configurations interact with each other. A multi-core proce ssor relies on a number of microarchitectural optimizations that exploit dynamic properties of code and data to facilitate the execution of instructions in parallel on each core whereas the HTM sy stem orchestrates how the transactionalized threads can be executed conc urrently across multiple cores. Therefore, the overall system performance is determined by the TM design decisions as well as by the underlying multi-core architecture parameters an d should not be considered in isolation. Understanding their interactions and the corresponding performance impact is critical for the efficient co-design of TM-based multi-core systems. For example, prior work [9] shows that the

PAGE 15

15 interaction of TM system design and program transactions can l ead to pathological patterns of execution that can degrade performance. Howe ver, the intrinsic interaction between TM mechanisms, core architecture configuration and program transaction is largely unknown. In the first part of this study, a HTM is developed on top of a common chip-multiprocessor simulator that is able to model all three of the common conflict detec tion/version management design dimensions (Eager/Eager, Eager/Lazy, and Lazy/Lazy). Using this framework, design points are identified that have a first-order effect on the perfor mance of transactional memory workloads. The significance of the complex inte ractions between TM and core architecture parameters is quantified. Next, computationally efficient non-linear predic tive models are built to accurately forecast TM workload performan ce at arbitrary HTM/core configuration combinations. By analyzing the regression trees generated from the neural network model building methods, further heterogeneous in teraction between TM workloads, core microarchitectures and TM mechanisms is revealed. The second part of this study focuses on char acterization of trans actional workloads. Because transactional memory has only returned to the forefront of software and computer engineering in recent years, ther e is a dearth of transactional programs. Complimenting this is the lack of a modern cohesive parallel benchmark su ite, forcing researchers to transactionalize older parallel programs while speculating about what future transactional a nd even parallel programs may look like. The concern from an architecture standpoint is that blindly choosing benchmarks may not accurately provide a picture of the entire design space especially if their characteristics are unknown. Even if their multi-thread ed behavior is well known, as in the case of the SPLASH-2 suite, little attention has been pa id to their similarities outside of the traditional lock/barrier

PAGE 16

16 model. This is a pitfall for computer archit ects. If too few benchmarks are chosen, the applications may not provide the stressors needed to evaluate a design. If too many benchmarks are chosen, their behavior may overlap, increasing design time without providing additional information. To measure the behavior of a transactional program, a set of metrics must be defined. In this study, a set of transactional characteristics that can be used to quantify the similarity of transactional workloads is provided and it is shown th at the chosen set of transactional features is independent of the transactional model. These me trics are then combined and monitored over the lifetime of the program, giving more insight than the simple aggregate performance of a single trait can provide. However, this is more data th an can be accurately analyzed and some of these characteristics may be correlated with one another, further hindering any meaningful observations about which characteristics may be important for a given program. To solve these problems, principal component analysis is used to reduce the size of th e input vector. Cluster analysis is then used on the reduced vectors to ev aluate the similarity of the chosen benchmarks providing an easy-to-underst and representation of the complete da ta set. The clusters can then be analyzed according to their linka ge distance, with s hort distances indicatin g strong clustering and larger distances indicating weak clustering, to determine their overall similarity. The (dis)similarities between 15 benchmarks take n from the SPLASH-2 [10], STAMP [11], and PARSEC [12] suites across the three most common dimensions of transactional memory are then identified. It is shown that by discarding similar programs and selecting a subset of the programs, the overall behavior of the entire set can be repr oduced. Moreover, this behavior can be captured even when the transactional and microarchitecture models are varied. Next, it is shown that using

PAGE 17

17 the proposed transactional characteristics, fine-g rained subsets can be generated tailored to explore specific architectural areas of in terest to the transactional developer. The third and final part of this study builds upon the knowledge and insight of transactional workload characteristics that has been gained to develop mechanisms and methodologies that can automatically generate pa rameterized synthetic transactional workloads. Even though programmers have several software-based transactional tools [13, 14] at their disposal, the production of valid transactional programs is almo st non-existent. This forces researchers to convert lock-based or messa ge-passing programs manually, which is itself exacerbated by the lack of a modern, c ohesive, parallel benchmark suite. Fundamentally, designing a transactional memory system involves making decisions about its conflict detection, version management and conflict resolution mechanisms. Despite the increasing momentum in transactional memory research, it is unclear which designs will lead to optimal performance, ease of use, and decrea sed complexity. Further evaluation using a wide spectrum of transactional applica tions is crucial to quantify the trade-offs among different design criteria. To-date, the majority of research into transactional memory systems has been performed using converted lock-based code or microbenc hmarks. Because many of these benchmarks are from the scientific community, they have been optimized for SMP systems and clusters and represent only a fraction of potential transactional memory programs. Microbenchmarks are often too simplistic to stress increasingly la rge and complex multi-core designs and their interaction with the TM system. Several earlier st udies [6, 15-18] have shown that implementing a realistic application using transactional memory requires a clear understanding of the particular algorithm and the effort is non-trivial. Therefore, there is an urgent need for techniques and frameworks that can automatically produce repr esentative transactional benchmarks with a

PAGE 18

18 variety of characteristics, a llowing architects and designers to explore the emerging multi-core transactional memory design space efficiently. Traditional synthetic benchmarks preserve the behavior of single[19] or multi-threaded [20, 21] workloads. The paramete rized transaction synthesizer pr oposed in this research is independent of any input behavior capable of producing transacti onal code with widely varied behavior that can effectively st ress transactional memory designs in multiple dimensions. This novel parameterized transaction framework can effectively 1) repres ent the heterogeneous concurrency patterns of a wide variety of applic ations and 2) mimic bot h the way that regular programmers use transactional memory and th e way experienced parallel programmers can exploit concurrency opportunitie s. Moreover, the parameterized synthetic workloads have significantly reduced runtime. The reduced runtim e allows architects and designers to explore large design spaces within which numerous design tradeoffs need to be evaluated quickly. The framework that is proposed in this study, Tr ansPlant, synthesizes benchmarks based on a parameter set whose inputs are defined by the user. These benchmarks can be run on real systems, simulators, or RTL modeling systems. Th e abstract set of parameters provides a robust method for generating programs for workspace explor ation. It is shown that TransPlant can be used to replicate the behavior of existing tran sactional programs. More importantly, it is shown that this methodology can be used to generate programs that repr esent all of the different highlevel transactional characteristics, giving resear chers a much wider breadth of program selection

PAGE 19

19 CHAPTER 2 SUPERTRANS: A MULTI-DIMENSIONAL, CYCLE ACCURATE HARDWARE TRANSAC TIONAL MEMORY MODEL This chapter describes the development of S uperTrans, a multiple-issue, cycle accurate, transactional memory model that is capabl e of simulating all of the common conflict detection/version management dimensions. While other examples of HTM models exist, this model is the first that is capable of cycle ac curate simulation of a multiple-issue, common-chip multiprocessor that is able to combine all dimensions in a common framework for direct comparison. Moreover whereas most other mode ls focus on specific implementations, this model provides the architect with both real cy cle accurate timing information as well as abstractions that allow for easy modifications of the coherence protocol, speculative timing information, etc. After a description of the hardware transactional model, the simulated workloads that will be used throughout the stud y are described as well as the process that was required to run them on the HTM model. Finally, the chapter is concluded with transactional results obtained by running the workloads on the HTM model. Hardware Transactional Memory Framework SuperTrans, a hardware-based TM fram ework, was developed on top of the cycle-accurate, common chip multiprocessor simulator SESC [22] The framework can model a variety of TM design points (e.g. Eager/Lazy conflict detect ion and Eager/Lazy version management) and report a wide range of TM execution statistics. While the majority of ex isting HTM studies have used in-order and single-issue processor models, SuperTrans pr ovides detailed cycle-accurate simulation of multiple-issue, out-of-order execu tion chip multiprocessors with parameterized core microarchitecture and TM mode ls. SuperTrans is also less coupled to a specific architectural implementation of transactional memory, and inst ead takes a parameterized approach. While all microarchitectural structures are still simulated by the detailed pipeline, additional parameters

PAGE 20

20 have been added to the transac tional model that allow various as pects of the transactional model to be tuned independently. For example, instead of the amount of time a commit takes being exclusively a function of the write back times of the specific implementation, the SuperTrans model is orthogonal to the detailed pipeline allowing it to be a f unction of both the actual L1/L2 write back times as well as static and variable parameterized transactiona l values. This abstract approach allows one to directly compare different transactional memory design points in a unified environment, providing valuable insight into which areas are of most concern to the transactional memory architect when seeking to optimize specific implementations. For example, using the SuperTrans framework an architect can quickly determine if it is more productive to improve the static commit overhead time versus the variable overhead time that is dependant upon write set size; or whether conf lict granularity is more importa nt in reducing conflicts than backoff policy, etc. SuperTrans is also capable of producing various levels of detailed output depending on what the architect is interested in studying. In the highest level of detail, the simulator will produce a cycle accurate, time domain trace of every memory reference (as well as any NACKs that occurred as a result of that request), commit, or abort. In the medium level of detail, the model will produce a line of output for every comm itted and aborted transaction that describes the size of the read/write set, length in instructions/cycle co unt, as well as any conflict information. Finally, in the lowest level of detail the model will simply produce aggregate information on the size, frequency, and time spen t on all of the states of the transactions (running, aborting, commiting, nacking). To extend SESC with the transactional m odel required two primary areas of focus functional correctness and detailed cycle accurac y. To ensure proper functional correctness, it

PAGE 21

21 was necessary to modify the MINT front-end which governs the proper emulation of the MIPS binary. This required construc ting transactional caches, addi ng checkpoint capabilities (to support potential aborts), intercepting and buffering memory references within transactional bounds (since atomicity must be guaranteed), as well as supporting write back on transactional commits. Next, the detailed pipeline within the SESC framework had to be augmented to support cycle accurate calculations based on the actual commit times of the in structions, to initiate stalls when necessary (commit/abort overheads, commit contention, memory contention, etc), as well as to back-propagate relevant timing informa tion back to the functi onal model through the coherence module. Finally, a cohe rence module was added that al lows detailed coordination of both the functional and cycle accurate aspects of the simulator. Simulated Workloads For evaluation a set of SPLASH-2 [10] benc hm arks were transactionalized and the STAMP [6] benchmark suite was modified to work with the HTM framework. While SPLASH-2 provides a good comparison of design points for fine-grained transactions and highly optimized lock based behavior, the STAMP suite allows for results based on coarse gr anularity transactions and un-tuned coarse grain locks. For SPLAS H-2, only those benchmarks which contain a sufficient amount of lock-based activity were used, and the locked regions were converted directly into their equivalent transactions. For the STAMP suite, transactions were modified by adopting the transactional annotation used within the SuperT rans framework. Additionally several portions of the thread sp awning routines were modified to use lower level pthread API to match the subset of pthread functions supported on the base SESC simulator. All other code remains the same. Finally, since the STAMP suite doesnt provide locked based versions of the benchmarks, equivalent locked based versions we re created using the same level of granularity. This results in very coarse grained locks that limit the performance of th e locked based versions.

PAGE 22

22 Since the purpose of transactional memory is to ease the burden of lock ed based programmers, it can be estimated that this is how a locked base d version of similar design time would perform. More importantly, however, since the goal of this work is not to compare transactional memory performance to lock-based performance, but in stead provide a baseline case with which to compare the performance of different HTM desi gn dimensions, this transformation does not detract from the results. Fina lly, these un-tuned locks provide an interesting comparison to SPLASHs highly optimized locks. All benchmar ks were run to completion and include both sequential and parallel portions of code. Table 2-1 summarizes the transactional behavi or of the studied workloads on an 8-core CMP. Included in the table are results for, Ea ger/Eager, Eager/Lazy, and Lazy/Lazy models of conflict detection and version management. As ca n be seen, many of the values (e.g. read/write set size) are inherent to the algorithm and dont vary dependant upon the choice of transactional dimension. Other values, such as the number of aborts, are largely determined by these policy decisions. This table also clearly demonstrates the need for more workload diversity than is achievable through SPLASH-2 alone. While the number of transactions across SPLASH-2 differs, the size of the transactions, read sets and write sets, is fairly uniform. In contrast, STAMP provides a greater level of TM workload diversity.

PAGE 23

23 Table 2-1. Transactional charac teristics of a subset SPLASH-2, STAMP, and PARSEC run on the hardware transactional model Benchmarks (input dataset) Tran Model Tran. Started Commits Aborts NACK Stalled Cycles (M) Averag e Read Set Size Average Write Set Size Read/ Write Ratio Avg. Commit Trans Length (Instructions ) EE 70533 68979 1554 2.33 6.71 6.53 1.07 204.09 EL 69495 68981 514 .665177 6.71 6.53 1.07 204.09 Barnes 16K particles LL 69336 68974 362 6.880302 6.71 6.53 1.07 204.10 EE 45256 45253 3 .001771 13.43 7.34 1.82 175.60 EL 45261 45261 0 .004679 13.43 7.34 1.82 175.57 Fmm 16K particles LL 45302 45276 26 .516338 13.43 7.34 1.82 175.52 EE 15904 15885 19 .015719 3.13 1.95 2.01 27.18 EL 15891 15885 6 .011696 3.12 1.95 2.01 27.18 Cholesky Tk15.O LL 15963 15885 78 .057466 3.12 1.95 2.01 27.16 EE 2161 1664 497 .091549 3.00 0.27 12.93 10.39 EL 1857 1664 193 .031215 3.00 0.27 12.93 10.38 Ocean-con 258x258 LL 1800 1664 136 .022013 3.00 0.26 13.44 10.38 EE 7200 2000 5200 .783498 3.00 0.38 9.79 13.25 EL 2965 2000 965 .092604 3.00 0.38 9.76 13.24 Ocean-non 66x66 LL 2778 2000 778 .057183 3.00 0.36 10.22 13.17 EE 141020 76741 64279 22.43765 6.49 2.46 5.33 60.87 EL 129550 76741 52809 21.945076 7.19 2.46 6.14 69.17 Raytrace Teapot LL 307376 76741 230635 .260170 7.49 2.46 6.51 73.54 EE 10398 10376 22 .002693 10.87 2.97 2.66 59.26 EL 10377 10376 1 .005997 10.87 2.97 2.66 59.26 Water-nsq 512 molecules LL 10482 10376 106 .654037 10.87 2.97 2.66 59.26 EE 153 153 0 .000146 2.48 1.37 1.68 133.25 EL 153 153 0 .004655 2.48 1.37 1.68 133.25 Water-sp 512 molecules LL 226 153 73 .003986 2.57 1.46 1.89 366.78 EE 263 166 97 2.977146 793.16 713.67 1.78 448887.36 EL 186 166 20 2.961192 793.16 713.67 1.78 448887.36 Bayes 384 records LL 168 140 28 .007756 437.66 371.51 1.78 230090.91 EE 6922 6594 328 .486603 27.96 5.05 8.16 1199.45 EL 6702 6594 108 .534763 27.91 5.05 8.16 1199.49 Genome g256 s15 n16384 LL 6869 6594 275 1.750257 27.89 5.05 8.16 1199.65 EE 6705 6705 0 .005030 6.97 2.49 2.27 100.20 EL 8049 8046 3 .019816 6.97 2.49 2.27 100.20 Kmeans Random1000_12 LL 718850 671841 47009 6.430020 6.97 2.49 2.27 100.20 EL 2788 112 2676 14951.773 601.71 403.18 5.13 517518.72 Labyrinth 512 molecules LL 448 112 336 .004661 606.77 408.10 5.12 517329.69 EE 7023 4096 2927 7.405184 67.11 13.22 3.95 1612.62 EL 5586 4096 1490 6.018982 66.46 13.18 3.95 1610.51 Vacation 4096 tasks LL 4493 4096 397 .047728 66.18 13.19 3.95 1610.69 EE 1822317 1822271 46 .025786 1.58 1.11 1.37 6.95 EL 1822219 1822185 34 .055536 1.58 1.11 1.37 6.95 Fluidanimate Simsmall LL 1840142 1838346 17956 43.700007 1.58 1.11 1.37 6.92 EE 19599 19599 0 .000068 2.97 1.97 1.17 20.75 Streamcluster Simsmall EL 19796 19796 0 .002186 2.97 1.97 1.17 20.75

PAGE 24

24 CHAPTER 3 USING ANALYTICAL MODELS TO EFFICI ENTLY EXPLORE HARDWARE TRANSACTIONAL MEMORY AND MULTI-CORE CO-DESIGN In this chapter, the HTM framework is used to identify the design poi nts that have a firstorder effect on the performance of transactiona l memory workloads. The significance of the complex interactions between TM and core archit ecture parameters is then quantified. Next, computationally efficient non-linear predictive mode ls are built that can accurately forecast TM workload performance at arbitrary HTM/core configuration combinations. By analyzing the regression trees generated from the neural network model building methods, further heterogeneous interactions between TM workloads, core microarchitectures and TM mechanisms revealed. Analytical Modeling Techniques In the following chapter, two widely used an alytical m odeling techniques are employed, namely linear regression and radi al basis function based neural networks. This section describes these two techniques. Linear Regression Models Linear regression m odels are used to disc over complex interactions between TM and processor core design parameters since linear models can explicitly reveal how various design parameters (inputs) and their interactions affect the performa nce of TM workloads (outputs). Linear regression is a method to establish the relationship between the response variables and the input parameters. A linear model specifies the relationship between a response variable, Yand a set of regressor variables, the sx', so that +x ++x +x + =Ykk...22110. The residual, which is a random variable with mean zero, stands fo r the error variability th at cannot fit into the model. In this equation 0is the regression coefficient for the intercept and thei values are the

PAGE 25

25 regression coefficients for variables 1 through k They are both determined by the condition that the sum of the square residuals is as small as po ssible. An effect of in teraction occurs when a relation between ixand Y depends on at least one other variable jx. In other words, the strength or the sign of a relation between at least two va riables is different depending on the value of some other variables. For example, a regre ssion model with a 2-way interaction is shown (Equation 3-1). +xx +x +x + =Y21322110 (3-1) (Equation 3-2) represents a complete model that includes three-way, four-way, and all higher order interactions Note that there are m2terms in this model and an equal number of unknown regression coefficients. 0 111111 1,2,...12......mmmmmm iii,jiji,j,kijk i=i=j=i+i=j=i+k=j+ mmY= + x+ xx+ x xx ++ xxx+ (3-2) To analyze the performance of transactiona l memory workloads across varied TM and multi-core microarchitecture design parameters, t hose parameters should be included in the built linear regression models. (Equation 3-2) indicate s that with an increased number of design parameters, constructing a complete linear regression model will require a prohibitively large number of experiments. In general, a linear re gression model can be re presented as a sum of k terms ( mk 2 ) from the complete model shown (Equation 3-2). To obtain accurate models that capture all significant terms with a minimum numb er of simulations, the D-optimal experimental design method [23] is used, which chooses locations in which to take measurements to minimize the number of terms k and the residual error simultaneously. D-optimality is one of the most commonly used design criteria for constructi ng linear regression models. For the normal

PAGE 26

26 response setting + =y and F = D-optimality maximizes the determinant of the information matrix F F Tand so minimizes the volume of the confidence ellipsoid for the unknown coefficients D-optimal experimental design is optimized for an identified set of model terms and can be augmented to existing experimental designs. Usually the coordinate exchange algorithm is first used to compute an initial small D-optimal design and this information is used to iteratively guide further simulations using augmented D-optimal designs, until an error bounded linear model is produced. As suggested in [24], the model construction procedure is started from an initial small D-optimal design to build a preliminary linear model that incorporates main effects. The model is th en iteratively refined by adding more model terms until an error bounded linear model is produced. Th e augmented D-optimal design is used to generate further simulations required for im proving the model accuracy. For each iteration, searching for the new set of model terms will produce multiple model candidates and Akaikes Information Criteria (AIC) [25] is used to select a model that fits well and has a minimum member of parameters. To learn more about the model construction method, the reader is encouraged to read [24]. Radial Basis Function Neural Networks In this study, neural network m odels are us ed to predict system response at unexplored design points, enabling efficient exploration of the design space. An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems process information. It is composed of a set of interconnected processing elements working in unison to solve problems. The most common type of neural network (shown as Figure 3-1) consists of three layers of units: a layer of input units is c onnected to a layer of hidden units, which is connected to a layer

PAGE 27

27 of output units. The input is fed into the netw ork through input units. E ach hidden unit receives the entire input vector and generates a response. The output of a hidden unit is determined by the input-output transfer function that is specified for that unit. Co mmonly used transfer functions include sigmoid, linear threshold function, and Radial Basis Function (RBF) [26]. The ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing the complex, non-lin ear relation between input and output. In this study, the RBF transfer function is used to predic t the impact of TM and core microarchitecture design parameters on the perf ormance of transactional memory workloads. The basic architecture of an RBF network with n-inputs and a single outpu t is shown in Figure 31. The nodes in adjacent layers are fully connected. Such a network can be represented by the following parametric model: ),(|)(1 ii r i iiXwxf (3-3) where X is an input vector, r is the number of hidden units, i is the basis function of the network, iws are weights of network and T n),...,,(21 is the center vector of the nodes, T n),...,,(21is the radius vector of the nodes, and || || denotes the Euclidean norm. In this study, a Gaussian function is used as the basis function of the network, i.e., 2exp),(i i ii iX X (3-4) The above RBF function has the highest response when the input corres ponds to the center vector and responses monotonically decrease if input data are far from the center, which is controlled by the radius vector The training of the RBF networ k involves selecting the center locations and radii, which are eventually used to determine the weights, using a regression tree

PAGE 28

28 [26]. A regression tree recursively partitions the input data set into subsets with decision criteria. As a result, there will be a root node, non-terminal nodes (havi ng sub nodes), and terminal nodes (having no sub nodes), which are associated with i nput dataset. Each node contributes one unit to the RBF networks center and radius vectors. In this study, the selection of RBF centers is performed by recursively pars ing regression tree nodes using a strategy proposed in [26]. Experimental Methodology Table 3-1 enum erates the set of TM design para meters that were chosen to model for this study and their ranges. Trans_model defines th e Conflict Detection and Version Management policy chosen. Conflict detection describes the methodology used to detect conflicts which can occur either immediately when a memory request is made (Eager), or as a batch operation when a transaction is committed (Lazy). Version manageme nt refers to whether new values are written directly to memory while old values are logged (Eager), or whether new values are buffered and then written in batch at the time of commit (L azy). Backoff policy refers to the method of backoff used after an abort (this helps to a void several of the detrimental transactional pathologies described in [9]). Conf lict detection granularity refers to the level of granularity at which conflicts are detected (usually, but not always, the size of a cache line). Finally, Primary/Secondary Baseline latency and Primar y variable latency quantify the latencies associated with a commit or an abort. In a lazy version management system, the primary latency is associated with a commit (since new values must be written back), and in an eager version management system the primary latency is associat ed with an abort (since the old values must be written back). The baseline latency is the static overhead requi red (e.g. arbitrating for the bus, maintenance, cleanup, etc) and the variable latency is the additional latency required based upon the write-set size. Additionally a set of processor core microarchi tecture parameters is included that previous studies have shown to have a high influence on performance to study the

PAGE 29

29 interaction of these parameters within a trans actional memory context. Analytical models are built for 6 TM design dimensions and 12 key co re design parameters. Throughout all of the results, speedup is used as the metric of choi ce for transactional worklo ad analysis. Speedup is defined as the number of cycles a lock based vers ion of the workload takes to complete divided by the number of cycles a transact ional version of the same code takes to complete, using the same microarchitectural parameters. This repr esentation of speedup was chosen as the primary metric since it allows one to focus in on the differences due exclusively to the use of transactional memory. For example, if raw executi on time were used and the percentage of code executed within transactional bounds was relatively low (as it is in many of the finely tuned SPLASH-2 benchmarks), then varying the core architecture parameters, which effect both transactional and non-transactiona l code portions, would dominate the results. This would make prediction easier and reduce the error across the models; however it would also provide substantially less information on the underlying interaction between microarchitecture and transactional memory. Note that it is not the intent of this st udy to draw a direct comparison between transactional and lock based workloads, but instead to use the lock based workloads as a baseline to draw comparisons between transactional models. The 18 HTM and core microarchitecture paramete rs listed in Table 3-1 are analyzed. Table 3-1 also shows the range of each parameter setti ng. The range is set to be as wide as possible while taking into consideration technology a nd implementation constraints. During the linear regression modeling experiments, e ach parameter is varied between two levels (encoded as -1 and 1) corresponding to the low and high value of its range. The procedure described in the linear model subsection is used to build error bounded linear models relating the TM performance metric to the above HTM and core microarchitecture design parameters. The D-

PAGE 30

30 optimal design can be used to construct models at any level of accuracy. In this study, the error bound is set to 0.01. In non-linear an alytical modeling experiments, a training data set is first used to build the neural network models. An estimate of the models accuracy is obtained by using the design points in a test data set. To build a representative design space, one needs to ensure that the sample data sets disperse points throughout the design space but keep the space small enough to ensure that the model building cost remains low. To achieve this goal, a variant of Latin Hypercube Sampling (LHS) [27] is used as the sampling strategy since it provides better coverage compared to a naive random sampling scheme. Multiple LHS matrices are generated and a space filing metric called L2-star discrepancy [28] is used. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2-star discrepancy. A randomly and independently ge nerated set of test da ta points is used to empirically estimate the predictiv e accuracy of the resulting m odels. In this study, 200 training data sets and 50 test data sets were used since this study shows th at it offers good tradeoff between simulation time and prediction accuracy for the design space under consideration. Results Linear Models In this sec tion, regres sion model details are presented fo r the studied transactional memory workloads. In a linear regression model, the co-efficient values ( i) represent the expected change in the response Y per unit change in ixand indicate the relative significance of the corresponding terms. Tables 3-3 and 3-4 show th e most significant 10 model terms and their beta values. The regression coefficients are sorted in decreasing order of magnitude. The listed parameters and their interacti ons are key factors in determin ing the performance varieties on these HTM and core microarchitecture structures The first thing one notices as the SPLASH-2

PAGE 31

31 results (Table 3-3) are compared to the STAMP resu lts (Table 3-4), is that in the former case, the parameter choices play a much smaller role in the overall execution speed of the benchmark. This is reflected by the small beta value that the most significant term has in each benchmark (< 0.002 for all but one of the SPLASH-2 benchm arks, Raytrace, which will be discussed subsequently). This result is to be expected be cause of the finely tuned nature of the SPLASH-2 benchmarks. Because of this tuning and the fine granularity of the locks, there are relatively few periods that atomic regions ove rlap. Thus, the performance of both locks and transactions is similar. Moreover since actual contention is low, transactional parameters such as conflict detection, version management a nd conflict granularity also resu lt in very similar execution patterns. A slight performance increase can be noticed in the performance of E/E (since E/E is encoded as +1 and the beta value is positive); this can be attributed to the performance advantage eager version management has on the more frequent commits. One should also note that because of the fine granularity of transactions as well as the small amount of actual conflict, there is very little opportunity for the core microarchitecture parameters to either assist or hinder the transactions, and thus they play very little role in the model. In fact, across several of the benchmarks, since more than sufficient resources exist within the atomic region, the speedup is actually reduced by applying more resources since the lock based version is able to capitalize on these additional resources outside of the atomic region while the overhea d of the transactional model remains roughly constant. An interesting ex ception to this can be seen on the benchmark FMM where the performance of the transactional model is actually impr oved over that of the lock based version by increasing the size of the floating point i ssue window. This improvement is due to four atomic regions that comprise 97% of the dynamically executed atomic regions. Each of these fine grained transactions occurs at the end of a chain of floating point operations

PAGE 32

32 and consists of two floating poi nt read-add/sub-write patterns. By increasing the width of the saturated floating point issue, the processor is ab le to issue the floating point arithmetic operation sooner, decreasing the delay be tween the read/write operations This both reduces the overall time spent within the atomic region, but more importantly, the reduction in the delay between the read/write operations increases the chance that an abort can be avoided through the use of stalls (since transactions are more likely to encounter a written value when they attempt to access the transaction instead of only a read value that will subsequently need to be written to). This is supported by the last term which shows additiona l benefit if both the fl oating point issue window is increased and the transactional model is E/E wh ere a stall can be used to avoid an abort. It can be inferred from this that when atomic regi ons become limited by the core architecture, there can potentially be more to gain by allocating resources to a transactional region than an equivalent lock based region. As noted previously, one exception within the SP LASH-2 suite is Raytrace. As can be seen from Table 3-2, Raytrace experiences a much hi gher rate of actual c ontention than the other SPLASH-2 benchmarks (as is a pparent from its higher Abort rate on both E/E and L/L and increased Nacked cycles on E/E), and this cont ention results in our tr ansactional parameters playing a more significant role. Interestingly, it ca n be seen that the best way to improve speedup is to actually increase the granularity of the conflict detection. This is a result of conflicts occurring earlier within the tr ansaction forcing the E/E model to stall sooner but potentially avoid an abort. This result is supported by the reduced abort and increased stall rates of the E/E system. For the STAMP benchmark suite, one finds a much more diverse workload set where generalities are harder to come by. First, it can be seen that on several of the benchmarks there is

PAGE 33

33 a clear better choice between E/E and L/L, but that the better option varies across workloads. It is also apparent that the microarchitectural core pa rameters begin to take on a much more important role in determining the execution time. For exampl e, on the memory intens ive Bayes transactions a significant improvement is seen as LSQ is incr eased (and in particular, when using E/E) and on the less memory intensive Genome, Kmeans and V acation an improvement as the issue width is increased. This is a particularly significant result since the metric of interest is speedup as compared to an equivalent lock version. Since one would expect both a lock-based version and a transactional version to share roughly equal benefit in reductio n of atomic region cycle length due to the increase in microarchi tecture core parameters, this im plies that there is a compounding effect that can be achieved by allocating additiona l resources to transactional atomic regions as compared to lock based atomic regions. In par ticular, E/E versions are able to detect possible conflicts sooner and avoid more e xpensive aborts in favor of stalls (as is indicated by the positive interaction between our microarchitecture parameter of interest and Trans_m odel). It is important to note, however, that as the granularity of the transaction becomes so la rge that contention and abortion is unavoidable (as it is in the thickest granularity benchm ark Labyrinth), this effect is reduced and a return to transacti onal parameters playing a dominant role is apparent. Finally, an unexpected result obtained from th is research is that on the lear ning algorithms Kmeans and to a lesser degree Bayes, the choice of transactional model can actually affect the algorithm itself. In all cases of L/L, Kmeans required roughly 64 more iterations before conv ergence was possible resulting in a substantial increase in the numbe r of transactions committed, aborted and overall execution time to produce the same result. This s uggests that future designs must be careful to analyze the unanticipated side effects that altering the norma l interleaving of a lock based workload may have on a transactional version, an d further implies that future studies should

PAGE 34

34 avoid simply using committed work as a metric to compare transactional models, and instead use more holistic metric. To summarize, it has been shown that SPLASH-2 alone is not suffici ent for transactional research. Overall it is not the small length of the transactions, but the relati vely low percentage of atomic regions that are executed and the even lowe r percentage of atomic regions that experience contention that result in the relatively small changes across variations in transactional and microarchitectural parameters. Moreover the great est potential for optimi zation is in shorter, more commonly occurring conten tious regions. This suggests th at it is not simply enough to increase the granularity of tran sactions, but more work must be done to create workloads that have greater levels of contention, atomic regi ons, and mixtures of granularities to really determine the performance of a transactional mode l. It was also shown that transactions may benefit more by increasing the amount of microarc hitectural core resources allocated to them than even similarly designed lock based atomic regions particular ly E/E designs. In general, if contention is low, E/E performs slightly better du e to its quicker commit speed, but this rule is broken as contention becomes a factor. Non-linear Model Results The TM workload speed up prediction accuracies using RBF neural network are plotted as boxplots in Figure 3-2. Boxplots are graphical displays that m easure location (median) and dispersion (interquartile range), id entify possible outliers, and i ndicate the symmetry or skewness of the distribution. The central box shows the data between hinges which are approximately the first and third quartiles of the error values. T hus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribut ion for the error values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the

PAGE 35

35 extreme values of the data or a distance 1.5 tim es the interquartile range from the median, whichever is less. The outlier s are marked as circles. Figure 3-2A shows that on SPLASH-2 benc hmarks, the performance model achieves median errors ranging from 4.3 percent (Cholesky) to 7.4 percent (Raytra ce) with an overall median error across all benchmarks of 5.8 percent. Figure 3-2B shows that on STAMP benchmarks, the predictive models are less accurate with median errors ranging from 4.6 percent (Genome) to 14.8 percent (Kmeans) with an overall median of 9.0 percent. This increase in error over the SPLASH-2 benchmarks can be attributed both to the increase in transactional conflict, as well as the more complex, harder to predic t probabilistic algorithms (for example, our two worst errors occur on learning algorithms Kmeans and Bayes). Using only 200 training points to predict across 18 dimensions is limiting, however, and as the number of training points used to build the NN is increased, the prediction accuracy increases accordingly. Using the worst case Kmeans the pr ediction error can be reduced to 6.8% with 500 training data points. The Heterogeneity of Transaction/Core Microarchitecture/TM Mechanism Interaction In linear m odel results subsection, linear regression models were used to draw conclusions about the most important parameters to consider in the co-design of tr ansactional memory and core microarchitecture. These models suggest th at the relative performance of a benchmark is related not only to each parameter, but also to the specific interactions that occur between the microarchitecture and transactional parameters. In this section this is further expanded upon using the NN models developed in the neural ne twork results subsection to explore whether these microarchitectural interactions are constant acr oss benchmarks and transa ctional models, or if they also vary and offer opportunities for future optimizations.

PAGE 36

36 To increase the focus on the individual inte raction between microarc hitecture parameters and transactional models, the neural networks are re-constructed by training them while holding the transactional model parameters constant and varying only the microarchitecture parameters. This provides a better understand ing of the level of variance with respect to the importance of microarchitecture parameters across the different benchmarks. Figure 3-3 (shown as a star plot) presents the split order of the core microarchitect ure parameters within the regression trees that model the performance of TM wo rkloads across the design space. A starplot [29] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. The star plot consists of a sequence of e qui-angular spokes, called ra dii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable across all data points. From the star plot we can obtain info rmation such as: What microarchitecture variables are dominant in dete rmining the relative performance of a given benchmark/transactional model? Which obs ervations show similar behavior? Etc. The first observation that can be quickly drawn from the figure is that the most significant core architectural parameters involved in increasing the relative performance of the transactional sections vary widely between the different benchm arks even with the tr ansactional model held constant. This is apparent from the unique shapes of each of the starplots. This is important for two reasons. First, it confirms the previous obser vation that no single arch itectural configuration is suitable in all cases to increase transacti onal performance, but moreover, goes further to show that even with a common transa ctional model this observation holds true. Second, it implies that there may be great room for optimization in fu ture SMT and heterogeneous CMPs that must make resource allocation decisions if multiple transactional memory workloads are run concurrently. For example, if the Eager/Eager versions of FMM and Bayes are transposed

PAGE 37

37 (Figure 3-4), it can be seen that the area of overl ap is minimal; scheduling both of these to run on an SMT machine will result in increased transa ctional performance and higher utilization of available resources. Similar approaches also exist for heterogeneous multicores. For example, scheduling transactional workload s that are able to leverage higher levels of issue width on higher performance cores can result in increased relative performance. A second observation that can be made from Figur e 3-3 is that, in general, the ordering of significance also varies across tr ansactional dimension (i.e. Eager/Eager vs. Lazy/Lazy). For instance, it can be seen from Figure 3-3 that the relative performance of the Eager/Eager version of FMM is largely governed by the size of the fl oating point window (similar to the result from the linear model); however this parameter is si gnificantly less crucial in the Lazy/Lazy version. Instead, in the Lazy/Lazy version, it can be seen that the L2 cache la tency and floating point registers take on a more critical role. This observation presents even greater opportunities for optimization in the future. For example, even in cases where suitable diversity amongst workloads may not be present (e.g. running two copies of the same benchmark), it may be possible to use hybrid approaches of transa ctional memory that can elicit the optimal microarchitectural characteristics in multiple domains.

PAGE 38

38 Table 3-1. TM and core microarchitect ural parameters and their ranges Parameter Range # of Levels Transmodel Conflict Detect / Version Ea g er / Ea g er Ea g er / Laz y, Laz y / Laz y 3 Backoff Backoff time after an abor t Random Linear Ex p onential 2 Confdet g Conflict detection 16B 32B 64B 128B 4 PBL Primar y Baseline Latenc y 10 50 100 3 PVL Primar y Variable Latenc y 4 8 16 3 TM SBL Secondar y Baseline 5 10 15 3 Issuewidth Processor issue width 1 2 43 ROBsize Reorder buffer size 18*issuewidth+32 36*issuewidth+32 2 LS Q size Load/store q ueue size 10*issuewidth+16 20*issuewidth+16 2 intre g s Inte g er Re g isters 8*issuewidth+32 16*issuewidth+32 2 f p re g s Floatin g Point Re g isters 6*issuewidth+32 12*issuewidth+32 2 intissuewinInte g er Issue Win Size 6*issuewidth+32 12*issuewidth+32 2 f p issuewin Floatin g Point Issue Win 4*issue 8*issue 2 L2size L2 cache size 1024 2048 4096KB 3 L2la t L2 cache latenc y 8 12 16 20 4 il1size L1 instruction cache size16 32 64 KB 3 dl1size L1 data cache size 16 32 64 KB 3 Processor Core dl1la t L1 data cache latenc y 1 2 43 Table 3-2. TM workloads and their trans actional characteristics (8 Core CMP) Benchmarks (input dataset) Tran Model # of Tran. Started # of Tran. Committed # of Tran. Aborted Nack Stalled Cycles(M) Average Read Set Size Average Write Set Size Read/ Write Ratio Avg. Commit Trans Length (Instructions) Stanford Parallel Applications for Shared Memory (SPLASH-2) [10] EE 70068 68979 1089 0.74 6.71 6.53 1.07 204.09 Barnes(16K particles) LL 69275 68970 305 4.56 6.71 6.53 1.07 204.12 EE 45298 45297 1 <0.01 13.42 7.34 1.82 175.44 Fmm(16K particles) LL 45321 45295 26 0.31 13.42 7.34 1.82 175.45 EE 15902 15885 17 0.02 3.14 1.94 1.99 27.27 Cholesky (tk15.O) LL 15946 15885 61 0.06 3.14 1.94 1.99 27.28 EE 136597 76741 59856 20.95 6.33 2.46 5.16 59.01 Raytrace (teapot) LL 258179 76741 181438 0.24 6.03 2.46 4.8 55.85 EE 10376 10376 0 <0.01 10.87 2.97 2.66 59.26 Water-nsq (512 molecules) LL 10465 10376 89 0.614 10.87 2.97 2.66 59.26 EE 153 153 0 <0.01 2.48 1.37 1.68 133.26 Water-sp (512 molecules) LL 220 153 67 <0.01 2.48 1.36 1.64 92.51 Stanford Transactional Applications for Multi-Processing (STAMP) [6] EE 6781 6594 187 0.46 27.93 5.06 8.16 1199.65 Genome(g256 s16 n16384) LL 6845 6594 251 1.79 27.77 5.06 8.16 1199.46 EE 8047 8046 1 <0.01 6.97 2.49 2.26 100.20 Kmeans(random1000_12) LL 715470 671841 43629 6.74 6.97 2.49 2.27 100.20 EE 253 112 141 231.16 602.51 403.38 5.13 520454.02 Labyrinth(randomx48y48z3n48) LL 448 112 336 <0.01 606.79 408.10 5.12 517335.40 EE 7176 4096 3080 8.43 66.19 13.21 3.95 1612.90 Vacation(4096 tasks) LL 4549 4096 453 0.04 66.22 13.19 3.95 1611.02 EE 490 273 217 57.41 750.22 647.40 1.76 419744.55 Bayes(384 records) LL 170 140 30 <0.01 437.66 371.51 1.78 230090.91 Table 3-3. Most significant model te rms and beta values (SPLASH-2) Barnes Cholesky Fmm Raytrace Water-nsq Water-sp 1 Intercept 1.0020 Intercept 1.0016 Intercept 1.0017 Intercept 0.9765 Intercept 0.9994 Intercept 0.9997 2 Trans_model 0.0015 L2 -0.0007 fp_issue_win 0.0007 Conf_det 0.0910 Trans_model 0.0007 Trans_model 0.0003 3 PVL 0.0012 Trans_model 0.0005 Trans_model 0.0723 int_regs -0.0002 ROB -0.0002 4 L2 -0.0012 Back_off 0.0005 Trans_model*Conf_det -0.0718 PBL 0.0002 5 Back_off 0.0009 PVL 0.0005 Conf_det*Issue 0.0287 ROB -0.0002 6 Trans_model*PVL -0.0008 L2 -0.0003 Trans_model*Issue 0.0193 7 int_regs -0.0003 Trans_model*Conf_det*Issue -0.0184 8 L2_lat 0.0003 Issue -0.0150 9 Trans_model*fp_issue_win 0.0002 Back_off -0.0113

PAGE 39

39 Table 3-4. Most significant model terms and beta values (STAMP) Bayes Genome Kmeans Labyrinth Vacation 1 Intercept 0.9641 Intercept 3.0594 Intercept 0.4626 Intercept 0.5143 Intercept 1.0414 2 Trans_model*LSQ 0.2811 Issue 0.1163 Trans_model 0.4005 Trans_model -0.4696 Issue 0.0019 3 LSQ*il1 0.2367 Conf_det 0.0461 Issue 0.0693 Trans_model*Back_off 0.0225 Trans_model -0.0013 4 ROB*il1 -0.2366 PBL 0.0460 Trans_model*Issue 0.0634 Back_off 0.0188 Conf_det 0.0012 5 SBL*int_regs 0.2087 Trans_model 0.0421 Trans_model*Conf_det 0.0356 Back_off*PBL -0.0157 SBL -0.0008 6 fp_issue_win*L2_lat*dl1_lat 0.2007 Trans_model*Conf_de t 0.0380 Conf_det 0.0353 Trans_model*Back_off*PBL -0.0153 7 PVL*dl1 -0.1983 PVL 0.0330 Conf_det*Issue -0.0300 PBL -0.0120 8 LSQ*L2_lat -0.1873 Trans_model*PBL -0.0291 Trans_model*Conf_det*Issue -0.0299 PVL 0.0120 9 LSQ 0.1783 Trans_model*PVL -0.0259 Issue*L2_lat 0.0183 Trans_model*PBL -0.0104 10 ROB -0.1636 int_regs 0.0217 Trans_model*Issue*L2_lat 0.0174 SBL 0.0100

PAGE 40

40 f(x)w1wnw2x1 xninput layer hidden layer output layer Figure 3-1. Basic architectur e of a neural network barnescholeskyfmmraytracewaterNSQwaterSP 0 5 10 15 20Error (%) bayesgenomekmeanslabyrinthvacation 0 20 40Error (%) Figure 3-2. RBF neural networ k speedup prediction accuracy

PAGE 41

41 Figure 3-3. Starplots of key microarchitectural parameters in determining relative speedup Figure 3-4. Bayes + Fmm transposition

PAGE 42

42 CHAPTER 4 A SET OF TRANSACTION-ORIENTED WO RKLOAD C HARACTERISTICS CAPABLE OF DESCRIBING TRANSACTIONAL WORKLOAD SIMILARITY In this chapter, a set of trans actional characteristics is propose d that can be used to quantify the similarity of transactional workloads and show that the chosen set of transactional features is independent of the transactional model. Next th e (dis)similarities betw een 15 benchmarks taken from the SPLASH-2, STAMP, and PARSECs suites across the three most common dimensions of transactional memory are identified. Finally it is shown that by discarding similar programs and selecting a subset of the programs, it is po ssible to reproduce the overall behavior of the entire set. Moreover, this behavior can be captured even when the transactional and microarchitecture models are varied. Transactional Oriented Workload Characteristics In gene ral, transactional worklo ads can be characterized by a sm all set of features. Table 4-1 shows how these individual traits can be used in a vector cap able of quantifying workload behavior. This characteristic ve ctor is then used as input to the princi ple component analysis. The goal in choosing these metrics was to provide attr ibutes that were able to describe the unique characteristics of individual tran sactions. To this end, it was n ecessary to define the different aspects of a transaction that play a dominant role in determining the runtime characteristics, contention, and interaction across transactional workloads. Moreove r, a set of attributes that would be largely architecture independent were desirable; providing a single set of metrics to describe the characteristics of a workload set re gardless of the underlying transactional memory system. For example, conflict behavior is the result of the intrinsic characteristics of a workload when run on a specific architecture design. Wh ile the same workload will exhibit different conflict behavior when run on two different transactional models (EE/LL), two different workloads will experience similar conflict behavior if they have similar characteristics when run on the same underlying model. From this point of re ference, the conflict be havior is an output

PAGE 43

43 used to validate an input. This allows the res earcher to use these classifications to identify common traits among a set of workloads and use those to compare different underlying models. For this evaluation, the transact ion percentage, transaction size, read-/write-set ratios, read/write-set conflict densities, read-/write-set size s, and the write-read ratio of each transaction are recorded. Since many transactional workloads exhi bit heterogeneous transactions and different synchronization patterns throughout runtime execution, these characteristi cs are recorded as histograms, which are able to provide more information than a simple aggregate value. The transaction percentage is the total number of retired committed transactional instructions divided by the total number of instructions retired. Th is ratio provides insight into how large of a role the transact ional portions of a workload play in the overall execution of an entire benchmark. This metric is the only metric that is not a histogram. However, is important because it helps to quantify the effect that th e remaining characteristics have in the overall execution of a benchmark. For example, a workload comprised of transactions that are highly contentious but are only in execu tion for brief intervals may exhibit less actual contention than a workload comprised of fewer contentious transac tions that occur with greater frequency. It is also important to note that only committed and not aborted transactions are considered within the transaction percentage. This is because while the amount of wo rk completed or committed is largely determined by the workload and its inputs, aborted transactions ar e largely a function of the underlying architecture and can vary wi dely depending on arch itectural decisions. Transaction size is defined as the total numbe r of instructions committed by a transaction. This characteristic is comprised of a histogram describing the i ndividual sizes of transactions across the entire execution time of a workload. Th is metric describes th e granularity of the transactions across a workload. Th e granularity of a transaction is directly related to the period

PAGE 44

44 that a transaction mainta ins ownership over its read/write set as well as the amount of work lost on an abort. The read set ratio and write set ratio are define d as the total number of reads divided by the number of unique reads and the total number of writes divided by the number of unique writes, respectively. This metric provides insight into th e spatial locality of the memory patterns of each individual transaction and is im portant in describing the characte ristics of a transaction. While a transaction may contain many load /store operations, if these operations reference a relatively small number of unique locations, this can result in substantially different workload execution than a transaction that contains the same numb er of load/store operati ons but all referencing unique memory addresses. While the read/write set ratio helps to descri be overall spatial locality of the memory operations, it alone is insufficient to characte rize a transaction. To assist in the overall characterization of the contentious memory access patterns, also included are the read conflict density and the write conflict density. The read conf lict density is defined as the total number of potential contentious addresses with in a transactions read set divide d by the total read set size of the transaction, and the write c onflict density is defined as the total number of potential contentious addresses within a transactions write set divided by the total write set of the transaction. To calculate the addresses that c ould potentially result in contention within a transaction, the entire workload is first run and the read/write sets are calculated for each transaction. Next, each memory address within a read set is marked as potentially contentious if any other transaction that was not located within the same thread wrote to that address. For addresses belonging to the write set, each memory address is marked as potentially contentious if any other transaction that was not located within the same thread either read or wrote to that

PAGE 45

45 address. Using this method, the worst-case contention rate of the read/wri te set for all possible thread alignments can be captured without the need to run exhaustive numbers of simulations. Using this characteristic of a transaction, the contentiousness of a specific transaction can be categorized not simply based on the aggregate size of a memo ry set, but on the actual contentiousness of the memory locations within those sets. While the read/write set ratios and the read/w rite conflict density ratios are crucial to describe the underlying characteri stics of individual read/write sets, they are unable to characterize the aggregate size of individual sets within a transaction. To meet this demand, the read set size and write set size metrics are also included, which quan tify the number of unique memory addresses from which a program reads (read set size) as well as the number of unique memory addresses to which a prog ram writes (write set size). The size of the read and write sets are important because they affect the total data fo otprint of each transaction as well as the period commits and aborts take. Finally, the write-read ratio is calculated, whic h describes the relative frequency of writes to reads within a transaction. Again, this metr ic builds upon and expands the previous memory related metrics by relating the number of writes to the number of reads. While multiple transactions are permitted to read an address, on ly a single write is allowed. Thus even if two transactions share a similar read and write conflict ratios, if one transaction is heavily weighted with the more contentious writes and the other is more heavily weighted with less contentious reads, this can result in different workload execution. When combined, the different transactional aspects that can be gathered from the characteristics described in Tabl e 4-1 provide an excellent means of quantifying the behavior of

PAGE 46

46 transactional workloads. However, due to the ex tensive nature of the da ta, a means of processing the data is necessary. Data Analysis This section describes the data processing techniques, nam ely principal component and cluster analysis. Principle Component Analysis Princip al component analysis (PCA) is a mu ltivariate analysis technique that exposes patterns in a high-dimensional data set. The pa tterns are revealed because PCA reduces the dimensionality of data by linearly transforming a se t of correlated variables into a smaller set of uncorrelated variables called principal components. These principal components account for most of the information (variance) in the original data set and provide a different presentation of the data, making the interpretation of large data sets easier. Principal components are linear combinations of the original variables. For a dataset with pcorrelated variables (X1, X2, Xp), a principal component Y1 is represented as Y1 = a11X1 + a12X2 + + a1pXp where (Y1, Y2, Yp) are the new uncorrelated variables (principal components) and (a11, a12, a1p) are weights that maximize the variation of the linear combination. A property of the transformation is that principal components are ordered according to their variance, where 1 > 2 > > p. Thus, principal component Y1 has more variance (information) than Y2 and Y2 has more variance than Y3, and Yp has the least variance. If kprincipal components are retained, where k << p, then Y1, Y2Yk contain most of the information in the original variables. By retaining the first k -principal components and ignoring the rest, one can achieve a reduction in the dimensionality of the dataset. The sum of the variances of the principal co mponents equals the sum of the variances of the original variables. The number of selected principal components co ntrol the amount of

PAGE 47

47 information retained. The amount of information retained is proportional to the ratio of the variances of the retained principal components to the variance s of the original variables. The Kaiser Criterion suggests choosing only the PCs greater than or equal to one. In general, principal components are retained so they accoun t for greater than 85% of the variance. These principal components are the e xposed patterns and trends that exist in the dataset. Cluster Analysis Cluster analysis [30] is a stat istical inference tool that a llows researchers to group data based on some measure of perceived similarity. There are two branches of cluster analysis: hierarchical and partitional clus tering. This study uses hierarchi cal, which is a bottom approach that begins with a matrix cont aining the distances between th e cases and progressively adds elements to the cluster hierarchy. In effect, build ing a tree based on the similarity distance of the cases. Unlike partitional clustering where the resear cher must pre-define the number of clusters, hierarchical clustering allows re searchers to choose the number of clusters themselves based on the linkage distance. In hierarchical clustering, each variable begins in a cluster by itself. Then the closest pair of clusters is matched and merged and the linkage distan ce (discussed later) between the old cluster and the new cluster is measured. This prev ious step is repeated until all of the variables are grouped into a single cluster. The resulting figure is a dendrogram (tree) with one axis showing the linkage distance between the variable s. The linkage distance can be calculated in several ways: single linkage (SLINK) defines the similarity between two clusters as the most similar pair of objects in each cluster, comple te linkage (CLINK) defines similarity as the similarity of the least similar pair of objects in each cluster, and average linkage (UPGMA) defines the similarity as the mean distance between the clusters.

PAGE 48

48 Experimental Methodology Baseline Transactional Model / Microarchitecture Config uration Table 4-2 describes the baseline 8-core CM P transactional model and microarchitecture configuration that was used for comparison throughout the study. The trans_model parameter describes the conflict detection and version ma nagement schemes that were employed. Conflict detection describes the methodology used to detect conflicts and can occur either immediately when a memory request is made (Eager), or as a batch operation when a transaction is committed (Lazy). Version management refers to whether ne w values are written directly to memory while old values are logged (Eager) or whether new values are buffered and then written in batch at the time of commit (Lazy). Throughout the analys is, all three forms of commonly accepted transactional models are evalua ted: Eager/Eager, Eager/Lazy, and Lazy/Lazy. Conflict detection granularity refers to the level of granularity at which conflicts ar e detected (usually the size of a cache line), and backoff policy is the backoff met hod that is used after a transaction is aborted. Since the SuperTrans model stri ves to be implementation indepe ndent, in addition to the actual time taken to write memory values back, valu es can be assigned to transactional model characteristics and varied independently from the microarchitecture configuration. Primary/secondary baseline latency and primary variable latency quantify the latencies associated with a commit or an abort. In a lazy version management system, the primary latency is associated with a commit (since new values must be written back), while in an eager version management system the primary latency is associat ed with an abort (since the old values must be written back). The baseline latency is the static overhead requi red (e.g. arbitrating for the bus, maintenance, cleanup, etc) and the variable latency is the additional latency required based upon the write-set size.

PAGE 49

49 Finally, the microarchitecture core parameters de scribe a set of processor core features that previous studies have shown to have a large impact on performance. This configuration was chosen because it is representative of an av erage machine. In the evaluation section, the transactional and microarchitecture parameters ar e varied independently to evaluate the results across more conservative as well as more aggressive machines. Transactional Metrics To exam ine the accuracy of similarity clas sification the following transactional metrics were used: speedup, speedup-total cycles, abor t-trans ratio, NACK-tran s ratio, and percent committed instructions (see Table 4-2). The goal in choosing these metrics was to capture the global effects of the transactional workload as we ll as the effects that are more specific to the transactional model. To this end, two forms of speedup metrics have been chosen as well as contention metrics specific to transactional memory. Speedup is defined as the amount of time in cycles a lock-based version of the program takes to complete divided by the amount of time th e transactional version ta kes to complete. This definition of speedup is used as opposed to a comparison with a sequential version of the benchmark because it accentuates the differences caused by the transactional portions of the workload. For example, in many of the SPLASH-2 benchmarks where the synchronization has been heavily tuned, the overall percentage of tran sactional code is relatively small compared to the entire execution time of the be nchmark. Because of this, varyi ng parameters that affect the transactional characteristics would result in very small differences in the overall execution time of the program, and provide a false sense of si milarity. Instead, by comp aring the transactional version to an equivalent lock version, the di fferences introduced by the transactions are magnified. In addition to the normal definition of speedup, a speedup total cycles metric is also used that compares the total number of cycles co nsumed across all threads on the lock version to

PAGE 50

50 the total number of cycles consumed across all threads on the transactiona l version. This elicits more of the transactional behavior by accentua ting the differences in transactional behavior across all threads as opposed to simply the longest running thread. The three remaining metrics all focus on transa ction-specific behavior The aborted cycles trans ratio is defined as the total number of aborted cycles divided by the total number of transitionally executed cycles. This metric provides insight into th e percentage of work that was wasted due to aborts. The NACK cycles trans ra tio is defined as the total number of cycles spent stalled due to NACK conditions divided by the total number of transitionally executed cycles. While both metrics give an indication of the amount of cycles that were lost due to contention, it is important to analyze both metrics because this provides for a more comprehensive analysis across transactional models For example, a simple analysis of aborts may not be as interesting on an Eager/Eager syst em that uses NACK stalls to avoid aborts. A similar argument can be made for NACK sta ll cycles on a Lazy/Lazy system. Finally, also included is the commit/abort Instruction ratio, whic h is defined as the total number of committed instructions, divided by the total number of transactional instructions. This metric is similar to the previous contention based metr ics except that it goes further to describe the percentage of transactional work that was useful using instruction counts instead of cycles, allowing for a less architecturally dependant analysis. Experimental Design The trans actional workload characterization presented consists of the following steps: (1) For each of the three transactional models (EE/EL/LL), the program characteristics in Table 4-1 are gathered using SESC. This produces a vector of 81 components for each input program; a total of 81 variables and 15 cases for each transactional model (an 81x15 matrix when considered independently and an 81x45 matrix when combined).

PAGE 51

51 (2) Instead of using the variables directly, PC A is used to remove variables that are correlated. The pvariables are normalized to a unit normal distribution and then PCA is performed using STATISTICA [31], producing p -principal components. (3) These p-principal components are reduced using some heuristic in this case choosing a minimum amount variance that needs to be retained by the principal components. The chosen components can then be used for further analysis. (4) STATISTICA is used to hierarchically clus ter the principal components chosen in (3) and a similarity study is performed using these clusters and the input configurations from Table 4-2. (5) From the dendrogram produced in (4), the program set can be divided into smaller subsets. The representative performance of th e subsets is determined using a sensitivity study where the transactional and microarchi tecture models are varied independently. (6) To show how architects can use only the tr ansaction characteristics that they are specifically interested in w ithout considering the entire range, we repeat steps (1) through (5) using only portions of the char acteristics in Table 1, not its entirety. Transactional Memory and Mi croarchitectu re Parameters In addition to the baseline configuration (Table 4-2) used to generate the program clusters, five transactional models, shown in Table 4-3, and four microarchitecture models, shown in Table 4-4, are used to verify the accuracy of benchmark subsetting. Th e results are normalized against the baseline case and the average is used to compare each subset against the performance of the whole benchmark set across the five transactional dimensi ons. To cover a wide range of designs, the dimensions in each parameter set were chosen to range from conservative (EE0/EL0/LL0 and A0) to unconventional (EE4/EL4/LL4 and A3).

PAGE 52

52 Results In this sec tion, it is first shown that the char acteristics in Table 4-1 are independent of the underlying transactional model. The n, it is shown that nearly half of the chosen benchmarks can be eliminated and evaluate th e results by varying the transactional memory model while holding the microarchitecture constant and by vary ing the microarchitect ure while holding the transactional memory model constant. The resu lts for the complete benchmark set are obtained and then it is shown how clustering can be used to eliminate redundant benchmarks using subsets, as well as the speedup obtained by eliminating the redundant benchmarks. Finally, a method that architects can use to choose which programs to run based on the transactional feature that needs to be evaluated is shown. Independence So that architects do not need to redefine th eir be nchmark set for each transactional model, it is necessary to show that the characteristics used as inputs are implementation-independent. Using a Lazy/Lazy, Eager/Lazy, and Eager/Eager m odel for each program, it can be shown that the characteristics in Table 4-1 are implement ation-independent. Based on Figure 4-2, 10 principal components are retained that account for 91% of the variance. Figure 4-1 shows the dendrogram produced from the characteristics in Ta ble 4-1. The resulting clusters are formed using the Euclidean distance between the benchmar ks and are generated hierarchally using the single linkage distance. The Y-axis represents the linkage distance, which is used to derive the similarity between the benchmarks. The smaller the linkage distance, the more tightly the programs are clustered and the la rger the linkage distance, the more loosely the programs are clustered. From Figure 4-1, it can be seen that each benchmark is tightly clustered with itself (with average distances less than 1), no matter what the underlying conflict detection and version

PAGE 53

53 management policy. This shows that regardless of the transactional model, the characteristics in Table 4-1 can provide a good indicati on of the expected performance. Benchmark Subsets This section evaluates th e accuracy of benc hmark subsetting. By eliminating benchmarks with similar behavior, researchers can redu ce the time needed to evaluate a design. The effectiveness of the subsets is evaluated as the transactional model is varied and as the microarchitecture model is varied. From Figure 41, the linkage distance can be used to select a subset of benchmarks representative of the entire group, allowing researchers to reduce the number of required simulations. Sets of 4, 6, and 8 benchmarks were chosen based on their linkage distance. For the set of 4 benchmar ks (vacation, genome, barnes, and water-N2), shown in Figure 4-1, the cutoff was set at a linkage distance of 8.6. For the set of 6 benchmarks (vacation, genome, raytrace, water-N2, barnes, and bayes), a distance of 6.8 was chosen as the cutoff. Finally, a cutoff distance of 5.9 was chosen for the set of 8 benchmarks. This set is comprised of vacation, genome, raytrace, water-N2, ocean-non, barnes, water-spatial, and bayes. By changing the cutoff value, researchers can effectively choose any subset of benchmarks. Interestingly, the STAMP benchmarks appear to have more diversity th an the SPLASH-2 suite with kmeans showing some similarity to fmm, wa ter-spatial, and cholesk y. [32] suggested that using both single-linkage and comple te-linkage can give an indicati on as to whether the clusters are well-defined or are artifacts of the cluste ring method. To validate th e clusters, complete linkage was also used to cluste r programs (not shown) and only produces a single deviation in the 4-case subset switching water-N2 for bayes and no deviations in the 6and 8-case subsets. Transactional Model Evaluation Figures 4-3 and 4-4 show the com parison of the complete benchmark set against the NACK cycle ratio and the percentage of committed instructions respectively as the transactional

PAGE 54

54 model is varied and Table 4-5 shows the average e rror for all metrics as well as the coefficient of variation across all models. The average erro r for the 4-benchmark subset was 7.05%. As expected, the 6-benchmark performance improves, reducing the average error 5.81%. Adding the 2 additional programs in the 8-benchmark subset offers no significant improvement except in the NACK trans ratio. Further analysis allows several conclusions to be drawn about the effectiveness of the workload characteristics chosen to describe the similarity of the TM workloads. First as the transactional memory model is varied, metrics that capture ag gregate information about the overall execution of the workload (e.g. speedup a nd total speedup) as well as metrics that focus specifically on transactional pe rformance (e.g. percent committed) perform very impressively; even with as few as four repres entative benchmarks. Finer granularity metrics, however, such as those that monitor the precise num ber of cycles aborted or cycles stalled due to NACKs, have higher error rates. This is because these me trics are not governed so lely by the aggregate statistical nature of the worklo ad composition, but vary based on bot h the statistical nature of the workload and specific timing characteristics of the workload. For example, while the overall characteristics of two workloads in which thirty percent of the transactions were aborted may be very similar, the actual number of aborted cycles or cycles spent stalled due to NACKs can vary widely between the two dependant upon when the conflicts occurred, whic h transactions were aborted, how long into the transact ions the abort took place, etc. Thus, the similarity of the workloads and the number of benchmarks that mu st be run varies depe ndant upon the granularity of the results the architect is looking for. Wh ile four benchmarks may be enough to determine the overall execution time, speedup, or commit-pe rcentage, eight benchmarks may be more appropriate for an in-depth analys is of the period s of stall time.

PAGE 55

55 Microarchitecture Model Evaluation Next, the m icroarchitecture model was varied using Table 4-4 as inputs. Figures 4-5 and 46 show the comparison of the complete benchm ark with the 4-, 6-, and 8-benchmark subsets comparing NACKs and percentage of committed in structions. Although the trend is the same as the number of programs is increased, the total er ror is higher than that shown transactional evaluation subsection. The 4-benchmark subset had an average error 11.57%; the 6-benchark subset had an average error of 10.85%, and the 8-benchmark subset had an average error 8.55% across all of the dimensions. Because the analysis parameters focus on transactional behavior, as the underlying microarchitecture varies in ways not directly related to the trans actional performance, differences within the benchmarks that occur outside the bou nds of transactions re sult in a higher overall error across our metrics. We see that the erro r across metrics that are a direct result of transactional performance (i.e. aborted cycl es, NACKed cycles, and percent committed) increase, but much less than those that incorporate both transa ctional qualities and microarchitecture features (i.e. speedup and speedup total cycles). Time Savings Table 4-7 shows the speedup of the subsets com pared to running the entire set of benchmarks. As can be seen, there are considerab le time savings between running the full set of benchmarks and those running the 4-, 6-, and 8-be nchmark subsets. The speedup decreases as the number of included benchmarks is increased, allowing architects to make a tradeoff between fidelity and simulation time.

PAGE 56

56 Variable Subsets As was shown in benchm ark subsetting subsection, the complete se t of transactional characteristics chosen for clustering allows an architect to substantiall y reduce the number of simulated benchmarks and obtain coarse-grained re sults with minimal error. Fine-grained results can be obtained with minimal error as well. Ho wever, in many cases the number of simulations must be increased. The difference lies in the fact that the complete set of transactional characteristics is targeted at describing the simi larity of all aspects of the workload. For some evaluations, architects may not be interested in the complete characteristics of a program. For example, if the target of an evaluation is th e conflict manager, the designer may only want to choose those programs with potenti ally large amounts of contention. In this secti on, it is shown that a subset of the characteristic s in Table 4-1 can be used to se lect a group of benchmarks that stress a specific architectural feature. Figure 4-7 shows the dendrogram produced if onl y the transaction size characteristic is used as input into the PCA algorithm. Here, th e linkage distances are relatively low (implying strong clustering) and the architect can obtain good performance estimates by using a small number of simulations. If the s uggested clustering is compared to the workload analysis table (Table 2-1), it can be seen that the clusters largely agree with the simulated results. Bayes and labyrinth are similar to one another, but largely different from vacation, which is different from genome, water-N2, and the rest of the SPLASH-2 suite. Of particular interest is the relative proximity of water-N2 and genome. Had one simply elected to use the aggregated average values presented in the workload charac terization table to determine the clustering, it is unlikely he/she would have seen much similarity in the tr ansactional size between these two benchmarks. However, over the lifetime of the program, more th an half of the transaction sizes are the same

PAGE 57

57 for water-N2 and genome and the average size of a trans action in genome is artificially increased by several large outliers. Figure 4-11 shows a plot of the first two prin cipal components (which comprise more than 70% of the total variance). Here, the first principal component is positively dominated by transaction sizes between 5 and 25 instructions and negatively dominated by transactions larger than 390k instructions. The second component is positively dominated by transactions between 5 and 25 instructions and negativel y dominated by transactions between 625 and 3125 instructions. From this figure, it can be seen that there are two very strong cl usters, one weak cluster, and two isolated programs. Although bayes has the most diversity in transa ction size, both it and labyrinth are dominated by large tr ansactions. The other clusters are comprised of programs that are dominated by one or two specific transacti on sizes. As with the cumulative result, the STAMP benchmarks show a wider range of transaction sizes than SPLASH-2 or the two PARSEC benchmarks. While transaction size is a very simple exam ple, and it may be possible to detect such patterns directly from the input da ta in such cases, as the complexity of the characteristics of interest is increased this becomes infeasible. Retu rning to the original example, an architect that is interested in designing a conflict manager and is specifically looking for benchmarks of interest as they relate to cont ention, the transaction size is just one small piece of the puzzle. Using the transactional characteristics, actual cont ention is a function of the transaction size, the size of the readand write-sets and the conflict density of those readand write-sets. Figures 4-8 and 4-9 show the dendrograms for read/write conflict density and r ead/write set size, respectively. Figure 4-10 shows the clustering for the combination of transaction size, read/write set size, and read/write conflict density, what will be referred to as the contention domain. If

PAGE 58

58 these dendrograms are compared, it can be seen that there are some similarities between the clusters for several of the characteristics, yet there are also distinct differences. The PC-plots show these distinctions in more detail. Figures 4-11, 4-12, and 4-13 show the plots for the first two principal components from the transaction size, conflict results, read/write set size, respectively. While barnes and raytrace are gr ouped with the rest of SPLASH-2 if one only considers the transaction size (Fi gure 4-11) or read/write set size (Figure 4-13), the results from the conflict density domain (Figure 4-12) suggest that they have distinct differences from the other benchmarks in SPLASH-2. This result caries over into the final contention clustering results (Figure 4-14), suggesting that an architect interested in a conf lict management system should consider running one or both of these benchmarks. Returning to the workload characterization ta ble, one finds that both barnes and raytrace stand out among the SPLASH-2 benchmarks in thei r contention, reflected by their abort counts and NACK stall periods, which are among the highest in the entire suite. This result could have easily been overlooked by a simple comparison of the read/write sets or transaction sizes, yet is easily discovered using variable su bsetting. Finally, one also sees th at as the level of complexity of the domain of interest increases, so does th e linkage distance. Furt her examination of the principal components from the amalgamated grouping provides more insight. Figure 4-14 shows the plot of the first two principal com ponents (accounting for 68% of the total variance) from the c ontention and Figure 4-15 shows the factor loading. From Figure 415, it can be seen that the first component is positively dominated by small transactions and small write sets and negatively dominated by small read set conflict ratios, large read sets, and large write sets. The second com ponent is positively dominated by transactions larger than 390k instructions and large re ad and write sets and negatively domi nated by read conflict ratios of up

PAGE 59

59 to 50%. In this figure, there is one strong cluster, one weak cluster, and th e rest of the benchmarks are clustered individually. Going back to barnes and raytrace, this suggests that the read conflict distribution of these two benchmarks is what sets them apart. Related Work Quantifying benchm ark similarity is not a ne w concept and has roots dating back to 1996 where [33] developed a metric comprised of th e dynamic execution characteristics of a program to analyze individual benchmarks and benchmark suites. However, the authors do not eliminate highly correlated characteristics when classifyin g the programs, instead relying on all of the measured characteristics. Building from this work, [34] and [35] used program characteristics to measure the similarity of the SPECint95 and TPCD benchmarks as the input data sets change. They show how program-input pairs that are clustered tightly exhi bit the same behavior as the microarchitecture changes. [36] analyzed data from 340 machines to show that the SPEC2000 benchmark suite only stresses four bottlenecks. Us ing PCA, the authors reduce their input set to four principal components and show that mach ines from the same generation are strongly clustered and the number of benchmarks can be reduced without affecting the performance analysis. More recently, [37] pe rformed a similar analysis on the SPEC2006 suite, showing that 6 of the 12 CINT2006 programs and 8 of the 17 CFP2006 programs are able to represent the majority of each benchmark set. [38] compar ed the SPLASH-2 and PARSEC benchmark suites. Here the authors use instruction mix, memo ry operations, and shared memory accesses to compare the programs in each suite. This research differs in that it focuses on the transactional behavior and delves deeper into the shari ng patterns exhibited by these applications.

PAGE 60

60 Table 4-1. Transaction oriented workload charact eristics used for measuring similarity between TM workloads Program Characteristics Synopsis 1 Transaction Percentage Fraction of instru ctions executed by committed transactions 2-11 Transaction Size Total number of instructions executed by committed transactions. This metric is computed as a cumulative distribution across ten categories: [0,5], (5, 25], (25,125], (125, 625], (625,3125], (3125,15625] (15625,78125], (78125,390625], 1221 Read Set Size Ratio Total number of reads within a transaction divided by the number of unique addresses read. The read set size ratio across all transactions is summarized as a histogram with 10 buckets: [ 0,0.1 ) [ 0.1,0.2 ) [ 0.9,1.0 ] 2231 Write Set Size Ratio Total number of writes within a transaction divided by the number of unique addresses written. The write set size ratio across all transactions is summarized as a histogram with 10 buckets: [0,0.1), [0.1,0.2)[0.9,1.0] 3241 Read Set Conflict Density The total number of potential conflict addresses read by a transaction divided by that transactions total read set.The read conflict ratio across all transactions is summarized as a histogram with 10 buckets: [0,0.1), [0.1,0.2)[0.9,1.0]. 4251 Write Set Conflict Density The total number of potential conflict addresses written by a transaction divided by that transactions total write set.The read conflict ratio across all transactions is summarized as a histogram with 10 buckets: [0,0.1), [0.1,0.2)[0.9,1.0]. 5261 Read Set Size Total number of unique memory addresses read by committed transactions. This metric is computed as a cumulative distribution across ten categories: [0,2), [2, 4), [4,8), [8,16), [16, 32), [32,64), [64,128), [128,256), [256,512), [512,1024), [1024, ) 6271 Write Set Size Total number of unique memory addresses written by committed transactions. This metric is computed as a cumulative distribution across ten categories: [0,2), [2, 4), [4,8), [8,16), [16, 32), [32,64), [64,128), [128,256), [256,512), [512,1024), [1024, ) 7281 Write Read Ratio The total number of writes within a transaction divided by the number of reads within the transaction. The write/read ratio across all transactions is summarized as a histogram with 10 buckets: [0,0.1), [0.1,0.2),[0.9,1.0].

PAGE 61

61 Table 4-2. Baseline transactional and microarchitecture model Parameter Value Trans_model Conflict Detec tion / Version Management Eager/Eager, Eager/Lazy, Lazy/Lazy Back_off Backoff Policy Random Linear Conf_det_g Conflict Detection Granularity 32 Bytes PBL Primary Baseline Latency 50 PVL Primary Variable Latency 8 TM Model SBL Secondary Baseline Latency 9 issue_width Processor Issue Width 4 ROB_size Reorder Buffer Size 18*issue_width + 32 LSQ_size Load/store queue size 10*issue_width + 32 int_regs Integer Registers 8*issue_width + 32 fp_regs Floating Point Registers 6*issue_width + 32 int_issue_win Integer Issue Win Size 6*issue_width + 32 fp_issue_win Floating Point Issue Win Size 4*issue_width L2_size L2 cache size 4096 L2_lat L2 cache latency 12 il1_size L1 instruction cache size 32 Il1_lat L1 instruction cache latency 1 dl1_size L1 data cache size 32 Processor Core dl1_lat L1 data cache latency 2 Table 4-3. Transactional memory parameters EE0/EL0/LL0 EE1/EL1/LL1 EE2/EL2 /LL2 EE3/EL3/LL3 EE4/EL4/LL4 Back_off Random Linear Random Linear Random Linear Random Linear Random Linear Conf_det_g 32 32 32 32 32 PBL 10 30 70 90 110 PVL 4 6 10 12 14 SVL 5 7 11 13 15

PAGE 62

62 Table 4-4. Microarchitecture parameters A0 A1 A2 A3 Issue_width 2 4 4 8 ROB_Size 9*issue_width+32 14*issue_width+32 36*issue_width+32 36*issue_width+32 LSQ_Size 5*issue_width+32 7*issue_width+32 20*issue_width+32 20*issue_width+32 Int_regs 4*issue_width+32 6*issue_width+32 16*issue_width+32 16*issue_width+32 Fp_regs 3*issue_width+32 4*issue_width+32 12*issue_width+32 12*issue_width+32 Int_issue_win 3*issue_width+32 4*issue_width+32 12*issue_width+32 12*issue_width+32 Fp_issue_win 2*issue 3*issue 6*issue 6*issue L2_size 1024 2048 8192 8192 L2_lat 8 10 20 20 Il1_size 8 16 64 128 Dl1_size 8 16 64 128 Dl1_lat 1 2 3 3 Table 4-5. Average error and Cv ( / ) as the transaction model is varied S4 S6 S8 Cv Average Error Average Error Average Error Speedup 0.68 4.34 3.95 3.50 Speedup Total Cycles 1.14 5.66 4.95 4.04 Aborted Cycles:Trans Cycles 1.72 7.40 6.36 6.60 Nacked Cycles:Trans Cycles 1.39 13.70 9.96 8.05 Percent Commited 0.32 4.16 3.85 3.57 Average 7.05 5.81 5.15 Table 4-6. Average and Cv ( / ) as the microarchitecture model is varied S4 S6 S8 Cv Average Error Average Error Average Error Speedup 0.72 11.49 11.56 7.95 Speedup Total Cycles 1.32 11.91 11.98 8.37 Aborted Cycles:Trans Cycles 1.95 18.92 16.05 15.39 Nacked Cycles:Trans Cycles 13.2 10.64 9.23 6.49 Percent Commited 0.37 4.86 5.43 4.57 Average 11.57 10.85 8.55 Table 4-7. Speedup of subsets to full suite Speedup S4 5.73 S6 2.58 S8 2.38

PAGE 63

63 Figure 4-1. Dendrogram showing clustering based on characteri stics defined in Table 4-1 Cumulative Distribtuion of Variance 135791 11 31 51 7Principal Components 0 20 40 60 80 100Percentage of Variance Figure 4-2. Cumulative distribut ion of the variance by PC

PAGE 64

64 Figure 4-3. Performance comparison of NACKd Cycl es:Trans Cycles as the transactional model is varied Figure 4-4. Percentage of committed instructio ns as the transactional model is varied 0 0.5 1 1.5 2 2.5 3 EE0 EE1 EE2 EE3 EE4 EL0EL1EL2EL3EL4 LL0LL1LL2 LL3 LL4 All Benchmarks Set of Four Set of Six Set of Eight 0 0.2 0.4 0.6 0.8 1 1.2 EE0 EE1EE2 EE3 EE4 EL0EL1EL2EL3EL4 LL0LL1LL2 LL3 LL4 All Benchmarks Set of Four Set of Six Set of Eight

PAGE 65

65 Figure 4-5. Performance comparison of NACKd Cycl es:Trans Cycles as the microarchitecture model is varied. Figure 4-6. Percentage of committe d instructions as the microarchitecture model is varied 0 0.2 0.4 0.6 0.8 1 1.2 1.4 EE A0 EE A1 EE A2 EE A3 EL A0 EL A1 EL A2 EL A3 LL A0 LL A1 LL A2 LL A3 All Benchmarks Set of Four Set of Six Set of Eight 0 0.2 0.4 0.6 0.8 1 1.2 1.4 EE A0 EE A1 EE A2 EE A3 EL A0 EL A1 EL A2 EL A3 LL A0 LL A1 LL A2 LL A3 All Benchmarks Set of Four Set of Six Set of Eight

PAGE 66

66 012345 Linkage Distance TM-labyrinth TM-bayes TM-vacation TM-water-nsquared TM-genome TM-kmeans TM-fmm TM-water-spatial TM-fluidanimate TM-streamcluster TM-ocean-non TM-ocean-con TM-raytrace TM-cholesky TM-barnes Figure 4-7. Dendrogram showing clus tering based on transaction size 01234567 Linkage Distance TM-vacation TM-raytrace TM-genome TM-water-nsquared TM-labyrinth TM-streamcluster TM-ocean-non TM-ocean-con TM-kmeans TM-water-spatial TM-fluidanimate TM-cholesky TM-fmm TM-bayes TM-barnes Figure 4-8. Dendrogram showing cluste ring based on transa ction conflicts

PAGE 67

67 012345678910 Linkage Distance TM-bayes TM-vacation TM-labyrinth TM-genome TM-fmm TM-water-nsquared TM-kmeans TM-streamcluster TM-ocean-non TM-ocean-con TM-raytrace TM-water-spatial TM-fluidanimate TM-cholesky TM-barnes Figure 4-9. Dendrogram showing clusteri ng based on readand write-set sizes 01234567891011121314 Linkage Distance TM-bayes TM-vacation TM-labyrinth TM-genome TM-raytrace TM-fmm TM-streamcluster TM-ocean-non TM-ocean-con TM-water-nsquared TM-kmeans TM-water-spatial TM-fluidanimate TM-cholesky TM-barnes Figure 4-10. Dendrogram showing clustering based on the combination of transaction size, R/W-set sizes, and conflict ratio

PAGE 68

68 -6-5-4-3-2-1012 PC 1 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 PC 2 TM-labyrinth TM-bayes TM-cholesky TM-fa TM-water-spa TM-ocean-non TM-ocean-con TM-streamcluster TM-kmeans TM-fmm TM-water-n2 TM-vacation TM-raytrace TM-genome TM-barnes Figure 4-11. PC 1 versus PC 2 plot of transaction sizes -7-6-5-4-3-2-101234 PC 1 -4 -3 -2 -1 0 1 2 3 4 5 PC 2 TM-labyrnth TM-vacation TM-genome TM-raytrace TM-barnes TM-fmm TM-w-spa TM-w-n2 TM-kmeans TM-o-non TM-o-con TM-bayes TM-fa TM-sc TM-cholesky Figure 4-12. PC 1 versus PC 2 plot of conflict ratios

PAGE 69

69 -10-8-6-4-20246 PC 1 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 PC 2 TM-bayes TM-w-spa TM-cholesky TM-w-n2 TM-fmm TM-kmeans TM-fa TM-o-con TM-o-non TM-sc TM-vacation TM-genome TM-labyrnth TM-barnes TM-raytrace Figure 4-13. PC 1 versus PC 2 plot of R-/W-set sizes -8 -6 -4 -2 0 2 PC 1 -8 -6 -4 -2 0 2 4 6 8 PC 2 TM-bayes TM-fmm TM-sc TM-fa TM-w-spa TM-w-n2 TM-o-non TM-o-con TM-cholesky TM-kmeans TM-raytrace TM-barnes TM-labyrnth TM-genome TM-vacation Figure 4-14. PC 1 versus PC 2 plot of contention domain

PAGE 70

70 TS0 TS2 TS4 TS6 TS8 RCR0 RCR2 RCR4 RCR6 RCR9 WCR1 WCR3 WCR5 WCR7 WCR9 RSS1 RSS3 RSS5 RSS7 RSS9 WSS1 WSS3 WSS5 WSS7 WSS9 -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 PC 1 PC 2 Figure 4-15. Factor loadings of the contention domain

PAGE 71

71 CHAPTER 5 A PARAMETERIZED METHODOLOGY FO R GENE RATING TRANSACTIONAL MEMORY WORKLOADS This chapter describes the development of Tr ansPlant, a framework that is capable of generating synthesized transacti onal benchmarks based on an arra y of different inputs. It is shown that TransPlant can mimic the behavior of current transact ional memory workloads. It is also shown that TransPlant can generate benchmar ks with features that lie outside the boundary occupied by these traditional benchmarks. Related Work There are many benchmarks availab le for ev aluating parallel computing systems, both traditional and transactional. Prior studies have attempted to quantify the redundancy in these and other frequently used app lication suites. Other authors ha ve even proposed methods to reproduce the behavior of these programs using sta tistical models and workload synthesis. This section addresses how this previous research contributes to and reflects on this work. Parallel Benchmarks One roadblo ck that the TM/multi-core resear ch and design community faces today is the lack of representative transactional memory benchmarks. As a result, a common practice in evaluating todays TM designs is to convert exis ting lock-based multi-threaded benchmarks into transactional versions. There are several multi-threaded benchmark suites to draw from: NPB [39], BioParallel [40] ALPBench [41], MineBench [42], SPEComp [43], SPA LSH-2 [10], and PARSEC [38]. Most of these suites are domai n specific (e.g. bioinformatics, multimedia, and data mining), which makes running all of the programs from one of these suites problematic. Of the above suites, only SPLASH2 and PARSEC are not limited to a single application domain. Even so, converting many of these applications is not an attractive option because of complex libraries or their threading model.

PAGE 72

72 Worse, even a successful conversion does not m ean that these programs are appropriate for use in a transactional memory evaluation. While these conventional multi-threaded workloads may reflect the thread-level c oncurrency of transactional workloads to some extent, in many cases, they have been heavily optimized to minimize the overhead associated with communication and synchronization. The fine-grain locking that these traditional programs exhibit does not represent the wide variety of expected behavior from transactional memory programs since any conversion leads to programs with infrequent transactions relative to the entire program. Much of the up-front benefit of transactional memory come s from its ease of use; programmers will be able to write parallel code bypassing much of the complex logic involved in providing correctness and minimizi ng time spent in synchronizati on regions. While programmers familiar with the pitfalls associated with paralle l programming will be able to extract nearly the same performance out of transactions, those ne w to the field or those more deadline-overperformance oriented will be more interested in knowing that their code is correct and safe regardless of the size of the parallel regions and possible interactions. Transactional Memory Benchmarks Researchers have already begun thinking about how to test transactional m emory systems and have developed microbenchmarks and app lications to evaluate their behavior. The microbenchmarks used for these evaluations t ypically contain only a few small transactions making them too simple to stress increasingly la rge and complex multi-core designs. While these benchmarks are easily portable, they can be tedious to create and may not have any complex control flow neither internor intra-thr ead. To address the shortcomings of these microbenchmarks, a few real applications have been ported for use in transactional memory but these are stand-alone applicati ons, many of which are not public ly available and their domain coverage is limited. Perfumo [17] and Minh [11] both offer transactional memory suites that

PAGE 73

73 attempt to expand this coverage. The problem w ith Perfumos applicatio ns is that they are implemented in Haskell, making th em extraordinarily difficult to port. On the other hand, Minhs contribution, STAMP, contains eight programs c overing a wide range of applications and is written in C. But do these applications truly offer an expande d view of the transactional performance domain? Benchmark Redundancy Previous authors have shown that m any programs within a benchmark suite exhibit tremendous amounts of redundancy [33, 35]. This is true of SPLASH-2, STAMP, and even the new PARSEC suite contains programs that not only share characteristics of the SPLASH-2 programs but also show some similarities with one another [43]. Com puter architects need programs with widely varying behavior in order to evaluate design changes and some of these suites fall short. Shown in figure 5-1 is an evaluation of STAMP and SPLASH-2 across a range of transactional features (the feature set is shown in Table 5-2). Figure 5-1 is a plot of the first three prin cipal components, which account for 64.6% of the total variance. Only 8 of the 18 benchmarks contain any real differences in their behavior in this domain. The rest of the benchmarks form a st rong cluster, which indicates that many of the examined characteristics are sim ilar if not the same. The hierarch al clustering (Figure 5-2) based on these three principal components shows the results more clearly. Beginning on the bottom with labyrinth and working up the dendrogram, one can see that the benchmarks beyond (and including in a relaxed interpretati on) fmm and genome form relative ly tight clusters. At a linkage distance of 4, 50% of the benchmarks have been clustered, showing that any evaluation of a transactional memory system using these benchmarks may not stress all of the elements in its design and that new programs may be needed.

PAGE 74

74 Benchmark Synthesis Statistical simulation [44] and workload synthesi s [19] capture the underlyi ng statistical behavior of a program and use this inform ation to genera te a trace or a new representative program that maintains the behavior of the original program. This new representation has a reduced simulation time compared to the original application, making it ideal for course-grain tuning of early designs. Although most of this research has b een on sequential programs, researchers have recently delved into multi-thread ed lock-based programs [20, 21] Although this previous work does produce small fast-running programs, it differs in that we do not use any program as a starting point. Our synthesis model works from an abstract starting point and the programs produced by TransPlant are built from the gr ound up from user-supplied inputs. This enables researchers to specify the program characteristic s precisely in order to test whatever system aspect they wish, similar to the work done by Josh i et. al.[45] who showed how an abstract set of program characteristics could be used with mach ine learning to generate single-threaded stress benchmarks in the power domain. TransPlant In the following section the TransPlant mode l for generating transactional workloads is described and how it both differs from and e xpands upon currently available transactional benchmarks. Next, its capabiliti es are detailed and this sectio n ends with a discussion on the implementation of the TransPlant framework. Design As long as there has been a need to quantify th e behavior of a design using test workloads, there has been debate over the type of workload to use. Running real world applications has the advantage of providing designers with realistic inputs that may actually occur after production. However, running real applications also have substantial disadvant ages. It can often be difficult to find real applications that cover a diverse d esign space, anticip ate future workload patterns,

PAGE 75

75 and are easily executable on the system of c hoice. Moreover, while a diverse set of real applications can provide significan t insight into overall, common case system performance, they can be inefficient at exploring the results of a specific design decision. Microbenchmarks, on the other hand, are much better suited to quickly as sess the result of a specific execution pattern, however lack much of the context provided fr om real world applica tions. The goal of the TransPlant framework is to bridge the advantages of these two worlds within a transactional memory context. Using the TransPlant framew ork, a TM designer can efficiently construct a workload that is tuned precisely to the characterist ics that he/she wishes to stress; starting either from a real application, or by using the tool to construct a design point that differs from any available workload. Capabilities The input to the TransP lant framework is a fi le describing the transactional characteristics the designer wishes to test and the output of th e framework is a source file that can be compiled to produce a binary that meets those specificatio ns. Table 5-1 describe s the first order design parameters that the user can specify. Threads spec ify the total number of active threads while the Homogeneity flag indicates whether all threads will be homogeneous or whether the user will enumerate different characteristics for each threa d. Transactional granularity specifies the size of the transaction with respect to instruction count and stride specifies the sequential distance between transactions. The Read Set and Write Set parameters describe the number of unique cache line accesses for reads and writes resp ectively, and the Shared Memory parameter describes the percentage of thos e locations that occur within sh ared memory regions. A key determinant of the overall transactional char acteristics of a program is how the memory references are physically located within the tr ansaction. The Conflict Distribution parameter indicates whether the shared memory references are evenly distributed throughout the transaction

PAGE 76

76 or whether a high conflict model is constructe d where a read/write pair is located at the beginning and end of the transaction to maximize contention. Finally, th e instruction mix of integer, floating point, and memory operations can be controlled independently for sequential and transactional portions. A key feature of this input set is that while it covers most of the architecturally independent transactional character istics, the level of granularity for which a user must specify the input set can be adjusted based upon what the designer is interested in. Fo r example, most of the above inputs can be enumerated as a simple av erage, a histogram, a time-ordered list, or any combination thereof. Thus, if a designer is intere sted in an exact stride, alignment, or instruction count across threads and less interested in the re ad/write set sizes, the granularity and stride values can be defined in a time-sequenced list wh ile the read/write set values are provided using a normalized histogram. This detaile d level of control can prove inva luable in stressing a specific design implementation or in producing precise determ inistic workloads to be used as a tool for debugging. Finally, the framework allows for an assimila te mode where a complete description of the program is provided as an input. When this mode is combined with a profiling mechanism, TransPlant can be used to reproduce a synthetic copy of an existing workload. This synthetic copy can be run in place of the original applica tion (for example, in circumstances where the original code is proprietary), or can be used as a baseline an d modified to test how possible changes will affect future designs. Implementation Because SESC is used as the sim ulation environment, TransPlant was developed for the MIPS ISA but the backend can be decoupled for use with any ISA. The framework is comprised of four steps: input validation and construction of high-level pr ogram characteristics (skeleton),

PAGE 77

77 opcode and operand generation (s pine), operand population (verte brae), and code generation (program body). A high-level view of th e framework is shown in Figure 5-3. Validation and Skeleton Creation The first stage of benchm ark generation within the TransPlant framework is to validate the input provided by the user. Since TransPlant accepts a wide variety of input formats (e.g. averages, lists, histograms, or any combination ther eof), it is important that the input be validated to ensure that it describes a realizable binary. For example, since read set, write set, and transaction size can all be varied independently, TransPlant must validate each read set/write set combination to ensure there is a suitable tran saction in which to fit the memory operations. The first pass in the validation stage confirms th at the user has specified all of the required options. Once all required options have been specified, the validation stage calculates the number of Cells required to represent the fina l binary described by the input. A Cell is the basic building block within the TransPlant fr amework and can be eith er transactional or sequential. If any of the inputs provided by the user is in a list format, then the total number of cells is equal to the number of en tries within that list. If the us er provides all histogram inputs, TransPlant will calculate the minimum number of cells required to meet the histogram specifications perfectly (for ex ample, if all normalized histogr am inputs are multiples of 0.05 then 20 cells can be used to meet the specifications). Once the minimum number of cells has been instantiated, each cell is populated with values described by a list input or derived from a histogram i nput. In the case of histogram inputs, the cell lists are ordered based upon size a nd then the read set a nd write set values are populated from largest to smallest to ensure prop er fitting. Other values, such as instruction mixes, shared memory percentages, and conf lict distributions are randomly assigned based upon their histogram frequency.

PAGE 78

78 Spine Once the program contents have been validated, the cell list is sent to the next portion of the framework to generate a series of ba sic blocks derived from the individual cell characteristics. For each cell, the spinal colu mn generator performs a second round of validation to ensure that it can meet the memory and size requirements of the cell. Because cells can be arbitrarily large, we attempt to form a loop within the cell. The loop must be able to preserve the instruction mix, shared memory distribution, and c onflict distribution of th e cell. The base value of the loop is determined by the number of unique memory references in the cell and is then adjusted to meet the remaining characteristics. A minimization algo rithm is used to identify the optimal number of instructions to be included in the loop such that the remainder is as small as possible to control program size. This allows much more flexibility in terms of transaction stride and granularity without introducing much variation in the program. Once the cells have passed the second round of validation and any loop counters have been assigned, the spine generates opcodes for each instruction within the cell base d on the instruction mix distribution. The last step in this phase attempts to privatize, locali ze, and add conflicts to the memory operations. The privatization mechanism assigns the memory t ype based on the number of shared reads and writes in each basic block by taggi ng the opcode as being private or global. Localization parses the memory references determining which ones s hould be unique (only occur once per cell) and which ones are permitted to be accessed multiple times within the same block. Memory conflictions are assigned based on the conflict distribution mode l and determines where each load and store within each block is placed. Vertebrae For each no n-memory instruction, operands ar e assigned based on a uniform distribution of the registers, using registers t0-t5 and s2-s7 for non-floating-point operations and f2-f12 for

PAGE 79

79 floating-point operations. This ensures that the program contains instruction dependencies but does not tie the population to any specific input. For memory opera tions, we assign a stride value based on the instructions priva tization, localization, and conflic tion parameters. We maintain maps matching private and conflic ted addresses for reuse to maintain the programs shared memory and conflict distribution models across threads. In addition, eac h instruction accesses memory as a stream beginning with the base offset and walking thr ough the array using the stride value assigned to it, restarting from the beginning when it reaches the boundary. The length of the array is predetermined based on the size of the private and global memory pools and the number of unique references in the program. Code Generation The com pleted program is emitted in C as a series of header files, each containing a function for one of the programs threads. The ma in thread is written with a header containing initialization for the global memory as well as its own internal memory and variables. Both global and private memory are allo cated using calls to malloc(). Th e base address of the memory pool is stored in a register, which along with offset s, is used to model the memory streams. SESC uses the MIPS ISA and instructions within each thread are emitted in MIPS as assembly using the asm keyword, effectively combining the highlevel C used for memory allocation with lowlevel assembly. To prevent the compiler from op timizing and reordering the code, the volatile keyword is used. The completed source code is then enclosed in a loop, which is used to control the dynamic instruction count for ea ch thread. This is primarily us ed to adjust the number of dynamic instructions required for th e program to reach a steady state. Example Figure 5-4 shows an example of the tool at each upper-level stage using two input cells f rom the assimilation profiler. The cell characterist ics are shown in the top of the figure. In this

PAGE 80

80 example, because the input is a predefined list of cells, the skeleton simply joins the cells and passes them to the popula tion section. The second box shows th e output of the spine (the detailed characteristics of each cell). Specifically, it shows how the spine translates the values in each cell to determine the memory distribution and l oop calculations. The first cell has two load operations, one of which is uni que, and one store operation, one of which is unique. The second cell has four load operations and zero store operations Of the loads, there is one shared and three unique. The populated instructions are sent to the tool body wh ere specific addresses are calculated and finally to the c ode generator. The final box show s the code generated for these two cells (excluding ini tialization information). Results This section provides an eval uation of TransPlant using be nchm arks generated to show program diversity as well as synthetic versio ns of the STAMP and SPLASH-2 benchmarks. For both sections, the transactional characteristics of the new benchmarks are measured and the results are evaluated using principal component an alysis and clustering (please refer to chapter 4 for a description of the characteristics that were used to analyze the workloads as well as a discussion on the principal component analysis and clustering). All benchmarks are run to completion with 8-threads on SuperTrans. Table 5-3 presents the microarc hitecture configuration that was used for each core in the 8-core CMP simulation. Stressing TM Hardware In any evaluation, it is useful to be able to test a variety of design points quickly. To this end, TransPlant was used to generate a set of program s with widely varying tr ansactional characteristics. These programs, Q1-1 through Q41, represent the average behavior of each test quadrant. Figure 5-5 shows a plot of the firs t two principal components for the benchmarks generated here. These first two PCs account for 77.4% of the total variance. The first principal

PAGE 81

81 component is positively dominated by transact ions sizes between 625 and 15k instructions; and negatively dominated by transacti ons larger than 390k instructions and read-/write-sets larger than 256 unique addresses. The second component is positively dominated by the extremes in write set (more than 1024 addresses) and read se t (fewer than 2 unique addresses) and negatively dominated by the opposite extremes. Program Q1-1 is comprised of transactions varying from 625 instructions to 78k instructi ons and readand write-sets with 8 to 32 unique addresses. Program Q2-1 is comprised of large transact ions (between 390k and 976k instructions) with readand write-sets ranging from 512 to 1024 uni que addresses. Programs Q3-1 and Q4-1 are composed of large and small transactions, respec tively, with readand wr ite-sets varying from 2 to 4 unique addresses for Q4-1 and 64 to 128 addr esses for Q3-1. Using the same variables, these programs were then compared to the benchmarks tr aditionally used to test transactional memory systems. Workload Comparison In this section, the overall program characteristics of the benchmarks generated in Section 5.3, Q1-1-Q4-1, are compared with those of the SPLASH-2 and STAMP benchmarks. Specifically, the same principal component analysis as above is applied with the addition of the new benchmarks. With the reduced data from PCA, hierarchical clustering is used to group the benchmarks. The transactional performance of th e benchmarks across two different transactional designs is also evaluated. Clustering Figure 5-7 shows the first two principal com pon ents plotted against one another for all of the benchmarks. The first two principal co mponents are largely dominated by the same characteristics described in Section 5.1. However, there are more factors c onsidered in this case and the first two components only comprise 47.1% of the tota l variance, changing factor

PAGE 82

82 weightings. The PC1-PC2 plot in Figure 5-7, sh ows programs Q2-1 and Q3-1 are separated from the rest of the benchmarks because they are comp rised of medium to large transactions and have high contention. The PCA weights these variables mo re heavily in this evaluation. Q1-1 and Q41 are made up of transactions ranging from 5 to 625 instructions (with a very few large transactions) with moderate size readand writ e-sets. Because their behavior is not skewed toward any particular feature in this domain, they fall in between the STAMP and SPLASH benchmarks. Figure 5-8 shows principal components three an d four plotted against one another. Factors three and four contain 24.6% of the variance with the third component positively dominated by small transactions and small write sets and negatively dominated by large write sets and small read sets. The fourth component is positively dominated by moderate read and write conflict ratios and large write sets and negatively dominat ed by moderate size transactions, read sets, and write sets. The program distributi on here shows much stronger clustering because of the limited variance, but even so Q3-1 and Q2-1 stand out while Q4-1 remains near the SPLASH programs and Q1-1 maintains the same relative distance to genome, fmm, and vacation. The performance metrics confirm this behavior. The clustering results in Figure 5-6 show Q2-1 and Q3-1 are the last in the amalgamation schedule and share the fewest program characteris tics while Q1-1 and Q4-1 remain clustered with STAMP and SPLASH, showing that these programs share many of the inherent program characteristics of the traditional benchmarks. Q1-1 through Q4-1 show that TransPlant is capable is generating not only outlier programs but also programs with tr aditional performance characteristics. Going further, if a cutoff value is used to choose a subset of programs able to represent the general behavior of all of the benchmarks [46] Q2-1 and Q3-1 are always included.

PAGE 83

83 Performance In order to v alidate the abstract characteristics discussed above, in this section results are presented for several transactional characterist ics measured across the two primary hardware transaction models of Conflict Detection/Ve rsion Management, Eager/Eager and Lazy/Lazy respectively. The results are shown in Table 5-4. From this table it can be seen that while the synthetic benchmarks do not separate themselves in any single program characteristic, their metrics taken as a whole do differentiate them from the SPLASH and STAMP benchmarks. For example: while Q2-1 is mostly comprised of very large transactions like bayes and labyrinth and has average readand write-set si zes similar to bayes, it spends more time NACKing than any of the other programs and is about average in the number of aborts that it experiences. Whats more, when the differences between EE and LL ar e examined, it can be seen that Q2 behaves more like labyrinth and Q3-1 behaves similarly but with much smaller readand write-sets. In the above clustering, Q1-1 was clustered with ge nome (loosely). In this case, they are both comprised of transactions that vary greatly in size, skewing the average length. Because they share this layout, their read and wr ite conflict ratios are very sim ilar. This also explains Q4-1, whose read/write ratio resembles that of barnes but whose general read set behavior is more closely related to cholesky. This shows that the tool is able to produce programs with vastly different high-level characteristics but can ma intain a realistic representation of program behavior. Benchmark Mimicking While being able to create benchm arks based on an arbitrary input is useful for testing and debugging, it is important that the tool be able to replicate the behavior of existing benchmarks. In this section, we use PCA and clustering to sh ow how the tool can use a trace to generate a

PAGE 84

84 synthetic benchmark that maintains the high-le vel program characterist ics of the SPLASH and STAMP benchmarks. Figure 5-9 shows the plot of the first tw o principle components of the STAMP and SPLASH benchmarks using the inputs from Table 5-2. These two factors comprise 48.9% of the total variance. Figure 5-11 shows the same plot of the first two factors of the synthetic representation, repres enting 33.4% of the variance. While th ese figures match almost perfectly, there is some deviation brought about by the read and write-conflict ratios. These are calculated using an absolute worst-case estimate, as descri bed in chapter 4. When th e profiler generates the input for the tool, it has best case of the actu al contentious memory addresses, producing a less conservative, more accurate, representation. Figure 5-10 shows the hierarchical clustering for the original applications and Figur e 5-12 shows the clustering for th e synthetic representation. While the amalgamation schedule is slightly off, the overall representation is almost exact. Finally, Table 5-5 shows actual results of the original and synthetic benchmarks when run on the cycle accurate CMP simulator. The results present the ratio between the total transactional cycles and the tota l execution cycles. This metric is of particular significance because transactional cycles include both those cycles due to committed work (i.e. real work completed), as well as all cycles wasted in contentious behavior (e.g. aborted transactions, nack stalled cycles, commit arbitration, etc). While mu ch of the committed work is within the direct control of TransPlant in the s ynthetic benchmark creation, the c ontentious behavior is purely a result of the workloads interplay with the underlyi ng transactional model. Mo reover, if one refers to Table 5-4, it can be seen that for many of the benchmarks, these contentious cycles account for a significant portion of the transactional work. Thus, while the PCA results provide validation that the synthetic benchmarks are able to pres erve the architecture independent

PAGE 85

85 workload characteristics of the original benchmarks, Table 5-5 cl early shows that the synthetic benchmarks also preserve the alignment and fine -grain behavior of the original benchmarks across real transactional models.

PAGE 86

86 Table 5-1. Transactionaland microarc hitecture-independent characteristics Characteristic Description Values Threads Total number of threads in the program Integer Homogeneity All threads have the same characteristics Boolean Tx Granularity Number of instructions in a transaction List, Normalized Histogram Tx Stride Number of instructions between transactions List, Normalized Histogram Read Set Number of unique reads in a transaction List, Normalized Histogram Write Set Number of unique writes in a transaction List, Normalized Histogram Shared Memory Number of global memory accesses List, Normalized Histogram (complete, high, lo w, minimal, none) Conflict Distribution Distribution of global memory accesses List, Normalized Histogram (high, random) Tx Instruction Mix Instruction mix Normalized Histogram (mem ory, integer, floating point) Sq Instruction Mix Instruction mix Normalized Histogram (memory, integer, floating point) Table 5-2. Transaction oriented workload charact eristics used for measuring similarity between TM workloads Program Characteristics Synopsis 1 Transaction Percentage Fraction of instructions executed by committed transactions 2-11 Transaction Size Total number of instructions executed by committed transactions. This metric is computed as a cumulative distribution across ten categories: [0,5], (5, 25], (25,125], (125, 625], (625,3125], (3125,15625] (15625,78125], (78125,390625], (390625,1953125], (1953125, ) 12-21 Read Conflict Density The total number of potential conflict addre sses read by a transaction divided by that transactions total read set. The read conflict ratio across all transactions is summarized as a histogram with 10 buckets: [0,0.1), [0.1,0.2)[0.9,1.0]. 22-31 Write Conflict Density The total number of potential conflict addresses written by a transaction divided by that transactions total write set. The read conflict ratio across all transactions is summarized as a histogram with 10 buckets: [0,0.1), [0.1,0.2)[0.9,1.0]. 32-41 Read Set Size Total number of unique memory addresses read by committed transactions. This metric is computed as a cumulative distri bution across ten categories: [0,2), [2, 4), [4,8), [8,16), [16, 32), [32,64), [64,128), [128,256), [256,512), [512,1024), [1024, ) 42-51 Write Set Size Total number of unique memory addresse s written by committed transactions. This metric is computed as a cumulative distribution across ten categories: [0,2), [2, 4), [4,8), [8,16), [16, 32), [32,64), [64,128), [128,256), [256,512), [512,1024), [1024, )

PAGE 87

87 Table 5-3. Microarchitecture configuration (8Core CMP) Processor issue width 4 Reorder buffer size 104 Load/store queue size 72 Integer Registers 64 Floating Point Registers 56 Integer Issue Win Size 56 Floating Point Issue Win Size 16 L2 cache size 4M L2 cache latency 12 L1 instruction cache size 32 KB L1 data cache size 32 KB Processor Core L1 data cache latency 2

PAGE 88

88 Table 5-4. TM workloads and their trans actional characteristics (8Core CMP) Benchmarks (input dataset) Tran Model Tran. Started Aborts NACK Stalled Cycles (M) Averag e Read Set Size* Averag e Write Set Size* Read/ Write Ratio Avg. Commit Trans Length (Instructions) EE 70533 1554 2.33 6.71 6.53 1.07 204.09 Barnes 16K particles LL 69336 362 6.880302 6.71 6.53 1.07 204.10 EE 45256 3 .001771 13.43 7.34 1.82 175.60 Fmm 16K particles LL 45302 26 .516338 13.43 7.34 1.82 175.52 EE 15904 19 .015719 3.13 1.95 2.01 27.18 Cholesky tk15.O LL 15963 78 .057466 3.12 1.95 2.01 27.16 EE 2161 497 .091549 3.00 0.27 12.93 10.39 Ocean-con 258x258 LL 1800 136 .022013 3.00 0.26 13.44 10.38 EE 7200 5200 .783498 3.00 0.38 9.79 13.25 Ocean-non 66x66 LL 2778 778 .057183 3.00 0.36 10.22 13.17 EE 141020 64279 22.43765 6.49 2.46 5.33 60.87 Raytrace Teapot LL 307376 230635 .260170 7.49 2.46 6.51 73.54 EE 10398 22 .002693 10.87 2.97 2.66 59.26 Water-nsq 512 molecules LL 10482 106 .654037 10.87 2.97 2.66 59.26 EE 153 0 .000146 2.48 1.37 1.68 133.25 Water-sp 512 molecules LL 226 73 .003986 2.57 1.46 1.89 366.78 EE 263 97 2.977146 793.16 713.67 1.78 448887.36 Bayes 384 records LL 168 28 .007756 437.66 371.51 1.78 230090.91 EE 6922 328 .486603 27.96 5.05 8.16 1199.45 Genome g256 s15 n16384 LL 6869 275 1.750257 27.89 5.05 8.16 1199.65 EE 16861 5645 4.313740 13.86 8.83 1.86 495.83 Intruder a10 l4 n2038 s1 LL 18508 7292 0.451212 14.03 8.84 1.86 494.58 EE 6705 0 .005030 6.97 2.49 2.27 100.20 Kmeans Random1000_12 LL 718850 47009 6.430020 6.97 2.49 2.27 100.20 EE 382 174 323.617883 287.10 199.29 2.06 387340.00 Labyrinth 512 molecules LL 694 486 .0048669 276.74 199.18 2.05 517329.69 EE 47314 35 0.013970 6.12 3.00 5.00 33.99 Ssca2 s13 i1.0 u1.0 l3 p3 LL 37581 42 0.015890 6.07 3.00 5.00 33.99 EE 7023 2927 7.405184 67.11 13.22 3.95 1612.62 Vacation 4096 tasks LL 4493 397 .047728 66.18 13.19 3.95 1610.69 EE 6400 1112 101.044 54.45 25.64 1.93 14409.92 Yada a20 633.2 LL 7247 1756 0.152548 54.16 25.35 1.93 14261.00 EE 1701 101 6.733200 22.0 20.69 1.05 7125.00 Q1-1 LL 2660 1060 0.017968 22.0 20.69 1.05 7125.00 EE 1387 587 4271.581 627.2 622.36 1.38 1896093.75 Q2-1 EL 3820 3020 0.116293 627.2 622.36 1.38 1896093.75 EE 1960 360 898.042 96.0 166.37 0.584 265625.00 Q3-1 LL 4294 2694 0.0021559 96.0 166.37 0.584 265625.00 EE 1689 89 1.415997 3.20 3.60 1.085 958.41 Q4-1 EL 2545 945 0.0032308 3.20 3.60 1.085 958.41

PAGE 89

89 Table 5-5. Transactional cycles-total cycles for original and synthetic barnes original 0.00537 labyrinth original 0.99887 intruder original 0.64076 synth 0.03313 synth 0.98735 synth 0.60100 bayes original 0.03116 ocean-con original 0.00005 kmeans original 0.01814 synth 0.06698 synth 0.00026 synth 0.04539 cholesky original 0.00136 ocean-non original 0.00250 yada original 0.86605 synth 0.01226 synth 0.00258 synth 0.92660 fmm original 0.00239 ssca2 original 0.00549 water-spa original 0.00009 synth 0.01923 synth 0.00485 synth 0.00007 genome original 0.71611 vacation original 0.11671 water-n2 original 0.00140 synth 0.67171 synth 0.13150 synth 0.00961

PAGE 90

90 Figure 5-1. PC Plot of STAMP and SPLASH Figure 5-2. Dendrogram of cluster analysis of STAMP and SPLASH

PAGE 91

91 Figure 5-3. High-level repres entation of TransPlant Figure 5-4. TransPlant step by step example

PAGE 92

92 Figure 5-5. PC1-PC2 plot of syntheti c programs Figure 5-6. Dendrogram of uni fied cluster analysis

PAGE 93

93 Figure 5-7. PC1-PC2 pl ot of unified PCA Figure 5-8. PC3-PC4 pl ot of unified PCA

PAGE 94

94 Figure 5-9. PC1-PC2 plot of original applications Figure 5-10. Dendrogram from original cluster analysis

PAGE 95

95 Figure 5-11. PC1-PC2 plot of synthetic applications Figure 5-12. Dendrogram of s ynthetic cluster analysis

PAGE 96

96 CHAPTER 6 CASE STUDY: UNDERSTANDING THE BE HAVIOR OF HARDWARE TRANSACTIONAL WORKLOADS; OBSERV ATIONS, IMPLICATIONS, AND DESIGN RECOMMENDATIONS While there has been extensive research in to the design of hardware and software transactional memory systems, there has been ve ry little investigation of transactional memory program behavior. Understanding th e behavior of transactional me mory programs is essential for making efficient choices about hardware, comp iler optimizations, and even the choice of versioning and conflict resolution mechanisms. Because these decisions often remain fixed throughout the life of a system, it is important that architects are able to make informed choices. Up to this point, architects have relie d on SPLASH-2, STAMP, and microbenchmarks for transactional memory system evaluations. While these programs cover a wide domain, it may be premature to rely on them for any thorough evaluation of a new system; targeted, but comprehensive, experiments are warranted in th is emerging area since the transactional memory program domain is unknown. This chapter investigates how transaction granularity, stride, and thread count affect program performance using both arrayand object -based memory accesses. The results suggest that conventional concurrent pr ogramming practices may lead to poor performance when applied to transactional memory. It may not be wise to spend development time to shrink critical sections; transaction count and th read count are more important. Th ere are also vast differences in the performance characteristi cs of eager conflict/eager vers ion management systems when switching between array and objec t accesses. Lazy/lazy systems are largely immune to the effects, which may make a lazy system more a ttractive to developers since performance remains consistent even in the presence of increased conflicts.

PAGE 97

97 To describe how conflicts occur and can be minimized, conflict memory stride is introduced to describe the effect that contentious memory locations have on the abort rate and the overall program performance. Excluding microa rchitecture and the overhead associated for each read and write in an eager system, readand write-set sizes are largely irrelevant. The abort rate is independent of not just the reads and writes in a transac tion but even the unique reads and writes; it is dependent on the largest conflict memory stride and affects both eager and lazy transactional memory implementations. Together the results from these experiments provide valuable insight into the response ch aracteristics of EE/LL systems. Related Work Since the res urgence of trans actional memory, the architectu re community has been racing to provide new implementations and tweak existing ones. Transactional Coherence and Consistency [1] was one of the first models to use transactional memory and works under the assumption that transactions constitute the ba sic block for all parallel work, communication, coherence, and consistency. TCC uses a lazy-lazy approach, which makes aborts fast but commit time relatively slow [47]. Follo wing TCC was UTM [48], VTM [49] Bulk [50], LogTM [2], and OneTM [51]. LogTM was different in that the desi gners chose to make commits fast and aborts slow by storing stale data values in a per-thread log. The assumpti on being that commits will be more frequent than aborts in t ypical applications. Ther e is, however, some disagreement that this will indeed be the case. Titos et al. [52] suggests that realistic benchmarks should reflect the expected use of transactions, which naturally leads to conservative synchronization and longrunning transactions. Indeed, architects seem to have become fixated on these specific designs without considering present and future workloads. With real hardware on the horizon [53], it is imperative that designers begin looking at how the characteristics of a transactional memory programs affect the performance.

PAGE 98

98 In [1], Hammond et al. broke equake and radix into large and small tr ansactions. For this lazy-lazy approach, they showed that longer transactions tended to experience much less speedup as the number of threads/processo rs increased. They also show th at longer bus arbitrations during the commit stage tend to have a greater impact on long-running transactio ns than short-running ones. This work differs from theirs in that this work provides a much more thorough and systematic approach for evaluating the effects of transaction size. In a ddition, the effects across implementations are examined and more insight is offered into why performance changes. In [11], Minh et al. examined how the different ST AMP benchmarks performed across a variety of HTM, STM, and hybrid systems. They concluded that many of the assumptions made about eager-eager versus lazy-lazy were incorrect in that eager-eager does not always outperform lazylazy, especially where ther e exists a great deal of contention. In this research, all benchmarks are evaluated on both eager-eager a nd lazy-lazy systems, but direct comparisons are able to be drawn between the benchmarks to show how transactions behave under different conditions. Researchers have begun looking at more ways to bring transactional memory into the mainstream through language and compiler techni ques. In [54], Crowl et al. describe the problems that must be overcome to integrate transactional memory into C and C++. Dice and Shavit [55] proposed a transactional locking algorithm that may allow compilers to convert sequential and coarse-grain code into concurrent transactional code automatically. This research will be beneficial to the comp iler community; transactional memo ry may be able to provide a boost for compilers and automatic pa rallelization due to the easing of restrictions on correctness but not without a more generalized understanding of the semantic s of transactional workloads. Methodology This section discusses th e simulation envir onment that was used as well as program creation details.

PAGE 99

99 Simulation Environment All of the workloads were run to completion on SuperTrans [56], a cycle-accurate, detailed hardware transactional memory model that is ca pable of simulating all of the major dimensions of conflict detection and version management (eager-eager, eager-lazy, and lazy-lazy). SuperTrans is built on top of SESC [22], a cycl e-accurate MIPS CMP simulator. SuperTrans has been used in previous research for workload ch aracterization of hardware transactional memory systems [57] and is particularly suited to this task as it allows for ab straction of the underlying model implementation. While the SuperTrans imp lementation of eager-eager and lazy-lazy systems is based on LogTM [2,58,59] and TCC [1] re spectively, the model al lows for abstraction of the specific overheads associated with tasks such as bus arbitration, NACK and backoff stall policies, conflict detection granularity, etc. This allows the researcher to gain insight into fundamental characteristics with in and across design dimensions without being limited to any single design implementation. Table 6-2 shows the baseline configuration th at was used for all experimental results unless otherwise stated. The trans_model paramete r describes the conflict detection and version management schemes that were employed. Conf_det_g refers to the level of granularity at which conflicts are detected. Back_off is the backoff method that is used after a transaction is aborted. Primary/secondary baseline latency and primary variable latency (PBL, SBL, and PVL) quantify the latencies associated with a commit or an a bort. In a lazy version management system, the primary latency is associated with a commit, sinc e new values must be written back, while in an eager version management system the primary late ncy is associated with an abort, because the old values must be written back. The baseline latency is the static overhead required (e.g. arbitrating for the bus, maintenance, cleanup, etc) and the variable latency is the additional latency required based upon the write-set size. For a fair comp arison, the latencies were kept

PAGE 100

100 equal across both models; the late ncy of the fast operation (SBL) was set to the number of cycles required for a single write-back within the simulator. Table 6-2 lists the underlying microarchitecture parameters that were used throughout our experimentation. The microarchitecture core configuration was chosen because it is representative of an average machine. For a more in-depth analysis of the effect of microarchitecture on transactional performance, the reader is encouraged to read [56]. Program Generation One of the distinguishing aspects of this work is the iso lation of the program semantics that dominate performance. Vital to this is the ability to build workloads that are functionally equivalent yet provide independe ntly tunable characteristics. Th e TransPlant [61] framework enables the construction of workloads in which the transactional wor k, both in terms of instruction count and instruction composition, is held constant. In these experiments, each thread is responsible for committing 262,144 instructions This number was chosen because it allows the representation of both extremes present in current transactional evaluation frameworks; high quantities of fine-grain transactions and low quantities of coarse-grain tran sactions, as well as all points in between. The instruction count was chos en as a power of two so the memory accesses could be evenly distributed across all granul arities, removing any concern that specific granularities may receive unequal benefit base d upon different memory alignments. Unless otherwise noted, transactions are evenly spaced throughout the program, allowing for a direct comparison across dimensions. The next requirement was that each transaction be responsible for at least one unique load and one unique store so that all transactions have at least some chance of conflicting. Thus, the total number of loads and stores to unique lo cations per thread was based upon the total number of transactions at the finest granularity, whic h results in a total of 32,768 loads and 32,768 stores

PAGE 101

101 to unique globally shared cache lines in each thr ead. We define a unique-cache line access as the first access to a cache-line within a transaction. Subsequent acce sses to the same cache line are defined as non-unique. This is an important distinction because onl y the first load or store to a cache line within a transaction in creases the read or write set of that transaction. The remaining instructions were comprised of 65,536 loads and stores to non-unique cache lines, 117,964 integer operations, and 13,108 floating-point operations. In the granularity experiments, the work is broken down into subsequently smaller granularities, each representing a point along an axis into which a programmer could decompose the transactional work. Thus, as the granularity of the transa ctions becomes finer grained, transactions contain fewer instructions but the total num ber of transactions required to complete the work increases. Table 6-3 shows this breakdo wn and includes a range from very fine, where each transaction comprises only 3.05x10-5 of the to tal work, to very coarse, where all of the transactional work is contained within a singl e transaction. While Tran sPlant provides two modes of conflict modeling, a high mode in whic h the distance between pairs of load/store operations to the same cache line is maximized and a random mode where this distance is randomized, only the random mode is used for the granularity experiments. This is because, as will be discussed subsequently, this distance plays a dominant role in determining the overall performance of a transaction. If the high conflic t model were used, it would unfairly penalize the larger granularity transactions in a way that is not representative of a programmer simply choosing a different decomposition for a workload. The sequential work that comprises the stride distance of each transaction is also consta nt and equal to the tran sactional work across all experiments, except those in which the transactio nal percentage or trans actional duty cycle is varied. In these tests, the sequent ial work is varied while the transactional work is held constant

PAGE 102

102 so that the transactional characteristics can be directly compared across experiments. Finally, it should be noted that since transactional work is calculated on a per-thread basis, trends can be compared across a varying number of processi ng elements, however the raw total cycle counts will differ based upon the number of threads. Results This section presents an analysis detail ing how varying a specific transactional characteristic affects a program s execution. Severa l experiments were used to evaluate system performance as transactional program character istics vary on both eager-eager and lazy-lazy HTM systems. The first set of experime nts measures the performance on programs representative of array-type memory accesses; sp ecifically how the number of threads affects the overall execution time as transaction granularit y changes and how transaction stride, the distance between successive transactions, affects overall performance in the presence of increasing granularity. The second set of e xperiments looks at performance using an object-based memory model and introduces the concept of conflict me mory stride. Finally, th e affect of conflict memory stride on execution time is shown. Transaction Granularity Transaction granularity refers to th e raw size of a transaction a nd can be related directly to the period that a transaction main tains ownership of its read and write sets as well as the amount of work lost on an abort. While previous evalua tions have shown that transaction size plays a role in execution time [1], there has not been a rigorous comparative study of how transaction granularity affects performance. In this secti on, we examine how performance scales using both arrayand object-based memory accesses. For arra y accesses, the thread memory is modeled as a circular array; on a per-thread basis, there is no reuse outside of the transaction that first references a specific location. This ensures that a single transaction in each thread can only

PAGE 103

103 interfere with a single transacti on in another thread. For example in a program with n threads, TX1-A can interfere with TX2-A, TX3-A, a nd TXn-1-A but never with TX2-B. To model object accesses, it is assumed that each transaction will access part of some global object and that every transaction has the possibility to conflict with every other transaction. Impact Using Array Access Figure 6-1 shows how granularity affects overa ll program performance as the number of program threads increases on both eager-eager (left) and lazy-lazy (right) systems. The transactional granularity is scaled by powers of 2 beginning with 8 instructions and continuing to 256k instructions as shown in Tabl e 6-3. The normalized cycle count is on the y-axis. From these results, there is a clear performance advantage for coarse-gain transactions when there are only two threads of execution on both designs and even for four threads on the lazy-lazy implementation. As the number of threads increases it becomes preferable to use more and more fine-grained transacti ons. While these experiments only show up to 16 threads, the downward trend in execution time suggests that it will hold true for incr easing thread counts. There are several reasons for these results. Figure 6-2 shows the breakdown of the transactional cycles (sequential cycles are excluded) for eager-eager. Regardless of thr ead count, the commit ove rhead has the largest impact on the low end of the granularity scale, diminishing in all cases as the granularity increases. This is to be expected because th e total number of transact ions decreases as the granularity increases. Whats more, the overhead as a percentage of execution time remains nearly constant regardless of thread count, which is part of th e reason the two-thread case on eager-eager performs better with coarser granulari ties. From Figure 6-1 it can be seen that the 8instruction granularity transacti on takes the longest to complete in the two-threaded case. All other thread counts consume the most cycles at the la rgest granularity.

PAGE 104

104 For these programs, the high end is completely dominated by conflicts; actual committed instructions account for less than 50% of the execution time in the best case. Some of the conflicts are resolved through NACKs but others require aborts. Of those resolved through useful stalls, the ratio of useful stall cycles to total transactional cycles remains nearly constant as the number of threads increases at the high end accounting for 60-65% of the total cycles for 4-, 8-, and 16-thread programs, suggesting that useful stalls are independe nt of thread count. Aborts on the other hand increase dramatically as the number of threads increases, both as a percentage of execution time and rate. The increase in aborts can be attributed to the greater propensity of write overlaps as the number of threads increases. A de eper discussion of conflict memory stride is discussed in the subsequent sec tion on conflict memory stride. For the lazy-lazy system, there is a trough in the performance breakdown for all thread counts. This suggests that neither a fine-grained nor a coarse-grain program will see the best performance but that a middling approach might work best. Figure 6-3 shows the lazy-lazy cycle breakdown (overhead is included in the abort cycles). As can be s een, aborts play a big role in the performance, which is why the relative perf ormance at smaller granularities with increasing thread counts is better than at lower thread coun ts. As the number of thr eads increases, there is more time spent with useful stal ls (a useful stall in this cas e implies that a completed, waiting, transaction was able to commit su ccessfully). From Figure 6-3a to 6-3d, as the number of threads and granularity increases, there is a greater probability of an abort occurring. This is especially apparent comparing 6-3a to 6-3d. Fo r all granularities, the two-thre aded case has an average retry rate of 0.4 while the 16-threaded case experien ces approximately 2.5 retries per transaction. So, the increasing number of aborts coupled with the increasing penalty indicates that coarse granularities with more than tw o threads will result in poor performance. However, this is not

PAGE 105

105 always the case; there is a delicate balance with the abort rate where choosing any granularity will result in roughly the same performance. Figu re 6-7 shows how the abort rate changes with the granularity and the number of threads for l azy-lazy. For all but the 16thread case, the abort rate is nearly asymptotic. From the above analysis it is apparent that mi nimizing the granularity of the critical section does not always result in the best performance, wh ich is opposite of what one would expect from lock-based programs. Instead, the granularity of the tr ansactions should be scaled based upon the number of actively committing threads. This is largely a result of the commit overhead and is exacerbated on the lazy-lazy model, which suffers a larger penalty during the commit phase. Moreover, across both models, ther e are rapidly diminishing return s for continued reduction in the granularity of the transactions ; the peak performance is achieved at granularities as large as 25% of the total work to be completed. This implies that, because of the speculative nature of transactions, program developers using mostly arrays should not needlessly focus development time on achieving minimal critical region size. Impact Using Object Access Object-b ased accesses assume that each transact ion will access the same set of cache lines repeatedly throughout the lifetime of the program. This is an important distinction because it dramatically increases the probabi lity of an abort; subsequent tr ansactions are more likely to interfere with previous transactions. In other words, it removes the spatial and temporal conflict aspects of array accesses. This type of behavior is often seen in scientific workloads, such as tree access that begin at a common root node/pointer, a nd commercial workloads that frequently visit or update a common entry or set. The general performance results for both eager-eager and lazylazy are shown in Figure 6-4.

PAGE 106

106 From Figure 6-4, the trends in both the eager-eager and lazy -lazy graphs are immediately apparent; what is most striking is the fact that lazy-lazys performance is almost the inverse of the eager-eager system. On eager-eager, the high and low end of the granul arity scale experience much better performance than the middle porti on. The opposite is true for lazy-lazy, where performance is better for the middle gr anularities and over a much wider range. Referring back to the array-based results in Figure 6-1, it can be seen that eager-eager performance has changed dramatically but that the lazy-lazy results are very similar to the results from the array experiment. The 2and 4-thread programs in lazy-lazy reach a steady state at or around the 8k granularity mark in both cases. Th e 8and 16-thread programs look as though they are mirrors of the array results because the fine -grained programs experience the worst results here. However, the trend is clearly toward increasing runtime for th e coarse transactions, just as in the array experiments. When the cycle breakdowns are examined, the results become clearer. Figures 6-5a6-5d show the cycle breakdown for the eager-eager programs with 2, 4, 8, and 16 threads. Above 2 threads, the entire execution is dominated by aborts. These programs experience very few useful stalls and the overh ead is negligible in the presence of the overwhelming aborts. The reason for this is the way the programs end up being structured. Since each transaction can conflict with every other transaction, including those that are subsequent to it, they have no ability to reach a staggered-state where a transaction in one thread is unlikely to conflict with a transaction in a se parate thread and thus has no choice except to abort. Because the number of potential conflicts increases with the thread count, the effect is exacerbated as the number of threads grows, reflecte d in the ever-growing abort ratios. The ratio of useful cycles to abort cycles as the number of threads increases is dramatic. With the array-based accesses, up to the 1k granularity, the cycle ratios remain nearly constant.

PAGE 107

107 After 1k, an increasing portion of the execution time is consumed by abort cycles. With objectbased accesses, the ratio of tran sactional work remains nearly constant across all granularities within each thread but no other part of the execution time scales; aborts completely dominate the execution. The same is true of lazy-lazy, shown in Figure 6-6. The execution of the lazy-lazy programs is also dominated by abort cycles. The first noticeable difference is that there are no useful stalls; for all thr ead counts, the number of useful stall cycles is zero. The reason for this behavi or is the same as outlined above since all transactions can conflict with one another and co nflicts are not detected until commit time, when a transaction does commit all othe r currently active transactions are almost guaranteed to abort. Nevertheless, this is not very different from the array-base d result for lazy-lazy where a significant portion of the execution time is consumed by abort cycles for all thread counts. Consider the breakdown of retries per transaction for lazy-lazy, provi ded in Figure 6-7. For array-based accesses, the retry rate approaches half the total number of threads. This is because, initially, many of the threads conflict with one another until they reach some steady state where commits are serialized [9]. Even in the coarsest cases where a steady-state is unattainable, since each transaction can only conflict with a si ngle transaction per thread and a committing transaction is guaranteed success, the maximum av erage retries will be lim ited to (ThreadCount / 2). For object-based accesses, because the memory addresses are reset in each transaction to mimic access to a shared object, this virtually guarantees that a transactio n will conflict not only with those transactions with which it shares spatial locality, but also those transactions subsequent to it on other threads. Further, since this is a lazy-lazy system, if a conflict occurs, an abort is the only option, which is similar to th e restart convoy problem discussed in [9]. The initial dip in the retry rate fo r the 8and 16-thread programs is indicative of the convoy problem.

PAGE 108

108 The sequential sections are just large enough that the threads become staggered, allowing some threads to execute in sequential sections while others are executing in transactional sections, reducing the overall contention. As soon as th e commit overhead becomes large enough that transactions are unable to execute and commit en tirely while other thread s are within sequential sections, the retry rate quickly approaches its pr actical limit of (ThreadCount 1). The contention is only reduced when the total numbe r of transactions is reduced a nd threads are able to complete their shared access region, reducing the num ber of active, contentious, threads. In summary, the performance of the lazy-l azy implementation provides nearly constant performance for either array or object accesse s across all thread counts between 64 and 32k transaction granularities. The same cannot be said for the eager-eager implementation. While eager-eager appears to scale bette r with small transactions, as th e number of threads increases, the move to an object-based access pattern where there are large numbers of conflicts makes the reverse true. As the number of threads increase s, the performance gets progressively worse because of the contention. The results for objectaccess imply that unless the workload developer can drastically minimize the size of the transactions in the eager-e ager model, the high cost of aborts coupled with the high probability of confli cts with future transactions suggests that the developer should instead focus on the minimizatio n of the number of transactions. Because of this, in regions of high contenti on, a lazy approach may be prefer ential, particularly for novice developers, as the general trend rema ins consist with array-based access. Transactional Duty Cycle Finally, to test the sensitivity of previous expe rim ents to changes in the transactional stride, the total transactional percentage is varied. Since number of transactional instructions that must be completed is fixed, the sequent ial portion of the code must be increased in order to reduce the percentage of the workload that is transactional, wh ich results in an increase in the stride distance

PAGE 109

109 between transactions. Similarly, to increase the transactional percentage the sequential section must be decreased, producing a correspondingly sm aller stride distance between transactions. Figure 6-8 shows the transactional cycle count resu lts for eager-eager (left) and lazy-lazy (right) workloads that are comprised of 25%, 50%, an d 75% transactional perc entages for both array and object access patterns. Transactional cycl e count is defined as the summation of all transactional activities (aborts, commits, stalls, an d overhead) across all threads. This metric is used (as opposed to total cycles) so that dr aw direct comparisons may be drawn across the experiments where the sequentia l instruction count varies. From both graphs, the minimal effect that th e sequential stride distance has within this range is apparent. In the eagereager model, we see that for object access there is a slight performance improvement achieved in the simulations that had a larger tran sactional stride. The 25% workloads outperform the 50% in all but tw o granularities with an average 7.8% reduction in execution time across all granularities and the 50% outperform the 75% in all but three granularities with an average 6.1% reduction in execution time. This improvement all but disappears in the array access, however, where the average distance between the two extreme stride lengths is less than 0.04%. In the lazy-lazy model, for all but the very fine st granularities, across both array and object access, increasing or decreasing the sequential has little to no effect on the transactional performance. Notatable exceptions to this are the very fine, 8and 16-instruction granularities. In these cases, by increasing the sequential distance be tween the transactions, one is able to increase the probability that a thread will be in a se quential section. Since the total length of the transactions within the granularit ies is very short, this improves the chance for threads to commit while other threads are within sequential sections. This advantage is much more apparent in the

PAGE 110

110 lazy model than in the eager model because of the all-or-nothing nature of lazy models. Whereas in an eager model this slight increase in probability might result in a cycle or two of saved stall, in the lazy model it often prevents entire aborts. Moreove r, it can be seen that as the granularity of the transaction in creases above 16 instructions, a nd thus the time for an entire transaction to complete and commit in lazy in creases, this advantage all but disappears. Memory Conflict Stride Most prev ious transactional memory research used characteristics such as readand writeset sizes, raw transaction size, or raw number of abor ts/retries per transaction to describe a transactional memory workload. In particular, when discussing a transactions conflict rate, prior work often references the size of the read and write sets and average transaction size [11, 52]. While these attributes do influen ce the abort rate, there is a more fundamental aspect of program design apart from set sizes that needs to be considered: memory conflict stride. The definition of memory conflic t stride, in general, is the distance between a read to a shared cache line and a co mmit within the same transaction. This can be refined for systems that allows to stalls to indicate the distance between a read and writ e to the same shared cache line within the same transaction. This is an important attribute because this is the window in which an abort is most likely to happen. In fact, when de scribing transaction size within the context of conflict potential, the memory c onflict stride should be used in place of or in conjunction with transaction size. Figure 6-9 shows how conflict me mory stride can affect a programs abort rate in an eager conflict detection/eager versioning model. The figure is broken into three boxes, coarse memory stride, fine memory stride, and coarse memory stride with additional reads and writes. In the left-most figure, AS is placed in the first threads read set; it is subsequently placed in the read set of threads T1 and T2. When T0 attempts to write AS, it is forced to stall because it has been detected in the r ead sets of T1 and T2. T0 is then a borted when T1 attempts to write; T1

PAGE 111

111 is aborted when T2 attempts to write. T0 and T1 restart and the conflicts are finally resolved. The center figure shows the same threads but with much shorter distances between the reads and writes. In this example, the threads never confli ct because the potential for a conflict has been greatly reduced due to the smaller dist ance between the load and store to AS. Finally, the third figure shows the threads with addi tional reads and writes but to private or nonconflictious global addresses. While these additional reads and writes are stored in each transactions read and write set, there is no possibility that they could ever cause a conflict with another transaction and so the conflicts are resolved exactly the same way as they were in the original coarse-grain example. A similar model can be used for LL, replacing the second memory reference with a commit. To demonstrate this phenomenon, TransPlant was used to generate two basic programs (shown in Table 6-4): one with transactions of size 18k instructions and one with transactions of 256 instructions. In the first experiment, each tran saction contains only a single read and a single write, each to the same shared address. Each pr ogram begins with a memory conflict stride equal to the transaction size and then the stride is slow ly decreased. To show that memory stride is independent of the size of the r eadand write-sets, the set sizes are varied for the 256-transaction size program (33 reads/9 writes and 65 reads/17 wr ites). Results provided are the median values for 100 runs of each program. Figure 6-10 shows the raw abort count plotte d against memory stride for mem_18k_1_1 on eager-eager and lazy-lazy systems. This figure cl early shows that larger memory strides have a greater chance of forcing aborts for both eager-eager and lazy-lazy systems. The result is more pronounced in the eager-eager system because the conflicts are detected prior to commit. The abort count is reflected in the overall execution time for this program, shown in Figure 6-11. The

PAGE 112

112 eager-eager system experiences many more abor ts than the lazy-lazy system, which induces additional overhead for the program since its ab ort penalty is the dominating factor. The lazylazy system shows nearly constant performance b ecause there is little additional overhead for an abort and commits are more frequent, making th e commit time the dominant performance factor. Figure 6-12 shows the raw number of aborts as the memory stride of the 2 56-instruction programs is scaled down. The total number of a borts has decreased since the confliction window has been globally reduced but there is still a clear downward trend with the total number of conflicts. For all of the 256-transaciton size programs, the lazy -lazy system experiences a greater number of aborts because the eager-eager system can detect conflicts sooner and stall rather than abort. There is also a clear difference in th e abort trend for the lazy-lazy system between mem_18k_1_1 and all of the mem_256 programs. B ecause of the reduced instruction count of each transaction compared to the 18k case, the time required for commits in the lazy-lazy system is reduced. When the memory stride is also redu ced, the distance to the first contentious load is increased, allowing an increased number of aborts to be prevented by permitting some transactions to commit before the conflicting lo ad in otherwise conten tious transactions is reached. This is the reason that there is a steady decrease in the abort rate. However, the overall performance, shown in Figure 6-13, follows the same trend as the 18k program. The lazy-lazy system performance is nearly constant while th e eager-eager system improves as the memory stride, and thus the number of aborts, decreases. For both the 18k and the 256 transaction size pr ograms, the performance of the eager-eager system increases quickly as the memory stride is reduced. This is because when aborts occur, they occur much closer together meaning that the eager-eager system has fewer instructions to rollback. The size of the read and write sets ma ke almost no difference in the raw number of

PAGE 113

113 aborts; the abort count is almost identical for all of the eager-eager a nd all of the lazy-lazy programs. The overall performance of the lazy-lazy system remains nearly constant regardless of stride and read-/write-set sizes, within 2%. The performance trend of the eager-eager system is similar but with a much more defined differen ce in the overall performance. The discrepancy between the performance benefit of reducing the raw abort count within the eager-eager model when compared to the lazy-lazy model is due to the nature of the two systems. While in both systems an abort prevents the re-execution of a nu mber of instructions (limited by the transaction granularity), each abort that is prevented within the eager-eager system also saves expensive bus arbitration, write back, and back-off time. Case Study Conclusions Although there have been m any proposed tran sactional memory designs, there has not been a systematic evaluation of how transactiona l memory program characteristics affect system performance. In this emerging area, marginalizing this type of research could adversely affect design decisions and force future research al ong the wrong path. With this in mind, this case study explores the ramifications of transaction granularity, stride, and thread count on program performance using both arrayand object-bas ed memory accesses. A new metric, conflict memory stride, is introduced to describe the effect that cont entious memory locations have on the abort rate and the over all program performance. Experimental results show that for typical array-based accesses using an eager-eager implementation with low thread c ounts, transactions s hould not be too fine-grained. However, as the number of threads scales upward, it becomes worthwhile to use more and more fine-grained transactions because of the increased contenti on, which is why the results for the lazy-lazy system follow a trend similar to eager-eager, al beit with an even greater delineation in its

PAGE 114

114 performance. These results suggest that while it may be wise to spend some development time on adjusting transaction sizes, the focus should not be on finding minima fo r the critical regions. For eager-eager systems, the object-based me mory access results suggest that, regardless of thread count, either fine-grain or coarse-grain transactions should be used; the overhead for aborts outweighs any contention reduction gained for the middle-sized transactions. For lazylazy systems, the opposite is true; fineand coarse -grain transactions should be avoided because of the increased contention. It should be noted th at the trends for both arrayand object-based access are similar, which may make a lazy sy stem more attractive to developers since performance remains consistent even in the presen ce of increased conflicts. This effect is also seen in the presence of variable conflict stride. Transaction conflicts are largely a result of the memory conflict stride. Excluding microarchitect ure and the overhead associated for each read and write, readand write-set sizes are largely irrelevant. The abort rate is inde pendent of not just the reads and writes in a transact ion but even the unique reads and writes; it is dependent on the largest conflict memory stride and affect s both eager and lazy transactional memory implementations. The experimental results presen ted in this case study indicate that some of the knowledge gained from decades of traditional concurrent programming may not hold true and both system designers and program developers should be cons cious of the implications. The community may also need to redefine how to gauge the pe rformance of transacti onal programs; average transaction size, transaction per centage, read set, and write set may not be enough to effectively categorize a programs likely performance on a given transactional memory system.

PAGE 115

115 Table 6-.1 Transactional and microarch itecture independent characteristics Characteristic Description Values Threads Total number of threads in the program Integer Homogeneity All threads have the same characteristics Boolean Tx Granularity Number of instructions in a transaction List, Normalized Histogram Tx Stride Number of instructions between transactions List, Normalized Histogram Read Set Number of unique reads in a transaction List, Normalized Histogram Write Set Number of unique writes in a transaction List, Normalized Histogram Shared Memory Number of global memory accesses List, Normalized Histogra m (complete, high, low, minimal, none) Conflict Distribution Distribution of global memory accesses List Normalized Histogram (high, random) Tx Instruction Mix Instruction mix of transactional section(s) Normalized Histogram (memory, integer, floating point) Sq Instruction Mix Instruction mix of sequential section(s) Normalized Histogram (memory, integer, floating point) Table 6-2. Baseline configuration Parameter Value Trans_model Conflict Detection/Version Management eager-eager, lazy-lazy Back_off Backoff Policy Exponential Conf_det_g Conflict Detection Granularity 32 Bytes PBL Primary Baseline Latency 50 PVL Primary Variable Latency 12 TM Model SBL Secondary Baseline Latency 12 issue_width Processor Issue Width 4 ROB_size Reorder Buffer Size 18*issue_width+32 LSQ_size Load/store queue size 10*issue_width+32 int_regs Integer Registers 8*issue_width+32 fp_regs Floating Point Registers 6*issue_width+32 int_issue_win Integer Issue Win Size 6*issue_width+32 fp_issue_win Floating Point Issue Win Size 4*issue_width L2_size L2 cache size 4096kB L2_lat L2 cache latency 12 il1_size L1 instruction cache size 32kB Il1_lat L1 instruction cache latency 1 dl1_size L1 data cache size 32kB Processor Core dl1_lat L1 data cache latency 2

PAGE 116

116 Table 6-3. Transactional program characteristics Granularity Transactions Read Set Write Set Ratio of Total Work 8 32768 1 1 3.05E-05 16 16384 2 2 6.10E-05 32 8192 4 4 1.22E-04 64 4096 8 8 2.44E-04 128 2048 16 16 4.88E-04 256 1024 32 32 9.77E-04 512 512 64 64 1.95E-03 1024 256 128 128 3.91E-03 2048 128 256 256 7.81E-03 4096 64 512 512 1.56E-02 8192 32 1024 1024 3.13E-02 16384 16 2048 2048 6.25E-02 32768 8 4096 4096 1.25E-01 65536 4 8192 8192 2.50E-01 131072 2 16384 16384 5.00E-01 262144 1 32768 32768 1.00E+00 Table 6-4. Memory conflict stri de program characteristics Name Transaction Size Conflict R-Set Conflict W-Set Total R-Set Total W-Set mem_18k_1_1 18k 1 1 1 1 mem_256_1_1 256 1 1 1 1 mem_256_32_8 256 1 132 9 mem_256_64_16 256 1 165 17

PAGE 117

117 EE Scaling Array 50% Total Cycles0 0.2 0.4 0.6 0.8 1 81632641282565121k2k4k8k16k32k64k128k256kGranularity 2T 4T 8T 16TNormalized Cycles LL Scaling Array 50% Total Cycles0 0.2 0.4 0.6 0.8 1 81632641282565121k2k4k8k16k32k64k128k256kGranularity 2T 4T 8T 16TNormalized CyclesFigure 6-1. Performance scaling on 50% transactional, array-based memory accesses, as the transactional granularity incr eases on an eager-eager system (left) and a lazy-lazy system (right) a) Two Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort Overhead b) Four Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort Overhead c) Eight Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort Overhead d) Sixteen Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort OverheadFigure 6-2. Relative execution time eager-eager (array-based)

PAGE 118

118 a) Two Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls Aborts b) Four Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls Aborts c) Eight Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Usefull Stall Abort d) Sixteen Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls AbortFigure 6-3. Relative execution time lazy-lazy (array-based) EE Scaling Object 50% Total Cycles0 0.2 0.4 0.6 0.8 1 81632641282565121k2k4k8k16k32k64k128k256kGranularity 2T 4T 8T 16TNormalized Cycles LL Scaling Object 50% Total Cycles0 0.2 0.4 0.6 0.8 1 81632641282565121k2k4k8k16k32k64k128k256kGranularity 2T 4T 8T 16TNormalized CyclesFigure 6-4. Performance scaling on 50% transact ional, object-based memory accesses, as the transactional granularity incr eases on an eager-eager system (left) and a lazy-lazy system (right)

PAGE 119

119 a) Two Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort Overhead b) Four Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort Overhead c) Eight Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort Overhead d) Sixteen Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stall Abort OverheadFigure 6-5. Relative execution tim e eager-eager (object-based) a) Two Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls Aborts b) Four Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls Aborts c) Eight Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls Aborts d) Sixteen Thread0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles Commit Useful Stalls AbortsFigure 6-6. Relative execution time lazy-lazy (object-based)

PAGE 120

120 0 1 2 3 4 5 6 7 8 81632641282565121k2k4k8k16k32k64k128k256kGranularityAbort Rate 2T 4T 8T 16T 0 2 4 6 8 10 12 14 16 81632641282565121k2k4k8k16k32k64k128k256kGranularityAbort Rate 2T 4T 8T 16TFigure 6-7. Retries per transacti ons for lazy-lazy on 50% trans actional as the transactional granularity increases on array-based me mory accesses (left) and object-based memory accesses (right) 0.00E+00 2.00E+08 4.00E+08 6.00E+08 8.00E+08 1.00E+09 1.20E+09 1.40E+09 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles 25% Array 50% Array 75% Array 25% Object 50% Object 75% Object Figure 6-8. Performance scaling as the transactional granularity incr eases and as the transactional percentage changes on an eager-eager system (left) and a lazy-lazy system (right) Figure 6-9. Memory conflic t stride (EE system) 0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 81632641282565121k2k4k8k16k32k64k128k256kGranularityCycles 25% Array 50% Array 75% Array 25% Object 50% Object 75% Object

PAGE 121

121 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 16k14k12k10k8k6k4k2k750500250125Conflict Stride LL_18k_1_1 EE_18k_1_1Aborts Figure 6-10. Abort count (mem_18k_1_1) 4.5 5 5.5 6 6.5 7 16k14k12k10k8k6k4k2k750500250125Conflict Stride LL_18k_1_1 EE_18k_1_1Cycles (B) Figure 6-11. Cycle count (mem_18k_1_1)

PAGE 122

122 0 500 1000 1500 2000 2500 3000 2552231911591269563311Conflict Stride LL_256_1_1 LL_256_32_8 LL_256_64_16 EE_256_1_1 EE_256_32_8 EE_256_64_16Aborts Figure 6-12. Abort count (mem_256_x_y) 35 35.5 36 36.5 37 37.5 38 38.5 39 2552231911591269563311Conflict Stride LL_256_1_1 LL_256_32_8 LL_256_64_16 EE_256_1_1 EE_256_32_8 EE_256_64_16Cycles (M) Figure 6-13. Cycle count (mem_256_x_y)

PAGE 123

123 CHAPTER 7 CONCLUSIONS Understanding the natu re of TM and multi-core design dimensions and their interactions is critical for the design of efficient HTM-based multi-core systems. Toward this end, the author has built a generic HTM system on top of cycleaccurate processor simulator that can simulate out-of-order execution multi-core processors. Analyt ical models were constructed that relate transactional memory workload performance to 6 TM design dimensions and 12 key core design parameters. Using these models, TM/core design parameters were identified that have the largest impact on TM workloads performance. Furtherm ore, it was shown that these models can be used for accurate performance prediction across the joint TM/multi-core design space. The major conclusions that can be drawn from the first part of this study are: (1 ) SPLASH-2 alone is not sufficient for transactional memory research. This is primarily due to its low percentage of atomic regions, and even lower percentage of atomic regions that experience actual contention. However, there are interesting a nd different optimization opportunitie s that exist in both fine and course granularity transactions and they largely depend on tran sactional interaction. Therefore work must be focused on creating truly dissimila r workloads across all gr anularities. (2) There can potentially be greater benefit in optimizing the microarchitecture to suit transactional atomic regions than even their equivalent lock-based co unterparts. Additionally, pa rticularly as the level of contention increases, overall transactional pe rformance is not simply a result of resources applied, but rather largely the result of how various TM mechanis ms interact with the underlying core configurations. Thus choosing appropria te resources for a transaction/workload configuration can result in compounded improvements. Theref ore, it is crucial that future transactional architects take into considerat ion the specific interaction between workloads,

PAGE 124

124 transactional parameters, and the underlying ar chitecture in a co-design method and not in isolation. With no universally accepted transactional memory pr ogram suite, researchers are left in the dark about what may or may not constitute an acceptabl e benchmark. If the wrong benchmarks are chosen, they may not be able to stress the design in the ways that the architect expects or may provide superfluous results. In fact, previous research has shown that ma ny of the SPEC CPU programs exhibit similarities in conventional design spaces and are, in fact, redundant. In this work, a set of characteristics has been provided that architects can use to evaluate transac tional memory programs and it has been shown that programs can be selected based on all or a subset of these traits. Using these characteristics, it is shown that using SPLASH-2 to evaluate a transactional memo ry system may not expose all of the benefits or flaws in a design and that the PARSEC programs used in this evaluation have features that make them redundant when using SPLASH-2. When the STAMP benc hmarks are considered, there is more diversity in the entire benchmark set but is still limited in its scope; more emphasis should be placed on the design and implementation of transactional memory programs if this field is going to continue to grow. The program traits and mathematical methods described in this paper can be used to guide and evaluate the comprehensiveness of these new programs. Finally, this study shows that picking and choosing feature sets can be used to select a set of programs that can stress a particular element in a design, allowing architects to quickly select a benchmark to test a specific design implementation. Using principle component analysis, clustering, and raw transactional performance metrics, it has been shown that TransPlant is capable of creating programs with a wide range of transactional features. These features are independent of the underlying transactional model and can be tuned in multiple dimensions, giving researchers the freedom they need in testing new transactional memory designs. In addition, it was shown that TransPlant can use profili ng information to create synthetic benchmarks that mimic the high-level characteristics of existing benchm arks. This allows for the creation of equivalent transactional memory programs without manually co nverting an existing program and provides a venue

PAGE 125

125 for the dissemination of possibly proprietary benc hmarks without dispersing the source code. The framework presented in this paper provides a limitless number of potential transactional memory programs usable by transactional memory architects for quick design evaluations. The case study results show that for typical array-based accesses using an eager-eager implementation with low thread c ounts, transactions s hould not be too fine-grained. However, as the number of threads scales upward, it becomes worthwhile to use more and more fine-grained transactions because of the increased contenti on, which is why the results for the lazy-lazy system follow a trend similar to eager-eager, al beit with an even greater delineation in its performance. These results suggest that while it may be wise to spend some development time on adjusting transaction sizes, the focus should not be on finding minima fo r the critical regions. For eager-eager systems, the object-based me mory access results suggest that, regardless of thread count, either fine-grain or coarse-grain transactions should be used; the overhead for aborts outweighs any contention reduction gained for the middle-sized transactions. For lazylazy systems, the opposite is true; fineand coarse -grain transactions should be avoided because of the increased contention. It should be noted th at the trends for both arrayand object-based access are similar, which may make a lazy sy stem more attractive to developers since performance remains consistent even in the presen ce of increased conflicts. This effect is also seen in the presence of variable conflict stride. Transaction conflicts are largely a result of the memory conflict stride. Excluding microarchitect ure and the overhead associated for each read and write, readand write-set sizes are largely irrelevant. The abort rate is inde pendent of not just the reads and writes in a transact ion but even the unique reads and writes; it is dependent on the largest conflict memory stride and affect s both eager and lazy transactional memory implementations.

PAGE 126

126 The experimental results presented in the cas e study indicate that some of the knowledge gained from decades of traditional concurrent programming may not hold true and both system designers and program developers should be cons cious of the implications. The community may also need to redefine how to gauge the pe rformance of transacti onal programs; average transaction size, transaction per centage, read set, and write set may not be enough to effectively categorize a programs likely performance on a given transactional memory system.

PAGE 127

127 LIST OF REFERENCES [1] L. Hammond, et. al.,T ransactional Memory Coherence and Consistency, In Proceedings of the 31st Annua l international Symposium on Computer Architecture, 2004. [2] K. Moore, J. Bobba, M. Moravan, M. Hill, and D. Wood, LogTM: Log-based Transactional Memory, In Proceedings of the International Symposium on HighPerformance Computer Architecture, 2006. [3] D. Dice, O. Shalev, and N. Shavit, Trans actional Locking II, In the Proceedings of the International Symposium on Distributed Computing, 2006. [4] T. Harris and K. Fraser, Language Support for Lightweight Transactions, In Proceedings of the Interna tional Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), 2003. [5] B. Saha, A.-R. Adl-Tabatabai, et al., A High Performance Software Transactional Memory System for A Multi-Core Runtime, In Proceedings of the International Symposium on Principles and Pract ice of Parallel Programming, 2006. [6] C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun, An Effective H ybrid Transactional Memory System With Strong Isolation Guarantees, In Proceedings of the 34th Annual international Symposium on Computer Architecture, 2007. [7] B. Saha, A.-R. Adl-Tabatabai, and Q. Jacobson, Architectural Support for Software Transactional Memory, In Proceedings of the International Symposium on Microarchitecture, 2006. [8] A. Shriraman, M. Spear, et al., Integra ted Hardware-Software A pproach to Flexible Transactional Memory, In Proceedings of the International Symposium on Computer Architecture, 2007. [9] J. Bobba, K. Moore, L. Yen, H. Volos, M. Hill, M. Swift, and D. Wood, Performance Pathologies in Hardware Transactional Memo ry, In Proceedings of the International Symposium on Computer Architecture, 2007. [10] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, The SPLASH-2 Programs: Characterization and Methodological Considerations, SIGARCH Computer Architecture News, vol. 23, pp. 24-36, 1995. [11] C. C. Minh, J. Chung, C. Kozyrakis, a nd K. Olukotun, STAMP: Standford Transactional Memory Applications for Multi-Processsi ng, In Proceedings of International Symposium on Workload Characterization, 2008.

PAGE 128

128 [12] C. Bienia, S. Kumar, J.P. Singh, and K. Li, The PARSEC Benchmark Suite: Characterization and Architectural Implications, Princeton University Technical Report, TR-811-08, 2008. [13] V. J. Marathe, M. F. Spear, C. Heriot, A. Acharya, D. Eisenstat, W. N. S. III, and M. L. Scott, Lowering the Overhead of Nonblocki ng Software Transactional Memory, In the Proceedings of the 1st ACM SIGPLAN Workshop on Transactional Computing, 2006. [14] Intel C++ STM Compiler, Prototype Edition 2.0. http://software.intel.com/. [15] J. Chung, H.Chafi, A. McDonald, C. C. Minh, B. Carlstrom, C. Kozyrakis, and K. Olukotun, K, The Common Case Transactiona l Behavior of Multithreaded Programs, In Proceedings of Internati onal Conference on High Performance Computer Architecture, 2006. [16] I. Watson, C. Kirkham and M. Lujan, A Study of a Transactional Parallel Routing Algorithm, In Proceedings of Internationa l Conference on Parallel Architecture and Compilation Techniques, 2007. [17] C. Perfumo, N. Sonmez, A. Cristal, O. S. Unsal, M. Valero, T. Harris, Dissecting Transactional Executions in Haskell, In the Proceedings of the Second ACM SIGPLAN Workshop on Transactional Computing, 2007. [18] M. Scott, M. Spear, L. Dalessandro, and V. Marathe, Delaunay Triangulation with Transactions and Barriers, In Proceedings of the International Symposium on Workload Characterization, 2007. [19] L. Eeckhout, R. Bell Jr., B. Stougie, K. De Bosschere, and L. John, Improved Control Flow in Statistical Simulation for Accurate and Efficient Processor Design Studies, In Proceedings of the International Sy mposium on Computer Architecture, 2004. [20] S. Nussbaum, S. and J. E. Smith, Statistical Simulation of Symmetric Multiprocessor Systems, In Proceedings of the Annual Simulation Symposium, 2002. [21] C. Hughes and T. Li, Accelerating Mu lti-core Processor Performance Evaluation using Automatic Multi-threaded Workload Synthe sis, In Proceedings of International Conference on Workload Characterization, 2008. [22] Josep Torrellas, Source Forge, SESC: A Simulator of Superscalar Multiprocessors and Memory Systems with Thread-Level Speculation Support, 2005. http://sourceforge.net/p rojects/sesc August 2008. [23] R. Meyer et al., The Coordinate-Exchange Algorithm for Constructing Exact Optimal Experimental Designs, T echnometrics, vol. 37, no. 1, 1995. [24] P. Joseph et al., Construction and Use of Linear Regression Models for Processor Performance Analysis, In Proceedings of the International Symposium on High Performance Computer Architecture, 2006.

PAGE 129

129 [25] Y. Sakamoto et al., Akaike Informa tion Criterion Statistics, Kluwer Academic Publishers, 1987. [26] M. Orr, et al, Combining Regression Tree and Radial Base d Function Networks, International Journal of Ne ural Systems, vol. 10, 2000. [27] J. Cheng et al., Latin Hypercube Sampli ng in Bayesian Networks, In Proceedings of the Thirteenth International Florida Artificial Intelligence Research Society Conference, 2000. [28] B. Vandewoestyne et al., Good Permutations for Deterministic Scrambled Halton Sequences in terms of L2-discrepancy, Journal of Computational and Applied Mathematics, vol. 189, issues 1-2, 2006. [29] J. Chambers et al., Graphical Methods for Data Analysis, Wadsworth, 1983. [30] C. H. Romesburg, Clust er Analysis for Researchers, Lifetime Learning Publications, 1984. [31] StatSoft, Inc, STATISTICA, 2007. http://www.statsoft.com August 2008. [32] M. M.Hughes, Exploration and Play Re-visited: A Hierarch ical Analysis, International Journal of Behavioral Development, pp. 225-233, 1979. [33] R. H. Saavedra and A. J. Smith, Analysis of Benchmark Charac teristics and Benchmark Performance Prediction, ACM Transactions on Computer Systems, vol. 14, issue 4, pp. 344-384, 1996. [34] L. Eeckhout, H. Vandierendonck, and K. De Bosschere, Designing Computer Architecture Research Workloads, Computer, vol. 36, no. 2, pp. 65-71, Feb. 2003. [35] L. Eeckhout, H. Vandierendonck, and K. De Bosschere, Quantifying The Impact of Input Data Sets On Program Behavior and Its Applications, Journal of Instruction Level Parallelism, vol. 5, pp. 1-33, 2003. [36] H. Vandierendonck and K. Bosschere, M any Benchmarks Stress the Same Bottlenecks, In Proceedings of the Workshop on Co mputer Architecture Evaluation Using Commercial Workloads, pp. 57-71, 2004. [37] A. Phansalkar, A. Joshi, and L. K. John, Analysis of Redundancy And Application Balance In The SPEC CPU2006 Benchmark Suite In Proceedings of the 34th Annual international Symposium on Computer Architecture, 2007. [38] C. Bienia, S. Kumar, and K. Li, P ARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip-Multiprocessors, In Proceedings of International Symposium on Wo rkload Characterization, 2008.

PAGE 130

130 [39] H. Jin, M. Frumkin, et al., The Open MP Implementation of NAS Parallel Benchmarks and its Performance, Technical Report, 1999. [40] A. Jaleel, M. Mattina, et al., Last Level Cache (llc) Performance of Data Mining Workloads on a CMP A Case Study of Parallel Bioinformatics Workloads, In Proceedings of High-Performance Computer Architecture, 2006. [41] Man-Lap Li, Ruchira Sa sanka, Sarita V. Adve, Yen-Ku ang Chen, Eric Debes, The ALPBench Benchmark Suite for Complex Multim edia Applications, In Proceedings of the IEEE International Symposium on Wo rkload Characterization (IISWC-2005), October 2005. [42] R. Narayanan, B. Ozisikyilmaz, et al., Minebench: A benchmark suite for data mining workloads, In Proceedings of 2006 IEEE International Symposium on Workload Characterization, pp. 182, Oct. 2006. [43] Standard Performance Evaluation Cor poration, SPEC OpenMP Benchmark Suite, 2001. http://www.spec.org/omp August 2008. [44] L. Eeckhout and K. De Bosschere, H ybrid Analytical-Statistical Modeling for Efficiently Exploring Architecture and Work load Design Spaces, In Proceedings of Parallel Architectures and Compilation Techniques, 2001. [45] A. Joshi, L. Eeckhout, L. John, and C. Isen, Automated microprocessor stressmark generation, In Proceedings of High Pe rformance Computer Architecture, 2008. [46] J. Yi, R. Sendag, L. Eeckhout, A. Joshi, D. Lilja, and L. John, Evaluating Benchmark Subsetting Approaches, In Proceedings of the 33rd Annual International Symposium on Workload Characterization, 2006. [47] A. McDonald, J. Chung, H. Chafi, C. C. Minh, B. Barlstrom, L. Hammond, C. Kozyrakis and K. Olukotun, Characterization of TCC on Chip-Multiprocessors, In Proceedings of International Conference on Parallel Arch itectures and Compilation Techniques, 2005. [48] C. Ananian, K. Asanovic, B. Kuszmaul, C. Leiserson, and S. Lie, Unbounded Transactional Memory, In Proceedings of the Internationa l Conference on High Performance Computer Architecture, 2005. [49] R. Rajwar, M. Herlihy, and K. Lai, Virtualizing Tr ansactional Memory, SIGARCH Computer Architecture News vol. 33, issue 2, pp. 494-505, 2005. [50] L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas, Bulk Disamb iguation of Speculative Threads in Multiprocessors, In Proceed ings of the 33rd Annual International Symposium on Computer Architecture, 2006. [51] C. Blundell, J. Devietti, E.C. Lewis, and M.M.K. Martin, Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transactional Memory, In

PAGE 131

131 Proceedings of the 34rd Annual International Symposium on Computer Architecture, 2007. [52] J.R. Titos, M.E. Acacio, and J.M. Garc ia, Characterization of Conflicts in Log-Based Transactional Memory (LogTM), In Proceedings of 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2008. [53] D. Dice, Y. Lev, M. Moir, and D. Nu ssbaum, Early Experience with a Commercial Hardware Transactional Memory Implem entation, In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems, 2009. [54] L. Crowl, Y. Lev, V. Luchangco, M. Moir and D. Nussbaum, Integrating Transactional Memory into C++, TransACT, 2007. [55] D. Dice and N. Shavit, Understanding Trad eoffs in Software Transactional Memory, In Proceedings of International Symposium on Code Generation and Optimization, 2007. [56] J. Poe, C. Cho, and T. Li, Using Analyt ical Models to Effici ently Explore Hardware Transactional Memory and Multi-Core Co-Desig n, In Proceedings of 20th International Symposium on Computer Architecture and High Performance Computing, 2008. [57] J. Poe, C. Hughes, and T. Li, Tra nsMetric: Architecture Independent Workload Characterization for Transactional Memory Benchmarks, In Proceedings of the 23rd International Conference on Supercomputing, 2009. [58] L. Yen, et al., LogTM-SE: Decoupling Hard ware Transactional Memory from Caches, In Proceedings of the International Symposium on High Performance Computer Architecture, 2007. [59] M. Moravan et al., Supporting Nested Transactional Memory in LogTM, In Proceedings of Architectural Support fo r Programming Languages and Operating Systems, 2006. [60] M. Herlihy and J.E. Moss, Transactiona l Memory: Architectural Support For Lock-free Data Structures, In Proceedings of the 20th Annual international Symposium on Computer Architecture, 1993. [61] J. Poe, C. Hughes, T. Li, TransPl ant: A Parameterized Methodology For Generating Transactional Memory Workloads, In Proceedings of the 17th Annual International Symposium on Modeling, Analysis and Simula tion of Computer a nd Telecommunication Systems, 2009.

PAGE 132

132 BIOGRAPHICAL SKETCH Jam es Michael Poe II was born in Miami, Florida in 1981. He graduated from Miami Palmetto High School in 2000, and attended Florida International University. While at Florida International University he served as Presiden t of the Florida Theta ch apter of Tau Beta Pi national honor society, Vice Presid ent of the student branch of the IEEE, and was a member of the Engineering Student Council. He graduated Summa Cum Laude with a Bachelor of Science degree in Computer Engineering in 2004, and was selected as th e 2004 Outstanding Graduate in Computer Engineering. He earned his Master of Science degr ee in Electrical and Computer Engineering from the University of Florida in 2006, and was awar ded a National Science Foundation Graduate Research Fellowship that same year. In 2008 he was certified as a Private Pilot by the Federal Aviation Administration.