Citation
Analysis of General Purpose Graphics Processing Units Workload and Network on-Chip Traffic Behavior

Material Information

Title:
Analysis of General Purpose Graphics Processing Units Workload and Network on-Chip Traffic Behavior
Creator:
Shankar, Ramkumar
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (63 p.)

Thesis/Dissertation Information

Degree:
Master's ( M.S.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Li, Tao
Committee Members:
Figueiredo, Renato J.
Peir, Jih-Kwon
Graduation Date:
12/17/2010

Subjects

Subjects / Keywords:
Bytes ( jstor )
Computer memory ( jstor )
Distance functions ( jstor )
DRAM ( jstor )
Global education ( jstor )
Memory ( jstor )
Simulations ( jstor )
Stall ( jstor )
Traffic congestion ( jstor )
Workloads ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
amd, benchmarks, characterization, computer, cuda, dendrogram, gpgpu, gpu, graphics, ideal, kernel, nvidia, opencl, pca, workload
Genre:
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Electrical and Computer Engineering thesis, M.S.

Notes

Abstract:
ANALYSIS OF GENERAL PURPOSE GRAPHICS PROCESSING UNITS WORKLOAD AND NETWORK ON-CHIP TRAFFIC BEHAVIOR Graphics Processing Units is emerging as a general-purpose high-performance computing device. Growing General Purpose GPU (GPGPU) research has made available plenty of GPGPU workloads. There is however no systematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture designs evaluation. This research proposes a set of microarchitecture agnostic GPGPU kernels characteristics to represent them in a microarchitecture independent space. Correlated dimensionality reduction process and clustering analysis is used to understand these kernels. In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites like, Nvidia Compute Unified Device Architecture (CUDA) Software Development Kit (SDK), Parboil and Rodinia. Our results show that, with a large number of diverse kernels, workloads like, similarity score, parallel reduction, and scan of large arrays show diverse characteristics in different workload spaces. We have also explored diversity in different workload subspaces (memory coalescing, branch divergence, etc.) Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction. Our endeavor with workload characterization led to the study of the on-chip network on the GPU. Our study shows that for many workloads across a diverse range of application domains including 64-bin Histogram, K-Means and Matrix Transpose are slowed down because of the stalls in the shader core due to congestion in the interconnect network. This work performs a thorough and detailed analysis of the on-chip network of the GPU to evaluate the causes for this congestion. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (M.S.)--University of Florida, 2010.
Local:
Adviser: Li, Tao.
Statement of Responsibility:
by Ramkumar Shankar.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Shankar, Ramkumar. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
773144844 ( OCLC )
Classification:
LD1780 2010 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

i ANALYSIS OF GENERAL PURPOSE G RAPHICS P ROCESSING U NITS WORKLOAD AND NETWORK ON CHIP TRAFFIC BEHAVIO R By RAMKUMAR SHANKAR A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQU IREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2010

PAGE 2

ii 2010 Ramkumar Shankar

PAGE 3

iii To my Grandparents L. K. Venkateswaran and Narayani Venkateswaran

PAGE 4

iv ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor Dr. Tao Li for giving me the opportunity to work with him and guiding me throughout my I have had many fruitful discussions with him, and for that I am eternally grateful I would like to take this opportunity to thank Dr. Renato Figueiredo and Dr. Jih Kwon Peir for their support and guidance during the course of this research work. I would like to thank all my team members at the I ntelligent D esign of E fficient A rchitectures L aboratory (IDEAL) including Amer Quoneh Clay Hughes and Madhura Joshi for al l the great discussions and support they extended all through my research work. I especially thank Nilanjan Goswami, without whose support this thesis would not have been possible. I thank all my friends back in India for th eir continued support all alon g. I would like to thank Sudha for the long hours we spent in discussions and for the motivation and support she gave me to complete this research I would like to thank my friends Nischal, Srini and Vineet, for putting up with me and for all the fun we ha d during my study here at the University of Florida. Most importantly, I express my greatest gratitude to my parents to whom I owe my life. I would like to thank Sriram, Santosh Poorni Indu and Prasanna for nudging me, pushing me and motivating me to wor k harder. I am ever indebted to my aunt and uncle who have been nothing less than parents to me. Lastly, I would like thank my grandparents, whom I have always depended on for guidance. I dedicate this thesis to them.

PAGE 5

v TABLE OF CONTENTS ACKNOWLEDGMENTS ................................ ................................ ................................ .. iv LIST OF TABLES ................................ ................................ ................................ ........... vii LIST OF FIGURES ................................ ................................ ................................ ........ viii LIST OF ABBREVIATIONS ................................ ................................ ............................. x INTRODUCTION ................................ ................................ ................................ ........... 13 GPU Microarchitecture ................................ ................................ ............................ 15 Programming Abstraction ................................ ................................ ....................... 15 BACKGROUND ................................ ................................ ................................ ............ 18 GPGPU Sim and Simulator Configurations ................................ ............................ 18 Statistical Methods ................................ ................................ ................................ .. 22 Principal Component Analysis ................................ ................................ .......... 22 Hierarchical Cl ustering Analysis ................................ ................................ ....... 23 GPGPU Workloads ................................ ................................ ................................ 24 WORKLOAD CHARACTERIZATION, ANALYSIS AND MICROARCHITECTURE EVALUATION IMPLICATIONS ................................ ................................ ............... 27 Workload Evaluation Metrics ................................ ................................ .................. 27 Experiment Stages ................................ ................................ ................................ 27 Results and Ana lysis ................................ ................................ ............................... 28 Microarchitecture Impact on GPGPU Workload Characteristics ....................... 28 GPGPU Workload Classification ................................ ................................ ...... 29 GPU Microarchitecture Evaluation ................................ ................................ ... 30 GPGPU Workload Fundamentals Analysis ................................ ...................... 31 Characterist ics Based Classification ................................ ................................ 32 Discussions and Limitations ................................ ................................ ............. 33 INTERCONNECT TRAFFIC CHARACTERIZATION ................................ .................... 43 GPU Interconnection Network Characterization ................................ ..................... 43 Modeled System Behavior ................................ ................................ ................ 43 Network Bottlenec k Analysis ................................ ................................ ............ 44 Case Study: Network Traffic Pattern ................................ ................................ ....... 45 Matrix Transpose ................................ ................................ .............................. 45 Breadth First Search ................................ ................................ ........................ 45 K Means ................................ ................................ ................................ ........... 46 64 Bin Histogram ................................ ................................ .............................. 46 Network Behavior Characterization ................................ ................................ ........ 47

PAGE 6

vi RELATED WORK ................................ ................................ ................................ .......... 55 CONCLUSION ................................ ................................ ................................ .............. 57 LIST OF REFERENCES ................................ ................................ ............................... 59 BIOGRAPHICAL SKETCH ................................ ................................ ............................ 63

PAGE 7

vii LIST OF TABLES Table page 2 1 GPU Workload Characteristics ................................ ................................ ............... 24 2 2 Generic Workload Characteristics ................................ ................................ .......... 25 2 3 GPGPU Sim Shader Configuration ................................ ................................ ........ 25 2 4 GPGPU Sim Interconnect Configuration ................................ ................................ 25 2 5 GPGPU Workload Synopsis ................................ ................................ .................. 25 3 1 GP GPU Workload Evaluation Metrics ................................ ................................ .... 33 3 2 GPGPU Workload Subset for 90% Variance ................................ ......................... 34 3 3 Average % of error as microarchitecture v aries ................................ ..................... 34

PAGE 8

viii LIST OF FIGURES Figure page 1 1 GPU Microarchitecture ................................ ................................ ........................... 17 3 1 Cumulative Distribution of Variance by PC ................................ ............................ 34 3 2 Dendrogram of all workloads ................................ ................................ ................. 35 3 3 Dendrogram of Workloads (SL, 90% Var) ................................ .............................. 36 3 4 Dendrogram of Workloads (CL, 90% Var) ................................ .............................. 36 3 5 Dendrogram of Workloads (SL, 70% Var) ................................ .............................. 37 3 6 Dendrogram of Workloads (CL, 70% Var) ................................ .............................. 37 3 7 Performance comparison (normalized to total workloads) of different evaluation metrics across different GPU Microarchitecture s ................................ ................ 38 3 8 Factor Loading of PC 1 and PC 2 ................................ ................................ .......... 38 3 9 PC 1 vs. PC 2 Scatter Plot ................................ ................................ ..................... 39 3 10 Factor Loading of PC 3 and PC 4 ................................ ................................ ........ 39 3 11 PC 3 vs. PC 4 Scatter Plot ................................ ................................ ................... 40 3 12 Dendrogram based on Divergenc e Characteristics ................................ .............. 40 3 13 Dendrogram based on Instruction Mix ................................ ................................ 41 3 14 Dendrogram based on Merge Miss and Row Locality ................................ .......... 41 3 15 Dendrogram based on Kernel Characteristics ................................ ..................... 42 3 16 Dendrogram based on Kernel Stress ................................ ................................ ... 42 4 1 Different Types of Shader Stall Cycles ................................ ................................ ... 49 4 2 Matrix Transpose: Memory Requests per Cycle (in bytes) ................................ ..... 49 4 3 Matrix Transpose: Different Memory Region Access Statistics .............................. 50 4 4 Breadth First Search: Memory Requests per Cycle (in bytes) ............................... 50 4 5 Breadth First Search: Different Memory Region Access Statistics ......................... 51

PAGE 9

ix 4 6 K Means: Memory Requests per Cycle (in bytes) ................................ .................. 51 4 7 K Means: Different Memory Region Access Statistics ................................ ........... 52 4 8 64 bin Histogram : Memory Requests per Cycle (in bytes) ................................ .... 52 4 9 64 bin Histogram : Different Memory Region Access Statistics ............................. 53 4 10 Maximum Channel Load experienced by different workloads .............................. 53 4 11 Latency Statistics: Shader Core to Memory Controller ................................ ......... 54 4 12 Latency Statistics: Memory Controller to Shader Core ................................ .......... 54

PAGE 10

x LIST OF ABBREVIATION S AMD A dvanced Micro Devices CUDA Compute Unified Device Architecture GPU Graphics Processing Unit GPGPU General Purpose Graphics Processing Unit CMOS Complementary Metal Oxide Semiconductor DWDM Dense Wavelength Division Multiplexing SWMR Single Write Mul tiple Read MWSR Multiple Write Single Read PCA Principal Component Analysis

PAGE 11

x i Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Sc ience ANALYSIS OF GENERAL PURPOSE G RAPHICS P ROCESSING U NITS WORKLOAD AND NETWORK ON CHIP TRAFFIC BEHAVIOR By Ramkumar Shankar December 2010 Chair: Tao Li Major: Electrical and Computer Engineering G raphics P rocessing U nits is emerging as a general purpose high performance c omputing device. Growing G eneral P urpose GPU (GPGPU) research has made available plenty of GPGPU workloads. There is however no s ystematic approach to characterize these benchmarks and analyze their implication on GPU microarchitecture designs evaluation This research propose s a set of microarchitecture agnostic GPGPU kernels characteristics to represent them in a microarchi tecture independent space. C orrelated dimensionality reduction process and clustering analysis is used to understand these kernels In addition, we propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. We present a diversity analysis of GPU benchmark suites like, Nvidia Compute Unified Device Architecture (CUDA) Software Development Kit (SDK) Parboil and Rodinia Our results show that, with a large number of diverse kernels, workloads like, similarity score parallel reduction and scan of large arrays show diverse

PAGE 12

xii characteristics in different workload spaces. We have als o explored diversity in different workload subspaces (memory coalescing, branch divergence, etc.) Similarity Score Scan of Large Arrays MUMmerGPU Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence charact eristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays K Means Similarity Score and Parallel Reduction Our endeavor with workload characterization led to the study of the on chip network on the GPU. Our study shows th at for many workloads across a diverse range of application domains including 64 bin Histogram K Means and Matrix Transpose are slowed down because of the stalls in the shader core due to congestion in the interconnect network. This work performs a thorou gh and detailed analysis of the on chip network of the GPU to evaluate the causes for this congestion

PAGE 13

13 CHAPTER 1 INTRODUCTION With the increasing numbers of cores per C entral P rocessing U nits (CPU) chip, the performance of microprocessors has increased tremendously over the past few years. However, data level parallelism is still not well exploited by general purpose chip multiprocessors for a given chip area and power budget. With hundreds of in order cores per chip, GPU provides performance throughput on data parallel and computation intensive applications. Therefore, a heterogeneous microarchitecture, consisting of chip multiprocessors and GPUs seems to be good choice for data parallel algorithms. As large computation problems are being solved using pa rallel computing on these many core processors, main memory bandwidth requirement is also increasing. With such an increase in computation load, large amounts of data will be required from the off chip memory at a very high rate. Relatively high off chip m emory access latency and low bandwidth has restricted the overall performance increase comparable to the processing performance. Recent advancements in CMOS process based silicon nano photonics have substantially mitigated the power, latency and bandwidth problem. 3D stacking technology provides low latency and high bandwidth cross layer communication and a compact design. With significant bandwidth demand in GPU, it is anticipated that power consumption will reach a point at which electrical interconnectio n network and DRAM subsystem will be insufficient. Using emerging silicon nano photonics and 3D chip integration technology, optically connected shader cores and DRAM subsystems seem to be an attractive alternative to electrical counterparts. In this resea rch, we propose a novel 3D stacked GPU microarchitecture based on silicon nano photonics technology

PAGE 14

14 that has shader cores in one layer, a cache layer, and a built in nano photonically connected on chip interconnection network layer. The optical network lay er enables a Dense Wavelength Division Multiplex (DWDM) based high speed network for GPU core communication. Also the memory controllers communicate with the memory modules using a DWDM based high bandwidth optical fiber. The use of optical fiber for off c hip communication is the key to solve the pin constraints that the present technology is facing. Since thread scheduling has direct impact on on chip network traffic, we propose a novel thread scheduling algorithm for GPU that alleviates network traffic co ngestion. On the memory interface front, to best utilize all the wavelengths available in the fiber I implemented a novel adaptive wavelength assignment algorithm that achieves optimum usage of the channel and at the same time achieves low memory stalls in the GPU core. Recent enhancements in general purpose programming models [1, 2] and continuous microarchitectural improvements [3] of GPU have motivated researchers and application developers to explore the opportunities of optimization of scientific, financial, biological, graphical and other general purpose applications on GPUs. GPUs provide excellent single instruction multiple d ata (SIMD) style computation hardware to perform efficient data parallel computations, which are prevalent in those domains. have provided data parallel application development t hrust by reducing significant amount of development effort.

PAGE 15

15 GPU Microarchitecture In a GPU, several simple in order cores are connected to memory controllers using an on chip interconnection network fabric [3] and off chip array of memory cells communicates with the cores via on chip memory controllers. Fig ure 1 1 shows the general purpose GPU microarchitecture where unified shader cores work as in order multi core streaming processor [19]. In each streaming multiprocessor (SM), there are several in order cores, each of which is known as single processor (SP). The number of SPs per SM varies with the generations of GPU. Each SM consists of an instruction cache, a read only con stant cache, a shared memory, registers, a multi threaded instruction fetch and issue unit (MT unit), several load/store and special functional units [21, 3]. Two or more SMs are grouped together to share a single texture processor unit and L1 texture cach e among the SMs [21]. An on chip interconnection network connects different SMs to L2 cache and memory controllers. Note that, in addition to CPU main memory, the GPU device has its own external memory (off chip), which is connected to the on chip memory c ontrollers. The host CPU and GPU device communicate with each other through the PCI express bus. Some high end GPUs also have a L2 cache, configurable shared memory, L1 data cache and error checking correction unit (ECC) for memory. Programming Abstraction General purpose programming abstractions, such as Nvidia CUDA and AMD Stream, are often used to facilitate GPGPU application development. In this research I have used Nvidia CUDA programming model but some of the basic constructs will hold for most progr amming models. Using CUDA, thousands of parallel and concurrent lightweight threads can be launched by the host CPU and those are grouped together in

PAGE 16

16 an execution entity called blocks Different blocks are grouped into an entity called a grid During execu tion, a particular block is assigned to a given SM. Blocks running on different SMs cannot communicate with each other. Based on the available resources in a SM, one or more blocks can be assigned to the same SM. All the registers of the SM are allocated t o different threads of different blocks This leads to faster context switching among the executing threads. Usually 32 threads from a block are grouped into a warp that executes the same instruction. However, this number can vary depending upon the GPU g eneration. Threads in GPU execute based on the single instruction multiple thread (SIMT) [20] model. During execution branch instructions may cause some threads to jump, while others fall through in a warp. This phenomenon is called warp divergence. Thread s in a diverged warp execute in serial fashion. Diverged warps possess inactive thread lanes. This reduces the thread activity of the kernel below 100%. Kernels with less divergent warps work efficiently for both independent scalar threads as well as data parallel coordinated threads in SIMT mode. Load/store requests issued by different threads in a warp get coalesced in load/store unit into a single memory request according to their access pattern. Memory coalescing improves the performance by hiding the i ndividual smaller memory accesses in a single large memory access. This phenomenon is termed as intra warp memory coalescing Threads in a block execute in synchronized fashion.

PAGE 17

17 Figure 1 1. GPU Microarchitecture

PAGE 18

18 CHAPTER 2 BACKGROUND G PGPU Si m and Simulator Configurations In this work I used GPGPU Sim [6], a cycle accurate PTX ISA simulator. GPGPU Sim encapsulates a PTX ISA [5] functional simulator (CUDA Sim) unit and a parallel SIMD execution performance simulator (GPU Sim) unit. The simulat or simulates shader cores, interconnection network, memory controllers, texture caches, constant caches, L1 caches, and off chip DRAM. In the baseline configuration, different streaming multiprocessors (SM) are connected to the memory controllers through t he mesh or crossbar interconnection network. On chip memory controllers are connected to the off chip memory chips. A thread scheduler distributes blocks among the shader cores in breadth first manner; the least used SM is assigned the new block to achieve load balancing among cores [6]. The SIMD execution pipeline in the shader core has following stages: fetch, decode, pre execute, execute, pre memory (optional), memory access (texture, constant and other) and write back. GPGPU Sim provides L1 global and l ocal memory caches. In the simulator compilation flow, cudafe [22] separates GPU code, which is further compiled by nvopencc [22] to produce PTX assembly. Ptxas assembler generates thread level shared memory, register and constant memory usage that is used by GPGPU Sim during thread allocation in individual SM. GPGPU Sim uses the actual PTX assembly instructions generated for functional simulation. Moreover, GPGPU Sim implements a custom CUDA runtime library to divert the CUDA runtime API calls to GPGPU Sim Host C code is linked with the GPGPU Sim custom runtime.

PAGE 19

19 Emerging GPGPU ben chmarks are not well understood because these workloads are significantly different from conventional parallel applications that utilize control dominated heavy weight threads. G PGPU benchmarks are inherently data parallel with light weight kernels. Until now, the effects of different workload characterization parameters have not been explored to understand the fundamentals of GPGPU benchmarks. In this paper, we propose and chara cterize the GPGPU workload characteristics according to microarchitectural viewpoint. Table 2 1 lists the features that specify the GPGPU specific workload characteristics. Table 2 2 lists the generic workload behaviors. Index in Table 2 1 refers to the di mensions of the data in the input matrix (12938). To capture overall workload stress on the underlying microarchitecture, we have used dynamic instruction count which is the total number of instructions executed on a GPU inclusive of all streaming multip rocessors. The dynamic instruction count per kernel provides per kernel instruction count of the workload. Average instruction count per thread captures average stress applied to each streaming processors during kernel execution. To express the degree of p arallelism we proposed thread count per kernel and total thread count metrics. Above mentioned metrics represent the kernel stress subspace Instruction mix is captured using floating point instruction count integer instruction count special instruction count, and, memory instruction count metrics. These parameters provide an insight of the usage of different functional blocks in the GPU. Inherent capability of hiding memory access latency is encapsulated within arithmetic intensity Memory type based cl assification is captured using memory instruction count

PAGE 20

20 shared memory instruction count texture memory instruction count constant memory instruction count local memory instruction count, and, global memory instruction count Local memory instruction co unt and global memory instruction count encapsulate the information of off chip DRAM access frequency that radically improves or degrades the performance of the workload. Branch instruction count provides the control flow behavior. Due to warp divergence, the frequency of branch instructions plays a significant role in characterizing the benchmarks. Barrier instruction count and atomic instruction count show synchronization and mutual exclusion behavior of the workload, respectively. Branch and divergence b ehaviors are characterized by percentage of divergent branches number of divergent warps percentage of divergent branches in serialized section, and, length of the serialized section Branches due to loop constructs in a workload or thread level separate execution paths do not contribute to warp divergence. Number of threads in the taken path or fall through path defines the length of the serialized section. Number of divergent warps is defined as total number of divergent warps during the execution of th e benchmark. It expresses the amount of SIMD lane inactivity in terms of idle SPs in SM. Length of serialized section is the number of instructions executed between a divergence point and corresponding convergence point. Thread level resource utilization i nformation is retrieved using metrics, like, registers per thread shared memory used per thread constant memory used per thread and local memory used per thread For example, in a SM registers are allocated to a thread from a pool of registers; minimum g ranularity of thread allocation is a block of threads. Too many threads per block may exhaust the register pool. The same holds

PAGE 21

21 true for shared memory. Thread level local arrays are allocated into off chip memory arrays; designated by local memory used per thread CPU GPU communication information is gathered from bytes transferred from host to device and bytes transferred from device to host The former provides a notion of off chip memory usage in GPU. The scenario when the data in the GPU is computed by itself without using any input data from CPU is distinct from the scenario when GPU processes the data received from CPU and returns it back to CPU. Kernel count tells the count of different parallel instruction sequences present in the workload. These met rics provide kernel characteristics subspace In order to capture intra warp and inter warp memory access coalescing, we use row locality [1] and merge miss [6] as described in Table 2 1 Merge miss reveals number of cache misses or uncached accesses that can be merged into another in flight memory request [6]. These accesses can be coalesced into a single memory access and memory bandwidth can be improved. DRAM row access locality in the data access pattern is captured before it is shuffled by the memory controller. We assume GDDR3 memory in this study. With different generations of GDDR memory these numbers may slightly differ, but it still remains microarchitecture agnostic. These two metrics comprise the coalescing characteristics subspace Some of the previously mentioned parameters are correlated with each other. For example, if a workload exhibits large integer instruction count, it will also show large dynamic instruction count. If percentage of divergent branches in a serialized section is high the n it is probable that the length of serialized section will also be high. This nature of high data correlation justifies the use of PCA analysis.

PAGE 22

22 Statistical Methods A brief description of principal component analysis and hierarchical clustering analysis is provided here. Principal Component Analysis Principal component analysis (PCA) is a multivariate data analysis technique to remove the correlations among the different input dimensions and to significantly reduce the data dimensions. In PCA analysis, a matrix is formed using different columns representing different input dimensions and different rows representing different observations. PCA transforms these observations into principal component (PC) domain, where PCs are linear combination of input dimen sions. Mathematically, a dataset of n correlated variables has k th observation of the form, D k d 0k d 1k d 2k d 3k (n 1)k ). The k th observation in PC domain is represented as P k p 0k p 1k p 2k p 3k (n 1)k ) where p 0k = c 0 d 0k + c 1 d 1k + c 2 d 2k + c 3 d 3k (n 1) d (n 1)k Terms like c 0, c c n 1 are referred to as factor loading and those are s elected in PCA process to maximize the variance for a particular PC. Variance of i th PC ( 2 i ) has the property of 2 i 1 < 2 i < 2 i+1 PCs are arranged in the decreasing order of eigen values. Eigen values describe the amount of information present in a pr incipal component. Kaiser criteria [46] suggests considering all the PCs that have eigen value greater than 1. In general, certain numbers of first principal components are chosen to obtain 90% of the total variance as compared to original data variance. U sually the count is much smaller than the input dimensions. Hence, dimensionality reduction is achieved without losing much information.

PAGE 23

23 Hierarchical Clustering Analysis Hierarchical clustering analysis is a type of statistical data classification techniqu e based on some kind of perceived similarity defined in the dataset. Clustering techniques [8] can be broadly categorized into hierarchical and partitional clustering. In hierarchical clustering, clusters can be chosen according to the linkage distance wit hout defining the number of clusters apriori In contrast, partitional clustering requires that the number of clusters be defined as an input to the process. Hierarchical clustering can be done in agglomerative (bottom up) or divisive (top down) manner. In the bottom up approach, all of the points are defined as different clusters at the beginning. The most similar clusters are found and grouped together. The previous steps are repeated until all the data points are clustered. In single linkage clustering similarity checking can be done by considering minimum distance among two different clusters. Alternatively, in complete linkage clustering, maximum distance among the two different clusters is chosen. Average linkage clustering refers to similarity checki ng based on mean distance between different clusters. The whole process of clustering produces a tree like structure (dendrogram), where one axis represents different data points and the other represents linkage distance. The whole dataset can be categoriz ed by selecting a particular linkage distance Higher linkage distance expresses dissimil arity between the data points. I have used the GPGPU Sim configuration as specified in Table 2 3 and 2 4 To show that the GPU configuration has no impact on the workl oad data, we have used 6 different configurations. Conf 2, baseline configuration, and conf 5 are significantly different from each other, whereas other configurations are changed in a subtle manner to see the effect of few selected components. Additionall y, we have heavily

PAGE 24

24 instrumented the simulator to extract the GPU characteristics such as floating point instruction count, arithmetic intensity, percentage of divergent branches, percentage of divergent warp, percentage of divergent branches in serialized section, length of serialized section, etc. GPGPU Workloads To perform GPU benchmark characterization analysis, we have collected a large set of available GPU workloads from Nvidia CUDA SDK [28], Rodinia Benchmark [25], Parboil Benchmark [26] and some thir d party applications. Due to simulation issues (e.g. simulator deadlock) we have excluded some benchmarks from our analysis. The problem size of the workloads is scaled to avoid long simulation time or unrealistically small workload stress. Table 2 5 lists the basic workload characteristics. Table 2 1 GPU Workload Characteristics Index Characteristics Synopsis 1 Special instruction count (I sop ) Total number of special functional unit instructions executed. 2 Parameter memory instruction count (I Par ) To tal number of parameter memory instructions. 3 Shared memory instruction count (I shd ) Total number of shared memory instructions. 4 Texture memory instruction count (I tex ) Total number of texture memory instructions. 5 Constant memory instruction count (I const ) Total number of constant memory instructions. 6 Local memory instruction count (I loc ) Total number of local memory instructions. 7 Global memory instruction count (I glo ) Total number of global memory instructions. 8 17 Percentage of divergent branches (D bra ) Ratio of divergent branches to total branches. 18 27 Number of divergent warps (D wrp) Total number of divergent warps in a benchmark. 28 37 Percentage of divergent branches in serialized section (B ser ) Ratio of divergent branches to total branches inside a serialized section. 38 47 Length of serialized section (l serial ) Instructions executed between a divergence and a convergence point. 48 57 Thread count per kernel (T kernel ) Count of total threads spawned per kernel. 58 Total thread count (T total ) Count of total threads spawned in a workload. 59 63 Registers used per thread (Reg) Total number of registers used per thread. 64 68 Shared memory used per thread (Sh mem ) Total amount of shared memory used per thread. 69 73 C onstant memory used per thread (c mem ) Total amount of constant memory used per thread. 74 78 Local memory used per thread(l mem ) Total amount of local memory used per thread.

PAGE 25

25 Table 2 1. Continued Index Characteristics Synopsis 79 Bytes transferred fro m host to device (H2D) Bytes transferred from host to device using cudaMemcpy () API. 80 Bytes transferred from device to host(D2H) Bytes transferred from device to host using cudaMemcpy () API. 81 Kernel count (K n ) Number of kernel calls in the workload. 82 91 Row locality (R) Average number of accesses to the same row in DRAM. 92 101 Merge miss (M m ) [6] Cache misses/uncached accesses can be merged with ongoing request. Table 2 2 Generic Workload Characteristics Index Characteristics Synopsis 10 2 Dynamic instruction count (I total ) Dynamic instruction count for all the kernels in a workload. 103 112 Dynamic instruction count per kernel (I kernel ) Per kernel split up of the dynamic instructions executed in a workload. 113 122 Average instructi on count per thread (I avg ) Average number of dynamic instructions executed by a thread. 123 Floating point instruction count (I fop ) Total number of floating point instructions executed in a given workload. 124 Integer instruction count (I op ) Total number of integer instructions executed in a given workload. 125 Memory instruction count (I mop ) Total number of memory instructions. 126 Branch instruction count (I b ) Total number of branch instructions. 127 Barrier instruction count (I bar ) Total number of b arrier synchronization instructions. 128 Atomic instruction count(I ato ) Total number of atomic instructions. 129 Arithmetic Intensity (A i ) Arithmetic and logical operations per memory operation across all kernels. Table 2 3 GPGPU Sim Shader Configura tion Conf 1 Conf 2 Base Conf 3 Conf 4 Conf 5 Shader Cores 8 8 28 64 110 110 Warp Size 32 32 32 32 32 32 SIMD Width 32 32 32 32 32 32 DRAM Controllers 8 8 8 8 11 11 DRAM Queue Size 32 32 32 32 32 128 Blocks/SM 8 2 8 8 8 16 Bus Width 8 Bytes/cycle C onst/Texture Cache 8 KB / 64 KB (2 way, 64 B Line, LRU) Shared Memory 16 KB 8 KB 16 KB 16 KB 16 KB 16 KB Register Count 8192 8192 16384 16384 16384 32768 Threads per SM 1024 512 1024 1024 1024 1024 Table 2 4 GPGPU Sim Interconnect Configuration Con f 1 Conf 2 Base Conf 3 Conf 4 Conf 5 Topology Mesh Mesh Mesh Cross bar Mesh Mesh Routing Dim Order Dim Order Dim Order Dim Order Dim Order Dim Order Virtual Channels 2 2 2 1 4 4 VC Buffers 4 4 4 8 16 16 Flit size 16 B 8 B 8 B 32 B 32 B 64 B Table 2 5 GPGPU Workload Synopsis Abbreviation Workload Instruction Count Arith metic Intensity Branch Div? /Merge Miss? /Shared Mem? /Barriers? LV Levenshtein Edit distance calculation 32K 221.5 Y/Y/Y/N BFS Breadth First Search [29] 589K 99.7 Y/Y/Y/N

PAGE 26

26 Table 2 5. Continued. Abbreviation Workload Instruction Count Arithmetic Intensity Branch Div? /Merge Miss? /Shared Mem? /Barriers? BP Back Propagation [25] 1048K 167.6 Y/Y/Y/Y BS BlackScholes Option Pricing [28] 61K 1000 Y/Y/Y/N CP Columbic Potential [26] 8K 471.8 Y/Y/Y/N CS Separable Convolution [28] 10K 3.1 Y/Y/Y/Y FWT Fast Walsh Transform [28] 32K 289.1 Y/Y/Y/Y GS Gaussian Elimination [25] 921K 5.7 Y/Y/Y/Y HS Hot Spot [31] 432K 101.4 Y/Y/Y/Y LIB LIBOR [32] 8K 353.6 Y/Y/Y/N LPS 3D Laplace Solver [30] 12K 118.9 Y/Y/Y/Y MM Matrix Multiplication [28] 10K 42.4 Y/Y/Y/Y MT Matrix Transpose [28] 65K 195.6 Y/Y/Y/Y NN Neural Network [34] 66K 71.5 Y/Y/Y/N NQU N Queen Solver [35] 24K 209.3 Y/Y/Y/Y NW Needleman Wunsch[25] 64 80.1 Y/Y/Y/Y PF Path Finder [25] 1K 378.1 Y/Y/Y/Y PNS Petri Net Simulation [26] 2.5K 411.5 Y/Y/Y/Y PR Parallel Reduction [28] 830K 1.0 Y/Y/Y/Y RAY Ray Trace [36] 65K 335.0 Y/Y/Y/Y SAD Sum of Absolute Difference [26] 11K 67.7 Y/Y/Y/Y SP Scalar Product [28] 32K 284.9 Y/Y/Y/Y SRAD Spec kle Reducing Anisotropic Diffusion [26] 460K 14.0 Y/Y/Y/Y STO Store GPU [37] 49K 361.9 Y/Y/Y/N CL Cell [25] 64 765.2 Y/Y/Y/Y HY Hybrid Sort [38] 541K 32.1 Y/Y/Y/Y KM K Means [39] 1497K 20.2 Y/Y/Y/N MUM MUMmerGPU [40] 50K 468.5 Y/Y/Y/N NE Nearest Neig hbor [25] 60K 196.8 Y/Y/Y/N BN Binomial Options [28] 131K 39.4 Y/Y/Y/Y MRIF Magnetic Resonance Imaging FHD [44] 1050K 86.7 Y/Y/Y/N MRIQ Magnetic Resonance Imaging Q [26] 526K 107.6 Y/Y/Y/N DG Galerkin time domain solver [41] 1035K 11.4 Y/Y/Y/Y SLA Sca n of Large Arrays [28] 1310K 1.9 Y/Y/Y/Y SS Similarity Score [25] 51K 1.1 Y/Y/Y/Y AES AES encryption [42] 65K 70.25 N/Y/Y/Y WP Weather Prediction [43] 4.6K 459.8 Y/Y/Y/N 64H 64 bin histogram [28] 2878K 22.8 Y/Y/Y/Y

PAGE 27

27 CHAPTER 3 WORKLOAD CHARACTERIZAT ION, ANALYSIS AND MICROARCHITECTURE EVALUATION IMPLICATIONS Workload Evaluation Metrics To evaluate the accuracy and the effectiveness of the proposed workload characteristics, we use the set of metrics as listed in Table 3 1 Activity factor [24] is defin ed as the average number of active threads at a given time during the execution phase. Several branch divergence related characteristics of the benchmark change this parameter. For example, the absence of branch divergence produces an activity factor of 10 0%. SIMD parallelism [24] captures the scalability of a workload. Higher value for SIMD parallelism indicates that the workload performance will improve on a GPU that have higher SIMD width. DRAM efficiency [6] describes how frequently memory accesses are requested during kernel computation. It also captures the amount of time spent to perform DRAM memory transfer during overall kernel execution. If a benchmark has large number of shared memory accesses or the benchmark has ALU operations properly balanced in between memory operations, then the metric will show higher value. Experiment Stages The experimental procedure consists of following steps: (1) The heavily instrumented GPGPU Sim simulator retrieves the statistics of GPU characteristics listed in Tabl es 2 1 2 2 and 3 1 for all the configurations described in Tables 2 3 and 2 4 Six different configurations are simulated to demonstrate the microarchitectural independence of the GPGPU workload characteristics. (2) We have performed vectorization of som e characteristics to produce 10 bin or 5 bin histograms from the GPGPU Sim output. This process provides an input matrix for

PAGE 28

28 principal component analysis of size 12938 (dimensionsworkloads). Few dimensions of the matrix with zero standard deviation are k ept out of the analysis to produce a 9338 normalized data matrix. (3) PCA is performed using STATISTICA [45]. Several PCAs are performed on the whole workload set according to different workload characteristics subspaces and all the characteristics. (4) Based on required percentage of total variance, p principal components are chosen. (5) STATISTICA is used to perform hierarchical cluster analysis. Both single linkage and complete linkage based dendrograms are generated using m principal components, where m < p ( p : total number of PC retained) Results and Analysis Microarchitecture Impact on GPGPU Workload Characteristics We verify that GPGPU workload characteristics are independent of the underlying microarchitecture. If the microarchitecture has little or no impact on the set of benchmark characteristics, then benchmarks executed on different microarchitectures will be placed close to each other in uncorrelated principal component domain. Figures 3 1 and 3 2 show the distribution of the variance obtaine d in different PCs and the dendrogram of all the benchmarks. According to Kaiser Criterion [46] we need to consider all the principal components till PC 20. Yet we have decided to consider first 7 principal components that retain 70% of the total variance and first 16 principal components that retain 90% of the total variance. In Figure 3 2 we see that executions of same benchmark on different microarchitectures are having small linkage distance, which demonstrates strong clustering. This suggests that irre spective of the varied microarchitecture configurations (e.g. shader cores, register file size, shared memory

PAGE 29

29 size, interconnection network configuration etc.), our proposed metrics are capable of capturing GPGPU workload characteristics. Note that, in Fig ure 3 2 execution on a particular microarchitecture is missing due to lack of microarchitectural resources of the GPU or simulation issues. For example, conf 2 (Tables 2 3 and 2 4 ) is incapable of executing benchmark problem size pair of S TO, NQU, HY and WP. GPGPU Workload Classification The program input pairs defined as workload in Table 2 5 show significantly distinct behavior in PCA domain. The hierarchically clustered principal components are listed in Figures 3 3, 3 4, 3 5, and 3 6 T o avoid the hierarchical clustering artifacts we have done both single linkage and complete linkage clustering dendrograms [27]. We have observed that for the set of 4 and 6 workloads the single linkage and complete linkage clustering shows identical resu lts irrespective of 70% variance and 90% variance cases. Benchmarks in set of 4 are SS, SLA, PR and NE. The set of 6 includes SS, SLA, PR, BN, KM and NE. For set size of 8 and 12 we have found single linkage and complete linkage clustering deviates by 1 an d 2 workloads respectively, for the 90% variance case Table 3 2 depicts results; bold type face highlights the differences in single and complete linkage clustering. To represent a cluster we choose the closest benchmark to the centroid of that cluster. T he results for 70% of total variance are presented in Figures 3 5 and 3 6 Dendrogram also highlights that Rodinia Parboil and Nvidia CUDA SDK benchmark suites demonstrate the diversity in decreasing order according to the available workloads Table 3 2 s hows the time savings obtained for different sets in the form of speedup. Out of all the benchmarks SS and 64H contribute 26% and 29% of the total simulation time respectively. Speedup numbers are in fact higher without SS. SS demonstrates very distinct ch aracteristics from the rest; such as high instruction count,

PAGE 30

30 low arithmetic intensity, very high kernel count, diverse inter warp coalescing behavior, branch divergence diversity, different types of kernels and diverse instruction mix. As expected, overall speedup decreases as we increase the set size. Due to the inclusion of 64H, the set of 12 in complete linkage clustering does not show high speed up. Architects can choose the set size to achieve the trade off between available simulation time and amount of accuracy desired. GPU Microarchitecture Evaluation Figure 3 7 shows the performance comparison of different evaluation metrics as GPU microarchitecture varies Table 3 2 shows the average error observed. Due to lack of computing resources and simulatio n issues with some benchmarks (e.g. SS, STO, HY, NQU), configurations, like, Conf 2, Conf 4 and Conf 5 are not considered. This should not affect the results as the benchmark characteristics are microarchitecture agnostic. Maximum average error for activit y factor is less than 17%. It reveals that the subsets are capable of representing branch divergence characteristics present in the whole set. Divergence based clustering shows branch divergence behavior is diverse. Therefore, a small set of 4 is relativel y less accurate in capturing divergence behavior. As we increase the set size to 6 and 8, we reach the average error as close as 1%. Average error in SIMD parallelism suggests that we are capturing the kernel level parallelism through activity factor and d ynamic instruction count with maximum error less than 18%. Increase of set size also decreases the average error for SIMD parallelism. DRAM efficiency shows that coalescing characteristics, memory request and memory transfer behaviors are captured closely with maximum error less than 6%. We also observe that the increase in subset size decreases the average error.

PAGE 31

31 GPGPU Workload Fundamentals Analysis In this subsection we analyze the workloads from the point of view of input characteristics. Architects ar e often interested in identifying the most influential characteristics to tune the performance of the workload on the microarchitecture. Different PCs are arranged in decreasing order of variances. Hence, by observing their factor loadings we can infer the impact of the input characteristics. We choose to analyze top 4 principal components which account for 53% of the total variance. Figure 3 8 show s the factor loadings for principal component 1 and 2, which account for 39% of total variance. Figure 3 9 sho ws the scatter plot. SS, SLA, PR, GS, SRAD, 64H, HS, and KM show relatively low arithmetic intensity than others. SS, SLA, PR, GS, SRAD, 64H, HS, and KM have large number of threads spawned, small number of instructions in a kernel, moderate merge miss cou nt, and good row locality. The exceptions are CS and HY, which possess large kernel count. Moreover, workloads excluding these are relatively less diverse in the previously mentioned behavioral domain. Close examination of the simulation statistics verify that benchmarks except SS, SLA, PR, GS, SRAD, 64H, and KM have good arithmetic intensity, large number of kernels and kernel level diversity. There are no benchmarks with very high arithmetic intensity. SS exhibits high integer instruction count, host to d evice data transfer, low branch divergence, and per thread shared memory usage. In contrast, SLA demonstrates low merge miss, good row locality and branch divergence. Simulation statistics verifies observed branch divergence, merge miss, and row locality. Figures 3 1 0 and 3 11 show the PC 3 vs. PC 4 (14% of total variance) factor loadings and scatter plot. PC 3 shows that PR has large number of kernels, low merge miss and moderate branch divergence. BN, SLA, MRIF and MRIQ possess moderate branch

PAGE 32

32 divergence and heavyweight threads. BN, MRIQ, MRIF, KM, and PR program input pairs have long serialized section due to branch divergence (large I serial ), high barrier and branch instruction count, heavy weight threads and more computations. Characteristics Based Clas sification The previous chapter mentioned several subspaces, such as instruction mix subspace, coalescing characteristics subspace, divergence characteristics subspace, to categorize t he workload characteristics. Figures 3 1 2 3 1 3 3 1 4 3 15 and 3 1 6 pr esent s dendrograms generated from the principal components (90% of total variance) obtained from PCA analysis of the workloads based on those different subspaces. Architects interested in improving branch divergence hardware might want to simulate SS, SLA, MUM, HS, BN and NE as they exhibit relatively large amount of branch divergence diversity. On the other hand, SS, BN, KM, MRIF, MRIQ, and, HY have distinct behavior in terms of instruction mix. These are valuable benchmarks for evaluating the effectivenes s of design with ISA optimization. Interestingly, workloads excluding SLA, KM, SS and PR show similar types of inter warp and intra warp memory coalescing behavior. Hence, SLA, KM, SS and PR provide diversity to evaluate microarchitecture level memory acce ss optimizations. Though it is possible that the workloads with similar inter and intra warp coalescing have large variations in cache behavior when the size of the cache is increasing beyond that used to measure intra warp coalescing. Figure 3 15 shows th e static kernel characteristics of different benchmarks. PR, STO, SLA, NQU, WP, SS, CP and LIB show distinct kernel characteristics. SS, BN, LIB, MRIF, MRIQ, WP, BP and NE demonstrate diverse behavior in terms of instructions executed by each thread, total thread count, total instruction count etc.

PAGE 33

33 Discussions and Limitations This research explores PTX translated GPGPU workload diversity in terms of different microarchitecture independent program characteristics. The GPU design used in our research closely resembles Nvidia GPUs. However, growing GPU industry has several other microarchitecture designs (AMD ATI [23], Intel Larrabee [50]), programming models (ATI Stream [23], OpenCL[49]) and ISAs (ATI intermediate language, x86, Nvidia Native Device ISA, ATI N ative Device ISA). The microarchitecture independent characteristics proposed in this paper are by and large applicable to ATI GPUs though they use different virtual ISA (ATI IL). For example, SFU instructions may be specific to Nvidia, but AMD ATI also p rocesses these instructions in a transcendental unit inside the thread processor. Branch divergence and memory coalescing behavior of the kernels are a characteristic of the workload and do not affect the results produced in this paper for ATI GPUs. Howeve r, for Intel Larrabee architecture, these characteristics will be different because of dissimilarity in the programming model. Also, Larrabee microarchitecture is different from traditional GPUs. Hence, for Larrabee the suitability of using the proposed me trics to represent workload characteristics needs further exploration. PTX ISA changes are minor over the generations; therefore it has insignificant effect on the results. In addition, PTX instruction mapping to different generations of native Nvidia ISA s and programming model change has little impact on the GPGPU kernel characteristics because we characterize the kernels in terms of the PTX virtual ISA. Table 3 1 GPGPU Workload Evaluation Metrics Evaluation metric Synopsys Activity factor Average percen tage of threads active at a given time. SIMD parallelism Speedup with infinite number of SP per SM. DRAM efficiency % of time spent sending the data across the pins of DRAM when other commands are pending /serviced.

PAGE 34

34 Table 3 2 GPGPU Workload Subset for 90% Variance Single linkage (SL) SL speedup Complete linkage (CL) CL speedup Set 4 SS, SLA, PR, NE 3.60 SS, SLA, PR, NE 3.60 Set 6 SS, SLA, PR, KM, BN, NE 2.38 SS, SLA, PR, KM, BN, NE 2.38 Set 8 SS, SLA, PR, KM, BN, LIB, MRIQ, NE 1.80 SS, SLA, PR, KM, BN, LIB, MRIQ, MRIF 1.53 Set 12 SS, SLA, PR, KM, BN, LIB, MRIQ, MRIF, STO, HS, BP NE 1.51 SS, SLA, PR, KM, BN, LIB, MRIQ, MRIF, STO, HS, 64H MUM 1.05 Table 3 3 Average % of error as microarchitecture varies Evaluation metric Set 4 Set 6 Set 8 Activit y Factor 16.7 5.5 1.1 SIMD parallelism 17.2 6.0 0.5 DRAM efficiency 5.2 5.0 2.5 Average 13.1 5.5 1.4 Figure 3 1 Cumulative Distribution of Variance by PC

PAGE 35

35 Figure 3 2 Dend r ogram of all workloads

PAGE 36

36 Fig ure 3 3 Dendrogram of Workloads (SL, 90% Var) Figure 3 4 Dendrogram of Workloads (CL, 90% Var)

PAGE 37

37 Figure 3 5 Dendrogram of Workloads (SL, 70% Var) Figure 3 6 Dendrogram of Workloads (CL, 70% Var)

PAGE 38

38 Figure 3 7 Performance comparison (normalized to total workloads) of different eval uation metrics across different GPU Microarchitectures (C1: Config. 1, B: Baseline, C3: Config. 3) Figure 3 8 Factor Loading of PC 1 and PC 2

PAGE 39

39 Figure 3 9 PC 1 vs. PC 2 Scatter Plot Figure 3 10 Factor Loading of PC 3 and PC 4

PAGE 40

40 Figure 3 11 PC 3 vs. PC 4 Scatter Plot Figure 3 12 Dendrogram based on Divergence Characteristics

PAGE 41

41 Figure 3 13 Dendrogram based on Instruction Mix Figure 3 14 Dendrogram based on Merge Miss and Row Locality

PAGE 42

42 Figure 3 15 Dendrogram based on Kernel C haracteristics Figure 3 16 Dendrogram based on Kernel Stress

PAGE 43

43 CHAPTER 4 INTERCONNECT TRAFFIC CHARACTERIZATION GPU Interconnection Network Characterization This chapter explore s on chip shader core and memory controller interconnection network traf fic behavior for several GPGPU applications. Modeled System Behavior Using GPGPU Sim, this research model s shader core and memory controller interconnection network as a mesh 1 of 6x6 nodes to observe the network traffic. Usually performance is less sensit ive to interconnection network topology [6] The system is composed of 28 shader cores that are connected to 8 memory controllers which are placed in a diamond pattern. Channel bandwidth of this electrical network is 16 By tes/s. All the routers have 2 virtual channels with 4 bytes buffer and communicate using dimension order routing. Off chip DRAM is connected to on chip memory controllers. There are three different types of requests examined in the experiment: read reques ts (8 bytes of header and address), read reply (64 bytes of header, address and data) and write request (80 bytes of header, address, data and mask). Mainly global local constant, and texture memory region accesses are examined. All these memories are ph ysically mapped to external DRAM, excluding local variables 2 Therefore, all these memory accesses generate interconnect traffic from shader core to memory controller or memory controller to shader core. 1 Mesh and crossbar also outperforms other topologies 2 Mapped to shader core registers

PAGE 44

44 Network Bottleneck Analysis To determine whether o r not a workload will be bottleneck ed by interconnection network we calculate the shader core and DRAM channel stall cycles due to interconnect network congestion 3 Figure 4 1 reveals that 13 workloads demonstrate significant amount of stalls in memory tr affic (shader core to memory controller and memory controller to shader core). BFS, FWT, LIB, MT, NN, PNS, RAY, SRAD, KM, MUM, NE, SS and 64H are the workloads that experience network bottleneck. Out of those, 5 workloads show network congestion in both di rections. However, there are other 14 workloads that experience memory stage stall in shader core pipeline, yet they do not experience network congestion. The reason is sufficiently interleaved memory accesses 4 do not congest the network. Therefore, 13 GPG PU workloads have burst of network traffic at n irregular interval. It is expected that with larger GPGPU workload input size and simultaneous execution of multiple kernels the situation will even more aggravate. While to lower the memory stage stall cycle s of shader core pipeline we need to increase the arithmetic intensity of GPGPU applications (programming effort and algorithmic change), GPU architects can mitigate the interconnection network stalls by proper design optimization. We believe that for effi cient design optimization of interconnection network, we need close examination of the traffic behavior for few of the heavy traffic workloads. 3 Shader core to interconnect stall (shader core memory stage stall) and interconnect to shader core sta ll (DRAM channel stall) 4 Application with high arithmetic intensity (ALU instructions/memory access) demonstrate such behavior

PAGE 45

45 Case Study: Network Traffic Pattern This study considered 36 workloads from various domains. In the time frame of 10k cycles we have sampled all the different memory accesses as descri bed in earlier sections In the following sections, we look into the overall memory access statistics for 4 memory access intensive (stalls due network congestion) workloads during en tire execution period. Matrix Transpose This workload demonstrates heavy irregular memory access pattern spread over different frames as shown in Figure 4 2. Traffic is also irregular from different network nodes. Figure 4 3 shows that most of the access es are write requests from different shader cores to the external DRAM banks. Also read requests generate large amount of read replies The two patches of orange color in the contour plot shows burst of read replies However, we do not see any interconnect to shader core stalls for this workload. Burst of write requests (orange strips in Figure 4 2) generates significant amount of traffic concentrated in few of the 10K cycle frames. It also shows that traffic generated from different shader nodes are not sa me. These Figures justify the shader core to interconnect stall cycles are in fact due to burst of network traffic from shader core to memory controller. We expect that with larger problem size and simultaneous execution of multiple kernels this type of tr affic burst will be spread over entire execution period. Even closer examination shows that, inside the 10K cycle frames network traffic is not spread. Breadth First Search This workload demonstrates heavy irregular memory access pattern spread over diff erent frames as shown in Figure 4 4 This workload has larger request size (10

PAGE 46

46 B/cycle) for most of the network congested frames with respect to matrix transpose workload (6 B/cycle). Traffic is also irregular from different shader nodes. However, memory c ontroller nodes almost constantly experience DRAM channel stalls due to interconnect network congestion for read reply requests. Figure 4 5 asserts the fact that a 2 way network (shader to memory controller and memory controller to shader core) has very hi gh congestion due to write request and read replies This workload also has large amount of write requests Breadth First Search experiences network congestion from the traffic in both directions. K Means K Means demonstrates heavy memory accesses spread over entire execution time as shown in Figure 4 6 This situation is even worse as maximum average request size grows to 30 B/cycle. Traffic is also irregular from different shader nodes. However, read replies are evenly spread over different memory contro ller nodes and creates heavy constant traffic towards shader cores. Figure 4 7 shows that most of the accesses are either global or texture read requests They generate significant amount of global and texture read replies travelling back to the shader cor es. It also shows that traffic generated from different shader nodes are not same. The burst of packets in the network is also prominent from these diagrams. It justifies large amount of stall cycles (both direction) experienced by this workload due to net work congestion. 64 Bin Histogram This workload demonstrates heavy read reply traffic from all different memory controller nodes as shown in Figure 4 8 Traffic is almost regular from different shader nodes with few yellow patches of relatively large amou nt of memory requests Average request size (1 B/cycle) is much lower than previous 3 workloads. Therefore, inside 10K

PAGE 47

47 cycle frames memory requests are lesser. Read replies are concentrated and huge in size (10 B/cycle). I t is certainly expected to congest the traffic towards shader cores. Figure 4 9 verifies that most of the accesses are global read requests from different shader cores to the external DRAM banks. Very few global write requests are also present. Large amount of read requests also generates huge amount of read reply traffic 5 This justifies Figure 4 1 data for this workload that no network congestion stalls are experienced for traffic towards memory controller. Network Behavior Characterization Four workloads discussed above (out of 13 menti oned) reveals burst of network traffic as the key reason of network traffic congestion. With the advent of simultaneous execution of multiple kernels in a shader core 6 we expect to have even more traffic load on the shader to memory controller interconnect ion network. With the increase in network bandwidth, this type of network congestion can be reduced. However, to learn more about the nature of the network load in terms of channel bandwidth utilization and packet travel latency 7 we present few more metri cs in the following paragraphs. Channel link status in terms of contention can be decided by calculating maximum channel load [ 5 7] This is expected that a network chan nel with maximum channel load will eventually congest the network. Maximum channel load acts as a substitute for channel bandwidth [ 5 8] If the network traffic is spread over time then maximum channel load might not be a good representative of channel bandwidth However, for most of our workloads this is not true. To calculate maximum channel lo ad 5 Read request size is much smaller than read replies 6 Fermi generation of GPUs from support simultaneous execution of multiple kernels 7 Represents unit time behavior

PAGE 48

48 we have calculated the maximum of total number of packets traversed in a given network link during the total execution of the program. Latency behavior is captured in the average latency metric for both way traffic (shader core to memory controller and memory controller to shader core). All the workloads presented here experience similar input load and have comparable execution time. This avoids the artifacts in maximum channel load calculation due to longer execution time. LV, BFS, BS, CS, LIB, MT, NN, PR, HY, KM, MUM, NE, SS, SLA, MRIF, WP, and 64H show high values for maximum channel load Closer examination of few of these workloads with relatively low maximum channel load (with respect to others) shows that memory requests and replies are not spread over entire execution period and burst of packets congest the network. For example, matrix transpose shows less maximum channel load than neural network benchmark, yet only matrix transpose suffers network congestion related stalls. From the case study dis cussed earlier, we see that matrix transpose possesses burst of network traffic. Workloads like NN, SLA and PR have high maximum channel load but the memory requests are distributed over time. To know packet latency behavior, we have calculated average latency and maximum latency 8 for different workloads in Figure 4 1 1 and 4 1 2 Shader core to memory controller traffic is relatively more cycle consuming than the other. For 28 shader cores and 8 memory controllers, the traffic towards memory controller is converging to less number of nodes. On the contrary, traffic from memory controlle rs to shader core is diverging. 8 Mean and maximum of average latency for all the kernels of a workload are calculated

PAGE 49

49 Figure 4 1. Different Types of Shader Stall Cycles Figure 4 2. Matrix Transpose: Memory Requests per Cycle (in bytes)

PAGE 50

50 Figure 4 3. Matrix Transpose: Different Memory Region Access Statistics Figure 4 4 Breadth First Search: Memory Requests per Cycle (in bytes)

PAGE 51

51 Figure 4 5 Breadth First Search: Different Memory Region Access Statistics Figure 4 6 K Means: Memory Request s per Cycle (in bytes)

PAGE 52

52 Fi gure 4 7 K Means: Different Memory Region Access Statistics Figure 4 8 64 bin Histogram : Memory Requests per Cycle (in bytes)

PAGE 53

53 Figure 4 9 64 bin Histogram : Different Memory Region Access Statistics Figure 4 1 0 Maximum Channel Load experienced by different workloads

PAGE 54

54 Figure 4 1 1 Latency Statistics: Shader Core to Memory Controller Figure 4 1 2 Latency Statistics: Memory Controller to Shader Core

PAGE 55

55 CHAPTER 5 RELATED WORK Saavedra et al. [47] demonstrated how workload performance on a new microarchitecture can be predicted using the study of microarchitecture independent and dependent workload characteristics. However, the research did not consider the correlated nature of different characteristics. In [11, 48] Eeckhout et al. demonstrated a technique to reduce the simulation time while keeping the benchmark diversity information intact by performing PCA and clustering analysis on correlated microarchitecture independent workload characteristics. In [12] res earchers have shown that 26 CPU2000 benchmark suite workloads only stress four different bottlenecks by collecting data from 340 different machines. By using the same technique, researchers in [9] have found the redundancy in SPEC2006 benchmarks. They show ed how 6 CINT2006 and 8 CFP2006 benchmarks can be representative of the whole workload set. In previously mentioned works, authors mostly used the following characteristics of the program: instruction mix, branch prediction, instruction level parallelism, cache miss rates, sequential flow breaks etc. In [10] researchers have used the same technique to cluster transactional workloads by using microarchitecture and transactional architecture independent characteristics like transaction percentage, transaction size, read/write set size ratio, conflict density etc. Our approach differs from all the above as we characterize the data parallel GPGPU workloads using several newly proposed microarchitecture independent characteristics. There has been very few researc hes [24, 25] done on GPGPU workload classification. In [24] researchers have characterizes 50 different PTX kernels which include NVIDIA CUDA SDK kernels and Parboil benchmarks suit. This research does

PAGE 56

56 not consider correlation among various workload charac terization metrics. In contrast, we use a standard statistical methodology with a wide range of workload characterization metrics. Moreover, they have not considered diverse benchmark suites like Rodinia and Mars [33]. In [25] authors have used the MICA fr amework [11] to characterize single core CPU version of the GPGPU kernels, which fails to capture branch divergence, row access locality, memory coalescing etc. of numerous parallel threads running on massively parallel GPU microarchitecture. Moreover, ins truction level parallelism has very less effect in GPU performance due to the simplicity of the processor core architecture. More than data stream size, data access pattern plays important role in boosting the application performance on GPU. We capture the se over looked features in our program characteristics to look into the GPGPU kernel behavior in detail.

PAGE 57

57 CHAPTER 6 C ONCLUSION The emerging GPGPU workload space has not been explored methodically to understand the fundamentals of the workloads. To under stand the GPU workloads, this research proposed a set of microarchitecture independent GPGPU workload characteristics that are capable of capturing five important behaviors: kernel stress, kernel characteristics, divergence characteristics, instruction mix and coalescing characteristics. This thesis also demonstrated that the proposed evaluation metrics accurately represents the input characteristics. It provides GPU architect a clear understanding of the available GPGPU workload space to design better GPU microarchitecture. My endeavor also demonstrates that workload space is not properly balanced with available benchmarks; while SS, SLA, PR workloads show significantly different behavior due to their large number of diverse kernels, the rest of the worklo ads provide similar characteristics. We also observe that benchmark suites like Nvidia CUDA SDK, Parboil, and Rodinia has different behavior in terms workload space diversity. This research shows that branch divergence diversity is best captured by SS, SLA MUM, HS, BN, and, NE. SLA, KM, SS, and, PR show relatively large memory coalescing behavior diversity than the rest. SS, BN, KM, MRIF, and HY have distinct instruction mix. Kernel characteristic is diverse in PR, STO, NQU, SLA, WP, SS, CP, and LIB. I als o show that among the chosen benchmarks divergence characteristics are most diverse and coalescing characteristic are least diverse. Therefore, we need benchmarks with more diverse memory coalescing behavior. In addition, we show that simulation speedup of 3.5x can be achieved by removing the redundant benchmarks.

PAGE 58

58 The study of the interconnect congestion shows that a number of current workloads suffer from network bottleneck. The network traffic caused by memory accesses was categorized based on the type as Read requests, Read replies and Write requests. These were further divided based on the type of memory access Local, Global and Texture memory. Our study conclusively proved that the on chip network certainly will lead to considerable slowdown in future data intensive workloads. It is therefore essential to utilize the emerging technologies of 3D integration and optical network on chip to best utilize the compute power of the shader cores.

PAGE 59

59 LIST OF REFERENCES [1] G. Yuan, A. Bakhoda, T. Aamodt, Complexity Ef fective Memory Access Scheduling for Many Core Accelerator Architectures, MICRO, 2009. [2] W. Fung, I. Sham, G. Yuan, T. Aamodt, Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, MICRO, 2007. [3] T. Halfhill, Looking Beyond Graphics, Micropro cessor Report, 2009. [4] D. Tarjan, J. Meng, K. Skadron, Increasing Memory Miss Tolerance for SIMD Cores, SC, 2009. [5] NVIDIA Compute PTX: Parallel Thread Execution ISA, Version 1.4, 2009. [6] A. Bakhoda, G. Yuan, W. Fung, H. Wong, T. Aamodt, Analyzing CUDA Workl oads using a Detailed GPU Simulator, ISPASS, 2009. [7] G. Dunteman. Principal Components Analysis. SAGE Publications, 1989. [8] 1984. [9] A. Joshi, A. Phansalkar, L. Eeckhout, L. John, Measuring Benchmark Similarity using Inherent Program Characteristics, IEEE Transactions on Computers, Vol.55, No.6, 2006. [10] C. Hughes, J. Poe, A. Qouneh, T. Li, On the (Dis)similarity of Transactional Memory Workloads, IISWC, 2009. [11] L. Eeckhout, H. Vandie [12] H. Vandierendonck and K. Bosschere, Many Benchmarks Stress the Same Bottlenecks, Workshop on Computer Architecture Evaluation Using Commer cial Workloads, 2004. [13] J. Henning, SPEC CPU2000: Measuring CPU Performance in the New Millennium, IEEE Computer, pp. 28 35, July 2000. [14] C. Lee, M. Potkonjak, W. Smith, MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communication Systems, MICRO, 1997. [15] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown, MiBench: A Free, Commercially Representative Embedded Benchmark Suite, WWC, 2001.

PAGE 60

60 [16] S. Woo, M. Ohara, E. Torrie, J. Singh, A. Gupta, The SPLASH 2 Programs: Characterization an d Methodological Considerations, ISCA, 1995. [17] C. Minh, J. Chung, C. Kozyrakis, K. Olukotun, STAMP: Standford Transactional Memory Applications for Multi Processing, IISWC, 2008. [18] C. Bienia, S. Kumar, J. Singh, K. Li, The PARSEC Benchmark Suite: Characteriz ation and Architectural Implications, Princeton University Technical Report, 2008. [19] [20] J. Nickolls, I. Buck, M. Garland, K. Skadron, Scalable Parallel Programming with CUDA, Queue 6, 2 ( Mar. 2008), 40 53. [21] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, NVIDIA Tesla: A Unified Graphics and Computing Architecture, Micro, IEEE vol.28, no.2, 2008. [22] The CUDA Compiler Driver NVCC, Nvidia Corporation, 2008. [23] Technical Overview, ATI Stream Com puting, AMD Inc, 2009. [24] A. Kerr, G. Diamos, S. Yalamanchilli, A Characterization and Analysis of PTX Kernels, IISWC 2009. [25] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S. Lee, K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing, IISWC, 2009. [26] Parboil Benchmark suite. URL: http://impact.crhc.illinois.edu/ parboil.php. [27] M. Hughes, Exploration and Play Re visited: A Hierarchical Analysis, International Journal of Behavioral Development, 1979. [28] http://www.nvidia.com/object/cuda_sdks.html [29] P. Harish and P. J. Narayanan, Accelerating Large Graph Algorithms on the GPU Using CUDA, HiPC, 2007. [30] M. Giles, Jacobi Iteration for a Laplace Discretisation on a 3D Structured Grid, http://people [31] W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, M. R. Stan, Hotspot: A Compact Thermal Modeling Methodology for Early Stage VLSI Design, IEEE Transactions on VLSI Systems 1 4 (5) (2006).

PAGE 61

61 [32] [33] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, Mars: a MapReduce Framework on Graphics Processors, PACT, 2008. [34] Bi llconan and Kavinguy. A Neural Network on GPU. http://www.codeproject.com/KB/graphics/GPUNN.aspx [35] Pcchen. N Queens Solver. http://forums.nvidia.com/index.php?showtopic=76893, 2008. [36] Maxime. Ray tracing. http://www.nvidia.com/cuda [37] S. Al Kiswany, A. Gharaibeh, E. Santos Neto, G. Yuan, and M. Ripeanu, StoreGPU: Exploiting Graphics Processing Units to accelerate Distributed Storage Systems, HPDC, 2008 [38] E.Sintorn, U. Assarsson, Fast Parallel GPU sorting using a Hybrid Algorithm, Journal of Parallel and Distributed Computing, Volume 68, Issue 10, October 2008. [39] R. Narayanan, B. Ozisikyilmaz, J. Zambreno, J. Pisharath, G. Memik, A. Choudhary, MineBench: A Benchmark Suite for Data Mining Workloads, IISWC, 2006. [40] M. Schatz, C. Trapnell, A. Delcher, and A. Varshney, High throughput Sequence Alignment using Graphics Processing Units, BMC Bioinformatics, 8(1):474, 2007. [41] T. Warburton, Mini Discontinuous Galer kin Solvers. http://www.caam.rice.edu/~timwar/RMMC/MIDG.html, 2008. [42] S. Manavski, CUDA compatible GPU as an Efficient Hardware Accelerator for AES Cryptography, ICSPC, 2007. [43] J. Michalakes and M. Vachharajani, GPU Acceleration of Numerical Weather Predict ion, IPDPS, 2008. [44] S. Stone, J. Haldar, S. Tsao, W. Hwu, Z. Liang, and B. Sutton, Accelerating Advanced MRI Reconstructions on GPUs, Computing Frontiers, 2008. [45] StatSoft, Inc. STATISTICA, http://www.statsoft.com/ [46] K. Yeomans and P.Golder, The Guttman Kaiser Criterion as a Predictor of the Statistician), Vol. 31, No. 3 (Sep., 1982). [47] R. H. Saavedra and A. J. Smith, Analysis of Benchmark Characteristics and Benchmark Performance Prediction, ACM Trans. Computer Systems, 1998.

PAGE 62

62 [48] K. Hoste and L. Eeckhout. Microarchitecture independent Workload Characterization. IEEE Micro, 27(3):63 72, 2007. [49] http ://www.khronos.org/opencl/ [50] Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: a many core x86 architecture for visual computing.ACM Trans. Graph. 27, 3 (Aug. 2008), 1 15. [51] Phansalkar, A., Joshi, A., and John, L. K. 2007. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. SIGARCH Comput. Archit. News 35, 2 (Jun. 2007), 412 423. [52] L. C hi Keung, H. Sunpyo, and K. Hyesoon, "Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping," in Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture New York, New York: ACM, 2009.

PAGE 63

63 BIOGRAPHICA L SKETCH Ramkumar Shankar was born in Chennai, India in 1986. He completed his Bachelor of Engineering in e lectrical and e lectronics in the year 2007. Soon after, he joined Tata Consultancy Services, a software solution provider, as a software developer. He began his Master of Science program in the Department of Electrical and Computer Engineering at the University of Florida in 2008. In the summer of 2010 he interned at AMD to develop multimedia cod ecs using OpenCL for acceleration on Graphics Processing Units