<%BANNER%>

Power Performance Co-Optimization of Throughput Architecture Using Emerging Technologies

MISSING IMAGE

Material Information

Title:
Power Performance Co-Optimization of Throughput Architecture Using Emerging Technologies
Physical Description:
1 online resource (156 p.)
Language:
english
Creator:
Goswami, Nilanjan
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Li, Tao
Committee Members:
Figueiredo, Renato Jansen
Gordon-Ross, Ann M
Wu, Samuel Shangwu

Subjects

Subjects / Keywords:
architecture -- emerging-technology -- energy-saving -- gpgpu -- power-efficiency -- renewable-energy-architecture -- throughput-computing
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
GPUs are emerging as general-purpose high-performance computing devices. Growing GPGPU research is enriching data parallel computing capabilities of GPU in the area of scientific high performance computing, financial applications, medical imaging etc. However, this growing trend exposes the impending impediment of unbalanced growth in semiconductor technology and throughput architecture evolution. In this dissertation, I explore several techniques that help in throughput microarchitecture evolution in accordance with emerging semiconductor technologies. Beginning with a GPGPU workload exploration initiative, I propose several emerging technology-based GPGPU microarchitecture designs that help in reducing overall energy/power consumption of chip while minimizing performance loss of throughput architecture. Beginning with throughput workload exploration, I delve into exploring power performance co-optimized GPGPU on-chip interconnect design. Furthermore, I propose a throughput architecture power management system to achieve balanced improvement in workload performance, sustainability and reliability using hybrid workload-/supply-driven power management scheme. Finally, I propose ideas for power-performance co-optimization of throughput shader on-chip memory architectures. In the following paragraphs, I briefly highlight the ideas. As a first step, I took a systematic approach to characterize GPGPU benchmarks and analyze their implication on GPU microarchitecture design evaluation, which is still lacking. In this research, I propose a set of microarchitecture agnostic throughput workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality, reduction process, and clustering analysis are used to understand these workloads. In addition, I propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. I present a diversity analysis of GPU benchmark suites like, Nvidia CUDA SDK, Parboil and Rodinia. Our results show that, with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. I have also explored diversity in different workload subspaces (memory coalescing, branch divergence, etc.). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction. In high performance throughput processors, the interconnect experiences power and energy bottlenecks due to massive parallelism and ever-increasing memory accesses in emerging workloads. Also, high performance throughput processors have exposed bandwidth and latency bottlenecks in on-chip interconnect and off-chip memory access. To eliminate such bottlenecks, we propose silicon-nanophotonics and 3D stacking technologies in throughput architecture. It provides higher communication bandwidth and lower latency signaling mechanisms at reduced power. Furthermore, to insulate the performance of the throughput cores from the interconnect bottlenecks, we propose a novel interconnect aware thread scheduling scheme to alleviate traffic congestion. We further employ dense-wavelength-division-multiplexed (DWDM) optical links between on-chip memory controller and off-chip DRAM arrays. To maximize off-chip link utilization, we organize the DRAM chips into independently operating sub-channels that allocate an equal portion of the total bandwidth. In order to improve memory access performance, we propose memory request scheduling schemes that leverage high bandwidth off-chip optical data-link efficiently. We evaluate a 3D stacked GPU with 2048 SIMD cores having photonic interconnect. The photonic multiple-write-single-read crossbar network with 32B channel bandwidth on average, achieves 96% network power reduction. In addition, it improves average core-to-memory access latency by 59% and 87% for northbound and southbound traffic respectively. The sub-channel allocation schemes and the thread scheduler improve the performance by as much as 48% and 9% respectively. Overall performance improvement using the entire proposed schemes is 13% on average across all memory intensive benchmarks along with 96% power saving. We anticipate that for emerging workloads and microarchitectures the implications of the proposed ideas are far reaching. Massively parallel computing on throughput computers such as GPUs requires myriad memory accesses to register files, on-chip scratchpad, caches, and off-chip DRAM. Unlike CPUs, these processors have a large register file and on-chip scratchpad memory, which consume a significant portion of compute core power (35%-45%). In this paper, we introduce novel throughput architecture by integrating resistive memory (Spin Transfer Torque RAM) inside the compute core, which reduces leakage significantly, but introduces write power overhead and longer write latencies in GPU shared memory and register file accesses. We enhance the compute core by introducing register file organization with differential memory update mechanism to remove update redundancy during write operations. Furthermore, using merged register-write-mechanism and write-back buffer, we coalesce multithreaded GPU register write accesses to save write energy. In addition, we introduce hybrid shared memory design using SRAM and STT-MRAM that provides significant leakage/dynamic power savings without affecting performance. On average, across 23 GPGPU/graphics workloads, our schemes save 46% dynamic power due to register access (83% leakage power saving) with negligible performance degradation. On average, hybrid shared memory provides 10% reduction in dynamic power with maximum 1.6× performance improvement for the current workloads at no additional area overhead. Eco-friendly energy sources (i.e. green power) attract great attention as lowering computer carbon footprint has become a necessity. Existing proposals on managing green energy powered systems show sub-optimal results since they either use rigid load power capping or heavily rely on backup power. We propose Chameleon, a novel adaptive green throughput server. Chameleon comprises of multiple flexible power management policies and leverages learning algorithm to select the optimal operating mode during runtime. The proposed design outperforms the state-of-the-art approach by 13% on performance, improves system MTBF by 42%, and still maintains up to 95% green energy utilization.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Nilanjan Goswami.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Li, Tao.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045706:00001

MISSING IMAGE

Material Information

Title:
Power Performance Co-Optimization of Throughput Architecture Using Emerging Technologies
Physical Description:
1 online resource (156 p.)
Language:
english
Creator:
Goswami, Nilanjan
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Li, Tao
Committee Members:
Figueiredo, Renato Jansen
Gordon-Ross, Ann M
Wu, Samuel Shangwu

Subjects

Subjects / Keywords:
architecture -- emerging-technology -- energy-saving -- gpgpu -- power-efficiency -- renewable-energy-architecture -- throughput-computing
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
GPUs are emerging as general-purpose high-performance computing devices. Growing GPGPU research is enriching data parallel computing capabilities of GPU in the area of scientific high performance computing, financial applications, medical imaging etc. However, this growing trend exposes the impending impediment of unbalanced growth in semiconductor technology and throughput architecture evolution. In this dissertation, I explore several techniques that help in throughput microarchitecture evolution in accordance with emerging semiconductor technologies. Beginning with a GPGPU workload exploration initiative, I propose several emerging technology-based GPGPU microarchitecture designs that help in reducing overall energy/power consumption of chip while minimizing performance loss of throughput architecture. Beginning with throughput workload exploration, I delve into exploring power performance co-optimized GPGPU on-chip interconnect design. Furthermore, I propose a throughput architecture power management system to achieve balanced improvement in workload performance, sustainability and reliability using hybrid workload-/supply-driven power management scheme. Finally, I propose ideas for power-performance co-optimization of throughput shader on-chip memory architectures. In the following paragraphs, I briefly highlight the ideas. As a first step, I took a systematic approach to characterize GPGPU benchmarks and analyze their implication on GPU microarchitecture design evaluation, which is still lacking. In this research, I propose a set of microarchitecture agnostic throughput workload characteristics to represent them in a microarchitecture independent space. Correlated dimensionality, reduction process, and clustering analysis are used to understand these workloads. In addition, I propose a set of evaluation metrics to accurately evaluate the GPGPU design space. With growing number of GPGPU workloads, this approach of analysis provides meaningful, accurate and thorough simulation for a proposed GPU architecture design choice. Architects also benefit by choosing a set of workloads to stress their intended functional block of the GPU microarchitecture. I present a diversity analysis of GPU benchmark suites like, Nvidia CUDA SDK, Parboil and Rodinia. Our results show that, with a large number of diverse kernels, workloads such as Similarity Score, Parallel Reduction, and Scan of Large Arrays show diverse characteristics in different workload spaces. I have also explored diversity in different workload subspaces (memory coalescing, branch divergence, etc.). Similarity Score, Scan of Large Arrays, MUMmerGPU, Hybrid Sort, and Nearest Neighbor workloads exhibit relatively large variation in branch divergence characteristics compared to others. Memory coalescing behavior is diverse in Scan of Large Arrays, K-Means, Similarity Score and Parallel Reduction. In high performance throughput processors, the interconnect experiences power and energy bottlenecks due to massive parallelism and ever-increasing memory accesses in emerging workloads. Also, high performance throughput processors have exposed bandwidth and latency bottlenecks in on-chip interconnect and off-chip memory access. To eliminate such bottlenecks, we propose silicon-nanophotonics and 3D stacking technologies in throughput architecture. It provides higher communication bandwidth and lower latency signaling mechanisms at reduced power. Furthermore, to insulate the performance of the throughput cores from the interconnect bottlenecks, we propose a novel interconnect aware thread scheduling scheme to alleviate traffic congestion. We further employ dense-wavelength-division-multiplexed (DWDM) optical links between on-chip memory controller and off-chip DRAM arrays. To maximize off-chip link utilization, we organize the DRAM chips into independently operating sub-channels that allocate an equal portion of the total bandwidth. In order to improve memory access performance, we propose memory request scheduling schemes that leverage high bandwidth off-chip optical data-link efficiently. We evaluate a 3D stacked GPU with 2048 SIMD cores having photonic interconnect. The photonic multiple-write-single-read crossbar network with 32B channel bandwidth on average, achieves 96% network power reduction. In addition, it improves average core-to-memory access latency by 59% and 87% for northbound and southbound traffic respectively. The sub-channel allocation schemes and the thread scheduler improve the performance by as much as 48% and 9% respectively. Overall performance improvement using the entire proposed schemes is 13% on average across all memory intensive benchmarks along with 96% power saving. We anticipate that for emerging workloads and microarchitectures the implications of the proposed ideas are far reaching. Massively parallel computing on throughput computers such as GPUs requires myriad memory accesses to register files, on-chip scratchpad, caches, and off-chip DRAM. Unlike CPUs, these processors have a large register file and on-chip scratchpad memory, which consume a significant portion of compute core power (35%-45%). In this paper, we introduce novel throughput architecture by integrating resistive memory (Spin Transfer Torque RAM) inside the compute core, which reduces leakage significantly, but introduces write power overhead and longer write latencies in GPU shared memory and register file accesses. We enhance the compute core by introducing register file organization with differential memory update mechanism to remove update redundancy during write operations. Furthermore, using merged register-write-mechanism and write-back buffer, we coalesce multithreaded GPU register write accesses to save write energy. In addition, we introduce hybrid shared memory design using SRAM and STT-MRAM that provides significant leakage/dynamic power savings without affecting performance. On average, across 23 GPGPU/graphics workloads, our schemes save 46% dynamic power due to register access (83% leakage power saving) with negligible performance degradation. On average, hybrid shared memory provides 10% reduction in dynamic power with maximum 1.6× performance improvement for the current workloads at no additional area overhead. Eco-friendly energy sources (i.e. green power) attract great attention as lowering computer carbon footprint has become a necessity. Existing proposals on managing green energy powered systems show sub-optimal results since they either use rigid load power capping or heavily rely on backup power. We propose Chameleon, a novel adaptive green throughput server. Chameleon comprises of multiple flexible power management policies and leverages learning algorithm to select the optimal operating mode during runtime. The proposed design outperforms the state-of-the-art approach by 13% on performance, improves system MTBF by 42%, and still maintains up to 95% green energy utilization.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Nilanjan Goswami.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Li, Tao.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045706:00001


This item has the following downloads:


Full Text

PAGE 1

POWERPERFORMANCECO-OPTIMIZATIONOFTHROUGHPUTARCHITECTUREUSINGEMERGINGTECHNOLOGIESByNILANJANGOSWAMIADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2013

PAGE 2

c2013NilanjanGoswami 2

PAGE 3

Idedicatethedissertationtomyfamily. 3

PAGE 4

ACKNOWLEDGMENTS Thisdissertationwouldnothavebeenpossiblewithouttheguidanceandthehelpofseveralindividualswhoinonewayoranothercontributedandextendedtheirvaluableassistanceinthepreparationandcompletionofthisresearchwork.Firstandforemost,myutmostgratitudetomyadvisorDr.TaoLi,whoseconsistentencouragementIwillneverforget.Dr.LihasbeenmyguideasIhurdlealltheobstaclesinthecompletionofthisresearchwork.IwouldliketothankallmycolleaguesatIDEALresearchlaboratory;especiallyChaoLi,AmerQouneh,andZhongqiLifortheirconstantsupportandvaluablesuggestions.IacknowledgetheUniversityofFloridaHigh-PerformanceComputingCenterforprovidingcomputationalresourcesandsupportthathavecontributedtotheresearchresultsreportedwithinthispaper.URL: http://hpc.ufl.edu IthanktheprofessorsandstaffintheElectricalandComputerEngineering(ECE)Departmentsfortheuseoffacilitiesinthelaboratory,consultationsandmoralsupport.Ialsothankalltheanonymousreviewersofmyresearchworkfortheirinsightfulsuggestionsandsupport.Lastbutnottheleast,myfamilyandfriends,forstayingbesidemealways,especiallyduringhardtimes. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 13 CHAPTER 1INTRODUCTION ................................... 15 1.1ThroughputWorkloadExploration ...................... 16 1.2NanophotonicThroughputInterconnect ................... 18 1.3ResistiveMemoryBasedShaderCoreArchitecture ............. 20 1.4RenewableEnergyBasedThroughputArchitecture ............. 22 2BACKGROUND ................................... 25 2.1ThroughputArchitecture ............................ 25 2.1.1FunctionalBlocksofThroughputArchitecture ............ 25 2.1.2ThroughputShaderPipeline ...................... 27 2.1.3ProgrammingModels .......................... 28 2.2SiliconNanophotonics ............................. 30 2.2.1WhyNanophotonicsinThroughputArchitecture? .......... 30 2.2.2AnIntroductiontoNanophotonicsinThroughputSystem ...... 33 2.3ResistiveMemory ............................... 34 3EXPLORINGTHROUGHPUTWORKLOADBEHAVIOR ............. 37 3.1UnderstandingGPGPUKernelCharacteristics ............... 37 3.1.1ThroughputKernelCharacteristics .................. 37 3.1.2ThroughputWorkloadAnalysisandEvaluation ........... 42 3.1.2.1MicroarchitectureImpactonWorkloadCharacteristics .. 42 3.1.2.2GPGPUWorkloadClassication .............. 44 3.1.2.3GPGPUMicroarchitectureEvaluation ........... 46 3.1.2.4GPGPUWorkloadFundamentalAnalysis ......... 48 3.1.2.5GPGPUWorkloadCharacteristicsBasedClassication 50 3.1.2.6DiscussionandLimitation .................. 52 3.1.3RelatedWork .............................. 53 3.2CharacterizationofThroughputInterconnectTrafc ............. 55 3.2.1CaseStudies:ThroughputInterconnectTrafcPattern ....... 55 3.2.1.1Matrixtranspose ....................... 56 3.2.1.2Breadthrstsearch ..................... 57 3.2.1.3K-means ........................... 58 5

PAGE 6

3.2.1.464binhistogram ....................... 60 3.2.2TrafcBehaviorAnalysis ........................ 61 3.3CharacterizationofShaderOn-chipMemoryAccesses ........... 64 4INTEGRATINGSILICONNANO-PHOTONICSINTHROUGHPUTARCHITECTURE .................................. 68 4.1DesignandImplementation .......................... 68 4.1.1AnOverviewoftheMicroarchitecture ................. 68 4.1.2Sub-ChannelBasedMemoryController ............... 75 4.1.3InterconnectAwareWarpScheduler ................. 78 4.1.3.1WarpClassication ..................... 80 4.1.3.2DeterministicWarpScheduling ............... 81 4.2Evaluation .................................... 82 4.2.1PowerEfcientOpticalInterconnect ................. 82 4.2.2ImpactofSub-ChannelAllocation ................... 86 4.2.3AnalysisofInterconnectAwareScheduler .............. 86 4.3RelatedWork .................................. 88 5RENEWABLEENERGYBASEDTHROUGHPUTARCHITECTURE ...... 90 5.1Chameleon:TheMode-SwitchingPowerManagement ........... 90 5.1.1OverviewandRationale ........................ 90 5.1.2ChameleonMicrocontroller ...................... 92 5.1.3Learning-BasedModeSwitching ................... 93 5.1.4SolvingtheModeSwitchingProblem ................. 96 5.2Evaluation .................................... 97 5.2.1EnergyUtilization ............................ 98 5.2.2PerformanceAccelaration ....................... 100 5.2.3ReliabilityBenets ........................... 100 5.3RelatedWork .................................. 102 5.3.1MatchingServerLoad ......................... 102 5.3.2PowerSourceManagement ...................... 102 5.3.3Supply-LoadCo-Management ..................... 102 6ON-CHIPMEMORYOPTIMIZATIONOFCOMPUTECORE ........... 104 6.1ResistiveMemoryBasedComputeCoredesign ............... 104 6.1.1ArrayedRegisterFileOrganizationwithDifferentialMemoryUpdate 104 6.1.2CoalescedRegisterUpdate ...................... 107 6.1.3HybridSharedMemoryArchitecture ................. 109 6.1.4ResistiveReadOnlyCaches ..................... 112 6.2Evaluation .................................... 112 6.2.1RegisterArchitectureEvaluation ................... 113 6.2.2HybridSharedMemoryEvaluation .................. 116 6.2.3EvaluationofSTTBasedOn-ChipCaches .............. 118 6.2.4OverallPowerandPerformanceImpact ............... 119 6

PAGE 7

6.3RelatedWork .................................. 120 7EXPERIMENTALSETUPANDMETHODOLOGY ................. 123 7.1SimulationInfrastructure ............................ 123 7.1.1GPGPU-Sim-TheGeneralPurposeGPUSimulator ........ 123 7.1.2GPU-PowerSim-ThePowerSimulatorforGPUArchitectures ... 125 7.1.3ThroughputNanophotonicInterconnectSimulator .......... 126 7.1.4GPGPUTemperatureProleGeneration ............... 128 7.1.5ResistiveMemoryBasedGPGPUArchitectureSimulator ...... 129 7.2GPGPUWorkloads ............................... 132 7.2.1GPGPUWorkloadAnalysis ...................... 132 7.2.2RenewableEnergyBasedThroughputArchitecture ......... 132 7.2.3WorkloadsforNanophotonicGPUEvaluation ............ 132 7.3EvaluationMetricsandMethodologyofGPGPUKernelCharacterization 132 7.3.1EvaluationMetrics ........................... 132 7.3.2Methodology .............................. 135 7.4RenewableEnergySupplyTraces ...................... 136 8CONCLUSION .................................... 137 8.1ThroughputWorkloadExploration ...................... 137 8.2NanophotonicThroughputInterconnect ................... 138 8.3RenewableEnergyBasedGPUArchitecture ................ 138 8.4ResistiveMemoryBasedShaderCoreArchitecture ............. 139 APPENDIX:STATISTICALMETHODS .......................... 140 REFERENCES ....................................... 142 BIOGRAPHICALSKETCH ................................ 156 7

PAGE 8

LISTOFTABLES Table page 2-1RedesignedSTT-RAMandbaselineSRAMcellparameters) ........... 36 3-1GPGPUworkloadcharacteristics. ......................... 38 3-2Genericworkloadcharacteristics. .......................... 40 3-3GPGPUworkloadsubsetfor90%variance. .................... 45 3-4Averagepercentageoferrorasmicroarchitecturecongurationvarries. .... 48 4-1Warpclassicationbitpattern ............................ 80 7-1GPGPU-Sim2.1.1bcongurationforworkloadexploration) ........... 124 7-2GPGPU-Sim2.1.1bon-chipinterconnectcongurationforworkloadexploration) 124 7-3GPGPU-SimcongurationsforrenewableenergybasedGPU .......... 125 7-4GPUconguration(optical/electrical) ........................ 126 7-5Opticalcrossbarconguration ............................ 126 7-6Opticallossindifferentcomponents ........................ 127 7-7PowermanagementcongurationsforrenewableenergybasedGPGPU .... 128 7-8HotSpotparametersforrenewableenergybasedGPGPUarchitecture ..... 128 7-9GPGPU-Sim3.0.1bsimulatorconguration .................... 129 7-10SynopsysofGPGPU-Simsimulatedworkloads .................. 131 7-11GPGPUworkloadsynopsis ............................. 133 7-12SynopsisofevaluatedGPGPUworkloadsforrenewableenergybasedGPUarchitecture ...................................... 134 7-13GPGPUworkloads(?:memoryintensive)fornanophotonicGPUevaluation .. 134 7-14GPGPUworkloadanalysisevaluationmetrics ................... 134 7-15Evaluatedrenewablepowersupplytraces ..................... 136 8

PAGE 9

LISTOFFIGURES Figure page 1-1GPUtotalpowerconsumptioninWattsforLUDecompositionworkload. .... 15 1-2GPUtotalpowerconsumptioninWattsforNearestNeighborworkload. .... 16 2-1FunctionalBlockDiagramofThroughputMicroarchitecture. ........... 25 2-2GPGPUShaderCorePipeline. ........................... 27 2-3AbstractionofCUDAProgrammingModel. .................... 28 2-4GPGPUChipTemperatureProle. ......................... 32 2-5OpticalCommunicationonGPGPUSystem. ................... 34 2-6BlockDiagramofSTT-RAMCell. .......................... 35 3-1DendrogramofallGPGPUworkloads. ....................... 41 3-2CumulativedistributionofvariancebyPC. ..................... 42 3-3Dendrogramofworkloads(SL,90%Var). ..................... 43 3-4Dendrogramofworkloads(CL,90%Var). ..................... 43 3-5Dendrogramofworkloads(SL,70%Var). ..................... 44 3-6Dendrogramofworkloads(CL,70%Var). ..................... 44 3-7Performancecomparison(normalizedtototalworkloads)ofdifferentevaluationmetricsacrossdifferentGPUmicroarchitectures(C1:Cong.1,B:Baseline,C3:Cong.3). .................................... 46 3-8FactorloadingofPC-1andPC-2. .......................... 46 3-9PC-1vs.PC-2scatterplot. ............................. 47 3-10FactorloadingofPC-1andPC-2. .......................... 47 3-11PC-3vs.PC-4scatterplot. ............................. 48 3-12Dendrogrambasedondivergencecharacteristics. ................ 49 3-13Dendrogrambasedoninstructionmix. ....................... 50 3-14Dendrogrambasedonmergemissandrowlocality. ............... 50 3-15Dendrogrambasedonkernelcharacteristics. ................... 51 3-16Dendrogrambasedonkernelstress. ........................ 52 9

PAGE 10

3-17VariousshaderstallsinGPGPUinterconnect. ................... 54 3-18MatrixTranspose:memoryrequestspercycle(inbytes). ............. 56 3-19MatrixTranspose:differentmemoryregionaccessstatistics. .......... 57 3-20BreadthFirstSearch:memoryrequestspercycle(inbytes). ........... 57 3-21BreadthFirstSearch:differentmemoryregionaccessstatistics. ........ 58 3-22K-Means:memoryrequestspercycle(inbytes). ................. 59 3-23K-Means:differentmemoryregionaccessstatistics. ............... 59 3-2464BinHistogram:memoryrequestspercycle(inbytes). ............ 60 3-2564BinHistogram:differentmemoryregionaccessstatistics. .......... 61 3-26MaximumChannelLoadexperiencedbydifferentworkloads. .......... 62 3-27Latencystatistics:shadercoretomemorycontroller. ............... 63 3-28Latencystatistics:memorycontrollertoshadercore. ............... 63 3-29Dynamicpowerproleofcomputecores. ..................... 65 3-30Reusabilityofregisterswrite-backdata. ...................... 65 3-31Percentageregisterlewrite(inbits). ....................... 66 3-32Sharedmemorywrite(inbits)prole. ........................ 66 4-1Shadercorelayerlayout ............................... 68 4-2Cache/MemoryControllerLayerLayout ...................... 69 4-3Opticallayer ..................................... 70 4-4Crossbarmodulestructure. ............................. 73 4-5Opticallyconnectedmemory ............................ 74 4-6Crosssectionschematicofthechip. ........................ 76 4-7Memorytransactionscheduling. .......................... 77 4-8Flowofoperationsinwarpscheduler. ....................... 79 4-9Shadertomemorycontrollerlatency(normalizedtoelectricalmesh16B,GM:geometricmean). ................................... 82 4-10Memorycontrollertoshaderlatency(normalizedtoelectricalmesh16B,GM:geometricmean). ................................... 83 10

PAGE 11

4-11Speedupwithrespecttoelectricalmeshnetwork16Bchannelbandwidth(GM:geometricmean). ................................... 84 4-12Networkpowerconsumptionnormalizedtoelectricalmesh16Bchannel(GM:geometricmean). ................................... 85 4-13NormalizedIPCimprovementduetosub-channelallocationschemes(GM:geometricmean). ................................... 85 4-14IPCandpercentagememorystallreduction .................... 87 4-15IPCandpercentagememorystallreduction .................... 88 5-1TheChameleonpowermanagementframework ................. 91 5-2Thecontrolarchitectureofmode-switchingpowermanagement ......... 93 5-3Themappingofpowermanagementmodestoloadadaptationbehaviors. ... 94 5-4Online-learningcontrolowandtimeline. ..................... 96 5-5RenewableenergyutilizationHPLV ......................... 98 5-6RenewableenergyutilizationHPHV ........................ 98 5-7RenewableenergyutilizationTPLV ......................... 99 5-8RenewableenergyutilizationTPHV ........................ 99 5-9EnergyutilizationofW-drivenpowersystems(normalizedtoS-driven). ..... 99 5-10Comparisonofaveragethroughput ......................... 100 5-11Comparisonofprocessorlifetimeacrossdifferentpowermanagementschemes 101 6-1DifferentialMemoryUpdatebasedGPUregisterunitarchitecturewithSTT-RAMregisterle. ...................................... 104 6-2MicroarchitectureofDMU(K=4)withmulti-arrayregisterbankorganization. .. 106 6-3Coalescingregisteraccessusingwritebackbuffer. ................ 108 6-4Registerwrite-backcoalescingowchart. ..................... 109 6-5Hybridsharedmemoryarchitecture. ........................ 111 6-6PowersavinginDMUbased(K=32)registerle .................. 113 6-7Comparisonofpowersavingintheregisterle .................. 113 6-8PerformanceimpactofSTT-RAMbasedregister ................. 115 11

PAGE 12

6-9Powersavingandperformanceimpactofcoalescedregisterupdate(baseline:pureSRAM). ..................................... 116 6-10Powersavinginhybridsharedmemory. ...................... 116 6-11PerformanceofonlySTTsharedmemory. ..................... 117 6-12Performanceimprovementusinghybridsharedmemory. ............. 117 6-13PercentagepowersavingsusingSTT-RAMbasedread-onlymemoryinGPU. 119 6-14Overallpowersavingandperformanceimprovement. .............. 120 7-1ImplementationblockdiagramofGPU-PowerSim. ................ 125 12

PAGE 13

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyPOWERPERFORMANCECO-OPTIMIZATIONOFTHROUGHPUTARCHITECTUREUSINGEMERGINGTECHNOLOGIESByNilanjanGoswamiAugust2013Chair:TaoLiMajor:ElectricalandComputerEngineeringGPUsareemergingasgeneral-purposehigh-performancecomputingdevices.GrowingGPGPUresearchisenrichingdataparallelcomputingcapabilitiesofGPUintheareasofscienticcomputing,nancialapplications,medicalimagingetc.However,thisgrowingtrendexposestheimpendingimpedimentofunbalancedgrowthinsemiconductortechnologyandthroughputarchitectureevolution.Inthisdissertation,Iexploreseveraltechniquesthathelpinthroughputmicroarchitectureevolutioninaccordancewithemergingsemiconductortechnologies.Tobeginwith,Iproposeasetofmicroarchitectureagnosticthroughputworkloadcharacteristicstorepresenttheworkloadsinamicroarchitectureindependentspace.Correlateddimensionalityreductionprocessandclusteringanalysisareusedtounderstandtheseworkloads.Then,weexploredGPGPUworkloadon-chipinterconnecttrafcbehavior,powerdissipationandenrygyconsumptioncharacteristics.Afterunderstandingworkloadtraits,Idelveintoexploringpower-performanceco-optimizedarchitecturedesign.Inthroughputprocessors,theinterconnectexperiencespowerandenergybottlenecksduetomassiveparallelismandever-increasingmemoryaccessesinemergingworkloads.Toeliminatesuchbottlenecks,weproposesilicon-nanophotonicsand3Dstackingtechnologiesinthroughputarchitecture.Furthermore,toinsulatethe 13

PAGE 14

performanceofthethroughputcoresfromtheinterconnectbottlenecks,weproposeanovelinterconnectawarethreadschedulingschemetoalleviatetrafccongestion.UnlikeCPUs,throughputprocessorshavealargeregisterleandon-chipscratchpadmemory,whichconsumeasignicantportionofcomputecorepower(35%-45%).Hence,weintroducenovelarchitecturebyintegratingresistivememoryinsidethecomputecore,whichreducesleakagesignicantly,butintroduceswritepoweroverheadandlongerwritelatenciesinGPUsharedmemoryandregisterleaccesses.Weenhancethecomputecorebyintroducingregisterleorganizationwithdifferentialmemoryupdatemechanismtoremoveupdateredundancyduringwriteoperations.Eco-friendlyenergysourcesattractgreatattentionasloweringcomputercarbonfootprinthasbecomeanecessity.Existingproposalsonmanaginggreenenergypoweredsystemsshowsub-optimalresultssincetheyeitheruserigidloadpowercappingorheavilyrelyonbackuppower.WeproposeChameleon,anoveladaptivegreenGPGPUsystem.Chameleoncomprisesofmultipleexiblepowermanagementpoliciesandleverageslearningalgorithmtoselecttheoptimaloperatingmode. 14

PAGE 15

CHAPTER1INTRODUCTIONWiththeincreasingnumbersofcoresperCPUchip,theperformanceofmicroprocessorshasincreasedtremendouslyoverthepastfewyears.However,data-levelparallelismisstillnotwellexploitedbygeneral-purposechipmultiprocessorsforagivenchipareaandpowerbudget.Withhundredsofin-ordercoresperchip,GPUprovidesperformancethroughputondataparallelandcomputationintensiveapplications.Asthroughput-computingdevices,GPUshavemultipleshadercorescomposedofthread-schedulers,ALUs,load/storeunits,largeregisterle,scratchpadmemory,cachesetc.Unlikegraphicscomputing,generalpurposeGPUcomputing(GPGPU)exposesshadercoresasmassivelyparallelcomputecores.Usingahighbandwidth,lowlatencyon-chipinterconnect,computecorescommunicatewiththeon-chipcacheandoff-chipmemory.GPGPUsachieveteraFLOPspeakperformancebyexecutingthousandsofthreadsinparallel,whileconsuminglargeamountsofenergy[ 4 ].ThistrendofthroughputcomputinghaspushedthelimitsofGPUdesigntooptimizeitsmicroarchitectureandprogrammingparadigmtoachievehigherperformanceperwatt[ 4 7 ],whilestillloweringaveragepower(around130W[ 4 ]).WehaveveriedtheseclaimsfortheNvidiaGTX470GPUusingcurrentsensors[ 8 ]forNearestNeighbor(NN,140Wpeak)andLUDecomposition(LU,170Wpeak)inFigure 1-1 and 1-2 Figure1-1. GPUtotalpowerconsumptioninWattsforLUDecompositionworkload. 15

PAGE 16

Figure1-2. GPUtotalpowerconsumptioninWattsforNearestNeighborworkload. HeterogeneousmicroarchitectureconsistingofchipmultiprocessorsandGPUsseemstobegoodchoicefordataparallelalgorithms.NvidiaCUDA[ 9 11 ],AMDstream[ 12 13 ]andOpenCL[ 14 ]programmingabstractionshaveprovideddataparallelapplicationdevelopmentthrustbyreducingsignicantamountofdevelopmenteffort.Furthermore,suchmodelshavemotivatedresearcherstoexploretheoptimizationopportunitiesofscientic,nancial,biological,andgraphicalapplicationsonthroughputarchitectures.Theseapplicationspossessalargeamountofdataparallelcomputationsthatisefcientlyperformedbysingle-instruction-multiple-data(SIMD)stylethroughputprocessorssuchasGPUs.Furthermore,inrecenttimes,throughputsystem(GPGPU)inevitablyentersthelandscapeofdesignforsustainabilityastheITpowerfootprinthasbecomeaglobalconcern.AccordingtotheUptimeInstitute,the3-yearenergyexpenditurehasalreadyexceededtheserverequipmentcostsince2012[ 15 ].Ifwecontinuetorelyonconventionalfossilfuelbasedelectricity,a1MWdatacenterwillcauseover10000metrictonofCO2emissionsannually[ 16 ].Drivenbytherisingenergypriceandthewarningofclimatechange,industryandacademiaalikearefocusingmoreattentionthaneveroneco-friendlysourcesofpowersuchaswindandsolarenergy. 1.1ThroughputWorkloadExplorationEmergingCPU-GPUheterogeneousmulti-coreprocessinghasmotivatedthecomputerarchitectureresearchcommunitytostudyvariousmicroarchitecturaldesigns, 16

PAGE 17

optimizations,andanalysisforGPU,suchas,anefcientGPUon-chipinterconnectarbitrationscheme[ 17 ],moreefcientGPUSIMDbranchexecutionmechanisms[ 18 ],techniquetodivergeonamemorymisstobettertoleratememorylatenciesinSIMDcores[ 19 ]etc.However,itislargelyunknownwhetherthecurrentlyavailableGPGPUworkloadsarecapableofevaluatingthewholedesignspace.Anidealevaluationmechanismmustbeaccurate,thoroughandrealistic.Accuracyisprovidedbytheapplicabilityofthededucedconclusionsthatareminimallyaffectedbythechosenbenchmarks.Athoroughevaluationmechanismcoversalargeamountofdiversebenchmarks,whereeachworkloadstressesdifferentaspectsofthedesign.Realisticevaluationguaranteesthatforlessernumberofsimulationsthoroughnesscanbeachieved.Realisticevaluationnarrowsdowntheworkloadsimulationspace(timeaswell),whilekeepingthesimulationdelityandactualconclusionwithinanacceptablethreshold.Inordertoachievetheabovegoals,weproposeasetofGPUmicroarchitectureagnosticGPGPUworkloadcharacteristicstoaccuratelycaptureworkloadbehavioranduseawiderangeofmetricstoevaluatetheeffectivenessofourcharacterization.Weemployprincipalcomponentanalysis[ 20 ]andclusteringanalysismethods[ 21 ],whichhavebeenshown[ 22 25 ]tobeeffectiveinanalyzingbenchmarksuitessuchasSPECCPU2000[ 26 ],SPECCPU2006[ 27 ],MediaBench[ 28 ],MiBench[ 29 ],SPLASH-2[ 30 ],STAMP[ 31 ]andPARSEC[ 32 ]benchmarks.Thisexplorationmakesthefollowingcontributions: 1. WeproposeasetofGPGPUworkloadcharacterizationmetrics.Using386designpoints,weshowthatthesemetricsareindependentoftheunderlyingGPUmicroarchitecture.ThesemetricswillallowGPGPUresearcherstoevaluatetheperformanceofemergingGPUmicroarchitecturesregardlessoftheirmicroarchitecturalimprovements.Though,wecharacterizetheGPGPUworkloadsusingNvidiaGPUmicroarchitecture[ 9 ],theconclusionsdrawnherearemostlyapplicabletootherGPUmicroarchitecturessuchasAMDATI[ 12 ]. 17

PAGE 18

2. UsingtheproposedGPGPUworkloadmetrics,westudythesimilaritiesbetweenexistingGPGPUkernelsandobservethattheyoftenstressthesamebottlenecks.Weshowthatremovingredundancycansignicantlysavesimulationtime. 3. Weprovideworkloadcategorizationbasedonvariousworkloadsubspaceslike,divergencecharacteristics,kernelcharacteristics,memorycoalescingetc.Wecategorizedifferentworkloadcharacteristicsaccordingtotheirimportance.Wealsoshowthatavailableworkloadspaceismostdiverseintermsofbranchdivergencecharacteristicsandleastdiverseintermsofthread-batchlevelcoalescingbehavior. 1.2NanophotonicThroughputInterconnectTodaysstate-of-the-artGPUshaveseveralhomogeneousin-ordercores.Thecorecountisdoublingevery18monthsorso[ 33 ].Thein-ordercoresareconnectedtoseveralon-chipmemorycontrollersusingon-chipinterconnect[ 11 ].Thecoresreceivedatafromoff-chipmemoryviathememorycontrollers.Traditionally,thesecoresrunmultipleinstancesofthesamethreadconcurrently.However,GPUscanalsoconcurrentlycomputemultipleinstancesofdifferentthreadswithlargerproblemsizes[ 11 ].Largercomputationloadrequireslargeamountofoff-chipmemoryaccessesatasignicantlyhigherrate.UnlikemulticoreCPUs,simultaneouslyexecutingthreadsinthroughputarchitecturesgeneraterelativelymorememoryrequestsandcreateinterconnecttrafchotspots.Inaddition,throughputarchitectureshavecomparativelymorenumberofactivecoresthanmulticoreCPUs,whichcomplicatesthethroughputinterconnectarchitectureoptimizationforbandwidthandlatencyimprovement.Moreover,asthenumberofcoresandproblemsizeincrease,datatransferintheon-chipinterconnectwillconsumemorepoweranddissipatemoreheat.Thiswillresultintemperaturehotspotsinthroughputinterconnectandaffectthereliabilityofthechip.Furthermore,on-chipinterconnectsandmainmemorybandwidthrequirementsarealsoincreasing.AccordingtoITRSprojections[ 34 ],off-chipmemorybandwidthandpin-countarenotexpectedtoincreaseatsimilarrates.Hence,withincreasedrequestrate,off-chipmemoryaccesslatencyandlowbandwidthwillbecomebottlenecks. 18

PAGE 19

RecentadvancementsinCMOSprocessbasedsilicon-nanophotonicshavesubstantiallymitigatedthepower,latencyandbandwidthproblem[ 35 38 ].3Dstackingtechnology[ 39 40 ]provideslowlatencyandhighbandwidthcross-layercommunicationinacompactform.Withsignicantbandwidthdemandinthroughputarchitecture,itisanticipatedthatpowerconsumptionwillreachapointatwhichelectricalinterconnectandmemorysubsystemdesignwillbecomeinfeasible.Onthecontrary,opticallyconnectedshadercoresandmemoryinterfacein3D-stackedmulti-layerchipseemtobeanattractivealternative[ 35 40 ].Inthispaper,weproposea3D-stackedthroughputarchitecturebasedonsilicon-nanophotonicstechnology.Thechiphasashadercorelayer,acachelayer,andabuilt-inopticallyconnectedon-chipnetworklayer.TheopticalnetworklayerenablesDWDMhigh-speedinterconnectforcore-memorycommunication.Inaddition,on-chipmemorycontrollerscommunicatewiththeoff-chipmemoryusingDWDMlinks.Tobestutilizeallthewavelengthsavailableinthemulti-wavelengthDWDMoff-chiplink,weproposedividingthememoryintoindependentsub-channelsandincorporatewavelengthallocationpoliciestodifferentsub-channels.Furthermore,instructionscheduling(inthiscase,issueofmemoryrequests)hasadirectimpactonon-chipnetworktrafc.Inthroughputarchitectures,myriadthreadsexecutethesameinstruction,whichunleashesthepossibilityofprogramcounterclassicationafterdecodingoneinstruction.Basedonnetworkcongestionstatus,wecanregulatememoryinstructionschedulingusingtheclassicationinformation.Tothisend,weproposeanovelinstructionschedulingpolicythathelpsinalleviatinginterconnecttrafccongestion.Thispapermakesthefollowingcontributions: 1. Wepropose3D-stackedGPUdesignbasedon16nmCMOSprocess.Itconnects2048in-ordercores(64shadercores)and16memorycontrollersthroughanall-opticalDWDMbasedcrossbarinterconnect.Wealsoexplorethepowerconsumptionimplicationofseverallowpowerphotoniccrossbardesignsinthroughputarchitectures. 19

PAGE 20

2. WeintroduceananophotonicmemoryinterfaceinGPGPUthatusesDWDMbasedhighbandwidthlinkstoconnecton-chipmemorycontrollerswiththeoff-chipmemory.Tobestutilizeallthewavelengthsinthelinks,weuseindependentlyoperatingoff-chipmemory-stacksanddividethetotalwavelengthsequallyamongthestacks(sub-channel).Toachievebetterbandwidthutilizationandoverallperformance,wealsoproposesub-channelallocationpolicies. 3. Weproposeaninterconnectawarethread-schedulingpolicyintheshadercorethathelpsinreducinginterconnectcongestionandhidesthememorylatency. 1.3ResistiveMemoryBasedShaderCoreArchitectureComputecoreshavelargeon-chipmemoryandaregisterlebuiltusingCMOS-basedSRAMmemory.WithgrowingtransistordensityinGPUdies,leakagepowerhasbeguntodominatetheoverallchippowerfordeep-sub-micronCMOSprocesses.TocharacterizethepowerbehaviorofGPUs,wehavedevelopedarchitecturelevelGPUpowermodelinGPGPU-Sim[ 41 ](seeSection 7.1.2 ).Thecharacterizationrevealsthatonaverageacross23GPGPUworkloads,registerleandsharedmemorydynamicpowerconsumptionis44%and7%ofthetotalcorepower,respectively.Leakagepoweris17%and20%forregisterleandsharedmemoryrespectively.Theleakageisevengreaterdesignimpedimentforsub-32nmtechnologynodes.Hence,toreduceoverallcomputecorepowerconsumption,itisimperativetore-architectvariousmemorycomponentsofthecomputecoreusingemergingtechnologies.Thelatestdevelopmentsinmagneticjunctiontransistor(MJT)basedspin-transfertorquemagneticrandomaccessmemory(STT-MRAM)demonstratealmostzeroleakagepower,betterscalability,smallerfoot-print,non-volatility,andradiationresistance[ 42 43 ].TheseimprovementssuggestthatSTT-MRAMmaybeasuitablereplacementforon-chipSRAMbasedmemory.UnlikeSRAM,resistivememoriesrelyonnon-volatile,resistiveinformationstorageinacell,andthusexhibitnear-zeroleakageinthedataarray[ 43 ].Moreover,accordingtoITRSprojections,STT-MRAMisexpectedtoreplaceSRAMforon-chipmemorydesigns[ 44 ].Recently,EverspinTechnologies[ 45 ]hasofciallyannouncedcommercializationofworld'srstSTT-MRAMchip.In 20

PAGE 21

addition,IBM,SamsungandGrandisInc.arecurrentlyactivelyworkingonSTT-RAMcommercializationprocess.TheSTT-MRAMprovidesalmost4morecelldensityandsimilarreadlatencycomparedtoSRAM[ 46 47 ].STT-MRAMdoesnothavewriteenduranceproblem(1015writes[ 48 ]).Furthermore,STT-MRAMiscapableofhighperformanceoperationsandCMOSprocesscompatible,whichmakesitsuitableforwiderangeofapplications[ 49 ].However,STT-MRAMsuffersfromlongerwritelatenciesandhigherwriteenergiescomparedtoSRAM[ 42 ].Towritea0or1intoanSTT-RAMcell,astrongcurrentisnecessarytoreversethemagneticdirectionofthestoragenode(MagneticTunnelJunctions,orMTJ).Thelatencyandenergyoverheadassociatedwiththewriteoperationsoftheseemergingmemorieshasbecomeamajorobstacleintheirwidespreadadoption[ 50 ].Inthiswork,weproposearchitecturalenhancementsinGPUcomputecorestorecuperatepossibleperformanceandenergylossesinvolvedduetoSTT-MRAMwrites;hence,weachievesignicantleakagepowersavingswithnegligiblelatencyandenergyduetowrites.Asthroughput-computingdevices,GPUsexecutemultipleinstancesofidenticalordifferentsequenceofinstructionstreams(GPGPUkernels)simultaneouslytoachievehigheroverallperformance,similartographicsapplications.Therefore,topreserveoverallthroughputofthegraphicsandGPGPUapplicationswhileloweringtheoverallpowerbudget,weproposearchitecturalenhancementsinSTT-MRAMbasedGPUon-chipmemorydesignthathidethehigherwritelatencyofSTT-MRAMbasedmemorywhiledecreasingoverallwriteenergybyusinganarrayedregisterledesign.ThekeyideaistoreducewriteaccessestotheSTT-MRAMbasedmemoryatthegranularityofupdatedregisterarray;explicitlyunmodiedregisterarraysarenotwritten.Furthermore,usingSRAMbasedsmallwrite-backbuffers,wecoalesce(widerupdate)multipleregisterwritesfromdifferentthreadsinGPGPUthread-batcheswithinasingleregisterbanktoreducetheregisterwriteenergy.ToaddresssharedmemoryperformancedegradationduetoslowerSTT-MRAMwrites,weintroduceahybridsharedmemory 21

PAGE 22

architecturebasedonSRAMandSTT-MRAM.Becausedifferentapplicationsexhibittemporallocality,spatiallocalityoramixtureofbothinsharedmemoryaccess,weproposeSRAMcongurableforeithercacheorscratchpadanddedicateSTT-MRAMexclusivelyforscratchpaduse.HeavywriteintensiveapplicationswithinherenttemporallocalityaccesstheSTT-MRAMscratchpadusingSRAMcache.Forotherapplications,thesharedSTT-MRAMandSRAMscratchpadreduceswritelatency.WeobtainthishybriddesignwithoutanyadditionaloverheadduetoareasavinginSTT-MRAMdesign.Ourworkmakesthefollowingcontributions: 1. AnovelSTT-MRAMdesigncustomizedforGPUstoachievelowerwritelatency,lowerenergy,higherperformance,andhigherenduranceatthecostoflowerretentiontime(seeSection 2.3 ). 2. ASTT-MRAMbasedGPUregisterledesignthatreducesleakagepowerandoverallenergyfootprintofthecomputecores.TorecuperatetheSTT-MRAMwritepoweroverhead,wehavedesignedaregisterlethatiscomprisedofseveralarrays.Insteadofupdatingtheentireregisterduringwriteoperation,weupdatecertainarrayswithintheregisterifanybitwithinthearrayisipped. 3. ASRAM/STT-MRAMbasedhybridsharedmemorydesigntoobtainleakage/dynamicpowersavingsduetoresistivememory,whilekeepingtheperformanceintact. 4. AmethodtoreducetheoverallenergyconsumptionandleakagepowerbyaugmentingtheshadercoredesignwithSTT-MRAMbasedread-onlyon-chipcaches. 1.4RenewableEnergyBasedThroughputArchitectureTherehavebeenseveralemergingsystemdesignsthatexploitrenewableenergy.Forexample,HPhasannounceditscarbon-freedatacenterprototypethatisentirelypoweredbyonsitegreenenergysources.Similarly,eBayisexperimentingwithasmalldatacenterthatintegratesa665kWsolararray.Recently,Applehasalsopatentedseveralsolarenergydrivenelectronicdevicesandrenewableenergypowermanagementcircuitry.Forthesesystems,minimizingpowerconsumptionisnolongerthesoleaimofthedesign.Instead,theyemphasizeefcientuseofrenewableenergyresourcesforloweringtherelianceonconventionalutilitypower. 22

PAGE 23

Thetime-varyingnatureofrenewablepowersuppliesposesunconventionalchallengesforsystempowermanagement.Duetotheirsensitivitytopowerdisturbances,computingsystemsmustmaintainacontinuousbalancebetweentheuctuatingrenewablepowerbudgetandthevariableITpowerdemand.Tocopewiththisissue,recentstudieshavebeenworkingontwopowermanagementapproaches,whichwerefertoaseitherenergy-orienteddesignorperformance-orienteddesign.Theformerapproachemphasizesmatchingloadpowerdemandtopowerbudgettomaximizethebenetofrenewableenergyusage.Thelaterapproach,ontheotherhand,leveragesbackuppower(i.e.,batteryorutilitygrid)tomaintaindesiredworkloadperformancewhengreenenergygenerationisinadequate.Nevertheless,neitherapproachprovidesdesirabletrade-offbetweenworkloadperformanceandenergyefciency.Whenthegreenpowerdropssignicantlyorbecomeintermittentlyunavailable,theenergy-orienteddesignoftenputcomputersystemintolow-powerstatesandthereforecompromisestheperformance.Incontrast,performance-orienteddesignsuffersfromdegradedenergyutilizationsincethetypicalround-tripenergyefciencyoflead-acidbatteryisonly75%[ 51 ].Furthermore,biasedpowermanagementschemescancausevarioustroublesinadditiontotheperformanceandefciencyproblems.Forexample,anaviduserofenergy-orienteddesigncanexperiencethermalcycling(TC)issueduetotheexcessiveon/offpowercyclesduringaggressiveloadmatching[ 52 ].Thisissuereferstorepeatedheatingandcoolingofsectionsofthemicroelectronicsandcanleadtopermanentdevicesfailures(e.g.,cracksoncircuitboards).Ontheotherhand,heavilyrelyingonenergybackuptohandlethestochasticworkloadsurgesandsupplysagsincreasestherequiredenergystoragecapacity,whichaddstotheoverallcostofinfrastructure.Inthisstudywetackletheabovechallengesusingauniquehybridpowermanagementstrategy.Themainideaistogracefullyswitchbetweenthetwomutuallyexclusivepower 23

PAGE 24

managementpoliciestoboostsystemperformanceandenhancereliabilitywithoutsacricingoverallrenewableenergyusageeffectiveness.WeproposeChameleon,anadaptivemany-coresystemthatexploresintelligentadaptationofpowermanagementpolicyformaximizingthebenetsofgreenenergyusage.Oursystemcomprisesofmultiplepowermanagementschemesandeachschemeisreferredtoasapowermanagementmode.Duringoperation,Chameleonlearnsfromthefeedback-basedsystem-environmentinteractionandselectstheoptimalpowermanagementmodebasedonthepreviousexperienceandtheobservedoutcome.Wetermthispowermanagementdesignasmode-switchingpowermanagement(MSPM).TheadvantageofChameleonistwo-fold.First,asitintegratesmultiplepowermanagementschemes,itcouldovercometheshortcomingsofrelyingonanysingleofthem.Second,theonlinelearningcapabilityenablesthesystemtobetteradapttothediverseandcomplexoperatingenvironment.Thispapermakesthefollowingcontributions: 1. Weproposenovelmode-switchingpowermanagement(MSPM)mechanismtoharvestthebenetsofdifferentrenewablepowermanagementschemes.Itenablesgreenserverstoadaptthemselvestothetime-varyingrenewablepowerbudgetautomaticallyandefciently.WeshowthatMSPMcouldachieveupto95%energyutilizationdependingontheactualrenewableenergyresourceavailability. 2. WeexplorelearningalgorithmtohelpChameleonachievehighadaptabilityinvariousenvironmentconditions.WeshowthatChameleonserveroutperformsthestate-of-the-artdesignby12.8%onsystemthroughput.Wealsofeedthesystemwithloadpowervariabilityinformationtomakeitreliability-aware.Bycurtailingunnecessarypowertuningactivities,ChameleoncouldincreasetheprocessorMTBFby42%onaverage. 24

PAGE 25

CHAPTER2BACKGROUND 2.1ThroughputArchitecture 2.1.1FunctionalBlocksofThroughputArchitecture Figure2-1. FunctionalBlockDiagramofThroughputMicroarchitecture. Asageneral-purposeprocessorGPU(GPGPU/ThroughputArchitecture),exposesitsshadercoresasSIMDprocessingenginewithaugmentedsupportforminimizingcontrolowdivergenceandoptimizingmemoryaccessmechanism.Thisarchitectureispowerandperformanceefcientdataparallelcomputingprocessor.Figure 2-1 showstheblockdiagramofatraditionalGPGPUarchitecture.TheGPUhasmultipleshadercoresthatareconnectedtoonchipL2cache/memorycontrollersthroughahighbandwidth,lowlatencyon-chipinterconnectionfabric.Usuallymultipleshadercoressharesingletextureprocessingengine(whichcanbeusedasparalleldatapre-fetcher)andexposessingleon-chipnetworkinterface;theshadercoreclusterisreferredastextureprocessingcluster(TPC).TheshadercoreisthemainSIMD 25

PAGE 26

processingengineandhasseveralfunctionalblocks(referredassingleprocessor,SP)suchasinteger/oatingpointALUs,load/storeunits,specialfunctionalblocks,atomicunitsetc.TraditionallytheseSPsareusedtoperformextremelydataparallelcomputingsuchasmatrixmultiplication,matrixtransposeetc.Theshadercorealsohashardwarethreadschedulerthatpickstheappropriateinstructionperthread-batch(warp)ineveryschedulingcycle.Severalcaches(instruction,data,texturememory,constantmemory)providedataandinstructiontotheinstructiondecoderunitthatinitiatesproperfunctionalunits(SP).UnlikeCPUs,GPGPUprocessorshavelargeregisterleandonchipscratchpadmemory,whichisreferredassharedmemory.Duetolargeregisterle,concurrentlyexecutingthreadshavethreadcontextsactiveintheregisterle;hence,simultaneousexecutionofthousandsofthreadsdoesnotincurthreadcontextswitchingoverhead.On-chipsharedmemory,providesexcellentmechanismtosharecommondataamongseveralconcurrentlyexecutingthreads.Itsalsoavoidstheunnecessarybandwidthwastagebyavoidingrepetitivedatatransferduringexecution.Usuallymultiplethreads(warps)executesimultaneouslyintheGPUinlock-stepfashion.Everythreadschedulingcycle,thefetch/scheduleunitchoosestheinstructionPCbasedonwarpformationandschedulingpolicy.Theinstructioncacheprovidestheinstructiontobedecodedbasedontheresourcedependencyinformationobtainedfromthescoreboard.Duetolock-stepexecutionparadigm,onlyoneinstructionisdecodedperthread-batchandindividualthreadindexesprovideaccesstothethreadspecicdata.Basedoninstructiontype,appropriatefunctionalblocksareexecutedsimultaneouslyforallthethreadsinthethread-batch.Load/storeunitsaccesstheon-chipsharedmemoryorthedatarequesttravelsthroughtheon-chipinterconnectandmemorycontrollertogettheoffchipmemorydata.Loadstoreunitsalsoimplementsatomicread-modify-writeoperationsandmemoryaccesscoalescingbehavior.MultipleconsecutiveaccessestosameDRAMrowarecoalescedtogeneratesingleoff-chipmemoryrequest.Thiskindofaccessbehaviorprovidesperformanceandpowerbenet 26

PAGE 27

foraccessingrowsandcolumnsofmatriceswhereconsecutiveelementsarebatchedtogethertogenerateasinglerequest.MemorycontrollerscommunicatewiththeoffchipDRAMarraysusingGDDR3/GDDR5buses.HostCPUusuallyconnectstotheGPUusingPCIexpressbusandoperatessynchronousorasynchronousfashionbasedontheapplication.BasedonGPUhardwareresourceconstraints,hostapplicationlaunchesseveralGPUthreadswithsameinstructionbehaviorandvariabledataaccesspattern. 2.1.2ThroughputShaderPipeline Figure2-2. GPGPUShaderCorePipeline. Figure 2-2 illustratesthesimpliedcomputecorepipeline.Usuallymultiplethreads(warps)executesimultaneouslyinthecomputecores.Thethread-schedulerfetchesoneormoreinstructionsforathread-batch(warp)ineveryschedulingcycle.Theinstructioncacheprovidestheinstructiontothedecodeunit.Thedecoderdecodestheinstructionandplacesitintheinstructionbuffer.Basedonresourcedependencyinformationfromthescoreboard,instructionissueunitgatherstheoperands(Figure 2-2 left)fromregisterleunittoissuetheinstructiontotheappropriateSP.Different 27

PAGE 28

operandcollectorsgatheroperandsfromdifferentregisterbanksandprovidetodifferentdispatchunits[ 53 ].SPreceivestheoperandfromthedispatchers.Load/storeunitaccessesthesharedmemoryortravelsthroughinterconnectandmemorycontrollertoaccesstheoff-chipmemory.Memorycontrollerscommunicatewiththeoff-chipDRAMarraysusingGDDR3/GDDR5buses.HostCPUusuallyconnectstotheGPUusingPCIexpressbusandoperatesinsynchronousorasynchronousmode.BasedonGPUhardwareresourceconstraints,hostapplicationlaunchesseveralGPUthreadswithsameinstructionbehaviorandvariabledataaccesspattern.Toexpeditethreadcontextswitching,individualGPUthreadsmaintainindividualregisterusageandcooperateinbetweenwhileaccessingsharedmemory. 2.1.3ProgrammingModels Figure2-3. AbstractionofCUDAProgrammingModel. Executionofgeneral-purposeprogramsonheterogeneousGPU/CPUarchitecturesisrealizedbydifferentapplicationprogramminginterfaces(API)suchasCUDA(Nvidia)[ 9 ]andOpenCL[ 14 ].UsingtheseAPIswecanlaunchthousandsofparallel 28

PAGE 29

threadsontoGPUdevicefromthehostCPU.However,recentGPUarchitecturessupportsthreadlaunchfromGPUkernelsaswell.Thismightprovideanopportunitytooverlapmultiplestagesofthecauseandstateinferenceprocessesfromdifferentcomputationblocktooverlap.Figure 2-3 showstheblockdiagramofthetypicalGPGPUprogrammingmodel.Theexecutionparadigmgeneratesahierarchyofthreadsthataregroupedtogethertoformacertainlayer.Atthetoplayer,threadsaregroupedtogethertoformaGPGPUkernelbasedontheexecutionsimilarity;theyallshareinstructionsequencewhichoperateondifferentdataelements.Thesekernelscanexecutesimultaneouslytooverlapdifferentindependentstagesoftheexecution.Inthesecondtier,threadswithinakernelaregroupedtogethertoformresource-sharingabstractionofthreadblocks.TheseblocksexecuteonsingleshadercoreoftheGPUandsharethecomputationalresourcessuchason-chipsharedmemoryandcachesamongthemselves.Severalthreadswithablocksynchronizeeitherimplicitlyattheendofexecutionorexplicitlyafterabarrierinstruction.Thisprovidesmechanismtointerfacedifferentstagesofexecutionwithconsistentviewofthememory.Finallyatthebottomlayer,threadswithinblockareclubbedtogethertoformthread-batches(Nvidiacallswarp).Usuallythreadcountwithinathread-batchisintegermultipleoftotalSIMDlanesintheshadercore.Multiplethreads-batchesareactiveatagivenpointoftimeduringexecution;overlappingthesebatchesprovidememoryaccesslatencyhidingopportunitywhileexecutingextremelydataparallelapplicationthatoftenrequirefetchingoff-chipdata.General-purposeprogrammingabstractions,suchasNvidiaCUDAandAMDStream,areoftenusedtofacilitateGPGPUapplicationdevelopment.Inthispaper,wehaveusedNvidiaCUDAprogrammingmodelbutsomeofthebasicconstructswillholdformostprogrammingmodels.UsingCUDA,thousandsofparallelandconcurrentlightweightthreadscanbelaunchedbythehostCPUandthosearegroupedtogetherinanexecutionentitycalledblocks.Differentblocksaregroupedintoanentitycalleda 29

PAGE 30

grid.Duringexecution,aparticularblockisassignedtoagivenSM.BlocksrunningondifferentSMscannotcommunicatewitheachother.BasedontheavailableresourcesinaSM,oneormoreblockscanbeassignedtothesameSM.AlltheregistersoftheSMareallocatedtodifferentthreadsofdifferentblocks.Thisleadstofastercontextswitchingamongtheexecutingthreads.Usually32threadsfromablockaregroupedintoawarpthatexecutesthesameinstruction.However,thisnumbercanvarydependingupontheGPUgeneration.ThreadsinGPUexecutebasedonthesingle-instruction-multiple-thread(SIMT)[ 10 ]model.Theexecutionofbranchinstructionsmaycausesomethreadstojump,whileothersfallthroughinawarp.Thisphenomenoniscalledwarpdivergence.Threadsinadivergedwarpexecuteinserialfashion.Divergedwarpspossessinactivethreadlanes.Thisreducesthethreadactivityofthekernelbelow100%.KernelswithlessdivergentwarpsworkefcientlyforbothindependentscalarthreadsaswellasdataparallelcoordinatedthreadsinSIMTmode.Load/storerequestsissuedbydifferentthreadsinawarpgetcoalescedinload/storeunitintoasinglememoryrequestaccordingtotheiraccesspattern.Memorycoalescingimprovestheperformancebyhidingtheindividualsmallermemoryaccessesinasinglelargememoryaccess.Thisphenomenonistermedasintra-warpmemorycoalescing.Threadsinablockexecuteinsynchronizedfashion. 2.2SiliconNanophotonics 2.2.1WhyNanophotonicsinThroughputArchitecture?Workloadsinthroughputarchitecturesgeneratemyriadmemoryrequestsduetodataparallelexecutionparadigmofthousandsofthreads.Moreover,off-chipmemoryaccessesoftencannotbeavoidedduetolimitedon-chipcachesandscratchpadmemorysizelimitation.Since,off-chipDRAMaccessspeedisnotimprovingatafasterrateoverthegenerations,itisimperativetodesignlowlatencyon-chipinterconnecttoreduceoveralloff-chipaccessoverhead.Suchinterconnectdesignwillcomplementthe 30

PAGE 31

architecturaloptimizationsofmemoryaccessessuchasinter-threadmemoryaccesscoalescing,etc.,toreduceoveralloff-chiplatencyandbandwidth.Unlikethroughputprocessors,traditionalmulticorearchitecturesexecutefewerthreadssimultaneouslyandinter-threadaccesscoalescingislimited.Inaddition,activecorecountinthroughputarchitectureismuchmorethanmulticorearchitectures.Hence,bandwidthrequirementisdrasticallydifferentforthetwoarchitecturegenres.InsteadofimprovingindividualthreadperformanceasinCPUs,throughputarchitecturemostlyreliesonoverallthroughputofthethousandsofsimultaneouslyexecutingthreads.Tomeetsuchthroughputdemandofexistingandemergingthroughputworkloads,interconnectbandwidthimprovementisimperative.Moreover,prospectivethroughputarchitectureswillincludeinter-shadercachecoherence.Withnumerousactivecoresandthousandsofconcurrentthreads,coherencetrafcwilleasilysurpasstheexistingthroughputinterconnectbandwidthandlatencyspecications;eventuallyitwillsupersedetraditionalmulticoreinterconnecttrafcdemand.Sincebandwidth-latencyscalabilityislimited,powerandheatingproblemswillrestrictelectricallargeinterconnectdesignfordeepsubmicrontechnologynodes.ITRSpredictssiliconnanophotonicsasoneofthepromisingfutureon-chipcommunicationmedia.Basedonbandwidth,latency,andtrafcdemandinthroughputarchitectures,siliconnanophotonicsbecomesmoreaptasanetwork-on-chipdesignchoicecomparedtomulticoresystems.Ourinvestigationofinterconnectloadforvariousworkloadsrevealsburstytrafcbehavior.Itsuggeststhattheshadercoresareexecutingseveralmemoryinstructionsatastretchforlargenumberofthreads.Forexample,shadercorecachemissgeneratesinterconnecttrafcwhenitexecutesathread-batch(Nvidia:warpandAMD:wavefront)andencountersamemoryinstruction.Eventhoughthroughputworkloadsareoptimizedformemoryaccesscoalescing,thereareseveralthroughputworkloadsinwhichalgorithmbehaviorandmemorysizecompelnon-contiguousmemoryaccess.Insuchscenarios,servingacachemisswillnotresultinacachehitfortherestofthethreads 31

PAGE 32

inthewarp.Moreover,incaseofmemorymiss,anewwarpisscheduledtoexecuteontheSIMDpipelineduetoSIMDlock-stepexecutionparadigm.Foralargenumberofthreads,itisexpectedthatthenextwarpwillalsoexecutethesameinstruction.Hence,itisanticipatedthatrandommemoryaccesspatternsandrepeatedexecutionofmemoryinstructionswillcauseon-chipnetworkcongestion.Toavoidsuchscenarios,thewarpsshouldbescheduledininterleavedfashion;sothataninstructioncausingon-chipnetworkcongestionisavoidedwhennon-memoryinstructionsareavailableinthethreadscheduler.Wehaveproposedadeterministichardwaresolutiontotheproblem. Figure2-4. GPGPUChipTemperatureProle. Furthermore,emergingGPGPUandgraphicsworkloadsareexpectedtohavemorecomputationloadforshadercoresandwillexertmoretrafctoon-chipinterconnectandmemorycontroller.Figure 2-4 showsaHotspot5.0[ 54 ]basedtemperatureprole(MatrixTransposebenchmark)ofaGPU[ 11 ]with16shadercoresand6memorycontrollersconnectedusing2D-meshinterconnect.Itassertsthaton-chipnetworkandmemorycontrollercomponentsdissipatemaximumpower(heat).Moreover,accordingtoITRSprojection[ 34 ],80%ofchippowerwillbeconsumedbyallmetallayerinterconnect.Recently,chiparchitectshaveexploredalternativehighthroughput, 32

PAGE 33

lowlatency,andlowpoweron-chipcommunicationmedia,suchasnanophotonicinterconnects.Siliconnanophotonicinterconnectsconsumestaticanddynamicpowerduringoperation.Staticpowerconsumptionisduetocarrierlasergenerationanddynamicpowerconsumptionisduetoswitchingofopticalswitchingelements.Usually,staticpowerlossbecomespoweroverheadinmulticoreinterconnects;unlessmemoryboundworkloadsarerequestingdatatransferregularlyandkeepingthenetworkbusy.Onthecontrary,inthroughputarchitectures,datatransferismorefrequentevenforcomputeboundworkloads;memoryboundworkloadsseldomkeepthenetworkfree.Thisphenomenonamortizesthestaticpoweroverheadovertheapplicationruntime.Eventually,integratingsiliconnanophotonicsinthroughputarchitecturesimprovesenergy-per-bitratingcomparedtomulticoreCPUs.Tothisend,weproposeaphotonicGPUthatleveragesopticalmediaforon/off-chipdatatransfer. 2.2.2AnIntroductiontoNanophotonicsinThroughputSystemAsshowninFigure 2-5 ,thebasicblocksofphotonicthroughputarchitectureare:alasersource(notshown),opticalwaveguidesandringresonators(activeandpassive).Usingtotalinternalreection,theopticalwaveguidepassesthelightwavebyconningitsenergywithinthewaveguide.Thewaveguideiscomposedofahighrefractiveindexcoreandalowrefractiveindexcladding.InaconventionalCMOSprocess,thewaveguideistypicallymadeusingsiliconcrystal(refractiveindex,:3.5)asthecoreandsilicondioxide(:1.5)asthecladding.Closed-ring-shapedresonatorsarealsomadeusingthesamematerialasthewaveguides.Activeandpassivearetwodifferenttypesofringresonatorsavailable.Apassiveringresonatorsresonantfrequencycannotbemodiedastherefractiveindexoftheresonatorisxedduringfabrication.Inanactiveringresonator,resonantfrequencycanbealteredelectricallybychangingitsrefractiveindex.Thispropertyisusedtoperformsignalmodulation,detectionandinjection. 33

PAGE 34

Figure2-5. OpticalCommunicationonGPGPUSystem. Theuseofopticalnetwork-on-chip(NoC)andoff-chipDWDMlinksenablesmultiplexingofmultiplerequestsinasinglecycle.Suchmultiplexingwillleadtoloweraveragestalltimeandincreaseoverallbandwidth.Simultaneousschedulingofmultiplememoryaccessesthatfetchdatafromdifferentsub-channels,banksandrowsofmemoryarraycanrecovertheDRAMrowactivationandpre-chargepenalties.Thememorycontrollercanefcientlydistributetheavailablebandwidthamongdifferentmemoryrequeststominimizeoverallmemoryaccesslatency.Furthermore,theuseofemergingtechnologiessuchas3D-die-stackingwillreducetheon-chipdataaccesslatenciesbyusingverticalcommunicationpathswhenpossible. 2.3ResistiveMemoryTheSTT-MRAMmemorycellcomprisesofaMTJandanaccesstransistor(Figure 2-6 ).TheMTJdeviceconsistsoftwomagneticlayersseparatedbyathindielectricmaterial.Thedielectricactsasatunnelformovingcurrentbetweenthemagneticlayers.TheaccesstransistorisaCMOS,andtheMTJmagneticmaterialisgrownoverthesourceanddrainregionsofthetransistors.TheMTJdevicestoresa0or1basedonthedirectionofthefreelayerincontrasttothepinnedlayer.Ifthedirectionofbothmagneticlayersissame,theMTJexhibitslowresistanceandrepresents0state.Contrastingmagneticdirectionsrepresents1.Therefore,toperformawriteoperationonMTJ, 34

PAGE 35

eitherapositiveornegativevoltageshouldbeappliedbetweenthetopandthebottomelectrodesforwritinga0or1respectively.AtransistorandaMTJcellarecoupledthroughword-linesandbit-linestoformmemoryarrays.Eachcellisreadbydrivingtheappropriateword-line(WL)thatconnectstherelevantMTJtothebit-line(BL)andsourceline(SL).Whenasmallbiasvoltage(0.1V)isappliedacrosstheWLandBL,itsensesthecurrentpassingthroughtheMTJusingacurrentsensingamplierconnectedtotheBL. Figure2-6. BlockDiagramofSTT-RAMCell. UnlikeSRAM,MTJbasedresistivememoryhashighreadspeed,unlimitedreadandwriteendurance,andgoodcompatibilitywithCMOSprocesses.Unlike6CMOS-basedSRAMcell,theleakyCMOSbasedaccesstransistorofSTT-MRAMcellhasnogateandsub-thresholdleakageduetogroundedSL,BLandWLlines.TheMTJdeviceofSTT-MRAMactsasaresistoronly.Therefore,STT-MRAMconsumesalmostnoleakagepowercomparedtoSRAM.However,STT-MRAMcellshowsfasterreadaccessandslowerwriteaccesscapabilitiescomparedtoSRAMcell.Thereadspeedisdeterminedbythreefactors.Firstly,howfastthecapacitiveWLcanbechargedtoturnontheaccesstransistor.Secondly,howfasttheBLcanberaisedtotherequiredreadvoltagetosampletheread-outcurrent.Finally,howfastthesenseamplierreads.Onthecontrary,thewriteoperationrequiresactivatingtheaccesstransistor,andapplyingcomparativelyhighervoltagethatcangenerateenoughcurrenttomodifythespinof 35

PAGE 36

Table2-1. RedesignedSTT-RAMandbaselineSRAMcellparameters) Parameters STT-RAM SRAM Cellsize 9F2 50F2 SwitchingCurrent 54 SwitchingTime 5ns WriteEnergy 0.58pJ/bit 0.32pJ/bit MTJResistance(Rlow/Rhigh) 1500/3000 RetentionTime 0.1ms WriteLatency 1.4ns 0.77ns thefreelayer.MTJperformssuchoperationinthreedifferentmodes:underthermalactivationmodethroughtheapplicationofalong,low-amplitudecurrentpulse(>10ns)orunderadynamicreversalregimewithintermediatecurrentpulses(3-10ns)orinaprecessionalswitchingregimewithashort(<3ns),high-amplitudecurrentpulse[19].Ina1T-1MTJcellwithaxedsizeMTJ,atrade-offexistsbetweenvolatilityandwrite-time.Inourdesign,wesacricethenon-volatilityofSTT-SRAMbasedregisterleandsharedmemory.Unlike[ 55 ],wehaveredesignedtheMTJ(SeeTable 2-1 )cellthathassmallerfreelayerbyusingrelaxednon-volatilitymechanism[ 43 ];smallertechnologynode(22nm)comparedto[ 55 ]makesfreelayerevensmaller.Itprovidesmuchsmallercellsizeand33%reductioninwriteenergy.Thisresultsinreducedretentiontime,whichisstilllong(0.1msi.e.at1.25GHzclock125000cycles)enoughfortheon-chipmemoryinGPUwheredataisupdatedmuchfaster.However,specialpurposelongtermregisterssuchasprogramcounters,stackpointersetc.areimplementedusingSRAMtoaddressdataremembranceissue. 36

PAGE 37

CHAPTER3EXPLORINGTHROUGHPUTWORKLOADBEHAVIOR 3.1UnderstandingGPGPUKernelCharacteristics 3.1.1ThroughputKernelCharacteristicsEmergingGPGPUbenchmarksarenotwellunderstood;becausetheseworkloadsaresignicantlydifferentfromconventionalparallelapplicationsthatutilizecontrol-dominatedheavyweightthreads.GPGPUbenchmarksareinherentlydataparallelwithlightweightkernels.Untilnow,theeffectsofdifferentworkloadcharacterizationparametershavenotbeenexploredtounderstandthefundamentalsofGPGPUbenchmarks.Inthispaper,IproposeandcharacterizetheGPGPUworkloadcharacteristicsaccordingtomicroarchitecturalviewpoint.Table 3-1 liststhefeaturesthatspecifytheGPGPUspecicworkloadcharacteristics.Table 3-2 liststhegenericworkloadbehaviors.IndexinTable 3-1 referstothedimensionsofthedataintheinputmatrix(12938).Tocaptureoverallworkloadstressontheunderlyingmicroarchitecture,Ihaveuseddynamicinstructioncount,whichisthetotalnumberofinstructionsexecutedonaGPUinclusiveofallstreamingmultiprocessors.Thedynamicinstructioncountperkernelprovidesper-kernelinstructioncountoftheworkload.Averageinstructioncountperthreadcapturesaveragestressappliedtoeachstreamingprocessorsduringkernelexecution.ToexpressthedegreeofparallelismIproposedthreadcountperkernelandtotalthreadcountmetrics.Abovementionedmetricsrepresentthekernelstresssubspace.Instructionmixiscapturedbycountingoating-pointinstructions,integerinstructions,specialinstructions,andmemoryinstructionsmetrics.TheseparametersprovideaninsightoftheusageofdifferentfunctionalblocksintheGPU.Inherentcapabilityofhidingmemoryaccesslatencyisencapsulatedwithinarithmeticintensity.Memorytypebasedclassicationiscapturedusingmemoryinstructioncount,sharedmemoryinstructioncount,texturememoryinstructioncount,constantmemoryinstructioncount, 37

PAGE 38

Table3-1. GPGPUworkloadcharacteristics. IndexCharacteristicsSynopsis 1Specialinstructioncount(Isop)Totalnumberofspecialfunctionalunitinstructionsexecuted2Parametermemoryinstructioncount(IPar)Totalnumberofparametermemoryinstructions3Sharedmemoryinstructioncount(Ishd)Totalnumberofsharedmemoryinstructions4Texturememoryinstructioncount(Itex)Totalnumberoftexturememoryinstructions5Constantmemoryinstructioncount(Iconst)Totalnumberofconstantmemoryinstructions6Localmemoryinstructioncount(Iloc)Totalnumberoflocalmemoryinstructions7Globalmemoryinstructioncount(Iglo)Totalnumberofglobalmemoryinstructions8-17Percentageofdivergentbranches(Dbra)Ratioofdivergentbranchestototalbranches18-27Numberofdivergentwarps(Dwrp)Totalnumberofdivergentwarpsinabenchmark28-37Percentageofdivergentbranchesinserializedsection(Bser)Ratioofdivergentbranchestototalbranchesinsideaserializedsection38-47Lengthofserializedsection(lserial)Instructionsexecutedbetweenadivergenceandaconvergencepoint48-57Threadcountperkernel(Tkernel)Countoftotalthreadsspawnedperkernel58Totalthreadcount(Ttotal)Countoftotalthreadsspawnedinaworkload59-63Registersusedperthread(Reg)Totalnumberofregistersusedperthread64-68Sharedmemoryusedperthread(Shmem)Totalamountofsharedmemoryusedperthread69-73Constantmemoryusedperthread(cmem)Totalamountofconstantmemoryusedperthread74-78Localmemoryusedperthread(lmem)Totalamountoflocalmemoryusedperthread79Bytestransferredfromhosttodevice(H2D)BytestransferredfromhosttodeviceusingcudaMemcpy()API80Bytestransferredfromdevicetohost(D2H)BytestransferredfromdevicetohostusingcudaMemcpy()API81Kernelcount(Kn)Numberofkernelcallsintheworkload82-91Rowlocality(R)AveragenumberofaccessestothesamerowinDRAM92-101Mergemiss(Mm)[ 41 ]Cachemisses/uncachedaccessescanbemergedwithongoingrequest 38

PAGE 39

localmemoryinstructioncount,andglobalmemoryinstructioncount.Localmemoryinstructioncountandglobalmemoryinstructioncountencapsulatetheinformationofoff-chipDRAMaccessfrequencythatradicallyimprovesordegradestheperformanceoftheworkload.Branchinstructioncountprovidesthecontrolowbehavior.Duetowarpdivergence,thefrequencyofbranchinstructionsplaysasignicantroleincharacterizingthebenchmarks.Barrierinstructioncountandatomicinstructioncountshowsynchronizationandmutualexclusionbehavioroftheworkload,respectively.Branchanddivergencebehaviorsarecharacterizedbypercentageofdivergentbranches,numberofdivergentwarps,percentageofdivergentbranchesinserializedsection,andlengthoftheserializedsection.Branchesduetoloopconstructsinaworkloadorthread-levelseparateexecutionpathsdonotcontributetowarpdivergence.Numberofthreadsinthetakenpathorfall-throughpathdenesthelengthoftheserializedsection.Numberofdivergentwarpsisdenedastotalnumberofdivergentwarpsduringtheexecutionofthebenchmark.ItexpressestheamountofSIMDlaneinactivityintermsofidleSPsinSM.Lengthofserializedsectionisthenumberofinstructionsexecutedbetweenadivergencepointandcorrespondingconvergencepoint.Threadlevelresourceutilizationinformationisretrievedusingmetricssuchasregistersperthread,sharedmemoryusedperthread,constantmemoryusedperthreadandlocalmemoryusedperthread.Forexample,inaSMregistersareallocatedtoathreadfromapoolofregisters;minimumgranularityofthreadallocationisablockofthreads.Toomanythreadsperblockmayexhausttheregisterpool.Thesameholdstrueforsharedmemory.Thread-levellocalarraysareallocatedintooff-chipmemoryarrays;designatedbylocalmemoryusedperthread.CPU-GPUcommunicationinformationisgatheredfrombytestransferredfromhosttodeviceandbytestransferredfromdevicetohost.Theformerprovidesanotionofoff-chipmemoryusageinGPU.ThescenariowhenthedataintheGPUiscomputedbyitselfwithoutusinganyinputdatafromCPUisdistinctfromthescenariowhenGPUprocessesthedata 39

PAGE 40

receivedfromCPUandreturnsitbacktoCPU.Kernelcounttellsthecountofdifferentparallelinstructionsequencespresentintheworkload.Thesemetricsprovidekernelcharacteristicssubspace. Table3-2. Genericworkloadcharacteristics. IndexCharacteristicsSynopsis 102Dynamicinstructioncount(Itotal)Dynamicinstructioncountforallthekernelsinaworkload103-112Dynamicinstructioncountperkernel(Ikernel)Perkernelsplitupofthedynamicinstructionsexecutedinaworkload113-122Averageinstructioncountperthread(Iavg)Averagenumberofdynamicinstructionsexecutedbyathread123Floatingpointinstructioncount(Ifop)Totalnumberofoatingpointinstructionsexecutedinagivenworkload124Integerinstructioncount(Iop)Totalnumberofintegerinstructionsexecutedinagivenworkload125Memoryinstructioncount(Imop)Totalnumberofmemoryinstructions126Branchinstructioncount(Ib)Totalnumberofbranchinstructions127Barrierinstructioncount(Ibar)Totalnumberofbarriersynchronizationinstructions128Atomicinstructioncount(Iato)Totalnumberofatomicinstructions129ArithmeticIntensity(Ai)Arithmeticandlogicaloperationspermemoryoperationacrossallkernels Inordertocaptureintra-warpandinter-warpmemoryaccesscoalescing,Iuserowlocality[ 17 ]andmergemiss[ 41 ]asdescribedinTable 3-1 .Mergemissrevealsnumberofcachemissesoruncachedaccessesthatcanbemergedintoanotherin-ightmemoryrequest[ 41 ].Theseaccessescanbecoalescedintoasinglememoryaccessandmemorybandwidthcanbeimproved.DRAMrowaccesslocalityinthedataaccesspatterniscapturedbeforeitisshufedbythememorycontroller.IassumeGDDR3memoryinthisstudy.WithdifferentgenerationsofGDDRmemorythesenumbersmayslightlydiffer,butitstillremainsmicroarchitectureagnostic.Thesetwometricscomprisethecoalescingcharacteristicssubspace.Someofthepreviouslymentionedparametersarecorrelatedwitheachother.Forexample,ifaworkloadexhibitslargeintegerinstructioncount,itwillalsoshowlargedynamicinstructioncount.Ifpercentageofdivergentbranchesinaserializedsectionis 40

PAGE 41

highthenitisprobablethatthelengthofserializedsectionwillalsobehigh.ThisnatureofhighdatacorrelationjustiestheuseofPCAanalysis(seeAppendixA). Figure3-1. DendrogramofallGPGPUworkloads. 41

PAGE 42

3.1.2ThroughputWorkloadAnalysisandEvaluation 3.1.2.1MicroarchitectureImpactonWorkloadCharacteristics Figure3-2. CumulativedistributionofvariancebyPC. WeverifythatGPGPUworkloadcharacteristicsareindependentoftheunderlyingmicroarchitecture.Ifthemicroarchitecturehaslittleornoimpactonthesetofbenchmarkcharacteristics,thenbenchmarksexecutedondifferentmicroarchitectureswillbeplacedclosetoeachotherinuncorrelatedprincipalcomponentdomain.Figures 3-2 and 3-1 showthedendrogramofallthebenchmarksandthedistributionofthevarianceobtainedindifferentPCs.AccordingtoKaiserCriterion[ 56 ]weneedtoconsideralltheprincipalcomponentstillPC-20.Yetwehavedecidedtoconsiderrst7principalcomponentsthatretain70%ofthetotalvarianceandrst16principalcomponentsthatretain90%ofthetotalvariance.InFigure 3-1 weseethatexecutionsofsamebenchmarkondifferentmicroarchitecturesarehavingsmalllinkagedistance,whichdemonstratesstrongclustering.Thissuggeststhatirrespectiveofthevariedmicroarchitecturecongurations 42

PAGE 43

(e.g.shadercores,registerlesize,sharedmemorysize,interconnectionnetworkcongurationetc.),ourproposedmetricsarecapableofcapturingGPGPUworkloadcharacteristics.Notethat,inFigure 3-1 somebenchmarksexecutiononaparticularmicroarchitectureismissingduetolackofmicroarchitecturalresourcesoftheGPUorsimulationissues.Forexample,Cong-2(Tables 7-1 and 7-2 )isincapableofexecutingbenchmark-problemsizepairofSTO,NQU,HYandWP. Figure3-3. Dendrogramofworkloads(SL,90%Var). Figure3-4. Dendrogramofworkloads(CL,90%Var). 43

PAGE 44

3.1.2.2GPGPUWorkloadClassication Figure3-5. Dendrogramofworkloads(SL,70%Var). Figure3-6. Dendrogramofworkloads(CL,70%Var). Theprogram-inputpairsdenedasworkloadinTable 7-11 showsignicantlydistinctbehaviorinPCAdomain.ThehierarchicallyclusteredprincipalcomponentsarelistedinFigures 3-3 3-4 3-5 ,and 3-6 .Toavoidthehierarchicalclusteringartifacts,wehavedonebothsinglelinkageandcompletelinkageclusteringdendrograms[ 57 ].We 44

PAGE 45

haveobservedthatforthesetof4and6workloadsthesinglelinkageandcompletelinkageclusteringshowsidenticalresultsirrespectiveof70%varianceand90%variancecases.Benchmarksinsetof4areSS,SLA,PRandNE.Thesetof6includesSS,SLA,PR,BN,KMandNE.Forsetsizeof8and12wehavefoundsinglelinkageandcompletelinkageclusteringdeviatesby1and2workloadsrespectively,forthe90%variancecase.Table 3-3 depictsresults;boldtypefacehighlightsthedifferencesinsingleandcompletelinkageclustering.Torepresentaclusterwechoosetheclosestbenchmarktothecentroidofthatcluster.Theresultsfor70%oftotalvariancearepresentedinFigures 3-5 and 3-6 .DendrogramalsohighlightsthatRodinia,ParboilandNvidiaCUDASDKbenchmarksuitesdemonstratethediversityindecreasingorderaccordingtotheavailableworkloads.Table 3-3 showsthetimesavingsobtainedfordifferentsetsintheformofspeedup.OutofallthebenchmarksSSand64Hcontribute26%and29%ofthetotalsimulationtimerespectively.SpeedupnumbersareinfacthigherwithoutSS.SSdemonstratesverydistinctcharacteristicsfromtherest;suchashighinstructioncount,lowarithmeticintensity,veryhighkernelcount,diverseinter-warpcoalescingbehavior,branchdivergencediversity,differenttypesofkernelsanddiverseinstructionmix.Asexpected,overallspeedupdecreasesasweincreasethesetsize.Duetotheinclusionof64H,thesetof12incompletelinkageclusteringdoesnotshowhighspeedup.Architectscanchoosethesetsizetoachievethetrade-offbetweenavailablesimulationtimeandamountofaccuracydesired. Table3-3. GPGPUworkloadsubsetfor90%variance. Singlelinkage(SL)SLspeedupCompletelinkage(CL)CLspeedup Set4SS,SLA,PR,NE3.60SS,SLA,PR,NE3.60Set6SS,SLA,PR,KM,BN,NE2.38SS,SLA,PR,KM,BN,NE2.38Set8SS,SLA,PR,KM,BN,LIB,MRIQ,NE1.80SS,SLA,PR,KM,BN,LIB,MRIQ,MRIF1.53Set12SS,SLA,PR,KM,BN,LIB,MRIQ,MRIF,STO,HS,BP,NE1.51SS,SLA,PR,KM,BN,LIB,MRIQ,MRIF,STO,HS,64H,MUM1.05 45

PAGE 46

Figure3-7. Performancecomparison(normalizedtototalworkloads)ofdifferentevaluationmetricsacrossdifferentGPUmicroarchitectures(C1:Cong.1,B:Baseline,C3:Cong.3). Figure3-8. FactorloadingofPC-1andPC-2. 3.1.2.3GPGPUMicroarchitectureEvaluationFigure 3-7 showstheperformancecomparisonofdifferentevaluationmetricsasGPUmicroarchitecturevaries.Table 3-4 showstheaverageerrorobserved.Duetolackofcomputingresourcesandsimulationissueswithsomebenchmarks(e.g.SS,STO,HY,NQU),congurationssuchasCong-2,Cong-4andCong-5arenotconsidered.Thisshouldnotaffecttheresultsasthebenchmarkcharacteristicsaremicroarchitectureagnostic.Maximumaverageerrorforactivityfactorislessthan17%.Itrevealsthat 46

PAGE 47

Figure3-9. PC-1vs.PC-2scatterplot. Figure3-10. FactorloadingofPC-1andPC-2. thesubsetsarecapableofrepresentingbranchdivergencecharacteristicspresentinthewholeset.Divergencebasedclusteringshowsbranchdivergencebehaviorisdiverse.Therefore,asmallsetof4isrelativelylessaccurateincapturingdivergencebehavior.Asweincreasethesetsizeto6and8,wereachtheaverageerrorascloseas1%.AverageerrorinSIMDparallelismsuggeststhatIamcapturingthekernellevelparallelismthroughactivityfactoranddynamicinstructioncountwithmaximumerrorlessthan18%.IncreaseofsetsizealsodecreasestheaverageerrorforSIMDparallelism. 47

PAGE 48

DRAMefciencyshowsthatcoalescingcharacteristics,memoryrequestandmemorytransferbehaviorsarecapturedcloselywithmaximumerrorlessthan6%.Wealsoobservethattheincreaseinsubsetsizedecreasestheaverageerror. Table3-4. Averagepercentageoferrorasmicroarchitecturecongurationvarries. EvaluationmetricSet4Set6Set8 Activityfactor16.75.51.1SIMDparallelism17.26.00.5DRAMefciency5.25.02.5Average13.15.51.4 Figure3-11. PC-3vs.PC-4scatterplot. 3.1.2.4GPGPUWorkloadFundamentalAnalysisInthissubsectionweanalyzetheworkloadsfromthepointofviewofinputcharacteristics.Architectsareofteninterestedinidentifyingthemostinuentialcharacteristicstotunetheperformanceoftheworkloadonthemicroarchitecture.DifferentPCsarearrangedindecreasingorderofvariances.Hence,byobservingtheirfactorloadingswecaninfertheimpactoftheinputcharacteristics.Wechoosetoanalyzetop4principalcomponents,whichaccountfor53%ofthetotalvariance.Figure 3-8 showsthefactorloadingsforprincipalcomponent1and2,whichaccountfor39%oftotalvariance.Figure 3-9 showsthescatterplot.SS,SLA,PR, 48

PAGE 49

Figure3-12. Dendrogrambasedondivergencecharacteristics. GS,SRAD,64H,HS,andKMshowrelativelylowarithmeticintensitythanothers.SS,SLA,PR,GS,SRAD,64H,HS,andKMhavelargenumberofthreadsspawned,smallnumberofinstructionsinakernel,moderatemergemisscount,andgoodrowlocality.TheexceptionsareCSandHY,whichpossesslargekernelcount.Moreover,workloadsexcludingthesearerelativelylessdiverseinthepreviouslymentionedbehavioraldomain.CloseexaminationofthesimulationstatisticsverifythatbenchmarksexceptSS,SLA,PR,GS,SRAD,64H,andKMhavegoodarithmeticintensity,largenumberofkernelsandkernelleveldiversity.Therearenobenchmarkswithveryhigharithmeticintensity.SSexhibitshighintegerinstructioncount,hosttodevicedatatransfer,lowbranchdivergence,andperthreadsharedmemoryusage.Incontrast,SLAdemonstrateslowmergemiss,goodrowlocalityandbranchdivergence.Simulationstatisticsveriesobservedbranchdivergence,mergemiss,androwlocality.Figures 3-10 and 3-11 showthePC-3vs.PC-4(14%oftotalvariance)factorloadingsandscatterplot.PC-3showsthatPRhaslargenumberofkernels,lowmergemissandmoderatebranchdivergence.BN,SLA,MRIFandMRIQpossessmoderatebranchdivergenceandheavyweightthreads.BN,MRIQ,MRIF,KM,andPR 49

PAGE 50

Figure3-13. Dendrogrambasedoninstructionmix. program-inputpairshavelongserializedsectionduetobranchdivergence(largeIserial),highbarrierandbranchinstructioncount,heavyweightthreadsandmorecomputations. Figure3-14. Dendrogrambasedonmergemissandrowlocality. 3.1.2.5GPGPUWorkloadCharacteristicsBasedClassicationInSection3.1wehavementionedseveralsubspaces,suchasinstructionmixsubspace,coalescingcharacteristicssubspace,divergencecharacteristicssubspace, 50

PAGE 51

Figure3-15. Dendrogrambasedonkernelcharacteristics. tocategorizetheworkloadcharacteristics.InFigures 3-12 3-13 3-14 3-15 and 3-16 ,wepresentdendrogramsgeneratedfromtheprincipalcomponents(90%oftotalvariance)obtainedfromPCAanalysisoftheworkloadsbasedonthosedifferentsubspaces.ArchitectsinterestedinimprovingbranchdivergencehardwaremightwanttosimulateSS,SLA,MUM,HS,BNandNEastheyexhibitrelativelylargeamountofbranchdivergencediversity.Ontheotherhand,SS,BN,KM,MRIF,MRIQ,and,HYhavedistinctbehaviorintermsofinstructionmix.ThesearevaluablebenchmarksforevaluatingtheeffectivenessofdesignwithISAoptimization.Interestingly,workloadsexcludingSLA,KM,SSandPRshowsimilartypesofinter-warpandintra-warpmemorycoalescingbehavior.Hence,SLA,KM,SSandPRprovidediversitytoevaluatemicroarchitecturelevelmemoryaccessoptimizations.Thoughitispossiblethattheworkloadswithsimilarinterandintrawarpcoalescinghavelargevariationsincachebehaviorwhenthesizeofthecacheisincreasingbeyondthatusedtomeasureintra-warpcoalescing.Figure 3-15 showsthestatickernelcharacteristicsofdifferentbenchmarks.PR,STO,SLA,NQU,WP,SS,CPandLIBshowdistinctkernelcharacteristics.SS,BN,LIB,MRIF,MRIQ,WP,BPandNEdemonstratediverse 51

PAGE 52

behaviorintermsofinstructionsexecutedbyeachthread,totalthreadcount,totalinstructioncountetc. Figure3-16. Dendrogrambasedonkernelstress. 3.1.2.6DiscussionandLimitationThisresearchexploresPTXtranslatedGPGPUworkloaddiversityintermsofdifferentmicroarchitectureindependentprogramcharacteristics.TheGPUdesignusedinourresearchcloselyresemblesNvidiaGPUs.However,growingGPUindustryhasseveralothermicroarchitecturedesigns(AMD-ATI[ 13 ],IntelLarrabee[ 58 ]),programmingmodels(ATIStream[ 12 ],OpenCL[ 14 ])andISAs(ATIintermediatelanguage,x86,NvidiaNativeDeviceISA,ATINativeDeviceISA).ThemicroarchitectureindependentcharacteristicsproposedinthispaperarebyandlargeapplicabletoATIGPUsthoughtheyusedifferentvirtualISA(ATIIL).Forexample,SFUinstructionsmaybespecictoNvidia,butAMD-ATIalsoprocessestheseinstructionsinatranscendentalunitinsidethethreadprocessor.BranchdivergenceandmemorycoalescingbehaviorofthekernelsareacharacteristicoftheworkloadanddonotaffecttheresultsproducedinthispaperforATIGPUs.However,forIntelLarrabeearchitecture,thesecharacteristicswillbedifferentbecauseofdissimilarityintheprogrammingmodel.Also,Larrabee 52

PAGE 53

microarchitectureisdifferentfromtraditionalGPUs.Hence,forLarrabeethesuitabilityofusingtheproposedmetricstorepresentworkloadcharacteristicsneedsfurtherexploration.PTXISAchangesareminoroverthegenerations;thereforeithasinsignicanteffectontheresults.Inaddition,PTXinstructionmappingtodifferentgenerationsofnativeNvidiaISAsandprogrammingmodelchangehaslittleimpactontheGPGPUkernelcharacteristicsbecausewecharacterizethekernelsintermsofthePTXvirtualISA. 3.1.3RelatedWorkSaavedraetal.[ 59 ]demonstratedhowworkloadperformanceonanewmicroarchitecturecanbepredictedusingthestudyofmicroarchitectureindependentanddependentworkloadcharacteristics.However,theresearchdidnotconsiderthecorrelatednatureofdifferentcharacteristics.In[ 24 60 ]Eeckhoutetal.demonstratedatechniquetoreducethesimulationtimewhilekeepingthebenchmarkdiversityinformationintactbyperformingPCAandclusteringanalysisoncorrelatedmicroarchitectureindependentworkloadcharacteristics.In[ 25 ]researchershaveshownthat26CPU2000benchmarksuiteworkloadsonlystressfourdifferentbottlenecksbycollectingdatafrom340differentmachines.Byusingthesametechnique,researchersin[ 23 ]havefoundtheredundancyinSPEC2006benchmarks.Theyshowedhow6CINT2006and8CFP2006benchmarkscanberepresentativeofthewholeworkloadset.Inpreviouslymentionedworks,authorsmostlyusedthefollowingcharacteristicsoftheprogram:instructionmix,branchprediction,instructionlevelparallelism,cachemissrates,sequentialowbreaksetc.In[ 22 ]researchershaveusedthesametechniquetoclustertransactionalworkloadsbyusingmicroarchitectureandtransactionalarchitectureindependentcharacteristicsliketransactionpercentage,transactionsize,read/writesetsizeratio,conictdensityetc.MyapproachdiffersfromalltheaboveasIcharacterizethedataparallelGPGPUworkloadsusingseveralnewlyproposedmicroarchitectureindependentcharacteristics. 53

PAGE 54

Therehasbeenveryfewresearches[ 1 61 ]doneonGPGPUworkloadclassication.In[ 61 ]researchershavecharacterizes50differentPTXkernelswhichincludeNVIDIACUDASDKkernelsandParboilbenchmarkssuit.Thisresearchdoesnotconsidercorrelationamongvariousworkloadcharacterizationmetrics.Incontrast,Iuseastandardstatisticalmethodologywithawiderangeofworkloadcharacterizationmetrics.Moreover,theyhavenotconsidereddiversebenchmarksuiteslikeRodiniaandMars[ 62 ].In[ 1 ]authorshaveusedtheMICAframework[ 24 ]tocharacterizesingle-coreCPUversionoftheGPGPUkernels,whichfailstocapturebranchdivergence,rowaccesslocality,memorycoalescingetc.ofnumerousparallelthreadsrunningonmassivelyparallelGPUmicroarchitecture.Moreover,instructionlevelparallelismhasverylesseffectinGPUperformanceduetothesimplicityoftheprocessorcorearchitecture.Morethandatastreamsize,dataaccesspatternplaysimportantroleinboostingtheapplicationperformanceonGPU.Icapturetheseover-lookedfeaturesinmyprogramcharacteristicstolookintotheGPGPUkernelbehaviorindetail. Figure3-17. VariousshaderstallsinGPGPUinterconnect. 54

PAGE 55

3.2CharacterizationofThroughputInterconnectTrafcToknowthewhetheraworkloadfacebottleneckininterconnectionnetworkIcalculatetheshadercoreandDRAMchannelstallcyclesduetointerconnectnetworkcongestion1.Figure 3-17 revealsthat13workloadsdemonstratesignicantamountofstallsinmemorytrafc(Shadercoretomemorycontrollerandmemorycontrollertoshadercore).BFS,FWT,LIB,MT,NN,PNS,RAY,SRAD,KM,MUM,NE,SSand64Haretheworkloadsthatexperiencenetworkbottleneck.Outofthose,5workloadsshownetworkcongestioninbothdirections.However,thereareother14workloadsthatexperiencememorystagestallinshadercorepipeline,yettheydonotexperiencenetworkcongestion.Thereasonissufcientlyinterleavedmemoryaccesses2donotcongestthenetwork.Therefore,13GPGPUworkloadshaveburstofnetworktrafcatnirregularinterval.ItisexpectedthatwithlargerGPGPUworkloadinputsizeandsimultaneousexecutionofmultiplekernelsthesituationwillevenmoreaggravate.WhiletolowerthememorystagestallcyclesofshadercorepipelineIneedtoincreasethearithmeticintensityofGPGPUapplications(programmingeffortandalgorithmicchange),GPUarchitectscanmitigatetheinterconnectionnetworkstallsbyproperdesignoptimization.Ibelievethatforefcientdesignoptimizationofinterconnectionnetwork,Ineedcloseexaminationofthetrafcbehaviorforfewoftheheavy-trafc-workloads.Section 3.2.1 providessuchdetails. 3.2.1CaseStudies:ThroughputInterconnectTrafcPatternIhaveexperimented36workloadsfromvariousdomains.Inthetimeframeof10kcyclesIhavesampledallthedifferentmemoryaccesses.Inthefollowingsections,I 1Shadercoretointerconnectstallandinterconnecttoshadercorestall.2Applicationwithhigharithmeticintensity(ALUinstructions/memoryaccess)demonstratesuchbehavior. 55

PAGE 56

lookintotheoverallmemoryaccessstatisticsfor4memoryaccessintensive(stallsduenetworkcongestion)workloadsduringentireexecutionperiod. Figure3-18. MatrixTranspose:memoryrequestspercycle(inbytes). 3.2.1.1MatrixtransposeThisworkloaddemonstratesheavyirregularmemoryaccesspatternspreadoverdifferentframesasshowninFigure 3-18 .Trafcisalsoirregularfromdifferentnetworknodes.Figure 3-19 showsthatmostoftheaccessesarewriterequestsfromdifferentshadercorestotheexternalDRAMbanks.Alsoreadrequestsgeneratelargeamountofreadreplies.Thetwopatchesoforangecolorinthecontourplotshowsburstofreadreplies.However,Idonotseeanyinterconnecttoshadercorestallsforthisworkload.Burstofwriterequests(orangestripsinFigure 3-18 )generatessignicantamountoftrafcconcentratedinfewofthe10Kcycleframes.Italsoshowsthattrafcgeneratedfromdifferentshadernodesarenotsame.TheseFiguresjustifytheshadercoretointerconnectstallcyclesareinfactduetoburstofnetworktrafcfromshadercoretomemorycontroller.Iexpectthatwithlargerproblemsizeandsimultaneousexecutionofmultiplekernelsthistypeoftrafcburstwillbespreadoverentireexecutionperiod.Evencloserexaminationshowsthat,insidethe10Kcycleframesnetworktrafcisnotspread. 56

PAGE 57

Figure3-19. MatrixTranspose:differentmemoryregionaccessstatistics. Figure3-20. BreadthFirstSearch:memoryrequestspercycle(inbytes). 3.2.1.2BreadthrstsearchThisworkloaddemonstratesheavyirregularmemoryaccesspatternspreadoverdifferentframesasshowninFigure 3-20 .Thisworkloadhaslargerrequestsize(10B/cycle)formostofthenetworkcongestedframeswithrespecttomatrixtransposeworkload(6B/cycle).Trafcisalsoirregularfromdifferentshadernodes.However,memorycontrollernodesalmostconstantlyexperienceDRAMchannelstallsdueto 57

PAGE 58

interconnectnetworkcongestionforreadreplyrequests.Figure 3-21 assertsthefactthatbothwaynetwork(shadertomemorycontrollerandmemorycontrollertoshadercore)hasveryhighcongestionduetowriterequestandreadreplies.Thisworkloadalsohaslargeamountofwriterequests.BreadthFirstSearchexperiencesnetworkcongestionfromthetrafcinbothdirections. Figure3-21. BreadthFirstSearch:differentmemoryregionaccessstatistics. 3.2.1.3K-meansK-MeansdemonstratesheavymemoryaccessesspreadoverentireexecutiontimeasshowninFigure 3-22 .Thissituationisevenworseasmaximumaveragerequestsizegrowsto30B/cycle.Trafcisalsoirregularfromdifferentshadernodes.However,readrepliesareevenlyspreadoverdifferentmemorycontrollernodesandcreatesheavyconstanttrafctowardsshadercores.Figure 3-23 showsthatmostoftheaccessesareeitherglobalortexturereadrequests.Theygeneratesignicantamountofglobalandtexturereadrepliestravelingbacktotheshadercores.Italsoshowsthattrafcgeneratedfromdifferentshadernodesarenotsame.Theburstofpacketsinthenetworkisalsoprominentfromthesediagrams.Itjustieslargeamountofstallcycles(bothdirection)experiencedbythisworkloadduetonetworkcongestion. 58

PAGE 59

Figure3-22. K-Means:memoryrequestspercycle(inbytes). Figure3-23. K-Means:differentmemoryregionaccessstatistics. 59

PAGE 60

3.2.1.464binhistogramThisworkloaddemonstratesheavyreadreplytrafcfromalldifferentmemorycontrollernodesasshowninFigure 3-24 .Trafcisalmostregularfromdifferentshadernodeswithfewyellowpatchesofrelativelylargeamountofmemoryrequests.Averagerequestsize(1B/cycle)ismuchlowerthanprevious3workloads.Therefore,inside10Kcycleframesmemoryrequestsarelesser.Readrepliesareconcentratedandhugeinsize(10B/cycle).Certainlyitexpectedtocongestthetrafctowardsshadercores.Figure 3-25 veriesthatmostoftheaccessesareglobalreadrequestsfromdifferentshadercorestotheexternalDRAMbanks.Veryfewglobalwriterequestsarealsopresent.Largeamountofreadrequestsalsogenerateshugeamountofreadreplytrafc.ThisjustiesFigure1dataforthisworkloadthatnonetworkcongestionstallsareexperiencedfortrafctowardsmemorycontroller. Figure3-24. 64BinHistogram:memoryrequestspercycle(inbytes). 60

PAGE 61

Figure3-25. 64BinHistogram:differentmemoryregionaccessstatistics. 3.2.2TrafcBehaviorAnalysisFourworkloadsdiscussedabove(outof13mentioned)revealsburstofnetworktrafcasthekeyreasonofnetworktrafccongestion.WiththeadventofsimultaneousexecutionofmultiplekernelsinashadercoreIexpecttohaveevenmoretrafcloadontheshader-to-memorycontrollerinterconnectionnetwork.Withtheincreaseinnetworkbandwidth,thistypeofnetworkcongestioncanbereduced.However,tolearnmoreaboutthenatureofthenetworkloadintermsofchannelbandwidthutilizationandpackettravellatency,Ipresentfewmoremetricsinthefollowingparagraphs.Channellinkstatusintermsofcontentioncanbedecidedbycalculatingmaximumchannelload[ 63 ].Thisisexpectedthatanetworkchannelwithmaximumchannelloadwilleventuallycongestthenetwork.Maximumchannelloadactsasasubstituteforchannelbandwidth[ 64 ].Ifthenetworktrafcisspreadovertimethenmaximumchannelloadmightnotbeagoodrepresentativeofchannelbandwidth.However,formostofourworkloadsthisisnottrue.TocalculatemaximumchannelloadIhavecalculatedthemaximumoftotalnumberofpacketstraversedinagivennetworklinkduringthetotalexecutionoftheprogram.Latencybehavioriscapturedintheaveragelatencymetric 61

PAGE 62

forbothwaytrafc(shadercoretomemorycontrollerandmemorycontrollertoshadercore). Figure3-26. MaximumChannelLoadexperiencedbydifferentworkloads. Figure 3-26 showsthemaximumchannelloadexperiencedbydifferentworkloads.Alltheworkloadspresentedhereexperiencesimilarinputloadandhavecomparableexecutiontime.Thisavoidstheartifactsinmaximumchannelloadcalculationduetolongerexecutiontime.LV,BFS,BS,CS,LIB,MT,NN,PR,HY,KM,MUM,NE,SS,SLA,MRIF,WP,and64Hshowhighvaluesformaximumchannelload.Closerexaminationoffewoftheseworkloadswithrelativelylowmaximumchannelload(withrespecttoothers)showsthatmemoryrequestsandrepliesarenotspreadoverentireexecutionperiodandburstofpacketscongestthenetwork.Forexample,matrixtransposeshowslessmaximumchannelloadthanneuralnetworkbenchmark,yetonlymatrixtransposesuffersnetworkcongestionrelatedstalls.FromSection 3.2.1.1 Iseethatmatrixtransposepossessesburstofnetworktrafc.WorkloadslikeNN,SLAandPRhavehighmaximumchannelload,butthememoryrequestsaredistributedovertime. 62

PAGE 63

Figure3-27. Latencystatistics:shadercoretomemorycontroller. Figure3-28. Latencystatistics:memorycontrollertoshadercore. 63

PAGE 64

Toknowpacketlatencybehavior,IhavecalculatedaveragelatencyandmaximumlatencyfordifferentworkloadsinFigure 3-27 (shadercoretomemorycontrollertrafc)andFigure 3-28 (memorycontrollertoshadercoretrafc).Shadercoretomemorycontrollertrafcisrelativelymorecycleconsumingthantheother.For28shadercoresand8memorycontrollers,thetrafctowardsmemorycontrollerisconvergingtolessnumberofnodes.Onthecontrary,trafcfrommemorycontrollerstoshadercoreisdiverging.Therefore,bandwidthrequirement 3.3CharacterizationofShaderOn-chipMemoryAccessesInsingle-instruction-multiple-data(SIMD)throughputprocessors,datacomputationsincomputecoresconsume35%to45%(50W-85Wacrossworkloads)oftotalGPUpower.Thisincludespowerconsumptionininteger/oatingpointunits(ALU),specialfunctionalunits(SFU),thread-scheduler,variouson-chipcaches,registerle,andsharedmemoryaccesses[ 4 ].InFigure 3-29 ,weshowpowerconsumptionbreakdownofthecomputecorecomponentsacrossseveralGPGPUandgraphicsworkloadssimulatedusingacustomizedMcPAT[ 65 ]basedGPUpowermodelandGPGPU-Simv3.0.1b[ 41 ](seeSection 7.1.2 ).Around70%-90%oftotalcomputecorepowerisconsumedbyexecutionunits(25%-45%),registerle(20%-40%),fetch-decodeunit(1%-10%),andsharedmemory(2%-40%)accesses.Sinceincreasingcomputecorecountalmostlinearly(assumingallcoresareoccupied)increasesoverallGPUpowerconsumption,poweroptimizationofintra-corecomponents(suchasregisterle,sharedmemory,on-chipcachesetc.)isimperativetolimittheenergyconsumptionofthewholesystem.Inthiswork,weprimarilyfocusonenergyconsumptionofthecomputecoresduetoon-chipmemoryaccessesdynamicandleakagepower.Toreducetheleakagepowerfordeep-submicrontechnologynodes,weproposeresistivememory(STT-MRAM)basedGPUcomputecoredesign.However,STT-MRAMbasedmemoryshowsdynamicpoweroverheadduetolargerwriteenergyrequirement.Inaddition, 64

PAGE 65

longerwriteoperationtothesememorycellsposesperformancethreatforGPGPUapplications. Figure3-29. Dynamicpowerproleofcomputecores. Figure3-30. Reusabilityofregisterswrite-backdata. Intuitively,temporallocalityoftheregisterleaccesscanbeexploitedusingasmallSRAMbasedcachetoreducewriteaccessestotheSTT-MRAMbasedlargeregisterle.Tojustifythefeasibilityofsuchdesign,wecharacterizethetemporallocalityoftheGPUregisterwriteoperations.Figure 3-30 revealsreusabilityofdynamicallygeneratedregisterupdatesduringkernelexecution.Across23GPGPU/graphicsworkloads,only15workloadshave20%ormore1-timeregisterupdate.Usingasmallregisterlecache(RFC)[ 5 ]anddedicatedmemoryupdateofone-time-register-writes,itispossibletoreducemainregisterlewriteoverhead.Forrestoftheworkloadsandunforeseenemergingthroughputworkloads,usingRFCmightbeinsufcient.Moreover, 65

PAGE 66

fordeep-sub-microntechnologies,leakagecannotbereducedbysuchmethods.Hence,resistivememory(loworalmostnoleakage)on-chipstoragesolutionsareimperativetoaddressdeep-sub-micronGPUcoredesigns.WechooseSTT-MRAMbasedmemorydesignduetosignicantlylowerleakagepowerofthesememorycellsforsub-32nmtechnologies[ 42 ]. Figure3-31. Percentageregisterlewrite(inbits). Figure3-32. Sharedmemorywrite(inbits)prole. Therefore,weproposeSTT-MRAMbasedon-chipGPUmemorydesignwitharchitecturallyoptimizedwriteaccesstechniques.Duetolargerenergyconsumptionoverhead,thewriteaccessesneedtoberegulatedtoreducethecomputecorepower.Moreover,longerwritelatencyalsoregulatestheperformanceofthecomputecore.Intuitively,reducedmemorycellupdatesaredirectlyproportionaltothewriteenergyconsumption.Hence,tounderstandthewriteupdatebehaviorofon-chipmemories,we 66

PAGE 67

havecharacterizedactualbitipoperationsofregisterwritefor23GPGPU/graphicsworkloads.InFigures 3-31 (registerle)and 3-32 (sharedmemory),weshowthatactualbitipsduringwriteoperationaresignicantlysmallercomparedtototalwritecountmultipliedbywriteaccesswidth.Surprisingly,forourworkloadset,onlyNWshowsmaximum45%ofbitipsoftotalwrites.Onaverage,only20%ofwriteoperationsiptheregistercells.SimilartrendisobservableforsharedmemoryinFigure 3-32 .Writedominatedworkloads(PR,FWT,SRAD,GS,BN)showrelativelysmallfractionofbitipsduringsharedmemorywrites(15%,40%,35%,2%,20%).Onaverage,lessthan10%bitsareippedduringsharedmemorywrite.Thisinsinuatesarchitecturalmemorywriteenhancementstorestrictunnecessarymemoryupdatesandreducewriteenergy. 67

PAGE 68

CHAPTER4INTEGRATINGSILICONNANO-PHOTONICSINTHROUGHPUTARCHITECTURE 4.1DesignandImplementationInthissection,IprovideanoverviewoftheproposednanophotonicGPU. Figure4-1. Shadercorelayerlayout:Includes64shadercores(SC).EachSChas32streamprocessingcores(SP),16load/storeunits(LD/ST),4transcendentalfunctionalblocks(TFB),registerle,threadscheduler/dispatcher,andL1caches/sharedmemory. 4.1.1AnOverviewoftheMicroarchitectureFigures 4-1 4-2 and 4-3 illustratethesimpliedfunctionalunitlayoutofourproposedsilicon-nanophotonicstechnologyenabled3Dthroughputarchitecture,whichusesmultipleDWDMbasedwaveguidestoopticallyconnectmultipleshadercoresandoff-chipmemoryarrays.Tomeetgrowingcomputationdemandandoff-chipmemoryaccessload,overallchipareaandheatdissipationcapabilitiesbecomemajordesignissueforthroughputarchitectures;multi-layered3Dthroughputarchitectureenablesbetterareautilization,coolingcapabilitiesandintegrationofdisparatetechnologyenabledlayers.WeseparateshadercorelayerandL2cache/memoryinterfacelayeranduseface-to-facebonding[ 66 ]toconnectthetwolayers;face-to-faceassemblyenablesdecouplingthenumberofTSVsfromthetotalnumberofinterconnections 68

PAGE 69

betweenthelayers.Weincorporateaseparatesiliconnanophotonicopticallayerthattransferstheopticalsignalstoenableshadercoreandmemorycontrollercommunication.Through-silicon-vias(TSV)areusedtoconnecttheL2cachelayerandopticallayer.TSVsalsoprovideclocking,powerandgroundsignals.Theopticallayerincludesoff-chiplasersource,coupler,resonatorsandopticalinterfacetotheoff-chipmemory.Weusesiliconwaveguidestocommunicatebetweenon-chipmemorycontrollersandoff-chipmemoryarrayswithoptimizedDRAMchiporganization. Figure4-2. Cache/MemoryControllerLayerLayout:IncludesMemoryController,TSVarray,Directory,HUB,NetworkInterface,and,CrossbarInterface(areaoffourshadercores). ThelatestgenerationofNvidiaKeplerTM[ 11 ]GK110seriesGPUsaremanufacturedusing28nmprocesstechnologyandrequire550mm2silicondieareaforfabricating7.1billiontransistorstoproduce15shadercores[ 67 ].Our3DstackedGPUmicroarchitecturedesignisbasedupon16nmtechnology[ 35 ].Weexpecttohave8.0billiontransistorsonsilicondieareaof400mm2intheshadercorelayerofnanophotonic3Dstackedthroughputarchitecture[ 68 ].Figures 4-1 showstheshadercorelayoutincludingan 69

PAGE 70

instructioncache,athreadschedulingunit,athreaddispatchunit,32streamprocessingcores(SC),48KBofsharedmemory,32Kregisters,3types(texture,constantandglobalmemorysegment)ofL1datacaches,16load/storeunits(LD/ST),4specialfunctionalunits(SFU),multiplegeometryunits,multipletextureunits,andL1-L2cacheinterface.Ourshadercorelayerdieincludes64shadercoresconsistingof2048shaderpipelinesalongwith16textureprocessingfunctionalblocks.Inourdesign,thethreadschedulerislocatedatthecenterofthecorelayerlayoutandthehostinterfaceisplacedatonesideforbetterinterfacing.Graphicsspecicrasterizationoperationsareperformedbytherasterengine,whichissharedby8shadercores.Therearetotal8rasterenginesavailable. Figure4-3. Opticallayer:64waveguides,opticalx-bar(right),off-chiplasersource/coupler,DIMM. 70

PAGE 71

Ourconservativeestimateofshadercorepipelineclockfrequencyis2.0GHz.Tominimizethermalimpact,theshadercorelayer,whichconsistsof2048SIMDpipelines,isplacedatthetopofthe3Dstack(i.e.beneaththeheatsink).TheL2datacacheofsize8192KBisplacedunderneaththeshadercorelayer.Emergingthroughputworkloadswouldrequiremoreon-chipstoragetoreduceoff-chiptrafcandtoenhanceoverallworkloadthroughput.Inourdesign,fourshadercoresareclusteredtogether(TPC:textureprocessorcluster)toshareamemorycontroller,networkinterface,andcrossbarinterface.Thislayeralsoincorporatesrenderoutputunits(ROP)thatperformpixelblending,anti-aliasing,andatomicmemoryoperations,whicharespecictothegraphicsapplications.Figure 4-2 depictsthesimpliedlayoutoftheL2cache/memorycontrollerlayerandazoominviewofindividualL2subsets.TheL2layercommunicateswiththeshadercorelayerthroughinterconnectionnetwork;henceL1-L2interfaceisimplementedusinganarrayofverticalTSVsconnectingtothenetworklayer.EachshadercoreclusterhasitsownsubsetofuniedL2cacheofsize512KB.ThelocalL2subsetforanyshaderclusterimplementsdirectshadercoretolocalL2interfacing(face-to-facebonding)andavoidstheinterconnect.InGPGPU,ofteninter-shadercommunicationislimitedandifemergingworkloadsareoptimizedforaccessinglocalL2subsetduringhigher-levelcachemisses,interconnecttrafcwillonlyoccurduringnon-localL2subsetaccesses.However,itwillnotreduceinter-shadercoherencetrafcifpresent.Therefore,aseparatelayerofopticalinterconnectionnetworkimplementsacrossbarinterfacetoconnectnon-localL2subsetsacrossdifferentclusters.AmemorycontrollerisassociatedwiththeL2cachesubsetofeachcluster.Thememorycontrollersareconnectedtotheoff-chipmemorymodulesusingopticalinterconnection.Inourdesign,throughputmemoryaccessesfollowthefollowingsteps:WhenL1cachemissoccursintheshadercore,thememoryrequestmapstheaddresstothecorrespondingL2subset.IftheL1missaddressismappedtothelocalL2subset, 71

PAGE 72

requestsdonottraversetheon-chip-network;insteaditusesthefastface-to-faceL1-L2interface.Inthroughputworkloads,core-to-corecommunicationislimitedandanapplicationdeveloperinseveralcasescancontroloff-chipdatathatismappedtonon-localL2subset.ThisbehaviorofGPGPUworkloadsjustiesdirectinterfacingofthelocalL2subset.IncasetheaddressismappedtoL2subsetofanothercluster,therequesthastotraversetheopticalinterconnect.SinceopticalinterconnectcarriesL1cachemisstrafc,inahighlydataintensiveGPGPUapplicationwitharbitrarymemoryaccesspattern,thememorymisseswillcongesttheon-chip-network.Moreover,interkerneldatacommunicationbetweenthesimultaneouslyexecutingthreadsinGPGPUapplicationsandinfutureshadercorecoherencetrafccangenerateheavyL2cachetrafc.Infact,weanticipatethatwithsimultaneousexecutionofmultiplekernelsinemergingGPGPUworkloads,theoff-chipmemoryrequeststrafcwillsurpasstheL2coherencetrafc.Weexpectthatthehighbandwidth,lowlatencyopticalinterconnect-basedthroughputarchitecturewillprovideanattractivesolutiontothisproblem.Figures 4-3 and 4-4 showthesimpliedlayoutoftheopticalinterconnectlayerandcrossbarinterfacedesignrespectively.Traditionally,throughputarchitecturesrequirewideroff-chipmemoryinterfacetomeetthedatarequestofthousandsofconcurrentlyrunningthreads.Efcientinterfacingofopticallyconnectedoff-chipmemorymotivatesthelayoutofthesiliconwaveguide.Weusemany-writer-single-reader(MWSR)photoniccrossbartoconnectthe16TPCsusing128waveguides(16channels4waveguides/channel2directions)inbothdirections(shadercoretomemorycontrollerandmemorycontrollertoshadercore).Eachwaveguideiscapableofpropagating64wavelengths(64bitsperclock,i.e.32bytesperclockperiodineachdirection)oflightusingDWDM.With5GHzclock,allthe16channelscanachieverawinterconnectbandwidthof40.96Tb/s.Thededicatedoorforopticalinterconnectallowsustoplacethesewaveguidesintheserpentinestructure(Figure 4-3 ),wherethewaveguidepitchisaslowas5.5m[ 69 ]. 72

PAGE 73

Figure4-4. Crossbarmodulestructure. OuropticalinterconnectincorporatesMWSRbasedcrossbarthatrequiresarbitrationmechanismwhichareusedintokenringLANsystemarbitration[ 70 ]toresolvethereadrequestssentbymultipleshadernodes.Unlikemulti-threadedCPUworkloads,throughputworkloadsexecutethesameinstructionsimultaneously(SIMDlock-stepexecutionparadigm)indifferentshadercoresthatrequestdatafromdifferentL2subsets.Since,shadercorecountiscomparablymorethanthecorescountinmulticoresystem,usuallyaveragetokengrablatencyforchannelreservationofarequestingshaderishigher.AnalternativetoMWSRissingle-writer-multiple-read(SWMR)crossbar,whichisrelativelypowerhungry(morelaserpowerisrequiredtodrive(nodecount-1)photo-detectorsascomparedto1photo-detectorinMWSR).SWMRincorporatesseparatededicatedchannelforeachsourcechannelthatsendspacketstoadestination;hence,nochannelarbitrationmechanismisrequired.Toimplement16nodesand256wavelengthsMWSRcrossbar,weuse3840modulatorsand256photo-detectorsfordatapernode.Forarbitration,weuse240modulatorsand 73

PAGE 74

16photo-detectors.Thestaticpowerofthecrossbarisestimatedas0.33W[ 69 71 ].Thedynamicenergyconsumptionis200fJ/bitfortransmissionandthedynamicpowerforarbitrationisnegligible. Figure4-5. Opticallyconnectedmemory(top:organization,bottom:typicalmemorycontrolleroperationow). Figure 4-5 showstheopticallyconnectedmemorydesign.Figure 4-6 showsverticallystackedlayers.Eachmemorycontrollerhastwo64-wavelengthDWDMinterfacewaveguidestotheoff-chipDRAM.Ourdesignincorporates64breadbus,32bwritebus,and16baddress/commandbusbasedon[ 72 ].Assumingbothedgesignalingand5GHzclock,themaximumdatatransmissionbandwidthcanreachupto120GB/spermemorycontroller.For16memorycontrollers,theoff-chippinbandwidthreaches1536Kb/clock.Inourproposedsystem,memorycontrollersareconnectedto83D-DRAM-stacksofsize4GBeach.Theverticallystacked3DDRAMchipsareconnectedelectricallyusingthrough-silicon-vias.Memorystackisconnectedtothememorycontrollervia 74

PAGE 75

two64-wavelengthDWDMsinglebersforread/ACKtrafc(north-boundlink)andwrite/address-commandtrafc(south-boundlink).Aninterfacedieperstackreceivesandtransmitsallopticalcommunicationstoandfromthememorycontroller[ 72 ].AllthecommunicationbetweentheDRAMchipsandinterfacedieiselectrical.TwoadditionaldedicatedsinglemodebersprovidetheopticalpowertotheDRAMchipfromtheGPUchiplasersource.Toefcientlyusehighbandwidthoffering,thephysicalchannelissubdividedintomultiplelogicalmemorychannelseitherstaticallyordynamically(seeSection 4.1.2 ).UnlikemulticoreCPUs,inthroughputarchitecture,datarequestsfromseveralthreadsinawarpcanbemappedtothesamememorycontroller,butindifferentDRAMstackswithintheinterfacingcontroller.Logicalsub-divisionofchannelscanprovidewarplevelmemoryaccessparallelismincasethreadlevelmemoryaccesscoalescingisnotpossible.ThiswillsignicantlyreducethememorymisslatencyofSIMDlockstepexecutionmechanism.Thetwotransaction-processingschemesare:sequentialsub-channelallocationandconcurrentsub-channelallocation.Duetosub-channelimplementationsthememorycontrollerdesigncomplexityincreases.ThesupportforsimultaneouscommandservicingandinterleavedaccesstomultipleDRAMbanksconnectedtomultipleDRAMsub-channelsisaddedtothetraditionaltransactionschedulerwithinthememorycontroller.Thesub-channelallocationschemesaimtoincreasebandwidthutilizationofoff-chipGPGPUphotoniclinksbyexploitingadditionalparallelisminmemoryorganization. 4.1.2Sub-ChannelBasedMemoryControllerWiderandfasteropticalinterfacetotheoff-chipmemoryarrayscannotreducetheDRAMrowaccesstime,whichisthemaincontributortotheoff-chipmemoryaccesslatencyinanopticallyconnectedGPU.Inaddition,increasingshadercoreclockfrequencyincreasesmemoryrequestrateandfasteron-chipphotonicnetworkcarriestherequestsrapidlytothememorysystem.Hence,itisimperativetodesignahighthroughputandlowlatencymemoryinterfacethatcanreduceDRAMaccess 75

PAGE 76

overheadforemergingGPGPUworkloads.Inthecontextofthroughputarchitecture,warplevelmemoryaccessparallelismduetologicalsub-divisionofchannelscanfurtherreducethememorymisslatencyofSIMDlock-stepexecutionparadigm.Tothisend,weaugmentourproposedoff-chipmemoryorganizationwithanewschedulingpolicythatschedulesmultiplememoryrequestssimultaneouslytodifferentindependentlyoperatingsub-channels(memory-stacks)toparallelizetheDRAMaccessesandincreaseoverallbusutilization.WhilekeepingtheDRAMrowaccesslocalityandbanklevelparallelismintact,theschemeimplementsconcurrencyacrossmultiplememory-stackaccesses.Wehaveinvestigatedsequentialandconcurrentsub-channelallocationpolicies. Figure4-6. Crosssectionschematicofthechip. Figure 4-7 explainstheowofoperationsexecutedinthememorycontrollerforread/writerequestsinsequentialallocationscheme.Newincomingrequestsfromnetworkinterfaceareplacedinsub-channelspecicqueues.Atagivencycle,thesequentialschemeappliesrst-ready-rst-come-rst-serve(FRFCFS)schedulingtoasub-channelspecicqueue.Ifnoschedulabletransactionisavailableinthatsub-channel,itusesround-robinpolicytoscheduleatransactionfromthenextsub-channel,ifpossible.ThispolicyensuresthatDRAMrowhitrate,banklevel 76

PAGE 77

parallelismandranklevelparallelismarekeptintactineachsub-channel.Thememorycontrollerusessingletransactionschedulertoscheduletherequeststoasub-channelateachcycle.Thesequentialapproachonlyschedulesasinglerequesttoaparticularsub-channelinagivenmemoryclockandusesround-robinpolicytoguaranteefairness.Eachrequesthasamemoryaddressthatcarriesa3-bit(log28)preambletoidentifythesub-channelitbelongsto.Eachsub-channelreserves8wavelengthsforread(northboundtrafc)dataand8wavelengthsforwrite(southboundtrafc)data.Therefore,the64-wavelengthnorthboundlinkmaximallyschedules8readrepliesandthe32wavelengthsouthboundlinkschedules4writerequests.Thememorybusbandwidthsharingisimplementedbystaticallyallocatingwavelengthstoindividualsub-channels.Theimplementationcostoftheschemeincludesadditionalqueuespersubchannel:32(queuesize)80B(maximumrequestsize)8bits0.039m2(6TSRAMcellareain16nm[ 73 ])7(additionalqueues)=5.59mm2(1%areaoverhead). Figure4-7. Memorytransactionscheduling. Onthecontrary,theconcurrentschemeimplementsmultipleschedulerspersub-channelandperformsFRFCFSschedulingpersub-channelsimultaneouslyifpossible.Inthisscheme,weassignequalnumberofwavelengthstodifferent 77

PAGE 78

memorysub-channelslikesequentialscheduling.Unlikethesequentialscheme,usingconcurrentschememultipletransactionscanbescheduledinagivenmemorycycleatadditionalcostof7transactionschedulersand2additional64-wavelengthDWDMlinks.Inadditiontotheadditionalqueues,concurrentschemerequiresadditionalschedulersaswell.InthecontextofGPGPUarchitecture,theadditionalmemoryaccesslatencysavingduetoconcurrentschedulingdoesnotcontributetooverallthroughputenhancementoftheapplicationduetotwomainreasons.Firstly,computeboundworkloadsoftenhavemoreALUwarpswhichrequiremorecomputationcyclesthansavedmemoryaccesslatencycycles;noadditionalthroughputenhancementforconcurrentscheduling.Secondly,memoryboundworkloadscanbenetbecauseofparallelism,butatanadditionalcostofarea(additionalschedulers,waveguides)andenergyconsumption. 4.1.3InterconnectAwareWarpSchedulerInopticalon-chipinterconnect,thededicatedopticalchannelbetweenasourcenode(shadercore/memorycontroller)andadestinationnode(memorycontroller/shadercore)isoccupiedbythesourcenodeuntilthetransmissioniscompleted.Anypackets(memoryrequestorresponse)fromotherrequesters(shadercore/memorycontroller)waitforthechannelduringthetransmissionperiodanditaddstotheoveralllatencyofthepacket.Moreover,incontrasttoelectricalon-chipnetworks,opticalon-chipnetworkslackmaturedrealizationofintermediateroutersandopticalbufferstoadaptivelycontrolinterconnecttrafc.Hence,adaptivememoryrequestissueisimperativetotheon-chipnetworkcongestionmitigationinnano-photonicinterconnectbasedthroughputarchitecture.Tothisend,weproposetoregulatetheshadercorememoryaccessissue(load/store)inordertocontrolburstyon-chiptrafcinthethroughputinterconnectandtoenforcehardwarelevelmemoryaccesslatencyhiding. 78

PAGE 79

Figure4-8. Flowofoperationsinwarpscheduler. Unliketraditionalmulticorearchitectures,GPUrunsthousandsofthreads(withidenticalinstructionsequence)concurrentlyinsingle-instruction-multiple-thread(SIMT)fashion.Duetocontrol-ow-divergence[ 18 ]andmemory-miss-divergence[ 19 ],threadsrunningwithinashadercoreprogressunevenly.Hence,atanythreadschedulingcycleintheshadercore,severalreadyinstructionsofdifferenttypes(ALU/memoryaccess)arefound.Theavailabilityofdifferenttypesofinstructionsattheschedulingcyclevariesduetotheinstructionmixofthethroughputworkload.Moreover,eventhoughthroughputworkloadsarewrittentoavoidirregularmemoryaccesses,duetothenatureofthealgorithm,memorymissdivergence,andbranchdivergencebehaviors,memoryaccessesduringexecutionarenotuniform.Thesethroughputworkloadbehaviorsalsohelpinmixingthewarppoolwithdifferenttypesofinstructions,eventhoughallthewarpsareprogrammedtoexecutesameinstruction.Therefore,basedonthelastlevelcachemissinformation,wedesignanovelthreadschedulerthatcaninterleavethememoryaccessrequestsinbetweenarithmeticoperations.Firstly, 79

PAGE 80

weproposeawarppoolclassicationtechniquethatbuildsontopoftheexistingimmediate-post-dominatorbasedwarpformationpolicyproposedbyWoopetal.[ 74 ]andFungetal.[ 18 ].Secondly,weextendtheround-robinwarpschedulingtoavoidmemoryinstructionsincaseofseveralconsecutivelastlevelcachemisses.Inadditiontoreducingon-chipopticalnetworkcongestion,theproposedschemealsohidesthememoryaccesslatencyofvariouson-chipmemoriessuchasL1cache,constantcache,texturecache,andsharedmemory,wheneverpossible.Wecarryouttheoperationintwostages:warpclassicationanddeterministicscheduling. 4.1.3.1WarpClassicationInitially,weproposeawarplevelclassicationmechanismtothehardwarethreadscheduler.TheschedulerselectswarpsbasedontheinstructioncharacteristiclistedinTable 4-1 .WhenthewarpwithacertainPCcompletesthedecodestageoftheshadercorepipeline,theinstructionclassicationforalltheotherwarpswiththesamePCisobtained.Theschedulingunitmaintains2bitsforeachwarptoindicatetheinformationshowninTable 4-1 .So,aschedulercapableofmanaging2048threadsneedsanadditional512bytesofstoragetoimplementtheproposedscheme.For64shadercores,thetotaladditionalmemoryrequirementis32KB.In16nmprocessourschedulerwilladdnegligibleareaoverhead. Table4-1. Warpclassicationbitpattern BitpatternDenition 00NowarpwiththePCevercommitted01WarpwiththePChasusedALUunit10WarpwiththePChasusedmemoryunit11Reserved. Theclassicationinformationcanbeusedinwarppoolformationandschedulingintwoways:reactiveandproactive.Usingreactivescheme,wecanrestrictthememorytrafcgeneration(load/storeinstructions)warpsfrombeingscheduledincaseofinterconnectcongestion.However,reactiveschemerequiresnetworkstatus 80

PAGE 81

detectionhardwareandanadditionalopticalwaveguide.Alsothereactiveschemewillexperiencesubstantiallatencyinobtainingthenetworkstatusinformationfromseveralnetwork-status-collection-points(e.g.networkinterfaces),assumingsuchhardwareisavailable.Moreover,althoughwecanminimizecongestednetworktrafcbystoppingtheissueofmemoryinstructions,westillhavelimitedscopeofintelligentlyhidingthememoryaccesslatency.Incontrast,weproposetointerleavethememoryaccessrequestsin-betweenastrideofALUoperationstoavoidtheinterconnectioncongestionandreducepossibleshadercorestalls.Theproactivepolicyachievesthegoalbyschedulingload/storewarpsinbetweenmultipleALUwarpsbasedoncachememoryaccessoutcome(e.g.cachehit/miss).Also,inthisscheme,theadditionalhardwarecostisonlyafewbytesofmemory. 4.1.3.2DeterministicWarpSchedulingComparedtotraditionalround-robinwarpschedulingmechanism[ 74 ],ourschemedeterministicallysavesthememoryaccesslatenciesofseveralcache(off-chipmemory,texture,andconstant)missesbyknowingtheinstructionclassicationattheschedulingstage.Sincetheapplicationdeveloperoftenheavilyoptimizessharedmemoryaccesses,wedontconsidersharedmemoryaccesses.Figure 4-8 showstheconceptualoperationow.Theschedulingpolicydeviatesfromthetraditionalroundrobinschedulingonlyafterapredenednumberofcachemisses(memory-miss-stride).Thepolicyissuesseveralwarpswitharithmeticinstructions(ifavailable)permemory-miss-stride;otherwiseitissuesawarpwithunknowninstructionclass.Appropriatechoiceofmemory-miss-stridehelpsinretainingtheintraandinterwarpmemorycoalescingbehaviorandoff-chipDRAMrowbufferhitrate.Unfortunately,forseveralmemory-boundapplications,theshufingofmemoryinstructionsreducesthememoryaccesscoalescingacrossthethreads.Suchthroughputworkloadsexperiencedegradedperformance.Therefore,weadaptivelyswitchbetweenregularround-robinschedulerandthedeterministicscheduler.Forexample,if2048concurrent 81

PAGE 82

threadsarepresentinashader,therewillbeatotalof64warpsperPC.Ifwereachmemory-miss-stridenumberofmemorymisses,wecanhideamaximumof63latencycyclesbyissuing63warps,assumingall63areALUoperations.Certainly,problemsize(i.e.warpcount)willinuencetheoverallimprovementthatcanbeachievedbythistechnique. 4.2EvaluationInthissectionweevaluatethedesignofthephotonicinterconnect,optimizedthreadscheduler,andmemorysubsystemdesignintermsofpowerandperformance. 4.2.1PowerEfcientOpticalInterconnect Figure4-9. Shadertomemorycontrollerlatency(normalizedtoelectricalmesh16B,GM:geometricmean). Figures 4-9 and 4-10 comparethedifferentdesignsintermsofshadercoretomemorycontrollerlatency(itlevel)andmemorycontrollertoshadercorelatency.Notethat,inFigure 4-9 SWMRphotonicinterconnectexperiencesrelativelylesslatency(81%-90%reduction)forreadandwriterequestswithrespecttoMWSR-basedcrossbar(59%-66%reduction).MWSR-basedcrossbarusesthetokentoimplementthemutuallyexclusiveaccesstothedestinationnode.Onthecontrary,SWMR-basedcrossbarhasmultiplededicatedchannelstothedestinationnodesfromasourcenode.Hence,inMWSR,asourcenodemaywaitevenwhenthedestinationchannelisidledueto 82

PAGE 83

unavailabilityofthedestinationtoken.ThislatencyiscompletelyhiddeninSWMRbasedcrossbar.However,SWMRisdifferentfromtraditionalelectricalcrossbarbecausethelatencyinSWMRbasedopticalcrossbardependsupondistancebetweenthesourceandthedestinationnodeinthewaveguide.IncaseofGPUmicroarchitecturethisphenomenonexhibitsadverseeffectswhenmemorytrafcbecomesbursty.MemorycontrollertoshadercoretrafclatencyofMWSRcrossbarinFigure 4-10 issignicantlylessthan(49%-95%reduction)electricalcrossbar.Comparatively,SWMRcrossbarshowslowerlatency(52%-94%reduction)withrespecttotheelectricalbaseline.MemoryintensivebenchmarkssuchasBFS(90%),MM(87%),DG(77%),RAY(72%)and64H(75%)exhibitmaximumbenetfromopticalnetwork.Allthesebenchmarkshavealargenumberofreadandwritememoryaccesses.Asweincreasethechannelbandwidth,memorycontrollertoshadercorelatencykeepsondecreasing.Withtheincreaseintheitsize,therewillbeasmallernumberofitsallocableinthenetworkinterfacebuffer.Hence,averagewaittimeinbufferisdecreased.CloseexaminationofthenetworkstatisticsalsorevealsthatinMWSRthetokenallocationdelaylargelydominatesoverallitlatency.AlsobufferwaittimeinSWMRislessthanMWSRduetolackoftokenallocationpenaltyexperiencedinMWSR. Figure4-10. Memorycontrollertoshaderlatency(normalizedtoelectricalmesh16B,GM:geometricmean). 83

PAGE 84

Benchmarkssuchas,BFS,SRAD,AES,PF,64H,HY,andMMyieldmoreshadertomemorycontrollerlatencywhenthechannelbandwidthisincreased.Thesebenchmarkshavemoredatareadinstructions(load)andlessdatawriteinstructions(store).Thedatareadrequestsizeis8B,whichissmallerthanthesmallestchannelbandwidthused.Withtheincreaseinchannelbandwidthmorebytesarewastedanditdoesnotreducetheoverallitcount,whichisthekeyreasonbehindlatencyimprovementforlargerchannelbandwidth.Since,memorycontrollertoshadercoretrafconlyhas64Bpackets,itdoesnotexhibitsimilareffect.Withtheincreaseinitsize,itcountreduceslinearlyandoverallwaittimeinthebufferisalsodecreased. Figure4-11. Speedupwithrespecttoelectricalmeshnetwork16Bchannelbandwidth(GM:geometricmean). Figures 4-11 and 4-12 showthespeedupandpowerconsumptioncharacteristicsofthedifferentdesigns.Onaverage,SWMR16Bcrossbarhas0.04%lowerperformancethan16Belectricalmesh,butasweincreasethebandwidthto32Band64Btheperformanceincreaseis3%and5%respectively.Interestingly,benchmarksthataredominatedbylargenumberofmemoryaccessesdemonstratebetterperformanceincrease.MaximumperformanceincreaseisobservedinBFS(17%)with64Bchannelbandwidth.However,MWSRbasedcrossbarshowsalmost4%improvementin 84

PAGE 85

performance,whichisattributedtothelargeamountdelayexperiencedbythepacketstograbthedestinationtoken. Figure4-12. Networkpowerconsumptionnormalizedtoelectricalmesh16Bchannel(GM:geometricmean). Figure4-13. NormalizedIPCimprovementduetosub-channelallocationschemes(GM:geometricmean). Thepowerconsumptioninopticalcrossbardecreasesdrastically(Figure 4-12 ).Onaverage,SWMRcrossbarshowsalmost87%powersavingand1%performancedegradationascomparedtoelectrical16Bmeshbaseline.Asexpected,averageMWSRpowersavingcomparativelyhigher(98%)withonly2%performancedegradation.Memoryintensiveworkloadsbenetsmostduetotheopticalinterconnect.BFS 85

PAGE 86

showsmaximumpowersavingof89%duetoitsheavymemoryaccessinMWSR.WerecommendMWSRbasedcrossbarwith32Bchannelbandwidthasthebestsolutionthatcomeswith96%powersavingand1%averageincreaseinperformance. 4.2.2ImpactofSub-ChannelAllocationFigure 4-13 comparessub-channelallocation-basedmemorysubsystemtoconventionalelectricalGDDR3/GDDR5basedmemorysystems.MemoryintensivebenchmarkswhichshowlargeDRAMutilization[ 41 ]suchasBFS(26%),MM(13%),LPS(32%),RAY(48%),SS(12%)demonstratesignicantimprovementoverthe1.6GHz-GDDR3(e.g.average9%improvementoverElec1.6G)and3.6GHz-GDDR5(e.g.average2%improvementoverElec3.6G)basedmemory.Theexperimentsrevealthatsequentialsub-channelallocation(Opt10GSA)andconcurrentsub-channelallocation(Opt10GCA)differsinperformanceonaverageby3%andmemoryintensiveworkloadsshowalmostsimilarperformance.Combinedeffectoflatencyhidingschemeandsub-channelallocation(Opt10GSAIAS,Opt10GCAIAS)improvesperformanceby3%.Sequentialsub-channelallocatordesignisbasedonsingle-schedulermemorycontrollerwith1%areaoverhead.Withfastercommandclock,sequentialallocatorusesfeweropticallinks(264-DWDMlinks/controller)toachievetheperformanceofconcurrentscheduler.Hence,weoptforsequentialcommandschedulerinoursystem.However,withemergingworkloadpressureandimprovedhardwaredesign,itisexpectedthatconcurrentsub-channelallocationschemewillprovidebetterresultsinfuture. 4.2.3AnalysisofInterconnectAwareSchedulerFigures 4-14 and 4-15 showtheperformanceimprovementusingtheschedulerlevelcontrolofphotonicNoCtrafc(IAS:interconnectawarescheduler)discussedinSection3.3.Wealsoshowthepercentagechangeinthewarpstallduetomemorymissforalltheworkloadsthatsuccessfullynishedthesimulation.SincetheproposedschemeaimstohideGPUmemoryaccesslatency,thechangeinmemorymissstallclearlyrevealstheimpactofthescheme.Figure 4-14 representstheresults 86

PAGE 87

Figure4-14. IPCandpercentagememorystallreduction(8shadercore,4memorycontroller,4TPC). for4(TPC)4(MC)networkwithrelativelysmallnumberofload/storerequests(2shader/TPC).Figure 4-15 has16(TPC)16(MC)networkwithrelativelylargenumberofload/storerequests(4shader/TPC).Performanceimprovementovertheexistinground-robinscheduling(RR)isobservedinBFS,HS,NE,AES,SRAD,andNE.InFigure 4-14 ,BFSshowsmaximumimprovementof9.7%,whichisadirecteffectofreductionofmemorymissstall(86%reduction).SinceHSislessmemoryintensivethanBFS,95%memorymissstallreductioninHSonlyprovides6.1%IPCimprovement.PFexperiencesperformancelossduetoreductioninintra/interwarpmemorycoalescingandDRAMrowbufferhitrate,ifonlydeterministicschedulingisused.UsingadaptivemodeIAS,weswitchtoround-robinschedulingafterinitializationintervalandrecoverthelossinPF.Sincememorylatencysignicantlydependsuponinterconnect,severenetworkcongestionimpactstheoverallperformance.Figure 4-15 showstheIPCimprovementof4shaders/TPCconguration.MM,AES,FWT,andRAYshowperformanceimprovement.Wevariedthememorymissstridefrom1to256andfoundthatthestrideof128providesoptimumperformance. 87

PAGE 88

Figure4-15. IPCandpercentagememorystallreduction(64shadercore,16memorycontroller,16TPC). 4.3RelatedWorkArchitectureslikeCorona[ 35 ],Firey[ 37 ],Phastlane[ 36 ],andFlexishare[ 38 ]havedemonstrated3D-stackedmulti-corechipwithopticaltoken-basedarbitration,electricalintra-clustercommunicationwithopticalcrossbar-basedinter-clustercommunication,switchbasedphotonicinterconnect,electrical-opticaltokenstreambasedarbitrationforchannelassignmentandcreditdistributionintheNoC,respectively.However,theinvestigationofthepowerandperformanceofsilicon-nanophotonicassistedGPUisstilllacking.Inthispaper,wemaketherststepinexploringphotonicsenabledGPUmicroarchitecturethatintegratesshadercores,caches,andoff-chipopticalmemoryinterfacesinthedifferentlayersof3D-stackedchip.Wehavecustomized[ 72 ]todesignoff-chipmemorystackinganddevelopedsequentialandconcurrentmultiplememoryaccessscheduling.AlthoughAhmedetal.[ 6 ]haveexploredthe3D-stackedcachearchitectureinmultilayerGPUimplementation,theyhavenotaddressedtheissuesofNoCcongestionoroff-stackmemoryaccessbottleneck.Moreover,theresearchhasnotexploredthepossibilitiesofco-designofNoCandthreadscheduler.In[ 75 ]Fungetal.haveproposedanefcientblock-levelcontrol-owoptimization.Incontrastwefocusoninterconnectawarewarpschedulingmechanisms.Usinganon-deterministic 88

PAGE 89

approach,Gebhartetal.in[ 5 ]haveproposedtwolevelwarpschedulingtechnique.Thoughswitchingbetweenwarpgroupsreducesthechanceofschedulingmemorywarpsbacktoback,stilllackofknowledgeaboutthewarptypeincreasesthechanceofconsecutivememorywarpsbeingscheduled.Amitetal.[ 76 ]haveproposedanopticalmemorycontroller(OCM)inplaceofadvancedmemorybufferinFBDIMM[ 77 ]design.Beameretal.[ 40 ]haveprovidedanopticallyconnectedDRAMarraysusingopticalpowerguidingtoachievescalability.Danaetal.[ 35 ]proposesscalableDRAMorganizationbydaisy-chainingthechannel.NoneofthesedesignsexploreschedulingmultiplememoryrequestsschedulinginGPU.In[ 78 ],Kimetal.haveproposedintermemorycontrollercoordinationandathread-awareschedulerthatprocessesthethreadswithleast-attained-service,assumingrequestsfollowParetodistribution.Theschedulerprocessestherequestsadaptivelybutdoesnotcoordinateamongthem.InGPU,thread-awareschedulingpolicymightnotbesufcient,insteadwarpbasedschedulingwillimproveperformance. 89

PAGE 90

CHAPTER5RENEWABLEENERGYBASEDTHROUGHPUTARCHITECTURE 5.1Chameleon:TheMode-SwitchingPowerManagementChameleonisapowermanagementframeworkforserverspoweredbyemerginggreenenergysources.Inparticular,ittargetsthroughputserversthatareusedforscienticcomputingordataprocessingjobswhichtypicallydonothavestrictcompletiondeadline.Theseserversoftendesiresignicantamountoftime(days/weekstorun)andconsumelargeamountofenergy.Leveraginggreenenergytopowerthesesystemscouldgreatlysaveconventionalutilitypowerbillsandlowernegativeenvironmentalimpact. 5.1.1OverviewandRationaleFigure 5-1 depictsthepowerprovisioninghierarchyofagreendatacenteracrossfacilitylevelandserverlevel.Atthefacilitylevel,onsiterenewableenergygeneration(e.g.,windturbineorsolarpanel)andutilitygridareconnectedtogetherthroughappropriatecircuitbreakersandpowerconversioninterfaces.Anautomatictransferswitch(ATS)canseamlesslyswitchtheloadtoanonsitedieselgeneratorincasethatbothrenewablepowersystemandutilitygridfail.Eventually,powerentersserverracksthroughpowerdistributionunits(PDU).EachserverrackissupportedbydistributedUPSbatteries[ 79 ]forhandlingtransientpowerinterruptions.SuchUPSplacementtopologyhasbeenadoptedbyGoogleandFacebookforimprovingenergyconversionefciency[ 79 ].Attheserverlevel,eachdual-cordedserverconnectstotwopowerstrips(rack-levelPDUs):oneconnectstotheprimarypowerandtheotherconnectstolocalUPSbatteries.Modernintelligentrack-levelPDUsfromHPandIBMhaveprovidedthecapabilityofmonitoringandmanagingthepowerbudgetoneachindividualpoweroutlet.Normally,serversrunattheirdesignatedspeediftherack-levelPDUhasntenforcedpowercappingontheoutlet.However,whenrenewablepowersupplyuctuatesseverely 90

PAGE 91

orbecomesintermittentlyunavailable,itiscrucialtochangethepowerbudgetoneachoutletandadjusttheserverpowerdemandaccordingly. Figure5-1. TheChameleonpowermanagementframeworkforserversystemspoweredbyintermittentgreenenergysources. Chameleonprovidesawaytogracefullyadaptserverloadtothetime-varyinggreenpowerbudget.AsshowninFigure1,eachserverfeaturesaChameleonmicro-controllerthatinteractswiththeserverpowersupply,throughputprocessor,andthepowermonitor.Thecontrollerenablesthesystemtointelligentlyleverageenergyresources(i.e.,thevariablerenewablepowerbudgetandthelimitedstoredenergy)forabettertradeoffbetweenefciencyandperformance.Chameleondoesnotemphasisusingutilitygridtoprovidebackupenergyduetothreereasons.First,fromasustainabilitypointofview,utilitypoweryieldslowpowerefciency(duetopowertransmissionloss)andhighcarbonfootprint.Second,incrowdedurbanareasutilitypowerfeedsareoftenattheircapacityandelectricityaccessfordatacenterisrestricted.Third,inremoteareasordevelopingcountries,utilitygridislessreliablewhichinturndegradestheoverallavailabilityofgrid-dependentsystem. 91

PAGE 92

5.1.2ChameleonMicrocontrollerChameleonneitheraggressivelyfollowsthevariablepowerbudgetnorheavilyborrowsstoredenergyfromthedistributedUPSbatteries.Instead,itcombinesthetwoapproachesbydeningmultiplepowermanagementmodes:energy-orientedcontrolmode(E-mode)andperformance-orientedcontrolmode(P-mode).WhenE-modeisactivated,theservergiveshighprioritytoprimarypowersupplyformaximizingthedirectuseofgreenenergy.WhenP-modeisactivated,theserveryieldshighprioritytostoredenergy(secondarypowersupply)formaintainingdesiredworkloadperformance.Suchapproachisreferredtoasmode-switchingpowermanagement(MSPM).TheChameleonmicro-controlleristhemajorhardwareadditioninourdesignandthekeycomponentthatsupportsMSPM.Figure2showsitsinternalarchitecturewhichconsistsofaChameleonmemorythatstoresimportantsystemprolinginformation,amodetuningagentthatmanagesthemodeswitchingactivities,amoderegisterthatspeciescurrentpowermanagementmode,andseveralpiecesofrmwarethatexecuteactualloadpowercontrol.Duringruntime,themicro-controllerperformsthreetasksconcurrently:monitoring,analyzing,andtuning.First,itmaintainsaloadpowerconsumptionhistoryrecordforthelastNevaluationframes.ItalsotrackstheaverageIPCandthepowerstatesofeachprocessorcoreinaprolingtable.AllthesedataareupdateddynamicallyintheChameleonmemory,ensuringtimelydiagnosisofthesystem.Second,themicro-controlleranalyzesitspastpowermanagementeffectivenessbasedonthestoredmonitoringdata.Attheendofeachcoarse-grainedmode-tuninginterval,itupdatesthemoderegisterwithanewlyselectedmodecode.Finally,ateachne-grainedcycle,thepowercontrolrmwareexecutesloadtuninginstructionbasedonthespeciedpowermanagementmodeandthesupply-loadpowermismatchatthatgiventimestamp.Inthisstudywemainlyconsiderdynamicvoltageandfrequencyscaling(DVFS)andprocessorpowergating(PG)asmajorloadtuningknobs.Eachprocessorcoreis 92

PAGE 93

Figure5-2. Thecontrolarchitectureofmode-switchingpowermanagementmechanisminChameleonMicrocontroller. connectedtoanon-chipvoltageregulatormodule(VRM)throughapower-gatingtransistor.Therefore,thepowercontrolrmwarecanputthethroughputprocessorintolowpowerstatesbyloweringthecoreV/Flevelortemporarilyenablesleepstate.Notethatthermwarebasedpowercontrolimplementationisfairlyscalableandeasytoupdate.Althoughtheentirepowermanagementlooppassesthroughseverallevels,thecontrollatencyisanegligiblefactorcomparedtothemuchlongertimeintervalofpowerbudgetvariation. 5.1.3Learning-BasedModeSwitchingThemodetuningagentisChameleonskeyoptimizationengine.Itdeterminesthestateofthemoderegister,andtherebyaffectshowserverswillreacttothetime-varyinggreenpowerbudget.InFigure3weshowtheloadadaptationpolicyindifferentpowermanagementmodes.Whentheserverissetasenergy-orientedmode(E-mode),it 93

PAGE 94

Figure5-3. Themappingofpowermanagementmodestoloadadaptationbehaviors. willstrivetomaximizedirectgreenenergyusage.Thatis,themicro-controllerwilldynamicallytunetheprocessorvoltageandfrequencylevelstofollowthepowersupplybudget.Ontheotherhand,inperformance-orientedmode(P-mode),theserverpreferstousestoredenergytollthepowershortfallwhentheloadpowerdemandishigherthanthepowersupplybudget.Powermodeswitchingischallenging.Anintuitivethinkingisthatonecansimplyadjustthemodebasedonpre-denedpowerthresholdorworkloadperformancethreshold:i.e.,useenergy-orientedmodewhenpowerheadroomincreasesanduseperformance-orientedmodewhenserverthroughputdrops.However,suchstraightforwardschedulingschemelacksrobustnessandoftenshowsunsatisedresultswhenfacedwithdisturbingevents.Forinstance,anincreaseinpowerbudgetisactuallynotagoodindicationofabundantrenewablepowersupplyitisverylikelythattherenewablepoweroutputuctuatesheavilyanditisjusttemporarypowersurge.Similarly,atemporarydropinserverthroughputdoesnotnecessarilymeanthatwearelackofpowerbudgetprobablythethroughputprocessoriswaitingduetomemoryaccess.Inthisstudyweexploreareinforcementlearningbasedmodetuningapproach.Ratherthanspecifyarigidmodeswitchingpolicy,weprovideChameleontheprivilegetomakeitsowndecisionsthroughcontroller-systeminteraction.Suchapproachhastwoadvantages.First,learning-baseddesignismoreadaptiveandisabletooptimizeits 94

PAGE 95

powermanagementeffectivenessoverthelifetime.Second,itshowsbetterextensibilitywhenweneedmorepowermanagementmodestohandletheincreasingcomplexityinhardwareandworkload.Forthreshold-basedmodeswitchingscheme,itisdifculttoanalyzetheentiredesignspacewhenfacingalargenumberofmodesandthresholdvalues.Thebehaviorofthemodetuningagent(i.e.,thelearner)isgivenbyasequenceofstate-modepairs(Si,Mi),whereSiisthesupply-demandmismatch(inWatts)inthelastcontrolperiodandMiisthepowermanagementmodethattheagentdecidestoselectbasedonitsexperienceandobservationofthestate.Meanwhile,theagentreceivesnumericalrewardsRifromthesystemasafeedbackforthemodeswitchingactionittakes.Therewardsignalindicatesifthemodeselectionisfavorableinanimmediatesenseandhelpsthemodetuningagenttoupdateitsknowledgebaseinthefuture.Therewardfunctionspeciestheobjectiveofthemodetuningagent.Thegoalistomaximizethecumulativerewardoverthelongtime.Wedenetherewardas:R=Performance SupplyVariability=PIPS A()F() (5)whereIPSistheinstructionpersecondforeachindividualprocessorcore,A()andF()aretheuctuationamplitudeandtheuctuationfrequencyofthepowerdataseries,respectively.A()iscalculatedasthestandarddeviationoftheloadpower.ThedenitionofF()issimilartotheaveragedzero-crossingratewidelyusedinthesignalprocessingcommunity:F()=1 2Xjsgn[x(n))]TJ ET q 0.478 w 144.97 -529.99 m 151.62 -529.99 l S Q BT /F7 11.955 Tf 144.97 -537.31 Td[(x])]TJ /F1 11.955 Tf 11.96 0 Td[(sgn[x(n))]TJ ET q 0.478 w 230.04 -529.99 m 236.68 -529.99 l S Q BT /F7 11.955 Tf 230.04 -537.31 Td[(x]j,wheresgn[x(n)]=8>><>>:+1x(n)0)]TJ /F9 11.955 Tf 9.3 0 Td[(1x(n)<0 (5)Wheresgn[]isthesignfunctionand xisthemeanvalueofx.Notethatinequation( 5 ),thepowervariabilityisalower-is-bettermetric.Thedenitionofourrewardfunctionimpliesthatwewantthemodetuningagenttooptimizethesystemthroughput 95

PAGE 96

whilebeawarethatsevereloadpowervariationisharmful.Notethatfrequentloadpowercyclescanresultinthermalcyclesandpotentiallydegradethelifetimeofthesystem. 5.1.4SolvingtheModeSwitchingProblem Figure5-4. Online-learningcontrolowandtimeline. WesolvethemodeswitchingproblemusingQ-learningalgorithm[ 80 ].Itiseasytoimplementashardwarecontrollersincethealgorithmnaturallyoperatesinanon-line,fullyincrementalfashion.ThebasicideaofQ-learningistostoreaso-calledQ-valueforeachstate-modepair(Si,Mi)inalookuptable.TheQ-valueisusedtopredictthepossiblereturnofeachpowermanagementmodeatanygivensystemstate.Weevenlydividedtherangeofsupply-demandmismatchintofourintervals,witheachmappingtoagivenQ-tableentry.Wechooseafour-entrytablebecausemoreentrieswillincreasethecomputationcomplexitysignicantly[ 80 ].Figure 5-4 illustratesthecontrolfollowinagivenmodetuningperiod.IntheSwitchstage,themodetuningagentdeterminescurrentstateSibasedonmonitoreddataandselectsapowermanagementmode.IntheAnalyzestage,theagentwaitsforasequenceofloadtuningactivitiesandcollectsprolingdata(thefeedbackinformation)andcalculatestherewardvalue.IntheLearnstage,theChameleoncontrollerupdates 96

PAGE 97

Q-tablebasedonacomprehensiveevaluationofthepreviousmodedecisionsandrewardvalue.Algorithm 1 showsourlearning-basedmodetuningscheme.TheagentchoosesamodethathaslargerQ-valueatanygivensystemstate.Meanwhile,itoccasionallytakesarandompowermanagementmodeinsteadoftheoptimaloneobservedfromtheQ-learningtable.Thisisreferredtoasexplorationoftheenvironmentandhelpstoavoidlocaloptimal[ 80 ].InFigure 1 ,thediscountfactorrdeterminestheimportanceoffuturerewards;thelearningratedeterminestowhatextentthenewlyacquiredinformationwilloverridetheoldinformation. Initialize:Allstate-actionpairsareinitializedtozeroRequires:discountfactor,learningrate,explorationprobabilitySetmoderegisterwithmodeifrand()
PAGE 98

5.2.1EnergyUtilizationWerstanalyzetheenergyutilizationproleofdifferentpowermanagementschemes.AsshowninFigures 5-5 5-6 5-7 and 5-8 whenthegreenpowerbudgetishigh,theefciencyofChameleonandThresholdarebothclosetoexistingenergy-orientedapproach.Onaverage,Chameleoncouldachieve97%utilizationratecomparedtoanaggressivelysupply-trackingbaseddesignandshows95%ofrenewableenergyutilization. Figure5-5. Renewableenergyutilizationfordifferentworkloads(Normalizedtoenergy-orienteddesign). Figure5-6. Renewableenergyutilizationfordifferentworkloads(Normalizedtoenergy-orienteddesign). Second,ifthegreenpowerbudgetistight,theefciencyofdifferentschemescanvarysignicantlywhenfacedwithdifferentpowerbudgetvariability.Ifthebudgetvariationislow,mode-switching(i.e.,ChameleonandThreshold)showsnearly50%lessrenewableenergyutilizationsinceittradeoffsenergyefciencyforperformance.However,whenthepowervariabilityishigh,mode-switchingbaseddesignoutperforms 98

PAGE 99

Figure5-7. Renewableenergyutilizationfordifferentworkloads(Normalizedtoenergy-orienteddesign). Figure5-8. Renewableenergyutilizationfordifferentworkloads(Normalizedtoenergy-orienteddesign). energy-orienteddesignsignicantly.Thisisbecausethelaterapproachlosesitsadvantagesofdirectlyutilizingthegreenpowerbudgetwhenthepowerbudgetfrequentlydropsbelowaverylowlevel.InFigure 5-9 weshowtheenergyutilization Figure5-9. EnergyutilizationofW-drivenpowersystems(normalizedtoS-driven). prolefromthepointofviewofthestoredenergy.ThesystemincreasinglygivesP-mode(i.e.,usingstoredenergytoboostsystemperformance)highpriorityasthegreenpowerbudgetbecomestightandvariable.Ourresultsshowthatabout75%of 99

PAGE 100

thegreenpowermustberststoredintoabatterythenitcanbeusedforthecomputingsystemtomeettheirperformance/reliabilitygoal. 5.2.2PerformanceAccelarationWefurtherevaluatethebenetsofmodeswitchingintermsofaveragesystemthroughput,asshowninFigure 5-10 .Comparedtoenergy-orienteddesign,Chameleonmode-switchingpowermanagementcouldimprovetheperformanceby1.5whenthepowergenerationishigh.Whenthegreenpowerbudgetistight,ourtechniqueshows16increasedperformancecomparedtoarigidenergy-orienteddesign.Moreimportantly,Chameleondesignoutperformssimplethreshold-basedmodeswitchingschemeduetoitsonlinelearningcapability.Althoughtheactualeffectivenessofourlearningbasedmode-switchingvarieswithdifferentworkloads,Chameleonshows13%higherworkloadspeedonaverageacrossalltheworkloads. Figure5-10. Comparisonofaveragethroughputacrossdifferentpowermanagementschemes. 5.2.3ReliabilityBenetsTheeffectofaggressiveloadpoweradaptiononprocessorisanopenquestion.Priorstudyhasshownconcernforthelifetimeissueofusingpowercyclingonserversbutdoesnotdigintoit.Wearguethataggressivepowertrackingandloadpowertuningcanresultinexcessivenumberofthermalcycles(TC),whichwillresultinpermanentfailureofsemiconductors[ 52 ].Chameleonalleviatesthisproblemasthemodetuningagentisawareoftheloadpowervariabilityandtriestoavoidit.Figure 5-11 showsthemeantimebetweenfailures 100

PAGE 101

A-HPLV B-HPHV C-LPLV D-LPHVFigure5-11. Comparisonofprocessorlifetimeacrossdifferentpowermanagementschemes(HPLV:HighPowerLowVariability,LPHV:LighPowerHowVariability,LPLV:LowPowerLowVariability,HPHV:HighPowerHighVariability). (MTBF)ofthroughputprocessorsinChameleonandThresholddesign.Theresultsarenormalizedtothatoftheenergy-orientedapproach.Aswecansee,thesecondandthirdplotsshowlessMTBFimprovementcomparedtotherstandforthplots.Thisisbecauseenergy-orienteddesigncausesfewerthermalcycleswhenthegreenpowersupplyisrelativelystableandabundant.Overall,modeswitchingpowermanagementcouldimprovetheprocessorlifetimesignicantly,andChameleonevenshows42%longerMTBFthanThresholdduetoitsvariability-awarelight-weightmodeswitchingactivities. 101

PAGE 102

5.3RelatedWorkSustainabilityhasbecomeanincreasingconcerninrecentstudies.Inthissectionwediscussthestate-of-the-artdesignintheemerginggreenenergy-awarecomputingresearch. 5.3.1MatchingServerLoadTherehavebeenseveralpioneeringstudiesonmatchingcomputerpowerdemandwithrenewablepowerbudget[ 81 88 ].Forexample,Lietal.proposedloadadaptationschemeforsolarenergypoweredmulti-coreservers[ 82 ]andSharmaetal.discussedtheimpactoftrackingintermittentpoweronsystemperformance[ 83 ].Theybothchosehardwaretuningknobsforloadmatching.Ontheotherhand,Goirietal.leveragedworkloadschedulingbasedontheavailabilityofrenewableenergy[ 84 85 ].Also,Zhangetal.exploredworkloadroutingalgorithmformaximizinggreenenergyusage[ 86 ]andLietal.proposedenergy-awareworkloadmigrationforimprovingloadmatchingefciency[ 87 ].AtHP,Arlittetal.designedclustersthatadjustitsloadbasedonthegreenpoweravailability[ 88 ]. 5.3.2PowerSourceManagementAnotherdesignapproachistofullyleverageancillarypowersystemsorutilitygrid.Forexample,Govindanetal.evaluatedthebenetsofusingonsiteenergystorageformanagingpowerdemandsurgeandpowershortageproblems[ 89 90 ].Dengetal.investigateddistributedgrid-tiedinverterforintegratinggreenenergyatdifferentdesignlevels[ 91 ]. 5.3.3Supply-LoadCo-ManagementWenoticethatrecentdesignshavehighlightedcooperativetuningofrenewableenergysourcesandcomputingworkloadtoachieveabetterperformance.Forexample,ajointcontrolofcomputingpowerdemandandonsitegeneratoroutputisdiscussedin[ 92 ]foroptimizingthegreenenergyutilizationandworkloadperformancesimultaneously.In[ 93 ],thesystemnotonlytakesadvantageoftheself-tuning 102

PAGE 103

capabilitiesofcomputingload,butalsointentionallyswitchestheloadbetweengreenenergyandutilitygridforbetterperformance.Differentfromexistingworks,thispaperexploresintelligentintegrationofdifferentpowermanagementschemesfortheoptimaltradeoffbetweenperformanceandefciency.Toourknowledge,thisistherstpaperthatapplieslearningtechniquetothroughputserverpowermanagementtoachievereliability-awareadaptivegreencomputing. 103

PAGE 104

CHAPTER6ON-CHIPMEMORYOPTIMIZATIONOFCOMPUTECORE 6.1ResistiveMemoryBasedComputeCoredesignInthissection,weintroducetheSTT-MRAMandSRAMbasedGPUcomputecoreon-chipmemoryarchitecturethatincludeswrite-optimized(latencyandenergy)registerledesignforGPU,hybridsharedmemoryarchitecture,STT-MRAMbasedon-chipreadonlycachedesign. Figure6-1. DifferentialMemoryUpdatebasedGPUregisterunitarchitecturewithSTT-RAMregisterle. 6.1.1ArrayedRegisterFileOrganizationwithDifferentialMemoryUpdateInFigure 6-1 ,wehaveshowntheregisterunitarchitectureforourthroughputprocessorbasedonoperandcollectorarchitecture[ 53 ].UnliketraditionalSRAMbasedregisterle,wedesigntheentireregisterlememoryarrayusingSTT-MRAMbasedmemorycells.ThedesignprovideslowerdynamicandleakageenergyconsumptioncomparedtoSRAMbasedmainregisterle.Inaddition,thedesignalsoenhances 104

PAGE 105

readperformanceandreducesdiefootprint.However,STT-MRAMmemoryexhibitssignicantlylongerwritelatencyandconsumeshigherwriteenergy.AccordingtoregisterwritecharacterizationinSection 3.3 ,alargepercentageofregisterwritedoesnotresultinmemorycellmodication.Intuitively,atregisterwordgranularity(widthoftheregisterwritebusperregisterlebank),wecancomputetheactualmemorycellmodications(0!1or1!0)bycomparingthecontentoftheregisterwordandtheregisterwritebackdatageneratedbythecomputecorepipeline.Byignoringtheunchangedmemorycell,weobtainaregisterwritemaskatregisterwordgranularity.Ineachregisterbankinterface,weproposetodesignadifferentialmemoryupdateunit(DMU),whichwillberesponsibleforcalculatingthemaskattheregisterwordgranularity.Inaddition,bycheckingthesetbitsofthemask,theDMUwillalsoprovideper-bitword-lineenablesignal.Sincethebankarbitrationunitinoperandcollectoralwaysprioritizesthewriteoperationoverthereadineachbank,foranyincomingwriterequesttheDMUwillreadthecurrentcontentoftheregistertogeneratethemask.Finallythemaskwillrestrictlargeamountofcurrentdrawnbytheunchangedbit-linesconnectedtotheSTT-MRAMcells.InthecontextofGPU,itwillreducealargepercentageofregisterleaccesspowerwithoutaffectingtheperformance.SincethethreadwarpsarescheduledinaninterleavedfashioninGPUcomputecorepipeline,thelatencyofadditionalreadoperationduringthewritedoesnotaffectconsecutiveregisterreadlatenciesandoverallperformance.DuetothefactthatSTT-MRAMreadenergyissignicantlysmallercomparedtoSRAMread,theoverallenergyconsumptionofregisterwriteisreducedduetomaskedmemoryupdate.Wechooseread-before-writememoryupdateasopposedtoearlywritetermination(EWT)[ 94 ]forthreereasons.Firstly,registerreuseintervalislongenoughtopreventperformancedegradationduetoadditionalread.Secondly,consideringadditionalenergyoverhead(0.0457nJ/cell)ofEWTcircuitandunchangedcell(0.148pJ/cell)thereadenergy(0.013nJ/cell)is 105

PAGE 106

signicantlylower(71%)[ 94 ].Thirdly,intermsofareaanddesigncomplexity,theimplementationoverheadofEWTleavesread-before-writeasabetterchoice. Figure6-2. MicroarchitectureofDMU(K=4)withmulti-arrayregisterbankorganization. Figure 6-2 showsthemicroarchitectureoftheDMU.Atthebeginning,ittreatseachregisterwritewithawarpasreadoperation.Soitenablestheread-lineoftheregisterwordusingtheaddressofthewriterequestregister.Itcomparestheexistingandincomingvalueofregisterwordinread-before-writeunion(RBW)togenerateamaskthatactsasanenablesignal.Finally,ituseswriteenablesignal(WES)todeterminewhethertoexecutethewriteoperationforthecellornot.However,per-bitwriteenablesignalneedsanadditionaladdressdecoder;asaresultitincurslargeareaoverhead.Toavoidareaoverheadandstillperformmaskedregisterwrite,weproposenovelarrayedregisterbankarchitecture.WeproposetosplitanNbitregisterwordintoMarraysofKbits/array;whereN=MK.Thesearrayswillberesidentina 106

PAGE 107

singlebank,butwillhaveseparatedecoderandwriteenablesignals.Therefore,eachKbitsofthearrayshareasingleword-lineenablesignal.Philosophically,thisreducesthegranularitywriteoperations.RegisterwritecharacterizationrevealsthatGPGPUkernelsfrequentlyupdatecontiguousmemorycellswithinaregister.Thisopensuptheopportunityofregisterupdatewithgranularityofmultiplecontiguousbits.Hence,insteadofper-bitmodication,weintroduceperarraymodicationinthebank.SinglebitmodicationwithinthearraywilltriggerwholeKbitarraywrite.Weexploreseveralvaluesfordeterminingthearraywidth(K=1,2,4,8,16),intermsofareaoverheadandpowersavinginSection 6.2 .Moreover,inthecontextofGPUs,arrayedregisterlememorylayoutopensupnergranularclockgatingopportunitiestofurtherreducedynamicenergyconsumption. 6.1.2CoalescedRegisterUpdateTheregisterunitreceiveswriterequestsfromthewrite-backstageofthecomputecorepipeline.Duringregister-writeoperation,everybankcanwriteatmostoneregisterword,whichisdeterminedbythewidthoftheper-bankregisterlewritebus.Generally,eachregisterwordupdaterequiresentireregisterbankword-linetobeactivated.However,intheproposedarrayedarchitecture,totalarrayactivationcountdependsuponnumberofupdatedarrayswhichislessthanorequaltothetotalarraycountthatconstitutesawholerow.Unwantedactivationresultsinunnecessarypowerconsumption.Intuitively,usingwideraccesswidthcansignicantlyreducesuchpowerloss.Withawiderwrite-port,greaternumberofbitsperrowcanbewrittenthussavingrowactivationenergy.Inthroughputarchitecture,thousandsofthreadsworkinsmallerbatches(warp).Withineachwarp,threadsworkinlockstepfashionandgenerateseveralregisterwriteoperationsinquicksuccession.AlltheregisterwriteswithinasinglebatchgotoasingleregisterbankinthetraditionalGPU.Foreachthread,thewriterequestsareservicedatregisterwordgranularity.Hence,toservicetheentirewarp,severalcycles 107

PAGE 108

arewasted.However,usingwiderregisterwriteport,GPUwarpscannishthependingregisterwritesatfasterrate.Naturally,italsoopensuptheperformanceimprovementopportunityaswell. Figure6-3. Coalescingregisteraccessusingwritebackbuffer. Atanygiventime,severalactivewarpsareresidentinaGPU.Often,individualwarpsaccessseveralregistersfromasinglebank.Therefore,usingper-bankwriterequestbuffering,wecancoalescemultipleregisterwritesacrossdifferentwarpsthatarewritingtothesamerowoftheregisterbank.Tothisend,weproposeper-bankbufferedwrite-backarchitecturetoenableintraandinter-warpregisterwritecoalescingwithineachbank.Weexploreanoptimumregisterwriteaccesswidthtodeterminethesizeofregisterwriteportwhileconsideringarea,powerandperformancetradeoffs.Figure 6-3 showsbufferedwrite-backstagearchitecturebasedon[ 95 ].Unlikelatencybasedbufferingin[ 95 ],weproposeper-SIMDlane(8-32lanes/computecore)bufferandper-bankwriteaddressbuffers.Figure 6-4 showsthestepsinvolvedinthecoalescingprocess.Thecoalescinglogicsearchestheper-bankaddressbuffertomergemultipleregisterwritesgoingtoadjacentcellsinarowofaregisterbank.Eachbankwithpendingrequestsgenerateswiderwriterequestineachcyclebyreleasingthedatafromtheregistercontentbuffer.Simultaneously,thescoreboardthatmanagesread-after-writehazardisalsoupdatedtoindicatemultipleregisterrelease.Wepropose 108

PAGE 109

toimplementSRAMbasedbuffersduetosmallersizesofthesebuffersandfasterperformancerequirement. Figure6-4. Registerwrite-backcoalescingowchart. 6.1.3HybridSharedMemoryArchitectureSharedmemoryorsoftwaremanagedon-chipscratchpadmemoryinGPUsprovideadditionalperformanceimprovementopportunityforGPGPUapplications.Byexploitingapplicationbehavior,GPGPUdevelopersmanagerepetitivedataaccessusingsharedmemory.Inessence,itbehaveslikesoftware-managedcacheforGPGPUapplications.ThedataaccessbehaviorofGPGPUapplicationsexhibitsfourdifferentpatterns:temporallocalityacrosswarps,spatiallocalityacrosswarps,mixoftemporal/spatiallocalityacrosswarpsand,nolocality.Applicationswithshareddatalocalityamongwarpscanfurtherbeoptimizedusingsharedmemorycachingwithnoadditionalprogrammingoverhead.Contrastingly,applicationswithnoorlowlocalityshareddata 109

PAGE 110

accesscanbeoptimizedusingincreasedsharedmemorysize.Moreover,onaverage,mostoftheGPGPUapplicationshavemultiplekernels(executionphases)thataretemporallyseparated.Forthose,thesharedmemoryaccesspatternwithinasingleapplicationchangesduringtheoverallexecutionperiod.Intuitively,toresolvetheseissuesweneedlargercachedsharedmemory.However,thiswillintroducepowerconsumptionoverheadintheexistingSRAMbasedsharedmemorydesign.Tothisend,weproposetoimplementcongurablehybridsharedmemoryarchitecturethathasSRAMandSTT-MRAMbasedmemorycells.UsingSTT-MRAMinsharedmemorydesign,wecanreducedynamicpower,leakagepowerandarea.WeproposeadditionalSRAMbasedsharedmemoryareathatcanbeconguredusingsoftwarelevelAPIstobehaveascacheorRAM.Fortemporalorspatiallocalitysharedmemoryaccess,additionalSRAMbasedcachereducesSTT-MRAMaccess.Usingcompilerassistance,weforwardlastSRAMcacheevictiondatatothenextlevelofmemoryavoidingSTT-MRAM.Moreover,toavoidredundantSTT-MRAMcellwriteoperations,weuseread-before-writebaseddifferentialmemoryupdate(DMU)architecture.Inessence,thisarchitecturenotonlyreducesSTT-MRAMsharedmemorywriteaccesslatencyusingSRAMcache,butalsolowersenergyconsumptionbyusingdifferentialmemoryupdate.Inaddition,wefurtherpowergatetheSRAMsectionofthesharedmemorytoreducepowerforapplicationswithfewersharedmemoryupdate.InSection 6.2 ,weprovidedetailsabouttheareaoverheadandpower/performancetradeoffofthedesignwhilevaryingthesizeofthesememories.NotethatexistingGPGPUworkloadsdonotoptimizethekernelsforcachedsharedmemoryaccess.Therefore,maximumpotentialofthecachedarchitectureisunexplored.Inbrief,hybridsharedmemoryarchitectureprovidesthefollowingbenetsforGPGPUarchitectures:itreducesoverallleakagepowerusingSTT-MRAM,addressesperformanceandenergyoverheadduetoSTT-MRAMwriteprocessusingSRAMcache,andapplicationlevelsharedmemorytuningoffersmoreexibilitytotheapplicationdevelopertooptimize 110

PAGE 111

performanceandpowerofGPGPUapplicationsthathavephasechangebehavioracrossthekernels. Figure6-5. Hybridsharedmemoryarchitecture. SharedMemoryCaching Figure 6-5 showsthesharedmemoryarchitecturewithSRAMcache.InGPU,severalthreadsofawarpaccessdifferentbanksofthesharedmemory.Tokeepsharedmemoryperformanceintact,weshouldkeepbanklevelaccessparallelismunchangedinourdesign.Intheproposedarchitecture,everySRAMcachelinecompactsdifferentthreadsofawarpthatrequestssametagindifferentbanks.Thisreducestotalcachelinereplacementspertransaction.Inthebest-casescenario,thisleadsto1replacementpertransaction.Intheworst-casescenario,itleadstowarpsizecountreplacementspertransaction.For32threadsperwarp,itwillgenerate32transactions.Sincewarpsexecuteinlock-stepfashion,multiplereplacementspertransactionincreasethewarplevelaccesslatencyofthetransaction.InSection 6.2 ,weevaluatedifferentdesignalternatives(directmapped,setassociativeandfullyassociative)andproperties 111

PAGE 112

(blockcount,blocksize)ofhybriddesign.TheSRAMcachefollowswrite-backpolicy.Intuitively,write-backpolicyavoidsunnecessarySTT-MRAMwrites.Anywrite-missduringcacheaccesswritesthedatatotheSRAMcachebutdoesnotfetchthedatafromtheSTT-MRAM(nowrite-allocatepolicy).Sincesharedmemoryaccessissoftwaremanaged,havingauniedcachedoesnothamperthebanklevelparallelismmuch.Duringcachelinereplacement,dataisreceivedandextractedbytheDMUofeachbank.Finally,sharedmemoryisupdatedonlywiththemodiedbits.However,readmississerveddirectlybythebanksavoidingtheDMUlogic.Eachcachelineisequippedwithanadditionalbittoindicatenalupdate.Afternalupdate,cachelineevictionsendsthewrite-backtothenextlevelofmemorybyavoidingSTT-MRAMwriteenergyandlatencyoverhead. 6.1.4ResistiveReadOnlyCachesExcludingregisterleandsharedmemory,GPUcomputecoresalsocontainseveralon-chipread-onlymemoryfortexturemappingandstoringconstantdata.Inbrief,theyarepartofthexedfunctionlogicintheGPU.InGPGPUapplications,thesememoriesareusedasread-onlycachestofurtherexpediteapplicationperformance.FixedfunctionlogicbringsdataintothesecachesfromhostprocessororotherlogicblocksofGPUpipeline.Duetorepetitivereadaccessofthecacheddata,STT-MRAMbasedmemorysignicantlyimprovesthereadperformanceandreducesdynamicreadpowerandleakageenergy.Hence,energyandlatencyoverheadofinfrequentdataloadfromtheoff-chipmemorylocationiscompensatedbyfrequentreadoperationsofthesecaches.Also,STT-MRAMreducesareaoverheadofthecaches. 6.2EvaluationThefollowingsubsectionsevaluatetheSTT-MRAMbasedon-chipmemorydesignbenets. 112

PAGE 113

Figure6-6. PowersavinginDMUbased(K=32)registerlewithSTT-RAMbasedmemorycell. 6.2.1RegisterArchitectureEvaluationFigure 6-6 showsthepowersavingsoftheproposedSTT-MRAMbasedregisterlearchitecture.TheregisterreadpoweroftheSTTbasedmemoryisreducedby49%onaverage.Wiretransferpoweroverheadofthereadoperationsislargelymitigatedbythecopiousamountoftransistorswitchingpower.BenchmarkssuchasNQU,MT,ST3D,andMMshowlowerpowersavingsduetofewerreadoperationscomparedtotherest.Onaverage,non-differentialupdatebasedregisterlewithSTT-MRAMcellsshow Figure6-7. Comparisonofpowersavingintheregisterlewithvariablearraysizes(K=1,2,4,8,16) 1.43xincreaseinpowerconsumptionduetowriteoverhead.HeavywritedominatedworkloadssuchasMM(1.81),MT(1.55),64H(1.55),CP(1.55),PNS(1.56)andRAY(1.56)sufferthemost.Usingper-bit(K=32)writetracking,thedifferentialmemoryupdatesaveswriteenergyby44%(SRAMbaseline)onaverage.GPGPU 113

PAGE 114

workloadswithfewerbitupdatessavewritepowerby82%(64H),82%(BFS),85%(NE),93%(PNS)and92%(GS).Onthecontrary,workloadssuchasMM,NW,LU,BPsufferpoweroverheadduetolargeamountofbitmodicationduringregisterwrite.Overall,thereadandwriteoperationsfortheseworkloadssavepowerby11%(MM),21%(NW),17%(LU)and22%(BP).Across23workloads,combinedreadandwriteoperationssave46%dynamicpoweronaverage.LPS(72%),BFS(68%),GS(72%),NE(66%),PNS(65%)andHY(63%)arethehighestpowersavingworkloads.Duetoareaandpoweroverhead,wehavere-architectedtheregisterlewithmultiplearraysperbank.Figure 6-7 showsgradualincreaseinpowersavingaswedecreasethearraywidthofarrayedregisterleorganization.Onaverage,across23workloads,arraywidthof2-bits,4-bitsand8-bitsprovides12%,29%and53%powersavingsrespectively.Theareasavingforadditionalwiringandaddressdecoderinthesecongurationsare50%,25%and12%comparedtothe32array/bankconguration(baseline).DuetocontiguousregisterbitupdatepatterninMM,MT,LU,NN,BP,LIBand64H,congurationssuchasK=4/8/16consumepowerclosetoK=32conguration.Interestingly,LPSshowsalmostnowritepowersavingforK=4/8/16congurations;interleavedbitupdatepatternofthisworkloadhidesthebenetofproposedscheme.Basedonthepowersavingandareaoverhead,K=8/4congurationsprovidegoodtradeoff.Onaverage,STT-MRAMbasedregisterlereducestheleakagepowerby32%across23GPGPUworkloadsinourdesignand46%areacomparedtoSRAM.Irrespectiveoflowerreadandhigherwritelatencies,performancedegradationofourdesignisnegligible(seeFigure 6-8 )acrossmostworkloads.OnlyBFS,NEandFWTshow4%,3%and3%performancedegradationrespectivelyduetolongerwritelatencyofconsecutivedependentinstructionschedulingfromthesamewarp.Readdominated64HworkloadbenetsfromthefasterreadperformanceoftheSTT-MRAMcellandachieves4%performancegain.Figure 6-9 showsthepowerandperformanceimprovementofregisteraccesscoalescinginSTT-MRAMbasedregisterle.Onaverage,coalescedwritessave9%of 114

PAGE 115

Figure6-8. PerformanceimpactofSTT-RAMbasedregisterlewithDMUcomparedtopureSRAM(baseline)implementation. writepower.BenchmarkssuchasLIB(9.4%),PR(9.4%),RAY(9.3%),and64H(9.4%)providemaximumpowerbenetduetocontiguousactivethreadsinawarpthataccesscontiguousregistersinarow.InterleavedaccesspatterninBFS(6.3%),NQU(6.7%)andNN(8%)restrictsthepowersaving.InFigure 6-9 ,MM(79%),64H(42%),ST3D(38%),BN(29%)andCP(25%)experiencegoodperformanceimprovementusingtheregisterwritecoalescing.Coalescedaccesswithwiderwriteportreducestotalnumberofregisterwritesgeneratedbyeachwarp.Forexample,Nthreads/warpgeneratesNwriterequestsperwarpthatreachsameSTTregisterbank.RegularcoalescingofMthreads/coalescewithallactivethreadsperwarpreducestotalnumberofwriterequesttoN/M.Thisphenomenonsignicantlyimprovestheoverallregisterwritelatencyoftheworkloadandoverallperformance.However,inreality,threaddivergenceandmemorymissdivergencewithinwarpreducestheactivethreadcountthatleadstonon-contiguousactivethreadsinwarpsandincreasesnumberoftotalinactivethreadsacrossallthewarps.Therefore,MM,64H,LIB,PRandRAYexperiencelowerpowersavingandperformancebenet.Notethat,registerwritecoalescingschemewillprovidecomparativelyhigherpowerbenetinSRAMbasedmemoryarrayduetoadditionalCMOStransistorspresentineachmemorycellcomparedtoSTT.For32-entrybuffer,areaoverheadofcoalescedwrite-backarchitectureisonly0.1mm2pershadercore. 115

PAGE 116

Figure6-9. Powersavingandperformanceimpactofcoalescedregisterupdate(baseline:pureSRAM). Figure6-10. Powersavinginhybridsharedmemory. 6.2.2HybridSharedMemoryEvaluationFigure 6-10 showsthepowersavingsduetohybrid-sharedmemory.Across23GPGPUworkloads,differentialmemoryupdatebasedSTT-MRAMsavesonly3%powercomparedtoSRAMbaseddesign.However,STT-MRAMonlysharedmemorysaves41%,46%,33%,48%and45%memoryaccesspowerforMM,NW,BFS,PNSandLU.Onthecontrary,duetolargepercentageofmemorywrite,sharedmemorydoesnotsaveadditionalpowerforseveralGPGPUworkloads.BenchmarkssuchasLPS,BN,FWT,PRandGShave29%,27%,51%,98%and49%writeaccessoftotalsharedmemoryaccessrespectively.Therefore,un-optimizedSTT-MRAMsharedmemorysufferssignicantpoweroverheadintheseworkloads.LPS,BN,FWT,PRandGSsuffer8%,4%,53%,140%and48%poweroverheadrespectively.Ontheotherhand, 116

PAGE 117

averageperformancedegradationforSTTonlysharedmemoryis5%(seeFigure 6-11 ).However,consecutivememorywritesdominantworkloadssuffersignicantperformanceloss.Forexample,64H,BN,NQU,SLA,LPSandFWTexperience18%,16%,27%,18%,9%and8%performancedegradationrespectively.Thecorrelationbetweenperformancelossandpowersavingsjustiestheco-optimizationapproachofhybrid-sharedmemory.CacheisusedtorevivetheperformanceandSRAMbasedscratchpadmemoryextensionreduceswriteenergyoverhead.HybridsharedmemoryimprovesthepoweroverheadofLPS(13%)andBN(15%).Usingtheproposedscheme,FWT,PRandGSreducespoweroverheadby3.8,2and4.4respectively.Excessivewriteaccesseswithsignicantamountofbitmodicationsduringwriteoperationsmakestheseworkloadssufferpowerloss. Figure6-11. PerformanceofonlySTTsharedmemory. Figure6-12. Performanceimprovementusinghybridsharedmemory. 117

PAGE 118

Figure 6-12 comparestheperformanceimprovementofhybrid-sharedmemoryacrossseveralSRAMbasedcachecongurations.16KBand32KBcacheswith2-wayand4-waysetassociativityrecovertheperformancelossesforBN,NQU,SLA,and64H.HybridsharedmemoryimprovestheperformanceofBN,NQU,SLA,and64Hby40%,42%,30%and40%respectivelyduetoSRAMbasedcache(16KB2-waysetassociative)thatbacksupSTT-MRAMbasedscratchpadmemory.Theworkloadsadd14%,15%,12%and12%additionalperformanceimprovementafterrecoveringtheIPClossduetoSTT-MRAMonlybasedsharedmemorydesign.FWTdoesnotachieveperformanceimprovementandLPSshows2%performancedegradationcomparedtoSRAMbaseddesign.NotethatLPSandFWTdonotimproveusingthehybridsharedmemorydesign.SpecialfeatureoftheseworkloadsissimultaneouspowerandperformancedegradationduetoSTT-MRAMwritepenalty.Closeinvestigationrevealsthattheseworkloadsreuserecentlywrittendataatafasterrateandoverallwriteaccessdensityisalsohigher.Moreover,duringkernelexecution,temporal/spatiallocality(requirescache)andintensewriteaccesswithnolocality(requiresRAM)ofsharedmemorychangesdynamicallyovertime.Infuture,dynamicswitchingbetweencacheandRAMaccessmodesofsharedmemorymightresolvethisissue.A32KBSTT-MRAMwith16KBcacheand64KBSTT-MRAMwith16KBcachesavesleakagepowerby82%and106%respectively.A32KBand64KBSTT-MRAMsharedmemorysavestheareaby0.011mm2and0.021mm2respectively.16KBand32KB2-waysetassociativeSRAMcacheaddsadditionalareaof0.015mm2and0.029mm2.Therefore,32KBSTT-MRAMwith16KBSRAMcacheor64KBSRAMwith32KBcachehavenegligibleareaoverhead. 6.2.3EvaluationofSTTBasedOn-ChipCachesGPUon-chipcachessignicantlyimproveread-onlydataaccessperformance.UsingSTT-MRAMbasedmemoryarchitecturesavessignicantleakagepowerinadditiontoimprovingtheperformance.Figure 6-13 showsthatread-onlycachessave 118

PAGE 119

powerby34%onaveragecomparedtoSRAM.ExcludingHY,PRandLU,mostoftheGPGPUworkloadsshowaround35%powersavingsduetoconstantreadaccessrate.However,PR,LUandHYhavelimitedread-onlydataaccesses,whichrestrictstheirlargepowerimprovements.Onaverage,leakagepowersavingsduetoresistivememoryis31%. Figure6-13. PercentagepowersavingsusingSTT-RAMbasedread-onlymemoryinGPU. 6.2.4OverallPowerandPerformanceImpactFigure 6-14 showsoverallpowersavingandperformanceimprovementofSTT-MRAMbasedcomputercorearchitectureusingarrayedregisterleorganization,hybridsharedmemory,andregisterwritecoalescing.Onaverage,across23GPGPUworkloadsthearchitecturesaves22%powerandimprovesperformanceby16%.64H(31%power/68%IPC),BN(22%power/54%IPC),NQU(16%power,23%IPC)andST3D(21%power/36%IPC)receivemostbenetusingthearchitecture.Alloftheseworkloadspossessalargenumberofredundantregisterupdate,temporallylocalizedsharedmemoryaccesses,andsignicantpercentageofcontiguouslyactivethreadsinawarp.Onthecontrary,BFS(32%power/18%IPC),HY(29%power/2%IPC),PNS(29%power/2%IPC),NW(30%power/-2%IPC)andGS(32%power/8%IPC)haveampleamountofredundantregisterlewrites,localizedsharedmemoryaccesses,andthreadlevelinactivityinwarpwithfrequentreuseofregisterwrites.Thesereducepowerconsumptionwithoutimprovingtheperformancesignicantly.OnlyFWT(16% 119

PAGE 120

power/-9%IPC)andBP(11%power/-5%IPC)havemoderatepowerwithperformancedegradation.FWTsuffersmaximumperformancedegradationduetosharedmemorywritelatencyoverheadintroducedbySTT.SharedmemorywritelatencyalsoaffectsperformanceofBP.UsingdynamicswitchingbetweenRAMandcachebehaviorduringkernelexecutionmightresolvetheperformancedegradationissueofFWTandBP. Figure6-14. Overallpowersavingandperformanceimprovement. 6.3RelatedWorkZhaoetal.[ 96 ]haveproposednon-volatileFPGAcircuitbasedontheSTT-MRAM.Theyfurtherenhancethepowerefciencyandthestartuptimeofthedesignin[ 97 ].Guoetal.[ 43 ]proposedtheideaofapplyingMTJdevicetomostoftheon-chipstoragetoimprovepowerefciencyandmaintainperformancewithin5%decrease.[ 42 ]proposedtheideaofrelaxingthenon-volatilityofSTT-MRAMcellstoreducethehighdynamicenergyandslowwritelatencies.In[ 94 ],Pingetal.proposedearlywriteterminationtoavoidwritingunnecessarybitipsbyreadingthecurrentcontentoftheregisterforcaches.Sinceaheavycurrentpulseisappliedattheendofthewriteoperationtoalterthebit,[ 94 ]proposedtowaituntiltheendofthewritepulsetodeterminethememorybitipstatus.Inearlywritetermination,considerableamountofenergyiswastedduringthewaitinthewriteprocess.Thoughitisfaster,theschemeconsumesunnecessarypowerduringthewait.InGPUs,sincethewarpsarefurtherinterleavedintime,registerreusebythesamewarpprovidesenoughtime 120

PAGE 121

toperformenergyefcientbutslowerread-after-writeoperationsinsteadoffasterbutpowerconsumingearlywritetermination.In[ 98 ],Sunetal.proposedbufferedL2cachetoreducethewriteaccesstotheL2cache.Inaddition,theyhaveproposedSRAM-MRAMbasedhybrid3DL2cachearchitecture.In[ 99 ],Wuetal.proposedinter/intracachelevelhierarchydesignbasedondisparatememorytechnologies.Unlike[ 98 ],weproposedifferentialmemoryupdatebasedsharedmemoryorganizationusingsoftwarelevelcongurableSRAMbasedcache/RAMandresistivememorybasedRAM.ForwriteintensiveGPGPUsharedmemoryaccessapplications,weuseSRAMcachedresistivesharedmemorytoreducewritelatencyandpower.Onthecontrary,GPGPUapplicationswithbalancedread-writesharedmemoryaccessesstillcanbenetthedynamic/leakagepowersavingofresistivememorywithoutcompromisingtheperformance.Inessence,toachievewritepowerreductionandfasterwriteperformance,weexploitGPGPU/graphicsworkloadbehaviorstoenhancethethroughputcorearchitecturebydifferentialmemoryupdateinanovelarrayedresistivememoryorganization.Franklinetal.[ 100 ]analyzedthereusabilityofupdatedregistervaluesandreuseintervalforparallelprocessors.Previously,ampleamountofworkdiscussedviabilityofregisterlecachesintermsofperformanceforCPU[ 101 106 ].Unlikeregisterlelatency,weaimtodesignenergyefcientthroughputprocessorregisterlesfordeep-sub-microntechnologynodes(verylowleakage).Recently,Gebhart[ 5 ]proposedregisterlecacheswithhierarchicalthreadschedulingforenergyefcientdesign.However,theworkdoesnotaddresstheleakageissueofdeep-sub-microntechnologynodesforanycomputecorememories.Wealsoincorporatearchitecturalenhancementstoreducewritepowerandlatency,whichareoverheadoftheSTT-MRAMtechnology.Satyamoorthy[ 107 ]implementedSRAMwrite-buffertoaddresswritelatencyforSTT-MRAMbasedsharedmemoryperformanceoverheadandreported17%energywith50%areasaving.Instead,wehaveenhancedsharedmemoryarchitectureusingdifferentialbitupdater(writeenergysaving)andsmallcachepersharedmemory 121

PAGE 122

bank(recuperatewritelatencyoverhead).Inthecontextofphasechangememory,Leeetal.[ 108 ]proposedPCMcellreadmechanismtoimprovetheenduranceoftheoff-chipPCMbasedmemorybyavoidingwriteoperations.AtL2cachelevel,theycheckthedirtybitatwordgranularitytocheckfortheupdatestatusofthememoryword.In[ 109 ]Zhouetal.proposedPCMmemorywithcelllevelwriteredundancyremovalschemetoimproveendurance.Insteadofwordlevelorcelllevelgranularity,ourtechniquesemploySTT-MRAMbasedregisterleandsharedmemoryupdateat4-16bitsgranularity.Thisreducestheupdatestatuscheck(1check/cell)overheadcomparedto[ 109 ].Moreover,usingnewarrayed(2/4/8/16bits)GPUregisterleorganizations,wereduceper-bitenable-signal-decoderareaoverheadwithonly16/8/4/2decodersignals. 122

PAGE 123

CHAPTER7EXPERIMENTALSETUPANDMETHODOLOGY 7.1SimulationInfrastructure 7.1.1GPGPU-Sim-TheGeneralPurposeGPUSimulatorInthisstudy,weusedGPGPU-Sim[ 41 ],acycleaccuratePTX-ISAsimulator.GPGPU-SimencapsulatesaPTXISA[ 110 ]functionalsimulator(CUDA-Sim)unitandaparallelSIMDexecutionperformancesimulator(GPU-Sim)unit.Thesimulatorsimulatesshadercores,interconnectionnetwork,memorycontrollers,texturecaches,constantcaches,L1caches,andoff-chipDRAM.Inthebaselineconguration,differentstreamingmultiprocessors(SM)areconnectedtothememorycontrollersthroughthemeshorcrossbarinterconnectionnetwork.On-chipmemorycontrollersareconnectedtotheoff-chipmemorychips.Athreadschedulerdistributesblocksamongtheshadercoresinbreadth-rstmanner;theleastusedSMisassignedthenewblocktoachieveloadbalancingamongcores[ 41 ].TheSIMDexecutionpipelineintheshadercorehasfollowingstages:fetch,decode,pre-execute,execute,pre-memory(optional),memoryaccess(texture,constantandother)andwrite-back.GPGPU-SimprovidesL1globalandlocalmemorycaches.Inthesimulatorcompilationow,cudafe[ 111 ]separatesGPUcode,whichisfurthercompiledbynvopencc[ 111 ]toproducePTXassembly.Ptxasassemblergeneratesthreadlevelsharedmemory,registerandconstantmemoryusagethatisusedbyGPGPU-SimduringthreadallocationinindividualSM.GPGPU-SimusestheactualPTXassemblyinstructionsgeneratedforfunctionalsimulation.Moreover,GPGPU-SimimplementsacustomCUDAruntimelibrarytodiverttheCUDAruntimeAPIcallstoGPGPU-Sim.HostCcodeislinkedwiththeGPGPU-Simcustomruntime.WehaveusedtheGPGPU-SimcongurationasspeciedinTable 7-1 and 7-2 .ToshowthattheGPUcongurationhasnoimpactontheworkloaddata,wehaveused6differentcongurations.Cong-2,baselineconguration,andCong-5aresignicantly 123

PAGE 124

Table7-1. GPGPU-Sim2.1.1bcongurationforworkloadexploration) ComponentCong1Cong2BaseCong3Cong4Cong5 Shadercores882864110110Warpsize323232323232SIMDWidth323232323232DRAMcontrollers88881111DRAMqueuesize3232323232128Blocks/SM8288816Buswidth8bytes/cycleConst./Texturecache8KB/64B(2-way,64Bline,LRU)Sharedmemory16KB8KB16KB16KB16KB32KBReg.count8192819216384163841638432768ThreadsperSM10245121024102410242048 Table7-2. GPGPU-Sim2.1.1bon-chipinterconnectcongurationforworkloadexploration) FeatureCong1Cong2BaseCong3Cong4Cong5 TopologyMeshMeshMeshCrossbarMeshMeshRoutingDim.orderDim.orderDim.orderDim.orderDim.orderDim.orderVirtualchannels222144VCbuffers44481616Flitsize16B8B8B32B32B64B differentfromeachother,whereasothercongurationsarechangedinasubtlemannertoseetheeffectoffewselectedcomponents.Additionally,wehaveheavilyinstrumentedthesimulatortoextracttheGPUcharacteristicssuchasoating-pointinstructioncount,arithmeticintensity,percentageofdivergentbranches,percentageofdivergentwarp,percentageofdivergentbranchesinserializedsection,lengthofserializedsection,etc.TherenewableenergybasedGPUarchitectureevaluationframeworkisbasedontheGPGPUsimulator.ThebaselinecongurationinTable 7-3 supportsfourdifferentclockdomainsshadercore,L2,on-chipinterconnect,andDRAM.Wehaverestructuredthesimulatorsothatitsupportper-corepowergatingandeachshadercorecanrunatdifferentvoltagewithinitsownclockdomain. 124

PAGE 125

Table7-3. GPGPU-Simv3.0.1bcongurationsforrenewableenergybasedGPUarchitectureevaluation ParametersConguration Shadercorenumbers16Threadbatchsize32SIMDpipeline32Cache(L1/Const/Tex)32KB/16KB/16KBSharedMemory48KRegistercount16384pershadercoreCache(L2)512KBThreadsperSM1024TopologyMeshChannelBW16BMemoryctrl.6 Figure7-1. ImplementationblockdiagramofGPU-PowerSim. 7.1.2GPU-PowerSim-ThePowerSimulatorforGPUArchitecturesThearchitecturelevelGPUpowersimulatorconsistsofGPGPU-Simv3.0.1[ 41 ]withGPUpowermodel(seeFigure 7-1 ).ThethroughputarchitecturepowermodelinpowersimulatorisorganizedinathreelevelhierarchyoftheprogrammablelogicblocksavailableinGPU.Atthearchitecturallevel,theGPUisdecomposedintomajorcomponentssuchascomputecores,L2cache,interconnectandmemorycontrollers.ComputecoresarefurtherdividedintoSIMDcores,fetch/decodeunits,instructionsissueunits,on-chipcaches,largeregisterle,sharedmemory,andthread-scheduler.Atcircuitlevel,thearchitecturalblocksaremappedtocircuitstructureslikearraysandcomplexlogic.CachesaremodeledasmemoryarraysusingCACTI[ 112 ]atthecircuitlevel.Interconnectiscomposedofsignallinksandrouters.Routeriscomposedofitbuffers,arbitersandcrossbars,whicharemodeledanalytically.Thememorycontroller 125

PAGE 126

modelsthedesignbyDenali[ 113 ]andiscomposedoffront-end-processingengineandphysicalinterface.WhilefrontendismodeledusingCAMandRAMstructures,thelattertwoaremodeledempirically.Atthedevicelevel,themodelusesdatafromITRSroadmap[ 44 ]tocalculatephysicalparametersofdevices,suchascapacitance,resistanceetc.ThepowermodelusesrelevantcomponentsfromMcPAT[ 65 ]andtsthemintothecomputecorepipeline.UnlikeMcPAT,theGPUpowersimulatormodelsmultipleSIMDlanebasedthroughputprocessorpipelinethathascontrastingarchitecturecomparedtomulticoreprocessors.Wevalidateourpowermodelagainstthepowervaluesgeneratedbyactualpowermeasurementsusingcurrentsensors[ 8 ]attachedtoaGTX470GPU.Thisreportssatisfactorilyclosematch. Table7-4. GPUconguration(optical/electrical) ParametersOpticalGPU)ElectricalBaseline Shadercore6464Threadbatchsize3232SIMDpipeline3232Memorycontrollers1616MCqueuesize3232L1/L2cache32/512KB32/512KBRegisterle32K32KSharedmemory48KB48KBThreadsperSM20482048InterconnecttopologyCrossbar2DMeshChannelbandwidth16/32/64B16BClock(SHD/Icnt/MC)2/5/1.6GHz2/5/1.6GHz 7.1.3ThroughputNanophotonicInterconnectSimulator Table7-5. Opticalcrossbarconguration OpticalCrossbarStructureAbbreviationChannelBW SingleWriteMultipleReadOXS-16B16BytesSingleWriteMultipleReadOXS-32B32bytesSingleWriteMultipleReadOXS-64B64bytesMultipleWriteSingleReadOXM-16B16BytesMultipleWriteSingleReadOXM-32B32bytesMultipleWriteSingleReadOXM-64B64bytes 126

PAGE 127

Ourevaluationisbasedonthecycle-basedsimulatorGPGPU-Sim[ 41 ].IthasamodiedversionoftheelectricalinterconnectsimulatorBooksim[ 63 ]tosimulateshadercoreandmemorycontrollertrafc.WehavereplacedtheBooksim(IntersiminGPGPU-Sim)withthein-houseopticalinterconnectsimulator.ItmodelstheserpentinenetworkdescribedinFigure 4-3 thatconnects16TPCand16memorycontrollersusingacrossbarandsimulatesit-levelopticaltrafc.Itimplementsopticaltoken-basedarbitrationtorealizeMWSRcrossbarandalsoimplementstoken-lessSWMRcrossbar.Alltypesoftrafc(readrequest,writerequest,andreadreply)entertheinterconnectsimulatorthroughaninterfacebuffer.InMWSRcrossbar,areservedopticaltokenleavesthesourcenodeandtravelsalongwiththetailofthelastitofthepacketuntilitreachesdestination.Oncetransferiscompletedthefreetokentraversesthewaveguideuntilanothernodegrabsit.Thedestinationnodesidentifytokens.Onthecontrary,inSWMR,alltheopticalchannelsarewritablebyonlyonesourcenodeandtherearemultiplechannelsperdestinationnode.TheSWMRcrossbarbehaveslikeahighthroughputlowlatencyelectricalcrossbarwithvariablesourcetodestinationdelay. Table7-6. Opticallossindifferentcomponents ParameterValue Opticalcoupler1dBSplitter0.2dBWaveguideloss1.3dB/cmFilterthrough1E-41E-2dBModulatorinsertionloss01dB Wehavemodeledournovelwarpschedulerinthesimulatorandalsoincorporatedtheproposedmemorysubsystemmodel,whichiscapableofperformingsequentialandconcurrentsub-channelallocationtothesimultaneouslyissuedmemoryaccesses.WeusedtwodifferentsystemmodelspresentedinTable 7-4 .Thebaselinesystemhasstate-of-the-artelectricalnetwork,electricallyconnectedmemory,andimmediatepostdenominatorbasedround-robinwarpscheduler.Themainmemoryspecication 127

PAGE 128

isasmentionedin 4.1.1 .Table 7-6 shows6opticalx-barsimplementedtoevaluateourdesign.WeinstrumentedGPGPU-SimandGPU-PowerSimtoextracttheseveralhardwareaccessstatisticstocalculatepower.WedevelopedGPU-PowerSim,anarchitecturelevelGPGPUpowersimulatorbasedonheavilyremodeledMcPAT[ 65 ]thattsintoGPUpipeline.GPU-PowerSiminterfaceswithGPGPU-SimtocalculateruntimepowerofmajorcomponentsofGPUincludinganelectricalNoC.NoCmodeliscomposedofsinglelinksandtraditionalfour-stagedrouterwithitbuffers,arbitersandcrossbar.InterconnectsimulatorsimulatestheelectricalNoCwith22nmtechnology.WefurtherscaletheNoCparametersdownto16nm.FortheopticalNoCpowerconsumption,weusedthestatisticsreportedin[ 114 115 ],assummarizedinTable??.Weuseenergycouplingefciencyof40%inourpowermodel.Inordertoachieveabiterrorrateof10)]TJ /F5 7.97 Tf 6.59 0 Td[(15,eachphoto-detectorrequires5Wpowertosuccessfullyreceivedata[ 71 ]andeachmodulatorconsumes200fJpowertomodulateonebitofdata[ 116 ].Tomodelthetrimmingissue[ 6 ],weassume1WheatingpowerperringperKelvin. Table7-7. PowermanagementcongurationsforrenewableenergybasedGPGPUarchitecture ParametersConguration Corevoltage(V)0.95,1.05,1.15,1.25,1.35,1.45Corefrequency(GHz)1.2,1.4,1.6,1.8,2.0,2.2Interconnection1.05V/1.4G,1.25/1.8G,1.45/2.2GMemoryctrl./L20.95V/1.0G Table7-8. HotSpotparametersforrenewableenergybasedGPGPUarchitecture ParametersValue Chipthickness/area0.15mm/530mm2Convectionresistance13.9K/WHeatsinkwidth/thickness0.07m/69mmInterfacematerialthickness0.025mm 7.1.4GPGPUTemperatureProleGenerationWeevaluatetheimpactofpowermanagementontheprocessortemperaturevariationusingHotSpot5.0[ 54 ].WeinstrumentedtheHotSpottomodelthethermal 128

PAGE 129

characteristicsoftheGPUdieandalsointegratedwithourGPGPUsimulationinfrastructure.Table 7-8 summarizestheHotSpotparametersusedinourexperiment.Table 7-7 summarizestheGPGPUchipDVFScongurations.WehavegenerateddynamicpowertracesforeachprocessorcomponenttouseasinputtotheHotspot.Inthisstudywemonitoredthehighesttemperatureoftheprocessorandcountedthethermalcyclesthroughoutthesimulation.Weadoptthereliabilityestimationmethodstudiedin[ 52 ]. 7.1.5ResistiveMemoryBasedGPGPUArchitectureSimulator Table7-9. GPGPU-Sim3.0.1bsimulatorconguration ParametersValues Computecorecount30(10cluster,3core/cluster)Clock(core/Icnt/MC)1.25/0.65/0.8GHzThreadbatchsize32SIMDpipewidth32Sharedmemory32KBSharedmemorybanks16SHMlatency(cycles)20/20(SRAM)16/40(STT)Cache(L1/Cons/Tex)16/8/4(KB)percoreThreads/core1024Memorycontroller8Registers/core16KRegisterlebanks32Registerlatency(R/W)2/2(SRAM)2/4(STT)cycleInterconnecttopologyMeshChannelBW32BTechnologynode22nm Table 7-9 showsthearchitecturelevelGPUpowersimulatorcongurationthatconsistsofGPGPU-Simv3.0.1[ 41 ]withGPUpowermodel.ThethroughputarchitecturepowermodelinpowersimulatorisorganizedinathreelevelhierarchyoftheprogrammablelogicblocksavailableinGPU.Atthearchitecturallevel,theGPUisdecomposedintomajorcomponentssuchascomputecores,L2cache,interconnectandmemorycontrollers.ComputecoresarefurtherdividedintoSIMDcores,fetch/decodeunits,instructionsissueunits,on-chipcaches,largeregisterle, 129

PAGE 130

sharedmemory,andthread-scheduler.Atcircuitlevel,thearchitecturalblocksaremappedtocircuitstructureslikearraysandcomplexlogic.CachesaremodeledasmemoryarraysusingCACTI[ 112 ]atthecircuitlevel.Interconnectiscomposedofsignallinksandrouters.Routeriscomposedofitbuffers,arbitersandcrossbars,whicharemodeledanalytically.ThememorycontrollermodelsthedesignbyDenali[ 113 ]andiscomposedoffront-end-processingengineandphysicalinterface.WhilefrontendismodeledusingCAMandRAMstructures,thelattertwoaremodeledempirically.Atthedevicelevel,themodelusesdatafromITRSroadmap[ 44 ]tocalculatephysicalparametersofdevices,suchascapacitance,resistanceetc.ThepowermodelusesrelevantcomponentsfromMcPAT[ 65 ]andtsthemintothecomputecorepipeline.UnlikeMcPAT,theGPUpowersimulatormodelsmultipleSIMDlanebasedthroughputprocessorpipelinethathascontrastingarchitecturecomparedtomulticoreprocessors.Wevalidateourpowermodelagainstthepowervaluesgeneratedbyactualpowermeasurementsusingcurrentsensors[ 8 ]attachedtoaGTX470GPU.Thisreportssatisfactorilyclosematch.WehaveheavilyinstrumentedGPGPU-Simv3.0.1[ 41 ]toobtainregisterle,sharedmemory,on-chipcacheupdatestatistics.Moreover,wehaveimplementedthelatencymodelforalltheon-chipmemorymodels.SinceSTT-RAMcellhasdifferentlatenciesforreadandwriteoperations,wehavecustomizedthesimulatortohavedifferentlatenciesforregisterread,registerwrite-back,sharedmemoryloadandsharedmemorystoreoperations.Inaddition,wehaveimplementeddifferentialmemoryupdatemechanismwithinon-chipmemorymodelsofthesimulator.WebuildaNGSPICEcircuitmodeltoobtainSTT-RAMcellelectricalcharacteristics.Weuse22nmCMOSprocesstechnologywithasupplyvoltageof0.9V.ThephysicalMOSFETmodelisbasedon22nmBSIM[ 117 ].Thecircuitmodelhastwoparameters:writepulseduration(5ns[ 98 ])andthresholdcurrentofMTJdevices.WeassumethesizeofMTJsis50nm75nm[ 118 ]withacriticalcurrentdensityof9105A/cm2[ 118 ]. 130

PAGE 131

Thederivedthresholdcurrentisapproximately50uA,whichisprovidedbytransistorsfabricatedusingexistingCMOStechnologies[ 108 ].TheMTJissimulatedusingaconstantresistorintheSTT-RAMcellmodel,whilemultipletime-varyingresistorsareusedduringswitchingsimulation.Tomodelthelatencyandpowerofregisterleandcache,wehaveimplementedSTT-RAMcellmodelinCACTI6.5[ 112 ]usingNGSPICEsimulationresultsasinputparameters. Table7-10. SynopsysofGPGPU-Sim3.0.1bsimulatedworkloads[ 1 3 ].(AI:ALUInst./MemoryInst.;SHM:Sharedmemory) Workload(Abbr.)ArithmeticIntensity(AI)SHM? MatrixMultiplication(MM)42.4YMatrixTranspose(MT)195.6N64BinHistogram(64H)22.8YLIBOR(LIB)353.6NNeedlemanWunsch(NW)80.1YLaplaceSolver(LPS)118.9YRaytrace(RAY)335.0NParallelPrexSum(SLA)1.9YStencil3D(ST3D)822.3YBreadthFirstSearch(BFS)99.7NHybridSort(HY)32.1YNearestNeighbor(NE)196.8NPetriNetSimulation(PNS)411.5YBinomialOptions(BN)39.4YColumbicPotential(CP)471.8NFastWalshTransform(FWT)289.1YLUDecomposition(LU)3.5YNeuralNetwork(NN)71.5NParallelReduction(PR)1.0YBackPropagation(BP)167.6YGaussianElimination(GS)5.7YN-QueenSolver(NQU)209.3YSpeckleReducingAnisotropicDiffusion(SRAD)14.0Y Using23realworldGPGPU/graphicsworkloads,wehaveevaluatedtheeffectivenessoftheproposedcomputecorearchitecturalenhancementstoreduceenergyconsumption.Table 7-10 liststheworkloadswhichareprogrammedusingNvidiaCUDAAPIsandpossessgoodmixof[ 62 119 ]intensearithmeticcomputation(increasedregisterleaccess)andlargelocaldatasharingbehavior(increasedsharedmemoryaccess). 131

PAGE 132

7.2GPGPUWorkloads 7.2.1GPGPUWorkloadAnalysisToperformGPUbenchmarkcharacterizationanalysis,wehavecollectedalargesetofavailableGPUworkloadsfromNvidiaCUDASDK[ 3 ],RodiniaBenchmark[ 1 ],ParboilBenchmark[ 2 ]andsomethirdpartyapplications.Duetosimulationissues(e.g.simulatordeadlock),wehaveexcludedsomebenchmarksfromouranalysis.Theproblemsizeoftheworkloadsisscaledtoavoidlongsimulationtimeorunrealisticallysmallworkloadstress.Table 7-11 liststhebasicworkloadcharacteristics. 7.2.2RenewableEnergyBasedThroughputArchitectureToevaluatethebenetsofincorporatingdynamicfrequencyadjustmentandpowergatingintotheGPUmicroarchitecturewechoosefromalargesetofavailableGPUworkloadsfromNVIDIACUDASDK[ 3 ],RodiniaBenchmark[ 1 ],ParboilBenchmark[ 2 ],andsomethirdpartyapplications.Theselectedworkloadsshowgoodmixofmemoryaccessandcomputationinstructions.Theproblemsizeoftheworkloadsisscaledtoavoidlongsimulationtimeorunrealisticallysmallworkloadstress.Table5liststheevaluatedworkloads. 7.2.3WorkloadsforNanophotonicGPUEvaluationInthenanophotonicGPUevaluation,weuse17real-worldGPGPUworkloadstoevaluatethebenetofintegratingnano-photonicinterconnectintheGPUmicroarchitecturedesign.Table 7-13 liststheevaluatedGPGPUworkloadsthatareprogrammedusingNVIDIACUDAprogrammingmodel[ 9 ]andpossessgoodmixofheavymemoryaccessandintensecomputationbehaviors.Theproblemsizesaremodiedtohavelargecomputationandheavynetworktrafc,whilekeepingthesimulationtimereasonable. 7.3EvaluationMetricsandMethodologyofGPGPUKernelCharacterization 7.3.1EvaluationMetricsToevaluatetheaccuracyandtheeffectivenessoftheproposedworkloadcharacteristics,weusethesetofmetricsaslistedinTable 7-14 .Activityfactor[ 61 ]isdenedasthe 132

PAGE 133

Table7-11. GPGPUworkloadsynopsis(IC:Instructioncount,AI:Arithmeticintensity,BD:Branchdivergence,MM:Mergemiss,SHM:Sharedmemoryaccess,B:BarrierInstructions) Abbr.WorkloadICAI?BD?/MM?/SHM?/B? LVLevenshteinEdit-distanceCalculation32K221.5Y/Y/Y/NBFSBreadthFirstSearch[ 120 ]589K99.7Y/Y/Y/NBPBackPropagation[ 1 ]1048K167.6Y/Y/Y/YBSBlackScholesOptionPricing[ 3 ]61K1000Y/Y/Y/NCPColumbicPotential[ 2 ]8K471.8Y/Y/Y/NCSSeparableConvolution[ 3 ]10K3.1Y/Y/Y/YFWTFastWalshTransform[ 3 ]32K289.1Y/Y/Y/YGSGaussianElimination[ 1 ]921K5.7Y/Y/Y/YHSHotSpot[ 54 ]432K101.4Y/Y/Y/YLIBLIBOR[ 119 ]8K353.6Y/Y/Y/NLPS3DLaplaceSolver[ 121 ]12K118.9Y/Y/Y/YMMMatrixMultiplication[ 3 ]10K42.4Y/Y/Y/YMTMatrixTranspose[ 3 ]65K195.6Y/Y/Y/YNNNeuralNetwork[ 122 ]66K71.5Y/Y/Y/NNQUN-QueenSolver[ 123 ]24K209.3Y/Y/Y/YNWNeedlemanWunsch[ 1 ]6480.1Y/Y/Y/YPFPathFinder[ 1 ]1K378.1Y/Y/Y/YPNSPetriNetSimulation[ 2 ]2.5K411.5Y/Y/Y/YPRParallelReduction[ 3 ]830K1.0Y/Y/Y/YRAYRayTrace[ 124 ]65K335.0Y/Y/Y/YSADSumofAbsoluteDifference[ 2 ]11K67.7Y/Y/Y/YSPScalarProduct[ 3 ]32K284.9Y/Y/Y/YSRADSpeckleReducingAnisotropicDiffusion[ 2 ]460K14.0Y/Y/Y/YSTOStoreGPU[ 125 ]49K361.9Y/Y/Y/NCLCell[ 1 ]64765.2Y/Y/Y/YHYHybridSort[ 126 ]541K32.1Y/Y/Y/YKMK-Means[ 127 ]1497K20.2Y/Y/Y/NMUMMUMmerGPU[ 128 ]50K468.5Y/Y/Y/NNENearestNeighbor[ 1 ]60K196.8Y/Y/Y/NBNBinomialOptions[ 3 ]131K39.4Y/Y/Y/YMRIFMagneticResonanceImagingFHD[ 129 ]1050K86.7Y/Y/Y/NMRIQMagneticResonanceImagingQ[ 2 ]526K107.6Y/Y/Y/NDGGalerkintime-domainsolver[ 130 ]1035K11.4Y/Y/Y/YSLAScanofLargeArrays[ 3 ]1310K1.9Y/Y/Y/YSSSimilarityScore[ 1 ]51K1.1Y/Y/Y/YAESAESencryption[ 131 ]65K70.25N/Y/Y/YWPWeatherPrediction[ 132 ]4.6K459.8Y/Y/Y/N64H64binhistogram[ 3 ]2878K22.8Y/Y/Y/Y 133

PAGE 134

Table7-12. SynopsisofevaluatedGPGPUworkloadsforrenewableenergybasedGPUarchitecture Abbr.WorkloadAvg.PowerAvg.IPCAbbr.WorkloadAvg.PowerAvg.IPC MUMMUMmerGPU202.811.9NWNeedlemanWunsch80.621.6FWTFastWalshTransform128.1107.8PFPathFinder122.4113.3HYHybridSort80.686.1PNSPetriNetSimulation143.767.4MMMatrixMultiplication176.7300.1ST3DStencil3D125.5122.6MTMatrixTranspose104.4226.5LIBLIBOR112.396.5 Table7-13. GPGPUworkloads(?:memoryintensive)fornanophotonicGPUevaluation Workload(Abbr.)Trafc(in106Bytes) BreadthFirstSearch(BFS)?78.1FastWashTransform(FWT)18.0GaussianElimination(GS)0.7HotSpot(HS)0.63DLaplaceSolver(LPS)?23.4MatrixMultiplication(MM)?120MatrixTranspose(MT)0.3PathFinder(PF)5.9RayTrace(RAY)?83.5SpeckleReducingAnisotropic(SRAD)0.5HybridSort(HY)20.1SimilarityScore(SS)?150NearestNeighbor(NE)2.0ParallelPrexSum(SLA)0.7Adv.EncryptionStd(AES)0.964BinHistogram(64H)?29.4GalerkinSolver(DG)?93.0 Table7-14. GPGPUworkloadanalysisevaluationmetrics EvaluationMetricSynopsys ActivityfactorAveragepercentageofthreadsactiveatagiventime.SIMDparallelismSpeedupwithinnitenumberofSPperSM.DRAMefciency%oftimespentsendingthedataacrossthepinsofDRAMwhenothercommandsarepending/serviced. 134

PAGE 135

averagenumberofactivethreadsatagiventimeduringtheexecutionphase.Severalbranchdivergencerelatedcharacteristicsofthebenchmarkchangethisparameter.Forexample,theabsenceofbranchdivergenceproducesanactivityfactorof100%.SIMDparallelism[ 61 ]capturesthescalabilityofaworkload.HighervalueforSIMDparallelismindicatesthattheworkloadperformancewillimproveonaGPUthathavehigherSIMDwidth.DRAMefciency[ 41 ]describeshowfrequentlymemoryaccessesarerequestedduringkernelcomputation.ItalsocapturestheamountoftimespenttoperformDRAMmemorytransferduringoverallkernelexecution.IfabenchmarkhaslargenumberofsharedmemoryaccessesorthebenchmarkhasALUoperationsproperlybalancedinbetweenmemoryoperations,thenthemetricwillshowhighervalue. 7.3.2MethodologyTheexperimentalprocedureconsistsoffollowingsteps: 1. TheheavilyinstrumentedGPGPU-SimsimulatorretrievesthestatisticsofGPUcharacteristicslistedinTables 3-1 3-2 and 7-14 forallthecongurationsdescribedinTables 7-1 and 7-2 .SixdifferentcongurationsaresimulatedtodemonstratethemicroarchitecturalindependenceoftheGPGPUworkloadcharacteristics. 2. Wehaveperformedvectorizationofsomecharacteristicstoproduce10-binor5-binhistogramsfromtheGPGPU-Simoutput.Thisprocessprovidesaninputmatrixforprincipalcomponentanalysisofsize12938(dimensionsworkloads).Fewdimensionsofthematrixwithzerostandarddeviationarekeptoutoftheanalysistoproducea9338normalizeddatamatrix. 3. PCAisperformedusingSTATISTICA[ 133 ].SeveralPCAsareperformedonthewholeworkloadsetaccordingtodifferentworkloadcharacteristicssubspacesandallthecharacteristics. 4. Basedonrequiredpercentageoftotalvariance,pprincipalcomponentsarechosen. 5. STATISTICAisusedtoperformhierarchicalclusteranalysis.Bothsinglelinkageandcompletelinkagebaseddendrogramsaregeneratedusingmprincipalcomponents,wherem
PAGE 136

7.4RenewableEnergySupplyTracesWegeneratedvariouspowersupplytracesbasedontherawrenewableenergysourcedataofHOMER[ 134 ].HOMERisacomputermodeldevelopedbyNationalRenewableEnergyLaboratorythatsimulatesbothconventionalandrenewableenergytechnologiesandevaluatesvariousdesignoptionsofpowersystems.TheoptimizationandsensitivityanalysisalgorithmsofHOMERalsoallowtheusertoevaluatetheeconomicandtechnicalfeasibilityofalargenumberoftechnologyoptionsandtoaccountforvariationintechnologycostsandenergyresourceavailability.Inthisstudy,weuseahybridpowerprovisioningschemesandweincorporatetheoutputsofHOMERtoourwind/solarpowermodel.Ourpowersupplytracesincludeslowgenerationperiod,peakhoursanddailysurges.AsshowninTable 7-15 ,wegeneratefourtraceswithdifferentcombinationsofvariationandaveragepowerlevels. Table7-15. Evaluatedrenewablepowersupplytraces TracesMeanpower(Watts)Standarddeviation(Watts) HpHv135(Highaveragepower)38(Highvariation)HpLv178(Highaveragepower)17(Lowvariation)LpHv40(Lowaveragepower)41(Highvariation)LpLv101(Lowaveragepower)14(Lowvariation) 136

PAGE 137

CHAPTER8CONCLUSIONInthefollowingsections,IdrawtheconclusionsGPGPUworkloadexploration,nanophotonicinterconnectionbasedGPGPUarchitecture,renewableenergybasedGPUsystemandresistivememorybasedGPUshadercorearchitecture,andoverallimpactemergingtechnologyonGPGPUarchitecture,respectively. 8.1ThroughputWorkloadExplorationTheemergingGPGPUworkloadspacehasnotbeenexploredmethodicallytounderstandthefundamentalsoftheworkloads.TounderstandtheGPUworkloads,IhaveproposedasetofmicroarchitectureindependentGPGPUworkloadcharacteristicsthatarecapableofcapturingveimportantbehaviors:kernelstress,kernelcharacteristics,divergencecharacteristics,instructionmix,andcoalescingcharacteristics.Ihavealsodemonstratedthattheproposedevaluationmetricsaccuratelyrepresentstheinputcharacteristics.ThisworkprovidesGPUarchitectaclearunderstandingoftheavailableGPGPUworkloadspacetodesignbetterGPUmicroarchitecture.Ourendeavoralsodemonstratesthatworkloadspaceisnotproperlybalancedwithavailablebenchmarks;whileSS,SLA,PRworkloadsshowsignicantlydifferentbehaviorduetotheirlargenumberofdiversekernels,therestoftheworkloadsprovidesimilarcharacteristics.IalsoobservethatbenchmarksuiteslikeNvidiaCUDASDK,Parboil,andRodiniahasdifferentbehaviorintermsworkloadspacediversity.OurresearchshowsthatbranchdivergencediversityisbestcapturedbySS,SLA,MUM,HS,BN,and,NE.SLA,KM,SS,and,PRshowrelativelylargememorycoalescingbehaviordiversitythantherest.SS,BN,KM,MRIF,andHYhavedistinctinstructionmix.KernelcharacteristicisdiverseinPR,STO,NQU,SLA,WP,SS,CP,andLIB.Ialsoshowthatamongthechosenbenchmarksdivergencecharacteristicsaremostdiverseandcoalescingcharacteristicareleastdiverse.Therefore,Ineedbenchmarkswithmorediversememorycoalescing 137

PAGE 138

behavior.Inaddition,Ishowthatsimulationspeedupof3.5canbeachievedbyremovingtheredundantbenchmarks. 8.2NanophotonicThroughputInterconnectWeproposea3D-stackednanophotonicthroughputarchitecturethatprovides96%averagereductioninNoCpower(inturnheat)with4%averageincreaseinperformanceandupto95%averageNoClatencyreductionascomparedtostate-of-the-artelectricalon-chipinterconnectbasedGPU.Toimproveoff-chipmemorybandwidthutilization,wedividesingleopticalmemoryinterfaceintomultipleconcurrentlyworkingsub-channelsandschedulemultiplememoryrequeststothosesub-channels.Wealsomodifythroughputprocessoroff-chipmemoryorganizationtoincreasememorycapacityofeachinterfacingcontroller.Ourdetailedevaluationshowsthatmultiplesub-channelbasedmemorysystemprovides9%increaseinoverallsystemperformanceforGPGPUworkloads.Formemoryboundworkloads,averageperformanceimprovementreachesasmuchas48%andonaverageacrossallmemoryintensivebenchmarks,itattains13%performancegain.Withincreasingoff-chipmemorydemand,weexpectthattheproposedschemewillattainfurtherimprovements.Tothebestofourknowledge,weproposerstinterconnectstatusawarethreadschedulerinthroughputarchitecturethatdeterministicallyscheduleswarpsbasedonthepresumedinterconnectbehaviorinferredfrommemoryaccesscharacteristics.Theschedulingmechanismprovidesasmuchas9%improvementinGPUperformanceforrealworldGPGPUapplications. 8.3RenewableEnergyBasedGPUArchitectureMotivatedbytheenergycrisisandenvironmentalissue,thispaperinvestigatesgreenenergypoweredthroughputserversystem.Existingenergy-orientedandperformance-orientedapproacheslackaholisticviewofmanagingthevariablegreenpowerbudget.Weshowthatintelligentswitchingbetweendifferentpowermanagementmodescanboostthesystemperformanceby12.8%,enhancereliabilityby42%,andstillmaintainupto95%greenenergyutilization. 138

PAGE 139

8.4ResistiveMemoryBasedShaderCoreArchitectureLargeron-chipmemorycomponents(registerle,sharedmemory,variouscaches)inthroughputprocessorsbecomedesignimpedimentfordeep-sub-microntechnologynodesduetoexcessiveleakagepowerandoverallenergyfootprint.EmergingresistivememorytechnologiessuchasSTT-MRAMprovidefeasibleon-chipmemorydesignalternativetoaddresspowerproblems.However,suchmemorytechnologieshavelongerwritelatencyandhigherenergyconsumption.Tothisend,weproposeon-chipmemoryorganizationforthroughputarchitecturesusingresistivememorythatimplementsdifferentialmemoryupdatebasedSTT-MRAMregisterleandhybridsharedmemory.TheproposedarchitecturereducesSTT-MRAMwritepowerforseveralon-chipmemorieswith83%savinginleakagepower.Resistivememorybasedmemoryorganizationsaves46%oftheregisterledynamicpowerwithlessthan1%performanceoverhead.WefurtheroptimizetheregisterleaccesspowerbycoalescingseveralconsecutiveregisterwriteswithwiderwriteportsandsmallSRAMbasedwrite-buffers.Thisprovidesadditional8%powersavings.ByincorporatingSRAMandSTT-MRAMbasedhybridsharedmemoryorganizationwithinterchangeablecacheandRAMbehaviorofthememory,wecancompensatetheperformancelossduetoSTT-MRAMwritelatencyandimprovethepowerefciency.Onaverage,hybridsharedmemoryprovides10%powersavingsand1.6performanceimprovementwithoutanyareaoverhead. 139

PAGE 140

APPENDIXSTATISTICALMETHODS Abriefdescriptionofprincipalcomponentanalysisandhierarchicalclusteringanalysisisprovidedhere.PrincipalComponentAnalysis Principalcomponentanalysis(PCA)isamultivariatedataanalysistechniquetoremovethecorrelationsamongthedifferentinputdimensionsandtosignicantlyreducethedatadimensions.InPCAanalysis,amatrixisformedusingdifferentcolumnsrepresentingdifferentinputdimensionsanddifferentrowsrepresentingdifferentobservations.PCAtransformstheseobservationsintoprincipalcomponent(PC)domain,wherePCsarelinearcombinationofinputdimensions.Mathematically,adatasetofncorrelatedvariableshaskthobservationoftheform,Dk=(d0k,d1k,d2k,d3k,,d(n)]TJ /F5 7.97 Tf 6.59 0 Td[(1)k).ThekthobservationinPCdomainisrepresentedasPk=(p0k,p1k,p2k,p3k,,p(n)]TJ /F5 7.97 Tf 6.59 0 Td[(1)k)wherep0k=c0d0k+c1d1k+c2d2k+c3d3k++c(n)]TJ /F5 7.97 Tf 6.58 0 Td[(1)d(n)]TJ /F5 7.97 Tf 6.59 0 Td[(1)k.Termslikec0,c1,,cn)]TJ /F5 7.97 Tf 6.59 0 Td[(1arereferredtoasfactorloadingandthoseareselectedinPCAprocesstomaximizethevarianceforaparticularPC.VarianceofithPC(2i)hasthepropertyof2i)]TJ /F5 7.97 Tf 6.59 0 Td[(1<2i<2i+1.PCsarearrangedinthedecreasingorderofeigenvalues.Eigenvaluesdescribetheamountofinformationpresentinaprincipalcomponent.Kaisercriteria[ 56 ]suggestsconsideringallthePCsthathaveeigenvaluegreaterthan1.Ingeneral,certainnumbersofrstprincipalcomponentsarechosentoobtain90%ofthetotalvarianceascomparedtooriginaldatavariance.Usuallythecountismuchsmallerthantheinputdimensions.Hence,dimensionalityreductionisachievedwithoutlosingmuchinformation.HierarchicalClusteringAnalysis Hierarchicalclusteringanalysisisatypeofstatisticaldataclassicationtechniquebasedonsomekindofperceivedsimilaritydenedinthedataset.Clusteringtechniques[ 21 ]canbebroadlycategorizedintohierarchicalandpartitionalclustering.Inhierarchical 140

PAGE 141

clustering,clusterscanbechosenaccordingtothelinkagedistancewithoutdeningthenumberofclustersapriori.Incontrast,partitionalclusteringrequiresthatthenumberofclustersbedenedasaninputtotheprocess.Hierarchicalclusteringcanbedoneinagglomerative(bottom-up)ordivisive(top-down)manner.Inthebottom-upapproach,allofthepointsaredenedasdifferentclustersatthebeginning.Themostsimilarclustersarefoundandgroupedtogether.Thepreviousstepsarerepeateduntilallthedatapointsareclustered.Insinglelinkageclustering,similaritycheckingcanbedonebyconsideringminimumdistanceamongtwodifferentclusters.Alternatively,incompletelinkageclustering,maximumdistanceamongthetwodifferentclustersischosen.Averagelinkageclusteringreferstosimilaritycheckingbasedonmeandistancebetweendifferentclusters.Thewholeprocessofclusteringproducesatree-likestructure(dendrogram),whereoneaxisrepresentsdifferentdatapointsandtheotherrepresentslinkagedistance.Thewholedatasetcanbecategorizedbyselectingaparticularlinkagedistance.Higherlinkagedistanceexpressesdissimilaritybetweenthedatapoints. 141

PAGE 142

REFERENCES [1] S.Che,M.Boyer,J.Meng,D.Tarjan,J.W.Sheaffer,S.-H.Lee,andK.Skadron,Rodinia:Abenchmarksuiteforheterogeneouscomputing,inProceedingsofthe2009IEEEInternationalSymposiumonWorkloadCharacterization(IISWC).IEEEComputerSociety,2009,pp.44.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5306797 [2] T.I.R.Group.(2010)ParboilBenchmarksuite.[Online].Available: http://impact.crhc.illinois.edu/parboil.php [3] N.Corporation,GPUComputingSDK,2010.[Online].Available: http://developer.nvidia.com/gpu-computing-sdk [4] S.HongandH.Kim,AnintegratedGPUpowerandperformancemodel,inProceedingsofthe37thannualinternationalsymposiumonComputerarchitecture.Saint-Malo,France:ACM,2010,pp.280.[Online].Available: http://dl.acm.org/citation.cfm?doid=1815961.1815998 [5] M.Gebhart,D.R.Johnson,D.Tarjan,S.W.Keckler,W.J.Dally,E.Lindholm,andK.Skadron,Energy-efcientmechanismsformanagingthreadcontextinthroughputprocessors,Proceedingofthe38thannualinternationalsymposiumonComputerarchitecture,pp.235,2011.[Online].Available: http://dl.acm.org/citation.cfm?doid=2000064.2000093 [6] M.AhmedAl,S.Guangyu,D.Xiangyu,N.Vijay,andX.Yuan,DGPUArchitectureUsingCacheStacking:Performance,Cost,PowerandThermalAnalysis,inProceedingsofthe2009IEEEinternationalconferenceonComputerdesign.LakeTahoe,California,USA:IEEEPress,2009. [7] H.Nagasaka,N.Maruyama,A.Nukada,T.Endo,andS.Matsuoka,StatisticalpowermodelingofGPUkernelsusingperformancecounters,inProceedingsoftheInternationalConferenceonGreenComputing.IEEEComputerSociety,2010,pp.115.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5598315 [8] E.U.D.Manufacturing,DCcurrentsensorModelHCS-20-10-AP.[Online].Available: http://www.eeurd.com/products/dc-current-sensors-all.php#02 [9] N.Corporation,TheNVIDIACUDACProgrammingGuide,2012.[Online].Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html [10] N.John,B.Ian,G.Michael,andS.Kevin,ScalableParallelProgrammingwithCUDA,Queue,vol.6,no.2,pp.40,2008.[Online].Available: http://dl.acm.org/citation.cfm?doid=1365490.1365500 [11] N.Corporation,NVIDIA'sNextGenerationCUDAComputeArchitecture:Fermi,2009. 142

PAGE 143

[12] AMDAcceleratedParallelProcessing(APP)SDK(formerlyATIStream),2012.[Online].Available: http://developer.amd.com/TOOLS/HC/AMDAPPSDK/Pages/default.aspx [13] N.Brookwood,AMDFusionTMFamilyofAPUs:EnablingaSuperior,ImmersivePCExperience,2010. [14] OpenCL:Theopenstandardforparallelprogrammingofheterogeneoussystems,2012.[Online].Available: http://www.khronos.org/opencl/ [15] K.G.Brill,TheEconomicMeltdownofMoore'sLawandtheGreenDataCenter,UptimeInstitute,2007.[Online].Available: http://static.usenix.org/event/lisa07/tech/brill talk.pdf [16] S.Electric,ITCarbonEnergyAllocationCalculator.[Online].Available: http://www.apcmedia.com/salestools/WTOL-7DJLN9 R1 EN.swf [17] G.L.Yuan,A.Bakhoda,andT.M.Aamodt,Complexityeffectivememoryaccessschedulingformany-coreacceleratorarchitectures.inMICRO.ACM,2009.[Online].Available: http://dblp.uni-trier.de/db/conf/micro/micro2009.html#YuanBA09 [18] W.L.F.Wilson,S.Ivan,Y.George,andM.A.Tor,Dynamicwarpformation:EfcientMIMDcontrolowonSIMDgraphicshardware,ACMTrans.Archit.CodeOptim.,vol.6,no.2,pp.1,2009.[Online].Available: http://dl.acm.org/citation.cfm?doid=1543753.1543756 [19] D.Tarjan,J.Meng,andK.Skadron,IncreasingmemorymisstoleranceforSIMDcores,inProceedingsoftheConferenceonHighPerformanceComputingNetworking,StorageandAnalysis,ser.SC'09.NewYork,NY,USA:ACM,2009,pp.22:1:11.[Online].Available: http://doi.acm.org/10.1145/1654059.1654082 [20] G.H.Dunteman,PrincipalComponentsAnalysis.SAGEPublications,Inc,1989,vol.69. [21] H.Romesburg,ClusterAnalysisforResearchers.Belmont,California:LifetimeLearningPublications,1984. [22] C.Hughes,J.Poe,A.Qouneh,andT.Li,Onthe(dis)similarityoftransactionalmemoryworkloads,inProceedingsofthe2009IEEEInternationalSymposiumonWorkloadCharacterization(IISWC),ser.IISWC'09.Washington,DC,USA:IEEEComputerSociety,2009,pp.108.[Online].Available: http://dx.doi.org/10.1109/IISWC.2009.5306790 [23] A.Joshi,A.Phansalkar,L.Eeckhout,andL.K.John,Measuringbenchmarksimilarityusinginherentprogramcharacteristics,IEEETrans.Comput.,vol.55,no.6,pp.769,Jun.2006.[Online].Available: http://dx.doi.org/10.1109/TC.2006.85 143

PAGE 144

[24] L.Eeckhout,H.Vandierendonck,andK.DeBosschere,Designingcomputerarchitectureresearchworkloads,Computer,vol.36,no.2,pp.65,Feb.2003.[Online].Available: http://dx.doi.org/10.1109/MC.2003.1178050 [25] H.VandierendonckandK.DeBosschere,Manybenchmarksstressthesamebottlenecks,inWorkshoponComputerArchitectureEvaluationUsingCommercialWorkloads,2004,pp.57. [26] J.Henning,SPECCPU2000:measuringCPUperformanceintheNewMillennium,Computer,vol.33,no.7,pp.28,jul2000. [27] A.Phansalkar,A.Joshi,andL.K.John,AnalysisofredundancyandapplicationbalanceintheSPECCPU2006benchmarksuite,inProceedingsofthe34thannualinternationalsymposiumonComputerarchitecture,ser.ISCA'07.NewYork,NY,USA:ACM,2007,pp.412.[Online].Available: http://doi.acm.org/10.1145/1250662.1250713 [28] C.Lee,M.Potkonjak,andW.Mangione-Smith,Mediabench:atoolforevaluatingandsynthesizingmultimediaandcommunicationssystems,inMicroarchitecture,1997.Proceedings.,ThirtiethAnnualIEEE/ACMInternationalSymposiumon,Dec.1997,pp.330. [29] M.R.Guthaus,J.S.Ringenberg,D.Ernst,T.M.Austin,T.Mudge,andR.B.Brown,MiBench:Afree,commerciallyrepresentativeembeddedbenchmarksuite,inProceedingsoftheWorkloadCharacterization,2001.WWC-4.2001IEEEInternationalWorkshop,ser.WWC'01.Washington,DC,USA:IEEEComputerSociety,2001,pp.3.[Online].Available: http://dx.doi.org/10.1109/WWC.2001.15 [30] S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta,TheSPLASH-2programs:characterizationandmethodologicalconsiderations,inProceedingsofthe22ndannualinternationalsymposiumonComputerarchitecture,ser.ISCA'95.NewYork,NY,USA:ACM,1995,pp.24.[Online].Available: http://doi.acm.org/10.1145/223982.223990 [31] C.C.Minh,J.Chung,C.Kozyrakis,andK.Olukotun,Stamp:Stanfordtransactionalapplicationsformulti-processing,inWorkloadCharacterization,2008.IISWC2008.IEEEInternationalSymposiumon,sept.2008,pp.35. [32] C.Bienia,S.Kumar,J.P.Singh,andK.Li,Theparsecbenchmarksuite:characterizationandarchitecturalimplications,inProceedingsofthe17thinternationalconferenceonParallelarchitecturesandcompilationtechniques,ser.PACT'08.NewYork,NY,USA:ACM,2008,pp.72.[Online].Available: http://doi.acm.org/10.1145/1454115.1454128 [33] K.Asanovic,R.Bodik,B.C.Catanzaro,J.J.Gebis,P.Husbands,K.Keutzer,D.A.Patterson,W.L.Plishker,J.Shalf,S.W.Williams,andK.A.Yelick,TheLandscapeofParallelComputingResearch: 144

PAGE 145

AViewfromBerkeley,Tech.Rep.,Dec.2006.[Online].Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html [34] INTERNATIONALTECHNOLOGYROADMAPFORSEMICONDUCTORS(INTERCONNECT),2007. [35] D.Vantrease,R.Schreiber,M.Monchiero,M.McLaren,N.P.Jouppi,M.Fiorentino,A.Davis,N.Binkert,R.G.Beausoleil,andJ.H.Ahn,Corona:SystemImplicationsofEmergingNanophotonicTechnology,SIGARCHComput.Archit.News,vol.36,no.3,pp.153,2008.[Online].Available: http://dl.acm.org/citation.cfm?doid=1394608.1382135 [36] J.C.Mark,C.K.Joseph,andH.A.David,Phastlane:ARapidTransitOpticalRoutingNetwork,inProceedingsofthe36thannualinternationalsymposiumonComputerarchitecture.Austin,TX,USA:ACM,2009.[Online].Available: http://dl.acm.org/citation.cfm?doid=1555754.1555809 [37] P.Yan,K.Prabhat,K.John,M.Gokhan,Z.Yu,andC.Alok,Firey:illuminatingfuturenetwork-on-chipwithnanophotonics,SIGARCHComput.Archit.News,vol.37,no.3,pp.429,2009.[Online].Available: http://dl.acm.org/citation.cfm?doid=1555815.1555808 [38] Y.Pan,J.Kim,andG.Memik,FlexiShare:ChannelSharingforanEnergy-EfcientNanophotonicCrossbar,inHPCA.Bangalore,India:ACM,2010. [39] B.Black,M.Annavaram,N.Brekelbaum,J.DeVale,L.Jiang,G.H.Loh,D.McCaule,P.Morrow,D.W.Nelson,D.Pantuso,P.Reed,J.Rupley,S.Shankar,J.Shen,andC.Webb,DieStacking(3D)Microarchitecture,inProceedingsofthe39thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture.IEEEComputerSociety,2006,pp.469.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4041869 [40] S.Beamer,C.Sun,Y.-j.Kwon,A.Joshi,C.Batten,V.Stojanovic,andK.Asanovic,Re-architectingDRAMwithMonolithicallyIntegratedSiliconPhotonics,Tech.Rep.,Dec.2009.[Online].Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-179.html [41] A.Bakhoda,G.L.Yuan,W.W.L.Fung,H.Wong,andT.M.Aamodt,AnalyzingCUDAworkloadsusingadetailedGPUsimulator,inISPASS.IEEE,2009.[Online].Available: http://dblp.uni-trier.de/db/conf/ispass/ispass2009.html#BakhodaYFWA09 [42] C.Smullen,V.Mohan,A.Nigam,S.Gurumurthi,andM.Stan,Relaxingnon-volatilityforfastandenergy-efcientSTT-RAMcaches,inHighPerformanceComputerArchitecture(HPCA),2011IEEE17thInternationalSymposiumon,2011,pp.50. 145

PAGE 146

[43] X.Guo,E.Ipek,andT.Soyata,Resistivecomputation:avoidingthepowerwallwithlow-leakage,STT-MRAMbasedcomputing,SIGARCHComput.Archit.News,vol.38,no.3,pp.371,2010. [44] ITRS,INTERNATIONALTECHNOLOGYROADMAPFORSEMICONDUCTORS2009EDITION(DESIGN),Tech.Rep.,2009. [45] E.Technologies.(2012)EverspinDebutsFirstSpin-TorqueMRAMforHighPerformanceStorageSystems. [46] X.Dong,X.Wu,G.Sun,Y.Xie,H.Li,andY.Chen,Circuitandmicroarchitectureevaluationof3DstackingmagneticRAM(MRAM)asauniversalmemoryreplacement,inProceedingsofthe45thannualDesignAutomationConference.Anaheim,California:ACM,2008,pp.554.[Online].Available: http://dl.acm.org/citation.cfm?doid=1391469.1391610 [47] S.LaiandT.Lowrey,OUM-A180nmnonvolatilememorycellelementtechnologyforstandaloneandembeddedapplications,inElectronDevicesMeeting,2001.IEDMTechnicalDigest.International,2001,pp.36.5.1.5.4. [48] F.Tabrizi.(2007)ThefutureofscalableSTT-RAMasauniversalembeddedmemory.[Online].Available: http://www.eetimes.com/design/embedded/4026000/The-future-of-scalable-STT-RAM-as-a-universal-embedded-memory [49] C.W.Smullen,IV,A.Nigam,S.Gurumurthi,andM.R.Stan,TheSTeTSiMSSTT-RAMsimulationandmodelingsystem,inProceedingsoftheInternationalConferenceonComputer-AidedDesign,ser.ICCAD'11.Piscataway,NJ,USA:IEEEPress,2011,pp.318.[Online].Available: http://dl.acm.org/citation.cfm?id=2132325.2132408 [50] A.K.Mishra,X.Dong,G.Sun,Y.Xie,N.Vijaykrishnan,andC.R.Das,Architectingon-chipinterconnectsforstacked3DSTT-RAMcachesinCMPs,inProceedingsofthe38thannualinternationalsymposiumonComputerarchitecture,ser.ISCA'11.NewYork,NY,USA:ACM,2011,pp.69.[Online].Available: http://doi.acm.org/10.1145/2000064.2000074 [51] Ieeeguideforarrayandbatterysizinginstand-alonephotovoltaic(pv)systems,IEEEStd1562-2007,pp.i,2008. [52] A.K.Coskun,R.Strong,D.M.Tullsen,andT.SimunicRosing,Evaluatingtheimpactofjobschedulingandpowermanagementonprocessorlifetimeforchipmultiprocessors,inProceedingsoftheeleventhinternationaljointconferenceonMeasurementandmodelingofcomputersystems,ser.SIGMETRICS'09.NewYork,NY,USA:ACM,2009,pp.169.[Online].Available: http://doi.acm.org/10.1145/1555349.1555369 [53] B.W.Coon,J.E.Lindholm,S.Liu,S.F.Oberman,andM.Y.Siu,Operandcollectorarchitecture,no.US7834881B2,November2010. 146

PAGE 147

[54] H.Wei,S.Ghosh,S.Velusamy,K.Sankaranarayanan,K.Skadron,andM.R.Stan,HotSpot:ACompactThermalModelingMethodologyforEarly-StageVLSIDesign,VeryLargeScaleIntegration(VLSI)Systems,IEEETransactionson,vol.14,no.5,pp.501,2006. [55] C.Yiran,W.Xiaobin,L.Hai,X.Haiwen,Y.Yuan,andZ.Wenzhong,DesignMarginExplorationofSpin-TransferTorqueRAM(STT-RAM)inScaledTechnologies,VeryLargeScaleIntegration(VLSI)Systems,IEEETransactionson,vol.18,no.12,pp.1724,2010. [56] K.A.YeomansandP.A.Golder,TheGuttman-KaiserCriterionasaPredictoroftheNumberofCommonFactors,JournaloftheRoyalStatisticalSociety.SeriesD(TheStatistician),vol.31,no.3,pp.221,Sepmetber1982. [57] M.M.Hughes,ExplorationandPlayRe-Visited:AHierarchicalAnalysis,InternationalJournalofBehavioralDevelopment,vol.2,no.3,pp.215,1979.[Online].Available: http://jbd.sagepub.com/content/2/3/215.abstract [58] L.Seiler,D.Carmean,E.Sprangle,T.Forsyth,M.Abrash,P.Dubey,S.Junkins,A.Lake,J.Sugerman,R.Cavin,R.Espasa,E.Grochowski,T.Juan,andP.Hanrahan,Larrabee:amany-corex86architectureforvisualcomputing,ACMTrans.Graph.,vol.27,no.3,pp.18:1:15,Aug.2008.[Online].Available: http://doi.acm.org/10.1145/1360612.1360617 [59] R.H.SaavedraandA.J.Smith,Analysisofbenchmarkcharacteristicsandbenchmarkperformanceprediction,ACMTrans.Comput.Syst.,vol.14,no.4,pp.344,Nov.1996.[Online].Available: http://doi.acm.org/10.1145/235543.235545 [60] K.HosteandL.Eeckhout,Microarchitecture-IndependentWorkloadCharacterization,IEEEMicro,vol.27,no.3,pp.63,May2007.[Online].Available: http://dx.doi.org/10.1109/MM.2007.56 [61] A.Kerr,G.Diamos,andS.Yalamanchili,AcharacterizationandanalysisofPTXkernels,inProceedingsofthe2009IEEEInternationalSymposiumonWorkloadCharacterization(IISWC),ser.IISWC'09.Washington,DC,USA:IEEEComputerSociety,2009,pp.3.[Online].Available: http://dx.doi.org/10.1109/IISWC.2009.5306801 [62] B.He,W.Fang,Q.Luo,N.K.Govindaraju,andT.Wang,Mars:aMapReduceframeworkongraphicsprocessors,inProceedingsofthe17thinternationalconferenceonParallelarchitecturesandcompilationtechniques,ser.PACT'08.NewYork,NY,USA:ACM,2008,pp.260.[Online].Available: http://doi.acm.org/10.1145/1454115.1454152 [63] W.DallyandB.Towles,PrinciplesandPracticesofInterconnectionNetworks.SanFrancisco,CA,USA:MorganKaufmannPublishersInc.,2003. 147

PAGE 148

[64] D.Abts,N.D.EnrightJerger,J.Kim,D.Gibson,andM.H.Lipasti,Achievingpredictableperformancethroughbettermemorycontrollerplacementinmany-coreCMPs,inProceedingsofthe36thannualinternationalsymposiumonComputerarchitecture,ser.ISCA'09.NewYork,NY,USA:ACM,2009,pp.451.[Online].Available: http://doi.acm.org/10.1145/1555754.1555810 [65] S.Li,J.H.Ahn,R.D.Strong,J.B.Brockman,D.M.Tullsen,andN.P.Jouppi,McPAT:AnIntegratedPower,Area,andTimingModelingFrameworkforMulticoreandManycoreArchitectures,inProceedingsofthe42ndAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture.NewYork,NewYork:ACM,2009,pp.469.[Online].Available: http://dl.acm.org/citation.cfm?doid=1669112.1669172 [66] D.M.Malta,MicrosystemIntegrationandPackaging3dIntegrationTechnology. [67] P.V.Bolotoff,AQuickAnalysisoftheNVIDIAFermiArchitecture,2010.[Online].Available: http://alasir.com/articles/nvidia fermi architecture/gt200 gt300 architecture.shtml [68] C.Demerjian.(2009)Miracleshappen,GT300tapesout!Warning:yingpigsandhugechips.[Online].Available: http://semiaccurate.com/2009/07/29/miracles-happen-gt300-tapes-out/ [69] J.Ajay,B.Christopher,K.Yong-Jin,B.Scott,S.Imran,A.Krste,andS.Vladimir,Silicon-photonicClosNetworksforGlobalOn-ChipCommunication,inProceedingsofthe20093rdACM/IEEEInternationalSymposiumonNetworks-on-Chip.IEEEComputerSociety,2009,pp.124.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5071460 [70] IEEEStandardforInformationTechnology-TelecommunicationsandInformationExchangeBetweenSystems-LocalandMetropolitanAreaNetworks-Part5:TokenRingAccessMethodandPhysicalLayerSpecications-Corrigendum1,IEEEStd802.5w-2000,p.i,2001. [71] A.Melloni,M.Martinelli,G.Cusmai,andR.Siano,ExperimentalEvaluationofRingResonatorFiltersImpactontheBitErrorRateinNon-returntoZeroTransmissionSystems,OpticsCommunications,vol.234,no.1-6,pp.211,2004.[Online].Available: http://www.sciencedirect.com/science/article/B6TVF-4BT6RV1-1/2/75808dbb5d34c893d74f7ec94eca9556 [72] A.N.Udipi,N.Muralimanohar,R.Balsubramonian,A.Davis,andN.P.Jouppi,CombiningMemoryandaControllerwithPhotonicsthrough3D-StackingtoEnableScalableandEnergy-EfcientSystems,inInternationalSymposiumonComputerArchitecture.SanJose:ACM,2011. [73] H.-Y.Chen,C.-C.Chen,F.-K.Hsueh,J.-T.Liu,C.-Y.Shen,C.-C.Hsu,S.-L.Shy,B.-T.Lin,H.-T.Chuang,C.-S.Wu,C.Hu,C.-C.Huang,andF.-L.Yang,nmfunctional0.039mm26T-SRAMcellwithnanoinjectionlithography,nanowire 148

PAGE 149

channel,andfullTiNgate,inElectronDevicesMeeting(IEDM),2009IEEEInternational,dec.2009,pp.1. [74] S.Woop,J,246,r.Schmittler,andP.Slusallek,RPU:AProgrammableRayProcessingUnitforRealtimeRayTracing,ACMTrans.Graph.,vol.24,no.3,pp.434,2005.[Online].Available: http://dl.acm.org/citation.cfm?doid=1073204.1073211 [75] W.FungandT.Aamodt,ThreadblockcompactionforefcientSIMTcontrolow,inHighPerformanceComputerArchitecture(HPCA),2011IEEE17thInternationalSymposiumon,feb.2011,pp.25. [76] H.Amit,B.Tony,S.J.BenYoo,A.Rajeevan,andA.Venkatesh,OCDIMM:ScalingtheDRAMMemoryWallUsingWDMBasedOpticalInterconnects,inProceedingsofthe200816thIEEESymposiumonHighPerformanceInterconnects.IEEEComputerSociety,2008.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4618577 [77] G.Brinda,J.Aamer,W.David,andJ.Bruce,Fully-BufferedDIMMMemoryArchitectures:UnderstandingMechanisms,OverheadsandScaling,inProceedingsofthe2007IEEE13thInternationalSymposiumonHighPerformanceComputerArchitecture.IEEEComputerSociety,2007.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4147653 [78] D.H.O.M.M.H.-B.YoonguKim,ATLAS:AScalableandHigh-PerformanceSchedulingAlgorithmforMultipleMemoryControllers,inHPCA,Bangalore,India,2010. [79] V.Kontorinis,L.Zhang,B.Aksanli,J.Sampson,H.Homayoun,E.Pettis,D.Tullsen,andT.SimunicRosing,ManagingdistributedUPSenergyforeffectivepowercappingindatacenters,inComputerArchitecture(ISCA),201239thAnnualInternationalSymposiumon,2012,pp.488. [80] R.S.SuttonandA.G.Barto,ReinforcementLearning:AnIntroduction.MITPress,1998.[Online].Available: http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html [81] C.Li,A.Qouneh,andT.Li,Characterizingandanalyzingrenewableenergydrivendatacenters,inProceedingsoftheACMSIGMETRICSjointinternationalconferenceonMeasurementandmodelingofcomputersystems,ser.SIGMETRICS'11.NewYork,NY,USA:ACM,2011,pp.131.[Online].Available: http://doi.acm.org/10.1145/1993744.1993791 [82] C.Li,W.Zhang,C.-B.Cho,andT.Li,SolarCore:Solarenergydrivenmulti-corearchitecturepowermanagement,inProceedingsofthe2011IEEE17thInternationalSymposiumonHighPerformanceComputerArchitecture,ser.HPCA'11.Washington,DC,USA:IEEEComputerSociety,2011,pp.205.[Online].Available: http://dl.acm.org/citation.cfm?id=2014698.2014858 149

PAGE 150

[83] N.Sharma,S.Barker,D.Irwin,andP.Shenoy,Blink:managingserverclustersonintermittentpower,inProceedingsofthesixteenthinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,ser.ASPLOS'11.NewYork,NY,USA:ACM,2011,pp.185.[Online].Available: http://doi.acm.org/10.1145/1950365.1950389 [84] I.n.Goiri,K.Le,T.D.Nguyen,J.Guitart,J.Torres,andR.Bianchini,GreenHadoop:leveraginggreenenergyindata-processingframeworks,inProceedingsofthe7thACMeuropeanconferenceonComputerSystems,ser.EuroSys'12.NewYork,NY,USA:ACM,2012,pp.57.[Online].Available: http://doi.acm.org/10.1145/2168836.2168843 [85] I.Goiri,K.Le,M.Haque,R.Beauchea,T.Nguyen,J.Guitart,J.Torres,andR.Bianchini,GreenSlot:Schedulingenergyconsumptioningreendatacenters,inHighPerformanceComputing,Networking,StorageandAnalysis(SC),2011InternationalConferencefor,2011,pp.1. [86] Y.Zhang,Y.Wang,andX.Wang,GreenWare:greeningcloud-scaledatacenterstomaximizetheuseofrenewableenergy,inProceedingsofthe12thACM/IFIP/USENIXinternationalconferenceonMiddleware,ser.Middleware'11.Berlin,Heidelberg:Springer-Verlag,2011,pp.143.[Online].Available: http://dx.doi.org/10.1007/978-3-642-25821-3 8 [87] C.Li,A.Qouneh,andT.Li,iSwitch:coordinatingandoptimizingrenewableenergypoweredserverclusters,inProceedingsofthe39thAnnualInternationalSymposiumonComputerArchitecture,ser.ISCA'12.Washington,DC,USA:IEEEComputerSociety,2012,pp.512.[Online].Available: http://dl.acm.org/citation.cfm?id=2337159.2337218 [88] M.Arlitt,C.Bash,S.Blagodurov,Y.Chen,T.Christian,D.Gmach,C.Hyser,N.Kumari,Z.Liu,M.Marwah,A.McReynolds,C.Patel,A.Shah,Z.Wang,andR.Zhou,Towardsthedesignandoperationofnet-zeroenergydatacenters,inThermalandThermomechanicalPhenomenainElectronicSystems(ITherm),201213thIEEEIntersocietyConferenceon,2012,pp.552. [89] S.Govindan,D.Wang,A.Sivasubramaniam,andB.Urgaonkar,Leveragingstoredenergyforhandlingpoweremergenciesinaggressivelyprovisioneddatacenters,inProceedingsoftheseventeenthinternationalconferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems,ser.ASPLOSXVII.NewYork,NY,USA:ACM,2012,pp.75.[Online].Available: http://doi.acm.org/10.1145/2150976.2150985 [90] S.Govindan,A.Sivasubramaniam,andB.Urgaonkar,Benetsandlimitationsoftappingintostoredenergyfordatacenters,inProceedingsofthe38thannualinternationalsymposiumonComputerarchitecture,ser.ISCA'11.NewYork,NY,USA:ACM,2011,pp.341.[Online].Available: http://doi.acm.org/10.1145/2000064.2000105 150

PAGE 151

[91] N.Deng,C.Stewart,andJ.Li,Concentratingrenewableenergyingrid-tieddatacenters,inSustainableSystemsandTechnology(ISSST),2011IEEEInternationalSymposiumon,2011,pp.1. [92] C.Li,R.Zhao,andT.Li,EnablingDistributedGenerationPoweredSustainableHigh-PerformanceDataCenter,inProceedingsofthe2013IEEEInternationalConferenceonHighPerformanceComputerArchitecture,ser.HPCA'13.IEEE,February2013. [93] I.Goiri,W.Katsak,K.Le,T.D.Nguyen,andR.Bianchini,ParasolandGreenSwitch:managingdatacenterspoweredbyrenewableenergy,inProceedingsoftheeighteenthinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,ser.ASPLOS'13.NewYork,NY,USA:ACM,2013,pp.51.[Online].Available: http://doi.acm.org/10.1145/2451116.2451123 [94] P.Zhou,B.Zhao,J.Yang,andY.Zhang,EnergyreductionforSTT-RAMusingearlywritetermination,inProceedingsofthe2009InternationalConferenceonComputer-AidedDesign,ser.ICCAD'09.NewYork,NY,USA:ACM,2009,pp.264.[Online].Available: http://doi.acm.org/10.1145/1687399.1687448 [95] Highperformancearchietctureforawritebackstage,Intelcorporation,2009. [96] W.Zhao,E.Belhaire,Q.Mistral,E.Nicolle,T.Devolder,andC.Chappert,IntegrationofSpin-RAMtechnologyinFPGAcircuits,inSolid-StateandIntegratedCircuitTechnology,2006.ICSICT'06.8thInternationalConferenceon,oct.2006,pp.799. [97] W.Zhao,E.Belhaire,C.Chappert,F.Jacquet,andP.Mazoyer,Newnon-volatilelogicbasedonspin-MTJ,physicastatussolidi(a),vol.205,no.6,pp.1373,2008.[Online].Available: http://dx.doi.org/10.1002/pssa.200778135 [98] G.Sun,X.Dong,Y.Xie,J.Li,andY.Chen,Anovelarchitectureofthe3DstackedMRAML2cacheforCMPs,inHighPerformanceComputerArchitecture,2009.HPCA2009.IEEE15thInternationalSymposiumon,feb.2009,pp.239. [99] X.Wu,J.Li,L.Zhang,E.Speight,R.Rajamony,andY.Xie,Hybridcachearchitecturewithdisparatememorytechnologies,inProceedingsofthe36thannualinternationalsymposiumonComputerarchitecture,ser.ISCA'09.NewYork,NY,USA:ACM,2009,pp.34.[Online].Available: http://doi.acm.org/10.1145/1555754.1555761 [100] M.FranklinandG.S.Sohi,Registertrafcanalysisforstreamlininginter-operationcommunicationinne-grainparallelprocessors,SIGMICRONewsl.,vol.23,no.1-2,pp.236,1992.[Online].Available: http://dl.acm.org/citation.cfm?doid=144965.145818 151

PAGE 152

[101] R.Balasubramonian,S.Dwarkadas,andD.H.Albonesi,Reducingthecomplexityoftheregisterleindynamicsuperscalarprocessors,Proceedingsofthe34thannualACM/IEEEinternationalsymposiumonMicroarchitecture,pp.237,2001. [102] E.Borch,S.Manne,J.Emer,andE.Tune,LooseLoopsSinkChips,Proceedingsofthe8thInternationalSymposiumonHigh-PerformanceComputerArchitecture,p.299,2002. [103] T.M.Jones,M.F.P.O'Boyle,J.Abella,A.Gonz,225,lez,287,andu.Ergin,Energy-efcientregistercachingwithcompilerassistance,ACMTrans.Archit.CodeOptim.,vol.6,no.4,pp.1,2009.[Online].Available: http://dl.acm.org/citation.cfm?doid=1596510.1596511 [104] Jos,233,L.Cruz,A.Gonz,225,lez,M.Valero,andN.P.Topham,Multiple-bankedregisterlearchitectures,SIGARCHComput.Archit.News,vol.28,no.2,pp.316,2000.[Online].Available: http://dl.acm.org/citation.cfm?doid=342001.339708 [105] P.R.NuthandW.J.Dally,TheNamed-StateRegisterFile:ImplementationandPerformance,Proceedingsofthe1stIEEESymposiumonHigh-PerformanceComputerArchitecture,p.4,1995. [106] H.ZengandK.Ghose,Registerlecachingforenergyefciency,Proceedingsofthe2006internationalsymposiumonLowpowerelectronicsanddesign,pp.244,2006.[Online].Available: http://dl.acm.org/citation.cfm?doid=1165573.1165633 [107] P.Satyamoorthy,STT-RAMforSharedMemoryinGPUs,UniversityofVirginia,2011. [108] S.Lee,H.Lee,S.Kim,S.Lee,andH.Shin,Anovelmacro-modelforspin-transfer-torquebasedmagnetic-tunnel-junctionelements,Solid-StateElectronics,vol.54,no.4,pp.497,2010.[Online].Available: http://www.sciencedirect.com/science/article/pii/S0038110110000055 [109] P.Zhou,B.Zhao,J.Yang,andY.Zhang,Adurableandenergyefcientmainmemoryusingphasechangememorytechnology,inProceedingsofthe36thannualinternationalsymposiumonComputerarchitecture.Austin,TX,USA:ACM,2009,pp.14.[Online].Available: http://dl.acm.org/citation.cfm?doid=1555754.1555759 [110] N.Corporation,PTX:ParallelThreadExecutionISAVersion2.3,2011.[Online].Available: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx isa 2.3.pdf [111] N.Corporation,TheCUDACompilerDriverNVCC,2012.[Online].Available: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html 152

PAGE 153

[112] S.Thoziyoor,N.Muralimanohar,J.H.Ahn,andN.Jouppi,Cacti6.5.[Online].Available: http://www.hpl.hp.com/research/cacti [113] Denali.(2007)UsingCongurableMemoryControllerDesignIPwithEncounterRTLComplier. [114] I.O'Connor,OpticalSolutionsforSystem-LevelInterconnect,inProceedingsofthe2004internationalworkshoponSystemlevelinterconnectprediction.Paris,France:ACM,2004,pp.79.[Online].Available: http://dl.acm.org/citation.cfm?doid=966747.966764 [115] N.Kirman,M.Kirman,R.K.Dokania,J.F.Martinez,A.B.Apsel,M.A.Watkins,andD.H.Albonesi,LeveragingOpticalTechnologyinFutureBus-basedChipMultiprocessors,inProceedingsofthe39thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture.IEEEComputerSociety,2006,pp.492.[Online].Available: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4041871 [116] C.Batten,A.Joshi,J.Orcutt,A.Khilo,B.Moss,C.Holzwarth,M.Popovic,H.Li,H.Smith,J.Hoyt,F.Kartner,R.Ram,V.Stojanovic,andK.Asanovic,BuildingManycoreProcessor-to-DRAMNetworkswithMonolithicSiliconPhotonics,inHighPerformanceInterconnects,2008.HOTI'08.16thIEEESymposiumon,aug.2008,pp.21. [117] Y.Cao,T.Sato,M.Orshansky,D.Sylvester,andC.Hu,NewparadigmofpredictiveMOSFETandinterconnectmodelingforearlycircuitsimulation,inCustomIntegratedCircuitsConference,2000.CICC.ProceedingsoftheIEEE2000,2000,pp.201. [118] M.Hosomi,H.Yamagishi,T.Yamamoto,K.Bessho,Y.Higo,K.Yamane,H.Yamada,M.Shoji,H.Hachino,C.Fukumoto,H.Nagao,andH.Kano,Anovelnonvolatilememorywithspintorquetransfermagnetizationswitching:spin-ram,inElectronDevicesMeeting,2005.IEDMTechnicalDigest.IEEEInternational,dec.2005,pp.459. [119] M.GilesandS.Xiaoke,NotesonusingthenVidia8800GTXgraphicscard,OxfordUniversity,Tech.Rep.,2009. [120] P.HarishandP.J.Narayanan,AcceleratinglargegraphalgorithmsontheGPUusingCUDA,inProceedingsofthe14thinternationalconferenceonHighperformancecomputing,ser.HiPC'07.Berlin,Heidelberg:Springer-Verlag,2007,pp.197.[Online].Available: http://dl.acm.org/citation.cfm?id=1782174.1782200 [121] M.Giles,JacobiiterationforaLaplacediscretisationona3Dstructuredgrid,OxfordUniversity,Tech.Rep.,April2008. 153

PAGE 154

[122] K.G.BillConan,ANeuralNetworkonGPU,March2008.[Online].Available: http://www.codeproject.com/Articles/24361/A-Neural-Network-on-GPU [123] Pcchen,N-QueensSolver,2008.[Online].Available: http://forums.nvidia.com/index.php?showtopic=76893 [124] M.Shevtsov,A.Soupikov,andA.Kapustin,HighlyParallelFastKD-treeConstructionforInteractiveRayTracingofDynamicScenes,ComputerGraphicsForum,vol.26,no.3,pp.395,2007.[Online].Available: http://dx.doi.org/10.1111/j.1467-8659.2007.01062.x [125] S.Al-Kiswany,A.Gharaibeh,E.Santos-Neto,G.Yuan,andM.Ripeanu,StoreGPU:exploitinggraphicsprocessingunitstoacceleratedistributedstoragesystems,inProceedingsofthe17thinternationalsymposiumonHighperformancedistributedcomputing,ser.HPDC'08.NewYork,NY,USA:ACM,2008,pp.165.[Online].Available: http://doi.acm.org/10.1145/1383422.1383443 [126] E.SintornandU.Assarsson,FastparallelGPU-sortingusingahybridalgorithm,J.ParallelDistrib.Comput.,vol.68,no.10,pp.1381,Oct.2008.[Online].Available: http://dx.doi.org/10.1016/j.jpdc.2008.05.012 [127] R.Narayanan,B.Ozisikyilmaz,J.Zambreno,G.Memik,andA.Choudhary,MineBench:ABenchmarkSuiteforDataMiningWorkloads,inWorkloadCharacterization,2006IEEEInternationalSymposiumon,oct.2006,pp.182. [128] M.Schatz,C.Trapnell,A.Delcher,andA.Varshney,High-throughputsequencealignmentusingGraphicsProcessingUnits,BMCBioinformatics,vol.8,no.1,pp.474+,2007.[Online].Available: http://dx.doi.org/10.1186/1471-2105-8-474 [129] S.S.Stone,J.P.Haldar,S.C.Tsao,W.-m.W.Hwu,Z.-P.Liang,andB.P.Sutton,AcceleratingadvancedMRIreconstructionsonGPUs,inProceedingsofthe5thconferenceonComputingfrontiers,ser.CF'08.NewYork,NY,USA:ACM,2008,pp.261.[Online].Available: http://doi.acm.org/10.1145/1366230.1366276 [130] gpuDG,July2008.[Online].Available: http://www.caam.rice.edu/timwar/RMMC/MIDG.html [131] S.Manavski,CUDACompatibleGPUasanEfcientHardwareAcceleratorforAESCryptography,inSignalProcessingandCommunications,2007.ICSPC2007.IEEEInternationalConferenceon,nov.2007,pp.65. [132] J.MichalakesandM.Vachharajani,GPUaccelerationofnumericalweatherprediction,inParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon,april2008,pp.1. [133] STATISTICA,2010.[Online].Available: http://www.statsoft.com/products/ 154

PAGE 155

[134] N.R.E.Laboratory,GettingStartedGuideforHOMERVersion2.1,2005.[Online].Available: https://analysis.nrel.gov/homer/includes/downloads/HOMERGettingStarted210.pdf 155

PAGE 156

BIOGRAPHICALSKETCH NilanjanGoswamireceivedhisundergraduatedegreeinelectronicsandcommunicationsfromUniversityofKalyani,Indiain2004.Subsequently,heworkedatTataElxsiLimited,India(2004-2006),andInterraSystemsIndiaPrivateLimited(2006-2008)asEmbeddedDSPEngineer.HereceivedhisMasterofSciencedegreeincomputerengineeringfromtheUniversityofFloridain2011.HereceivedhisPh.D.underthesupervisionofDr.TaoLiintheareaofthroughputarchitectureoptimizationfromUniversityofFlorida.DuringhisPh.D.,heinternedatTexasInstruments(asDSPEngineerandCompilerEngineer)andNvidiaCorporation(asGPUArchitect).HeisprofessionallyafliatedwithIEEE,ACM,HKN,andEnvagrogen. 156