<%BANNER%>

Thermal-aware Task Scheduling on Multicore Processors

Permanent Link: http://ufdc.ufl.edu/UFE0044648/00001

Material Information

Title: Thermal-aware Task Scheduling on Multicore Processors
Physical Description: 1 online resource (135 p.)
Language: english
Creator: Wang, Zhe
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: dvfs -- management -- multicore -- partitioning -- processor -- scheduling -- task -- thermal
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Power and heat density of multicore processor are increasing exponentially with Moore's Law. High temperature negatively affects reliability and the cost of cooling and packaging. This makes thermal management a significant design challenge for multicore processors. High temperature raises the leakage power of the chip and thereby potentially lead to thermal runaway. Existing power-aware techniques do not address the temperature issues in multicore processors. Localized high-temperature "hot spots" and spatial gradients can be lead by the unbalanced power distribution and heat dissipation. Techniques like effectively scheduling of tasks on multicore processors are needed to reduce the "hotspot" and high temperature on the chip. In Chapter 3, we first focus on a compact temperature model of multicore processors. A major challenge in thermal-aware task scheduling is to estimate the temperature on chip effectively and efficiently. Existing detailed thermal models are computationally intensive. An easy-computed compact model that derives the temperature profile efficiently is needed in the thermal-aware task scheduling problems. We propose a compact Matrix Model to estimate the steady-state temperature of multicore processors effectively and efficiently. The experimental results show that our Matrix Model is an order of magnitude faster than existing thermal model. Further, we use this model to develop a novel greedy slack allocation algorithm for dependent tasks represented by Directed Acyclic Graph (DAG) on a multicore processor. The objective of the slack allocation is to minimize the peak temperature on any core for a given deadline. We also use this model to optimize the workload distribution for each core in a data parallel problem. The goal of this algorithm is to maximize the total throughput across all cores and the maximum temperature is bounded by a given threshold. We provide a thorough evaluation and comparison to show the effectiveness and efficiency of the thermal model and these algorithms. In Chapter 4, we propose to partition the "hot" tasks into multiple subtasks and interleave these subtasks with "cool" tasks to reduce the overall maximum temperature. We propose two heuristic task partitioning algorithms for a periodic set of tasks with common period and individual periods on single core processors, respectively. We also propose one heuristic task partitioning and scheduling algorithm on a set of independent tasks on multicore processors. Finally, we provide a thorough evaluation and comparison to show how task partitioning can assist in thermal-aware task scheduling problems. Thermal issues on multicore processors are mainly caused by high energy consumption. In Chapter 5, we focus on the reducing energy consumption on multicore systems with full memory hierarchy. We classify the energy consumption into computation energy on cores and communication energy on internal bus. We propose a general methodology to set the voltage and frequency level for cores as well as internal bus to minimize the energy consumption of the whole processor. This methodology considers processors' temperature, energy consumption, memory hierarchy and algorithms' input size together to achieve lower energy consumption of the chip as well as main memory. We analyze energy consumption of different matrix multiplication algorithms with different input sizes.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Zhe Wang.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Ranka, Sanjay.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044648:00001

Permanent Link: http://ufdc.ufl.edu/UFE0044648/00001

Material Information

Title: Thermal-aware Task Scheduling on Multicore Processors
Physical Description: 1 online resource (135 p.)
Language: english
Creator: Wang, Zhe
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2012

Subjects

Subjects / Keywords: dvfs -- management -- multicore -- partitioning -- processor -- scheduling -- task -- thermal
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Power and heat density of multicore processor are increasing exponentially with Moore's Law. High temperature negatively affects reliability and the cost of cooling and packaging. This makes thermal management a significant design challenge for multicore processors. High temperature raises the leakage power of the chip and thereby potentially lead to thermal runaway. Existing power-aware techniques do not address the temperature issues in multicore processors. Localized high-temperature "hot spots" and spatial gradients can be lead by the unbalanced power distribution and heat dissipation. Techniques like effectively scheduling of tasks on multicore processors are needed to reduce the "hotspot" and high temperature on the chip. In Chapter 3, we first focus on a compact temperature model of multicore processors. A major challenge in thermal-aware task scheduling is to estimate the temperature on chip effectively and efficiently. Existing detailed thermal models are computationally intensive. An easy-computed compact model that derives the temperature profile efficiently is needed in the thermal-aware task scheduling problems. We propose a compact Matrix Model to estimate the steady-state temperature of multicore processors effectively and efficiently. The experimental results show that our Matrix Model is an order of magnitude faster than existing thermal model. Further, we use this model to develop a novel greedy slack allocation algorithm for dependent tasks represented by Directed Acyclic Graph (DAG) on a multicore processor. The objective of the slack allocation is to minimize the peak temperature on any core for a given deadline. We also use this model to optimize the workload distribution for each core in a data parallel problem. The goal of this algorithm is to maximize the total throughput across all cores and the maximum temperature is bounded by a given threshold. We provide a thorough evaluation and comparison to show the effectiveness and efficiency of the thermal model and these algorithms. In Chapter 4, we propose to partition the "hot" tasks into multiple subtasks and interleave these subtasks with "cool" tasks to reduce the overall maximum temperature. We propose two heuristic task partitioning algorithms for a periodic set of tasks with common period and individual periods on single core processors, respectively. We also propose one heuristic task partitioning and scheduling algorithm on a set of independent tasks on multicore processors. Finally, we provide a thorough evaluation and comparison to show how task partitioning can assist in thermal-aware task scheduling problems. Thermal issues on multicore processors are mainly caused by high energy consumption. In Chapter 5, we focus on the reducing energy consumption on multicore systems with full memory hierarchy. We classify the energy consumption into computation energy on cores and communication energy on internal bus. We propose a general methodology to set the voltage and frequency level for cores as well as internal bus to minimize the energy consumption of the whole processor. This methodology considers processors' temperature, energy consumption, memory hierarchy and algorithms' input size together to achieve lower energy consumption of the chip as well as main memory. We analyze energy consumption of different matrix multiplication algorithms with different input sizes.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Zhe Wang.
Thesis: Thesis (Ph.D.)--University of Florida, 2012.
Local: Adviser: Ranka, Sanjay.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2012
System ID: UFE0044648:00001


This item has the following downloads:


Full Text

PAGE 1

THERMAL-AWARETASKSCHEDULINGONMULTICOREPROCESSORSByZHEWANGADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2012

PAGE 2

c2012ZheWang 2

PAGE 3

TomybelovedparentsandLei 3

PAGE 4

ACKNOWLEDGMENTS Firstofall,myspecialgratitudegoestomyadvisor,SanjayRanka,forhisgreatinspirationandexcellentguidancethroughoutthisdissertationandmyPh.D.educationatUniversityofFlorida,aswellashistrustintheresearchandkindconsiderationinlife.IwouldalsoliketothankProf.PrabhatMishra,Prof.YeXia,andProf.ShigangChenandProf.PaulAveryforservingonmydissertationcommitteeandprovidingvaluablesuggestionsonthisdissertation. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 12 CHAPTER 1INTRODUCTION ................................... 14 1.1Background ................................... 14 1.2Contributions .................................. 17 1.2.1CompactThermalModelanditsApplications ............ 17 1.2.2Temperature-awareTaskPartitioning ................. 18 1.2.3ThermalandEnergy-awareSchedulingforMatrixMultiplication .. 18 1.3OutlineoftheDissertation ........................... 19 2LITERATUREREVIEW ............................... 20 2.1TemperatureMeasurement .......................... 20 2.1.1Sensor-basedTemperatureMeasurement .............. 20 2.1.2Model-basedTemperatureMeasurement ............... 21 2.2Temperature-awareTaskSchedulingAlgorithms .............. 22 2.3Preliminaries .................................. 24 2.3.1TaskModel ............................... 24 2.3.2PowerModel .............................. 24 2.3.3ThermalModel ............................. 25 2.3.3.1Coarse-grainedthermalmodel ............... 26 2.3.3.2Fine-grainedthermalmodels ................ 27 2.4Summary .................................... 29 3COMPACTTHERMALMODELANDITSAPPLICATIONS ............ 31 3.1Background ................................... 31 3.2ASimpleThermalModelforMulticoreProcessors ............. 33 3.2.1ThermalEquations ........................... 33 3.2.2MatrixModel .............................. 34 3.2.3HotSpotModel ............................. 39 3.3ApplicationtoSlackAllocationforDependentTasks ............ 39 3.3.1ProblemBackground .......................... 39 3.3.2SlackAllocationAlgorithm ....................... 40 3.3.3GreedyAlgorithmforSlackAllocation ................ 42 3.3.4Evaluation ................................ 45 5

PAGE 6

3.3.4.1Comparisonofthermalmodels ............... 46 3.3.4.2Greedyalgorithmforslackallocation ............ 51 3.4ApplicationtoThermalConstrainedWorkloadDistribution ......... 52 3.4.1ProblemBackground .......................... 52 3.4.2SteadyStateAnalysis ......................... 57 3.4.2.1Systemoverview ....................... 57 3.4.2.2Uniformvoltagewithoutthrottling .............. 58 3.4.2.3Uniformvoltagewiththrottling ............... 59 3.4.2.4Non-uniformvoltagewiththrottling ............. 60 3.4.3Evaluation ................................ 62 3.4.4TransientAnalysis ........................... 66 3.5Summary .................................... 71 4TEMPERATURE-AWARETASKPARTITIONINGALGORITHMS ........ 72 4.1Background ................................... 72 4.2RelatedWork .................................. 74 4.3Preliminaries .................................. 76 4.3.1ThermalProleofIndividualTasks .................. 76 4.3.2ThermalProleofTaskSequencesonSingleCoreProcessors .. 76 4.3.3ThermalProleofTaskSequencesonMulticoreProcessors .... 77 4.4TaskPartitioningAlgorithmsonSingleCoreProcessors .......... 78 4.4.1PeriodicTaskswithCommonPeriod ................. 82 4.4.2PeriodicTaskswithIndividualPeriod ................. 84 4.5TaskPartitioningandSchedulingonMulticoreProcessors ......... 89 4.6ExperimentalResults ............................. 94 4.6.1TaskPartitioningAlgorithmsonSingleCoreProcessors ...... 95 4.6.1.1Taskswithcommonperiod ................. 95 4.6.1.2Taskswithindividualperiod ................. 97 4.6.1.3Contextswitchingoverhead ................. 98 4.6.2TaskPartitioningandSchedulingonMulticoreProcessors ..... 99 4.6.2.1Synthetictasks ........................ 100 4.6.2.2Realbenchmarks ...................... 103 4.6.2.3Schedulingtime ....................... 104 4.7Summary .................................... 105 5THERMALANDENERGYAWARESCHEDULINGALGORITHMFORMATRIXMULTIPLICATION .................................. 107 5.1Background ................................... 107 5.2RelatedWork .................................. 108 5.3ProcessorModelandEnergyModel ..................... 109 5.3.1ProcessorModel ............................ 109 5.3.2EnergyModel .............................. 111 5.3.2.1Processorenergymodel .................. 111 5.3.2.2Busenergymodel ...................... 112 6

PAGE 7

5.3.2.3Sharedmemorymodel ................... 112 5.4GeneralMethodology ............................. 112 5.5ParallelAlgorithmanalysis ........................... 114 5.5.1RowwiseMatrixMultiplication ..................... 114 5.5.1.1Problemdenition ...................... 114 5.5.1.2Smallmatricescase ..................... 114 5.5.1.3Largematricescase ..................... 118 5.5.2BlockMatrixMultiplication ....................... 119 5.5.2.1Problemdenition ...................... 119 5.5.2.2Smallmatricescase ..................... 120 5.5.2.3Largematricescase ..................... 122 5.6Summary .................................... 124 6CONCLUSIONSANDFUTUREWORK ...................... 126 REFERENCES ....................................... 128 BIOGRAPHICALSKETCH ................................ 135 7

PAGE 8

LISTOFTABLES Table page 2-1Therelationshipbetweenfractionalincreaseoftemperaturewithtime(clockcycles) ......................................... 27 3-1Theexecutiontimeoftasksingure 3-3 ...................... 42 3-2TheEST,LSTandslackofeachtaskbasedontheassignmentDAGinFigure 3-4 ........................................... 43 3-3Specicationsofprocessors ............................. 63 4-1Thefoursetsofbenchmarktaskswithcommonperiod .............. 96 4-2Thefoursetsofbenchmarktaskswithindividualperiod ............. 98 4-3Thefoursetsofbenchmarktasksformulticoreprocessors ............ 103 8

PAGE 9

LISTOFFIGURES Figure page 2-1TheRC-circuitofaprocessorinHotSpot(derivedfrom[ 69 ]) ........... 28 2-2Asimpliedanalyticalmodelofamulticoreprocessor(derivedfrom[ 63 ]) ... 29 3-1A4-coreprocessorwith4heatsinks ........................ 35 3-2Cmatrixofa4-coreprocessor ........................... 39 3-3AsimpleDAGwith6tasks .............................. 42 3-4TheassignmentDAGtoa2-coreprocessoroftasksinFigure 3-3 ....... 43 3-5TemperatureversuspowerconsumptionforP0ona4-coreprocessor ..... 47 3-6TemperatureversuspowerconsumptionforP0andP1ona4-coreprocessor 47 3-7PeaktemperaturecomparisonbetweenMatrixModel(MM)andHotSpotona4-coreprocessor ................................... 48 3-8PeaktemperaturecomparisonusingtheMatrixModel(MM)andtheHotSpotModelona4-coremulticoreprocessor ....................... 49 3-9ComputationtimecomparisonbetweenMatrixModel(MM)andHotSpotModelona4-coremulticoreprocessor .......................... 50 3-10Peaktemperaturecomparisonbetweengreedyalgorithmanduniformalgorithmona4-coreprocessorforvariablenumberoftasks ................ 52 3-11Peaktemperaturecomparisonbetweengreedyalgorithmanduniformalgorithmona16-coreprocessorforvariablenumberoftasks ............... 53 3-12Peaktemperaturecomparisonoforiginal(noslackallocation),greedyalgorithmanduniformalgorithmunderdifferentextensionratesona4-coreprocessor .. 54 3-13Peaktemperaturecomparisonoforiginal(noslackallocation),greedyalgorithmanduniformalgorithmunderdifferentextensionratesona16-coreprocessor 55 3-14Procedureforassigningvoltages .......................... 58 3-15MaximumtemperatureofMIP,BestP,BestP+1andWorstP ........... 64 3-16PerformancecomparisonbetweenouralgorithmandLPapproach. ....... 65 3-17ComputationtimecomparisonbetweenLPandouralgorithm(Uniformpowerwiththrottling ..................................... 66 3-18PerformancecomparisonbetweenheuristicapproachandLPapproach .... 67 9

PAGE 10

3-19ComputationtimecomparisonbetweenLPandheuristicsolution(Non-uniformpowerwiththrottling) ................................. 68 3-20ThroughputcomparisonbetweenNVT,UVT,UVNT ................ 68 3-21Transientanalysisofa16-coreprocessor ..................... 69 3-22Comparisonoftransienttemperaturebetweennewthresholdandoriginalthresholdtemperature ...................................... 70 4-1Anexampleofthetasksequencingandtaskpartitioning ............. 80 4-2Transienttemperaturecomparisonbetweentasksequencingandtaskpartitioning 82 4-3TaskPartitioningAlgorithm(Step1-3) ....................... 84 4-4TaskPartitioningAlgorithm(Step4-6) ....................... 85 4-5TaskPartitioningAlgorithm(Step7). ........................ 85 4-6FirstportionoftheslackallocationstepintheEDFbasedpartitioningalgorithm 88 4-7SecondportionoftheslackallocationstepintheEDFbasedpartitioningalgorithm 89 4-8Normalizetheexecutiontimeoftaskstothesmallestintegermultipleofd ... 90 4-9Intra-corescheduling ................................. 93 4-10Inter-corescheduling ................................. 94 4-11Peaktemperaturecomparisonbetweendifferentnumberofcategories ..... 96 4-12Peaktemperaturecomparisonbetweentasksequencingalgorithmandtaskpartitioningalgorithmonrealtasks ......................... 97 4-13PeaktemperaturecomparisonbetweenEDFandEDFp(M=15) ........ 97 4-14Averagenumberofcontextswitchespertaskoftaskpartitioningalgorithmforvariousnumberofcategories ............................ 99 4-15AveragenumberofcontextswitchespertaskcomparisonbetweenEDFandEDFpforvarioussetsoftasks ........................... 99 4-16PeaktemperaturecomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor ............. 101 4-17MakespancomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor .............. 102 4-18PeaktemperaturecomparisonbetweenTPS-1relaxedandPDTMalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor ............. 102 10

PAGE 11

4-19PeaktemperaturecomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforrealbenchmarksona16-coreprocessor .................... 103 4-20MakespancomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforrealbenchmarksona16-coreprocessor ..................... 104 4-21SchedulingtimecomparisonbetweenPDTMandourTPSalgorithms ..... 105 5-1Theprocessormodel ................................ 110 5-2Rowwise1-Dcyclicmatrixmultiplication ...................... 115 5-3Theminimumenergyrequirementcomparisonbetweencpuvoltageonlyscalingandcpu+busvoltagescalingofrowwisematrixmultiplication(smallmatrices) 118 5-4Theminimumenergyrequirementcomparisonbetweencpuonlyscalingandcpu+busscalingofrowwisematrixmultiplication(largematrices) ........ 120 5-52-Dblockmatrixmultiplication(smallmatrices) .................. 121 5-6Theminimumenergyrequirementcomparisonbetweencpuonlyscalingandcpu+busscalingofblockmatrixmultiplication(smallmatrices) ......... 122 5-72-Dblockmatrixmultiplication(largematrices) .................. 123 5-8Theminimumenergyrequirementcomparisonbetweencpuonlyscalingandcpu+busscalingofblockmatrixmultiplication(largematrices) .......... 124 11

PAGE 12

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyTHERMAL-AWARETASKSCHEDULINGONMULTICOREPROCESSORSByZheWangAugust2012Chair:SanjayRankaMajor:ComputerEngineering PowerandheatdensityofmulticoreprocessorareincreasingexponentiallywithMoore'sLaw.Hightemperaturenegativelyaffectsreliabilityandthecostofcoolingandpackaging.Thismakesthermalmanagementasignicantdesignchallengeformulticoreprocessors.Hightemperatureraisestheleakagepowerofthechipandtherebypotentiallyleadtothermalrunaway.Existingpower-awaretechniquesdonotaddressthetemperatureissuesinmulticoreprocessors.Localizedhigh-temperaturehotspotsandspatialgradientscanbeleadbytheunbalancedpowerdistributionandheatdissipation.Techniqueslikeeffectivelyschedulingoftasksonmulticoreprocessorsareneededtoreducethehotspotandhightemperatureonthechip. InChapter 3 ,werstfocusonacompacttemperaturemodelofmulticoreprocessors.Amajorchallengeinthermal-awaretaskschedulingistoestimatethetemperatureonchipeffectivelyandefciently.Existingdetailedthermalmodelsarecomputationallyintensive.Aneasy-computedcompactmodelthatderivesthetemperatureproleefcientlyisneededinthethermal-awaretaskschedulingproblems.WeproposeacompactMatrixModeltoestimatethesteady-statetemperatureofmulticoreprocessorseffectivelyandefciently.TheexperimentalresultsshowthatourMatrixModelisanorderofmagnitudefasterthanexistingthermalmodel.Further,weusethismodeltodevelopanovelgreedyslackallocationalgorithmfordependenttasksrepresentedbyDirectedAcyclicGraph(DAG)onamulticoreprocessor.Theobjectiveoftheslack 12

PAGE 13

allocationistominimizethepeaktemperatureonanycoreforagivendeadline.Wealsousethismodeltooptimizetheworkloaddistributionforeachcoreinadataparallelproblem.Thegoalofthisalgorithmistomaximizethetotalthroughputacrossallcoresandthemaximumtemperatureisboundedbyagiventhreshold.Weprovideathoroughevaluationandcomparisontoshowtheeffectivenessandefciencyofthethermalmodelandthesealgorithms. InChapter 4 ,weproposetopartitionthehottasksintomultiplesubtasksandinterleavethesesubtaskswithcooltaskstoreducetheoverallmaximumtemperature.Weproposetwoheuristictaskpartitioningalgorithmsforaperiodicsetoftaskswithcommonperiodandindividualperiodsonsinglecoreprocessors,respectively.Wealsoproposeoneheuristictaskpartitioningandschedulingalgorithmonasetofindependenttasksonmulticoreprocessors.Finally,weprovideathoroughevaluationandcomparisontoshowhowtaskpartitioningcanassistinthermal-awaretaskschedulingproblems. Thermalissuesonmulticoreprocessorsaremainlycausedbyhighenergyconsumption.InChapter 5 ,wefocusonthereducingenergyconsumptiononmulticoresystemswithfullmemoryhierarchy.Weclassifytheenergyconsumptionintocomputationenergyoncoresandcommunicationenergyoninternalbus.Weproposeageneralmethodologytosetthevoltageandfrequencylevelforcoresaswellasinternalbustominimizetheenergyconsumptionofthewholeprocessor.Thismethodologyconsidersprocessors'temperature,energyconsumption,memoryhierarchyandalgorithms'inputsizetogethertoachievelowerenergyconsumptionofthechipaswellasmainmemory.Weanalyzeenergyconsumptionofdifferentmatrixmultiplicationalgorithmswithdifferentinputsizes. 13

PAGE 14

CHAPTER1INTRODUCTION 1.1Background Thepowerdensityofmulticoreprocessorshasdoubledeverythreeyearsandthisrateisexpectedtoincrease[ 68 ].Theresultinghightemperaturecanleadtoreliabilityissuesandincreasethecostofpackagingandcooling.A10oC-15oCincreaseintemperaturecouldleadtotwicereductionsinthelifetimeofthedevice[ 65 ].Duetothebroadapplicationofmulticoreprocessorsinembeddedsystemsaswellaslargeserversanddatacenters,thermal-awaretaskschedulingonmulticoreprocessorsisbecominganincreasinglycriticalissueinsoftwaredesign. Existingpower-awaretechniquesdonotaddressthetemperatureissuesinembeddedsystems.Themainreasonisthatthepowerdistributionofmultiprocessorsisnotuniform.Localizedheatingrisesmuchrapidlythanthewholechip,leadingtononuniformtemperaturedistributiononthechipwithlocalizedhigh-temperaturehotspotsandspatialgradients[ 70 ].Traditionalmethodstocontroltheon-chiptemperatureistoemploybetterpackagingandcoolingtechniques(e.g.,activefancooling,watercoolingandheatpipe).Theseactivecoolingsystemsmaynotalwaysbesuitablebecauseofthespacelimitations.BuildingthermalanalysisabilityintoEDAowallowsthesystemtoaddressthethermalimpactonvariouson-chipparametersandincorporateeffectsofnon-uniformthermalprolesduringICdesignprocess.However,suchtechnologiesareunabletodealwithavarietyofruntimesituations.Inthisdissertation,wefocusonsoftwareapproachesforthermalmanagement.Theseapproachesareexibleanddonothavesomeofthelimitationsthataredescribedabove. Effectiveschedulingoftasksonmulticoreprocessorscanreducehotspots.TheuseofDynamicVoltageandFrequencyScaling(DVFS)[ 8 ]todifferentiallyassignvoltageandfrequencytoeachcorecaneffectivelyreducethetemperature.Therst 14

PAGE 15

challengeinthermal-awareschedulingisthathowtoeffectivelyandefcientlyevaluatethetemperaturethroughoutthechip.TheexistingmodelsofthesystemsuchasHotSpot[ 70 ],athermalsimulatorthatcarefullysimulatetheunderlyingheattransferarequitecomputationallyintensive.Schedulingalgorithmstrytoachievetherequiredobjectivebycomparingmanyschedulingscenariosinasystematicfashion.Thismayrequirederivingthetemperatureproleofeachscenario.UsingdetailedmodelssuchasHotSpotforthispurposeiscomputationallyintensive.Othermodels[ 10 ]requiretimeproportionaltothenumberoftimeslicestimesthenumberofcores.Aneasy-computedanalyticalmodelisproposedin[ 63 ].However,itneglectsthelateralthermalresistancesbetweencoresandeffectivelyignoresoorplan. InChapter 3 ,wedevelopedathermalmodelcalledMatrixModel(MM)thatcanbeusedtoderivenewtemperatureprolesforagivensetofcoresandtheirvoltages.OursimulationresultsshowthatthemodeliscomparabletotheHotSpotModelforpredictingthepeaktemperatureandrequiresanorderofmagnitudelesscomputation.Besideshavingalowercomputationalcost,theMMissuccinct(asinglematrix)andcanbeusedtoderivealgorithmsforavarietyofscenariosincludingwhereincrementalchangestothetemperatureprolearenecessary. WeusetheMatrixModeltodevelopanovelslackallocationalgorithmforaworkowrepresentedbyDirectedAcyclicGraph(DAG)onamulticoreprocessor.Theobjectiveoftheslackallocationistominimizethepeaktemperatureonanycoreforagivendeadline.Wealsousethismodeltooptimizetheworkloaddistributionofdataparallelapplicationsonmulticoreprocessors,sothatthetotalthroughputacrossallcoresismaximizedandthemaximumtemperatureforanycoreisboundedbyagiventhreshold. Thesecondchallengeishowtoeffectivelyreducethechiptemperatureutilizingtaskpartitioning.Givenaparticularchipanditsoutsideenvironment,ambienttemperature,thermalresistanceandcapacitancearexed.BasedontheparametersofEquation 15

PAGE 16

( 4 ),therearethreemajorfactorsaffectingtheon-chiptransienttemperature:averagepoweroftheprocessor,initialtemperatureandexecutiontime.DynamicVoltageandFrequencyScaling(DVFS)canbeusedtoreducethepowerconsumptionbyloweringthesupplyvoltageandoperatingfrequency,therebyreducetheon-chiptemperature[ 3 5 20 36 57 62 ].However,DVFSfacesaseriousproblemintime-constrainedapplications.Temperatureawaretasksequencingalgorithm[ 33 ]reducestheinitialtemperaturetominimizepeaktemperature.However,temperatureawaretasksequencingfailstoreducetemperatureincaseswhenoneormoreofthehottasksarelong.Thealgorithmtodeferexecutionofhottasks[ 11 ]failstoreducetemperatureinthesamesituation.Thisisbecausewhentheexecutiontimeofahottaskistoolong,itcanleadtoahighsteady-statetemperatureirrespectiveoftheinitialtemperature. InChapter 4 ,weproposetopartitionthehottasksintomultiplesubtasksandinterleavethesesubtaskswithcooltaskstoreducetheoverallmaximumtemperature.Thefocusofthischapteristousetaskpartitioningeffectivelytoreducethemaximumtemperature.Tothebestofourknowledge,ourworkistherstattempttodevelopefcienttaskpartitioningalgorithmstodemonstratesignicanttemperaturereduction.Inthischapter,werstshowhowtaskpartitioningcanassistinthermal-awaremanagementproblemsonsinglecoreprocessors.Weproposetwoheuristictaskpartitioningalgorithms.Oneofthemusescooltaskstointerleavehottasksforaperiodicsetoftaskswithcommonperiod.Theotheroneisapplicabletoaperiodicsetoftaskswithindividualperiod.Experimentalresultsshowthatoursinglecoretaskpartitioningalgorithmoutperformsthetasksequencingalgorithm[ 33 ]byreducingthepeaktemperaturebyasmuchas6oC.Second,wealsoinvestigatetheapplicabilityoftaskpartitioninginmulticoreprocessors.Weproposeaheuristictaskpartitioningandschedulingalgorithmonasetofindependenttasksonmulticoreprocessors. Thethirdchallengeishowtofurtherreducetheenergyconsumptionofmulticoreprocessorswithmultipleclockdomainandfullmemoryhierarchy.Foramulti-core 16

PAGE 17

environmentwherethealgorithmsexecuted,whiletheenergyconsumptionofcomputationcanbeobtainedbythesummationofenergyconsumptionforallactivecores,theestimationofcommunicationenergyisnotthateasy.Theshared-addressparallelmodelallowsCPUscommunicateonacentralizedmemoryviasharedbuslinks.Itleadstoanaturalquestion:whatvoltageandfrequencylevelshouldbesetforthebustoachievetheglobalenergyefciency.ExistingDVFS-basedpower-efcientalgorithmsscalethevoltage/frequencyofthewholeprocessorchip,thatis,thecores,bus,L1andL2cacheshouldalloperateatthesamevoltagelevel[ 4 59 60 ].However,suchauniversalscalingmethodmayreduceefciencysincedifferentapplicationsmayhavedifferentworkloadratiosbetweencomputationpartandcommunicationpart.OnesolutionforthisproblemistheMutipleClockDomain(MCD)processorswhichallowscomponentsofprocessortooperateonindependentvoltagelevel[ 32 ].InChapter 5 ,weproposeageneralmethodologyforbus-basedmulti-coreprocessorswithfullmemoryhierarchy.Usingthismethodology,onecananalyzethedynamicvoltagesettingsforthebusandcores,estimatetheenergyconsumptiononeachcomponentseparatelyandobtainagoodtradeoffamongthem.WeexaminethismethodologyusingvariousparallelmatrixmultiplicationalgorithmsthataresuitableforsharedmemorymulticoremachineswithL1andL2caches.Allthealgorithmsaresupposedtoexecuteonthefullmemoryhierarchy. 1.2Contributions Inthissection,wepresentthemaincontributionsinthisdissertation. 1.2.1CompactThermalModelanditsApplications Weproposeacompactmatrixmodelthatcanbeusedtoderivenewtemperatureprolesforagivensetofcoresonprocessorsandtheirvoltages.Weshowthatthethermalmodelhasclosesteady-statetemperatureevaluationtoHotSpotmodel[ 70 ]withanorderofmagnitudelowercomputationtime.Weutilizethismatrixthermalmodeltodevelopnovelthermal-awareschedulingalgorithmsfordependenttasksrepresented 17

PAGE 18

asDAG.Oursimulationresultsshowthatouralgorithmiscomputationallyefcientandoutperformsuniformslackallocationalgorithmbyupto13.38oCandonaverage5.75oC.Wealsoutilizethematrixthermalmodeltooptimizetheworkloaddistributionsundercertainthermalconstraintsfordataparallelapplicationsonmulticoreprocessors.Experimentalresultsshowouralgorithmisanorderofmagnitudefasterthanstandardlinearprogrammingmethod. 1.2.2Temperature-awareTaskPartitioning Weproposetopartitionthehottasksintomultiplesubtasksandinterleavethesesubtaskswithcooltaskstoreducetheoverallmaximumtemperature.Tothebestofourknowledge,ourworkistherstattempttodevelopefcienttaskpartitioningalgorithmstodemonstratesignicanttemperaturereduction.weproposeaheuristictaskpartitioningalgorithmusingcooltaskstointerleavehottasksforaperiodicsetoftaskswithcommonperiod.Weproposeanotherheuristictaskpartitioningalgorithmforaperiodicsetoftaskswithindividualperiod.Further,wealsoinvestigatetheapplicabilityoftaskpartitioninginmulticoreprocessors.Weproposeaheuristictaskpartitioningandschedulingalgorithmonasetofindependenttasksonmulticoreprocessors.Finally,weprovideathoroughevaluationandcomparisontoshowhowtaskpartitioningcanassistinthermal-awaremanagementproblems.Experimentalresultsshowthatouralgorithmcanreducethepeaktemperaturebyupto11.68oCcomparedwithMin-Minalgorithm,reducethemakespanbyupto40%comparedwithPDTMalgorithm.WhenthemakespanisthesameasPDTM,ouralgorithmsimprovethepeaktemperaturesubstantiallyby18oC. 1.2.3ThermalandEnergy-awareSchedulingforMatrixMultiplication Weproposeageneralmethodologytominimizetheenergyconsumptionforbus-basedmulti-coreprocessorswithfullmemoryhierarchy.Thismethodcananalyzethedynamicvoltagesettingsforthebusandcores,estimatetheenergyconsumptiononeachcomponentseparately,andobtaintradeoffsbetweentheglobaloptimalenergy 18

PAGE 19

anddifferentperformanceconstrains.Thismethodconsidersvariousfactorsoftheprocessorsincludingworkingtemperature,memoryhierarchyandalgorithms'inputsizetogethertoachievelowerenergyconsumption.WeexaminethismethodologyusingvariousparallelmatrixmultiplicationalgorithmsthataresuitableforsharedmemorymulticoremachineswithL1andL2caches.Oursimulationresultsshowthattheeffectivelysettingthevoltageofbusesalongwithcorescanresultinupto20%reductionintheoverallenergyrequirementswithoutasubstantiallossinoverallperformance. 1.3OutlineoftheDissertation Therestofthisdissertationisorganizedasfollows:Chapter 2 givesanoverviewoftherelatedresearchandpreliminariesforthermal-awaretaskscheduling.Chapter 3 presentsthematrixthermalmodelandapplyittodependenttaskschedulingonmulticoreprocessorsandworkloaddistributionfordataparallelapplications.Chapter 4 presentsthetwotaskpartitioningalgorithmsforreal-timetaskswithcommonperiodandindividualperiodsonsinglecoreprocessors.Chapter 4 alsopresentsataskpartitioningandschedulingalgorithmonmulticoreprocessors.Chapter 5 presentsageneralmethodologytominimizetheenergyconsumptionforbus-basedmulti-coreprocessorswithfullmemoryhierarchy.Chapter 6 concludesthedissertationandpresentsfuturework. 19

PAGE 20

CHAPTER2LITERATUREREVIEW 2.1TemperatureMeasurement ThetechniquesthatcanquicklyestimatethetemperatureonchipsarerequiredforDynamicThermalManagement(DTM)algorithmstomakedecisions.Therearetwomaintechniquestoquicklyestimatethechiptemperature: Sensor-basedMethods:Thesemethodsusetheprocessors'on-diethermalsensorstocapturethereal-timetemperatureonthechip[ 13 56 67 ].Sensor-basedmethodshavelowcomputationoverheadandcancapturethetemperatureinreal-time.Sensor-basedmethodshavelimitationsthatonlythetemperatureneartheon-diesensorscanbecaptured.Thenoiseinthedataobtainedbythermal-sensorswillalsoleadtoinaccuracyinresults. Model-basedMethods:ThesemethodsuseRC-thermalmodels[ 70 ]toestimatethetemperaturebasedonthechip'shardwarecharacteristics,powerconsumptionandambienttemperature.Model-basedmethodsaremoreexibleandcanestimatethetemperatureofthewholechip.Thedisadvantageofmodel-basedmethodsisthehighcomputationoverhead.Awiderangeofthermalmodelshavebeenproposed[ 63 77 79 ]. 2.1.1Sensor-basedTemperatureMeasurement Modernprocessorsareequippedwiththermalsensorstodetecttheover-heatingonchip.Theprocessorswillbethrottlediftheon-chiptemperatureexceedsacertainpredenedthreshold.UsingthermalsensorscanalsohelpDTMmethodstocapturetheon-dietemperature.In[ 56 ],Mukherjeeetal.proposeanapproachtoidentifythebestlocationsforon-diethermalsensors.Theauthorsutilizek-meansclusteringalgorithmtogrouptheon-diehotspotsandassignthermalsensorstoeachgroupcenter.Theauthorsalsoproposetwostrategiestoplacethethermalsensors:globalsensorplacementandlocalsensorplacement.Theformerstrategyputssensorsonglobalhotspotsofthewholechipforoverheatalert.Thelatteronefocusesontheallocationofthermalsensorsonindividualprocessingcomponenttocapturethereal-timetemperature.Anotherchallengeofsensor-basedmethodisthenoiseinthedataobtainedbythermal-sensors,whichwillleadtotheinaccuracyinnalresults. 20

PAGE 21

Sharietal.proposeanovelmethodtoaccuratelyestimatetheon-chiptemperaturefromthermalsensors[ 67 ].TheyutilizeKalmanltertoassistinreducingthenoiseindata.Theyalsoutilizemodelorderreductionandsteady-stateKalmanltertofurtherreducethecomputationoverhead.Thetemperatureestimationerrorarereducedbyanorderofmagnitude.Sensor-basedmethodsalsohavelimitationsthatonlythetemperatureneartheon-diesensorscanbecaptured.CochranandRedaproposetoestimatetemperaturecharacteristicsofthewholechipbyusingNyquist-Shannonsamplingtechnique[ 13 ].Theabsoluteestimationerroroftemperaturecharacteristicsofthechipisonly0.6%. 2.1.2Model-basedTemperatureMeasurement Model-basedtemperaturemeasurementmethodsutilizethewell-knownthermal-electricaldualitytomodelthechipaswellasothercoolingcomponentsuchheatspreaderandheatsinkintoalumpedRCcircuit.ThetemperaturedifferenceonchipisanalogoustothevoltagedifferenceinRCcircuit.TheheatowonchipisanalogoustothecurrentinRCcircuit.TheheatingandcoolingprocessofthechipisanalogoustothecharginganddischargingtothecapacitanceinRCcircuit[ 70 ]. HotSpot[ 69 ]isawidely-usedthermalsimulationpackagewhichiscompatiblewithdifferentkindsofprocessormodelsincomputerarchitecture.Itmodelsthreelayersofprocessors:heatspreader,heatsinkandsilicondie.HosSpotsupportssteady-statetemperaturesimulationaswellastransienttemperaturesimulation.Ithassimpleinterfacethatcanbeintegratedwithmanymoderncomputerarchitecturesimulators.TouseHotSpot,oneonlyneedtoprovideHotSpottheoorplanoftheprocessor,thethermalhardwarecharacteristicsandthetraceleofthepowerconsumptiononthechip.HotSpotisabletogeneratethetransienttemperaturetracelebymodelingandevaluatingthermalbehavioroftheprocessor.ThedisadvantageoftheHotSpotisitshighcomputationoverhead.Raoetal.[ 63 ]proposeafastanalyticalthermalmodelsimpliedfromHotSpot.Theanalyticalmodelconsiderstheheatsinkandheatspreader 21

PAGE 22

togetherasonecoolingcomponentandassumethatthetemperatureonthiscomponentisconstant.Theanalyticalmodelalsoignoresthelateralthermalresistanceonthesiliconlayerbecauselateralthermalresistanceofsiliconisfourtimeshigherthanverticalthermalresistance.Finally,theanalyticalmodelonlyestimatesthesteady-statetemperature. 2.2Temperature-awareTaskSchedulingAlgorithms Theworkdoneonthermal-awaremodelingandschedulingcanbeclassiedintocategoriesbasedontheirdesiredobjectives.Therstcategoryofdynamicthermalmanagement(DTM)researchmainlyfocusonoptimizingprocessorspeedwhilekeepingthetemperatureundercertaintemperaturethreshold[ 18 20 24 36 58 62 ].Raoetal.uselinearprogrammingtocalculatetheoptimalcorefrequenciesunderacertaintemperatureconstraint[ 62 ].Theauthorsproposesolutionsunderbothsteady-stateandtransienttemperatureconditionsbyusingclockgatingdutycycle.Muralietal.useconvexoptimizationtosolvetheproblemofassigningfrequenciestomulticoreprocessorstomaximizeperformanceunderbothenergyandtemperatureconstraints[ 58 ].Ebietal.proposeanagent-baseddistributedDTMthatdynamicallytradeavailablepowerunitswithothercorestoachievelowertemperature,lowerenergyconsumptionandbetterscalabilitywhenapplyingonlargenumberofcores[ 20 ].Cuietal.presentanevent-basedthermalmodeltoreplacethetraditionalthermalmodelduetoitslowcomputationaloverheadandhighaccuracy[ 18 ]. ThesecondcategoryofDTMresearchfocusonminimizingthepeaktemperaturewhilekeeptheperformanceabovecertainthreshold[ 3 10 14 16 22 45 49 61 ].Coskunetal.proposeaproactiveDTMalgorithmtobalancetheheatgeneratedonanMPSoCtoreducethemaximumtemperature.Theyusearegressionmodeltopredictthetemperaturebasedontheprevioustemperatureprole[ 14 ].TheauthorsclaimthattheproactiveDTMmethodoutperformsreactivemethodinreducingmaximumtemperaturewithlittleperformanceoverhead.Lietal.proposetodynamicallyadjust 22

PAGE 23

thepriorityofhotandcoolprocessesbasedontheprocessthermalprole[ 49 ].Thermalprolesincludethedatafromprocesses'performancecountersandthermalsensors.PredictiveDynamicThermalManagement(PDTM)wasproposedbyYeoetal.todynamicallymigratethreadfromcoreswithhightemperaturetocoreswithlowtemperatureinthefuture[ 80 ].Thefuturetemperatureisestimatedbybothcore-basedandapplication-basedthermalmodel.PDTMalsoreducethethreadprioritytofurtherreducethetemperature.Kursunetal.proposesavariation-awareDTMalgorithm[ 45 ].Theauthorsproposethatvariationsexistinmostcomputerarchitectureduetoleakagecurrent,supplyvoltageandetc.VariationswillaffecttheefciencyofexistingDTMalgorithmsthatassumeconstanthardwareparameters. ThethirdcategoryofDTMresearchfocusonoptimizingperformanceandtemperaturetogether.Mostoftheseresearchuseheuristicalgorithmtoachievelowertemperatureandshorterexecutiontime[ 17 40 48 57 79 ].Cuietal.proposeanalgorithmthatutilizeslook-uptablestodynamicallyassigntaskstohomogeneouscores[ 17 ].Look-uptablesaregeneratedofthetemperaturedifferencesoneachcoreifthepowercongurationonsomecorechanges.Taskallocationcanbemadequicklybasedontheirplacementweight,whichcanbecalculatedfromthesetables.Khanetal.proposeahardware-softwareco-designarchitectureforDTM[ 40 ].Thehardwarecomponentsincludethethermalsensors,microprocessorplatformandprocessorvirtualization.ThesoftwarecomponentincludesthevirtualmachinemonitorwithVirtualThermalManager(VTM).VTMpredictsthefuturehotspotsonthechipandrecongurethehardwaretoreduceorpreventthefuturehotspots.Adynamicfrequencyassignmentalgorithmthatpro-activelymanagesthetemperatureofmulticoresystemshasbeenproposedbyMuralietal.[ 57 ].Theauthorgeneratesthefeasiblefrequenciestablefordifferentworkloadandtemperaturecongurationsofine.Atonlinetime,thefrequenciesareassignedbymatchingthecurrentworkloadandtemperaturetotheofine-generatedtable. 23

PAGE 24

2.3Preliminaries 2.3.1TaskModel Thedifferenttaskmodelsstudiedintemperature-awaretaskschedulingcanbeclassiedunderfollowingdimensions: IndependenttasksvsDependenttasks:Thedependencybetweentwotasksmeanstheexecutionofonetaskdependsonthecompletionoftheother.Independenttasksmeansthattherearenodependencyrelationsbetweenthetasks.Mostgeneralpurposeapplicationsondesktopsaredependenttasks.Dependenttasksmeanstherearedependencyrelationsbetweentasks.ThedependencybetweentasksareusuallyrepresentedasaDirectedAcyclicGraph(DAG).InaDAG,tasksaredenotedbynodesandthedependenciesbetweentasksaredenotedbyedges. Real-timetasksvsNon-real-timetasks:Taskswitharrivaltimeanddeadlineareconsideredasreal-timetasks.Real-timetaskscanalsobefurtherclassiedintohardreal-timetasksandsoftreal-timetasks.Hardreal-timetaskshavestrictdeadlineconstraints.Tasksneedtonishstrictlybeforeitsdeadline.Hardreal-timetasksincludecriticalsystemcontrol,criticalconditionsdetectionandetc.Softreal-timetasksdonothavestrictdeadlineandtheycannishafterdeadline.Butpenaltywillbeintroducediftasksnishaftertheirdeadline.Softreal-timetasksincludesavingdata,showinginformationonscreenandetc. PeriodictasksvsAperiodictasks:Periodictasksincludeaninnitesequenceofidenticalinstances.Theinstancesofperiodictaskstakeplaceinregularintervals.Theactivationtimeoftherstinstanceisphase,denotedbyf.ForperiodictasktwithperiodT,theactivationtimeofthekthinstanceisgivenbyf+(k)]TJ /F5 11.955 Tf 11.07 0 Td[(1)T.Thedeadlineofataskisoftendenedastheendofaperiodinmanycases.Aperiodictasksincludeaninnitesequenceofactivitiesthatareactivatedirregularly. 2.3.2PowerModel TherearetwoaspectsofpowerconsumptionindigitalCMOScircuits:dynamicenergyandstaticenergy.Thedynamicenergyisconsumedbythecharginganddischargingprocessofcircuitcapacitancesandshortcircuitpowerdissipation.Thestaticenergyisduetoleakagecurrentthroughreversebiaseddiodesandetc. ThedynamicpowerconsumptionofCMOScircuits[ 37 ]isPd=CefV2ddf (2) 24

PAGE 25

whereVddisthesupplyvoltage,Cefistheeffectivecapacitanceandfistheclockfrequency. Thefrequencyisrepresentedbyf=k(Vdd)]TJ /F3 11.955 Tf 10.95 0 Td[(Vt)2=Vdd (2) whereVtisthethresholdvoltageandkisahardwareconstant. LeakagepoweristhemainsourceforstaticpowerconsumptiononCMOScircuit[ 41 ].TheleakagepowerconsumptionisthemultiplicationbetweensupplyvoltageVddandleakagecurrentIleak:Pl=VddIleak (2) Leakagecurrentcanbedividedtogate-oxideleakagecurrentIgandsub-thresholdleakagecurrentIs.Gate-oxideleakagecurrentiscalculatedby:Ig=MW(Vdd=Wox)2e)]TJ /F19 8.966 Tf 1 0 .167 1 273.09 -329.78 Tm[(aWox=Vdd (2) whereMandaarehardwareparameters,Wisthegatewidth,Woxistheoxidethickness.Sub-thresholdleakagecurrentiscalculatedby:Is=KWe)]TJ /F11 8.966 Tf 6.96 0 Td[(Vth=nVh(1)]TJ /F3 11.955 Tf 10.94 0 Td[(e)]TJ /F11 8.966 Tf 6.97 0 Td[(Vdd=Vh) (2) whereKandnarehardwareparameters.Vhisthevoltagerelatedtocurrentchiptemperature. 2.3.3ThermalModel Inthissection,weintroducetwotypesofthermalmodels:coarse-grainedthermalmodelsandne-grainedthermalmodels.Coarse-grainedthermalmodelsassumethewholeprocessorhasuniformtemperature.Coarse-grainedthermalmodelsusuallyareusedtoshowthethermalbehaviorofthewholechipgiventhepowerconsumptionand 25

PAGE 26

time.Fine-grainedthermalmodelsestimatethethermalproleofprocessorsindetails.Theyareusedintemperatureestimationformulticoreprocessors. 2.3.3.1Coarse-grainedthermalmodel TheprocessorthermalbehaviorcanbeeffectivelymodeledusingRCmodel[ 70 ].IftheaveragepowerofaprocessorisPoveratimeperiodt,thenthetransienttemperatureT(t)attheendofthisperiod,usingthismodel,isgivenby:T(t)=PR+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((PR+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti)e)]TJ /F11 8.966 Tf 6.96 0 Td[(t=RC (2) whereRisthethermalresistanceandCisthethermalcapacitance,TAistheambienttemperatureandTiistheinitialtemperature. Bylettingt!in( 2 ),wecangetthesteady-statetemperature:TS=PR+TA (2) SupposetheaveragepowerconsumptionofatasktiisPi,theexecutiontimemeasuredincyclesisci,theinitialtemperaturewhenthetaskstartingtoexecuteisTi.Then,thetemperatureafterthetasknishesisgivenby:T(t)=PiR+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((PiR+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti)e)]TJ /F11 8.966 Tf 6.97 0 Td[(ci=RC (2) WhereR,CandTAhavethesamedenitionasinEquation( 2 )and( 2 ). Similarly,thesteady-statetemperatureoftasktiisgivenby:TSi=PiR+TA (2) ThetimerequiredforreachingclosetoasteadystatecanbederivedbysubstitutingPiR+TAbyTSiinEquation( 2 ).Simplifyingthis,wegetT(t)=TSi)]TJ /F6 11.955 Tf 10.95 0 Td[((TSi)]TJ /F3 11.955 Tf 10.94 0 Td[(Ti)e)]TJ /F11 8.966 Tf 6.97 0 Td[(ci=RC (2) 26

PAGE 27

Thetimerequiredforafractionoftheoverallincreaseisgivenbya=T(t))]TJ /F11 8.966 Tf 6.96 0 Td[(Ti TSi)]TJ /F11 8.966 Tf 6.97 0 Td[(Ti,thenwehave(TSi)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti)a+Ti=TSi)]TJ /F6 11.955 Tf 10.95 0 Td[((TSi)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti)e)]TJ /F11 8.966 Tf 6.96 0 Td[(ci=RC (2) FromEquation( 2 ),wecangetci=)]TJ /F3 11.955 Tf 9.29 0 Td[(RCln(1)]TJ /F7 11.955 Tf 1 0 .167 1 269.52 -144.17 Tm[(a) (2) Equation( 2 )showsthatthetimetakenbytheheatingprocessonlydependsontheproductofthermalresistanceandthethermalcapacitance,andthefractionoftheoveralltemperatureincrease.Settingtheproductofthermalresistanceandthermalcapacitancetobe0.2053Joules=Wattfora1.5GHzprocessor[ 33 ],wecanderivethetimerequirementsforvariablevaluesofa(Table 2-1 ). Table2-1. Therelationshipbetweenfractionalincreaseoftemperaturewithtime(clockcycles) aRC(J=W)time(cycles) 0.990.20531.411090.80.20534.951080.50.20532.1108 FromTable 2-1 ,wecanseethatachieving99%change(veryclosetothenalsteadystate)requiresexecutiontimelargerthan1.41109clockcycles,or0.94secondona1.5GHzprocessor.Iftherearesometaskswithexecutiontimelargerthan1.41109,theywillreachclosetotheirsteady-statetemperature. 2.3.3.2Fine-grainedthermalmodels HotSpot[ 69 ]isawidely-usedthermalsimulationpackagewhichiscompatiblewithdifferentkindsofprocessormodelsincomputerarchitecture.Itmodelsthreelayersofprocessors:heatspreader,heatsinkandsilicondie.HosSpotsupportssteady-statetemperaturesimulationaswellastransienttemperaturesimulation.Ithassimpleinterfacethatcanbeintegratedwithmanymoderncomputerarchitecturesimulators. 27

PAGE 28

Figure2-1. TheRC-circuitofaprocessorinHotSpot(derivedfrom[ 69 ]) TouseHotSpot,oneonlyneedtoprovideHotSpottheoorplanoftheprocessor,thethermalhardwarecharacteristicsandthetraceleofthepowerconsumptiononthechip.HotSpotisabletogeneratethetransienttemperaturetracelebymodelingandevaluatingthermalbehavioroftheprocessor.Figure 2-1 showsaprocessorwithheatspreaderandheatsink,anditscorrespondingRC-circuitmodeledinHotSpot.TherearethreelayersintheRC-circuitmodel:dielayer,heatspreaderlayerandheatsinklayer.DielayerisdividedintothreeblocksandeachblockconnectwithitsneighboringblocksthroughRC-circuit.Heatspreaderlayerisdividedintoveblockswithonecenterblockcoversthedielayerandfourequalblockssurroundthecenterblock.Heatsinklayerhasthesimilarlayoutasheatspreaderlayer.TherearealsoRC-circuitbetweentwoneighboringlayersaswellasbetweenprocessorsandambienttemperature. ThedisadvantageoftheHotSpotisitshighcomputationoverhead.Raoetal.[ 63 ]proposeafastanalyticalthermalmodelsimpliedfromHotSpot.Theanalyticalmodelconsiderstheheatsinkandheatspreadertogetherasonecoolingpackageandassume 28

PAGE 29

Figure2-2. Asimpliedanalyticalmodelofamulticoreprocessor(derivedfrom[ 63 ]) thatthetemperatureonthiscomponentisconstant.Theanalyticalmodelalsoignoresthelateralthermalresistanceonthesiliconlayerbecauselateralthermalresistanceofsiliconisfourtimeshigherthanverticalthermalresistance.Finally,theanalyticalmodelonlyestimatesthesteady-statetemperature.Figure 2-2 showstheRC-circuitoftheanalyticalmodel.EachcorearemodeledexactlythesamewithMpowersource,Mthermalresistanceandnothermalcapacitance.Theysharethesinglecoolingpackage. 2.4Summary Inthischapter,wehaveproposedtherelatedresearchworkdoneondynamicthermalmanagementmechanismonsinglecoreandmulticoreprocessorsinrecentyears.Werstclassifycurrenttemperaturemeasurementmethodsintotwodifferentcategories:sensor-basedmethodandmodel-basedmethod.Weproposeboththeadvantagesanddisadvantagesofthesetwomethodsandresearchdoneonimprovingthesetwomethods.Second,weclassifythermal-awaretaskschedulingalgorithmsintothreecategoriesandproposethestate-of-artresearchoneachcategory. 29

PAGE 30

Finally,weproposethedetailedpreliminariesneededforresearchondynamicthermalmanagementmechanism.Thepreliminariesdescribethetaskmodels,powermodelsandstate-of-artthermalmodels. 30

PAGE 31

CHAPTER3COMPACTTHERMALMODELANDITSAPPLICATIONS 3.1Background Thepowerdensityofmulticoreprocessorshasdoubledeverythreeyearsandthisrateisexpectedtoincreaseasfrequenciesscalefasterthanoperatingvoltages[ 68 ].Thiscanleadtohighheatdensity,whichcandegradereliabilityandincreasepackagingandcoolingcost.Thecostofcoolingincreasessuper-linearlyinpowerconsumption[ 26 ].Existingpower-awaretechniquesdonotaddressthetemperatureissuesinmulticoreprocessors.Differentialpowerdistributionandheatdissipationcanleadtononuniformtemperaturedistributiononthechipwithlocalizedhigh-temperaturehotspotsandspatialgradients[ 70 ].Thesehotspotsandspatialgradientsnotonlyaffecttheperformanceofthesystem,butalsothelife-timeofthechip[ 65 ]. Effectiveschedulingoftasksonmulticoreprocessorscanreducehotspots.TheuseofDynamicVoltageScaling(DVS)[ 8 ]todifferentiallyassignvoltage(andeffectivelycomputationalpower)toeachcorecaneffectivelyreducethetemperatureinalocalizedfashion.Amajorchallengeintemperatureschedulingisthattheunderlyingmodelofthesystemisquitecomplex.ModelssuchasHotSpot[ 70 ]thatcarefullysimulatetheunderlyingdesignarecomputationallyveryintensive.Schedulingalgorithmstrytoachievetherequiredobjectivebycomparingmanyschedulingscenariosinasystematicfashion.Thismayrequirederivingthetemperatureproleofeachscenario.UsingdetailedmodelssuchasHotSpotforthispurposeiscomputationallyintensive.Othermodels[ 10 ]requiretimeproportionaltothenumberoftimeslicestimesthenumberofcores.Aneasy-computedanalyticalmodelisproposedin[ 63 ].However,itneglectsthelateralthermalresistancesbetweencoresandeffectivelyignoresoorplan.Inthischapter,weproposeacompactMatrixModelthatrequiresanorderofmagnitudelesscomputationtimecomparedwithHotSpot.WealsoproposetwoapplicationsoftheMatrixModeltoshowitseffectivenessandefciency. 31

PAGE 32

TherstapplicationofthecompactMatrixModelisslackallocationfordependenttasks.Findingoptimaltemperature-awaretaskschedulingsolutionisanNP-hardproblem[ 12 ].Severalsimpleheuristicmethodshavebeenproposed,suchascoolestrstandneighborawarealgorithm[ 72 ].Complexmethods[ 78 ]includestemperatureinthepriorityofeachtasktoreducepeaktemperature.Convexoptimizationhasbeenusedforfrequencyscaling[ 57 ]tominimizepowerconsumptionunderthresholdtemperatureconstraints.Mixed-integerlinearprogramming[ 10 ]hasalsobeenusedforthetemperatureminimizationproblemforasetofdependenttasks.However,noneofthesemethodsuseDVSalgorithmfortaskswithprecedencerelationships. ThesecondapplicationofthecompactMatrixModelistheworkloaddistributionoptimizationofmulticoreprocessorfordataparallelapplications.Thegoalofthisapplicationistodeterminetheworkloaddistributionforeachcore,sothatthetotalthroughputacrossallcoresismaximizedandthemaximumtemperatureforanycoreisboundedbyagiventhreshold.DVFScancontrolthedistributionofpowerbyassigningdifferentvoltagesandfrequenciestoeachcore. Asthepowerofacoreorsetofcoresisincreasedforagivencomputationalphase,thetemperatureofthechipgoesup.Thesteadystateanalysisdeterminesthemaximumtemperaturethatwillbereachedifthelengthofthephaseisreasonablylarge(ourexperimentssuggestthatthiscouldbebetween10secondstohalfaminutedependingontheunderlyingarchitecturalparameters).Weformulatelinearandintegerprogrammingformulationsbasedontheassumptionthatthemaximumtemperatureonanycoreinthesteadystateislessthanauserprovidethreshold.Standardintegerlinearprogramming(ILP)solvers[ 30 ]canbeusedforeachofthethreecases,butarecomputationallyexpensive.Weproposefasteralgorithmswhereverapplicable.Ifthedurationofthephaseissmall,asteadystatemaybetoopessimistic.Weprovideextensionstoourapproach. Themaincontributionsofthischapterareasfollows: 32

PAGE 33

1. WehavedevelopedathermalmodelcalledMatrixModel(MM)thatcanbeusedtoderivenewtemperatureprolesforagivensetofcoresandtheirvoltages.OursimulationresultsshowthatthemodeliscomparabletotheHotSpotModelforpredictingthepeaktemperatureandrequiresanorderofmagnitudelesscomputation.Besideshavingalowercomputationalcost,theMMissuccinct(asinglematrix)andcanbeusedtoderivealgorithmsforavarietyofscenariosincludingwhereincrementalchangestothetemperatureprolearenecessary. 2. WeusethismodeltodevelopanovelslackallocationalgorithmforaworkowrepresentedbyDirectedAcyclicGraphonamulticoreprocessor.Theobjectiveoftheslackallocationistominimizethepeaktemperatureonanycoreforagivendeadline.Oursimulationresultsshowthatouralgorithmiscomputationallyefcientandoutperformsuniformslackallocationalgorithmbyupto13.38oCandonaverage5.75oC. 3. Wealsousethismodeltooptimizetheworkloaddistributionofdataparallelapplicationsonmulticoreprocessors,sothatthetotalthroughputacrossallcoresismaximizedandthemaximumtemperatureforanycoreisboundedbyagiventhreshold.Oursimulationresultsshowouralgorithmisanorderofmagnitudefasterthanstandardlinearprogrammingmethod. Therestofthischapterisorganizedasfollows.Section 3.2 describestheMatrixModelanditsproofofcorrectness.Section 3.3 proposesthegreedyslackallocationproblemutilizingMatrixModel.Section 3.4 proposetooptimizethethermalconstrainedworkloaddistributionfordata-parallelapplications.WeconcludeinSection 3.5 3.2ASimpleThermalModelforMulticoreProcessors 3.2.1ThermalEquations WechooseaRCmodelequationtoexpresstheunderlyingthermalprocess[ 70 ].IftheaveragepowerofcoreisPoveratimeperiodt,thenthetemperatureT(t)attheendofthisperiodisgivenby:T(t)=PR+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((PR+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti)e)]TJ /F11 8.966 Tf 6.96 0 Td[(t=RC (3) whereRisthethermalresistanceandCisthethermalcapacitance,TAistheambienttemperatureandTiistheinitialtemperature.Byignoringthenonlinearfactorsin 33

PAGE 34

Equation( 3 ),wecangetthesteady-statetemperatureasfollows:TS=PR+TA (3) ItiseasytoseethatEquation( 3 )and( 3 )thatgiventhesamevalueofthermalresistanceandpowerprole,TSiseitherthehighestorlowesttemperatureT(t)thatcanbereachedinthesteadystate.Inparticular,thisshouldworkwellforalargeclassoftaskparallelapplications,wheretheaveragetimerequirementsofexecutionoftasksislargeenoughthatastateclosetosteadystatecanbereached.Otherwise,steadystatetemperatureshouldgenerallyserveasanupperboundontheactualmaximumtemperaturethatcanberealized. 3.2.2MatrixModel Theaccuracyandefciencyofthethermalmodelareveryimportantintemperatureawarealgorithms.Forourmodeling,weassumethattheexecutiontimeofeachtimesliceislongenoughformulticoreprocessortoreachthesteady-statetemperature.Inthissection,webrieydescribeathermalmodelforthispurpose. Denition1. Weassumeaprocessorwithucoreshasvheatsinksabovethesecores(Figure 3-1 ).Wealsoassumethateachcoreandheatsinkinmulticoreprocessorisadiscretethermalelement.Weidentifythecoresasthermalelement0throughu)]TJ /F5 11.955 Tf 11.48 0 Td[(1,andtheheatsinksasthermalelementuthroughu+v)]TJ /F5 11.955 Tf 11.48 0 Td[(1.Thesetofu+vthermalelementsisdenotedbyM.WeuseTmtodenotethesteady-statetemperatureofathermalelementm.WeassumeasimplethermalmodelinwhichG(m,n)isthethermalconductancebetweenthermalelementmandn.WealsoassumeG(m,n)0;G(m,n)=G(n,m)ifm,nareneighbors;G(m,n)=0ifm,narenotneighbors.P(m)indicatesthepowerconsumptionbythermalelementm.A(m)istheverticalconductancebetweenthermalelementmandtheoutsideenvironment.TAistheambienttemperature,thetemperatureofoutsideenvironment.Figure 3-1 showstheoorplanandaboveparametersofa4-coreprocessor. 34

PAGE 35

Figure3-1. A4-coreprocessorwith4heatsinks Sincetheheatsinksdonotgenerateanypower(weassumenoheatsinkhasactivecoolingsystem),wehavefollowingequation:P(m)=8>><>>:P(m),8m2[0,u)]TJ /F5 11.955 Tf 10.95 0 Td[(1]0,8m2[u,u+v)]TJ /F5 11.955 Tf 10.95 0 Td[(1] (3) Wealsoassumethatthereisasmallamountofheattransferbetweeneachcoreandtheoutsideenvironment.Thus,wehavefollowingequation:A(m)=8>><>>:em,8m2[0,u)]TJ /F5 11.955 Tf 10.95 0 Td[(1]A(m),8m2[u,u+v)]TJ /F5 11.955 Tf 10.95 0 Td[(1] (3) whereemisaverysmallconstant,em>0. Thenthesteady-statetemperatureofthermalelementm,Tmisgivenby:n2M(Tm)]TJ /F3 11.955 Tf 10.95 0 Td[(Tn)G(m,n)+(Tm)]TJ /F3 11.955 Tf 10.94 0 Td[(TA)A(m)=P(m),8m2M (3) InEquation( 3 ),thelefthandsiderepresentsthethermalpowerthatelementmtransferstoitsneighborelementn.Foracore,itsneighborelementisthecores 35

PAGE 36

surroundingitandtheheatsinkaboveit.Foraheatsink,thereverseistrue.Therighthandsiderepresentsthethermalpowerelementmgenerates(notethatheatsinkdoesnotgeneratepower).Thisexpressionmakesasimpliedassumptionthatthethermalpowerthatelementmtransferstoothers(neighborelements)isproportionaltothepoweritgeneratesinsteadystate. Usingtheaboveassumption,wecanrewriteEquation( 3 )as:RT=P+TAA (3) whereRisa(u+v)(u+v)matrixconsistingofconstantcoefcientsgivenby Rij=8>><>>:)]TJ /F3 11.955 Tf 9.28 0 Td[(G(i,j),ifi6=ju+v)]TJ /F15 8.966 Tf 6.97 0 Td[(1k=0,k6=iG(i,k)+A(i),ifi=j (3) Tisa1(u+v)vectorofTm,thatis,T=[T0,T1,,Tu)]TJ /F15 8.966 Tf 6.97 0 Td[(1,,Tu+v)]TJ /F15 8.966 Tf 6.96 0 Td[(1]T.Aisa1(u+v)vectorofA(m),thatis,A=[e0,e1,,eu)]TJ /F15 8.966 Tf 6.96 0 Td[(1,A(u),,A(u+v)]TJ /F5 11.955 Tf 10.95 0 Td[(1)]T. Lemma1. AssumingthatA(m)>0,8m2M,thematrixRinEquation( 3 )isnonsin-gular. Proof:TheijthentryofmatrixRis:Rij=8>><>>:)]TJ /F3 11.955 Tf 9.28 0 Td[(G(i,j),ifi6=ju+v)]TJ /F15 8.966 Tf 6.97 0 Td[(1k=0,k6=iG(i,k)+A(i),ifi=j AssumingA(m)>0,8m2M,itiseasytoshowjRiij>i6=jjRijj,8i (3) Equation( 3 )showsthatmatrixRisstrictlydiagonallydominantmatrix.AccordingtoLevy-Desplanquestheorem[ 54 ],matrixRisnonsingular. 36

PAGE 37

Lemma2. Thesteady-statetemperatureateachcorem2M,Tm,isthelinearcombina-tionofpowerconsumptiononallthecoresplusTA,thatis,Tcore=TAI+CP (3) whereTcoreisa1uvectorofsteady-statetemperatureateachcorem,Tm,m2M.Pisa1uvectorofpowerconsumptionatcorem,P(m),m2M.Cisauumatrixconsistingofconstantcoefcients.I=[1,1,,1]T.TAistheambienttemperature. Proof:Fromlemma 1 ,weknowRisnonsingular.BymultiplyingR)]TJ /F15 8.966 Tf 6.97 0 Td[(1tobothsideofEquation( 3 ),wecanget:T=R)]TJ /F15 8.966 Tf 6.97 0 Td[(1P+TAR)]TJ /F15 8.966 Tf 6.97 0 Td[(1A (3) FromEquation( 3 )and( 3 ),weknowthat:RI=A (3) whereI=[1,1,,1]T.SubstitutingEquation( 3 )into( 3 ),wehave:T=R)]TJ /F15 8.966 Tf 6.96 0 Td[(1P+TAI (3) NotethatTincludethetemperatureofucoresandvheatsinks,thatis,T=[Tcore,Tsink]T.Meanwhile,accordingtoEquation( 3 ),thepowerofvheatsinksare0.Also,onlytheleftupperuumatrixcontributestothetemperatureofcores,denotedbyCin( 3 ):R)]TJ /F15 8.966 Tf 6.97 0 Td[(1=264CR)]TJ /F15 8.966 Tf 6.97 0 Td[(1uvR)]TJ /F15 8.966 Tf 6.96 0 Td[(1vuR)]TJ /F15 8.966 Tf 6.96 0 Td[(1vv375 (3) Combine( 3 )and( 3 ),wehaveEquation( 3 ).Q.E.D PleasenotethattheaboveanalysisassumesA(m)>0,8m2M.Itisnotanunreasonableassumptionthatthereisatleastnon-zeroconductancebetweencoreandenvironmentorbetweenheatsinkandenvironment.Theexperimentalresultsshowthatthismodel 37

PAGE 38

canbeusedtoderiveverygoodapproximationofthetemperatureprolesgeneratedbyHotSpot[ 70 ](Section 3.3.4 ). AmodelsimilartoMatrixModel(MM)hasalsobeenusedforuniformlytiledmulticoreprocessors[ 17 ].Ourmodelassumesamoregeneralrelationshipbetweenthermalelements.TheMatrixModelcanbeeffectivelyusedtodeterminethenewproleafterchangesofthethermalpowerinonecore.theamountofchangeintemperatureofmthcorecausedbynthcoreisgivenbytheCm,ntimesthechangeinthermalpowerofnthcore.Clearly,thiscanalsobeusedformultiplesimultaneouschanges. Fromexpression( 3 ),thesteady-statetemperatureofcoremwhensometasksexecutingonthesecoresiscomputedby:Tm=TA+n2MCmnP(n),8m2M (3) whereP(n)istheaveragepowerconsumptionduringtaskexecutiontimeoncoren2M.TAistheambienttemperature.Fromthismodel,wecangeteverycorem'ssteadystatetemperatureatanytimebygettingthepowertraceofthewholechip.MatrixCcanbecalculatedbasedonthethermalparametersoftheprocessor.AneasierwaytoderiveMatrixCisusingHotSpot(Section 3.3.4 ).Figure 3-2 showsthematrixCof4-coreprocessor,whoseoorplanisshowninFigure 3-1 1. Notethatthesteady-statetemperatureTmisonlydeterminedbytheambienttemperature,thepowerofeachcoreatcurrenttimeandtheCmatrix.Thisisbecuaseweassumeeachtimesliceislongenoughtoreachsteady-statetemperature. 1Cm,nmeansthedegreecorem'stemperatureriseswhencoren'spowergrowsby10Watt.Forexample,C0,2meanscore0willriseby1.56oCwhencore2'spowergrowsby10Watt. 38

PAGE 39

Figure3-2. Cmatrixofa4-coreprocessor 3.2.3HotSpotModel HotSpotisanaccuratethermalmodelsuitableforuseinarchitecturalstudies.Itisbasedonanequivalentcircuitofthermalresistancesandcapacitancesthatcorrespondtomicroarchitectureblocksandessentialaspectsofthethermalpackage[ 70 ].HotSpotiscomputationallyintensivebuthasbeenshowntobeaccurateenoughtosimulaterealscenariosoftasksandprocessors. TodeterminetheaccuracyoftheMatrixModel,wehaveappliedthismodelforslackallocationalgorithminSection 3.3.2 .ThesimulationresultsshowthatthequalityofallocationusingMMisveryclosetothequalityofallocationusingHotSpot.However,thecomputationalrequirementsoftheMMaresubstantiallylower. 3.3ApplicationtoSlackAllocationforDependentTasks 3.3.1ProblemBackground Therstapplicationofthethermalmodelisslackallocationfordependenttasks.Findingoptimaltemperature-awaretaskschedulingsolutionisanNP-hardproblem[ 12 ].Severalsimpleheuristicmethodshavebeenproposed,suchascoolestrst 39

PAGE 40

andneighborawarealgorithm[ 72 ].Complexmethods[ 78 ]includestemperatureinthepriorityofeachtasktoreducepeaktemperature.Convexoptimizationhasbeenusedforfrequencyscaling[ 57 ]tominimizepowerconsumptionunderthresholdtemperatureconstraints.Mixed-integerlinearprogramming[ 10 ]hasalsobeenusedforthetemperatureminimizationproblemforasetofdependenttasks.However,noneofthesemethodsuseDVSalgorithmfortaskswithprecedencerelationships. 3.3.2SlackAllocationAlgorithm TheinputtotheslackallocationproblemisasetoftaskscalledJ.TheworkowofthethesetaskscanberepresentedbyaDirectedAcyclicGraph(DAG).Weassumethattheoveralltimeinthesystemisdividedintounittimeslices,andthatalltasksoccupyintegralnumberofunittimeslices.Clearly,iftheactualtimerequirementsforeachtaskarecontinuous,thetimerequirementscanbesuitablydiscretizedwithoutsignicantlossofaccuracybyusingasmallvalueofunittimeslice.Themainreasonforassumingadiscretetimemodelwillbeobvioustothereaderwhenwediscussthedetailsofourslackallocationalgorithm. Eachtaskti2Jisa7-tuple(pPredi,predi,pSucci,succi,mi,Ei),wherepPrediisthetaskassignedpriortotionthesamecoreandprediissetofdirectpredecessorsoftiinaDAG,pSucciandsucciarethetaskaftertionthesamecoreandasetofdirectsuccessors,respectively.miisthecorethistaskexecutingon.Eiisthenumberoftimeslicesneededtonishthistaskatmaximumvoltageleveloncoremi.TheDAGhastobecompletedbyagivendeadlineD. Weassumethatanassignmentoftaskstoallthecoreshasalreadybeendone.GivenasetoftasksandtheirDAG,theassignmentoftaskstotheappropriatecorewillbedonethroughanassignmentalgorithm.TheassignmentalgorithmdeterminestheorderingtoexecutetasksandthemappingoftaskstotheirexecutingcoresbasedontheexecutiontimeatmaximumvoltagelevelEi,generatingassignmentDAG.TheassignmentDAGprovidesaworkowoftasksafterassignedtocores.Therearea 40

PAGE 41

numberofassignmentalgorithmsthatminimizethemakespanonparallelprocessors[ 37 ].Weassumethatoneofthesealgorithmsisalreadyapplied.Lettheminimumoverallexecutiontimerequirements(atmaximumvoltage)isgivenbyR.Thus,theextensionrateforRis(D)]TJ /F3 11.955 Tf 10.95 0 Td[(R)=R. ThegoalofslackallocationistotakeanassignmentDAGasaninputandallocateslacktotaskstominimizepeaktemperaturewhilemeetingtheassignmentDAGanddeadlineconstraints.Thepeaktemperatureisthemaximumsteadystatetemperatureonanycoreforanygiventimeslot.Theproblemoftemperatureawareslackallocationcanbeposedasfollowing:Allocateavariableamountofslacktoeachtaskssothatthepeaktemperatureisminimized,suchthatthedependencyrelationshipsamongtasksaremaintainedandthedeadlineconstraintsaremet. Eachtaskhasadifferentialamountofslackavailablebasedonitsdependencywithitspredecessorsandsuccessorsanddeadlinerequirements.Theoverallslackisdiscretizedintermsofintegralnumberofunittimeslices.Therefore,themaximumamountofavailableslackoftaskti,slacki,isdenedbythedifferencebetweenlateststarttimeofti,LSTi,andtheearlieststarttimeofti,ESTi.LSTi,ESTiandslackiofoftaskticanbecalculatedasfollowing:LSTi=min(deadline,LSTpSucci,minj2succi(LSTj)))]TJ /F3 11.955 Tf 10.95 0 Td[(compTimei (3)ESTi=max(start,ESTpPredi+compTimepPredi,maxtj2predi(ESTj+compTimej)) (3)slacki=LSTi)]TJ /F3 11.955 Tf 10.95 0 Td[(ESTi (3) wheredeadlineistheoveralldeadlinefortheDAG,startisthestarttimeoftheDAG,succiisthesetofdirectsuccessorsoftasktiinaDAG,pSucciisthetaskassignednexttotionthesamecore,prediisthesetofdirectpredecessorsoftiinaDAG,andpPredi 41

PAGE 42

isthetaskassignedpriortotionthesamecore.compTimeiisthecomputationtimefortaskti. Figure3-3. AsimpleDAGwith6tasks.Eachnoderepresentsatask.Thearrowfromtaskitotaskjmeansthattaskjcannotstartuntiltaskiisnished Letusexplainaboveconceptsinasimpleexample.Figure 3-3 givesasimpleDAGwith6tasks.Eachnoderepresentsatask.Thearrowfromtaskitotaskjmeansthattaskjcannotstartuntiltaskiisnished.AssumetheexecutiontimeofthetasksinFigure 3-3 isshowninTable 3-1 Table3-1. Theexecutiontimeoftasksingure 3-3 TaskNumberExecutionTime 051326374354 Ifweassignthese6tasksintoa2-coreprocessor,thenwecangettheassignmentDAGinFigure 3-4 Assumethestartis0anddeadlineis30,thentheEST,LSTandslackofeachtaskcanbecomputedinTable 3-2 3.3.3GreedyAlgorithmforSlackAllocation Weassumethattheexecutiontimeofeachtasktioncorem,Eiislongenoughforallthecorestoreachasteady-statetemperature.TheMatrixModelsuggeststhatthe 42

PAGE 43

Figure3-4. TheassignmentDAGtoa2-coreprocessoroftasksinFigure 3-3 steady-statetemperatureisindependentoftheinitialtemperatureandonlydependsonambienttemperatureandinputpowerofeachcore. Ourheuristicforslackallocationisbasedonfollowingobservation.Givenaxedambienttemperature,aoorplan(includingmultiplecoresandtheirheatsinksandtheirthermalconductances),thetemperatureofonecoremattimet,T(t,m)dependsonthepoweritgeneratesandthepoweritsneighborsgenerateattimet.Thegreedyalgorithm(Algorithm 1 )lowersthepeaktemperaturebyallocatingvariableslackstotasksthathavethegreatestimpactonreducingthepeaktemperature.Adiscretemodel Table3-2. TheEST,LSTandslackofeachtaskbasedontheassignmentDAGinFigure 3-4 TaskNumberESTLSTslack 00551513825105311165418235521265 43

PAGE 44

oftimeisnecessaryasineachiterationonlyoneunittimeofslackisassignedtoanytask.DetailsareprovidedinAlgorithms 1 Algorithm1GreedyAlgorithmforSlackAllocation 1: input:ADAGandatotaldeadline 2: computeslacki,8ti2J 3: computetemperatureprole 4: whilethereexiststasksthathaveusableslacksdo 5: ndthepeaktemperatureT(t,m).Thisiscalculatedoverallcoresandoveralltimeslots.ft,mgrepresentthetimeandthecorecorrespondingtothispeaktemperature. 6: ifthereisatasktexecutingattimetoncorem,suchthatslackt>0then 7: assignoneunitofslacktot 8: else 9: checkallothertasksthatalsorunningattimetinneighboringcores.FindthecandidateTaskonthesecoresthatmaximizesthereductionofthepeaktemperature 10: assignoneunitofslacktocandidateTask 11: endif 12: updateslacki,8ti2J 13: recomputetemperature 14: endwhile Werstcomputetheavailableslacksforeachtask(Step2).Thenwecomputethetemperatureproleintermsoftimetandcorem2Mbasedonathermalmodel(MatrixModelorHotSpot)(Step3).Usingthis,wendthethepeaktemperature(hotspot)onanygivencoreatanygiventime.ThispeaktemperatureisdenotedbyT(t,m),ifthepeaktemperatureisoncoremandattimet. Wecheckifaunitslackcanbeassignedtothetasktthatisexecutingoncoremattimet.Ifthisispossible,weassignoneunitofslacktot,potentiallyupdatetheavailableslackforalldependenttasks,recomputethetemperatureproleandexecutethenextiteration. Ifthereisnoslackavailableonthetaskexecutingathotspot(thiscouldalsobeduetothefactthereisnotaskexecutingatthehotspot),thenwecheckallothertasksthatalsorunningattimet,butinaneighborcoren.Wethenndacandidatetask(runningonaneighborcore)thatcanmaximizesthereductionofthepeaktemperature. 44

PAGE 45

Thisisdonebydeterminingtheimpactofassigningoneunitofslacktoeachneighborandusingthethermalmodeltocomputetheonethathasthemaximumimpact(Step9).Aunitslackisassignedtocandidatetask.Thisisfollowedbyupdatingtheslackforalltasks,andrecomputingthetemperatureprole. Theabovestepisexecuteduntilthereisnotaskonanycorewhereslackcanbeallocatedthatleadstoareductioninpeaktemperature. 3.3.4Evaluation Oursimulationbasedexperimentalevaluationisdecomposedintotwosetsofexperiments.TherstsetofexperimentswereconductedtodemonstratetheaccuracyoftheMatrixModeltoderivesteady-statetemperatureformulticoreprocessors.Thesecondsetofexperimentswereconductedtodemonstratetheusefulnessofthegreedyslackallocationalgorithminreducingthepeaktemperature. Thefollowingisabriefdescriptionofthekeyparametersusedfortheexperimentalsetup: Multicoreparameters:Theexperimentswereconductedon4-coreand16-coreprocessors,respectively.Eachcoreisabstractedasa8mm8mmsquarechip.4-coreand16-coreprocessorshaveaoorplanof22,44coresarrangedrespectivelyinamesh.Eachcoredissipates100Wpoweratmaximumvoltagefor4-coreprocessorsand50Wfor16-coreprocessor.Thesiliconthermalconductivityissetto100.0W=(m)]TJ /F3 11.955 Tf 11.02 0 Td[(K)andchipthicknessto0.15mm.Theambienttemperatureusedis45.15oC[ 70 ]. MatrixCgeneration:MatrixCcanbeestimatedusingHotSpot[ 70 ]byplugginginmanydifferentpowercongurationsandcomparingcorrespondingtemperaturevalues.ThedetailsofcomputingmatrixCisgivenbyAlgorithm 2 ThemainideaofgeneratingmatrixCisasfollows.Inagiventrialj,choosearandompowerconguration.Usethiscongurationtoderiveuperturbedcongurationsbyincreasingordecreasingthepowerbyasmallamountaforeachoftheucores(onecoreforeachconguration).ThedifferencebetweentheoriginalcongurationandtheperturbedcongurationforcoreicanbeusedtoderivetheithrowofmatrixCj. TheaboveprocessisrepeatedforanumberoftrialsandanaveragevalueofalltheCjsisusedtocomputeC.Forourexperiments,wesetato10Wandthenumberoftrialsto20. 45

PAGE 46

Algorithm2GeneratematrixCmatrixusingHotSpot 1: //multicoreprocessorwithucores.aisapositiveconstant 2: forj=1tonumberoftrialsdo 3: randomlychooseapowercongurationP=[P(0),P(1),,P(u)]TJ /F5 11.955 Tf 10.95 0 Td[(1)]T 4: T=HOTSPOT(P),whereT=[T0,T1,,Tu)]TJ /F15 8.966 Tf 6.96 0 Td[(1]T //HOTSPOT(P)representsthesteady-statetemperatureproleTbasedonthepowercongurationPusingHOTSPOT 5: fori=0tou)]TJ /F5 11.955 Tf 10.95 0 Td[(1do 6: Pi=[P(0),P(1),,P(i)+a,,P(u)]TJ /F5 11.955 Tf 10.95 0 Td[(1)]T 7: Ti=HOTSPOT(Pi) 8: Cj[i,]=(Ti)]TJ /F3 11.955 Tf 10.95 0 Td[(T)=a //Cj[i,]istheithrowofmatrixCj 9: endfor 10: endfor 11: C=AVG(Cj) DAGgeneration:WegeneratedalargenumberofrandomDAGswith32,64,128,256tasks(for4coreprocessors)and32,64,128,256,512tasks(for16coreproccessors).Theexecutiontimeofeachtaskatthemaximumvoltagelevelisvariedfrom20to60timeslices.WegeneratedtheDAGintopologicalorder,thatis,Edge(i,j)wasgeneratedonlywheni
PAGE 47

Figure3-5. TemperatureversuspowerconsumptionforP0ona4-coreprocessor Figure 3-6 showsthetemperaturechangeof4coresasthepowerconsumptionofbothP0andP1increaselinearly.Wecanseethattemperatureofallcoresalsogrowlinearly.ButthistimebothP0andP1'stemperaturearehigherthanothertwocoresbecausetheytwogeneratemorepower,whichkeeptheminahotterlevel. Figure3-6. TemperatureversuspowerconsumptionforP0andP1ona4-coreprocessor PeakTemperatureComparison .WegeneratedaseriesofslackallocationsusingMM.Foreachslackallocation,wegeneratedthepeaktemperatureusingMMandHotSpot.ThepeaktemperaturecomparisonofeachslackallocationisshowninFigure 3-7 .Theprocessorhas4identicalcores,eachcoredissipates100Wpoweratthe 47

PAGE 48

maximumvoltagelevel.FromFigure 3-7 ,wecanseethatthepeaktemperaturederivedbytheMatrixModelforeachallocationisveryclosetotheonederivedusingHotSpot. Figure3-7. PeaktemperaturecomparisonbetweenMatrixModel(MM)andHotSpotona4-coreprocessor Asdescribedearlier,thegreedyalgorithmassignanunitslackineachiterationandmultipleiterationsofthisprocessisexecutedtillthetemperaturecannolongerbereduced.WhenusingtheMatrixModelorHotSpot,themodelisusedtoevaluatethetemperatureaftereachiteration. ItisalsoworthnotingthatthepredictiveaccuracyoftheMatrixModelishighevenaftermultipleiterationsofallocatingunitslackwhenusingthegreedyalgorithm.Thisshowsthatthemodelisusefulevenafterperformingalargenumberofchanges. Figure 3-8 showsthepeaktemperaturecomparisonforthegreedyslackallocationalgorithmusingtheMatrixModelandHotSpotforavariableamountofavailableslack.Weconductedtwosubexperimentsforeachcase: 1. ApplyslackallocationusingMMateachstep.ReportthenalpeaktemperatureofthenalallocationusingHotSpot. 2. ApplyslackallocationusingHotSpotateachstep.ReportthenalpeaktemperatureofthenalallocationusingHotSpot. Thisisgivenbythedeadlineextensionrate(percentageincreaseoverthedeadline).Also,theresultsarepresentedforvariablenumberoftasks. 48

PAGE 49

A B C D Figure3-8. PeaktemperaturecomparisonusingtheMatrixModel(MM)andtheHotSpotModelona4-coremulticoreprocessor.A)extensionrate0.0.B)extensionrate0.1.C)extensionrate0.2.D)extensionrate0.4 TheseresultsshowthattheslackallocationalgorithmachievesallocationtotasksthatnearlyidenticalnaltemperaturesbyusingtheMatrixModelversusHotSpot.NotethatthenaltemperaturereportedisbasedontheallocationgeneratedbytheMatrixModelandHotSpot.Forthenalallocationofboththecases,peaktemperatureiscalculatedusingHotSpot. AlloftheaboveresultsdemonstratethattheMatrixModelisaccurateenoughtoestimatesteady-statetemperature. ComputationTime .Recallthatourmotivationtocomparethesemodelsistondapreciseandefcientthermalmodelformulticoreprocessor.FromFigure 3-7 and 3-8 weknowthatMatrixModelisaccurate.Acomparisonofthecomputationtimeusing 49

PAGE 50

thesemodelsisshowninFigure 3-9 inconjunctionwiththegreedyalgorithm.Thenumberoftasksandextensionratesarevaried.Theexperimentswereconductedona4-coreprocessor.ThetimerequirementsofHotSpotisatleastanorderofmagnitudehigher.Moreover,thegapbetweenthemgrowsasthenumberoftasksincreases.Inthe4-coreprocessor,HotSpotisroughly10timesslowerthanMatrixModel.For16-coreprocessor,HotSpotisroughly20timesslowerthanMatrixModel.Hence,MatrixModelisaveryattractiveschemeforderivingthesteadystatetemperatureproleonmulticoreprocessor. A B C D Figure3-9. ComputationtimecomparisonbetweenMatrixModel(MM)andHotSpotModelona4-coremulticoreprocessor.A)extensionrate0.0.B)extensionrate0.1.C)extensionrate0.2.D)extensionrate0.4 Intherestofthissection,alltheresultsforthegreedyapproacharebasedontheMatrixModel. 50

PAGE 51

3.3.4.2Greedyalgorithmforslackallocation Inthissection,weevaluatetheperformanceofourgreedyslackallocationalgorithmunderavarietyofscenarios.Wecompareouralgorithmtoasimplisticslackallocationalgorithmthatassignstheslackequallytoallthetaskswhilestillguaranteeingdeadlineconstraints. Theperformanceofslackallocationalgorithmsismeasuredbasedonpeaktemperature(measuresintheCelsiusscale).Thedeadlineisdeterminedby:deadline=(1+extensionrate)nishtime.Thenishtimeisgivenbytheassignmentalgorithmthattriestominimizethistime.Weprovideexperimentalresultsfordeadlineextensionrate0.0(noextension),0.1,0.2,0.4,forcorenumber4,16andfornumberoftasks32,64,128,256,512. Figure 3-10 showstheimprovementofgreedyalgorithmoveruniformalgorithmforvariablenumberoftasksandvariableextensionratesona4-coreprocessor.Thegreedyalgorithmoutperformstheuniformalgorithm,reducingpeaktemperaturebyasmuchas13.38oC.Theaverageimprovementis5.75oC.Asexpected,theamountoftemperaturereductionincreasesforbothalgorithmsastheextensionrategrows. Figure 3-11 showstheimprovementofgreedyalgorithmoveruniformalgorithmona16-coreprocessor.Greedyalgorithmoutperformsuniformalgorithm,reducingpeaktemperaturebyasmuchas6.71oC.Theaveragereductionis2.28oC. Figure 3-12 and 3-13 showacomparisonofthepeaktemperaturesoforiginal(noslackallocation),greedyalgorithmanduniformalgorithmunderdifferentextensionrates.Wecanseethatgreedyanduniformcandramaticallyreducethepeaktemperatureevenatextensionrate0.Thisisbecausefordependenttasks,therewillbeholesonsomeprocessorwherenotaskcanexecuteduetothenishtimeofitspredecessors.Thus,DVSbasedslackallocationalgorithmcanutilizetheseholesandreducepeaktemperature.Withhigherextensionrates,bothgreedyanduniformalgorithmscan 51

PAGE 52

A B C D Figure3-10. Peaktemperaturecomparisonbetweengreedyalgorithmanduniformalgorithmona4-coreprocessorforvariablenumberoftasks.A)extensionrate0.0.B)extensionrate0.1.C)extensionrate0.2.D)extensionrate0.4 reducethepeaktemperaturebylargeramounts.Clearly,thegreedyalgorithmismoreefcientatutilizingtheavailableslack. 3.4ApplicationtoThermalConstrainedWorkloadDistribution 3.4.1ProblemBackground Traditionalpowermanagementpolicies[ 19 25 31 42 ],aimingtoreducethepowerconsumption,failtoaddressthethermal-constraintproblemduetothehotspotandspatialgradients.Temperature-awaredesignfocusonpowerconsumptioninadifferentwaybecausepoweristheheatingsourceofachip.Duetothepositivecorrelationbetweenworkloadandpowerconsumption,differentialworkloaddistributionaffectthethermalproleofachip.Carefullyassigningworkloadtodifferentcoresonachip 52

PAGE 53

A B C D Figure3-11. Peaktemperaturecomparisonbetweengreedyalgorithmanduniformalgorithmona16-coreprocessorforvariablenumberoftasks:A)extensionrate0.0;B)extensionrate0.1;C)extensionrate0.2;D)extensionrate0.4 caneffectivelybalancethetemperaturegradientandmeetthethermalconstraints.Inthissection,weassigndifferentialworkloadbysettingdifferentialmaximumpowerthresholdforeachcoretocontrolthethermalbehaviorofthechip.Wealsoassumeahomogeneousmulti-coreprocessorforanalysis.However,thealgorithmscanbeeasilyexpandedintoheterogeneousmulti-coreprocessor.Athroughputoptimaltaskallocationalgorithmunderthermalconstraintsisproposedin[ 28 ].However,itneglectslateralthermalresistancebetweencores(effectivelyignoresoorplan).Also,thisworkin[ 28 ]islimitedtoDynamicFrequencyScaling(DFS).OurworkassumesavarietyofDynamicVoltageandFrequencyScaling(DVFS)schemesthataresuitableforlooselysynchronousanddataparallelapplications. 53

PAGE 54

A B C D Figure3-12. Peaktemperaturecomparisonoforiginal(noslackallocation),greedyalgorithmanduniformalgorithmunderdifferentextensionratesona4-coreprocessor.A)32tasks.B)64tasks.C)128tasks.D)256tasks Mostparallelalgorithmsworkinphases[ 44 ].Ineachphase,thecomputationisdividedacrossmultiplecoresorprocessorssothattheloadisbalancedandcommunicationisminimized.Foragivenphase,differentialworkloadcanbeachievedinanumberofways: Fortaskparallelapplicationswherethetasksareindependent,thiscanbeachievedbyassigningvariablenumberoftaskstodifferentcores. Fordataparallelapplications,thiscanbeachievedbyvariableallocationofunderlyingdataelementstodifferentcores.Anumberofmethodsforuniformornon-uniformpartitioningsingleandmultidimensionalarrayshavebeendescribedintheliterature[ 44 ]. 54

PAGE 55

A B C D Figure3-13. Peaktemperaturecomparisonoforiginal(noslackallocation),greedyalgorithmanduniformalgorithmunderdifferentextensionratesona16-coreprocessor.A)64tasks.B)128tasks.C)256tasks.D)512tasks Asimpleapproach,whichassignsthesameworkloadtoallthecores,issuboptimalassomecores(e.g.outercores)maybeabletodissipateheatatafasterratethanothercores(e.g.innercores).Therefore,itisusefultoconsiderdifferentialallocationofworkloadtothecoressothattheoverallpowerismaximizedforagiventemperaturethreshold.Inthissection,weassumethatworkloadcanbepartitionednonuniformlyacrossallthecores.Thethroughputofworkloadassignedtoacoreislimitedbycomputingabilityofthecoresundercertainpowerthreshold.Bysettingdifferentialmaximumpowerthresholdofeachcore,wecancontrolthethroughputofeachcoreandmakesurethatthethermalconstraintsaremet. 55

PAGE 56

Thegoalofthissectionistodeterminetheworkloaddistributionforeachcore,sothatthetotalthroughputacrossallcoresismaximizedandthemaximumtemperatureforanycoreisboundedbyagiventhreshold.DynamicVoltageandFrequencyScaling(DVFS)cancontrolthedistributionofpowerbyassigningdifferentvoltagesandfrequenciestoeachcore.Therearethreetypesofpowercontrolschemesthataresupportedbydifferentmulti-coreprocessors: Uniformvoltagewithoutthrottling:Eachcorecanworkatfullvoltageorhastobeshutoff.Inthiscase,theworkload(orvoltagelevel)ofeachcoreisxed.Theobjectivehereistochooseamaximum-sizesubsetofthecoresthatensuresthethermalconstraintismet. Uniformvoltagewiththrottling:Thevoltageofeachcorehastobethesamebutcanbevariedbetweenamaximumandminimum.Theobjectivehereistodeterminecommonvoltageforeachcoresthatmaximizesthetotalthroughputoverallthecoreswhilemeetingthethermalconstraints. Non-uniformvoltagewiththrottling:Thevoltageofeachcorecanbevariedindependently.Inthiscase,thevoltageofeachcorecanbevariable.Thegoalhereistochoosethevoltageforeachcorethatmaximizesthetotalthroughputwhilemeetingthethermalconstraints. Asthepowerofacoreorsetofcoresisincreasedforagivencomputationalphase,thetemperatureofthechipgoesup.Thesteadystateanalysisdeterminesthemaximumtemperaturethatwillbereachedifthelengthofthephaseisreasonablylarge(ourexperimentssuggestthatthiscouldbebetween10secondstohalfaminutedependingontheunderlyingarchitecturalparameters).Weformulatelinearandintegerprogrammingformulationsforabovethreecasesbasedontheassumptionthatthemaximumtemperatureonanycoreinthesteadystateislessthanauserprovidethreshold.Standardintegerlinearprogramming(ILP)solvers[ 30 ]canbeusedforeachofthethreecases,butarecomputationallyexpensive.Weproposefasteralgorithmswhereverapplicable.Ifthedurationofthephaseissmall,asteadystatemaybetoopessimistic.Weprovideextensionstoourapproachforthesecases.Theproposed 56

PAGE 57

workislimitedtoapplicationswheretheimpactoftheunderlyingcommunicationonthemaximumtemperaturecanbeignored. 3.4.2SteadyStateAnalysis Inthissection,weproposethesteadystateanalysistothermalconstrainedworkloadoptimizationfordata-parallelapplications.WerstdescribethesystemcongurationandthreedifferentDVFSschemesanalyzedinthischapter.WethendescribethealgorithmsproposedtooptimizetheworkloaddistributionformulticoreprocessorswiththesethreeDVFSschemes. 3.4.2.1Systemoverview Consideramulti-corehomogeneousprocessorwithpcores.LetMbethesetofpcoresandPbethepowerofeachcoreatmaximumvoltagelevel.Inthissystem,weassumethatthereisauniversalcontrollertodeterminetheworkloaddistributionofeachcoreundercertainthermalconstraints.Weassumethattheworkloadcanbepartitionedinanon-uniformfashionandarbitrarilyacrossallthecores.Letxibethefractionofthemaximumpowerchosenbythecontrollerforagivencorei,where0xi1.Thevalueofxiissetbythecontrollertomaximizetheeffectivethroughputonallorasubsetofcoressuchthatthethermalconstraintsaresatised.Highervaluesofxisimplyhigherfrequency,resultinginfastercomputationandhigherthroughput,alsohigherpowerandhighersteadystatetemperature.Thevaluesofxi,thevoltageofeachcore,isconstrainedontheexibilityintheunderlyingDynamicVoltageFrequencyScaling(DVFS)orothervoltagecontrolmechanism.Figure 3-14 showstheunderlyingprocedure. EachprocessorwithDVFSschemesupportsoneormoreofthefollowingvoltage-basedscheme: Uniformvoltagewithoutthrottling:Eachcorecanworkatfullvoltageorhastobeshutoff.Inthiscase,thethroughputofeachcoreisxed.Theobjectivehereistochooseamaximumsizesubsetofthecoresthatensuresthatthethermalconstraintismet. 57

PAGE 58

Figure3-14. Procedureforassigningvoltages Uniformvoltagewiththrottling:Thevoltageofeachcorehastobethesamebutcanbevariedbetweenamaximumandminimum.Theobjectivehereistodeterminethecommonvoltageforeachcorethatmaximizesthetotalthroughputoverallthenodeswhilemeetingthethermalconstraints. Non-uniformvoltagewiththrottling:Thevoltageofeachcorecanbevariedindependently.Inthiscase,thethroughputofeachcorecanbevariable.Thegoalhereistochoosethevoltageforeachcorethatmaximizesthesumofthethroughputoverallthecoreswhilemeetingthethermalconstraints. 3.4.2.2Uniformvoltagewithoutthrottling Forsomearchitectures,eachcorecanworkatfullvoltageorhastobeshutoff.Inthiscase,thepowerofeachcoreifusedisxed.Theobjectiveistochooseamaximumsizesubsetofthecoresthatensurethatthethermalconstraintismet.Thisproblemcanbeformulatedasanintegerprogrammingproblemasfollows:maxZ=i2Mxi (3)s.t.TA+j2MCijxjPTth,8i2M (3)xi=0or1,8i2M (3) whereTAistheambienttemperature.CijisijthelementofmatrixC. Equation( 3 )constrainsthevoltage(andeffectivepowerandcomputationalcapabilitiesofthecore)basedonthermalconstraints,whileEquation( 3 )limitsitto 58

PAGE 59

integervalues.Theremaybemanysolutionsthatsatisfytheabovesetofconstraints.Anadditionalstepensuresthatthesolutionthatminimizesthetemperatureamongalltheonesthatsatisfythethresholdconditionischosen.Thiscanbeformulatedasfollows: minTp (3)s.t.TpTA+j2MCijxjP,8i2M (3)i2Mxi=Z (3)( 3)-138()]TJ /F5 11.955 Tf 20.24 0 Td[(20 ) WhereTpisthemaximumtemperatureacrosstheprocessor. Equation( 3 )ensuresthatthetemperatureofallthecoresislessthanTp,whileEquation( 3 )ensuresthatthenumberofactivecoresisequaltothevalue(representedbyZderivedinthestepdescribedearlier). 3.4.2.3Uniformvoltagewiththrottling Forsomearchitectures,DVFSmechanismsrequireeachcoretohavethesamevoltage(andeffectivelythesamepowerandcomputationalcapabilities).TheobjectiveisthentodeterminethisvoltagesothattheoverallthroughputismaximizedandthetemperaturedoesnotexceedtheprovidedthresholdTth.Theproblemcanbeformulatedasfollows:maxx (3)s.t.TA+j2MCijxPTth,8i2M (3)0x1 (3) 59

PAGE 60

Equation( 3 )ensuresthatthetemperatureislessthanthethresholdforallthecores.Equation( 3 )limitsthevalueofxtobebetween[0,1].Itisworthnotingthatsettingxequaltozerosatisesalltheconstraints.Hence,thesystemofequationsalwayshasafeasiblesolution.Thelinearprogrammingformulationofaboveproblemisalsodenedin[ 36 ]andcanbesolvedusingstandardpackages.ThetimerequirementstosolvetheabovelinearprogrammingformulationisgivenbyO(p3)[ 75 ].Inmanyreal-timescenarios,thedeterminationofthevoltageforeachcoremayhavetobedoneinrealtime.Forsuchscenarios,weproposethefollowingsimpliedapproach. Theproblemcanbesimpliedtoreducethecomputationaltimebysimplifying( 3 )asfollows:xD jCij,8i2M (3) whereDisequaltoTth)]TJ /F11 8.966 Tf 6.97 0 Td[(TA P.Thesolutiontoxisgivenbythefollowingexpression:x=minfminiDi jCij,1g,8i2M (3) Thetimerequirementoftheabovealgorithmisquadraticintheworstcase.ByprecomputingthetermscorrespondingtojCijitcanbereducedtoalineartimealgorithm. 3.4.2.4Non-uniformvoltagewiththrottling Inthiscase,eachcorecanhavedifferentvoltagelevel.OurgoalistomaximizetheoverallthroughputunderthetemperaturethresholdTth.Sincethethroughputofeachprocessorisproportionaltothevoltage,theproblemcanbeformulatedas:maxi2Mxi (3)s.t.TA+j2MCijxjPTth,8i2M (3)0xi1,8i2M (3) 60

PAGE 61

Thepowerofeachnodeisscaleddownfromthemaximumpowerbyxi.Thegoalistomaximizethetotalthroughputacrossallprocessors.Thisisequivalenttomaximizingthesumofallxis.Equation( 3 )constraintsthesteadystatetemperatureofeachcoretobelimitedbythethresholdtemperature.Equation( 3 )limitsthevalueofxitobeintherange[0,1]. Asmentionedabove,thetimerequirementstosolvetheabovelinearprogrammingformulationisalsoO(p3).Weproposeaconsiderablyfasterheuristicapproachforthereal-timescenariosasfollows.Thisapproachconsidersthreepossiblecasesfortheworkloaddistribution: Case1:Thethresholdtemperatureishighandallcorescanexecuteatitsmaximumvoltagewithoutexceedthethreshold.Inthiscaseallxisaresetto1s.Case1iseasytocheck(bysubstitutingallxisequalto1). Case2:Thethresholdtemperatureislowandrequiresallxistobelessthanorequalto1.InthiscasethesexiareallboundedbyEquation( 3 ). Case3:ThethresholdtemperatureissuchthatxivaluesofsomeofthecoresislimitedbytheconstraintgivenbyEquation( 3 ). Case2shouldresultindroppingtheconstraintgivenbyEquation( 3 )asfollows.BysettingDiequaltoTth)]TJ /F11 8.966 Tf 6.97 0 Td[(TA PinEquation( 3 ),wecanrewritetheequationasfollows:maxi2Mxi (3)s.t.CXD, (3)0X1,8i2M (3) whereXisthevectorofxi,thatis,X=[x1,x2,,xp]T.Cisappmatrix.D=[Tth)]TJ /F11 8.966 Tf 6.97 0 Td[(TA P,Tth)]TJ /F11 8.966 Tf 6.97 0 Td[(TA P,,Tth)]TJ /F11 8.966 Tf 6.96 0 Td[(TA P]T Itcanbeshownthatformostarchitectures,Cisapositivesymmetricmatrixandhasarankp[ 77 ].Wecanwrite( 3 )asXC)]TJ /F15 8.966 Tf 6.97 0 Td[(1D (3) 61

PAGE 62

WecanseeXisboundedby(C)]TJ /F15 8.966 Tf 6.97 0 Td[(1D),weset:xi=(C)]TJ /F15 8.966 Tf 6.97 0 Td[(1D)i,8i2M (3) where(C)]TJ /F15 8.966 Tf 6.97 0 Td[(1D)iistheithelementofvector(C)]TJ /F15 8.966 Tf 6.96 0 Td[(1D). TheheuristicworksbyrstcheckingforCase1followedbyCase2.Ifbothcasesfail,thananapproximationisusedasinCase3byassumingthatallxivaluesarethesameandthealgorithmforuniformvoltagewiththrottling(describedintheprevioussection)isused.Since,C)]TJ /F15 8.966 Tf 6.97 0 Td[(1isconstantforagivenoorplan[ 77 ],andcanbecomputedofine.Therefore,theheuristicsolutionwillonlyhaveatimecomplexityofO(p2). 3.4.3Evaluation Inthissection,weevaluatetheperformanceandqualityofthedifferentalgorithmsforavarietyofarchitecturalandworkloadscenarios.Theseexperimentswereconductedformulti-coreprocessorswith4,16,32and64cores,respectively.ThespecicationsoftheseprocessorsareshowninTable 3-3 .TheL2/L3cacheandI/Ointerconnectsareomittedinourexperimentsasthepowerconsumptionoftheseelementsareconsiderablysmallerthanthecoresandtheyareunlikelytobethehotspot.Thephysicalandthermalparametersofthechip,heatsinkandheatspreaderaresetasdefaultvaluesfromHotSpot[ 70 ].Theambienttemperatureusedwas45.5oC.Themaximumallowabletemperaturewassetto70oC.Weuseametriccalledeffectivenumberofcorestomeasurethetotalthroughputintermsofthemaximumthroughputthatcanbeachieved(atmaximumvoltage)onanygivencore.Thisisgivenby:EffectiveNumberofCores=Totalthroughput Maximumthroughputpercore (3) Theabovecanbeeasilyshowntobesumofallthexisasthethroughputofeachcoreisproportionaltothevoltage(higherthevoltage,higherthefrequency). 62

PAGE 63

WeusedCPLEXpackage[ 30 ]forsolvinglinearandintegerprogrammingproblemsonIntel-basedworkstations.Oncethexis(fractionofthemaximumpower)wasderivedusingoneofthealgorithms,thetemperatureswerereportedusingHotSpotforthederivedvaluesofthevoltagesbasedonxiforeachcore.HotSpotusesthedualitybetweenRCcircuitsandthermalsystemstomodelheattransferinsilicon.Hotspotwasnotusedtodeterminethexivalueofacorebyanyoftheschedulingalgorithms.Foreachalgorithm,weconductedalargenumberofexperimentsbyvaryinganumberofparametersthataredescribedabove.Representativeresultsarepresentedwhereverapplicable. Uniformvoltagewithoutthrottling:Theobjectiveofthisalgorithmistondthelargestsubsetofcoressuchthatthemaximumtemperatureisbelowathreshold.WecompareanumberofdifferentapproachestoshowthebenetsofthealgorithmpresentedinSection 3.4.2 .Inthefollowing,weassumethattheminimumnumberofcoresgivenbytheintegerlinearprogrammingformulationfordeterminingtheminimumnumberofcoresisgivenbyP.Wecomparedthefollowingfourformulations: MIP(MixedIntegerProgramming):ThesolutionderivedbyouralgorithmthatminimizesthemaximumtemperaturefortheoptimalvalueofP. BestP:ConsiderallsubsetsofsizeP.Findthesubsetthatcorrespondstothelowestmaximumtemperature. BestP+1:ConsiderallsubsetsofsizeP+1.Findthesubsetthatcorrespondstothelowestmaximumtemperature. WorstP:ConsiderallsubsetsofsizeP.Findthesubsetthatcorrespondstothehighestmaximumtemperature. Table3-3. Specicationsofprocessors NumberofcoresFloorplanAreapercoreMaximumpowerpercore 422grid8mm8mm40.0Watt1644grid4mm4mm10.0Watt3284grid3mm3mm5.0Watt6488grid2mm2mm2.5Watt 63

PAGE 64

Figure3-15. MaximumtemperatureofMIP,BestP,BestP+1andWorstP Figure 3-15 showsthemaximumtemperatureforthefouralgorithms(Thethermalconstraintwassetto70oC).Asexpected,theperformanceofMIPandBestPisthesame.However,MIPrequiresconsiderablylowertimeinourexperiments.Also,bothMIPandBestPalwaysprovidesolutionsthatarelowerthanthetemperaturethreshold.TheresultsalsoshowthattheBestP+1havethepeaktemperaturehigherthanthethresholdbyasmuchas4oC.TheresultsforWorstPshowsthatusinganalgorithmthatrandomlychoosesasubsetofPcoresmayresultintemperaturethatishigherthanthreshold.TheseresultsjustifythetwophaseapproachusedintheMIPalgorithmtodeterminetheoptimalsubset. Uniformvoltagewiththrottling:Figure 3-16 showsthatouralgorithmgivesthesameresultsastheLPapproachforavarietyoftemperaturethresholdson16and32cores.Italsoshowsthatfractionoftotalthroughputgrowsalmostlinearlywiththetemperaturetillitsaturatestothetotalnumberofcores. Figure 3-17 showsthecomputationtimecomparisonofLPandouralgorithmunderdifferentnumberofcores.TheseresultsshowthatcomputationtimeofheuristicsolutionisatleastanorderofmagnitudelowerthanusingLP.FurtherreductionsintimecanbeachievedbyprecomputingthesummationalongeachrowofCij. 64

PAGE 65

A B Figure3-16. PerformancecomparisonbetweenouralgorithmandLPapproach.A)16-coreprocessor.B)32-coreprocessor Non-uniformvoltagewiththrottling:Figure 3-18 showstheperformancecomparisonbetweentheheuristicapproachandtheLPapproachforavarietyoftemperaturethresholdson16and32cores.Theseresultsshowthatforsmallandlargethresholds,theheuristicapproachiscomparabletotheLPapproach.Forboth16and32cores,themaximumdeteriorationislessthen4%. Figure 3-19 showsthecomputationtimecomparisonofLPandheuristicsolutionunderdifferentnumberofcores.TheseresultsshowthatcomputationtimeofheuristicsolutionisatleastanorderofmagnitudelowerascomparedtoLP. Comparisonbetweenthethreeschemes 65

PAGE 66

Figure3-17. ComputationtimecomparisonbetweenLPandouralgorithm(Uniformpowerwiththrottling Figure 3-20 showsthecomparisonofthethroughputderivedfortheproposedalgorithmsforthethreecases:Non-uniformvoltagewiththrottling(NVT),Uniformvoltagewiththrottling(UVT)andUniformvoltagewithoutthrottling(UVNT).Thenumberofcoresontheprocessoris16.Thexaxisisthethresholdtemperature.TheresultsareobtainedbyCPLEXbasedonthelinearandintegerprogrammingformulationsthataredescribedinSection 3.4.2 .TheseresultsshowthatNVTprovidesmaximumthroughput,buttheextraexibilityofvariablevoltagedoesnotsignicantlyaddtothethroughputascomparedtoUVT.Thisclearlydependsontheunderlyingarchitecturaldesign.Iftheheatsinksarepoorortheoorplanresultsinnon-uniformdissipationofheat,thedifferencesbetweenNVTandUVTwouldbelarger.TheresultsalsoshowthatthelackofvariableDVFSandlackofthrottlingcanleadtosubstantialdeteriorationasseenbytheunderperformanceofUVNT. 3.4.4TransientAnalysis Wehaveassumedthattheoverallworkowisbrokenintoanumberofphases.Eachphaseresultsinacoordinatedsetoftasksthatareexecuted-thesetasksmaybedatadrivenandbasedondatadecompositiontechniquesormayberelativelyindependent.Asdiscussedearlier,ifthevoltageofacoredecreases(duetolower 66

PAGE 67

A B Figure3-18. PerformancecomparisonbetweenheuristicapproachandLPapproach.A)16-coreprocessor.B)32-coreprocessor computationalrequirementsoralowfrequencybeingacceptable),thetemperaturewillgodownandnothresholdviolationswilloccur.However,ifthevoltagegoesup,itmayleadtoincreaseintemperatureandacarefulschedulingisrequiredtoensurethattherearenothresholdviolations.Steadystatetemperatureshouldserveasaconservativeupperboundforanalyzingthetemperature.Anotheradvantageofsteadystateanalysisisitdoesnotdependonthestartingtemperaturesatthebeginningofthephase. However,ifthetimerequirementofeachphaseissmallandcomputationalpowerincreasesleadtoheating,thesteadystateanalysiscanbeoverlypessimistic.Inthissection,wedescribeextensionsofthesteadystateanalysistoaddresssuchscenarios. 67

PAGE 68

Figure3-19. ComputationtimecomparisonbetweenLPandheuristicsolution(Non-uniformpowerwiththrottling) Figure3-20. ThroughputcomparisonbetweenNVT,UVT,UVNT Weassumethatallthecoresstartagiventemperatureatthebeginningofthephase.Thisisrealisticforthetargetapplicationswhereaworkowconsistsofmultiplephases.Ineachphase,eachcoreexecutesataskonafractionofthedataorexecutemultipleindependenttaskssuchthattheoveralltimerequirementsofeachcoreisloadbalanced. Figure 3-21 showsthetemperaturecomparisonbetweensteadystateanalysisversustransientanalysis(usingHotspot)forallthe16coreswherenon-uniformthrottlingispossible.Theambientandinitialtemperatureissetat45.15oC.Thethresholdtemperatureissetat60oC.FromFigure 3-21 ,wecanseethatittakesroughly 68

PAGE 69

Figure3-21. Transientanalysisofa16-coreprocessor.Weassumethatnon-uniformthrottlingispossible.Thethresholdtemperatureissetat60oC 30secondsforthisprocessortoreachclosetothesteady-statetemperatureandthatthetransientbehaviorofallcoresisveryclose.Thisisnotsurprisingasthedifferentialvoltagesassignedtodifferentcoresiseffectivelyproportionaltotheirheatdissipationability(basedontheirplacement). Forthenonuniformvoltagewiththrottlingcase,thebehaviorcanbeapproximatedbyusingEquation( 2 ).However,theRCconstantcorrespondstotheoverallprocessorconsistingofallthecorescanbedeterminedbyusingthecurveinFigure 3-21 .WerepresentthisbyRC0anduseEquation( 2 )todeterminethetransienttemperaturetomodelthetemperatureaftertunitsofexecution.Ifweknowthetimerequirementsofthephaseist,raisethethresholdtemperatureusedinthealgorithmsdescribedinprevioussectiontoahighervaluesothatalesspessimisticboundcanbeachievedaftertimet.ThisisachievedbysettingthethresholdtemperaturetodTth,givenbydTth=max(Tth,b(Tth)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti) 1)]TJ /F3 11.955 Tf 10.95 0 Td[(e)]TJ /F22 6.974 Tf 13.08 3.53 Td[(t RC0+Ti) (3) 69

PAGE 70

bisredundancyparameter.Ifbissetto1thisisexactlythesameasEquation( 2 ).Choosingalowervalueofbincreasestheprobabilitythattheoriginalthresholdisnotviolated. Figure3-22. Comparisonoftransienttemperaturebetweennewthresholdandoriginalthresholdtemperature.Theoriginalthresholdtemperatureissetat60oC Supposetheambienttemperatureandtheinitialtemperatureis45.15oCandthemaximumsteadystatetemperatureis60oCforagivencomputationalpower.Consideraphasewillonlyneed10secondstorun.UsingthedatainFigure 3-21 ,wederivethevalueforRC0.Thetransienttemperatureforthe10-secondphaseona16-coreprocessorinFigure 3-22 usingb=0.9.Itisworthnotingthatsuchananalysisresultsinsignicantlyhighercomputationalpowerandthenaltemperatureisveryclosetothethresholdtemperature.Fortheuniformcase,theoverallRCconstantcanpotentiallybeapproximatedbythecorewiththemaximumtemperature.Theseresultsshowthatthethresholdusedintheprevioussectioncanpotentiallybesuitablyupdatedtoprovidebetterboundsifthetimerequirementsofthephaseissmall.Theseresultsarebasedonlimitedexperiments.Although,thisisapromisingapproach,furtherexperimentsarerequiredandispartofourongoingwork. 70

PAGE 71

3.5Summary Inthischapter,wehavedevelopedasimplethermalmodel(calledMatrixModel(MM))thatcanbeusedtoderivetemperatureprolesforallthecoresofamulticoreprocessor.WetheoreticallydemonstratethecorrectnessandefciencyofMM.OursimulationresultsshowthatthemodeliscomparabletoHotSpotforpredictingthepeaktemperature.Besidesthelowercomputationalcost,theMMmodelissuccinct(asinglematrix)andcanbeusedtoderivealgorithmsforavarietyofsteadystatescenarios. WeusethismodeltodevelopanovelslackallocationalgorithmforaDirectedAcyclicGraphbasedworkowonamulticoreprocessor.Theobjectiveoftheslackallocationistominimizethepeaktemperatureforagivendeadline.Oursimulationresultsshowthatthegreedyslackallocationalgorithmisefcientandoutperformsanalgorithmthatassignsslackuniformly. Wealsousethismodeltoproposeseveralalgorithmsforsettingthevoltageofeachcoreinamulti-coreprocessor,sothatthetotalthroughputacrossallcoresismaximizedandthemaximumtemperatureonthechipisboundedbyagiventhreshold.OuralgorithmsaresuitableforavarietyofDVFSschemesbasedonthelevelofnon-uniformityandthrottling.Theexperimentalresultsshowthatmanyofouralgorithmsareveryfastandsuitableforruntimescheduling. Mostofouralgorithmsassumethatthelengthofacomputationalphaseissuchthatasteadystateisreached.Ifthedurationofthesteadystateisshorter,thismayprovetobetoopessimistic.Weprovidesimpleextensionstoourapproachtoprovideatransientanalysisandtherebyprovidebettersolutions. 71

PAGE 72

CHAPTER4TEMPERATURE-AWARETASKPARTITIONINGALGORITHMS 4.1Background Thepowerdensityofon-chipsystemsisrisingveryrapidlyandtheratewillcontinuetogrow[ 68 ].Thismakesthermalmanagementasignicantdesignchallengeformicroprocessors.Riseinon-chiptemperaturedirectlyimpactstheperformanceandtime-to-failureofswitchingdevices.Thisisaccentuatedbythefactthatthecoolingandpackagingcostrisessuper-linearlywiththegrowthofpowerdensity.Also,thecostofcoolingandpackagingisoneofthesignicantcontributorstocomputersystem[ 26 ].Hightemperatureincreasesleakagepowerofthechip,andtherebypotentiallyleadtothermalrunaway.Ithasbeenshownthata10-15oCreductioninpeaktemperaturecandoublethelifetimeofthechip[ 34 ]. Existingpower-awaretechniquesdonotaddressthetemperatureissuesinembeddedsystems.Themainreasonisthatthepowerdistributionofmultiprocessorsisnotuniform.Localizedheatingrisesmuchrapidlythanthewholechip,leadingtononuniformtemperaturedistributiononthechipwithlocalizedhigh-temperaturehotspotsandspatialgradients[ 70 ].Traditionalmethodstocontroltheon-chiptemperatureistoemploybetterpackagingandcoolingtechniques(e.g.,activefancooling,watercoolingandheatpipe).Theseactivecoolingsystemsmaynotalwaysbesuitableforembeddedsystemsbecauseofthespaceandbatterylimitations.BuildingthermalanalysisabilityintoEDAowallowsthesystemtoaddressthethermalimpactonvariouson-chipparametersandincorporateeffectsofnon-uniformthermalprolesduringICdesignprocess.However,suchtechnologiesareunabletodealwithavarietyofruntimesituations. Inthischapter,wefocusonsoftwareapproachesforthermalmanagement.Theseapproachesareexibleanddonothavesomeofthelimitationsthataredescribedabove.TheprocessorthermalbehaviorcanbeeffectivelymodeledusingRCmodel 72

PAGE 73

[ 70 ].IftheaveragepowerofaprocessorisPoveratimeperiodt,thenthetransienttemperatureT(t)attheendofthisperiod,usingthismodel,isgivenby:T(t)=PR+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((PR+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Ti)e)]TJ /F11 8.966 Tf 6.96 0 Td[(t=RC (4) whereRisthethermalresistanceandCisthethermalcapacitance,TAistheambienttemperatureandTiistheinitialtemperature. Givenaparticularchipanditsoutsideenvironment,ambienttemperature,thermalresistanceandcapacitancearexed.BasedontheparametersofEquation( 4 ),therearethreemajorfactorsaffectingtheon-chiptransienttemperature:averagepoweroftheprocessor,initialtemperatureandexecutiontime.DynamicVoltageandFrequencyScaling(DVFS)canbeusedtoreducethepowerconsumptionbyloweringthesupplyvoltageandoperatingfrequency,therebyreducetheon-chiptemperature[ 3 5 20 36 57 62 ].However,DVFSfacesaseriousproblemintime-constrainedapplications.Temperatureawaretasksequencingalgorithm[ 33 ]reducestheinitialtemperaturetominimizepeaktemperature.However,temperatureawaretasksequencingfailstoreducetemperatureincaseswhenoneormoreofthehottasks1arelong.Thealgorithmtodeferexecutionofhottasks[ 11 ]failstoreducetemperatureinthesamesituation.Thisisbecausewhentheexecutiontimeofahottaskistoolong,itcanleadtoahighsteady-statetemperatureirrespectiveoftheinitialtemperature. Weproposetopartitionthehottasksintomultiplesubtasksandinterleavethesesubtaskswithcooltaskstoreducetheoverallmaximumtemperature(Figure 4-2 ).Thefocusofthischapteristousetaskpartitioningeffectivelytoreducethemaximumtemperature.Tothebestofourknowledge,ourworkistherstattempttodevelopefcienttaskpartitioningalgorithmstodemonstratesignicanttemperaturereduction. 1Wedenehottasksastaskswithhigheraveragepowerconsumption,andcooltasksastaskswithloweraveragepowerconsumption. 73

PAGE 74

Inthischapter,werstshowhowtaskpartitioningcanassistinthermal-awaremanagementproblemsonsinglecoreprocessors.Weproposetwoheuristictaskpartitioningalgorithms.Oneofthemusescooltaskstointerleavehottasksforaperiodicsetoftaskswithcommonperiod.Theotheroneisapplicabletoaperiodicsetoftaskswithindividualperiod.Experimentalresultsshowthatoursinglecoretaskpartitioningalgorithmoutperformsthetasksequencingalgorithm[ 33 ]byreducingthepeaktemperaturebyasmuchas6oC.Second,wealsoinvestigatetheapplicabilityoftaskpartitioninginmulticoreprocessors.Weproposeaheuristictaskpartitioningandschedulingalgorithmonasetofindependenttasksonmulticoreprocessors.ExperimentalresultsshowthatourmulticoretaskpartitioningalgorithmcanachievethesimilarpeaktemperatureasPDTM[ 80 ]whilereducingthemakespan2oftasksby30%.WhenthemakespanisthesameasPDTM,ouralgorithmsimprovethepeaktemperaturesubstantiallyby18oC. Therestofthechapterisorganizedasfollows.Section 4.2 presentsrelatedwork.Section 4.3 describesthebackgroundofthermalawareanalysis.Section 4.4 introducestheproblemsofperiodictaskswithcommonandindividualperiodonsinglecoreprocessors,andproposesheuristictaskpartitioningalgorithmstosolvethem.Section 4.5 furtherextendstaskpartitioningtomulticoreprocessors,andproposesaheuristictaskpartitioningandschedulingalgorithmforasetofindependenttasksonmulticoreprocessors.Section 4.6 comparesthesealgorithmswithtasksequencingalgorithm,EDFalgorithmandPDTM,respectively.Section 4.7 concludesthechapter. 4.2RelatedWork Recently,considerableresearchhasbeendoneonDynamicThermalManagement(DTM)[ 3 5 20 25 29 33 36 52 57 62 73 78 81 ].Theseworkscanbeclassiedintotwoboardcategories: 2Makespanistotaltimewhenasetoftaskshavenishedprocessing 74

PAGE 75

1. DVFS-basedmechanismstoreducetemperaturebyreducingthethermalpowerconsumption[ 3 5 20 29 36 52 57 62 81 ]. 2. Task-sequencingandthreadmigrationmechanismstoreducetemperaturebyloweringtheinitialtemperature[ 25 33 73 78 ]. Huangetal.[ 29 ]presentatriggertechniquethatthrottletheprocessorbyeitherreducingthevoltageandfrequencyortogglingotherunits(decode,I-cache)whenthetemperatureishigherthansometriggerthreshold.Liuetal.[ 52 ]provideadesign-timeDVFS-basedoptimizationframeworktodetermineaproperthermalsolutioninembeddedsystems.ZhangandChatha[ 81 ]provideaDVFS-basedapproximationalgorithmforlatencyminimizationofasetofperiodictasksexecutingonaprocessorsubjecttothermalconstraints.RaoandVrudhula[ 62 ]computethethroughputmaximumproblemwithsteady-stateandtransientthermalconditiononamulticoresystem.Byconvertingtheseproblemsintolinearandnon-linearprogrammingformulationsandsolvingthemanalytically,theycanprovideanefcientonlinecomputationalgorithmfortheseproblems.Afewproactivethermalmanagementtechniqueshavealsobeensuggestedtopreventoverheatingproblemsinchip.AyoubandRosing[ 3 ]introduceanewthermalpredictorthatisbasedonbandpropertyofthetemperaturespectrum.TheyalsoprovideaproactivethermalmanagementpolicyusingdynamicloadbalancingandthreadmigrationtominimizethefrequencyinHotSpot[ 70 ]foramulticoreplatform.Alloftheabovetechniquesutilizedynamicvoltageandfrequencyscaling(DVFS)toreducethepowerconsumptionbyloweringthevoltageandfrequencyoftheprocessors.Unfortunately,thesetechniqueshavelimitedimpactonthetemperatureiftheavailableslackissmall(i.e.,thedeadlineconstraintsarestrict). Threadmigrationorcoolest-rsttechniqueshavebeenusedtolowertheinitialtemperatureofeachtask,therefore,loweringthenaltemperature[ 25 73 78 ].JayaseelanandMitrashowthatthesequencingofasetofheterogenoustasksexecutedonaprocessorhasagreatinuenceonthepeaktemperature[ 33 ].Aheuristicalgorithm 75

PAGE 76

thatrecursivelypairthetaskswithoppositecharacteristicisgiven.Theresultsshowthattheirheuristicalgorithmcanreachatemperaturewithin0.5oCofoptimalvalue. Studiessuggest[ 33 ]thatpeaktemperatureusuallyoccursonthetaskswithhighpowerproleandlongexecutiontime.Evenifoneofthehottaskshavealongexecutiontime,thepeaktemperatureachievedbythistaskisrelativelyindependentoftheprevioussequenceoftasksastheinitialtemperatureofthistaskhaslimitedimpactonthepeaktemperature.Inthischapter,weshowthatpartitioningthesetaskscanreducethepeaktemperatureofasetofheterogeneoustasks.Bypartitioningthehottasksandinterleavingthemwithcooltasks,onecanreplaceahighpeaktemperaturegeneratingcurveintoseveralsmallercurves(Figure 4-2 ),thatcumulativelycanlowerthepeaktemperature. 4.3Preliminaries Inthissection,webrieydescribeseveralrelatedconcepts.First,wediscusshowtomeasurethethermalproleofatask.Next,wepresenthowtoanalyzethethermalproleofatasksequenceonsinglecoreprocessors.Finally,wepresenthowtoanalyzethethermalproleofatasksequenceonmulticoreprocessors. 4.3.1ThermalProleofIndividualTasks ThebasicthermalequationhasalreadybeenintroducedinEquation( 4 ).Bylettingt!inEquation( 4 ),wecangetthesteady-statetemperature:TS=PR+TA (4) Basedonourexperiments,ittakeslessthanasecondtoreachsteady-statetemperaturefora1.5GHzprocessorwithproductofthermalresistanceandthermalcapacitancetobe0.2053Joules=Watt. 4.3.2ThermalProleofTaskSequencesonSingleCoreProcessors Consideraperiodicsetofheterogeneoustasks(i.e.,taskswithdifferentthermalproles),withtheexecutiontimeofthesetasksgivenbyc1,c2,...,cN.Theperiodsof 76

PAGE 77

thesetasksaregivenbyp1,p2,...,pN.TheaveragepowerconsumptionduringexecutiontimeisgivenbyP1,P2,...,PN.SupposethesetasksareorderedinaparticularsequenceS=.Thehyper-periodofthesetasksisdenedbyLCMNi=1fpig,whereLCMstandsfortheleastcommonmultiple.Byexecutingthesetasksinahyper-periodforalargernumberofiterations(theyareperiodictasks),thetemperatureofthesetaskswillrisefrominitialtemperatureandreachanaltemperature,wherethehyper-periodtemperatureprolerepeatedperiodically[ 33 ].Wecallthissteady-statethehyper-periodsteady-state.Usingtheabovearguments,wecananalyzethethermalproleofonehyper-periodsequence,otherhyper-periodsequenceswillhaveexactlythesamethermalproleasthisone. Usingtheabovesimplications,wehavethenaltemperatureofeachtaskasfollows:T1=PiR+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((PiR+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(TN)e)]TJ /F11 8.966 Tf 6.97 0 Td[(c1=RCTN=PiR+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((PiR+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(TN)]TJ /F15 8.966 Tf 6.97 0 Td[(1)e)]TJ /F11 8.966 Tf 6.96 0 Td[(c1=RC (4) Aswecansee,thetemperatureinEquation( 4 )isamonotonicfunction,whenPiR+TA>Ti,thetemperatureoftheprocessorincreasesduringthetaskexecutiontimeandviceversa.Therefore,eitherTiorT(t)isthemaximumtemperatureduringtheexecutiontimeofthetask.Thus,wedenethemaximumnaltemperatureoftasksinonehyper-periodasthepeaktemperatureofthesequence.peaktemperature=maxfT1,T2,,TNg (4) 4.3.3ThermalProleofTaskSequencesonMulticoreProcessors Itisamajorchallengetodeveloptaskpartitioningalgorithmsformulticoreprocessors.Thetemperaturedifferenceofmultiplecoreswillimpacteachotherandtheheatgeneratedbyonewillaffectthetemperatureofothercores.Werelaxtheproblem 77

PAGE 78

denitionalittletoremovetheexplicitperiodconstraintontasks.Considerasetofindependentheterogeneoustasks,withthesameexecutiontimeandaveragepowerdenitioninSec 4.3.2 .Assumewehaveap-coreprocessorandtheassignmentoftaskstocoreshasalreadybeendone.Thetimeisdiscretizedtotimeslots.Thepowerconsumptionofcorejattimeslotkisequaltopowerconsumptionofthetasksbeingexecuted,denotedbyPk,j.ThetemperatureofcorejattimeslotkisdenotedasTk,j.ThesetofneighboringcoresofcorejisdenotedasNj.Wecangetthetemperatureofcorejattimeslotk+1:Tk+1,j=Tk,j+8n2NjAn,j(Tk,n)]TJ /F3 11.955 Tf 10.95 0 Td[(Tk,j)+BjPk,j+Cj(TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Tk,j) (4) WhereTAistheambienttemperature.MatrixAisappmatrixshowingthethermalimpactbetweencores.VectorBandCarealsoparametersshowingthethermalimpactfromcore'sownpowerconsumptionandambientenvironment,respectively.MatrixA,vectorBandCarederivedfromthethermalcharacteristicofthechip[ 57 ].ThepeaktemperatureoftasksequencesisdenedasthemaximumtemperatureamongallTk,j. 4.4TaskPartitioningAlgorithmsonSingleCoreProcessors Therearetwomajorchallengesindevelopingtaskpartitioningalgorithmstoreducepeaktemperature: 1. NumberofPartitions:Ataskcanbepartitionedintoalargenumberofverysmallpieces.However,thismayresultinsignicantoverheadofpreemptionandrestart.Choosingtherightnumberofpartitionsthatcarefullytradeoffsthenumberofpartitionsandtheresultanttemperaturereductionisimportant. 2. SequencingofSubtasks:Areorderingofhotandcooltaskshastoensurethatthesubtasksofagiventaskmaintainthesequentialorder.Forexample,ifataskAisdecomposedintosubtasksA1,A2andA3.A1shouldalwaysbeexecutedbeforeA2,andA2shouldalwaysbeexecutedbeforeA3. 78

PAGE 79

Besidestheobviousnoveltyofproposingapartitioningapproachforaddressingthethermalissues,thischapterdevelopsnovelalgorithmstoaddressthefollowingtwobroadscenariosonsinglecoreprocessors: 1. Aperiodicsetoftaskswithcommonperiod.Allthetaskshavethesamearrivaltimeanddeadline. 2. Asetofperiodictaskswithindividualperiod.Eachtaskmayhavedifferentarrivaltimeanddeadline. Inthissection,werstgiveanillustrativeexampleshowingthatthepeaktemperatureobtainedusingourtaskpartitioningalgorithmislessthanthatoftasksequencingalgorithm[ 33 ].Assumethattwotasks,t1andt2,haveaveragepowerconsumptionP1andP2,respectively.Theirexecutiontimesaret1andt2,t1,t2>0.Withoutlossofgenerality,weassumeP1T(0).FromEquation( 4 ),wegettemperaturewhentaskt1andt2arenished,T(t1)andT(t1+t2)(Figure 4-2 ),as:T(t1)=P1R+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((P1R+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(T(0))e)]TJ /F11 8.966 Tf 6.97 0 Td[(t1=RC (4)T(t1+t2)=P2R+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((P2R+TA)]TJ /F3 11.955 Tf 10.95 .01 Td[(T(t1))e)]TJ /F11 8.966 Tf 6.96 0 Td[(t2=RC (4) Lemma3. ThepeaktemperatureusingtasksequencingisT(t1+t2). Proof:Givenasequencewithouttaskpartitioning,werstprovethatT(t1)>T(t),t2[0,t1].WhenP1R+TA>T(0),functionT(t)inEquation( 4 )ismonotonicallyincreasingat[0,t1],therefore,wehaveT(t1)>T(t),t2[0,t1].Forthesamereason, 79

PAGE 80

Figure4-1. Anexampleofthetasksequencingandtaskpartitioning.Therearetwotasksintasksequencing:acooltaskt1followedbyahottaskt2.Theexecutiontimeoft1andt2aret1andt2,respectively.Thesetwotasksarepartitionedintofoursubtasksintaskpartitioning.The(task,executiontime)sequenceis(t11,t11),(t21,t21),(t12,t12),(t22,t22),wheret1=t11+t12,t2=t21+t22 P2R+TA>P1R+TA>T(t1),wehaveT(t1+t2)>T(t),t2[t1,t1+t2].Therefore,T(t1+t2)isthepeaktemperatureintasksequencing. Now,weintroducethetaskpartitioning.Weassumethatbothtaskt1andt2arepartitionedintotwosubtasks.Byinterleavingthesubtasks,the(task,executiontime)sequenceis(t11,t11),(t21,t21),(t12,t12),(t22,t22),wheret1=t11+t12,t2=t21+t22,t11,t12,t21,t22>0(Figure 4-1 ).Thenthetemperaturesaftereachsubtaskisnished(Figure 4-2 )are: Lemma4. ThepeaktemperatureusingtaskpartitioningismaxfTp(t11+t21),Tp(t1+t2)g. Proof:SimilartoLemma 3 ,wecanprovethatTp(t11+t21)>Tp(t),t2[0,t11+t21].IfP1R+TA>Tp(t11+t21),Equation( 4 )ismonotonicallyincreasingat[t11+t21,t1+t12].Therefore,Tp(t1+t12)>Tp(t),t2[t11+t21,t1+t12].Forthesamereason,wecangetTp(t1+t2)>Tp(t),t2[t1+t12,t1+t2].ThenthepeaktemperatureisTp(t1+t2). IfP1R+TATp(t11+t21),Equation( 4 )ismonotonicallydecreasingat[t11+t21,t1+t12].ThenwehaveTp(t11+t21)Tp(t),t2[t11+t21,t1+t12].Forthesame 80

PAGE 81

reason,wecangetTp(t1+t2)>Tp(t),t2[t1+t12,t1+t2].ThusthepeaktemperatureisTp(t11+t21).SothepeaktemperatureismaxfTp(t11+t21),Tp(t1+t2)g. Tp(t11)=P1R+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((P1R+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(T(0))e)]TJ /F11 8.966 Tf 6.97 0 Td[(t11=RC (4)Tp(t11+t21)=P2R+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((P2R+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Tp(t11))e)]TJ /F11 8.966 Tf 6.97 0 Td[(t21=RC (4)Tp(t1+t12)=P1R+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((P1R+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Tp(t11+t21))e)]TJ /F11 8.966 Tf 6.96 0 Td[(t12=RC (4)Tp(t1+t2)=P2R+TA)]TJ /F6 11.955 Tf 10.95 0 Td[((P2R+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(Tp(t1+t12))e)]TJ /F11 8.966 Tf 6.97 0 Td[(t22=RC (4) Lemma5. Thepeaktemperatureoftaskpartitioningislessthanthepeaktemperatureoftasksequencing. Proof:IfthepeaktemperatureoftaskpartitioningisTp(t11+t21),fromEquation( 4 ),( 4 ),( 4 ),( 4 ),wecanget:Tp(t11+t21))]TJ /F3 11.955 Tf 10.95 0 Td[(T(t1+t2)=R(P2)]TJ /F3 11.955 Tf 10.95 0 Td[(P1)(e)]TJ /F11 8.966 Tf 6.96 0 Td[(t2=RC)]TJ /F3 11.955 Tf 10.95 0 Td[(e)]TJ /F11 8.966 Tf 6.97 0 Td[(t22=RC)+(P1R+TA)]TJ /F3 11.955 Tf 10.95 0 Td[(T(0))(e)]TJ /F16 8.966 Tf 6.96 0 Td[((t1+t2)=RC)]TJ /F3 11.955 Tf 10.94 0 Td[(e)]TJ /F16 8.966 Tf 6.97 0 Td[((t11+t22)=RC) (4) FromtheassumptionanddenitionwecanderivethatP2>P1andt2>t22.ThenwehaveR(P2)]TJ /F3 11.955 Tf 11.33 0 Td[(P1)(e)]TJ /F11 8.966 Tf 6.97 0 Td[(t2=RC)]TJ /F3 11.955 Tf 11.33 0 Td[(e)]TJ /F11 8.966 Tf 6.97 0 Td[(t22=RC)<0.Similarly,wealsohave(P1R+TA)]TJ /F3 11.955 Tf -432.07 -23.91 Td[(T(0))(e)]TJ /F16 8.966 Tf 6.97 0 Td[((t1+t2)=RC)]TJ /F3 11.955 Tf 10.95 0 Td[(e)]TJ /F16 8.966 Tf 6.97 0 Td[((t11+t22)=RC)<0.Therefore,Tp(t11+t21)
PAGE 82

IfthepeaktemperatureoftaskpartitioningisTp(t1+t2),fromEquation( 4 ),( 4 ),( 4 ),( 4 ),( 4 ),( 4 ),wecanget:Tp(t1+t2))]TJ /F3 11.955 Tf 10.95 0 Td[(T(t1+t2)=(P2)]TJ /F3 11.955 Tf 10.95 0 Td[(P1)(1)]TJ /F3 11.955 Tf 10.95 0 Td[(e)]TJ /F11 8.966 Tf 6.97 0 Td[(t12=RC)(e)]TJ /F11 8.966 Tf 6.97 0 Td[(t21=RC)]TJ /F5 11.955 Tf 10.95 0 Td[(1) (4) Similarly,wecanderivethatP2)]TJ /F3 11.955 Tf 10.89 0 Td[(P1>0,1)]TJ /F3 11.955 Tf 10.88 0 Td[(e)]TJ /F11 8.966 Tf 6.96 0 Td[(t12=RC>0ande)]TJ /F11 8.966 Tf 6.97 0 Td[(t21=RC)]TJ /F5 11.955 Tf 10.88 0 Td[(1<0.Therefore,Tp(t1+t2)
PAGE 83

Methodsthatonlyreorderthesequenceoftasks(suchasTSA[ 33 ])failtofurtherreducethepeaktemperaturewhensomehottasksoflongenoughexecutiontimecanreachclosetoitssteady-statetemperatureandthistemperatureisrelativelyindependentoftheinitialtemperature.Weproposeataskpartitioningalgorithmtoreducethetemperatureusingpreemption.Themainideaoftaskpartitioningalgorithmistopartitionthehottasksintoseveralsubtasksandinterleavingthemwithcooltaskstoabsorbtheheatgeneratedbyhotsubtasks.Topartitionhottasksintomoresubtasksandgenerateenoughcoolsubtaskstointerleavewiththem,wedividethetasksintocategoriesbasedontheirpowerprole.Thetasksinhighercategoriesarepartitionedintomoresubtasks.Thesubtasksinlowercategoriesactascooltasksandinterleavewithsubtasksinhighercategories.DetailsofthealgorithmisshowninAlgorithm 3 Algorithm3TaskPartitioningAlgorithm(TPA) 1: Sortthetasksbasedonthepowerprolefromcoolesttohottest 2: Groupthesortedtasksintokcategorieswithequalnumberoftasks.Thesecategoriesarenumberedfrom1tok. 3: Partitiontasksincategoryj,2jk,into2i)]TJ /F15 8.966 Tf 6.97 0 Td[(1equalsubtasks.Partitiontasksincategory1into2equalsubtasks. 4: fori=1tok)]TJ /F5 11.955 Tf 10.95 0 Td[(2do 5: Interleavetasksofithcategorywithtasksofi+1thcategorytoformthenewi+1thcategory 6: endfor//Nowonlytwocategoriesareleft.Therstoneiscategorykandthesec-ondoneisanewcategoryk)]TJ /F5 11.955 Tf 10.95 0 Td[(1derivedbycombiningcategory1throughk)]TJ /F5 11.955 Tf 10.95 0 Td[(1. 7: Insertthesubtasksinnewcategoryk)]TJ /F5 11.955 Tf 10.95 0 Td[(1intotheintervalsoftasksincategoryk. First,wesortthetasksbasedontheirpowerconsumptionfromcoolesttohottest.Inthesecondstep,wegroupthesortedtasksintokcategoriesfromcategory1tocategoryk.Category1isthecategoryofcoolestn=ktasks,andcategorykisthecategoryofhottestn=ktasks.Thereasonthatwedividethetasksintodifferentcategoriesisthatweneedtopartitionhottertasksintomoresubtaskstoreducethepeaktemperature.Wealsoneedenoughcoolersubtaskstoseparatethehotsubtasks(Figure 4-3 ).Byhavingdifferentcategoriesoftasks,wecanachieveboth 83

PAGE 84

targetssimultaneously.Inthischapter,weassumethatwhenataskispartitionedintoseveralsmallsubtasks,thesubtaskshavethesameaveragepowerconsumptionastheoriginaltask.WeusedWattch[ 6 ]tocomputeaveragepowerofthesubtasks.Asexpected,thesenumbersarecomparablewiththeaveragepoweroftheoriginaltasksinourbenchmarkset. Figure4-3. TaskPartitioningAlgorithm(Step1-3):Sortthetasksbasedonthepowerconsumption,groupthetasksinto3categories.Taskincategory3isthehottestandtaskincategory1isthecoolest.Partitiontasksintosubtasks.Taskincategory3ispartitionedinto23)]TJ /F15 8.966 Tf 6.96 0 Td[(1=4equalsubtasks.Taskincategory2ispartitionedinto22)]TJ /F15 8.966 Tf 6.96 0 Td[(1=2equalsubtasks.Taskincategory1ispartitionedinto2equalsubtasks Inthenextstep,wepartitiontasksincategoryj,2jkinto2j)]TJ /F15 8.966 Tf 6.97 0 Td[(1equalsubtasks.Thetasksincategory1arepartitionedinto2equalsubtasks.Afterrecursivelyinterleavingtasksinithcategorywithtasksini+1thcategory,thereareonlytwocategoriesleft.Therstonecorrespondstocategorykandthesecondoneisderivedbycombiningcategory1throughk)]TJ /F5 11.955 Tf 11.57 0 Td[(1.Forthesakeofconvenience,wecallthiscombinedsetasnewcategoryk)]TJ /F5 11.955 Tf 11.17 0 Td[(1.Wenowhaven=k2k)]TJ /F15 8.966 Tf 6.96 0 Td[(1tasksincategoryk,andn=k2k)]TJ /F15 8.966 Tf 6.97 0 Td[(1intervalsbetweenthesesubtasks.Wealsohaven=k(2+2+4+...+2k)]TJ /F15 8.966 Tf 6.96 0 Td[(2)=n=k2k)]TJ /F15 8.966 Tf 6.97 0 Td[(1tasksinnewcategoryk)]TJ /F5 11.955 Tf 11.1 0 Td[(1.Therefore,wehaveenoughtasksfromcategoryk)]TJ /F5 11.955 Tf 10.97 0 Td[(1tointerleavewiththetasksincategoryk(Figure 4-4 ).Inthelaststep,weinsertthetasksofnewcategoryk)]TJ /F5 11.955 Tf 10.95 0 Td[(1intotheintervalsofcategoryk(Figure 4-5 ). 4.4.2PeriodicTaskswithIndividualPeriod ConsiderasetofperiodicNheterogeneoustasksinasetLwhereeachtaskhasitsownperiodpi.Thearrivaltimeaiisequaltothestarttimeofitsperiodandthedeadlinediisequaltotheendtimeofitsperiod.Recallthatthetemperatureproleof 84

PAGE 85

Figure4-4. TaskPartitioningAlgorithm(Step4-6):Interleavesubtasksofcategory1withsubtasksofcategory2asthenewcategory2.Nowwehaveenoughsubtasksfromcategory2tointerleavewiththesubtasksofcategory3 Figure4-5. TaskPartitioningAlgorithm(Step7):Weinsertsubtasksofnewcategory2intointervalsofsubtasksofcategory3togetnalsequence onehyper-periodisidenticaltothatofotherhyper-periods.Weonlyneedtoanalyzethetaskinstanceswithinonehyper-period. Theoretically,eachperiodictaskcorrespondstoaninnitesequenceofidenticalactivities,calledinstances.Therstinstanceofeachperiodictaskarrivesattime0.LetPibetheaveragepowerconsumptionduringtheexecutiontimecioftaskti.Thegoalistondasequenceofthesetasksusingtaskpartitioningtominimizethepeaktemperature. Thealgorithmdevelopedintheprevioussectiondoesnotapplytothisscenarioasthearrivalanddeadlineconstraintshavetobecarefullyaddressed.Thisadditionalconstraintlimitsthetaskorderingthatcanbeused.Inthissection,wedevelopanovelalgorithmthatintegratestaskpartitioningtechniqueintoEDFscheduler3.Werst 3Earliest-Deadline-First(EDF)[ 59 ]isadynamicschedulingalgorithmthatschedulesthetasksaccordingtotheirabsolutedeadlines.Taskswithearlierdeadlineswillbeexecutedathigherpriorities. 85

PAGE 86

usetheEDFschedulertoschedulethetaskstogetaninitialsequenceSi.Basedontheinitialsequence,wecanuseEquation( 4 )togetthethermalproleofthetasksequence.Usingthisthermalprole,letthetaskinstancewherepeaktemperatureoccursbecalledhottaskinstance,denotedbyth. Thepeaktemperatureisreducedbypartitioningthintoseveralsubtaskinstancesinterleavedbyothercooltaskinstances.Becausethehottaskinstancethcannotmovebeforeitsarrivaltimeorafteritsdeadline,weonlyneedtoanalyzetheintervalbetweenthearrivaltimeanddeadlineofth,representedasathanddth,calledhotinterval.Allothertaskinstancesexceptthehottaskinstanceinthehotintervalarecalledcooltaskinstances,denotedbytc.Itisworthnotingthatthisdenitionofhotandcooltasksissubstantiallydifferentascomparedtotheprevioussectionthatconsideredperiodictaskswithcommonperiod.AhighleveldescriptionoftheEDFschedulingwithourtaskpartitioningisgiveninAlgorithm 4 Algorithm4EDFwithtaskpartitioning 1: UseEDFschedulertogettheinitialscheduleofthesetasks 2: whileloopforMtimesdo 3: Calculatethethermalproleoftasksequence,ndthehottaskinstancethwherepeaktemperatureoccurs. 4: Partitionthetaskinstanceswhoseexecutionperiodoverlapwiththearrivaltimeordeadlineofthehottaskinstance. 5: Inthehotinterval,removeallthesubpartsofthandcalculatetheavailableslackforeachcooltaskinstance. 6: whiletherearepartsofthunassignedandsomecooltaskinstancehasavailableslackdo 7: foreachcooltaskinstancetciinthehotintervaldo 8: ifslacki>0then 9: Appendoneunitofthintotciandupdatetheslackforallcooltaskinstances 10: endif 11: endfor 12: endwhile 13: Ifthereisstillsomesubpartsofthunassigned,scanthehotintervalandassignthemuniformlyintotheidletime. 14: endwhile 86

PAGE 87

TherearetwomajorstepsthatarerequiredfortheAlgorithm 4 andareasfollows: Hotintervalisolation .Partitionthetaskinstanceswhoseexecutionperiodoverlapswiththearrivaltimeand/ordeadlineofthehottaskinstanceAfterndingthetaskinstanceth,wecanlimitouranalysistothehotinterval.Itislikelythatsometaskinstancesareacrosseitherathordth.Wepartitionsuchtaskinstancesacrossthesetimelinesusingthefollowingequation.Ifsometasktkacrossthetimeline(eitherathordth)g,thatis,stk
PAGE 88

Figure4-6. FirstportionoftheslackallocationstepintheEDFbasedpartitioningalgorithm.Peaktemperatureoccursattaskinstanceth.athanddtharethearrivaltimeanddeadlineofth,respectively.Therearetwopartsofth:th1,th2.Bothth1andth2areremovedfromthesequence.Thenothercooltasksinstances(tc1-tc5)inthehotintervalhavemoreexibility.Therefore,theslacksofthesecooltaskinstancescanbecalculated(Foreaseofpresentationinthislimitedspace,onlytc2'sarrivaltimeat2,deadlinedt2andslackareshown) andslackofcooltaskinstanceticanbecomputedasfollows:ESTi=max(ath,ati,ESTpredi+cpredi)LSTi=min(dth,dti,LSTsucci))]TJ /F3 11.955 Tf 10.95 0 Td[(cislacki=LSTi)]TJ /F3 11.955 Tf 10.95 0 Td[(ESTi (4) whereprediisthepredecessoroftidenedbyEDFschedule,succiisthesuccessoroftidenedbyEDFschedule.Here,atianddtiarethearrivaltimeanddeadlineoftaskinstanceti,respectively,andciistheexecutiontimeofti. Aftertheavailableslackshavebeencalculatedforallthecooltaskinstances,weinsertthehottaskinstancebackintotheseslacksasuniformlyaspossible.Thisisdonebyappendingoneunitslackofthtoeachcooltaskinstanceatatime(oneunitslackisasmallconstantrepresentingasmallperiodoftime).Ifthereisstillsomesubpartsofthunassigned,someidletimewillexistbecausenocooltaskinstancecan 88

PAGE 89

Figure4-7. SecondportionoftheslackallocationstepintheEDFbasedpartitioningalgorithm.Calculatingtheslacksofallthetasksinthehotinterval.Foreachtasktci,appendoneunitofthintotciatatimeandupdatetheslacks.Whennoslacksleftandtherearestillsomepartsofthunassigned,scanthehotintervalandassignthemuniformlyintotheidletime haveslackonit.Theremainingpartofthethisuniformlydecomposedintotheseidletimes(Figure 4-7 ). Theaboveprocesscorrespondstoasingleiterationofthealgorithm.Thisprocessisrepeatedforseveraliterations.Ineachiteration,apotentiallynewhottaskinstanceischosenbasedonthethermalprole.Thenumberofiterationscanbexedorchosenbasedonthelevelofimprovementachieved.Forourexperiments,wefoundthat10-15iterationsaresufcientforderivingmostofthebenets. 4.5TaskPartitioningandSchedulingonMulticoreProcessors Thechallengetodevelopalgorithmsthatutilizetaskpartitioningtoreducepeaktemperatureonmutlicoreprocessorsistoachievegoodtemporalandspatialheatbalancesimultaneously.WeusethesametaskdenitionasSection 4.3.3 .Thegoalistondasequenceofthesetasksforahomogeneousmulticoreprocessorusingtaskpartitioningtominimizethepeaktemperature.Inthissection,weproposeataskpartitioningandschedulingalgorithm(TPS)tofurtherreducethepeaktemperatureonmulticoreprocessors.Ouralgorithmmainlyconsistsoffoursteps: 89

PAGE 90

1. Inordertofacilitateschedulingofthehotandcooltasksbetweendifferentneighboringcores,weneedtorstnormalizetheexecutiontimeofeachtask. 2. Usetaskassignmentalgorithmtoassigntasksontomulticoreprocessors. 3. Usetaskpartitioningalgorithmwithineachcoretopartitionthelonghottasksintoshorthotsubtasksandinterleavethemwithcooltaskstoabsorbtheheatgeneratedbyhotsubtasks. 4. Scheduletheexecutionofsubtasksbetweencorestomakesurethatwhenonecoreisexecutingahottask,alltheneighboringcoresareexecutingcooltasks. Therestofthissectiondescribesthesestepsindetail. NormalizetheExecutionTimeofEachTask:First,wenormalizetheexecutiontimeofalltasks.Itisdifculttoschedulehotandcooltaskswithinonecoreaswellasaligningthembetweencoresinthesametime.Wenormalizetheexecutiontimeofeachtaskasintegermultipleofd,whichwillbeusedastheminimumschedulingunitinnextsteps.DVFStechniqueisusedtonormalizethetaskexecution.DVFSdynamicallyscalesdownVddandfoncorestoextendtheexecutiontimeoftasks[ 9 ].Letcibetheexecutiontimeoftaskti.Itisextendedtothesmallestintegermultipleofdlargerthanci.LetEibethenormalizedexecutiontimeoftasktiintermsofd,then,Ei=dci de.Figure 4-8 showsanexampleofnormalizingexecutiontimeoftwotasks. Figure4-8. UseDVFStonormalizetheexecutiontimeoftaskstothesmallestintegermultipleofd.Task1isnormalizedto3timesofd,E1=3.Task2isnormalizedto4timesofd,E2=4 TaskAssignment:Thealgorithmdevelopedinthischapterisrelativelyindependentoftaskassignmentthatdeterminestheprocessoronwhicheachtaskisexecuted.WetriedseveraldifferenttaskassignmentalgorithmssuchasMin-Min,Max-minalgorithm 90

PAGE 91

[ 23 ]andcoolestrstassignmentalgorithmthatassigntaskstothecurrentcoolestcore.WeusetheMin-Minalgorithm[ 23 ]asourtaskassignmentalgorithmbecauseofbetterperformance.Min-Minalgorithmcangeneratesmallmakespanbyassigningtaskwiththeminimumcompletiontimetothecoreexpectedtonishprocessingitearliest[ 53 ].MakespanMisthetotaltimewhenasetoftaskshavenishedprocessing.Makespanisdenedas:M=maxjfti2corejEig.Instep1,wenormalizedthetimeintomultipletimeslots,eachofthemisaslongasd.WeusethesamedenitionofPk,j,Tk,j,Adj,A,B,CandTAasSection 4.3.3 .UsingDVFStoextendtheexecutiontimewilllowerthepowerconsumptionofthetasks[ 9 ].LetP0ibethenewpowerconsumptionoftasktiafternormalization.Weuseintegerlk,i,jasthelabeltoshowtaskscheduleresults.lk,i,j=1meanscorejwillexecutetaskiontimeslotk,otherwiselk,i,j=0.Tinitistheinitialtemperature.Thentheproblemcanbeformulatedasfollows: Equation( 4 )calculatesthetemperatureoneachtimeslotoneachcore.Equation( 4 )usethepowerconsumptionofthetaskscheduledontimeslotkofcorejasthepowerconsumptionofPk,j.Equation( 4 )istheconstraintthattasktineedEinumberoftimeslots.Equation( 4 )istheconstraintthatonlyonetaskcanbescheduledpertimeslotpercore.Thisproblemisanintegerprogrammingproblemandnotpossibletogenerateoptimalresultsinareasonabletime.Thereforeweproposeaheuristicalgorithminnexttwostepstosolveit. Minimize:Tpeak (4)suchthat:TpeakTk,j,8k,j (4)Tk+1,j=Tk,j+8n2AdjAn,j(Tk,n)]TJ /F3 11.955 Tf 10.95 0 Td[(Tk,j)+BjPk,j+Cj(TA)]TJ /F3 11.955 Tf 10.94 0 Td[(Tk,j),8k,j (4) 91

PAGE 92

T0,j=Tinit,8j (4)Pk,j=P0i,iflk,i,j=1,8k,j (4)klk,i,j=Ei,8ti2corej (4)ilk,i,j=1,8k,j (4)lk,i,j2f0,1g (4) Intra-coreScheduling:Intra-coreschedulingreducesthepeaktemperatureinatemporalway.Inthisstep,weonlyfocusontheschedulingoftasksequenceswithineachcore.Pleasenotethataftertaskassignment,somecoresmighthavesmallertotalexecutiontimethanothercores.Thesecoresmaybecomeidlewhenwaitingothercorestonishtheirtasks.Wewillutilizetheseidletimestofurtherreducethetemperature.TheidletimeofeachcorejwillbeIj=M)]TJ /F4 11.955 Tf 11.39 -.88 Td[(ti2corejEi.ThenwetreattheidletimeasatasktIjwithpowerascoreidlepowerandexecutiontimeasIj.TasktIjwillbeconsideredasaregulartaskandbescheduledwithothertaskstogether. Algorithm5Intra-coreScheduling 1: Sortthetasksbasedontheiraveragepowerconsumptioninascendingorderwithineachcore.AssumethatNjisthenumberoftasksassignedtocorej. 2: ComputethesummationofEiofalltaskstionassignedcorej,denotedbyEsumj. 3: Findthetasktmjsuchthatmj)]TJ /F15 8.966 Tf 6.96 0 Td[(1i=1Ei
PAGE 93

arecool.Step2-3makesurethereareenoughcoolsubtaskstoseparatehotsubtasks.Next,wepartitionallthetasksintosubtaskswithexecutiontimed(step4).Recursivelypairingthetwosubtasksfromtwoendofthetasklist,coolandhotsubtaskscanbeeffectivelyinterleaved(step5).Inthischapter,weassumethatwhenataskispartitionedintoseveralsmallsubtasks,thesubtaskshavethesameaveragepowerconsumptionastheoriginaltask.WeusedWattch[ 6 ]tocomputeaveragepowerofthesubtasks.Asexpected,thesenumbersarecomparablewiththeaveragepoweroftheoriginaltasksinourbenchmarkset.Anexampleofintra-coreschedulingon4tasksisshowninFigure 4-9 Figure4-9. Intra-corescheduling.4tasksaresortedinascendingorderontheiraveragepowerconsumption.Taskt1andt2arecooltasks.Taskt3andt4arehottasks.Taskt1ispartitionedintosubtaskst1aandt1b,Taskt1intosubtaskst2a,t2bandt2c,etc.Pairingcoolandhotsubtaskswhilemaintainingtheorderbetweensubtasksfromsameparenttask Inter-coreScheduling:Inintra-corescheduling,weonlyfocusontheschedulingwithineachcore.Inthisstep,weneedtoconsiderthemutualinuencebetweencores.Inter-coreschedulingworksasfollows:xingthesubtasksequenceononecore,startingwithacoolsubtask,followedbyahotsubtask,followedbyacoolsubtask,etc.Alltheneighboringcoreswassetwithanoppositesequence.Continuetosettask 93

PAGE 94

sequenceonothercoresfollowingthisruleuntilthetasksequenceonallthecoresweresettled.Sinceeachsubtaskhasexactlythesameexecutiontimed,theywillalignautomaticallywithallthehotsubtaskssurroundedbycoolsubtasks.Ourinter-coreschedulingalgorithmisdevelopedbasedonBreadth-FirstSearch(BFS)tosettheneighborcoreswithoppositetasksequence.Anexampleofnalstateoftasksequenceona4-coreprocessorisshowninFigure 4-10 Figure4-10. Inter-corescheduling.Thereare4cores.Core0hasneighbors:Core1andCore2.Therefore,Core1andcore2haveoppositehotandcoolsubtasksequencetocore0.Core3isneighbortocore1andcore2andithasthesamesequenceascore0 4.6ExperimentalResults Inthissection,wewillevaluatetaskpartitioningalgorithmsonbothsinglecoreandmulticoreprocessors.Bothsynthetictasksandrealbenchmarkswillbeusedtocompareourtaskpartitioningalgorithmwiththestate-of-artschedulingalgorithms. 94

PAGE 95

4.6.1TaskPartitioningAlgorithmsonSingleCoreProcessors Toevaluatethetaskpartitioningalgorithmsonsinglecoreprocessors,weusedaplatformthatisbasedonARMCortexA8[ 2 ]:2-widthin-orderissue,32KBinstructionanddatacachesforevaluatingouralgorithms.Theclockspeedwassetto1.5GHz.UsingdefaultthermalcongurationsinHotSpot[ 70 ]andtheoorplanandsiliconareaofARMCortexA8,thethermalresistanceandcapacitancecanbecomputedas1.83oC=Wattand0.112J=oC,respectively[ 33 ].Theambienttemperaturewassetat45.15oC.Weusedthearchitecture-levelpowersimulatorWattch[ 6 ]toobtainthepowerconsumptionoftasks. 4.6.1.1Taskswithcommonperiod Fortaskswithcommonperiod,synthetictasksaregeneratedtondtheprotablenumberofcategoriestoachievealowerpeaktemperatureandrealbenchmarksaregeneratedtocomparetheperformancebetweentasksequencingalgorithm(denotedbyTSA[ 33 ])andourtaskpartitioningalgorithm(denotedbyTPA).Fortaskswithindividualperiod,realbenchmarksaregenerated. SyntheticTasks .Tasksweregeneratedtocomparethethermalreductionachievedbyourtaskpartitioningalgorithmwithdifferentnumberofcategories.Thenumbersofclockcyclesoftasksareuniformlydistributedin[1.5108,1.45109],thepowerconsumptionofthesejobsareuniformlydistributedin[5,25]Watt.Thenumbersofjobstestedare32,64,128,192,256.Thenumbersofcategoriesintaskpartitioningare2,3,4,5. Figure 4-11 showsthepeaktemperaturecomparisonusingdifferentnumberofcategoriesforthetaskpartitioningalgorithm.Thenumberoftasksusedinthisscenariois64.Recallfromthetaskpartitioningalgorithm,differentnumberofcategorieswillleadtodifferentnumberofpartitioningwithineachcategories.Itisnecessarytondaprotablenumberofcategoriestoachievealowerpeaktemperature.Theresultsshowthat3categoriesareenoughtoachievemostofthebenetsforreducingthe 95

PAGE 96

Figure4-11. Peaktemperaturecomparisonbetweendifferentnumberofcategories peaktemperature.Infactusinglargernumberofcategoriesmayresultinslightlyhighertemperature.Thisisbecausewhenthenumberofcategoriesislarge,thenumberoftasksinthehighestcategoryisfewer.Sometasksoriginallyinthehighestcategorywillbepushedintothesecondhighestcategory.Thesetasks,whicharealreadyveryhot,aretreatedascoolertaskstointerleavewithtasksinthehighestcategory.However,thesetaskscannotabsorbenoughheatfromthehottasks.Asaresult,itwillleadalgorithmeffectivelynotabletoreducethetemperaturesignicantly.Ourexperimentssuggestthat3categoriesshouldbeidealformostpracticalscenarios. RealBenchmarks .WeuseMibench[ 27 ]andMediabench[ 46 ]toformfoursetsofthebenchmarktasks.ThecharacteristicsofthesebenchmarksareshowninTable 4-1 Table4-1. Thefoursetsofbenchmarktaskswithcommonperiod SetNumberBenchmarks set1patricia,adpcm,rijndael,susan,crc,FFT,dijkstra,epicset2patricia,djpeg,adpcm,sha,FFT,rijndael,susan,rijndaelset3sha,djpeg,FFT,rijndael,dijkstra,epic,rijndael,susanset4rijndael,dijkstra,FFT,gsm,sha,patricia,pegwit,djpeg Figure 4-12 showsthepeaktemperaturecomparisonbetweentasksequencingalgorithmandourtaskpartitioningalgorithmonrealtasksfor3categories.Thetaskpartitioningalgorithmreducesthepeaktemperaturebyasmuchas5.8oCcomparedwithtasksequencingalgorithm.Giventheaboveresults,achoiceof3categoriesshould 96

PAGE 97

Figure4-12. Peaktemperaturecomparisonbetweentasksequencingalgorithmandtaskpartitioningalgorithmonrealtasks.Numberofcategoriesis3 generallyprovideagoodtradeoffbetweenlowcontextswitchingoverheadandhighreductioninpeaktemperatureasdiscussedinSection 4.6.1.3 4.6.1.2Taskswithindividualperiod Fortaskswithindividualperiod,wealsouseMibench[ 27 ]andMediabench[ 46 ]toformfoursetsofthebenchmarktasks.ThesebenchmarksareshowninTable 4-2 .Eachperiodictaskhasindividualperiods.Wesetthedeadlineofalltaskstobetheendofitsownperiod.Also,weassumethatarrivaltimeoftherstinstanceofalltasksis0. Figure4-13. PeaktemperaturecomparisonbetweenEDFandEDFp(M=15) 97

PAGE 98

Table4-2. Thefoursetsofbenchmarktaskswithindividualperiod SetNumberBenchmarkExecutionTime(ClockCycles)Period(ClockCycles) set1patricia1.321087.2108adpcm1.811071.35108susan1.481072.66108crc2.131089.73108dijkstra2.91071.74108set2pegwit2.121072.72107gsm6.351069.52107epic3.211071.44108crc2.131089.73108patricia1.321087.2108set3gsm6.351069.52107patricia1.321087.2108susan1.481072.66108djpeg1.571079.42107adpcm1.811071.35108set4rijndael4.51073.6108susan1.481072.66108FFT1.541081.1109sha4.81072.7108adpcm1.811071.35108 WecompareourapproachwithEDFthatdoesnotdirectlyaddressthermalissues.Figure 4-13 showsthepeaktemperaturecomparisonbetweenEDFandourapproach,calledEDFp.TheexperimentalresultsshowthattheEDFpoutperformsEDFbyasmuchas6oC. 4.6.1.3Contextswitchingoverhead Werstanalyzethecontextswitchingoverheadintroducedbypartitioningthetaskswithcommonperiod.Figure 4-14 showsthenumberofcontextswitchespertaskfortaskpartitioningalgorithmusingvariablenumberofcategories.Fortaskpartitioningalgorithmwith3categories(whichisshownmostprotableinFigure 4-11 ),thenumberofcontextswitchesisabout1.8pertask,whichistolerableinmanypracticalscenarios4. 4ContextswitchtimeonARMcpucanbelessthan10us[ 64 ] 98

PAGE 99

Second,weanalyzethecontextswitchingoverheadfortaskswithindividualperiods.Figure 4-15 showstheaveragenumberofcontextswitchespertaskbetweenEDFandEDFpforvarioussetsoftasks.TheoverheadforEDFp(lessthan2contextswitchespertask)istolerableinmostpracticalscenarios. Figure4-14. Averagenumberofcontextswitchespertaskoftaskpartitioningalgorithmforvariousnumberofcategories Figure4-15. AveragenumberofcontextswitchespertaskcomparisonbetweenEDFandEDFpforvarioussetsoftasks 4.6.2TaskPartitioningandSchedulingonMulticoreProcessors Inthispart,weevaluatetaskpartitioningandschedulingalgorithminmulticoreprocessors.WeusedaplatformbasedonARMCortexA9[ 2 ]:2-widthout-of-order 99

PAGE 100

issue,32KBinstructionanddatacachesforevaluatingouralgorithms.Theclockspeedwassetto1.5GHz.DefaultthermalcongurationsinHotSpot[ 70 ]andtheoorplanofARMCortexA9areused.Theambienttemperaturewassetat45.15oC.Weusedthearchitecture-levelpowersimulatorWattch[ 6 ]toobtainthepowerconsumptionoftasks.Wehaveconductedexperimentsusing4-core,8-coreaswell16-corearchitectures.Inthissection,wepresentresultsfor16-corescenario.Thedistributionofthe16coresisa44matrix.Tomeasurethepropertemperaturewhenprocessorsruntasksforlongtime,wesettheaveragesteady-statetemperatureofthetasksequenceastheinitialtemperatureandrunthetasksforthreeiterations.Wetakethetemperatureinthenaliterationasmeasuredtemperature.Wealsocomparedthemakespanbetweendifferentalgorithms.Lowermakespanmeanstheschedulingofthetaskscanbenishedearlier. Thealgorithmswecomparedinthissectionareasfollows: Min-Min:WeuseonlyMin-Minalgorithmtoassigntaskstothemulticoreprocessor[ 23 ]. PDTM:PredictiveDynamicThermalManagement,astate-of-artdynamictemperature-awaretaskschedulingalgorithm.PDTMcanachievelowertemperaturebymigratingprocessfromhigh-temperaturecorestopotentiallycoolestcoresinfuture[ 80 ]. TPS:OurTaskPartitioningandSchedulingalgorithmonmulticoreprocessors.WeusefourTPSalgorithmsbasedondifferentvaluesofd.TheyareTPS-1(d=0.33ms),TPS-2(d=0.66ms),TPS-3(d=1.32ms)andTPS-4(d=2.64ms). 4.6.2.1Synthetictasks SynthetictasksweregeneratedtocomparethethermalreductionachievedbyTPS.Thenumbersofclockcyclesoftasksareuniformlydistributedin[5105,5107]. Figure 4-16 showsthepeaktemperaturecomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor.Thenumbersoftaskstestedare80,120,160and200.WecanseethatTPSachievessignicantlylowerpeaktemperature,reducingthepeaktemperaturebyupto11.68oC 100

PAGE 101

Figure4-16. PeaktemperaturecomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor comparedwithMin-Minalgorithm.WecanalsoseethatPDTMachievesslightlylowertemperaturethanTPS-1,buthighertemperaturecomparedwithTPS-4. Figure 4-17 showsthemakespancomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor.WecanseethatMin-MinalgorithmhasthesmallestmakespanandPDTMhasthelargestmakespan.ThemakespanofTPSalgorithmsincreasewiththeincreaseofd.However,TPSalgorithmsstillhaveveryclosemakespantoMin-Minalgorithm.TPS-4increasesthemakespanby8.8percentcomparedtoMin-Minalgorithm.EventhoughPDTMcanachievesimilarpeaktemperaturescomparedwithTPS,itrequires33%moremakespantonishallthetasks.TPScansignicantlyreducethepeaktemperatureonmulticoreprocessorsaslowasPDTMalgorithmwhilekeepthemakespancomparabletoMin-Minalgorithm. TofurthercompareTPSalgorithmandPDTMalgorithm,weuseDVFStorelaxthemakespanofTPSalgorithm'stobeequaltomakespanofPDTM.WerstusePDTMandTPStogeneratethescheduleoftasksandcalculatedtheratiorbetweenmakespanofPDTMandTPSbyr=makespanPDTM=makespanTPS.WethenextendedexecutiontimeofallthetasksinTPSschedulebyafactorofr.ThenewlyrelaxedTPS 101

PAGE 102

Figure4-17. MakespancomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor schedulehasthesamemakespanofPDTM,calledTPSrelaxed.SinceallfourTPSalgorithmsgaveusthesimilarresults,weonlyshowTPS-1relaxedinthischapter.Figure 4-18 showsthepeaktemperaturecomparisonbetweenTPS-1relaxedandPDTMalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor.WecanseethatTPS-1relaxedcanachieveupto20oCtemperaturereductionascomparedwithPDTM. Figure4-18. PeaktemperaturecomparisonbetweenTPS-1relaxedandPDTMalgorithmsforsyntheticallygeneratedtasksona16-coreprocessor 102

PAGE 103

4.6.2.2Realbenchmarks WeuseMibench[ 27 ]andMediabench[ 46 ]toformfoursetsofbenchmarktasks,whichareshowninTable 4-3 .Weusedifferentinputdatatogeneratemorevariantsofthesetasks.Forexample,cjpegwithdifferentinputdataisconsideredavariantofcjpeg.Eachbenchmarksethas30benchmarktasksincludingthevariants. Table4-3. Thefoursetsofbenchmarktasksformulticoreprocessors SetNumberBenchmarks set1dijkstra,patricia,adpcm,crc32,fft,gsm,shaghostscript,cjpeg,blowsh,stringsearchset2epic,unepic,rasta,epegwit,ifft,dpegwit,basicmathghostscript,dadpcm,ungsm,blowsh,susanset3dijkstra,adpcm,crc32,fft,gsm,rasta,ghostscriptpatricia,rijndael,blowsh,stringsearch,sha,epicjpeg,pegwitset4stringsearch,rasta,unepic,dadpcm,ifft,susan,c,shaghostscript,dpegwit,blowsh,crc32,dijkstra,djpeg Figure4-19. PeaktemperaturecomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforrealbenchmarksona16-coreprocessor Figure 4-19 showsthepeaktemperaturecomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforrealbenchmarksona16-coreprocessor.WecanseethatTPSachievessignicantlylowerpeaktemperature,reducingthepeaktemperaturebyupto9.92oCcomparedwithMin-Minalgorithm.TPSalgorithmscanalsoachievelower 103

PAGE 104

temperaturecomparedwithPDTMalgorithms,reducingthepeaktemperaturebyupto4.52oC. Figure4-20. MakespancomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforrealbenchmarksona16-coreprocessor Figure 4-20 showsthemakespancomparisonbetweenMin-Min,PDTMandourTPSalgorithmsforrealbenchmarksona16-coreprocessor.WecanseethatTPSalgorithmsachievenearlyidenticalmakespansasMin-Minalgorithm.PDTMalgorithmrequiresasmuchas44%moremakespancomparedwithMin-MinandTPSalgorithms.SoTPScansignicantlyreducethepeaktemperatureonmulticoreprocessorswhilekeepingnearlythesamemakespanasMin-Minalgorithm. 4.6.2.3Schedulingtime Figure 4-21 showstheschedulingtimecomparisonbetweenPDTMandourTPSalgorithms.Schedulingtimeisdenedasthetimetakenbyrunningtheschedulingalgorithmsitself,notincludingthetasksexecutiontime.Schedulingtimeisameasureofoverheadintroducedbytheschedulingalgorithm.OurTPSalgorithmisastaticalgorithmanddoesnotrequirereschedulingwhentasksareexecuting.DynamicalgorithmssuchasPDTMrequirehighschedulingtimetodynamicallyrescheduletaskstoachievelowtemperature.WecanseethattheschedulingtimeofTPSisclosetozero 104

PAGE 105

Figure4-21. SchedulingtimecomparisonbetweenPDTMandourTPSalgorithms.Thenumberoftasksusedisbasedonnumberofcores.Therearevetasksforeachcoreaveragely whennumberofcoresis16orlessandriseslowlywhenthenumberofcoreincreasesto64.SchedulingtimeofPDTMismuchlargerandrisesmuchmorerapidly. 4.7Summary Bothpowerandheatdensityofon-chipsystemsareincreasingexponentiallywithMoore'sLaw.Hightemperaturenegativelyaffectsreliabilityaswellthecostsofcoolingandpackaging.Inthischapter,weproposetaskpartitioningasaneffectivewaytoreducethepeaktemperatureintheembeddedsystems.Taskpartitioningcanreducethetemperaturebybreakinghottasksintosubtasks,andinterleavingthemwithcooltaskswhilestillmaintainingtheorderoftheexecution.Werstutilizetaskpartitioninginsinglecoreprocessors.Wedevelopednovelalgorithmsthataddresstwobroadscenarios:(1)asetofperiodictaskswithcommonperiodand(2)asetofperiodictaskswithindividualperiod.Experimentalresultsshowthatourrstalgorithmoutperformsthetasksequencingalgorithm[ 33 ]byreducingthepeaktemperaturebyasmuchas6oC.Fortasksetswithindividualperiod,EDFschedulingwithtaskpartitioningcanlowerthepeaktemperature,ascomparedtosimpleEDFscheduling,byasmuchas6oC. Wehavealsoexploredtheeffectivenessoftaskpartitioninginmulticoreprocessors.Wedevelopedtaskpartitioningandschedulingalgorithm(TPS)formulticoreprocessors. 105

PAGE 106

TPSalgorithmbreakshottaskintosubtasksandinterleavethemwithcooltaskswithinthesamecoreandbetweenneighboringcores.Ourtaskpartitioningandschedulingalgorithmonmulticoreprocessorscanreducethetemperaturebyasmuchasby11.68oCcomparedwithMin-Minalgorithm[ 23 ],reducethemakespanbyupto40%comparedwithPDTMalgorithm[ 80 ].WhenthemakespanisthesameasPDTM,ouralgorithmsimprovethepeaktemperaturesubstantiallyby18oC.Theseresultsarepromisingandclearlydemonstratethattaskpartitioningisaneffectivewaytoreducethepeaktemperatureinsinglecoreandmulticoresystems. 106

PAGE 107

CHAPTER5THERMALANDENERGYAWARESCHEDULINGALGORITHMFORMATRIXMULTIPLICATION 5.1Background Theincreasingspeedandnumberofcoresoftoday'smulticoremicroprocessorshasledtoasignicantincreaseinpowerconsumption.TherearetwoaspectsofpowerconsumptionindigitalCMOScircuits:dynamicenergyandstaticenergy.Thedynamicenergyisconsumedbythecharginganddischargingprocessofcircuitcapacitancesandshortcircuitpowerdissipation.Thestaticenergyisduetoleakagecurrentthroughreversebiaseddiodesandetc.DynamicVoltageandFrequencyScaling(DVFS)canbeusedastheessentialtechniquetoreducethedynamicenergyconsumptionbyloweringthesupplyvoltageandoperatingfrequencies[ 7 ]. Mostdataparallelalgorithmsconsistofcomputationandcommunication.ExistingDVFS-basedpower-efcientalgorithmsscalethevoltage/frequencyofthewholeprocessorchip,thatis,thecores,bus,L1andL2cacheshouldalloperateatthesamevoltagelevel[ 4 35 59 60 ].However,suchauniversalscalingmethodmayreduceefciencysincedifferentapplicationsmayhavedifferentworkloadratiosbetweencomputationandcommunication.OnesolutionforthisproblemistheMutipleClockDomain(MCD)processorswhichallowscomponentsofprocessortooperateonindependentvoltagelevel[ 32 66 ].Thischapterstudiesthetradeoffbetweenvoltagelevelsofthecoresandinternalcommunicationbus. Inthischapter,weproposeageneralmethodologyforbus-basedmulti-coreprocessorswithfullmemoryhierarchy.Usingthismethodology,onecananalyzethedynamicvoltagesettingsforthebusandcores,estimatetheenergyconsumptiononeachcomponentseparatelyandobtainagoodtradeoffamongthem.WeexaminethismethodologyusingvariousparallelmatrixmultiplicationalgorithmsthataresuitableforsharedmemorymulticoremachineswithL1andL2caches.Allthealgorithmsaresupposedtoexecuteonthefullmemoryhierarchy. 107

PAGE 108

Oursimulationresultsshowthattheeffectivelysettingthevoltageofbusesalongwithcorescanresultinupto20%reductionintheoverallenergyrequirementswithoutasubstantiallossinoverallperformance.Themethodsproposedinthischapterdemonstratetheusefulnessofmultipleelementoptimizationinmulticorearchitectures.Theexperimentsshowthatagoodtradeoffamongtheseelementsintheoverallperformanceandenergyrequirementscanleadtoimprovedresultsintheenergyconsumption. Therestofthechapterisorganizedasfollows.Section 5.2 introducestherelatedworkregardingwiththistopic.Section 5.3 denestheprocessormodelandthecorrespondingenergymodel.Section 5.4 presentsthegeneralmethodologytoanalyzedynamicvoltagelevelsettingsforcoresandbusbasedontheapplication'sparameters.Section 5.5 usethismethodologytoanalyzedifferentmatrixmultiplicationalgorithms.Section 5.6 concludesthechapter. 5.2RelatedWork DVFSiswidelyadoptedtoreducecomputationenergy[ 21 38 39 55 76 ]onmulticoreprocessors.Kanget.al.[ 38 39 ]addresstheproblemthatminimizingenergyconsumptionofDAG-basedtasksrunningonparallelmachineswithaconstrainedexecutiondeadline.Researchesareconductedonhowtomodelandreduceenergyconsumptiononinternalbus.H.Lee[ 47 ]exploresaquantitativemethodologytoevaluatetheenergyconsumptiononcommunicationapproachesassharedbus,point-to-pointlinks,andNetwork-on-Chip(NoC)inmultiprocessorsystems.Theauthorsof[ 71 ]proposedabusenergymodelwhichcanderivetheenergyconsumptionfromparametersasthebodyvoltage,theswingvoltage,andthebuscircuitcharacteristics.In[ 74 ],theauthorsstatethattheycanimprovetheencodingtechniquetoreducebitswitchingactivitiesduringdatatransmissionthusachieveenergysavings.Similarly,Liuandetc.[ 51 ]designsanewbusarchitecturetoreducethenumberofbittransmission,whichaddsanextracacheforeachlinktorecordhistorictransmittedbits. 108

PAGE 109

Intheworkdescribedabove,theenergyrequirementsforcommunicationarenottakenintoconsideration.Andreiet.al.[ 1 ]haveinvestigatedthepossibilitytoreducebothcomputationandcommunicationenergydissipationviaDVFSandbusvoltagescalingonmulti-coreprocessors.Theyassumedprocessorsandbuscanworkunderdifferentsupplyvoltages.Thusaschedulingalgorithmcanbedesignedtoselectthejoint,butpotentiallydifferent,optimalsupplyvoltagesforprocessorsandthecommunicationbus.However,theiralgorithmonlyworksformessage-passingparallelmodesomemoryaccessesarenottakenintoaccount.Theauthorsof[ 43 ]describeenergy-efciencyundershare-addressparallelmodelofparallelapplicationssuchasparallelquicksort,prexsum.Thesealgorithmsexecuteparallelcomputationtasksonmultiplecoresbutcommunicateviaread/writeIOsofthesharedmemory.Byestimatingthememoryaccesstimestheycancalculatetheoptimalfrequencyofprocessorsunderagivendeadline.Byexaminedwithdifferentproblemsizeandprocessornumbers,theenergy-awarescalabilityofparallelapplicationscanbeevaluated.ThepotentialdrawbackofthemethodisthattheysimplymergetheoverheadsonbuswiththoseonmemoryI/Oswhichdominatethecommunicationtime.Theirmodelsdonotexplicitlyconsidertheenergyrequirementofthebus. 5.3ProcessorModelandEnergyModel Inthissection,wewilldenetheprocessormodel(thestructureofdifferentcomponentsinprocessorsandhowtheyareconnected)andthecorrespondingenergymodel(themodeltoestimatetheenergyconsumptiononprocessormodel). 5.3.1ProcessorModel Theprocessorwemodeledinthischapterisamulticoreprocessorwithpcoresandathree-levelmemoryhierarchy.Thememoryhierarchyincludesalargemainmemorysharedbyallthecores,aL2cachesharedbyallthecoresandpprivateL1caches.AninternalbusconnectsallthecoresandtheL2cache.ThedatatransferredbetweenL1 109

PAGE 110

cachesandL2cacheneedtogothroughthisinternalbus.Similarly,thereisafront-sidebusconnectstheL2cachetothemainmemory(Figure 5-1 ). Figure5-1. Theprocessormodel.EachcorehasitsownprivateL1cache.L2cacheandmainmemoryaresharedbyallthecores.ThecoresandL2cacheareconnectedbyaninternalbus.TheL2cacheandmainmemoryareconnectedbyafront-sidebus Ourmodelmakesthefollowingsimplifyingassumptions: Internalbusissharedbetweenallthecoresandonlyonecorecantransferitsdataonthebusatthesametime.Allthecoreshavethesamedatatransferlatencyonthebusnomatterwherethecoresarelocated. Boththevoltageofcores(includingtheirL1cache)andinternalbuscanbescaledcontinuouslyintheinterval[V0,V],whereVisthemaximumvoltageandV0istheminimumvoltage.Alltheactivecoresrunatthesamevoltage.Buttheinternalbuscanrunatadifferentvoltage. ThevoltageofL2caches,front-sidebusandmainmemorycannotbescaled.TheaccesstimetoL2cacheandmainmemoryareconstant1. Thismodelisrepresentativeofmanyextantmulticoremachines. 1Tosimplifytheanalysis,themainmemoryaccesstimeincludesthedatatransferlatencyonfront-sidebusandthedataaccesstimeinmainmemory 110

PAGE 111

5.3.2EnergyModel Therearethreedifferentenergymodelsusedtoanalyzetheenergyconsumptiononprocessor,sharedbusandmemory,respectively.Thedetailsofthesemodelsaredescribedasfollows: 5.3.2.1Processorenergymodel Weadopttheenergymodelusedin[ 43 ]foranalyzingtheenergyrequirements.Thedynamicpowerconsumptionofonecore(includingitsL1cache)is:Pdym=CefX2fx (5) whereCefistheswitchedcapacitance,XisthesupplyvoltageandfxistherunningfrequencyofthecoreatvoltagelevelX.fxcanbewrittenas:fx=k(X)]TJ /F3 11.955 Tf 10.95 0 Td[(Vt)2 X (5) wherekistheconstantofcircuitandVtisthethresholdvoltage[ 38 ].Thedynamicenergyconsumptionofexecutingccpuclockcyclesisexpressedby:Ecpudym=EdccpuX2 (5) whereweredeneCefasEdtomaketheparameternameconsistentwithotherequationsinthischapter. Theleakageenergyconsumptionisdenedas:Ecpuleak=ElTcpuactfx (5) whereEldependsonhardwarecircuitandthecurrenttemperatureofthecore.Tcpuactistheactivetimeofthecore.Inthischapter,weassumethatenergyconsumptionisnearlyzerowhenthecoreisidle.Itisareasonableassumptionduetothereasonthatrecentprocessorscanreducepowerconsumptiontonearzerowhennotasksarerunning[ 43 ]byshuttingthecore. 111

PAGE 112

5.3.2.2Busenergymodel Weusetherepeater-basedbusmodelin[ 1 ],whichisquitesimilartoprocessorenergymodel.Thedynamicandleakageenergyconsumptiononabuscanbegivenas:Ebusdym=EbcbusY2 (5)Ebusleak=ElTbusactfy (5) whereEbisahardwareconstant,Yisthesupplyvoltageofthebus.fyisthefrequenciesatvoltagelevelY.cbusisthenumberofclockcyclesthatbusneedstotransferitsdata.Tbusactisthebusactivetime. 5.3.2.3Sharedmemorymodel RecallthatweassumetheenergyconsumptionforsharedL2cacheandmainmemoryisconstantforeachaccess.TheenergyconsumptionforL2cacheandmainmemoryaregivenas:Ecache=EcNc (5)Emem=EmNm (5) whereEcandEmaretheenergyconsumptionforeachL2accessandmemoryaccess,respectively.NcandNmarethenumberofL2accessesandmemoryaccesseswhileparallelapplicationisbeingexecuted. Although,ouranalysisusestheabovemodels,theoverallmethodologyisquitegeneralandmanyofthetradeoffsandconclusionspresentedshouldbevalidforavarietyofothersimilarmodels. 5.4GeneralMethodology Inthissection,weintroduceourgeneralmethodologyandequationstodoouranalysis.Givenaparallelproblem,ourgoalistondtheoptimalcoreandbusvoltagestominimizetheenergyconsumptionundertheconstraintsthatrunningtimeisrelaxedtoatmostktimesoftheparallelrunningtimewithoutvoltagescaling. 112

PAGE 113

1. Proposeappropriatealgorithmsfortheparallelcomputingproblemaccordingtotheinputsize.SmallerinputsizemayentirelytintotheL2cache.ThusthealgorithmneedsmostlyconsiderthedatatransferbetweenL1cacheandL2cache.ForlargerinputsizethatcannottintoL2cache,amoreefcientalgorithmthatconsideringthedatatransferbetweenL2cacheandmainmemoryisproposed. 2. EvaluatethetotalcomputationtimeCompTimetotal2fortheparallelalgorithm.CompTimetotalincludesthetimecoresneedtoexecutethetasksandthetimetoaccessesL1caches.CompTimetotalisthesummationofallthecomputationtimeofallcorestocompletethealgorithm.Itisusedtocomputethetotalcomputationdynamicenergyconsumption. 3. EvaluatethebuscommunicationtimeCommTimebus,whichisthetimetotransferdataontheinternalbus. 4. EvaluatethetotalL2cacheaccessnumberNcandmainmemoryaccessnumberNm.Notethatthedatatransfertimeonfront-sidebusisincludedintheaccessestimetomainmemory. 5. EvaluatethecriticalpathcomputationtimeCompTimecpfortheparallelalgorithm.Thecriticalpathisdenedasthelongestpathinagraphrepresentinginterdependenciesofsubtasksinaparallelalgorithm.NotethatthecriticalpathbuscommunicationtimeisthesameasCommTimebusduetoourassumptionthatnoconcurrentaccessestointernalbusisallowed.(Thiscanbesuitablymodiedifaconstantnumberofaccessedaresimultaneouslyallowed) 6. Basedonaboveevaluation,wehavefollowingequations:Etotal=Ecomp+Ebus+Ecache+Emem+Eleak (5)Ecomp=EdCompTimetotalX2 (5)Ebus=EbCommTimebusY2 (5)Eleak=Ecpuleak+Ebusleak (5) whereEcache,Emem,EcpuleakandEbusleakcanbegetfromEquation( 5 5 5 5 ),respectively. 7. Theparallelexecutiontimewithoutvoltagescaling(voltagesetatV)Tpis:Tp=Tcomp+Tbus+Tcache+Tmem=CompTimecp=fv+CommTimebus=fv+McNc+MmNm 2Allthetimeunitsinthischapterareclockcycles 113

PAGE 114

whereMcisthetimeforeachL2cacheaccess,Mmisthetimeforeachmainmemoryaccess.fvisthefrequencyatvoltagelevelV.Theactualexecutiontimewithdynamicvoltagescaling,Tis:T=CompTimecp=fx+CommTimebus=fy+McNc+MmNm (5) WherefxandfyarethefrequenciesatvoltagelevelXandY,respectively. 8. Formulatetheenergyconsumptionandexecutiontimeconstraintsintoaconvexoptimizationproblem.Solvingitbystandardconvexoptimizationsolver,wecanhavetheminimumenergyconsumptionundercertaintimeconstraints. 5.5ParallelAlgorithmanalysis Weanalyzetwomatrixmultiplicationalgorithmsthathavebeendescribedintheparallelcomputingliterature[ 44 ].Thesecorrespondtodoingablockingalongtherow(1DorRowwise)andblockingalongbothrowandcolumn(2DBlocking).Forbothalgorithms,wedevelopanalysisfortwoscenarios:smallmatricesandlargematrices.TheformercorrespondstothecasewhenthematricesaresmallenoughtottheL2cache.Thelattercorrespondstothescenariowhenthisisnotthecase. 5.5.1RowwiseMatrixMultiplication 5.5.1.1Problemdenition Givenanp-coreprocessorthateachcorehasitsownL1cacheandasharedL2cache,weconsiderthetwonnmatrix(matrixAandB)multiplicationproblem.Thetwomatricesarestoredinmainmemoryinitially.Theresultingnnmatrix(matrixC)iswrittenbackintomainmemory.Weuserowwisematrixmultiplicationalgorithm. 5.5.1.2Smallmatricescase Forthiscase,weassumethattheinputtwomatricesaresmallenoughtotdirectlyintoL2cache.Figure 5-2 showstheprocessofrowwisematrixmultiplicationalgorithm.ThedetailsoftherowwisematrixmultiplicationalgorithmforthesmallmatrixcaseareshownasAlgorithm 6 Firstofall,letuscomputethetotalcomputationtime.Tocomputethel1nblocksofAandnl1blocksofB,ittakesbnl21clockcycles,wherebistheaverageclockcycles 114

PAGE 115

Figure5-2. Rowwise1-Dcyclicmatrixmultiplication.EachcorebringlrowsofmatrixAandlcolumnsofmatrixBintoitsL1cacheanddothecomputation,thenwritebackthellsubmatrixbacktoL2cache Algorithm6Rowwisematrixmultiplication(smallmatrices) 1: BringnnmatricesAandBentirelyfrommainmemoryintoL2cache(Figure 5-2 ).//Withoutlossofgenerality,weassumethatnismultipleofpl1.l1isthemaximumnumberofrowsofmatrixAorBthatL1cachecanhold.Therefore,l1=minfL1cachesize=2n,n=pg. 2: forLoop=1ton=(pl1)do 3: fori=1topdo 4: forj=1ton=l1do 5: Coreiwillbring(iLoop)thofl1rowsofmatrixAandjthofl1columnsofmatrixBintoitsL1cache,wherel1=minfL1cachesize=2n,n=pg. 6: Coreidoitscomputationinparallelofithl1nblocksofAandjthnl1blocksofB,gettheresultsofal1l1submatrixandwritetheresultingsubmatrixbacktoL2cache. 7: endfor 8: endfor 9: endfor 10: WriteresultingnnmatrixCbackintomainmemory. peroperation(additionormultiplication).Tosimplifytheanalysis,weassumeeachoperationtakesoneclockcycle.Therearetotallyn=(pl1)pn=l1=n2=l21computations.Therefore,CompTimetotal=bnl21n2=l21=bn3.ItistrivialtogetthatCompTimecp=bn3=p. Second,weevaluatetheamountofdatatransferonthebus.Instep2)]TJ /F5 11.955 Tf 11.39 0 Td[(9,cputotallyreadsn=(pl1)pnl1(n=l1+1)=n3=l1+n2dataandwritesn=(pl1)pl21n=l1=n2datathroughthebus,thatis,CommTimebus=d(n3=l1+2n2)=W,wheredisthenumberofbitsperdatum,W(bitsperclockcycle)isthewidthofthebus. 115

PAGE 116

Third,weneedtocomputethenumberofsharedmemoryaccesstimes.Itistrivialtoseethattherearetotally2n2readaccessesandn2writeaccessesofmainmemory,thatis,Nm=3n2.FortheL2cacheaccesses,therearetotally3n2L2accessesinstep1and10.Therearetotallyn=(pl1)pnl1(n=l1+1)=n3=l1+n2readaccessesandn=(pl1)pl21n=l1=n2writeaccessesofL2cacheinstep2)]TJ /F5 11.955 Tf 11.59 0 Td[(9.Therefore,Nc=n3=l1+5n2.Thenwehave: Etotal=Ecomp+Ebus+Ecache+Emem+Eleak (5) where:Ecomp=Ed(bn3)X2 (5)Ebus=Ebd(n3=l1+2n2)=WY2 (5)Ecache=Ec(n3=l1+5n2) (5)Emem=Em(3n2) (5)Eleak=El(bn3=p fx+n3=l1+2n2 fy)(fx+fy) (5) Theoriginalparalleltimewithoutvoltagescalingis:Tp=Tcomp+Tbus+Tcache+Tmem=bn3=p fv+d(n3=l1+2n2)=W fv+Mc(n3=l1+5n2)+Mm(3n2) (5) Theparalleltimewithvoltagescalingis:T=bn3=p fx+d(n3=l1+2n2)=W fx+Mc(n3=l1+5n2)+Mm(3n2) (5) Theoptimizationproblemnowishowtoscalethevoltagesofcoresandbussuchthattheenergyconsumptionisminimizedandexecutiontimewithvoltagescalingisk 116

PAGE 117

timesoforiginalparallelexecutiontimewithoutvoltagescaling.Thatis:minEtotal (5)s.t.TkTp (5)V0XV (5)V0YV (5) Tosolvetheaboveconvexoptimizationproblem,weneedtochooseappropriatevaluesoftheconstantvalues.Withoutlossofgenerality,wechooseaverageclockcycleperoperationbtobe3,asonecycletoreadthedatafromL1cache,onecycletocomputeandonecycletowritebacktoL1cache.BasedontheexperimentalresultsonpowersimulatorMcPAT[ 50 ]andthepowermodelparametersettingsin[ 43 ],wesettheEd=15.51=f2v,leakageconstantEl=0.1Edf2v.Thetemperatureoftheworkchipissetat70oC.WealsosetthatEc=0.2Edf2v,,Em=1000Edf2v.Also,wesetL2cacheaccesstimeasNc=10clockcycles,mainmemoryaccesstimeasNm=100clockcycles.Thedatumwidthd=32bit,thebuswidthisalsoW=32bitperclockcycle.Clearlyothersuitablevaluescanbechosenbasedonactualarchitectureparameters. Figure 5-3 showstheenergyconsumptionwhenscalingvoltageofbothcoresandbus(cores+bus)andenergyconsumptionwhenscalingvoltageofcoresonly(coresonly).Wevarytheslackparameterktoevaluatethedifferentialenergyneedswhentimeconstraintsarerelaxed.Forexample,k=1meansnotimeconstraintrelax.k=1.2meansthatthealgorithmcanuse20%moretime.CPU+busmeansminimumenergyconsumptionwhenvoltagescalingofbothcoresandbusisallowed.Energyconsumptionforcoresonlyistheminimumenergyconsumptionwhenonlyvoltagescalingofcoresisallowed.Theresults(basedontheaboveparametervalues)showthatwhenkisassmallas1.3(i.e.30%additionaltime),energyreductionforvoltagescalingofbothcoresandbusis20%betterascomparedwithvoltagescalingofcoresonly. 117

PAGE 118

Figure5-3. Theminimumenergyrequirementcomparisonbetweencpuvoltageonlyscalingandcpu+busvoltagescalingofrowwisematrixmultiplication(smallmatrices).Executiontimeisktimesoforiginalparallelexecutiontimeatmaximumfrequency.n=1024,p=16.L2cachesizeis8MB.L1cachesizeforeachcoreis16KB 5.5.1.3Largematricescase Forthiscase,weassumethattheinputtwomatricesaresolargethattheycannottdirectlyintoL2cache.Thealgorithmisshownasfollows(Algorithm 7 ): Algorithm7Rowwisematrixmultiplication(largematrices) 1: fori=1ton=l2do 2: forj=1ton=l2do 3: Bringithl2rowsofmatrixAandjthl2columnsofmatrixBintotheL2cache,wherel2=L2cachesize=2n. 4: Usealgorithm 6 tocalculatethemultiplicationofithl2nblocksofAandjthnl2blocksofB. 5: Writetheresultingl2l2submatrixbacktomainmemory. 6: endfor 7: endfor Thelargematricesalgorithmissimilartothesmallmatrices.Thecoresneedtobringnl2blockofmatrixAandl2nblockofmatrixBintotheL2cacherst(step3),usingthesamerowwise1-Ddatapartitioning.Instep4,usethealgorithminsmallmatricesalgorithmtocalculatethemultiplicationofthesetwoblockofdata,gettheresultforl2l2submatrixofC,andwritethembackintomainmemory(step5). 118

PAGE 119

Thetotalcomputationtimeiscalculatedinasimilarfashionasthesmallmatricescase.Itiseasytoseethatthecomputationtimeisthesameassmallermatricescase.Thatis,CompTimetotal=bn3andCompTimecp=bn3=p. Wecomputethedatatransferonbusasfollows.Instep4,cputotallyreadsl2=(pl1)pnl1(l2=l1+1)=l22n=l1+l2ndataandwritesl2=(pl1)pl21l2=l1=l22datathroughthebus,thatis,d(l22n=l1+l2n+l22)=W.Therearetotallyn2=l22iterationsofabovedatatransfer.Therefore,totallyCommTimebus=n2=l22d(l22n=l1+l2n+l22)=W=d(n3=l1+n3=l2+n2)=W. Theamountofsharedmemoryaccessesiscomputedasfollows.Formainmemoryaccesses,therearetotallyn=l2(n=l2+1)l2n=(n3+n2l2)=l2readaccessesandn2writeaccesses,thatis,Nm=(n3+2n2l2)=l2.BecauseeachbustransferandeachmainmemoryaccesscompanywithaL2cacheaccess,respectively,L2cacheaccessesareequaltothesummationofbusdatatransferandmemoryaccesses.Therefore,Nc=(n3+2n2l2)=l2+n3=l1+n3=l2+n2=n3(1=l1+2=l2)+3n2.WeformulatethesameconvexproblemasinSection 5.5.1 andsolveitusingconvexoptimizationsolver. Figure 5-4 showstheminimumenergyconsumptionwhenscalingvoltageofbothcoresandbus(cores+bus)andenergyconsumptionwhenscalingvoltageofcoresonly(coresonly)forthelargematricescase.Wevarytheslackfactork.Wecanseethattheenergyconsumptionforvoltagescalingofbothcoresandbusis8.6%lessthanthatofcoresonly.Thisimprovementissmallerthanthesmallmatricescase,astheamountofdatatransfertotheL2cacheandmemoryinthiscaseissubstantiallylarger.TheenergyconsumptionoftheseL2cacheandmemorybecomesalargepartofthetotalenergyconsumption.Therefore,thepotentialofvoltagescalingofcoresandbusisreduced. 5.5.2BlockMatrixMultiplication 5.5.2.1Problemdenition Givenanp-coreprocessorthateachcorehasitsownL1cacheandasharedL2cache,weconsiderthetwonnmatrix(MatrixAandB)multiplicationproblem.Thetwomatricesxarestoredincomputer'smainmemory.Theresultnnmatrix(MatrixC) 119

PAGE 120

Figure5-4. Theminimumenergyrequirementcomparisonbetweencpuonlyscalingandcpu+busscalingofrowwisematrixmultiplication(largematrices).Executiontimeisktimesoforiginalparallelexecutiontimeatmaximumfrequency.n=32768,p=16.L2cachesizeis8MB.L1cachesizeforeachcoreis16KB willbewritebacktomemory.Weusea2-Dblockbasedpartitioningalgorithmtodothematrixmultiplication.The2-Dblockalgorithmgeneratesloweramountofcommunicationforlargermatricesonparallelcomputers.Ourexpectationisthatthesamemaybetrueforthetargetarchitecture. 5.5.2.2Smallmatricescase Figure 5-5 showstheprocessof2-Dblockmatrixmultiplicationalgorithmforsmallmatrices.Thedetailsofthealgorithmforthesmallmatricescaseareshownasfollows(Algorithm 8 ): Nowwecomputethetotalcomputationtime.Itiseasytoseethatthecomputationtimeisthesameasrowwisematrixmultiplicationalgorithm.Thatis,CompTimetotal=bn3andCompTimecp=bn3=p. Next,wecomputeamountofdatatransferonbus.Instep3)]TJ /F5 11.955 Tf 11.13 0 Td[(12,cputotallyreads3r21n=r1n2=r21=3n3=r1dataandwrites2r21n2=r21=2n2datathroughthebus,thatis,CommTimebus=d(3n3=r1+2n2)=W. 120

PAGE 121

Figure5-5. 2-Dblockmatrixmultiplication(smallmatrices).Eachinputmatrixhasbeenpartitionedintor1r1submatrix.Theshadedsubmatrixaremultipliedcorrespondinglyandsumtotheresultingshadedr1r1submatrixinC. Algorithm82-Dblockmatrixmultiplication(smallmatrices) 1: BringtwomatrixAandBintothesharedL2cache. 2: Partitionthetwomatrixintomultipler1r1submatrix,andlabelthemina2-dfashion(Figure 5-5 ),wherer1=minfp L1cachesize=2,n=pg. 3: fori=1ton=r1do 4: forj=1ton=r1do 5: forL=1ton=r1pdo 6: fort=1topdo 7: CoretbringsAi,LtandBLt,jintoitsL1cacheanddothematrixmultiplicationinparallel. 8: WritethetemporaryresultingmatrixbacktoL2cache. 9: endfor 10: endfor 11: BringallthetemporarymatricesintotheL1cache,summatethemtogetherinparalleltotheresultingCi,j.WriteCi,jbackintomemory. 12: endfor 13: endfor 14: WritebacktheresultsofmatrixCtomainmemory. Thenumberofsharedmemoryaccessesareasfollows.Formainmemoryaccesses,therearetotally2n2readaccessesandn2writeaccesses,thatis,Nm=3n2.Since,eachbustransferandeachmainmemoryaccesscorrespondstoaL2cacheaccess,respectively,theL2cacheaccessareequaltothesummationofbusdatatransferandmemoryaccesses.Therefore,Nc=3n3=r1+5n2. Figure 5-6 showstheenergyconsumptionwhenscalingvoltageofbothcoresandbus(cores+bus)andenergyconsumptionwhenscalingvoltageofcoresonly(cores 121

PAGE 122

only).Wecanseethatwhenkis1.7,theenergyreductionis9.2%.Theamountofreductionislessthanusingrowwise(or1-Dblock)matrixmultiplication.Thiscanmainlybeattributedtothefactthatblockmatrixmultiplicationrequireslesscommunicationtimethanrowwisematrixmultiplication.Thus,theadvantagesofbusvoltagescalingissmaller. Figure5-6. Theminimumenergyrequirementcomparisonbetweencpuonlyscalingandcpu+busscalingofblockmatrixmultiplication(smallmatrices).Executiontimeisktimesoforiginalparallelexecutiontimeatmaximumfrequency.n=1024,p=16.L2cachesizeis8MB.L1cachesizeforeachcoreis16KB 5.5.2.3Largematricescase Thealgorithmforthelargematrixcaseusing2-DblockdecompositionisgiveninFigure 9 .ThedetailsofthealgorithmareshowninAlgorithm 5-7 Itiseasytoseethatthecomputationtimeisthesameassmallermatricescase.Thatis,CompTimetotal=bn3andCompTimecp=bn3=p.Theamountofdatatransferonbuscanbecomputedasfollows.Instep6)]TJ /F5 11.955 Tf 11.15 0 Td[(8,thedatatransferonbusisthesameasalgorithm 8 ifwesubstitutethenwithl2,thatis,d(3l32=r1+2r22)=W.Therearetotallyn=r2iterationsinstep4)]TJ /F5 11.955 Tf 11.05 0 Td[(10.Therefore,thedatatransferisd[(3l32=r1+2r22)n=r2]=W.Instep11,thereared(r22n=r2+r22)=Wdatatransferonbus.Andtherearen2=r22iterationsofabovesteps.Thus,totally,CommTimebus=dfn2=r22[(3r32=r1+2r22)n=r2+r22n=r2+r22]g=W=d(3n3=r1+3n3=r2+n2)=W. 122

PAGE 123

Figure5-7. 2-Dblockmatrixmultiplication(largematrices).Eachinputmatrixhasbeenpartitionedintor2r2submatrix.Theshadedsubmatrixaremultipliedcorrespondinglyusingalgorithm 8 andsumtotheresultingshadedr2r2submatrixinC Algorithm92-Dblockmatrixmultiplication(largematrices) 1: Partitionthetwomatrixintomultipler2r2submatrix,andlabelthemina2-dfashion(Figure 5-7 ),where2r22L2cachesize. 2: forI=1ton=r2do 3: forJ=1ton=r2do 4: forL=1ton=r2pdo 5: fort=1topdo 6: CoretbringsAI,LtandBLt,JintoitsL2cache. 7: Performalgorithm 8 tothemultiplicationofAI,LtandBLt,J 8: Writethetemporaryresultmatrixbackintomainmemory. 9: endfor 10: endfor 11: BringallthetemporarymatricesintotheL2cache,summatethemtogetherinparalleltotheresultingCI,J.WriteCI,Jbackintomemory. 12: endfor 13: endfor Theamountofsharedmemoryaccessescanbederivedasfollows.Formainmemoryaccesses,thereare3r22n=(r2B)=3nr2=Bmemoryaccessesinstep4)]TJ /F5 11.955 Tf 11.26 0 Td[(11.Therearealsor22n=(r2B)=nr2=Bmemoryaccessesinstep11.Totally,Nm=(4n3=r2+n2)=B.EachbustransferandmainmemoryaccesscorrespondtoaL2cacheaccess.TheL2cacheaccessesareequaltothesummationofbusdatatransferandmemoryaccesses.Therefore,Nc=3n3=r1+7n3=r2+2n2. Figure 5-6 showsthecomparisonofenergyconsumptionwhenscalingvoltageofbothcoresandbus(cores+bus)andenergyconsumptionwhenscalingvoltageofcores 123

PAGE 124

only(coresonly).Wecanseethatwhenk=1.6,theenergyreductionisaslargeas11.9%. Figure5-8. Theminimumenergyrequirementcomparisonbetweencpuonlyscalingandcpu+busscalingofblockmatrixmultiplication(largematrices).Executiontimeisktimesoforiginalparallelexecutiontimeatmaximumfrequency.n=32768,p=16.L2cachesizeis8MB.L1cachesizeforeachcoreis16KB 5.6Summary Inthischapter,weproposeageneralmethodologyforenergyestimationofbus-basedmulti-coreprocessorsassumingthatDVScanbeusedforbothbusesandcores.OurformulationcanprovidetradeoffsbetweenDVSsettingforbusesandcores.WeexaminethismethodologyusingvariousparallelmatrixmultiplicationalgorithmsthataresuitableforsharedmemorymulticoremachineswithL1andL2caches.Oursimulationresultsshowthatthesimultaneouslychangingthevoltageofbusesalongwithcorescanresultin10)]TJ /F5 11.955 Tf 11.34 0 Td[(20%reductionintheoverallenergyrequirementsascomparedtoonlychangingthecorevoltages.ThisisundertheassumptionthatsufcientslackisavailableforDVStobeabletoworkatlowervoltagestosaveenergy.Themethodsproposedinthischapterdemonstratetheusefulnessofmultipleelementoptimizationinmulticorearchitectures.Theyshowthatagoodunderstandingofthe 124

PAGE 125

overalltradeoffsbetweentheeffectoftheseelementsintheoverallperformanceandenergyrequirementscanleadtoimprovedresultsintheenergyrequirements. 125

PAGE 126

CHAPTER6CONCLUSIONSANDFUTUREWORK Bothpowerandheatdensityofon-chipsystemsareincreasingexponentiallywithMoore'sLaw.Hightemperaturenegativelyaffectsreliabilityaswellthecostsofcoolingandpackaging.Inthischapter,wereviewourmaincontributions. Inchapter 3 ,weproposedacompactmatrixmodelthatcanbeusedtoderivenewtemperatureprolesforagivensetofcoresonprocessorsandtheirvoltages.ExperimentalresultsshowthatthecompactmatrixisanorderofmagnitudefasterthanHotSpotmodel.Weutilizedthismatrixthermalmodeltodevelopnovelthermal-awaregreedyslackallocationalgorithmsfordependenttasksrepresentedasDirectAcyclicGraph(DAG).Thegreedyslackallocationalgorithmcanreducethepeaktemperaturebyasmuchas13.68oCcomparedwithuniformslackallocationalgorithm.Wealsoutilizedthematrixthermalmodeltooptimizetheworkloaddistributionsundercertainthermalconstraints.Experimentalresultsshowouralgorithmisanorderofmagnitudefasterthanstandardlinearprogrammingmethod. Inchapter 4 ,weproposedtopartitionthehottasksintomultiplesubtasksandinterleavethesesubtaskswithcooltaskstoreducetheoverallmaximumtemperature.Weproposedaheuristictaskpartitioningalgorithmusingcooltaskstointerleavehottasksforaperiodicsetoftaskswithcommonperiod.Weproposedanotherheuristictaskpartitioningalgorithmforaperiodicsetoftaskswithindividualperiod.Experimentalresultsshowthatourtaskpartitioningalgorithmscanreducethepeaktemperaturebyasmuchas6oCcomparedwithtasksequencingalgorithmandEDFschedulingalgorithm.Wealsoproposedtaskpartitioningandschedulingalgorithmforasetofindependenttasksonmulticoreprocessors.Experimentalresultsshowthatouralgorithmcanreducethepeaktemperaturebyupto11.68oCcomparedwithMin-Minalgorithm,reducethemakespanbyupto40%comparedwithPDTMalgorithm. 126

PAGE 127

Inchapter 5 ,weproposedageneralmethodologytominimizeenergyconsumptionforbus-basedmulticoreprocessorswithfullmemoryhierarchy,whichcananalyzethedynamicvoltagesettingsforthebusandcoresseparatelybasedonthechiptemperature,hardwarecongurationandmemoryhierarchy.Themethodestimatestheenergyconsumptiononeachcomponentandobtaintradeoffsbetweentheglobaloptimalenergyanddifferentperformanceconstrains.Theexperimentalresultsshowthatourmethodcanreducetheenergyconsumptionby10)]TJ /F5 11.955 Tf 11.57 0 Td[(20%comparedwithalgorithmssettingcores'voltageonly. Taskpartitioningisverysuitableforreal-timetasksbecausetaskpartitioningalgorithmsdonotnecessarilyneedtoextendtheexecutiontimeoftasks.Inthefuture,wecandeveloptaskpartitioningalgorithmsonmulticoreprocessorsforasetoftaskswithindividualperiod.Thetemporalheatbalance,taskpartitioningalgorithmcanbeginwiththeschedulegeneratedbyEDFalgorithm,andassignthehottasksasslackstoothercooltasks.Thealgorithmsalsoneedtoconsiderthespatialheatbalanceonmulticoreprocessorstoachievelowerpeaktemperatureonthewholechip. 127

PAGE 128

REFERENCES [1] Andrei,A.,Schmitz,M.,Eles,P.,Peng,Z.,andAlHashimi,B.M.Simultaneouscommunicationandprocessorvoltagescalingfordynamicandleakageenergyreductionintime-constrainedsystems.ICCAD.IEEE,2004,362. [2] ARM.Cortex-ASeries.2005. URL http://www.arm.com/products/processors/cortex-a/index.php [3] Ayoub,R.Z.andRosing,T.S.Predictandact:dynamicthermalmanagementformulti-coreprocessors.ISLPED.2009,99. [4] Bini,E.,Buttazzo,G.,andLipari,G.MinimizingCPUenergyinreal-timesystemswithdiscretespeedmanagement.ACMTransactionsonEmbeddedComputingSystems(TECS)8(2009).4:31. [5] Brooks,DandMartonosi,M.Dynamicthermalmanagementforhigh-performancemicroprocessors.HPCA.PublishedbytheIEEEComputerSociety,2001,0171. [6] Brooks,D,Tiwari,V,andMartonosi,M.Wattch:aframeworkforarchitectural-levelpoweranalysisandoptimizations.ACMSIGARCHComputerArchitectureNews28(2000).2:94. [7] Burd,T.D.andBrodersen,R.W.EnergyefcientCMOSmicroprocessordesign.ProceedingsoftheTwenty-EighthHawaiiInternationalConferenceonSystemSciences.vol.1.IEEE,1995,288. [8] Burd,T.D.,Pering,T.A.,Stratakos,A.J.,andBrodersen,R.W.Adynamicvoltagescaledmicroprocessorsystem.IEEEJournalofsolid-statecircuits35(2000).11:1571. [9] Chandrakasan,A.P.,Sheng,S.,andBrodersen,R.W.Low-powerCMOSdigitaldesign.IEEEJournalofSolid-StateCircuits27(1992).4:473. [10] Chantem,T.,Dick,R.P.,andHu,X.S.Temperature-awareschedulingandassignmentforhardreal-timeapplicationsonMPSoCs.Design,AutomationandTestinEurope(DATE).IEEE,2008,288. [11] Choi,J.,Cher,C.Y.,Franke,H.,Hamann,H.,Weger,A.,andBose,P.Thermal-awaretaskschedulingatthesystemsoftwarelevel.ISLPED.ACM,2007,213. [12] Chrobak,M.,Durr,C.,Hurand,M.,andRobert,J.Algorithmsfortemperature-awaretaskschedulinginmicroprocessorsystems.SustainableComputing:InformaticsandSystems(2011). [13] Cochran,R.andReda,S.Spectraltechniquesforhigh-resolutionthermalcharacterizationwithlimitedsensordata.DAC.2009,478. 128

PAGE 129

[14] Coskun,A.K.,Rosing,T.S.,andGross,K.C.ProactivetemperaturemanagementinMPSoCs.ISLPED.2008,165. [15] Coskun,A.K.,Rosing,T.S.,Whisnant,K.A.,andGross,K.C.Temperature-awareMPSoCschedulingforreducinghotspotsandgradients.ASPDAC.2008,49. [16] Coskun,A.K.,Rosing,TT,Whisnant,K.A.,andGross,K.C.Staticanddynamictemperature-awareschedulingformultiprocessorSoCs.IEEETransactionsonVLSI16(2008).9:1127. [17] Cui,J.andMaskell,D.L.Dynamicthermal-awareschedulingonchipmultiprocessorforsoftreal-timesystem.Proceedingsofthe19thACMGreatLakessymposiumonVLSI.2009,393. [18] .Highleveleventdriventhermalestimationforthermalawaretaskallocationandscheduling.Proceedingsofthe2010AsiaandSouthPacicDesignAutomationConference.2010,793. [19] Donald,JandMartonosi,M.Powerefciencyforvariation-tolerantmulticoreprocessors.ProceedingsoftheinternationalsymposiumonLowpowerelectronicsanddesign(ISLPED).ACM,2006,309. [20] Ebi,T,AlFaruque,MA,andHenkel,J.TAPE:thermal-awareagent-basedpowereconomyformulti/many-corearchitectures.ICCAD.ACM,2009,302. [21] Fan,X.,Ellis,C.,andLebeck,A.Thesynergybetweenpower-awarememorysystemsandprocessorvoltagescaling.Power-AwareComputerSystems(2005):151. [22] Fisher,N.,Chen,J.J.,Wang,S.,andThiele,L.Thermal-awareglobalreal-timeschedulingonmulticoresystems.RTAS.2009,131. [23] Freund,R.F.,Gherrity,M.,Ambrosius,S.,Campbell,M.,Halderman,M.,Hensgen,D.,Keith,E.,Kidd,T.,Kussow,M.,Lima,J.D.,etal.Schedulingresourcesinmulti-user,heterogeneous,computingenvironmentswithSmartNet.HCW.IEEE,1998,184. [24] Fu,X.,Wang,X.,andPuster,E.Dynamicthermalandtimelinessguaranteesfordistributedreal-timeembeddedsystems.RTCSA.2009,403. [25] Gomaa,M.,Powell,M.D.,andVijaykumar,TN.Heat-and-run:leveragingSMTandCMPtomanagepowerdensitythroughtheoperatingsystem.ACMSIGARCHComputerArchitectureNews.vol.32.2004,260. [26] Gunther,S.,Binns,F.,Carmean,D.M.,andHall,J.C.Managingtheimpactofincreasingmicroprocessorpowerconsumption.IntelTechnologyJournal5(2001).1:1. 129

PAGE 130

[27] Guthaus,MR,Ringenberg,JS,Ernst,D,Austin,TM,Mudge,T,andBrown,RB.MiBench:Afree,commerciallyrepresentativeembeddedbenchmarksuite.IEEEInternationalWorkshoponWorkloadCharacterization(WWC-4).2001,3. [28] Hanumaiah,V.,Rao,R.,Vrudhula,S.,andChatha,K.S.Throughputoptimaltaskallocationunderthermalconstraintsformulti-coreprocessors.DesignAutomationConference(DAC).IEEE/ACM,2009,776. [29] Huang,M,Renau,J,Yoo,SM,andTorrellas,J.Aframeworkfordynamicenergyefciencyandtemperaturemanagement.Proceedingsofthe33rdannualACM/IEEEinternationalsymposiumonMicroarchitecture.2000,202. [30] IBM.ILOGCPLEXOptimizer,2009.Www.ibm.com/software/integration/optimization/cplex-optimizer. [31] Isci,C,Buyuktosunoglu,A,Cher,CY,Bose,P,andMartonosi,M.Ananalysisofefcientmulti-coreglobalpowermanagementpolicies:Maximizingperformanceforagivenpowerbudget.Proceedingsofthe39thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture.2006,347. [32] Iyer,A.andMarculescu,D.Powerandperformanceevaluationofgloballyasynchronouslocallysynchronousprocessors.ACMSIGARCHComputerArchitectureNews.vol.30.IEEEComputerSociety,2002,158. [33] Jayaseelan,RandMitra,T.Temperatureawaretasksequencingandvoltagescaling.ProceedingsofIEEE/ACMInternationalConferenceonComputer-AidedDesign.IEEEPress,2008,618. [34] JEDEC.Failuremechanismsandmodelsforsemiconductordevices.2002. URL http://www.jedec.org/standards-documents/docs/jep-122e [35] Jejurikar,R.,Pereira,C.,andGupta,R.Leakageawaredynamicvoltagescalingforreal-timeembeddedsystems.DesignAutomationConference,2004.Proceed-ings.41st.IEEE,2004,275. [36] Kadin,MandReda,S.Frequencyplanningformulti-coreprocessorsunderthermalconstraints.Proceedingofthe13thinternationalsymposiumonLowpowerelectronicsanddesign(ISLPED).ACM,2008,213. [37] Kang,J.,Adviser-Ranka,S.,andAdviser-Sahni,S.Schedulingalgorithmsforenergyminimization.UniversityofFlorida,2008. [38] Kang,J.andRanka,S.DVSbasedenergyminimizationalgorithmforparallelmachines.IEEEInternationalSymposiumonParallelandDistributedProcessing(IPDPS).2008,1. 130

PAGE 131

[39] .DynamicAlgorithmsforEnergyMinimizationonParallelMachines.IEEE16thEuromicroConferenceonParallel,DistributedandNetwork-BasedProcessing(PDP).2008,399. [40] Khan,O.andKundu,S.Hardware/softwareco-designarchitectureforthermalmanagementofchipmultiprocessors.DATE.2009,952. [41] Kim,N.S.,Austin,T.,Baauw,D.,Mudge,T.,Flautner,K.,Hu,J.S.,Irwin,M.J.,Kandemir,M.,andNarayanan,V.Leakagecurrent:Moore'slawmeetsstaticpower.Computer36(2003).12:68. [42] Kontorinis,V,Shayan,A,Kumar,R,andTullsen,DM.ReducingPeakPowerwithaTable-DrivenAdaptiveProcessorCore.MICRO(2009). [43] Korthikanti,V.A.andAgha,G.Towardsoptimizingenergycostsofalgorithmsforsharedmemoryarchitectures.Proceedingsofthe22ndACMsymposiumonParallelisminalgorithmsandarchitectures.ACM,2010,157. [44] Kumar,V.,Grama,A.,Gupta,A.,andKarypis,G.Introductiontoparallelcomputing,vol.110.Benjamin/CummingsUSA,1994. [45] Kursun,E.andCher,C.Y.Temperaturevariationcharacterizationandthermalmanagementofmulticorearchitectures.IEEEMicro29(2009).1:116. [46] Lee,C,Potkonjak,M,andMangione-Smith,WH.MediaBench:atoolforevaluatingandsynthesizingmultimediaandcommunicatonssystems.Proceedingsofthe30thannualACM/IEEEinternationalsymposiumonMicroarchitecture.1997,330. [47] Lee,H.G.,Chang,N.,Ogras,U.Y.,andMarculescu,R.On-chipcommunicationarchitectureexploration:Aquantitativeevaluationofpoint-to-point,bus,andnetwork-on-chipapproaches.ACMTransactionsonDesignAutomationofElectronicSystems(TODAES)12(2007).3:23. [48] Lee,J.S.,Skadron,K.,andChung,S.W.Predictivetemperature-awareDVFS.IEEETransactionsonComputers59(2010).1:127. [49] Li,D.,Chang,H.C.,Pyla,H.K.,andCameron,K.W.System-level,thermal-aware,fully-loadedprocessscheduling.ParallelandDistributedProcessing,2008.IPDPS2008.IEEEInternationalSymposiumon.2008,1. [50] Li,S.,Ahn,J.H.,Strong,R.D.,Brockman,J.B.,Tullsen,D.M.,andJouppi,N.P.McPAT:anintegratedpower,area,andtimingmodelingframeworkformulticoreandmanycorearchitectures.Proceedingsofthe42ndAnnualIEEE/ACMinterna-tionalSymposiumonMicroarchitecture.ACM,2009,469. [51] Liu,C.,Sivasubramaniam,A.,andKandemir,M.Optimizingbusenergyconsumptionofon-chipmultiprocessorsusingfrequentvalues.JournalofSystemsArchitecture52(2006).2:129. 131

PAGE 132

[52] Liu,Y.,Yang,H.,Dick,R.P.,Wang,H.,andShang,L.Thermalvsenergyoptimizationfordvfs-enabledprocessorsinembeddedsystems.QualityElec-tronicDesign,2007.ISQED'07.8thInternationalSymposiumon.IEEE,2007,204. [53] Maheswaran,M.,Ali,S.,Siegal,HJ,Hensgen,D.,andFreund,R.F.Dynamicmatchingandschedulingofaclassofindependenttasksontoheterogeneouscomputingsystems.HCW.IEEE,1999,30. [54] Marcus,M.andMinc,H.ASurveyofMatrixTheoryandMatrixInequalities.Boston,MAAllynandBacon,1964. [55] Martin,T.L.andSiewiorek,D.P.NonidealbatteryandmainmemoryeffectsonCPUspeed-settingforlowpower.IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems9(2001).1:29. [56] Mukherjee,R.andMemik,S.O.Systematictemperaturesensorallocationandplacementformicroprocessors.DAC.ACM,2006,542. [57] Murali,S.,Mutapcic,A.,Atienza,D.,Gupta,R.,Boyd,S.,Benini,L.,andDeMicheli,G.Temperaturecontrolofhigh-performancemulti-coreplatformsusingconvexoptimization.DATE.2008,110. [58] Murali,S.,Mutapcic,A.,Atienza,D.,Gupta,R.,Boyd,S.,andDeMicheli,G.Temperature-awareprocessorfrequencyassignmentforMPSoCsusingconvexoptimization.CODES+ISSS.2007,111. [59] Pillai,PandShin,KG.Real-timedynamicvoltagescalingforlow-powerembeddedoperatingsystems.ProceedingsoftheeighteenthACMsymposiumonOperatingsystemsprinciples.ACM,2001,102. [60] Pouwelse,J.,Langendoen,K.,andSips,H.Dynamicvoltagescalingonalow-powermicroprocessor.Proceedingsofthe7thannualinternationalconferenceonMobilecomputingandnetworking.2001,251. [61] Puschini,D.,Clermidy,F.,Benoit,P.,Sassatelli,G.,andTorres,L.Temperature-awaredistributedrun-timeoptimizationonMP-SoCusinggametheory.ISVLSI.2008,375. [62] Rao,RandVrudhula,S.Efcientonlinecomputationofcorespeedstomaximizethethroughputofthermallyconstrainedmulti-coreprocessors.IEEE/ACMInternationalConferenceonComputer-AidedDesign(ICCAD).2008,537. [63] Rao,R.,Vrudhula,S.,andChakrabarti,C.Throughputofmulti-coreprocessorsunderthermalconstraints.ProceedingsoftheinternationalsymposiumonLowpowerelectronicsanddesign(ISLPED).ACM,2007,206. [64] SEGGER.Contextswitchingtime.2006. 132

PAGE 133

URL http://www.segger.com/cms/context-switching-time.html [65] Semenov,O,Vassighi,A,andSachdev,M.Impactofself-heatingeffectonlong-termreliabilityandperformancedegradationinCMOScircuits.IEEETransac-tionsonDeviceandMaterialsReliability6(2006):17. [66] Semeraro,G.,Albonesi,D.H.,Dropsho,S.G.,Magklis,G.,Dwarkadas,S.,andScott,M.L.Dynamicfrequencyandvoltagecontrolforamultipleclockdomainmicroarchitecture.Proceedings.35thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture.IEEE,2002,356. [67] Shari,S.,Liu,C.,andRosing,T.S.Accuratetemperatureestimationforefcientthermalmanagement.ISQED.2008,137. [68] Skadron,K.,Stan,M.R.,Huang,W.,Velusamy,S.,Sankaranarayanan,K.,andTarjan,D.Temperature-awarecomputersystems:Opportunitiesandchallenges.Micro,IEEE23(2003).6:52. [69] .Temperature-awaremicroarchitecture.ACMSIGARCHComputerArchitectureNews.vol.31.ACM,2003,2. [70] Skadron,K.,Stan,M.R.,Sankaranarayanan,K.,Huang,W.,Velusamy,S.,andTarjan,D.Temperature-awaremicroarchitecture:Modelingandimplementation.ACMTransactionsonArchitectureandCodeOptimization(TACO)1(2004).1:94. [71] Sotiriadis,P.P.andChandrakasan,A.P.Abusenergymodelfordeepsubmicrontechnology.IEEETransactionsonVeryLargeScaleIntegration(VLSI)Systems10(2002).3:341. [72] Stavrou,K.andTrancoso,P.Thermal-awarescheduling:asolutionforfuturechipmultiprocessorsthermalproblems.IEEE9thEUROMICROConferenceonDigitalSystemDesign:Architectures,MethodsandTools(DSD).2006,123. [73] .Thermal-awareschedulingforfuturechipmultiprocessors.EURASIPJournalonEmbeddedSystems2007(2007).1:40. [74] Suresh,D.C.,Agrawal,B.,Yang,J.,andNajjar,W.A.Tunableandenergyefcientbusencodingtechniques.IEEETransactionsonComputers58(2009).8:1049. [75] Traub,J.F.andWozniakowski,H.ComplexityofLinearyProgramming.Polynomial-TimeWorkshop(1980). [76] Venkatachalam,V.andFranz,M.Powerreductiontechniquesformicroprocessorsystems.ACMcomputingsurveys37(2005).3:195. 133

PAGE 134

[77] Wang,Z.andRanka,S.Asimplethermalmodelformulti-coreprocessorsanditsapplicationtoslackallocation.IEEEInternationalSymposiumonParallel&DistributedProcessing(IPDPS).2010,1. [78] Xie,YandHung,WL.Temperature-awaretaskallocationandschedulingforembeddedmultiprocessorsystems-on-chip(MPSoC)design.TheJournalofVLSISignalProcessing45(2006).3:177. [79] Yang,J.,Zhou,X.,Chrobak,M.,Zhang,Y.,andJin,L.Dynamicthermalmanagementthroughtaskscheduling.ISPASS.2008,191. [80] Yeo,I.,Liu,C.C.,andKim,E.J.Predictivedynamicthermalmanagementformulticoresystems.DAC.2008,734. [81] Zhang,SandChatha,KS.Approximationalgorithmforthetemperature-awareschedulingproblem.ProceedingsofIEEE/ACMinternationalconferenceonComputer-aideddesign.IEEEPress,2007,281. 134

PAGE 135

BIOGRAPHICALSKETCH ZheWangreceivedhisB.S.degreeinElectronicsandInformationEngineeringfromHuazhongUniversityofScienceandTechnology,Wuhan,China,in2007.HereceivedhisPh.D.inComputerandInformationScienceandEngineeringDepartmentattheUniversityofFlorida,Gainesville,FL,UnitedStates,in2012.Hisresearchtopicisthermal/energy-awaretaskschedulingonmulticoreprocessors. 135