Tag Management Architecture and Policies for Hardware-Managed Translation Lookaside Buffers in Virtualized Platforms

MISSING IMAGE

Material Information

Title:
Tag Management Architecture and Policies for Hardware-Managed Translation Lookaside Buffers in Virtualized Platforms
Physical Description:
1 online resource (164 p.)
Language:
english
Creator:
Venkatasubramanian,Girish
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Figueiredo, Renato J
Committee Members:
Li, Tao
Boykin, P. Oscar
Fortes, Jose A
Mishra, Prabhat

Subjects

Subjects / Keywords:
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
The use of virtualization to effectively harness the power of multi-core processors has emerged as a viable solution to meet the growing demand for computing resources, especially in the server segment of the computing industry. However, two significant issues in using virtualization for performance-critical workloads are: 1. the overhead of virtualization, which adversely impacts the performance of such virtualized workloads, and 2. the "noise" or variation in the performance of these virtualized workloads due to the platform resources being shared amongst multiple virtual machines (VMs). Thus, improving the performance of virtualized workloads and reducing the performance variations introduced by the sharing of platform resources are two challenges in the field of virtualization. Meeting these challenges, specifically in the context of hardware-managed Translation Lookaside Buffers (TLBs), forms the theme of this dissertation. To understand the performance impact of the TLB and to investigate the performance improvement due to various architectural modifications, a suitable simulation framework is imperative. Hence, the first contribution of this dissertation is developing a full-system execution-driven simulation framework supporting the x86 ISA and detailed TLB functional and timing models. Using this framework, it is observed that the performance of typical server workloads are reduced by as much as 8% to 35% due to the TLB misses on virtualized platforms, compared to the 1% to 5% reduction on non-virtualized single-O/S platforms. This clearly motivates the need for improving the TLB performance for virtualized workloads. The second part of this dissertation proposes the Tag Manager Table (TMT) for generating and managing process-specific tags for hardware-managed TLBs, in a software-transparent manner. By tagging the TLB entries with process-specific identifiers, multiple processes can share the TLB, thereby avoiding TLB flushes that are triggered during context switches. Using the TMT reduces the TLB miss rates by 65% to 90% and the TLB-induced delay by 50% to 80% compared to a TLB without tags, thereby improving workload performance by 4.5% to 25%. The effect of various factors including the TLB and TMT design parameters, the workload characteristics and the TLB miss penalty on the benefit of using the TMT is explored. The use of the TMT in enabling shared Last Level TLBs is also investigated. Furthermore, the use of the TMT to tag I/O TLBs, in scenarios where address translation services and TLBs in the I/O fabric allow I/O devices to operate in virtual address space, is also explored. While the TMT enables multiple processes to share a TLB, this results in the TLB becoming a potential resource of contention. The third part of this dissertation investigates the performance implications of such TLB contention and proposes the CShare TLB architecture to isolate the TLB behavior of virtualized workloads from one another using a TLB Sharing Table (TST) along with the TMT. The use of the CShare TLB in increasing the overall performance of consolidated workloads involving streaming applications with poor TLB usage as well as in selectively increasing the performance of a high priority workload by restricting the TLB usage of low priority workloads is explored. It is observed that the increase in the performance of a high priority workload due to using the TMT without controlled sharing can be further improved by 1.4x using such TLB usage restrictions. The use of dynamic usage control policies to achieve this selective performance increase while minimizing the performance reduction of the low priority workloads is also investigated.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Girish Venkatasubramanian.
Thesis:
Thesis (Ph.D.)--University of Florida, 2011.
Local:
Adviser: Figueiredo, Renato J.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2011
System ID:
UFE0042725:00001


This item is only available as the following downloads:


Full Text

PAGE 1

TAGMANAGEMENTARCHITECTUREANDPOLICIESFORHARDWARE-MANAGEDTRANSLATIONLOOKASIDEBUFFERSINVIRTUALIZEDPLATFORMSByGIRISHVENKATASUBRAMANIANADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2011

PAGE 2

c2011GirishVenkatasubramanian 2

PAGE 3

ACKNOWLEDGMENTS MyheartfeltgratitudeandthanksareduetomyadvisorDr.RenatoJ.Figueiredoforsupporting,encouragingandguidingmeinmyacademicjourneyculminatinginthePhDdegree.Hispatienceandguidance,especiallyduringtheinitialyears,gavemethecondencetopersevere.Learningfromhimaboutcomputerarchitectureandsystems,virtualization,theartofresearch,techniquesforgoodwritingandstrategiesforcreatinggoodpresentationshasbeenawonderfulexperience.Iamprivilegedtohavehimasmyadvisorandmentor.IthankDr.P.OscarBoykinforteachingmetechniquesofanalyticalmodelingandfortheinvigoratingdiscussionsonapplyingengineeringprinciplestosolvereal-worldproblems.IamgratefultoDr.JoseFortesforgivingmeanopportunitytobeapartoftheACISLabattheUniversityofFloridaandforsharinghisinsightandperspectiveonresearchandthePhDprocess.IalsothankDr.TaoLiandDr.PrabhatMishraforservingonmycommitteeandfortheirinsightfulquestionsandsuggestionswhichhaveenhancedthisdissertation.AgoodportionofmycomputerarchitectureknowledgeandsimulationskillswerelearnedandhonedduringmyinternshipsatIntelCorporation.IthankRameshIllikkal,GregRegnier,DonaldNewellandDr.RaviIyerforgivingmetheseopportunitiesandNileshJain,JaideepMoses,Dr.OmeshTickooandPaulM.StillwellJrforhelpingmecompletetheseinternshipssuccessfully.IalsothankthemembersoftheSoCPlatformandArchitecturegroupatIntelLabsfortheirideasandperspectivesonmyresearch.IamespeciallythankfultoDr.OmeshTickooforbeingawonderfulmentorduringandaftermyinternship.IwouldalsoliketothankmypastandpresentcolleaguesatACISLabsandatUniversityofFloridaincludingPriyaBhat,Dr.VineetChadha,Dr.ArijitGanguly,Dr.ClayHughes,SelviKadirvel,Dr.AndreaMatsunaga,Dr.JamesM.PoeII,PrapapornRattanatamrong,PierreSt.Juste,Dr.MauricioTsuagawaandDavidWolinskyfortheir 3

PAGE 4

helpandfeedbackonmyworkandforthemanyintellectualdiscussionsoncomputerarchitecture,computernetworks,modelingandsimulation.ThisworkwasfundedinpartbytheNationalScienceFoundationunderCRIcollaborativeawards0751112,0750847,0750851,0750852,0750860,0750868,0750884,and0751091andbyagrantfromIntelCorporation.IwouldalsoliketoacknowledgetheUniversityofFloridaHigh-PerformanceComputingCenterforcomputationresources.IalsothankVirtutechfortheirsupportinusingSimicsandNaveenNeelakantamfromtheUniversityofIllinoisatUrbana-ChampaignforhishelpwithusingFeS2.MymotivationtoobtainaPhDwasinspiredbymyparents,Dr.N.K.VenkatasubramanianandPrabhavathyVenkatasubramanian,andmyuncleVaidyanathan.They,alongwithmysisterDr.ChitraVenkatasubramanianandmybrother-in-lawMurthyS.Krishna,havebeenasourceofencouragementandsupportwithoutwhichthisdissertationwouldnothavebeencompleted.Ithankthemanddedicatethisdissertationtothem. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 3 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 11 CHAPTER 1INTRODUCTION ................................... 13 1.1Hardware-ManagedTLBsinVirtualizedEnvironments ........... 14 1.2ContributionsoftheDissertation ....................... 15 1.2.1Simulation-BasedAnalysisoftheTLBPerformanceonVirtualizedPlatforms ................................ 16 1.2.2TagManagerTableforProcess-SpecicTaggingoftheTLB .... 17 1.2.3MechanismsandPoliciesforTLBUsageControl .......... 18 1.3OutlineoftheDissertation ........................... 19 2BACKGROUND:VIRTUALMEMORYANDPLATFORMVIRTUALIZATION .. 21 2.1VirtualMemoryinNon-VirtualizedSystems ................. 22 2.1.1ImplementingVirtualMemoryUsingPaging ............. 23 2.1.2AddressTranslationinx86withPageAddressExtensionEnabled 24 2.2TranslationLookasideBuffer ......................... 26 2.3VirtualMemoryinVirtualizedSystems .................... 28 2.3.1Full-SystemVirtualizationandShadowPageTables ........ 29 2.3.2ParavirtualizationandPageTables .................. 30 2.3.3HardwareVirtualizationandTwo-LevelPageTables ......... 31 2.4Summary .................................... 33 3ASIMULATIONFRAMEWORKFORTHEANALYSISOFTLBPERFORMANCE 34 3.1SurveyofSimulationFrameworksUsedinTLB-RelatedResearch .... 35 3.2DevelopingtheSimulationFramework .................... 36 3.2.1UsingSimicsandFeS2asFoundation ................ 37 3.2.2TLBFunctionalModel ......................... 38 3.2.3ValidationoftheTLBFunctionalModel ................ 39 3.2.4TLBTimingModel ........................... 40 3.2.5ValidatingtheTLBTimingModel ................... 42 3.3SelectionandPreparationofWorkloads ................... 45 3.3.1WorkloadApplications ......................... 46 3.3.2ConsolidatedWorkloads ........................ 46 3.3.3MultiprocessorWorkloads ....................... 47 5

PAGE 6

3.3.4CheckpointingWorkloads ....................... 48 3.4EvaluationoftheSimulationFramework ................... 48 3.5UsingtheFrameworktoInvestigateTLBBehaviorinVirtualizedPlatforms 51 3.5.1IncreaseinTLBFlushesonVirtualization .............. 53 3.5.2IncreaseinTLBMissRateonVirtualization ............. 54 3.5.3DecreaseinWorkloadPerformanceonVirtualization ........ 56 3.5.3.1I/O-intensiveworkloads ................... 57 3.5.3.2Memory-intensiveworkloads ................ 60 3.5.3.3Consolidatedworkloads ................... 61 3.5.4ImpactofArchitecturalParametersonTLBPerformance ...... 63 3.6Summary .................................... 65 4ATLBTAGMANAGEMENTFRAMEWORKFORVIRTUALIZEDPLATFORMS 66 4.1CurrentStateoftheArtinImprovingTLBPerformance ........... 66 4.2ArchitectureoftheTagManagerTable .................... 68 4.2.1AvoidingFlushesUsingtheTagManagerTable ........... 70 4.2.2TLBLookupandMissHandlingUsingtheTagManagerTable ... 72 4.3ModelingtheTagManagerTable ....................... 74 4.4ImpactoftheTagManagerTable ....................... 74 4.4.1ReductioninTLBFlushesDuetotheTMT .............. 74 4.4.2ReductioninTLBMissRateDuetotheTMT ............. 79 4.4.3IncreaseinWorkloadPerformanceDuetotheTMT ......... 82 4.5ArchitecturalandWorkloadParametersAffectingtheImpactoftheTMT 88 4.5.1ArchitecturalParameters ........................ 88 4.5.2WorkloadParameters ......................... 88 4.5.2.1Effectoflargermemoryfootprint .............. 89 4.5.2.2Effectofthenumberofprocessesintheworkload .... 91 4.5.3SensitivityAnalysis ........................... 94 4.6ComparisonofProcess-SpecicandDomain-SpecicTags ........ 96 4.7UsingtheTagManagerTableonNon-VirtualizedPlatforms ........ 97 4.8EnablingSharedLastLevelTLBsUsingtheTagManagerTable ...... 99 4.8.1UsingtheTMTastheTaggingFramework .............. 100 4.8.2ArchitectureoftheSharedLLTLB ................... 101 4.8.3MissRateImprovementDuetoSharedLastLevelTLBs ...... 104 4.9Summary .................................... 106 5CONTROLLEDSHARINGOFHARDWARE-MANAGEDTLB .......... 107 5.1Motivation .................................... 109 5.2ArchitectureoftheCShareTLB ........................ 111 5.3ExperimentalFramework ........................... 115 5.4PerformanceIsolationusingCShareArchitecture .............. 115 5.5PerformanceEnhancementUsingCShareArchitecture .......... 119 5.5.1ClassicationofTLBUsagePatterns ................. 119 5.5.2PerformanceImprovementWithStaticTLBUsageControl ..... 122 6

PAGE 7

5.5.3SelectivePerformanceImprovementWithStaticTLBUsageControl 127 5.5.4PerformanceImprovementWithDynamicTLBUsageControl ... 131 5.6Summary .................................... 134 6CONCLUSIONANDFUTUREWORK ....................... 136 APPENDIX AFULLFACTORIALEXPERIMENT ......................... 139 BFULLFACTORIALEXPERIMENTSUSINGTHESIMULATIONFRAMEWORK 141 CUSINGTHETAGMANAGERTABLEFORTAGGINGI/OTLB .......... 142 C.1ArchitectureofVMA .............................. 143 C.2PrototypingandSimulatingtheVMAArchitecture .............. 144 C.3UsingtheTagManagerTableinVMAArchitecture ............. 151 C.4FunctionalVericationoftheUseofTMTinVMA .............. 152 C.5Summary .................................... 153 REFERENCES ....................................... 154 BIOGRAPHICALSKETCH ................................ 164 7

PAGE 8

LISTOFTABLES Table page 3-1PseudocodeofthemicrobenchmarkforTLBtimingmodelvalidation ...... 43 3-2Throughputofthesimulationframeworkformultiprocessorx86simulations .. 52 3-3SimulationparametersforinvestigatingTLBbehavioronvirtualizedplatforms 53 3-4ImpactofPageWalkLatencyonTLB-inducedperformancereductionRIPC .. 63 4-1FlushproleforSPECjbb-basedworkloadswithvaryingheapsizes ...... 90 4-2FlushProleforTPCC-UVabasedworkloadswithvaryingnumberofprocessesandvaryingTMTsizes ................................ 93 4-3Factorsandtheirlevelsforthesensitivityanalysis ................. 95 4-4FactorswithsignicantinuenceontheReductioninTLBmissratesduetoCR3tagging ..................................... 96 5-1AlgorithmsforselectionofvictimSID ........................ 114 8

PAGE 9

LISTOFFIGURES Figure page 2-1Pagewalkfora4KBpagewithPAEenabled .................... 26 2-2TranslationLookasideBuffer ............................ 27 2-3Memoryvirtualizationinavirtualizedplatform ................... 29 3-1SimulationframeworkforanalyzingTLBperformance .............. 38 3-2Timingowinthesimulationframework ...................... 40 3-3ValidationoftheTLBtimingmodel ......................... 44 3-4Screenshotofthesimulationframeworkinuse .................. 49 3-5Throughputofthesimulationframeworkforuniprocessorx86simulations ... 50 3-6IncreaseinTLBushesonvirtualization ...................... 54 3-7IncreaseinTLBmissrateonvirtualization ..................... 55 3-8Decreaseinsingle-domainworkloadperformanceonvirtualization ....... 58 3-9Decreaseinconsolidatedworkloadperformanceonvirtualization ........ 62 3-10Impactofthepipelinefetchwidth(FW)onTLB-inducedperformancereduction 64 4-1TLBushbehaviorwiththeTagManagerTable .................. 70 4-2TLBlookupbehaviorwiththeTagManagerTable ................. 72 4-3ReductioninTLBushesusingan8-entryTMT .................. 75 4-4EffectofTagManagerTablesizeonthereductioninnumberofushes ..... 78 4-5ReductioninTLBmissrateusingan8-entryTMT ................. 80 4-6EffectofTLBassociativityonthereductioninmissrate ............. 82 4-7Increaseinworkloadperformanceusingan8-entryTMT ............. 85 4-8EffectofthePageWalkLatencyontheimprovementinperformance ...... 87 4-9EffectofworkloadmemoryfootprintonthereductioninTLBmissrate ..... 91 4-10EffectofthenumberofworkloadprocessesonthereductioninITLBmissrate 94 4-11Comparisonoftheperformanceimprovementduetoprocess-specicandVM-specictagging ................................. 97 4-12PerformanceimpactofTMTonnon-virtualizedplatforms ............. 98 9

PAGE 10

4-13UsingtheTMTforSharedLastLevelTLBs .................... 102 4-14ReductioninDTLBmissrateduetoSharedLastLevelTLB ........... 105 5-1PerformanceimprovementforconsolidatedworkloadswithuncontrolledTLBsharing ........................................ 110 5-2ControlledTLBusageusingCSharearchitecture ................. 112 5-3EffectofvaryingTLBreservationonmissrate ................... 117 5-4MissrateisolationusingtheTMTarchitecture ................... 118 5-5ClassicationofTLBusagepatterns ........................ 121 5-6OverallmissrateimprovementforconsolidatedworkloadwithstaticTLBusagecontrol ......................................... 124 5-7OverallperformanceimprovementforconsolidatedworkloadwithstaticTLBusagecontrol ..................................... 126 5-8SelectiveperformanceimprovementforconsolidatedworkloadwithstaticTLBusagecontrol ..................................... 128 5-9DynamicTLBUsageControl ............................ 132 5-10SelectiveperformanceimprovementforconsolidatedworkloadwithdynamicTLBusagecontrol .................................. 133 C-1Architectureandsimulation-basedprototypeofVMA ............... 145 C-2IPMMUandI/OTLB ................................. 150 C-3FunctionalvalidationoftheuseofTMTinVMA .................. 152 10

PAGE 11

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyTAGMANAGEMENTARCHITECTUREANDPOLICIESFORHARDWARE-MANAGEDTRANSLATIONLOOKASIDEBUFFERSINVIRTUALIZEDPLATFORMSByGirishVenkatasubramanianAugust2011Chair:RenatoJ.FigueiredoMajor:ElectricalandComputerEngineeringTheuseofvirtualizationtoeffectivelyharnessthepowerofmulti-coreprocessorshasemergedasaviablesolutiontomeetthegrowingdemandforcomputingresources,especiallyintheserversegmentofthecomputingindustry.However,twosignicantissuesinusingvirtualizationforperformance-criticalworkloadsare: 1.theoverheadofvirtualization,whichadverselyimpactstheperformanceofsuchvirtualizedworkloads,and 2.thenoiseorvariationintheperformanceofthesevirtualizedworkloadsduetotheplatformresourcesbeingsharedamongstmultiplevirtualmachines(VMs).Thus,improvingtheperformanceofvirtualizedworkloadsandreducingtheperformancevariationsintroducedbythesharingofplatformresourcesaretwochallengesintheeldofvirtualization.Meetingthesechallenges,specicallyinthecontextofhardware-managedTranslationLookasideBuffers(TLBs),formsthethemeofthisdissertation.TounderstandtheperformanceimpactoftheTLBandtoinvestigatetheperformanceimprovementduetovariousarchitecturalmodications,asuitablesimulationframeworkisimperative.Hence,therstcontributionofthisdissertationisdevelopingafull-systemexecution-drivensimulationframeworksupportingthex86ISAanddetailedTLBfunctionalandtimingmodels.Usingthisframework,itisobservedthattheperformanceoftypicalserverworkloadsarereducedbyasmuchas8%to35%duetotheTLBmissesonvirtualizedplatforms,comparedtothe1%to5%reductiononnon-virtualized 11

PAGE 12

single-O/Splatforms.ThisclearlymotivatestheneedforimprovingtheTLBperformanceforvirtualizedworkloads.ThesecondpartofthisdissertationproposestheTagManagerTable(TMT)forgeneratingandmanagingprocess-specictagsforhardware-managedTLBs,inasoftware-transparentmanner.BytaggingtheTLBentrieswithprocess-specicidentiers,multipleprocessescansharetheTLB,therebyavoidingTLBushesthataretriggeredduringcontextswitches.UsingtheTMTreducestheTLBmissratesby65%to90%andtheTLB-induceddelayby50%to80%comparedtoaTLBwithouttags,therebyimprovingworkloadperformanceby4.5%to25%.TheeffectofvariousfactorsincludingtheTLBandTMTdesignparameters,theworkloadcharacteristicsandtheTLBmisspenaltyonthebenetofusingtheTMTisexplored.TheuseoftheTMTinenablingsharedLastLevelTLBsisalsoinvestigated.Furthermore,theuseoftheTMTtotagI/OTLBs,inscenarioswhereaddresstranslationservicesandTLBsintheI/OfabricallowI/Odevicestooperateinvirtualaddressspace,isalsoexplored.WhiletheTMTenablesmultipleprocessestoshareaTLB,thisresultsintheTLBbecomingapotentialresourceofcontention.ThethirdpartofthisdissertationinvestigatestheperformanceimplicationsofsuchTLBcontentionandproposestheCShareTLBarchitecturetoisolatetheTLBbehaviorofvirtualizedworkloadsfromoneanotherusingaTLBSharingTable(TST)alongwiththeTMT.TheuseoftheCShareTLBinincreasingtheoverallperformanceofconsolidatedworkloadsinvolvingstreamingapplicationswithpoorTLBusageaswellasinselectivelyincreasingtheperformanceofahighpriorityworkloadbyrestrictingtheTLBusageoflowpriorityworkloadsisexplored.ItisobservedthattheincreaseintheperformanceofahighpriorityworkloadduetousingtheTMTwithoutcontrolledsharingcanbefurtherimprovedby1.4usingsuchTLBusagerestrictions.Theuseofdynamicusagecontrolpoliciestoachievethisselectiveperformanceincreasewhileminimizingtheperformancereductionofthelowpriorityworkloadsisalsoinvestigated. 12

PAGE 13

CHAPTER1INTRODUCTIONThecurrentparadigmofcomputingintheserverindustryisundergoingrapidchanges.Ononehand,thedemandforcomputingresourceshasbeengrowing,especiallyintheserversegment.Thisgrowthisdrivenbytheexpansionofonlineserviceprovidersincludingcloudcomputingandsocialnetworkingservices,inadditiontothetraditionalserver-orientedhigh-performancecomputingandbankingsectors.Facebook,amajorsocialnetworkprovider,hasincreasedthenumberofserversitusesfrom10,000to30,000in2009[ 1 ].CloudserviceprovidersincludingAmazonEC2,RackspaceandGoGridhaveexperiencedincreasingcomputingrequirementsinthepastyear[ 2 ].Ontheotherhand,ChipMultiProcessor(CMP)architectures,withaneverincreasingnumberofprocessorsonasingledie,haveemergedasthearchitecturalsolutionforpowerfulservers[ 3 ].Processorswith8hardwarethreadsarealreadybeingusedand16threadprocessorshavebeendemonstrated[ 4 ].VirtualizationhasemergedasoneofthekeytechnologiesallowingtotapthepowerofCMPstomeetthecomputingdemandsoftheserversegmentinaexiblemanner[ 5 ].ByencapsulatingapplicationswiththeirOperatingSystem(O/S)andsoftwarestackinvirtualmachines(VMs),multipleapplicationscanbeconsolidatedonasinglephysicalplatform.Moreover,withtherisingemphasisongreenserverroomsandlow-costautonomicmanagement,virtualizationhasemergedasaconvenientwaytomanageQualityofService(QoS)andresourcesharingamongtheconsolidatedapplications.VirtualizationisalsobeingexploredforensuringapplicationportabilityinHighPerformanceComputing(HPC)systemsbyvirtualizingdifferentHPCsystemswithdisparatearchitecturesintoastandardizedplatformabstraction[ 6 7 ].EstimatesbyGartner[ 8 9 ],predictingthatHostedVirtualDesktopmarketwillsurpass$65billionin2013andtheSoftwareasaService(SaaS)modelusing 13

PAGE 14

virtualizationwillaccountfor20%ofemailservicesby2012,clearlyhighlighttheimportanceofvirtualizationintheserverdomain.Similarly,therecentvirtualizationofSandiaNationalLab'sRedStormsupercomputerusingaspeciallydesignedhypervisor[ 10 ]isatestamenttotheapplicabilityofvirtualizationinHPCdomains.However,thebenetofusingvirtualizationforperformancecriticalserverandHPCapplicationsisaccompaniedbytwosignicantchallenges. 1.1Hardware-ManagedTLBsinVirtualizedEnvironmentsFull-systemvirtualizationmaybeviewedasprovidinganenvironmentwheremultipleapplications,eachbelongingtodifferentusersandhavingdifferentrequirements(suchasthesoftwarestackonwhichitruns),cancoexist[ 11 ].Typically,theapplicationrunninginaVMisnotawareofthisfactandbehavesinexactlythesamewayasitwouldonarealmachineexceptfortimingconsiderations.Inthisscenario,itisimportanttoshieldthestateofoneVMfromtheactionsofanotherVMwhichrunsonthesamephysicalplatform.Toensurethis,PopekandGoldberg[ 12 13 ]mandatedthatattemptstoexecuteprivilegedinstructionsinsidetheVMshouldtraptotheVirtualMachineMonitor(VMM).Satisfyingthisrequirementcausesaperformanceoverheadforvirtualizedworkloads.Specically,anentryintotheVMMoranexitfromtheVMMinvolveschangingtheCPUmodetotheprivilegedmodeandsavingandrestoringstaterelatedinformation.Apartfromtheapparentoverhead,theseswitchesalsoposeanadditionaldemandontheCPUcachesandtherebypollutethem,posingfurtherperformanceoverheads.Reducingsuchperformancedegradationisasignicantchallengeintheareaofplatformvirtualization.AnotherchallengeonvirtualizedplatformsistheneedtoshieldtheperformanceoftheworkloadrunninginoneVMfromthenoiseorvariationduetotheresourceconsumptionofotherVMswhichshareplatformresources.Whilethisisnotstrictlyrequiredtomaintaincorrectnessofvirtualization,performanceisolationisimperativetoachievepredictableperformanceandforensuringthattheperformance 14

PAGE 15

ofahighpriorityworkloadisnotreducedduetotheresourcerequirementoflowpriorityworkloads.Whenconsideringtheseperformancerelatedchallenges,themostexpensiveCPUcacheistheTranslationLookasideBuffer(TLB)[ 14 ].TheTLBcachesthetranslationsfromthevirtualtothephysicaladdressspaceandisinthecriticalpathofmemoryoperations.Hardware-managedTLBs,whicharetypicalinmostvirtualizedplatforms[ 15 ],areushedoneverycontextswitchtoensurethatoneprocess'sTLBentriesarenotusedforotherprocesses.Thisushinghowever,ensuresthateveryprocesswhichisswitchedintocontextexperiencesalargenumberofTLBmissesuntiltherequiredentriesarebroughtbackintotheTLB.Thus,theushingoftheTLBandthesubsequentTLBmissesandpagewalkstoservicethesemissesconstituteadelaywhichslowsdowntheperformanceoftheprocess.Whiletypicallytolerableinthecaseofnon-virtualizedsystems,thisperformanceslowdownisquitehighinvirtualizedconsolidatedscenariosduetothelargenumberofaddressspacesandthefrequentswitchesbetweentheseaddressspacesaswellastheswitchingbetweentheVMandtheVMM.ItisvitaltoreducethisTLB-induceddelayinvirtualizedplatformsespeciallyforperformancecriticalapplications.ManysolutionsattempttoreducethisTLB-inducedperformancepenalty,asexplainedinChapter 4 ,bysharingtheTLBamongstmultipleaddressspacesacrosscontextswitchboundaries.This,however,makesthehardware-managedTLBasharedresourceandyetanothersourceofperformancenoise,necessitatingTLBperformanceisolationsolutions.Solvingtheseperformanceimprovementandperformanceisolationchallengesinthecontextofthehardware-managedTLBformsthefocusofthisdissertation. 1.2ContributionsoftheDissertationThisdissertationmakesthreemajorcontributionstowardssolvingthechallengesoutlinedinSection 1.1 .Abriefoutlineofthecontributionsarepresentedhere. 15

PAGE 16

1.2.1Simulation-BasedAnalysisoftheTLBPerformanceonVirtualizedPlat-formsInordertounderstandtheperformancedegradationcausedbythehigh-frequencyTLBushingonvirtualizedplatformsandtoinvestigatetheimpactofvariousschemesthatareproposedtoreducetheTLB-induceddelay,simulationframeworkssupportingdetailedandcustomizableperformanceandtimingmodelsfortheTLBareneeded.Infact,mostworksstudyinghardware-managedTLBshaveusedmissratesasthemetricformeasuringtheimpactoftheTLB[ 16 18 ]duetothelackofsuitablesimulationframeworkssupportingTLBtimingmodels.Whilethereductioninmissrateisasuitableinitialmetric,thetrueimpactoftheTLBsonthesystemperformancecanbeobtainedonlybyusingtiming-basedmetrics.InadditiontosatisfyingtherequirementforTLBmodels,simulationframeworksthatareusedforstudyingvirtualizedscenariosshouldbefull-systemandexecution-driventocapturetheinteractionbetweenthehardware,VMM,VMandapplications.Moreover,suchsimulationframeworksshouldsupportthesimulationofx86ISAsincethatisoneofthemostpopularvirtualizedplatforms[ 15 ].However,simulatingx86isdifcultduetothecomplexarchitectureandthefactthateveryx86instructionisbrokendownintomicrooperations(ops)whichhavetobesimulated.Asurveyofcurrentlyavailablesimulators,asconductedinChapter 3 ,clearlyshowthattherearefewacademicsimulatorsthatsatisfyalltheserequirements.Toaddressthisissue,afull-systemsimulationframeworksupportingx86ISAandTLBmodelsisdeveloped,validatedandusedtoexperimentallyevaluatetheperformanceimplicationsoftheTLBinvirtualizedenvironments.Thisframeworkusestwoexistingsimulators(SimicsandFeS2)asitsfoundationandincorporatesaTLBtimingmodel.ThisistheonlyacademicsimulationframeworkthatprovidesadetailedtimingmodelfortheTLBandsimulatesthewalkingofpagetablesonaTLBmiss.Moreover,thisframeworkiscapableofsimulatingmultiprocessormulti-domainworkloads,whichmakesituniquely 16

PAGE 17

suitableforstudyingvirtualizedplatforms.Usingthisframework,theTLBbehaviorofI/O-intensiveandmemory-intensivevirtualizedworkloadsischaracterizedandcontrastedwiththeirnon-virtualizedequivalents.Itisshownthat,unlikenon-virtualizedsingle-O/Sscenarios,theadverseimpactoftheTLBontheworkloadperformanceissignicantonvirtualizedplatforms.Usingthedevelopedsimulationframework,itisshownthatthisperformancereductionforvirtualizedworkloadsisasmuchas35%duetotheTLBmisseswhicharecausedbytherepeatedushingoftheTLBandthesubsequentpagewalkstoservicethesemisses. 1.2.2TagManagerTableforProcess-SpecicTaggingoftheTLBToaddressthisissueofTLB-inducedperformancereduction,thisdissertationproposesanovelmicroarchitecturalapproachcalledtheTagManagerTable(TMT).TheTMTapproachinvolvestaggingtheTLBentrieswithtagsthatareprocess-specic,thusassociatingthemwiththeprocesswhichownsthem.BytaggingtheTLBentries,TLBushescanbeavoidedduringcontextswitches,aswellasduringswitchesbetweentheVMMandtheVM.ThisresultsinareductionintheTLBmissrate.TheTMTisasmall,fast,fullyassociativecachewhichisimplementedatthesamelevelastheTLB.EveryTLBhasanassociatedTMT.EachentryintheTMTcapturesthecontextofaprocessandstoresauniquetagassociatedwiththisprocesswhichisusedtotagtheTLBentriesofthisprocess.TheTMTisdesignedtogenerateandmanagethesetagsinasoftware-transparentfashionwhileensuringlowlatencyofTLBlookupsandimposingasmallareaoverhead.ThebenetofusingtheTMTandprocess-specictaggedTLBinvirtualplatformsisestimatedusingthedevelopedsimulationframework.Itisfoundthatusingprocess-specictagsreducestheTLBmissratebyabout65%to90%fortypicalserverworkloadscomparedtousingnotags.ThisreductioninmissrateeffectivelyreducestheTLB-induceddelaybyabout50%to80%which,dependingontheTLBmisspenalty,translatesintoa4.5%to25%improvementintheperformanceoftheworkloads.The 17

PAGE 18

effectivenessoftheTMTapproachdependsonmicroarchitecturalfactorsincludingthesizeoftheTLBandTMT,thepagewalklatencyandtheworkloadcharacteristics,includingthenumberofprocessesandtheworkingsetsizeoftheworkload.Ontheotherhand,theassociativityandreplacementpolicyoftheTLBplaylittleroleindecidingtheimpactoftheTMT.Thesevariousarchitecturalandworkload-relatedfactorsareprioritizedaccordingtotheirimpactonthebenetobtainedfromusingtheTMT.TheprimarymotivationfortheTagManagerTableisavoidingTLBushesbytaggingthecontentsoftheTLBwithprocess-specicidentiersandtherebyenablingmultipleprocessestoshareaTLB.Sincethetagsaregeneratedataprocess-levelgranularityandarenottiedtoanyvirtualization-specicaspect,theTMTmaybeusedtoavoidTLBushesinnon-virtualizedscenariosaswell.Inaddition,sharingacrossmultipleper-coreprivateTLBsusingahierarchicaldesignwithasharedLastLevelTLB(LLTLB)inordertoexploitinter-TLBsharing[ 19 ],ismadepossibleonplatformswithhardware-managedTLBsusingtheTagManagerTable.Thisdissertationalsoshowsthat,evenfortwounrelatedworkloadswithlittlescopeforinter-TLBsharing,sharedLLTLBsresultinreducingthemissratecomparedtoprivateLLTLBsoccupyingthesameon-chipareaby15%to28%duetoabetterusageoftheTLBspace.AnotherscenarioinwhichtheTMTmaybeusedisintaggingI/OTLBs,inscenarioswhereaddresstranslationservicesandTLBsintheI/OfabricallowI/Odevicestooperateinvirtualaddressspace,andsynchronizingtheI/OTLBusheswiththecoreTLBushes.Thesescenariosareinvestigatedinthisdissertation. 1.2.3MechanismsandPoliciesforTLBUsageControlOneoftheadvantagesofvirtualizationisthat,byconsolidatingapplicationswhichstressdifferentpartsofthesystem,theaverageutilizationoftheentiresystemcanbeincreased.However,evencompletelydisparateapplicationswillsharecoreplatformresourcesandinuencetheperformanceofoneanotherdependingontheconsumptionofthesecoreresources.SincetheTMTenablesthesharingoftheTLBamongmultiple 18

PAGE 19

workloads,itmakestheTLBonesuchsharedresourceandrenderstheperformanceofanapplicationinoneVMsusceptibletovariationsduetotheTLBusageofotherVMssharingtheTLB.ThisnecessitatesmechanismsandpoliciesforcontrollingtheuseoftheTLB.Thethirdpartofthisdissertationaddressesthisneed.First,theTLBspaceutilizationofconsolidatedworkloads,withmorethanoneVMrunningonthesamephysicalplatformischaracterizedinordertounderstandtheperformancenoiseduetosharedTLBsandtomotivatetheneedforexplicitlycontrollingtheusagebydifferentworkloadssharingtheTLB.Then,theCShareTLBarchitecture,consistingoftheTMTwithaTLBSharingTable(TST)tocontroltheusageofthesharedTLB,isproposed.ItisshownthattheTLBbehaviorofaworkloadrunninginaVMcanbeisolatedfromtheTLBusageofotherVMsrunningonthesameplatformbyassigningxedslicesofthesharedTLBspaceusingtheTSTtothevariousVMs.TheuseoftheTSTinimprovingtheoverallperformanceofconsolidatedworkloadsorinselectivelyimprovingtheperformanceofahighpriorityworkloadbyrestrictingtheTLBusageofotherlowpriorityworkloadsisexplored.ThisdissertationshowsthattheperformanceimprovementforthehighpriorityworkloadthatisachievedbyusingtheTMTwithoutusagecontrolcanbefurtherincreasedby1.4byrestrictingtheTLBusageoflowpriorityworkloadsusingtheTST.Thecostofsuchselectiveperformanceenhancementforvarioustypesofworkloadsandtheuseofdynamicusagecontrolpoliciesforminimizingthiscostarealsoinvestigatedinthisdissertation. 1.3OutlineoftheDissertationTheremainingpartofthisdissertationisorganizedasfollows.Relevantbackgroundinformationaboutvirtualmemory,TLBsandmemorymanagementinvirtualizedsystemsispresentedinChapter 2 .Thedesignandvalidationofthefull-systemsimulationframeworkwiththeTLBtimingmodelisdescribedinChapter 3 alongwithananalysisoftheTLB-inducedperformancedegradationinvirtualizedworkloads.The 19

PAGE 20

architectureandfunctionalityoftheTagManagerTableandtheperformancebenetofusingitispresentedinChapter 4 .TheuseoftheTMTinenablingsharedLLTLBsisalsodiscussedinthischapter.TheneedforusagemanagementpoliciesintheTLBismotivatedinChapter 5 andtheuseoftheCShareTLBforachievingusagecontrolwithstaticanddynamicpoliciesisdiscussedindepth.TheleveragingoftheTMTtotagI/OTLBsisproposed,simulatedandvalidatedinAppendix C .TheconclusionsfromthisdissertationaresummarizedinChapter 6 20

PAGE 21

CHAPTER2BACKGROUND:VIRTUALMEMORYANDPLATFORMVIRTUALIZATIONVirtualizationcanbeviewedasthesuccessortoemulation[ 20 ].Inthecaseofcomputersystems,emulationistheprocessofduplicatingthefunctionsofatargetsystemusingadifferentsourcesystem,sothatthesourcesystembehaveslikethetargetsystem.Thetargetsystemisusuallyemulatedatthefunctionallevel.Virtualizationtakesthisconcepttothenextlevelbyallowingahostsystemtobehavelikemultipledifferentguestsystems[ 20 ].Platformvirtualizationorfull-systemvirtualization,oneofthecommontypesofvirtualization,isdenedasthehidingofthephysicalcharacteristicsofacomputingplatformfromusersandshowinganabstractcomputingplatform.TheabstractionthusexposediscalledaVirtualMachine.Thevirtualmachinemonitor(VMM)orhypervisoractsasthecontrolandtranslationsystembetweentheVMsandthephysicalplatformhardware.AVMbehavesistheexactsamewayasaphysicalmachineand,exceptfortimingconsiderations,isindistinguishablefromaphysicalmachine.ThesoftwarestackrunninginsideaVMisunawarethatitisnotdirectlyrunningonaphysicalmachine.SincethelevelatwhichtheabstractionisprovidedtendstobetheInstructionSetArchitecture(ISA),suchvirtualizationisalsoknownasfull-systemvirtualizationorISAlevelvirtualization.InadditionserverconsolidationforharnessingthepowerofCMPs,asmentionedinChapter 1 ,virtualizationhasmanyadvantages: Inaserverenvironment,virtualizationreducesthecostofinfrastructurebymaximizingtheutilizationoftheresourcesandenhancingthemanagementcapabilities. DesktopVirtualization[ 21 ],theconceptofusingathinandinexpensiveclienttoaccessavirtualdesktoprunningonpowerfulbackendservers,enablessimplerandinexpensiveprovisioningofdesktopsandlowersthecostsformanagingsecurityanddeployingnewsoftwarebythesystemadministrator. 21

PAGE 22

Hostedvirtualmachines,whereintheVMrunsasanapplicationonthehostplatformalongwithseveralhost-levelnon-virtualizedapplications,canbeusedtoprovideaneffectiveisolatedsandboxforsoftwaretestinganddevelopment[ 22 ]. Virtualizationenablesutilitycomputingandcloudcomputing[ 23 ].UsingservicemodelssuchasInfrastructureasaService(IaaS)andApplicationsasaService(AaaS),virtualizationcanprovideeconomicalandsecureutilitycomputingwithguaranteesofprivacyandisolationofdataandperformance. Virtualizationenablescomputinggridsspanningwidelydistributedresources.Byprovidingdifferentuserswithvirtualmachineimages[ 24 25 ]whichcanscavengecomputingcyclesfromtheirresources,itbecomespossibletocreateapoolofcomputingpowerwhichcanbeusedforlargescalecomputing.Whilevirtualizationprovidesbetterresourceutilizationandnewparadigmsofcomputing,virtualizingacomputersystemischallenging.Specically,inthecaseofthememorysubsystem,itisimportanttorealizethatmemoryisalreadyvirtualizedevenonnon-virtualizedsingle-O/Ssystems.Platformvirtualizationaddsyetanotherlayerofabstractiontothisalready-virtualizedmemory.Creatingandmanagingtheselevelsofabstractionmakesmemoryvirtualizationchallenging.Sincetheworkinthisdissertationliesinthedomainofmemoryvirtualization,somerelevantbackgroundaboutmemoryvirtualizationinnon-virtualizedplatformsaswellasvirtualizedplatformsispresentedinthischapter. 2.1VirtualMemoryinNon-VirtualizedSystemsMemoryvirtualizationisaconceptwherebyanapplicationisprovidedwithanabstractionofanaddressspacethatisdifferentfromtheactualphysicalmemory.Thisabstractedaddressspaceistermedasvirtualmemory,virtualaddressspaceorlinearaddressspace.Byvirtualizingmemoryandprovidingprocesseswithuniquevirtualaddressspaces,multipleprocessescansharethephysicalmemory[ 20 ].Usingthisabstraction,applicationscanbewrittenassumingacontiguousaddressspacewithouttheprogrammerhavingtoconsiderissuessuchasthesizeofthephysicalmemoryandtherangeofaddressablelocations.Usingvirtualmemory,aprogramcanuseabsoluteaddressingmodesandcanbeeasilyportedfromonemachinetoanother 22

PAGE 23

withoutneedinganychange.Memoryvirtualizationmayalsobeusedforprovidingtheapplicationwithamemoryspacethatmaybeinexcessoftheactualphysicalmemoryavailable.Moreover,virtualmemorymaybeusedtoenforcememoryisolationamongstmultipleprocessesandrestrictthetypeofaccessesallowedondifferentmemorylocationsbasedonthesemanticsofthedatastoredatthoselocations. 2.1.1ImplementingVirtualMemoryUsingPagingMemoryistypicallyvirtualizedbypaging.Here,theavailablephysicalmemoryispartitionedasmultipleregular-sizedblockscalledpageframes.Thevirtualmemoryiscomposedofblockstermedaspages,whosesizeisthesameastheframes.Wheneveracertainvirtualaddressneedstobeaccessed,thepagecontainingthataddressistontoapageframeinphysicalmemorybymappingthevirtualpagetothephysicalframeaddress.Thepagetablestoresthedetailsofthevirtualtophysicalmapping.Theprocessofconvertingavirtualaddresstoaphysicaladdress,inordertoaccessmemory,isknownasaddresstranslation.Sinceaddresstranslationisahighfrequencyoperationinthecriticalpathofallmemoryaccesses,itisusuallyimplementedintheMemoryManagementUnit(MMU)hardware.Addresstranslation Addresstranslationconsistsoflookingupthevirtualtophysicaladdressmappingfromthepagetablesandthisprocessistermedasthepagewalk.Sincethepagetablealsocontainsinformationsuchasthetypesofoperationspermittedonthepage,addresstranslationalsoprovidessomemeasureofisolationandprotection.Ifthepageisnotcurrentlymappedinmemory,apagefaultisraisedandhandledbythesystemsoftwarebymappingthevirtualaddresspageontoafreephysicalmemoryframeorevictinganexistingpagefromitsframeandreusingtheframeforthenewpage.Thepagetableforthenewpageaswellastheevictedvictimpageareupdated.Thecontentsofthepagewhichhasbeenevictedfromphysicalmemoryismaintainedinthevirtualmemorydiskcache. 23

PAGE 24

Aatpagetable,whichstoresallthepagemappinginformationinasingle-leveltable,isconceptuallysimple.But,thephysicalmemoryrequirementsforsuchaattablemakesitprohibitivelyexpensive.Hencemulti-levelpagetablesareused.Here,thestartingaddressoftherstlevelpagetablesisusuallystoredinaregistercalledthePageTableBaseRegister(PTBR).InconjunctionwiththePTBR,apartofthevirtualaddressisusedtoindextherstlevelofpagetables.Thecontentsoftheindexedlocationintherstlevelpagetablepointstothestartofthesecondlevelofpagetables.Alongwiththenextpartofthevirtualaddress,thisisusedtoindexthesecondlevelpagetable.Thisprocessiscontinuedtillthelastlevelpagetableisindexedandthephysicaladdresscorrespondingtothevirtualaddressisobtained.Thesetofhierarchicalpagetablesmayalsobepaged,i.e.,partsofthehierarchicalpagetablesmayresideindiskandcanbebroughtintophysicalmemorywhenneeded.Insuchcasestheupperlevelsofthehierarchicalpagetablearealwaysmaintainedinmemorytoavoiddeadlocks.Itshouldbenotedthatmostsystemsallowstheexistenceofmorethanonepagesize.Byusinglargepages,wherealargerblockofcontiguousphysicalmemoryismappedtoasinglepage,thesizeofthepagetablescanbereduced.Suchlargepagesarealsotermedassuperpagesorbigpages. 2.1.2AddressTranslationinx86withPageAddressExtensionEnabledSincex86isthemostpopularvirtualizedarchitecture,thedetailsoftheaddresstranslationprocessonx86arewarrantedacloseexamination.Specically,sincethesystemsimulatedinthisworkusesPAEaddressingmodeandmostvirtualizationsolutionson32-bitx86usePAEaddressingmode,theaddresstranslationinPAEmodeisdescribedindetailinthissection.32-bitx86hasseveraldifferentmodesofpaging,oneofwhichisPhysicalAddressExtension(PAE)virtualaddressingmode.Witha32bitphysicaladdress,themaximumaddressablephysicalspaceis4GB(232bytes).PageAddressExtensionisfeatureofthex86architecturethatallowsaccesstomorethan4GBofRAM,iftheoperating 24

PAGE 25

systemsupportsit.InthePAEmode,avirtualaddressbelongingtoa4KBsmallpageistranslatedinafourstepprocessasshowninFigure 2-1 .TheCR3registeristhePTBRforthex86architectureandpointstoPageDirectoryPointerTable.Thetwomostsignicantbits(MSBs)ofthevirtualaddress(VA)areusedasanoffsetfromthestartingaddressofthePageDirectoryPointerTableandthePageDirectoryPointerTableEntry(PDPTE)isobtained,asshowninstep . 1 ofFigure 2-1 .ThePDPTEpointstothebaseofthePageDirectoryTable,whichisthenextlevelinthemulti-levelpagetable.The9LeastSignicantBits(LSBs)ofthePDPTEcontainattributesofallthepagesbelongingtothatPageDirectoryTablesuchastheRead/WriteattributesandtheCPUprivilegerequirementforaccessingthesepages.ThesePDPTEattributebitsaremaskedandreplacedbyanoffsetcomposedofbits29to21fromthevirtualaddress.TheresultingaddressisusedtoreadthePageDirectoryTableentry(PDE)forthisvirtualaddress,asshowninstep . 2 .SimilartothePDPTE,the9LSBsofthePDEarealsoattributebits,whicharemaskedandreplacedbythenext9signicantbitsofthevirtualaddress.ThisresultingaddressisusedandthePageTableEntry(PTE)isread,asinstep . 3 .ThePTEpointstothestartinglocationofthephysicalmemorypageframewherethepagecontainingthevirtualaddressist.HencethePTEissometimesreferredtoasthePhysicalFrameNumber(PFN)orPhysicalPageNumber(PPN).Thenalstep,step . 4 ,consistsofaccessingthepagethatispointedtobythePTEandaddingthe12LSBstogetthephysicaladdress(PA)correspondingtothevirtualaddress.Sincethese12bitsindicateabytewithinapage,theyaretermedasPageOffsetandtheremaining20MSBsasVirtualPageNumber(VPN).ItshouldbenotedthattheattributesofapageisdeterminedasthelogicalANDoftheattributesfromthePDPTE,thePDEandthePTE.InthePAEmode,largepagesofsize2MBareidentiedbybit7ofthePDEbeingset.TillthePDEisdetermined,thepagewalkforbothlargeandsmallpagesareidentical.ButoncethePDEisreadandthepageisfoundtobealargepage,base 25

PAGE 26

Figure2-1. Pagewalkfora4KBpagewithPAEenabled addressofthelargepage,andnotthePTE,isdeterminedbyusingPDE.Then,theremaining21bitsofthevirtualaddressareusedasanindextothelargepagetoaccessthephysicaladdresscorrespondingtothevirtualaddress. 2.2TranslationLookasideBufferTospeedupthepagewalkprocess,asmallassociativecachecalledtheTranslationLookasideBuffer(TLB)isusedforcachingthetranslationsfortherecentlyaccessedpages.ThestructureofatypicalTLBisshowninFigure 2-2 .EveryentryintheTLBcontainsthreeelds: TheVirtualPageNumber(VPN). ThePhysicalPageNumber(PPN)correspondingtotheVPN. Theattributesofthepageindicatingthewritepermissionsforthepage(R/W),theCPUmoderequiredtoaccessthepage(S/U),thecacheabilityofthepageandthetypeofphysicalmemory(MTRR,PAT)aswellastheaccessedanddirtystateforthepagetableentriescorrespondingtothistranslation. 26

PAGE 27

Wheneveranaddresstranslationisrequired,theTLBisrstlookeduptocheckifthetranslationiscached,asshowninFigure 2-2 .IfthelookuphitsintheTLB,thepageoffsetfromthevirtualaddressisusedalongwiththePPNfromtheTLBentrytogetthephysicaladdresswithouthavingtogothroughtheentirepagewalk.OnaTLBmiss,however,thepagetablesarewalkedandtheaddresstranslationisobtained.Dependingonthereplacementpolicy,avictimisevictedfromtheTLBandthatslotispopulatedwiththeVPN,PPNandattributesobtainedfromthepagewalk. Figure2-2. TranslationLookasideBufferforcachingtherecentlyusedvirtualtophysicaladdresstranslations. TLBscanbebroadlyclassiedintoSoftware-ManagedTLBsorArchitectedTLBs,suchasinSPARCandALPHA[ 26 27 ]andHardware-ManagedTLBs,suchasinx86[ 28 ],dependingonthebehavioronaTLBmiss.InsoftwaremanagedTLBs,theTLBraisesafaultonaTLBmisswhichishandledinafashionsimilartoanygeneralinterrupt.Thepipelinegetsushed[ 29 ]andthepagewalkisperformedbytheO/S.Oncethepagewalkiscompleted,theTLBispopulatedandthenthepipelineisrestarted.TheadvantageofthesoftwaremanagedTLBisthattheO/SmayuseintelligentschemestopopulatetheTLBandredenetheorganizationofthepagetabletosuitthenewschemes.However,thetimetakenforthepagewalkissignicantlyhigherthaninhardware-managedTLBsandthepagewalkprocessmaypollutetheinstructioncache. 27

PAGE 28

Inhardware-managedTLBs,thestructureofthepagetableandtheformatofthepagetableentriesaredenedbytheISAandarexed.WhenaTLBmissoccurs,ahardwarestatemachinewalksthepagetables,determinesthetranslationandpopulatestheTLB.ThismechanismismuchfasterthanasoftwaremanagedTLB[ 30 ],sincethepagewalkhappensentirelyinhardware.Moreover,itdoesnotstallthepipelineandinstructionswhicharenotdependentonthisparticulartranslationcanbeexecutedoutoforder[ 31 ].Thedisadvantageofhardware-managedTLBsisduringacontextswitch.Whenthereisacontextswitchfromoneprocesstoanother,thehardware-managedTLBgetsushedtoavoidusingtheTLBentriesoftherstprocessforthesecondprocess.InsoftwaremanagedTLBs,however,mostoperatingsystemstagthecontentsoftheTLBwithsomeIDwhichrelatestheentriestotheprocesstowhichtheybelongandtherebyavoidushingtheTLBoncontextswitches.Thus,withhardware-managedTLBs,everyprocesswhichisswitchedintocontextexperiencesalargenumberofTLBmissesuntiltherequiredentriesarebroughtbackintotheTLB. 2.3VirtualMemoryinVirtualizedSystemsAsseenintheprevioussection,anon-virtualizedsystemhastwolevelsofmemory:thephysicalmemoryandthevirtualmemorywhichisanabstractionofthephysicalmemoryandwhichgetsexposedasauniqueaddressspacetoeveryprocess.Withplatformvirtualization,thevirtualmemoryisabstractedbytheVMMandispresentedasphysicalmemorytotheVM.ThismemoryisfurthervirtualizedbytheguestO/SrunningontheVM.Toavoidambiguity,thislevelofmemoryisreferredtoasrealmemory.ThethreedifferentlevelsofmemoriesinavirtualizedplatformareclearlyindicatedinFigure 2-3 .Inthethree-levelmemoryarchitectureofavirtualizedplatform,thepagetablesmaintainedbytheguestO/Scontaintranslationsbetweenvirtualmemoryandrealmemory.Similarly,thepagetablesmaintainedbytheVMMcontainthemappingbetweenrealmemoryandphysicalmemory.Itisthisabstractionofthephysical 28

PAGE 29

Figure2-3. Memoryvirtualizationinavirtualizedplatform memoryintorealmemorythatachievesthegoalofvirtualizingmemoryattheVM-VMMinterface.Becauseofthisthree-levelmemoryabstraction,thevirtualaddressseenbyanapplicationinsideaVMhastobetranslatedtotherealmemorydomainusingthepagetablesoftheVM.Then,thisrealaddresshastobetranslatedbytheVMMtophysicalmemoryandtherequireddatashouldbeaccessed.However,whilemaintainingtwosetsofpagetablesisconceptuallysimple,itisrarelyusedduetothecostinvolvedinmaintainingtwosetsofpagetables.Rather,thisishandledinoneofthefollowingthreeways. 2.3.1Full-SystemVirtualizationandShadowPageTablesFull-systemvirtualizationsolutionssuchasVMwareusestheconceptofshadowpagetables[ 32 ].TheVMMmaintainsasetofshadowpagetables(SPTs),oneforeveryprocessineveryguestVM.TheseSPTsareinvisibletotheguestO/Sandmapthevirtualmemorypagesdirectlytophysicalmemory.ByusingtheSPTs,onesetofpagewalkscanbeeliminated,therebymakingtheaddresstranslationprocessfaster. 29

PAGE 30

Toachievethis,thePageTableBaseRegister(PTBR)isvirtualized.Whenstartingaguest,theVMMpopulatesthephysicalPTBRwiththelocationoftheshadowpagetablesandthevirtualPTBRwiththerealmemorylocationoftheguestO/S'spagetables.WhenevertheguestattemptstoreadorwritethePTBR,theinstructiontrapstotheVMM.Ifthisawriteattempt,whichmaybecausedbyacontextswitchinsidetheguest,thevirtualPTBRisupdatedwiththerealmemoryaddresspointingthepagetablesofthenewprocess.ThephysicalPTBRisthenupdatedbytheVMMtopointtothephysicalmemorylocationwhichcontainstheshadowpagetableofthenewprocessoftheguestVM.Iftheattemptisareadattempt,theVMMreturnsthevirtualPTBRvaluetotheguestO/S.WhiletheSPTeffectivelyeliminatesonelevelofmemoryindirection,itintroducestheneedtomaintainconsistencybetweenSPTsandguestpagetables.Forinstance,ifacertainvirtualpageisnotmappedtotherealmemoryaccordingtotheguestpagetables,thentheshadowpagetablesforthatprocessshouldnotcontainamapping.Thisisneededinordertoensurethattheoccurrenceofpagefaultsisconsistent,irrespectiveofwhethertheapplicationisrunninginaguestO/Soronanon-virtualizedplatform.Thus,pagetablemanagementbecomesasourceofvirtualizationoverhead. 2.3.2ParavirtualizationandPageTablesInatraditionalVMM,thevirtualizedabstractionthatisexposedasVMisidenticaltotheunderlyingphysicalmachine[ 33 34 ].Hence,operatingsystemsneednotbemodiedtoruninaguestVM.However,thecostofmaintainingthisabstractionofidenticalhardwareishigh.Xen[ 35 ]takestheapproachofpresentingtheguestwithasimilarbutnonidenticalabstractionoftherealhardwareusingatechniquecalledparavirtualization.Duetothedifferencesbetweenrealandvirtualhardware,theO/ShastobepatchedtorunintheparavirtualizedVM(whicharereferredtoasdomainordominXenterminology). 30

PAGE 31

However,onlytheO/SrequirespatchingandunmodiedbinariescanstillberunonthispatchedO/Sinsidethedoms.Xenhandlesmemoryvirtualizationbyallowinggueststodirectlyviewthephysicalmemoryandtherebyeliminatingtheintermediaterealmemory[ 35 ].Thecongurationleforauserdomain(domU)includesarequestforacertainamountofmemory.Ifsufcientphysicalmemoryisavailable,XenallocatestherequestedamountofphysicalmemoryandreservesitfordomU.Suchareservationallowsthegueststodirectlyviewtheirallocatedphysicalmemoryandimposesstrongisolationfromotherdomains.WheneveramodiedguestO/Sneedsmemory,itallocatesapagefromitsreservedpoolofphysicalmemoryandregistersthisallocationwiththeXenhypervisor.Thepagetablesfortheprocesses,whicharemaintainedbytheguest,aremadeunwritablebytheguest.WhenevertheguestO/Sdesirestoupdatethepagetable,itdoessobyissuingahypercall.XenveriesthatthewriterequestfromtheguestO/Sisvalidandmakestherequestedchangesinthepagetables.Toimproveperformance,multiplesuchhypercallsmaybebatchedandissuedbytheguestO/StoavoidfrequentswitchingbetweentheVMandthehypervisor.Eliminatingtherealmemoryremovestheneedtomaintainshadowpagetables.However,thisposesaconictwiththecontiguousphysicaladdressspacemodelthatisassumedbymostguestO/S.Xenhandlesthisbyprovidesapseudo-physicalmemory,whichmaybethoughtofasananalogtorealmemory,andbyrewritingthepartsoftheguestO/Swhichdependonphysicalmemorycontiguitytousethispseudo-physicalmemory. 2.3.3HardwareVirtualizationandTwo-LevelPageTablesWhileXenavoidstheoverheadofshadowpagetablemanagement,whichmaybeashighas75%ofthetotalexecutiontimeofanapplication[ 36 ],itstilldoesnotcompletelyeliminatethememoryvirtualizationoverheads.Theneedforthehypervisorduringpagetableupdatesandforprovidingpseudo-physicalmemoryaretwoinstances 31

PAGE 32

ofvirtualizationoverheadinXen.Toavoidtheseoverheadsassociatedwithsoftwaremethodsofvirtualizingthememory,bothIntel[ 37 ]andAMD[ 36 ]havedevelopedhardwaresolutionsbyextendingtheMMUofthex86-64andamd-64architecturesrespectively.Thesesolutions,involvingtwolevelsofpagetables,areknownasNestedPageTables(NPT)andExtendedPageTables(EPT)byAMDandIntelrespectively.NPTsandEPTsprovidetwolevelsofpagetables.Therstlevelofpagetables,calledguestpagetables(GPTs)aresimilartoregularpagetablesandareusedtomapvirtualaddressestorealaddresses.Thesecondlevelofpagetables,calledHostpagetables,aremaintainedbythehypervisorandcontainthemappingsbetweenrealandphysicaladdressspacesandaremanagedbytheVMM.BoththeguestandtheVMMhavetheirowncopiesofthePTBR(CR3).TheguestCR3pointstothestartoftheguestpagetablesandthehostCR3pointstothebaseoftheEPT/NPT.Whenavirtualaddresshastobetranslatedtophysicaladdress,atwo-dimensionalpagewalktakesplace.TheguestCR3,alongwiththeMSBsofthevirtualaddress,indicatetheaddressoftherst-levelpagetableentryinrealmemory.ThisaddressistranslatedtothephysicalmemorydomainbywalkingthehostpagetablesusingthehostCR3.Thetranslatedphysicaladdressisusedtoreadtherst-levelpagetableentryoftheguestpagetables,whichisthentranslatedfromrealtophysicalmemory.Byrepeatingthisprocess,thephysicaladdresscorrespondingtothelinearaddressisobtained.Byallowingthegueststomanagetheirpagetables,theneedfortrappingMMUrelatedinstructionsisavoided.Thisreducestheoverheadofmemoryvirtualization.Itshouldbenotedthat,evenwithnestedpagetables,theTLBstillcachesvirtualtophysicaladdresstranslationsratherthanvirtualtorealaddresstranslations.Moreover,thecostofaTLBmissincreasessignicantlycomparedtonon-nestedpagetableswhenNPTs/EPTsareused,furtherincreasingtheneedtoreducetheTLBmisses. 32

PAGE 33

2.4SummaryThebackgroundinformationaboutmemoryvirtualizationinnon-virtualizedandvirtualizedsystemspresentedinthischapterclearlydemonstratethecomplexitiesofvirtualizingmemory.Inadditiontothiscomplexity,manyofthestrategiesthathavebeenusedtoreducethelatencyofpagetablemanagement,suchasusingEPT/NPT,aswellastheswitchesbetweentheVMandtheVMMnecessitatedbypagetablemanagementoperations,haveimplicationsonthebehavioroftheTranslationLookasideBuffer.Theseimplications,theperformancedelaycausedbytheTLBandavoidingthisperformancedelayformsthefocusoftheremainderofthisdissertation. 33

PAGE 34

CHAPTER3ASIMULATIONFRAMEWORKFORTHEANALYSISOFTLBPERFORMANCEThegrowinguseofvirtualizationforserverconsolidationonCMPplatforms[ 5 24 38 ]hasemergedasanewparadigminthehigh-endservercomputingindustry.However,oneissuewithsuchvirtualization-basedresourceconsolidationistheperformancedegradationofvirtualizedworkloads.Infact,improvingtheperformanceofvirtualizedworkloadstonear-nativelevelshasbeenthefocusofmuchresearch[ 6 39 45 ].Thex86architecture,whichisoneofthemostpopularvirtualizedplatforms[ 15 ],hasalsobeenmodiedwithhardwarevirtualizationextensionstoimprovetheperformanceofvirtualmachines.StartingwiththeVTextensions[ 46 ],therehavebeenmanychangesinthisdirectionincludingIntelVTforConnectivityandIntelVTDirectedI/O[ 47 ].SimilardevelopmentsfromAMDincludetheAMD-Vvirtualizationtechnology[ 36 ]andtheDirectConnectArchitecture[ 48 ].AsmentionedinChapter 1 ,theTLBiscriticalindeterminingtheperformanceofvirtualizedworkloads[ 14 ].Hence,itisnosurprisethatthemostrecentvirtualizationextensionstothex86architecturehavefocussedontheTLB.Specically,theTLBarchitecturehasbeenmodiedbytheadditionoftagsasapartoftheTLBentryandbyprovidinghardwareprimitivesforrapidtagcomparison[ 36 37 48 ].DuetothesechangesintheTLBarchitecture,thereisaneedforreexaminingandunderstandingtheTLBbehaviorofworkloadsinvirtualizedsettingsinordertosolveissuesinvolvingtaggenerationandmanagement.Furthermore,theoptimumtaggedTLBarchitecture,intermsofsizeandassociativity,shouldbeexplored.Onewayofobtainingthisunderstandingisbyconductingasimulation-basedstudywhereintheeffectofvariousarchitecturalandworkloadrelatedparametersontheTLBperformancecanbeexplored.Moreover,usingsuchasimulation-basedapproachwillfacilitateunderstandingtheimpactoftheTLBontheperformanceofvirtualized 34

PAGE 35

workloadsandwillallowthecomparisonofvariousTLB-relatedperformance-enhancingideas. 3.1SurveyofSimulationFrameworksUsedinTLB-RelatedResearchTheTranslationLookasideBufferhasbeenthetargetofmanyresearchworks.TLBprefetching[ 49 51 ]hasbeenexploredtoincreasetheTLBhitratio.Chadhaetal.[ 52 53 ]haveusedfunctionalmodelswithSoftSDV[ 54 ]simulatortostudytheTLBbehaviorofI/O-intensivevirtualizedworkloads.Tickooetal.[ 18 ]haveexploredTLBtaggingintheirqTLBapproach.Ekmanetal.[ 55 ]estimatetheTLBtoberesponsibleforupto40%ofthepowerconsumptionincaches.Variouscircuit-levelandarchitecturaltechniques[ 16 56 59 ]aswellascompiler-levelcodetransformation[ 60 ]havebeenexploredtoreducetheTLBpowerconsumption.However,thesepreviousstudiesinvolvingthehardware-managedTLB(suchasthex86TLB)haveusedSimpleScalar[ 61 ]orcustom-builttrace-drivensimulators[ 62 ]andnotTLBtimingmodelsinafull-systemenvironment,therebyignoringtheinteractionoftheworkloadwiththeO/S/VMM.Evenincaseswherefull-systemsimulationhasbeenused,theTLBtiminghasnotbeenmodeled[ 52 53 ]orthex86architecturehasnotbeensimulated[ 50 51 ].Apossiblereasonwhythestudiesinvolvinghardware-managedTLBsonx86havenotusedtiming-basedmetrics,orusesimpliedsimulatorswhicharenotfull-systemsimulatorsandtendtoignorehypervisoreffects,maybethelackofsimulatorsupport.Commonlyusedx86simulatorsareeithernotfull-systemsimulatorsordonotmodelthetimingbehavioroftheTLB.Zesto[ 63 ],whichsupportscycle-accuratesimulationforx86andmodelstheTLBcannotbootanO/Sanddoesnotsupportfull-systemsimulation.PTLSim/X[ 64 ]isafull-systemsimulatorforx86thatcansimulateanentireO/Sandthebinariesrunninginsideit,byrunningtheO/SasaguestontopofamodiedversionofXen.However,itisnotcapableofsimulatingthehypervisoritself,whichmakesitunsuitableforfull-systemstudiesonvirtualizedplatforms.SimOS[ 65 ]supportsthex86architecture,butitdoesnotsupportrunningavirtualmachinemonitor.M5[ 66 ],while 35

PAGE 36

providingfull-systemsupportandtimingmodels,doesnotsupportx86architecture.Simics[ 67 ]isafull-systemsimulatorthatiscapableofbootingandrunningXenandmultipleguestO/S,butrequiresextensionstosupporttimingstudies.GEMS[ 68 ]providesonesuchtimingframework,howeveritdoesnotsupportthesimulationofthex86ISA.FeS2[ 69 ]isanaccurateexecution-driventimingmodelthatincludesacachehierarchy,branchpredictorsandasuperscalarout-of-ordercore.Itsupportsx86andcanbepluggedintoSimics.COTSon[ 70 ]isasimilartimingsimulatorthatcanbepluggedintoAMDSimNow[ 71 ].ButneitherFeS2norCOTSonprovidetimingmodelsfortheTLB.Thus,thereisaclearneedforasimulationframeworkforsimulatingthebehaviorofhardware-managedTLBsonvirtualizedplatformsthatmeetsthefollowingrequirements: TheframeworkshouldsupportcongurableTLBfunctionalandtimingmodels.Sincerecenthardware-managedTLBsincorporatetagsasapartoftheTLBentry,thefunctionalTLBmodelshouldsupportthesimulationoftaggedTLBfunctionalityaswell. Asx86isthemostcommonvirtualizedplatform,thesimulatorshouldsupportthesimulationofx86ISA.Itisalsodesirablethattheframeworksimulatesthex86ISAatthemicro-operations(ops)granularity. Tocapturetheinteractionbetweenthehardware,theVMM,theVMandtheapplication,itisimperativethatthesimulatorbeafull-systemexecution-drivenframework.Developingsuchasimulationframeworkformsthefocusofthischapter. 3.2DevelopingtheSimulationFrameworkThefull-systemsimulationframeworkdevelopedforanalyzingtheTLBbehavioronvirtualizedplatformsusesSimics[ 67 ]andFeS2[ 72 ]asfoundations.ThebasicfunctionalTLBmodelinSimicsisreplacedwithagenerictaggedTLBmodel.TLBtimingmodelsarealsodevelopedandincorporatedintothetimingowofFeS2.Thesecomponentsofthesimulationframeworkaredescribedinthissection. 36

PAGE 37

3.2.1UsingSimicsandFeS2asFoundationThesimulationframework,showninFigure 3-1 ,consistsofVirtutechSimics[ 67 ](version3.0.1),afull-systemsimulationplatformcapableofsimulatinghigh-endtargetsystemswithsufcientdelityandspeedtobootandrunoperatingsystemsandworkloads.SimicsusesafunctionalCPUmodelwithatomicandsequentialexecutionofinstructions,whereintheexecutionofeveryinstructiontakesexactlyonecycle.Theprocessormodelisnon-pipelinedandonlyx86CPUswithouthardwarevirtualizationsupportaremodeled.SimicsalsoprovidesarichsetofmicroarchitecturalcomponentsincludingthecacheandTLBwhichcanbeincorporatedwiththeCPU.Insuchsimulations,theexecutiontimeforaninstructionisincreasedbyanystallsthatmaybecausedbythememorysubsystemforthatinstruction,buttheexecutionmodelisstillsequential.Moreover,onlythecachesandthememorycanstallaninstructionandthehitandmisslatenciesassociatedwiththeTLBareignored.SimicsalsoprovidesthecapabilitytoinstallcallbackfunctionsandassociatethesewiththeoccurrenceofspeciceventssuchasTLBmissesandcontextswitches.WhileSimicsprovidesamicroarchitecturalinterface(MAI)timingmodel,whichemulatesapipelineandoutoforderexecution,itdoesnotsimulateatthegranularityofx86microoperations(ops).Tosupporttiming-basedanalysis,atimingmodelbasedontheFeS2[ 69 ]simulatorisused.FeS2worksonatiming-rstmethodology,wherethefunctionalcorrectnessisprovidedbySimicsandthetiminginformationbyFeS2.Anx86instructionisfetched,decodedinops,usingthedecoderfromPTLSim[ 64 ],whicharethenexecutedandretired.Duringtheretirementphase,thecorrespondingx86instructionisallowedtoexecuteinSimics.Then,thestateofthesystemmaintainedbyFeS2iscomparedtothefunctionally-correctstatemaintainedbySimics.Incaseofthesestatesnotmatchingup,theFeS2pipelineisushedandrestartedatthenextinstruction.FeS2reliesonSimicstosupplythefunctionaldatasuchasthecontentsatagivenmemorylocationandthe 37

PAGE 38

translationforagivenvirtualaddress.ThusFeS2providesaneffectivetimingplugintotheSimicssimulator.CouplingFeS2withSimicscreatesaframeworkwhichsatisesalltherequirementsforsimulationstudiesinvolvingvirtualizedworkloads,exceptforthelackofadvancedTLBfunctionality(liketaggedTLB)andatimingmodelsfortheTLB. Figure3-1. SimulationframeworkforanalyzingTLBperformance.TheframeworkisbuiltusingSimicsandFeS2asfoundations.AgenerictaggedTLBfunctionalmodelaswellTLBtimingmodelisincorporated 3.2.2TLBFunctionalModelThex86processormodelinSimics[ 73 ]hasafunctionalTLBmodelconsistingoffour64-entry4-wayassociativeTLBs.TheseTLBsareorganizedastwoDTLBsandtwoITLBs,i.e.,forthe4KBsmallpagesandlargepages,each.FirstInFirstOut(FIFO)replacementpolicyisusedintheseTLBs.AsthisTLBfunctionalmodeldoesnotsupportstoringtagsasapartoftheTLBentryorincorporatetagcheckingasapartoftheTLBlookup,agenerictaggedTLBfunctionalmodeliscreated.ThetaggedTLBmodelconsistsoffourcomponentsasshowninFigure 3-1 : 1.theGenerationandManagementofTags(GMT)module, 2.theextendedTLBwhichstores 38

PAGE 39

atagasapartofeveryentry, 3.theTagCachewhichstoresthecurrenttagand 4.atagcomparatorforcomparingthetagsduringTLBlookup.DependingondetailsofthespecictaggedTLBsolutionbeingmodeledoneormoreofthesecomponentsmaynotbeneeded.Forinstance,whenmodelingataggedTLBsolutionwheretheassignmentoftagsisdonebythesystemsoftware,theGMTneednotbesimulated.However,creatingmodelsforallthesecomponentsmakesthistaggedTLBmodelexibleenoughtosimulateanytaggingsolution.Toaddthetaggingfunctionality,theGMT,TagCacheandcomparatorareaddedasmodelextensionstoSimics,similartotheAntFarmextensionbyJones[ 74 ].TheGMTisimplementedinsuchamannerthatitiscapableofexaminingthestateoftheCPUofwhichitisapart.TheSimicsTLBmodelisextendedbyaddingtagsasapartofthedatastructureforeveryentry.InadditiontotheFIFOreplacementpolicy,anLRUreplacementpolicywiththetimestampsbasedontheSimicsclockisadded.TheTagCacheismodeledasaregisterwhichiswideenoughtocacheoneentryoftheGMT.ThecomparatorfunctionalityisimplementedbylookingupthecurrenttagfromtheTagCacheandusingthisapartoftheTLBlookuplogic.APIstofacilitatecommunicationbetweentheGMTandtheTLBarealsoimplemented.EverytimeaTLBushistriggeredbywritinganewvaluetotheCR3register,theextendedTLBmodulecommunicatesthisnewvaluetotheGMTmoduleusingtheseAPIs.TheGMTmakestheappropriatechangesandupdatestheTagCache.TheGMTthen,dependingonthefunctionalitybeingsimulated,indicatesiftheTLBushcanbeavoidedornot.IftheTLBushcannotbeavoided,theextendedTLB'scontentsareushed. 3.2.3ValidationoftheTLBFunctionalModelThevalidationoftheTLBfunctionalmodelconsistsofverifyingthattheTLBisfunctionallycorrectwhenthetagsareusedtoavoidTLBushes.Anyerrorinthefunctionalitywillresultinretainingstaleentrieswhichareinconsistentwiththepagetables.Hence,verifyingtheconsistencyoftheTLBentriesservestovalidatethe 39

PAGE 40

taggedTLBimplementation.Forthis,aFunctionalCheckmodeisimplemented.Inthismode,wheneverthereisahitinthetaggedTLB,apagewalkisperformedtogetthetranslationTransPWconsistingofthephysicaladdresscorrespondingtothelinearaddressandallthepageattributessuchastheread/writebit,theglobalbit,thepagemodebit,thePATandtheMTRRbits.ThistranslationisthencomparedtothetranslationTransTLBpresentinthetaggedTLB.Ifthesetranslationsdonotmatch,aninconsistencyisdeclared.ItshouldbenotedthatFunctionalCheckmodeseverelyslowsdownthespeedofsimulationandisusedonlyforvalidationoftheTLBfunctionalmodel. 3.2.4TLBTimingModel Figure3-2. Timingowinthesimulationframework.FeS2pluggingintoSimicsandtheTLBtimingmodelspluggingintoFeS2areshown.TheowoftimingduringaTLBlookupisillustrated. FeS2doesnotimplementeithertheinstructionorthedataTLB.Wheneveranaddresstranslationisneeded,FeS2queriesSimicsusingaSimics-providedAPI.ThisAPIreturnsthetranslationirrespectiveofwhetheritispresentintheSimicsfunctionalTLBornot.IfthefunctionalTLBdoesnotcontaintheneededtranslation,Simicswalks 40

PAGE 41

thepagetable,computesthetranslation,populatestheTLBandreturnsthetranslation,completelytransparenttoFeS2.Moreover,thedetailsofanycachemissescausedbythepagewalkarealsonotcommunicatedtoFeS2bythisAPI.Thus,FeS2isunabletoaccountfordifferentexecutiontimesforaopdependingonwhetherthelookupittriggeredhitormissedintheTLBand,incaseofmiss,whethertherewereanycachemisses.ThisbehaviorofFeS2ismodiedbyimplementingtimingmodelsforITLBandDTLBandintegratingthemintoFeS2asshowninFigure 3-2 .Aftertheadditionofthesemodels,thefetch-and-decodestagequeriesthetimingmodel,insteadofusingSimicsAPI,wheneveranaddresstranslationisneeded.Thispathisshownbythearrowlabeled . 1 intheFigure 3-2 .ThetimingmodelqueriesthefunctionalTLBmodelasshownbyarrow . A .IfthetranslationisnotpresentinthefunctionalTLB,thetimingmodelreadstheCR3valueandcalculatestherstaddresstobelookedupinthepagewalkprocess.IttheninsertsalookupforthisaddressinthecachehierarchymaintainedbyFeS2.Oncethislookupreturns,theactualvaluestoredatthisaddressisobtainedfromtheSimicsfunctionalmemory,asshownbyarrow . B ,andusedtocalculatethenextaddressinthepagewalkprocess.Thisprocessisrepeateduntiltheentiretranslationiscomputed.Oncecomputed,thefunctionalTLBispopulatedusingthistranslationasshowninFigure 3-2 byarrow . A .IfthefunctionalmodelissimulatingataggedTLB,thepopulatedentryistaggedwiththecorrespondingtagandtimestamped.Then,theinstructionwhichhasbeenstalledduringthisprocessisreleasedasshownbyarrow . 2 .SimilarlytheDTLBtimingmodelisqueriedifanaddresstranslationisneededforaLoadoraStoreinstructionintheexecutestageandreturnsafteracertainlatencyshownbyarrows . 3 and . 4 respectively.TheowofthefunctionaldatabetweentheDTLBtimingmodelandthefunctionalTLBisshownbyarrow . C .IncaseofthelookupmissingintheDTLBandtriggeringapagewalk,thedataowbetweentheDTLBtiming 41

PAGE 42

modelandthememoryisshownbyarrows . D inFigure 3-2 .Afterthislookupreturns,theexecutionoftheopwhichwasstalledisallowedtocontinue.ThelatencyofaTLBlookupdependsonwhethertherequiredinformationisfoundinthefunctionalTLB.Ifitisamiss,thenthepagewalklatency(PW)determinestimeforwhichthecorrespondinginstructionoropisstalled.Thispagewalklatency(alsoreferredtointhisdissertationasTLBMissPenalty)istheminimumnumberofstallcyclesexperiencedbyaopduetoaTLBmisswhosepagewalkdoesnotmissintheL1cache.Ifthereareanycachemissesinthepagewalk,theopwillbestalledforthelatencyofthosemissesinadditiontothispagewalklatency.Thus,aproperchoiceofthepagewalklatencyisimportant.TodeterminethesevaluesfortheTLB,theRightMarkMemoryAnalyzer(RMMA)[ 75 ]isutilized.RMMAallowstheestimationofvitallowlevelsystemcharacteristicsincludingthelatencyandbandwidthoftheRAM,theaverageandminimallatenciesalongwiththesizeandassociativityofdifferentlevelsofcacheandtheTLB.TheRMMAtestsuiteisrunona64bitIntelCore2DuoCPUrunning32bitWindowsXP.Fromtheresultsofthisexperiment,adefaultpagewalklatencyof60cyclesischosen. 3.2.5ValidatingtheTLBTimingModelAsdescribedinSection 3.2.1 ,thissimulationframeworkisbuiltontopofwelldocumentedandestablishedsimulatorsi.e.SimicsandFeS2.HencethevalidationprocessisconnedtotheTLBtimingmodelthathasbeendevelopedinthiswork.ValidationofthetimingpartofthesimulationframeworkconsistsofensuringthatthebehavioroftheTLBtimingmodelisasexpected.Forthisvalidation,asimpliedpipeline,withthewidthofeverystagesettoone,isconsidered.Thisensuresthatastallinoneparticularopwillstalltheentirepipelineandnoout-oforderexecutionispossible.Itshouldbenotedthatthissimplicationisonlyforthevalidationprocessandanun-simpliedpipelinewithout-of-orderexecutioncapabilityisusedfortheexperimentsdiscussedintheremainderofthisdissertation.ThesizeoftheL1andthe 42

PAGE 43

Table3-1. PseudocodeofthemicrobenchmarkforTLBtimingmodelvalidation /**PseudocodeofthemicrobenchmarkwithwelldefinedTLBbehavior*/intmain()/*Step1*/allocate_contiguous_pages(64);/*Step2-Warmup*/touch_first_pages(64);/*Step3-TLBMissProducingSection*//*ThenumberofmissesproducedbyStep3isafunctionofthe*TLBsizeandthenumberofpagesbeingtouchedT.*/touch_first_pages(T);return0; L2cachesaresettolargevaluesof2MB,therebyensuringthatthepagetablesarecachedandthestallsduetopagewalkrelatedcachemissesareminimized.Thus,inthissimpliedscenario,theprimarycauseofmemorysubsystemstallsaretheTLBmissesandtheensuingpagewalks.Then,amicro-benchmarkwithawelldenedTLBbehavior,forwhichthenumberofTLBmissesforagivenTLBsizeispredictable,iscreated.Thispseudo-codeisshowninthelistinginTable 3-1 .Themicro-benchmarkconsistsofthreesteps.Intherststep,acontiguousblockofNpages,eachofsize4KB,isallocated.Instep2,therstbyteofeachoftheseNpagesareaccessedtowarmuptheTLBandcachewiththenecessarypagetablesentries.Then,instep3,therstToftheseNpagesareaccessedandsomevalueiswrittenintothesepages.IftheTLBislargeenoughtoholdalltheNtranslations(alongwiththerequiredO/S/VMMtranslations)whichwerelookedupinstep2,thenstep3willnotcauseany 43

PAGE 44

missesintheTLB.Ontheotherhand,asmallerTLBwillresultinaboutTmisses.Thus,thetimeforexecutingstep3dependsonthenumberTLBmisses,whichinturnisdecidedbytheTLBsize.Insuchascenario,theexecutiontimeforstep3inthesimpliedpipelinecanbetheoreticallyestimatedforvariousTLBsizes.ComparingtheseestimationstothevaluesobtainedfromsimulationsusingtheTLBtimingmodelservestovalidatetheTLBtimingmodel. Figure3-3. ValidationoftheTLBtimingmodel.Theestimatedvalue(DEst)andsimulatedvalue(DSim)ofthedifferenceintheexecutiontimeforstep3ofthemicro-benchmarkinTable 3-1 ,with64entryand256entryTLBs,areobtainedandcompared.Thesimulationvaluesmatchtheestimatedvaluesquiteclosely. Twofully-associativeTLBsofsizes64entriesand256entriesareconsidered.ByensuringthattheTLBsarefully-associative,theTLBsizebecomestheonlydeterminantofthenumberofmisses.SincethenumberofTLBmissesforagivenTLBsizecanbepredicted,thetimeforexecutingStep3withthesetwoTLBsizesisestimatedandthedifferenceinthesetimes,DEstiscalculated.Then,themicrocodeissimulatedwithfully-associative64-entryand256-entryTLBsusingthedevelopedTLBtiming 44

PAGE 45

model.TheexecutiontimeforStep3isnotedandthedifferencebetweentheexecutiontimesobtainedfromthesimulations,DSimiscalculated.ThisexperimentisrepeatedfordifferentvaluesofTanddifferentvaluesofpagewalklatencyandthecomparisonoftheobtainedDSimandDEstvaluesisshowninFigure 3-3 .Fromthis,itcanbeseenthatthedifferenceasobtainedfromthesimulatortracksthetheoreticallyestimateddifferencequiteclosely.ThemaximumdeviationbetweenDEstandDSimisabout6.59%forT=64andasmallpagewalklatencyof30cycles.Forlargerpagewalklatenciesthedeviationdropstolessthan3.5%.Thisveriesthatthebehaviorofthetimingmodelisasexpected. 3.3SelectionandPreparationofWorkloadsTheadvantageofafull-systemsimulationframework,suchastheonedescribedinSection 3.2 ,isthatitallowstherunningofsystemandapplicationsoftwarestackonthesimulatedplatform.Thissectiondescribesthesoftwarestackusedinthisdissertation.Forthesingle-O/Sscenario,DebianLinux2.6.18kernelwithPAEsupportisbootedontheSimicssimulatedphysicalmachineandtheworkloadapplicationsarelaunchedasprocessesinthisLinuxenvironment.Forthevirtualizedscenario,Xen[ 35 ]isselected.Xenisanopen-sourcehypervisorwhichcansupportpara-virtualguestsrunningmodiedversionsofoperatingsystems(XenoLinux),orHybridVirtualMachinesrunningun-modiedO/S(iftheprocessorhasvirtualizationsupportbuiltin).SincevirtualizationextensionsarenotsupportedbytheSimicsx86CPUmodels,theparavirtualversionofXenisused.OntopoftheSimicssimulatedphysicalmachine,Xen-3.1.0/2.6.18-xenkernelwithPAEsupportandhasHAPscompiledinittotriggervariousfunctionsduringinter-domainswitches,isbooted.Onbooting,XenstartsupacontrolVMordomaincalleddom0.FromthisdomainuserdomainsordomUsarecreatedandtheworkloadapplicationsarelaunchedinsidetheuserdomains. 45

PAGE 46

3.3.1WorkloadApplicationsOnecommonworkloadwhichisusedtobenchmarkvirtualizedplatformsisVMMark[ 76 ].Here,commonserverapplicationsincludingOutlookMailserver,Apachewebserver,Oracledatabaseserver,SPECjbbanddbenchareputtogethertoformaconsolidatedworkload.DuetolicensingissuesinusingVMMark,asimilarsuiteofapplicationsiscreatedinordertohavevariedworkloads.Theapplicationsincludedinthissuiteare: TPCC-UVa[ 77 ],anopensourceimplementationoftheTPC-Cbenchmarkstandard,whichrepresentstypicaldatabasetransactionprocessingserverworkloads.ItusesthePostgreSQLdatabasesystemandasimpletransactionmonitortomeasuretheperformanceofsystemsandforksoffoneclientprocessperwarehouse.Inallthesimulationsinthisdissertation,thenumberofwarehousesissetto4. dbench[ 78 ],adiskI/O-intensiveleserver.SimilartoTPCC-UVa,dbenchisanI/O-intensiveworkload.However,theI/OcomponentindbenchismuchmorethanTPCC-UVa. SPECjbb2005[ 79 ],anotherOLTPclassworkload.SPECjbbdiffersfromTPCC-UVaasitemulatesonlytheserversideofanOLTPsystem[ 79 ],whereasTPCC-UVaemulatesbothclientandserveroperations.Moreover,SPECjbb2005hasasignicantlylargememoryrequirement[ 80 81 ]asitsentiredatabaseisheldinmemory,whereasTPCC-UVastoresitsdatabaseondiskandaccessesitasneeded.Inallthesimulationsconductedinthisresearch,theheapsizeoftheJVMinwhichSPECjbbissetat256MB. Vortex[ 82 83 ],adatabasemanipulationworkloadfromtheSPECCPU2000suiteofbenchmark.Thisworkload,similartoSPECjbb,alsousessignicantamountofmemory. 3.3.2ConsolidatedWorkloadsConsolidatedworkloadsconsistofmultipleapplicationsconstitutingtheeffectiveworkload.ConsolidatedworkloadsarecreatedbyrunningeveryapplicationasdifferentprocessesinLinux.TogeneratesuchconsolidatedworkloadsonXen,therstapplicationisrunonitsdomainandpausedusingtheXenmanagementtools[ 84 ],whenthepointofinterestisreached.Thepointofinterestisthephase 46

PAGE 47

wherethewarmupphase,likereadingthedatabaseintomemoryfortypicaldatabasetransactionprocessingworkloads,iscompletedandthelong-runningservicephasebeginsexecution.Byrepeatingthisprocessforalltheapplications,multiplevirtualmachineswiththeapplicationsrunninginsidethemarebroughtup.AllthepausedVMsarethenresumedatthesametime,ensuringthatallapplicationscontributeandinuencethebehavioroftheconsolidatedworkload. 3.3.3MultiprocessorWorkloadsBothuniprocessorandmultiprocessorsimulationsmaybeperformedusingthedevelopedsimulationframework.Inmultiprocessorscenarios,Xenallowspinning[ 84 ]ofvirtualCPUs(VCPUs)tophysicalCPUs.PinningisaconceptwhereinacertainVCPUisassociatedwithoneormorephysicalCPUs.ThisrestrictstheschedulingoftheVCPUtooneofthephysicalCPUstowhichitispinned.Byanintelligentuseofthepinningmechanism,longrunningdomainscanbegiventheirownCPUstoensureuninterruptedperformance.Theterminologyusedinthisdissertationtodescribepinnedcongurationsisillustratedconsideringanexamplesetupwiththesimulatedphysicalmachinehavingtwox86CPUsisconsidered.Xenisbootedonthismachineanddom0isstartedwithtwoVCPUs.Inadditiontodom0,twovirtualmachineswithoneVCPUeach,dom1anddom2,arecreated.TheworkloadrunningonboththeuserdomainsisTPCC-UVa.ThiscongurationistermedasTPCC-TPCC-nopinasnodomainisexplicitlypinnedtoanyCPU.Then,usingthepiningcommandsofXen,thedom0VCPUsarerestrictedtorunonlyonphysicalCPU0andtheVCPUsofdom1anddom2areboundtophysicalCPU1.Since,onlydom0canbescheduledonCPU0andonlydom1anddom2canbescheduledonCPU1,thispinningcongurationistermedasTPCC-TPCC-0012.Inthecaseofuniprocessorsimulations,pinningmakesnodifferenceasthereisonlyoneCPU 47

PAGE 48

onwhichalltheVCPUsarescheduled.Hencethenopin/pinannotationisignoredforsingleprocessorscenarios. 3.3.4CheckpointingWorkloadsAtypicalusagemodelforlow-throughputhigh-delitysimulatorsischeckpointing.Insuchcases,thesimulationisruntillacertainpointinamodewherethesimulationthroughputisquitehigh.Invariably,thedataobtainedduringthisphaseissmallandisignored.Oncethepointofinteresthasbeenreached,thesimulationstateischeckpointed.Thenthesimulationisrestartedinalowthroughputmodewherethedelityofsimulationandthequalityofdataobtainedishigh.Suchausagescenarioispossibleusingthisdevelopedframework.Simics[ 67 ]supportscheckpointingwheretheentirestateofthesystem,includingmemoryandI/Osubsystemsintheformofcompressedles.Theselescanbecopiedfromonemachinetoanotherandusedwithoutanylossofdata.Usingthismethod,checkpointsofthesingleandmulti-domainworkloadsareprepared.Ascreenshotofasimulatedmachinerunning6domainsisshowninFigure 3-4 .FurtherdetailsofusingthesecheckpointsforlongrunningparametricsweeptypesimulationsinbatchmodearediscussedinAppendix B 3.4EvaluationoftheSimulationFrameworkOneofthebiggestdisadvantagesofafull-systemsimulationframeworkisthatthespeedofsimulationismuchlowercomparedtotrace-drivensimulators.Thisisindeedonereasonwhytracedrivensimulatorsarepreferredwhenonlyonesubsystemisunderconsideration.Inthissection,thespeedofthesimulationframeworkwithandwithouttimingmodelsisexamined.Thespeedofvarioussimulationmodesischaracterizedbythethroughputofthesimulationframeworkcalculated,asshowninEquation 3 ,asthenumberofx86instructionssimulatedusingtheframeworkinagivensecondofwallclocktime.Theresultsfromtheseinvestigationswillhelpunderstandthetimerequirementsinvolvedinsimulation-basedanalysisandplanaccordingly.Suchanunderstandingisimportant,whensimulationsareperformedonsharedresources 48

PAGE 49

Figure3-4. Screenshotofthesimulationframeworkinuse.Theuniprocessorsimulatedmachinehassixuserdomains(domU)andonecontroldomain(dom0).Fiveofthesixuserdomainsarepaused,whiledom1isrunningTPCC-UVa usingschedulerssuchasMaui[ 85 ]andTorque[ 86 ],wheretheuserhastoprovidetheanticipatedtimeforthesimulationtoaidinschedulingthejobsproperly. Throughput=Simulatedx86instructions WallClockTime (3) ToevaluatethespeedofsimulationwhentheTLBtimingmodelisused,a3GB1-CPUx86physicalmachineissimulatedusingSimics.TheTLBisconguredtobefullyassociativewithasizeof1024entriesandpagewalklatencyof60cycles.XenisbootedonthisphysicalmachineandTPCC-UVa,runningonadomU,issimulatedinthreedifferentsimulatorcongurations, 1.JustSimicswithoutFeS2orthetimingTLBmodel(OnlySimics) 2.SimicswithFeS2pluggedin,butwithouttheTLBtimingmodel(Simics+FeS2)and 3.SimicswithFeS2andtheTLBtimingmodel(Simics+FeS2+TLBTimingModel).Thelengthofthesimulationisvariedfrom1millionto1billionx86instructions.ThesesimulationsarerunonanIBMsystemXtowerserverwithtwoIntel 49

PAGE 50

Xeon2GHzcoresand7GBmemoryrunning32-bitLinux2.6.22.6withPAEsupport.Onthismachine,thesimulationsareruntillthespeciednumberofx86instructionsarecommittedandthewallclocktimefortherunisnoted.Fromthese,thethroughputofthesimulationframeworkiscomputedandusedtoquantifythespeedofthesimulationframework,whicharepresentedinFigure 3-5 Figure3-5. ThroughputofthesimulationframeworkforuniprocessorsimulationswithvirtualizedTPCC-UVaworkload.Thethroughputofthesimulationismeasuredasthex86instructionsretiredpersecondofwallclocktimeandispresentedasKiloInstructionsperSecond(KIPS).Thespeedofsimulationisreducedby10whenFeS2andTLBtimingmodelsareusedcomparedtothethroughputinpurelyfunctionalmodewithSimics. AsdiscussedinSection 3.2.1 ,SimicsisprimarilyafunctionallevelsimulatoranddoesnotprovidetimingmodelsfortheTLB.Hence,thethroughputachievedbyusingjustSimicsisquitehigh,oftheorderof0.1millionsofsimulatedinstructionspersecond,asseenfromFigure 3-5 .Moreover,thethroughputincreaseswiththetotalnumberofx86instructionssimulated.Thisincreaseiscausedbytheamortizationofthestartupcostsofthesimulation(suchassettingupthedatastructuresrepresentingvarious 50

PAGE 51

microarchitecturalcomponents),whichdonotcontributetowardsthethroughput,overthelongerruns.Forsimulationsinvolving1billioninstructionsandmore,thethroughputachievediscloseto0.7millioninstructionspersecond.TheslowdowninthethroughputbyusingFeS2isconsiderable,evenwhentheTLBtimingmodelisnotused(Simics+FeS2),ascanbeseenfromFigure 3-5 .Evenforlongrunningsimulationsof1billioninstructions,thisslowdownisasmuchas30timesandthethroughputachievedisonlyabout23000x86instructionspersecond.ThisslowdownisfurthercompoundedwhentheTLBtimingmodelisusedandlowersthethroughputtoabout12000x86instructionspersecond.Thethroughputofthesimulationframeworkformultiprocessorssimulations,wherethesimulatedmachinehasmorethanoneCPU,isalsoexaminedbysimulatinganx86machinewith3GBofmemoryandtwouserdomainsrunningTPCC-UVaandVortex.ThenumberofCPUsinthissimulatedmachineisvariedbetween2and8.Forgreaterdelityofsimulation,Simicsissettosimulate1x86instructiononaCPUbeforeitswitchestothenextinroundrobinfashion.ThethroughputofthesesimulationsarepresentedinTable 3-2 .Forbrevity,onlythespeedsforlongrunning(1billionx86instructions)arepresented.Thehigh-frequencyswitchingbetweenthesimulatedCPUscausesahighoverheaddegradingthethroughputonincreasingthenumberofCPUs,evenwhenFeS2andtheTLBtimingmodelarenotused.Forinstance,thespeedofa2CPUsimulationisonlyathirdofthe1CPUsimulation.WhenFeS2andtheTLBtimingmodelareused,thesimulationspeedfurtherreducesandisabout10smallerthanthespeedwithoutFeS2. 3.5UsingtheFrameworktoInvestigateTLBBehaviorinVirtualizedPlatformsOneofthemotivatingfactorsfordevelopingthesimulationframeworkistounderstandtheTLBbehaviorinvirtualizedscenariosandquantifytheimpactoftheTLBontheperformanceofvirtualizedworkloads.Toachievethis,consolidatedandunconsolidatedworkloadsconsistingoftheapplicationsdescribedinSection 3.3.1 51

PAGE 52

Table3-2. Throughputofthesimulationframeworkformultiprocessorx86simulations #CPUsinsimulatedmachineSimulatorCongurationSimulatedKIPS 1OnlySimics667.99Simics+FeS223.08Simics+FeS2+TLBTimingModel13.842OnlySimics260.41Simics+FeS25.24Simics+FeS2+TLBTimingModel3.684OnlySimics208.33Simics+FeS24.23Simics+FeS2+TLBTimingModel3.238OnlySimics98.32Simics+FeS21.95Simics+FeS2+TLBTimingModel1.69 aresimulatedusingtheframeworkdescribedinSection 3.2 .ThreedifferentmetricsareusedtocharacterizetheTLBbehaviorforaworkload: 1.thenumberofushes 2.theITLBanDTLBmissratesand 3.theimpactoftheTLBmissesontheworkloadperformance.EachofthesemetricscharacterizetheTLBbehavioratdifferentgranularitiesandareusedtoillustratekeyinsightsintothebehavioroftheTLBforvirtualizedscenarios.TheSimicssimulatedmachineinalltheexperimentsinthischapterisconguredtohaveoneCPUandanuntaggedTLB.Inthesesimulations,thevaluesofparametersnotrelatedtotheTLB,suchasthepipelinewidthandcachesizes,aremaintainedatFeS2'sdefaultvaluesshowninTable 3-3 .TheTLBsizeisselectedtocoverboththerangeofexistingTLBsizesfoundinmodernx86processorsaswellaslargersizes.AsmentionedinSection 3.2.4 ,thevalueofthepagewalklatencyisdeterminedtobe60cyclesbasedonRMMAexperimentsonIntelCore2Duoprocessor.However,sincethepagewalklatency(PW)willhaveaneffectonRIPC,arangeoflatenciesfrom30cyclesto90cyclesisusedforthesimulations. 52

PAGE 53

Table3-3. SimulationparametersforinvestigatingTLBbehavioronvirtualizedplatforms ParameterValues NumberofProcessors1TLBSize64,128,256,512,1024TLBAssociativity8TLBPageWalkLatency(PW)30-90cyclesL1CacheSize8MBL1CacheMissLatency8cyclesL2CacheSize32MBL2CacheMissLatency100cyclesPipelineFetchWidth4PipelineRenameWidth4PipelineExecuteWidth4PipelineRetireWidth4MemoryWidth2LengthofSimulation1billionx86instructions 3.5.1IncreaseinTLBFlushesonVirtualizationThedisadvantageofvirtualization,withrespecttotheTranslationLookasideBuffer,isthatitincreasesthenumberofprocesseswhichsharetheTLB,whichraisesthenumberofcontextswitchesbetweenthesespaces.Bytheverynatureofhardware-managedTLBs,consistencyismaintainedduringthesecontextswitchesbyushingtheTLB,resultinginalargenumberofTLBushesandsubsequentTLBmisses.Theincreaseinnumberofushesisfurthercompoundedbythevirtualizationrequirementthatcertainprivilegedinstructions(suchasI/Oandpagetableupdates)havetobetrappedandexecutedbythehypervisororthevirtualmachinemonitor(VMM),eventhoughtheyareissuedbythevirtualmachine(VM).ConformingtothisrequirementcausesswitchesbetweentheVMandtheVMMwhichfurtherincreasesTLBmissrate.Thecomparisonofthenumberofushesobtainedforthevirtualizedandnon-virtualizedworkloadsisshowninFigure 3-6 .AsexplainedinSection 3.3.1 ,TPCC-UVaconsistsofmanyprocessesandthecontextswitchesbetweentheseprocessesushtheTLBquite 53

PAGE 54

Figure3-6. IncreaseinTLBushesonvirtualization.ComparingtheTLBushesinnon-virtualizedandvirtualizedplatformsrevealsa7to10increaseinthenumberofushesforvirtualizedworkloads. frequently.Hence,TPCC-UVaexhibitsalargenumberofushesperinstructions,evenwhenitrunsinanon-virtualizedsystem.ButthefrequencyofTLBushesincreasesbyalmost10onvirtualization.Asimilarbehaviorisseenforthedbenchworkload,asthatisI/Ointensiveinnatureaswell.ThisbehaviorisduetotheI/Ocomponentofthesebenchmarkswhichrequiresswitchingbetweenthedomonwhichtheapplicationrunsanddom0whichcontainstheI/Obackenddrivers.Ontheotherhand,whenSPECjbbandVortexareconsidered,thenumberofushes,whilestilllargeronvirtualizedplatformsthanonLinux,issmallercomparedtotheI/O-intensiveTPCC-UVaordbench.Intheseapplications,theratiooftheushesonvirtualizedandnon-virtualizedscenariosissmallerthantheI/O-intensivebenchmarks. 3.5.2IncreaseinTLBMissRateonVirtualizationTheeffectoftheTLBbeingushedmorefrequentlyisthatthelifespanoftheTLBentriesreducetotheorderofafewhundredthousandcycles,causingabigbarrierfor 54

PAGE 55

ADTLBMissrateforI/O-intensiveworkloads BDTLBMissrateformemory-intensivework-loads CITLBMissrateforI/O-intensiveworkloads DITLBMissrateformemory-intensivework-loadsFigure3-7. IncreaseinTLBmissrateonvirtualization.ComparingTLBmissratesinnon-virtualizedandvirtualizedplatformsshowssignicantlylargermissratesforthevirtualizedworkloads. 55

PAGE 56

improvedVMperformance[ 14 ].Theimpactofthisincreasednumberofushescanbeunderstoodbyexaminingthemissratesforthenon-virtualizedapplicationsandcontrastingthemtotheirvirtualizedcounterparts.InFigure 3-7 ,thenumberofTLBmissesperthousandinstructions(MPKI)forallfourworkloads,bothvirtualizedandnon-virtualizedscenarios,arepresented.WhenthechangemissrateswithincreasingTLBsizesisobserved,itisseenthattheDTLBmissratesforTPCC-UVaonXenreducestillabout256entryTLBandthenbecomeconstantat0.5577missesperthousandinstructions.Ontheotherhand,thevirtualizedSPECjbbandVortexshowaconstantlyreducingtrendintheDTLBmissrateswithincreaseinTLBsizes.ItisalsoclearthattheDTLBmissrateonXenis1.5to5largerthanonLinuxforalargeTLBofsize1024entries.Thisvirtualization-drivenincreaseinITLBmissratesisevenlarger,andforSPECjbbandVortex,isaslargeas70for1024entryTLB.Thus,thisexperimentclearlyshowsthesignicantlylargerTLBmissesonavirtualizedplatform.Dependingonwhetherthepagewalkhistormissesinthecache,thecostofeveryTLBmissmaybethetimetakenforafewRAMaccesses,i.e.,upwardsofafewhundredcycles. 3.5.3DecreaseinWorkloadPerformanceonVirtualizationToestimatetheimpactoftheTLBontheperformanceofaworkload,theworkloadissimulatedintwodifferentcongurations.Intherstconguration,SimicsisconguredtouseFeS2butnottheTLBtimingmodel,therebycapturingthebehaviorofanidealTLBwithzerolatencyforTLBlookupsanda100%TLBhitrate.TheworkloadInstructionsperCycle(IPC)obtainedfromthissimulationcorrespondstoanidealIPCwheretheTLBisnotrealisticallymodeled.ThisIPCvaluerepresentsthemaximumIPCthatcouldpotentiallybeachievedbyanyimprovedTLBdesign.Then,theframeworkisconguredtorunSimicswithFeS2andtheregularTLBtimingmodel(nitecapacityandnon-zeropagewalklatency)andtherealisticIPCoftheworkloadisobtained.The 56

PAGE 57

differenceintheIPCvaluesofboththesecongurationsgivesanestimateoftheTLB'sinuenceindeterminingtheperformanceoftheworkload.ThemetricRIPCshowninEquation 3 ,whichisaratioofthedifferenceintherealisticandidealIPCstotheidealIPC,expressedaspercentagevalue,isusedtogaugetheimpactoftheTLBtimingmodel.ThehigherthevalueofRIPC,thefarthertheIPCobtainedusingtherealisticTLBtimingmodeldeviatesfromtheidealIPCvalue.Hence,theRIPCcapturestheTLB-induceddelayintheperformanceoftheworkload.AnyimprovementintheTLBarchitecturewhichreducestheTLB-induceddelaywilllowertheRIPCvalueandtherefore,RIPCmayalsobeusedasagureofmerittocomparevariousTLBimprovementschemes.Moreover,RIPCmayalsobeusedasanestimationofthedeviationofIPCfromarealisticIPC,whensimulationframeworksareusedtostudythecharacteristicsofvirtualizedworkloadwithoutaccountingforTLBtiming.Thus,alargeRIPCforaworkloademphasizesthecriticalityofmodelingtheTLBbehaviorforaccuratelycharacterizingtheperformanceoftheworkload.TheRIPCvaluesfromthesesimulationsareshowninFigure 3-8 andFigure 3-9 forsingleandconsolidatedworkloadsrespectively. RIPC=1001)]TJ /F3 11.955 Tf 13.15 8.08 Td[(IPCRegularTLB IPCIdealTLB (3) 3.5.3.1I/O-intensiveworkloadsTPCC-UVausesaPostgreSQLdatabase,whichitreadsfromthediskasneeded.Thus,TPCC-UVacausessomediskI/Oactivity.TheI/OdriversinXenuseasplitarchitecture,wherethefront-enddriveronthedomUusestheprivilegedbackenddriverondom0toperformI/O.Asaresult,therearealargenumberofushescausedbythecontextswitchesbetweenthedomains.TheseI/OrelatedcontextswitchescausesTPCC-UVaonXentohaveahighTLBmissrateand,therefore,alargeRIPC,asseenfromFigure 3-8A .TheRIPCisespeciallyhighatsmallerTLBsizes,asseen 57

PAGE 58

ARIPCforTPCC-UVa BRIPCfordbench CRIPCforSPECjbb DRIPCforVortexFigure3-8. Decreaseinsingle-domainworkloadperformanceonvirtualization.PerformanceisexpressedusingIPCandthedecreaseinperformanceusingRIPC.TheRIPCforvirtualizedworkloadsissignicantlylarger,especiallyatlargerTLBsizes. 58

PAGE 59

fromitsvalueof8.5%and12%fora64-entryTLBwithPWvaluesof60an90cyclesrespectively.Oneadvantageofafull-systemsimulator,suchastheonepresentedinthiswork,isthattheO/Sandthesoftwarestackissimulatedinadditiontotheworkloadapplication.Hence,accesstoperformancemonitoringtoolsliketop[ 87 88 ]isreadilyavailable.Usingtop,thememoryusageofTPCC-UVaisestimatedtobeabout50MBmemory.Thus,inadditiontotheTLBmissesdrivenbytheI/Orelatedcontextswitches,someoftheTLBmissesarealsocausedbythismemoryfootprintandthelackofsufcientspaceintheTLBtoaccommodatealltheentries.WhenthechangeinRIPCwithTLBsizeisobserved,itcanbeseenthatincreasingtheTLBsizeupto256entriesfrom64entriesreducesRIPCfrom8.5%to5.5%,a35%reductionintheTLB-induceddelay.ThisisbecausealargerTLBisabletoaccommodatemoreentries,therebyavoidingtheTLBmissesandpagewalkdelayswhichariseduetothelackofTLBcapacity.However,increasingtheTLBsizebeyond256entriesdoesnotreducetheRIPCsignicantly.Beyond256entries,thedominantcauseforTLBmissesistherepeatedushingoftheTLBandnotTLBsizelimitations.OncomparingtheRIPCvaluesforLinuxandXen,anine-foldincreaseinRIPCisobservedinthevirtualizedscenario,primarilyduetotheI/OactivityofTPCC-UVa.AtaTLBsizeof1024entries,theRIPCforTPCC-UVaonLinuxisbetween0.35%and0.9%dependingonthevalueofthepagewalklatency.CorrespondingvaluesforthevirtualizedTPCC-UValieintherangeof3.6%to8%.ThisclearlyunderlinestheincreasedimpactoftheTLBontheworkloadperformanceinvirtualizedscenariosandtheimportanceofmodelingtheTLBtimingbehaviorwhensimulatingvirtualizedworkloads.ThetrendofverylargeRIPCvaluesforthevirtualizedversionoftheworkloadisobservedfordbenchalso,asdbenchisanotherI/O-intensiveworkload.Infact,sincedbenchisthemostI/O-intensiveofallfourworkloads,itexhibitsthehighestincrease 59

PAGE 60

inRIPCasaresultofvirtualization.TheRIPConXenfora64entryTLBismorethan10thatofdbenchrunningonLinux.SimilartoTPCC-UVa,increasingtheTLBsizetill256entrieslowerstheTLBmissesandtherebytheRIPCofvirtualizeddbenchtosomeextent.Beyondthispoint,however,thereductioninRIPCislimited,forthesamereasonsasTPCC-UVa,i.e.,ushingoftheTLBbecomesthedominantcauseofTLBmisses.WhilethetrendissimilartoTPCC-UVa,theactualRIPCvaluesarelargerforanygivenTLBsizeandpagewalklatencythanTPCC-UVa.Itisalsoinstructivetoseethat,onincreasingthesizeoftheTLB,theRIPCvaluesforLinuxdonotchangesignicantlyandreduceonlyby9%.Thisisincontrasttothe60%reductionshownbydbenchonXen. 3.5.3.2Memory-intensiveworkloadsThevaluesandthetrendsofRIPCforSPECjbbonLinuxisquitedifferentfrombothTPCC-UVaanddbench,asseeninFigure 3-8C .EvenwhenSPECjbbrunsonLinux,itrunsinsideaJavavirtualmachinewhichhasalargeheapsizeof256MB.Moreover,asexplainedinSection 3.3 ,SPECjbbcachesthedatabaseinmemorycausingawidespreadinthepatternofthememorypagesitaccesses.BoththesefactorscauseSPECjbbtoexhibithighTLBmissrates,asreportedbyShufetal.[ 80 ].Thus,eveninnon-virtualizedscenario,thereisasignicantRIPCforSPECjbb.Infact,atsmallerTLBsizes,theRIPCinLinuxisclosetothatinXen.Forinstance,witha64entryTLB,RIPCinLinuxisalmost80%thatofXenfor60cyclepagewalklatency.OnincreasingtheTLBsize,however,theadditionalRIPCduetovirtualizationbecomespronounced.ThisisduetotheinabilityofincreasingTLBsizestocopewithvirtualizationrelatedcontextswitchesandtheresultingTLBushes.EveninaworkloadlikeSPECjbbwhichisnotpredominantlyI/O-intensive,theRIPCfor1024entryTLBand60cyclepagewalklatencyincreasesbytwofold,comparedtoLinux.ComparedtoTPCC-UVaanddbench,anothernotabledifferenceisthescalingoftheRIPCvaluesonincreasingtheTLBsize,evenbeyond256-entryTLBsize.Infact,comparedtothevalueat256entries,theRIPCvalueforthevirtualizedSPECjbbreduces 60

PAGE 61

by11%fora512entryTLBand16%for1024entryTLB.Thisbehaviorisduetothememory-intensivenatureoftheworkload.AttheselargeTLBsizes,thecontributionoftheTLBmissesduetoalackofcapacityintheTLBisstillquitesignicantandtheworkloadisabletobenetfromincreasedspaceintheTLB.Moreover,sincevirtualizationdrivenTLBushesarenotpresentinLinux,itcanbeobservedthatthereductioninRIPCismoreforLinuxthanforXen.Vortex,inspiteofbeingapartofCPUintensivebenchmarksuite,alsohasasignicantmemoryusageofabout75MB.WhiletheamountofmemoryitusesislesserthanSPECjbb,itsspreadpatternofaccessingpagescausesittohaveamissratecomparabletomanyJavabasedworkloads[ 83 ].Hence,thetrendoftheRIPCvaluesissimilartoSPECjbb.TheimpactofvirtualizationissmallatsmallTLBsizes,asseenbytheRIPCvalueonLinuxwhichisabout90%ofthevalueonXen.AstheTLBsizeincreases,thereductionofRIPConLinuxismuchsteeperthanonXen.AtaTLBsizeof1024entries,VortexonXenhasanRIPCwhichisalmostfourfoldthatofLinux,for60cyclepagewalklatency.WhilethetrendinRIPCvaluesaresimilarforSPECjbbandVortex,onenotabledifferenceisthemagnitudebywhichtheyreduceonscalingupthesizeoftheTLB.EvenforvirtualizedVortex,theRIPCreducesby70%onscalingtheTLBsizefrom64entriesto1024entries,comparedtothe40%reductionforvirtualizedSPECjbb. 3.5.3.3ConsolidatedworkloadsTostudytheTLBbehaviorforconsolidatedmulti-domainworkloads,twoconsolidatedworkloadsTPCC-UVA SPECjbbandTPCC-UVA dbencharecreated,usingthemethodoutlinedinSection 3.3.2 .Intheseworkloads,bothcomponentapplicationstimeshareasingleCPUforthelengthofthesimulation,i.e.,1billioninstructions.TheseworkloadsaresimulatedusingFeS2,andtheRIPCsareplotted,asshowninFigure 3-9 .FromFigure 3-9A ,itcanbeseenthatthevaluesandthetrendsoftheRIPCsforTPCC-UVA SPECjbbareacombinationoftheindividualvaluesandtrendsfor 61

PAGE 62

ARIPCforTPCC-UVaconsolidatedwithSPECjbb BRIPCforTPCC-UVaconsolidatedwithdbenchFigure3-9. Decreaseinconsolidatedworkloadperformanceonvirtualization.PerformanceisexpressedusingIPCandthedecreaseinperformanceusingRIPC.TheRIPCforvirtualizedworkloadsissignicantlylarger.ThetrendinRIPCforconsolidatedworkloadisacombinationofthevaluesandtrendsoftheRIPCofthecomponentapplications. TPCC-UVaandSPECjbb.Asanexample,ataTLBsizeof64entriesandapagewalklatencyof60cycles,theincreaseinRIPCduetovirtualizationis1.45,8.66and1.26fortheconsolidatedworkload,TPCC-UVaandSPECjbbrespectively.Thisbehaviorisduetothefactthat,becauseofequalprioritiesinthescheduler,theseapplicationstimesharetheTLBcausingtheresultingbehaviortobeacombinationofboththeindividualapplications.ItcanalsobeseenthattheactualvaluesoftheTLBRIPCsarebetweenthoseofcomponentapplicationsforallTLBsizesandpagewalklatencies.AsimilarbehaviorisseenforTPCC-UVa dbenchasshowninFigure 3-9B .SincealltheworkloadsareI/O-intensive,theincreaseinRIPCduetovirtualizationisquitelarge,irrespectiveofTLBsize.Infact,theratioofRIPCvaluesonXenandLinuxisintherange 62

PAGE 63

Table3-4. ImpactofPageWalkLatencyonTLB-inducedperformancereductionRIPC PWRIPC(%) LatencyTPCC-UVaonXenSPECjbbonXen (Cycles)642561024642561024 304.633.183.0613.499.798.39608.485.605.4121.5414.9412.489012.147.997.7328.4819.6916.2818021.6714.5314.0543.9231.7826.5827029.3320.2119.5854.0040.8534.69 of10to6fortheconsolidatedworkloads.ThetrendoftheRIPConXen,whenscalinguptheTLBsizes,alsoexhibitsthebehaviorofthecomponentapplicationsandtapersoffbeyond256entries.Fromtheseobservations,itisclearthat IndependentofwhetherthevirtualizedworkloadisI/Oormemory-intensive,theTLBplayasignicantroleindeterminingtheperformanceofvirtualizedworkloads.TheimpactoftheTLBrangesfromaslowas1%toasmuchas35%dependingontheTLBsize. TheimportanceoftheTLBindeterminingtheperformanceofworkloadsinavirtualizedscenarioissignicantlylargerthaninnon-virtualizedenvironments.Infact,forI/O-intensiveworkloads,theinuenceexertedbytheTLBontheperformancecanbeasmuchas9timesgreaterforvirtualthannon-virtualsettings. Forconsolidatedworkloads,theRIPCtrendsareacombinationoftheindividualworkloadsandexhibitasignicantlylargerRIPConvirtualizeplatformsthaninsingle-O/Sscenarios.andnotusingTLBtimingmodelswillcausetheIPCvaluestohavelargedeviationsfromrealisticvalues. 3.5.4ImpactofArchitecturalParametersonTLBPerformanceOneofthevirtualizationextensionstothex86hardwareistheintroductionofNestedPageTables(NPT)[ 36 ]orExtendedPageTables(EPT)[ 37 ],wheretheVMscanhandlepagetableupdateswithoutthehelpofthehypervisor.WhilethisapproachreducestheoverheadofswitchingbetweenthehypervisorandVM,itincreasesthecostofaTLBmisssignicantly,asdescribedinSection 2.3.3 .Toinvestigatetheimpactof 63

PAGE 64

thelargerPWvaluesonRIPC,TPCC-UVarunningonthedomUofa1-CPUmachineissimulatedwiththeidealaswellasregularTLBmodelforpagewalklatenciesof180and270cyclesandtheRIPCvaluesarecalculated.Similarly,theRIPCvaluesformemoryintensiveSPECjbbarealsodeterminedfortheselargePWvalues.FromtheseRIPCvaluestabulatedinTable 3-4 ,itcanbeseenthattheimpactoftheTLBontheworkloadperformanceissignicantlylargeratlargePWvalues.RIPCforvirtualizedTPCC-UVaincreasesbyabout6.3onincreasingthePWfroma30cyclesto270cycles.AsimilarincreaseoffourtimesisobservedinthecaseofvirtualizedSPECjbb.ThisunderscorestheimportanceoftheTLBandincorporatingdetailedTLBtimingmodelswhilecharacterizingvirtualizedworkloadsformodernplatformarchitectureswithmulti-levelpagetables. Figure3-10. Impactofthepipelinefetchwidth(FW)onTLB-inducedperformancereduction.PerformancereductionisexpressedusingRIPC.TheinteractionbetweentheTLBandarchitecturalcomponentssuchasthepipelinecanbecapturedonlybyusingaTLBtimingmodelasinthissimulationframework. AnotheradvantageofhavingtimingmodelsfortheTLBistheabilitytostudytheperformanceimpactofvariousarchitecturalchangesontheworkloadperformanceandRIPC,evenwhenthesaidchangeisnotintheTLB.Toinvestigatetheeffectofonesuchparameter,i.e.,thewidthofthefetchstageofthepipeline,virtualizedTPCC-UVa 64

PAGE 65

issimulatedwithtwodifferentfetchwidthsof2and4formultipleTLBsizesandpagewalklatencies.TheIPCsfromthesesimulationsareusedtodeterminetheRIPCswhichareshowninFigure 3-10 .Fromthis,itcanbeseenthatnarrowingthefetchpartofthepipelinereducesRIPCquitesignicantly.Withanarrowerstreamofinstructions,thereisreducedpressureontheTLB,andtherebysmallernumberofTLBrelatedstallcycles.Forinstance,theRIPCfor64entryTLBand60cyclepagewalklatencyisalmostathirdsmallerfor2-widefetchstagethanfora4-widefetchstage.ThistrendisseenirrespectiveoftheTLBsize.ItisalsointerestingtonotethatthereductioninRIPCbynarrowingthefetchstageislesspronouncedatlargerpagewalklatenciesas,theTLB-induceddelayincreaseincomparisonwiththestallcyclescausedbytherestofthesystematlargePWvalues.Thus,itisclearthatusingatimingmodelwillhelpunderstandtheimpactofvariousnon-TLBarchitecturalparametersontheTLBbehaviorworkloads. 3.6SummaryInthischapter,afull-systemsimulationframeworkbasedonSimicsandFeS2andincorporatingdetailedTLBfunctionalandtimingmodelsisdevelopedandusedtoinvestigatetheTLB-induceddelayforI/O-intensive,memory-intensiveandconsolidatedworkloads.TheimpactoftheTLBonworkloadperformanceisfoundtodependontheTLBsizeaswellasthevalueofthepagewalklatency.Fortypicalserverworkloads,theperformanceoftheworkloadsisreducedby8%to35%duetotheincreasedTLBushesandmissesonvirtualizedplatforms.ItisalsoseenthattheTLB-inducedperformancedegradation,especiallyforTPCC-UVaanddbench,areasmuchas7to8forthevirtualizedworkload,comparedtonon-virtualizedscenarios. 65

PAGE 66

CHAPTER4ATLBTAGMANAGEMENTFRAMEWORKFORVIRTUALIZEDPLATFORMSWhilevirtualizationbasedserverconsolidationoffersadvantagessuchaseffective,exibleandcontrollableuseofserverresources,theworkloadsrunninginsuchvirtualizedplatformsexperiencelowerperformancethantheirnon-virtualizedcounterparts.Onesignicantsourceofthisperformancedegradation,asseeninChapter 3 ,isthehighfrequencycontextswitch-relatedushingoftheTranslationLookasideBufferwhichincreasestheTLBmissrateandpagewalkstoservicethesemisses,therebyreducingtheperformanceofthevirtualizedworkloads.ReducingthisTLB-inducedperformancedegradationisanimportantchallengeinvirtualization. 4.1CurrentStateoftheArtinImprovingTLBPerformanceHardwaremanagedTLBs,suchasthex86TLB,getcompletelyushedoncontextswitchestoensureconsistencyoftheentriesandpreventtheentriesofoneprocess'saddressspacebeingusedforanotherprocess.ThisrepeatedushingcausesTLBlookupstomiss,necessitatinghigh-latencypagewalksandtherebyreducestheworkloadperformance.However,iftheTLBentriesareidentiedasbelongingtoaspecicaddressspacebyusingatag,thentheTLBneednotbeushedoncontextswitches.AvoidingTLBushesbytaggingtheentrieswithaddressspaceidentiersisawell-establishedtechniqueinsoftwaremanagedTLBs[ 26 27 89 ].TheuseofTLBtagsonItanium[ 90 ]aswellasonPowerPC[ 91 93 ]havealsobeeninvestigated.Priortotheadventofvirtualization,however,taggingoftheentriesinthehardware-managedx86TLBwasnotexhaustivelystudied.Thiswasprimarilyduetothereasonthatthefrequencyofthehardware-managedx86TLBbeingushedissmallinnon-virtualizedcases,aboutonceeverymillioncycles,andisnotamajorsourceofperformancedegradation. 66

PAGE 67

Ontheotherhand,TLBushesandtheresultantTLB-induceddelaycannotbeignoredonvirtualizedsystemsasevidentfromSection 3.5 .TheintroductionofTLBtagsandahardware-basedtag-checkingmechanismasapartofthevirtualizationextensions,suchasAMD-SVM[ 36 48 ]andIntelVPID[ 37 ],isclearlyanodtotheimportanceoftheTLBonvirtualizedplatforms.InAMD-SVM[ 48 ],eachTLBentryhasa6bitAddressSpaceID(ASID)asapartofitsentry.Currently,XenonAMD-SVM[ 94 ]usesASID0forthehypervisorortheHostmode.AslongastheCPUisinHostmode,theTLBentriesaretaggedwithASID0.WhentheCPUswitchestoGuestmode,theTLBisnotushed,buttheASIDischangedfrom0totheASIDoftheguestVM.Thus,anyTLBentrybelongingtothehypervisorwillnotbedeclaredahitforaGuest-initiatedTLBlookup,astheASIDtagswillnotmatch.AvoidingTLBushesusingASIDtagsisfoundtohaveabout11%reductionintheoverallruntimeofkernbench,akernelcompilingworkload[ 94 ].Similarly,intheIntelNehalem[ 37 ],theTLBentriesaretaggedwithaper-VMVirtualProcessorIdentier(VPID).IntelplatformssuchastheWestmere[ 28 ]supportPCID,aprocess-specictagwhichisassignedandmanagedbythesystemsoftware.Tickooetal.[ 18 ]alsoexploreTLBtaggingintheirqTLBapproach,wheretheTLBentriesbelongingtothehypervisor,whichareglobalwithinadomain,arenotushedduringaswitchfromonedomaintoanother.TheprimaryintentoftheseeffortsistomaketheswitchingbetweenVMsmoreefcientbyavoidingaTLBush.However,usingVM-specictags1canavoidonlyasubsetofthecontextswitchrelatedTLBushescomparedtousingprocess-specictagging.Inadditiontothis,whileasoftware-transparenthardware-onlyschemeisdesirableforhardware-managedTLBstokeepinlinewiththehardware-managed 1InthisdissertationVM-specictagsarealternativelyreferredtoasdomain-specictags,dom-specictags,per-VMtagsorper-domaintags 67

PAGE 68

designphilosophy,thesystemsoftwareisinvolvedinavoidingTLBushesinalltheseapproachesincludingthePCIDarchitecture[ 28 ].Tomeettheserequirements,theTagManagerTable(TMT)isproposedinthisdissertation.TheTMTisalow-latencyarchitecturewhichderivesatagfromthePTBR(CR3registerinx86)inasoftware-transparentmannerandusesittotagtheTLBentries.SuchanapproachsignicantlyreducesTLBmissratesandthenumberofTLBushes,comparedtousingonlyVM-specictags.TheimpactoftheTMTisinvestigated,intermsofthereductioninTLBushes,TLBmissrateandTLB-inducedperformancereductionusingthefull-systemsimulationframeworkdevelopedinChapter 3 .TheinuenceofvarioushardwaredesignparametersandworkloadcharacteristicsonthisimpactofusingtheTMTisanalyzed.TheuseoftheTMTinenablingsharedLastLevelTLBsisalsopresented. 4.2ArchitectureoftheTagManagerTableVM-specicTLBtagging,asseeninqTLB[ 18 ],isaimedatavoidingthehypervisorentriesbeingushedwhenthereisacontextswitchbetweentwoVMs,termedasInter-VMswitch.However,thesetagsdonotpreventtheTLBbeingushedifthereisacontextswitchbetweentwoprocesseswithinthesameVM,i.e.,anIntra-VMswitch.BychoosingtagsthatassociatetheTLBentrieswithaparticularprocess'saddressspaceratherthanaparticularVM,itispossibletoavoidTLBushestriggeredduetoalltypesofcontextswitches.Furthermore,itisimportantthatthetaggingsolutionforhardware-managedTLBspreservesthehardware-basedTLBmanagementwithminimalornosoftwareinvolvement.ThesetworequirementsdictatethedesignoftheTagManagerTable.OnepotentialtagwhichconformstotheserequirementisthePageTableBaseRegister(PTBR)whichisstoredinahardwareregister(CR3registerinthecaseofthex86architecture).Sinceeveryprocesshasauniquesetofpagetables,thevalueinCR3registerisuniqueforeveryaddressspaceandthecontentsoftheCR3canbe 68

PAGE 69

obtainedwithoutahighlatencyinteractionwiththesystemsoftwarestack.Hence,theTLBentriesmaybetaggedwiththeCR3valuetoidentifytheprocessorvirtualaddressspacetowhichtheybelong.However,thesizeoftheCR3registerisquitelarge(32or64bits);taggingtheTLBentrywiththeCR3increasesthedieareaaswellastheenergyexpenditurefortheTLBlookup.Hence,theTagManagerTable(TMT)isproposedtoachievethissoftware-transparentprocess-specictaggingwithminimaloverheads.TheTMT,showninFigure 4-1 ,isasmall,fast,cacheimplementedatthesamelevelastheTLB.EveryTLBintheplatformhasanassociatedTMT.EachentryintheTMTrepresentsthecontextofaprocessandhasthreeelds: TheCR3eld,whichcontainsthevalueoftheCR3register,aper-processuniquepointertothepagetablesfortheprocess. TheVirtualAddressSpaceIdentier(VASI),whichstoresauniqueidentierassociatedwiththeaddressspaceoftheprocess.TheVASIisgeneratedasafunctionoftheCR3inasoftware-transparentmanner.AnyfunctionwhichguaranteesthatallentriesintheTMThavedifferentVASIs,suchasaperfecthashortheCR3maskedwithanappropriatebitmask,canbeused.Intheworkpresentedhere,thepositionoftheentryintheTMTisusedastheVASI.Forinstance,theVASIfortherstentryintheTMTis0,thesecondentryis1andsoon.ThissimpleschemeeliminatestheneedforacomplexhashfunctionorabitmaskwhileguaranteeingauniqueVASIforeveryTMTentry. TheSharingID(SID)eld,whichstorestheidentierofthesharingclasstowhichtheprocessbelongs.TheSIDisneededonlyforcontrollingthesharingoftheTLBandcanbeleftunassignedinthecaseofuncontrolledsharingasinalltheexperimentsperformedinthischapter.TheselectionofthesharingclassesandtheuseoftheSIDarediscussedindetailinChapter 5 .TheSIDandtheVASItogetherconstitutetheCR3tag2.TaggingtheTLBentrieswiththeCR3taginsteadoftheCR3itselfresultsinalowerareaoverhead.Forinstance,withan8-entryTMTanda3-bitSID,theCR3tagisonly6bitscomparedtothe32or64bitCR3.TheTMTarchitecturealsoconsistsofaCurrentContextRegister(CCR).The 2InthisdissertationtheCR3tagisalsoreferredtoasprocess-specictagorper-processtag. 69

PAGE 70

CCRisaregisterwiththesamesizeastheTagManagerTableentry,whichcachestheCR3,SIDandtheVASIforthecurrentcontext. Figure4-1. TLBushbehaviorwiththeTagManagerTable.Instep . 1 avalueiswritteninCR3promptingaush.Instep . 2 ,theTMTissearchedforthenewCR3.SimultaneouslythenewCR3iscomparedtothecurrentCR3intheCCR.TheTLBandtheTMTareushedifthenewCR3matcheswiththeCCR,orifthenewCR3isinsertedintotheTMTafterevictinganexitingentry.Thisisshowninstep . 3 4.2.1AvoidingFlushesUsingtheTagManagerTableWheneverthereisacontextswitchfromprocessP1toP2,aTLBushistriggeredbytheMOVCR3instructionwhichupdatesthevalueoftheCR3register,asshowninstep . 1 ofFigure 4-1 .OnthetriggeringoftheTLBush,theTMTissearchedtodetermineiftheCR3valueofP2alreadyexistsasshowninFigure 4-1 ,step . 2 .Simultaneously,thenewvaluebeingwrittenintotheCR3iscomparedwiththecurrentCR3valuefromtheCCR.IfthenewCR3valueisdifferentfromthecurrentCR3value,itisdeducedthattheTLBushwastriggeredbyacontextswitch.TheTMTissearchedforthenewCR3value.IfitexistsintheTMT,thatTMTentryiscopiedintotheCurrent 70

PAGE 71

ContextRegister.Ontheotherhand,iftheCR3valueofP2isnotfoundintheTMT,itisinsertedintoafreeslotintheTMTandaVASIassignedtoit.Then,thisTMTentryiscopiedintotheCCR.OncetheCCRispopulatedwiththeCR3andthetagsofP2,anyTLBlookupwillhitonlyiftheTLBentrybelongstoP2andmatchesthetagsintheCCR.Thus,inboththesecases,updatingoftheCCRisequivalenttoushingtheTLBandtheactualTLBushcanbeavoided.AsituationmayariseduringacontextswitchfromP1toP2wheretheCR3ofP2isnotintheTMTand,duetolimitedcapacity,therearenofreeentriesintheTMT.InthiscaseavictimTMTentry,(CR33,SID3,VASI3)belongingtoP3,inaccordancewithFirstInFirstOut(FIFO)replacementpolicy.TheCR3andSIDvaluesofP2replaceCR33andSID3whileVASI3isreusedforP2.ToavoidtheTLBentriesofP3beingusedforP2,theentrieswiththeVASI3areushed,asseeninFigure 4-1 ,step . 3 .Thisush,causedbythelackofcapacityintheTagManagerTable,istermedaCapacityFlush.SincethelatencyforexaminingeveryTLBentryandushingonlythoseentrieswithtagVASI3maybeprohibitive,thecapacityushisimplementedasafullTLBush.However,thedownsideofsuchanimplementationistheevictionofTLBentrieswhosetagsarenotVASI3,therebypotentiallyincreasingtheTLBmissrate.Moreover,ISAextensions[ 28 ]forushingentrieswithaspecictagandthehardwaretoimplementthisinstructionwithoutaprohibitivelatencyarebeingintroducedinmodernprocessors.Withsuchextensions,thecapacityushmaybeimplementedasaselectiveushandnotresultintheentireTLBbeingushed.Apartfromcontextswitches,TLBushesmayalsobetriggeredbychangesinthepagetables.Wheneverpagetablesaremodied,anyentrycachedintheTLBwhichisaffectedbythischangeshouldbeushedfromtheTLBtomaintainconsistencybetweentheTLBandthepagetables.Inbothnon-virtualized(Linux)andvirtualized(Xen)systems,consistencyismaintainedbyushingtheentireTLB.OnexaminingthesourcecodeofbothLinuxandXen,itisfoundthatthisushiseffectedbyatwostep 71

PAGE 72

process.ThecurrentvalueintheCR3registerisreadintherststepandthesamevalueiswrittenintotheCR3registerusingaMOVCR3instructioninthesecondstep.Eventhoughnochangeofcontextisinvolved,thisMOVCR3instructionstilltriggersaushoftheTLB.SuchushesarecalledForcedFlushes.TheTMTisdesignedtorecognizetheseForcedFlushes.Asseeninstep . 2 ofFigure 4-1 ,wheneveranewvalueiswrittenintotheCR3,itiscomparedwiththecurrentCR3valuefromtheCCR.Ifbothofthemarethesame,thisushisdeducedtobeaForcedFlushandtheTLBisushedcompletely,asdepictedinFigure 4-1 ,step . 3 .WhenevertheTLBisforceushed,theTMTisalsoushedtofreetheslotsbeingoccupiedbycontextsnoneofwhichhaveanyentriesintheTLB.ThisbehaviorisshowninFigure 4-1 4.2.2TLBLookupandMissHandlingUsingtheTagManagerTable Figure4-2. TLBlookupbehaviorwiththeTagManagerTable.Instep . 1 apossiblematchisfoundintheTLB,bycomparingtheVPNofthevirtualaddresswiththeTLBentries.Instep . 2 theVASIfromtheTLBentryandtheVASIfromtheCCRarecompared.TheTLBlookupresultsinahitonlyifboththeVPNsandtheVASIsmatch,asinstep . 3 72

PAGE 73

TheTLBlookuphappensasshowninFigure 4-2 .TheTLBissearchedforanyentrywhichhasthesameVPNasthevirtualaddress.Simultaneously,theVASIofthecurrentcontextislookedupfromtheCCR.TheentryisdeclaredasahitonlywhenitsVASImatchestheVASIintheCCRandtheVPNisthesameastheVPNinthevirtualaddressbeinglookedup.SincetheCCRisdedicatedregister,theVASIcanbelookedupwithminimumlatency.ItshouldbenotedthatthecomparisonoftheVASIhappensinparallelwiththeVPNcomparison,asshowninFigure 4-1 .ThusnoadditionallatencyisimposedbytheTMTinthecriticalTLBlookuppath.Ifthelookupresultsinamiss,thepagewalkproceedstodeterminethephysicaladdressfromthepagetables.Oncethistranslationisobtained,itisaddedintheTLBalongwiththeCR3tag(SIDandVASI)ofthecurrentcontext.OneissuewithenablingTLBsharing,aswithcacheswhichareindexedusingvirtualaddresses,isaliasing[ 95 ].Aliasingisthesituationwherethesametranslationmaybecachedonceforeveryprocess'saddressspace,thuscreatingmultiplecopiesofthesameentry.SuchsituationsarisetypicallywithGlobalentrieswhicharetranslationsforvirtualaddressesinthe3GB-4GBrangebelongingtothekernel.Forinstance,theentriescorrespondingtothehighmemoryrangeinLinuxarevalidinallprocessaddressspacesandaremarkedusingtheGlobalbitintheTLBentries.ToavoidmultiplecopiesofsuchGlobalentrieswithdifferentVASItagstoexist,theTLBlookuplogicismodiedtohitwheneithertheVASIoftheentrymatcheswiththeVASIintheCCRortheiftheGlobalbitintheentryisset.ThisensuresthatonlyonecopyofGlobalentriesarecachedinthetaggedTLB.WhiletheprecedingexplanationoftheTMTisforx86processorswithouthardwarevirtualizationsupport,itcanbeusedforprocessorswithExtended/Nestedpagetables(EPT/NPT)[ 36 37 48 ]withminormodications.InaprocessorwithEPT/NPTsupport,asdescribedinSection 2.3.3 ,boththeguestandhostCR3valuesarecachedintheTMTentryensuringthattheCR3tagwillstillbeuniqueperprocessaddressspace. 73

PAGE 74

4.3ModelingtheTagManagerTableTheTMTandtheprocess-specictaggedTLBaremodeledusingthegenerictaggedTLBsimulationmodeldescribedinSection 3.2 .ThefunctionalityoftheTMTismappedtotheGMTmodule.Thus,theTLBushoneveryMOVCR3instructionisinterceptedbytheGMTmodulewhichperformsthenecessarychangesintheTMTandusestheTMTfunctionalitytodecidewhetherthisushshouldbecarriedoutoravoided.Similarly,theCCRismappedtotheTagCachewhichgetsupdatedoneveryMOVCR3instruction.SincetheTMTisdesignedtoperformthetagcomparisonwithoutimposinganyadditionaldelayduringtheTLBlookup,theTLBlookuplatenciesaremaintainedatthesamevalueswhensimulatingtheregularTLBandthetaggedTLB.ThemodelingoftheTMTisvalidatedusingtheFunctionalCheckmodedescribedinSection 3.2 4.4ImpactoftheTagManagerTableInthissection,thebenetofusingtheTagManagerTableisevaluatedusingthreemetrics,similartothoseusedinSection 3.5 ,namely: 1.thenumberofushes 2.theITLBandDTLBmissrates,and 3.theincreaseinworkloadperformance. 4.4.1ReductioninTLBFlushesDuetotheTMTInagenericcachememory,thesizeofthecacheisthemaindeterminantofthemissrate.Whenaworkloadbeginstoexecute,therewillbeafewmissesasthedataisbroughtintothecachefortheveryrsttime.Suchmissesaretermedascoldmisses.Beyondthiswarmupphase,foraninnitelylargecache,alltherequireddatawillbecontainedinthecacheandthehitratewillasymptoticallyreach100%.However,thesituationisnotthesameforTLBs.Apartfromthesize,oneofthemaindeterminantsfortheTLBhitrateisthefrequencyatwhichtheTLBisushed.InTLBswherenotagsareused,thehitrateintheTLBislimited,becauseofshortenedlifespanoftheentries.EveninthecaseofTLBswithunlimitedsize,thehitrateisstilllimitedduetotheperiodicpurgingoftheTLB.ThebenetofusinganidentiertotagtheTLBisinavoidingushesandloweringthe 74

PAGE 75

AFlushproleofdifferentapplicationsrunningona1-CPUsimulatedmachine.Theintra-VMushesarehighwhentherearemorethanoneprocessesinadomain. BReductioninTLBushesonusingan8-entryTMT.Morethan90%oftheushesareelimi-natedincaseswheretheForcedFlushesdonotdominate.Figure4-3. ReductioninTLBushesusingan8-entryTMT missrateintheTLB.Ifmoreushesareavoided,theincreasedlifespanoftheTLBentrywillresultinhigherhitrates.Thus,thereductioninthenumberofTLBushescomparedtoanuntaggedTLBisacoarse,yetintuitivegureofmeritforunderstandingtheimpactoftheTMT.TLBushesoccurringinvirtualizedscenarioscanbeclassied,basedonthecausefortheush,intothreetypes.ThereasonthattheTLBhastobeushedcanbeeitheracontextswitchorthatthepagetablehasbeenmodied,asdescribedinSection 4.2 .Ifthecauseisacontextswitch,thetwoprocessesbetweenwhichtheswitchhappenscouldbewithinthesameVMorcouldbepartofdifferentVMs.Basedonthisaushcanbeclassiedintothreecategories:Intra-VMushescausedbyaIntra-VMcontextswitch,Inter-VMushescausedbyadomaintodomainorInter-VMcontextswitchandForcedushes.Thisclassicationiscalledtheushproleandisagoodindicatorof 75

PAGE 76

thegainthatcanbeachievedbyusingataggedTLB.Forinstance,iftheforcedushesdominate,then,irrespectiveofwhetherprocess-specictagsordomain-specictagsareused,theTLBwillstillbefrequentlyushed.Insuchcases,thenumberofushesthatcanbeavoidedwillbesmall,leadingtosmallergainsfromusingtaggedTLBs.Ontheotherhand,usingtagswillreapsignicantbenetswhenthecontextswitchushes,whichcanbeavoided,dominate.TheushprolesofthefourworkloadsmentionedinSection 3.3.1 ,runningonasimulatedx86machinewithoneCPUandonedomU,arepresentedinFigure 4-3A .ItcanbeobservedthatTPCC-UVa,whichisatypicalserverworkload,hasasignicantnumberofcontextswitchushes,about92%.Outofthese,thenumberofintra-VMandinter-VMcontextswitchushesarealmostequal.However,inthecaseofsingle-processworkloadssuchasSPECjbbandVortex,inter-VMushesdominatetheprolecomparedtointra-VMushes.Moreover,sincetheonlyactivityperformedbydbenchislereadsandwrites,itismoreI/O-intensivethanTPCC-UVa.Hence,mostoftheushesitexperiencesareduetothetransitionsbetweendomUanddom0foraccesstotheprivilegedevicedriversresidingondom0andduetotheforcedushesresultingfromtheactualtransferofdatato/fromthedisk.Thus,theintra-VMushesconstituteonly2.5%ofthetotalushesfordbench.TheadvantageofusingtheTMTandprocess-specictagsisthat,bothinter-VMandintra-VMushescanbeavoided.FromthereductionintheTLBushesfortheseworkloadsusingan8-entryTMTasshowninFigure 4-3B ,itisseenthatabout96%oftheushesforSPECjbbandVortexareavoided,eventhoughtheinter-VMushesdominatefortheseworkloads.Incaseswherethereareasubstantialnumberofintra-VMushes,asinTPCC-UVa,almost90%oftheushesareeliminated.If,ontheotherhand,domain-specictagswereused,onlyabout50%oftheushedwouldhavebeeneliminated.ThereductionintheTLBushesissmalleronlyfordbench,where35% 76

PAGE 77

oftheTLBushesareforcedushesandareunavoidable.Evenforthisworkload,theeliminationofcontextswitchushesreducesthetotalnumberofushesby65%.EffectoftheTagManagerTablesize Whilethecompositionoftheushesdeterminesthenumberofushesthatcanbeavoided,thesizeoftheTMT(thenumberofentriesintheTMT)alsoinuencesthisreductionandisanimportantdesignparameter.TheTMTsizedecidesthenumberofprocessesoraddressspacesthatcanconcurrentlysharetheTLB.Ifthesizeisincreased,additionalprocessescanberepresentedintheTMTandthenumberofcapacityushes(contextswitchusheswhichcouldnotbeavoidedduetolackofcapacityinasmallerTMT)canbereduced.Ontheotherhand,increasingTMTsizecausestheVASItohavealargersizeandincreasesthediesizeaswellastheenergyrequiredfortagcomparison.Ifthenumberofcapacityushesisalreadysmall,increasingtheTMTsizewillnotresultincommensuratereductionoftheTLBmissrate.Moreover,incaseswherethesizeoftheTLBentrytagisxed,suchasthe6bitsfortheAMDSVM[ 48 ],asmallerTMTresultsinasmallerVASIleavingfreebitswhichmaybeusedtostoremetadataforTLBusagemanagement.HencedeterminingtheappropriateTMTsizeisquiteimportant.TostudythesizetradeoffsfortheTMT,theTPCC-UVaapplicationisrunonasimulatedx86uniprocessormachinewhichhas256-entry,8-wayTLBswithCR3tagging.ThesizeoftheTMTisvariedfrom0entries(representingasituationwithnoCR3tagging)to16entries.ForeachTMTsize,thenumberandtypeofushesaswellasthereductioninTLBmissratesisobserved.TheresultsareshowninFigure 4-4 .Over10billioninstructionsofTPCC-UVa,thereare64738ushes.Outofthese,5100areforcedushesandtheremaining59638areduetointer-VMandintra-VMcontextswitches.WhentheTMTsizeis0entries,everycontextswitchcausesacapacityush.Hence,theTLBisushed64738timesasseenfromFigure 4-4 .However,withCR3tagginganda2-entryTMT,thereisasubstantialreductioninthe 77

PAGE 78

Figure4-4. EffectofTagManagerTablesizeonthereductioninnumberofushes.ThenumberofushesforTPCC-UVafor10billionx86instructionsisshownintheleftYaxisusingalogscale.ThereductioninDTLBandITLBmissratesfora256-entry8-wayTLBisshownintherightYaxis.WhileincreasingtheTMTsizetill8entriesreducesthetotalnumberofushesandthemissrate,furtherincreasedoesnotreducethetotalnumberofushessignicantlyandthereforedoesnotreducethemissrate. numberofushesfrom64738to29484,asthenumberofcapacityushesreducebymorethan50%.Thisreducesthemissrate,byabout25%fortheDTLBand30%fortheITLB.FurtherscalinguptheTMTsize,however,givesdiminishingreturnsandanysizebeyond8entriesdoesnotsubstantiallyreducethemissrate,eventhoughthecapacityushesarereduced.Thisisbecause,atTMTsizeslargerthan8entries,thedominanttypeofushistheforcedushandnotthecapacityush.EvenifthecapacityushesarereducedbyhavingalargerTMT,theforcedushesstillperiodicallyushtheTLBlimitingthelifetimeoftheentries.ThesimulationsarerepeatedwithSPECjbb,Vortexanddbench.Inallthecases,itisfoundthatan8-entryTagManagerTableissufcienttoensurethatthenumberofcapacityushesismuchsmallerthantheforcedushes. 78

PAGE 79

4.4.2ReductioninTLBMissRateDuetotheTMTWhilethereductioninthenumberofushesisacoarsemetricandprovidessomeinsightintotheadvantageofusingprocess-specictags,itisnotsufcienttoinvestigatethebenetoftheTMTthoroughly.Forinstance,iftheushproleforaworkloadissuchthatitexperiencesnointra-VMushes,usingeitherprocess-specictagsordomain-specictagswillavoidthesamenumberofushes.However,domain-specictaggingsolutionssuchasqTLB[ 18 ]canretainonlythehypervisor'sTLBentriesacrosscontextswitches.Usingprocess-specictagssuchastheCR3tagscanretainallentries.Thus,thoughthesamenumberofushesareavoided,usingprocess-specictagsmayresultinlowerTLBmissrate.Inordertocapturesuchdifference,thereductioninTLBmissratewhenusingtheTMTcomparedtousinguntaggedTLBisused.ThismetricisquantiedasReduction,asshowninEquation 4 ,andisexpressedasapercentageoftheuntaggedTLBmissrate.TheadvantageofusingReductionisitshighsensitivitytoanyTLBorTMTrelatedchangesandinsensitivitytochangesinotherarchitecturalsubsystemssuchasthecache. Reduction(%)=1001)]TJ /F3 11.955 Tf 21.95 8.09 Td[(TLBmissratewithtags TLBmissratewithouttags (4) ThebenetofnotushingtheTLBwhenswitchingfromprocessP1toP2dependsontheamountofTLBthatisbeingusedbyP2.IfP2requiresalargeTLBspace,anyofP1'sentrieswhichsurvivedtheTLBushwillstillbeevictedtomakespaceforP2'sentries.Insuchcases,thereductionintheTLBmissrateduetotaggingwillbeverysmall.Thus,themaximumbenetfromtaggingcanbeobtainedwhentheTLBislargeenoughtoaccommodatetheentriesofbothP1andP2.Ontheotherhand,alargeTLBwillconsumevaluablechiprealestatewhichmaybeutilizedbetterbyothersubsystems,suchasalargerL1cache.Thus,theTLBsizeshouldbemadesufcientlylargetooptimizethereductionofthemissrateduetoCR3tagging,butnolarger. 79

PAGE 80

AReductioninDTLBmissrate BReductioninITLBmissrateFigure4-5. ReductioninTLBmissrateusingan8-entryTMTand8-wayassociativity.LargerTLBsizesallowsthecachingofmoreTLBentriesacrosscontextswitchesleadingtoahigherreductioninTLBmissrate. ToinvestigatethedependenceofthebenetofusingtheTMTandtheTLBsize,theI/O-intensiveandmemory-intensiveworkloadsaresimulatedonauniprocessorSimicsmachineandtheReductioninDTLBandITLBmissratesareplotted,asshowninFigure 4-5 .ItshouldbenotedthatthemissrateusedinthesecalculationsisexpressedinMissesperThousandInstructions(MPKI).Fromthis,itcanbeseenthatallworkloads,exceptdbench,showanincreasingReductionwithTLBsize.Forinstance,theReductiontrendforTPCC-UVashowsthattheDTLBmissratefora1024-entrytaggedTLBis65%smallerthantheuntaggedTLB.Ontheotherhand,eventhoughdbenchshowssomeincreaseintheReductioninDTLBMPKIwithTLBsizesupto256-entryTLB,theTLBmissesduetothelackofTLBcapacitystopbeingthepredominantsourceofTLBmissesandtherepeatedushingoftheTLBbeginstodominatebeyondtheseTLBsizes,causingaplateauintheReductioncurve.BothVortexandSPECjbbexhibit 80

PAGE 81

Reductioncurveswithahighslope,evenfor1024-entryTLB,indicatingthatfurtherincreaseintheTLBsizemayachieveevenlowerDTLBmissrate.TheReductiontrendsforITLBmissrate,showninFigure 4-5B ismarkedlydifferentfromtheDTLBReductiontrends.ThespacerequiredintheITLBissmallerthantheDTLBspacerequirementsduetotheinstructionmemoryfootprintfortheseworkloadsbeingsmallerthanthedatamemoryfootprint.Asaresult,themajordifferencefromtheDTLBtrendisthatthereductioninITLBmissrateissignicantlylargerforanygivenTLBsizethatthereductionintheDTLBmissrate.Moreover,whiletheReductioninDTLBislowforSPECjbbandVortexduetotheirmemory-intensivenature,theReductioninITLBmissrateissignicantlyhigh.Itcanalsobeobservedthat,inspiteofthesmallinstructionmemoryfootprint,therepeatedforcedushingoftheTLBcausestheReductioninITLBmissratefordbenchtobelimited.TLBassociativity AnotherimportantTLBdesignparameteristheassociativity.IncreasingtheassociativitywillreducetheconictmissesintheTLB.However,largerassociativityvaluesnecessitatemorecomparatorsintheTLBlookuphardwaretomatchtheVPN,therebyincreasingtheareaandpowerrequirements.Hence,itisimportanttounderstandtheeffectthattheTLBassociativityhasonthereductioninmissrateduetoCR3tagging.OnsimulatingTPCC-UVa,withan8-entryTMTandtaggedTLBofvaryingassociativityvalues,andplottingtheReductiontrend,asshowninFigure 4-6A andFigure 4-6B ,itcanbeobservedthattheassociativityhaslittleeffectontheReduction.ThereissomeadditionalReductioninthemissratewhentheassociativityischangedfrom4-wayto8-way,butanyfurtherincreaseinthesetsizedoesnotvarytheReductionbyalargevalue.Thisanalysisisalsoperformedfortheotherworkloadsandsimilarresponsetochangingassociativityisobserved.Thus,bysettingtheassociativityvalue 81

PAGE 82

AReductioninDTLBmissrate BReductioninITLBmissrateFigure4-6. EffectofTLBassociativityonthereductioninmissratewithan8-entryTMT.Whileincreasingtheassociativityfrom4-wayto8-wayshowssomeadditionalincreaseinthereductioninTLBmissrates,higherassociativityvaluesdonotmakeasignicantdifference. at8,thebenetofusingtheTMTcanbeobtainedwithoutahighareaandpoweroverhead. 4.4.3IncreaseinWorkloadPerformanceDuetotheTMTThemostimportantendresultofusingtheTMTistheimprovementintheperformanceofvirtualizedworkloads.However,Reductionisnotsufcienttounderstandthisimprovement.Toquantifythisperformanceimprovement,workloadsarerstrunontheframeworkdescribedinSection 3.2 withanuntaggedregularTLBmodel,andtheInstructionsperCycle(IPC)(IPCRegularTLB)fromthissimulationisnoted.Then,theworkloadsaresimulatedusingthetaggedTLBmodelaugmentedwiththeTMTandtheIPC(IPCTMTTLB)isnoted.ThereductioninTLBmissesonusingthetaggedTLBisreectedinIPCTMTTLBbeinghigherthantheIPCRegularTLBandthisIncreaseinIPC 82

PAGE 83

(IIPC),asshowninEquation 4 ,givestheimpactoftheTMTontheperformanceoftheworkloads.ThegreaterthenumberofTLBmissesavoidedbytheTMT,largeristhevalueofIIPC.ThetheoreticalmaximumvalueofIIPCmaybeobtainedwhentheTLBbehaveslikeanidealTLB,asexplainedinSection 3.5.3 ,andexperiencesnoTLBmisses.Inthiscase,theTLB-induceddelay,i.e.,thelatencyduetoTLBmissesandsubsequentpagewalks,iscompletelyeliminated.BysimulatingtheworkloadswithanidealTLBmodel(noTLBmissesandnolatencyduetopagewalks)andobservingtheIPC,thismaximumachievableIIPC(IPCIdealTLB)canbeobtained.ExpressingtheIIPCachievedusingthetaggedTLBasapercentageofthismaximumachievableIIPC,asshowninEquation 4 ,givestheImpactFactorIFoftheTMT.ThisIFgivesaninsightintotheperformancebenetoftheTMTarchitecture.Forinstance,anIFof50%impliesthattheTMTimprovestheIIPCby50%oftheincreaseachievablebyanyTLBarchitecture(includingtheidealTLB),orthattheimpactoftheTLBdelayonoverallperformancehasbeenreducedby50%. IIPC=100IPCTMTTLB IPCRegularTLB)]TJ /F5 11.955 Tf 11.96 0 Td[(1IF=100IPCTMTTLB)]TJ /F3 11.955 Tf 11.96 0 Td[(IPCRegularTLB IPCIdealTLB)]TJ /F3 11.955 Tf 11.96 0 Td[(IPCRegularTLB (4) UsingIPCbasedmetricstounderstandtheperformanceimpactofTMThastheadvantageofbeingapplicabletoalltypesofworkloads,especiallywhenitisnotfeasibletoruntheworkloadbenchmarkstocompletion.Moreover,avoidingTLBmisseswillreducethetimespentbytheCPUwaitingforpagewalkstocompleteandusingIPCisappropriateforestimatingthisreduction.However,itisalsoimportanttounderstandtheimplicationsofusingtheTMTwithuser-observableperformancemetrics.Forthis,SPECjbbisinstrumentedtoindicatethecompletionofeverytransaction.This 83

PAGE 84

numberoftransactionsisusedtoestimatethethroughputofSPECjbbandmeasuretheimprovementinSPECjbb'sperformancewhentheTMTisused.TounderstandtheimprovementinperformanceduetotheTMT,asingle-CPUx86machineissimulatedusingtheframeworkdescribedinSection 3.2 andthevirtualizedworkloadisrunonthisx86machinewitheithertheidealTLB,theregularTLBorthetaggedTLBwithan8-entryTMT.TheIFandIIPCvaluesforvariousTLBsizesand8-wayassociativityarecalculatedandpresentedinFigure 4-7 .AsseenfromSection 4.4.1 ,TPCC-UVaexperiencesapproximatelyequalnumberofinter-VMandintra-VMushesandamuchsmallernumberofforcedushes.AvoidingtheseushesusingtheTMTreducestheTLBmissrateandimprovestheIPCvalue,asseenfromFigure 4-7 .TwofactorswhichdeterminetheTLBmissrate,andthereforethedelayduetoTLBmisses,aretheTLBsizeandthefrequencyofTLBushing.Figure 4-7 showsthatscalinguptheTLBsizeinitiallyincreasestheIFandIIPCvaluesduetoareductioninthecapacitymissesintheTLB.Forinstance,IFforthe128-entryTLBisalmostfourtimesthatoftheIFforthe64-entryTLB.However,theIFfor4096-entryTLBisalmostthesameasfor1024-entryTLB.AttheselargeTLBsizes,mostoftherequiredtranslationsarecachedintheTLBandthedominantreasonfortheTLB-induceddelayistheTLBmissesduetoTLBusheswhichdoesnotchangeonincreasingtheTLBsize.Hence,theIFandIIPCdonotvarysignicantlyatthesesizes.ItisalsoclearthatthetrendinIFisdifferentfromtheReductiontrendsforITLBandDTLBmissrates.dbench,asseenfromFigure 4-7 ,showsanIFtrendsimilartoTPCC-UVa,i.e.,increasingrapidlyforsmallerTLBsizesandshowingsmallerincrementsatlargerTLBsizes.ThesignicantdifferenceisintheactualvaluesoftheImpactFactorIF.Forinstance,theIFfordbenchwitha1024entryTLBand60-cyclepagewalklatencyis22.04%,whichislessthanhalfofthe49.65%seenforTPCC-UVa.Thereasonforthisbehavioristheushproleoftheseworkloads.Overasimulationrunof10billionx86instructions,withCR3tagging,dbenchexperiences25263ushes,allofwhich 84

PAGE 85

AIncreaseinIPCIIPC BImpactFactorIFFigure4-7. Increaseinworkloadperformanceusingan8-entryTMTatPWof60cyclesand8-wayassociativity.UsingtheTMTeliminatesasignicantfractionoftheTLB-induceddelay,exceptfordbench.Theimpactislimitedfordbenchduetothepredominanceofforcedushes. 85

PAGE 86

areforcedushes.Ontheotherhand,TPCC-UVaexperiencesonly7686ushesofwhich2586ushesarecapacityushesand5100areforcedushes.Thishigherrateofun-avoidableushesreducestheimpactofCR3taggingfordbenchcomparedtoTPCC-UVa.Thus,theIFis20%,evenatTLBsizeof4096entries.SPECjbbdiffersfromTPCC-UVaasithasasignicantnumberofcapacitydrivenTLBmisses.ThedelayduetotheTLBmissesinSPECjbbisprimarilycausedbyitsworkingsetsizeandthelackofspaceintheTLB,ratherthantheushingoftheTLB.Hence,thebenetoflargerTLBsizesismorepronouncedandtheincreaseinIFwithTLBsizesissteeperthanforTPCC-UVa,asseenfromFigure 4-7 .Itisalsoobservedthattheimprovementinthetransactionrate(throughput)ofSPECjbb,obtainedbyinstrumentingSPECjbbtoindicatethecompletionofeverytransaction,tracksIIPCclosely.Forinstance,fora60-cyclePW,thetransactionrateofSPECjbbisimprovedby3.29%and7.21%forTLBsizesof1024entriesand4096entriesrespectively.SinceVortexalsohasalargenumberofcapacitydrivenTLBmisses,theIFtrendisclosertoSPECjbbthanTPCC-UVa.Thedifference,asseenfromFigure 4-7 ,liesintheactualvaluesofIF.AtTLBsizesof64entries,theIFforSPECjbbandVortexareverysimilarat0.39%and0.42%respectively.However,ataTLBsizeof1024entries,IFforVortexincreaseto71%,whichisaboutthricetheIFforSPECjbb.ThisdifferenceisduetotheworkingsetsizeofVortexbeingsmallerthanSPECjbbandthemajorityofitstranslationsbeingaccommodatedina1024-entryTLBunlikeSPECjbb.ThiseffectofalargeIF,whentheTLBbecomessufcientlylargetocapturetheentireworkingsetisseenforSPECjbbalsoatasizeof4096entries.BoththeseworkloadsfullltheexpectationoflargeIF,i.e.benetoftheTMT,atlargeTLBsizeaspredictedinSection 4.4.2 .SensitivityofIIPCtothepagewalklatency Therearerecentvirtualization-drivenenhancementssuchasNestedPageTables(NPT)[ 36 ]orExtendedPageTables(EPT)[ 37 ]thatindicatethatpagewalklatencies 86

PAGE 87

Figure4-8. EffectofthePageWalkLatencyontheimprovementinperformancewith8-entryTMT,1024-entry8-wayTLB.TheperformanceimprovementduetotheTMTissignicantlyhigheratlargerPWvalues. canfurtherincrease.Unliketheone-levelShadowPageTablesbeingusedforaddresstranslationinprocessorswithoutthisextension,processorswithEPT/NPTsupporthavetwolevelsofpagetables,bothofwhichareusedfortranslatingavirtualtophysicaladdress.Thistwo-leveltranslationincreasesthecostofaTLBmisssignicantly.TounderstandtheimpactoflargercostofaTLBmiss,TPCC-UVa,dbench,SPECjbbandVortexaresimulatedona1-CPUx86machinewith8-wayregularandtaggedTLB(8-entryTMT)underdifferentvaluesoftheminimumpagewalklatency(PW).TheIIPCvaluesobtainedfromthesesimulationsareshowninFigure 4-8 .Fromthesevalues,itcanbeseenthatusingtheTMTincreasestheIPCofTPCC-UVabyabout12%atPWlatencyof270cyclesfora1024-entryTLB.Similarly,theIPCofSPECJbbandVortexincreasesbyabout12%and25%,respectively.InthecaseofSPECjbb,itisknownfromthedatapresentedinSection 4.4.3 that1024-entryTLBisnotsufcienttocapturethe 87

PAGE 88

entireworkingsetsize.ThoughnotshowninFigure 4-8 ,atPWof270-cyclesandaTLBsizeof4096entries,theIIPCforSPECjbbincreasestoabout28%. 4.5ArchitecturalandWorkloadParametersAffectingtheImpactoftheTMTTheimpactoftheTagManagerTableisinreducingtheTLB-induceddelayandtherebyimprovingtheperformanceofthevirtualizedworkload.However,thisimpactdependsonafewhardwareparametersandworkloadfactors.ThesefactorsandtheirinuenceontheimprovementduetotheTMTarepresentedinthissection.Thesefactorsandparametersarealsoprioritizeddependingonthesignicanceoftheirinuence.Itshouldbenotedthat,forthesimulationspresentedinthissection,ReductionisusedasthegureofmeritasitismoresensitivethanIIPC. 4.5.1ArchitecturalParametersWhilethearchitecturalparametersthataffecttheTLBbehaviorandthebenetofusingtheTMTarediscussedindepthinSection 4.4 ,theyaresummarizedinthissection. ThesizeoftheTagManagerTabledecidesthenumberofcontextswitchrelatedTLBushesthatcannotbeavoidedduetolackofcapacityintheTMT. ThesizeoftheTLBcontrolsthenumberofTLBentriesofdifferentprocessesthatareretainedacrosscontextswitchboundarieswhentheassociatedTLBushesareavoided. TheassociativityoftheTLB,beyond8-waysetsize,doesnotplayasignicantimpactonthebenetofusingtheTMT. Thevalueoftheminimumpagewalklatency(PW)inuencesthecostofTLBmisses,andthereforethebenetthatisobtainedfromavoidingthesemissesusingtheTMT. 4.5.2WorkloadParametersFromthediscussioninSection 4.4.2 ,itisevidentthattheTLBsizeisanimportantparameterwhichaffectthebenetthatcanbeobtainedfromtagging.AsmallTLBwillexperiencecapacitymissesirrespectiveofwhethertagsareusedtoavoidushesornot.However,whetherthesizeoftheTLBissmallorlargedependsontheworkload 88

PAGE 89

andthenumberofpagesthatareaccessedbytheworkload.Similarly,thenumberofushesthatcanbeavoidedbytaggingandthereductioninmissrate,dependonthenumberandtypeofTLBushesexperiencedbytheworkload.Thus,thebenetoftaggingtheTLBentrieswilldependontheworkloadcharacteristics. 4.5.2.1EffectoflargermemoryfootprintToexaminetheimpactoftheworkingsetsizeoftheworkload,theSPECjbbbenchmarkisselected.ThememoryutilizedbySPECjbbiscappedbytheheapsizeoftheJavaVirtualMachine(JVM)inwhichitruns.ByincreasingtheheapsizeoftheJVM,theworkingsetsizeoftheworkloadcanbevariedtherebyvaryingdemandexertedontheTLB.FourdifferentSPECjbb-basedworkloadswithheapsizesof128MB,192MB,256MBand320MBarepreparedbylaunchingSPECjbbinthedomUofasimulatedsingle-processormachine.Theworkloadsarerunfor8-wayTLBsofsizesvaryingfrom64entriesto8192entries3withouttaggingandtheirmissratesandushprolesareobserved.Then,thesimulationsarerepeatedwithCR3taggingandan8-entryTMTandthemissratesandushesareobserved.FromtheTLBushesforthefourworkloads,showninTable 4-1 ,itcanbeseenthatvaryingtheheapsizedoesnotchangethenumberofushessignicantly.InsituationswithoutTLBtagsaswellaswithtags,theushesfortheworkloadwithvaryingheapsizesallfallwithin4%ofeachother,whichisduetothevariationscausedbythesystemnoise.Moreoverthereislittlecorrelationbetweenincreasingtheheapsizeandtheincreaseinthenumberofushes.Thus,varyingtheheapsizedoesnotaffecttheushprolesignicantlyandanyvariationintheobservedTLBmissrateisduetotheimpactofthedifferingworkingsetsizes. 3TheTLBsizeisvariedtill8192entriestoillustratetheshiftintheReductiontrend. 89

PAGE 90

Table4-1. FlushproleforSPECjbb-basedworkloadswithvaryingheapsizes HeapSize(MB)FlusheswithoutTagsCapacityFlusheswithCR3Tags,8-entryTMTForcedFlusheswithCR3Tags,8-entryTMT 1283251901189192335504120525632915011753203301201151 WhentheReductioninDTLBmissrates,asshowninFigure 4-9A ,areconsidereditcanbeseenthatthereisasystematiccorrelationbetweentheheapsize,theTLBsizeandtheimprovementduetotagging.AtverysmallTLBsizes,thechangeinheapsizedoesnotchangethemissrateimprovementduetotagging.UptoaTLBsizeof256entries,eventhesmallestheapsizeof128MBissufcienttocausealargenumberofcapacitymissesintheTLB.Hence,thefourworkloadsexhibitanidentical,albeitsmall,Reductionof6%intheDTLBmissrate.However,ataTLBsizeof512entries,theTLBislargeforSPECjbbwith128MBheapsizeandsmallforSPECjbbwith320MBheapsize.Hence,theReductioninmissratevariesbyabout4%whentheheapsizeischangedfrom128MBto320MB.Thistrendofvaryingmissrateimprovementforvaryingheapsizesismorepronouncedat1024entryand2048entryTLBsizes.ForaTLBsizeof2048entries,thereductioninmissrateforSPECjbbwith128MBheapsizeis30%morethanforSPECjbbwith320MBheapsize.Beyondthesizeof2048entries,however,theTLBbecomeslargeenoughtoaccommodateevena320MBheapsize.Hence,thevariationintheimpactoftaggingbecomesreducesandeventuallydiminishes.Thus,itisclearthattheworkingsetsize,incombinationwiththeDTLBsize,affectstheimprovementthatcanbeobtainedfromtagging.TheITLBmissratesfromthisexperimentarepresentedinFigure 4-9B .ItcanbeseenthatthereductionintheITLBmissratedoesnotvarysignicantlywiththeworking 90

PAGE 91

AReductioninDTLBmissrate BReductioninITLBmissrateFigure4-9. EffectofscalingthememoryfootprintonthereductioninTLBmissratewithan8-entryTMT.ThereductioninDTLBmissrateisaffectedbythememoryfootprintoftheworkloadwhentheTLBsizeisbetween512entriesand2048entries.Outsidethisrange,theTLBiseithertoosmallorlargeenoughtonotbeinuencedbythememoryfootprint.ThereductioninITLBmissrateisnotsignicantlyaffectedbythememoryfootprintoftheworkload. setsizeasincreasingtheheapsizedoesnotaffectinstructionfootprintandtheITLBusagesignicantly. 4.5.2.2EffectofthenumberofprocessesintheworkloadWhilevaryingtheheapsizechangesthepressureexertedontheDTLB,itdoesnotstresstheITLB.However,onincreasingthenumberofprocessesinaworkload,eachoftheseprocesseswillrequireashareintheITLBandtherebyincreasethedemandforspaceintheITLB.Thus,varyingthenumberofprocessesinamulti-processapplicationwillcreatedifferentworkloadswhicharesuitableforinvestigatingtherelationbetweentheworkloadcharacteristicsandtheimpactoftheTMTinreducingtheITLBmissrate. 91

PAGE 92

Tocreatesuchworkloads,TPCC-UVaisutilized.FourdifferentTPCC-UVabasedworkloadsarepreparedbychangingthenumberofwarehousesinthebenchmarkfrom1to8.Sinceoneclientprocessisforkedoffforeverywarehouse,thesefourworkloadshavedifferingnumberofprocesses,eachofwhichwillutilizeaportionoftheITLBspace.TheseworkloadsarerunonthedomUofasimulateduniprocessorx86machinewith8-wayTLBsofsizesrangingfrom64entriesto1024entries.Thesimulationsarerun,bothwithandwithouttagging,andtheushprole,missratesandReductioninmissratesforthedifferentworkloadsareobserved.TheushproleforthefourdifferentTPCC-UVaworkloadsisshowninTable 4-2 .IntheuntaggedTLBcase,thenumberofushesincreaseby53%whenthenumberofwarehousesareincreasedfrom1to8.AsimilartrendisseenevenwhenCR3tagsareused.AtasmallTMTsizeof2entries,thereductioninthenumberofushesisabout60%for1-warehouseworkloadand56%fora8-processworkload.At8-entryTMT,thecapacityushesaresmallerthantheforcedushesandstopbeingthepredominantsourceofushes.OnfurtherscalinguptheTMTsizeto16entries,thecapacityushesreduceto0forallbutthe4-warehouseworkload.TheimpactofvaryingnumberofprocessesontheReductioninITLBmissrates,witha2-entryTMT,ispresentedinFigure 4-10A .ThebehaviorofthereductionintheITLBmissrates,forTLBsizesbetween64entriesand512entriesmimictheDTLBmissratereductionbehaviorbetweenTLBsizesof512entriesto2048entriesfortheSPECjbbworkloadsfromFigure 4-9A .ThedifferenceintheTLBsizerangewherethisbehaviorisexhibitedisduetothesmallerworkingsetsizeoftheindividualprocessesofTPCC-UVaworkload.AnotherinterestingdifferenceisthatthespreadintheimprovementcurvesismuchhigherthanthespreadintheDTLBimprovementcurvesfortheSPECjbbworkloads.Atthewidestpointofseparation,i.e.,atTLBsizeof128entries,thereductioninTLBmissrateforaone-warehouseworkloadisalmosttwicethatoftheeight-warehouse 92

PAGE 93

Table4-2. FlushProleforTPCC-UVabasedworkloadswithvaryingnumberofprocessesandvaryingTMTsizes TMTSizeNumberofWarehousesprocessesFlusheswithoutTagsCapacityFlusheswithCR3tagsForcedFlusheswithCR3tags 21496921894419152545212087626724634802422236708763382932746544149692430219152545214928267246348061343670876338465446548149692400191525452161026724634809593670876338187246541614969201915254521026724634801367087633804654 workload.Thisiscausedbecause,inadditiontothevariationcausedbythedifferingTLBdemands,thenumberofushesalsovariessignicantlyforthedifferentworkloads.Thus,inadditiontotheTLBsize,theTMTsizeisanotherparameterwhichmaybelargeorsmalldependingontheworkload.IncreasingtheTMTsizewillresultinfurtherreductionoftheTLBmissrates.ThisisshowninFigure 4-10B ,wherethereductioninITLBmissratefortwoextremeTMTsizesof2entriesand16entriesisshown.FromTable 4-2 ,itisclearthata16entryTMTeliminatesallbutforcedushes.Thisisreectedinthemissrateofthe1-warehouseworkloadfora64entryTLBreducingby17%withan16entryTMTascomparedto14%fora2entryTMT.ThisdisparityincreasesastheTLBsizeincreasesandat1024entriestheone-warehouseTPCC-UVa'sreductionwith16entryTMTisalmosttwicethatwith2entryTMT. 93

PAGE 94

AReductioninITLBmissratewith2-entryTMT BReductioninITLBmissrateonscalingTMTsizeFigure4-10. EffectofthenumberofworkloadprocessesonthereductioninITLBmissratewith8-wayassociativeTLBs.ThelegendnWindicatesnwarehouses.TheeffectofthenumberofworkloadprocessesonthereductioninITLBmissrateforagivenTMTsizeispronouncedatsmallerTLBsizes,butreducesforlargerTLBsizes.IncreasingtheTMTsizeincreasesthereductioninmissrate. 4.5.3SensitivityAnalysisInordertoachievethemostbenetfromusingtheTMT,i.e.maximizetheReductionofTLBmissrateswhileminimizingthesizeoftheTLBandTMT,therelativesignicanceofvariousparametersindeterminingthereductioninmissrateshouldbeunderstood.Forthis,aFullFactorialExperiment[ 96 ]isperformed.AdditionaldetailsonFullFactorialExperimentsarepresentedinAppendix A .Toperformthisevaluation,fourdifferenttypesofworkloadsarechosensuchthattheyoccupydifferentquadrantsinatwodimensionalspace.Thenumberofushesandtheworkingsetsizeformthetwoaxesofthisspace.TPCC-UVahasasmallerworkingsetsize,comparedtoSPECjbb,butalargernumberofTLBushesand 94

PAGE 95

Table4-3. Factorsandtheirlevelsforthesensitivityanalysis FactorRangeofValues TLBSize64,128,256,512,1024TLBAssociativity4,8,16,32,64,fullTLBreplacementpolicyFIFO,LRUTMTsize2,4,8,16Flushes/10BinstructionsHigh(30000),LowMemoryUsedHigh(100MB),Low liesinthesmaller-memoryhigher-ushesquadrant.VortexhasamemoryusagesimilartoTPCC-UVaasmeasuredusingtheLinuxtop[ 87 ]command,butexperienceslesserushesthanTPCC-UVaandliesinthesmaller-memorylower-ushesquadrant.SPECjbbisagoodcandidateforthehigher-memorysmaller-ushesquadrantandaconsolidatedworkloadwithTPCC-UVaandSPECjbbiscreatedtoserveasthehigher-memoryhigher-ushesworkload.ThesefourworkloadsaresimulatedforallpossiblecombinationsoftheparameterslistedinTable 4-3 .ItshouldbenotedthatthefactorslistedinTable 4-3 arecontrollabledesignparametersandunderstandingtheinuenceoftheseparametersontheimprovementinmissrateduetotaggingwillhelpindesigntrade-offs.PageWalklatencyisnotincludedasafactorinthelistingasitisnotacontrollabledesignparameter.Fromthesesimulations,thereductioninDTLBandITLBmissratesforvariousparametercombinationsarecalculated.ByanalyzingthevariationamongallDTLBmissratereductionforallthesecombinations,themostsignicantfactorindeterminingthereductionisidentiedastheTLBsizewitha65.14%signicance.TheotherdominantfactorsindeterminingtheDTLBmissrateimprovementarefromworkloadcharacteristics(memorysizeandnumberofushes)asseenfromTable 4-4 .Thesetwofactorsandtheirinteractionhavearelativeinuenceofalmost20%indeterminingtheimpactoftagging.TheinteractionbetweenTLBsizeandmemoryutilization,i.e.havingalargerTLBforworkloadsusingmorememory,isalsosignicant. 95

PAGE 96

Table4-4. FactorswithsignicantinuenceontheReductioninTLBmissratesduetoCR3tagging S.NoFactorInuenceinDTLBmissratereductionInuenceinITLBmissratereduction 1TLBSize65.14%70.92%2Flushes/10Binstructions3.66%12.89%3MemoryUsed14.85%1.85%4TMTSize1.42%1.94%5TLBSize*Flushes1.45%5.02%6TLBSize*Memory5.75%1.47% OnperformingasimilaranalysisfortheITLB,therelativesignicanceofworkload'smemoryutilizationindeterminingtheITLBmissratereductionisfoundtobeonly1.8%,whereasthenumberofushesexerts12.89%inuenceasshowninTable 4-4 .TheprimaryfactorwhichdeterminestheITLBimprovementistheTLBsizewith70.9%inuence.ItisalsoveriedfromtheFullFactorialExperimentthattheassociativityoftheTLBandthereplacementpolicyusedintheTLBplayonlyminorrolesindecidingtheimpactofCR3taggingforbothITLBandDTLB. 4.6ComparisonofProcess-SpecicandDomain-SpecicTagsTocomparetheperformancebenetofusingprocess-specictagsusingtheTMTanddomain-specictagsasintheqTLB[ 18 ],thegenerictaggedTLBmodeldevelopedinChapter 3 isusedtomodeltheqTLBsolutionbymappingthedomain-specictaggenerationfunctionalitytotheGMTmoduleandmaintainingthecurrentVM'stagintheTagCache.Then,TPCC-UVaandVortexaresimulatedusingbothprocess-specicanddomain-specictaggingstrategiesandtheIIPCvaluestheworkloadwithbothtypesoftaggingareobserved.Comparingthesevalues,asshowninFigure 4-11 ,itisclearthattheimprovementinIPCismuchhigherwhenTMTisused.ForTPCC-UVa,usingtheTMTresultsinincreasingtheperformancebymorethan10comparedtodomain-specictags.Moreover,thedependenceofIIPCusingqTLBontheTLBsizeislessmarkedthanIIPCfromusingtheTMT,asonlythehypervisormappings 96

PAGE 97

AIIPCcomparisonforTPCC-UVa BIIPCcomparisonforvortexFigure4-11. Comparisonoftheperformanceimprovementduetoprocess-specicandVM-specictagging.Process-specictaggingwithan8-entryTMT(legendP)increasestheIPCsignicantlymorethanVM-specictaggingusingqTLBapproach[ 18 ](legendQ)asitcanavoidalltypesofcontextswitchrelatedushes.Theadvantageofprocess-specictaggingisevenmorepronouncedinnonI/O-intensiveVortexwherethereislittleinter-domaincontextswitches. areretainedondomainswitchesintheqTLB.OncetheTLBgrowslargeenoughtoaccommodateallthehypervisorentries(256entriesinthecaseofTPCC-UVa),thegainfromfurtherincreasingtheTLBsizeisminimal.TheratioofIIPCvalueswithCR3taggingtodomain-specictaggingisevenmorepronouncedforVortexduetothesignicantlysmallernumberofinter-domainswitchesinVortex.Theseresultsclearlyshowthebenetofusingprocess-specictagsoverdomain-specictags. 4.7UsingtheTagManagerTableonNon-VirtualizedPlatformsWhiletheTMTismotivatedbytheneedtoreduceTLB-inducedperformancedegradationonvirtualizedplatforms,itachievesthisbyavoidingTLBushesusingatagtoassociateeveryTLBentrywiththeprocesstowhichitbelongs.SincethegenerationandmanagementoftheVASIisnottiedtoanyparticularaspectofvirtualization,the 97

PAGE 98

TMTmayalsobeusedonnon-virtualizedplatformswithoutrequiringanychangetothesystemsoftware.Asaresult,thesamehardwareplatformmaybeusedinavirtualizedornon-virtualizedmannertransparenttothesoftwarestackrunningonit.ToestimatetheperformanceimplicationsofusingtheTMTonnon-virtualizedsingle-O/Splatforms,anx86single-coremachineissimulatedusingtheexperimentalframeworkdevelopedinChapter 3 .DebianLinux2.6.18-paekernelisbootedonthissimulatedplatformandI/O-intensiveTPCC-UVaaswellasmemory-intensivevortexarerunonit.TheIPCfortheseworkloads,witheitheraregular8-wayTLBandtagged8-wayTLBand8-entryTMTanda60-cyclePW,isobserved.ThesimulationsarerepeatedforvaryingTLBsizesandtheIIPCaswellastheIFfortheseworkloadsarecalculatedfromthesesimulations.ThesevaluesarepresentedinFigure 4-12 Figure4-12. PerformanceimpactofTMTonnon-virtualizedplatformswith60-cyclePWand8-wayTLB.TheIIPCispresentedontheleftYaxisandtheIFispresentedontherightYaxis.TheTMTisquiteeffectiveateliminatingTLB-induceddelaysforworkloadsrunningonnon-virtualizedplatformseveniftheperformanceimplicationsarenothighlysignicant. 98

PAGE 99

FromSection 3.5.1 ,itisclearthatthenumberofushesismuchsmallerinasingle-O/Sscenariothanonavirtualizedplatform.Giventhislowushrate,thepredominantcauseforTLBushesisthelackofTLBspace.Asexpected,Figure 4-12 showsanincreasingtrendintheIIPCvalueswithTLBsize.ItisobservedthattheIIPCduetotheTMTonnon-virtualizedplatformsisquitesmall.Forinstance,evenfora1024-entryTLBandforVortex,theIIPCisonlyabout1.6%,comparedtothe5.9%forthevirtualizedVortexaspresentedinFigure 4-7A .However,sincetheTLB-induceddelayissmallonsingle-O/Splatforms,thisimprovementinIPCtranslatestoanIFof75%.Similarly,usingtheTMTforTPCC-UVaresultsinanIIPCofabout0.5%andanIFof89%.Thus,theTMTisquiteeffectiveateliminatingTLB-induceddelaysforworkloadsrunningonnon-virtualizedplatformseveniftheperformanceimplicationsarenothighlysignicant.However,themostimportantobservationfromthesesimulationsisthattheTMTcanbeusedwithnochangeindesignfornon-virtualizedscenarios. 4.8EnablingSharedLastLevelTLBsUsingtheTagManagerTableAwellknownprincipleofdatacachingisthereductioninmissrate,andthereforethestallsduetothecachemisses,onincreasingthesizeofthecache.However,thepurposeofcachingthedata,i.e.reducingthetimetakentoaccessthedatabyndingitinthecacheratherthanthemainmemory,isdefeatedwhenthecachesizeincreasestolargevalues.Forinstance,ithasbeenestimatedthatthehittimefor1MBcache,using35nmtechnologyisabout6ns[ 97 ].Awellknownsolutiontothisproblemiscreatingahierarchyofcaches,withthesmallerandfastercachesclosertotheCPUandthelargerandslowercachesclosertothememory.Byhavingsuchmulti-levelcaches,anymissintherstlevelcachewhichndsthedatainthesecondlevelcachepaysasmallerpenaltythanaccessingthedatafromthemainmemory.WhensuchhierarchicalcacheorganizationsincurrentCMPplatformsareconsidered,thelastlevelcache(LLC)isusuallysharedamongstmultiplecoresandservesthecachemissesfromtheprivatecachehierarchiesofeachofthese 99

PAGE 100

cores.SuchsharedLLCsareespeciallybenecialforworkloadswhichsharedata.Eveninworkloadsthatdonothavesignicantsharing,aggregatingtheon-chipareaallocationforthelastlevelcachesasasharedLLCinsteadofmultipleper-coreprivateLLCshasbeenshowntoresultinalowermissrate[ 98 ]duetothebetterutilizationofthesharedcachespace,evenwhenthereislittlesharingbetweenthedifferentprocesseswhichsharethecache.Moreover,bycachingablockinthelastlevelofafully-inclusivehierarchy,theneedforsnoopingamongtheupperlevelcachesofthedifferentprocessorscanbeavoided[ 99 ].DuetoincreasingimportanceoftheTLBoncurrentplatforms,thehierarchicaldesignisbeingextendedtoTLBsaswell4.AMDAthlonprocessors[ 48 ]supporttwolevelsofinstructionanddataTLBs,witha512-entryL2ITLBanda640-entryL2DTLB.Similarly,IntelNehalemprocessors[ 4 ]havea512-entryL2instructionanddataTLBs.However,thesemulti-levelTLBsareorganizedasprivateper-corehierarchieswithnosharedLastLevelTLB(LLTLB).PreviousworkhasshownthathavingaSharedLastLevelTLBwillexploitinter-coresharingwhereaspecicentrybroughtintotheLLTLBmaybeusedbyallothercores,therebyavoidingTLBmissesandpagewalksforthosecores[ 19 ]. 4.8.1UsingtheTMTastheTaggingFrameworkTheprimaryrequirementforsharingtheLastLevelTLBs,inhardware-managedTLBssuchasonthex86platform,istheneedtodistinguishtheTLBentriesofoneprocessfromtheentriesofanotherprocess.Thismaybeachievedusingprocess-specictagswhicharegeneratedandmanagedusingtheTagManagerTable. 4Eventhoughtherearetwolevels,itshouldbenotedthatbothlevelsoftheTLBareusedstorethevirtualtophysicaladdresstranslations,eveninvirtualizedscenarioswithtwo-levelpagetables,andnotvirtualtorealorrealtophysicaladdresstranslations. 100

PAGE 101

WhenusingtheTMT,asdiscussedinSection 4.2 ,everyTLBisprovidedwithitsownTMT.AsaresulttheCR3-to-VASImappinginoneTLBmaybedifferentfromthemappingestablishedinanotherTMT,andtheTLBentriesofthesameprocessaddressspacemaybetaggedwithdifferentVASIsindifferentTLBs.Suchanapproachissatisfactoryeveninmulti-levelTLBsprovidedthereisnosharedTLBs.However,inthecaseofsharedTLBs,itisimportanttohaveaconsistentprocess-to-tagmappinginallTLBstoensurethatanentryinthesharedTLBcanbeusedbyanycorewhichsharesthisTLB.Thus,establishingthisconsistentprocess-to-tagmappingisthesecondrequirementforenablingsharedLLTLBs.OnewayofsatisfyingthisrequirementistohaveoneglobalTMTwhichgeneratesandmanagesthetagsforallper-coreprivateTLBhierarchieswhichsharetheLLTLB. 4.8.2ArchitectureoftheSharedLLTLBThearchitectureofthesharedLastLevelTLBusingtheTagManagerTableisillustratedinFigure 4-13 .TheplatformillustratedinFigure 4-13 consistsoftwoprocessors,CPU0andCPU1withatwo-levelTLBhierarchyforeachcore.Itshouldbenotedthat,thoughthearchitectureisexplainedconsideringadual-coreplatform,asimilararchitecturemaybeenvisionedforsharingtheLLTLBamongalargernumberofprocessors.L0)]TJ /F3 11.955 Tf 12.34 0 Td[(TLB0andL0)]TJ /F3 11.955 Tf 12.35 0 Td[(TLB1aretheprivateper-coreTLBsofCPU0andCPU1respectively.ThesecondlevelTLB,indicatedasL1)]TJ /F3 11.955 Tf 12.6 0 Td[(TLBSinFigure 4-13 ,istheLLTLBwhichissharedamongthesecores.OneglobalTMTisusedtogenerateandmanagestagsforallthreeTLBs.However,everycoreisprovidedwithitsownCCRregistertoensurethatnoadditionallatencyisimposedbythattaggingframeworkonthecriticalTLBlookuppath.TLBlookupandmisshandlingwithsharedLastLevelTLBs TheTLBlookupprocessinthesharedLLTLBscenariohappensasshowninFigure 4-13 .AprocessP0runningonCPU0withthetagVASI0mayrequireatranslationforvirtualaddressVA0.IfthistranslationisnotavailableinL0)]TJ /F3 11.955 Tf 12.92 0 Td[(TLB0, 101

PAGE 102

Figure4-13. UsingtheTMTforSharedLastLevelTLBs.Twoprivateper-corerstlevelTLBs,L0-TLB0andL0-TLB1aswellasasecondlevel(LastLevel)sharedTLB,L1-TLBS,areshown.AuniformCR3-to-VASImappingisensuredbyusingaglobalTMTforallTLBs.However,everycoreisprovidedwithitsownCCRregister. asshowninStep . 1 ,thiswilltriggeralookupintheLastLevelSharedTLBL1)]TJ /F3 11.955 Tf 12.12 0 Td[(TLBS.TheVASIintheCCRofCPU0isdispatchedtotheLLTLBasapartoftheLLTLBlookup.OnlyifatranslationforVA1withthisVASItagVASI1isfoundintheLLTLBwillthelookupresultinahit.IfthisentryisnotpresentintheLLTLB,theTLBlookupisdeclaredasaTLBmissandapagewalktriggered.OncompletionofthepagewalktheentryiscachedinL1)]TJ /F3 11.955 Tf 12.68 0 Td[(TLBSwithtagVASI1.ThisisshowninStep . 2 .AfterthisentryiscachedintheLLTLB,tomaintainthefullyinclusivenatureoftheTLBhierarchy,theentryiscopiedtoL0)]TJ /F3 11.955 Tf 12.47 0 Td[(TLB0asshowninStep . 3 .Oncethisentry(VA0,VASI0)iscachedintheLLTLB,itwillbeavailabletoserviceanyTLBmissesfromeitherL0)]TJ /F3 11.955 Tf 12.59 0 Td[(TLB0orL0)]TJ /F3 11.955 Tf 11.95 0 Td[(TLB1.Forinstance,theprocessP0maygetscheduledonCPU1atsomepointintimeafterthe(VA0,VASI0)entrygetscachedintheLLTLB.IfthetranslationforVA0is 102

PAGE 103

requiredbyP0anditisnotfoundinL0)]TJ /F3 11.955 Tf 12.2 0 Td[(TLB1asshowninStep . 4 ,thelookupwillhitinL1)]TJ /F3 11.955 Tf 12.14 0 Td[(TLBSandavoidanexpensiveTLBmissasdepictedinStep . 5 andStep . 6 .InadditiontoP0beingrescheduledonCPU1,threadsofamulti-threadedworkloadwhichsharetheaddressspacewillbenetfromsuchasharedLLTLB.Itshouldbenotedthat,whilethisdiscussionfocussesonfully-inclusiveTLBhierarchies,theuseoftheTMTtoenablesharedLLTLBsisequallyapplicableinthecaseofexclusiveTLBhierarchiesaswell.TLBushhandlingwithsharedLastLevelTLBs OneimplicationofusingaglobalTMTisthegenerationoffalseTLBushes.AsituationmayariseduringacontextswitchfromP1toP2(onCPU1)wheretheCR3ofP2isnotintheTMTand,duetolimitedcapacity,therearenofreeentriesintheTMT.Inthiscase,dependingonthereplacementpolicy,thevictimentryinglobalTMT,(CR33,SID3,VASI3),belongingtoP3ischosen.TheCR3andSIDvaluesofP2replaceCR33andSID3whileVASI3isreusedforP2.ToavoidtheTLBentriesofP3beingusedforP2,theper-coreprivateTLBhierarchyofCPU1isushedwithacapacityush.However,insharedLLTLBscenarios,thiscapacityushwillalsohavetoushallTLBhierarchiesthatsharetheLLTLBwithCPU1'sTLBhierarchyincludingthesharedLLTLB.ThisisrequiredtoensurethatnoentrybelongingtoP3iscachedinanyoftheseTLBswiththetagVASI3.Thus,thecapacityushesexperiencedbythesharedLLTLBisthesumofthecapacityushesexperiencedbytheprivatelastlevelhierarchiesitreplaces.However,thenumberofslotsintheGlobalTMTcanbesetasthesumofthenumberofslotsinalltheper-TLBTMTitreplaces,therebyreducingtheoccurrenceofcapacityushes.Forcedushes,ontheotherhand,arepropagatedtoallTLBhierarchiesusingmechanismssuchasInter-processorinterrupts[ 53 ]eveninexistingplatforms.HencethenumberofforcedushesexperiencedbytheLLTLBremainsconstantirrespectiveofwhethertheitissharedornot. 103

PAGE 104

4.8.3MissRateImprovementDuetoSharedLastLevelTLBsInadditiontothebenetforworkloadswhichshareaddressspaces,usingsharedLLTLBswillresultinabetterutilizationoftheTLBspace.Thus,allocatingaxedamountofTLBspaceasasharedTLBratherthanastwoprivateTLBswillresultinreducingtheTLBmissrate.TounderstandthereductioninmissratethatcanbeachievedusingthesharedLLTLBforvirtualizedworkloads,atwo-processorx86machineissimulatedusingtheexperimentalframeworkdescribedinSection 3.2 .ThetaggedTLBmodeldevelopedinSection 3.2.2 ismodiedtoincludeaninterfacetofacilitatecommunicationbetweentwolevelsofaTLBhierarchy.UsingthistaggedTLBmodel,bothCPUsinthesimulatedplatformareconguredwithatwo-levelprivateper-coreTLBhierarchywithnosharingofthelastlevelTLB.XenisbootedonthisplatformandthepinnedworkloadsTPCC-Vortex-0102andTPCC-SPECjbb-01025arecreated.Pinningtheworkloadsinthisfashionensuresthatdom1runningTPCC-UVaistheonlyworkloaddomaintobescheduledonCPU0anddom2runningVortexorSPECjbbgetsscheduledonlyonCPU1.Theseworkloadsarerunonthesimulatedplatformwitha64-entryrstlevelTLBandvaryinglastlevelTLBsizeswithan8-entryper-coreTMTandthemissratesforthevariousTLBsareobserved.Then,thesecondlevelTLBofboththeprivateper-corehierarchiesarereplacedwithasharedTLBandthe8-entryTMTsarereplacedwitha16-entryglobalTMT.ThesimulationsarerepeatedforvaryingsharedLLTLBsizesandthemissratesforthevariousTLBsareobserved.TheDTLBmissratesfortheprivateandsharedLLTLBsfromthesesimulationsarecomparedinFigure 4-14 .ItshouldbenotedthatthemissratesforaprivateLLTLBofacertainsizeiscomparedtothemissrateofthesharedLLTLBoftwicethesize. 5ThedetailsofcreatingthepinnedworkloadsandtheirnomenclatureareexplainedinSection 3.3.3 104

PAGE 105

Figure4-14. ReductioninDTLBmissrateduetoSharedLastLevelTLB.TheTLBsizespeciedontheX-axisisthesizeoftheprivateper-coreLLTLBandhalfthesizeofthesharedLLTLB.HavingasharedLastLevelTLBreducestheDTLBmissrateby0%to35%dependingontheTLBsizeandworkload. Fromthis,itisobservedthatthesharedLLTLBhasalowermissratecomparedtotheprivateper-coreLLTLB.Forinstance,a64-entryprivateper-coreLLTLBresultsinmissratesof0.23MPKIand2.42MPKIforTPCCandVortexinTPCC-Vortex-0102workloadrespectively.However,whentheprivateLLTLBsarereplacedwithashared128-entryTLB,themissrateofVortexdropsto2.11MPKI,a13%reduction.ItisalsoobservedthatthisreductionissignicantlyhigherforVortexandSPECjbbwhichhavehigherdatamemoryfootprintcomparedtoTPCC.InthecaseofTPCC,thepotentialincreaseinTLBspaceduetousingasharedLLTLBisoffsetbythesignicantlyhigherusageofthatsharedTLBbyVortexandSPECjbb.HavingTLBusagecontrolsmaybeenvisionedtoincreasethebenetofthesharedLLTLBforTPCC-UVa.However,themissrateforTPCC-UVaisneverlargerwhenusingsharedLLTLBcomparedtoprivateper-coreLLTLB.TheaveragereductioninDTLBMPKIforSPECjbbandVortexareabout15% 105

PAGE 106

and28%respectively.TheseresultsclearlydemonstratethebenetofusingsharedLastLevelTLBs. 4.9SummaryTheTagManagerTableisproposedinthischaptertogenerateandmanageprocess-specicTLBtagsinasoftware-independentmannerforhardware-managedTLBs.ThedesignandworkingoftheTMTisdiscussedandthereductiontheTLBmissrateandTLB-induceddelayduetotheTMTisanalyzed.Thevarioushardwareandworkload-relatedfactorsthatinuencethebenetoftheTMTareinvestigatedandprioritized.ItisfoundthatusingtheTMTfortypicaltransaction-processingandCPUintensiveworkloadsreducesthedelayduetoTLBmissesbyasmuchas50%-70%comparedtountaggedTLBsandimprovestheIPCbyasmuchas12%-25%forlargeTLBsizesandpagewalklatencies.TheuseoftheTMTinnon-virtualizedplatformsaswellastoenablesharedLastLevelTLBsisalsoexplored. 106

PAGE 107

CHAPTER5CONTROLLEDSHARINGOFHARDWARE-MANAGEDTLBResourceconsolidationusingvirtualizationhasemergedasaviablewaytosharetheresourcesofchipmulticoreprocessorsamongmultipleworkloadswhichhavedifferentoperatingsystem(O/S)requirements.Byconsolidatingdifferentworkloadsonthesameplatform,theutilizationoftheplatformresourcescanbeincreased.Thishasmadevirtualizationextremelyattractivetotheserverindustry.Inaconsolidatedenvironment,theperformanceofonevirtualmachine(VM)willbesusceptibletotheutilizationofsharedresourcesbyotherVMs.Inaddition,systemnoise,i.e.theoperatingsystemcarryingoutvitalfunctionssuchasmemorymanagementandtaskscheduling,alsocausesvariationaswellasdegradationintheperformanceofvirtualizedworkloads.ThisinterferencemanifestsasconsumptionofresourcesbyotherVMorsystemprocesses,whichcouldhavebeenotherwisedevotedtoincreasingtheperformanceofuserapplications,andisamajorlimitingfactorintheperformanceofapplicationsinlarge-scalesystems[ 100 101 ].Hence,thereisaneedforcontrollingandmanagingtheusageofsharedresources.SuchresourcemanagementtechniquesarevitalforprovidingscalableanddeterministicperformanceinfuturearchitecturessuchasDatacenter-on-chip[ 102 ].ResourcemanagementinCMPplatformsforprovidingQualityofService,especiallyinthememorysubsystem,hasbeenthefocusofmanyresearchefforts.Kim,ChandraandSolihin[ 103 ]explorethesharingofcachesforprovidingafairshareofthecachetodifferenthardwarethreads.Iyeretal.[ 104 ]andHsuetal.[ 105 ]presentdifferenttypesofcache-sharingpoliciesforthelastlevelcacheforvariedsystem-levelgoals,includingmaximizingthesystemthroughputandensuringuniformthroughputforeachofthethreads.ChangandSohi[ 106 ]discussadaptivelyincreasingthecachespaceallocatedtoathreadintheshortrun,whilemaintainingfairnessinthelongrun.QureshiandPatt[ 107 ]investigatethecapabilityofdifferentworkloadstousethecachewith 107

PAGE 108

varyingdegreesofefciencyandusethisinformationtodecidethecacheallocation.Srikantaiahetal.[ 108 ]explorethepollutioninthecacheduetomultiplecoressharingthelastlevelcacheandproposeschemestoreducethispollutionbymodifyingthecacheevictionpolicies.ArchitecturalsupportforO/S-levelcachemanagementhasbeeninvestigatedbyRaque,LimandThottethodi[ 109 ].Selectivereplication[ 110 ]toimprovetheperformanceofselectedapplicationshasbeenproposedbyBeckmann,MartyandWood.However,sinceonlyoneprocesscouldusetheTLBatagiventimebeforetheadventoftaggedTLBsforreducingthevirtualizationoverhead,researchonusagecontrolinhardware-managedTLBsislimitedtotheqTLBwork[ 18 ].ThisassumptionofaprocessowningtheentireTLB,however,ischangedinthecontextoftaggedTLBs.WhiletheTMTenablesthesharing1oftheTLBamongmultipleworkloads,therebyimprovingtheperformanceoftheseworkloads,italsomakestheTLBasharedresourceandtheperformanceofanapplicationinoneVMwillvarydependingontheTLBusageofotherVMswhichrunonthesamecore.ThisnecessitatesmechanismsandpoliciesformanagingtheuseoftheTLB.Toaddressthisissue,theCShare(Controlled-Share)hardware-managedTLBisproposedinthisdissertation.AtthecoreoftheCShareTLBistheuseofaTLBSharingTable(TST),inconjunctionwithTMT-generatedprocess-specictagsforsharingtheTLBbetweenmultipleprocessesandforcontrollingtheTLBspaceusedbytheseprocesses.ByassigningvariousVMsaxedsliceofthesharedTLBspaceusingtheTST,theTLBbehaviorofaworkloadrunninginaVMcanbeisolatedfromtheTLBusageofotherVMsrunningonthesameplatform.TheTSTcanalsobeusedto 1ThesharingofasingleTLBbymultipleprocessesisthemainfocusofthischapter.However,thearchitecturesdevelopedandanalysisperformedhereareviableinthecontextofsharingacrossmultipleTLBs,suchassharedLastLevelTLBs. 108

PAGE 109

selectivelyimprovetheperformanceofahighpriorityworkloadbyrestrictingtheTLBusageofotherlowpriorityworkloadsrunningonthesameplatform.Insuchscenarios,theperformanceimprovementforthehighpriorityworkloadthatisachievedusingtheTMTcanbefurtherincreasedby1.4byrestrictingtheTLBusageoflowpriorityworkloads.Thecostofthisselectiveperformanceenhancementforvarioustypesofworkloadsisanalyzedandtheuseofdynamicusagecontrolpoliciesforminimizingthiscostandimprovingtheoverallperformanceoftheconsolidatedworkloadisexplored. 5.1MotivationTypicalusageofvirtualizedplatformsinvolvelaunchingmultipleworkloadsonaplatform,eachintheirVM,andhavingtheseVMsshareresources.ThusitisimportanttoinvestigatethebehaviorofthetaggedTLBforsuchconsolidatedworkloads,inadditionwithstand-aloneworkloads.Tounderstandthis,consolidatedworkloadsarecreatedbylaunchingtwoapplications,TPCC-UVaandVortexforinstance,ondom1anddom2.Thoughnoapplicationislaunchedondom0,theinteractionsbetweendomUandthephysicalmachine(suchasI/OrequestsforTPCC-UVa)areservedbythedriversresidingondom0andinstructionsareexecutedonthisdomainaswell[ 35 ].Theseconsolidatedworkloadsarerunona1-CPUx86simulatedmachine,usingtheframeworkoutlinedinSection 3.2 ,andtheIIPCduetothetaggedTLB(withoutanyexplicitusagecontrol)isobserved.InadditiontotheIIPCfortheentireconsolidatedworkload,thedetailsofthedomainswitchesareobtainedbyinstrumentingtheXenkernelandareusedtoclassifyexecutedinstructionsonaper-dombasis,thusenablingthecalculationofIIPConaper-domainbasis.TheseIIPCvaluesareshowninFigure 5-1 .Fromthesesimulationresults,thefollowingobservationscanbemade: Whiledom0doesnotrunanyactualworkload,itsIPCshowsdenitebenetfromincreasingtheTLBsize.Infact,evenatlargeTLBsizesof512entries,furtherscalingupoftheTLBsizeresultsinfurtherincreasingdom0'sIPC.Thisbehaviorisobservedbecause,inallthreeworkloads,dom0isscheduledforlessthan8%ofthetotalrunningtime.Asaresult,theTLBentriescachedbydom0getevictedbytheentriesofdom1anddom2,beforetheycanbesignicantlyreused. 109

PAGE 110

AIIPCforTPCC(dom1)-Vortex(dom2) BIIPCforTPCC(dom1)-Specjbb(dom2) CIIPCforVortex(dom1)-Specjbb(dom2)Figure5-1. PerformanceimprovementforconsolidatedworkloadswithuncontrolledTLBsharingwith8-entryTMTandPWof60cycles.TheperformanceimprovementduetotaggingforadomainclearlydependsontheotherdomainswhichsharetheTLB. 110

PAGE 111

TheeffectofsharingtheTLBisalsoapparentonconsideringtheIIPCforTPCC-UVa(dom1)inTPCC-SPECjbbandTPCC-VortexworkloadsasseeninFigures 5-1B and 5-1A respectively.Intheseworkloads,dom1isscheduledforabout35%and42%ofthetotalexecutiontime,forTPCC-SPECjbbandTPCC-Vortexrespectively.Thus,itusesonlyapartoftheTLBspaceand,unliketheIIPCtrendforTPCC-UVawhenitisrunalone(asseeninFigure 4-7A ),theincreaseinIIPCdoesnottaperoffwithincreaseinTLBsizebeyond256entries.Evenbeyondthissize,theTLBspaceusedbyTPCCisnotsufcienttoholdallitstranslations,asithastobesharedwiththeotherworkload. ThehigherTLButilizationofSPECjbbcomparedtoVortex,asdiscussedinSection 4.4.3 ,lowersTPCC-UVa'sIIPCinTPCC-SPECjbbcomparedtoTPCC-VortexforanygivenTLBsize.ThusitisclearthatthesharedTLBusageisheavilyinuencedbythenatureoftheworkloadswhichshareit.Theseobservationsclearlyindicatethat,intheabsenceofexplicitcontrols,theamountofsharedTLBspaceusedbyadomaindependsonthetimeforwhichthedomainisscheduledonaCPU,theworkingsetsizeoftheworkloadrunninginthedomain,andtheworkloadsrunningonotherdomainswhichsharetheTLB.Clearly,withevenmoreVMssharingtheTLB,noiseintheperformanceoftheworkloadswillincrease.ThismotivatestheneedforcontrollingtheusageofthesharedTLBbydifferentworkloadVMsaswellasdom0. 5.2ArchitectureoftheCShareTLBTheCShareTLBarchitectureconsistsoftheregularhardware-managedTLBwithtwoadditionalhardwaretables:theTagManagerTable(TMT),andtheTLBShareTable(TST).TheTMTisresponsibleforenablingmultipleaddressspacestosharetheTLBandhasbeendiscussedindepthinChapter 4 .TheTSTisusedtocontrolthesharedTLBusageamongstthedifferentsharers.TheTLBShareTable(TST)isusedforcontrollingtheTLBusage,onaper-TLBsetbasis,bychoosingthevictimduringTLBreplacementdependinguponthecurrentusageofthedifferentsharingclasses.ThesharingclassesarethegranularityatwhichtheTLBusageiscontrolled,andeachclassmayconsistofaprocess,aVM(asinthiswork)oracollectionofVMs.Inthisworkweusethevirtualmachineasthesharing 111

PAGE 112

class.EachentryoftheTST,representingonesharingclass,containstheTLBusagerestrictionsforthatclassandhasfourelds: TheSIDeld,whichhastheidentierofthesharingclass.TheuseofSIDsprovidestheexibilityofchangingthegranularityofthesharingclasseswhileincludingthisSIDasapartoftheTMTentryprovidesaconvenientmappingbetweenthedifferentprocessesandtheirsharingclasses. ThePRIORITYeld,indicatesthepriorityofthesharingclassandisusedtodeterminethevictiminsituationswherenosharingclasshasexceededitsusagelimits. TheSHAREeld,indicatesthemaximumnumberofentriesperTLBsetwhichcanbeusedbythesharingclass. TheCNTeld,isusedtostorethenumberofentriesinasetthatareoccupiedbytheSharingID.Unlikethepreviousthreeelds,whichareprogrammedbytheVMM,theCNTeldisupdatedbythehardware. Figure5-2. ControlledTLBusageusingCSharearchitecture.ThevictimevictedfromtheTLBischosendependingontheallocationsandcurrentusagesforthedifferentsharingclasses. TheTLBShareTableislookeduponlyinthecaseofaTLBmiss,asshowninFigure 5-2 ,and,similartotheTMT,isnotinthecriticalpathofTLBlookups.Thevirtual 112

PAGE 113

addressisusedtocalculatetheTLBset(victimset)inwhichthetranslation(newentry)willbestored.Theper-SID(per-VM)usageinformationofthissetisobtained,asshowninstep . 1 ofFigure 5-2 ,bycountingthenumberofentriesinthatsetandstoringitintheCNTeldsoftheappropriatesharingclassintheTST.BasedontheseCNTandSHAREvaluesforthedifferentclasses,theSIDtowhichthevictimshouldbelong(V-SID)iscalculatedasshowninstep . 2 ofFigure 5-2 .Itshouldbenotedthat,sincetheCNTandSHAREinformationarecomputedonaper-setbasis,thetimeforselectingtheV-SIDissmallandcanbeoverlappedwiththepagewalk.OncetheV-SIDisdetermined,avictimbelongingtothissharingclassischosenfromthevictimsetusingtheregularTLBreplacementheuristic(e.g.LRU).Oncompletionofthepagewalk,whichproceedsinparalleltotheselectionofthevictim,theobtainedtranslationistaggedwiththeCR3tagofthecurrentprocessfromtheCCR,asshowninStep . 3 ofFigure 5-2 .Thechosenvictimisreplacedwiththistranslationasdepictedinstep . 4 .TheactualalgorithmusedinselectionofV-SIDdependsonthemotivationbehindcontrollingtheusageoftheTLB(performanceisolationorperformanceenhancement).Whenperformanceisolationisthegoal,theTLBcanbeeffectivelypartitionedamongtheVMsbyassigningaxednumberofTLBslots(SHAREvalues)toeachVM,suchthatthesumoftheseSHAREvaluesdoesnotexceedthetotalnumberofslotsintheTLBset.Withsuchpartitioning,anyVMwhoseCNTvalueforaparticularvictimsetislessthanitsSHAREvalueisguaranteedtondatleastonefreeslotinthesetsinceotherVMswouldnothaveexceededtheirallottedSHAREs.Thatfreeslotisusedforcachingthenewentry.Ontheotherhand,ifCNTofVM1isequaltotheSHAREofVM1,oneofVM1'sentriesinthesetisevictedandthatslotisusedforcachingthenewentry.SuchastrictenforcementoftheSHAREfordifferentVMs,however,maynotbesuitablewhenthemotivationbehindusingCShareTLBisimprovingtheperformanceofahighpriorityworkloadandisnotenforcingTLBisolationthroughTLBpartitioning.Forinstance,whentheVMrunningthehighpriorityworkloadhasusedallofitsreserved 113

PAGE 114

Table5-1. AlgorithmsforselectionofvictimSID FORPERFORMANCEISOLATIONBYTLBRESERVATION1)CounttheslotsusedinthevictimsetforCCR.SIDandstoreinappropriateCNT2)IfCCR.SID.CNT>=CCR.SID.SHARE,V-SID=CCR.SID3)Else:3.1)Chooseoneofthe(guaranteed)freeslotsanduseitforcachingnewtranslationFORPERFORMANCEENHANCEMENTOFSELECTIVEWORKLOAD1)Iffreeslotavailableinthevictimset,useit2)Else:Counttheslotsusedinthevictimsetonaper-SIDbasisandstoreinappropriateCNT2a)IfCCR.SID.CNT>=CCR.SID.SHARE,V-SID=CCR.SID2b)Else:2b.1)ForSIDi8SIDinTST:IfSIDi.CNT>SIDi.SHARE,V-SID=SIDi2b.2)IfnoV-SID,ForSIDi8SIDinTST:IfSIDiislowpriority&SIDi.CNT>0,V-SID=SIDi slots,itmayborrowunusedslotsbelongingtoaVMwhichrunsalowerpriorityworkloadinordertoreducethemissrateandincreasetheperformanceofthehighpriorityworkload.TheseslotsmaybereclaimedbytheVMrunningthelow-priorityworkloadwhenneeded.Hence,whenperformanceenhancementofselectedhighpriorityworkloadsisthegoal,thealgorithmforselectionofV-SIDallowsanyVM,irrespectiveofitsusagelimitstouseanyavailablefreeslots.TheusagelimitationsandPRIORITYvaluesofdifferentdomainscomeintoeffectindecidingtheV-SIDonlywhennofreeslotisavailableandsomeentryfromthesethastobeevictedtocachethenewtranslation.BoththesealgorithmsareshowninTable 5-1 114

PAGE 115

5.3ExperimentalFrameworkTheCShareTLBismodeledbyaugmentingtheTMTmodel,describedinSection 4.3 ,withtheTST.ThesizeoftheTSTissettomatchthenumberofentriesintheTMT.ThefunctionalityoftheTSTisveriedbyusingaFunctionalCheckmodewherein,thenumberofTLBslotsusedbyeachSIDiscountedandensuredtobewithinthespeciedlimitsduringeveryTLBreplacement.ThemetricsusedtostudytheimpactofcontrollingTLBusagewiththeTST,aresimilartothemetricsusedinChapter 3 andChapter 4 andarepresentedhereforreference. NumberofTLBushes DTLBandITLBmissrateandtheReductioninmissrate,where Reduction(%)=1001)]TJ /F3 11.955 Tf 21.95 8.09 Td[(TLBmissratewithtags TLBmissratewithouttags (5) InstructionsperCycle(IPC)andRIPC,IIPCandIF,where RIPC=1001)]TJ /F3 11.955 Tf 13.15 8.08 Td[(IPCRegularTLB IPCIdealTLBIIPC=100IPCCShareTLB IPCRegularTLB)]TJ /F5 11.955 Tf 11.96 0 Td[(1IF=100IPCCShareTLB)]TJ /F3 11.955 Tf 11.96 0 Td[(IPCRegularTLB IPCIdealTLB)]TJ /F3 11.955 Tf 11.95 0 Td[(IPCRegularTLB (5) 5.4PerformanceIsolationusingCShareArchitectureInthissection,theeffectofusingtheCSharearchitecturetoenforcepartitionsintheTLBisinvestigated.TheworkloadsusedforthisinvestigationaretheTPCC-TPCC-0012andTPCC-Vortex-0012.Theseworkloadsarecreatedbysimulatingatwo-processorx86machineusingtheexperimentalframeworkdescribedinSection 3.2 .Xenisbootedonthismachineandtwouserdomains(domUs)arecreated,withonevirtualCPU(VCPU)perdomain.TPCC-UVaisrunintherstdomU(dom1)and,oncetheapplicationreachesitsworkingphase,thedomainispaused.Then,dependingon 115

PAGE 116

therequiredworkload,TPCC-UVaorVortexislaunchedintheseconddomU(dom2)andallowedtoreachitsworkingphase.Then,dom1isresumedandtheVCPUsofbothdom1anddom2pinnedtoCPU1oftheSimicssimulatedmachine.InadditiontheVCPUsofdom0arepinnedtoCPU0ofthesimulatedmachine.PinningtheVCPUsinthisfashionensuresthatonlydom1anddom2arescheduledonCPU1ofthesimulatedmachine.ThusonlytheworkloadsonthesedomainswillsharetheTLBofCPU1.Theperformanceisolationusagecontrolpolicy,outlinedinTable 5-1 isusedtopartitionTLB1intotwoandallocatethesepartitionstodom1anddom2explicitly.TPCC-TPCC-0012issimulatedusingtheframeworkdescribedinSection 5.3 forvariousTLBsizesandvariousTLBpartitionsizes.TheDTLBandITLBmissrates,expressedasMissesperThousandInstructions(MPKI),for64-entryTLBand512-entryTLBasobtainedfromthesesimulationsarepresentedinFigure 5-3 .ThemissrateisusedasthemetricsinceitdependsonlyonthesharedTLB,whichisbeingcontrolledusingCSharearchitecture,whiletheIPCdependsonmanyotherfactorsincludingthecacheandmemoryutilizationoftheworkloadswhicharenotbeingcontrolled.Fromthisgure,itisobservedthattheDTLBmissrate,hasastrongdependenceonthesizeoftheTLBspaceallocatedtothedomains.Forinstance,when10%oftheTLBisreservedfordom1,itsmissrateisalmost8timesthemissrateofdom2.ThelowestmissrateforboththedomainsisachievedwhenthesharetheTLBequally,asbothdomainsrunthesameworkloadandshowsimilarTLBusagerequirements.AsimilarbehaviorisobservedinthecaseoftheITLBmissrates.Itshouldbenotedthat,whileboth64-entryTLBand512-entryTLBareinsufcienttocapturetheworkingsetsizeofTPCC-UVaandVortexcombinedasseenfromSection 4.4 ,thesmallersizeofthe64-entryTLBcausetheMPKIvariationwithpartitionsizestobelargerinmagnitudeandsmootherthantheMPKItrendsfor512-entryTLB.Thus,itisclearthattheTSTservesasgoodcontrolknobforcontrollingtheTLBusageonaper-domainbasis. 116

PAGE 117

ADTLBmissratefor64-entryTLB BITLBmissratefor64-entryTLB CDTLBmissratefor512-entryTLB DITLBmissratefor512-entryTLBFigure5-3. EffectofvaryingTLBreservationonmissrateisshownbyplottingtheTLBmissrateforTPCC-TPCC-0012forvaryingallocationoftheTLBspaceforeachdomain.Themissratesofdomainsshowastrongcorrelationwiththeirallocations. 117

PAGE 118

Figure5-4. MissrateisolationusingtheTMTarchitectureisshownbyplottingtheper-domainmissratesona64-entryCShareTLBforTPCC-TPCC-0012(T-T)andTPCC-Vortex-0012(T-V).DespitethedifferentdemandsontheTLBbydom2,themissrateofdom1isisolatedfromtheinuenceofdom2. ToshowthattheusagecontrolknobpropertyoftheTSTcanbeusedtoisolatetheTLBmissratesofworkloads,thesimulationsarerepeatedforTPCC-Vortex-0012workload.Theper-domainDTLBmissratesfora64-entryTLBforbothTPCC-TPCC-0012andTPCC-Vortex-0012forarangeofpartitionsizes,asobtainedfromthesesimulations,areshowninFigure 5-4 .Whentheper-domainmissratesareconsideredforTPCC-TPCC-0012,sincebothdomainsrunthesameworkload,theyexhibitsimilarmissratesofabout0.61MPKIwhenallocatedequalsharesintheTLB(dom1=50%).OnreducingtheTLBusagelimitfordom1andallocatingalargershareoftheTLBfordom2,themissratesforthesedomainsbegintoshowdifferingtrends.Atdom1=10%,withdom2allowed90%,themissrateofTPCC-UVaondom1is4.07MPKIwhichisalmostanorderofmagnitudegreaterthanthemissrateofTPCC-UVaondom2.AsimilartrendisseenfortheconsolidatedworkloadTPCC-Vortexwiththemaindifferencebeingthat,themiss 118

PAGE 119

rateforVortexismuchlargerthanthemissrateforTPCC-UVaevenwhenitisallocatedalargerportionoftheTLBduetoitsmemoryintensivebehavior.SinceVortexismoreTLBhungrythanTPCC-UVa,themissrateofTPCC-UVawillbeincreasedwhenitisconsolidatedwithVortexintheabsenceofanyusagecontrol.However,fromFigure 5-4 ,itisseenthatthemissrateofTPCC-UVarunningondom1inboththeconsolidatedworkloadsisverycloseanddependsonlyontheportionoftheTLBthatisreservedforit.Itisalsoseenthatthemissrateofdom1isindependentoftheworkloadrunningondom2,clearlyindicatingtheefcacyoftheCShareTLBinisolatingtheTLBmissrateofonedomainfromtheinuenceofotherdomains. 5.5PerformanceEnhancementUsingCShareArchitectureInadditiontoisolatingtheTLBbehaviorofanapplicationrunningonaVMfromotherVMsrunningonthesameplatform,theCSharearchitecturemayalsobeusedtofurtherimprovetheperformanceincreaseachievedbyusingtheTMT.DifferentapplicationswithvaryingworkingsetsizesandmemoryaccesspatternsexhibitcorrespondinglyvaryingpatternsintheusageoftheTLBspace.BycontrollingtheTLBspaceandregulatingtheamountofTLBspaceusedbyeveryVMbasedonitsmemoryaccesspattern,itbecomespossibletoachievealowerTLBmissrateandimprovetheperformanceoftheworkloads. 5.5.1ClassicationofTLBUsagePatternsTypicalmultimediaapplicationexhibitastreamingmemoryaccesspattern,wherethedataaccessedfromthemainmemoryshowregularityinthestrideofaccess[ 111 ].Insuchapplications,thenumberofdataaccessesperinstructionistypicallyveryhigh,andthereislittlereuseintheaccesseddata.Applicationswhichexhibitsuchmemorybehavioraretermedasstreamingapplicationsinthisdissertation.TounderstandtheTLBimplicationsofstreamingapplications,severalworkloadapplicationsaresimulatedonthedomUofauniprocessorx86machinewiththeCShare 119

PAGE 120

TLB,withoutexplicitTLBusagecontrolandan8-entryTMT,usingtheframeworkdescribedinChapter 3 .Theselectedapplicationsare: Vortex:amemoryintensivedatabasemanipulationworkload[ 77 ]. TPCC-UVa:anI/OintensiveimplementationoftheTPC-CbenchmarkfromtheSPECCPU2000suiteofbenchmarks[ 82 ]. Apsi:aweatherpredictionprogramwhichreadsa112112112arrayofdataanditeratesover70timesteps[ 112 ]. Art:Aneuralnetworkprogramusedforobjectrecognitioninthermalimagery[ 113 ]. Lucas:AprogramtochecktheprimalityofMersennenumbersoftheform2n)]TJ /F5 11.955 Tf -408.26 -14.44 Td[(1[ 114 ]. Swim:acomputeintensiveoatingpointprogramforshallowwatermodelingwitha13351335arrayofinputdata[ 115 ].TheDTLBmissratesforthedomUrunningtheseapplicationsisobservedforvaryingTLBsizes.Thesemissrates,normalizedtothemissratefora64-entryTLB,ispresentedinFigure 5-5A .Fromthis,itcanbeseenthatincreasingtheTLBsizedoesnotreducetheTLBmissratetothesameextentinallapplications.Forinstance,TPCC-UVaandVortexshowsignicantbenetfromtheincreaseinTLBsize.However,ApsiandArtshowsmallerreductioninDTLBmissrateofabout20%tillaTLBsizeof256entriesand512entriesrespectively.BeyondtheseTLBsizes,theTLBmissraterapidlyreducestolessthan5%ofthe64-entryTLBmissrate.YetanothertrendoftheTLBmissratesisexhibitedbySwimandLucas.Intheseworkloads,thereislittlebenetofscalinguptheTLBsizeandevenatalargeTLBsizeof1024-entries,theDTLBmissrateisnothighlyreduced.Forinstance,atthisTLBsize,themissrateofSwimis98.8%ofthe64-entryDTLBmissrate.Fromthistrend,theapplicationscanbeclassied,inamannersimilartopreviousworks[ 18 ],intothreecategories: Type1ApplicationssuchasTPCC-UVaandVortex,whichhaveasmallerworkingsetsizeandshowgoodreuseintheaccesspattern.TheseworkloadsarecharacterizedbyaconcaveparabolicresponseofthenormalizedDTLBmissratetoincreasingTLBsizes.Insuchapplications,increasingtheTLBsizereduces 120

PAGE 121

ADTLBMissRatefordomUrunningthework-loadapplication BITLBMissRatefordomUrunningthework-loadapplication CDTLBMissRatefordom0 DITLBMissRatefordom0Figure5-5. ClassicationofTLBusagepatterns.ApplicationscanbeclassiedintooneofthreetypesdependingonthereductioninmissrateuponincreasingtheTLBsize. 121

PAGE 122

themissrateevenwhenhesizeisinsufcienttoaccommodatetheentireworkingsetsize. Type2ApplicationssuchasApsiandArt,whichhaveasmalltomediumworkingsetsize,butshowrelativelylessreuseintheaccesspattern.ThenormalizedDTLBmissratesoftheseapplicationsshowaconvexparabolictrendtoincreasingtheTLBsize.AslongastheTLBsizeisnotsufcienttoaccommodatetheworkingsetsize,thereislittlebenettoincreasingtheTLBsizesincethereuseofentriesisnotveryhigh.However,oncetheTLBsizeislargeenoughtocapturetheentireworkingsetsize,theDTLBmissratereducessignicantly. Type3ApplicationssuchasSwimandLucaswhicharestreamingapplications.AnyincreaseintheTLBsizedoesnotsignicantlyreducetheDTLBmissrate.TheITLBmissrate,ontheotherhand,foralltheseapplicationsexhibitsimilarresponsetoincreasingtheTLBsize,asseenfromFigure 5-5B .SimplydoublingtheTLBsizefrom64entriesto128entriesreducestheITLBmissrateofalltheapplicationsbyatleast40%.InthecaseofVortexandApsi,thisreductionisashighas90%and80%respectively.Intuitively,whiletheinstructionfootprintofdifferentapplicationsmayvary,thebehaviorofthememoryaccessforfetchinginstructionsissimilaracrossapplications.Thus,asfarastheITLBisconcerned,allapplicationsexhibitType1behavior.Similarly,theDTLBandITLBmissratesfordom0alsoexhibitType1behaviorasboththecodeanddataworkingsetsizesondom0,whichareduetothebackenddrivers,aresmallandshowgoodreuse.Fromtheseobservations,itisclearthatthebenetofawardingmoreTLBspacetoanapplicationorthepenaltyofwithholdingTLBspacefromanapplicationishighlydependentontheTLBusagepatternoftheworkloadapplication. 5.5.2PerformanceImprovementWithStaticTLBUsageControlTheideabehindimprovingtheperformanceofworkloadsusingTLBusagecontrolistogivealargerTLBspacetothoseworkloadswhichmakebetteruseoftheawardedspaceandtorestricttheTLBspaceforthoseapplicationswhichdonotmakegooduseoftheTLBspace.TheTLBusagebydifferentdomainsiscontrolledusingtheTLBusagecontrolpolicyforperformanceenhancementlistedinTable 5-1 .Theusage 122

PAGE 123

restrictionsforeachdomainisspeciedasthemaximumpercentagefractionoftheCShareTLBthatcanbeusedbythatdomain.Itshouldbenotedthatinthisdissertation,thenotationX-Y-ZisusedtorepresentastaticTLBusageschemewhereX%,Y%andZ%oftheentriesintheTLBsetaretheusagerestrictions,andthereforetheSHAREvalues,fordom0,dom1anddom2respectively.Sincetheusagecontrolpolicyisstatic,theseusagecontrolrestrictionsforthedifferentdomainsaresetatthebeginningoftheexperimentandaremaintainedconstantthroughout.TodemonstratethebenetofTLBusagecontrolinimprovingworkloadperformance,consolidatedworkloadsTPCC-VortexandTPCC-Lucasarerunonasimulateduniprocessorx86virtualizedplatformwithCShareTLBsofvaryingsizesand8-wayassociativity.dom1,whichrunsTPCC-UVaissettobethehighprioritydomainandisallowedtouse100%oftheTLBspace.Theusagerestrictionsforthelowprioritydom0anddom2(runningeitherLucasorVortex)aresettobeeither20%,40%,60%,80%or100%.Inadditiontotheseusagecontrolschemes,acompletelyuncontrolledschemewherealldomainsaregivenequalpriorityandareallowedtousetheentiretyoftheTLBspaceisalsoinvestigated.TheDTLBandITLBmissratesaswellastheImpactFactor(IF)fromthesesimulationsarepresentedinFigure 5-6 andFigure 5-7 .Fromthese,itcanbeobservedthatstaticallyallocatingahigherTLBspacetoTPCC-UVaandlowerTLBspacetodom0anddom2hasdifferenteffectsonboththeconsolidatedworkloads.AsfarasTPCC-Vortexisconcerned,boththeworkloaddomains,aswellasdom0exhibitType1TLBbehavior.Thus,restrictingtheTLBusageofdom0anddom2resultinanincreaseoftheDTLBmissrateasseeninFigure 5-6A .ThisincreaseismuchhigheratsmallerTLBsizesof64entries,astheTLBspaceisunderhighcontentioninthisTLBsizerange.However,onincreasingtheTLBsizeto1024entries,thechangeinDTLBmissratewithvaryingusagerestrictionsfordom0andom2becomesmall.Theimportantpointisthat,atnostaticTLBusagecontrolschemeistheDTLBmissratesmallerthan 123

PAGE 124

ADTLBmissrateforTPCC-Vortex BDTLBmissrateTPCC-Lucas CITLBmissrateforTPCC-Vortex DITLBmissrateforTPCC-LucasFigure5-6. OverallmissrateimprovementforconsolidatedworkloadwithstaticTLBusagecontrol.Exceptforthecurvemarkeduncontrolledusage,dom1issetathighprioritywith100%usagelimit. 124

PAGE 125

theuncontrolledusageschemewhereineachdomainusesasmuchTLBspaceasitneedsbyevictingtheolderentriesbelongingtootherdomains.Evenwhenalldomainsareallowedtouse100%oftheTLBspace,theeffectivereplacementpolicyisnotpurelyLRUbutLRUweightedwiththeprioritiesofthevariousdomains.ThustheDTLBmissrateat100%)]TJ /F5 11.955 Tf 12.2 0 Td[(100%)]TJ /F5 11.955 Tf 12.2 0 Td[(100%issmallerthantheuncontrolledusagescenario.Itisalsointerestingtonotethat,at512-entryand1024-entryTLBsizes,increasingtheusagelimitfordom0whilemaintainingthelimitfordom2,increasesthemissrate.Thisisanartefactoftheusagecontrolpolicy,especiallyStep2binthealgorithminTable 5-1 .AsimilarphenomenonoftheuncontrolledusageresultinginlowermissratethananystaticusagecontrolschemeisseenintheITLBmissrates,asshowninFigure,sinceallITLBtrendsexhibitType1behavior.Thus,duetothetrendsinbothTLBandITLBmissrates,theIFishighestfortheuncontrolledunmanagedusagecontrolpolicythanforanyotherstaticreservationpolicy,asseenfromFigure 5-7A .Infact,for64-entryTLBand512-entryTLB,theIF,whichisameasureoftheTLBdelayasexplainedinSection 4.4.3 ,fallstoasmuchas)]TJ /F5 11.955 Tf 9.3 0 Td[(100%,indicatingthattheTLBdelayisdoubledatthoseusagecontrolsettings.TheimpactofusagecontrolonTPCC-Lucasworkload,ontheotherhand,isquitedifferentfromtheimpactonTPCC-Vortex.AsLucasisaType3streamingworkload,asfarastheDTLBisconcerned,withholdingTLBspacefromitdoesnotsignicantlyincreasetheTLBmissrate.Thus,atusagecontrolschemeswherethelimitfordom2issettolowvaluessuchas20%and40%,dom0anddom1benetfromthisadditionalTLBspaceandshowalowerDTLBmissratethantheuncontrolledusageschemeasseenfromFigure 5-6B .TheresultofthisbehaviorisreectedintheIFtrendsforTPCC-Lucas,wheresettinga20%restrictionforLucasresultsinincreasingtheIFfrom20%to25%fora512-entryTLB.TheITLBmissratetrendsdisplayedinFigure 5-6D ,however,arethesameasforTPCC-Vortex. 125

PAGE 126

AIFforTPCC-Vortex BIFforTPCC-LucasFigure5-7. OverallperformanceimprovementforconsolidatedworkloadwithstaticTLBusagecontrol.Exceptforthecurvemarkeduncontrolledusage,dom1issetathighprioritywith100%usagelimit. Fromthesesimulations,thefollowingconclusionscanbedrawnregardingthemissrateandoverallworkloadperformancewhentheusagerestrictionsarestaticallysetforconsolidatedworkloads. Independentofthetypeofworkload,theITLBwithuncontrolledandunrestrictedsharingperformsbetterthananystaticusagecontrolscheme.Thissuggeststhat,formaximumperformance,theITLBshouldnotbemanagedusingstaticusagecontrolpolicies. Thebenetofstaticusagecontrolschemesdependonthecompositeapplicationswhichareconsolidatedintheworkload.Specically,restrictingtheusageforaType3applicationtoincreasethespaceavailableforType1applicationsresultsinasmallerDTLBmissrateandlargerIFfortheconsolidatedworkloadasawhole. ForconsolidatedworkloadssuchasTPCC-Vortex,wherealldomainsexhibitType1behavior,usingprioritiesinthereplacementpolicywillresultinlowermissrate,evenifalldomainsareallowedtousetheentireTLBspace,comparedtousingpureLRUwithoutanynotionofusagecontrolorpriorities. 126

PAGE 127

5.5.3SelectivePerformanceImprovementWithStaticTLBUsageControlTheprevioussectionexaminedtheeffectofstaticTLBusagecontrolontheperformanceimprovementofconsolidatedworkloads.FromFigure 5-7 ,itwasevidentthat,asfarastheIFfortheentireconsolidatedworkloadwasconcerned,staticusagecontrolpolicieswerebenecialonlywhenoneoftherestricteddomainswasaTLBinsensitivestreamingworkload.However,themotivationbehindTLBusagecontrolcouldbetoimprovetheperformanceofoneselectedhighpriorityworkloaddomainandnottheentireconsolidatedworkload.TheuseoftheCSharearchitecturetoachievethisisexploredinthissection.Toexaminethis,theconsolidatedworkloadsTPCC-VortexandTPCC-Lucasaresimulatedona1-cpux86machinewithaCShareTLBofvaryingsizesandtheV-SIDselectionforperformanceenhancementalgorithmshowninTable 5-1 usedduringTLBmisses.Thesamestaticusagecontrolschemesexploredintheprevioussectionareutilizedhere.Ineachoftheseschemes,exceptfortheuncontrolledusagescheme,dom0anddom2aresetasthelowprioritydomainswhiledom1runningTPCC-UVaissetasthehighprioritydomain.Theper-domainIIPCfortheworkloadsareobservedfromthesesimulations.TheIIPCtrendsforTPCC-VortexandTPCC-Lucasfor512-entryTLBaswellas1024-entryTLBsizesarepresentedinFigure 5-8 .WhentheIIPCvariationfordom0isconsidered,thereismarkedchangeintheIIPCwiththeTLBusagelimitimposeduponit.ThistrendinIIPCforvarioususagecontrolschemesisindependentoftheworkloadrunningondom2,asdom0mainlyrunsthecodeforservicingTPCC-UVa'sI/Orequests.Whendom0'susageisrestrictedtoamaximumof20%ofthetotalTLBspace(20)]TJ /F5 11.955 Tf 11.35 0 Td[(100)]TJ /F5 11.955 Tf 11.35 0 Td[(20),theIIPCdecreasesbyafactorof0.83forTPCC-Vortexand0.81forTPCC-Lucas.Moreover,sincetheV-SIDselectionalgorithmisnotgearedforperformanceisolation,theimpactofalteringtheusagelimitationsondom2isreectedintheIIPCvaluesofdom0,asseenfromthereduction 127

PAGE 128

AIIPCforTPCC-Vortex,512-entryTLB BIIPCforTPCC-Lucas,512-entryTLB CIIPCforTPCC-Vortex,1024-entryTLB DIIPCforTPCC-Lucas,1024-entryTLBFigure5-8. SelectiveperformanceimprovementforconsolidatedworkloadwithstaticTLBusagecontrolwithPWof60cycles.ExceptwheremarkedasNoControl,dom1(TPCC-UVa)isgivenhigherprioritywhiledom0(backenddrivers)anddom2(VortexandLucasinTPCC-VortexandTPCC-Lucasconsolidatedworkloadsrespectively)aresetatlowerpriority. 128

PAGE 129

inIIPCby0.54and0.38forcontrolschemes20)]TJ /F5 11.955 Tf 12.46 0 Td[(100)]TJ /F5 11.955 Tf 12.45 0 Td[(60and20)]TJ /F5 11.955 Tf 12.46 0 Td[(100)]TJ /F5 11.955 Tf 12.45 0 Td[(100forTPCC-Vortex.ThetrendintheIIPCvaluefordom2,ontheotherhand,ishighlydependentonwhethertheworkloadisVortex,whichsignicantlyreusesthecachedTLBentriesandthereforeissensitivetochangeinTLBsize,orLucaswhichishaslowsensitivitytoTLBsizeduetothestreamingnatureofitsmemoryaccessandlittlereuseofthecachedTLBentries.Forinstance,whenVortexisrunondom2,restrictingtheTLBspaceforVortexseverelyimpactstheIIPCvalue.Whentheusagelimitfordom2issetat20%asinusagescheme20)]TJ /F5 11.955 Tf 12.28 0 Td[(100)]TJ /F5 11.955 Tf 12.28 0 Td[(20,theIIPCattainsavalueof)]TJ /F5 11.955 Tf 9.29 0 Td[(8.4%,comparedtothe5.1%fortheuncontrolledusagescenario.Thisindicatesthat,inspiteofhavingtheprocess-specictagging,thesheerlackofTLBspacedrivestheperformanceofVortexlowerthantheperformanceinthecaseofun-sharedTLBandtheeffectofavoidingtheTLBushesisnullied.Inadditiontothehighprioritydom1,whendom0isalsoallowedtousetheentireTLBspace(usagescheme100-100-20),thereductioninIPCfurtherworsensandisalmost10%(IIPCis)]TJ /F5 11.955 Tf 9.3 0 Td[(10%).However,with60%usagelimitforVortex,theIIPCvaluebouncesbackto3.9%-3.7%.WhileVortex'sperformanceatthisusagelimitisdenitelylessthanwithuncontrolledusage,itishigherthantheperformancethatcanbeobtainedwithoutCShareTLB.Ontheotherhand,whenLucasrunsastheworkloadindom2,theeffectofdeprivingitofTLBspaceismarkedlydifferentfromVortexduetoitslowsensitivitytoTLBsize.Thelowestvalueofdom2'sIIPC,occurringatusagecontrolscheme100)]TJ /F5 11.955 Tf 12.31 0 Td[(100)]TJ /F5 11.955 Tf 12.3 0 Td[(20,is0.34comparedtotheIIPCof0.47withoutanyusagecontrol.TheimportantdifferencewithVortexisthat,atnousageschemedoesLucasexhibitanegativeIIPC,indicatingthattheperformancewithCShareTLBishigherthantheperformancewithregularTLB,evenwitharestrictedTLBusage.ThebehaviorofthehighpriorityTPCC-UVaworkloadondom1showsaninterestingtrendinIIPCfordifferentTLBusagecontrolschemes,asseenfromFigures 5-8A and 5-8B .WhenrunconsolidatedwithVortex,theIPCincreasesunderanyusage 129

PAGE 130

schemecomparedtotheuncontrolledsharingscheme.ThehighestIIPCisseenwhentheusageofbothdom0anddom1arerestrictedto20%.Inthisscheme,TPCC-UVa'sIIPCincreasesbyafactorof1.4comparedtouncontrolledusagescheme.However,especiallyinthecaseofTPCC-Vortex,settingausagecontrolschemeof20)]TJ /F5 11.955 Tf 12.59 0 Td[(100)]TJ /F5 11.955 Tf 12.59 0 Td[(20provesextremelyexpensiveontheperformanceofVortex.Increasingdom2'susagelimitto60%reducesthepenaltyimposedondom2'sperformancewhileensuringthattheIPCofTPCC-UVaondom1isstillhigherthanuncontrolledsharing.WithTPCC-Lucas,ontheotherhand,TPCC-UVa'sIIPCisactuallysmallerthantheuncontrolledsharingschemewhenLucasisallowedtousetheentireTLB,duetothestreamingnatureofLucas'memoryaccess.ItcanalsobeobservedthatusagecontrolontheIIPCofdom1signicantlyreducesatlargerTLBsizeof1024-entryforTPCC-Vortex,howeverisstillpronouncedforTPCC-Lucas.AtthisTLBsize,theworkingsetsizeofbothTPCC-UVaaswellasVortexcanbeaccommodatedintheTLBandawardingalargershareoftheTLBfordom1doesnotpaysignicantdividends.Ontheotherhand,evena1024-entryTLBisnotsufcienttoholditsworkingsetwhenconsolidatedwithTPCC-UVa.RestrictingLucas'sTLBusage,evenatalargeTLBsizeof1024entries,improvesIIPC,andthereforetheperformance,ofTPCC-UVa.Fromthesesimulations,itisobservedthatausagecontrolsettingof20)]TJ /F5 11.955 Tf 12.16 0 Td[(100)]TJ /F5 11.955 Tf 12.16 0 Td[(60forTPCC-Vortexwitha512entryTLBcausesanIFof62%forTPCC-UVa,implyingthat62%oftheTLB-induceddelayinTPCC-UVacanbeeliminatedbyusingtheCShareTLB.Similarly,forTPCC-UVainTPCC-Lucas,anIFof52%isobservedfor512-entryTLBunderthisusagecontrolscheme.TheseIFstranslatetoanincreaseinTPCC-UVa'sIPCbyabout3.5%atPWlatencyof60cyclesand16.5%atPWlatencyof270cycles.Fromtheseanalysis,thefollowingobservationscanbededucedaboutcontrolledTLBsharingusingtheCSharearchitectureforselectiveperformanceenhancement: 130

PAGE 131

TheimpactofusagecontrolispronouncedaslongastheTLBisinsufcienttocapturetheworkingsetofalltheworkloadswhichshareit,i.e.whentheTLBisaresourceofcontention. WhentheTLBbehaviorofthelow-priorityworkloadisdependentonthesizeoftheTLB,suchasVortex,restrictingitsTLBusagereducesitsIPCbyalargervalue,thanitincreasestheIPCofthehigh-priorityapplication Whenthelow-priorityapplicationexhibitsstreamingtypeofmemoryaccess,withlowreuseofthecachedTLB.entries,limitingtheTLBspaceforthisapplicationincreasestheIPCofthehighpriorityapplicationbyalargervaluethanthereductioninthelow-priorityapplication'sIPC. 5.5.4PerformanceImprovementWithDynamicTLBUsageControlFromSection 5.5.3 ,itisevidentthatthecostofselectivelyenhancingtheperformanceofahighpriorityworkload,i.e.thereductionintheperformanceofthelowpriorityworkload,dependsonthenatureoftheworkload.ForworkloadssuchasLucas,thecostissmallerthantheincreaseinthehighpriorityworkloadperformance.However,forTLBsensitiveapplicationssuchasVortex,thecostoutstripstheperformancebenet.Asaresult,theoverallperformanceoftheconsolidatedworkloadreducesasseenfromFigure 5-7A .However,theTLBusageofmanyTLB-sensitiveapplicationshavedistinctphases:somewherethepressureexertedontheTLBisquitehighandsomephaseswheretheTLBusageislow.Unlikeastaticusagecontrolpolicy,asusedinSection 5.5 ,adynamicusagecontrolpolicywillbeabletoexploitthesedifferentphasesbytemporarilyallocatingalargershareoftheTLBforthelow-priorityapplicationwhenitisinahighTLBusagephaseandrestrictingtheTLBusageonlyinlowTLBusagephases.Inordertoimplementsuchdynamicusagepolicies,aphaseanalyzerfunctionalityisaddedtotheCShareTLBasshowninFigure 5-9 .Thephaseanalyzerarchitectureconsistsofabankofregisters,similartotheperformancemonitoringunits(PMUs).TheseregistersareusedtotrackthemissrateoftheTLBonaper-SIDbasis,inafashionsimilartothePMUsfortrackingcachestatisticsincurrentprocessors[ 28 ],asshowninstep . 1 .Italsoconsistsofacountdowntimerwhichcanbeusedtosetthe 131

PAGE 132

Figure5-9. DynamicTLBUsageControlwithaPhaseAnalyzer.TLBmissesaretrackedasshowninstep . 1 .Whenthephaseanalyzerfunctionalityisinvokedatprogrammedintervals,asshowninstep . 2 ,themissrateoverthepastintervaliscalculatedandusedtoadjusttheSHAREvalueforthesharingclasses,asshowninstep . 3 frequencyatwhichthephaseanalyzerfunctionalityisinvoked.Thistimerissettothedesiredvalueandisdecrementedoneveryclocktick.Oncethetimerreacheszero,andthenextcapacityorforcedushoccurs,thephaseanalyzerfunctionalityistriggered.TheideabehindincorporatingthephaseanalysisfunctionalityasapartoftheTLBushbehavioristoavoidthegratuitousushingoftheTLBafterreallocation.Oninvocation,asshowninstep2ofFigure 5-9 ,thephaseanalyzerexaminesthecurrentusageoftheTLBbycalculatingtheTLBmissratesincethelastinvocation.ItthenusesthismissrateandthepasthistoryofthemissratechangetheTLBusagelimitofthelowprioritydomainsshowninstep . 3 .Forinstance,Ifthetrendinthemissrateisincreasing,theSHAREvalueofthelowpriorityworkloaddomainisincreasedcomparedtothecurrentallocation.Ifthemissrateofthecurrentphase,however,islowerthanthepreviousphase,theusageofthelowpriorityworkloaddomainisfurther 132

PAGE 133

restricted.Toimplementthisfunctionality.ThenumberofentriesintheTSTdecidethenumberofregistersinthisbank. Figure5-10. SelectiveperformanceimprovementforconsolidatedworkloadwithstaticTLBusagecontrolfora512-entry8-wayCShareTLB.DynamicallychangingtheTLBusagerestrictionsofthelow-priorityworkloaddomain(dom2)signicantlyreducesthecostofselectivelyenhancingtheperformanceofhighpriorityworkloaddomain(dom1)andimprovestheoverallperformanceoftheconsolidatedworkload. InordertodemonstratetheadvantageofdynamicTLBusagecontrolpolicies,TPCC-VortexissimulatedusingthesamesetupoutlinedinSection 5.5.3 withtheadditionofthephaseanalyzermodule.ThecountdowntimerisprogrammedwithavalueofvemillioncyclesasthisapproximatesthefrequencyofforcedushesfortheTPCC-Vortexworkload.Theper-domainandoverallperformancestatisticsareobservedforvariousCShareTLBsizes.Fromtheseobservations,theIFforthedom0aswellastheworkloaddomains,fora512-entryTLB,arepresentedinFigure 5-10 133

PAGE 134

Fromthisgure,itcanbeclearlyseenthatdynamicallymanagingtheTLBusageofVortexrunningondom2signicantlyreducesthecostofselectiveperformanceenhancement.Forinstance,atastaticusagerestrictionof100)]TJ /F5 11.955 Tf 12.43 0 Td[(100)]TJ /F5 11.955 Tf 12.43 0 Td[(20,wherethelowerprioritydom2isrestrictedtouseonly20%oftheTLBwhilethehigherpriorityworkloaddom1runningTPCC-UVaaswellasthedriverdomaindom0areallowedtousetheentireTLBspace,theIFofTPCC-UVaincreasesfrom47%to63%.However,thecostofthisincreaseisanIFof)]TJ /F5 11.955 Tf 9.3 0 Td[(110%fordom2.Inotherwords,thedelayduetotheTLBmissesandpagewalksfordom2whensuchastaticrestrictionisusedismorethantwicethedelayoftheuntaggedTLB.ThebenetofusingthetaggedTLB,whichisaloweringoftheTLBdelayby56%intheuncontrolledcase,ismorethanoffsetwithstaticusagerestrictions.Evenat60%usagerestrictionfordom2,thecostintermsoftheloweringofIFcomparedtotheuncontrolledcaseisabout15%.However,withdynamiccontrolusingthephaseanalyzer,thecostisreducedto4%whichthebenetintermsoftheIFfordom1increasesby14%fromtheuncontrolledcase.ThesetranslateintoIIPCvaluesof3.59%and4.87%fordom0anddom1respectively,about1.3and0.96theIIPCwithoutexplicitTLBusagecontrols.Moreover,whilenotshowninthegure,theIFoftheoverallconsolidatedworkloadincreasesbyabout2%.Thus,withdynamicusagecontrolitbecomespossibletoachieveselectiveperformanceenhancementforTPCC-UVarunningondom1withoutsignicantlyloweringtheperformanceofthelowerprioritydom2. 5.6SummaryInthischapter,theCShareTLBisproposedforenablingthesharingoftheTLBusingprocess-specictagginginacontrolledmanner.TheTLBusagecontrolmechanismintheCShareTLBcanbeusedforisolatingtheTLBperformanceofvariousdomainswhichshareaTLBbyexplicitlyreservingportionsoftheTLBfordifferentdomains.Moreover,bystaticallypartitioningtheTLBspacetorestricttheTLBusageforalowprioritydomain,theperformanceofthehighprioritydomaincanbeincreased. 134

PAGE 135

ThisisaccompaniedbyanincreaseintheoverallconsolidatedworkloadperformanceifthelowprioritydomainbeingrestrictedexhibitsaTLB-insensitivestreamingusagepattern.However,ifthelowprioritydomainisTLBsensitive,thecostofrestrictingitsTLBusagecanbesignicant,eventotheextentofreducingtheoverallperformanceoftheconsolidatedworkload.ThiscostcanbereducedbyusingdynamicTLBusagecontrolpoliciestorestricttheTLBusageofthelowprioritydomainonlyduringphaseswheretheTLBusageisnothigh.Usingsuchusagecontrol,theperformanceincreaseforahighpriorityworkloaddomainachievedbyusinganuncontrolledprocess-specictaggedTLBcanbeselectivelyincreasedbyabout1.4. 135

PAGE 136

CHAPTER6CONCLUSIONANDFUTUREWORKImprovingtheperformanceofvirtualizedworkloadsandmanagingthesharingofresourcesamongthecomponentapplicationsofconsolidatedworkloadsaretwochallengesinvirtualization.Meetingthesechallenges,specicallyinthecontextofhardware-managedTranslationLookasideBuffers(TLBs),formsthethemeofthisdissertation.Inordertounderstandtheperformancedegradationcausedbythehigh-frequencyTLBushingonvirtualizedplatformsandtoinvestigatetheimpactofvariousschemesthatareproposedtoreducetheTLB-induceddelay,simulationframeworkssupportingdetailedandcustomizableperformanceandtimingmodelsfortheTLBareneeded.Toaddressthisissue,afull-systemsimulationframeworksupportingx86ISAandTLBmodelsisdeveloped,validatedandusedtoexperimentallyevaluatetheperformanceimplicationsoftheTLBinvirtualizedenvironments.ThetaggedTLBmodeldevelopedinthisworkisdesignedtobegenericenoughtosupportthesimulationofbothprocess-specicaswellasVM-specictagging.ThisistheonlyacademicsimulationframeworkthatprovidesadetailedtimingmodelfortheTLBandsimulatesthewalkingofpagetablesonaTLBmiss.Moreover,thisframeworkiscapableofsimulatingmultiprocessormulti-domainworkloads,whichmakesituniquelysuitableforstudyingvirtualizedplatforms.Usingthisframework,theTLBbehaviorofI/O-intensiveandmemory-intensivevirtualizedworkloadsischaracterizedandcontrastedwiththeirnon-virtualizedequivalents.Itisshownthat,unlikenon-virtualizedsingle-O/Sscenarios,theadverseimpactoftheTLBontheworkloadperformanceissignicantonvirtualizedplatforms.Usingthedevelopedsimulationframework,itisshownthatthisperformancereductionforvirtualizedworkloadsisasmuchas35%duetotheTLBmisseswhicharecausedbytherepeatedushingoftheTLBandthesubsequentpagewalkstoservicethesemisses. 136

PAGE 137

ThisdissertationproposesanovelmicroarchitecturalapproachcalledtheTagManagerTable(TMT)toreducetheTLB-inducedperformancedelayforvirtualizedworkloads.TheTMTapproachinvolvestaggingtheTLBentrieswithtagsthatareprocess-specic,thusassociatingthemwiththeprocesswhichownsthem.BytaggingtheTLBentries,TLBushescanbeavoidedduringcontextswitches.TheTMTisdesignedtogenerateandmanagethesetagsinasoftware-transparentfashionwhileensuringlow-latencyofTLBlookupsandimposingasmallareaoverhead.Usingthesimulationframeworkdevelopedinthisdissertation,Itisfoundthatusingprocess-specictagsreducestheTLBmissratebyabout65%to90%which,dependingontheTLBmisspenalty,translatesintoa4.5%to25%improvementintheperformanceoftheworkloads.ThearchitecturalparametersandworkloaddependentfactorsthatinuencetheperformancebenetofusingtheTMTareinvestigatedandprioritizedonthebasisofthesignicanceoftheirinuence.Sincethetagsaregeneratedataprocess-levelgranularityandarenottiedtoanyvirtualization-specicaspect,theTMTmaybeusedtoavoidTLBushesinnon-virtualizedscenariosaswell.Moreover,theTMTmayalsobeusedtoenableTLBsharingacrossmultipleper-coreprivateTLBsusingahierarchicaldesignwithasharedLastLevelTLB(LLTLB),whichreducestheTLBmissrateby15%to28%duetoabetterutilizationoftheTLBspace.TheuseoftheTagManagerTableintaggingI/OTLBsisproposedandvalidatedusingafull-systemsimulation-basedprototype.ThethirdpartofthisdissertationaddressestheissueofusagecontrolinthetaggedTLBwhich,becauseofthetagging,issharedamongstmultipleprocesses.TheCShareTLBarchitectureisproposedtocontroltheTLBsharing.TheTLBusageofdifferentapplicationsisanalyzedandclassieddependingonhowwelltheyusetheTLBspace.Basedonthis,theperformanceimprovementduetotheTMTwithoutanyexplicitusagecontrolsisfurtherincreasedbyusingtheCSharearchitecturetoprovidealargerTLBspacetothoseapplicationswhichhaveahigherpriorityandtorestricttheTLBusage 137

PAGE 138

ofTLB-insensitiveapplications.TheuseofdynamicTLBusagecontrolpoliciestoprovidethisfurtherperformanceimprovement,evenwhentherestrictedworkloadisTLBsensitive,isinvestigated.Usingsuchcontrol,theperformanceincreaseforahighpriorityworkloaddomainachievedbyusinganuncontrolledprocess-specictaggedTLBcanbeselectivelyincreasedbyabout1.4.TheuseoftheCSharearchitectureinensuringTLBperformanceisolationamongstdomainswhichsharetheTLBisalsoexplored.WhiletheTagManagerTableismotivatedbytheneedtoimproveperformanceinvirtualizedscenario,process-specictaggingoftheTLBentriesiskeytoenablingmanyarchitecturalfeatureswhicharecommononRISCarchitectureswithsoftware-managedTLBsandwhichdependontheabilitytoassociateTLBentrieswiththeaddressspaceforwhichtheyarevalid.UsingtheTMT-generatedprocess-specictagscreatestheseassociationsinplatformswithhardware-managedTLBs,likex86,andenabletheadoptionofideassuchascoherentTLBsandvirtualcachesontheseplatforms.Theworkpresentedinthisdissertationformsthefoundationforsuchfutureexploratoryresearch. 138

PAGE 139

APPENDIXAFULLFACTORIALEXPERIMENTAFullFactorialExperimentisanexperimentaltechniquetounderstandtheeffectofvariousparametersontheoutputofasystem.Insuchexperiments,therearetwoormorefactors,eachofwhichcantakeoneofmanydiscretelevels.Thesefactorsactastheinputtothesystemundertest.Oneexperimentisperformedforeachcombinationofthefactors.Byexaminingtheoutputforthesedifferentcombinationofthefactors,theeffectofthefactorsandtheirinteractionsontheresponsevariablecanbestudied.Inafullfactorialexperiment,theresponsevariableyijkforthekthrepetitionoftheexperiment(outofatotalofrrepetitions)withfactorsAatthejthofapossiblelevelsandfactorBattheithofbpossiblelevels,isgivenby yijk=+j+i+ij+eijk (A) Hereisthemeanvalueoftheresponsevariable,jtheeffectoffactorAatlevelj,itheeffectoffactorBatleveliandijtheeffectofinteractionbetweenAatleveljandBatleveli.eijkistheerrorterm.Theobservationsfromthefullfactorialexperimentarearrangedinatwo-dimensionalmatrixofcellswithbrowsandacolumns.The(i,j)thcellcontainstheobservationsbelongingtotherrepetitionsfortheexperimentwithAandBatlevelsjandirespectively.Averagingthevaluesineachcell,acrosscolumns,acrossrowsandacrossalltheobservationsproduces yij.=+j+i+ijyi..=+iy.j.=+jy...= (A) 139

PAGE 140

Fromtheseequations,theeffectscanbecalculatedas =y...j=y.j.)]TJ /F5 11.955 Tf 12.24 0 Td[(y...i=yi..)]TJ /F5 11.955 Tf 12.25 0 Td[(y...ij=yij.)]TJ /F5 11.955 Tf 12.24 0 Td[(yi..)]TJ /F5 11.955 Tf 12.24 0 Td[(y.j.+y...eijk=yijk)]TJ /F5 11.955 Tf 12.25 0 Td[(yij. (A) ThevariationoftheoutputvariablecanbeallocatedamongthetwofactorsandtheirinteractionbysquaringbothsidesofEquation A ,andassigningthedifferenttermsthenotationsshowninEquation A Xijky2ijk=abr2+brj2j+ari2i+rij2ij+ijke2ijkSSY=SS0+SSA+SSB+SSAB+SSE (A) Fromthesevalues,thepercentagevariationduetofactorsAandB,theinteractionABaswellasanunexplainedpartduetoexperimentalerrorsarecalculatedasshowninEquation A SST=SSY)]TJ /F3 11.955 Tf 11.95 0 Td[(SS0=SSA+SSB+SSAB+SSE%VariationA=100SSA SST%VariationB=100SSB SST%VariationAB=100SSAB SST%VariationErr=100SSE SST (A) Whenthenumberoffactorsinvolvedbecomelarge,asinChapter 4 ,theestimationofthesignicanceofeachfactorcanbecomputedusingstatisticalsoftwaresuchasSAS[ 116 ]. 140

PAGE 141

APPENDIXBFULLFACTORIALEXPERIMENTSUSINGTHESIMULATIONFRAMEWORKAtypicalformofsimulation-basedstudiesisparametricsweeps.Suchstudies,similartotheexperimentsdetailedinSection 4.4 ,consistofrunningalargenumberoflongrunningsimulationswithvaryingkeyparametersforeachsimulationrun.Typicallysuchlargerunningsimulationjobsareperformedondedicatedclusterresourcesorondistributedgrids.Thisappendixprovidesthedetailsofsettingupthesimulationrunsonatypicalclusteraswellasonawideareagrid.ThededicatedclusteronwhichthesimulationsarerunistheUniversityofFloridaHighPerformanceComputingCluster[ 117 ].TheHPCconsistsofacentralizedLinuxcluster,twolarge-scalesharedlesystems,andadedicatedhighspeednetwork.Tosetupaparametricsweepinthisenvironment,checkpointsarecreatedusingthemethodsoutlinedinSection 3.3.4 andtransferredtothe$HOMEdirectoryoftheuserinHPC.Fromhere,asubmissionscriptiswrittenforeachsimulationwhichspeciestheparameterssuchastheestimatedtimeforthesimulation,usingtheresultsfromSection 3.4 .Thescriptalsocontainscommandswhichstartsthesimulationinbatchmode,congurestheappropriateparameterssuchasthepagewalklatency,proceedswiththesimulationandarchivestheresultsoncompletionofthejob.Toconductlargescalesimulationstudies,wide-areagridresourceinArcher[ 25 ]isalsoused.Archerisanopeninfrastructureforsimulation-basedcomputerarchitectureresearch.Archerconsistsofafewhundredcores,eachwithSimicsinstalledinit,connectedthroughawideareaP2Pnetwork.ItalsohasaclusterwideNFSwhichfacilitatesthesharingoflesonanodeseamlesslythroughoutthecluster.Usingthisinfrastructureoneormorenodesarepopulatedwiththecheckpointsoftheworkloads.Usingthisnodeasarepositoryforthecheckpoints,manysimulationsarestartedoffandconguredtoruninparallelwithdifferentparametervaluesforeachrun. 141

PAGE 142

APPENDIXCUSINGTHETAGMANAGERTABLEFORTAGGINGI/OTLBPowerandperformanceconsiderationsforhighthroughputcomputingplatformsareleadingtoasituationwhereinsimplerCPUcoresarebecomingtheprocessorofchoiceevenforhighthroughputplatforms.AcaseinpointisthetrendoftheIntelAtomfamilyofprocessorsbeingincreasinglypreferred,inspiteoftheirlowerprocessingcapability,inhighthroughputserversoverpowerhungrybutmorecapableprocessorvariantssuchasXeon[ 118 119 ].Tollthisgapinadvancedandspecializedfunctionalities,thehighthroughputplatformswithlow-powerprocessorsneedtoeitherexecutethesefunctionalitiesinsoftware,onthemainprocessorcores,orintegratespecializedhardwareunitsoracceleratorswhichofferthesefunctionalitiesforofoad.Variouspower/performancetradeoffsdictatethelaterastheapproachofchoice[ 118 ].Evenincaseswheremorecomplexprocessorarchitecturesareemployed,therearesignicantpowersavingstobeobtainedbyemployingspecializedacceleratorsdesignedforcommoncomputeintensivefunctionsandofoadingsuchfunctionsfromthecomplexprocessortotheseaccelerators.Traditionalapproachesforintegratingsuchspecializedacceleratorsandforofoadingjobstothemviewtheacceleratorasadeviceandrelyonasoftwaredevicedriverforinterfacing.Thisapproachworkswellwhentheexecutiontimeontheacceleratorisofamagnitudebiggerthantheoverheadsincurredinofoadingatask.However,forthecaseinpointi.e.Highperformancesystemswithveryne-grainfunctionalityofoad,agenericinterfacespecicationthatreducesperformanceoverheadsandallowsseamlessportabilityofprogramsacrossplatformswithvaryingdegreesofhardwaresupportisneeded[ 120 122 ].Severalapproachesincludingallowingtheacceleratortooperateintheapplicationdomain'svirtualmemoryspace,makingapplicationsofoadawareandachievingtightintegrationbetweenCPUSandacceleratorshavebeenproposed.However,inordertoallowtheacceleratorto 142

PAGE 143

operateinthesameaddressspaceastheprocess,theacceleratorhastobeawarethattheofoadeddataisbeingspeciedbyanaddressinthevirtualaddressdomain.Moreover,thevirtualaddressshouldbetranslatedtothephysicaladdressbeforethedatacanbeaccessedfrommemory.Thus,forperformanceconsiderations,anI/OTLBisneededtocachethevirtualtophysicaltranslationsusedbytheaccelerator.Sincemultipleprocessesmayofoadjobstotheacceleratorinaninterleavedfashion,thisTLBshouldbecapableofbeingsharedbymultipleprocess'saddressspaces[ 120 ].TheTagManagerTablemaybeusedinthisscenario.Inthisdissertation,onespecicacceleratorinterfacingscheme,VirtualMemoryAccelerator(VMA)[ 120 ],isconsideredandtheuseoftheTMTinthisVMAarchitectureisdemonstrated. C.1ArchitectureofVMAThetwomajorobjectivesofVMAare 1.establishingalow-latencyinterfacewithminimumsoftwareoverheadsforimprovedperformanceand 2.allowinguser-modedataofoadforprogrammabilityandseamlessportabilityoftheapplicationacrossplatformswithvaryingdegreesofhardwaresupport.VMAachievesthisbyallowingtheacceleratorstoworkinthesameaddressspacedomainastheprocesswhichofoadtoitandbyprovidinganextendedISAforofoadingthetasktotheaccelerator.ThearchitectureofVMA,asshowninFigure C-1 ,hasfourcomponents: ExtendedISAforofoading:Theofoadinginfrastructureconsistsofthemechanisminwhichtheuserapplicationofoadsatasktotheaccelerator.Theinformationwhichhastobepassedtotheaccelerator,typicallyincludesasourcebufferwiththedata,adestinationbuffertostoretheprocessedresultsandacommandwordwhichinformstheacceleratoronhowthedatashouldbeprocessed.ThisisimplementedbyextendingtheISAwithtwoinstructionsPUTTXNandGETTXN.ThePUTTXNinstructionprovidesaprocessanatomicmethodtosenddataandcommandwordtotheaccelerator.ThisinstructionreturnsauniquetransactionIDthattheprocesscanusetoquerythehardwareforcompletionstatus.TheGETTXNinstructionprovidesaprocesswithamethodforqueryingthehardwareforcompletionstatusforagiventransaction. Virtualmemoryawareaccelerators:Hardwareacceleratorscanbemadevirtualmemoryawarebyprovidingthemwithanapplicationcontextatthetimeofofoad,byincludingacontextIDasapartoftheofoadedfunctionality.Thiscontextid 143

PAGE 144

isthenprovidedbytheacceleratorasapartofeverymemorytransactionthatitissues,inordertoidentifytheprocessaddressspaceinwhichitoperatesandtofacilitatemappingfromthisaddressspacetothephysicalmemoryspace. IPMMUforI/Ovirtualtophysicaladdresstranslations:TheIP(IntellectualProperty)memorymanagementunit(IPMMU)isprovidedintheinterconnectionfabricandoffersaddresstranslationservicestotheacceleratorssothattheycanexecuteinvirtualmemorydomain.Thisalsoallowstheprogramstoaccesstheacceleratorfunctionsdirectlyfromtheuserspaceandcommunicateusingvirtualmemoryaddresses.Whentheacceleratortriestoaccessapplicationmemorywithavirtualaddress,theAcceleratorMemoryManagementUnit(IPMMU)willintercepttherequestandautomaticallytranslatethevirtualaddressintothecorrespondingphysicaladdress.Foraddresstranslationefciency,IPMMUcontainsaTLBtocachetherecentaddresstranslations.ThisI/OTLBissimilarinstructureandorganizationtothecoreTLBwiththeadditionofatagwhichidentiestheentryinaTLBwiththecontextoftheapplicationforwhoseaddressspacethetranslationisvalid. PageFaultHandling:Similartopagefaultscausedduringtheaddresstranslationonthecore,memoryaccessesinitiatedbytheacceleratorandinterceptedbytheIPMMUmayfailintheaddresstranslation.VMAimplementsafaultreportingmechanismwhichdeliversthisI/Opagefaulttothesoftwarestackrunningonthesystemandafaulthandlingmechanismconsistingofsoftwaremodulestohandlethesepagefaults. C.2PrototypingandSimulatingtheVMAArchitectureInordertomodelthehardwareandsoftwarecomponentsofVMA,VirtutechSimics,whichhasbeendiscussedindetailinSection 3.2.1 ,ischosenasthesimulationframeworkfordevelopingtheVMAprototype.UsingSimics,aplatformconsistingofanIntelXeonCPUwithanX58chipsetandICH10Southbridgeissimulatedand64-bitLinux2.6.28isbootedonthisplatform.Thisplatform,showninFigure C-1 ,isusedformodelingandsimulatingtheVMAprototype.ExtendingtheISAwithofoadinstructions InordertosimulatethePUTTXNandGETTXNinstructionsforenablingne-grainedinstructionbasedofoad,theMagicInstructioncapabilityofSimicsisused.Themagicinstruction,forx86models,isthexchgbx,bxinstruction.Whenthisisexecutedbythesoftwarestackrunninginthesimulatedplatform,Simicsstopsthesimulationand 144

PAGE 145

FigureC-1. Architectureandsimulation-basedprototypeofVMA.ThearchitectureofVMAconsistsofextendedISAforofoadingtotheaccelerator,acceleratorswhicharevirtualmemoryaware,anIPMMUtotranslatefromvirtualtophysicaladdresswithataggedTLBtocachethesetranslationsandsoftwarehandlerforIPMMU-generatedpagefaults.ThesecomponentsareprototypedusingSimicsfull-systemsimulationframework surrenderscontroltoauser-denedHAPscript.Thisscriptmaybeusedtoexaminethearchitecturalstateofthesuspendedsimulationandmodifyit,ifnecessary.Oncetheactionsspeciedinthisscriptarecompleted,Simicsresumesthesimulationfromthepointwhereitwasstopped.ForthePUTTXNinstruction,theappropriatearguments,suchasthesourceanddestinationbufferaddressareloadedintogeneralpurposex86registers.Aninstructionidentier,whichidentiesthatthemagicinstructionisusedtosimulatethePUTTXNinstruction,isalsoloadedintoaregister,followingwhichthemagicinstructioniscalled.TheHAPscriptwhichisinvokedonthismagicinstructionreadstheinstructionidentierandsimulatethePUTTXNinstructionbycopyingtheargumentsfromtheseregisterstotheappropriatelocationsintheregisterbankofthesimulatedaccelerator.TheTag 145

PAGE 146

ManagerTableisalsoqueriedandtheVASIfromtheCCRoftheCPUonwhichtheofoadingapplicationisexecutingisalsoprovidedtotheacceleratorasthecontextid.Thisscriptalsogeneratesatransactionidandupdatesboththeacceleratoraswellasthegeneralpurposeregisterinwhichthesourcebufferaddresswasspeciedwiththistransactionid.ThescriptalsoprovidestheSoftwareTriggertotheacceleratortoinitiatetheofoad.Onresumingthesimulation,theacceleratorbeginstoprocesstheofoadedtaskbyissuingPCItransactionsforaccessingthedatafromthesourcebuffer.TheofoadingapplicationreadsthetransactionidfromthegeneralpurposeregisterwhichwaspopulatedbytheHAPscript.TheGETTXNinstructionissimulatedinasimilarfashionusingthemagicinstructionbyloadingtheidentierfortheGETTXNinstructionaswellasthetransactionidintogeneralpurposeregistersandthenexecutingthemagicinstruction.ThescriptinvokedontheexecutionofthemagicinstructionchecksthecompletionbitinthehardwareacceleratorandcopythisvalueintotheEAXregister.Onresumingthesimulation,thevaluewhichhasbeenloadedintheEAXregisterisreadbytheuserapplicationtocheckthecompletionstatusoftheofoadedtask.Itshouldbenotedthattheuseofthegeneralpurposeregistersisanartefactofsimulation.Inreality,alocationinmemorycanbeusedtoofoadthetaskandtoreadthetransactionid.Theacceleratormaybemadeawareofthismemorylocationduringtheboot-upinitialization.Prototypingthevirtualmemoryawareaccelerator ThesampleacceleratorprototypedinthisresearchisaPCIbasedimageprocessingacceleratorwithnegrainfunctionalityofoads1.APCIbasedaccelerator 1Itshouldbenotedthatthenegranularityreferstothefunctionalitythatisofoadedandnottogranularityofthedatasize.Oneexampleofsuchne-grainedfunctionalityistheSIMDextensionssuchasSSEandAVXwhichoperateon128-bytewideand 146

PAGE 147

ischosenduetotheeaseofmodelingsuchdevicesandintegratingthemodelwiththesimulatedmachineinSimics.SimilartomostPCIType0device,thecongurationspaceoftheacceleratormodelisimplementedasabankofregisterswhichareprogrammedbytheO/SduringdevicediscoveryandenumerationandcanmapuptosixfunctionalregionsintotheaddressspaceoftheCPU.Theacceleratoralsoimplementsa4KBinternalbufferusedfortheinternalcomputationoftheacceleratorwhichisnotmappedintotheprocessoraddressspace.Theacceleratorutilizestwoofthesixfunctionalregions,FN0andFN1.Eachofthesefunctionalregionsconsistsofabankofregisters,whichcanbeaddressedasMemory-mappedI/O(MMIO)addressesafterdeviceenumeration.FN0implementsasimpleSum-of-Products(SOP)functionality.ASOPcomputationcanbeofoadedtotheacceleratorbywritingtheaddressofthesourcebufferwhichcontainstheelementsoftherowandcolumnalongwiththedimensionoftherow/columnaswellasthedestinationbuffertotheappropriateregistersinFN0.Oncethesebufferaddressesareprovided,thecomputationoftheSOPisinitiatedbywritingtotheSoftwareTriggerregisterintheFN0registerbank.Onreceivingthetrigger,theacceleratorreadsthecontentsfromthesourcebufferusingPCI-to-memorytransactions,computestheSOPandwritestheresulttothespecieddestinationbuffer.Then,itsetsacompletionbitinitsregisterbank.ThecompletionoftheofoadedtaskmaybenotiedtothesoftwarestackbyeitherconvertingthesettingofthecompletionbittoanI/Ointerruptorbypollingthecompletionbitinthisregisterbankatregularintervals.FN1implementsapixelmanipulationfunctionality.Givenanimageandthetransformationmatrix,FN1multiplieseachofthepixelbythetransformationmatrixandwritesthetransformedimageintothespecieddestinationbuffer.Auserapplicationcanofoad 256-bytewidedataandperformne-grainedoperationssuchasoatingpointarithmeticonthesedata 147

PAGE 148

animagemanipulationfunctionalitytotheacceleratorbywritingtoFN1'sregistersinamannersimilartotheFN0ofoad.Thesefunctionalitiesarechosenastheyarequiteimportantinimageprocessingandareidealcandidatesforacceleration[ 123 ].TheacceleratorismadeVirtualMemoryAwarebyprovidingthecontextinformation(VASItag)asapartoftheofoad.TheacceleratorthenincludesthiscontextidasapartofeveryPCItomemorytransaction.Inordertoachievethis,theformatofthePCIbusTLPheaderischangedandthecontextinformationeldisaddedtoit.Moreover,byincorporatingthecontextidasapartofthePCItransaction,theacceleratorisabletosupportofoadsfrommultipleuserprocesseswithdifferentcontextsandprocesstheseinaninterleavedandpipelinedfashion.HandlingIPMMU-generatedpagefaults WhentheIPMMUwalksthepagetablestotranslatethevirtualaddressbelongingtoaparticularprocesstoitsphysicalequivalent,thispagewalkmayresultinapageduetomismatchintheaccesspermissionsforthepage(Read/WritepermissionsorUser/Supervisorprivileges)andthedesiredtypeofaccess.However,amorecommonreasonforpagefaultsisthelackofaphysicalpagecorrespondingtothevirtualaddressbeingaccessed.Forinstance,inLinux,typicalallocationsofuserspacebuffersarelazyinnature(i.e.),thephysicalmemoryforthebuffersarenotassignedwhenthebuffersarecreated.Whentheprogramrunningonthecoreattemptstoaccessthebuffer,thisresultsinademandpagefault.TheO/Spagefaulthandlerallocatesthepageandupdatesthepagetableandthenrestartsthefaultinginstruction.IntheVMAarchitecture,sincetheacceleratoralsoworksinthesamevirtualaddressspaceastheuserapplication,thetransactionsitissuesmayalsocausesuchdemandpagingfaults.Inaddition,swappingoutofthephysicalpagespagecorrespondingtoauser-spacebuffer(duetomemorylimitations)beforethatbuffercanbeaccessedbytheacceleratormayalsogeneratepagefaults.WheneverapagefaultiscausedbytheIPMMUwalkingthepagetables,thecauseofthatpagefaultis 148

PAGE 149

determined.Ifitisduetoamismatchinthepermissionsorprivilegebits,thisistreatedasanunrecoverableerrorandthePCIread/writetransactionisterminatedwithanexpliciterrorindication,asmandatedbythePCIstandards[ 124 ].Theaccelerator,onsuchterminatedPCIrequests,waitsforacertainretryperiodandthenreissuesthetransaction.Thisretryperiodcanbeeffectivelyhiddenbytheacceleratorissuingmemoryrequestsofanotherofoadedtaskwhileitiswaiting.Afteracertainnumberofretries,ifthePCItransactioncannotbecompleted,theacceleratorterminatestheofoadedjobbysettingthecompletionbitandindicatestheunsuccessfulcompletionofthetaskbysettinganerrorbit.Forpagefaultscausedbythelackofanentryinthepagetables,theIPMMUraisesaninterruptusingtheIPMMUFaultReportingMechanism(FRM).TheFRMissimilartotheVT-dfaultreportingmechanism[ 125 ].ItconsistsofabankofFaultRecordingRegisters(FRR),asshowninFigure C-2 ,witheachregisterhavingeldsforstoringthefaultingaddressandtheprocesscontextinwhichthefaultoccurred.TheIPMMUpopulatesoneoftheseregisterswiththefaultinginformationandraisesaninterrupt.Then,itterminatesthePCItransactionwithanexpliciterrorindication.TheIPMMUsoftwarefaulthandlercatchestheinterruptandveriesthattheinterruptwasraisedduetoapagefault.ItthenreadsthefaultingaddressandcontextfromtheFaultRecordingRegister,allocatesphysicalmemoryandmapsthefaultingvirtualaddresstotheallocatedmemorybyupdatingthepagetables.TheIPMMUfaulthandlerthenclearstheFaultRecordingRegisterandterminates.Subsequently,whenthefaultingPCItransactionisreissuedbytheaccelerator,thepagewalkresultsinasuccessfultranslationofthevirtualtophysicaladdressandthetransactionsuccessfullycompletes.SimulatingtheIPMMUandtheI/OTLB TheIPMMUisimplementedontheSimicssimulatedplatformintheNorthbridge,asshowninFigure C-2 .Itisdesignedtointerceptalltrafcbetweentheaccelerators(I/Odevices)andmemory,inordertoprovidetranslationforrequestsfromVMaware 149

PAGE 150

accelerators.OninterceptingaPCI-to-memorytransaction,theIPMMUexaminesthecontextideldoftheTLPheader.Thepresenceofanon-zerocontextidindicatesthatthedevicewhichoriginatedthetransactionisaVMAdeviceandthetargetaddressspeciedinthetransactionisavirtualaddress. FigureC-2. IPMMUandI/OTLB Usingthesuppliedcontextid,theIPMMUrstcheckstoseeifthetranslationiscachedintheIPMMUTranslationLookasideBuffer(I/OTLB).ThisIPMMUTLBisataggedTLB,similartothearchitecturedescribedinSection 4.2 .Everyentryisannotatedwiththecontextidoftheofoadinguserapplication.SuchataggeddesignallowsthetranslationsofmultipleprocessestocoexistintheTLBandallowstheIPMMUtohandletranslationrequestsfrommultipleuserapplicationsinaninterleavedfashion.IftherequiredvirtualtophysicaltranslationforthecontextidofthePCItransactionbeingcurrentlyprocessedisnotfoundintheIPMMUTLB,theIPMMUinitiatesanaddresstranslationprocessbywalkingtheO/Spagetablesoftheuserapplication.Oncethepagewalkiscompleted,andthephysicaladdresscorrespondingtothevirtualaddress 150

PAGE 151

isobtained,theIPMMUreprogramsthePCItransactionwiththisphysicaladdressandallowsittoaccessthedatafromthatphysicaladdress.ThistranslationisalsoaddedtotheTLBandtaggedwiththecontextidoftheofoadinguserapplicationtowhichitbelongs. C.3UsingtheTagManagerTableinVMAArchitectureSincemultipleprocessesmayofoadtaskstothemultipleaccelerators,andsincethesemaybeexecutedinaninterleavedfashion,theIPMMUwillhavetoperformaddresstranslationsformultipleaddressspacesinaninterleavedfashion.Giventhis,itisimperativetotagtheI/OTLBentriesandtherebyensurethatmultipleprocessentriesmaybecachedconcurrently.TheTagManagerTablemaybeusedforgeneratingthisprocess-specictag.Inthiswork,theI/OTLBisdesignedtohaveaseparateTMT.TheCR3valueoftheofoadingprocessitselfisusedasthecontextid.TheTMTestablishesuniqueCR3-to-VASImappingsandusestheseVASIstotagtheTLBentries.TheIPMMU,oninterceptingamemoryaccessfromtheaccelerator,usestheTMTtogettheCR3-to-VASImappingandlooksupthetaggedTLBusingthisVASIfortherequiredtranslation.Ifthistranslationisnotpresent,thepagewalkisperformedandthecomputedtranslationisannotatedwiththeVASIandcachedintheTLB.AsimpleTLBsynchronizationscheme,whereineverycoreTLBushalsoushestheI/OTLB,isused.However,itisalsopossibletouseacoretaggedTLBandaglobalTagManagerTable,asinSection 4.8 ,andtohavethesameprocess-to-VASImappingusedinboththecoreandI/OTLB.Insuchadesign,theI/OTLB,similartothecoreTLB,willbeushedonlyduringcapacityushesandforcedushes.Thus,thenumberofI/OTLBushesmaybesignicantlyreduced.Inaddition,ifthecontextidbeingusedisnottheCR3valueoftheofoadingprocess,aCR3-to-contextidmappingshouldalsobemaintainedforeveryofoadingprocess. 151

PAGE 152

ALenabeforeconversionusingVMAaccel-erator BLenaafterconversionusingVMAacceler-atorFigureC-3. FunctionalvalidationoftheuseofTMTinVMA C.4FunctionalVericationoftheUseofTMTinVMAInordertoverifytheworkingoftheVMAarchitectureinconjunctionwiththeTagManagerTable,asimpleimage-manipulationtestapplicationiscreated.Thisapplicationreadsinanimagefromale,allocatessourceanddestinationbufferandpopulatesthesourcebufferwiththepixelsfromtheimage.Itshouldbenotedthatthesebuffersarecreatedusinglazymemoryallocation.Sincetheimageisreadintothesourcebuffer,demandpagingandtheconventionalO/Spagefaulthandlertakescareofallocatingphysicalmemoryforthesourcebuffer.Ontheotherhand,sincethedestinationbufferisnotaccessedbytheuserapplicationbeforeofoad,thereisnophysicalmemoryallocatedforthisbuffer.Thisapplicationofoadsthepixelsoftheimage,alongwithatransformationmatrixforconvertingtheimagetograyscale,totheacceleratorbywritingthesourceanddestinationbufferstotheregistersofFN1usingthePUTTXNinstruction.AfterthisitspinsinalooppollingforthecompletionoftheofoadusingtheGETTXNinstruction.Itshouldbenotedthatthedatagranularityoftheofoadisxedat4KB,resultingintheapplicationofoadingthepixelsonapage-by-pagebasistillallthepixelsareconvertedtograyscale. 152

PAGE 153

Theimagechosenforthissimulationwasa512512sizedversionofthestandardimageLena[ 126 ].Convertingthisimagetoa32bitsperpixelrepresentationresultedinasourcebuffersizeof1MB.Sincenocompressionwasusedtostorethegrayscaleoutput,thedestinationbufferwasalso1MB.Dictatedbythe4KBsizeoftheofoaddatagranularity,thesourcebufferwasofoadedin4KBchunksresultingin256ofoadstothehardwareaccelerator.Sincethedestinationbufferwascreatedusinglazymemoryallocation,theveryrstPCIwritetothedestinationbufferoneachofthese256ofoadscausedanIPMMUpagefault.EachofthesefaultsraisedinterruptswhichwerecaughtandhandledbytheIPMMUpagefaulthandler.Itwasalsoobservedthatamaximumofthreeretrieswith10sretryperiodwassufcienttoensurethattheIPMMUpagefaultwasservicedandthePCIwritetransactionsuccessfullycompleted.Moreover,forthissimulation,a99.90%hitrateintheIPMMUTLBwasobserved.TheoriginalandconvertedimagesareshowninFigure C-3 .ThisvalidatestheworkingoftheVMAarchitecturewiththeTMT. C.5SummaryWhilethemajorityofthisdissertationinvestigatestheuseoftheTagManagerTableforimprovingtheperformanceofvirtualizedworkloads,theTMTisagenerictaggingframeworkthatusesprocess-specictagsandcanbeusedfornon-virtualizedscenariosaswell.ThisappendixproposestheuseoftheTMTfortaggingI/OTLBsinnon-virtualizedplatforms.Specically,theincorporationoftheTMTasataggingframeworkinVirtualMemoryAccelerators,anarchitectureinvolvingI/OacceleratorsoperatinginvirtualaddressdomainwithanIPMMUandI/OTLBforprovidingthevirtualtophysicaltranslations,isexamined.Usingasimulation-basedprototypeofVMA,theproposeduseoftheTMTisfunctionallyvalidated. 153

PAGE 154

REFERENCES [1] R.Miller.(2010,April)FacebookNowHas30,000Servers.[Online].Available: http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/ [2] Avanade.(2010,April)GlobalSurveyofCloudComputing.[Online].Available: http://www.avanade.com/Documents/Research%20and%20Insights/fy10cloudcomputingexecutivesummarynal314006.pdf [3] K.Olukotunetal.,Thecaseforasingle-chipmultiprocessor,SIGPLANNotices,vol.31,no.9,pp.2,1996. [4] I.Corporation.(2010,April)FirsttheTick,NowtheTock:NextGenerationIntelMicroarchitecture(Nehalem).[Online].Available: http://www.intel.com/technology/architecture-silicon/next-gen/whitepaper.pdf [5] M.R.MartyandM.D.Hill,Virtualhierarchiestosupportserverconsolidation,SIGARCHComputerArchitectureNews,vol.35,no.2,pp.46,2007. [6] M.F.Mergenetal.,Virtualizationforhigh-performancecomputing,SIGOPSOperatingSystemsReview,vol.40,pp.8,2006. [7] L.Youseffetal.,Paravirtualizationeffectonsingle-andmulti-threadedmemory-intensivelinearalgebrasoftware,ClusterComputing,vol.12,pp.101,2009. [8] Gartner.(2010,April)GartnersaysWorldwideHostedVirtualDesktopMarkettoSurpass65Billionin2013.[Online].Available: http://www.gartner.com/it/page.jsp?id=920814 [9] .(2010,April)GartnerSays20PercentofCommercialE-MailMarketWillBeUsingaSaaSPlatformBytheEndof2012.[Online].Available: http://www.gartner.com/it/page.jsp?id=931215 [10] J.Langeetal.,PalaciosandKitten:NewHighPerformanceOperatingSystemsforScalableVirtualizedandNativeSupercomputing,inParallelDistributedProcessing(IPDPS),2010IEEEInternationalSymposiumon,2010,pp.1. [11] J.SmithandR.Nair,VirtualMachines:VersatilePlatformsforSystemsandProcesses.MorganKaufmannPublishersInc.,2005. [12] R.Goldberg,SurveyofVirtualMachineResearch,Computer,vol.7,no.6,pp.34,1974. [13] G.Amdahl,G.Blaauw,andF.Brooks,ArchitectureofIBMSystem/360,IBMJournalofResearchandDevelopment,vol.8,no.2,pp.87,1964. 154

PAGE 155

[14] U.Drepper,TheCostofVirtualization,ACMQueue,vol.6,no.1,pp.28,2008. [15] Gartner.(2010,April)MarketShare:x86VirtualizationMarket,Worldwide,2008.[Online].Available: http://www.gartner.com/it/page.jsp?id=1211813 [16] I.Kadayifetal.,OptimizinginstructionTLBenergyusingsoftwareandhardwaretechniques,ACMTransactionsonDesignAutomationofElectronicSystems,vol.10,no.2,pp.229,2005. [17] C.McCurdy,A.L.Cox,andJ.Vetter,InvestigatingtheTLBBehaviorofHigh-endScienticApplicationsonCommodityMicroprocessors,inProc.IEEEInternationalSymposiumonPerformanceAnalysisofSystemsandsoftware,2008,pp.95. [18] O.Tickooetal.,qTLB:LookinginsidetheLook-asidebuffer,inProc.The14thInternationalConferenceonHighPerformanceComputing,2007,pp.107. [19] A.Bhattacharjee,D.Lustig,andM.Martonosi,Sharedlast-levelTLBsforchipmultiprocessors,inProc.The17thInternationalSymposiumonHighPerformanceComputerArchitecture,2011,pp.359. [20] D.Chisnall,TheDenitiveGuidetotheXenHypervisor(PrenticeHallOpenSourceSoftwareDevelopmentSeries).PrenticeHallPTR,2007. [21] VMwareInc.(2010,April)VMwareVirtualDesktopInfrastructure(VDI)datasheet.[Online].Available: http://www.vmware.com/les/pdf/vdi datasheet.pdf [22] I.Krsuletal.,VMPlants:ProvidingandManagingVirtualMachineExecutionEnvironmentsforGridComputing,inProc.The2004ACM/IEEEconferenceonSupercomputing,2004,p.7. [23] A.Weiss,Computingintheclouds,netWorker,vol.11,no.4,pp.16,2007. [24] R.Figueiredo,P.Dinda,andJ.Fortes,GuestEditors'Introduction:ResourceVirtualizationRenaissance,Computer,vol.38,no.5,pp.28,2005. [25] R.J.O.Figueiredoetal.,Archer:ACommunityDistributedComputingInfrastructureforComputerArchitectureResearchandEducation,CollaborativeComputing:Networking,ApplicationsandWorksharing,vol.10,no.2,pp.181,2009. [26] SPARCInternational,Inc,TheSPARCArchitectureManualVersion9.PTRPrenticeHall,1993. [27] CompaqComputerCorporation,ALPHAArchitectureReferenceManual.CompaqComputerCorporation,2002. [28] IntelCorporation,Intel64andIA-32ArchitecturesSoftwareDeveloper'sManuals.IntelCorporation,2010. 155

PAGE 156

[29] B.JacobandT.Mudge,Virtualmemoryincontemporarymicroprocessors,IEEEMicro,vol.18,no.4,pp.60,1998. [30] ,Alookatseveralmemorymanagementunits,TLB-rellmechanisms,andpagetableorganizations,SIGOPSOperatingSystemsReview,vol.32,no.5,pp.295,1998. [31] B.Jacob.(2010,April)VirtualMemorySystemsandTLBStructures.[Online].Available: http://www.ece.umd.edu/blj/papers/CEH-chapter.pdf [32] C.A.Waldspurger,MemoryresourcemanagementinVMwareESXserver,SIGOPSOperatingSystemsReview,vol.36,no.SI,pp.181,2002. [33] R.A.MacKinnon,Thechangingvirtualmachineenvironment:Interfacestorealhardware,virtualhardware,andothervirtualmachines,IBMSystemsJournal,vol.18,no.1,pp.18,1979. [34] L.H.SeawrightandR.A.MacKinnon,VM/370:astudyofmultiplicityandusefulness,IBMSystemsJournal,vol.18,no.1,pp.4,1979. [35] P.Barhametal.,Xenandtheartofvirtualization,inProc.ThenineteenthACMsymposiumonOperatingsystemsprinciples,2003,pp.164. [36] AdvancedMicroDevices.(July2010,April)AMD-VNestedPaging.[Online].Available: http://developer.amd.com/assets/NPT-WP-1%201-nal-TM.pdf [37] G.Neigeretal.,IntelVirtualizationTechnology:HardwareSupportforEfcientProcessorVirtualization,IntelTechnologyJournal,vol.10,no.3,pp.167,2006. [38] N.Jerger,D.Vantrease,andM.Lipasti,AnEvaluationofServerConsolidationWorkloadsforMulti-CoreDesigns,inProc.10thInternationalSymposiumonWorkloadCharacterization,2007,pp.47. [39] L.Cherkasova,D.Gupta,andA.Vahdat,ComparisonofthethreeCPUschedulersinXen,SIGMETRICSPerformanceEvaluationReview,vol.35,no.2,pp.42,2007. [40] D.Guptaetal.,EnforcingperformanceisolationacrossvirtualmachinesinXen,inProc.TheACM/IFIP/USENIX2006InternationalConferenceonMiddleware,2006,pp.342. [41] J.R.Santosetal.,BridgingthegapbetweensoftwareandhardwaretechniquesforI/Ovirtualization,inProc.USENIX2008AnnualTechnicalConference,2008,pp.29. [42] W.Huangetal.,Acaseforhighperformancecomputingwithvirtualmachines,inProc.The20thannualinternationalconferenceonSupercomputing,2006,pp.125. 156

PAGE 157

[43] L.CherkasovaandR.Gardner,MeasuringCPUoverheadforI/OprocessingintheXenvirtualmachinemonitor,inProc.USENIXAnnualTechnicalConference,2005,pp.24. [44] A.Menonetal.,Diagnosingperformanceoverheadsinthexenvirtualmachineenvironment,inProc.The1stACM/USENIXinternationalconferenceonVirtualexecutionenvironments,2005,pp.13. [45] S.ThibaultandT.Deegan,Improvingperformancebyembeddinghpcapplicationsinlightweightxendomains,inProc.The2ndworkshoponSystem-levelvirtualizationforhighperformancecomputing,ser.HPCVirt'08,2008,pp.9. [46] R.Uhligetal.,IntelVirtualizationTechnology,Computer,vol.38,no.5,pp.48,2005. [47] D.Abramsonetal.,IntelVirtualizationTechnologyforDirectedI/O,IntelTechnologyJournal,vol.10,no.03,pp.179,2006. [48] AdvancedMicroDevices,AMDSecureVirtualMachineArchitectureReferenceManual.AdvancedMicroDevices,2010. [49] G.B.KandirajuandA.Sivasubramaniam,GoingthedistanceforTLBprefetching:anapplication-drivenstudy,inProc.The29thannualinternationalsymposiumonComputerarchitecture,2002,pp.195. [50] A.BhattacharjeeandM.Martonosi,Inter-CorecooperativeTLBprefetchersforchipmultiprocessors,inProc.The15thinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,2010,pp.359. [51] ,CharacterizingtheTLBBehaviorofEmergingParallelWorkloadsonChipMultiprocessors,inProc.InternationalConferenceonParallelArchitecturesandCompilationTechniques,2009,pp.29. [52] V.Chadhaetal.,I/Oprocessinginavirtualizedplatform:asimulation-drivenapproach,inProc.The3rdinternationalconferenceonVirtualexecutionenviron-ments,2007,pp.116. [53] V.Chadha,Provisioningwide-areavirtualenvironmentsthroughI/Ointerposition:Theredirect-on-writelesystemandcharacterizationofi/ooverheadsinavirtualizedplatform,Ph.D.dissertation,UniversityofFlorida,2008. [54] R.Uhligetal.,SoftSDV:APresiliconSoftwareDevelopmentEnvironmentfortheIA-64Architecture,IntelTechnologyJournal,vol.3,no.4,pp.1,1999. [55] M.Ekman,P.Stenstrom,andF.Dahlgren,TLBandsnoopenergy-reductionusingvirtualcachesinlow-powerchip-multiprocessors,inProc.The2002internationalsymposiumonLowpowerelectronicsanddesign,2002,pp.243. 157

PAGE 158

[56] S.Manneetal.,LowPowerTLBDesignforHighPerformanceMicroprocessors,UniversityofColoradoatBoulder,CO,Tech.Rep.CU-CS-834-97,1997. [57] J.-H.Leeetal.,Abanked-promotiontranslationlookasidebuffersystem,JournalofSystemsArchitecture,vol.47,no.14-15,pp.1065,2002. [58] A.Ballesil,L.Alarilla,andL.Alarcon,AStudyofPowerTrade-offsinTranslationLookasideBufferStructures,inProc.2006IEEERegion10Conference,2006,pp.1. [59] L.T.Clark,B.Choi,andM.Wilkerson,Reducingtranslationlookasidebufferactivepower,inProc.The2003internationalsymposiumonLowpowerelectronicsanddesign,2003,pp.10. [60] R.Jeyapaul,S.Marathe,andA.Shrivastava,CodeTransformationsforTLBPowerReduction,inProc.The22ndInternationalConferenceonVLSIDesign,2009,pp.413. [61] T.Austin,E.Larson,andD.Ernst,SimpleScalar:aninfrastructureforcomputersystemmodeling,Computer,vol.35,no.2,pp.59,2002. [62] R.Bhargavaetal.,Acceleratingtwo-dimensionalpagewalksforvirtualizedsystems,inProc.The13thinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,2008,pp.26. [63] G.Loh,S.Subramaniam,andY.Xie,Zesto:Acycle-levelsimulatorforhighlydetailedmicroarchitectureexploration,inProc.IEEEInternationalSymposiumonPerformanceAnalysisofSystemsandSoftware,2009,pp.53. [64] M.Yourst,PTLsim:ACycleAccurateFullSystemx86-64MicroarchitecturalSimulator,inProc.IEEEInternationalSymmposiumonPerformanceAnalysisofSystemsandSoftware,2007,pp.23. [65] M.Rosenblumetal.,UsingtheSimOSmachinesimulatortostudycomplexcomputersystems,ACMTransactionsonModelingandComputerSimulation,vol.7,no.1,pp.78,1997. [66] N.L.Binkertetal.,TheM5Simulator:ModelingNetworkedSystems,IEEEMicro,vol.26,no.4,pp.52,2006. [67] P.S.Magnussonetal.,Simics:Afullsystemsimulationplatform,Computer,vol.35,no.2,pp.50,2002. [68] M.M.K.Martinetal.,Multifacet'sgeneralexecution-drivenmultiprocessorsimulator(GEMS)toolset,SIGARCHComputerArchitectureNews,vol.33,no.4,pp.92,2005. [69] NaveenNeelakantam.(2010,April)FeS2:AFull-systemExecution-drivenSimulatorforx86.[Online].Available: http://fes2.cs.uiuc.edu/ 158

PAGE 159

[70] E.Argolloetal.,COTSon:infrastructureforfullsystemsimulation,SIGOPSOperatingSystemsReview,vol.43,no.1,pp.52,2009. [71] AdvancedMicroDevicesInc,SimNowSimulatorUsersManual.AdvancedMicroDevicesInc,2009. [72] L.Baugh,N.Neelakantam,andC.Zilles,UsingHardwareMemoryProtectiontoBuildaHigh-Performance,Strongly-AtomicHybridTransactionalMemory,SIGARCHComputerArchitectureNews,vol.36,no.3,pp.115,2008. [73] VirtutechInc,SimicsReferenceManual.VirtutechInc,2007. [74] S.T.Jones,A.C.Arpaci-Dusseau,andR.H.Arpaci-Dusseau,Antfarm:trackingprocessesinavirtualmachineenvironment,inProc.USENIX'06AnnualTechnicalConference,2006,pp.1. [75] CPURightMark.(2010,April)RightMarkMemoryAnalyzer.[Online].Available: http://cpu.rightmark.org/products/rmma.shtml [76] V.Makhijaetal.,VMmark:AScalableBenchmarkforVirtualizedSystems,VMwareInc,CA,Tech.Rep.VMware-TR-2006-002,September2006. [77] D.R.Llanos,TPCC-UVa:anopen-sourceTPC-Cimplementationforglobalperformancemeasurementofcomputersystems,SIGMODRecord,vol.35,no.4,pp.6,2006. [78] A.Tridge.(2010,April)dbenchbenchmark.[Online].Available: http://samba.org/ftp/tridge/dbench/ [79] M.Karlssonetal.,MemorySystemBehaviorofJava-BasedMiddleware,inProceedingsofthe9thInternationalSymposiumonHigh-PerformanceComputerArchitecture,2003,pp.217. [80] Y.Shufetal.,CharacterizingthememorybehaviorofJavaworkloads:astructuredviewandopportunitiesforoptimizations,inProc.The2001ACMSIGMETRICSinternationalconferenceonMeasurementandmodelingofcomputersystems,2001,pp.194. [81] A.Adamson,D.Dagastine,andS.Sarne,SPECjbb2005-AYearintheLifeofaBenchmark,inProc.The2007SPECBenchmarkWorkshop,2007. [82] StandardPerformanceEvaluationCorporation.(2010,April)255.vortexSPECCPU2000BenchmarkDescriptionFile.[Online].Available: http://www.spec.org/cpu2000/CINT2000/255.vortex/docs/255.vortex.html [83] A.Georges,L.Eeckhout,andK.D.Bosschere,ComparingLow-LevelBehaviorofSPECCPUandJavaWorkloads,AdvancesinComputerSystemsArchitecture,vol.3740,pp.669,2005. 159

PAGE 160

[84] S.Dague,D.Stekloff,andR.Sailer.(2010,April)xm(1)-Linuxmanpage.[Online].Available: http://linux.die.net/man/1/xm [85] N.Andersson.(2010,April)TheMauiScheduler.[Online].Available: http://www.nsc.liu.se/systems/retiredsystems/grendel/maui.html [86] G.Staples,Torqueresourcemanager,inProc.The2006ACM/IEEEconferenceonSupercomputing,2006. [87] J.Warner.(2010,April)top(1)-Linuxmanpage.[Online].Available: http://linux.die.net/man/1/top [88] A.Cahalan.(2010,April)pmap(1)-Linuxmanpage.[Online].Available: http://linux.die.net/man/1/pmap [89] SiliconGraphics,Inc,MIPSR4000MicroprocessorUser'sManual.PTRPrenticeHall,1993. [90] X.Zhangetal.,Ahash-TLBapproachforMMUvirtualizationinxen/IA64,inProc.IEEEInternationalSymposiumonParallelandDistributedProcessing,2008,pp.1. [91] MotorolaInc,PowerPC601RISCMicroprocessorUser'sManual.MotorolaInc,2002. [92] J.Liedtke,ImprovedAddress-SpaceSwitchingonPentiumProcessorsbyTransparentlyMultiplexingUserAddressSpaces,GermanNationalResearchCenterforInformationTechnology,Tech.Rep.993,1995. [93] V.Uhligetal.,Performanceofaddress-spacemultiplexingonthePentium,UniversityofKarlsruhe,Tech.Rep.2002-1,2002. [94] S.Biemeuller.(2010,April)ASIDManagementinXenAMD-V.[Online].Available: xen.xensource.com/xensummit/xensummit spring 2007.html [95] J.L.HennessyandD.A.Patterson,Computerarchitecture:aquantitativeapproach.MorganKaufmannPublishersInc.,2002. [96] R.Jain,TheArtofComputerSystemsPerformanceAnalysis:TechniquesforExperimentalDesign,Measurement,Simulation,andModeling.Wiley,1991. [97] R.Minetal.,Partialtagcomparison:anewtechnologyforpower-efcientset-associativecachedesigns,inProc.17thInternationalConferenceonVLSIDesign,2004,pp.183188. [98] A.Jaleel,M.Mattina,andB.Jacob,Lastlevelcache(LLC)performanceofdataminingworkloadsonaCMP-acasestudyofparallelbioinformaticsworkloads,inProc.TheTwelfthInternationalSymposiumonHigh-PerformanceComputerArchitecture,2006,pp.8898. 160

PAGE 161

[99] L.Zhaoetal.,Towardshybridlastlevelcachesforchip-multiprocessors,SIGARCHComputerArchitectureNews,vol.36,pp.56,2008. [100] K.B.Ferreira,P.Bridges,andR.Brightwell,CharacterizingapplicationsensitivitytoOSinterferenceusingkernel-levelnoiseinjection,inProc.The2008ACM/IEEEconferenceonSupercomputing,2008,pp.19:1:12. [101] R.Gioiosa,S.A.McKee,andM.Valero,DesigningOSforHPCApplications:Scheduling,inProc.IEEEInternationalConferenceonClusterComputing,2010,pp.78. [102] R.Iyeretal.,Datacenter-on-chiparchitectures:Tera-scaleopportunitiesandchallengesinintel'smanufacturingenvironment,IntelTechnologyJournal,vol.11,no.3,pp.227,2007. [103] S.Kim,D.Chandra,andY.Solihin,FairCacheSharingandPartitioninginaChipMultiprocessorArchitecture,inProc.The13thInternationalConferenceonParallelArchitecturesandCompilationTechniques,2004,pp.111. [104] R.Iyeretal.,QoSpoliciesandarchitectureforcache/memoryinCMPplatforms,SIGMETRICSPerformanceEvaluationReview,vol.35,no.1,pp.25,2007. [105] L.R.Hsuetal.,Communist,utilitarian,andcapitalistcachepoliciesonCMPs:cachesasasharedresource,inProc.The15thinternationalconferenceonParallelarchitecturesandcompilationtechniques,2006,pp.13. [106] J.ChangandG.S.Sohi,Cooperativecachepartitioningforchipmultiprocessors,inProc.The21stannualinternationalconferenceonSupercomputing,2007,pp.242. [107] M.K.QureshiandY.N.Patt,Utility-BasedCachePartitioning:ALow-Overhead,High-Performance,RuntimeMechanismtoPartitionSharedCaches,inProc.The39thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2006,pp.423. [108] S.Srikantaiah,M.Kandemir,andM.J.Irwin,Adaptivesetpinning:managingsharedcachesinchipmultiprocessors,inProc.The13thinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,2008,pp.135. [109] N.Raque,W.-T.Lim,andM.Thottethodi,Architecturalsupportforoperatingsystem-drivenCMPcachemanagement,inProc.The15thinternationalconferenceonParallelarchitecturesandcompilationtechniques,2006,pp.2. [110] B.M.Beckmann,M.R.Marty,andD.A.Wood,ASR:AdaptiveSelectiveReplicationforCMPCaches,inProc.The39thAnnualIEEE/ACMInternationalSymposiumonMicroarchitecture,2006,pp.443. 161

PAGE 162

[111] J.Lee,C.Park,andS.Ha,Memoryaccesspatternanalysisandstreamcachedesignformultimediaapplications,inProceedingsofthe2003AsiaandSouthPacicDesignAutomationConference,ser.ASP-DAC'03,2003,pp.22. [112] StandardPerformanceEvaluationCorporation.(2010,April)301.apsiSPECCPU2000BenchmarkDescriptionFile.[Online].Available: http://www.spec.org/cpu2000/CFP2000/301.apsi/docs/301.apsi.html [113] .(2010,April)179.artSPECCPU2000BenchmarkDescriptionFile.[Online].Available: http://www.spec.org/cpu2000/CFP2000/179.art/docs/179.art.html [114] .(2010,April)255.vortexSPECCPU2000BenchmarkDescriptionFile.[Online].Available: http://www.spec.org/cpu2000/CFP2000/189.lucas/docs/189.lucas.html [115] .(2010,April)171.swimSPECCPU2000BenchmarkDescriptionFile.[Online].Available: http://www.spec.org/cpu2000/CFP2000/171.swim/docs/171.swim.html [116] SAS.(2010,April)SAS:StatisticalAnalysisSoftware.[Online].Available: http://www.sas.com/ [117] (2010,April)TheUniversityofFloridaHigh-PerformanceComputingCenter.[Online].Available: http://www.hpc.u.edu/index.php?body=about [118] D.Eadline,Lowcost/powerhpc,LinuxMagazine,2010. [119] SeaMicro.(2011,January)SeaMicrotoDemonstrateSM10000.[Online].Available: http://www.seamicro.com/ [120] P.Stillwelletal.,HiPPAI:HighPerformancePortableAcceleratorInterfaceforSoCs,inProc.InternationalConferenceonHighPerformanceComputing2009,2009,pp.109. [121] F.E.Powers,Jr.andG.Alaghband,Introducingthehydraparallelprogrammingsystem,inProceedingsoftheeighteenthannualACMsymposiumonParallelisminalgorithmsandarchitectures,ser.SPAA'06,2006,pp.116. [122] H.Wongetal.,Pangaea:atightly-coupledIA32heterogeneouschipmultiprocessor,inProceedingsofthe17thinternationalconferenceonParal-lelarchitecturesandcompilationtechniques,ser.PACT'08,2008,pp.52. [123] H.Bay,T.Tuytelaars,andL.VanGool,SURF:SpeededUpRobustFeatures,LectureNotesinComputerScience,vol.3951,pp.404,2006. [124] R.Budruk,D.Anderson,andT.Shanley,PCIExpressSystemArchitecture.Addison-WesleyProfessional,2003. 162

PAGE 163

[125] I.Corporation.(2011,January)IntelVirtualizationTechnologyforDirectedI/O.[Online].Available: ftp://download.intel.com/technology/computing/vptech/Intel%28r%29 VT for Direct IO.pdf [126] M.Wakin.(2011,January)StandardTestImages.[Online].Available: http://www.ece.rice.edu/wakin/images/ 163

PAGE 164

BIOGRAPHICALSKETCH GirishVenkatasubramanianwasborninCoimbatore,Indiain1981.HeattendedGRGMatriculationandHigherSecondarySchool,IndiaandgraduatedwiththeBestOutgoingStudentawardin1999.HeobtainedhisBachelorofEngineeringdegree(FirstClasswithDistinction)inElectricalandElectronicsEngineeringfromPSGCollegeofTechnology,India.DuringthistimehereceivedtheDeansLetterofCommendationforAcademicPerformancetwice.GirishwasacceptedtotheDepartmentofElectricalandComputerEngineeringatUniversityofFloridain2003,fromwherehegraduatedwithaMasterofSciencedegreein2005(4.0GPA)andaDoctorofPhilosophydegreein2011(4.0GPA).DuringhisPhD,hereceivedtheUniversityofFloridaInternationalCenter'sCerticateofAchievementforOutstandingAcademicPerformanceandwasselectedasanOutstandingInternationalStudent.AttheUniversityofFlorida,GirishjoinedtheAdvancedComputingandInformationSystems(ACIS)Labandconductedresearchinareasincludingcomputerarchitecture,operatingsystems,virtualizationandfull-systemmodelingandsimulation.Tocomplementhisacademicskills,healsocompletedinternshipswithIntelCorporationandVMware.Aftergraduation,Girishplanstotakeupafull-timepositionatIntelandworkinareasrelatedtovirtualization. 164