Citation
Provisioning Wide-Area Virtual Environments through I/O Interposition

Material Information

Title:
Provisioning Wide-Area Virtual Environments through I/O Interposition The Redirect-on-Write File System and Characterization of I/O Overheads in a Virtualized Platform
Creator:
Chadha, Vineet
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (145 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Figueiredo, Renato J.
Committee Members:
Fortes, Jose A.
George, Alan D.
Wilson, Joseph N.
Boykin, P. Oscar
Graduation Date:
12/19/2008

Subjects

Subjects / Keywords:
Household appliances ( jstor )
Image servers ( jstor )
Income protection insurance ( jstor )
Input output ( jstor )
Munchausen syndrome by proxy ( jstor )
Preliminary proxy material ( jstor )
Proxy reporting ( jstor )
Proxy statements ( jstor )
Simulations ( jstor )
Workloads ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
filesystem, simulation, virtualization, xen
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Computer Engineering thesis, Ph.D.

Notes

Abstract:
This dissertation presents the mechanisms to provision and characterize I/O workloads for applications found in virtual data-centers. This dissertation addresses two specific modes of workload execution in a virtual data center: (1) workload execution in heterogeneous compute resources across wide-area environment, and (2) workload execution and characterization within a virtualized platform. A key challenge arising in wide-area, grid computing infrastructures is that of data management: how to provide data to applications, seamlessly, in environments spanning multiple domains. In these environments, it is often the case that data movement and sharing is mediated by middleware that schedules applications. This thesis presents a novel approach that enables wide-area applications to leverage on-demand block-based data transfers and a de-facto distributed file system (NFS) to access data stored remotely and modify it in the local area - Redirect-on-Write file system (ROW-FS). The ROW-FS approach enables multiple clients to operate on private, virtual versions of data mounted from a single shared data served as a network file system (NFS). ROW-FS approach enables multiple VM instances to efficiently share a common set of virtual machine image files. The proposed approach offers savings in storage and bandwidth requirements compared to the conventional approaches of provisioning VMs by copying the entire VM image to the client and by cloning the image on the server side. The Thin client approach described in this dissertation uses ROW-FS to enable the use of unmodified NFS clients/servers and local buffering of file system modifications during an application's lifetime. An important application of ROW-FS is in enabling the instantiation of multiple non-persistent virtual machines across wide-area resources from read-only images stored in an image servers (or distributed along multiple replicas). A common deployment scenario of ROW-FS is when the virtual machine hosting its private, redirected ?shadow? file system server and the client virtual machine are consolidated into a single physical machine. While a virtual machine provides levels of execution isolation and service partition that are desirable in environments such as data centers, its associated overheads can be a major impediment for wide deployment of virtualized environments. While the virtualization cost depends heavily on workloads, the overhead is much higher with I/O intensive workloads compared to those which are compute-intensive. Unfortunately, the architectural reasons behind the I/O performance overheads are not well understood. Early research in characterizing these penalties has shown that cache misses and TLB related overheads contribute to most of I/O virtualization cost. While most of these evaluations are done using measurements, this thesis presents an execution-driven simulation based analysis methodology with symbol annotation as a means of evaluating the performance of virtualized workloads, and presents simulation-based characterization of the performance of a representative network-intensive benchmark (iperf) in the Xen virtual machine environment. The main contributions of this dissertation work are 1)the novel design and implementation of the ROW-FS file system, 2) experimental evaluation of ROW-FS for the O/S image framework that enables virtual machine images to be published, discovered and transferred on-demand through a combination of ROW-FS and peer-to-peer techniques 3) a novel implementation of an execution-driven simulation framework to evaluate network I/O performance using symbol annotation for environments that encompass both a virtual machine hypervisor and guest operating system domains and 4)evaluation, through simulation, of the potential benefits of different micro-architectural TLB improvements on performance. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2008.
Local:
Adviser: Figueiredo, Renato J.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31
Statement of Responsibility:
by Vineet Chadha.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Chadha, Vineet. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
12/31/2010
Resource Identifier:
664684947 ( OCLC )
Classification:
LD1780 2008 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketothankmyadvisorDr.RenatoFigueiredoforallthesupporthehasprovidedmeduringlastsixyears.HehasbeentakingaroundthemazeofsystemsresearchandshownmetherightwaywheneverIfeltlost.IthasbeenaprivilegetoworkwithDr.Figueiredowhosecalmness,humbleandpolitedemeanoristheoneiwouldliketocarryandapplyfurtherinmycareer.ThankstoDr.JoseForteswhoprovidedmeopportunitytoworkatAdvancedComputingandInformationSystem(ACIS)laboratory.Hegavemeencouragementandsupportwheneverthingsweredown.IwouldliketothankDr.OscarBoykinforservinginmycommitteeandforallthosefruitfuldiscussionsonResearch,Linux,healthyfoodandRunning.Hispassiontoachieveperfectionineveryendeavorsoflifeofteneggsmetodobetter.ThankstoDr.AlanGeorgeandDr.JosephWilsonforservinginmyPhDprogramcommitteeandmotivatingmethroughtheircoursesandresearchwork.Iwouldliketothankmymentor,RameshIllikkalandmanager,DonaldNewellatIntelCorporationforthefaiththeyhaveshowninmeandeggedmetoworkhard.IthasbeenaprivilegetoworkwithRameshwhotaughtmeimportanceofteamwork,failureandsuccess.ThankstoDr.PadmashreeApparaoandDr.RavishankarIyerforguidanceandencouragementtoachievemygoals.ThanksisalsoduetoDr.IvankrsulandDr.SumaAdabalaforguidingmenotonlyduringthedevelopmentandresearchofIn-VIGOprojectbutalsooftensharingthoughtsonaPhDprogramandexpectations.ThanksisduetoallthecolleagueshereatACISwhichmadeworkenvironmentfuntoworkin.IwouldliketothankAndreaandMauricioforprovidingexcellentresearchfacilitiesandresources.ThankstomyocematesArijitandGirishforallfruitfuldiscussions.ThanksisduetoCathyformaintainingcordialenvironmentinACISlabandextendingsupporttomeasagoodfriendwheneverneedarises. 4

PAGE 5

5

PAGE 6

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 9 LISTOFFIGURES .................................... 10 ABSTRACT ........................................ 13 CHAPTER 1INTRODUCTION .................................. 16 1.1Introduction ................................... 16 1.1.1VirtualNetworkFileSystemI/ORedirection ............. 16 1.1.2CharacterizationofI/OOverheadsinaVirtualizedEnvironments 18 1.1.2.1Simulation ........................... 19 1.1.2.2I/OMechanisms ....................... 19 1.2DissertationContributions ........................... 20 1.3DissertationRelevance ............................. 22 1.4DissertationOverview ............................. 23 1.5DissertationOrganization ........................... 26 2I/OVIRTUALIZATION:RELATEDTERMSANDTECHNOLOGIES ..... 28 2.1Introduction ................................... 28 2.2VirtualizationTechnologies ........................... 29 2.3VirtualMachineArchitectures ......................... 32 2.3.1I/OmechanismsinVirtualMachines ................. 32 2.3.2VirtualMachinesandCMParchitectures ............... 34 2.4GridComputing ................................ 35 2.5FileSystemVirtualization ........................... 35 2.5.1NetworkFileSystem .......................... 36 2.5.2GridVirtualFileSystem ........................ 38 3REDIRECT-ON-WRITEDISTRIBUTEDFILESYSTEM ............ 41 3.1Introduction ................................... 41 3.1.1FileSystemAbstraction ........................ 41 3.1.2Redirect-on-WriteFileSystem ..................... 42 3.2MotivationandBackground .......................... 42 3.2.1Use-CaseScenario:FileSystemSessionsforGridComputing .... 44 3.2.2Use-CaseScenario:NFS-MountedVirtualMachineImagesandO/SFileSystems ............................... 44 6

PAGE 7

................................. 45 3.3ROW-FSArchitecture ............................. 47 3.3.1HashTable ................................ 48 3.3.2Bitmap .................................. 49 3.4ROW-FSImplementation ........................... 51 3.4.1MOUNT ................................. 51 3.4.2LOOKUP ................................ 53 3.4.3GETATTR/SETATTR ......................... 54 3.4.4READ .................................. 54 3.4.5WRITE ................................. 55 3.4.6READDIR ................................ 55 3.4.7REMOVE/RMDIR/RENAME ..................... 58 3.4.8LINK/READLINK ........................... 58 3.4.9SYMLINK ................................ 60 3.4.10CREATE/MKDIR ........................... 60 3.4.11STATFS ................................. 60 3.5ExperimentalResults .............................. 60 3.5.1Microbenchmark ............................. 61 3.5.2ApplicationBenchmark ......................... 63 3.5.3VirtualMachineInstantiation ..................... 66 3.5.4FileSystemComparison ........................ 67 3.6RelatedWork .................................. 68 3.7Conclusion .................................... 68 4PROVISIONINGOFVIRTUALENVIRONMENTSFORWIDEAREADESKTOPGRIDS ......................................... 70 4.1Introduction ................................... 70 4.2DataProvisioningArchitecture ........................ 71 4.3ROW-FSConsistencyandReplicationApproach ............... 78 4.3.1ROW-FSConsistencyinImageProvisioning ............. 79 4.3.2ROW-FSReplicationinImageProvisioning .............. 80 4.4SecurityImplications .............................. 81 4.5ExperimentsandResults ............................ 84 4.5.1ProxyVMResourceConsumption ................... 84 4.5.2RPCCallProle ............................ 86 4.5.3DataTransferSize ............................ 87 4.5.4Wide-areaExperiment ......................... 87 4.5.5DistributedHashTableStateEvaluationandAnalysis ........ 88 4.6RelatedWork .................................. 88 4.7Conclusion .................................... 90 7

PAGE 8

.......... 91 5.1Introduction ................................... 92 5.2MotivationandBackground .......................... 93 5.2.1FullSystemSimulator .......................... 94 5.2.2I/OVirtualizationinXen ........................ 94 5.3AnalysisMethodology ............................. 96 5.3.1FullSystemSimulation:XenVMMasWorkload ........... 96 5.3.2InstructionTrace ............................ 97 5.3.3SymbolAnnotation ........................... 98 5.3.4PerformanceStatistics ......................... 98 5.3.5EnvironmentalSetupforVirtualizedWorkload ............ 100 5.4ExperimentsandSimulationResults ..................... 102 5.4.1LifeCycleofanI/Opacket ....................... 102 5.4.1.1UnprivilegedDomain ..................... 103 5.4.1.2GrantTableMechanism ................... 105 5.4.1.3TimerInterrupts ....................... 105 5.4.1.4PrivilegedDomain ...................... 106 5.4.2CacheandTLBCharacteristics .................... 108 5.5CacheandTLBScaling ............................ 110 5.6RelatedWork .................................. 114 5.7Conclusion .................................... 115 6HARDWARESUPPORTFORI/OWORKLOADS:ANANALYSIS ...... 116 6.1Introduction ................................... 116 6.2TranslationLookasideBuer .......................... 117 6.2.1Introduction ............................... 117 6.2.2TLBInvalidationinMultiprocessors .................. 118 6.3InterprocessorInterrupts ............................ 120 6.4GrantTableMechanism:I/OAnalysis .................... 121 6.5ExperimentsandResults ............................ 123 6.5.1GrantTablePerformance ........................ 123 6.5.2HypervisorGlobalBit .......................... 124 6.5.3TLBCoherenceEvaluation ....................... 125 6.6RelatedWork .................................. 129 6.7Conclusion .................................... 131 7CONCLUSIONANDFUTUREWORK ...................... 132 7.1Conclusion .................................... 132 7.2FutureWork ................................... 133 REFERENCES ....................................... 136 BIOGRAPHICALSKETCH ................................ 145 8

PAGE 9

Table page 3-1SummaryoftheNFSv2protocolremoteprocedurecalls ............. 53 3-2LANandWANexperimentsformicro-benchmarks ................ 62 3-3AndrewbenchmarkandAM-Utilsexecutiontimes ................. 63 3-4LinuxkernelcompilationexecutiontimesonaLANandWAN. ......... 65 3-5WideareaexperimentalresultsfordisklessLinuxbootandsecondboot ..... 66 3-6RemoteXenboot/rebootexperiment ........................ 67 4-1Gridappliancebootandreboottimesoverwideareanetwork .......... 88 4-2MeanandvarianceofDHTaccesstimeforveclients ............... 88 6-1Granttableoverheadsummary ........................... 124 6-2TLBushstatisticswithandwith-outIPIushoptimization .......... 128 6-3InstructionTLBmissstatisticswithandwith-outIPIushoptimization .... 129 6-4DatamissTLBstatisticswithandwith-outIPIushoptimization ........ 129 9

PAGE 10

Figure page 1-1Protocolredirectionthroughuser-levelproxies ................... 17 1-2Illustrationofserverconsolidation .......................... 18 2-1Landscapeofvirtualizedcomputersystems ..................... 30 2-2Systemspartitioningcharacteristics ......................... 31 2-3I/OvirtualizationpathforsingleO/Sandvirtualmachines ........... 33 2-4I/OpartitioninginvirtualmachineandCMParchitectures ............ 34 2-5Gridvirtuallesystem ................................ 39 3-1IndirectionmechanismintheLinuxvirtuallesystem .............. 41 3-2MiddlewaredatamanagementforsharedVMimages ............... 43 3-3Check-pointingaVMcontainerrunninganapplicationwithNFS-mountedlesystem ......................................... 46 3-4Redirect-on-writelesystemarchitecture ...................... 48 3-5ROW-FSproxydeploymentoptions ......................... 49 3-6Hashtableandagdescriptions ........................... 50 3-7RemoteprocedurecallprocessinginROW-FS ................... 50 3-8AsnapshotviewoflesystemsessionthroughRedirect-on-Writeproxy ..... 52 3-9Sequenceofredirect-on-writelesystemcalls ................... 56 3-10NumberofRPCcallsreceivedbyNFSserverinnon-virtualizedenvironment,andbyROW-FSshadowandmainserversduringAndrewbenchmarkexecution ............................................. 64 4-1O/Simagemanagementoverwideareadesktops .................. 72 4-2ThedeploymentoftheROWproxytosupportPXE-basedbootofa(diskless)non-persistentVMoverawideareanetwork. .................... 73 4-3AlgorithmtobootstrapaVMsession ........................ 77 4-4AlgorithmtopublishavirtualmachineImage ................... 77 4-5ReplicationapproachforROW-FS ......................... 81 4-6Disklessclientandpublisherclientsecurity ..................... 83 10

PAGE 11

.............. 85 4-8RPCstatisticsfordisklessboot ........................... 87 4-9CumulativedistributionofDHTquerythrough10IPOPclients(inseconds) .. 89 5-1FullsystemsimulationenvironmentwithXenexecution .............. 95 5-2Executiondrivensimulationandsymbolannotatedprolingmethodology .... 97 5-3Symbolannotation .................................. 99 5-4Function-levelperformancestatistics ........................ 99 5-5SoftSDVCPUcontrollerexecutionmode:performanceorfunctional ....... 101 5-6LifeofanI/Opacket ................................. 103 5-7Unprivilegeddomaincallgraph ........................... 104 5-8TCPtransmitandgranttableinvocation ...................... 105 5-9Timerinterruptstoinitiatecontextswitch ..................... 106 5-10Lifeofapacketinprivilegeddomain ........................ 107 5-11ImpactofTLBushandcontextswitch ...................... 109 5-12CorrelationbetweenVMswitchingandTLBmisses ................ 109 5-13TLBMissesafteraVMcontextswitch ....................... 110 5-14TLBmissesafteragrantdestroy .......................... 111 5-15ImpactofVMswitchoncachemisses ........................ 111 5-16L2cacheperformancefortransmitofI/Opackets ................. 112 5-17DataandinstructionTLBperformancefortransmitofI/Opackets ....... 112 5-18L2cacheperformanceforreceiveofI/Opackets .................. 113 5-19DataandinstructionTLBperformanceforreceiveofI/Opackets ........ 114 6-1Thex86pagetableforsmallpages ......................... 119 6-2Interprocessorinterruptmechanisminx86architecture .............. 121 6-3Simulationexperimentalsetup ............................ 124 6-4ImpactoftaggingTLBwithaglobalbit ...................... 125 6-5Pagesharinginmulticoreenvironment ....................... 126 11

PAGE 12

................. 127 12

PAGE 13

13

PAGE 14

14

PAGE 15

15

PAGE 16

1 ].ThisdissertationinvestigatesdataprovisioningandperformancecharacterizationofvirtualI/O,inparticularnetworklesystems,insuchvirtualizedenvironments.Thegoalsofthisdissertationareasfollows:First,deviseandevaluatetechniqueswhichcanseamlesslyprovidedatatoapplicationsinwide-areaenvironmentsspanningmultipledomainsthroughdistributedlesystemprotocolI/Oredirection.Second,becausetheperformanceofthisdataprovisioningsolutionislimitedinherentlybyoverheadsassociatedwithnetworkI/Oinavirtualizedenvironment,thisdissertationevaluatesnetworkI/Ovirtualizationoverheadsinsuchenvironmentswithasimulation-basedmethodologythatenablesquantitativeanalysisoftheimpactofmicro-architecturefeaturesonperformanceofacontemporarysplit-I/Ovirtualmachinehypervisor.Finally,thisworkexploreshardware/softwaresupporttoimprovebottlenecksinI/Operformance.Thedistributedlesystemredirectionapproachfordatamanagementandevaluationcanbenet,inparticularapplicationswherelargemostlyreaddatasetsneedtobeprovisionedinavirtualdatacenter. 2 ].Tofacilitatedatamovementinsuchenvironments,Ihavedevelopedanovelredirect-on-writedistributedlesystem(ROW-FS)whichallowsforapplication-transparentbueringandrequestre-routingofalllesystemmodicationslocallythroughuser-levelproxies.Theseproxiesforwardlesystemaccessesthatmodifyanylesystemobjectstoa\shadow"distributedlesystemserver,creatingon-demandprivatecopiesofsuchobjectsintheshadowleserverwhileroutingaccessestounmodieddatatoa\main"server.Amotivationforthis 16

PAGE 17

3 ].Figure 1-1 showsanexampleofprotocolredirectionofNFSthroughuser-levelproxies.VirtualmachineenvironmentsarecommonlyusedtoimproveCPUutilizationincomputersystems[ 1 ].SuchVM-basedenvironmentsarebeingincreasinglyusedindatacentersforresourceconsolidation,highperformanceandgridcomputingasmeanstofacilitatethedeploymentofuser-customizedexecutionenvironments[ 4 ][ 5 ][ 6 ][ 7 ].Thus,thedeploymentofuser-levelproxiesinVM-basedexecutionenvironmentsisacommonscenariotofacilitatedatamovementandprovisioning[ 4 ]. Figure1-1. ProtocolRedirection:Client/ServerprotocolismodiedtoforwardRPCcallstoashadowdistributedlesystemserver Forexample,considerthescenarioofmultipleserversindatacenterswhichareconsolidatedintoasingleserverthroughdeploymentofvirtualmachines,whichrepresentsacommoncontemporaryusecaseofvirtualization[ 8 ].Figure 1-2 showsapossibledeploymentofROWproxieswhichforwardcallstoNFSandshadowserver.ShadowserverVMandclientVMareconsolidatedintoasinglephysicalmachine.SuchascenarioiscommonwhenclientVMisdisklessordiskspaceonclientVMisaconstraint.WhiledeployingROWproxiesinsuchcasesprovidesamuchneededfunctionality,theoverheadassociatedwithvirtualizednetworkI/Oandisoftenconsideredabottleneck[ 9 ][ 10 ].Specically,theperformanceofROW-FSinsuchenvironments,aswellasthatof 17

PAGE 18

3 ],isdependentupontheperformanceoftheunderlyingvirtualI/Onetworklayer. Figure1-2. ServerConsolidation:Illustrationofpartitioningofphysicalsystemwithsharedhardwareresources.AspeciccaseofROW-FSdeploymentisshown.ShadowserverisencapsulatedintoaseparateVM.RPCcallsarepassedtoRemote/shadowserversthroughaROW-FSproxydeployedinprivilegedVM. 11 ].Also,systemworkloadsarebecomingmorecomplexasapplicationsareoftencompartmentalizedandexecutedwithinvirtualmachines[ 1 ].Thus,thekeyresponsibilitiesofconventionalO/Sessuchasschedulingaredelegatedtoanewlayer-theHypervisororVirtualMachineMonitor[ 12 ].Hypervisorsarewidelyacceptedasanapproachtoaddressunder-utilizationofresourcessuchasI/OandCPUinphysicalmachines.Tocharacterizetheimpact 18

PAGE 19

1. Whyhasasimulation-basedapproachbeenchosen? 2. WhataretheoptionsofnetworkI/Omechanismsinvirtualmachinedesigns? 13 ][ 14 ][ 15 ].Theuseofsimulationinthisdissertationismotivatedbythefactthatcurrentsystemevaluationmethodologiesforvirtualmachinesarebasedonmeasurementsofadeployedvirtualizedenvironmentonaphysicalmachine.Althoughsuchanapproachgivesgoodestimatesofperformanceoverheadsforagivenphysicalmachine,itlacksexibilityindeterminingtheresourcescalingperformance.Inaddition,itisdiculttoreplicateameasurementframeworkondierentsystemarchitectures.Itisimportanttomovetowardsafullsystemsimulationmethodologybecauseitisaexibleapproachinstudyingdierentarchitectures. 16 ].ThevirtualmachineI/OarchitectureswhichhavebeenimplementedindierentHypervisorsare:(1)DirectI/OModel(Xen1.0),wheretheHypervisorisresponsibleforrunningdevicedrivers,and(2)SplitI/OModel(Xen2.0/3.0),wheredevicedriversareexportedtoaprivilegedguestvirtualmachine(3)Pass-throughI/OModel,wheredirectaccesstohardwaredevicesisprovidedfromguestvirtualmachines.Chapter2further 19

PAGE 20

17 ].Microkernelshavenotfoundwideacceptancearguablybecauseofoverheadsassociatedwithinter-processcommunicationbetweendierententities(suchaslesystemanddevicedrivers).AsprocessorsarebecomingfasterandnewCMParchitecturesareprovidingcore-levelparallelism,I/OapproachessuchasooadingselectivevirtualI/Ofunctionalitytoseparatecorearebeingexplored[ 18 ].IchosesplitI/Oasabasisfortheinvestigationinthisdissertationbecauseofthefollowingreasons: 19 ].Withtheadventofchipmultiprocessorarchitectures,themicrokernelapproachisbeingre-visited;theXenVMMversion3.0[ 20 ]hasfeatureswhicharealsofoundinmicrokernels[ 17 ].DedicatedCPUstoimprovesplitI/OperformancecaneasilybeaddressedthroughCMParchitectures. 20

PAGE 21

21

PAGE 22

2 ]havedealtwiththisproblemviaapplicationcheck-pointingandrestart.Alimitationofthisapproachliesinthatitonlysupportsaveryrestrictedsetofapplications-theymustbere-linkedtoCondorlibrariesandcannotusemanysystemcalls(e.g.fork,exec,mmap).TheapproachofredirectingRPCcallsinashadowserver,incontrast,supportsunmodiedapplications.Itusesclient-sidevirtualizationmechanismsthatallowfortransparentbueringofalllesystemmodicationsproducedbydistributedlesystemclientsonlocalstablestorage.Locallybueredlesystemmodicationscanthenbecheckpointedtogetherwiththeapplicationstate.By"localstorage",Imeanthestoragewhichiseitherlocaltotheclientmachineorinclient'slocalareanetwork. 22

PAGE 23

21 ][ 22 ].TheseO/SimageframeworkseitherrequirekernellevelsupportorrelyonaggressivecachingoffullO/Simages. 23

PAGE 24

23 ].PreviousresearchhasshownthatdataaccesspatternsforaDFSareoftendynamicandephemeral[ 23 ].Also,currentsolutionsareeitherbasedoncompleteletransferthatmayincuraccesslatencyoverhead(ascomparedtoblocktransfers)[ 24 ]orarelimitedtosingledomains[ 25 ];notacceptableinvirtualizedenvironmentwhereoftenthereisneedtoaccesslargeO/Simagesoverwidearea.Solution:Ipresentanovelapproachthatenableswide-areaapplicationstoleverageon-demandblock-baseddatatransfersandade-factodistributedlesystem(NFS)toaccessdatastoredremotelyandmodifyitinthelocalarea-Redirect-On-WriteFileSystem(ROW-FS).IshowthattheROW-FSapproachprovidessubstantialimprovementscomparedtotraditionalNFSprotocolforbenchmarkapplicationssuchasLinuxkernelcompilationandvirtualmachineinstantiation.Problem2:Client-Serverconsistencyandreplicationmechanismsforon-demanddatatransferDuringthetimeaROW-FSlesystemsessionismounted,allmodicationsareredirectedtotheshadowserver.Itisimportanttoconsiderconsistencyindistributedlesystemsbecausedatacanbepotentiallysharedbymultipleclients.Forconsistency,twodierentscenariosneedtobeconsidered.First,thereareapplicationsinwhichitisneitherneedednordesirablefordataintheshadowservertobereconciledwiththemainserver;anexampleistheprovisioningofsystemimagesfordisklessclientsorvirtualmachines.Second,forapplicationsinwhichitisdesirabletoreconciledatawiththeserver,theROW-FSproxyholdsstateinitsprimarydatastructuresthatcanbeusedtocommitmodicationsbacktotheserver.Solution:IleverageonAPIsexportedbylookupservices(suchasDistributedHashtable)indistributedframeworks(e.g.IPOP[ 26 ])tokeepclientconsistentwithlatestupdates 24

PAGE 25

27 ].ThearchitecturalreasonsbehindtheI/Operformanceoverheadsinvirtualizedenvironmentsarenotwellunderstood.EarlyresearchincharacterizingthesepenaltieshaveshownthatcachemissesandTLBrelatedoverheadscontributetomostofI/Ovirtualizationcost[ 27 ].Solution:Ihaveappliedexecution-drivenmethodologytostudythenetworkI/OperformanceofXen(asacasestudy)inafullsystemsimulationenvironment,usingdetailedcacheandTLBmodelstoproleandcharacterizesoftwareandhardwarehotspots.Byapplyingsymbolannotationtotheinstructionowreportedbytheexecutiondrivensimulatorwederivefunctionlevelcallowinformation.Thismethodologyprovidesdetailedinformationatthearchitecturallevelandallowsdesignerstoevaluatepotentialhardwareenhancementstoreducevirtualizationoverhead.Problem4:HardwaresupportfornetworkI/Ovirtualization

PAGE 26

4 ],COD[ 28 ]),virtualmachineimageprovisioningandfaulttoleranceindistributedcomputing.Further,Chapter3discussesthedesignandimplementationofROW-FS.IevaluateROW-FSwithmicro-andapplicationbenchmarkstomeasureoverheadassociatedwithindividualRPCcalls.Chapter4describesanovelapproachtoO/SimagemanagementandprovisioningarchitecturethroughdisklesssetupofVMsusingROW-FS.Theprimarygoalisto 26

PAGE 27

27

PAGE 28

16 ],andwide-areaGridandpeer-to-peer(P2P)computingsystems[ 29 { 32 ].Assystemsevolveinthedirectionoflarge-scale,networkedenvironmentsandasapplicationsevolvetodemandprocessingofvastamountsofinformation,theinput/output(I/O)subsystemwhichprovidesacomputerwithaccesstomassstorageandtonetworksbecomesincreasinglyimportant.Fundamentally,computersystemsconsistofthreesubsystems:processors,memoryandI/O[ 1 ].Intheearlydaysofdesktopcomputing,I/Osystemswereprimarilyusedasanextensionofthememorysystem(asahierarchylevelbackingupcacheandRAMmemories)andforpersistentdatastorage.Withthearrivalofnetworksandthewide-areaInternet,theI/Osubsystemcanbeinterpretedasagenerictermreferringtoaccessofdata,overthenetworkorinstorage.InthecontextofnetworkedI/O,aparticularsubsystemwhichhasbeensuccessfullyusedinprovisioningdataovernetworksisa 28

PAGE 29

33 ],AFS[ 24 ],Legion[ 34 ]).Intheprocessorandmemorysubsystems,systemarchitectsincreasinglyapplyideasofvirtualization,withapproachesthatbuildontechniquesforlogicallypartitioningphysicalsystemsdevelopedinmainframes[ 12 ]whicharenowaccessibleforcommoditysystemsbasedonthex86architecture.Whileforseveralyearsthex86architecturedidnotsatisfyconditionsthatmakeaCPUamenabletovirtualization[ 12 ],hardwarevendors(IntelandAMD)haveprovidedhardwaresupporttosimplifythedesignofVirtualMachineMonitors(VMMs)[ 35 ][ 36 ].EventhoughvirtualizationapproachescanbeusedtoaddresstheproblemofCPUunder-utilizationthroughworkloadconsolidation,thegapbetweenI/OmechanismsandCPUeciencyhaswiden.Figure 2-1 givesanoverviewofsomeofthedierenttechnologiesbeingharnessedbystate-of-the-artcomputingsystems,andprovidesalandscapeinwhichtheapproachesdescribedinthisdissertationareintendedtobeapplied:wide-areasystems(possiblyorganizedinapeer-to-peerfashion),wherenodesandnetworksarevirtualizedandwherecommodityCPUscontainmultiplecores.Thefollowingsubsectionsaddressthesetechnologiesinmoredetail. 37 ].Theprocessofprovidingtransparentaccesstoheterogeneousresourcesthroughalayerof 29

PAGE 30

Landscapeofvirtualizedcomputersystems:resourcesandplatformsareheterogeneouswithrespecttohardwareandsystemsoftwareenvironments;computingnodeshavemultipleindependentprocessingunits(cores)andarevirtualizable;Gridcomputingmiddlewareisusedtoharnessesthecomputepowerofheterogeneousresources;apeer-to-peerorganizationofresourcesenablesself-organizingensemblesofCPUstosupporthigh-throughputcomputingapplicationsandscienticexperiments. indirectioniscentraltovirtualization.Forexample,inLinux,thevirtuallesystemprovidestransparentaccesstodataacrossdierentlesystems.Indirectioncanbeachievedthroughinterpositionbyanagentorproxybetweentwoormorecommunicatingentities.Proxieshavebeenextensivelyusedacrossforuserauthentication,callforwardingandforsecuregateways[ 38 ].Forexample,virtualmemoryisawidelyusedmechanismformultiplexingphysicalRAMintraditionaloperatingsystems. 30

PAGE 31

2-2 providesabroaderviewofpartitioningofasystemwhichalsoincludesverticalpartitioning.Horizontalpartitioningessentiallyaddsanextralayerinthesystemstackwhichprovidesabstractiontoapplicationlayertoaccessunderlyingresources,whileverticalpartitioningmechanismscanbeusedtodividetheunderlyingresources,forinstancetoisolatesub-systemssuchthatinterferenceduetofaultsandperformancecross-talkisminimized. Figure2-2. Systemspartitioningcharacteristics:AccesstoCPUinmulti-coresystemscanbenaturallypartitionedacrosscores,howevertherearehardwareresourceswhichareshared(e.g.L2cache,memory,harddiskandNICs).Qualityofservice(QoS)provisioningcanbeusedtopartitionsharedresourcesacrossvirtualcontainers. Ingeneral,virtualmachinearchitecturesexhibitthreecommoncharacteristics:multiplexing,polymorphismandmanifolding[ 4 ].Thevirtualizationapproachofmultiplexingphysicalresourcesnotonlydecouplesthecomputeresourcesfromhardwarebutprovidestheexibilityofallowingcomputeresourcestomigrateseamlessly.Today,manyvirtualmachinemonitorsareavailableforresearchanddevelopment(e.g.VMware[ 37 ],Parallels[ 39 ],VirtualBox[ 40 ],KVMcitevm:kvm,lguest[ 41 ],Xen[ 20 ],UML[ 42 ],andQemu[ 43 ]). 31

PAGE 32

1 ].Inordertobeeectivelyvirtualizable,itisdesirablethattheunderlyingprocessorinstructionsetarchitecture(ISA)followsconditionssetforthin[ 12 ].Forseveralyears,theinstructionsetofwhatiscurrentlythemostpopularmicroarchitecure[ 44 ]wasnotdirectlyvirtualizablebecausethereareasetofsensitiveinstructionswhichdonotcausetheprocessortotrapwhenrunninginunprivilegedmode[ 1 ].Systemvirtualmachineshavebeendesignedtoovercomethislimitationinpreviousx86generationswithtwomajorapproachesresultinginsuccessfulimplementations.Intheclassicalapproach,avirtualmachinesuchasVMwarereliesonecientbinarytranslationmechanismstoemulatenon-virtualizableinstructionswithoutrequiringmodicationstotheguestO/S.Intheparavirtualizedapproach(e.gXen),modicationstothearchitecture-dependentcodeoftheguestO/Sarerequired,bothtoavoidtheoccurrenceofnon-virtualizableinstructionsandtoenableimprovedsystemperformance.IntelandAMDhaveprovidedhardwaresupporttoextendthex86architectureforvirtualizedenvironment[ 1 ],makingtheimplementationofclassicVMMsamucheasiertaskbecausebinarytranslationisnotrequired;theKVMvirtualmachineisanexampleofarecentVMMwhichbuildsuponsuchhardwareextensions. 2-3 (a).Incontrast,aconventionalI/OvirtualizationpathtraversesthroughaguestVMdriver,virtualI/Odevice,physicaldevicedriverandphysicaldevice(e.g.anetworkinterfacecard,NIC).TheguestVMdrivermerelyprovidesamechanismtoshareaccesstoavirtualI/Odevice(emulateddevicedriver)throughasharedmemorymechanism.Theemulateddeviceiseither 32

PAGE 33

2-3 (b)). Figure2-3. I/Ovirtualizationpath:(a)InsingleO/S,applicationinvokesphysicaldevicedriverbymeansofsystemcalls(b)InaVMenvironment,eachVM'sguestdrivercommunicateswithvirtualI/Odevicethroughasharedmemorymechanism.ThevirtualI/OdevicecaneitherresideinthehypervisororintheseparateVMasshownbydottedlines. ThefollowingaretypicalalternativeswhichhavebeenconsideredforI/Ohandlinginvirtualmachineenvironments: 37 ]. 45 ]. 33

PAGE 34

46 ][ 47 ].IOMMUallowsdirectmappingofhardwaredeviceaddressesintoguestphysicalmemoryandalsoprotectsVMsfromspuriousmemoryaccesses. I/OPartitioning:Resourcemappinginmulti-coresystems.VM1isallocatedtwoCPUs.VM3ispinnedtoI/OdevicesDiskD1,andNICN1,N2. CMParchitecturesallowconcurrentexecutionofsoftwarethreadsandmodules.Thus,toleverageonCMParchitectures,itisimportanttopartitionthesystemsoftwareresources,operatingsystemsandhardwareresourcessoastodeterministicallyallocateCPUresourcestotheapplications[ 48 ].TheperformancegainthroughsuchapartitioningofresourcescouldbeabbreviatedifbottleneckishardwaredevicessuchasNICs.Toaddressthis,trendsareeithertowardsharnessingmultipledevicessuchasnetwork 34

PAGE 35

10 ]ordevelopingnewmodelsofcommunicationwithhardwaredevices[ 48 ][ 49 ].ItisconceivablethatvirtualizedCMParchitecturesofthefuturewillrunmultipleVMs,withamixofresourcestime-sharedand/ordedicatedtoVMguests.Forexample,asshowninFigure 2-4 ,virtualmachinesVM1,VM2andVM3arehostedonamulti-coresystem.AsshowninFigure 2-4 ,VM3ispinnedtoDiskD1,NICsN1andN2andhasbeenallocatedasingleCPU.VM1hasbeenallocatedwith2CPUsandhaveprivilegedstatustoaccesshardwareresources.ThegurealsoillustratesvirtualizationpathforsplitI/O(dottedpath)anddirectI/Oapproach.InsplitI/O,sincevirtualizationpathisdividedintoseparateVMcontainers,allocatingdedicatedresources(e.g.CPUsandNICs)toguestandprivilegedVMscanpotentiallyimproveI/Operformance. 29 ],Gridcomputingreferstothe\coordinatedresourcesharingandproblemsolvingindynamic,multi-institutionalvirtualorganization".Gridcomputingtypicallytakesplaceonheterogeneousresourcesdistributedacrosswide-areanetworks,relyingonsupportfrommiddlewaretoprovideservicessuchasauthentication,scheduling,anddatatransfers.VirtualizationinthecontextofGridcomputingismotivatedbytheabilitytoabstracttheheterogeneityofresourcesandprovideaconsistentenvironmentfortheexecutionofworkloads.TheIn-VIGO[ 4 ]middlewareisarepresentativeexampleofasystemwhichextensivelyemploysvirtualizationinthisscenario.SimilarapproachhasbeentakenbyseveralprojectsinGridorUtilitycomputingsuchasCOD[ 28 ]andVirtualWorkspaces[ 5 ].Akeychallengearisinginwide-area,GridcomputinginfrastructuresisthatofmanagementandaccesstodatanotonlyI/Olocaltoanode,butalsohowtoprovidedatatoapplications,seamlessly,inenvironmentsspanningmultipledomains. 50 ].Ingridenvironments,datamovementand 35

PAGE 36

3 ][ 24 ][ 33 ]toenhancementsofwidely-usedlocal-areaprotocols(NFSv2/v3)and,tooverlayadditionalfunctionalitiesormodiedconsistencymodelsoverwideareanetworks[ 51 ][ 52 ][ 53 ][ 54 ].However,thewide-spreaddeploymentofnewprotocolsishinderedbythefactthatoperatingsystemsdesignershavemostlyfocusedonlocal-areadistributedlesystemswhichcovertypicalusagescenarios.Asanexample,open-sourceandproprietaryimplementationsoftheLAN-orientedversionsoftheNFSprotocol(v2/v3)havebeendeployed(andhardenedovertime)inthemajorityofUNIXavorsandWindows,whileopen-sourceimplementationsofthewide-areaprotocol(v4)underdevelopmentsincethelate1990shavenotyetbeenwidelydeployed.ThefollowingsectionswillexplaintheNFSprotocolarchitectureandarelatedapproachthatconsistofvirtualizationlayerbuiltonexistingNetworkFileSystem(NFS)components-Gridvirtuallesystem(GVFS). 1. Machineandlesystemindependence 2. Transparentaccesstoremoteles 3. Simplecrashrecoverymechanism 4. Lowperformanceoverheadinlocalareanetworks 36

PAGE 37

3 ]andwaslaterextendedtoaddressshortcomingssuchassmalllesizes,largenumberofGetattrcallsandperformanceoverhead,leadingtoNFSversion3(NFSv3).Forexample,inNFSv2,theinvocationofalookupcallisalwaysfollowedbytheinvocationofagetattrcall.InNFSv3,thelookupcallisoptimizedtoreturnattributesinasingleRPCoperation.Similarly,NFSv3introducesnewprocedurecallstoprovisionbueringofwritesinclientandlatercommittingittotheserver.NeitherNFSv2norv3scalewellincross-domainwide-areaenvironments[ 3 ].Inaddition,NFSv3doesnotprovideguaranteedconsistencybetweentheclients.NFSv4improvestheconsistencymechanismattheexpenseofamorecomplexserverdesign,whichisnolongerstateless[ 3 ].NFSv4implementsanopen-closeconsistencymechanism.NFSv4clientscancachethedataafterleisopenedforaccess.Ifcacheddataismodied,NFSclientsneedtocommitthedatabackduringlecloseoperation.NFSclientscanalsore-validatethecacheddatathroughletimestampforfutureaccessofthedata.Inaddition,NFSv4serversupportsdelegationandcallbackmechanismstoprovidewritepermissionstotheclient.Thus,aNFSv4clientcanallowotherclientstoaccessthedatafromthedelegatedle.NFSsupportshierarchicalorganizationoflesanddirectories.EachdirectoryorleinNFSserverisuniquelyaddressedbyapersistentlehandle[ 3 ].Alehandleisareferencetoaleordirectorythatisindependentofthelename.Forexample,NFSv2haspersistentlehandlesofsizeof32-bytes.AcomprehensivesurveyofNFSlehandlestructurecanbefoundin[ 55 ]. 37

PAGE 38

3 ].Thisapproachofstatelessserversimpliesfailurerecovery.Forexample,afailurerequiringaserverrestartcanbedealtwithbysimplyrequiringclientstore-establishtheconnection.Inaddition,theNFSprotocolsupportsidempotentoperations.Intheeventofaservercrash,theclientonlyneedstowaitfortheservertobootandre-sendtherequest.NetworkFileSystemprimarilyconsistsoftwoprotocols.First,themountdprotocolisusedtoinitiallyaccessthelehandleoftherootofanexporteddirectory.Second,thenfsdprotocolisusedtoinvokeRPCprocedurecallstoperformleoperationsonremoteserver.ANFSclientinvokesthemountprotocolthroughmountutility.Themountprotocolisathreestepprocess.First,theclientcontactsthemountdservertoobtaintheinitiallehandleforanexportedlesystem.Inthesecondstep,themountprotocolaccesstheattributesofthedirectorymountpointrequestedbytheclient.Finally,theNFSclientobtainstheattributesoftheexportedlesystem.NFSprovidesthecapabilitytoenabledierentauthenticationmechanismsuchasUnixsystemauthentication(UIDorGID).NFSsupportsauthorizationbasedonaccesscontrollistsmaintainedbytheserver.ThisaccesscontrollistprovidesamappingofuserandgroupIDbetweenclientandserver.WheneveraRPCcallisreceivedbytheserver,theservervalidatestheclientcredentialsthroughaccesscontrollist. 56 ].GVFSformsthebasicframeworkforthetransferofdatanecessaryforproblemsolvingenvironmentssuchasIn-VIGO.ItreliesonavirtualizationlayerbuiltonexistingNetworkFileSystem(NFS)components,andisimplementedatthelevelofRemoteProcedureCalls(RPC)bymeansofmiddleware-controlledlesystemproxies.AvirtuallesystemproxyinterceptsRPCcallsfromanNFSclientandforwardsthemtoanNFSserver,possiblymodifying 38

PAGE 39

4 ]. Figure2-5. GridVirtualFileSystem:NFSproceduralcallsareinterceptedthroughuser-levelproxies.GVFSproxiesaredeployedinclientandservermachines.UsersareauthenticatedthroughaaccesscontrollistexportedbyGVFSproxyandNFSserver. Figure 2-5 providesanoverviewofgridvirtuallesystem.Asshowninthisgure,middleware-controlledlesystemproxiesareusedtostartagridsessionfortheclient.GridvirtualFilesystemsupportsperformanceenhancingmechanismssuchasadisk-cache[ 56 ].GVFSproxiesarefurtherextendedwithwritebacksupporttoprovideon-demandvirtualenvironmentsforgridcomputing[ 57 ].ThisapproachreliesonbueringRPCrequestsandresultsinadisk-cache,andcommittingchangesbacktoserverattheendofusersession.Arelatedworkusesaservice-orientedapproachtoharnesstheGVFSproxiesforoptimizationssuchascachingorcopy-on-writesupport.Thisapproachisbased-onWebservicesresourceframework(WSRF)thatenablestheprovisioningofdata 39

PAGE 40

58 ]. 40

PAGE 41

3.1.1FileSystemAbstractionAlesystemisanabstractioncommonlyusedtoaccessdatafromthememory/storagesystems(e.g.disk).Thisabstractionisoftenimplementedasalayerofindirection.Indirectionmechanismsarecommonlyusedtoaddresscomputerscienceproblems.Forexample,intheLinuxO/S,toprovidetransparentaccesstodierentlesystems,indirectionmechanismsaretypicallyusedtosteerthelesystemoperationsthroughacommonlesystemframeworkcalledvirtuallesystem(VFS).Figure 3-1 showsindirectionmechanismsacrossthreelevels:transparentaccessofalesystemthroughtheVFSframework,logicalaccesstoadiskvolume,andindirectaccesstoaleblockthroughi-node.Thisdissertationappliesindirectionmechanismsthroughauser-levelproxysoastoprovidetransparentaccessofdatafromoneormoreservers.Theredirect-on-writelesystemenableswide-areaapplicationstoleverageon-demandblock-baseddatatransfersandade-factodistributedlesystem(NFS)toaccessdatastoredremotelyandmodifyitinthelocalarea. Figure3-1. IndirectionmechanismintheLinuxvirtuallesystem(vfs).Theaccesstovariouslesystemsthroughtheindirectionmechanism.Alesystem'sblockscouldbelogicallypresentinmultipleharddisks.Further,accesstotheleblocksisperformedthroughdirectorindirectaccessofblocksthroughinodes. 41

PAGE 42

42

PAGE 43

Figure3-2. Middlewaredatamanagement:GridUsersG1,G2andG3accessesledisk.imgfromserverandcustomizeforpersonalusethroughROWproxy.G1modiessecondblockBtoB',G2modiesblockCtoC'andG3extendsthelewithadditionalblockD(a)Modicationsarestoredlocallyateachshadowserver(b)virtualizedview ROW-FScomplementscapabilitiesprovidedby"classic"virtualmachine(VMs[ 12 ][ 1 ])tosupportexible,fault-tolerantexecutionenvironmentsindistributedcomputingsystems.Namely,ROW-FSenablesmounteddistributedlesystemdatatobeperiodicallycheck-pointedalongwithaVM'sstateduringtheexecutionofalong-runningapplication.ROW-FSalsoenablesthecreationofnon-persistentexecutionenvironmentsfornon-virtualizedmachines.Forinstance,itallowsmultipleclientstoaccessinread/writemodeanNFSlesystemcontaininganO/Sdistributionexportedinread-onlymodebyasingleserver.Localmodicationsarekeptinper-client"shadow"lesystemsthatarecreatedand 43

PAGE 44

3-2 illustratesanexampleofsharedVMimagebetweenGridusersG1,G2andG3.O/Simagemodicationsarelocallybueredwhereastheserverhostsread-onlyO/Simages. 4 ],COD[ 28 ]andVirtualWorkspaces[ 5 ]).Forexample,datamanagementinIn-VIGOisprovidedbyavirtualizationlayerknownasGridVirtualFileSystem.Theresultinggridvirtuallesystemallowsdynamiccreationanddestructionoflesystemsessionsonaper-userorper-applicationbasis.Suchsessionsallowon-demanddatatransfers,andpresenttousersandapplicationstheAPIofawidelyuseddistributednetworklesystemacrossnodesofacomputationalgrid.ROW-FScanexportsimilarAPIstoenduserandnetwork-intensiveapplicationstotransparentlybuerwritesinalocalserver. 44

PAGE 45

59 ][ 60 ].Intheseenvironments,oftenitisthecasethatbaseoperatingsystemlayerissharedbetweendierentclients(read-only).AnymodicationstobaseOSbyclients(e.g.akernelpatch)canbefeasiblebydeployingROW-FStoaccessread-onlyimages. 45

PAGE 46

3-3 .Inthegure,theclientvirtualmachine"C"crashesattimetf.IntraditionalNFS(Figure 3-3 ,top),jobexecutionhastorestartagain,becausetheserverstate'S'maynolongerbeconsistentwiththeclientstateatthetimeofthelastcheckpoint.Intheredirect-on-writesetup(Figure 3-3 ,bottom),jobexecutioncancorrectlyrestartatthelastcheckpointtc. Figure3-3. Check-pointingaVMcontainerrunninganapplicationwithNFS-mountedlesystems.IntraditionalNFS(top),onceaclientrollsbacktocheckpointedstate,itmaybeinconsistentwithrespecttothe(non-checkpointed)serverstate.InROW-FS(bottom),statemodicationsarebueredattheclientsideandarecheckpointedalongwiththeVM AnimportantclassofGridapplicationsconsistsoflong-runningsimulations,whereexecutiontimesintheorderofdaysarenotuncommon,andmid-sessionfaultsarehighlyundesirable.SystemssuchasCondor[ 2 ]havedealtwiththisproblemviaapplicationcheck-pointingandrestart.Alimitationofthisapproachliesinthatitonlysupportsarestrictedsetofapplications-theymustbere-linkedtospeciclibrariesandcannotusemanysystemcalls(e.g.fork,exec,mmap).ROW-FS,incontrast,supportsunmodied 46

PAGE 47

2 ][ 61 ]middlewareisbeingextendedwiththeso-calledVMuniversetosupportcheckpointandrestoreofentireVMsratherthanindividualprocesses;ROW-FSsessionscanbeconceivablycontrolledbythismiddlewaretobuerlesystemmodicationsuntilaVMsessioncompletes. 3-4 .Itconsistsofuser-levelDFSextensionsthatsupportselectiveredirectionofdistributedlesystem(DFS)callstotwoservers:themainserverandashadowserver.ThearchitectureisnovelinthemanneritoverlaystheROWcapabilitiesuponunmodiedclientsandservers,withoutrequiringchangestotheunderlyingprotocol.TheapproachreliesontheopaquenatureofNFSlehandlestoallowforvirtualhandles[ 3 ]thatarealwaysreturnedtotheclient,butmaptophysicallehandlesatthemainandROWservers.Alehandlehashtablestoressuchmappings,aswellasinformationaboutclientmodicationsmadetoeachlehandle.Fileswhosecontentsaremodiedbytheclienthave"shadow"lescreatedbytheshadowserverinasparsele,andblock-basedmodicationsareinsertedin-placeintheshadowle.Apresencebitmapmarkswhichblockshavebeenmodied,atthegranularityofNFSblocks(typicallyofsize8-32KB).Figure 3-5 showspossibledeploymentsofproxiesenabledwithuser-leveldiskcachingandROWcapabilities.Forexample,acacheproxyconguredtocacheread-onlydatamayprecedetheROWproxy,thuseectivelyformingaread/writecachehierarchy.Suchcache-before-redirect(Figure 3-5 (a))proxysetupallowsdiskcachingofbothread-only 47

PAGE 48

ROW-FSArchitecture-TheRedirect-on-writelesystemisimplementedbymeansofauser-levelproxywhichvirtualizesNFSbyselectivelysteeringcallstoeitheramainserverorshadowserver.MFH:MainFileHandle,SFH:ShadowFileHandle,F:Flags,HP:Hashtableprocessor,BITMAP:bitmapprocessor. contentsofthemainserveraswellasofclientmodications.Write-intensiveapplicationscanbesupportedwithbetterperformanceusingaredirect-before-cache(Figure 3-5 (b))proxysetup.Furthermore,redirectionmechanismsbasedontheROWproxycanbeconguredwithbothshadowandmainserversbeingremote(Figure 3-5 (c)).Suchsetupcould,forexample,beusedtosupportaROW-mountedO/Simageforadisklessworkstation. 48

PAGE 49

Proxydeploymentoptions(a)Cache-before-redirect(CBR),(b)Redirect-before-cache(RBC)(c)Non-localshadowserver. isasupersetofthelesystemobjectsinmainserver.Themain-indexed(MI)tableisneededtomaintainstateinformationaboutlesinthemainserver.Figure 3-6 showsthestructureofthehashtableandaginformation.Thereaddirag(RD)isusedtoindicatetheoccurrenceoftheinvocationofanNFSreaddirprocedurecallforadirectoryinthemainserver.Generationcount(GC)isanumberinsertedintohashtupleforeachlesystemobjecttocreateauniquediskbasedbitmap.Remove(RM)andRename(RN)agsareusedtoindicatedeletion/renameofale. 49

PAGE 50

Hashtableandagdescriptions:SFH:ShadowFileHandle,MFH:MainFileHandle,RD:ReaddirFlag,RE:ReadFlag,GC:GenerationCount,RM:RemoveFlag,RN:RenameFlag,L1:InitialMainLink,L2:NewShadowLink,L3:CurrentMainLink,RL:Remove/RenameFileList theROWlesystem.Tokeeptrackofcurrentlocationofupdatedblocks,eachleisrepresentedbyatwo-levelhierarchicaldatastructureindisk.Therstlevelindicatesthenameofthelewhichcontainsinformationabouttheblock.Thesecondlevelindicatesthelocationofapresencebitwithinthebitmaple. Figure3-7. RemoteprocedurecallprocessinginROW-FS.TheprocedurecallisrstforwardedtotheshadowserverandlatertothemainNFSserver.SS:ShadowServer,MS:MainServer,SI:ShadowIndexed,MI:MainIndexed. 50

PAGE 51

3-8 illustratesasnapshotviewoflesystemsessionthroughRedirect-on-writeproxy.ROW-proxywhichisusedtointerceptthemountprotocolisabstractedinthegure.NFSclientsmountsaread-onlydirectory(/usr/lib)fromtheserverVM.ThemountedlesystemdirectoryistransparentlyreplicatedintheclientVMtobuerlocalmodications.Filesreplicatedatshadowserveraredummyleswhichrepresentsasparseversionofread-onlyleintheServerVM.Onlyleblockswrittenduringthelesystemsessionsarereplicatedintheshadowserver.Anhashtableentryisupdatedtoprovidestatusofles.Figure 3-8 illustrateshexdumpof32bytesNFSv2hashtable.Generationcountisusedtoprovideauniquebitmapdirectory.Thegenerationcountalongwithhashedvalueofshadowlehandleisusedtocreateabitmapdirectoryperle.AsshownintheFigure 3-8 ,libXlehandleishashedto"777"whichisfurtherconcatenatedwithgenerationcount"234"toproduceauniquebitmapdirectory.RDagismarked"0"asthisisforale.REagismarked"1"toindicatethatbitmapneedstobeaccessedforapossible"libX"leblockintheshadowserver.For"libX"le,thereisnomain-indexedhashtableentryasthereisnostatusinformationtokeepforread-onlyleintheServerVM.Allthenewlywrittenblocksarepresentin"0"leofbitmapdirectory. 3-7 describesthevarioushashtableentriesstoredintheproxywhicharereferencedthroughoutthissection.Table 3-1 brieydescribesNFSv2RPCcallsandpointstotherelevantsectionsforcallmodications.AdetaileddescriptionofallNFSprotocolcallsdescribedbelowcanbefoundin[ 3 ]. 51

PAGE 52

AsnapshotviewoflesystemsessionthroughRedirect-on-Writeproxy.Thehashtableandbitmapstatusisshownforthele\libX"whichistransparentlyreplicatedinashadowserver.Threeblocksof\libX"areshowntoberecentlyaccessedandwrittenintheshadowserver. theserver.Inthesecondstep,themountutilityinvokestheNFSgetattrproceduretogetattributesofdirectory.Finally,themountutilitygetstheattributesoflesystem.Tomaintainmounttransparency,ROW-FSalsohasaproxyforthemountprotocol.Themountprocedureismodiedtoobtaininitialmountlehandleofshadowserver.Specically,themountproxyforwardsamountcalltobothshadowandmainserver.Whenthemountutilityisissuedbyaclient,theshadowserveriscontactedrsttosavethelehandleofthedirectorytobemounted.ThislehandleislaterusedbyNFSprocedurecallstodirectRPCcallstotheshadowserver.TheinitialmappingoflehandlesofamounteddirectoryisinsertedintheSIhashtableduringinvocationofthegetattrprocedure.Figure 3-9 (top,left)depictshandlingoftheMountprocedure. 52

PAGE 53

SummaryoftheNFSv2protocolremoteprocedurecalls.EachrowsummarizesthebehavioroftheRPCcallandpointstothesectionwithinthischapterwherethemechanismtovirtualizeeachcallinROW-FSisdescribed. NFScall Behavior(ModicationSection) Null Testingcall(Nomodication) Getattr RetrievestheattributesfromNFSserver(Section 3.4.3 ) Setattr Settheattributesofaleordirectory(Section 3.4.3 ) Lookup Returnlehandleforalenameordirectory(Section 3.4.2 ) Readlink Readsymboliclink(Section 3.4.8 ) Read Readablockofale(Section 3.4.4 ) Write Writetoablockofale(Section 3.4.5 ) Create Createanewle(Section 3.4.10 ) Remove Removeale(Section 3.4.7 ) Rename Renameale(Section 3.4.7 ) Link Createahardlinkofale(Section 3.4.8 ) Symlink Createasymboliclinkofale(Section 3.4.9 ) Mkdir Createanewdirectory(Section 3.4.10 ) Rmdir Removeanexistingdirectory(Section 3.4.7 ) Readdir Listcontentofanexistingdirectory(Section 3.4.6 ) Statfs Checkstatusoflesystem(Section 3.4.11 ) exist"error.Otherwise,theproxyissuesanNFScreatecallforadummy 53

PAGE 54

1. Anewlemayhavebeencreatedattheshadowserver.Inthiscase,allblocksarepresentinshadowserverandallreadcallsaredirectedtoit. 54

PAGE 55

Ifthelesystemobjectisnotnewlycreatedatshadowserver,leblocksmayresideineithershadowormainserver.Inthatcase,theproxyusesthebitmappresencedatastructureandcalculateslocationofcurrentandvalidleblocktodeterminewhetherthereadrequestshouldbesatisedbythemainorbytheshadowserver. 3. Anoptimization(REag)isdoneforthecasewhenlehandlemappingispresentinMIhashtable;eventhoughabitmaphasnotbeencreated(i.e.noblocksofthelehavebeenwritteninto),thecallisforwardedtothemainserver.Thisoptimizationpreventsexpenseofcheckingbitmapdatastructurefromdisk. 55

PAGE 56

Sequenceofredirect-on-writelesystemcalls asimplelistdirectoryutility(i.e."ls-l")mayinvokemultiplereaddirprocedurecallsbecauseoflargenumberofleobjectspresentinthedirectory.Toprovidesynchronizationbetweenmultiplereaddircalls,informationisneededtokeeptrackofthepositionwherelastreaddircallreturnedthelesystemobject.IntraditionalNFS,thisisaccomplishedbythemeansofacookierandomlygeneratedonaper-leobjectbasis.InthecontextofROW-FS,therearetwopossiblescenariosforreaddirprocedure:rstcallorsubsequentcall.Forsakeofclarity,Ireferthemainserverreaddirasm-readdirandtheshadowserverforwardinginvocationass-readdir.Tovirtualizethersts-readdirprocedure,it 56

PAGE 57

1. ROWproxyinterceptss-readdirrequestfromclientandchecksstatusofparentlehandleintheSIhashtable.Ifitistherstcall,initializeatemporarybuertostoretemporarycookieformultiplem-readdircalls. 2. CheckstatusofRDagofparentdirectorywhichindicatesifreaddirhasbeenpreviouslycalled.IfRDagissetthenforwardcalltoshadowserver.IfRDagisnotset,thefollowingaretheoptionsofrelativestructureofthedirectoriesintheshadowandmainserver: 3. Checkforlesystemtypeofthereturnedlesystemobject.Ifitisasymboliclink,thereadlinkprocedureisinvokedtogetthedatafrommainserverandasymlinkcallisissuedtoshadowservertoreplicatetheobject. 57

PAGE 58

Iflesystemobjectisoflinktype(asprovidedbythenlinkattributeofthefattrdatastructure),thentheLINKprocedureiscalledintheshadowserver.Linkprocedureistheonlyprocedurewhichcanincrementnlinkattributeoflesystemobject.UpdateallregeneratedlesystemobjectsinSI/MIhashtables. exist"error. 58

PAGE 59

59

PAGE 60

60

PAGE 61

62 ].TheNISTnetemulatorisdeployedasavirtualrouterinaVMwareVMwith256MBmemoryandrunningLinuxRedhat7.3.Redirectionisperformedtoashadowserverrunninginavirtualmachineintheclient'slocaldomain. 3-2 showthatROW-FSperformanceissuperiortoNFSv3inaWANscenario,whilecomparableinaLAN.IntheWANexperiment,recursivestatshowsnearlyvetimesimprovementover 61

PAGE 62

3-2 .Clearly,performanceforWANforROW-FSduringthesecondruniscomparablewithLANperformanceandmuchimprovedoverNFSv3.Thisisbecauseonceadirectoryisreplicatedattheshadowserver,subsequentcallsaredirectedtotheshadowserverbymeansofreaddirstatusag.TheinitialreaddiroverheadforROW-FS(especiallyinLANsetup)isduetothefactthatdummyleobjectsarebeingcreatedintheshadowserverduringtheexecution.Remove:Tomeasurethelatencyofremoveoperations,Ideletedalargenumberofles(greaterthan15000andtotaldatasize190MB).IobservedthatinROW-FS,sinceonlytheremovestateisbeingmaintainedratherthancompleteremovalofle,performanceisnearly80%betterthanthatofconventionalNFSv3.Ittakesnearly37minutesinROW-FSincomparisonto63minutesinNFS3todelete190MBofdataoverawideareanetwork.Notethateachexperimentisperformedwithcoldcaches,setupbyre-mountinglesystemsineverynewsession.Ifthelesystemisalreadyreplicatedinshadowserver,ittakes18minutes(WAN)todeletethecompletehierarchy. Table3-2. LANandWANexperimentsforlookup,readdirandrecursivestatmicro-benchmarks.ForROW-FS,eachbenchmarkisrunfortwoiterations:First,warmsupshadowserver.Second,toaccessmodicationslocally.NFSv3isexecutedonceasperformanceforsecondrunissimilartorstrun.InbothROW-FSandNFSv3,NFScachingisdisabled. Micro-benchmark LAN(seconds) WAN(seconds) ROW-FS NFSv3 ROW-FS NFSv3 1strun 2ndrun 1strun 2ndrun Lookup 0.018 0.008 0.011 0.089 0.018 0.108 Readdir 67 17 41 1127 17 1170 RecursiveStat 425 404 367 1434 367 1965 Remove 160 NA 230 2250 NA 3785 62

PAGE 63

3-3 summarizestheperformanceofAndrewbenchmarkandFigure 3-10 providesstatisticsfornumberofRPCcalls.TheimportantconclusiontakenfromthedatainFigure 3-10 isthatROW-FS,whileincreasingthetotalnumberofRPCcallsprocessedduringtheapplicationexecution,itreducesthenumberofRPCcallsthatcrossdomainstolessthanhalf.Notethattheincreaseinnumberofgetattrcallsisduetoinvocationofgetattrproceduretovirtualizereadcallstomainserver.Readcallsarevirtualizedwithshadowattributes(thecasewhenblocksarereadfromMainserver)becausetheclientisunawareoftheshadowserver;lesystemattributeslikelesystemstatisticsandleinodenumberhavetobeconsistentbetweenreadandpost-readgetattrcall.Nonetheless,sinceallgetattrcallsgotothelocal-areashadowserver,theoverheadofextragetattrcallsissmallcomparedtogetattrcallsoverWAN. Table3-3. AndrewBenchmarkandAM-Utilsexecutiontimesinlocal-andwide-areanetworks. Benchmark ROW-FS(sec) NFSv3(sec) Andrew(LAN) 13 10 Andrew(WAN) 78 308 AMUtils(LAN) 833 703 AMUtils(WAN) 986 2744 63

PAGE 64

63 ]isalsousedasanadditionalbenchmarktoevaluateperformance.Theautomounterbuildconsistsofcongurationteststodeterminerequiredfeaturesforbuild,thusgeneratinglargenumberoflookups,readandwritecalls.Thesecondstepinvolvescompilingofam-utilssoftwarepackage.Table 3-3 providesexperimentalresultsforLANandWAN.TheresultingaveragepingtimefortheNIST-emulatedWANis48.9msintheROW-FSexperimentand29.1msintheNFSv3experiment.Wide-areaperformanceofROW-FSforthisbenchmarkisagainbetterthanNFSv3,evenunderlargeraveragepinglatencies. Figure3-10. NumberofRPCcallsreceivedbyNFSserverinnon-virtualizedenvironment,andbyROW-FSshadowandmainserversduringAndrewbenchmarkexecution 64

PAGE 65

LinuxkernelcompilationexecutiontimesonaLANandWAN. Setup FS Oldcongtime(s) Deptime(s) BzImagetime(s) NFSv3 49 120 710 LAN ROW-FS 55 315 652 NFSv3 472 2648 4200 WAN ROW-FS 77 1590 780 "oldcong",make"dep"andmake"bzImage".Table4showsperformancereadingsforbothLANandWANenvironments.TheperformanceofLinuxkernelcompilationforROW-FSiscomparablewithNFSv3inLANenvironmentandshowssubstantialimprovementintheperformanceovertheemulatedWAN.Notethat,forWAN,kernelcompilationperformanceisnearlyvetimesbetterwiththeROWproxyincomparisonwithNFSv3.TheresultsshowninTable 3-4 donotaccountfortheoverheadinsynchronizingthemainserver.Nonetheless,asshowninFigure 3-10 ,amajorityofRPCcallsdonotrequireserverupdates(read,lookup,getattr);furthermore,manyRPCcalls(write,create,mkdir,rename)arealsoaggregateinstatistics-oftenthesamedataiswrittenagain,andmanytemporarylesaredeletedandneednotbecommitted.Faulttolerance:Finally,Itestedthecheck-pointingandrecoveryofacomputationalchemistryscienticapplication(Gaussian[ 64 ]).AVMwarevirtualmachinerunningGaussianischeckpointed(alongwithROW-FSstateintheVM'smemoryanddisk).Itisthenresumed,runsforaperiodoftime,andafaultisinjected.SomeGaussianexperimentstakemorethanonehourtonishandgeneratealargeamountoftemporarydata(hundredsofMBytes).WithROW-FS,Iobservethattheapplicationsuccessfullyresumesfromapreviouscheckpoint.WithNFSv3,inconsistenciesbetweentheclientcheckpointandtheserverstatecausedtheapplicationtocrash,preventingitssuccessfulcompletion. 65

PAGE 66

3-5 summarizestheperformanceofdisklessboottimeswithdierentproxycachecongurations.Theresultsshowwhatprecachingofattributesbeforeredirectionandpostredirectiondatacachingdeliverthebestperformance,reducingwide-areaboottimewith"warm"cachesbyover300%. Table3-5. WideareaexperimentalresultsfordisklessLinuxboot/secondbootfor(1)ROWproxyonly(2)ROWproxy+datacache(3)attribute+ROW+datacache WAN Boot(sec) 2ndBoot(sec) Client->ROW->Server 435 236 Client->ROW->DataCache->Server 495 109 Client->Attr.Cache->ROW->DataCache->Server 409 76 3-6 .Inthesecondpart,Itestedthesetupwith 66

PAGE 67

RemoteXenboot/rebootexperimentwithROWproxyandROWproxy+cache NISTNetDelay ROWProxy ROWProxy+CacheProxy Boot(sec) 2ndBoot(sec) Boot(sec) 2ndBoot(sec) 1ms 121 38 147 36 5ms 179 63 188 36 10ms 248 88 279 37 20ms 346 156 331 37 50ms 748 266 604 41 aggressiveclientsidecaching(Figure2(b)).Table 3-6 alsopresentstheboot/secondbootlatenciesforthisscenario.Fordelayssmallerthan10ms,theROW+CPsetuphasadditionaloverheadforXenboot(incomparisonwithROWsetup);however,fordelaysgreaterthan10ms,thebootperformancewithROW+CPsetupisbetterthanROWsetup.RebootexecutiontimeisalmostconstantwithROW+CPproxysetup.Clearly,theresultsshowmuchbetterperformanceofXensecondbootfortheROW+CPexperimentalsetup. 65 ].ThekeyadvantagesofROW-FSoverUnionFSarethattheformerisuser-levelandintegrateswithunmodiedNFSclients/servers,whilethelatterisakernel-levelapproachthatrequiressupportfromthekernel,andthattheformeroperateswithindividualledatablockswhilethelatteroperatesonwholeles.Thisisimportantinapplicationswhereunmodiedclientsaredeployedandapplicationsthataccesssparsedata;forexample,theprovisioningofVMimages.IhaveattemptedtocomparetheperformanceofUnionFSandROW-FSforXenvirtualmachineinstantiationacrosswide-area,butinstantiatingaXen3.0domUwithanimagestackedusingthelatestversionofUnionFSavailableatthetimeofwriting(Unionfs1.4)fails.UnionFScopy-on-writemechanismisbasedoncopy-upcompletetonewbranchonwriteinvocationwhereasROW-FSjustreplicatestheneededblock;hence,ROW-FShasaddedadvantageoverUnionFSfordiskimagesinstantiation(largecopy-upisexpensive).AdvantagesofUnionFSoverROW-FSincludepotentially 67

PAGE 68

38 ].Inthepast,researchershaveusedNFSshadowingtechniquetologusers'behavioronoldlesinaversioninglesystem[ 66 ].EmulationofNFSmounteddirectoryhierarchyisoftenusedasameansofcachingandperformanceimprovement[ 67 ].Koshaprovidesapeertopeerenhancementofnetworklesystemtoutilizeredundantstoragespace[ 51 ].Inthepast,levirtualizationwasaddressedthroughNFSmountedlesystemwithintheprivatenamespacesforgroupofprocesseswithmotivationtomigratetheprocessdomain[ 68 ].Stripednetworklesystemisimplementedtoincreasetheserverthroughputbystripinglebetweenmultipleservers[ 69 ].Thisapproachisprimarilyusedtoparallelyaccessleblocksfrommultipleservers,thusimprovingitsperformanceoverNFS.Acopy-on-writeleserverisdeployedtoshareimmutabletemplateimagesforoperatingsystemskernelsandle-systemsin[ 21 ].Theproxy-basedapproachpresentedinthisdissertationisuniqueinhowitnotonlyprovidescopy-on-writefunctionality,butalsoprovidesprovisionforinter-proxycomposition.Checkpointmechanismsareintegratedintolanguagespecicbyte-codevirtualmachineasmeansofsavingapplication'sstate[ 70 ].VMwareandXen3.0virtualmachineshaveprovisionoftakingcheckpoints(snapshots)andrevertingbacktothem.Thesesnapshots,however,donotsupportcheckpointsofchangesinamounteddistributedlesystem. 68

PAGE 69

69

PAGE 70

71 ].Theapproachisathinclientsolutionfordesktopgridcomputingbasedonvirtualmachineapplianceswhoseimagesarefetchedon-demandandonaper-blockbasisoverwide-areanetworks.Specically,Iaimatreducingdownloadtimesassociatedwithapplianceimages,andprovidingadecentralized,scalablemechanismtopublishanddiscoverupgradestoapplianceimages.Theapproachembodiesdierentcomponentsandtechnologies|virtualmachines,anoverlaynetwork,pre-bootexecutionenvironmentservices,andaredirect-on-writevirtuallesystem.VirtualMachinesovervirtualnetworksaredeployedwithpre-bootexecutionservertofacilitateremotenetworkbooting.TheapproachusesROW-FSthatenablestheuseofunmodiedNFSclients/serversandlocalbueringoflesystemmodicationsduringtheappliance'slifetime.Similarlytorelatedeorts,ourapproachtargetsapplicationsdeployedonnon-persistentvirtualcontainers[ 4 ][ 28 ]throughprovisioningofvirtualenvironmentswithrole-specicdiskimages[ 60 ].Thincomputingparadigmsoeradvantagessuchasloweradministrationcostandfailuremanagement.Inearlycomputingsystems,thin-clientcomputingwassuccessfulbecauseoftwomainreasons:low-costcommodityhardwarewasnotavailabletoenduser,andacentralizedapproachofcomputingisoftenpreferredduetoeasiersystemadministration.Aslow-costPCsandhigh-bandwidthlocal-areanetworksbecamewidelyavailable,thin-clientcomputinglostground.Theadventofvirtualmachineshaveopenedupnewopportunities;virtualmachinescanbeeasilycreated,congured,managedanddeployed.Thevirtualizationapproachofmultiplexingphysicalresourcesnotonly 70

PAGE 71

72 ].AnillustrativeexampleofanvirtualapplianceisaFedora9applianceofsize800-MBwithpre-conguredgraphicaluserinterfacepackages[ 72 ].Optimizingthesizeofanapplianceistime-consuming,andinmanycasesnotpossiblewithoutlossoffunctionality(e.g.byavoidinginstallationofcertainpackages).Nonetheless,itisoftenthecasethatatrun-timeonlyasmallfractionofthevirtualdiskisactually\touched"byanapplication.Iexploitthisbehaviorbybuildingonon-demanddatatransfersthatsubstantiallyreducethedownloadtimeandbandwidthrequirementsfortheenduser.Thefollowingsectionswillexplaintheoverallarchitectureandapproach. 4-1 .Theapproachisbasedondisklessprovisioningofvirtualmachineenvironmentsthroughavirtualmachineproxy.Theutilityoftheenvisionedarchitecturecanbeobservedfromtheviewpointofbothusersandsystemadministrators.UsersnotonlyhavefastandtransparentaccesstodierentO/SimagesbutalsohaveautomaticsupporttoupgradetheO/Simages.Foradministrators,itprovidesaframeworkforsimpledeploymentandmaintenanceofnewimages.AsshowninFigure 4-1 ,anenduserX1downloadsasmallproxyappliance(VM2)fromDownloadServerDS.Theproxyapplianceisconguredtoconnecttoavirtual 71

PAGE 72

O/Simagemanagementoverwideareadesktops:UserX1downloadsasmallROW-FSproxyconguredappliance(VM2)fromdownloadserverDS.UserX2potentiallycansharetheapplianceimagewithUserX1.ImageServer(IS)exportsread-onlyimagestoclientsoverNFS.VM1isadisklessclient.TheappliancebootstrapprocedureisexplainedfurtherinFigure 4-2 networkoverlayconnectingittootherusers(e.g.usingIPOP[ 26 ][ 30 ]).AnexampleNFSproxyapplianceofsize350MBcanbedownloadedfromVmwarevirtualappliancemarketplace[ 72 ].TheproxyapplianceisalsoconguredtorunasmallftpserverandDHCPservertodownloadnetworkbootstrapprogramandallocateIPaddresstoclient'sworkingenvironment.Theactualapplianceswhichcarryoutcomputationcanbeconguredwithadesiredexecutionenvironmentandneednotbedownloadedintheirentiretybyendusers-theyarebroughtinon-demandthroughtheproxyappliance.EachnodeisanindependentcomputerwhichhasitsownIPaddressonaprivatenetwork.Keytothisarchitectureistheredirect-on-writelesystem(ROW-FS).Asexplainedinchapter3,ROW-FSconsistsofuser-levelDFSextensionsthatsupportselective 72

PAGE 73

ThedeploymentoftheROWproxytosupportPXE-basedbootofa(diskless)non-persistentVMoverawideareanetwork. Figure 4-2 expandsonFigure 4-1 toshowthedisklessprovisioningofvirtualmachines.InFigure 4-2 ,VM1isadisklessvirtualmachine,andVM2isabootproxyapplianceconguredwithtwoNICcardsforcommunicationwithhost-onlyandpublicnetwork.VM2isconguredtoexecuteROWlesystem(ROW-FS)andNFScacheproxies.Inaddition,VM2isconguredtorunDHCPandTFTPserverstoprovidedisklessclientanIPaddressandinitialkernelimagetoVM1.ClassicvirtualmachinessuchasVMwareprovidessupportforPXE-enabledBIOSandNICs;PXEisatechnologytobootdisklesscomputersusingnetworkinterfacecards.TheserverVMisconguredtoshareacommondirectorythroughROW-FStoclients.Toillustratetheworkingsofdisklesssetup,considerthefollowingstepstobootadisklessVMwithanapplianceimageservedoverawide-areanetwork: 1. DisklessVM(VM1)invokesDHCPrequestforanIPaddress 2. DHCPrequestisroutedthroughahost-onlyswitchtogatewayVM(VM2) 3. VM2isconguredtohavetwoNICs:host-only(privateIPaddress)andpublic.VM2receivesrequestathost-onlyNIC(eth0). 4. DHCPServerallocatesanIPaddressandsendsareplybacktodisklessVM(VM1) 73

PAGE 74

DisklessVMinvokesTFTPrequesttoobtainnetworkbootstrapprogramandinitialkernelimage. 6. VM2receivesTFTPrequestathost-onlyeth0 7. KernelimageistransferredtoVM1andloadedinRAMtokickstartbootprocess 8. DisklessVMinvokesmountrequesttomountread-onlydirectoryfromServer(VM3)throughtheproxyVM(VM2) 9. VM2isconguredtoredirectwritecallstoalocalserver.Read-onlyNFScallsareroutedthroughtheproxyVM2toVM3;theconnectionbetweenVM2andVM3isthroughthevirtualoverlaynetwork.TheP2Pnetworksareconsideredtobeinherentlyself-conguring,scalableandveryrobusttonodeorsystemfailures.EachP2Pnodemaintainsaviewofnetworkatregularintervalswhichfacilitatesseamlessadditionorremovalofanodefromthesystem.Asthenumberofnodesareaddedintothenetworkpool,bandwidthandCPUprocessingisdistributedandsharedamongusers,thusP2Psystemsareveryscalable.Furthermore,P2Psystemsareconguredtobetoleranttonodefailures.P2PoverlaynetworkssuchasIPOPalsofacilitaterewalltraversalwithoutadministratorinterventionwhichallowsP2Pnodesbehindtherewallstojointhenetwork[ 30 ].TheprocessofpublishingandsharingO/SimagesiswellsupportedbytheseP2Pproperties.Theprimarygoalofthearchitectureistoautomatetheprocessofpublishing,discoveringandmountingapplianceimages.Furthermore,itshouldbepossibleforimagestobereplicated(fullyorpartially)acrossmultiplevirtualserversthroughoutavirtualnetworkforload-balancingandfault-tolerance.ItisfeasibletoprovideimageversioningcapabilitythroughmaintainingthelatestimagestateinadecentralizedwayusingaDistributedHashTable(DHT){which,inthecaseoftheIPOPvirtualnetwork[ 30 ],isalreadyresponsibleforprovidingDHCPaddresses.DHTsprovidetwosimpleprimitives:put(key,value)andget(key).InordertousetheDHTtotrackapplianceimageversions,thekeyfunctionalityneededcanbebrokendownintodisklessclientandpublisherclient. 74

PAGE 75

75

PAGE 76

1. Disklessclientdownloadsthebootproxyappliancemachinefromthedownloadserver(VM2inFigure 4-1 ).Clientbootstrapsthisgenericapplianceconguredtoforwardclient'srequesttoImageServer(IS).ThedownloadedproxymachineisconguredsoastoconnectwithnetworkofappliancesthroughIPOPp2pnetwork. 2. DisklessclientqueriestheDHTforapplianceAversion.Anillustrativeexampleofanappliancenamecouldbea\Redhat"appliance. 3. DisklessclientqueriestheDHTtoobtaintheimageserverIPaddressandmountpathfortheapplianceA. 4. StarttheROW-FSproxiesusingtheimageserverIPaddress.ThestartupofROW-FSproxiessetsupanaccesscontrollistandasessiondirectorytoallowcallforwardingtotheimageserverandthelocalNFSserver. 5. Bootstrapadisklessclientvirtualmachineandestablishamountsessionwiththeimageserver.VirtualmachineAPIsareleveragedtobootstrapthedisklessclient. 6. DisklessclientperformstheexperimentsduringtheestablishedROW-FSsessionbetweendisklessclientandtheimageserver. 7. Haltthebooteddisklessclient 8. KilltheROW-FSproxies.Thebootproxyappliancemachinecontainstheclient'ssessiondataandexperimentalrunresults.Figure 4-3 andFigure 4-4 illustratesthealgorithmfordisklessclienttobootstrapanapplianceandapublisherclienttopublishtheO/Simage.UnusedVMscanberemovedfromthesystematregularintervals.WhennumberofclientsaccessingtheVMimageiszeroandtheimageisnotthelatestversion,DHTexpireswithtimeout. 76

PAGE 77

AlgorithmtobootstrapaVMsession AlgorithmtopublishavirtualmachineImage 77

PAGE 78

58 ];itisalsoconceivabletointegratethelogictocongure,createandtear-downROW-FSsessionswithapplicationworkowschedulerssuchas[ 73 ].Further,failuretransparencyisimportantpropertyofdistributedsystems.DuringthetimeaROW-FSlesystemsessionismounted,allmodicationsareredirectedtotheshadowserver.Aquestionthatarisesishowlesystemmodicationsintheshadowservershouldbereconciledwithdatainthemainserverattheendofasession.Threescenarioscanbeconsideredforconsistencysupportinthecontextofredirect-on-writelesystem: 1. Thereareapplicationsinwhichitisneitherneedednordesirablefordataintheshadowservertobereconciledwiththemainserver;anexampleistheprovisioningofsystemimagesfordisklessclientsorvirtualmachines,wherelocalmodicationsmadebyindividualVMinstancesordisklessmachinesarenotpersistent. 2. Forapplicationsinwhichitisdesirabletoreconciledatawiththeserver,theROW-FSproxyholdsstateinitsprimarydatastructures(thelehandlehashtableandtheblockbitmaps)thatcanbeusedtocommitmodicationsbacktotheserver.TheapproachistoremountthelesystemattheendofaROW-FSsessioninread/writemode,andsignaltheROW-FSproxytotraverseitslehandlehashtablesandbitmapstocommitchanges(moves,removes,renames,etc)todirectories 78

PAGE 79

3. OneparticularusecaseofROW-FSisautonomicprovisioningofO/Sdiskimagessharedbetweenmultipleclients.Inthiscontext,IleverageonAPIsexportedbylookupservices(suchasDistributedHashTable)indistributedframeworks(suchasIPOP[ 26 ])tostoreclient'susageandsharinginformation.Thisapproachisbasedonmultipleclientsconvergingtousethelatestapplianceimageduringthecourseoftime. 79

PAGE 80

74 ]. 16 ]).EachdirectoryorleinROW-FSisuniquelyaddressedbyapersistentlehandle.ItisfeasibletoprovidereplicasupportforwideareadesktopsinROW-FSProxyaslehandlesinreplicatedVMisconsistentwithprimaryVM. 80

PAGE 81

ReplicationApproachforROW-FS.Filehandlesofserverreplicaandread-onlyserverareequivalent.Atimeoutoftheconnectiontoread-onlyservercanresultinswitchovertoreplicaserver. ROW-FSproxycanbeconguredtoprovidesupportforread-onlyserverreplica.Figure 4-5 showsthefeasibilityofvirtualmachine-basedreplicationmechanism.TheROW-FSproxyisconguredtoforwardcallstobothread-onlyandreplicaserver.Eachreplicaisaclonedvirtualmachinewhichisexportingrootlesystemofthegridappliancetotheclient.Anexamplescenarioisshowninthegure.Ifread-onlyserverr0goesdownforsomereason,ROW-FSproxycanbeconguredtoswitchovertoareplicaserver(r1)afternoresponsetime,Tout.TheshadowstateofROW-FSproxywhichincludeslehandlemappingbetweenshadowserver(s0)andread-onlyserver(r0)andbitmapdatastructureisstillvalidwithnewread-onlyserverreplica(r1).Thisisfeasiblebecauseshadowserverstateisdependentonlehandlesofs0andr0servers.Serversr0andr1haveidenticallehandleswhichfacilitatesaseamlesstransitionofNFScallstor1. 81

PAGE 82

74 ]: 1. Condentiality:PublishersshouldbeabletopublishO/Simagesforintendedusers. 2. Integrity:Integrityofpublisher'sclaimshouldbemaintained.Nootherusershouldbeabletomodifythepublisher'sclaim.Toaddressthesesecurityproperties,Iconsiderfollowingsecuritymechanisms: 1. Encryption:Publisher'sclaimforO/Simageneedstobeencryptedtoavoidanyinterceptionorfabricationfromarogueuser. 2. Authentication:Publisherofanimagemustbeabletoproveitsidentity.Toestablishpublisher'sidentity,publickeycryptographyschemecanbeapplied.PublickeyofeachuserinP2Pnetworkcanbeadvertised.AnyclaimbyUserisencryptedbyitsownprivatekey.Whileauthenticationandencryptionhelpsinsendingdatasecurely,itisimportanttomodeltrustbetweentheP2Pusers.TrustmodelsareawaytovalidateclientX'sclaimtobe\UserX".Varioustrustmodelshavebeenusedtoestablishtrustindistributedsystems.Forexample,a\weboftrust"approachisacommonlyusedemailschemetosendprivateemailstotheendusers[ 74 ].Theapproachreliesonmaintainingalistoftrustedpublickeysbyuser.IsuggestapublickeyinfrastructureastrustedmodelforO/Smanagementframework.Publickeyinfrastructureisacollectionofcerticateauthoritiesandcerticatesassignedtousers.Certicatesarecommoncryptographictechniqueusedine-commerceapplications.Acerticateisadigitalsignaturewhichhelpsinmaintainingidentication,authorizationanddatacondentialityoftheuser.Adigitalsignatureisakindofasymmetriccryptographytosecurelysendmessagebetweenusers.Whileasymmetrickeycryptographysecurelytransferthedata,thequestionpersistinformationoftrustbetweenendusers.Forthepurposeofthisdissertation,Iassumethatthereisatrustedcerticate 82

PAGE 83

4-6 illustratessecuritymechanismtoauthenticateandencryptclient'sandpublisher'sdata.Here,assumptionisthatpublickeyofcerticateauthorityisbuiltintoP2Poverlaynetwork.Tovalidateeachclient'sidentity,certicateauthorityencryptsclient'sidentication(IDC)andpublickey(K+C)(i.e.certicate)throughitsprivatekey(KCA)anddistributeitoverP2Pnetwork. Disklessclientandpublisherclientsecurity.C:Client,P:PublisherandCA:CerticateAuthority. Followingequationsillustratetheencryptionofappliance(A)informationbyapublisherandfurtherdecryptionbyaclient.PublisherEncryption:(A;Vi)=>KP(A;Vi)(A;Vi;IP;MountPath)=>KP(A;Vi;IP;MountPath)ClientDecryption:K+P(KP(A;Vi))=>(A;Vi)K+P(KP(A;Vi;IP;MountPath))=>(A;Vi;IP;MountPath) 83

PAGE 84

4-2 ).Inthisexperiment,virtualmachinesVM1,VM2andVM3aredeployedinahost-onlynetworkusingVMware'sESXServerVMmonitor.NotethatthisexperimentonlyreectsVMresourcestatistics(notapplicationexecutiontime).Theexperimentalsetupisasfollows:ServerVM3isconguredtohave2GBRAM,singlevirtualCPU.VM2andVM1areconguredtohave1GBRAM,andalsoasingleVCPU.TheVMsarehostedbyadualXeon3.2GHzprocessor,4GBmemoryserver.Thesizeoftheimageoftheapplianceis934MB.Figure 4-7 showstime-seriesplotswithCPU,diskandnetworkratesforthreedierentintervals.Thesevaluesareobtainedin20-secondintervalsleveragingVMwareESX'sinternalmonitoringcapabilities.Intherstinterval,theVMisbooted.Inthesecondinterval,theVMrunsaCPU-intensiveapplication(thecomputerarchitecturesimulatorSimplescalar)whichmodelsthetargetworkloadofatypicalvoluntarycomputingexecution.Inthethirdphase,theapplianceisrebooted. 84

PAGE 85

ProxyVMusagetimeseriesforCPU,diskandnetwork.Resultsaresampledevery20secondsandreectdatameasuredattheVMmonitor.Threephasesareshown(markedbyverticallines):applianceboot,executionofCPU-intensiveapplication(simplescalar),andappliancereboot. 85

PAGE 86

75 ].TheFigureshowshighdatawriteratesduringVMbootexecution.Clearly,Icanobserveamaximumof12%CPUconsumptionandalsoahighdatarateacrossthenetworktoloadtheinitialkernelimageintodisklessclient(VM1)memory.ApplicationExecution:SincetheapplicationisCPUintensive,theproxyVMexhibitslittlerun-timeoverheadinthisphase.ThisisbecauseoncethedisklessVM(VM1)isbooted,itloadsnecessarylesfortheapplicationexecutionintoRAM(asshownbyinitialnetworkactivity).IcanfurtherobservethatdiskandnetworkusageisnegligibleinVM2duringtheexecutionofthesimplescalarapplication,thussupportingmyassumptionofminimaloverheadofproxyconguredVM.VMReboot:Duringreboot,theclienthasreplicatedsessionstateattheshadowserver.Iseeanaverageboottimereduction(describedbelow).IobservefurtherspikesinnetworkandCPUusageassomeoflesarefetchedandreadfromtheServerVM.Inpastresults,Ihaveshownthataggressivecachingcanfurtherimprovebootperformance[ 75 ]. 4-8 providesstatisticsfornumberofRPCcallsduringthebootupofthedisklessapplianceVM1.ThehistogramisbrokendownbydierenttypesofRPCcallscorrespondingtoNFSprotocolcalls,fromlefttoright:getandsetleattributes,lehandlelookup,readlinks,readblock,writeblock,createle,renamele,makedirectory,makesymboliclink,andreaddirectory.TheimportantconclusiontakenfromthedatainFigure 4-8 isthatROW-FS,whileincreasingthetotalnumberofcallsroutedtothelocalshadowserver,reducesthenumberofRPCcallsthatcrossdomains.Notethattheincreaseinnumberofgetattribute(getattr)callsisduetoinvocationofgetattrproceduretovirtualizereadcallstomain 86

PAGE 87

RPCstatisticsfordisklessboot.ShadowserverreceivesmajorityofRPCcalls.ThebarsrepresentthenumberofRPCcallsreceivedbyshadowandmainservers. 30 ]virtualnetwork,andtheserverandclientVMsarebehindNATs.TheproxyVM2is 87

PAGE 88

4-1 summarizestheresultsfromthisexperiment.Noticethattheboottimesreducetolessthanhalf,becomingcomparabletotheLANPXE/NFSboottimeofapproximately2minutes. Table4-1. ApplianceBoot/reboottimesoverWAN.ISPisaVMbehindaresidentialISPprovider;UFLisadesktopmachineattheUniversityofFlorida;VIMSisaservermachineattheVirginiaInstituteofMarineSciences. VM1/VM2, Boot 2ndBoot Pinglatency VM3 seconds seconds seconds ISP,UFL 291 116 23ms UFL,VIMS 351 162 68ms 4-9 showsthecumulativedistributiongraphof100accessesofDHTthrough10IPOPclients.TheclientsarerandomlychosentoquerytheDHT.Thedistributiongraphshowsthatinmostcasesittakeslessthan2secondstoquerytheDHTandobtaintheappliancestatusinformation.Theaveragetimetoinsertapplianceversionwithappliancenameaskeyover10iterationsis1.4sec.Table 4-2 providethemeanandvariancestatisticsforveclients.ThemeanandvariancestatisticsshowsthatclientaccesstimesvaryoverDHToverP2PnodesdeployedonPlanetlab.TheaccesstimeoftendependsontheroutepathtakentoaccessDHTinformation. Table4-2. MeanandvarianceofDHTaccesstimeforveclients Clients/ Client1 Client2 Client3 Client4 Client5 Statistics Mean 0.567 0.648 2.699 0.6875 2.224 Variance 0.00254 0.0389 4.731 0.08512 1.337 76 ]providesmechanismstomakeread-onlydatascalableoverthewideareanetworkthroughcooperativecachingandNFSproxies;myapproach 88

PAGE 89

CumulativedistributionofDHTquerythrough10IPOPclients(inseconds) complementsitbyenablingredirect-on-writecapabilities,whichisarequirementtosupportthetargetapplicationenvironmentofNFS-mounteddisklessVMclients.SFSadvocatedtheapproachofread-onlylesystemforuntrustedclients.Therehavebeenupcomingcommercialproductswhicheitherprovidethin-clientsolutionsbasedonpre-bootexecutionenvironment[ 77 ][ 78 ]orprovidecachebasedsolutionasaviablethin-clientapproachforscalablecomputing[ 79 ].Distributedcomputingapproachbasedonstackablevirtualmachinesandboxesisadvocatedin[ 59 ].Stackablestoragebasedframeworkisalsousedtoautomateclustermanagementasmeanstoreduceadministrativecomplexityandcost[ 60 ].Theapproachadvocatedin[ 60 ]isre-provisioningofapplicationenvironment(baseOS,servers,librariesandapplication)throughrole-specic(readorwrite)diskimages.Aframeworktomanageclusterofvirtualmachinesisproposedin[ 80 ].Storkpackagemanagementtoolprovidesmechanismtosharelessuchaslibrariesandbinariesbetweenvirtualmachines[ 81 ].Acopy-on-writeleserverisdeployedtoshareimmutabletemplateimagesforoperatingsystemskernelsandle-systemsin[ 21 ].ThisapproachusesacombinationoftraditionalNFSforread-onlymountandAFSforaggressivecachingofsharedimages 89

PAGE 90

21 ].Write-oncesemanticsbasedlesystemsarecommontoleverageoncommoditydiskimagesforapplicationssuchasmap-reduce[ 82 ]. 90

PAGE 91

1 ][ 71 ],acommondeploymentscenarioofROW-FSiswhenthevirtualmachinehostingshadowserverandtheclientvirtualmachineareconsolidatedintoasinglephysicalmachine.SuchascenarioiscommonwhenclientVMisdisklessordiskspaceonclientVMisaconstraint.WhiledeployingROWproxiesinsuchcasesprovidesamuchneededfunctionality,theoverheadassociatedwithvirtualizednetworkI/Oandisoftenconsideredabottleneck[ 9 ][ 10 ].Whilethevirtualizationcostdependsheavilyonworkloads,ithasbeendemonstratedthattheoverheadismuchhigherwithI/Ointensiveworkloadscomparedtothosewhicharecompute-intensive[ 10 ].Unfortunately,thearchitecturalreasonsbehindtheI/Operformanceoverheadsarenotwellunderstood.EarlyresearchincharacterizingthesepenaltieshasshownthatcachemissesandTLBrelatedoverheadscontributetomostofI/Ovirtualizationcost[ 10 ][ 83 ][ 84 ].Whilemostoftheseevaluationsaredoneusingmeasurements,inthischapterIdiscussanexecution-drivensimulationbasedanalysismethodologywithsymbolannotationasameansofevaluatingtheperformanceofvirtualizedworkloads.Thismethodologyprovidesdetailedinformationatthearchitecturallevel(withafocusoncacheandTLB)andallowsdesignerstoevaluatepotentialhardwareenhancementstoreducevirtualizationoverhead.ThismethodologyisappliedtostudythenetworkI/OperformanceofXen(asacasestudy)inafullsystemsimulationenvironment,usingdetailedcacheandTLBmodelstoproleandcharacterizesoftwareandhardwarehotspots.ByapplyingsymbolannotationtotheinstructionowreportedbytheexecutiondrivensimulatorIderivefunctionlevelcallowinformation.Ifollowtheanatomyof 91

PAGE 92

20 ]),usingtheSoftSDV[ 85 ]execution-drivensimulatorextendedwithsymbolannotationsupportandanetworkI/Oworkload(iperf). 92

PAGE 93

10 ][ 83 ].Therestofthischapterisorganizedasfollows.ThemotivationbehindthecurrentworkisdescribedinSection 5.2 .Section 5.3 describesthesimulationmethodology,toolsandsymbolannotations.Section 5.4 detailsthesoftwareandarchitecturalanatomyofI/Oprocessingbyfollowingtheexecutionpaththroughguestdomain,hypervisorandtheI/OVMdomain.Also,IprovideinitialresultsofresourcescalinginSection 5.5 .Section 5.6 describesrelatedwork. 13 ][ 14 ][ 15 ].Asimulation-basedmethodologyforvirtualenvironmentsisalsoimportanttoguidethedesignandtuningofarchitecturesforvirtualizedworkloads,andtohelpsoftwaresystemsdeveloperstoidentifyandmitigatesourcesofoverheadsintheircode.Adrivingapplicationforsimulation-drivenanalysisisI/Oworkloads.ItisimportanttominimizeperformanceoverheadsofI/Ovirtualizationinordertoenableecientworkloadconsolidation.Forexample,inatypicalthreetierdatacenterenvironment,Web 93

PAGE 94

17 ].Enablingalowlatency,highbandwidthinter-domaincommunicationmechanismbetweenVMdomainsisoneofthekeyarchitectureelementswhichcouldpushthisdistributedservicesarchitectureevolutionforward. 86 ][ 13 ];IusetheSoftSDVsimulator[ 85 ]asabasisfortheexperiments.SoftSDVnotonlysupportsfastemulationwithdynamicbinarytranslation,butalsoallowsproxyI/Odevicestoconnectasimulationrunwithphysicalhardwaredevices.ItalsosupportsmultiplesessionstobeconnectedandsynchronizedthroughavirtualSoftSDVnetwork.ForcacheandTLBmodelingIintegratedCASPER[ 87 ]-afunctionalsimulatorwhichoersrichsetofperformancemetricsandprotocolstodeterminecachehierarchicalstatistics. 94

PAGE 95

5-1 ,(A)).Theguestdomain'sfrontenddrivercommunicateswithbackenddriversthroughIPCcalls.ThevirtualandbackenddriverinterfacesareconnectedbyanI/Ochannel.ThisI/Ochannelimplementsazero-copypageremappingmechanismfortransferringpacketsbetweenmultipledomains.IdescribetheI/OVMarchitecturealongwiththelife-of-packetanalysisinSection 5.4 Figure5-1. FullsystemsimulationenvironmentwithXenexecutionincludes(A)XenVirtualEnvironment(B)SoftSDVSimulator(C)PhysicalMachine 95

PAGE 96

5-2 summarizestheprolingmethodologyandthetoolsused.Thefollowingsectionsdescribetheindividualstepsindetail;theseinclude(1)virtualizationworkload,(2)fullsystemsimulation,(3)instructiontrace,(4)performancesimulationwithdetailedcacheandTLBsimulation,and(5)symbolannotation. 5-1 ).Inordertoanalyzeanetwork-intensiveI/Oworkload,theiperfbenchmarkapplicationisexecutedinDomU.Thisenvironmentallowsustotapintotheinstructionowtostudytheexecutionowandtoplugindetailedperformancemodelstocharacterizearchitecturaloverheads.TheDomUguestusesafrontenddrivertocommunicatewithabackenddriverinsideDom0,whichcontrolstheI/Odevices.IsynchronizedtwoseparatesimulationsessionstocreateavirtualnetworkedenvironmentforI/Oevaluation.Theexecution-drivensimulationenvironmentcombinesfunctionalandperformancemodelsoftheplatform.Forthisstudy,IchosetoabstracttheprocessorperformancemodelandfocusoncacheandTLBmodelstoenablethecoverageofalongperiodintheworkload(approximately1.2billioninstructions). 96

PAGE 97

Executiondrivensimulationandsymbolannotatedprolingmethodology.Fullsystemsimulatoroperateseitherinfunctionalmodeorperformancemode.Instructiontraceandhardwareeventsareparsedandcorrelatedwithsymbolstoobtainannotatedinstructiontrace. 97

PAGE 98

5.4 .AnexampleexecutionowaftersymbolannotationisgiveninFigure 5-3 .Thesedecodedinstructionsfromthefunctionalmodelarethenprovidedtotheperformancemodelwhichsimulatesthearchitecturalresourcesandtimingfortheinstructionsexecuted. 5-4 98

PAGE 99

Symbolannotation.Compile-timeXensymbolsarecollectedfromhypervisor,driverandapplicationandannotated.Figureshowsanexamplewheresymbolsareannotatedwith\kernel"and\hypervisor". Figure5-4. Function-levelperformancestatistics.Figureillustratesthathowperformancestatisticsarecoupledwithinstructioncallgraphforeachfunction.SamplestatisticsforL1/L2cachesand,instructionanddataTLBareshown. 99

PAGE 100

5-5 ,aninstructionparserisusedtoparsedierentinstructioneventssuchasINT(interrupts,systemcalls),MOVCR3(addressspaceswitch),andCALL(functioncall).Thesetracesweredumpedintoalewithrun-timevirtualaddressinformation,aswellascacheandTLBstatistics.InstructiontracesareparsedandmappedwithsymboldumpstocreateI/Ocallgraph.SoftSDVsystemcall(SSC)utilitiesfacilitatetransferofdatabetweenhostandsimulatedguest.Aperformancesimulationmodelisusedtocollectinstructiontracesalongwithhardwareeventsofvirtualizedworkload.TheseutilitiesareimportantasIgatheredruntimesymbolsofkernelsandapplicationfromtheprockerneldatastructuretotransfertothehostsystem(forexample,/proc/kallsymsforkernelsymbols).Foriperfruntimesymbols,wemappedprocessIDwithcorrespondingprocessIDinprocdirectory.Theserun-timesymbols, 100

PAGE 101

Figure5-5. SoftSDVCPUcontrollerexecutionmode:performanceorfunctional.Infunctionalmode,SoftSDVsimulatorprovidesinstructiontrace.Inperformancemode,instructiontraceisparsedtoobtainhardwareeventssuchascacheandTLBmisses.Compiletimesymbolsfromkernel,driversandapplicationalongwithruntimesymbolsfromproclesystemarecollectedtoobtainper-functioneventstatistics. Symbolsareannotatedtokeeptrackofthesourceofafunctioncallinvocation.Notethattherecanbeduplicatesymbolswhencollectedsymbolsaresummedupintoale.Theseduplicatesareremovedandfurtherthedatacollectedisformatedinausefulway.Insomecases,itisnecessarytomanuallyresolveambiguitiesinvirtualaddressspacesthroughcheckpointatvirtualaddressduringre-runofasimulatedSoftSDVsession.Linuxutilitiessuchasnmandobjdumpareoftenusedtocollectsymbolsfromcompiletimesymboltables.Ingeneral,anyapplicationcanbecompiledtoprovidesymboltableinformation.InC++applications(suchasiperf),functionnamemanglinginobjectcodeisusedtoprovidedistinctnameforfunctionsthatsharethesamename.Essentially,itaddssomerandomnessatprexandsuxofthefunctionname.Iusedthedemangleoptionofthenmutilitytoidentifythecorrectfunctionforiperfapplication. 101

PAGE 102

5-5 showsthesimulationframeworkimplementationtoobtaincallgraphinformationandperformcachescalingstudies.AsillustratedinFigure 5-5 ,theCPUcontrollerlayerinSoftSDVintegrateswithaperformanceorfunctionalmodel.Theplatformcongurationforthisstudyissettoasingleprocessorwith2levelsofcache(32KBrstleveldataandinstructioncache,2MBL2cache)andwith64-entryinstructionanddataTLBs.TheexperimentalsetupinvolvedmultipleSoftSDVsessionsconnectedovervirtualnetwork.IchoosetoruniperfapplicationtostudylifeofI/Opacketasitistherepresentativebenchmarktomeasureandstudynetworkcharacteristics.TheiperfclientisexecutedtoinitiatepackettransmissionsfromaXenenvironment. 5-6 showsanoverviewofdierentstageswhichcharacterizethelifeofapacketbetweenVMdomains.Typically,anetworkpacketintheXenenvironmentgoesthroughthefollowingfourstagesinitslifecycleaftertheapplicationexecution: 1. Unprivilegeddomain-Packetbuildandmemoryallocation 2. PagetransferMechanism-Azero-copymechanismtomappagesinvirtualaddressspaceofDom0/DomUdomains 3. Timerinterrupts-Contextswitchbetweenhypervisoranddomains 102

PAGE 103

Privilegeddomain-ForwardingI/Opacketdownthewireandsendingacknowledgmentbacktotheguestdomain. Figure5-6. LifeofanI/Opacket(a)Applicationexecution(b)Unprivilegeddomain,(c)Granttablemechanism-switchtohypervisor,(d)Timerinterrupt,(e)Privilegeddomain 27 ].AninterfaceinXentoallocateasocketbuerinthenetworkinglayer(alloc skb from cache)isidentied.ThefrontenddriverusesthegranttablemechanismprovidedbythehypervisortotransferthebuertoDom-0.ThefunctionsandtheassociatedinstructioncountforoveralllifeofthepacketinDomUincludesocketlock,copydatafromuserspacetokernelspace,allocatepagefromfreelist,andreleasesocketlock(Figure 5-7 ).Notethattheinstructioncountstatisticsareshowninchronologicalorderwithfunctionentrypointsasmarkers.Iremovedsomerepeating 103

PAGE 104

Figure5-7. Dom-Ucallgraph:Socketallocation(alloc skb from cache),user-kerneldatacopy(copy from user)andnallyTCPtransmitwrite(tcp write xmit). 104

PAGE 105

5-8 demonstratestheexecutionowfromdomUtohypervisorthroughgranttablemechanism. Figure5-8. TCPtransmit(tcp transmit skb)andGranttableinvocation(gnttab claim grant reference) do upcalltostartprocessingtheevent.Thefunctionsinvokedduringthetimerinterruptisshowningure 5-9 105

PAGE 106

AnnotatedcallgraphtoshowcontextswitchbetweenhypervisorandDom-0VM-Timerinterrupts(write ptbase) 5-10 (sincethecompleteexecutionatthisstageislong,snippetsofexecutioncoveringthebasicowandhighlightingtheimportantfunctionsisshown).NotethatthegranttablemechanismisusedtomapguestpagesintoDom0addressdomainonthebackendreceivingside.Thenthepacketissenttothebridgecode,afterwhichitissentoutonthewire.Oncecomplete,thehostmapisdestroyedandaneventissentontheeventchanneltotheguestdomain.ItisinterestingtonotethattheprocessorTLBisushedwhiledestroyingthegrant.ItisdonebywritingtheCR3register(thex86pagetablepointer)throughthewrite cr3function.IdescribetheimpactofthisTLBushinSection 5.4.2 106

PAGE 107

LifeofapacketinDom-0:Accessinggrantedpage(create grant host mapping),ethernettransmission(e100 tx clean),destroygrantmapping(destroy grant host mapping)andeventnoticationbacktohypervisor(evtchn send) 107

PAGE 108

5-11 showsanexecutionsnippetwhereTLBushesandmissesareplottedasafunctionofsimulatedinstructionsretired.ThegureshowsthatthereisahighcorrelationbetweentheTLBmisses,contextswitchesandTLBushevents.AnexecutionrunofVMduringaperiodwithnocontextswitchesorTLBushesresultsinnegligibleTLBmisses.WheneverTLBushingeventshappen,thereisasurgeofTLBmisses.ThiscorrelateswellwiththeobservationsofTLBmissoverheadinearlierstudies.Figure 5-12 showstheincreasednumberofTLBmissesassociatedwiththeVMswitchesinacumulativegraph.IobservethatthereisasurgeofTLBmissesassociatedwitheachVMswitch.ExecutionsegmentswithoutVMswitchesshowatareaswithfewTLBushes.Figure 5-13 depictsatypicalVMswitchscenario.TheexecutionmovesfromoneVMtoanotherthroughacontext.TheCR3valueischangedtopointtothenewVMcontext.ThistriggersthehardwaretoushalltheTLBstoavoidinvalidtranslations.ButthiscomeswithacostofTLBusheseverytimeanewpageistouched,bothforcodeanddatapages.AnotherscenarioistheexplicitTLBushesdonebytheXenhypervisoraspartofthedatatransferbetweenVMs.ThisisanartifactofthecurrentI/OVMimplementationasexplainedintheprevioussection.Inordertorevokeagrant,acompleteTLBush 108

PAGE 109

ImpactofTLBushandcontextswitch.Thex-axisshowsasliceoftotalnumberofinstructionsretiredduringanexecutionrunofiperfapplication.They-axisshowsinstructionanddataTLBmissevents,normalizedtoTLBushesandcontextswitch. Figure5-12. CorrelationbetweenVMswitchingandTLBmisses.Thex-axisshowsasegmentofthetotalnumberofinstructionsretired.They-axis(left)representsVMswitchingwhere\1"indicatesVMswitch.They-axis(right)showscumulativeTLBmisses. 109

PAGE 110

TLBMissesafteraVMcontextswitch.InstructionanddataTLBmissesareplotedonthey-axisagainstasegmentofinstructionsretiredonx-axis.ThecontextswitchbetweenvirtualmachinecausesaTLBushwhichincreasesthenumberofTLBmisses. isexecutedexplicitly,whichalsocreatesTLBperformanceissuessimilartoVMswitch.Figure 5-14 demonstratesthecodeowandtheTLBimpact.Figure 5-15 showstheimpactofcontextswitchesoncacheperformance.TheverticallinesmarkVMswitcheventsobtainedthroughsymbolannotation,andtheplottedlineshowsthecumulativecachemissevents.NotethatthecachemissrateincreasesarealsocorrelatedwithVMswitchevents. 88 ].TLBandcachestatisticsaremeasuredfortransferofapproximately25millionTCP/IPpackets. 110

PAGE 111

TLBmissesafteragrantdestroy.Thex-axisshowasegmentofinstructionsretiredandthey-axisrepresentsdataandinstructionTLBmisses. Figure5-15. ImpactofVMswitchoncachemisses.Thex-axisshowasegmentofinstructionsretired.They-axis(left)representsVMcontextswitchthroughtheverticallinesbetween0and1.ThecontextswitchbetweenvirtualmachinecausesaTLBushwhichincreasesthenumberofL2cachemisses(y-axis(right)). 111

PAGE 112

L2cacheperformancewhenL2cachesizeisscaledfrom2MBto32MB.Intheplot,L2cachemissratioisnormalizedtoL2cachesize2MB.ThedatapointsarecollectedwheniperfclientisexecutedinaguestVM(transmitofI/Opackets). Figure5-17. DataandinstructionTLBperformancewhenTLBsizeisscaledbetween64and1024entries.TheplotindicatesthatTLBmissratioisnormalizedtoTLBsizeof64entries.ThedatapointsarecollectedwheniperfclientisexecutedinaguestVM(transmitofI/Opackets). 112

PAGE 113

5-16 showstheeectofscalingL2cache.Theperformancemodelisconguredtosimulateatwolevelcache:32KBL1(splitdataandinstruction)anda2MBuniedL2cache.TheprimarygoalistounderstandthecachesensitivityoftheI/OvirtualizationarchitectureinthecontextofnetworkI/O.NotethatincreasingtheL2cachesizeupto4MBprovidedgoodperformancescaling,afterwhichtheincreaseinperformancewasminimal.Increasingthecachesizebeyond8MB,therateofreductioninmissratesissmall.Icanattributereducedmissratesfromthe8MBcachetotheinclusionofneededpagesfromhypervisor,Dom0andDomU. Figure5-18. L2cacheperformancewhenL2cachesizeisscaledfrom2MBto32MB.L2cachemissratioisnormalizedtoL2cachesize2MB.ThedatapointsarecollectedwheniperfserverisrunninginaguestVM(ReceiveofI/Opackets). Figure 5-17 showstheTLBperformancescalingimpactfordataandinstructionTLBs.Asshowninthegure,withincreaseinsizeofthedataTLB,themissratiodecreasesforsizesupto128entries.Forlargersizes,themissratioisnearlyconstant.TheITLBmissratedecreasesslightly,whiletheDTLBrateshowsasharperdecreasefrom64to128entries.ItcanbeinferredthatTLBsizeof128entriesissucienttoincorporatealladdresstranslationsduringtheTLBstage.IncreasingtheTLBsizeisnotaveryeectiveenhancementinthisscenario.Thisisbecause,asobservedinFigure 113

PAGE 114

DataandinstructionTLBperformancewhenTLBsizeisscaledbetween64and1024entries.TheplotindicatesthatTLBmissratioisnormalizedtoTLBsizeof64entries.ThedatapointsarecollectedwheniperfserverisrunninginaguestVM(ReceiveofI/Opackets). 5-12 and 5-13 ,therearesubstantialnumbersofTLBushesduringgrantrevocationandVMswitches,whichinvalidateallTLBentries.AlargeTLBsizedoesnothelpmitigatetheeectofcompulsoryTLBmissesthatfollowaush.Similarly,thecacheandTLBscalingstudiesonthereceivesideisperformed.ResultsaregiveninFigures 5-18 and 5-19 respectively. 89 ][ 20 ].Performancemonitoringtoolshavebeendeployedtogaugeapplicationperformanceinvirtualizedenvironments[ 9 ][ 83 ][ 10 ].TraditionalnetworkoptimizationssuchasTCP/IPchecksumooad,TCPsegmentationooadarebeingusedtoimprovenetworkperformanceofXen-basedvirtualmachines[ 10 ].Inaddition,fasterI/Ochannelfortransferringnetworkpacketsbetweenguestanddriverdomainsisbeingstudied[ 10 ].Thesestudieslackmicro-architecturaloverheadanalysisofthevirtualizedenvironment. 114

PAGE 115

90 ][ 36 ]. 115

PAGE 116

91 ].Inmulti-coreprocessors,avirtualaddresstranslationstoredinpagetableentriesofaprocessorneedstobepropagatedtotheTLBsofallprocessors.ManyarchitecturesresorttotheushingofentirecontentsofaremoteTLBtoenforcesuchcoherence,inaprocessoftencalled"TLBshootdown".ConsideranexampleofnetworkI/Ocommunicationinavirtualizedenvironment.TheGranttablemechanismadoptedbyXenVMMisbasedonmodifyingaccessprotectionbitsofsharedpagetableentrybetweenguestandprivilegeddomain.Therefore,networkI/OcommunicationbetweenguestandprivilegeddomainsmayresultinTLBshootdowns.TheproblemwiththeshootdownapproachisthatitworksatacoarsecoherencegranularitybyinvalidatingallentriesofaTLB.BecausenotallTLBentriesmustbeinvalidatedtoenforceconsistency(onlythosethatareaectedbytheprotectionchanges)thiscoarse-grainapproachtoenforcecoherencecanresultinthe 116

PAGE 117

6.2 .IintroduceoverviewtheinterprocessorinterruptmechanismusedbyLinuxinx86-basedprocessorstoimplementTLBshootdownsinSection 6.3 .Section 6.4 explainsthepagesharingmechanisminXenhypervisor.Section 6.5 providedetailsofexperimentstomeasureI/Ooverhead,evaluatehardwaresupporttotaghypervisorpagesandevaluatepotentialforselectiveushingininterprocessorinterrupts.Section 6.6 describestherelatedwork. 6.2.1IntroductionTheTranslationlook-asidebuer(TLB)isanon-chipcachetoexpeditethevirtual-to-physicaladdresstranslation.IntheabsenceofTLB,pagetabledatastructureisusedtoaccessphysicalpagecorrespondingtovirtualaddress;thisprocessoftranslatingvirtualintophysicaladdressisexpensive.Instead,processorsrelyontheTLBandlocalityofreferencestoachievefastaddresstranslation.Thisprocesscanbesummarizedasfollows.Anapplicationrungeneratesavirtualmemoryaddresstoaccessinstructionordatafromthememory.TheCPUlooksupthevirtualaddressbyindexingtheTLB.IftheTLBaccessisahitthenthepagetableentrypresentinTLB,isusedtoaccessthephysicalpage.Inmulti-coresystems,typicallyeachprocessorhasitsownTLBinordertoachievefastlookuptimes.ThiscreatesachallengeinmanagingmultipletranslationscachedacrossmultipleTLBs,andthusitisimportanttomaintainTLBcoherency.Unlikeprocessordataandinstructioncaches,theTLBcoherencyisimplementedbyoperatingsystem.Thisisaccomplishedbytheoperatingsystemissuing,foranyupdatestoapagetableentry,aTLBinvalidationoperation. 117

PAGE 118

118

PAGE 119

91 ].TLBentriesrelyoninformationprovidedandupdatedthroughthepagetable.ATLBushoperationmayresultinTLBmiss,apagetablewalk,apossiblepagefault(ifpageisnotinthememory)andaTLBrell.HardwarestatemachinewalksthroughthepagetabletorelltheTLBentry. Figure6-1. Thex86pagetableforsmallpages:Pagingmechanisminx86architectureisshown.Controlregister(CR3)isloadedwithcurrentlyscheduledprocess.Thevirtualaddressfromanapplicationisdividedtoobtainpagedirectoryentry(PDE),Pagetableentry(PTE)andPageoset. Figure 6-1 illustratesthetranslationofvirtualaddressintoaphysicaladdressinthex86architecture.Thesequenceofaccesstophysicalpagefromavirtualorlinearaddressisasfollows:(1)VirtualaddressislookedintoTLB(2)IfTLBtranslationisnotavailable,virtualaddressistranslatedintophysicaladdresstoretrievethepagecontentthroughthepagetable(3)Ifvirtualaddressisnotpresentinpagetable,apagefaultisinvoked.Thepagetableisahierarchicalstructuretoindexandretrievethenallocationofthephysicaladdresscorrespondingtothevirtualaddress.Inaddition,itprovideentriestochecktheaccessprivilegesandmodeofinvocationofthepage.Thisistopreventother 119

PAGE 120

92 ].Incaseofmulti-core,itmaybethecaseeitherpagetablebetweentwoCPUsissharedorpageentriesaresharedbetweendierentCPUs. 92 ].SendingsucharequestinvolvesprogrammingspecialregistersinthisAPIC(ICR,orinterruptcontrolregisters)toselect,amongotherthings,thedestinationofaninterruptanditsarguments.TheformatandentriesinICRregistercanbeobtainedfromtheIntel'ssoftwaredeveloperspecications[ 92 ].InterprocessorinterruptsareinitiatedafterawritetotheAPIC'sICRregister.InSMParchitecture,eachprocessorhaveitsownlocalAPICtoinvokeoractonanIPI.AlocalAPICunitindicatesuccessfuldispatchofanIPIbyresettingdeliveryStatusbitinInterruptCommandRegister(ICR).AnexampleofIPIisushingtheTLBcontentswhenaprocessormodiesthesharedentryinthepagetabledatastructurethatissharedwithotherprocessors.ThisallowssynchronizationbetweentheaddresstranslationprocedureofSMPprocessors.AnotherexampleofIPIiswhenreschedulinganewtaskonSMPmachine.Consideranexampleofscheduling\idle"task.Toimplementthis,rstalltheinterruptsareenabledinSMPprocessorsandthen\hlt\instructionisissuedtoalltheprocessors.WheneveraninterruptisreceivedfromsystemdevicessuchasKeyboard,CPUisawakenthroughinterrupt.InSMP,oneCPUisawakenonsuchinterrupt,anIPIissenttootherCPUthroughawritetoAPIC'sICRregister(e.g.InLinuxO/S"send IPI mask"functionperformsthetaskofwritingtoICRregister).Figure 6-2 showsanexampleofIPIinvocationmechanisminatwoprocessorSMPsystem.EachCPUhasitsownlocalAPICunit.CPUscanstoretheinterruptvector 120

PAGE 121

Interprocessorinterruptmechanisminx86architecture:InterruptisinitiatedwithawritetoInterruptcontrolregister(ICR).ICRisamemorymappedregisterfortheAPICsystem. andtheidentiertotargetprocessorinICR.OnawritetoICR,amessageissenttotargetprocessorviasystembus.Figure 6-2 showsthecontentsofICRregistertoidentifydestinationCPU(0x1000000)andkindofIPI(e.g.thecontentofICRregisterforTLBinvalidationis0x8fd).ThedefaultlocationofAPICregistersisat0xfee0000inthephysicalmemory.ThesequenceofeventstoinvokeIPIforTLBinvalidationisasfollows: 121

PAGE 122

122

PAGE 123

6.5.1GrantTablePerformanceTheperformancemodelusedinChapter5capturesfunctionalbehaviorbutisnottimingaccurate.Thischoiceismotivatedbythefactthattiming-accuratemodelsareconsiderablyslowerthanfunctionalmodels.AconclusiondrawnfromthepreviousChapterwasthatsplitI/OimplementationresultsinadditionalTLBushesandmisses.Itisalsoimportanttocharacterizetheimpactofthismechanismonthetimingandperformanceofavirtualizedsystem.Thissubsectionaddressesthisissuethroughaproling-basedanalysisofamodiedversionofXen.Toaccuratelyevaluatetheoverheadinthelifecycleofnetworkpacket,inthischapterIinstrumentedxensourcecodetoevaluatepercentageofcyclesconsumedduringgrantmechanism.ThisexperimentisperformedonbothIntelduocore2andPentium-basedmachines.Table 6-1 providesthepercentageofcyclesconsumedduetogranttableoperationsbyhypervisor,averagedovertheperiodwheretheselectednetworkapplicationbenchmark(iperf)executed.AsinferredfromTable 6-1 ,grantoperationsconsumeasignicantamountofresources-approximately20%ofCPUcyclesduringapackettransferforboththeCoreDuo2andPentiumIIICPUs.Whilegrantmapandunmapstatisticsaregatheredduringthetransmitoftheiperfpacket,grantcopyortransferstatisticsshow 123

PAGE 124

10 ],similartotheresultshowedinTable 6-1 Table6-1. Granttableoverheadsummary Functioncalls %cycles(duocore2) %cycles(PentiumIII) gnttab map 7.13 6.18 gnttab unmap 4.28 3.56 gnttab transfer/copy 8.20(copy) 10.39(transfer) ThestatisticsareobtainedfromprolingthegrantoperationsinXenhypervisorthroughXentracetool[ 93 ]andaccessingthesystemcyclecountsthroughtime-stampregister(RDTSC). Experimentalsetup:Twosimics/softSDVsessionsaresynchronizedusingavirtualnetwork.IperfapplicationdeployedonLinuxO/Sisexecutedinonesession.IperfisdeployedintoDomUinXenhypervisorenvironment.Packetsaresentto/fromDomUfrom/toLinuxO/S IstudiedthepotentialimpactofaTLBoptimizationtomakeglobalhypervisorpagespersistentinTLBs.IntheabsenceofTLBtagging,onaTLBushalltranslations 124

PAGE 125

6-4 ,suchanoptimizationindeedhasthepotentialtosubstantiallyreduceDTLBmisses(and,toalesserextent,reduceITLBmisses).ItismoreeectivethanincreasingtheTLBsizebecauseglobalbittaggingallowsasubsetofthetranslationstoremaincachedduringswitchesandgrantrevocations.TheexperimentalsetupisshowninFigure 6-3 Figure6-4. ImpactoftaggingTLBwithaglobalbittopreventTLBushforhypervisorpages.Intheplot,TLBmissratioisnormalizedtoTLBsizeof64entries.ThedatapointsarecollectedwheniperfclientisexecutedinaguestVM(transmitofI/Opackets). TheimportanceofperformanceisolationandVMlevelQoSisagrowingresearchareaespeciallywiththeintroductionofmulti-coreprocessorswhicharesharingplatformresourceslikecache,TLBandmemoryresources.MyworkisfurtherextendedtostudytheimpactofqualityofserviceonTLBin[ 94 ]. 125

PAGE 126

Figure6-5. Pagesharinginmulticoreenvironment:AnexampleofpotentialreasonforinvocationofinterprocessorinterruptbetweentwoCPUsisshown. TochecktheconsistencybetweenpagetableandTLBcontentsatinvocationofinterprocessorinterrupt,followingarethesequenceofstepsasshowninFigure 6-5 : 126

PAGE 127

Figure6-6. Simicsmodeltocaptureinter-processorinterrupts:SimicsexportsAPItoregisterdierentperformancemodels.TheseperformancemodelscancaptureimportanteventssuchasIPIthroughsimicsAPI(SIM hap add callback).SimicsworkloadisabstractedtorepresentanO/Sorahypervisor. TLBmodelisgenerallyinitializedandloadedwhensimicsisbooted.SimicsprovideAPItoregisteracallbackfunctiontocaptureandactonIPIeventintheTLBmodule.AnexamplefunctionalityofIPIcallbackfunctioncouldbeacountertocountthenumberofIPIsduringtheexecutionofworkload.Figure 6-6 illustratethatsimicsAPI(SIM hap add callback)isusedtoregisteraneventrelatedtocoresystemdevice(inthiscaseAPIC).ThiscallbackfunctionisusedtomodifythesemanticsofTLBush 127

PAGE 128

6-2 showsthenumberofIPIsfortransmitandreceiveofthenetworkI/Opacketsduringexecutionrunofiperfbenchmark.ThepotentialbenetofnotushingtheTLBduringaninterprocessorinterruptcanbenegatedifO/Sissuesanormalushduetoschedulinganotherprocess.Forthisexperiment,Icreatedanetworkoftwosimicssimulatedmachine.Foreachexperiment,Simicssessionsarewarmedupfor20millioninstruction.Thestatisticsshownarecollectedforarunof500millioninstructions.Notethatnumberofushesismoreinreceivescenariowhenguestvirtualmachineisreceivingthepackets.ToevaluatetheconsistencyduringtheIPI,thepagetableparserisimplementedinsimicsTLBmoduletolookuptheTLBentriesinthepagetable. Table6-2. TLBushstatisticswithandwith-outIPIushoptimization Transmit/Receive IPIush Normalush Transmit 485 31670 Receive 602 57718 TofurtherunderstandtheimpactofIPImechanism,IstudiedtheimpactofscalingtheTLBsizeforinstructionanddataTLBs.Fortheseexperiments,domain-1isanitizetoCPU1whiledomain-0doesnothaveanyCPUanity.Table 6-3 andTable 6-4 providetheTLBmissstatisticsforinstructionanddatacachesduringtheapplicationrunofiperffor50millioninstructions.WhilethedataTLBdoesnotshowsignicantimprovementinperformanceforbothCPUs,instructionTLBforCPU1showsimprovementof1.2to2.4%.Inaddition,numberofmissesarenotsignicantlychangedbeyondtheTLBsizeof128entries.WhilenumberofTLBmissesisreducedwhencompleteushisavoidedduringinvocationofinterprocessorinterrupt,thepotentialperformanceimprovementisnegatedbycompleteushesduetolocalushes(e.g.schedulingofnewprocess). 128

PAGE 129

InstructionTLBmissstatisticswithandwith-outIPIushoptimizationforreceiveiperfbenchmark IPImode TLBEntries 64 128 256 512 IPIush(CPU0) 112848 100687 100657 100687 IPIush(CPU1) 347 4506 4506 4506 NoIPIush(CPU0) 112811 100604 100602 100602 NoIPIush(CPU1) 347 4455 4455 4455 Table6-4. DatamissTLBstatisticswithandwith-outIPIushoptimizationforreceiveiperfbenchmark IPImode TLBEntries 64 128 256 512 IPIush(CPU0) 389221 454949 450143 450118 IPIush(CPU1) 55369 33636 33628 33628 NoIPIush(CPU0) 389186 454916 450051 450008 NoIPIush(CPU1) 55369 33580 33554 33554 95 ].Performanceevaluationofsharedhardwareresourcesbetweenmulti-coreprocessorswithvirtualmachineenvironment-basedworkloadforserverconsolidationhasrecentlybeenaddressed[ 96 ].Simulation-basedapproachhasbeenusedtomaximizesharedmemoryaccessbetweenmultipleVMs[ 96 ],howeverthisstudyapproximatesuseofvirtualization-themodeldoesnotconsidertheexecutionofhypervisors.Alltheseanalysislackedthecompleteunderstandingofthenetworkstack.Specializedvirtualmachinecontainers(withminimalfunctionalitysuchasonlyI/O)arebeingusedtostudytheI/Oscalability[ 97 ].Solarare[ 11 ]hasapproachtobypassthehypervisorfornetworkI/O.TheapproachisbasedonprovidinghardwaresupportofavirtualNIC(vNIC)pervirtualmachine.ThevirtualNICcontrollerforI/Oaccelerationcommunicateswiththenetworkdriverinterfaceoeredbyguestvirtualmachines.ThenetworkdriversinsidetheguestvirtualmachinescommunicatedirectlywithinterfaceoeredbythevirtualNICcontroller(bypassingthehypervisor). 129

PAGE 130

98 ].ManyapproacheshavebeenconsideredinthepasttoimproveTLBperformance.WhenaprocessortriestomodifyaTLBentry,itlockspagetabletopreventotherprocessorfrommodifyingit.ItushesthelocalTLBentries.TLBoperations(suchasTLBrell)arequeuedforupdate.ProcessorsendsandIPIandspinsuntilallotherprocessorsaredone.Finallyitunlocksthepagetable.ThesestepsarecycleconsuminginTLB.ManyimprovementshavebeensuggestedtoimprovethisperformanceofTLB.Thisincludes: 99 ].Anexampleofthisisupgradeofapagefromread-onlytoread-write.LinuxO/SdoessupportLazyTLBushing. 130

PAGE 131

131

PAGE 132

132

PAGE 133

133

PAGE 134

100 ](Nehelam)toimprovethecostofvirtualtophysicaladdresstranslation.Inthisscenario,thisdissertationmotivatesresearchonthemodelsthatcapturesthebehaviorofsimilarnewon-chipresources.Thesemodelscanbeaugmentedandevaluatedfurthertoguidesystemdesignersforimprovedsystemsperformance.Futureinternetusageispredictedtoincludenewusagemodelsthatarebasedonuserconnectivitysuchassocialnetworking,collaborationandon-linegaming.ThereareinstanceswheretheseapplicationsaredesiredtobeencapsulatedinVMenvironments 134

PAGE 135

135

PAGE 136

[1] J.E.SmithandR.Nair,VirtualMachines:versatileplatformsforsystemsandprocesses,MorganKaufmannpublishers,May2005. [2] M.Litzkow,M.Livny,andM.W.Mutka,\Condor:ahunterofidleworkstations,"inProceedingsof8thinternationalconferenceonDistributedComputingSystems,Jun1988,pp.104{111. [3] B.Callaghan,NFSillustrated,Addison-WesleyLongmanLtd.,Essex,UK,2000. [4] S.Adabala,V.Chadha,P.Chawla,R.J.Figueiredo,J.A.B.Fortes,I.Krsul,A.Matsunga,M.Tsugawa,J.Zhang,M.Zhao,L.Zhu,andX.Zhu,\Fromvirtualizedresourcestovirtualcomputinggrids:Thein-vigosystem,"FutureGenerationComputingSystems,specialissueonComplexProblem-SolvingEnviron-mentsforGridComputing,vol.21,no.6,Apr2005. [5] K.Keahey,I.Foster,T.Freeman,X.Zhang,andD.Galron.,\Virtualworkspacesinthegrid,"inProceedingsoftheEuro-ParConference,Lisbon,Portugal,Sep2005. [6] A.SundarajandP.A.Dinda,\Towardsvirtualnetworksforvirtualmachinegridcomputing,"in3rdUSENIXVirtualMachineResearchandTechnologySymp.,May2004. [7] D.Nurmi,R.Wolski,C.Grzegorczyk,G.Obertelli,S.Soman,L.Youse,andD.Zagorodnov,\Theeucalyptusopen-sourcecloud-computingsystem,"inCloudComputingandItsApplicationsworkshop(CCA'08),Chicago,IL,October2008. [8] VMware,\Merrilllynchtostandardizeonvmwarevirtualmachinesoftware[Online],"WorldWideWebelectronicpublication,Available: lynch.html [9] A.Menon,J.R.Santos,Y.Turner,G.J.Janakiraman,andW.Zwaenepoel,\Diagnosingperformanceoverheadsinthexenvirtualmachineenvironment,"inVEE'05:Proceedingsofthe1stACM/USENIXinternationalconferenceonVirtualexecutionenvironments,NewYork,NY,USA,2005,pp.13{23,ACM. [10] A.Menon,A.L.Cox,andW.Zwaenepoel,\Optimizingnetworkvirtualizationinxen,"inATEC'06:ProceedingsoftheannualconferenceonUSENIX'06AnnualTechnicalConference,Berkeley,CA,USA,2006,USENIXAssociation. [11] D.Chisnall,Thedenitiveguidetothexenhypervisor,PrenticeHallPress,UpperSaddleRiver,NJ,USA,2007. [12] R.Goldberg,\Surveyofvirtualmachineresearch,"IEEEComputerMagazine,vol.7,no.6,pp.34{45,1974. 136

PAGE 137

[13] P.S.Magnusson,M.Christensson,J.Eskilson,D.Forsgren,G.Hllberg,J.Hgberg,F.Larsson,A.Moestedt,andB.Werner,\Simics:Afullsystemsimulationplatform,"IEEEComputer,2002. [14] M.Rosenblum,S.A.Herrod,E.Witchel,andA.Gupta,\Completecomputersystemsimulation:ThesimOSapproach,"IEEEParallelandDistributedTechnol-ogy,vol.3,pp.34{43,1995. [15] J.J.YiandD.J.Lilja,\Simulationofcomputerarchitectures:Simulators,benchmarks,methodologies,andrecommendations,"IEEETransactionsonComputers,vol.55,no.3,pp.268{280,2006. [16] J.Sugerman,G.Venkitachalan,andB.H.Lim,\Virtualizingi/odevicesonvmwareworkstation'shostedvirtualmachinemonitor,"inProceedingsoftheUSENIXAnnualTechnicalConference,Jun2001. [17] S.Hand,A.Wareld,K.Fraser,E.Kotsovinos,andD.Magenheimer,\Arevirtualmachinemonitorsmicrokernelsdoneright?,"inHOTOS'05:Proceedingsofthe10thconferenceonHotTopicsinOperatingSystems,Berkeley,CA,USA,2005,USENIXAssociation. [18] S.Kumar,H.Raj,K.Schwan,andI.Ganev,\Re-architectingvmmsformulticoresystems:Thesidecoreapproach,"inWorkshopontheInteractionbetweenOperatingSystemsandComputerArchitecture,2007. [19] K.Krewell,\Bestserversof2004:Multicoreisnorm,"MicroprocessorReport,2005. [20] P.Barham,B.Dragovic,K.Fraser,S.Hand,T.Harris,A.Ho,R.Neugebauer,I.Pratt,andA.Wareld,\Xenandtheartofvirtualization,"inSOSP'03:ProceedingsofthenineteenthACMsymposiumonOperatingsystemsprinciples,NewYork,NY,USA,2003,pp.164{177,ACM. [21] E.Kotsovinos,T.Moreton,I.Pratt,R.Ross,K.Fraser,S.Hand,andT.Harris,\Global-scaleservicedeploymentinthexenoserverplatform,"inProceedingsoftheFirstWorkshoponReal,LargeDistributedSystems(WORLDS'04),Dec2004. [22] K.Suzaki,T.Yagi,K.Iijima,andN.A.Quynh,\Oscircular:internetclientforreference,"inLISA'07:Proceedingsofthe21stconferenceonLargeInstallationSystemAdministrationConference,Berkeley,CA,USA,2007,pp.105{116,USENIXAssociation. [23] B.M.G.,J.Hartman,M.Kupfer,K.W.Shri,andJ.Ousterhout,\Measurementsofadistributedlesystem,"inProceedingsofthe13thSymposiumonOperatingSystemsPrinciples,1991. [24] J.H.Howard,M.L.Kazar,S.G.Menees,A.Nichols,M.Satyanarayanan,R.N.Sidebotham,andM.J.West,\Scaleandperformanceinadistributedlesystem,"ACMTransactionsonComputerSystems,vol.6,Feb1988.

PAGE 138

[25] B.Pawlowski,C.Juszczak,P.Staubach,C.Smith,D.Lebel,andD.Hitz,\Nfsversion3:Designandimplementation,"inUSENIXSummer,Boston,MA,Jun1994. [26] A.Ganguly,D.Wolinsky,P.O.Boykin,andR.Figueiredo,\Decentralizeddynamichost:Congurationinwide-areaoverlaynetworksofvirtualworkstations,"inWorkshoponLarge-ScaleandVolatileDesktopGrids(PCGrid),LongBeach,CA,Mar2007,pp.1{8. [27] P.Apparao,S.Makineni,andD.Newell,\Characterizationofnetworkprocessingoverheadsinxen,"inVTDC'06:Proceedingsofthe2ndInternationalWorkshoponVirtualizationTechnologyinDistributedComputing,WashingtonDC,USA,2006,p.2,IEEEComputerSociety. [28] J.S.Chase,D.E.Irwin,L.E.Grit,J.D.Moore,andS.E.Sprenkle,\Dynamicvirtualclustersinagridsitemanager,"inHPDC'03:Proceedingsofthe12thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,Washington,DC,USA,2003,pp.90{101,IEEEComputerSociety. [29] I.Foster,C.Kesselman,andS.Tuecke,\Theanatomyofthegrid:enablingscalablevirtualorganizations,"InternationalJournalofSupercomputingApplications,vol.15,no.3,pp.200{222,Apr2001. [30] A.Ganguly,A.Agrawal,P.Boykin,andR.J.Figueiredo,\Ipoverp2p:Enablingself-conguringvirtualipnetworksforgridcomputing,"inIEEEInternationalParallel&DistributedProcessingSymposium(IPDPS),RhodeIsland,Greece,Apr2006. [31] S.M.Larson,C.D.Snow,M.R.Shirts,andV.S.Pande,\Folding@homeandgenome@home:Usingdistributedcomputingtotacklepreviouslyintractableproblemsincomputationalbiology,"inComputationalGenomics.2002,HorizonPress. [32] D.P.Anderson,J.Cobb,E.Korpella,M.Lebofsky,andD.Werthimer,\Seti@home:Anexperimentinpublic-resourcecomputing,"CommunicationsoftheACM,vol.11,no.45,pp.56{61,2002. [33] J.J.KistlerandM.Satyanarayan,\Disconnectedoperationincodalesystem,"ACMTransactionsonComputerSystems,vol.6,Feb1992. [34] B.S.White,A.S.Grimshaw,andA.Nguyen-Tuong,\Grid-basedleaccess:Thelegioni/omodel,"inHighPerformancedistributedComputing,Pittsburgh,PA,Aug2000,pp.165{174. [35] R.Uhlig,G.Neiger,D.Rodgers,A.L.Santoni,F.C.M.Martins,A.V.Anderson,S.M.Bennett,A.Kagi,F.H.Leung,andL.Smith,\Intelvirtualizationtechnology,"Computer,vol.38,no.5,2005.

PAGE 139

[36] T.Garnkel,K.Adams,A.Wareld,andJ.Franklin,\Compatibilityisnottransparency:Vmmdetectionmythsandrealities,"inHOTOS'07:Proceedingsofthe11thUSENIXworkshoponHottopicsinoperatingsystems,Berkeley,CA,USA,2007,pp.1{6,USENIXAssociation. [37] M.RosemblumandT.Garnkel,\Virtualmachinemonitors:Currenttechnologyandfuturetrends,"IEEEComputer,vol.38,pp.39{47,2005. [38] D.C.Anderson,J.S.Chase,andA.M.Vahdat,\Interposedrequestforroutingforscalablenetworkstorage,"inSymposiumonOSDI,SanDiego,CA,oct2000. [39] D.J.Blezard,\Multi-platformcomputerlabsandclassrooms:amagicbullet?,"inSIGUCCS'07:Proceedingsofthe35thannualACMSIGUCCSconferenceonUserservices,NewYork,NY,USA,2007,pp.16{20,ACM. [40] J.Watson,\Virtualbox:bitsandbytesmasqueradingasmachines,"LinuxJ.,vol.2008,no.166,pp.1,2008. [41] S.Bhattiprolu,E.W.Biederman,S.Hallyn,andD.Lezcano,\Virtualserversandcheckpoint/restartinmainstreamlinux,"SIGOPSOperatingSystemReview,vol.42,no.5,pp.104{113,2008. [42] J.Dike,\Auser-modeportofthelinuxkernel,"inProc.ofthe4thAnnualLinuxShowcaseandConference,Atlanta,GA,2000. [43] F.Bellard,\Qemu,afastandportabledynamictranslator,"inATEC'05:ProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference,Berkeley,CA,USA,2005,pp.41{46,USENIXAssociation. [44] \x86architecture[Online],"WorldWideWebelectronicpublication,Available: [45] A.S.Tanenbaum,J.N.Herder,andH.Bos,\Canwemakeoperatingsystemsreliableandsecure?,"inComputer.may2006,pp.44{51,IEEEComputerSociety. [46] R.Bhargava,B.Serebrin,F.Spadini,andS.Manne,\Acceleratingtwo-dimensionalpagewalksforvirtualizedsystems,"inASPLOSXIII:Proceedingsofthe13thinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,NewYork,NY,USA,2008,pp.26{35,ACM. [47] P.Willmann,S.Rixner,andA.L.Cox,\Protectionstrategiesfordirectaccesstovirtualizedi/odevices,"inATC'08:USENIX2008AnnualTechnicalConferenceonAnnualTechnicalConference,Berkeley,CA,USA,2008,pp.15{28,USENIXAssociation. [48] R.Iyer,L.Zhao,F.Guo,R.Illikkal,S.Makineni,D.Newell,Y.Solihin,L.Hsu,andS.Reinhardt,\Qospoliciesandarchitectureforcache/memoryincmpplatforms,"inACMSigmetrics,Jun2007.

PAGE 140

[49] N.Egi,A.Greenhalgh,M.Handley,M.Hoerdt,F.Huici,andL.Mathy,\Fairnessissuesinsoftwarevirtualrouters,"inPRESTO'08:ProceedingsoftheACMworkshoponProgrammableroutersforextensibleservicesoftomorrow,NewYork,NY,USA,2008,pp.33{38,ACM. [50] A.Muthitacharoen,B.Chen,andD.Mazieres,\Alow-bandwidthnetworklesystem,"inSymposiumonOperatingSystemsPrinciples,2001,pp.174{187. [51] A.R.Butt,T.Johnson,Y.Zheng,andY.C.Hu,\Kosha:Apeer-to-peerenhancementforthenetworklesystem,"inProceedingsofIEEE/ACMSC2004,Nov2004. [52] V.SrinivasanandJ.C.Mogul,\Spritelynfs:Experiementswithcache-consistencyprotocols,"inProceedingsoftheTwelfthACMSymposiumonOperatingSystemsPrinciples,Dec1989,pp.45{57. [53] R.Macklem,\Notquitenfs,softcacheconsistencyfornfs,"inProceedingsoftheWinter1994UsenixConference,SanFrancisco,CA,Jan1994. [54] D.HildebrandandP.Honeyman,\Exportingstoragesystemsinascalablemannerwithpnfs,"inProceedingsofthe22ndIEEE/13thNASAGoddardConferenceonMassStorageSystemsandTechnologies,Washington,DC,USA,2005,pp.18{27,IEEEComputerSociety. [55] A.Traeger,A.Rai,C.P.Wright,andE.Zadok,\Nfslehandlesecurity,"Tech.Rep.,StonyBrookUniversity,2004. [56] R.J.Figueiredo,P.Dinda,andJ.A.B.Fortes,\Acaseforgridcomputingonvirtualmachines,"inProc.ofthe23rdIEEEIntl.ConferenceonDistributedComputingSystems(ICDCS),Providence,RhodeIsland,May2003. [57] M.ZhaoandR.J.Figueiredo,\Distributedlesystemsupportforvirtualmachinesingridcomputing,"inProceedingsofHPDC,Jun2004. [58] M.Zhao,V.Chadha,andR.J.Figueiredo,\Supportingapplication-tailoredgridlesystemsessionswithwsrf-basedservices.,"inProceedingsofHPDC,Jul2005. [59] D.Wolinsky,A.Agrawal,P.O.Boykin,J.Davis,A.Ganguly,V.Paramygin,P.Sheng,andR.Figueiredo,\Onthedesignofvirtualmachinesandboxesfordistributedcomputinginwideareaoverlaysofvirtualworkstations,"inFirstWorkshoponVirtualizationTechnologiesinDistributedComputing(VTDC),Nov2006. [60] F.Oliveira,G.Guardiola,J.A.Patel,andE.V.Hensbergen,\Blutopia:Stackablestorageforclustermanagement,"inProceedingsoftheIEEEclustercomputing,Sep2007. [61] S.Santhanam,P.Elango,A.Arpaci-Dusseau,andM.Livny,\Deployingvirtualmachinesassandboxesforthegrid,"inUSENIXWORLDS,2004.

PAGE 141

[62] M.CarsonandD.Santay,\Nistnet:alinux-basednetworkemulationtool,"SIGCOMMComputerCommunicationReview,vol.33,no.3,pp.111{126,2003. [63] J.SpadavecchiaandE.Zadok,\Enhancingnfscross-administrativedomainaccess,"inUSENIXAnnualTechnicalConferenceFREENIXTrack,2002,pp.181{194. [64] M.Baker,R.Buyya,andD.Laforenza,\Gridsandgridtechnologiesforwide-areadistributedcomputing,"SoftwarePractice&Experience,vol.32,no.15,pp.1437{1466,2002. [65] C.P.Wright,J.Dave,P.Gupta,H.Krishnan,D.P.Quigley,E.Zadok,andM.N.Zubair,\Versatilityandunixsemanticsinnamespaceunication,"ACMTransac-tionsonStorage(TOS),vol.2,no.1,pp.1{s32,February2006. [66] D.Santry,M.Feeley,N.Hutchinson,A.Veitch,R.Carton,andJ.Or.,\Decidingwhentoforgetintheelephantlesystem,"in17thACMSOSPPrinciples,1999. [67] R.G.Minnich,\Theautocacher:Alecachewhichoperatesatthenfslevel,"inUSENIXconferenceproceedings,1993,pp.77{83. [68] S.Osman,D.Subhraveti,G.Su,andJ.Nieh,\Thedesignandimplementationofzap:Asystemformigratingcomputingenvironments.,"inSymposiumonOSDI,Boston,MA,Dec2002. [69] J.H.HartmanandJ.K.Ousterhout,\Thezebrastripednetworklesystem,"inSOSP'93:ProceedingsofthefourteenthACMsymposiumonOperatingsystemsprinciples.Dec1993,pp.29{43,ACM. [70] A.AgbariaandR.Friedman,\Virtualmachinebasedheterogeneouscheckpointing,"software-Practice&Experience,vol.32,pp.1175{1192,2002. [71] A.Ganguly,A.Agrawal,P.O.Boykin,andR.J.O.Figueiredo,\Wow:Self-organizingwideareaoverlaynetworksofvirtualworkstations.,"inHPDC.June2006,pp.30{42,IEEE. [72] C.Sun,L.He,Q.Wang,andR.Willenborg,\Simplifyingservicedeploymentwithvirtualappliances,"inIEEEInternationalConferenceonServicesComputing,vol.2,pp.265{272,2008. [73] R.ProdanandT.Fahringer,\Overheadanalysisofscienticworkowsingridenvironments,"IEEETransactionsonParallelandDistributedSystems,vol.19,no.3,pp.378{393,2008. [74] A.S.TanenbaumandM.vanSteen,DistributedSystems:PrinciplesandParadigms(1stEdition),Prentics-HallInc,2002. [75] V.ChadhaandR.J.Figueiredo,\Row-fs:Auser-levelvirtualizedredirect-on-writedistributedlesystemforwideareaapplications,"inInternationalConferenceonhighPerformanceComputing(HiPC),Goa,India,Dec2007.

PAGE 142

[76] S.Annapureddy,M.J.Freedman,andD.Mazires,\Shark:Scalingleserversviacooperativecaching,"inProceedingsofthe2ndUSENIX/ACMSymposiumonNetworkedSystemsDesignandImplementation,May2005. [77] D.Reimer,A.Thomas,G.Ammons,T.Mummert,B.Alpern,andV.Bala,\Openingblackboxes:usingsemanticinformationtocombatvirtualmachineimagesprawl,"inVEE'08:ProceedingsofthefourthACMSIGPLAN/SIGOPSinternationalconferenceonVirtualexecutionenvironments,NewYork,NY,USA,2008,pp.111{120,ACM. [78] R.Chandra,N.Zeldovich,C.Sapuntzakis,andM.S.Lam,\Thecollective:Acache-basedsystemmanagementarchitecture,"inProceedingsof2ndSymposiumonNetworkedSystemsDesign&Implementation(NSDI),2005,pp.259{272. [79] \2xthinclientserver,"WorldWideWebelectronicpublication,2008, [80] M.McNett,D.Gupta,A.Vahdat,andG.M.Voelker,\Usher:Anextensibleframeworkformanagingclustersofvirtualmachines,"inProceedingsofthe21stLargeInstallationSystemAdministrationConference(LISA),Nov2007. [81] J.Cappos,S.Baker,J.Plichta,D.Nyugen,J.Hardies,M.Borgard,J.Johnston,andJ.H.Hartman,\Stork:Packagemanagementfordistributedvmenvironments,"inProceedingsofthe21stLargeInstallationSystemAdministrationConference(LISA),Nov2007. [82] R.GrossmanandY.Gu,\Dataminingusinghighperformancedataclouds:experimentalstudiesusingsectorandsphere,"inKDD'08:Proceedingofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,NewYork,NY,USA,2008,pp.920{927,ACM. [83] A.Menon,J.R.Santos,Y.Turner,G.Janakiraman,andW.Zwaenepoel,\Diagnosingperformanceoverheadsinthexenvirtualmachineenvironment,"inVEE'05:Proceedingsofthe1stACM/USENIXinternationalconferenceonVirtualexecutionenvironments.2005,pp.13{23,ACM. [84] J.R.Santos,Y.Turner,G.Janakiraman,andI.Pratt,\Bridgingthegapbetweensoftwareandhardwaretechniquesfori/ovirtualization,"inATC'08:USENIX2008AnnualTechnicalConferenceonAnnualTechnicalConference,Berkeley,CA,USA,2008,pp.29{42,USENIXAssociation. [85] R.Uhlig,R.Fishtein,O.Gershon,I.Hirsh,andH.Wang,\Softsdv:Apresiliconsoftwaredevelopmentenvironmentfortheia-64architecture,"IntelTechnologyJournal,1999. [86] M.Yourst,\PTLsim:Acycleaccuratefullsystemx86-64microarchitecturalsimulator,"IEEEInternationalSymposiumonPerformanceAnalysisofSystems&Software,2007,pp.23{34,April2007.

PAGE 143

[87] R.Iyer,\Onmodelingandanalyzingcachehierarchiesusingcasper,"in11thIEEEInternationalSymposiumonModeling,Analysis,andSimulationofComputerandTelecommunicationsSystems(MASCOTS),Oct2003. [88] W.WuandM.Crawford,\Interactivityvs.fairnessinnetworkedlinuxsystems,"ComputerNetworks,vol.51,no.14,pp.4050{4069,2007. [89] L.CherkasovaandR.Gardner,\Measuringcpuoverheadfori/oprocessinginthexenvirtualmachinemonitor,"inATEC'05:ProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference,Berkeley,CA,USA,Apr2005,USENIXAssociation. [90] G.Neiger,A.Santoni,F.Leung,D.Rodgers,andR.Uhlig,\Intelvirtualizationtechnology:Hardwaresupportforecientprocessorvirtualization,"IntelTechnologyJournal,Aug2006. [91] B.Jacob,S.W.NG,andD.T.Wang,MemorySystems:Cache,DRAMandDisk,MorganKaufmannpublishers,2007. [92] I.Corporation,\Intel64andia-32architecturessoftwaredeveloper'smanuals[Online],"WorldWideWebelectronicpublication,Available: [93] D.Gupta,R.Gardner,andL.Cherkasova,\Xenmon:Qosmonitoringandperformanceprolingtool[Online],"WorldWideWebelectronicpublication,Available: [94] O.tickoo,H.Kannan,V.Chadha,R.Illikkal,R.Iyer,andD.Newell,\qtlb:Lookinginsidelookasidebuer,"inInternationalConferenceonhighPerformanceComputing(HiPC),Goa,India,Dec2007. [95] R.Santos,G.Janikaraman,andY.turner,\Xennetworkoptimization[Online],"WorldWideWebelectronicpublication,Available: 3/networkoptimizations.pdf [96] M.R.MartyandM.D.Hill,\Virtualhierarchiestosupportserverconsolidation,"inISCA'07:Proceedingsofthe34thannualinternationalsymposiumonComputerarchitecture,NewYork,NY,USA,2007,pp.46{56,ACM. [97] J.Wiegert,G.Regnier,andJ.Jackson,\Challengesforscalablenetworkinginavirtualizedserver,"inProceedingsofthe16thInternationalConferenceonComputerCommunicationsandNetworks,Aug2007. [98] J.Liu,W.Huang,B.Abali,andD.K.Panda,\Highperformancevmm-bypassi/oinvirtualmachines,"inATEC'06:ProceedingsoftheannualconferenceonUSENIX'06AnnualTechnicalConference,Berkeley,CA,USA,May2006,USENIXAssociation.

PAGE 144

[99] M.-S.ChangandK.Koh,\Lazytlbconsistencyforlarge-scalemultiprocessors,"inProceedingsofthe2ndAIZUInternationalSymposiumonParallelAlgo-rithms/ArchitectureSynthesis,1997. [100] I.Corporation,\Firstthetick,nowthetock:NextgenerationIntelmicroarchitecture[Online],"WorldWideWebelectronicpublication,Available:

PAGE 145

VineethasgraduatedwithB.EinElectronicsandTelecommunicationfromUniversityofPune,India.HenishedhisM.SinComputerSciencefromMississippiStateUniversity.HeispursuinghisPhDinComputerInformationScienceandEngineeringatUniversityofFlorida.Hisresearchinterestsincludevirtualization,operatingsystems,computerarchitecture,lesystemsanddistributedcomputing.SinceFall2002,VineethasbeenresearchassistantatAdvancedComputingandInformationSystems(ACIS)Laboratory.AtACIS,hisresearchfocushasbeenGridVirtualFileSystem(GVFS)andI/Ovirtualization.Hehasbeeninvolvedindevelopmentofmiddlewaresupportfornetworklesystemandsimulation-basedevaluationmethodologytocharacterizeI/Ooverheadinvirtualizedenvironments.Tocomplementhisacademicexperience,VineethascompletedtwosummerinternshipsattheIntelSystemsTechnologyLab.Uponhisgraduation,VineetplanstotakeupafulltimepositionatIntelCorporation. 145