<%BANNER%>

Provisioning Wide-Area Virtual Environments through I/O Interposition

Permanent Link: http://ufdc.ufl.edu/UFE0022662/00001

Material Information

Title: Provisioning Wide-Area Virtual Environments through I/O Interposition The Redirect-on-Write File System and Characterization of I/O Overheads in a Virtualized Platform
Physical Description: 1 online resource (145 p.)
Language: english
Creator: Chadha, Vineet
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: filesystem, simulation, virtualization, xen
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: This dissertation presents the mechanisms to provision and characterize I/O workloads for applications found in virtual data-centers. This dissertation addresses two specific modes of workload execution in a virtual data center: (1) workload execution in heterogeneous compute resources across wide-area environment, and (2) workload execution and characterization within a virtualized platform. A key challenge arising in wide-area, grid computing infrastructures is that of data management: how to provide data to applications, seamlessly, in environments spanning multiple domains. In these environments, it is often the case that data movement and sharing is mediated by middleware that schedules applications. This thesis presents a novel approach that enables wide-area applications to leverage on-demand block-based data transfers and a de-facto distributed file system (NFS) to access data stored remotely and modify it in the local area - Redirect-on-Write file system (ROW-FS). The ROW-FS approach enables multiple clients to operate on private, virtual versions of data mounted from a single shared data served as a network file system (NFS). ROW-FS approach enables multiple VM instances to efficiently share a common set of virtual machine image files. The proposed approach offers savings in storage and bandwidth requirements compared to the conventional approaches of provisioning VMs by copying the entire VM image to the client and by cloning the image on the server side. The Thin client approach described in this dissertation uses ROW-FS to enable the use of unmodified NFS clients/servers and local buffering of file system modifications during an application's lifetime. An important application of ROW-FS is in enabling the instantiation of multiple non-persistent virtual machines across wide-area resources from read-only images stored in an image servers (or distributed along multiple replicas). A common deployment scenario of ROW-FS is when the virtual machine hosting its private, redirected ?shadow? file system server and the client virtual machine are consolidated into a single physical machine. While a virtual machine provides levels of execution isolation and service partition that are desirable in environments such as data centers, its associated overheads can be a major impediment for wide deployment of virtualized environments. While the virtualization cost depends heavily on workloads, the overhead is much higher with I/O intensive workloads compared to those which are compute-intensive. Unfortunately, the architectural reasons behind the I/O performance overheads are not well understood. Early research in characterizing these penalties has shown that cache misses and TLB related overheads contribute to most of I/O virtualization cost. While most of these evaluations are done using measurements, this thesis presents an execution-driven simulation based analysis methodology with symbol annotation as a means of evaluating the performance of virtualized workloads, and presents simulation-based characterization of the performance of a representative network-intensive benchmark (iperf) in the Xen virtual machine environment. The main contributions of this dissertation work are 1)the novel design and implementation of the ROW-FS file system, 2) experimental evaluation of ROW-FS for the O/S image framework that enables virtual machine images to be published, discovered and transferred on-demand through a combination of ROW-FS and peer-to-peer techniques 3) a novel implementation of an execution-driven simulation framework to evaluate network I/O performance using symbol annotation for environments that encompass both a virtual machine hypervisor and guest operating system domains and 4)evaluation, through simulation, of the potential benefits of different micro-architectural TLB improvements on performance.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Vineet Chadha.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Figueiredo, Renato J.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022662:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022662/00001

Material Information

Title: Provisioning Wide-Area Virtual Environments through I/O Interposition The Redirect-on-Write File System and Characterization of I/O Overheads in a Virtualized Platform
Physical Description: 1 online resource (145 p.)
Language: english
Creator: Chadha, Vineet
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: filesystem, simulation, virtualization, xen
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
Genre: Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: This dissertation presents the mechanisms to provision and characterize I/O workloads for applications found in virtual data-centers. This dissertation addresses two specific modes of workload execution in a virtual data center: (1) workload execution in heterogeneous compute resources across wide-area environment, and (2) workload execution and characterization within a virtualized platform. A key challenge arising in wide-area, grid computing infrastructures is that of data management: how to provide data to applications, seamlessly, in environments spanning multiple domains. In these environments, it is often the case that data movement and sharing is mediated by middleware that schedules applications. This thesis presents a novel approach that enables wide-area applications to leverage on-demand block-based data transfers and a de-facto distributed file system (NFS) to access data stored remotely and modify it in the local area - Redirect-on-Write file system (ROW-FS). The ROW-FS approach enables multiple clients to operate on private, virtual versions of data mounted from a single shared data served as a network file system (NFS). ROW-FS approach enables multiple VM instances to efficiently share a common set of virtual machine image files. The proposed approach offers savings in storage and bandwidth requirements compared to the conventional approaches of provisioning VMs by copying the entire VM image to the client and by cloning the image on the server side. The Thin client approach described in this dissertation uses ROW-FS to enable the use of unmodified NFS clients/servers and local buffering of file system modifications during an application's lifetime. An important application of ROW-FS is in enabling the instantiation of multiple non-persistent virtual machines across wide-area resources from read-only images stored in an image servers (or distributed along multiple replicas). A common deployment scenario of ROW-FS is when the virtual machine hosting its private, redirected ?shadow? file system server and the client virtual machine are consolidated into a single physical machine. While a virtual machine provides levels of execution isolation and service partition that are desirable in environments such as data centers, its associated overheads can be a major impediment for wide deployment of virtualized environments. While the virtualization cost depends heavily on workloads, the overhead is much higher with I/O intensive workloads compared to those which are compute-intensive. Unfortunately, the architectural reasons behind the I/O performance overheads are not well understood. Early research in characterizing these penalties has shown that cache misses and TLB related overheads contribute to most of I/O virtualization cost. While most of these evaluations are done using measurements, this thesis presents an execution-driven simulation based analysis methodology with symbol annotation as a means of evaluating the performance of virtualized workloads, and presents simulation-based characterization of the performance of a representative network-intensive benchmark (iperf) in the Xen virtual machine environment. The main contributions of this dissertation work are 1)the novel design and implementation of the ROW-FS file system, 2) experimental evaluation of ROW-FS for the O/S image framework that enables virtual machine images to be published, discovered and transferred on-demand through a combination of ROW-FS and peer-to-peer techniques 3) a novel implementation of an execution-driven simulation framework to evaluate network I/O performance using symbol annotation for environments that encompass both a virtual machine hypervisor and guest operating system domains and 4)evaluation, through simulation, of the potential benefits of different micro-architectural TLB improvements on performance.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Vineet Chadha.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Figueiredo, Renato J.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022662:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketothankmyadvisorDr.RenatoFigueiredoforallthesupporthehasprovidedmeduringlastsixyears.HehasbeentakingaroundthemazeofsystemsresearchandshownmetherightwaywheneverIfeltlost.IthasbeenaprivilegetoworkwithDr.Figueiredowhosecalmness,humbleandpolitedemeanoristheoneiwouldliketocarryandapplyfurtherinmycareer.ThankstoDr.JoseForteswhoprovidedmeopportunitytoworkatAdvancedComputingandInformationSystem(ACIS)laboratory.Hegavemeencouragementandsupportwheneverthingsweredown.IwouldliketothankDr.OscarBoykinforservinginmycommitteeandforallthosefruitfuldiscussionsonResearch,Linux,healthyfoodandRunning.Hispassiontoachieveperfectionineveryendeavorsoflifeofteneggsmetodobetter.ThankstoDr.AlanGeorgeandDr.JosephWilsonforservinginmyPhDprogramcommitteeandmotivatingmethroughtheircoursesandresearchwork.Iwouldliketothankmymentor,RameshIllikkalandmanager,DonaldNewellatIntelCorporationforthefaiththeyhaveshowninmeandeggedmetoworkhard.IthasbeenaprivilegetoworkwithRameshwhotaughtmeimportanceofteamwork,failureandsuccess.ThankstoDr.PadmashreeApparaoandDr.RavishankarIyerforguidanceandencouragementtoachievemygoals.ThanksisalsoduetoDr.IvankrsulandDr.SumaAdabalaforguidingmenotonlyduringthedevelopmentandresearchofIn-VIGOprojectbutalsooftensharingthoughtsonaPhDprogramandexpectations.ThanksisduetoallthecolleagueshereatACISwhichmadeworkenvironmentfuntoworkin.IwouldliketothankAndreaandMauricioforprovidingexcellentresearchfacilitiesandresources.ThankstomyocematesArijitandGirishforallfruitfuldiscussions.ThanksisduetoCathyformaintainingcordialenvironmentinACISlabandextendingsupporttomeasagoodfriendwheneverneedarises. 4

PAGE 5

5

PAGE 6

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 9 LISTOFFIGURES .................................... 10 ABSTRACT ........................................ 13 CHAPTER 1INTRODUCTION .................................. 16 1.1Introduction ................................... 16 1.1.1VirtualNetworkFileSystemI/ORedirection ............. 16 1.1.2CharacterizationofI/OOverheadsinaVirtualizedEnvironments 18 1.1.2.1Simulation ........................... 19 1.1.2.2I/OMechanisms ....................... 19 1.2DissertationContributions ........................... 20 1.3DissertationRelevance ............................. 22 1.4DissertationOverview ............................. 23 1.5DissertationOrganization ........................... 26 2I/OVIRTUALIZATION:RELATEDTERMSANDTECHNOLOGIES ..... 28 2.1Introduction ................................... 28 2.2VirtualizationTechnologies ........................... 29 2.3VirtualMachineArchitectures ......................... 32 2.3.1I/OmechanismsinVirtualMachines ................. 32 2.3.2VirtualMachinesandCMParchitectures ............... 34 2.4GridComputing ................................ 35 2.5FileSystemVirtualization ........................... 35 2.5.1NetworkFileSystem .......................... 36 2.5.2GridVirtualFileSystem ........................ 38 3REDIRECT-ON-WRITEDISTRIBUTEDFILESYSTEM ............ 41 3.1Introduction ................................... 41 3.1.1FileSystemAbstraction ........................ 41 3.1.2Redirect-on-WriteFileSystem ..................... 42 3.2MotivationandBackground .......................... 42 3.2.1Use-CaseScenario:FileSystemSessionsforGridComputing .... 44 3.2.2Use-CaseScenario:NFS-MountedVirtualMachineImagesandO/SFileSystems ............................... 44 6

PAGE 7

................................. 45 3.3ROW-FSArchitecture ............................. 47 3.3.1HashTable ................................ 48 3.3.2Bitmap .................................. 49 3.4ROW-FSImplementation ........................... 51 3.4.1MOUNT ................................. 51 3.4.2LOOKUP ................................ 53 3.4.3GETATTR/SETATTR ......................... 54 3.4.4READ .................................. 54 3.4.5WRITE ................................. 55 3.4.6READDIR ................................ 55 3.4.7REMOVE/RMDIR/RENAME ..................... 58 3.4.8LINK/READLINK ........................... 58 3.4.9SYMLINK ................................ 60 3.4.10CREATE/MKDIR ........................... 60 3.4.11STATFS ................................. 60 3.5ExperimentalResults .............................. 60 3.5.1Microbenchmark ............................. 61 3.5.2ApplicationBenchmark ......................... 63 3.5.3VirtualMachineInstantiation ..................... 66 3.5.4FileSystemComparison ........................ 67 3.6RelatedWork .................................. 68 3.7Conclusion .................................... 68 4PROVISIONINGOFVIRTUALENVIRONMENTSFORWIDEAREADESKTOPGRIDS ......................................... 70 4.1Introduction ................................... 70 4.2DataProvisioningArchitecture ........................ 71 4.3ROW-FSConsistencyandReplicationApproach ............... 78 4.3.1ROW-FSConsistencyinImageProvisioning ............. 79 4.3.2ROW-FSReplicationinImageProvisioning .............. 80 4.4SecurityImplications .............................. 81 4.5ExperimentsandResults ............................ 84 4.5.1ProxyVMResourceConsumption ................... 84 4.5.2RPCCallProle ............................ 86 4.5.3DataTransferSize ............................ 87 4.5.4Wide-areaExperiment ......................... 87 4.5.5DistributedHashTableStateEvaluationandAnalysis ........ 88 4.6RelatedWork .................................. 88 4.7Conclusion .................................... 90 7

PAGE 8

.......... 91 5.1Introduction ................................... 92 5.2MotivationandBackground .......................... 93 5.2.1FullSystemSimulator .......................... 94 5.2.2I/OVirtualizationinXen ........................ 94 5.3AnalysisMethodology ............................. 96 5.3.1FullSystemSimulation:XenVMMasWorkload ........... 96 5.3.2InstructionTrace ............................ 97 5.3.3SymbolAnnotation ........................... 98 5.3.4PerformanceStatistics ......................... 98 5.3.5EnvironmentalSetupforVirtualizedWorkload ............ 100 5.4ExperimentsandSimulationResults ..................... 102 5.4.1LifeCycleofanI/Opacket ....................... 102 5.4.1.1UnprivilegedDomain ..................... 103 5.4.1.2GrantTableMechanism ................... 105 5.4.1.3TimerInterrupts ....................... 105 5.4.1.4PrivilegedDomain ...................... 106 5.4.2CacheandTLBCharacteristics .................... 108 5.5CacheandTLBScaling ............................ 110 5.6RelatedWork .................................. 114 5.7Conclusion .................................... 115 6HARDWARESUPPORTFORI/OWORKLOADS:ANANALYSIS ...... 116 6.1Introduction ................................... 116 6.2TranslationLookasideBuer .......................... 117 6.2.1Introduction ............................... 117 6.2.2TLBInvalidationinMultiprocessors .................. 118 6.3InterprocessorInterrupts ............................ 120 6.4GrantTableMechanism:I/OAnalysis .................... 121 6.5ExperimentsandResults ............................ 123 6.5.1GrantTablePerformance ........................ 123 6.5.2HypervisorGlobalBit .......................... 124 6.5.3TLBCoherenceEvaluation ....................... 125 6.6RelatedWork .................................. 129 6.7Conclusion .................................... 131 7CONCLUSIONANDFUTUREWORK ...................... 132 7.1Conclusion .................................... 132 7.2FutureWork ................................... 133 REFERENCES ....................................... 136 BIOGRAPHICALSKETCH ................................ 145 8

PAGE 9

Table page 3-1SummaryoftheNFSv2protocolremoteprocedurecalls ............. 53 3-2LANandWANexperimentsformicro-benchmarks ................ 62 3-3AndrewbenchmarkandAM-Utilsexecutiontimes ................. 63 3-4LinuxkernelcompilationexecutiontimesonaLANandWAN. ......... 65 3-5WideareaexperimentalresultsfordisklessLinuxbootandsecondboot ..... 66 3-6RemoteXenboot/rebootexperiment ........................ 67 4-1Gridappliancebootandreboottimesoverwideareanetwork .......... 88 4-2MeanandvarianceofDHTaccesstimeforveclients ............... 88 6-1Granttableoverheadsummary ........................... 124 6-2TLBushstatisticswithandwith-outIPIushoptimization .......... 128 6-3InstructionTLBmissstatisticswithandwith-outIPIushoptimization .... 129 6-4DatamissTLBstatisticswithandwith-outIPIushoptimization ........ 129 9

PAGE 10

Figure page 1-1Protocolredirectionthroughuser-levelproxies ................... 17 1-2Illustrationofserverconsolidation .......................... 18 2-1Landscapeofvirtualizedcomputersystems ..................... 30 2-2Systemspartitioningcharacteristics ......................... 31 2-3I/OvirtualizationpathforsingleO/Sandvirtualmachines ........... 33 2-4I/OpartitioninginvirtualmachineandCMParchitectures ............ 34 2-5Gridvirtuallesystem ................................ 39 3-1IndirectionmechanismintheLinuxvirtuallesystem .............. 41 3-2MiddlewaredatamanagementforsharedVMimages ............... 43 3-3Check-pointingaVMcontainerrunninganapplicationwithNFS-mountedlesystem ......................................... 46 3-4Redirect-on-writelesystemarchitecture ...................... 48 3-5ROW-FSproxydeploymentoptions ......................... 49 3-6Hashtableandagdescriptions ........................... 50 3-7RemoteprocedurecallprocessinginROW-FS ................... 50 3-8AsnapshotviewoflesystemsessionthroughRedirect-on-Writeproxy ..... 52 3-9Sequenceofredirect-on-writelesystemcalls ................... 56 3-10NumberofRPCcallsreceivedbyNFSserverinnon-virtualizedenvironment,andbyROW-FSshadowandmainserversduringAndrewbenchmarkexecution ............................................. 64 4-1O/Simagemanagementoverwideareadesktops .................. 72 4-2ThedeploymentoftheROWproxytosupportPXE-basedbootofa(diskless)non-persistentVMoverawideareanetwork. .................... 73 4-3AlgorithmtobootstrapaVMsession ........................ 77 4-4AlgorithmtopublishavirtualmachineImage ................... 77 4-5ReplicationapproachforROW-FS ......................... 81 4-6Disklessclientandpublisherclientsecurity ..................... 83 10

PAGE 11

.............. 85 4-8RPCstatisticsfordisklessboot ........................... 87 4-9CumulativedistributionofDHTquerythrough10IPOPclients(inseconds) .. 89 5-1FullsystemsimulationenvironmentwithXenexecution .............. 95 5-2Executiondrivensimulationandsymbolannotatedprolingmethodology .... 97 5-3Symbolannotation .................................. 99 5-4Function-levelperformancestatistics ........................ 99 5-5SoftSDVCPUcontrollerexecutionmode:performanceorfunctional ....... 101 5-6LifeofanI/Opacket ................................. 103 5-7Unprivilegeddomaincallgraph ........................... 104 5-8TCPtransmitandgranttableinvocation ...................... 105 5-9Timerinterruptstoinitiatecontextswitch ..................... 106 5-10Lifeofapacketinprivilegeddomain ........................ 107 5-11ImpactofTLBushandcontextswitch ...................... 109 5-12CorrelationbetweenVMswitchingandTLBmisses ................ 109 5-13TLBMissesafteraVMcontextswitch ....................... 110 5-14TLBmissesafteragrantdestroy .......................... 111 5-15ImpactofVMswitchoncachemisses ........................ 111 5-16L2cacheperformancefortransmitofI/Opackets ................. 112 5-17DataandinstructionTLBperformancefortransmitofI/Opackets ....... 112 5-18L2cacheperformanceforreceiveofI/Opackets .................. 113 5-19DataandinstructionTLBperformanceforreceiveofI/Opackets ........ 114 6-1Thex86pagetableforsmallpages ......................... 119 6-2Interprocessorinterruptmechanisminx86architecture .............. 121 6-3Simulationexperimentalsetup ............................ 124 6-4ImpactoftaggingTLBwithaglobalbit ...................... 125 6-5Pagesharinginmulticoreenvironment ....................... 126 11

PAGE 12

................. 127 12

PAGE 13

13

PAGE 14

14

PAGE 15

15

PAGE 16

1 ].ThisdissertationinvestigatesdataprovisioningandperformancecharacterizationofvirtualI/O,inparticularnetworklesystems,insuchvirtualizedenvironments.Thegoalsofthisdissertationareasfollows:First,deviseandevaluatetechniqueswhichcanseamlesslyprovidedatatoapplicationsinwide-areaenvironmentsspanningmultipledomainsthroughdistributedlesystemprotocolI/Oredirection.Second,becausetheperformanceofthisdataprovisioningsolutionislimitedinherentlybyoverheadsassociatedwithnetworkI/Oinavirtualizedenvironment,thisdissertationevaluatesnetworkI/Ovirtualizationoverheadsinsuchenvironmentswithasimulation-basedmethodologythatenablesquantitativeanalysisoftheimpactofmicro-architecturefeaturesonperformanceofacontemporarysplit-I/Ovirtualmachinehypervisor.Finally,thisworkexploreshardware/softwaresupporttoimprovebottlenecksinI/Operformance.Thedistributedlesystemredirectionapproachfordatamanagementandevaluationcanbenet,inparticularapplicationswherelargemostlyreaddatasetsneedtobeprovisionedinavirtualdatacenter. 2 ].Tofacilitatedatamovementinsuchenvironments,Ihavedevelopedanovelredirect-on-writedistributedlesystem(ROW-FS)whichallowsforapplication-transparentbueringandrequestre-routingofalllesystemmodicationslocallythroughuser-levelproxies.Theseproxiesforwardlesystemaccessesthatmodifyanylesystemobjectstoa\shadow"distributedlesystemserver,creatingon-demandprivatecopiesofsuchobjectsintheshadowleserverwhileroutingaccessestounmodieddatatoa\main"server.Amotivationforthis 16

PAGE 17

3 ].Figure 1-1 showsanexampleofprotocolredirectionofNFSthroughuser-levelproxies.VirtualmachineenvironmentsarecommonlyusedtoimproveCPUutilizationincomputersystems[ 1 ].SuchVM-basedenvironmentsarebeingincreasinglyusedindatacentersforresourceconsolidation,highperformanceandgridcomputingasmeanstofacilitatethedeploymentofuser-customizedexecutionenvironments[ 4 ][ 5 ][ 6 ][ 7 ].Thus,thedeploymentofuser-levelproxiesinVM-basedexecutionenvironmentsisacommonscenariotofacilitatedatamovementandprovisioning[ 4 ]. Figure1-1. ProtocolRedirection:Client/ServerprotocolismodiedtoforwardRPCcallstoashadowdistributedlesystemserver Forexample,considerthescenarioofmultipleserversindatacenterswhichareconsolidatedintoasingleserverthroughdeploymentofvirtualmachines,whichrepresentsacommoncontemporaryusecaseofvirtualization[ 8 ].Figure 1-2 showsapossibledeploymentofROWproxieswhichforwardcallstoNFSandshadowserver.ShadowserverVMandclientVMareconsolidatedintoasinglephysicalmachine.SuchascenarioiscommonwhenclientVMisdisklessordiskspaceonclientVMisaconstraint.WhiledeployingROWproxiesinsuchcasesprovidesamuchneededfunctionality,theoverheadassociatedwithvirtualizednetworkI/Oandisoftenconsideredabottleneck[ 9 ][ 10 ].Specically,theperformanceofROW-FSinsuchenvironments,aswellasthatof 17

PAGE 18

3 ],isdependentupontheperformanceoftheunderlyingvirtualI/Onetworklayer. Figure1-2. ServerConsolidation:Illustrationofpartitioningofphysicalsystemwithsharedhardwareresources.AspeciccaseofROW-FSdeploymentisshown.ShadowserverisencapsulatedintoaseparateVM.RPCcallsarepassedtoRemote/shadowserversthroughaROW-FSproxydeployedinprivilegedVM. 11 ].Also,systemworkloadsarebecomingmorecomplexasapplicationsareoftencompartmentalizedandexecutedwithinvirtualmachines[ 1 ].Thus,thekeyresponsibilitiesofconventionalO/Sessuchasschedulingaredelegatedtoanewlayer-theHypervisororVirtualMachineMonitor[ 12 ].Hypervisorsarewidelyacceptedasanapproachtoaddressunder-utilizationofresourcessuchasI/OandCPUinphysicalmachines.Tocharacterizetheimpact 18

PAGE 19

1. Whyhasasimulation-basedapproachbeenchosen? 2. WhataretheoptionsofnetworkI/Omechanismsinvirtualmachinedesigns? 13 ][ 14 ][ 15 ].Theuseofsimulationinthisdissertationismotivatedbythefactthatcurrentsystemevaluationmethodologiesforvirtualmachinesarebasedonmeasurementsofadeployedvirtualizedenvironmentonaphysicalmachine.Althoughsuchanapproachgivesgoodestimatesofperformanceoverheadsforagivenphysicalmachine,itlacksexibilityindeterminingtheresourcescalingperformance.Inaddition,itisdiculttoreplicateameasurementframeworkondierentsystemarchitectures.Itisimportanttomovetowardsafullsystemsimulationmethodologybecauseitisaexibleapproachinstudyingdierentarchitectures. 16 ].ThevirtualmachineI/OarchitectureswhichhavebeenimplementedindierentHypervisorsare:(1)DirectI/OModel(Xen1.0),wheretheHypervisorisresponsibleforrunningdevicedrivers,and(2)SplitI/OModel(Xen2.0/3.0),wheredevicedriversareexportedtoaprivilegedguestvirtualmachine(3)Pass-throughI/OModel,wheredirectaccesstohardwaredevicesisprovidedfromguestvirtualmachines.Chapter2further 19

PAGE 20

17 ].Microkernelshavenotfoundwideacceptancearguablybecauseofoverheadsassociatedwithinter-processcommunicationbetweendierententities(suchaslesystemanddevicedrivers).AsprocessorsarebecomingfasterandnewCMParchitecturesareprovidingcore-levelparallelism,I/OapproachessuchasooadingselectivevirtualI/Ofunctionalitytoseparatecorearebeingexplored[ 18 ].IchosesplitI/Oasabasisfortheinvestigationinthisdissertationbecauseofthefollowingreasons: 19 ].Withtheadventofchipmultiprocessorarchitectures,themicrokernelapproachisbeingre-visited;theXenVMMversion3.0[ 20 ]hasfeatureswhicharealsofoundinmicrokernels[ 17 ].DedicatedCPUstoimprovesplitI/OperformancecaneasilybeaddressedthroughCMParchitectures. 20

PAGE 21

21

PAGE 22

2 ]havedealtwiththisproblemviaapplicationcheck-pointingandrestart.Alimitationofthisapproachliesinthatitonlysupportsaveryrestrictedsetofapplications-theymustbere-linkedtoCondorlibrariesandcannotusemanysystemcalls(e.g.fork,exec,mmap).TheapproachofredirectingRPCcallsinashadowserver,incontrast,supportsunmodiedapplications.Itusesclient-sidevirtualizationmechanismsthatallowfortransparentbueringofalllesystemmodicationsproducedbydistributedlesystemclientsonlocalstablestorage.Locallybueredlesystemmodicationscanthenbecheckpointedtogetherwiththeapplicationstate.By"localstorage",Imeanthestoragewhichiseitherlocaltotheclientmachineorinclient'slocalareanetwork. 22

PAGE 23

21 ][ 22 ].TheseO/SimageframeworkseitherrequirekernellevelsupportorrelyonaggressivecachingoffullO/Simages. 23

PAGE 24

23 ].PreviousresearchhasshownthatdataaccesspatternsforaDFSareoftendynamicandephemeral[ 23 ].Also,currentsolutionsareeitherbasedoncompleteletransferthatmayincuraccesslatencyoverhead(ascomparedtoblocktransfers)[ 24 ]orarelimitedtosingledomains[ 25 ];notacceptableinvirtualizedenvironmentwhereoftenthereisneedtoaccesslargeO/Simagesoverwidearea.Solution:Ipresentanovelapproachthatenableswide-areaapplicationstoleverageon-demandblock-baseddatatransfersandade-factodistributedlesystem(NFS)toaccessdatastoredremotelyandmodifyitinthelocalarea-Redirect-On-WriteFileSystem(ROW-FS).IshowthattheROW-FSapproachprovidessubstantialimprovementscomparedtotraditionalNFSprotocolforbenchmarkapplicationssuchasLinuxkernelcompilationandvirtualmachineinstantiation.Problem2:Client-Serverconsistencyandreplicationmechanismsforon-demanddatatransferDuringthetimeaROW-FSlesystemsessionismounted,allmodicationsareredirectedtotheshadowserver.Itisimportanttoconsiderconsistencyindistributedlesystemsbecausedatacanbepotentiallysharedbymultipleclients.Forconsistency,twodierentscenariosneedtobeconsidered.First,thereareapplicationsinwhichitisneitherneedednordesirablefordataintheshadowservertobereconciledwiththemainserver;anexampleistheprovisioningofsystemimagesfordisklessclientsorvirtualmachines.Second,forapplicationsinwhichitisdesirabletoreconciledatawiththeserver,theROW-FSproxyholdsstateinitsprimarydatastructuresthatcanbeusedtocommitmodicationsbacktotheserver.Solution:IleverageonAPIsexportedbylookupservices(suchasDistributedHashtable)indistributedframeworks(e.g.IPOP[ 26 ])tokeepclientconsistentwithlatestupdates 24

PAGE 25

27 ].ThearchitecturalreasonsbehindtheI/Operformanceoverheadsinvirtualizedenvironmentsarenotwellunderstood.EarlyresearchincharacterizingthesepenaltieshaveshownthatcachemissesandTLBrelatedoverheadscontributetomostofI/Ovirtualizationcost[ 27 ].Solution:Ihaveappliedexecution-drivenmethodologytostudythenetworkI/OperformanceofXen(asacasestudy)inafullsystemsimulationenvironment,usingdetailedcacheandTLBmodelstoproleandcharacterizesoftwareandhardwarehotspots.Byapplyingsymbolannotationtotheinstructionowreportedbytheexecutiondrivensimulatorwederivefunctionlevelcallowinformation.Thismethodologyprovidesdetailedinformationatthearchitecturallevelandallowsdesignerstoevaluatepotentialhardwareenhancementstoreducevirtualizationoverhead.Problem4:HardwaresupportfornetworkI/Ovirtualization

PAGE 26

4 ],COD[ 28 ]),virtualmachineimageprovisioningandfaulttoleranceindistributedcomputing.Further,Chapter3discussesthedesignandimplementationofROW-FS.IevaluateROW-FSwithmicro-andapplicationbenchmarkstomeasureoverheadassociatedwithindividualRPCcalls.Chapter4describesanovelapproachtoO/SimagemanagementandprovisioningarchitecturethroughdisklesssetupofVMsusingROW-FS.Theprimarygoalisto 26

PAGE 27

27

PAGE 28

16 ],andwide-areaGridandpeer-to-peer(P2P)computingsystems[ 29 { 32 ].Assystemsevolveinthedirectionoflarge-scale,networkedenvironmentsandasapplicationsevolvetodemandprocessingofvastamountsofinformation,theinput/output(I/O)subsystemwhichprovidesacomputerwithaccesstomassstorageandtonetworksbecomesincreasinglyimportant.Fundamentally,computersystemsconsistofthreesubsystems:processors,memoryandI/O[ 1 ].Intheearlydaysofdesktopcomputing,I/Osystemswereprimarilyusedasanextensionofthememorysystem(asahierarchylevelbackingupcacheandRAMmemories)andforpersistentdatastorage.Withthearrivalofnetworksandthewide-areaInternet,theI/Osubsystemcanbeinterpretedasagenerictermreferringtoaccessofdata,overthenetworkorinstorage.InthecontextofnetworkedI/O,aparticularsubsystemwhichhasbeensuccessfullyusedinprovisioningdataovernetworksisa 28

PAGE 29

33 ],AFS[ 24 ],Legion[ 34 ]).Intheprocessorandmemorysubsystems,systemarchitectsincreasinglyapplyideasofvirtualization,withapproachesthatbuildontechniquesforlogicallypartitioningphysicalsystemsdevelopedinmainframes[ 12 ]whicharenowaccessibleforcommoditysystemsbasedonthex86architecture.Whileforseveralyearsthex86architecturedidnotsatisfyconditionsthatmakeaCPUamenabletovirtualization[ 12 ],hardwarevendors(IntelandAMD)haveprovidedhardwaresupporttosimplifythedesignofVirtualMachineMonitors(VMMs)[ 35 ][ 36 ].EventhoughvirtualizationapproachescanbeusedtoaddresstheproblemofCPUunder-utilizationthroughworkloadconsolidation,thegapbetweenI/OmechanismsandCPUeciencyhaswiden.Figure 2-1 givesanoverviewofsomeofthedierenttechnologiesbeingharnessedbystate-of-the-artcomputingsystems,andprovidesalandscapeinwhichtheapproachesdescribedinthisdissertationareintendedtobeapplied:wide-areasystems(possiblyorganizedinapeer-to-peerfashion),wherenodesandnetworksarevirtualizedandwherecommodityCPUscontainmultiplecores.Thefollowingsubsectionsaddressthesetechnologiesinmoredetail. 37 ].Theprocessofprovidingtransparentaccesstoheterogeneousresourcesthroughalayerof 29

PAGE 30

Landscapeofvirtualizedcomputersystems:resourcesandplatformsareheterogeneouswithrespecttohardwareandsystemsoftwareenvironments;computingnodeshavemultipleindependentprocessingunits(cores)andarevirtualizable;Gridcomputingmiddlewareisusedtoharnessesthecomputepowerofheterogeneousresources;apeer-to-peerorganizationofresourcesenablesself-organizingensemblesofCPUstosupporthigh-throughputcomputingapplicationsandscienticexperiments. indirectioniscentraltovirtualization.Forexample,inLinux,thevirtuallesystemprovidestransparentaccesstodataacrossdierentlesystems.Indirectioncanbeachievedthroughinterpositionbyanagentorproxybetweentwoormorecommunicatingentities.Proxieshavebeenextensivelyusedacrossforuserauthentication,callforwardingandforsecuregateways[ 38 ].Forexample,virtualmemoryisawidelyusedmechanismformultiplexingphysicalRAMintraditionaloperatingsystems. 30

PAGE 31

2-2 providesabroaderviewofpartitioningofasystemwhichalsoincludesverticalpartitioning.Horizontalpartitioningessentiallyaddsanextralayerinthesystemstackwhichprovidesabstractiontoapplicationlayertoaccessunderlyingresources,whileverticalpartitioningmechanismscanbeusedtodividetheunderlyingresources,forinstancetoisolatesub-systemssuchthatinterferenceduetofaultsandperformancecross-talkisminimized. Figure2-2. Systemspartitioningcharacteristics:AccesstoCPUinmulti-coresystemscanbenaturallypartitionedacrosscores,howevertherearehardwareresourceswhichareshared(e.g.L2cache,memory,harddiskandNICs).Qualityofservice(QoS)provisioningcanbeusedtopartitionsharedresourcesacrossvirtualcontainers. Ingeneral,virtualmachinearchitecturesexhibitthreecommoncharacteristics:multiplexing,polymorphismandmanifolding[ 4 ].Thevirtualizationapproachofmultiplexingphysicalresourcesnotonlydecouplesthecomputeresourcesfromhardwarebutprovidestheexibilityofallowingcomputeresourcestomigrateseamlessly.Today,manyvirtualmachinemonitorsareavailableforresearchanddevelopment(e.g.VMware[ 37 ],Parallels[ 39 ],VirtualBox[ 40 ],KVMcitevm:kvm,lguest[ 41 ],Xen[ 20 ],UML[ 42 ],andQemu[ 43 ]). 31

PAGE 32

1 ].Inordertobeeectivelyvirtualizable,itisdesirablethattheunderlyingprocessorinstructionsetarchitecture(ISA)followsconditionssetforthin[ 12 ].Forseveralyears,theinstructionsetofwhatiscurrentlythemostpopularmicroarchitecure[ 44 ]wasnotdirectlyvirtualizablebecausethereareasetofsensitiveinstructionswhichdonotcausetheprocessortotrapwhenrunninginunprivilegedmode[ 1 ].Systemvirtualmachineshavebeendesignedtoovercomethislimitationinpreviousx86generationswithtwomajorapproachesresultinginsuccessfulimplementations.Intheclassicalapproach,avirtualmachinesuchasVMwarereliesonecientbinarytranslationmechanismstoemulatenon-virtualizableinstructionswithoutrequiringmodicationstotheguestO/S.Intheparavirtualizedapproach(e.gXen),modicationstothearchitecture-dependentcodeoftheguestO/Sarerequired,bothtoavoidtheoccurrenceofnon-virtualizableinstructionsandtoenableimprovedsystemperformance.IntelandAMDhaveprovidedhardwaresupporttoextendthex86architectureforvirtualizedenvironment[ 1 ],makingtheimplementationofclassicVMMsamucheasiertaskbecausebinarytranslationisnotrequired;theKVMvirtualmachineisanexampleofarecentVMMwhichbuildsuponsuchhardwareextensions. 2-3 (a).Incontrast,aconventionalI/OvirtualizationpathtraversesthroughaguestVMdriver,virtualI/Odevice,physicaldevicedriverandphysicaldevice(e.g.anetworkinterfacecard,NIC).TheguestVMdrivermerelyprovidesamechanismtoshareaccesstoavirtualI/Odevice(emulateddevicedriver)throughasharedmemorymechanism.Theemulateddeviceiseither 32

PAGE 33

2-3 (b)). Figure2-3. I/Ovirtualizationpath:(a)InsingleO/S,applicationinvokesphysicaldevicedriverbymeansofsystemcalls(b)InaVMenvironment,eachVM'sguestdrivercommunicateswithvirtualI/Odevicethroughasharedmemorymechanism.ThevirtualI/OdevicecaneitherresideinthehypervisororintheseparateVMasshownbydottedlines. ThefollowingaretypicalalternativeswhichhavebeenconsideredforI/Ohandlinginvirtualmachineenvironments: 37 ]. 45 ]. 33

PAGE 34

46 ][ 47 ].IOMMUallowsdirectmappingofhardwaredeviceaddressesintoguestphysicalmemoryandalsoprotectsVMsfromspuriousmemoryaccesses. I/OPartitioning:Resourcemappinginmulti-coresystems.VM1isallocatedtwoCPUs.VM3ispinnedtoI/OdevicesDiskD1,andNICN1,N2. CMParchitecturesallowconcurrentexecutionofsoftwarethreadsandmodules.Thus,toleverageonCMParchitectures,itisimportanttopartitionthesystemsoftwareresources,operatingsystemsandhardwareresourcessoastodeterministicallyallocateCPUresourcestotheapplications[ 48 ].TheperformancegainthroughsuchapartitioningofresourcescouldbeabbreviatedifbottleneckishardwaredevicessuchasNICs.Toaddressthis,trendsareeithertowardsharnessingmultipledevicessuchasnetwork 34

PAGE 35

10 ]ordevelopingnewmodelsofcommunicationwithhardwaredevices[ 48 ][ 49 ].ItisconceivablethatvirtualizedCMParchitecturesofthefuturewillrunmultipleVMs,withamixofresourcestime-sharedand/ordedicatedtoVMguests.Forexample,asshowninFigure 2-4 ,virtualmachinesVM1,VM2andVM3arehostedonamulti-coresystem.AsshowninFigure 2-4 ,VM3ispinnedtoDiskD1,NICsN1andN2andhasbeenallocatedasingleCPU.VM1hasbeenallocatedwith2CPUsandhaveprivilegedstatustoaccesshardwareresources.ThegurealsoillustratesvirtualizationpathforsplitI/O(dottedpath)anddirectI/Oapproach.InsplitI/O,sincevirtualizationpathisdividedintoseparateVMcontainers,allocatingdedicatedresources(e.g.CPUsandNICs)toguestandprivilegedVMscanpotentiallyimproveI/Operformance. 29 ],Gridcomputingreferstothe\coordinatedresourcesharingandproblemsolvingindynamic,multi-institutionalvirtualorganization".Gridcomputingtypicallytakesplaceonheterogeneousresourcesdistributedacrosswide-areanetworks,relyingonsupportfrommiddlewaretoprovideservicessuchasauthentication,scheduling,anddatatransfers.VirtualizationinthecontextofGridcomputingismotivatedbytheabilitytoabstracttheheterogeneityofresourcesandprovideaconsistentenvironmentfortheexecutionofworkloads.TheIn-VIGO[ 4 ]middlewareisarepresentativeexampleofasystemwhichextensivelyemploysvirtualizationinthisscenario.SimilarapproachhasbeentakenbyseveralprojectsinGridorUtilitycomputingsuchasCOD[ 28 ]andVirtualWorkspaces[ 5 ].Akeychallengearisinginwide-area,GridcomputinginfrastructuresisthatofmanagementandaccesstodatanotonlyI/Olocaltoanode,butalsohowtoprovidedatatoapplications,seamlessly,inenvironmentsspanningmultipledomains. 50 ].Ingridenvironments,datamovementand 35

PAGE 36

3 ][ 24 ][ 33 ]toenhancementsofwidely-usedlocal-areaprotocols(NFSv2/v3)and,tooverlayadditionalfunctionalitiesormodiedconsistencymodelsoverwideareanetworks[ 51 ][ 52 ][ 53 ][ 54 ].However,thewide-spreaddeploymentofnewprotocolsishinderedbythefactthatoperatingsystemsdesignershavemostlyfocusedonlocal-areadistributedlesystemswhichcovertypicalusagescenarios.Asanexample,open-sourceandproprietaryimplementationsoftheLAN-orientedversionsoftheNFSprotocol(v2/v3)havebeendeployed(andhardenedovertime)inthemajorityofUNIXavorsandWindows,whileopen-sourceimplementationsofthewide-areaprotocol(v4)underdevelopmentsincethelate1990shavenotyetbeenwidelydeployed.ThefollowingsectionswillexplaintheNFSprotocolarchitectureandarelatedapproachthatconsistofvirtualizationlayerbuiltonexistingNetworkFileSystem(NFS)components-Gridvirtuallesystem(GVFS). 1. Machineandlesystemindependence 2. Transparentaccesstoremoteles 3. Simplecrashrecoverymechanism 4. Lowperformanceoverheadinlocalareanetworks 36

PAGE 37

3 ]andwaslaterextendedtoaddressshortcomingssuchassmalllesizes,largenumberofGetattrcallsandperformanceoverhead,leadingtoNFSversion3(NFSv3).Forexample,inNFSv2,theinvocationofalookupcallisalwaysfollowedbytheinvocationofagetattrcall.InNFSv3,thelookupcallisoptimizedtoreturnattributesinasingleRPCoperation.Similarly,NFSv3introducesnewprocedurecallstoprovisionbueringofwritesinclientandlatercommittingittotheserver.NeitherNFSv2norv3scalewellincross-domainwide-areaenvironments[ 3 ].Inaddition,NFSv3doesnotprovideguaranteedconsistencybetweentheclients.NFSv4improvestheconsistencymechanismattheexpenseofamorecomplexserverdesign,whichisnolongerstateless[ 3 ].NFSv4implementsanopen-closeconsistencymechanism.NFSv4clientscancachethedataafterleisopenedforaccess.Ifcacheddataismodied,NFSclientsneedtocommitthedatabackduringlecloseoperation.NFSclientscanalsore-validatethecacheddatathroughletimestampforfutureaccessofthedata.Inaddition,NFSv4serversupportsdelegationandcallbackmechanismstoprovidewritepermissionstotheclient.Thus,aNFSv4clientcanallowotherclientstoaccessthedatafromthedelegatedle.NFSsupportshierarchicalorganizationoflesanddirectories.EachdirectoryorleinNFSserverisuniquelyaddressedbyapersistentlehandle[ 3 ].Alehandleisareferencetoaleordirectorythatisindependentofthelename.Forexample,NFSv2haspersistentlehandlesofsizeof32-bytes.AcomprehensivesurveyofNFSlehandlestructurecanbefoundin[ 55 ]. 37

PAGE 38

3 ].Thisapproachofstatelessserversimpliesfailurerecovery.Forexample,afailurerequiringaserverrestartcanbedealtwithbysimplyrequiringclientstore-establishtheconnection.Inaddition,theNFSprotocolsupportsidempotentoperations.Intheeventofaservercrash,theclientonlyneedstowaitfortheservertobootandre-sendtherequest.NetworkFileSystemprimarilyconsistsoftwoprotocols.First,themountdprotocolisusedtoinitiallyaccessthelehandleoftherootofanexporteddirectory.Second,thenfsdprotocolisusedtoinvokeRPCprocedurecallstoperformleoperationsonremoteserver.ANFSclientinvokesthemountprotocolthroughmountutility.Themountprotocolisathreestepprocess.First,theclientcontactsthemountdservertoobtaintheinitiallehandleforanexportedlesystem.Inthesecondstep,themountprotocolaccesstheattributesofthedirectorymountpointrequestedbytheclient.Finally,theNFSclientobtainstheattributesoftheexportedlesystem.NFSprovidesthecapabilitytoenabledierentauthenticationmechanismsuchasUnixsystemauthentication(UIDorGID).NFSsupportsauthorizationbasedonaccesscontrollistsmaintainedbytheserver.ThisaccesscontrollistprovidesamappingofuserandgroupIDbetweenclientandserver.WheneveraRPCcallisreceivedbytheserver,theservervalidatestheclientcredentialsthroughaccesscontrollist. 56 ].GVFSformsthebasicframeworkforthetransferofdatanecessaryforproblemsolvingenvironmentssuchasIn-VIGO.ItreliesonavirtualizationlayerbuiltonexistingNetworkFileSystem(NFS)components,andisimplementedatthelevelofRemoteProcedureCalls(RPC)bymeansofmiddleware-controlledlesystemproxies.AvirtuallesystemproxyinterceptsRPCcallsfromanNFSclientandforwardsthemtoanNFSserver,possiblymodifying 38

PAGE 39

4 ]. Figure2-5. GridVirtualFileSystem:NFSproceduralcallsareinterceptedthroughuser-levelproxies.GVFSproxiesaredeployedinclientandservermachines.UsersareauthenticatedthroughaaccesscontrollistexportedbyGVFSproxyandNFSserver. Figure 2-5 providesanoverviewofgridvirtuallesystem.Asshowninthisgure,middleware-controlledlesystemproxiesareusedtostartagridsessionfortheclient.GridvirtualFilesystemsupportsperformanceenhancingmechanismssuchasadisk-cache[ 56 ].GVFSproxiesarefurtherextendedwithwritebacksupporttoprovideon-demandvirtualenvironmentsforgridcomputing[ 57 ].ThisapproachreliesonbueringRPCrequestsandresultsinadisk-cache,andcommittingchangesbacktoserverattheendofusersession.Arelatedworkusesaservice-orientedapproachtoharnesstheGVFSproxiesforoptimizationssuchascachingorcopy-on-writesupport.Thisapproachisbased-onWebservicesresourceframework(WSRF)thatenablestheprovisioningofdata 39

PAGE 40

58 ]. 40

PAGE 41

3.1.1FileSystemAbstractionAlesystemisanabstractioncommonlyusedtoaccessdatafromthememory/storagesystems(e.g.disk).Thisabstractionisoftenimplementedasalayerofindirection.Indirectionmechanismsarecommonlyusedtoaddresscomputerscienceproblems.Forexample,intheLinuxO/S,toprovidetransparentaccesstodierentlesystems,indirectionmechanismsaretypicallyusedtosteerthelesystemoperationsthroughacommonlesystemframeworkcalledvirtuallesystem(VFS).Figure 3-1 showsindirectionmechanismsacrossthreelevels:transparentaccessofalesystemthroughtheVFSframework,logicalaccesstoadiskvolume,andindirectaccesstoaleblockthroughi-node.Thisdissertationappliesindirectionmechanismsthroughauser-levelproxysoastoprovidetransparentaccessofdatafromoneormoreservers.Theredirect-on-writelesystemenableswide-areaapplicationstoleverageon-demandblock-baseddatatransfersandade-factodistributedlesystem(NFS)toaccessdatastoredremotelyandmodifyitinthelocalarea. Figure3-1. IndirectionmechanismintheLinuxvirtuallesystem(vfs).Theaccesstovariouslesystemsthroughtheindirectionmechanism.Alesystem'sblockscouldbelogicallypresentinmultipleharddisks.Further,accesstotheleblocksisperformedthroughdirectorindirectaccessofblocksthroughinodes. 41

PAGE 42

42

PAGE 43

Figure3-2. Middlewaredatamanagement:GridUsersG1,G2andG3accessesledisk.imgfromserverandcustomizeforpersonalusethroughROWproxy.G1modiessecondblockBtoB',G2modiesblockCtoC'andG3extendsthelewithadditionalblockD(a)Modicationsarestoredlocallyateachshadowserver(b)virtualizedview ROW-FScomplementscapabilitiesprovidedby"classic"virtualmachine(VMs[ 12 ][ 1 ])tosupportexible,fault-tolerantexecutionenvironmentsindistributedcomputingsystems.Namely,ROW-FSenablesmounteddistributedlesystemdatatobeperiodicallycheck-pointedalongwithaVM'sstateduringtheexecutionofalong-runningapplication.ROW-FSalsoenablesthecreationofnon-persistentexecutionenvironmentsfornon-virtualizedmachines.Forinstance,itallowsmultipleclientstoaccessinread/writemodeanNFSlesystemcontaininganO/Sdistributionexportedinread-onlymodebyasingleserver.Localmodicationsarekeptinper-client"shadow"lesystemsthatarecreatedand 43

PAGE 44

3-2 illustratesanexampleofsharedVMimagebetweenGridusersG1,G2andG3.O/Simagemodicationsarelocallybueredwhereastheserverhostsread-onlyO/Simages. 4 ],COD[ 28 ]andVirtualWorkspaces[ 5 ]).Forexample,datamanagementinIn-VIGOisprovidedbyavirtualizationlayerknownasGridVirtualFileSystem.Theresultinggridvirtuallesystemallowsdynamiccreationanddestructionoflesystemsessionsonaper-userorper-applicationbasis.Suchsessionsallowon-demanddatatransfers,andpresenttousersandapplicationstheAPIofawidelyuseddistributednetworklesystemacrossnodesofacomputationalgrid.ROW-FScanexportsimilarAPIstoenduserandnetwork-intensiveapplicationstotransparentlybuerwritesinalocalserver. 44

PAGE 45

59 ][ 60 ].Intheseenvironments,oftenitisthecasethatbaseoperatingsystemlayerissharedbetweendierentclients(read-only).AnymodicationstobaseOSbyclients(e.g.akernelpatch)canbefeasiblebydeployingROW-FStoaccessread-onlyimages. 45

PAGE 46

3-3 .Inthegure,theclientvirtualmachine"C"crashesattimetf.IntraditionalNFS(Figure 3-3 ,top),jobexecutionhastorestartagain,becausetheserverstate'S'maynolongerbeconsistentwiththeclientstateatthetimeofthelastcheckpoint.Intheredirect-on-writesetup(Figure 3-3 ,bottom),jobexecutioncancorrectlyrestartatthelastcheckpointtc. Figure3-3. Check-pointingaVMcontainerrunninganapplicationwithNFS-mountedlesystems.IntraditionalNFS(top),onceaclientrollsbacktocheckpointedstate,itmaybeinconsistentwithrespecttothe(non-checkpointed)serverstate.InROW-FS(bottom),statemodicationsarebueredattheclientsideandarecheckpointedalongwiththeVM AnimportantclassofGridapplicationsconsistsoflong-runningsimulations,whereexecutiontimesintheorderofdaysarenotuncommon,andmid-sessionfaultsarehighlyundesirable.SystemssuchasCondor[ 2 ]havedealtwiththisproblemviaapplicationcheck-pointingandrestart.Alimitationofthisapproachliesinthatitonlysupportsarestrictedsetofapplications-theymustbere-linkedtospeciclibrariesandcannotusemanysystemcalls(e.g.fork,exec,mmap).ROW-FS,incontrast,supportsunmodied 46

PAGE 47

2 ][ 61 ]middlewareisbeingextendedwiththeso-calledVMuniversetosupportcheckpointandrestoreofentireVMsratherthanindividualprocesses;ROW-FSsessionscanbeconceivablycontrolledbythismiddlewaretobuerlesystemmodicationsuntilaVMsessioncompletes. 3-4 .Itconsistsofuser-levelDFSextensionsthatsupportselectiveredirectionofdistributedlesystem(DFS)callstotwoservers:themainserverandashadowserver.ThearchitectureisnovelinthemanneritoverlaystheROWcapabilitiesuponunmodiedclientsandservers,withoutrequiringchangestotheunderlyingprotocol.TheapproachreliesontheopaquenatureofNFSlehandlestoallowforvirtualhandles[ 3 ]thatarealwaysreturnedtotheclient,butmaptophysicallehandlesatthemainandROWservers.Alehandlehashtablestoressuchmappings,aswellasinformationaboutclientmodicationsmadetoeachlehandle.Fileswhosecontentsaremodiedbytheclienthave"shadow"lescreatedbytheshadowserverinasparsele,andblock-basedmodicationsareinsertedin-placeintheshadowle.Apresencebitmapmarkswhichblockshavebeenmodied,atthegranularityofNFSblocks(typicallyofsize8-32KB).Figure 3-5 showspossibledeploymentsofproxiesenabledwithuser-leveldiskcachingandROWcapabilities.Forexample,acacheproxyconguredtocacheread-onlydatamayprecedetheROWproxy,thuseectivelyformingaread/writecachehierarchy.Suchcache-before-redirect(Figure 3-5 (a))proxysetupallowsdiskcachingofbothread-only 47

PAGE 48

ROW-FSArchitecture-TheRedirect-on-writelesystemisimplementedbymeansofauser-levelproxywhichvirtualizesNFSbyselectivelysteeringcallstoeitheramainserverorshadowserver.MFH:MainFileHandle,SFH:ShadowFileHandle,F:Flags,HP:Hashtableprocessor,BITMAP:bitmapprocessor. contentsofthemainserveraswellasofclientmodications.Write-intensiveapplicationscanbesupportedwithbetterperformanceusingaredirect-before-cache(Figure 3-5 (b))proxysetup.Furthermore,redirectionmechanismsbasedontheROWproxycanbeconguredwithbothshadowandmainserversbeingremote(Figure 3-5 (c)).Suchsetupcould,forexample,beusedtosupportaROW-mountedO/Simageforadisklessworkstation. 48

PAGE 49

Proxydeploymentoptions(a)Cache-before-redirect(CBR),(b)Redirect-before-cache(RBC)(c)Non-localshadowserver. isasupersetofthelesystemobjectsinmainserver.Themain-indexed(MI)tableisneededtomaintainstateinformationaboutlesinthemainserver.Figure 3-6 showsthestructureofthehashtableandaginformation.Thereaddirag(RD)isusedtoindicatetheoccurrenceoftheinvocationofanNFSreaddirprocedurecallforadirectoryinthemainserver.Generationcount(GC)isanumberinsertedintohashtupleforeachlesystemobjecttocreateauniquediskbasedbitmap.Remove(RM)andRename(RN)agsareusedtoindicatedeletion/renameofale. 49

PAGE 50

Hashtableandagdescriptions:SFH:ShadowFileHandle,MFH:MainFileHandle,RD:ReaddirFlag,RE:ReadFlag,GC:GenerationCount,RM:RemoveFlag,RN:RenameFlag,L1:InitialMainLink,L2:NewShadowLink,L3:CurrentMainLink,RL:Remove/RenameFileList theROWlesystem.Tokeeptrackofcurrentlocationofupdatedblocks,eachleisrepresentedbyatwo-levelhierarchicaldatastructureindisk.Therstlevelindicatesthenameofthelewhichcontainsinformationabouttheblock.Thesecondlevelindicatesthelocationofapresencebitwithinthebitmaple. Figure3-7. RemoteprocedurecallprocessinginROW-FS.TheprocedurecallisrstforwardedtotheshadowserverandlatertothemainNFSserver.SS:ShadowServer,MS:MainServer,SI:ShadowIndexed,MI:MainIndexed. 50

PAGE 51

3-8 illustratesasnapshotviewoflesystemsessionthroughRedirect-on-writeproxy.ROW-proxywhichisusedtointerceptthemountprotocolisabstractedinthegure.NFSclientsmountsaread-onlydirectory(/usr/lib)fromtheserverVM.ThemountedlesystemdirectoryistransparentlyreplicatedintheclientVMtobuerlocalmodications.Filesreplicatedatshadowserveraredummyleswhichrepresentsasparseversionofread-onlyleintheServerVM.Onlyleblockswrittenduringthelesystemsessionsarereplicatedintheshadowserver.Anhashtableentryisupdatedtoprovidestatusofles.Figure 3-8 illustrateshexdumpof32bytesNFSv2hashtable.Generationcountisusedtoprovideauniquebitmapdirectory.Thegenerationcountalongwithhashedvalueofshadowlehandleisusedtocreateabitmapdirectoryperle.AsshownintheFigure 3-8 ,libXlehandleishashedto"777"whichisfurtherconcatenatedwithgenerationcount"234"toproduceauniquebitmapdirectory.RDagismarked"0"asthisisforale.REagismarked"1"toindicatethatbitmapneedstobeaccessedforapossible"libX"leblockintheshadowserver.For"libX"le,thereisnomain-indexedhashtableentryasthereisnostatusinformationtokeepforread-onlyleintheServerVM.Allthenewlywrittenblocksarepresentin"0"leofbitmapdirectory. 3-7 describesthevarioushashtableentriesstoredintheproxywhicharereferencedthroughoutthissection.Table 3-1 brieydescribesNFSv2RPCcallsandpointstotherelevantsectionsforcallmodications.AdetaileddescriptionofallNFSprotocolcallsdescribedbelowcanbefoundin[ 3 ]. 51

PAGE 52

AsnapshotviewoflesystemsessionthroughRedirect-on-Writeproxy.Thehashtableandbitmapstatusisshownforthele\libX"whichistransparentlyreplicatedinashadowserver.Threeblocksof\libX"areshowntoberecentlyaccessedandwrittenintheshadowserver. theserver.Inthesecondstep,themountutilityinvokestheNFSgetattrproceduretogetattributesofdirectory.Finally,themountutilitygetstheattributesoflesystem.Tomaintainmounttransparency,ROW-FSalsohasaproxyforthemountprotocol.Themountprocedureismodiedtoobtaininitialmountlehandleofshadowserver.Specically,themountproxyforwardsamountcalltobothshadowandmainserver.Whenthemountutilityisissuedbyaclient,theshadowserveriscontactedrsttosavethelehandleofthedirectorytobemounted.ThislehandleislaterusedbyNFSprocedurecallstodirectRPCcallstotheshadowserver.TheinitialmappingoflehandlesofamounteddirectoryisinsertedintheSIhashtableduringinvocationofthegetattrprocedure.Figure 3-9 (top,left)depictshandlingoftheMountprocedure. 52

PAGE 53

SummaryoftheNFSv2protocolremoteprocedurecalls.EachrowsummarizesthebehavioroftheRPCcallandpointstothesectionwithinthischapterwherethemechanismtovirtualizeeachcallinROW-FSisdescribed. NFScall Behavior(ModicationSection) Null Testingcall(Nomodication) Getattr RetrievestheattributesfromNFSserver(Section 3.4.3 ) Setattr Settheattributesofaleordirectory(Section 3.4.3 ) Lookup Returnlehandleforalenameordirectory(Section 3.4.2 ) Readlink Readsymboliclink(Section 3.4.8 ) Read Readablockofale(Section 3.4.4 ) Write Writetoablockofale(Section 3.4.5 ) Create Createanewle(Section 3.4.10 ) Remove Removeale(Section 3.4.7 ) Rename Renameale(Section 3.4.7 ) Link Createahardlinkofale(Section 3.4.8 ) Symlink Createasymboliclinkofale(Section 3.4.9 ) Mkdir Createanewdirectory(Section 3.4.10 ) Rmdir Removeanexistingdirectory(Section 3.4.7 ) Readdir Listcontentofanexistingdirectory(Section 3.4.6 ) Statfs Checkstatusoflesystem(Section 3.4.11 ) exist"error.Otherwise,theproxyissuesanNFScreatecallforadummy 53

PAGE 54

1. Anewlemayhavebeencreatedattheshadowserver.Inthiscase,allblocksarepresentinshadowserverandallreadcallsaredirectedtoit. 54

PAGE 55

Ifthelesystemobjectisnotnewlycreatedatshadowserver,leblocksmayresideineithershadowormainserver.Inthatcase,theproxyusesthebitmappresencedatastructureandcalculateslocationofcurrentandvalidleblocktodeterminewhetherthereadrequestshouldbesatisedbythemainorbytheshadowserver. 3. Anoptimization(REag)isdoneforthecasewhenlehandlemappingispresentinMIhashtable;eventhoughabitmaphasnotbeencreated(i.e.noblocksofthelehavebeenwritteninto),thecallisforwardedtothemainserver.Thisoptimizationpreventsexpenseofcheckingbitmapdatastructurefromdisk. 55

PAGE 56

Sequenceofredirect-on-writelesystemcalls asimplelistdirectoryutility(i.e."ls-l")mayinvokemultiplereaddirprocedurecallsbecauseoflargenumberofleobjectspresentinthedirectory.Toprovidesynchronizationbetweenmultiplereaddircalls,informationisneededtokeeptrackofthepositionwherelastreaddircallreturnedthelesystemobject.IntraditionalNFS,thisisaccomplishedbythemeansofacookierandomlygeneratedonaper-leobjectbasis.InthecontextofROW-FS,therearetwopossiblescenariosforreaddirprocedure:rstcallorsubsequentcall.Forsakeofclarity,Ireferthemainserverreaddirasm-readdirandtheshadowserverforwardinginvocationass-readdir.Tovirtualizethersts-readdirprocedure,it 56

PAGE 57

1. ROWproxyinterceptss-readdirrequestfromclientandchecksstatusofparentlehandleintheSIhashtable.Ifitistherstcall,initializeatemporarybuertostoretemporarycookieformultiplem-readdircalls. 2. CheckstatusofRDagofparentdirectorywhichindicatesifreaddirhasbeenpreviouslycalled.IfRDagissetthenforwardcalltoshadowserver.IfRDagisnotset,thefollowingaretheoptionsofrelativestructureofthedirectoriesintheshadowandmainserver: 3. Checkforlesystemtypeofthereturnedlesystemobject.Ifitisasymboliclink,thereadlinkprocedureisinvokedtogetthedatafrommainserverandasymlinkcallisissuedtoshadowservertoreplicatetheobject. 57

PAGE 58

Iflesystemobjectisoflinktype(asprovidedbythenlinkattributeofthefattrdatastructure),thentheLINKprocedureiscalledintheshadowserver.Linkprocedureistheonlyprocedurewhichcanincrementnlinkattributeoflesystemobject.UpdateallregeneratedlesystemobjectsinSI/MIhashtables. exist"error. 58

PAGE 59

59

PAGE 60

60

PAGE 61

62 ].TheNISTnetemulatorisdeployedasavirtualrouterinaVMwareVMwith256MBmemoryandrunningLinuxRedhat7.3.Redirectionisperformedtoashadowserverrunninginavirtualmachineintheclient'slocaldomain. 3-2 showthatROW-FSperformanceissuperiortoNFSv3inaWANscenario,whilecomparableinaLAN.IntheWANexperiment,recursivestatshowsnearlyvetimesimprovementover 61

PAGE 62

3-2 .Clearly,performanceforWANforROW-FSduringthesecondruniscomparablewithLANperformanceandmuchimprovedoverNFSv3.Thisisbecauseonceadirectoryisreplicatedattheshadowserver,subsequentcallsaredirectedtotheshadowserverbymeansofreaddirstatusag.TheinitialreaddiroverheadforROW-FS(especiallyinLANsetup)isduetothefactthatdummyleobjectsarebeingcreatedintheshadowserverduringtheexecution.Remove:Tomeasurethelatencyofremoveoperations,Ideletedalargenumberofles(greaterthan15000andtotaldatasize190MB).IobservedthatinROW-FS,sinceonlytheremovestateisbeingmaintainedratherthancompleteremovalofle,performanceisnearly80%betterthanthatofconventionalNFSv3.Ittakesnearly37minutesinROW-FSincomparisonto63minutesinNFS3todelete190MBofdataoverawideareanetwork.Notethateachexperimentisperformedwithcoldcaches,setupbyre-mountinglesystemsineverynewsession.Ifthelesystemisalreadyreplicatedinshadowserver,ittakes18minutes(WAN)todeletethecompletehierarchy. Table3-2. LANandWANexperimentsforlookup,readdirandrecursivestatmicro-benchmarks.ForROW-FS,eachbenchmarkisrunfortwoiterations:First,warmsupshadowserver.Second,toaccessmodicationslocally.NFSv3isexecutedonceasperformanceforsecondrunissimilartorstrun.InbothROW-FSandNFSv3,NFScachingisdisabled. Micro-benchmark LAN(seconds) WAN(seconds) ROW-FS NFSv3 ROW-FS NFSv3 1strun 2ndrun 1strun 2ndrun Lookup 0.018 0.008 0.011 0.089 0.018 0.108 Readdir 67 17 41 1127 17 1170 RecursiveStat 425 404 367 1434 367 1965 Remove 160 NA 230 2250 NA 3785 62

PAGE 63

3-3 summarizestheperformanceofAndrewbenchmarkandFigure 3-10 providesstatisticsfornumberofRPCcalls.TheimportantconclusiontakenfromthedatainFigure 3-10 isthatROW-FS,whileincreasingthetotalnumberofRPCcallsprocessedduringtheapplicationexecution,itreducesthenumberofRPCcallsthatcrossdomainstolessthanhalf.Notethattheincreaseinnumberofgetattrcallsisduetoinvocationofgetattrproceduretovirtualizereadcallstomainserver.Readcallsarevirtualizedwithshadowattributes(thecasewhenblocksarereadfromMainserver)becausetheclientisunawareoftheshadowserver;lesystemattributeslikelesystemstatisticsandleinodenumberhavetobeconsistentbetweenreadandpost-readgetattrcall.Nonetheless,sinceallgetattrcallsgotothelocal-areashadowserver,theoverheadofextragetattrcallsissmallcomparedtogetattrcallsoverWAN. Table3-3. AndrewBenchmarkandAM-Utilsexecutiontimesinlocal-andwide-areanetworks. Benchmark ROW-FS(sec) NFSv3(sec) Andrew(LAN) 13 10 Andrew(WAN) 78 308 AMUtils(LAN) 833 703 AMUtils(WAN) 986 2744 63

PAGE 64

63 ]isalsousedasanadditionalbenchmarktoevaluateperformance.Theautomounterbuildconsistsofcongurationteststodeterminerequiredfeaturesforbuild,thusgeneratinglargenumberoflookups,readandwritecalls.Thesecondstepinvolvescompilingofam-utilssoftwarepackage.Table 3-3 providesexperimentalresultsforLANandWAN.TheresultingaveragepingtimefortheNIST-emulatedWANis48.9msintheROW-FSexperimentand29.1msintheNFSv3experiment.Wide-areaperformanceofROW-FSforthisbenchmarkisagainbetterthanNFSv3,evenunderlargeraveragepinglatencies. Figure3-10. NumberofRPCcallsreceivedbyNFSserverinnon-virtualizedenvironment,andbyROW-FSshadowandmainserversduringAndrewbenchmarkexecution 64

PAGE 65

LinuxkernelcompilationexecutiontimesonaLANandWAN. Setup FS Oldcongtime(s) Deptime(s) BzImagetime(s) NFSv3 49 120 710 LAN ROW-FS 55 315 652 NFSv3 472 2648 4200 WAN ROW-FS 77 1590 780 "oldcong",make"dep"andmake"bzImage".Table4showsperformancereadingsforbothLANandWANenvironments.TheperformanceofLinuxkernelcompilationforROW-FSiscomparablewithNFSv3inLANenvironmentandshowssubstantialimprovementintheperformanceovertheemulatedWAN.Notethat,forWAN,kernelcompilationperformanceisnearlyvetimesbetterwiththeROWproxyincomparisonwithNFSv3.TheresultsshowninTable 3-4 donotaccountfortheoverheadinsynchronizingthemainserver.Nonetheless,asshowninFigure 3-10 ,amajorityofRPCcallsdonotrequireserverupdates(read,lookup,getattr);furthermore,manyRPCcalls(write,create,mkdir,rename)arealsoaggregateinstatistics-oftenthesamedataiswrittenagain,andmanytemporarylesaredeletedandneednotbecommitted.Faulttolerance:Finally,Itestedthecheck-pointingandrecoveryofacomputationalchemistryscienticapplication(Gaussian[ 64 ]).AVMwarevirtualmachinerunningGaussianischeckpointed(alongwithROW-FSstateintheVM'smemoryanddisk).Itisthenresumed,runsforaperiodoftime,andafaultisinjected.SomeGaussianexperimentstakemorethanonehourtonishandgeneratealargeamountoftemporarydata(hundredsofMBytes).WithROW-FS,Iobservethattheapplicationsuccessfullyresumesfromapreviouscheckpoint.WithNFSv3,inconsistenciesbetweentheclientcheckpointandtheserverstatecausedtheapplicationtocrash,preventingitssuccessfulcompletion. 65

PAGE 66

3-5 summarizestheperformanceofdisklessboottimeswithdierentproxycachecongurations.Theresultsshowwhatprecachingofattributesbeforeredirectionandpostredirectiondatacachingdeliverthebestperformance,reducingwide-areaboottimewith"warm"cachesbyover300%. Table3-5. WideareaexperimentalresultsfordisklessLinuxboot/secondbootfor(1)ROWproxyonly(2)ROWproxy+datacache(3)attribute+ROW+datacache WAN Boot(sec) 2ndBoot(sec) Client->ROW->Server 435 236 Client->ROW->DataCache->Server 495 109 Client->Attr.Cache->ROW->DataCache->Server 409 76 3-6 .Inthesecondpart,Itestedthesetupwith 66

PAGE 67

RemoteXenboot/rebootexperimentwithROWproxyandROWproxy+cache NISTNetDelay ROWProxy ROWProxy+CacheProxy Boot(sec) 2ndBoot(sec) Boot(sec) 2ndBoot(sec) 1ms 121 38 147 36 5ms 179 63 188 36 10ms 248 88 279 37 20ms 346 156 331 37 50ms 748 266 604 41 aggressiveclientsidecaching(Figure2(b)).Table 3-6 alsopresentstheboot/secondbootlatenciesforthisscenario.Fordelayssmallerthan10ms,theROW+CPsetuphasadditionaloverheadforXenboot(incomparisonwithROWsetup);however,fordelaysgreaterthan10ms,thebootperformancewithROW+CPsetupisbetterthanROWsetup.RebootexecutiontimeisalmostconstantwithROW+CPproxysetup.Clearly,theresultsshowmuchbetterperformanceofXensecondbootfortheROW+CPexperimentalsetup. 65 ].ThekeyadvantagesofROW-FSoverUnionFSarethattheformerisuser-levelandintegrateswithunmodiedNFSclients/servers,whilethelatterisakernel-levelapproachthatrequiressupportfromthekernel,andthattheformeroperateswithindividualledatablockswhilethelatteroperatesonwholeles.Thisisimportantinapplicationswhereunmodiedclientsaredeployedandapplicationsthataccesssparsedata;forexample,theprovisioningofVMimages.IhaveattemptedtocomparetheperformanceofUnionFSandROW-FSforXenvirtualmachineinstantiationacrosswide-area,butinstantiatingaXen3.0domUwithanimagestackedusingthelatestversionofUnionFSavailableatthetimeofwriting(Unionfs1.4)fails.UnionFScopy-on-writemechanismisbasedoncopy-upcompletetonewbranchonwriteinvocationwhereasROW-FSjustreplicatestheneededblock;hence,ROW-FShasaddedadvantageoverUnionFSfordiskimagesinstantiation(largecopy-upisexpensive).AdvantagesofUnionFSoverROW-FSincludepotentially 67

PAGE 68

38 ].Inthepast,researchershaveusedNFSshadowingtechniquetologusers'behavioronoldlesinaversioninglesystem[ 66 ].EmulationofNFSmounteddirectoryhierarchyisoftenusedasameansofcachingandperformanceimprovement[ 67 ].Koshaprovidesapeertopeerenhancementofnetworklesystemtoutilizeredundantstoragespace[ 51 ].Inthepast,levirtualizationwasaddressedthroughNFSmountedlesystemwithintheprivatenamespacesforgroupofprocesseswithmotivationtomigratetheprocessdomain[ 68 ].Stripednetworklesystemisimplementedtoincreasetheserverthroughputbystripinglebetweenmultipleservers[ 69 ].Thisapproachisprimarilyusedtoparallelyaccessleblocksfrommultipleservers,thusimprovingitsperformanceoverNFS.Acopy-on-writeleserverisdeployedtoshareimmutabletemplateimagesforoperatingsystemskernelsandle-systemsin[ 21 ].Theproxy-basedapproachpresentedinthisdissertationisuniqueinhowitnotonlyprovidescopy-on-writefunctionality,butalsoprovidesprovisionforinter-proxycomposition.Checkpointmechanismsareintegratedintolanguagespecicbyte-codevirtualmachineasmeansofsavingapplication'sstate[ 70 ].VMwareandXen3.0virtualmachineshaveprovisionoftakingcheckpoints(snapshots)andrevertingbacktothem.Thesesnapshots,however,donotsupportcheckpointsofchangesinamounteddistributedlesystem. 68

PAGE 69

69

PAGE 70

71 ].Theapproachisathinclientsolutionfordesktopgridcomputingbasedonvirtualmachineapplianceswhoseimagesarefetchedon-demandandonaper-blockbasisoverwide-areanetworks.Specically,Iaimatreducingdownloadtimesassociatedwithapplianceimages,andprovidingadecentralized,scalablemechanismtopublishanddiscoverupgradestoapplianceimages.Theapproachembodiesdierentcomponentsandtechnologies|virtualmachines,anoverlaynetwork,pre-bootexecutionenvironmentservices,andaredirect-on-writevirtuallesystem.VirtualMachinesovervirtualnetworksaredeployedwithpre-bootexecutionservertofacilitateremotenetworkbooting.TheapproachusesROW-FSthatenablestheuseofunmodiedNFSclients/serversandlocalbueringoflesystemmodicationsduringtheappliance'slifetime.Similarlytorelatedeorts,ourapproachtargetsapplicationsdeployedonnon-persistentvirtualcontainers[ 4 ][ 28 ]throughprovisioningofvirtualenvironmentswithrole-specicdiskimages[ 60 ].Thincomputingparadigmsoeradvantagessuchasloweradministrationcostandfailuremanagement.Inearlycomputingsystems,thin-clientcomputingwassuccessfulbecauseoftwomainreasons:low-costcommodityhardwarewasnotavailabletoenduser,andacentralizedapproachofcomputingisoftenpreferredduetoeasiersystemadministration.Aslow-costPCsandhigh-bandwidthlocal-areanetworksbecamewidelyavailable,thin-clientcomputinglostground.Theadventofvirtualmachineshaveopenedupnewopportunities;virtualmachinescanbeeasilycreated,congured,managedanddeployed.Thevirtualizationapproachofmultiplexingphysicalresourcesnotonly 70

PAGE 71

72 ].AnillustrativeexampleofanvirtualapplianceisaFedora9applianceofsize800-MBwithpre-conguredgraphicaluserinterfacepackages[ 72 ].Optimizingthesizeofanapplianceistime-consuming,andinmanycasesnotpossiblewithoutlossoffunctionality(e.g.byavoidinginstallationofcertainpackages).Nonetheless,itisoftenthecasethatatrun-timeonlyasmallfractionofthevirtualdiskisactually\touched"byanapplication.Iexploitthisbehaviorbybuildingonon-demanddatatransfersthatsubstantiallyreducethedownloadtimeandbandwidthrequirementsfortheenduser.Thefollowingsectionswillexplaintheoverallarchitectureandapproach. 4-1 .Theapproachisbasedondisklessprovisioningofvirtualmachineenvironmentsthroughavirtualmachineproxy.Theutilityoftheenvisionedarchitecturecanbeobservedfromtheviewpointofbothusersandsystemadministrators.UsersnotonlyhavefastandtransparentaccesstodierentO/SimagesbutalsohaveautomaticsupporttoupgradetheO/Simages.Foradministrators,itprovidesaframeworkforsimpledeploymentandmaintenanceofnewimages.AsshowninFigure 4-1 ,anenduserX1downloadsasmallproxyappliance(VM2)fromDownloadServerDS.Theproxyapplianceisconguredtoconnecttoavirtual 71

PAGE 72

O/Simagemanagementoverwideareadesktops:UserX1downloadsasmallROW-FSproxyconguredappliance(VM2)fromdownloadserverDS.UserX2potentiallycansharetheapplianceimagewithUserX1.ImageServer(IS)exportsread-onlyimagestoclientsoverNFS.VM1isadisklessclient.TheappliancebootstrapprocedureisexplainedfurtherinFigure 4-2 networkoverlayconnectingittootherusers(e.g.usingIPOP[ 26 ][ 30 ]).AnexampleNFSproxyapplianceofsize350MBcanbedownloadedfromVmwarevirtualappliancemarketplace[ 72 ].TheproxyapplianceisalsoconguredtorunasmallftpserverandDHCPservertodownloadnetworkbootstrapprogramandallocateIPaddresstoclient'sworkingenvironment.Theactualapplianceswhichcarryoutcomputationcanbeconguredwithadesiredexecutionenvironmentandneednotbedownloadedintheirentiretybyendusers-theyarebroughtinon-demandthroughtheproxyappliance.EachnodeisanindependentcomputerwhichhasitsownIPaddressonaprivatenetwork.Keytothisarchitectureistheredirect-on-writelesystem(ROW-FS).Asexplainedinchapter3,ROW-FSconsistsofuser-levelDFSextensionsthatsupportselective 72

PAGE 73

ThedeploymentoftheROWproxytosupportPXE-basedbootofa(diskless)non-persistentVMoverawideareanetwork. Figure 4-2 expandsonFigure 4-1 toshowthedisklessprovisioningofvirtualmachines.InFigure 4-2 ,VM1isadisklessvirtualmachine,andVM2isabootproxyapplianceconguredwithtwoNICcardsforcommunicationwithhost-onlyandpublicnetwork.VM2isconguredtoexecuteROWlesystem(ROW-FS)andNFScacheproxies.Inaddition,VM2isconguredtorunDHCPandTFTPserverstoprovidedisklessclientanIPaddressandinitialkernelimagetoVM1.ClassicvirtualmachinessuchasVMwareprovidessupportforPXE-enabledBIOSandNICs;PXEisatechnologytobootdisklesscomputersusingnetworkinterfacecards.TheserverVMisconguredtoshareacommondirectorythroughROW-FStoclients.Toillustratetheworkingsofdisklesssetup,considerthefollowingstepstobootadisklessVMwithanapplianceimageservedoverawide-areanetwork: 1. DisklessVM(VM1)invokesDHCPrequestforanIPaddress 2. DHCPrequestisroutedthroughahost-onlyswitchtogatewayVM(VM2) 3. VM2isconguredtohavetwoNICs:host-only(privateIPaddress)andpublic.VM2receivesrequestathost-onlyNIC(eth0). 4. DHCPServerallocatesanIPaddressandsendsareplybacktodisklessVM(VM1) 73

PAGE 74

DisklessVMinvokesTFTPrequesttoobtainnetworkbootstrapprogramandinitialkernelimage. 6. VM2receivesTFTPrequestathost-onlyeth0 7. KernelimageistransferredtoVM1andloadedinRAMtokickstartbootprocess 8. DisklessVMinvokesmountrequesttomountread-onlydirectoryfromServer(VM3)throughtheproxyVM(VM2) 9. VM2isconguredtoredirectwritecallstoalocalserver.Read-onlyNFScallsareroutedthroughtheproxyVM2toVM3;theconnectionbetweenVM2andVM3isthroughthevirtualoverlaynetwork.TheP2Pnetworksareconsideredtobeinherentlyself-conguring,scalableandveryrobusttonodeorsystemfailures.EachP2Pnodemaintainsaviewofnetworkatregularintervalswhichfacilitatesseamlessadditionorremovalofanodefromthesystem.Asthenumberofnodesareaddedintothenetworkpool,bandwidthandCPUprocessingisdistributedandsharedamongusers,thusP2Psystemsareveryscalable.Furthermore,P2Psystemsareconguredtobetoleranttonodefailures.P2PoverlaynetworkssuchasIPOPalsofacilitaterewalltraversalwithoutadministratorinterventionwhichallowsP2Pnodesbehindtherewallstojointhenetwork[ 30 ].TheprocessofpublishingandsharingO/SimagesiswellsupportedbytheseP2Pproperties.Theprimarygoalofthearchitectureistoautomatetheprocessofpublishing,discoveringandmountingapplianceimages.Furthermore,itshouldbepossibleforimagestobereplicated(fullyorpartially)acrossmultiplevirtualserversthroughoutavirtualnetworkforload-balancingandfault-tolerance.ItisfeasibletoprovideimageversioningcapabilitythroughmaintainingthelatestimagestateinadecentralizedwayusingaDistributedHashTable(DHT){which,inthecaseoftheIPOPvirtualnetwork[ 30 ],isalreadyresponsibleforprovidingDHCPaddresses.DHTsprovidetwosimpleprimitives:put(key,value)andget(key).InordertousetheDHTtotrackapplianceimageversions,thekeyfunctionalityneededcanbebrokendownintodisklessclientandpublisherclient. 74

PAGE 75

75

PAGE 76

1. Disklessclientdownloadsthebootproxyappliancemachinefromthedownloadserver(VM2inFigure 4-1 ).Clientbootstrapsthisgenericapplianceconguredtoforwardclient'srequesttoImageServer(IS).ThedownloadedproxymachineisconguredsoastoconnectwithnetworkofappliancesthroughIPOPp2pnetwork. 2. DisklessclientqueriestheDHTforapplianceAversion.Anillustrativeexampleofanappliancenamecouldbea\Redhat"appliance. 3. DisklessclientqueriestheDHTtoobtaintheimageserverIPaddressandmountpathfortheapplianceA. 4. StarttheROW-FSproxiesusingtheimageserverIPaddress.ThestartupofROW-FSproxiessetsupanaccesscontrollistandasessiondirectorytoallowcallforwardingtotheimageserverandthelocalNFSserver. 5. Bootstrapadisklessclientvirtualmachineandestablishamountsessionwiththeimageserver.VirtualmachineAPIsareleveragedtobootstrapthedisklessclient. 6. DisklessclientperformstheexperimentsduringtheestablishedROW-FSsessionbetweendisklessclientandtheimageserver. 7. Haltthebooteddisklessclient 8. KilltheROW-FSproxies.Thebootproxyappliancemachinecontainstheclient'ssessiondataandexperimentalrunresults.Figure 4-3 andFigure 4-4 illustratesthealgorithmfordisklessclienttobootstrapanapplianceandapublisherclienttopublishtheO/Simage.UnusedVMscanberemovedfromthesystematregularintervals.WhennumberofclientsaccessingtheVMimageiszeroandtheimageisnotthelatestversion,DHTexpireswithtimeout. 76

PAGE 77

AlgorithmtobootstrapaVMsession AlgorithmtopublishavirtualmachineImage 77

PAGE 78

58 ];itisalsoconceivabletointegratethelogictocongure,createandtear-downROW-FSsessionswithapplicationworkowschedulerssuchas[ 73 ].Further,failuretransparencyisimportantpropertyofdistributedsystems.DuringthetimeaROW-FSlesystemsessionismounted,allmodicationsareredirectedtotheshadowserver.Aquestionthatarisesishowlesystemmodicationsintheshadowservershouldbereconciledwithdatainthemainserverattheendofasession.Threescenarioscanbeconsideredforconsistencysupportinthecontextofredirect-on-writelesystem: 1. Thereareapplicationsinwhichitisneitherneedednordesirablefordataintheshadowservertobereconciledwiththemainserver;anexampleistheprovisioningofsystemimagesfordisklessclientsorvirtualmachines,wherelocalmodicationsmadebyindividualVMinstancesordisklessmachinesarenotpersistent. 2. Forapplicationsinwhichitisdesirabletoreconciledatawiththeserver,theROW-FSproxyholdsstateinitsprimarydatastructures(thelehandlehashtableandtheblockbitmaps)thatcanbeusedtocommitmodicationsbacktotheserver.TheapproachistoremountthelesystemattheendofaROW-FSsessioninread/writemode,andsignaltheROW-FSproxytotraverseitslehandlehashtablesandbitmapstocommitchanges(moves,removes,renames,etc)todirectories 78

PAGE 79

3. OneparticularusecaseofROW-FSisautonomicprovisioningofO/Sdiskimagessharedbetweenmultipleclients.Inthiscontext,IleverageonAPIsexportedbylookupservices(suchasDistributedHashTable)indistributedframeworks(suchasIPOP[ 26 ])tostoreclient'susageandsharinginformation.Thisapproachisbasedonmultipleclientsconvergingtousethelatestapplianceimageduringthecourseoftime. 79

PAGE 80

74 ]. 16 ]).EachdirectoryorleinROW-FSisuniquelyaddressedbyapersistentlehandle.ItisfeasibletoprovidereplicasupportforwideareadesktopsinROW-FSProxyaslehandlesinreplicatedVMisconsistentwithprimaryVM. 80

PAGE 81

ReplicationApproachforROW-FS.Filehandlesofserverreplicaandread-onlyserverareequivalent.Atimeoutoftheconnectiontoread-onlyservercanresultinswitchovertoreplicaserver. ROW-FSproxycanbeconguredtoprovidesupportforread-onlyserverreplica.Figure 4-5 showsthefeasibilityofvirtualmachine-basedreplicationmechanism.TheROW-FSproxyisconguredtoforwardcallstobothread-onlyandreplicaserver.Eachreplicaisaclonedvirtualmachinewhichisexportingrootlesystemofthegridappliancetotheclient.Anexamplescenarioisshowninthegure.Ifread-onlyserverr0goesdownforsomereason,ROW-FSproxycanbeconguredtoswitchovertoareplicaserver(r1)afternoresponsetime,Tout.TheshadowstateofROW-FSproxywhichincludeslehandlemappingbetweenshadowserver(s0)andread-onlyserver(r0)andbitmapdatastructureisstillvalidwithnewread-onlyserverreplica(r1).Thisisfeasiblebecauseshadowserverstateisdependentonlehandlesofs0andr0servers.Serversr0andr1haveidenticallehandleswhichfacilitatesaseamlesstransitionofNFScallstor1. 81

PAGE 82

74 ]: 1. Condentiality:PublishersshouldbeabletopublishO/Simagesforintendedusers. 2. Integrity:Integrityofpublisher'sclaimshouldbemaintained.Nootherusershouldbeabletomodifythepublisher'sclaim.Toaddressthesesecurityproperties,Iconsiderfollowingsecuritymechanisms: 1. Encryption:Publisher'sclaimforO/Simageneedstobeencryptedtoavoidanyinterceptionorfabricationfromarogueuser. 2. Authentication:Publisherofanimagemustbeabletoproveitsidentity.Toestablishpublisher'sidentity,publickeycryptographyschemecanbeapplied.PublickeyofeachuserinP2Pnetworkcanbeadvertised.AnyclaimbyUserisencryptedbyitsownprivatekey.Whileauthenticationandencryptionhelpsinsendingdatasecurely,itisimportanttomodeltrustbetweentheP2Pusers.TrustmodelsareawaytovalidateclientX'sclaimtobe\UserX".Varioustrustmodelshavebeenusedtoestablishtrustindistributedsystems.Forexample,a\weboftrust"approachisacommonlyusedemailschemetosendprivateemailstotheendusers[ 74 ].Theapproachreliesonmaintainingalistoftrustedpublickeysbyuser.IsuggestapublickeyinfrastructureastrustedmodelforO/Smanagementframework.Publickeyinfrastructureisacollectionofcerticateauthoritiesandcerticatesassignedtousers.Certicatesarecommoncryptographictechniqueusedine-commerceapplications.Acerticateisadigitalsignaturewhichhelpsinmaintainingidentication,authorizationanddatacondentialityoftheuser.Adigitalsignatureisakindofasymmetriccryptographytosecurelysendmessagebetweenusers.Whileasymmetrickeycryptographysecurelytransferthedata,thequestionpersistinformationoftrustbetweenendusers.Forthepurposeofthisdissertation,Iassumethatthereisatrustedcerticate 82

PAGE 83

4-6 illustratessecuritymechanismtoauthenticateandencryptclient'sandpublisher'sdata.Here,assumptionisthatpublickeyofcerticateauthorityisbuiltintoP2Poverlaynetwork.Tovalidateeachclient'sidentity,certicateauthorityencryptsclient'sidentication(IDC)andpublickey(K+C)(i.e.certicate)throughitsprivatekey(KCA)anddistributeitoverP2Pnetwork. Disklessclientandpublisherclientsecurity.C:Client,P:PublisherandCA:CerticateAuthority. Followingequationsillustratetheencryptionofappliance(A)informationbyapublisherandfurtherdecryptionbyaclient.PublisherEncryption:(A;Vi)=>KP(A;Vi)(A;Vi;IP;MountPath)=>KP(A;Vi;IP;MountPath)ClientDecryption:K+P(KP(A;Vi))=>(A;Vi)K+P(KP(A;Vi;IP;MountPath))=>(A;Vi;IP;MountPath) 83

PAGE 84

4-2 ).Inthisexperiment,virtualmachinesVM1,VM2andVM3aredeployedinahost-onlynetworkusingVMware'sESXServerVMmonitor.NotethatthisexperimentonlyreectsVMresourcestatistics(notapplicationexecutiontime).Theexperimentalsetupisasfollows:ServerVM3isconguredtohave2GBRAM,singlevirtualCPU.VM2andVM1areconguredtohave1GBRAM,andalsoasingleVCPU.TheVMsarehostedbyadualXeon3.2GHzprocessor,4GBmemoryserver.Thesizeoftheimageoftheapplianceis934MB.Figure 4-7 showstime-seriesplotswithCPU,diskandnetworkratesforthreedierentintervals.Thesevaluesareobtainedin20-secondintervalsleveragingVMwareESX'sinternalmonitoringcapabilities.Intherstinterval,theVMisbooted.Inthesecondinterval,theVMrunsaCPU-intensiveapplication(thecomputerarchitecturesimulatorSimplescalar)whichmodelsthetargetworkloadofatypicalvoluntarycomputingexecution.Inthethirdphase,theapplianceisrebooted. 84

PAGE 85

ProxyVMusagetimeseriesforCPU,diskandnetwork.Resultsaresampledevery20secondsandreectdatameasuredattheVMmonitor.Threephasesareshown(markedbyverticallines):applianceboot,executionofCPU-intensiveapplication(simplescalar),andappliancereboot. 85

PAGE 86

75 ].TheFigureshowshighdatawriteratesduringVMbootexecution.Clearly,Icanobserveamaximumof12%CPUconsumptionandalsoahighdatarateacrossthenetworktoloadtheinitialkernelimageintodisklessclient(VM1)memory.ApplicationExecution:SincetheapplicationisCPUintensive,theproxyVMexhibitslittlerun-timeoverheadinthisphase.ThisisbecauseoncethedisklessVM(VM1)isbooted,itloadsnecessarylesfortheapplicationexecutionintoRAM(asshownbyinitialnetworkactivity).IcanfurtherobservethatdiskandnetworkusageisnegligibleinVM2duringtheexecutionofthesimplescalarapplication,thussupportingmyassumptionofminimaloverheadofproxyconguredVM.VMReboot:Duringreboot,theclienthasreplicatedsessionstateattheshadowserver.Iseeanaverageboottimereduction(describedbelow).IobservefurtherspikesinnetworkandCPUusageassomeoflesarefetchedandreadfromtheServerVM.Inpastresults,Ihaveshownthataggressivecachingcanfurtherimprovebootperformance[ 75 ]. 4-8 providesstatisticsfornumberofRPCcallsduringthebootupofthedisklessapplianceVM1.ThehistogramisbrokendownbydierenttypesofRPCcallscorrespondingtoNFSprotocolcalls,fromlefttoright:getandsetleattributes,lehandlelookup,readlinks,readblock,writeblock,createle,renamele,makedirectory,makesymboliclink,andreaddirectory.TheimportantconclusiontakenfromthedatainFigure 4-8 isthatROW-FS,whileincreasingthetotalnumberofcallsroutedtothelocalshadowserver,reducesthenumberofRPCcallsthatcrossdomains.Notethattheincreaseinnumberofgetattribute(getattr)callsisduetoinvocationofgetattrproceduretovirtualizereadcallstomain 86

PAGE 87

RPCstatisticsfordisklessboot.ShadowserverreceivesmajorityofRPCcalls.ThebarsrepresentthenumberofRPCcallsreceivedbyshadowandmainservers. 30 ]virtualnetwork,andtheserverandclientVMsarebehindNATs.TheproxyVM2is 87

PAGE 88

4-1 summarizestheresultsfromthisexperiment.Noticethattheboottimesreducetolessthanhalf,becomingcomparabletotheLANPXE/NFSboottimeofapproximately2minutes. Table4-1. ApplianceBoot/reboottimesoverWAN.ISPisaVMbehindaresidentialISPprovider;UFLisadesktopmachineattheUniversityofFlorida;VIMSisaservermachineattheVirginiaInstituteofMarineSciences. VM1/VM2, Boot 2ndBoot Pinglatency VM3 seconds seconds seconds ISP,UFL 291 116 23ms UFL,VIMS 351 162 68ms 4-9 showsthecumulativedistributiongraphof100accessesofDHTthrough10IPOPclients.TheclientsarerandomlychosentoquerytheDHT.Thedistributiongraphshowsthatinmostcasesittakeslessthan2secondstoquerytheDHTandobtaintheappliancestatusinformation.Theaveragetimetoinsertapplianceversionwithappliancenameaskeyover10iterationsis1.4sec.Table 4-2 providethemeanandvariancestatisticsforveclients.ThemeanandvariancestatisticsshowsthatclientaccesstimesvaryoverDHToverP2PnodesdeployedonPlanetlab.TheaccesstimeoftendependsontheroutepathtakentoaccessDHTinformation. Table4-2. MeanandvarianceofDHTaccesstimeforveclients Clients/ Client1 Client2 Client3 Client4 Client5 Statistics Mean 0.567 0.648 2.699 0.6875 2.224 Variance 0.00254 0.0389 4.731 0.08512 1.337 76 ]providesmechanismstomakeread-onlydatascalableoverthewideareanetworkthroughcooperativecachingandNFSproxies;myapproach 88

PAGE 89

CumulativedistributionofDHTquerythrough10IPOPclients(inseconds) complementsitbyenablingredirect-on-writecapabilities,whichisarequirementtosupportthetargetapplicationenvironmentofNFS-mounteddisklessVMclients.SFSadvocatedtheapproachofread-onlylesystemforuntrustedclients.Therehavebeenupcomingcommercialproductswhicheitherprovidethin-clientsolutionsbasedonpre-bootexecutionenvironment[ 77 ][ 78 ]orprovidecachebasedsolutionasaviablethin-clientapproachforscalablecomputing[ 79 ].Distributedcomputingapproachbasedonstackablevirtualmachinesandboxesisadvocatedin[ 59 ].Stackablestoragebasedframeworkisalsousedtoautomateclustermanagementasmeanstoreduceadministrativecomplexityandcost[ 60 ].Theapproachadvocatedin[ 60 ]isre-provisioningofapplicationenvironment(baseOS,servers,librariesandapplication)throughrole-specic(readorwrite)diskimages.Aframeworktomanageclusterofvirtualmachinesisproposedin[ 80 ].Storkpackagemanagementtoolprovidesmechanismtosharelessuchaslibrariesandbinariesbetweenvirtualmachines[ 81 ].Acopy-on-writeleserverisdeployedtoshareimmutabletemplateimagesforoperatingsystemskernelsandle-systemsin[ 21 ].ThisapproachusesacombinationoftraditionalNFSforread-onlymountandAFSforaggressivecachingofsharedimages 89

PAGE 90

21 ].Write-oncesemanticsbasedlesystemsarecommontoleverageoncommoditydiskimagesforapplicationssuchasmap-reduce[ 82 ]. 90

PAGE 91

1 ][ 71 ],acommondeploymentscenarioofROW-FSiswhenthevirtualmachinehostingshadowserverandtheclientvirtualmachineareconsolidatedintoasinglephysicalmachine.SuchascenarioiscommonwhenclientVMisdisklessordiskspaceonclientVMisaconstraint.WhiledeployingROWproxiesinsuchcasesprovidesamuchneededfunctionality,theoverheadassociatedwithvirtualizednetworkI/Oandisoftenconsideredabottleneck[ 9 ][ 10 ].Whilethevirtualizationcostdependsheavilyonworkloads,ithasbeendemonstratedthattheoverheadismuchhigherwithI/Ointensiveworkloadscomparedtothosewhicharecompute-intensive[ 10 ].Unfortunately,thearchitecturalreasonsbehindtheI/Operformanceoverheadsarenotwellunderstood.EarlyresearchincharacterizingthesepenaltieshasshownthatcachemissesandTLBrelatedoverheadscontributetomostofI/Ovirtualizationcost[ 10 ][ 83 ][ 84 ].Whilemostoftheseevaluationsaredoneusingmeasurements,inthischapterIdiscussanexecution-drivensimulationbasedanalysismethodologywithsymbolannotationasameansofevaluatingtheperformanceofvirtualizedworkloads.Thismethodologyprovidesdetailedinformationatthearchitecturallevel(withafocusoncacheandTLB)andallowsdesignerstoevaluatepotentialhardwareenhancementstoreducevirtualizationoverhead.ThismethodologyisappliedtostudythenetworkI/OperformanceofXen(asacasestudy)inafullsystemsimulationenvironment,usingdetailedcacheandTLBmodelstoproleandcharacterizesoftwareandhardwarehotspots.ByapplyingsymbolannotationtotheinstructionowreportedbytheexecutiondrivensimulatorIderivefunctionlevelcallowinformation.Ifollowtheanatomyof 91

PAGE 92

20 ]),usingtheSoftSDV[ 85 ]execution-drivensimulatorextendedwithsymbolannotationsupportandanetworkI/Oworkload(iperf). 92

PAGE 93

10 ][ 83 ].Therestofthischapterisorganizedasfollows.ThemotivationbehindthecurrentworkisdescribedinSection 5.2 .Section 5.3 describesthesimulationmethodology,toolsandsymbolannotations.Section 5.4 detailsthesoftwareandarchitecturalanatomyofI/Oprocessingbyfollowingtheexecutionpaththroughguestdomain,hypervisorandtheI/OVMdomain.Also,IprovideinitialresultsofresourcescalinginSection 5.5 .Section 5.6 describesrelatedwork. 13 ][ 14 ][ 15 ].Asimulation-basedmethodologyforvirtualenvironmentsisalsoimportanttoguidethedesignandtuningofarchitecturesforvirtualizedworkloads,andtohelpsoftwaresystemsdeveloperstoidentifyandmitigatesourcesofoverheadsintheircode.Adrivingapplicationforsimulation-drivenanalysisisI/Oworkloads.ItisimportanttominimizeperformanceoverheadsofI/Ovirtualizationinordertoenableecientworkloadconsolidation.Forexample,inatypicalthreetierdatacenterenvironment,Web 93

PAGE 94

17 ].Enablingalowlatency,highbandwidthinter-domaincommunicationmechanismbetweenVMdomainsisoneofthekeyarchitectureelementswhichcouldpushthisdistributedservicesarchitectureevolutionforward. 86 ][ 13 ];IusetheSoftSDVsimulator[ 85 ]asabasisfortheexperiments.SoftSDVnotonlysupportsfastemulationwithdynamicbinarytranslation,butalsoallowsproxyI/Odevicestoconnectasimulationrunwithphysicalhardwaredevices.ItalsosupportsmultiplesessionstobeconnectedandsynchronizedthroughavirtualSoftSDVnetwork.ForcacheandTLBmodelingIintegratedCASPER[ 87 ]-afunctionalsimulatorwhichoersrichsetofperformancemetricsandprotocolstodeterminecachehierarchicalstatistics. 94

PAGE 95

5-1 ,(A)).Theguestdomain'sfrontenddrivercommunicateswithbackenddriversthroughIPCcalls.ThevirtualandbackenddriverinterfacesareconnectedbyanI/Ochannel.ThisI/Ochannelimplementsazero-copypageremappingmechanismfortransferringpacketsbetweenmultipledomains.IdescribetheI/OVMarchitecturealongwiththelife-of-packetanalysisinSection 5.4 Figure5-1. FullsystemsimulationenvironmentwithXenexecutionincludes(A)XenVirtualEnvironment(B)SoftSDVSimulator(C)PhysicalMachine 95

PAGE 96

5-2 summarizestheprolingmethodologyandthetoolsused.Thefollowingsectionsdescribetheindividualstepsindetail;theseinclude(1)virtualizationworkload,(2)fullsystemsimulation,(3)instructiontrace,(4)performancesimulationwithdetailedcacheandTLBsimulation,and(5)symbolannotation. 5-1 ).Inordertoanalyzeanetwork-intensiveI/Oworkload,theiperfbenchmarkapplicationisexecutedinDomU.Thisenvironmentallowsustotapintotheinstructionowtostudytheexecutionowandtoplugindetailedperformancemodelstocharacterizearchitecturaloverheads.TheDomUguestusesafrontenddrivertocommunicatewithabackenddriverinsideDom0,whichcontrolstheI/Odevices.IsynchronizedtwoseparatesimulationsessionstocreateavirtualnetworkedenvironmentforI/Oevaluation.Theexecution-drivensimulationenvironmentcombinesfunctionalandperformancemodelsoftheplatform.Forthisstudy,IchosetoabstracttheprocessorperformancemodelandfocusoncacheandTLBmodelstoenablethecoverageofalongperiodintheworkload(approximately1.2billioninstructions). 96

PAGE 97

Executiondrivensimulationandsymbolannotatedprolingmethodology.Fullsystemsimulatoroperateseitherinfunctionalmodeorperformancemode.Instructiontraceandhardwareeventsareparsedandcorrelatedwithsymbolstoobtainannotatedinstructiontrace. 97

PAGE 98

5.4 .AnexampleexecutionowaftersymbolannotationisgiveninFigure 5-3 .Thesedecodedinstructionsfromthefunctionalmodelarethenprovidedtotheperformancemodelwhichsimulatesthearchitecturalresourcesandtimingfortheinstructionsexecuted. 5-4 98

PAGE 99

Symbolannotation.Compile-timeXensymbolsarecollectedfromhypervisor,driverandapplicationandannotated.Figureshowsanexamplewheresymbolsareannotatedwith\kernel"and\hypervisor". Figure5-4. Function-levelperformancestatistics.Figureillustratesthathowperformancestatisticsarecoupledwithinstructioncallgraphforeachfunction.SamplestatisticsforL1/L2cachesand,instructionanddataTLBareshown. 99

PAGE 100

5-5 ,aninstructionparserisusedtoparsedierentinstructioneventssuchasINT(interrupts,systemcalls),MOVCR3(addressspaceswitch),andCALL(functioncall).Thesetracesweredumpedintoalewithrun-timevirtualaddressinformation,aswellascacheandTLBstatistics.InstructiontracesareparsedandmappedwithsymboldumpstocreateI/Ocallgraph.SoftSDVsystemcall(SSC)utilitiesfacilitatetransferofdatabetweenhostandsimulatedguest.Aperformancesimulationmodelisusedtocollectinstructiontracesalongwithhardwareeventsofvirtualizedworkload.TheseutilitiesareimportantasIgatheredruntimesymbolsofkernelsandapplicationfromtheprockerneldatastructuretotransfertothehostsystem(forexample,/proc/kallsymsforkernelsymbols).Foriperfruntimesymbols,wemappedprocessIDwithcorrespondingprocessIDinprocdirectory.Theserun-timesymbols, 100

PAGE 101

Figure5-5. SoftSDVCPUcontrollerexecutionmode:performanceorfunctional.Infunctionalmode,SoftSDVsimulatorprovidesinstructiontrace.Inperformancemode,instructiontraceisparsedtoobtainhardwareeventssuchascacheandTLBmisses.Compiletimesymbolsfromkernel,driversandapplicationalongwithruntimesymbolsfromproclesystemarecollectedtoobtainper-functioneventstatistics. Symbolsareannotatedtokeeptrackofthesourceofafunctioncallinvocation.Notethattherecanbeduplicatesymbolswhencollectedsymbolsaresummedupintoale.Theseduplicatesareremovedandfurtherthedatacollectedisformatedinausefulway.Insomecases,itisnecessarytomanuallyresolveambiguitiesinvirtualaddressspacesthroughcheckpointatvirtualaddressduringre-runofasimulatedSoftSDVsession.Linuxutilitiessuchasnmandobjdumpareoftenusedtocollectsymbolsfromcompiletimesymboltables.Ingeneral,anyapplicationcanbecompiledtoprovidesymboltableinformation.InC++applications(suchasiperf),functionnamemanglinginobjectcodeisusedtoprovidedistinctnameforfunctionsthatsharethesamename.Essentially,itaddssomerandomnessatprexandsuxofthefunctionname.Iusedthedemangleoptionofthenmutilitytoidentifythecorrectfunctionforiperfapplication. 101

PAGE 102

5-5 showsthesimulationframeworkimplementationtoobtaincallgraphinformationandperformcachescalingstudies.AsillustratedinFigure 5-5 ,theCPUcontrollerlayerinSoftSDVintegrateswithaperformanceorfunctionalmodel.Theplatformcongurationforthisstudyissettoasingleprocessorwith2levelsofcache(32KBrstleveldataandinstructioncache,2MBL2cache)andwith64-entryinstructionanddataTLBs.TheexperimentalsetupinvolvedmultipleSoftSDVsessionsconnectedovervirtualnetwork.IchoosetoruniperfapplicationtostudylifeofI/Opacketasitistherepresentativebenchmarktomeasureandstudynetworkcharacteristics.TheiperfclientisexecutedtoinitiatepackettransmissionsfromaXenenvironment. 5-6 showsanoverviewofdierentstageswhichcharacterizethelifeofapacketbetweenVMdomains.Typically,anetworkpacketintheXenenvironmentgoesthroughthefollowingfourstagesinitslifecycleaftertheapplicationexecution: 1. Unprivilegeddomain-Packetbuildandmemoryallocation 2. PagetransferMechanism-Azero-copymechanismtomappagesinvirtualaddressspaceofDom0/DomUdomains 3. Timerinterrupts-Contextswitchbetweenhypervisoranddomains 102

PAGE 103

Privilegeddomain-ForwardingI/Opacketdownthewireandsendingacknowledgmentbacktotheguestdomain. Figure5-6. LifeofanI/Opacket(a)Applicationexecution(b)Unprivilegeddomain,(c)Granttablemechanism-switchtohypervisor,(d)Timerinterrupt,(e)Privilegeddomain 27 ].AninterfaceinXentoallocateasocketbuerinthenetworkinglayer(alloc skb from cache)isidentied.ThefrontenddriverusesthegranttablemechanismprovidedbythehypervisortotransferthebuertoDom-0.ThefunctionsandtheassociatedinstructioncountforoveralllifeofthepacketinDomUincludesocketlock,copydatafromuserspacetokernelspace,allocatepagefromfreelist,andreleasesocketlock(Figure 5-7 ).Notethattheinstructioncountstatisticsareshowninchronologicalorderwithfunctionentrypointsasmarkers.Iremovedsomerepeating 103

PAGE 104

Figure5-7. Dom-Ucallgraph:Socketallocation(alloc skb from cache),user-kerneldatacopy(copy from user)andnallyTCPtransmitwrite(tcp write xmit). 104

PAGE 105

5-8 demonstratestheexecutionowfromdomUtohypervisorthroughgranttablemechanism. Figure5-8. TCPtransmit(tcp transmit skb)andGranttableinvocation(gnttab claim grant reference) do upcalltostartprocessingtheevent.Thefunctionsinvokedduringthetimerinterruptisshowningure 5-9 105

PAGE 106

AnnotatedcallgraphtoshowcontextswitchbetweenhypervisorandDom-0VM-Timerinterrupts(write ptbase) 5-10 (sincethecompleteexecutionatthisstageislong,snippetsofexecutioncoveringthebasicowandhighlightingtheimportantfunctionsisshown).NotethatthegranttablemechanismisusedtomapguestpagesintoDom0addressdomainonthebackendreceivingside.Thenthepacketissenttothebridgecode,afterwhichitissentoutonthewire.Oncecomplete,thehostmapisdestroyedandaneventissentontheeventchanneltotheguestdomain.ItisinterestingtonotethattheprocessorTLBisushedwhiledestroyingthegrant.ItisdonebywritingtheCR3register(thex86pagetablepointer)throughthewrite cr3function.IdescribetheimpactofthisTLBushinSection 5.4.2 106

PAGE 107

LifeofapacketinDom-0:Accessinggrantedpage(create grant host mapping),ethernettransmission(e100 tx clean),destroygrantmapping(destroy grant host mapping)andeventnoticationbacktohypervisor(evtchn send) 107

PAGE 108

5-11 showsanexecutionsnippetwhereTLBushesandmissesareplottedasafunctionofsimulatedinstructionsretired.ThegureshowsthatthereisahighcorrelationbetweentheTLBmisses,contextswitchesandTLBushevents.AnexecutionrunofVMduringaperiodwithnocontextswitchesorTLBushesresultsinnegligibleTLBmisses.WheneverTLBushingeventshappen,thereisasurgeofTLBmisses.ThiscorrelateswellwiththeobservationsofTLBmissoverheadinearlierstudies.Figure 5-12 showstheincreasednumberofTLBmissesassociatedwiththeVMswitchesinacumulativegraph.IobservethatthereisasurgeofTLBmissesassociatedwitheachVMswitch.ExecutionsegmentswithoutVMswitchesshowatareaswithfewTLBushes.Figure 5-13 depictsatypicalVMswitchscenario.TheexecutionmovesfromoneVMtoanotherthroughacontext.TheCR3valueischangedtopointtothenewVMcontext.ThistriggersthehardwaretoushalltheTLBstoavoidinvalidtranslations.ButthiscomeswithacostofTLBusheseverytimeanewpageistouched,bothforcodeanddatapages.AnotherscenarioistheexplicitTLBushesdonebytheXenhypervisoraspartofthedatatransferbetweenVMs.ThisisanartifactofthecurrentI/OVMimplementationasexplainedintheprevioussection.Inordertorevokeagrant,acompleteTLBush 108

PAGE 109

ImpactofTLBushandcontextswitch.Thex-axisshowsasliceoftotalnumberofinstructionsretiredduringanexecutionrunofiperfapplication.They-axisshowsinstructionanddataTLBmissevents,normalizedtoTLBushesandcontextswitch. Figure5-12. CorrelationbetweenVMswitchingandTLBmisses.Thex-axisshowsasegmentofthetotalnumberofinstructionsretired.They-axis(left)representsVMswitchingwhere\1"indicatesVMswitch.They-axis(right)showscumulativeTLBmisses. 109

PAGE 110

TLBMissesafteraVMcontextswitch.InstructionanddataTLBmissesareplotedonthey-axisagainstasegmentofinstructionsretiredonx-axis.ThecontextswitchbetweenvirtualmachinecausesaTLBushwhichincreasesthenumberofTLBmisses. isexecutedexplicitly,whichalsocreatesTLBperformanceissuessimilartoVMswitch.Figure 5-14 demonstratesthecodeowandtheTLBimpact.Figure 5-15 showstheimpactofcontextswitchesoncacheperformance.TheverticallinesmarkVMswitcheventsobtainedthroughsymbolannotation,andtheplottedlineshowsthecumulativecachemissevents.NotethatthecachemissrateincreasesarealsocorrelatedwithVMswitchevents. 88 ].TLBandcachestatisticsaremeasuredfortransferofapproximately25millionTCP/IPpackets. 110

PAGE 111

TLBmissesafteragrantdestroy.Thex-axisshowasegmentofinstructionsretiredandthey-axisrepresentsdataandinstructionTLBmisses. Figure5-15. ImpactofVMswitchoncachemisses.Thex-axisshowasegmentofinstructionsretired.They-axis(left)representsVMcontextswitchthroughtheverticallinesbetween0and1.ThecontextswitchbetweenvirtualmachinecausesaTLBushwhichincreasesthenumberofL2cachemisses(y-axis(right)). 111

PAGE 112

L2cacheperformancewhenL2cachesizeisscaledfrom2MBto32MB.Intheplot,L2cachemissratioisnormalizedtoL2cachesize2MB.ThedatapointsarecollectedwheniperfclientisexecutedinaguestVM(transmitofI/Opackets). Figure5-17. DataandinstructionTLBperformancewhenTLBsizeisscaledbetween64and1024entries.TheplotindicatesthatTLBmissratioisnormalizedtoTLBsizeof64entries.ThedatapointsarecollectedwheniperfclientisexecutedinaguestVM(transmitofI/Opackets). 112

PAGE 113

5-16 showstheeectofscalingL2cache.Theperformancemodelisconguredtosimulateatwolevelcache:32KBL1(splitdataandinstruction)anda2MBuniedL2cache.TheprimarygoalistounderstandthecachesensitivityoftheI/OvirtualizationarchitectureinthecontextofnetworkI/O.NotethatincreasingtheL2cachesizeupto4MBprovidedgoodperformancescaling,afterwhichtheincreaseinperformancewasminimal.Increasingthecachesizebeyond8MB,therateofreductioninmissratesissmall.Icanattributereducedmissratesfromthe8MBcachetotheinclusionofneededpagesfromhypervisor,Dom0andDomU. Figure5-18. L2cacheperformancewhenL2cachesizeisscaledfrom2MBto32MB.L2cachemissratioisnormalizedtoL2cachesize2MB.ThedatapointsarecollectedwheniperfserverisrunninginaguestVM(ReceiveofI/Opackets). Figure 5-17 showstheTLBperformancescalingimpactfordataandinstructionTLBs.Asshowninthegure,withincreaseinsizeofthedataTLB,themissratiodecreasesforsizesupto128entries.Forlargersizes,themissratioisnearlyconstant.TheITLBmissratedecreasesslightly,whiletheDTLBrateshowsasharperdecreasefrom64to128entries.ItcanbeinferredthatTLBsizeof128entriesissucienttoincorporatealladdresstranslationsduringtheTLBstage.IncreasingtheTLBsizeisnotaveryeectiveenhancementinthisscenario.Thisisbecause,asobservedinFigure 113

PAGE 114

DataandinstructionTLBperformancewhenTLBsizeisscaledbetween64and1024entries.TheplotindicatesthatTLBmissratioisnormalizedtoTLBsizeof64entries.ThedatapointsarecollectedwheniperfserverisrunninginaguestVM(ReceiveofI/Opackets). 5-12 and 5-13 ,therearesubstantialnumbersofTLBushesduringgrantrevocationandVMswitches,whichinvalidateallTLBentries.AlargeTLBsizedoesnothelpmitigatetheeectofcompulsoryTLBmissesthatfollowaush.Similarly,thecacheandTLBscalingstudiesonthereceivesideisperformed.ResultsaregiveninFigures 5-18 and 5-19 respectively. 89 ][ 20 ].Performancemonitoringtoolshavebeendeployedtogaugeapplicationperformanceinvirtualizedenvironments[ 9 ][ 83 ][ 10 ].TraditionalnetworkoptimizationssuchasTCP/IPchecksumooad,TCPsegmentationooadarebeingusedtoimprovenetworkperformanceofXen-basedvirtualmachines[ 10 ].Inaddition,fasterI/Ochannelfortransferringnetworkpacketsbetweenguestanddriverdomainsisbeingstudied[ 10 ].Thesestudieslackmicro-architecturaloverheadanalysisofthevirtualizedenvironment. 114

PAGE 115

90 ][ 36 ]. 115

PAGE 116

91 ].Inmulti-coreprocessors,avirtualaddresstranslationstoredinpagetableentriesofaprocessorneedstobepropagatedtotheTLBsofallprocessors.ManyarchitecturesresorttotheushingofentirecontentsofaremoteTLBtoenforcesuchcoherence,inaprocessoftencalled"TLBshootdown".ConsideranexampleofnetworkI/Ocommunicationinavirtualizedenvironment.TheGranttablemechanismadoptedbyXenVMMisbasedonmodifyingaccessprotectionbitsofsharedpagetableentrybetweenguestandprivilegeddomain.Therefore,networkI/OcommunicationbetweenguestandprivilegeddomainsmayresultinTLBshootdowns.TheproblemwiththeshootdownapproachisthatitworksatacoarsecoherencegranularitybyinvalidatingallentriesofaTLB.BecausenotallTLBentriesmustbeinvalidatedtoenforceconsistency(onlythosethatareaectedbytheprotectionchanges)thiscoarse-grainapproachtoenforcecoherencecanresultinthe 116

PAGE 117

6.2 .IintroduceoverviewtheinterprocessorinterruptmechanismusedbyLinuxinx86-basedprocessorstoimplementTLBshootdownsinSection 6.3 .Section 6.4 explainsthepagesharingmechanisminXenhypervisor.Section 6.5 providedetailsofexperimentstomeasureI/Ooverhead,evaluatehardwaresupporttotaghypervisorpagesandevaluatepotentialforselectiveushingininterprocessorinterrupts.Section 6.6 describestherelatedwork. 6.2.1IntroductionTheTranslationlook-asidebuer(TLB)isanon-chipcachetoexpeditethevirtual-to-physicaladdresstranslation.IntheabsenceofTLB,pagetabledatastructureisusedtoaccessphysicalpagecorrespondingtovirtualaddress;thisprocessoftranslatingvirtualintophysicaladdressisexpensive.Instead,processorsrelyontheTLBandlocalityofreferencestoachievefastaddresstranslation.Thisprocesscanbesummarizedasfollows.Anapplicationrungeneratesavirtualmemoryaddresstoaccessinstructionordatafromthememory.TheCPUlooksupthevirtualaddressbyindexingtheTLB.IftheTLBaccessisahitthenthepagetableentrypresentinTLB,isusedtoaccessthephysicalpage.Inmulti-coresystems,typicallyeachprocessorhasitsownTLBinordertoachievefastlookuptimes.ThiscreatesachallengeinmanagingmultipletranslationscachedacrossmultipleTLBs,andthusitisimportanttomaintainTLBcoherency.Unlikeprocessordataandinstructioncaches,theTLBcoherencyisimplementedbyoperatingsystem.Thisisaccomplishedbytheoperatingsystemissuing,foranyupdatestoapagetableentry,aTLBinvalidationoperation. 117

PAGE 118

118

PAGE 119

91 ].TLBentriesrelyoninformationprovidedandupdatedthroughthepagetable.ATLBushoperationmayresultinTLBmiss,apagetablewalk,apossiblepagefault(ifpageisnotinthememory)andaTLBrell.HardwarestatemachinewalksthroughthepagetabletorelltheTLBentry. Figure6-1. Thex86pagetableforsmallpages:Pagingmechanisminx86architectureisshown.Controlregister(CR3)isloadedwithcurrentlyscheduledprocess.Thevirtualaddressfromanapplicationisdividedtoobtainpagedirectoryentry(PDE),Pagetableentry(PTE)andPageoset. Figure 6-1 illustratesthetranslationofvirtualaddressintoaphysicaladdressinthex86architecture.Thesequenceofaccesstophysicalpagefromavirtualorlinearaddressisasfollows:(1)VirtualaddressislookedintoTLB(2)IfTLBtranslationisnotavailable,virtualaddressistranslatedintophysicaladdresstoretrievethepagecontentthroughthepagetable(3)Ifvirtualaddressisnotpresentinpagetable,apagefaultisinvoked.Thepagetableisahierarchicalstructuretoindexandretrievethenallocationofthephysicaladdresscorrespondingtothevirtualaddress.Inaddition,itprovideentriestochecktheaccessprivilegesandmodeofinvocationofthepage.Thisistopreventother 119

PAGE 120

92 ].Incaseofmulti-core,itmaybethecaseeitherpagetablebetweentwoCPUsissharedorpageentriesaresharedbetweendierentCPUs. 92 ].SendingsucharequestinvolvesprogrammingspecialregistersinthisAPIC(ICR,orinterruptcontrolregisters)toselect,amongotherthings,thedestinationofaninterruptanditsarguments.TheformatandentriesinICRregistercanbeobtainedfromtheIntel'ssoftwaredeveloperspecications[ 92 ].InterprocessorinterruptsareinitiatedafterawritetotheAPIC'sICRregister.InSMParchitecture,eachprocessorhaveitsownlocalAPICtoinvokeoractonanIPI.AlocalAPICunitindicatesuccessfuldispatchofanIPIbyresettingdeliveryStatusbitinInterruptCommandRegister(ICR).AnexampleofIPIisushingtheTLBcontentswhenaprocessormodiesthesharedentryinthepagetabledatastructurethatissharedwithotherprocessors.ThisallowssynchronizationbetweentheaddresstranslationprocedureofSMPprocessors.AnotherexampleofIPIiswhenreschedulinganewtaskonSMPmachine.Consideranexampleofscheduling\idle"task.Toimplementthis,rstalltheinterruptsareenabledinSMPprocessorsandthen\hlt\instructionisissuedtoalltheprocessors.WheneveraninterruptisreceivedfromsystemdevicessuchasKeyboard,CPUisawakenthroughinterrupt.InSMP,oneCPUisawakenonsuchinterrupt,anIPIissenttootherCPUthroughawritetoAPIC'sICRregister(e.g.InLinuxO/S"send IPI mask"functionperformsthetaskofwritingtoICRregister).Figure 6-2 showsanexampleofIPIinvocationmechanisminatwoprocessorSMPsystem.EachCPUhasitsownlocalAPICunit.CPUscanstoretheinterruptvector 120

PAGE 121

Interprocessorinterruptmechanisminx86architecture:InterruptisinitiatedwithawritetoInterruptcontrolregister(ICR).ICRisamemorymappedregisterfortheAPICsystem. andtheidentiertotargetprocessorinICR.OnawritetoICR,amessageissenttotargetprocessorviasystembus.Figure 6-2 showsthecontentsofICRregistertoidentifydestinationCPU(0x1000000)andkindofIPI(e.g.thecontentofICRregisterforTLBinvalidationis0x8fd).ThedefaultlocationofAPICregistersisat0xfee0000inthephysicalmemory.ThesequenceofeventstoinvokeIPIforTLBinvalidationisasfollows: 121

PAGE 122

122

PAGE 123

6.5.1GrantTablePerformanceTheperformancemodelusedinChapter5capturesfunctionalbehaviorbutisnottimingaccurate.Thischoiceismotivatedbythefactthattiming-accuratemodelsareconsiderablyslowerthanfunctionalmodels.AconclusiondrawnfromthepreviousChapterwasthatsplitI/OimplementationresultsinadditionalTLBushesandmisses.Itisalsoimportanttocharacterizetheimpactofthismechanismonthetimingandperformanceofavirtualizedsystem.Thissubsectionaddressesthisissuethroughaproling-basedanalysisofamodiedversionofXen.Toaccuratelyevaluatetheoverheadinthelifecycleofnetworkpacket,inthischapterIinstrumentedxensourcecodetoevaluatepercentageofcyclesconsumedduringgrantmechanism.ThisexperimentisperformedonbothIntelduocore2andPentium-basedmachines.Table 6-1 providesthepercentageofcyclesconsumedduetogranttableoperationsbyhypervisor,averagedovertheperiodwheretheselectednetworkapplicationbenchmark(iperf)executed.AsinferredfromTable 6-1 ,grantoperationsconsumeasignicantamountofresources-approximately20%ofCPUcyclesduringapackettransferforboththeCoreDuo2andPentiumIIICPUs.Whilegrantmapandunmapstatisticsaregatheredduringthetransmitoftheiperfpacket,grantcopyortransferstatisticsshow 123

PAGE 124

10 ],similartotheresultshowedinTable 6-1 Table6-1. Granttableoverheadsummary Functioncalls %cycles(duocore2) %cycles(PentiumIII) gnttab map 7.13 6.18 gnttab unmap 4.28 3.56 gnttab transfer/copy 8.20(copy) 10.39(transfer) ThestatisticsareobtainedfromprolingthegrantoperationsinXenhypervisorthroughXentracetool[ 93 ]andaccessingthesystemcyclecountsthroughtime-stampregister(RDTSC). Experimentalsetup:Twosimics/softSDVsessionsaresynchronizedusingavirtualnetwork.IperfapplicationdeployedonLinuxO/Sisexecutedinonesession.IperfisdeployedintoDomUinXenhypervisorenvironment.Packetsaresentto/fromDomUfrom/toLinuxO/S IstudiedthepotentialimpactofaTLBoptimizationtomakeglobalhypervisorpagespersistentinTLBs.IntheabsenceofTLBtagging,onaTLBushalltranslations 124

PAGE 125

6-4 ,suchanoptimizationindeedhasthepotentialtosubstantiallyreduceDTLBmisses(and,toalesserextent,reduceITLBmisses).ItismoreeectivethanincreasingtheTLBsizebecauseglobalbittaggingallowsasubsetofthetranslationstoremaincachedduringswitchesandgrantrevocations.TheexperimentalsetupisshowninFigure 6-3 Figure6-4. ImpactoftaggingTLBwithaglobalbittopreventTLBushforhypervisorpages.Intheplot,TLBmissratioisnormalizedtoTLBsizeof64entries.ThedatapointsarecollectedwheniperfclientisexecutedinaguestVM(transmitofI/Opackets). TheimportanceofperformanceisolationandVMlevelQoSisagrowingresearchareaespeciallywiththeintroductionofmulti-coreprocessorswhicharesharingplatformresourceslikecache,TLBandmemoryresources.MyworkisfurtherextendedtostudytheimpactofqualityofserviceonTLBin[ 94 ]. 125

PAGE 126

Figure6-5. Pagesharinginmulticoreenvironment:AnexampleofpotentialreasonforinvocationofinterprocessorinterruptbetweentwoCPUsisshown. TochecktheconsistencybetweenpagetableandTLBcontentsatinvocationofinterprocessorinterrupt,followingarethesequenceofstepsasshowninFigure 6-5 : 126

PAGE 127

Figure6-6. Simicsmodeltocaptureinter-processorinterrupts:SimicsexportsAPItoregisterdierentperformancemodels.TheseperformancemodelscancaptureimportanteventssuchasIPIthroughsimicsAPI(SIM hap add callback).SimicsworkloadisabstractedtorepresentanO/Sorahypervisor. TLBmodelisgenerallyinitializedandloadedwhensimicsisbooted.SimicsprovideAPItoregisteracallbackfunctiontocaptureandactonIPIeventintheTLBmodule.AnexamplefunctionalityofIPIcallbackfunctioncouldbeacountertocountthenumberofIPIsduringtheexecutionofworkload.Figure 6-6 illustratethatsimicsAPI(SIM hap add callback)isusedtoregisteraneventrelatedtocoresystemdevice(inthiscaseAPIC).ThiscallbackfunctionisusedtomodifythesemanticsofTLBush 127

PAGE 128

6-2 showsthenumberofIPIsfortransmitandreceiveofthenetworkI/Opacketsduringexecutionrunofiperfbenchmark.ThepotentialbenetofnotushingtheTLBduringaninterprocessorinterruptcanbenegatedifO/Sissuesanormalushduetoschedulinganotherprocess.Forthisexperiment,Icreatedanetworkoftwosimicssimulatedmachine.Foreachexperiment,Simicssessionsarewarmedupfor20millioninstruction.Thestatisticsshownarecollectedforarunof500millioninstructions.Notethatnumberofushesismoreinreceivescenariowhenguestvirtualmachineisreceivingthepackets.ToevaluatetheconsistencyduringtheIPI,thepagetableparserisimplementedinsimicsTLBmoduletolookuptheTLBentriesinthepagetable. Table6-2. TLBushstatisticswithandwith-outIPIushoptimization Transmit/Receive IPIush Normalush Transmit 485 31670 Receive 602 57718 TofurtherunderstandtheimpactofIPImechanism,IstudiedtheimpactofscalingtheTLBsizeforinstructionanddataTLBs.Fortheseexperiments,domain-1isanitizetoCPU1whiledomain-0doesnothaveanyCPUanity.Table 6-3 andTable 6-4 providetheTLBmissstatisticsforinstructionanddatacachesduringtheapplicationrunofiperffor50millioninstructions.WhilethedataTLBdoesnotshowsignicantimprovementinperformanceforbothCPUs,instructionTLBforCPU1showsimprovementof1.2to2.4%.Inaddition,numberofmissesarenotsignicantlychangedbeyondtheTLBsizeof128entries.WhilenumberofTLBmissesisreducedwhencompleteushisavoidedduringinvocationofinterprocessorinterrupt,thepotentialperformanceimprovementisnegatedbycompleteushesduetolocalushes(e.g.schedulingofnewprocess). 128

PAGE 129

InstructionTLBmissstatisticswithandwith-outIPIushoptimizationforreceiveiperfbenchmark IPImode TLBEntries 64 128 256 512 IPIush(CPU0) 112848 100687 100657 100687 IPIush(CPU1) 347 4506 4506 4506 NoIPIush(CPU0) 112811 100604 100602 100602 NoIPIush(CPU1) 347 4455 4455 4455 Table6-4. DatamissTLBstatisticswithandwith-outIPIushoptimizationforreceiveiperfbenchmark IPImode TLBEntries 64 128 256 512 IPIush(CPU0) 389221 454949 450143 450118 IPIush(CPU1) 55369 33636 33628 33628 NoIPIush(CPU0) 389186 454916 450051 450008 NoIPIush(CPU1) 55369 33580 33554 33554 95 ].Performanceevaluationofsharedhardwareresourcesbetweenmulti-coreprocessorswithvirtualmachineenvironment-basedworkloadforserverconsolidationhasrecentlybeenaddressed[ 96 ].Simulation-basedapproachhasbeenusedtomaximizesharedmemoryaccessbetweenmultipleVMs[ 96 ],howeverthisstudyapproximatesuseofvirtualization-themodeldoesnotconsidertheexecutionofhypervisors.Alltheseanalysislackedthecompleteunderstandingofthenetworkstack.Specializedvirtualmachinecontainers(withminimalfunctionalitysuchasonlyI/O)arebeingusedtostudytheI/Oscalability[ 97 ].Solarare[ 11 ]hasapproachtobypassthehypervisorfornetworkI/O.TheapproachisbasedonprovidinghardwaresupportofavirtualNIC(vNIC)pervirtualmachine.ThevirtualNICcontrollerforI/Oaccelerationcommunicateswiththenetworkdriverinterfaceoeredbyguestvirtualmachines.ThenetworkdriversinsidetheguestvirtualmachinescommunicatedirectlywithinterfaceoeredbythevirtualNICcontroller(bypassingthehypervisor). 129

PAGE 130

98 ].ManyapproacheshavebeenconsideredinthepasttoimproveTLBperformance.WhenaprocessortriestomodifyaTLBentry,itlockspagetabletopreventotherprocessorfrommodifyingit.ItushesthelocalTLBentries.TLBoperations(suchasTLBrell)arequeuedforupdate.ProcessorsendsandIPIandspinsuntilallotherprocessorsaredone.Finallyitunlocksthepagetable.ThesestepsarecycleconsuminginTLB.ManyimprovementshavebeensuggestedtoimprovethisperformanceofTLB.Thisincludes: 99 ].Anexampleofthisisupgradeofapagefromread-onlytoread-write.LinuxO/SdoessupportLazyTLBushing. 130

PAGE 131

131

PAGE 132

132

PAGE 133

133

PAGE 134

100 ](Nehelam)toimprovethecostofvirtualtophysicaladdresstranslation.Inthisscenario,thisdissertationmotivatesresearchonthemodelsthatcapturesthebehaviorofsimilarnewon-chipresources.Thesemodelscanbeaugmentedandevaluatedfurthertoguidesystemdesignersforimprovedsystemsperformance.Futureinternetusageispredictedtoincludenewusagemodelsthatarebasedonuserconnectivitysuchassocialnetworking,collaborationandon-linegaming.ThereareinstanceswheretheseapplicationsaredesiredtobeencapsulatedinVMenvironments 134

PAGE 135

135

PAGE 136

[1] J.E.SmithandR.Nair,VirtualMachines:versatileplatformsforsystemsandprocesses,MorganKaufmannpublishers,May2005. [2] M.Litzkow,M.Livny,andM.W.Mutka,\Condor:ahunterofidleworkstations,"inProceedingsof8thinternationalconferenceonDistributedComputingSystems,Jun1988,pp.104{111. [3] B.Callaghan,NFSillustrated,Addison-WesleyLongmanLtd.,Essex,UK,2000. [4] S.Adabala,V.Chadha,P.Chawla,R.J.Figueiredo,J.A.B.Fortes,I.Krsul,A.Matsunga,M.Tsugawa,J.Zhang,M.Zhao,L.Zhu,andX.Zhu,\Fromvirtualizedresourcestovirtualcomputinggrids:Thein-vigosystem,"FutureGenerationComputingSystems,specialissueonComplexProblem-SolvingEnviron-mentsforGridComputing,vol.21,no.6,Apr2005. [5] K.Keahey,I.Foster,T.Freeman,X.Zhang,andD.Galron.,\Virtualworkspacesinthegrid,"inProceedingsoftheEuro-ParConference,Lisbon,Portugal,Sep2005. [6] A.SundarajandP.A.Dinda,\Towardsvirtualnetworksforvirtualmachinegridcomputing,"in3rdUSENIXVirtualMachineResearchandTechnologySymp.,May2004. [7] D.Nurmi,R.Wolski,C.Grzegorczyk,G.Obertelli,S.Soman,L.Youse,andD.Zagorodnov,\Theeucalyptusopen-sourcecloud-computingsystem,"inCloudComputingandItsApplicationsworkshop(CCA'08),Chicago,IL,October2008. [8] VMware,\Merrilllynchtostandardizeonvmwarevirtualmachinesoftware[Online],"WorldWideWebelectronicpublication,Available: lynch.html [9] A.Menon,J.R.Santos,Y.Turner,G.J.Janakiraman,andW.Zwaenepoel,\Diagnosingperformanceoverheadsinthexenvirtualmachineenvironment,"inVEE'05:Proceedingsofthe1stACM/USENIXinternationalconferenceonVirtualexecutionenvironments,NewYork,NY,USA,2005,pp.13{23,ACM. [10] A.Menon,A.L.Cox,andW.Zwaenepoel,\Optimizingnetworkvirtualizationinxen,"inATEC'06:ProceedingsoftheannualconferenceonUSENIX'06AnnualTechnicalConference,Berkeley,CA,USA,2006,USENIXAssociation. [11] D.Chisnall,Thedenitiveguidetothexenhypervisor,PrenticeHallPress,UpperSaddleRiver,NJ,USA,2007. [12] R.Goldberg,\Surveyofvirtualmachineresearch,"IEEEComputerMagazine,vol.7,no.6,pp.34{45,1974. 136

PAGE 137

[13] P.S.Magnusson,M.Christensson,J.Eskilson,D.Forsgren,G.Hllberg,J.Hgberg,F.Larsson,A.Moestedt,andB.Werner,\Simics:Afullsystemsimulationplatform,"IEEEComputer,2002. [14] M.Rosenblum,S.A.Herrod,E.Witchel,andA.Gupta,\Completecomputersystemsimulation:ThesimOSapproach,"IEEEParallelandDistributedTechnol-ogy,vol.3,pp.34{43,1995. [15] J.J.YiandD.J.Lilja,\Simulationofcomputerarchitectures:Simulators,benchmarks,methodologies,andrecommendations,"IEEETransactionsonComputers,vol.55,no.3,pp.268{280,2006. [16] J.Sugerman,G.Venkitachalan,andB.H.Lim,\Virtualizingi/odevicesonvmwareworkstation'shostedvirtualmachinemonitor,"inProceedingsoftheUSENIXAnnualTechnicalConference,Jun2001. [17] S.Hand,A.Wareld,K.Fraser,E.Kotsovinos,andD.Magenheimer,\Arevirtualmachinemonitorsmicrokernelsdoneright?,"inHOTOS'05:Proceedingsofthe10thconferenceonHotTopicsinOperatingSystems,Berkeley,CA,USA,2005,USENIXAssociation. [18] S.Kumar,H.Raj,K.Schwan,andI.Ganev,\Re-architectingvmmsformulticoresystems:Thesidecoreapproach,"inWorkshopontheInteractionbetweenOperatingSystemsandComputerArchitecture,2007. [19] K.Krewell,\Bestserversof2004:Multicoreisnorm,"MicroprocessorReport,2005. [20] P.Barham,B.Dragovic,K.Fraser,S.Hand,T.Harris,A.Ho,R.Neugebauer,I.Pratt,andA.Wareld,\Xenandtheartofvirtualization,"inSOSP'03:ProceedingsofthenineteenthACMsymposiumonOperatingsystemsprinciples,NewYork,NY,USA,2003,pp.164{177,ACM. [21] E.Kotsovinos,T.Moreton,I.Pratt,R.Ross,K.Fraser,S.Hand,andT.Harris,\Global-scaleservicedeploymentinthexenoserverplatform,"inProceedingsoftheFirstWorkshoponReal,LargeDistributedSystems(WORLDS'04),Dec2004. [22] K.Suzaki,T.Yagi,K.Iijima,andN.A.Quynh,\Oscircular:internetclientforreference,"inLISA'07:Proceedingsofthe21stconferenceonLargeInstallationSystemAdministrationConference,Berkeley,CA,USA,2007,pp.105{116,USENIXAssociation. [23] B.M.G.,J.Hartman,M.Kupfer,K.W.Shri,andJ.Ousterhout,\Measurementsofadistributedlesystem,"inProceedingsofthe13thSymposiumonOperatingSystemsPrinciples,1991. [24] J.H.Howard,M.L.Kazar,S.G.Menees,A.Nichols,M.Satyanarayanan,R.N.Sidebotham,andM.J.West,\Scaleandperformanceinadistributedlesystem,"ACMTransactionsonComputerSystems,vol.6,Feb1988.

PAGE 138

[25] B.Pawlowski,C.Juszczak,P.Staubach,C.Smith,D.Lebel,andD.Hitz,\Nfsversion3:Designandimplementation,"inUSENIXSummer,Boston,MA,Jun1994. [26] A.Ganguly,D.Wolinsky,P.O.Boykin,andR.Figueiredo,\Decentralizeddynamichost:Congurationinwide-areaoverlaynetworksofvirtualworkstations,"inWorkshoponLarge-ScaleandVolatileDesktopGrids(PCGrid),LongBeach,CA,Mar2007,pp.1{8. [27] P.Apparao,S.Makineni,andD.Newell,\Characterizationofnetworkprocessingoverheadsinxen,"inVTDC'06:Proceedingsofthe2ndInternationalWorkshoponVirtualizationTechnologyinDistributedComputing,WashingtonDC,USA,2006,p.2,IEEEComputerSociety. [28] J.S.Chase,D.E.Irwin,L.E.Grit,J.D.Moore,andS.E.Sprenkle,\Dynamicvirtualclustersinagridsitemanager,"inHPDC'03:Proceedingsofthe12thIEEEInternationalSymposiumonHighPerformanceDistributedComputing,Washington,DC,USA,2003,pp.90{101,IEEEComputerSociety. [29] I.Foster,C.Kesselman,andS.Tuecke,\Theanatomyofthegrid:enablingscalablevirtualorganizations,"InternationalJournalofSupercomputingApplications,vol.15,no.3,pp.200{222,Apr2001. [30] A.Ganguly,A.Agrawal,P.Boykin,andR.J.Figueiredo,\Ipoverp2p:Enablingself-conguringvirtualipnetworksforgridcomputing,"inIEEEInternationalParallel&DistributedProcessingSymposium(IPDPS),RhodeIsland,Greece,Apr2006. [31] S.M.Larson,C.D.Snow,M.R.Shirts,andV.S.Pande,\Folding@homeandgenome@home:Usingdistributedcomputingtotacklepreviouslyintractableproblemsincomputationalbiology,"inComputationalGenomics.2002,HorizonPress. [32] D.P.Anderson,J.Cobb,E.Korpella,M.Lebofsky,andD.Werthimer,\Seti@home:Anexperimentinpublic-resourcecomputing,"CommunicationsoftheACM,vol.11,no.45,pp.56{61,2002. [33] J.J.KistlerandM.Satyanarayan,\Disconnectedoperationincodalesystem,"ACMTransactionsonComputerSystems,vol.6,Feb1992. [34] B.S.White,A.S.Grimshaw,andA.Nguyen-Tuong,\Grid-basedleaccess:Thelegioni/omodel,"inHighPerformancedistributedComputing,Pittsburgh,PA,Aug2000,pp.165{174. [35] R.Uhlig,G.Neiger,D.Rodgers,A.L.Santoni,F.C.M.Martins,A.V.Anderson,S.M.Bennett,A.Kagi,F.H.Leung,andL.Smith,\Intelvirtualizationtechnology,"Computer,vol.38,no.5,2005.

PAGE 139

[36] T.Garnkel,K.Adams,A.Wareld,andJ.Franklin,\Compatibilityisnottransparency:Vmmdetectionmythsandrealities,"inHOTOS'07:Proceedingsofthe11thUSENIXworkshoponHottopicsinoperatingsystems,Berkeley,CA,USA,2007,pp.1{6,USENIXAssociation. [37] M.RosemblumandT.Garnkel,\Virtualmachinemonitors:Currenttechnologyandfuturetrends,"IEEEComputer,vol.38,pp.39{47,2005. [38] D.C.Anderson,J.S.Chase,andA.M.Vahdat,\Interposedrequestforroutingforscalablenetworkstorage,"inSymposiumonOSDI,SanDiego,CA,oct2000. [39] D.J.Blezard,\Multi-platformcomputerlabsandclassrooms:amagicbullet?,"inSIGUCCS'07:Proceedingsofthe35thannualACMSIGUCCSconferenceonUserservices,NewYork,NY,USA,2007,pp.16{20,ACM. [40] J.Watson,\Virtualbox:bitsandbytesmasqueradingasmachines,"LinuxJ.,vol.2008,no.166,pp.1,2008. [41] S.Bhattiprolu,E.W.Biederman,S.Hallyn,andD.Lezcano,\Virtualserversandcheckpoint/restartinmainstreamlinux,"SIGOPSOperatingSystemReview,vol.42,no.5,pp.104{113,2008. [42] J.Dike,\Auser-modeportofthelinuxkernel,"inProc.ofthe4thAnnualLinuxShowcaseandConference,Atlanta,GA,2000. [43] F.Bellard,\Qemu,afastandportabledynamictranslator,"inATEC'05:ProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference,Berkeley,CA,USA,2005,pp.41{46,USENIXAssociation. [44] \x86architecture[Online],"WorldWideWebelectronicpublication,Available: [45] A.S.Tanenbaum,J.N.Herder,andH.Bos,\Canwemakeoperatingsystemsreliableandsecure?,"inComputer.may2006,pp.44{51,IEEEComputerSociety. [46] R.Bhargava,B.Serebrin,F.Spadini,andS.Manne,\Acceleratingtwo-dimensionalpagewalksforvirtualizedsystems,"inASPLOSXIII:Proceedingsofthe13thinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems,NewYork,NY,USA,2008,pp.26{35,ACM. [47] P.Willmann,S.Rixner,andA.L.Cox,\Protectionstrategiesfordirectaccesstovirtualizedi/odevices,"inATC'08:USENIX2008AnnualTechnicalConferenceonAnnualTechnicalConference,Berkeley,CA,USA,2008,pp.15{28,USENIXAssociation. [48] R.Iyer,L.Zhao,F.Guo,R.Illikkal,S.Makineni,D.Newell,Y.Solihin,L.Hsu,andS.Reinhardt,\Qospoliciesandarchitectureforcache/memoryincmpplatforms,"inACMSigmetrics,Jun2007.

PAGE 140

[49] N.Egi,A.Greenhalgh,M.Handley,M.Hoerdt,F.Huici,andL.Mathy,\Fairnessissuesinsoftwarevirtualrouters,"inPRESTO'08:ProceedingsoftheACMworkshoponProgrammableroutersforextensibleservicesoftomorrow,NewYork,NY,USA,2008,pp.33{38,ACM. [50] A.Muthitacharoen,B.Chen,andD.Mazieres,\Alow-bandwidthnetworklesystem,"inSymposiumonOperatingSystemsPrinciples,2001,pp.174{187. [51] A.R.Butt,T.Johnson,Y.Zheng,andY.C.Hu,\Kosha:Apeer-to-peerenhancementforthenetworklesystem,"inProceedingsofIEEE/ACMSC2004,Nov2004. [52] V.SrinivasanandJ.C.Mogul,\Spritelynfs:Experiementswithcache-consistencyprotocols,"inProceedingsoftheTwelfthACMSymposiumonOperatingSystemsPrinciples,Dec1989,pp.45{57. [53] R.Macklem,\Notquitenfs,softcacheconsistencyfornfs,"inProceedingsoftheWinter1994UsenixConference,SanFrancisco,CA,Jan1994. [54] D.HildebrandandP.Honeyman,\Exportingstoragesystemsinascalablemannerwithpnfs,"inProceedingsofthe22ndIEEE/13thNASAGoddardConferenceonMassStorageSystemsandTechnologies,Washington,DC,USA,2005,pp.18{27,IEEEComputerSociety. [55] A.Traeger,A.Rai,C.P.Wright,andE.Zadok,\Nfslehandlesecurity,"Tech.Rep.,StonyBrookUniversity,2004. [56] R.J.Figueiredo,P.Dinda,andJ.A.B.Fortes,\Acaseforgridcomputingonvirtualmachines,"inProc.ofthe23rdIEEEIntl.ConferenceonDistributedComputingSystems(ICDCS),Providence,RhodeIsland,May2003. [57] M.ZhaoandR.J.Figueiredo,\Distributedlesystemsupportforvirtualmachinesingridcomputing,"inProceedingsofHPDC,Jun2004. [58] M.Zhao,V.Chadha,andR.J.Figueiredo,\Supportingapplication-tailoredgridlesystemsessionswithwsrf-basedservices.,"inProceedingsofHPDC,Jul2005. [59] D.Wolinsky,A.Agrawal,P.O.Boykin,J.Davis,A.Ganguly,V.Paramygin,P.Sheng,andR.Figueiredo,\Onthedesignofvirtualmachinesandboxesfordistributedcomputinginwideareaoverlaysofvirtualworkstations,"inFirstWorkshoponVirtualizationTechnologiesinDistributedComputing(VTDC),Nov2006. [60] F.Oliveira,G.Guardiola,J.A.Patel,andE.V.Hensbergen,\Blutopia:Stackablestorageforclustermanagement,"inProceedingsoftheIEEEclustercomputing,Sep2007. [61] S.Santhanam,P.Elango,A.Arpaci-Dusseau,andM.Livny,\Deployingvirtualmachinesassandboxesforthegrid,"inUSENIXWORLDS,2004.

PAGE 141

[62] M.CarsonandD.Santay,\Nistnet:alinux-basednetworkemulationtool,"SIGCOMMComputerCommunicationReview,vol.33,no.3,pp.111{126,2003. [63] J.SpadavecchiaandE.Zadok,\Enhancingnfscross-administrativedomainaccess,"inUSENIXAnnualTechnicalConferenceFREENIXTrack,2002,pp.181{194. [64] M.Baker,R.Buyya,andD.Laforenza,\Gridsandgridtechnologiesforwide-areadistributedcomputing,"SoftwarePractice&Experience,vol.32,no.15,pp.1437{1466,2002. [65] C.P.Wright,J.Dave,P.Gupta,H.Krishnan,D.P.Quigley,E.Zadok,andM.N.Zubair,\Versatilityandunixsemanticsinnamespaceunication,"ACMTransac-tionsonStorage(TOS),vol.2,no.1,pp.1{s32,February2006. [66] D.Santry,M.Feeley,N.Hutchinson,A.Veitch,R.Carton,andJ.Or.,\Decidingwhentoforgetintheelephantlesystem,"in17thACMSOSPPrinciples,1999. [67] R.G.Minnich,\Theautocacher:Alecachewhichoperatesatthenfslevel,"inUSENIXconferenceproceedings,1993,pp.77{83. [68] S.Osman,D.Subhraveti,G.Su,andJ.Nieh,\Thedesignandimplementationofzap:Asystemformigratingcomputingenvironments.,"inSymposiumonOSDI,Boston,MA,Dec2002. [69] J.H.HartmanandJ.K.Ousterhout,\Thezebrastripednetworklesystem,"inSOSP'93:ProceedingsofthefourteenthACMsymposiumonOperatingsystemsprinciples.Dec1993,pp.29{43,ACM. [70] A.AgbariaandR.Friedman,\Virtualmachinebasedheterogeneouscheckpointing,"software-Practice&Experience,vol.32,pp.1175{1192,2002. [71] A.Ganguly,A.Agrawal,P.O.Boykin,andR.J.O.Figueiredo,\Wow:Self-organizingwideareaoverlaynetworksofvirtualworkstations.,"inHPDC.June2006,pp.30{42,IEEE. [72] C.Sun,L.He,Q.Wang,andR.Willenborg,\Simplifyingservicedeploymentwithvirtualappliances,"inIEEEInternationalConferenceonServicesComputing,vol.2,pp.265{272,2008. [73] R.ProdanandT.Fahringer,\Overheadanalysisofscienticworkowsingridenvironments,"IEEETransactionsonParallelandDistributedSystems,vol.19,no.3,pp.378{393,2008. [74] A.S.TanenbaumandM.vanSteen,DistributedSystems:PrinciplesandParadigms(1stEdition),Prentics-HallInc,2002. [75] V.ChadhaandR.J.Figueiredo,\Row-fs:Auser-levelvirtualizedredirect-on-writedistributedlesystemforwideareaapplications,"inInternationalConferenceonhighPerformanceComputing(HiPC),Goa,India,Dec2007.

PAGE 142

[76] S.Annapureddy,M.J.Freedman,andD.Mazires,\Shark:Scalingleserversviacooperativecaching,"inProceedingsofthe2ndUSENIX/ACMSymposiumonNetworkedSystemsDesignandImplementation,May2005. [77] D.Reimer,A.Thomas,G.Ammons,T.Mummert,B.Alpern,andV.Bala,\Openingblackboxes:usingsemanticinformationtocombatvirtualmachineimagesprawl,"inVEE'08:ProceedingsofthefourthACMSIGPLAN/SIGOPSinternationalconferenceonVirtualexecutionenvironments,NewYork,NY,USA,2008,pp.111{120,ACM. [78] R.Chandra,N.Zeldovich,C.Sapuntzakis,andM.S.Lam,\Thecollective:Acache-basedsystemmanagementarchitecture,"inProceedingsof2ndSymposiumonNetworkedSystemsDesign&Implementation(NSDI),2005,pp.259{272. [79] \2xthinclientserver,"WorldWideWebelectronicpublication,2008, [80] M.McNett,D.Gupta,A.Vahdat,andG.M.Voelker,\Usher:Anextensibleframeworkformanagingclustersofvirtualmachines,"inProceedingsofthe21stLargeInstallationSystemAdministrationConference(LISA),Nov2007. [81] J.Cappos,S.Baker,J.Plichta,D.Nyugen,J.Hardies,M.Borgard,J.Johnston,andJ.H.Hartman,\Stork:Packagemanagementfordistributedvmenvironments,"inProceedingsofthe21stLargeInstallationSystemAdministrationConference(LISA),Nov2007. [82] R.GrossmanandY.Gu,\Dataminingusinghighperformancedataclouds:experimentalstudiesusingsectorandsphere,"inKDD'08:Proceedingofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,NewYork,NY,USA,2008,pp.920{927,ACM. [83] A.Menon,J.R.Santos,Y.Turner,G.Janakiraman,andW.Zwaenepoel,\Diagnosingperformanceoverheadsinthexenvirtualmachineenvironment,"inVEE'05:Proceedingsofthe1stACM/USENIXinternationalconferenceonVirtualexecutionenvironments.2005,pp.13{23,ACM. [84] J.R.Santos,Y.Turner,G.Janakiraman,andI.Pratt,\Bridgingthegapbetweensoftwareandhardwaretechniquesfori/ovirtualization,"inATC'08:USENIX2008AnnualTechnicalConferenceonAnnualTechnicalConference,Berkeley,CA,USA,2008,pp.29{42,USENIXAssociation. [85] R.Uhlig,R.Fishtein,O.Gershon,I.Hirsh,andH.Wang,\Softsdv:Apresiliconsoftwaredevelopmentenvironmentfortheia-64architecture,"IntelTechnologyJournal,1999. [86] M.Yourst,\PTLsim:Acycleaccuratefullsystemx86-64microarchitecturalsimulator,"IEEEInternationalSymposiumonPerformanceAnalysisofSystems&Software,2007,pp.23{34,April2007.

PAGE 143

[87] R.Iyer,\Onmodelingandanalyzingcachehierarchiesusingcasper,"in11thIEEEInternationalSymposiumonModeling,Analysis,andSimulationofComputerandTelecommunicationsSystems(MASCOTS),Oct2003. [88] W.WuandM.Crawford,\Interactivityvs.fairnessinnetworkedlinuxsystems,"ComputerNetworks,vol.51,no.14,pp.4050{4069,2007. [89] L.CherkasovaandR.Gardner,\Measuringcpuoverheadfori/oprocessinginthexenvirtualmachinemonitor,"inATEC'05:ProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference,Berkeley,CA,USA,Apr2005,USENIXAssociation. [90] G.Neiger,A.Santoni,F.Leung,D.Rodgers,andR.Uhlig,\Intelvirtualizationtechnology:Hardwaresupportforecientprocessorvirtualization,"IntelTechnologyJournal,Aug2006. [91] B.Jacob,S.W.NG,andD.T.Wang,MemorySystems:Cache,DRAMandDisk,MorganKaufmannpublishers,2007. [92] I.Corporation,\Intel64andia-32architecturessoftwaredeveloper'smanuals[Online],"WorldWideWebelectronicpublication,Available: [93] D.Gupta,R.Gardner,andL.Cherkasova,\Xenmon:Qosmonitoringandperformanceprolingtool[Online],"WorldWideWebelectronicpublication,Available: [94] O.tickoo,H.Kannan,V.Chadha,R.Illikkal,R.Iyer,andD.Newell,\qtlb:Lookinginsidelookasidebuer,"inInternationalConferenceonhighPerformanceComputing(HiPC),Goa,India,Dec2007. [95] R.Santos,G.Janikaraman,andY.turner,\Xennetworkoptimization[Online],"WorldWideWebelectronicpublication,Available: 3/networkoptimizations.pdf [96] M.R.MartyandM.D.Hill,\Virtualhierarchiestosupportserverconsolidation,"inISCA'07:Proceedingsofthe34thannualinternationalsymposiumonComputerarchitecture,NewYork,NY,USA,2007,pp.46{56,ACM. [97] J.Wiegert,G.Regnier,andJ.Jackson,\Challengesforscalablenetworkinginavirtualizedserver,"inProceedingsofthe16thInternationalConferenceonComputerCommunicationsandNetworks,Aug2007. [98] J.Liu,W.Huang,B.Abali,andD.K.Panda,\Highperformancevmm-bypassi/oinvirtualmachines,"inATEC'06:ProceedingsoftheannualconferenceonUSENIX'06AnnualTechnicalConference,Berkeley,CA,USA,May2006,USENIXAssociation.

PAGE 144

[99] M.-S.ChangandK.Koh,\Lazytlbconsistencyforlarge-scalemultiprocessors,"inProceedingsofthe2ndAIZUInternationalSymposiumonParallelAlgo-rithms/ArchitectureSynthesis,1997. [100] I.Corporation,\Firstthetick,nowthetock:NextgenerationIntelmicroarchitecture[Online],"WorldWideWebelectronicpublication,Available:

PAGE 145

VineethasgraduatedwithB.EinElectronicsandTelecommunicationfromUniversityofPune,India.HenishedhisM.SinComputerSciencefromMississippiStateUniversity.HeispursuinghisPhDinComputerInformationScienceandEngineeringatUniversityofFlorida.Hisresearchinterestsincludevirtualization,operatingsystems,computerarchitecture,lesystemsanddistributedcomputing.SinceFall2002,VineethasbeenresearchassistantatAdvancedComputingandInformationSystems(ACIS)Laboratory.AtACIS,hisresearchfocushasbeenGridVirtualFileSystem(GVFS)andI/Ovirtualization.Hehasbeeninvolvedindevelopmentofmiddlewaresupportfornetworklesystemandsimulation-basedevaluationmethodologytocharacterizeI/Ooverheadinvirtualizedenvironments.Tocomplementhisacademicexperience,VineethascompletedtwosummerinternshipsattheIntelSystemsTechnologyLab.Uponhisgraduation,VineetplanstotakeupafulltimepositionatIntelCorporation. 145