Citation
Self-Managing Virtual Networks for Wide-Area Distributed Computing

Material Information

Title:
Self-Managing Virtual Networks for Wide-Area Distributed Computing
Creator:
Ganguly, Arijit
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (154 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Computer Engineering
Computer and Information Science and Engineering
Committee Chair:
Figueiredo, Renato J.
Committee Members:
Fortes, Jose A.
Newman, Richard E.
George, Alan D.
Avery, Paul R.
Graduation Date:
8/9/2008

Subjects

Subjects / Keywords:
Connectivity ( jstor )
Coordinate systems ( jstor )
Identifiers ( jstor )
Internet ( jstor )
Munchausen syndrome by proxy ( jstor )
Preliminary proxy material ( jstor )
Proxy reporting ( jstor )
Proxy statements ( jstor )
Tunnels ( jstor )
Uniform resource identifiers ( jstor )
Computer and Information Science and Engineering -- Dissertations, Academic -- UF
dht, firewalls, grid, nat, networks, p2p, virtual
Genre:
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Computer Engineering thesis, Ph.D.

Notes

Abstract:
Sharing of computing and storage resources among different institutions and individuals connected over the Internet is seen as a solution to meet the ever-increasing computation and storage demands of modern applications. Several factors curtail the ability of existing applications to run seamlessly on Wide-area Networks (WANs): heterogeneous resource configurations, obscured access to resources due to Network Address Translators (NATs) and firewalls, inability to express sharing policies and lack of isolation provided by operating systems. This work addresses the problem of providing bi-directional network connectivity among wide-area resources behind NATs and firewalls. At the core of the presented approach is a self-managing networking infrastructure (IPOP) that aggregates wide-area hosts into a private network with decoupled address space management, and is functionally equivalent to a Local-area network (LAN) environment where a wealth of existing, unmodified IP-based applications can be deployed. The IPOP virtual network tunnels the traffic generated by applications over a P2P-based overlay network, which handles NAT/firewall traversal (through hole-punching techniques) and dynamically adapts its topology (through establishment of direct connections between communicating nodes) in a self-organized, decentralized manner. Together with classic virtual machine technology for software dissemination, IPOP facilitates deployment of large-scale distributed computing environments on wide-area hosts owned by different organization and individuals. A real deployment of the system has been up and running for more than one year, providing access to computational resources for several users. This dissertation makes the following contributions in the area of virtualization applied to wide-area networks: a novel self-organizing IP-over-P2P system with decentralized NAT traversal; decentralized self-optimization techniques to create overlay links between nodes based on traffic inspection; creation of isolated address spaces and decentralized allocation of IP addresses within each such address space using Distributed Hash Tables (DHTs); tunneling of overlay links for maintaining the overlay structure even in presence of NATs and routing outages; and techniques for proxy discovery for tunnel nodes using network coordinates. I describe the IPOP virtual network architecture and present an evaluation of a prototype implementation using well-known network performance benchmarks and a set of distributed applications. To further facilitate deployment of IPOP, I describe techniques that allow new users to easily create and manage isolated address spaces and decentralized allocation of IP addresses within each such address space. I present generally applicable techniques that facilitate consistent routing in structured P2P systems even in presence of overlay faults, thereby benefiting different applications of these systems. In the context of the IPOP system, these techniques provide improved virtual IP connectivity. I also describe and evaluate decentralized techniques for discovering suitable proxy nodes to establish a 2-hop overlay path between virtual IP nodes, when direct communication is not possible. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2008.
Local:
Adviser: Figueiredo, Renato J.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-02-28
Statement of Responsibility:
by Arijit Ganguly.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Ganguly, Arijit. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
2/28/2009
Resource Identifier:
664555333 ( OCLC )
Classification:
LD1780 2008 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketotakethisopportunitytoexpressmysinceregratitudetomyadvisorDr.RenatoFigueiredo,forhisconstantsupportandguidancethroughoutthecourseofmystudy.Withouthisencouragement,appreciationandopennessforresearchideas,itwouldnothavebeenpossibletopublishthedeningpapersofmywork.IalsothankhimforsupportingmystudieswithaResearchAssistantshipandalsofortheopportunitytopresentresearchpapersatvariousconferences.IamequallythankfultoDr.OscarBoykin,whoseideasandinitialworkontheBrunetP2PlibraryhavebeenthebaselineofmyresearchandalsothedevelopmentoftheIPOPsystem.IwouldalsoliketothankDr.JoseFortesforgivingmeanopportunitytobeapartoftheAdvancedComputingandInformationSystems(ACIS)Laboratory,anddocutting-edgesystemsresearch.Iamthankfultomyothercommitteemembers,Dr.RichardNewman,Dr.AlanGeorgeandDr.PaulAvery,forsomestimulatingdiscussions,andtheinspirationwhichIdrewfromtheirresearchwork,whichinturnkindledmyowncreativity.Intoday'scompetitiveworld,itisnotpossibletomakeanimpactwithoutteamwork.IamthankfultothemembersofACISP2Pteamforsharingwithmeseveraltoolsandinfrastructuresthathavehelpedmeinmyresearch.Inparticular,IthankDavidWolinsky,whosegrid-appliancewasveryusefulforgatheringdataforoneofmypapers.IthankACISresearchers,MauricioandAndrea,fortheireortsinensuringthehighavailabilityoflabfacilitiesandalsofortheirspontaneoushelpwheneverneeded.IamalsothankfultoVineetandGirish,forsharingawonderfulworkenvironment.Iamthankfultomyparents,whohavetaughtmetheabilityoftherightjudgement,andhavealwaystakenprideandexpressedcondenceinme.Iamthankfultomysister,Inakshi,whohasbeenencouragingattimesofdespair.Ishouldalsoexpressmythankfulnesstomygirlfriend,Swastiforbeingsosupportive,understandingandpatientallthiswhile.Finally,noachievementispossiblewithouttheblessingsandwishoftheAllMighty,whoIthankforallmyaccomplishmentsandseekhisblessingsforthefuture. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 9 LISTOFFIGURES .................................... 10 ABSTRACT ........................................ 13 CHAPTER 1INTRODUCTION .................................. 15 1.1Wide-AreaDistributedComputing ...................... 18 1.1.1High-ThroughputComputing ..................... 18 1.1.2Cross-DomainCollaborationandDevelopment ............ 18 1.2IssueswithTraditionalApproaches ...................... 19 1.2.1HeterogeneityofComputeResources ................. 19 1.2.2ObscuredInternetConnectivity .................... 20 1.2.3LackofIsolation ............................. 20 1.3Solution:Virtualization ............................ 21 1.3.1VirtualMachines ............................ 21 1.3.2VirtualNetworks ............................ 22 1.3.3CombiningVirtualMachinesandVirtualNetworks ......... 23 1.4FocusandContributions ............................ 23 2SELF-ORGANIZINGVIRTUALNETWORKING ................. 25 2.1Peer-to-PeerNetworksandArchitectures ................... 26 2.2TunnelingIPoverP2P{IPOP ........................ 29 2.3ArchitectureofIPOP .............................. 30 2.3.1VirtualIPPacketCaptureandRouting ................ 30 2.3.2OverlayConnectionManagement ................... 31 2.3.2.1Connectionsetup ....................... 33 2.3.2.2Joininganexistingnetworkandacquiringconnections .. 35 2.3.2.3EstablishingcommunicationthroughNATs ........ 36 2.3.2.4Adaptiveshortcutcreation .................. 37 2.4Experiments ................................... 39 2.4.1ShortcutConnections:LatencyandBandwidth ............ 39 2.4.2SingleIPOPlink:LatencyandBandwidth .............. 43 2.4.3ImplementationOptimizations ..................... 46 2.5RelatedWork .................................. 46 2.5.1ResourceVirtualization ......................... 46 2.5.2ApplicationsofP2P ........................... 46 2.5.3TechniquesforNAT-Traversal ..................... 47 5

PAGE 6

.................................... 48 3WIDE-AREAOVERLAYSOFVIRTUALWORKSTATIONS .......... 49 3.1NodeCongurationandDeploymentinWOWs ............... 50 3.1.1BackgroundandMotivations ...................... 50 3.1.2VirtualMachineConguration ..................... 51 3.1.3DeploymentScenarios .......................... 52 3.1.4TestbedWOWPerformance ...................... 53 3.1.4.1Batchapplication ....................... 55 3.1.4.2Parallelapplication ...................... 56 3.1.4.3Virtualmachinemigration .................. 57 3.2RelatedWork .................................. 59 3.3Discussion .................................... 60 3.4MakingDeploymentofWOWsSimpler .................... 62 4DISTRIBUTEDHASHTABLESANDAPPLICATIONSINIPOP ....... 64 4.1RelatedWork .................................. 65 4.1.1SystemsBasedonDistributedHashTable .............. 65 4.1.2DistributedHashTableDesign ..................... 65 4.2DistributedHashTableonBrunet ....................... 67 4.2.1HandlingTopologyChanges ...................... 67 4.2.2ApplicationProgrammersInterface(API) ............... 68 4.2.3TolerancetoInconsistentRoots .................... 69 4.3ExtensionstoIPOP ............................... 71 4.4LifecycleManagementofWOWs ....................... 72 4.4.1CreatinganIPOPnamespace ..................... 73 4.4.2DynamicHostConguration ...................... 73 4.4.3ResolutionofVirtualIPtoP2PAddress ............... 75 4.5Experiments ................................... 75 4.6Conclusion .................................... 77 5IMPACTOFWIDE-AREACONNECTIVITYCONSTRAINTSONIPOP ... 79 5.1ConnectivityHazardsinWide-AreaNetworks ................ 80 5.2ImpactofConnectivityConstraints ...................... 82 5.2.1ImpactonCoreStructuredOverlayRouting ............. 83 5.2.2EectonAll-to-AllConnectivity .................... 84 5.2.3EectonDynamicVirtualIPConguration ............. 84 5.2.4EectonCompletionofDHTOperations ............... 85 5.2.5EectonDHTDynamics ........................ 85 5.2.6EectonTopologyAdaptation ..................... 86 5.3Discussion .................................... 86 6

PAGE 7

...................... 87 6.1AnnealingRouting ............................... 87 6.2TunnelEdges .................................. 90 6.3ImprovementsinStructuredRouting ..................... 93 6.3.1SimulationMethodology ........................ 94 6.3.2EvaluatingtheImpactofAnnealingRouting ............. 95 6.3.3EvaluatingtheImpactofTunnelEdges ................ 98 6.4TunnelEdgeImplementationinBrunet .................... 99 6.5Experiments ................................... 101 6.5.1StructureVericationofP2PNetwork ................. 101 6.5.1.1MajorityofnodesbehindNATs ............... 102 6.5.1.2Incompleteunderlyingnetwork ............... 102 6.5.1.3Wide-areadeployment .................... 102 6.5.2ConnectivitywithinaWOW ...................... 104 6.6RelatedWork .................................. 105 6.7Conclusion .................................... 107 7IMPROVINGEND-TO-ENDLATENCY ...................... 109 7.1NetworkCoordinates .............................. 110 7.2ImprovingLatencyofOverlayRouting .................... 112 7.2.1ProximityNeighborSelectioninBrunet ................ 112 7.2.2ImplementationinBrunet ....................... 115 7.2.3Experiments ............................... 116 7.3ImprovingLatencyofVirtualIPCommunication .............. 119 7.3.1Wide-areaResourceDiscovery ..................... 120 7.3.2DiscoveringProxiesinIPOP ...................... 121 7.3.2.1Closestconnection(CC) ................... 121 7.3.2.2ExpandingSegmentSearch(ESS) .............. 121 7.3.3Experiments ............................... 123 7.3.4Discussion ................................ 127 7.4Conclusion .................................... 127 8CONCLUSION .................................... 132 APPENDIX ANETWORKADDRESSTRANSLATORS ..................... 134 A.1TraversingNATswithUDP .......................... 135 A.2TraversingNATswithTCP .......................... 136 A.3ProblemsusingSTUN/STUNT ........................ 136 BVIVALDINETWORKCOORDINATES ...................... 137 B.1VivaldiAlgorithm ................................ 137 B.2ConsiderationsforConvergenceandAccuracy ................ 138 7

PAGE 8

........................ 140 C.1ApplicationProgrammersInterface(API) .................. 140 C.2TreeGenerationAlgorithms .......................... 141 C.3ApplicationsofMap-Reduce .......................... 142 REFERENCES ....................................... 144 BIOGRAPHICALSKETCH ................................ 154 8

PAGE 9

Table page 2-1BandwidthmeasurementsbetweentwoIPOPnodes:withandwithoutshortcuts 43 2-2CongurationsofmachinesusedforevaluatingperformanceofsingleIPOPlink 44 2-3Meanandstandarddeviationof10000pinground-triptimesforIPOP-UDPandphysicalnetwork ................................. 44 2-4MeanandstandarddeviationofrateofTCPresquest/responsetransactionsmeasuredover100sampleswithnetperfforIPOP-UDPandphysicalnetwork 44 2-5ComparisonofthroughputofasingleIPOPlinkinLANandWANenvironments 45 3-1CongurationoftheWOWtestbeddepictedinFigure 3-1 .AllWOWguestsrunthesameDebian/Linux2.4.27-2O/S ...................... 54 3-2ExecutiontimesandspeedupsfortheexecutionoffastDNAml-PVMin1,15and30nodesoftheWOW.Thesequentialexecutiontimefortheapplicationis22,272secondsonnode2and45,191secondsonnode34.Parallelspeedupsarereportedwithrespecttotheexecutiontimeofnode2. .............. 57 6-1Probabilityofbeingabletoformatunneledgeasafunctionofedgeprobabilityandnumberofrequirednearconnectionsoneachside ............... 93 6-2AdjacentP2Pnodesinwide-areadeploymentthatcouldnotcommunicateusingTCPorUDP ..................................... 103 7-1CongurationsofoverlaysusedtoevaluateCCandESS ............. 124 7-2RelativeandabsolutepenaltiesofCC,ESS(1.1),ESS(1.2)andESS(1.05)foroverlaysExp1(withPNS)andExp2(withoutPNS) ............... 126 7-3RelativeandabsolutepenaltiesofCCandESS(1.2)fordierentfractionofproxynodes ...................................... 130 9

PAGE 10

Figure page 1-1Hostedvirtualmachinemonitorrunningtwoisolatedvirtualmachine"guests" 22 2-1StructuredP2Prouting ............................... 28 2-2VirtualizingIPoveraP2Poverlay ......................... 30 2-3ArchitecturaloverviewofIPOP ........................... 31 2-4Peer-to-peerconnectionsinBrunet ......................... 32 2-5ConnectionsetupbetweenP2Pnodes ........................ 33 2-6Distributionofround-triplatenciesforICMP/pingpacketsover118-nodePlanet-Laboverlay.Twohopsseparatethepingsourcefromthedestination. ........ 38 2-7ProlesofICMPechoround-triplatenciesanddroppedpacketsduringIPOPnodejoin ....................................... 40 2-8ThreeregimesforpercentageofdroppedICMPpacketsinUFL-NWU(rst50packets). ........................................ 42 3-1TestbedWOWusedforexperiments ........................ 50 3-2FrequencydistributionsofPBS/MEMEjobwallclocktimes ........... 54 3-3ProleofexecutiontimesforPBS-scheduledMEMEsequentialjobsduringthemigrationofaworkernode .............................. 59 4-1Handlingtopologychanges .............................. 68 4-2InconsistentrootsinDHT .............................. 70 4-3ExampleoftwodierentWOWswithnamespacesN1andN2sharingacommonBrunetoverlay .................................... 72 4-4EventsleadingtovirtualIPaddresscongurationoftapinterface ........ 76 4-5CumulativedistributionofthetimetakenbyanewIPOPnodetoacquireavirtualIPaddress(T2inFigure 4-4 ).Meanandvarianceare27.41secondsand20.19seconds,respectively. .............................. 77 4-6CumulativedistributionofthetimetakenbyanewP2Pnodetogetconnectedtoitsleftandrightneighborsinthering(T1inFigure 4-4 ).Meanandvarianceare4.98secondsand13.99seconds,respectively. .................. 77 4-7CumulativedistributionofthenumberofdierentvirtualIPaddressestriedduringDHCP.Meanandvarianceare1.096and0.784,respectively. ....... 78 10

PAGE 11

............................... 80 5-2MultiplelevelsofNATs ............................... 82 5-3InconsistentrootsinDHT .............................. 83 6-1Annealingroutingalgorithm ............................. 89 6-2TunneledgebetweennodesAandBwhichcannotcommunicateoverTCPorUDPtransports .................................... 91 6-3Atedgelikelihoodof70%(0.7),thepercentageofnon-routablepairsvariesfrom9.5%to10.9%(thetotalnumberofpairsinthesimulatednetworkis1,000,000) 95 6-4Atedgelikelihoodof70%,thepercentageofwronglyroutedkeysvariesfrom9.5%to10.7%(thetotalnumberofsimulatedmessagesis10,000,000) ...... 96 6-5Comparinggreedyroutingwithtunneledgesform=3andm=2.Atedgelikelihoodof70%,thepercentageofnon-routablepairsinanetworkof1000nodesis(1)3.9%form=2,and(2)0.86%form=3. .................. 97 6-6Averagenumberofnon-routablepairs.Atedgelikelihoodof70%,thepercentageofnon-routablepairsforgreedyandannealingroutingis(1)withouttunneledges,10.26%and3.4%respectively;(2)withtunneledges,0.86%and0.21%respectively.Atedgelikelihoodsof95%,therearenonon-routablepairswithtunneledges. 97 6-7Averagenumberofwronglyroutedkeys.Atedgelikelihoodof70%,thepercentageofwronglyroutedkeysforgreedyandannealingroutingis(1)withouttunneledges,10.2%and3.4%respectively;(2)withtunneledges,0.86%and0.19%respectively. ...................................... 98 7-1Relativeerrorforround-triptimepredictionusingnetworkcoordinatesmeasuredafter1,3and5hoursofbootstrapping ....................... 112 7-2Proximityneighborselection ............................. 113 7-3Round-triptimes(RTTs)foroverlaypingswithPNSandwithoutPNS.AverageRTTis(1)1.75secs(withPNS),(2)2.86secs(withoutPNS). .......... 117 7-4PercentageofnodesinthequeriedrangeRthatareclosertothesourcenodethantheoneselectedbynetworkcoordinates.Onaverage,8.76%ofthenodesinthequeriedrangeRarecloserthantheselectednode. ............. 117 7-5Relativelatencyerrorforchoosingtheclosestnodeinthenetworkcoordinatespace.Onaverage,therelativelatencyerroris1.432. ............... 118 7-6ExpandingSegmentSearch(ESS) .......................... 123 11

PAGE 12

................ 125 7-8Relativepenalty(RP)forusingaproxynodeselectedbyCCandESS(1.2),whenallnodesinthenetworkcanserveasproxies,whenonly30%ofthenodescanserveasproxies. ................................. 128 7-9Relativepenalty(RP)forusingaproxynodeselectedbyCCandESS(1.2),whenallnodesinthenetworkcanserveasproxies,whenonly20%ofthenodescanserveasproxies. ................................. 129 C-1Executionofamap-reducecomputationoveratreeofnodes ........... 141 C-2BoundedbroadcastoverasegmentoftheP2Pring ................ 142 12

PAGE 13

SharingofcomputingandstorageresourcesamongdierentinstitutionsandindividualsconnectedovertheInternetisseenasasolutiontomeettheever-increasingcomputationandstoragedemandsofmodernapplications.SeveralfactorscurtailtheabilityofexistingapplicationstorunseamlesslyonWide-areaNetworks(WANs):heterogeneousresourcecongurations,obscuredaccesstoresourcesduetoNetworkAddressTranslators(NATs)andrewalls,inabilitytoexpresssharingpoliciesandlackofisolationprovidedbyoperatingsystems. Thisworkaddressestheproblemofprovidingbi-directionalnetworkconnectivityamongwide-arearesourcesbehindNATsandrewalls.Atthecoreofthepresentedapproachisaself-managingnetworkinginfrastructure(IPOP)thataggregateswide-areahostsintoaprivatenetworkwithdecoupledaddressspacemanagement,andisfunctionallyequivalenttoaLocal-areanetwork(LAN)environmentwhereawealthofexisting,unmodiedIP-basedapplicationscanbedeployed.TheIPOPvirtualnetworktunnelsthetracgeneratedbyapplicationsoveraP2P-basedoverlaynetwork,whichhandlesNAT/Firewalltraversal(throughhole-punchingtechniques)anddynamicallyadaptsitstopology(throughestablishmentofdirectconnectionsbetweencommunicatingnodes)inaself-organized,decentralizedmanner.Togetherwithclassicvirtualmachinetechnologyforsoftwaredissemination,IPOPfacilitatesdeploymentoflarge-scaledistributedcomputingenvironmentsonwide-areahostsownedbydierentorganizationandindividuals.Areal 13

PAGE 14

Thisdissertationmakesthefollowingcontributionsintheareaofvirtualizationappliedtowide-areanetworks:anovelself-organizingIP-over-P2PsystemwithdecentralizedNATtraversal;decentralizedself-optimizationtechniquestocreateoverlaylinksbetweennodesbasedontracinspection;creationofisolatedaddressspacesanddecentralizedallocationofIPaddresseswithineachsuchaddressspaceusingDistributedHashTables(DHTs);tunnelingofoverlaylinksformaintainingtheoverlaystructureeveninpresenceofNATsandroutingoutages;andtechniquesforproxydiscoveryfortunnelnodesusingnetworkcoordinates. IdescribetheIPOPvirtualnetworkarchitectureandpresentanevaluationofaprototypeimplementationusingwell-knownnetworkperformancebenchmarksandasetofdistributedapplications.TofurtherfacilitatedeploymentofIPOP,IdescribetechniquesthatallownewuserstoeasilycreateandmanageisolatedaddressspacesanddecentralizedallocationofIPaddresseswithineachsuchaddressspace.IpresentgenerallyapplicabletechniquesthatfacilitateconsistentroutinginstructuredP2Psystemseveninpresenceofoverlayfaults,therebybenetingdierentapplicationsofthesesystems.InthecontextoftheIPOPsystem,thesetechniquesprovideimprovedvirtualIPconnectivity.Ialsodescribeandevaluatedecentralizedtechniquesfordiscoveringsuitableproxynodestoestablisha2-hopoverlaypathbetweenvirtualIPnodes,whendirectcommunicationisnotpossible. 14

PAGE 15

ThegrowthoftheInternethasgivenanopportunitytoshareresourcessuchasCPUcyclesandstoragecapacityamongdierentinstitutionsconnectedoverwide-areanetworks.Besidescollaboration,suchsharingisparticularlyusefultomeettheever-increasingcomputationandstoragedemandsofmodernapplicationsfromdierentdomains,suchashigh-energyphysics,medicalimaging,businessdataanalysis,amongothers.Severalmiddlewaresolutions[ 1 ]havebeenproposedtofacilitateresourcesharinginawaythatnotonlyrespectsthepoliciesdenedbytheresourceowners,butalsoprovidesmaximumexibilitytoconsumers.SystemshavealsobeenconceivedandimplementedtoharnessidlecyclesfromdesktopsofusersconnectedtotheInternet[ 2 3 ].Commontotheseeortsisthevisionofprovidingcomputingasautilitythatcanbedeliveredbyapoolofdistributedresourcesinaseamlessmanner.ThetermsGridandutilitycomputingareusedtorefertosuchsystemsforwide-areadistributedcomputing. Asdistributedcomputingmovesfromwithinanorganizationtoawide-areacollaboration,severalnewchallengesarisethatlimittheclassofapplicationsthatcanbenetfromresourcesharing,comparedtolocal-areaenvironments. Firstly,resourcesownedbydierentdomainstendtodierintheirhardwareandsoftwarecongurations.Resourceprovidersindependentlychoosethecongurationsoftheresourcestheyown(includingO/Skernels,libraries,andapplications),whichmightbeincompatiblewithapplicationormiddlewarerequirements.Itisdicultfortheapplication/middlewaredevelopertomaintaindierentdistributionsforeverypossibleresourceconguration[ 4 5 ].Secondly,theincreasinguseofNATsandrewallroutersatInternetsiteshinderbi-directionalaccesstoresourcesacrossdierentdomains[ 6 { 8 ].Thenetworkpoliciesateachsitearedesignedwithfocusonsecuringhosts;often,theimplementationofsuchpoliciesmakesitdiculttosupportapplicationsfromexternalusers.Althoughsomemechanismstoaccessrewalledresources(forexample,secureshell 15

PAGE 16

Tofacilitatecollaboration,severalmiddleware-levelsolutionshavebeenproposed.TheGlobus[ 1 ]toolkitprovidesapublic-keybasedsecurityinfrastructureandWeb-servicebasedstandardsforinteractionsbetweencomponents.BOINC[ 9 ]providesaplatformforwritingdistributedapplicationsthatcanharnessCPUcyclesondesktophostsbelongingtoindividualusersconnectedtotheInternet.However,theseapproachesrequireexistingapplicationstobere-writtenwiththeapplicationprogramminginterface(API)theyprovideforoperationssuchasschedulinganddatatransfers..Theheterogeneityofwide-areahostsfurthercomplicatesmiddlewaredeploymentandconguration. Inthisdissertation,anovelapproachtowide-areadistributedcomputingisproposedandinvestigated.Thisapproachexploitsvirtualizationtoprovidetoapplicationstheirpreferredexecutionenvironment.Theexecutionenvironmentofanapplicationconsistsofthemachineonwhichitrunsandthenetworkoverwhichitcommunicateswithotherprocesses,middlewarecomponentsandapplications.Inthisapproach,anapplicationexecutesinsideavirtualmachine(VM)andisconnectedtootherapplicationsthroughavirtualnetworkthatisfunctionallyequivalenttoalocal-areaTCP/IPnetwork. "Classic"VMs[ 10 { 12 ]enablemultipleoperatingsystems,completelyisolatedfromeachother,totime-shareresourcesofasinglemachine.VMsencapsulatethesoftwaredependenciesofanapplicationconsistingoftheentireOS,libraries,andapplicationswithinaclosedenvironment,whichiscompletelydecoupledfromthephysicalhost.AVM-basedexecutionenvironmentcanbequicklyinstantiatedonanyphysicalhost[ 4 5 ],withtheonlysoftwaredependencebeingthepresenceofavirtualmachinemonitor(VMM). 16

PAGE 17

13 { 16 ]providetoapplicationsacommunicationenvironmentwithdecoupledIPaddressmanagementandall-to-allconnectivity.Inavirtualnetwork,idiosyncrasiesofheterogeneousnetworkaccessarehandledbythevirtualizationlayer,whileapplicationsperceiveanenvironmentthatisfunctionallyequivalenttoalocal-areanetwork. Withinthecontextofsuchavirtualizeddistributedsystem,thisdissertationfocusesonthearchitecture,design,implementationandevaluationofanovelvirtualnetworkingtechnique.TheproposedapproachcombinesIPtunnelingandpeer-to-peer(P2P)overlaynetworkingtoaggregatewide-areahostsintoanIP-over-P2P(IPOP)virtualnetwork.TheIPOPvirtualnetworkisarchitectedtobescalable,fault-tolerantandself-managing,requiringminimaladministrativecontrol.Thevirtualnetworkalsoincorporatesnoveltechniquestoprotectitselffromtheconnectivityandperformancedegradationduetowide-areaconnectivityconstraintsthatpreventcommunicationbetweenpairsofnodes. Theremainderofthisdissertationisorganizedasfollows.Chapter 2 describestheIPOParchitecture.Chapter 3 presentsandevaluatessystemswhichcombineVMtechnologiesandIPOPvirtualnetworkingtoprovidehomogeneouslyconguredWide-areaOverlaysofVirtualWorkstations(WOWs).Chapter 4 presentsthedesignofastoragelayerbasedonDistributedHashTable(DHT)andhowitcanbeleveragedtofacilitatethedeploymentofWOWs.Chapter 5 describeshowconnectivityhazardsinaWide-areacanaectconsistencyofP2Prouting,andsubsequentlyconnectivityandperformancewithintheIPOPvirtualnetwork.NoveltechniquesthatfacilitateconsistentroutingunderconnectivityconstraintsarepresentedinChapter 6 ,andtheirimplementationintheIPOPsystemisdescribed.Chapter 7 investigatestechniquestodiscoversuitableproxynodestominimizeend-to-endlatencywhendirectcommunicationisnotpossible,andalsovalidatesexistingtechniquestoreducelatencyofmulti-hopoverlayrouting.Finally,thedissertationisconcludedinChapter 8 17

PAGE 18

1.1.1High-ThroughputComputing 17 18 ]consistofasharedpoolofloosely-coupledcomputeresourcesmanagedbymiddlewarecomponents[ 19 20 ]whichincludejobscheduler,resourcemonitor,amongothers.Thesystemthroughputcanbereadilyincreasedbyaddingmore(heterogeneouswithrespecttohardwarecongurations)computeresources.Similarly,computationally-intensiveloosely-coupledparallelapplications(suchasmaster/workerapplicationswithsmallcommunication-to-computationratios)oftenachievealowerexecutiontimewhenmoreresources(workers)areavailable,irrespectiveofrelativeCPUspeedsoftheworkers. Addinganewresourcetothesharedpoolrequiresconguringanewhostwiththenecessarysoftware(OS,libraries,applicationbinariesandmiddleware)andalsoensuringthatthenewresourceisaccessibleoverthenetwork{aprocessthatbecomesnon-trivialwhenresourcesareownedbymultipleorganizations.Thisdissertationinvestigatesmechanismsbywhichnewresourcescanbeaddedtosuchpoolwithminimalmanualintervention,withfocusonaccessibilityofthenewresource. 18

PAGE 19

Asanillustration,experienceswithresearchersinthecoastalsciencesapplicationdomainrevealthatresearchersindierentuniversities(e.g.UniversityofFlorida,VirginiaInstituteofMarineSciences)collaboratebyexchangingdatasetsandpredictionmodels.Theyrunpredictionsbycouplingmodelsthatrunatdierentlocations,andthesimulationmodelsoftenrequirenotonlyanexecutablebinarybutalsoavarietyofsupporttools(scriptinglanguages,dataconversiontools,geographicalinformationsystems).UsersateachsiteareveryfamiliarwithOSlevelabstractions,suchasUnixaccountsanddistributedlesystems.EventhoughGridframeworksandtoolsareavailabletofacilitatecross-domaincomputationanddatatransfers,insuchscenariosusersstillwanttoreusetheirexistingsoftwareinfrastructurewithoutextensivemodications. 1.2.1HeterogeneityofComputeResources 19

PAGE 20

Manyapplicationshavestrictsoftwaredependenciesandthereforecannotbedeployedonarbitraryhosts.Todealwithheterogeneity,applicationprogrammersmustre-writetheirapplicationstotwithintheresourceframeworktheyareprovided.Modifyingthewealthofexistingapplicationstomakethemrunonremoteresourcesisseenasanobstacletowide-areadistributedcomputing. MechanismsforNATtraversalwithUDP(STUN)[ 21 ]andTCP(STUNT)[ 22 ]exist,andarebeingusedbyapplicationssuchasSkype[ 23 ].However,thesemechanismsrequiresettingupandmanagingpubliclyreachable(STUNorSTUNT)servers.Furthermore,theycannotbedirectlyusablebyexistingTCP/IPapplicationsthatwerenotdesignedtousethesemechanisms. 20

PAGE 21

Itsisequallyimportanttoisolatefromeachothertheapplicationsfromdierentusersthatsharethesameresource.Traditionalapproachesrelyontheoperatingsystemstoprovidethedesiredsecurityandperformanceisolation.However,thecomplexityofmodernoperatingsystemsintroducesseveralloop-holesthatcanbeexploitedbymaliciousapplications. 12 ]andXen[ 10 ])wereoriginallyintroducedbyIBM(Systems/370)in1970toallowfortime-sharingofitsexpensivemainframeplatforms.Virtualmachinesallowsimultaneousexecutionofmultiplefull-blownoperatingsystemsonsamehost(seeFigure 1-1 ).Thisprocessisachievedthroughathinlayerofsoftwarecalledvirtualmachinemonitor(VMM),whichprovidestoeachrunningoperatingsystem(guest)anabstractionofacompletehardwareplatform(CPU,memory,storageandperipherals).VirtualmachinesprovideaninterfaceequivalenttoISAoeredbytheunderlyingplatformandallowmostinstructionsissuedbytheguestOSandguestapplicationstobedirectlyexecutedontheCPU,thusreducingtheoverheadofemulation. Virtualmachinescompletelydecouplethesoftwareenvironmentforanapplicationfromtheunderlyingphysicalhost.Theentireexecutionenvironment(operatingsystem,librariesandapplicationbinaries)canbeencapsulatedintoasinglelargele(calledaVM 21

PAGE 22

Hostedvirtualmachinemonitorrunningtwoisolatedvirtualmachine"guests" image),thatcanbecopiedandinstantiatedonanyphysicalhostwithasuitablevirtualmachinemonitor.Thisencapsulationenableshomogeneouscongurationsofwide-arearesourcesusingunmodiedmiddlewareandapplications,withoutinterferingwiththelocalsitepolicies.In[ 24 ],theauthorsproposetouseVMimagestodeployandmaintainsoftwareinanorganization.Theentirememory,CPUstateofaVMcanbecheckpointedandresumed,thusallowingformigrationofunmodiedapplications.VMsconnetheguestapplicationswithinaclosedsandbox,whichcanpreventamaliciousapplicationfromcausingharmtothephysicalhostresources. Avirtualmachinemonitoroperatesatalevelwhereithastomanageonlyafewresources(CPU,memoryanddevices)asopposedtoamodernOSthatmanageseveralentities(users,les,processesetc).ThissimplicitynotonlymakesVMMsmoresecurethanmodernOSes,butalsomakeresourceusagepoliciestobeeasilyexpressedandenforced[ 10 ]. 22

PAGE 23

13 { 15 ]decouplethenetworkenvironmentofanapplicationfromthephysicalenvironmentandprovideanopportunitytoaggregatewide-arearesources.Thevirtualizednetworkhasitsownaddressspace,andallapplicationtracisisolatedfromthephysicalnetwork.ThevirtualizationlayerhandlesallcomplicationsrelatingtothepresenceofNATs/rewalls,thuspresentingtoapplicationsanenvironmentsimilartolocal-areanetworks. 13 { 15 ]techniquesexist,butimposenon-trivialmanagementoverhead. Inthiswork,IfocusontheIPOPself-managingvirtualnetwork,whichusesP2PtechniquesforoverlayroutinganddecentralizedestablishmentofdirectconnectionsamongnodesbehindNAT/rewallrouters.IPOPcanscaletoalargenumberofhosts,andishighlyresilienttonodeandlinkfailures.ThenoveltyofthisworkliesintheapplicationofstructuredP2Ptechniquestothedierentaspectsofthevirtualnetwork{addressspacemanagement,routediscoveryandNAT-traversal.IncombinationwithclassicVMtechnology,IPOPfacilitatescreationofhomogeneouslyconguredWide-areaclustersofVirtualWorkstations(WOWs).Thesesystemssupportexecution/checkpoint/migration 23

PAGE 24

1 ],Condor[ 18 ]),thusprovidingexcellentinfrastructurefordeploymentofdesktopgridsandcross-domaincollaboration. WithinthecontextoftheIPOPsystem,thekeycontributionsofthisworkarelistedbelow: 1. 25 ]fromMicrosoft)requirehighly-availableserversforout-of-bandexchangeofinformationrelevanttoNAT-traversal.ThedecentralizedNAT-traversalthatusesstructuredP2Proutingfortheout-of-bandmessagingbetweenNATedhosts,requiringonlyaneasy-to-manageseednetworkofpublicnodes.TheimplementationofthistechniqueintheIPOPsystemisdescribedinChapter 2 2. 4 3. 26 ]andseveralapplicationsandservicesthatbuildontopofit.IhavedevelopedgenerallyapplicabletechniquestofacilitateconsistentP2Proutinginpresenceofoverlayfaults.ThesetechniquesandthesubsequentimprovementsinroutingarepresentedinChapter 6 4. 27 ]isawell-knowntechniquetoreducelatencyofstructuredP2Prouting,byselectingoverlaylinksbasedonproximity.Toassessnetworkproximitywithoutexplicitlatencymeasurements,atechniquecallednetworkcoordinates[ 28 ]isusedthatallowsembeddingnodelatenciesinalow-dimensionalspacesuchthatthedistanceinthecoordinatespaceprovidesanestimateoflatency.IhaveinvestigatedtheusefulnessofPNSbasedonnetworkcoordinates,asameanstoimproveroutelatencyoftheIPOPoverlays.Chapter 7 describesthetechniqueandpresentsanalysisofadeploymentonPlanetLab[ 29 ]. 5. 7 ). 24

PAGE 25

TheincreasinguseofNetworkAddressTranslators(NATs)andrewallscreatesasituationthatsomenodesonthenetworkcancreateoutgoingconnections,butcannotreceiveincomingconnections.Thislackofbi-directionalconnectivityisrecognizedasahindrancetoprogramminganddeployingdistributedapplications[ 6 7 30 ].ProtocolsforNAT/rewalltraversal[ 21 ]exist,butrequireapplicationstobere-linkedwiththenewprotocollibraries. Virtualnetworks[ 13 { 15 ]canprovidetoapplicationsrunninginavirtualizedinfrastructuretheperceptionofanenvironmentfunctionallyidenticaltoalocal-areanetwork,despitethepresenceofNATsandrewallroutersinthephysicalinfrastructure.Virtualnetworksalsoconnecommunicationofadistributedapplicationwithinanenvironmentthatislogicallyisolatedfromthephysicalnetworkinfrastructure,thusreducingvulnerabilitytonon-participatinghostsandusersatasite. Inthecurrenttechniquesfornetworkvirtualization,overlayroutingtablesareeithersetupbyanadministratororrelyonvirtualnetworkrouters/switchestohaveall-to-allconnectivityamongthemselves.Hence,theprocessofadding,conguringandmanagingclientsandserversthatroutetracwithintheoverlayisdiculttoscale.Althoughtopologyadaptationispossibleusingtechniquesproposedin[ 31 ],adaptiveroutesarecoordinatedbyacentralizedserver.Theseapproachescanprovidearobustoverlaythroughredundancy.However,theeortrequiredtopreserverobustnesswouldincreaseeverytimeanewnodeisaddedandthenetworkgrowsinsize. Forwide-areacollaborations,itisalsonecessarythatanetworkvirtualizationtechniqueisscalable,fault-tolerantandrequiresminimaladministrativecontrol.ThisworkpresentsIPOP{anetworkvirtualizationtechniquebasedonIPtunnelingoverpeer-to-peer(P2P)networksthatmeetstheserequirements.P2Pnetworksare 25

PAGE 26

Therestofthischapterisorganizedasfollows.Section 2.1 givesanoverviewofP2Pnetworks,applicationsanddierentP2Parchitectures.InSection 2.3 ,IpresenttheIPOParchitecture,andmechanismstodiscover,establishandmaintainoverlaylinksbetweennodesbehindNATs/rewalls.AcomprehensiveevaluationofthevirtualnetworkperformanceispresentedinSection 6.5 Asdesktopcomputersbecomemoreandmorepowerful,thereisanincreasinginteresttouseresourcesfromcommoditycomputersattheedgeoftheInternet.Peer-to-peer(P2P)networksrefertodistributedcomputingarchitecturesthataredesignedtofacilitatethisexchangeofcomputerresources(content,storageandCPUcycles)bydirectexchange,ratherthanrequiringintermediationfromcentralizedservers.P2Pnetworksaredesignedtofunction,scaleandself-organizeinthepresenceofahighlytransientpopulationofnodes,andthatofnetworkandcomputerfailures.ApplicationsofP2Pnetworksinclude: 1. 26

PAGE 27

32 ],Gnutella[ 33 ],Kazaa[ 34 ],Oceanstore[ 35 ],PAST[ 36 ],CFS[ 37 ]. 2. TherstgenerationofP2Pnetworks(suchastheoriginalNapstersystem)requiredcentralizedserverstokeeptrackoflocationofdierentlesintheP2Pnetwork.Oncealewascorrectly,located,itcouldbetransferreddirectlywithoutinvolvingtheserver.GnutellaandKazaaalleviatedtherequirementofhigh-capacitycentralizedserversbystoringadistributedindexoflesacrossasetofP2Pnodes.Acommonpointamongtheseprotocolswasthatnoneofthemimposedanystructureontheoverlaytopology,andalecouldpotentiallybelocatedonanynodeinthenetwork.TheseP2Pnetworksthusincurredhighoverheadsonsearchingforles.Cachingtechniquesreducedsearchcost,butlimitedtheapplicabilityofthesesystemstoimmutabledata.Nevertheless,thelackofstructurealsomadetheseprotocolshighlyresilienttochurn. InthenextgenerationofP2Psystems[ 38 { 41 ],theP2Pnodesarrangethemselvesintoawell-denedtopology(suchasaring,orahypercube)basedontheirP2Pidentiers(alsocalledP2Paddresses)chosenfromalargeaddressspace.Theoverlaytopologyandroutingprotocolsboundthenumberofhops(withasymptoticcomplexitysub-linearwithrespecttothesizeofthenetwork)betweenP2Pnodes. Figure 2-1 illustratesroutingoveraring-structuredP2Pnetwork.Eachnodemaintainsaroutingtablewithnetworkendpoints(IPaddressandport)ofonlyafewnodesinsystem.Ateachnode,thefollowingroutingruleisexecuted:ifthedestinationappearsinthelocalroutingtable,themessageisdirectlycommunicatedtothedestination;otherwiseitissenttothenode(intheroutingtable)thatisclosesttothedestination;andincasethecurrentnodeisclosest,themessageisdeliveredlocally. 27

PAGE 28

ToillustrateP2PoverlayroutingtechniquesinFigure 2-1 (A),node100sendsamessagetoaddress130.Themessageisforwardedtonode118andsubsequentlytonode128,whichhasnode130initsroutingtable.Eventually,themessageisforwardedtonode130whereitisdeliveredtothelocalapplication.InFigure 2-1 (B),node100sendsamessagetoaddress133(destinationnodenotpresentinthenetwork);themessageisroutedthroughnodes118,128and130tonode134(closesttoaddress133),whereitisdeliveredtothelocalapplication. B StructuredP2Prouting StructuredP2PsystemsprovideanobjectstoragefacilitycalledaDistributedHashTable(DHT).Eachobjectisassociatedwithakeythatbelongstothesameaddressspaceasnodeidentiers.Theownershipofkeysispartitionedamongparticipatingnodes,suchthateachkeyisstoredonasetofnodesthatareclosesttothekeyintheidentierspace.ThispartitioningofkeyownershiptogetherwithecientroutingboundslookupoverheadforanobjectstoredintheDHT. 28

PAGE 29

42 ][ 43 ].AnIP-over-P2Poverlaybenetsfromthesynergyoffault-toleranttechniquesappliedatdierentlevels.TheIPOPoverlaydynamicallyadaptsroutingofIPpacketsasnodesfailorleavethenetwork;evenifpacketsaredroppedbysuchnodes,IPandotherprotocolsaboveitinthenetworkstackhavebeendesignedtocopewithsuchtransientfailures. 21 ][ 22 ]totraverseNAT/rewalls.TheseapproachesrequiresettingupgloballyreachableSTUNorSTUNTserversthataidbuildingthenecessaryNATstatebycarefullycraftedexchangeofpackets.WithP2Pnetworks,eachoverlaynodecanprovidethisfunctionalityfordetectionofNATsandtheirsubsequenttraversal.Thisapproachisdecentralizedandintroducesnodedicatedservers. AusefulapplicationforIPOPisinareaofGridComputing[ 1 ],wherewide-areanodes(irrespectiveoftheirphysicallocations),canbeaggregatedintoavirtualIPnetwork(Figure 2-2 ).TheIPOPlayersitsbetweenapplications(e.g.Gridclustersofphysicaland/orvirtualmachines,voice-over-IP)andphysicalcomputingnodesinterconnectedbyexistingIPnetworkinginfrastructures.Thisvirtualnetworkiscompletelydecoupledfromthephysicalnetwork,whichnotonlyisolatestheGridapplicationtrac,butalsoallowsformigrationofvirtualIPnodesintonewsubnets. 29

PAGE 30

VirtualizingIPoveraP2Poverlay TheIPOParchitectureisdescribednext. 2-3 ):avirtualizednetworkinterfaceforcapturingandinjectingIPpacketsintothevirtualnetwork,andaP2Pnodethatencapsulates,tunnelsandroutespacketswithintheoverlay.IPOPbuildsuponauser-levelframeworkprovidedbytheBrunetP2Pprotocolsuite[ 44 ],whichprovidesmechanismstodiscover,establishandmaintainoverlaylinksbetweennodes,evenacrossNATsandrewalls.ThenextsectiondescribeshowIPpacketsarecaptured/injectedattheendhosts,androutedontheBrunetP2Pnetwork;andthefollowingsectionwilldiscussmechanismsforoverlaylinksetupandNAT/rewalltraversal. 30

PAGE 31

ArchitecturaloverviewofIPOP Figure 2-3 showstheowofdatabetweentwoapplicationscommunicatingoverthevirtualIPnetworkprovidedbyIPOP:1)Application(onleft)sendsdatatoavirtualIPdestination(src:172.16.0.2,dest:172.16.0.18).2)IPOPreadsouttheethernetframefromthetapandextractsthevirtualIPpacket,3)ThevirtualIPpacketisencapsulatedinsideaP2P(Brunet)packetaddressedtoP2PnodeB(right)associatedwiththevirtualIPdestination,4)andthenroutedwithintheP2PoverlaytoadestinationnodeB.5)AtnodeB,IPOPextractsthevirtualIPpacketfromtheP2Ppacket,6)buildsanethernetframethatitinjectsintothetap.7)Eventually,dataisdeliveredtoapplication(onB).WhileIPOPseesEthernetframes,itonlyroutesIPpackets;non-IPtrac,notablyARPtrac,iscontainedwithinthehost. TheP2PaddressofanIPOPnodeisthe160bitSHA-1hashoftheIPaddress(virtualizedIP)ofthetapdevice,andthisisusedformappingadestinationvirtualIPaddresstoP2Paddress(ofdestinationIPOPnode)andvice-versa. 44 ],particularlythemechanismsthatenablenewnodestojoinanexistingP2PnetworkandconnectionstoformbetweennodesbehindNATs.Theterm\connection"isusedtorefertoanoverlaylinkbetweenP2Pnodesoverwhichpacketsarerouted. 31

PAGE 32

Peer-to-peerconnectionsinBrunet BrunetmaintainsastructuredringofP2Pnodesorderedby160-bitBrunetaddresses(Figure 2.3.2 ).EachnodemaintainsconnectionstoitsnearestneighborsintheP2Paddressspacecalledstructurednearconnections. 45 ].ThesystemsupportsdecentralizedtracinspectionandbootstrappingofdirectoverlayconnectionsbetweenfrequentlycommunicatingP2Pnodes,whichwecalledshortcutconnections. Brunetusesgreedyroutingofpacketsoverstructured(nearandfar)connections,whereateachoverlayhopthepacketgetsclosertothedestinationintheP2Paddressspace.Thepacketiseventuallydeliveredtothedestination;orifthedestinationisdown,itisdeliveredtoitsnearestneighborsintheP2Paddressspace. 32

PAGE 33

ConnectionsetupbetweenP2Pnodes ConnectionsbetweenBrunetnodesareabstractedandmayoperateoveranytransport.Theinformationabouttransportprotocolandthephysicalendpoint(e.g.IPaddressandportnumber)iscontainedinsideaUniformResourceIndicator(URI),suchasbrunet.tcp:192.0.1.1:1024.NotethataP2PnodemayhavemultipleURIs,ifithasmultiplenetworkinterfacesorifitisbehindoneormorelevelsofnetworkaddresstranslation.TheencapsulationprovidedbyURIsprovidesextensibilitytonewconnectiontypes;currentlythereareimplementationsforTCPandUDPtransports. 2-5 illustratesconnectionsetupbetweentwoP2Pnodes.Themechanismforconnectionsetupbetweennodesconsistsofconveyingtheintenttoconnect,andresolutionofP2PaddressestoURIsfollowedbyalinkinghandshake,whicharesummarizedasfollows: 33

PAGE 34

2. 2.3.2.3 Nodeskeepanidleconnectionstatealivebyperiodicallyprobingeachotherthroughpingmessages,whichalsoinvolvesresendingofunrespondedpingsandexponentialback-osbetweenresends.Asuccessionofunrespondedpingsisperceivedasthetargetnodegoingdownoranetworkoutage,andthecurrentnodediscardstheconnectionstate.Thesepingmessagesincurbandwidthandprocessingoverheadatendnodes,whichrestrictsthenumberofconnectionsanodecanmaintain. Itshouldbenotedthatthelinkingprotocolisinitiatedbyboththepeers,leadingtoapotentialraceconditionthatmustbebrokeninfavorofonepeersucceedingwhiletheotherfailing.Therefore,eachnoderecordsitsactivelinkingattempttothetarget,beforesendingoutalinkrequest.Now,ifthecurrentnodenowgetsalinkrequestfromitstarget,itrespondswithalinkerrormessagestatingthatthetargetshouldgiveupitsactiveattemptandletthecurrentnodegoaheadwiththeprotocol.Thetargetnodegivesupitsconnectionattempt,andeventuallythecurrentnodewouldsucceed.Itispossiblethatboththenodes(currentandtarget)initiateactivelinking,getlinkerrormessages 34

PAGE 35

Thenewnodemustnowidentifyitscorrectpositioninthering,andformstructurednearconnectionswithitsleftandrightneighbors.ItsendsaCTMrequestaddressedtoitselfonthenetworkthroughtheleaftarget.TheCTMrequestisroutedoverthestructurednetwork,and(sincethenewnodeisstillnotinthering)eventuallydeliveredtoitstwonearestneighbors.TheCTMrepliesreceivedbytheforwardingnodearepassedbacktothenewnode.ThenodenowknowstheURIsofitsnearest(leftandright)neighborsandviceversa,andcanformstructurednearconnectionswiththem.Atthispoint,thenodeisfullyroutable.ThetimetakenforanewnodetojoinanexistingP2Pnetworkofmorethan130nodesandbecomefully-routableisfoundtobeoftheorderofseconds(seeFigure 2-7 ). 35

PAGE 36

45 ])ontheringusingprotocolsdescribedinSection 2.3.2.1 21 ]protocol,therearefourtypesofNATsincommonusetoday(describedinAppendix A ).Ofthesefourtypes,allhavethepropertythatifaUDPpacketissentfromIPaddressAportpatoIPaddressBportpb,theNATdevicewillallowpacketsfromIPaddressBportpbtoowtoIPaddressAportpa.Inadditiontotheaboveproperty,threeoutoffourofthecommonNATtypes(allbutthesymmetric)usethesamemappingfortheNAT'sport!internal(IP,port)pair,irrespectiveofthedestination(IP,port).TheUDPtransportimplementationofBrunetisdesignedtodealwithNATtraversalforlargeclassofNATdevicesfoundinpracticaldeployments.Thebi-directionalityoflinkinghandshakeiswhatenablesnodespunchingholesintotheirownNATsasdescribedinSTUN[ 21 46 47 ].Thishappensbecauseoneoftheincomingpacketsisperceivedasareplytoanoutgoingpacket,andallowedtopass.Furthermore,thisapproachisdecentralizedandintroducesnosinglepointsoffailureordedicatedservers(unliketheSTUNprotocol). BasedonthedescriptionofURIspresentedearlier,itfollowsthatanodeinsideaprivatenetworkbehindaNATcanhavemultipleURIs(correspondingtotheprivateIP/port,andNATassignedIP/portwhenitcommunicateswithnodesonpublicinternet).Furthermore,notallURIscanbeusedtocommunicatewithit.WhichURIsareusabledependonthelocationsofcommunicatingnodesandthenatureofNATs.Forexample,twonodesbehindaNATthatdoesnotsupport\hairpin"translation[ 46 ]cancommunicateonlyusingURIscorrespondingtotheirprivateIP/port,andtheyfailwhenusingtheNATassignedIP/port.Incontrast,twonodesbehind\hairpin"NATs 36

PAGE 37

SinceprivateIPaddressesarenotuniqueacrossLANs,itispossiblethattryingaURIwithprivateaddressleadstocommunicationwithanodeotherthantheintendedconnectiontarget.However,eachnodehasauniqueP2Paddress.LinkingmessagescontaintheP2Paddressesofboththepeers.ThisinformationisusedtodetectsuchfalsehitsinthesameLANandsuppressthelinkingattempttotheintendedtargetusingthatURI. 2-6 showsobservedlatenciesofashighas1600msbetweenIPOPnodes(in[ 16 ])connectedtoP2Pnetworkofover100nodesonPlanetLab.Thesehighlatencieswereduetomulti-hopoverlayroutingthroughhighlyloadedPlanetLabnodes.Thissectiondescribesthetechniquefordecentralizedadaptiveshortcutcreationwhichenablesthesetupofsingle-hopoverlaylinksondemand,basedontracinspection.Section 6.5 showsthatshortcutsgreatlyreducelatencyandimprovebandwidthofthevirtualnetwork. 37

PAGE 38

Distributionofround-triplatenciesforICMP/pingpacketsover118-nodePlanet-Laboverlay.Twohopsseparatethepingsourcefromthedestination. TheBrunetP2Plibraryisanextensiblesystemwhichallowsdeveloperstoaddnewroutingprotocolsandconnectiontypes.Foreachconnectiontype,aP2PnodehasaConnectionOverlordthatensuresthenodehastherightnumberofconnectionsofthattype. TosupportshortcutP2Pconnections,IhaveimplementedaShortcutConnec-tionOverlordwithintheBrunetlibrary.TheShortcutConnectionOverlordatanodetrackscommunicationwithothernodesusingametriccalledscore.Thealgorithmisonebasedonaqueueingsystem.Thenumberofpacketsthatarriveintheithunitoftimeisai.Thereisaconstantservicerateonthisworkqueue.Thescoreistheamountofremainingworkleftinthisvirtualqueue.Ifthescoreattimeiissi,andtherateofserviceisc,itfollows:si+1=max(si+aic;0) Thehigherthescoreofadestinationnode,themorecommunicationtherehasbeenwithit.Thenodesforwhichthevirtualqueueisthelongestarethenodesitconnectto.TheShortcutConnectionOverlordestablishesandmaintainsshortcutconnectionswithnodeswhosescoresexceedacertainthreshold. 38

PAGE 39

Thenextfewsectionspresentexperimentalresultscomparingbandwidthandlatencyofthevirtualnetworkwithandwithouttheadaptiveshortcuts.Ialsoreportthetimerequiredforthenetworktoadaptandcreateshortcutsbetweencommunicatingnodes. 39

PAGE 40

3-1 .ThenodesinUFLwerelocatedinthesameprivatenetworkbehindasiteNAT,whiletheonesatNWUwereinseparateprivatenetworks(behindVMwareNATsontwodierenthosts)behindacommonsiterewall. BDroppedpackets ProlesofICMPechoround-triplatenciesanddroppedpacketsduringIPOPnodejoin 40

PAGE 41

2-7 summarizestheresultsfromthisexperiment.TheexperimentconsidersthreecombinationsofthelocationofthenodeAjoiningthenetworkandthenodeBitcommunicatesto:UFL-UFL,UFL-NWUandNWU-NWU. 2-7A :Theplotshowslatenciesaveragedover100trialsasreportedbythepingapplicationforpacketswhichwerenotdropped. 2-7B :Theplotshowsthepercentageoflostpackets(over100trials)foreachICMPsequencenumberasreportedbythepingapplication. FocusinginitiallyontheUFL-NWUcase,analyzingthedatafortheinitialtensofICMPpacketsshowsthreedierentregimes(seeFigure 2-8 ).FortherstthreeICMPrequests,onaverage,90%ofthepacketsaredropped TheUFL-UFLcasealsorevealsthesamethreeregimes;however,thetimingsdier.Ittakesuptoabout40ICMPpingpacketsbeforethenodebecomesroutableovertheP2Pnetwork.Furthermore,ittakesabout200ICMPpacketsbeforeshortcutconnectionsareformed.ThishighdelayisbecauseofthenatureoftheUFLNATandtheimplementation 2-7 ,thesefewinitialpacketsdonotappearontheplot.4 41

PAGE 42

ThreeregimesforpercentageofdroppedICMPpacketsinUFL-NWU(rst50packets). oftheIPOPlinkingprotocol,asfollows.TheUFLNATdoesnotsupport\hairpin"translation[ 46 ],i.e.itdiscardspacketssourcedwithintheprivatenetworkanddestinedtotheNAT-translatedpublicIP/port.AsdescribedinSection 2.3.2.1 ,thelinkinghandshakeinvolvesnodestryingtargetURIsonebyoneuntilndingoneonwhichtheycansendandreceivehandshakemessages.InIPOP,nodesrstattempttheURIscorrespondingtotheNATassignedpublicIP/portforthelinkinghandshakeduringtheconnectionsetup.Becauseofconservativeestimatesforthere-sendinterval,theback-ofactorandthenumberofretriesforUDPtunneling,nodestakeseveralsecondsbeforegivinguponthatURIandtryingthenextinthelist(privateIP/port)onwhichtheysucceed.InSection 2.4.3 ,Idescribeimplementation-leveloptimizationsthatincreasethelikelihoodofpickingthecorrectURIintherstattempt,thusreducingtheconnectionsetupdelayinUFL-UFLcasetoafewseconds. FortheNWU-NWUcase,thetwonodesareeitherinsidethesameprivatenetwork(theVMwareNATnetwork),orondierenthosts.TheVMwareNATsupportshairpintranslation,hencebothURIsfortheP2Pnodeworkforconnectionsetup.AswiththeUFL-NWUcase,thelinkingprotocolsucceedswiththerstURIittriesandhencetheshortcutconnectionsaresetupwithinafew(about20)ICMPpackets. ThebandwidthimprovementsovertheoriginalP2Proutingbyenablingtheshortcutconnectionsetup,betweentwoIPOPnodescommunicatingoverthevirtualnetwork, 42

PAGE 43

BandwidthmeasurementsbetweentwoIPOPnodes:withandwithoutshortcuts Shortcutsenabled Shortcutsdisabled Bandwidth Std.dev Bandwidth Std.dev Mbps Mbps Mbps Mbps UFL-UFL 12.91 0.74 0.67 0.024 UFL-NWU 10.00 1.62 0.68 0.018 wereevaluatedusingtheTestTCP(ttcp).Thisutilityisusedtomeasuretheend-to-endbandwidthachievedintransfersoflargeles.Table 2-1 showstheaveragebandwidthandstandarddeviationmeasurementsbetweentwoIPOPnodes:withandwithoutshortcuts.Theexperimentconsiders12ttcp-basedtransfersoflesofthreedierentsizes(695MB,50MBand8MB)andtwoscenariosforthelocationofnodes:UFL-UFLandNWU-UFL.Twofactorsalongtheroutingpathlimitthebandwidthbetweentwonodes:rst,thebandwidthoftheoverlaylinks,andsecond,theveryhighCPUloadofmachineshostingtheintermediateIPOProuters,whichreducestheprocessingthroughputoftheuser-levelIPOPimplementation.Withoutshortcutconnections,nodescommunicatedovera3-hopcommunicationpathtraversingtheheavilyloadedPlanetLabnodesandaverylowbandwidthwasrecorded.However,withshortcutsenabled,nodescommunicateoverasingleoverlayhop,thusachievingamuchhigherbandwidth. TheprecedingsectionsofthischapterhavereportedonthedelaysincurredbyanewnodetojointheP2Pnetworkandbecomefullyroutable,andalsoimprovementsinlatencyandbandwidthbyusingshortcutsasopposedtothemulti-hoproutingpath.FurtherassumingthatshortcutconnectionsarealwaysestablishedbetweentwocommunicatingIPOPnodes,thenextsectionquantiesthelatencyandbandwidthoverhead(duetovirtualization)onasingleIPOPlink(usingasingleP2Phop)overthephysicalnetwork(directIPlevelcommunication). 16 ].Basedonseveraloptimizations 43

PAGE 44

CongurationsofmachinesusedforevaluatingperformanceofsingleIPOPlink Machine Hosttype CPU Location A Physical Pentium-41.8Ghz UFL B Physical Pentium-41.7Ghz UFL C Virtual(VMwareESX3.0) Xeon3.2Ghz UFL D Physical Pentium-2400Mhz UCLA E Physical Pentium-43.0Ghz VIMS Table2-3. Meanandstandarddeviationof10000pinground-triptimesforIPOP-UDPandphysicalnetwork mean(msec) std.dev(msec) LAN physical 0.240 0.028 (AandB) IPOP-UDP 4.25 1.43 WAN physical 66.13 1.20 (CandD) IPOP-UDP 80.54 14.27 totheBrunetP2PlibraryandIPOP,thissectionpresentsasimilarcomparisonbetweenIPOP-UDPandphysicalnetwork.Latencywasmeasuredfromtheround-triptimesofICMPpings,andalsobytherateofTCPrequest/responsetransactionsfromnetperfbenchmark.Thebandwidthmeasurementswereperformedusingiperf.ThecongurationsofmachinesusedfortheseexperimentsareshowninTable 2-2 .TheLANexperiments(latencyandbandwidth)refertomachinesAandBconnectedtothesame100Mbpswitch.TheWANlatencyexperimentsrefertomachinesCandDwhiletheWANthroughputexperimentsrefertomachinesCandEconnectedviaAbilene. Table 2-3 summarizesthepinground-triptimesforthelatencyexperimentsonanIPOPlinkusingasingleP2Phop.ThelatencyoverheadforLANandWANisobservedtobe4msand15ms,respectively.ThehigheroverheadonWANisunderinvestigation. Table2-4. MeanandstandarddeviationofrateofTCPresquest/responsetransactionsmeasuredover100sampleswithnetperfforIPOP-UDPandphysicalnetwork mean std.dev (trans/sec) (trans/sec) LAN physical 8148.6 942.53 (AandB) IPOP-UDP 183.6 45.19 WAN physical 14.91 0.28 (CandD) IPOP-UDP 13.40 0.34 44

PAGE 45

ComparisonofthroughputofasingleIPOPlinkinLANandWANenvironments Abs.B/W Rel.B/W (Mbps) (IPOP/Phys) LAN physical 93.8 (AandB) IPOP-UDP(HostA) 23.91 25% IPOP-UDP(HostB) 29.58 32% WAN physical 13.9 (CandE) IPOP-UDP(HostE) 12.5 90% Latenciesoftheorderofmilliseconds/packethavealsobeenreportedincontextofotheruser-levelroutingsystems,suchasVNET[ 13 ].TheLANexperimentprovidesaroughestimateoftheoverheadassociatedwiththeimplementationofIPOP.ThisoverheadisattributedtothetraversalofkernelTCP/IPstacktwicebyanypacketsentonthevirtualnetwork(onceonthevirtualinterface,andadditionallyonthephysicalinterface).WhiletherelativeoverheadishighintheLANenvironment,fortheWANusedinthisexperimenttheoverheadis31%ofthatofthephysicalnetwork.InaWAN,theoverheadofuser-levelroutinggetsamortizedoverthenumberofInternethops(inourcase,10)thatmakeupaP2Plink. Table 2-4 presentstherateofTCPrequest/replytransactionsoverIPOPandphysicalnetworkmeasuredwithnetperfwith1-bytepayload.Eachtransactioninvolvessendingarequesttoanothernode(whichongettingtherequestimmediatelysendsaresponseback)andwaitingforaresponse.Thenetperfbenchmarkmeasuresthenumberofsuchtransactionsoneaftertheotherthatcanbecarriedoutwithinatimeinterval.Thehigherthelatency,higheristhetimepertransactionandthusalowertransactionrate.Netperfmeasurementsarerepresentativeoflatencies(multiplicativeinverseoftransactionrate)incurredbyapplicationswhicharetypicallybasedonTCP/IP.WithIPOP,weobservelowtransactionrateonlocal-area(2%ofphysical),whileonwide-areaIPOPisabletoachieveupto90%oftherateonphysicalnetwork. Table 2-5 comparesthethroughputofasingleIPOPlinktothatofthephysicalnetwork,forbothLANandWANscenarios.Aninterestingobservationisthatover 45

PAGE 46

2.4.1 ,delaysgreaterthan150secondswereobservedfordirectconnectionsetupbetweentwonodesinUFLbehindasiteNATwhichdoesnotsupport"hairpinning".Giventheconservativelychosentimeoutsforpacketresendsinthelinkingprotocol,thisdelayhappenedbecausenodesrsttriedthewrongURIs(correspondingtoNATassignedIP/portofthepeer)forlinking.ThelinkingprotocolhasbeenextendedsuchthatnodestryeachothersURIsveinparallel,sothattheycanrightawaytrythecorrectURIforlinking.ThisenhancementhasresultedinUFL-UFLconnectionsetuptobeasfast(withinafewseconds)asfortheotherscenarios. 2.5.1ResourceVirtualization 48 ][ 5 ][ 49 ][ 50 ][ 4 ]recognizetheusefulnessofvirtualmachines(VMware[ 12 ][ 51 ],Xen[ 10 ])asexecutionenvironmentsforGridapplications.SuchenvironmentscanbedeployedindependentofthephysicalsetupateachGridsite.Inaddition,VIOLIN[ 14 ],VNET[ 13 31 ],andViNe[ 15 ])havealsorecognizedtheutilityofnetworkoverlaysinwideareadistributedenvironments.Inthesetechniques,itisnecessaryforadministratorstosetupoverlaylinks,andacentralizedentityisneededtoorchestratenetworkadaptationin[ 31 ].ThedescribedapproachisfundamentallydierentinitsuseofaP2P-basedoverlay,whichthenodescanjoinandleaveinacompletelydecentralized,self-organizingfashion. 52 ].In[ 53 ]Cheemaet.alandin[ 54 ]Iamnitchiet.alhaveinvestigatedP2Pdiscoveryof 46

PAGE 47

55 ]Caoet.al.haveproposedaP2Papproachtotaskschedulingincomputationalgrids.RelatedtothisworkaretheJalapeno[ 56 ],OrganicGrid[ 57 ],OurGrid[ 58 ]andParCop[ 59 ]projectswhichalsopursuedecentralizedcomputingusingP2Ptechnology.ThereisalsoexistingbodyofresearchonvariouswaysinwhichP2PsystemscanbeappliedtoexistingIPsystems.In[ 60 ]Coxet.al.haveproposedtobuildaDistributedDNSusingDHash,apeer-to-peerdistributedtablebuiltontopofChord[ 40 ].TheIPOPsystemcurrentlyappliesP2Ptechniquestoachieveself-congurationofoverlaynetworklinkstoenableecientandeasy-to-deployvirtualprivatenetworksonwhichapplications,includingexistingP2P-basedsystems([ 36 37 ]),canbedeployedwithoutconcernforNAT/rewalltraversal. TheuseofP2Pbasedoverlaytosupportlegacyapplicationshasalsobeendescribedincontextofi3([ 61 ][ 62 ]).ThegoalistosupportinteroperabilitywithnewI3applicationsthatsupportmulticast,anycastandmobility.Incontrast,motivationofmyresearchistoprovideseamlessaccesstoGridresourcesspanningdierentnetworkdomainsbyaggregatingthemintoavirtualIPnetworkthatiscompletelyisolatedfromthephysicalnetwork. Zhouet.al.havedevelopedP6P[ 63 64 ],animplementationofIPv6onaP2Poverlay.TheTeredo[ 25 ]protocoldevelopedbyMicrosofttunnelsIPv6packetsoverIPv4UDPpacketstoenablenodesbehindNATstobeaddressedwithIPv6connectivity.Ontheotherhand,myfocusistoenableexistinggridapplications(typicallybasedonIPv4)rununmodiedonwide-area.FewexistingapplicationssupportIPv6. 46 ][ 47 ][ 21 ][ 22 ])requirepubliclyavailablerendezvousserversforout-of-bandsignallingandexchangeofNAT-assignedIPaddressandportnumbers.Forexample,theTeredoprotocoldevelopedbyMicrosofttunnelsIPv6insideIPv4UDPmessagesrequiresmaintainingpublicTeredoservers;theIPv6addressofaTeredoclientisderivedfromIPv4addressesofthecorrespondingTeredo 47

PAGE 48

48

PAGE 49

Inthischapter,Idescribehowself-conguringvirtualnetworkingthroughIPOPcanbecombinedwithvirtualmachinetechnologytocreatescalablewide-areaoverlaynetworksofvirtualworkstationscalledWOWs.Thesesystems:(1)facilitatetheadditionofnodestoapoolofresourcesthroughtheuseofsystemvirtualmachines(VMs)andself-organizingvirtualnetworklinks,(2)maintainIPconnectivityevenifVMsmigrateacrossnetworkdomains;(3)presenttoend-usersandapplicationsanenvironmentthatisfunctionallyidenticaltoalocal-areanetworkorclusterofworkstations[ 65 ].Bydoingso,WOWnodescanbedeployedindependentlyondierentdomains,andWOWdistributedsystemscanbemanagedandprogrammedjustlikelocal-areanetworks,reusingunmodiedsubsystemssuchasbatchschedulers,distributedlesystems,andparallelapplicationenvironmentsthatareveryfamiliartosystemadministratorsandusers. Furthermore,WOWnodescanbepackagedasVM\appliances"[ 24 ]thatcanbeinstantiatedwithoutdisruptingthecongurationofexisting,commoditydesktopswithavarietyofhostedI/Ovirtualizationtechnologies(e.g.VMware,Parallels,LinuxKVM).ThesecharacteristicsmakeWOWsanexcellentinfrastructureforthedeploymentofdesktopgridsthatsupportnotonlyapplicationsdesignedforsuchenvironments,asin[ 2 3 66 67 ]andsystemsbasedonBOINC[ 9 ],butalsocomplex,full-edgedO/Senvironmentswithunmodiedsoftwareandmiddlewarecomponents(e.g.Condor[ 18 { 20 ]). Experimentswitharealisticdeploymentconsistingof118routernodesonPlanetLab[ 29 ]and33computenodesacrosssixdierentrewalleddomains(Figure 3-1 ),demonstratetheabilityofWOWs(1)toestablishdirectoverlaylinksbetweenIPOPnodes(2)tosupportexistingmiddlewareandcompute-intensiveapplicationsanddelivergoodperformance,and(3)toautonomouslyre-establishvirtualnetworklinksafteraVMmigratesacrossawide-areanetwork,andsuccessfullyresumetheexecutionofTCP/IPclient/serverapplicationsinamannerthatiscompletelytransparenttotheapplications. 49

PAGE 50

TestbedWOWusedforexperiments Therestofthischapterisorganizedasfollows.In 3.1 ,IdescribehowWOWnodesareconguredanddeployed.Section 3.1.4 evaluatesatestbedWOWprototype.AcaseforusingWOWstosetupadhocCondorpoolsforhigh-throughputcomputingispresentedinSection 3.3 .IdescriberelatedworkinSection 3.2 .InSection 3.4 ,IlistthevariousdecienciesincurrentIPOPprototypethathinderthedeploymentofWOWsbynewusers;solutionstotheseproblemsarepresentedinsubsequentchaptersofthisdissertation. 3.1.1BackgroundandMotivations 1 ].Attheresourcelevel,systemcongurationheterogeneityanddicultytoestablishconnectivityamongmachinesduetotheincreasinguseofNATs/rewalls[ 7 ]substantiallyhindersharingofresources.WOWsaredesignedtofacilitatetheaggregationofresourcesinanenvironmentwheresystemsindierentdomainshavedierenthardwareandsoftwarecongurationsandaresubjecttodierent 50

PAGE 51

Virtualizationallowsforisolated,exibleandecientmultiplexingofresourcesofashared,distributedinfrastructure[ 48 ].WiththeuseofVMs,thenativeorpreferredsoftwareenvironmentforapplicationscanbeinstantiatedonanyphysicalhost,replicatedtoformvirtualclusters[ 68 ],andcheckpointed/migrated[ 69 ]toenableuniqueopportunitiesforloadbalancingandfaulttolerance. 16 ]virtualnetworkconsistingof:mono.NETruntimeenvironment,a\tap"device(Figure 2-3 ),andashort(tensoflines)congurationscripttolaunchIPOPandsetuptheVMwithanIPaddressontheoverlay.ThecongurationscriptspeciesthelocationofatleastoneIPOPnodeonthepublicInternettoestablishP2Pconnectionswithothernodes.Currently,weuseanoverlaydeployedonPlanetLabforthispurpose. OneimportantadvantageofavirtualnetworksupportingseamlessNATtraversalisthatincaseswheretheVMmonitorprovidesaNAT-basedvirtualnetworkinterface 51

PAGE 52

Alternatively,itisalsopossibletorunIPOPonthephysicalhostthathoststheVM,andstillbeabletocapture/injectvirtualIPtrac.TheVM'sethernetinterfacehasanIPaddressfromthevirtualaddressspace,andallnetworkvirtualizationmechanismstakeplaceoutsidetheVM.Atthecostofextraconguration(installingIPOP)onthehost,suchamodelcompletelyconnestheVMtracwithinavirtualnetwork. 24 ]isconguredonce,thencopiedanddeployedacrossmanyresources,facilitatingthedeploymentofopenenvironmentsforgridcomputingsimilarinnaturetoeortssuchastheOpenScienceGrid(OSG[ 70 ]).WOWallowsparticipantstoaddresourcesinafullydecentralizedmannerthatimposesverylittleadministrativeoverhead. AnillustrativeusecaseexampleofWOWtechniquesisaVMappliance[ 71 ]thatself-conguresCondor[ 18 ]poolsonwide-areahosts.TheVMcongurationisbasedonaLinux2.16kernelandaDebiandistributionthatiscustomizedtooptimizetheVMimagesize,Condor6.8.20,andtheIPOPruntime 52

PAGE 53

3-1 detailsthecongurationofthevariouscomputenodesofthetestbedillustratedinFigure 3-1 .TheWOWhas33computenodes,32ofwhicharehostedinuniversitiesandbehindatleastonelevelofNATand/orrewallrouters:16nodesinFlorida;13inIllinois(NorthwesternU.);2inLouisiana;and1nodeeachinVirginiaandNorthCarolina(VIMSandUNC).Node34isinahomenetwork,behindmultipleNATs(VMware,wirelessrouter,andISPprovider).Atotalof118P2Prouternodeswhichrunon20PlanetLabhostsarealsopartoftheoverlaynetwork,toprovidea\bootstrap"overlayrunningonpublic-addressInternetnodes,towhichnodesbehindrewallscouldconnect Withtheonlyexceptionofthencgrid.orgrewall,whichhadasingleUDPportopentoallowIPOPtrac,norewallchangesneededtobeimplementedbysystemadministrators.Furthermore,noneofthesites(exceptUFL)providedDHCPcapabilitiesandWOWnodesoverthereusedVMwareNATdevices,whichdonotrequireanIPaddresstobeallocatedbythesiteadministrator. 53

PAGE 54

CongurationoftheWOWtestbeddepictedinFigure 3-1 .AllWOWguestsrunthesameDebian/Linux2.4.27-2O/S Nodenumber PhysicalDomain HostCPU HostO/S VMmonitor (VMware) node002 u.edu Xeon Linux Workstation 2.4GHz 2.4.20-20.7smp 5.5 node003u.edu Xeon Linux GSX2.5.1node016 2.4GHz 2.4.20-20.7smp node017northwestern.edu Xeon Linux GSX2.5.1node029 2.0GHz 2.4.20-8smp node030lsu.edu Xeon Linux GSX3.0.0node031 3.2GHz 2.4.26 node032 ncgrid.org Pentium-3 Linux VMPlayer 1.3GHz 2.4.21-20.ELsmp 1.0.0 node033 vims.edu Xeon Linux GSX3.2.0 3.2GHz 2.4.31 node034 gru.net Pentium-4 WindowsXP VMPlayer 1.7GHz SP2 1.0.0 BShortcutsdisabled FrequencydistributionsofPBS/MEMEjobwallclocktimes Ichosetworepresentativelife-scienceapplicationsasbenchmarks:MEME[ 72 ]version3.5.0andfastDNAml-p[ 73 74 ]version1.2.2.Theseapplicationsran,withoutanymodications,onthe33-nodeWOW;scheduling,datatransferandparallelprogrammingrun-timemiddlewarealsoranunmodied,includingOpenPBS[ 75 ]version2.3.16,PVM[ 76 ]version3.4.5,SSH,RSHandNFSversion3. InthePBSexperiment,oneoftheWOWVMs(node002)wasconguredastheclusterheadnodeandtherestwereconguredasworkernodes.InthePVMexperiments, 54

PAGE 55

TheexperimentsweredesignedtobenchmarkmyimplementationforclassesoftargetapplicationsforWOW:high-throughputindependenttasksandparallelapplicationswithhighcomputation-to-communicationratios.Specically,thegoalsoftheexperimentsare:(1)toshowthatWOWscandelivergoodthroughputandparallelspeedups,(2)toquantifytheperformanceimprovementsduetoshortcutconnections,and(3)toprovidequalitativeinsightsonthedeployment,useandstabilityoftheIPOPsysteminarealisticenvironment. 72 ]isacompute-intensiveapplicationthatimplementsanalgorithmtodiscoveroneormoremotifsinacollectionofDNAorproteinsequences.Inthisexperiment,Iconsidertheexecutionofalargenumber(4000)ofshort-runningMEMEsequentialjobs(approximately30seach)queuedandscheduledbyPBS.Thejobsrunwiththesamesetofinputlesandarguments,andaresubmittedatafrequencyof1job/secondatthePBSheadnode.JobsreadandwriteinputandoutputlestoanNFSlesystemmountedfromtheheadnode. Forthescenariowhereshortcutconnectionswereenabled,theoverallwall-clocktimetonishthe4000jobswas4565s,andtheaveragethroughputoftheWOWwas53jobsperminute.Figure 3-2 showsadetailedanalysisofthedistributionofjobexecutiontimes,forboththecaseswheretheWOWhadshortcutconnectionestablishmentenabledanddisabled.ThevariationinjobexecutiontimesshowninthehistogramcanbequalitativelyexplainedwiththehelpofTable 3-1 :mostphysicalmachinesintheWOWprototypehave2.4GHzPentium-4CPUs;acoupleofthem(nodes32and34)arenoticeablyslower,andthreeofthemarenoticeablyfaster(nodes30,31and33).Overall,theslowernodesenduprunningasubstantiallysmallernumberofjobsthanthefastestnodes(node32runs1.6%ofthejobs,whilenode33runs4.2%). 55

PAGE 56

3-2 alsoshowsthattheuseofshortcutconnectionsdecreasesboththeaverageandtherelativestandarddeviationofthejobexecutiontimes.Thewallclocktimeaverageandstandarddeviationare24.1sand6.5s(shortcutsenabled)and32.2sand9.7s(shortcutsdisabled).TheuseofshortcutsalsoreducedqueuingdelaysinthePBSheadnode,whichresultedinsubstantialthroughputimprovement,from22jobs/minute(withoutshortcutconnections)to53jobs/minute(withshortcutconnections). Notethatthethroughputachievedbythedeployedsystemdependsnotonlyontheperformanceoftheoverlay,butalsoontheperformanceoftheschedulinganddatatransfersoftwarethatrunsonit(PBS,NFS).ThechoiceofdierentmiddlewareimplementationsrunninginsideWOW(e.g.Condor,Globus)canleadtodierentthroughputvalues.TheVMtechnologyinusealsoimpactsperformance.TheaverageexecutiontimeforMEMEapplicationinsideaVMwasobservedtobe13%higherthanthatofaphysicalhost. 73 74 ].TheparallelimplementationoffastDNAmloverPVMisbasedonamaster-workersmodel,wherethemastermaintainsataskpoolanddispatchestaskstoworkersdynamically.Ithasahighcomputation-to-communicationratioand,duetothedynamicnatureofitstaskdispatching,ittoleratesperformanceheterogeneitiesamongcomputingnodes. Table 3-2 summarizestheresultsofthisexperimentforthe50-taxainputdatasetreportedin[ 74 ].TheparallelexecutionoffastDNAmlontheWOWreducessignicantlytheexecutiontime.Thefastestexecutionisachievedon30nodeswhentheWOWhasshortcutconnectionsenabled:24%fasterthan30nodeswithoutshortcutsenabled,and49%fasterthanthe15-nodeexecution.EventhoughfastDNAmlhasahighcomputation-to-communicationratioforeachtask,theuseofshortcutsresultedinsubstantialperformanceimprovements.WhileIhavenotproledwheretimeisspent 56

PAGE 57

ExecutiontimesandspeedupsfortheexecutionoffastDNAml-PVMin1,15and30nodesoftheWOW.Thesequentialexecutiontimefortheapplicationis22,272secondsonnode2and45,191secondsonnode34.Parallelspeedupsarereportedwithrespecttotheexecutiontimeofnode2. ParallelExecution 15Nodes 30Nodes Shortcuts Shortcuts Shortcuts enabled disabled enabled Executiontime 2439 2033 1642 (seconds) Parallelspeedup 9.1 11.0 13.6 (withrespecttonode2) withintheapplicationduringitsexecutiontime,increaseinexecutiontimescanbeexplainedbythefactthattheapplicationneedstosynchronizemanytimesduringitsexecution,toselectthebesttreeateachroundoftreeoptimization[ 74 ]. ThesequentialexecutiontimesoffastDNAmlarereportedfortwodierentnodes(node002andnode034)andshowthatthedierencesinthehardwarecongurationoftheindividualnodesoftheWOWresultinsubstantialperformancedierences.Whilemodelingparallelspeedupsinsuchaheterogeneousenvironmentisdicult,Ireportonthespeedupswithrespecttoanodewhichhasthehardwaresetupmostcommoninthenetwork.Theparallelspeedupcomputedunderthisassumptionis13.6x;incomparison,thespeedupreportedin[ 74 ]isapproximately23x,butisachievedinahomogeneousIBMRS/6000SPclusterwithinaLAN. 69 77 78 ].However,whenaVMmigratesitalsocarriesalongitsconnectionstate.Suchconnectionstatecanalsoaccumulateinsideotherhostswithwhichitiscommunicating.ThisforcestheVMtoretainitsnetworkidentity,whichinturnhampersVMmigrationbetweensubnets.VirtualnetworkingprovidestheopportunityofmaintainingaconsistentnetworkidentityforaVM,evenwhenitmigratestoadierentnetwork. 57

PAGE 58

2.3.2 .Clearly,packetsdonotgetroutedandaredroppeduntilthenoderejoinstheP2Pnetwork;thisshortperiodofnoroutabilityisapproximately8minutesforthe150-nodenetworkusedinthesetup.TheTCPtransportandapplicationsareresilienttosuchtemporarynetworkoutages,asthefollowingexperimentsshow. 1. WhentheVMwasresumed,itsvirtualeth0networkinterfacewasrestarted,andbecausetheNATsthattheVMconnectedtoweredierentatUFLandNWU,theVMacquiredanewphysicaladdressforeth0atthedestination.However,thevirtualtap0interfacedidnotneedtoberestartedandremainedwiththesameidentityontheoverlaynetwork.ThenIPOPwasrestarted;secondslater,theSCPserverVMagainbecameroutableovertheP2Pnetwork,thenestablishedashortcutconnectionwiththeSCPclientVM,andeventuallytheSCPletransferresumedfromthepointithadstalled.Thesustainedtransferbandwidthsbeforeandaftermigrationare1.36MB/sand1.83MB/s,respectively. 2. Thisexperimentsimulatesausecaseofapplyingmigrationtoimproveloadbalancing:backgroundloadwasintroducedtoaVMhostresultinginanincreaseintheexecutiontimeofapplicationsexecutingontheVMguest;theVMguestwasthenmigratedfromUFLtoadierenthostatNWU.IPOPwasrestartedonthe 58

PAGE 59

ProleofexecutiontimesforPBS-scheduledMEMEsequentialjobsduringthemigrationofaworkernode guestuponVMresume.ThejobthatwasrunningontheVMcontinuedtoworkwithoutproblems,andeventuallycommitteditsoutputdatatotheNFS-mountedhomedirectoryoftheaccountusedintheexperiment.Whiletheruntimeforthejobthatwas\intransit"duringthemigrationwasincreasedsubstantiallyduetotheWANmigrationdelay.OncePBSstartedsubmittingjobstotheVMrunningonanunloadedhost,itwasobservedthatthejobruntimesdecreasedwithrespecttotheloadedhost.ThisexperimentalsoshowedthattheNFSandPBSclient/serverimplementationsweretoleranttotheperiodwithlackofconnectivity.Figure 3-3 summarizestheresultsfromthisexperiment.JobIDs1through87runonaVMatUFL.DuringjobID88,theVMismigratedtoNWU.JobID88isimpactedbythewide-areamigrationlatencyofhundredsofseconds,butcompletessuccessfully.SubsequentjobsscheduledbyPBSalsorunsuccessfullyonthemigratedVM,withoutrequiringanyapplicationrecongurationorrestart. 65 ]andBeowulf[ 79 ],whichareverysuccessfuleortsatusingcommoditymachinesandnetworksforhighperformancedistributedcomputing.Ratherthansupportingtightly-coupledparallelcomputationwithinalocal-areaorclusternetwork,theaimofWOWistosupporthigh-throughputcomputingandcross-domaincollaborations.TheDAS[ 80 ]projectbuiltadistributedclusterbasedonhomogeneouslyconguredcommoditynodesacrossveDutchuniversities.AlsorelatedtomyworkistheIbisproject[ 30 81 ]whichletsapplicationsspanmultiplesitesofagrid,andcopes 59

PAGE 60

82 ]. Severaleortsonlarge-scaledistributedcomputinghavefocusedonaggregatingwide-arearesourcestosupporthigh-throughputcomputing,butattheexpenseofrequiringapplicationstobedesignedfromscratch[ 2 3 66 67 83 ].Legion[ 84 ]isasystemalsodesignedtoscaletolargenumbersofmachinesandcrossadministrativedomains.Globus[ 1 ]providesasecurityinfrastructureforfederatedsystemsandsupportsseveralservicesforresource,dataandinformationmanagement.Condor[ 18 { 20 ]hasbeenhighlysuccessfulatdeliveringhigh-throughputcomputingtolargenumberofusers. Myworkdiersfromtheseapproachesinthatitisanend-to-endapproachtoprovidingafullyconnectedwide-areaclusterenvironmentforrunningunmodiedapplications.Nonetheless,theuseofvirtualizationmakesmyapproachonethatdoesnotprecludetheuseofanyofthesesystems.Quitethecontrary,becausevirtualizationenablesustorununmodiedsystemssoftware,itispossibletoreadilyreuseexisting,maturemiddlewareimplementationswhenapplicable,andrapidlyintegratefuturetechniques. In[ 82 ],authorsdescribeavirtualizedinfrastructuresforwide-areadistributedcomputingbasedonvirtualmachinesandvirtualnetworking.Thekeydistinguishingfeatureofmyapproachistheuseofpeer-to-peer(P2P)techniquesforoverlayroutingandestablishmentofoverlayconnectionsamongnodesbehindNAT/rewallroutersinahighlyscalablemanner. 71 ]havebeenusedtodeployaresourcepoolforrunningcompute-intensivejobsthroughCondor.Thepoolconsistsofmorethan80nodesinseveralNATed/rewalleddomains,andrunsjobssubmittedfromnanoHub[ 17 ]andusers 60

PAGE 61

Thedeploymentofthispoolhasbeengreatlyfacilitatedbypackagingallthesoftware(IPOP,Condormiddleware)withintheVMandrequiringonlyaNATnetwork,aswellasbytheavailabilityoffreex86-basedVMmonitors,notablyVMPlayerandVMwareServer.OnceabaseVMimagewascreated,replicatingandinstantiatingnewnodeswasquitesimple.TheVMimagecanbedownloadandinstantiatedbyordinaryusersontheirdesktops.Onceinstantiated,theVMautomaticallybecomesthepartofthesharedpool.ThroughthisVM,userscansubmitjobstothesharedpoolandalsorunjobsfromotherusers. Thesystemhasbeenobservedtobetoleranttophysicalnodesfailures{neighborsofafailednoderespondbycreatingconnectionstoothernodesandthusmaintainroutabilitywithinthenetwork.havebeenshutdownandrestartedduringthisperiodoftime.TheoverlaynetworkhasalsoexhibitedresiliencytochangesinNATIP/porttranslations.IPOPisabletodealwiththesetranslationchangesautonomouslybydetectingbrokenlinksandre-establishingthemusingtheconnectiontechniquesdiscussedinSection 2.3.2 AnapplicationofWOWtechniqueswithinaLANisfacilitatingdeploymentofCondorpoolsatUniversityComputerCentersconsistingofhundredsofidledesktops,usuallyrunningtheWindowsoperatingsystem.CondormiddlewarerunsonLinux,andcanbedeployedinsideLinuxVMcontaineronWindowsmachines.However,tobeabletocommunicatewitheachother,theseVMsneedunique,routableIPaddresses.ManagingadditionalIPaddressesforVMscanbeverydicultforadministrators,becausemanynewVMscanbereadilyinstantiatedbyusersfromaVMimage,andcanalsobeeasilymigratedacrossphysicalhosts.ThedecentralizedNAT-traversalsupportinIPOPallowsinstantiationoftheseVMsbehindNATinterfacesprovidedbyVMMs(suchasVMwareandXen),andbeabletoprovideconnectivitybetweenthemwithoutrequiringaroutableIPaddressesforVMs. 61

PAGE 62

85 ],thedeploymentofWOWsbynewusersisstillhinderedby: 1. 71 ],IPOPsupportsdynamicvirtualIPcongurationusingunmodiedDHCPclients,bycapturingDHCPpacketsfromthetapinterface,andmakingrequeststoSOAPserverthatmaintainsvirtualIPleases.WiththevirtualnetworkprovidedbyIPOPpotentiallyinvolvinghostsspanningwide-areanetworksandownedbymultipleorganizations,maintainingsuchdedicatedDHCPserversisdicult.Moreover,dedicatedserversintroducecentralpointsoffailures. 2. Furthermore,earlierversionsofIPOP[ 16 71 ]havealsosueredfromlimitationswithrespectto: 1. 78 ],[ 13 ]),thusrequireskillingandrestartingtheP2Pnodeonthetargethostasshownin[ 85 ]. 2. 62

PAGE 63

28 ]andresourcediscoverytondproxynodesthatcanroutecommunicationbetweenvirtualIPnodeswhenshortcutconnectionscannotform.Subsequentchaptersofthisdissertationdescribeandevaluatethesetechniques. Inthenextchapter,IdescribeanimplementationofaDHTovertheBrunetP2PsystemandhowitcanbeleveragedtomakedeploymentofWOWseasierfornewusers. 63

PAGE 64

Enterpriseinformationsystemsinvolvepackingandstoringlargeamountsofstoragedevicesthroughoutaseriesofshelvesinaroom,alllinkedtogether.Theinformationinthesestoragesystemscanbeaccessedbyasupercomputer,mainframecomputer,orpersonalcomputer.Thesesystemscanonlybeaccessedbyauthorizedusersandrequireconstantattentionandmanagementbyexpertswithinanorganization.Themanagementactivitiesincludekeepingthesystemupandrunning,hardwareandsoftwareupgrades,backupanddisasterrecovery. Inawide-areaenvironmentthatinvolvesseveralorganizationsandindividualusers,usingsuchcentralizedsystemsposesseveralissues:Whomanagesthesystem?Whatshouldbethetargetedsystemcapacity?Isthesystemaccessibletoallusers? Architecturesbasedonself-managingpeer-to-peerstorage(CFS[ 37 ],PAST[ 36 ])havebeenproposedasanalternativetocentralizedapproachesforvariousapplications.AsdescribedinChapter2,structuredP2Psystemsprovideaprimitivecalledthedistributedhashtable(DHT)forstoringandlocatingobjects.Eachobjectisassociatedwithakeythatbelongstothesameaddressspaceasnodeidentiers.Theownershipofkeysispartitionedamongparticipatingnodes,suchthateachkeyisstoredonasetofnodesthatareclosesttothekeyintheidentierspace.ThispartitioningofkeyownershiptogetherwithecientroutingbetweennodesboundslookupoverheadforanobjectstoredintheDHT.ThefollowingpropertiesmakeDHTsusefulwide-areastoragearchitectures: 1. 2. 64

PAGE 65

4.1.1SystemsBasedonDistributedHashTable 37 ]basedonChord[ 40 ],andPAST[ 36 ]developedbyMicrosoftbasedonPastry[ 39 ].In[ 60 ]Coxet.al.haveproposedtobuildDistributedDNSusingDHash,adistributedhashtable(DHT)basedonChord[ 40 ].SCRIBE[ 86 ]isalargescaleapplication-levelmulticastandeventnoticationinfrastructurebasedinPastryP2Psystem.ePost[ 87 ]describesadecentralizedemailservice,alsobasedonPastry.OpenDHT[ 88 ]isapublicDHTservicebasedonBambooP2Psystem[ 89 ]operatingonPlanetLab,andcanbeusedbythird-partyapplications. Byrestrictingthekeyownershipstoasmallsetofnodesandnotrequiringcachingtoachieveecientlookup,DHTscanalsobeusedtostoremutabledata.In[ 90 ],authorsproposeanalgorithmtoprovideatomicityofmutabledatastoredinaDHT.Comet[ 91 ]usesDHTtoprovideascalableanddecentralizedcoordinationspaceasinLinda[ 92 ]. In[ 93 ],theauthorsproposetouseauniversaloverlaytoprovideascalableinfrastructuretobootstrapmultipleserviceoverlaysprovidingdierentfunctionality.Itprovidesmechanismstoadvertiseservicesandtodiscoverservices,contactnodes,andservicecode.Inthiswork,IdemonstratehowauniversaloverlaycanbeusedtofacilitatebootstrappingofmultipleWOWs,eachsupportingadierentcommunityandhavingitsownvirtualprivateIPaddressspace. 65

PAGE 66

1. ProximityNeighborSelection(PNS)[ 27 ]isatechniqueinwhichwheneveraP2Pnodehasachoiceonitsroutingtableentries,itpicksthenodethathastheleastlatencytoit.Ithasbeenshownthatthislocalminimaateachhopcanboundthetotallatencyofalookuptowithinacertainfractionoftheactuallatencytothenodestoringthekey.PNSisalreadyemployedinexistingP2Psystems(Chord[ 40 ],Pastry[ 39 ])andcanbeincorporatedintoBrunettoachievetheboundsonthetotaltransitlatencyincurredbyavirtualIPpacket. ItispossiblethatthemessagescontainingDHToperationsgetlostandneedretry.TheBambooP2Poverlay[ 89 ]usestheInternetround-triplatencyinformationtocalculatetherighttimeoutsonDHToperations.TheP2Proutingateachnodealsotriestoavoidahighlatencyhop.TheBambooDHTalsoreplicatesakeyatseveralnodes,sothatasingleslownodestoringthekeydoesnotdelaythelookupprocess. InSBARC[ 94 ]andBrocade[ 95 ],someP2Pnodesupgradetosupernodesbasedontheirhighresourcecapacitiesandhighstability.AllintermediateP2Phopsinvolvetheonlysupernodes,thusachievingaquicklookuplatency.Skypeusesasimilarapproach,exceptthatitisbasedonanunstructuredP2Pnetwork. 2. 26 ].Anodecanpotentiallymissaneighborwithwhichitcannotcommunicate,whichcanleadtoanincorrectroutingdecisionandsubsequentlyaDHToperationtoberoutedtowrongsetofnodes. 3. 66

PAGE 67

4. 96 { 98 ]. 5. 99 ],theauthorspresenttechniquestoprovidecontent/pathlocalityandsupportforNATsandrewalls,whereinstancesofconventionaloverlaysareconguredtoformahierarchyofidentierspacesthatreectsadministrativeboundariesandrespectsconnectivityconstraintsamongnetworks. 4-1A showshowanewnodearrivalishandled.Initially,node123storeskeysintherange[110;123]S[123;128],whilenode110storeskeysintheranges[110;123]S[110;123].Inresponsetothearrivalofthenode116,node123migratesthekeysinrange[110;116]tothenewnode.Similarly,node110migratesthekeysinrange[116;123]tothenewnode. Figure 4-1B showshowanodefailureishandled.Initially,node123storeskeysintheranges[116;123]and[123;128],whilenode110storeskeysintheranges[110;116]and[110;123].Inresponsetothefailureofnode116,node123copiesthekeysinrange[116;123]tonode110,whilenode110copiesthekeysinrange[110;116]tonode123. 67

PAGE 68

BNodedeparture Handlingtopologychanges 1. 2. 3. SincetheobjectsarestoredintheDHTassoft-stateforthelifetimespeciedintime-to-live(afterwhichtheyareautomaticallygarbagecollected),thereisnoprimitivetodeleteanobjectassociatedwithasomekey 68

PAGE 69

26 ],oftenpreventcommunicationbetweenimmediateneighborsintheidentierspace,thusaectingoverlaystructuremaintenanceandrouting.Thisinabilitytocommunicatewithaneighbornodeisperceivedastheneighborbeingdown,thusresultinginaninconsistentviewofthelocalneighborhoodandsubsequentlyincorrectroutingdecisionsatthatnode.Insuchcases,DHToperationsonthesamekeykoriginatingatdierentsourcesmaynotalwaysroutedtothesamenode(calledtherootforthatkey).ThisproblemisreferredtoasinconsistentrootsintheDHTliteratureandhinderstheabilityofDHTCreateprimitivetodetectduplicationofkeys,asshowninFigure 5-3 Nodes115and110cannotformaconnection.Amessageissenttokey112,andtheclosestnodeis110.Left:message(Create)addressedtokey112arrivesatnode115;itbelievesthatitistheclosesttothedestination{themessageisdeliveredlocallyandthekeyissuccessfullycreated(alsoreplicatedatnode100).Right:anothermessageCreateaddressedtothesamekeyarrivingatnode83iscorrectlyroutedtonode110{herethekeyisnotfoundandiscreatedagain(thisCreateoperationalsoreturnssuccessinsteadofreturninganerror). TheinabilityoftheDHTCreateprimitivetodetectduplicationofkeyssubsequentlyaectsthecorrectoperationofapplicationsrequiringsuchuniquenessguarantees,suchasthedecentralizedDynamicHostCongurationProtocol(DHCP)describedinSection 4.4.2 69

PAGE 70

InconsistentrootsinDHT Toreducethelikelihoodofinconsistentroots,eachapplicationspeciedkeykisinternallyre-mappedtonkeys(k1;k2:::kn),whicharethenstored(togetherwiththeassociatedvalue)atndierentlocationsontheP2Pring.Applicationscanchoosethisdegreeofre-mappingforeachkey,andexpectDHToperationstoseparatelyprovidereturnvaluesforeachre-mappedkey,thusallowingapplicationstoimplementschemeslikemajorityvoteonresultsobtainedforeachsuchre-mappedkey.Forafaulttooccurnow,therootsofasmanyashalf(morethanone)ofthere-mappedkeyshavetobeinconsistent.Majorityvotingalsohastheadvantagethatbynotrequiringresultsforallre-mappedkeys,afewslownodescannotslowdowntheentireDHToperation. Furthermore,anewP2PnodejoiningtheoverlayisnotallowedtoperformanyDHToperationuntilitgetsconnectedcorrectly,i.e.itformsconnectionswithitsnearestleftandrightneighborsonthering.ThisisbecauseanincorrectlyconnectednodehasaninconsistentviewoftheringandmayobserverootsfortheDHTkeysthatareinconsistentwiththoseobservedbyexistingnodes.Thetimetakenforanewnodetogetcorrectlyconnectedtoanexistingnetworkofover100nodeshasbeenobservedtobeabout5secondsonaverage. 70

PAGE 71

Figure 4-3 showshowdierentWOWs(orIPOPnamespaces)canexistontopofacommonP2Poverlay.EachvirtualIPnodebelongstosomeIPOPnamespaceandisassociatedwithaP2Pnode.Inthisexample,theIP!P2PmappingsfornodesA1;B1;A2;B2(A1!X8,B1!X1,A2!X2andB2!X4)arestoredatnodesX3,X5,X6andX7respectively.TheDHTkeyforeachsuchmappingisacombinationofagloballyuniqueidentierforthenamespaceandthevirtualIPaddresswithinthatnamespace.TheinclusionofthenamespaceidentiferallowsvirtualIPnodesindierentnamespacestohavesameIPaddresses.TosendavirtualIPpackettonodeB1,thenodeA1queriestheDHTwith(N1;B1)askey.ThevalueassociatedwiththiskeyistheP2Paddress(X1)oftheP2PnodeassociatedwithB1,andisquicklyretrievedfromnodeX5.Fromthispointonwards,communicationproceedsasdescribedinChapter2. CreatinganewIPOPnamespaceonlyrequiresexecutingasimpleprogramwithinformationabouttheIPOPnamespace(assignablevirtualIPaddressesandothernetworkparameters).ThenamespaceidentieristhenprovidedasaparameterinsidetheIPOPcongurationoftheapplianceVMsfordistribution.ExperimentsshowthatanewnodejoiningaWOWtakesabout20-30secondsonaveragetoacquireavirtualIPaddress. 71

PAGE 72

ExampleoftwodierentWOWswithnamespacesN1andN2sharingacommonBrunetoverlay SimilartoDHT-basedsystemsdescribedearlier,thisdecentralizedIPaddressmanagement(1)eliminatesanydedicatedcomponents,(2)scalestolargenumbersbyharnessingtheresourcesatparticipatingnodesand(3)providesresiliencetonodefailures,whicharecommonrequirementsinlarge-scaledesktopgridenvironments. Thedescribedfunctionalityiscomparablewithtasksanadministratorwouldtypicallyperformtosetupaprivatenetwork.Followingthesettingupofswitchesandcables,aprivateIPaddressrangeissetaside.HostsconnectingtotheprivatenetworkareassigneduniqueIPaddressesfromthisIPaddressrange.Toenabledynamicnetworkcongurationofhostsconnectingtothenetwork,oneormoreDHCPserversareconguredwiththelistofassignableIPaddressesandothernetworkparameters.Newhostsdiscover 72

PAGE 73

Incontrast,tosetupanewWOW,auserisonlyrequiredtocreateanIPOPnamespacewithauniqueidentierandaprivateaddressspace.ThenamespaceidentierisspeciedinsidetheIPOPcongurationoftheWOWVMapplianceimage.Eachdeployedinstanceoftheapplianceonbootretrievesthenamespaceinformation(virtualIPaddressrange)andconguresitselfwithauniquevirtualIPaddress.Thesestepsaredescribedbelow. 85 ])ofnodesintheuniversaloverlayandstartsupaP2Pnodethatconnectstothatoverlay.ThenodetriestoinsertthenamespaceinformationintotheDHT(usingCreate)witharandomlychosenidentierasthekey.Ifthekeyalreadyexists,theCreatereturnsanerrorandtheprogramretrieswithadierentidentieruntilitsucceeds.SincetheDHTdoesnotstoreobjectsindenitely,thisobjectholdingthenamespaceinformationhastobeperiodicallyrecreated(usingRecreate).ThisnamespaceidentierisspeciedinsidetheIPOPcongurationoftheWOWapplianceimage. 71 ],IPOPsupportsdynamicvirtualIPcongurationusingunmodiedDHCPclients.ThisisachievedbycapturingDHCPrequestpacketsfromthetapandmakingSOAPrequeststoapubliclyaccessibleserverthatstoresthelistofassignableIPaddressesandactiveleases,andeventuallyinjectingDHCPresponsepacketstothetap.TheSOAPservercanbeasinglepointoffailure.ThedecentralizedDHCPusestheDHTas 73

PAGE 74

OninterceptingaDHCPpacket,IPOPretrievesinformationaboutitsnamespace(assignableIPaddressesrange,netmask,leasetimes)fromtheDHT(usingaGet)onthenamespaceidentiferasthekey.ItthenchoosesarandomIPaddressfromthatrange,belongingtothenamespace,andattemptstocreateaDHTentry(usingaCreate)with:combinationofnamespaceidentierandtheguessedIPaddressasthekey,arandomlychosenpassword,anditsP2Paddressasthevalue.Theentryissuccessfullycreatedonlyifthereisnootherentrywiththesamekey.ThispreventsIPaddressconictsbetweenWOWnodesbelongingtothesamenamespace.IncaseCreatereturnsanerror,IPOPtriesanother(randomlychosen)IPaddressuntiliteventuallysucceeds.TheDHCPresponsepacketwithinformationabouttheleaseiswrittentotap.Thepasswordisrecordedforsubsequentoperationsonthekey. Theentryisonlycreatedwithatime-to-live(TTL)equaltotheleasetimeforthatnamespace,andthusneedstoberecreated(usingaReCreate)periodically.ThisprocessisagaintriggeredbytheDHCPclient,whichattemptstorenewavirtualIPleaseafterhalftheleasetimehaselapsed.Inthiscase,IPOPattemptstoReCreatethesameDHTkeycorrespondingtothevirtualIPaddressboundtotap. DHTinconsistenciesasdescribedinSection 4.2.3 ,cancreateasituationwheremultipleIPOPnodesacquirethesamevirtualIPaddress.TheDHTimplementationinIPOPthereforeachievesfault-tolerancetoinconsistenciesbyre-mappingeachapplicationspeciedkeykto8dierentkeys(k1;k2;k3;k4;k5;k6;k7;k8)andperformthecorrespondingoperationoneachofthesekeys.ToconsideraCreateorRecreatetohavesuccessfullyhappened,itisexpectedthatatleast5oftheseoperationsbesuccessful(returnatrue).Otherwise,theoperationisconsideredtohavefailedandadierentIPaddressistried.However,theinitialimplementationofthistechniquethatisevaluatedinSection 4.5 stillwaitsforall8resultsbeforeperformingamajorityvote{incaseswhere 74

PAGE 75

Figure 4-4 showsatime-lineofevents,fromthestartupoftheIPOPandDHCPclient(dhclient)process,tohavinganIPaddressboundtotap.ItshouldbenotedthattheIPaddressleaseacquisitioncannotstartuntiltheassociatedP2Pnodeiscorrectlyconnected(i.e.,withleft/rightneighbors).Oncecorrectlyconnected,IPOPtriesdierentvirtualIPaddresses(one-by-one)untilitisassuredthatnoothernodewithinthesameIPOPnamespacehasacquiredthesameIPaddress.ItispossibletohavealargeIPaddressrangefortheprivatenetwork,whichreducesthechancesoftwoIPOPnodesguessingthesameIPaddress. 3 ,mostTCP/IPbasedapplicationscommunicatingovertheIPOPvirtualnetworkareresilienttosuchtransientpacketlosses.ThisprocessofresolvingavirtualIPaddresstoaP2Paddressiscalled"Brunet-ARP". 75

PAGE 76

EventsleadingtovirtualIPaddresscongurationoftapinterface AnexperimentbetweentwodesktopmachinesAandBisthenperformedasfollows.OndesktopA,startIPOPandDHCPclientsothatitacquiresavirtualIPaddress,whichremainsxedduringtheexperiment.OnthedesktopB,proceedaniterativeprocessof:(1)startIPOPnodeandDHCPclient(2)waituntilanIPaddressisboundtotap(3)startpingingthevirtualIPaddressofdesktopAfor200seconds(4)killIPOPandDHCPclientonB.Thisprocesswasrepeated250times. Ineachtrialoftheexperiment,theIPOPnodeondesktopBhadadierent(randomlychosen)P2Paddress.ThevirtualIPleasesfromdierenttrialspersistedintheDHT(sincetheleasesarenotrelinquished),andthispreventedthedesktopBfromacquiringthesameIPaddressoverdierenttrials.ThismappingbetweenvirtualIPaddressandP2Paddressisarbitraryandisdiscoveredautomaticallyineverytrial(asdescribedinSection 4.4.3 )beforepingpacketsstartowingbetweenthedesktopsoverthevirtualnetwork. Figure 4-5 showsthecumulativedistributionofdelayseenbytheDHCPclient(dhclient)toacquireanIPaddressonthetap.Weobservethatin90%ofthecases,DHCPprocessnisheswithin30secondsofIPOPandDHCPclientstartup.AsshowninFigure 4-4 ,thisdelaydependson(1)timetakenfortheIPOPnodestogetcorrectlyconnectedand(2)numberofdierentIPaddressesthataretried.ThecumulativedistributionsofthesecomponentsareshowninFigure 4-6 andFigure 4-7 76

PAGE 77

CumulativedistributionofthetimetakenbyanewIPOPnodetoacquireavirtualIPaddress(T2inFigure 4-4 ).Meanandvarianceare27.41secondsand20.19seconds,respectively. CumulativedistributionofthetimetakenbyanewP2Pnodetogetconnectedtoitsleftandrightneighborsinthering(T1inFigure 4-4 ).Meanandvarianceare4.98secondsand13.99seconds,respectively. 77

PAGE 78

CumulativedistributionofthenumberofdierentvirtualIPaddressestriedduringDHCP.Meanandvarianceare1.096and0.784,respectively. managementwithinaWOWthatleveragestheDHTfunctionalityoftheP2Pnetwork.ExperimentshaveshownthatanewWOWnodecanacquireauniquevirtualIPaddresswithin20-30secondsonaverage.TheimplementationofthisprotocolhasnowbeenmaderesilienttoP2Pmessagelosses.Insteadofwaitingforeachinternallyre-mappedkeyoperation(CreateorRecreate)toreturnaresult,theoperationisconsideredsuccessfulandtheprotocolproceedsforwardsassoonassucientresultsareavailabletoassurethattheoperationhassucceededonmajorityofnodes.TheseenhancementshaveresultedinreductionintheaveragetimetoacquireavirtualIPaddresstolessthan10seconds. AdditionaldecentralizedcongurationcanbeintegratedintoWOWs.Inparticular,theWOW-basedCondorpoolsthatarecurrentlyconguredfromacentralserver,canbeextendedtoleverageDHT-basedtechniquestoachievemanagerdiscoveryforworkernodes. 78

PAGE 79

Likeseveralrelatedeorts(suchasChord[ 40 ],Kademlia[ 100 ]andPastry[ 39 ]),IPOPreliesonstructuredP2Poverlaystoprovidethecoreserviceofmessageroutingandadditionalcapabilitiessuchasobjectstorageandretrieval.StructuredP2ProutingassumeseachnodehasaconsistentviewofitslocalneighborhoodintheP2Pidentierspace,whichisreectedinitsabilitytocommunicatewithitsneighboringnodes.RelatedworkonstructuredP2PsystemshaveimplicitlyassumedanenvironmentwhereP2Pnodesareabletoestablishdirectconnectionstooneanother,andhavemainlyfocussedonecientoverlaytopologies[ 27 ],correctroutingofobjectlookupsunderchurn[ 89 101 ],andproximity-awarerouting[ 102 ]. However,inpractice,wide-areaenvironmentsarebecomingincreasinglyconstrainedintermsofpeerconnectivity,primarilyduetotheproliferationofNATandrewallrouters.Studieshaveshownthatabout30%-40%[ 103 ]ofthenodesinaP2PsystemarebehindNATs.EventhoughmajorityoftheNATdevices(coneNATs)canbe\traversed"throughUDP-holepunching,upto20%oftheNATs[ 46 ](symmetricNATs)cannotbetraversedusingtheexistingtechniques.Inaddition,studieshavealsoshowntheexistenceofpermanentortransientrouteoutagesbetweenpairsofnodesontheInternet;forexample,[ 26 ]reports5.2%pair-wiseoutagesamongnodesonPlanetLab[ 29 ].Together,theseconnectivityconstraintsposeachallengetooverlaystructuremaintenance:twoadjacentnodescannotcommunicatedirectly,creatingfalseperceptionsofaneighbornotbeingavailable.Ingeneral,thesemissinglinksonaP2Pstructureleadtoinconsistentroutingdecisions,andsubsequentlyaectingroutabilityandservicesbuiltupontheassumptionofconsistentrouting. TheexistingimplementationsofstructuredP2Psystemshaverecognizedtheproblemofoverlaystructuremaintenancewhenonlyasmallfractionofpairs(about4%[ 104 ])cannotcommunicatewitheachother[ 26 105 ].However,practicalexperienceswithWOW 79

PAGE 80

Connectivity-constrainedwide-areadeploymentscenariotargetedbydeploymentsoftheIPOPP2Psystem deploymentsrevealthatthisfractioncanbesignicantlylargerduetonodesbehind(multiple)NATs,andNATsthataresymmetricordonotsupport\hairpin"translationthatprecludehole-punching. 88 ]reliesonnon-rewalledPlanetLabP2PnodestodeployitsDHT;however,nodesbehindNATsandrewallscanonlyactasOpenDHTclientsanddonotstorekeys.InordertoaggregatetheincreasingnumberofhostsbehindNATs/rewallsasWOWnodes,theIPOPvirtualnetworkmustbeabletodealwithacomplexwide-areaenvironmentastheonedepictedinFigure 5-1 ,wheretypicalendusersofaP2PsystemareconstrainedbyNATsinwhichtheydonothavethecontrol(orexpertise)necessarytosetupandmaintainrewallexceptionsandmappingsnecessaryforNATtraversal. Itiscommonforbroad-bandhoststobebehindtwolevelsofNAT(ahomegateway/routerandtheISPedgeNAT,nodesAandBinFigure 5-1 ).IPOPsupportsestablishmentofUDPcommunicationusingholepunchingtechniquesfor\cone"typeNATs(e.g.betweennodesAandCinFigure 5-1 ),andthereisempiricalevidencepointingtothefactthatthesearethecommoncase[ 46 ].However,nodesbehindNATsfor 80

PAGE 81

Recognizingtheimportanceofsupportingtraversal,someofrecentNATshavestartedsupportingUniversalPlugandPlay(UPnP)[ 106 ]whichallowthemtobeconguredtoopenportssothatotherhosts(outsidetheNAT)caninitiatecommunicationwithhostsbehindtheNAT(e.g.hostsIandH).However,UPnPisnotubiquitous,andevenwhenitisavailable,multi-levelNATscreatetheproblemthathostscanonlyconguretheirlocalNATsthroughUPnP,whilehavingnoaccesstocontrolthebehavioroftheedgeNAT,whichrenderstheUPnPapproachineectiveoutsidethedomain.Forexample,althoughhostsAandEinFigure 5-1 areconnectedtoUPnPNATs,theyarealsosubjecttorulesfromanISPNATandaUniversityNATrespectively,whichtheydonotcontrol. SomeNATssupport"hairpinning",wheretwonodesinthesameprivatenetworkandbehindthesameNATcancommunicateusingeachother'stranslatedIPandport.Suchabehaviorisusefulinamulti-levelNATscenario,wheretwohostsbehindthesamepublicNATbutdierentsemi-publicNATsareabletocommunicateonlyusingtheirIPaddressandportassignedbythepublicNAT.However,notallNATssupporthairpinning,creatingasituationinwhichtwonodesinthesamemulti-levelNATeddomainmaynotbeabletousehole-punchingtocommunicatedirectly(e.g.nodesEandF)asdepictedinFigure 5-2 .NodesEandFarebehindtwodierentsemi-publicNATsrespectively,whichinturnarebehindapublicNAT-P.WhileformingitsinitialconnectionswithbootstrapnodesonPlanetLab,thesenodesonlylearntheirIPendpointsassignedbythepublicNAT-P.WhenEandFtrytoformaconnectiontheysendlinkmessagestoeachothersIPendpointsassignedbyNAT-P.IncaseNAT-Pdoesnotsupport\hairpinning",EandFcannotformaconnection.Only24%oftheNATstestedin[ 46 ]supporthairpinning. Somehostsarebehindrewallrouters(e.g.hostG)thatmightblockallUDPtracaltogether.OnlyafewP2Pnodesarepublicandareexpectedtobeableto 81

PAGE 82

MultiplelevelsofNATs communicatewitheachother.Connectivityevenamongthesehostsisconstrained:Internet-1andInternet-2hostscannotcommunicatewitheachother(e.g.hostsJandK),whilemulti-homedhostscancommunicatewiththemboth.Inaddition,linkfailures,BGProutingupdates,andISPpeeringdisputescaneasilycreatesituationswheretwopublicnodescannotcommunicatedirectlywitheachother.In[ 26 ],theauthorsobservedthatabout5.2%ofunorderedpairsofhosts(P1,P2)onPlanetLabexhibitedabehaviorsuchthatP1andP2cannotreacheachotherbutanotherhostP3canreachbothP1andP2. Itsisobservedthatatypicalwide-areaenvironmentpresentsseveraldeterrentstoconnectivitybetweenapairofnodes,andwhentwosuchnodeshaveadjacentidentiersontheP2Pring,structuremaintenanceisaected.Tothebestofmyknowledge,whilestructuredP2PsystemshavebeendemonstratedinpublicinfrastructuressuchasPlanetLab,wherethereareonlyafewpair-wiseoutagesandasmallamountofdisordercanbetolerated[ 105 ],nostructuredP2PsystemsdescribedintheliteraturehavebeendemonstratedwherethemajorityofP2PnodesaresubjecttoNATconstraintsofvariouskindsasillustratedinFigure 5-1 82

PAGE 83

InconsistentrootsinDHT 45 ]. Similartootherstructuredsystems,routinginBrunetusesthegreedyalgorithmwhereateachhopamessagegetsmonotonicallyclosertothedestinationuntilitiseitherdeliveredtothedestinationortothenodethatisclosesttothedestinationintheP2Pidentierspace.Greedyroutingassumeseachnodehasaconsistentviewofitslocalneighborhood,whichisreectedinitsabilitytoformstructurednearconnectionswithitsleftandrightneighborsintheP2Pidentierspace.Theinabilitytoformconnectionswithimmediateneighborsintheidentierspacecreatesinconsistentviewofthelocalneighborhood,thusresultinginincorrectroutingdecisionsasshowninFigure 5-3 (a). 83

PAGE 84

85 ].Thenotionofaconnection,whichdescribesanoverlaylinkbetweentwoP2Pnodes,iskeytoestablishingsuchlinks.Connectionsoperateoverphysicalchannelscallededges,whichinIPOPcanbebasedondierenttransportssuchasUDPorTCP.Besidesassistinginoverlaystructuremaintenance,theconnectionprotocolsallowthecreationof1-hopshortcutsbetweenWOWnodestoself-optimizetheperformanceofthevirtualnetworkwithrespecttolatencyandbandwidth. TheconnectionsetupbetweenP2PnodesisprecededbyaconnectionprotocolforconveyingtheintenttoconnectandexchangingthelistofUniformResourceIndicators(URIs)forcommunication.TheseconnectionmessagesareroutedovertheP2Pnetwork.Incorrectroutingleadstosituationswhereconnectionmessagesareeithermisdelivered(ornotdeliveredatall),thusaectingbothoverlaystructuremaintenanceandconnectivitywithinthevirtualnetwork. 107 ],whichisusedfordynamicvirtualIPcongurationofWOWnodes,summarizedasfollows.IPOPsupportscreationofmultiplemutually-isolatedvirtualnetworks(calledIPOPnamespaces)overacommonP2Poverlay.ThevirtualIPcongurationofWOWnodesineachsuchprivatenetworkisachievedusingadecentralizedimplementationoftheDynamicHostCongurationProtocol(DHCP).TheDHCPimplementationusesaDHTprimitive(calledCreate)tocreatekey/valuepairsmappingvirtualnetworknamespacesandvirtualIPaddressesuniquelytoP2Pidentiers.TheCreateprimitivereliesontheconsistencyofkey-basedroutingtoguaranteeuniquenessofIP-to-P2Paddressmappings.Thatis,messagesaddressedtosomekeykmustbedeliveredtothesamesetofnodesregardlessofitsoriginator. 84

PAGE 85

5-3 (a)and(b),cancauseCreatemessagesaddressedtothesamekeyfromdierentsourcestoberoutedtodierentnodes.Thisproblemisalsoidentiedin[ 26 ]andisreferredtoasinconsistentroots,andcanleadtoasituationwheretwoWOWnodesclaimthesamevirtualIPaddress. 85

PAGE 86

71 ],theinabilityofaworkernodetoobtainanIPaddressimpliesitdoesnotjointhepool.Evenifanode\N"obtainsanIPaddress,ifitcannotcommunicatewiththecentralmanagernode\M",itisnotavailableforcomputation.Furthermore,theinabilityofnode\N'toroutetoaworkernode\W"preventsjobssubmittedby\N"toexecuteon\W".Allthesesituationsresultinthesystemnotbeingabletoachievablethemaximumavailablethroughputbecausenotallnodescanparticipateincomputations. Chapter 6 describesgenerallyapplicabletechniquesthatfacilitateconsistentstructuredroutingdespitetheconnectivityconstrainedpresentedbyawide-areaenvironment.TheninChapter 7 ,IevaluatetheapplicabilityofProximityNeighborSelection(PNS)inconjunctionwithnetworkcoordinatestoreducethelatencyofkeylookupsinIPOPoverlaysonPlanetLab.TechniquesarealsodescribedforselectingsuitableproxynodestoroutecommunicationbetweenvirtualIPnodes,whentopologyadaptationbasedonshortcutsisnotpossible. 86

PAGE 87

Inthischapter,Idescribeandevaluatetwonovel,synergistictechniquesforfault-tolerantroutingandstructuredoverlaymaintenanceinthepresenceofnetworkoutages:annealingrouting,analgorithmbasedonsimulatedannealingfromoptimizationtheory,andtunneledges,atechniquetoestablishconnectionsbetweenP2Pnodesbytunnelingovercommonneighbors.Thesearefullydecentralizedandself-conguringtechniquesthathavebeensuccessfullyimplementedintheBrunetP2Psystemanddemonstratedinactualwide-areaPlanetLabdeploymentsaswellasinNATedenvironmentswithemulatedpair-wiseoutages.Theeectivenessoftheseapproachesareanalyzedforvarioussystemcongurationswiththeaidofanalyticalmodels,simulation,anddatacollectedfromrealisticsystemdeployments. 87

PAGE 88

6-1 andworksasfollows. Inlines10-17,thenodelooksupitsconnectiontabletodetermineifitisadjacenttothedestinationintheidentierspace.Inthatcase,thenodedeliversthemessagelocallyandalsosendsittothenodeontheotherside(leftorright)ofthedestinationintheidentierspace. Ifthemessagehasnottakenanyhopsyet(i.e.itoriginatedatthecurrentnode),itissenttotheclosestnodeu.Otherwise(lines23-29),untilthemessagehastakenMAX UPHILLhopsitisdeliveredtotheclosestnodeuorthenextclosestusec(ifitwasalreadyreceivedfromtheclosestnodeu).Untilthispoint,thealgorithmdoesnotcheckfortheforwardprogressofthemessagetowardsthedestinationinidentierspace. BeyondMAX HILLhops(lines30-41),themessageissenttouorusec,onlyifthenexthopisclosertothedestinationthantheprevioushop.Itshouldbenotedthatthisconditiononlyrequiresprogresswithrespecttothepreviousnode;itstillallowsamessagetakeonehopthatisfartherawayfromdestinationthanthecurrentnode. Theannealingalgorithmisveryusefulforroutingmessagesaddressedtoexactdestinations,whichincludeconnectionsetupmessagesbetweenP2Pnodes,virtualIPpacketsbetweenIPOPnodes,andtheresultsofDHToperationsbacktothesourcenode.Inaperfectly-formedstructuredring,thealgorithmworksexactlyasthegreedyalgorithmandincursthesamenumberofhops. WhenmessagesareaddressedtoDHTkeys,thisalgorithmhasabetterchancetoreachthenodeclosesttothekey,bydeliveringthemessageateachlocalminima.Asa 88

PAGE 89

Annealingroutingalgorithm 89

PAGE 90

Theideabehindtunneledgesisasfollows.Assumethateachnodeinthenetworkattemptstoacquireconnectionstoitsclosest2mnearneighborsontheP2Pring,msuchneighborsateachside.ConsiderasituationwherethereisanoutagebetweentwoadjacentnodesAandBontheP2Pring.SincebothAandBalsoattempttoformnearconnectionswith2mnodeseach,theirneighborhoodsintersectat2(m1)nodesasshowninFigure 6-2 .ForatunneledgetoexistbetweenAandB,theremustbeatleastonenodeCintheintersectionItowhichbothAandBareconnected.SuchnodeCisacandidatetobeusedintunnelingthestructurednearconnectionbetweenAandB. ThisenhancementallowstheconnectionstateatanodetoconsistentlyreecttheoverlaytopologyevenwhenitisnotpossibletocommunicatewithsomeneighborsusingtheconventionalTCPorUDPtransports.Tunneledgesareimplemented(describedinSection 6.4 )suchthattheyarefunctionallyequivalenttoUDPorTCPedgesoncethey 90

PAGE 91

TunneledgebetweennodesAandBwhichcannotcommunicateoverTCPorUDPtransports areestablished,allowingseamlessreuseofthecoderesponsibleforstatemaintenanceandroutinglogicinthesystem. Twoimportantquestionsariseinthecontextofthisproposedapproach:whatistheprobabilityoftunneledgestobeformedbetweentwonodesAandB?howmanynodesarecandidatesforproxyingtunneledges?Thesequestionsareaddressedanalyticallyinthissection. Letpbetheprobabilityofanedgeoutagebetweenapairofnodes.ForatunneledgetoexistbetweenAandB,theremustbeatleastonenodeCintheintersectionItowhichbothAandBareconnected.Assumingmnearconnectionsateachsideofbothnodes,theprobabilityforatunneledgebetweenAandBtoexistthroughCisgivenby

PAGE 92

6-1 showstheprobabilityofformingatunneledgebetweenunconnectednodesAandBfordierentvaluesofedgeprobabilitypandnumberofnearconnectionsm.Itshouldbenotedthatthereisasharpincreaseintheprobabilityofbeingabletoformatunneledgewhennodesacquiremorethan2nearconnectionsoneachside.Thisfactisalsoreectedinsimulationresultswhichshowthatimprovementsincorrectnessofroutingusingtunneledgesaresignicantlyhigherwhenm3.Figure 6-5 shows3.9%brokenpairswhenm=2and0.86%brokenpairswhenm=3. Nowconsiderasituationwhereatunneledgeinvolvesexactlyoneforwardingnode.Whentheforwardingnodedeparts,thecurrentnodealsolosesthetunneledgeconnection.Therefore,forfault-tolerance,itisalsoimportantthattheforwardingsetofnodesfortunneledgecontainsmorethanonenode.Theprobabilitythatforwardingsetconsistsofatleast2nodesisgivenbyP[forwardingsetofsizeatleast2]=1k=1Xk=0P[forwardingsetofsizeexactlyk]=1k=1Xk=02(m1)k(p2)k(1p2)2(m1)k 92

PAGE 93

Probabilityofbeingabletoformatunneledgeasafunctionofedgeprobabilityandnumberofrequirednearconnectionsoneachside tunneledgeprobability edgeprob m=2 m=3 m=4 m=5 0.70 0.7399 0.9323 0.9824 0.9954 0.75 0.8085 0.9633 0.9929 0.9986 0.80 0.8704 0.9832 0.9978 0.9997 0.90 0.9638 0.9986 0.9999 0.9999 ItcanfurtherbeshownthatifeachnodemaintainsO(log2N)neighbors,thentheprobabilityofnotbeingabletoformatunneledgeis:=(1p2)2(m1)=(1p2)O(m)=(1p2)O(log2N)=O(Nlog2(1p2)) Forp=0:9,theaboveexpressionevaluatesto:O(N2:39).Therefore,asthenetworkgrowsinsizeandnodestendtoacquiremorenearconnections,tunneledgesbecomemoreandmoreprobable. ScenariossuchassymmetricNATs,multi-levelNATsandInternetrouteoutagesresultincomplexmodelsforthelikelihoodoftwonodesbeingabletocommunicate.Forexample,thelikelihoodofanodebehindasymmetricNATbeingabletoformanedgewithanotherarbitrarynodedependsonthefractionofnodesthatarepublic(orbehindfull-coneNATs).Inthemulti-levelNATscenario(Figure 5-2 )wheretheoutermostNAT-Pdoesnotsupport"`hairpinning"',thelikelihoodofanodeEtoformanedgewithanotherarbitrarynodeisafunctionofthefractionofnodesthatarebehindthesameNAT-P,but 93

PAGE 94

Intheabsenceofanypublishedworkthatprovidesafaultmodeltocaptureallsuchscenarios,thelikelihoodofanedgebetweenapairofnodeshasbeenmodeledwithauniformpair-wiseedgeprobability,andallowingforhighprobabilitiesofP2Pedgesnotbeingabletoform{ashighas30%. 1. Attempttoaddnearconnectionstotheimmediatemneighbors(oneitherside)respectingtheconnectionmatrix. 2. Iftunnelingisenabled:identifyallthemissingconnectionsbetweenpairsofnodes,computetheoverlapoftheirconnectiontablestoseeiftunnelingispossible,andaddthepossibletunneledgestothenetwork. 3. Iftherearenodeswithfewerthanmconnectionsoneachside:eachsuchnodetriestoacquiremorenearconnections(toitsclosestneighbors,andfullyrespectingtheconnectionmatrix),untilithassuccessfullyacquiredmnearconnectionsoneachside. 4. Iftherearenodeswhichacquiredmorethanmconnectionsoneachside,theseexcessconnectionsaretrimmedinthesubsequentstep. 5. Iteratively,attempttoaddafarconnectionateachnode.Thedistancestraveledbytheseconnectionsinthestructuredringfollowthedistributiondescribedin[ 45 ].Thisstepisrepeateduntileverynodehassuccessfullyacquiredatleastonefarconnectionthatisallowedbytheconnectionmatrix. Theall-to-allroutabilityofthenetworkisstudiedbysimulatingthesendingofamessagebetweeneachpairofnodes,andcountthenumberoftimesthemessageis 94

PAGE 95

Atedgelikelihoodof70%(0.7),thepercentageofnon-routablepairsvariesfrom9.5%to10.9%(thetotalnumberofpairsinthesimulatednetworkis1,000,000) incorrectlydelivered.Thisexperimentisconductedthisfor200dierentrandomlygeneratedgraphs.Toinvestigatecorrectroutingofkeys,Irandomlygenerate10000dierentkeys.Foreachkey,Isimulatethesendingofamessageaddressedtothatkeyfromeachnodeasthesource,andcountthenumberoftimesthelookupiswronglydelivered,i.e.tonodesotherthanthenodeclosesttothekeyinidentierspace.Weconductthisexperimentfor200dierentrandomlygeneratedgraphs. Figure 6-3 showsthenumberofnon-routablepairs(outof10001000possiblepairs)ofnodesfordierentvaluesofthenumberofnearneighborsm,whenneitherannealingroutingnortunneledgesareused.Itisobservedthatasedgelikelihooddropsto70%,theall-to-allroutabilityofthenetworkdropstolessthan90%,i.e.morethan10%ofpairsarenon-routable.Similarobservationsarealsomadefortheaveragenumberofinstanceswhenkeyswerewronglyrouted(seeFigure 6-4 ).Asedgelikelihooddropsto70%,thereismorethan10%chancethatakeyiswronglyrouted.Furthermore,keepingmorenearconnectionsateachnodeonlymarginallyimprovesthenetworkroutability. 6-1 isshowninFigure 6-6 ,withm=3;tunneledgesarenotenabled.Itisobservedthat,atanedge 95

PAGE 96

Atedgelikelihoodof70%,thepercentageofwronglyroutedkeysvariesfrom9.5%to10.7%(thetotalnumberofsimulatedmessagesis10,000,000) likelihoodof85%,thepercentageofnon-routablepairswithannealingroutingisabout0.6%,whichislessthanone-fthofthepercentagewhengreedyrouting(3.3%)isused.Evenwhentheedgelikelihooddropsto70%,thepercentageofnon-routablepairs(lessthan3.4%)isstilllessthanhalfofthatwhengreedyroutingisused(morethan10%).Itshouldalsobenotedthatannealingroutingwithm=3ismorelikelytoreachthecorrectdestinationthanusinggreedyroutingwithm=5,forthesameedgelikelihoodinthisnetworkof1000nodes. Theaveragenumberofhopstakenbyamessageforbothgreedyandannealingroutingwasalsomeasuredineachsimulation.Inaperfectlyformedstructurednetwork,bothroutingalgorithmsincurexactlythesamenumberofhops.Otherwise,theaveragenumberofhopsbetweenP2Pnodesforannealingisalmostthesameasforgreedyrouting.Foranedgelikelihoodof70%andm=3,theratioofnumberofhopsincurredbyannealingtothatofgreedyis1.01.Therefore,annealingroutingonlyincursamarginaloverheadintermsofnumberofhops. Figure 6-7 showstheaveragenumberofwronglyroutedkeylookupsforbothannealingandgreedyroutingusingthemethodologyasdescribedinSection 6.3.1 .Atanedgelikelihoodof70%(m=3),annealingroutingreducesthechancesofakeybeingwronglyroutedfrom10.2%to3.4%.Bydeliveringamessageatmorethanonenode,the 96

PAGE 97

Comparinggreedyroutingwithtunneledgesform=3andm=2.Atedgelikelihoodof70%,thepercentageofnon-routablepairsinanetworkof1000nodesis(1)3.9%form=2,and(2)0.86%form=3. Averagenumberofnon-routablepairs.Atedgelikelihoodof70%,thepercentageofnon-routablepairsforgreedyandannealingroutingis(1)withouttunneledges,10.26%and3.4%respectively;(2)withtunneledges,0.86%and0.21%respectively.Atedgelikelihoodsof95%,therearenonon-routablepairswithtunneledges. annealingalgorithmcanresultincreationofadditional(morethantwo)replicasforakey 97

PAGE 98

Averagenumberofwronglyroutedkeys.Atedgelikelihoodof70%,thepercentageofwronglyroutedkeysforgreedyandannealingroutingis(1)withouttunneledges,10.2%and3.4%respectively;(2)withtunneledges,0.86%and0.19%respectively. 6-6 ,form=3.Itisobservedthatatanedgelikelihoodof70%,tunneledgessubstantiallyreducethepercentageofnon-routablepairsofnodesfrom3.4%to0.21%forannealingrouting(from10%to0.86%forgreedyrouting). Eachvirtualhopoveratunneledgeactuallycorrespondstotwooverlayhops.Theactualnumberofhopstakenbyamessagesaddressedtoexactdestinationsinanoverlaythatsupportedtunneledgeswasalsorecorded.Foranedgelikelihoodof70%andm=3,theratioofnumberofactualhopstothatofvirtualhopswasobservedtobe1.14,whichisasmalloverheadconsideringtheimprovementinroutability. Figures 6-7 alsocompareshowtunneledgesimprovetheconsistencyofkeyroutabilityofthenetwork,form=3.Itisobservedthat,atanedgelikelihoodof70%withtunneledges,thechancesofakeybeingwronglyroutedare0.86%forgreedyrouting(and0.19%forannealingrouting). 98

PAGE 99

2 )thatusestheP2Poverlaytorendezvouswitharemotenodeforout-of-bandexchangeofinformationrelevantforcommunication(throughConnectToMe(CTM)messages),followedbyabidirectionallinkingprotocolthatestablishestheconnection.TheconnectionprotocolallowsnodestoexchangetheirNAT-assignedIPaddress/portforhole-punching.Toimplementtunneledges,thesamemechanismtoalsousedexchangeinformationaboutconnectionstonearneighbors. EachconnectioninBrunetisbasedonanedge.EachnodehasoneormoreUniformResourceIndicators(URIs)thatabstracttheedgeprotocolsitcansupportandtheendpointsoverwhichitcancommunicate.Foreachtypeofedge,anEdgeListenerisresponsibleforcreatingandmaintainingedgesofthattype,andalsosendingandreceivingmessagesoverconnectionsusingthatedgetype.Forexample,tocreateanedgewithanothernodeusingaURIipop.udp://128.227.56.123:4000,theUdpEdgeListenerisinvoked,whereastocommunicatewiththesamenodeusingURIipop.tcp://128.227.56.123:4001,theTcpEdgeListenerisinvoked.ABrunetnodecanhavemorethanoneEdgeListener,andnewtypescanbeeasilyadded. Beforedescribingtheprocessofcreatingatunneledge,IoverviewthefunctionalitythatallowseachBrunetnodeCtoalsoactasamessageforwarderforcommunicationbetweentwonodesAandB.ThemessagefromtheoriginalsourceAisencapsulatedinsideaforwardrequestmessageaddressedtonodeC.WhennodeCreceivesthemessagefromA,itextractstheoriginalmessage(fromAtoB),andsendsittonodeB.ThisfunctionalityisusedbyanewBrunetnodetoidentifyitsleftandrightneighborsintheP2Pring[ 85 ]. ThetunnelingofaconnectionbetweennodesAandBovercommonneighbors,isachievedbyimplementinganEdgeListenercalledTunnelEdgeListener.TheURIforanodecorrespondingtothetunneledgesiscomputeddynamicallybyconcatenatingthe 99

PAGE 100

WhenanodeAhascomputedtheforwardingsetFwitharemotenodeB,itthensendsthisinformationtoBinanEdgeRequestusingtheforwardingservicesofoneofthenodesinF.WhenBreceivesanEdgeRequest,itrepliesbackwithanEdgeResponseandalsorecordsthenewtunneledge.OnreceivingtheEdgeResponse,thenodeAalsorecordsthenewtunneledge.Oncethetunneledgeissuccessfullycreated,nodesAandBcansubsequentlycreateaconnectionbetweenthem. Thisimplementationdoesnotrequirenodesintheforwardingtokeepanystateaboutthetunneledgesthatareusingthem.Furthermore,theperiodicpingmessagestomaintainaconnectionbasedonatunneledgealsokeeptheunderlyingconnectionsalive.Therefore,noextraoverheadisincurredbynodesintheforwardingset.Theforwardingsetforatunneledgecanchangeovertimeasconnectionsareacquiredorlost.Tokeeptheforwardingsetuptodateandsynchronized,nodesAandBnotifyeachotheraboutthechangesintheirconnections. Whenanodejoinsanexistingoverlayandcannotcommunicatewithitsimmediateleftandrightneighbors,itstunnelURIisinitiallyemptysinceitdoesnothaveanyconnectionsyet.However,itispossiblethatthenewnodecancommunicatewithitsothernearneighbors,thereforeitmustrsttrytoformconnectionswiththemandthenusethoseconnectionstoformtunneledgeswithitsimmediateneighborsontheP2Pring.ThenewnodelearnsaboutitsothercloseneighborsthroughtheCTMmessagesitreceivesfromitsimmediateneighbors,whichalsocontainalistoftheirnearconnections. 100

PAGE 101

WhentwoadjacentP2Pneighborscannotformaconnection,itislikelythatcrawlingthenetworkusingneighborinformationwillskipanode.Ifthenextreportednodehasaconnectionwiththemissingnode,aninconsistencywillbereported.However,incaseeventhesecondnodedoesnothaveconnectiontothemissingnode,theinconsistencymaygounnoticed.Itisstillpossibletoobservea100%consistentringwithafewnodescompletelymissing.ThesehiddennodescanbedetectedusinginformationloggedbyBrunetateachnode;knowledgeofthenumberofnodesandtheiridentiersisalsoavailable. Theeectofthepresentedtechniqueswithrespecttooverlaystructureisdemonstratedinboth,asyntheticenvironmentwitharticiallycreatedsituationsthatpreventconnectionsetup;andalsoinalarge-scalePlanetLabenvironment,whichisknowntoexhibitrouteoutagesbetweenpairsofhosts. 101

PAGE 102

Eachnodewasconguredtoformconnectionswith3neighborsonitsimmediateleftandright.Ofthetotalof6180structurednearconnectionsreportedbyallnodes,about4926(70%)existedbetweennodeswhichwereondierentprivatenetworks.TheseconnectionswerenotpossiblewithoutdecentralizedNATtraversal.TheP2Pringwas100%consistent. TheP2Pringcontained35pairsofadjacentnodesthatcouldnotsetupaUDPconnectionbecauseofrewallrules.Thesepairsofnodeswerehoweverabletoconnectusingtunneledges,thusrenderingacompleteP2Pring. 6-2 )could 102

PAGE 103

AdjacentP2Pnodesinwide-areadeploymentthatcouldnotcommunicateusingTCPorUDP PlanetLabhostswithpair-wiseroutingoutages planetdev05.fm.intel.com|pli2-pa-3.hpl.hp.com planetlab-1.fokus.fraunhofer.de|planetlab2.cosc.canterbury.ac.nz sjtu2.6planetlab.edu.cn|planetlab1.ucb-dsl.nodes.planet-lab.org planetlab2-tijuana.lan.redclara.net|pli1-pa-3.hpl.hp.com planetlab-1.man.poznan.pl|pli2-pa-2.hpl.hp.com planetlab2.ls..upm.es|planetlab1.cosc.canterbury.ac.nz planetlab2.cosc.canterbury.ac.nz|planetlab3-dsl.cs.cornell.edu athen.dvs.informatik.tu-darmstadt.de|uestc2.6planetlab.edu.cn uestc2.6planetlab.edu.cn|planetdev02.fm.intel.com notcommunicateusingTCPorUDPtransports.Theirinabilitytoconnect,whichwasobservedindirectlythroughthefactthattunneledgeshadbeencreated,wasverieddirectlybyloggingintoeachhostandobservingthatICMPmessages(andSSHconnections)toitspeerdidnotgothrougheither. Theabilityoftunneledgestoformwasfurtherevaluatedbydeployingadditional20P2PnodesonhostsH1and20nodesrunningonhostH2.ThesenodeswereconguredtouseonlyUDPtransports,andtheirhostsH1andH2wereconguredtodropUDPpacketsbetweenthem,thusmodelingascenariowherethereisaroutingoutagebetweentwosites.TwoinstanceswereobservedwhereoneoftheadjacentpairswasrunningonH1,whiletheotherwasrunningonH2.Tunneledgesformedbetweenthesenodesinbothcases,againrenderinga100%consistentP2Pring.Withouttunneledges,thesenodeswouldhavehadinconsistentviewoftheirlocalneighborhoods(inidentierspace),andthemessagesaddressedtothemwerelikelytobemisdelivered. ThedelayincurredbyanewP2Pnode(onahomedesktop)togetconnectedwithitsleftandrightneighborsontheringusingtunneledgeswasalsomeasured,overseveraltrials.Theaveragetimetogetconnectedwithneighborsislessthan10seconds,usingUDPorTCP.ThehomedesktopdidnothaveanInternetpathtoafewnodesonPlanetLabandeverytimeitbecameaneighbortooneofthesenodes,itreliedontunneledgestogetconnected,whichtook41secondsonaverage.Thecurrent 103

PAGE 104

Thelinkingprotocolforconnectionsetupisexecutedthroughoneormorelinkers;eachlinkersendslinkmessagesusingthedierentURIsoftheremotenodeinparalleluntilitstartsreceivingreplies.Onlyonelinkerisactiveatatime,duringwhichitsendsseverallinkmessagesoveraURIuntilitstartsreceivingrepliesorgivesup.Initially,thenewnodedoesnothaveanyconnectionstotunneloveranditstunnelURIisempty.TherstlinkerthatiscreatedcanthusonlysucceedusingTCPorUDP.WhenTCPorUDPcommunicationisnotpossible,ittakesseveralattemptsforthelinkertonish,andthenextlinkertobeactivated.Insomecases,thelinkercontainingausabletunnelURI(createdafterthenodehasacquiredafewconnections)isstillwaitinginthequeue.AnalternativeimplementationispossiblethatallowsupdationstothetunnelURIlistedinthecurrentlyactivelinker,whichwouldobviatetheneedtowaitforthenextlinker. statuscommand)wasmeasured,whichisrepresentativeoftheachievablethroughputoftheCondorpool.Furthermore,onceaworkerhasbeenchosenforjobexecutionbytheCondormanagerthroughmatchmaking,theprocessofjobsubmissioninvolvesdirectcommunicationbetweenthesubmitnodeandtheworker.Theall-to-allconnectivitybetweenworkernodesisalsoreported.TheconnectivitywithintheWOW 104

PAGE 105

InitiallyP2Pedgesareallowedtoformwithoutconstraints.TheCondormanagerreportedall180workers,andtheworkerswereall-to-allconnected.TheP2Pringwas100%consistent.Tocreatesituationswheredirectcommunicationwasnotalwayspossible,theUdpEdgeListenerateachnodewasconguredtodenyUDP-basedconnectionswithaprobability0.10.TheprobabilityoftwonodesbeingabletoformaUDP-basedconnectionisthusgivenby(10:10)2=0:902=0:81. TheP2Pnodeswereconguredtoonlyusegreedyroutingandnotunneledges.TheCondormanagerreportedatmost160nodes,i.e.only88%oftheworkernodeswereavailable.Inaddition,therewere6020pair-wiseworkerconnections(outof180180)thatcouldnotform.Inanotherexperiment,theP2Pnodeswereconguredtouseannealingroutingbutnotunneledges.TheCondormanagerreported177workernodes,andtherewere859pairsofworkersthancouldnotcommunicatewitheachother. Finally,bothannealingroutingandtunneledgeswereenabled.TheP2Pringconsistingof201nodes(20bootstrap,1managerand180workers)reported40tunneledges,whichformedwhenUDPcommunicationwasdeniedbyoneoftheUdpEdgeListenerbetweenadjacentP2Pnodes.OnlyoneinconsistencywasobservedontheP2Pring,whereatunneledgedidnotformbecausetheP2PnodesdidnothaveanyoverlapontheirUDP-basedconnections,theoverlappingconnectionswerealreadybasedontunneledgesandtheexistingimplementationdoesnotsupportrecursivetunneling.TheCondormanagerreportedall180workers,andtherewereonly7pairsofworkersthatcouldnotcommunicate. 8 ],theauthorsdescribetheimplementationofaSocketslibrarythatcanbeusedbyapplicationsforcommunicationbetweennodessubjecttoavarietyofconstraintsinwide-areanetworks.Mywork,ontheotherhand,investigatesanapproachwhere 105

PAGE 106

StructuredP2Psystems(Chord[ 40 ],Pastry[ 39 ],Bamboo[ 89 ],Kademlia[ 100 ])haveprimarilyfocusedonecientoverlaytopologies[ 27 ],reliableroutingunderchurn[ 89 ][ 101 ],andimprovinglatencyoflookupsthroughproximity-awarerouting[ 102 ].In[ 26 ][ 105 ],theauthorsdescribetheaectofafew(5%brokenpairs)Internetroutingoutagesonwide-areadeploymentsofstructuredP2Psystems.Ontheotherhand,myfocusistoenableoverlaystructuremaintenanceinwhenalargemajorityofnodesbehindNATs,andseveralscenarioshindercommunicationbetweennodes.Thetechniquesdescribedinthischapterfacilitatecorrectstructuredrouting,evenwhenmany(upto30%)pairsofnodescannotcommunicatedirectlyusingTCPorUDP. In[ 99 ],theauthorspresenttechniquestoprovidecontent/pathlocalityandsupportforNATsandrewalls,whereinstancesofconventionaloverlaysareconguredtoformahierarchyofidentierspacesthatreectsadministrativeboundariesandrespectsconnectivityconstraintsamongnetworks.InaGridscenario,however,networkconstraintsarenotrepresentativeofcollaborationboundaries,asvirtualorganizations(VOs)areknowntospanacrossmultipleadministrativedomains. Atechniquesimilartotunneledgesisalsodescribedin[ 87 ],inthecontextofaP2P-basedemailsystembuiltontopofPastry.Mywork,ontheotherhand,usestunnelingtoimproveall-to-allvirtual-IPconnectivitybetweenWOWnodes.Ialsoquantifytheimpactofthedescribedtechniquesonstructuredroutingthroughsimulations,underdierentedgeprobabilitiesbetweennodes.UnmanagedInternetProtocol(UIP)[ 108 ]proposestousetunnelinginKademliaDHTtoroutebetween\unmanaged"mobiledevicesandhostsinadhocenvironments,beyondthehierarchicaltopologiesthatmakeupthecurrentInternet.However,myfocusistofacilitateIPcommunicationbetweenGridresourcesindierent\managed"Internetdomains. 106

PAGE 107

109 ],theauthorsdescribeanalgorithmforprovidingstrongconsistencyofkey-basedrouting(KBR)indynamicP2Penvironments,characterizedbyfrequentchangesinmembershipduetonodearrivalsanddepartures.TheimprovementsineventualconsistencybyusingthetechniquesdescribedinthischaptercanalsobenettheimplementationofstronglyconsistentKBR.Similarlyin[ 110 ],authorsprovideasymptoticupperboundsonthenumberofhopstakenbymessagesundervaryingratesforlinkandnodefailures,anddescribeheuristicstoimproveroutingunderthosefailures.However,theirworkdoesnotconsiderfailuresoflinkswithneighbornodesandthesubsequentimpactonconsistentstructuredrouting.Tocomplementfault-tolerantrouting,thetunnelingtechniquedescribedinthischapteralsoattemptstocorrecttheoverlaystructureinpresenceoflinkfailures. Thecurrentimplementationoftunneledges\passively"reliesonanoverlaptoexistbetweenconnectionsoftwonodes,forformingantunneledgebetweenthem.SymmetricNATednodescanbediculttohandleusingthisimplementationsincetheycanonlycommunicatewithpublicnodes(ornodesbehindcone-NATs).Thesenodesmaynot 107

PAGE 108

108

PAGE 109

InstructuredP2PsystemssuchasIPOP,nodesarrangethemselvesinawell-denedtopologythatisdictatedbytheirrandomlychosennodeidentiers.Asaresult,structuredroutingisobliviousofthegeographicallocationofnodes.ThesubsequentroutingdelaysaectthelatencyobservedbyservicessuchastheDHToperations,theconnectionsetupprotocols,andtheapplicationsofIPOPvirtualnetworkwhenshortcutconnectionscannotform.Furthermore,severaltoolsusedtogatherinformationabouttheIPOPdeploymentsonPlanetLab(suchasP2PringcrawlerasdescribedinChapter 6 ),alsorelyonoverlayroutingtocommunicatewithanode. Techniques[ 111 ]havebeenproposedtoembedlocalityinformationintonodeidentierstomaketheoverlayrouting(thatfollowsthenodeidentiers)latency-aware,howeverdoingsoadverselyaectstheuniformdistributiononnodeidentiersandsubsequentlytheboundsonaveragenumberofoverlayhopsandDHTdatareliability.Awell-knowntechniquecalledProximityNeighborSelection(PNS)[ 27 ]reduceslatencyofstructuredP2Proutingbyrequiringnodestoconsiderproximityinformationwhilechoosingonlysome(notall)oftheirconnections.ItworkswithrandomassignmentofP2PidentiersanddoesnotaecttheboundsonthenumberoverlayhopsandDHTreliability.Inthischapter,animplementationofPNSintheIPOPsystemispresented,andisshowntoachieveupto30%improvementinroutelatencyoftheIPOPoverlaysdeployedonPlanetLab. Asdescribedearlier,connectivityconstraintsinwide-areacanpreventcreationofadaptiveshortcutconnectionsbetweenIPOPnodes.EvenwithPNS,thelatencyof(multi-hop)P2ProutingforvirtualIPpacketsisstillveryhighformanyapplications.Insuchcases,itispossibletoestablisha2-hopoverlaytunnelbetweenIPOPnodes,byselectingaproxynodebasedonsomecriteria.Thischapteralsoinvestigatestechniquesto 109

PAGE 110

BothPNSandproxydiscoveryrequiretheknowledgeofInternetlatenciesbetweenpairsofnodesinthenetwork.Onepossibilityistotakeexplicitround-triptime(RTT)measurementstoanothernode.However,inaheavilyNATedenvironment(asdepictedinSections 5.1 and 6.5.1.1 ,measuringlatencytoanarbitrarynodemayalsorequiresettingupaconnectiontothatnodethroughhole-punching,thatcouldincuramessagingoverheadofupto10.2Kbytes.Sinceconnectionsetupincursnon-trivialcostanddelay,itisnotusefultoestablishconnectionsforshort-livedcommunicationwitharbitrarynodes,suchasmeasuringRTTtoanode.Toalleviatethisneedforcreationofshort-livedconnections,alow-overheadtechniquebasedonnetworkcoordinatesisusedthatallowsarbitrarypairsofnodestoestimateInternetlatenciestoeachotherwithoutrequiringanexplicitmeasurement. TherestofthisChapterisorganizedasfollows.InSection 7.1 ,IdescribenetworkcoordinatesandtheirimplementationintheIPOPsystem.Section 7.2 focussesonimprovementsinoverlayrouting.IdescribePNS,itsimplementationinIPOPandevaluatethecorrespondingimprovementsinroutelatencyofIPOPoverlaysonPlanetLab.ThefocusofSection 7.3 isonimprovingIPOPlatencybysettingupa2-hoppaththroughasuitablychosenproxynodewhendirectcommunicationisnotpossiblebetweenend-nodes.Idescribeandevaluatetechniquestodiscoverproxynodesthatminimizelatencybetweenend-nodes,underdierentscenarios. 112 ],Lighthouses[ 113 ]),andallothernodescomputetheircoordinatesbymeasuringlatencytotheselandmark 110

PAGE 111

28 ],andtheirisnodependenceonexternalnodes.ThesecondclassofalgorithmsaremoresuitableforanautonomoussystemsuchasIPOP,sincetheyfacilitatedeploymentandmaintainabilitybyallowingallnodesinthesystemtoexecutethesamecode. IntheVivaldialgorithm,allnodesinthesystemstartatrandompointsinspace,initially.ByperiodicallytakingRTTsamplestoaxedsetofafewrandomnodes,anodeadjustsitscoordinates.Gradually,thesecoordinatesevolvetorepresenttheInternetlatenciesbetweennodes.TheVivaldialgorithmisdescribedindetailinAppendix B SupportforVivaldinetworkcoordinateshasbeenincorporatedintotheBrunetP2Psystem.Periodically,anodetakesaRTTmeasurementtoonesofitsexistingconnections.Thesemeasurementsdonotimposeanyextraoverhead,sincetheysupplementtheperiodicpingmessagestokeepidleconnectionsalive.TheperiodicRTTmeasurementsaretakenatanapplication-level,andhenceexhibithigh-variancesthatcancauseslowconvergenceandinstabilityTheimplementationofnetworkcoordinatesintheIPOPsystemusesstatisticallteringoflatencysamples[ 114 ],toachievetolerancetothenoiseinRTTmeasurements,andstillberesponsivetosustainedchangeinlatencies. Figure 7-1 showsthecumulativedistributionfunction(CDF)oftheestimationerrorforlatenciesbetweenpairsofnodes(usingnetworkcoordinates)onanoverlayconsistingofthan350nodesonPlanetLab,measuredatdierentinstants.Eachnodeintheoverlayacquiresabout15connectionsonaverage,whichitperiodicallysamplesevery10seconds.Theembeddingspaceforthenetworkcoordinatesisa2dimensional,withascalarheightvector.Thepredictionaccuracyofnetworkcoordinatesisobservedovertime.After5hoursofbootstrappingthesystem,themedianpredictionerrorisobservedtobeabout15%,whichisalsoinagreementwithotherstudiesonusingnetworkcoordinates[ 115 ].Eventhoughanetworkof350nodes,allbootstrappedatonce,takeabout5hourstoachievethisaccuracy,suchnetworkstypicallygrowincrementally,onenodeatatime.Relatedwork[ 28 ]hasshownthanoncethereisacriticalmassofwell-placednodesina 111

PAGE 112

Relativeerrorforround-triptimepredictionusingnetworkcoordinatesmeasuredafter1,3and5hoursofbootstrapping Vivaldinetwork,anewnodejoiningthesystemneedstomakefewmeasurementsinordertondagoodplaceforitself. 27 ],toimprovelatencyofmulti-hopstructuredP2ProutinginIPOP,byrequiringP2Pnodestoselectasubsetoftheirconnectionsbasedonnetworkproximity.

PAGE 113

Proximityneighborselection PNSrequiresnodestoalsoconsidernetworkproximitywhilechoosingfarconnections,asillustratedinFigure 7-2 .Inthisgure,thenodeAconsidersarangeRofsizeO(log(n))nodesstartingattherandomtargetnodeB(inidentierspace),selectedusingthetechniquedescribedabove.InsteadofconnectingtothenodeB,thenodeAconnectswiththenodeB0inrangeRthathasthesmallestInternetlatencytoitself.Inaddition,eachnodealsokeepsaconnectiontotheclosestnode(intermsoflatency)amongitslog(n)nearestneighborsintheidentierspace.Ingeneral,anoverlaypathbetweentwonodes(adequatelyseparatedinidentierspace)isdominatedbyfarconnections,sincetheseconnectionstravelmostofthedistanceinidentierspace.Thenumberofhopsoverfarconnectionsisthusafunctionofn,whileonlythelastfew(constant,sayM)hopsareoverconnections.Byselectingfarconnectionsbasedonproximity,messagescanavoidtakinglonglatencyhopsformostofitsprogresstowardsthedestination,thusleadingtoreductioninoverlaylatency.Furthermore,choosingfarconnectionsusingthistechniquealsopreservestheboundonthenumberofoverlayhopsbetweenpairsofnodestoO(1 113

PAGE 114

114

PAGE 115

Giventheabovefunctionality,acomponentcalledVivaldiTargetSelector(VTS)hasbeenimplementedthatenablesanodetodiscovertheclosestnodeB0intherangeRstartingatitsguessedrandomaddresss.TheVTStakesasinput(1)startaddresssoftherangeand(2)sizeroftherange.Usingthestartaddresssasforwarder,itsendsaqueryrequestingthelocalnetworkcoordinatestothedirectionaladdress\Left"withTTLsettor,androutingmodepath-deliver.Thisqueryisthedeliveredtoeachnodeintherangeofsizertotheleftofaddresss,andtheresultsarecommunicatedbacktothesourcenode.TheVTSthenmeasuresthedistanceofinthenetworkcoordinatespacetoeachcandidatenode,andselectstheclosestnodeB0forconnectionsetup.Anestimateofthenetworksizeismadeusingaddressrangespannedbythenearconnectionsatanode.Periodicallyevery300seconds,anoderandomlyselectsafarconnectionstocheckifitisstilloptimalbyqueryingthesamerangeRfornetworkcoordinates.Incasetheconnectionnodeisnolongertheclosest,theconnectionistrimmed.TheseperiodiccheckstakescareoftheevolutionofnetworkcoordinatesovertimeandalsoarrivalofaclosernodeintherangeR. 115

PAGE 116

Thecumulativedistributionoftheobservedround-tripdelayisshowninFigure 7-3 .ItisobservedthatPNSresultsinupto40%reductionintheaverageoverlaylatency.ThisreductioninoverlaylatencyisalsoreectedinthedecreaseincrawltimesandDHToperationsonIPOPoverlays,thusvalidatingtheapplicabilityofPNSandnetworkcoordinatestothestructuredoverlaysdeployedinwide-area. TheabilityofnetworkcoordinatestocorrectlypredicttheclosestnodeineachqueriedrangeRisalsoevaluated.Thecandidateset(withtheircoordinates)foreachPNSqueryislogged.Theround-triptimes(RTTs)betweenPlanetLabnodesasmeasuredbyICMPpings,arethenusedtodeterminethepercentageofnodesthatwereclosertothecurrentnodethantheoneselectedbytheVivaltiTargetSelector(usingnetwork 116

PAGE 117

Round-triptimes(RTTs)foroverlaypingswithPNSandwithoutPNS.AverageRTTis(1)1.75secs(withPNS),(2)2.86secs(withoutPNS). PercentageofnodesinthequeriedrangeRthatareclosertothesourcenodethantheoneselectedbynetworkcoordinates.Onaverage,8.76%ofthenodesinthequeriedrangeRarecloserthantheselectednode. 117

PAGE 118

Relativelatencyerrorforchoosingtheclosestnodeinthenetworkcoordinatespace.Onaverage,therelativelatencyerroris1.432. coordinates).ThisdistributionisshowninFigure 7-4 .Itisobservedthatwithprobabilitygreaterthan0.60,networkcoordinatescancorrectlyidentifytheclosestnode,whilewithprobability0.90theselectednodeisamongtheclosest20%ofthequeriednodes. Therelativelatencyerrorforeachselectionisdenedasdseldopt 7-5 .Itisobservedthatinmorethan80%ofthecases,therelativelatencyerrorislessthan0.50. EvenwithPNS,theaveragelatencyontheoverlayisstillobservedtobeveryhigh(inexcessofasecond)forecientroutingofvirtualIPpacketswhendirectcommunicationisnotpossiblebetweenIPOPnodes.Inthenextsection,Iinvestigatetechniquestoreduceend-to-endperformanceoftheIPOPvirtualnetworkinsuchscenarios,bysettingupa2-hopcommunicationpaththroughproxynodeswithdesiredcapabilities. 118

PAGE 119

ThelatencyofvirtualIPcommunicationcanbereducedbysettingupa2-hopcommunicationpathbetweenendnodes,AandB(thatcannotestablishdirectcommunication),throughanothernodePinthenetworkthathassucientresourcestoecientlyroutevirtualIPpackets.AsimilarapproachisalsousedintheSkype[ 23 ]system.InSkype,somenodesinthesystemelevatetostatusofsupernodes,onbasisoftheircapabilities(CPU,memoryanddisk)andnetworkconnectivity.Thesenodescanproxycommunicationbetweenothernodesinthesystem.Theinformationaboutthesupernodesinmaintainedinacentraldirectory.Itisretrievedbyanewnodeonstart-up,andcachedlocallyforthefuture.InasystemsuchasIPOPwherenodesbelongtodierentorganizationsandindividuals,managingsuchcentralserversisaproblemasdescribedinChapter 4 .ToachieveautonomousoperationoftheIPOPvirtualnetwork,decentralizedtechniquestodiscoverproxynodesaremoresuitable,andareinvestigatedinthischapter. InanIPOPoverlay,anodeisonlyawareofafewothernodes,theonesitisconnectedwith.InordertodiscoveraproxynodePtocommunicatewithanothernodeBinthenetwork,anodeAmustbeabletoquerythenetworkforanodewithcertaindesiredcharacteristicsthatdependonthenatureoftheapplication{aninteractiveapplicationmightrequireaproxynodesuchthattheend-to-enddelayiswithinadelaybudget,whereasadata-transferapplicationwouldneedaproxythatwillprovidesucient 119

PAGE 120

19 ]usesacentralmanagertomatchapplicationrequirementstoresourcecharacteristics.Gridtechnologies[ 116 ]requireeachparticipatingsiteinthecollaborationtomaintainalocaldirectory,whichmaintainsinformationaboutthelocalresourcesatthatsite.Thesesystemsrequiresomesettingupandmaintainingdedicatedservers,whichbecomesdicultwhenusersandresourceprovidersareindividualusersontheInternet. SWORD[ 97 ]presentsanextensiveframeworkfordiscoveringwide-areahostsforservicedeploymentthatmeetcertainuser-denedcriteria.Eachresourceisdescribedbyasetofattributevalueswhichcanbetime-varying.Wide-areahostsformstructuredoverlaywherebesidesbeingaresourcethatperiodicallypublishesitsstate,everyhostisalsoresponsibleforstoringpartofinformationaboutotherresources.XenoSearch[ 117 ]isanothersystemfordiscoveringsuitableXenhostsontheInternet,fordeployingservicesinsideVMcontainers.Theabsenceofspecialserverstostoreresourcecharacteristicsmakesthesesystemshighlyself-managing,andthusmoresuitableforafully-decentralizedenvironment. BoththesesystemsuseaDHTtostoreresourceattributes.Foreachrelevantresourceattribute,anentry(attributevalue;resourceidentifier)iscreatedintheDHT.Thisinformationisonlystoredassoft-statesinceattributevaluesmayvaryovertime.Inaddition,SWORDalsohandlesrangequeriesonattributevalues.Manyresourcesmightsatisfytheconstraintsimposedbyaquery,whichaectsthequeryprocessingcost.Eachquerythereforealsospeciesthemaximumcostthatmustbeincurredtoprocessit. 120

PAGE 121

Thequalityofproxyselectioninthisapproachislimitedbytheconnectionsatanode,whicharechosenrandomly.Thenextsectiondescribesanapproachthatcansearchothernodesinthenetwork,beyondtheexistingconnectionsatanode,tondproxynodesthatminimizeend-to-endlatency. 121

PAGE 122

118 ]havebeenproposedtoecientlystoremulti-dimensionalattributessuchasnetworkcoordinates,anddorangequeriesonthem.However,networkcoordinatesaredynamicanderror-prone,therefore,usingaPHTtostoretheseattributescanbeanoverkill.Instead,itispossibletodirectlyquerynodesfortheirnetworkcoordinates.Ihavedevelopedagenericframeworkbasedonmap-reduceforparallelexecutionofsuchdistributedqueriesandaggregationofresultsfromalargenumberofnodes,whichisdescribedinAppendix C .Inowdescribethesearchalgorithm. ThegoalofthisalgorithmistondaproxynodesuchthatthissumSofdistances(A$PandP$B)iswithinafractionfofdirectdistanceA$B.Thesearchqueryisexecutedusingtheusingmap-reduceresourcediscoveryframeworkbybroadcastingoverdierentsegmentsoftheP2Pring,asshowninFigure 7-6 TheMapfunctiontakesasargumentthecoordinatesofnodesA(nca)andB(ncb),andadelayfractionf.IfthesumSpofthedistancesofthelocalcoordinates(nclocal)fromendnodes(i.e.jncanclocalj+jncbnclocalj)iswithinfractionfofthedirectdistance(i.e.fjncancbj),itreturnsalistcontainingasingletuple(Sp;Identifierp).Otherwiseitreturnsanemptylist.TheReducefunctionmergesthechildlistoftupleswiththecurrentlistbasedontherstcomponent(sumofdistances)ineachtuple,tobuildasinglelistofatmostK(typically10)tupleswiththebestsums. InFigure 7-6 ,thenodeArstqueriesasegmentoftheringofsizer,startingatitself,forthebestKnodesthatmatchthecriteria.Ifnosuchnodeisfound,thenodeAthenqueriesathenextsegmentwhichistwiceasbigasthecurrentsegment.Thisexpandingringsearchterminatesonceatleastonenodethatmatchesthecriteriahasbeenfoundorwhentheentireringhasbeenqueried.IntherestofthisChapter,IrefertothisapproachbyESS. 122

PAGE 123

ExpandingSegmentSearch(ESS) 7-1 showsthe6dierentoverlaysonPlanetLabthatwereusedforexperiments.Theoverlayswithineachset((Exp1;Exp2),(Exp3;Exp4)and(Exp5;Exp6))weredeployedsimultaneouslyontwodierentslices,withandwithoutPNS,respectively.Theseoverlayswerecrawledtocollectthelistofconnections(orderedbylatency)andnetworkcoordinatesateachnode.Theaveragenumberofconnectionsacquiredbyoverlaynodeswasobservedtobe15.TheorderedlistofconnectionsisthenusedtodeterminetheproxynodethatwouldhavebeenselectedusingCC,foreachpairofnodes.Likewise,itisalsopossibletosimulatetheESSalgorithmoineforeachpairofendnodes,giventhenodeidentiersandcoordinates(bothofwhichareavailablefromthecrawl). Thecorrectorderedlistofproxies(orderedbyend-to-endlatencythroughtheproxynode)foreachpairofendnodes,wascomputedfromtheICMPpinglatenciesbetweennodesForeachpair,Irefertothemostidealproxyaspopt,andtheend-to-endlatencyforusingthatproxynodeasdopt.Likewise,Irefertotheproxyselectedbythedescribed 123

PAGE 124

CongurationsofoverlaysusedtoevaluateCCandESS Overlay #ofnodes PNS Exp1 485 Yes Exp2 448 No Exp3 482 Yes Exp4 454 No Exp5 488 Yes Exp6 452 No techniquesasp,andtheend-to-endlatencythroughthatproxyasdp.TocomparethequalityoftheproxynodepselectedbyCCandESS,thefollowingmetricsareused: 1. RelativePenalty(RP):Therelativepenaltyforusingpoverpoptisgivenby: 2. AbsolutePenalty(AP):Theabsolutepenaltyforusingpoverpoptisgivenby: InTable 7-2 summarizestheRPandAPforCCandESS(withf:1.2,1.1,1.05),observedonoverlaysExp1andExp2.TheaveragenumberofnodesqueriedinthethreecongurationsofESS,was37,57and87.Interestingly,itisobservedthatthesimpleapproachCCperformsbetterthaneithercongurationofESS,whichisreectedinitslowerRPandAP.AnotherinterestingobservationisthatfortheoverlaywithoutPNS(Exp2),thedierenceinRPsandAPsofproxiesselectedbyCCandESSislesserthanthatoftheoverlaywithPNS(Exp1).Infact,CCperformsalmostthesameasESS. Thisobservationisexplainedfromaninformation-theoreticperspective.CCreliesontheinformationcontainedinanode'sconnectionsforproxyselection,whereasESSusestheinformationcontainedinnetworkcoordinates.With15connections,itispossiblethatanodeacquiresagoodviewoftheInternetspace.TheamountofinformationpresentinCCcannotbematchedbythatofnetworkcoordinates,duetotheirhighestimationerrors.Furthermore,whenPNSisused,theinformationcontainedinnetworkcoordinatesisalsousedinselectingconnectionsclosetoanode,thusprovidinganodeabetterviewofthenetworkandamuchbetterperformanceonCCoverESS.SimilarobservationswerealsomadeforotheroverlaysasshowninFigure 7-7 124

PAGE 125

BExp2 CExp3 DExp4 EExp5 FExp6 Relativepenalty(RP)forusingaproxynodeselectedbyCCandESS(1.2),whenallnodesinthenetworkcanserveasproxies. 125

PAGE 126

RelativeandabsolutepenaltiesofCC,ESS(1.1),ESS(1.2)andESS(1.05)foroverlaysExp1(withPNS)andExp2(withoutPNS) Exp1(WithPNS) Exp2(WithoutPNS) %-ile CC ESS(1.2) ESS(1.1) ESS(1.05) CC ESS(1.2) ESS(1.1) ESS(1.05) 25 0.076 0.117 0.113 0.111 0.106 0.133 0.127 0.124 8.732 14.296 13.825 13.619 12.986 17.077 16.113 15.645 50 0.166 0.213 0.208 0.204 0.225 0.245 0.236 0.231 19.509 25.787 25.232 24.863 26.573 30.247 28.998 28.3185 75 0.337 0.427 0.417 0.415 0.470 0.537 0.519 0.511 35.572 43.622 42.693 42.003 49.051 51.238 49.689 48.98 90 0.742 1.008 1.014 1.012 1.129 1.373 1.337 1.324 63.403 71.118 69.397 68.466 92.258 110.227 106.027 105.575 Eventhoughalow-overheadtechniqueCCprovidesamuchbetterproxyselection(intermsoflatency)thanESS,thesearchspaceandthequalityofproxies(byalsoincludingothercriteria)inCCisstilllimitedbytheconnectionsatanode.ScenariosarisewhereonlyasubsetofthenodesinanIPOPoverlaycanbeusedasproxies.Forexample,PlanetLabnodesthatimposelimitsontheamountoftractheycanhandlearepoorcandidatesforproxyselection.Inaddition,toproxyconnectionsbetweencertainpairsofnodes,itisalsoimportantthattheproxynodehasgoodnetworkconnectivity,suchasitisapublicnodeandalsohassucientCPUcyclestorouteIPOPtrac.Itispossiblethatonlyafew(ornone)oftheexistingconnectionshavesucientresourcestoecientlyroutevirtualIPtrac.Withonlyasmallsubsetofanode'sconnectionstoselectfrom,itispossiblethatonlyasmallamountoflatencyinformationisusablebyCC. Inanothersetofexperiments,onlyfractionofthenodeswererandomlyclassiedaspotentialproxies.Theserepresentnodesthathavegoodnetworkconnectivity,andotherresourcestoecientlyrouteIPOPtrac.Inthisscenario,CCselectstheclosestconnectionthatisalsoapotentialproxy.Similarly,ESSsearchonlyconsidersnodesthatarepotentialproxies.AnotheralgorithmcalledESS(rst)isalsointroduced,whichselectstherstpotentialproxyencounteredintheESSsearchwithoutconsideringlatencyat 126

PAGE 127

Table 7-3 summarizesthepenalties(RPandAP)forCCandESSwhen100%,30%and20%ofthenodesarepotentialproxies,fortheoverlaysExp1andExp2.Itisobservedthatwhenlessthan30%ofthenodes(about135nodesoutof450)inthenetworkcanserveasproxies,ESSprovidesabetterproxyselectionthanCC.Givenauniformlikelihoodof30%forbeingaproxy,lessthan5(outof15)ofanode'sconnectionscanbeusedasproxies,whichgreatlyreducesthelatencyinformationpresentavailabletoCCforproxyselection.ItisalsoobservedthattheCCperformsbetteronoverlaysconguredwithPNS,becausetheconnectionsetofeachnodesalreadyincorporateslatencyinformationfromnetworkcoordinates.TheaveragenumberofnodesqueriedbyESS(1.2)was74and99for30%and20%fractionofpotentialproxies,respectively. Figures 7-8 and 7-9 presenttheCDFoftheRPforthethreeapproaches,CC,ESS(1.2)andESS(rst),ontheotheroverlays.Ineachcase,ESS(1.2)resultsinalowerRPthanbothCCandESS(rst). 127

PAGE 128

BExp2 CExp3 DExp4 EExp5 FExp6 Relativepenalty(RP)forusingaproxynodeselectedbyCCandESS(1.2),whenallnodesinthenetworkcanserveasproxies,whenonly30%ofthenodescanserveasproxies. 128

PAGE 129

BExp2 CExp3 DExp4 EExp5 FExp6 Relativepenalty(RP)forusingaproxynodeselectedbyCCandESS(1.2),whenallnodesinthenetworkcanserveasproxies,whenonly20%ofthenodescanserveasproxies. 129

PAGE 130

RelativeandabsolutepenaltiesofCCandESS(1.2)fordierentfractionofproxynodes Network %ofproxynodes 100 30 20 %-ile CC ESS(1.2) CC ESS(1.2) CC ESS(1.2) Exp1 25 0.076 0.117 0.085 0.079 0.101 0.083 (WithPNS) 8.732 14.296 11.281 9.807 13.116 10.515 50 0.166 0.213 0.194 0.167 0.232 0.170 19.509 25.787 25.114 21.363 29.639 21.948 75 0.337 0.427 0.472 0.368 0.558 0.377 35.357 43.622 49.260 40.266 58.390 40.612 90 0.742 1.008 1.236 0.934 1.402 0.891 63.403 71.118 92.470 69.599 109.546 77.211 Exp2 25 0.106 0.133 0.132 0.131 0.124 0.090 (WithoutPNS) 12.986 17.077 18.623 16.618 18.315 11.555 50 0.225 0.245 0.270 0.240 0.274 0.187 26.573 30.247 36.405 31.771 38.277 23.158 75 0.470 0.537 0.695 0.513 0.965 0.415 49.051 51.238 67.040 52.404 108.508 42.939 90 1.129 1.373 2.482 1.278 4.210 1.097 92.258 110.227 184.208 113.434 214.333 95.353 Toimprovelatencyofoverlayrouting,IimplementedProximityNeighborSelection(PNS)basedonlatencyestimatesfromnetworkcoordinates.ExperimentswithIPOPoverlaysonPlanetLabshowupto30%reductioninaverageoverlaylatencyTheloweroverlaylatencyinturnalsoresultsinbetterresponsetimesDHT-basedsystemsthatrelyonoverlayroutingofkeys. Ontheotherhand,toprovidelowend-to-endlatencywhendirectcommunicationisnotpossible,techniqueshavebeenpresentedtodiscoversuitableproxynodesthatcanbeusedtosetupa2-hoppathbetweentheendnodes.Itisobservedthatalowoverheadtechniquebasedonsamplingofexistingconnections(CC)resultsinlowerend-to-endlatencythansearchbasedonnetworkcoordinates(ESS),whenallnodesinthenetworkarewillingtoserveasproxies.However,practicaldeploymentsofIPOPoverlayshaverevealedscenarioswhereonlyafractionofthenodescanbeusedasproxynodesbasedontheircapabilitiestoecientlyrouteIPOPtrac.Inscenarioswherelessthan30%ofthe 130

PAGE 131

131

PAGE 132

Myresearchhasleveragedacombinationofseveralscienticmethods:(1)validationofexistingtechniques,(2)presentingnovelideasandevaluatingtheirecacyusingsimulations,(3)workingimplementationsofresearchcontributionsand(4)large-scaleexperimentstodemonstratetheoperationofimplementedtechniques. Ihaveaddressedtheproblemofprovidingbi-directionalnetworkconnectivityamongwide-areahostsbehindNATsandrewalls,tosupportunmodieddistributedapplicationsonwide-area.Ihavepresentedaself-managingvirtualnetwork(IPOP)thataggregateswide-areahostsintoaprivatenetworkwithdecoupledaddressspacemanagement.ThevirtualnetworkisfunctionallyequivalenttoaLocal-areanetwork(LAN)environmentwhereawealthofexisting,unmodiedIP-basedapplicationscanbedeployed.TheIPOPvirtualnetworktunnelsthetracgeneratedbyapplicationsoveraP2P-basedoverlay.IPOPnodesself-congurevirtualIPaddressesusingaDHCPimplementationoveraDistributedHashTable(DHT);andself-congureIPtunnelstoconnecttoothernodesonthenetwork.Thevirtualnetworkprovidesamechanismtoselectivelyestablish1-hopoverlaylinksbetweencommunicatingnodes,whichself-optimizethevirtualnetworkwithrespecttooverlaylinklatencyandbandwidth. TogetherofwithVMsforsoftwaredissemination,IPOPfacilitatescreationofhomogeneouslyconguredwide-areaclustersofVirtualWorkstations(calledWOWs).Thesesystemscanbeprogrammedusingexistingbatchschedulersandmiddleware,andsupportcheckpoint/migrationofdistributedapplicationsacrossdomains.WOWdistributedsystemsprovideanexcellentinfrastructurefordeploymentofdesktopgridsandcross-domaincollaboration,wherenewnodescanbeaddedbysimplydownloadingaVMimageandinstantiatingit. TheWOWtechniqueshaveresultedinaneasilydeployableandhighlyusableVMappliancethatconguresadhocCondorpoolsonwide-areahostsforhigh-throughout 132

PAGE 133

17 ],andfromusersatUniversityofFlorida. InsupportofIPOPoverlaysforend-users,Ihavealsoaddressedinterestingresearchproblemsfromdierentareas,structuredP2Psystemsandwide-arearesourcediscovery,amongstothers.DeployingstructuredP2Psystemsonwide-areaisawell-recognizedchallenge,connectivityconstraintssuchassymmetricNATsandInternetrouteoutagesaectoverlaystructuremaintenance,oftenleadingtoinconsistentroutingdecisions.IhavepresentedgenerallyapplicabletechniquestoimproveroutabilityofstructuredP2Psystems,thusbenetingtheapplicationsofthesesystems.Inseverallarge-scaledistributedsystems(includingIPOP),whendirectcommunicationisnotpossiblebetweenendnodes,itispossibletosetupa2-hopcommunicationthroughasuitablychosenproxynode.Dierenttechniqueshavebeeninvestigatedtodiscoverproxiesthatminimizeend-to-endlatency,andtheirecacyhasbeenevaluatedunderdierentscenariosusingexperimentsonPlanetLab{atestbedthatisrepresentativeoftheInternet. 133

PAGE 134

NetworkAddressTranslators(NATs)rstbecamepopularasawaytodealwithshortageofIPv4addressesandalsoavoiddicultyofreservingIPaddressesforbuildinglocal-areanetworks.AccordingtoRFC1918,threeblocksofIPv4addresses(10.0.0.0-10.255.255.255,172.16.0.0-172.31.255.255and192.168.0.0-192.168.255.255)havebeenreservedforprivatenetworks,andarenotusedbyhostsonpublicInternet NATsareusedtoprovideInternetconnectivitytohostsinsuchprivatenetworks.ANATrouterhastwonetworkinterface:oneconnectedtotheprivatenetworkandtheotherconnectedtotheInternetwithoneormorepublicIPaddress(es).AstracpassesfromtheprivatenetworktotheInternet,thesourceIP/portineachpacketistranslatedtoaNAT-assignedpublicIP/port.TheNATtracksthesemappingfrominternalprivateIP/porttopublicIP/port.WhenareplyreturnstotheNAT,itusesthistrackingdataitstoredduringtheoutboundphasetore-writethedestinationIP/port.ToasystemontheInternet,therouteritselfappearstobethesource/destinationforthistrac.NATshavebecomeastandardfeatureinroutersforhomeandsmall-oceInternetconnections,wherethepriceofextraIPaddresseswouldoftenoutweighthebenets. BasedontheassignmentofpublicIP,port,andtreatmenttoinboundpacketsthefollowingNATbehaviorisobservedforUDPtrac: 1. 2. 3. 134

PAGE 135

4. SinceUDPisastateless,NATsdonotdomuchcommunicationstatetracking.Throughacarefullycraftedexchangeofpacketstwonodescanpunch"holes"intheirlocalNATs,andthencommunicatedirectlywithoutanyexternalproxying.Thishole-punchingtechniqueistechniqueisusedintheSTUNprotocol[ 21 ],whichwenowdescribe. WhenanodeAbehindaNATwantstocommunicatewithanodeBbehindanotherNAT,itcontactstheSTUNserverwhichthencommunicatesbacktoAtheexternalIP/portthathadrecordedfornodeB.Atthesametime,theSTUNserveralsoinformsthenodeBabouttheexternalIP/portofnodeA.ThenodeBthensendsamessagetoA'sexternalIP/port,whichpunchesaholeinB'slocalNATthatwouldlaterallowallpacketsfromA'sexternalIP/port.ThepacketssentoutbyAtoB'sexternalIP/portcreateaholeinA'sNAT,thusallowingpacketscominginfromB. 135

PAGE 136

STUNTprotocols[ 22 ]existforNATtraversalusingTCP,butrequiretechniqueslikepacketsning/spoongthatrequiresuperuserprivileges. NATtraversalwithTCPisfurthercomplicatedbythefactthatunlikeUDP,adierentlocalportischosenforeveryoutgoingTCPconnection.EventhoughlocalNATisofConetype,itseesadierentinternalportforeachnewconnection,whichitmapsdierently. 136

PAGE 137

SyntheticcoordinatesallowInternethoststopredictround-triptimes(RTTs)tootherhostsasfollows:hostscomputetheircoordinatesinsomespacesuchthatthedistancebetweentwohosts'syntheticcoordinatespredictstheRTTbetweenthemontheInternet.Acoordinatesystemisparticularlyusefulinserverselectiontofetchreplicateddata,especiallywhenthenumberofserversislargeandamountofdataissmall.Insuchcases,explicitRTTmeasurementsaneasilyoutweighthebenetsofexploitingproximityinformation. Vivaldi[ 28 ]isasimple,light-weightandfullydistributedalgorithmthatallowshoststocomputesyntheticcoordinatesbyonlycommunicatingwithonlyafewotherhosts.Thealgorithmdoesnotrequireanyxednetworkinfrastructureandnodistinguishablehosts. LetLijbetheactualRTTbetweennodesiandj,andxibethecoordinatesassignedtonodei.Thetotalpredictionerrorinthecoordinatesisgivenasfollows: wherejjxixjjjisdistancebetweenthecoordinatesofnodesiandj.Thiserrorfunctioncorrespondstotheenergystoredinthespringnetworkconnectingnodestoeachotherisequivalenttominimizingthepredictionerror. 137

PAGE 138

Thescalarquantity(Lijjjxixjjj)isthedisplacementofthespringfromrest,andu(xixj)isaunitvectorindirectionoftheforceoni. Tosimulatetheevolutionofthespringnetwork,thealgorithmconsiderssmallintervalsoftime.Ateachinterval,thealgorithmmoveseachnodeiasmalldistanceinthedirectionofeachforceFijandthenrecomputesallforces.Thecoordinatesattheendoftheintervalare:xi=xi+Fij;whereisthelengthoftheinterval. Eachnodethenetworksimulatesitsmovementinthespringsystem.Eachnodemaintainsitscurrentcoordinates,startingwithcoordinatesattheorigin.Wheneveranodecommunicateswithanothernode,itmeasurestheRTTtothatnodeandalsolearnsthatnode'scurrentcoordinates.Inresponsetosuchsample,anodepushesitselfforashorttimeinthedirectionofforcecomputedbyEquation B{2 ;eachsuchmovementreducesthenode'serrorwithrespecttotheothernodeinthesystem.Asnodescommunicatewitheachother,theyconvergetocoordinatesthatpredictRTTwell. 1. Eachnodekeeptracksitslocalpredictionerrorusinganexponentiallymovingaveragethatupdatesthelocalerror.EachRTTsamplebearsthepredictionerrorattheremotenode.Thetimestepischosenas: localerror+remoteerror(B{3) 138

PAGE 139

2. 114 ]proposetousenon-linearmovingpercentilewithinaxedwindowofRTTsamples.ItremovesnoiseandalsorespondstoactualchangesinRTT.Overall,thistechniqueprovidesimprovedaccuracyandstability. 139

PAGE 140

Inthesimplestform,resourcediscoveryispossiblebyqueryingasubsetofnodesinthenetworkfortheirattributes,gatheringalltheresultsatthesourcenode,andthencomputeanaggregateresult.TheresourcediscoveryframeworkinIPOPallowsforparallelqueryexecutionandresultaggregationbybuildingatreeconsistingofcandidatenodesrootedatthesourcenode,propagatethequerydownthetreeinparallelandastheresultspropagateupthetree,doanaggregationofresultsateachintermediatenode.Suchaquerycanbedescribedasamap-reducecomputation,thatexecutesinthefollowingtwophases(Figure C-1 ): 1. C-1 A).Eachnodeinthetreecomputesalocalresult(calledmap result),andusesittoinitializethereduce result.Itthendynamicallycomputesthelistofitschildrenandinitiatessimilarmap-reducecomputationsateachchildnode,withpossiblydierentparameters.Thenodethenstartswaitingforchildcomputationstoreturn.Incasethenodedoesnothaveanychildren,itreturnsthereduce resulttoitsparent. 2. C-1 B),andaggregationtakesplaceateachintermediatenode.Ongettingresultofachildcomputation,theparentupdatesitsreduce result.Whenallchildcomputationshavereturnedtheirresults,thecurrentreduce resultisreturnedbacktotheparent. 1. args):Computesthemap result,asthequerytraversesdownthetree. 2. args):Computesthelistofimmediatechildrennodesofthecurrentnode,basedontheargumentsgen args.Returnsalistofchildren,andargumentsfortheircorrespondingmap-reducecomputations.Returnsanemptylistifthenodehasnochildren. 3. args,current reduce result,child result,[out]done):Invokedongettingaresultfromachild.Computesanaggregationusingthecurrentvalueofreduce resultandthechild result.Alsoreturnsanoutparameterifcertain 140

PAGE 141

BReducephase Executionofamap-reducecomputationoveratreeofnodes terminationcriteriaismet,indicatingthatthereisnotneedtowaitforresultsfromotherchildrenandthecurrentreduce resultcanbereturnedrightawaytotheparent. 1. args).Thistreecorrespondstotheoverlayroutetoadestination.Themethodreturnsalistcontainingsinglenodethatisclosesttothedestinationinthelocalconnectiontable.Incasethereisnonodeclosertothedestinationthanthecurrentnode,itreturnsanemptylist. Thistreeallowscomputingstatisticsaboutanoverlayroutebetweentwonodes.Forexample,itispossibletoaggregateallintermediatenodeaddressesintoalist,byspecifyingaMapfunctionthatreturnsalistcontainingcurrentnodeaddress,andaReducefunctionthatsimplydoeslistconcatenation.Tocountthenumberoftunneledgesinanoverlayroute,provideaMapfunctionthatreturns1ifthe 141

PAGE 142

2. arg),startingatthecurrentnodeA(seeFigure C-2 ).Tocomputethechildrenateachnode,thefollowingalgorithmisused. TobroadcastoveraregionofringstartingatthecurrentnodeAandendingatanodeB,thenodedeterminesallitsconnectionsintheregion[A;B),sayc1;c2;c3::::cm.Thenodethenassignstoci,thesegment[ci;ci+1].Theprocesscontinuesuntilthecurrentnodeistheonlynodeinitsassignedrange.ItcanbeshownthatgivenO(log(n))connectionsateachnode,themaximumdepthofthetreeisO(log(n)),forarangeofsizen. Toexecutethecomputationoveranarbitraryrange[C;D)(Cisnotthecurrentnode),itisrequiredtodeterminetherstnodeC0inthisrange,whichisdoablebyusinggreedyrouting,andtheninitiatethecomputationatthatnode. BoundedbroadcastoverasegmentoftheP2Pring 1. 142

PAGE 143

3. identifier).TheReducefunctionmergesthechildlistoftupleswiththecurrentlisttobuildasinglelistof(atmostK)tupleswiththebestranks. 4. 143

PAGE 144

[1] I.FosterandC.Kesselman,\Globus:Ametacomputinginfrastructuretoolkit,"Intl.JournalofSupercomputerApplications,vol.11,no.2,pp.115{128,1997. [2] D.P.Anderson,J.Cobb,E.Korpella,M.Lebofsky,andD.Werthimer,\"SETI@Home:Anexperimentinpublic-resourcecomputing","CommunicationsoftheACM,vol.11,no.45,pp.56{61,2002. [3] M.W.Chang,W.Lindstrom,A.J.Olson,andR.K.Belew,\AnalysisofHIVwild-typeandmutantstructuresviainsilicodockingagainstdiverseligandlibraries,"JournalofChemicalInformationandModeling,vol.47,no.3,pp.1258{1262,2007. [4] K.Keahey,I.Foster,T.Freeman,X.Zhang,andD.Galron,\Virtualworkspacesinthegrid,"inProc.ofEuropar,Lisbon,Portugal,Sep2005. [5] I.Krsul,A.Ganguly,J.Zhang,J.A.B.Fortes,andR.J.Figueiredo,\VMPlants:Providingandmanagingvirtualmachineexecutionenvironmentsforgridcomputing,"ininProc.ofSC2004,Pittsburgh,PA,Nov2004. [6] S.SonandM.Livny,\Recoveringinternetsymmetryindistributedcomputing,"inProc.ofthe3rdIntl.Symp.onClusterComputingandtheGrid,May2003. [7] S.Son,B.Allcock,andM.Livny,\Codo:Firewalltraversalbycooperativeon-demandopening,"inProc.of14thIntl.Symp.onHighPerformanceDistributedComputing(HPDC),2005. [8] J.MaassenandH.E.Bal,\Smartsockets:Solvingtheconnectivityproblemsingridcomputing,"inProc.ofSymp.onHighPerformanceDistributedComputingSymposium,MontereyBay,CA,Jun2007. [9] D.Anderson,\BOINC:Asystemforpublic-resourcecomputingandstorage,"inProc.ofthe5thIntl.WorkshoponGridComputing(GRID-2004),Pittsburgh,PA,Nov2004,pp.4{10. [10] P.Barham,B.Dragovic,K.Fraser,andS.H.et.al.,\Xenandtheartofvirtualization,"inProc.ofthe19thACMsymposiumonOperatingsystemsprinci-ples,BoltonLanding,NY,2003,pp.164{177. [11] R.Goldberg,\Surveyofvirtualmachineresearch,"IEEEComputerMagazine,vol.7,no.6,pp.34{45,1974. [12] J.Sugerman,G.Venkitachalan,andB.H.Lim,\VirtualizingI/ODevicesonVMwareWorkstation'sHostedVirtualMachineMonitor,"inProc.oftheUSENIXAnnualTechnicalConference,Jun2001. 144

PAGE 145

A.SundararajandP.Dinda,\Towardsvirtualnetworksforvirtualmachinegridcomputing,"inProc.ofthe3rdUSENIXVirtualMachineResearchandTechnologySymp.,SanJose,CA,May2004. [14] X.JiangandD.Xu,\Violin:Virtualinternetworkingonoverlayinfrastructure,"inProc.ofthe2ndIntl.Symp.onParallelandDistributedProcessingandApplications,Dec2004. [15] M.TsugawaandJ.A.B.Fortes,\Avirtualnetwork(ViNe)architectureforgridcomputing,"inProc.oftheIEEEIntl.ParallelandDistributedProcessingSymp.(IPDPS),Rhodes,Greece,Jun2006. [16] A.Ganguly,A.Agrawal,P.O.Boykin,andR.J.Figueiredo,\IPoverP2P:Enablingself-conguringvirtualIPnetworksforgridcomputing,"inProc.oftheIEEEIntl.ParallelandDistributedProcessingSymp.(IPDPS),Rhodes,Greece,Apr2006. [17] G.Klimeck,\NanoHUB.orgtutorial:Educationsimulationtools,"inProc.ofIEEEConferenceonNano/MicroEngineeredandMolecularSystems(NEMS),Bankok,Thailand,Jan2007. [18] M.Litzkow,M.Livny,andM.Mutka,\Condor-Ahunterofidleworkstations,"inProc.8thIEEEIntl.ConferenceonDistributedComputingSystems(ICDCS),Jun1988. [19] R.Raman,M.Livny,andM.Solomon,\Matchmaking:Distributedresourcemanagementforhighthroughputcomputing,"inProc.ofthe7thIEEEIntl.Symp.onHighPerformanceDistributedComputing(HPDC),Chicago,IL,Jul1998. [20] T.Tannenbaum,D.Wright,K.Miller,andM.Livny,BeowulfClusterComputingwithLinux.TheMITPress,2002,ch.Condor-ADistributedJobScheduler. [21] J.Rosenberg,J.Weinberger,C.Huitema,andR.Mahy,\RFC3489-STUN-SimpleTraversalofUserDataProtocolProtocolThroughNetworkAddressTranslators,"Mar2003.[Online].Available: http://www.ietf.org/rfc/rfc3489.txt Lastaccessed:Jul2008. [22] S.GuhaandP.Francis,\CharacterizationandmeasurementofTCPtraversalthroughNATsandrewalls,"inProceedingsoftheInternetMeasurementConference,Berkeley,CA,Oct2005. [23] S.Guha,N.Daswani,andR.Jain,\AnexperimentalstudyoftheSkypepeer-to-peerVoIPsystem,"inInProc.oftheInternationalWorkshoponPeer-to-PeerSystems(IPTPS),SantaBarbara,CA,Feb2006. [24] C.Sapuntzakis,D.Brumley,R.Chandra,N.Zeldovich,J.Chow,M.S.Lam,andM.Rosenblum,\Virtualappliancesfordeployingandmaintainingsoftware," 145

PAGE 146

[25] S.-M.Huang,Q.Wu,andY.-B.Lin,\TunnelingIPv6throughNATwithTeredomechanism,"inProceedingsoftheInternationalConferenceonAdvancedInformationNetworkingandApplications,Taiwan,Mar2005. [26] M.J.Freedman,K.Lakshminarayanan,S.Rhea,andI.Stoica,\Non-transitiveconnectivityandDHTs,"inProc.ofthe2ndUSENIXWorkshoponReal,LargeDistributedSystems(WORLDS),SanFrancisco,CA,Dec2005. [27] K.Gummadi,R.Gummadi,S.Gribble,S.Ratnasamy,S.Shenker,andI.Stoica,\TheimpactofDHTroutinggeometryonresilienceandproximity,"inProc.ofACMSIGCOMM,Karlsruhe,Germany,Aug2003. [28] F.Dabek,R.Cox,F.Kaashoek,andR.Morris,\Vivaldi:Adecentralizednetworkcoordinatesystem,"inProc.ofACMSIGCOMM,Portland,OR,Aug2004. [29] B.Chun,D.Culler,T.Roscoe,A.Bavier,L.Peterson,M.Wawrzoniak,andM.Bowman,\PlanetLab:Anoverlaytestbedforbroad-coverageservices,"ACMSIGCOMMComputerCommunicationReview,vol.33,no.3,2003. [30] A.Denis,O.Aumage,R.Hofman,K.Verstoep,T.Kielmann,andH.E.Bal,\Wide-areacommunicationforgrids:Anintegratedsolutiontoconnectivity,performanceandsecurityproblems,"inProc.ofthe13thIntl.Symp.onHighPerformanceDistributedComputing,Honolulu,Hawaii,Jun2004. [31] A.Sundararaj,A.Gupta,andP.Dinda,\Dynamictopologyadaptationofvirtualnetworksofvirtualmachines,"inProc.SeventhWorkshoponLanguages,CompilersandRun-timeSupportforScalableSystems(LCR),Oct2004. [32] P.K.Gummadi,S.Saroiu,andS.D.Gribble,\AmeasurementstudyofNapsterandGnutellaasexamplesofpeer-to-peerlesharingsystems,"ACMSIGCOMMComputerCommunicationReview,vol.32,no.1,p.82,Jan2002. [33] M.Ripeanu,A.Iamnitchi,andI.Foster,\MappingtheGnutellanetwork,"IEEEInternetCommunication,vol.6,no.1,pp.50{57,Feb2002. [34] N.Leibowitz,M.Ripeanu,andA.Wierzbicki,\DeconstructingtheKazaanetwork,"inProceedingsof3rdWorkshoponInternetApplications,SanJose,CA,Jun2003. [35] J.Kubiatowicz,D.Bindel,Y.Chen,S.Czerwinski,P.Eaton,D.Geels,R.Gummadi,S.Rhea,H.Weatherspoon,W.Weimer,C.Wells,andB.Zhao,\Oceanstore:Anarchitectureforglobal-scalepersistentstorage,"inInProc.oftheIntlConf.onArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS2000),Cambridge,MA,Nov2000. 146

PAGE 147

P.DruschelandA.Rowstron,\PAST:Alarge-scale,persistentpeer-to-peerstorageutility,"inProc.ofthe8thWorkshoponHotTopicsinOperatingSystems(HotOS),SchossElmau,Germany,May2001. [37] F.Dabek,M.F.Kaashoek,D.Karger,R.Morris,andI.Stoica,\Wide-areacooperativestoragewithcooperativelesystem,"inProc.ofthe18thACMSymp.onOperatingSystemsPrinciples,Ban,Canada,Oct2001. [38] S.Ratnasamy,P.Francis,M.Handley,R.Karp,andS.Shenker,\Ascalablecontentaddressablenetwork,"inProc.oftheACMSIGCOMM2001,2001. [39] A.RowstronandP.Druschel,\Pastry:Scalable,decentralizedobjectlocationandroutingforlarge-scalepeer-to-peersystems,"inProc.oftheIFIP/ACMIntl.Conf.onDistributedSystemsPlatforms(Middleware),Heidelberg,Germany,Nov2001. [40] I.Stoica,R.Morris,D.Liben-Nowell,D.R.Karger,M.F.Kaashoek,F.Dabek,andH.Balakrishnan,\Chord:Ascalablepeer-to-peerlookupprotocolforinternetapplications,"IEEE/ACMTransactionsonNetworking,vol.11,no.1,pp.17{32,2003. [41] B.Y.Zhao,L.Huang,S.C.Rhea,J.Stribling,A.D.Joseph,andJ.D.Kubiatowicz,\Tapestry:Aglobal-scaleoverlayforrapidservicedeployment,"IEEEJ-SAC,vol.22,no.1,pp.41{53,Jan2004. [42] R.ekaAlbert,H.Jeong,andA.asloBarabasi,\Errorandattacktoleranceofcomplexnetworks,"Nature,vol.406,pp.378{381,2000. [43] M.Ripeanu,I.Foster,andA.Iamnitchi,\Mappingthegnutellanetwork:Propertiesoflarge-scalepeer-to-peersystemsandimplicationsforsystemdesign,"IEEEInternetComputingJournal,vol.6,no.1,2002,specialissueonpeer-to-peernetworking. [44] P.O.Boykin,J.S.A.Bridgewater,J.S.Kong,K.M.Lozev,B.A.Rezaei,andV.P.Roychowdhury,\AsymphonyconductedbyBrunet,"inarXiv:0709.4048v1,Sep2007. [45] J.Kleinberg,\Navigationinsmallworld,"Nature,vol.406,p.845,2000. [46] B.Ford,P.Srisuresh,andD.Kegel,\Peer-to-peercommunicationacrossnetworkaddresstranslators,"inProc.ofthe2005USENIXAnnualTechnicalConference(USENIX'05),Anaheim,California,Apr2005. [47] S.Guha,Y.Takeda,andP.Francis,\NUTSS:ASIPbasedapproachtoUDPandTCPconnectivity,"inProc.ofSpecialInterestGrouponDataCommunications(SIGCOMM)Workshops,Portland,OR,Aug2004,pp.43{48. [48] R.J.Figueiredo,P.Dinda,andJ.A.B.Fortes,\Acaseforgridcomputingonvirtualmachines,"inProc.ofthe23rdIEEEIntl.ConferenceonDistributedComputingSystems(ICDCS),Providence,RhodeIsland,May2003. 147

PAGE 148

S.Adabala,V.Chadha,P.Chawla,R.J.Figueiredo,J.A.B.Fortes,I.Krsul,A.Matsunaga,M.Tsugawa,J.Zhang,M.Zhao,L.Zhu,andX.Zhu,\Fromvirtualizedresourcestovirtualcomputinggrids:TheIn-VIGOsystem,"FutureGenerationComputingSystems,SpecialissueonComplexProblem-SolvingEnvironmentsforGridComputing,vol.21,no.6,Apr2005. [50] A.Shoykhet,J.Lange,andP.Dinda,\Virtuoso:Asystemforvirtualmachinemarketplaces."NorthwesternUniversity,Jul2004,technicalReportNWU-CS-04-39. [51] J.SmithandR.Nair,VirtualMachines:VersatilePlatformsforSystemsandProcesses.MorganKaufmann,2005. [52] I.FosterandA.Iamnitchi,\Ondeath,taxes,andtheconvergenceofpeer-to-peerandgridcomputing,"inProc.ofthe2ndIntl.WorkshoponPeer-to-PeerSystems(IPTPS),Berkeley,CA,Feb2003. [53] A.S.Cheema,M.Muhammad,andI.Gupta,\Peer-to-peerdiscoveryofcomputationalresourcesforgridapplications,"inProc.ofthe6thIEEE/ACMWorkshoponGridComputing,Seattle,WA,Nov2005. [54] A.Iamnitchi,I.Foster,andD.C.Nurmi,\Apeer-to-peerapproachtoresourcelocationingridenvironments,"inProc.ofthe11thSymp.onHighPerformanceDistributedComputing,Edinburgh,UK,Aug2002. [55] J.Cao,O.M.K.Kwong,X.Wang,andW.Cai,\Apeer-to-peerapproachtotaskschedulingincomputationgrid,"Intl.JournalofGridandUtilityComputing,vol.1,no.1,2005. [56] N.TherningandL.Bengtsson,\Jalapeno-Decentralizedgridcomputingusingpeer-to-peertechnology,"inProc.ofthe2ndConferenceonComputingFrontiers,Ischia,Italy,2005. [57] A.J.Chakravarti,G.Baumgartner,andM.Lauria,\Theorganicgrid:Self-organizingcomputationonapeer-to-peernetwork,"IEEETransactionsonSystems,Man,andCybernetics,vol.35,no.3,May2005. [58] N.Andrade,L.Costa,G.Germoglio,andW.Cirne,\Peer-to-peergridcomputingwiththeourgridcommunity,"inProc.ofthe23rdBrazilianSymp.onComputerNetworks,May2005. [59] N.A.Al-DmourandW.J.Teahan,\Parcop:Adecentralizedpeer-to-peercomputingsystem,"inProc.ofthe3rdIntl.Symp.onAlgorithms,ModelsandToolsforParallelComputingonHeterogenousNetworks,Jul2004. [60] R.Cox,A.Muthitacharoen,andR.Morris,\ServingDNSusingChord,"inProc.ofthe1stIntl.WorkshoponPeer-to-PeerSystems(IPTPS),Cambridge,MA,Mar2002. 148

PAGE 149

I.Stoica,D.Adkins,S.Zhuang,S.Shenker,andS.Surana,\InternetIndirectionInfrastructure,"IEEE/ACMTransactionsonNetworking,vol.12,no.2,pp.215{218,Apr2004. [62] J.Kannan,A.Kubota,andK.Lakshminarayan,\SupportinglegacyapplicationsoverI3."ComputerScienceDivision,UniversityofCalifornia,Berkeley,Jun2004,technicalreportno.UCB/CSD-04-1342. [63] L.ZhouandR.vanRenesse,\P6P:Apeer-to-peerapproachtointernetinfrastructure,"inProc.ofPeer-to-peerSystemsIII:ThirdIntl.Workshop(IPTPS),2004,p.75. [64] L.Zhou,R.vanRenesse,andM.Marsh,\ImplementingIPv6asapeer-to-peeroverlaynetwork,"in21stIEEESymp.onReliableDistributedSystems(SRDS),2002,p.347. [65] T.E.Anderson,D.E.Culler,andD.A.P.et.al,\Acasefornetworkofworkstations:NOW,"IEEEMicro,February1995. [66] C.Chiapusio.(2007,May)distributed.net.[Online].Available: http://distributed.net Lastaccessed:Jul2008. [67] B.Calder,A.A.Chien,J.Wang,andD.Yang,\Theentropiavirtualmachinefordesktopgrids,"inCSEtechnicalreportCS2003-0773,UniversityofCalifornia,SanDiego,SanDiego,CA,Oct2003. [68] X.Zhang,K.Keahey,I.Foster,andT.Freeman,\Virtualclusterworkspacesforgridapplications,"ArgonneNationalLaboratories,Tech.Rep.ANL/MCS-P1246-0405,Apr2005. [69] C.Clark,K.Fraser,S.Hand,J.Hansen,E.Jul,C.Limpach,I.Pratt,andA.Wareld,\Livemigrationofvirtualmachines,"inProc.2ndSymp.onNetworkedSystemsDesignandImplementation(NSDI),Boston,MA,May2005. [70] M.Lorch,D.Kafura,I.Fisk,K.Keahey,G.Carcassi,T.Freeman,T.Peremutov,andA.S.Rana,\AuthorizationandaccountmanagementintheOpenScienceGrid,"inProc.ofthe6thIEEE/ACMWorkshoponGridComputing,Nov2005. [71] D.Wolinsky,A.Agrawal,P.O.Boykin,J.Davis,A.Ganguly,V.Paramygin,P.Sheng,andR.Figueiredo,\Onthedesignofvirtualmachinesandboxesfordistributedcomputinginwideareaoverlaysofvirtualworkstations,"inProc.ofthe1stWorkshoponVirtualizationTechnologiesinDistributedComputing(VTDC),withSupercomputing,Tampa,FL,Nov2006. [72] T.BaileyandC.Elkan,\Fittingamixturemodelbyexpectationmaximizationtodiscovermotifsinbiopolymers,"inProc.oftheSecondIntl.ConferenceonIntelligentSystemsforMolecularBiology.MenloPark,CA:AAAIPress,1994,pp.28{36. 149

PAGE 150

G.Olsen,H.Matsuda,R.Haggstrom,andR.Overbeek,\fastDNAml:Atoolforconstructionofphyllogenetictreesofdnasequencesusingmaximumlikelihood,"Comput.Appl.Biosci.,vol.10,pp.41{48,1994. [74] C.Stewart,D.Hart,D.Berry,G.Olsen,E.Wernert,andW.Fischer,\ParallelimplementationandperformanceoffastDNAml-aprogramformaximumlikelihoodphylogeneticinference,"inProc.ofIEEE/ACMSupercomputingConference(SC01),2001. [75] K.Melero,M.Hardy,andM.Lucas,\Open-sourcesoftwaretechnologiesfordataarchivingandonlinegeospatialprocessing,"inProc.oftheIEEEGeoscienceandRemoteSensingSymposium,Toulouse,France,Jul2003. [76] V.Sunderam,J.Dongarra,A.Geist,andR.Manchek,\ThePVMconcurrentcomputingsystem:Evolution,experiences,andtrends,"ParallelComputing,vol.20,no.4,pp.531{547,Apr1994. [77] M.KozuchandM.Satyanarayanan,\Internetsuspend/resume,"in4thIEEEWorkshoponMobileComputingSystemsandApplications(WMCSA2002),20-21June2002,Callicoon,NY,USA.IEEEComputerSociety,2002,p.40. [78] C.Sapuntzakis,R.Chandra,B.Pfa,J.Chow,M.Lam,andM.Rosenblum,\Optimizingthemigrationofvirtualcomputers,"inProc.ofUSENIXOperatingSystemDesignandImplementation(OSDI),2002. [79] D.Becker,T.Sterling,D.Savarese,J.Dorband,U.Ranawake,andC.Packer,\Beowulf:Aparallelworkstationforscienticcomputation,"inProc.oftheIntl.ConferenceonParallelProcessing(ICPP),1995. [80] H.Bal,R.Bhoedjang,R.Hofman,C.Jacobs,T.Kielmann,J.Maassen,R.vanNieuwpoort,J.Romein,L.Renambot,T.Ruhl,R.Veldema,K.Verstoep,A.Baggio,G.Ballintijn,I.Kuz,G.Pierre,M.vanSteen,A.Tanenbaum,G.Doornbos,D.Germans,H.Spoelder,E.-J.Baerends,S.vanGisbergen,H.Afsermanesh,D.vanAlbada,A.Belloum,D.Dubbeldam,Z.Hendrikse,B.Hertzberger,A.Hoekstra,K.Iskra,D.Kandhai,D.Koelma,F.vanderLinden,B.Overeinder,P.Sloot,P.Spinnato,D.Epema,A.vanGemund,P.Jonker,A.Radulescu,C.vanReeuwijk,H.Sips,P.Knijnenburg,M.Lew,F.Sluiter,L.Wolters,H.Blom,C.deLaat,andA.vanderSteen,\ThedistributedASCIsupercomputerproject,"SIGOPSOper.Syst.Rev.,vol.34,no.4,pp.76{96,2000. [81] O.Aumage,R.F.H.Hofman,andH.E.Bal,\NetIbis:Anecientanddynamiccommunicationsystemforheterogenousgrids,"inProc.ofCCGrid2005,Cardi,UK,May2005. [82] D.Xu,P.Ruth,J.Rhee,R.Kennel,andS.Goasguen,\Autonomicadaptationofvirtualdistributedenvironmentsinamulti-domaininfrastructure,"inProc.ofIEEEIntlSymp.onHigh-PerformanceDistributedComputing(HPDC),HotTopicsSession,Paris,France,Jun2006. 150

PAGE 151

V.Lo,D.Zappala,D.Zhou,Y.Liu,andS.Zhao,\Clustercomputingonthey:P2Pschedulingofidlecyclesintheinternet,"inProc.ofthe3rdIntl.WorkshoponPeer-to-PeerSystems(IPTPS),SanDiego,CA,Feb2004. [84] A.S.GrimshawandW.A.Wulf,\Legion:Flexiblesupportforwide-areacomputing,"inProc.ofthe7thACMSIGOPSEuropeanWorkshop,Ireland,1996. [85] A.Ganguly,A.Agrawal,P.O.Boykin,andR.J.Figueiredo,\WOW:Self-organizingwideareaoverlaynetworksofvirtualworkstations,"inProc.oftheIEEEIntlSymp.onHigh-PerformanceDistributedComputing(HPDC),Paris,France,Jun2006. [86] M.Castro,P.Druschel,A.-M.Kermarrec,andA.Rowstron,\Scalableapplication-levelanycastforhighlydynamicgroups,"inProc.ofthe5thIntl.WorkshoponNetworkedGroupCommunications(NGC),Munich,Germany,Sep2003. [87] A.Mislove,A.Post,A.Haeberlen,andP.Druschel,\Experiencesinbuildingandoperatingepost,areliablepeer-to-peerapplication,"inProc.ofACMSIGOPS/EuroSysEuropeanConf.onComputerSystems,Leuven,Belgium,Apr2006. [88] S.Rhea,B.Godfrey,B.Karp,J.Kubiatowicz,S.Ratnasamy,S.Shenker,I.Stoica,andH.Yu,\Opendht:ApublicDHTserviceanditsuses."inProc.ofACMSIGCOMM,Philadelphia,PA,Aug2005. [89] S.Rhea,D.Geels,T.Roscoe,andJ.Kubiatowicz,\HandlingchurninaDHT,"inProc.ofUSENIXTechnicalConference,Jun2004. [90] A.Muthitacharoen,S.Gilbert,andR.Morris,\Etna:Afault-tolerantalgorithmforatomicmutabledhtdata,"inTechnicalReportMIT-LCS-TR-993,MIT-LCS,Jun2005. [91] Z.LiandM.Parashar,\Comet:Ascalablecoordinationspacefordecentralizeddistributedenvironments,"inProceedingsofthe2ndInternationalWorkshoponHotTopicsinPeer-to-PeerSystems(HOT-P2P),SanDiego,CA,Jul2005. [92] D.Gelernter,\Generativecommunicationinlinda,"inACMTransactionsonProgrammingLanguageSystems,vol.7,no.1,1985,pp.80{112. [93] M.Castro,P.Druschel,A.-M.Kermarrec,andA.Rowstron,\Oneringtorulethemall:Servicediscoveryandbindinginstructuredpeer-to-peeroverlaynetworks,"inProc.oftheSIGOPSEuropeanWorkshop,France,Sep2002. [94] Z.XuandY.Hu,\SBARC:Asupernodebasedpeer-to-peerlesharingsystem,"inProc.ofthe8thIEEEInternationalSymposiumonComputersandCommunication(ISCC),Antalaya,Turkey,Jun2003. [95] B.Y.Zhao,Y.Duan,L.Huang,A.D.Joseph,andJ.D.Kubiatowicz,\Brocade:Landmarkroutingonoverlaynetworks,"inInProc.oftheInternationalWorkshoponPeer-to-PeerSystems(IPTPS),Cambridge,MA,Mar2002. 151

PAGE 152

A.R.Bharambe,M.Agrawal,andS.Seshan,\Mercury:Supportingscalablemulti-attributerangequeries,"inProc.ofACMSIGCOMM,Portland,OR,2004. [97] D.Oppenheimer,J.Albrecht,D.Patterson,andA.Vahdat,\DistributedresourcediscoveryonplanetlabwithSWORD,"inProc.oftheACM/USENIXWorkshoponReal,LargeDistributedSystems(WORLDS),SanFrancisco,CA,Dec2004. [98] C.SchmidtandM.Parashar,\Enablingexiblequerieswithguaranteesinp2psysteme,"inIEEENetworkComputing(SpecialIssueonInformationDisseminationontheWeb),vol.3,Jun2004,pp.19{26. [99] A.MisloveandP.Druschel,\Providingadministrativecontrolandautonomyinstructuredpeer-to-peeroverlays,"inProc.ofthe3rdIntl.WorkshoponPeer-to-peersystems,SanDiego,ca,Feb2004. [100] P.MaymounkovandD.Mazieres,\Kademlia:Apeer-to-peerinformationsystembasedonthexormetric,"inProc.oftheWorkshoponPeer-to-PeerSystems(IPTPS),Cambridge,MA,Mar2002. [101] M.Castro,M.Costa,andA.Rowstron,\Performanceanddependabilityofstructuredpeer-to-peeroverlays,"inProc.oftheConf.onDependableSystemsandNetworks,Jun2004. [102] M.Castro,P.Druschel,Y.C.Hu,andA.Rowstron,\Topology-awareroutinginstructuredpeer-to-peeroverlaynetworks,"inMicrosoftResearchMSR-TR-2002-82,Sep2002. [103] J.Liang,R.Kumar,andK.Ross,\TheFastTrackoverlay:Ameasurementstudy,"inComputerNetworks(SpecialIssueonOverlays),2005. [104] J.Li,J.Stribling,R.Morris,M.F.Kaashoek,andT.M.Gil,\Aperformancevs.costframeworkforevaluatingDHTdesigntradeosunderchurn,"inProc.IEEEINFOCOM,2005. [105] S.GerdingandJ.Stribling,\Examiningthetrade-osofstructruredoverlaysindynamicnon-transitivenetwork,"Dec2003,Classproject.[Online].Available: http://pdos.lcs.mit.edu/strib/docs/projects/networking fall2003.ps Lastaccessed:Jul2008. [106] B.A.Miller,T.Nixon,C.Tai,andM.D.Wood,\HomenetworkingwithUniversalPlugandPlay,"IEEECommunicationsMagazine,vol.39,no.12,pp.104{109,Dec2001. [107] A.Ganguly,D.Wolinsky,P.O.Boykin,andR.J.Figueiredo,\Decentralizeddynamichostcongurationinwide-areaoverlaysofvirtualworkstations,"inProc.ofthePCGridworkshop,LongBeach,CA,Mar2006. 152

PAGE 153

B.Ford,\UnmanagedInternetProtrocol:Tamingtheedgenetworkmanagementcrisis,"inProc.oftheWorkshoponHotTopicsinNetworks(HotNets),Cambridge,MA,Nov2003. [109] W.ChenandX.Liu,\Enforcingroutingconsistencyinstructuredpeer-to-peeroverlays:Shouldweandcouldwe?"inInProc.oftheWorkshoponPeer-to-PeerSystems(IPTPS),SantaBarbara,CA,Feb2006. [110] J.Aspnes,Z.Diamadi,andG.Shah,\Fault-tolerantroutinginpeer-to-peersystems,"inProc.oftheSymp.onPrinciplesofDistributedComputing(PODC),Monterey,CA,Jul2002. [111] S.Ratnasamy,M.Handley,R.Karp,andS.Shenker,\Topology-awareoverlayconstructionandserverselection,"inProc.oftheIEEEINFOCOMM,NewYork,NY,2002. [112] T.S.E.NgandH.Zhang,\Predictinginternetnetworkdistancewithcoordinates-basedapproaches,"inProc.ofIEEEINFOCOMM,NewYork,NY,Jun2002. [113] M.Pias,J.Crowcroft,S.Wilbur,S.Bhatti,andT.Harris,\Lighthousesforscalabledistributedlocation,"inProc.oftheWorkshoponPeer-to-PeerSystems(IPTPS),Berkeley,CA,Feb2003. [114] P.Pietzuch,J.Ledlie,andM.Seltzer,\SupportingnetworkcoordinatesonPlanetLab,"inProc.ofthe2ndUSENIXWorkshoponReal,LargeDistributedSystems(WORLDS),SanFrancisco,CA,Dec2005. [115] P.Pietzuch,J.Ledlie,M.Mitzenmacher,andM.Seltzer,\Network-awareoverlayswithnetworkcoordinates,"inProc.oftheWorkshoponDynamicDistributedSystems(IWDDS),Lisbon,Portugal,Jul2006. [116] A.IamnitchiandI.Foster,\Onfullydecentralizedresourcediscoveryingridenvironments,"inProc.oftheInternationalWorkshoponGridComputing,Pittsburgh,PA,Nov2004. [117] D.SpenceandT.Harris,\Distributedresourcediscoveryinxenoserveropenplatform,"inProc.oftheIEEEIntl.Symp.onHighPerformanceDistributedCOmputing(HDDC),Seattle,WA,Jun2003. [118] Y.Chawathe,S.Ramabhadran,S.Ratnasamy,A.LaMarca,S.Shenker,andJ.Hellerstein,\Acasestudyinbuildinglayereddhtapplications,"inProceedingsoftheConferenceonApplications,technologies,architectures,andprotocolsforcomputercommunications.NewYork,NY,USA:ACM,2005,pp.97{108. 153

PAGE 154

ArijitGangulygrewupinNewDelhi,thecapitalcityofIndia.HeattendedtheIndianInstituteofTechnology(IIT),Guwahati(India)forhisundergraduatestudies(B.Tech.)incomputerscience.Hebecameinterestedincomputersystems,distributedcomputingandnetworks,andlaterdecidedtopursuegraduatestudies.AfterhisgraduationfromIITGuwahati,ArijitjoinedUniversityofFloridainFall2002.InFall2003,ArijitstartedhisPhDresearchunderDr.RenatoFigueiredoattheAdvancedComputingandInformationSystems(ACIS)Laboratory,wherehewaslaterappointedasaResearchAssistant(RA).AtACIS,hegotanopportunitytodocutting-edgeresearchononvirtualmachines,networksandP2Psystems;andintheprocesspublishpapersinhighlyacclaimedconferencesandjournals.TogetherwithACISresearchers,hehasalsobeeninvolvedinthedevelopmentofopen-sourcesoftwarecalledIPOP,whichisalreadybeingusedbyresearchersattheUniversityofFlorida.Tocomplementhisacedemicexperience,ArijitalsogotanopportunitytodosummerinternshipsatVMware,IBMResearchandMicrosoftin2005,2006and2007,respectively.Uponhisgraduation,Arijitplanstotakeupafull-timepositioninindustryintheareaofcomputersystems. 154