A Software-Defined Overlay Virtual Network with Self-Organizing Small-World Topology and Forwarding for Fog Computing

Material Information

A Software-Defined Overlay Virtual Network with Self-Organizing Small-World Topology and Forwarding for Fog Computing
Subratie, Kensworth
Place of Publication:
[Gainesville, Fla.]
University of Florida
Publication Date:
Physical Description:
1 online resource (152 p.)

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Figueiredo,Renato Jansen
Committee Co-Chair:
Fortes,Jose A
Committee Members:
Gilbert,Juan Eugene
Newman,Richard E
Graduation Date:


Subjects / Keywords:
computing -- fog -- iot
Electrical and Computer Engineering -- Dissertations, Academic -- UF
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Electrical and Computer Engineering thesis, Ph.D.


With the advent of the IoT era, the scale of interacting devices breaks the dominant cloud model. Cloud datacenter resources are being relocated in proximity to the data sources and sinks at the networks edge mitigate the contention over network links back to the cloud. This approach while necessary is not without its challenges, as software-based orchestration of resources is as important -- if not more so -- at the edge. Additionally, traditional methodologies used in the datacenter are poorly applied over the WAN. To addresses these challenges this dissertation presents a software system for the creation and management of cyber-infrastructure useful for building edge networks and connecting them to the cloud. It provides flexible virtualized overlay networks, supporting dynamic membership, endpoint discovery, and secured communication. It is highly scalable and works for arbitrary network sizes. Furthermore, the software architecture makes it easily customizable and extensible allowing components to be replaced by API compliant alternatives. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2019.
Adviser: Figueiredo,Renato Jansen.
Co-adviser: Fortes,Jose A.
Statement of Responsibility:
by Kensworth Subratie.

Record Information

Source Institution:
Rights Management:
Applicable rights reserved.
LD1780 2019 ( lcc )


This item has the following downloads:

Full Text




© 2019 Kensworth C . Subratie


This dissertation enshrines the memories of your love , aspirations , and selfless sacrifices . Dedicated to Myrtle Doocran.


4 ACKNOWLEDGMENTS I a cknowledge and thank the many people who have influenced and contributed to my progress throughout my doctoral studies . Foremost , I thank my advisor Dr. Renato Figueiredo for his ongoing guidance through this j ourney . Next, I thank Dr. Jose Fortes for presenting the opportunities that have challenge d me to grow as a researcher . I also thank the rest of my PhD committee for their valuable comments , feedback and time . To m y peers in the ACIS Lab , both its current and graduated members , with who m I worked alongside over the years , I thank you for making my time at UF a very enjoyable experience . I would like to thank Dr. Paul Hansen , Dr. Cayelan Carey and t he members of the PRAGMA and CENTRA groups whose collaborations and support have broadened my insight . Additionally, t he experience I garnered from internships with the IBM , Intel and NICT res earch groups has been invaluable and is sincerely appreciated. Finally, I must express my heartfelt gratitu de to my family and loved ones for the ir ongoing patience, encouragement, support and willing ness t o attempt to understand what I do and why I chose to do it. This material is based upon work supported in part by the National Science Foundation under Grants No. 1527415, 1339737, 1234983 and 1550126. Any opinions, findings, and conclusions or recommendat ions expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ . 4 LIST OF TABLES ................................ ................................ ................................ ........... 8 LIST OF FIGURES ................................ ................................ ................................ ........ 9 LIST OF ABBREVIATIONS ................................ ................................ .......................... 11 ABSTRACT ................................ ................................ ................................ .................. 13 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ ... 15 Motivation ................................ ................................ ................................ .............. 15 Contribution ................................ ................................ ................................ ........... 17 2 OVERVIEW ................................ ................................ ................................ ........... 20 Terminology ................................ ................................ ................................ ........... 20 Virtual Networks ................................ ................................ ................................ .... 21 P2P Overlay Networks ................................ ................................ .......................... 22 Software Def ined Networks ................................ ................................ ................... 23 Internet of Things ................................ ................................ ................................ ... 24 Fog Computing ................................ ................................ ................................ ...... 24 Cloud Native Applications Micro services ................................ ............................ 25 Related Works ................................ ................................ ................................ ....... 26 3 ON THE DESIGN AND IMPLEMENTA TION OF IP OVER P2P OVERLAY VIRTUAL PRIVATE NETWORKS ................................ ................................ .......... 30 Background ................................ ................................ ................................ ........... 30 Core Abstractions and Architecture ................................ ................................ ....... 31 Decoupling Endpoint Discovery from Overlay ................................ ........................ 34 Decoupling Control from Datapath ................................ ................................ ......... 35 Software Defined Switching ................................ ................................ ................... 38 Concluding Remarks ................................ ................................ ............................. 41 4 TOWARDS DYNAMIC, ISOLATED WORK GROUPS FOR DISTRIBUTED IOT AND CLOUD SYSTEMS ................................ ................................ ....................... 45 Background ................................ ................................ ................................ ........... 46 Architecture ................................ ................................ ................................ ........... 50 The Control Plane IPOP Controller ................................ ............................... 51


6 The Data Plane Tincan ................................ ................................ ................. 55 Case Study ................................ ................................ ................................ ............ 57 GroupVPN ................................ ................................ ................................ ....... 57 SocialVPN ................................ ................................ ................................ ....... 60 Discussion ................................ ................................ ................................ ....... 61 Concluding Remarks ................................ ................................ ............................. 62 5 TOWARDS ISLAND NETWORKS: SDN ENABLED VIRTUAL PRIVATE NETWORK S WITH PEER TO PEER OVERLAY LINKS FOR EDGE COMPUTING ................................ ................................ ................................ ........ 65 Background ................................ ................................ ................................ ........... 67 IoT Application Characteristics ................................ ................................ ........ 68 Mo del Characteristics ................................ ................................ ...................... 68 Networking Implications ................................ ................................ .................. 69 Overlay and Island Networks Concepts ................................ ........................... 70 An Architecture for Overlay Networks ................................ ................................ .... 71 Controller ................................ ................................ ................................ ........ 72 Data Plane ................................ ................................ ................................ ...... 73 Experimental Evaluation ................................ ................................ ........................ 74 Experimental Testbed ................................ ................................ ..................... 74 Scenarios and Results ................................ ................................ .................... 75 Results and Analysis ................................ ................................ ............................. 76 Concluding Remarks ................................ ................................ ............................. 77 6 BOUNDED FLOOD: SCALABLE LAYER 2 FORWARDING FOR DYNAMIC CLOUD TO EDGE NETWORK ENVIRONMENTS ................................ ................ 80 Background ................................ ................................ ................................ ........... 82 Design ................................ ................................ ................................ ................... 85 IPOP Controller ................................ ................................ ............................... 86 SD N Controller ................................ ................................ ................................ 88 Learning Table ................................ ................................ .......................... 89 Flood Route and Bound ................................ ................................ ............ 89 Flooding Bou nd Algorithm ................................ ................................ ......... 90 Overlay Churn ................................ ................................ ........................... 92 Evaluation ................................ ................................ ................................ .............. 92 Experiment Test Cases ................................ ................................ ................... 93 Testbed ................................ ................................ ................................ ........... 95 Results and Analysis ................................ ................................ ............................. 96 Cost of soft state ................................ ................................ ............................. 96 Join or Departure Cost ................................ ................................ .................... 97 Route Disc overy Cost ................................ ................................ ...................... 98 Verification of Correctness Tests ................................ ................................ ..... 98 On demand Tunnels ................................ ................................ ....................... 99 Network Switching Latency (Path Length in Hops) ................................ .......... 99 Dynamic Network (Churn) ................................ ................................ ............. 100


7 STP vs B F ................................ ................................ ................................ ..... 100 Concluding Remarks ................................ ................................ ........................... 101 Lessons Learned ................................ ................................ ........................... 101 Summary ................................ ................................ ................................ ....... 104 7 GRAPLER A USE CASE ON THE PRACTICAL APPLICATION OF IPOP ....... 114 Background ................................ ................................ ................................ ......... 116 Architecture and Design ................................ ................................ ...................... 120 Overlay Virtual Network (IPOP) ................................ ................................ ..... 121 Workload Management (HTCondor) ................................ .............................. 123 Experiment Management Tools (GEMT) ................................ ....................... 123 GRAPLE R Language Package (GRAPLEr) ................................ .................. 128 Use Case Workflow ................................ ................................ ............................. 130 Evaluation ................................ ................................ ................................ ............ 132 Concluding Remarks ................................ ................................ ........................... 135 8 CONCLUSION ................................ ................................ ................................ .... 140 Contribution ................................ ................................ ................................ ......... 141 Future Work ................................ ................................ ................................ ......... 142 LIST OF REFERENCES ................................ ................................ ............................ 144 BIOGRAPHICAL SKETCH ................................ ................................ ......................... 152


8 LIST OF TABLES Table page 5 1 Hardware specification for nodes used in experiment. ................................ ....... 79 5 2 Ping latency test between hosts D and A showing round trip times. A total of 60 packets were transmitted and received. ................................ ....................... 79 5 3 Time to create a fully writable link between the IoT class device and each peer in the overlay network. ................................ ................................ .............. 79 6 1 Legend of experiment configuration parameters. ................................ ............. 113 6 2 Comparison of Overlay Average Path Length. The OND enabled overlay uses k=4 while the overlay without OND uses k=7; both are of size n=128. .... 113 6 3 Maximum Path Lengths. Frequency and cumulative percentages, in overlay ................................ ...... 113 7 1 GWS Application Programming Interface (API) ................................ ............... 139


9 LIST OF FIGURES Figure page 1 1 Structure of the Dissertation ................................ ................................ .............. 19 3 1 architecture and virtualized endpoint ............................. 43 3 2 Illustration of decoupling endpoint discovery from virtual network overlay in IPOP. ................................ ................................ ................................ ................ 43 3 3 Decoupling of the IPOP node design into control and data path modules. ........ 44 3 4 IPOP functions as a switch by extending a software bridge such as Open vSwitch. ................................ ................................ ................................ ............. 44 4 1 Structural relationships between the CFx, the controller modules and Tincan. .. 63 4 2 Workflow for CBT Chaining. ................................ ................................ .............. 64 4 3 Steps in overlay recovery using only successors vs both successors and chords. ................................ ................................ ................................ .............. 64 5 1 An Overlay Network (ON) . ................................ ................................ ................. 78 5 2 IPOP Tunnel Architecture . ................................ ................................ ................. 78 5 3 Experiment Testbed Structure. ................................ ................................ .......... 79 6 1 A rendering of the logical view of the IPOP overlay topology supporting Bounded Flood. ................................ ................................ ............................... 105 6 2 A segment of an overlay with 128 nodes using a 7 bit address space . . ........... 106 6 3 T he root and peer switch roles. ................................ ................................ ....... 106 6 4 FRB Structure, types 1 and 2. FRBs are used by BF bridges to (1) exchange leaf MACs and (2) perform overlay broadcasts. ................................ ............... 107 6 5 Partitioning from churn when using a single successor. ................................ .. 107 6 6 Resilience to partitioning by using mult iple successors. ................................ .. 108 6 7 Bounded Flood Experiment Bridge Setup. ................................ ....................... 108 6 8 Cumulative percentages of average latency of BF and BF with on demand tunnels. ................................ ................................ ................................ ........... 109


10 6 9 Cumulative percentages of bandwi dth for BF and BF with on demand tunnels. ................................ ................................ ................................ ........... 109 6 10 Bandwidth variation over time, STP and STP with OND. ................................ . 110 6 11 Network Average Path Length vs Network Size. ................................ ............. 110 6 12 Impact of churn on throughput between two overlay nodes measured over 300 seconds. ................................ ................................ ................................ ... 111 6 13 Histogram of latencies in BF and STP overlay. Larger values are better. ........ 111 6 14 Histogram of latencies in BF and STP overlays. ................................ .............. 112 7 1 System Architecture (GRAPLEr). ................................ ................................ .... 136 7 2 Workload Management (HTCondor). ................................ ............................... 137 7 3 GRAPLEr WEB Service (GWS). ................................ ................................ ...... 137 7 4 GRAPLEr top level workflow chart. ................................ ................................ .. 138 7 5 GRAPLEr sweep job workflow chart. ................................ ............................... 138 7 6 Job runtimes for GRAPLEr HTCondor pool, compared to sequential execution times o n CloudLab (SEQ Fast) and UF (SEQ Slow) slots. .............. 139


11 LIST OF ABBREVIATIONS API Application Programming Interface AV Autonomous Vehicle BF Bounded Flood CBT Controller Brokered Task CFx Controller Framework DoS Denial of Service FRB Flooding Route and Bound GEMT GRAPLE Experiment Management Tools GLM General Lake Model GRAPLE GLEON Research And PRAGMA Lake Expedition GRE Generic Routing Encapsulation GWS GRAPLE WEB Service ICE Interactive Connectivity Establishment IETF Internet Engineering Task Force IoT Internet of Things IPOP IP Over P2P LAN Local Area Network NAT Network Address Traversal NFV Network Function Virtualization OSN Online Social Network OVS Open vSwitch P2P Peer to Peer SDN Software Defined Networking STP Spanning Tree Protocol


12 STUN Session Traversal Utilities for NAT TURN Traversal Using Relays Around NAT UUID Universally Unique Identifier VPN Virtual Private Network VxLAN Virtual Extensible LAN WRTC Web Real Time Communication XMPP Extensible Messaging and Presence Protocol


13 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A SOFTWARE DEFINED OVERLAY VIRTUAL NETWORK WITH SELF ORGANIZING SMALL WORLD TOPOLOGY AND FORWARDING FOR FOG COMPUTING By Kensworth C . Subratie December 2019 Chair: Rena t o Figueiredo Major: Electrical and Computer Engineering The advent of virtualization and cloud computing has fundamentally changed the way in which distributed applications and services are deployed and managed. With the proliferation of IoT and mobile devices, the scale of interacting devices breaks the infeasible from the standpoint of bandwidth, latency, security, privacy , and policy enforcement. Consequently, virtualized systems akin to those offered by cloud providers are i ncreasingly needed geographically near the edge in close proximity to the data sources and sinks , to mitigate the contention over network links back to the cloud. Latency sensitive, bandwidth intensive a pplications can be relocated on resources a t the edge near IoT devices, to perform operations on high volume sensor data (e.g. real time high definition camera feeds ) a model referred to as fog computing. Not only is performance important, but a trustworthy network is fundamental to guarantee ing privacy and integrity at the network layer across all participating resources (e.g. IoT, cloud VMs, and containers at edge resources ). This approach , while necessary is not without its challenges, as software based orchestration of resources is


14 a s important if not more so at the edge. Additionally , tradition al methodologies used in the data center are poorly applied over the Internet . To address these challenges , this dissertation presents a software system for the creation and management of cyber infras tructure useful for building edge networks and connecting them to the cloud. The system provides flexible virtualized overlay networks , supporting dynamic membership , endpoint discovery , and secured communication . It is highly scalable a s network size grows . Furthermore, t he software architecture makes it easily customizable and extensible , allowing comp onents to be replaced by API compliant alternatives .


15 CHAPTER 1 I NTRODUCTION It is the mark of an educated mind to be able to entertain a thought without accepting it. Aristotle Motivation We are on the cusp of a new era of smart devices, the I nternet of T hings, and smart spaces [1] . These ideas herald a degree of social and technological change that w ill have profound impact in the way humans interact with technology and experience their everyday lives. While t he information age introduced cyber systems , which replaced many real world process es with electronic, online alternatives , there we re still clear delineation between the functions of the virtual, cyber world and th ose in the physical one. These distinctions are now being increasingly blurred by what is known as cyber physical systems that underscore the sense a na lyze actuate model [2] . The idea of pervasive computing is to eliminate the need for human intervention in the acquisition of data or the application of inst ructions [3], [4] . Motes , and other classes of smar t devices for remote sensing , monitor various aspects of the world and our acti vities . T his sensory data is transmitted to compute devices , where it is analyzed to understand what was obser ved and to ultimately make decisions . These decisions can be instructions to another class of de vices called a ctuators . They complete the workflow by carrying out the se instruction s , and in doing so, interact with the physical world . Th e generalize d sense analyze actuate model is widely applica ble and contemporary research applies it to various aspe cts of everyday life , including farming, autonomous vehicles, consumer commerce, healthcare, and law enforcement .


16 To transition from novel concepts to mainstream adoption requires fundamental advances in ubiq uitous computing and an always connected world. B illions of sensing , computing and actuating devices are being deployed within and across comm unities , but the underlying communicat ion infrastructure f o r inter connect ion among themselves and the cloud , is also needed . Additionally, n ew classes of applications must be built , which can leverage the new infrastructure and resources , to address the increasingly sophisticated needs of the everyday consumer. Emerging prototypes of t hese application s already e xhibit characteristics that are difficult to meet using existing cloud computing models [5] and new ones are already being proposed to complement it. Approaches being investigate d , which are built around this model , include e dge , m ulti access e dge , c loudlet , m ist , and f og computing [6] . Lightweight da ta center processing nodes to bring compute and short term storage in closer proximity to the data sources and sinks. This eliminates the latency and throughput penalties incurred from mov ing data across large geo graphic distances into high contention areas . However, while distributed infrastructure addresses the data movement problem, it introduces a n operation and management problem , subsequently leading to s oftware defined control to become an even more critical factor than in cloud based data center s . It is necessary to interconnect all virtualized components to create a complete semblance of a virtualized computing environment. Unfortunately, software and m ethodologies designed for the data center are typically poorly suited for operations over the Internet due to constraints of the Internet Protocol .


17 The I nternet core protocol IP version 4 was first described in IETF publication RFC 791 in 1981 , but it was never designed for the scale and complexity of the modern I nternet. While IP version 6 was ratified as an Internet Standard in July 2017 , it has not yet seen wide deployment , leaving most of the I nt ernet still operating on IP v 4 [7], [8] . A lack of public IP v 4 addresses and Network Address Translators (NAT) make pe er communication difficult a s most devices do not possess an address that is routable over the public Internet . Furthermore, the lack of built in security and privacy in the IPv4 protocol means appl ication or transport level mechanisms must be emp loyed . Network virtualization stands at a unique point in the current virtualization landscape to deliver an open, flexible, and heterogeneous networking environment . While existing v irtual private networks (VPN) can be used to mitigate some the hurdles s uch as endpoint addressing and secured communication , current models are infeasible for operation and management at the proposed scale of future IoT applications . A decentralized and dynamic system that can virtualize addressable endpoints and provide secu red communication is needed. Contribution The contribution of my PhD dissertation is the design and implementation of a software system for creati ng , operating and manag ing networking cyber infrastructure suitable for the cloud to thing continuum , i.e., f og computing . It leverages the previous generation of the open source IP Over P2P (IPOP ) system to provide: (1) e ndpoint identi fication and authentication , (2) p rivate P2P communication , (3) v irtualized layer 2 ( Ethernet ) overlay, (4) multitenant networks with individual control ( 5 ) d ynamic membership , ( 6 ) s elf healing , resilient topology , ( 7 ) s calable loop free forwarding , ( 8 ) and f lexi ble operational configuration .


18 Figure 1 1 illustrates the document structure , while t he outline of the remainder of the dissertation document is as follows . Chapter 2 pr ovides an overview of the concepts, technologies and software leverage d by this work , discussed within the context of its stated goals . It also reviews existing related works. Chapter 3 presents the core concepts in the IPOP design, and the evolution of its implementation leading to its current form . Chapter 4 [9] presents a novel framework that addresses the needs of extensibility and customizability for establishing overlay net work together sensor, edge, and cloud devices under the same virtual private network. Chapter 5 [10] presents a novel approach that integrates both s oftware defined switches and overlay networks to create software defin ed virtual networks across edge and cloud resources . Chapter 6 presents the Bounded Flood Layer 2 switching protocol for IPOP structured ring topology and makes the argume nt that it is a well suited networking cyberinfrastructure for fog and IoT applications . Chapter 7 [11] presents and evaluates a system that uses overlay virtual networks for lake ecology simulations GRAPLEr which has al so involved the use of IPOP in edge to cloud ecological forecasting. Chapter 8 summarizes and concludes the findings of this dissertation and looks at future directions for this system.


19 Figure 1 1. Stru cture of the Dissertation


20 CHAPTER 2 O VERVIEW So complicated is the full story that the lay reader cannot see the wood for the trees. I have endeavored to render intelligible the broad effects. Winston Churchill Th is chapter presents an overview of the fundamental concepts on which the work of this dissertation is built. It define s terminology used throughout this document , but particularly those with nuances within and across fields of study. The next subsections pro vide a brief background on several technology concepts on which my contribution is built. Finally, a review of related works is covered. Terminology In this dissertation , I refer to the following terms in the following manner. Motes a re tiny computers for remote sensing. These devices are characterized by constrained compute, storage and communication capabilities. A n ode is logical entity in the network, it can be physical , a virtual machine or a container. Nodes are considered peers if the host systems, operating with in the same logical context or overlay network , have equivalent capabilities. They can be assigned endpoints which are uniquely addressable network identifiers within its namespace, e.g., MAC address or IP address and port. A bridge or switch is a network device that connects devices along a layer 2 link to create a layer 2 network and is responsible for forwarding data to its destination. This forwarding process along a specified path is call switching or layer 2 r outing, and t he unit of communication used at the link layer is a frame. As peers join or leav e an overlay , its disruptive effect is referred to as churn.


21 Virtualization is a process t o decouple the logical or functional properties of a system from its physical realization and create a software abstraction or model of the system. It is an important methodology as the fundamental computer system fu nctions have been virtualized, e.g., CPU, memory, storage, network, display. A network function is a physical networking appliance or the capability it embodies . Network Function Virtualization (NFV) is therefore, the virtualization of network functions into software implementations. To illustrate, consider a n Ethernet switch , which can exist as a physical device , or as a memory resident, software entity executing its functionality on a CPU. Virtual Networks A virtual network is a model or representation of a communication s network functional capabilities . It is presented and accessed using software abstractions that decoupled the ser vices from the fixed characteristics of its physical infrastructure . These abstractions can then choose any combination of available methodologies to implement the end to end network services of their interface . A virtual network is inta n gi b le , or virtual, as it exists independently of the physical resource that it virtualizes . There may be no physical network necessary if the entire system is modeled in the memory of a single host and all th e components are themselves virtual. In other scenarios, there may not be a one to one mapping between the virtual and physical constructs , where multiple physical links are virtualized to appear as one , or of links of a different type. This is useful when a specialized physical infrastructure is need ed but unavailable . Virtual networks can also be multiplexed over a single network , partitioning it into several discrete and independent ones.


22 There are several techniques for virtualizing a network [12] . It can be accomplished by (1) creat ing virtual network devices , which act as the operating mechani sm for accessing the network , e.g., TAP/TUN device ; (2) the tunneling or encapsulation of one protocol with another so that an alternate medium can be used for interoperability , e.g., GRE [13] or VxLAN [14] tunnels ; (3) manipulation of the protocol headers at key points in trans it to give the appearance of additional resources , e.g., VLAN [15] ; (4) I P addr ess Management (IPAM) in conjunction with IP rout ing . P2P Overlay Networks An overlay network is regarded as any network built on top of another network . From this general definition, an IP network overlays an Ethernet network, and similarly an online social network overlays an IP network. An overlay is composed of (potentially independently operated) servers distributed across the Internet to prov ide infrastructure to application s by allowing them to take responsibility for special forwarding a nd handling of a pplication data [16] . Peer to Peer (P2P) [17] refers to the direct interconnect ion between nodes acting as peers where there no distinction between client s and s ervers and a P2P network is built on such interconnects. P 2P o verlay networks are application level implementation s of value added network services built on existing underlying communication infrastructure. They overlay the underlying general purpose system networks (e.g., TCP/IP) and provide high level, special i zed functionalities (e.g., Distributed Hash Tables) [17] . T hey are inherently distributed systems without hierarchical organization or centralized control. P eers possess symmetry in their roles and form self organizing networks overlaid on the I nternet P rotocol (IP) . The se properties are known for providing fault tolerance and massive scalability [17] .


23 There are two distinct classes of P2P overlay networks, un structured and structured [17] . Unstructured P2P overlays make no attempt to creat e a specific logical structure in the way they interconnect among each other ; n either do they attempt to place data at any special locations . Consequently, they use flooding or random walks to perform searches . In a s tru cture d P 2 P overlay , the network topology is carefully controlled , resulting in a specific and organized structure , ( e.g., a ring ) ; content is placed at specific peers , and message routing is performed based on these expectations [17] . Software Defined Networks The s oftware defined networking (SDN) [18] model control logic (the control plane) from the underlying routers and switches that forward the traffic (the data plane) . W ith the separation of the control and data planes, network switches become simple forwarding devices that are optimized for this specialized purpose. T he complex control logic is implemented on a general purpose CPU s in a controller module , using a high level programming language. This approach enables programmatic control of the network . Rapid application development tools and methodologies can then be used to simplify development of sophisticated features for policy enforcement and network configuration . This approach is beneficial as there are considerably more challenges ( and hence changes ) in the control layer as opposed to the data plane. Increasing the flexibility and ease in solving network control problem s and deploying potential solutions for verification mak es it easier to facilitat e network evolution and innovation. Predominant contemporary approaches in SDN are OpenFlow [19] and P4 [20] .


24 Internet of Things The I nternet of T hings (IoT) is a paradigm in which th ings are connect ed to the Internet to facilitate the ir interaction s and enable new types of services and applications [1], [21] . T hings s to a broad category of devices including everyday appliances , sensors , actuators , and smart devices , with varying degrees of capabilities , roles and connection mechanisms . IoT s ystems a re heter ogeneous ; importantly , its things will be ubiquitous and pervasive , with b illions of new devices predicted to be connected to the Internet. It extends the power and reach of the existing I nternet , be yond computers and smartphone s found in data center s, offices and homes , to a much broader range of environments, processes and devices [22] . Fog Computing Fog c omputing [21], [23], [24] , is a system architectural model that specifies how to distribute compute and storage resources over a wide area , and the mechanism s for interconnecting them for communication and interaction. It is also an architecture for control of the network itself and control of the cyber physical systems co n n e cted by the netw ork (e.g., using SDN) . The f og of coverage includes : the c loud , the core (at the center) , the metro (adjacent to the center) , edge (far from the center) , client s (people and intelligent systems) , and things . I t is referred to as the cloud to thing continuum as it covers an uninterrupted service range from the cloud at the center to th ings at the extreme edge [21], [23] . This ecosystem consists of three fundamental networking layers, the core, access, and extreme edge. The centralized cloud exists at the core network, and IoT devices are connected in local area networks at t he extreme edge. The access network interconnects independent edge networks


25 among themselves as well as to the core, and it hosts fog resources such as switches , routers, compute , and storage servers. Although the cloud c omputing model promotes ubiquitous , on demand network access to shared computing resources [25] , t he large number of devices introduced by IoT systems [26], [27] will outstrip the capabilities of traditional c loud c omputing d ata c enters [5] . To meet IoT application performance requirements , fog computin g bridges the ga p between the cloud and end (IoT) devices by decent r alizing compute, storage, networking and management across the network at nodes within close vicinity of the end devices. It is a layered model for enabling ubiqu itous access to a shared c ontinuum of scalable computing resource [28] . Cloud Native Applications M icro services As applications evolve to provide increasingly complex functions, they have also become more distributed and rely on comp onents that deployed across wide geographic distances. Consequently, new application architectures and mo dels are now in use to reflect this. One such approach is to design applications as s ystem of micro services interacting over the network, where a micro service represents a clearly defined and cohesive portion of application rules or logic [29] . Each micro service is packaged in a container for deployment and execution , i.e . , d ynamic orchestrat ion . A container is a software abstract ion designed for that purpose. The complete software environment necessary for the micro packed in a container, a l ong with the micro service itself. This reduces deployment of the application in a distributed environment to placing containers on appropriate hosts across the fog and executing them. Contemporary approaches include [30] [32] for conta iner technology and Kubernetes [33] for container orchestration and m anagement.


26 Related Works T he process of virtualiz ing a communication networks using virtual network devices and protocol tunneling , is widely used as it is both powerful and flexible. Network Function Virtualization (NFV) [34] and SDN enable the means by which networks can be provisioned and operated entirely in software. This is an important criterion in cloud computing where in frastructure as a service (IAAS) is necessary to meet the varying needs of cloud tenants. Virtual networks a re now widely used to both connect tenant s within and across data center s. In this section we review contemporary works in this area as well as approaches that are now consider ing the challenges of the fog . The Container Network Interface (CNI) for Kubernetes , is a set of specifi cations and associated libraries for developing compatible plugins that configure network interfaces in Linux containers . The functionality of the CNI specification is compact and onl y covers network connectivity of containers and de allocat ing resources when the container is deleted . Due to Kubernetes popularity, the CNI is supported by several commercial vendor implementations such as : F lannel, Calico , Weave Net , Romana, and NS X [35] [36] [37] [38] [39] . These tunneling solutions vary in sophistication and the number of support ed virt ualization levels . CoreOS Flannel limits the scope of its functionality to provid ing a layer 3 IPv4 network between multiple nodes in a cluster. It gives an IP subnet to each host for use with container runtimes. Flannel does not control how containers are networked to the host, only how the traffic is transported between hosts. Its focus is on networking rather than network policy .


27 Project Calico is an open source networking and network security solution for containers, virtual machines, and native host based workloads. It virtualizes the IP layer and attempts to avoid tunneling when workload endpoints are addressable . It prioritizes this approach to avoid th e inherent overhead associated with tunnel but must fall back to VxLAN or IP in IP for scenarios where IP routing fails . Weaveworks WeaveNet uses VxLAN encapsulation to virtualize a layer 2 network and forwards traffic between nodes in a mesh network . Weave peers communicate their knowledge of the topology, so that all peers learn about the entire topology . Communication between peers occurs over TCP links using a spanning tree based broadcast with neighbor gossip. Romana Cloud Native Networks uses layer 3 network techniques to build micro segmented , cloud native networks without a virtual network overlay. Micro segmentation is a method of creating secure zones in data centers and cloud deployments by logically divid ing the data center into dist inct security segments down to the individual workload level, and then define security controls and deliver services for each unique segment. Each microsegment is made up of one or more network address blocks and routes are installed to these blocks on hosts and network devices so that they can forward traffic directly to endpoints and enforce network policy without the overhead of encapsulation. VMware NSX also provides n etwork and s ecurity v irtualization using micro segmentation . Additionally, NSX prov ides layers 2 4 (including switching, routing, load balancing, micro segmentation, and distributed firewalling) networking and security virtualization platform by running the se networ k layers stack in software, decoupled from underlying physical hardware.


28 Midonet [40] provides virtualized layer s 2 4 networking with distributed layer 2 switching and layer 3 routing , in addition to d istributed load balancing and firewall services . Midonet uses VxLAN for its layer 2 networking and supports integration with VxGW ( VXLAN Gateway ) and hardware VTEP (VXLAN Tunnel End Point) as specifi ed in [14] . IBM Dove and its successor SDN VE [41], [42] is an SDN based network virtualization solution built around the concept of centrally controlled, platform terminated overlays and allows creation of multiple, isolated, and dynamic virtual networks over shared physical infrastructure . It implements a proprietar y NFV switch as the datapath; supporting both layer 2 and 3 virtualization. The system is managed from a centralized console that configures each virtual network, and controls and disseminates policies to the virtual switches. The administrator uses an int ent based modeling abstraction for specifying the network as a policy governed service and thus expressing the functionality of the desired virtual network . Google Andromeda [43] is the network virtualization environment for Google Cloud Platform . It is a multicomponent , distributed system , using a hierarchical architecture that provides layers 2 4 virtualization with features such as switching, routing , monitoring, and firewall protection . To support throughput and latency similar to what is available from the underlying hardware, Andromed a uses a data plane that combines a set of user space packet processing paths to handle specialized workloads. The VM host Fast Path is the first path in the data plane hierarchy and targets raw performance over flexibility. Per packet work that is CPU intensive or without strict latency targets is performed o n Coprocessors , which run on host i n per VM floating


29 threads. Hoverboards are dedicated gateways that perform virtual network routing and pr ocess the long tail of mostly idle flows . The Andromeda control plane is designed around a global hierarchy , where every cluster runs a separate control stacks f or isolation . I t maintains information about where every VM in the network currently runs , and t hrough a hierarchy of controllers , selected subsets of the controller state is installed at individual servers .


30 CHAPTER 3 O N THE D ESIGN AND I MPLEMENTATION OF IP OVER P2P O VERLAY V IRTUAL P RIVATE N ETWORKS There is nothing noble in being superior to your fellow men. True nobility lies in being superior to your former self. Ernest Hemingway Background While software defined virtual networking systems exist within large scale cloud data centers [19] at the core of the Internet and under a single administrative entity these are not suited for future distributed applications spanning edge resources. Such fog applications are distributed across multiple providers and edge networks, raising more complex security and privacy issues than in cloud environments [44] . E dge virtual networks require traversing mult iple [45] administrative environments across different providers and enforcing data privacy and integrity in communication. While transport layer network se curity (e.g. TLS, DTLS) and VPN tunneling (e.g. IPSEC) technologies exist, they are not readily applicable to systems where resource membership is dynamic, and where most devices are constrained by NAT/firewall middleboxes. Furthermore, the effort associat ed with developing or porting applications to enforce privacy and integrity in communications comes with significant costs. To accomplish its required functionality, a fog application requires the ability to deploy, aggregate , and process data from sensors on edge and cloud resources in a dynamic manner . It also needs to support dynamic changes in the membership of participating devices over time, as devices may be mobile (e.g. video cameras in T his chapter has been reprinted with permission from Internet Architecture, Applications and Operation Technologies for a Cyber Physical System.


31 smartphones and vehicles). Fundamentally, the network connectin g IoT, edge and cloud resources must provide trustworthy, seamless communication across a dynamic, heterogeneous, mobile set of resources. Taking into consideration the situation where privacy rich or security sensitive data are used, these types of applic ations have a commonality in terms of needs and requirements. Such applications require dynamic and timely provisioning of computational and network resources in response to user requests. Since multiple devices from multiple providers may join an applicat ion, the separation and isolation of such resources must be accomplished on the basis of user groups, so that each virtual network encompassing computing and network resources for an application has access control to allow only the corresponding user group to access them. Furthermore, the management of groups must be dynamic, and allow for community provided devices to join the application including devices that are mobile, deployed across different Internet Service Providers (IPSs), and/or behind NATs. The current version of IPOP , which implements the novel design and technologies described in this dissertation , implements a hybrid overlay/software defined network software . This section outlines the evolution of the overlay design leading up to its curre nt form . Core A bstractions and A rchitecture The core abstraction exposed by IPOP to a computer system using it is of a virtual network. The first iterations of the designed exposed the abstraction of a layer 3 (IP) virtual network [46] [48] , while subsequent versions expose the a bstraction at layer 2 (Ethernet) [9], [10] . In both cases, the abstraction is exposed through a virtual device available in the Linux and Windows kernels, from which frames/packets that are sent and received are


32 intercepted. IPOP nodes are implemented as a user space process that runs in each node connected to the overlay; this process reads/writes from the tap device, using the system call interface, as illustrated in Fig ure 3 1 . IPOP nodes form virtual links amo ng each other, where each virtual link is an Internet tunnel that carries encrypted and encapsulated virtual network frames/packets through a transport protocol typically UDP, which is more amenable to NAT traversal than TCP. Each IPOP node in an overlay is uniquely identified by a node ID (e.g. A, B, ..., F) in Fig ure 3 1 . The set of virtual links among IPOP nodes forms a topology. While different overlay topologies have been implemented in IPOP versions over time, a structured P2P topology has been a fe ature of the design since its inception. In this topology, nodes form a logical ring, with successor links ordered by their node IDs, and shortcut links across the ring, following a structured P2P algorithm for topology construction and identifier based ro uting such as Chord or Symphony. The first generation of IPOP [46] was fully decentralized, following a structured P2P design using the Brunet [49] library and a Symphony based protocol. Two key motivations for this approach were to avoid any external dependences, and any single point of failure. Each n ode implemented the functionality to ( 1) discover other nodes by means of a peer list file, ( 2) bootstrapping by contacting nodes in the peer list and using ( 3) providing ( 4) forwarding messages according to a greedy routing algorithm.


33 This implementation provided a layer 3 virtual network for the IPv4 protocol. There was no encryption, and no authentication of nodes into the overlay. IP addresses were assigned to nodes by leveraging a Distributed Hash Table (DHT ) [50] key/value store which was also implemented by IPOP nodes. The DHT used a compound key, IPOP ID. This all owed multiple virtual networks to share a single overlay without collision of private subnets. IPOP nodes also included an implementation of a user level DHCP server that handed out IP addresses by randomly assigning an address within a declared virtual ne twork namespace and inserting the mapping into the DHT. While this implementation proved to be useful in many scenarios, and resilient to failures and churn, it had several shortcomings. First, the monolithic design led to a complex node that implemented s everal modules for discovery and bootstrapping, DHT, overlay routing, NAT traversal, IP address assignment/mapping, as well as virtual network interface bindings. Second, the use of a monolithic design implemented as an application written in C# made it di fficult to incorporate standards and functionality in libraries written for different languages into a single process in particular, the STUN, TURN and ICE protocols for NAT traversal were not supported, as well as transport layer security. Third, the de sign did not provide a mechanism to authenticate peers into the overlay. Fourth, rules for packet forwarding and header manipulation, such as IP mapping, were implemented in the monolithic node, preventing the ability to change them without significant cod e investment. These shortcomings were progressively addressed in subsequent implementations of IPOP, as described in the next sections.


34 Decoupling E ndpoint D iscovery from O verlay To tackle the complexities of overlay membership, endpoint discovery and boot strapping links which stem from its initial design, the next iteration of IPOP [48] introduced a signaling process reliant on OSN . Each host in an overlay is represented and identified by an OSN Identity (ONSI) which exists on an OSN server. The ONSI maintains th e notion of a social network (i.e., its private roster of friends) which also corresponds to the peer hosts that participate in its overlay. By using an XMPP [26] compliant instant messaging service, issues of overlay membership and credentialing would in stead be handled externally by a service that follows a widely used standard, and that supports the authentication of user accounts and the establishment of trust relationships between users. Additionally, a well known, published service eliminated the nee d to have prior knowledge of overlay nodes in order to join the overlay. Instant messaging provided the facility to exchange bootstrapping connection data that would be used in support of other established standards, specifically the IETF protocols Interac tive Connectivity Establishment (ICE), Session Traversal Utilities for NAT (STUN) and Traversal Using Relay around NAT (TURN). Collectively, these changes decoupled endpoint discovery and connection bootstrapping from the overlay. Whereas prior to this app roach, anyone with the software could join the single global overlay for all participants, OSNs now provided the necessary mechanism to restrict the participants of an overlay, and subsequently define multiple separate overlays. OSN identities (ONSI) are p rotected by credentials which requires each identity to authenticate with a centralized or federation of OSN servers prior to using its services. Once authenticated, the OSNI has access to its roster of friends which is


35 interpreted as an indication of dire ct acquaintance and mutual trust. Friends are therefore used to identify the network endpoints that are eligible to participate in the social relationships. The view of social relationships in this approach is taken broadly on one hand it may be mapped to resources owned by multiple individuals connected by a social network, while in another hand it may be mapped to resources managed by a single (or federated) admin istrative domain with trust relationships capturing the membership in an overlay. When an OSNI signs on, a presence message is broadcasted to all its available friends. This presence awareness is used to trigger the process of negotiating peer links. Estab lishing peer links using ICE requires each node to discover and share several types of information with its peer all of which must occur before a communication channel between the peers is established. The instant messaging facility of the OSN is utilize d to exchange messages which indicate the intent to create a P2P channel as well as exchange the necessary bootstrapping data between peers. This data includes the peer UUID, certificate fingerprints, and network address endpoints that will ultimately be u sed for creating all the facilities of the tunnel. Decoupling C ontrol from D atapath Another enhancement introduced in later designs of IPOP was a separation of concerns between control and datapath. Rather than the monolithic process of its initial impleme ntation, starting in [12] IPOP transitioned to an approach that adopted an SDN inspired modular design which separated its functional responsibilities into a controller and datapath components.


36 The IPOP datapath or Tincan, builds on Web Real Time Communica tion (WebRTC) [51], [52] to create its communication links. WebRTC is an open standard and industry effort to enable direct, real time media communication between browsers. Tincan utilizes the WebRTC native C++ libraries for peer connected data channels which constitutes the virtual link between peers. These virtual links are either direct connection, STUN/NAT traversal, or TURN relay. Di rect connection occurs when two nodes are routable to each other using their local IP address (e.g. within a LAN); traversal occurs when nodes are behind cone style NATs that are amenable to STUN based traversal, while relay occurs when nodes are behind sy mmetric style NATs and STUN based traversal fails. The datapath module also handles the IO interaction between the "tap" device and the channel. Ethernet frames read from the VNIC are encrypted using DTLS, encapsulated as the payload of an IP datagram and sent across the tunnel. Conversely, incoming messages are stripped of the UDP headers, decrypted and written to the VNIC. The IPOP Controller uses a modular framework which separates the application framework from the modules which implement specific funct ionalities [14]. The controller framework loads and initializes a parameterized list of modules at startup, providing an asynchronous task based messaging service for inter module communications. A Controller Brokered Task (CBT) is created by a module, sub mitted to the framework, which adds it to the recipient modules work queue. When the recipient has completed the requested task, the CBT is returned to the initiator via the framework. The CBT is a self contained structure that fully describes the task ove r its lifetime, including all the details pertaining to the task request and response.


37 A control module is a component with an application specific role within the IPOP Controller. By implementing a framework defined interface, modules can be loaded and in itialized, sent CBTs for processing, and invoked at periodic intervals. Modules can be created for any purpose to extend the capabilities of the IPOP Controller. A few core modules include signaling, link negotiation and creation, topology definition, and status reporting; key modules are as follows. The signaling module leverages the Extensible Messaging and Presence Protocol (XMPP) to advertise presence, indicate intent, and exchange connection bootstrap data. At sign on, and on periodic intervals, an ava ilability message is broadcasted to all peers listed on its friendship roster, which informs peers that the node is available to accept a request for a communication channel. This module services requests from other modules that require peer communication over the XMMP band. The link management module manages tunnels between peers, mirroring the notion of the link layer between two networked devices. It creates, maintains and destroys tunnels as a service to other modules and utilizes the signal module to i ndicate the intent to create a channel as well as exchanging virtual link bootstrapping data. Additionally, it maintains the associated VNIC and data channel for a tunnel and instructs the datapath to create them. It coordinates, as an intermediary, the CA S exchange handshake between the local and peer data planes. The topology module determines where to place tunnels and when to create and destroy them. It utilizes the link management services for the creation of individual tunnels and orchestrates the loc


38 structure. The provided module implements a structured P2P topology based on a successor ring with shortcut paths. Reporting of the overlay state is accomplished by cooperative work among the report ing module and any other module that needs to report its state data. Participating modules periodically submit their respective state to the reporting module which then aggregates the data into a node wide representation. The node data is sent to a central collector webservice which aggregates node data into a global view encompassing all the overlays and their respective nodes. Software D efined S witching In the latest design iteration, IPOP moved from a layer 3 to a layer 2 virtual network. This has allowe d IPOP to support broadcast/multicast protocols other than IP (e.g. ARP), and a wider variety of applications. Furthermore, in its latest iteration, IPOP supports handling of layer 2 frames using the OpenFlow [19] standard and SDN based software switches, opening the system to a variety of possi ble networking implementations. An IPOP overlay creates a layer 2 broadcast domain by tunneling Ethernet frames. The ring structure with shortcut links provides multiple redundant paths between switches in the overlay. If utilized, these links provide alte rnate paths to avoid IO bottlenecks, and resilience against link failures. However, typical Ethernet switching does not accommodate a topology with cycles, and network failure from broadcast storms occurs if cycles are not disabled. Approaches such as Span ning Tree Protocol (STP) [53], [54] are used in local area Ethernet networks to selectively disable links and construct a cycle free spanning tree. Unfortunately, this approach ignores functional links that could otherwise be used as alternative paths between pairs of communicating


39 hosts. To address the issues stemming from cyclic paths while still retaining the functional benefits of multiple links, an OpenFlow compliant switching protocol, called Bounded Flooding, has been d esigned and implemented for structured P2P topologies in IPOP. An IPOP node uses two controller components: the IPOP controller, and a new OpenFlow controller module introduced in its latest design [13]. The IPOP controller maintains the topology and the O penFlow controller programs the switching rules. Furthermore, the IPOP overlay is a distributed structure with each node running its own instances of the pair of controllers. The OpenFlow controller functionality is topology dependent and relies on structu ral information queried from the IPOP controller, such as two nodes to thousands of nodes. The IPOP topology is a ring of successors with identifiers increasing clockw ise and arcs of shortcut links. Each switching node running IPOP is assigned a UUID, and each node bears the responsibility of identifying its closest neighbor (next larger UUID) and building a successor link between them; a node may select more than one s uccessor for resilience against interruptions from churn. This process continues with each node until the ring is complete. The number of nodes in the overlay are proportional to the average number of switching hops to deliver frames in the overlay and sub sequently is used as a measure of the perceived latency in communication. While a complete network would provide constant switching cost, it is infeasible for overlays with more than tens of nodes. Each new node adds edges and an overlay with nodes has edges. As each tunnel incurs an ongoing


40 communication cost over its lifetime this approach does not scale as maintenance data begins to saturate the network. Shortcut links can used to bound the average latency and using links per node, provides an equivalent bound [50] . To select a suitable peer for the shortcut links the Symphony Long Distance Links mechanism is used. The overlay is parameterize d to tune the trade offs between node degree and latency where each node can be independently configured. Topology data provided by the IPOP controller is used to distinguish between directly connected peer bridges and leaf devices, and to associate the le af devices with building on demand tunnels which avoid additional switching hops between actively communicating leaf devices. The OpenFlow controller implements a procedure we have termed bounded flooding, which is fundamental to buildings its learning table (a mapping of observed source MAC address and the associated ingress ports) and subsequently programming the data plane flow rules. Whenever a broadcast is necessary, eit her due to broadcast destination address or a unicast address that is unknown in the local Learning Table, a Flooding Route and Bound (FRB) is used to greedily flood all its peer ports, i.e., egress ports that connect to another bridge. Flooding Route and Bound is a custom ethernet layer protocol implemented and used by IPOP SDN switching to perform link layer broadcasts in dynamic cyclic switched fabrics. The FRB prefixes the original broadcast nd. The root bridge is the switch that manages the device that initiated the broadcast. The flooding bound is a


41 closed open interval , specifying the recipient , and the furthest node , the message should be forward. The recipients are adjacent peers and each bound can potentially differ. On receiving an FRB, a node will deliver the payload on its managed leaf ports, determine if any adjacent peer bridges are within the message bound, recalculate and update the bound as necessary, and retransmit the message. Retransmitting an FRB is done clockwise around the ring on successor and shortcut edges. This approach ensures that broadcasts are never duplicated, are delivered to all devices in the overlay, and will eventually terminate. As FRBs are propa gated throughout the overlay, they are tracked at each node to update its local learning table. This information collectively provides a return route across the overlay to the FRB initiator. Additionally, an FRB is used to exchange the managed leaf devices of a switch with its peer. Conclu ding Remarks We have traced the evolution of IPOP from its P2P origins in Brunet to its current hybrid implementation, showing how specific needs of a real world system have driven pragmatic design changes at each stage. These changes allow IPOP to fill a gap in emergent technologies and seamlessly integrate existing applications. We have also applications. Generation 1 virtu alized a layer 3 network through VNIC host integration and tunneling IP packets within UDP/IP. It was a fully decentralized overlay, utilizing the Brunet library. The architecture reflected a monolithic process that incorporated all services for bootstrapp ing and packet forwarding which made interoperability with open standards difficult. The global overlay used namespace identifiers and DHT store to


42 provide scope for IP subnetting. Joining the overlay required a peer list and used overlay links to carry bo otstrap messages, and there was no support for authentication or encryption. Generation 2 introduced OSNs to decouple endpoint discovery from the overlay and introduced a client server model for bootstrapping. While no longer a pure P2P model, using a pub lished OSN server and friendship relationships facilitated independent overlays, simplified bootstrapping and enforced authentication for membership. The monolithic process was separated into a control and data plane and standards such as ICE, STUN, TURN w as adopted. Moving from the Brunet library meant losing the Symphony structure; to accommodate this links were create to all friends and IP mapping performed between them. An SDN inspired approach was adopted which split IPOP into controller and data plane processes. However, the coarse granularity of the architectural components still left code maintenance and enhancements daunting. Generation 3 addressed the architectural problems by introducing the controller application framework. The controller utilize d modular components focused on topology, link management and signaling. Improvements to the data plane supported multiple, concurrent, discreet layer 2 overlays within a single process. Generation 4 reintroduced the Symphony structured ring topology, impl emented in an IPOP controller module. To provide the switching routes for the specialized topology, a n OpenFlow module implementing Bounded Flood is used. Both the algorithms for the topology and switching rules were designed to be parameterized and scalab le to function for networks as small as two nodes to hundreds on nodes.


43 IPOP designs are meant to be implementable and practical and it has been used as the networking cyber infrastructure for multiple collaborative projects. A few of these are PerSoNet [52] , GRAPLEr [11] , FLARE and PRAGMA ENT. Figure 3 P2P architecture and virtualized endpoint. IPOP nodes (e.g. A, B) are connected in successive order based on their integer node links, the overlay features shortcut links (e .g. A D) which reduce routing packets from the O/S kernel, and a user mode IPOP application that implements the overlay virtual network functionality. Figure 3 2. Illustration of decoupling endpoint discovery from virtual network overlay in IPOP, zoom ed on three of the nodes (B, C, D) in the overlay shown in Fig. 1. Nodes B and C, which may be NAT ed, use external services (STUN, TURN) to discover the ir own IP:port endpoints on the public Internet. They then use an OSN service (e.g. an XMPP com pliant instant messaging service ) to exchange these endpoints if B and C have a trust relationship recorded in the OSN. Once endpoint and certificate fingerprint s are exchanged, a successor link (in red) is established. The integrity and confidentiality of communications between B and C are enforced by a transport layer protocol (e.g. DTLS).


44 Figure 3 3. Decoupling of the IPOP node des ign into control and data path modules. The data path module is responsible for packet capture/injection, as well as for the setup and maintenance of peer to peer private tunnels through which virtual network traffic flows. The control module is responsibl e for signaling, link management, and topology management, among other functions. These modules are implemented as separate processes, written in separate languages (currently, a Python controller and C++ datapath) that execute in the same node and communi cate via a localhost network API. Figure 3 4. IPOP functions as a switch by extending a software bridge such as Open vSwitch. Tunnels are added to the bridge as subordinate interfaces as they are created and removed when destr oyed .


45 CHAPTER 4 TOWARDS DYNAMIC, ISOLATED WORK GROUPS FOR DISTRIBUTED IOT AND CLOUD SYSTEMS A community is a small group working together. Community scales by adding groups, and building connections between them, not enlarging them. Robert Reed Sensors, gateways, compute and storage nodes near the edge, and cloud backends are a few of the various device types involved in end to end, sensor to cloud pipelines. N ew approaches for seamless networking are needed which address the complexities of thei r interactions. Such a solution must incorporate several layers of differentiated services and be both extensible and customizable to meet its goals. This chapter describes a novel framework that addresses the needs of extensibility and customizability in the context of establishing overlay network together sensor, edge, and cloud devices under the same virtual private network. The proposed approach has been implemented in the context of the IP over P2P (IPOP) overlay virtual network by means of a flexible and extensible distributed controller framework for overlay management and is described in this chapter . IPOP Workgroups serves as a platform for building flexible grouping of devices for different use cases and application scenarios resources within the group and across its boundaries in a manner where controlled access is enforced while using standard abstractions. W e advocate the use of overlay networks to enable flexible aggregation of such devi ces into logically isolated T his chapter has been reprinted with permission from Sensors to Cloud Architectures Workshop (SCAW 2017)


46 Workgroups . G roup members can communicate with enforced privacy, authentication, and access control at the network layer using a virtual private network abstraction. Our architecture is based on the concepts found in SDN and P2P overlays, leveraging each unique features to enable new classes of functionalities. The chapter is organized as follows. W e present an overview of relevant background on virtual private networking and highlight some of the shortcomings of the existing host centric approaches. Next, we present the design and reference implementation of a distributed along with its integration with the OS and standard networking subsystem. Finally , a case study of an actual deployment of our reference implementation is presented. Background The model of sensors to cloud is a culmination of technologies and a fundamental part of the Internet of Things strategy and goals. With the a dvent of an emerging technology, there are associated challenges and hurdles that must be solved. IoT brings an explosion in the number of networked devices, and the size of the datasets being generated and transmitted. To keep such large networks function ing and maintainable, there are certain key objectives that must be given consideration ( 1 ) e ntity discovery, naming, and enumeration, ( 2 ) g rouping of devices alongside functional peers, ( 3 ) s eparation of private resources from published ones, ( 4 ) e nforcin g access control to networked entities , and ( 5 ) m aintaining privacy and integrity of data throughout the network. Logical grouping of these set s of distributed edge and cloud devices across the Internet into Workgroups can provide an effective technique to manage large aggregates in a scalable manner. Furthermore, enforcing privacy and authentication in


47 communication allows such logical groups to be isolated from the public Internet. Quickly building Workgroups for spec ific purposes becomes a powerful tool for limiting the scope of exposure and trust. Workgroup participants leverage an implicit level of trust among peers that is gained from the barriers to joining the group. Grouping also reduces the amount of communicat ion that a node must experience, to only what it needs to participate in. Finally, Workgroups allows the resources which are published and externally accessible to be separated from the physical resources by software abstractions. One approach to connectin g IoT devices to the cloud in a secure, isolated manner, is virtual private networks. They provide flexibility in address space assignment, isolation from the Internet public address space, and enforce privacy, integrity and access control with public key cryptographic techniques. However, a typical centralized VPN architecture with a gateway (e.g. OpenVPN [2]) does not scale adequately if there are many devices transmitting data to the cloud. Additionally, there will be cases where not all the data needs t o go up to the core cloud. Rather, edge nodes and sensors will benefit from peer communications where contained island or edge network encapsulates a trusted execution environment which c an also act as a data source. This variance from a typical VPN model warrants consideration of a solution based on decentralized virtual network overlays that allow P2P communication. While P2P VPNs exist (e.g. tinc [55] , FreeLAN [56] , ZeroTier [57] ), there is no and virtual address assignment/translation. The approach proposed in this chapter


48 presents a novel design of dis tributed overlay controllers that provides the flexibility to construct topologies that match the workflow and deployment, while still maintaining direct communication channels (i.e., without the use of a relay). The IPOP Workgroup system described in this chapter consists of a distributed set of software processes that enable private communication among multiple nodes. In IPOP, trust relationships (and thus group membership) are established using a centralized service that provides for device authenticatio n, discovery, and messaging (e.g., an XMPP compliant service ) . For scalability, communication among members of the group is tunneled over P2P links . This allows sharing of resources from edge to cloud in a manner that is isolated from the public Internet and allows devices within the confines of the virtual private network to interact seamlessly, as if they were connected by a local area network. Following the taxonomy of [16] , IPOP is a P2P routing security overlay tailored to private communication among peer devices through virtual ized networking. In contrast to typical P2P overlays, which focus on a particular application, such as file sharing/transfer (e.g. BitTorrent , Gnutella [58] ), key/value storage (e.g. Ch ord [59] ), streaming/video on demand [60] or voice over IP (e.g. Skype), IPOP pr ovides a framework suitable for arbitrary IP communicat ion . IPOP allows overlay software to run on sensors (e.g. Raspberry Pi, Edison), sensor gateways (e.g. OpenWRT routers), intermediate edge processing nodes (e.g. [61] ), mobile devices, and cloud instance s. It supports a combination of stationary and mobile actors that are distributed and may be behind one or more levels of N ATs.


49 and management of tunnels (control plane) and ove rlay packet processing (data plane) in a manner inspired by the SDN architecture [62] . Unlike typical SDN systems, which operate within a data center using a single logical controller, IPOP is designed to be deplo yed across multiple geographically distributed networks and does not assume control over these networks. Virtualization is based on encapsulation, using tunneling and overlay routing, and it allows the virtual network to be deployed at the endpoints with N FV [34] ro uters. The network overlay needs to present scalable, resilient topologies and must provide key infrastructure services that support dynamic group membership, routing with multicast, and assignment of virtual address to the endpoints. The actual deployment will dictate its own priorities and will call for specialization, as large and small networks exhibit different characteristics and require separate consideration. For example, Chord based structured routing scales to large networks, while for small netwo rks, an all to all topology is suitable as it avoids complex routing and link maintenance . There is also benefit in having the overlay topology parallel key functional resources , e.g. , stream processing as proposed in SpanEdge [63] . The chapter discusses the system active state model, the architecture of the software (the key component s and inter module communication), the API to the data plane, and the communication between peer controllers. Also introduced are the framework features designed to support extensibility and specialization for custom topologies. We have demonstrated its de sign and implementation in the context of the open source IPOP software. Finally, case studies of two different IPOP virtual private


50 network overlay implementations are described . GroupVPN (GVPN) and SocialVPN (SVPN) are reviewed to show how different topo logies can be realized through controller specialization . Architecture The IPOP network overlay is a suite of software modules that enables flexible arrangements of networked hosts into customized topologies. IPOP uses a distributed peer model, where each host runs its own control and data plane. The data plane (Tincan) establishes communication links with its peer and routes messages between peers based on rules which are provided by its local control plane. Tincan is designed to be a quick, lightweight co mponent, performing the intrinsic tasks associated with moving messages from one host to another. The control plane ( IPOP Controller) implements a specific topology by specifying when links are created and to which nodes. The Controller and Tincan modules interact dynamically allowing the system to grow, shrink and heal as the situation requires. Tincan provides continuous feedback of network conditions and events to the Controller. This information is used by the Controller, in combination with its own sta tic rules, to adjust the operational parameters of the data plane and hence maintain a functional topology. Together, they provide a paradigm for rapidly building scalable application specific overlays . The previous IPOP architecture [48] required separate contro ller implementations for each overlay in particular, topology and routing. Th e novel controller framework allows multiple overlays to share the core controller functionality and utilize existing modules. As a result, different overlay designs for differe nt environments and topologies (e.g. structured P2P, hierarchical, unstructured) , can leverage the codebase from the base controller and its framework as necessary.


51 The Control Plane IPOP Controller The IPOP Controller consists of the framework and multi ple modules which work together to implement a controller model. The Controller Framework (CFx) is the orchestrating component for an IPOP Controller, and it defines the process framework for all interactions within the system. It specifies the rules for c omponent interaction and facilitates the asynchronous message driven communication among them. The IPOP Controller process framework overarching design goal is to support an extensible model that can be used to quickly implement new functionalities in the controller. I t is a modular construct , where the CFx is the scaffolding for the component modules. Modul es that plug into the CFx are expected to encapsulate a specific , well defined functionality that it publishes as a set of capabilities. These capabilities are invoked using inter module communications that are brokered by the CFx and enforces minimal coup ling in the dependencies between the modules. Figure 4 illustrates the structural relationships between the CFx, the controller modules and Tincan , and how a message i A controller module is a pluggable component of the system that embodies some functionality important to functioning of the overall system. Some control modules may be optional while others provide fundamental opera tions . R egardless , they must be highly cohesive such that the role they play is well understood and singular in purpose, at whatever level of abstraction. essentially its programming interface to any actor that consumes its services. The are the messages that are passed from module to module . It describe s a request when


52 initiated and the corresponding response when completed. A request CBT contains ( 1 ) its origin, i.e. the module which initiated the request, ( 2 ) the module targeted by the request to perform the operation, ( 3 ) the operation to be performed and its associated parameters, and ( 4 ) control d ata fields used for various housekeeping tasks. CBTs are transferred between modules via the CFx using an asynchronous messaging communication protocol. The CFx maintains work queues for each module that are used to store in transit CBTs active in the syst em. A well formed CBT, when of execution, the CFx will dispatch the CBT to the appropriate recipient by invoking its CBT processing function with the CBT instance as th e parameter. Responses are handled similarly; a CBT is updated as completed with a status and data (if any) and submitted to the CFx. At some later time, the CFx will again dispatch it to the module that initiated the request. Not every initiated CBT r equires a response, and for some types of operations the provider of the capability is not obligated to provide an explicit response e.g., the logging module . However, some do; it a simple status code, or a dataset generated by the target. By tagging the C BT, the sender can differentiate among CBTs as they are completed. Due to CBTs being completed asynchronously, and the complex interactions and dependencies that can occur, a tag does not provide enough context to resolve the sequence of actions that need to be taken on receipt of a given CBT. Consider the scenario, as illustrated by Figure 4 , where module B receives a request CBT A1 from module A. To execute its capabi lity, B needs to make subsequent requests to modules


53 C and D, using CBT B1 and CBT B2, which modifies the state of the system and retrieves a dataset, respectively. Module B can proceed no further with CBT A1 until it receives responses to both the subsequ ent requests that were sent neither does it want to remain idle in the interim as there are likely many other CBTs pending in its work queue. To accomplish this, a strategy of chaining dependent CBTs is employed. A chained CBT is an optional operation th at can be performed to any type of CBT and is done by the modules with the assistance of the CFx. When the CBTs (CBT B1 and CBT indicated by chaining them to it. CBT A1 is then marked as pending and stored on a list of CBTs awaiting further action. When B gets a completion CBT it checks if it is chained to a parent CBT and updates CBT A1 progress status. When all the dependent CBTs are completed, the capability requested by CBT A1 can be fulfilled, the dependent CBT B1 and CBT B2 are freed and CBT A1 is completed by sending its response to module A. Each controller module is provided two dedicated threads of execution that are processing function on receiving a new CBT for that module, and the second is a time d event that the module must request to activate . Since a module cannot receive additional timer events if it is active or blocked on its timer thread, they are expected to perform quick operation in this context. Longer running tasks can be scheduled by creating and submitting a self addressed CBT which requests the operation. In addition to its capabilities, a controller module is required to implement a class type interface common to all modules. It is one half of a two part interaction system used for direct synchr onous


54 communication between the CFx and a module. The first is used by the CFx to invoke CFx and supports invoking the CFx services such as submitting a CBT for deliver y and querying environment state. This choice of asynchronous messaging, separate worker threads and timer events come together to form a loosely coupled system that is responsive to task processing and reasonably easy to program. Messaging breaks the hard dependencies between modules allowing one module to be replaced by another. Instead of being programmed to invoke a module capability as a function call, it sends a message to the CFx to request the capability. The CFx has the sole responsibility of deter mining the actual recipient of the CBT, so it may be delivered to one or several modules as it is processed. A controller model is a specific combination of modules that are assembled to realize a particular and desirable type of overlay network. By replac ing modules which support the same capabilities, but with different implementations, the behavior of the system can be modified . Consider a module that could change the way relationships between nodes in the overlay are identified. Currently, XMPP friendsh ips are used but Active directory or LDAP are possible replacements. If two user accounts are in the same group, then they could participate in the same overlay network. The controller model is configuration defined , i.e., the set of modules that are loade d into the system by the CFx are specified by a configuration file. This configuration is provided, in part by the de velop er and in part by the user. The developer specifies the list of the modules which implement the given control model and the default pa rameters upon which they operate, while the user provides the user


55 specific parameters. For example, the URL and credentials to an XMPP service which is to be used for identifying the participants of the network. The configuration driven model allows the C Fx to load only the modules that are specified and enabled in the configuration. There is no need to change the framework, and enhancements can be IPOP currently implements two different controller models, GVP N and SVPN, both targeting different use case scenarios. GVPN is designed to create an overlay network where all peers in the group can communicate with each other. This reflects a n administrative domain under a centralized authority , which can determine t he parameters for participating in the group on behalf of all the participants. This necessary as the network configuration requires coordinated selection all participants must agree on the same subnet and assignment of individual IP addresses that is us ed within the overlay network. This makes GVPN suited for structured groups with ongoing relationship that exhibit a higher degree of trust, e.g. , your personal devices that reside at different geographic places (home, work, school); or sensors installed i n a smart city operating as an edge network connected to the cloud. The topology and routing supported is reflect in this model, as it is assumed that all nodes in the overlay can potentially establish links with all other nodes. The case study in the foll owing section provides an in depth review of the GVPN and SVPN controller, the consideration driving their design, and their functionality. The Data Plane Tincan The IPOP data plane is known as Tincan and implements the communication channels and tunneli ng functionality . Tincan also implements a control interface which is used by its local controller. The core functional responsibilities are instantiating a local


56 virtual network device (TAP), establishing the communication channel to a peer, and tunneling Ethernet frames between the local host and remote peers. The TAP is a virtual network interface which is emulated by kernel mode software drivers. The OS networking stack is layered atop one of its interfaces and allows application software to bind a stan dard network socket to it. Additionally, Ethernet frames can be read and written from/to the TAP using the OS ABI. Virtual links are communication channels between two Tincan processes which are built using WebRTC transport channels. They provide a direct peer to peer connection between nodes which performs NAT traversal if necessary. All communication over virtual links are by default encrypted using Datagram Transport Layer Security (DTLS). Tincan tunnels are a software abstraction of the TAP and virtual link composite. Ethernet frames are read from a local TAP device instance and encapsulat ed as a message to be transmitted via the associated virtual link . T he reciprocal operation is performed for receiving a message . Tincan supports two mutually exclusive tunneling modes, Layer 2 Tunneling ( s witch mode), and Layer 3 Tunneling (IP m apping). When operating in s witch mode [64] , Tincan tunnels address resolution protocol (ARP) messages. IPOP can be deployed to a switch or wireless router , interfaces join the overlay without the need to run Tincan locally. They see the LAN as they would typi cally do, a subnet of nodes connected to the same switch and a gateway to the WAN. The other nodes on the subnet may not be physically on the same LAN. They only need to be on the Internet and running IPOP or behind another switch running in s witch mode.


57 I P mapping eliminate s the challenges of configuring a peer network. Finding an available subnet and IP addresses for a virtual network which spans multiple physical LANs can be tedious. Furthermore, if a new node is to join the overlay it must first ensure that there is no overlap in the subnet defined by its physical network interface. The approach taken is to allow each host to select and use its own subnet and IP, as long as it does not conflict with its local network. When a node joins a peer network, it exchanges virtual IPs with its peers. Each node then maintains a table of mapping between two virtual IPs: one which the node uses to refer to its peer that is taken from its own locally configured subnet pool, and the other which the remote peer uses to refer to itself. The IP addresses in the datagrams must be updated on the local system to match its network configuration before it is written to the TAP. Additional details on Tincan can be found in [48] . Case Study This section provides a case study of two cont roller implementation, GVPN and SVPN. It discusses the unique objectives for each controller and their functional characteristics. Finally, they are compared on how they reuse and leverage existing code in the base controller and framework. GroupVPN The GV PN controller realizes a structured peer to peer network and implements a ring topology , where participating nodes are all reachable peers but not necessarily all connected. The terms node and peer are used interchangeably to mean any host system running a compatibly configured IPOP instance . Nodes that have an establish connection between them are called adjacent peers ( or neighbors ) and can communicate directly , i.e., P2P . P eers c onnections are bidirectional communication


58 links and are categorized with the following logical roles successor, chord, or on demand. These links describe traits that are characteristic of the overlay. There are two methods available in the IPOP controller to facilitate the exchange of control data between pee rs. The first is a message sent over the XMPP network as an instant message to the recipient. The second is using Inter Controller Communication (ICC) to leverage the overlay for message forwarding. Intermediate peers use the destination UID of the message to determine the optimal adjacent peer to forward to, taking advantage of chords for shortcuts. The G VPN controller was designed to be scalabl e to large network sizes . The link policies that govern the use of successors, chords, and on demand links are ad aptive, user configurable, and are functionally independent. Each policy and its links specify a set of responsibilities for maintaining the overlay. Successors primarily maintain the ring topology, chords provide strategic shortcut links throughout the ov erlay, and on demand links establishes convenient direct links for high traffic communication . The successor policy is fundamental in maintaining the ring network where nodes are ordered by their 20 byte Unique Identifier (UID). Additionally, this policy i s responsible for bootstrapping a node into the network and for repairing malformed or partitioned networks. The number of successors is configurable per node; more successors incurs additional maintenance overhead, but the added redundancy ensures a more robust integration to the overlay. The bootstrapping process starts with a newly online node logging into its XMPP server and querying for an initial list of online nodes. Using this list, connection requests are sent over the XMPP network to the nearest


59 s ucceeding UID peers in an attempt to establish initial successor links and become a leaf node. The requested nodes may be in various states of connectedness; ideally, the requested nodes are already part of a single overlay, but some may be separate partit ioned networks or even completed isolated. To ensure the maintenance of a well formed ring topology, the successor policy adapts its connections to the best fitted successors. Towards this strategy, each node periodically sends advertisements via ICC to ad vertise a list of adjacent peers to each of its neighbors. This list can then be used for dynamic peer discovery and enables a node to identify and connect to better fitted successors. More infrequently, each node queries the XMPP server for a list of onli ne nodes as a failsafe mechanism to discover and repair severely partitioned networks as is the case when nodes in each partition have no knowledge of any indirect neighbors in another partition. The chord policy is used to establish shortcut connections through a potentially very large ring overlay. This shortcut path improves the performance of forwarding by reducing the number of hops needed for a message to traverse the network. The number of chords is also configurable per node. A node determines its ideal chord UIDs by calculating the logarithmic reduction based on its UID. Connection requests are sent via ICC, and during the forwarding process, peers whose UID matches or is closest to the ideal UIDs will be the recipient the request. Periodically, e ach node issues connection requests to identify and connect to better fitted chords. Furthermore, chords can also enhance the fault tolerance of the overlay. Chords present extra redundant links in the overlay that can also reduce recovery time. The advert isements used in the successor policy for maintaining the ring topology can leverage these shortcuts to


60 speed peer discovery. This is illustrated in Figure 4 . In this e xample, the overlay is disconnected, and node H must establish a link with node A to restore the ring topology. Without chords, the extent of advertisements is local and thus it requires multiple iterations of successor links. Using chords , advertisements are enhanced, and peer discovery of distant nodes is accelerated. The on demand policy is the final facet of the Topology Manager responsible for optimizing direct communications between chatty peers. If two peers do not share a connection, then messages m ust be forwarded across the overlay. When the peers are terse, overlay forwarding is sufficient because it reduces the per node link count required to maintain the connected state of the overlay and with a reasonable forwarding overhead. However, high traf fic communication between peers can stress the overlay. Consequently, each node monitors its network traffic (in bytes per interval) and establishes on demand links with peers when the traffic exceeds some threshold to alleviate this stress. Network traffi c is periodically monitored to determine the on demand link is still necessary. SocialVPN SVPN provides a model inspired by Online Social Network (OSN) systems , which allow s users to establish individual relationships among themselves , where the resulting topology is a social graph rather than a structured P2P system. In order to map large numbers of OSN users to a scalable topology and support private IPv4 address spaces, SVPN relies on dynamic assignment and mapping of virtual IP addresses to peers [65] and supports social overlay rout ing [66] . Intern ally, SVPN performs IP m apping so there is no need for all participants to be on the same network subnet. Each node configures IPOP to use an IPv4 address and subnet that is


61 convenient for the local environment. Further a ddresses are assigned from this range to its peers , which r esults in every node having a unique local view of the overlay. SVPN then correlates the IP address it has assigned to a peer with the IP that a peer assigned to itself. These mapping define the rules used by Tincan to update packets before they are inject ed into the virtual network interface. This approach makes SVPN suitable for transient groups, where there are lower barriers to participating and higher churn. Although SVPN also uses XMPP to identify peers, it would just as easily implement one based on proximity over Bluetooth or from GPS. Consider as a SVPN use case, a crowd sourced scientific experiment to measure air quality at various points over a localized area. A participant running the software can opt in so that when in area of interest (by GPS or Bluetooth), have the ir sensors capture this sensor data and transmit it over the SVPN overlay. Participating does not require an individual to establish links with any other participant other than the friend who is conducting the experiment. Discussi on While SVPN has fundamentally different characteristics from GVPN in terms of membership management (per user vs. group manager), topology (social graph vs. structured P2P Chord ring), routing (social vs. identifier based), and address translation (dynam ic vs. no translation), both overlay models are supported by the controller framework described in this chapter . Most core modules in the controller are common to both implementations; the modules that implement the different behavior are pluggable into th e framework. This highlights the flexibility of the proposed approach in enabling different VPN modalities for different use cases, with minimal changes. A comparison of the number of files and lines of code (LOC) of GVPN and SVPN gives an indication of h ow much common functionality is leveraged across both


62 controller models. GVPN uses 16 Python source code files totaling 2539 LOC, and SVPN uses 19 Python source code files totaling 2031 LOC. Across both projects 14 of these files and 1518 LOC were shared w ithout any modifications. So , while GVPN is 25% LOC larger than SVPN it reuses 60% of its code. Additionally, both GVPN and SVPN controllers use the Tincan data plane module (implemented in C++), without modification. C onclu ding Remark s As IoT deployments increase, new techniques are required to manage and secure this infrastructure. Grouping devices alongside functional peers is an effective and scalable approach to this problem. However, as there is no one size or model that will fit all the various requirements, the solution must be easily customizable and extensible. We advocate the use of overlay networks to enable flexible aggregation of hosts into logically isolated Workgroups and have proposed the design and reference implementation for such a system. IPOP is a platform for building custom network overlay as defined by the application needs, and it provide s the services for node discovery, grouping and privacy. By separating specialized services into respective module s , common functio nalities can be shared and leveraged for code reuse. This reduces the amount of work that is needed to implement a new controller model. Our case study shows that a majority portion of the GVPN code approximately 60%, was reused from the SVPN controller. The process framework allows developing new modules to augment or replace existing behavior, as necessary, to implement a different topology altogether. Both GVPN and SVPN, as reference implementations, validate the design objectives of the IPOP network o verlay architecture.


63 Figure 4 1 . Structural relationships between the CFx, the controller modules and Tincan.


64 Figure 4 2 . Workflow for CBT Chaining. Tracking the dependencies between is done by linking one CBT to another. A CBT is only completed when all the CBTs on its chain have been completed. Figure 4 3 . Steps in overlay recovery using only successors vs both successors and chords.


65 CHAPTER 5 TOWARDS ISLAND NETWORKS: SDN ENABLED VIRTUAL PRIVATE NETWORKS WITH PEER TO PEER OVERLAY LINKS FOR EDGE COMPUTING You and I come by road or rail, but economists travel on infrastructure. Margaret Thatcher A fundamental aspect of edge computing is the relocation of compute, storage s edge. However, building a network at the edge poses challenges not typically encountered in the data center: the considerably larger area for deployment, the time and effort to physically access these locations, the heterogeneous mix of components to int eroperate, and numerous independent owners/networks that contribute to the pool of edge resources. Thus, while core techniques developed for data center clouds may be leveraged, it will not be feasible to manage IoT infrastructure and services as it is cur rently done in clouds. At the edge, there is no centralized or consolidated premise for hardware, so existing approaches for identifying, leasing, and configuration deployment become impractical. New ways are needed to solve these issues in a manner that c an be orchestrated via software. SDN is a mature technology widely used by network administrators in cloud data centers to orchestrate and manage networks via software. However, its use within a data center (and on backbone connections across data centers) is predicated on the fact that a single or a few administrative entities own the data plane (switches, routers, links) and can manage it through a centralized control plane. In contrast, applying SDN techniques to connect a multitude of IoT and edge devic es across the Internet brings a T his chapter has been re printed with permission from Internet and Distributed Computing Systems.


66 very different set of challenges [67] . Nonetheless, SDN exposes key primitives for packet handling that can provid e a basis for software defined edge networking. For instance, using network virtualization to create a familiar environment, such as a flattened layer 2 networking namespace, in which administrators can utilize standard and familiar tools. This allows IoT/ edge applications to reuse a plethora of middleware that works atop TCP/IP networks. In this chapter , we consider the challenges associated with the use of SDN in IoT/edge computing and propose a novel approach that integrates both SDN switches and overlay networks to create software defined virtual networks across edge and cloud resources. The contribution of our approach lies in a novel way to integrate control/data planes for both the overlay and SDN layers. At the overlay layer, the control plane is rea lized by a distributed set of software modules (overlay controllers) that coordinate and, in a peer to peer fashion, create and manage virtual links as TCP/UDP tunnels across the public Internet even when devices are in different edge providers and const rained by middleboxes (NATs, firewalls). In our approach, the overlay layer controllers not only manage virtual links, but also dynamically bind them to ports of SDN controlled software switches. Overlay links thus become the data plane for the SDN fabric, allowing packets sent/received by IoT, e dge and cloud resources that reach a switch to be forwarded across the overlay to other nodes. Combined with virtualization primitives implemented at the SDN layer (by a centralized, or distributed SDN controllers), the resulting system delivers a software abstraction of a layer 2 (Ethernet) network, thereby reducing the complexities associated with the deployment of middleware and applications across edge and cloud


67 resources across the Internet. In addition to establishing the data plane, the overlay netwo rk layer controllers allow dynamic membership and grouping of resources, and enforce authentication and privacy in communication, addressing key management and security concerns for edge/IoT applications. In summary, our proposed approach creates a flexibl e virtual private networking (VPN) system which can dynamically aggregate IoT, edge and cloud resources into managed communication groups, with links that are peer to peer Internet tunnels terminating in SDN programmable switches. The approach is demonstra ted with the development and experiments with a prototype that builds on open source frameworks for both the SDN and overlay layers. In particular, Open vSwitch (OVS) [68] , a widely used software switch that supports the standard OpenFlow API and various SDN controller frameworks (e.g., Ryu [69] ), and IPOP tunnels, an open source overlay network with built in support for NAT/firewall traversal using STUN/TURN/ICE sta ndards [70] [72] and the WebRTC framework [51] . The rest of the chapter is organized as follows; first, we introduce overlay and island networks concepts and then proceed to describe the notable traits of next generation IoT applications and its implications on netw ork structure and the division of application roles. Next, we present our novel hybrid overlay/SDN approach for building virtualized network infrastructure. Finally, the experimental evaluation of our reference implementation is presented along with the ch aracterization data obtained from our testbed. Background Software applications have continuously evolved to address the increasingly sophisticated requirements of a modern society, and they tackle workloads and problem


68 spaces that cannot be addressed with in singular systems. The IoT era promises to continue this trend, as the abundance of connected devices will generate new uses cases for engaging consumers, both in the physical and virtual world. IoT Application Characteristics It is already conceivable t o anticipate our everyday commute using autonomous vehicles. However, there are significant barriers to making this concept safe and reliable for use. Autonomous vehicles (AVs) comprise 3 major technology categories, sensing and perception, localization an d mapping and driving policy. Multimodal sensor streams are an important facet of successfully accomplishing the first two, but contemporary technology is still restricted in its temporal and spatial resolution and this impacts the quality of the decisions that are made in the third category. As such, it is expected that AVs will be connected to the surrounding roadway infrastructure via wireless networks to exchange sensor data that will assist with navigation [23], [73] . While it is a prudent design approach to ensure that any autonomous system can continue to operate reliably, even with diminished capabilities, when disconnected from t he communication network, there is considerable benefit from leveraging information available from it. If a hazardous condition exists and fixed roadway infrastructure can be used to detect this, e.g., using cameras and machine vision, it can be intellige ntly transmitted to approaching vehicles. Vehicles can initiate defensive maneuvers ahead of the time they could start sensing such conditions. Model Characteristics The scenario presented exemplifies the role for detection using sensor devices and analyt ics to reach a decision which ultimately should be acted on. The hardware components will unlikely all reside within a single physical unit but rather be separated


69 according to their functional roles. The sensors and actuators must be distributed over the area of interest and the data sent to the processing nodes for analytics. This underscores another important facet of the system, the communication path. As the usefulness of the decisions expire over time, the workflow of sense analyze actuate must be tim ely. As the data generating events exhibit spatiotemporal correlations with the actors, one approach is to keep data close to where is produced and/or finally consumed and bring the compute operations close to data. We have also seen that there are multipl e roles and device types involved in a single solution which produce heterogeneous mix of operational resources. Additionally, due to the vast geographic areas that require hardware coverage, it is reasonable to assume that it will be provided by multiple independent vendors. IoT ecosystems and the associated sense analyze actuate model inherently respond to real world events, implying a data producer/consumer abstraction. It can be generalized as a process chain which starts with a data generating event, f ollowed by one or more consumers that apply a transformative process and subsequently produce new events and data. The processes applied at each stage are application specific but can include storage, analysis, generating derivative data and actuation. Net working Implications To support the execution of distributed applications as exemplified above, an interconnection network must be in place to enable communication among the IoT devices and the various compute/storage agents that participate in the workflo w. While, in principle (as implied by their name) IoT devices are connected to the Internet, the reality of establishing interconnectivity among devices such that communication occurs seamlessly is more complex.


70 First, owing to the shortage of identifiers in the IPv4 address space, devices may be private and not have addresses on the public network. Even if IPv6 deployments fully address the identifier shortage issue, private networks are unlikely to go away private networks provide a line of defense agai nst malicious actors in the network, by concealing the device behind a NAT/firewall middlebox. Second, devices participating in such workflow are likely to be connected to different ISP providers/network domains for instance, a 5G wireless provider for a n in vehicle sensor, a city networking provider for infrastructure cameras, and one or more commercial edge/cloud providers. As a result, no single entity will be able to control the configuration/setup of the networking infrastructure end to end. Third, s uch applications will require privacy guarantees at many layers of the stack including privacy in communications. The public Internet does not provide such guarantees. Fourth, in applications where the devices participating in the distributed workflow ca n dynamically join and leave, the networking layer must be adaptive to accommodate dynamic group membership. Overlay and Island Networks Concepts An overlay network is a network created atop another pre existing network. The de facto approach to contempora ry network design is to provide increasingly capable services within discrete layers. This provides the view of the Internet being constructed as a series of overlay networks. Additional overlays are created whenever a new network abstraction is presented, but which utilizes the services of another existing network. This is the case with network tunneling technologies and specifically IPOP. IPOP creates a new layer 2 overlay network which is tunneled through layer 3 IP across the Internet.


71 There are several contemporary methodologies and technologies that are used in designing communication networks, and they impact how the hosts are connected and addressed. It is standard to use both physical and logical approaches which partition networks that interact thr ough defined connection points. A single host can participate in multiple networks as it roles dictates; for instance, consider an application server that has a network dedicated for administration and another (or more) for its data path communication to o ther instances associated with the application. These application networks are the networking infrastructure that is used to support complex distributed applications. The network structure or its topology define the way the hosts are interrelated by means of the direct connections they establish among themselves, while the attributes and protocol specify how they are addressed and how they communicate. Correctly done, partitioning a network into smaller individual components is a successful strategy for def ining important functional, performance and security characteristics. As each component is its own private self contained network that connects to and communicates with other component networks, this chapter regards etwork can freely employ an architectural approach that best suits its internal needs, and by the encapsulation of its internal characteristics, these choices need not adversely impact other islands or the larger aggregate network. An Architecture for Over lay Networks IPOP provides flexible virtual networking infrastructure, using peer to peer tunneling techniques that can traverse NATs and firewalls, as well as peer/endpoint discovery and key exchange for private communication via online social network (OS N)


72 based messaging. These features make IPOP a well suited overlay substrate to build the Overlay Network for IoT/edge computing. Inspired by the SDN approach of separating the functional roles of a controller and data plane, the IPOP design consists of tw o modules: a controller, and a data plane (referred to as Tincan, drawing an analogy to private peer to peer communication among friends). For ease of deployment on a variety of platforms (currently, it runs on Ubuntu and Windows desktops/servers, and on R aspberry Pi devices), IPOP is implemented as two user mode processes one implements the controller, one implements the tunneling data plane. This chapter introduces work which extends the existing IPOP capability into the realm of SDN by proving integrat ion with SDN and Open vSwitch. This is accomplished by creating a new abstraction called an IPOP Tunnel, which is programmatically connected to an OVS or Linux bridge. It has two key components: a virtual network interface (TAP device) and a virtual link ( WebRTC communication channel). The IPOP Tincan creates the tunnels and connects IO between its TAP and Link. Controller The IPOP controller provides an application framewo rk which manages a parameterized set of task specific modules. The controller framework supports an asynchronous and brokered, task based messaging system for decoupled communication between its modules. Each module implements a well defined cohesive task, which in part contributes to composite functionality of the controller model. Notable modules provided in the default model are not limited to signaling and link bootstrapping, tunnel creation and management, topology definition and enforcement. Signaling for bootstrapping a link between peers is performed using instant messaging


73 via the XMPP protocol, using software such as ejabberd. This mechanism also identifies the endpoints participating in the overlay network and provides the stage at which participa nts must authenticate credentials. By establishing an account on an ejabberd defining who can participate in their private overlay network and to whom communication li nks may potentially be established. Tunnel creation is a 9 way handshake between two peers, which exchanges the necessary data for establishing a virtual link. When a friend node is online, a presence notification is delivered via the Signal Module to bot h peers. This initiates the tunnel creation process, which involves allocating the local resources to be used for the tunnel network addresses that can be used for local , reflection (STUN), and relay (TURN) connections along with unique identifying fingerprints is exchanged via signaling. With this information, the peers can negotiate keys and establish a private tunnel, which is then used as a virtual link for the ov erlay network. The Topology module defines which peers will establish a virtual link, and the conditions that determine when the link is created (or removed). While any topology can be implemented, an all to all topology was used for this work. Data Plane The IPOP data plane (Tincan) implements the key abstraction of the IPOP Tunnel. The IPOP Tunnel is a construct of a virtual network interface (TAP device), and an associated virtual link. The virtual network interface is the approach used for creatin g separate network namespaces, i.e., network segments which can be switched or routed; this is as opposed to using IP rules, which are restricted to routing functionality.


74 Messages from the TAP device are sent on the associated link. While it is possible t o apply an IP configuration to each TAP device, it is impractical for large networks. A more favorable approach is to create an in memory bridge device, e.g., a Linux or OVS (Open Virtual Switch) bridge and connect the Tap devices as subordinate devices. I POP is then able to leverage the extensive switching functionality of these tools and further extend their reach and capabilities via its Tunnels. Tincan links are built on top on of the WebRTC C++ libraries and its transport and channel abstractions. Subs equently, IPOP links can perform NAT traversal and utilize TURN services around symmetric NAT and restrictive firewalls. Experimental Evaluation This section evaluates IPOP using our reference implementation. We describe the experimental setup and the scen arios tested, provide the corresponding results and discuss factors that influence them. Experimental Testbed To demonstrate the functionality of the hybrid SDN + IPOP Tunnel network we created an overlay with 5 peer nodes that are distributed geographical ly across the continental USA. 3 nodes were located at the University of Florida, another at Cloud Lab (Clemson), and 1 at Amazon AWS (Ohio). The hardware specification of the nodes was chosen to be representative of the heterogeneous mix of devices that w ould occur in an IoT deployment an IoT type node, a virtual machine, edge processing nodes and server class cloud nodes. Each host is installed with the IPOP software, Open vSwitch, and IP Route2 network utilities. Iperf version 2 is also installed on ho sts A, B and E for testing IP multicast. The IPOP controller is configured for an all to all network topology, such that each node creates a tunnel with each peer. The tunnels are bridged locally


75 and STP is enabled on the network bridge at all hosts. The b ridge device has an internal interface and each one is configured with a unique IP v4 address in the is used for communication on the overlay network. Scenarios and Re sults The first demonstration is to establish basic end to end connectivity among all the peers. This is done using a combination of the console ping command and TCP Dump. All the ping tests were successful between pairs of hosts. Table 5 2 indicates the r ound trip times (RTT) between the IoT device and the AWS VM measured both on the systems native interface and the IPOP TAP interface. With TCP Dump output, the ARP and ICMP packets were verified. The IPOP Controller code was also instrumented to record the time taken to create and connect each tunnel to its respective peer as well as the total time to establish a fully connected overlay. The latter includes the time from the first notification Measurements taken in host D are shown in Table 5 3. performing an IP Multicast test. The iperf network performance measurement tool was used to create an IP multicast group as illustrated in Figure 5 3. On hosts B and E iperf was invoked in its server role, with both hosts listening on the same IP v4 multicast address This multicast address restricts communication to one site. On host A iperf was invoked as the cli ent which sent packets to the same multicast address.


76 The first invocation of the test was done at the default UDP bandwidth rate of 1 Mbps and indicated a single packet lost in the first round of transmissions. Subsequent client invocations to the existin g server processes did not show this loss but which was reproducible when the server process is restarted. We believe this to be associated with the initial address resolution process. Beyond this, there were no errors or failures reported and all remainin g packets were received by both receiving hosts at the sending rate. We progressively increased the client transmissions rate and eventually observed packet loss at around the 35 Mbps rate at both server hosts. Results and Analysis Connection setup times a re dominated by the connection bootstrap handshake WebRTC session establishment which may involve STUN and/or TURN. We observe an average connection time of 11.3 seconds per node . It is important to note that the total connection time for the overlay (42.12 s) is less than the sum of the individual connection times (45.14 s). This is due to the overlapping operations as multiple tunnels can bootstrap concurrently, however new oper ations must wait until the existing in progress ones complete. The latency between host D (IoT node) and host A (AWS VM) shows a smaller than 2% increased for RTT min/avg/max when using the IPOP overlay as compared to ngly, the IPOP interface exhibits a lower standard deviation (mdev). bandwidth and why this approach is undesirable. As the bridge must block certain tunnels from use, it resul ts in packets being switched along longer paths. Additionally,


77 when considering the underlying network, the increased routing cost and potential traversal of WAN links further exacerbates the problem. While blocking tunnels is necessary to eliminate cycles , it ignores lower cost paths to the destination. In F igure 5 3, the resulting spanning tree of the overlay network requires a response from host C to B to be sent via the route bridge at host A. Within the underlying network, hosts B and A are connected to the same physical switch while host A is reached via a WAN link. Conclu ding Remarks In this chapter , we have described and evaluated IPOP Tunnels, a flexible virtual private networking (VPN) system which has the capabilities to dynamically agg regate IoT, edge and cloud resources into managed communication groups. IPOP Tunnels facilitates a virtual layer 2 overlay network by utilizing Internet aware tunnels which connect remote hosts with peer to peer links. They can be programmatically created and managed allowing dynamic integration with an SDN switch such as OVS. We presented our argument in the context of the Internet of Things and the relevance of connecting c loud and e dge resources. Additionally, we motivated the need to build customized co mmunication groups to support the IoT application model and illustrate how a hybrid approach using IPOP Tunnels and SDN can accomplish this. To demonstrate the feasibility of our technique we have implemented our designs in IPOP VPN and used this reference implementation to construct an experimental network testbed. Our evaluation show that the functional and performance requirements were met to seamlessly support a stable layer 2 overlay network across the Internet. Additionally, the increase in latencies associated with tunneling overhead and securing communication were within 2% increase of the native interface.


78 Figure 5 1. An Overlay Network (ON). Four modules distributed across different resources on edge and cloud providers that are logically aggregated into an ON. While the physical devices are connected by Internet links, the ON encapsulates virtual network Ethernet frames and forwards across peer to peer tunnels. Fig ure 5 2 . IPOP Tu nnel Architecture. I llustrates the architecture of the IPOP Tunnel and its interaction with the rest of the system. The tunnel abstraction is composed of a TAP device which connected to the OVS bridge a and virtual transmission path contrasted to virtualized view of the tunnel.


79 Fig ure 5 3 . Experiment Testbed Structure. IPOP Tunnels to create an all to all overlay topology. Host A is designated the root bridge for the spanning tree as indicated by incoming light arrowheads at Host A. Bold arrowheads at host indicate a forwarding link from the originati ng host. Table 5 1 . Hardware specification for nodes used in experiment. Host CPU RAM NIC A AWS hosted VM Intel Xeon E5 2676 v3, 1 core 990 MB Low to moderate VIF B Bare metal desktop Intel i7 6850, 6 cores 32 GB 1 GigE C Bare metal desktop Intel i7 6700, 4 cores 16 GB 1 GigE D Compulab Fitlet2 IoT Intel Celeron J3455, 4 cores 4 GB 1 GigE E Bare metal sever 2 x Intel Xeon E5 2660 v2, 28 cores 256 GB 1 GigE Table 5 2 . Ping latency test between h ost s D and A showing round trip times. A total of 60 packets were transmitted and received. RTT (ms) Min Avg Max MDev Native Interface 42.047 42.229 42.758 0.226 IPOP Interface 42.731 42.990 43.375 0.170 Difference (%) 1.601 1.770 1.422 Table 5 3 . Time to create a fully writable link between the IoT class device and each peer in the overlay network. Host D Host A Host B Host C Host E Time to connect (s) 11.4309 10.6864 12.3229 10.7039 Total connection time (s) 42.1215


80 CHAPTER 6 BOUNDED FLOOD: SCALABLE LAYER 2 FORWARDING FOR DYNAMIC CLOUD TO EDGE NETWORK ENVIRONMENTS Claims that cannot be tested, assertions immune to disproof are veridically worthless, whatever value they may have in inspiring us or exciting our sense of wonder. Carl Sag an Edge networks [74] , as used in the recent IoT context, refer to the loc al networks that connect IoT devices Wi Fi access points and gateways and excludes the IoT devices themselves. The existing approaches [19], [75] employed in data center network architecture and administration are not well suited for building and managing these networks. Challenges arise from (1) the geographic distances over which equipment must be deployed and the effort to physically access these locations; (2) the heterogeneous mix of components and configuration; (3) networks that must span multiple administrative control domains; and (4) dynamic membership than can often be short lived. New approaches are necessary to address these issues; approaches that enable greater autonomy and orchestration through software. We leverage the observation [30], [33], [35] [37] that virtualized networks are positioned to play an important role in the orchestration and communication among geographically distributed containerized applications. In this chapter we present Bounded Flood (BF), a novel technique for scalable layer 2 topology and forwarding for dynamic cloud to edge networking environments , where the peer nodes are software defined layer 2 bridges. We make the claim that Bounded Flood is a well suit ed networking cyberinfrastructure for fog and IoT applications. It adapts to conditions at the edge that require dynamic membership, is


81 scalable, supports heterogeneous node capabilities and configuration, and provides private P2P communication among its m embers. Towards this objective, the functional requirements of the system are specified across 7 broad categories. Within each category are the specific functionalities that have been implemented and validated , a s presented below. ( F 1) Decentralization : (a ) Each node in the cyber infrastructure is equally important and acts independently of the other s ; (b) switching (layer 2 forwarding) is fully distributed. ( F 2) Scalability: (a) The protocol should scale as network size increases . (b) There must be deterministic bounds on switching hops and maintenance cost (adding/removing a node) relative to the number of nodes in the network. ( F 3) Availability: (a) I n a continuous state of change , a path of bounded length between two connected nodes must a lways be found. (b) The protocol must work for the arbitrary arrival and departure of nodes, and those with short lifespan. (c) Creating new links do not cause overlay wide communication interruption. (d) All overlay layer 2 links are available for use and nodes can create and utilize direct links as needed. (e) Existing unicast and multicast sessions should proceed unaffected to the extent allowed by underlying physical connectivity. ( F 4) Fault Tolerance: (a) The system must tolerate a known degree of fail ure within the overlay without creating partitioned networks. (b) Beyond this limit , it must remain operable within individual partitions.


82 ( F 5) Resilience: (a) The system must autonomously repair a partitioned overlay and resume overlay wide functionality when int e r partition links are restored. Failures or node departures are common at scale, so failure detection should be rapid and efficient. ( F 6) Flexibility: (a ) The system must p rovide multiple independent Ethernet broadcast domain s with no forwarding loops. (b) Applications must have flexibility in how they assign layer 3 address , and an IP address can be relocated to any node in the overlay. (c) Functional parameter s must be configurable for operational tuning (e.g. the number of l inks to create ) , and nodes can be configured independently . ( F 7) Simplicity: (a) The system design must be practical to implement and (b) require minimal operational configuration. The rest of this chapter is organized as follows. Section 6. 2 introduces an d briefly explains the integral technologies that are fundamental to this work. Section 6. 3 is a detailed review of the design and components of the implemented system which references BF. Section 6. 4 explains the basis for the evaluation of BF, how the de sign claims are validated , and the environment that is used to accomplish this. Section 6. 5 presents the experiment al findings and interprets the results. Section 6. 6 considers potential future work that intuitively extends what is already accomplished. Fi nally, in section 6. 7 the summary and concluding remarks are presented. Background The link layer (layer 2), e.g., Ethernet [76] , is concerned with the point to point transmission of frames between endpoints connected by a link. Multiple devices can be connected to a network bridge, which performs layer 2 switching to transparently forward messages between its hosted devices. Bridges can also be connected to form a switched network fabric (typically envisioned as a grid) to enable communication


83 between any pair of hosted devices. Hosted devices or leaf devices are di fferentiated from bridges or switches in that they perform no switching functionality and are either the original source or ultimate destination of a message. Each leaf or bridge device that participates in the same Ethernet namespace is also in the same b roadcast domain; this implies that a broadcast sent from a single device is delivered to every other participant. Ethernet has no support for loops, and specialized procedures must be implemented to protect against layer 2 broadcast storms which would occu r as frames get repeatedly duplicated at each bridge that propagates it. One approach is to design network architecture that is loop free, e.g., hierarchical topologies. Other approaches must implement a protocol to avoid loops that occur naturally in a fa bric. A simple and widely used approach that can be implemented in the data plane is the Spanning Tree Protocol (STP) [53], [54] . In STP, all the bridges in the domain agree on a root bridge that becomes the central coordinat or in identifying and disabling redundant links until a spanning tree remains. This approach has several drawbacks: domain wide communication is interrupted each time a bridge is added or removed as the spanning tree must be recalculated; multiple links ar e disabled and left idle; contention for the existing links is increased; and the overall throughput of the fabric is reduced. Furthermore, these drawbacks are exacerbated in dynamic, distributed environments that are found in fog and edge computing. There are other approaches [77] that implement their own software based switching based on expectat ions of the topology. Critical to the feasibility of such approaches is Software Defined Networking (SDN), which splits the typical bridge into two components: the control and data planes. The data plane remains implemented in


84 specialized networking hardwa re and is concerned with efficiently moving network frames from one port to another, while the control plane is moved to a general purpose OpenFlow [19] and Ryu [69] . For practical reasons, the switching protoco l is designed and implemented based on assumptions on how the underlying topology is structured, so changing the topology implies changing the behavior of the protocol. Overlay networks are application level implementations of value added network services built on existing underlying communication infrastructure. They provide functionalities such as resilient routing, distributed data structures and multicast. Exploiting the properties of how nodes in an overlay are arranged and interrelated can yield benef icial results. This has been shown in the study of distributed hash tables (DHT) and structured P2P systems [50], [59], [78], [79] . These works have shown that app roaches using structured P2P where the nodes in a distributed system are equivalent peers that assemble and maintain a defined structure are resilient and scalable. They work well for large and small networks alike, even when participants independently arrive and depart at arbitrary intervals. These properties are desirable for communication environments outside the cloud. data center s occur at greater geographic distances, the communication latency increases ; as the num ber of leaf devices increases, the throughput available to each device decreases. This trend is expected to continue with the advent of IoT and ubiquitous computing. Furthermore, the sense analyze actuate model that defines cyber physical infrastructures i s anticipated to be reliant on low latency communication within very dynamic groups. To support these


85 demands on existing infrastructure, portions of cloud resources are being moved out to data center s . Movin g compute and storage resources within proximity of where the data is generated, eliminates unnecessary round trips to the cloud , and improv es the quality of communication experienced by IoT application and end users. However, simply moving lightweight dat a center s to the edge is insufficient as the ability to effectively build and operate a layer 2 domain is desirable. This is apparent when considering existing services such as VM migration [77] . Chord [59] and Symphony [50] have shown that building an overlay network with the desired properties can be accomplished, as they sufficiently address the issues of the topology and locating data items using DHTs. O bserving bridges as nodes and tunnels as links, the fundamentals of a virtualized overlay network take shape. Furthermore, these overlay networks can extend from cloud to edge, enabling exi s ting methodologies and utilities to be used for orchestrating IoT containerized applications and services. Design Peer To Peer systems are distributed systems that are defined by the lack of centralized control or specialized roles, and where each participant has functionally equivalent capabilities. These properties mak e them a highly scalable and robust approach to system design, and more specifically, well suited for efficient location of identifiable items. The goal of Bounded Flood is to deliver scalable layer 2 forwarding for dynamic cloud to edge network environmen ts where the peer nodes act as software defined bridges. Our two fold approach relies on virtualized overlay networks comprising of a structured ring topology and an SDN enabled layer 2 switching fabric. The first component (the IPOP Controller) is solely focused on assembling the overlay,


86 and the second (the SDN Controller) programs the corresponding switching rules. Bounded Flood employs the distributed P2P approach, where each node possesses identical capabilities, and independently maintains its own ins tances of the IPOP and SDN controllers. It uses no centralized components for overlay management and switching. However, it uses an online social network service (XMPP) and STUN/TURN [10] for endpoint discovery and tunnel bootstrapping. IPOP Controller D Kleinberg routable small world network [80] ) structured overlay. Figure 6 1 presents a rendering of an instan t iation of this topology . The topology defines thr ee types of links: successor, long distance, and on perimeter, clockwise in increasing order by the ir unique ID , and each node is configured to link to its closest successors. As each node creates its successor links accommodating for circular wrap around the loop is closed and forms the ring. The ring is the fundamental structure that is required for correctness, as greedy internode routing between peers occurs in a clockwise fashion based on overl ay identifiers. On demand links are used elastically, created and destroyed as needed to facilitate a direct path of one hop between peers. Using on demand tunnels reduce the switching hops and removes the traffic burden on intermediate bridges that lie on the communication path. Long distance links are used as shortcuts across the ring and reduce the average path length between two nodes. Bounded Flood implements shortcuts based on Symphony long distance links [50]


87 maintains long distance links and selects long distance peers by drawing a from the following probability distribution function (pdf): (6 1) The product specifies the clockwise distance , from the source node , of the furthest peer , at a distance less than , that is selected a s the endpoint. There are two components integral to the functionality provided by the topology module. They are the network graph builder , which creates a representation of the desired network state, and the network orchestrator which transitions the network from its current state to the desired one. As nodes join and leave the overlay (a process known as churn) the composition and size of the graph changes. Peers which were previously ideal candidates for links become less favorable choices, triggering the graph builder to reevaluate the desired network state. The network graph builder re ceives as its input the current available candidates, target ed amounts for each type of link and the existing realized network graph. It selects new successor and long distance nodes based on available candidates and replaces existing, less favorable ones. A long distance peer is considered too close , and is discarded , if it falls within the distance given by: (6 2) See Figure 6 2 for an illustration of these concepts. long candidates are selected as long peers can select this node as their long distance peer this is equivalent to 2k long distance connections per node. Nodes must be prepared to accept as many links a s the y create , or the resulting im balance would


88 cause cascading failures in link creation across the overlay . Additionally, static links can be specified in the configuration at startup and are maintained when the peer nodes are online ; contrasted to on de mand links , which are added or removed as needed . adjacency list. The differences between the current and desired states provides the context to generate the new network state v ia update, remove and add edge operations, and in doing so, it enacts changes in the overlay. As each node in the overlay acts independently, each orchestrator effectively creates a star graph with the local node at the center of its adjacent peers. It is candidates that the distributed set of star topologies become the ring ensemble. The orchestrator reliably transitions the network state by identifying the differences between the current realized state and the d esired state. It removes deprecated links and initiates creation of newly added ones. It takes care to create replacement successor relay. It negotiates with peers to cre ate links, rejecting requests when link quantity thresholds are reached. It handles link interruption and bootstrapping failure, ensuring SDN Controller The novel SDN Controll er developed in this work termed Bounded Flood is a Ryu based [69] , OpenFlow compliant [19] module that utilizes a specialized method for broadcast to discover acyclic network paths between peer nodes. Each node along a network path uses its edge data to build its data plane forwardi ng database. Each forwarding record in the database is an OpenFlow flow rule which specifies the egress action when a specific ingress, source and destination MAC is observed. The Learning


89 Table, Flood Route and Bound (FRB), and Flooding Bound are three ke y components which form the core of the Bounded Flood SDN Controller. They are explained below. Learning Table The learning table is a compound structure that maintains decision making data that is used to build the data plane forwarding database. It perfo rms both Ethernet address learning [53] and root bridge learning. Within an overlay, leaf devices are the initial producers and final consumers of network messages, and they connect to the overlay through their respective root bridge (Figure 6 3) . The learning table learns of leaf devices and their associated root bridge by observing encountered FRB header s ; this is described in detail in the next section. Each entry in the root bridge table contains a peer switch descriptor, as observed r switch descriptor comprises the peer node ID , the local port number if the peer is adjacent , and the set of observed leaf devices managed by the peer switch. This data, which is regarded as an incomplete and potentially globally inconsistent descriptor o f the peer switch, is treated as soft state and maintained for building on demand tunnels. When flow metrics indicate data rates above or below specified thresholds, on demand tunnels are created or destroyed, respectively. To establish an on demand tunnel , the host bridge for each leaf device is identified from the root bridge table. The FRB protocol, which is explained below, guarantees that at least one BF controller, in a Flood Route and Bound Flo od Route and Bound (FRB) is a custom Ethernet protocol developed as part of the BF system to serve two purposes. Whenever the bridge is required to perform a


90 broadcast to another peer bridge, it replaces the operation with an FRB. Hence, the FRB becomes th e method for performing duplicate free broadcasts in an environment with layer 2 link cycles. Alternatively, the FRB is also used to share the local leaf devices hosted by the switch with its adjacent peers. In this role, the FRB includes the node id (NID) of the initiating switch (the root NID) and the list of leaf MAC addresses that are connected to the bridge. T his exchange is performed w henever a tunnel is established between peers . In the former role, the FRB header must specify, in addition to the roo t NID, a bound NID for limiting broadcasts , and the payload that is the original broadcast frame. The bound NID is the identifier of an IPOP node in the overlay; it indicates to the message recipient how far along the ring the broadcast can be safely propa gated and hence prevents message duplication. The bound NID must be recalculated for each of Bound Algorithm. The FRB structure is depicted in Figure 6 4 . Flooding Bou nd Algorithm Designed to handle broadcasts in presence of layer 2 loops, the algorithm produces for each destination NID, the corresponding peer NID that will terminate further propagation of the message, i.e., the bound. The approach relies on (1) the pre viously described structured topology, (2) an adjacency list consisting of peer nodes that are connected by an edge, and (3) the expectation that each adjacent peer bridge that receives the FRB will deliver the payload to its leaf devices and forward the F RB to its own peers that lie within the bound. Each output tuple produced by the algorithm is a closed open interval specifying the recipient i, of the message, and the furthest node j to which the recipient


91 may forward the message. Th is distance is based on the clockwise circular distance between two nodes . T he NID is an integer value representing its distance from position 0, and the furthest possible node is at a distance equal to the maximum value of the node identifiers m bit address space or . Node is considered at distance 0 from itself, and the distance to another node is given as, for ; otherwise . To determine the bound, the initiating node selects two adjacent peers (the message recipient) and such that is the next greater node I D after in its sorted (by N ID) adjacency list. The Flooding Bound tuple is then . This procedure is de picted in Figure 6 2 . This is repeated for every such pair in its adjacency list, and for the entry preceding its own, the switch uses its own node id as the bound. On receipt of an FRB, a controller evaluates its own adjacent peers determining which lie within the received bound and calculating a new bound for each one. Determining the new bound is done using the same procedure as described ab ove. The flooding bounds algorithm ensures that broadcast frames are never duplicated, are delivered to all devices in the overlay, and eventually terminates. As FRBs are propagated throughout the overlay, they are tracked at each node to update its local learning table. This information collectively provides a return route across the overlay to the FRB initiator. As there are potentially multiple valid paths between any two peers, the network path identified between two nodes is not guaranteed to be shorte st path due the greedy clockwise routing procedure used for discovery in the flooding process. However , it is bounded by [50] . The Bounded Flood design imposes no restrict ions on the number of links a node can create over its


92 operational lifespan; furthermore, each node can independently and dynamically vary the number of long distance links it creates. Thus, when k , the average overlay switching path is reduced t o . Overlay Churn As nodes join and leave the overlay, the restructuring of the topology can potentially disrupt network connectivity within the overlay. This is mitigated by using redundant successor links. If a node has a single successor, t hen its departure will create a partition in the segment from the node to its first long distance node. The partition remains until the failure is detected and a new successor is linked. In general, to tolerate the concurrent departure of n successors, successor links are required (Figure 6 5 and 6 6) . Churn can invalidate any edge along an active communication path between nodes. As Bounded Flood routing decisions occur independently per node, the root bridges will have no indication if the failure in volves nodes outside its adjacency list. However, other nodes on the path will detect the failure and will attempt to rediscover a new route to the destination. In this scenario, when no path is known for forwarding the request, the node performs a bounded flood for the forwarding operation. The frame is delivered to the intended recipient at the extra cost of the broadcast. Evaluation Bounded Flood has be en implemented, and evaluated quantitatively in a realistic Internet environment, for correctness and performance. The set of tests described below are used to illustrate correct functionality with existing networking tools and applications, as well as to validate the design claims of efficiency and scalability.


93 Below is the description of each of the conducted test s , its purpose , and the requirement from s ection 6. 1 that it validates. For all tests the configuration parameters are, n ( the total number of nodes in the overlay ) , k ( the number of outgoing long distance links per node ) , s ( the number of successor links per node ) , p ( the network path length ; i.e., number of edges or switching hops), and forwarding can be either STP with MAC learning or BF. When using BF forwarding, on deman d tunnels (OND) can be used to create ad hoc edges as needed. Network tools used include arping [81] ( explicit ARP request/response verification ), ping [82] ( latency measurement ) , iperf [83], [84] ( TCP bandwidth measurement ), and tcpdump [85] ( fine grained inspection of network traffic ) . Experiment Test Cases E1 . An ARP test verifies ( F 6.a). An ARP is generated using arping, and each BF node records its received ARP requests. The test ensure s that every BF node receives at most one ARP request, and the initiator receives the ARP reply. E2. The reassign IP address test verifies ( F 6.b). An IP configuratio n for the same subnet is applied to two nodes and Node 1 is shown to ping Node 2. Node 2 is then powered down and its IP address is reassigned to Node 3. Node 1 is shown to successfully ping the IP address relocated from Node 2 to 3. E3. The Multicast test verified ( F 3.e). A multicast group is created using iperf version 2. The test verifies the overlay support for the multicast single writer multiple readers environment. A n overlay in instantiated with parameters n = 64, s = 2, k = log n , and a node was selected as the multicast writer and two as the readers . The output from iperf is observed for functional correctness .


94 E4. Connectivity within partitions test verifies ( F 1.b, F 3.c, F 4.b, F 5.a). An overlay was instantiated with parameters n = 64, s = 1, k = 2 . A node wa s then selected , and two ping test s were initiated between the node and (1) (2) the first long distance linked peer which follows it s successor. We shut down the successor node and observe the output of ping. E5. Resilience against (n 1) failures with n successors test verifies ( F 1, F 4.a). An overlay was instantiated with parameters n = 64, s = 2, k = 2 , and the procedure in T 4 in repeated . T he ping test wa s then invoked between the node and (1) its peer (pn3) loc ated 3 successor edges away and (2) the first long distance peer which follows pn3 . B oth nodes at 1 and 2 successor edges distance we re shut down and the output of the ping tests are observed . E6. Overlay with mixed configuration parameters verifies ( F 1.a, F 6.c). An overlay wa s c onfigure d such that BF nodes had separate amounts of successor and long distance links, and some had on demand tunnels enabled or disabled. Nodes 1 8 {s = 1, k = log n, OND = disabled}, nodes 9 16 {s = 2, k = 1, OND = enabled}, nodes 17 24 {s = 2, k=log n, OND = disabled}, nodes 24 32 {s = 2, k=log n, OND = enabled}. The functionality was observed . E7. O n demand tunnel test verifies ( F 3.c, F 3.d). Using the parameter s n = 128, s = 2, OND threshold = 100 MB, total transfer size= 3 GB; 3 overlays were instantiate d such that (1) k = 4, BF forwarding with on demand enabled (2) k = log(n), BF forwarding with on demand disabled (3) k = 128, STP forwarding with MAC learning. We r andomly select 100 pairs of nodes and sequentially r a n iperf , followed by ping, between each node pair. We r ecord the bandwidth and average latency for each test in each scenario


95 and n ote d the differences in average path length between the 2 BF overlays of different k. E8. Latency test verifies ( F 2, F 7). An overlay was instantiated with parameters s=2, k = log n and n = {8, 32, 64, 128}. Each node in the overlay ping ed another node in the overlay, generating an FRB in the process. Each node will have received at least one FRB (broadcast) from every node in the overlay during this process and it allows recording the path length, in switching hops, from every sender. Each node record ed its maximum path length and the average of all path lengths between itself and its neighbors. E9. Dynamic network (churn) verifies ( F 2 . a, F 3.a, F 3.b). A n overlay is configured with parameters s = 2, k = 2 and n = 32. H alf the nodes in the ov erlay were booted and iperf ran between 2 nodes (without a direct link). We r esume d booting of the remaining nodes at 60 second intervals. Next , nodes randomly le ft and rejoin ed the over lay until the iperf test completes. We observed throughput over the du ration of the transfer, including any timeouts or failures. E10. STP vs BF path utilization test verifies ( F 3.d). The purpose of this test is to show reduced link contention when using BF. First, an overlay configured and booted with n=225, k=log n and for warding=STP. STP will disable links until a spanning tree is formed. 300 client/server pairs were randomly selected and the bandwidth and latency for each pair was individually measure . The test was repeated on similar overlay using BF forwarding and with same node selections. Testbed The testbed is designed as clusters of Docker containers on physical hosts distributed across the Internet. The Bounded Flood processes execute in a privileged


96 D ocker container and creates the overlay network within the containers networking namespace. The vi rtualized network is never directly visible to the host. This approach is necessary to scale to a large enough network of hundreds of nodes up to 200 containers per host. The hosts have enough RAM such that no memory pages are swapped during execution. A single host is used f or the experiments , except for T10 where 3 physical hosts each with 75 containers are used. The hosts are connected via WAN links which exhibits lower bandwidth and higher latencies compared to the links between containers on the same host. Finally, all the clusters participate in the same virtual Ethernet broadcast domain. Physical hosts used in these experiments were provided by the academic clouds CloudLab [86] and Chameleon [87] . For the purposes of the experiment, it was necessary to create a l eaf node that was attached to and hosted by its BF switch; as the bridge never is the initiator or ultimate destination of an application message. In each container a secondary bridge was instantiated, its internal port appropriat ely configured with an IP v4 address, and a patch link established between the two bridges. Applications would bind a socket to thi s internal port and the frame would be patched over to the local BF bridge, switched across the overlay to the destination BF bridge, patched to the de stination application bridge and delivered to the recipient application. This is illustrated in Figure 6 7 . Results and Analysis Cost of soft state The data plane forwarding database must store two rules per pair of communicating leaf devices. If a bridge has N leaf devices and there are M other leaf devices on the overlay, then the bridge must maintain at most 2MN flow entries. While the flow entries are set to expire over an idle threshold, and typical workloads do not


97 exhibit this pattern, this is the wo rst case for a bridge when each of its leaf devices are actively communicating with every other leaf device that is hosted by a remote bridge. The learning table maintains an ephemeral table for every peer switch and its hosted leaf devices. The data is us ed for determining flow rules and the endpoints for on demand tunnels. The topology state tracks the identifier of each node present in the overlay. Extended structural information is only kept for adjacent peers, which is , where S is the nu mber of successors, L is the number of long distance and D is the number of on demand links. Join or Departure Cost When a node attempts to join an overlay, an OSN sign in presence message is exchanged with each online friend for an cost. Creating a single link is a constant cost that includes the messages related to Interactive Connectivity Establishment (ICE) [72] and the exchange of endpoint data between the two hosts. Each no de will create s successors and k long distance links and will receive at most the same amount of incoming links, resulting in links for the join operation. The join cost is . When a node leaves the overlay, s successor links mus t be repaired. If the departure results a change in , n links in the overlay will be discarded and relinking will occur. However, from the closeness function 6. 2 only one long distance link per node will possibly require relinking. The cost to lea ve is in the worst case , and otherwise on average .


98 Route Discovery Cost BF is used as a replacement to Ethernet broadcast for route discover y between a pair of communicating host s . A single message must be delivered from the initiator to each participant in the overlay , but only once. Hence , the cost of route discovery is the same as broadcast and is simply O(n). Multiple successor links provide redundancy for fault tolerance . T he long distance links are used for efficiency an d are not necessary for correctness. Verification of Correctness Tests All tests for correctness, E 1 E 6 completed as expected. (E 1 ) Inspection of the Ethernet broadcast of ARP using BF indicated a single request/response exchange and the host was found via the broadcast. (E 2 ) The IP address was successfully located via ping before and after its relocation to a different node and MAC a ddress in the overlay. ( E 3) Multicast completed successfully with out any errors for the 1 Mbps transmission rate tested. ( E 4) Ping to the second successor failed when the 1 st successor was shut down; however, ping to the first long distance node was uninte rrupted. This indicate a partitioning in the network between the failed node and up to the 1 st long distance peer. As the network repaired its topology the failed transmission resumed. ( E 5) Using s = 2 prevents the network partitioning experienced in the p revious test, but only against a single node being shut down. When two adjacent nodes are shut down, the transmission interruption manifests again. As a note, the overlay tolerated two nodes shutting down once they are were not immediate successors. ( E 6) T he overlay converged to a stable state where, in the absence of churn no new links were being created. ARP, ICMP and TCP between nodes tested successfully.


99 On demand Tunnels Experiment E 7 validates the benefits of using on demand tunnels with BF and shows that on demand tunnels ha ve a detrimental effect on STP (Figure 6 10 ) . Overlay A (on demand enabled) is configured to use fewer long distance links than Overlay B (no on demand) , and Table 6 2 show s an expected increase in overlay average path length. It t akes an additional switching hop on average to deliver a frame between each pair of nodes. However, both bandwidth and latency results show improvements when using on demand tunnels al though there are fewer links available in the overlay and its average pa th length is greater. For instance, inspecting an inflection point in Figure 6 8 at 0.6 ms, we observe i n overlay A, 82% of latencies below 0.6 ms, while only 12% were below 0.6 ms in overlay B ; a 6.8x improvem e nt . Additionally, Figure 6 9 shows 96% of bandwidth tests were over 330 Mbps in overlay A as opposed to 41% in overlay B, a 2 . 4 x improvement . By employing on demand tunnels , a BF overlay can achieve better bandwidth and latencies tha n without it , while using fewer long distance links. As t unnels incur an ongoing maintenance cost, using a lower k reduces the associated overhead for each node. Network Switching Latency (Path Length in Hops) The expected average switching cost of the overlay is and when k is co nfigured as . From experiment E 8 , each node records the maxim um number of hops between itself and every other peer, and the average of all path lengths to every peer.

PAGE 100

100 The results in T able 6 3 indicate that while the maximum hop count observed at each node can exceed the theorized bound, for an overlay with 128 nodes, only 3 paths had a length 12 and approximately three quarters of the paths had length less than 9. F igure 11 shows t he network average path length scales linearly with the number of n odes in the overlay , and for the network sizes evaluated it is approximately bound. As switching hops varies with the path length, it can be bound to a function of overlay size. Furthermore, as BF handles Ethernet loops, a direct edge can be placed between nodes to reduce the cost of switching to a single hop. This supports the claim that BF works well fo r both small and large overlays. Dynamic Network (Churn) Experiment E 9 validates that BF forwarding remains functional in a dynamic and changing topology. Within the specifications of the test ( i.e., the degree of churn ) the switching flows can be reprogr ammed, and the topology updated without impact to existing TCP streams. From a converged overlay, 8 nodes were randomly selected and sequentially started and stopped, with a 15 second interval between each event. No transmission failure or timeout was obse rved. Figure 6 1 2 illustrates the variation of throughput over time between two nodes. STP vs BF Experiment E 10 validates that BF allows multiple paths on a cycle to be safely utilized with a performance benefit over STP. F rom Figure 6 13 , BF indicates a g reater occurrence of low latency paths compared to STP, with 97% more nodes with a latency less than 36 ms. T he maximum bandwidth report ed in two similarly configured overlays were similar , but Figure 6 14 shows that BF had more paths with higher bandwidth than

PAGE 101

101 STP. Examining infle ct ion point at 60 Mbps, a pproximately 37 % of BF measured bandwidth is greater , while only 18 % of STP bandwidth was above 60 Mbps a 2 .2x increase in the number of such paths. BF exhibits better latencies and bandwidth , experienc ing reduced link contention and fewer switching hops , as a result of utilizing all available links for transmission. STP must selectively disable link s to create its spanning tree , which has the effect of creating longer path s with increased shar ing. Conclu ding Remarks Lessons Learned Application models will change to reflect and best utilize the properties of the fog architecture. Communication privacy and deployment needs are drivers for packaging the network and its parameters . Analogous to how applications are packaged as microservices in containers, an overlay represents a preconfigured virtualized layer 2 network packaged for immediate and reusable deployment. Due to the complex interactions of decentralized P2P systems, they can exhibit unex pected emergent properties. It is therefore necessary to aggregate the component states as an ensemble and visualize the system both as a whole and as individuals. Scalability is an important factor as it indicates the solution s versatility to be used in environments of varying sizes. BF overlay average switching hops scales logarithmically with the size of the overlay . Support for dynamic environments and churn is necessary in a loosely controlled environment. Control is interpreted to mean more than the right or ability to determine how or when a resourced is used, but additionally that external factors that will impact its operation and availability. Due to the high likelihood that overlay participation is fluid the peer design is integral to the sys tem ability to respond without interruption.

PAGE 102

102 The use of multiple successor links for fault tolerance introduces overhead; however, as all links are utilized they reduce the last mile hop (the point on the path between two nodes where no further long dist ance links can be used and the remaining hops are traversed using successor links) by . How a n etwork is partition ed is not always obvious as forwarding is only done clockwise around the ring . Environmental network conditions may cause successor links to fail while long distance links from other nodes are connected. For example, consider that t wo nodes that are both behind symmetric NATs , w ould fail to create their successor link, but long dis tance links with a node behind a full cone NAT would succeed. Messages forwarded along the successor path would fail while those along the long distance path would be delivered. The flexibility of the system allows peers to be parameterized independently a nd reflects the conditions encountered in distributed and heterogeneous operational environments of the fog . Nodes within the overlay may very likely be owned and controlled by independent entities, each with their own configuration criteria. However, this flexibility impacts the predictability of the overlay performance bounds. Fortunately, on demand links can be used to mitigate this. There are many challenges to building and maintaining a resilient overlay, and it must handle the traditional P2P vulnerab ilities to rog u e members. Even with good faith participants, transient environmental problems can delay timely repairs to the overlay. For example, how forgiving to be of peer that repeatedly fails to fulfill its operation role before removing it from cons ideration for further linking attempts.

PAGE 103

103 A minimal out of the box operational configuration does not imply few configurable parameters. While default settings will work in most scenarios, an determined by the configuration choices available to the administrator. Design choices that are practical to implement are critical to a system such as this. The importance of resilience within the overlay and the disruptive impact of churn was a major f actor in selecting Symphony over Chord. Although Chord gives more precise bounds on switching hops, it is subject to higher churn maintenance. This is because Chords must be maintained at each log n node, whereas Symphony requires only that the less string ent conditions of its harmonic function is met. The result being fewer changes to existing links as nodes join and leave the overlay. Our experiments show that using values of k=4 vs k=7 when n=128 results in an additional switching cost of approximately + 1 hop . This setup can be used in environments where the increased switching latency can be tolerated for the benefit of reduced tunnel overhead. The characteristics of BF overlays using on demand edges indicate noticeable improvements over standard BF. The analytics used in this experiment to trigger the creation and removal of the on demand tunnels were simple . With increased sophistication , they hold promise for better overlay performance and support enhanced features such as multicast trees. There are cl ear benefits that have been illustrated when using BF versus STP. STP must disable edges in the presence of loops to create the spanning tree; hence, it never is able to use all available links in the topology. Using fewer links result in longer

PAGE 104

104 paths with increased latency, and the utilized links experience higher contention as more traffic is routed over them. Additionally, the addition or removal of a link in the overlay, requires interrupting network transmission until a new spanning tree is evaluated. As such, no benefits are derived from on demand edges with STP. In topologies with multiple layer 2 loops, BF outperforms STP providing lower latency and increased available throughput. Summary We have presented our approach for building network cyber infr astructure in support of fog computing. We proposed that using virtualized overlay networks built on P2P tunnels is viable for connecting the network core to its edge, and for orchestrating dynamic edge networks of IoT leaf devices and edge processing node s. To substantiate this, we proposed and executed several experiments which served to verify correctness and performance and discussed how the results met expectations. We have shown in experiments 1 6 that the virtual network presented correctly behaves as a layer 2 Ethernet broadcast domain supporting existing upper layer protocols and applications. Experiments 7 9 verifies the design and theoretical claims. Experiment 10 illustrates its performance improvements against the popular existing approach STP. These experiments are derived from the functional requirements (F 1 F 7 ) outlined in s ection 6 1, which are in turn based on the project objective. Therefore, the successful outcome of our experiments identifies Bounded Flood as a scalable layer 2 forwardin g approach for dynamic cloud to edge network cyber infrastructure.

PAGE 105

105 Figure 6 1 . A rendering of the logical view of the IPOP overlay topology supporting Bounded Flood. This overlay contains 128 nodes, with 2 successors and 7 long distance links per node. T he nodes are arranged in a circle with each node creating a successor link to each of its two closest clockwise neighbors. Each node also selects 7 peers, based on its probability distribution function (Function 6 1) , for long distance links.

PAGE 106

106 Figure 6 2. A segment of an overlay with 128 nodes using a 7 bit address space, successor links are colored orange, and the long distance link is colored blue. The red link is invalid as the closeness function evaluates to 15. G reen links show the flow of an ARP broadcast that is sent from leaf device 0. The ARP is encapsulated at switch S0 in an FRB and transmitted on each port with the appropriate bound. S1 will only propagate the FRB to S2 as its next edge (to S70) is beyond t he NID 23 bound. Figure 6 3. Illustrates the root and peer switch roles. Switch A is the root switch for devices A1 and A2, and switch B is the root switch for device B1. Switches A and B are peers as they are both BF switches in the same overlay network.

PAGE 107

107 Figure 6 4. FRB Structure, types 1 and 2. FRBs are used by BF bridges to (1) exchange leaf MACs and (2) perform overlay broadcasts. A B C Figure 6 5 . Partitioning from churn when using a single successor. (A) Nodes A, B and C are 3 consecutive nodes in the overlay segment , each connected by a successor link. A message from A to C must be sent via B. (B) If B leaves the overlay, the overlay is partitioned until a replacement link between A and C is created . (C) Due to BF clockwise routing, A cannot reach C while this partitioning exists, but C can potenti ally reach A by routing the message clockwise via the other nodes.

PAGE 108

108 A B c Figure 6 6. Resilience to partitioning by using multiple successors. (A) Nodes A, B and C are 3 consecutive nodes in the overlay segment, each connected by 2 successor link s . (B) If B leaves the overlay, no partitioning occurs. (C) A will detect the event and immediat ely start using the edge AC to send me ssage to C and the nodes beyond. Figure 6 7 . Bounded Flood Bridge Setup . For the purposes of the experiment, it was necessary to create a leaf node that was attached to and hosted by its BF switch. In each container a secondary bridge was instantiated, its internal port appropriat ely configured with an IP v4 address, and a patch link established between the two bridges.

PAGE 109

109 Figure 6 8 . Cumulative percentages of average latency of BF and BF with on demand tunnels. For on demand the latency is measured after the on demand tunnel has been established. Approximately 71% fall in the 0.4 0.5 range exhibiting the expected behavior of a single hop. The outliers result from transient environmental conditions as repeating those test cases yield results in the range 0.5 0.7 milliseconds. Figure 6 9 . Cumulative percentages of bandwidth for BF and BF with on demand tunnels. The comparison shows an increase in bandwidth between nodes when using BF with on demand tunnels. The on demand tunnels were created after the transfer was initiated and crossed a 100 MB threshold; the switching flows were then transparently reprogra m med to use the new path. 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 Cumulative % Average Latency BF vs BF+OND Cumulative % of Average Latency Cumulative % BF+OND Cumulative % BF 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 150 180 210 240 270 300 330 360 390 420 450 480 Cumulative % Bandwidth (Mbps) BF vs BF+OND Cumulative % of Bandwidth Cumulative % BF+OND Cumulative % BF

PAGE 110

110 Figure 6 10 . Bandwidth variation over time, STP and STP with OND. This illustrates the effect of using on demand tunnels with STP. When a new link is added the spanning tree must be recalculated and network transmission is interrupted until the process completes. As this takes several seconds to complete and does not yield increase throughput, using on demand tunnels has a detrimental effect on STP performanc e. Figure 6 11 . Network Average Path Length vs Network Size. The averages of the path length from each node to every other node, within the overlay. Each node provides its average path length as the sum of all discovered path lengths divided by the numb er of paths. The network average path length is the average of the nodes path length. The results show it to scale logarithmically with overlay size. 0 50 100 150 200 250 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Bandwidth (Mbps) Interval (secs) STP vs STP+OND Bandwidth over time {n=128, s=2, k=log n} STP+OND Bandwidth STP Bandwidth 2 4 8 16 32 64 128 225 Avg Path Len 1 1 1 1.229 1.772 2.458 3.547 4.560 1 2 4 8 Average Path Length Network Size (nodes) Network Avg Path Length {s=2, k=log n)

PAGE 111

111 Figure 6 12 . Impact of churn on t hroughput between two overlay nodes measured over 300 seconds . As nodes join or leave the overlay the topology changes as links are created or removed. This can result in observable changes in the path used for communication between peers, as the number of switching hops increase or decrease. Figure 6 13 . Histogram of latencies in BF and STP overlay. Larger values are better. The chart shows BF has greater number of lower latency paths than STP. 345 350 355 360 365 370 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 100-110 110-120 120-130 130-140 140-150 150-160 160-170 170-180 180-190 190-200 200-210 210-220 220-230 230-240 240-250 250-260 260-270 270-280 280-290 290-300 Bandwidth (Mbps) Intervals (s) Churn {n=64, s=2, k=log n} 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 Frequency Avg Latency (ms) BF vs STP Cumulative % of Avg Latency {n=225, s=2, k=log n} Cumulative % STP Cummulative % BF

PAGE 112

112 Figure 6 14 . Histogram of latencies in BF and STP overlays. Smaller values are better. The chart shows that BF has a greater number of higher bandwidth paths than STP. 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 15 45 75 105 135 165 195 225 255 285 315 345 375 405 435 465 495 Frequency Bandwidth (Mbps) Cumulative % of Bandwidth BF vs STP {n=225, s=2, k=log n} Cumulative % STP Cumulative % BF

PAGE 113

113 Table 6 1. Legend of experiment configuration parameters. Term Description BF Bounded Flood forwarding STP Spanning Tree Protocol forwarding OND On demand Tunnels feature enabled n The total number of nodes in the overlay k The number of outgoing long distance links per node s The number of successor links per node p The network path length (number of edges or switching hops) between two nodes Table 6 2 . Comparison of Overlay Average Path Length. The OND enabled overlay uses k=4 while the overlay without OND uses k=7; both are of size n=128. BF + OND, k=4 BF, k=7 Overlay Average Path Length 4.3181 3.4060 Mean Deviation 0.3054 0.2747 Table 6 3 . Maximum Path Lengths. Frequency and cumulative percentages, in overlay Max Hops Frequency Cumulative % 5 1 0.78% 6 14 11.72% 7 32 36.72% 8 21 53.13% 9 27 74.22% 10 22 91.41% 11 8 97.66% 12 3 100.00%

PAGE 114

114 CHAPTER 7 GRAPLER A USE CASE ON THE PRACTICAL APPLICATION OF IPOP No man ever wetted clay and then left it, as if there would be bricks by chance and fortune . Plutarch The GLEON Research And PRAGMA Lake Expedition GRAPLE aims to improve our understanding and predictive capacity of water quality threats to our freshwater resources, including climate change. It is predicted that climate change will increase water tempe ratures in many freshwater ecosystems, potentially increasing toxic phytoplankton blooms [88], [89] . Consequently, understanding how altered climate will affect phytoplankton dynamics is paramount for ensuring the long term sustainabi lity of our freshwater resources. Underlying these consequences are complex physical biological interactions, such as phytoplankton community structure and biomass responses to short term weather patterns, multi year climate cycles, and long term climate t rends [90], [91] . New data from high frequency sensor networks (e.g., GLEON) provide easily measured in dicators of phytoplankton communities, such as in situ pigment fluorescence, and show promise for improving predictions of ecosystem scale wax and wane of phytoplankton blooms [92] . However, translating sensor data to an improved understanding of coupled climate water quality dynamics requires additional data sources, model development, and synthesis, and it is this type of complex challenge that requires increasing computational capacity for lake modeling. T his chapter has been reprinted with permission from Concurrency and Computation: Practice and Experience.

PAGE 115

115 Searching through the complex response surface associated with multiple environm ental starting conditions and phytoplankton traits (model parameters) requires executing and interpreting thousands of simulations, and thus substantial compute resources. Furthermore, the configuration, setup, management and execution of such large batche s of simulations is time consuming, both in terms of computing and human resources. This puts the computational requirements well beyond the capabilities of any single desktop computer system, and to meet the demands imposed by these simulations it becomes necessary to tap into distributed computing resources. However, distributed computing resources and technologies are typically outside the realm of most freshwater science projects. Designing, assembling, and programming these systems is not trivial, and requires the level of skill typically available to experienced system and software engineers. Consequently, this imposes a barrier to scientists outside information technology and computer science disciplines and presents challenges to the acceptance of di stributed computing as a solution to most lake ecosystem modelers. GRAPLE is a collaboration between lake ecologists and computer scientists that aims to address this challenge. Through this interdisciplinary collaboration, we have designed and implemented a distributed system platform that supports compute intensive model simulations, aggregates resources across an overlay network spanning collaborating institutions, and presents intuitive WEB service based interfaces that integrate with existing programmi ng environments that lake ecologists are familiar with, such as R.

PAGE 116

116 This chapter describes GRAPLEr, a cyber infrastructure that is unique in how it integrates a collection of distributed hardware resources through the IP OP overlay virtual network, supports existing models and the HTCondor distributed computing middleware [93] , and exposes a user friendly interface that integrates with R based desktop environments through a WEB service. As a multi tiered , distributed solution, GRAPLEr incorporates several componen ts into an application specific solution. Some of these components are pre existing solutions which are deployed and configured for our specific uses, while others are specifically developed to address unique needs. The rest of this chapter is organized as follows: Section 7 2 describes driving science use cases and motivates the need for the GRAPLEr cyber infrastructure . Section 7 3 describes the architecture, design, and implementation of GRAPLEr. Section 7 4 describes a deployment of GRAP LEr and summarizes results from an experiment that evaluates its capabilities and performance. Section 7 5 discusses related work, and s ection 7 6 concludes the chapter . Background Simulation modeling is a powerful tool for examining the effects of many di fferent climate change scenarios on water quality in lakes. Here, we are specifically focusing on scenarios that look at both linear and stochastic changes in climate drivers to predict changes in harmful algal blooms. Algal community dynamics can be highl y non linear because of the large diversity of algae and their functional traits in lakes, as well as the dynamic physical chemical environment in which the algae live [94] . Consequently, small incremental changes in different climate drivers, could potentially reveal threshold effects that result in dispropo rtionately large changes in responses. Thus, it could be

PAGE 117

117 possible that a small change in climate drivers creates an ideal set of environmental conditions for blooms to occur in silico. How do algae and lake ecosystem processes respond to small changes in t he timing, frequency, and magnitude of air temperature, precipitation, and wind? Can our models simulate the algal growth that is observed in lakes with high frequency sensors in the field? To answer these questions, we use the simulation software, GLM AED (General Lake Model Aquatic Eco Dynamics) [95], [96] . GLM AED simulates the vertical dimension (i.e., 1D) of lake hydrodynamics in response to meteorological and hydrologic forcing and lake chemistry in response to external loading and physical and bio logical fluxes. The two biological trophic levels modeled here, phytoplankton (4 functional groups) and zooplankton (3 functional groups), are modeled according to a set of functional traits that govern growth and death in response to the physical and chem ical environment. All three components physical, chemical and biological interact to form a highly dynamic and vertically heterogeneous environment. To study phytoplankton blooms, we adjust the model parameters representing functional traits in the phyto plankton and zooplankton. There are thousands of viable trait combinations and therefore thousands of simulations needed to determine the outcomes from these hypothetical communities. To study climate change effects on phytoplankton, we adjust meteorologic al forcing driver data to represent future climate scenarios. We explored two use cases of the GRAPLEr cyber infrastructure to address the questions above. In the first use case, we ran hundreds of thousands of GLM AED simulations in which we incrementally range between

PAGE 118

118 between 0 to +5 m/s), and precipitation by ±0.1 mm in precipitation (in a range between 0 to 5 mm) at each hourly time st ep throughout a year. Analyzing all possible combinations of these climate drivers necessitated running many thousand simulations to determine where threshold effects in algal growth exist in the model. This information was critical for identifying which m eteorological conditions would best promote algal blooms. In the second use case, we defined distributions of potential parameter values for different phytoplankton functional traits governing nutrient uptake, light and temperature sensitivity, and growth rates, and then randomly pulled different parameter values from the distributions. Analyzing all possible combinations of the parameter values allowed us to determine which parameter values best recreated observed field data of algal abundance in the lake. Having this information consequently allowed us to improve the parameterization of the model to predict future algal blooms and water quality. Several HTCondor based high throughput computing systems have been deployed in support of scientific application s. One representative example is the Open Science Grid (OSG) [97] , which features a distributed set of HTCondor clusters. In contrast to OSG, which expects each site to run and manage its own HTCondor pool, GRAPLEr allows sites to join a collaborative, distributed cluster by joining it s virtual HTCondor pool via the IPOP virtual network overlay. This reduces the barrier to entry for participants to contribute nodes to the network e.g., by simply deploying one or more VMs on a private or public cloud. Furthermore, GRAPLEr exposes a dom ain tailored WEB service interface that lowers the barrier to entry for end users.

PAGE 119

119 WS PGRADE [98] is a workflow design and execution tool that has a broad scope of functionalities geared at parameter sweep¨ applications. Users can design workflows, specify how input is combined¨ and on what type of resources job execution takes place. Workflows can also be hosted in a repository and shared with other end users. A designer selects a workflow node that matches his algorithm characteristics and attaches to it data ports, which determine how the input is combined for execution as well as generated for output. These workflow nodes can be combined and nested to create more complex ones and the desired application. Depending on the workflow node, execution c an be mapped onto the local system, a desktop Grid, or service Grid. According to the WS PGRADE workflow and parameter set classification, GRAPLEr would be a combination of single regular node, single parameter style (PS) node, and a generator output port. The generator port produces multiple output files derived from its input and a user specified algorithm. A GRAPLEr job, which consists of multiple simulations, is mapped to a single PS workflow and each executing simulation a single regular workflow node. However, the highly generalized approach used by WS PGRADE, which provides extensive flexibility, introduces two of the very problems GRAPLEr was designed to solve the need to redesign and re implement existing applications, and learn new technologies out side the knowledge domain. GRAPLEr provides an intuitive and easy to learn workflow and user interface. Through a collaboration between a core group of domain scientists and cyber infrastructure experts, GRAPLEr codifies typical usage patterns into workflows that are

PAGE 120

120 exposed through simple interfaces to the broader set of target end users, including students. Furthermore, rather than a WEB based presentation layer, domain scientists can quickly become productive as GRAPLEr client API integrates direc tly into their existing R/Rstudio development environment. The GRAPLEr workflows are built around the existing processes, and instead leverage the existing opportunities for concurrency. The NEWT [99] project also provides a RESTful based WEB service interface to High Performance Computing (HPC) systems. I t is a gateway to access HPC computing and data resources at the National Energy Research Scientific Computing Center ( NERSC) and is designed to make these resources highly usable via a WEB browser. The NEWT WEB service provides an API over HTTP which are used by client side browser technologies to WEB applications. H owever, NEWT does not describe any mechanism for incorporating a heterogeneous set of d istributed computing resources within its HPC cluster. GRAPLEr differentiates itself with this ability to leverage virtualized cloud resources which are interconnected by virtual networks. Architecture and Design The system architecture of GRAPLEr is illus trated in Figure 7 1. Users interact with the system via a client side library that is invoked from an R development environment (e.g., R Studio) running on their personal computer. User requests are created using the R language and mapped to the GRAPLEr A pplication Programming Interface (API) calls, which in turn transforms and transmits them to the GRAPLEr WEB Service (GWS). The GWS tier is responsible for interpreting the user requests, invoking the GRAPLEr Experiment Management Tools (GEMT) to set up an d prepare the simulations, and queuing jobs for submission to the HTCondor pool. The HTCondor workload management tier is responsible for scheduling and dispatching model

PAGE 121

121 simulations across the compute resources, which are interconnected through the IPOP v irtual network overlay. All the GRAPLEr components are described in the following sections. Overlay Virtual Network (IPOP) Rather than investing significant effort in development, porting, and testing new applications and distributed computing middleware, GRAPLEr has focused on an approach in which computing environments are virtualized and can be deployed on demand on cloud resources. While Virtual Machines (VMs) available in cloud infrastructures provide a basis to address the need for a user provided sof tware environment, another challenge remains: how to inter connect VMs deployed across multiple institutions (including private and commercial cloud providers) such that HTCondor and the simulation models work seamlessly? The approach to address this probl em is to apply virtualization at the network layer. The IPOP overlay network is a flexible and dynamic VPN. It frees the administrator to define the virtual network by simply specifying relationships among the participating nodes. IPOP then transparently b uilds logical communication links to facilitate seamless and secure communication within this ad hoc group. Additionally, the IPOP overlay is self healing as it automatically adjusts to changes in the underlying network the hosts continue to function wit hout user intervention and with minimal disruption. This dynamic function of IPOP is essential for addressing the complexities associated with the intra networking within the hybrid cloud system composed of local and public cloud resources. By using tunnel ing protocols to extend discrete network segments between hosts, a virtual LAN is created. A virtual LAN simplifies the design

PAGE 122

122 and layout of the network by grouping hosts with common requirements regardless of their actual location. For example, HTCondor h osts can be configured to communicate over public Internet infrastructure. However, the deployment of such a cluster has considerably more complexities than a corresponding cluster deployed within a LAN. IPOP allows GRAPLEr to define and deploy its own vir tual private network (VPN) that can span physical and virtual machines distributed across multiple collaborating institutions and commercial clouds. To accomplish this, IPOP captures and injects configured within an isolated virtual private address subnet space. IPOP then encrypts and tunnels virtual network packets through the public Internet. The Tin c an tunnels used by IPOP to carry network traffic use facilities from Web RTC to create end to end links that carry virtual IP traffic instead of audio or video. To discover and notify peers that are connected to the GRAPLEr G VPN, IPOP uses XMPP. XMPP messages carry information used to create private tunnels (the fingerprint of address:port pairs that the device is reachable). For nodes behind network address translators (NATs), public facing address:port endpoints can be discovered using the STUN (Session Tra versal Utilities for NAT) protocol, and devices behind symmetric NATs can use TURN (Traversal Using Relays around NAT) to communicate through a relay in the public Internet. Put together, these techniques handle firewalls and NATs transparently to users an d applications and allow for simple configuration of VPN groups via an XMPP server.

PAGE 123

123 computing resources can be integrated to implement a workload services cluster. Workload Managem ent (HTCondor) A key motivation for the use of virtualization technologies, including IPOP, is the ability to integrate existing, unmodified distributed computing middleware. In particular, GRAPLEr integrates HTCondor [8], a specialized workload management system for compute intensive jobs. Like other full featured batch systems, HTCondor provides a job queueing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to HTCondor , HTCondor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. Figure 7 2 illustrates how HTCondor has been deployed for implementing the GRAPLEr workload execution and management tier. An HTCondor resource pool, running across distributed resources and connected by an IPOP network, provides a general purpose capability where it is possible to run a variety of applications from different dom ains. Furthermore, application tailored middleware can be layered upon this general purpose environment to enhance the performance and streamline the configuration, of user simulations. Experiment Management Tools (GEMT) GEMT provides a suite of scripts fo r designing and automating the tasks associated with running General Lake Model (GLM) based experiments on a very large scale. Here, we use the term address a science use case question, such as dete rmining the effects of climate change on water quality metrics. GEMT is both the guidelines for the design and the layout of

PAGE 124

124 individual simulations in the experiment, as well as a library of executable code for creating and managing an experiment over its lifetime in the system. The primary responsibility of GEMT is to identify and target the task level parallelism inherent in the experiment by generating proper packaging of executables, inputs, and outputs; furthermore, GEMT seeks to effectively exploit th e distributed compute resources across the HTCondor pool by performing operations such as aggregation of multiple simulations into a single HTCondor job, compression of input and output files, and the extraction of selected features from output files. For the simulations in an experiment, GEMT defines the naming convention used by the files and directories as well as their layout. The user may interact with GEMT in two possible ways: ( 1) directly, by using a desktop computer configured with the IPOP overlay software and HTCondor job submission software, or ( 2) indirectly, by issuing requests against the GRAPLEr WEB service. In the former case, once the user has followed the GEMT specification for creating their experiment, executing it and collecting the res ults becomes a simple matter of invoking two GEMT scripts. However, the user is left the responsibility of deploying and configuring both IPOP and HTCondor locally. Additionally, the user is now a trusted endpoint on the VPN which carries its own security accessing the VPN. The latter case alleviates the user from both these concerns. This chapter focuses on the latter approach, where GEMT scripts are invoked indirectly by the us er through the WEB service. There are three distinct functional modes for GEMT, which pertain to the different

PAGE 125

125 GEMT selects a configurable number of simulations to be gr ouped as a single HTCondor job. Multiple simulations are grouped into a single HTCondor job as the costs of job scheduling and network transfer of short running simulations, can be significant. By grouping simulations into a single HTCondor job, redundant copies of the input can be eliminated to reduce the bandwidth transfer cost and only a single scheduling decision is needed to dispatch all the simulations in the job. The inputs and executables pertaining to a group of simulations are then compressed and submitted as a job to the HTCondor scheduler for execution. When this job becomes scheduled, GEMT is invoked in its second phase, this time on the HTCondor execute node. The execute side GEMT script coordinates running each simulation within the job and pr eparing the output so it can be returned to the originator. Finally, in its third phase, back on the submit node side, GEMT collates the results of all the jobs that were successful and presents them in a standard format to the end user. GEMT implements user configurable optimizations to fine tune its operations for individual preferences. It can limit how many simulations are placed in a job, and it will compress these files for transfer. GEMT can also overlap the client side job creation with the server side features can be set via a configuration file and together they combine to provide a simplified mechanism to execute large numbers of simulations. The GWS module, as illustrated in Figure 7 3, is a publicly addressable WEB service available on the Internet and serves as a gateway for users to submit requests to run experiments. GWS is a middleware service tier which provides a WEB application programming interface (AP I), implemented over the HTTP protocol, to its remote users.

PAGE 126

126 This API encompasses the server side functionality of GRAPLEr; each method of this API represent a discrete unit of computation capability which is to be executed on the distributed cluster. GWS accepts and interprets user requests, configure and queues jobs, and consolidate and prepares results for retrieval. For example, create and execute an experiment consisting of N simulations by varying air temperature according to a statistical distributio n for a climate change scenario. GWS extensively utilizes the functionality of GEMT for simulation processing and is co located in the same host as the GEMT library. This host acts as the submit node to the HTCondor pool, where it monitors job submission a nd execution. Representational State Transfer ( REST ) is an architectural style for networked hypermedia applications that is primarily used to build lightweight and scalable WEB services. Such a WEB service, referred to as RESTful, is stateless with a unif orm interface and representation of entities, and communicates with clients via messages and address resources using URIs. GWS implements the RESTful paradigm and is designed to treat every job submission independently from any other. Note that there is pe r experiment state that is managed by GWS, such as the status of each HTCondor job submitted by the GWS. The state of the experiment is maintained on disk, within the local filesystem , leaving the service itself stateless. GWS implements the public facing interface using a combination of open source middleware for WEB service processing Python Flask [100] , and an asynchronous task queue Python Celery [101] . The application is hosted using uWSGi (an application deployment solution) and supplemented by a Nginx reverse proxy server to offload the task of serving static fil es

PAGE 127

127 from the application. The employed technology stack facilitates rapid service development and robust deployment. The GWS workflow begins when a request is received from an R client by the service interface, which is handled by Flask. The request to eval uate a series of simulations can be provided in one of the several ways previously discussed. However, only data files are accepted as input no user provided executable binaries or scripts are executed as part of the experiment. A single client side requ est can potentially unfold into large numbers (e.g., thousands) of jobs, and GWS places these requests into a Celery task queue for asynchronous processing. Provisioning a task queue, allows GWS to decouple the time consuming processing of the input and ta sk submission to HTCondor, from the response. A 40 character U UID is randomly generated for each simulation request received by GWS; it is used as an identifier to reference the state of an experiment and is thus used for any further interactions with the service for a given experiment. Using the UID returned by GRAPLEr, an R client can not only configure the job, but also monitor its status, download outputs, and terminate the job. Once the input file has been uploaded to the service, GWS puts the task int o the task queue and responds promptly with the UUID . Therefore, the latency that the R developer experiences, from the moment the job is submitted to when the UUID is received, is minimized. A GWS worker thread then dequeues GEMT tasks from the task queue and processes the request according to the parameters defined by the user. Figure 7 3 shows the internal architecture and setup of GWS.

PAGE 128

128 GRAPLE R Language Package (GRAPLEr) The user facing component of GRAPLEr is an R package that serves as a thin layer of software between the WEB service and the R Programming Language. It provides an R language application programming interface which can be programmatically consumed by client programs wanting to utilize the GRAPLEr functionality. It acts as a proxy to tran slate user commands written in R into WEB service requests which are sent to GWS. It also transfers data between the client and WEB service as necessary. The following examples illustrates the sequence of R calls to program a GRAPLEr experiment. In the fir st example a new GRAPLEr instance is created and the GWS IP address, the length of time the results should remain available for download, and the local directories for storing various experiment files are specified. The availability of the service is check ed to make sure it is running, and finally the list of post processing filters is retrieved. batchExp < new("Graple", Retention, ExpRootDir, ResultsDir, TempDir) batchExp < GrapleCheckService(batchExp) batchExp < GrapleListPostProcessFilters(batchExp) The second example executes a batch type experiment of 1 or more simulations which have previously been created and placed in the directory ExpRootDir. A human readable experiment name is set, and the command to run them on the cluster is issued. The exper iment completion status is checked and then the results are downloaded when it becomes available. batchExp < setExpName(batchExp, "BatchExperiment1") batchExp < GrapleRunExperiment(batchExp)

PAGE 129

129 batchExp < GrapleCheckExperimentCompletion(batchExp) batchExp < GrapleGetExperimentResults(batchExp) The third example shows how a user can specify a parameter sweep experiment with simulations which are derived from a baseline set. A new experiment instance is aga in created, and the friendly name and post processing filter are assigned. Both the baseline simulation and the experiment description have been created and stored in the ExpRootDir directory at the client. However, the filter specified by Filtername is st ored remotely in the service. The command to run the sweep experiment is invoked and the progress is checked periodically until it indicates 100% completion. The results are then retrieved to be used locally. sweepExp < new("Graple", ExpRootDir, ResultsDi r, TempDir) sweepExp < setExpName(sweepExp, "SweepExperiment") sweepExp < GrapleRunSweepExperiment(sweepExp, "Filtername") sweepExp < GrapleCheckExperimentCompletion(sweepExp) Sys.sleep(10); sweepExp < GrapleCheckExperimentCompletion(sweepExp) cat(paste(sweepExp@ExpName, sweepExp@StatusMsg, sep=":")) } sweepExp < GrapleGetExperimentResults(sweepExp) To prevent the use of the WEB service interface to execute arbitrary code, custom code whe ther binary executables or R scripts cannot be sent as part of the simulation requests; instead, users only provide input files and parameters for the GLM

PAGE 130

130 simulations. The scenarios that can be run are currently restricted to using GLM tools and our own scripts. The GRAPLEr source code can be found at and is made available under the MIT Software License. The project website, along with tutorials and usage guides, is available at U se C ase W orkflow A key feature of GRAPLEr is to automatically create and configure an experiment by generating a range of simulation scenarios, see Figures 7 4 and 7 5. This is application specific knowledge. In particular, the service uses application specific information to identify data in the input file (such as air temperature, or precipitation), and apply transformations to them to generate multiple discrete simulation scen arios. This removes the onus from the user to generate, schedule, and collate the outputs of thousands of simulations within their desktops, and allows them to quickly create expressive experiment scenarios from a high level description that simply enumera tes which input variables to consider, what function to apply to vary them, and how many simulations to create. The user also has the flexibility to specify a post processing operation for each simulation, and to retrieve and download only a selected subse t of the results back to their desktops, thereby minimizing local storage requirements and data transfer times. Based the science use case introduced and discussed in section 7 2, and our understanding of the GRAPLEr infrastructure and APIs described in se ction 7 3, we can now concretely illustrate how GRAPLEr would be used to solve this problem. In the second use case, we defined distributions of potential parameter values for different

PAGE 131

131 phytoplankton functional traits governing nutrient uptake, light and t emperature sensitivity, and growth rates, and then randomly pulled different parameter values from the distributions. Within the R programming environment, the GRAPLEr function GrapleRunSweepExperiment is used to set the stage for creating an experiment de rived from a baseline simulation which was created from actual senor data. The coding optional post processing filter name is specified. The experiment description file specifies which distribution (linear, random, uniform, binomial, or Poisson) to choose samples from, the number of samples, the variable(s) to be modified, and the operation applied to a variable for each randomly generated value (add, subtract, multiply, or divide). The post processing filter name specifies a selection from a collection of operations, which is stored within the GEMT library, to run after the successful completion of each individual simulation. The invocation of GrapleRunSweepExperiment results in a request being sent to the WEB service API method GrapleRunMetSample in Table 7 1 , which goes about ons. From this single input and description, GWS utilizes GEMT to generates a detailed description of the experiment along with a partitioning of how the of jobs should be distributed to the M simulations, which comprises its respective subset of the experiment and executes them sequentially in turn.

PAGE 132

132 Evaluation In this section, we present a quantitative evaluation of a proof of concept deployment of GRAPLEr. The goal of this evaluation is to d emonstrate the functionality and capabilities of the framework by deploying a large number of simulations to an HTCondor pool. The HTCondor pool is distributed across multiple clouds and connected by the IPOP virtual network overlay. Rather than focusing s olely on the reduction in execution times, we evaluate a setup that is representative of an actual deployment composed of execute nodes with varying capabilities. A GLM simulation is specified by a set of input files, which describe model parameters and ti me series data that drive inputs to the simulation, such as air temperature over time, derived from sensor data. The resulting output at the completion of a model run is a netCDF file containing time series of the simulated lake, with many lake variables, such as water temperatures at different lake depths. In our experiments, we use the 1 D GLM Aquatic Eco Dynamics (AED) model with a single GLM AED simulation of a moderately deep lake, run for eight months at an hourly time step. The test experiment was de signed to run reasonably quickly, yet of sufficiently long duration where the timing results would not be skewed by extraneous timing factors. The input folder size was approximately 3 MB, whereas the size of the resulting netCDF file after successful comp letion of the simulation was 90MB. We note that simulations run over decades and with more frequent time steps may increase simulation run time and result output by an order of magnitude. We conducted simulation runs on different systems to obtain a range of simulation runtimes. With the baseline parameters, GLM AED simulation times ranged from the best case of 6 seconds (on a CloudLab system with Intel Xeon CPU E5 2450

PAGE 133

133 with 2.10GHz clock rate and 20MB cache) to 57 seconds (on a University of Florida system with virtualized Intel Xeon CPU X565 with 2.60GHz clock rate and 12MB cache). Note that individual 1 D GLM AED simulations can be short running; the GEMT feature of grouping multiple individual simulations into a single HTCondor job leads to increased eff iciency by reducing the overhead time incurred from job scheduling and placement. Description of Experiment Setup: The GRAPLEr system deployed for this evaluation was distributed across three sites: University of Florida, NSF CloudLab, and Microsoft Azure. The GWS/GEMT service front end, HTCondor submit node, and HTC We deployed three HTCondor worker nodes, each with 16 cores and 16GB of RAM. Two nodes were hosted in virtual machines on a VMware ESX server at the University of Florida and one on a physical machine in the CloudLab Apt cluster at University of Utah. All the nodes in this experiment ran Ubuntu 14.04, HTCondor version 8.2.8. and IPOP GVPN version 15.01. To conduc t the evaluation, we carried out executions of three different experiments containing 3000, 5000 and 10000 simulations of an example lake with varying meteorological input data. Figure 7 6 summarizes the results from this evaluation. As a reference, we als o present the estimated best case sequential execution time on a single, local machine, taken the CloudLab and UF machines as a reference. For 10,000 simulations we achieved a speedup of 2.5 (with respect to sequential execution time of the fast workstatio n) and 23x speedup (with respect to the sequential execution time at a UF virtual machine).

PAGE 134

134 It is observed that the time taken to complete the experiment depended greatly on how the jobs were executed by the HTCondor scheduler; as the GRAPLEr cluster is co mprised of heterogeneous nodes with varying capabilities. It therefore follows, the achieved speedup is smaller when compared to the best case baseline on the fastest node as opposed to that on the slower nodes. Furthermore, because HTCondor is best suited for long running jobs, the user perceived speedup of GRAPLEr over local stand alone systems will increase as longer running experiments are executed through the service. We expect that, as demand for modeling tools by the lake ecology community increases, so will the complexity, time series resolution and simulated epochs of climate change scenarios, further motivating users to move from a local processing workflow to distributed execution through GRAPLEr. Submission of a job to the HTCondor pool involves processing of input (for sweep requests) and packaging of generated simulations into GEMT. In order to evaluate this step, we carried out experiments to account for the time taken by GRAPLEr to respond to a request to generate a given number of simulations and submit them for execution. The results are presented in Table 7 2 . The column service response time captures the time taken by GRAPLEr to respond to a request with a UUID , which is slightly more than the time required to upload the baseline input. The inputs for job submission. Though not fully explored yet in the design of GRAPLEr, another benefit of remote execution through a WEB service interface is the leveraging of storage and data sharing capabilities of the collaborative infrastructure aggregated by distributed

PAGE 135

135 resources connected through the IPOP virtual network. For instance, this experiment resulted in unfiltered result output of 900GB. By keeping the results on the GRAPLEr cloud and allowing users to share simulation outputs and download selected subsets of the raw data, the service can provide a powerful capability to its end users in enabling large scale, exploratory scenarios, by both reducing computationa l time and relaxing local storage requirements at the client side. Conclu ding Remarks GRAPLEr, a distributed computing system which integrates and applies overlay virtual network, high throughput computing, and WEB service technologies is a novel way to address the modeling needs of interdisciplinary GRAPLE researchers. The are not software engineering experts but who need to leverage extensive computational resou exploit parallelism inherent in GRAPLE experiments. Additionally, the system scales out, by simply adding additional worker nodes to the pool, to manage both increasingl y complex experiments as well as larger number of concurrent users. GRAPLEr is best suited for large numbers of long running simulations as the distribution and scheduling overhead will increase the running time for such experiments. As lake models demand increased resolution and longer time scales to address climate change scenarios, GRAPLEr provides a platform for the next generation of modeling tools and simulations The GRAPLE endeavo r additionally includes an active involvement in training its end users, as this is perceived as a critical aspect of the penetration and impact of its software infrastructure on its community. This training and inclusion extend beyond the

PAGE 136

136 collaboration be tween the working groups engineers and domain scientists, to undergraduate and graduate students, and postdoctoral researchers. This sets the stage for an iterative process of learning and refinement of the cyber infrastructure where each subsequent iterat ion facilitates tackling problems of a much larger scale, and scientists trust and reply on the tool and methodologies to address broader water research issues. Figure 7 1. System Architecture (GRAPLEr). Users interact with GRAPLEr using R environments i n their desktop (right). The client connects to a WEB service tier that exposes an endpoint to the public Internet. Job batches are prepared using GEMT and are scheduled to execute in distributed HTCondor resources across an IPOP virtual private network.

PAGE 137

137 Figure 7 2. Workload Management (HTCondor). GRAPLEr supports unmodified HTCondor software and configuration to work across multiple sites (e.g., a private cloud at UF and a commercial cloud at Azure). Figure 7 3. GRAPLEr WEB Service (GWS). The GWS is responsible for taking WEB service requests from users, interpreting them and creating tasks for remote execution using GEMT.

PAGE 138

138 Figure 7 4. GRAPLEr top level workflow chart. Figure 7 5. GRAPLEr sweep job workflow chart.

PAGE 139

139 Figure 7 6. Job runtimes for GRAPLEr HTCondor pool, compared to sequential execution times on CloudLab (SEQ Fast) and UF (SEQ Slow) slots. Table 7 1 . GWS Application Programming Interface (API) Service Status N/A WEB service availability GrapleRun Simulations to execute Job UID Gra pleRunStatus UID of a previously submitted job Job status message (completed/processing) GrapleRunResults UID of a previously submitted job URI of job results GrapleRunLinearSweep The base simulation and job definition file Job UID GrapleRunMetSample The base simulation and job definition file Job UID GrapleEnd UID of previously submitted job Status message (success/retry) GrapleListFilters N/A The list of available post processing filters

PAGE 140

140 CHAPTER 8 CON CLUSION It always seems impossible until it s done. Nelson Mandela My dissertation describes the investigation , design , impl em entation and evaluation of a novel software system that provides netw orking cyber infrastructure that addresses the interconnect complexities stemming from the hetero geneous environment of the fog . The scope of the design encompasses various layers, from the local control of network endpoint ( including a software defined datapath ) , to coordination of topology and forwarding across the overlay . My work extends and enhances IPOP VPN to demonstrate it s capabilities in practical real world environments . I t is motivat ed by imminent changes in the computing lan dscape being shaped by IoT , pervasive computing and edge comp uting , and t he challenges these new te chnologies present . I t s overarching goals are (1) e ndpoint identification and authentication , (2) p rivate P2P communication , (3) v irtualized layer 2 (Ethernet) infrastructure , (4) d ynamic membership , (5) s elf healing, resilient topology , (6) s calable loop free forwarding , (7) and f lexible operational configuration. The technique s employed to accomplish these goals start with presenting a P2P link for tunneling Ethernet frames over UDP/ IP. Next , an applicati on framework (IPOP Controller framework ) is introduced to support the mo dular components of the control la yer . These control modules create and manage the tunnels forming the specified topology with its desired characteri sti cs , e.g., dynamic membership and self repair. Finally , a n OpenFlow based controller (Bounded Flood) provides Ethernet f orwarding in a cyclic switched fabric .

PAGE 141

141 Through practical evaluations, I have demonstrated that the implemented systems validate their functional requirements . In Chapter 5, I show that Tincan tunnels and the IPOP topology provide a functional layer 2 network. STP, broadcast, unicast and application level multicast worked as expected and without any modification. In Chapter 6, I demonstra ted that STP can be replaced with Bounded Flood, my novel SDN based for warding implementation. I t correctly handles cycles without disabling links and by utilizing all of them , outperform s STP whi le maintaining the desirable self organizing feature of STP . Also demonstrated, i s the practical , on going capability to provide networking infrastructure supporting real wor l d workloads. The HTC system GRAPLEr empl oys computation modeling both as a research and educational tool. Contribut ion The main contribution of this dissertation is the design, implementation and evaluation of (1) a novel c ontroll er framework for virtual network overlays , (2) a novel t opology control module for creating a structure d P2P topology , (3) and a novel SDN based controller for l ayer 2 loop safe forwarding . These have been fully implemented and demonstrated in the opensource system IPOP . T he IPOP Controller frameworks provides a software system and philosophy for bui lding and interconnecting plugg able modules that work tog ether towards a common application goal . It promotes the design of decoupled , cohesive functionality with in discret e modules and has been shown quantitatively to increase code reusability when compared to previous implementations. The IPOP Topology control m odule creates a structure P2P topology (ring) based on successor and Symphony long distance edges. The topology is practical to

PAGE 142

142 implement and is resilient to the effects of churn . It is flexible, supporting a gre at degree of parameterization and independent configuration. It is also scalable with regards to the structural maintenance cost of node join and leave procedures. The Bounded Flood SDN controller provides reli able Ethernet layer forwarding in the presence of cycles with in the switched fabric. Its forwarding cost has been shown to increase lo garithmically with the size of the network and it outperforms STP both in available bandwidth and end to end laten cy. Future Work There are useful extensions of this work that will enhance its benefi t to fog networking . What has been accomplished so far provides the foundations to be extended both as reusable components and as the supporting frame work . The remain ing final section of my dissertation discuss es a few of the concepts that warrant further investigation in the application of networking cyber infrastructure for the fog . Native support for multicast within the overlay can be re asonably implemented as an enhancement of the SDN Bounded Flood controller by utilizing on demand edge s to dynamically bui ld branches multicast trees. The membership and privacy properties of the overlay constrains the scope . On deman d edge s can augment the existing unicast tree to minimize retransmission and bypass non participants. Additionally, m oving this functionality into the system layer promotes reuse , alleviating each application need to impl ement its own. Currently, Tincan tunnels are used exclusively for connecting nodes in all scenarios. Features such as TLS based encryption and the associated overhead cost may be u nnecessary between peers i n the same data center . I t would be beneficial for IPOP to re cognize differ ent tunneling implementations (e.g., GRE, VxLAN) and utilize

PAGE 143

143 them in the appropriate scenario. Th is necessitates quantifying the respective tunneling technologies capabilities and performance and determine the conditions where they are best suited. Presently, features such as encryption and NAT traversal are intrinsic capabilities of the Tincan tunnel. S eparating these functionalities into reus able abstractions allows them to be combined with other tunnels to create previously unseen cap abilities. Fi nally, fu r thering and integrating advances which eliminate the dual traversal o f the networking stack for tunneling based virtualization [102] can provide system wide benefits for all tunnel types.

PAGE 144

144 LIST OF REFERENCES [1] Sci. Am. , vol. 291, no. 4, pp. 76 81, 2004. [2] 2008 11th IEEE International Symposium on Object and Component Oriented Real Time Distributed Computing (ISORC) , 2008, pp. 363 369. [3] RFID J. , vol. 22, no. 7, pp. 97 114, 2009. [4] IEEE Pers. Commun. , vol. 8, no. 4, pp. 10 17, Aug. 2001. [5] B. Zhang et al. Pr oceedings of the 7th USENIX Conference on Hot Topics in Cloud Computing , Berkeley, CA, USA, 2015, pp. 21 21. [6] A. Yousefpour et al. J. Syst. Archit. , Fe b. 2019. [7] Internet Society . . [8] [Accessed: 28 Oct 2019]. [9] K. Subratie, S. Aditya, S. Sabogal, T. Theegala, and R. Figueiredo Dynamic, Isolated Work groups for Distributed IoT and Cloud Systems with Peer to Sensors to Cloud Architectures Workshop (SCAW 2017) , Austin, Texas, USA, 2017. [10] d Networks: SDN Enabled Virtual Private Networks with Peer to Internet and Distributed Computing Systems , 2018, pp. 122 133. [11] K. C. Subratie, S. Aditya, S. Mahesula, R. Figueiredo, C. C. Carey, and P. C. Hanso Ecosystem Modeling that Integrates Overlay Networks, High throughput Concurr. Comput. Pract. Exp. , vol. 29, no. 13, p. e4139, 2017. [12] Queue , vol. 6, no. 1, pp. 37:36 37:ff, Jan. 2008. [13]

PAGE 145

145 [14] M. Mahalingam et al. a Network (VXLAN): A 2014. [Online]. Available: https://www.rfc [Accessed: 29 May 2019]. [15] h ttp:// [Accessed: 28 Oct 2019]. [16] Commun. Strateg. , vol. 63, p. 109, 2006. [17] E. K. Lua, J. Cro Comparison of Peer to IEEE Commun. Surv. Tutor. , vol. 7, no. 2, pp. 72 93, Second 2005. [18] Su rvey of Software Defined Networking: Past, Present, and Future of IEEE Commun. Surv. Tutor. , vol. 16, no. 3, pp. 1617 1634, Third 2014. [19] N. McKeown et al. SIGCOMM Comput Commun Rev , vol. 38, no. 2, pp. 69 74, Mar. 2008. [20] P. Bosshart et al. SIGCOMM Comput Commun Rev , vol. 44, no. 3, pp. 87 95, Jul. 2014. [21] IEEE Commun. Mag. , vol. 55, no. 4, pp. 18 20, Apr. 2017. [22] Comput. Netw. , vol. 54, no. 15, pp. 2787 2805, Oct. 2010. [23] A. Yousefpour et al. J. Syst. Archit. , Feb. 2019. [24] F. Bonomi, R. Milito, J. Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing , New York, NY, USA, 2012, pp. 13 16. [25] Spec. Publ. NIST SP 800 145 , Sep. 2011. [26] Gart. Res. , 2013.

PAGE 146

146 [27] CISCO White Pap. , vol. 1, no. 2011, pp. 1 11, 2011. [28] M. Iorga, L. Feldman, R. Barton, M. J. Martin, N. S. Goren, and C. Mahmoudi, Spec. Publ. NIST SP 500 325 , Mar. 2018. [29] D. Gannon, R. Barg IEEE Cloud Comput. , vol. 4, no. 5, pp. 16 21, Sep. 2017. [30] Docker . [Online]. Available: [Accessed: 06 Oct 2019]. [31] [Online]. Available: [Accessed: 23 Oct 2019]. [32] LXD [Accessed: 23 Oct 2019]. [33] B. Burns, B. Grant, D. Oppenheimer, E. Br [34] IEEE Commun. Mag. , vol. 53, no. 2, pp. 90 97, Feb. 2015. [35] reOS config.html. [Accessed: 18 Oct 2019]. [36] Project Calico . [Online]. Available: https://www.projectcalico.or g/. [Accessed: 18 Oct 2019]. [37] [Accessed: 18 Oct 2019]. [38] [Accessed: 18 Oct 2019 ]. [39] VMware . [Online]. Available: [Accessed: 18 Oct 2019]. [40] Available: h ttps:// [Accessed: 18 Oct 2019]. [41] R. Cohen et al. 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013) , 2013, pp. 42 50.

PAGE 147

147 [42] R. Cohen, K. Barabash, an 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013) , 2013, pp. 1088 1089. [43] M. Dalton et al. Velocity at Scale in 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, WA, USA, April 9 11, 2018 , 2018, pp. 373 387. [44] ng the Cloud to the 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , 2014, pp. 346 351. [45] IEEE Netw. , vol. 32, no. 5, pp. 106 111, S ep. 2018. [46] IPOP VPN , 2006. [Online]. Available: http://ipop [Accessed: 20 Mar 2015]. [47] D. I. Wolinsky, Y. Liu, P. S. Juste, G. Venkatasubramanian, and R. Figueiredo, le, self Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis , 2009, pp. 1 12. [48] Defined P2P Virtual Network Overlays for Ad EAI Endorsed Trans. Collab. Comput. , vol. 1, no. 2, p. 15, Oct. 2014. [49] P. O. Boykin, J. S. A. Bridgewater, J. S. K ong, K. M. Lozev, B. A. Rezaei, and V. ArXiv07094048 Cs , Sep. 2007. [50] Proceedings of the 4th Conference on USE NIX Symposium on Internet Technologies and Systems Volume 4 , Berkeley, CA, USA, 2003, pp. 10 10. [51] S. Loreto and S. P. Romano, Real Time Communication with WebRTC: Peer to Peer in the Browser [52] S. Aditya, K. Subratie, a Defined Overlay Virtual Networks Spanning Personal Devices Across Social Network Computing Technology and Science (CloudCom), 2018, pp. 171 180. [ 53] C. E. Spurgeon and J. Zimmerman, Ethernet Switches: An Introduction to Network Design with Switches

PAGE 148

148 [54] IEEE Std 8021D 2004 Revi s. IEEE Std 8021D 1998 , pp. 1 281, Jun. 2004. [55] [Accessed: 16 Oct 2019]. [56] [Accessed: 16 Oct 2019]. [57 ] [58] Topologies in Modern P2P File IEEEACM Trans. Netw. , vol. 16, no. 2, pp. 267 280, Apr. 2008. [59] I. Stoica et al. ord: A Scalable Peer to peer Lookup Protocol for Internet IEEEACM Trans Netw , vol. 11, no. 1, pp. 17 32, Feb. 2003. [60] Y. Huang, T. Z. J. Fu, D. Design and Analysis of a Large scale P2P vod Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication , New York, NY, USA, 2008, pp. 375 388. [61] IEEE Pervasive Compu t. , vol. 8, no. 4, pp. 14 23, Oct. 2009. [62] Open Networking Foundation . [Online]. Available: definition/. [Accessed: 29 Jun 2018]. [63] H. P. Sajjad, K. Danniswara, A. Al Towards Unifying Stream Processing over Central and Near the Edge Data 2016 IEEE/ACM Symposium on Edge Computing (SEC) , 2016, pp. 168 178. [64] configuring Software define d Overlay Bypass for Seamless Inter and Intra Proceedings of the 25th ACM International Symposium on High Performance Parallel and Distributed Computing , New York, NY, USA, 2016, pp. 153 164. [65] P. St. Juste, D. Wolinsky, P . Oscar Boykin, M. J. Covington, and R. J. area Collaboration with Integrated Social Comput. Netw. , vol. 54, no. 12, pp. 1926 1938, Aug. 2010.

PAGE 149

149 [66] Multidimensional Heuristic for Social Routing in Peer to 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC) , 2013, pp. 329 335. [67] Soft ware Defined Networking architecture for the Internet of 2014 IEEE Network Operations and Management Symposium (NOMS) , 2014, pp. 1 9. [68] B. Pfaff et al. 12th USENIX Symposium on Networked S ystems Design and Implementation (NSDI 15) , Oakland, CA, 2015, pp. 117 130. [69] 29 Jun 2018]. [70] ies [Accessed: 13 Jun 2019]. [71] (TURN): Relay Extensions to Session Traversal Utilities for NA [Online]. Available: https://www.rfc [Accessed: 13 Jun 2019]. [72] 2010. [Online]. Available: https://www.rfc [73] IEEE Internet Comput. , vol. 17, no. 1, pp. 80 83, Jan. 2013. [74] W. Shi, J. Cao, Q. IEEE Internet Things J. , vol. 3, no. 5, pp. 637 646, Oct. 2016. [75] SIGCOMM Com put Commun Rev , vol. 44, no. 2, pp. 87 98, Apr. 2014. [76] C. E. Spurgeon, Ethernet: the definitive guide [77] R. Niranjan Mysore et al. tolerant Layer 2 Data Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication , New York, NY, USA, 2009, pp. 39 50. [78] B. Y. Zhao, Ling Huang, J. Stribling, S. C. Rhea, A. D. Joseph, an d J. D. scale Overlay for Service IEEE J. Sel. Areas Commun. , vol. 22, no. 1, pp. 41 53, Jan. 2004.

PAGE 150

150 [79] and Routing for Large Scale Peer to Middleware 2001 , 2001, pp. 329 350. [80] 1999. [81] Programs ams.php?prog=arping. [Accessed: 18 Oct 2019]. [82] iputils . iputils, 2019. [83] SourceForge . [Online]. Available: [Accessed: 18 Oct 2019]. [84] /. [Accessed: 18 Oct 2019]. [85] [Accessed: 18 Oct 2019]. [86] D. Duplyakin et al. Proceedings of the USENIX Annual Tech nical Conference (ATC) , 2019, pp. 1 14. [87] K. Keahey et al. Contemporary High Performance Computing: From Petascale toward Exascale , 1st ed., vol. 3, J. Vetter, Ed. Boca Raton, FL: CRC Press, 2019, pp. 123 148. [88] Science , vol. 320, no. 5872, pp. 57 58, Apr. 2008. [89] Science , vol. 334, no. 6052, pp. 46 47, Oct. 2011. [90] J. Plankton Res. , vol. 27, no. 12, pp. 1205 1210, Dec. 2005. [91] C. C. Carey, P. C. Hanson, R. C. Lathrop, and A. L analyses to examine variability in phytoplankton seasonal succession and annual J. Plankton Res. , vol. 38, no. 1, pp. 27 40, Jan. 2016. [92] K. Weathers et al. Bull. Limnol. Oceanogr. , vol. 22, no. 3, pp. 71 73, 2013.

PAGE 151

151 [93] Condor Ex Concurr. Comput. Pract. Exp. , vol. 17, no. 2 4, pp. 323 356, 2005. [94] PLOS ONE , vol. 10, no. 2 , p. e0115414, Feb. 2015. [95] Univ. West. Aust. Tech. Man. Perth Aust. , 2013. [96] M. R. Hipsey et al. Systems: Water Resour. Res. , vol. 51, no. 9, pp. 7023 7043, 2015. [97] R. Pordes et al. Journal of Physics: Conference Series , 2007, vol. 78, p. 012057. [98] P. Kacsuk et al. PGRADE/gUSE Generic DCI Gateway Framework for a J. Grid Comput. , vol. 10, no. 4, pp. 601 630, Dec. 2012. [99] High Perf Gateway Computing Environments Workshop (GCE), 2010 , 2010, pp. 1 11. [100] M. Grinberg, Flask Web Development: Developing WEB Applications with Python [101] [Accessed: 11 Jun 2019]. [102] D. Zhuo et al. Overhead Container Overlay Desig n and Implementation ({NSDI} 19), 2019, pp. 331 344.

PAGE 152

152 BIOGRAPHICAL SKETCH Ken sworth C. Subratie earned his BSc degree in Computer Science at the Florida International University , Miami FL . He spent the next 14 years as a professional software engineer, design ing and implementing systems for a variety of environments . Ken sworth return ed to academi a to pu r suit his doctora l degree a t the University of Florida , where he worked as a r esearch a ssistant in the Advanced Computing and Information Systems (ACIS) Laboratory . T here he continued to pursue his passion for s oftware s ystems d esign and went on to further developed interests in f og c omputing ; c oncurrent, d istributed and P2P systems ; and network v irtualization . In 2019 , he received his P hD in Computer Engineering from the D epartment of Electrical and Computer Engineering at the U niversity of Florida in Gainesville.