UFDC Home  myUFDC Home  Help 



Full Text  
OVERLAY INFRASTRUCTURE SUPPORT FOR INTERNET APPLICATIONS By ZHAN ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 2007 Zhan Zhang To my family ACKNOWLEDGMENTS I am grateful for the help I have received in writing this dissertation. First of all, I would like to thank my advisor Prof. Shigang Chl i1 for his guidance and support throughout my graduate studies. Without the numerous discussions and brainstorms with him, the results presented in this thesis would never have existed. I am grateful to Prof. Liuqing Yang, Prof. Randy C!i. . Prof. Sartaj Sahni, and Prof. Ye Xia for their guidance and encouragement during my years at the University of Florida (UF). I am thankful to all my colleagues in Prof. Ch! i,'s group, including Liang Zhang, MyungKeun Yoon, Ying Jian, and Ming Zhang. They provide valuable feedback for my research. I would like to thank many people in the Computer and Information Science and Engineering (CISE) Department for their help in my research work. Last but not least, I am grateful to my family for their love, encouragement, and understanding. It would be impossible for me to express my gratitude towards them in mere words. TABLE OF CONTENTS page ACKNOWLEDGMENTS . ... .............................. 4 LIST OF FIGURES .... ............................... 7 A BSTR A CT . ...... . . . . . . . 9 CHAPTER 1 INTRODUCTION .... ............................. .. 11 1.1 Overlay Networks ..... ......................... .. 11 1.2 Related Work . ... ................................ 13 1.3 Contribution . . . . . 17 1.3.1 A Distributed Hybrid Query Scheme to Speed Up Queries in Unstructured PeertoPeer Networks . . . . . . 17 1.3.2 A Distributed Incentive Scheme for PeertoPeer Networks . 18 1.3.3 CapacityAware Multicast Algorithms on Heterogeneous Overlay N networks . . . . . . . .. .. 18 2 A HYBRID QUERY SCHEME TO SPEED UIP QUERIES IN IUNSTRUICTIURED PEERTOPEER NETWORKS . . . . . . 19 2.1 M otivation . . . . . . . . . 19 2.1.1 Problems in Prior Work . . . . . . 19 2.1.2 M otivation . . . . . . . . 21 2.2 Constructing a Smallworld Topology . . . . . 23 2.2.1 Measuring the Interest Similarity between Two Nodes . . 23 2.2.2 C!I~ i n'' Nodes with Similar Interests . . . . 27 2.2.3 Bounding Clusters . . . . . . . 27 2.3 A Hybrid Query Scheme . . . . . . . 29 2.3.1 Mixing InterCluster Queries and IntraCluster Queries . 29 2.3.2 Reducing the Communication Overhead . . . . 32 2.4 Sim ulation . . . . . . . . . 32 3 MARCH: A DISTRIBUTED INCENTIVE SCHEME IN OVERLAY NETWORKS 40 3.1 M otivation . . . . . . . . . 40 3.1.1 Limitation of Prior Work . . . . . . 40 3.1.2 M otivation . . . . . . . . 41 3.2 System M odel . . . . . . . . 42 3.3 Authority Infrastructure . . . . . . . 43 3.3.1 Delegation . . . . . . . . 43 3.3.2 kpair T iilv..i ll, Set . . . . . . 44 3.4 MARCH: A Distributed Incentive Scheme . . . . 45 3.4.1 Money and Reputation . . . . . . 45 3.4.2 Phase 1: Contract Negotiation . . . . 3.4.3 Phase 2: Contract Verification . . . . 3.4.4 Phases 3 and 4: Money Transfer and Contract Execution 3.4.5 Phase 5: Prosecution . . . . . 3.5 System Properties and Defense against Various Attacks . . 3.5.1 System Properties . . . . . . 3.5.2 Defending Against Various Attacks . . . 3.6 D discussions . . . . . . . . 3.6.1 Rewarding Delegation Members . . . . 3.6.2 Money Refilling . . . . . . 3.6.3 System Dynamics and Overhead . . . . 3.7 Sim ulation . . . . . . . . 3.7.1 Effectiveness of Authority . . . . . 3.7.2 Effectiveness of MARCH . . . . . 4 CAPACITYAWARE MULTICAST OVERLAY NETWORKS...... 4.1 M otivation . . . . 4.2 System Overview . . . 4.3 CAMChord Approach . . 4.3.1 Neighbors . . . 4.3.2 Lookup Routine . . 4.3.3 Topology Maintenance . 4.3.4 Multicast Routine . . 4.3.5 Analysis . . . 4.4 CAMKoorde Approach . . 4.4.1 Neighbors . . . 4.4.2 Lookup Routine . . 4.4.3 Multicast Routine . . 4.4.4 Analysis . . . 4.5 Discussions . . . 4.5.1 Group Members with Very 4.5.2 Proximity and Geography 4.6 Simulation . . . . 4.6.1 Throughput . . 4.6.2 Throughput vs. Latency 4.6.3 Path Length Distribution 4.6.4 Average Path Length . ALGORITHMS ON HETEROGENEOUS Small Upload Bandwidth. 4.6.5 Impact of Dynamic Capacity Variation . . 102 5 SUMMARY . . . . . . . . . REFERENCES . . . . . . . . . BIOGRAPHICAL SKETCH . . . . . . . . 108 LIST OF FIGURES Figure page 21 Probability of random walks escaping out of the cluster decreases exponentially with the ratio of number of intercluster edges to the number of intracluster edges. . . . . . . . . . . 20 22 Markov random walks discover less distinct number of nodes than uniform random w alks do. . . . . . . . . . 21 23 A query scheme mixing intercluster queries and intracluster queries (the nodes in the grey clusters fall into the same interest group). . . . 23 24 Interest similarity between nodes. The number of data items in 1, 2, and n are 50, 100, and 200 respectively. A) No common visited nodes (different interests). B) u and v have visited node 2 (100 data items) with 8 and 5 times respectively (a certain level of similar interests). C) u and v have visited node n (200 data items) with 8 and 5 times respectively (falls inbetween) . . . 25 25 Effect of average query number on the cluster size. . . . . 33 26 Interest association is a good metric to estimate interest similarity . . 34 27 Percentage of returned query within a specific hop number. . . 34 28 Percentage of returned query within a specific message number. . . 35 29 Percentage of returned query within a specific hop number in a less clustered network. . . . . . . . . . . 36 210 Percentage of returned query within a specific message number in a less clustered netw ork. . . . . . . . . . . 36 211 Number of distinct nodes discovered in the same group within a certain message range. . . . . . . . . . . 37 212 Number of distinct nodes discovered in the same group within a specific hop num ber. . . . . . . . . . . 38 213 Percentage of messages discovering distinct nodes within a certain message range. . . . . . . . . . . . 3 8 214 Total number of distinct nodes discovered within a specific hop number. . 39 31 Trustworthy probability for a delegation and 5pair delegation set with m* = 3,000 are 99.97.'. and 99.815'. respectively. Even if a delegation/kpair delegation set is not trustworthy, it may not be compromised because very unlikely a single colluding group can control the ii i Ii i ly of them. . . . . 44 32 A) Protocols for contract verification and exchange. B) money transfer. C)Prosecution. 52 33 Trustworthiness of delegation . . . . . . . 63 34 Trustworthiness of kpair delegation set . . . . . 63 35 Most of malicious nodes are rejected within the first 50 transactions. . 65 36 Failed transaction ratio and the overpaid money ratio drop quickly to small percentages within the first 100 transactions. . . . . . . 66 37 Overpaid money ratio (measured after 250 transactions) increases linearly with the number of dishonest nodes. . . . . . . 67 38 Number of rejected dishonest nodes (measured after 250 transactions) increases linearly to the number of dishonest nodes. . . . . . 67 39 Overpaid money ratio with respect to the threshold . . . . 68 310 Number of rejected nodes with respect to the threshold . . . 68 311 Number of rejected dishonest nodes (measured after 250 transactions) increases linearly to the number of dishonest nodes. . . . . . 69 41 Chord v.s. CAMChord neighbors (c, 3) . . . . . 75 42 CAMKoorde topology with identifier space [0..63] . . . . 85 43 Multicast throughput with respect to average number of children per nonleaf n o d e . . . . . . . . .. . 94 44 Throughput improvement ratio with respect to upload bandwidth range . 94 45 Multicast throughput with respect to size of the multicast group . . 95 46 Throughput vs. average path length . . . . . . 95 47 Path length distribution in CAMChord. Legend "[x.. I means the node capacities are uniformly distributed in the range of [x..y]. . . . . 97 48 Path length distribution in CAMKoorde. Legend "[Jx../ means the node capacities are uniformly distributed in the range of [x..y]. . . . . 97 49 Average path length with respect to average node capacity . . . 98 410 Proximity optimization . . . . . . . . 99 411 Throughput vs. latency . . . . . . . 100 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy OVERLAY INFRASTRUCTURE SUPPORT FOR INTERNET APPLICATIONS By Zhan Zhang May 2007 C'i iir: Shigang CIh h Major: Computer Engineering Overlay networks have gained a lot of popularity, and are now being considered in many application domains such as content distribution, conferencing, iii etc. These applications rely on the support from underlying overlay infrastructures. How to design overlay infrastructures to satisfy requirements from different applications, such as resourcesharing systems and applicationlevel multicast, is largely an open problem. My dissertation develops several techniques in the design of overlay infrastructure to facilitate Internet applications. A distributed hybrid query scheme is proposed for resourcesharing systems. It combines Markov random walks and peer clustering to achieve a better tradeoff. The scheme has a short response time for most of the queries that belong to the same interest group, while still maintaining a smaller network diameter. More important, we propose a totally distributed clustering algorithm, which means better resilience to network dynamics. A new incentive scheme is proposed to support cooperative owver1' in which the amount of a node can benefit is proportional to its contribution, malicious nodes can only attack others at the cost of their own interests, and a collusion cannot gain advantage by cooperation. Furthermore, we have designed a distributed authority infrastructure and a set of protocols to facilitate the implementation of the scheme in peertopeer networks, which have no trusted authorities to manage each peer's history information. Moreover, we propose two capacityaware multicast systems that focus on host heterogeneity, any source multicast, dynamic membership, and scalability. We extend Chord and Koorde to be capacityaware. We then embed implicit degreevarying multicast trees on top of the overlay network and develop multicast routines that automatically follow the trees to disseminate multicast messages. The implicit trees are well balanced with workload evenly spread across the network. We rigorously analyze the expected performance of multisource capacityaware iilli ii I ii which was not thoroughly addressed in any previous work. We also perform extensive simulations to evaluate the proposed multicast systems. CHAPTER 1 INTRODUCTION 1.1 Overlay Networks Overlay networks have recently attracted a lot of attention and are now being considered in many application domains such as content distribution, conferencing, 1imin: etc. An overlay network is a computer network which is built on top of another network. Each node in an overlay network maintains pointers to a set of neighbor nodes. These pointers are used both to maintain the overlay and to implement application functionality, for example, to locate content stored by overlay nodes. Two neighbors in the overlay can be thought of as being connected by virtual or logical links, each of which corresponds to a path, perhaps through many physical links, in the underlying network. For example, many peertopeer networks are overlay networks because they run on top of the Internet. Manually configured static ovw i.i are nothing new. A salient, modern feature of overlay networks is that they can be made to autonomically selforganize which provides great ease of deployment. Resource sharing and applicationlevel multicast are two of ii' Pr applications in overlay networks, and we study the infrastructure support for these applications from following three aspects. Query schemes for lookup services: The basic overlay applications are resourcesharing systems, such as grid computing, and filesharing peertopeer networks. The core operation in these systems is an efficient lookup mechanism for locating resources. The fundamental challenge is to achieve faster response time, smaller network diameter, better resilience to network dynamics, and lower overhead. The overli need to be constructed with good search properties, such as clustering the nodes with similar interests, and the query scheme needs to utilize the properties to support lookup efficiently. Incentive schemes to encourage cooperation: In cooperative overlay networks, a node is allowed to consume resources from other nodes, and is also expected to share its resources with the community. Tod c.'s peertopeer networks suffer from the problem of freeriders, that consume resources in the network without contributing anything in return. Originally it was hoped that users would be altruistic, "from each according to his abilities, to each according to his needs." In practice, altruism breaks down as networks grow larger and include more diverse users. This leads to a I i'.;. ly of the commons," where individual p1. i rs' self interest causes the system to collapse. Overlay multicast for group communication: Overlay multicast is becoming a promising alternative for group communications, because the global deployment of IP multicast has been slow due to the difficulties related to heterogeneity, scalability, manageability, and lack of a robust interdomain multicast routing protocol. Even though overlay multicast can be implemented on top of overlay unicast, they have very different requirements, because in overlay unicast, lowcapacity nodes only affect traffic passing through them and they create bottlenecks of limited impact. In overlay multicast, all traffic will pass all nodes in the group, and the multicast throughput is decided by the node with the smallest throughput, particularly in the case of reliable delivery. The strategy of assigning an equal number of children to each intermediate node is far from optimal. If the number of children is set too big, the lowcapacity nodes will be overloaded, which slows down the entire session. If the number of children is set too small, the highcapacity nodes will be underutilized. In such systems, the network heterogeneity, scalability, manageability must be welladdressed, We focus on how to design overlay infrastructures to address these challenges. For resource sharing systems, we present a clustering algorithm and a labeling algorithm to cluster members within the same interest group, and propose a efficient query scheme to deliver a better tradeoff between communication overhead and response time. Moreover, we propose a new incentive scheme, which is particularly suitable to overlay networks without any centralized authority. Our incentive scheme is more resilient against various attacks, especially launched by a large number of colluding nodes. With respect to overlay multicast, we propose two capacityaware overlay systems that support distributed applications requiring multiplesource multicast with dynamic membership. 1.2 Related Work The fundamental challenge of constructing an overlay network for resourcesharing systems such as peertopeer networks is to achieve faster response time, smaller network diameter, better resilience to network dynamics, and higher security. Structured P2P networks have been proposed by many researchers [17], in which distributed hash tables (DHT) are used to provide data location management in a strictly structured way. Whenever a node joins/leaves the overlay, a number of nodes need to update their routing tables to preserve desirable properties for fast lookup. While structured P2P networks can offer better performance in response time and communication overhead for query procedures, they suffer from the large overhead for overlay maintenance due to network dynamics. In addition, DHTs are inflexible in providing generic keyword searches because they have to hash the keys associated with certain objects [8]. Unstructured P2P networks such as Gnutella rely on a random process, in which nodes are interconnected in a random manner. The randomness offers high resilience to the network dynamics. Basic unstructured networks rely on flooding for users' queries, which is expensive in computation and communication overhead. Consequently, salability has ahbi ; been a major weakness for unstructured networks. Even with the use of super nodes in Morpheus and KaZaA, the traffic is still high, and even exceeds web traffic. Searching through random walks are proposed in [810], in which incoming queries are forwarded to the neighbor that is chosen randomly. In random walks, there is typically no preference for a query to visit the most possible nodes maintaining the needed data, resulting in long response time. Interestbased shortcut [11] exploits the locality of interests among different nodes. In this approach, a peer learns its shortcuts by flooding or passively observing its own traffic. A peer ranks its shortcuts in a list and locates content by sequentially asking all of the shortcuts on the list from the top, until content is found. The basic principle behind this approach is that a node tends to revisit accessed nodes again since it was interested in the data items from these nodes before. The concept of interest similarity is vague, and it is difficult to make a subtle, quantitative definition based on it. In addition, it may cause new problems. Another challenge in resource sharing systems is the incentive schemes. Tod (.'s peertopeer networks suffer from the problem of freeriders, that consume resources in the network without contributing anything in return. Originally it was hoped that users would be altruistic, "from each according to his abilities, to each according to his needs." In practice, altruism breaks down as networks grow larger and include more diverse users. This leads to a ii 'ly of the commons," where individual pl i rs' self interest causes the system to collapse. To reduce freeriders, the systems have to incorporate incentive schemes to encourage cooperative behavior. Some recent works [1217] propose reputation based trust systems, in which each node is associated with a reputation established based on the feedbacks from others that it has made transactions with. The reputation information helps users to identify and avoid malicious nodes. An alternative is virtual currency schemes [1820], in which each node is associated with a certain amount of money. Money is deducted from the consumers of a service, and transferred to the providers of the service after each transaction. Both types of schemes rely on authentic measurement of service quality and unforgeable reputation/money information. Otherwise, selfish/malicious nodes may gain advantage based on false reports. For example, a consumer may falsely claim to have not received service in order to p i less or defame others. More seriously, malicious nodes may collude in cheating in order to manipulate their information. Several algorithms are proposed to address these problems. They either analyze statistical characteristics of the nodes' behavior patterns and other nodes' feedbacks [13, 21], or remove the underlying incentive for cheating [22]. In order to apply these algorithms, the nodes' history information must be managed by a central authority, which is not available in typical peertopeer networks. Some other works [23, 24] find circular service patterns based on the history information shared among trusted nodes. Each node in a service circle has chance to be both a provider and a consumer. The communication overhead for discovering service circles is very high, which makes these schemes not scalable. In addition, nodes belonging to different interest groups have little chance to cooperate because service circles are unlike to form among them. Besides the resource sharing systems, applicationlevel multicast is another promising applications in overlay networks. AT ir' research papers [2528] pointed out the disadvantages of implementing multicast at the IP level [29], and argued for an applicationlevel overlay multicast service. More recent work [3036] studied overlay multicast from different aspects. To handle dynamic groups and ensure salability, novel proposals were made to implement multicast on top of overlay networks. For example, B .i ux [37] and Borg [38] were implemented on top of Tapestry [3] and Pastry [4] respectively, and CANbased Multicast [39] was implemented based on CAN [2]. ElAnsary et al. studied efficient broadcast in a Chord network, and their approach can be adapted for the purpose of multicast [40]. Castro et al. compares the performance of treebased and floodingbased multicast in CANi le versus Pastry il ,1. overlay networks [41]. These systems assume each node has the same number of children. Host heterogeneity is not addressed. Even though overlay multicast can be implemented on top of overlay unicast, they have very different requirements. In overlay unicast, lowcapacity nodes only affect traffic passing through them and they create bottlenecks of limited impact. In overlay multicast, all traffic will pass all nodes in the group, and the multicast throughput is decided by the node with the smallest throughput, particularly in the case of reliable delivery. The strategy of assigning an equal number of children to each intermediate node is far from optimal. If the number of children is set too big, the lowcapacity nodes will be overloaded, which slows down the entire session. If the number of children is set too small, the highcapacity nodes will be underutilized. To support efficient multicast, we should allow nodes in a P2P network to have different numbers of neighbors. Shi et al. proved that constructing a minimumdiameter degreelimited spanning tree is NPhard [42]. Note that the terms "d. 5;. and "( p Iy are interchangeable in the context of this dissertation. Centralized heuristic algorithms were proposed to balance multicast traffic among multicast service nodes (\ISNs) and to maintain low endtoend latency [42, 43]. The algorithms do not address the dynamic membership problem, such as MSN join/departure. There has been a flourish of capacityaware multicast systems, which excel in optimizing singlesource multicast trees but are not suitable for multisource applications such as distributed games, teleconferencing, and virtual classrooms, which are the target applications of our algorithms. Bullet [44] is designed to improve the throughput of data dissemination from one source to a group of receivers. An overlay tree rooted at the source must be established. Disjoint data objects are disseminated from the source via the tree to different receives. The receivers then communicate amongst themselves to retrieve the missing objects; these dynamic communication links, together with the tree, form a mesh, which offers better bandwidth than the tree alone. Overlay multicast network infrastructure (OMNI) [45] dynamically adapts its degreeconstrained multicast tree to minimize the latencies to the entire client set. Riabov et al. proposed a centralized constantfactor approximation algorithm for the problem of constructing a singlesource degreeconstrained minimumdelay multicast tree [46]. Yamaguchi et al. described a distributed algorithm that maintains a degreeconstrained delaysensitive multicast tree for a dynamic group [47]. These algorithms are designed for a single source and not suitable when there are many potential sources (such as in distributed games). Building one tree for each possible source is too costly. Using a single tree for all sources is also problematic. First, a minimumdelay tree for one source may not be a minimumdelay tree for other sources. Second, the singletree approach concentrates the traffic on the links of the tree and leaves the capacities of the in i i ily nodes (leaves) unused, which affects the overall throughput in multisource multicasting. Third, a single tree may be partitioned beyond repair for a dynamic group. 1.3 Contribution There are three i ii jr contributions in this dissertation. First of all, we propose an efficient distributed hybrid query scheme for unstructured peertopeer networks. Second, we propose an incentive scheme to encourage members to contribute to the community. Third, we propose two overlay infrastructures to support applicationlevel multicast in heterogeneous environment. 1.3.1 A Distributed Hybrid Query Scheme to Speed Up Queries in Unstruc tured PeertoPeer Networks We propose a hybrid query scheme, which deliver a better tradeoff between communication overhead and response time. We define a metric, independent of any global information, to measure the interest similarity between nodes. Based on the metric, we propose a clustering algorithm to cluster nodes sharing similar interests with small overhead, and fast convergence. We propose a distributed labeling algorithm to explicitly capture the borders of clusters without any extra communication overhead. We propose a new query scheme, which is able to deliver a better tradeoff among response time, communication overhead, and the ability to locate more resources by mixing intercluster queries and intracluster queries. 1.3.2 A Distributed Incentive Scheme for PeertoPeer Networks We propose a new incentive scheme to encourage end users to contribute to the systems, which is suitable to the network without any centralized authority, and is more resilient against various attacks, especially launched by a large collusion. Furthermore, we have designed a distributed authority infrastructure and a set of protocols to implement the scheme in peertopeer networks. More specifically, the scheme proposed has the following advantages over the conventional approaches: We propose a new distributed incentive scheme, which combines reputation and virtual money. It is able to strictly limit the damage caused by malicious nodes and their colluding groups. The following features distinguish our scheme from others. The benefit that a node can get from the system is limited by its contribution to the system. The members in a colluding group cannot increase their total money or S. .regate reputation by cooperation, regardless of the group size, and malicious nodes can only attack others at the cost of their own interest. We design a distributed authority infrastructure to manage the nodes' history information with low overhead and high security. We design a key sharing protocol and a contract verification protocol based on the threshold cryptography to implement the proposed distributed incentive scheme. 1.3.3 CapacityAware Multicast Algorithms on Heterogeneous Overlay Networks We extend Chord, and Koorde to support applicationlevel multicast, which has following properties. Capacity awareness: Member hosts may vary widely in their capacities in terms of network bandwidth, memory, and CPU power. Some are able to support a large number of direct children, but others support few. Any source multicast: The system should allow any member to send data to other members. A multicast tree that is optimal for one source may be bad for another source. On the other hand, one tree per member is too costly. Dynamic membership: Members may join and leave at any time. The system must be able to efficiently maintain the multicast trees for a dynamic group. Scalability: The system must be able to scale to a large Internet group. It should be fully distributed without a single point of failure. CHAPTER 2 A HYBRID QUERY SCHEME TO SPEED UP QUERIES IN UNSTRUCTURED PEERTOPEER NETWORKS 2.1 Motivation 2.1.1 Problems in Prior Work Small communication overhead and short response time are the two main concerns in designing efficient query schemes in peer to peer networks. Current approaches suffer various problems in achieving a better tradeoff between communication overhead and response time due to the blindness in searching procedures. Flooding: Flooding [48, 49], is a popular query scheme to search a data item in fully unstructured P2P networks such as Gnutella. While flooding is simple and robust, its communication overhead, that is, the message number, increases exponentially with the hop number. In addition, most of these messages visit the nodes that have been searched in the same query, and they can be regarded as duplicate messages. Consequently, communication overhead and salability are aivx the main problems in the flooding approach. Random Walks: Random walks [810, 50] rely on query messages randomly selecting their next hops among neighbors to reduce the communication overhead. A query may have to go through many hops before it successfully locates the queried data item. Consequently, this approach takes a long time. If the networks are wellclustered (nodes with similar interests are densely connected), it is expected that the query latency can be reduced significantly. However, it is not true, because the chance of a random walk message escaping out of the original cluster increases exponentially with the ratio r of the intercluster edge number to the intracluster edge number, as shown in Figure 21. In the case of a network with a small value of r (e.g., r < 0.01), if queried data items are in different clusters from the source node, a query message has to walk a long distance to be able to traverse the cluster border and locate the queried data items. In the case of a network with a large value of r (e.g., r > 0.1), query requests may escape out of the 200 180 Uniform random walks 180 160 140 4. 120 100 80 o 80 ' S 60 40 20 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ratio (# intercluster edges/# intraincluster edges) Figure 21. Probability of random walks escaping out of the cluster decreases exponentially with the ratio of number of intercluster edges to the number of intracluster edges. original cluster within a small number of hops, resulting in a long response time if the queried data is in the original cluster. These observations also are demonstrated by our simulations in Section 3.7. Consequently, random walks may suffer long response time regardless of the network having been wellclustered or not. Interestbased shortcut: Interestbased shortcut [t11] tries to avoid the blindness in random walks by favoring nodes sharing similar interests with the source, which can be regarded as a variation of markov random walks. Markov random walks may accelerate the query process to some extent in some cases. However, it causes new problems. Suppose nodes in an interest group have formed a cluster, and query messages can be artificially confined in this specific cluster. In the sense of nodes in the cluster share similar interests, any of them possibly maintains the queried data. The query procedure should shorten the covering time of the whole cluster instead of the hitting time of some specific nodes in it. However, due to the bias in selecting next hop in markov random walks, it tends to keep visiting some specific nodes, resulting in less distinct nodes being covered comparing to uniform random walks, as illustrated by Figure 22. Consequently, markov random walks work worse than uniform random walks if query messages can be confined in specific clusters. '70 Uniform random walks 0 50 40 30 20 10 I 0 o 20 10 0 10 20 30 40 50 60 70 80 90 100 number of hops Figure 22. Markov random walks discover less distinct number of nodes than uniform random walks do. 2.1.2 Motivation Researchers [11, 51, 52] have found rn vi peertopeer networks exhibit smallworld topology, and most of queried data items are offered by the nodes, which share similar interest with the source node. Intuitively, the nodes sharing similar interests with the source node should have higher priority to be searched than others. Practically, there are two challenges in designing such a query scheme. The first one is how to construct a smallworld topology to cluster nodes sharing similar interests. By icing n1 w interests", we actually mean that two nodes are interested with a common set of data items. The number of common accessed data items can served as a metric to measure the interest similarity between two nodes. A clustering algorithm based on the metric can be easily designed to densely connect the nodes in the same interest group. Moreover, each node u can explicitly pick up a set of intercluster neighbors that have different interests from u, and a set of intracluster neighbors that share similar interests with u. Take Figure 23 as an example. The network consists of 5 clusters, and nodes in the same cluster fall into the same interest group. Note that there exists an interest group consisting of two clusters: 1, and 5. Suppose the network has been wellclustered, and each node explicitly maintains a set of intercluster and intracluster neighbors. The second challenge is how to fast locate the clusters that share similar interests with the source node, and how to exhaustively search nodes in the found clusters, if the queried data items are in the source node's interest group. We introduce two types of queries: intercluster queries, and intracluster queries. The intercluster queries are for the purpose of discover the clusters that share similar interests with the source node, and they are issued by source node, carry the interest information, and only travel on intercluster neighbors. It can be expected that clusters sharing similar interests with the source node can be located quickly, because the number of cluster is much smaller comparing to the network size, and intercluster queries only travel among different clusters. The intracluster queries are spawned by intercluster queries when a cluster sharing similar interest with the source node is hit. An intracluster query thoroughly go through nodes in the found cluster, where it is spawned, by only traveling on intracluster neighbors. Note that intercluster queries and intracluster queries can be easily implemented if each node explicitly knows the types of its neighbors. Occasionally, queried data may be out of the source node's interest group, and possibly maintained by a cluster(s) with different interests. This problem is addressed by blind search: intercluster messages randomly spawning intracluster messages when hitting clusters with different interests. For example, in Figure 23, first interqueries are initiated by a node in cluster 1, which travel among different clusters. By the interest information carried in the interqueries, cluster 5 is found to share similar interests when it is hit, and an intracluster query is spawned, which then will exhaustively search the nodes in it. In addition, an intracluster query is spawned in cluster 2 by intercluster queries to support blind search. intercluster query (Y intracluster query cluster 4 5.3 4> Figure 23. A query scheme mixing intercluster queries and intracluster queries (the nodes in the grey clusters fall into the same interest group). 2.2 Constructing a Smallworld Topology 2.2.1 Measuring the Interest Similarity between Two Nodes Cluster is generally formed by connecting nodes with similar interests in a network. We start our discussion with the definition of interest similarity between two nodes in P2P networks. If node u and node v share similar interests, then it is very likely that they have accessed same data items more or less previously. The size of the common subset of accessed data items can served as a metric to measure to what extent the interests of two nodes are similar. Each node may offer hundreds of data items, and hence, there may exist a large number of data items even in a small network. As a result, only if u and v have visited a large number of data items respectively, they are able to show some degree of similarity. An alternative to evaluate the interest similarity is by the number of common accessed nodes, which may enable a clustering algorithm to converge faster than the former approach. The problem in this approach is that two nodes visiting a common node does not indicate they have similar interests, because a node may offer data items belonging to multiple interest groups. For instance, a user u may offer resources for two groups: a number of mp3 music files for one group, and a number of research literatures in P2P networks for the other group. It is possible that two nodes that have visited u may be interested in data items in different interest group. We have to address the discrepancy between the common set of accessed nodes and the common set of visited data items. Suppose there are n nodes N = {1, 2,..., n} in the whole P2P network and a node i offers a number of data items to the community. It categorizes (maps) all of these data items into a, different categories, denoted as C' = {c c, ..., cI }. Suppose a data item x in i is mapped to a category c'(x), where c'(x) c C'. How to categorize the data items is determined by the node i independently. For instance, node i may classify music files as category 1, while another node may classify music files as category 2. On the other hand, node i may fall into multiple interest groups, denoted as G' = {g',g, ..., g.}. For a node u, the access history with respect to each of its interest groups (e.g., gj) can be specified by a set of data items x, denoted as (i, c'(x)), where i represents the node offering the data item, and c'(x) is the category in C' defined by i. If two nodes u and v share "similar interests" (e.g., gj g`), their histories for g' and g' tend to consist of a common set of (i, ct(x)). Note that in the above definitions, each node determines its interest groups and categories independently, indicating a node need not maintain any global information. For easy explanation, we study a basic approach by assuming each user only falls into one interest group, and offers one category of data items. In this scenario, the access history can be represented by the accessed nodes alone. This approach can be easily extended to the multicategories and multigroups based on the definitions above. One node u may access another node multiple times for different data items, and hence, the access history of node u can be represented by a vector V" = (vu, .., v'), 1 where vu represents the number of times u has visited node x. To cancel out the number of queries a node has issued, the access vector Vu is normalized to the frequency vector 1 The real size of the data structure maintaining Vu is much smaller than the network size n, and can be fixed to only record the nodes accessed most frequently by u N N N A B C Figure 24. Interest similarity between nodes. The number of data items in 1, 2, and n are 50, 100, and 200 respectively. A) No common visited nodes (different interests). B) u and v have visited node 2 (100 data items) with 8 and 5 times respectively (a certain level of similar interests). C) u and v have visited node n (200 data items) with 8 and 5 times respectively (falls inbetween) F", in which the ith element in F" is denoted as f", computed by V" as = s, Y'jGN V] representing the frequency of the corresponding node i having been accessed. Note that the value in F'11 falls into the range [0,1]. If u has never accessed node i, the corresponding element f[ is equal to 0. The summation of all elements is equal to 1. Furthermore, if the number of data items in node i, denoted as di, is large, the chance that two nodes u and v have visited common data items in i may be small even if both of them have visited i multiple times. As an example, in B and C of Figure 24, u and v have visited one common node. But u and v in B have more chance of having visited common data items because the number of data items in node 2 is half of that in node n. To account for this issue, we introduce a weighted diagonal matrix W with (i, i)th value wi4, equal to It represents the probability of both u and v visiting a common data items, if both of them visit i once. Now we define the following metric to evaluate the interest similarity between two nodes (Take Figure 24 (B) as an example): Au FuT WF" T .2 .02 0 0 0 0 0 0 .8 0 .01 0 0 0 0 .5 0 0 0 d3 0 0 0 0 0 0 0 .0 0 0 0 0 0 0 0 0 0 0 0 0 .005 .5 S0.004 Similarly, the interest similarity in Figure 24 (C) is 0.002, which means nodes in A show more interest similarity. From the definition, we have ARU F"WF 1f f (21) di If we view fj' and f[ as the probabilities of nodes u and v visiting node i, and as the probability of u and v visiting a common data items if both of them visit i, then the summation A;'" can be used to predict the probability that both u and v will visit a common data item in their future queries. Note that if a node i has not been visited by both u and v, then p' = 0 and/or p4 = 0, indicating that a node does not need to maintain any information of the nodes it has never visited. As we have discussed, the size of the vector Vu is fixed. Hence, both the storage for access history and the computation overhead for Au, are constant. Our definition is advantageous in manyfold. First, nodes u and v need not maintain any global information to compute Auj0. Second, the frequency vector cancels out the effect of the number of queries that a node has issued, enabling a clustering algorithm based on this definition to converge fast with the average number of queries. Third, the definition prefers the nodes with good properties. For instance, if two nodes i, j maintaining same data items can be accessed by 1MB/s and 56Kb/s respectively. i obviously will be accessed more often, resulting in a larger value of p' and p'. Fourth, it reduces the impact of the possible discrepancy in the category definitions by nodes. For instance, if a category in node i is poorly defined, and consists of data items belonging to various interest groups, then the category will be seldom accessed comparing to its size, resulting in a small value of fJ'f7j . 2.2.2 Clustering Nodes with Similar Interests Given the metric to evaluate the interest similarity between two nodes, we propose a lightweight clustering algorithm to connect nodes sharing similar interests. In our strategy, each node i maintains a list L with limited size (e.g., 30) to record the nodes that possibly share same interests with itself. Each time a query message is processed, the similarity between the querying node itself and the node that owns the data items is computed. The newly obtained interest similarity, and the corresponding node's address are inserted into the list L. If the list is full, the stored neighbor with the lowest interest similarity is dropped. By assuming that interests of nodes will not shift in a limited time frame, the nodes collected in L possibly fall into the same interest group as i, and will serve as candidates of its intracluster neighbors. 2.2.3 Bounding Clusters Although a smallworld topology can be formed along with queries by the above clustering algorithm, existing query schemes such as random walks can only benefit marginally from it as we discussed in Section 3.1. To exploit the characteristics of the small world topology, our approach is to explicitly capture the clusters in the underlying topology by each node i maintaining a set of intercluster neighbors in other interest groups, and intracluster neighbors in its own interest group. For intercluster neighbors, a node i can learn them easily. For example, i can issue a certain number of random walk messages only traveling on other nodes' intercluster neighbors, and choose the nodes hit by the messages as its intercluster neighbors. Note that the list L collects candidates of its intracluster neighbors, and should not overlap with the set of intercluster neighbors. It is of the most importance to learn the intracluster neighbors, which can be selected from the nodes collected in the list L. The purpose of intracluster neighbors is to confine intracluster queries within a specific interest group. Two nodes falsely regarded as intracluster neighbors may create a dramatic impact because an intracluster query may traverse to another cluster with different interests. On the contrast, two nodes i and k that are falsely regarded as intracluster neighbors will only have limited impact, because i and k may be connected by other intermediate intracluster neighbors j. In addition, the chance of i and k falling into the same cluster tends to become larger along with query procedures if they are in the same interest group. Based on this observation, we propose a labeling algorithm, which ensures that if a link (i,j) is labeled as an intracluster edge, then i and j are in the same interest group with high probability. For a node i, we normalize the interest similarity of its neighbors j in L as follows. AJ Pi[j VYk A k If a matrix P is organized such that its i,jth element is pij, then the rows in P sum to 1 as the matrix P is row stochastic. Intuitively, pij can be viewed as the transition probabilities for the markov random walk. The transition probability pij can serve as a good metric to determine whether i and j are in the same interest group or not by introducing a threshold, denoted as T, as a lower bound. T can be set as a relatively larger value, because the false negative has limited impact as discussed. Suppose there are a neighbors in L that are possibly in the same interest group with i. T can be set as 1. Note that pij and T are computed by node i locally. The labeling algorithm does not involve any extra communication. 2.3 A Hybrid Query Scheme 2.3.1 Mixing InterCluster Queries and IntraCluster Queries By explicitly capturing the cluster structures in the underlying network, we can formally define following three types of query messages. The first one is called 1query message, denoted as i., which is a special type of intercluster message only traveling on intercluster neighbors. The purpose of it is to quickly locate the clusters that may share similar interests with the source node and disperse intraqueries among different clusters. Messages in this type are issued by the source node, and walk among different clusters randomly. Moreover, if the queried data is in the source node's interest group, 1query messages should piryback the source node's frequency vector, such that nodes hit by the messages can determine whether their clusters share similar interests with the source node. The second one is called squery message, denoted as if., which is a special type of intracluster message confining itself within a specific cluster by doing uniform random walks only on intracluster neighbors. squery messages are only spawned/issued in the clusters that share similar interests with the source node. The purpose of it is to exhaustively search nodes that fall into the same interest group as the original node. Messages in this type are issued by the source node if the query is in the same interest group, and/or spawned by 1query messages, if clusters sharing similar interests with the source node are hit. In order to reduce the number of duplicate squery messages, each node having received the message should be able to estimate to what extent the cluster has been covered. If most of nodes have been visited, an squery message has little chance to discover any new nodes by continuing to walk in the cluster, indicating the new received message should be discarded. Otherwise, the message should be forwarded. Accurately estimating the covering time of a cluster is difficult and resourceconsuming in a distributed system. Heuristically, if the message has been sequentially hitting a certain number, denoted as h, of nodes that have been visited by previous intracluster messages, most of nodes may have been covered, and hence, the message should be discarded. All messages in I [., need to maintain a counter to keep track on the sequential number of nodes having been visited by previous intracluster messages. The counter of each new spawned message in T[, is set to 0 or 1 depending on whether the starting node has been hit or not before. Note that the total (not sequential) number of nodes having been hit by previous intracluster queries is not a good metric for estimation, because it heavily depends on the cluster size. The last one is called bquery message, denoted as if, which is also a special type of intracluster message similar to squery message. The difference of bquery messages from squery messages is that bquery messages may be spawned in clusters that have different interests from the source node. The purpose of it is to support blind search, because occasionally, the queried data may be out of the source node's interest group. The messages in this type are issued by the source node if the query is out of source's interest group, and/or possibly spawned by 1query messages, when clusters that have not been visited by intracluster query messages are hit. The chance that the queried data item in a cluster having different interests is very small. Once a bquery message hits a node that has been visited by intracluster messages before, the message is discarded to reduce the number of duplicated messages. To control the communication overhead, the total number of concurrent query messages has to be limited. The source node needs to count the number of 1query, squery and bquery messages, denoted as mi, ms and mb respectively. Whenever an intracluster message, such as squery message or bquery message, is spawned or discarded, the source node needs to be notified to update the corresponding counter. Only if the summation of nil, ms, and mb is smaller than a certain number, denoted as ni, a new bquery message can be spawned to support blind search. squery messages can be spawned without restriction, and the total number of concurrent messages may be larger than m temporarily. Note that the counter mi does not change in a query. In addition, all messages need to periodically check the status of the source node so that they can stop if the query has been successfully returned. With these three types of messages, our query scheme is designed as follows. Initialization: To initiate a query request, a node u issues a number mi of 1query messages. If the queried data item falls in u's interest group, 1query messages carry the source's frequency vector, and a certain number ms of squery messages are issued to exhaustively search its own cluster. Otherwise, the message does not carry any interest information, and a bquery message is issued. Receiving an 1query message: In the case of a node u receiving an 1query message, it calculates the interest similarity with the source node. If u shares similar interests (e.g., the similarity is larger than a small value), it spawns a new squery message and update ms maintained by the source node. Otherwise, a new bquery message is spawned, if the node has not been hit by other messages in ., and i f, and mi + ms + mb < m. Finally, node u forwards the received message to a randomly selected intercluster neighbor. Receiving an squery message: In the case of a node u receiving an squery message, if u has been hit by messages in ., or I f, it increases the counter in the message by 1. Otherwise, it resets the counter to 0. Next, if the counter is larger than the threshold h (e.g., 10), the node discards the message and notifies the source node to update the counter mi. Otherwise, it forwards the message to a randomly selected intracluster neighbor. Receiving a bquery message: In the case of a node u receiving a bquery message, if it has been hit by messages in ., or A, the node discards the message and notifies the source node to update the counter nib. Otherwise, it forwards the message to a randomly selected intracluster neighbor. Our scheme can be considered to be stateful, in which if the same queries are reissued multiple times, less intracluster queries will be spawned in the clusters that have been well searched, and in contrast, more intracluster queries will be spawned in the lesssearched clusters, resulting in stronger ability to discover more resources/replicas. 2.3.2 Reducing the Communication Overhead By mixing intercluster and intracluster queries, it can be expected the system performance can be improved significantly. Because the access vector V'11 of a node u can be fixed to a small size, the extra overhead will not increase largely (only 1query messages need to carry the frequency vector). Moreover, 1query messages only travel among different clusters, and the number of clusters, especially in a wellclustered network, is much smaller comparing to the real network size. It can be expected that most of clusters can be covered by 1query messages within a small number of hops. 1query messages can remove the frequency vector from the p loads after a certain number of hops. In the meantime, a source node can specify one 1query message to keep the frequency vector for the case of that some clusters sharing similar interests with the source node have not been discovered after the specified number of hops. 2.4 Simulation In this section, the performance of the proposed clustering algorithm and query scheme is studied by simulations. If not explicitly defined, the number of nodes is 10, 000, and the average group size is 150. Each node maintains 1,000 data items, the average number of queries issued by each node is 30, the threshold h is equal to 10, and the probability of a node incorrectly classifying its queries or data items is 0.1. Moreover, m and mi are set to be 32 and 16 respectively. 160 Clustering, ii liin 140 r^A "! ^ ^ 120  S100 o 80  S 60 40 20  0 0 10 20 30 40 50 number of queries Figure 25. Effect of average query number on the cluster size. We compare our scheme to random walks, in which a source node issues 32 random walk messages in each query, correspondingly. We also have compared our scheme to flooding schemes. As expected, we observe the flooding schemes suffer from very large communication overhead. In the figures, the legend "Uniform random walks (0)" /"Uniform random walks (1)" refers to the queried data items are out of/in the source node's interest group in the uniform random walks query scheme, and similarly "Interintra (0)"/"Interintra (1)" refers to the queries are out of/in the source node's interest group in the proposed scheme. First, we study the effectiveness of the metric measuring interest similarity and the clustering algorithm. In Figure 25, it is observed that when the average query number is larger than 10, the algorithm reaches a stable state and almost all nodes in the same interest group form a single cluster. It indicates that our algorithm converge fast with the average number of queries, which is especially useful in P2P networks, where nodes tend to join/leave the system more frequently. By Figure 26, it can be observed that the average number of nodes in a cluster is almost the same as group size, demonstrating that A', can effectively measure the nodes' interest similarity. Clustering algorithm 0 20 40 60 80 100 120 140 160 group size Figure 26. Interest association is a good metric to estimate interest similarity  h Uniform random walks (1) S 0.2 H Uniform random walks (0)  0^ 0.1 'l i Iiili.i i I 0 0 20 40 60 80 100 120 140 160 180 200 number of hops Figure 27. Percentage of returned query within a specific hop number. 1 S 0.9 ... . 0.8 S0 .7 S 0.6  S 0.5  S 0.4 0.3  0.3 F r 1ii. 1ii ,i 1.11 I 'I .11 I U 0.2 1 iii 1.11 i. n .111. li ii S0. InterIntra (0) 0 5 10 15 20 25 30 35 40 45 50 number of messages (*100) Figure 28. Percentage of returned query within a specific message number. Second we study the performance of our scheme with respect to query latency and communication overhead. In Figure 27, it is observed that if the queried data items fall into the original node's interest group, the number of hops needed for the in ,' P ily of the queries is significantly reduced to about 20, while in the uniform random walks, it takes much longer time. Correspondingly, the number of messages is also much smaller in our scheme than that in random walks, as shown in Figure 28. The figures also show that if the queried data are out of the source node's interest group, the performance of our scheme is similar to uniform random walks. Note that a longer response time is acceptable since only a few queries will be out of source node's interest group in many P2P networks. In addition, these two figures also demonstrate that random walks for queries in the source node's interest group can only benefit marginally from the underlying clustered topology, that is, only a little larger percentage of them can be returned than those out of source node's interest group within the same number of hops). (messages). We also have studied the performance of a network, in which each group consists of multiple different clusters, as shown in Figure 29 and Figure 210. The results show the similar trends, which keeps true with respect to all other metrics that will be studied later. Moreover, comparing Figure 27 with Figure 29, and Figure 28 with Figure 210, it can 0i.3 0.2 Un Uilii landuim ivalks QIu) 0.1 InterIntra (1) 0 InterIntra (0) 0 20 40 60 80 100 120 140 160 180 200 number of hops Figure 29. Percentage of returned query within a specific hop network. number in a less clustered 0.2 Uniform random walks (0)   S 0.1 InterIntra (1) 0 1 1 InterIntra (0) 0 5 10 15 20 25 30 35 40 45 50 number of messages (*100) Figure 210. Percentage of returned query within a specific message number in a less clustered network. 110 Uniform random walks (1) 100 M InterIntra (1) 90  80  0 (U) 0 0 C: .. 60  (U) 3 50 0 o < 40  E Z 30 : 20  1' (0, 5] (5, 10] (10, 15] (15, 20] (20, 25] (25, 30] range of messages (*100) Figure 211. Number of distinct nodes discovered in the same group within a certain message range. be observed that the performance of random walks in two different (wellclustered and poorclustered) networks are similar, which further verifies our argument in Section 3.1. As have been observed, when the queried data items are in the source nodes' interest group, our scheme works much better than random walks. The reason behind it is that our scheme can discover more distinct nodes in the source nodes' interest groups within the same number of messages or hops, as shown in Figure 211 and Figure 212. In the figures, it can be observed that within the first 1,000 messages or 30 hops, more than 120 nodes in the source node's interest group have been searched by query messages. Consequently, the majority of queried data falling into the node's interest group can be found with smaller overhead, and shorter latency. It also indicates our scheme has stronger ability to locate more replicas, since it can discover a much larger number of nodes sharing similar interests. Occasionally, the queried data item may be maintained by nodes in other interest groups, or classified into wrong interest group by source node. In the former case, 1queries will not carry any interest information, but in the latter case, 1queries will carry wrong interest information. In both cases, the efficiency of our query scheme can be evaluated 120 I 100 80 I 80 S60 40  n 40 Uniform random walks (1) InterIntra (1) 0 50 100 150 number of hops Figure 212. Number of distinct nodes discovered in the same group within a specific hop number. 09 (I) 0) 08 (1) 's 07 E 06 U 05 Z4) 0 04 03 02 0 1 0 1 (0, 5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] (40,45] (45,50] range of messages (*100) Figure 213. Percentage of messages discovering distinct nodes within a certain message range. 5000 II 4500 o 4000 C e 3500 .H 3000  2500 S 2000 1500  1000 :(ililbim random walks 500 InterIntra (1) 0 InterIntra (0) 0 0 50 100 150 200 number of hops Figure 214. Total number of distinct nodes discovered within a specific hop number. by the number of distinct nodes discovered by queries, including those out of the source node's interest group, within a certain number of messages and hops. Note that whether the queries are in or out of the original node's interest group makes no difference to random works. Figure 213 and Figure 214 show that in the first 1,000 messages, if the queries carrying interest information, less distinct nodes can be searched in our scheme. The reason is that squery messages mistakenly exhaustively search the nodes in the clusters that share ii I 1r" interests in the beginning, which has been demonstrated by our previous simulations. Consequently, the number of bquery messages is limited. Along with the increment of the number of messages/hops, our scheme works similar to the uniform random walks. It is because after most of nodes sharing similar interests are covered, more bquery messages will be spawned to search clusters with different interests, which are able to discover more distinct nodes. In addition, by the figures, if the queries carry no interest information, our scheme works similar to uniform random walks. CHAPTER 3 MARCH: A DISTRIBUTED INCENTIVE SCHEME IN OVERLAY NETWORKS 3.1 Motivation 3.1.1 Limitation of Prior Work Any node in a peertopeer network is both a service provider and a service consumer. It contributes to the system by working as a provider, and benefits from the system as a consumer. A transaction is the process of a provider offering a service to a consumer, such as supplying a video file. The purpose of an incentive scheme is to encourage the nodes to take the role of providers. Neither reputation systems [1217] nor virtual currency systems [18, 19] can effectively prevent malicious nodes, especially those in collusion, from manipulating their history information by using false service reports. Specifically, the existing schemes have the following problems. Reputation inflation: In the reputation schemes, malicious nodes can work together to inflate each other's reputation or to defame innocent nodes, by which colluding nodes protect themselves from the complaints by innocent nodes as these complaints may be treated as noise by the systems. Money depletion: In the virtual currency schemes, malicious nodes may launch attacks to deplete other nodes' money and paralyze the whole system. Without authentic reputation information, innocent nodes are not able to proactively select benign partners and avoid malicious ones. Frequent complainer: In many incentive schemes, nodes will be punished if they complain frequently, which prevents malicious nodes from constantly defaming others at no cost. It also discourages innocent nodes from reporting frequent malicious acts because otherwise they would become frequent complainers. Punishment scale: In most existing schemes, the scale of punishment is related to the service history of the transaction participants. Consequently, an innocent node may be subject to negative discrimination attacks [12] launched by nodes with excellent history. 3.1.2 Motivation Punishing malicious nodes and limiting the damage caused by a colluding group are indispensable requirements of an incentive scheme that is able to deter bad behavior. There are two in r kinds of bad behavior. First, a provider may deceive a consumer by providing lessthanpromised service. Second, a consumer may defame a provider by falsely claiming the service is poor. Consider how these problems are dealt with in real life. Before a transaction happens, the provider would want to know if the consumer has enough money to p i for the service, and the consumer would want to know the reputation of the provider. With such information, they can control the risk and decide whether to carry out the transaction or not. After the transaction, if the provider deceives, it will be sued by the consumer. Consequently, the malicious provider will build up bad reputation, which prevents it from deceiving more consumers. Now consider a consumer intentionally defames a provider. It does so only after it can show the evidence of a transaction, which requires it to p i money first. Consequently, defaming comes with a cost. The ability of the malicious consumer to defame others is limited by the amount of money it has. Inspired by the observation above, we propose a new incentive scheme: MARCH, which is a combination of Money And Reputation sCHemes. The basic idea behind the scheme is simple: each node is associated with two parameters: money and reputation. The providers earn money (and also reputation) by serving others. The consumers p i money for the service. If a consumer does not think the received service worth the money it has paid, it reports to an authority, specifying the amount of money it believes it has overpaid. If the authority can determine who is lying, the liar is punished. Otherwise, the authority freezes the money claimed to have been overpaid. The money will not be available to the provider and will not be returned to the consumer either, which eliminates any reason for the consumer to lie. If the provider is guilty, the consumer has the revenge and the provider's reputation suffers. If the provider is innocent, the consumer does it at a cost because after all it has paid the price of the transaction. In addition, the falselypenalized provider will not serve it any more. The technical challenges are (1) establishing a distributed authority for managing the money and reputation, (2) designing the protocol of transaction that ensures authentic exchange of money/reputation information and allows the unsatisfied consumers to sue the providers, (3) analyzing the properties of such a system, and (4) evaluating the system. 3.2 System Model The nodes in a P2P network fall in three categories: honest, selfish, and malicious. Honest nodes follow the protocol exactly, and they both provide and receive services. Selfish nodes will break the protocol only if they can benefit. Malicious nodes are willing to compromise the system by breaking the protocol even when they benefit nothing and may be punished. Selfish/malicious nodes may form colluding groups. There may exist a significant number of selfish nodes, but the malicious nodes are likely to account for a relatively small percentage of the whole network. At a certain time, all self/malicious nodes that break the protocol are called dishonest nodes. A node is said to be rejected from the system if it has too little money and too poor reputation such that no honest providers/consumers will perform transaction with it. We study the incentive scheme in the context of DHTbased P2P networks [1, 2, 4]. We assume the routing protocol is robust, ensuring the reliable delivery of messages in the network [53]. We also assume the networks have the following properties. Random, nonselectable identifier: A node can not select its identifier, which should be arbitrarily assigned by the system. This requirement is essential to defending the Sybil attack [54]. One common approach is to hash a node's IP address to derive a random identifier for the node [1]. Public/private key pair: Each node A in the network has a public/private key pair, denoted as PA and SA respectively. A trusted third party is needed to issue publickey certificates. The trusted third party is used offline once per node for certificate issuance, and it is not involved in any transaction. 3.3 Authority Infrastructure 3.3.1 Delegation Who will keep track of the money/reputation information in a P2P network? In the absence of a central authority for this task, we design a distributed authority infrastructure. Each node A is assigned a delegation, denoted as DA, which consists of k nodes picked pseudorandomly. For example, we can apply k hash functions, i.e., {h, 2, ..**., hk}, on the identifier of node A to derive the identifiers of nodes in DA. If a derived identifier does not belong to any node currently in the network, the "(! .. I node is selected. For example, in [1], it will be the node clockwise after the derived identifier on the ring. The jth element in DA is denoted as DA(J). DA keeps track of A's money/reputation. Any anomaly in the information stored at the delegation members may indicate an attempt to forge data. The information is legitimate only if the ini i i"i ly of the delegation members agree on it. As long as the Sin ii ly of the delegation members are honest, the information about node A cannot be forged. Such a delegation is said to be trustworthy. On the other hand, if at least half of the members are dishonest, then the delegation is ui l i n'v .i li. The delegation members are appointed pseudorandomly by the system. A node cannot select its delegation members, but can easily determine who are the members in its or any other node's delegation. To compromise a delegation, the malicious/selfish nodes from a colluding group must constitute the ini i ii i ly of the delegation. Unless the colluding group is very large, the probability for this to happen is small because the identifiers of the colluding nodes are randomly assigned by the system and the identifiers of the delegation are also randomly assigned. Let m be the size of a colluding group and n be the total number of nodes in the system. The probability for t out of k nodes to be in the colluding group is P(t, k, k (1 t n = 100,000 S 0.9995 S 0.999 0.9985 delegation (k=5) 0.998 5pair delegation set   0.998   0 500 1000 1500 2000 2500 3000 m* (total number of dishonest nodes) Figure 31. Trustworthy probability for a delegation and 5pair delegation set with m* = 3,000 are 99.97.'. and 99.815'. respectively. Even if a delegation/kpair delegation set is not trustworthy, it may not be compromised because very unlikely a single colluding group can control the ii i Ii i ly of them. where P(t, k, m) denotes the probability of t successes in k trials in a Binomial distribution with the probability of success in any trial being m. Let m* be the total number of distinct nodes in all colluding groups, also including all malicious nodes. The probability of a delegation being ti u1l i li, is at least Io P(t, k, m4). We plot the ti iiI v i lI lii probability with respect to m* when k = 5 in Figure 31 (the upper curve). In order to control the overhead, we shall keep the value of k small. 3.3.2 kpair Trustworthy Set A transaction involves two delegations, one for the provider and the other for the consumer. They have to cooperate in maintaining the money and reputation information, and avoiding any fraud. To facilitate the cooperation, we introduce a new structure, called kpair delegation set, consisting of k pairs of delegation members. Suppose node A is the provider and node B is the consumer. The ith pair is (DA(i), DB(i)), Vi CE [1..k], and the whole set is {(DA(1), DB (1)), (DA (2), DB (2)), ..., (DA (k), DB (k))} If both DA(i) and DB(i) are honest, the pair (DA(i), DB(i)) is trustworthy. If the in iii fly of the k pairs are ti nI~. i 11 i1, the whole set is trustworthy. It can be easily verified that the probability for the whole set to be trustworthy is P(t, k, 2' 2) Sn n to0 We plot the trustworthy probability for the whole set with respect to m* in Figure 31 (the lower curve). 3.4 MARCH: A Distributed Incentive Scheme 3.4.1 Money and Reputation With the distributed authority designed in the previous section, the following information about a node A is maintained by a delegation of k nodes. Total money (TMA): It is the total amount of money paid by others to node A minus the total amount of money paid to others by A in all previous transactions. The universal refilled money (Section 3.6.2) will also be added to this variable. Overpaid money (OMA): It is the total amount of money overpaid by consumers. A consumer p i money to node A before a service. If the service contract is not fulfilled by the transaction, the consumer may file a complaint, specifying the amount of money that it has overpaid. This amount cannot be greater than what the consumer has paid. When a new node joins the network, its total money and overpaid money are initialized to zero. From TMA and OMA, we define the following two quantities. Available money (mA): It is the amount of money that node A can use to buy services from others. mA = TMA OMA (31) Reputation (rA): It evaluates the quality of service (with respect to the service contracts) that node A has provided. {TMAOMA ifrTMA/o ( rA = AA (32) 1 if TMA = 0 For example, if TMA = 500 and OMA = 10, then A's available money is 490, i.e., mA =490, and its reputation is 0.98, i.e., rA = 0.98. To track every node's available money and reputation, we propose a set of protocols. Consider a transaction, in which Alice (A) is the provider and Bob (B) is the consumer. The transaction consists of five sequential phases. Phase 1: Contract Negotiation. Alice and Bob negotiate a service contract. Phase 2: Contract Verification. Through the help of their delegations, Alice and Bob verify the authenticity of the information claimed in the contract. Phase 3: Money Transfer. The amount of money specified in the contract is transferred from Bob's account in DB to Alice's account in DA. Phase 4: Contract Execution. Alice offers the service to Bob based on the contract specification. Phase 5: Prosecution. After the service, Bob provides feedback reflecting the quality of service offered by Alice. 3.4.2 Phase 1: Contract Negotiation Suppose Bob has received a list of providers through the lookup routine of the P2P network. Each provider specifies its reputation and its price for the service. Bob wants to minimize his risk when deciding which service provider he is going to use. Let LA be the price specified by Alice and GB be the fair price estimated by Bob himself. According to the definition, rA can roughly be used as a lower bound on the probability of Alice being honest. Intuitively, the probability for Bob to receive the service is at least rA, and the probability for Bob to waste its money LA is at most (1 rA). We define the benefit for Bob to have a transaction with Alice as GB x rA LA x (1 rA). We further normalize it as R rA A) (33) GB To avoid excessive risk, Bob takes Alice as a potential provider if R is greater than a threshold value T. The use of threshold helps the system reject dishonest providers with poor reputation. Among all potential providers, Bob picks the one with the highest normalized benefit. Both the value of L and the value of rA are given by the provider A. If L is set too high, R will be small and the provider runs the risk of not being picked by Bob. Providers with poor reputation can improve their R values by setting their prices low. In this way, they can recover their reputation by selling services at lower prices. If Alice lies about its TA, she will be caught in the next phase and be punished. Now suppose Bob chooses Alice as the best service provider. They have to negotiate a service contract, denoted as c, in the following format. < A, B, S, Q, L, SeqA, SeqB, TA, MB > where A, B, S, Q, and L specify the provider, the consumer, the service type, the service quality, and the service price respectively. SeqA and SeqB are the contract sequence numbers of Alice and Bob, respectively. After the transaction, Alice and Bob each increase their sequence numbers by 1. The values of TA and mB in the contract will be verified by the delegations in the next phase. As an example, if S = Storage, Q = 200G, and L = 5, the contract means that Alice offers storage with size 200G to Bob, and as return, the amount of money Bob must p1v is 5. 3.4.3 Phase 2: Contract Verification After negotiating a contract, Alice and Bob should exchange an authenticatable contract proof, so that Alice is able to activate the money transfer procedure and Bob is granted the prosecution rights. In addition, the information in the contract such as TA and MB should be verified by the delegations of Alice and Bob. We use the notation [x]y for the digital signature signed on message x with key y and {x}y for the cipher text of message x encrypted with key y. After Phase two, if the contract is verified by the delegations, Alice should have the following contract proof CA = [CSB CA should not be produced by Bob, who may lie about MB. Instead, Alice must receive CA from Bob's delegation after the members confirm the value of mB. Bob has k delegation members. Each of them will produce a pI ." of CA and send it to Alice, who will combine the pi " into a valid contract proof. Similarly, Bob must receive the following contract proof from Alice's delegation CB = [c]sA The contract proofs will be used by Alice for money transfer and by Bob for prosecution. It is important to ensure that either both Alice and Bob, or none of them, receive the contract proofs. Otherwise, dishonest nodes may take advantage of it. It can be shown that ensuring both or neither one receives her/his contract proof is impossible without using a third party (the delegation of Alice or Bob in this case). Key Sharing Protocol A kmember delegation is not a centralized third party. One possible approach for producing a contract proof by a delegation is to use threshold cryptography [55]. A (k, t) threshold cryptography scheme allows k members to share the responsibility of performing a cryptographic operation, so that any subgroup of t members can perform this operation successfully, whereas any subgroup of less than t members can not. For digital signature, k shares of the private key are distributed to the k members. Each member generates a partial signature by using its share of the key. After a combiner receives at least t partial signatures, it is able to compute the signature, which is verifiable by the public key. An important property is that less than t compromised members cannot produce a verifiable signature on a false message. In our case, the problem is to produce cB (or CA) by the kmember delegation of Alice (or Bob). We employ a (k, [k] + 1) threshold cryptography scheme to produce the contract proof. Alice distributes shares SA(i) of her private key SA to her delegation members DA(i), which will produce partial signatures [c]sA(Y) and forward them to Bob for combination. As long as the delegation of Alice is ti ni I. i 11 ii, Bob will receive enough correct partial signatures to compute a verifiable contract proof, while the false partial signatures generated by the compromised delegation members will not yield any verifiable proof. When applying threshold cryptography, we have to defend against dishonest nodes, which may intentionally distribute incorrect secret shares. The incorrect partial signatures cannot yield a valid signatures. We propose a protocol for distributing the key shares. Take Alice as an example. The protocol guarantees that either all delegation members receive the correct shares, or they all detect that Alice is dishonest. Step 1: Alice sends a key share SA(i) to each delegation member DA(i), encrypted by the member's public key PDA(i). The messages are shown below. MSG1 Alice DA(i): [{SA()}pDA()SA,VDA(i) e DA Step 2: After all members receive their key shares, they negotiate a common random number s (possibly by multiparty DiffieHellman exchange with authentication). Each member sends the number s as a challenge to Alice, signed by the member's private key and then encrypted by Alice's public key. MSG2 DA(i) Alice: {[s]sDA(}PA, VDA(i) e DA Step 3: Alice signs s with SA(i) and then with SA before sending it back to DA(i). MSG3 Alice DA(): [SA(i)SA, VDA(i) e DA Step 4: After authentication, if the received [s]sA(i) value matches the locally computed one, DA(i) forwards the message to all other members in DA. 1 MSG4 DA(i) DA(j): [[s]sAi)]sA, VDA(j) c DA Otherwise, DA(i) files a certified complaint to other members. MSG5 DA(i) DA(j): ["SA(i) is invalid"]sD(iW VDA(j) C DA Step 5: DA(i) needs to collect [s]sA(J), VDA(j) CE DA, which are the partial signatures on s. If it receives MSG4 [[s]sA(j)]s, from DA(j), the value of [s]sA(j) is in the message. If it receives MSG5 from DA(j), there are two possibilities: either Alice or DA(j) is dishonest. To resolve this situation, DA(i) forwards the certified complaint to Alice. If Alice challenges the complaint, she must disclose the correct value of SA(j) to DA(i) in the following message (then DA(j) can learn SA(j) from DA(i)). MSG6 Alice DA(i): [{SA0(j)}PDA(i)SA Learning SA(j) from this message, DA(i) can compute [s]sA(j). After DA(i) has all k partial signatures on s, it can determine that Alice is honest if any (q ] + 1) partial signatures produce the same signature [s]sA, which can be verified by Alice's public key. Otherwise, Alice must be dishonest. Since the value of k is typically set small (e.g. 5) and the key distribution is performed once per node, the overhead of the above protocol is not significant. Theorem 3.1. The I. ; sharing protocol ensures that all delegation members will either obtain the correct shares of Alice's private I. ; or detect Alice's fraud. Proof. First of all, any node cannot deny the messages it has sent to others or falsely declare it has received some messages from others, because all messages in the protocol are signed by the corresponding nodes with their private keys. 1 Note that DA(i) knows s and learns SA(i) from MSG1. Consider the first case that Alice is honest. All delegation members can obtain the correct shares in Step 1. In the meantime, only if a complaint is signed by a delegation member, Alice will disclose the corresponding share to challenge the complaint. If Alice is honest, only dishonest members may issue certified complaints. The total number of distinct shares exposed by Alice is no larger than the number of dishonest members. Next consider the second case that Alice is dishonest. Alice may try to deceive the delegation in two possible vi. One way is that Alice does not send shares to some delegation members DA(i), which can be easily detected by DA(i) when it receives MSG3 from Alice or MSG4 from other delegation members. Subsequently, DA(i) will file a certified complaint (MSG6). If Alice discloses the correct share (MSG7) to challenge the complaint, DA(i) can obtain its share from other members; otherwise, honest members are certain that Alice is dishonest, and will punish her. The other possible way for Alice to deceive is to distributes incorrect shares to some members DA(i) in MSG1. There are three possible outcomes when MSG3 is processed by DA(i). (1) The partial signature in MSG3 matches the locally computed one. Subsequently, MSG3 is forwarded to all other members by DA(i). Then all honest members can detect Alice's fraud in Step 5 because, in addition to [s]sA(), there are Lk other partial signatures that cannot be used to compute the signature [s]sA. (2) The partial signature in MSG3 does not match the locally computed one. DA(i) will detect Alice's fraud in Step 4 because of the inconsistency between MSG1 and MSG3. It will forward two inconsistent messages from Alice to all other members in MSG5. Consequently, all members learn the inconsistency and punish Alice. (3) Alice does not send MSG3 to DA(i) at all. This can be handled in a way similar to the previous case that DA(i) does not receive MSG1 from Alice. E Contract Verification Protocol Both Alice and Bob must register the contract with their delegations so that the money transfer and the optional prosecution can be performed through the delegations at 0 Members in Alice's Delegation set 0 Members in Bob's Delegation set Yes/NO Ps, C0 P cA CA CB Alice ... Bob Alice ... Bob 0 0 Alice ... ... Bob Ps c C CA psj Yes/NO Contract Verification and Exchange (Alice>Bob) Money Transfer Prosecution A B C Figure 32. A) Protocols for contract verification and exchange. B) money transfer. C)Prosecution. later times. The delegations must verify the information claimed by Alice and Bob in the contract and generate the contract proofs that Alice and Bob need in order to continue their transaction. We design a contract verification protocol to implement the above requirements. The protocol consists of four steps, illuminated in Figure 32 (A), and the number of messages is 0(k) for normal cases. A procedure call is denoted as x.y(z), which means to invoke procedure y at node x with parameters) z. If x is a remote node, a signed message carrying z must be sent to x first. Step 1: Alice sends the contract c and a digital signature c' to the delegation DA for validation. c' may be a signature of the contract concatenating the identifier of the receiver, i.e., c' = [cDA(iIsA. Bob does the same thing. Alice. SendContract (Contract c, Signature c') 1. for i = 1 to k do 2. DA(i).ComputePartialProof(c, c') Step 2: Then the delegation member DA(i) verifies the reputation claimed by Alice in the contract (denoted as c.TA) and computes a partial signature (denoted as psi) on the contract with its key share established by the previous protocol. DA(i).ComputePartialProof (Contract c, Signature c') 1. if rA > C.TA then 2. ContractList.add(c, c') 3. ps [c]sA(i) 4. DB(i).DeliverPartialProof(c,psi) 5. else punish(A) Line 1 verifies whether C.TA is overclaimed or not. Line 2 saves the contract for later use in Step 3. The signature c' will be used in a procedure called detectQ. Line 3 produces a partial signature on the contract by using SA(i). Line 4 sends the partial signature to the ith member of DB. If C.rA is overclaimed, Alice will be punished at Line 5. The delegation members in DB execute a similar procedure except that the condition in Line 1 should be MB > c.mB. Step 3: When DB(i) receives the contract c and the partial signature spi from DA(i), it executes the following procedure. DB (i).DeliverPartialProof (Contract c, PartialSignature psi) 1. wait for a timeout period 2. if c is found in ContractList then 3. Bob.ProcessPartialProof(psi) 4. else 5. detect() Line 1 waits for a timeout period to ensure that the contract from Bob has arrived. Line 2 checks if the received contract c is also in the local ContractList. If that is true, Bob has announced the exact same contract as Alice does, and DB(i) forwards the partial signature psi to Bob. Otherwise, there are two possibilities: (1) Bob does not have the same contract as Alice has, or (2) Bob does have the same contract but its contract has not arrived at DB(i) yet. To distinguish these two cases, Line 4 waits for a timeout period before checking again if c is now in ContractList. If not, DB(i) believes that Alice and Bob do not have the same contract, and the detect() procedure is executed to detect the special case of a malicious Bob forging the contract. Once Bob is detected to be dishonest, the delegation can regard him as a liar, and omit the detection procedure in the future suspicious transactions, indicating the detection procedure needs to be invoked only once for each dishonest node. In the following, we will present the design details of the detectQ procedure, and then provide the correctness proof. The delegation member DB(i), which has received two different contracts from Bob and DA(i), must handle two possible cases. One case is that DB(i) has received a contract with the sequence number c.SeCqB from Bob, and the other is that DB(i) has never received such a contract from Bob. In the former case, DB(i) stops the verification procedure immediately. Then it tries to detect whether Bob is lying or not by sending Bob's signature c' to all other delegation members in DB. If a member finds that c' is different from Bob's signature that it receives directly from Bob, it sends its version of Bob's signature to all other members. Otherwise, the member discards the signature from DB(i). In the latter case, DB(i) sends a special request, denoted REQ, with the sequence number c.SeCqB to all other members in DB. If a member has already received the contract from Bob with the specified sequence number, it sends the corresponding signature c' to all other members after receiving REQ. Otherwise, the member discards the request REQ. In both cases, any member that has received two different versions of Bob's signature c' punishes Bob and refuses to participate in the rest of transaction for the contract. In addition, for the latter case, DB(i) refuses to proceed with the verification if no replies are received from other members, or punishes Bob but still continues the verification procedure using the the contract retrieved from other members if all of the signatures received are the same. We show that, having received conflicted contracts, if a delegation member simply stops the verification procedure without invoking the detect() routine, Bob will be able to break the protocol. Suppose k is equal to 3, DB(1) is a friend of Bob, and both DB(2) and DB(3) are honest. Bob can break the protocol by sending DB(2) a correct contact while sending DB(3) a forged contract (for instance, with lower price c.L) or not sending the contract to DB(3) at all. Because DB(1) is Bob's friend, it may forward the partial signature pSA(1) from DA(1) to Bob, but not send the partial signature psB(1) to DA(1). Then, Bob can collect two partial signatures pSA(1) and pSA(2) because DB(2) cannot detect Bob's fraud and will forward the partial signature pSA(2) to Bob, while Alice can only get one signature psB(2) from DA(2). Bob can compute the contract proof signed by Alice, while Alice cannot compute the proof signed by Bob. In our protocol, this problem is addressed by DB(3) invoking the detect() routine after receiving the contract from DA(3). The detect() routine guarantees that either DB(2) detects Bob's fraud or DB(1) forwards the partial signature of the contract retrieved from DB (2). Step 4: After Alice (Bob) receives t or more correct partial signatures, she (he) can compute the contract proof CA (CB), which can be verified by using Bob's (Alice's) public key. Theorem 3.2. If kpair delegation set of Alice and Bob is trustworthy, the contract verY. H.r protocol ensures that both Alice and Bob will receive the correct p.'.'f or neither one can receive a valid contract proof and the transaction is aborted. Proof. First, we prove that neither Alice nor Bob can deceive the authority. This is a symmetric protocol, so without losing generality we only consider Bob. Below we analyze the four possible vi that Bob may use to deceive the authority. 1. Bob overclaims its available money mB. In this case, all honest members in DB can detect Bob's fraud in Step 1, and punishes Bob. In the meantime, these members will neither forward the partial signature [c]sB(i) to DA(i) nor deliver [c]sA() to Bob, and consequently the transaction will be aborted. 2. Bob modifies the contract specification, for example, by lowering the transaction price c.L in order to p1 less for the transaction. He sends the same modified contract to DB. In Step 3, all members in DB learn that the contract presented by Bob is different from that presented by Alice, and they will invoke the detect() procedure. In this case, all honest delegation members will stop contract verification immediately, but they will not punish Bob, because there is only one contract signature from Bob and either Alice or Bob may be lying. 3. Bob sends different modified contracts to the delegation members. Multiple delegation members will detect that the contracts from Bob and Alice are different, and they will invoke the detect() routine. At the end of the detection procedure, all members will learn that there are different contract signatures coming from Bob. Consequently, they will all punish Bob and abort the transaction. 4. Bob does not send the contracts to some (or all) delegation members. In this case, a member DB(i) that does not received the contract from Bob will send the request REQ to all other members. If no other member receives the contract from Bob, DB(i) will receive no reply back. It will refuse to continue the verification process, but will not punish Bob because either Alice or Bob can be lying. Similarly, all other members will also stop the verification. Now if some other members have received the contracts from Bob, DB(i) will receive Bob's signatures in the replies from them, and it will continue the verification process using the contracts retrieved from other members. No one stops the verification process. In summary, we can see that all members in the ti iiIvi l1, set will take the same action (continuing or stopping the contract verification process) in all four possible cases. Next, we prove that dishonest members cannot deceive honest members in the trustworthy sets to stop the verification process if both Alice and Bob are honest. As we have discussed above, only in two cases, an honest delegation member, DA(i) or DB(i), will stop the contract verification. One case is that the contract signatures (both Alice's and Bob's) received by the member in the detect() routine are not identical. The other case is that the member receives different contracts from Alice and Bob in Step 3. The former case happens only when Alice/Bob sign and distribute different contracts to delegation members dishonestly, while the latter case happens when both DA(i) and DB(i) are untrustworthy or when either Alice or Bob is dishonest. If both Alice and Bob are honest, dishonest members cannot interrupt the verification process at the members in the trustworthy sets. By Theorem 3.1 and the discussions above, if Alice and Bob are honest, both of them are able to collect no less than Lkj correct partial signatures, and compute the 2 valid contract proofs. Otherwise, if either Alice or Bob is dishonest, the transaction is aborted. E 3.4.4 Phases 3 and 4: Money Transfer and Contract Execution Before providing the service, Alice requests its delegation to transfer money, which is illuminated in Figure 32 (B). Upon receiving a money transfer request from Alice, the delegation member DA(i) invokes the following procedure. DA(i).TransferMoneyProvider(Contracts c, ContractProof CA) 1. if valid(c, CA) and DB(i).TransferMoneyConsumer(c, CA) 2. TMA = TMA + c.L 3. else verify() In Line 1, both DA(i) and DB(i) need to validate the contract by using Bob's public key, which can be queried from Bob if it is not locally available. After validation, DA(i) increases Alice's earned money in Line 2. Note that DB(i) may be malicious. If DA(i) cannot get a positive answer from DB(i), it must verify the validity of the contract further (Line 3), which can be designed as follows. DA(i) asks other members in DA If the ii ii ii ly of DA have received a positive answer from DB, the contract is considered to be valid (DB(i) is malicious). Otherwise, the contract is considered to be invalid and Alice is punished. When DB(i) receives a money transfer request from DA(i), it performs the following operations. DB(i).TransferMoneyConsumer(Contract c, ContractProof CA) 1. if valid(c, CA) then 2. if mB > c.L then 3. TMB = TMB c.L 4. return true; 5. else 6. punish(B) 7. return false; 8. else return false; First, if the contract is valid (Line 1) and Bob has enough money to p1v the service (Line 2), then Bob's spent money is increased and a positive answer is returned to DA(i) (Line 3 and 4). Second, it is possible that the contract is valid but Bob does not have enough money. This happens when Alice and Bob are colluding nodes and Alice gets the contract proof CA directly from Bob instead through her delegation. In such a case, Bob is punished and a negative answer is returned (Line 6 and 7). Third, if the contract is invalid, a negative answer is returned (Line 8). DA(i) and DB(i), Vi e [1..k], perform money transfer at most once for each contract. They keep track of the sequence numbers (SeqA and SeqB) of the last contract for which the money has been transferred. All new contracts have larger sequence numbers. 3.4.5 Phase 5: Prosecution After Bob receives the service from Alice, if the quality of service specified in the contract is not met, Bob may issue a prosecution request to Alice's delegation, as illustrated in Figure 32 (C). The request specifies the amount of money f that Bob thinks he has overpaid. Upon receiving a prosecution request from Bob, if DA cannot evaluate the service quality, it punishes both Alice and Bob by freezing the money overpaid by Bob. The procedure is given as follows. DA(i).Prosecution(Contract c, ContractProof CB, Overpaid f) 1. if valid(c,CB) and f < c.L then 2. OMA OMAf / 3. notify(A) First DA(i) validates the prosecution request by checking if the contract proof is authentic (Line 1). If the contract is valid, it increases Alice's overpaid money by f (Line 2). Finally, it notifies Alice so that Alice is able to determine whether to sell service to Bob in the future. 3.5 System Properties and Defense against Various Attacks 3.5.1 System Properties We study the properties of MARCH, which solves or alleviates the problems in the previous approaches. First, according to the money transfer procedures in Section 3.4.4, transactions among members in the same colluding group cannot increase the total amount of available money of the group. We have the following property, which indicates that the malicious nodes cannot benefit by cooperation. Property 1. R. /n., . of its size, a colluding ii '' cannot increase its members' money or reputation by cooperation without decreasing other members' m' ,u'. ;/ and/or reputation. Second, unlike some other schemes [12, 13, 15, 16], MARCH does not maintain the history of any consumer's complaints, and does not punish frequent complainers. We have the following property. Property 2. If a consumer is deceived, it is not restricted by the system in i' i.i; from seeking prosecution against the malicious providers. Third, the overpaid money is not returned to the complaining consumer, which eliminates any reason for the consumer to lie if the consumer is not malicious. If the consumer is malicious and intends to defame the providers, it has to p i the price for the transactions before committing any harm, which serves as an automatic punishment. Consequently, its ability of defaming is limited by the money it has, which cannot be increased artificially by collusion, according to Property 1. In addition, by Property 2, a deceived consumer can seek revenge with no restriction, which means a malicious provider cannot benefit from its action. We have the following properties. Property 3. A malicious provider cannot 7,. [. i7 by deceiving the consumers, and a malicious consumer will be 10l.ii.'il..l ,ii punished for 1. fu 1ii:;u the providers. Property 4. The maximum amount of loss for an innocent provider or consumer in a transaction with a malicious node is limited by the price p .. W. in the contract. Property 3 removes financial incentives to cheat. A provider can make money only by serving others; a consumer will not be refunded for cheating. Property 4 makes sure that an innocent node will not be subject to negative discrimination attacks [12], in which nodes with excellent reputation can severely damage other nodes. In summary, the malicious nodes cannot increase their power (in terms of available money) by cooperation, and they can only attack others at the cost of their own interests, i.e., money and/or reputation. Consequently, the total damage caused by the malicious nodes is strictly limited. They will eventually be rejected from the system due to poor reputation or be enforced to serve others for better reputation in order to stay in the system. 3.5.2 Defending Against Various Attacks In the following, we consider four different types of attacks launched by a colluding group [12]. Unfairly high ratings: The members of a colluding group cooperate to artificially inflate each other's reputation by false reports, so that they can attack innocent nodes more effectively. In MARCH, a colluding group can inflate the reputation of some members only by moving the available money from other members to them. According to Property 1, the total money in the group cannot be inflated through cooperation. Although some members' reputation can be made better, other members' reputation will become worse, making them ineffective in attacks. Unfairly low ratings: Providers collude with consumers to "badmouth" other providers that they want to drive out of the market. Because MARCH requires all consumers to p i money for their transactions before they can defame the providers, the malicious consumers lose their money (and reputation) for "badmouthiiing which in turn makes it harder for them to stay in the system. Negative discrimination: A provider only discriminates a few specific consumers by offering services with much lowered quality than what the contract specifies. It hopes to earn some "extra" money without damaging its reputation since it serves most consumer honestly. In MARCH, a provider cannot make such "extra" money because of the prosecution mechanism and Properties 23. Positive discrimination: A provider gives an exceptionally good service to a few consumers with high reputation and an average service to the rest consumers. The strategy will work in an incentive scheme where a consumer's ability of affecting a provider's reputation is highly related to the consumer's own reputation, and vice versa. MARCH does not have this problem. The provider's reputation changes after a transaction is determined by how much money it receives for the service, not by the reputation of the consumer. 3.6 Discussions In this section, we discuss other important issues on implementing MARCH. 3.6.1 Rewarding Delegation Members The system should offer incentive for the delegation members to perform their tasks. A simple approach is for the provider and the consumer of a transaction to reward their delegation members with a certain amount of money, which should be less than the price of the transaction. After the transaction, the provider A signs an incentive p .iment certificate and sends the certificate to every delegation member DA(i), which reduces TMA by a certain amount and then forwards the certificate to its delegation members, where the certificate is authenticated and the money is deposited. The consumer p .i, its delegation members in a similar way. If a delegation member in DA refuses to serve, node A can increase k to bring new members into DA.c 3.6.2 Money Refilling Because the overpaid money will be frozen forever, the total amount of available money in the whole system may decrease over the time. As a result, the system may enter into deflation and lack sufficient money for the providers and the consumers to engage in transactions. This problem can be addressed by money refilling. The delegation members of a node A will replenish the total money TMA of the node at a slow, steady rate. In this way, a minimal amount of service is provided to all consumers, even the freeriders, at all time, which we believe is reasonable. For additional service, a consumer has to contribute to the P2P network by also serving as a provider. 3.6.3 System Dynamics and Overhead In a P2P network, nodes may join/leave the network at any time. When a node X leaves the network, its DHT table will be taken over by the closest neighbor X'. In MARCH, suppose X is a delegation member of A. After X leaves the network, X' will become a new member in A's delegation. In order to deal with abrupt departure, X' should cache the information kept at X, or it can learn the information from other delegation members after X leaves. For a specific DHT network, there are better  of selecting the delegation members than the approach in Section 3.3.1. Take Chord [1] as an example. We can select a subset of the log n neighbors of node A as the delegation DA In this way, the maintenance of the delegation is free as Chord already maintains the neighbor set. The communication overhead of a transaction (excluding the actual service) consists of 0(k) control messages, which are sent from the provider (consumer) via k pairs of 300 .2 k=3 k31=5 X  k25 S250 k=7 200 i 150 100 i 50 S 0 0 500 1000 1500 2000 2500 3000 number of dishonest nodes Figure 33. Trustworthiness of delegation S 0.014 3pair delegation set 5pair delegation set  0.012 7pair delegation set S 0.01 S 0.008  0.006  S 0.004  S 0.002  0 2 0 500 1000 1500 2000 2500 3000 number of dishonest nodes Figure 34. Trustworthiness of kpair delegation set delegation members to the consumer (provider) throughput direct TCP connections. This overhead is quite small comparing to the typical services such as downloading video files of many gigabytes or sharing storage for months. More importantly, the overhead does not increase with the network size, which makes MARCH a scalable solution, comparing with other schemes [23, 24] whose overhead increases with the network size. 3.7 Simulation In our simulations, the dishonest nodes fall into three categories with equal probability. Category one: These nodes never offer services to others after receiving money, and al i defame the providers after receiving services. Category two: When these nodes find that they may be rejected from the system, they behave honestly. Otherwise, they behave in the same way as the nodes in category one. Category three: When these nodes find that they may be rejected from the system, they behave honestly. Otherwise, they cheat their transaction partners with a probability taken from [0.5,1] uniformly at random. If not explicitly specified otherwise, the system parameters are set as follows. The number of nodes is 100,000 and k is 5. The average number of dishonest nodes is 1,000. Initially, the total money for a node is 500, and the overpaid money is 0. The service price G estimated by the consumers is 10. The threshold T is 0.9. To satisfy the threshold requirement, the maximum selling price for a provider is denoted as max (max is the maximum value of L that keeps R above the threshold, calculated based on Eq. (42). If max is negative, then the node can no longer be a provider. If a dishonest node in Category two or three finds that its max value may become negative after additional malicious acts, it will behave honestly). The actual selling price is a random number taken uniformly from (0, max]. If a node can neither be a provider (due to poor reputation), nor be a consumer (due to little money), it is said to be rejected from the system. If one participant in a transaction tries to deceive the other one, the transaction is called a failed transaction. We define "failed transaction ratio" as the number of failed transactions divided by the total number of transactions, and ip .d money ratio" as the total amount of overpaid money divided by the total amount of money paid in the transactions. These metrics are used to assess the overall damage caused by dishonest nodes. 1000 honest nodes  dishonest nodes  7 800 o 600 400 200 0 I 0 50 100 150 200 250 300 350 400 450 500 avg. number of transactions performed by each node Figure 35. Most of malicious nodes are rejected within the first 50 transactions. 3.7.1 Effectiveness of Authority In the first set of simulations, we study the trustworthiness of the delegations and the kpair delegation sets. Figure 33 shows the number of untrustworthy delegations with respect to the number of dishonest nodes for k 3, 5, and 7. Recall that a delegation is untrustworthy if at least half of its members are dishonest. Out of 100,000 delegations, only a few number of them are untrustworthy. For k 5, the number of nodes with untrustworthy delegations is just 23 even when there are 3,000 dishonest nodes. Figure 34 shows the probability for an arbitrary kpair delegation set to be untrustworthy (Section 3.3.2). The 5pair delegation set is trustworthy with a probability larger than 99.,' even when there are 3,000 dishonest nodes. Note that when a delegation is untrustworthy, the dishonest members may not belong to the same colluding group. Without cooperation, the damage they can cause will be smaller. 3.7.2 Effectiveness of MARCH The second set of simulations study the effectiveness of our incentive scheme. Figure 35 presents how the number of rejected nodes changes with the average number of transactions performed per node, which can be used as the logical time as the simulation progresses. Recall that the default number of dishonest nodes is t,000. The figure shows that most dishonest nodes are rejected from the system within 50 transactions per node. I EI 200 0 50 100 150 200 250 300 350 400 450 500 avg. number of transactions performed by each node Figure 35. Most of malicious nodes are rejected within the first 50 transactions. 3.7.1 Effectiveness of Authority In the first set of simulations, we study the trustworthiness of the delegations and the kcpair delegation sets. Figure 33 shows the number of untrustworthy delegations with respect to the number of dishonest nodes for k =3, 5, and 7. Recall that a delegation is untrustworthy if at least half of its members are dishonest. Out of 100,000 delegations, only a few number of them are untrustworthy. For k =5, the number of nodes with untrustworthy delegations is just 23 even when there are 3,000 dishonest nodes. Figure 34 shows the probability for an arbitrary kcpair delegation set to be untrustworthy (Section 3.3.2). The 5pair delegation set is trustworthy with a probability larger than 99.>'. even when there are 3,000 dishonest nodes. Note that when a delegation is untrustworthy, the dishonest members may not belong to the same colluding group. Without cooperation, the damage they can cause will be smaller. 3.7.2 Effectiveness of MARCH The second set of simulations study the effectiveness of our incentive scheme. Figure 35 presents how the number of rejected nodes changes with the average number of transactions performed per node, which can be used as the logical time as the simulation progresses. Recall that the default number of dishonest nodes is 1,000. The figure shows that most dishonest nodes are rejected from the system within 50 transactions per node. 0.02 o failed transaction ratio  overpaid money ratio  e 0.015 0 0.005 o 0 " S0 50 100 150 200 250 300 350 400 450 500 avg. number of transactions performed by each node Figure 36. Failed transaction ratio and the overpaid money ratio drop quickly to small percentages within the first 100 transactions. Because of money refilling, some rejected nodes will recover after enough money is refilled, but they will be rejected again after performing malicious transactions. No honest nodes are rejected from the system during the simulation. Figure 36 shows that the failed transaction ratio drops quickly from 1. !'. to 0.;'. within the first 100 transactions per node, and the overpaid money ratio drops from 1. !'. to 0.'. in the same period. As the time progresses, these ratios become even more insignificant. Note that the overpaid money ratio is smaller than the failed transaction ratio. This is because the dishonest providers have to lower their prices in order to compete with honest providers, which in turn lowers their ability to cause significant damage. Ironically, if a dishonest node with poor reputation wants to stay in the system, not only does it have to behave honestly to gain reputation, but also it has to do so with lower price in order to get consumers, which i p i s" the damage it does to the system previously. Next, we study how the number of dishonest nodes affects the system performance. Figure 37 shows the overpaid money ratio after 250 transactions per node. We find that the ratio increases linearly with the number of dishonest nodes. Even when there are 3,000 0.003 overpaid money ratio 0.0025 0.002 0.0015  0.001  0.0005 0 500 1000 1500 2000 2500 3000 number of dishonest nodes Figure 37. Overpaid money ratio (measured after 250 transactions) increases linearly with the number of dishonest nodes. honest nodes ? 2000 dishonest nodes  2000UU 1500 1000 500 0 500 1000 1500 2000 2500 3000 number of dishonest nodes Figure 38. Number of rejected dishonest nodes (measured after 250 transactions) increases linearly to the number of dishonest nodes. 0.002  overpaid money ratio CS e ID, 0 s 0 0 I o 0.0015 0.001 0.0005 0 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold Overpaid money ratio with respect to the threshold 1200 honest nodes  1000 dishonest nodes  800 600 400  200  0  0.4 0.5 0.6 0.7 0.8 0.9 threshold Figure 310. Number of rejected nodes with respect to the threshold dishonest nodes, the overpaid money ratio remains very small, just 0.15'. Figure 311 shows that the more the number of dishonest nodes, the more they are rejected. Last, we study the impact of the threshold on the system performance. The threshold is used by a consumer to select the potential providers (Section 3.4.2). Figure ?? shows that the overpaid money ratio decreases linearly with the threshold value, which means the system performs better with a larger threshold. Figure 310 shows that the number of rejected dishonest nodes is largely insensitive to the threshold value. When the threshold is too low, some honest nodes may be rejected by the system because a smaller threshold allows the dishonest nodes to do more damage on the honest nodes, which may even 2000 1500 1000 500 honest nodes dishonest nodes 0 0 500 1000 1500 2000 2500 3000 number of dishonest nodes Figure 311. Number of rejected dishonest nodes (measured after 250 transactions) increases linearly to the number of dishonest nodes. cause some honest nodes to be rejected from the system due to defamed reputation. The numbers in the above two figures are measured after 250 transactions per node. CHAPTER 4 CAPACITYAWARE MULTICAST ALGORITHMS ON HETEROGENEOUS OVERLAY NETWORKS 4.1 Motivation Multicast is an important network function for group communication among a distributed, dynamic set of heterogeneous nodes. The global deployment of IP multicast has been slow due to the difficulties related to heterogeneity, salability, manageability, and lack of a robust interdomain multicast routing protocol. Applicationlevel multicast becomes a promising alternative. Even though overlay multicast can be implemented on top of overlay unicast, they have very different requirements. In overlay unicast, lowcapacity nodes only affect traffic passing through them and they create bottlenecks of limited impact. In overlay multicast, all traffic will pass all nodes in the group, and the multicast throughput is decided by the node with the smallest throughput, particularly in the case of reliable delivery. The strategy of assigning an equal number of children to each intermediate node is far from optimal. If the number of children is set too big, the lowcapacity nodes will be overloaded, which slows down the entire session. If the number of children is set too small, the highcapacity nodes will be underutilized. There has been a flourish of capacityaware multicast systems, which excel in optimizing singlesource multicast trees but are not suitable for multisource applications such as distributed games, teleconferencing, and virtual classrooms. Especially, they are insufficient in supporting applications that require .:,i,urce multicast with varied host capacities and dynamic membership. To support efficient multicast, we should allow nodes in a P2P network to have different numbers of neighbors. We proposes two overlay multicast systems that support .ni,  urce multicast with varied host capacities and dynamic membership. We model the capacity as the maximum number of direct children to which a node is willing to forward multicast messages. We extend Chord [1] and Koorde [56] to be capacityaware, and they are called CAMChord and CAMKoorde, respectively.1 A delicate CAMChord or CAMKoorde overlay network is established for each multicast group. We then embed implicit degreevarying multicast trees on top of CAMChord or CAMKoorde and develop multicast routines that automatically follow the implicit multicast trees to disseminate multicast messages. Dynamic membership management and salability are inherited features from Chord or Koorde. Capacityaware multiplesource miticast are added features. Our analysis on CAM multicasting sheds light on the expected performance bounds with respect to the statistical distribution of host heterogeneity. 4.2 System Overview Consider a multicast group G of n nodes. Each node x cE G has a capacity cx, specifying the maximum number of direct child nodes to which x is willing to forward the received multicast messages. The value of cx should be made roughly proportional to the upload bandwidth of node x. Intuitively, x is able to support more direct children in a multicast tree when it has more upload bandwidth. In a heterogeneous environment, the capacities of different nodes may vary in a wide range. Our goal is to construct a resilient capacityaware multicast service, which meets the capacity constraints of all nodes, allows frequent membership changes, and delivers multicast messages from any source to the group members via a dynamic, balanced multicast tree. Our basic idea is to build the multicast service on top of a capacityaware structured P2P network. We focus on extending Chord [1] and Koorde [56] for this purpose. The resulting systems are called CAMChord and CAMKoorde, respectively. The principles and techniques developed should be easily applied to other P2P networks as well. A CAMChord or CAMKoorde overlay network is established for each multicast group. All member nodes (i.e., hosts of the multicast group) are randomly mapped by a hash function (such as SHA1) onto an identifier ring [0, N 1], where the next 1 CAM stands for CapacityAware Multicast. identifier after N 1 is zero. N(= 2b) should be large enough such that the probability of mapping two nodes to the same identifier is negligible. Given an identifier x E [0, N 1], we define successor(x) as the node clockwise after x on the ring, and predecessor(x) as the node clockwise before x on the ring. x refers to the node whose identifier is x; if there is not such a node, then it refers to successor(x). Node x is said to be responsible for identifier x. With a little abuse of notation, x, x, successor(x), and predecessor(x) may represent a node or the identifier that the node is mapped to, depending on the appropriate context where the notations appear. Given two arbitrary identifiers x and y, (x, y] is an identifier segment that starts from (x + 1), moves clockwise, and ends at y. The size of (x, y] is denoted as (y x). Note that (y x) is alvb positive. It is the number of identifiers in the segment of (x, y]. The distance between x and y is x y = y x = min{(y x),(x y)}, where (y x) is the size of segment (y,x] and (x y) is the size of segment (x, y]. (y, x] and (x, y] form the entire identifier ring. Before we discuss the CAMs, we briefly review Chord and Koorde. Each node x in Chord has O(log2 n) neighbors, which are 1 ..., of the ring away from x, respectively. When receiving a lookup request for identifier k, a node forwards the request to the neighbor closest to k. This greedy algorithm takes O(log2 n) hops with high probability to find k, the node responsible for k. Each node in Koorde has m neighbors. A node's identifier x is represented as a basem number. Its neighbors are derived by shifting one digit (with value 0..m1) into x from the right side and discard x's leftmost bit to maintain the same number of digits. When x receives a lookup request for k, the routing path of the request represents a transformation from x to k by shifting one digit of k at a time into x from the right until the request reaches the node responsible for k. Because k has O(log, n) digits, it takes O(log, n) hops with high probability to resolve a lookup request. Readers are referred to the original papers for more details. Our first system is CAMChord, which is essentially a basec1 (instead of base2) Chord with cx variable for different nodes. The number of neighbors of a node x is 0(c, >i7), which is related to the node's capacity. Hence, different nodes may have different numbers of neighbors. The distances between x and its neighbors on the identifier ring are ..., c ... ... of the ring, respectively. Apparently CAMChord becomes Chord if cx = 2, for all node x. We will design a greedy lookup routine for CAMChord and a multicast routine that essentially embeds implicit, capacityconstrained, balanced multicast trees in CAMChord. The multicast messages are disseminated via these implicit trees. It is a challenging problem to analyze the performance of CAMChord. The original analysis of Chord cannot be applied here because cx is variable. We will provide a new set of analysis on the expected performance of the lookup routine and the multicast routine. The second system is CAMKoorde, which differs from Koorde in both variable number of neighbors and how the neighbors are calculated. This difference is critical in constructing balanced multicast trees. Each node x has ce neighbors. The neighbor identifiers are derived by shifting x to the right for a variable number I of bits and then replacing the leftmost I bits of x with a certain value. In comparison, Koorde shifts x one digit (base m) to the left and replaces the rightmost digit. This subtle difference makes sure that CAMKoorde spreads neighbors of a node evenly on the identifier ring while neighbors in Koorde tend to cluster together. We will design a lookup routine and a multicast routine that essentially performs broadcast. Remarkably, we show that this broadcastbased routine achieves roughly balanced multicast trees with the expected number of hops to a receiver being 0(log n/E(log ce)). CAMChord maintains a larger number of neighbors than CAMKoorde (by a factor of O( log)), which means larger maintenance overhead. On the other hand, CAMChord is more robust and flexible because it offers backup paths in its topology [57]. The two systems achieve their best performances under different conditions. Our simulations show that CAMChord is a better choice when the node capacities are small and CAMKoorde is better when the node capacities are large. 4.3 CAMChord Approach CAMChord is an extension of Chord. It takes the capacity of each individual node into consideration. We first describe CAMChord as a regular P2P network that supports a lookup routine, which is to find k for a given identifier k. We then present our multicast algorithm on top of CAMChord. When a node joins or leaves the overlay topology, the lookup routine is needed to maintain the topology as it is defined. CAMChord is not designed for data sharing among peers as most other P2P networks (e.g., Chord [1]) do. There are NO data items associated with the identifier space. Each multicast group forms its own CAMChord network, whose sole purpose is to provide an infrastructure for dynamic capacityaware multicasting. 4.3.1 Neighbors For a node x in Chord, its neighbor identifiers are (x + 2V) mod N, Vi E [1.. log2 N], which are ..., of the ring away from x. CAMChord is a basec1 Chord with variable c, for different nodes. Let c mean (ce)'. The neighbor identifiers are (x + j x c') mod N, denoted as xij, Vj c [1..cx 1], Vi c [0.. gN 1]. i and j are called the level X ~log C and the sequence number of xij. Let x0o,o = x. The actual neighbors are xfj, which are the nodes responsible for xij. Note that xo, is Jvxi successor(x). See an illustration in Figure 1 with ce = 3. The levelone neighbors in CAMChord divide the whole ring into cx segments of similar size. The leveltwo neighbors further divides a close segment (x, x + 1 N] into c subsegments of similar size. And so on. Consider an arbitrary identifier k. Let log (k x) log c( i = L x (42) S levelone neighbors S leveltwo neighbors ........................................................................................ x x levelthree neighbors Chord neighbors CAMChord neighbors Figure 41. Chord v.s. CAMChord neighbors (c, = 3) It can be easily verified that xij is the neighbor identifier of x that is counterclockwise closest to k, which means xj is the neighbor node of x that is counterclockwise closest to node k.2 We call i the level and j the sequence number of k with respect to x. 4.3.2 Lookup Routine CAMChord requires a lookup routine to assist member join/departure during a dynamic multicast session. This routine returns the address of node k responsible for a given identifier k. x.foo() denotes a procedure call to be executed at x. It is a local (or remote) procedure call if x is the local (or a remote) node. The set of identifiers that x is responsible for is (predecessor(x), x]. The set of identifiers that successor(x) is responsible for is (x, successor(x)]. x.LOOKUP(k) 1. if k c (x, successor(x)] then 2. return the address of successor(x) 3. else 4. log (kx) 4. i LJ log cm 5. i k Lx] 6. if k c (x, xj] then 2 It is possible that f = k. 7. return the address of xj 8. else /* forward request to xij */ 9. return xf,.LOOKUP(k) First the LOOKUP routine checks if k is between x and successor(x). If so, LOOKUP returns the address of successor(x). Otherwise, it calculates the level i and the sequence number j of k. If k falls between x and xf, which means x j is responsible for the identifier k, LOOKUP returns the address of xij. On the other hand, if xfj precedes k, then x forwards the lookup request to x j. Because xfj is x's closest neighbor preceding k, CAMChord makes a greedy progress to move the request closer to k. 4.3.3 Topology Maintenance Because CAMChord is an extension of Chord, we use the same Chord protocols to handle member join/departure and to maintain the correct set of neighbors at each node. The difference is that our LOOKUP routine replaces the Chord LOOKUP routine. The details of the protocols can be found in [1]. The join operation of Chord can be optimized because two consecutive nodes on the ring are likely to have similar neighbors. When a new node joins, it first performs a lookup to find its successor and retrieves its successor's neighbors (called fingers in Chord). It then checks those neighbors to make correction if necessary. In a basec Chord, the join complexity without the optimization is O(ct>g) for a constant c. The optimization reduces the complexity to O(c 9). CAMChord can be regarded as a basec Chord where c is a random variable following the nodecapacity distribution. It cannot .J. li perform the above optimization because consecutive nodes may have different capacities, which make their neighbor sets different. When this happens, a new node x has to perform O(c lg) lookups to find all its neighbors. The lookup complexity is O( log' ) by Theorem 1 (to be proved). lg E(' c ) log 2 n lOc if The join complexity is ( I ), which would be reduced to locg 2i Th jincopextyisO~1logy c.g log c Ot('glogc)i the capacities of all nodes had the same value, i.e., c was a constant. This overhead is too high for a traditional P2P filesharing application such as FastTrack, because the observations in [58] showed that over 2i i'. of the connections last 1 minute or less and 111'. of the IP addresses keep active for no more than 10 minutes each time after they join the system. But CAMChord is not designed for filesharing applications. One appropriate CAMChord application is teleconferencing, which has far less participants than FastTrack and less dynamic member changes. We do not expect the ini ii ii ly of participants to keep joining and departing during a conference call. Another application is distributed games, where a user is more likely to p1 i for a hour than for one minute. CAMChord makes a tradeoff between capacity awareness and maintenance overhead, which makes it unsuitable for highly dynamic multicast groups. For them, CAMKoorde is a better choice because a node only has c, neighbors. Our future research will attempt to develop new techniques to overcome this limitation of CAMChord. 4.3.4 Multicast Routine On top of the CAMChord overlay, we want to implicitly embed a dynamic, roughly balanced multicast tree for each source node. Each intermediate node in the tree should not have more children than its capacity. It should be emphasized that no explicit tree is built. Given a multicast message, a node x executes a MULTICAST routine, sending the message to ce selected neighbors, which in turn execute the MULTICAST routine to further propagate the message. The execution of the MULTICAST routine at different nodes makes sure that the message follows a capacityaware multicast tree to reach every member. Let msg be a multicast message, k be an identifier, and x be a node executing the MULTICAST routine. The goal of x.MULTICAST(msg, k) is to deliver msg to all nodes whose identifiers belong to (x, k]. The routine is implemented as follows: x chooses cx neighbors that split (x, k] into c subsegments as even as possible. Each subsegment begins from a chosen neighbor and ends at the next clockwise chosen neighbor. x forwards the multicast message to each chosen neighbor, together with the subsegment assigned to this neighbor. When a neighbor receives the message and its subsegment, it forwards the message using the same method. The above process repeats until the size of the subsegment is reduced to one. The distributed computation of MULTICAST recursively divides (x, k] into nonoverlapping subsegments and hence no node will receive the multicast message more than once. x.MULTICAST(msg, k) 1. if k = x then 2. return 3. else 4. i   0gkx log cm 5. i k L] /*select children from leveli neighbors preceding k*/ 6. k' < k 7. for m = j down to 1 8. xi .MULTICAST(msg, k') 9. k' xi,m 1 /* select (c j 1) children from level(i 1) neighbors */ 10. I < c 11. form = c j 1 down to 1 12. I  / /* for even separation */ 13. xi 11.MULTICAST(msg, k') 14. k' ` xi1,_1 1 /* select x's successor */ 15. xo,.MULTICAST(msg, k') To split (x, k] evenly, x first calculates the level i and the sequence number j of k with respect to x (Line 45). Then neighbors xim (Vm E [1..j]) at the ith level preceding k are selected as children of x in the multicast tree (Line 69). We also select x's successor, which is x0o (Line 15). Since j + 1 may be less than ce, in order to fully use x's capacity, cx 1 j neighbors at the (i 1)th level are chosen; Line 1014 ensures that the selection is evenly spread at the (i 1)th level. Because the algorithm selects neighbors that divide (x, k] as even as possible, it constructs a multicast tree that is roughly balanced. At Line 9, we optimize the code by using k < xi,m 1 instead of k < x 1. That is because there is no node in (xi,m, Xim) by the definition of xLm. 4.3.5 Analysis Assume cx > 2 for every node x. We analyze the performance of the LOOKUP routine and the multicast routine of CAMChord. Suppose a node x receives a lookup request for identifier k and it forwards the request to a neighbor node xfj that is closest to k. We call Ek j the distance reduction ratio, which measures how much closer the KX request is from k after onehop routing. The following lemma establishes an upper bound on the distance reduction ratio with respect to c, which is a random variable of certain distribution. Lemma 4.1. Suppose a node x forwards a lookup request for .. i/,.i r k to a neighbor xfj. Ifxfj c (x,k], then k x In ce E( I)< E( ) k x CX 1 Proof. Based on the algorithm of the LOOKUP routine, i must be the height of k with respect to x. By (41) and (42), k can be written as k =x+ jc+l where j E [..c 1] is the sequence number of k with respect to x and I E [0..cj). k x = c + (43) By definition (Section 4.3.1), xij = x+jc Because x, xij, xj, and k are in clockwise order on the identifier ring, we have k x j < k xij = (44) By (43) and (44), we have k ~ Xi'j < < X k x jc + jc +1 kY We now derive the expected distance reduction ratio. E(kk j) depends on three random variables, j, 1, and cex. Because the location of k is arbitrary with respect to x, we can consider j and I as independent random variables with uniform distributions on their respective value ranges. k E1 1 CY1 1 k1 x. E( m) E ( i dl) k x ex t j1 e fo kx 1 cl 1 [c e l =E( E dl) c l j=1 cJ Jc +1 1 ax1 Inc = E( ( E (In(j + 1) In j))) E(I cx 1 *71 C 1 D Theorem 4.2. Let ce, for all nodes x, be independent random variables of certain distribution. The expected length of a lookup path in CAMCl. i, is 0( "in ) Proof. Suppose (xI, X2, ..., m) is a prefix of a lookup path for identifier k, where xi is the node that initiates the lookup, and xi, i c [1..m], and k are in clockwise order on the identifier ring. Because the nodes are randomly mapped to the identifier ring by a hash function, the distance reduction ratio after each hop is independent of those after other hops. Consequently k i c [2..m], are independent random variables. E(k Xm) = E( Xm k k X2 (k x1)) k x,1 k m X2 k x, k Xm k xml k X2 E(k ) ) E( k) ... x E( k ) ) E(k xi) (45) k l X k m,2 k X1 < (E( n x))1 N cx 1 where ce is a random variable with the same distribution as cx i G [1..m]. Next we derive the value of m that ensures E(k Xm) < which is the average distance between two .'lI went nodes on the identifier ring. The following is a sufficient condition to achieve E(k xm) < N ( n c N (E( I x))1 N cm 1 12 Inn In E(ne c1 If E(k Xm) < the expected number of additional routing hops from Xm to k is 0(1). 0(m) = 0( 1"1) gives the expected length of the lookup path. D cx It is natural that the expected length of a lookup path in CAMChord depends on the probability distribution of cex, which affects the topological structure of the overlay network. For a given distribution, an upper bound of the expected path length can be derived from Theorem 4.2. The following theorem gives an example. Theorem 4.3. Suppose the node .'p'../;/ cx follows a "nu.m n distribution with E(cex) = c. The expected length of a lookup path in CAMChord is 0(1). Proof. Suppose the range of c, is [ti..t2 with E(c,) = c. We perform BigO reduction as follows. In c to ln c 1 E( 2) =( x) c cx t2 t 1 C 1 EnInc 1 =8( r dcx x ) t^ ex t2 +l^ cJ2 t2 t+1 =0((11n t2 nt In2 1) 22 c (In2 c) because t2< 2c and t < c c Therefore, ln E(In e) e(lnInI2) ( Ilnc) ex c ca. C By Theorem 4.2, O( 1' ) O('n). D cx Other distributions of ce may be analyzed similarly. Next we analyze the performance of the MULTICAST routine in CAMChord. Suppose x executes x.MULTICAST(msg, k), which is responsible for delivering msg to all nodes in the identifier segment (x, k). Specifically, x forwards msg to some neighbor nodes xim by remotely invoking xim.MULTICAST(msg, k'), which is responsible for delivering msg to a smaller subsegment (xim, k'), where xim, k' e (x, k). It is a typical divideandconquer strategy. We call the segment reduction ratio, which measures k the degree of reduction in problem size after onehop multicast routing. The following lemma establishes an upper bound on the segment reduction ratio with respect to ce, which is a random variable of certain distribution. Lemma 4.4. Suppose a node x forwards a multicast message to a neighbor x Z, i.e., x.MULTICAST(msg, k) calls xim.MULTICAST(msg, k'). It must be true that V x In xi^ ^ l (nca E( k n Proof. Based on the algorithm of the MULTICAST routine, the execution of x.MULTICAST(msg, k) will divide its responsible segment (x, k) into ce subsegments, and xi, is responsible for delivering msg to all nodes in one subsegment (xi,, k'). The largest subsegment is created by Lines 69. When Line 7 is executed for m E [j 1..1], k' = Xi,m+ 1. Therefore, k X i,rm =x+(m+ 1)c x mc (46) cx By Line 4, i is the level of k with respect to x. By (41) and (42), k can be written as k =x+ jc+l where j E [1..c 1] is the sequence number of k with respect to x and I E [O..c). k x = jc + l (47) By (46) and (47), we have k1/ ~ Y?,,n< k' Xi,m  k x jci+l We now derive the expected segment reduction ratio. E( k/k ') depends on three random variables, j, 1, and cex. Because the location of k is arbitrary with respect to x, we can consider j and I as independent random variables with uniform distributions on their respective value ranges. k xm 1 t t 1 k f. E( ,m) E( c i dl) k, x e~x t j71 ex. JOC k  E(1 cx11n cfet c 1=1 c Jo jc +I (48) 1 ex1 =E( Y (In (j + t) Inj))) Cx 1 j1 In cx = E( ) Cx 1 D A multicast path is defined as the path that the MULTICAST routine takes to deliver a multicast message from a source node to a destination node. The proofs of the following two theorems are very similar to those of Theorems 4.2 and 4.3, due to the similarity between Lemma 4.4 and Lemma 4.1, on which the theorems are based. To avoid excessive repetition and to conserve space, we omit the proofs for Theorems 4.5 and 4.6. Theorem 4.5. Let ce, for all nodes x, be independent random variables of certain distribution. The expected length of a multicast path in CAMChord is O( ig7 ) In E( ncx cx Theorem 4.6. Suppose the node e'p'. H1 c1 follows a uniform distribution and E(cex) = c. The expected length of a multicast path in CAMChord is 0(1). 4.4 CAMKoorde Approach This section proposes CAMKoorde. For any node x in CAMKoorde, its number of neighbors is exactly equal to its capacity cx. The maintenance overhead of CAMKoorde is smaller than that of CAMChord due to a smaller number of neighbors. Like Koorde, CAMKoorde embeds the Bruijn graph in the identifier ring. On the other hand, it has two ni i, r differences from Koorde, which are critical to our capacityaware multicast service. The first difference is about neighbor selection. The neighbor identifiers of a node x in Koorde are derived by shifting x one digit (base m) to the left and then replacing the last digit with 0 through m 1. The neighbor identifiers differ only at the last digit. Consequently they are clustered and often refer to the same physical node. For the purpose of multicast, we want the neighbors to spread evenly on the identifier ring. The neighbor identifiers of a node x in CAMKoorde are derived by shifting x one or more bits to the right and then replacing the highorder bits with 0 through certain number. The neighbor identifiers differ at the highorder bits, and they are evenly distributed on the identifier ring. The second difference is about the number of neighbors. Koorde requires every node to have the same number of neighbors. CAMKoorde allows nodes to have different numbers of neighbors. 4.4.1 Neighbors Let N = 2b. In CAMKoorde, x has ce neighbors, which are categorized into three groups. All computations are assumed to be modulo N. O0 Neighbors in 9 . the basic group ........35 4   36 A Neighbors in the second group 37 0 Neighbors in 41 the third group 57 50 Figure 42. CAMKoorde topology with identifier space [0..63] The basic group has four neighbors. Two are x's predecessor and successor. The other two are the nodes responsible for identifiers (x/2) and 2b1 + (x/2), respectively. After the basic group, there are cx 4 remaining neighbors. Let s [log(c1 4)]. If s > 1, we shall shift x by s bits to the right to derive the neighbor identifiers.3 If s > 1, then let t 2s; otherwise let t 0. The neighbors in the second group are the nodes responsible for identifiers (i x 2bs + x/28), V i e [0..t 1]. After the basic and second groups, there are t' = cx 4 t remaining neighbors. Let s' s + 1. The neighbors in the third group are the nodes responsible for identifiers (i x 2bs' + x/28'),Vi e [0..t' 1]. It is required that cx > 4. The basic group is mandatory. The optional second and third groups pick up the remaining neighbors. An example is given in Figure 42, showing the neighbors of node 36 (100100) whose capacity is 10. The binary representation of the node identifier is given in the parentheses. The basic group is {35 (100011), 37 (100101), 18 (010010), 50 (110010)} The second group is {9 (001001), 25 (011001),41 (101001),57 (111001)} 3 If s = 1, it means to shift one bit. The basic group already does that. The third group is {4 (000100), 12 (001100)} 4.4.2 Lookup Routine Definition 1. Given two bbit .1. i/. rs x and k, if an 1bit prefix of x matches an 1bit suffix of k, we .';/ x and k share I pscommon bits. x = k if the two share b pscommon bits. Similar to CAMChord, a lookup routine is needed in CAMKoorde for member join/departure. First consider an Nnode network with every identifier having a corresponding node. Given an identifier k, suppose node x wants to query for the address of node k. The lookup routine forwards the lookup request along a chain of neighbors whose identifiers share progressively more pscommon bits with k. Starting from x, we identify a neighbor that has the longest prefix matching the suffix of k. More specifically, if the third group is not empty and a thirdgroup neighbor can be derived by selecting the ([log(c1 4)] + 1) bits of k that precedes the current pscommon bits and shifting them from the left into x, then the lookup request is forwarded to this neighbor. Otherwise, if the second group is not empty and a secondgroup neighbor can be derived by selecting the [log(c1 4)] bits of k that precedes the current pscommon bits and shifting them from the left into x, then the lookup request is forwarded to this neighbor. Otherwise, we forward the request to a firstgroup neighbor that increases the number of pscommon bits by one. To determine each subsequent node on the forwarding path, a similar process repeats by shifting more bits of k into the identifier of the next hop receiver. After at most b hops, the request can reach node k. Now suppose n < N, which is normally the case. We still calculate the chain of neighbor identifiers in the above way, which essentially transforms identifier x to identifier k in a series of steps, each step adding one or more bits from k. Once the next neighbor identifier y on the chain is calculated, the request is forwarded to y, which in turn calculates its neighbor identifier that should be the next on the forwarding path and then forwards the request. The pseudo code of the LOOKUP routine is shown below. It uses the highorder bits of the node identifier to match the loworder bits of k, which is different from Koorde's routine and is critical for our multicast routine, to be discussed shortly. xr.LOOKUP(k) 1. if k c (predecessor((x),x;r] then 2. return the address of x 3. if k cE (x, successor(x)] then 4. return the address of successor(x) 5. let mi1 be the number of pscommon bits shared by x and k 6. find the neighbor y that shares the largest number m2 of pscommon bits with k 7. if mI < M2 then 8. return y.LOOKUP(k) 9. else 10. if k predecessor(x) < k successor(x) then 11. return predecessor((x).LOOKUP(k) 12. else 13. return successor(x).LOOKUP(k) Koorde uses Chord's protocols with a new LOOKUP routine for node join/departure, so does CAMKoorde. 4.4.3 Multicast Routine When a node receives a multicast message, it forwards the message to all neighbors except those that have received or are receiving the message. Because neighbor connections are bidirectional, it is easy for a node to perform the checking through a short control packet. The overhead is negligible when the message is large, such as a video file. Note that a node does not have to wait for the entire message to arrive before forwarding it to neighbors. The forwarding is done on per packet basis, but the checking is performed only for the first packet of a message, which carries the message header. The pseudo code of the MULTICAST routine is shown below. x.MULTICAST(msg) 1. for each neighbor y do 2. if y has not received or is not receiving msg then 3. y.MULTICAST(msg) 4.4.4 Analysis Lemma 4.7. Let ce, for all nodes x, be independent random variables of certain distri bution. The expected length of the shortest path between two nodes in CAMKoorde is O(log n/E(log cx)). Proof. Consider two arbitrary nodes xo and y. Let y' be an O(logn)bit prefix of y. We show there exists a path of length O(logn/E(log c1)) that reaches a node Xm with y' also as its prefix. We construct a physical path (xo, xi, x2, ..**., XTI, xm) as follows. Node xo has a basic or secondgroup neighbor x1, where x, is derived by shifting max{1, [log(c1o  4)] } loworder bits of y' into xo from the left.4 The bits of y' that have been used in shifting are called used bits. Similarly, x1 has a secondgroup neighbor x2, where x2 is derived by shifting max{1, [log(c1 4)]} loworder unused bits of y' into xi ... Finally, xm1 has a secondgroup neighbor Xm, where Xm is derived by shifting the remaining max{1, [log(cy 4)]} loworder unused bits of y' into x l. The length of path (xo, x1, x2, ..., f ) is m. The total number of used bits of y' is ZT max{1, [log(c1 4)] J}, 4 If cx < 6, we can pick x, from the basic group, which shifts one bit of y' into xo; if cx > 6, we can pick x, from the second group, which shifts [log(c1o 4)] bits of y' into xo. which is O(log n). Let ce be a random variable of the same distribution as c, Vi e [O..m 1]. mrl Zmax{1, [log(c, 4)]j} O(logn) i=o n1 logcf, O(logn) i=O m1 E( log c) O0(log n) i=0 mE(logc) = O(logn) m O(logn/E(logce)) Next we construct an identifier list (xo, xX', x' ..., I 1 xm) as follows. x1 is derived by shifting max{1, [log(cxo 4)] } loworder bits of y' into xo from the left. x' is derived by shifting max{1, [log(c1 4)]} loworder unused bits of y' into x' ... Finally, where x' is derived by shifting the remaining max{1, [log(c 4)] } loworder unused bits of y' into y' is an O(log n)bit prefix of both x' and y. The distance between x' and y on the identifier ring, Ix' y, is O('). Note that N is the average distance between any two .,I11 went nodes on the ring. If we can show that E(Xm x') < , then E(x y) < E(x x'I + x' y) = E(x x'1) + E(x' y) = O(N), which means that the expected number of hops from Xm to y is 0(1). Vi c [1..m], define a random variable Ai = Ii xi\. Because xi can be anywhere in (predecessor (x), i), we have 1 N E(A) = E(predecessor(j),j) (49) 2 2n where N is the expected distance between two .,.11 went nodes on the identifier ring. n xi and x' are derived from x1\ and x' _i, respectively, by shifting the same max{1, [log(c, 4)] } bits of y' in from the left. Therefore, / Vl1 X~ IlI I 1 2max{ IlI.. 4)]} It follows that, Vi E [i..m], iii x'^ Ai+ V1 i 1 2max{1 Li.. 4)]} By induction, we have m m1 r~~ ~ T < TT ___ t __ Xm < i= j=i Because cj > 4 for any node x in CAMChord, max{1, [log(c 4)]} > 1. Hence, xr' 1) < 1n A i)(t )Tn' Xlm Xlm\ 2 zn i=1 i=1 N m 1 2n ^2} iml N m1 N 2n5Y2(< n i=0 Consequently, the expected number of hops from xm to x' and then to y is 0(1). The expected length of the entire path from xo to y is 0(m) = 0(log n/E(logcex)). D Theorem 4.8. Let ex, for all nodes x, be independent random variables of certain distribution. The expected length of a multicast path in CAMKoorde from a source node to a member node is O(logn/E(logcx)). Proof. According to the MULTICAST routine, a multicast packet is delivered in CAMKoorde by broadcast, which follows the shortest paths to the member nodes. The expected length of a multicast path from a source node to a member node is O(log n/E(log c,)) by Lemma 4.7. . Theorem 4.9. Suppose the node a'/i. H, c1 follows a uniform distribution and E(cF ) = c. The expected length of a multicast path in CAMKoorde from a source node to a member node is O(logn/logc)). Proof. Suppose the range of ce is [t ..t2 with E(ce) = c. We perform BigO reduction as follows. t2 E(logce) = 1 log ce Cxtl =0( log1 dex) t2 t,+l t Jt' _0(  ((t2log9t2 t2) (t Ilogt I t 0))) t2 tl + t S0 (logc) because t < 2c and t, < c By Theorem 4.8, O(logn/E(logce)) O("). D 4.5 Discussions 4.5.1 Group Members with Very Small Upload Bandwidth A node x with very small upload bandwidth should only be a leaf in the multicast trees unless itself is the data source. In order to make sure that the MULTICAST routine does not select x as an intermediate node in any multicast tree, x must not be in the CAMChord (or CAMKoorde) overlay network. Instead, it joins as an external member. x asks a node y known to be in CAMChord (or CAMKoorde) to perform LOOKUP(k) for an arbitrary identifier k, which returns a random node z in the overlay network. x then attempts to join z as an external member. If z cannot support x, z forwards x to successor(z). If z admits x as an external member, z will forward the received multicast messages to x and x will multicast its messages via z. If z leaves the group, x must rejoin via another node in CAMChord (or CAMKoorde). 4.5.2 Proximity and Geography The overlay connections between neighbors may have very different d.1 iv Two neighbors may be separated by transcontinental links, or they may be on the same LAN. There exist some approaches to cope with geography, for example, Proximity Neighbor Selection and Geographic Layout. With Proximity Neighbor Selection, nodes are given some freedom in choosing neighbors based on other criteria (i.e. latencies) in addition to the arithemic relations between their identifiers. With Geographic Layout, node identifiers are chosen in a geographically informed manner. The main idea is to make geographically closeby nodes form clusters in the overlay. Readers are referred to [57, 59] for details. Extension to the existing P2P networks, CAMs can naturally inherit most of those features without much additional effort. For example, instead of choosing the ith neighbor to be (x + 2), a proximity optimization of Chord allows the ith neighbor to be any node within the range of [(a + 2), (a + 2i+1), which does not affect the complexities [57]. This optimization can also be applied to CAMChord (which is an extention of Chord) without affecting the complexities. A node x can choose any node whose identifier belongs to the segment [x +j x c, x + (j + 1) x c4) as the neighbor xij. Given this freedom, some heuristics such as smallest delay first, may be used to choose neighbors to promote proximityclustering. Specifically, a node x can progressively scan the nodes in the allowed segment for neighbor xij, for example, by following the successor link to probe the next node in the segment after every 100k data bits sent by x, which trivializes the probing overhead. x .J. i use the nearest node it has found recently as its neighbor. 4.6 Simulation Throughput and latency are two ini Pr performance metrics for a multicast application. We evaluate the performance of CAMs from these two aspects. We simulate multicast algorithms on top of CAMChord, Chord, CAMKoorde, and Koorde, respectively. The identifier space is [0, 219). If not specified otherwise, the default number of nodes in an overlay network is 100, 000, and the node capacities are taken from [4..10] with uniform probability. The upload bandwidths of the nodes are randomly distributed in a default range of [400,1000] kbps. It should be noted that the value ranges may go far beyond the default ones in specific simulations. In our simulations, c, = [B,/p], where B, is the node's upload bandwidth and p is a system parameter of CAMs, specifying the desired bandwidth per link in the multicast tree. By varying the value of p, we can construct CAMs with different average node capacities, which also mean different average numbers of children per nonleaf node and consequently different tree depths (latency). If the average node capacity c is not the default value of 7, the node capacities are taken uniformly from [4,2c 4]. 4.6.1 Throughput We compare the sustainable throughput of multicast systems based on CAMChord, Chord, CAMKoorde, and Koorde. Throughput is defined as the rate at which data can be continuously delivered from a source node to all other nodes. Due to limited buffer space at each node, the sustainable multicast throughput is decided by the link with the least allocated bandwidth in the multicast tree. CAMChord and CAMKoorde produce much larger throughput because a node's capacity c, (which is its number of children in the multicast tree) is adjustable based on the node's upload bandwidth. The primary advantage of CAMs over the Chord/Koorde is their ability to adapt the overlay topology according to host heterogeneity. Figure 43 compares the throughput of CAMChord, Chord, CAMKoorde, and Koorde with respect to the average number of children per nonleaf node in the multicast tree. CAMs perform much better. Their throughput improvement over Chord and Koorde is 70i I.' when the nodes' upload bandwidths are chosen from the rather narrow default range of [400,1000] kbps. The larger the uploadbandwidth range, the more the throughput improvement, as demonstrated by Figure 44. In general, let [a, b] be the range of upload bandwidth. The upper bound b of the range is shown on the horizontal axis CAMChord Chord CAMKoorde Koorde 5 10 15 20 25 30 Average Number of Children Figure 43. Multicast throughput with respect to average number node 2.4 2.3  2.2 2.1 2 1.9 1.8  1.7 1.6 1.5 1.4 800 of children per nonleaf CAMChord over Chord CAMKoorde over Koorde  900 1000 1100 1200 1300 1400 1500 1600 Upload Bandwidth Range Figure 44. Throughput improvement ratio with respect to upload bandwidth range 35 40 55 50 45 CAMChord Chord  S40CAMKoorde 40 Koorde e  35 30 25 100002000030000400005000060000700008000090000100000 Number of Nodes Figure 45. Multicast throughput with respect to size of the multicast group 9 8 7 6 5 4 CAMChord CAMKoorde  3 10 20 30 40 50 60 70 80 90 100 Throughput (kbps) Figure 46. Throughput vs. average path length of Figure 44, while the lower bound a is fixed at 400 kbps. The figure shows that the throughput improvement by CAMs increases when the uploadbandwidth range is larger, representing a greater degree of host heterogeneity. The simulation data also indicate that the throughput ratio of CAMChord (CAMKoorde) over Chord (Koorde) is roughly proportional to 2a 2a Figure 45 shows the multicast throughput with respect to the size of the multicast group. According to the simulation results, the throughput is largely insensitive to the group size. 4.6.2 Throughput vs. Latency We measure multicast latency by the average length of multicast paths. Latency is determined by both the number of hops and the hop delay. In CAMChord and CAMKoorde, the overlay links are randomly formed among the nodes due to the use of DHT. The latency of an overlay path is statistically proportional to the number of hops. That's why we used the number of hops to characterize the latency performance. In fact, the measurement in terms of number of hops carries information beyond latency. It is also an indication of how many times a message has to be r 1 i ', 1 which is a resource consumption issue. With the proximity neighbor selection in Section 4.5.2, the latency is no longer proportional to the number of hops. We add a simulation in Section 4.6.4 to study this case, where the actual d. 1 ic is measured. Both throughput and latency are functions of average node capacity. With a larger average node capacity (achieved by a smaller value of p), the throughput decreases due to more children per nonleaf node and the latency also decreases due to smaller tree depth. There exists a tradeoff between throughput and latency, which is depicted by Figure 46. Higher throughput can be achieved at the cost of longer routing paths. Given the same upload bandwidth distribution, the system's performance can be tuned by adjusting p. The figure also shows that, for relatively small throughput (less than 46kbps in the figure)  namely, for large node capacities CAMKoorde slightly outperforms CAMChord; for relatively large throughput (greater than 46kbps in the figure) namely, for small node capacities CAMChord outperforms CAMKoorde, which will be further explained in Section 4.6.4. 4.6.3 Path Length Distribution Figure 47 and Figure 48 present the statistical distribution of multicast path lengths, i.e., the number of nodes that can be reached by a multicast tree in certain number of hops. Each curve represents a simulation with node capacities chosen from a different range. When the capacity range increases, the distribution curve moves to 45000 40000 35000  30000 25000 20000 15000  10000 5000 0 2 4 6 8 Path Length(hops) 10 12 14 Figure 47. Path length distribution in CAMChord. Legend "[x..y]" means the node capacities are uniformly distributed in the range of [x..y]. 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 4 [4..6] [4..8] [4..10] [4..20] [4..40] [4..100] [4..200] 5 10 Path Length(hops) 15 20 Figure 48. Path length distribution in CAMKoorde. Legend "[x..'/ capacities are uniformly distributed in the range of [x..y]. means the node 16 CAMChord CAMKoorde  14 1.5*ln(100,000)/ln(c) S12 i5) S10 s 8 & 6 2 4  0 10 20 30 40 50 60 70 80 90 100 110 Average Node Capacity Figure 49. Average path length with respect to average node capacity the left of the plot due to shorter multicast paths. The improvement in the distribution can be measured by how much the curve is shifted to the left. At the beginning, a small increase in the capacity range causes significant improvement in the distribution. After the capacity range reaches a certain level ([4,10] in our simulations), the improvement slows down drastically. Each curve has a single peak, and the right side of the peak quickly decreases to zero. It means that the vast in Pi il y of nodes are reached within a small range of path lengths. We didn't observe any multicast path whose length was significantly larger than the average path length. 4.6.4 Average Path Length Figure 49 shows the average path length with respect to the average node capacity. We also plot an artificial line, 1.51" with n = 105, which upperbounds the average path lengths of CAMChord and CAMKoorde, verifying Theorem 4.6 and Theorem 4.9. From the figure, when the average node capacity is less than 10, the average path length of CAMChord is smaller than that of CAMKoorde; when the average node capacity is greater than 12, the average path length of CAMKoorde is smaller than CAMChord. A smaller average path length means more balanced multicast trees. For small node capacities, CAMChord multicast trees are more balanced than CAMKoorde 450 CAMChord 400 Approximateoptimized CAMChord  2 350 S300 \ S250 x 200 150 100  50  0 10 20 30 40 50 60 70 80 90 100 Average Node Capacity Figure 410. Proximity optimization multicast trees, and vice versa. The reasons are explained as follows. On one hand, a nonleaf CAMKoorde node x may have less children than ce because some neighbors may have already received the multicast message from a different path. This tends to make the depth of a CAMKoorde multicast tree larger than that of a CAMChord tree. On the other hand, a CAMChord node x may split (x, k] into uneven subsegments (i.e., subtrees) with a ratio up to cx (Lines 615 in Section 4.3.4). This ratio of unevenness becomes small when the node capacities are small. Combining these two factors, CAMChord creates better balanced trees for small node capacities; CAMKoorde creates better balanced trees for large node capacities. Next we use CAMChord as an example (Section 4.5.2) to study the impact of proximity optimization. In [60], Mukherjee found that the endtoend packet d. 1I on the Internet can be modeled by a shifted Gamma distribution, which is a longtail distribution. The shape parameter varies from approximately 1.0 during low loads to 6.0 during high loads on the backbone. We set the shape parameter to be 5.0 and the average packet delay to be 50 ms. Figure 410 compares the average latency of delivering a multicast message from a source to a receiver in CAMChord with or without the proximity optimization. The simulation is performed for different average node capacities, 1.5 CAMChord 1.45 CAMKoorde   1.4 S 1.35 S 1.3    S1.25  > D 1.2 S 1.15 1.1 1.05 0.5 0.6 0.7 0.8 0.9 1 Average Capacity Ratio Figure 411. Throughput vs. latency and the impact of proximity optimization is significant. In most cases, it reduces the latency more than by half. 4.6.5 Impact of Dynamic Capacity Variation In a real environment, the upload bandwidth of a node may fluctuate. If we atlv' use the same implicit multicast trees, then the dynamic variation of node capacities will cause variation in average throughput but not in average latency. CAMs can also be easily modified to ensure throughput but allow latency variation. If a node's capacity decreases, it simply forwards messages to a smaller number of neighbors, which automatically reshapes the implicit tree. If a node's capacity increases for a long time, the node can take advantage of the improved capacity by increasing the number of neighbors. We define the capacity ratio as the actual capacity of a node x at the real time divided by the claimed capacity ce that is used to build the topology of CAMs. We define the latency ratio as the actual delay at a given capacity ratio divided by the "b, 1i11. !ii 1. delay when the capacity ratio is 1i" i'., namely, no dynamic capacity variation. Apparently, the latency ratio is a function of the capacity ratio. Figure 411 shows the relation between the latency ratio and the capacity ratio. When the capacity ratio is smaller, which means the nodes cannot support as many children as they have claimed, the nodes will forward the received messages to a 