UFDC Home  myUFDC Home  Help 



Full Text  
DATA STRUCTURES FOR DYNAMIC ROUTER TABLE By HAIBIN LU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2003 Copyright 2003 by Haibin Lu To my family. ACKNOWLEDGMENTS I would like to give my sincere thankfulness to my advisor, Dr. Sartaj Sahni, for his mentoring and support throughout my Ph.D. study. It would be impossible to have my research career without his guidance. This work was supported, in part, by the National Science Foundation under grant CCR9911:'~ ,. I am very grateful to Dr. S lii 'y Ranka, Dr. Randy C'!. .., Dr. Richard N. ,. il 111 Dr. Michael Fang for serving on my Ph.D. supervisory committee and providing helpful , I 1 i i. I want to dedicate this dissertation to my parents. Without their encouragement and hard work, I could not think of getting a doctoral degree. Finally, I would like to give my special thanks to my wife, Lan, whose caring and love enabled me to complete this work. TABLE OF CONTENTS ACKNOWLEDGMENTS . ........... ABSTRACT .. ................... CHAPTER 1 INTRODUCTION AND RELATED WORK . 1.1 Introduction .. ........... 1.1.1 Static Router Table ..... 1.1.2 Dynamic Router Table . 1.2 Related Work .. .......... 1.2.1 Trie . . . . 1.2.2 Sets of EqualLength Prefixes 1.2.3 EndPoint Array ....... 1.2.4 Multiway Range Tree . 1.2.5 O(logn) Dynamic Solutions 1.2.6 HighestPriority Prefix Table 1.2.7 TCAM .. .......... 1.2.8 Others . . . 1.3 Contribution .. ........... 2 O(log n) DYNAMIC ROUTER TABLE FOR PREFIXES Al Prelim inaries . . . . . . 2.1.1 Prefixes and LongestPrefix Matching ..... 2.1.2 Ranges and Projections .. ........... 2.1.3 MostSpecificRange Routing and ConflictFree 2.1.4 Normalized Ranges .. ............ 2.1.5 Priority Search Trees And Ranges ....... P refixes . . . . . . . N i iii! i, iecting Ranges . . . . . ConflictFree Ranges .. ............... 2.4.1 Determine msr(d) .. ............. 2.4.2 Insert A Range .. .............. 2.4.3 Delete A Range .. .............. 2.4.4 Computing maxP and minP .......... 2.4.5 A Simple Algorithm to Compute maxP . 2.4.6 An Efficient Algorithm to Compute maxP . D RANGES Ranges page iv 2.1 2.4.7 Wrapping Up Insertion of a Range . . 44 2.4.8 Wrapping Up Deletion of a Range . . ..... 45 2.4.9 Complexity. ....... ............ ...... 45 2.5 Experimental Results .................. ..... .. 46 2.5.1 Prefixes. .................. ......... .. 46 2.5.2 Nonintersecting Ranges ............ .. .. .. 50 2.5.3 Conflictfree Ranges ............ ...... 51 2.6 Conclusion .................. ............ .. 51 3 DYNAMIC IP ROUTER TABLES USING HIGHESTPRIORITY MATCHING .................. ............. .. 53 3.1 Preliminaries ...... .......... ..... .... 53 3.2 N. iii. I, ecting HighestPriority RuleTables (NHRTs)BOB .56 3.2.1 The Data Structure .................. .... 56 3.2.2 Search for hpr(d) .................. .. 59 3.2.3 Insert a Range .................. ..... .. 61 3.2.4 RedBlackTree Rotations .................. .. 63 3.2.5 Delete a Range. .................. .... .. 66 3.2.6 Expected Complexity of BOB . . ..... 68 3.3 HighestPriority PrefixTables (HPPTs)PBOB . ... 69 3.3.1 The Data Structure .................. .... 69 3.3.2 Lookup ........ ....... ...... .. .... 69 3.3.3 Insertion and Deletion . . . ...... 71 3.4 LongestMatching PrefixTables (LMPTs)LMPBOB . 71 3.4.1 The Data Structure .................. .... 71 3.4.2 Lookup ........ ....... ...... .. .... 72 3.4.3 Insertion and Deletion ................ .. .. 73 3.5 Implementation Details and Memory Requirement . ... 74 3.5.1 Memory Management ................ .. .. 74 3.5.2 BO B . . . . ... . .. . 74 3.5.3 PBOB ...... ........ ......... .... 76 3.5.4 LM PBOB .................. ...... .. .. 77 3.6 Experimental Results .................. ..... .. 78 3.6.1 Test Data and Memory Requirement . . 78 3.6.2 Preliminary Timing Experiments . . ..... 79 3.6.3 RunTime Experiments ............ .. .. .. 82 3.7 Conclusion .................. ............ .. 84 4 A BTREE DYNAMIC ROUTERTABLE DESIGN . . 87 4.1 LongestMatching PrefixTablesLMPT . . ..... 88 4.1.1 The Prefix In BTree StructurePIBT . ... 88 4.1.2 Finding The Longest MatchingPrefix . . ... 91 4.1.3 Inserting A Prefix .................. .. 92 4.1.4 Inserting an endpoint ................ .. .. 92 4.1.5 Update interval vectors ............... .. 96 4.1.6 Deleting A Prefix ..... .......... .... 97 4.1.7 Deleting from a Leaf Node .............. .. .. 98 4.1.8 Borrow from a Sibling ...... .......... .... 98 4.1.9 Merging Two Ad i .. il Siblings . . ..... 99 4.1.10 Deleting from a Nonleaf Node . . 100 4.1.11 CacheMiss Analysis ...... ........ . 102 4.2 HighestPriority RangeTables ............. . 104 4.2.1 Preliminaries ..... . . ...... 104 4.2.2 The Range In BTree StructureRIBT . .... 105 4.2.3 RIBT Operations ................ ... 107 4.3 Experimental Results .................. .... 108 4.4 Conclusion .................. ............ .. 112 5 CONCLUSION AND FUTURE WORK ............. ..113 5.1 Conclusion ............... ......... ..113 5.2 Future Work ............... ........... ..114 REFERENCES .................. ................ .. 116 BIOGRAPHICAL SKETCH .................. ......... 120 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DATA STRUCTURES FOR DYNAMIC ROUTER TABLE By Haibin Lu August 2003 C'!I ir: Sartaj Sahni Major Department: Computer Information and Science and Engineering Internet routers use router tables to classify incoming packets based on the in formation carried in the packet headers. Packet classification is one of the network bottlenecks, especially when a high update rate becomes necessary. Much of the research in the routertable area has focused on static prefix tables, where updates usually require the rebuilding of the whole router table. Some routertable designs rely on the relatively short IPv4 addresses to achieve desired efficiency. However, these designs have bad scalability in terms of the prefix length. We propose several schemes to represent onedimensional dynamic range tables, that is, tables into/from which rules are inserted/deleted concurrent with packet classification, and filters are specified as ranges. Our schemes allow realtime update and at the same time provide efficient lookup. The lookup and update complexities of our schemes are logarithmic functions of the number of the filters. The first scheme PST, which is based on priority search trees, uses the most specific rule tie breaker. The second scheme is called BOB (Binary search tree On Binary search tree). This scheme uses the highest priority tie breaker. In order to utilize the wide cache line size and reduce the tree height, a third scheme is developed in which the top level tree is a BTree. This scheme also uses the highest priority tie breaker. All three schemes are suitable for prefix filters as well as for range filters in which no two filters have intersecting ranges. In addition, the PST also can handle a conflictfree range set. CHAPTER 1 INTRODUCTION AND RELATED WORK 1.1 Introduction Tod ,'s Internet consists of thousands of packet networks interconnected by routers. When a host sends a packet into the Internet, the routers relay the packet towards its final destination. The routers exchange routing information with each other, and use the information gathered to calculate the paths to all reachable desti nations. Each packet is treated independently and forwarded to a next router based on its destination address. The data structure a router uses to query next hop is called the router table. Each entry in the router table is a rule of the form (address prefix, next hop). Table 11 shows a set of five rules. We use W to denote the maximum possible length of a prefix. In IPv4, W = 32 and in IPv6, W = 128. In Table 11 W is 5. The prefix P1, which matches all the destination addresses, is called the /. fault prefix. The prefix P3 matches the destination addresses between 16 and 19. If the address prefix of a rule matches the destination address the incoming packet carries, the next hop of this rule is used to forward packet. Address prefix was introduced by CIDR (Classless Interdomain Routing) to deal with address depletion and router table explosion. The result of CIDR's address . I . regation is that there may have several rules whose prefixes match the destination address. For example, the rules P1, P3 and P4 in Table 11 match the destination address 19. In this case, a tie breaker is needed to select one of the matching rules. The most .p. .W matching is usually used, namely, the longest prefix matching the destination address is the winner. For our example router table, P4 is the winner for destination address 19. The other two popular tie breakers are first matching and highest j' .:, .:1,/ match ing. For first matching tie breaker, the rule table is assumed to be a linear list of rules with the rules indexed 1 through n for an nrule table. The first rule that matches the incoming package is used. Notice that the rule R1 is selected for every incoming packet since it matches all the destination addresses. In order to give a chance to other rules to become the winner, we must index the rules carefully, and the default prefix should be the last rule. In the highest priority n il hin,:. each rule is assigned a priority, and the rule with the highest priority is selected from those matching the incoming packet.1 Notice that the first matching tie breaker is a special case of the highest priority matching tie breaker(simply assign each rule a priority equal to the negative of its index in the linear linear). Table 11: A router table with five rules (W = 5) Rule Name Prefix Name Prefix Next Hop Range Start Range Finish R1 P1 N1 0 31 R2 P2 0101* N2 10 11 R3 P3 100* N3 16 19 R4 P4 1001* N4 18 19 R5 P5 10111 N5 23 23 The query based on the destination address is usually called address lookup or packet forwarding. In general other fields such as source address and port numbers may also be used, and the router table consists of the rules of the form (F, A), where F is a filter and A is an action. The action component of a rule specifies what is 1 We may assume either that all priorities are distinct or that selection among rules that have the same priority may be done in an arbitrary fashion to be done when a packet that satisfies the rule filter is received. Sample actions are drop the packet, forward the packet along a certain output link, and reserve a specified amount of bandwidth. Tie breakers similar to those mentioned earlier are used to select a rule from the set of rules that match the incoming packet. We call this problem packet la i..:, /;'.:n. 1.1.1 Static Router Table In a static rule table, the rule set does not vary in time. For these tables, we are concerned primarily with the following metrics: 1. Time required to process an incoming packet. This is the time required to search the rule table for the rule to use. We refer to this operation as a lookup. 2. Preprocessing time. This is the time to create the ruletable data structure. 3. Storage requirement. That is, how much memory is required by the ruletable data structure? To handle update, static schemes usually use two copies working and shadow of the router tables. Lookups are done using the working table. Updates are performed, in the background (either in real time on the shadow table or by watching updates and reconstructing an updated shadow at suitable intervals); periodically, the shadow replaces the working table, and the caches of the working table are flushed. In this mode of update operation, many packets may be misclassified, because the working <"i, isn't immediately updated. The number of misclassified packets depends on the periodicity with which the working table can be replaced by an updated shadow. Further, additional memory is required for the shadow table and for periodic recon struction of the working table. It is important to have shorter preprocessing time in order to reduce the number of misclassified packets. 1.1.2 Dynamic Router Table In practice, rule tables are seldom truly static. At best, rules may be added to or deleted from the rule table infrequently. Typically, in a "static" rule table, in serts/deletes are batched and the routertable data structure reconstructed as needed. In a t/i,.//'. rule table, rules are added/deleted with some frequency. For such tables, inserts/deletes are not batched. Rather, they are performed in real time. We believe that dynamic structures for router tables is becoming a necessity. First, update occurs frequently in the backbone area. Labovitz et al. [1] found up date rate could reach as high as 1000 per second. These updates stem from the route failure, route repair and route failover. With the number of autonomous systems con tinuously in' i  i it is reasonable to expect the raising update rate. The router table needs to be updated in order to reflect the route change. Second, fast process ing of update is preferred because during the batch and reconstruction, endtoend d, 1liv increases, packet loss raises dramatically, and the part of network may expe rience connectivity loss. Labovitz et al. [2] observed dramatically increased packet loss and endtoend latency during the BGP routing change. Batch and expensive reconstruction make things worse. While BGP takes time to converge, routerepair events usually do not cause multiple announcements, and the latency for router table to become stable due to these events should only depend on the network delay and router processing d.1, along the path [2]. In addition, when the BGP coverage time gets reduced, the processing delay may dominate. Pei et al. [3] reduce the conver gence time from 30.3 seconds to 0.3 seconds for a failure withdraw in the tested by applying two consistency assertions to BGP. Macian et al. [4] emphasize the impor tance of supporting high update rate. Dynamic router tables that permit highspeed inserts and deletes are essential in QoS and VAS applications [4]. For example, edge routers that do stateful filtering require highspeed updates [5]. For dynamic router tables, we are concerned additionally with the time required to insert/delete a rule. For a dynamic rule table, the initial ruletable data structure is constructed by starting with an empty data structure and then inserting the initial set of rules into the data structure one by one. So, typically, in the case of dynamic tables, the preprocessing metric, mentioned above, is very closely related to the insert time. For dynamic router table, the following metrics are measured to compare the performance: 1. Lookup Time. 2. Insertion Time. This is the time required to insert a new rule into the rule table. 3. Deletion Time. This is the time required to delete a rule from the rule table. 4. Storage requirement. Note that there is only a working table for dynamic schemes and updates are made directly to the working table in real time. In this mode of update, no packet is improperly classified. However, packet classification/forwarding may be d.1 .1 until a preceding update completes. To minimize this delay, it is necessary that update be done as fast as possible. Another important metric we concern for both static and dynamic router table is the scalability to IPv6. IPv6, the next generation of IP, uses 128bit addresses (W = 128). Although some of the schemes in section 1.2 work well for IPv4 (W = 32), they have bad scalability in terms of the prefix length. 1.2 Related Work Data structures for rule tables in which each filter is a address prefix and the rule priority is the length of this prefix2 have been intensely researched in recent years. We refer to rule tables of this type as longestmatching prefixtables (LMPT). We refer to rule tables in which the filters are ranges and in which the highestpriority matching filter is used as highestpriority rangetables (HPRT). When the filters of no two rules of an HPRT intersect, the HPRT is a nonintersecting HPRT (NHPRT). Although every LMPT is also an NHPRT, an NHPRT may not be an LMPT. RuizSanchez et al. [6] review data structures for static LMPTs and Sahni et al. [7] review data structures for both static and dynamic LMPTs. 1.2.1 Trie Several triebased data structures for LMPTs have been proposed [8, 9, 10, 11, 12, 13, 14]. Structures such as that of Doeringer et al. [10] use the pathcompression technique. Thus the memory requirement is O(n). The search is guided by the input key and only inspects the bit position stored at the internal node due to a successful search bias. When the search reaches the leaf node and the search does not succeed, the downward path may be backtracked to find the longest matching prefix. Hence the search can be carried out in O(W) time. The update operation, insert or delete, is natural in trie structure, and can also be performed in O(W) time. The memory accesses during these operations are O(W). For IPv6, O(W = 128) memory accesses are quite expensive. Moreover, path compression reduces the height of trie only if the prefixes scatter inside the trie sparsely. When the number of prefixes increases, lots of branch nodes are needed and path compression does not have many nodes to 2 For example, the filter 10* matches all destination addresses that begin with the bit sequence 10; the length of this prefix is 2. compress. RuizSanchez et al. [6] observe that the height of BSD version of path compressed trie is 26 for a IPv4 router table with 47,113 prefixes, and the height of a simple binary trie is only 30. In order to reduce the trie length, Gupta et al. [15] uses DIR248 scheme which fully expands the binary trie at depth 24, i.e., all prefixes with length less than or equal to 24 are expanded to 24bit prefixes as many as needed, and a table with 224 entries is used to store these expanded prefixes. For those prefixes longer than 24 bits, a second table is used to store them. The correspondence is established by storing pointers in the first table which point to the proper entries in the second table. The first table has 224 entries, and each entry is 16 bits (32M bytes in total). The first bit of each entry indicates whether the next 15 bits store the next hop or a pointer into 2nd table. With more than 32M bytes memory usage, the scheme can perform search in at most two memory accesses. But it is not scalable to IPv6 because expanding to 24 bits already takes too much memory. Gupta et al. [15] also propose alternatives that use less memory but require more memory accesses. Degermark et al. [9] use a similar prefix expansion technique at multiple depths. Bitmap compression is deploy, 1l to reduced the memory requirement greatly. A router table with 40,000 rules can fit into 160K bytes. In the worst case, the number of memory accesses is nine. Huang et al. [16] fully expand the binary trie at depth 16 and also expand the sbutries rooted at the nodes in depth 16 to their own depths. The bitmap compression is also applied to reduce the memory requirement. The router tables used in the experiment can be compacted into less than 500K bytes. The number of worst case memory accesses is three. Both schemes [9, 16] heavily depend on the prefix distribution. It is hard to decide a proper memory size for the scheme ahead of time. For example, in extreme case, if n prefixes in the router table all have length 32, and their first 16bits are distinct (assume n <= 216), the scheme [16] needs at least 214n bytes. Nilsson et al. [11] apply the level compression as well as path compression to the binary trie. A binary trie is pathcompressed first, then level compression is used to reduce the height of the trie further by substituting k highest levels of the binary trie with a single degree2k node. Although the search complexity of LC (level compressed) trie is still O(W), the height of LCtrie is around 8 for the router tables used in author's analysis. These data structures [9, 11, 15] as well as Srinivasan et al. [12] attempt to optimize lookup time through an expensive preprocessing step. They, while providing very fast lookup capability, have a prohibitive insert/delete time, so they are suitable only for static routertables (i.e., tables into/from which no inserts and deletes take place). Sahni et al. [13, 14] provide efficient constructions for fixedstride and variable stride multibit tries. The lookup time and memory requirement are optimized through expensive preprocessing. Aiming at improving update speed for fixedstride multibit trie at pipelined ASIC architecture, Basu et al. [17] describe an algorithm to optimize and balance the memory requirement across the pipeline stages. 1.2.2 Sets of EqualLength Prefixes Waldvogel et al. [18] have proposed a scheme that performs a binary search on hash tables organized by prefix length. In order to support binary search, O(log W) markers are generated for each prefix, and the longest matching prefix is precomputed for each marker. This binary search scheme has an expected complexity of O(log W) for lookup. The memory requirement is bounded by O(n log W). By introducing a technique called marker partitioning in the full version of Waldvogel et al. [18], the scheme has O(a T7iWlog W) insert/delete time and an increased search time O(a + log W), for a > 1. 1.2.3 EndPoint Array An alternative adaptation of binary search to longestprefix matching is devel oped in [19]. The distinct end points (start points and finish points) of the ranges defined by the prefixes are stored in ascending order in an array. The end points divide the universe into O(n) basic intervals. The LMP(d) is precomputed for each interval as well as for each end point. LMP(d) is found by performing a binary search on this ordered array. A lookup in a table that has n prefixes takes O(log n) time. Because the schemes [19] use expensive precomputation, they are not suited for a dynamic routertables. 1.2.4 Multiway Range Tree Suri et al. [20] have proposed a Btree data structure for dynamic LMPTs. Using their structure, we may find the longest matchingprefix, LMP(d), in O(log, n) time. However, inserts/deletes take O(Wlog, n) time. When W bits fit in 0(1) words (as is the case for IPv4 and IPv6 prefixes) logical operations on Wbit vectors can be done in 0(1) time each. In this case, the scheme of Suri et al. [20] takes O(mlog2 W log n) time for an insertion and O(mlog, n+W) for a deletion. Assume one node can fit into 0(1) cache line, the number of memory accesses that occur when the data structure of Suri et al. [20] is used is O(log, n) per search, and O(m log, n) per update. 1.2.5 O(logn) Dynamic Solutions Sahni et al. [21, 22] develop data structures, called a collection of redblack trees (CRBT) and alternative collection of redblack trees (ACRBT), that support the three operations of a dynamic LMPT in O(log n) time each. The number of cache misses is also O(log n). Sahni et al. [22] show that their ACRBT structure is easily modified to extend the biasedskiplist structure of Ergun et al. [23] so as to obtain a biasedskiplist structure for dynamic LMPTs. Using this modified biased skip list structure, lookup, insert, and delete can each be done in O(log n) expected time and O(logn) expected cache misses. Like the original biasedskip list structure of Ergun et al. [23], the modified structure of Sahni et al. [22] adapts so as to perform lookups faster for bursty access patterns than for nonbursty patterns. The ACRBT structure may also be adapted to obtain a collection of splay trees structure [22], which performs the three dynamic LMPT operations in O(log n) amortized time and which adapts to provide faster lookups for bursty traffic. 1.2.6 HighestPriority Prefix Table When an HPPT (highestpriority prefixtable) is represented as a binary trie [24], each of the three dynamic HPPT operations takes O(W) time and cache misses. Gupta et al. [25] have developed two data structures for dynamic HPRTsheap on trie (HOT) and binary search tree on trie (BOT). The HOT structure takes O(W) time for a lookup and O(W logn) time for an insert or delete. The BOT structure takes O(W log n) time for a lookup and O(W) time for an insert/delete. The number of cache misses in a HOT and BOT is .i mptotically the same as the time complexity of the corresponding operation. 1.2.7 TCAM Ternary contentaddressible memories, TCAMs, use parallelism to achieve 0(1) lookup [26]. Each memory cell of a TCAM may be set to one of three states 0, 1, and don't care. The prefixes of a router table are stored in a TCAM in descending order of prefix length. Assume that each work of the TCAM has 32 cells. The prefix 10* is stored in a TCAM work as 10??...?, where ? denotes a don't care and there are 30 ?s in the given sequence. To do a longestprefix match, the destination address is matched, in parallel, against every TCAM entry and a sortedbivl n:_.l linear list, the longest matchingprefix can be determined in 0(1) time. A prefix may be inserted or deleted in O(W) time, where W is the length of the longest prefix [27]. Although TCAMs provide a simple and efficient solution for static and dynamic router tables, this solution requires special hardware, costs more, and uses more power and board space than solutions that employ SDRAMs. TCAMs have longer latency than SDRAMs. Since TCAM requires an arbitration module to choose the longest matching prefix and a more complex arbitration module are needed for a 'i.. r router table, the latency of TCAM increases with router table size. EZchip Technologies, for example, claim that classifiers can forgo TCAMs in favor of comoodity memory solutions [5, 28]. Algorithmic approaches that have lower power consumption and are conservative on board space at the price of slightly increased search latency are sought. "System vendors are willing to accept some latency in their searches if it means lowering the power of a line < iI [28]. 1.2.8 Others C'!,. ig et al. [29] developed a model for tabledriven route lookup and cast the table design problem as an optimization problem within this model. Their model accounts for the memory hierarchy of modern computers, and they optimize average performance rather than worstcase performance. Solutions that involve modifications to the Internet Protocol (i.e., the addition of information to each packet) have also been proposed [30, 31, 32]. 1.3 Contribution We have developed data structures for dynamic router tables. The data struc tures use O(n) space except that RIBT uses O(nlog, n) space. Our first data struc ture, PST [33, 34], uses the most specific matching tie breaker. It permits one to search, insert, and delete in O(log n) time each. Although O(log n) time data struc tures for prefix tables were known prior to our work [21, 22], the PST is more memory efficient than the data structures of [21, 22]. Further, PST is significantly superior on the insert and delete operations, while being competitive on the search operation. For nonintersecting ranges and conflictfree ranges PSTs are the first to permit O(log n) search, insert, and delete. The second data structure, BOB [35], works for highestpriority matching with nonintersecting ranges. the highestpriority rule that matches a destination address may be found in O(log2 n) time; a new rule may be inserted and an old one deleted in O(log n) time. For the case when all rule filters are prefixes, the data structure PBOB (prefix BOB) permits highestpriority matching as well as rule insertion and deletion in O(W) time each. On practical rule tables, BOB and PBOB perform each of the three dynamictable operations in O(log n) time and with O(log n) cache misses. PBOB can also support the dynamictable operations in O(logn) time and with O(log n) cache misses for nonintersecting ranges when the number of nesting levels is a constant. To utilize the wide cache line size, e.g., 64byte cache line, we propose Btree data structures for dynamic routertables for the cases when the filters are prefixes as well as when they are nonintersecting ranges. A crucial difference between our data structure for prefix filters and the Btree routertable data structure of Suri et al. [20] is that in our data structure, each prefix is stored in 0(1) Btree nodes per Btree level, whereas in the structure of Suri et al. [20], each prefix is stored in O(m) nodes per level (m is the order of the Btree). As a result of this difference, a prefix may be inserted or deleted from an nfilter router table accessing only O(log, n) nodes of our data structure; these operations access O(mlog, n) nodes using the structure of Suri et al. [20]. Even though the .,i'~!Itotic complexity of prefix insertion and deletion is the same in both Btree structures, experiments conducted by us show that because of the reduced cache misses for our structure, the measured average insert and delete times using our structure are about 3i '. less than when the Btree structure of Suri et al. [20] is used. Further, an update operation using the Btree structure of Suri et al. [20] will, in the worst case, make 2.5 times as many cache misses as made when our structure is used. The .,i,i!,ld l ic complexity to find the longest matching prefix is the same, O(mlog, n) in both Btree structures, and in both structures, this operation accesses O(log, n) nodes. The measured time for this operation also is nearly the same for both data structures. Both Btree structures for prefix routertables take O(n) memory. However, our structure is more memory efficient by a constant factor. For the case of nonintersecting ranges, the highest priority range that matches a given destination address may be found in O(m log, n) time using our proposed Btree data structure. The time to insert and delete a range is O((m + D) log, n), where D is the maximum nesting depth of the ranges. Our data structure for nonintersecting ranges requires O(n log, n) memory and O(log,, n) nodes are accessed during an operation. With the O(logn) operation time, our data structures scale well to the large router tables. Since the complexity is independent of the prefix length, our data structures are also scalable to IPv6. Another important feature of our data structures is that nonintersecting ranges are supported naturally, whereas most existing data structures support ranges (neces sary when the filters are defined for port numbers) by breaking one range into O(W) prefixes which results in O(W log n) memory requirement. Supporting ranges is also a nice feature for network livr addresses. The range that a prefix covers must be a power of two, and it must start at a number which is a multiple of the range size. But the end points and the size of a normal range can be any number. Supporting ranges means one can allocate a range with arbitrary size to a network (AppleTalk supports this feature) and the range .,.::regation is potentially better than that of prefix. For example, two di, ,iil prefixes can .. .regate into one prefix only if their ranges are .,Ii] i'.ent to each other and they have the same length, whereas the two dii i I ranges can ..::regate into one range as long as they are next to each other. So, range .:: regation is expected to result in router tables that have fewer rules. CHAPTER 2 O(log n) DYNAMIC ROUTER TABLE FOR PREFIXES AND RANGES In this chapter, we show in Section 2.2 how prioritysearch trees may be used to represent dynamic prefixroutertables. The resulting structure, which is conceptually simpler than the CRBT structure of Sahni et al. [21], permits lookup, insert, and delete in O(log n) time each. For range routertables, we consider the case when the best matchingprefix is the mostspecific matching prefix (this is the range analog of longestmatching prefix). In Section 2.3, we show that dynamic rangeroutertables that employ mostspecific range matching and in which no two ranges overlap may be efficiently represented using two prioritysearch trees. Using this twoprioritysearch tree representation, lookup, insert, and delete can be done in O(log n) time each. The general case of nonconflicting ranges is considered in Section 2.4. In this section, we augment the data structure of Section 2.3 with several redblack trees to obtain a rangeroutertable representation for nonconflicting ranges that permits lookup, insert, and delete in O(log n) time each. Section 2.1 introduces the terminology we use. In this section, we also develop the mathematical foundation that forms the basis of our data structures. Experimental results are reported in Section 2.5. 2.1 Preliminaries 2.1.1 Prefixes and LongestPrefix Matching The prefix 1101* matches all destination addresses that begin with 1101 and 10010* matches all destination addresses that begin with 10010. For example, when W = 5, 1101* matches the addresses {11010, 11011} {26, 27}, and when W = 6, 1101* matches {110100,110101,110110,110111} = {52,53,54,55}. Suppose that a router table includes the prefixes P1 = 101*, P2 = 10010*, P3 = 01*, P4 = 1*, and S y I I 1 II I U V u V x y u v x y x y u v x I u v I i I i (A) (B) (C) Figure 21: Relationships between pairs of ranges. A)Two ranges are di. iiil B)Two ranges are nested. C)Two ranges intersect. P5 = 1010*. The destination address d = 1010100 is matched by the prefixes P1, P4, and P5. Since P1 = 3 (the length of a prefix is number of bits in the prefix), P4 = 1, and P5 = 4, P5 is the longest prefix that matches d. In longestprefix routing, the next hop for a packet destined for d is given by the longest prefix that matches d. 2.1.2 Ranges and Projections Definition 1 A range r = [u, v] is a pair of addresses u and v, u < v. The r r,'., r represents the addresses {u, u + 1,..., v}. start(r) = u is the start point of the r"i,,. and finish(r) = v is the finish point of the r,,.g The rr,i..' r covers or matches all addresses d such that u < d < v. range(q) is a predicate that is true iff q is a r it,'ilI The start point of the range r = [3, 9] is 3 and its finish point is 9. This range covers or matches the addresses {3, 4, 5, 6, 7, 8, 9}. In IPv4, s and f are up to 32 bits long, and in IPv6, s and f may be up to 128 bits long. The IPv4 prefix P = 0* corresponds to the range [0, 231 1]. The range [3,9] does not correspond to any single IPv4 prefix. We may draw the range r = [u, v] = {u, u + 1,..., v} as a horizontal line that begins at u and ends at v. Figure 21 shows ranges drawn in this fashion. Notice that every prefix of a prefix routertable may be represented as a range. For example, when W = 6, the prefix P = 1101* matches addresses in the range [52,55]. So, we  P = 1101* = [52,55], start(P) = 52, and finish(P) = 55. Since a range represents a set of (contiguous) points, we may use standard set operations and relations such as n and c when dealing with ranges. So, for example, [2, 6] n [4, 8] = [4, 6]. Note that some operations between ranges my not yield a range. For example, [2, 6] U [8, 10] {2, 3, 4, 5, 6, 8, 9, 10} is not a range. Definition 2 Let r = [u, v] and s = [x, y] be two ri .g. Let overlap(r, s) = r n s. (a) The predicate disjoint(r, s) is true iff r and s are disjoint. disjoint(r, s) < overlap(r, s)= 0 v < x V y < u Figure 21(A) shows the two cases for disjoint sets. (b) The predicate nested(r, s) is true iff one of the r,,.", is contained within the other. nested(r, s) o overlap(r, s) r V overlap(r, s)= s == rCsVsCr <= x Figure 21(B) shows the two cases for nested sets. (c) The predicate intersect(r, s) is true iff r and s have a no,. mn1,1i intersection that is different from both r and s. intersect(r,s) rns /OArns /rArs / s = ~disjoint(r, s) A nested(r, s) u Figure 21(C) shows the two cases for r,,i.. that intersect. Notice that overlap(r, s) = [x, v] when u < x < v < y and overlap(r, s) = [u, y] when x < u < y < v. [2, 4] and [6, 9] are disjoint; [2,4] and [3,4] are nested; [2,4] and [2,2] are nested; [2,8] and [4,6] are nested; [2,4] and [4,6] intersect; and [3,8] and [2,4] intersect. [4, 4] is the overlap of [2, 4] and [4, 6]; and overlap([3, 8], [2, 4]) = [3, 4]. Lemma 1 Let r and s be two r,,. Ei. //;/ one of the following is true. 1. disjoint(r, s) 2. nested(r, s) 3. intersect(r, s) Proof Straightforward. U Definition 3 Let R = {ri,..., r} be a set of n r,'ig The projection, H(R), of R is H(R) = Ui That is, II(R) comprises all addresses that are covered by at least one rr,,. of R. For A {[2, 5], [3, 6], [8, 9]}, H(A) = {2, 3, 4, 5, 6, 8, 9}, and for B = {[4, 8], [7,9]}, (B) = {4, 5,6, 7,8, 9}. II(A) is not a range. However, 1(B) is the range [4,9]. Note that HI(R) is a range iff d CE (R) for every d, u < d < v, where u = mind d E H(R)} and v = max{dd E I(R)}. Lemma 2 Let R = {ri,r2,..., rn} be a set of n ri.,g such that 1(R) = [u,v]. (a) u = minStart(R) min{start(ri)} and v maxFinish(R) = max{finish(ri)}. (b) Let s be a r,.,g II(RU{s}) is a ri,., if starts) < v+1 and finishes) > u1. (c) When H(R U {s}) [x, y], x = min{u, startss} and y = max{v, finish(s)}. Proof (a) is straightforward. Figure 22 shows all possible cases for which II(RU{s}) is a range, s is shown as a solid line. (b) and (c) are readily verified for each case of Figure 22. m 2.1.3 MostSpecificRange Routing and ConflictFree Ranges Definition 4 The ri 'i r is more specific than the r .'i,. s iff r C s. I II I I u v I I . . . . I I u1 v+1 Figure 22: Cases for Lemma 2 [2, 4] is more specific than [1,6], and [5, 9] is more specific than [5, 12]. Since [2, 4] and [8, 14] are dii ..iii neither is more specific than the other. Also, since [4, 14] and [6, 20] intersect, neither is more specific than the other. Definition 5 Let R be a riu,.j. set. ranges(d, R) (or .i.:,,l;, ranges(d) when R is implicit) is the subset of r,, of R that match/cover the destination address d. msr(d,R) (or msr(d)) is the most / .... .:;' riI.j, of R that matches d. That is, msr(d) is the most "/.. ..:'. rr,,i.j, in ranges(d). msr([u,v], R) = msr(u, v, R) = r iff msr(d, R) = r, u < d < v. When R is implicit, we write msr(u, v) and msr([u,v]) in place of msr(u, v,R) and msr([u, v],R). In mostspecificrange routing, the next hop for packets destined for d is given by the nexthop information associated with msr(d). When R = {[2,4], [1, 6]}, ranges(3) = [2,4], [1, 6]}, msr(3) = [2,4], msr(1) = [1, 6], msr(7) = 0, and msr(5,6) = [1,6]. When R = {[4,14], [6, 20], [6,14], [8,12]}, msr(4, 5) [4,14], msr(6, 7) [6,14], msr(8, 12) [8,12], msr(13, 14) [6,14], and msr(15, 20) [6, 20]. Definition 6 The ri,,, set R has a conflict iff there exists a destination address d for which ranges(d) / 0 A msr(d) = R is conflict free iff it has no conflict. The predicate conflictFree(R) is true iff R is a conflictfree ru,., set. con flictFRee({[2, 8], [4, 12], [4, 8])} is true while conflictFree( {[2, 8], [4, 12])} is false. We note that our definition of conflict free is a natural extension to ranges of the definition of conflict free given by Hari et al. [36] for the case of twodimensional prefix rules. Definition 7 Let r and s be two intersecting r,".,. of the r,,g, set R. The subset Q c R is a resolving subset for these two ri., if Q is conflict free and II(Q) = overlap(r, s). Two r,,i., of a riu.,' set are in conflict iff they intersect and have no resolving subset. Two r, i ,, are conflict free iff they are not in conflict. Lemma 3 A rir,,. set is conflict free iff it has no pair of ri,, ig that are in conflict. Proof Follows from the definition of a conflictfree range set. 0 Lemma 4 Let R be a conflictfree r,,u., set. Let r be an arbitrary r,i ., Let A be the subset of R that comprises all r,,.g of R that are contained in r. A is conflict free. Proof Since R is conflict free, every pair (s,t) of intersecting ranges in A has a resolving subset B in R. From Definition 7, it follows that every range in B is contained in overlap(s,t). Hence, B C A. Therefore, every pair of intersecting ranges of A has a resolving subset in A. So, A is conflict free. 0 Lemma 5 Let R be a conflictfree rr,,.' set. Let A, A C R be such that II(A) = r [u,v]. 1. 3B C R[conflictFree(A U B) A I(A) = H(A U B)] 2. Let s E R be such that intersect(r, s). 3B C R[I(B) = overlap(r, s)] 3. R U {r} is conflict free. Proof 1. Follows from Lemma 4. 2. When r E R, (2) follows from the definition of a conflictfree range set. So, assume r R. Let C comprise all ranges of A contained in s. If s intersects no range of A, II(C) = overlap(r, s). If s intersects at least one range of A, then let t E A be an intersecting range with maximum overlap. Since R is conflict free, 3D C R[H(D) = overlap(t, s)]. We see that H(C U D) = overlap(r, s). 3. From parts (1) and (2) of this lemma, it follows that there is a resolving subset in RU {r} for every s E R that intersects with r. Hence, RU {r} is conflict free. Definition 8 maxP(u, v, R) = max{finish(H(A)) A C R A range(H(A)) A start(H(A)) u A finish(H(A)) < v} is the maximum possible projection that is a ri,..' that starts at u and finishes by v. minP(u,v, R) = min{start(H(A)) A C R A range(H(A)) A finish(H(A)) v A start(H(A)) > u} is the minimum possible projection that is a rin,. that finishes at v and starts by u. When /QA C R[range(H(A)) A start(H(A)) u A finish(H(A)) < v], we ..r; that maxP(u, v, R) does not exist. Similarly, minP(u, v, R) I,,r.; not exist. At times, we use maxP and minP as abbreviations for maxP(u, v, R) and minP(u, v, R), re IN. A 1 1;/ maxY(u,v, R) = max{y [x,y] E R A x < u < y < v} and minX(u,v, R) min{x [x, y] E R A u < x < v < y}. Note that maxY and minX i,,.. r not exist. Lemma 6 Let R be a conflictfree r,,..' set. Let A RU {r}, where r = [u, v] R. conflictFree(A) maxY(u, v, R) < maxP(u, v, R) A minX(u, v, R) > minP(u, v, R) where maxY < maxP (minX > minP) is true whenever maxY (minX) does not exist and is false when maxY (minX) exists but maxP (minP) does not. Proof (=) Assume that A is conflict free. When neither maxY nor minX exist (this happens iff no range of R intersects r =[u, v]), maxY < maxP A minX > minP. When, maxY exists, s = [x, maxY] E AAx < u < maxY < v. (Note that intersect(r, s).) Since A is conflict free, A has a (resolving) subset B for which H(B) = overlap(r, s)= [u, maxY]. Therefore, maxY < maxP. Similarly, when minX exists, minX > minP. ( ) Assume maxY(u, v, R) < maxP(u, v, R) A minX(u, v, R) > minP(u, v, R). When neither maxY nor minX exist, no range of R intersects r. When, maxY exists, 3s = [x, y] E A[x < u < y < v]. Consider any such s = [x, y]. Since maxY < maxP and maxY exists, maxP exists. Hence, 3B C R[conflictFree(B) A H(B) [u, maxP]]. When y = maxP, B is a resolving subset for s and r in A. When y < maxP, intersect(s, [u,maxP]). Since R U {[u,maxP]} is conflict free lemmaa 5(3)), R U {[u,maxP]} (and so also R and A) has a resolving subset for s and [u, maxP]. This resolving subset is also a resolving subset for s and r. When minX exists, 3s [x, y] E A[u < x < v < y]. In a manner analogous to the proof for the case maxY exists, we may show that A has a resolving subset for r and each such s. Hence, in all cases, intersecting ranges of A have a resolving subset. So, A is conflict free. U Lemma 7 Let R be a conflictfree rr,,j set. Let A = R {r} for some r c R. AB c A[H(B) r]A As e Air c s] ,AC c Air c n(C)] Proof Assume AB c A[H(B) = r] (2.1) and As e A[r C s] (2.2) We need to show that AC C A[r C H(C)]. Suppose that there is a C such that C C A Ar C H(C). From C C A and Equation 2.2, it follows that Vt c C[disjoint(r,t) V intersect(r, t) V t C r] (2.3) If At c C[intersect(r, t), then from Equation 2.3, we get Vt c C[disjoint(r, t) V t c r]. From this and r C H(C), it follows that all destination addresses d, d E r, are covered by ranges of C that are contained in r. Therefore, 3B C C C A(H(B) r). This contradicts Equation 2.1. Next, suppose 3t E C[intersect(r, t)]. Let D be the union of the resolving subsets for all of these t and r in R. Clearly, all ranges in D are contained in r. Further, let E be the subset of all ranges in C that are contained in r. It is easy to see that D U E C A A H(D U E) = r. This contradicts Equation 2.1. t Lemma 8 Let R be a conflictfree rr,'j set. Let A = R {r}, for some r e R. 1. 3B C A[H(B) = r] = conflictFree(A). 2. 14B C A[H(B) r] s= [conflictFree(A) c=/Es e A[r c s] V [m,n] e A], where max{start(s)ls c AAr C s}, and n min{finish(s)ls c AAr C s}. Proof For (1), we note that by replacing r by B in every resolving subset for intersecting ranges in R, we get resolving subsets that do not include r. Hence all of these resolving subsets are present in A. So, A is conflict free. For (2), assume that AB C A[I(B) = r]. ( ) Assume that A is conflict free. We need to prove s e A[r C s] V [m,n] e A (2.4) We do this by contradiction. So, assume 3s E A[r C s] A [m, n] A (2.5) Since 3s E A[r c s], m and n are well defined. Equation 2.5 implies that A has a range [m, y], y > n as well as a range [x, n], x < m. Further, intersect([m, y], [x, n]) and r C overlap([m, y], [x, n]) = [m, n]. Let B be the subset of R comprised of all ranges contained in [m, n]. From Lemma 4, it follows that B is conflict free. However, r is the projection of no subset of C = B {r}. Further, no range of C contains r. From Lemma 7, it follows that no subset of C has a projection that contains r. In particular, C has no subset whose projection is [m, n]. Therefore, A, has no subset whose projection is [m, n]. So, A has no resolving subset for [m, y] and [x, n]. Therefore, A is not conflict free, a contradiction. () If no range of A contains r, then r is not part of the resolving subset for any pair of intersecting ranges of R. This, together with the fact that R is conflict free, implies that A is conflict free. If [m, n] e A, we can use [m, n] in place of r in any resolving subset for intersecting ranges of R. Therefore, A has a resolving subset for every pair of intersecting ranges. So, A is conflict free. 0 Lemma 9 Let R be a conflictfree rr,,,.' set and let d be a destination address. If ranges(d) / 0, then start(msr(d)) = a = maxStart(ranges(d)) = max{start(r) r E ranges(d)} and finish(msr(d)) = b = minFinish(ranges(d)) min{finish(r)lr E ranges(d)}. Proof Since R is conflict free and ranges(d) / 0, msr(d) / 0. Assume that msr(d) = s. If s / [a, b], then starts) < a or finishes) > b. Assume that starts) < a (the case finishes) > b is similar). Let t E ranges(d) be such that start(t) = a. Now, intersect(s,t) Vt C s. Hence, s / msr(d). U 2.1.4 Normalized Ranges Definition 9 [Normalized Ranges] The r i,,. set R is normalized iff one of the following is true. 1. RI <1. 2. IRI > 1 and for every r E R and every s E R, r / s, one of the following is true. (a) disjoint(r, s). (b) nested(r,s) A start(r) / starts) A finish(r) / finishes). That is, r and s are nested and do not have a common endpoint. H (A) (B) Figure 23: Unnormalized and normalized range sets Figure 23(A) shows a range set that is not normalized (it contains ranges that intersect as well as nested ranges that have common endpoints). Figure 23(B) shows a normalized range set. Regardless of which of these two range sets is used, every destination d has the same mostspecific range. Definition 10 An ordered sequence of ri.l (ri,..., r) is a chain iff Vi < n [start(ri+l) = finish(ri)]. A ri.j, set R is a chain iff its ri.j, can be ordered so as to form a chain. chain(R) is a predicate that is true iff R is a chain. The range sequence ([2, 4], [5, 7], [8, 12]) is a chain while ([5, 8], [12, 14]) and ([5, 8], [2, 4]) are not. The range sets {[5,8], [2, 4]} and {[2, 4], [8, 12], [5, 7]} are chains while {[2, 4], [8, 12]} and {[2, 4], [5, 7], [8, 12], [9, 10]} are not. Note that when R is a chain, H(R) = [minStart(R), maxFinish(R)]. Lemma 10 Let N be a normalized ri.lj. set. A c N A n(A) = [u, v] = 3B c N[chain(B) A n(B) = [u, v]] Proof Let B be the subset of A obtained by removing from A all ranges that are nested within at least one other range of A. Clearly, I(B) = H(A) = [u, v]. Since N is normalized and B C N, B is also normalized. From Definition 9 and the fact that B has no pair of nested ranges, it follows that all ranges of B are dlii ,iiil For dli I iil ranges to have a projection that is a range, the dli iiil ranges must form a chain. U Lemma 11 Let N be a normalized riu.j. set. 1. N 1In,,; be ,n.u':,1, l/ partitioned into a set of longest chains CP(N) {C1,..., Ck}, N = Ul ... +1 + ] .... .' +1 +1 +1 Figure 24: Partitioning a normalized range set into chains of CP i,,i' be combined into a single chain. CP(N) is called a canonical partitioning. 2. For all i and j, 1 < i < j < k, i and Cj are either disjoint,or Ci is ""'/' contained within a ri,:, of Cy or Cj is 1" I'/' i'/ contained within a ri ,:' of CQ. A chain Ci is i," '/'. JI contained within the ru,:, r iffII(Ci) C r and Ci and r share no end point. Proof Direct consequence of the definition of a normalized set and that of a chain. Figure 24 shows a normalized range set and its canonical partitioning into three chains. Next we state a chopping rule that we use to transform every conflictfree range set R into an equivalent normalized range set norm(R). By equivalent, we mean that for every destination d, the mostspecific matchingrange is the same in R as it is in norm(R). Definition 11 [Chopping Rule] Let r = [u,v] E R, where R is a ri,.' set. chop(r, R) (or more .':,,/1/; chop(r) when R is implicit), is as /, r1 ,.1 below. 1. If neither maxP(u, v 1, R) nor minP(u + 1, v, R) exists, chop(r) = r. 2. If only maxP(u, v 1, R) exists, chop(r) = [maxP(u, v 1, R) + 1, finisher)]. 3. If only minP(u + 1, v, R) exists, chop(r) = [start(r), minP(u + 1, v, R) 1]. 4. If both maxP(u, v, R) and minP(u+l,v, R) exist and maxP(u, v, R)+1 < minP(u+ 1, v, R) 1, chop(r) = [maxP(u,v 1, R) + 1, minP(u+ 1, v, R) 1]. 5. IfbothmaxP(u,vl,R) and minP(u+,v,R) exist and maxP(u,vl,R)+1 > minP(u + 1, v, R) 1, chop(r) = 0, where 0 denotes the null r,,.. The null r,,.'i, neither intersects nor is contained in rn, other r,,.'. D. fi,. norm(R) = {chop(r)lr E R A chop(r) / 0}. Lemma 12 Let R be a conflictfree rr,'. set. Vr E R Vs E R[s C r = [s C chop(r) A starts) / start(chop(r)) A finishes) / finish(chop(r))] Vdisjoint(s, chop(r))] Proof The lemma is trivially true when chop(r) = 0 (disjoint(s, 0) is true). So, assume that chop(r) = r'. For the lemma to be false, either intersects, r') or (r' C s or s and r' have a common end point). If intersect(s,r'), then either starter') < starts) < finisher') < finishes) or starts) < starter') < finishes) < finisher'). Assume the former (the latter case is similar). From the chopping rule, it follows that 3A C R[II(A) = [finish(r') + 1, finisher)]. Therefore, A U {s} C R A II(A U {s}) = starts(s, finisher)]. From this, start(r) < start(r') < startss, and the chopping rule, we get finish(chop(r)) < startss. But, starts) < finisher'), a contradiction. So, r' C s or s and r' have a common end point. First consider the case r' C s C r. Suppose that starts) / start(r) (the case finishes) / finish(r) is similar). Since r' = chop(r), 3A C R[I(A) = finisherr') + 1, finisher)]. Therefore, II(A U {s}) starts(s, finisher)] and start(r) < starts) < start(r'). From the chopping rule, it follows that finish(chop(r)) < starts) < starter') < finisher'), a contradiction. Therefore, s C r'. If starts) = starter'), maxP(start(r), finish(r) 1) > finishes). So, starter') > finishes), which contradicts s C r'. The case finishes)= finisher') is similar. 0 Lemma 13 Let r and s be two intersecting rii'. of a conflictfree r,,', set R. disjoint(chop(r), overlap(r, s)) A disjoint(chop(s), overlap(r, s)) A disjoint(chop(r), chop(s)) Proof Without loss of generality, we may assume that start(r) < starts) < finisher) < finishes). Since R is conflict free, 3A[A C R A n(A) = overlap(r, s)]. Therefore, finish(chop(r)) < starts) and start(chop(s)) > finisher). This proves the lemma. 0 Lemma 14 Let R be a conflictfree r,,'. set. For every r' e norm(R) there is a unique r E R such that chop(r) = r'. Proof Let r' be any range in norm(R). Clearly, for every r' E norm(R), there is at least one r E R such that chop(r) = r'. Suppose two different ranges r and s of R have r' = chop(r) = chop(s). If intersect(r, s), then from Lemma 13 we get disjoint(chop(r), chop(s)). So, chop(r) / chop(s). If nested(r, s), then from Lemma 12 it follows that s C chop(r) V disjoint(s, chop(r)) when s C r and r C chop(s) V disjoint(r, chop(s)) when r C s. Consider the former case (the latter case is similar). s C chop(r) implies chop(s) / chop(r). disjoint(s, chop(r)) also implies chop(s) / chop(r). The final case is disjoint(r, s). In this case, clearly, chop(s) / chop(r). m For r' E norm(R), define full(r') chopl(r') = r, where r is the unique range in R for which chop(r) = r'. Notice that full(chop(r)) = r except when chop(r) = 0. Lemma 15 For every conflictfree r,,'.., set R, norm(R) is a normalized conflictfree ri./,'' set. Proof We shall show that norm(R) is normalized. Since a normalized range set has no intersecting ranges, every normalized range set is conflict free. If Inorm(R)l < 1, norm(R) is normalized. So, assume that Inorm(R)l > 1. Let r' and s' be two different ranges in norm(R). We need to show that r' and s' satsify property 2(a) or 2(b) of Definition 9. Let r = [u,v] = full(r') and s = full(s'). There are three possible cases for r and s, they either intersect, are nested, or are dii,,iil (Lemma 1). Case 1: intersect(r, s). From Lemma 13, it follows that r' and s' are disjoint. Case 2: nested(r,s). Either s C r or r C s. Assume the former (the latter case is similar). From Lemma 12, we get [s C chop(r) A starts) / start(chop(r)) A finishes) / finish(chop(r))] V disjoint(s, chop(r)) s C chop(r) A starts) / start(chop(r)) A finishes) / finish(chop(r)) implies that s' and r' are nested and do not have a common endpoint. disjoint(s, chop(r)) implies that s' and r' are disjoint. Case 3. disjoint(r, s). Clearly, disjoint(r', s'). Lemma 16 Let r' E norm(R), where R is a conflictfree ri,,.'. set. As' e norm(R)[s' C r'] = r = full(r') = msr(r', R) Proof Assume that /fs' e norm(R)[s' C r']. If 3d E r'[r / msr(d,R)], then 3s C r[d E s]. From Lemma 12, it follows that s C r'Vsn r' 0. Since d E sAd E r', s n r' / 0. Hence, s C r'. From Lemma 4, it follows that A {= t t RAt C s} 0 is conflict free. From the chopping rule it follows that norm(A) / 0. So, 3t' e norm(A) c norm(R)[t' C t = full(t') C r']. This violates the assumption of this lemma. Therefore, Ad E r'[r / msr(d, R)]. So, r = msr(r', R). 0 Lemma 17 Let R be a conflictfree rr,.ui, set, let x be the start point of some rri..j, in R, and let y be the finish point of some rru.j, in R. 1. Let s E R be such that starts) = x and finishes) = min{finish(t)t E R A start(t) = x} (a) chop(s) / 0. (b) start(chop(s)) = x. (c) chop(s) is the only r"'i.j in norm(R) that starts at x. 2. Let s C R be such that finishes) = y and starts) = max{start(t) t E R A finish(t) = y} (a) chop(s) / 0. (b) finish(chop(s)) = y. (c) chop(s) is the only rr,..,. in norm(R) that finishes at y. Proof We prove l(a) (c). 2(a) (c) are similar. Since maxP(start(s), finish(s) 1, R) does not exist, case 5 of the chopping rule does not apply and chop(s) / 0. One of the cases 1 and 3 applies. In both of these cases, start(chop(s)) = x. For l(c), we note that the definition of a normalized set (Definition 9) implies that no two ranges of norm(R) share an end point. In particular, norm(R) can have only one range that has x as an end point. U Lemma 18 Let r' E norm(R), where R is a conflictfree riw.. set. start(full(r')) / starter') =/As e R[start(s) = starter')] Proof Suppose that start(full(r')) / starter') and 3s E R[start(s) = start(r')]. From Lemma 17(la and Ib), it follows that 3t c R[start(t) = starter') A chop(t) / 0 A start(chop(t)) = start(r')]. Therefore, norm(R) has at least two ranges (r' and chop(t)) that start at starter'). This contradicts Lemma 17(lc). m Lemma 19 Let R be a conflictfree r,'.. set. Let r E R be such that r msr(u, v, R) for some rr wu. [u, v]. r' = chop(r) = msr(u, v, norm(R)). Proof From the definition of msr, it follows that there is no s E R such that s C r A s n [u, v] / 0. Therefore, [u, v] C chop(r). Further, from Lemmas 12 and 13, it follows that norm(R) contains no s' Cc hop(r). So, r' = msr(u, v, norm(R)). m Lemma 20 Let R be a conflictfree riu.,. set that has a subset whose projection equals [x, y]. Let A C R comprise all r c R such that r C [x, y]. 1. 3B C norm(R)[II(B) = [x,y]] 2. C = {full(r') r' e norm(R) A r' C [x, y]} C A Proof 1. From Lemma 4, it follows that A is conflict free. Further, since R has a sub set whose projection equals [x, y], n(A) = [x, y]. From Lemma 19, it fol lows that every d E [x,y] has a mostspecific range in norm(A). Therefore, n(norm(A)) = [x, y]. From the definition of the chopping rule and that of A, we see that Vr E A[chop(r, A) = chop(r,R)]. So, norm(A) c norm(R). 2. First, assume that [x, y] E R. Suppose there is a range r' E norm(R) such that r' C [x, y] and r = full(r') A. There are three cases for r. Case 1: disjoint(r, [x,y]). In this case, disjoint(r', [x,y]) and so r' Z [x, y]. Case 2: intersect(r, [x, y]). From Lemma 13, we get disjoint(chop(r), [x, y]). So r' 0 [x,y]. Case 3: [x,y] C r. From Lemma 12 and r' c [x,y], we get disjoint([x,y], chop(r)). So r' 7 [x, y]. When [x, y] R, let R' RU{[x, y]}, C' = {full(r') r' e norm(R')Ar' C [x, y]} and A' = A U { [x, y]}. Using the lemma case we have already proved, we get C' C A'. Since chop([x,y],R') = 0 and chop(s,R) = chop(s,R') for every s E R, norm(R') = norm(R). Therefore, C C'. So, C C A'. Finally, since [x, y] C, CC A. Lemma 21 Let R be a conflictfree r,,..', set. Let r R be such that R U {r} is conflict free. 1. chop(r, RU {r}) 0 = Vt R[chop(t, R)= chop(t, RU {r})]. 2. Let s be the smallest ri,.j,' of R that contains r. Assume that s exists and that chop(r, RU {r}) / 0. (a) Vt R {s}J[chop(t, R) chop(t, RU {r})]. (b) chop(s, R) / chop(s, RU {r}) = (x' =u' A y' = v') V (x' u' A y' > v) V (x' < uAy' = v'), where r [u,v], chop(r,RU{r}) = chop(r,R) = [u', v'], and chop(s, R) = [x', y']. Proof For (1), note that chop(r, RU {r}) 0 z= 3A C R[H(A) = r]. Therefore, the addition of r to R does not affect any of the maxP and minP values. For (2a), suppose there are two different ranges g and h in R such that chop(g, R) / chop(g, R U {r}) and chop(h, R) / chop(h, R U {r}) From the chopping rule, it follows that rCgAr C h (2.6) Therefore, disjoint(g, h). From this and Lemma 1, we get intersect(g, h)V nested(g, h). Equation 2.6 and intersect(g, h) imply r C overlap(g, h). From this and Lemma 13, we get disjoint(r, chop(g, R)) A disjoint(r, chop(h, R)). Therefore, chop(g,R) = chop(g, R U {r}) and chop(h, R) = chop(h, R U {r}), a contradiction. So, intersect(g, h). If nested(g, h), we may assume, without loss of generality, that g C h. This and Equation 2.6 yield r C g C h. Therefore, maxP(x,yl, R) = maxP(x,yl, RU{r}) and minP(x + y, R) = minP(x + 1, y, RU {r}), where h = [x,y]. So, chop(h, R) chop(h, R U {r}), a contradiction. Hence, there can be at most one range of R whose chop() value changes as a result of the addition of r. The preceding proof for the case nested(g, h) also establishes that the chop() value may change only for the range s, that is for the smallest enclosing range of r (i.e., smallest s E R[r C s]). For (2b), assume that chop(s, R) / chop(s, RU{r}). This implies that chop(s, R) / 0 and so x' and y' are well defined. (Note that from part (1), we get chop(r, R) / 0.) We consider each of the three cases for the relationship between r and chop(s, R) (Lemma 1). Case 1: disjoint(r, chop(s, R)). This case cannot arise, because then chop(s, R) = chop(s, R U {r}). Case 2: intersect(r, chop(s, R)). Now, either x' < u < y' < v or u < x' < v < y'. Consider the former case. Since r C s, v < y. When v = y, minP(u + 1, v, R U {r}) = minP(x + 1, y, R) = y' + 1. So, v' = y'. Therefore, x' < u A y' = v'. Consider the case v < y. From the chopping rule, it follows that 3A C R c R U {r}[H(A) [y' + l,y]]. From this, Lemma 5(2), and the fact that RU {r} is conflict free, we conclude 3B E R U {r}[I(B) = overlap(r, [y' + 1, y]) = [y' + 1, v]]. From this and minP(x + 1, y, R) = y' + 1, we get minP(u + 1, v, RU {r}) y' + 1. So, v' = y'. Once again, x' < u A y' = v'. Using a similar argument, we may show that when u < x' < v < y', x' u' A y' > v. Case 3: nested(r, chop(s, R)). So, either r C chop(s,R) or chop(s,R) C r. First, consider all possibilities for r C chop(s, R). The case x' < u < v < y' cannot arise, because this implies chop(s, R) = chop(s,R U {r}). When x' = u < v < y', u' = x'. So, x' = u' A y' > v. When x' < u < v = y', v' = y'. So, x' < uA y' = v'. The final case is when x' = u < v = y'. Now, u' = x' A y' v'. Using an argument similar to that used in part (2a), we may show that when chop(s,R) C r, x' = u' A' y' v'. Lemma 22 Let R, r = [u, v], s = [x, y], x', y', u' and v' be as in Lemma 21. Assume that s exists and chop(s) / 0. 1. disjoint(r,chop(s,R)) V x' < u < v < y' = chop(s,R U {r}) = chop(s,R). 2. x' = u' A y' = v' = chop(s, R U {r}) 0. 3. Suppose x' = u' A y' > v. If maxP(v' + 1, y', R) doesn't exist, then chop(s, R U {r}) = [v + 1, y']. If it exists, chop(s, RU {r}) = [maxP(v' + 1, y', R) + 1, y']. 4. Suppose x' < u' A y' = v'. If minP(x', u' 1, R) doesn't exist, then chop(s, R U {r}) = [x', u ]. If it exists, chop(s,R U {r}) = [x', minP(x', u' 1,R) i]. Proof (1) follows from the proof of Lemma 21(2b). For (2), from the proof cases of Lemma 21(2b) that have x' = u' A y' = v', it follows that case 5 of the chopping rule applies for s in R U {r}. So, chop(s, R U {r}) = 0. For (3), finish(chop(s, R U {r})) = y' follows from the proof of Lemma 21(2b). Also, we observe that maxP(x,y 1,R U {r}) > v. So, (3b) can be false only when maxP(x, y 1, RU {r}) > v and either (a) maxP(v' + 1, y', R) doesn't exist or (b) maxP(v' + 1,y', R) < maxP(x,y 1,RU {r}). For (a), 3[c,d] e R[x < c < v' Av < d < y']. For (b), 3[c,d] e R[x < c < v' Av < maxP(v' + ,y',R) < d < y']. In both cases, c < u implies that r = [u, v] C [c, d] C s. This contradicts the assumption that s is the smallest enclosing range of r. Also, in both cases, c > u implies intersect(r, [c, d]). So, R U {r} has a subset whose projection is [c, v]. Therefore, finish(chop(u, v, RU {r})) < c < v', a contradiction. The proof for (4) is similar to that for (3). U Lemma 23 Let R be a conflictfree ri,.,', set. Let r = [u, v] E R be such that R{r} is conflict free. 1. chop(r, R) = 0 = Vt E R {r}[chop(t, R) = chop(t, R {r})]. 2. Let s = [x, y] be the smallest r,.".' of R {r} that contains r. Assume that s exists and that chop(r, R) = [u', v']. (a) Vt E R {r, s}[chop(t, R) = chop(t, R {r})]. (b) chop(s, R) = 0 chop(s, R {r}) [u',v']. (c) chop(s, R)= [x', y'] = chop(s, R {r}) [min{x', u'}, max{y', v'}]. Proof For (1), note that chop(r, R) =0 = 3A C R {r}[II(A) = r]. Therefore, the removal of r from R does not affect any of the maxP and minP values. For (2a) note that by substituting R {r} for R in Lemma 21(2a), we get Vt E R {r, s}[chop(t, R {r}) = chop(t, R)]. (2b) and (2c) follow from Lemma 22. * 1 2 .. ..... .... .. ." ..... ..... .... 4 7 8 14 8 ..... ... 11 14 4 11 17 0 22 0 I 0 4 8 12 16 20 24 (A) (B) Figure 25: An example range set R and its mapping mapl(R) into points in 2D 2.1.5 Priority Search Trees And Ranges A prioritysearch tree (PST) [37] is a data structure that is used to represent a set of tuples of the form (keyl, key2, data), where keyl > 0, key2 > 0, and no two tuples have the same keyl value. The data structure is simultaneously a mintree on key2 (i.e., the key2 value in each node of the tree is < the key2 value in each descendent node) and a search tree on keyl. There are two common PST representations [37]: 1. In a radix prioritysearch tree (RPST), the underlying tree is a binary radix tree on keyl. 2. In a redblack prioritysearch tree (RBPST), the underlying tree is a red black tree. McCreight [37] has si , I ,l a PST representation of a collection of ranges with distinct finish points. This representation uses the following mapping of a range r into a PST tuple: (keyl, key2, data) (finish(r), starter), data) (2.7) where data is any information (e.g., next hop) associated with the range. Each range r is, therefore mapped to a point mapl(r) = (x,y) = (keyl, key2) = (finish(r), starter)) in 2dimensional space. Figure 25 shows a set of ranges and the equivalent set of 2dimensional points (x, y). McCreight [37] has observed the when the mapping of Equation 2.7 is used to obtain a point set P = mapl(R) from a range set R, then ranges(d) is given by the points that lie in the rectangle (including points on the boundary) defined by Xleft = Xright = 00, Yto = d, and bottom = 0. These points are obtained using the method enumerateRectangle(xft, Xright, Ytop) = enumerateRectangle(d, oo, d) of a PST bottomm is implicit and is alvv 0). When an RPST is used to represent the point set P, the complexity of enumerateR'. (,,/ xl leftf, right, top) is O(logmaxX + s), where maxX is the largest x value in P and s is the number of points in the query rectangle. When the point set is represented as an RBPST, this complexity becomes O(log n + s), where n = IP. A point (x, y) (and hence a range [y, x]) may be inseted into or deleted from an RPST (RBPST) in O(log maxX) (O(logn)) time [37]. 2.2 Prefixes Let R be a set of ranges such that each range represents a prefix. It is well known (see Sahni et al. [21], for example) that no two ranges of R intersect. Therefore, R is conflict free. For simplicity, assume that R includes the range that corresponds to the prefix *. With this assumption, msr(d) is defined for every d. From Lemma 9, it follows that msr(d) is the range [maxStart(ranges(d)), minFinish(ranges(d))]. To find this range easily, we first transform P = mapl(R) into a point set transforml(P) so that no two points of transforml(P) have the same xvalue. Then, we represent transforml(P) as a PST. Definition 12 Let W be the (maximum) number of bits in a destination address (W = 32 in IPv). Let (x, y) E P. transforml(x, y) = (x',y') = (2wxy+2w1,y) and transforml(P) = {transforml(x, y) (x, y) C P}. We see that 0 < x' < 22W for every (x', y') E transforml(P) and that no two points in transforml(P) have the same x'value. Let PST1(P) be the PST for transforml(P). The operation enumerateR. i,,.il/, (2wd d + 2" 1, oo, d) performed on PST1 yields ranges(d). To find msr(d), we employ the minX inRectangle leftf, Xright, Ytop) operation, which determines the point in the defined rectangle that has the least xvalue. It is easy to see that minXinRectangle(2Wd d + 2w 1, oo, d) performed on PST1 yields msr(d). To insert the prefix whose range in [u, v], we insert transforml(mapl([u, v])) into PST1. In case this prefix is already in PST1, we simply update the next hop information for this prefix. To delete the prefix whose range is [u, v], we delete transforml(mapl([u,v])) from PST1. When deleting a prefix, we must take care not to delete the prefix *. Requests to delete this prefix should simply result in setting the nexthop associated with this prefix to 0. Since, minXinRectangle, insert, and delete each take O(W) (O(logn)) time when PST1 is an RPST (RBPST), PST1 provides a routertable representation in which longestprefix matching, prefix insertion, and prefix deletion can be done in O(W) time each when an RPST is used and in O(logn) time each when an RBPST is used. 2.3 Nonintersecting Ranges Let R be a set of nonintersecting ranges. Clearly, R is conflict free. For simplicity, assume that R includes the range z that matches all destination addresses (z [0, 232 1] in the case of IPv4). With this assumption, msr(d) is defined for every d. We may use PST1(transforml(mapl(R))) to find msr(d) as described in Section 2.2. Insertion of a range r is to be permitted only if r does not intersect any of the ranges of R. Once we have verified this, we can insert r into PST1 as described in Section 2.2. Range intersection may be verified by noting that there are two cases for range intersection (Definition 2(c)). When inserting r = [u, v], we need to determine if 3s = [x, y] E R[u < x < v < yVx < u < y < v]. We see that 3s E R[x < u < y < v] iff mapl(R) has at least one point in the rectangle defined by xleft = u, Xright = 1, and ytop = u 1 (recall that bottom = 0 by default). Hence, 3s E R[x < u < y < v] iff minXinRectangle(2u (u 1) + 2W 1,2w(v 1) + 2W u 1) exists in PST1. To verify 3s E R[u < x < v < y], map the ranges of R into 2dimensional points using the mapping, map2(r) = (start(r), 2 1 finish(r)). Call the resulting set of mapped points map2(R). We see that 3s E R[u < x < v < y] iff map2(R) has at least one point in the rectangle defined by xleft = + 1, Xright = v, and ytop (2w 1) v 1. To verify this, we maintain a second PST, PST2 of points in transform2(map2(R)), where transform2(x, y) = (2Wx + y, y) Hence, 3s E R[u < x < v < y] iff minXinR.. l ,il. (2(u + 1), 2v + (2 1) v 1, (2 1) v 1) exists. To delete a range r, we must delete r from both PST1 and PST2. Deletion of a range from a PST is similar to deletion of a prefix as discussed in Section 2.2. The complexity of the operations to find msr(d), insert a range, and delete a range are the same as that for these operations for the case when R is a set of ranges that correspond to prefixes. 38 Step 1: If r = [u, v] E R, update the nexthop information associated with r E R and terminate. Step 2: Compute maxP(u, v, R), minP(u, v, R), maxY(u, v, R) and minX(u, v, R). Step 3: If maxY(u,v, R) < maxP(u,v, R) A minX(u,v, R) > minP(u,v,R), R U {r} is conflict free; otherwise, it is not. In the former case, insert transforml(mapl(r)) into PST1 and transform2(map2(r)) into PST2. In the latter case, the insert operation fails. Figure 26: Insert r = [u, v] into the conflictfree range set R 2.4 ConflictFree Ranges In this section, we extend the twoPST data structure of Section 2.3 to the general case when R is an arbitrary conflictfree range set. Once again, we assume that R includes the range z that matches all destination addresses. PST1 and PST2 are defined for the range set R as in Sections 2.2 and 2.3. 2.4.1 Determine msr(d) Since R is conflict free, msr(d) is determined by Lemma 9. Hence, msrd(d) may be obtained by performing the operation minXinRectangle(2Wd d + 2w 1, oo, d) on PST1. 2.4.2 Insert A Range When inserting a range r = [u,v] i R, we must insert transform(mapl(r)) into PST1 and transform2(map2(r)) into PST2. Additionally, we must verify that R U {r} is conflict free. This verification is done using Lemma 6. Figure 26 gives a highlevel description of our algorithm to insert a range into R. Step 1 is done by searching for transforml(mapl(r)) in PST1. For Step 2, we note that maxY(u, v, R) = maxXinRectangle(2u (u 1)+2w 2(v 1)+2 1, u 1) minX(u, v, R) = minXinRectangle(2(u+1),2Wv+(2w1) v (2 1)v1) Step 1: If r = z, change the nexthop information for z to 0 and terminate. Step 2: Delete transforml(mapl(r)) from PST1 and transform2(map2(r)) from PST2 to get the PSTs for A R {r}. If PST1 did not have transforml(mapl(r)), r J R; terminate. Step 3: Determine whether or not A has a subset whose projection equals r = [u, v]. Step 4: If A has such a subset, conclude conflictFree(A) and terminate. Step 5: Determine whether A has a range that contains r [u, v]. If not, conclude conflictFree(A) and terminate. Step 6: Determine m and n as defined in Lemma 8 as follows. m start(maxXinRectangle(O, 2Wu + (2 1) v, (2w 1) v) (use PST2) n = finish(minXinRectangle(2v u + 2W 1, u) (use PST1) Step 7: Determine whether [m,n] e A. If so, conclude conflictFree(A). Otherwise, conclude conflictFree(A). In the latter case reinsert transforml(mapl(r)) into PST1 and transform2(map2(r)) into PST2 and disallow the deletion of r. Figure 27: Delete the range r = [u, v] from the conflictfree range set R where for maxY we use PST1 and for minX we use PST2. Section 2.4.4 describes the computation of maxP and minP. The point insertions of Step 3 are done using the standard insert algorithm for a PST [37]. 2.4.3 Delete A Range Suppose we are to delete the range r = [u, v]. This deletion is to be permitted iff r / z and A = R {r} is conflict free. Figure 27 gives a highlevel description of our algorithm to delete r. Its correctness follows from Lemma 8. Step 2 employs the standard PST algorithm to delete a point [37]. For Step 3, we note that A has a subset whose projection equals r = [u, v] iff maxP(u, v, A) = v. In Section 2.4.4, we show how maxP(u, v, A) may be computed efficiently. For Step 5, we note that r = [u, v] C s = [x, y] iff x < u A y > v. So, A has such a range iff minXinR.. I.il, (2wv u + 2 ,oo, u) exists in PST1. In Step 6, we assume that maxXinRectangle and minXinR. u.,.i,.!: return the range of R that corresponds to the desired point in the rectangle. To determine Step 1: Find r' E norm(R) such that starter') = u. If no such r' or start(full(r')) / uV finish(full(r')) > v, maxP does not exist; terminate. Step 2: maxP = finisher'); while (s' e norm(R) A startss) = maxP + 1 A (full(s') C [u, v]) maxP = finishes'); Figure 28: Simple algorithm to compute maxP(u, v, R), where [u, v] is a range and conflictFree(R) whether [m, n] c A (Step 7), we search for the point (2wn m + 2W 1, m) in PST1 using the standard PST search algorithm [37]. The reinsertion into PST1 and PST2, if necessary, is done using the standard PST insert algorithm [37]. 2.4.4 Computing maxP and minP Although maxP and minP are relatively difficult to compute using data struc tures such as PST1 and PST2 that directly represent R, they may be computed efficiently using data structures for norm(R). In this section, we show how to com pute maxP from norm(R). The computation of minP is similar. 2.4.5 A Simple Algorithm to Compute maxP Figure 28 is a highlevel description of a simple, though not efficient, algorithm to compute maxP(u, v, R). Theorem 1 Figure 28 corr. l//; computes maxP(u, v, R). Proof First consider Step 1. From Lemma 17(a), it follows that /r' e norm(R)[start(r') = u] z /3r c R[start(r) = u] Therefore, /3r' c norm(R)[start(r') = u] /B3maxP. From Lemma 18, it follows that start(full(r')) / starter') =u /3s e R[start(s) = starter') = u]. So, start(full(r')) / u =/3maxP. Finally, u start(r') start(full(r)) im plies finish(full(r')) = min{finish(t) t R A start(t) = u} (Lemma 17(1)). So, finish(full(r')) > v implies /3s E R[start(s) = u A finishes) < v]. Hence, starter') = u A finish(full(r')) > v 3=/3maxP. Further, when 3r' E norm(R)[start(r') = u A finish(full(r')) < v], maxP exists and maxP > finish(full(r')) > finish(r'). Therefore, Step 1 correctly identifies the case when maxP doesn't exist. We get to Step 2 only when maxP exists. From the definition of maxP, 3A C R[I(A) = [u, maxP]]. From this and Lemma 20(1), we get 3B C norm(R)[H(B) [u, maxP]]. Now, from Lemma 10, we get 3D C norm(R)[chain(D) A H(D) [u, maxP]]. From Lemma 11, it follows that D is a subchain of the unique chain Ci E CP(norm(R)) that includes r'. Let r', s', s', ..., s be the tail of C. It follows that maxP is either finisher') or finishes') for some j in the range [1,q]. Let j be the least integer such that full(s') % [u,v]. If such a j does not exist, then maxP finishes') as norm(R) has no subset whose projection equals [u,x] for any x > finishes'). So, assume that j exists. From Lemma 20(2), it follows that maxP < finishes'). Hence, Step 2 correctly determines maxP. U 2.4.6 An Efficient Algorithm to Compute maxP The algorithm of Figure 28 takes time O(length(Ci)), where length(Ci) is the number of ranges in the chain C, e CP(norm(R)) that contains r'. We can reduce this time to O(loglength(Ci)) by representing each chain of CP(norm(R)) as a red black tree (actually any balanced search tree structure that permits efficient join and split operations may be used). The number of redblack trees we use equals the number of chains in CP(norm(R)). Let D t,..., ') be a chain in CP(norm(R)). The redblack tree, RBT(D), for D has one node for each of the ranges t'. The key value for the node for t is start(t) (equivalently, finish(t) may be used as the search tree key). Each node of RBT(D) has the following four values (in addition to having a t' and other values necessary for efficient implementation): minStartLeft, minStartRight, maxFinishLeft, and maxFinishRight. For a node p that has an empty left subtree, minStartLeft = 2W 1 and maxFinishLeft = 0. Similarly, when p has an empty right subtree, minStartRight = 2W 1 and maxFinishRight = 0. Otherwise, minStartLeft = min{start(full(r')) r' E leftSubtree(p)} minStartRight = min{start(full(r'))r' E rightSubtree(p)} maxFinishLeft = max{finish(full(r')) r' E leftSubtree(p)} maxFinishRight = max{finish(full(r'))r' c rightSubtree(p)} The collection of redblack trees representing norm(R) is augmented by an ad ditional redblack tree endPointsTree(norm(R)) that represents the end points of the ranges in norm(R). With each point x in endPointsTree, we store a pointer to the node in RBT(D) that represents s'. Alternatively, we may use a PST, PST3, for the range set chains {[start(Ci), finish(Ci)]I Ci E CP(norm(R))}. The points in PST3 are mapl(chains); with each point in PST3, we keep a pointer to the root of the RBT for that chain. Note that since range endpoints are distinct in chains, we do not need to use transform as used in PST1. To find an end point d, we first find the smallest chain that covers d by performing the operation minXinRectangle(d, oo, d) in PST3. Next, we follow the pointer associated with this chain to get to the cor responding RBT. Finally, a search of this RBT gets us to the RBT node for the s' with the given end point. In the sequel, we assume that endPointsTree, rather than PST3, is used. A parallel discussion is possible for the case when PST3 is used. To implement Step 1 of Figure 28, we search endPointsTree for the point u. If u endPointsTree, then /Jr' c norm(R)[start(r') = u]. If u c endPointsTree, then we use the pointer in the node for u to get to the root of the RBT that has r'. A search in this RBT for u locates r'. We may now perform the remaining checks of Step 1 using the data associated with r'. Suppose that maxP exists. At the start of Step 2, we are positioned at the RBT node that represents r'. This is node 0 of Figure 29. We need to find s' e norm(R) Figure 29: An example RBT with least s' such that startss) > finisher') A full(s') % [u, v]. If there is no such s', then maxP = max{finish(root.range), root.maxFinishRight}. If such an s' exists, maxP = startss) 1. s' may be found in O(height(RBT)) time using a simple search process. We illustrate this process using the tree of Figure 29. We begin at node 0. If [minStartRight, maxFinishRight] C [u,v], then s' is not in the right subtree of node 0. Since node 0 is a right child, s' is not in its parent. So, we back up to node 1 (in general, we back up to the nearest ancestor whose left subtree contains the current node). Let t' be the range in node 1. s' = t' iff t' % [u,v]. If s' / t', we perform the test [minStartRight, maxFinishRight] C [u, v] at node 1 to determine whether or not s' is in the right subtree of node 1. If the test is true, we back up to node 2. Otherwise, s' is in the right subtree of node 1. When the right subtree (if any) that contains s' is identified, we make a downward pass in this subtree to locate s'. Figure 210 describes this downward pass. downwardPass(currentNode) // currentNode is the root of a subtree all of whose ranges start at the right of u // This subtree contains s'. Return maxP. while (true) { if ([currentNode. minStartLeft, currentNode.maxFinishRight] C [u,v]) // s' not in left subtree if (currentNode.range C [u, v]) // s' currentNode. s' must be in right subtree. currentNode = currentNode.rightChild; else return (start(currentNode.range) 1); else // s' is in left subtree currentNode = currentNode.le ftChild; } Figure 210: Find s' (and hence maxP) in a subtree known to contain s' 2.4.7 Wrapping Up Insertion of a Range Now that we have augmented PST1 and PST2 with a collection of RBTs and an endPointsTree, whenever we insert a range r = [u, v] into R, we mut update not only PST1 and PST2 as described in Section 2.4.2, but also the RBT collection and endPointsTree. To do this, we first compute chop(r, R U {r}) = chop(r, R) = [u', v'] by first computing minP(u + 1, v) and maxP(u, v 1) as described in Section 2.4.4. [u', v'] is now easily obtained from the chopping rule. Lemma 21 tells us that the only s E R whose chop() value may change as a result of the insertion of r is the smallest enclosing range of r. Since z E R and r / z, such an s must exist. Rather than search for this s explicitly, we use the cases (2)(4) conditions of Lemma 22 to find s' = chop(s, R) in endPointsTree. Note that if chop(s, R) = 0, the search in endPointsTree will not find s; but when chop(s, R) = 0, chop(s, RU {r}) = 0. So, no change in chop(s, R) is called for. Note that the insertion of r may combine two chains of CP(norm(R)). In this case, we use the join operation of redblack trees to combine the RBTs corresponding to these two chains. 2.4.8 Wrapping Up Deletion of a Range When chop(r, R) = 0, no changes are to be made to the RBTs and endPointsTree (Lemma 23(1)). So, assume that chop(r, R) / 0. We first find s, the smallest range that contains r (see Lemma 23(2)). Note that since z E R and r / z, s exists. One may verify that s is one of the ranges given by the following two operations. minXinRR. I.,,,I (2Wv u + 2 1,o u) maxXinRectangle(0, 2Wu + 2w 1 v, 2w 1 v) where the first operation is done in PST1 and the second in PST2 (both oper ations are done after transforml(mapl([u, v])) has been deleted from PST1 and transform2(map2([u, v)) has been deleted from PST2). The ranges returned by these two operations may be compared to determine which is s. Once we have identified s, Lemma 23(2) is used to determine chop(s, R{r}). As sume that chop(s, R) / 0. Let chop(r, R) = r' = [u', v'] and chop(s, R) = s' = [x', y']. When s' and r' are in different RBTs (this is the case when r' C s', chop(s, R) chop(s, R {r}) and the RBT that contains s' may need to be split into two RBTs. When s' and r' are in the same RBT, they are in the same chain of CP(norm(R)). If s' are r' are .dli i .ent ranges of this chain, we may simply remove the RBT node for r' and update that for s' to reflect its new start or finish point (only one may change). When r' and s' are not .,.li i:ent ranges, the nodes for these two ranges are removed from the RBT (this may split the RBT into up to two RBTs) and chop(s, R {r}) inserted. Figure 211 shows the different cases. 2.4.9 Complexity The portions of the search, insert, and delete algorithms that deal only with PST1 and PST2 have the same .ivmptotic complexity as their counterparts for the case of nonintersecting ranges (Section 2.3). The portions that deal with the RBTs and endPointsTree require a constant number of search, insert, delete, join, and split U' V' yt x' y u' v' u' y x' v'U (A) (B) UI IV 1 I I V/ FHH H H HHH FH SUH ' FHH1 HH FHFH (C) (D) Figure 211: Cases when s' and r' are in the same chain of CP(norm(R)) operations on these structures. Since each of these operations takes O(log n) time on a redblack tree and since we can update the values minStartLeft, minStartRight, and so on, that are stored in the RBT nodes in the same .,ill li ic time as taken by an insert/delete/join/split, the overall complexity of our proposed data structure is O(log n) for each operation when RBPSTs are used for PST1 and PST2. When RPSTs are used, the search complexity is O(W) and the insert and delete complexity is (W + logn) = (W). 2.5 Experimental Results 2.5.1 Prefixes We programmed our redblack prioritysearch tree algorithm for prefixes (Sec tion 2.2) in C++ and compared its performance to that of the ACBRT of Sahni et al. [22]. Recall that the ACBRT is the best performing O(logn) data structure reported in [22] for dynamic prefixtables. For test data, we used six IPv4 prefix databases obtained from [38]. The number of prefixes in each of these databases as well as the memory requirements for each database of prefixes using our data struc ture (PST) of Section 2.2 as well as the ACBRT structure of Sahni et al. [22] are B PST ACRBT 12000  10000 8000 o E 6000 4000 2000 Palx1 Pbl MaeWest Aads Pb2 Paix2 Database Figure 212: Memory usage shown in Table 21. The databases Paixl, Pbl, MaeWest and Aads were obtained on Nov 22, 2001, while Pb2 and Paix2 were obtained Sept. 13, 2000. Figure 212 is a plot of the data of Table 21. As can be seen, the ACBRT structure takes almost three times as much memory as is taken by the PST structure. Further, the memory requirement of the PST structure can be reduced to about 5(0' that of our current implementation. This reduction requires an nnode implementation of a priority search tree as described in [37] rather than our current implementation, which uses 2n 1 nodes as in [39]. Table 21: Memory usage Database Paixt Pbl MaeWest Aads Pb2 Paix2 Num of Prefixes 16172 22225 28889 31827 35303 85988 Memory PST 884 1215 1579 1740 1930 4702 (KB) ACRBT 2417 3331 4327 4769 5305 12851 To obtain the mean time to find the longest matchingprefix (i.e., to perform a search), we started with a PST or ACRBT that contained all prefixes of a pre fix database. Next, a random permutation of the set of start points of the ranges corresponding to the prefixes was obtained. This permutation determined the order in which we searched for the longest matchingprefix for each of these start points. The time required to determine all of these longestmatching prefixes was measured and averaged over the number of start points (equal to the number of prefixes). The experiment was repeated 20 times and the mean and standard deviation of the 20 mean times computed. Table 22 gives the mean time required to find the longest matchingprefix on a Sun Blade 100 workstation that has a 500MHz UltraSPARCIie processor and has a 256KB L2 cache. The standard deviation in the mean time is also given in this table. On our Sun workstation, finding the longest matchingprefix takes about 10'. to 1 !'. less time using an ACRBT than a PST. Table 22: Prefix times on a 500MHz Sun Blade 100 workstation Database Paixl Pbl MaeWest Aads Pb2 Paix2 PST Mean 2.88 3.06 3.25 3.31 3.43 4.06 Search Std 0.36 0.18 0.17 0.16 0.09 0.05 (psec) ACRBT Mean 2.60 2.77 2.87 2.87 3.09 3.51 Std 0.25 0.16 0.16 0.12 0.13 0.04 PST Mean 3.90 4.45 4.83 5.18 5.14 6.04 Insert Std 0.57 0.63 0.51 0.48 0.19 0.20 (psec) ACRBT Mean 21.15 23.42 24.77 25.36 25.54 28.07 Std 1.11 0.66 0.38 0.29 0.19 0.18 PST Mean 4.36 4.45 4.73 4.71 5.06 5.48 Delete Std 0.91 0.63 0.53 0.00 0.19 0.16 (psec) ACRBT Mean 21.24 22.68 23.16 23.71 24.56 25.64 Std 0.95 0.55 0.49 0.35 0.26 0.21 To obtain the mean time to insert a prefix, we started with a random permutation of the prefixes in a database, inserted the first I.7' of the prefixes into an initially empty data structure, measured the time to insert the remaining 3: ;'. and computed the mean insert time by dividing by the number of prefixes in 3 ;' of the database. This experiment was repeated 20 times and the mean of the mean as well as the standard deviation in the mean computed. These latter two quantities are given in Table 22 for our Sun workstation. As can be seen, insertions into a PST take between 1i'. and 2"' the time to insert into an ACRBT! The mean and standard deviation data reported in Table 22 for the delete operation were obtained in a similar fashion by starting with a data structure that had 1C( 1' of the prefixes in the database and measuring the time to delete a randomly selected 3 ;' of these prefixes. Deletion from a PST takes about 21 1' the time required to delete from an ACRBT. Tables 23 and 24 give the corresponding times on a 700MHz Pentium III PC and a 1.4GHz Pentium 4 PC, respectively. Both computers have a 256KB L2 cache. The run times on our 700MHz Pentium III are about onehalf the times on our Sun workstation. Surprisingly, when going from the 700MHz Pentium III to the 1.4GHz Pentium 4, the measured time to find the longest matchingprefix decreased by only about 5'. for PST. More surprisingly, the corresponding times for ACRBT actually increased. The net result of the slight decrease in time for PST and the increase for ACRBT is that, on our Pentium 4 PC, the PST is faster than the ACRBT on all three operationsfind longest matchingprefix, insert, and delete. This somewhat surprising behavior is due to architectural differences (e.g., differences in width and size of L1 cache lines) between the Pentium III and 4 processors. Table 23: Prefix times on a 700MHz Pentium III PC Database Paixl Pbl MaeWest Aads Pb2 Paix2 PST Mean 1.39 1.54 1.61 1.65 1.70 1.97 Search Std 0.27 0.22 0.17 0.14 0.00 0.04 (psec) ACRBT Mean 1.36 1.44 1.44 1.49 1.54 1.80 Std 0.25 0.18 0.13 0.14 0.14 0.06 PST Mean 2.41 2.63 2.60 2.83 2.80 3.07 Insert Std 0.87 0.30 0.53 0.43 0.40 0.14 (psec) ACRBT Mean 11.97 12.63 13.48 13.62 13.77 14.93 Std 0.95 0.67 0.24 0.48 0.35 0.18 PST Mean 2.32 2.38 2.49 2.45 2.55 2.91 Delete Std 0.82 0.61 0.52 0.47 0.00 0.17 (psec) ACRBT Mean 11.69 12.55 12.95 13.01 13.40 14.10 Std 0.87 0.63 0.54 0.44 0.48 0.16 Figures 213, 214, and 215 histogram the search, insert, and delete time data of the preceding tables. Table 24: Prefix times on a 1.4GHz Pentium 4 PC Database Paixl Pbl MaeWest Aads Pb2 Paix2 PST Mean 1.30 1.44 1.51 1.52 1.63 1.92 Search Std 0.19 0.18 0.17 0.13 0.13 0.06 (psec) ACRBT Mean 1.48 1.69 1.83 1.87 1.87 2.24 Std 0.31 0.20 0.16 0.07 0.14 0.05 PST Mean 1.76 1.96 2.18 2.17 2.38 2.65 Insert Std 0.41 0.69 0.00 0.44 0.35 0.18 (psec) ACRBT Mean 11.22 11.81 12.41 12.91 12.92 13.94 Std 0.41 0.60 0.41 0.44 0.26 0.18 PST Mean 1.76 1.69 1.92 1.93 2.00 2.22 Delete Std 0.41 0.60 0.38 0.21 0.42 0.17 (psec) ACRBT Mean 9.46 10.39 10.54 10.42 10.92 11.64 Std 0.57 0.63 0.38 0.21 0.42 0.16 PSBT IN AC RBT PST ACRBT i(B) (B) 1 (C) (C) Figure 213: Time for searching longest matching prefix. A)Sun. B)Pentium 700MHz. C)Pentium 1.4GHz M APSBT ACRBTI ACRBT ACRBT Figure 214: Time for inserting a prefix. A)Sun. B)Pentium 700MHz. C)Pentium 1.4GHz 2.5.2 Nonintersecting Ranges To benchmark our algorithm for nonintersecting ranges (Section 2.3), we gener ated three different sets of random1 nonintersecting ranges. These, respectively, had 1 We resorted to randomly generated data sets because no benchmark data for nonintersecting ranges was available. S PSBT IN AC RBT III 111111 111111 111111 t PST t PST 7 PST G RABT ARBT ACRBT S111111. ... Database Database Database (A) (B) (C) Figure 215: Time for deleting a prefix. A)Sun. B)Pentium 700MHz. C)Pentium 1.4GHz 30000, 50000, and 80000 ranges. Table 25 gives the memory requirement as well as the mean times and standard deviations for search, insert, and delete. The run times are for our 700MHz Pentium III PC. The search, insert, and delete experiments were modeled after those conducted for the case of prefix databases. Table 25: N,,iii isecting Ranges. 700 MHz PIII Num of Ranges 30000 50000 80000 Memory Usage (KB) 3360 5600 8960 Search Mean 1.92 2.19 2.51 (psec) Std 0.15 0.04 0.06 Insert Mean 8.65 9.27 9.88 (psec) Std 0.49 0.29 0.17 Remove Mean 5.75 6.42 6.81 (psec) Std 0.44 0.28 0.14 2.5.3 Conflictfree Ranges Table 26 gives the memory required as well as the mean times and standard deviations for the case of conflictfree ranges. The range sequence used is generated so that when the ranges are inserted in sequence order, there are no conflicts. For deletion, 3 ::' of the ranges are removed in the reverse of the insert order. 2.6 Conclusion We have developed data structures for dynamic router tables. Our data struc tures permit one to search, insert, and delete in O(log n) time each. Although O(log n) Table 26: Conflictfree Ranges. PIII 700MHz with 256K L2 cache Num of Ranges in R 30000 50000 80000 Num of Ranges Mean 29688 48868 76472 in norm(R) Std 18.03 42.90 60.05 Memory Usage Mean 6240 9979 15219 (KB) Std 7.06 10.91 11.19 Search Mean 1.98 2.34 2.69 (psec) Std 0.07 0.09 0.06 Insert Mean 18.45 19.65 20.76 (psec) Std 0.51 0.27 0.27 Remove Mean 19.3 20.49 21.60 (psec) Std 0.41 0.13 0.29 time data structures for prefix tables were known prior to our work [21, 22], our data structure is more memory efficient than the data structures of Sahni et al. [21, 22]. Further, our data structure is significantly superior on the insert and delete opera tions, while being competitive on the search operation. For nonintersecting ranges and conflictfree ranges our data structures are the first to permit O(log n) search, insert, and delete. CHAPTER 3 DYNAMIC IP ROUTER TABLES USING HIGHESTPRIORITY MATCHING In this chapter, we focus on data structures for dynamic NHPRTs, HPPTs and LMPTs. In Section 3.2, we develop the data structure binary tree on binary tree (BOB). This data structure is proposed for the representation of dynamic NHPRTs. Using BOB, a lookup takes O(log2 n) time and cache misses; a new rule may be inserted and an old one deleted in O(logn) time and cache misses. For HPPTs, we propose a modified version of BOBPBOB (prefix BOB)in Section 3.3. Using PBOB, a lookup, rule insertion and deletion each take O(W) time and cache misses. In Section 3.4, we develop the data structures LMPBOB (longest matchingprefix BOB) for LMPTs. Using LMPBOB, the longest matchingprefix may be found in O(W) time and O(log n) cache misses; rule insertion and deletion each take O(log n) time and cache misses. On practical rule tables, BOB and PBOB perform each of the three dynamictable operations in O(log n) time and with O(log n) cache misses. Section 3.1 introduces some terminology and Experimental results are presented in Section 3.6. 3.1 Preliminaries Definition 13 A range r = [u, v] is a pair of addresses u and v, u < v. The ru.i., r represents the addresses {u, u+ 1,..., v}. starter) = u is the start point of the ri.,, and finish(r) = v is the finish point of the rr, .i. The rr ,,, r matches all addresses d such that u < d < v. The start point of the range r = [3, 9] is 3 and its finish point is 9. This range matches the addresses {3, 4, 5, 6, 7, 8, 9}. In IPv4, s and f are up to 32 bits long, and in IPv6, s and f may be up to 128 bits long. The IPv4 prefix P = O* corresponds to the range [0, 231 1]. The range [3,9] does not correspond to any single IPv4 prefix. We may draw the range r = [u, v] = {u, u + 1,..., v} as a horizontal line that begins at u and ends at v. Figure 21 shows ranges drawn in this fashion. Notice that every prefix of a prefix routertable may be represented as a range. For example, when W = 6, the prefix P = 1101* matches addresses in the range [52,55]. So, we v P = 1101* = [52,55], start(P) = 52, and finish(P) = 55. Since a range represents a set of (contiguous) points, we may use standard set operations and relations such as n and c when dealing with ranges. So, for example, [2, 6] n [4, 8] = [4, 6]. Note that some operations between ranges my not yield a range. For example, [2, 6] U [8, 10] = 2, 3, 4, 5, 6, 8, 9, 10}, which is not a range. Definition 14 Let r = [u, v] and s = [x, y] be two r, .,. Let overlap(r, s) = rn s. (a) The predicate disjoint(r, s) is true iff r and s are disjoint. disjoint(r, s) < overlap(r, s)= 0 v < x V y < u Figure 21(A) shows the two cases for disjoint sets. (b) The predicate nested(r, s) is true iff one of the ru., is contained within the other. nested(r, s) overlap(r, s) r V overlap(r, s)= s r rCsVsCr < x Figure 21(B) shows the two cases for nested sets. (c) The predicate intersect(r, s) is true iff r and s have a no,. mi1l,;' intersection that is different from both r and s. intersect(r, s) => r s O Ar n s rAr n s s < disjoint(r, s) A inested(r, s) S= u Figure 21(C) shows the two cases for ri,.g that intersect. Notice that overlap(r, s) = [x,v] when u < x < v < y and overlap(r, s) = [u,y] when x < y < v. [2, 4] and [6, 9] are disjoint; [2,4] and [3,4] are nested; [2,4] and [2,2] are nested; [2,8] and [4,6] are nested; [2,4] and [4,6] intersect; and [3,8] and [2,4] intersect. [4, 4] is the overlap of [2, 4] and [4, 6]; and overlap([3, 8], [2, 4]) = [3, 4]. Lemma 24 Let r and s be two r,.g. E,'. /;i one of the following is true. 1. disjoint(r, s) 2. nested(r, s) 3. intersect(r, s) Proof Straightforward. U Definition 15 The r i,.g' set R is nonintersecting iff disjoint(r, s) V nested(r, s) for every pair of ri,.g, r and s E R. Definition 16 The ri.,. r is more specific than the r,,'j. s iff r C s. [2, 4] is more specific than [1,6], and [5, 9] is more specific than [5, 12]. Since [2, 4] and [8, 14] are dii, 1iiil neither is more specific than the other. Also, since [4, 14] and [6, 20] intersect, neither is more specific than the other. Definition 17 Let R be a rr,,.. set. ranges(d, R) (or i .'l.; ranges(d) when R is implicit) is the subset of rn.g' of R that match the destination address d. msr(d,R) (or msr(d)) is the most "/ .. ..:'' ring' of R that matches d. That is, msr(d) is the most /... ''.: rir '. in ranges(d). msr([u, v],R) = msr(u, v,R) = r iff msr(d, R) = r, u < d < v. When R is implicit, we write msr(u,v) and msr([u,v]) in place of msr(u,v, R) and msr([u, v],R). hpr(d) is the highesti '.i ',;l ri,'. in ranges(d). We assume that rr,.i are assigned priorities in such a way that hpr(d) is ;",',.:!;" ;I I, f;,.,, for every d. When R = {[2,4], [1, 6]}, ranges(3) = {[2,4], [1, 6]}, msr(3) = [2,4], msr(1) [1, 6], msr(7) = 0, and msr(5, 6) = [1,6]. When R = {[4,14], [6, 20], [6,14], [8,12]}, msr(4, 5) [4,14], msr(6, 7) [6,14], msr(8, 12) [8,12], msr(13, 14) [6,14], and msr(15, 20) = [6, 20]. Definition 18 Let r and s be two ri.g. r < s # starter) < starts) V starterr) starts) A finisher) > finishes)). Note that for every pair, r and s, of different ranges, either r < s or s < r. Lemma 25 Let R be a nonintersecting rig. set. If r n s / 0 for r s R, then the following are true: 1. start(r) < starts) = finish(r) > finishes). 2. finisher) > finishes) = start(r) < startss. Proof Straightforward. U 3.2 Nonintersecting HighestPriority RuleTables (NHRTs)BOB 3.2.1 The Data Structure The data structure binary tree on binary tree (BOB) that is being proposed here for NHRTs comprises a single balanced binary search tree at the top level. This top level balanced binary search tree is called the point search tree (PTST). For an nrule NHRT, the PTST has at most 2n nodes (we call this the PTST size constraint). The size constraint is necessary to enable O(log n) update. With each node z of the PTST, we associate a point, point(z). The PTST is a standard redblack binary search tree (actually, any binary search tree structure that supports efficient search, insert, and delete may be used) on the point(z) values of its node set [24]. That is, for every node z of the PTST, nodes in the left subtree of z have smaller point values than point(z), and nodes in the right subtree of z have larger point values than point(z). Let R be the set of nonintersecting ranges of the NHRT. Each range of R is stored in exactly one of the nodes of the PTST. More specifically, the root of the PTST stores all ranges r E R such that starter) < point(root) < finisher); all ranges r E R such that finisher) < point(root) are stored in the left subtree of the root; all ranges r E R such that point(root) < start(r) (i.e., the remaining ranges of R) are stored in the right subtree of the root. The ranges allocated to the left and right subtrees of the root are allocated to nodes in these subtrees using the just stated range allocation rule recursively. Note that the range allocation rule is quite similar to that used for interval trees [40]. For the range allocation rule to successful allocate all r E R to exactly one node of the PTST, the PTST must have at least one node z for which starter) < point(z) < finisher). Table 31 gives an example set of nonintersecting ranges, and Figure 31 shows a possible PTST for this set of ranges (we w possible, because we haven't specified how to select the point(z) values and even with specified point(z) values, the corresponding redblack tree isn't unique). The number inside each node is point(z), and outside each node, we give ranges(z). 70 Y .([2, 100],4) ([8, 50], 9) I I ([69, 72],10) S([10, 50], 20)1 . 30 80 1 ([10,35], 3)  I ([15, 33], 5) 2 6 ([80, 80], 12)I ([16, 320], 302 I 2 ___ _ S([2, 4], 33) ([54, 66], 18) I ([2, 3], 34) I I ([60, 65], 7) 1 Figure 31: A possible PTST Figure 31: A possible PTST 58 Table 31: A nonintersecting range set range priority [2, 100] 4 [2, 4] 33 [2, 3] 34 [8, 68] 10 [8, 50] 9 [10, 50] 20 [10, 35] 3 [15, 33] 5 [16, 30] 30 [54, 66] 18 [60, 65] 7 [69, 72] 10 [80, 80] 12 Let ranges(z) be the subset of ranges of R allocated to node z of the PTST.1 Since the PTST may have as many as 2n nodes and since each range of R is in exactly one of the sets ranges(z), some of the ranges(z) sets may be empty. The ranges in ranges(z) may be ordered using the < relation of Definition 18. Using this < relation, we put the ranges of ranges(z) into a redblack tree (any balanced binary search tree structure that supports efficient search, insert, delete, join, and split may be used) called the range searchtree or RST(z). Each node x of RST(z) stores exactly one range of ranges(z). We refer to this range as range(x). Every node y in the left (right) subtree of node x of RST(z) has range(y) < range(x) (range(y) > range(x)). In addition, each node x stores the quantity mp(x), which is the maximum of the priorities of the ranges associated with the nodes in the subtree 1 We have overloaded the function ranges. When u is a node, ranges(u) refers to the ranges stored in node u of a PTST; when u is a destination address, ranges(u) refers to the ranges that match u rooted at x. mp(x) may be defined recursively as below. p(x) p(x) if x is leaf max {mp(leftChild(x)), mp(rightChild(x)), p(x)} otherwise where p(x) = prior.:// (range(x)). Figure 32 gives a possible RST structure for ranges(30) of Figure 31. Each node shows (range(x),p(x), mp(x)). [10, 35], 3, 30 [8, 50], 9, 20 [15, 33], 5, 30 [8, 68], 1 10 [10, 50], 20, 20 [16, 30], 30, 30 Figure 32: An example RST for ranges(30) of Figure 31 Lemma 26 Let z be a node in a PTST and let x be a node in RST(z). Let st(x) start(range(x)) and fn(x) = finish(range(x)). 1. For every node y in the right subtree of x, st(y) > st(x) and fn(y) < fn(x). 2. For every node y in the left subtree of x, st(y) < st(x) and fn(y) > fn(x). Proof For 1, we see that when y is in the right subtree of x, range(y) > range(x). From Definition 18, it follows that st(y) > st(x). Further, since range(y) n range(x) / 0, if st(y) > st(x), then fn(y) < fn(x) (Lemma 25); if st(y) = st(x), fn(y) < fn(x) (Definition 18). The proof for 2 is similar. m 3.2.2 Search for hpr(d) The highestpriority range that matches the destination address d may be found by following a path from the root of the PTST toward a leaf of the PTST. Figure 33 gives the algorithm. For simplicity, this algorithm finds hp = prior :l',(hpr(d)) rather than hpr(d). The algorithm is easily modified to return hpr(d) instead. Algorithm hp(d) { // return the priority of hpr(d) // easily extended to return hpr(d) hp = 1; // assuming 0 is the smallest priority value z = root; // root of PTST while (z != null) { if (d > point(z)) { RST(z)>hpRight(d, hp); z = rightChild(z); } else if (d < point(z)) { RST(z)>hpLeft(d, hp); z = leftChild(z); } else // d == point(z) return max{hp, mp(RST(z)>root)}; } return hp; } Figure 33: Algorithm to find prior ':/,(hpr(d)) We begin by initializing hp = 1 and z is set to the root of the PTST. This initialization assumes that all priorities are > 0. The variable z is used to follow a path from the root toward a leaf. When d > point(z), d may be matched only by ranges in RST(z) and those in the right subtree of z. The method RST(z)>hpRight(d,hp) (Figure 34) updates hp to reflect any matching ranges in RST(z). This method makes use of the fact that d > point(z). Consider a node x of RST(z). If d > fn(x), then d is to the right (i.e., d > finish(range(x))) of range(x) and also to the right of all ranges in the right subtree of x. Hence, we may proceed to examine the ranges in the left subtree of x. When d < fn(x), range(x) as well as all ranges in the left subtree of x match d. Additional matching ranges may be present in the right subtree of x. hpLeft(d, hp) is the analogous method for the case when d < point(z). Complexity The complexity of the invocation RST(z)>hpRight (d,hp) is read ily seen to be O(height(RST(z)) = O(logn). Consequently, the complexity of hp(d) is O(log2 n). To determine hpr(d) we need only add code to the methods hp(d), Algorithm hpRight(d, hp) { // update hp to account for any ranges in RST(z) that match d // d > point(z) x = root; // root of RST(z) while (x != null) if (d > fn(x)) x = leftChild(x); else { hp = max{hp, p(x), mp(leftChild(x))}; x = rightChild(x); } } Figure 34: Algorithm hpRight(d, hp) hpRight(d, hp), and hpLeft(d, hp) so as to keep track of the range whose priority is the current value of hp. So, hpr(d) may be found in O(log2 n) time also. 3.2.3 Insert a Range A range r that is known to have no intersection with any of the existing ranges in the router table, may be inserted using the algorithm of Figure 35. In the while loop, we find the node z nearest to the root such that r matches point(z) (i.e., starter) < point(z) < finisher)). If such a z exists, the range r is inserted into RST(z) using the standard redblack insertion algorithm [24]. During this insertion, it is necessary to update some of the mp values on the insert path. This update is done easily. In case the PTST has no z such that r matches point(z), we insert a new node into the PTST. This insertion is done using the method insertNewNode. To insert a new node into the PTST, we first create a new PTST node y and define point(y) and RST(y). point(y) may be set to be any destination address matched by r (i.e., any address such that start(r) < point(y) < finisher)) may be used. In our implementation, we use point(y) = starter). RST(y) has only a root node and this root contains r; its mp value is prior' /l(r). If the PTST is currently empty, y becomes the new root and we are done. Otherwise, the new node y may be inserted where the search conducted in the while loop of Figure 35 terminated. That Algorithm insert(r) { // insert the nonintersecting range r z = root; // root of PTST while (z != null) if (finish(r) < point(z)) z = leftChild(z); else if (start(r) > point(z)) z = rightChild(z); else {// r matches point(z) RST(z)>insert(r); return; } // there is no node z such that r matches point(z) // insert a new node into PTST insertNewNode(r); } Figure 35: Algorithm to insert a nonintersecting range is, as a child of the last nonnull value of z. Following this insertion, the traditional bottomup redblack rebalancing pass is made [24]. This rebalancing pass may require color changes and at most one rotation. Color changes do not affect the tree structure. However, a rebalancing rotation, if performed, affects the tree structure and may lead to a violation of the range allocation rule. Rebalancing rotations are investigated in the next section. We note that if the number of nodes in the PTST was at most 21RI, where IRI is the number of ranges prior to the insertion of a new range r, then following the insertion, IPTSTI < 21RI + 1 < 2(IRI + 1), where IPTSTI is the number of nodes in the PTST and R + 1 is the number of ranges following the inertion of r. Hence an insert does not violate the PTST size constraint. Complexity Exclusive of the time required to perform the tasks associated with a rebalancing rotation, the time required to insert a range is O(height(PTST)) O(logn). As we shall see in the next section, a rebalancing rotation can be done in O(logn) time. Since at most one rebalancing rotation is needed following an insert, the time to insert a range is O(log n). In case it is necessary for us to verify that the range to be inserted does not intersect an existing range, we can augment the PTST with priority search trees as in [34] and use these trees for intersection detection. The overall complexity of an insert remains O(log n). 3.2.4 RedBlackTree Rotations Figures 36 and 37, respectively, show the redblack LL and RR rotations used to rebalance a redblack tree following an insert or delete (see [24]). In these figures, pt() is an abbreviation for point(). Since the remaining rotation types, LR and RL, may, respectively, be viewed as an RR rotation followed by an LL rotation and an LL rotation followed by an RR rotation, it suffices to examine LL and RR rotations alone. pt(x) pt(y) ity) LL a p(x) a b b/ Figure 36: LL rotation pt(x) P Y) y x pt(y) RR pt(x) b c a b Figure 37: RR rotation Lemma 27 Let R be a set of nonintersecting r,,mj. Let ranges(z) C R be the r i,.j, allocated by the r u,., allocation rule to node z of the PTST prior to an LL or RR rotation. Let ranges'(z) be this subset for the PTST node z following the rotation. ranges(z) = ranges'(z) for all nodes z in the subtrees a, b, and c of Figures 36 and 37. Proof Consider an LL rotation. Let ranges(subtree(x)) be the union of the ranges allocated to the nodes in the subtree whose root is x. Since the range allocation rule allocates each range r to the node z nearest the root such that r matches point(z), ranges(subtree(x)) = ranges'(subtree(y)). Further, r c ranges(a) if r E ranges(subtree(x)) and finisher) < point(y). Consequently, r E ranges'(a). From this and the fact that the LL rotation doesn't change the positioning of nodes in a, it follows that for every node z in the subtree a, ranges(a) = ranges'(a). The proof for the nodes in b and c as well as for the RR rotation is similar. 0 Let x and y be as in Figures 36 and 37. From Lemma 27, it follows that ranges(z) = ranges'(z) for all z in the PTST except possibly for z E {x, y}. It is not too difficult to see that ranges'(y) = ranges(y) U S and ranges'(x) = ranges(x) S, where S = {rfr E ranges(x) A start(r) < point(y) < finisher)} Since we are dealing with a set of nonintersecting ranges, all ranges in ranges(y) are nested within the ranges of S. Figure 38 shows the ranges of ranges(x) using solid lines and those of ranges(y) using broken lines. S is the set of ranges drawn above ranges(y) (i.e., the solid lines above the broken lines). The range rMax of S with largest start() value may be found by searching RST(x) for the range with largest start() value that matches point(y). (Note that rMax = msr(point(y),ranges(x)).) Since RST(x) is a binary search tree of an ordered set of ranges (Definition 18), rMax may be found in O(height(RST(x)) time by following a path from the root downward. If rMax doesn't exist, S = 0, ranges'(x)= ranges(x) and ranges'(y) = ranges(y). msr(pt(y), ranges(x)) pt(y) pt(x) I I II I ms(pt(y), ranges(x)) pt(x) pt(y) I I I I Figure 38: ranges(x) and ranges(y) for LL and RR rotations. as in Figures 36 and 37 Assume that rMax exists. We may use the split operation RST(x) the ranges that belong to S. The operation RST(x) split(small, rMax, big) separates RST(x) into an RST small of ranges < (Definition 18) than rMax and an RST big of ranges > than rMax. We see that RST'(x) = big and RST'(y) = join(small, rMax, RST(y)), where join [24] combines the redblack tree small with ranges < rMax, the range rMax, and the redblack tree RST(y) with ranges > rMax into a single redblack tree. The standard split and join operations of Horowitz et al. [24] need to be modified slightly so as to update the mp values of affected nodes. This modification doesn't affect the .,i~'i!ii.1 I ic complexity, which is logarithmic in the number of nodes in the tree being split or logarithmic in the sum of the number of nodes in the two trees being joined, of the split and join operations. So, the complexity of performing an LL or RR rotation (and hence of performing an LR or RL rotation) in the PTST is O(log n). Nodes x and y are [24] to extract from 3.2.5 Delete a Range Figure 39 gives our algorithm to delete a range r. Note that if r is one of the ranges in the PTST, then r must be in the RST of the node z that is closest to the root and such that r matches point(z). The while loop of Figure 39 finds this z and deletes r from RST(z). Algorithm delete(r) { // delete the range r z = root; // root of PTST while (z != null) if (finish(r) < point(z)) z = leftChild(z); else if (start(r) > point(z)) z = rightChild(z); else {// r matches point(z) RST(z)>delete(r); cleanup(z); return; } } Figure 39: Algorithm to delete a range Assume that r is, in fact, one of the ranges in our PTST. To delete r from RST(z), we use the standard redblack deletion algorithm [24] modified to update mp values as necessary. Following the deletion of r from RST(z) we perform a cleanup operation that is necessary to maintain the size constraint of the PTST. Figure 310 gives the steps in the method cleanup. Algorithm cleanup(z) { // maintain size constraint if (RST(z) is empty and the degree of z is 0 or 1) delete node z from the PTST and rebalance; while (IPTSTI > 21RI) delete a degree 0 or degree 1 node z with empty RST(z) from the PTST and rebalance; Figure 3 Algorithm to maintain size constraint following a delete Figure 310: Algorithm to maintain size constraint following a delete Notice that following the deletion of r from RST(z), RST(z) may or may not be empty. If RST(z) becomes empty and the degree of node z is either 0 or 1, node z is deleted from the PTST using the standard redblack node deletion algorithm [24]. If this deletion requires a rotation (at most one rotation may be required) the rotation is done as described in Section 3.2.4. Since the number of ranges and nodes has each decreased by 1, the size constraint may be violated (this happens if IPTST = 21RI prior to the delete). Hence, it may be necessary to remove a node from the PTST to restore the size constraint. If RST(z) becomes empty and the degree of z is 2 or if RST(z) does not become empty, z is not deleted from the PTST. Now, IPTSTI is unchanged by the deletion of r and R reduces by 1. Again, it is possible that we have a size constraint violation. If so, up to two nodes may have to be removed from the PTST to restore the size constraint. The size constraint, if violated, is restored in the while loop of Figure 310. This restoration is done by removing one or two (as needed) degree 0 or degree 1 nodes that have an empty RST. Lemma 28 shows that whenever the size constraint is violated, the PTST has at least one degree 0 or degree 1 node with an empty RST. So, the node z needed for deletion in each iteration of the while loop l i' i, exists. Lemma 28 When the PTST has > 2n nodes, where n= IRI, the PTST has at least one degree 0 or degree 1 node that has an n mp'/l PTST. Proof Suppose not. Then the degree of every node that has an empty RST is 2. Let n2 be the total number of degree 2 nodes, nl the total number of degree 1 nodes, no the total number of degree 0 nodes, n, the total number of nodes that have an empty RST, and n, the total number of nodes that have a nonempty RST. Since all PTST nodes that have an empty RST are degree 2 nodes, n2 > n,. Further, since there are only n ranges and each range is stored in exactly one RST, there are at most n nodes that have a nonempty RST, i.e., n > n,n. Thus n2 + n > n, + n, IPTSTI, i.e., n2 > IPTSTI n. From [24], we know that no = n2 + 1. Hence, no + n1 + n2 2 + + n + > n2+n2 > 21PTSTI 2n > IPTSTI. This contradicts no + n1 + n2 = PTSTI. 0 To find the degree 0 and degree 1 nodes that have an empty RST efficiently, we maintain a doublylinked list of these nodes. Also, a doublylinked list of degree 2 nodes that have an empty RST is maintained. When a range is inserted or deleted, PTST nodes may be added/removed from these doublylinked lists and nodes may move from one list to another. The required operations can be done in 0(1) time each. Complexity It takes O(logn) time to find the PTST node z that contains the range r that is to be deleted. Another O(log n) time is needed to delete r from RST(z). The cleanup step removes up to 2 nodes from the PTST. This takes another O(log n) time. So, the overall delete time is O(log n). 3.2.6 Expected Complexity of BOB Let maxR be the maximum number of ranges that match any destination ad dress. So, Iranges(z)l = IRST(z)I < maxR for every node z of the PTST. We may, therefore, restate the complexity of the BOB operationslookup, insert, deleteas O(lognlogmaxR), O(logn), and O(logn), respectively. Sahni et al. [21] have analyzed the prefixes in several real IPv4 prefix router tables. They report that a destination address is matched by about 1 prefix on average; the maximum number of prefixes that match a destination address is at most 6. Making the assumption that this analysis holds true even for real range routertables (no data is available for us to perform such an analysis), we conclude that maxR < 6. So, the expected complexity of BOB on real routertables is O(log n) per operation. 3.3 HighestPriority PrefixTables (HPPTs)PBOB 3.3.1 The Data Structure When all rule filters are prefixes, maxR < min{n, W}. Hence, if BOB is used to represent an HPPT, the search complexity is O(log n min{log n, log W}); the insert and delete complexities are O(log n) each. Since maxR < 6 for real prefix routertables, we may expect to see better perfor mance using a simpler structure (i.e., a structure with smaller overhead and possibly worse .,imptotic complexity) for ranges(z) than the RST structure described in Sec tion 3.2. In PBOB, we replace the RST in each node, z, of the BOB PTST with an array linear list [41], ALL(z), of pairs of the form (pLength, priority), where pLength is a prefix length (i.e., number of bits) and priority is the prefix priority. ALL(z) has one pair for each range r E ranges(z). The pLength value of this pair is the length of the prefix that corresponds to the range r and the priority value is the priority of the range r. The pairs in ALL(z) are in ascending order of pLength. Note that since the ranges in ranges(z) are nested and match point(z), the corresponding prefixes have different length. 3.3.2 Lookup Figure 311 gives the algoritm to find the priority of the highestpriority prefix that matches the destination address d. The method maxp() returns the highest priority of any prefix in ALL(z) (note that all prefxes in ALL(z) match point(z)). The method searchALL(d,hp) examines the prefixes in ALL(z) and updates hp taking into account the priorities of those prefixes in ALL(z) that match d. The method searchALL(d,hp) utilizes the following lemma. Consequently, it examines prefixes of ALL(z) in increasing order of length until either all prefixes have been examined or until the first (i.e., shortest) prefix that doesn't match d is examined. Algorithm hp(d) { // return the priority of hpp(d) // easily extended to return hpp(d) hp = 1; // assuming 0 is the smallest priority value z = root; // root of PTST while (z != null) { if (d == point(z)) return max{hp, ALL(z)>maxp()}; ALL(z)>searchALL(d,hp); if (d < point(z)) z = leftChild(z); else z = rightChild(z); } return hp; } Figure 311: Algorithm to find prior.:l ,(hpp(d)) Lemma 29 If a prefix in ALL(z) doesn't match a destination address d, then no longerlength prefix in ALL(z) matches d. Proof Let pl and p2 be prefixes in ALL(z). Let li be the length of pi. Assume that 11 < 12 and that pi doesn't match d. Since both pi and P2 match point(z), P2 is nested within pl. Therefore, all destination addresses that are matched by p2 are also matched by pi. So, p2 doesn't match d. 0 One way to determine whether a length 1 prefix of ALL(z) matches d is to use the following lemma. The check of this lemma may be implemented using a mask to extract the mostsignifcant bits of point(z) and d. Lemma 30 A length I prefix p of ALL(z) matches d iff the most.:i,ii.:liW..it/ bits of point(z) and d are the same. Proof Straightforward. U Complexity We assume that the masking operations can be done in 0(1) time each. (In IPv4, for example, each mask is 32 bits long and we may extract any subset of bits from a 32bit integer by taking the logical and of the appropriate mask and the integer.) The number of PTST nodes reached in the while loop of Figure 311 is O(log n) and the time spent at each node z that is reached is linear in the number of prefixes in ALL(z) that match d. Since the PTST has at most maxR prefixes that match d, the complexity of our lookup algorithm is O(log n + maxR) = O(W) (note that log2 n < W and maxR < W). 3.3.3 Insertion and Deletion The PBOB algorithms to insert/delete a prefix are simple adaptations of the cor responding algorithms for BOB. rMax is found by examining the prefixes in ALL(x) in increasing order of length. ALL'(y) is obtained by prepending the prefixes in ALL(x) whose length is < the length of rMax to ALL(y), and ALL'(x) is obtained from ALL(x) by removing the prefixes whose lenth is < the length of rMax. The time require to find rMax is O(maxR). This is also the time required to com pute ALL'(y) and ALL'(x). The overall complexity of an insert/delete operation is O(log n + maxR) = 0(W). As noted earlier, maxR < 6 in practice. So, in practice, PBOB takes O(log n) time and makes O(log n) cache misses per operation. 3.4 LongestMatching PrefixTables (LMPTs)LMPBOB 3.4.1 The Data Structure Using priority = pLength, a PBOB may be used to represent an LMPT obtain ing the same performance as for an HPPT. However, we may achieve some reduction in the memory required by the data structure if we replace the array linear list that is stored in each node of the PTST by a Wbit vector, bit. bit(z)[i] denotes the ith bit of the bit vector stored in node z of the PTST, bit(z)[i] = 1 iff ALL(z) has a prefix whose length is i. We note that Suri et al. [20] use Wbit vectors to keep track of prefix lengths in their data structure also. 3.4.2 Lookup Figure 312 gives the algorithm to find the length of the longest matchingprefix, Imp(d), for destination d. The method longest() returns the largest i such that bit(z)[i] = 1 (i.e., it returns the length of the longest prefix stored in node z). The method searchBitVector(d,hp,k) examines bit(z) and updates hp taking into ac count the lengths of those prefixes in this bit vector that match d. The method same (k+l, point (z) d) returns true iff point(z) and d agree on their k + 1 most significant bits. Algorithm lmp(d) { // return the length of lmp(d) // easily extended to return lmp(d) hp = 0; // length of Imp k = 0; // next bit position to examine is k+1 z = root; // root of PTST while (z != null) { if (d == point(z)) return max{k, z>longest()}; bit(z)>searchBitVector(d,hp,k); if (d < point(z)) z = leftChild(z); else z = rightChild(z); } return hp; } Figure 312: Algorithm to find length(lmp(d)) Algorithm searchBitVector(d,hp,k) { // update hp and k while (k < W && same(k+1, point(z), d) { if (bit(z)[k+1] == 1) hp = k+1; k++; } } Figure 313: Algorithm to search a bit vector for prefixes that match d The method searchBitVector(d,hp,k) (Figure 313) utilizes the next two lem mas. Lemma 31 If bit(z)[i] corresponds to a prefix that doesn't match the destination address d, then bit(z)[j], j > i corresponds to a prefix that doesn't match d. Proof bit(z)[q] corresponds to the prefix pq whose length is q and which equals the q most significant bits of point(z). So, pi matches all points that are matched by pj. Hence, if pi doesn't match d, pj doesn't match d either. 0 Lemma 32 Let w and z be two nodes in a PTST such that w is a descendent of z. Suppose that z > bit(q) corresponds to a prefix pq that matches d. w > bit(j), j < q cannot correspond to a prefix that matches d. Proof Suppose that w > bit(j) corresponds to the prefix pj, pj matches d, and j < q. So, pj equals the j most significant bits of d. Since pq matches d and also point(z), d and point(z) have the same q most significant bits. Therefore, pj matches point(z). So, by the range allocation rule, pj should be stored in node z and not in node w, a contradiction. U Complexity We assume that the method same can be implemented using masks and Boolean operations so as to have complexity 0(1). Sine a bit vector has the same number of bits as does a destination address, this assumption is consistent with the implicit assumption that arithmetic on destination addresses takes 0(1) time. The total time spent in all invocations of searchBitVector is O(W + log n). The time spent in the remaining steps of lmp(d) is O(logn). So, the overall complexity of 1mp(d) is O(W + logn) = O(W). Even though the time complexity is O(W), the number of cache misses is O(log n) (note that each bit vector takes the same amount of space as needed to store a destination address). 3.4.3 Insertion and Deletion The insert and delete algorithms are similar to the corresponding algorithms for HPPTs. The essential difference are as below. 1. Rather than insert or delete a prefix from an ALL(z), we set bit(z)[1], where 1 is the length of the prefix being inserted or deleted, to 1 or 0, respectively. 2. For a rotation, we do not look for rMax in bit(x). Instead, we find the largest integer iMax such that the prefix that corresponds to bit(x)[iMax] matches point(y). The first (bit 0 comes before bit 1) iMax bits of bit'(y) are the first iMax bits of bit(x) and the remaining bits of bit'(y) are the same as the corresponding bits of bit(y). bit'(x) is obtained from bit(x) by setting its first iMax bits to 0. Complexity iMax may be determined in O(log W) time using binary search; bit'(x) and bit'(y) may be computed in 0(1) time using masks and boolean operations. The remaining tasks performed during an insert or delete take O(log n) time. So, the overall complexity of an insert or delete operation is O(log n+log W) = O(log(Wn)). The number of cache misses is O(log n). 3.5 Implementation Details and Memory Requirement 3.5.1 Memory Management We implemented our data structures in C++. Since dynamic memory allocation and deallocation using C++'s methods new and delete are very time consuming, we implemented our own methods to manage memory. We maintained our own list of free memory. Whenever this list was exhausted, we used the new method to get a large chunk of memory to add to our free list. Memory was then allocated from this large chunk as needed by our data structures. Whenever memory was to be deallocated, it was put back on to our free list. 3.5.2 BOB As described in Section 3.2, each node z of the PTST of BOB has the following fields: color, point(z), RST, leftChild, and rightChild. To improve the lookup per formance of BOB, we added the following fields: maxPriority (maximum priority of the ranges in ranges(z)), minSt (smallest starting point of the ranges in ranges(z)), and maxFn (largest finish point of the ranges in ranges(z)). Correspondingly, the statements RST>hpRight (d,h) and RST (z) >hpLeft (d, h) of Figure 33 are executed only when maxPriority > hp&&d <= maxFn and maxPriority > hp&&minSt < d, respectively. With the added fields, each node of the PTST has 8 fields. For the color and maxPriority fields, we allocate 1 byte each. Assuming 4 bytes for each of the re maining fields, we get a node size of 26 bytes. For improved cache performance, it is desirable to align node to 4byte memoryboundaries. This alignment is simplified if node size is an integral multiple of 4 bytes. Therefore, for practical purposes, the PTST nodesize becomes 28 bytes. In our implementation of hpRight (Figure 34), the while loop conditional was changed from x != null to x != null && mp > hp. A corresponding change was made to hpLeft. The nodes of an RST have the following fields: color, mp, st, fn, p, leftChild, and rightChild. Using 1 byte for the color, p, and mp fields each, and 4 bytes for each of the remaining fields, the size of an RST node becomes 19 bytes. Again, for ease of alignment to 4byte boundaries, we make the RSTnode size 20 bytes. In addition to nodes, every nonempty RST has the fields root (pointer to root of RST) and rank (rank of redblack tree) field. Each of these fields is a 4byte field. For the doublylinked lists of PTST nodes with an empty RST, we used the minSt and maxFn fields to, respectively, represent left and right pointers. So, there is no space overhead (other than the space needed to keep track of the first node) associated with maintaining the two doublylinked lists of PTST nodes that have an empty RST. Since an instance of BOB may have up to 2n PTST nodes, n nonempty RSTs, and n RST nodes, the maximum space/memory required by BOB is 28*2n+8*n+20*n 84n bytes. 3.5.3 PBOB The required fields in each node z of the PTST of PBOB are: color, point(z), ALL, size, length, leftChild, and rightChild, where ALL is a onedimensional array, each entry of which has the subfields pLength and priority; size is the dimension of the array, and length is the number of pairs currently in the array linear list. The array ALL initially has enough space to accommodate 4 pairs (pLength, priority). When the capacity of an ALL is exceeded, the size of the ALL is increased by 4 pairs (since at most 6 pairs are expected in an ALL, the size of an ALL needs to be increased at most once; in theory, an ALL may get as many as W pairs and, in theory, using array doubling as in [41] may work better than increasing the array size by 4 each time array capacity is exceeded). To improve the lookup performance of PBOB, the field maxPriority (maxi mum priority of the prefixes in ALL(z)), may be added. Note that minSt (smallest starting point of the prefixes in ALL(z)), and maxFn (largest finish point of the pre fixes in ALL(z)) are easily computed from point(z) and the pLength of the shortest (i.e., first) prefix in ALL(z). When the nodes of the PTST are augmented with a maxPriority field, the expression ALL (z) >maxp () in Figure 311 may be changed to maxPriority(z), and the statement ALL(z)>searchALL(d,hp) executed only when maxPriority > hp && minSt < d && d < maxFn Since searchALL does its first check against the shortest prefix in the array linearlist and this check tests minSt < d&&d < maxFn, it is sufficient to execute the statement ALL(z)>searchALL (d,hp) only when maxPriority > hp. Using 1 byte for each of the fields: color, size, length, maxPriority, pLength, and priority; and 4 bytes for each of the remaining fields, the initial size of a PTST node of PBOB is 24 bytes. For the doublylinked lists of PTST nodes with an empty ALL, we used the 8 bytes of memory allocated to the empty array ALL to, respectively, represent left and right pointers. So, there is no space overhead (other than the space needed to keep track of the first node) associated with maintaining the two doublylinked lists of PTST nodes that have an empty ALL. Since an instance of PBOB may have up to 2n PTST nodes, the minimum space/memory required by these 2n PTST nodes is 24 2n = 48n bytes. However, some PTST nodes may have more than 4 pairs in their ALL. There can be at most n/5 such nodes. So, the maximum spacerequirement of PBOB is 48n + 8n/5 = 49.6n bytes. 3.5.4 LMPBOB In the case of LMPBOB, each node z of the PTST has the following fields: color, point(z), bit, leftChild, and rightChild. To improve the lookup performance of PBOB, the fields minLength (minimum of lengths of prefixes in bit(z)) and maxLength may be added. When the nodes of the PTST are augmented with a minLength and a maxLength field, we replace the statement bit(z)>searchBitVector(d,hp,k) of Figure 312 by if (same(minLength, point(z), d)) { hp = k = minLength; bit(z)>searchBitVector(d,hp,k); } Observe that maxLength of LMPBOB is equivalent to maxPriority of BOB and PBOB. Using 1 byte for each of the fields: color, minLength, and maxLength; 8 bytes for bit (this analysis is for IPv4); and 4 bytes for each of the remaining fields, the size of a PTST node of LMPBOB is 23 bytes. Again, to easily align PTST nodes along 4byte boundaries, we pad an LMP PTST node so that its size is 24 bytes. For the doublylinked lists of PTST nodes with an empty bit vector, we used the 8 bytes of memory allocated to the empty bit vector bit to represent left and right pointers. So, there is no space overhead (other than the space needed to keep track of the first node) associated with maintaining the two doublylinked lists of PTST nodes that have an empty bit. Since an instance of LMPBOB may have up to 2n PTST nodes, the space/memory required by these 2n PTST nodes is 24 2n = 48n bytes. 3.6 Experimental Results 3.6.1 Test Data and Memory Requirement We implemented the BOB, PBOB, and and LMPBOB data structures and asso ciated algorithms in C++ as described in Section 3.5 and measured their performance on a 1.4GHz PC. To assess the performance of these data structures, we used six IPv4 prefix databases obtained from [38]2 We assigned each prefix a priority equal to its length. Hence, BOB, PBOB, and LMPBOB were all used in a longest matchingprefix mode. For dynamic routertables that use the longest matchingprefix tie breaker, the PST structure of Lu et al. [33, 34] provides O(logn) lookup, insert, and delete. So, we included the PST in our experimental evaluation of BOB, PBOB, and LMPBOB. The number of prefixes in each of our 6 databases as well as the memory re quirement for each database of prefixes are shown in Table 32. For the memory 2 Our experiments are limited to prefix databases because range databases are not available for benchmarking requirement, we performed two measurements. Measure gives the memory used by a data structure that is the result of a series of insertions made into an initially empty instance of the data structure. For Measurel, less than 1 of the PTSTnodes in the constructed BOB, PBOB, and LMPBOB instances are empty. So, these data structures use close to the minimum amount of memory they could use. Measure gives the memory used after 7 .. of the prefixes in the data structure constructed for Measure are deleted. In the resulting BOB, PBOB, and LMPBOB instances, almost half the PTST nodes are empty. The datbases Paixl, Pbl, MaeWest and Aads were obtained on Nov 22, 2001, while Pb2 and Paix2 were obtained Sep 13, 2000. Fig ures 314 and 315 histogram the data of Table 32. The memory required by PBOB and LMPBOB is the same when rounded to the nearest KB. This is so because in each of these structures, the number of PTST nodes is the same; the minimum size of a PTST node in PBOB is 24 bytes, very few PTST nodes of PBOB are vi._r than 24 bytes because the average value of Iranges(z) is about 1 for our data sets and the maximum value is at most 6; and the size of PTST node in LMPBOB is 24 bytes. In Measure, the memory required by BOB is about 2.38 times that required by PBOB and LMPBOB. However, in Measure2, this ratio is about 1.75. Also, note that, in Measure, PST takes slightly more memory than does BOB, whereas, in Measure2, BOB takes about 50' more memory than does PST. We note also that the mem ory requirement of PST may be reduced by about 50' using a prioritysearchtree implementation different from that used in [33]. Of course, using this more memory efficient implementation would increase the runtime of PST. 3.6.2 Preliminary Timing Experiments We performed preliminary experiments to determine the effectiveness of the changes il. 1. .1 in Section 3.5. Since these changes are only to the lookup al gorithm, our preliminary timing experiments measured only the lookup times for the BOB, PBOB, and LMPBOB data structures. To obtain the mean lookuptime, we Table 32: Memory usage Database Paixl Pbl MaeWest Aads Pb2 Paix2 Num of Prefixes 16172 22225 28889 31827 35303 85988 PST 884 1215 1579 1740 1930 4702 Measure BOB 851 1176 1526 1682 1876 4527 (KB) PBOB 357 495 642 708 790 1901 LMPBOB 357 495 642 708 790 1901 PST 221 303 395 435 482 1175 Measure BOB 331 455 592 652 723 1760 (KB) PBOB 189 260 338 372 413 1007 LMPBOB 189 260 338 372 413 1007 4000 PST BOB PBOB LMPBOB . n li li 0 Palxl Pbl MaeWest Aads Pb2 Palx2 Database Figure 314: Memory usagemeasurel 2000 800 PST S BOB 1600 PBOB S LMPBOB 1400 1200  1000  800  600  400 200 l Palxl Pbl MaeWest Aads Database Pb2 Paix2 Figure 315: Memory usagemeasure2 started with a BOB, PBOB, or LMPBOB that contained all prefixes of a prefix database. Next, we created a list of the start points of the ranges corresponding to the prefixes in a database and then added 1 to each of these start points. Call this list L. A random permutation of L was generated and this permutation determined the order in which we searched for the longest matchingprefix for each of addresses in L. The time required to determine all of these longestmatching prefixes was measured and averaged over the number of addresses in L (actually, since the time to perform all these lookups was too small to measure accurately, we repeated the lookup for all addresses in L several times and then averaged). The experiment was repeated 10 times, each time using different random permutation of L, and the mean of these average times computed. The mean times for the implementation described in Section 3.5 is the base lookuptime. For BOB, we found that omitting the predicates d < maxFn and minSt < d resulted in a mean lookup time that is approximately twice the base lookup time. On the other hand, elimination of the predicate maxPriority > hp reduces the mean lookup time by about 2' Even though the use of the predicate maxPriority > hp increased the lookup time slightly on our test data, we believe this is a good heuristic for data sets in which the priorities are not highly correlated with the lengths of the prefixes or ranges. So, our remaining experiments retained this predicate. Eliminating the predicate mp > hp had no noticeable effect on mean lookup time. This is to be expected on our data sets, because for these data sets, the maximum value of ranges(z) is < maxR = 6. The predicate mp > hp is expected to be effective on data sets with a larger value of maxR. So, we retained this predicate for our remaining tests. For PBOB, elimination of the predicate hp < maxPriority results in a very slight decrease in the mean lookup time relative to the base case. Hwoever, we expect that for data sets in which the priority isn't highly correlated with the prefix length, this predicate will actually reduce lookup time. Therefore, for further experiments, we retain this predicate in our lookup code. In the case of LMPBOB, the introduction of the statement hp = k = minLength into the base code, results in a lookup time that is 15' less than when this statement is removed. 3.6.3 RunTime Experiments We measured the mean lookuptime as described in Section 3.6.2. The standard deviation in the average times across the 10 repetitions described in Section 3.6.2 was also computed. These mean times and standard deviations are reported in Table 33. The mean times are also histogrammed in Figure 316. It is interesting to note that PBOB, which can handle prefix tables with arbitrary priority assignments is actually 211' to 3I' faster than PST, which is limited to prefix tables that employ the longest matchingprefix tie breaker. Further, lookups in BOB, which can handle range tables with arbitrary priorities are slightly slower than in PST. LMPBOB, which, like PST, is designed specifically for longestmatchingprefix lookups is slightly inferior to the more general PBOB. I PST 32 BOB 3 I PBOB 28 LMPBOB 26 24 ?22 2 E18 16 814 (012 Paixl Pbl MaeWest Aads Pb2 Paix2 Database Figure 316: Search time To obtain the mean inserttime, we started with a random permutation of the prefixes in a database, inserted the first I 7'. of the prefixes into an initially empty data structure, measured the time to insert the remaining : ;'. and computed the mean insert time by dividing by the number of prefixes in :3 ;' of the database. (Once again, Table 33: Prefix times on a 1.4GHz Pentium 4 PC with an 8K L1 data cache and a 256K L2 cache Database Paixl Pbl MaeWest Aads Pb2 Paix2 PST Mean 1.20 1.35 1.49 1.53 1.57 1.96 Std 0.01 0.01 0.04 0.01 0.00 0.01 BOB Mean 1.22 1.39 1.54 1.56 1.62 2.19 Search Std 0.01 0.02 0.02 0.02 0.02 0.01 (psec) PBOB Mean 0.82 0.98 1.10 1.15 1.20 1.60 Std 0.01 0.01 0.01 0.01 0.01 0.01 LMPBOB Mean 0.87 1.03 1.17 1.21 1.27 1.69 Std 0.01 0.01 0.01 0.01 0.01 0.01 PST Mean 2.17 2.35 2.53 2.60 2.64 3.03 Std 0.07 0.04 0.03 0.01 0.05 0.01 BOB Mean 1.70 1.89 2.06 2.10 2.16 2.55 Insert Std 0.06 0.06 0.05 0.05 0.05 0.03 (psec) PBOB Mean 1.04 1.25 1.39 1.44 1.51 1.93 Std 0.06 0.05 0.00 0.05 0.05 0.06 LMPBOB Mean 1.06 1.29 1.47 1.50 1.57 1.98 Std 0.07 0.07 0.06 0.06 0.04 0.01 PST Mean 1.72 1.87 2.06 2.09 2.11 2.48 Std 0.04 0.05 0.05 0.06 0.04 0.06 BOB Mean 1.04 1.13 1.26 1.27 1.32 1.69 Delete Std 0.06 0.05 0.04 0.05 0.06 0.06 (psec) PBOB Mean 0.68 0.82 0.90 0.91 0.97 1.30 Std 0.07 0.06 0.05 0.06 0.03 0.05 LMPBOB Mean 0.67 0.82 0.89 0.92 0.95 1.26 Std 0.06 0.06 0.05 0.05 0.03 0.05 Num of Copies 15 11 9 8 8 3 since the time to insert the remaining 3:' of the prefixes was too small to measure accurately, we started with several copies of the data structure and inserted the 3 ' prefixes into each copy; measured the time to insert in all copies; and divided by the number of copies and number of prefixes inserted). This experiment was repeated 10 times, each time starting with a different permutation of the database prefixes, and the mean of the mean as well as the standard deviation in the mean computed. These latter two quantities as well as the number of copies of each data structure we used for the inserts are given in Table 33. Figure 317 histograms the mean inserttime. As can be seen, insertions into PBOB take between III '. and ian'. less time than do insertions into PST; insertions into LMPBOB take slightly more time than do insertions into PBOB; and insertions into PST take 211' to 25'. more time than do insertions into BOB. S PST 32 BOB 3 H PBOB 28 m LMPBOB 26 24 22 2 18 E F16 14 Palxl Pbl MaeWest Aads Pb2 Palx2 Database Figure 3 17: Insert time The mean and standard deviation data reported for the delete operation in Ta ble 3 3 and Figure 3 18 was obtained in a similar fashion by starting with a data structure that had 1(111' of the prefixes in the database and measuring the time to delete a randomly selected Q:;' of these prefixes. Deletion from PBOB takes less than 50' the time required to delete from an PST. For the delete operation, how ever, LMPBOB is slightly faster than PBOB. Deletions from BOB take about 40' . less time than do deletions from PST. 3.7 Conclusion Table 3.7 gives the worstcase memory required by each of the data structures. The data of this table are for IPv4. When comparing these memory requirement data, we should keep in mind that BOB, PBOB, and LMPBOB have different ca pabilities. BOB works for highestpriority matching with nonintersecting ranges; PBOB is limited to highestpriority matching with prefixes; and LMPBOB is limited PBOB is limited to highestpriority matching with prefixes; and LMPBOB is limited 2 PST 26 24 022 Palxl Pbl MaeWest Aads Pb2 Palx2 Database Figure 318: Delete time to longestlength matching with prefixes. The PST structure of Lu et al. [33] has the same restrictions as does LMPBOB. Table 34: Node sizes and worstcase memory requirement in bytes for IPv4 router tables. BOB PBOB LMPBOB PST Node Size PTST(28 RST20) >24 24 28 ?" 1 2 08 Memory Required 84n 49.6n 48n 56n 0204 P Maeiest Paix1 Pb1 MaeWest Aads Pb2 Paix2 Database Table 3Figure5 gives the .De time complexity and Table 3time6 gives the .mp totic cache misses for our data structures. In these tables, maxR is the maximum number of ragestlenges or prefixes th prefixes. The PST stination address of Lu et and ma[33 is the maximum number of cachelines needed by any of the array linearlists stored in a PTST node. For LMPBOB, it is assumed that mask operations on Wbit vectorsLMPBOB. take (1) time sizesand that an enworstcase Wbit memory requirement in bytes for IPv4 router tabmisses. Table 35: Time complexity BOB PBOB PBOB LMPBOB PST Node Size PTST(28) RST(20) >24 24 28 Memory Required 84n 49.6naR) O(logn + W) O(logn Table 35 gives the i, iiii ic time complexity and Table 36 gives the .'Jrmp totic cache misses for our data structures. In these tables, maxR is the maximum Inser of ranges or prefixes that match any destinatio(logn address and axL is the Deletemaximum number of cachelines needed by any of the array linearlists stored in a PTST node. For LMPBOB, it is assumed that mask operations on Wbit vectors take 0(1) time and that an entire Wbit vector can be accessed with 0(1) cache misses. Table 35: Time complexity BOB PBOB LMPBOB PST Search 0(lognlogmaxR) 0(logn+maxR) 0(log n + W) 0(log n) Insert 0 (log n) 0(logn + maxR) 0(logn + logW) 0(logn) Delete 0(log n) 0(logn + maxR) 0(logn + logW) 0(logn) Table 36: Cache misses BOB PBOB LMPBOB PST Search O(log log maxR) O(log n + maxL) O(log n) O(log n) Insert O(log n) O(log n + maxL) O(log n) O(log n) Delete O(log n) O(log n + maxL) O(log n) O(log n) Our experiments show that PBOB is to be preferred over PST and LMPBOB for the representation of dynamic longestmatching prefixroutertables. This is some what surprising because PBOB may be used for highestpriority prefixroutertables, not just longestmatching prefixroutertables. A possible reason why PBOB is faster than LMPBOB is that in LMPBOB one has to check O(W) prefix lengths, whereas in PBOB O(maxR) lengths are checked (note that in our test databases, W = 32 and maxR < 6). BOB is slower than and requires more memory than PBOB when tested with longestmatching prefixrouter tables. The same relative performance between BOB and PBOB is expected when filters are prefixes with arbitrary priority. Of the data structures considered in this chapter, BOB, of course, remains the only choice when the filters are ranges that have an associated priority. Although the range allocation rule used by our data structures is similar to that used in an interval tree [40], the unique feature of our structures is the 2n size constraint. The size constraint is essential for O(log n) update. CHAPTER 4 A BTREE DYNAMIC ROUTERTABLE DESIGN In this chapter, we focus on Btree data structures for dynamic NHPRTs and LMPTs. We are interested in the Btree, because by varying the order of the B tree, we can control the height of the tree and hence control the number of cache misses incurred when performing a ruletable operation. Although Suri et al. [20] have proposed a Btree data structure for dynamic prefixtables, their structure has the following shortcomings: 1. A prefix may be stored in O(m) nodes at each level of the order m Btree. This results in excessive cache misses during the insert and delete operations. 2. Some of prefix endpoints are stored twice in the Btree. This is because every endpoint is stored in a leaf node and some of the endpoints are additionally stored in interior nodes. This duplicity in endpoint storage increases memory requirement. Our proposed Btree structure doesn't suffer from these shortcomings. In our struc ture, each prefix is stored in 0(1) nodes at each level, and each prefix endpoint is stored once. Consequently, even though the .,iin!ll ic complexity of performing dynamic prefixtable operations is the same in both structures and the .,i'!,l I ilic memory requirements of both are the same, our structure is faster for the insert and delete operations and also takes less memory. In Section 4.1, we develop our Btree data structure, PIBT (prefix in Btree), for dynamic prefixtables. Our Btree structure for nonintersecting ranges, RIBT (range in Btree), is developed in Section 4.2. Experimental results comparing the performance of our PIBT structure, the multiway range tree (\! RT) structure of Suri Table 41: An example prefix set R (W = 5) Preifx Name Prefix Range Start Range Finish P1 001* 4 7 P2 00* 0 7 P3 1* 16 31 P4 01* 8 15 P5 10111 23 23 P6 0* 0 15 et al. [20], and the best binary tree structure for dynamic prefixtables, PBOB [35], are presented in Section 4.3. 4.1 LongestMatching PrefixTablesLMPT 4.1.1 The Prefix In BTree StructurePIBT A range r = [u, v] is a pair of addresses u and v, u < v. The range r represents the addresses {u,u + 1,...,v}. starter) = u is the start point of the range and finisher) = v is the finish point of the range. The range r matches all addresses d such that u < d < v. Every prefix of a prefix routertable may be represented as a range. For example, when W = 5, the prefix p = 100* matches addresses in the range [16,19]. So, we p = 100* [16,19], start(p) = 16, and finish(p) = 19. The length of p is 3. Figure 41 shows a prefix set and the ranges of the prefixes. The set of start and finish points of a collection P of prefixes is the set of endpoints, E(P), of P. When IP = n, E(P)I < 2n. Although our PIBT structure and the MRT structure of Suri et al. [20] (\I RT) store the endpoints E(P) together with additional information in a Btree1 [41], each structure uses a different variety of Btree. Our PIBT structure uses a Btree in which each key (endpoint) is stored 1 A Btree of order m is an mway search tree. If the Btree is not empty, the root has at least two children and other internal nodes have at least [m/21 children. All external nodes are at the same level. X 7 16 Y / z / w )o 4 8 15 23 31 P1 P3 P2 P4 P6 Figure 41: Btree for the endpoints of the prefixes of Figure 41 15 4 7 ) ( 23 0 4) 7) 8 15) 16 23 31 Figure 42: Alternative Btree for Figure 41 exactly once, while the MRT uses a Btree in which each key is stored once in a leaf node and some of the keys are additionally stored in interior nodes. Figure 41 shows a possible order3 Btree for the endpoints of the prefix set of Figure 41. In this example, each endpoint is stored in exactly one node. This example Btree is a possible Btree for PIBT but not for MRT. Figure 42 shows a possible order 3 Btree in which each endpoint is stored in exactly one leaf node and some endpoints are also stored in interior nodes. This example Btree is a possible Btree for MRT but not for PIBT. With each node x of a PIBT Btree, we associate an interval int(x) of the des tination address space [0, 2" 1]. The interval int(root) associated with the root of the Btree is [0, 2W 1]. Let x be a Btree node that has t keys. The format of this node is: t, child, (key,, child,), (keyt, child) where keyi is the ith key in the node (keyi < key2 < ... < keyt) and child is a pointer to the ith subtree. In case of ambiguity, we use the notation x.keyi and x.childi to refer to the ith key and child, respectively, of node x. Let keyo = start(int(x)) and keyt+l = finish(int(x)). By definition, intjix) = int(childi) = [keyi, keyi+1], 0 < i < t For the example Btree of Figure 41, int(x) = [0, 31], into(x) = int(y) = [0, 7], intl(x) = int(z) = [7,16], int2(x) = int(w) = [16,31], into(y) = [0,0], intl(y) [0, 4], int2(y) [4, 7], and into(z) [7, 8]. Node x of a PIBT has t + 1 Wbit vectors x.inte i,. 0 < i < t and t Wbit vectors .,,.,/ 1 < i < t. The Ith bit of x.int( i', denoted x.intti, i., [1] is 1 iff there is a length 1 prefix whose range includes inti(x) but not int(x). This rule for the interval vectors is called the prefix allocation rule. For our example of Figure 4 1, y.interval2[3] = 1 because prefix P1 has length 3 and range [4,7]; [4,7] includes int2(y) = [4, 7] but not int(y) = [0, 7]. We iv that P1 is stored in y.interval2 and in node y. It is easy to see that a prefix may be stored in up to m 1 intervals of an order m Btree node and in up to 2 nodes at each level of the Btree. The bit , ,,., [1] is 1 iff there is a length 1 prefix that has a start or finish endpoint equal to keyi of x. For our example, prefixes P2 and P6 have 0 as their start endpoint. Since the length of P2 is 2 and that of P6 is 1,, ,.,/l [1] = ,, .,li [2] = 1; all other bits of ,' ./',,.,i are 0. To conserve space, leaf nodes do not have child pointers. Further, to reduce memory accesses, child pointers and interval bitvectors are interleaved so that child and inte i ,., can be accessed with a single cache miss provided cache lines are long enough. In the sequel, we assume that W is sufficiently small so that this is the case. Further, we assume that bitvector operations on Wbit vectors take 0(1) time. This assumption is certainly valid for IPv4 where W = 32 and a Wbit vector may be represented as a 4byte integer. 4.1.2 Finding The Longest MatchingPrefix As in [20], we determine only the length of the longest prefix that matches a given destination address d. From this length and d, the longest matchingprefix, Imp(d), is easily computed. The PIBT search algorithm (Figure 43) employs the following lemma. Lemma 33 Let P be a set of prefixes. If P contains a prefix whose start or finish endpoint equals d, then the longest prefix, Imp(d), that matches d has its start or finish point equal to d. Proof Let p E P be a prefix that matches d and whose start or finish endpoint equals d. Let q E P be a prefix that matches d but whose start and finish endpoints are different from d. It is easy to see that the range of p is properly contained in the range of q. Therefore, p is a longer prefix than q. So, Imp(d) / q. The lemma follows. 0 The PIBT search algorithm first constructs a Wbit vector matchVector. When the router table has no prefix whose start or finish endpoint equals the destination address d, the constructed bit vector satisfies matchVector[l] = 1 iff there is a length 1 prefix that matches d. Otherwise, matchVector[l] = 1 iff there is a length 1 prefix whose start or finish endpoint equals d. The maximum 1 such that matchVector[l] = 1 is the length of Imp(d). Complexity Analysis. Each iteration of the while loop takes O(log2 ) time (we assume throughout this paper that, for sufficiently large m, a Btree node is searched using a binary search) and the number of iterations is O(log, n). The largest I such that matchVector[l1] = 1 may be found in O(log2 W) time by performing O(log2 W) operations on the Wbit vector matchVector. So, the overall complexity is 