UFDC Home  myUFDC Home  Help 



Full Text  
DATA STRUCTURES FOR STATIC AND DYNAMIC ROUTER TABLES By KUN SUK KIM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2003 Copyright 2003 by Kun Suk Kim To my mother, Kyung Hwan Lee my wife, Hye Ryung and my sons, Jin Sung and Daniel ACKNOWLEDGMENTS I appreciate God for my exciting and wonderful time at the University of Florida. Interactions with the many multicultural students and great professors have enriched me in both academic and social aspects. I would like to deeply thank Distinguished Professor Sartaj Sahni, my advisor for guiding me through the doctoral study. I thank God for having given me such an ideal mentor. I have learned from him how an advisor should treat his students. His enthusiastic and devoted attitude toward teaching and research has strongly affected me. He has inspired me to solve difficult problems and given me invaluable advices. I consider him as a role model for my life and hope to lift my abilities up to his high standards in the future. I also thank Drs. Randy Chow, Richard Newman, Shigang Chen, and Janise fM. NILr for serving on my supervisory committee. I am thankful to Venkatachary Srinivasan and Jon Sharp for giving their pro gramming codes and comments for the multibit trie and BSL structures, respectively. With their help I could start my research smoothly. Special thanks go to my former advisor, Dr. YannHang Lee for supporting me for the first two semesters at University of Florida and encouraging me even after his moving to Arizona State University. I would like to thank former and current members of the Korean Student Association of the CISE department at the University of Florida for their assistance and friendship. Thanks go to Daeyoung Kim, Yoonmee Doh, Young Joon Byun, Myung Chul Song, and many others. I am eternally grateful to Pastor Hee Young Sohn, who is my spiritual mentor; and to the ministers at Korean Baptist Church of Gainesville (KBCG), for continu ously caring about me in Jesus love. Thanks go to Jin Kun Song, Dong Yul Sung, and other former and current members of my cell church of KBCG for sharing their lives with me and praying for me. I cannot adequately express my gratitude to my mother, Kyung Hwan Lee. She has worked hard to provide me with better educational opportunities that made it possible for me to get this degree. I would like to thank Moon Suk, my brother; and Mi Ok, my sister for their constant support and encouragement. I am also grateful to my parentsinlaw, Dae Hun Song and Ok Jo Choi; my five elder sistersinlaw (and their husbands), Mi Hye, Mi Ah, Hye Young, and Hye Youn; and my younger sisterinlaw, Hyo Jae for supporting me both materially and morally. At the end I have to mention my nuclear family that has put up with my absence for many evenings while I finished this work. Thanks go to my wife, Hye Ryung; and my sons, Jin Sung and Daniel for their love and tolerance. I am grateful to all for their help and guidance and hope to remember their love forever. TABLE OF CONTENTS page A C K N O W L E D G M E N T S ................................................................................................. iv L IST O F TA B LE S ........... .... ........ ... .... ...................... .... .... .............. viii LIST OF FIGURES ................................. .. ... ... ................. .x A B S T R A C T .............................................. ..........................................x iii CHAPTER 1 IN TR OD U CTION ............................................... .. ... .... .............. .. 1.1 Internet R outer .................. ........................... ............ .. ........ .. 1.1.1 Internet Protocols .................................................. ........ .. .. ....... .. 4 1.1.2 Classless InterDomain Routing (CIDR) ...............................................6 1.1.3 Packet Forw arding ...........................................................................8 1.2 Packet Classification ............................................ ........ ............... 10 1.3 Prior W ork .......................... .. ............... .......................... 15 1.3.1 Linear List ...................... .... ......... .. ..............................15 1.3.2 EndPoint Array ........................... .. .. ...... .........................16 1.3.3 Sets of EqualLength Prefixes..............................................16 1 .3 .4 T rie s ................................................................................................. 1 9 1.3.5 B inary Search Trees ........................................ .......................... 20 1.5 .6 O th ers ......... ........ .............................................................. .......... 2 1 1.4 D issertation O u tlin e ............................................................... .....................2 2 2 M U LTIB IT TRIES ........................................................................ ............... 23 2.1 1Bit Tries ................. ....... .....................23 2.2 FixedStride Tries...................... ........................ ......26 2 .2 .1 D efin ition ..................... ... ......... ..................................2 6 2.2.2 Construction of Optimal FixedStride Tries .......................................27 2.3 V ariableStride Tries .......... ............................................ ............... 38 2.3.1 D definition and Construction .......... .. .............. ................................. 38 2 .3 .2 A n E x am ple........... ...... .............................. ................ .. .... ..... .. 44 2.3.3 Faster k = 2 A lgorithm ........................................ ........ ............... 46 2.3.4 Faster k = 3 A lgorithm ........................................ ........ ............... 48 2 .4 E xperim mental R esults............................................................. .....................5 1 2.4.1 Performance of FixedStride Algorithm .............................................51 2.4.2 Performance of FixedStride Algorithm ............... ....... ...........52 2 .5 S u m m ary ..................................................... ................ 6 7 3 BINARY SEARCH ON PREFIX LENGTH ................................ .....................68 3.1 H euristic of Srinivasan ......... ............... .................................. ............... 71 3.2 O ptim alStorage A lgorithm ......................................... ......................... 74 3.2.1 Expansion C ost .............................................. .......................... 75 3.2.2 Number of Markers ...................... ............. ................... 76 3.2.3 Algorithm for ECHT .......................... ............. .. ............... 78 3.3 A alternative Form ulation ......... ........................................ .... .. ................. ... 80 3.4 R educedR ange H euristic........................................... .......................... 81 3.5 More Accurate Cost Estimator.................... .............. .. ...............95 3.6 E xperim ental R esults............................................... ............................. 95 3 .7 Sum m ary ........................................................................99 4 O(log n) DYNAMIC ROUTERTABLE ........................................................101 4.1 Prefixes and R anges ......................................................... .............. 101 4.2 Properties of Prefix Ranges................................................................. 103 4.3 Representation Using Binary Search Trees...................................................105 4.3.1 Representation........ ...... ....................... ......... .. .. ................. 105 4.3.2 Longest Prefix M atching................................ ........................ 108 4.3.3 Inserting a Prefix ............................ ..... ...................110 4.3.4 D eleting a Prefix ............................................................................ 120 4.3.5 C om plexity ....... ............................ .. .......... .... .. ............ 123 4.3.6 C om m ents .............................................. ............ .. .............. 124 4.4 Experim ental R esults.............................................. ............................ 125 4 .5 S u m m a ry .................................................................................................. 12 9 5 DYNAMIC LOOKUP FOR BURSTY ACCESS PATTERNS..........................132 5.1 Biased Skip Lists w ith Prefix Trees ......... ...... ............... ..... .......... .. 133 5.2 C collection of Splay Trees .................................................... ........ .......... 139 5.3 Comparison of BITs and ABITs ........................................................... 140 5.4 Experim ental R esults.............................................. ............................ 142 5 .5 S u m m a ry .................................................................................................. 1 5 4 6 CONCLUSIONS AND FUTURE WORK .................................... ...............159 6 .1 C o n c lu sio n s .............................................................................................. 1 5 9 6.2 Future W ork ......................................... ................... .... ........ 161 REFERENCES .................................................... .............164 BIOGRAPHICAL SKETCH ............................................................. ............... 170 LIST OF TABLES Table pge 21 Prefix databases obtained from IPMA project on Sep 13, 2000............................25 22 Distributions of the prefixes and nodes in the 1bit trie for Paix...........................25 23 Memory required (in Kbytes) by best klevel FST..............................................51 24 Execution time (in /sec) for FST algorithms, Pentium 4 PC .............................53 25 Execution time (in /sec) for FST algorithms, SUN Ultra Enterprise 4 0 0 0 /5 0 0 0 ....................................................... ................ 5 3 26 Memory required (in Kbytes) by best klevel VST ............................................54 27 Execution times (in msec) for first two implementations of our VST algorithm Pentium 4 PC ......................................................... ............... 56 28 Execution times (in msec) for first two implementations of our VST algorithm, SUN Ultra Enterprise 4000/5000 ............................... ............... .56 29 Execution times (in msec) for third implementation of our VST algorithm, P en tiu m 4 P C ..................................................... ................ 5 7 210 Execution times (in msec) for third implementation of our VST algorithm, SU N U ltra Enterprise 4000/5000.................................. ...................................... 57 211 Execution times (in msec) for our best VST implementation and the VST algorithm of Srinivasan and Varghese, Pentium 4 PC...........................................59 212 Execution times (in msec) for our best VST implementation and the VST algorithm of Srinivasan and Varghese, SUN Ultra Enterprise 4000/5000 ............59 213 Time (in msec) to construct optimal VST from optimal stride data, Pentium 4 P C ........................................................................... 6 1 214 Search time (in /sec) in optimal VST, Pentium 4 PC .......................................61 215 Insertion time (in /sec) for OptVST, Pentium 4 PC...........................................64 216 Deletion time (in /sec) for OptVST, Pentium 4 PC ...........................................64 217 Insertion time (in /sec) for Batch1, Pentium 4 PC..............................................64 218 Deletion time (in /sec) for Batchl, Pentium 4 PC .............................................65 219 Insertion time (in /sec) for Batch2, Pentium 4 PC..............................................65 220 Deletion time (in /sec) for Batch2, Pentium 4 PC .............................................66 31 Number of prefixes and markers in solution to ECHT(P, k) ..............................97 32 Number of prefixes and markers in solution to ACHT(P, k) ..............................98 33 Preprocessing time in milliseconds...................... ... .......................... 98 34 Execution time, in /sec, for ECHT(P, k) .................................... ............... 99 35 Execution time, in /sec, for ACHT(P, k) ..........................................100 41 Statistics of prefix databases obtained from IPMA project on Sep 13, 2000 ......125 42 M em ory for data structure (in Kbytes) ..................................... .................126 43 Execution time (in /sec) for randomized databases ................. ............ .......128 44 Execution time (in /sec) for original databases.......................................129 51 M em ory require ent (in KB) ..................................................... ... .......... 144 5 2 T race sequ en ces ........................................... ............................ .................... 14 5 53 Search time (in /sec) for CRBT, ACRBT, and SACRBT structures on NODUP, DUP, and RAN data sets................ ... .... ................. 147 54 Search time (in /sec) for CST and BSLPT structures on NODUP, DUP, and R A N data sets............. ...................................... ................ .. .... ...... 148 55 Search time (in /sec) for CRBT, ACRBT, and SACRBT structures on trace sequ en ces............................................... ................ 14 9 56 Search time (in /sec) for CST and BSLPT structures on trace sequences..........150 57 Average tim e to insert a prefix (in /sec) .................................. ............... 153 58 Average time to delete a prefix (in /sec) ......................................................153 61 Performance of data structures for longest matchingprefix...............................161 LIST OF FIGURES Figure page 11 Internet structure .................. ...................................... .. ........ .. .. 12 Generic router architecture .................................. ...................................... 5 13 Form ats for IP packet header............................................................................. 14 Transport protocol header form ats............................................... ........ ....... 7 15 R outer table exam ple ........................................... ....................................... 10 21 Prefixes and corresponding 1bit trie ............................................ ...............24 22 Prefix expansion and fixedstride trie ...................................... ...............27 23 A lgorithm for fixedstride tries.................................... ........................... ......... 38 24 Twolevel VST for prefixes of Figure 21(a) ......................................................39 25 A prefix set and its expansion to four lengths............................................ 44 26 1bit trie for prefixes of Figure 25(a).............. ............................. ...............44 27 Opt values in the computation of Opt(NO, 0, 4) ............................................. 45 28 Optimal 4VST for prefixes of Figure 25(a) ............. ..................................46 29 Algorithm to compute C using Equation 2.20 ................................................. 47 210 Algorithm to compute Tusing Equation 2.22...................................................49 211 Memory required (in Kbytes) by best klevel FST ...............................................52 212 Execution time (in /sec) for FST algorithms, Pentium 4 PC .............................53 213 Execution time (in /sec) for FST algorithms, SUN Ultra Enterprise 4000/5000 ........................................... ........................... 54 214 Memory required (in Kbytes) for Paix by best kVST and best FST ....................55 215 Execution times (in msec) for Paix for our three VST implementations, P entium 4 P C ........................................58 x 216 Execution times (in msec) for Paix for our three VST implementations, SU N U ltra Enterprise 4000/5000.................................. ............................ .......... 58 217 Execution times (in msec) for Paix for our best VST implementation and the VST algorithm of Srinivasan and Varghese, Pentium 4 PC ..........................60 218 Execution times (in msec) for Paix for our best VST implementation and the VST algorithm of Srinivasan and Varghese, SUN Ultra Enterprise 4000/5000 ........................................... ........................... 60 219 Search time (in nsec) in optimal VST for Paix, Pentium 4 PC.............................62 220 Insertion time (in /sec) for Paix, Pentium 4 PC ...................................................65 221 Deletion time (in /sec) for Paix, Pentium 4 PC ......................................... 66 31 C controlled prefix expansion........................................................ ............... 69 32 Prefixes and corresponding 1bit trie............................................ .................. 73 33 Alternative binary tree for binary search .................................... ............... 75 34 LEC and EC values for Figure 32..................................... ........................ 76 35 LMC and MC values for Figure 32............................................77 36 Optimalstorage CHTs for Figure 32......................................... ............... 80 37 Algorithm for binarysearch hash tables..................................... ............... 94 41 Prefixes and their ranges ......... ....... .. ......... ........ .............. ............... 102 42 Pictorial and tabular representation of prefixes and ranges.............................102 43 Types of prefix ranges .................................. ....................................................... 104 44 C B ST for F igure 42(a).......................................................................... ..... 106 45 Values of next are shown as left arrows .............................. .................107 46 Algorithm to find LM P(d) ................................................... ...... ............... 111 47 Pictorial representation of prefixes and ranges after inserting a prefix .............12 48 Basic interval tree and prefix trees after inserting P6 = 01 into Figure 44.......113 49 A lgorithm to insert an end point................................. ....................... .. .......... 114 410 Splitting a basic interval when lsb(u) = 1 ..... .......... ..... ........................ 115 411 Prefix trees after inserting P7 = 10* into P1P5.................... ..............117 412 Algorithm to update prefix trees ............................ ...............119 413 P = S; P S and S starts at s; and P S and S finishes atf..............................122 414 Memory required (in Kbytes) by best kVST and CRBT for Paix ................126 415 Search time (in /sec) comparison for Paix.......... ................... .................. 129 416 Insert time (in /sec) comparison for Paix.................... ........ .................. 130 417 Delete tim e (in /sec) comparison for Paix ................................... .................. ... 130 51 Skip list representation for basic intervals of Figure 42(a) ..............................134 52 Start point s of P splits the basic interval [a, b] .............. .................................. 137 53 BSLPT insert algorithm ........... .. ........ ........................ 138 54 Alternative Base Interval Tree corresponding to Figure 42(a)...........................140 55 Total memory requirement (in M B) ....................................... ............... 145 56 Average search time for NODUP, DUP, and RAN data sets ...........................156 57 Average search time for trace sequences.................. ............ ............... 157 58 A average tim e to insert a prefix......................................................... .... .......... 158 59 Average time to delete a prefix.......................................... ................... 158 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DATA STRUCTURES FOR STATIC AND DYNAMIC ROUTER TABLES By Kun Suk Kim August 2003 Chair: Sartaj K. Sahni Major Department: Computer and Information Science and Engineering The Internet has been growing exponentially since many users make use of new applications and want to connect their hosts to Internet. Because of increased pro cessing power and highspeed communication links, packet header processing has become a major bottleneck in Internet routing. To improve packet forwarding, our study developed fast and efficient algorithms. We improved on dynamic programming algorithms to determine the strides of optimal multibit tries by providing alternative dynamic programming formulations for both fixed and variablestride tries. While the 'iulpt.iti., complexities of our algorithms are the same as those for the corresponding algorithms of [82], experiments using real IPv4 routing table data indicate that our algorithms run considerably faster. An added feature of our variablestride trie algorithm is the ability to insert and delete prefixes taking a fraction of the time needed to construct an optimal variablestride trie from scratch. IP lookup in a collection of hash tables (CHT) organization can be done with O(logldist) hashtable searches, where Idit is the number of distinct prefixlengths (also equal to the number of hash tables in the CHT). We developed an algorithm that minimizes the storage required by the prefixes and markers for the resulting set of prefixes to reduce the value of Idist by using the controlled prefixexpansion technique. Also, we proposed improvements to the heuristic of [80]. We developed a data structure called collection of redblack trees (CRBT) in which prefix matching, insertion, and deletion can each be done in O(logn) time, where n is the number of prefixes in the router table. Experiments using real IPv4 routing databases indicate that although the proposed data structure is slower than optimized variablestride tries for longest prefix matching, the proposed data struc ture is considerably faster for the insert and delete operations. We formulate a variant, alternative collection of redblack trees (ACRBT) from the CRBT data structure to develop data structures for bursty access patterns. By replacing the redblack trees used in the ACRBT with splay trees (or biased skip lists), we obtained the collection of splay trees (CST) structure (or the biased skip lists with prefix trees (BSLPT) structure) in which search, insert, and delete take O(logn) amortized time (or O(logn) expected time) per operation, where n is the number of prefixes in the router table. Experimental results using real IPv4 routing databases and synthetically generated search sequences as well as trace sequences indicate that the CST structure is best for extremely bursty access patterns. Other wise, the ACRBT is recommended. Our experiments also indicate that a supernode implementation of the ACRBT usually has better search performance than does the traditional oneelementpernode implementation. CHAPTER 1 INTRODUCTION The influence of Internet evolution reaches not only to the technical fields of computer and communications but throughout society as we move toward increasing use of online services (e.g., electronic commerce and information acquisition). The Internet is a worldwide communication infrastructure in which individuals and their computers interact and collaborate without regard for geographic location. Beginning with the early research on packet switching1 and the ARPANET,2 government, industry, and academia have been cooperating to evolve and deploy this exciting new technology [12, 45]. The Internet is an internetwork that ties many groups of networks with a common Internet Protocol (IP) [6, 19]. Figure 11 shows routers as the switch points of an internetwork. A packet may pass through many different deployment classes of routers from source to destination. The enterprise router, located at the lowest level in the router hierarchy, must support tens to thousands of network routes and medium bandwidth interfaces of 100 Mbps to 1 Gbps. Access routers are aggregation points for corporations and residential services. Access routers must support thousands or tens of thousands of routes. Residential terminals are connected to modem pools of the telephone central office through plain old telephone service (POTS), cable service, 1 This technology is fundamentally different from the circuit switching that was used by the telephone system. In a packetswitching system, data to be delivered is broken into small chunks called packet that are labeled to show where they come from and where they are to go. 2 This project was sponsored by U.S. Department of Defense to develop a whole new scheme for postnuclear communication. A owO 'Core router mmmo ISP access router Enterprise router Figure 11: Internet structure or one of the digital subscriber lines (DSLs). Backbone routers require multiples of high bandwidth ports such as OC48 at 2.5 Gbps and OC192 at 9.6 Gbps. Backbone routers cover national and international areas. With the doubling of Internet traffic every 3 months [85] and the tripling of Internet hosts every 2 years [31], the importance of high speed scalable network routers cannot be overemphasized. Fast networking "will play a key role in enabling future pi'["'  [54]. Fast networking requires fast routers; and fast routers require fast router table lookup. The rest of this chapter is structured as follows. Section 1.1 introduces the basics of Internet routers. We describe the IP lookup and packet classification problems in Section 1.2. In Section 1.3 discusses related work in these fields. Finally, Section 1.4 presents an outline of the dissertation. 1.1 Internet Router Figure 12 shows the generic architecture of an IP router. Generally, a router consists of the following basic components: the controller card, the router backplane, and line cards [3, 51]. The CPU in the controller card performs path computations and router table maintenance. The line cards perform inbound and outbound packet forwarding. The router backplane transfers packets between the cards. The basic functions in a router can be classified as routing, packet forwarding, switching, and queueing [3, 6, 51, 58]. We discuss the each function in more detail below. Routing: Routing is the process of communicating to other routers and ex changing route information to construct and maintain the router tables that are used by the packetforwarding function. Routing protocols that are used to learn about and create a view of the network's topology include the routing information protocol (RIP) [37], open shortest path first (OSPF) [55], border gateway protocol (BGP) [47, 65], distance vector multicast routing protocol (DVMRP) [86], and protocol independent multicast (PIM) [27]. Packet forwarding: The router looks at each incoming packet and performs a table lookup to decide which output port to use. This is based on the des tination IP address in the incoming packet. The result of this lookup may imply a local, unicast, or multicast delivery. A local delivery is the case that the destination address is one of the router's local addresses and the packet is locally delivered. A unicast delivery sends the packet to a single output port. A multicast delivery is done through a set of output ports, depending on the multicast group membership of the router. In addition to table lookup, routers must perform other functions. Packet validation: This function checks to see that the recieved IPv4 packet is properly formed for the protocol before it proceeds with protocol processing. However, because the checksum calculation is considered too expensive, current routers hardly verify the checksum, instead assuming that packets are transmitted through reliable media like fiber optics and assuming that end hosts will recognize any possible corruption. Packet lifetime control: The router adjusts the timetolive (TTL) field in the packet to prevent packets looping endlessly in the network. A host sending a packet initializes the TTL with 64 (recommended by [67]) or 255 (the maximum). A packet being routed to output ports has its TTL value decremented by 1. A packet whose TTL is zero before reaching the destination is discarded by the router. Checksum update: The IP header checksum must be recalculated since the TTL field was changed. RFC 1071 [8] contains implementation tech niques for computing the IP checksum. If only the TTL was decremented by 1, a router can efficiently update the checksum incrementally instead of calculating the checksum over the entire IP header [48]. Packet switching: Packet switching is the process of moving packets from one interface to other port interface based on the forwarding decision. Packet switching can be done at very high speed [14, 52]. Queueing: Queueing is the action of buffering each packet in a small memory for a short time (on the order of a few microseconds) during processing of the packet. Queueing can be done at the input, in the switch fabric, and/or at the output. 1.1.1 Internet Protocols The headers of the IPv4 and IPv6 protocols [21, 59] are shown in Figure 13. Unicast packets are forwarded based on the destination address field. Each router be tween source and destination must look at this field. Multicast packets are forwarded based on the source network and destination group address. The protocol field defines the transport protocol (e.g., TCP [60] and UDP [61]) that is encapsulated within this IP packet. The typeofservice (ToS) field notices a packet's priority, its queueing, and its dropping behavior (to the routers). Some applications such as telnet and FTP set these flags. Figure 12: Generic router architecture The most notable change from the IPv4 to the IPv6 header is the address length of 128 bits. Payload length is the length of the IPv6 payload in octets. Next header uses the same value as the IPv4 protocol field. Hop limit is decremented by 1 by each node that forwards the packet. The packet is discarded if the hop limit is decremented to zero. The flow ID field is added to simplify packet classification. The tuple (source address, flow ID) uniquely identifies a flow for any nonzero flow ID. The headers of two transport protocols, as shown in Figure 14, provide more information (such as source and destination port numbers and flags that are used to further classify packets). In TCP and UDP networks, a port is an endpoint to a logical connection and the way a client program specifies a specific server program on a computer in a network. These two port numbers are used to distribute packets to the application and represent the finegrained variety of flows. They can be used to identify a flow within the network. Thus, applications can reserve the resources to 0 Vers HLen 1 ToS 2 3 (octet) Packet Length Identification Flagment Info/Offset TTL Protocol Header Checksum Source Address Destination Address IP Options (optional, variable length) (a) Format for an IPv4 header 0 1 2 3 (octet) Vers Traffic Class Flow ID Payload Length Next Header Hop Limit Source Address (128 bits) Destination Address (128 bits) (b) Format for an IPv6 header Figure 13: Formats for IP packet header guarantee their service requirement using appropriate signalling protocol like RSVP [9]. Some ports have numbers that are preassigned to them, and these are known as wellknown ports. Port numbers range from 0 to 65536, but only ports numbers 0 to 1024 are reserved for privileged services and designated as wellknown ports [67]. Each of these wellknown port numbers specifies the port used by the server process as its contact port. For example, port numbers 20, 25, and 80 are assigned for FTP, simple mail transfer protocol (SMTP) [42], and hypertext transfer protocol (HTTP) [29] servers, respectively. 1.1.2 Classless InterDomain Routing (CIDR) As the Internet has evolved and grown, it faced two serious scaling problems [38]. 0 1 2 3 (octet) Source Port Destination Port Sequence Number Acknowlegement Number Offset Reserved Flags Window Size Checksum Urgent Pointer TCP Options (optional, variable length) (a) TCP header 0 1 2 3 (octet) Source Port Destination Port UDP Data Length Checksum (b) UDP header Figure 14: Transport protocol header formats Exhaustion of IP address space: In the old Class A, B, and C address scheme, a fundamental cause of this problem was the lack of a network class of a size that is appropriate for a midsized organization. Class C with a maximum of 254 host addresses is too small; while class B, which allows up to 65534 addresses, is too large to be densely populated. The result is inefficient use of class B network numbers. For example, if you needed 500 addresses to configure a network, you would be assigned Class B. However, that means 65034 unused addresses. Routing information overload: As the number of networks on the Internet increased, so did the number of routes. The size and rate of growth of the router tables in Internet routers is beyond the ability to effectively manage it. CIDR is a mechanism to slow the growth of router tables and allow for more effi cient allocation of IP addresses than the old Class A, B, and C address scheme. Two solutions to these problems were developed and adopted by the Internet community [66, 30]. Restructuring IP address assignments: Instead of being limited to a net work identifier (or pijfi.r) of 8, 16, or 24 bits, CIDR uses generalized prefixes anywhere from 13 to 27 bits. Thus, blocks of addresses can be assigned to net works with 32 hosts or to those with over 500,000 hosts. This allows for address assignments that much more closely fit an organization's specific needs. Hierarchical routing aggregation: The CIDR addressing scheme also en ables route aggregation in which a single highlevel route entry can represent many lowerlevel routes in the global router tables. Big blocks of addresses are assigned to the large Internet Service Providers (ISPs) who then reallocate portions of their address blocks to their customers. For example, a tier1 ISP (e.g., Sprint and Pacific Bell) was assigned a CIDR address block with a prefix of 14 bits and typically assigns its customers, who may be smaller tier2 ISPs, CIDR addresses with prefixes ranging from 27 bits to 18 bits. These customers in turn reallocate portions of their address block to their users and/or cus tomers (tier3 or local ISPs). In the backbone router tables all these different networks and hosts can be represented by the single tier1 ISP route entry. In this way, the growth in the number of router table entries at each level in the network hierarchy has been significantly reduced. 1.1.3 Packet Forwarding Consider a part of the Internet in Figure 15(a) to get an intuitive idea of packet delivery. If a user in Chicago wishes to send a packet to Orlando, the packet is sent to a router R4. The router R4 may send this packet on the communication link L3 to a router R1. The router R1 may then send the packet on link L4 to a router R5 in Orlando. R4 then sends the packet to the final destination. An Internet router table is a set of tuples of the form (p, a), where p is a binary string whose length is at most W (W = 32 for IPv4 destination addresses and W = 128 for IPv6), and a is an output link (or next hop). When a packet with destination address A arrives at a router, we are to find the pair (p, a) in the router table for which p is a longest matching prefix of A (i.e., p is a prefix of A and there is no longer prefix q of A such that (q, b) is in the table). Once this pair is determined, the packet is sent to output link a. The speed at which the router can route packets is limited by the time it takes to perform this table lookup for each packet. For example, consider a router table at the router R1 in Figure 15(a), shown in Figure 15(b). Assume that when a packet arrives on router R1, the packet carries the destination address 101110 in its header. In this example we assume that the longest prefix length is 6. To forward the packet to its final destination, router R1 consults a router table, which lists each possible prefix and the corresponding output link. The address 101110 matches both 1* and 101* in the router table, but 101* is the longest matching prefix. Since the table indicates output link L2, the router then switches the packet to L2. Longest prefix routing is used because this results in smaller and more manage able router tables. It is impractical for a router table to contain an entry for each of the possible destination addresses. Two of the reasons this is so are The number of such entries would be almost one hundred million and would triple every 3 years. Every time a new host comes online, all router tables must incorporate the new host's address. By using longest prefix routing, the size of router tables is contained to a reasonable quantity; and information about host/router changes made in one part of the Internet need not be propagated throughout the Internet. Chicago (11010*) R4 (a) Backbone routers (b) Router table for router RI Figure 15: Router table example 1.2 Packet Classification An Internet router classifies incoming packets into flows,3 using information contained in packet headers and a table of (classification) rules. This table is called the rule table (equivalently, router table). The packetheader information that is used to perform the classification is some subset of the source and destination addresses, the source and destination ports, the protocol, protocol flags, type of service, and so 3 A flow is a set of packets that are to be treated similarly for routing purposes. Prefix Output Link 1* L1 101* L2 11010* L3 11100* L4 on. The specific header information used for packet classification is governed by the rules in the rule table. Each ruletable rule is a pair of the form (F, A), where F is a filter and A is an action. The action component of a rule specifies what is to be done when a packet that satisfies the rule filter is received. Sample actions are drop the packet, forward the packet along a certain output link, and reserve a specified amount of bandwidth. A rule filter F is a tuple that is comprised of one or more fields. In the simplest case of destinationbased packet forwarding, F has a single field, which is a destination (address) prefix; and A is the next hop for packets whose destination address has the specified prefix. For example, the rule (01*, a) states that the next hop for packets whose destination address (in binary) begins with 01 is a. IP multicasting uses rules in which F comprises the two fields source prefix and destination prefix; QoS routers may use fivefield rule filters (sourceaddress prefix, destinationaddress prefix, sourceport range, destinationport range, and protocol); and firewall filters may have one or more fields. In the ddimensional packet classification problem, each rule has a dfield filter. Our study was concerned solely with 1dimensional packet classification. It should be noted, that data structures for multidimensional packet classification are usually built on top of data structures for 1dimensional packet classification. Therefore, the study of data structures for 1dimensional packet classification is fundamental to the design and development of data structures for ddimensional, d > 1, packet classification. For the 1dimensional packet classification problem, we assume that the single field in the filter is the destination field; and that the action is the next hop for the packet. With these assumptions, 1dimensional packet classification is equivalent to the destinationbased packet forwarding problem. Henceforth, we use the terms rule table and router table to mean tables in which the filters have a single field, which is the destination address. This single field of a filter may be specified in one of two ways: As a range: For example, the range [35, 2096] matches all destination addresses d such that 35 < d < 2096. As an address/mask pair: Let xi denote the ith bit of x. The address/mask pair a/m matches all destination addresses d for which d, = a, for all i for which mi = 1. That is, a 1 in the mask specifies a bit position in which d and a must agree; while a 0 in the mask specifies a don'tcare bit position. For example, the address/mask pair 101100/011101 matches the destination addresses 101100, 101110, 001100, and 001110. When all the 1bits of a mask are to the left of all 0bits, the address/mask pair specifies an address prefix. For example, 101100/110000 matches all destination addresses that have the prefix 10 (i.e., all destination addresses that begin with 10). In this case, the address/mask pair is simply represented as the prefix 10*, where the denotes a sequence of don'tcare bits. If W is the length, in bits, of a destination address, then the in 10* represents all sequences of (W 2) bits. In IPv4 the address and mask are both 32 bits; while in IPv6 both of these are 128 bits. Notice that every prefix may be represented as a range. For example, when W = 6, the prefix 10* is equivalent to the range [32, 47]. A range that may be specified as a prefix for some W is called a pi fi.r wiai. The specification 101100/011101 may be abbreviated to ?011?0, where ? denotes a don'tcare bit. This specification is not equivalent to any single range. Also, the range specification [3,6] is not equivalent to any single address/mask specification. When more than one rule matches an incoming packet, a tie occurs. To select one of the many rules that may match an incoming packet, we use a tie breaker. Let RS be the set of rules in a rule table and let FS be the set of filters associ ated with these rules. rules(d, RS) (or simply rules(d) when RS is implicit) is the subset of rules of RS that match/cover the destination address d. filters(d, FS) and filters(d) are defined similarly. A tie occurs whenever Irules(d)l > 1 (equivalently, Ifilters(d)l > 1). Three popular tie breakers are First matching rule in table: The rule table is assumed to be a linear list ([39]) of rules with the rules indexed 1 through n for an nrule table. The action corresponding to the first rule in the table that matches the incoming packet is used. In other words, for packets with destination address d, the rule of rules(d) that has least index is selected. For our example router table corresponding to the five prefixes of Figure 4 1, rule RI is selected for every incoming packet, because PI matches every destination address. When using the firstmatchingrule criteria, we must index the rules carefully. In our example, PI should correspond to the last rule so that every other rule has a chance to be selected for at least one destination address. Highestpriority rule: Each rule in the rule table is assigned a priority. From among the rules that match an incoming packet, the rule that has highest priority wins is selected. To avoid the possibility of a further tie, rules are assigned different priorities (it is actually sufficient to ensure that for every destination address d, rules(d) does not have two or more highestpriority rules). Notice that the firstmatchingrule criteria is a special case of the highest priority criteria (simply assign each rule a priortiy equal to the negative of its index in the linear list). Mostspecificrule matching: The filter F1 is more specific than the filter F2 iff F2 matches all packets matched by F1 plus at least one additional packet. So, for example, the range [2,4] is more specific than [1, 6], and [5, 9] is more specific than [5, 12]. Since [2, 4] and [8, 14] are disjoint (i.e., they have no address in common), neither is more specific than the other. Also, since [4,14] and [6, 20] intersect4 neither is more specific than the other. The prefix 110* is more specific than the prefix 11*. In mostspecificrule matching, ties are broken by selecting the matching rule that has the most specific filter. When the filters are destination prefixes, the mostspecificrule that matches a given destination d is the longest5 prefix in filters(d). Hence, for prefix filters, the mostspecificrule tie breaker is equiv alent to the longestmatchingprefix criteria used in router tables. For our ex ample rule set, when the destination address is 18, the longest matchingprefix is P4. When the filters are ranges, the mostspecificrule tie breaker requires us to select the most specific range in filters(d). Notice also that mostspecific range matching is a special case of the the highestpriority rule. For example, when the filters are prefixes, set the prefix priority equal to the prefix length. For the case of ranges, the range priority equals the negative of the range size. In a static rule table, the rule set does not vary in time. For these tables, we are concerned primarily with the following metrics: Time required to process an incoming packet: This is the time required to search the rule table for the rule to use. Preprocessing time: This is the time to create the ruletable data structure. Storage requirement: That is, how much memory is required by the rule table data structure? 4 Two ranges [u, v] and [x, y] intersect iff u < x < v < y V x < u < y < v. 5 The length of a prefix is the number of bits in that prefix (note that the is not used in determining prefix length). The length of PI is 0 and that of P2 is 4. In practice, rule tables are seldom truly static. At best, rules may be added to or deleted from the rule table infrequently. Typically, in a "static" rule table, inserts/deletes are batched and the ruletable data structure reconstructed as needed. In a *li,,,iii: rule table, rules are added/deleted with some frequency. For such tables, inserts/deletes are not batched. Rather, they are performed in real time. For such tables, we are concerned additionally with the time required to insert/delete a rule. For a dynamic rule table, the initial ruletable data structure is constructed by starting with an empty data structures and then inserting the initial set of rules into the data structure one by one. So, typically, in the case of dynamic tables, the preprocessing metric, mentioned above, is very closely related to the insert time. In this dissertation, we focus on data structures for static and dynamic router tables (1dimensional packet classification) in which the filters are either prefixes or ranges. Although some of our data structures apply equally well to all three of the commonly used tie breakers, our focus, in this dissertation, is on longestprefix matching. 1.3 Prior Work Several solutions for the IP lookup problem (i.e., finding the longest matching prefix) have been proposed. Let LMP(d) be the longest matchingprefix for address d. 1.3.1 Linear List In this data structure, the rules of the rule table are stored as a linear list ([39]) L. The LMP(d) is determined by examining the prefixes in L from left to right; for each prefix, we determine whether or not that prefix matches d; and from the set of matching prefixes, the one with longest length is selected. To insert a rule q, we first search the list L from left to right to ensure that L doesn't already have a rule with the same filter as does q. Having verified this, the new rule q is added to the end of the list. Deletion is similar. The time for each of the operations to determine LMP(d), insert a rule, and delete a rule is O(n), where n is the number of rules in L. The memory required is also O(n). Note that this data structure may be used regardless of the form of the filter (i.e., ranges, Boolean expressions, etc.) and regardless of the tie breaker in use. The time and memory complexities are unchanged. 1.3.2 EndPoint Array Lampson, Srinivasan, and Varghese [44] proposed a data structure in which the end points of the ranges defined by the prefixes are stored in ascending order in an array. The LMP(d) is found by performing a binary search on this ordered array of end points. Although Lampson et al. [44] provide ways to reduce the complexity of the search for the LMP by a constant factor, these methods do not result in schemes that permit prefix insertion and deletion in O(logn) time. It should be noted that the endpoint array may be used even when ties are broken by selecting the first matching rule or the highestpriority matching rule. Further, the method applies to the case when the filters are arbitrary ranges rather than simply prefixes. The complexity of the preprocessing step (i.e., creation of the array of ordered endpoints) and the search for the rule to use is unchanged. Further, the memory requirements are the same, O(n) for an nrule table, regardless of the tie breaker and whether the filters are prefixes or general ranges. 1.3.3 Sets of EqualLength Prefixes Waldvogel et al. [87] proposed a data structure to determine LMP(d) by per forming a binary search on prefix length. In this data structure, the prefixes in the router table T are partitioned into the sets So, Si, ... such that S, contains all prefixes of T whose length is i. For simplicity, we assume that T contains the default prefix *. So, So = {*}. Next, each S, is augmented with markers that represent prefixes in Sj such that j > i and i is on the binary search path to Sj. For example, suppose that the length of the longest prefix of T is 32 and that the length of LMP(d) is 22. To find LMP(d) by a binary search on length, we will first search S16 for an entry that matches the first 16 bits of d. This search6 will need to be successful for us to proceed to a larger length. The next search will be in 524. This search will need to fail. Then, we will search S20 followed by S22. So, the path followed by a binary search on length to get to S22 is S16, S24, S20, and S22. For this to be followed, the searches in S16, S20, and S22 must succeed while that in S24 must fail. Since the length of LMP(d) is 22, T has no matching prefix whose length is more than 22. So, the search in S24 is guaranteed to fail. Similarly, the search in S22 is guaranteed to succeed. However, the searches in S16 and S20 will succeed iff T has matching prefixes of length 16 and 20. To ensure success, every length 22 prefix P places a marker in S16 and S20, the marker in S16 is the first 16 bits of P and that in S20 is the first 20 bits in P. Note that a marker M is placed in Si only if S, doesn't contain a prefix equal to M. Notice also that for each i, the binary search path to Si has O(logl,,,) = O(logW), where I r is the length of the longest prefix in T, Sjs on it. So, each prefix creates O(log W) markers. With each marker M in Si, we record the longest prefix of T that matches M (the length of this longest matchingprefix is necessarily smaller than i). To determine LMP(d), we begin by setting leftEnd = 0 and rightEnd = lmx. The repetitive step of the binary search requires us to search for an entry in S,, where m = L(leftEnd + rightEnd)/2j, that equals the first m bits of d. If S, does not have such an entry, set rightEnd = m 1. Otherwise, if the matching entry is the prefix P, P becomes the longest matchingprefix found so far. If the matching entry is the marker M, the prefix recorded with M is the longest matchingprefix 6 When searching S,, only the first i bits of d are used, because all prefixes in Si have exactly i bits. found so far. In either case, set leftEnd = m + 1. The binary search terminates when leftEnd > rightEnd. One may easily establish the correctness of the described binary search. Since, each prefix creates O(logW) markers, the memory requirement of the scheme is O(n log W). When each set S, is represented as a hash table, the data structure is called SELPH (sets of equal length prefixes using hash tables). The expected time to find LMP(d) is O(logW) when the router table is represented as an SELPH. When inserting a prefix, O(log W) markers must also be inserted. With each marker, we must record a longestmatching prefix. The expected time to find these longest matchingprefixes is O(log2 W). In addition, we may need to update the longest matching prefix information stored with the O(nlog W) markers at lengths greater than the length of the newly inserted prefix. This takes O(n log2 W) time. So, the ex pected insert time is O(n log2 W). When deleting a prefix P, we must search all hash tables for markers M that have P recorded with them and then update the recorded prefix for each of these markers. For hash tables with a bounded loading density, the expected time for a delete (including markerprefix updates) is O(nlog2 W). Waldvo gel et al. [87] have shown that by inserting the prefixes in ascending order of length, an nprefix SELPH may be constructed in O(n log2 W) time. When each set is represented as a balanced search tree, the data structure is called SELPT. In an SELPT, the time to find LMP(d) is O(lognlog W); the insert time is O(nlogn log2 W); the delete time is O(n log n log2 W); and the time to construct the data structure for n prefixes is O(W + n log n log2 W). In the full version of [87], Waldvogel et al. show that by using a technique called marker partitioning, the SELPH data structure may be modified to have a search time of O(a + log W) and an insert/delete time of O(a nW log W), for any a > 1. Because of the excessive insert and delete times, the sets of equallength prefixes data structure is suitable only for static router tables. By using the prefix expansion method [22, 82], we can limit the number of distinct lengths in the prefix set and so reduce the run time by a constant factor [87]. 1.3.4 Tries IP lookup in the BSD kernel is done using the Patricia data structure [78], which is a variant of a compressed binary trie [39]. This scheme requires O(W) memory accesses per lookup, insert, and delete. We note that the lookup complexity of longest prefix matching algorithms is generally measured by the number of accesses made to main memory (equivalently, the number of cache misses). Dynamic prefix tries, which are an extension of Patricia, and which also take O(W) memory accesses for lookup, were proposed by Doeringer et al. [23]. For IPv4 prefix sets, Degermark et al. [22] proposed the use of a threelevel trie in which the strides are 16, 8, and 8. They propose encoding the nodes in this trie using bit vectors to reduce memory requirements. The resulting data structure requires at most 12 memory accesses. However, inserts and deletes are quite expensive. For example, the insertion of the prefix 1* changes up to 215 entries in the trie's root node. All of these changes may propogate into the compacted storage scheme of [22]. The multibit trie data structures of Srinivasan and Varghese [82] are, perhaps, the most flexible and effective trie structure for IP lookup. Using a technique called controlled prefix expansion, which is very similar to the technique used in [22], tries of a predetermined height (and hence with a predetermined number of memory accesses per lookup) may be constructed for any prefix set. Srinivasan and Varghese [82] develop dynamic programming algorithms to obtain space optimal fixedstride tries (FSTs) and variablestride tries (VSTs) of a given height. Lampson et al. [44] proposed the use of hybrid data structures comprised of a stride16 root and an auxiliary data structure for each of the subtries of the stride16 root. This auxiliary data structure could be the endpoint array (since each subtrie is expected to contain only a small number of prefixes, the number of end points in each endpoint array is also expected to be quite small). An alternative auxiliary data structure '1'. tI .1 by Lampson et al. [44] is a 6way search tree for IPv4 router tables. In the case of these 6way trees, the keys are the remaining up to 16 bits of the prefix (recall that the stride16 root consumes the first 16 bits of a prefix). For IPv6 prefixes, a multicolumn scheme is i'2' t .11 [44]. None of these proposed structures is suitable for a dynamic table. Nilsson and Karlsson [57] propose a greedy heuristic to construct optimal VSTs. They call the resulting VSTs LCtries (levelcompressed tries). An LCtries obtained from a 1bit trie by replacing full subtries of the 1bit trie by single multibit nodes. This replacement is done by examining the 1bit trie top to bottom (i.e., from root to leaves). 1.3.5 Binary Search Trees Suri et al. [84] proposed a Btree data structure for dynamic router tables. Using their structure, we may find the longest matching prefix in O(log n) time. However, inserts/deletes take O(W log n) time. The number of cache misses is O(log n) for each operation. When W bits fit in 0(1) words (as is the case for IPv4 and IPv6 prefixes) logical operations on Wbit vectors can be done in 0(1) time each. In this case, the scheme of [84] takes O(log W log n) time for an insert and O(W + log n) = O(W) time for a delete. Several researchers ([16, 25, 26, 36, 74], for example), have investigated router table data structures that account for bias in access patterns. Gupta, Prabhakar, and Boyd [36], for example, propose the use of ranges. They assume that access frequencies for the ranges are known, and they construct a boundedheight binary search tree of ranges. This binary search tree accounts for the known range access frequencies to obtain nearoptimal IP lookup. Although the scheme of [36] performs IP lookup in nearoptimal time, changes in the access frequencies, or the insertion or removal of a prefix require us to reconstruct the data structure, a task that takes O(n log n) time. Ergun et al. [25, 26] use ranges to develop a biased skip list structure that performs longest prefixmatching in O(log n) expected time. Their scheme is designed to give good expected performance for bursty7 access 1p., I t' I The biased skip list scheme of Ergun et al. [25, 26] permits inserts and deletes in O(logn) time only in the severely restricted and impractical situation when all prefixes in the router table are of the same length. For the more general, and practical, case when the router table comprises prefixes of different length, their scheme takes O(n) expected time for each insert and delete. 1.3.6 Others Cheung and McCanne [16] developed "a model for tabledriven route lookup and cast the table design problem as an optimization problem within this model." Their model accounts for the memory hierarchy of modern computers and they optimize average performance rather than worstcase performance. Gupta and McKeown [33] examine the asymptotic complexity of a related prob lem, packet classification. They develop two data structures, heapontrie (HoT) and binarysearchtreeontrie (BoT), for the dynamic packet classification problem. The complexity of these data structures (for packet classification and the insertion and deletion of rules) also is dependent on W. For ddimensional rules, a search in a HoT takes O(Wd) and an update (insert or delete) takes O(Wdlogn) time. The corresponding times for a BoT are O(W/ log n) and O(W1 1 logn), respectively. 7 In a '1,,i /,, access pattern the number of different destination addresses in any subsequence of q packets is << q. That is, if the destination of the current packet is d, there is a high probability that d is also the destination for one or more of the next few packets. The fact that Internet packets tend to be bursty has been noted in [18, 46], for example. Hardware solutions that involve the use of content addressable memory [50] as well as solutions that involve modifications to the Internet Protocol (i.e., the addition of information to each packet) have also been proposed [10, 13, 56]. 1.4 Dissertation Outline The remainder of the dissertation is organized as follows. Chapters 2 and 3 concentrate on two data structures for static router table, in which the rule set does not vary in time. In Chapter 2, we develop new dynamic programming formulations for the construction of space optimal tries of a predetermined height. In Chapter 3, we develop an algorithm that minimizes storage requirement for collection of hash table optimization problem. Also, we propose improvements to the heuristic of [80]. Chapters 4 and 5 provide data structures for dynamic router tables, in which rules are added/deleted with some frequency. In Chapter 4, we show how to use the range encoding idea of [44] so that longest prefix matching as well as prefix insertion and deletion can be done in O(logn) time. Chapter 5 presents the management of router tables for a dynamic environment (i.e., search, insert, and deletes are performed dynamically) in which the access pattern is bursty. CHAPTER 2 MULTIBIT TRIES In this chapter, we focus on the controlled expansion technique of Srinivasan and Varghese [82]. In particular, we develop new dynamic programming formulations for the construction of space optimal tries of a predetermined height. While the 2, vmrpt.ti," complexities of the algorithms that result from our formulations are the same as those for the corresponding algorithms of [82], experiments using real IPv4 routing table data indicate that our algorithms run considerably faster. Our fixed stride trie algorithm is 2 to 4 times as fast on a SUN workstation and 1.5 to 3 times as fast on a Pentium 4 PC. On a SUN workstation, our variablestride trie algorithm is between 2 and 17 times as fast as the corresponding algorithm of [82]; on a Pentium 4 PC, our algorithm is between 3 and 47 times as fast. In Section 2.1, we describe the data structure for 1bit tries. We develop our new dynamic programming formulations for both fixedstride and variablestride tries in Section 2.2 and 2.3, respectively. In Section 2.4, we present our experimental results. 2.1 1Bit Tries A 1bit trie is a treelike structure in which each node has a left child, left data, right child, and right data field. Nodes at level 1 1 of the trie store prefixes whose length is 1 (the length of a prefix is the number of bits in that prefix; the terminating * (if present) does not count towards the prefix length). If the rightmost bit in a prefix whose length is 1 is 0, the prefix is stored in the left data field of a node that is at level 1 1; otherwise, the prefix is stored in the right data field of a node that is at level 11. At level i of a trie, branching is done by examining bit i (bits are numbered from left to right beginning with the number 0, and levels are numbered with the root being at level 0) of a prefix or destination address. When bit i is 0, we move into the left subtree; when the bit is 1, we move into the right subtree. Figure 21(a) gives the prefixes in the 8prefix example of [82], and Figure 21(b) shows the corresponding 1bit trie. The prefixes in Figure 21(a) are numbered and ordered as in [82]. Since the trie of Figure 21(b) has a height of 6, a search into this trie may make up to 7 memory accesses. The total memory required for the 1bit trie of Figure 21(b) is 20 units (each node requires 2 units, one for each pair of (child, data) fields). The 1bit tries described here are an extension of the 1bit tries described in [39]. The primary difference being that the 1bit tries of [39] are for the case when all keys (prefixes) have the same length. NO Original prefixes P5 P N P5=0* N21 P1 N22 Pl=10* P2 P2= 111* N31 P6 N32 P3=11001* N41 P4=1* P4 3 N42 P6=1000* N5 P7=100000* N6 P8=1000000* P8 (a) 8prefix example of (b) Corresponding 1bit trie [82] Figure 21: Prefixes and corresponding 1bit trie When 1bit tries are used to represent IPv4 router tables, the trie height may be as much as 31. A lookup in such a trie takes up to 32 memory accesses. Table 21 gives the characteristics of five IPv4 backbone router prefix sets, and Table 22 gives a more detailed characterization of the prefixes in the largest of these five databases, Paix [53]. For our five databases, the number of nodes in a 1bit trie is between 2n and 3n, where n is the number of prefixes in the database (Table 21). Table 21: Prefix databases obtained from IPMA project on Sep 13, 2000 Database Prefixes 16bit prefixes 24bit prefixes Nodes* Paix 85,682 6,606 49,756 173,012 Pb 35,151 2,684 19,444 91,718 MaeWest 30,599 2,500 16,260 81,104 Aads 26,970 2,236 14,468 74,290 MaeEast 22,630 1,810 11,386 65,862 * The last column shows the number of nodes in the 1bit trie representation of the prefix database. Note: the number of prefixes stored at level i of a 1bit trie equals the number of prefixes whose length is i + 1. Table 22: Distributions of the prefixes and nodes in the 1bit trie for Paix Level Number of Number of Level Number of Number of prefixes nodes prefixes nodes 0 0 1 16 918 5,117 1 0 2 17 1,787 8,245 2 0 4 18 5,862 12,634 3 0 7 19 3,614 15,504 4 0 11 20 3,750 20,557 5 0 20 21 5,525 26,811 6 0 36 22 7,217 32,476 7 22 62 23 49,756 37,467 8 4 93 24 12 54 9 5 169 25 26 44 10 9 303 26 12 20 11 26 561 27 5 9 12 56 1,037 28 4 5 13 176 1,933 29 1 2 14 288 3,552 30 0 1 15 6,606 6,274 31 1 1 2.2 FixedStride Tries 2.2.1 Definition Srinivasan and Varghese [82] have proposed the use of fixedstride tries to enable fast identification of the longest matching prefix in a router table. The stride of a node is defined to be the number of bits used at that node to determine which branch to take. A node whose stride is s has 2" child fields (corresponding to the 2" possible values for the s bits that are used) and 2" data fields. Such a node requires 2" memory units. In a fixedstride trie (FST), all nodes at the same level have the same stride; nodes at different levels may have different strides. Suppose we wish to represent the prefixes of Figure 21(a) using an FST that has three levels. Assume that the strides are 2, 3, and 2. The root of the trie stores prefixes whose length is 2; the level one nodes store prefixes whose length is 5 (2 + 3); and level three nodes store prefixes whose length is 7 (2 + 3 + 2). This poses a problem for the prefixes of our example, because the length of some of these prefixes is different from the storeable lengths. For instance, the length of P5 is 1. To get around this problem, a prefix with a nonpermissible length is expanded to the next permissible length. For example, P5 = 0* is expanded to P5a = 00* and P5b = 01*. If one of the newly created prefixes is a duplicate, natural dominance rules are used to eliminate all but one occurrence of the prefix. For instance, P4 = 1* is expanded to P4a = 10* and P4b = 11*. However, PI = 10* is to be chosen over P4a = 10*, because PI is a longer match than P4. So, P4a is eliminated. Because of the elimination of duplicate prefixes from the expanded prefix set, all prefixes are distinct. Figure 22(a) shows the prefixes that result when we expand the prefixes of Figure 21 to lengths 2, 5, and 7. Figure 22(b) shows the corresponding FST whose height is 2 and whose strides are 2, 3, and 2. Since the trie of Figure 22(b) can be searched with at most 3 memory references, it represents a time performance improvement over the 1bit trie of Figure 21(b), P5 00 Expanded prefixes P5 01 (3 levels) Pl 10 00* (P5a) P4 11 01* (P5b) 000 P6 000 10* (P1) 001 P6 001 P3 11 (P4) 010 \ 010 11100*(P2a) o0l \ l  11101*(P2b) 100 \ 100 P2 11110*(P2c) 101 101 P2 11111*(P2d) 110 110 P2 11001* (P3) 111 111 P2 10000* (P6a) 10001*(P6b) P8 00 1000001* (P7) oP7 1000000* (P8) 11 (a) Expanded pre (b) Corresponding fixedstride trie fixes Figure 22: Prefix expansion and fixedstride trie which requires up to 7 memory references to perform a search. However, the space requirements of the FST of Figure 22(b) are more than that of the corresponding 1bit trie. For the root of the FST, we need 8 fields or 4 units; the two level 1 nodes require 8 units each; and the level 3 node requires 4 units. The total is 24 memory units. We may represent the prefixes of Figure 21(a) using a onelevel trie whose root has a stride of 7. Using such a trie, searches could be performed making a single memory access. However, the onelevel trie would require 27 = 128 memory units. 2.2.2 Construction of Optimal FixedStride Tries In the fixedstride trie optimization (FSTO) problem, we are given a set P of prefixes and an integer k. We are to select the strides for a klevel FST in such a manner that the klevel FST for the given prefixes uses the smallest amount of memory. For some P, a klevel FST may actually require more space than a (k 1)level FST. For example, when P = {00*, 01*, 10*, 11*}, the unique 1level FST for P requires 4 memory units while the unique 2level FST (which is actually the 1bit trie for P) requires 6 memory units. Since the search time for a (k 1)level FST is less than that for a klevel FST, we would actually prefer (k 1)level FSTs that take less (or even equal) memory over klevel FSTs. Therefore, in practice, we are really interested in determining the best FST that uses at most k levels (rather than exactly k levels). The modified FSTO problem (MFSTO) is to determine the best FST that uses at most k levels for the given prefix set P. Let 0 be the 1bit trie for the given set of prefixes, and let F be any klevel FST for this prefix set. Let so, ..., k1 be the strides for F. We shall say that level 0 of F covers levels 0,..., so 1 of 0, and that level j, 0 < j < k of F covers levels a,..., b of 0, where a q= E sqq and b = E o sq 1. So, level 0 of the FST of Figure 22(b) covers levels 0 and 1 of the 1bit trie of Figure 21(b). Level 1 of this FST covers levels 2, 3, and 4 of the 1bit trie of Figure 21(b); and level 2 of this FST covers levels 5 and 6 of the 1bit trie. We shall refer to levels e = E"o S, 0 < u < k as the expansion levels of O. The expansion levels defined by the FST of Figure 22(b) are 0, 2, and 5. Let nodes(i) be the number of nodes at level i of the 1bit trie 0. For the 1bit trie of Figure 21(a), nodes(0 : 6) = [1, 1, 2, 2, 2, 1, 1]. The memory required by F is E 1 nodes(eq) 2q. For example, the memory required by the FST of Figure 22(b) is nodes(0) 22 + nodes(2) 23 + nodes(5) 22 = 24. Let T(j, r), r < j + 1, be the cost (i.e., memory requirement) of the best way to cover levels 0 through j of O using exactly r expansion levels. When the maximum prefix length is W, T(W 1, k) is the cost of the best klevel FST for the given set of prefixes. Srinivasan and Varghese [82] have obtained the following dynamic programming recurrence for T: T(j, r) min {T(m, r 1) + nodes(m + 1) 2m}, r > 1 (2.1) mC{r2.J1} T(j, 1) = 2j+ (2.2) The rationale for Equation 2.1 is that the best way to cover levels 0 through j of O using exactly r expansion levels, r > 1, must have its last expansion level at level m + 1 of O, where m must be at least r 2 (as otherwise, we do not have enough levels between levels 0 and m of O to select the remaining r 1 expansion levels) and at most j 1 (because the last expansion level is < j). When the last expansion level is level m + 1, the stride for this level is j m, and the number of nodes at this expansion level is nodes(m + 1). For optimality, levels 0 through m of O must be covered in the best possible way using exactly r 1 expansion levels. As noted by Srinivasan and Varghese [82], using the above recurrence, we may determine T(W 1, k) in O(kW2) time (excluding the time needed to compute O from the given prefix set and determine nodes()). The strides for the optimal klevel FST can be obtained in an additional O(k) time. Since, Equation 2.1 also may be used to compute T(W 1, q) for all q < k in O(kW2) time, we can actually solve the MFSTO problem in the same asymptotic complexity as required for the FSTO problem. We can reduce the time needed to solve the MFSTO problem by modifying the definition of T. The modified function is C, where C(j, r) is the cost of the best FST that uses at most r expansion levels. It is easy to see that C(j, r) < C(j, r 1), r > 1. A simple dynamic programming recurrence for C is: C(j, r) = min {C(m, r 1) + nodes(m + 1) 2m}, j > 0, r > 1 (2.3) m {1..j 1} C(1, r) = 0 and C(j, 1) = 2+1, j > 0 (2.4) To see the correctness of Equations 2.3 and 2.4, note that when j > 0, there must be at least one expansion level. If r = 1, then there is exactly one expansion level and the cost is 2j+l. If r > 1, the last expansion level in the best FST could be at any of the levels 0 through j. Let m + 1 be this last expansion level. The cost of the covering is C(m, r 1) + nodes(m + 1) 2jm. When j = 1, no levels of the 1bit trie remain to be covered. Therefore, C(1, r) = 0. We may obtain an alternative recurrence for C(j, r) in which the range of m on the right side is r 2..j 1 rather than 1..j 1. First, we obtain the following dynamic programming recurrence for C: C(j, r) = min{C(j, r 1),T(j,)}, r> 1 (2.5) C(j, 1) = 2j+ (2.6) The rationale for Equation 2.5 is that the best FST that uses at most r expansion levels either uses at most r 1 levels or uses exactly r levels. When at most r 1 levels are used, the cost is C(j, r 1), and when exactly r levels are used, the cost is T(j, r), which is defined by Equation 2.1. Let U(j, r) be as defined in Equation 2.7. U(j, r)= mm {C(m, r 1) + nodes(m + 1) 2jm (2.7) From Equations 2.1 and 2.5 we obtain From Equations 2.1 and 2.5 we obtain C(j, r) =min{C(j, r 1), U(j, r)} (2.8) To see the correctness of Equation 2.8, note that for all j and r such that r < j+l, T(j, r) > C(j, r). Furthermore, T(j, r) = mi {T(m, r 1) + nodes(m + 1) 2 m} mC{r2.j 1} > min {C(m,r 1)+ nodes(m+ 1) 23 m mC{r2..j 1} = (j,r) (2.9) Therefore, when C(j, r 1) < U(j, r), Equations 2.5 and 2.8 compute the same value for C(j, r) (i.e., C(j, r 1)). When C(j, r 1) > U(j, r), it appears from Equation 2.9 that Equation 2.8 may compute a smaller C(j, r) than is computed by Equation 2.5. However, from Equation 2.3, which is equivalent to Equation 2.5, the C(j, r) computed by Equations 2.3 and 2.5 satisifes C(j,r) = min {C(m, r 1) + nodes(m + 1) 2jm m{ 1..jl} < min {C(m,r 1) + nodes(m + 1) 23m} mC{r2..j 1} = (j,r) where C(1, r) = 0. However, when C(j, r 1) > U(j, r), the C(j, r) computed by Equation 2.8 is U(j, r). Therefore, when C(j, r 1) > U(j, r), the C(j, r) computed by Equation 2.8 cannot be smaller than that computed by Equation 2.5. Therefore, the C(j, r)s computed by Equations 2.5 and 2.8 are equal. In the remainder of this section, we use Equations 2.3 and 2.4 for C. The range for m (in Equation 2.3) may be restricted to a range that is (often) considerably smaller than r 2..j 1. To obtain this narrower search range, we first establish a few properties of 1bit tries and their corresponding optimal FSTs. Lemma 1 For every 1bit trie 0, (a) nodes(i) < 21, i > 0 and (b) nodes(i + j) < 2J nodes(i), j > 0, i > 0. Proof Follows from the fact that a 1bit trie is a binary tree. U Let M(j, r), r > 1, be the smallest m that minimizes C(m, r 1)+ nodes(m + 1) 2jm, in Equation 2.3. Lemma 2 V(j > 0, r > 1)[M(j + 1, r) > M(j, r)]. Proof Let M(j, r) = a and M(j + 1, r) = b. Suppose b < a. Then, C(j, r) = C(a, r < C(b, r since, otherwise, M(j, r) = b. Furthermore, C(j + 1, r) = C(b, r 1) + nodes(b + 1) 2j+lb < C(a, r 1) + nodes(a + 1) 2 2+la Therefore, nodes(a + 1) 2ja + nodes(b + 1) 2j+lb < nodes(b + 1) 2jb + nodes(a + 1) 2j+1' nodes(b + 1) 2jb < nodes(a + 1) 2ja Hence, 2ab nodes(b + 1) < nodes(a + 1) This contradicts Lemma l(b). So, b > a. Lemma 3 V(j > 0, r > 0)[C(j, r) < C(j + 1, r)]. 1) + nodes(a + 1) 230 1) + nodes(b + 1) 2jb Proof The case r = 1 follows from C(j, 1) = 2j+l. So, assume r > 1. From the definition of M, it follows that C(j + 1, r) =C(b, r 1) + nodes(b + 1) 2j+lb where 1 < b = M(j + 1, r) < j. When b < j, we get C(j, r) < C(b, r 1) + nodes(b + 1) 2b < C(b, r 1) + nodes(b + 1) 2j+ b = (j + 1, r) When b = j, C(j + 1, r) = C(j, r 1)+ nodes(j + 1) 2 > C(j,r 1), since nodes(j + 1) > 0. m The next few lemmas use the function A, which is defined as A(j, r) = C(j, r  1) C(j, r). Since, C(j, r) < C(j, r 1), A(j, r) > 0 for all j > 0 and all r > 2. Lemma 4 V(j > 0)[A(j, 2) < A(j + 1, 2)]. Proof If C(j, 2) = C(j, 1), there is nothing to prove as A(j + 1, 2) > 0. The only other possibility is C(j, 2) < C(j, 1) (i.e., A(j, 2) > 0). In this case, the best cover for levels 0 through j uses exactly 2 expansion levels. From the recurrence for C (Equations 2.3 and 2.4), it follows that C(j, 1) = 2j+, and C(j,2) = C(a, 1) + nodes(a + 1) 23a = 2+l nodes(a + 1) 2ja, for some a, 0 < a < j. Therefore, A(j, 2) = C(j, 1) (j,2) = 2j+ 23+1 nodes(a + 1) 2a. From Equations 2.3 and 2.4, it follows that C(j+ 1,2) < C(a, 1)+nodes(a + 1) 2j+l = 2+l1 + nodes(a + 1) 2j+la Hence, A(j + 1,2) > 2j+2 2a+ nodes(a + 1) 2j+la Therefore, A(j + 1, 2) A(j, 2) > 2j+2 2a+l _ nodes(a + 1) 2j+la 2+1 + 20+1 + nodes(a + 1) 2ja = 2+ nodes(a + 1) 2ja > 2j+ 20+1 2ja (Lemma l(a)) = 0 Lemma 5 V(j > 0, k > 2)[A(j, k1) < A(j+1, k1)] = V(j > 0, k > 2)[A(j, k) < A(j + 1, k)]. Proof Assume that V(j > 0, k > 2)[A(j, k 1) < A(j + 1, k 1)]. We shall show that V(j > 0, k > 2)[A(j, k) < A(j + 1, k)]. Let M(j, k) = b and M(j + 1, k 1) = c. Case 1: c > b. = C(j, k 1) C(j, k) = C(j, k 1) C(b, k 1) nodes(b + 1) 2jb < C(b, k 2) + nodes(b + 1) 2jb C(b, k 1) nodes(b + 1) 2jb A(j + 1, k) = C(j+1, k 1) C(j+1,k) > C(c, k 2) + nodes(c + 1) 2+lc C(c, k 1) nodes(c + 1) 2j+lc = A(c, 1). Since c > b, A(b, k 1) < A(c, k 1). Therefore, A(j + 1, A) > A(c, k 1) > A(b, k 1) > A(j, k). Case 2: c < b. Let M(j + 1, k) = a, M(j, k) = b, M(j + 1, k 1) = c, and M(j, k 1) From Lemma 2, a > b and c > d. Since c < b, a > b > c> d. Also, C(j,k 1) [C(d, k 2) [C(b, k  C(j, k) + nodes(d + 1) 2jd]  1) + nodes(b + 1) 2jb] A(j,k) A(b,k 1). Also, A(j, k) A(j + 1, k) C(j + 1, k 1) C(j + 1, k) [C(c, k 2) + nodes(c + 1) 2j+lc] [C(a, k 1) + nodes(a + 1) 2J+la]. [C(c, k 2) [C(d, k +[C(b, k [C(a, k + nodes(c + 1) 2j+lc]  2) + nodes(d + 1) 2jd]  1) + nodes(b + 1) 2jb]  1) nodes(a + 1) 2+la]. (2.10) Since > b > c > d = Mj, k 1), C(c, k 2) + nodes(c + 1) 2 > C(d, k 2) + nodes(d + 1) 2jd Furthermore, since M(j + 1, k) = a > b, C(b, k 1) + nodes(b + 1) 2j+ b > C(a, k 1) + nodes(a + 1) 2j+1a Substituting Equations 2.11 and 2.12 into Equation 2.10, we get A(j + 1, k) A(j, k) > nodes(c + 1) 2jc nodes(b + 1) 2jb Lemma 1 and c < b imply nodes(c + 1) 2bc > nodes(b + 1). Therefore, nodes(c + 1) 2jc > nodes(b + 1) 2jb So, A(j + 1, k) A(j, k) > 0. and Therefore, (2.11) (2.12) A(j + 1, k) A(j, k) Lemma 6 V(j > 0, k > 2)[A(j, k) < A(j + 1, k)]. Proof Follows from Lemmas 4 and 5. U Lemma 7 Let k > 2. V(j > 0)[A(j, k 1) A(j+ 1, k 1)] = V(j > 0)[M(j, k) > M(j, k 1)]. Proof Assume that V(j > 0)[A(j, k 1) < (j + 1, k 1)]. Suppose that M(j, k  1) = a, M(j, k) = b, and b < a for some j, j > 0. From Equation 2.3, we get C(j, k) =C(b, k 1) + nodes(b + 1) 2b SC(a, k 1)+nodes(a+ 1) 23 and C(j, k 1) = C(a, k 2) + nodes(a + 1) 230 < C(b, k 2) + nodes(b + 1) 2jb Hence, C(b,k 1)+C(a,k 2) Therefore, A(a, k 1) < A(b, k 1). However, b < a and V(j > 0)[A(j, k 1) < A(j + 1, k 1)] imply that A(b, k 1) < A(a, k 1). Since our assumption that b < a leads to a contradiction, it must be that there is no j > 0 for which M(j, k 1) = a, M(j, k) = b, and b < a. m Lemma 8 V(j > 0, k > 2)[M(j, k) > M(j, k 1)]. Proof Follows from Lemmas 6 and 7. U Theorem 1 V(j > 0, k > 2)[M(j, k) > max{M(j 1, k), M(j, k 1)}]. Proof Follows from Lemmas 2 and 8. U Algorithm FixedStrides(W, k) // W is length of longest prefix. // k is maximum number of expansion levels desired. // Return C(W 1, k) and compute M(*, E). { for (j 0;j < W;j ++){ C(j, 1):= 2j; M(j, 1) := 1;} for (r= 1;r< k;r++) C(1, r) := 0; for (r = 2; r < k;r + +) for (j = r1; < W;+ +){ // Compute C(j, r). minJ:= max(M(j 1, r), M(j,r 1)); minCost := C(j, r 1); minL := M(j, r 1); for (m = minJ; m < j; m ++){ cost := C(m, j 1) + nodes(m + 1) 2m; if (cost < minCost) then {minCost:= cost; minL := m;}} C(j, r) := minCost; M(j, r) := minL;} return C(W 1, k); } Figure 23: Algorithm for fixedstride tries. Note 1 From Lemma 6, it follows that whenever A(j, k) > 0, A(q, k) > 0, Vq > j. Theorem 1 leads to Algorithm FixedStrides (Figure 23), which computes C(W 1, k). The complexity of this algorithm is O(kW2). Using the computed M values, the strides for the OFST that uses at most k expansion levels may be determined in an additional O(k) time. Although our algorithm has the same ;i\lipt.iti, complexity as does the algorithm of Srinivasan and Varghese [82], experiments conducted by us using real prefix sets indicate that our algorithm runs faster. 2.3 VariableStride Tries 2.3.1 Definition and Construction In a variablestride trie (VST) [82], nodes at the same level may have different strides. Figure 24 shows a twolevel VST for the 1bit trie of Figure 21. The stride P5 00 P5 01 P1 10 P4 11 00000 P8 000 00001 P7 001 P3 00010 P6 010 00011 P6 011 00100 P6 100 P2 00101 P6 101 P2 00110 P6 110 P2 00111 P6 111 P2 11100 11101 11110 11111 Figure 24: Twolevel VST for prefixes of Figure 21(a) for the root is 2; that for the left child of the root is 5; and that for the root's right child is 3. The memory requirement of this VBT is 4 (root) + 32 (left child of root) + 8 (right child of root) = 44. Since FSTs are a special case of VSTs, the memory required by the best VST for a given prefix set P and number of expansion levels k is less than or equal to that required by the best FST for P and k. Despite this, FSTs may be preferred in certain router applications "because of their simplicity and slightly faster search time" [82]. Let rVST be a VST that has at most r levels. Let Opt(N, r) be the cost (i.e., memory requirement) of the best rVST for a 1bit trie whose root is N. Srinivasan and Varghese [82] have obtained the following dynamic programming recurrence for Opt(N, r). Opt(N, r)= min {2s + Opt(Q, r 1)}, r > 1 (2.13) sc{1..l+height(N)} QcDs(N) where D,(N) is the set of all descendents of N that are at level s of N. For ex ample, Di(N) is the set of children of N and D2(N) is the set of grandchildren of N. height(N) is the maximum level at which the trie rooted at N has a node. For example, in Figure 21(b), the height of the trie rooted at N1 is 5. When r = 1, Opt(N, 1) = 21+hcighl() (2.14) Equation 2.14 is equivalent to Equation 2.2; the cost of covering all levels of N using at most one expansion level is 2l+cight(N) When more than one expansion level is permissible, the stride of the first expansion level may be any number s that is between 1 and 1 + height(N). For any such selection of s, the next expansion level is level s of the 1bit trie whose root is N. The sum in Equation 2.13 gives the cost of the best way to cover all subtrees whose roots are at this next expansion level. Each such subtree is covered using at most r 1 expansion levels. It is easy to see that Opt(R, k), where R is the root of the overall 1bit trie for the given prefix set P, is the cost of the best kVST for P. Srinivasan and Varghese [82] describe a way to determine Opt(R, k) using Equations 2.13 and 2.14. Although Srinivasan and Varghese state that the complexity of their algorithm is O(nW2k), where n is the number of prefixes in P and W is the length of the longest prefix, a close examination reveals that the complexity is O(pWk), where p is the number of nodes in the 1bit trie. Since p = O(n) for realistic router prefix sets, the complexity of their algorithm is O(nWk) on realistic router prefix sets. We develop an alternative dynamic programming formulation that also permits the computation of Opt(R, k) in O(pWk) time. However, the resulting algorithm is considerably faster. Let Opt(N, s, r)= 1 Opt(Q, r), s > 0, r > 1, QeD D(N) and let Opt(N, 0, r) = Opt(N, r). From Equations 2.13 and 2.14, we obtain: Opt(N, 0, r) = min.{.,+heght(N)} {2 + Opt(N, s, r 1)}, r > 1 (2.15) and Opt(N, 0, 1) = 21+h"eght(). (2.16) For s > 0 and r > 1, we get Opt(N, s, r) = Z Opt(Q, r) QecD(N) = Opt(LeftChild(N), s 1, r) + Opt(RightChild(N), s 1, r). (2.17) For Equation 2.17, we need the following initial condition: Opt(null, *, *) = 0 (2.18) The number of Opt(*, *, *) values is O(pWk). Each Opt(*, *, *) value may be computed in 0(1) time using Equations 2.15 through 2.18 provided the Opt values are computed in postorder. Therefore, we may compute Opt(R, k) = Opt(R, 0, k) in O(pWk) time. Although both our algorithm and that of [82] run in O(pWk) time, our algorithm is expected to do less work. We arrive at this expectation by performing a somewhat crude operation count analysis. In the algorithm of [82], for each value of r (see Equation 2.13), Opt(M, r 1) is used levelM times, where levelM is the level for node M. Adding in 1 unit for the initial storage of Opt(M, r 1), we see that a levelM node contributes roughly levelM + 1 to the total cost of computing Opt(*, r). Therefore, a rough operation count for the algorithm of [82] is OpCountSrini = k Y(levelM + 1) M where the sum is taken over all nodes M of the 1bit trie. Let heightM be the height of the subtree rooted at node M of the 1bit trie (the height of a subtree that has only a root is 0). Our algorithm computes (heightM+ 1)k Opt(M, *, *) values at node M. Each of these values is computed using a single addition. So, the operation count for our algorithm is crudely estimated to be OpCountOur = k Y(heightM + 1) M For our five databases Paix, Pb, MaeWest, Aads, and MaeEast, the ratios OpCountSrini/OpCountOur are 6.7, 5.9, 5.7, 5.6, and 5.4. We can determine the possible range for this ratio by computing the ratio for skewed as well as full binary trees. For a totally skewed 1bit trie (e.g., a left or right skewed trie), the two operation count estimates are the same. For a 1bit trie that is a full binary tree of height W 1, W1 OpCountSrini/k = (p + 1)2P 0 = ( 1)2+ 1 and W1 OpCountOur/k = Y(W p)2P 0 = 2 w+ 2 So, OpCountSrini/OpCountOur ( (W 1)/2. Since skewed and full binary trees represent two extremes for the operation count ratio, the operation count ratio is expected to be between 1 and (W 1)/2. For IPv4, W = 32 and this ratio lies between 1 and 15.5. For IPv6, W = 128 and this ratio is between 1 and 63.5. Although the number of operations being performed is an important contributing factor to the observed run time of an algorithm, the number of cache misses often has significant impact. For the algorithm of [82], we estimate the number of cache misses to be of the same order as the number of operations (i.e., OpCountSrini). Because our algorithm is a simple postorder traversal that visits each node of the 1bit trie exactly once, the number of cache misses for our algorithm is estimated to be OpCountOur/L, where L is the smaller of k and the number of Opt(M, *, *) values that fit in a cache line. The cache miss count gives our algorithm another factor of L advantage over the algorithm of [82]. When the cost of operations dominates the run time, our crude analysis indicates that our algorithm will be about 6 times as fast as that of [82] (for our test databases). When cache miss time dominates the run time, our algorithm could be 12 times as fast when k = 2 and 42 times as fast when k = 7. Of course, since our analysis doesn't include all of the overheads associated with the two algorithms, actual speedups may be quite different. Our algorithm requires O(W2k) memory for the Opt(*, *, *) values. To see this, notice that there can be at most W + 1 nodes N whose Opt(N, *, *) values must be retained at any given time, and for each of these at most W + 1 nodes, O(Wk) Opt(N, *, *) values must be retained. To determine the optimal strides, each node of the 1bit trie must store the stride s that minimizes the right side of Equation 2.15 for each value of r. For this purpose, each 1bit trie node needs O(k) space. Therefore, the memory requirements of the 1bit trie are O(pk). The total memory required is, therefore, O(pk + W2k). In practice, we may prefer an implementation that uses considerably more mem ory. If we associate a cost array with each of the p nodes of the 1bit trie, the memory requirement increases to O(pWk). The advantage of this increased memory imple mentation is that the optimal strides can be recomputed in O(W2k) time (rather than O(pWk)) following each insert or delete of a prefix. This is so because, the Opt(N, *, *) values need be recomputed only for nodes along the insert/delete path of the 1bit trie. There are O(W) such nodes. PI P2 P3 P4: P5 P6 P7: P8: N21 N22 N31 N32 N41 P7 P8 N42 ]N5 N6 Figure 26: 1bit trie for prefixes of Figure 25(a) 2.3.2 An Example Figure 25(a) gives a prefix set P that contains 8 prefixes. The length of the longest prefix (P8) is 7. Figure 25(b) gives the prefixes that remain when the prefixes of P are expanded into the lengths 1, 3, 5, and 7. As we shall see, these expanded prefixes correspond to an optimal 4VST for P. Figure 26 gives the 1bit trie for the prefixes of Figure 25. To determine the cost, Opt(NO, 0, 4), of the best 4VST for the prefix set of Figure 25(a), we must compute all the Opt values shown in Figure 27. In this figure Opti, for example, refers to Opt(N1, *, *) and Opt42 refers to Opt(N42, *, *). S0* 0 *(PI) = i* 1 (P2) =11* 101 (P4) S101* 110 (P3) = 10001* 111 (P3) =1100* 10001 (P5) =110000* 11000 (P6) = 1100000* 11001 (P6) 1100000 (P8) 1100001 (P7) Original prefixes (b) Expanded prefixes 5: A prefix set and its expansion to four lengths NO PI P2 (a) Figure 2 Opto r = 1 2 3 4 = 0 128 26 20 18 1 64 18 16 16 2 40 18 16 16 3 20 12 12 12 4 10 8 8 8 5 4 4 4 4 6 2 2 2 2 Opt21 r= 1 2 3 4 s=0 8 6 6 6 1 4 4 4 4 2 2 2 2 2 Opt31 r =1 2 3 4 s=0 4 4 4 4 1 2 2 2 2 Opt41 = 1 2 3 4 s=0 2 2 2 2 p0t, r = 1 2 3 4 s=O 4 4 4 4 1 2 2 2 2 Opt r =1 2 3 4 s=0 2 2 2 2 Opt r = 1 2 3 4 s = 0 64 18 16 16 1 40 18 16 16 2 20 12 12 12 3 10 8 8 8 4 4 4 4 4 5 2 2 2 2 Opt__ r = 1 2 3 4 S=0 32 12 10 10 1 16 8 8 8 2 8 6 6 6 3 4 4 4 4 4 2 2 2 2 Opt32 r = 1 2 3 4 s=0 16 8 8 8 1 8 6 6 6 2 4 4 4 4 3 2 2 2 2 Opt42 r = 1 2 3 4 s=0 8 6 6 6 1 4 4 4 4 2 2 2 2 2 Figure 27: Opt values in the computation of Opt(NO, 0, 4) The Opt arrays shown in Figure 27 are computed in postorder; that is, in the order N41, N31, N21, N6, N5, N42, N32, N22, N1, NO. The Opt values shown in Figure 27 were computed using Equations 2.15 through 2.18. From Figure 27, we determine that the cost of the best 4VST for the given prefix set is Opt(NO, 0, 4) = 18. To construct this best 4VST, we must determine the strides for all nodes in the best 4VST. These strides are easily determined if, with each Opt(*, 0, *), we store the s value that minimizes the right side of Equation 2.15. For Opt(NO, 0, 4), this minimizing s value is 1. This means that the stride for the root of the best 4VST is 1, its left subtree is empty (because NO has an empty 0 P1 1 P2 00 P4 01 P3 10 P3 \11 0 00 P6 1 01 P6 10 / I 0 00 P8 1 P5 01 P7 10 11 Figure 28: Optimal 4VST for prefixes of Figure 25(a) left subtree), its right subtree is the best 3VST for the subtree rooted at N1. The minimizing s value for Opt(N1, 0, 3) is 2 (actually, there is a tie between s = 2 and s = 3; ties may be broken arbitrarily). Therefore, the right child of the root of the best 4VST has a stride of 2. Its first subtree is the best 2VST for N31; its second subtree is empty; its third subtree is the best 2VST for N32; and its fourth subtree is empty. Continuing in this manner, we obtain the 4VST of Figure 28. The cost of this 4VST is 18. 2.3.3 Faster k = 2 Algorithm The algorithm of Section 2.3.1 may be used to determine the optimal 2VST for a set of n prefixes in O(pW) (equal to O(nW) for practical prefix sets) time, where p is the number of nodes in the 1bit trie and W is the length of the longest prefix. In this section, we develop an O(p)time algorithm for this task. From Equation 2.13, we see that the cost, Opt(root, 2) of the best 2VST is Algorithm ComputeC(t) //Initial invocation is ComputeC(root). // The C array and level are initialized to 0 prior to initial invocation. // Return height of tree rooted at node t. { if (t! = null) { level + +; leftHeight = ComputeC(t.leftChild); rightHeight = ComputeC(t.rightChild); level  height = max{leftHeight, rightHeight} + 1; C[level] += 2heghwt+1 return height; } else return 1; Figure 2 Algorithm to compute C using Equation 2.20. Figure 29: Algorithm to compute C using Equation 2.20. Opt(root, 2) minsEC{..l+height(root)} {2s + Opt(Q, 1)} QeD, (root) mins {1 ..+height(root)} {2 + 21+height(Q)} QeDs (root) minsE{1..l+height(root)} {2S + C(s)} C(s)= 21 +height(Q) (2.20) QeD (root) We may compute C(s), 1 < s < 1 + height(root), in O(p) time by performing a postorder traversal (see Figure 29) of the 1bit trie rooted at root. (Recall that p is the number of nodes in the 1bit trie.) Once we have determined the C values using Algorithm ComputeC (Figure 29), we may determine Opt(root, 2) and the optimal stride for the root in an additional O(height(root)) time using Equation 2.19. If the optimal stride for the root is s, then where (2.19) the second expansion level is level s (unless, s = 1 + height(root), in which case there isn't a second expansion level). The stride for each node at level s is one plus the height of the subtree rooted at that node. The height of the subtree rooted at each node was computed by Algorithm ComputeC, and so the strides for the nodes at the second expansion level are easily determined. 2.3.4 Faster k = 3 Algorithm Using the algorithm of Section 2.3.1 we may determine the optimal 3VST in O(pW) time. In this section, we develop a simpler and faster O(pW) algorithm for this task. On practical prefix sets, the algorithm of this section runs in O(p) time. From Equation 2.13, we see that the cost, Opt(root, 3) of the best 3VST is Opt(root, 3) = mins{..I+height(root)} {2 + Opt(Q, 2)} QeDD (root) = ilnsel{..l+height(root)}{2f + T(s)} (2.21) where T(s)= Opt(Q, 2) (2.22) QeD. (root) Figure 210 gives our algorithm to compute T(s), 1 < s < 1 + height(root). The computation of Opt(M, 2) is done using Equations 2.19 and 2.20. In Algorithm ComputeT (Figure 210), the method allocate allocates a onedimensional array that is to be used to compute the C values for a subtree. The allocated array is initialized to zeroes; it has positions 0 through W, where W is the length of the longest prefix (W also is 1 + height(root)); and when computing the C values for a subtree whose root is at level j, only positions j through W of the allocated array may be modified. The method deallocate frees a C array previously allocated. Algorithm ComputeT(t) //Initial invocation is ComputeT(root). // The T array and level are initialized to 0 prior to initial invocation. // Return cost of best 2VST for subtree rooted at node t and height // of this subtree. { if (t! = null) { level + +; // compute C values and heights for left and right subtrees of t (leftC, leftHeight) = ComputeT(t.leftChild); (rightC, rightHeight) = ComputeT(t.rightChild); level ; // compute C values and height for t as well as // bestT = Opt(t, 2) and t.stride = stride of node t //in this best 2VST rooted at t. height = max{leftHeight, rightHeight} + 1; bestT = leftC[level] = 2hght+l; t.stride = height + 1; for (int i = 1; i <= height; i + +) { leftC[level + i] += rightC[level + i]; if (2 + leftC[level + i] < t.bestT) { bestT = 2 + leftC[level + i]; t.stride = i; }} T[level]+ = bestT; deallocate(rightC); return (leftC, height); } else {//t is null allocate(C); return (C, 1); } } Figure 210: Algorithm to compute T using Equation 2.22. The complexity of Algorithm ComputeT is readily seen to be O(pW). Once the T values have been computed using Algorithm ComputeT, we may determine Opt(root, 3) and the stride of the root of the optimal 3VST in an additional O(W) time. The strides of the nodes at the remaining expansion levels of the optimal 3 VST may be determined from the t.stride and subtree height values computed by Algorithm ComputeT in O(p) time. So the total time needed to determine the best 3VST is O(pW). When the difference between the heights of the left and right subtrees of nodes in the 1bit trie is bounded by some constant d, the complexity of Algorithm ComputeT is O(p). We use an amortization scheme to prove this. First, note that, exclusive of the recursive calls, the work done by Algorithm ComputeT for each invocation is O(height(t)). For simplicity, assume that this work is exactly height(t) + 1 (the 1 is for the work done outside the for loop of ComputeT). Each active C array will maintain a credit that is at least equal to the height of the subtree it is associated with. When a C array is allocated, it has no credit associated with it. Each node in the 1bit trie begins with a credit of 2. When t = N, 1 unit of the credits on N is used to pay for the work done outside of the for loop. The remaining unit is given to the C array leftC. The cost of the for loop is paid for by the credits associated with rightC. These credits may fall short by at most d+ 1, because the height of the left subtree of N may be up to d more than the height of N's right subtree. Adding together the initial credits on the nodes and the maximum total shortfall, we see that p(2 + d + 1) credits are enough to pay for all of the work. So, the complexity of ComputeT is O(pd) = O(p) (because d is assumed to be a constant). In practice, we expect that the 1bit tries for router prefixes will not be too skewed and that the difference between the heights of the left and right subtrees will, in fact, be quite small. Therefore, in practice, we expect ComputeT to run in O(p) time. 51 Table 23: Memory required (in Kbytes) by best klevel FST Levels(k) 2 3 4 5 6 7 Paix 49,192 3,030 1,340 1,093 960 922 Pb 47,925 2,328 896 699 563 527 MaeWest 44,338 2,168 819 636 499 468 Aads 42,204 2,070 782 594 467 436 MaeEast 38,890 1,991 741 549 433 398 2.4 Experimental Results We programmed our dynamic programming algorithms in C and compared their performance against that of the C codes for the algorithms of Srinivasan and Varghese [82]. All codes that were run on a SUN workstation were compiled using the gcc compiler and optimization level 02; codes run on a PC were compiled using Microsoft Visual C++ 6.0 and optimization level 02. The codes were run on a SUN Ultra Enterprise 4000/5000 computer as well as on a 2.26 GHz Pentium 4 PC. For test data, we used the five IPv4 prefix databases of Table 21. 2.4.1 Performance of FixedStride Algorithm Table 23 and Figure 211 shows the memory required by the best klevel FST for each of the five databases of Table 21. Note that the yaxis of Figure 211 uses a semilog scale. The k values used by us range from a low of 2 to a high of 7 (corresponding to a lookup performance of at most 2 memory accesses per lookup to at most 7 memory accesses per lookup). As was the case with the data sets used in [82], using a larger number of levels does not increase the required memory. We note that for k = 11 and 12, [82] reports no decrease in memory required for three of their data sets. We did not try such large k values for our data sets. Table 24 and Figure 212 show the time taken by both our algorithm and that of [82] (we are grateful to Dr. Srinivasan for making his fixed and variablestride codes available to us) to determine the optimal strides of the best FST that has at most k levels. These times are for the Pentium 4 PC. Times in Table 25 and Figure 213 are for the SUN workstation. As expected, the run time of the algorithm of [82] is 52 105 v Paix SPb MaeWest Aads S MaeEast 104 0 E 102 1021 2 3 4 5 6 7 k Figure 211: Memory required (in KBytes) by best klevel FST quite insensitive to the number of prefixes in the database. Although the run time of our algorithm is independent of the number of prefixes, the run time does depend on the values of nodes(*) as these values determine M(*, *) and hence determine minJ in Figure 23. As indicated by the graph of Figure 212, the run time for our algorithm varies only slightly with the database. As can be seen, our algorithm provides a speedup of between z1.5 and z3 compared to that of [82]. When the codes were run on our SUN workstation, the speedup was between Z2 and A4. 2.4.2 Performance of VariableStride Algorithm Table 26 shows the memory required by the best klevel VST for each of the five databases of Table 21. The columns labeled "Yes" give the memory required when the VST is permitted to have Butler nodes [44]. This capability refers to the replacing of subtries with three or fewer prefixes by a single node that contains these prefixes [44]. The columns labeled "No" refer to the case when Butler nodes are not permitted (i.e., the case discussed in this chapter). The data of Table 26 as well 53 Table 24: Execution time (in psec) for FST algorithms, Pentium 4 PC Paix Pb MaeWest Aads MaeEast k [82] Our [82] Our [82] Our [82] Our [82] Our 2 5.23 3.20 5.19 3.15 5.13 3.15 5.15 3.17 5.17 3.09 3 9.99 4.87 9.73 4.73 9.98 4.81 9.96 4.90 10.00 4.73 4 14.68 6.23 14.53 6.15 14.62 6.29 14.59 6.31 14.64 6.10 5 19.54 7.36 19.42 7.31 19.42 7.40 19.15 7.42 19.45 7.28 6 24.32 9.39 24.08 8.37 24.07 8.47 24.03 8.46 24.23 8.29 7 28.99 9.48 28.72 9.42 28.68 9.45 28.68 9.38 28.76 9.34 Table 25: Execution time (in pisec) for FST algorithms, SUN Ultra Enterprise 4000/5000 20 1U 15  v Paix[16] SPaixOur  Pb[16] PbOur MaeWest[16] MaeWestOur SAads[16] v AadsOur MaeEast[16] MaeEastOur 3 4 5 6 7 k Figure 212: Execution time (in psec) for FST algorithms, Pentium 4 PC Paix Pb MaeWest Aads MaeEast k [82] Our [82] Our [82] Our [82] Our [82] Our 2 39 21 41 21 39 21 37 20 37 21 3 85 30 81 30 84 31 74 31 96 31 4 123 39 124 40 128 38 122 40 130 40 5 174 46 174 48 147 46 161 45 164 46 6 194 53 201 54 190 55 194 54 190 53 7 246 62 241 63 221 63 264 62 220 62 : : / : : /d : / : : f : / : 2 : / : // .l " / . / r  / .... 300 250 200 (D 0) 150 E 100 100 / ) / 34/5 k/ , / 7  /* S 3 4 5 6 7 k Figure 213: Execution time (in psec) 4000/5000 SPaix[16] PaixOur Pb[16] PbOur MaeWest[16] MaeWestOur SAads[16] AadsOur MaeEast[16] SMaeEastOur for FST algorithms, SUN Ultra Enterprise as the memory requirements of the best FST are plotted in Figure 214. As can be seen, the Butler node provision has far more impact when k is small than when k is large. In fact, when k = 2 the Butler node provision reduces the memory required by the best VST by almost 50%. However, when k = 7, the reduction in memory resulting from the use of Butler nodes versus not using them results in less than a 20% reduction in memory requirement. Table 26: Memory required (in Kbytes) by best k VST Paix Pb MaeWest Aads MaeEast k No Yes No Yes No Yes No Yes No Yes 2 2,528 1,722 1,806 1,041 1,754 949 1,631 891 1,621 837 3 1,080 907 677 496 619 443 582 405 537 367 4 845 749 489 397 441 351 410 320 371 286 5 780 706 440 370 393 327 363 297 326 264 6 763 695 426 361 379 319 350 290 313 257 7 759 692 422 358 376 316 346 287 310 254 105 10 5 No Butler Butler  FST \ \ U 104 0 E 10 103m .. .. .. : : :. ...... : :: : : .:.:.: ........ : .... ... ... ... 102 2 3 4 5 6 7 k Figure 214: Memory required (in KBytes) for Paix by best kVST and best FST For the run time comparison of the VST algorithms, we implemented three ver sions of our VST algorithm of Section 2.3.1. None of these versions permitted the use of Butler nodes. The first version, called the O(pk + W2k) Static Memory Implemen tation, is the O(pk + W2k) memory implementation described in Section 2.3.1. The O(W2k) memory required by this implementation for the cost arrays is allocated at compile time. During execution, memory segments from this preallocated O(W2k) memory are allocated to nodes, as needed, for their cost ;,ii,\ . The second version, called the O(pWk) Dynamic Memory Implementation, dynamically allocates a cost array to each node of the 1bit trie nodes using C's malloc method. Neither the first nor second implementations employ the fast algorithms of Sections 2.3.3 and 2.3.4. Tables 27 and 28 give the run time for these two implementations. The third implementation of our VST algorithm uses the faster k = 2 and k = 3 algorithms of Section 2.3.3 and 2.3.4 and also uses O(pWk) memory. The O(pWk) memory is allocated in one large block making a single call to malloc. Following Table 27: Execution times (in msec) for first two implementations of our VST algo rithm, Pentium 4 PC Paix Pb MaeWest Aads MaeEast k S D S D S D S D S D 2 34.3 107.5 17.6 56.2 31.0 50.1 15.9 46.9 12.2 40.2 3 39.1 115.4 22.4 65.2 15.2 58.2 19.0 53.1 15.1 46.5 4 47.0 131.4 28.2 74.8 20.0 66.2 23.3 57.6 16.6 54.2 5 51.5 140.7 29.6 78.0 20.3 66.2 23.2 62.0 19.9 56.0 6 59.0 146.7 32.9 82.7 27.9 69.4 26.3 71.6 21.4 62.5 7 63.7 159.3 31.0 88.6 32.8 79.0 32.7 73.3 29.4 67.1 S = O(pk + W2k) Static Memory Implementation D = O(pWk) Dynamic Memory Implementation Table 28: Execution times (in msec) for first two implementations of our VST algo rithm, SUN Ultra Enterprise 4000/5000 Paix Pb MaeWest Aads MaeEast k S D S D S D S D S D 2 290 500 150 280 150 260 120 200 120 230 3 360 790 190 460 180 430 150 340 150 340 4 430 900 210 520 220 430 180 430 160 390 5 490 1140 260 610 240 570 200 520 190 470 6 530 1170 290 670 270 570 270 550 220 510 7 590 1390 330 780 300 690 300 630 260 560 S = O(pk + W2k) Static Memory Implementation D = O(pWk) Dynamic Memory Implementation 57 Table 29: Execution times (in msec) for third implementation of our VST algorithm, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 21.0 10.6 9.0 8.2 7.3 3 27.8 15.0 13.2 12.1 10.7 4 48.5 27.6 24.6 22.9 20.6 5 56.2 32.3 28.7 26.7 24.0 6 62.1 36.4 32.5 30.4 27.1 7 69.3 40.3 36.1 33.7 30.3 Table 210: Execution times (in msec) for third implementation of our VST algo rithm, SUN Ultra Enterprise 4000/5000 k Paix Pb MaeWest Aads MaeEast 2 70 30 30 20 20 3 210 100 90 80 70 4 550 290 270 270 240 5 640 350 370 330 260 6 740 430 390 410 350 7 920 530 450 400 350 this, the large allocated block of memory is partitioned into cost ;,ii.'\ for the 1bit trie nodes by our program. The run time for the third implementation is given in Tables 29 and 210. The run times for all three of our implementations is plotted in Figures 215 and 216. Notice that this third implementation is significantly faster than our other O(pWk) memory implementation. Note also that this third implementation is also faster than the O(pk + W2k) memory implementation for the cases k = 2 and k = 3 (this is because, in our third implementation, these cases use the faster algorithms of Sections 2.3.3 and 2.3.4). To compare the run time performance of our algorithm with that of [82], we use the times for implementation 3 when k = 2 or 3 and the times for the faster of implementations 1 and 3 when k > 3. That is, we compare our best times with the times for the algorithm of [82]. The times for the algorithm of [82] were obtained using their code and running it with the Butler node option off. Since the code of [82] does no dynamic memory allocation, our use of the times for the static memory allocation SDynamic tThird 2 3 4 5 6 7 k Figure 215: Execution times Pentium 4 PC 1000 8 800 S600 I (in msec) for Paix for our three VST implementations, 5 6 Figure 216: Execution times (in msec) for Paix SUN Ultra Enterprise 4000/5000 for our three VST implementations, 1 59 Table 211: Execution times (in msec) for our best VST implementation and the VST algorithm of Srinivasan and Varghese, Pentium 4 PC Paix Pb MaeWest Aads MaeEast k [82] Our [82] Our [82] Our [82] Our [82] Our 2 64.6 21.0 37.4 10.6 31.1 9.0 27.9 8.2 26.6 7.3 3 665.6 27.8 339.2 15.0 297.0 13.2 269.8 12.1 244.1 10.7 4 1262.7 47.0 629.8 27.6 559.4 20.0 503.2 22.9 448.5 16.6 5 1858.0 51.5 928.4 29.6 817.1 20.3 737.2 23.2 659.8 19.9 6 2441.0 59.0 1215.8 32.9 1073.2 27.9 971.4 26.3 868.9 21.4 7 3034.7 63.7 1512.7 31.0 1328.0 32.8 1209.3 32.7 1072.0 29.4 Table 212: Execution times (in msec) for our best VST implementation and the VST algorithm of Srinivasan and Varghese, SUN Ultra Enterprise 4000/5000 Paix Pb MaeWest Aads MaeEast k [82] Our [82] Our [82] Our [82] Our [82] Our 2 190 70 130 30 50 30 40 20 40 20 3 1960 210 1230 100 360 90 320 80 280 70 4 3630 430 2330 210 700 220 590 180 530 160 5 5340 490 3440 260 1030 240 860 200 780 190 6 7510 530 4550 290 1340 270 1150 270 1020 220 7 9280 590 5650 330 1650 300 1420 300 1270 260 does not, in any way, disadvantage the algorithm of [82]. The run times, on our 2.26 GHz PC, are shown in Table 211 and these times are plotted in Figure 217. For our largest database, Paix, our new algorithm takes less than onethird the time taken by the algorithm of [82] when k = 2 and about 1/47 the time when k = 7. On our SUN workstation, as shown in Table 212 and Figure 218, the observed speedups for Paix ranges from a low of 2.7 to a high of 15.7. The observed speedups aren't as high as predicted by our crude analysis because actual speedup is governed by both the operation cost and the cachemiss cost; further, our crude analysis doesn't account for all operations. The higher speedup observed on a PC i, ,t a higher relative cachemiss cost on the PC (relative to the cost of an operation) versus on a SUN workstation. The times reported in Tables 27212 are only the times needed to determine the optimal strides for a given 1bit trie. Once these strides have been determined, 3500 3000 2500 82000 E 1500 F 1000 500 2 3 4 5 6 Figure 217: Execution times (in msec) for Paix for our best VST implementation and the VST algorithm of Srinivasan and Varghese, Pentium 4 PC 2 3 4 5 6 Figure 218: Execution times (in msec) for Paix for our best VST implementation and the VST algorithm of Srinivasan and Varghese, SUN Ultra Enterprise 4000/5000 Algorithm of [16] Algorithm of Our S Algorithm of [16]  Algorithm of Our 61 Table 213: Time (in msec) to construct optimal VST from optimal stride data, Pen tium 4 PC k Paix Pb MaeWest Aads MaeEast 2 117.1 78.0 68.5 67.4 64.3 3 107.8 62.6 55.7 47.0 47.0 4 115.5 66.2 61.3 50.9 47.0 5 126.6 78.0 63.6 62.1 56.5 6 131.4 82.6 64.5 68.9 59.5 7 139.3 78.0 75.6 71.6 62.0 Table 214: Search time (in psec) in optimal VST, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 0.55 0.46 0.44 0.43 0.42 3 0.71 0.64 0.62 0.61 0.59 4 0.79 0.74 0.73 0.72 0.72 5 0.92 0.89 0.89 0.88 0.90 6 1.01 1.00 0.99 0.99 0.98 7 1.10 1.10 1.08 1.09 1.10 it is necessary to actually construct the optimal VST. Table 213 shows the time required to construct the optimal VST once the optimal strides are known. For our databases, the VST construction time is more than the time required to compute the optimal strides using our best optimal stride computation implementation. The primary operation performed on an optimal VST is a lookup or search in which we begin with a destination address and find the longest prefix that matches this destination address. To determine the average lookup/search time, we searched for as many addresses as there are prefixes in a database. The search addresses were obtained by using the 32bit expansion available in the database for all prefixes in the database. Table 214 and Figure 219 show the average time to perform a lookup/search. As expected, the average search time increases monotonically with k. For our databases, the search time for a 2VST is less than or equal to half that for a 7VST. Inserts and deletes are performed less frequently than searches in a VST. We experimented with three strategies for these two operations: C,, CO07 06 0254 2 3 4 5 6 7 k Figure 219: Search time (in nsec) in optimal VST for Paix, Pentium 4 PC * OptVST: In this strategy, the VST was always the best possible kVST for the current set of prefixes. To insert a new prefix, we first insert the prefix into the 1bit trie of all prefixes. Then, the cost arrays on the insert path are recomputed. This is done efficiently using implementation 2 (i.e., the O(pWk) dynamic memory implementation) of our VST stride computation algorithm. Following this, the optimal strides for vertices on the insert path are computed. Since, the optimal VST for the new prefix set differs from the optimal VST for the original prefix set only along the insert path, we modify the original optimal VST only along this insert path using the newly computed strides for the vertices on this path. Deletion works in a similar fashion. * Batchl: In this strategy, the optimal VST is computed periodically (say, after a sufficient number of inserts/deletes have taken place) rather than following each insert/delete. Inserts and deletes are done directly into the current VST without regard to maintaining optimality. If the insertion results in the creation of a new node, the stride of this new node is such that the sum of the strides of the nodes on the path from the root to this new node equals the length of the newly inserted prefix. The deletion of a prefix may require us to search a node for a replacement prefix of the next (lower) length that matches the deleted prefix. Batch2: This differs from strategy Batchi in that inserts and deletes are done in both the current VST and in the 1bit trie. This increases the time for an insert as well as for a delete. In the case of deletion, by first deleting from the 1bit trie, we determine the next (lower) length matching prefix from the delete path taken in the 1bit trie. This eliminates the need to search a node for this next (lower) length matching prefix when deleting from the VST. The result is a net reduction in time for the delete operation. The batch modes described above may also be useful when the insert/delete rate is sufficiently small that following each insert or delete done as above, the optimal VST is computed in the background using another processor. While this computation is being done, routes are made using the suboptimal VST resulting from the insert or delete that was done as described for the batch modes. When the new optimal VST has been computed, the new optimal VST is swapped with the suboptimal one. Tables 215220 give the measured run times for the insert and delete operations using each of the three strategies described above. Figures 58 and 59 plot these times for the Paix database. For the insert time experiments, we started with an optimal VST for 75% of the prefixes in the given database and then measured the time to insert the remaining 25%. The reported times are the average time for one insert. For Paix and k = 2, it takes 21 + 78 = 99 milli seconds to construct the optimal VST (time to compute optimal strides plus time to construct the VST for these strides). However, the cost of an incremental insert that maintains the optimality of the VST is only 50.75 micro seconds; the cost of an incremental delete is 51.85 micro seconds; a speedup of about 2000 over the from scratch optimal VST construction! Table 215: Insertion time (in [tsec) for OptVST, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 50.75 49.72 49.53 49.17 208.44 3 325.95 71.25 67.28 66.23 68.19 4 146.74 165.60 126.74 122.92 99.87 5 186.22 191.17 187.68 169.06 186.36 6 2247.96 333.87 252.99 746.92 192.36 7 912.03 446.03 2453.73 445.81 375.32 16: Deletion time /usec) for OptVST, Pentium 4 PC Although batch insertion is considerably faster than insertion using strategy OptVST, batch insertion increases the number of levels in the VST, and so results in slower searches. For example, in the experiments with Paix, the batch inserts increased the number of levels in the initial kVST from k to 5 for k = 2, to 6 for k = 3 and 4, and to 8 for k = 5, 6, and 7. The delete times were measured by starting with an optimal VST for 100% of the prefixes in the given database and then measuring the time to delete 25% of these prefixes. Once again, the average time for a single delete is reported. Table 217: Insertion time (in [tsec) for Batchl, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 1.51 1.73 1.69 1.81 1.89 3 2.37 2.97 3.40 2.58 2.79 4 2.86 3.50 4.04 3.31 3.09 5 3.63 4.33 4.52 3.54 3.93 6 4.18 5.05 6.53 4.35 5.19 7 5.00 4.98 9.03 4.58 5.19 Table 2 k Paix Pb MaeWest Aads MaeEast 2 51.85 51.39 51.79 50.95 52.15 3 61.29 60.77 60.94 59.80 61.71 4 74.86 72.90 73.72 71.99 74.58 5 87.47 85.71 86.31 84.97 87.80 6 99.74 97.70 98.50 96.97 99.90 7 111.92 109.89 110.49 108.82 113.15 103 2 E 10 F OptVST SBatch1 Batch2 . .   /       / . . .. .. .... ..  ^  100 2 Figure 220: Insertion time (in psec) for Paix, Pentium 4 PC Table 218: Deletion time (in psec) for Batchl, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 4.72 6.13 6.53 6.24 6.44 3 2.54 2.14 2.18 2.34 2.48 4 2.22 2.69 2.59 2.39 2.65 5 2.58 3.25 2.79 2.92 2.94 6 2.73 3.05 3.15 3.86 3.06 7 2.80 3.43 3.10 3.51 3.62 Table 219: Insertion time (in psec) for Batch2, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 3.53 3.68 4.01 4.07 4.00 3 4.18 4.62 4.97 4.72 4.79 4 4.42 4.96 4.93 4.94 5.03 5 4.70 5.32 5.48 4.94 5.51 6 5.10 5.96 6.31 5.72 6.15 7 5.34 6.13 7.26 5.76 5.47 Table 220: Deletion time (in psec) for Batch2, Pentium 4 PC 3 4 5 6 7 Figure 221: Deletion time (in psec) for Paix, Pentium 4 PC k Paix Pb MaeWest Aads MaeEast 2 5.70 7.34 7.37 7.16 7.45 3 3.59 3.59 3.67 3.63 3.49 4 3.48 3.78 3.93 3.89 3.98 5 3.67 3.77 4.05 3.79 4.05 6 3.81 4.07 4.14 3.98 4.15 7 3.87 4.19 4.23 3.99 3.90 10 3 () 2 a)~ 10 r) U, E U, U' 101 100 2 2.5 Summary We have developed faster algorithms to compute the optimal strides for fixed and variablestride tries than those proposed in [82]. On IPv4 prefix databases and a 2.26 GHz Pentium 4 PC, our algorithm for fixedstride tries is faster than the corresponding algorithm of [82] by a factor of between 1.5 and 3; on a SUN Ultra Enterprise 4000/5000, the speedup is between 2 and 4. This speedup results from narrowing the search range in the dynamicprogramming formulation. Since the search range is at most 32 for IPv4 databases and at most 128 for IPv6 databases, the potential to narrow the range (and hence speed up the computation) is greater for IPv6 data. Hence, we expect that our narrowedrange FST algorithm will exhibit greater speedup on IPv6 databases. We are unable to verify this expectation, because of the nonavailability of IPv6 prefix databases. On our PC, our algorithm to compute the strides for an optimal variablestride trie is faster than the corresponding algorithm of [82] by a factor of between 3 and 47; on our SUN workstation, the speedup is between 2 and 17. Our VST stride computation method permits the insertion and removal of prefixes without having to recompute the optimal strides from scratch. The incremental insert and delete algorithms are about 3 orders of magnitude faster than the "from , .t,. Li algorithm. We also have proposed two batch strategies for the insertion and removal of prefixes. Although, these strategies permit faster insertion and deletion, they increase the height of the VST, which results in slowing down the search operation. These batch strategies are, nonetheless, useful in applications where it is practical to rebuild the optimal VST whenever the search performance has become unacceptable. CHAPTER 3 BINARY SEARCH ON PREFIX LENGTH In this chapter, we focus on the collection of hash tables (CHT) scheme of Wald vogel et al. [87]. Let P be the set of prefixes in a router table, and let Pi be the subset of P comprised of prefixes whose length is i. In the scheme of Waldvogel et al. [87], we maintain a hash table Hi for every Pi that is not empty. H, includes the prefixes of Pi as well as markers for prefixes in Uij m.mlp is the longest matchingprefix for m. Consider the prefix set P = {P, ..., P6} of Figure 31(a). The prefixes of P have 5 distinct lengths 1, 2, 4, 6, and 7. So, the CHT of [87] will comprise H1, H2, H4, H6, and H7. Given a destination address d, the longest matchingprefix, Imp(d) is found by searching the His using a binary search. Suppose that the binary search follows a path as determined by the binary tree of Figure 31(b). That is, if the first four bits of d correspond to a prefix in H4, this prefix becomes the longest matchingprefix found so far and the search continues to H6; if the first four bits of d correspond to a marker m in H4, then m.imp becomes the longestmatching prefix found so far and the search continues to H6; otherwise, the search continues to H1. The quest for Imp(d) examines at most 3 hash tables in our example. When the number of distinct lengths is Idist, the number of hash tables examined is O(logldist). For the described search to work correctly, H4 must have markers for P5 and P6; HI for P3; and H6 for P6. H1, for example, will include P1 and P2 plus the marker 1* for P3 (actually, since P2 = 1*, the marker isn't needed); while H4 will include P4 plus the marker 1001* for P5 and P6. The Imp value for the marker 1001* is P3. Srinivasan and Varghese [82] have proposed the use of controlled prefixexpansion to reduce the number of distinct lengths and hence the number of hash tables in the P1=0* 00* (Pla) P =1* 01* (Plb) P =1* 10* (P3) P3=0* 11* (P2a) P4=1000* 1000 (P4) P5=100100* 1001000* (P5a) P6=1001001* W H 1001001* (P6) (a) Prefixes (b) Tree for binary (c) Expanded pre search fixes Figure 31: Controlled prefix expansion CHT. By reducing the number of hash tables in the CHT, the worstcase number of hash tables searched in the quest for lmp(d) may be reduced. Prefix expansion [82] replaces a prefix of length u with 20" prefixes of length v, v > u. The new prefixes are obtained by appending all possible bit sequences of length v u to the prefix being expanded. So, for example, the prefix 1* may be expanded to the length 2 prefixes 10* and 11* or to the length 3 prefixes 100*, 101*, 110*, and 111*. In case an expanded prefix is already present in the prefix set of the router table, it is dominated by the existing prefix (the expanded prefix 10* represents a shorter original prefix 1* that cannot be used to match destination addresses that begin with 10 when longestprefix matching is used) and so is discarded. So, if we expand P2 = 1* in our collection of Figure 31(a) to length 2, the expand prefix P2b = 10* is dominated by P3 = 10*. Figure 31(c) shows the prefixes that result when the length 1 prefixes of Figure 31(a) are expanded to length 2 and the length 6 prefix is expanded to length 7. You may verify that lmp(d) is the same for all d regardless of whether we use the prefix set of Figure 31(a) or (c) (when the latter set is used, we need to map back to the original prefix from which an expanded prefix came). Since, the prefixes of Figure 31(c) have only 3 distinct lengths, the corresponding CHT has only 3 hash tables and may be searched for Imp(d) with at most 2 hashtable searches. Hence, the CHT scheme results in faster lookup when the prefixes of Figure 31(c) are used than when those of Figure 31(a) are used. [71, 72, 73, 82] use prefix expansion to improve the lookup performance of trierepresentations of router tables. When reducing the number of distinct lengths from u to v, the choice of the target v lengths affects the number of markers and prefixes that have to be stored in the resulting CHT but not the number of hash tables, which is always v. Although the number of target lengths may be determined from the expected number of packets to be processed per second and the performance characteristics of the computer to be used for this purpose, the target lengths are determined so as to minimize the storage requirements of the CHT. Consequently, Srinivasan and Varghese [82] formulated the following optimization problem. Exact Collection of Hash Tables Optimization Problem (ECHT) Given a set P of n prtfi.r and a lm,,., number of distinct lengths k, determine 1 1,,,, I lengths 11, ..., lk such that the storage required by the pirfi.r . and markers for the pr fi.r set expansion(P) obtained from P by pi fi.r expansion to the determined q,,., I lengths is minimum. When P and k are not implicit, we use the notation ECHT(P, k). For simplicity, Srinivasan [80] assumes that the storage required by the prefixes and markers for the prefix set expansion(P) equals the number of prefixes and markers. We make the same assumption in this chapter. Srinivasan [80] provides an O(nW2)time heuristic for ECHT. We first show, in Section 3.1, that the heuristic of Srinivasan [80] may be implemented so that its complexity is O(nW + kW2) on practical prefixsets. Then, in Section 3.2, we provide an O(nW3 + kW4)time algorithm for ECHT. In Section 3.3, we formulate an alternative version ACHT of the ECHT problem. In this alternative version, we are to find at most k distinct target lengths to minimize storage rather than exactly k target lengths. The ACHT problem also may be solved in O(nW3 + kW4) time. In Section 3.4, we propose a reduction in the search range used by the heuristic of [80]. The proposed range reduction reduces the run time by more than 50% exclusive of the preprocessing time. The reducedrange heuristic generates the same results on our benchmark prefix datasets as are generated by the fullrange heuristic of [80]. A more accurate cost estimator than is used in the heuristic of [80] is proposed in Section 3.5. Experimental results highlighting the relative performance of the various algorithms and heuristics for ECHT and ACHT are presented in Section 3.6. 3.1 Heuristic of Srinivasan The ECHT heuristic of Srinivasan [80] uses the following definitions: ExpansionCost(i, j) This is the number of distinct prefixes that result when all prefixes in Pq E P, i < q < j are expanded to length j. For example, when P = {0*, 1*, 01*, 100*}, ExpansionCost(1, 3) = 8 (note that 0* and 1* contribute 4 prefixes each; 01* contributes none because its expanded prefixes are included in the expanded prefixes of 0*). Entries(j) This is the maximum number of markers in Hj (should j be a target length) plus the number of prefixes in P whose length is j. Srinivasan [80] uses "maximum number of markers" in the definition of Entries(j) rather than the exact number of markers because of the reported difficulty in computing this latter quantity. T(j,r) This is an upper bound on the storage required by the optimal solution to ECHT(Q, r), where Q C P comprises all prefixes of P whose length is at most j; the optimal solution to ECHT(Q, r) is required to contain markers, as necessary, for prefixes of P whose length exceeds j. Srinivasan [80] provides the following dynamic programming recurrence for T(j, r). T(j, r) = Entries(j) + min {T(m, r 1)+ ExpansionCost(m + l,j)} (3.1) mC{rl ...j1} T(j, 1) = Entries(j) + ExpansionCost(, j) (3.2) We may verify the correctness of Equations 3.1 and 3.2. When r = 1, there is only 1 target length and this length is no more than j. When Q has a prefix whose length is j, then j must be the target length. In this case, the number of expanded prefixes is at most ExpansionCost(1, j) plus the number of prefixes whose length is j. So, the number of prefixes and markers is at most Entries(j) + ExpansionCost(1, j). When Q has no prefix whose length is j, the optimal target length is the largest 1, 1 < j such that Q has a prefix whose length is 1. In this case, Entries(1)+ExpansionCost(1, 1) < Entries(j) + ExpansionCost(1, j) is an upper bound on the number of prefixes and markers. To compute ExpansionCost and Entries, a 1bit trie [39] is used. Figure 32 shows a prefix set and its corresponding 1bit trie. Notice that nodes at level i (the root is at level 0) of the 1bit trie store prefixes whose length is i + 1. Srinivasan [80] states how ExpansionCost(i, j), 1 < i < j < W may be computed in O(nW2) time using a 1bit trie for P. Sahni and Kim [71] have observed that, for practical prefix sets, the 1bit trie has O(n) nodes. So, by performing a postorder traversal of the 1bit trie, ExpansionCost(i, j), 1 < i < j < W may be computed in O(nW) time (note that n > W). Details of this process are provided in Section 3.2.1 where we show how a closely related function may be computed. For Entries(j), Srinivasan [80] proposes counting the number of prefixes stored in level j 1 of the 1bit trie and the number of (nonnull) pointers (in the 1bit trie) to nodes at level j (the number of pointers actually equals the number of nodes). Prefixes Length N] Pl =1* P2=01* 1 PI N2 P3=001* 2 P4=010* 2 P2 P5=0101* N31 N32 P6=00001* 3 P3 P4 P7=00010* N41 N42 N43 P8=00110* 4 P5 P9=01000* N51/ N52 N53 N54 PO0=01001* 5 P6 P7 P8 P9 PO1 (a) A pre (b) Corresponding 1bit trie fix set Figure 32: Prefixes and corresponding 1bit trie The former gives the number of prefixes whose length is j and the latter gives the maximum number of markers needed for the longerlength prefixes. Suppose that m and j are target lengths and that no 1, m < 1 < j is a target length. The actual number of prefixes and markers in Hj may be considerably less than Entries(j) + ExpansionCost(m + 1, j) for the following reasons. An expanded prefix counted in ExpansionCost(m + 1, j) may be identical to a prefix in P whose length is j. Some of the prefixes in P whose length is more than j may not need to leave a marker in Hj because their length is not on any binary search (sub)path that is preceded by the length j. For example, for the binary search described by Figure 31(b), H1 needs markers only for prefixes in H2; not for those in H4, H6, and H7. However, Entries(1) accounts for markers needed by prefixes in H2 as well as those in H4, H6, and H7. Entries(j) doesn't account for the fact that a marker may be identical to a prefix in which case the storage count for the marker and the prefix together should be 1 and not 2. For example, in Figure 32(b), the marker corresponding to the nonnull pointer to node N42 is identical to the prefix P3 and that for the nonnull pointer to N43 is identical to P4. So, we can safely reduce the value of Entries(3) from 5 to 3. Note also that if the target lengths for the example of Figure 32(a) are 1, 3, and 5, then the number of prefixes and markers in H3 is 4. However, ExpansionCost(2, 3) + Entries(3) = 2 + 5 = 7. Exclusive of the time needed to compute ExpansionCost and Entries, the com plexity of computing T(W, k) and the target k lengths using Equations 3.1 and 3.2 is O(kW2) [80]. So, the overall complexity is O(nW2) (note that n > k). As noted above, we may reduce the time required to compute ExpansionCost on practical prefixsets by performing a postorder traversal of the 1bit trie. Hence, for practical prefixsets, the overall run time is O(nW + kW2). 3.2 OptimalStorage Algorithm As noted in Section 3.1, the algorithm of Srinivasan [80] is only a heuristic for ECHT. Since T(W, k) is only an upper bound on the cost of an optimal solution for ECHT(P, k), there is no assurance that the determined target lengths actually result in an optimal or closetooptimal solution to ECHT(P, k). In this section, we develop an algorithm to determine the storage cost of an optimal solution to ECHT(P, k). The algorithm is easily extended to determine the target lengths that yield this optimal storage cost. Like the heuristic of Srinivasan [80], our algorithm uses dynamic programming. However, we modify the definition of expansion cost and introduce an accurate way to count the number of markers. Although the heuristic of Srinivasan [80] is insensitive to the shape of the binary tree that describes the binary search, the optimalstorage algorithm cannot be insensi tive to this shape. To see this, notice that the binary tree of Figure 31(b) corresponds to the traditional way to program a binary search. In this, if low and up define the current search range, then the next comparison is made at mid = L(low + up)/2]. If instead, we were to make the next comparison at mid = [(low +up)/2], the search is described by the binary tree of Figure 33. When a binary search is performed accord ing to this tree, only H4 need have markers. The markers in H4 are the same regardless H4 H2( Figure 33: Alternative binary tree for binary search of whether we use mid = [(low + up)/2] or mid = [(low + up)/2]. By using the latter definition of mid, we eliminate markers from all remaining hash tables. In our development of the optimalstorage algorithm, we assume that mid = [(low+ up)/2] is used. Our development is easily altered for the case when mid = [(low + up)/2] is used. 3.2.1 Expansion Cost Define EC(i, j), 1 < i < j < W, to be the number of distinct prefixes that result when all prefixes in Pq E P, i < q < j are expanded to length j. Note that EC(i, i) is the number of prefixes in P whose length is i. We may compute EC by traversing the 1bit trie for P in a postorder fashion. Each node x at level i1 of the trie maintains a local set of EC(i, j) values, LEC(i, j), which is the expansion cost measured relative to the prefixes in the subtree of which x is root. Some of the cases for the computation of x.LEC(i, j) are given below. x.LEC(i, i) equals the number of prefixes stored in node x. For example, for node N1 of Figure 32(a), LEC(1, 1) = 1 and for node N54, LEC(5, 5) = 2. For the remaining cases, assume i < j. If x has a prefix in its left data field (e.g., the prefix in the left data field of node N32 is P4) and also has one in its right data field, then x.LEC(i, j) = 2j+l. If x has no prefixes (e.g., nodes N41 and N42) and x has nonnull left and right subtrees, then x.LEC(i, j) = x.leftChild.LEC(i+ 1, j)+x.rightChild(i+1, j). If x has a right prefix and a nonnull left subtree, then x.LEC(i, j) = x.left Child.LEC(i + 1, j)+ 2. The remaining cases are similar to those given above. One may verify that EC(i, j) is just the sum of the LEC(i, j) values taken over all nodes at level i 1 of the trie. Figure 34 gives the LEC and EC values for the example of Figure 32. In this figure, LEC51, for example, refers to the LEC values for node N51. LEC51[5j] j 5 LEC52[5j] j =5 L TrC.[ i1 j= 5 1 1 1 LEC54[5j] j = 5 LEC41[4j] j =4 5 2 02 LEC42[4j] j 4 5 LEC43[4j] j =4 5 0 1 1 4 EC[ij]j 123 4 5 i= 1 3 71430 LEC31[3j]j=3 4 5 Lit C2 jlj 3 4 5 2 1 3 6 14 126 124 3 2 410 LEC2[2j]j=23 4 5 LEC[1j]j =12 3 4 5 4 1 7 1 3 6 14 1 3 7 1430 5 5 (a) LEC values (b) EC values Figure 34: LEC and EC values for Figure 32 Since a 1bit trie for n prefixes may have O(nW) nodes, we may compute all EC values in O(nW2) time by computing the LEC values as above and summing up the computed LEC values. A postorder traversal suffices for this. As noted in [71], the 1bit tries for practical prefix sets have O(n) nodes. Hence, in practice, the EC values take only O(nW) time to compute. 3.2.2 Number of Markers Define MC(i, j, m), 1 < i < j < m < W, to be the number of markers in Hj under the following assumptions. The prefix set comprises only those prefixes of P whose length is at most m. The target lengths include i 1 (for notational convenience, we assume that 0 is a trivial target length for which Ho is always empty) and j but no length between i 1 and j. Hence, prefixes whose length is i, i + 1, or j are expanded to length j. Only prefixes whose length is between j + 1 and m may leave a marker in Hj. For MC(2, 4, 5) (Figure 32), P6 through P10 may leave markers in H4. The candidate markers are obtained by considering only the first four bits of each of these prefixes. Hence, the candidate markers are 0000*, 0001*, 0011*, and 0100*. However, since the next smaller target length is 1, P2, P3, and P4 will leave a prefix in H4. The prefixes in H4 are 0100*, 0101*, 0110*, 0111*, 0010*, and 0011*. So, of the candidate markers, only 0000* and 0001* are different from the prefixes in H4. Therefore, the marker count MC(2, 4, 5) is 2. We may compute all MC(i, j, m) values in O(nW3) time (O(nW2) for practical prefix sets) using a local function LMC in each node of the 1bit trie and a postorder traversal. The method is very similar to that described in Section 3.2.1 for the computation of all EC values. Figure 35 shows the LMC and MC values for our example of Figure 32. LMC51[5j,k] k 5 LMC52[5j,k] k 5 LMC53[5j,k] k 5 j=5 0 j=5 0 j=5 0 LMC54[5j,k] k 5 LMC41[4j,k]k 4 5 j=5 0 j=4 02 LMC42[4j,k]k 4 5 5 0 j =4 0 1 LMC43[4j,k]k 4 5 5 0 j=4 01 5 0 LMC31[3,j,k] k 3 4 5 j 3 0 0 1 L.kh ; jijk 3 4 5 4 0 2 j=3 0 0 0 5 0 4 00 5 0 LMCl[1j,k]k 1 2 3 4 5 j 1 0 11 11 LMC2[2j,k] k2 3 4 5 2 0111 j 2 0111 3 0 0 1 3 001 4 02 4 02 5 0 5 0 (a) LMC values MC[1j,k] k j=1 2 3 4 5 MC[2j,k] k j=2 3 4 5 123 4 5 MC[3,j,k] k 3 4 5 01111 j3 00 1 0111 4 02 001 5 0 02 0 MC[4j,k] k 4 5 j=4 04 2 3 4 5 5 0 0111 0 0 1 MC[5j,k] k 5 02 j 5 0 0 (b) MC values Figure 35: LMC and MC values for Figure 32 3.2.3 Algorithm for ECHT Let Opt(i, j, r) by the storage requirement of the optimal solution to ECHT(P, r) under the following restrictions. Only prefixes of P whose length is between i and j are considered. Exactly r target lengths are used. j is one of the target lengths (even if there is no prefix whose length is j). Let lmax, Imax < W, be the length of the longest prefix in P. We see that Opt(l, Imax, k) is the storage requirement of the optimal solution to ECHT(P, k). When r = 1, there is exactly one target length, j. So, all prefixes must be expanded to this length and there are no markers. Therefore, Opt(i, j, 1) = EC(i, j), i < j (3.3) When r = 2, one of the target lengths is j and the other, say m, lies between i and j 1. Because we assume mid = [(low + up)/2], the first search is made in Hj and the second in Hm. Consequently, neither Hj nor Hm has any markers. Hj (Hm) includes prefixes resulting from the expansion of prefixes of P whose length is between m + 1 and j (i and m). So, Opt(i, j, 2) = min {EC(i, m) + EC(m + 1, j)}, i < j (3.4) i~m Consider the case r > 2. Let the r target lengths be 11 < 12 < .. < Ir. Suppose that the mid = [(1 + r)/2]th target length is v. Let u 1 be the largest target length such that u 1 < v. The first search of the binary search is done in H,. The number of prefixes and markers in H, is EC(u, v) + MC(u, v, j). Additionally, the mid 1 = L(r 1)/2] target lengths that are less than v define an optimal (mid 1) targetlength solution for prefixes whose length is between i and u 1 subject to the constraint that u 1 is a target length (notice that there are no markers in this solution for prefixes whose length exceeds u 1) and the r m = (r 1)/2] target lengths greater than v define an optimal (r m)targetlength solution for prefixes whose length is between v+1 and j subject to the constraint that j is a target length. Hence, we obtain the following recurrence for Opt(i, j, r). Opt(i, j, r) min {Opt(i, 1, [(r 1)/2]) i+[(r1)/2] v + MC(u,v,j)},3 < r (3.5) Using Equations 3.33.5 to compute Opt(l, 5, 4) for the example of Figure 32, we get Opt(1, 5,4) min {Opt(l, 1, 2)+ Opt(v + 1, 5,1)+ EC(u, v) + MC(u, v, j)} 3 min{Opt(l, 2, 2) + Opt(4, 5, 1) + EC(3, 3) + MC(3, 3, 5), Opt(1, 2, 2) + Opt(5, 5,1) + EC(3, 4) + MC(3, 4, 5), Opt(1, 3,2) + Opt(5,5, 1)+ EC(4, 4) + MC(4, 4, 5)} min{EC(1, 1) + EC(2, 2) + EC(4, 5) + EC(3, 3) + MC(3, 3, 5), EC(1, 1) + EC(2, 2) + EC(5, 5) + EC(3, 4) + MC(3, 4, 5), min{EC(1, 1) + EC(2, 3), EC(1, 2) + EC(3, 3)} +EC(5, 5) + EC(4, 4) + MC(4, 4, 5)} min{+ + 7+ 2 + 1,1+ 1+5+4+ 2, min{1 + 3,3 + 2}+ 5 + 1 + 4} min{12, 13, 14} = 12. From the above computations, we see that the optimal expansion lengths are 1, 2, 3, and 5. Figure 36(a) shows the CHT structure that results when these four target lengths are used. The three markers are shown in boldface, two of these markers are also prefixes (P3 and P4). The storage cost is 12. Array of hash table pointers Array of hash table pointers l(length 1) 2(length 2) 3(length 3) 4(length 5) l(length 1) 2(length 3) 4(length 5) P P201* P6=00001* P6=00001* P7=00010* P7=00010* 000* P8=00110* 0001 P8=00110* P3=001 P9=01000* P3=001* P9=01000* P4=010 P10=01001* P4=010 P10=01001* P5=01010* P2=011 P5=01010* P5=01011* P5=01011* (a) 4 target lengths (b) 3 target lengths Figure 36: Optimalstorage CHTs for Figure 32 Complexity To solve Equations 3.33.5 for Opt(l, Imra, k), we need to compute O(kW2) Opt(i, j, r) values. Each of these values may be computed in O(W2) time from earlier computed Opt values. Hence, exclusive of the time needed to compute EC and MC, the time to compute Opt(l, Imax, k) is O(kW4). Adding in the time to compute EC and MC, we get O(nW3 + kW4) as the overall time needed to solve the ECHT problem. Of course, on practical data sets, the time is O(nW2 + kW4). 3.3 Alternative Formulation In the ECHT(P, k) problem, we are to find exactly k target lengths for P that minimize the number of (expanded) prefixes and markers (i.e., minimize storage cost). Although k is determined by constraints on required lookup performance, the deter mined k is only an upper bound on the number of target lengths because using a smaller number of target lengths improves lookup performance. The ECHT prob lem is formulated with the premise that using a smaller k will lead to increased storage cost, and so in the interest of conserving storage/memory while meeting the lookup performance requirement, we use the maximum permissible number of target lengths. However, this premise is not true. As an example, consider the prefix set P = {P1, P2, P3} = {0*, 00*, 010*}. The solution for ECHT(P, 2) uses the target lengths 2 and 3; P1 expands to 00* and 01* but the 00* expansion is dominated by P2 and is discarded; no markers are stored in either H2 or H3; and the storage cost is 3. The solution for ECHT(P, 3), on the other hand, uses the target lengths 1, 2, and 3; no prefixes are expanded; H2 needs a marker 01* for P3; and the total storage cost is 4! With this in mind, we formulate the ACHT(P, k) problem in which we are to find at most k target lengths for P that minimize the storage cost. In case of a tie, the solution with the smaller number of target lengths is preferred, because this solution has a reduced average lookup for the preceding example, the solution to ECHT(P, 3) is {1, 2, 3}, whereas the solution to ACHT(P, 3) is {2, 3}. For the example of Figure 32, the solution to ECHT(P, 4) is {1, 2, 3, 5} resulting in a storage cost of 12; the solution to ACHT(P, 4) is {1, 3, 5} resulting in a storage cost that is also 12 (see Figure 36(b)). The ACHT problem may be solved in the same asymptotic time as needed for the ECHT problem by first computing Opt(i, j, r), 1 < i < j < I i, 1 < r < k and then finding the r for which Opt(l, Imra, r) is minimum, 1 < r < k. 3.4 ReducedRange Heuristic We first adapt the ECHT heuristic of Srinivasan [80] to the ACHT problem. For this purpose, we define the function C, which is the ACHT analog of T. To get the definition of C, simply replace ECHT(Q, r) by ACHT(Q, r) in the definition of T. Also, we use the same definitions for ExpansionCost (now abbreviated to ECost) and Entries as used in [80] (see Section 3.1). It is easy to see that C(j, r) < C(j, r 1),r > 1. A simple dynamic programming recurrence for C is C(j, r)= Entries(j) + mmin {C(m, r 1) + ECost(m + 1, j)}, > 0, r > 1 mCG{0...j1} (3.6) C(0, r) = 0, C(j, 1) = Entries(j) + ECost(l, j), j > 0 (3.7) To see the correctness of Equations 3.6 and 3.7, note that when j > 0, there must be at least one target length. If r = 1, then there is exactly one target length. This target length is at most j (the target length is j when there is at least one prefix of this length) and so Entries(j) + ECost(1,j) is an upper bound on the storage cost. If r > 1, let m and s, m < s, be the two largest target lengths in the solution to ACHT(P, r). m could be at any of the lengths 0 through j 1; m = 0 would mean that there is only 1 target length. Hence the storage cost is bounded by Entries(j) + C(m, r 1) + ECost(m + 1, j). Since we do not know the value of m, we may minimize over all choices for m. C(0, r) = 0 is a boundary condition. We may obtain an alternative recurrence for C(j, r) in which the range of m on the right side is r 1... j 1 rather than 0... j 1. First, we obtain the following dynamic programming recurrence for C: C(, r) = min{C(j, r 1), T(j, r)}, r > 1 (3.8) C(j, 1) = Entries(j) + ECost(1, j) (3.9) The rationale for Equation 3.8 is that the best CHT that uses at most r target lengths either uses at most r 1 target lengths or uses exactly r target lengths. When at most r 1 target lengths are used, the cost is bounded by C(j, r 1), and when exactly r target lengths are used, the cost is bounded by T(j, r), which is defined by Equation 3.1. Let U(j, r) be as defined in Equation 3.10. U(j, r) = Entries(j) + min {C(m, r 1) + ECost(m + 1,j)}, j > 0, r > 1 (3.10) From Equations 3.1 and 3.8 we obtain C(j, r) = min{C(j, r 1), U(j, r)}, r > 1 (3.11) To see the correctness of Equation 3.11, note that for all j and r such that r < j, T(j, r) > C(j, r). Furthermore, Entries(j) + min {T(m, r 1) + ECost(m + 1,j)} > Entries(j) + min {C(m, r 1)+ ECost(m + 1,j)} = U(j, r) (3.12) Therefore, when C(j, r 1) < U(j, r), Equations 3.8 and 3.11 compute the same value for C(j, r). When C(j, r 1) > U(j, r), it appears from Equation 3.12 that Equation 3.11 may compute a smaller C(j, r) than is computed by Equation 3.8. However, this is impossible, because C(j, r) =Entries(j) + min {C(m, r 1) + ECost(m + 1, j)} m in{0...j 1} < Entries(j) + mmin {C(m, r 1) + ECost(m + 1,j)} mn{rl ...jl1 Therefore, the C(j, r)s computed by Equations 3.8 and 3.11 are equal. In the remainder of this section, we use the reduced ranges r 1... j 1 for C. Heuristically, the range for m (in Equation 3.6) may be restricted to a range that is (often) considerably smaller than r 1... j 1. The narrower range we wish to use is max{M(j 1, r), M(j, r 1), r 1} ... j 1, where M(j, r), r > 1, is the smallest m that minimizes C(m, r 1) + ECost(m + 1, j) in Equation 3.6. Although the use of this narrower range could result in different results from what we get using the range r 1... j 1, on our benchmark prefix sets, this doesn't happen. In the remainder of this section, we derive a condition Z on the 1bit trie that, if satisfied, guarantees that the use of the narrower range yields the same results as when the range r 1... j 1 is used. Let P be the set of prefixes represented by the 1bit trie. Let exp(i, j), i < j, be the set of distinct prefixes obtained by expanding the prefixes of P whose length is between i and j 1 to length j. Note that exp(i, i) = 0 and that Iexp(i, j) = ECost(i, j). We say that exp(i, j) covers a length j prefix p of P iff p E exp(i, j). Let n(i, j) be the number of length j prefixes in P that are not covered by a prefix of exp(i, j). Note that n(i, i) is the number of length i prefixes in P. The condition Z that ensures that the use of the narrower range produces the same C values as when the range r 1... r 1 is used is Z = ECost(a,j) ECost(b,j) > 2(n(b, j) n(a,j)) where, 0 < a < b < j. Lemma 9 For every 1bit trie, (a) ECost(i, j + 1) > 2ECost(i, j), 0 < i < j, and (b) ECost(i, j) > ECost(i + 1, j), 0 < i < j. Proof (a) ECost(i,j + 1) = exp(,j + 1) = 2[exp(i, j) + n(i,j)] = 2ECost(i, j) + 2n(i, j) > 2ECost(i, j) (b) Since, exp(i 1,j) C exp(i,j), ECost(i,j) = exp(i,j)l > exp(i + 1,j) = ECost(i+ 1,j). ) Lemma 10 V(j > 0, i < j)[Entries(j)+ECost(i, j) < Entries(j+1)+ECost(i, j+ 1)]. Proof By definition, Entries(j) = number of prefixes of length j plus the number of nodes at level j of the trie (this latter number equals the number of pointers from level j 1 to level j). Since each length j prefix expands to 2 length j+1 prefixes, the first term in the sum for Entries(j) is at most ECost(i,j + 1)/2. Since the subtree rooted at each level j node contains at least one prefix, the second term in the sum for Entries(j) is at most Entries(j + 1). So, Entries(j) < ECost(i, j + 1)/2 + Entries(j + 1) From Lemma 9(a), ECost(i, j) < ECost(i, j+1)/2. So, Entries(j)+ECost(i, j) < ECost(i, j + 1) + Entries(j + 1). m Lemma 11 V(j > 0, r > 0)[C[j, r] < C[j + 1, r]]. Proof First, consider the case when r = 1. From Equation 3.7, we get C(j, 1) = Entries(j) + ECost(1, j) and C(j + 1, 1)= Entries(j + 1) + ECost(1,j + 1). From Lemma 10, Entries(j) + ECost(1,j) < Entries(j + 1) + ECost(1, + 1). Hence, C(j, 1) < C(j + 1, 1). Next, consider the case r > 1. From the definition of M(j, r), it follows that C(j + 1, r) = Entries(j + 1) + C(b, r 1) + ECost(b + 1, J+ 1), where 0 < b = M(j + 1, r) < j. When b < j, using Equation 3.6 and Lemma 9, we < Entries(j) + C(b, r 1 < Entries(j + 1) + C(b, r ) +ECost(b+ 1, j)  1)+ ECost(b + 1, + 1) C(j + 1, r). When b = j, C(j + 1, r) = Entries(j + 1) + C(j, r 1)+ ECost(j + 1, j + 1) > C(j, r The remaining lemmas of this section assume that Z is true. Lemma 12 ECost(a, j +1) ECost(b, j+1) > ECost(a, j) ECost(b, j)], 0 < a < b Proof From the definition of n(i, j), it follows that ECost(a, j) ECost(b, j) j1 Zn(a, l) 2j l=a b1 Zn(a, 1) 2 Ia j1 Zn(b, 1) 2j l=b j1 [n(b, 1) lb C(j, r) n(a, 1)] 2j 