1 BLOOMIER HASHING: CONFINING MEMO RY BANDWIDTH AND SPACE IN NETWORK ROUTING TABLE LOOKUP AND MEMORY PAGE TABLE ACCESS By DAVID YI LIN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010
2 2010 David Yi Lin
3 To my father and mother
4 ACKNOWLEDGMENTS First, I thank my advisor, Dr Jih-Kwon Peir for his continuous support throughout the whole Ph.D. program. I thank him for his insightful advices, support and mentoring. I also like to thank Dr. Shigang Chen for his advices and help for my research projects. I also extend my appreciation to my other committee members, Dr. Ye Xia, Dr. My T. Thai and Dr. Liuqing Yang for thei r valuable comments and support. I thank my colleagues Zhuo Huang, Gang Li u, Jianming Cheng, Zhen Yang, Feiqi Su, and Xudong Shi for their help and friendships I especially want to thank Zhuo Huang and Gang Liu for helping my research projects and learning the simulation environments. Last, but not least, I thank my parents and my brother for their love, support and advices. None of these woul d be possible without them.
5 TABLE OF CONTENTS page ACKNOWLEDG MENTS..................................................................................................4 LIST OF TABLES............................................................................................................7 LIST OF FIGURE S..........................................................................................................8 LIST OF ABBR EVIATIONS...........................................................................................10 ABSTRACT...................................................................................................................11 CHAPTER 1 INTRODUCTION....................................................................................................13 1.1 Bloomier Hashing and Its Applicatio n on Destination IP Address Lookup........16 1.2 Page Table Lookup Using In cremental Bloom ier Hashing................................21 1.3 Dissertati on Organiza tion..................................................................................24 2 BLOOMIER HASHING AND ITS APPLCA TION ON DESTINATION IP ADDRESS LOOKUP...............................................................................................25 2.1 Introdu ction.......................................................................................................25 2.2 Bloomier Hashing..............................................................................................25 2.3 IP Prefixes Expansion.......................................................................................34 2.4 Architecture and Implem entation of the BH-RT................................................38 2.5 Performance Ev aluation Me thodology..............................................................41 2.6 Performanc e Resu lts........................................................................................42 2.7 Relate d Works..................................................................................................48 3 PAGE TABLE LOOKUP USING INCR EMENTAL BLOOMIE R HASHING..............51 3.1 Introdu ction.......................................................................................................51 3.2 Conventional Page Table Organi zations...........................................................51 3.2.1 Forward Mappi ng Page Table.................................................................52 3.2.2 Inverted Page T able................................................................................54 3.2.3 Hashed Page T able.................................................................................56 3.3 Using Collision Free Bloomier Filter for Inve rted Page Table...........................58 3.4 Using Incremental Bloomier Hashing for Has hed Page T able..........................64 3.4.1 Coalesced Ha shing A pproach.................................................................70 3.4.2 Separate C haining A pproach..................................................................74 3.5 Performance Eval uation Met hodology:.............................................................78 3.6 Performanc e Resu lts........................................................................................79 3.7 Relate d work.....................................................................................................88
6 4 USING BLOOMIER HASHING TECH NIQUES FOR OTHER POSSIBLE APPLICAT IONS......................................................................................................91 5 DISSERTATION SUMMARY..................................................................................95 LIST OF RE FERENCES...............................................................................................97 BIOGRAPHICAL SKETCH..........................................................................................102
7 LIST OF TABLES Table page 2.1 Percentage of prefixes in different length groups...............................................36 3.1 Space overhead and average linked list length comparisons between Bloomier Filter and hash anc hored invert ed page t able.....................................63 3.2 Footprints for eight SPEC 2000/2006 workloads................................................78
8 LIST OF FIGURES Figure page 1.1 Longest prefix matching in a routi ng table..........................................................17 2.1 Psuedocode for Bloomie r index tabl e setup.......................................................27 2.2 Psuedocode fo r FindG roup................................................................................28 2.3 Psuedocode for ProgramT able...........................................................................29 2.4 Bloomier Index Table enc oding and set up example...........................................31 2.5 Distrbutions of prefixes into bu ckets with single hashing, two hashing and Bloomier hashing................................................................................................33 2.6 Distributions of prefixes based on pref ix length from fi ve routing tables.............35 2.7 The number of prefixes with ex pansions to va rious lengths...............................38 2.8 The basic architecture fo r BH routing table l ookup.............................................39 2.9 Bandwidth and memory space requirem ent for the five hash-based schemes...44 2.10 Sensitivity on the prefix ex pansion for differ ent hashi ng bits..............................47 2.11 Sensitivity on the BI T size..................................................................................48 3.1 Forward-m apped page t able...............................................................................53 3.2 Inverted page table wit h hash anchor table........................................................55 3.3 Hashed pa ge table.............................................................................................57 3.4 Inverted page table with Bloom ier filter i ndex t able............................................58 3.5 Bloomier filter setup failure percentage for differ ent parameters........................61 3.6 Psuedocode for page table access.....................................................................64 3.7 Psuedocode for check page table......................................................................65 3.8 Psuedocode for inse rt page tabl e entry..............................................................66 3.9 Average linked list length comparisons between normal hashing and Incremental Bloomier hashing with randomly gener ated numbers.....................68 3.10 Hashed page table with increment al Bloomier hashi ng index table....................69
9 3.11 Hashed page table with increment al Bloomier hashing index table (Coale sced)........................................................................................................70 3.12 Example of shared linked list in coale sced hash ing............................................73 3.13 Hashed page table with incremental Bloomier hashing inde x table (Separate Chaini ng)............................................................................................................74 3.14 Hashed page table with incremental Bloomier hashing index table (Sets associ ated).........................................................................................................76 3.15 Average page table hit search time comparison by using separate chaining hashed page table..............................................................................................80 3.16 Average page table hit search time comparison by using coalesced hashed page tabl e...........................................................................................................81 3.17 Worst case search time compar ison for coalesced hashed page table..............82 3.18 TLB misses per kilo instruct ions for differ ent work loads.....................................83 3.19 Average number of cycles spent per k ilo instructions for different schemes......84 3.20 Sensitivity on incr emental Bloomier hashi ng index t able si ze.............................86 3.21 Sensitivity on page table entries acce ss latency.................................................87
10 LIST OF ABBREVIATIONS HAT Hash anchor table IPT Inverted page table LPM Longest prefix match TLB Translation lookaside buffer XOR Exclusive or
11 Abstract of Dissertation Pr esented to the Graduate School of the University of Florida in Partial Fulf illment of the Requirements for t he Degree of Doctor of Philosophy BLOOMIER HASHING: CONFINING MEMO RY BANDWIDTH AND SPACE IN NETWORK ROUTING TABLE LOOKUP AND MEMORY PAGE TABLE ACCESS By David Yi Lin December 2010 Chair: Jih-Kwon Peir Cochair: Shigang Chen Major: Computer Engineering In todays information age, efficient sear ching and retrieving of needed data are very important for many digi tal applications. Hashing is a common and fast method to store and lookup data. However, common ha shing methods, such as random hashing, create collisions in hash buckets and can le ad to long and unpredictable lookup latency. For this dissertation, we developed a new hashing method called Bloomier hashing. We present our hashing solutions through t he applications of network routing table lookup and page table access for memory address translations. Continuous advancement in network speed and internet traffic demands faster routing table lookup. It is difficult to scale the curr ent TCAM-based or Trie-based solution for future-generation routing t ables due to increasing demands on higher throughput, larger table size, and longer pref ix length. We present a comprehensive solution for future routing table designs with three key contributions. First, we present a partitioned longest prefix matchi ng (LPM) scheme by grouping prefixes according to their length distribution. Either TCAM or has h-based SRAM is selected in each prefix group to balance the space and bandwidth requirement for accomplishing the LPM
12 function. Second, we use the Bloomier has hing method to alleviate hashing collisions and to balance prefixes among the hashed bucke ts. Third, a constant lookup rate can be achieved by organizing the routing table as set-associative SRAM arrays along with proper prefix expansions to allow one memo ry access per table lookup. Performance evaluations demonstrate that the proposed routing table lookup scheme can maintain a constant lookup rate of 200 millions per second with less bandwidth and space requirement in comparisons with ot her existing hash-based approaches. Computer architecture is moving into 64-bits virtual and physical addresses. Large address space causes problems for the tr aditional page table organizations. We introduce Incremental Bloomier hashing, a m odified version of Bloomier hashing, to create a fast and space-efficient page table architecture. Simulation results show that our Incremental Bloomier hashing can decr ease the average lookup time in the page table and therefore decrease the overall number of cy cles spent on page table operations.
13 CHAPTER 1 INTRODUCTION In the information age t oday, tremendous amounts of info rmation are collected and processed to fulfill a wide array of needed daily functions. In this age, it has been characterized as the ability of individuals to transfer information freely, and to have instant access to knowledge that would have been difficult or impossible to find previously. Given the huge amount of data th at must be stored and processed, it becomes increasingly difficult to efficient searching and retrieving the needed data. Therefore, it is essential to organize the huge data storage in a proper way so that information retrieval can be done quickly. Let the collection of data elements be called a table, and each of the data elements called a record. Each record cont ains an identity called key, which can be used to differentiate between records. The simplest way to perform a search is by examining each of the records key to the search key. The first records key is examined and goes on until a records key ma tch the search key. If none of the records key match the search key, then the sear ch fails. This form of search is called a sequential search. The sequential search is time consuming, because many records must be examined sequentially. The worst case search time is the time that exams all records in the table. However, it is possi ble to organize the records in a way so that sequential search can be avoided. For exam ple, records can be ordered by the values of their keys in the table. Searching is then can be done by comparing the value of the lookup key to a records key, and decide how to continue the search based on the result of the comparison. A binary search is an ex ample of this strategy and is very fast compare to sequential search. However, the average lookup time of this strategy still
14 depend on the total number of the records in the table. Hashi ng is another search strategy that has a fast average lookup time that is independent of the number of records in a table. Hashing uses a data structure called the hash table. A hash table can be used to greatly reduce the number of records that need to be examined during a search. Hash table uses a hash function to transform each key into an index of a hashed bucket in a table. The record with the key is then stor e into the table according to the index. Searching in a hash table is done in two steps First, the index of the search key is computed based on a hash function. Next, usi ng the index, the records in the hashed buckets are examined to see if any records key matches the search key. Searching can thus be reduced to only the records in the hashed bucket. Moreover, the average cost of searching is depending on the number of keys in a bucket, instead of the total number of keys in the table. Ideally, the hash function should hash each key into a unique index, referred as a collision free hashing. Hash collision occurs when one or more items are mapped into the same bucket. However, it is difficult and expensive to obtain a collision-free random hash functions. Moreover, when the number s of hash buckets is smaller than the number of records, hash collisions always en counter. To hold multiple records in a bucket, chaining can usually be applied. Inside each hash bucket, a pointer is associated with each record. The pointer points to the next records in the hashed bucket to form a link list that contains all the records that hashed to the bucket. Each entry in the link contains the key value. During lookup, search may need to search all items in the link list. For efficient search ing, a good hashing function can balance the
15 length of all the hashed link lists. Howeve r, good hashing functions such as perfect hash function [8, 50] which takes long time to computer the hash va lue for a key. It usually applies to a set of known keys. T herefore, a good hashing function can lengthen the lookup process . One well-known approach of reducing the hashing collision is to use multiple hashing functions. Instead of a single bucket, multiple hashing functions hash a key to multiple buckets for placing a record. By placing each record into the least-loaded bucket among multiple hashed buckets, the length of the hashed buckets can be more balanced [10, 18]. Studies show that with the flexibi lity of placing a key into the smaller bucket using two hashing functions, balance of keys among buckets is indeed improved significantly [10, 18]. However, the downsid e of using multiple hashing is that the search must go through all the hashed buckets which increases the search complexity and lengthens the search time. Furthermore, even with multiple ha shing functions, unbalance among hashed buckets due to hash collisions is still unavoidable. Bloomier filter was introduced in  to completely eliminate conflicts and provide collision free hashing. Bloomier filter is an extension of the original bloom filter. Unlike bloom filter which only supports membership queries, the Bloomie r filter can store arbitrary per-key function values for all the keys in the key set to achieve a collision-free hashing. Bloomier filter can be established as an intermediate index table. Mu ltiple hashing functions are first applied to the intermediate index tabl e. The encoded values stored in the hashed entries are fetched out and applied to certain mathematic al function to produce a per-key index value. Such an index is then used to deter mine the hashed bucket in the hash table.
16 The unique advantage of having an intermedi ate index table is that the values stored in the index table can be set up in su ch a way to balance the record distribution among buckets. The original intent was to pr ovide a collision-free per-key function. Such a collision-free mapping requires a hash table several times bigger than the number of records. However, instead of a collisionfree hashing, we generalize and extend the Bloomier Filter approach by allowing collisions in the index table. We use the Bloomier index table for balancing the record s distribution among the buckets. There are many examples of informati on retrieval. Hashing techniques are commonly used among them. In the following subsections, we describe a few examples of information retrieval and our proposed solutions to the problems. 1.1 Bloomier Hashing and Its Applicat ion on Destination IP Address Lookup The first example of information retrieval is in the internet domain. The internet is formed by a huge collection of the computer networks that are connected to each others and use a standard network protocol. Computer s in the internet communicate to each others by sending packets. Packets are like bi nary mails. An internet packet contains two sections: header and data. The header of packet contains information for the network to deliver the data. The informati on includes source and destination addresses, error detection information and sequencing information. Packets go though the computer networks at the speed of light and t here are many different paths for a packet to reach its destination. So, packets in the in ternet are like traffics in the real world and packet traffic controls is needed just as we ll. Routers are used for this purpose. The router is an important network device that is responsib le for routing and forwarding the internet packets.
17 One of the main functions for a network router is to route packets it has received. A packet contains both destination and source IP address in the network data. Once the packet is received by the router, the router needs to determine the next hop or station to route the packet to base on the destination address. A lookup table called routing table is used to record the next hop. There are many entries in a routing table. Each entry in a routing table contains in formation including a destination IP address and next hop. The packets next hop is determined by matching its destination IP address to an existing entry in the routing table. To sa ve the routing table space, the destination addresses in a routing table are stored as pref ixes. A prefix is a binary string which contains some wildcard bits at the end. Wildca rd bits are basically dont care bits that can assume any values. An incoming pa ckets destination address may match more than one prefix in the routi ng table yet only one next hop can be selected. So, the longest-prefix match (LPM) is used in order to obtain the next hop. To perform the longest-prefix match, the router needs to ma tch the incoming string to the longest prefix among all the prefixes in the routing table. Figure 1.1 shows an example of LPM in a routing table: Figure 1.1. Longest prefix ma tching in a routing table Prefix 1: 1011* Prefix 2: 10111* Prefix 3: 101110 Destination address: 101111 Routin g table: Results: LPM p refix 2 next ho p 2 Next ho p : 1 Next ho p : 2 Next ho p : 3
18 As seen in Figure 1.1, there ar e three prefixes in a routing table. For simplicity, let the length of an IP address be six bits, the router received a packet with a destination address 101111. The router performs LPM and decides the next hop. For IP address length of 6, prefix 1 has two wild card bits and prefix 2 has 1 wildcard bit at the end. Prefix 3 has no wild card bits. Wildcard bi ts can assume any values. For example, prefix 2 has the last bit as a wildcard bit and can match address 101111 and address 101110. Therefore, for the destination address 101111, the address matches both prefix 1 and prefix 2. However, since prefix 2 is longer than prefix 1 without the wildcard bits, the destination address match prefix 2, t he longest prefix matched, and the next hop 2 is chosen. In a backbone router, the number of prefix in the routing table can be up to 200300 thousands . LPM is a classic bottlene ck in the backbone inter net routers. With recent advances in optical network technol ogy along with the new IPv6 routing table and a growing number of prefixes in internet backbone routers, it becomes increasingly difficult to provide an IP lookup scheme fast enough to handle the ever-increasing internet traffic. If the router is not up to speed, packet drops may occur. There are three categories of LPM mechanisms including Ternary content addressable memory (TCAM) based schemes, tries based schemes and hash based schemes. TCAM were the choice for lookup in sma ll routing tables. However, TCAM is inefficient in silicon space with high power dissipation. TCAM compares the incoming address against all prefixes stored in a routing table. Theref ore, TCAM is well suited for matching wild cards in a prefix. However, as the number of prefixes in the large routing
19 table grows continuously, it becomes prohibi tively expensive and power hungry to use the TCAM solution. Caching a part of the r outing table permits fast lookups for recent routed prefixes . However, besides la ck of locality, caching cannot provide a constant lookup rate and may cause package drops upon consecutive cache misses. Tries-based search schemes are another approach for LPM. Tr ies are binary tree like data structures. To efficiently perform LPM, tries-based search algorithms match one or a few bits at a time wit h the prefixes in the routing table [2, 31, 35]. However, due to the multiple levels in trie s, the lookup latency is direct ly proportional to the length of the prefix. For a long prefix, search need to go very deep in the tree, led to long search latency. The worst case lookup latency c an directly affect the performance of the router. Moreover, each dat a node in the tries needs extra space to hold information such as child and parents nodes pointer. This led to large memory usage and forces the tries to be storage off-chip. Many recent works consider building t he routing table with conventional SRAM technology and using a hash-based approach to l ookup the routing tabl e [18, 65]. In this approach, a few fundamental issues need to be addressed. To accomplish the LPM, the router must match all possible lengt hs of the prefixes in the routing table. It incurs a signifi cant bandwidth requirement and delays, especially for IPv6 where the prefix lengt hs can vary from 16 to 64 bits. This problem is exacerbated when the growing r outing table can no longer fit into onchip SRAM and must be fetched from off-chip. As the internet wire speed will soon exceed 100 Gbps, a r outer with a single network connection needs to forward mo re than 150 million packets per second . The router must sustain the high cons tant lookup rate in order to avoid any dropping of incoming packages. One inherent difficulty in the hash-based approach stems from its collision in placing prefixes into the hashed buckets. Wh en multiple prefixes are hashed to the same bucket, the search must cover the entire bucket. This problem is
20 exacerbated when hashing prefixes to t he buckets are unbalan ced so that the search delay must accommodate the longest bucket. To reduce multiple searches for variabl e prefix lengths, the Control Prefix Expansion (CPE)  can be us ed. For a single prefix X prefix length L and wildcard length of W, control Prefix expansion replace the prefix X with a number of new prefix length of L+W, based on Xs 2 to the power L values. CPE reduces the number of different prefix lengths to a small number with t he cost of expanding the routing table. Meanwhile, Bloom Filter  has been considered to filter unnecessary accesses to the routing table for all possible prefix lengths. Although Bloom f ilter provides an efficient way to filter routing table lookups, it suffers a small percentage of false positive conditions. When this condition occurs, mu ltiple routing table accesses can cause unpredictable delays in forwarding the incoming package. Even with these issues, hash based approach still have some major advantages. During lookup, hashing method does not compare all prefixes in the table at the same time, so it used much less power than the br ute force approach used by TCAM. Unlike tries, hash method lookup is O(1), regardle ss of the prefix length. Due to these advantages, hash-based approach seems to be t he choice for LPM in future when IPV6 is commonly used. We propose an IP lookup architecture t hat using a new hashing scheme called Bloomier hashing to greatly reduce the collisions in hash buckets. We also introduce partial prefix expansion. Part ial prefix expansion allow s earch only on few different length while maintain a reasonable space overhead. By using these techniques, our destination IP lookup architecture can ac hieve much higher bandwidth than the other existing hashing based architecture.
21 1.2 Page Table Lookup Using Incremental Bloomier Hashing Second example of information retrieval is in the computer virtual memory domain. Computer data used in com puter programs are binary dat a that store in physical memory devices. Computer programmer like s to have a large and fast memory for their program. However, a memory device that has bigger capacity also has more latency. On the other hand, a fast memory device is more expansive and smaller. For example, a disk is a cheap memory device that has the most space but slowes t access time. On the other hand, computer ma in memory is typical RAM (random access memory) device that is more expansive than the disk but has much faster access time. Since RAM is more expansive, the size of co mputer main memory is only a fraction of the disk space. Ideally, program data should be in the main memory all of time. However, due to the fact that some programs hav e large sizes, some progr am data need to store on disk when it is not in use. Keeping track of and managing the program data in the main memory and on disk at the same time are ti resome tasks. Virtual memory is used in todays operating systems to solve the pr oblem. Virtual memory management unit combines both the disk and main physical memo ry into a single abstract memory unit. An application that uses virtual memory thin ks that there is a large continuous memory to use. But in reality, some parts of the memory space may reside on the disk and some parts of the memory may be fragmented. Using vi rtual memory can make a programmers job a lot easier when programming big applications. When using virtual memory, applications request memory accesses by using virtual addresses. The memory allocations are done in a unit called a page. A page is a block of data resides in the data storage. Virtual address is used to address a page in the virtual memory. However, using virtual memory, some of the requested pages may
22 not be in the main physical memory. T herefore, the hardwar e/software needs to determine where the requested memory page is located. To do so, the hardware/software tries to translate the vi rtual address into physical address where the page is located in the physical main memory Fail translation implies that the page is located on the disk. This translation proc ess is called virtual to physical memory address translation. When the requested page is not located in the physical memory, a page fault occurs. The operating system must take over and move the page from disk into the physical memory. To do a virtual memory address translation, a lookup table called page table is used. A page tables stores all the m apping between virtual addresses and physical addresses. Each page table entry contains the virtual address, the physical address, protection bits and status info rmation . To do virtual memory address translation, the hardware/software simply uses the virt ual address to find a matching entry in the page table and return the content of the entry if the virtual memo ry address has valid translation. The address translation process is in t he critical path of the computation and need to complete quickly. Since the page table can be fairly large, caching part of the page table can be useful. The translation look aside buffer (TLB) is a very small buffer that caches some page table entries. Due to it s size, TLB is very fast and located close to the processor. Many different TL B organizations [17, 38] was proposed and evaluated. Whenever a translation is needed, the TLB is first checked to see the translation exist on the TLB. However, due to t he small size of the TLB, not all the valid
23 virtual address can be translated by the TLB. The pages table handles the translation when the TLB can not find a valid translation. The straight forward way to implement the page table is to create a page table entry for each virtual address and organize thes e entries into a linear array. During a lookup, the virtual address is used to directly index the array and return the content of the entry. Obviously, such approach has huge space requirement. For example, if the virtual address is 64 bits, then the total number of entries in the page table is 264 divided by the page size. If the page size is 8k, then the total number of entries in the page table is 251. Moreover, many of the entries do not contain physical address translation since the physical memory is much smaller than the virtual memory Due to this, the table is quiet sparse and a lot of space are wasted. Moreover, the linear array is implemented in hardware using trees like data structures which has multiple levels . For large virtual address, lookup needs to tr averse many level deep down to tree with multiple memory references Using reverse mapping can solve the problems of array approach. Instead of creating a page table entry for each vi rtual address, a page table entry is created for each physical page. The total number of entries in the page table is then equal to the total number of physical pages. Since each physical page is mapped to a virtual address, no entire in the page t able is empty. However, a lookup can be difficult in the reverse mapping approach using virtual address. Typical hashing scheme can help with the look up. One common approach is to use hashing by adding another intermediate table called hash anchor t able. The virtual address is first hashed to the anchor table to retrieve the inde x to the inverted page table. Due to hash collisions, multiple virtual addresses can be hashed to the same entry in the anchor
24 table. Therefore a linked list must be built to link all the collided virtual addresses in the inverted page table. Consequently, the lookup time can still be unpredictable and long in the worst case. The average case lookup ti me in a page table can affect the overall system performance. We pr opose using Bloomier filter and our Bloomier hashing concept to the page table. By using our methods, we can reduce the search time and the additional memory accesses when accessi ng the page table to improve the overall system performance. 1.3 Dissertation Organization The dissertation structure is as follow. In Chapter 2, we present our Bloomier hashing algorithm and apply this hashing method to create a high throughput LPM architecture. In Chapter 3, we first introduce the common page table organizations. Then, we present our Incremental Bloomie r hashing method and apply this method to decrease the average lookup time in the has hed page table. Chapter 4 introduces other possible applications by using Bloomier hashing. Chapter 5 is a summary of the dissertation.
25 CHAPTER 2 BLOOMIER HASHING AND ITS APPLCATI ON ON DESTINATION IP ADDRESS LOOKUP 2.1 Introduction Destination IP address lookup is an impor tant function of the router. When an incoming packet arrives, the router need to find the correct output port for the packet. To do so, the router needs to perform a l ookup on the routing t able. Given an IP address, the routing table returns the nex t hop or output port that the packet can forward to. Because the address space for IP addresses is too big, routing tables do not store next hop information for every IP addre ss. To save space, routing table stores next hop information for IP prefix es. An IP prefix summarize s a group of IP addresses. To perform a lookup on the routing table, a longest prefix match (LPM) algorithm is used. Longest prefix match (LPM) algorithms match the longest prefix in the routing table to the incoming IP address and return the next hop. There are three major approaches for the longest prefix match problem in destination IP address lookup. The three approaches are ternary content-addr essable approaches, tries based approaches and hash based approaches. In Chapter 2, we present a new hash based approach for longest prefix matching problem by usi ng a new hashing scheme called Bloomier hashing. Bloomier hashing can greatly impr ove the hash table performance by using a small intermediate table. We show the general architecture by using Bloomier hashing and present the results afterward. 2.2 Bloomier Hashing The Bloomier Hashing is a generalized and extended solution from the original Bloomier Filter . Bloomier filter is built based on the concept of multiple hashing. While Bloomier filters suppor t retrieval of arbitrary per-key information and guarantee
26 collision-free hashing, the Bloomier hashi ng relaxes the constraint and balances hashing collisions for distributing prefixes among buckets. Like Bloomier filter, our Bloomier hashing requires an intermediate table. However, due to relaxing the collisionfree constraint, the intermediate table in Bloomier hashing can be much smaller while achieving very good hashing performance. Let s first introduce some terminology for the Bloomier hashing. The collection of all prefixes is the prefix set. The prefix set is hashed into a set of hashed buckets. An intermediate hashing table called the Bloomier Index Table (BIT) is introduced where all prefix es in the prefix set have first been hashed Proper values are stored in the BIT for determining the final hashed buckets for all prefixes. There are k hashing functions that are used to hash each prefix to k locations in the BIT where these k locations for each prefix are the prefixs hash neighborhood. If a prefix is hashed into a loca tion that is not in any other prefixs hash neighborhood, the location is called a singleton. The colle ction of hashed prefixes in each location in the BIT is the prefix group of the respective location. The process of hashing and recording the prefix set to the BIT is referred to as the index table encoding. The setup of values in all entries of the BIT for plac ing and searching the prefixes based on the encoded index table is referred to as the index table setup. We describe the complete encoding and setup algorithms of Bloomier hashing below then follow by a detail example: Encoding of Bloomier Index Table: Index table encoding is st raightforward. Based on k randomized functions, all prefixes in the prefix set are hashed into the entries in the BIT. Due to collisions, each entry in BIT may have multiple prefixes hash ed to it. Therefore, each prefix group has
27 the number of prefixes ranging fr om 0 up to the total number of prefixes in the prefix set. A counter is maintained for each prefix gr oup for selecting the prefix group during the BIT setup. The Encoding algorithm returns back the index table. Setup of Bloomier Index Table: The setup method takes two parameters S and T S is the sets of prefixes and T is the table return by the Encoding process described a bove. The setup algorithm involves two major functions as shown in Figure 2.1. The fi rst function is called FindGroup( S,T ) and the second function is called ProgramTable( T ). We describe each of the functions below. Setup ( S T ) =FindGroup( S, T ) ProgramTable( ,T ) Return T Figure 2.1. Psuedocode for Bloomier index table setup FindGroup( S,T ) takes two parameter S and T. The function returns a stack which contains all the prefixes. The algorithm is shown in Figure 2.2. The algorithm is similar to the original Bloomie r filter setup method but with some fundamenta l different. In the function, we firs t create the empty stack Next, while not all pr efixes are on the stack, we try to find any pref ixes that have singletons. If any prefixes with singletons exist, we push these prefixes an d their singletons onto the stack We also erase these prefixes locations in the table T. If none of the prefixes with exists we do not redo the setup with different hashing functions like in the original Bloomier filter approach. Instead, we find the smallest pref ix group, the prefix group with smallest counter values. After the group is found, we push all the prefixes which are in the group
28 and the group location onto the stack We also erase all these prefixes locations in the table. The whole proc ess is repeated until all pr efixes are on the stack. FindGroup ( S,T ) Create an empty stack While Not all prefixes in S are on the stack Find all prefixes with singletons in table T If (prefix(s) with si ngleton(s) exist) Push the prefixe(s) and thei r singleton(s) to the stack Erase all these prefixes K locations in table T Else /*no key with singleton exist*/ Find the smallest group /*prefix groups with smallest counter*/ Push all prefixes in the group and the group location to stack Erase all these prefixes K locations in table T Return stack Figure 2.2. Psuedocode for FindGroup To encode the proper values in the inde x table, the func tion ProgramTable ( T ) is called. The algorithm is shown in Figur e 2.3. The function takes two parameters and T. is the stack returned by FindGroup ( S,T ). The stack contains all the prefixes in S. T is the index table. The function does the following. While the stack is not empty, a prefix is popped from the stack. If the prefix is a prefix with a singleton, then a shortest bucket is chosen among all the hash buckets. The prefix is then placed into the bucket. The singleton location in the table is encoded with proper value with the function Encode( p, b, s, T ). The function used the same Xor idea as the original Bloomier filter . In the index table T the function Encode ( p, b, s, T ) encodes the value b into the location s for prefix p. On the other hand, if the prefix that was popped form the stack is belong to a pref ix group, then the rest of group P are popped from the stack as well. We then find D smallest buckets among all the hash buckets We then
29 place one of the prefix in P into one of the D buckets by using the encode function We try all combination by using each member in P and D Using the function TotalSum ( P ) we then chose the assignment that gives the shortest total length of the affected buckets among all the hash buckets after enc oding. The whole process repeats until all prefixes are popped from the stack. ProgramTable ( T ) While the Stack is not empty Pop a prefix p off the top the Stack If the prefix p is a prefix with Singleton s Chose smallest bucket b in the hash table Encode ( p, b, s, T ) Else /*the prefix p is be long to a prefix group */ Pop the rest of prefix group P and the group location l off the stack Pick D smallest buckets in the hash table For each prefix ti in the prefix group P For each bucket bj in D Encode ( ti, bj, l, T) Sumi,j = TotalSum ( P ) /*Let ts and bs be the prefix and bucket t hat have the smallest sum value*/ Encode ( ts,bs, l,T ) Figure 2.3. Psuedocode for ProgramTable Search and Update Prefixes: After index table setup, sear ching prefixes are straightfo rward. The prefix is first hashed to multiple BIT locations and the c ontents of these locations are fetched out. The bucket ID is determined by exclusive-or ing these values. Although an intermediate BIT access is necessary, the size of t he BIT can be small in comparison with the
30 hashed buckets, and hence can be fitted into fast on-chip SRAM. We evaluate the performance impact of different BIT sizes later. Example of Bloomier Hash ing Encoding and Setup: We now present a detail example of our Bloomier hashing algorithm using Figure 2.4. Index table encoding is straightforward as ill ustrated. Each pref ixes are hash into the table and a counter (not shown in Figur e2.4) is used to keep track of how many prefixes are in one location. For example, location 1 in the Bloomier Index Table (BIT) has prefixes K0, K2, K3, and K4 hashed to it wit h the counter value of 4. Next all the prefixes groups are push onto the stack start with the smallest group. In the example, K0 and K1 in location 2 are selected first to be pushed onto the stack. All K0 and K1 are removed from BIT afterwards. Next, the remaining K4 and K5 in location 4 are selected and removed. Finally, K2 and K3 in location 1 are removed as marked by the circled sequence number. All entries in the BIT are initialized wit h a randomized value rangi ng from 0 to the total number of buckets before the second step begins. In this example, K2 and K3 are the first group to be popped from the stack. In placing thes e prefixes to buckets, we consider a heuristic al gorithm by placing one of the prefixes in the prefix group into one of the shortest bucket. In this example, we assume there are 4 buckets with the lengths of 4, 2, 1, and 2 when the simulation begins. H ence, bucket 2 with a single prefix is the target for placing either K2 or K3. For determining the hashed bucket from the intermediate BIT, we use the function encodi ng idea . The ID of the bucket can be
31 calculated by an Exclusive OR function of the values fetched from the hashed entries in the BIT. Since both K2 and K3 are hashed to the same two BIT entries 1 and 5, they Figure 2.4. Bloomier Index Table encoding and setup example K0 K3 K2 K1 K0, K2, K3, K4 K2, K3 K1, K4, K5 Bloomier Filter Table Step1: Push Prefix to Stack, start from small prefix group K4 K0, K1 Stack K4, K5 K2, K31 4 2 Step2: Assign Value from top of Stack; evaluation function for group with multiple prefixs K0, K1 3 1 0 2 4 5 6 7 Index Table Encoding Index Table Setup 3 1 0 2 4 5 i 2 3 1 V2 V1=2 Vi=1 1 2 3 V4=3 K5 K0, K2, K3, K4 K2, K3 K1, K4, K5 K0, K1 3 1 0 2 4 5 6 7 i V5=0 4 1 2 2 1 0 2 3 length 1Assign K2to B(2): B(K2)=B(K3)= V1=2 + V1 V5 = (2) 3 4 2 1 0 2 3 length 2 Assign K4 to B(1): 3 4 2 2 1 0 2 3 length 2 Assign K4to B(3): 3 3 2 B(K4)= V4=3 + V4 V1 = (1) B(K4)=1; B(K5)=2 B(K4)= V4=1 + V4 V1= (3) B(K4)=3; B(K5)=0 3 4 5
32 must be placed to the same hashed bucket. The ID of the shortest bucket can be calculated as:5 1 2 ) 3 ( ) 2 ( V V K B K B where B(Ki) represents the bucket ID of the prefix Ki and Vi stands for the value stores in the ith location of the BIT. In this example, we assume location 5 in the BIT ( V5 ) contains an initial value of 0. The value in location 1 ( V1 ) can be determined:2 2 5 1 V V. The bucket lengths become 4, 2, 3, and 2 after the first placement. Next, K4 and K5 are popped from the stack and placed into the buckets. Since buckets 1 and 3 are now the shortest, either K4 or K5 will be placed into them. In determining the value for V4 in the BIT, we apply a simple heuristic function to see if either placing K4 or K5 in bucket 1 or bucket 3 can result in a shorter total length of the affected buckets. Note that the length of a bucket with n prefixes is calculated as the sum of the traversal length to each prefix in the bucket ( 1+2++n ). The total length of the affected buckets can be determined from t he sum of the individual buckets. In this example, K4 is first placed in bucket 1. Following the same procedure in determining the value of V1 V4 is equal to 3. Hence, B(K4)=1; and B(K5)=2. The bucket lengths become 4, 3, 4, and 2 a fter the placement. Similarly, K4 ca n be placed in bucket 3. With the same calculation, V4 =1. Given the initial value of Vi =1, B(K4)=3; and B(K5)=0. The bucket lengths become 5, 2, 3, and 3. Appl ying the simple heuristic function, the total length of the affected buckets 1 and 2 for the first placement is equal to (1+2+3)+(1+2+3+4) = 16; and t he total length for the second placement is equal to (1+2+3+4+5)+(1+2+3) = 21. Therefore, the fi rst placement is the choice. Next, K5 is placed into either bucket. Due to the associat ive property of the ex clusive-or function, K5 placement shows identical result s as the K4 placement. Therefore, V4 can be set to
33 value 3. This process continues until all the prefix groups are popped from the stack and placed into the buckets. Prefixes = 128K, Buckets = 1K0 200 400 600 800 100089 99 105 110 115 120 125 130 135 140 145 150 155 1 6 0Number of Prefixes/BucketHistogram one hash two hash Bloomier Hashing Prefixes = 128K, Buckets = 16K0 2000 4000 6000 8000 10000 12000 140000 2 4 6 8 1 0 1 2 1 4 16 1 8 2 0 2 2Numbers of Prefixes/BucketHistogram one hash two hash Bloomier Hashing Figure 2.5. Distrbutions of prefixes into buckets with single hashing, two hashing and Bloomier hashing In order to show the effectiveness of Bloomier hashing, we randomly generate 128K prefixes and hash these prefixes into 1K and 16K buckets using a single hashing function, two hashing functions and our Bloo mier hashing method. With two hashing functions, each prefix is placed into the bu cket with smaller number of prefixes. The results shown in Figure 2.5 clearly demonstr ate the uneven distribution of prefixes to
34 buckets with a single randomized hashing function. With the flexibility of placing a prefix into the smaller bucket using two hashing f unctions, balance of prefixes among buckets is improved significantly. However, to perform a lookup with two hashing function, two buckets are searched. Using our Bloomier me thod with a small index table size of 16K entries, balance of prefixes among buckets is improved even further. Our Bloomier hashing method even outperform the tw o hashing method by a large margin. 2.3 IP Prefixes Expansion Given the need to match the longest prefix, each routing table lookup must search for all possible prefix lengths. It incurs long delays for sequential searches or a huge bandwidth requirement for parallel searches. We study the IP prefix distributions and use controlled prefix expansion to decide the best compromise. Studies on several routing tables in core internet routers have shown that the distribution of the prefixes acco rding to their lengths is stabl e, but very uneven. In Figure 2.6, we plot the length distributions of five routing tables, as286, as1103, as4608, as4777 and as6447 [6, 54]. Similar to the report in , a ma jority of prefixes (>98%) have lengths between 16 and 24 bits with le ngth 24 dominating a bout 54%. Prefixes longer than 24 bits are very few (<2%) and t here is no prefix shorter than 8 bits. This uneven distribution provides an opportunity fo r partitioning the prefixes into groups and applying different LPM mechanisms in different groups. On-chip TCAM serves well when the number of prefixes is small with a wide range of lengths. On the other hand, a hash-based SRAM using proper prefix expa nsion can confine the space and bandwidth requirement when all prefix lengths are long, consecutive, and have small variations. Based on the length distributions in Fi gure 2.6, the routing tables can be partitioned into three groups, length 8-18, length 19-24, and length 25-32 for the
35 following reasons. Length 25-32 represents less than 2% of the pref ixes. Instead of including these eight different lengths in a hash-based table, it is inexpensive to use small on-chip TCAM to perform the LPM. Similarly, there are less than 1% of the prefixes with the lengths of 8-15, and hence they are also a good target for TCAM. The remaining 98% of the prefixes have lengths from 16 to 24. To avoid using expensive TCAM, these prefixes can be off-loaded to a hash-based SRAM array. However, there are still nine different prefix lengths and all need to participate in the longest match of this group. Figure 2.6. Distributions of prefixes based on prefix length from five routing tables The amount of prefixes from length 16 to 24 is roughly in an increasing order. To further reduce the number of di fferent prefix lengths in t he SRAM table, hence reducing the LPM searches, more prefixes may be moved into the TCAM starting from the shortest length. The ratios of prefixes of varying the number of lengths are given in
36 Table 2.1. With six lengths from 19 to 24, t he total amount of prefixes is over 90%. In other words, the remaining pr efixes with 19 different lengths as a target for the TCAM solution are less than 10%. Further reducing t he number of lengths in this group helps LPM searching, but it also boosts the need fo r larger TCAM. The final LPM is decided from the LPMs of the three groups. Table 2.1. Percentage of prefix es in different length groups Length 16-24 Length 17-24 Length 18-24 Length 19-24 Length 20-24 AS286 99.2% 95.8% 94.3% 91.6% 85.9% AS1103 99.0% 95.5% 94.0% 91.4% 85.8% AS4608 98.1% 94.8% 93.3% 90.7% 85.1% AS4777 98.3% 94.9% 93.4% 90.8% 85.2% AS6447 97.2% 93.9% 92.5% 90.0% 84.3% The off-chip memory organization plays a critical role in enabling one memory access per routing table lookup for prefixes wit h different lengths. Similar to [18, 33, 34], we consider two-dimensional set-associativ e SRAM arrays to support a constant lookup rate. The first dimension is the total number of hashed buckets while the second dimension is the maximum length among all buckets. To accomplish one memory access per lookup, each bucket is allocated in consecutive memory locations and can be fetched as a single block. Furthermore, to balance the routing table size and search bandwidth, we use a hybrid prefix expansi on and bucket coalescing scheme to allocate prefixes with multiple lengths in the same bucket for accommodating the LPM in a single memory access. Consider the prefix group of length 19-24 as an exampl e. One straightforward solution for enabling one memory access per l ookup is to expand all prefixes to the maximum length 24. To do so, all prefixes le ss than 24 are expanded to 24. To replace
37 the wild card bit in a prefix, the prefix is r eplaced by many prefixes to represent all the possible representation of the or iginal prefix. For example, if a prefix need to replace 2 wild card bits, then the prefix is replace by four other prefixes. Given a majority of prefixes in this length group, such an expansion produces si gnificant space overhead as shown in Figure 2.7. The ot her opposite solution is to keep all prefixes without any expansion. Multiple lengths of prefixes that are required for determining the LPM can be allocated in the same bucket based on the common 19 prefix bits for accommodating the shortest length. Although this bucket coalescing approach achieves one memory access per lookup, it suffers heavy bandwidth requirement since all prefixes must be hashed to buckets based on only 19 common bits. The reason is that by hashing only on the common bits, multiple di fferent prefixes that have the same common bits always hash into the same hash bucket regardless how hashing is handled. Now consider a general case where all prefixes have lengths from m to n with nm+1 different length. To accomplish one memo ry access per lookup, a fixed length l can be chosen for balancing the space overhead from prefix expansion with the bandwidth requirement due to coalescing of different lengths into the same bucket. All prefixes with lengths less than l are expanded to l for determining the hashed bucket. By reducing the expa nsion from lengths n to l the expansion ratio can be reduced up to a factor of 2(n-l). However, in using the common l bits for hashing, a maximum of 2(n-l) prefixes may now be allocated in the same bucket. We refe r to this collision as the
38 Figure 2.7. The number of prefixes with expansions to various lengths lower bound of the maximum number of prefixes in a bucket. For example, with about 70% of space overhead, we can expand lengt h 19-21 to 22 to be used for hashing. With this expansion, however, the lower bound fo r the maximum bucket length becomes 4. Similarly, in order to reduc e the lower bound to 2, we must expand prefixes 19-22 to 23 with about 2.7 times of the prefixes. Note that besides the collision due to limited hashing bits, the maximum length in a bu cket is determined by the overall hashing collisions since prefixes with different common bits may still be hashed into the same bucket. Detailed evaluations of the space/ bandwidth tradeoff are presented in Section 2.6. 2.4 Architecture and Impl ementation of the BH-RT The block diagram of the basic architecture for a BH-RT based router is illustrated in Figure 2.8. There are a num ber of key highlights.
39 Figure 2.8. The basic architectu re for BH routing table lookup Partitioned two-level longest prefix matching (LPM): Given an uneven distribution of various prefix lengths from several IPv4 routing tables, the prefixes are grouped into three classes, length 8-18, length 19-24, and length 25-32 in this example design. Group 19-24 has over 90% of the prefixes which is saved in hash-based SRAMs, likely off-loaded from the router chip. The remaining two groups with less than 10% of the prefixes reside in on-chip TCAMs for performing the LPM function. Bloomier Hashing (BH) with on-ch ip Bloomier Index Table (BIT): As described in Section 2.2, the BH scheme is used to balance the prefixes among the hashed buckets for Group 19-24, lo cated off-chip. A hybrid prefix
40 expansion/coalescing scheme de scribed in highlight 4 below, confines prefixes for each lookup in one hashed bucket. For fast accesses, the BIT is partitioned into equal regions, each associated with a hashing function. Off-chip set-associat ive ODR-III SRAM: To maintain a constant lookup rate, the off-chip memory is organiz ed as a two-dimensional set-a ssociative array. The first dimension is the set, which is the total number of hashed buckets. The second dimension is the set-associativity, wh ich is the maximum length among all the buckets. Each bucket is fitted in a block of fixed size in consecutive memory locations and can be fetched as a singl e unit. High-bandwidth 500+ MHz QDR-III SRAM is used which supports 72-bi t reads/writes per cycle . Hybrid prefix expansion/coalescing scheme : To achieve one memory access per LMP lookup, prefixes of different lengt hs that are required for the LMP, must be allocated in the same hashed bucket. A proper length l, i.e. 22 with prefix bits 0 to 21 in the illustrated design, is chosen as the indices to determine the hashed bucket. All prefixes with length less t han l must be expanded to l to provide the needed common hashing bits for avoidi ng multiple memory fetches. Unbalanced hash collisions increase both the fetch bandwidth and the memory space requirement. As illustrated in Figure 2.8, an intermediate Bloomier Index Table (BIT) is constructed on-chip for the new BH-RT scheme. Multiple hashing functions using the common prefix bits determine the locations in the BIT. An exclusive-or function of the contents from the hashed BIT locations provides the bucket address in the memory. A simple hashing function based on direct-m apping is considered. The lower-order k bits from the common prefix bits are selected, where 2k is the size of the BIT. In case the number of common prefix bits is less than k, each prefix is hashed to a single location in BIT using all the available hashing bits. For multiple hashing functions, lower bits are rotate with upper bits to obtain a new hash value. Furthermore, to avoid conflicts in fetching multiple contents out of the BIT, the BIT is partitioned equally according to the number of hashing functions. Our evaluations show that there is little impact on hashing collisions usin g a unified or a partitioned BIT.
41 2.5 Performance Evaluation Methodology Five routing tables from in ternet backbone routers [6, 54] are selected to carry out the performance evaluations and comparis ons. As286 (KPN Internet Backbone), as1103 (SURFnet, The Netherlands), as4608 (Asi a Pacific Network Information Center, Pty. Ltd.) and as4777 (Asia Pacific Networ k Information Centre) are downloaded from , using the tables dumped at 7:59am July 1st, 2009. The fifth table as6447 (OregonIX Oregon Exchange) was downloaded from [6 ] in August, 2006. The numbers of prefixes in these five tables are 27 7K, 279K, 283K, 281K, and 212K respectively. We compare five hashing mechanisms in routing table lookups including simple direct-mapped hashing (Direct-mapped), r andomized hashing (Randomized), hashing with Extended Bloom Filter (Extended), multip le hashing (2Hash), and the new Bloomier hashing (Bloomier). For fast hashing wit hout complicated hardware, Direct-mapped decodes lower-order prefix bits for t he hashed bucket. To alleviate collisions, randomized selected the hashing bits by exclus ive-oring the lower-order prefix bits with the adjacent higher-order bits. Extended foll ows the scheme described in . Each prefix is hashed into multiple buckets bas ed on multiple hashing functions and store one copy into the shortest bucket. A shared c ounter in each bucket along with proper links across buckets allows the search only through the shortest bucket. 2Hash stores each prefix into the shortest bucket based on mult iple hashing functions. Multiple buckets must be searched, which incurs not only additional bandwidth, but unexpected delays due to fetching non-adjacent memory blocks. T he benefit diminishes in using more than two hashing functions to balance the buckets Hence, we only consider two simple hashing functions, direct-mapped and rotation of high/low prefix bits. The proposed Bloomier also uses the same two hashing functions to hash to the BIT. The size
42 (number of entries) of the BIT impacts the balance of the hashed buckets. In our evaluation, we consider the BIT with 8K to 64K entries, denoted as Bloomier-nK, where nK is the respective number of BIT entries. The number of hashed buckets greatly impacts the number of prefixes in a bucket. Obviously, the larger the number of hashed bu ckets, the smaller the number of prefixes in each bucket and hence the less the bandwidth is required for each lookup. Nevertheless, increasing the number of buckets negatively increases the total memory size to hold the routing table with the same number of prefixes. In comparing different hash-based routing table designs, we vary the number of hashed buckets from 16K to 2M to evaluate the tradeoff between t he maximum bucket length and the overall memory size. 2.6 Performance Results Figure 2.9 shows the lookup bandwidth and memory space comparisons of the five hash-based lookup schemes collected from simulating the five routing tables. For Bloomier, we include the result s of two BIT sizes with 16K a nd 64K entries. We consider the prefix group of lengths 19-24 with prefix exp ansions to both lengths 22 and 23 bits for hashing the buckets. After expansions to a length of 22, the numbers of prefixes in this group are 433K, 435K, 438K, 436K, and 318K for the respective routing tables, which are about 70% increases from the or iginal prefixes. When expanding to a length of 23, the numbers of prefixes become 687K, 689K, 693K, 692K, and 508K with about 170% increases. We use these prefix numbers as the basis to calculate the memory expansion ratio for comparing the efficiency of the memory space requirement. Memory expansion ratio is equal to the product of num ber of buckets and ma ximum prefixes in a bucket divided by the total number of prefixes.
43 Note that while the CPE expands the routing table for reducing the number of different lengths, the memory expansion rati o calculates the memory expansion due to allocations of each bucket in a fixed size me mory block for achieving a constant lookup rate. We choose the expansion ratio, inst ead of the size for a uniform comparison across routing tables with variable number of prefixes. The maximum number of prefixes in a bucket determines the bandwid th requirement for each lookup and the memory expansion ratio represents the me mory size requirem ent. Given that the number of hashed buckets is in cremented by a power of 2, the expansion ratio of the horizontal axis is plotted on a logarithmic sca le. In Figure 2.9, we confine the maximum bucket size and the memory expansion ratio to 60 and 64 respectively. Several observations are to be made fr om Figure 2.9. First, as expected, Bloomier shows superior performance when compare to other approaches. It requires the lowest memory size expansion to obt ain small prefixes in a bucket for both expansion sizes. This is mainly because of Bloomiers ability to balance the prefixes among hashed buckets through the BIT. Extended and Randomized show little improvement from Direct-m apped. Extended does not perform well when the number of prefixes is much bigger than the number of has hed buckets. In this case, each bucket counter in keeping the length information is incremented multiple times whenever a prefix is hashed to it. Ex tended does not perform well when the number of prefixes is much bigger than the number of hashed buckets. In this case, each bucket counter in keeping the length information is increment ed multiple times whenever a prefix is hashed to it. 2Hash suffers from twice t he bandwidth requirement due to the need to
44 search two buckets. In addition, unlike Bloo mier which places each prefix into the shortest bucket, 2Hash must make a choi ce even if both hashed buckets are long. Figure 2.9. Bandwidth and memory spac e requirement for the five hash-based schemes
45 Note that Cuckoo  or Peacock  hashing helps balancing the buckets with multiple hashing, but they require to relocate the prefixes in memory to avoid overflow, and hence are not included. Second, larger BITs indeed help, espec ially with small me mory expansion. However, to achieve minimum bandwidth r equirement with larger expansions, the BIT size makes rather minor diffe rences. A careful sensitivity study on the impact of the BIT size is discussed later. Third, the bandwidth and memory space requirement of the five hashing schemes are similar between the two partial expansion lengths 22 and 23. However, the lower bounds of the maximum prefixes in a bucket are 4 and 2 respectively for these two expansions. As shown in Figure 2.9, these lower bounds can be obtained with about 6 times of the memory expansi on using Bloomier. But, such bounds are unreachable by any other hashing schemes even when the expansion ratio grows to 64. Further discussions in achieving the lower bound is given in the following sensitivity study. Recall that a combination of partial prefix expansions and coalescing of different lengths of prefixes into the same bu cket permits each LPM lookup in one memory access. In Figure 2.10, we show the memo ry space and bucket size tradeoffs of the prefix group 19-24 with partial expansions to lengths 21, 22, 23, and 24. Note that in Figure 2.10, the memory size is calculated as the product of the number of hashed buckets and the maximum prefixes in a bucket. The Bloomier hashing with a 64K BIT is simulated. The results are obt ained from an average of the fi rst four tables. The fifth table is not considered due to its smaller size.
46 We can make the following observations. As shown in Figure 2.7, the number of prefixes increases with the respective expansion lengths by about 25%, 70%, 170%, and 400%. On the other hand, the lower bound of the maximum prefixes in a bucket decreases with the expansion lengths from 8, 4, 2, to 1. Given enough hashed buckets, such lower bounds can be reached. For example, with 128K, 1M, and 2M hashed buckets, the lower bounds of 8, 4, and 2 pr efixes in a bucket are achieved for the expansion lengths 21, 22, and 23. Nevertheless, total memo ry space must consider both the number of buckets and the maximum prefixes in a bucket. Therefore, the expansion length of 23 with 2M buckets and 2 prefixes in a bucket has the same memory size requirement as the expansion length of 22 with 1M buckets and 4 prefixes in a bucket. Overall, the expansion length of 23 shows the best tradeoff. With a SRAM array capable of holding slightly over 4M pref ixes, this expansion can achieve the lower bound of 2 prefixes per lookup. For a lo wer bound of 4 prefixes per lookup, the expansion length of 23 only needs a SRAM to hold 2M prefixes, instead of holding 4M prefixes for the expansion length of 22. The expansion length of 21 su ffers from its large lower bound of 8 unless the bandwidth requ irement is not a problem. Although the expansion length of 24 has the minimum lowe r bound, it requires the number of buckets beyond 2M and incurs huge memory overhead to achieve the lower bound. Consider a specific design using 500+ MHz QDR-III SRAMs which support 72-bit read/write operations per cycle. A burst read of 2 or 3 cycles can fetch 144 or 216 bits respectively. Since each prefix in this group has 24 bits along with 16 bits for the output port and a few bits for distinguis hing the prefix length due to partial prefix expansions, a bust read can fetch 3 or 5 prefixes in 2 or 3 cycles. Considering that each lookup only
47 requires 2 or 4 prefixes, a constant lookup ra te of 250M/sec or 167M/sec is achievable with additional rooms for bigger routi ng tables and/or longer prefixes. Figure 2.10. Sensitivity on the prefix expansion fo r different hashing bits The results of a sensitivity study on t he BIT size are given in Figure 2.11. We simulate 4 BIT Sizes with 8K, 16K, 32K, and 64K entries. We only show the expansion length of 22 in Figure 2.11 since the expansi on length of 23 has similar behavior. Bigger BIT indeed helps balancing the buckets especiall y with smaller memory sizes. Note that since the memory size requirement goes wit h the total number of hashed buckets, the difference between large and small BIT is more evident when the number of buckets is small. However, to achieve the lower bound of prefixes in a bucket, the number of buckets must be sufficiently large. Therefor e, the impact of the BIT size is not as significant. For example, except for 8K entries of BIT, t he other 3 bigger BIT sizes can
48 all achieve the lower bound of 4 prefix es in a bucket with 1M hashed buckets. Considering a BIT with 16K entries and each entry saves 22 bits for hashing to 4M buckets, the total on-chip BIT size is only 44 KB, which is smaller than a typical L1 cache and can be fitted into on-chip SRAMs. Figure 2.11. Sensitivity on the BIT size 2.7 Related Works There are three categories of LPM approaches including Ternary Content Addressable Memories (TCAM) [46, 2, 49], trie-based searches [ 21, 35, 31], and hashbased approaches [10, 25, 48, 57, 26, 18, 39 ]. TCAMs are custom devices that can search all the prefixes stor ed in memory simultaneously. They incur low delays, but require expensive tables and comparators and generate high power dissipation. Triebased approaches use a tree-like structure to st ore the prefixes with the matched output
49 ports. They consume low amount of power and less storage space, but incur long lookup latencies involving multiple memory operations that make it difficult to handle new IPv6 routing tables. Moreover, trie s-based approaches also required large space overhead for storing pointers. The pointers ar e store in nodes and used to traverse up and down in the tries. Hashed-based approaches store and retrieve prefixes in hash tables. It is powerefficient and is capable of handli ng large number of prefixes However, the hash-based approach encounters two fundamental issues hash collisions and inefficiency in handling the LPM function. Sophistic hashing f unctions  can reduce the collision by using expensive hardware with long delays. On the other hand, multiple hashing functions allocate prefixes into the sma llest hashed bucket for balancing the prefixes among all hashed buckets [9, 3, 10]. Cuckoo [ 18] and Peacock  multiple-hashing schemes further improve the balance with reloca tions of prefixes from long buckets. The downside of having multiple choices is that t he searches must cover multiple buckets to find the one that contain the correct information. Extended Bloomier Filter  places each prefix into multiple buckets and uses a counter to count the number of prefixes in each bucket. Searches are only needed from the shortest bucket. To reduce the space overhead and the length of t he buckets, duplicated prefixes are removed. For efficient searches, proper links with shared counters mu st be established. However, the number of buckets needs to be large to achieve good lookup rate. Hence, the number of entries in the table that store the array of counters can be quite huge. Handling LPM is difficult in hash-based r outing table lookup. To reduce multiple searches for variable prefix lengths, C ontrol Prefix Expansion (CPE)  and its
50 variances [18, 58] reduces the number of diffe rent prefix lengths to a small number with high space overhead. Organizing the routing table in a set-associative memory and using common hash bits to allocate different lengths prefixes into the same bucket reduces the number of memory operations to perform the LPM function [33, 34, 18]. However, by coalescing buckets with multiple prefix lengths, it creates significant hashing collisions and hence increas es the bandwidth requirement. Bloom Filter was considered to filter unnec essary IP lookups of variable prefix lengths . Multiple Bloom filt ers are established, one for each prefix length to filter the need in accessing the routing tabl e for the respective length. Due to uneven distribution of the prefix lengths, further improvement by redistribution of the hashing functions and Bloom filter tables can ac hieve balanced and conf lict-free Bloom table accesses for multiple lengths of prefix es . In these filteri ng approaches, however, the falsepositive condition may cause unpredictable dela ys. Given the fact that LMP requires searching all possible prefix lengths, none of the above solutions can guarantee a constant lookup rate. For IP lookups, Chisel  introduces the Bloomier hashing idea which is originated from the Bloomier Filter [ 14]. In their approach, it requires a huge intermediate index table as well as the num ber of hashed buckets to achieve a conflictfree hashing. Inspired by the original Bloo mier filter paper, our Bloomier hashing approach uses the intermediate index tabl e to balance the hashed buckets stored in regular SRAMs. By accepting conflicts, t he intermediate index t able and the number of hashed buckets can be much smaller. With balanced buckets, both the bandwidth and the space requirement for IP lookups can be reduced.
51 CHAPTER 3 PAGE TABLE LOOKUP USING I NCREMENTAL BLOOMIER HASHING 3.1 Introduction Todays operating systems use virtual memory. Virtual memory gives an application the impressi on that it has a large contiguous memory to work with, but in fact, the actual physical memory may be mu ch smaller and fragmented. Programs use virtual memory each time they request memory access via a virtual address. A virtual address needs to translate to a physical a ddress before the data can be fetched. This process is called memory address translati on. A page table stores all the mapping between virtual addresses and physical addresse s. If the reques ted memory page is currently located in the physical main memory, then looking up the page table via a virtual address automatically translates to th e memory pages physical address. A page table can be implemented in different ways. We describe the conventional ways to implement a page table and their shortcomings. Next, we describe how to implement a page table by using both the Bloomier filter and the In cremental Bloomier hashing approaches. We discuss each implementations space and time requirements by assuming virtual addresses are 64 bits, physical addresses are 64bits and page size is 4KB. Evaluation results are given after. 3.2 Conventional Page Table Organizations There are three conventional implement ations for page tables: the forward mapping table, the inverted page table and t he hashed page table. A forward mapping table utilizes a straightforward method which uses the virtual address to index a series of tables until a translation is found. On the other hand, an inverted page table uses reverse mapping. Indexing an inverted page table with a physical address returns the
52 entry containing the virtual address that is currently mapped to t he physical address. Hashed page table is based on the inverted page table. We describe each of the organizations next. 3.2.1 Forward Mapping Page Table A forward-mapping or multi-level page table is a collection of tables put in a hierarchical order. Figure 3.1 shows an example of this type of table. In this example, there are three levels of tables. Parts of t he bits in the virtual address are used to index each of the levels. The last level contains the leaf pages. The intermediate tables contain pointer. For each valid virtual addr ess, there is an entry on a leaf page that holds the translation . Each leaf page ent ry also contains extr a information such as status and protection bits . The Intel x86 machines are currently using this kind of page table . Time Analysis: For each page table lookup, each leve l in a table needs one memory access. The total time for a page table lookup in the forward-mapped page table is therefore directly related to t he total number of levels. Cons ider a system that is using a 64 bits virtual address and has a 4KB page size. For the leaf pages, each entry needs to store 8 bytes of physical address and ot her information. Each entry in the intermediate tables needs to store 8 bytes for the pointer. So for a 4KB page size, the total entries in an intermediate table are 512 entries and 9 bits in the virtual address can be used to index a table. T herefore, a virtual address which excludes the offset is 52 bits and there are 6 levels of tables. Each page table lookup needs to access the memory 6 times.
53 Figure 3.1. Forward-mapped page table Space Analysis: If every virtual page contains an ent ry in the page table, then there are 252 entries in the leaf pages. If each entry in the leaf page is 8 bytes, then 255 bytes, 32 petabytes are needed. However, it is unlik ely a process uses all 64 bits of address space. To access the leaf pages, five entries in the intermediate tables are accessed. Assuming 8 bytes for each entry in the in termediate tables, then the total space overhead for a leaf page is 40 bytes. Based on the analysis by using a 64 bit virtual address, the forward-mapped page table requires big space overhead. Moreov er, a memory translation requires many memory accesses to reach the leaf page.  explored the idea of using cache to accelerate the memory address translation process. However, the method requires dedicated page table cache entry, which c an decrease the already limited cache Virtual address Level 1 table Level 2 table + + + Root p ointer OS + Ph y sical address Leaf Pa g es
54 memory. Therefore, more compact tables that require fewer memory accesses for lookup can increase the perfo rmance of the 64bits system. 3.2.2 Inverted Page Table Another common page table organization is called the Inverted page table (IPT). Many large address space machine used IPTs [41, 11, 29]. When comparing to the forward-mapped page table, an inverted page t able is a much more compact data structure. The IPT records all virtual to physical page mappings for those virtual pages that currently reside in memory. Each entry in the IPT contains the virtual address that is currently mapped to the ph ysical address. Since a physi cal address can use as an index to an IPT entry, the num ber of entries in the IPT is equal to the numbers of physical pages. By using virtual addresses as keys, searchi ng is difficult in the inverted page table. Using a brute force method, each entry in th e table is examined until a match is found. The worst search time is therefore equal to the total numbers of physical pages. Instead of a brute force approac h, one can use hashing as an e fficient search strategy. A hash anchor table (HAT) is created alongsi de with the IPT for this purpose. Each entry in the HAT contains the page frame num ber that is used to index the IPT. The organizational structure is shown in Figure 3.2. To insert an entry into the IPT, a part of the virtual page addr ess is randomized by XORing with higher-order bits to produce a hash value. This value is then used to index the HAT. The physical page number is then put into the entry in the HAT. To handle hashing conflicts in a bucket in the HAT, a li nk is built by using pointer spaces in the IPT. Each pointer in the IPT is also a physical page number that indexes the next member in the linked list.
55 Figure 3.2. Inverted page table with hash anchor table Time Analysis : During a search, the virtual address is hashed into an entry in the anchor table. Then, the physica l address in the entry is used to index another entry in the IPT. If the virtual addr ess matches the entry in the IPT, then the translation is completed. In this case, there are two me mory references. Ho wever, if there are collisions in the anchor table entry, then the linked list in the IPT is traversed. Therefore, there may be more memory references to other elements in the linked list. Space Analysis: Each entry in the IPT contains t he virtual address tag (8 bytes), a pointer to the next element in the link (8 by tes) and other information (4 bytes). The total number of entries in the IPT is equal to the total number of physical pages. So if the size of the entry in the IPT is 20 bytes and there are 219 physical pages, then the size of the IPT is 12 megabytes. Each entry in the anchor table onl y needs to store the physical page number, since t he index of the IPT can compute from the physical page Virtual address Off set Virtual page number Hash function Hash anchor table Inverted page table PN VA Virtual address Off set Virtual page number Hash function Hash anchor table Inverted page table PN VA
56 number. Therefore, each entry in the anchor table can use up to 8 bytes. The total number of entries in the anchor table is not fixed but should be greater or equal to the number of physical pages. Lar ge anchor table sizes can be used to reduce the average length of the linked lists in the IPT. For the IPT with anchor table organization, small anchor table or poor hashing methods can create long linked lists in the IP T and affect the overall performance. On the other hand, large anchor tables can cr eate significant space overhead. Hashed page table is introduced to e liminate the anchor table. 3.2.3 Hashed Page Table The hashed page table is introduced by . The hashed page is built based on the inverted page concept. The hashed page tabl e combines the inverted page table and the hash anchor table into one data struct ure. Due to the removal of the hash anchor table, fewer memory accesses are needed for page table accesses. However, the page table entry in the hashed page table must now contain both the physical and virtual addresses, since the physical addres s can no longer be computed from the page tables index. Figure 3.3 shows the hashed page table. When a TLB (translation lookaside buffer) miss occurs, the virtual page number is hashed into a hashed page tables index. T he lookup virtual address is then compared to the virtual address in the page table ent ry. If the addresses match, then the translation is completed. If t he addresses do not match, then the rest of the entry in the chain is used for comparison until either one matches the lookup addr ess or the end of the chain is reached. Page fault occurs when end of the chain is reached without a match.
57 Figure 3.3. Hashed page table Time Analysis : During a search, the virtual addre ss is hashed into an entry in the hashed table. If the virtual address matc hes the entry in the hashed page table, the translation is completed. In this case, there is just one memory reference. However, if there are collisions in the table, then there may be more memory references to other elements in the linked list. Space Analysis: Each entry in the hashed page table contains the virtual address tag, a pointer to the next element in the lin k, the physical page number and other information. An entry in the hashed page tabl e should contain the virtual address (8 bytes), the physical address ( 8byte), the next pointer (8 bytes) and other information (4bytes). So if the size of an entry in the page table is 28 bytes and there are 219 physical pages, then the size of the hashed page table is 219 28 bytes, or 14 megabytes. The average length of the chai n is depending on the size of the hashed Hash function Virtual address Virtual page number Off set hashed page table PTE PTE PTE Hash function Virtual address Virtual page number Off set hashed page table PTE PTE PTE PTE PTE
58 page table. Moreover, different collision re solution methods also play important parts on the average lookup time. Collisions can be chained within the tabl e itself or chained into an overflow table. 3.3 Using Collision Free Bloomier Fi lter for Inverted Page Table Inverted page tables with the hash anchor approach use two level data structures to achieve fast lookups while maintaini ng a compact size. However, due to the limitation of a single hash f unction, the lookup time can be unpredictable and long in the worst case. Replacing the HAT with a Bloom ier filter can solve these problems. Bloomier filters offer collision free hashing [ 14]. By replacing the hash anchor table with a Bloomier filter table, a c onstant lookup time on the IPT can be achieved. Moreover, space overhead becomes smaller. The m odified architecture is shown below. Figure 3.4. Inverted page table wi th Bloomier filter index table Virtual address Off set Virtual page number Hash function Bloomierfilter index table Inverted page table VA XOR Virtual address Off set Virtual page number Virtual address Off set Virtual page number Hash function Bloomierfilter index table Inverted page table VA XOR
59 As seen in Figure 3.4, we replace the HAT wi th the Bloomier filter index table. The Bloomier filter index table works the same way as the Bloomier filter. The Bloomier filter index table can perform the se tup by using all the virtual addresses that are currently mapped to the physical pages as keys. The Bl oomier filter table stores the physical addresses, or the indexes to the IPT, for these virtual addresses. Time Analysis: During a lookup, the virtual address is hashed to a few locations in the index table. The content s of these locations are fetched and XORed into a page address number or index to the IPT. An entry in the IPT is then fetched out using the index. If the virtual address in the IP T entry matches the lookup address, then the translation is completed. Otherwise, the OS initiates a page f ault handling. Since Bloomier filters offer collision free hashing, fetching the entry in the IPT only needs one memory access. However, performing look ups on the index table requires multiple memory accesses. But since the accesses are independent, they can be done in parallel in multiple banks memory devices. Space Analysis: The Bloomier filter table approach guarantees collision free hashing, so no pointer space is needed in the IPT. Ther efore, the Bloomier f ilter index table is the only space overhead. Each entry in the IPT only needs to contain the virtual address tag (8 bytes) and other information (4 bytes). So if the size of the entry in the IPT is 12 bytes and there are 219 physical pages, then the size of the IPT is 12*219 bytes, or about 6 megabytes. The size of each entry in th e index table needs to be the same size as the physical page address. So, each entry in the index table is up to 8 bytes. The total number of entries in the i ndex table needs to be large enough so that the setup can perform successfully. We show the space require ment for the index table a little later.
60 When a page fault occurs, the page table de letes the replaced page and inserts the new page. The replacement process may force the Bloomier index table to perform the time consuming setup again. Moreover, page faulty handling also takes a significant amount of time. However, in most applicat ions, page faults happen rarely. Therefore, re-setup is rarely needed. Moreover, some kind of victim cache approach can also mitigate the problem. Next, we show the s pace requirement for a Bloomier filter index table by considering different parameters. The Bloomier filter provides collision free hashing. Bloo mier filters need to perform a successful setup before being used. The setup process is time consuming and can fail when no singleton is found . When the previous setup fails, setup can be redone with different hash functions. The setup is more likely to succeed if there is enough allocated space for the Bloomier filter. Not allocating enough space can cause the setup to fail and redo many times . The Bloomier Filter has three parameters: M (size of the filter or number of entries in the filter table), N (number of keys) and K (number of hashing functions). Before a Bloomier filter of size M can be used, the f ilter must be able to setup using K hashing functions for N keys. In Figure 3.5, we try to setup the Bloomier filter by using different combination of parameters. We set the number of keys to be 1000 and 2000. For each combination of the parameters, 100 runs were performed using randomly generated keys. The setup fail percentage is obtained by the number of times setup failed divided by the total number of se tups performed. Fo r each new run, we generated new hash values for each key. Ther e are several observations that can be made from Figure 3.5. First, as expected, larger filter si ze increases the likelihood that
61 Figure 3.5. Bloomier filter setup fa ilure percentage for different parameters
62 the setup can perform successfully. Sec ond, more hash functions can reduce the space needed to complete the setup. For example, when N is equal to 1000, M is greater than 2250 and with two hash functions we can obtain a success rate of more than 50%. On the other hand, if we use three hash functions, we can obtain the same setup success rate when M is greater than 1235 and N stays the same. Moreover, by using more hashing functions, the setup failure probability drops into a much smaller range than using less hashing functions. Fo r example, when using N equal to 1000, M close to1800 and two hashing functions, the se tup failure probability is greater than 90%. To get a setup failure probability less than 10%, M needs to be close to 5000 and other parameters stay the same If we using the same N value with three hashing functions, then the setup failure probability is greater than 90% when M is around 1200. To get a setup failure probability less than 10%, M only needs to be close to 1280. From the results, we can s ee that the Bloomier filter size does indeed affect the setup success rate. However, by using more hashing functions, less space is required for the filter. We can see t hat by using three hashing functions and setting the size of the filter to be larger than 1.3 times the size of the key set, the setup can be performed successfully most of the time. Back to the IPT comparison, the hash anchor table IPT approach requires extra space overhead in the IPT. These spaces ar e used for the pointers in the linked list. Hash anchor table (HAT) by itself require s extra space. On the other hand, the Bloomier filter table approach guarantees collision free hashi ng. Therefore, no pointer space is needed in the IPT and the Bloomier index table is the only space overhead. Using this information, the Table 3.1 compares the space and the average linked list
63 length between the two approaches. In Table 3. 1, the value N is t he number of physical pages. Table 3.1 shows that if the hash anchor table (HAT) has the number of entries equal to N, then the average linked list length in the page table is 1.5. For HAT size of 2N, the average linked list length is 1.25  Moreover, IPT with anchor table needs extra spaces to store point ers. Therefore, there is another N space which is contributing to space overhead. The total space overhead is shown in the last column in the table. On the other hand, by using Bloo mier filter indexing tables which contains 1.4 N to 2N entries and three hashing functions, we can setup the Bloomier filter with an IPT easily. We can use simple hashing su ch as rotating or XORing lower bits with higher bits in a virtual address. For be tter hashing coverage, we can also use the hardware hashing function that was described in . Most importantly, the lookup time in the page table for the Bloom ier filter approach is constant one lookup. This can be very important for some real time application or machines [ 67, 68]. The Bloomier filter approach can still have problems when applying to this application. Section 3.4 introduces the Incremental Bloomier hashi ng which tries to solve the problems. Table 3.1. Space overhead and average linked list length comparisons between Bloomier Filter and hash anchored inverted page table Scheme Index or anchor table size Average link list length Space overhead in IPT Total space overhead HAT 1N 1.5 1N 2N HAT 2N 1.25 1N 3N Bloomier Filter 1.4N to 2N Constant 1 none 1.4N to 2N
64 3.4 Using Incremental Bloomier Hashing for Hashed Page Table Although the Bloomier filter inverted page t able approach can elim inate collisions in the IPT, the Bloomier filter index table it self may still be too large. Moreover, updating is not easy. Therefore, we can relax the non-collision constraint and use the Bloomier hashing described in Chapter 2 instead. Ho wever, like the Bloomier filter, the normal Bloomier hashing previously described also needs to perform setup before it is used. For the network routing table application, all the prefixes are known before setting up the routing table. Therefore, performing an index table setup is straightforward. For the page table applications, we use virtual addre sses as keys. Unlike the routing table application (where all the keys are known before the routi ng table is used), virtual addresses are highly dynamic and unknown before they are us ed. Therefore, performing index table setup during program exec ution may not be feasible. We introduce the new hashing method called incremental Bloomier hashingwhich requires no table setup. The pseudo codes are described below. The terminology is the same as described in Section 2.2. PageTableAccess CheckPageTable ( V ) If (Translation does not exist) InsertPageEntry( P,V ) END Figure 3.6. Psuedocode for page table access Figure 3.6 shows the Psuedocode for page table accesses. Page table accesses go through two levels tables like in the inve rted page table. The first level is the index table, which is initialed wit h random value for each entry. The second level is the page
65 table, which contains translations for all the physical memory pages and is empty in the beginning. During the access, the function CheckPageTable( V ) is first called. This function uses the virtual page address V as the input and performs a lookup on the page table. CheckPageTable( V ) returns the correct translation if V matches a page table entry in the page table. Otherwise, CheckPageTable( V ) returns translation not found error and a page fault is occurred. In this case, the OS handles the page fault. After the page fault handling, the OS returns with a pair of addresses P and V, which need to be inserted into the page table. To do so, the function InsertPageEntry( P,V ) is called. We discuss the two important functions below. CheckPageTable ( V ) Hash V into two location l1 and l2 in the index table Compute the page table index ID by XORing l1 and l2 Go to the bucket using the index ID and traverse the linked list For each entry, compare the Virtual page tag Vt to V If Tag Vt match V Return the translation If End of linked list is reached Return Not found End Figure 3.7. Psuedocode for check page table Figure 3.7 shows the psudocode for CheckPageTable ( V ). CheckPageTable ( V ) takes one parameter V. The function returns the translated physical page address for V if V matches a page table entry. Ot herwise, a translation not found error is returned. The function first hashs V into two locations l1 and l2 in the index table. The contents of these two locations are XORed into an index ID in the page table just like in the normal Bloomier hashing decoding process described in Section 2.2. Usi ng the index, the
66 linked list started at the index ID is traversed. If V matches any of the page table entry in the linked list, then the translation is found. On the other hand, if the end of the linked list is reached, then V is not currently in the main memory and the not found error is returned. In this case, the function InsertPageEntry( P,V ) is called. We describe the details for the function below. InsertPageEntry( P,V ) If Physical page address already exists, delete the old entry Hash V into two location l1 and l2 Compute the page table index D by XORing l1 and l2 If the indexed bucket in the page table is empty Insert the translation P V into the bucket Marked l1 and l2 as used Return If the indexed bucket in the page table is not empty Check if l1 or l2 are occupied by other keys If one of l1 or l2 are not occupied Choose an empty bucket B in the page table Encode the index of B into l1 or l2 whichever is not occupied Marked both l1 and l2 as occupied Else Insert into the linked list start at D Return End Figure 3.8. Psuedocode for insert page table entry Figure 3.8 shows the psuedocode for Insert PageEntry. This function is called when a page fault is encountered, and a new tr anslation is needed to insert into the page table. The function accepts two arguments P and V. P is the physical address and V is the corresponding virtual address. The function checks to see if P already exists in
67 an entry in the page table. If so the old entry that contains P is deleted, since it contains an old translati on. The virtual address V is hashed into two location l1 and l2 in the index table. The cont ent of these two locations are XORed into an index ID in the page table just like the normal Bloomier has hing decode process described in Section 2.2. The bucket at location ID is checked to see if it is em pty. If the bu cket is empty, then we simply put the translation in that bucke t. If the bucket is not empty, then we try to place the new translation in an empty bucke t instead. To do so, we check to see if l1 or l2 are occupied by other keys. If l1 and l2 are both occupied, then we have no choice but to put the translation in the non empty bucket. We then use some form of collision resolution methods to handl e the overflow. If l1 or l2 is not occupied, then we choose an empty bucket B in the page table and enc ode the index of B into the unoccupied location in the index table. The encoding process is the same which is described in Section 2.2. After encoding is done, both l1 and l2 are marked as occupied. The incremental Bloomier hashing require s no index table setup. Moreover, lookup is straightforward. Figure 3.9 us es randomly generated numbers to show average linked list length comparisons between the incremental Bloomier hashing scheme and normal hashing. The numbers ar e generated by using the standard C++ pseudo-rand integral number generator. We chose to compare the average linked list length because the average linked list length is directly related to the average lookup time in a page table. We consider t he average lookup time be one of the most important measur ements in page table performance. Figure 3.9 compares the average linked li st lengths of the normal hashing scheme and our incremental Bloomier hashing scheme at th ree different index table sizes. The
68 Average Linked List Length0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 normal bloomI(1/2n)bloomI(1n)bloomI(2n)Number of buckets Figure 3.9. Average linked list lengt h comparisons between normal hashing and Incremental Bloomier hashing with randomly generated numbers size of the hash table is 64K entries. We use different hashing methods to hash 64K randomly generated numbers into the table. For the incr emental Bloomier hashing scheme, the three index table sizes are as fo llows: half the size of the page table, the same size as the page table and double the si ze of the page table. Figure 3.9 shows that our scheme can balance the hash bu ckets and reduce the average linked list length. A shorter average linked list lengt h means a shorter average search time and better hash table performance. From Figure 3. 9, we can also see that just like the normal Bloomier hashing increasing the in dex table size indeed helps balance the buckets. If we use an index table size of 32K entries, we can reduce the average search time by a small amount. But if we use an index table size of 128K, we can reduce the average search time by a large amount. After the discussion on the incremental Bl oomier hashing algori thms, the modified architecture is shown below.
69 Figure 3.10. Hashed page table with in cremental Bloomier hashing index table As seen in Figure 3.10, instead of using the inverted page table, we use a general hashed page table. The hashed tabl e is similar to the one de scribed in . Each entry in the hash table contains both the virtual address and physical page number since the physical page number can no lo nger be computed from the entrys index. Comparing this to our previous Bloomie r Filter table approach, the Incremental Bloomier hashing index table approach uses fewer hashing functi ons in the index table. We can also update the index table a lot easie r. The lookup operation is very similar to the previous Bloomier filter approach. By using the vi rtual address and the hashing functions, we fetch out two locations in the Bloomier Hash ing index table. Th e contents of the two locations are XORed to produce an index. By us ing this index, we fetch an entry in the page table and compare the virtual addr ess tag with the lookup address. Hash functions Bloomierhashing index table XOR Page table VAPN Virtual address Off set Virtual page number Hash functions Bloomierhashing index table XOR Page table VAPN Virtual address Off set Virtual page number Virtual address Off set Virtual page number
70 Unlike the Bloomier filter approach whic h guarantees collision free hashing, collisions in the page table are once again possible when using our incremental Bloomier hashing method. When two or more items are hashed into the same bucket, collisions are occurred. There are many po ssible ways to handle the collisions in the page table. Coalesced hashing, separate c haining and the set associated approach are the common and well known methods for handli ng collision . Coalesced hashing does not use extra space overhead. On t he other hand, both separate chaining and the set associated approach use extra space overhead to achieve shorter linked list length. Each of the collision resolution methods has its advantages and disadvantages. We examine each method below. 3.4.1 Coalesced Hashing Approach Figure 3.11. Hashed page table with in cremental Bloomier hashing index table (Coalesced) Hash functions Bloomierhashing index table XOR Page table VAPA Virtual address Off set Virtual page number Hash functions Bloomierhashing index table XOR Page table VAPA Virtual address Off set Virtual page number Virtual address Off set Virtual page number
71 Figure 3.11 shows a page table organization that uses a modified version of coalesced hashing . Coalesced hashing us es ideas from two other common hashing schemes, separate chaining and open addressing hashing . In coalesced hashing, whenever collisions occur in a bucket, the firs t empty bucket in the table is used to hold the overflow item and a link to the index of th e overflow bucket is put into the collision bucket. Since our incremental Bloomier hashing scheme does not guarantee collision free hashing, we use the coalesced hashing idea to handle collisions. If a translation is hashed into a bucket in the page table that is already occupied by another translation, we then randomly choose an empty bucket in the page table and put the overflow item in the bucket. We call this the overflow bucke t. We then link the ov erflow bucket to the collision bucket using a pointer However, since the collis ion buckets pointer field may already be occupied by a previous overflow ed item, we may need to traverse the linked list to find an empty pointer field and put the pointer in there. Time Analysis: First, we access two entries in the i ndex table in parallel. Then, like the IPT with anchor table approach, we search t he page table by traversing the linked list until the virtual address tag in a page table en try matches the lookup address. If the end of the linked list is reached, then a page fault is occurred. Therefore, at least one memory reference on the page table is needed. The average length of the linked list depends on the size of the index table. We show the performance difference for different index table si zes in Section 3.6. Space Analysis: Each entry in the hashed page table c ontains the virtual address tag, a pointer to the next element in the lin k, the physical page number and other information. Coalesced hashing has the advant age that no bucket in the hash table is
72 wasted. Because of this, the coalesced hash table is a very compact structure like the inverted page table. By setting the size of the page table equal to the number of physical pages, all entries in the coalesced page table are used. An entry in the page table should contain the virtual address (8 by tes), the physical address (8 bytes), the next pointer (8 bytes) and other information (4 bytes). So if t he size of an entry in the page table is 28 bytes and there are 219 physical pages, then the size of the hashed page table is 219 28 bytes, or 14 megabytes. Like IP T with Bloomier filter index table approach, each entry in the inde x table is up to 8 bytes, since the number of buckets in the page table is still equal to the number of physical pages. However, by using hashed page table, we can create inde x table and page table for each process. For example, an entry in the page table should contain the virtual address, the physical address, other information and a pointer. The size of t he entry should be around 24 bytes. If a process uses 256 megabytes of memory and the page size is 4 kilobytes, then there are 64 thousands or 216 entries in a per process page table. The size of the index table entry is depended on the number of entries in the hashed page table. So, the size of an index table entry will only be around 2 bytes, wh ich is only 1/12 the size of a page table entry. Therefore, both the index table and the hashed page t able can become smaller if they are pre-process tables. Once again, the index table size has a direct effect on the average length and worst case lengt h of the linked lists. Howe ver, the index table size is only a fraction of the page tables size. We show some simulation results with different index table si zes in Section 3.6.
73 Coalesced hashing can achieve a very compact page table with no wasted space. However, the average case and worst case linked list length c an become long due to additions of items to the already present linked list. An example is shown below: Figure 3.12. Example of shared linked list in coalesced hashing In this example, the keys or virtual addresses VA2 and VA3 are hashed into the same bucket. Due to the collision, VA3 is then put into an empty bucket 2 with its physical address PA3. The linked list starting at bucket 1 has length of two at this point. Sometime later, VA1 is hashed into Bucket 2. But since Bucket 2 is already occupied by VA3, VA1 is then placed into Bucket 3. The linked list starting at bucket 2 has two items at this point. However, if we traverse the linked list starting at bucket 1, then there are three items. So, using coalesced has hing can increase both the average and worst linked list length. This problem can be solved by using extra overflow areas. The separate chaining appr oach and the set associative approach are two collision resolution approaches that use extr a overflow area to handle collisions. The separate chaining approach stores overflow it ems by allocating new table entries in memory space and chaining them together. Ea ch entry in a link is chained together by VA2 PA2 VA3 PA3 VA1 PA1 VA2 and VA3 hash into VA 1 hash into VA2 PA2 VA3 PA3 VA1 PA1 VA2 and VA3 hash into VA 1 hash into
74 using pointers. On the other hand, the set associative appr oach allocates extra spaces to handle the overflow in advance. 3.4.2 Separate Chaining Approach Figure 3.13 shows the Bloomier hashing separate chaining approach. In this approach, each bucket contains a linked list fo r overflow items. Each linked list can grow by allocating a new page table entry in memory and chaining it to the linked list. Figure 3.13. Hashed page table with in cremental Bloomier hashing index table (Separate Chaining) Time Analysis : First, we access two entries in the i ndex table in parallel. Then, like the IPT with anchor table approach, searching in this approach is performed by traversing the linked list associated with each bucket until a match is found. If we reach the end of the list, then a page fault is occurred. The average length of the linked list is dependent Hash functions Bloomierhashing index table XOR Page table VAPA Virtual address Off set Virtual page number Hash functions Bloomierhashing index table XOR Page table VAPA Virtual address Off set Virtual page number Virtual address Off set Virtual page number
75 on the size of the index table. We show the performance differenc e for different index table sizes in Section 3.6. Space Analysis: The separate chaining approach is space efficient. A new page table entry is allocated on an as-needed basis when co llision occurs. Therefore, the size of the page table is not fixed. The chains can grow or shrink at run time. An entry in the page table should contain the virtual address (8 bytes), the physical address (8 bytes), the next pointer (8 bytes) and other information (4 bytes). So if the size of an entry in the page table is 28 bytes and there are 219 entries in the table, then the size of the hashed page table is at least 219 28 bytes, or 14 megabytes. Like the IPT with Bloomier filter index table approach, each entry in the index table is up to 8 bytes, since the number of buckets in t he page table is still equal to the number of physical pages. However, by using hashed page table, we ca n create index table and page table for each process. Therefore, both the inde x table and the hashed page table can become smaller if they are pre-process tables. On ce again, the index table size has a direct effect on the average length and worst ca se length of the linked lists. This separate chaining approach has poor cache performance because the next member of the linked list is likely not in t he same physical page as the previous member of the list. This forces another page to be fetched from the memory and can cause additional page faults. However, if we know the worst case linked list length, we can allocate extra space in advance for all the collisions. We can use the set associative approach like our Bloomier hashing desi gn for IP lookup in the Chapter 2. Figure 3.14 shows the increm ental Bloomier hashing set associative approach. In this approach, all the items that are hashed to the same bucket are pu t into a set. Each
76 Figure 3.14. Hashed page table with increm ental Bloomier hashing index table (Sets associated) bucket in the page table is a set of page tabl e entries and the number of entries per set is fixed. Since the page size is fairly la rge compared to the size of the page entry, the number of entries per set can also be large. The number of entries per set is needs to be sufficiently large to handle the worst case collisions. In an update, unlike the separate chaining linked list approach where we can always allocate the extra nodes, if the number of collisions in a bucket is bi gger than the max entri es per set, then the page table may need to increase the size for all the set. This can cause problems because all the previous page table entries ma y need to allocate to different physical pages which can cause long delay. Hash functions Bloomierhashing index table XOR Virtual address Off set Virtual page number Hash page table Page table entry S entries per set Hash functions Bloomierhashing index table XOR Virtual address Off set Virtual page number Virtual address Off set Virtual page number Hash page table Page table entry S entries per set Hash page table Page table entry S entries per set
77 Time Analysis: Once again, we access two entries in the index table in parallel. However, unlike the separate c haining approach, all elements in the same set are in the same page. Therefore, only one memory acce ss is needed to fetch out the entire set. Lookup is then done by comparing each entry in the set to the lookup address. Space Analysis: Since the page size is fairly large compared to the size of the page entry, the number of entries per set can also be large. The number of entries per set needs to be sufficiently large to handle the wors t case collisions. Since the number of entries per set is fixed, some of the entries in some sets may not be used at all. An entry in the page table should contain the vi rtual address (8 bytes), the physical address (8 bytes), and other information (4 bytes). So if the size of an entry in the page table is 20 bytes and there are 219 entries in the table, then the size of the hashed page table is 219 20 bytes, or 12 megabyte for one entry per set. Assuming the number of entries per set is four, the total si ze of the page table is then 219 20 4 megabytes, or 48 megabytes. Like the IPT with Bloomier filt er index table approach, each entry in the index table is up to 8 bytes, since the number of buckets in the page table is still equal to the number of physical pages. However, by using hashed page table, we can create index table and page table for eac h process. Therefore, both the index table and the hashed page table can become smaller if they are pre-process tables. The index table size has a direct effect on the worst case length of the linked lists. The worst case length determines the number s of entries per set. Overall, the set associative approach is much less space efficient than the separate chaining approach but c an have better lookup performance.
78 3.5 Performance Evaluation Methodology: Many previous studies [32, 45, 7] on memory behaviors used the Simics full system simulator. For the expe riment, we also ran a Virtutech Simics 3.0 simulator  on an x86 target machine. The target machine used Linux op erating system. We chose eight workloads in the SPEC CPU 2000 and SPEC CPU 2006 benchmarks. These eight workloads are Swim, Applu, Gcc06, Apsi, Wupwise, Milc, Lbm and CactusADM. Table 3.2 shows the footprints for these work loads. The footprints are measured in a period of 10 billion instructions for each workload. Table 3.2. Footprints for eight SPEC 2000/2006 workloads Swim Applu Gcc06 Apsi WupwiseMilc Lbm CactusADM Footprints (in MBs) 178 175 127 125 178 430 404 412 For the workloads Swim, Applu, Gcc06, Apsi and Wupwise, we ran them using a target machine with 128 Megabytes of memo ry. Since the workloads Milc, Lbm and CactusADM had footprints much greater than 128 Megabytes, we ran them using a target machine with 256 Megabytes of memory. Comparing to the real machines these days, the simulated machines contain sm all amount of memory. However, the simulated machines only run one process whic h is the workload most of the time while the real machines run multiple processes at the same time. If we use large memory size, then the results wont able to show the performance differ ent between different page table schemes. The page size was 4k for all machines. For each machine, we also included a fu lly associated data TLB and an instruction TLB. Both TLBs had 32 entries. We chose to use small TLBs because we like to better
79 test the page table performanc e. Large TLBs can decease the effectiveness of a page tables performence. For each memory refer ence, Simics output both a virtual address and its physical address. Using the virtual address, the simulator first performed a lookup on the TLB to see if the TLB contained the translation. If the TLB did not contain the translation, then the simulator perform ed a page table lookup. If the page table lookup succeeded, then the translation is inse rted into the TLB. But if the page table lookup failed, then a page fault occurred. After the operating system handled the page fault, the virtual address and t he corresponding physical address were inserted into both the page table and the TLB. The simulation ended when eight million TLB misses occurred for each workload. Section 3.6 compares our schemes to other schemes. The other schemes include the normal inverted page table (IPT) with the anchor table and the hashed page table. For faster hashing without complicated hardw are, one hash function decodes lower-order virtual address bits for the hash bucket. Since our scheme requires more than one hashing function, we use the rotation method, which rotates the lower order bits with higher order bits to obtain the second hash va lue. For our incremental Bloomier hashing scheme, we include results with different index table sizes. The size of the page tables is equal to the number of physical pages in the machines. 3.6 Performance Results One of the most import ant measurements of page table performance is the average access time for page table hits. We show the average access time for the page table hits below. For each workload, the average is calculated over a period of eight million page table accesses. Of these eight million accesses, over 98% are hits. The table below shows the results for tw o collision resolution methods. Using our
80 Bloomier hashing scheme, results for three different index tabl e sizes are also shown. We first show the results for the separate chaining hashed page table. Then, we show the results for the coal esced hashed page table. Average Page table hit search time (Chainning) 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4swim a pplu gcc06 milc wu p wi s e a psi l bm c a c tusA D MNumber of PTE searched normal BloomH(1/2N index) BloomH(1N index) BloomH(2N index) Figure 3.15. Average page table hit search ti me comparison by using separate chaining hashed page table Figure 3.15 compares the average search time for different methods by using separate chaining hashed page table. The time measurement is calculated in terms of the number of page table entries searched before the translation is found. N is the size of the page table. For the workloads that run on a 128MB machine and 4K page size, the size of the page table is 32K entries. For the workloads that run on a 256MB machine and 4K page size, the size of the p age table is 64K entries. Comparisons are made between the hashed page table approach and our incremental Bloomier hashing approach (BloomH) at three different index t able sizes. For our incremental Bloomier hashing approach, there are 3 sizes for index table: half the size of page table (1/2N), the same size as the page table (1N) and t wice the size of the page table (2N). From
81 Figure 3.15, we can see that our incr emental Bloomier hashing scheme does indeed decrease the average lookup time for the page t able. When using the index table that is the same size as the page table, all but one workload has a smaller average access time as compared to the normal hashing met hod. If we double the size of the index table, the improvement is ev en greater. When using a smalle r index table size (1/2N), there is still improvements for fi ve out of the eight workloads. Average Page table hit search time (Colasced)1 1.1 1.2 1.3 1.4 1.5 1.6 1.7swim appl u gcc 06 m i lc wupw i s e aps i lbm cactusADMNumber of PTE searched normal BloomH(1/2N index) BloomH(1N index) BloomH(2N index) Figure 3.16. Average page tabl e hit search time comparison by using coalesced hashed page table Figure 3.16 compares the average search time for different method by using the coalesced hashed page table. When using the coalesced page table, collisions are chained within the page table it self. Due to this fact, the page table using coalesced method for collision resolutions can have long linked lists. Figure 3.16 indeed shows that the average lookup times are worse than the separate chaining hashed page table for most cases. Nevertheless, when co mparing to the normal hashing scheme, our incremental Bloomier hashing performs well. When using an index table which has the
82 same size as the page table, all workl oads have smaller average search time as compared to the normal hashing me thod. If we double the size of the index table, the improvement is even greater. Moreover, the average access time is now about the same as in the separated chaining method that was shown in t he Figure 3.15. When using a smaller index table size (1/2N), ther e is still improvements for five out of the eight workloads. Using our incremental Bloomier hashing method, the worst case search time is also reduced as shown in Figure 3.17 below. Worst case search time (coalesced)0 5 10 15 20 25 30s w im a p pl u gcc milc w u p w i s e ap s i l b m ADMNumber of PTE searched normal BloomH(1/2N index) BloomH(1N index) BloomH(2N index) Figure 3.17. Worst case search time comparison for coalesced hashed page table Figure 3.17 shows the worst case search time for different methods by using the coalesced hashed page table. Our increment al Bloomier hashing method can decrease the worst case search time. When using the index tables of the same or bigger size than the page tables, all workloads have a smaller worst case search time with the incremental Bloomier hashing method t han with the normal hashing method. For workloads such as Applu and Ap si, the difference is even gr eater. Our hashing method worst case search time is half t he time using a normal hashing method.
83 During the 40 million TLB misses, the number s of instructions executed are also gathered for each workload and are shown below as the number of TLB misses per kilo instructions for each workload: TLB Misses Per Kilo Instructions0 2 4 6 8 10 12swim app lu gcc mi lc wupwise ap si lb m cactus ADM Figure 3.18. TLB misses per kilo in structions for different workloads For each TLB miss, a page table search nee ds to be performed. The gathered data show that over 98% of page table searches result in hits. By using this knowledge about the TLB misses rate, we can compare the amount of time spent on page table searches in terms of CPU cycles by assuming the following. Due to the size, the page table likely resides in off-chip memory. Since the index t able is only a fraction of the page table, it can be kept in cache. Gener ally, memory is slow and cache is much faster. Therefore, accessing an entry in page table in memory should take 200 cycles. Cache is must faster. Theref ore, accessing the i ndex table in cache should only take about 5 cycles. For the normal hashing scheme, the average cycles spend on page table search is the average number of page t able entries searched per lookup times the number of cycles for accessing a page table entry. For the Incremental Bloomier
84 hashing as well as IPT with anchor table, t here are two levels of date structures for lookup. Each lookup always access the in termediate table once. Therefore, the average cycles spend on page table search is the sums of the cycl es spent on index table access and the average cycles spent on the page table. Since each page table lookup always access the index table once, t he number of cycles spends on index table access is 5 cycles. The average number of the cycles spend on page table search is also the average number of page table entries searched per lookup times the number of cycles for accessing a page table entry. By using these formulas, we show the results in Figure 3.19 below. Figure 3.19. Average number of cycles spent per kilo instru ctions for different schemes Figure 3.19 compares t he average numbers of cycles spent on page table searches per kilo instructions for different schemes. The leftmost two bars are results for using separate chaining hashed page table. The rightmost three bars are results for using coalesced hashing page table. We incl ude the results for the inverted page table (IPT) with the anchor table as well. The in verted page table with the anchor table is a Number of cycles spend on page table searches per million instructions0.00 500.00 1000.00 1500.00 2000.00 2500.00 swimapplugcc06milcwupwiseapsilbmcactusADMCycles (in Kilo unit) normal SChain BH SChain(2N IT) IPT normal Coalesced BH Coalesced(2N IT) 2886, 2734, 2908, 3689, 3320
85 form of coalesced hashing. The invert ed page table with the anchor table puts the collisions in the anchor table and handles t he overflow by chaining together entries within the page table. Due to this, t he inverted page table with the anchor table can achieve the same average look up time as the normal hashed page table with the separate chaining collision resolution methodwit hout using overflow space. However, due to the need for accessing the anchor table, the inverted page table with the anchor table has extra time overhead. For our in cremental Bloomier hashing approach (BH), we use index table size same as the page t able size. Sensitivity study on the index table size will show later. From figure 3.19, when using separate chaining hashed page table and comparing to normal hashing method, we can see that by using our hashing method, all workloads but two have positive improvem ents. The workload Milc and Wupwise show little to no improvements. On t he other hand, by using coalesced hashing collision resolution method, the average number of cycles spent on page table search is higher than using separate chaining collision resolution method for most schemes. This is expected, since coalesced page table does not have extra overflow space for handling collisions. Therefore, all entries in t he coalesced page table are used. On the other hand, separate chaining page table can have unused buckets. By using coalesced hashing page table and comparing to normal hashing method, we can see that our hashing method shows improvements for most of the workloads. T he inverted page table with anchor table size of 1N performs the same as the normal has hing scheme except for the workload Apsi and CactusADM. Another interesting observa tion is that by using our incremental Bloomier hashing method, we achieve t he around the same performance regardless
86 which collision resolution method is used. Ne xt, we show the sensitivity study on the index table sizes for our schemes. 0.00 1000.00 2000.00 3000.00 4000.00 swimapplugcc06milcwupwiseapsilbmcactusADMCycles (in Kilo unit)Number of cycles spend on page table searches per million instructions BH SChain(1/2N IT) BH SChain(1N IT) BH SChain(2N IT) BH SChain(4N IT) BH Coalesced(1/2N IT) BH Coalesced(1N IT) BH Coalesced(2N IT) BH Coalesced(4N IT) Figure 3.20. Sensitivity on incremental Bloomier hashing index table size The results of a sensitivity study on the incremental Bloomier hashing index table size are given in Figure 3.20. We simulate four different increm ental Bloomier hashing index table sizes. The four sizes are half the size of the page table (1/2N IT), the same size of the page table (1N IT), twice the si ze as the page table (2N IT) and four times the size of the page table (4N IT). We al so show the results for both the separate chaining and coalesced hashing collision re solution methods. From Figure 3.20, we can see that bigger incremental Bloomie r hashing index table indeed help decrease average cycles spent on page table search. R egardless of which collision resolution method is used, our incremental Bloomie r hashing methods perform the best when using an index table size of 4N This is most evident by looking the results for the workload Apsi and CactusADM. On the other hand, the diffe rent between using index
87 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 100200300400Cycles (in Kilo unit)Page table entry access time (in cycles)Number of cycles spend on page table searches per million instructions for SWIM normal SChain BH SChain(1N IT) IPT normal Coalesced BH Coalesced(1N IT) 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 100200300400Cycles (in Kilo unit)Page table entry access time (in cycles)Number of cycles spend on page table searches per million instructions for GCC normal SChain BH SChain(1N IT) IPT normal Coalesced BH Coalesced(1N IT) 0.00 2000.00 4000.00 6000.00 8000.00 100200300400Cycles (in Kilo unit)Page table entry access time (in cycles)Number of cycles spend on page table searches per million instructions for APSI normal SChain BH SChain(1N IT) IPT normal Coalesced BH Coalesced(1N IT) Figure 3.21. Sensitivity on page table entries access latency
88 table size of 1/2 N and 1 N is quite small for most of the workloads. Next, we show the sensitivity study on page tabl e entries access latency. The results of a sensitivity study on t he page table latency are given in Figure 3.21. In Figure 3.19 and Fi gure 3.20, we used 200 cycles latency for accessing a page table entry. In Figure 3.21, we show the results for 100 cycles latency, 200 cycles latency, 300 cycles latency and 400 cycles page table entry access latency by using workload Gcc, Swim and Apsi. The results ar e shown for the different hashing schemes and the two different collisions resolution methods. From Figure 3.21, we can see that coalesced hashing resolution method gener ally performs worse than the separate chaining method resolution method, just as we have seen before. We can see that our incremental Bloomier hashing scheme shows improvements for latency values of 100 and 200 cycles. For the latency values of 300 and 400 cycles, our incremental Bloomier hashing scheme performs even better. 3.7 Related work To perform the Bloomier filter setup, the setup algorithm tries to find a singleton for each key. Setup can fail if no singleton c an be found for any one of keys. If the setup fails, then all the keys are needed to be rehas hed . Chisel  proposes another idea. They suggest that if the setup fails then a few problematic keys can be removed and put into a spill over TCAM. The setup pr ocess then resumes. However, by using their method, a lookup needs to be performed in parallel between the filter and the spill over TCAM. Another approach is the li near equations method . This method decides what value to encode for each location in the filter by solving sets of linear equations. This method requires less space for the filter, but longer setup time.
89 Two common page table organizations, forward-mapped and inverted page tables, are described in details in Se ction 3.2. The forward-mapped page table is very flexible since entries can be duplicated per process . However, t he forward-mapped table suffers from multiple memory accesses per l ookup. Moreover, most of the tables are quite sparse. On the other hand, the inverted page table is a compact data structure. Instead of storing all valid virtual address mappings, the IPT stores only the virtual pages that currently map to the physical pages. Physical addresses can be computed directly from the indexes in the IPT. When us ing hashing, searching the IPT is fairly fast and requires few memory accesses. Howe ver, address aliasing becomes a problem since only one physical to virtual address mappi ng can exist in the table at a time. The hashed page table  is another page table organization with fast lookup times. The hashed page table is based on the inverted page table. The difference is that the hashed page table does not use the hash anchor table. The virtual address is directly hashed to a bucket in the hashed page table. Due to this, each entry in the table needs to contain both the physical a ddress and virtual address since the physical address can no longer be computed from the i ndex in the hashed page table. Moreover, hashed page tables also require a collision resolution table to handle the collisions in the page table. Hashed page tables support address aliasing. The authors show that the hashed page table can reduce the memory re ferences during lookup at the expense of requiring more s pace for table entries. Guarded page table  combines the adv antages of multiple-levels page table and hashed page table. Howe ver, the guarded page table s performance gains are achieved by specific assembler level optimizat ion. Likewise, in [ 56], the authors try to
90 combine both IPT and forward-mapped page table into a page table in order to obtain the benefits in both approaches. However, the hybrid page tabl e also introduces extra hardware cost and extra comp lications during lookup. The Cluster page table  is based on the design of the hashed page table. Each entry in the table now stores the tr anslation information for a block of several consecutive pages. The number of pages in an entry can vary depending on how sparse the address space is. This number is called the subblocking factor. The authors show that cluster page tables can have more efficient page table operations. They also show that the average linked list length in the cluster page table is shorter than the average linked list length in the hashed page table. Shorter linked lists mean the access time can improve. However, for pr ograms that do not s how spatial locality, cluster page tables are not as effective.
91 CHAPTER 4 USING BLOOMIER HASHING TECH NIQUES FOR OTHER POSSIBLE APPLICATIONS Hash table is a basic information stor age and retrieval method. Our Bloomier hashing techniques can generally replace the normal hash table for any digital applications and systems. Therefore, Bl oomier hashing is very general and can be applied in other applications involving info rmation storage and retrieval. Applications such as intrusion detection based on virus si gnature searches, key word matching in the Google search engine, maintaining connecti on record or per-flow states in network processing and packet classification in network are possible candidat es that can benefit from using the Bloomier hashing ideas. We introduce each of the possible applications below. Intrusion detection based on virus signature searches: An intrusion detection system, such as Snort , performs Deep Packet Inspection for each packet. The content of each packet is compared against a signature database. A string matching approach can be applied for the matching proc ess . Hash-based approaches  can also be used for fast lookup. Theref ore, we can apply our hashing methods to Deep Packet Inspection. For example, in , an input pattern to generate an index value. Then the algorithm uses the index value to fetch the vi rus patterns that are st ored in the memory. Since several different virus patterns may share the same ha sh values, the input pattern may need to compare multiple virus patterns. We can use our Bloomier hashing ideas to better balance the virus patterns among the buckets and therefore increase the overall performance.
92 Key word matching in the Google search engine: Bigtable  is a high performance database system which was built on the Google file system [12, 24]. Many Google services, such as Googl e Maps, Google search engine and Orkut, use Bigtable . Bigtable has mu ltiple dimensions. The tables row and column names are arbitrary strings. Each t able cell also contains a time stamp . Therefore, cells may share the same cont ent but have different time stamps, and Bigtable can use the time stamps for vers ioning purposes . The Google search engine uses a large number of machines to serve huge amounts of users at the same time . Because of this, Bigtable is sp lit by rows and becomes tablets . Each Google machine then stores a large number of tablets and thus can independently serve users . However, a disproportion ate load-balance may cause some machines to become less busy than others. Therefore, our Bloomier hashing can be used as a load-balancing method to more equally di stribute workloads among the machines. Maintaining connection record in network processing: Many network systems need to maintain connection records. To do so, these systems use hash tables. For example, intrusion detection syst ems, such as Snort  and Bro , maintain records in hash tables for all the TCP connections. Each connection record is created by hashing the 5 tuples in the TCP header. Each record contains information on the connection state. The record is also updated every time a new packet arrives in that connection. Network monitoring systems, su ch as Netflow  and Adaptive NetFlow , maintain connection records in off-ch ip DRAM. Other hardw are implementations [20, 55] also store the reco rds in DRAM. Therefore, by using our Bloomier hashing
93 techniques, we can balance the number of re cords stored per bucket and thus increase system performance. Packet classification in network: Packet classification is one of the most important functions in a rout er. When given an IP packet, t he router must classify the packet based on a number of fields in the pa ckets header. Many pa cket-classification methods first examine a single field in t he header and then narrow down the search to a smaller subset of classification rules [40, 4, 43, 23]. Since a hash table can perform a field lookup, we can apply our Bloomier hashing idea and thus increase performance quality. For example,  introduces packet cla ssification using Tuple Space Search. A tuple is a combination of header field l ength. The search algorithm groups the classification rules into sets of tuples. The tuples are then ordered by their lengths and a hash table stores each group in the me mory. When a new packet arrives, the algorithm hashes the header fields and perfo rms matches on all the hash tables. Therefore, we can apply t he Bloomier hashing idea to improve the hash tables performance. Exact flow matching is also a form of packet classification. The exact flow matching lookup operation performs exact matc hes on five header fields in a packets header. In , the authors pr opose using hash tables for lookup. A small, on-chip hash table contains valid bits. Furthermore, a big hash table st ores all filters in off-chip memory. In their lookup algorithm, when a new packet arrives, lower-order bitswhich consist of source and destination addresse s in the packets headerproduce a hash index. By using the index, the on-chip table is checked to see if the valid bit is set. If
94 the system verifies the bits validity, then the system again uses the index to check the off-chip hash table. The hash tables use separate chaining for collision resolution. Therefore, balancing the has h buckets by using Bloomi er hashing can improve the overall system performance.
95 CHAPTER 5 DISSERTATION SUMMARY Information retrieval is an important ta sk for many systems in today's digital information world. Massive amounts of informat ion is stored first and retrieved later. Proper ways to store and retrieve information ar e critical for the systems performance. Hashing is a classic technique that has fast lookup time. However, typical hashing still suffers from unpredictable and worst case l ookup time. We introduce the concepts of Bloomier hashing and incremental Bloomier hashing based on the previous work of the Bloomier filter . By using a small interme diate table, we greatly reduce the collisions in a hash table. Hence, we greatly reduce the average time and worst case time for retrieving an item in the table. We show ed two examples of information retrieval problems and propose our hashing techniques to solve the problems. The first example is the IP lookup problem in routers. IP routing tables are growing bigger and bigger every year. Bec ause of this, a fast way for lookup in a routing table is needed. We present a comprehensive hash-based LMP solution for future back-bone internet routers. A sma ll portion (<10%) of the prefixes are implemented in TCAM to help in space and bandwidth efficiency during LPM lookups of the majority of prefixes from regular SRAMs. The SRAMs holding the prefixes are organized as two-dimensional arrays to allow a constant fetch rate. Multiple lengths of prefixes that are required for LPM lookup are allocated in the same memory block using partial prefix expansion and bucket coale scing to enable one memory access per look up. The Bloomier hashing through an inte rmediate indexing mechanism balances the hashed buckets and achieves the best s pace and bandwidth tradeoffs for the LPM lookup function.
96 The second example is the page table lookup in virtual memory. The page table is an important part of the virtual memory system. Programs address memory by using virtual addresses. Virtual addresses need to translate into a physical memory address before the data can be fetched out. The address translation needs to be done as fast as possible when the processor is waiting fo r the data. The inverted page table is a compact data structure. Combined with a hashi ng anchor table, inverted page also has a fast lookup time but still suffe rs the short comings of the tr aditional hashing technique. We apply our incremental Bloomier hashi ng techniques and the Bloomier filter technique to the page tables. Results show that we can reduce the number of page table accesses and space reduction is also possible. Bloomier hashing is very general and can be applied in other applications involving information storage and retrieval. Applications such as intrusion detection based on virus signature searches, maintaining connecti on record or per-flow states in network processing and key word matching in t he Google search engine are possible candidates that can benefit from using the Bloomier hashing idea.
97 LIST OF REFERENCES  M. J. Akhbarizad eh and M. Nourani, Efficient prefix cache for network processors, In 12th Annual IEEE symposium on High Performance Interconnects Aug 2004.  M. J. Akhbarizadeh M. Nourani. D. S. Vijayasarathi and P. T. Balsara, Pcam: A ternary cam optimized for longest prefix matching tasks, ICCD 04, 2004.  Y.Azar, A broder, and E. Upfal, Balanced allocations, In Proceedings of 26th ACM Symposium on the Theory of Computing, 1994.  F. Baboescu and G. Varghese, Scalable packet classification, In ACM Sigcomm, San Diego, CA, August 2001.  T. W. Barr, A L. Co x and S. Rixner, Translation Caching: Skip, Dont Walk (the Page Table), Proceedings of the 37th annual international symposium on Computer architecture, 2010.  BGP Routing Table Analysis Report. http://bgp.potaroo.net/, 2009.  A. Bhattacharjee and M. Martonosi, Characterizing t he TLB Behavior of Emerging ParallelWorkloads on Chip Multiprocessors, 18th International Conference on Parallel Architectures and Compilation Techniques, 2009.  F C. Botelho, Y Kohayakawa, N Zi viani, "A Practical Minimal Perfect Hashing Method," Experimental and Efficient Algorit hms, 4th InternationalWorkshop, 2005.  A. Broder and A. Karlin Multilevel adaptive hashing, In Proceedings of 1st ACMSIAM Symposium on Discrete Algorithm, 1990.  A. Z. Broder and M.Mi tzenmacher, Using Multiple Hash fuctions to Improve IP Lookups, In INFOCOM 01 2001.  A. Chang and M. F. Mergen, Storage: Arch itecture and Programming, ACM Transactions on Computer Systems, Vol 6, No 1, pp 28-50. February 1988.  F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes and R E. Gruber, Bigt able: A Distributed St orage System for Structured Data, OSDI, 2006.  D. Charles and K. Chellapilla, Bloomie r Filters: A Second Look D. Halperin and K. Mehlhorn (Eds.): ESA 2008, LNCS 5193, pp. 259 2008.  B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, The Bloomie r Filter: An Efficient Data Structure for Static Support Lookup Tables, In The Fifteenth Annual ACMSIAM Symposium on Discrete Algorithms, 2004.
98  Y. H. Cho and W. H. Mangione-Smith, A Pattern Matching Coprocessor for Network Security, DAC 2005.  Cisco netflow. http://www.cisco.com/warp/public/732/Tech/netflow  D. Clark and J. Emer, Perform ance of the VAX11/780 translation buffer Simulation and measurement, ACM Transactions on Comput er Systems, 3(1):3162 February 1985.  S. Demetriades, M. H anna, S. Cho, and R. Melhem, A n Efficient Hardware-based Multi-hash Scheme for High Speed IP Lookup, In Proceedings of the Annual IEEE Symposium on High-Perform ance Interconnects (HOTI), August 2008.  S. Dharmapurikar, P. Krishnamurthy, and D. Taylor, Longest Prefix Matching using Bloom Filters, In ACM SIGCOMM-03, 2003.  S. Dharmapurikar and V. Paxson, Robust TCP stream reassembly in the presence of adversaries, In USENIX Security Symposium August 2005.  W. Eatherton, G. Varghese, and Z. Dittia, T ree Bitmap: hardware/software IP Lookups with Incremental Updates, ACM SIGCOMM Computer Communication Review, 2004.  C. Estan, K. Keys, D. Moore, and G. Varghese, B uilding a better NetFlow, In ACM Sigcomm, 2004.  A. Feldmann and S. Muthukrishnan, Tradeoffs for packe t classification, In Proceedings of IEEE INFOCOM, 2000.  S. Ghemawat, H. Gobioff And S. T. Leung, The Google file system, In Proc. of the 19th ACM SOSP, Dec 2003.  M. Hanna, S. Demetr iades, S. Cho, and R. Melhem CHAP: Enabling Efficient Hardware-Based Multiple Hash Schemes for IP Lookup, IFIP International Federation for Information Processing 2009.  J. Hasan, S. Cadambi, V. Jakku la, and S.T.Chakradhar, Chisel: A Storageefficient, Collision-free Hash-based Network Processing Architecture, In ISCA 06, 2006.  HDL Design House. HCR MD5: MD 5 crypto core fam ily, December, 2002.  J. Huck and J. Hays, Architectural Support for Translation Table Management in Large Address Space Machines, In ISCA 93, 1993.  IBM, IBM System /38 technical developments, Order no. G580-0237, IBM, Atlanta, GA., 1978.
99  B. Jacob and T. Mudge, A Look at Several Memory Management Units, TLB-Refill Mechanisms, and Page Table Organizations, In ASPLOS 98 1998.  W. Jiang, Q. Wang and V. Prasanna, Beyond TCAMS: An SRAM-based Parallel Multi-Pipeline Architecture for Terabit IP Lookup, In IEEE INFOCOM-08, 2008.  G.B. Kandiraju and A. Sivasubramaniam, Characte rizing the dTLB Behavior of SPEC CPU2000 Benchmarks, AC M SIGMETRICS Perform. Eval. Rev., Vol. 30, No. 1, pages 129, 2002.  S. Kaxiras and G. Keramidas, IPStash: A Power Efficient Memory Architecture for IP lookup, In Proc. of MICRO-36, November 2003.  S. Kaxiras and G. Keramidas, IPStas h: A set-associative memory approach for efficient ip-lookup, pp. 992, IEEE Infocom, 2005.  K. S. Kim and S. Sahni, Efficient c onstruction of pipelined multibit-tire routertables, IEEE Trans. Computers, 56(1):32-43 ,2007.  Donald E. Knutb. The Art of Com puter Programming Volume 3: Sorting and Searching Addison Wesley, 1973.  L. Tan,T. Sherwood, A High Throughpu t String Matching Architecture for Intrusion Detection and Prevention, ISCA 2005.  G. Taylor, P. Davies and M. Farmw ald, The TLB slice a low-cost high-speed address translation mechanism, In The 17th Annual International Symposium on Computer Architecture. May 1990.  S. Kumar, J. Turner, and P. Crowley, Peacock Hash: Fast and Updatable Hashing for High Performance Packet Processing Algorithms, In IEEE INFOCOM-08, 2008.  T. V. Lakshman and D. Stiliadis. High-speed policybased packet forwarding using efficient multi-dimens ional range matching, In ACM Sigcomm, September 1998.  R. B. Lee, Precision Architecture, Computer, January 1989.  J. Liedtke and K. Elphinstone, Guarde d page tables on Mips R4600 or an exercise in architecture-dependent micro optimization, SIGOPS Operating Systems Review 1996.  J. Lunteren and T. Engbersen, F ast and Scalable Packet Classification, IEEE Journal on Selected Areas in Communications, 21, May 2003.  P. S. Magnusson et al, Simics: A Full System Simulation Platform, IEEE Computer, Feb. 2002.
100  P. Magnusson and B.Werner, Effic ient Memory Simulation in SimICS, Proceedings of the 28th Annual Simulation Symposium, 1995.  H. Miyatake, M. T anaka and Y. Mori, A design fo r high-speed low-power cmos fully parallel content-addressable memory macros, IEEE Journal of Solid-State Circuits, 36(6):956-968, 2001.  D Nagle, R Uhlig, T Stanley, S Sechrest, T Mudge and R Brown, Design Tradeoffs for SoftwareManaged TLBs, ACM Trans. on Computer S ystems, 12(3):175, August 1994.  X. Nie, D.Wilson, J. Cornet, G.Damm and Y. Zhao, IP Address Lookup Using a Dynamic Hash Function, In CCECE/CCGEI 2005.  H. Noda et al. A Co st-Efficient High-Performance Dynamic TCAM with Pipelined Hierarchical Searching and Shift Redundancy Architecture, IEEE J. Solid-State Circuits, 40(1): 245, Jan. 2005.  R. Pagh, Hash and Displace: Effici ent Evaluation of Minimal Perfect Hash Functions, BRICS Report Series 1999.  V. Paxson, Bro: A system for det ecting network intruders in real time, Computer Networks, 1999.  M. Pearson, QDRTMIII: Ne xt Generation SRAM for Networking, http://www.qdrconsortium.org/p resentation/QDR-III-SRAM.pdf, 2009  M.V. Ramakrishna, E. Fu, and E. Bahcekapili, Efficient Hardware Hashing Functions for High Performance Computers, IEEE Transactions ON Computers, VOL. 46, NO. 12, Dec 1997.  Routing Information Service, http://www.ripe.net/ris/, 2009.  D. V. Schuehler, J. Moscola, and J. W. Lockwood, Architecture for a hardwarebased TCP/IP content scanning system, In IEEE Symposium on High Performance Interconnects (HotI), August 2003.  I. J. Shyu and S.P. Shieh, Vir tual Address Translation for Wide-Address Architectures, ACM SIGOPS Operating Systems Review, 1995.  H. Song. S. Dharmapurik ar, J. Turner and J.W. Lockw ood, Fast hash table lookup using extended bloom filter: an aid to network processing, In SIGCOMM 05, 2005.  H. Song, F. Hao, M. Kodialam, T.V. Lakshman, IPv6 Lookups using Distributed and Load Balanced Bloom Filters for 100Gbps Core Router Line Cards, In INFOCOM 09, 2009.
101  Snort The Open Source Ne twork Intrusion Detection System. http://www.snort.org  SPARC International Inc, The SPARC Architecture 199 Manual, Version 8, 1991.  V. Srinivasan and G. Varghese, F ast address lookups usin g controlled prefix expansion, ACM Transactions on Comput er Systems, 17(1): 1-40, 1999.  V. Srinivasan, S Suri, and G Varghese, Packet classificati on using tuple space search, In SIGCOMM, pages 135, 1999.  M. Talluri, M. D. Hil l and Y. A. Kalidi, A New Page Table for 64-bit Address Spaces, In SIGOPS 95, 1995.  D. Taylor, A. Chandra, Y. Chen, S. Dharmapurikar, J. Lockwood, W. Tang, and J. Turner, System-on-chip packet processo r for an experimental network services platform, In Proceedings of IEEE Globecom, 2003.  J. T. M. Waldvogel, G. Varghese and B. Plattner, Sca lable High Speed IP Routing Lookups, In ACM SIGCOMM-97, 1997.  X. Zhou and P. Petrov, The Interval Page Table: Virtual Memory Support in RealTime and Memory-Constrained Embedded Systems, Proceedings of the 20th annual conference on Integrated circuits and systems design 2007.  X. Zhou and P. Petrov, Direct Addre ss Translation for Virtual Memory in EnergyEfficient Embedded Systems, ACM Transactions on Embedded Computing Systems (TECS), 2008.  X. Zhou and P. Petrov, Arithmeticbased address translation for energy-efficient virtual memory support in lowpower, real-time embedded systems, Proceedings of the 18th annual symposium on In tegrated circuits and system design 2005.
102 BIOGRAPHICAL SKETCH David Yi Lin was born in the Guangdong pr ovince, China. He moved to United States of America with his par ents during the year 1993. He received his B.S. degree in computer engineeri ng and M.S. degree in computer engineering fr om University of Florida in 2002 and in 2004 respectively. In 2005, he entered the Ph.D. program in Computer Engineering at Univer sity of Florida. In 2010, he received his Ph.D. degree in Computer Engineering. His research in terests include network router design and computer architecture.