<%BANNER%>

Design and Analysis of a Dynamically Reconfigurable Network Processor


PAGE 1

DESIGN AND ANALYSIS OF A DYNAMICALLY RECONFIGURABLE NETWORK PROCESSOR By IAN A. TROXEL A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING UNIVERSITY OF FLORIDA 2003

PAGE 2

ACKNOWLEDGMENTS I wish to thank the Department of Defense and MLDesign Technologies Incorporated for their financial support, all the good professors for their inspiration and guidance, all the bad ones for their frustration and misdirection, the members of the High-performance Computing and Simulation lab for their technical support, Dr. Alan George for his guidance, the UF Turkish Folklor group for their friendship, my parents for their nurture, my ancestors for their nature, the fates for their scheming and my wife for her patience and love. ii

PAGE 3

TABLE OF CONTENTS page ACKNOWLEDGEMENTS................................................................................................ii LIST OF TABLES...............................................................................................................v LIST OF FIGURES...........................................................................................................vi ABSTRACT......................................................................................................................vii CHAPTER 1 INTRODUCTION.........................................................................................................1 2 RELATED RESEARCH...............................................................................................5 3 SIMULATION MODEL DESCRIPTION..................................................................12 3.1 Model Overview....................................................................................................12 3.2 Simulation Environment........................................................................................13 3.3 Architecture Terminology......................................................................................14 3.4 Parameters and Verification...................................................................................15 4 SIMULATION EXPERIMENTS................................................................................17 4.1 Packet Processing Description...............................................................................17 4.2 Traffic Description.................................................................................................18 4.3 Dynamic ME Allocation Scheme..........................................................................20 4.4 Baseline Experiments.............................................................................................22 4.5 Head-To-Head Experiments..................................................................................24 5 CASE STUDY.............................................................................................................30 5.1 Case Study Description..........................................................................................30 5.2 Baseline Experiments.............................................................................................32 5.3 Head-To-Head Experiments..................................................................................32 6 CONCLUSIONS..........................................................................................................35 iii

PAGE 4

REFERENCES..................................................................................................................39 BIOGRAPHICAL SKETCH.............................................................................................42 iv

PAGE 5

LIST OF TABLES Table page 1 Divide operation performance vs. area..................................................................18 2 Optimal baseline configurations............................................................................23 3 Dynamic priority latency thresholds......................................................................24 4 Case study optimal baseline configurations...........................................................32 v

PAGE 6

LIST OF FIGURES Figure page 1 Motorola/C-Port DCP NP........................................................................................6 2 IBM NP2G NP.........................................................................................................7 3 Intel IXP1200 NP.....................................................................................................8 4 Lucent/Agere PayloadPlus NP.................................................................................9 5 Chameleon Systems RPC......................................................................................10 6 Functional diagram of RC-enhanced NP...............................................................13 7 Typical system configuration.................................................................................22 8 2-Hot baseline results.............................................................................................23 9 1-Hot results...........................................................................................................25 10 2-Hot results...........................................................................................................26 11 3-Hot results...........................................................................................................27 12 Theatre missile defense system..............................................................................31 13 Theatre missile defense kill chain..........................................................................32 14 Case study results...................................................................................................33 vi

PAGE 7

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering DESIGN AND ANALYSIS OF A DYNAMICALLY RECONFIGURABLE NETWORK PROCESSOR By Ian A. Troxel May 2003 Chair: Dr. Alan D. George Department: Electrical and Computer Engineering The fusion of reconfigurable computing (RC) techniques with network processor (NP) designs has opened new doors for packet processing platforms. While previous designs have been static or configurable at best, routing switches, edge switches and network interface cards (NICs) of the future will be able to adapt to network traffic in real-time through dynamic reconfiguration. This paper presents the simulation results of a novel RC-enhanced NP based on the Intel IXP1200 NIC design philosophy. The enhanced NPs performance is compared to the baseline NP in terms of three normalized traffic patterns and a case-study traffic pattern based on a military application. The results demonstrate that the enhanced NP significantly outperforms the baseline NP in terms of latency, throughput and resource utilization for traffic that is non-uniform. vii

PAGE 8

CHAPTER 1 INTRODUCTION Advances in chip technology and wire speeds have driven the need for faster packet-processing devices. Todays network devices are required to perform more complex operations on an increasing number of packets in a more flexible manner for a lower cost in a shorter amount of time than any previously. To further complicate the issue, the users appetite for additional bandwidth appears insatiable -history has shown that as network technology catches up to demand, new ways to use the additional functionality tend to push the envelope even further. NPs have their origins within the vast sea of on-line niche markets, dot-com catastrophes and high-speed access to the desktop, as a revolution among the top providers of the Internets infrastructure gained momentum. A strong divide existed in the early to mid 1990s between two main factions of protocol-based packet processor developers along the dimensions of flexibility verses speed. On the one hand, developers who saw a need to accommodate a large diversity of protocols chose to use a General-Purpose Processor (GPP) to handle network traffic in order to sacrifice speed for flexibility. Those who fell into this group believed the future shape of the Internet to be a sea of protocols with intelligent translation between intranets. This groups focus, therefore, was on serving the need for flexibility of the periphery consumer, and largely overlooked the needs of the backbone suppliers. On the other hand, developers who wished to capitalize and improve upon the increased network speeds obtained during this 1

PAGE 9

2 era chose to produce Application-Specific Integrated Circuits (ASICs) for network traffic processing in order to sacrifice flexibility for speed. Those who fell into this camp saw the Internet as a close-knit community, moving toward a single unifying standard. This groups focus, therefore, was on serving the need for speed of the backbone suppliers, and largely overlooked the needs of the periphery consumer. As these two camps strayed further apart, the inherent problems each faced by maintaining a hard-line viewpoint of the emerging Internet that did not materialize spelled drastic consequences for each. The GPP group saw a consolidation of protocols rendering their flexibility niche a moot point, while at the same time link speeds began to exceed processor clocks ten-fold rendering wire speeds unattainable. At the same time, the ASIC faction saw the same consolidation of protocols stop far short of the one or two protocols they believed would define the Internet. Consequently, the fixed nature of the ASIC has rendered their designs too inflexible for some applications. In addition, the tremendous production cost involved in creating ASIC designs has produced a high entrance barrier in the market, ensuring their use by only large-scale corporations. ASICs have been able to maintain a strong market presence by keeping in step with link-speed increases in recent years. While this group has not suffered the hardships that befell the GPP group, the recent decline in sales of all traditional networking equipment may have signaled a shift in the trends of networking. Vendors are now realizing that optimizing for speed or flexibility alone will not meet tomorrows market demands for routing switches, edge switches, network interface cards (NICs) and nodes that offer the best of both options. To meet these and other challenges, both groups have begun to work toward the common goal of a next-generation NP.

PAGE 10

3 Meanwhile, due to technological advances over the past decade, Reconfigurable Computing (RC) has garnered a great deal of attention from the academic community as well as industry [1]. RC systems have been shown to provide a computation speedup, in some application-specific domains, as large as 100 times as compared to GPPs with comparable resources [2]. In still other domains, RC device implementations have mimicked GPP performance at two-thirds the cost [3]. The most remarkable speedup has been seen in traditionally ASIC-laden markets such as digital signal processing (DSP) [4] and cryptography [5]. While this emerging technology has grown at an accelerated pace, there are still numerous obstacles to overcome. Producing a set of language standards, defining network protocols, standardizing benchmarks and even solidifying a name for the technology are all still to be accomplished [6]. However, RC designs have produced significant performance speedup in point-solution markets that have the same processing trends as the application domain of network processing. Therefore, RC-enhanced NP designs are poised to make an impact on future packet-processing systems in so-called active networks (AN). Some of the general topics within AN include resource management [7], adaptive flow control, adaptive error recovery, adaptive mesh interconnections, adaptive routing [8-9], adaptive node topologies and reconfigurable network links [10]. The organization of the remainder of this paper is as follows. Chapter 2 presents related research as it pertains to current generation NPs and how RC has entered NP designs. Chapter 3 outlines the proposed RC-enhanced NP model and the simulation

PAGE 11

4 environment in which it was created. Chapter 4 discusses the experiments by which the new model is compared to the baseline system. Chapter 5 presents a case-study analysis of the new model, and Chapter 6 discusses conclusions and directions for future work.

PAGE 12

CHAPTER 2 RELATED RESEARCH A host of large-scale companies as well as numerous start-ups are shaping the face of new-generation NPs [11]. Each of the major players has been lured to new NPs due to their faster time-to-market as compared to traditional ASIC designs as well as the flexibility they provide akin to past GPP designs [12]. Also, market forecasters predict a coming surge in revenues for the NP market niche. In fact, combined revenues are expected to climb to $2.9 billion by 2004 [13]. Such potential market growth has garnered the respect of many of the industries biggest players. The NP market is rich with a variety of designs. While fragmentation has meant the lifeblood of some of the smaller players in the NP design market, with numerous designs realized or in production, it is difficult to adequately judge the benefits of each. This fragmentation of the market has left much to be desired in terms of accurate head-to-head performance analyses of the notable models. However, there exists a need to conduct such tests so that future designs will not make the same mistakes as those that preceded them. In addition, while it can be surmised that new NPs will outperform their ASIC and GPP equivalents in terms of flexibility coupled with speed [14-15], the fact that they are better (and if so by how much) cannot be determined without structured comparisons. Of the proposed designs, there are at least four in particular that warrant more attention. 5

PAGE 13

6 The Motorola/C-Port C-5 Digital Communication Processor (DCP), shown in Figure 1, represents one extreme of the NP market as a highly distributed architecture. The NP consists of 16 channel processors (CPs) and five co-processors, all connected through a 60Gbps bus. The channel processors, each of which consists of a 32-bit RISC core and two serial data processors (SDPs), are the heart of the unit. The SDPs are microcode-programmable to implement link-layer interfaces including Ethernet, SONET and serial data streams. Since each RISC core can execute a different program, and the channel processors share a common bus, there is a great deal of flexibility in distributing processing across the chip. There can be a parallel processing arrangement where identical programs can be executed on several CPs, or a pipelined arrangement where each processor is dedicated to a particular task and passes its output to the input of the next processor [16]. The C-5 DCP offers a wide range of processing options from network edge to core. Figure 1. Motorola/C-Port DCP NP (Courtesy: Motorola Inc./C-Port Corp.)

PAGE 14

7 The IBM NP2G is one of two designs that represent the middle of the road in terms of distributed versus tightly coupled processing. Figure 2 shows the IBM NP2G architecture [17]. The device provides fast switching by integrating switching engine, search engine, and security functions on one device. It provides Ethernet, Packet over SONET (POS) and Point-to-Point Protocol (PPP) switching, and supports three priority levels [18]. Figure 2. IBM NP2G NP (Courtesy: IBM Inc.) The NP2Gs packet processor consists of control and processing components. The control component supports Layer 2 and 3 routing protocols, Layer 4 and 5 network application, and management functions. The control component for the device can be an external processor connected through an Ethernet link or the PCI interface. The NP2Gs embedded PowerPC processor can also perform control component functions. The processing component provides packet forwarding, filtering, and classification of the

PAGE 15

8 tables generated by the routing protocols. The NP2Gs processing component is made up of six Dyadic Protocol Processor Units (DPPUs) that each forms a packet-processing unit capable of collectively executing twelve independent threads of code at a time. The Intel IXP 1200 is the second of two designs that represent the middle of the road in terms of distributed versus tightly coupled processing. Figure 3 shows the Intel IXP1200 architecture [19]. This hybrid data processor delivers high-performance parallel processing power and flexibility to a wide variety of networking, communications and other data-intensive applications. The IXP1200 is designed specifically as a data control element for applications that require access to a fast memory subsystem, a fast interface to I/O devices, and processing power to perform efficient manipulation of various data sizes [20]. Figure 3. Intel IXP1200 NP (Courtesy: Intel Corp.)

PAGE 16

9 The IXP1200 combines a StrongARM microprocessor with six independent 32-bit RISC data engines possessing hardware multithread support that, when combined, provide over 1 Giga-operations per second. The six MEs are reportedly capable of packet forwarding of 3 million Ethernet packets per second at Layer 3. The StrongARM processor is used for more complex tasks such as address learning, building and maintaining forwarding tables, and network management. The Lucent/Agere PayloadPlus processor family represents the other market extreme as a tightly coupled NP solution. The PayloadPlus architecture is shown in Figure 4. This architecture includes the Fast Pattern Processor (FPP), Routing Switch Processor (RSP) and the Agere System Interface (ASI). The PayloadPlus processor is designed to handle wire-speed data streams at up to OC-48c rates. Each specialized chip provides a complementary function to work in concert: the FPP for high-speed classification, the RSP for processing and routing traffic, and the ASI to provide policing, manage state information and provide a PCI connection to a host processor [21]. Figure 4. Lucent/Agere PayloadPlus NP (Courtesy: Lucent Technologies/Agere Systems)

PAGE 17

10 Within the realm of RC-enhanced NPs, the Reconfigurable Communications Processor (RCP) from Chameleon Systems is the first industry NP that uses dynamic reconfiguration as part of normal system operation. The architecture of the RCP is shown in Figure 5. The RCP consists of a 32-bit ARC processor, memory units and a 32-bit reconfigurable processing fabric that consists of 108 parallel computation units [22]. The RPC has been able to bridge the configuration latency problem that plagues RC systems by multiplexing contexts in the processing fabric. After initialization, context switching can be performed in a single clock cycle [23]. Figure 5. Chameleon Systems RPC (Courtesy: Chameleon Systems Inc.) For our research, a novel design and simulation model has been developed for a dynamically reconfigurable NP by adapting the architecture of the Intel IXP1200. Its purpose is to support the study of design options and tradeoffs in dynamic NP architectures versus static ones and help develop an understanding of how RC-enhanced NP devices may influence future systems. The IXP1200 was chosen as a basis due to the nature of its fixed processor coupled with flexible microengines that can be readily

PAGE 18

11 replaced with reconfigurable units. This design philosophy offers potent tradeoffs between performance, software tool support, flexibility, and versatility. Chapter 3 provides a description of the architecture for this RC-enhanced NP.

PAGE 19

CHAPTER 3 SIMULATION MODEL DESCRIPTION A description of the new model and the simulation environment in which it was produced is presented in Sections 3.1 and 3.2, respectively. Section 3.3 details terminology that is specific to the models architecture. Section 3.4 describes the values assigned to model parameters as well as how the model was verified against the Intel IXP1200. 3.1 Model Overview The moderately coupled, distributed processing approach highlighted in the Intel IXP1200 NP forms the basis of the new RC-enhanced NP. However, the six MEs that perform packet processing in the new system are provided with the capability to be dynamically reconfigured in the manner described in Section 3.3. Six MEs were used in the new RC-enhanced NP in order to provide a fair comparison to the IXP1200, but the model is designed such that future designs could easily include more. A functional description of this new design is shown in Figure 6. The MEs in the new design perform pipelined packet processing as in the original IXP1200. However, in the new design, the MEs are not fixed components, but instead dynamically reconfigurable devices such as FPGAs. To function properly, runtime reconfigurable systems use a statistics gathering mechanism to determine when system adaptations are necessary. For the new design, the additional functions of internal packet routing, statistics analysis, and reconfiguration management are all performed by the 12

PAGE 20

13 Time Stamp(StrongArm) ReconfigurationManagement(StrongArm) StatisticsCollection(StrongArm) Time Stamp(StrongArm) Network Interface(Physical Connection) Network Host Host Interface Input PacketRouting(StrongArm)pppppp Internal PacketRouting(StrongArm)pppppp ME5pp ME4pp ME3pp ME2pp ME1pp ME0ppSignal DescriptionConfiguration DataNetwork PacketsStatistics Data Figure 6. Functional diagram of RC-enhanced NP StrongArm processor. In the original Intel design, the StrongArm has much the same role in terms of packet routing and statistics management, so the addition of the reconfiguration management to its workload is considered to be a reasonable addition. 3.2 Simulation Environment Simulation is used to develop the RC-enhanced NP in order to accurately model the level of detail and flexibility inherent in RC systems. To model and simulate the device, the Block-Oriented Network Simulator (BONeS) Designer tool from Cadence Design Systems was used. BONeS is an integrated software package for event-driven simulation of data transfer systems. It allows for a hierarchical, dataflow representation of hardware devices and networks with the ability to import finite-state machine diagrams and user-developed C/C++ code as functional primitives. BONeS was developed to model and simulate the flow of information represented by bits, packets, messages, or any combination of these. An overview of BONeS can be found in Shanmugen et al. [24].

PAGE 21

14 The simulation model constructed using BONeS to evaluate the RC-enhanced NP is of a high fidelity, consisting of approximately 3300 primitive blocks in six layers of depth that can be simulated to the accuracy of a clock cycle. The entire system model was constructed, tested and experiments performed in approximately 1500 hours over the course of 1.5 years. 3.3 Architecture Terminology An ME pipeline is a collection of MEs that performs packet processing in a pipelined manner. Three pipelines, with a depth of at least one ME per pipeline, exist in the model simulation. Due to the fact that the RC-enhanced NP has a total of six MEs, the upper bound for the depth of a pipeline is four. Numerous possible pipeline configurations exist for the model. One option allows only one ME to be assigned to each pipeline. Another option allows multiple MEs to be assigned to each pipeline. The use of additional computing resources would tend to increase a pipelines processing throughput. Other possibilities include hybrid designs that offer mixtures of the two previous options. The convention for describing the number of MEs allotted to each pipeline is given by [x,y,z] where x, y and z denote the number of MEs allotted to the first, second, and third pipeline respectively. Pipeline configurations for the NP system fall into two main types. The first type, baseline configuration, is a collection of all three available pipelines that cannot be reconfigured during the course of packet processing. Baseline configurations represent the NP system without any RC enhancement. An optimal baseline configuration is the one that is best suited to process a specific set of packets.

PAGE 22

15 The second type, dynamic configuration, is a collection of all three available pipelines that can be reconfigured during the course of packet processing. The decision process for possible reconfiguration occurs periodically at the decision interval. This interval is defined as the period between decisions and lasts the length of time it takes for the slowest pipeline to process five packets. At the end of a decision interval, the per-pipeline latency is polled. This latency is defined as the number of clock cycles it takes for a packet to move from the input of a pipeline to the output. The per-pipeline latency is checked to see if it falls within the acceptable region bounded by the upper and lower latency thresholds for each pipeline to determine the need to reconfigure the system. A separate latency threshold value exists for each pipeline. If a per-pipeline latency value falls out of the region defined for that pipeline, a reconfiguration is attempted. The manner in which the NP performs reconfiguration is detailed in Section 4.2. 3.4 Parameters and Verification All system delays are defined as multiples of the ME clock period of 8.33ns (i.e. frequency of 120MHz). This frequency represents the lower bound of most RC hardware available today. Packets arrive at the ingress point of the NP at a rate of one packet per ten ME clock cycles. This rate is faster than any pipeline can process a single packet, ensuring the pipelines are the bottleneck of the system. The PCI host interface, memory access delay, and physical network interface are not included in our model for simplicity. Rather, these values are given a fixed latency value of one ME clock cycle in order to keep packets from being overwritten in the simulation. The reconfiguration latency, or the number of ME clock cycles it takes to change pipeline configurations, is assumed to be one ME clock cycle as in Chameleon Systems RPC mentioned in Chapter 2.

PAGE 23

16 In order to gauge the relative performance of the simulation of the baseline configuration as compared to the original IXP1200 design, a verification of the system was performed as follows. First, a baseline configuration of our NP model with two MEs per pipeline is created. Second, 5000 Gigabit Ethernet (GigE) packets are passed through this NP model. It was found that the device maintained a steady-state, packet-processing rate of 2.65 million packets per second. This value is comparable to the 3 million Ethernet packets per second asserted by the Intel documentation [19].

PAGE 24

CHAPTER 4 SIMULATION EXPERIMENTS The following section details the manner in which the baseline configurations are compared to the dynamic configurations. Section 4.1 describes the packet processing methodology used for the experiments, and Section 4.2 introduces the terminology used to characterize the packet processing. Section 4.3 describes the manner in which MEs are allocated to pipelines. Section 4.4 presents the baseline experiments in which the optimal baseline configuration for each of the different cases is observed. The results from the baseline experiments are compared to the dynamic configurations in the head-to-head experiments detailed in Section 4.5. 4.1 Packet Processing Description The NP creates the GigE packet header information for data and destination address pairs that are passed from the host processor. Analysis of the work involved in packet header construction, demonstrates that a majority of processing time is spent computing the Cyclic Redundancy Check (CRC). Also, within the CRC operation, the majority of processing is taken up performing a 32-bit polynomial division. Therefore, system reconfiguration in order to better adapt a given pipeline is centered on the polynomial divide operation. Four possible divide operations were chosen for this research described in Martin and Knight [25]. For each operation, the cost, represented as a number of transistors, and performance, represented as processing latency, is shown in Table 1. The values of transistor size and processing latency for each operation are 17

PAGE 25

18 converted to a number of MEs and clock cycles to produce meaningful values for the purposes of the simulation. Integer values of MEs were used since they represent the number of stages in the pipeline. Table 1. Divide operation performance vs. area Cost (Area) Performance Divide Operation Transistors MEs Latency (ns) CCs 32-array 32,896 4 160 20 16-digital serial 11,386 3 220 26 Modified Booth 3,808 2 320 38 Quasi bit-serial 1,944 1 640 77 There exists a direct relation between cost and performance for the divide operations as can be observed in Table 1. Our RC-enhanced NP relies upon the assumption that the divide operation can be partitioned without significant loss of performance. Previous research has shown that breaking up such complex polynomial computations can be accomplished with little additional cost from communication overhead [26]. 4.2 Traffic Description A traffic flow is the basic unit of differentiation between types of network traffic that are processed by the NP. Each traffic flow originates from the host processor. For the purposes of this research, three traffic flows are used and each is mapped exclusively to one of the three pipelines. A priority scheme is used to denote the relative importance of the traffic flows, and each traffic flow is assigned an exclusive priority within a priority scheme. Three priority schemes have been defined for our NP model.

PAGE 26

19 The 1PC, 2PC and U schemes allocate priority to the traffic flows as follows: One-priority critical (1PC): P0 > P1 > P2 Two-priority critical (2PC): P0 = P1 > P2 Uniform (U): P0 = P1 = P2 A traffic mixture describes the relative number of packets from each traffic flow that makes up the total number of packets for the NP to process. A traffic mixture is expressed as the relative percentage of the number of packets from each traffic flow out of a total of 3000 packets. Three traffic mixtures (1-Hot, 2-Hot, 3-Hot) have been defined for the NP simulation model. Within the 1-Hot traffic mixture, a single priority dominates during any given time period. 1-Hot contains the following traffic flow percentages: the first 1000 packets are made up of 90% P0, 5% P1 and 5% P2; the second 1000 packets are made up of 5% P0, 90% P1 and 5% P2; and the third 1000 packets are made up of 5% P0, 5% P1 and 90% P2. Within the 2-Hot traffic mixture, two priorities dominate during a given time period. 2-Hot contains the following traffic flow percentages: the first 1000 packets are made up of 45% P0, 45% P1 and 10% P2; the second 1000 packets are made up of 45% P0, 10% P1 and 45% P2; and the third 1000 packets are made up of 10% P0, 45% P1 and 45% P2. Finally, the 3-Hot traffic mixture provides a near-uniform traffic mixture in which all priorities are equal. Due to limitations of the simulations traffic generators, only increments of 5% were possible, resulting in a distribution of 35%, 35%, and 30%. However, it was determined that the affect of the small imbalance in the distribution is negligible. The 3-Hot traffic mixture contains the following traffic flow percentages: the first 1000 packets are made up of 35% P0, 35% P1 and 30% P2; the second 1000 packets

PAGE 27

20 are made up of 35% P0, 30% P1 and 35% P2; and the third 1000 packets are made up of 30% P0, 35% P1 and 35% P2. 4.3 Dynamic ME Allocation Scheme For a dynamic configuration, priority schemes are also used to determine the manner in which MEs are added to, or taken away from, a given traffic flows pipeline. The dynamic NP reallocates MEs in order to adapt to changing traffic mixture conditions. The reallocation is performed based on the following conditions. If the latency of a given traffic flows pipeline exceeds the upper latency threshold value for the priority assigned to that traffic flow, the NP attempts to allocate more MEs to that pipeline. If a free ME exists, the NP will assign it to the requesting pipeline. However, if no MEs are free, then the NP will take away a ME of another traffic flows pipeline if three conditions are met. First, the priority value assigned to that traffic flow must be less than the priority value of the requesting traffic flow as dictated by the priority scheme. Second, the traffic flows pipeline depth must be greater than one ME. Third, the traffic flows pipeline utilization must be less than 95%. The pipeline utilization is defined as the number of clock cycles a given traffic flows pipeline is busy, divided by the total number of clock cycles during a decision interval (busy time plus idle time). A value above 95% means that the pipeline is experiencing a large volume of packets and performance would greatly degrade if MEs were removed from the pipeline. If the latency of a given traffic flows pipeline is less than the lower latency threshold value for the priority assigned to that traffic flow, the NP attempts to remove MEs from that pipeline. However, MEs will be removed from a pipeline only if two

PAGE 28

21 conditions are met. First, the traffic flows pipeline depth must be greater than one ME. Second, the traffic flows pipeline utilization must be less than 95%. This process to retune the pipelines occurs at the end of each decision interval. In order to ensure that MEs are used efficiently, the measured pipeline latency values are first checked against the lower latency thresholds to determine if MEs may be removed from a pipeline, and then checked against the upper latency thresholds to determine if MEs need to be assigned to a pipeline. As an example to illustrate the interaction of the system parameters, a typical configuration is shown in Figure 7. The three traffic flows that originate from the host are labeled Traffic Flow 0, Traffic Flow 1 and Traffic Flow 2. This example illustrates the first 1000-packet phase of the 1-Hot traffic mixture in which Traffic Flow 0 accounts for 90% of packets, Traffic Flow 1 accounts for 5% of packets and Traffic Flow 2 accounts for 5% of packets. Note that several points (labeled A, B and C) are included in the Figure 7 for statistics gathering purposes. Pipeline latency is measured between point B and point C. Pipeline throughput, defined in Head-To-Head Experiments, is measured between point A and point C. Packets from each traffic flow are assigned a given priority in the NP (P0 for Traffic Flow 0, P1 for Traffic Flow 1 and P2 for Traffic Flow 2 in this case) and mapped to a given pipeline (pipeline 0 for Traffic Flow 0, pipeline 1 for Traffic Flow 1 and pipeline 2 for Traffic Flow 2). Pipeline 0, pipeline 1 and pipeline 2 contains 3, 2 and 1 MEs respectively. Therefore, the configuration denoted as [3,2,1] is shown in Figure 7.

PAGE 29

22 text text Traffic Flow 0 Traffic Flow 1 Traffic Flow 2 Internal Packet Router 90%5%5% Host Processor ME ME ME ME ME ME Pipeline 0Pipeline 1Pipeline 2Priority 0Priority 1Priority 2 NP A B C B B External Network Figure 7. Typical system configuration 4.4 Baseline Experiments In the baseline experiments, the optimal baseline configuration for each of the three traffic mixtures (1-Hot, 2-Hot, 3-Hot) is determined for each of the three priority schemes (1PC, 2PC, U). The observed values are given in Table 2. The name for each priority scheme in the baseline experiments is preceded by B_ for clarity. As evidenced in Table 2, B_2PC and B_U have identical optimal baseline configurations. Therefore, the B_U case will be used to represent both these priority schemes for the head-to-head experiments described in following subsequent sections.

PAGE 30

23 Table 2. Optimal baseline configurations Priority Scheme Traffic Mixture B_1PC B_2PC B_U 1-Hot [4,1,1] [3,2,1] [3,2,1] 2-Hot [4,1,1] [3,2,1] [3,2,1] 3-Hot [4,1,1] [2,2,2] [2,2,2] In each of the baseline configuration experiments, we observed the same trends for the 1-Hot, 2-Hot and 3-Hot experiments in terms of throughput, latency, utilization and total execution time. Therefore, the 2-Hot results are presented alone for clarity. The pipeline utilization results, given in Figure 8b, show that when the priority scheme for an optimal baseline configuration matches the type of traffic mixture (i.e. 2PC for 2-Hot), the pipeline utilization for that configuration is more balanced as compared to the other configurations. As such, configuration [3,2,1] outperformed all other configurations for the 2-Hot traffic mixture. This fact is also evidenced in the total execution time results shown in Figure 8a. Throughput and latency results for the optimal baseline configurations are presented in the head-to-head experiments. 0500010000150002000025000[1,1,1][4,1,1][3,2,1][2,2,2]ConfigurationTotal Execution Time (CCs) 020406080100[1,1,1][4,1,1][3,2,1][2,2,2]ConfigurationUtilization (%) Hot Priorities Average of Others a) Total execution time b) Pipeline utilization Figure 8. 2-Hot baseline results.

PAGE 31

24 4.5 Head-To-Head Experiments In the head-to-head experiments, the optimal baseline configurations for each of the three traffic mixtures and the three priority schemes are compared to the dynamic ones. The three priority schemes for the dynamic configurations are preceded by a D_ for clarity. The latency thresholds for each of the three priority schemes are given in Table 3. Table 3. Dynamic priority latency thresholds Priority P0 P1 P2 Priority Scheme Lower Upper Lower Upper Lower Upper D_1PC 20 50 40 70 60 100 D_2PC 40 80 40 80 60 100 D_U 60 120 60 120 60 120 Average latency values are used to compare the optimal baseline configurations to the dynamic configurations. These averages are determined by taking the average of all latency values computed at the end of each decision interval for each pipeline. Throughput is defined as the number of bytes transferred for a given traffic flow from point A to point C of Figure 7, divided by the number of clock cycles needed to perform the transfer. The throughput value is calculated on a per-packet basis and averaged over the decision interval. Only the results for the best dynamic configuration in each experiment are shown (e.g. D_1PC scheme in the case of the 1-Hot traffic mixture) for clarity.

PAGE 32

25 Results From The 1-Hot Experiment The average latency results for the 1-Hot experiment are shown in Figure 9a. D_1PC produces low latency values for each priority because MEs are reallocated to each ME pipeline as needed. P2 suffers a severe latency with the B_U scheme, and both P1 and P2 suffer severely with B_1PC, due to the fact that they cannot reallocate MEs. The dynamic system is found to reconfigure a total of only 17 times in this experiment. The average throughput results for the 1-Hot experiment are shown in Figure 9b. D_1PC produces relatively high throughput values for all three priorities because MEs are reallocated to each ME pipeline as needed. Both baseline schemes produce throughputs for each priority that are lower than the dynamic scheme due to the fact that they cannot reallocate MEs. The B_1PC scheme produces a throughput that is less than the B_U scheme for P1 and P2 because the B_1PC scheme is optimized to process packets of priority P0. 0100020003000400050006000P0P1P2PriorityLatency (CCs) D_1PC B_1PC B_U 00.20.40.60.81P0P1P2PriorityThroughput (bytes/CC) D_1PC B_1PC B_U a) Average latency b) Average throughput Figure 9. 1-Hot results.

PAGE 33

26 Results From The 2-Hot Experiment The average latency results for the 2-Hot experiment are shown in Figure 10a. D_2PC produces latency values that are fairly comparable to the previous experiment for each priority despite a substantial increase in the number of system reconfigurations. Due to the fact that the traffic mixture is becoming more uniform, the optimal baseline configurations are less subject to the latency penalties inherent in having ME pipelines that are fixed. The B_U scheme, in particular, produces good latency values for all priorities because of this trend. The dynamic system is found to reconfigure a total of 90 times in this experiment. The average throughput results for the 2-Hot experiment are shown in Figure 10b. The throughput results show the same trend as the latency results in that D_2PC produces throughput values that are fairly comparable to the previous experiment for each priority despite a substantial increase in the number of system reconfigurations. Also, the optimal baseline configurations are less subject to the latency penalties inherent in having ME pipelines that are fixed because the traffic mixture is becoming more uniform. 0150300450600750900P0P1P2PriorityLatency (CCs) D_2PC B_1PC B_U 00.150.30.450.60.750.9P0P1P2PriorityThroughput (bytes/CC) D_2PC B_1PC B_U a) Average latency b) Average throughput Figure 10. 2-Hot results.

PAGE 34

27 Results From The 3-Hot Experiment The average latency results for the 3-Hot experiment are shown in Figure 11a. D_U produces the poorest latency values for each priority due to a large number of system reconfigurations. The relative effect versus the baseline is more severe in this experiment as compared to the previous one because the reconfigurations are creating ME pipelines that are not ideally suited for the current traffic mixture in addition to consuming processing cycles during reconfiguration. This condition occurs because the traffic mixture at any given time changes so rapidly that the dynamic system cannot adapt in a meaningful and efficient manner. The B_U scheme produces low latency values for all priorities because it is always ideally suited for a uniform traffic mixture. B_1PC continues to produce a poor latency values for P2 because this priority scheme is not well suited for a uniform traffic mixture. The dynamic system is found to reconfigure a total of 110 times in this experiment. 065130195260325390P0P1P2PriorityLatency (CCs) D_U B_1PC B_U 00.20.40.60.8P0P1P2PriorityThroughput (bytes/CC) D_U B_1PC B_U a) Average latency b) Average throughput Figure 11. 3-Hot results.

PAGE 35

28 The average throughput results for the 3-Hot experiment are shown in Figure 11b. The throughput results show the same trend as the latency results in that D_U produces the poorest throughput values for each priority due to a large number of unproductive system reconfigurations. The B_U scheme produces high throughput values for all priorities because it is always ideally suited for a uniform traffic mixture. B_1PC produces poor throughput values for P2 because this priority scheme is not well suited for a uniform traffic mixture. It should be noted that the scale of the previous three latency charts has been significantly reduced in the progression from 1-Hot to 2-Hot to 3-Hot results. Therefore, latency penalties that occur in the 3-Hot experiment are clearly much less severe than those in the 1-Hot experiment. These results illustrate three main points. First, dynamic configurations perform far better than the baseline configurations when traffic mixtures are non-uniform. Second, dynamic configurations suffer from resource thrashing if traffic mixtures become uniform. Third, the decrease in performance incurred by using dynamic configurations when processing a uniform traffic mixture is far less severe than the decrease in performance incurred by using a baseline configuration when processing non-uniform traffic mixtures. In studying a wide variety of real-world network traffic patterns, the argument could be made that the majority of network traffic mixtures can be seen as non-uniform over an arbitrary time period. Given this assumption, the use of dynamic NP resource allocation is likely to have a large impact on tomorrows networking environments. In

PAGE 36

29 the next section, a description of the traffic mixture for a case study used to stimulate our NP model is presented.

PAGE 37

CHAPTER 5 CASE STUDY The case-study experiments employ the same testing strategy used in the previous section. A description of the case study is given in Section 5.1. The optimal baseline configuration for the case studys traffic mixture is found for each of the priority schemes in Section 5.2. The optimal baseline configurations are compared to the three dynamic configuration priority schemes in head-to-head experiments shown in Section 5.3. 5.1 Description The traffic mixture for this case study is inspired by packet types used in the network of a Navy system for theatre air and missile defense known as Cooperative Engagement Capability (CEC). This system, under development and deployment by the U.S. Navy for the past decade, supports the total automation of fleet defense systems. The basic concept is illustrated in Figure 12. Each player in the theatre of operations communicates using three traffic flows defined as missile detection and tracking (denoted as radar), cooperation instructions (denoted as command) as well as engagement information (denoted as weapon). For the purposes of this case study, the three traffic flows are mapped onto the three priorities such that radar is P0, command is P1 and weapon is P2. All networked military assets must undertake a distinct series of actions in order to destroy a missile. This series of actions, know as the kill chain, involves three phases 30

PAGE 38

31 (detect, control and engage) illustrated in Figure 13. The 3000 packets from which the traffic mixture is calculated have been divided equally between the three phases of the kill chain. Figure 12. Theatre missile defense system (Courtesy: ONR) As an actual battle traffic mixture is not publicly available, a candidate traffic mixture was developed. This traffic mixture contains the following traffic flow distribution: the first 1000 packets (detect phase) are made up of 85% P0, 10% P1 and 5% P2; the second 1000 packets (control phase) are made up of 35% P0, 60% P1 and 5% P2; and the third 1000 packets (engage phase) are made up of 15% P0, 30% P1 and 55% P2.

PAGE 39

32 Figure 13. Theatre missile defense kill chain 5.2 Baseline Experiments The optimal baseline configurations for the case study were determined and are given in Table 4. As in Chapter 4, since B_2PC and B_U have identical optimal baseline configurations, the B_U case will be used to represent both priority schemes in the subsequent experiments. Table 4. Case study optimal baseline configurations Priority Scheme Traffic Mixture B_1PC B_2PC B_U Case Study [4,1,1] [3,2,1] [3,2,1] 5.3 Head-To-Head Experiments The optimal baseline configurations for each of the three priority schemes are compared to the three priority schemes for the dynamic configurations. The average latency results for the case study are shown in Figure 14a. All configurations produce low latency values for P0. However, B_1PC produces high latency values for P1 and P2, while the B_U scheme produces high latency values for P2, due to the fact that MEs cannot be reallocated. For the dynamic cases, D_U

PAGE 40

33 produces low latency values for all three priorities while the other two dynamic configurations did not produce low latency values for P2. D_U performed well regardless of having a high number of system reconfigurations because these configurations were meaningful and produced pipelines that were well suited for the current traffic mixture even after reconfiguration. D_1PC and D_2PC did not produce configurations that were well suited to handle P2 traffic. The reconfiguration count for the dynamic systems was found to be 16 times for D_1PC, 18 times for D_2PC and 63 times for D_U. The throughput results for the case study are shown in Figure 14b. All configurations produce high throughput values for P0. However, B_1PC produces lower throughput values for P1 and P2, while the B_U scheme produces lower throughput values for P2, due to the fact that Mes cannot be reallocated. For the dynamic cases, D_U produces high throughput values for all three priorities while the other two dynamic configurations did not produce high throughput values for P2. D_U outperformed all other configuration in terms of overall throughput as well as latency. 00.150.30.450.60.750.9P0P1P2PriorityThroughput (bytes/CC) D_1PC D_2PC B_1PC 0300600900120015001800P0P1P2PriorityLatency (CCs) D_1PC D_2PC B_1PC a) Average latency b) Average throughput Figure 14. Case study results. D_U B_U D_U B_U

PAGE 41

34 The case study results illustrate two main points. First, the need for dynamic configurations to allow Mes to be available for any traffic flows pipeline to use is demonstrated. This point is evidenced in the fact that the D_U scheme did not suffer from system resource thrashing (as seen in the 3-Hot head-to-head experiment in Section 4.5) even though it reconfigured four times more often than the other two dynamic configurations. Second, the dynamic configurations performed as well, if not better, overall than each of the ideal baseline configurations.

PAGE 42

CHAPTER 6 CONCLUSIONS Current NPs have incorporated static reconfiguration or configurability that does not allow for in-system reconfiguration. The prevalence and success of RC designs that incorporate dynamic reconfigurability makes the addition of dynamic RC to NPs a logical progression. In this paper, a candidate NP (the Intel IXP1200) is enhanced to provide dynamic reconfiguration of network processing resources in order to produce an NP that better adapts to changing network traffic flows. The functionality of the RC-enhanced NP model is simulated using BONeS Designer, a software tool for event-driven simulation of computer network systems that offers a high degree of fidelity. The accuracy of the baseline model is verified in terms of a nominal value for sustained packet processing asserted in the Intel IXP1200 documentation. The RC-enhanced NP dynamic configuration is compared to a baseline configuration in order to gauge relative performance. In addition, a case study is presented to highlight the effects of dynamic reconfiguration on a traffic mixture patterned after a complex system design. This research has shown the potential advantages from incorporating RC design techniques into next-generation NPs. In the baseline experiments, the pipelines were not reconfigurable. This type of system mimics the behavior of todays NPs that are not RC-enhanced. The baseline experiments determined the optimal baseline configurations later used for comparison of the baseline NP to the dynamic NP in the head-to-head experiments. The pipeline 35

PAGE 43

36 utilization results from the baseline experiments showed that optimal configurations efficiently use their resources by evenly spreading the workload over all three pipelines. Configurations that do not distribute the packet processing workload evenly waste pipeline computing cycles by allowing them to be idle. For static network development, the sort of trial and error approach is generally used to obtain the optimal configuration. However, the optimal configuration for one moment in time is not necessarily the optimal configuration at another. This type of workload partitioning and scheduling problem can be overcome by using RC techniques as this research has shown. The head-to-head experiments pitted the optimal baseline configurations against the dynamic configurations. Average latency and throughput results were used as a metric for comparison. These results illustrated three main points. First, dynamic configurations perform far better than baseline configurations when traffic patterns are non-uniform due to the fact that dynamic configurations can reallocate Mes. Because the baseline configurations cannot reallocate Mes, many processing clock cycles are wasted due to an imbalance of system resources. Second, dynamic configurations may suffer from resource thrashing if traffic patterns become uniform. This situation may occur if the decision interval assumed in this research is too fine grained as compared to the time it takes to process the total number of packets per experiment. By having a small window in which to determine if a system reconfiguration is necessary, the dynamic configurations reconfigured the system based on transient traffic mixtures rather than adapting to the overall traffic mixture over time. In fact, the results demonstrate that it is better to not reconfigure the system at all for uniform traffic mixtures.

PAGE 44

37 Third, the decrease in performance incurred by using dynamic configurations when processing uniform traffic mixes is far less severe than the decrease in performance incurred by using a baseline configuration when processing non-uniform traffic. This phenomenon is due to the fact that the baseline systems are only optimized for a single traffic mixture and are therefore inadequately suited to process packets in other traffic mixtures. In the case study, the baseline experiments and head-to-head experiments are performed with a traffic mixture that resembles one that is realistic. The case study illustrated two points. First, a high number of pipeline reconfigurations do not necessarily imply resource thrashing and performance loss in dynamic configurations. This fact is observed when the configuration for the D_U scheme performed well for all priorities even though it reconfigured four times more than other dynamic configurations. The D_U configuration accomplished this by reconfiguring the system in a meaningful way in order to produce pipelines that were well suited for the current traffic mixture even after reconfiguration. Therefore, the decision interval used in the case study experiment has a better granularity for the traffic mixture as compared to the head-to-head experiments in Chapter 4. Second, the dynamic configurations were found to perform at or above the level of the baseline configurations for the case-study traffic mixture due to the fact that resources in the NP can be reallocated to the pipelines of other traffic flows as needed. Future directions for this research include adding to the simulation model the capability for studying the effects of full-duplex packet processing for multiple RC-enhanced NPs in a complex network. In creating a network of nodes, implementing

PAGE 45

38 additional functionality such as network routing information, flow control and quality of service (QoS) mechanisms would be a logical next step. Another area for future work is to study issues with the migration of RC techniques to other NP designs. This direction would help to ascertain the relationship between the emphasis of the NP architecture (e.g. loosely coupled versus tightly coupled processing) and the most effective methods for augmenting them to achieve increased performance through adaptive, reconfigurable processing.

PAGE 46

REFERENCES 1. A. DeHon and J. Wawrzynek, Reconfigurable Computing: What, Why and Implications for Design Automation, Proc. 36 th ACM Design Automation Conference (DAC), New Orleans, LA, June 21-25, 1999, pp. 610-615. 2. J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati and P. Boucard, Programmable Active Memories: Reconfigurable Systems Come of Age, IEEE Transactions on VLSI Systems, Vol. 4, No. 1, March 1996, pp. 56-69. 3. Star Bridge Systems, Theory of Hypercomputing, Star Bridge Systems, 2002, http://www.starbridegesystems.com (Accessed December 2002). 4. H. Choi, J. Kim, C. Yoon, I. Park, S. Hwang and C. Kyung, Synthesis of Application Specific Instructions for Embedded DSP Software, IEEE Transactions on Computers, Vol. 48, No. 6, June 1999, pp. 603-614. 5. O. Mencer, M. Morf and M. Flynn, Hardware Software Tri-design of Encryption for Mobile Communications Units, Proc. IEEE Inter. Conf. On Acoustics, Speech and Signal Processing (ICASSP), Seattle, WA, Vol. 5, May 1998, pp. 3045-3048. 6. B. Dipert, Figuring Out Reconfigurable Logic, EDN Magazine, August 5, 1999, pp.103. 7. A. Staicu, J. Radzikowski, K. Gaj, N. Alexandridis and T. El-Ghazawi, Effective Use of Networked Reconfigurable Resources, Proc. Int. Conf. On Military Applications of Programmable Logic Devices (MAPLD), Laurel, MD, September 2001. 8. A. Dollas, D. Pnevmatikatos, N. Aslanides, S. Kavvadias, E. Sotiriades, S. Zogopoulos, K. Papademetriou, N. Chrysos, K. Harteros, E. Antonidakis, N. Patrakis, Architecture and Applications of PLATO, a Reconfigurable Active Network Platform, Proc. 9 th Annual IEEE Symp. On Field-Programmable Custom Computing Machines (FCCM), Rohnert Park, CA, April 29-May 2, 2001. 9. A. DeHon, R. Huang and J. Wawrzynek, Hardware-Assisted Fast Routing, Proc. 10 th Annual IEEE Symp. On Field-Programmable Custom Computing Machines (FCCM), Napa, CA, April, 2002. 39

PAGE 47

40 10. S. Johansson, Transport Network Involving a Reconfigurable WDM Network Layer A European Demonstration, IEEE Journal of Lightwave Technology, Vol. 14, No. 6, June 1996, pp. 1341-1348. 11. X. Zhu, Network Processor Directory, Gigascale Silicon Research Center, 2001, http://www.gigascale.org/mescal/forum/110.html (Accessed January 2001). 12. J. Caruso, Network Processors Combining the Speed of ASICs with the Flexibility of General-Purpose Processors, Network World Fusion, September 15, 1999. 13. D. McEuen, Network Processor Revenues Climb from Obscurity in 1999 to $2.9 Billion by 2004, In-stat MDR, March 6, 2000. 14. E. Rothfus, The Challenge for Next Generation Network Processors, Agere, 1999, http://www.agere.com (Accessed December 2002). 15. E. Rothfus, Building Next Generation Network Processors, Agere, 1999, http://www.agere.com (Accessed December 2002). 16. Motorola, C-5 Network Processor, Motorola, 2002, http://www.motorola.com (Accessed December 2002). 17. IBM, PowerNP NP2G Network Processor, IBM, 2002, http://www.ibm.com (Accessed December 2002). 18. IBM, NP2G Data Sheet, IBM, 2002, http://www.ibm.com (Accessed December 2002). 19. Intel, IXP1200 Network Processor, Intel, 2002, http://www.intel.com (Accessed December 2002). 20. Intel, IXP1200 Data Sheet, Intel, 2002, http://www.intel.com (Accessed December 2002). 21. Agere, PayloadPlus NPs, Agere, 1999, http://www.agere.com (Accessed December 2002). 22. X. Tang, M. Aalsma and R. Jou, A Compiler Directed Approach to Hiding Configuration Latency in Chameleon Processors, Proc. 10 th International Conference on Field Programmable Logic and Applications, Villach, Austria, 2000, pp. 29-38.

PAGE 48

41 23. Chameleon Systems, CS2000 Reconfigurable Communications Processor, Chameleon Systems, 2000, http://www.chameleonsystems.com (Accessed December 2002). 24. K. Shanmugen, V. Frost and W. LaRue, A Block-Oriented Network Simulator (BONeS), Simulation, Vol. 58, No. 2, 1992, pp. 83-94. 25. R. S. Martin and J.P. Knight, Power-Profiler: Optimizing ASICs Power Consumption at the Behavioral Level, Proc. 32 nd ACM Design Automation Conference (DAC), San Francisco, CA, June 1995, pp. 42-47. 26. C. Tanougast, Y. Berviller and S. Weber, Optimization of Motion Estimator for Run-Time-Reconfiguration Implementation, Proc. 7 th Reconfigurable Architectures Workshop (RAW), Cancun, Mexico, May 1-5, 2001.

PAGE 49

BIOGRAPHICAL SKETCH Ian Troxel received a Bachelor of Science in Electrical Engineering degree with minors in business administration and mathematics and a Bachelor of Science in Computer Engineering degree from the ECE department at the University of Florida in May of 2000 and December of 2000 respectively. As an undergraduate student, he was selected as a University Scholars recipient and served as a paid undergraduate researcher under the direction of Dr. Alan George, founder and director of the High-performance Computing and Simulation (HCS) laboratory at the University of Florida. Since becoming a paid graduate research assistant in May of 2000, Ian has worked on several projects in the area of reconfigurable computing and network simulation. Ian is one of two co-inventors of the Honeywell Reconfigurable Space Computer (HRSC) as part of the HRSC development team at Honeywell Defense and Space Systems of Clearwater, FL (January until August 2002). He will continue his studies at the University of Florida in pursuit of a Ph.D. in Electrical Engineering. 42


Permanent Link: http://ufdc.ufl.edu/UFE0000809/00001

Material Information

Title: Design and Analysis of a Dynamically Reconfigurable Network Processor
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000809:00001

Permanent Link: http://ufdc.ufl.edu/UFE0000809/00001

Material Information

Title: Design and Analysis of a Dynamically Reconfigurable Network Processor
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0000809:00001


This item has the following downloads:


Full Text











DESIGN AND ANALYSIS OF A
DYNAMICALLY RECONFIGURABLE NETWORK PROCESSOR















By

IAN A. TROXEL


A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF
FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE
DEGREE OF MASTER OF ENGINEERING

UNIVERSITY OF FLORIDA


2003















ACKNOWLEDGMENTS

I wish to thank the Department of Defense and MLDesign Technologies

Incorporated for their financial support, all the good professors for their inspiration and

guidance, all the bad ones for their frustration and misdirection, the members of the High-

performance Computing and Simulation lab for their technical support, Dr. Alan George

for his guidance, the UF Turkish Folklor group for their friendship, my parents for their

nurture, my ancestors for their nature, the fates for their scheming and my wife for her

patience and love.















TABLE OF CONTENTS



A C K N O W L E D G E M E N T S ................................................................................................ ii

L IST O F T A B L E S .. ............ ................................................... ............... v...... .... ..v

LIST OF FIGURES .................................................. ............................ vi

A B S T R A C T ...................................................................................................................... v ii

CHAPTER

1 IN TR OD U CTION .............. ...................... ................ ..1.. .. ..

2 RELA TED RESEAR CH .................... ...............................................................5......

3 SIMULATION MODEL DESCRIPTION .............................................................12

3 .1 M o d el O v erv iew .................................................................................................... 12
3.2 Sim ulation Environm ent ................. ........................................................... 13
3.3 A architecture Term inology.................................... ....................... ............... 14
3.4 P aram eters and V erification.............................................................. ............... 15

4 SIM U LA TION EXPERIM EN TS ........................................................... ................ 17

4.1 Packet Processing D escription.................. .................................................... 17
4.2 T traffic D description ... ................................................................... .............. 18
4.3 Dynamic ME Allocation Scheme .............. ............... 20
4.4 B aseline E xperim ents ..................................................................... ................ 22
4.5 H ead-T o-H ead E xperim ents ............................................................. ................ 24

5 C A SE S T U D Y ..................................................... ................................................ 30

5.1 C ase Study D description .......................................... ........................ ................ 30
5.2 B aseline E xperim ents ..................................................................... ................ 32
5.3 H ead-T o-H ead E xperim ents ............................................................. ................ 32

6 C O N C LU SIO N S.......................... .. .................. ............... .............. ..... ............... 35









REFEREN CE S .................................................................................................................. 39

B IO G R APH ICAL SK ETCH ...................... .............................................................. 42















LIST OF TABLES

Table page

1 Divide operation performance vs. area............................................................. 18

2 O ptim al baseline configurations ....................................................... ................ 23

3 Dynamic priority latency thresholds.................................................................24

4 Case study optimal baseline configurations......................................................32















LIST OF FIGURES

Figure page

1 M otorola/C -Port D CP N P ..................................... ........................ ............... 6

2 IBM NP2G NP .................... .. ...... ..........7.....

3 Intel IXP1200 NP .............. ..............................8

4 L ucent/A gere PayloadPlus N P ............................................................ ...............9...

5 C ham eleon System s R P C ....................................... ....................... ............... 10

6 Functional diagram of RC-enhanced NP ............... ....................................13

7 Typical system configuration............................................................ ................ 22

8 2-H ot baseline results ........................................................................ ............... 23

9 1-H ot results ........................................................................... ......... ............... 25

10 2 -H o t re su lts ........................................................................................................... 2 6

1 1 3 -H o t re su lts ........................................................................................................... 2 7

12 Theatre m issile defense system ......................................................... ................ 31

13 Theatre m issile defense kill chain ..................................................... ................ 32

14 C ase study results ... .. ........................................... ........................ .... ......... 33















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Engineering

DESIGN AND ANALYSIS OF A
DYNAMICALLY RECONFIGURABLE NETWORK PROCESSOR


By

Ian A. Troxel

May 2003

Chair: Dr. Alan D. George
Department: Electrical and Computer Engineering

The fusion of reconfigurable computing (RC) techniques with network processor

(NP) designs has opened new doors for packet processing platforms. While previous

designs have been static or configurable at best, routing switches, edge switches and

network interface cards (NICs) of the future will be able to adapt to network traffic in

real-time through dynamic reconfiguration. This paper presents the simulation results of

a novel RC-enhanced NP based on the Intel IXP1200 NIC design philosophy. The

enhanced NP's performance is compared to the baseline NP in terms of three normalized

traffic patterns and a case-study traffic pattern based on a military application. The

results demonstrate that the enhanced NP significantly outperforms the baseline NP in

terms of latency, throughput and resource utilization for traffic that is non-uniform.















CHAPTER 1
INTRODUCTION

Advances in chip technology and wire speeds have driven the need for faster

packet-processing devices. Today's network devices are required to perform more

complex operations on an increasing number of packets in a more flexible manner for a

lower cost in a shorter amount of time than any previously. To further complicate the

issue, the user's appetite for additional bandwidth appears insatiable -- history has shown

that as network technology catches up to demand, new ways to use the additional

functionality tend to push the envelope even further.

NPs have their origins within the vast sea of on-line niche markets, dot-com

catastrophes and high-speed access to the desktop, as a revolution among the top

providers of the Internet's infrastructure gained momentum. A strong divide existed in

the early to mid 1990s between two main factions of protocol-based packet processor

developers along the dimensions of flexibility verses speed. On the one hand, developers

who saw a need to accommodate a large diversity of protocols chose to use a General-

Purpose Processor (GPP) to handle network traffic in order to sacrifice speed for

flexibility. Those who fell into this group believed the future shape of the Internet to be a

sea of protocols with intelligent translation between intranets. This group's focus,

therefore, was on serving the need for flexibility of the periphery consumer, and largely

overlooked the needs of the backbone suppliers. On the other hand, developers who

wished to capitalize and improve upon the increased network speeds obtained during this









era chose to produce Application-Specific Integrated Circuits (ASICs) for network traffic

processing in order to sacrifice flexibility for speed. Those who fell into this camp saw

the Internet as a close-knit community, moving toward a single unifying standard. This

group's focus, therefore, was on serving the need for speed of the backbone suppliers,

and largely overlooked the needs of the periphery consumer.

As these two camps strayed further apart, the inherent problems each faced by

maintaining a hard-line viewpoint of the emerging Internet that did not materialize

spelled drastic consequences for each. The GPP group saw a consolidation of protocols

rendering their flexibility niche a moot point, while at the same time link speeds began to

exceed processor clocks ten-fold rendering wire speeds unattainable. At the same time,

the ASIC faction saw the same consolidation of protocols stop far short of the one or two

protocols they believed would define the Internet. Consequently, the fixed nature of the

ASIC has rendered their designs too inflexible for some applications. In addition, the

tremendous production cost involved in creating ASIC designs has produced a high

entrance barrier in the market, ensuring their use by only large-scale corporations. ASICs

have been able to maintain a strong market presence by keeping in step with link-speed

increases in recent years. While this group has not suffered the hardships that befell the

GPP group, the recent decline in sales of all traditional networking equipment may have

signaled a shift in the trends of networking. Vendors are now realizing that optimizing

for speed or flexibility alone will not meet tomorrow's market demands for routing

switches, edge switches, network interface cards (NICs) and nodes that offer the best of

both options. To meet these and other challenges, both groups have begun to work

toward the common goal of a next-generation NP.









Meanwhile, due to technological advances over the past decade, Reconfigurable

Computing (RC) has garnered a great deal of attention from the academic community as

well as industry [1]. RC systems have been shown to provide a computation speedup, in

some application-specific domains, as large as 100 times as compared to GPPs with

comparable resources [2]. In still other domains, RC device implementations have

mimicked GPP performance at two-thirds the cost [3]. The most remarkable speedup has

been seen in traditionally ASIC-laden markets such as digital signal processing (DSP) [4]

and cryptography [5].

While this emerging technology has grown at an accelerated pace, there are still

numerous obstacles to overcome. Producing a set of language standards, defining

network protocols, standardizing benchmarks and even solidifying a name for the

technology are all still to be accomplished [6].

However, RC designs have produced significant performance speedup in point-

solution markets that have the same processing trends as the application domain of

network processing. Therefore, RC-enhanced NP designs are poised to make an impact

on future packet-processing systems in so-called active networks (AN). Some of the

general topics within AN include resource management [7], adaptive flow control,

adaptive error recovery, adaptive mesh interconnections, adaptive routing [8-9], adaptive

node topologies and reconfigurable network links [10].

The organization of the remainder of this paper is as follows. Chapter 2 presents

related research as it pertains to current generation NPs and how RC has entered NP

designs. Chapter 3 outlines the proposed RC-enhanced NP model and the simulation






4


environment in which it was created. Chapter 4 discusses the experiments by which the

new model is compared to the baseline system. Chapter 5 presents a case-study analysis

of the new model, and Chapter 6 discusses conclusions and directions for future work.















CHAPTER 2
RELATED RESEARCH

A host of large-scale companies as well as numerous start-ups are shaping the

face of new-generation NPs [11]. Each of the major players has been lured to new NPs

due to their faster time-to-market as compared to traditional ASIC designs as well as the

flexibility they provide akin to past GPP designs [12]. Also, market forecasters predict a

coming surge in revenues for the NP market niche. In fact, combined revenues are

expected to climb to $2.9 billion by 2004 [13]. Such potential market growth has

garnered the respect of many of the industries biggest players.

The NP market is rich with a variety of designs. While fragmentation has meant

the lifeblood of some of the smaller players in the NP design market, with numerous

designs realized or in production, it is difficult to adequately judge the benefits of each.

This fragmentation of the market has left much to be desired in terms of accurate head-to-

head performance analyses of the notable models. However, there exists a need to

conduct such tests so that future designs will not make the same mistakes as those that

preceded them. In addition, while it can be surmised that new NPs will outperform their

ASIC and GPP equivalents in terms of flexibility coupled with speed [14-15], the fact

that they are better (and if so by how much) cannot be determined without structured

comparisons. Of the proposed designs, there are at least four in particular that warrant

more attention.










The Motorola/C-Port C-5 Digital Communication Processor (DCP), shown in

Figure 1, represents one extreme of the NP market as a highly distributed architecture.

The NP consists of 16 channel processors (CPs) and five co-processors, all connected

through a 60Gbps bus. The channel processors, each of which consists of a 32-bit RISC

core and two serial data processors (SDPs), are the heart of the unit. The SDPs are

microcode-programmable to implement link-layer interfaces including Ethernet, SONET

and serial data streams. Since each RISC core can execute a different program, and the

channel processors share a common bus, there is a great deal of flexibility in distributing

processing across the chip. There can be a parallel processing arrangement where

identical programs can be executed on several CPs, or a pipelined arrangement where

each processor is dedicated to a particular task and passes its output to the input of the

next processor [16]. The C-5 DCP offers a wide range of processing options from

network edge to core.


SRAM SRAM Host .PL FROM I SDRAM
Farc (optional) (optiona):
--- M o -P rt CP -rol P
L.: Cii: p.


._._ __ ... .. .i PROM


burnt Mmti








0/10X Etlhernet Giabit Ethernet
OC-3 OC-12
Figure 1. MotorolalC-Port DCP NP
(Courtesy: Motorola lnc./C-Port Corp.)










The IBM NP2G is one of two designs that represent the middle of the road in

terms of distributed versus tightly coupled processing. Figure 2 shows the IBM NP2G

architecture [17]. The device provides fast switching by integrating switching engine,

search engine, and security functions on one device. It provides Ethernet, Packet over

SONET (POS) and Point-to-Point Protocol (PPP) switching, and supports three priority

levels [18].

DDR SDRAM (10 to 13)
ZBT SRAM (2)




Ingress EDS Wrap(IEW) Egress EDS

Enqueuer Internal |||| Enqueuer
Dequeuer SRAMs Dequeuer
Scheduler Scheduler
Embedded405
rPower PC ice

0Embedded

SComplex SDRAM (4)

Ingress PMM Ac Egress PMM 1c
SMultiplexed MACfs Mutiplexed MACS p L



DMU Bus DMLJ Bus
| Physical Layer Devices
Figure 2. IBM NP2G NP
(Courtesy: IBM Inc.)

The NP2G's packet processor consists of control and processing components.

The control component supports Layer 2 and 3 routing protocols, Layer 4 and 5 network

application, and management functions. The control component for the device can be an

external processor connected through an Ethernet link or the PCI interface. The NP2G's

embedded PowerPC processor can also perform control component functions. The

processing component provides packet forwarding, filtering, and classification of the









tables generated by the routing protocols. The NP2G's processing component is made up

of six Dyadic Protocol Processor Units (DPPUs) that each forms a packet-processing unit

capable of collectively executing twelve independent threads of code at a time.

The Intel IXP 1200 is the second of two designs that represent the middle of the

road in terms of distributed versus tightly coupled processing. Figure 3 shows the Intel

IXP1200 architecture [19]. This hybrid data processor delivers high-performance parallel

processing power and flexibility to a wide variety of networking, communications and

other data-intensive applications. The IXP1200 is designed specifically as a data control

element for applications that require access to a fast memory subsystem, a fast interface

to I/O devices, and processing power to perform efficient manipulation of various data

sizes [20].


Figure 3. Intel IXP1200 NP
(Courtesy: Intel Corp.)









The IXP1200 combines a StrongARM microprocessor with six independent 32-

bit RISC data engines possessing hardware multithread support that, when combined,

provide over 1 Giga-operations per second. The six MEs are reportedly capable of

packet forwarding of 3 million Ethernet packets per second at Layer 3. The StrongARM

processor is used for more complex tasks such as address learning, building and

maintaining forwarding tables, and network management.

The Lucent/Agere PayloadPlus processor family represents the other market

extreme as a tightly coupled NP solution. The PayloadPlus architecture is shown in

Figure 4. This architecture includes the Fast Pattern Processor (FPP), Routing Switch

Processor (RSP) and the Agere System Interface (ASI). The PayloadPlus processor is

designed to handle wire-speed data streams at up to OC-48c rates. Each specialized chip

provides a complementary function to work in concert: the FPP for high-speed

classification, the RSP for processing and routing traffic, and the ASI to provide policing,

manage state information and provide a PCI connection to a host processor [21].


V PCI to Host CPU
Figure 4. LucentlAgere PayloadPlus NP
(Courtesy: Lucent Technologies/Agere Systems)









Within the realm of RC-enhanced NPs, the Reconfigurable Communications

Processor (RCP) from Chameleon Systems is the first industry NP that uses dynamic

reconfiguration as part of normal system operation. The architecture of the RCP is

shown in Figure 5. The RCP consists of a 32-bit ARC processor, memory units and a 32-

bit reconfigurable processing fabric that consists of 108 parallel computation units [22].

The RPC has been able to bridge the configuration latency problem that plagues RC

systems by multiplexing contexts in the processing fabric. After initialization, context

switching can be performed in a single clock cycle [23].









Figure 5. Chameleon Systems RPC







Imtaul i E Cw rrAgr|able 1:0

(Courtesy: Chameleon Systems Inc.)

For our research, a novel design and simulation model has been developed for a

dynamically reconfigurable NP by adapting the architecture of the Intel IXP 1200. Its

purpose is to support the study of design options and tradeoffs in dynamic NP

architectures versus static ones and help develop an understanding of how RC-enhanced

NP devices may influence future systems. The IXP1200 was chosen as a basis due to the

nature of its fixed processor coupled with flexible microengines that can be readily






11


replaced with reconfigurable units. This design philosophy offers potent tradeoffs

between performance, software tool support, flexibility, and versatility. Chapter 3

provides a description of the architecture for this RC-enhanced NP.















CHAPTER 3
SIMULATION MODEL DESCRIPTION

A description of the new model and the simulation environment in which it was

produced is presented in Sections 3.1 and 3.2, respectively. Section 3.3 details

terminology that is specific to the model's architecture. Section 3.4 describes the values

assigned to model parameters as well as how the model was verified against the Intel

IXP1200.

3.1 Model Overview

The moderately coupled, distributed processing approach highlighted in the Intel

IXP1200 NP forms the basis of the new RC-enhanced NP. However, the six MEs that

perform packet processing in the new system are provided with the capability to be

dynamically reconfigured in the manner described in Section 3.3. Six MEs were used in

the new RC-enhanced NP in order to provide a fair comparison to the IXP1200, but the

model is designed such that future designs could easily include more. A functional

description of this new design is shown in Figure 6.

The MEs in the new design perform pipelined packet processing as in the original

IXP1200. However, in the new design, the MEs are not fixed components, but instead

dynamically reconfigurable devices such as FPGAs. To function properly, runtime

reconfigurable systems use a statistics gathering mechanism to determine when system

adaptations are necessary. For the new design, the additional functions of internal packet

routing, statistics analysis, and reconfiguration management are all performed by the
























Internal Packet 1|
Network Interface Routing
(Physical Connection) (StrongArm)

Signal Description
'. ....................... > Configuration Data
NetworkJ .... .. Statistics Data
_- > Network Packets
Figure 6. Functional diagram of RC-enhanced NP

StrongArm processor. In the original Intel design, the StrongArm has much the same

role in terms of packet routing and statistics management, so the addition of the

reconfiguration management to its workload is considered to be a reasonable addition.

3.2 Simulation Environment

Simulation is used to develop the RC-enhanced NP in order to accurately model

the level of detail and flexibility inherent in RC systems. To model and simulate the

device, the Block-Oriented Network Simulator (BONeS) Designer tool from Cadence

Design Systems was used.

BONeS is an integrated software package for event-driven simulation of data

transfer systems. It allows for a hierarchical, dataflow representation of hardware devices

and networks with the ability to import finite-state machine diagrams and user-developed

C/C++ code as functional primitives. BONeS was developed to model and simulate the

flow of information represented by bits, packets, messages, or any combination of these.

An overview of BONeS can be found in Shanmugen et al. [24].









The simulation model constructed using BONeS to evaluate the RC-enhanced NP

is of a high fidelity, consisting of approximately 3300 primitive blocks in six layers of

depth that can be simulated to the accuracy of a clock cycle. The entire system model

was constructed, tested and experiments performed in approximately 1500 hours over the

course of 1.5 years.

3.3 Architecture Terminology

An ME pipeline is a collection of MEs that performs packet processing in a

pipelined manner. Three pipelines, with a depth of at least one ME per pipeline, exist in

the model simulation. Due to the fact that the RC-enhanced NP has a total of six MEs,

the upper bound for the depth of a pipeline is four. Numerous possible pipeline

configurations exist for the model. One option allows only one ME to be assigned to each

pipeline. Another option allows multiple MEs to be assigned to each pipeline. The use of

additional computing resources would tend to increase a pipeline's processing

throughput. Other possibilities include hybrid designs that offer mixtures of the two

previous options. The convention for describing the number of MEs allotted to each

pipeline is given by [x,y,z] where x, y and z denote the number of MEs allotted to the first,

second, and third pipeline respectively.

Pipeline configurations for the NP system fall into two main types. The first type,

baseline configuration, is a collection of all three available pipelines that cannot be

reconfigured during the course of packet processing. Baseline configurations represent

the NP system without any RC enhancement. An optimal baseline configuration is the

one that is best suited to process a specific set of packets.









The second type, dynamic configuration, is a collection of all three available

pipelines that can be reconfigured during the course of packet processing. The decision

process for possible reconfiguration occurs periodically at the decision interval. This

interval is defined as the period between decisions and lasts the length of time it takes for

the slowest pipeline to process five packets. At the end of a decision interval, theper-

pipeline latency is polled. This latency is defined as the number of clock cycles it takes

for a packet to move from the input of a pipeline to the output. The per-pipeline latency

is checked to see if it falls within the acceptable region bounded by the upper and lower

latency thresholds for each pipeline to determine the need to reconfigure the system. A

separate latency threshold value exists for each pipeline. If a per-pipeline latency value

falls out of the region defined for that pipeline, a reconfiguration is attempted. The

manner in which the NP performs reconfiguration is detailed in Section 4.2.

3.4 Parameters and Verification

All system delays are defined as multiples of the ME clock period of 8.33ns (i.e.

frequency of 120MHz). This frequency represents the lower bound of most RC hardware

available today. Packets arrive at the ingress point of the NP at a rate of one packet per

ten ME clock cycles. This rate is faster than any pipeline can process a single packet,

ensuring the pipelines are the bottleneck of the system. The PCI host interface, memory

access delay, and physical network interface are not included in our model for simplicity.

Rather, these values are given a fixed latency value of one ME clock cycle in order to

keep packets from being overwritten in the simulation. The reconfiguration latency, or

the number of ME clock cycles it takes to change pipeline configurations, is assumed to

be one ME clock cycle as in Chameleon Systems' RPC mentioned in Chapter 2.






16


In order to gauge the relative performance of the simulation of the baseline

configuration as compared to the original IXP1200 design, a verification of the system

was performed as follows. First, a baseline configuration of our NP model with two MEs

per pipeline is created. Second, 5000 Gigabit Ethernet (GigE) packets are passed through

this NP model. It was found that the device maintained a steady-state, packet-processing

rate of 2.65 million packets per second. This value is comparable to the 3 million

Ethernet packets per second asserted by the Intel documentation [19].















CHAPTER 4
SIMULATION EXPERIMENTS

The following section details the manner in which the baseline configurations are

compared to the dynamic configurations. Section 4.1 describes the packet processing

methodology used for the experiments, and Section 4.2 introduces the terminology used

to characterize the packet processing. Section 4.3 describes the manner in which MEs are

allocated to pipelines. Section 4.4 presents the baseline experiments in which the optimal

baseline configuration for each of the different cases is observed. The results from the

baseline experiments are compared to the dynamic configurations in the head-to-head

experiments detailed in Section 4.5.

4.1 Packet Processing Description

The NP creates the GigE packet header information for data and destination

address pairs that are passed from the host processor. Analysis of the work involved in

packet header construction, demonstrates that a majority of processing time is spent

computing the Cyclic Redundancy Check (CRC). Also, within the CRC operation, the

majority of processing is taken up performing a 32-bit polynomial division. Therefore,

system reconfiguration in order to better adapt a given pipeline is centered on the

polynomial divide operation. Four possible divide operations were chosen for this

research described in Martin and Knight [25]. For each operation, the cost, represented

as a number of transistors, and performance, represented as processing latency, is shown

in Table 1. The values of transistor size and processing latency for each operation are









converted to a number of MEs and clock cycles to produce meaningful values for the

purposes of the simulation. Integer values of MEs were used since they represent the

number of stages in the pipeline.

Table 1. Divide operation performance vs. area
Divide Cost (Area) Performance
Operation Transistors MEs Latency CCs
(ns)
32-array 32,896 4 160 20

16-digital
16digital 11,386 3 220 26
serial
Modified 3,808 2 320 38
Booth
Quasi bit-
Quasi bit- 1,944 1 640 77
serial IIIII

There exists a direct relation between cost and performance for the divide

operations as can be observed in Table 1. Our RC-enhanced NP relies upon the

assumption that the divide operation can be partitioned without significant loss of

performance. Previous research has shown that breaking up such complex polynomial

computations can be accomplished with little additional cost from communication

overhead [26].

4.2 Traffic Description

A traffic flow is the basic unit of differentiation between types of network traffic

that are processed by the NP. Each traffic flow originates from the host processor. For

the purposes of this research, three traffic flows are used and each is mapped exclusively

to one of the three pipelines. Apriority scheme is used to denote the relative importance

of the traffic flows, and each traffic flow is assigned an exclusive priority within a

priority scheme. Three priority schemes have been defined for our NP model.









The 1PC, 2PC and U schemes allocate priority to the traffic flows as follows:

One-priority critical (1PC): PO > P1 > P2
Two-priority critical (2PC): PO = P > P2
Uniform (U): PO = P1 = P2

A traffic mixture describes the relative number of packets from each traffic flow

that makes up the total number of packets for the NP to process. A traffic mixture is

expressed as the relative percentage of the number of packets from each traffic flow out

of a total of 3000 packets. Three traffic mixtures (1-Hot, 2-Hot, 3-Hot) have been

defined for the NP simulation model. Within the 1-Hot traffic mixture, a single priority

dominates during any given time period. 1-Hot contains the following traffic flow

percentages: the first 1000 packets are made up of 90% PO, 5% PI and 5% P2; the

second 1000 packets are made up of 5% PO, 90% PI and 5% P2; and the third 1000

packets are made up of 5% PO, 5% P1 and 90% P2.

Within the 2-Hot traffic mixture, two priorities dominate during a given time

period. 2-Hot contains the following traffic flow percentages: the first 1000 packets are

made up of 45% PO, 45% PI and 10% P2; the second 1000 packets are made up of 45%

PO, 10% P1 and 45% P2; and the third 1000 packets are made up of 10% PO, 45% P1 and

45% P2.

Finally, the 3-Hot traffic mixture provides a near-uniform traffic mixture in which

all priorities are equal. Due to limitations of the simulation's traffic generators, only

increments of 5% were possible, resulting in a distribution of 35%, 35%, and 30%.

However, it was determined that the affect of the small imbalance in the distribution is

negligible. The 3-Hot traffic mixture contains the following traffic flow percentages: the

first 1000 packets are made up of 35% PO, 35% PI and 30% P2; the second 1000 packets









are made up of 35% PO, 30% P1 and 35% P2; and the third 1000 packets are made up of

30% PO, 35% PI and 35% P2.

4.3 Dynamic ME Allocation Scheme

For a dynamic configuration, priority schemes are also used to determine the

manner in which MEs are added to, or taken away from, a given traffic flow's pipeline.

The dynamic NP reallocates MEs in order to adapt to changing traffic mixture conditions.

The reallocation is performed based on the following conditions. If the latency of

a given traffic flow's pipeline exceeds the upper latency threshold value for the priority

assigned to that traffic flow, the NP attempts to allocate more MEs to that pipeline. If a

free ME exists, the NP will assign it to the requesting pipeline. However, if no MEs are

free, then the NP will take away a ME of another traffic flow's pipeline if three

conditions are met. First, the priority value assigned to that traffic flow must be less than

the priority value of the requesting traffic flow as dictated by the priority scheme.

Second, the traffic flow's pipeline depth must be greater than one ME. Third, the traffic

flow's pipeline utilization must be less than 95%. The pipeline utilization is defined as

the number of clock cycles a given traffic flow's pipeline is busy, divided by the total

number of clock cycles during a decision interval (busy time plus idle time). A value

above 95% means that the pipeline is experiencing a large volume of packets and

performance would greatly degrade if MEs were removed from the pipeline.

If the latency of a given traffic flow's pipeline is less than the lower latency

threshold value for the priority assigned to that traffic flow, the NP attempts to remove

MEs from that pipeline. However, MEs will be removed from a pipeline only if two









conditions are met. First, the traffic flow's pipeline depth must be greater than one ME.

Second, the traffic flow's pipeline utilization must be less than 95%.

This process to retune the pipelines occurs at the end of each decision interval. In

order to ensure that MEs are used efficiently, the measured pipeline latency values are

first checked against the lower latency thresholds to determine if MEs may be removed

from a pipeline, and then checked against the upper latency thresholds to determine if

MEs need to be assigned to a pipeline.

As an example to illustrate the interaction of the system parameters, a typical

configuration is shown in Figure 7. The three traffic flows that originate from the host

are labeled Traffic Flow 0, Traffic Flow 1 and Traffic Flow 2. This example illustrates

the first 1000-packet phase of the 1-Hot traffic mixture in which Traffic Flow 0 accounts

for 90% of packets, Traffic Flow 1 accounts for 5% of packets and Traffic Flow 2

accounts for 5% of packets.

Note that several points (labeled A, B and C) are included in the Figure 7 for

statistics gathering purposes. Pipeline latency is measured between point B and point C.

Pipeline throughput, defined in Head-To-Head Experiments, is measured between point

A and point C.

Packets from each traffic flow are assigned a given priority in the NP (PO for

Traffic Flow 0, PI for Traffic Flow 1 and P2 for Traffic Flow 2 in this case) and mapped

to a given pipeline (pipeline 0 for Traffic Flow 0, pipeline 1 for Traffic Flow 1 and

pipeline 2 for Traffic Flow 2). Pipeline 0, pipeline 1 and pipeline 2 contains 3, 2 and 1

MEs respectively. Therefore, the configuration denoted as [3,2,1] is shown in Figure 7.











Host Processor

Traffic Flow 0 Traffic Flow 1 Traffic Flow 2
90% 5% 5%

----------------------------

Nr -------------1*--------- --- -- -

SInternal Packet Router

Priority 0 B Priority 1 Priority 2:

Pipeline 0 Pipeline 1 Pipeline 2

ME ME ME


ME ME
..







II
I External Network


Figure 7. Typical system configuration


4.4 Baseline Experiments


In the baseline experiments, the optimal baseline configuration for each of the


three traffic mixtures (1-Hot, 2-Hot, 3-Hot) is determined for each of the three priority


schemes (1PC, 2PC, U). The observed values are given in Table 2. The name for each


priority scheme in the baseline experiments is preceded by "B for clarity. As


evidenced in Table 2, B_2PC and B_U have identical optimal baseline configurations.


Therefore, the B_U case will be used to represent both these priority schemes for the


head-to-head experiments described in following subsequent sections.









Table 2. 0 timal baseline configurations
Traffic Priority Scheme
Mixture B 1PC B 2PC B U
1-Hot [4,1,1] [3,2,1] [3,2,1]
2-Hot [4,1,1] [3,2,1] [3,2,1]
3-Hot [4,1,1] [2,2,2] [2,2,2]

In each of the baseline configuration experiments, we observed the same trends

for the 1-Hot, 2-Hot and 3-Hot experiments in terms of throughput, latency, utilization

and total execution time. Therefore, the 2-Hot results are presented alone for clarity. The

pipeline utilization results, given in Figure 8b, show that when the priority scheme for an

optimal baseline configuration matches the type of traffic mixture (i.e. 2PC for 2-Hot),

the pipeline utilization for that configuration is more balanced as compared to the other

configurations. As such, configuration [3,2,1] outperformed all other configurations for

the 2-Hot traffic mixture. This fact is also evidenced in the total execution time results

shown in Figure 8a. Throughput and latency results for the optimal baseline

configurations are presented in the head-to-head experiments.


25000 100

U-
o 20000 80

E 15000 60
F a
a o
S10000 N 40 Hot Priorities
Configuration Confige of Others
-F 5000 20

0 0

Configuration Configuration

a) Total execution time b) Pipeline utilization
Figure 8. 2-Hot baseline results.









4.5 Head-To-Head Experiments

In the head-to-head experiments, the optimal baseline configurations for each of

the three traffic mixtures and the three priority schemes are compared to the dynamic

ones. The three priority schemes for the dynamic configurations are preceded by a "D_"

for clarity. The latency thresholds for each of the three priority schemes are given in

Table 3.

Table 3. Dynamic priority latency thresholds
Priority
Priority P0 P P2
Scheme
Lower Upper Lower Upper Lower Upper
D 1PC 20 50 40 70 60 100
D 2PC 40 80 40 80 60 100
DU 60 120 60 120 60 120


Average latency values are used to compare the optimal baseline configurations to

the dynamic configurations. These averages are determined by taking the average of all

latency values computed at the end of each decision interval for each pipeline.

Throughput is defined as the number of bytes transferred for a given traffic flow from

point A to point C of Figure 7, divided by the number of clock cycles needed to perform

the transfer. The throughput value is calculated on a per-packet basis and averaged over

the decision interval. Only the results for the best dynamic configuration in each

experiment are shown (e.g. D_1PC scheme in the case of the 1-Hot traffic mixture) for

clarity.









Results From The 1-Hot Experiment

The average latency results for the 1-Hot experiment are shown in Figure 9a.

D_1PC produces low latency values for each priority because MEs are reallocated to

each ME pipeline as needed. P2 suffers a severe latency with the B_U scheme, and both

P1 and P2 suffer severely with B_1PC, due to the fact that they cannot reallocate MEs.

The dynamic system is found to reconfigure a total of only 17 times in this experiment.

The average throughput results for the 1-Hot experiment are shown in Figure 9b.

D_1PC produces relatively high throughput values for all three priorities because MEs

are reallocated to each ME pipeline as needed. Both baseline schemes produce

throughputs for each priority that are lower than the dynamic scheme due to the fact that

they cannot reallocate MEs. The B_1PC scheme produces a throughput that is less than

the B_U scheme for P1 and P2 because the B_1PC scheme is optimized to process

packets of priority PO.


6000 1

5000 -0
o 0.8
" 4000 -
0 0.6
> 3000 -
._c 0.4
C 2000
ED 1PC OB 1PC OB U 0
1000 0.2

0 0
PO P1 P2
Priority


a) Average latency
Figure 9. 1-Hot results


IID_1PC 0OB-1PC IIB_U



PO P1 P2
Priority


b) Average throughput









Results From The 2-Hot Experiment

The average latency results for the 2-Hot experiment are shown in Figure l0a.

D_2PC produces latency values that are fairly comparable to the previous experiment for

each priority despite a substantial increase in the number of system reconfigurations.

Due to the fact that the traffic mixture is becoming more uniform, the optimal baseline

configurations are less subject to the latency penalties inherent in having ME pipelines

that are fixed. The B_U scheme, in particular, produces good latency values for all

priorities because of this trend. The dynamic system is found to reconfigure a total of 90

times in this experiment.

The average throughput results for the 2-Hot experiment are shown in Figure 10b.

The throughput results show the same trend as the latency results in that D_2PC produces

throughput values that are fairly comparable to the previous experiment for each priority

despite a substantial increase in the number of system reconfigurations. Also, the optimal

baseline configurations are less subject to the latency penalties inherent in having ME

pipelines that are fixed because the traffic mixture is becoming more uniform.


900

750

600

450
D_2PC 0 B_1 PC 0 B_U
300

150

0
PO P1 P2
Priority


0 0.75

0.6-

0.45 -
Q_ D_2PC OB_1P(
5 0.3 -

0.15

0
PO P1
Priority


a) Average latency b) Average throughput
Figure 10. 2-Hot results.









Results From The 3-Hot Experiment

The average latency results for the 3-Hot experiment are shown in Figure 1 la.

D_U produces the poorest latency values for each priority due to a large number of

system reconfigurations. The relative effect versus the baseline is more severe in this

experiment as compared to the previous one because the reconfigurations are creating

ME pipelines that are not ideally suited for the current traffic mixture in addition to

consuming processing cycles during reconfiguration. This condition occurs because the

traffic mixture at any given time changes so rapidly that the dynamic system cannot adapt

in a meaningful and efficient manner. The B_U scheme produces low latency values for

all priorities because it is always ideally suited for a uniform traffic mixture. B_1PC

continues to produce a poor latency values for P2 because this priority scheme is not well

suited for a uniform traffic mixture. The dynamic system is found to reconfigure a total

of 110 times in this experiment.

390 0.8 i


325

260

> 195
ci,
J 130

65

0


1<


0.6 -


1D U OB 1PC MB U U B
DB- D DU OB_ PC l
e.0.4 -
CL
S0.2
I-

0
PO P1 P2 PO P1
Priority Priority


a) Average latency b) Average throughput
Figure 11. 3-Hot results.


BU







P2









The average throughput results for the 3-Hot experiment are shown in Figure 1 lb.

The throughput results show the same trend as the latency results in that D_U produces

the poorest throughput values for each priority due to a large number of unproductive

system reconfigurations. The B_U scheme produces high throughput values for all

priorities because it is always ideally suited for a uniform traffic mixture. B_1PC

produces poor throughput values for P2 because this priority scheme is not well suited for

a uniform traffic mixture.

It should be noted that the scale of the previous three latency charts has been

significantly reduced in the progression from 1-Hot to 2-Hot to 3-Hot results. Therefore,

latency penalties that occur in the 3-Hot experiment are clearly much less severe than

those in the 1-Hot experiment.

These results illustrate three main points. First, dynamic configurations perform

far better than the baseline configurations when traffic mixtures are non-uniform.

Second, dynamic configurations suffer from resource thrashing if traffic mixtures become

uniform. Third, the decrease in performance incurred by using dynamic configurations

when processing a uniform traffic mixture is far less severe than the decrease in

performance incurred by using a baseline configuration when processing non-uniform

traffic mixtures.

In studying a wide variety of real-world network traffic patterns, the argument

could be made that the majority of network traffic mixtures can be seen as non-uniform

over an arbitrary time period. Given this assumption, the use of dynamic NP resource

allocation is likely to have a large impact on tomorrow's networking environments. In






29


the next section, a description of the traffic mixture for a case study used to stimulate our

NP model is presented.















CHAPTER 5
CASE STUDY

The case-study experiments employ the same testing strategy used in the previous

section. A description of the case study is given in Section 5.1. The optimal baseline

configuration for the case study's traffic mixture is found for each of the priority schemes

in Section 5.2. The optimal baseline configurations are compared to the three dynamic

configuration priority schemes in head-to-head experiments shown in Section 5.3.

5.1 Description

The traffic mixture for this case study is inspired by packet types used in the

network of a Navy system for theatre air and missile defense known as Cooperative

Engagement Capability (CEC). This system, under development and deployment by the

U.S. Navy for the past decade, supports the total automation of fleet defense systems.

The basic concept is illustrated in Figure 12.

Each player in the theatre of operations communicates using three traffic flows

defined as missile detection and tracking (denoted as radar), cooperation instructions

(denoted as command) as well as engagement information (denoted as weapon). For the

purposes of this case study, the three traffic flows are mapped onto the three priorities

such that radar is PO, command is P1 and weapon is P2.

All networked military assets must undertake a distinct series of actions in order

to destroy a missile. This series of actions, know as the kill chain, involves three phases










(detect, control and engage) illustrated in Figure 13. The 3000 packets from which the

traffic mixture is calculated have been divided equally between the three phases of the

"kill chain."


C-130 E-2C
Surrogate
EP-3
MSiLink


MElCaha Ridg UAVSensors





Edi ng S pl -in a id
A Sensors


Nihau Sub
DDICG21 Sensors?? (Unks)
Figure 12. Theatre missile defense system
(Courtesy: ONR)

As an actual battle traffic mixture is not publicly available, a candidate traffic mixture

was developed. This traffic mixture contains the following traffic flow distribution: the

first 1000 packets (detect phase) are made up of 85% PO, 10% P1 and 5% P2; the second

1000 packets (control phase) are made up of 35% PO, 60% P1 and 5% P2; and the third

1000 packets (engage phase) are made up of 15% PO, 30% P1 and 55% P2.









Threat Acquired

Detect -Target detection
Deect initial track of target

Control Tracking information relayed
Control.- Command asserted and decision made


Engae Command decision disseminated
Engage 'tert : ,tr,, ,- a ,:pr, deployed


Threat Destroyed

Figure 13. Theatre missile defense kill chain

5.2 Baseline Experiments

The optimal baseline configurations for the case study were determined and are

given in Table 4. As in Chapter 4, since B_2PC and B_U have identical optimal baseline

configurations, the B_U case will be used to represent both priority schemes in the

subsequent experiments.

Table 4. Case study optimal baseline configurations
Traffic Priority Scheme
Mixture B 1PC B 2PC B U
Case
Study [4,1,1] [3,2,1] [3,2,1]

5.3 Head-To-Head Experiments

The optimal baseline configurations for each of the three priority schemes are

compared to the three priority schemes for the dynamic configurations. The average

latency results for the case study are shown in Figure 14a.

All configurations produce low latency values for PO. However, B_1PC produces

high latency values for P1 and P2, while the B_U scheme produces high latency values

for P2, due to the fact that MEs cannot be reallocated. For the dynamic cases, D_U










produces low latency values for all three priorities while the other two dynamic

configurations did not produce low latency values for P2. DU performed well

regardless of having a high number of system reconfigurations because these

configurations were meaningful and produced pipelines that were well suited for the

current traffic mixture even after reconfiguration. D_1PC and D_2PC did not produce

configurations that were well suited to handle P2 traffic. The reconfiguration count for

the dynamic systems was found to be 16 times for D_1PC, 18 times for D_2PC and 63

times for D U.

The throughput results for the case study are shown in Figure 14b. All

configurations produce high throughput values for PO. However, B_1PC produces lower

throughput values for P and P2, while the B_U scheme produces lower throughput

values for P2, due to the fact that Mes cannot be reallocated. For the dynamic cases,

DU produces high throughput values for all three priorities while the other two dynamic

configurations did not produce high throughput values for P2. D_U outperformed all

other configuration in terms of overall throughput as well as latency.

09 1800

075 1500

06 1200

045 900

03 600
0D3 _1PCE D_2PC lDDU B_1PCf BU D600 D_1PC ED_2PC DDU D B_1PC B

0 15 e30 0


P0 P1 P2 PO P1 P2
Priority Priority
a) Average latency b) Average throughput
Figure 14. Case study results.






34


The case study results illustrate two main points. First, the need for dynamic

configurations to allow Mes to be available for any traffic flow's pipeline to use is

demonstrated. This point is evidenced in the fact that the D_U scheme did not suffer

from system resource thrashing (as seen in the 3-Hot head-to-head experiment in Section

4.5) even though it reconfigured four times more often than the other two dynamic

configurations. Second, the dynamic configurations performed as well, if not better,

overall than each of the ideal baseline configurations.















CHAPTER 6
CONCLUSIONS

Current NPs have incorporated static reconfiguration or configurability that does

not allow for in-system reconfiguration. The prevalence and success of RC designs that

incorporate dynamic reconfigurability makes the addition of dynamic RC to NPs a logical

progression. In this paper, a candidate NP (the Intel IXP1200) is enhanced to provide

dynamic reconfiguration of network processing resources in order to produce an NP that

better adapts to changing network traffic flows.

The functionality of the RC-enhanced NP model is simulated using BONeS

Designer, a software tool for event-driven simulation of computer network systems that

offers a high degree of fidelity. The accuracy of the baseline model is verified in terms of

a nominal value for sustained packet processing asserted in the Intel IXP1200

documentation. The RC-enhanced NP dynamic configuration is compared to a baseline

configuration in order to gauge relative performance. In addition, a case study is

presented to highlight the effects of dynamic reconfiguration on a traffic mixture

patterned after a complex system design. This research has shown the potential

advantages from incorporating RC design techniques into next-generation NPs.

In the baseline experiments, the pipelines were not reconfigurable. This type of

system mimics the behavior of today's NPs that are not RC-enhanced. The baseline

experiments determined the optimal baseline configurations later used for comparison of

the baseline NP to the dynamic NP in the head-to-head experiments. The pipeline









utilization results from the baseline experiments showed that optimal configurations

efficiently use their resources by evenly spreading the workload over all three pipelines.

Configurations that do not distribute the packet processing workload evenly waste

pipeline computing cycles by allowing them to be idle. For static network development,

the sort of trial and error approach is generally used to obtain the optimal configuration.

However, the optimal configuration for one moment in time is not necessarily the optimal

configuration at another. This type of workload partitioning and scheduling problem can

be overcome by using RC techniques as this research has shown.

The head-to-head experiments pitted the optimal baseline configurations against

the dynamic configurations. Average latency and throughput results were used as a

metric for comparison. These results illustrated three main points.

First, dynamic configurations perform far better than baseline configurations

when traffic patterns are non-uniform due to the fact that dynamic configurations can

reallocate Mes. Because the baseline configurations cannot reallocate Mes, many

processing clock cycles are wasted due to an imbalance of system resources.

Second, dynamic configurations may suffer from resource thrashing if traffic

patterns become uniform. This situation may occur if the decision interval assumed in

this research is too fine grained as compared to the time it takes to process the total

number of packets per experiment. By having a small window in which to determine if a

system reconfiguration is necessary, the dynamic configurations reconfigured the system

based on transient traffic mixtures rather than adapting to the overall traffic mixture over

time. In fact, the results demonstrate that it is better to not reconfigure the system at all

for uniform traffic mixtures.









Third, the decrease in performance incurred by using dynamic configurations

when processing uniform traffic mixes is far less severe than the decrease in performance

incurred by using a baseline configuration when processing non-uniform traffic. This

phenomenon is due to the fact that the baseline systems are only optimized for a single

traffic mixture and are therefore inadequately suited to process packets in other traffic

mixtures.

In the case study, the baseline experiments and head-to-head experiments are

performed with a traffic mixture that resembles one that is realistic. The case study

illustrated two points. First, a high number of pipeline reconfigurations do not

necessarily imply resource thrashing and performance loss in dynamic configurations.

This fact is observed when the configuration for the D_U scheme performed well for all

priorities even though it reconfigured four times more than other dynamic configurations.

The D_U configuration accomplished this by reconfiguring the system in a meaningful

way in order to produce pipelines that were well suited for the current traffic mixture

even after reconfiguration. Therefore, the decision interval used in the case study

experiment has a better granularity for the traffic mixture as compared to the head-to-

head experiments in Chapter 4. Second, the dynamic configurations were found to

perform at or above the level of the baseline configurations for the case-study traffic

mixture due to the fact that resources in the NP can be reallocated to the pipelines of

other traffic flows as needed.

Future directions for this research include adding to the simulation model the

capability for studying the effects of full-duplex packet processing for multiple RC-

enhanced NPs in a complex network. In creating a network of nodes, implementing






38


additional functionality such as network routing information, flow control and quality of

service (QoS) mechanisms would be a logical next step. Another area for future work is

to study issues with the migration of RC techniques to other NP designs. This direction

would help to ascertain the relationship between the emphasis of the NP architecture (e.g.

loosely coupled versus tightly coupled processing) and the most effective methods for

augmenting them to achieve increased performance through adaptive, reconfigurable

processing.















REFERENCES


1. A. DeHon and J. Wawrzynek, "Reconfigurable Computing: What, Why and
Implications for Design Automation," Proc. 36th ACM Design Automation
Conference (DAC), New Orleans, LA, June 21-25, 1999, pp. 610-615.

2. J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati and P. Boucard,
"Programmable Active Memories: Reconfigurable Systems Come of Age," IEEE
Transactions on VLSI Systems, Vol. 4, No. 1, March 1996, pp. 56-69.

3. Star Bridge Systems, Theory ofHypercomputing, Star Bridge Systems, 2002,
http://www.starbridegesystems.com (Accessed December 2002).

4. H. Choi, J. Kim, C. Yoon, I. Park, S. Hwang and C. Kyung, "Synthesis of Application
Specific Instructions for Embedded DSP Software," IEEE Transactions on
Computers, Vol. 48, No. 6, June 1999, pp. 603-614.

5. 0. Mencer, M. Morf and M. Flynn, "Hardware Software Tri-design of Encryption for
Mobile Communications Units," Proc. IEEE Inter. Conf On Acoustics, Speech and
Signal Processing (ICASSP), Seattle, WA, Vol. 5, May 1998, pp. 3045-3048.

6. B. Dipert, "Figuring Out Reconfigurable Logic," EDNMagazine, August 5, 1999,
pp.103.

7. A. Staicu, J. Radzikowski, K. Gaj, N. Alexandridis and T. El-Ghazawi, "Effective
Use of Networked Reconfigurable Resources," Proc. Int. Conf. On Military
Applications of Programmable Logic Devices (MAPLD), Laurel, MD, September
2001.

8. A. Dollas, D. Pnevmatikatos, N. Aslanides, S. Kavvadias, E. Sotiriades, S.
Zogopoulos, K. Papademetriou, N. Chrysos, K. Harteros, E. Antonidakis, N. Patrakis,
"Architecture and Applications of PLATO, a Reconfigurable Active Network
Platform," Proc. 9th Annual IEEE Symp. On Field-Programmable Custom Computing
Machines (FCCM), Rohnert Park, CA, April 29-May 2, 2001.

9. A. DeHon, R. Huang and J. Wawrzynek, "Hardware-Assisted Fast Routing," Proc.
10th Annual IEEE Symp. On Field-Programmable Custom Computing Machines
(FCCM), Napa, CA, April, 2002.









10. S. Johansson, "Transport Network Involving a Reconfigurable WDM Network Layer
A European Demonstration," IEEE Journal of Lightwave Technology, Vol. 14, No.
6, June 1996, pp. 1341-1348.

11. X. Zhu, "Network Processor Directory," Gigascale Silicon Research Center, 2001,
http://www.gigascale.org/mescal/forum/110.html (Accessed January 2001).

12. J. Caruso, "Network Processors Combining the Speed of ASICs with the Flexibility
of General-Purpose Processors," Network World Fusion, September 15, 1999.

13. D. McEuen, "Network Processor Revenues Climb from Obscurity in 1999 to $2.9
Billion by 2004," In-stat MDR, March 6, 2000.

14. E. Rothfus, The Challenge for Next Generation Network Processors, Agere, 1999,
http://www.agere.com (Accessed December 2002).

15. E. Rothfus, Building Next Generation Network Processors, Agere, 1999,
http://www.agere.com (Accessed December 2002).

16. Motorola, C-5 Network Processor, Motorola, 2002, http://www.motorola.com
(Accessed December 2002).

17. IBM, PowerNP NP2G Network Processor, IBM, 2002, http://www.ibm.com
(Accessed December 2002).

18. IBM, NP2G Data .,/hee, IBM, 2002, http://www.ibm.com (Accessed December
2002).

19. Intel, IXP1200 Network Processor, Intel, 2002, http://www.intel.com (Accessed
December 2002).

20. Intel, IXP1200 Data ,/heei, Intel, 2002, http://www.intel.com (Accessed December
2002).

21. Agere, PayloadPlus NPs, Agere, 1999, http://www.agere.com (Accessed December
2002).

22. X. Tang, M. Aalsma and R. Jou, "A Compiler Directed Approach to Hiding
Configuration Latency in Chameleon Processors," Proc. 10th International
Conference on Field Programmable Logic and Applications, Villach, Austria, 2000,
pp. 29-38.









23. Chameleon Systems, CS2000 Reconfigurable Communications Processor,
Chameleon Systems, 2000, http://www.chameleonsystems.com (Accessed December
2002).

24. K. Shanmugen, V. Frost and W. LaRue, "A Block-Oriented Network Simulator
(BONeS)," Simulation, Vol. 58, No. 2, 1992, pp. 83-94.

25. R. S. Martin and J.P. Knight, "Power-Profiler: Optimizing ASICs Power
Consumption at the Behavioral Level," Proc. 32nd ACMDesign Automation
Conference (DAC), San Francisco, CA, June 1995, pp. 42-47.

26. C. Tanougast, Y. Berviller and S. Weber, "Optimization of Motion Estimator for
Run-Time-Reconfiguration Implementation," Proc. 7th Reconfigurable Architectures
Workshop (RAW), Cancun, Mexico, May 1-5, 2001.















BIOGRAPHICAL SKETCH


Ian Troxel received a Bachelor of Science in Electrical Engineering degree with

minors in business administration and mathematics and a Bachelor of Science in

Computer Engineering degree from the ECE department at the University of Florida in

May of 2000 and December of 2000 respectively. As an undergraduate student, he was

selected as a University Scholars recipient and served as a paid undergraduate researcher

under the direction of Dr. Alan George, founder and director of the High-performance

Computing and Simulation (HCS) laboratory at the University of Florida.

Since becoming a paid graduate research assistant in May of 2000, Ian has

worked on several projects in the area of reconfigurable computing and network

simulation. Ian is one of two co-inventors of the Honeywell Reconfigurable Space

Computer (HRSC) as part of the HRSC development team at Honeywell Defense and

Space Systems of Clearwater, FL (January until August 2002). He will continue his

studies at the University of Florida in pursuit of a Ph.D. in Electrical Engineering.