Group Title: NARC : network-attached reconfigurable computing for high-performance, network-based applications
Title: Presentation
ALL VOLUMES CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094734/00002
 Material Information
Title: Presentation
Physical Description: Book
Language: English
Creator: Conger, Chris
Troxel, Ian
Espinosa, Daniel
Aggarwal, Vikas
George, Alan D.
Publisher: Conger et al.
Place of Publication: Gainesville, Fla.
 Record Information
Bibliographic ID: UF00094734
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

MAPLD05_Conger ( PDF )


Full Text
^ UNIVERSITY OF
.FLORIDA

NARC:
Network-Attached Reconfigurable Computing for
High-performance, Network-based Applications




Chris Conger, lan Troxel, Daniel Espinosa,
Vikas Aggarwal, and Alan D. George

High-performance Computing and Simulation (HCS) Research Lab
Department of Electrical and Computer Engineering
University of Florida


Conger


#233 MAPLD 2005





Outline

* Introduction
* NARC Board Architecture, Protocols


* Case Study


Applications


* Experimental Setup
* Results and Analysis


* Pitfalls and


Lessons Learned


* Conclusions
* Future Work


( UNIVERSITY OF
C FLORIDA


Conger


#233 MAPLD 2005







Introduction

Network-Attached Reconfigurable Computer (NARC) Project
a Inspiration: network-attached storage
(NAS) devices
Core concept: investigate challenges and
alternatives for enabling direct network access
and control over reconfigurable (RC) devices
a Method: prototype hardware interface and
software infrastructure, demonstrate proof of
concept for benefits of network-attached RC
resources
Motivations for NARC project include (but
not limited to) applications such as:
Network-accessible processing resources
Generic network RC resource, viable alternative to server and supercomputer solutions
A Power and cost savings over server-based FPGA cards are key benefits
] No server needed to host RC device
L] Infrastructure provided for robust operation and interfacing with users
Performance increase over existing RC solutions is not a primary goal of this approach
a Network monitoring and packet analysis
Easy attachment; unobtrusive, fast traffic gathering and processing
Network intrusion and attack detection, performance monitoring, active traffic injection
Direct network connection of FPGA can enable wire-speed processing of network traffic
l Aircraft and advanced munitions systems
Standard Ethernet interface eases addition and integration of RC devices in aircraft and munitions systems
Low weight and power also attractive characteristics of NARC device for such applications

UNIVERSITY OF
Conger FLORIDA 3 #233 MAPLD 20(


)5







Envisioned Applications


* Aerospace & military applications
a Modular, low-power design lends itself well to
military craft and munitions deployment
a FPGAs providing high-performance radar, sonar,
and other computational capabilities
* Scientific field operations
a Quickly provide first-level estimations for
scientific field operations for geologists,
biologists, etc.


* Field-deployable covert operations
a Completely wireless device enabled through battery, WLAN
a Passive network monitoring applications
a Active network traffic injection
* Distributed computing
a Cost-effective, RC-enabled clusters or cluster resources
a Cluster NARC devices at a fraction of cost, power, cooling


* Cost-effective intelligent sensor networks
a Use FPGAs in close conjunction with sensors to provide pre-processing
functions before network transmission
* High-performance network technologies
a Fast Ethernet may be replaced by any network technology
a Gig-E, Infiniband, RapidlO, proprietary communication protocols


UNIVERSITY OF
FLORIDA


L "
J=.-


Conger


#233 MAPLD 2005







NARC Board Architecture: Hardware


ARM9 network control with FPGA processing power (see
a Prototype design consists of two boards, connected via cable:
Network interface board (ARM9 processor + peripherals)
Xilinx development boards) (FPGA) Xilinx HW-AFX-BG560-100
a Network interface peripherals include:
Layer-2 network connection (hardware PHY+MAC) FPGA
External memory, SDRAM and Flash
Serial port (debug communication link) i
FPGA control and data lines
a NARC hardware specifications: 8
ARM-core microcontroller, 1.8V core, 3.3V peripheral coto o ut
a 32-bit RISC, 5-stage pipeline, in-order execution
a 16KB data cache, 16KB instruction cache L
a Core clock speed 180MHz, peripheral clock 60MHz
] On-chip Ethernet MAC layer with DMA
External memory, 3.3V AT91R
a 32MB SDRAM, 32-bit data bus ARM P
a 2MB Flash, 16-bit data bus
a Port available for additional 16-bit SRAM devices
Ethernet transceiver, 3.3V
a DM9161 PHY layer transceiver
a 100Mbps, full duplex capable
a RMII interface to MAC


Figure 1)


Network
Interface


RJ-45


Figure 1 Block diagram of NARC device


UNIVERSITY OF
FLORIDA


Conger


#233 MAPLD 2005








NARC Board Architecture: Software


* ARM processor runs Linux kernel 2.4.19
a Provides TCP(UDP)/IP stack, resource management, threaded
execution, Berkeley Sockets interface for applications
a Configured and compiled with drivers specifically for our board
* Applications written in C, compiled using GCC compiler for
ARM (see Figure 2)
* NARC API: Low-level driver function library for basic services
a Initialize and configure on-chip peripherals of ARM-core processor
a Configure FPGA (SelectMAP protocol)
a Transfer data to/from FPGA, manipulate control lines
a Monitor and initiate network traffic
* NARC protocol for job exchange (from remote workstation)
a NARC board application and client application must follow standard
rules and procedures for responding to requests from a user
a User appends a small header onto data (if any) containing info.
about request before sending over network (see Figure 3)
* Bootstrap software in on-board Flash, automatically loads
and executes on power-up
a Configures clocks, memory controllers, I/O pins, etc
a Contacts tftp server running on network, downloads Linux and
ramdisk
a Boot Linux, automatically execute NARC board software contained
in ramdisk
* Optional serial interface through HyperTerminal for
debugging/development


RTYPE Job ID Undefined
1 byte 1 byte 2 bytes


Data Size in bytes
4 bytes


RTYPE Job ID
00 request status Unique identifier of request
01 configure FPGA Included with response
02-FE user-definable functions
FF reboot board
Figure 3 Request header field definitions


UNIVERSITY OF

FLORIDA


Figure 2 Software development process


Conger


#233 MAPLD 2005








NARC Board Architecture: FPGA Interface

PROG, INIT, CS,
WRITE, DONE
SD[0:7]
* Data communicated tolfrom FPGA by means of SelectMAP
unidirectional data paths Por
a 8-bit input port, 8-bit output port, 8 control lines (Figure 4) PROG, INIT, CS,
WRITE, DONE
Control lines manage data transfer, also drive configuration signals
S- ,trnO 71 i nrn:7


a uata transferred one byte at a time, full duplex communication possible
a Control lines include following signals:
Clock software-generated signal to clock data on data ports
Reset reset signal for interface logic in FPGA
Ready signal indicating device is ready to accept another byte of data
Valid signal indicating device has placed valid data on port
SelectMAP all signals necessary to drive SelectMAP configuration


ARM


f_ready
a_valid

In[0:7]
f_valid
a rpadv


4-



-------


f_ready
a_valid

Out[0:7]
f_valid
a rpadv


FPGA


clock I---
reset ,

Figure 4 FPGA interface signal diagram


* FPGA configuration through SelectMAP protocol

a Fastest configuration option for Xilinx FPGAs, protocol emulated using GPIO pins of ARM
o NARC board enables remote configuration and management of FPGA
User submits configuration request (RTYPE = 01), along with bitfile and function descriptor
Function descriptor is ASCII string, formatted list of functions with associated RTYPE definition
ARM halts and configures FPGA, stores descriptor in dedicated RAM buffer for user queries
] All FPGA designs must restrict use of all SelectMAP pins after configuration
Some signals are shared between SelectMAP port and FPGA-ARM link
Once configured, SelectMAP pins must remain tri-stated and unused


UNIVERSITY OF

FLORIDA


Conger


#233 MAPLD 2005







Results and Analysis: Raw Performance


* FPGA interface I/O throughput (Table 1)
a 1 KB data transferred over link, timed
a Measured using hardware methods
Logic analyzer to capture raw link data rate, divide data sent by
time from first clock to last clock (see Figure 9)
a Performance lower than desired for prototype
Handshake protocol may add unnecessary overhead
a Widening data paths, optimizing software routine will
significantly improve FPGA I/O performance
* Network throughput (Table 2)
a Measured using Linux network benchmark IPerf
NARC board located on arbitrary switch within network, application
partner is user workstation
Transfers as much data as possible in 10 seconds, calculates
throughput based on data sent divided by 10 seconds
a Performed two experiments with NARC board serving as client
in one run, server in other
a Both local and remote (remote location ~400 miles away, at
Florida State University) IPerf partner
a Network interface achieves reasonably good bandwidth
efficiency
* External memory throughput (Table 3)
a 4KB transferred to external SDRAM, both read and write
a Measurements again taken using logic analyzer
a Memory throughput sufficient to provide wire-speed buffering
of network traffic
On-chip Ethernet MAC has DMA to this SDRAM
Should help alleviate I/O bottleneck between ARM and FPGA


time
Figure 9 Logic analyzer timing


Input Output


Logic
Analyzer


6.08


6.12


Table 1 FPGA interface I/O performance


Local Remote
Network Network
(WAN)


NARC-
Server


75.4


4.9


Server- 78.9 5.3
Server
Table 2 Network throughput


Table 3 External SDRAM throughput


UNIVERSITY OF
FLORIDA


Conger


#233 MAPLD 2005






Results and Analysis: Raw Performance



* Reconfiguration speed
L Includes time to transfer bitfile over network, plus time to configure device (transfer
bitfile from ARM to FPGA), plus time to receive acknowledgement
L Our design currently completes a user-initiated reconfiguration request with a
1.2MB bitfile in 2.35 sec

* Area/resource usage of minimal wrapper for Virtex-Il Pro FPGA
u Stats on resource requirements for a minimal design to provide required link
control and data transfer in an application wrapper are presented below:
Design implemented on older
Virtex-ll Pro FPGA Device utilization summary:
Vitx-lPo PA------------------------------------------------
Numbers to right indicate
requirements for wrapper only, Selected Device : 2vp20ff1152-5
un-used resources available
Number of Slices: 143 out of 9280 1%
for use in user applications Number of Slice Flip Flops: 120 out of 18560 0%
Extremely small footprint! Number of 4 input LUTs: 238 out of 18560 1%
Footprint will be even smaller Number of bonded IOBs: 24 out of 564 4%
on larger FPGA Number of BRAMs: 8 out of 88 9%
SNumber of GCLKs: 1 out of 16 6%


UNIVERSITYY OF
longer FLORDA 9 #233 MAPLD 200


C


5






Case Study Applications


* Clustered RC Devices: N-Queens
a HPC application demonstrating NARC board's role as generic compute resource
Application characterized by minimal communication, heavy computation within FPGA
NARC version of N-Queens adapted from previously implemented application for PCI-
based Celoxica RC1000 board housed in a conventional server
N-Queens algorithm is a part of the DoD high-performance computing benchmark suite and
representative of select military and intelligence processing algorithms
a Exercises functionality of various developed mechanisms and protocols for job
submission, data transfer, etc. on NARC Fiure c/o Jeff Somers
User specifies a single parameter N, upon MEMEME |
completion the algorithm returns total number
of possible solutions EH ---EHE
i Purpose of algorithm is to determine how many
possible arrangements of N queens there are on
an N x N chess board, such that no queen may
Figure 5 Possible 8x8 solution
attack another (see Figure 5)
a Results are presented from both NARC-based execution and RC1000-based
execution for comparison


UNIVERSITY OF
longer FLORIDA 10 #233 MAPLD 20(


(


)5








I Case Study Applications

* Network processing: Bloom Filter
a This application performs passive packet analysis through use
of a classification algorithm known as a Bloom Filter x
Application characterized by constant, bursty communication patterns
Most communication is Rx over network, transmission to FPGA
Filter may be programmed or queried
a NARC device copies all received network frames to memory,
ARM parses TCP/IP header and sends it to Bloom Filter for
classification
User can send programming requests, which include a header and
string to be programmed into Filter
User can also send result collection requests, which causes a
formatted results packet to be sent back to the user
Otherwise, application constantly runs, querying each header against
the current Bloom Filter and recording match/header pair information
a Bloom Filter works by using multiple hash functions on a given
bit string, each hash function rendering indices of a separate
bit vector (see Figure 6)
To program, hash inputted string and set resulting bit positions as 1
To query, hash inputted string, if all resulting bit positions are 1 the
string matches
a Implemented on Virtex-ll Pro FPGA
Uses slightly larger, but ultimately more effective application wrapper
(see Figure 7)
Larger FPGA selected to demonstrate interoperability with any FPGA


U UNIVERSITY OF
FLORIDA


NARC Board to network


ARM


------------------------'


Figure 6 Bloom Filter algorithmic architecture


FPGA


Figure 7 Bloom Filter implementation architecture


Conger


#233 MAPLD 2005






Experimental Setup


* N-Queens: Clustered RC devices
a NARC device located on arbitrary switch in network
a User interfaces through client application on RC-enabled
workstation, requests N-Queens procedure servers
Figure 8 illustrates experimental environment NARC
Client application records time required to satisfy request
Power supply measures current draw of active NARC device Ethernet
L N-Queens also implemented on RC-enabled server Network
equipped with Celoxica RC1000 board
Client-side function call to NARC board replaced with function NARC
call to RC1000 board in local workstation, same timing
measurement
Comparison offered in terms of performance, power, cost Workstatio
User
* Bloom Filter: Network processing Figure 8 Experimental environment
L Same experimental setup as N-Queens case study
L Software on ARM co-processor captures all Ethernet frames
Only packet headers (TCP/IP) are passed to FPGA
Data continuously sent to FPGA as packets arrive over network
L By attaching NARC device to switch, limited packets can be captured
Only broadcast packets and packets destined for the NARC device can be seen
Dual-port device could be inserted in-line with network link, monitor all flow-through traffic

UNIVERSITY OF
Conger FLORIDA 12 #233 MAPLD 2005







Results and Analysis: N-Queens Case Study


* First, consider an execution time comparison
between our NARC board and a PCI-based
RC card (see Figure 10a and 10b)
a Both FPGA designs clocked at 50MHz
a Performance difference is minimal between devices
* Being able to match performance of PCI-based card
is a resounding success!
a Power consumption and cost of NARC devices
drastically lower than that of server with RC card
combos
a Multiple users may share NARC device, PCI-based
cards somewhat fixed in an individual server
* Power consumption calculated using following
method
a Three regulated power supplies exist in complete
NARC device (network interface + FPGA board): 5V,
3.3V, 2.5V
a Current draw from each supply was measured
a Power consumption is calculated as sum of Vxl
products of all three supplies


N-Queens Execution Time Comparison
(small board size)


0.05- -- NARC-
S0.04 -A--RC-1000-
0 -


0.0;
0.01


5 6 7 8 9 10
Algorithm Parameter (N)

N-Queens Execution Time Comparison
(large board size)
--n- NARC
--RC-1000




11 12 13 14
Algorithm Parameter (N)


Figure 10 Performance comparison between NARC
board and PCI-based RC card on server


UNIVERSITY OF
FLORIDA


Conger


#233 MAPLD 2005







Results and Analysis: N-Queens Case Study

NARC / RC-1000 Performance Ratio


* Figure 11 summarizes the performance
ratio of N-Queens between both NARC
and RC-1000 platforms
* Consider Table 4 for a summary of cost
and power statistics
L Unit price shown excluding cost of FPGA
FPGA costs offset when compared to
another device
Price shown includes PCB fabrication,
component costs
L Approximate power consumption
drastically less than server + RC-card
combo
Power consumption of server varies
depending on particular hardware
Typical servers operate off of 200-
400W power supplies
* See Figure 12 for example of approximate
power consumption calculation


-*- RATIO
20 ....... Equivalency
15 -
10
5
0 -- ----------- *
5 6 7 8 9 10 11 12 13 14
Algorithm Parameter (N)
Figure 11 Power consumption calculation


Cost per unit
(prototype)


NARC Board

$175.00


Approx. Power 3.28 W
Consumption
Table 4 Price and power figures
for NARC device

P = (5V)(Is) + (3.3V)(133) + (2.5V)(12s)
I = 0.2A; 133 = 0.49A; 12 = 0.27A
P = (5)(.2) + (3.3)(.49) + (2.5)(.27) = 3.28W


Figure 12 Power consumption calculation


UNIVERSITY OF
FLORIDA


Conger


#233 MAPLD 2005







Results and Analysis: Bloom Filter


* Passive, continuous network traffic analysis
a Wrapper design was slightly larger than previous minimal wrapper used with N-Queens
Still small footprint on chip, majority of FPGA remains for application
Maximum wrapper clock frequency 183 MHz, should not limit application clock if in same clock domain
a Packets received over network link are parsed by ARM, with TCP/IP header saved in buffer
a Headers sent one-at-a-time as query requests to Bloom Filter (FPGA), when query finishes
another header will be de-queued if available
User may query NARC device at any time for results update, program new pattern


a Figure 13 shows resource usage for
Virtex-ll Pro FPGA
a Maximum clock frequency of 113MHz
Not affected by wrapper constraint
Significantly faster computation speed
than FPGA-ARM link communication
speed
a FPGA-side buffer will not fill up,
headers are processed before next
header transmitted to FPGA
a ARM-side buffer may fill up under
heavy traffic loads
32MB ARM-side RAM gives large buffer


Device utilization summary:
-------------------------------------------------------
Selected Device : 2vp20ff1152-5

Number of Slices: 1174 out of 9280 13%
Number of Slice Flip Flops: 1706 out of 18560 9%
Number of 4 input LUTs: 2032 out of 18560 11%
Number of bonded IOBs: 24 out of 564 4%
Number of BRAMs: 9 out of 88 10%
Number of GCLKs: 1 out of 16 6%


Figure 13 Device utilization statistics for Bloom Filter design


UNIVERSITY OF
FLORIDA


Conger


#233 MAPLD 2005






Pitfalls and Lessons Learned


* FPGA I/O throughput capacity remains persistent problem
a One motivation for designing custom hardware is to remove typical PCI
bottleneck and provide wire-speed network connectivity for FPGA
a Under-provisioned data path between FPGA and network interface restricts
performance benefits for our prototype design
a Luckily, this problem may be solved through a variety of approaches
Wider data paths (16-bit, 32-bit) double or quadruple throughput, at expense of
higher pin count
Use of higher-performance co-processor capable of faster I/O switching frequencies
Optimized data transfer protocol

* Having co-processor in addition to FPGA to handle network
interface is vital to success of our approach
a Required in order to permit initial remote configuration of FPGA, as well as
additional reconfigurations upon user request
a Offloading network stack, basic request handling, and other maintenance-type
tasks from FPGA saves significant amount of valuable slices for user designs
a Drastically eases interfacing with user application on networked workstation
a Active co-processor for FPGA applications, e.g. parsing network packets as in
Bloom Filter application

UNIVERSITY OF
longer FLORIDA 16 #233 MAPLD 20(


C


)5






Conclusions

* A novel approach to providing FPGAs with standalone network connectivity has
been prototyped and successfully demonstrated
a Investigated issues critical to providing remote management of standalone NARC resources
a Proposed and demonstrated solutions to discovered challenges
a Performed pair of case studies with two distinct, representative applications for a NARC device
* Network-attached RC devices offer potential benefits for a variety of applications
a Impressive cost and power savings over server-based RC processing
a Independent NARC devices may be shared by multiple users without moving
a Tightly coupled network interface enables FPGA to be used directly in path of network traffic for
real-time analysis and monitoring
* Two issues that are proving to be a challenge to our approach include:
a Data latency in FPGA communication
a Software infrastructure required to achieve a robust standalone RC unit
* While prototype design achieves relatively good performance in some areas, and
limited performance in others, this is acceptable for concept demonstration
a Fairly complex board design; architecture and software enhancements in development
a As proof of "NARC" concept, important goal of project was achieved in demonstration of an
effective and efficient infrastructure for managing NARC devices

UNIVERSITY OF
longerr FLORIDA 17 #233 MAPLD 20(


(


)5






Future Work

* Expansion of network processing capabilities
L Further development of packet filtering application
More specific and practical activity or behavior sought from network traffic
Analyze streaming packets at or near wire-speed rates
L Expansion of Ethernet link to 2-port hub
Permit transparent insertion of device into network path
Provide easier access to all packets in switched IP network
Merging FPGA with ARM co-processor and network interface
into one device
a Ultimate vision for NARC device
L Will restrict number of different FPGAs which may be supported, according to
chosen FPGA socket/footprint for board
L Increased difficulty in PCB design
Expansion to Gig-E, other network technologies
L Fast Ethernet targeted for prototyping effort, concept demonstration
L True high-performance device should support Gigabit Ethernet
L Other potential technologies include (but not limited to) InfiniBand, RapidlO
Further development of management infrastructure
L Need for more robust control/decision-making middleware
L Automatic device discovery, concurrent job execution, fault-tolerant operation
UNIVERSITY OF
longer FLORIDA 18 #233 MAPLD 20(


C


)5




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs