Title: Experimental analysis of multi-FPGA architectures over RapidIO for space-based radar processing
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094721/00001
 Material Information
Title: Experimental analysis of multi-FPGA architectures over RapidIO for space-based radar processing
Physical Description: Book
Language: English
Creator: Conger, Chris
Bueno, David
George, Alan D.
Publisher: Conger et al.
Place of Publication: Gainesville, Fla.
Publication Date: September 20, 2006
Copyright Date: 2006
 Notes
General Note: Presented at the 10th Annual Workshop of High Performance Embedded Computing, 19 - 21 September 2006
 Record Information
Bibliographic ID: UF00094721
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

HPEC06_ASC ( PDF )


Full Text
I UNIVERSITY of
UFIFLORIDA


www.hcs.ufl.edu
I- ;~ ,. I rr ; n r,"c;rr i u ~ '., ,-, -,1 iLab


Experimental Analysis of
Multi-FPGA Architectures


over RapidlO for


Space-Based Radar Processing


High Peifovrnance E .

/ IToiksI


19-


i
SIiiil~


Chris Conger, David Bueno,
and Alan D. George

HCS Research Laboratory
College of Engineering
University of Florida


0))


21 September 2006


20 September 2006






Project Overview UIFnIVERSITY

* Considering advanced architectures for on-board satellite
processing
a Reconfigurable components (e.g. FPGAs)
a High-performance, packet-switched interconnect
* Sponsored by Honeywell Electronic Systems Engineering &
Applications
a RapidlO as candidate interconnect technology
a Ground-Moving Target Indicator (GMTI) case study application
* Design working prototype system, on which to perform
performance and feasibility analyses
* Experimental research, with focus on node-level design and
memory-processor-interconnect interface architectures and issues
a FPGAs for main processing nodes, parallel processing of radar data
a Computation vs. communication: application requirements,
component capabilities Image courtesy [5]
a Hardware-software co-design
a Numerical format and precision considerations


RapidlO Honeywell 1 XILINX-


20 September 2006 wwwhcsufedu
^\j hc Le uI I\~j mh-et r iT tOB iin I yi isia~ ei~e u -'








Background Information


Intra-System Interconnect


* RapidlO
o Three-layered, embedded system interconnect
o Point-to-point, packet-switched connectivity
o Peak single-link throughput ranging from 2 to 64 Gbps
o Available in serial or parallel versions, in addition to
message-passing or shared-memory programming
models


.PE\
#3
1#4/
Pulse Compression
time


P 1

PE
#3
#4DopplerProcessing
Doppler Processing


E" DATA-PARALLEL

_PE PE

TAP I CFAR Dtctin
STAP CFAR Detection


1 CPI


Doppler
Processing STAP + CFAR


*PE PE
6 P#8
T*PE *PE'
* #'9)


Data Data Data
Cube Cube 2 Cube


Results of 1st Results of 2nd
data cube data cube
read fady

3Pc


Cube Cube 2 Cube 3 DP
119 9 STAP
Cubel Cube 2 Cube 3 STAP
Data Daata (CAD
Cube Cube 2 CFAR
CPI 1 CPI 2 1 CPI 3 1 CPI 4
start time PIPELINED


*a


Image cou y
Image courtesy [6]


Space-Based Radar (SBR)
o Space environment places tight constraints on system
Frequency-limited radiation-hardened devices
Power- and memory-limited
o Streaming data for continuous, real-time processing of
radar or other sensor data
Pipelined or data-parallel algorithm decomposition
Composed mainly of linear algebra and FFTs
o Transposes or distributed corner turns of entire data set
required, stresses memory hierarchy
o GMTI composed of several common kernels
Pulse compression, Doppler processing, CFAR detection
Space-Time Adaptive Processing and Beamforming


wwwhcs, ufl.edu
11ihIrfVefIr;IA -' inI I U insiitA n RI5I reh IL


20 September 2006


I UNIVERSITY of

UFIFLORIDA


Incoming
Data Cube




7,: P .


Pulse
Compression
PE1






Testbed Hardware


* Custom-built hardware testbed, composed of:
a Xilinx Virtex-II Pro FPGAs (XC2VP20-FF1152-6), RapidlO IP cores
a 128 MB SDRAM (8 Gbps peak memory bandwidth per-node)
a Custom-designed PCBs for enhanced node capabilities
a Novel processing node architecture (HDL)


WT I UNIVERSITY of
UF FLORIDA


* Performance measurement and debugging with:
a 500 MHz, 80-channel logic analyzer
a UART connection for file transfer


I
While we prefer to
work with existing
hardware, if the
need arises we
have the ability to
design custom
hardware


RapidlO switch PCB


RapidlO tested


RapidlO switch PCB layout


www,hcs,ufl.edu
Hiris'hrfefM~m r n; i'Wr iins b yinsiiton RI5.reh I Ii


20 September 2006


RapidlO testbed,
Showing two nodes
directly connected
via RapidlO, as well
as logic analyzer
connections






Node Architecture UFIIVERSY
UF FLORIDA

A All processing performed via hardware
engines, control performed with External Network
Memory Interface
embedded PowerPC Controller remote Controller
request I I ncom
a PowerPC interfaces with DMA engine to port incoming
control memory transfers 3r Party
a PowerPC interfaces with processing CNOLDER RapidlO
engines to control processing tasks Co
a Custom software API permits app 3r Party outgoing
development SDRAM local
dt Controller request
* Visualize node design as a triangle of ore port
communicating elements:
a External memory controller
a Processing engines) On-Chip
Memory
a Network controller s.I, tr Controller
oscflll tor RESET&CLOCK DMA
" Parallel data paths (FIFOs and control v st GENERATORC cntrollerr
logic) allow concurrent operations from
different sources M ,C 1/0
0 Locally-initiated transfers completely--- -...... ..- -. -
independent of incoming, remotely-, HW H
PowerPC module *** module
initiated transfers
* Internal memory used for processing --
buffers (no external SRAM)
Conceptual diagram of FPGA design (node architecture)



20 September 2006 wwwhcs ufledu 5
^\j Scptciilucj 20j0j Ms'ihBrpeMmrtr; iT 'OT-ina b ynnisi~te~in K'swwarh 'usi!-








Processing Engine Architectures


UNIVERSITY of

UFIFLORIDA


* All co-processor engines wrapped in standardized interface (single data port, single control port)

* Up to 32 KB dual-port SRAM internal to each engine
a Entire memory space addressable from external data port, with read and write capability
a Internally, SRAM divided into multiple, parallel, independent read-only or write-only ports
* Diagrams below show two example co-processor engine designs, illustrating similarities


TO MEMORY
CONTROLLER
Port B Port B Port B PortB Port B Port B Port B Port B
Input Input Input Input Output Output Output Output
Buffer Buffer Buffer Buffer Buffer Buffer Buffer Buffer
1 2 3 4 1 2 3 4
SRAM
Dual- Dual- Dual- Dual- Dual- Dual- Dual- Dual-
Port Port Port Port Port Port Port Port
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
Port A Port A PA Port A Port A Port A Port A Port A
Data Vector multiplier Processed Data
~ ~I~~ i'~~"1


STO MEMORY
CONTROLLER
PortB P ortB Port B PortB PortB P ort0B PortB Port B
Input Input Input Input Output Output Output Output
Buffer Buffer Buffer Buffer Buffer Buffer Buffer Buffer
1 2 3 4 1 2 3 4
SiM
Dual- Dual- Dual- Dual- Dual- Dual- Dual- Dual-
Port Port Port Port Port Port Port Port
SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM
PnrfA Port A Pr A Port A Pr A Port A Prt A Pr A


FFT I IFFT

Pulse Compression Co-processor Architecture


CFAR Co-processor Architecture


www hcsufl.edu
NWicriMmar Ai'~d n nifitM ~l~rh ii


20 September 2006







Experimental Environment


* System and algorithm Parameters

ram ee u D si


Ranges


1024


Range dimension of data cube


Pulses 128 Pulse dimension of data cube
Channels 16 Channel dimension of data cube

Proc. Frequency 100 MHz PowerPCICo-processor engine clock
frequency
Mem. Frequency 125 MHz Memory clock frequency
Net. Frequency 250 MHz RapidlO clock frequency

Max System Size 2 Max number of FPGAs used
experimentally
Proc. SRAM Size 32 KB Max SRAM internal to each proc.
FIFO Size (each) 8 KB Size of FIFOs tolfrom SDRAM

* Numerical Format
a Signed magnitude, fixed-point, 16-bit
a Complex elements for 32-bit/element


user


SExperimental steps
a No high-speed input to system, so data must
be pre-loaded
a XModem over UART provides file transfer
between tested and user workstation
a User prepares measurement equipment,
initiates processing after data is loaded
through UART interface
[ Processing completes relatively quickly,
output file is transferred back to user
a Post-analysis of output data and/or
performance measurements


+/- Integer bits (7) Fraction bits (8)


www.hcs,ufl.edu
HirihBerfefr; ? i'Wr iins b yinsiiton RI5.reh I Ii


20 September 2006


I UNIVERSITY of

UFIFLORIDA






Results: Baseline Performance

* Data path architecture results in independent clock
domains, as well as varied data path widths
u SDRAM: 64-bit, 125 MHz (8 Gbps max theoretical)
a Processors: 32-bit, 100 MHz (4 Gbps max theoretical)
a Network: 64-bit, 62.5 MHz (4 Gbps max theoretical)
* Generic data transfer tests to stress each communication
channel, measure actual throughputs achieved
* Notice transfers never achieve over 4 Gbps
a "A chain is only as strong as its weakest link"
a Simulations of custom SDRAM controller core alone suggest
maximum sustained* throughput of 6.67 Gbps


I UNIVERSITY of
UFIFLORIDA



Max. sustained* throughputs
* SDRAM: 6.67 Gbps
* Processor: 4 Gbps
* Network: 3.81 Gbps


www.hcs,ufl.edu
Hiris'hrfefM~m r n; i'Wr iins b yinsiiton RI5.reh I Ii


20 September 2006


* Assumes sequential addresses, datalspace
always available for writes/reads







Results: Kernel Execution Time



* Processing starts when all data is buffered
* No inter-processor communication during processing
* Double-buffering maximizes co-processor efficiency
* For each kernel, processing is done along one dimension
* Multiple "processing chunks" may be buffered at a time:
o CFAR co-processor has 8 KB buffers, all others have 4 KB
buffers
o CFAR works along range dimension (1024 elements or 4 KB)
o Implies 2 "processing chunks" processed per buffer by CFAR
engine
* Single co-processing engine kernel execution times for an
entire data cube
o CFAR only 15% faster than Doppler processing, despite 39%
faster buffer execution time
o Loss of performance for CFAR due to under-utilization
o Equation to lower right models execution time of an individual
kernel to process an entire cube (using double-buffering)
Kernel execution time can be capped by both processing time as
well as memory bandwidth
After certain point, higher co-processor frequencies or more
engines per node will become pointless


I UNIVERSITY of

UFIFLORIDA


Kernel Execution Times


Kernel Execution Times
(per cube)


PC DP CFAR


PC DP CFAR


PC = Pulse Compression
DP = Doppler Processing


Tkernel
where

Ttrans
Steady
N


= 2 Ttrans + N Tsteady,


= DMAblocking + MAX[DMAblocking, PROCbuffer]
= 2 -MAX[(2 DMAblocking), PROCbuffer]
= general term for number of iterations

DMAblocking = time to complete a blocking
DMA transfer of A elements
PROCbuffer = time to process M elements of
buffered data (depends on task)


www.hcs,ufl.edu
HirihBerfefr; ? i'Wr iins b yinsiiton RI5.reh I Ii


20 September 2006








Results: Data Verification



* Processed data inspected for correctness
o Compared to C version of equivalent algorithm from
Northwestern University & Syracuse University [7]
o MATLAB also used for verification of Doppler
processing and pulse compression engines
* Expect decrease in accuracy of results due to
decrease in precision
o Fixed-point vs. floating-point
o 16-bit elements vs. 32-bit elements
* CFAR and Doppler processing results shown to
right, along-side "golden" or reference data
o Pulse compression engine very similar to Doppler
processing, results omitted due to space limitations
* CFAR detections suffer significantly from loss of
precision
o 97 detected (some false), 118 targets present
o More false positives where values are very small
o More false negatives where values are very larger
* Slight algorithm differences prevent direct
comparison of Doppler processing results with [7]
o MATLAB implementation and testbed both fed square
wave as input
o Aside from expected scaling in testbed results, data
skewing can be seen from loss of precision


UNIVERSITY of

UFIFLORIDA


TESTBED DOPPLER PROCESSING / MATLABDOPPLERPROCESSING
0.18
0.16
0.14
S0.12




1 13 25 37 49 61 73 85 97 109121 I
Pulse pulse
CFAR INPUT DATA ,120130
*100-110
S90-100
S00. 0





* *: -:--., s ,-' -i s30 -40
0.02








1 q20-30
RangePulse

Targets

1-
SA .0-
10-20





Range

CFAR Detections from Teatbed
Z '' 128

E0









Range


www.hcs, ufl.edu
HirihBerfefr; ? i'Wr iins b yinsiiton RI5.reh I II


20 September 2006







Results: FPGA Resource Utilization


* FPGA resource usage table* (below)
o Virtex-II Pro (2VP40) FPGA is target device
o Baseline design includes:
PowerPC, buses and peripherals
RapidlO endpoint (PHY + LOG) and endpoint controller
SDRAM controller, FIFOs
DMA engine and control logic
Single CFAR co-processor engine
* Co-processor engine usage* (right)
o Only real variable aspect of design
o Resource requirements increase with greater data
precision


Occupied Slices


11,059


19,392


PowerPC's 1 2 50
BlockRAMs 111 192 58
Global Clock Buffers 14 16 88
Digital Clock Managers 5 8 63
HCS_CNF design (complete)
Equivalent Gate Count for design 7,743,720

Resource numbers taken from
mapper report (post-synthesis)


IW UNIVERSITY of
UF FLORIDA


Occupied Slices 1,094 19,392 5
BlockRAMs 16 192 8


Resourc UA i


Occupied Slices 2,349 19,392 12
BlockRAMs 14 192 7
MULT18X18s 16 192 8
Doppler Processing co-processor
Equivalent Gate Count for design 1,071,565


www.hcs,ufl.edu
KHi h .r 1 A--r ; -?I I'W in U T insiitA n -I.I Ireh IL


20 September 2006







Conclusions and Future Work FUNIVERSI



* Novel node architecture introduced and demonstrated
a All processing performed in hardware co-processor engines
a Apps developed in Xilinx's EDK environment using C, custom API enables control of hardware resources through software
* External memory (SDRAM) throughput at each node is critical for system performance in systems with hardware
processing engines and integrated high-performance network
* Pipelined decomposition may be better for this system, due to co-processor (under)utilization
a If co-processor engines sit idle most of the time, why have them all in each node?
a With sufficient memory bandwidth, multiple engines could be used concurrently
* Parallel data paths are nice feature, at cost of more complex control logic, higher potential development cost
a Multiple request ports to SDRAM controller improves concurrency, but does not remove bottleneck
Different modules within design can request and begin transfers concurrently through FIFOs
SDRAM controller can still only service one request at a time (assuming one external bank of SDRAM)
Benefit of parallel data paths decreases with larger transfer sizes or more frequent transfers
a Parallel state machineslcontrol logic take advantage of FPGA's affinity for parallelism
a Custom design, not standardized like buses (e.g. CoreConnect, AMBA, etc)
* Some co-processor engines could be run at slower clock rates to conserve power without loss of performance
* 32-bit fixed-point numbers (possibly larger) required if not using floating-point processors
a Notable error can be seen in processed data simply by visually comparing to reference outputs
a Error will compound as data propagates through each kernel in a full GMTI application
a Larger precision means more memory and logic resources required, not necessarily slower clock speeds

* Future Research
a Enhance testbed with more nodes, more stable boards, Serial RapidlO
a Complete Beamforming and STAP co-processor engines, demonstrate and analyze full GMTI application
a Enhance architecture with direct data path between processing SRAM and network interface
a More in-depth study of precision requirements and error, along with performancelresource implications


20 September 2006 hcs uft.edu 12
Sp I [u s R r ii yis i ?5 1i






Bibliography UIFFL RI of


[1] D. Bueno, C. Conger, A. Leko, I. Troxel, and A. George, "Virtual Prototyping and Performance Analysis of
RapidlO-based System Architectures for Space-Based Radar," Proc. High Performance Embedded
Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, Sep. 28-30, 2004.

[2] D. Bueno, A. Leko, C. Conger, I. Troxel, and A. George, "Simulative Analysis of the RapidlO Embedded
Interconnect Architecture for Real-Time, Network-Intensive Applications," Proc. 29th IEEE Conf. on Local
Computer Networks (LCN) via IEEE Workshop on High-Speed Local Networks (HSLN), Tampa, FL, Nov. 16-
18, 2004.

[3] D. Bueno, C. Conger, A. Leko, I. Troxel, and A. George, "RapidlO-based Space Systems Architectures for
Synthetic Aperture Radar and Ground Moving Target Indicator," Proc. Of High-Performance Embedded
Computing (HPEC) Workshop, MIT Lincoln Lab, Lexington, MA, Sep. 20-22, 2005.

[4] D. Bueno, C. Conger, and A. George, "RapidlO for Radar Processing in Advanced Space Systems," ACM
Transactions on Embedded Computing Systems, to appear.

[5] http://www.noaanews.noaa.gov/stories2005/s2432.htm

[6] G. Shippen, "RapidlO Technical Deep Dive 1: Architecture & Protocol," Motorola Smart Network
Developers Forum, 2003.

[7] A. Choudhary, W. Liao, D. Weiner, P. Varshney, R. Linderman, M. Linderman, and R. Brown, "Design,
Implementation and Evaluation of Parallel Pipelined STAP on Parallel Computers," IEEE Trans. on
Aerospace and Electrical Systems, vol. 36, pp 528-548, April 2000.


20 September 2006 wwwhcsufledu 13
^\j~~~~ icp Lem eI [\~j HiP-effrr;tO'i i Ia yiiie i nw ~-




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs