Title: Virtual prototyping and performance analysis of RapidIO-based system architectures for space-based radar
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094755/00001
 Material Information
Title: Virtual prototyping and performance analysis of RapidIO-based system architectures for space-based radar
Physical Description: Book
Language: English
Creator: Bueno, David
Leko, Adam
Conger, Chris
Troxel, Ian A.
George, Alan D.
Publisher: Bueno et al.
Place of Publication: Gainesville, Fla.
Publication Date: September 24, 2004
Copyright Date: 2004
 Record Information
Bibliographic ID: UF00094755
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

HPEC_RIO ( PDF )


Full Text

www.hcs.ufl.edu
- ,. I,. i. i . 11 .. (. 1 l. 0.i. 1. *it u .-I- . ..I L a ib


Virtual Prototyping and Performance
Analysis of RapidlO-based System
Architectures for Space-Based Radar


David Bueno, Adam Leko, Chris Conger,
lan Troxel, and Alan D. George

HCS Research Laboratory
College of Engineering
University of Florida


28 September 2004


.NMI;






Outline


'-c


I. Project Overview
II. Background
I. RapidlO (RIO)
II. Ground-Moving Target Indicator (GMTI)
III. Partitioning Methods
IV. Modeling Environment and Models
I. Compute node and RIO endpoint models
II. RapidlO switch model
III. GMTI models
IV. System and backplane model
V. Experiments and Results
I. Result latency
II. Switch memory utilization
III. Parallel efficiency
VI. Conclusions



o ..... www.hcs.ufl.edu


L2 8epteM[er 2.0-4






Project Overview


Simulative analysis of Space-Based Radar (SBR) systems using
RapidlO interconnection networks
a RapidlO (RIO) is a high-performance, switched interconnect for
embedded systems
Can scale to many nodes
Provides better bisection bandwidth than existing bus-based technologies

Study optimal method of constructing scalable RIO-based
systems for Ground Moving Target Indicator (GMTI)
u Identify system-level tradeoffs in system designs
i Discrete-event simulation of RapidlO network,
processing elements, and GMTI algorithm
i Identify limitations of RIO design for SBR
i Determine effectiveness of various GMTI algorithm
partitionings over RIO network


Image courtesy [1]

28 September 2004 www. hcs.ufl.edu 3
28 September 2004 .. .. .






Background- RapidlO


* Three-layered, embedded system interconnect architecture
a Logical memory mapped I/O, message passing, and globally shared memory
a Transport
u Physical serial and parallel
* Point-to-point, packet-switched interconnect
* Peak single-link throughput ranging from 2 to 64 Gb/s
* Focus on 16-bit parallel LVDS RIO implementation for satellite systems
Intra-System Interconnect Inter-System Interconnect












28 September 2004 w. h.s.ufledu 4






Background- GMTI


Receive
Cube


Send
Results


( Corner Turn Partitioned along Partitioned along
range dimension pulse dimension

* GMTI used to track moving targets on ground
a Estimated processing requirements range from
40 (aircraft) to 280 (satellite) GFLOPs
* GMTI broken into four stages:
a Pulse Compression (PC)
DATA CUBE
a Doppler Processing (DP)
a Space-Time Adaptive Processing (STAP) =3
a Constant False-Alarm Rate detection (CFAR) Ranges M
* Incoming data organized as 3-D matrix (data cube)
a Data reorganization ("corner turn") necessary between stages for processing efficiency
a Size of each cube dictated by Coherent Processing Interval (CPI)


28 September 2004 www.hcs.uf.edu 5






GMTI Partitioning Methods- Straightforward


STAP


time


1 CPI


* Data cubes divided among all Processing Elements (PEs)
* Partitioned along optimal dimension for any particular stage
* Data reorganization between stages implies personalized all-to-all
communication (corner turn) => stresses backplane links
* Minimal latency
a Entire cube must be processed within one CPI to receive next cube


www.hcs.ufl.edu


28 September 2004


PE

PE
#2
PE
-I #3

(PE
-J#4

CFAR







GMTI Partitioning Methods- Staggered


Data Cube
0


Data Cube
3


Data
Source



PG Processing Group


Data Cube
1


PG
2


Data Cube
4


Data Cube Data Cube
S2 5P3 PI
I CPIO I CPI 1 I cPI 2 1 CPI 3 1 CPI4 1


start


time


* Data cubes sent to groups of PEs in round-robin fashion
) Limiting each Processing Group (PG) to a single board significantly reduces
backplane bandwidth impact
* Time given to each PG to receive and process a data cube is N x CPI
a N = number of processing groups
a CPI = amount of time between generated data cubes
* Latency to produce result is higher than in straightforward partitioning


www.hcs.ufl.edu


28 September 2004







GMTI Partitioning Methods- Pipelined


Pulse
Compression


#1



PE

#3


Doppler
Processing





S PE
#6

PE ,
#7


STAP + CFAR


FE
#8

FE
#9


Data
PC
Cube 1


DP


Data
Cube 2

Data
Cube 1


STAP


CFAR


Data
Cube 3

Data
Cube 2

Data
Cube 1


Results of 1st
data cube
ready





Data
Cube 3

Data
Cube 2

Data
Cube 1


Results of 2nd
data cube
ready








Data
Cube 3

Data
Cube 2


I CP1 I CPI 2 | CPI3 CPI 4 CPI 5
start time


* Each PE group assigned to process a single stage of GMTI

a Groups may have varying numbers of PEs depending upon processing
requirements of each stage

* Potential for high cross-system bandwidth requirements

) Irregular and less predictable traffic distribution
a Frequent communication between different group sizes

* Latency to produce result is higher than straightforward method

a One result emerges each CPI, but the results are three CPIs old


www.hcs.ufl.edu


28 September 2004





Model Library Overview

MLDesig
Technologies ^a
i Modeling library created using Mission Level L!g
Designer (MLD), a commercial discrete-event simulation modeling tool
a C++-based, block-level, hierarchical modeling tool
I Algorithm modeling accomplished via script-based processing
a All processing nodes read from a global script file to determine when/where
to send data, and when/how long to compute
I Our model library includes:
Model of Compute Node
a RIO central-memory switch with RIO Endpoin
with RIO Endpoint
a Compute node with RIO endpoint
a GMTI traffic source/sink
L RIO logical message-passing layer PhysicalLayerStats -a
SProcessorScript Pointer II'
a Transport and parallel physical PrsrocessorIRIO Interface FabricF1
layers Processor RapidlO Endpoint ablri
layer V Fabrkic


www.hcs.ufl.edu


28 September 2004


I- EI~1 ~


In
Out






RapidlO Models


* Key features of Endpoint model
Message-passing logical layer
Transport layer Cen
a Parallel physical layer
Transmitter- and receiver-controlled flow control
Error detection and recovery -
Priority scheme for buffer management
Adjustable link speed and width -
Adjustable priority thresholds and queue lengths

* Key features of Central-memory switch model
a Selectable cut-through or store-and-forward routing
a High-fidelity TDM model for memory access
a Adjustable priority thresholds based on free switch memory
L Adjustable link rates, etc. similar to endpoint model


Model of RIO
tral-Memory Switch


www.hcs.ufl.edu


28 September 2004


- 0 0





GMTI Processor Board Models


* System contains many processor boards connected via backplane
* Each processor board contains one RIO switch and four
processors
* Processors modeled with three-stage Model of
finite state machine Four-Processor Board
a Send data n-u
Compute Node ASIC 2 M Compute
a Receive data .8 Port entraMemory Switch
Compute Node ASIC 1 .. Compute
a Compute
* Behavior of processors controlled
with script files
a Script generator converts high-level inputo Outputo Inputl utputl Input2 Output2 Input3 ou
GMTI parameters to script
a Script is fed into simulations


GMTI & system r-I
parameters


www.hcs.ufl.edu


28 September 2004


]


Smulto


Scrip


Processor script
send... =>
receive...






System Design Constraints


* 16-bit parallel 250MHz DDR RapidlO links (1 GB/s)
a Expected radiation-hardened component performance by time RIO and
SBR ready to fly in -2008 to 2010
* Systems composed of processor boards interconnected by RIO
backplane
a 4 processors per board
a 8 Floating-Point Units (FPUs) per processor
a One 8-port central-memory switch per board; implies 4 connections to
backplane per board
* Baseline GMTI algorithm parameters:
j Data cube: 64k ranges, 256 pulses, 6 beams
a CPI = 256ms
a Requires -3 GB/s of aggregate throughput from source to sink to meet
real-time constraints



28 September www. hcs.ufl.edu 12
28 September 2004 ..... ....12







Backplane and System Models


* High throughput requirements for data source and corner turns require
non-blocking connectivity between all nodes and data sources


7-Board System


.


Backplane-to-Board 4, 5, 6 -- "- .
and Data Sourcon ons| 4-Switch Non-blocking Backplane


www.hcs.ufl.edu


28 September 2004


Board 0 Boar 3

Board 1 Board 4
4-Switch
Non-Blocking
SBackplane
Board 2 Board 5

Data 3
SData Board 6
Source
--
^^*


" .






Overview of Experiments


* Experiments conducted to evaluate strengths and weaknesses of
each partitioning method
* Same switch backplane used for each experiment
* Varied data cube size
256 pulses, 6 beams for all tests
a Varied number of ranges from 32k to 64k
* Several system sizes used
a Analysis determined that 7-board configuration necessary for
straightforward method to meet deadline
a Both 6- and 7-board configurations used for pipelined method
a Staggered method does not benefit from a system larger than 5 boards
with configuration used
Staggering performed with one processor board per group
Larger system-configurations leave processors idle


www.hcs.ufl.edu


28 September 2004






Result Latency Comparison


40000 48000 56000
Number of ranges


64000


1536


1280


1024


www.hcs.ufl.edu


28 September 2004


---Straightforward, 7 boards
-- Staggered, 5 boards
Pipelined, 6 boards
Pipelined, 7 boards







i ^ i i


* Result latency is interval from
data arrival until results reported
* Straightforward achieved lowest
latency, required most
processor boards
j No result for 64k ranges because
system could not meet real-time
deadline
* Staggered requires least number
of processor boards to meet
deadline
a Efficient system configuration,
small communication groups
Tradeoff is result latency
* Pipelined method a compromise


32000






Switch Memory Histogram with Straightforward Method


7-board, straightforward,
48k ranges


I-
0 3276.8 6553.6 9830.4 13107.2
Free memory (bytes)


0.5
0.45
0.4
0.35
0.3
0.25
0.2


www.hcs.ufl.edu


28 September 2004


* Chart shows frequency of
time free switch memory
lies in each bracket
* Max switch memory is
16384 bytes
* Results taken from switch
on processor board 1
a All processor board
switches see essentially
identical memory usage
* ~90% of time is spent with
switch -80% free
j Most predictable
communication patterns,
enabling effective static
planning of comm. paths


0.15
0.1
0.05
0


M






Switch Memory Histogram with Staggered Method


0.5
m Staggered method uses
0.45 slightly more memory over
0.4 course of simulation
0.35 5-board, staggered, a More data flows through
S0.3 48k ranges single switch during corner
turn
0.25
0-* a Less spread in
2 0.2 communication patterns
L.
0.15 than straightforward method
0.1 More switch memory usage
indicates more contention
0.05 for a particular port, not
0 necessarily more utilization
0 3276.8 6553.6 9830.4 13107.2
or communication
Free memory (bytes)



28 September 2004 www.hcs.ufedu 17






Switch Memory Histogram with Pipelined Method


0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0


0 3276.8 6553.6 9830.4 13107.2
Free memory (bytes)


* Pipelined method stresses
network
a Irregular comm. patterns
a Greater possibility for
output port contention
a Non-blocking network not
helpful when multiple
senders vying for same
destination
* Difficult to plan out optimal
comm. paths beforehand
a Much synchronization
required to stagger many-
to-one communication, but
not extremely costly in
total execution time


www.hcs.ufl.edu


28 September 2004






Average Parallel Efficiency


0.8
0.7
0.6
o 0.5
0.4
L 0.3
0.2
0.1
0
Straightforward, Staggered, Pipelined, Pipelined,
7 boards 5 boards 6 boards 7 boards

* Parallel efficiency defined as sequential execution time (i.e. result latency) divided
by N times the parallel execution time
a N = number of processors that work on a single CPI
a Pipelined efficiency a special case, must use N/3 for fair comparison (shown) since all
processors do not work on a CPI at the same time
* Staggered method most efficient due to small communication groups and low
number of processors working on same CPI
a Straightforward method worst for opposite reason, pipelined method a compromise

www. hcs.ufl.edu 19
28 September 2004 w. .. .19





Conclusions


* Developed suite of simulation models and mechanisms for
evaluation of RapidlO designs for space-based radar
Evaluated three partitioning methods for GMTI over a fixed RapidlO
non-blocking network topology
Straightforward partitioning method produced lowest result
latencies, but least scalable
u Unable to meet real-time deadline with our maximum data cube size
Staggered partitioning method produced worst result latencies, but
highest parallel efficiency
a Also able to perform algorithm with least number of processing boards
a Important for systems where power consumption, weight are a concern
Pipelined partitioning method is a compromise in terms of latency,
efficiency, and scalability, but heavily taxes network
RapidlO provides feasible path to flight for space-based radar
a Future work to focus on additional SBR variants (e.g. Synthetic Aperture
Radar) and experimental RIO analysis



28 September 2004 ww. hcs.ufl.edu 20






Bibliography


[1] http://www.afa.ora/magazine/auq2002/0802radar.asp
[2] G. Shippen, "RapidlO Technical Deep Dive 1: Architecture & Protocol," Motorola Smart
Network Developers Forum, 2003.
[3] "RapidlO Interconnect Specification (Parts I-IV), RapidlO Trade Association, June
2002.
[4] "RapidlO Interconnect Specification, Part VI: Physical Layer 1x/4x LP-Serial
Specification," RapidlO Trade Association, June 2002.
[5] M. Linderman and R. Linderman, "Real-Time STAP Demonstration on an Embedded
High Performance Computer," Proc. of the IEEE National Radar Conference, Syracuse,
NY, May 13-15, 1997.
[6] "Space-Time Adaptive Processing for Airborne Radar," Tech. Rep. 1015, MIT Lincoln
Laboratory, 1994.
[7] G. Schorcht, I. Troxel, K. Farhangian, P. Unger, D. Zinn, C. Mick, A. George, and H.
Salzwedel, "System-Level Simulation Modeling with MLDesigner," Proc. of 11th
IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of
Computer and Telecommunications Systems (MASCOTS), Orlando, FL, October 12-15,
2003.
[7] R. Brown and R. Linderman, "Algorithm Development for an Airborne Real-Time STAP
Demonstrationn" Proc. of the IEEE National Radar Conference, Syracuse, NY, May 13-
15,1997.
[8] A. Choudhary, W. Liao, D. Weiner, P. Varshney, R. Linderman, M. Linderman, and R.
Brown, "Design, Implementation and Evaluation of Parallel Pipelined STAP on Parallel
Computers," IEEE Trans. on Aerospace and Electrical Systems, vol. 36, pp 528-548,
April 2000.


28 September 2004 ww. hcs.ufl.edu 21





Acknowledgements


* We wish to thank Honeywell Space Systems in Clearwater, FL for
their funding and technical guidance in support of this research.

* We wish to thank MLDesign Technologies in Palo Alto, CA for
providing us the MLD simulation tool that made this work possible.


www.hcs.ufl.edu


28 September 2004




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs