Simulation Framework for Performance Prediction in the
Engineering of RC Systems and Applications
Eric Grobelny, Casey Reardon, Adam Jacobs, Alan D. George
NSF Center for High-Performance Reconfigurable Computing (CHREC)
HCS Research Lab, ECE Department, University of Florida
Telephone: (352) 392-5225 Fax: (352) 392-8671
Abstract-Reconfigurable computing (RC) is rapidly emerging
as a promising technology for the future of high-performance
computing, enabling systems with the computational density and
power of custom-logic hardware and the versatility of software-
driven hardware in an optimal mix. Novel methods for rapid virtual
prototyping, performance prediction, and evaluation are of critical
importance in the engineering of complex reconfigurable systems
and applications. This approach can yield insightful tradeoff
analyses while saving valuable time and resources for researchers
and engineers alike. The research described herein provides a
methodology for mapping arbitrary applications to targeted re-
configurable systems in a simulation environment. By splitting the
process into two domains, the application and simulation domains,
characterization of each element can occur idl, p, ndl, nlr and
in parallel, leading to fast and accurate performance prediction
results. This paper presents the design of a novel new framework
for simulative performance prediction, along with a single-node
case study performed with a 'rnth, ti, -aperture radar application
to provide validation results and a component tradeoff analysis,
illustrating the effectiveness of our approach at the node level.
I. INTRODUCTION
Reconfigurable computing (RC) is becoming recognized
as an increasingly important and viable paradigm for high-
performance computing in times where the size and power
consumption of clusters and traditional supercomputers have
grown to alarming levels. With RC, the performance potential
of underlying hardware resources in a system can be fully
realized in a highly adaptive manner. The underlying device
technology enabling this new paradigm of computing is field-
programmable hardware, such as the field-programmable gate
array or FPGA. These programmable logic devices feature
many thousands of logic cells as building blocks that can be
quickly configured and interconnected to form application-
specific custom logic. RC extends the fields of large-scale and
embedded high-performance computing by incorporating and
dynamically reconfiguring these devices at run-time to acceler-
ate operations that would otherwise be performed in software.
Hybrid systems of microprocessors and FPGAs can leverage
system-level concepts from conventional high-performance
computing while accommodating hardware reconfigurability.
While these hybrid systems offer the potential for large
performance improvements over traditional systems, the in-
troduction of reconfigurable devices can dramatically increase
the design complexity of such systems. In addition to tra-
ditional design space parameters such as processor speed,
memory subsystem performance, and network interconnect,
RC systems must also consider FPGA resources, IO subsys-
tem performance, and reconfiguration capabilities. The large
design space can make targeting applications to a specific RC
system difficult and daunting. Simulation provides a means
of predicting the performance and bottlenecks of applications
running on numerous system configurations, for the purpose of
studying design tradeoffs. The resulting analyses can provide
useful data before investing ,is.iiiIk.iiI amounts of time and
resources on the development of a particular solution.
In this paper, we present a framework for simulating ap-
plications on reconfigurable computing systems that balances
both speed and fidelity. By providing this balance, our frame-
work can provide a broad range of timely and meaningful
prediction analyses for a given reconfigurable application or
platform. With the appropriate models and calibration data,
numerous existing and future systems and applications can
be efficiently simulated and analyzed. The remainder of this
paper is organized as follows. Section II presents related
modeling and simulation research, for both reconfigurable
and high-performance computing. Section III provides an
overview of our simulation approach and methodology for
performance prediction of reconfigurable computing systems.
In Section IV, the results of our node-level architecture case
study are presented to validate and help illustrate the capabil-
ities of our simulation framework. Finally, the conclusions of
this paper are summarized in Section V.
II. RELATED RESEARCH
The modeling and performance prediction of RC devices
can be broken into two primary categories: device-level and
system-level. Device-level modeling is normally handled us-
ing electronic design toolkits such as ActiveHDL and Model-
Sim provided by vendors such as Aldec and Mentor Graphics,
respectively. These tools use the HDL languages employed to
design the corresponding functional cores and only target the
performance of the specific configurable device or family of
devices rather than the entire computational subsystem exer-
cised when using the core with a real application. Although
accurate, the tools are device-specific and have the potential to
take hours or even days to execute and do not address critical
performance issues in other components of a subsystem or
system, such as the I/O bus to which the device attaches.
A number of research projects attempt to predict the
performance of RC devices through the use of analytical
models. In [1], analytical models were developed to analyze
the performance of heterogeneous workstations outfitted with
RC devices. The research specifically addresses performance
issues dealing with load balancing using synchronous iterative
applications at both the node and device level. The models
showed reasonable accuracy, however igniiik.iii effort is
required for each application under study. In [2], RC models
are developed to predict the performance of vision algo-
rithms. The models incorporate the traditional configurable
computing system with a configurable device being used as
an offload engine by a host processor with implementation
details abstracted away. The models in their project address
performance prediction in a general sense with numerous
architectural and core variations possible. Although a compre-
hensive model is presented, model validation is not provided
and, again, the models are fairly complex and difficult to create
for mapping applications to specific architectures.
III. SIMULATION FRAMEWORK
This paper presents a novel new simulation framework for
performance prediction of reconfigurable computing systems
that balances speed and accuracy. The goal of the project is to
develop a framework that supports the analyses of numerous
variations of RC architectures at both the node and system
levels and application mappings to them. These architectures
include clusters of RC nodes, supercomputers with RC de-
vices, embedded systems, and other current and emerging
configurations. As a result of the wide design space possible
for current and future RC systems, we assume a generic
system model as shown in Fig. 1. The generic system consists
of one or more nodes interconnected by some high-speed
interconnect. Each RC node includes a host processor that
typically handles general computing tasks while offloading
specialized tasks to the corresponding RC device. The archi-
tecture is general enough, however, to support future systems
that incorporate RC devices that act completely independent of
the host processor. The RC devices can be attached to various
local interconnect technologies within the node, including a
peripheral bus or system bus or switching fabric. By support-
ing the generic architecture described in Fig. 1, this simulation
framework allows for comprehensive analyses of arbitrary
serial and parallel RC applications targeted for the wide
range of reconfigurable systems. This node architecture can
be tailored for prototyping of a variety of system platforms.
For example, the Local Interconnect(s) cloud shown in Fig. 1
in some cases may represent one level of interconnect (e.g. a
direct HyperTransport connection between CPUs, FPGAs, and
main memory) and in other cases a hierarchy of interconnects
(e.g. CPUs and main memory residing on the front-side
interconnect such as HyperTransport, with FPGAs attached
via a bridge to an I/O interconnect such as PCI-Express).
In order to tackle the problem of optimally mapping arbi-
trary applications to specific target architectures, the modeling
framework is split into two separate domains the application
CPU CI
U Memory (D
Local Network C)
Int s Interface
nter - -----
RC Device
Node Architecture
Fig. 1. Generic RC System Architecture
domain and the simulation domain. This split allows users
to characterize applications independently of the candidate
system architectures while supporting concurrent model devel-
opment that is independent of the potential applications. This
independence offers a high level of data and model reusability
and modularity which in turn facilitates rapid analyses of
numerous virtually prototyped systems and applications. The
overall structure of our simulation framework, and the key
steps within each domain, are illustrated in Fig. 2. The
remainder of this section provides details and examples on the
procedures used in the application and simulation domains.
Fig. 2. RC framework diagram
A. Application domain
The purpose of the application domain is to collect char-
acterization data on a selected application that captures its
inherent behavior, with the intent of creating some form of
stimulus data for the simulation models. The steps that make
up the application domain are shown by the ovals in Fig. 2.
Hardware core characterization defines the behavior of the
kernel or function to be performed within the RC device.
The computation time for an RC core can be obtained from
two sources. The first source uses experimentally measured
delays from a hardware core implementation, while the second
uses delays supplied from simulations of the hardware design
from vendor-supplied tools. Both methods provide reasonably
accurate results assuming deterministic behavior of the RC
devices. Computation time is not the only parameter required
to characterize a hardware core. Other key parameters include
core size, input data size, and output data size. The core size
parameter is important in order to manage how many cores
can fit onto a single RC device. This management capability
enables us to consider performance gains when scaling up
device size by squeezing more cores onto the fabric at once,
thus executing on more data in parallel. It also supports
the modeling of partial reconfiguration by allowing cores to
be exchanged within the RC device during the simulation.
However, an in-depth study of this technique was not included
in this paper. The final parameters, the input and output data
sizes, allow accurate modeling of transactions with the RC
device. These parameters are critical in order to accurately
capture the communication delays incurred when passing data
between various node components and the RC device.
Another vital step in the application domain that can be
conducted in parallel with characterizing the hardware core
is the application characterization stage. This step involves
identifying and gathering a sequence of key events that defines
the performance of the target application. The events currently
supported within this framework include the computation
conducted by the host processor, the computation executed in
the RC device, and the communication between RC nodes. A
specific sequence of these three events can be used to capture
the behavior of any parallel or serial RC application. However,
gathering these sequences of events is a key challenge when
dealing with RC applications due to the large number of
programming interfaces implemented to transfer data to and
from RC devices. The framework addresses this challenge by
using a scripting language that allows the user to manually
represent the behavior of an RC application. When using
this approach, classic instrumentation tools can be employed
to gather the data relevant to defining host computation and
inter-node communication. Presently, the incorporation of RC
events into scripts must be performed manually, since no
common standard exists that defines interactions with RC
devices. In the future, the adoption of a standard RC program-
ming interface, such as that being explored by the USURP
project [3] or the U-APPREQ work group in the OpenFPGA
consortium [4] would allow the automatic characterization of
these events. However, until a common standard is finalized,
manual script generation provides the most flexible alternative
to cover the wide range of RC application and RC device
combinations. When a common standard is adopted, automatic
generation of application scripts within our scripting syntax
can be implemented. The proceeding paragraph presents spe-
cific details on the scripting language developed for stimulat-
ing RC simulation models.
The final step in the application domain deals with gen-
erating scripts that represent the behaviors of the target
applications. The information collected during the application
and core characterization steps define both the structure and
the values needed to construct scripts that accurately exercise
the computational subsystems as if the actual application was