RAT: A Methodology for Predicting Performance
in Application Design Migration to FPGAs
Brian Holland, Karthik Nagarajan, Chris Conger, Adam Jacobs, Alan D. George
NSF Center for HighPerformance Reconfigurable Computing (CHREC)
ECE Department, University of Florida
{holland,nagarajan,conger,jacobs,george}@chrec.org
ABSTRACT
Before any application is migrated to a reconfigurable com
puter (RC), it is important to consider its amenability to
the hardware paradigm. In order to maximize the proba
bility of success for an application's migration to an FPGA,
one must quickly and with a reasonable degree of accuracy
analyze not only the performance of the system but also
the required precision and necessary resources to support a
particular design. This extra preparation is meant to reduce
the risk of failure to achieve the application's design require
ments (e.g. speed or area) by quantitatively predicting the
expected performance and system utilization. This paper
presents the RC Amenability Test (RAT), a methodology
for rapidly analyzing an application's design ... .1 .i'
to a specific FPGA platform.
1. INTRODUCTION
FPGAs continue to grow as a viable option for increas
ing the performance of many applications over traditional
CPUs without the need for ASICs. Because no standard
ized rules exist for FPGA amenability, it is important for a
designer to consider the likely performance of an application
in hardware before undergoing a lengthy migration process.
Ultimately, the designer must know what order of magnitude
speedup (or potentially slowdown) will be encountered.
Some researchers have suggested [4] that a 50x to 100x
speedup is required to gain the attention and approval of
"middle management." Other scenarios might place the
breakeven point (time of development versus time saved
at execution) at a more conservative factor of ten or less.
The highperformance embedded community might simply
want FPGA performance to parallel a traditional processor
since savings could come in the form of reduced power
usage. Ultimately, the success or failure of an application's
RC migration will be judged against some metric of
performance. It is critical to consider whether the chosen
application architecture and FPGA platform will meet the
speed, area, and power requirements of the project. The
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HPRCTA'07, November 11, 2007, Reno, Nevada, USA
Copyright 2007 ACM 9781595938947/07/0011 ...$5.00.
RC Amenability Test (RAT) is a combination of algorithm
and software legacy code analyses along with 'pencil and
paper' computations that seeks to determine the likelihood
of success for a specific algorithm's migration to a particular
RC platform before any (or at least significant) hardware
coding is begun.
The need for the RAT methodology stemmed from com
mon difficulties encountered during several FPGA applica
tion migration projects. Researchers would typically possess
a software application but would be unsure about poten
tial performance gains in hardware. The level of experience
with FPGAs would vary greatly among the researchers and
inexperienced designers were often unable to quantitatively
project and compare possible algorithmic design and FPGA
platforms choices for their application. Many initial pre
dictions were haphazardly formulated and performance es
timation methods varied greatly. Consequently, RAT was
created to consolidate and unify the performance prediction
strategies for faster, more simple, and more effective analy
ses.
Three factors are considered for the amenability of an
application to hardware: throughput, numerical precision,
and resource usage. The authors believe that these issues
dominate the overall effectiveness of an application's hard
ware migration. Consequently, analyses for these three fac
tors comprise the majority of the RAT methodology. The
throughput analysis uses a series of simple equations to pre
dict the performance of the application based upon known
parameters (e.g. interconnect speed) and values estimated
from the proposed design (e.g. volume of communicated
data). Numerical precision analysis is a subset of through
put encompassing the design tradeoffs in performance as
sociated with possible algorithm data formats and their as
sociated error. Resource analysis involves estimating the
application's hardware usage in order to detect designs that
consume more than the available resources.
Many research projects as discussed in [11] emphasize the
usage of FPGAs to achieve speedup over traditional CPUs.
Consequently, accurate throughput analysis is the primary
focus of the RAT methodology. While numerical precision,
resource utilization, and other issues such as development
time or power usage are not trivial, they are less likely to be
the sole contributor to the failure of an application migration
when speedup is the primary goal. Consequently, RAT's
throughput analysis is the most detailed performance test
and the focus of the application case studies in this paper.
The remainder of this paper is structured as follows. Sec
tion 2 discusses background related to FPGA performance
prediction and resource utilization. The fundamental analy
ses comprising the RAT methodology are detailed in Section
3. A detailed walkthrough illustrating the usage of RAT
with a real application is in Section 4. Section 5 presents
additional case studies using RAT to further explore the
accuracy of the performance prediction methodology. Con
clusions and future work are discussed in Section 6.
2. RELATED WORK
Researchers have studied the area of hardware amenabil
ity but their approaches vary. A Performance Prediction
Model (PPM) is suggested in [12] for determining the opti
mal mapping of an algorithm to an FPGA. Their methodol
ogy consists of four steps: choice and modification of the im
plementation, classification, feature extraction, and perfor
mance matrix computation (for frequency, latency, through
put, and area requirements). The concept of a detailed clas
sification of the internal operations of an application de
sign is very practical. However, the performance estimation
method incorporates quantitative area, IO pins, latency, and
throughput into a large system of platformdependent equa
tions which is impractical for RAT. The goal of the research
presented in [14] is to determine the optimal function design
with respect to area, latency and throughput. Ultimately,
the project seeks to create a tool or library for i.i1 i~lif.
ing the best version among many alternatives for a particu
lar scenario. The work provides valuable insight, especially
into the domain of application precision, but its focus on
automated design of single kernels is overly specific for a
highlevel RAT methodology for applications.
A performance prediction technique presented in [16] seeks
to parameterize not only the computational algorithm but
also the FPGA system. Applications are decomposed and
analyzed to determine their total size and computational
density. Computational platforms are characterized by their
memory size, bandwidth, and latency. By comparing the al
gorithm's computational requirement with the memory bot
tleneck of the FPGA platform, a worstcase computational
throughput (in operations per second) can be quantified.
However, the author asserts that "the performance analysis
in this paper is not real performance prediction; rather it
targets the general concern of whether or not an algorithm
will fit within the memory subsystem that is designed to feed
it." The RAT methodology differs because it seeks not only
to ..1i1,1 i, the number of operations but also the expected
number of operations executed per clock cycle yielding per
formance predictions strictly in units of time.
Another performance prediction technique [15] concerns
modeling of shared heterogeneous workstations containing
reconfigurable computing devices. This methodology chiefly
concerns the modeling of system level, multiFPGA architec
tures with variable computational loading due to the multi
user environment. The basic execution time model encom
passes five steps: master node setup, serial node setup, par
allel kernel computation (in hardware and/or software), se
rial node shutdown, and master node shutdown. Similar to
RAT, analytic models are used to estimate the performance
of the components of application execution time. However,
this heterogeneous system modeling assumes that hardware
tasks have deterministic runtime and performance estimates
can be based on clock frequencies or simulation. In contrast,
RAT is ]i..... focused on modeling FPGA execution
times before designs are coded for hardware and applica
tion analysis is not limited to deterministic algorithms. Ef
fectively, RAT and the heterogeneous modeling could work
collaboratively to provide higher fidelity systemlevel trade
off analyses before any application code is migrated to hard
ware.
A research project into molecular dynamics at the Uni
versity of Illinois [13] proposes a framework for application
design in the hardware paradigm. The project asserts that
the most frequently accessed block of code is not the only in
dicator for RC amenability. The types of computations and
the volume of communication will increase or decrease the
recommended quantity of FPGA functions. Although the
general framework stresses resource utilization tests for case
studies, the researchers "postpone finding out about space
requirements for the design until [they] actually map it to
the FPGA." A priori measurements of resource requirements
may be inexact, but they are still necessary to avoid creating
initial designs that are physically unrealizable.
Conceptually, the RAT methodology is meant to resemble
the approach behind the Parallel Random Access Machine
(PRAM) [8] model for traditional supercomputing. Both
RAT and PRAM attempt to model the critical (and hope
fully small) set of algorithm and platform attributes neces
sary to achieve a better understanding of the greater com
putational interaction and ultimately the application perfor
mance. PRAM focuses on modes of concurrent memory ac
cesses whereas RAT examines the communication and com
putation interaction on an FPGA. These critical attributes
of RAT also resemble the LogP model [7] (a successor to
PRAM) which seeks to abstract "the computing bandwidth,
the communication bandwidth, the communication delay,
and the efficiency of coupling communication and compu
tation." RAT does not claim to be the ultimate solution
to RC performance prediction but instead encourages the
FPGA community to consider more structured and stan
dardized ways for algorithm analysis using established con
cepts inherited from prior successes in traditional parallel
computing modeling.
3. RC AMENABILITY TEST
Figure 1 illustrates the basic methodology behind the RC
amenability test. This simple set of tests serves as a basis
for determining the viability of an algorithm design on the
FPGA platform prior to any FPGA programming. Again,
RAT is intended to address the performance of a specific
design, not a generic algorithm. The results of the RAT
tests must be compared against the designer's requirements
to evaluate the success of the application design. Though the
throughput analysis is considered the most important step,
the three tests are not i.  ...II. used as a single, sequential
procedure. Often, RAT is applied iteratively during the
design process until a suitable version of the algorithm is
formulated or all reasonable permutations are exhausted
without a satisfactory solution.
3.1 Throughput
For RAT, the predicted performance of an application
is defined by two terms: communication time between the
CPU and FPGA, and FPGA computation time. Reconfigu
ration and other setup times are ignored. These two terms
encompass the rate at which data flows through the FPGA
and rate at which operations occur on that data, respec
tively. Because RAT seeks to analyze applications at the
START
.. /" Insufficient *
Identify kernel, comm BE
create design on orcomp.
paper Throughput throughput
NEW Test
  Desirable Minimum
Perform RAT performance precision
Perform RAT Unrealizable
Numerical precision
Precision Test, requirement
Acceptable balance of
Build in HDL or HLL, performancesndprecision
simulate design pefor
Resource Test Insufficient
resources
Verify on
HW platform
PROCEED
Figure 1: Overview of RAT Methodology
earliest stage of hardware migration, these terms are reduced
to the most generalized parameters. The RAT throughput
test i I......1 I. models FPGAs as coprocessors to general
purpose processors but the framework can be adjusted for
streaming applications.
Calculating the communication time is a relatively sim
plistic process given by Equations (1), (2), and (3). The
overall communication time is defined as the summation of
the read and write components. For the individual reads
and writes, the problem size (i.e. number of data elements,
Nelements) and the numerical precision (i.e. number of bytes
per element, Nbytes/element) must be decided by the user
with respect to the algorithm. Note that problem size only
refers to a single block of data to be buffered by the FPGA
system. An application's data communication may be di
vided into multiple discrete transfers, which is accounted
for in a subsequent equation. The hypothetical bandwidth
of the FPGA/processor interconnect on the target platform
(e.g. 133MHz 64bit PCIX which has a documented maxi
mum throughput of 1GB/s) is also necessary but is generally
provided either with the FPGA system documentation or as
part of the interconnect standard. An additional parameter,
a, represents the fraction of ideal throughput performing
useful communication. The actual sustained performance of
the FPGA interconnect will only be a fraction of the docu
mented transfer rate. Microbenchmarks composed of simple
data transfers can be used to establish the true communica
tion bandwidth.
tcomm = tread + write (1)
t Nelements Nbytes/element (2)
C ,read throughputideal
t ie Nelements Nbytes/element (3)
write throughputideal
Before further equations are discussed, it is important to
i I ., the concept of an "element." Until now, the expres
sions "problem size," "volume of communicated data," and
"number of elements" have been used interchangeably. How
ever, strictly speaking, the first two terms refer to a quan
tity of bytes whereas the last term has the ubiquitous unit
"elements." RAT operates under the assumption that the
computational workload of an algorithm is directly related
to the size of the problem dataset. Because communication
times are concerned with bytes and (as will be subsequently
shown) computation times revolve around the number of
operations, a common term is necessary to express this re
lationship. The element is meant to be the basic building
block which governs both communication and computation.
For example, an element could be a value in an array to be
sorted, an atom in a molecular dynamics simulation, or a
single character in a stringmatching algorithm. In each of
these cases, some number of bytes will be required to rep
resent that element and some number of calculations will
be necessary to complete all computations involving that
element. The difficulty is establishing what subset of the
data should constitute an element for a particular algorithm.
Often an application must be analyzed in several separate
stages since each portion of the algorithm could interpret
the input data in a different scope.
Estimating the computational component, as given in
Equation (4), of the RC execution time is more compli
cated than communication due to the conversion factors.
Whereas the number of bytes per element is ultimately a
fixed, userdefined value, the number of operations (i.e. com
putations) per element must be manually measured from the
algorithm structure. Generally, the number of operations
will be a function of the overall computational complexity
of the algorithm and the types of individual computations
involved. Additionally, as with the communication equa
tion, a throughput term, throughputproc is also included to
establish the rate of execution. This parameter is meant
to describe the number of operations completed per cycle.
For fully pipelined designs, the number of operations per
cycle will equal the number of operations per element. Less
optimized designs will only have a fraction of the capacity
requiring multiple cycles to complete an element. Again,
note that computation time essentially refers to the time
required to operate on the data provided by one communi
cation transfer. (Applications with multiple communication
and computation blocks are resolved when the total FPGA
execution time is computed later in this section.)
tco N elements ops/element (4)
S fclock throughputproc
Despite the potential unpredictability of algorithm behav
ior, estimating a sufficiently precise number of operations is
still possible for many types of applications. However, pre
dicting the average rate of operation execution can be chal
lenging even with detailed knowledge of the target hard
ware design. For applications with a highly deterministic
pipeline, the procedure is straightforward. But for interde
pendent or data dependent operations, the problem is more
complex. For these scenarios, a better approach would be to
treat throughputproc as an independent variable and select
a desired speedup value. Then one can solve for the partic
ular throughputproc value required to achieve that desired
speedup. This method provides the user with insight into
the relative amount of parallelism that must be incorporated
for a design to succeed.
Single Buffered
Comm R1 W1 R2 W2 R3 W3
Comp C1 C2 C3
  
D uble Buffered, Compulation Bound
Comm R1 R2 W1 R3i W2 R4 W3wR5 W4
Comp C1 C2 C3 C4
Double Buffered, Communication Bound
Comm R1 2 W1 iR3 W R4 W3 R5 W4
Comp C1 C2 C3 C4
Legend: RRead W= Write, C=Compute
Figure 2: Example Overlap Scenarios
Similar to an element, one must also examine what is an
"operation." Consider an example application composed of
a 32bit addition followed by a 32bit multiplication. The
addition can be performed in a single clock cycle but to save
resources the 32bit multiplier might be constructed using
the Booth algorithm requiring 16 clock cycles. Arguments
could be made that the addition and multiplication would
count as either two operations (addition and multiplication)
or 17 operations (addition plus 16 additions, the basis of the
Booth multiplier algorithm). Either formulation is correct
provided that the throughputproc is formulated with same
assumption about the scope of an operation. For this exam
ple, 2/17 and 1 operation per second, respectively, yield the
correct computation time of 17 cycles.
Figure 2 illustrates the types of communication and com
putation interaction to be modeled with the throughput test.
Single buffering (SB) represents the most simplistic scenario
with no overlapping tasks. However, a doublebuffered (DB)
system allows overlapping communication and computation
by providing two independent buffers to keep both the pro
cessing and I/O elements occupied simultaneously. Since the
first computation block cannot proceed until the first com
munication sequence has completed, steadystate behavior is
not achievable until at least the second iteration. However,
this startup cost is considered negligible for a sufficiently
large number of iterations.
The FPGA execution time, tRc, is a function not only
of the tcomm and tcomp terms but also the amount of over
lap between communication and computation. Equations
(5) and (6) model both single and doublebuffered scenar
ios. For single buffered, the execution time is simply the
summation of the communication time, tcomm, and compu
tation time, tcomp. With the doublebuffered case, either the
communication or computation time completely overlaps the
other term. The smaller latency essentially becomes hidden
during steadystate.
The RAT analysis for computing tcomp ] .I I. S assumes
one algorithm "functional unit" operating on a single buffer's
worth of transmitted information. The parameter Niter is
the number of iterations of communication and computation
required to solve the entire problem.
tRCB N= ter (tcomm + tcomp) (5)
tRCDB N Niter Max(tcomm, tcomp)
Assuming that the application design currently under
analysis was based upon available sequential software code,
a baseline execution time, tsoft, is available for comparison
with the estimated FPGA execution time to predict the
overall speedup. As given in Equation (7), speedup is a
function of the total application execution time, not a single
iteration.
speedup
tsoft
Related to the speedup is the computation and communi
cation utilization given by Equations (8), (9), (10), and (11).
These metrics determine the fraction of the total applica
tion execution time spent on computation and communica
tion for the single and doublebuffered cases. Note that the
doublebuffered case is only applicable to applications with
a sufficient number of iterations so as to achieve a steady
state behavior throughout most of the execution time. The
computation utilization can provide additional insight about
the application speedup. If utilization is high, the FPGA
is rarely idle thereby maximizing speedup. Low utilizations
can indicate potential for increased speedups if the algorithm
can be reformulated to have less (or more overlapped) com
munication. In contrast to computation which is effectively
parallel for optimal FPGA processing, communication is se
rialized. Whereas computation utilization gives no indica
tion about the overall resource usage since additional FPGA
logic could be added to operate in parallel without affect
ing the utilization, the communication utilization indicates
the fraction of bandwidth remaining to facilitate additional
transfers since the channel is only a single resource.
utilcomP B
tcomp
toomm + tromp
tcomm
UtilcommsB to m. (9)
toomm + tcomp
tcomp 
utilcompDB M (t o t (10)
Max (tcomm, tcomp)
UtilcommDB tcomm (11)
MaZ (tcomm, tcomp)
3.2 Numerical Precision
Application numerical precision is typically defined by the
amount of fixed or floatingpoint computation within a de
sign. With FPGA devices, where increased precision dic
tates higher resource utilization, it is important to use only
as much precision as necessary to remain within accept
able tolerances. Because generalpurpose processors have
fixedlength data types and i. i.. Ii available floatingpoint
resources, it is reasonable to assume that often a given soft
ware application will have at least some measure of wasted
precision. Consequently, effective migration of applications
to FPGAs requires a timeefficient method to determine the
minimum necessary precision before any translation begins.
While formal methods for numerical precision analysis
of FPGA applications are important, they are outside the
scope of the RAT methodology. A plethora of research ex
ists on topics including automated conversion of floating
point software programs to fixedpoint hardware designs [2],
Figure 3: Architecture of 1D PDF Algorithm
designtime precision analysis tools for RC [5], and custom
or dynamic bitwidths for maximizing performance and area
on FPGAs [3, 9]. Application designs are meant to capital
ize on these numerical precision techniques and then use the
RAT methodology to evaluate the resulting algorithm per
formance. As with parallel decomposition, numerical formu
lation is ultimately the decision of the application designer.
RAT provides a quick and consistent procedure for evaluat
ing these design choices.
3.3 Resources
By measuring resource utilization, RAT seeks to deter
mine the scalability of an application design. Empirically,
most FPGA designs will be limited in size by the availabil
ity of three common resources: onchip memory, dedicated
hardware functional units (e.g. multipliers), and basic logic
elements (i.e. lookup tables and flipflops).
Onchip RAM is i. .I1i1. measurable since some quantity
of the memory will likely be used for I/O buffers of a known
size. Additionally, intraapplication buffering and storage
must be considered. Vendorprovided wrappers for inter
facing designs to FPGA platforms can also consume a sig
nificant number of memories but the quantity is generally
constant and independent of the application design.
Although the types of dedicated functional units included
in FPGAs can vary greatly, the hardware multiplier is a
fairly common component. The demand for dedicated mul
tiplier resources is highlighted by the I .,I 1 .1.1 of families
of chips (e.g. Xilinx Virtex4 SX series) with extra multipli
ers versus other comparably sized FPGAs. C'i .1, ii. ;,I the
necessary number of hardware multipliers is dependent on
the type and amount of parallel operations required. Mul
tipliers, dividers, square roots, and floatingpoint units use
hardware multipliers for fast execution. Varying levels of
pipelining and other design choices can increase or decrease
the overall demand for these resources. With sufficient de
sign planning, an accurate measure of resource utilization
can be taken for a design given knowledge of the architec
ture of the basic computational kernels.
Measuring basic logic elements is the most ubiquitous re
source metric. Highlevel designs do not empirically trans
late into any discernible resource count. Qualitative asser
tions about the demand for logic elements can be made based
upon approximate quantities of arithmetic or logical opera
tions and registers. But a precise count is nearly impossi
ble without an actual hardware description language (HDL)
implementation. Above all other types of resources, routing
strain increases exponentially as logic element utilization ap
proaches maximum. Consequently, it is often unwise (if not
impossible) to fill the entire FPGA.
Currently, RAT does not employ a database of statistics
to facilitate resource analysis of an application for complete
FPGA novices. The usage of RAT requires some vendor
specific knowledge (e.g. 32bit fixedpoint multiplications on
Xilinx V4 FPGAs require two dedicated 18bit multipliers).
Resource analyses are meant to highlight general application
trends and predict scalability. For example, the structure of
the molecular dynamics case study in Section 5 is designed
to minimize RAM usage and the parallelism was ultimately
limited by the I ,I I 11.1iI i. of multiplier resources.
4. WALKTHROUGH
To in, .1; the RAT analysis in Section 3, a worksheet
can be constructed based upon Equations (1) through (11).
Users simply provide the input parameters and the resulting
performance values are returned. This walkthrough further
explains key concepts of the throughput test by performing
a detailed analysis of a real application case study, one
dimensional probability density function (PDF) estimation.
The goal is to provide a more complete description of how
to use the RAT methodology in a practical setting.
4.1 Algorithm Architecture
The Parzen window technique is a generalized nonpara
metric approach to estimating probability density functions
(PDFs) in a ddimensional space. Though more computa
tionally intensive than using histograms, the Parzen window
technique is mathematically advantageous. For example, the
resulting probability density function is continuous there
fore differentiable. The computational complexity of the al
gorithm is of order O(Nnd) where N is the total number
of discrete probability levels (comparable to the number of
"bins" in a histogram), n is the number of discrete points at
which the PDF is estimated (i.e. number of elements), and
d is the number of dimensions. A set of mathematical oper
ations are performed on every data sample over nd discrete
points. Essentially, the algorithm computes the cumulative
effect of every data sample at every discrete probability level.
For simplicity, each discrete probability level is subsequently
referred to as a bin.
In order to better understand the assumptions and choices
made during the RAT analysis, one general architecture of
the PDF estimation algorithm is highlighted in Figure 3.
A total of 204,800 data samples are processed in batches
of 512 elements at a time against 256 bins. Eight separate
pipelines are created to process a data sample with respect to
a particular subset of bins. Each data sample is an element
with respect to the RAT analysis. The data elements are fed
into the parallel pipelines sequentially. Each pipelined unit
can process one element with respect to one bin per cycle.
Internal registering for each bin keeps a running total of the
Table 1: Input parameters for RAT analysis
Dataset Parameters
Neiements, input (elements)
Nelements, output (elements)
Bytes per Element (B/element)
Communication Parameters
throughputideal (MB/s)
Write 0 < a < 1
read 0 < ac < 1
Computation Parameters
Operations per element (ops/element)
throughputproc (ops/cycle)
fclock (MHz)
Software Parameters
tsoft (sec)
Niter (iterations)
Table 2: Input parameters of 1D PDF
Dataset Parameters
Neiements, input (elements) 512
Nelements, output (elements) 1
Nbytes/element (bytes/element) 4
Communication Parameters
throughputideal (MB/s) 1000
Write 0 < a < 1 0.37
Read 0 < a < 1 0.16
Computation Parameters
Nops/element (ops/element) 768
throughputproc (ops/cycle) 20
fclock (MHz) 75/100/150
Software Parameters
toft (sec) 0.578
Niter (iterations) 400
impact of all processed elements. These cumulative totals
comprise the final estimation of the PDF function.
4.2 RAT Input Parameters
Table 1 provides a list of all the input parameters neces
sary to perform a RAT analysis. The parameters are sorted
into four distinct categories, each referring to a particular
portion of the throughput analysis. Note that Nelements is
listed under a separate category when it is actually used by
both communication and computation. It is assumed that
the number of elements dictating the computation volume
is also the number of elements that are input to the ap
plication. While there are cases where applications exhibit
unusual computational trends or require significant amounts
of additional data (e.g. constants, seed values, or lookup ta
bles), the current RAT user base has found these instances
to be uncommon. Alterations can be made to account for
these cases but such examples are not included in this paper.
Table 2 summarizes the input parameters for the RAT
analysis for our 1D PDF estimation algorithm using Gaus
sian kernels. The dataset parameters are generally the first
values supplied by the user since the number of elements
will ultimately govern the entire algorithm performance.
Though the entire application involves 204,800 data sam
ples, each iteration of the 1D PDF estimation will involve
only a portion, 512 data samples, or 1/400 of the total set.
This algorithm effectively consumes all of the input values.
Only one value is retained after each iteration per bin. Each
iteration's result is retained on the FPGA and all values are
transferred back to the host in a single block after the al
gorithm has completed. However, the output transfer time,
regardless of how it is modeled, remains negligible with re
spect to the overall execution time.
The number of bytes per element is rounded to four (i.e.
32 bits). Even though the PDF estimation algorithm only
uses 18bit fixed point, the communication channel uses 32
bit communication. During the algorithmic formulation, 18
bit and 32bit fixed point along with 32bit floating point
were considered for use in the PDF algorithm. However,
the maximum error percentage was only '. for 18bit fixed
point which is satisfactory precision for the application. Ul
timately 18bit fixed point was chosen so that only one Xil
inx 18x18 multipleaccumulate (MAC) unit would be needed
per multiplication. Though slightly smaller bitwidths would
have also possessed reasonable error constraints, no perfor
mance gains or appreciable resource savings would have been
achieved.
Next, the communication parameters are provided by the
user since they are merely a function of the target RC
platform, which in this case is a Nallatech H101PCIXM
card containing a Virtex4 LX100 user FPGA. The card
is connected to the host CPU via a 133MHz PCIX bus
which has a theoretical maximum bandwidth of 1000MB/s.
The a parameters were computed using a microbenchmark
consisting of a read and write for a data size comparable
to one used by the 1D PDF algorithm. The resulting read
and write times were measured, combined with the transfer
size to compute the actual communicate rates, and finally
calculate the a parameters by dividing by the theoretical
maximum. The a parameters for the FPGA platforms are
low due to communication protocols used by Nallatech atop
PCIX. In general, the microbenchmark is performed on an
FPGA over a wide range of possible data sizes. The resulting
a values can be tabulated and used in future RAT analyses
for that FPGA platform.
The computation parameters are perhaps the most chal
lenging portion of the performance analysis, but are still
simplistic given the deterministic behavior of the PDF esti
mation algorithm. As mentioned earlier, each element that
comes into the PDF estimator is evaluated against each of
the 256 bins. Each computation requires 3 operations: com
parison (subtraction), multiplication, and addition. There
fore, the number of operations per element totals 768 (i.e.
256 x 3). This particular algorithm structure has 8 pipelines
that each effectively perform 3 operations per cycle for a to
tal of 24. However, this value is conservatively rounded down
to 20 to account for pipeline latency and other overheads
that are not otherwise considered. The 1D PDF algorithm
is constructed in VHDL to allow explicit, cycleaccurate con
struction of the intended design.
While previous parameters could be reasonably inferred
from the deterministic structure of the 1D PDF algorithm,
a priori estimation of the required clock frequency is very
difficult. Empirical knowledge of FPGA platforms and algo
rithm design practices provides some insight as to a range of
likely values. However, attaining a single, accurate estimate
of the maximum FPGA clock frequency achieved is gener
ally impossible until after the entire application has been
converted to a hardware design and analyzed by an FPGA
vendor's layout and routing tools. Consequently, a num
ber of clock values ranging from 75MHz to 150MHz for the
LX100 are used to examine the scope of possible speedups.
The software parameters provide the last piece of infor
mation necessary to complete the speedup analysis. The
software execution time of the algorithm is provided by the
user. Often, software legacy code is the basis for the hard
ware migration initiative. But one could equally generate
this code from a mathematically defined algorithm strictly
as a baseline for performance analysis. The baseline software
for the 1D PDF estimation was written in C, compiled using
gcc, and executed on a 3.2 GHz Xeon. Lastly, the number
of iterations is deduced from the portion of the overall prob
lem to reside in the FPGA at any one time. Since the user
decided to only process 512 elements at a time from the set
of 204800 element set, there must be 400 (i.e. 204800/512)
iterations of the algorithm.
4.3 Predicted and Actual Results
The RAT performance numbers are compared with the
experimentally measured results in Table 3. Each predicted
value in the table is computed using the input parameters
and equations listed in Section 3.1. For example, the pre
dicted computation time when fclk 150MHz is computed
as follows:
S512 elements 768 ops/element
comp 150 MHz. 20 ops/cycle
393216 ops
393216 ops= 1.31E4 sees
3E+9 ops/sec
The communication
responding equation.
buffered, the total RC
tion.
time is computed using the cor
Because the application is single
execution time is a simple summa
The speedup is simply the division of the software execu
tion time by the RC execution time. The utilization equa
tions are computed using the corresponding singlebuffered
equations.
The communication and computation times for the actual
FPGA code were measured from the hardware. The to
tal execution time for the hardware can be computed from
Equation (5) or, as with this case study, measured from the
FPGA to ensure maximum accuracy. The speedup and uti
lization values were computed from this information using
the same equations as the predicted values. From the table,
the predicted speedup when fclk =150 is reasonably close to
the actual value. The discrepancy in speed in this case is due
to the inaccuracies in the tcomm estimation. Although com
munication predictions are based on experimentally gath
ered data from microbenchmarks, the true behavior was not
encapsulated for this algorithm. Variability in the commu
nication time with the small data sizes (i.e 2KB for one
iteration of the 1D PDF algorithm) and additional delays
introduced by 800 (400 read, 400 write) repetitive transfers
are considered to be the source of the error.
A relatively accurate tcomp prediction is not surprising
given the deterministic nature of the algorithm. However,
the two significant figures of accuracy between the predicted
and actual computation times with fcJk 150MHz was un
expected given that computational throughput was conser
vatively estimated. Much of the 1D PDF algorithm is
pipelined but enough latency and pipeline stalls existed to
genuinely warrant a 17% reduction in the throughput esti
mate (i.e. 20 ops/cycle instead of 24). Unfortunately, the
inaccuracies in the communication time predictions reduced
the accuracy of the overall speedup. Had the communication
been double buffered, the inaccuracies in the communication
time could have been masked behind the more stable compu
tation time for a more accurate (and higher) speedup. The
relatively low resource usage in Table 4 also illustrates a po
tential for further speedup by including additional parallel
kernels.
tRCsB = 400 iterations (5.56E6 sees + 1.31E4 sees)
= 5.46E2 sees
Table 3: Performance parameters of 1D PDF
fclk (MHz)
tcomm (sec)
tcomp (sec)
utilcomrnms
utilcompsB
tRCSB (sec)
speedup
Predicted
75
5.56E6
2.62E4
2%
1.07E1
5.4
Predicted
100
5.56E6
1.97E4
3%
8.09E2
7.2
Predicted
150
5.56E6
1.31E4
4%
5.46E2
10.6
150
2.50E
1.39E
15%
7.45E
7.8
5. ADDITIONAL CASE STUDIES
These case studies are presented as further analysis and
validation of the RAT methodology, 2D PDF estimation
and molecular dynamics. Twodimensional PDF estima
tion continues to illustrate the accuracy of RAT for algo
rithms with a deterministic structure. However, the molec
ular dynamics application serves as an interesting counter
I1 point given the relative difficulty of encapsulating its compu
tational behavior. As with the onedimensional PDF estima
5 tion algorithm, the design emphasis is placed on throughput
4 analyses because the overall goal was to minimize execution
time for these designs.
2
Table 4: Resource usage of 1D PDF (LX100)
FPGA Resource Utilization
48bit DSPs
BRAMs 15%
Slices 1 .
5.1 2D PDF Estimation
As previously discussed, the Parzen window technique is
applicable in an arbitrary number of dimensions. However,
the twodimensional case presents a i _.;I. .1,1. greater
problem in terms of communication and computation vol
ume than the original 1D PDF estimate. Now 256 x 256
discrete bins are used for histogram generation and the in
put data set is effectively doubled to account for the extra
dimension. The basic computation per element grows from
(N n)2 + c to ((Ni ni)2 + (N2 n2)2 + c where N1 and
N2 are the probability level values and ni, n2 are the data
Table 5: Input parameters of 2D PDF (LX100)
Dataset Parameters
Neiements, input (elements) 1024
Neiements, output (elements) 65536
Nbytes/element (bytes/element) 4
Communication Parameters
throughputideal (MB/s) 1000
Write 0 < a < 1 0.37
Cread 0 < a < 1 0.16
Computation Parameters
Nops/element (ops/element) 393216
throughputproc (ops/cycle) 48
fclock (MHz) 75/100/150
Software Parameters
tsoft
Niter
(sec)
(iterations)
158.8
400
Table 6: Performance parameters of 2D PDF
fclk (MHz)
tcomm (sec)
tcomp (sec)
utilcommSB
utilcompsB
tRCSB (sec)
speedup
Predicted
75
1.65E3
1.12E1
1%
4.54E+1
3.5
Predicted
100
1.65E3
8.39E2
2%
3.42E+1
4.6
Predicted
150
1.65E3
5.59E2
3%
2.30E+1
6.9
sample values for each dimension, and c is a probability scal
ing factor. But despite the added complexity, the increased
quantity of parallelizable operations intuitively makes this
algorithm more amendable to the RC paradigm, assuming
sufficient quantities of hardware resources are available.
Table 5 summarizes the input parameters for the RAT
analysis for our 2D PDF estimation algorithm. Again,
the computation is performed in a twodimensional space
so twice the number of data samples (in blocks of 512 words
for each dimension) are sent to the FPGA. In contrast to the
1D case, the PDF values computed over each iteration are
sent back to the host processor. The same numerical preci
sion of four bytes per element is used for the data set. The
interconnect parameters model the same Nallatech FPGA
card as in the 1D case study. The higher order of computa
tional complexity is reflected in the larger number of com
putations per element, approximately three orders of mag
nitude greater. However, the number of parallel operations
is only increased by a factor of two. VHDL is also used for
the 2D PDF algorithm to create the cycleaccurate pipeline.
Again, the same range of clock frequencies are used for com
parison. The software baseline for computing speedup values
was written in C and executed on the same 3.2GHz Xeon
processor. The same 400 iterations are required to complete
the algorithm.
The RAT performance predictions are compared with the
experimentally measured results in Table 6. The predicted
speedup at 150MHz is closer to the experimental 150MHz
value than the onedimensional case. Though the percent
error is greater for both communication and computation,
Table 7: Resource usage of 2D PDF (LX100)
FPGA Resource Utilization
48bit DSPs
BRAMs
Slices
21%
the important difference is that the predicted computation
time was i .1.6. i. ii overestimated which balanced out the
underestimated communication time. However, the prob
lem is not with the computation parameters. These values
are often conservatively estimated to account for unfore
seen problems and to avoid promising unattainable results.
Unfortunately, the 'unforeseen problem' for this algorithm
turned out to be communication six times larger than pre
dicted, comprising 19% of the total execution instead of the
originally estimated 3%. The authors acknowledge this case
study as a victory in contingency planning but highlight the
precariousness of applications where relatively small shifts
in the communication time can greatly effect the utilization
of the FPGA.
S Ultimately, the reduced number of parallel operations
(throughputproc) with respect to the quantity of operations
(Nops/element) resulted in an effective speedup less than the
onedimensional algorithm. The qualitative RC 'amenabil
ity' of the twodimensional PDF application could only be
capitalized to the extent permissible by the designer's algo
rithm structure and physical device limitations, .. i. I ii.
the communication bandwidth. The performance decrease
with respect to the 1D PDF is highlighted by the compu
tation utilizations. Though more parallelizable operations
occur in the 2D PDF algorithm, the increased communica
tion demands of the higher order reduced the speedup of this
design for this platform. Comparing Table 7 to the resource
utilization from the 1D algorithm, the hardware usage has
increased but still has not nearly exhausted the resources
of the FPGA. Additional parallelism could be exploited to
improve the performance of the 2D algorithm.
5.2 Molecular Dynamics
Molecular Dynamics (MD) is the numerical simulation of
the physical interactions of atoms and molecules over a given
time interval. Along with standard Newtonian physics,
properties such as Van Der Waals forces and electrostatic
charge (among others) are calculated for each molecule at
each time step with respect to the movement and the molec
ular structure of every particle in the system. The tremen
dous number of options for physically modeled phenomena
and computational methods makes MD a perfect example
of how various designs for an application can have radi
cally different execution times. Three different versions of
the molecular dynamics algorithm [1], [6], and [10] report
speedup values of 0.29x, 2x, and 46x respectively. These de
signs make use of various algorithm optimizations, precision
choices, and FPGA platform selections. Consequently, RAT
can offer insight about a particular design, but it cannot
guarantee that a better solution does not exist.
The version of the molecular dynamics application used
for this case study was adapted from code provided by Oak
Ridge National Lab (ORNL). Table 8 summarizes the input
parameters for the RAT analysis of the MD design. The data
size of 16,384 molecules (i.e. elements) was chosen because
Table 8: Input parameters of MD
Dataset Parameters
Neiements, input (elements) 16384
Neiements, output (elements) 16384
Nbytes/element (bytes/element) 36
Communication Parameters
throughputideal (MB/s) 500
write 0 < a < 1 0.9
Cread 0 < a < 1 0.9
Computation Parameters
Nops/element (ops/element) 164000
throughputproc (ops/cycle) 50
fclock (MHz) 75/100/150
Software Parameters
tsoft
Niter
(sec)
(iterations)
Table 9: Performance parameters of MD
fcl, (MHz)
tcomm (sec)
tcomp (sec)
utilcommSB
utilcompsB
tRCsB (sec)
speedup
Predicted
Predicted
I. 4
75
2.62E3
7.17E1
0.4%
7.19E1
8.0
100
2.62E3
5.37E1
II ",'
5.40E1
10.7
Predicted
150
2.62E3
3.58E1
0.7%
99.3%
3.61E1
16.0
Actual
100
1.39E3
8.79E1
II ",
8.80E1
6.6
it is a small but still ;. i., ii. .i. interesting problem. Each
element requires 36 bytes, 4 bytes each for position, velocity
and acceleration in each of the X, Y, and Z spatial directions.
The interconnect parameters model an XtremeData XD1000
platform containing a Altera Stratix II EP2S180 user FPGA
connected to an Opteron processor over the HyperTransport
fabric.
Li11, .. i 1., the most challenging aspect of performance
prediction for the molecular dynamics application is accu
rately measuring the number of operations per element (i.e.
molecule) and operations per clock cycle. This particular
algorithm's execution time is dependent on the locality of
the molecules, which is a function of the dataset values.
Distant molecules are assumed to have negligible interac
tion and therefore require less computational effort. Conse
quently, the number of operations per element can only be
estimated for this circumstance and as previously discussed,
the number of operations per element is treated as a "tuning"
parameter. Though 50 is the quantitative value computed
by the equations to achieve the desired overall speedup of
approximately 10x, this value serves qualitatively to the user
as an indicator that substantial data parallelism and func
tional pipelining must be achieved in order to realize the de
sired speedup. Several major architectural design revisions
were explored, based upon the RAT findings, in order to
facilitate the necessary parallelism. ,\.i.1'1 I; i11i, the same
clock frequencies were used as in the previous case studies,
even though the FPGA platform has changed, because the
parameters are empirically reasonable. The serial software
baseline was performed on a 2.2 GHz Opteron processor, the
Table 10: Resource usage of MD (EP2S180)
FPGA Resource Utilization
9bit DSPs 1111' .
BRAMs .
ALUTs .
host processor of the XD1000 system. The entire dataset is
processed in a single iteration.
The application was constructed in the HLL language Im
pulse C in contrast to the PDF algorithms. Normally the
usage of an HLL would have made it more difficult (if not im
possible) to ensure that the particular cyclebycycle struc
ture desired by the researcher was utilized in the final design.
Even slight variabilities in algorithm structure introduced
by HLLs could make accurate RAT computational analyses
challenging. For the molecular dynamics application, which
possesses data dependent operations and irregular computa
tion structure independent of its linguistical representation,
performance prediction is already handicapped in terms of
accuracy. Consequently, the usage of HLLs is beneficial for
this MD application for i_.i;i. ..117. reducing development
time because the variability of the highlevel paradigm will
not i1.ii. Ill impact the already complex design struc
ture.
Table 9 outlines the predicted and actual results of the
molecular dynamics algorithm. The actual communication
times is the same order of magnitude as the predicted value.
While more accurate estimations are always the goal of
RAT, any further precision improvements for this parameter
are inconsequential given the low communication utilization.
Computation dominated the overall RC execution time and
the actual time was also the same order of magnitude as the
predicted value. Again, what eventually allowed the molecu
lar dynamics algorithm to succeed in the RC paradigm was a
qualitative interpretation of the prediction parameters which
highlighted the need for scalable parallelism i i ... i;. ,ii., the
ability to work on several molecules simultaneously). After
major highlevel design revisions, the designer successfully
created a molecular dynamics algorithm that met the pre
dicted criteria with moderate success. However, as Table 10
illustrates, a large percentage of the combinatorial logic and
dedicated multiplyaccumulators (DSPs) were required.
6. CONCLUSIONS
RAT is a simple and effective method for investigating
the performance potential of the mapping of a given ap
plication design for a given FPGA platform architecture.
The methodology employs a simple approach to planning
an FPGA design. RAT is meant to work with empirical
knowledge of RC devices to create more efficient and effec
tive means for formulating an application design.
RAT succeeds in its goal of simple and reasonably accu
rate prediction for application designs. For deterministic al
gorithms such as the PDF estimation, RAT can accurately
estimate the computational loads expected of the FPGA.
Coupled with an accurate measurement of the FPGA plat
form interconnect throughput, a suitable prediction as to
the 'RC amenability' was formulated before any hardware
code was composed. In contrast, the molecular dynamics
experiment grapples with a situation where data dependen
cies in the algorithm create uncertainty about the overall
runtime. However, RAT was able to '!1i .1i .1 1 i. highlight
the large volume of parallelism that would be required to
achieve even a 10x speedup. Consequently, extra algorithm
restructuring was incorporated to increase performance and
obtain a speedup somewhat near the 10x goal.
Though RAT has thus far proven effective and useful to
the endeavors of its designers, there are several additional
areas under consideration. The current methodology was de
signed to support applications involving several algorithms,
each with their own separate RAT analysis. Further experi
mentation and usage with such design projects is necessary,
especially with systems containing multiple FPGAs being
increasingly deployed. Several strategies for improving the
throughput test's analysis are also under consideration.
7. ACKNOWLEDGMENTS
This work was supported in part by the I/UCRC Pro
gram of the National Science Foundation under Grant No.
EEC0642422. The authors gratefully acknowledge vendor
equipment and/or tools provided by Altera, Xilinx, Nal
latech, XtremeData, and Impulse Accelerated Technologies
that helped make this work possible.
8. REFERENCES
[1] N. Azizi, I. Kuon, A. Egier, A. Darabiha, and
P. Chow. Reconfigurable molecular dynamics
simulator. In Proc IEEE 12th Symp.
FieldProgrammable Custom Computing Machines
(FCCM), pages 197206, Napa, CA, Apr 2023 2004.
[2] P. Banerjee, D. Bagchi, M. Haldar, A. Nayak, V. Kim,
and R. Uribe. Automated conversion of floating point
matlab programs into fixed point fpga based hardware
design. In Proc IEEE 11th Symp. FieldProgrammable
Custom Computing Machines (FCCM), pages
263264, Napa, CA, Apr 811 2003.
[3] K. Bondalapati and V. Prasanna. Dynamic precision
management for loop computations on reconfigurable
architectures. In Proc IEEE 7th Symp.
FieldProgrammable Custom Computing Machines
(FCCM), pages 249258, Napa, CA, Apr 2123 1999.
[4] D. Buell. Programming reconfigurable computers:
Language lessons learned. In Reconfigurable System
Summer Institute (RSSI), Urbana, IL, Jul 1213 2006.
[5] M. Chang and S. Hauck. Precis: A designtime
precision analysis tool. In Proc IEEE 10th Symp.
FieldProgrammable Custom Computing Machines
(FCCM), pages 229238, Napa, CA, Apr 2224 2002.
[6] L. Cordova and D. Buell. An approach to scalable
molecular dynamics simulation using supercomputing
adaptive processing elements. In Proc. IEEE Int.
Conf. Field Programmable Logic and Applications
(FPL), pages 711712, Aug 2426 2005.
[7] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E.
Schauser, E. Santos, R. Subramonian, and T. von
Eicken. Logp: Towards a realistic model of parallel
computation. In Proc ACM 4th Symp. Principles and
Practice of Parallel Programming, pages 112, San
Diego, CA, May 1922 1993.
[8] S. Fortune and J. Wyllie. Parallelism in random access
machines. In Proc ACM 10th Symp. Theory of
Computing, pages 114118, San Diego, CA, May 0103
1978.
[9] A. Gaffar, O. Mencer, W. Luk, P. C'lI. ..,_ and
N. Shirazi. Floatingpoint bitwidth analysis via
automatic differentiation. In Proc IEEE Int. Conf.
FieldProgrammable Technology (FPT), pages
158165, Hong Kong, China, Dec 1618 2002.
[10] Y. Gu, T. VanCourt, and M. Herbordt. Accelerating
molecular dynamics simulations with configurable
circuits. In Proc. IEE Computers and Digital
Techniques, volume 153, pages 189195, May 2 2006.
[11] Z. Guo, W. Najjar, F. Vahid, and K. Vissers. A
quantitative analysis of the speedup factors of fpgas
over processors. In Proc ACM 16th Symp.
FieldProgrammable Gate Arrays (FPGA), pages
162170, Monterey, CA, Feb 2224 2004.
[12] T. Jeger, R. Enzler, D. Cottet, and G. Troster. The
performance prediction model a methodology for
estimating the performance of an fpga implementation
of an algorithm, technical report, Electronics Lab,
Swiss Federal Inst. of Technology (ETH) Zurich, 2000.
[13] V. Kindratenko and D. Pointer. A case study in
porting a production scientific supercomputing
application to a reconfigurable computer. In Proc
IEEE 14th Symp. FieldProgrammable Custom
Computing Machines (FCCM), pages 1322, Napa,
CA, Apr 2426 2006.
[14] D.U. Lee, A. Gaffar, O. Mencer, and W. Luk.
Optimizing hardware function evaluation. IEEE
Trans. Computers, 54(12):15201531, Dec. 2005.
[15] M. Smith and G. Peterson. Parallel application
performance on shared high performance
reconfigurable computing resources. Performance
Evaluation, 60:107125, May 2005.
[16] C. Steffen. Parameterization of algorithms and fpga
accelerators to predict performance. In Reconfigurable
System Summer Institute (RSSI), Urbana, IL, Jul
1720 2007.
