Title: RAT : a methodology for predicting performance in application design migration to FPGAs
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094693/00001
 Material Information
Title: RAT : a methodology for predicting performance in application design migration to FPGAs
Physical Description: Book
Language: English
Creator: Holland, Brian
Nagarajan, Karthik
Conger, Chris
Jacobs, Adam
George, Alan D.
Publisher: Holland et al.
Place of Publication: Gainesville, Fla.
Publication Date: 2007
Copyright Date: 2007
 Record Information
Bibliographic ID: UF00094693
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

HPRCTA2007_F3 ( PDF )


Full Text







RAT: A Methodology for Predicting Performance

in Application Design Migration to FPGAs



Brian Holland, Karthik Nagarajan, Chris Conger, Adam Jacobs, Alan D. George
NSF Center for High-Performance Reconfigurable Computing (CHREC)
ECE Department, University of Florida
{holland,nagarajan,conger,jacobs,george}@chrec.org


ABSTRACT
Before any application is migrated to a reconfigurable com-
puter (RC), it is important to consider its amenability to
the hardware paradigm. In order to maximize the proba-
bility of success for an application's migration to an FPGA,
one must quickly and with a reasonable degree of accuracy
analyze not only the performance of the system but also
the required precision and necessary resources to support a
particular design. This extra preparation is meant to reduce
the risk of failure to achieve the application's design require-
ments (e.g. speed or area) by quantitatively predicting the
expected performance and system utilization. This paper
presents the RC Amenability Test (RAT), a methodology
for rapidly analyzing an application's design ... .1 .i'
to a specific FPGA platform.


1. INTRODUCTION
FPGAs continue to grow as a viable option for increas-
ing the performance of many applications over traditional
CPUs without the need for ASICs. Because no standard-
ized rules exist for FPGA amenability, it is important for a
designer to consider the likely performance of an application
in hardware before undergoing a lengthy migration process.
Ultimately, the designer must know what order of magnitude
speedup (or potentially slowdown) will be encountered.
Some researchers have suggested [4] that a 50x to 100x
speedup is required to gain the attention and approval of
"middle management." Other scenarios might place the
break-even point (time of development versus time saved
at execution) at a more conservative factor of ten or less.
The high-performance embedded community might simply
want FPGA performance to parallel a traditional processor
since savings could come in the form of reduced power
usage. Ultimately, the success or failure of an application's
RC migration will be judged against some metric of
performance. It is critical to consider whether the chosen
application architecture and FPGA platform will meet the
speed, area, and power requirements of the project. The



Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
HPRCTA'07, November 11, 2007, Reno, Nevada, USA
Copyright 2007 ACM 978-1-59593-894-7/07/0011 ...$5.00.


RC Amenability Test (RAT) is a combination of algorithm
and software legacy code analyses along with 'pencil and
paper' computations that seeks to determine the likelihood
of success for a specific algorithm's migration to a particular
RC platform before any (or at least significant) hardware
coding is begun.
The need for the RAT methodology stemmed from com-
mon difficulties encountered during several FPGA applica-
tion migration projects. Researchers would typically possess
a software application but would be unsure about poten-
tial performance gains in hardware. The level of experience
with FPGAs would vary greatly among the researchers and
inexperienced designers were often unable to quantitatively
project and compare possible algorithmic design and FPGA
platforms choices for their application. Many initial pre-
dictions were haphazardly formulated and performance es-
timation methods varied greatly. Consequently, RAT was
created to consolidate and unify the performance prediction
strategies for faster, more simple, and more effective analy-
ses.
Three factors are considered for the amenability of an
application to hardware: throughput, numerical precision,
and resource usage. The authors believe that these issues
dominate the overall effectiveness of an application's hard-
ware migration. Consequently, analyses for these three fac-
tors comprise the majority of the RAT methodology. The
throughput analysis uses a series of simple equations to pre-
dict the performance of the application based upon known
parameters (e.g. interconnect speed) and values estimated
from the proposed design (e.g. volume of communicated
data). Numerical precision analysis is a subset of through-
put encompassing the design trade-offs in performance as-
sociated with possible algorithm data formats and their as-
sociated error. Resource analysis involves estimating the
application's hardware usage in order to detect designs that
consume more than the available resources.
Many research projects as discussed in [11] emphasize the
usage of FPGAs to achieve speedup over traditional CPUs.
Consequently, accurate throughput analysis is the primary
focus of the RAT methodology. While numerical precision,
resource utilization, and other issues such as development
time or power usage are not trivial, they are less likely to be
the sole contributor to the failure of an application migration
when speedup is the primary goal. Consequently, RAT's
throughput analysis is the most detailed performance test
and the focus of the application case studies in this paper.
The remainder of this paper is structured as follows. Sec-
tion 2 discusses background related to FPGA performance









prediction and resource utilization. The fundamental analy-
ses comprising the RAT methodology are detailed in Section
3. A detailed walkthrough illustrating the usage of RAT
with a real application is in Section 4. Section 5 presents
additional case studies using RAT to further explore the
accuracy of the performance prediction methodology. Con-
clusions and future work are discussed in Section 6.

2. RELATED WORK
Researchers have studied the area of hardware amenabil-
ity but their approaches vary. A Performance Prediction
Model (PPM) is suggested in [12] for determining the opti-
mal mapping of an algorithm to an FPGA. Their methodol-
ogy consists of four steps: choice and modification of the im-
plementation, classification, feature extraction, and perfor-
mance matrix computation (for frequency, latency, through-
put, and area requirements). The concept of a detailed clas-
sification of the internal operations of an application de-
sign is very practical. However, the performance estimation
method incorporates quantitative area, IO pins, latency, and
throughput into a large system of platform-dependent equa-
tions which is impractical for RAT. The goal of the research
presented in [14] is to determine the optimal function design
with respect to area, latency and throughput. Ultimately,
the project seeks to create a tool or library for i.i1 i~lif.-
ing the best version among many alternatives for a particu-
lar scenario. The work provides valuable insight, especially
into the domain of application precision, but its focus on
automated design of single kernels is overly specific for a
high-level RAT methodology for applications.
A performance prediction technique presented in [16] seeks
to parameterize not only the computational algorithm but
also the FPGA system. Applications are decomposed and
analyzed to determine their total size and computational
density. Computational platforms are characterized by their
memory size, bandwidth, and latency. By comparing the al-
gorithm's computational requirement with the memory bot-
tleneck of the FPGA platform, a worst-case computational
throughput (in operations per second) can be quantified.
However, the author asserts that "the performance analysis
in this paper is not real performance prediction; rather it
targets the general concern of whether or not an algorithm
will fit within the memory subsystem that is designed to feed
it." The RAT methodology differs because it seeks not only
to ..1i1,1 i-, the number of operations but also the expected
number of operations executed per clock cycle yielding per-
formance predictions strictly in units of time.
Another performance prediction technique [15] concerns
modeling of shared heterogeneous workstations containing
reconfigurable computing devices. This methodology chiefly
concerns the modeling of system level, multi-FPGA architec-
tures with variable computational loading due to the multi-
user environment. The basic execution time model encom-
passes five steps: master node setup, serial node setup, par-
allel kernel computation (in hardware and/or software), se-
rial node shutdown, and master node shutdown. Similar to
RAT, analytic models are used to estimate the performance
of the components of application execution time. However,
this heterogeneous system modeling assumes that hardware
tasks have deterministic runtime and performance estimates
can be based on clock frequencies or simulation. In contrast,
RAT is ]i..... focused on modeling FPGA execution
times before designs are coded for hardware and applica-


tion analysis is not limited to deterministic algorithms. Ef-
fectively, RAT and the heterogeneous modeling could work
collaboratively to provide higher fidelity system-level trade-
off analyses before any application code is migrated to hard-
ware.
A research project into molecular dynamics at the Uni-
versity of Illinois [13] proposes a framework for application
design in the hardware paradigm. The project asserts that
the most frequently accessed block of code is not the only in-
dicator for RC amenability. The types of computations and
the volume of communication will increase or decrease the
recommended quantity of FPGA functions. Although the
general framework stresses resource utilization tests for case
studies, the researchers "postpone finding out about space
requirements for the design until [they] actually map it to
the FPGA." A priori measurements of resource requirements
may be inexact, but they are still necessary to avoid creating
initial designs that are physically unrealizable.
Conceptually, the RAT methodology is meant to resemble
the approach behind the Parallel Random Access Machine
(PRAM) [8] model for traditional supercomputing. Both
RAT and PRAM attempt to model the critical (and hope-
fully small) set of algorithm and platform attributes neces-
sary to achieve a better understanding of the greater com-
putational interaction and ultimately the application perfor-
mance. PRAM focuses on modes of concurrent memory ac-
cesses whereas RAT examines the communication and com-
putation interaction on an FPGA. These critical attributes
of RAT also resemble the LogP model [7] (a successor to
PRAM) which seeks to abstract "the computing bandwidth,
the communication bandwidth, the communication delay,
and the efficiency of coupling communication and compu-
tation." RAT does not claim to be the ultimate solution
to RC performance prediction but instead encourages the
FPGA community to consider more structured and stan-
dardized ways for algorithm analysis using established con-
cepts inherited from prior successes in traditional parallel
computing modeling.

3. RC AMENABILITY TEST
Figure 1 illustrates the basic methodology behind the RC
amenability test. This simple set of tests serves as a basis
for determining the viability of an algorithm design on the
FPGA platform prior to any FPGA programming. Again,
RAT is intended to address the performance of a specific
design, not a generic algorithm. The results of the RAT
tests must be compared against the designer's requirements
to evaluate the success of the application design. Though the
throughput analysis is considered the most important step,
the three tests are not i. - ...II-. used as a single, sequential
procedure. Often, RAT is applied iteratively during the
design process until a suitable version of the algorithm is
formulated or all reasonable permutations are exhausted
without a satisfactory solution.

3.1 Throughput
For RAT, the predicted performance of an application
is defined by two terms: communication time between the
CPU and FPGA, and FPGA computation time. Reconfigu-
ration and other setup times are ignored. These two terms
encompass the rate at which data flows through the FPGA
and rate at which operations occur on that data, respec-
tively. Because RAT seeks to analyze applications at the











START
.. /" Insufficient *
Identify kernel, comm BE
create design on orcomp.
paper Throughput throughput
NEW Test

-- -- Desirable Minimum
Perform RAT performance precision
Perform RAT Unrealizable
-Numerical precision
Precision Test, requirement
Acceptable balance of
Build in HDL or HLL, performancesndprecision
simulate design pefor
Resource Test Insufficient
resources

Verify on
HW platform
PROCEED


Figure 1: Overview of RAT Methodology


earliest stage of hardware migration, these terms are reduced
to the most generalized parameters. The RAT throughput
test i I......1 I. models FPGAs as co-processors to general-
purpose processors but the framework can be adjusted for
streaming applications.
Calculating the communication time is a relatively sim-
plistic process given by Equations (1), (2), and (3). The
overall communication time is defined as the summation of
the read and write components. For the individual reads
and writes, the problem size (i.e. number of data elements,
Nelements) and the numerical precision (i.e. number of bytes
per element, Nbytes/element) must be decided by the user
with respect to the algorithm. Note that problem size only
refers to a single block of data to be buffered by the FPGA
system. An application's data communication may be di-
vided into multiple discrete transfers, which is accounted
for in a subsequent equation. The hypothetical bandwidth
of the FPGA/processor interconnect on the target platform
(e.g. 133MHz 64-bit PCI-X which has a documented maxi-
mum throughput of 1GB/s) is also necessary but is generally
provided either with the FPGA system documentation or as
part of the interconnect standard. An additional parameter,
a, represents the fraction of ideal throughput performing
useful communication. The actual sustained performance of
the FPGA interconnect will only be a fraction of the docu-
mented transfer rate. Microbenchmarks composed of simple
data transfers can be used to establish the true communica-
tion bandwidth.


tcomm = tread + write (1)


t Nelements Nbytes/element (2)
C ,read throughputideal

t ie Nelements Nbytes/element (3)
write throughputideal
Before further equations are discussed, it is important to
i I .-, the concept of an "element." Until now, the expres-
sions "problem size," "volume of communicated data," and


"number of elements" have been used interchangeably. How-
ever, strictly speaking, the first two terms refer to a quan-
tity of bytes whereas the last term has the ubiquitous unit
"elements." RAT operates under the assumption that the
computational workload of an algorithm is directly related
to the size of the problem dataset. Because communication
times are concerned with bytes and (as will be subsequently
shown) computation times revolve around the number of
operations, a common term is necessary to express this re-
lationship. The element is meant to be the basic building
block which governs both communication and computation.
For example, an element could be a value in an array to be
sorted, an atom in a molecular dynamics simulation, or a
single character in a string-matching algorithm. In each of
these cases, some number of bytes will be required to rep-
resent that element and some number of calculations will
be necessary to complete all computations involving that
element. The difficulty is establishing what subset of the
data should constitute an element for a particular algorithm.
Often an application must be analyzed in several separate
stages since each portion of the algorithm could interpret
the input data in a different scope.
Estimating the computational component, as given in
Equation (4), of the RC execution time is more compli-
cated than communication due to the conversion factors.
Whereas the number of bytes per element is ultimately a
fixed, user-defined value, the number of operations (i.e. com-
putations) per element must be manually measured from the
algorithm structure. Generally, the number of operations
will be a function of the overall computational complexity
of the algorithm and the types of individual computations
involved. Additionally, as with the communication equa-
tion, a throughput term, throughputproc is also included to
establish the rate of execution. This parameter is meant
to describe the number of operations completed per cycle.
For fully pipelined designs, the number of operations per
cycle will equal the number of operations per element. Less
optimized designs will only have a fraction of the capacity
requiring multiple cycles to complete an element. Again,
note that computation time essentially refers to the time
required to operate on the data provided by one communi-
cation transfer. (Applications with multiple communication
and computation blocks are resolved when the total FPGA
execution time is computed later in this section.)

tco N elements ops/element (4)
S fclock throughputproc

Despite the potential unpredictability of algorithm behav-
ior, estimating a sufficiently precise number of operations is
still possible for many types of applications. However, pre-
dicting the average rate of operation execution can be chal-
lenging even with detailed knowledge of the target hard-
ware design. For applications with a highly deterministic
pipeline, the procedure is straightforward. But for interde-
pendent or data dependent operations, the problem is more
complex. For these scenarios, a better approach would be to
treat throughputproc as an independent variable and select
a desired speedup value. Then one can solve for the partic-
ular throughputproc value required to achieve that desired
speedup. This method provides the user with insight into
the relative amount of parallelism that must be incorporated
for a design to succeed.









Single Buffered
Comm R1 W1 R2 W2 R3 W3
Comp C1 C2 C3
--------- ----------- ---------------
D uble Buffered, Compulation Bound
Comm R1 R2 W1 R3i W2 R4 W3wR5 W4
Comp C1 C2 C3 C4
Double Buffered, Communication Bound
Comm R1 2 W1 iR3 W R4 W3 R5 W4
Comp C1 C2 C3 C4
Legend: RRead W= Write, C=Compute

Figure 2: Example Overlap Scenarios




Similar to an element, one must also examine what is an
"operation." Consider an example application composed of
a 32-bit addition followed by a 32-bit multiplication. The
addition can be performed in a single clock cycle but to save
resources the 32-bit multiplier might be constructed using
the Booth algorithm requiring 16 clock cycles. Arguments
could be made that the addition and multiplication would
count as either two operations (addition and multiplication)
or 17 operations (addition plus 16 additions, the basis of the
Booth multiplier algorithm). Either formulation is correct
provided that the throughputproc is formulated with same
assumption about the scope of an operation. For this exam-
ple, 2/17 and 1 operation per second, respectively, yield the
correct computation time of 17 cycles.
Figure 2 illustrates the types of communication and com-
putation interaction to be modeled with the throughput test.
Single buffering (SB) represents the most simplistic scenario
with no overlapping tasks. However, a double-buffered (DB)
system allows overlapping communication and computation
by providing two independent buffers to keep both the pro-
cessing and I/O elements occupied simultaneously. Since the
first computation block cannot proceed until the first com-
munication sequence has completed, steady-state behavior is
not achievable until at least the second iteration. However,
this startup cost is considered negligible for a sufficiently
large number of iterations.
The FPGA execution time, tRc, is a function not only
of the tcomm and tcomp terms but also the amount of over-
lap between communication and computation. Equations
(5) and (6) model both single- and double-buffered scenar-
ios. For single buffered, the execution time is simply the
summation of the communication time, tcomm, and compu-
tation time, tcomp. With the double-buffered case, either the
communication or computation time completely overlaps the
other term. The smaller latency essentially becomes hidden
during steady-state.
The RAT analysis for computing tcomp ] .I I. S- assumes
one algorithm "functional unit" operating on a single buffer's
worth of transmitted information. The parameter Niter is
the number of iterations of communication and computation
required to solve the entire problem.


tRCB N= ter (tcomm + tcomp) (5)


tRCDB N Niter Max(tcomm, tcomp)


Assuming that the application design currently under
analysis was based upon available sequential software code,
a baseline execution time, tsoft, is available for comparison
with the estimated FPGA execution time to predict the
overall speedup. As given in Equation (7), speedup is a
function of the total application execution time, not a single
iteration.


speedup


tsoft


Related to the speedup is the computation and communi-
cation utilization given by Equations (8), (9), (10), and (11).
These metrics determine the fraction of the total applica-
tion execution time spent on computation and communica-
tion for the single- and double-buffered cases. Note that the
double-buffered case is only applicable to applications with
a sufficient number of iterations so as to achieve a steady-
state behavior throughout most of the execution time. The
computation utilization can provide additional insight about
the application speedup. If utilization is high, the FPGA
is rarely idle thereby maximizing speedup. Low utilizations
can indicate potential for increased speedups if the algorithm
can be reformulated to have less (or more overlapped) com-
munication. In contrast to computation which is effectively
parallel for optimal FPGA processing, communication is se-
rialized. Whereas computation utilization gives no indica-
tion about the overall resource usage since additional FPGA
logic could be added to operate in parallel without affect-
ing the utilization, the communication utilization indicates
the fraction of bandwidth remaining to facilitate additional
transfers since the channel is only a single resource.


utilcomP B


tcomp
toomm + tromp


tcomm
UtilcommsB to m. (9)
toomm + tcomp

tcomp -
utilcompDB M (t o t (10)
Max (tcomm, tcomp)


UtilcommDB tcomm (11)
MaZ (tcomm, tcomp)

3.2 Numerical Precision
Application numerical precision is typically defined by the
amount of fixed- or floating-point computation within a de-
sign. With FPGA devices, where increased precision dic-
tates higher resource utilization, it is important to use only
as much precision as necessary to remain within accept-
able tolerances. Because general-purpose processors have
fixed-length data types and i. i.. Ii available floating-point
resources, it is reasonable to assume that often a given soft-
ware application will have at least some measure of wasted
precision. Consequently, effective migration of applications
to FPGAs requires a time-efficient method to determine the
minimum necessary precision before any translation begins.
While formal methods for numerical precision analysis
of FPGA applications are important, they are outside the
scope of the RAT methodology. A plethora of research ex-
ists on topics including automated conversion of floating-
point software programs to fixed-point hardware designs [2],
































Figure 3: Architecture of 1-D PDF Algorithm



design-time precision analysis tools for RC [5], and custom
or dynamic bit-widths for maximizing performance and area
on FPGAs [3, 9]. Application designs are meant to capital-
ize on these numerical precision techniques and then use the
RAT methodology to evaluate the resulting algorithm per-
formance. As with parallel decomposition, numerical formu-
lation is ultimately the decision of the application designer.
RAT provides a quick and consistent procedure for evaluat-
ing these design choices.

3.3 Resources
By measuring resource utilization, RAT seeks to deter-
mine the scalability of an application design. Empirically,
most FPGA designs will be limited in size by the availabil-
ity of three common resources: on-chip memory, dedicated
hardware functional units (e.g. multipliers), and basic logic
elements (i.e. look-up tables and flip-flops).
On-chip RAM is i. .I1i1. measurable since some quantity
of the memory will likely be used for I/O buffers of a known
size. Additionally, intra-application buffering and storage
must be considered. Vendor-provided wrappers for inter-
facing designs to FPGA platforms can also consume a sig-
nificant number of memories but the quantity is generally
constant and independent of the application design.
Although the types of dedicated functional units included
in FPGAs can vary greatly, the hardware multiplier is a
fairly common component. The demand for dedicated mul-
tiplier resources is highlighted by the I- .,I 1 .1.1 of families
of chips (e.g. Xilinx Virtex-4 SX series) with extra multipli-
ers versus other comparably sized FPGAs. C'i .1, ii-. ;,I the
necessary number of hardware multipliers is dependent on
the type and amount of parallel operations required. Mul-
tipliers, dividers, square roots, and floating-point units use
hardware multipliers for fast execution. Varying levels of
pipelining and other design choices can increase or decrease
the overall demand for these resources. With sufficient de-
sign planning, an accurate measure of resource utilization
can be taken for a design given knowledge of the architec-
ture of the basic computational kernels.


Measuring basic logic elements is the most ubiquitous re-
source metric. High-level designs do not empirically trans-
late into any discernible resource count. Qualitative asser-
tions about the demand for logic elements can be made based
upon approximate quantities of arithmetic or logical opera-
tions and registers. But a precise count is nearly impossi-
ble without an actual hardware description language (HDL)
implementation. Above all other types of resources, routing
strain increases exponentially as logic element utilization ap-
proaches maximum. Consequently, it is often unwise (if not
impossible) to fill the entire FPGA.
Currently, RAT does not employ a database of statistics
to facilitate resource analysis of an application for complete
FPGA novices. The usage of RAT requires some vendor-
specific knowledge (e.g. 32-bit fixed-point multiplications on
Xilinx V4 FPGAs require two dedicated 18-bit multipliers).
Resource analyses are meant to highlight general application
trends and predict scalability. For example, the structure of
the molecular dynamics case study in Section 5 is designed
to minimize RAM usage and the parallelism was ultimately
limited by the I- ,I I 11.1iI i-. of multiplier resources.


4. WALKTHROUGH
To in, .1; the RAT analysis in Section 3, a worksheet
can be constructed based upon Equations (1) through (11).
Users simply provide the input parameters and the resulting
performance values are returned. This walkthrough further
explains key concepts of the throughput test by performing
a detailed analysis of a real application case study, one-
dimensional probability density function (PDF) estimation.
The goal is to provide a more complete description of how
to use the RAT methodology in a practical setting.

4.1 Algorithm Architecture
The Parzen window technique is a generalized nonpara-
metric approach to estimating probability density functions
(PDFs) in a d-dimensional space. Though more computa-
tionally intensive than using histograms, the Parzen window
technique is mathematically advantageous. For example, the
resulting probability density function is continuous there-
fore differentiable. The computational complexity of the al-
gorithm is of order O(Nnd) where N is the total number
of discrete probability levels (comparable to the number of
"bins" in a histogram), n is the number of discrete points at
which the PDF is estimated (i.e. number of elements), and
d is the number of dimensions. A set of mathematical oper-
ations are performed on every data sample over nd discrete
points. Essentially, the algorithm computes the cumulative
effect of every data sample at every discrete probability level.
For simplicity, each discrete probability level is subsequently
referred to as a bin.
In order to better understand the assumptions and choices
made during the RAT analysis, one general architecture of
the PDF estimation algorithm is highlighted in Figure 3.
A total of 204,800 data samples are processed in batches
of 512 elements at a time against 256 bins. Eight separate
pipelines are created to process a data sample with respect to
a particular subset of bins. Each data sample is an element
with respect to the RAT analysis. The data elements are fed
into the parallel pipelines sequentially. Each pipelined unit
can process one element with respect to one bin per cycle.
Internal registering for each bin keeps a running total of the










Table 1: Input parameters for RAT analysis
Dataset Parameters
Neiements, input (elements)
Nelements, output (elements)
Bytes per Element (B/element)
Communication Parameters
throughputideal (MB/s)
Write 0 < a < 1
read 0 < ac < 1
Computation Parameters
Operations per element (ops/element)
throughputproc (ops/cycle)
fclock (MHz)
Software Parameters
tsoft (sec)
Niter (iterations)



Table 2: Input parameters of 1-D PDF
Dataset Parameters
Neiements, input (elements) 512
Nelements, output (elements) 1
Nbytes/element (bytes/element) 4
Communication Parameters
throughputideal (MB/s) 1000
Write 0 < a < 1 0.37
Read 0 < a < 1 0.16
Computation Parameters
Nops/element (ops/element) 768
throughputproc (ops/cycle) 20
fclock (MHz) 75/100/150
Software Parameters
toft (sec) 0.578
Niter (iterations) 400



impact of all processed elements. These cumulative totals
comprise the final estimation of the PDF function.

4.2 RAT Input Parameters
Table 1 provides a list of all the input parameters neces-
sary to perform a RAT analysis. The parameters are sorted
into four distinct categories, each referring to a particular
portion of the throughput analysis. Note that Nelements is
listed under a separate category when it is actually used by
both communication and computation. It is assumed that
the number of elements dictating the computation volume
is also the number of elements that are input to the ap-
plication. While there are cases where applications exhibit
unusual computational trends or require significant amounts
of additional data (e.g. constants, seed values, or lookup ta-
bles), the current RAT user base has found these instances
to be uncommon. Alterations can be made to account for
these cases but such examples are not included in this paper.
Table 2 summarizes the input parameters for the RAT
analysis for our 1-D PDF estimation algorithm using Gaus-
sian kernels. The dataset parameters are generally the first
values supplied by the user since the number of elements
will ultimately govern the entire algorithm performance.
Though the entire application involves 204,800 data sam-


ples, each iteration of the 1-D PDF estimation will involve
only a portion, 512 data samples, or 1/400 of the total set.
This algorithm effectively consumes all of the input values.
Only one value is retained after each iteration per bin. Each
iteration's result is retained on the FPGA and all values are
transferred back to the host in a single block after the al-
gorithm has completed. However, the output transfer time,
regardless of how it is modeled, remains negligible with re-
spect to the overall execution time.
The number of bytes per element is rounded to four (i.e.
32 bits). Even though the PDF estimation algorithm only
uses 18-bit fixed point, the communication channel uses 32-
bit communication. During the algorithmic formulation, 18-
bit and 32-bit fixed point along with 32-bit floating point
were considered for use in the PDF algorithm. However,
the maximum error percentage was only -'. for 18-bit fixed
point which is satisfactory precision for the application. Ul-
timately 18-bit fixed point was chosen so that only one Xil-
inx 18x18 multiple-accumulate (MAC) unit would be needed
per multiplication. Though slightly smaller bitwidths would
have also possessed reasonable error constraints, no perfor-
mance gains or appreciable resource savings would have been
achieved.
Next, the communication parameters are provided by the
user since they are merely a function of the target RC
platform, which in this case is a Nallatech H101-PCIXM
card containing a Virtex-4 LX100 user FPGA. The card
is connected to the host CPU via a 133MHz PCI-X bus
which has a theoretical maximum bandwidth of 1000MB/s.
The a parameters were computed using a microbenchmark
consisting of a read and write for a data size comparable
to one used by the 1-D PDF algorithm. The resulting read
and write times were measured, combined with the transfer
size to compute the actual communicate rates, and finally
calculate the a parameters by dividing by the theoretical
maximum. The a parameters for the FPGA platforms are
low due to communication protocols used by Nallatech atop
PCI-X. In general, the microbenchmark is performed on an
FPGA over a wide range of possible data sizes. The resulting
a values can be tabulated and used in future RAT analyses
for that FPGA platform.
The computation parameters are perhaps the most chal-
lenging portion of the performance analysis, but are still
simplistic given the deterministic behavior of the PDF esti-
mation algorithm. As mentioned earlier, each element that
comes into the PDF estimator is evaluated against each of
the 256 bins. Each computation requires 3 operations: com-
parison (subtraction), multiplication, and addition. There-
fore, the number of operations per element totals 768 (i.e.
256 x 3). This particular algorithm structure has 8 pipelines
that each effectively perform 3 operations per cycle for a to-
tal of 24. However, this value is conservatively rounded down
to 20 to account for pipeline latency and other overheads
that are not otherwise considered. The 1-D PDF algorithm
is constructed in VHDL to allow explicit, cycle-accurate con-
struction of the intended design.
While previous parameters could be reasonably inferred
from the deterministic structure of the 1-D PDF algorithm,
a priori estimation of the required clock frequency is very
difficult. Empirical knowledge of FPGA platforms and algo-
rithm design practices provides some insight as to a range of
likely values. However, attaining a single, accurate estimate
of the maximum FPGA clock frequency achieved is gener-









ally impossible until after the entire application has been
converted to a hardware design and analyzed by an FPGA
vendor's layout and routing tools. Consequently, a num-
ber of clock values ranging from 75MHz to 150MHz for the
LX100 are used to examine the scope of possible speedups.
The software parameters provide the last piece of infor-
mation necessary to complete the speedup analysis. The
software execution time of the algorithm is provided by the
user. Often, software legacy code is the basis for the hard-
ware migration initiative. But one could equally generate
this code from a mathematically defined algorithm strictly
as a baseline for performance analysis. The baseline software
for the 1-D PDF estimation was written in C, compiled using
gcc, and executed on a 3.2 GHz Xeon. Lastly, the number
of iterations is deduced from the portion of the overall prob-
lem to reside in the FPGA at any one time. Since the user
decided to only process 512 elements at a time from the set
of 204800 element set, there must be 400 (i.e. 204800/512)
iterations of the algorithm.

4.3 Predicted and Actual Results
The RAT performance numbers are compared with the
experimentally measured results in Table 3. Each predicted
value in the table is computed using the input parameters
and equations listed in Section 3.1. For example, the pre-
dicted computation time when fclk 150MHz is computed
as follows:


S512 elements 768 ops/element
comp 150 MHz. 20 ops/cycle

393216 ops
393216 ops= 1.31E-4 sees
3E+9 ops/sec


The communication
responding equation.
buffered, the total RC
tion.


time is computed using the cor-
Because the application is single
execution time is a simple summa-


The speedup is simply the division of the software execu-
tion time by the RC execution time. The utilization equa-
tions are computed using the corresponding single-buffered
equations.
The communication and computation times for the actual
FPGA code were measured from the hardware. The to-
tal execution time for the hardware can be computed from
Equation (5) or, as with this case study, measured from the
FPGA to ensure maximum accuracy. The speedup and uti-
lization values were computed from this information using
the same equations as the predicted values. From the table,
the predicted speedup when fclk =150 is reasonably close to
the actual value. The discrepancy in speed in this case is due
to the inaccuracies in the tcomm estimation. Although com-
munication predictions are based on experimentally gath-
ered data from microbenchmarks, the true behavior was not
encapsulated for this algorithm. Variability in the commu-
nication time with the small data sizes (i.e 2KB for one
iteration of the 1-D PDF algorithm) and additional delays
introduced by 800 (400 read, 400 write) repetitive transfers
are considered to be the source of the error.
A relatively accurate tcomp prediction is not surprising
given the deterministic nature of the algorithm. However,
the two significant figures of accuracy between the predicted
and actual computation times with fcJk -150MHz was un-
expected given that computational throughput was conser-
vatively estimated. Much of the 1-D PDF algorithm is
pipelined but enough latency and pipeline stalls existed to
genuinely warrant a 17% reduction in the throughput esti-
mate (i.e. 20 ops/cycle instead of 24). Unfortunately, the
inaccuracies in the communication time predictions reduced
the accuracy of the overall speedup. Had the communication
been double buffered, the inaccuracies in the communication
time could have been masked behind the more stable compu-
tation time for a more accurate (and higher) speedup. The
relatively low resource usage in Table 4 also illustrates a po-
tential for further speedup by including additional parallel
kernels.


tRCsB = 400 iterations (5.56E-6 sees + 1.31E-4 sees)
= 5.46E-2 sees



Table 3: Performance parameters of 1-D PDF


fclk (MHz)
tcomm (sec)
tcomp (sec)
utilcomrnms
utilcompsB
tRCSB (sec)
speedup


Predicted
75
5.56E-6
2.62E-4
2%

1.07E-1
5.4


Predicted
100
5.56E-6
1.97E-4
3%

8.09E-2
7.2


Predicted
150
5.56E-6
1.31E-4
4%

5.46E-2
10.6


150
2.50E-
1.39E-
15%

7.45E-
7.8


5. ADDITIONAL CASE STUDIES
These case studies are presented as further analysis and
validation of the RAT methodology, 2-D PDF estimation
and molecular dynamics. Two-dimensional PDF estima-
tion continues to illustrate the accuracy of RAT for algo-
rithms with a deterministic structure. However, the molec-
ular dynamics application serves as an interesting counter-
I1 point given the relative difficulty of encapsulating its compu-
tational behavior. As with the one-dimensional PDF estima-
5 tion algorithm, the design emphasis is placed on throughput
4 analyses because the overall goal was to minimize execution
time for these designs.


2


Table 4: Resource usage of 1-D PDF (LX100)
FPGA Resource Utilization
48-bit DSPs
BRAMs 15%
Slices 1 .


5.1 2-D PDF Estimation
As previously discussed, the Parzen window technique is
applicable in an arbitrary number of dimensions. However,
the two-dimensional case presents a i _.;I. .1,1-. greater
problem in terms of communication and computation vol-
ume than the original 1-D PDF estimate. Now 256 x 256
discrete bins are used for histogram generation and the in-
put data set is effectively doubled to account for the extra
dimension. The basic computation per element grows from
(N n)2 + c to ((Ni ni)2 + (N2 n2)2 + c where N1 and
N2 are the probability level values and ni, n2 are the data










Table 5: Input parameters of 2-D PDF (LX100)
Dataset Parameters
Neiements, input (elements) 1024
Neiements, output (elements) 65536
Nbytes/element (bytes/element) 4
Communication Parameters
throughputideal (MB/s) 1000
Write 0 < a < 1 0.37
Cread 0 < a < 1 0.16
Computation Parameters
Nops/element (ops/element) 393216
throughputproc (ops/cycle) 48
fclock (MHz) 75/100/150
Software Parameters


tsoft
Niter


(sec)
(iterations)


158.8
400


Table 6: Performance parameters of 2-D PDF


fclk (MHz)
tcomm (sec)
tcomp (sec)
utilcommSB
utilcompsB
tRCSB (sec)
speedup


Predicted
75
1.65E-3
1.12E-1
1%

4.54E+1
3.5


Predicted
100
1.65E-3
8.39E-2
2%

3.42E+1
4.6


Predicted
150
1.65E-3
5.59E-2
3%

2.30E+1
6.9


sample values for each dimension, and c is a probability scal-
ing factor. But despite the added complexity, the increased
quantity of parallelizable operations intuitively makes this
algorithm more amendable to the RC paradigm, assuming
sufficient quantities of hardware resources are available.
Table 5 summarizes the input parameters for the RAT
analysis for our 2-D PDF estimation algorithm. Again,
the computation is performed in a two-dimensional space
so twice the number of data samples (in blocks of 512 words
for each dimension) are sent to the FPGA. In contrast to the
1-D case, the PDF values computed over each iteration are
sent back to the host processor. The same numerical preci-
sion of four bytes per element is used for the data set. The
interconnect parameters model the same Nallatech FPGA
card as in the 1-D case study. The higher order of computa-
tional complexity is reflected in the larger number of com-
putations per element, approximately three orders of mag-
nitude greater. However, the number of parallel operations
is only increased by a factor of two. VHDL is also used for
the 2-D PDF algorithm to create the cycle-accurate pipeline.
Again, the same range of clock frequencies are used for com-
parison. The software baseline for computing speedup values
was written in C and executed on the same 3.2GHz Xeon
processor. The same 400 iterations are required to complete
the algorithm.
The RAT performance predictions are compared with the
experimentally measured results in Table 6. The predicted
speedup at 150MHz is closer to the experimental 150MHz
value than the one-dimensional case. Though the percent
error is greater for both communication and computation,


Table 7: Resource usage of 2-D PDF (LX100)
FPGA Resource Utilization


48-bit DSPs
BRAMs
Slices


21%


the important difference is that the predicted computation
time was i .1.6. i. ii overestimated which balanced out the
underestimated communication time. However, the prob-
lem is not with the computation parameters. These values
are often conservatively estimated to account for unfore-
seen problems and to avoid promising unattainable results.
Unfortunately, the 'unforeseen problem' for this algorithm
turned out to be communication six times larger than pre-
dicted, comprising 19% of the total execution instead of the
originally estimated 3%. The authors acknowledge this case
study as a victory in contingency planning but highlight the
precariousness of applications where relatively small shifts
in the communication time can greatly effect the utilization
of the FPGA.
S Ultimately, the reduced number of parallel operations
(throughputproc) with respect to the quantity of operations
(Nops/element) resulted in an effective speedup less than the
one-dimensional algorithm. The qualitative RC 'amenabil-
ity' of the two-dimensional PDF application could only be
capitalized to the extent permissible by the designer's algo-
rithm structure and physical device limitations, .. i. I ii-.
the communication bandwidth. The performance decrease
with respect to the 1-D PDF is highlighted by the compu-
tation utilizations. Though more parallelizable operations
occur in the 2-D PDF algorithm, the increased communica-
tion demands of the higher order reduced the speedup of this
design for this platform. Comparing Table 7 to the resource
utilization from the 1-D algorithm, the hardware usage has
increased but still has not nearly exhausted the resources
of the FPGA. Additional parallelism could be exploited to
improve the performance of the 2-D algorithm.

5.2 Molecular Dynamics
Molecular Dynamics (MD) is the numerical simulation of
the physical interactions of atoms and molecules over a given
time interval. Along with standard Newtonian physics,
properties such as Van Der Waals forces and electrostatic
charge (among others) are calculated for each molecule at
each time step with respect to the movement and the molec-
ular structure of every particle in the system. The tremen-
dous number of options for physically modeled phenomena
and computational methods makes MD a perfect example
of how various designs for an application can have radi-
cally different execution times. Three different versions of
the molecular dynamics algorithm [1], [6], and [10] report
speedup values of 0.29x, 2x, and 46x respectively. These de-
signs make use of various algorithm optimizations, precision
choices, and FPGA platform selections. Consequently, RAT
can offer insight about a particular design, but it cannot
guarantee that a better solution does not exist.
The version of the molecular dynamics application used
for this case study was adapted from code provided by Oak
Ridge National Lab (ORNL). Table 8 summarizes the input
parameters for the RAT analysis of the MD design. The data
size of 16,384 molecules (i.e. elements) was chosen because










Table 8: Input parameters of MD
Dataset Parameters
Neiements, input (elements) 16384
Neiements, output (elements) 16384
Nbytes/element (bytes/element) 36
Communication Parameters
throughputideal (MB/s) 500
write 0 < a < 1 0.9
Cread 0 < a < 1 0.9
Computation Parameters
Nops/element (ops/element) 164000
throughputproc (ops/cycle) 50
fclock (MHz) 75/100/150
Software Parameters


tsoft
Niter


(sec)
(iterations)


Table 9: Performance parameters of MD


fcl, (MHz)
tcomm (sec)
tcomp (sec)
utilcommSB
utilcompsB
tRCsB (sec)
speedup


Predicted


Predicted


I. 4--


75
2.62E-3
7.17E-1
0.4%

7.19E-1
8.0


100
2.62E-3
5.37E-1
II ",'

5.40E-1
10.7


Predicted


150
2.62E-3
3.58E-1
0.7%
99.3%
3.61E-1
16.0


Actual


100
1.39E-3
8.79E-1
II ",

8.80E-1
6.6


it is a small but still ;. i., ii. .i-. interesting problem. Each
element requires 36 bytes, 4 bytes each for position, velocity
and acceleration in each of the X, Y, and Z spatial directions.
The interconnect parameters model an XtremeData XD1000
platform containing a Altera Stratix II EP2S180 user FPGA
connected to an Opteron processor over the HyperTransport
fabric.
Li11, .. i 1.-, the most challenging aspect of performance
prediction for the molecular dynamics application is accu-
rately measuring the number of operations per element (i.e.
molecule) and operations per clock cycle. This particular
algorithm's execution time is dependent on the locality of
the molecules, which is a function of the dataset values.
Distant molecules are assumed to have negligible interac-
tion and therefore require less computational effort. Conse-
quently, the number of operations per element can only be
estimated for this circumstance and as previously discussed,
the number of operations per element is treated as a "tuning"
parameter. Though 50 is the quantitative value computed
by the equations to achieve the desired overall speedup of
approximately 10x, this value serves qualitatively to the user
as an indicator that substantial data parallelism and func-
tional pipelining must be achieved in order to realize the de-
sired speedup. Several major architectural design revisions
were explored, based upon the RAT findings, in order to
facilitate the necessary parallelism. ,\.i.1'1 I; -i11i, the same
clock frequencies were used as in the previous case studies,
even though the FPGA platform has changed, because the
parameters are empirically reasonable. The serial software
baseline was performed on a 2.2 GHz Opteron processor, the


Table 10: Resource usage of MD (EP2S180)
FPGA Resource Utilization
9-bit DSPs 1111' .
BRAMs -.
ALUTs .



host processor of the XD1000 system. The entire dataset is
processed in a single iteration.
The application was constructed in the HLL language Im-
pulse C in contrast to the PDF algorithms. Normally the
usage of an HLL would have made it more difficult (if not im-
possible) to ensure that the particular cycle-by-cycle struc-
ture desired by the researcher was utilized in the final design.
Even slight variabilities in algorithm structure introduced
by HLLs could make accurate RAT computational analyses
challenging. For the molecular dynamics application, which
possesses data dependent operations and irregular computa-
tion structure independent of its linguistical representation,
performance prediction is already handicapped in terms of
accuracy. Consequently, the usage of HLLs is beneficial for
this MD application for -i_.i;i. ..117-. reducing development
time because the variability of the high-level paradigm will
not -i1.ii. Ill impact the already complex design struc-
ture.
Table 9 outlines the predicted and actual results of the
molecular dynamics algorithm. The actual communication
times is the same order of magnitude as the predicted value.
While more accurate estimations are always the goal of
RAT, any further precision improvements for this parameter
are inconsequential given the low communication utilization.
Computation dominated the overall RC execution time and
the actual time was also the same order of magnitude as the
predicted value. Again, what eventually allowed the molecu-
lar dynamics algorithm to succeed in the RC paradigm was a
qualitative interpretation of the prediction parameters which
highlighted the need for scalable parallelism i i ... i;. ,ii., the
ability to work on several molecules simultaneously). After
major high-level design revisions, the designer successfully
created a molecular dynamics algorithm that met the pre-
dicted criteria with moderate success. However, as Table 10
illustrates, a large percentage of the combinatorial logic and
dedicated multiply-accumulators (DSPs) were required.


6. CONCLUSIONS
RAT is a simple and effective method for investigating
the performance potential of the mapping of a given ap-
plication design for a given FPGA platform architecture.
The methodology employs a simple approach to planning
an FPGA design. RAT is meant to work with empirical
knowledge of RC devices to create more efficient and effec-
tive means for formulating an application design.
RAT succeeds in its goal of simple and reasonably accu-
rate prediction for application designs. For deterministic al-
gorithms such as the PDF estimation, RAT can accurately
estimate the computational loads expected of the FPGA.
Coupled with an accurate measurement of the FPGA plat-
form interconnect throughput, a suitable prediction as to
the 'RC amenability' was formulated before any hardware
code was composed. In contrast, the molecular dynamics
experiment grapples with a situation where data dependen-









cies in the algorithm create uncertainty about the overall
runtime. However, RAT was able to '!1i .1i .1 1- i-. highlight
the large volume of parallelism that would be required to
achieve even a 10x speedup. Consequently, extra algorithm
restructuring was incorporated to increase performance and
obtain a speedup somewhat near the 10x goal.
Though RAT has thus far proven effective and useful to
the endeavors of its designers, there are several additional
areas under consideration. The current methodology was de-
signed to support applications involving several algorithms,
each with their own separate RAT analysis. Further experi-
mentation and usage with such design projects is necessary,
especially with systems containing multiple FPGAs being
increasingly deployed. Several strategies for improving the
throughput test's analysis are also under consideration.


7. ACKNOWLEDGMENTS
This work was supported in part by the I/UCRC Pro-
gram of the National Science Foundation under Grant No.
EEC-0642422. The authors gratefully acknowledge vendor
equipment and/or tools provided by Altera, Xilinx, Nal-
latech, XtremeData, and Impulse Accelerated Technologies
that helped make this work possible.


8. REFERENCES
[1] N. Azizi, I. Kuon, A. Egier, A. Darabiha, and
P. Chow. Reconfigurable molecular dynamics
simulator. In Proc IEEE 12th Symp.
Field-Programmable Custom Computing Machines
(FCCM), pages 197-206, Napa, CA, Apr 20-23 2004.
[2] P. Banerjee, D. Bagchi, M. Haldar, A. Nayak, V. Kim,
and R. Uribe. Automated conversion of floating point
matlab programs into fixed point fpga based hardware
design. In Proc IEEE 11th Symp. Field-Programmable
Custom Computing Machines (FCCM), pages
263-264, Napa, CA, Apr 8-11 2003.
[3] K. Bondalapati and V. Prasanna. Dynamic precision
management for loop computations on reconfigurable
architectures. In Proc IEEE 7th Symp.
Field-Programmable Custom Computing Machines
(FCCM), pages 249-258, Napa, CA, Apr 21-23 1999.
[4] D. Buell. Programming reconfigurable computers:
Language lessons learned. In Reconfigurable System
Summer Institute (RSSI), Urbana, IL, Jul 12-13 2006.
[5] M. Chang and S. Hauck. Precis: A design-time
precision analysis tool. In Proc IEEE 10th Symp.
Field-Programmable Custom Computing Machines
(FCCM), pages 229-238, Napa, CA, Apr 22-24 2002.
[6] L. Cordova and D. Buell. An approach to scalable
molecular dynamics simulation using supercomputing
adaptive processing elements. In Proc. IEEE Int.
Conf. Field Programmable Logic and Applications
(FPL), pages 711-712, Aug 24-26 2005.
[7] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E.
Schauser, E. Santos, R. Subramonian, and T. von
Eicken. Logp: Towards a realistic model of parallel
computation. In Proc ACM 4th Symp. Principles and
Practice of Parallel Programming, pages 1-12, San
Diego, CA, May 19-22 1993.
[8] S. Fortune and J. Wyllie. Parallelism in random access
machines. In Proc ACM 10th Symp. Theory of


Computing, pages 114-118, San Diego, CA, May 01-03
1978.
[9] A. Gaffar, O. Mencer, W. Luk, P. C'lI. ..,_ and
N. Shirazi. Floating-point bitwidth analysis via
automatic differentiation. In Proc IEEE Int. Conf.
Field-Programmable Technology (FPT), pages
158-165, Hong Kong, China, Dec 16-18 2002.
[10] Y. Gu, T. VanCourt, and M. Herbordt. Accelerating
molecular dynamics simulations with configurable
circuits. In Proc. IEE Computers and Digital
Techniques, volume 153, pages 189-195, May 2 2006.
[11] Z. Guo, W. Najjar, F. Vahid, and K. Vissers. A
quantitative analysis of the speedup factors of fpgas
over processors. In Proc ACM 16th Symp.
Field-Programmable Gate Arrays (FPGA), pages
162-170, Monterey, CA, Feb 22-24 2004.
[12] T. Jeger, R. Enzler, D. Cottet, and G. Troster. The
performance prediction model a methodology for
estimating the performance of an fpga implementation
of an algorithm, technical report, Electronics Lab,
Swiss Federal Inst. of Technology (ETH) Zurich, 2000.
[13] V. Kindratenko and D. Pointer. A case study in
porting a production scientific supercomputing
application to a reconfigurable computer. In Proc
IEEE 14th Symp. Field-Programmable Custom
Computing Machines (FCCM), pages 13-22, Napa,
CA, Apr 24-26 2006.
[14] D.-U. Lee, A. Gaffar, O. Mencer, and W. Luk.
Optimizing hardware function evaluation. IEEE
Trans. Computers, 54(12):1520-1531, Dec. 2005.
[15] M. Smith and G. Peterson. Parallel application
performance on shared high performance
reconfigurable computing resources. Performance
Evaluation, 60:107-125, May 2005.
[16] C. Steffen. Parameterization of algorithms and fpga
accelerators to predict performance. In Reconfigurable
System Summer Institute (RSSI), Urbana, IL, Jul
17-20 2007.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs