[61] C. Reardon, B. Holland, A. George, H. Lam, and G. Stitt, "RC' I.: An abstract
modeling language for design-space exploration in reconfigurable computing," in Proc.
IEEE Reconfigurable Architectures Workshop, May 25-26 2009.
[62] Y. Sun, Y. Cai, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie,
"ESPRIT: estimating species richness using large collections of 16S rRNA
pyrosequences," Nucleic Acids Res., vol. 37, no. 10, pp. e76.
[63] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for
similarities in the amino acid sequence of two proteins," J. Molecular B.:/. '-,/; vol. 48,
no. 3, pp. 443-453, 1970.
[64] M. Reiser and S. S. Lavenberg, i. ii-value in i 1i -; of closed multichain queuing
networks," J. AC/I vol. 27, no. 2, pp. 313-322, 1980.
The proposed framework provides the necessary methodology to integrate modeling
environments with RAT performance prediction for strategic DSE. The framework
translates application specifications into performance characterizations and orchestrates
the required RAT predictions. The DSE tool was constructed based on the framework
methodology and demonstrated accurate performance analysis with under 5'. error for
the MNW case study. Strategic DSE with the tool was efficient and rapid, requiring only
140ms and 340ms for analysis of 100,000 revisions to the MNW application and the MVA
graph, respectively.
3.3.3 Predicted and Actual Results
The RAT performance numbers are compared with the experimentally measured
results in Table 3-3. Each predicted value in the table is computed using the input
parameters and equations listed in Section 3.2.1. For example, the predicted computation
time when flk =150MHz is calculated as follows:
512 elements 768 ops/element
comp 150 MHz 20 ops/cycle
393216 ops
= 1.31E-4 sees
3E+9 ops/sec
The communication time is computed using the corresponding equation. Because the
application is single-buffered, the total RC execution time is simply:
tRcsB = 400 iterations (2.47E-5 sees + 1.31E-4 secs)
S6.23E-2 secs
The speedup is simply the division of the software execution time by the RC
execution time. The utilization is computed using the corresponding SB equations.
The communication and computation times for the actual FPGA code were measured
using the wall-clock time of the CPU. The error in the prediction of the communication
time was minimal, approximately 1 due to detailed microbenchmarking for these exact
transfer sizes. A relatively accurate tcomp prediction is expected given the deterministic
structure of the parallel algorithm. However, the high degree of accuracy (two significant
figures) between the predicted and actual computation times with flk =150MHz was
unusual, since the computational throughput was a conservatively estimated parameter.
Much of the 1-D PDF algorithm is pipelined but the lower effective throughput, due to
the latency in the short 0.14ms of computation time, closely matched the conservatively
estimated value for throC' i'il,,i, (i.e. the 20 ops/cycle used in the RAT prediction
4.3.1.2 Application Scope ............ ... .. 62
4.3.1.3 Model Usage .................. ..... .. 64
4.3.2 Model Attributes and Equations .............. .. .. 65
4.3.2.1 Compute Node Model ............... .. 66
4.3.2.2 Network Model .................. ... .. 67
4.3.2.3 Stage M odel .................. ....... .. 71
4.3.2.4 Application Model .................. 73
4.4 Detailed Walkthrough: 2-D PDF Estimation ..... ..... 74
4.4.1 Algorithm and Platform Structure .... 75
4.4.2 Compute Node Modeling ............... .. .. 76
4.4.3 Network Modeling ........... .... 78
4.4.3.1 PCI-X Network Modeling ................. .. 78
4.4.3.2 Ethernet Network Modeling ................ .. 80
4.4.4 Stage/Application Modeling ............ .... .. 82
4.4.5 Results and Verification .................. ..... .. 84
4.5 Additional Case Studies .................. ......... .. 86
4.5.1 Image Filtering .................. .......... .. 89
4.5.2 Molecular Dynamics .................. ........ .. 91
4.6 Conclusions .................. ................ .. 94
5 INTEGRATED PERFORMANCE PREDICTION WITH RC MODELING LANGUAGE
(PHASE 3) ................... ..... .... ....... 96
5.1 Introduction .................. .... 96
5.2 Background and Related Research ................ .... .. 98
5.2.1 RAT Performance Prediction ................. .. 98
5.2.2 Modeling Environments .................. ..... .. 99
5.3 Integrated Framework .................. .......... .. 101
5.3.1 Translation. .................. ........... 102
5.3.2 Orchestration .................. ........... 104
5.4 Case Studies .................. ................ .. 105
5.4.1 Experimental Setup .......... ..... 106
5.4.2 Modified Needleman-Wunsch (\ NW) 107
5.4.3 Task Graph of Mean-Value Analysis (\!VA Graph) ........ ..111
5.5 Integrated RATSS (\iNW ) .................. ........ .. 113
5.6 Conclusions .................. ................ .. 114
6 CONCLUSIONS .................. ................ .. 116
REFERENCES ................... ............ ...... 119
BIOGRAPHICAL SKETCH .................. ............. .. 125
for analytical, system-level modeling prior to any detailed and potentially costly
implementation of FPGA kernels or applications. RAT from C'!i Ilter 3 defined an
analytical model for performance estimation of a specific algorithm on a specific platform
prior to implementation, albeit for single-FPGA designs. The RAT methodology must
be paired with system-level modeling concepts to provide a complete model for scalable,
multi-FPGA systems.
Existing research with microprocessor-based algorithms and parallel platforms
can help bridge the gap between analytical modeling for FPGA devices and the design
challenges of FPGA-based scalable systems. The Parallel Random Access Machine
(PRAM) [8] is one of the first widely studied models that attempts to reduce complex
system behavior into a few key attributes, but the model neglects issues such as
synchronization and communication, which can greatly affect the accuracy of the
performance estimations for larger systems. The Bulk Synchronous Parallel (BSP)
Model [9] extends the modeling concepts of PRAM by defining an application in terms
of a series of global supersteps each consisting of local computation, communication,
and synchronization. The LogP [10] model attempts to define networks of computing
nodes (i.e., microprocessors and local memory) by latency, L; overhead, o; gap between
(short) messages, g; and number of processing units, P. The LogGP model [11] extends
the LogP concept with support for a long-message gap, G. Other extensions to LogP
and LogGP include support for contention, LoPC [42], and parameterization of the L,
o, g, G, and P attributes (PlogP) to support dynamically changing values in wide-area
networks [12]. Additionally, benchmarks have been created to assist in the measurement
of these attributes [43]. Leveraging these models allows RATSS to describe system-level
communication for multi-FPGA platforms.
Prior work has leveraged system-level modeling concepts beyond homogeneous
microprocessors. Heterogeneous LogGP (HLogGP) [44, 45] considers extensions for
multiple processor speeds and communication networks within a cluster. In [46],
3.4.5 Summary of Case Studies
Table 3-17 outlines the predicted values, actual results, and error percentages for
the communication time, computation time, and speedup. The magnitudes of the
predicted and actual values are listed to compare the absolute impact of the relative
error percentages. For example, molecular dynamics has comparable communication and
computation error percentages but the magnitudes are different with communication
having virtually no impact on total speedup.
For these case studies, communication had the lowest average error among the
modeled times. The larger errors for LIDAR and MD were caused by minor discrepancies
in the final communication setup versus the RAT analysis of the algorithm. These error
percentages are acceptable for RAT because they still yield valid quantitative insight
about the algorithm behavior. The cost of more precise prediction must be balanced with
the impact of the communication time on performance. Again, the largest communication
error, found in MD, did not significantly affect the speedup, because the communication
was less than 1 of the overall execution time.
Even with the more complicated task of architecture and platform parameterization,
the average prediction error for computation was only slightly higher than communication.
PDF and TSP had low computation errors complementing the low communication errors.
LIDAR had double-digit error, which was due to perceivable system overheads in the
short 0.2ms computation time. The one outlier was the MD application. The difficulty
of mitigating the data-driven computations was compounded by unknowns in the final
algorithm mapping by the HLL tool. As with the communication predictions for MD,
the error was significant but RAT still provided a useful insight about what order of
magnitude speedup should be achievable.
The prediction errors in the overall speedup were higher on average than the
individual computation or communication times. Particularly with 1-D PDF, LIDAR,
and TSP, overheads not part of the RAT computation and communication models were
Table 3-2. Input parameters of 1-D PDF
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements)
N, .. .... (bytes/element)
Communication Parameters (Nallatech)
throughputidel (0\ 1 /s)
write 0 < a < 1 0
read 0 < a < 1 0
Computation Parameters
Nops/element (ops/element)
throughputproc (ops/cycle)
fc1ock (M\ 1. ) 75/100
Software Parameters
tsoft (sec) 0
Niter (iterations)
512
1
4
1000
.099
.001
768
20
/150
.578
400
individual partial transfers, one per iteration, to <. : end with the RAT model, but the
throughput < '-. : ... is ..i'usted to .. ;..- .1 with a single block of data.
T: number of bytes per element, 7 /,eemt, is rounded to four (i.e. bits).
Even though the PDF estimation algorithm only uses 18-bit fixed point, the interconnect
uses 32-bit communication. .1: data was not byte-packed and the remaining 14 bits
I word of coninmunication are unused. During the algorithmic formulation, several
formats including 18-bit : -*d point, 32-bit : (-d point, and 32-bit f : ::: point were
considered for use in the PDF algorithm. However, the maximum error percentage
was found to be c-,. 1, 3..,,', for 18-bit fixed point, which is satisfactory precision
t.he .i l:::ation. Ultimately 18-bit fixed A:int was chosen so that only one Xilinx
18 x 18 lmultiple-accumulate (MAC) unit would be needed per multiplication. Ti.
slightly smaller bitwidths also had reasonable error constraints, no 1p :, :: : e gains or
appreciable resource savings would have been achieved.
T: communication parameters are provided .- the user since are merely a
function of the target RC platform, which is a .: 1ch H101-PCIXM card containing a
Virtex-4 LX100 user i iGA for this case study. : a card is connected to the host Ci'Uj
Table 5-3. Analysis times for design spaces of MVA graph
Number of Revisions Analysis Time (ms)
0 (initial design) 10
1000 12
10000 33
100000 340
Although the algorithm structure is a SDF MoC, collecting quantitative parameters
from the individual computation tasks and communication operations remains very similar
to the MNW case study. The algorithm complexity is manifested in the number of tasks,
their dependencies, and their scheduling. The predicted execution time of the initial design
for the MVA graph is 0.17s, which is dominated by the slowest pipeline, T2-1 (Figure
5-9), due to its highest assigned workload. Strategic DSE can provide useful insight about
minimum execution rates for the other pipelines and allows a designer to construct the
slowest (and least resource-intensive) pipeline possible without increasing the overall
execution time for the application. For example, Figure 5-10 illustrates the predicted
execution time of the MVA graph based on 1000 design revisions that represent different
execution rates for the T4-2 pipeline. Pipeline rates for T4-2 above a certain threshold
have no impact on the total performance of the MVA graph because the execution time
remains dominated by the slower T2-1 pipeline. However, sufficiently slow rates for the
T4-2 pipeline increase the execution time of the MVA graph. Rapid determination of this
threshold is difficult without the DSE tool.
Despite the large number of revisions, DSE using the integrated framework remained
tractable. Table 5-3 summarizes the framework analysis times, including the transfer of
the performance information and RAT prediction, for various numbers of revisions to the
design of the MVA graph. The analysis time grows approximately linearly with the size
of the design space. The longest analysis of 100,000 revisions took 340ms, which is nearly
indistinguishable by the user from the duration of a single analysis of a simple application
(e.g., 2.3ms for MNW). The primary limitation of broad DSE (aside from Java memory
requirements) is the ability of the user to efficiently digest the generated prediction values.
Depending on the FPGA platform architecture and mapping, computation and
communication within a stage is either serialized or overlapping with the total execution
time of the stage, tstage, defined as a number of iterations, Nstageiterations, of either the
sum or maximum of the tcomp and tcomm terms (Equation 4-14). Note that performance
estimates should be modeled for each unique stage of the application execution with
attention to any special cases, such as initial and final stages possessing more or less
communication.
stage = Nstageiterations X (4 1 4)
M ax (tcomp tcomm)
stage: total execution time of the application stage
Nstageiterations: number of stage-level iterations
4.3.2.4 Application Model
The RATSS application-level model describes the scheduling of the individual stages
of task execution to estimate the full system performance. Applications consist of one
or more distinct stages of execution, which may be collectively repeated for one or more
of iterations. Analogous to the computation and communication scheduling, application
stages can either be serialized or overlapping. Figure 4-3 provides an example timing
diagram for potential iterative behavior at the application level. The example consists of
three stages collectively repeated twice, reinforcing the multi-level iterative behavior of the
stage and application models as first described in Figure 4-2. This sample application does
not illustrate all potential platform and algorithm features such as multi-level networks,
but instead reinforces the ability of RATSS to organize execution paths into hierarchical
models.
Equation 4-15 defines the set, Sstage, of s stage times, tstage, for the application.
The total execution time for the application, application, is the number of iterations,
instead of the theoretical 24). Also, this lower throughput accounted for extra overhead
time involved with polling the FPGA for completion of computation.
The total execution time for the FPGA is also measured using the wall-clock time,
rather that calculated from Equation (3-5), to ensure maximum accuracy. Additional
factors may be present in the total time that are not accounted in the individual
communication and computation. In this case study, the total error was 1,' but the
communication and computation errors were only 1 and I' respectively. The discrepancy
is due to overheads with managing and regulating data transfers by the host CPU that are
not expressly part of the individual RAT models. The impact of these overheads and other
synchronization issues will vary depending upon the particular FPGA platform and size of
the overhead time relative to the total RC execution time. For 1-D PDF, the extra 8.5ms
was significant compared to the 75ms of execution time. The relatively low resource usage
in Table 3-4 illustrates a potential for further speedup by including additional parallel
kernels albeit at the risk of increasing the impact of the system overhead.
3.4 Additional Case Studies
Several case studies are presented as further analysis and validation of the RAT
methodology: 2-D PDF estimation, coordinate calculation for LIDAR processing, the
traveling salesman problem, and molecular dynamics. Two-dimensional PDF estimation
continues to illustrate the accuracy of RAT for algorithms with a deterministic structure.
Coordinate calculation uses prediction on a communication-bound algorithm. Traveling
salesman explores a computation-bound searching algorithm with pipelined structure.
However, the molecular dynamics application serves as a counterpoint given the relative
difficulty of encapsulating its data-driven, non-deterministic computations. A diverse
collection of vendor platforms, Nallatech, Cray, SRC, and XtremeData, is used for 2-D
PDF, LIDAR, traveling salesman, and molecular dynamics, respectively. Each of these
case studies has single-buffered communication and computation. As with one-dimensional
node time. Because of the equal load balancing among the nodes, the total computation
time is equivalent to the performance of any i-th node in the system.
Scomp tfpgal tfpgap } (4-20)
tcomp(Scomp) = Max(Scomp) tfpgai (4-21)
The set of communication times, Scomm,,,, summarizes network transactions involved
in the single stage of the case study. The scatter and write times, tscatter and twrite,
represent the X and Y dimensions of the input data and the read and reduce times, tread
and reduce, define the data collected after computation. From Equation 4-23, the total
communication performance, tomm,,,, is the summation of the individual, non-overlapping
network transactions.
Scomm = {tscatterx tscattery twritex twritey tread, reduce} (4-22)
tcomm(Scomm) = tscatterx + tscattery + twritex + twritey + tread + reduce (4-23)
The inputs to the RATSS system-level model are summarized in the first block
of Table 4-4. The only user-provided attribute is the number of stage-level iterations,
Niterations. The 2-D PDF application only requires one iteration of node and network
interaction (i.e., inter-node data distribution, node-level computation, inter-node data
collection). The individual node and network transaction times comprise the iD i.i iiy
of the input to the system model. The Scomp and Scomm attribute sets contain the
compute node and network transaction times from the respective models. The second
block of Table 4-4 summarizes the estimated performance of the total computation and
communication times, tcomp and tcomm from Equations 4-21 and 4-23.
The 2-D PDF estimation does not use an elaborate buffering scheme so the total
performance of the application, tapplication, as defined by the stage time, stage, is the
16th Symp. Field-P, -..g.i ,,i,,,,.. Custom Computing Machines (FCCI[), Palo Alto,
CA, Apr. 14-15 2008.
[37] K. Shih, A. Balachandran, K. N I, ii il B. Holland, C. Slatton, and A. George,
1` ,-I realtime LIDAR processing on FPGAs," in Proc Engineering of Reconfigurable
S;l.~ m;- and Algorithms (ERSA), Las Vegas, NV, July 14-17 2008.
[38] S. Tschoke, R. Lubling, and B. Monien, "Solving the traveling salesman problem with
a distributed branch-and-bound algorithm on a 1024 processor network," in Proc.
Symp. Parallel Processing, Santa Barbara, CA, Apr. 25-28 1995.
[39] M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids, Oxford University
Press, New York, 1987.
[40] D. A. Pearlman, D. A. Case, J. W. Caldwell, W. S. Ross, I. Thomas E. C'I. .11,ii i
S. DeBolt, D. Ferguson, G. Seibel, and P. Kollman, "Amber, a package of computer
programs for applying molecular mechanics, normal mode analysis, molecular
dynamics and free energy calculations to simulate the structural and energetic
properties of molecules," Computer Ph,;-. Communications, vol. 91, no. 1-3, pp.
1-41, September 1995.
[41] M. Nelson, W. Humphrey, A. Gursoy, A. Dalke, L. Kal, R. D. Skeel, and K. Schulten,
\ ,, iil a parallel, object-oriented molecular dynamics program," Int'l J. Super-
computer Applications and High Performance Computing, vol. 10, no. 4, pp. 251-268,
1996.
[42] M. I. Frank, A. Agarwal, and M. K. Vernon, "LoPC: modeling contention in parallel
algorithms," in Proc. 6th AC_ [ SIGPLAN Symp. Principles and Practice of Parallel
P, ..i,,,,,,:i (PPOPP), 1997, pp. 276-287.
[43] T. Kielmann, H. E. Bal, and K. Verstoep, 1: I-1 measurement of LogP parameters for
message passing platforms," in Proc. 15th IPDPS Workshop Parallel and Distributed
Processing, London, UK, 2000, pp. 1176-1183.
[44] J. L. Bosque and L. P. Perez, "Hloggp: a new parallel computational model for
heterogeneous clusters," in Proc. IEEE Symp. Ci,-/. r Computing and the Grid.
[45] J. L. Bosque and L. Pastor, "A parallel computation model for heterogenous
clusters," IEEE Trans. Parallel and Distributed S1,-.l m- vol. 17, no. 13, 2006.
[46] A. Lastovetsky, I.-H. Mkwawa, and M. O'Flynn, "An accurate communication model
of a heterogenous cluster based on a switch-enabled ethernet network," in Proc. 12th
IEEE Int'l Conf. Parallel and Distributed S.-1i. n- (ICPADS), Minneapolis, MN, July
12-115 2006.
[47] R. Kesavan, K. Bondalapati, D. Panda, and D. K. P, 4\!,!i, ,t on irregular
switch-based networks with wormhole routing," in Proc. Int'l Symp. High Perfor-
mance Computer Architecture (HPCA), San Antonio, TX, 1997, pp. 48-57.
made, key performance attributes are extracted, predictions are computed, and suitable
designs proceed to implementation. The goal is to establish the core, extensible model of
application and architecture performance that is efficient for use prior to implementation
and provides reasonably accurate results. Several application case studies are i, 1i-. .1I
with RAT and implemented within FPGA systems to verify the performance model and
methodology.
For the second phase, the emergence and continued interest in multi-FPGA system
necessitates a methodology for multi-FPGA performance prediction to improve application
development. The RC Amenability Test for Scalable Systems (RATSS) is an expansion of
the RAT methodology encompassing larger FPGA systems and potentially higher degrees
of algorithm parallelism. A i, ri" challenge is the size and v iii, 1 ivj of communication
topologies in multi-FPGA platforms, which require varying amounts of parameterization
and analysis for accurate performance prediction. RATSS uses the synchronous iterative
model [1] for two modern platform architectures for RC systems [2], providing accurate
modeling for data-parallel algorithms typically structured as SIMD-- i-le pipelines.
The third phase involves integration of the RAT analytical model with the RC
Modeling Language (RC'\ I I). Manual specification and analysis of applications becomes
increasingly inefficient as the FPGA platform size and algorithm complexity grow.
Ideally, algorithms, FPGA platform architectures, and system mappings are specified
using a modeling environment based on a model of computation and then analyzed
by performance estimation tools such as RAT. RC' \! is used because it provides an
efficient, intuitive, and scalable infrastructure specifically designed for FPGA systems.
The integration of RAT and RC'\ 11 provides efficient design-space exploration through
tool-assisted translation of application specifications into prediction models and evaluation
of both an initial design and potential revisions to the algorithm or platform architecture.
The remainder of this document is structured as follows. C'!i pter 2 provides a brief
background about FPGA computing and related research for performance prediction and
Table 5-2. Analysis times for design spaces of MNW
Number of Revisions Analysis Time (ms)
0 (initial design) 2.3
1000 3.0
10000 15
100000 140
r Start A' 1111 -T2-2/ \ T4-2 \j End
cm 'Lm m
.Ij\ M 1 I
Figure 5-9. MVA graph specification and mapping
pipelines), but these analyses can be trivial for this computation-bound application due
to the direct correspondence between the rate of execution and the overall application
performance. Instead, Figure 5-8 illustrates the predicted execution time of MNW based
on 1000 design revisions that represent different problem sizes (i.e., comparisons of
500 to 50450 DNA sequences) divided among four FPGAs. These revisions expand the
initial four-FPGA design of 1500 DNA sequences. The execution time of MNW increases
exponentially with the number of DNA sequence comparisons. This DSE can help evaluate
the suitability of the MNW application for meeting the broad performance requirements of
a designer, particularly when the size of the sequence database is expected to increase at a
potentially unknown rate after implementation. The DSE tool took only 6.1ms to analyze
this significantly larger design space. Table 5-2 summarizes analysis times for the initial
design, the 1000 revisions, and two other large DSEs. The analysis times grow linearly
with the size of the design space and allows very large numbers of revisions to be explored
in significantly less than one second.
8 FPGA Nodes
Internal Internal Nallatech
BRAM BRAM Middleware
Nallath FGA Host 3.2GHz Xeon
Nallatech FPGA :
Board -" Microprocessor
(XC4VLX100) PCI-X Bus
P __J GigE Switch
Figure 4-5. Platform Structure for 2-D PDF Estimation Case Study
Table 4-1. Node Attributes for 2-D PDF Estimation
Attribute Units 2 Nodes 4 Nodes 8 Nodes
PLcomp (cycles) 11 11 11
Rcomp (ops/cycle) 240 240 240
Fclock (\I 1..) 195 195 195
Ncompeements (elements) 33,554,432 16,777,216 8,388,608
Nps/element (ops/element) 196,608 196,608 196,608
tfpga (s) 1.41E+2 7.05E+1 3.52E+1
of the pipeline latency, PLcomp, requires detailed knowledge of the final algorithm
structure. The pipeline for the 2-D PDF estimation has a straightforward computational
structure of three operations (subtraction, multiplication, and addition from Figure
4-4) requiring 11 total cycles. The relatively deep pipeline helps ensure a higher clock
frequency. The computational throughput, Rcomp, of 240 operations per cycle comes from
the 80 pipelined kernels, each with 3 simultaneous operations per pipeline. Predictions are
generated for a large range of possible frequencies. The prediction for the clock frequency,
Fc~ock, of 195MHz is shown since it ultimately matched the maximum frequency for
later implementation. The number of computation elements, Ncompelements, is 33,554,432
(6 !\!+2); 16,777,216 (6 I \+ 1); and 8,388,608 (6 \l+' ) for the two, four and eight-node
configurations, respectively, due to the balanced data decomposition. The number of
operations per element, Nops/element, is based on the 256x 256 comparisons per data
element times 3 operations per element for a total of 196,608 operations.
Table 3-10. Resource usage of LIDAR (XC2VP50)
FPGA Resource Utilization ( ,)
BRAMs 12
18x18 Multipliers 5
Slices 45
communication and computation may be necessary. Table 3-10 highlights the availability
of unused resources to expand the parallel computation but the benefit will be marginal
because of the communication-bound algorithm.
3.4.3 Traveling Salesman Problem
The traveling salesman problem (TSP) [38] is a particular version of the NP-complete
Hamiltonian path problem that locates the minimum length path through an undirected,
weighted graph in which each vertex (i.e. city) is visited exactly once. (Other derivations
of the Hamiltonian path problem include the snake-in-the-box, knight's tour, and the
Lovdsz conjecture.) For this algorithm, any city may be the starting point and all cities
are connected to every other city creating N! potential Hamiltonian paths, where N is
the number of cities. To accelerate the time to converge on a solution, heuristics are
sometimes employ, -l to systematically search a subset of the solution space. However,
the algorithm for this case study performs an exhaustive search on all paths in the graph.
The specific algorithm formulation has significant ramifications not only on the hardware
performance but also on the prediction accuracy.
The case study targets SRC Computer's SRC-6 FPGA platform. Within the SRC-6,
the algorithm uses one of the Xilinx XC2V6000 user FPGAs in a single MAP-B unit.
The FPGA is connected to a host processor via the vendor's SNAP (memory DIMM slot)
interconnect. Nine depth-first traversals of the graph occur simultaneously on a single
FPGA starting from each of the nine different cities. Techniques such as branch and
bound are not used because each step of the search would be dependent on the previous
steps, thus preventing any pipelining. Instead, the algorithm starts with selecting N
arbitrary cities (and their N 1 edges) all at once and is then followed by determination
Sfpga = {tfpgal ttfpga} (4 10)
Sfpg,: set of FPGA execution times for the stage
n: total number of FPGA nodes
tcomp = overhead + Max(Max(Sfpa) tp) (4-11)
tcomp total computation time for the stage
overhead configuration, setup, and other overheads for the stage
Similarly, the set of communication times, Scomm, contains the performance estimates
for each of the r network transaction times, transaction (Equation 4-12). Typically, this
set will contain one or more input and output transactions for each level of network
communication in the platform, though some applications will instead accumulate partial
results within the FPGAs over multiple stages with cumulative output after the last
computation iteration. From Equation 4-13, the communication time for the stage,
tcmm,,, is composed of the sum of the r transaction times, ttransaction Multiple levels of
communication within an application stage are assumed non-overlapping due to blocking.
However, non-blocking transactions can be modeled by the total network delay and longest
(i.e., maximum) throughput time.
Scomm = {ttransactioni transaction, (4-12)
Scomm: set of transaction times for the stage
T: total number of communication transaction
tcomm (Scomm) =- Scomm (4-1 3)
tcomm: total communication time for the stage
Comparison Normalized
Database Modified Needleman-Wunsch Edit Distance
FPGA 1 5 Comparison [1,2] calculation of [,
S Comparison [1,6] Normalized [,
[1,2] Edit Distance Distance
[1,3] FPGA 2 1 Comparison [1,3] Calculation of [1,3]
[1,4] g Comparison [1,7] Normalized Distance
5[1,5] I _I : IEdit Distance [1,4]
FPGA 3 2 Comparison [1,4] Calculation of Distance
M Comparison [1,8] Normalized
i : Edit Distance
FPGA 4 J Comparison [1,5] calculation of [N-1,N
[N-1,N] .- Comparison [1,9] Normalized
S:Edit Distance Distance
S(N-N)/2 Comparisons
A Calculation of normalized edit distances on multiple FPGAs
for MNW
tart MNW End
B MNW algorithm specification and mapping
Figure 5-7. Overview of MNW case study
Table 5-1. Predicted and experimental results for MNW
Predicted Time (s) Experimental Time (s) Error
1 FPGA 9.44E-1 9.58E-1 1.5
2 FPGAs 4.72E-1 4.83E-1 2.;:',
4 FPGAs 2.36E-1 2.46E-1 4.1.
database and the resulting values for the normalized edit distance are described using
the AMP MoC. From the algorithm specification (Figure 5-7B), each of the computation
tasks (Start, MNW, and End) and two communication connections requires a separate
analytical model. The performance of the software Start and End tasks are defined
by an execution-time attribute. The number of characters in the database of sequence
comparisons determines the amount of input communication (between Start and MNW)
and the amount of computation for MNW. The output communication (between MNW
and End) is defined by the number of sequence comparisons. The architecture model
(Figure 5-6) contains the parameters outlining the communication capabilities of the PCIe
interconnect. The FPGA clock frequency (architecture) and pipeline depth (algorithm)
parameters define the computation rate.
these tests are applied iteratively during the RAT analysis until a suitable version of
the algorithm is formulated or all reasonable permutations are exhausted without a
satisfactory solution. The throughput test is a suitable starting-point for an application
wishing to match the numerical precision and general architecture of a legacy algorithm.
However, starting with the numerical precision and resources tests to refine an application
prior to throughput analysis is equally viable.
START
Identlf kemeol,
c rebate design on
paper 7
NEW h nsufficien
or comm.pBW
throughput
Perform RAT f roughput
-PUnrealizable
i m b preA csion
requirement
Build in HDL or HL requirement
Simulate design Insufficient
Verify on
HW platform
ROCEED
Figure 3-1. Overview of RAT Methodology
3.2.1 Throughput
For RAT, the predicted performance of an application is defined by two terms:
communication time between the CPU and FPGA, and FPGA computation time.
Reconfiguration and other setup times are ignored. These two terms encompass the
rate at which data flows through the FPGA and rate at which operations occur on that
data, respectively. Because RAT seeks to analyze applications at the earliest stage of
hardware nri Ill i. these terms are reduced to the most generalized parameters. The RAT
throughput test primarily models FPGAs as accelerators to general-purpose processors
with burst communication but the framework can be adjusted for applications with
streaming data.
Calculating the communication time is a relatively simplistic process given by
Equations (3-1), (3-2), and (3-3). The overall communication time is defined as the
Head Node -P Host pP FPGA Memory FPGA Computation FPGA Memory Host uP Head Node pP
Data Samples .- In I Ut U
Data SampElement 1 Final PDF
lee Node Memor BRAM 2-D PDF Registers PDF Estimate Estimate
(Data Sample) cun o
L( apl e Scatter8 192 elements 2,1 2,1 Fau 2,1
|Lastchunk of 1 256, [256, 256, 256, 256,
Elem ent 64M Elem ent (6 M X +-25-x256
(Data Sample) N+(64M)P M) 1 Bins
Each node receives (64M)/P data elements (8,192xP) s Each node sends 65,536 probability values
80 pipelined kernels per node
Figure 4-4. Application Structure for 2-D PDF Estimation Case Study
fixed point. The results are accumulated in 256x 256 registers and periodically read
back by the host microprocessor. The resulting 256x 256 partial sums on each of the P
nodes are collected with a reduce operation. More discussion on this 2-D PDF estimation
architecture along with general issues related to FPGA implementation can be found in
[52].
The intended platform for this case study is illustrated in Figure 4-5. The full
platform consists of eight 3.2GHz Xeon microprocessor nodes each connected to one Xilinx
XC4VLX100 (Nallatech H101 card) via a PCI-X bus. The processing nodes are organized
as a cluster of traditional computers each augmented with application acceleration
hardware (i.e., FPGAs). The on-chip Block RAMs (BRAMs) are explicitly illustrated
since they are used to store the input and output data for the 2-D PDF case study. The
microprocessor nodes are connected via Gigabit Ethernet. Network-level communication
uses the MPICH2 implementation of the Message Passing Interface (\!PI). The case study
is modeled and the implementation is tested using 2, 4, and 8 FPGA nodes.
4.4.2 Compute Node Modeling
The node-level model consists of estimating the computation for the 2, 4, and 8
Nallatech-augmented compute nodes. The values in Table 4-1 consist of the computation
attributes, which are distilled from the structure of the 2-D PDF estimation algorithm as
mapped to the architecture of the FPGA node. For computation, accurate parameterization
and multi-FPGA systems but also efficient usage during formulation (i.e., strategic
:-space exploration). Known as the RC Amenability Test, RAT provides a ::
for prediction of i:>tential .1: ., for a given high-level parallel w.1 .' *!1 .. ied
to a selected hardware target, so that a : :. i of strategic tradeoffs in algorithm and
architecture exploration can be quickly evaluated undertaking weeks or months of
costly --- -- -tation. RAT p.. ... p ce prediction is scoped to maintain efficient and
reasonably accurate estimation relative to the FPGA; :.. size and complexity.
Central to RINT is the analytical ... .. ....e model and the methodology for its
1. i :: to a range of algorithms and FPGA platform architectures. TI: key pects
of communication and computation within the FPGA system are ..ameterized and used
1- the RAT model to estimate the total application n performance (i.e. execution time and
speedup). T'. prediction efficiency and reliability are increased via tool-assisted parameter
extraction and j.. :. : : : estimation from explicit algorithm, architecture, and system
ions (also referred to as models). T.. need for the RAT methodology stemmed
.. common difficulties encountered during several T'A application development
.jects. Researchers would :.1 .l1 possess a software .: : on but would be unsure
about potential performance gains in hardware. i.. level of experience with 'GAs
would : greatly among the researchers and inexperienced designers were often unable
to .. ... .': i1 pro'.. i and compare possible algorithmic design and FPGA .
choices i : their i : : Many initial pre : lions were haphazardly i: : : d and
performance estimation methods varied greatly. Consequently, RAT was created to
consolidate and unify the performance prediction strategies for faster, simpler, and more
effective analyses.
i research is divided into three phases. In the first 1: : the focus of the
I ::.. .,:. e prediction model is on systems with a ~ 1FPGA connected directly
to a microprocessor. i'i plications proceed in iterations of writing data to the T rGA,
... i t stationon, and reading results from the FPGA. Design choices arc
Table 3-8. Input parameters of LIDAR
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements)
N, .... .. (bytes/element)
Communication Parameters (Cray)
throughputideal ( !1 /s)
aowrite 0 < a < 1
read 0 < O < 1
Computation Parameters
Nops/element (ops/element)
throughputproc (ops/cycle)
fc1ock (:\M!1.) 100/1
Software Parameters
tsoft (see)
Niter (iterations)
33000
33000
8
1600
0.5
0.5
1
1
25/150
0.011
1
X P(CyCrSO0 C.' rCQpSO S.' 9,SO)) + Xc
Y p(sycQSO S.' cCpCO C.' ,eO)) + Y, (3 12)
Z p(-SOSO COCOCO) + Zac
Table 3-8 summarizes the RAT input parameters of the algorithm for coordinate
calculation. The input date size of 33,000 elements is based on one second of LIDAR
returns (i.e. the time between GPS updates). A corresponding number of GPS coordinates
is returned by the calculations. The X, Y, and Z dimensions of the LIDAR returns and
GPS coordinates each use a 16-bit fixed-point format. A total of 48 bits is sent using
the 64-bit (8-byte) RapidArray interconnect. This channel has a documented theoretical
throughput of 1.6GB/s per direction but microbenchmarking indicates only half the rate
is achievable for these data transfers. Because the computation is pipelined, the number
of operations per element is synonymous with the number of elements. The pipeline can
process one operation (i.e. element) per cycle. The exact depth of the pipeline is not
known a priori but the extra latency is presumed negligible when compared to the size of
the dataset. A range of clock frequencies is examined to predict the scope of the overall
0.8
S0.6
0.4
0.2 Single Buffered -Double Buffered
0.0
0 0.5 1 1.5 2
t co,,,, = 0.5s tomp (s)
Figure 3-3. Trends for Computational Utilization in SB and DB Scenarios
3.2.2 Numerical Precision
Application numerical precision is typically defined by the amount of fixed- or
floating-point computation within a design. With FPGA devices, where increased
precision dictates higher resource utilization, it is important to use only as much precision
as necessary to remain within acceptable tolerances. Because general-purpose processors
have fixed-length data types and readily available floating-point resources, it is reasonable
to assume that often a given software application will have at least some measure of
wasted precision. Consequently, effective migration of applications to FPGAs requires a
method to determine the minimum necessary precision before any translation begins.
While formal methods for numerical precision analysis of FPGA applications are
important, they are outside the scope of this document. A plethora of research exists on
topics including maintaining precision with mixed data types [29], automated conversion
of floating-point software programs to fixed-point hardware designs [30], design-time
precision analysis tools for RC [31], and custom or dynamic bit-widths for maximizing
performance and area on FPGAs [32-35]. Application designs are meant to capitalize
on these numerical precision techniques and then use the RAT methodology to evaluate
the resulting algorithm performance. Numerical precision must also be balanced against
the type and quantity of available FPGA resources to support the desired format. For
revised not once but potentially hundreds or thousands of times depending on the breadth
of the design space and complexity of the algorithm.
Strategic DSE begins with identification of performance features for revisions. The
application designer may choose to annotate a parameter with one or more alternative
values denoting possible changes to the application design. The goal is to propose
revisions to specific features and compare the range of performance values against the
performance requirements of the designer. For example, several different pipeline rates
or clock frequencies may be evaluated. Also, scalability can be analyzed using revisions
that define progressively larger problem sizes and hardware resources. Alternatively,
different schedules can be evaluated by adjusting the ordering (i.e., priority) of messages to
outgoing communication channels. As illustrated by the case studies in Section 3.4, rapid
exploration of large design spaces can greatly aid design productivity. However, a designer
using the framework must ensure that the design space under investigation is realistic with
respect to the architectural constraints (e.g., maximum circuit size or clock frequency).
5.4 Case Studies
This section describes two case studies, MNW and MVA graph, which demonstrate
the capabilities of the integrated framework for efficient (i.e., rapid and reasonably
accurate) strategic DSE. The experimental setup, including the construction of the DSE
tool bridging the RCi\ lI modeling environment with RAT performance prediction, is
discussed in Section 5.4.1. The MNW case study in Section 5.4.2 is a bioinformatics
application with an AMP MoC. This case study demonstrates accurate prediction, as
compared to subsequent hardware implementations, and rapid DSE. The MVA graph
in Section 5.4.3 contains a more complex network of pipelines with performance defined
by the SDF MoC. This case study maintains rapid DSE, even for very large numbers of
revisions to a complex algorithm structure.
system-wide network model and two node models: the microprocessor and FPGA.
However, the NNUS architecture will involve two networks: a local interconnect between
the FPGA and microprocessor and a system-wide network between the microprocessors.
Defining nodes based on their adjoining networks creates a consistent abstraction of
computation and communication for both prevailing system classes. This distinction
becomes increasingly important as the hierarchy of the FPGA platform increases in depth.
Ultimately, the collection of node and network models provides small, separable
descriptions of the complete computation capabilities and communication performance
of the FPGA platform. For each piece of computation in the application, the clock
frequency attribute for the FPGA defines the overall rate of execution. For each network
communication, quantitative attributes include the d.l1 i through the interconnect medium
(i.e., latency) and bandwidth for message transmission.
4.3.1.2 Application Scope
Strategic performance prediction requires application characteristics amenable to
quantification. An application encompasses an algorithm and its mapping to an FPGA
platform.
Algorithm Finite number of hardware-independent tasks with explicitly defined
parallelism and ordering used to solve a problem.
Mapping Algorithm's computation tasks assigned to nodes and data movement between
nodes assigned to one or more communication networks.
A complete description of an algorithm and its mapping must be provided by the
designer for effective performance modeling. The composition and parallelization
of algorithm tasks defines the computational load for each node and the required
communication for each network to support the application. Algorithm and mapping
features must be scoped to ensure quantitative characterization of computation and
communication interaction that is tractable for analytical modeling. Specifically,
the number of data elements, Nelements, and number of bytes per element, Nbytes/element
(Equation 4-7). Expressing data in terms of elements allowed more direct correlation
between the volume of computation and the amount of communication. However, the RAT
I/O model is adjusted to coincide with the LogGP formulation for consistency within the
network model.
g(k)= (Rio x Effo)-1 (4-6)
RIo: theoretical throughput rate of I/O channel (from RAT)
Effio: efficiency of I/O channel (from RAT)
k N10_elements X ; ...... (4-7)
NIodelements: number of I/O elements (from RAT)
i .. elementn: number of bytes per element (from RAT)
Equation 4-8 defines the set of attributes, Slo, for the revised I/O transaction model,
which consists of the latency and overhead d-iv; L and o; size-dependent message gap,
g(k); number of nodes, P; message size, k; and the additional computation cost, 7, for the
communication, if any. Though the I/O model represents a point-to-point interconnect,
the P value remains to represent unidirectional (i.e., P=1) or bidirectional (i.e., P=2)
behavior. Equation 4-9 defines the communication for the I/O transaction, t1o, by the
delay function, fdelay, for latency and overhead; number of nodes; message size; gap; and
additional computation as a function, fcost, of the message size and 7 cost value.
CHAPTER 4
EXPANDED MODELING FOR MULTI-FPGA PLATFORMS (PHASE 2)
The second research phase outlines the expansion of the RAT analytical model for
efficient and reasonably accurate performance estimation of multi-FPGA systems prior to
hardware implementation. This chapter presents a brief introduction to the challenges and
objectives of multi-FPGA RAT (Section 4.1); background and related work on multi-node
performance modeling (Section 4.2); assumptions, quantitative attributes, analytical
models, and scope of the expanded model (Section 4.3); a detailed walkthrough of a
reasonably complex application, 2-D PDF estimation (Section 4.4); two additional case
studies, image filtering and molecular dynamics (Section 4.5); and conclusions (Section
4.6).
4.1 Introduction
The reformation towards explicitly parallel architectures and algorithms is accompanied
by increased emphasis on multi-device systems for achieving additional performance
benefits. However, exploiting parallelism for scalable FPGA systems requires even
more expensive development cycles, which further limits widespread adoption of RC.
Current design approaches focus on faster coding paths to device-level implementations
(e.g., high-level synthesis), which address only one symptom of the greater productivity
challenge for FPGA systems.
In contrast to development practices based on iterative implementation, strategic
design-space exploration (DSE) is needed to improve productivity with scalable systems.
Parallel applications for multi-FPGA platforms should be planned and performance issues
analyzed prior to implementation, narrowing the range of possible algorithm and systems
mappings based on the performance requirements. For the second phase of research, the
RC Amenability Test for Scalable Systems (RATSS) extends RAT from C!i Ipter 3 to
multi-FPGA systems by incorporating key concepts from traditional analytical modeling
(e.g., BSP [9] and LogP [10]). This new model produces a comprehensive performance
Network pP P P1,2 P1,3 P1,4 P
PF1, F1,2 F1,3
P. 1 P. P., P2, P
I F2, F2( F. P*F
Primary P,1 P.: P P,, P ..
FPGA L F3.1 F3.2 F3.3
NScconej P,'I P3; P,3, P,, P.
inter(onnect
Secondary
FPGA P5,1 P5,2 P5.3 P5,4 Pt
Figure 4-8. Algorithm Structure for Image Filtering Case Study
Table 4-7. Node Attributes for Image Filtering
Attribute Units Value
PL,omp (cycles) 0
Rcomp (operations/cycle) 34
Fclock (\!I1. ) 100
Ncorp_elements (elements) 349,448
Nops/element (operations/element) 17
quickly and accurately determine a priori. However, the pipeline latency, RLcop, should
be negligible with respect to the volume of data. Both FPGAs contain a pipelined filtering
kernel that calculates the nine multiplications and eight additions for the convolution
for a total computational throughput, Rcop, of 34 (2 FPGAs x 9+8 operations). The
clock frequency, Fclock, for the MAP-B node is fixed at 100MHz. An image size of 418x418
pixels, limited by the size of the MAP-B SRAM, is used for this case study though larger
sizes can be simulated by repeatedly looping through the memory. In contrast to the
previous case study, each FPGA of each node needs the complete data set (i.e., image)
because each kernel convolves a different filter. Consequently, the effective number of data
elements, Ne1ements, per node is 349,448 (418 pixels x 418 pixels x 2 FPGAs). Again, each
element requires nine multiplications with the 3 x3 filter and eight subsequent summations
for a total of 17 operations per element, Nopselement.
Table 4-8 defines the application-specific network attributes for the SNAP network
model. Two nodes, P, with four total FPGAs are used for this case study. The pipelined
(streaming) computation is structured using shift registers and requires three new
Image (P) and 3x3 Filter (F)
New Image (N)
-- Read -- Read
0.35 Write 0.45 Write
0.3 0.4
> 0.25 > 0.35
S0.2 .a 0.3
LL 0.15 -a 0.25
0.1 0.2
0.05 0.15 -
0 4 0.1
10 10 10 10 10 10 106 1 10e
Transfer Size I,..":i Transfer Size (bytes)
A Range of transfer sizes B Large transfer sizes
Figure 4-6. Results of Efficiency Microbenchmarks for Nallatech BRAM I/O
which defines the effective throughput for a given transfer size. The I/O latency and
overhead, Llo + olo, is assumed to be the total transfer time for a very small transfer (i.e.,
4B of data), which is dominated by the channel delay of the PCI-X bus. For writes and
reads with the FPGA block RAM, the measured performance is 1.60E-5(s) and 3.20E-5(s),
respectively. The gap, g(k), for the write and read I/O tractions is the multiplicative
inverse of the 1,0. I l- l /s (i.e., 33MHz, 64-bit PCI-X) theoretical throughput, Rio, times
the I/O efficiency, Efflo. These particular g(k) values are determined by the message size,
k, which is defined by the number of I/O elements, NIo_elements, and number of bytes per
element, Nbytes/element. For the write I/O, the 6 I\ input data elements for each of the X
and Y dimension are divided among the 2, 4, and 8 nodes for the Nio_elements term. Again,
these write transfers are divided into blocks of 8,192 elements meaning 6 \!- 1')2-P
distinct transfers. The output (i.e., read I/O) involves collecting the 65,536 (256x256)
elements storing the partial PDF estimates for each of the 6 \!-8,192-P iterations.
Though the computation is 18-bit fixed point, the data format for the I/O transfers is
32-bit integer and consequently the number of bytes per element, Nbytes/element, is 4.
Equation 4-17 defines the performance for PCI-X write and read transaction,
ttransactionwiite rad by the I/O latency and overhead, Llo + olo, gap value for the message
Efficiency Plot
Efficiency Plot
number of nodes; the short or long-message gap; and the additional computation, if any, as
a function, f.ost, of the amount of data and 7 cost value.
Stransaction = {L, o, g, G, P, y, 7} (4-4)
Stransaction: set of attribute values for the specific transaction
L, o: LogGP latency and overhead attributes, respectively
g, G: LogGP short and long-message gap, respectively
P, k: LogGP number of nodes and message size, respectively
7: additional computation for operations such as reduce
transaction (Stransaction)
=fddeay(L, o) + fquantity(P, k) x [g or G] + fcost(P, k, 7) (4-5)
transaction: total time for the network transaction
fdelay(): function defining delay w.r.t. L and o
quantity ): function defining total data quantity w.r.t P and k
fost(): function defining additional computation cost (e.g., reduce)
In contrast to the multi-node network model, I/O interconnects between microprocessors
and FPGA accelerator cards often exhibit a highly variable gap over a range of message
sizes. However, the different gap values for a range of data sizes and transfer types (i.e.,
DMA to BRAM or read from registers) can be collected prior to application analysis using
microbenchmarks and reused on future applications with similar I/O communication.
These attribute values are either collected into a table for reference or used to construct an
explicit g(k) function. From Equation 4-6, the original RAT model separated individual
gap values into the theoretical interconnect throughput, Rio, and the efficiency of the
interconnect for the message size, Efflo. Similarly, the message size was decomposed into
cost due to potential reuse for analysis of future applications with similar platform
mapping. Application-specific attributes such as the quantity of data and amount of
computational parallelism are explicitly specified by the user based on the algorithm and
platform mapping. These attributes feed the equations described in Section 4.3.2 which
compute the performance estimate. Based on the model results, the designer may further
refine the application or proceed to low-level implementation and analysis.
The accompanying case studies presented in this chapter primarily emphasize
performance prediction for the final configuration of an application prior to implementation.
However, the authors expect that strategic DSE will explore multiple options for algorithm
structure and platform r ipllii: which would involve several repetitions of RATSS
analysis with comparison of the predicted performance values against the designer's
expectations. The key performance criterion explored in this chapter is execution time,
but issues of application scalability, resource utilization (e.g., load balancing), power-delay
product, etc. can also be inferred from the RATSS an i& -i- Analyses of these issues are
not limited to physically realizable systems and can project capabilities of future system
configurations.
4.3.2 Model Attributes and Equations
This section discusses the attributes, equations, and general approach of the node
and network models along with their arrangement into stage- and application-level models
for RATSS hierarchical performance prediction. The attributes and equations for the
node and network models leverage existing research from RAT [50] and LogGP [11]
to construct computation and communications models. The platform and algorithm
scope provide efficient quantification of performance features of the computation and
communication, which serves as input to the analytical models. Essentially, both
computation and communication represent the time cost of data movement through a
component (e.g., FPGA or interconnect). Equation 4-1 defines the general structure for
node and network performance as the delay overhead through medium/architecture, delay;
(or statistically observed) rate. Classification of these basic operations is straightforward
because tasks perform only computation and connections facilitate only communication.
The quantitative attributes for the computation tasks include the amount of data to be
processed and the cost of processing each data element, which are contained within the
particular task specification. Similarly, quantitative attributes for algorithm connections
define the amount of data and segmentation for transfers between tasks, which are
contained within the connection specification. Computation and communication models
for tasks and connections, respectively, are provided with corresponding architectural
information (e.g., FPGA clock frequency or interconnect bandwidth) by the framework
translation based on the application mapping. For scheduling, the key difference between
the two MoCs is the specificity of the overlap of task execution. For AMP, task execution
is dependent only upon the order of communication message from its predecessor tasks
(i.e., those prior tasks which provide data to the current task). In contrast, SDF models
assumes simultaneous, fine-grain operation of all tasks and connections, typically as a
pipeline operating on individual data elements within one or more streams of data. In
practice, AMP is sufficient for serializing communication between microprocessors and
FPGA application accelerators (e.g., MNW) whereas SDF is useful for describing multiple
directly connected pipelines (e.g., MVA graph).
5.3.2 Orchestration
Strategic DSE involves evaluating a range of application designs to determine the
most desirable configuration. Design alternatives may differ in multiple facets including
the algorithm requirements (e.g., problem size) and architectural capabilities (e.g., clock
frequency). The framework supports strategic DSE by repetitively revising an application
specification and evaluating the resulting performance against other design alternatives.
RAT is provided different sets of quantitative performance features, which typically
represent several permutations of one or more attributes. (DSE based on in i, 'r revisions
to the application mapping is outside the scope of this dissertation.) Predictions are
Collectively, this research contributes an analytical model and accompanying
methodology for performance prediction where such work was lacking for FPGA
development. RAT and RATSS demonstrated high applicability to a variety of algorithms,
platform architectures, and system mappings and provides a formalized infrastructure
for integrated, efficient, reliable, and reasonably accurate aid with respect to the
prediction aspect of design-space exploration. This research contributed not only to
analytical modeling for FPGA performance estimation but also to modeling languages
and design patterns for RC. Future directions for research include incorporation of the
RAT prediction (and the integrated framework) into a larger methodology for more
fully automated design-space exploration (e.g., automated mapping, evaluation, and
optimization) with integrated bridging to design-level implementation code.
250
.E 200
I-
150
u
x
100
0
50
1 0 -- --------i---------------
0 10000 20000 30000 40000 50000 60000
Number of DNA Sequences
Figure 5-8. Predicted execution times of MNW on four FPGAs based on revisions to the
number of DNA sequences for comparison
Table 5-1 summarizes the predicted and experimental execution times for MNW
using 1500 DNA sequences (i.e., 1500 21500 comparisons) divided across 1, 2, and 4
FPGAs. The predicted execution times were generated by RAT based on the quantitative
performance information provided by the framework. The experimental execution times
were measured from subsequent hardware implementations that correspond to the
application specification. Based on the 1 to !' error rate, the integrated framework
was able to maintain reasonable accuracy during the abstract application specification,
collection of quantitative performance parameters, and resulting performance prediction.
Generating the abstract specification took only a few minutes and the subsequent analysis,
as directed by the framework, took approximately 2.3ms. The productivity gained by
using the framework is significant because the actual hardware implementation for MNW
required approximately 200 man-hours to code, place and route, debug, and evaluate.
Beyond the initial prediction, evaluating the performance impact of alternative
MNW designs can provide insight about the desirability of possible structural or behavior
revisions. The DSE tool can explore different architectural optimizations (e.g., faster
summation of the read and write (- Ionents. For the individual reads and writes, the
problem size (i.e. number of data elements, N elements) and the numerical precision (i.e.
number of' per element, / e /dment) must be decided the user with i"' .
to the ,:1 ::: that for these equations, the problem size only refers to a single
block of data to be b- : .1 the FPC ; All read or write communication
.. the application need not occur as a single transfer but can instead be partitioned
into multiple blocks (. data to be L .1 enciently sent or received. :i':.1 transfers are
considered in a subsequent .. .. T. '. 1l tical bandwidth of the FPGA/CPU
interconnect on the target platform (e.g. 1 .I i: 64-bit PCI-X which has a documented
maximum throughput of IGB/s) is also necessary but is generally provided either
with the FPGA subsystem documentation or as part of the interconnect standard. An
additional ..... ter, a, represents the fraction of ideal throughput *. ....... ; useful
communication. 'i : actual sustained performance of the I PGA interconnect is typically a
fraction of the documented transfer rate.
,corn, = t'read l write (3 1)
b of sileen ata .
rtre dmims fr t e b k t s
meareticad l ct th ut to e 1
elem ,ents '. 1 '
Microbenchmarks composed of simple data I ** can be used to establish the
true communication :: : 1 ut. The communication times for these block transfers are
measured and compared against the theoretical interconnect throughput to esl .. i the
a parameters. It is important for microbenchmarks to closely match the communication
methods used 1-.- real ap .A. ns on the target FPGA platform to accurately model
the intended behavior. In general, mmicrobenchmarks are f .: .'d over a wide range
5-10 Predicted execution times of the MVA graph ................ 111
5-11 Architecture specification of Novo-G system ................ 113
key performance characteristics from the designer's specification. Such models prevent
wasted implementation effort by identifying unrealizable designs and reducing the revisions
necessary to achieve performance requirements.
RATSS provides an efficient and reasonably accurate analytical model for evaluating
a scalable FPGA application prior to implementation based on the RAT methodology
from ('! Ilpter 4. RATSS boosts designer productivity by extending concepts from
component-level models to allow efficient abstraction and estimation of the computation
and communication features of FPGA applications. The RATSS model contributes a
hierarchical approach for ... -Liii I i.ii-; component descriptions into a full performance
estimate. RATSS performance prediction remains tractable by focusing on synchronous,
iterative computation models for the two i, i i" class of modern high-performance FPGA
platforms.
The 2-D PDF, image filter, and MD case studies illustrate performance modeling
for a range of problem sizes and ratios of computation-to-communication. These case
studies demonstrated nearly 911' prediction accuracy, which is considered sufficient
given the focus of RATSS on strategic application planning. The accuracy of both the
computation and communication models allows not only individual performance estimates
but also accurate predictions across a range of potential application configurations
including wide variations in problems sizes and computation-to-communication ratios.
Specifically, important performance tradeoffs such as increasing parallelism or decreasing
the communication rate can be efficiently evaluated with reasonable accuracy. These
case studies serve as motivation for broad design-space exploration with RATSS as
predictions are efficiently generated and reasonably accurate, which help ensure the
eventual implementation is the most desirable design configuration.
Translation
Modeling Environment Interface
Algorithm Architecture MaDDingc
Model Model Scheduling of
Basic Operations Operation and Operations and
and Connections connection Rates Connections
( Tool-specific MoC Abstraction
Attributes of Moc (Abstract-MoC
Operations and Connections) Schedule
Comp. Comm. )( Overlap of
SOperations Connections Operations
I I I
Figure 5-5. Translation of application specification information for RAT prediction
the groups of operations mapped to that resource and RAT communication models based
on data movement between hardware resources. A generic schedule that describes the
parallelism and overlap between the computation and communication is constructed from
the semantics of the MoC. Conversion between algorithmic MoCs and RAT performance
models is possible as the important structure and behavior of the application specification,
properly formatted, correspond directly to available computation and communication
models. The basic computation operations within an MoC are often generic abstractions
that require additional quantitative attributes from the application designer, specifically
formatted for the assumed technology (e.g. FPGAs), for translation to the RAT model.
For the DSE tool used in Section 3.4 (and by extension, any future tool connecting
modeling environments and RAT), the underlying translation step must be tuned
for the MoCs of interest, specifically AMP and SDF for the case studies. For FPGA
systems, these MoCs can be abstractly represented as a number of I I-:'" (i.e., generic
encapsulations of groups of operations with detailed specification left to the application
designer) with the data movement through algorithm "connections." Tasks often represent
either pipelines or state machines, which imply structured execution at a deterministic
PDF estimation, the design emphasis is placed on throughput analyses because the overall
goal is to minimize execution time for these designs.
These RAT case studies represent a range of experiences with estimating computational
throughput based on different user backgrounds and prediction emphases. Consequently,
2-D PDF estimation, LIDAR, and traveling salesman focus on more exact throughput
parameterization in contrast to the conservative prediction in 1-D PDF. However, the
performance of molecular dynamics could not be reliably estimated prior to implementation
because of the difficulty of analyzing the complex and data-dependent algorithm
structure as described by a high-level language. Instead, the target throughput is
computed from the speedup requirements. While this prediction will be inaccurate if
the minimum throughput is unrealizable, the RAT estimation provides a starting point
for implementation and insight about the performance ramifications of a suboptimal
architecture.
3.4.1 2-D PDF Estimation
As previously discussed, the Parzen window technique is applicable in an arbitrary
number of dimensions [36]. However, the two-dimensional case presents a significantly
greater problem in terms of communication and computation volume than the original 1-D
PDF estimate. Now 256 x 256 discrete bins are used for PDF estimation and the input
data set is effectively doubled to account for the extra dimension. The basic computation
per element grows from (N n)2 + c to ((N1 n1)2 + (N2 n2)2 + c where N1 and N2
are the data sample values and nl, n2 are the probability levels for each dimension, and c
is a probability scaling factor. But despite the added complexity, the increased quantity of
parallelizable operations intuitively makes this algorithm amendable to the RC paradigm,
assuming sufficient quantities of hardware resources are available.
Table 3-5 summarizes the input parameters for the RAT a n i, -i-; for our 2-D PDF
estimation algorithm. Again, the computation is performed in a two-dimensional space, so
twice the number of data samples are sent to the FPGA. In contrast to the 1-D case, the
1) Write to SRAM
3) Read from SRAM
2) Calculation
Figure 4-9. Algorithm Structure for Molecular Dynamics Case Study
Table 4-10. Node Attributes for Molecular Dynamics
Attribute Units Value
PLomp (cycles) 0
Rcomp (operations/cycle) 1
Fclock (\I 1.) 100
Ncompelements (elements) 8,192
Nops/element (operations/element) 32,767
node, Neiements, is 8,192 (32,768/4). Each molecule's interaction is computed against every
other molecule for a total of 32,767 operations, Nops/element.
Table 4-11 defines the application-specific attributes for the SNAP network model. A
total of four nodes, P, are used for the case study. The scatter message size is twice the
gather message size due to two copies of input data required to compute a molecular
interaction in a single cycle (i.e., two memory access per cycle). Also, the 4-byte,
single-precision x, y, and z, dimensions of the molecule data are packed into two 8-byte
words. Thus, the scatter and gather message sizes, k, are 1,048,576B (32,768x4x8B)
and 524,228B (32,768x2x8B) respectively. Because of the single time step, only one
system-level iteration, Nsystemiterations, is required for this case study.
Table 4-12 compares the results of the RATSS model with the subsequent implementation
of MD. Over 9' '. of the execution time is dominated by the FPGA computation, which
is highly deterministic. The model error for the node-level time, tnodes, is -0.0001 an
calculated at each time step based on the particles' masses and the relevant subatomic
forces. For this case study, the MD simulation is focused on the interaction of certain inert
liquids such as neon or argon. These atoms do not form covalent bonds and consequently
the subatomic interaction is limited to the Lennard-Jones potential (i.e., the attraction
of distant particles by van der Waals force and the repulsion of close particles based
on the Pauli exclusion principle) [39]. Large-scale MD simulators such as AMBER [40]
and NAMD [41] use these same classical physics principles but can calculate not only
Lennard-Jones potential but also the nonbonded electrostatic energies and the forces of
covalent bonds, their angles, and torsions, making them applicable to not only inert atoms
but also complex molecules such as proteins. The parallel algorithm used for this case
study was adapted from code provided by Oak Ridge National Lab (ORNL).
Figure 4-9 provides an overview of the MD case study. In slight contrast to the
image-filtering case study, four MAP-B nodes, one FPGA each, are used for MD. In
order to compare two molecules every clock cycle, two copies of the molecular data are
sent by the network-attached microprocessor to the primary FPGA of each node. (The
secondary FPGA is not used for this case study). Each copy contains the X, Y, and Z
dimensions of the molecular position data, requiring 2 SRAMs per copy for a total of 4
banks per node. Each MD kernel checks the distance of N/4 molecules against the other
N 1 molecules, where N/4 is the number of molecules for each of the four nodes. If
the molecules are sufficiently close, the MD kernel calculates the molecular forces (and
subsequent acceleration) imparted on each other. The acceleration effects are accumulated
in the last two SRAM banks and transferred back to the network-attached microprocessor.
The node-level attributes for the MD case study are defined in Table 4-10. The
pipeline latency, PLcomp, is considered negligible for this case study due to the 0(N2)
computational complexity. One pipeline per node allows for molecular iteration (i.e.,
operation) per cycle, Nops/cycle. Again, the clock frequency, Fclock, for the MAP-B nodes
is fixed at 100MHz. For this case study, the number of data elements (i.e., molecules) per
quantities of arithmetic or logical operations and registers. But a precise count is nearly
impossible without an actual hardware description language (HDL) implementation.
Above all other types of resources, routing strain increases exponentially as logic element
utilization approaches maximum. Consequently, it is often unwise (if not impossible) to fill
the entire FPGA.
Currently, RAT does not employ a database of statistics to facilitate resource
analysis of an application for complete FPGA novices. The usage of RAT requires
some vendor-specific knowledge (e.g. single-cycle 32-bit fixed-point multiplications with
64-bit resultants on Xilinx Virtex-4 FPGAs require four dedicated 18-bit multipliers).
Additionally, the user must consider tradeoffs such as using fixed resources versus logic
elements and computational logic versus lookup tables. Resource analyses are meant to
highlight general application trends and predict scalability. For example, the structure of
the molecular dynamics case study in Section 3.4 is designed to minimize RAM usage and
the parallelism is ultimately limited by the availability of multiplier resources.
3.2.4 Scope of RAT
The analytical model described in Section 3.2.1 establishes the basic scope of RAT as
a strategic design methodology to formulate predictions about algorithm performance
and RC amenability. RAT is intended to support a diverse collection of platforms
and application fields because the methodology focuses on the common structures and
determinism within the algorithm. Communication and computation are related to the
number of data elements in the algorithm. Effective usage of the performance prediction
models requires mitigation of variabilities in the algorithm structure such as data-driven
computation. Based on the complexity of the algorithm and architecture, the RAT model
may be used to directly predict performance or instead establish minimum throughput
requirements based on the desired speedup. RAT currently targets systems with a single
CPU and FPGA as a first step towards a broad RC methodology. The FPGA device
is considered a coprocessor to the CPU but can initiate some operations independently
5.4.2 Modified Needleman-Wunsch (MNW)
The MNW case study is an FPGA-optimized application for calculating the
normalized edit distance between two DNA sequences within the composite ESPRIT
application for metagenomics [62]. The normalized edit distance provides concise
quantitative insight about the similarity of two DNA sequences based on the length of
the sequences, the number of gaps in the global sequence alignment, and the number of
edits required to transform one sequence string into the other. The MNW application
pipelines the standard Needleman-Wunsch [63] calculations for individual alignment
scores and resulting global alignment with the ESPRIT calculation of the normalized edit
distance. The pipeline concurrently computes the alignment scores with the normalized
edit distance rather than computing the edit distance from the character representation of
the alignment as is done in software. Computing the edit distance in this way eliminates
the need to store a score matrix, significantly reducing the memory requirements for the
FPGA system. This case study is referred to as modified because the typical outputs
of Needleman-Wunsch, the score matrix and global alignment, are unnecessary after
the calculation of the normalized edit distance and are never retained. However, as
with traditional Needleman-Wunsch, MNW is often useful for comparing many pairs of
sequences of similar length as a batch.
Figure 5-7A provides a general overview of the algorithm structure. A database of
comparisons is built from the sequences and divided, round-robin, among the specified
number of FPGAs. A total of N sequences requires a database of N2- comparisons
since sequences are not compared against themselves and comparisons such as [2, 1] are
equivalent to [1, 2]. The initial configuration of this case study involves 1500 sequences,
each 105 characters in length. The resulting N2 N normalized edit distances are collected
by microprocessor after computation is complete.
The framework queries the modeling environment for the quantitative performance
information necessary for RAT prediction. The communication of the DNA sequence
Similar to an element, one must also examine what is an "operation." Consider an
example algorithm composed of a 32-bit addition followed by a 32-bit multiplication.
The addition can be performed in a single clock cycle but to save resources the 32-bit
multiplier might be constructed using the Booth algorithm requiring 16 clock cycles.
Arguments could be made that the addition and multiplication would count as either two
operations (addition and multiplication) or 17 operations (addition plus 16 additions,
the basis of the Booth multiplier algorithm). Either formulation is correct provided
that throni,.tl',l, is formulated with the same assumption about the scope of an
operation. Often, deterministic and highly structured algorithms are better viewed with
the number of operations synonymous with the number of cycles. In contrast, complex
or nondeterministic algorithms tend to be viewed as a number of abstract number of
operations with an average rate of execution. Ultimately, either choice is viable and left to
the preference of the user.
Figure 3-2 illustrates the types of communication and computation interaction to be
modeled with the throughput test. Single buffering (SB) represents the most simplistic
scenario with no overlapping tasks. However, a double-buffered (DB) system allows
overlapping communication and computation by providing two independent buffers to
keep both the processing and I/O elements occupied simultaneously. Since the first
computation block cannot proceed until the first communication sequence has completed,
steady-state behavior is not achievable until at least the second iteration. However, this
startup cost is considered negligible for a sufficiently large number of iterations.
The FPGA execution time, tRC, is a function not only of the t,,,, and tcomp terms
but also the amount of overlap between communication and computation. Equations (3-5)
and (3-6) model both SB and DB scenarios. For SB, the execution time is simply the
summation of the communication time, tco,,n, and computation time, tcop. With the
DB case, either the communication or computation time completely overlaps the other
term. The smaller latency essentially becomes hidden during steady state. The DB case
REFERENCES
[1] M. Smith and G. Peterson, "Parallel application performance on shared high
performance reconfigurable computing resources," Performance Evaluation, vol. 60,
pp. 107-125, Al li 2005.
[2] T. El-Ghazawi, E. El-Araby, M. Huang, K. Gaj, V. Kindratenko, and D. Buell, "The
promise of high-performance reconfigurable computing," Computer, vol. 41, no. 2, pp.
69-76, Feb. 2008.
[3] D. Pellerin and S. Thibault, Practical FPGA P,.3,nni,,,.:, in C, Prentice Hall Press,
2005.
[4] SRC Computers, SRC Carte C P,..j,,,,n,.:,j Environment, 2007.
[5] Mitrionics, "Low power hybrid computing for efficient software
acceleration," updated 2008, cited May 2010, available from
http://www.mitrion.com/?document Hybrid-Computing-Whitepaper.pdf.
[6] Mentor Graphics, "Handel-c synthesis methodology," updated 2010, cited May 2009,
available from http://www.mentor.com/products/fpga/handel-c/.
[7] W. Wolf, "A decade of hardware/software codesign," Computer, vol. 36, no. 4, pp.
38-43, 2003.
[8] S. Fortune and J. Wyllie, "Parallelism in random access machines," in Proc. AC _I
10th Symp. Theory of Computing, San Diego, CA, May 01-03 1978, pp. 114-118.
[9] L. G. Valiant, "A bridging model for parallel computation," Communications AC'/,
vol. 33, no. 8, pp. 103-111, Aug. 1990.
[10] D. Culler, R. Karp, D. Patterson, A. ,1! ,i, K. E. Schauser, E. Santos,
R. Subramonian, and T. von Eicken, "LogP: Towards a realistic model of parallel
computation," in Proc. AC_[ 4th Symp. Principles and Practice of Parallel Pi. ',~r ,,-
ming, San Diego, CA, May 19-22 1993, pp. 1-12.
[11] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, "LogGP:
Incorporating long messages into the LogP model for parallel computation," J.
Parallel and Distributed CorT,,1'r.:,.h vol. 44, no. 1, pp. 71-79, 1997.
[12] T. Kielmann, H. E. Bal, and S. Gorlatch, "Bandwidth-efficient collective
communication for clustered wide area systems," in Proc. Int'l Parallel and Dis-
tributed Processing Symp. (IPDPS), 1999, pp. 492-499.
[13] E. Grobelny, C. Reardon, A. Jacobs, and A. George, "Simulation framework for
performance prediction in the engineering of RC systems and applications," in Proc.
Int'l Conf. Engineering of Reconfigurable S';-/, i- and Algorithms (ERSA), Las Vegas,
NV, Jun 25-28 2007.
of possible data sizes. The resulting a values can be tabulated and used in future RAT
analyses for that FPGA platform. By separating the effective throughput into the
theoretical maximum and the a fraction, effects such as changing the interconnect
type and efficiency can be explored separately. This fidelity is particularly useful for
hypothetical or otherwise unavailable FPGA platforms.
Before further equations are discussed, it is important to clarify the concept of an
SI, ili, 1l Until now, the expressions lpiNI i,. size," ,ulume of communicated data,"
and "number of E in, ii, have been used interchangeably. However, strictly speaking, the
first two terms refer to a quantity of bytes whereas the last term has units of I ill -
RAT operates under the assumption that the computational workload of an algorithm
is directly related to the size of the problem dataset. Because communication times are
concerned with bytes and (as will be subsequently shown) computation times revolve
around the number of operations, a common term is necessary to express this relationship.
The element is meant to be the basic building block which governs both communication
and computation. For example, an element could be a value in an array to be sorted,
an atom in a molecular dynamics simulation, or a single character in a string-matching
algorithm. In each of these cases, some number of bytes will be required to represent that
element and some number of calculations will be necessary to complete all computations
involving that element. The difficulty is establishing what subset of the data should
constitute an element for a particular algorithm. Often an application must be analyzed in
several separate stages, since each portion of the algorithm could interpret the input data
in a different scope.
Estimating the computational component, as given in Equation (3-4), of the RC
execution time is more complicated than communication due to the conversion factors.
Whereas the number of bytes per element is ultimately a fixed, user-defined value, the
number of operations (i.e. computations) per element must be manually measured
from the algorithm structure. Generally, the number of operations will be a function
network-level analysis, which then combine in the RATSS stage-level model. Based on
LogP and its derivatives (e.g., LogGP and the RAT communication model, to an extent),
the network model consists of parameters for the latency (i.e., physical interconnect
delay), L; overhead, o; message gap, g; number of "pro --..i P; and message size, k.
These attributes are analogous to the general terms from Equation 4-1 in that L and o
determine delay; P and k determine data q
between LogP, LogGP, PLogP, and the RAT communication models is the i.,/1 parameter,
which is defined by the expected behavior of the particular network. Approximations
of the short-message gap, g, and long-message gap, G, of LogGP are often sufficient for
microprocessor networks such as Ethernet. However, PLogP and RAT define the gap as a
function of the message size, g(k).
Determining message size is a key issue for accurate performance prediction. The
general assumption is that each node, P, will contribute k bytes of data for the network
transaction. However, the message gap, g(k), is highly dependent not only on the volume
of data per node but also any subdivision of that data into multiple smaller transfers.
Typically, a large message will have less performance overhead than several smaller
messages. Ensuring the gap attribute accurately reflects the performance of the actual
message segmentation size will reduce modeling errors.
Although specific communication transactions (e.g., MPICH2 implementation of an
MPI scatter over Ethernet) have detailed, potentially application-specific performance, the
network model introduces general equations for the two types of network communication
used in the cases studies. Equation 4-4 illustrates the individual set of attributes,
Stransaction, for multi-node network transactions such as InfiniBand or Ethernet, which
includes the L, o, g, G, P, and k attributes along with a 7 cost value for any additional
required computations (e.g., reduce operations). Equation 4-5 defines the performance of
the communication transaction, ttransaction, by the delay as a function, fdelay, of the latency
and overhead attributes; the volume of data as a function, quantity, of the message size and
example, 18-bit fixed point ---.- be used in Xilinx FPG As since it maximizes usage of
single 18-bit embedded multipliers. /. with parallel decomposition, numerical formulation
is ultimately the decision of the (l. ,, !'. designer. RAT i. .. Ides a i:' 1 and consistent
procedure for evaluating these 1. : choices.
3.2.3 Resources
By measuring resource utilization, RAT seeks to determine the scalability < an
application design. F... -..'cally, most FPGA designs will be limited in size by the
availability of three conlmnon resources: on-chip memory, hardcore functional units (e.g.
fixed ::1 I: ;). and basic lc elements (i.e. look-up tables and flip-f. ).
On-chip RAM is readily measurable since some ( : ''!:'ty of the memory will likely be
used for I/O bi: of a known size. Additionally, : i: .i,.!. 1. :. buffering and storage
must be considered. Vendor-provided wrappers i.. interfacing designs to T GA 1:::...
can also consume a :: ll number of memories but the quantity is generally constant
and independent of the i 1'" l:. design.
Although the tyi of dedicated functional units included in FPGAs can vary greatly,
the hardware multiplier is a ... y common component. T demand for .. :.ed
multiplier resources is :::i 'd by the availability of ::::.: <. !:1 (e.g. Xilinx
Virtex-4 and -5 SX series) with extra multipliers versus other .. ..*11y sized FPGAs.
( : : : : the necessary number of hardware multipliers is dependent on the
and amount of ... 1 (. : 1'. required. d? :::.. dividers, -i:' :e roots, and
: : -point units use hardware multipliers for i execution. Varying levels of pipelining
and other design choices can increase or decrease the overall demand : these resources.
With .:-cient design p1 .... :" an accurate measure of resource utilization can be taken
for a design given knowledge of the architecture of the basic computational kernels.
S1 .... .. : basic logic elements is the most common resource metric. High-level
designs do not empiric :i translate into any .: : ::1" resource count. Q1i :
assertions about the demand for logic elements can be made based upon approximate
quantity of data,
and 4.3.2.2 expound on this general performance equation for the node and network
models, respectively.
performance = delay + delay + qii.,,Ii,1 x ..i' (4-1)
thro,,,il!,,I
Sections 4.3.2.3 and 4.3.2.4 discuss the hierarchical .,. I 'ii.1 i il; .i of the individual
node and network components to model the performance of the individual algorithm
stages and subsequently the total application. The synchronous, iterative behavior
described in the node and network defines the computation and communication scheduling
for each algorithm stage and the overlap of these stages defines the total application
performance. The RATSS model uses this hierarchical approach to .,.-.-1.! i lIe the
individual performance estimates for the components into a single, quantitative prediction
for the application.
4.3.2.1 Compute Node Model
The goal of the node model is to estimate the performance of each computational
task based on the user-provided platform and application attributes. Depending on the
application requirements, each of the devices performing a given algorithm task may
have different computational demands, each requiring a separate node-level analysis.
Similarly, the application will likely have different computational loads for each task (i.e.,
stage) of the algorithm. Again, the node model is used to describe each unique portion
of computation and the individual performance estimates are combined in the RATSS
stage-level model.
As summarized in Equation 4-2, each node at each stage of execution can have
a unique set of attribute values, Sfpg,,,, which includes the pipeline latency, PLfpga;
number of data elements, Nfpga_element; number of computational operations per element,
Nops_element; FPGA clock frequency, Fclock; and computation throughput, Rfpg, for the
[14] E. Grobelny, D. Bueno, I. Troxel, A. George, and J. Vetter, : -, A framework
for scalable performance prediction of HPC systems and applications," Simulation:
Transactions on The S .. .. I// for Modeling and Simulation International, vol. 83, no.
10, pp. 721-745, October 2007.
[15] K. K. Bondalapati, Modeling and 'Y"'l'.:';. for D;I,,i,,/.. / ll;. Reconfigurable H;lI,.:ll
Architectures, Ph.D. dissertation, University of Southern California, Los Angeles, CA,
August 2001.
[16] R. Enzler, C. Plessl, and M. Platzner, "System-level performance evaluation of
reconfigurable processors," Microprocessors and -if. '.V";;. i,,- vol. 29, no. 2-3, pp.
63-75, April 2005.
[17] W. Fu and K. Compton, "A simulation platform for reconfigurable computing
research," in Int'l Conf. Field P,.. 'i,;I,,I,, l', Logic and Applications (FPL), August
2006, pp. 1-7.
[18] C. Steffen, "Parameterization of algorithms and FPGA accelerators to predict
performance," in Reconrfii.1it,'- System Summer Institute (RSSI), Urbana, IL, Jul
17-20 2007.
[19] H. Quinn, M. Leeser, and L. S. King, "Dynamo: A runtime partitioning system for
FPGA-based HW/SW image processing systems," J. Real-Time Image Processing,
vol. 2, no. 4, pp. 179-190, 2007.
[20] M. C. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, and
D. DiSabello, "Achieving high performance with FPGA-based computing," Computer,
vol. 40, no. 3, pp. 50-57, 2007.
[21] W. M. Fang and J. Rose, "Modeling routing demand for early-stage FPGA
architecture development," in Proc. ACi[ Symp. Field P,..,,i,,,,i,,,l,, Gate Ar-
rays (FPGA), Monterey, CA, 2008, pp. 139-148.
[22] V. Manohararai 1! G. R. Chiu, D. P. Singh, and S. D. Brown, "Difficulty of
predicting interconnect delay in a timing driven FPGA CAD flow," in Proc. AC_[
Workshop System-level Interconnect Prediction (SLIP), Munich, Germany, 2006, pp.
3-8.
[23] M. Xu and F. Kurdahi, "Accurate prediction of quality metrics for logic level design
targeted towards lookup-table-based FPGA's," IEEE Trans. Very LI,,;, Scale
Integration (VLSI) S,-/. I,- vol. 7, no. 4, pp. 411-418, Dec. 1999.
[24] S. D. Brown, J. Rose, and Z. G. Vranesic, "A stochastic model to predict the
routability of field-programmable gate arrays," IEEE Trans. Computer-Aided Design
of Integrated Circuits and S,.-/. i,- vol. 12, no. 12, pp. 1827-1838, Dec. 1993.
Single Buffered
Comm RI W1 R2 W2 R3 W3
Comp C1 C | G2 i | C3
Double Buffered, Computation Bound
Comm R1 R2 W1I R3 W2 R4 W3 RS W4
omp CI. C2 C3 C4
Double Buffered, Communication Bound
Comm R1 R2 W1 R3 W2 R4 W R5 W4
Comp C1 I C2 G 3 C4
Leend: R = Read, W = Write, C = Compute
Figure 3-2. Example Overlap Scenarios
is included for completeness of the RAT model, however the case studies focus on the SB
scenario.
The RAT analysis for computing tcomp primarily assumes one algorithm "functional
unit" operating on a single buffer's worth of transmitted information. The parameter iter
is the number of iterations of communication and computation required to solve the entire
problem.
tRCsB = -iter (tcomm + tcomp) (3-5)
tRCDB M Niter Max(tcomm, tcomp) (3-6)
Assuming that the application design currently under analysis was based upon
available sequential software code, a baseline execution time, tsoft, is available for
comparison with the estimated FPGA execution time to predict the overall speedup.
As given in Equation (3-7), speedup is a function of the total application execution time,
not a single iteration.
speedup tsoft (3 7)
tRc
CHAPTER 2
BACKGROUND AND RELATED RESEARCH
The background and related research for this document is divided into two sections.
Section 2.1 provides an overview of FPGA technology and high-performance reconfigurable
computing. Section 2.2 summarizes related work for HPC performance modeling, FPGA
simulation and analytical modeling, and prediction focus.
2.1 FPGA Background
The field-programmable gate array (FPGA) is the primary device driving reconfigurable
computing. The overall goal of FPGAs is to provide the performance of an application-specific
integrated circuit (ASIC) with the flexibility and programmability of a microprocessor.
Applications for FPGAs are developed as hardware circuits constructed from logic
elements, fixed resources such as multiply accumulators or memories, and routing
elements. Traditionally, FPGA applications are developed in hardware description
languages such as VHDL or Verilog but high-level languages are emerging to bring FPGA
code to the level of languages such as C or JAVA. Figure 2-1 outlines the performance
spectrum of computing devices and the general structure of FPGAs.
. Reconfigurable H I H I II
Processors -Connect- Switch -Connect- Switch -
SGeneral- (FPGA) 4-input Box Box Box Box _
SPurpose LUT I I I
LL Processors I
SASIs 1 i -Connect -Connect-
ASICs LOGIC Box Box
Performance
Figure 2-1. Performance C'!i i:terization and General Structure of FPGA Devices
Reconfigurable computing systems often incorporate FPGAs to provide maximum
performance (e.g speed, power, cost) versus comparable microprocessor-only solutions.
This research is applicable to both high-performance computing (HPC) and high-performance
pipelining allowed <-, lutational thro uts to be accurately pro'. even though
the high-level parallel algorithms were not yet mapped to hardware, i i total RC
execution time had an error of 1 for the case studies, I'1 ly higher than the
individual communication and computation components. Large system overhead versus
short execution time was the main cause. Overall, the methodolc p : .. well for
the diverse collection of algorithm complexities, hardware languages, FPGA : : ..
and total execution times. RAT was designed to handled these issues in single CPU and
FPGA systems where communication and computation are governed by the number of
data elements.
specific algorithm task. Adapted from RAT, the computation time, Equation 4-3, is
analogous to Equation 4-1 where the pipeline latency is the delay term, the number
of data elements and number of operations per element are the ie,., ,I.:li, and the
computation throughput and clock frequency define the effective thro,,.ili,,ii
Sfpgyi PLpga i, Nfpga-elementsl Nops/element Fdlock Rfpga } (4-2)
Sfpga: set of attribute values for a specific computation unit
PLfpga: pipeline latency of the computation (cycles)
Nfpga_element: number of computation elements (elements)
Nops/element: number of operations per element (ops/element)
Fclock: FPGA clock frequency (\!lI.)
Rfpga: computation throughput (ops/cycle)
fpa(Sfpg) PLfpga Nfpga-elements X N'ops/element (4 3)
tfpga(Sfga) = + (4-3)
Fclock clock X Rfpga
tfpga: execution time for the fpga compute node (s)
Microprocessor nodes can also impact FPGA application performance with computation
coinciding with FPGA execution (from Figure 4-2). The execution for a microprocessor,
tp, is defined by the software time which must be measured from legacy code or
estimated using a traditional model. Note that this microprocessor performance attribute
is only intended for application-related computation occurring in parallel with FPGA
execution. FPGA setup, configuration, and other software-involved overheads are
considered in the stage-level model.
4.3.2.2 Network Model
The goal of the network model is to estimate the performance of a communication
transaction based on the provided platform and application attributes. Analogous
to the node model, each unique communication transaction will require a separate
via a 133MHz PCI-X bus which has a theoretical maximum bandwidth of IGB/s. T: a
parameters were computed using a microbenchmark consisting of a read and write for data
sizes *1.1. to those used by the 1-D PDF algorithm. : 'i: read and write
times were measured, combined with the transfer size to < ::: :i.1 the actual communicate
rates, and finally used to calculate the a parameters dividing the theoretical
maximum. a meterss for the target FPGA platform are low due to communication
.: ..' >cols and middleware used by Nallatech atop PCI-X and high latencies associated
with the small 2KB (512 x 4B) t.
: computation : :::: tearss are the more i :' :11 **. portion of RAT ] I : : .
1.... .tion, but are still i. :..: given the deterministic behavior of PDF estimation. ,.
mentioned :: :, each element that comes into the PDF estimator is evaluated against
each of the .. bins. Each computation requires 3 operations: (.... ..1 on (subtraction),
multiplication, and addition. Therefore, the number of operations : element totals :.
(i.e. _' x 3). This particular i.I .:: ;1 : structure has 8 .; felines that each perform 3
operations per cycle for a total of 24. However, this value is conservati- "--- rounded down
to 20 to account for implementation details such as :icline latency and :.. station
overhead. This conservative 'ameter was selected prior to the algorithm coding and
has not (nor has any :::: -ter for any case study) been -.1usted for f: ::: f.
created from runtime data. Pre-imrplementation --* ustments to the RXAI parameters such
as reducing the thrc::.-I: ut value are not required but are sometimes useful to create
more optimistic or pessimistic predictions and account for ap .!. ..- or : -specific
behaviors not modeled by RAT. Similarly, a range of th rc::.i: ut values could be examined
to explore the effect on performance when the 1:.. 1:, .. :. is better or worse than
expected. However, this case study focuses on a I value for each parameter to
validate the RAT' model.
While previous --ameters could be reasonably ... i .... .the deterministic
structure of the algorithm, a prior estimation of the required clock : .: :y is
CHAPTER 6
CONCLUSIONS
The promise of reconfigurable computing for achieving speedup and power savings
versus traditional computing paradigms is expanding interest in FPGAs. Among the
challenges for improving development of parallel algorithms for FPGAs, the lack of
methods for strategic design is a key obstacle to efficient usage of RC devices. Better
formulation methodologies are needed to explore algorithms, architectures, and mapping
to reduce FPGA development time and cost. Consequently, RAT is created as a simple
and effective methodology for investigating the performance potential of the mapping
of a given parallel algorithm for a given FPGA platform architecture. The methodology
employs an analytical model to analyze FPGA designs prior to actual development.
RAT is scoped and automated to provide maximum efficiency and reliability while
retaining reasonable prediction accuracy. The performance prediction is meant to work
with empirical knowledge of RC devices to create more efficient and effective means for
design-space exploration.
For the first phase of research, the RAT methodology defined the core analytical
model for performance estimation during formulation. The extensible RAT model
was specifically scoped for usage with deterministic applications on common, albeit
single-FPGA, platforms. Five case studies (1-D PDF, 2-D PDF, LIDAR, TSP, and MD)
validated the accuracy of decomposing complex system behavior into key communication
microbenchmarks and computation parameters for RAT modeling. Detailed microbenchmarking
allowed for an average error of 1'"- (with individual errors as low as 1 .) for the
communication times of the case studies. For the deterministic case studies (i.e., all
except MD), computation error peaked at 17'. The total RC execution time had an
average error of 1' for the case studies, which helped validate the RAT methodology of
rapid and reasonable accurate prediction.
Figure 5-4. Framework bridging modeling environments and performance prediction, and
orchestrating DSE
(KPN), a specialized AMP MoC, to describe the system concurrency and individual
component behavior. The RC Modeling Language (RC'\11I) [61] provides hierarchical
models for the algorithm, architecture, and total application mapping with specialized
constructs to express deep parallelism (e.g. pipelines), typically for AMP and SDF
MoCs. With suitable abstraction, any of these models could provide effective application
specification for RAT performance prediction. The proposed DSE tool in Section 3.4 uses
RC'\ \!I because of the RC-oriented focus.
5.3 Integrated Framework
This section describes the methodology of the framework for connecting RAT
performance prediction with modeling environments for increased productivity during
strategic DSE. (Section 3.4 discusses the DSE tool bridging RAT with the RC'\ I1
modeling environment.) Figure 5-4 provides an overview of the framework structure.
The proposed methodology provides translation of specification information from the
modeling environment to RAT and orchestration of DSE based on revisions to the
specification information. The modeling environment and RAT performance prediction
components are the existing methods and tools from Section 5.2, as indicated by the
Table 4-5. Modeling Error for 2-D PDF Estimation (Nallatech, XC4VLX100, 195MHz)
Predicted (s) Experimental (s) Error
tcomp 1.41E+2 1.56E+2 -9. ,
2 PGAs tcomm 1.35E+1 1.51E+1 -10.7'
total 1.54E+2 1.71E+2 -9.7'~
Speedup 146 132 10.,.'
tcomp 7.05E+1 7.84E+1 -10.1,
4 PGA tcomm 9.31E+0 9.93E+0 -6.'".
total 7.98E+1 8.84E+1 -9.7',
Speedup 283 255 11.0'(.
tcomp 3.52E+1 3.95E+1 -10.9,
SFPGAs tcomm 7.25E+0 7.70E+0 -5.9'
total 4.24E+1 4.72E+1 -10.1,
Speedup 532 478 11.:''.
for a range of clock frequency values, though only the results of the 195MHz estimation
are shown. Major revisions to the target algorithm or platform architecture during
implementation can significantly alter the application performance affecting the validity
of the prediction. Thus, RATSS can be used iteratively throughout the design process,
recomputing predictions whenever significant revisions are considered or become necessary
to ensure the subsequent implementation will still meet performance requirements and
thereby prevent further reductions in productivity. However, such modifications to the
application structure were not necessary for the 2-D PDF estimation case study.
In Table 4-5, the results of the performance prediction for the 2-D PDF estimation
case study are compared against a subsequent implementation of the target algorithm.
The node and network models underestimated the actual implementation times and
subsequently overestimated the total application speedup over the software baseline. The
node times represented the i I i i ,i iy of the execution time (over 9, '.- of the physical
implementation) thereby having the greatest impact on prediction accuracy. The
prediction errors for the 2, 4, and 8 FPGA configurations are under 11 which is
considered reasonably accurate given the focus on high-level design-space exploration
prior to implementation. Most of the discrepancy is due additional cycles of overhead
related to data movement during the FPGA computation. In contrast to the node
CHAPTER 3
ANALYTICAL MODEL FOR FPGA PERFORMANCE
ESTIMATION PRIOR TO DESIGN (PHASE 1)
The first research phase outlines FPGA performance estimation using the RAT
analytical model prior to hardware implementation. This chapter presents a brief
introduction on the research challenges for a formulation-time, extensible model
(Section 3.1), a detailed analysis of the prediction methodology (Section 3.2), a complete
walkthrough of performance estimation for a real scientific application (Section 3.3), four
additional case studies as further validation of RAT (Section 3.4), and conclusions (Section
3.5).
3.1 Introduction
In this chapter, research challenges with constructing a performance prediction
model to support efficient design-space exploration are investigated. Potential algorithms,
architectures, and system mappings must be investigated prior to implementation (to
reduce development cost) and predictions are limited to evaluation of a specific algorithm
targeting a specific platform (to avoid vague generalities). The RAT methodology is
presented as a technique to address the challenges of formulation-level performance
prediction. Five case studies are presented to validate the RAT performance model and
methodology.
3.2 RC Amenability Test
Figure 3-1 illustrates the basic methodology behind the RC amenability test. These
throughput, numerical precision, and resource tests serve as a basis for determining the
viability of an algorithm design on the FPGA platform prior to any FPGA programming.
Again, RAT is intended to address the performance of a specific high-level parallel
algorithm mapped to a particular FPGA platform, not a generic application. The results
of the RAT tests must be compared against the designer's requirements to evaluate the
success of the design. Though the throughput analysis is considered the most important
step, the three tests are not necessarily used as a single, sequential procedure. Often,
SNAP Interconnect
Figure 4-7. Platform Structure for Image Filtering Case Study
by the interconnect controller. Network transactions are still described using collective
communication terminology but the physical operations are independently initiated
point-to-point messages. Additionally, case study implementations are written in Carte C.
Common to both case studies, Table 4-6 defines the network attributes for the SNAP
interconnect, which are measured from microbenchmarks. Similar to the Nallatech system,
latency, L, is the transmission time of a single-word transfer, which is dominated by the
network delay. However, overhead, o, the time between successive messages, is a function
of the number of nodes, P, and message size, k, not a constant parameter due to the
decentralized communication requests. More extensive microbenchmarking can determine
approximate overhead values although variability in message ordering inhibits detailed
analysis. For these case studies, the effect of overhead is considered negligible and not part
of the network model. The short-message gap, g, is measured as the transmission time
for short messages minus latency. No short messages are used in these case studies but
the value is included for completeness. The long-message gap, G, is one 8-byte word every
10ns clock cycle based on the fixed 100MHz clock.
2.2 Related Research
Productive application development for FPGA-based systems is key for wider
deployment and usage of FPGAs. One challenge to FPGA productivity is efficient
generation of more abstract FPGA design codes. Raising the design focus from traditional
hardware description languages, high-level languages such as Impulse C [3], Carte C
[4], Mitrion C [5], and Handel C [6] provide a software-like infrastructure for a more
efficient and familiar programming model for FPGA applications. Similarly, research in
hardware/software codesign enables a faster bridge between application specification and
hardware implementation and a brief history of this research trend can be found in [7].
However, faster implementation does not address the underlying need for application
planning and evaluation prior to significant commitments of time and money.
Efficient performance modeling for algorithms and systems is an ongoing area
of research for traditional parallel computing. The Parallel Random Access Machine
(PRAM) [8] attempts to model the critical (and hopefully small) set of algorithm
and platform attributes necessary to achieve a better understanding of the greater
computational interaction and ultimately the application performance. The Bulk
Synchronous Parallel (BSP) [9] model extends PRAM concepts, which includes support for
communication and its interaction (i.e., overlap) with computation. The LogP model [10]
(one successor to PRAM) abstracts the application performance based on the latency (i.e.,
wire delay), L; overhead, o; message gap (i.e., minimum time between messages), g; and
number of processors, P. LogGP [11] and additional revisions such as parameterized LogP
(PlogP) [12] provide further modeling fidelity to LogP by addressing specific issues such
as bandwidth constraints for long messages and dynamic performance characterization,
respectively. However, these concepts are not limited to systems of general-purpose
processors. The evolution towards heterogeneous i i-r,: v-core devices necessitates increased
modeling research and usage due to rising development time and cost. In particular, RAT
seeks to leverage these ideas of maximizing model flexibility (through parameterization)
16 Nodes
.y1 \ Novo-G
Gigabit Ethernet
Figure 5-11. Architecture specification of Novo-G system
Additional analysis tools could be constructed to identify the highest performing designs)
based on criteria such as fewest revisions from the original application specification, but
such features are outside the scope of this dissertation.
5.5 Integrated RATSS (MNW)
The case studies (and associated FPGA platform) from Section 5.4 only required
RAT-level analysis due to the small system size and single-level point-to-point communication.
However, the proposed framework can also be used with RATSS, which extends the
RAT tool based on the methodologies for analytical modeling of scalable systems from
C'! lpter 4. Abstract application specification using modeling environments and MoCs
remains unchanged except for potential mappings to larger, scalable systems with
more computation resources and communication interconnects. The methodologies
for translation and orchestration of the synchronous iterative model for performance
estimation are also maintain with RATSS, which provides models for the added system-level
communication. For this case study, the parallelism of MNW is extended to multiple
nodes of an FPGA-augmented cluster. Figure 5-11 describes the architecture of the
system, referred to as Novo-G, where each node of the cluster is the platform described
in Figure 5-6. The nodes are connected by Gigabit Ethernet, which is modeled using the
RATSS tool.
FPGA Memory
Sequenlial Accesses
soa a- : bn
E E E U3
3 e a n n c
-. --- ,, : .-- -
x x x x X X
+. +t- + +} + +_ +l
Bins Bins Bins Bins Bins Bins Bins B iqs
0-31 32-63 64-95 96-127 128-159 160-191 192-223 2ii2-25.-
PDF Kernel
Figure 3-4. Parallel Algorithm and Mapping for 1-D PDF
sample at every discrete probability level. For simplicity, each discrete probability level is
subsequently referred to as a bin.
In order to better understand the assumptions and choices made during the RAT
analysis, the chosen algorithm for PDF estimation is highlighted in Figure 3-4. A total
of 204,800 data samples are processed in batches of 512 elements against 256 bins. Eight
separate pipelines are created to process data samples with respect to a particular subset
of bins. Each data sample is an element with respect to the RAT analysis. The data
elements are fed into the parallel pipelines sequentially. Each pipeline unit can process
one element with respect to one bin per cycle. Internal registering for each bin keeps a
running total of the impact of all processed elements. These cumulative totals comprise
the final estimation of the PDF function.
3.3.2 RAT Input Parameters
Table 3-1 provides a list of all the input parameters necessary to perform a RAT
analysis. The parameters are sorted into four distinct categories, each referring to a
particular portion of the throughput analysis. Note that Neiements is listed under a
separate category when it is used by both communication and computation. It is assumed
that the number of elements dictating the computation volume is also the number of
G. E G E E .
E E E E E E
0o 0 o 0 0 0
U Uo 00 U0
E E E E E E E E E
00 0 00 0 0 0 0
000 0 0 0 0 0
Stage,
Stage2
Stage3
----- ------- ----- --------------
SApplication Iteration1 Application Iteration2 ***
-- time
Figure 4-3. Example Timing Diagram Illustrating Stage- and Application-level Scheduling
of Computation and Communication.
Nappiterations, of either the sum or maximum (longest) of the stage times. Again, this
model generalizes high-level iterative behavior. Applications can contain repetitive but
irregular behavior (e.g., 3 iterations of stage one, 7 iterations of stage two, 5 iterations of
stage three, etc.), which is simple to calculate but not explicitly considered by the model.
Sstage ({tstagel ... tstages, (4-15)
Sstage: set of stage times for the application
s: total number of stages for the application
application N appiterations X Sstage (4 16)
Max (Sstage)
application: total execution time of the application
Nappiterations: number of application-level iterations
4.4 Detailed Walkthrough: 2-D PDF Estimation
This section presents a detailed walkthrough of RATSS performance prediction with
a reasonably complex case study, 2-D PDF estimation. The intended algorithm and
platform structure along with the feature characterization and performance calculations
for the node, network, stage, and application models are discussed. The results of the
Modeling Environment
Architecture Algorithm
S Model Model
Mapping
S_J Online API _Standard File Formatl j
Figure 5-3. Y-chart approach to application specification in modeling environments [54]
The RAT tool maintains the synchronous iterative (i.e., multiphase) performance
model, a subset of fork-join models with each hardware resource (e.g., microprocessor
or FPGA) performing an independent portion of the application computation each
iteration with synchronizing communication separating every iteration from preceding
and proceeding iterations [1, 55]. Figure 5-2 outlines the general structure of the RAT
tool. RAT assumes application execution time is defined by the summation of the
slowest computation and communication each iteration. The underlying computation
and communication models for RAT describe the potentially complex and typically
data-oriented behaviors within each iteration using a few key quantitative attributes.
Computation is defined by total number of application-specific operations and their rate
of execution based on usage of the algorithm's deep or wide parallelism by the hardware
resources. Alternatively, RC computation can often be described by the number of data
elements to be processed and the rate of completion (i.e., cycles per element). RAT uses
an extension of the Hockney model [56] to describe I/O communication and LogGP for
system-level communication.
5.2.2 Modeling Environments
Modeling environments provide abstract yet reasonably precise descriptions of
application structure and behavior. An application specification consists of models (i.e.,
descriptions) of the underlying algorithm and RC platform architecture along with their
respective mapping. Algorithm and architecture models are often specified separately
Table 4-4. System Model Attributes for 2-D PDF Estimation
Attribute Units 2 Nodes 4 Nodes 8 Nodes
Nstageiterations (iterations) 1 1 1
Scomp (s) 1.41E+2 7.05E+1 3.52E+1
scatter X 1.28E+0 1.92E+0 2.25E+0
scatter Y 1.28E+0 1.92E+0 2.25E+0
write X 4.07E-1 2.03E-1 1.02E-1
Scomm write Y (s) 4.07E-1 2.03E-1 1.02E-1
read 1.01E+1 5.05E+0 2.52E+0
gather 3.89E-3 7.78E-3 1.17E-2
tcomp (s) 1.41E+2 7.05E+1 3.52E+1
tcomm, (s) 1.35E+1 9.31E+0 7.25E+0
stage (s) 1.54E+2 7.98E+1 4.24E+1
summation of the number of iterations, Nstage iterations, of I/O communication and FPGA
computation for the node, tcomm and tcomp (Equation 4-24). Single buffering maximizes
the available memory bandwidth to the computation units. The computation time, as
shown in Table 4-4, dominates the total application execution time and would not greatly
benefit from double buffering.
application stage N stageiterations (tcomm + tcomp) (4-24)
The model output is summarized in the third block of Table 4-4. As the number
of nodes doubles, the node time reduces by half but the network time increases slightly.
Consequently, the total execution time, total, decreases by slightly less than half as the
number of nodes doubles. This trend of nearly linear performance improvement with
increasing platform size is reasonable given the embarrassingly parallel nature of the
computation and relatively low impact of the PCI-X and Ethernet communication.
4.4.5 Results and Verification
As previously discussed, the performance prediction is calculated prior to low-level
design and was not adjusted based on implementation details. Predictions were generated
Abstract of Dissertation Presented to the Graduate School
of the University of' V : ida in Partial Fulfillment <- the
i. requirement .. the Degree of Doctor of F'.1. >phy
IMPROVING FPGA APPLICATION DEVELOPMENT VIA STRATEGIC
EXPLORATION, Pr: )l. i \NC ( PREDICTION, AND TRADEOFF ANALAY
By
Brian M. Holland
August 2010
( i .::: Alan D. George
M :' i : cal and C, ::: .:, :* : :, : ering
FPC continue to demonstrate impressive benefits in terms of i. ::p, .: :e per Watt
.. a wide range of !. i:cations. However, the design time and technical complexities
of FPGAs have made : l' :d 1 ::: : i pensive, ::,ticularly as the number
of project revisions increases. C<- ....- y, it is important to engage in --t ematic
formulation of : 1 : :. : : i. ::.:::. strategic 4 1. : :. :: performance prediction, and
tradeoff i .... :. i undergoing length1 -- development cycles. Unfortunately, almost all
existing simulative and analytic models for FPGAs target existing ap; : : : to provide
detailed, low-level analysis. This document explores methods, challenges, and ....
concerning performance prediction scope and complexity, 1 : : and verification,
1;.1.' to small and large-scale FPGA systems, efficiency, and automation. T i
RC Amenability Test (RAT) is proposed as a high-level methodology to address these
challenges and : a necessary design evaluation mechanism currently lacking in
FPGA application .... ... RAT is : d of an extensible analytic model
single and multi-FPGA m Ims harnessing a modeling infrastructure, RC Modeling
Language (RC i ), to provide a breath of features allowing FPGA designers to more
S.: :: and au atomati( explore and evaluate algorithm, platform, and system
mapping choices.
- 1
RA T Performance Prediction
Computation
SPrediction Synchronous *;
I Pipeline, State Machine te I
_. Iterative S
I l Communication c
Prediction Schedule o
S Prediction
IHockney)
Figure 5-2. Performance prediction using RAT
of the DSE tool includes the connectivity and usage of existing tools for application
specification and RAT prediction by the newly constructed components for translation and
orchestration.
5.2 Background and Related Research
The proposed framework leverages existing research in RAT performance prediction
and modeling environments to facilitate application specification and analysis for strategic
DSE for RC. RAT uses separate computation and communication models, based on
underlying assumptions about the application structure and behavior, which ... -- Li:. i I ,e
into complete predictions of application. Several modeling environments provide methods
and tools with similar approaches for abstracting algorithm and platform architecture
specifications, albeit with differing levels of implementation detail.
5.2.1 RAT Performance Prediction
For this phase of research, a prediction tool is constructed from the RAT methodology
that includes an API to provide the necessary prediction input and gather the resulting
performance estimation. The general methodology of separate computation and
communication modeling for RAT is common in prediction techniques, though research
directed towards strategic RC analysis is not as expansive as compared to modeling
environments. RAT is included within the framework due to its fairly unique focus of
strategic prediction prior to implementation. Encapsulation of RAT for replacement with
other synchronous iterative performance models is possible, but outside the scope of this
dissertation.
elements that are input to the application (although the effective bit-widths may differ
due to the fixed width of the communication channel). While applications can exhibit
unusual computational trends or require significant amounts of additional data (e.g.
constants, seed values, or lookup tables), these instances may be considered uncommon.
Alterations can be made to account for uncorrelated communication and computation but
such examples are not included in this document.
Table 3-1. Input parameters for RAT
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements)
N, ......... (bytes/element)
Communication Parameters
throughputideal (! 1/s)
Cawrite 0 < a < 1
read 0 < < 1
Computation Parameters
Nops/element (ops/element)
throughputproc (ops/cycle)
fclock (MlP.,)
Software Parameters
tsoft (sec)
Niter (iterations)
Table 3-2 summarizes the input parameters for RAT ,in 1i,-i; of the specified
algorithm for 1-D PDF estimation. The dataset parameters are generally the first
values supplied by the user, since the number of elements will ultimately govern the
entire algorithm performance. Though the entire application involves 204,800 data
samples, each iteration of the 1-D PDF estimation will involve only a portion, 512 data
samples, or 1/400 of the total set. This algorithm effectively consumes all of the input
values. Only one cumulative value is left after each iteration per bin but these results are
retained on the FPGA. Values are only transferred back to the host after computation
for all iterations is complete. The final output communication must be represented as
Table 3-15. Performance parameters of MD (XtremeData)
Predicted Predicted Predicted Actual
fclk (\II.) 75 100 150 100
tcomm (sec) 8.77E-4 8.77E-4 8.77E-4 1.39E-3
tcomp (sec) 7.17E-1 5.37E-1 3.58E-1 8.79E-1
utilcommsB 0.1 0.2'- 0.2 0.'"
utilcompsB 99.9', 99.>' 99.>' 99. 1
tRCsB (sec) 7.19E-1 5.38E-1 3.59E-1 8.80E-1
speedup 8.0 10.7 16.0 6.6
Table 3-16. Resource usage of MD (EP2S180)
FPGA Resource Utilization (C)
BRAMs 24
9-bit DSPs 100
ALUTs 73
The interconnect parameters model an XtremeData XD1000 platform containing a
Altera Stratix-II EP2S180 user FPGA connected to an Opteron processor over the
HyperTransport fabric. The theoretical interconnect throughput is 1.6GB/s but only
a fraction of the channel can be used for transferring data to the on-board SRAM as
needed for the algorithm. The number of operations per element, approximately 16,400
interactions per molecule times 10 operations each, is estimated due to the length of the
pipeline and data-driven behavior. Unlike the previous case studies, the computational
throughput cannot be reliably measured due to the complex and nondeterministic
algorithm structure. As discussed in Section 3.2.1, the number of operations per cycle
is treated as a "tuning parameter to compute the throughput necessary to achieve the
desired speedup based on the estimate of N1ps/element. Though 50 is the quantitative
value computed by the equations to achieve the desired overall speedup of approximately
10, this value serves qualitatively as an indicator that substantial data parallelism and
functional pipelining must be achieved in order to realize the desired speedup. The same
range of clock frequencies was used as in PDF estimation. The serial software baseline was
executed on a 2.2 GHz Opteron processor, the host processor of the XD1000 system. The
system-level modeling concepts form the basis for a proposed model for heterogeneous
clusters. Collective communication modeling and scheduling for node-heterogeneous
networks of workstations (NOWs) [47, 48] and clusters of clusters with hierarchical
networks [49] are further extensions to traditional system-level modeling. These
modifications for heterogeneous computing provide useful insight towards the proposed
RATSS model.
4.3 RATSS Model
This section provides a detailed discussion of the structure and contributions of the
RATSS model for fast and reasonably accurate performance prediction. This prediction
(and consequently design-space exploration) begins with the designer's specifications of
the FPGA platform and algorithm pairing for RATSS analysis. The FPGA platform
specification defines the performance capabilities of each component in the system,
specifically the computation and communication metrics such as latency, bandwidth, and
clock frequency. The algorithm specification defines the computation requirements of
every specific task and the resulting communication between devices, which depends on
the algorithm/platform mapping. Quantitative attributes are provided for every unique
computation and communication task in the FPGA system and these values feed the
component-level analytical equations. The RATSS model .I.-.-' 1i:. ii Ies the individual
computation and communication predictions based on the system-level schedule defined
by the application specification and subsequently provides a quantitative performance
estimate for the platform/algorithm pairing. This prediction is used by the designer
for further design-space exploration, revising (and re-analyzing) as necessary until the
application meets their performance requirements.
Again, RATSS adapts existing computation and communication models to provide
a complete performance prediction for an FPGA application (i.e., a specific algorithm
mapped to a specific FPGA platform). RATSS performance prediction is based on
efficient quantitative characterization of the key attributes of this algorithm/platform
Deterministic Algorithm tasks and data movement between tasks are predictable prior
to implementation, either as a constant or an average performance of typical data
sets.
Precise characterization of application task scheduling is insufficient for design-space
exploration if the underlying computation and communication times cannot be precisely
quantified. Randomness in computation and communication behavior requires quantification
of application characteristics as averages of expected behavior. The RATSS assumption for
deterministic behavior is reasonable as many applications targeting the FPGA paradigm
are SIMD--I i,, algorithms implemented as pipelines.
Ultimately, synchronous, iterative, and deterministic behavior allows efficient
characterization of computation needs and communication requirements of the FPGA
application. Pipelined, SIMD- Ile algorithms involve data transformations and both the
communication and computation are characterized by the quantity of data associated
with the particular platform/algorithm mapping. The computational demands of the
application are quantified by the number of operations per input data element and the
rate of execution (i.e., amount of deep and wide parallelism). Similarly, the attributes for
the communication requirements define the amount of data for each network transaction in
terms of bytes.
4.3.1.3 Model Usage
Again, RATSS involves quantifying the key attributes of the FPGA platform and
application for use in the underlying analytical models for performance prediction. These
quantitative characteristics are provided largely by the designer. Platform-intrinsic
attributes such as network latency and throughput are gathered from microbenchmarks
that specifically mirror algorithm operations, such as a DMA read and write. Ideally,
a database of microbenchmark results is referenced by the designer for the platform
attributes, else the benchmarks must be performed prior to any performance prediction.
Note that accurate microbenchmarking can be a nontrivial process, albeit with nonrecurring
tcomp Max({tfpga tfpgap }) (4 27)
tcomm = tbroadcast,scatter + gather (4-28)
application stage N stageiterations X (tcomp + tcomm) (429)
Although these two additional case studies involve a different FPGA platform from
the 2-D PDF application, the sequential software baselines are measured from the same
3.2GHz Xeon microprocessor for consistency. While speedup is often an advantageous
performance metric, the specific speedup value must be compared with the problem
size and computation-to-communication ratio for the application. The image filtering
and molecular dynamics case studies illustrate communication- and computation-bound
problems, respectively, with correspondingly lower and higher speedups. Both scenarios
are modeled by RATSS with reasonable accuracy.
4.5.1 Image Filtering
The particular image filter used in this case study is a discrete 2-D convolution
of a 3x3 image segment (i.e., a pixel and its 8 neighbors) with a user-specified filter.
Example usages of this application include Sobel or Canny edge detection and high,
low, or band-pass filtering for noise reduction. Figure 4-8 provides an illustration of this
algorithm. The same 418x418 image is streamed (i.e., written) to the primary FPGA
of two nodes of the SRC-6 system. As part of the computation, the primary FPGAs
stream the image data to their respective secondary FPGAs. Each FPGA performs the
convolution of the image data with respect to different filter values. The resulting images
on the secondary FPGAs are streamed back to their respective primary FPGA which
DMAs the two new images from the node back to the network-attached microprocessor. A
more general overview of convolution for image filtering can be found in [53].
Table 4-7 summarizes the compute node attributes for the RATSS model. Because of
the double-precision operations, the overall pipeline will be fairly deep and too complex to
performance estimation are compared against a subsequent hardware implementation to
evaluate the accuracy of the RATSS model.
4.4.1 Algorithm and Platform Structure
The 2-D PDF estimation algorithm for this case study uses the Parzen window
technique [51], a generalized nonparametric approach to estimating PDFs in a d-dimensional
space. Despite the increased computational complexity versus traditional histograms,
the Parzen window technique is mathematically advantageous because of the rapid
convergence to a continuous function. This algorithm is amenable for FPGA acceleration
because of the high degree of computational parallelism and large computation effort
relative to the amount of data consumed (i.e., input) and produced (i.e., output). The
computational complexity of a d-dimensional PDF algorithm is O(Nnnd) where N is the
total number of samples of the random variable, n is the number of levels where the PDF
is estimated, and d is the number of dimensions. This 2-D PDF estimation algorithm
accumulates the statistical likelihood of every sample occurring within every probability
level. Each sample/level combination is independent, thereby making the algorithm
embarrassingly parallel. The data input consists of 0(N) samples whereas the output is
the resulting O(n2) probability levels.
A general overview of the algorithm structure for this case study is presented
in Figure 4-4. A total of 67,108,684 (i.e., 6 !\!) data samples, originating on one
microprocessor, are scattered equally among the P microprocessors. The number of
samples is large to fully stress the communication and memory capabilities of the target
FPGA platform. The microprocessors transfer the data to their respective FPGA node
in chucks of 8,192 data samples, limited by the available on-chip block RAM. A total of
80 pipelined kernels per node perform the necessary computations (comparison, scaling,
and accumulation) to analyze each data sample against the 256x256 probability levels.
The number of parallel kernels maximizes the available 96 hardware multipliers on the
target FPGA with some leeway. The numerical precision for the computation is 18-bit
A( ': )WLED( i TTS
i success of this research is due to the -:-iort and generosity of the University of
: i igh-performance Computing and Simulation (HC' ) Laboratory, and NSF Center
for i:J:-perforrance B. :: durable Computing (('s i C). A special thank you goes to
the author's thesis committee: Dr. Alan D. Ceorge (( ::.), Dr. Herman Lam, Dr. Creg
Stitt, and Dr. P.----:- Sanders. Additional thanks go to Vikas A i- .wal, Max :i 'ey,
(. : Cicslcwski, C .:: Conger, John Curreri, R DeVille, Rafael Garcia, Dr. Ann
Gordon-Ross, Dr. Eric G-.... .. -, Adam Jacobs, Seth Koehler, A l.. eet Lawande, Dr.
Saumil Merchant, Karathik ". :: .i: Carlo Pascoe. Dr. Casey Reardon, P.::-:i Shih,
Dr. Ian Troxel, and Jason Williams.
This work was :::ported in part I the I/IU( C Program of the National Science
Foundation under Grant No. EEC-0642422. aT. author gratefully acknowledges vendor
equipment and/or tools provided by Altera, C"-:-, : : e Accelerated Technologies,
Nallatech, SRC C i.-. ::. :: Xilinx, and XtremeData that helped make this work li)ssible.
Additional thanks go to the students and faculty of the High .-. .......e Computing Lab
(HCL) at George Washington Un' : I for the generous use of their -6
Level of Abstraction (Examples)
High Low
Back of the Envelope
Explore 0
Estimation Models
(Proposed Framework) -
C U/ >
0 >,
SAbstract Executable Models o
S(Ptolemy, ESL languages, etc.) .
0" 0 /Explore 0
Cycle-Accurate Models
Synthesizable Models O
(VHDL, Verilog)
Low < High
Alternative Realizations (Design Space)
Figure 5-1. Abstraction pyramid comparing levels of modeling for hardware applications
[54]
using isolated abstract specification and analysis tools can be tedious, disconnected from
subsequent implementation tasks, and ultimately counterproductive.
This chapter proposes a methodology for a framework allowing integration of
modeling environments with RAT. The RAT methodology (and subsequent tool)
provides models describing the behavior of the individual computation and communication
operations and estimates the total application performance based on their .-'11,;!, ii.
RAT has demonstrated reasonably accurate performance prediction, but its efficiency is
limited by currently manual interpretation of application specifications for the necessary
inputs to the analysis. The proposed framework defines a I i i-I I i. i" component that
distills the required prediction inputs for RAT from the (supported) model of computation
(MoC) of the application specification. An abstraction 1l .r insulates the translation
functionality from the tool-dependent details of the particular modeling environment.
Additionally, the framework defines "orchestration" of strategic DSE, which performs
RAT prediction on an initial application design and potential revisions to the underlying
algorithm and/or platform architecture. As validation of the productivity benefits of
framework-assisted DSE, a tool for translation and orchestration using the proposed
methodology (hereafter referred to as the "DSE tool") is constructed. The functionality
S PTER 1
INTRODU( i ::ON
C-.-.-~uting is currently undergoing two reformations, one in device architecture and
the other in application development. Using the growth in transistor density predicted
1. Moore's Law for increased clock rates and instruction-level p.. i :.. has reached
.: .. : *i limits, and the nature of current and :': .:'e device architectures is focused
upon higher density in terms multi-core and :: .. : ore structures and more explicit
:of -allelism. Many such devices exist and are emerging on this path, some with
a fixed structure (e.g. quad-core CPU, Cell Broadband Engine) and some : durable
(e.g. FPGAs). Concomitant with this reformation in device architecture, the complexity
of .:i : .::: :. for these fixed or reconfigurable devices is at the forefront of
fundamental challenges in computing to-1
Ti: development of applications : : complex architectures can be defined in terms of
four : formulation, design, translation, and execution. i' p: .. of formulation is
exploration of, .:1 1 :::: architectures, and mapping, where strategic decisions are made
'* to coding antd l:.1. : ..... 1 design, translation, and execution r are
where implementation occurs in an iterative fashion, in terms of progr i .." translation
to executable codes and cores, 1 : 1: verification, f'. :... ::e optimization, etc.
. architecture complexity continues to increase, so too does the importance of the
formulation stage, since productivity increases when design weaknesses are et: ed
and addressed 1 in the development ....ess. FPGA i" ;are particularly
noteworthy : the amount of effort needed with existing languages and tools to
render a s- ... implementation, and thus productivity of application development
: FPC 1 : d systems can greatly benefit : :: better concepts and tools in the
formulation stage. T.: document presents a novel methodology and model to :: :ort
the rapid forimu.lation of : : ; for i "'GA-based reconfigurable cornpi ':: systems.
This model focuses on not :.1 reasonably accurate :i :.. :c estimation for single
: i'ROVING FPGA APPLICATION DEVELOPMENT VIA STRATEGIC
EXPLORATION, PERFORMANCE : PREDICTION, AND TRADEOFF ANAL,
By
BRIAN M. HOLLAND
AD:' : STATION PR : i TED TO THE (.:, '\DUIATE SCHOOL
OF THE UNTV: F .' OF FLORIDA IN PARTIAL FULF I F
OF THE R]- j:i.: : i FOR THE DEGREE OF
DO( :)I tC:: Piiii:: OSOP "i .
UNIVER i : L OF FLORIDA
2010
LIST OF FIGURES
Figure page
2-1 Performance C'! i o :terization and General Structure of FPGA Devices ..... ..16
2-2 Spectrum of High-Performance Reconfigurable Computing Platforms ..... ..17
3-1 Overview of RAT Methodology ............... ....... .. 22
3-2 Example Overlap Scenarios .................. .......... .. 27
3-3 Trends for Computational Utilization in SB and DB Scenarios .... 29
3-4 Parallel Algorithm and Mapping for 1-D PDF ......... ... 33
4-1 Two Classes of Modern High-performance FPGA systems .... 61
4-2 Synchronous Iterative Model ............... ......... .. 63
4-3 Example Timing Diagram for Application Scheduling ............. 74
4-4 Application Structure for 2-D PDF Estimation Case Study .......... ..76
4-5 Platform Structure for 2-D PDF Estimation Case Study .......... .77
4-6 Results of Efficiency Microbenchmarks for Nallatech BRAM I/O ... 79
4-7 Platform Structure for Image Filtering Case Study .............. 87
4-8 Algorithm Structure for Image Filtering Case Study ... 90
4-9 Algorithm Structure for Molecular Dynamics Case Study ..... 93
5-1 Abstraction pyramid comparing levels of modeling for hardware applications .97
5-2 Performance prediction using RAT .................. ..... .. 98
5-3 Y-chart approach to application specification in modeling environments 99
5-4 Framework bridging specification and analysis, and orchestrating DSE ..... ..101
5-5 Translation of application specification information for RAT prediction ..... 103
5-6 Architecture specification of FPGA platform ................ 106
5-7 Overview of MNW case study ............. ... 108
5-8 Predicted execution times of MNW ................ .... 109
5-9 MVA graph specification and mapping .................. ...... 110
such as DMA. Even for single-FPGA systems, a range of issues related to parallelism and
scalable can be explored. RAT is scoped to make a convenient and impactful model that
not only integrates broader issues such as numerical precision and resource utilization
but also contributes to the larger goal of better parallel algorithm formulation and
design-space exploration. Future research will expand the RAT methodology for larger
scale prediction on multi-FPGA systems.
3.3 Walkthrough
To simplify the RAT analysis in Section 3.2, a worksheet can be constructed based
upon Equations (3-1) through (3-11). Users simply provide the input parameters and the
resulting performance values are returned. This walkthrough further explains key concepts
of the throughput test by performing a detailed a ,~i-;- of a real application case study,
one-dimensional probability density function (PDF) estimation. The goal is to provide a
more complete description of how to use the RAT methodology in a practical setting.
3.3.1 Algorithm Architecture
The Parzen window technique [36] is a generalized nonparametric approach to
estimating probability density functions (PDFs) in a d-dimensional space. The common
parametric forms of PDFs (e.g., Gaussian, Binomial, Rayleigh distributions) represent
mathematical idealizations and, as such, are often not well matched to densities
encountered in practice. Though more computationally intensive than using histograms,
the Parzen window technique is mathematically advantageous. For example, the resulting
probability density function is continuous therefore differentiable. The computational
complexity of the algorithm is of order O(Nnfd) where N is the total number of data
samples (i.e. number of elements), n is the number of discrete points at which the PDF is
estimated (comparable to the number of 'i- in a histogram), and d is the number of
dimensions. A set of mathematical operations are performed on every data sample over
ad discrete points. Essentially, the algorithm computes the cumulative effect of every data
Table 3-6. Performance parameters of 2-D PDF (Nallatech)
Predicted Predicted Predicted Actual
fclk (\II.) 75 100 150 100
tcomm (sec) 1.01E-2 1.01E-2 1.01E-2 1.06E-2
tcomp (sec) 5.59E-2 4.19E-2 2.80E-2 4.46E-2
utilcommsB 15' 19' 27' 19'
utilcompsB 85', 81. 7 8' 81
tRCsB (sec) 2.64E+1 2.08E+1 1.52E+1 2.21E+1
speedup 6.0 7.6 10.4 7.2
Table 3-7. Resource usage of 2-D PDF (XC4VLX100)
FPGA Resource Utilization (C)
BRAMs 21
48-bit DSPs 33
Slices 22
against the comparable 100MHz prediction. The communication time is within 5'.
of the predicted value with a discrepancy of 0.5 milliseconds again due to accurate
microbenchmarking of the Nallatech board's PCI-X interface. This difference is potentially
significant given the 400 iterations required to perform this algorithm. The overall impact
on speedup is further affected by variation in the computation time. An underestimation
by approximately 2.7 milliseconds creates a total discrepancy just over 3 milliseconds per
iteration. This larger error in the computational throughput parameter as compared to
1-D PDF is due to the more exact modeling of the pipeline behavior without adjustments
for potential overhead. These overheads from pipeline latency and polling were assumed
insignificant due to the length of the overall execution time but instead had noticeable
effect each iteration. In total, the speedup was i.' less than the predicted speedup. This
error margin is excellent given the fast and coarse-grained prediction approach of RAT
compounded over hundreds of iterations. Greater attention to communication behavior
and the nuances of the computation structure can further reduce this error if desired.
To the extent possible while maintaining fast performance estimation, insight about
shortcomings in previous RAT predictions can be factored into future projects to further
boost accuracy. Comparing Table 3-7 to the resource utilization from the 1-D algorithm,
deterministic, pipelined structure. For the SRC-6 system, the individual communication
and computation times were measured on the FPGA via a vendor-provided counter
function. Both operations are FPGA-controlled and initiated by a single function call
which could not be separated from the perspective of the host microprocessor. However,
the total RC execution as measured by the wall-clock time of the CPU is approximately
0.07 seconds longer than the sum of the computation and communication times measured
on the FPGA. Consequently, extra system overhead not considered by RAT caused the
actual speedup value to be ,1' less than expected. The utilizations for the actual design
reflect this overhead with only i.'. of RC execution time comprising communication
or computation. The discrepancy in total execution is large because the overhead is
significant relative to the short time (less than 0.5s). If the overhead had been factored
into the prediction, the total estimation error would be under 1 Additionally, resource
utilization is summarized in Table 3-13. No multiplers are required for this type of
searching but the heavy usage of logic elements limits further scalability of the algorithm
on a single FPGA of this size.
3.4.4 Molecular Dynamics
Molecular Dynamics (\!1)) is the numerical simulation of the physical interactions of
atoms and molecules over a given time interval. Based on Newton's second law of motion,
the acceleration (and subsequent velocity and position) of the atoms and molecules are
calculated at each time step based on the particles' masses and the relevant subatomic
forces. For this case study, the molecular dynamics simulation is primarily focused on
the interaction of certain inert liquids such as neon or argon. These atoms do not form
covalent bonds and consequently the subatomic interaction is limited to the Lennard-Jones
potential (i.e. the attraction of distant particles by van der Waals force and the repulsion
of close particles based on the Pauli exclusion principle) [39]. Large-scale molecular
dynamics simulators such as AMBER [40] and NAMD [41] use these same classical physics
principles but can calculate not only Lennard-Jones potential but also the nonbonded
The last attribute in Table 4-1, tfpga, summarizes the computation time for the
2, 4, and 8 node cases. Each node for a particular system size (i.e., number of nodes,
P) will have an identical execution time because of the equal data decomposition. Due
to the increasing number of node resources, FPGA computation time, tfpg", decreases
approximately linearly. Two FPGAs require twice the time as four FPGAs which need
twice the time of eight FPGAs. This behavior is consistent with the embarrassingly
parallel nature of the 2-D PDF estimation algorithm.
4.4.3 Network Modeling
For the FPGA platform used in this case study, two communication network models
are necessary: PCI-X I/O Bus and Ethernet. The PCI-X bus model describes the
point-to-point interconnect between a host microprocessor and its Nallatech FPGA node.
The Ethernet model describes the MPICH2 communication over the Gigabit Ethernet
network. Assembling the attribute values for these models involves not only analysis of the
algorithm structure and mapping but also microbenchmarking of the underlying platform
behavior for typical communication transactions.
4.4.3.1 PCI-X Network Modeling
The I/O operations for 2-D PDF estimation involve transfers between the host CPU
and the onboard FPGA block RAM. Microbenchmarks were performed on common
transfer sizes (i.e., powers of two from 4B to 6 i11 ). Figure 4-6 summarizes the results
of these transfers, which can be referenced for all future I/O performance estimations.
Smaller transfers, Figure 4-6A, have erratic but steadily increasing efficiency whereas
larger transfers, Figure 4-6B, could be approximated by a single value. For the transfer
sizes used in this case study (writing 8,192 elements and reading 65,536 elements,
discussed later), the I/O efficiencies are 0.31 and 0.10, respectively.
Table 4-2 summarizes the delay and throughput attributes, gathered from microbenchmarks
of the PCI-X I/O bus, along with the quantity of data transmitted for the 2-D PDF
estimation case study. The microbenchmarks measure the total time of a data transfer,
Table 3-5. Input parameters of 2-D PDF
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements) 6
N, .. .... (bytes/element)
Communication Parameters (Nallatech)
throughputideal ( !1 /s)
write 0 < a < 1 0
read 0 < a < 1 0
Computation Parameters
Nops/element (ops/element) 19
throughputproc (ops/cycle)
fc1ock (M .) 75/100
Software Parameters
tsoft (sec) 1
Niter (iterations)
1024
5536
4
1000
.147
.026
6608
48
/150
58.8
400
65,535 (256 x 256) PDF values are sent back to the host processor after each iteration of
computation due to memory size constraints on the FPGA. The same numerical precision
of four bytes per element is used for the data set. The interconnect parameters model
the same Nallatech FPGA card as in the 1-D case study but for different transfer sizes.
The aread term is small for the relatively large output of 65,536 elements because data
is transferred in 256 batches of 256 elements each, incurring a large latency overhead.
Each of the 65,536 bins requires three operations for a total of 196,608 operations. Eight
kernels, each containing two pipelines (one per dimension), perform three operations per
pipeline per cycle for a total of 48 simultaneous computations per cycle. Again, the same
range of clock frequencies is used for comparison. The software baseline for computing
speedup values was written in C and executed on the same 3.2GHz Xeon processor. The
algorithm requires the same 400 iterations to complete the computation and VHDL is also
used.
The RAT performance predictions are listed with the experimentally measured
results in Table 3-6. These three predictions are based on the range of clock frequency
values listed in Table 3-5 but the accuracy of the actual 100MHz design is only evaluated
shaded boxes. The dashed border of the modeling environment and tool-abstraction 1iv. r
indicates the interchangeability of specification tool. Section 5.3.1 describes the procedure
for translation of algorithm-based MoCs into the quantitative performance attributes
and application scheduling necessary to direct the synchronous, iterative performance
model of RAT. Section 5.3.2 describes the mechanism for orchestration of DSE, specifically
the directed revision of an initial application specification to examine and compare the
performance potential design alternatives.
5.3.1 Translation
Although individual modeling environments and performance prediction techniques
sometimes include methods for direct connectivity to other tools, an explicit intermediary
between specification and analysis is advantageous. The proposed framework provides
translation between the algorithm MoCs and the RAT performance prediction, facilitating
the transfer of the required quantitative attributes and scheduling information to the
corresponding computation and communication model. Potential issues during translation
include differences in the data structures (e.g., format, representation, or precision),
abstraction levels, and semantic mean along with other dilemmas such a missing,
redundant, or inconsistent data. Resolving these issues can require acute awareness of
the low-level details of the data formats, syntax, and semantics of the tools with extra
functionality to identify and request additional information from the user as necessary.
The need for unique bridges between every desired modeling tool and RAT is greatly
reduced by an abstraction 1 v-_r, which allows the framework to perform the in ii. "iily of
the translation based on a generic format for algorithm MoCs derived from the specific
modeling environment tool.
As illustrated in Figure 5-5, the algorithm and architecture attributes for the basic
operations of the MoC of the application specification must be reorganized and formatted
based on their contribution to the RAT computation and/or communication estimation.
The framework constructs RAT computation models for every hardware resource based on
technique presented in [18] seeks to parameterize the computational algorithm and the
FPGA system itself. The analytical methods have similarities to RAT but the emphasis
is on projecting potential bottlenecks due to memory throughput, not on predicting total
system performance. Dynamo [19] involves performance prediction of image processing
systems partitioned and compiled at runtime from existing pipelined kernels. The system
provides dynamic optimization for application construction exclusively from existing
modules and assumes that algorithm design and analysis is completed prior to the use of
Dynamo. In [20], 12 design techniques are presented for maximizing the performance of
FPGA applications. This research is synergistic to RAT by potential reducing the number
of algorithm and architecture iterations necessary to achieve suitable performance, however
the RAT methodology is still required to quantitatively evaluate each design iteration.
Though prediction is quite common with FPGA technologies, it is not primarily used
for system-level performance. Routing is a common target for device-level prediction due
to the impact on development time and performance. In [21], a model of the algorithm
routing demands is created early in the FPGA development cycle. In [22], prediction is
used to mitigate the variability and long run times of commercial place and route tools
for estimating interconnect delay. Other issues including timing [23], routability [24],
interconnect planning [25], and routing d.1 iv [26] are explored via prediction. Performance
is also explored by modeling issues such as power [27] and wafer yield [28]. 1M ,ii: of
these prediction techniques for lower-level issues migrated from application-specific
integrated circuits into the RC domain to more efficiently model the growing complexity
of FPGAs. Similarly, RAT and other methodologies are branching to RC from existing
areas of parallel application modeling to bridge the growing need for efficient performance
prediction.
Table 3-12. Performance parameters of TSP (SRC)
Predicted Actual
fclk (\ I.) 100 100
t,,mm (sec) 1.54E-5 1.57E-5
tcomp (sec) 4.31E-1 4.30E-1
utilcommsB 0.00 !' 0.011:'
utilcompsB 99.' I.' 86.'".
tRCsB (sec) 4.31E-1 4.99E-1
speedup 5.16 4.45
Table 3-13. Resource usage of TSP (XC2V6000)
FPGA Resource Utilization (C.)
BRAMs 56
18x18 Multipliers 0
Slices 73
While N2 distances are need to compute path lengths, NN total paths must be examined.
For consistency with the other case studies, the number of operations per element is set to
NN-2 (i.e. 97 = 4782969), which makes the RAT prediction computationally equivalent to
the view of NN path elements (for computation only) with one operation each. Since nine
cities are examined in this case study using nine kernels, a total of nine potential paths are
examined per clock cycle. The clock frequency of the MAP-B unit is fixed at 100 MHz and
only one input/compute/output iteration is required for this algorithm. The C software
baseline was executed on a 3.2GHz Xeon processor. The parallel algorithm is constructed
in SRC's Carte C, a high-level language (HLL) for FPGA design.
The results of the hardware design are compared against the performance predictions
in Table 3-12. The percent error in the predicted communication time was less than '.
due to microbenchmarking on the SRC-6 specifically to replicate the short communication
transfers. The cycle-accurate timers of the SRC-6 system meant this discrepancy was not
a measurement error but instead a function of the modeling and parameterization of the
SNAP interconnect throughput. However, the actual communication time is only 16ps
(versus 430ms for computation) and consequently its impact on speedup is negligible for
this case study. The predicted and actual computation times were nearly identical due to
Table 3-11. Input parameters of TSP
Dataset Parameters
Nelements, input (elements) 81
Neiements, output (elements) 1
N, ....... (bytes/element) 8
Communication Parameters (SRC)
throughputideal ( !1 /s) 1400
write 0 < a < 1 0.03
read 0 < a < 1 0.03
Computation Parameters
Nops/element (ops/element) 4782969
throughputproc (ops/cycle) 9
fclock (M\! I1.) 100
Software Parameters
tsoft (sec) 2.22
Niter (iterations) 1
of path validity and length (i.e. if all cities were visited exactly once, report the total
distance traveled). The individual steps are not interrelated and the examination of
possible paths can be pipelined. However, unlike the branch-and-bound technique which
backtracks in the middle of paths to avoid revisiting cities, the hardware pipeline operates
on full N-length paths, even those invalid because of repeated cities. Extra computation is
required but substantially more parallelism is exploitable.
Table 3-11 lists the input parameters of the RAT performance prediction for TSP.
The interconnect parameters model the proprietary SNAP interconnect of the SRC-6
system. The small fraction of throughput, a, represents the overhead associated with
the extremely minimal communication in the algorithm, only N x N input elements.
This information contains the distances between every pair of cities. The only output for
this system is the minimal path length and this communication time is assumed to be
negligible. Elements are 8 bytes, the width of the MAP-B's SRAM, but only 4 bytes (32
bits) per element are used to represent distances in fixed point. The information is not
byte-packed for communication and consequently the other 32 bits are wasted. For this
case study, the computational workload is exponentially related to the number of elements.
using the Y-chart approach [54], as illustrated in Figure 5-3. These application models
(particularly the algorithm model) describe the behavior of an application in terms of
a model of computation (MoC). A MoC defines a set of allowable I''" i i i. 'i-:" (i.e.,
basic and often technology-dependent computational events), communication between
operations (i.e. data movement), their relative costs (e.g., clock cycles), and the total
system behavior based on the operations composing the application. Each modeling
environment uses graphical and/or textual elements to denote precise syntactic and
semantic meanings for an application specification based on the MoC.
The case studies for this chapter, MNW and MVA graph, are specified by .-i-Lchronous
message-passing (AMP) and synchronous dataflow (SDF) MoCs, respectively, which
represent common models for FPGA systems. AMP denotes the use of one or more
queues to describe communication between groups of operations. Only messages
within the same queue are strictly ordered with unspecified timing between different
queues. SDF represents a special case of AMP with groups of operations evaluated
as soon as the necessary messages are available from the communication channels,
which are uni-directional. Data enters the application model at a constant rate, which
eventually induces a steady-state evaluation rate for each group of operations with total
performance defined by the slowest group. AMP suitably describes the straightforward
DMA communication between the microprocessor and FPGAs for MNW. SDF provides
mechanisms for describing the pipeline network of the MVA graph.
The proposed DSE tool requires a modeling environment capable of effectively
representing abstract algorithm, architecture, and subsequent application mapping
models based on AMP and SDF MoCs. Ptolemy [57] is an environment specifically for
simulating and prototyping systems involving heterogeneous MoCs, including AMP and
SDF. Metropolis [58] defines "metamodels" that use formal execution semantics to define
the application function, platform architecture, and mapping of the system based on a new
or existing MoCs. Artemis [59] and Sesame [60] use a hierarchical Kahn Process Network
Table 3-17. Summary of Results
1-D PDF 2-D PDF LIDAR TSP MD
Predicted Comm. (s) 2.47E-5 1.01E-2 6.60E-4 1.54E-5 8.77E-4
Actual Comm. (s) 2.50E-5 1.06E-2 5.65E-4 1.57E-4 1.39E-3
Comm. Error 1 5'. 17' .' :
Predicted Comp. (s) 1.31E-4 4.19E-2 2.64E-4 4.31E-1 5.37E-1
Actual Comp. (s) 1.39E-4 4.46E-2 2.25E-4 4.30E-1 8.79E-1
Comp. Error ,'.. ,'. 17. 0.'. 3' 0I'
Predicted Speedup 6.5 7.6 11.8 5.2 10.7
Actual Speedup 7.8 7.2 13.8 4.5 6.6
Speedup Error 1'.', .., 1 '., 1. 'I 3' .
entire dataset is processed in a single iteration and the algorithm is constructed in Impulse
C, a cross-platform HLL for FPGAs.
Table 3-15 outlines the predicted and actual results of the MD. Note that these
results are unique to this specific algorithm and that different structures, target languages,
and platforms will have varying prediction accuracy. The difference in predicted and
actual communication time is :;7' The error itself is associated with the overhead of
multiple I/O transfers between the CPU and on-board SRAM memory modeled as a
single block of communication. While more accurate estimations are the goal of RAT,
any further precision improvements for this parameter are inconsequential given the low
communication utilization. Computation dominated the overall RC execution time and
the actual time is l' higher than the predicted value due to the data-driven operations
and suboptimal pipelining performance. The total number of operations was higher than
expected, coupled with relatively modest parallelism for the problem size. Consequently,
the speedup error was also 3:l' ~, significantly less than desired. However, this case study
is useful because the qualitative need for significant parallelism is correctly predicted even
though the algorithm cannot be fully analyzed at design time. As Table 3-16 illustrates,
a large percentage of the combinatorial logic and all dedicated multiply-accumulators
(DSPs) were required for the algorithm.
similar to the scatter. However, unlike scatter (or gather) the amount of data during each
transmission does not increase because the data is reduced at every node. Consequently,
the reduce has 1(../_(P) latency, L, and transmission time, Gk, as defined in Equation 4-19.
Additionally, each transmission requires an addition operation, 7, for each of the k data
values in the message. Note that Equations 4-18 and 4-19 assume P is a power of 2.
ttransaction-d-, = Ic..'I (P) x (L + 2o + Gk + -yk) (4-19)
The second block of Table 4-3 lists the two application-dependent attributes defined
by the user based on the 2-D PDF estimation case study. Again, system configurations
of 2, 4, and 8 nodes, P, are used for this case study. The 2-D PDF application requires
two distinct transactions: distribution of the input data for the X and Y dimensions (i.e.,
MPIScatter) and reduction of the partial PDFs (i.e., MPIReduce). The 2, 4, and 8
FPGA platform configurations will involve message sizes, k, of 128MB, 6 11 ), and 32MB
of data, respectively for the scatter. For the reduce, every node will contribute the 256KB
(256x 256x 4B) partial results (regardless of the number of nodes) that are ultimately
accumulated on the head node.
The third block of Table 4-3 summarizes the results of the network model. The
individual times for the scatter and reduce transactions, transactions, are listed. These times
increase logarithmically for the 2, 4, and 8 node platforms due to the increasing number of
messages (i.e., log2(P)) required for the transaction.
4.4.4 Stage/Application Modeling
The stage and application models are synonymous because this case study consists
of a single stage of execution. Equation 4-20 summarizes the set, Scomp, of the execution
times, tfpga, for the 2, 4 and 8 (i.e., P) FPGA nodes used in this case study. From
Equation 4-21, the computation time, tcomp, is determined by the maximum (longest)
Related to the speedup is the computation and communication utilization given by
Equations (3-8), (3-9), (3-10), and (3-11). These metrics determine the fraction of the
total application execution time spent on computation and communication for the SB and
DB cases. For SB, the computation utilization can provide additional insight about the
application speedup. If utilization is high, the FPGA is rarely idle thereby maximizing
speedup. Low utilizations can indicate potential for increased speedups if the algorithm
can be reformulated to have less (or more overlapped) communication. In contrast to
computation which is effectively parallel for optimal FPGA processing, communication
is serialized. Whereas computation utilization gives no indication about the overall
resource usage, since additional FPGA logic could be added to operate in parallel without
affecting the utilization, the communication utilization indicates the fraction of bandwidth
remaining to facilitate additional transfers since the channel is only a single resource.
For DB, assuming steady-state behavior, the implications of the utilization terms are
slightly different. The larger value, whether communication or computation, will have a
utilization of 1. If computation is the shorter (i.e. overlapped) time, utilization illustrates
how starved the computation is for data. If communication is the shorter time, utilization
is a measure of the available throughput to support additional parallel computation. An
example of these utilization trends is shown in Figure 3-3.
utilcompsB tcomp (3 8)
tcomm + tcomp
Uttlc tcomm (3-9)
tcomm + tcomp
tlcompDB com (3-10)
Max(t comm, tcomp)
utilcommD tcom mp (3-t )
Max (tconmmn tcomnp)
Table 3-9. Performance parameters of LIDAR (Cray)
Predicted Predicted Predicted Actual
fclk (I I.) 100 125 150 125
tcomm (sec) 6.60E-4 6.60E-4 6.60E-4 5.65E-4
tcomp (sec) 3.30E-4 2.64E-4 2.20E-4 2.25E-4
Utilco'mmsB 23 2 5' 2' .
utilcompsB 71 7 7
tRCSB (sec) 9.90E-4 9.24E-4 8.80E-4 7.90E-4
speedup 11.0 11.8 12.4 13.8
speedup. The software baseline was written in C and executed on a 2.4GHz Opteron
processor, the host CPU for the Cray XD1 node. Only one iteration (i.e. GPS interval)
is required for this case study and VHDL is used to implement the parallel algorithm in
hardware.
Table 3-9 compares the RAT performance predictions with the actual 125MHz
experimental results. The structure of the particular Cray SRAM interface overlaps
computation and DMA transfers back to the CPU (i.e. tread). Consequently, the
total RC execution time was directly measurable but the individual computation and
communication times for the actual result were estimated from the total execution time
based on the expected latency of the computation. Two general conclusions were that
both the computation and communication times were overestimated by RAT but that the
utilization ratios were still fairly consistent with expectations. Unlike the previous case
studies, the total speedup was underestimated by 1'.' This discrepancy in speedup was
primarily due to the difference in communication times. The actual computation pipeline
is believed to correspond closely with the high-level algorithm. The communication and
comptution times of 565ps and 225ps are likely comparable to the system overhead
and measurement error causing noticable discrepancies and the unusual behavior of
a pessimistic prediction even with the generalized analytical model. Though extra
performance as compared to RAT projections may be an unexpected benefit for the
algorithm, the goal of the methodology is precise prediction that considers all 1i ii' 'r
factors to performance. Adjustments to the model for more accurate prediction of short
The second phase of research proposed RATSS, an extension the RAT model for
multi-FPGA systems. RATSS balanced the desire for greater algorithm and platform
diversity (i.e., model applicability) with the requirement of high predictability (i.e.,
model accuracy) for scalable systems by focusing on synchronous iterative algorithms
for two classes of modern RC systems. Synchronous iterative algorithms represented a
significant class of data-parallel applications, typically structured as SIMD-- I ,.L pipelines.
Focusing on two classes of RC systems allows hierarchical .,.:.-1i ii i I i. ii of computation
and communication models into RAT predictions for the full application. Successes in
conventional HPC and HPEC modeling such as the LogP communication model are
leveraged to help maximize efficiency and reliability. Three case studies, 2-D PDF, image
filtering, and MD, demonstrated total prediction errors under 1"-'. ;' and 0.0 :'
respectively.
For the third phase of research, the RAT (and RATSS) methodologies were
integrated within a larger framework for more strategic design-space exploration of
RC applications. Specifically, the framework bridged RAT performance prediction
with modeling environments. These modeling environments allowed rapid yet accurate
application specification by a designer within the context of the MoC. The framework
provided translation between supported MoCs and the analytical performance model
for RAT (i.e., the synchronous iterative model). A tool constructed from the framework
methodology provided translation for the AMP and SDF MoCs of the RC'\ 1, modeling
environment. This framework tool orchestrated design-space exploration by performing
RAT analysis on an initial application design and potential revisions to the characteristics
of algorithm and/or platform architecture, identifying suitable design configurations
based on designer criteria. Two case studies, MNW and MVA graph, demonstrated
reasonable prediction accuracy (under 5'. for MNW) and rapid exploration of large design
spaces (140ms and 340ms for RAT analysis of 100K revisions to MNW and MVA graph,
respectively).
TABLE OF CONTENTS
page
ACKNOW LEDGMENTS ................................. 4
LIST OF TABLES ....................... ............. 7
LIST OF FIGURES .................................... 9
A BSTRA CT . . 11
CHAPTER
1 INTRODUCTION ...................... .......... 12
2 BACKGROUND AND RELATED RESEARCH ......... ......... 16
2.1 FPGA Background ........ ...................... 16
2.2 Related Research ........ .............. ......... 18
3 ANALYTICAL MODEL FOR FPGA PERFORMANCE ESTIMATION PRIOR
TO DESIGN (PHASE 1) ............................... 21
3.1 Introduction ...................... ........... 21
3.2 RC Amenability Test ................... ....... 21
3.2.1 Throughput ............................. 22
3.2.2 Numerical Precision ........................... 29
3.2.3 Resources ................... ......... 30
3.2.4 Scope of RAT ................... ....... 31
3.3 Walkthrough ...................... .......... 32
3.3.1 Algorithm Architecture .......... .............. 32
3.3.2 RAT Input Parameters ................... ..... 33
3.3.3 Predicted and Actual Results ......... ........ ... 38
3.4 Additional Case Studies .................. ......... .. 39
3.4.1 2-D PDF Estimation ............... .... .. 40
3.4.2 Coordinate Calculation for LIDAR. ...... ........... 43
3.4.3 Traveling Salesman Problem ................. .. 46
3.4.4 Molecular Dynamics .................. ........ .. 49
3.4.5 Summary of Case Studies .................. .... .. 53
3.5 Conclusions .................. ................ .. 54
4 EXPANDED MODELING FOR MULTI-FPGA PLATFORMS (PHASE 2) 56
4.1 Introduction .................. .... 56
4.2 Background and Related Research ................ .... .. 57
4.3 RATSS Model .................. ............... .. 59
4.3.1 RATSS Scope .................. ........... .. 60
4.3.1.1 FPGA Platform Scope .............. .. .. 60
[48] P. B. Bhat, V. K. Prasanna, and C. S. Raghavendra, "Adaptive communication
algorithms for distributed heterogeneous systems," J. Parallel Distributed Computing,
vol. 59, no. 2, pp. 252-279, 1999.
[49] F. Cappello, P. Fraigniaud, B. Mans, and A. L. Rosenberg, "HiHCoHP: Toward
a realistic communication model for hierarchical hyperclusters of heterogeneous
processors," in Proc. 15th Int'l Parallel and Distributed Processing Symp. (IPDPS),
Washington, DC, USA, 2001, p. 42, IEEE Computer Society.
[50] B. Holland, K. N ,I ijan, and A. D. George, "RAT: RC amenability test for rapid
performance prediction," ACM[ Trans. Reconfigurable T l ,,..I/;/ and S'I- 11,
(TRETS), vol. 1, no. 4, pp. 22:1-22:31, 2009.
[51] E. Parzen, "On estimation of a probability density function and mode," Annals of
Mathematical Statistics, vol. 33, no. 3, pp. 1065-1076, 1962.
[52] K. N I, ,i ,i- B. Holland, A. George, K. C. Slatton, and H. Lam, "Accelerating
machine-learning algorithms on FPGAs using pattern-based decomposition," J.
S.:jIrl Processing S,.-1 ,m- Jan. 2009.
[53] R. C. Gonzalez and R. E. Woods, Digital Image Processing, Second Edition,
Prentice-Hall, Inc, Upper Saddle River, NJ, 2002.
[54] B. Kienhuis, E. F. Deprettere, P. van der Wolf, and K. Vissers, Embedded Processor
Design Chllll u'.' chapter A Methodology to Design Programmable Embedded
Systems: The Y-Chart Approach, pp. 18-37, Springer, 2002.
[55] G. D. Peterson and R. D. C'!i ,inherlain, "Beyond execution time: Expanding the use
of performance models," IEEE Parallel Distributed T,1.-h.-l' 'ri Sl-,l.i- Applications,
vol. 2, no. 2, pp. 37-49, 1994.
[56] R. W. Hockney, "The communication challenge for MPP: Intel Paragon and Meiko
CS-2," Parallel Computing, vol. 20, no. 3, pp. 389-398, 1994.
[57] J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt, "Ptolemy: A framework for
simulating and prototyping heterogeneous systems," Int'l J. Computer Simulation,
vol. 4, pp. 152-184, April 1994.
[58] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, and
A. Sangiovanni-Vincentelli, \ 1 I ropolis: an integrated electronic system design
environment," Computer, vol. 36, no. 4, pp. 45-52, April 2003.
[59] A. D. Pimentel, L. O. Hertzbetger, P. Lieverse, P. van der Wolf, and E. F. Deprettere,
"Exploring embedded-systems architectures with artemis," Computer, vol. 34, no. 11,
pp. 57-63, November 2001.
[60] A. D. Pimentel, C. Erbas, and S. Polstra, "A systematic approach to exploring
embedded system architectures at multiple abstraction levels," IEEE Trans. Comput-
ers, vol. 55, no. 2, pp. 99-112, February 2006.
Table 3-3. Performance parameters of 1-D PDF (Nallatech)
Predicted Predicted Predicted Actual
fclk (\!II.) 75 100 150 150
tcomm (sec) 2.47E-5 2.47E-5 2.47E-5 2.50E-5
tcomp (sec) 2.62E-4 1.97E-4 1.31E-4 1.39E-4
utilcommsB 11 11 15' ,
utilcompsB 91 b'' ( !' 85'
tRCsB (sec) 1.15E-1 8.85E-2 6.23E-2 7.45E-2
speedup 5.0 6.5 9.3 7.8
Table 3-4. Resource usage of 1-D PDF (XC4VLX100)
FPGA Resource Utilization ( )
BRAMs 15
48-bit DSPs 8
Slices 16
difficult. Empirical knowledge of FPGA p: .: ... and algorithm design practices provides
some insight as to a range <" 1V values. However, attaining a single, accurate estimate
of the maximum FPGA clock fre.- .... achieved is generally i possible e until after the
entire :.1 nation n has been converted to a hardware (1 :-:: and analyzed by an FPGA
vendor's !.--out and routing tools. Consequently, a number of clock values ranging from
75MHz to 150MHz i: : the LX100 are used to examine the scope <- possible ':
Ti. software parameters provide the last piece of information necessary to complete
the speedup analysis. i : software execution time of the algorithm is provided by the
user. Often, software 1 I code is the basis for the hardware migration initiative. FPGA
development could be based directly on mathematical models, but there would be no
baseline for evaluating speedup. Ti :: software for the 1-D PDF estimation was
written in C, compiled using gcc, and executed on a 3.2 CHz Xeon. IT !' the number of
iterations is deduced 1: : the portion of the overall : 1 1 ::: to reside in the FPGA at
one time. Since the user decided to only process 512 elements at a time from the set of
ei element set, there must be ..:. (i.e. 204, ::/512) iterations of the algorithm. Ti:
case study is implemented in VHDL.
PCI Ex ress 8x
Figure 5-6. Architecture specification of FPGA platform
5.4.1 Experimental Setup
For validation of the proposed methodology, a DSE tool provides functionality for
gathering the application specification, performing translation, and orchestrating DSE
using performance prediction. This functionality includes interaction with a modeling
environment tool, RC' \ to collect the necessary information from an application
specification and usage of a prediction tool, RAT, for performance analysis. RC\ lIl
provides an RC-specific abstraction environment with semantic constructs amenable to the
RAT prediction. The DSE tool is a hierarchical composite of existing tools for RC' I Il and
RAT and newly constructed components providing translation and orchestration. These
translation and orchestration components are implemented as a Java-based Eclipse plug-in
to help minimize the customized interfacing necessary for connecting to the RC'i\ I and
RAT tools.
The two application case studies for this chapter are mapped onto a Linux server
containing a GiDEL PROCStar-III FPGA card connected by a PCIe x8 bus to a Xeon
E5520 (i.e., 2.26GHz Quad-core Nehalem) microprocessor. The GiDEL FPGA card
contains four Altera Stratix-III E260 FPGAs, which have interconnects to .,I.i ient
FPGAs and support DMA transfers to and from the microprocessor. Figure 5-6 outlines
the general architecture model for the FPGA-augmented platform. This FPGA system
can be used as a prototype for an RC-augmented embedded platform or represent a single
node in a multi-node RC supercomputer.
and fidelity (through analytical methods). These modeling concepts, particularly LogGP,
are also used for extending RAT to scalable multi-FPGA systems (C'!i plter 4).
Simulation is another common outlet for quantifying the performance of RC
application models at a high level. In [13], a framework for simulation of FPGA systems
and applications is built on top of the Fast and Accurate Simulation Environment (FASE)
[14]. Models are created for the Mission-Level Designer (MI .1)) tool based on scripts of
algorithm behavior to rapidly explore large-scale FPGA systems. Another simulation
framework is the Hybrid System Architecture Model (HySAM) coupled with DRIVE
[15]. HySAM provides mechanisms for parameterizing architectures, defining algorithms,
and simulating interactions, while DRIVE provides tools for visualizing results generated
by HySAM. In [16], SimpleScalar and ModelSIM are combined for system analysis
through simultaneous processor emulation and VHDL simulation. Another tool [17] uses
a Simics-based simulator for capturing precise memory-access patterns while functionally
verifying hardware kernels. While each of these methodologies provides high-level
simulation fidelity, significant cost is associated with setting up the requisite models.
Either actual hardware or software code is required or effort is spent on constructing
custom simulation inputs distilled from algorithm and system behavior. In contrast, RAT
seeks to render performance prediction of the application and FPGA platform prior to any
significant hardware or software coding. By using analytical models instead of simulation
frameworks, prediction effort is minimized while maintaining reasonable accuracy. Though,
some insight about modeling larger systems can be leveraged from the multi-FPGA
simulation frameworks.
Understanding and improving algorithm design for FPGAs via analytical modeling
is an expanding area of RC research. One technique [1] focuses on analytical modeling
of shared heterogeneous workstations containing reconfigurable computing devices. The
methodology primarily emphasizes system-level, multi-FPGA architectures with variable
computational loading due to the multi-user environment. A performance prediction