Improving FPGA Application Development via Strategic Exploration, Performance Prediction, and Tradeoff Analysis

Material Information

Improving FPGA Application Development via Strategic Exploration, Performance Prediction, and Tradeoff Analysis
Holland, Brian
Place of Publication:
[Gainesville, Fla.]
University of Florida
Publication Date:
Physical Description:
1 online resource (125 p.)

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
George, Alan D.
Committee Members:
Stitt, Greg
Lam, Herman
Sanders, Beverly A.
Graduation Date:


Subjects / Keywords:
Analytical models ( jstor )
Architectural design ( jstor )
Architectural models ( jstor )
Communication models ( jstor )
Design analysis ( jstor )
Input output ( jstor )
Modeling ( jstor )
Performance prediction ( jstor )
Pipelines ( jstor )
Rats ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
analytical, fpga, performance, reconfigurable
Electronic Thesis or Dissertation
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
Electrical and Computer Engineering thesis, Ph.D.


FPGAs continue to demonstrate impressive benefits in terms of performance per Watt for a wide range of applications. However, the design time and technical complexities of FPGAs have made application development expensive, particularly as the number of project revisions increases. Consequently, it is important to engage in systematic formulation of applications, performing strategic exploration, performance prediction, and tradeoff analysis before undergoing lengthy development cycles. Unfortunately, almost all existing simulative and analytic models for FPGAs target existing applications to provide detailed, low-level analysis. This document explores methods, challenges, and tradeoffs concerning performance prediction scope and complexity, calibration and verification, applicability to small and large-scale FPGA systems, efficiency, and automation. The RC Amenability Test (RAT) is proposed as a high-level methodology to address these challenges and provide a necessary design evaluation mechanism currently lacking in FPGA application formulation. RAT is comprised of an extensible analytic model for single and multi-FPGA systems harnessing a modeling infrastructure, RC Modeling Language (RCML), to provide a breath of features allowing FPGA designers to more efficiently and automatically explore and evaluate algorithm, platform, and system mapping choices. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2010.
Adviser: George, Alan D.
Statement of Responsibility:
by Brian Holland.

Record Information

Source Institution:
Rights Management:
Applicable rights reserved.
Embargo Date:
Resource Identifier:
004979613 ( ALEPH )
769019209 ( OCLC )
LD1780 2010 ( lcc )


This item has the following downloads:

Full Text

[61] C. Reardon, B. Holland, A. George, H. Lam, and G. Stitt, "RC' I.: An abstract
modeling language for design-space exploration in reconfigurable computing," in Proc.
IEEE Reconfigurable Architectures Workshop, May 25-26 2009.

[62] Y. Sun, Y. Cai, L. Liu, F. Yu, M. L. Farrell, W. McKendree, and W. Farmerie,
"ESPRIT: estimating species richness using large collections of 16S rRNA
pyrosequences," Nucleic Acids Res., vol. 37, no. 10, pp. e76.

[63] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for
similarities in the amino acid sequence of two proteins," J. Molecular B.:/. '-,/; vol. 48,
no. 3, pp. 443-453, 1970.

[64] M. Reiser and S. S. Lavenberg, i. ii-value in i 1i -; of closed multichain queuing
networks," J. AC/I vol. 27, no. 2, pp. 313-322, 1980.

The proposed framework provides the necessary methodology to integrate modeling

environments with RAT performance prediction for strategic DSE. The framework

translates application specifications into performance characterizations and orchestrates

the required RAT predictions. The DSE tool was constructed based on the framework

methodology and demonstrated accurate performance analysis with under 5'. error for

the MNW case study. Strategic DSE with the tool was efficient and rapid, requiring only

140ms and 340ms for analysis of 100,000 revisions to the MNW application and the MVA

graph, respectively.

3.3.3 Predicted and Actual Results

The RAT performance numbers are compared with the experimentally measured

results in Table 3-3. Each predicted value in the table is computed using the input

parameters and equations listed in Section 3.2.1. For example, the predicted computation

time when flk =150MHz is calculated as follows:

512 elements 768 ops/element
comp 150 MHz 20 ops/cycle
393216 ops
= 1.31E-4 sees
3E+9 ops/sec

The communication time is computed using the corresponding equation. Because the

application is single-buffered, the total RC execution time is simply:

tRcsB = 400 iterations (2.47E-5 sees + 1.31E-4 secs)

S6.23E-2 secs

The speedup is simply the division of the software execution time by the RC

execution time. The utilization is computed using the corresponding SB equations.

The communication and computation times for the actual FPGA code were measured

using the wall-clock time of the CPU. The error in the prediction of the communication

time was minimal, approximately 1 due to detailed microbenchmarking for these exact

transfer sizes. A relatively accurate tcomp prediction is expected given the deterministic

structure of the parallel algorithm. However, the high degree of accuracy (two significant

figures) between the predicted and actual computation times with flk =150MHz was

unusual, since the computational throughput was a conservatively estimated parameter.

Much of the 1-D PDF algorithm is pipelined but the lower effective throughput, due to

the latency in the short 0.14ms of computation time, closely matched the conservatively

estimated value for throC' i'il,,i, (i.e. the 20 ops/cycle used in the RAT prediction Application Scope ............ ... .. 62 Model Usage .................. ..... .. 64
4.3.2 Model Attributes and Equations .............. .. .. 65 Compute Node Model ............... .. 66 Network Model .................. ... .. 67 Stage M odel .................. ....... .. 71 Application Model .................. 73
4.4 Detailed Walkthrough: 2-D PDF Estimation ..... ..... 74
4.4.1 Algorithm and Platform Structure .... 75
4.4.2 Compute Node Modeling ............... .. .. 76
4.4.3 Network Modeling ........... .... 78 PCI-X Network Modeling ................. .. 78 Ethernet Network Modeling ................ .. 80
4.4.4 Stage/Application Modeling ............ .... .. 82
4.4.5 Results and Verification .................. ..... .. 84
4.5 Additional Case Studies .................. ......... .. 86
4.5.1 Image Filtering .................. .......... .. 89
4.5.2 Molecular Dynamics .................. ........ .. 91
4.6 Conclusions .................. ................ .. 94

(PHASE 3) ................... ..... .... ....... 96

5.1 Introduction .................. .... 96
5.2 Background and Related Research ................ .... .. 98
5.2.1 RAT Performance Prediction ................. .. 98
5.2.2 Modeling Environments .................. ..... .. 99
5.3 Integrated Framework .................. .......... .. 101
5.3.1 Translation. .................. ........... 102
5.3.2 Orchestration .................. ........... 104
5.4 Case Studies .................. ................ .. 105
5.4.1 Experimental Setup .......... ..... 106
5.4.2 Modified Needleman-Wunsch (\ NW) 107
5.4.3 Task Graph of Mean-Value Analysis (\!VA Graph) ........ ..111
5.5 Integrated RATSS (\iNW ) .................. ........ .. 113
5.6 Conclusions .................. ................ .. 114

6 CONCLUSIONS .................. ................ .. 116

REFERENCES ................... ............ ...... 119

BIOGRAPHICAL SKETCH .................. ............. .. 125

for analytical, system-level modeling prior to any detailed and potentially costly

implementation of FPGA kernels or applications. RAT from C'!i Ilter 3 defined an

analytical model for performance estimation of a specific algorithm on a specific platform

prior to implementation, albeit for single-FPGA designs. The RAT methodology must

be paired with system-level modeling concepts to provide a complete model for scalable,

multi-FPGA systems.

Existing research with microprocessor-based algorithms and parallel platforms

can help bridge the gap between analytical modeling for FPGA devices and the design

challenges of FPGA-based scalable systems. The Parallel Random Access Machine

(PRAM) [8] is one of the first widely studied models that attempts to reduce complex

system behavior into a few key attributes, but the model neglects issues such as

synchronization and communication, which can greatly affect the accuracy of the

performance estimations for larger systems. The Bulk Synchronous Parallel (BSP)

Model [9] extends the modeling concepts of PRAM by defining an application in terms

of a series of global supersteps each consisting of local computation, communication,

and synchronization. The LogP [10] model attempts to define networks of computing

nodes (i.e., microprocessors and local memory) by latency, L; overhead, o; gap between

(short) messages, g; and number of processing units, P. The LogGP model [11] extends

the LogP concept with support for a long-message gap, G. Other extensions to LogP

and LogGP include support for contention, LoPC [42], and parameterization of the L,

o, g, G, and P attributes (PlogP) to support dynamically changing values in wide-area

networks [12]. Additionally, benchmarks have been created to assist in the measurement

of these attributes [43]. Leveraging these models allows RATSS to describe system-level

communication for multi-FPGA platforms.

Prior work has leveraged system-level modeling concepts beyond homogeneous

microprocessors. Heterogeneous LogGP (HLogGP) [44, 45] considers extensions for

multiple processor speeds and communication networks within a cluster. In [46],

3.4.5 Summary of Case Studies

Table 3-17 outlines the predicted values, actual results, and error percentages for

the communication time, computation time, and speedup. The magnitudes of the

predicted and actual values are listed to compare the absolute impact of the relative

error percentages. For example, molecular dynamics has comparable communication and

computation error percentages but the magnitudes are different with communication

having virtually no impact on total speedup.

For these case studies, communication had the lowest average error among the

modeled times. The larger errors for LIDAR and MD were caused by minor discrepancies

in the final communication setup versus the RAT analysis of the algorithm. These error

percentages are acceptable for RAT because they still yield valid quantitative insight

about the algorithm behavior. The cost of more precise prediction must be balanced with

the impact of the communication time on performance. Again, the largest communication

error, found in MD, did not significantly affect the speedup, because the communication

was less than 1 of the overall execution time.

Even with the more complicated task of architecture and platform parameterization,

the average prediction error for computation was only slightly higher than communication.

PDF and TSP had low computation errors complementing the low communication errors.

LIDAR had double-digit error, which was due to perceivable system overheads in the

short 0.2ms computation time. The one outlier was the MD application. The difficulty

of mitigating the data-driven computations was compounded by unknowns in the final

algorithm mapping by the HLL tool. As with the communication predictions for MD,

the error was significant but RAT still provided a useful insight about what order of

magnitude speedup should be achievable.

The prediction errors in the overall speedup were higher on average than the

individual computation or communication times. Particularly with 1-D PDF, LIDAR,

and TSP, overheads not part of the RAT computation and communication models were

Table 3-2. Input parameters of 1-D PDF
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements)
N, .. .... (bytes/element)
Communication Parameters (Nallatech)
throughputidel (0\ 1 /s)
write 0 < a < 1 0
read 0 < a < 1 0
Computation Parameters
Nops/element (ops/element)
throughputproc (ops/cycle)
fc1ock (M\ 1. ) 75/100
Software Parameters
tsoft (sec) 0
Niter (iterations)





individual partial transfers, one per iteration, to <. : end with the RAT model, but the

throughput < '-. : ... is ..i'usted to .. ;..- .1 with a single block of data.

T: number of bytes per element, 7 /,eemt, is rounded to four (i.e. bits).

Even though the PDF estimation algorithm only uses 18-bit fixed point, the interconnect

uses 32-bit communication. .1: data was not byte-packed and the remaining 14 bits

I word of coninmunication are unused. During the algorithmic formulation, several

formats including 18-bit : -*d point, 32-bit : (-d point, and 32-bit f : ::: point were

considered for use in the PDF algorithm. However, the maximum error percentage

was found to be c-,. 1, 3..,,', for 18-bit fixed point, which is satisfactory precision

t.he .i l:::ation. Ultimately 18-bit fixed A:int was chosen so that only one Xilinx

18 x 18 lmultiple-accumulate (MAC) unit would be needed per multiplication. Ti.

slightly smaller bitwidths also had reasonable error constraints, no 1p :, :: : e gains or

appreciable resource savings would have been achieved.

T: communication parameters are provided .- the user since are merely a

function of the target RC platform, which is a .: 1ch H101-PCIXM card containing a

Virtex-4 LX100 user i iGA for this case study. : a card is connected to the host Ci'Uj

Table 5-3. Analysis times for design spaces of MVA graph
Number of Revisions Analysis Time (ms)
0 (initial design) 10
1000 12
10000 33
100000 340

Although the algorithm structure is a SDF MoC, collecting quantitative parameters

from the individual computation tasks and communication operations remains very similar

to the MNW case study. The algorithm complexity is manifested in the number of tasks,

their dependencies, and their scheduling. The predicted execution time of the initial design

for the MVA graph is 0.17s, which is dominated by the slowest pipeline, T2-1 (Figure

5-9), due to its highest assigned workload. Strategic DSE can provide useful insight about

minimum execution rates for the other pipelines and allows a designer to construct the

slowest (and least resource-intensive) pipeline possible without increasing the overall

execution time for the application. For example, Figure 5-10 illustrates the predicted

execution time of the MVA graph based on 1000 design revisions that represent different

execution rates for the T4-2 pipeline. Pipeline rates for T4-2 above a certain threshold

have no impact on the total performance of the MVA graph because the execution time

remains dominated by the slower T2-1 pipeline. However, sufficiently slow rates for the

T4-2 pipeline increase the execution time of the MVA graph. Rapid determination of this

threshold is difficult without the DSE tool.

Despite the large number of revisions, DSE using the integrated framework remained

tractable. Table 5-3 summarizes the framework analysis times, including the transfer of

the performance information and RAT prediction, for various numbers of revisions to the

design of the MVA graph. The analysis time grows approximately linearly with the size

of the design space. The longest analysis of 100,000 revisions took 340ms, which is nearly

indistinguishable by the user from the duration of a single analysis of a simple application

(e.g., 2.3ms for MNW). The primary limitation of broad DSE (aside from Java memory

requirements) is the ability of the user to efficiently digest the generated prediction values.

Depending on the FPGA platform architecture and mapping, computation and

communication within a stage is either serialized or overlapping with the total execution

time of the stage, tstage, defined as a number of iterations, Nstageiterations, of either the

sum or maximum of the tcomp and tcomm terms (Equation 4-14). Note that performance

estimates should be modeled for each unique stage of the application execution with

attention to any special cases, such as initial and final stages possessing more or less


stage = Nstageiterations X (4 1 4)
M ax (tcomp tcomm)

stage: total execution time of the application stage

Nstageiterations: number of stage-level iterations Application Model

The RATSS application-level model describes the scheduling of the individual stages

of task execution to estimate the full system performance. Applications consist of one

or more distinct stages of execution, which may be collectively repeated for one or more

of iterations. Analogous to the computation and communication scheduling, application

stages can either be serialized or overlapping. Figure 4-3 provides an example timing

diagram for potential iterative behavior at the application level. The example consists of

three stages collectively repeated twice, reinforcing the multi-level iterative behavior of the

stage and application models as first described in Figure 4-2. This sample application does

not illustrate all potential platform and algorithm features such as multi-level networks,

but instead reinforces the ability of RATSS to organize execution paths into hierarchical


Equation 4-15 defines the set, Sstage, of s stage times, tstage, for the application.

The total execution time for the application, application, is the number of iterations,

instead of the theoretical 24). Also, this lower throughput accounted for extra overhead

time involved with polling the FPGA for completion of computation.

The total execution time for the FPGA is also measured using the wall-clock time,

rather that calculated from Equation (3-5), to ensure maximum accuracy. Additional

factors may be present in the total time that are not accounted in the individual

communication and computation. In this case study, the total error was 1,' but the

communication and computation errors were only 1 and I' respectively. The discrepancy

is due to overheads with managing and regulating data transfers by the host CPU that are

not expressly part of the individual RAT models. The impact of these overheads and other

synchronization issues will vary depending upon the particular FPGA platform and size of

the overhead time relative to the total RC execution time. For 1-D PDF, the extra 8.5ms

was significant compared to the 75ms of execution time. The relatively low resource usage

in Table 3-4 illustrates a potential for further speedup by including additional parallel

kernels albeit at the risk of increasing the impact of the system overhead.

3.4 Additional Case Studies

Several case studies are presented as further analysis and validation of the RAT

methodology: 2-D PDF estimation, coordinate calculation for LIDAR processing, the

traveling salesman problem, and molecular dynamics. Two-dimensional PDF estimation

continues to illustrate the accuracy of RAT for algorithms with a deterministic structure.

Coordinate calculation uses prediction on a communication-bound algorithm. Traveling

salesman explores a computation-bound searching algorithm with pipelined structure.

However, the molecular dynamics application serves as a counterpoint given the relative

difficulty of encapsulating its data-driven, non-deterministic computations. A diverse

collection of vendor platforms, Nallatech, Cray, SRC, and XtremeData, is used for 2-D

PDF, LIDAR, traveling salesman, and molecular dynamics, respectively. Each of these

case studies has single-buffered communication and computation. As with one-dimensional

node time. Because of the equal load balancing among the nodes, the total computation

time is equivalent to the performance of any i-th node in the system.

Scomp tfpgal tfpgap } (4-20)

tcomp(Scomp) = Max(Scomp) tfpgai (4-21)

The set of communication times, Scomm,,,, summarizes network transactions involved

in the single stage of the case study. The scatter and write times, tscatter and twrite,

represent the X and Y dimensions of the input data and the read and reduce times, tread

and reduce, define the data collected after computation. From Equation 4-23, the total

communication performance, tomm,,,, is the summation of the individual, non-overlapping

network transactions.

Scomm = {tscatterx tscattery twritex twritey tread, reduce} (4-22)

tcomm(Scomm) = tscatterx + tscattery + twritex + twritey + tread + reduce (4-23)

The inputs to the RATSS system-level model are summarized in the first block

of Table 4-4. The only user-provided attribute is the number of stage-level iterations,

Niterations. The 2-D PDF application only requires one iteration of node and network

interaction (i.e., inter-node data distribution, node-level computation, inter-node data

collection). The individual node and network transaction times comprise the iD i.i iiy

of the input to the system model. The Scomp and Scomm attribute sets contain the

compute node and network transaction times from the respective models. The second

block of Table 4-4 summarizes the estimated performance of the total computation and

communication times, tcomp and tcomm from Equations 4-21 and 4-23.

The 2-D PDF estimation does not use an elaborate buffering scheme so the total

performance of the application, tapplication, as defined by the stage time, stage, is the

16th Symp. Field-P, -..g.i ,,i,,,,.. Custom Computing Machines (FCCI[), Palo Alto,
CA, Apr. 14-15 2008.

[37] K. Shih, A. Balachandran, K. N I, ii il B. Holland, C. Slatton, and A. George,
1` ,-I realtime LIDAR processing on FPGAs," in Proc Engineering of Reconfigurable
S;l.~ m;- and Algorithms (ERSA), Las Vegas, NV, July 14-17 2008.

[38] S. Tschoke, R. Lubling, and B. Monien, "Solving the traveling salesman problem with
a distributed branch-and-bound algorithm on a 1024 processor network," in Proc.
Symp. Parallel Processing, Santa Barbara, CA, Apr. 25-28 1995.

[39] M. P. Allen and D. J. Tildesley, Computer Simulation of Liquids, Oxford University
Press, New York, 1987.

[40] D. A. Pearlman, D. A. Case, J. W. Caldwell, W. S. Ross, I. Thomas E. C'I. .11,ii i
S. DeBolt, D. Ferguson, G. Seibel, and P. Kollman, "Amber, a package of computer
programs for applying molecular mechanics, normal mode analysis, molecular
dynamics and free energy calculations to simulate the structural and energetic
properties of molecules," Computer Ph,;-. Communications, vol. 91, no. 1-3, pp.
1-41, September 1995.

[41] M. Nelson, W. Humphrey, A. Gursoy, A. Dalke, L. Kal, R. D. Skeel, and K. Schulten,
\ ,, iil a parallel, object-oriented molecular dynamics program," Int'l J. Super-
computer Applications and High Performance Computing, vol. 10, no. 4, pp. 251-268,

[42] M. I. Frank, A. Agarwal, and M. K. Vernon, "LoPC: modeling contention in parallel
algorithms," in Proc. 6th AC_ [ SIGPLAN Symp. Principles and Practice of Parallel
P, ..i,,,,,,:i (PPOPP), 1997, pp. 276-287.

[43] T. Kielmann, H. E. Bal, and K. Verstoep, 1: I-1 measurement of LogP parameters for
message passing platforms," in Proc. 15th IPDPS Workshop Parallel and Distributed
Processing, London, UK, 2000, pp. 1176-1183.

[44] J. L. Bosque and L. P. Perez, "Hloggp: a new parallel computational model for
heterogeneous clusters," in Proc. IEEE Symp. Ci,-/. r Computing and the Grid.

[45] J. L. Bosque and L. Pastor, "A parallel computation model for heterogenous
clusters," IEEE Trans. Parallel and Distributed S1,-.l m- vol. 17, no. 13, 2006.

[46] A. Lastovetsky, I.-H. Mkwawa, and M. O'Flynn, "An accurate communication model
of a heterogenous cluster based on a switch-enabled ethernet network," in Proc. 12th
IEEE Int'l Conf. Parallel and Distributed S.-1i. n- (ICPADS), Minneapolis, MN, July
12-115 2006.

[47] R. Kesavan, K. Bondalapati, D. Panda, and D. K. P, 4\!,!i, ,t on irregular
switch-based networks with wormhole routing," in Proc. Int'l Symp. High Perfor-
mance Computer Architecture (HPCA), San Antonio, TX, 1997, pp. 48-57.

made, key performance attributes are extracted, predictions are computed, and suitable

designs proceed to implementation. The goal is to establish the core, extensible model of

application and architecture performance that is efficient for use prior to implementation

and provides reasonably accurate results. Several application case studies are i, 1i-. .1I

with RAT and implemented within FPGA systems to verify the performance model and


For the second phase, the emergence and continued interest in multi-FPGA system

necessitates a methodology for multi-FPGA performance prediction to improve application

development. The RC Amenability Test for Scalable Systems (RATSS) is an expansion of

the RAT methodology encompassing larger FPGA systems and potentially higher degrees

of algorithm parallelism. A i, ri" challenge is the size and v iii, 1 ivj of communication

topologies in multi-FPGA platforms, which require varying amounts of parameterization

and analysis for accurate performance prediction. RATSS uses the synchronous iterative

model [1] for two modern platform architectures for RC systems [2], providing accurate

modeling for data-parallel algorithms typically structured as SIMD-- i-le pipelines.

The third phase involves integration of the RAT analytical model with the RC

Modeling Language (RC'\ I I). Manual specification and analysis of applications becomes

increasingly inefficient as the FPGA platform size and algorithm complexity grow.

Ideally, algorithms, FPGA platform architectures, and system mappings are specified

using a modeling environment based on a model of computation and then analyzed

by performance estimation tools such as RAT. RC' \! is used because it provides an

efficient, intuitive, and scalable infrastructure specifically designed for FPGA systems.

The integration of RAT and RC'\ 11 provides efficient design-space exploration through

tool-assisted translation of application specifications into prediction models and evaluation

of both an initial design and potential revisions to the algorithm or platform architecture.

The remainder of this document is structured as follows. C'!i pter 2 provides a brief

background about FPGA computing and related research for performance prediction and

Table 5-2. Analysis times for design spaces of MNW
Number of Revisions Analysis Time (ms)
0 (initial design) 2.3
1000 3.0
10000 15
100000 140

r Start A' 1111 -T2-2/ \ T4-2 \j End
cm 'Lm m
.Ij\ M 1 I

Figure 5-9. MVA graph specification and mapping

pipelines), but these analyses can be trivial for this computation-bound application due

to the direct correspondence between the rate of execution and the overall application

performance. Instead, Figure 5-8 illustrates the predicted execution time of MNW based

on 1000 design revisions that represent different problem sizes (i.e., comparisons of

500 to 50450 DNA sequences) divided among four FPGAs. These revisions expand the

initial four-FPGA design of 1500 DNA sequences. The execution time of MNW increases

exponentially with the number of DNA sequence comparisons. This DSE can help evaluate

the suitability of the MNW application for meeting the broad performance requirements of

a designer, particularly when the size of the sequence database is expected to increase at a

potentially unknown rate after implementation. The DSE tool took only 6.1ms to analyze

this significantly larger design space. Table 5-2 summarizes analysis times for the initial

design, the 1000 revisions, and two other large DSEs. The analysis times grow linearly

with the size of the design space and allows very large numbers of revisions to be explored

in significantly less than one second.

8 FPGA Nodes

Internal Internal Nallatech
BRAM BRAM Middleware

Nallath FGA Host 3.2GHz Xeon
Nallatech FPGA :
Board -" Microprocessor
(XC4VLX100) PCI-X Bus

P __J GigE Switch

Figure 4-5. Platform Structure for 2-D PDF Estimation Case Study

Table 4-1. Node Attributes for 2-D PDF Estimation
Attribute Units 2 Nodes 4 Nodes 8 Nodes
PLcomp (cycles) 11 11 11
Rcomp (ops/cycle) 240 240 240
Fclock (\I 1..) 195 195 195
Ncompeements (elements) 33,554,432 16,777,216 8,388,608
Nps/element (ops/element) 196,608 196,608 196,608
tfpga (s) 1.41E+2 7.05E+1 3.52E+1

of the pipeline latency, PLcomp, requires detailed knowledge of the final algorithm

structure. The pipeline for the 2-D PDF estimation has a straightforward computational

structure of three operations (subtraction, multiplication, and addition from Figure

4-4) requiring 11 total cycles. The relatively deep pipeline helps ensure a higher clock

frequency. The computational throughput, Rcomp, of 240 operations per cycle comes from

the 80 pipelined kernels, each with 3 simultaneous operations per pipeline. Predictions are

generated for a large range of possible frequencies. The prediction for the clock frequency,

Fc~ock, of 195MHz is shown since it ultimately matched the maximum frequency for

later implementation. The number of computation elements, Ncompelements, is 33,554,432

(6 !\!+2); 16,777,216 (6 I \+ 1); and 8,388,608 (6 \l+' ) for the two, four and eight-node

configurations, respectively, due to the balanced data decomposition. The number of

operations per element, Nops/element, is based on the 256x 256 comparisons per data

element times 3 operations per element for a total of 196,608 operations.

Table 3-10. Resource usage of LIDAR (XC2VP50)
FPGA Resource Utilization ( ,)
BRAMs 12
18x18 Multipliers 5
Slices 45

communication and computation may be necessary. Table 3-10 highlights the availability

of unused resources to expand the parallel computation but the benefit will be marginal

because of the communication-bound algorithm.

3.4.3 Traveling Salesman Problem

The traveling salesman problem (TSP) [38] is a particular version of the NP-complete

Hamiltonian path problem that locates the minimum length path through an undirected,

weighted graph in which each vertex (i.e. city) is visited exactly once. (Other derivations

of the Hamiltonian path problem include the snake-in-the-box, knight's tour, and the

Lovdsz conjecture.) For this algorithm, any city may be the starting point and all cities

are connected to every other city creating N! potential Hamiltonian paths, where N is

the number of cities. To accelerate the time to converge on a solution, heuristics are

sometimes employ, -l to systematically search a subset of the solution space. However,

the algorithm for this case study performs an exhaustive search on all paths in the graph.

The specific algorithm formulation has significant ramifications not only on the hardware

performance but also on the prediction accuracy.

The case study targets SRC Computer's SRC-6 FPGA platform. Within the SRC-6,

the algorithm uses one of the Xilinx XC2V6000 user FPGAs in a single MAP-B unit.

The FPGA is connected to a host processor via the vendor's SNAP (memory DIMM slot)

interconnect. Nine depth-first traversals of the graph occur simultaneously on a single

FPGA starting from each of the nine different cities. Techniques such as branch and

bound are not used because each step of the search would be dependent on the previous

steps, thus preventing any pipelining. Instead, the algorithm starts with selecting N

arbitrary cities (and their N 1 edges) all at once and is then followed by determination

Sfpga = {tfpgal ttfpga} (4 10)

Sfpg,: set of FPGA execution times for the stage

n: total number of FPGA nodes

tcomp = overhead + Max(Max(Sfpa) tp) (4-11)

tcomp total computation time for the stage

overhead configuration, setup, and other overheads for the stage

Similarly, the set of communication times, Scomm, contains the performance estimates

for each of the r network transaction times, transaction (Equation 4-12). Typically, this

set will contain one or more input and output transactions for each level of network

communication in the platform, though some applications will instead accumulate partial

results within the FPGAs over multiple stages with cumulative output after the last

computation iteration. From Equation 4-13, the communication time for the stage,

tcmm,,, is composed of the sum of the r transaction times, ttransaction Multiple levels of

communication within an application stage are assumed non-overlapping due to blocking.

However, non-blocking transactions can be modeled by the total network delay and longest

(i.e., maximum) throughput time.

Scomm = {ttransactioni transaction, (4-12)

Scomm: set of transaction times for the stage

T: total number of communication transaction

tcomm (Scomm) =- Scomm (4-1 3)

tcomm: total communication time for the stage

Comparison Normalized
Database Modified Needleman-Wunsch Edit Distance
FPGA 1 5 Comparison [1,2] calculation of [,
S Comparison [1,6] Normalized [,
[1,2] Edit Distance Distance
[1,3] FPGA 2 1 Comparison [1,3] Calculation of [1,3]
[1,4] g Comparison [1,7] Normalized Distance
5[1,5] I _I : IEdit Distance [1,4]
FPGA 3 2 Comparison [1,4] Calculation of Distance
M Comparison [1,8] Normalized
i : Edit Distance
FPGA 4 J Comparison [1,5] calculation of [N-1,N
[N-1,N] .- Comparison [1,9] Normalized
S:Edit Distance Distance
S(N-N)/2 Comparisons
A Calculation of normalized edit distances on multiple FPGAs
for MNW

tart MNW End

B MNW algorithm specification and mapping

Figure 5-7. Overview of MNW case study

Table 5-1. Predicted and experimental results for MNW
Predicted Time (s) Experimental Time (s) Error
1 FPGA 9.44E-1 9.58E-1 1.5
2 FPGAs 4.72E-1 4.83E-1 2.;:',
4 FPGAs 2.36E-1 2.46E-1 4.1.

database and the resulting values for the normalized edit distance are described using

the AMP MoC. From the algorithm specification (Figure 5-7B), each of the computation

tasks (Start, MNW, and End) and two communication connections requires a separate

analytical model. The performance of the software Start and End tasks are defined

by an execution-time attribute. The number of characters in the database of sequence

comparisons determines the amount of input communication (between Start and MNW)

and the amount of computation for MNW. The output communication (between MNW

and End) is defined by the number of sequence comparisons. The architecture model

(Figure 5-6) contains the parameters outlining the communication capabilities of the PCIe

interconnect. The FPGA clock frequency (architecture) and pipeline depth (algorithm)

parameters define the computation rate.

these tests are applied iteratively during the RAT analysis until a suitable version of

the algorithm is formulated or all reasonable permutations are exhausted without a

satisfactory solution. The throughput test is a suitable starting-point for an application

wishing to match the numerical precision and general architecture of a legacy algorithm.

However, starting with the numerical precision and resources tests to refine an application

prior to throughput analysis is equally viable.


Identlf kemeol,
c rebate design on
paper 7
NEW h nsufficien
or comm.pBW
Perform RAT f roughput
i m b preA csion
Build in HDL or HL requirement
Simulate design Insufficient

Verify on
HW platform

Figure 3-1. Overview of RAT Methodology

3.2.1 Throughput

For RAT, the predicted performance of an application is defined by two terms:

communication time between the CPU and FPGA, and FPGA computation time.

Reconfiguration and other setup times are ignored. These two terms encompass the

rate at which data flows through the FPGA and rate at which operations occur on that

data, respectively. Because RAT seeks to analyze applications at the earliest stage of

hardware nri Ill i. these terms are reduced to the most generalized parameters. The RAT

throughput test primarily models FPGAs as accelerators to general-purpose processors

with burst communication but the framework can be adjusted for applications with

streaming data.

Calculating the communication time is a relatively simplistic process given by

Equations (3-1), (3-2), and (3-3). The overall communication time is defined as the

Head Node -P Host pP FPGA Memory FPGA Computation FPGA Memory Host uP Head Node pP
Data Samples .- In I Ut U
Data SampElement 1 Final PDF
lee Node Memor BRAM 2-D PDF Registers PDF Estimate Estimate
(Data Sample) cun o

L( apl e Scatter8 192 elements 2,1 2,1 Fau 2,1
|Lastchunk of 1 256, [256, 256, 256, 256,
Elem ent 64M Elem ent (6 M X +-25-x256
(Data Sample) N+(64M)P M) 1 Bins
Each node receives (64M)/P data elements (8,192xP) s Each node sends 65,536 probability values
80 pipelined kernels per node

Figure 4-4. Application Structure for 2-D PDF Estimation Case Study

fixed point. The results are accumulated in 256x 256 registers and periodically read

back by the host microprocessor. The resulting 256x 256 partial sums on each of the P

nodes are collected with a reduce operation. More discussion on this 2-D PDF estimation

architecture along with general issues related to FPGA implementation can be found in


The intended platform for this case study is illustrated in Figure 4-5. The full

platform consists of eight 3.2GHz Xeon microprocessor nodes each connected to one Xilinx

XC4VLX100 (Nallatech H101 card) via a PCI-X bus. The processing nodes are organized

as a cluster of traditional computers each augmented with application acceleration

hardware (i.e., FPGAs). The on-chip Block RAMs (BRAMs) are explicitly illustrated

since they are used to store the input and output data for the 2-D PDF case study. The

microprocessor nodes are connected via Gigabit Ethernet. Network-level communication

uses the MPICH2 implementation of the Message Passing Interface (\!PI). The case study

is modeled and the implementation is tested using 2, 4, and 8 FPGA nodes.

4.4.2 Compute Node Modeling

The node-level model consists of estimating the computation for the 2, 4, and 8

Nallatech-augmented compute nodes. The values in Table 4-1 consist of the computation

attributes, which are distilled from the structure of the 2-D PDF estimation algorithm as

mapped to the architecture of the FPGA node. For computation, accurate parameterization

and multi-FPGA systems but also efficient usage during formulation (i.e., strategic

:-space exploration). Known as the RC Amenability Test, RAT provides a ::

for prediction of i:>tential .1: ., for a given high-level parallel w.1 .' *!1 .. ied

to a selected hardware target, so that a : :. i of strategic tradeoffs in algorithm and

architecture exploration can be quickly evaluated undertaking weeks or months of

costly --- -- -tation. RAT p.. ... p ce prediction is scoped to maintain efficient and

reasonably accurate estimation relative to the FPGA; :.. size and complexity.

Central to RINT is the analytical ... .. ....e model and the methodology for its

1. i :: to a range of algorithms and FPGA platform architectures. TI: key pects

of communication and computation within the FPGA system are ..ameterized and used
1- the RAT model to estimate the total application n performance (i.e. execution time and

speedup). T'. prediction efficiency and reliability are increased via tool-assisted parameter

extraction and j.. :. : : : estimation from explicit algorithm, architecture, and system

ions (also referred to as models). T.. need for the RAT methodology stemmed

.. common difficulties encountered during several T'A application development

.jects. Researchers would :.1 .l1 possess a software .: : on but would be unsure

about potential performance gains in hardware. i.. level of experience with 'GAs

would : greatly among the researchers and inexperienced designers were often unable

to .. ... .': i1 pro'.. i and compare possible algorithmic design and FPGA .

choices i : their i : : Many initial pre : lions were haphazardly i: : : d and

performance estimation methods varied greatly. Consequently, RAT was created to

consolidate and unify the performance prediction strategies for faster, simpler, and more

effective analyses.

i research is divided into three phases. In the first 1: : the focus of the

I ::.. .,:. e prediction model is on systems with a ~ 1FPGA connected directly

to a microprocessor. i'i plications proceed in iterations of writing data to the T rGA,

... i t stationon, and reading results from the FPGA. Design choices arc

Table 3-8. Input parameters of LIDAR
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements)
N, .... .. (bytes/element)
Communication Parameters (Cray)
throughputideal ( !1 /s)
aowrite 0 < a < 1
read 0 < O < 1
Computation Parameters
Nops/element (ops/element)
throughputproc (ops/cycle)
fc1ock (:\M!1.) 100/1
Software Parameters
tsoft (see)
Niter (iterations)





X P(CyCrSO0 C.' rCQpSO S.' 9,SO)) + Xc

Y p(sycQSO S.' cCpCO C.' ,eO)) + Y, (3 12)


Table 3-8 summarizes the RAT input parameters of the algorithm for coordinate

calculation. The input date size of 33,000 elements is based on one second of LIDAR

returns (i.e. the time between GPS updates). A corresponding number of GPS coordinates

is returned by the calculations. The X, Y, and Z dimensions of the LIDAR returns and

GPS coordinates each use a 16-bit fixed-point format. A total of 48 bits is sent using

the 64-bit (8-byte) RapidArray interconnect. This channel has a documented theoretical

throughput of 1.6GB/s per direction but microbenchmarking indicates only half the rate

is achievable for these data transfers. Because the computation is pipelined, the number

of operations per element is synonymous with the number of elements. The pipeline can

process one operation (i.e. element) per cycle. The exact depth of the pipeline is not

known a priori but the extra latency is presumed negligible when compared to the size of

the dataset. A range of clock frequencies is examined to predict the scope of the overall




0.2 Single Buffered -Double Buffered

0 0.5 1 1.5 2
t co,,,, = 0.5s tomp (s)

Figure 3-3. Trends for Computational Utilization in SB and DB Scenarios

3.2.2 Numerical Precision

Application numerical precision is typically defined by the amount of fixed- or

floating-point computation within a design. With FPGA devices, where increased

precision dictates higher resource utilization, it is important to use only as much precision

as necessary to remain within acceptable tolerances. Because general-purpose processors

have fixed-length data types and readily available floating-point resources, it is reasonable

to assume that often a given software application will have at least some measure of

wasted precision. Consequently, effective migration of applications to FPGAs requires a

method to determine the minimum necessary precision before any translation begins.

While formal methods for numerical precision analysis of FPGA applications are

important, they are outside the scope of this document. A plethora of research exists on

topics including maintaining precision with mixed data types [29], automated conversion

of floating-point software programs to fixed-point hardware designs [30], design-time

precision analysis tools for RC [31], and custom or dynamic bit-widths for maximizing

performance and area on FPGAs [32-35]. Application designs are meant to capitalize

on these numerical precision techniques and then use the RAT methodology to evaluate

the resulting algorithm performance. Numerical precision must also be balanced against

the type and quantity of available FPGA resources to support the desired format. For

revised not once but potentially hundreds or thousands of times depending on the breadth

of the design space and complexity of the algorithm.

Strategic DSE begins with identification of performance features for revisions. The

application designer may choose to annotate a parameter with one or more alternative

values denoting possible changes to the application design. The goal is to propose

revisions to specific features and compare the range of performance values against the

performance requirements of the designer. For example, several different pipeline rates

or clock frequencies may be evaluated. Also, scalability can be analyzed using revisions

that define progressively larger problem sizes and hardware resources. Alternatively,

different schedules can be evaluated by adjusting the ordering (i.e., priority) of messages to

outgoing communication channels. As illustrated by the case studies in Section 3.4, rapid

exploration of large design spaces can greatly aid design productivity. However, a designer

using the framework must ensure that the design space under investigation is realistic with

respect to the architectural constraints (e.g., maximum circuit size or clock frequency).

5.4 Case Studies

This section describes two case studies, MNW and MVA graph, which demonstrate

the capabilities of the integrated framework for efficient (i.e., rapid and reasonably

accurate) strategic DSE. The experimental setup, including the construction of the DSE

tool bridging the RCi\ lI modeling environment with RAT performance prediction, is

discussed in Section 5.4.1. The MNW case study in Section 5.4.2 is a bioinformatics

application with an AMP MoC. This case study demonstrates accurate prediction, as

compared to subsequent hardware implementations, and rapid DSE. The MVA graph

in Section 5.4.3 contains a more complex network of pipelines with performance defined

by the SDF MoC. This case study maintains rapid DSE, even for very large numbers of

revisions to a complex algorithm structure.

system-wide network model and two node models: the microprocessor and FPGA.

However, the NNUS architecture will involve two networks: a local interconnect between

the FPGA and microprocessor and a system-wide network between the microprocessors.

Defining nodes based on their adjoining networks creates a consistent abstraction of

computation and communication for both prevailing system classes. This distinction

becomes increasingly important as the hierarchy of the FPGA platform increases in depth.

Ultimately, the collection of node and network models provides small, separable

descriptions of the complete computation capabilities and communication performance

of the FPGA platform. For each piece of computation in the application, the clock

frequency attribute for the FPGA defines the overall rate of execution. For each network

communication, quantitative attributes include the d.l1 i through the interconnect medium

(i.e., latency) and bandwidth for message transmission. Application Scope

Strategic performance prediction requires application characteristics amenable to

quantification. An application encompasses an algorithm and its mapping to an FPGA


Algorithm Finite number of hardware-independent tasks with explicitly defined

parallelism and ordering used to solve a problem.

Mapping Algorithm's computation tasks assigned to nodes and data movement between

nodes assigned to one or more communication networks.

A complete description of an algorithm and its mapping must be provided by the

designer for effective performance modeling. The composition and parallelization

of algorithm tasks defines the computational load for each node and the required

communication for each network to support the application. Algorithm and mapping

features must be scoped to ensure quantitative characterization of computation and

communication interaction that is tractable for analytical modeling. Specifically,

the number of data elements, Nelements, and number of bytes per element, Nbytes/element

(Equation 4-7). Expressing data in terms of elements allowed more direct correlation

between the volume of computation and the amount of communication. However, the RAT

I/O model is adjusted to coincide with the LogGP formulation for consistency within the

network model.

g(k)= (Rio x Effo)-1 (4-6)

RIo: theoretical throughput rate of I/O channel (from RAT)

Effio: efficiency of I/O channel (from RAT)

k N10_elements X ; ...... (4-7)

NIodelements: number of I/O elements (from RAT)

i .. elementn: number of bytes per element (from RAT)

Equation 4-8 defines the set of attributes, Slo, for the revised I/O transaction model,

which consists of the latency and overhead d-iv; L and o; size-dependent message gap,

g(k); number of nodes, P; message size, k; and the additional computation cost, 7, for the

communication, if any. Though the I/O model represents a point-to-point interconnect,

the P value remains to represent unidirectional (i.e., P=1) or bidirectional (i.e., P=2)

behavior. Equation 4-9 defines the communication for the I/O transaction, t1o, by the

delay function, fdelay, for latency and overhead; number of nodes; message size; gap; and

additional computation as a function, fcost, of the message size and 7 cost value.


The second research phase outlines the expansion of the RAT analytical model for

efficient and reasonably accurate performance estimation of multi-FPGA systems prior to

hardware implementation. This chapter presents a brief introduction to the challenges and

objectives of multi-FPGA RAT (Section 4.1); background and related work on multi-node

performance modeling (Section 4.2); assumptions, quantitative attributes, analytical

models, and scope of the expanded model (Section 4.3); a detailed walkthrough of a

reasonably complex application, 2-D PDF estimation (Section 4.4); two additional case

studies, image filtering and molecular dynamics (Section 4.5); and conclusions (Section


4.1 Introduction

The reformation towards explicitly parallel architectures and algorithms is accompanied

by increased emphasis on multi-device systems for achieving additional performance

benefits. However, exploiting parallelism for scalable FPGA systems requires even

more expensive development cycles, which further limits widespread adoption of RC.

Current design approaches focus on faster coding paths to device-level implementations

(e.g., high-level synthesis), which address only one symptom of the greater productivity
challenge for FPGA systems.

In contrast to development practices based on iterative implementation, strategic

design-space exploration (DSE) is needed to improve productivity with scalable systems.

Parallel applications for multi-FPGA platforms should be planned and performance issues

analyzed prior to implementation, narrowing the range of possible algorithm and systems

mappings based on the performance requirements. For the second phase of research, the

RC Amenability Test for Scalable Systems (RATSS) extends RAT from C!i Ipter 3 to

multi-FPGA systems by incorporating key concepts from traditional analytical modeling

(e.g., BSP [9] and LogP [10]). This new model produces a comprehensive performance

Network pP P P1,2 P1,3 P1,4 P
PF1, F1,2 F1,3
P. 1 P. P., P2, P
I F2, F2( F. P*F
Primary P,1 P.: P P,, P ..
FPGA L F3.1 F3.2 F3.3
NScconej P,'I P3; P,3, P,, P.
FPGA P5,1 P5,2 P5.3 P5,4 Pt

Figure 4-8. Algorithm Structure for Image Filtering Case Study

Table 4-7. Node Attributes for Image Filtering
Attribute Units Value
PL,omp (cycles) 0
Rcomp (operations/cycle) 34
Fclock (\!I1. ) 100
Ncorp_elements (elements) 349,448
Nops/element (operations/element) 17

quickly and accurately determine a priori. However, the pipeline latency, RLcop, should

be negligible with respect to the volume of data. Both FPGAs contain a pipelined filtering

kernel that calculates the nine multiplications and eight additions for the convolution

for a total computational throughput, Rcop, of 34 (2 FPGAs x 9+8 operations). The

clock frequency, Fclock, for the MAP-B node is fixed at 100MHz. An image size of 418x418

pixels, limited by the size of the MAP-B SRAM, is used for this case study though larger

sizes can be simulated by repeatedly looping through the memory. In contrast to the

previous case study, each FPGA of each node needs the complete data set (i.e., image)

because each kernel convolves a different filter. Consequently, the effective number of data

elements, Ne1ements, per node is 349,448 (418 pixels x 418 pixels x 2 FPGAs). Again, each

element requires nine multiplications with the 3 x3 filter and eight subsequent summations

for a total of 17 operations per element, Nopselement.

Table 4-8 defines the application-specific network attributes for the SNAP network

model. Two nodes, P, with four total FPGAs are used for this case study. The pipelined

(streaming) computation is structured using shift registers and requires three new

Image (P) and 3x3 Filter (F)

New Image (N)

-- Read -- Read
0.35 Write 0.45 Write
0.3 0.4
> 0.25 > 0.35
S0.2 .a 0.3
LL 0.15 -a 0.25
0.1 0.2
0.05 0.15 -
0 4 0.1
10 10 10 10 10 10 106 1 10e
Transfer Size I,..":i Transfer Size (bytes)
A Range of transfer sizes B Large transfer sizes

Figure 4-6. Results of Efficiency Microbenchmarks for Nallatech BRAM I/O

which defines the effective throughput for a given transfer size. The I/O latency and

overhead, Llo + olo, is assumed to be the total transfer time for a very small transfer (i.e.,

4B of data), which is dominated by the channel delay of the PCI-X bus. For writes and

reads with the FPGA block RAM, the measured performance is 1.60E-5(s) and 3.20E-5(s),

respectively. The gap, g(k), for the write and read I/O tractions is the multiplicative

inverse of the 1,0. I l- l /s (i.e., 33MHz, 64-bit PCI-X) theoretical throughput, Rio, times

the I/O efficiency, Efflo. These particular g(k) values are determined by the message size,

k, which is defined by the number of I/O elements, NIo_elements, and number of bytes per

element, Nbytes/element. For the write I/O, the 6 I\ input data elements for each of the X

and Y dimension are divided among the 2, 4, and 8 nodes for the Nio_elements term. Again,

these write transfers are divided into blocks of 8,192 elements meaning 6 \!- 1')2-P

distinct transfers. The output (i.e., read I/O) involves collecting the 65,536 (256x256)

elements storing the partial PDF estimates for each of the 6 \!-8,192-P iterations.

Though the computation is 18-bit fixed point, the data format for the I/O transfers is

32-bit integer and consequently the number of bytes per element, Nbytes/element, is 4.

Equation 4-17 defines the performance for PCI-X write and read transaction,

ttransactionwiite rad by the I/O latency and overhead, Llo + olo, gap value for the message

Efficiency Plot

Efficiency Plot

number of nodes; the short or long-message gap; and the additional computation, if any, as

a function, f.ost, of the amount of data and 7 cost value.

Stransaction = {L, o, g, G, P, y, 7} (4-4)

Stransaction: set of attribute values for the specific transaction

L, o: LogGP latency and overhead attributes, respectively

g, G: LogGP short and long-message gap, respectively

P, k: LogGP number of nodes and message size, respectively

7: additional computation for operations such as reduce

transaction (Stransaction)

=fddeay(L, o) + fquantity(P, k) x [g or G] + fcost(P, k, 7) (4-5)

transaction: total time for the network transaction

fdelay(): function defining delay w.r.t. L and o

quantity ): function defining total data quantity w.r.t P and k

fost(): function defining additional computation cost (e.g., reduce)

In contrast to the multi-node network model, I/O interconnects between microprocessors

and FPGA accelerator cards often exhibit a highly variable gap over a range of message

sizes. However, the different gap values for a range of data sizes and transfer types (i.e.,

DMA to BRAM or read from registers) can be collected prior to application analysis using

microbenchmarks and reused on future applications with similar I/O communication.

These attribute values are either collected into a table for reference or used to construct an

explicit g(k) function. From Equation 4-6, the original RAT model separated individual

gap values into the theoretical interconnect throughput, Rio, and the efficiency of the

interconnect for the message size, Efflo. Similarly, the message size was decomposed into

cost due to potential reuse for analysis of future applications with similar platform

mapping. Application-specific attributes such as the quantity of data and amount of

computational parallelism are explicitly specified by the user based on the algorithm and

platform mapping. These attributes feed the equations described in Section 4.3.2 which

compute the performance estimate. Based on the model results, the designer may further

refine the application or proceed to low-level implementation and analysis.

The accompanying case studies presented in this chapter primarily emphasize

performance prediction for the final configuration of an application prior to implementation.

However, the authors expect that strategic DSE will explore multiple options for algorithm

structure and platform r ipllii: which would involve several repetitions of RATSS

analysis with comparison of the predicted performance values against the designer's

expectations. The key performance criterion explored in this chapter is execution time,

but issues of application scalability, resource utilization (e.g., load balancing), power-delay

product, etc. can also be inferred from the RATSS an i& -i- Analyses of these issues are

not limited to physically realizable systems and can project capabilities of future system


4.3.2 Model Attributes and Equations

This section discusses the attributes, equations, and general approach of the node

and network models along with their arrangement into stage- and application-level models

for RATSS hierarchical performance prediction. The attributes and equations for the

node and network models leverage existing research from RAT [50] and LogGP [11]

to construct computation and communications models. The platform and algorithm

scope provide efficient quantification of performance features of the computation and

communication, which serves as input to the analytical models. Essentially, both

computation and communication represent the time cost of data movement through a

component (e.g., FPGA or interconnect). Equation 4-1 defines the general structure for

node and network performance as the delay overhead through medium/architecture, delay;

(or statistically observed) rate. Classification of these basic operations is straightforward

because tasks perform only computation and connections facilitate only communication.

The quantitative attributes for the computation tasks include the amount of data to be

processed and the cost of processing each data element, which are contained within the

particular task specification. Similarly, quantitative attributes for algorithm connections

define the amount of data and segmentation for transfers between tasks, which are

contained within the connection specification. Computation and communication models

for tasks and connections, respectively, are provided with corresponding architectural

information (e.g., FPGA clock frequency or interconnect bandwidth) by the framework

translation based on the application mapping. For scheduling, the key difference between

the two MoCs is the specificity of the overlap of task execution. For AMP, task execution

is dependent only upon the order of communication message from its predecessor tasks

(i.e., those prior tasks which provide data to the current task). In contrast, SDF models

assumes simultaneous, fine-grain operation of all tasks and connections, typically as a

pipeline operating on individual data elements within one or more streams of data. In

practice, AMP is sufficient for serializing communication between microprocessors and

FPGA application accelerators (e.g., MNW) whereas SDF is useful for describing multiple

directly connected pipelines (e.g., MVA graph).

5.3.2 Orchestration

Strategic DSE involves evaluating a range of application designs to determine the

most desirable configuration. Design alternatives may differ in multiple facets including

the algorithm requirements (e.g., problem size) and architectural capabilities (e.g., clock

frequency). The framework supports strategic DSE by repetitively revising an application

specification and evaluating the resulting performance against other design alternatives.

RAT is provided different sets of quantitative performance features, which typically

represent several permutations of one or more attributes. (DSE based on in i, 'r revisions

to the application mapping is outside the scope of this dissertation.) Predictions are

Collectively, this research contributes an analytical model and accompanying

methodology for performance prediction where such work was lacking for FPGA

development. RAT and RATSS demonstrated high applicability to a variety of algorithms,

platform architectures, and system mappings and provides a formalized infrastructure

for integrated, efficient, reliable, and reasonably accurate aid with respect to the

prediction aspect of design-space exploration. This research contributed not only to

analytical modeling for FPGA performance estimation but also to modeling languages

and design patterns for RC. Future directions for research include incorporation of the

RAT prediction (and the integrated framework) into a larger methodology for more

fully automated design-space exploration (e.g., automated mapping, evaluation, and

optimization) with integrated bridging to design-level implementation code.


.E 200


1 0 -- --------i---------------

0 10000 20000 30000 40000 50000 60000
Number of DNA Sequences

Figure 5-8. Predicted execution times of MNW on four FPGAs based on revisions to the
number of DNA sequences for comparison

Table 5-1 summarizes the predicted and experimental execution times for MNW

using 1500 DNA sequences (i.e., 1500 21500 comparisons) divided across 1, 2, and 4

FPGAs. The predicted execution times were generated by RAT based on the quantitative

performance information provided by the framework. The experimental execution times

were measured from subsequent hardware implementations that correspond to the

application specification. Based on the 1 to !' error rate, the integrated framework

was able to maintain reasonable accuracy during the abstract application specification,

collection of quantitative performance parameters, and resulting performance prediction.

Generating the abstract specification took only a few minutes and the subsequent analysis,

as directed by the framework, took approximately 2.3ms. The productivity gained by

using the framework is significant because the actual hardware implementation for MNW

required approximately 200 man-hours to code, place and route, debug, and evaluate.

Beyond the initial prediction, evaluating the performance impact of alternative

MNW designs can provide insight about the desirability of possible structural or behavior

revisions. The DSE tool can explore different architectural optimizations (e.g., faster

summation of the read and write (- Ionents. For the individual reads and writes, the

problem size (i.e. number of data elements, N elements) and the numerical precision (i.e.

number of' per element, / e /dment) must be decided the user with i"' .

to the ,:1 ::: that for these equations, the problem size only refers to a single

block of data to be b- : .1 the FPC ; All read or write communication

.. the application need not occur as a single transfer but can instead be partitioned

into multiple blocks (. data to be L .1 enciently sent or received. :i':.1 transfers are

considered in a subsequent .. .. T. '. 1l tical bandwidth of the FPGA/CPU

interconnect on the target platform (e.g. 1 .I i: 64-bit PCI-X which has a documented

maximum throughput of IGB/s) is also necessary but is generally provided either

with the FPGA subsystem documentation or as part of the interconnect standard. An

additional ..... ter, a, represents the fraction of ideal throughput *. ....... ; useful

communication. 'i : actual sustained performance of the I PGA interconnect is typically a

fraction of the documented transfer rate.

,corn, = t'read l write (3 1)

b of sileen ata .
rtre dmims fr t e b k t s
meareticad l ct th ut to e 1

elem ,ents '. 1 '

Microbenchmarks composed of simple data I ** can be used to establish the

true communication :: : 1 ut. The communication times for these block transfers are

measured and compared against the theoretical interconnect throughput to esl .. i the

a parameters. It is important for microbenchmarks to closely match the communication

methods used 1-.- real ap .A. ns on the target FPGA platform to accurately model

the intended behavior. In general, mmicrobenchmarks are f .: .'d over a wide range

5-10 Predicted execution times of the MVA graph ................ 111

5-11 Architecture specification of Novo-G system ................ 113

key performance characteristics from the designer's specification. Such models prevent

wasted implementation effort by identifying unrealizable designs and reducing the revisions

necessary to achieve performance requirements.

RATSS provides an efficient and reasonably accurate analytical model for evaluating

a scalable FPGA application prior to implementation based on the RAT methodology

from ('! Ilpter 4. RATSS boosts designer productivity by extending concepts from

component-level models to allow efficient abstraction and estimation of the computation

and communication features of FPGA applications. The RATSS model contributes a

hierarchical approach for ... -Liii I i.ii-; component descriptions into a full performance

estimate. RATSS performance prediction remains tractable by focusing on synchronous,

iterative computation models for the two i, i i" class of modern high-performance FPGA


The 2-D PDF, image filter, and MD case studies illustrate performance modeling

for a range of problem sizes and ratios of computation-to-communication. These case

studies demonstrated nearly 911' prediction accuracy, which is considered sufficient

given the focus of RATSS on strategic application planning. The accuracy of both the

computation and communication models allows not only individual performance estimates

but also accurate predictions across a range of potential application configurations

including wide variations in problems sizes and computation-to-communication ratios.

Specifically, important performance tradeoffs such as increasing parallelism or decreasing

the communication rate can be efficiently evaluated with reasonable accuracy. These

case studies serve as motivation for broad design-space exploration with RATSS as

predictions are efficiently generated and reasonably accurate, which help ensure the

eventual implementation is the most desirable design configuration.

Modeling Environment Interface
Algorithm Architecture MaDDingc
Model Model Scheduling of
Basic Operations Operation and Operations and
and Connections connection Rates Connections

( Tool-specific MoC Abstraction
Attributes of Moc (Abstract-MoC
Operations and Connections) Schedule
Comp. Comm. )( Overlap of
SOperations Connections Operations

Figure 5-5. Translation of application specification information for RAT prediction

the groups of operations mapped to that resource and RAT communication models based

on data movement between hardware resources. A generic schedule that describes the

parallelism and overlap between the computation and communication is constructed from

the semantics of the MoC. Conversion between algorithmic MoCs and RAT performance

models is possible as the important structure and behavior of the application specification,

properly formatted, correspond directly to available computation and communication

models. The basic computation operations within an MoC are often generic abstractions

that require additional quantitative attributes from the application designer, specifically

formatted for the assumed technology (e.g. FPGAs), for translation to the RAT model.

For the DSE tool used in Section 3.4 (and by extension, any future tool connecting

modeling environments and RAT), the underlying translation step must be tuned

for the MoCs of interest, specifically AMP and SDF for the case studies. For FPGA

systems, these MoCs can be abstractly represented as a number of I I-:'" (i.e., generic

encapsulations of groups of operations with detailed specification left to the application

designer) with the data movement through algorithm "connections." Tasks often represent

either pipelines or state machines, which imply structured execution at a deterministic

PDF estimation, the design emphasis is placed on throughput analyses because the overall

goal is to minimize execution time for these designs.

These RAT case studies represent a range of experiences with estimating computational

throughput based on different user backgrounds and prediction emphases. Consequently,

2-D PDF estimation, LIDAR, and traveling salesman focus on more exact throughput

parameterization in contrast to the conservative prediction in 1-D PDF. However, the

performance of molecular dynamics could not be reliably estimated prior to implementation

because of the difficulty of analyzing the complex and data-dependent algorithm

structure as described by a high-level language. Instead, the target throughput is

computed from the speedup requirements. While this prediction will be inaccurate if

the minimum throughput is unrealizable, the RAT estimation provides a starting point

for implementation and insight about the performance ramifications of a suboptimal


3.4.1 2-D PDF Estimation

As previously discussed, the Parzen window technique is applicable in an arbitrary

number of dimensions [36]. However, the two-dimensional case presents a significantly

greater problem in terms of communication and computation volume than the original 1-D

PDF estimate. Now 256 x 256 discrete bins are used for PDF estimation and the input

data set is effectively doubled to account for the extra dimension. The basic computation

per element grows from (N n)2 + c to ((N1 n1)2 + (N2 n2)2 + c where N1 and N2

are the data sample values and nl, n2 are the probability levels for each dimension, and c

is a probability scaling factor. But despite the added complexity, the increased quantity of

parallelizable operations intuitively makes this algorithm amendable to the RC paradigm,

assuming sufficient quantities of hardware resources are available.

Table 3-5 summarizes the input parameters for the RAT a n i, -i-; for our 2-D PDF

estimation algorithm. Again, the computation is performed in a two-dimensional space, so

twice the number of data samples are sent to the FPGA. In contrast to the 1-D case, the

1) Write to SRAM
3) Read from SRAM

2) Calculation

Figure 4-9. Algorithm Structure for Molecular Dynamics Case Study

Table 4-10. Node Attributes for Molecular Dynamics
Attribute Units Value
PLomp (cycles) 0
Rcomp (operations/cycle) 1
Fclock (\I 1.) 100
Ncompelements (elements) 8,192
Nops/element (operations/element) 32,767

node, Neiements, is 8,192 (32,768/4). Each molecule's interaction is computed against every

other molecule for a total of 32,767 operations, Nops/element.

Table 4-11 defines the application-specific attributes for the SNAP network model. A

total of four nodes, P, are used for the case study. The scatter message size is twice the

gather message size due to two copies of input data required to compute a molecular

interaction in a single cycle (i.e., two memory access per cycle). Also, the 4-byte,

single-precision x, y, and z, dimensions of the molecule data are packed into two 8-byte

words. Thus, the scatter and gather message sizes, k, are 1,048,576B (32,768x4x8B)

and 524,228B (32,768x2x8B) respectively. Because of the single time step, only one

system-level iteration, Nsystemiterations, is required for this case study.

Table 4-12 compares the results of the RATSS model with the subsequent implementation

of MD. Over 9' '. of the execution time is dominated by the FPGA computation, which

is highly deterministic. The model error for the node-level time, tnodes, is -0.0001 an

calculated at each time step based on the particles' masses and the relevant subatomic

forces. For this case study, the MD simulation is focused on the interaction of certain inert

liquids such as neon or argon. These atoms do not form covalent bonds and consequently

the subatomic interaction is limited to the Lennard-Jones potential (i.e., the attraction

of distant particles by van der Waals force and the repulsion of close particles based

on the Pauli exclusion principle) [39]. Large-scale MD simulators such as AMBER [40]

and NAMD [41] use these same classical physics principles but can calculate not only

Lennard-Jones potential but also the nonbonded electrostatic energies and the forces of

covalent bonds, their angles, and torsions, making them applicable to not only inert atoms

but also complex molecules such as proteins. The parallel algorithm used for this case

study was adapted from code provided by Oak Ridge National Lab (ORNL).

Figure 4-9 provides an overview of the MD case study. In slight contrast to the

image-filtering case study, four MAP-B nodes, one FPGA each, are used for MD. In

order to compare two molecules every clock cycle, two copies of the molecular data are

sent by the network-attached microprocessor to the primary FPGA of each node. (The

secondary FPGA is not used for this case study). Each copy contains the X, Y, and Z

dimensions of the molecular position data, requiring 2 SRAMs per copy for a total of 4

banks per node. Each MD kernel checks the distance of N/4 molecules against the other

N 1 molecules, where N/4 is the number of molecules for each of the four nodes. If

the molecules are sufficiently close, the MD kernel calculates the molecular forces (and

subsequent acceleration) imparted on each other. The acceleration effects are accumulated

in the last two SRAM banks and transferred back to the network-attached microprocessor.

The node-level attributes for the MD case study are defined in Table 4-10. The

pipeline latency, PLcomp, is considered negligible for this case study due to the 0(N2)

computational complexity. One pipeline per node allows for molecular iteration (i.e.,

operation) per cycle, Nops/cycle. Again, the clock frequency, Fclock, for the MAP-B nodes

is fixed at 100MHz. For this case study, the number of data elements (i.e., molecules) per

quantities of arithmetic or logical operations and registers. But a precise count is nearly

impossible without an actual hardware description language (HDL) implementation.

Above all other types of resources, routing strain increases exponentially as logic element

utilization approaches maximum. Consequently, it is often unwise (if not impossible) to fill

the entire FPGA.

Currently, RAT does not employ a database of statistics to facilitate resource

analysis of an application for complete FPGA novices. The usage of RAT requires

some vendor-specific knowledge (e.g. single-cycle 32-bit fixed-point multiplications with

64-bit resultants on Xilinx Virtex-4 FPGAs require four dedicated 18-bit multipliers).

Additionally, the user must consider tradeoffs such as using fixed resources versus logic

elements and computational logic versus lookup tables. Resource analyses are meant to

highlight general application trends and predict scalability. For example, the structure of

the molecular dynamics case study in Section 3.4 is designed to minimize RAM usage and

the parallelism is ultimately limited by the availability of multiplier resources.

3.2.4 Scope of RAT

The analytical model described in Section 3.2.1 establishes the basic scope of RAT as

a strategic design methodology to formulate predictions about algorithm performance

and RC amenability. RAT is intended to support a diverse collection of platforms

and application fields because the methodology focuses on the common structures and

determinism within the algorithm. Communication and computation are related to the

number of data elements in the algorithm. Effective usage of the performance prediction

models requires mitigation of variabilities in the algorithm structure such as data-driven

computation. Based on the complexity of the algorithm and architecture, the RAT model

may be used to directly predict performance or instead establish minimum throughput

requirements based on the desired speedup. RAT currently targets systems with a single

CPU and FPGA as a first step towards a broad RC methodology. The FPGA device

is considered a coprocessor to the CPU but can initiate some operations independently

5.4.2 Modified Needleman-Wunsch (MNW)

The MNW case study is an FPGA-optimized application for calculating the

normalized edit distance between two DNA sequences within the composite ESPRIT

application for metagenomics [62]. The normalized edit distance provides concise

quantitative insight about the similarity of two DNA sequences based on the length of

the sequences, the number of gaps in the global sequence alignment, and the number of

edits required to transform one sequence string into the other. The MNW application

pipelines the standard Needleman-Wunsch [63] calculations for individual alignment

scores and resulting global alignment with the ESPRIT calculation of the normalized edit

distance. The pipeline concurrently computes the alignment scores with the normalized

edit distance rather than computing the edit distance from the character representation of

the alignment as is done in software. Computing the edit distance in this way eliminates

the need to store a score matrix, significantly reducing the memory requirements for the

FPGA system. This case study is referred to as modified because the typical outputs

of Needleman-Wunsch, the score matrix and global alignment, are unnecessary after

the calculation of the normalized edit distance and are never retained. However, as

with traditional Needleman-Wunsch, MNW is often useful for comparing many pairs of

sequences of similar length as a batch.

Figure 5-7A provides a general overview of the algorithm structure. A database of

comparisons is built from the sequences and divided, round-robin, among the specified

number of FPGAs. A total of N sequences requires a database of N2- comparisons

since sequences are not compared against themselves and comparisons such as [2, 1] are

equivalent to [1, 2]. The initial configuration of this case study involves 1500 sequences,

each 105 characters in length. The resulting N2 N normalized edit distances are collected

by microprocessor after computation is complete.

The framework queries the modeling environment for the quantitative performance

information necessary for RAT prediction. The communication of the DNA sequence

Similar to an element, one must also examine what is an "operation." Consider an

example algorithm composed of a 32-bit addition followed by a 32-bit multiplication.

The addition can be performed in a single clock cycle but to save resources the 32-bit

multiplier might be constructed using the Booth algorithm requiring 16 clock cycles.

Arguments could be made that the addition and multiplication would count as either two

operations (addition and multiplication) or 17 operations (addition plus 16 additions,

the basis of the Booth multiplier algorithm). Either formulation is correct provided

that throni,.tl',l, is formulated with the same assumption about the scope of an

operation. Often, deterministic and highly structured algorithms are better viewed with

the number of operations synonymous with the number of cycles. In contrast, complex

or nondeterministic algorithms tend to be viewed as a number of abstract number of

operations with an average rate of execution. Ultimately, either choice is viable and left to

the preference of the user.

Figure 3-2 illustrates the types of communication and computation interaction to be

modeled with the throughput test. Single buffering (SB) represents the most simplistic

scenario with no overlapping tasks. However, a double-buffered (DB) system allows

overlapping communication and computation by providing two independent buffers to

keep both the processing and I/O elements occupied simultaneously. Since the first

computation block cannot proceed until the first communication sequence has completed,

steady-state behavior is not achievable until at least the second iteration. However, this

startup cost is considered negligible for a sufficiently large number of iterations.

The FPGA execution time, tRC, is a function not only of the t,,,, and tcomp terms

but also the amount of overlap between communication and computation. Equations (3-5)

and (3-6) model both SB and DB scenarios. For SB, the execution time is simply the

summation of the communication time, tco,,n, and computation time, tcop. With the

DB case, either the communication or computation time completely overlaps the other

term. The smaller latency essentially becomes hidden during steady state. The DB case


[1] M. Smith and G. Peterson, "Parallel application performance on shared high
performance reconfigurable computing resources," Performance Evaluation, vol. 60,
pp. 107-125, Al li 2005.

[2] T. El-Ghazawi, E. El-Araby, M. Huang, K. Gaj, V. Kindratenko, and D. Buell, "The
promise of high-performance reconfigurable computing," Computer, vol. 41, no. 2, pp.
69-76, Feb. 2008.

[3] D. Pellerin and S. Thibault, Practical FPGA P,.3,nni,,,.:, in C, Prentice Hall Press,

[4] SRC Computers, SRC Carte C P,..j,,,,n,.:,j Environment, 2007.

[5] Mitrionics, "Low power hybrid computing for efficient software
acceleration," updated 2008, cited May 2010, available from Hybrid-Computing-Whitepaper.pdf.

[6] Mentor Graphics, "Handel-c synthesis methodology," updated 2010, cited May 2009,
available from

[7] W. Wolf, "A decade of hardware/software codesign," Computer, vol. 36, no. 4, pp.
38-43, 2003.

[8] S. Fortune and J. Wyllie, "Parallelism in random access machines," in Proc. AC _I
10th Symp. Theory of Computing, San Diego, CA, May 01-03 1978, pp. 114-118.

[9] L. G. Valiant, "A bridging model for parallel computation," Communications AC'/,
vol. 33, no. 8, pp. 103-111, Aug. 1990.

[10] D. Culler, R. Karp, D. Patterson, A. ,1! ,i, K. E. Schauser, E. Santos,
R. Subramonian, and T. von Eicken, "LogP: Towards a realistic model of parallel
computation," in Proc. AC_[ 4th Symp. Principles and Practice of Parallel Pi. ',~r ,,-
ming, San Diego, CA, May 19-22 1993, pp. 1-12.

[11] A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman, "LogGP:
Incorporating long messages into the LogP model for parallel computation," J.
Parallel and Distributed CorT,,1'r.:,.h vol. 44, no. 1, pp. 71-79, 1997.

[12] T. Kielmann, H. E. Bal, and S. Gorlatch, "Bandwidth-efficient collective
communication for clustered wide area systems," in Proc. Int'l Parallel and Dis-
tributed Processing Symp. (IPDPS), 1999, pp. 492-499.

[13] E. Grobelny, C. Reardon, A. Jacobs, and A. George, "Simulation framework for
performance prediction in the engineering of RC systems and applications," in Proc.
Int'l Conf. Engineering of Reconfigurable S';-/, i- and Algorithms (ERSA), Las Vegas,
NV, Jun 25-28 2007.

of possible data sizes. The resulting a values can be tabulated and used in future RAT

analyses for that FPGA platform. By separating the effective throughput into the

theoretical maximum and the a fraction, effects such as changing the interconnect

type and efficiency can be explored separately. This fidelity is particularly useful for

hypothetical or otherwise unavailable FPGA platforms.

Before further equations are discussed, it is important to clarify the concept of an

SI, ili, 1l Until now, the expressions lpiNI i,. size," ,ulume of communicated data,"

and "number of E in, ii, have been used interchangeably. However, strictly speaking, the

first two terms refer to a quantity of bytes whereas the last term has units of I ill -

RAT operates under the assumption that the computational workload of an algorithm

is directly related to the size of the problem dataset. Because communication times are

concerned with bytes and (as will be subsequently shown) computation times revolve

around the number of operations, a common term is necessary to express this relationship.

The element is meant to be the basic building block which governs both communication

and computation. For example, an element could be a value in an array to be sorted,

an atom in a molecular dynamics simulation, or a single character in a string-matching

algorithm. In each of these cases, some number of bytes will be required to represent that

element and some number of calculations will be necessary to complete all computations

involving that element. The difficulty is establishing what subset of the data should

constitute an element for a particular algorithm. Often an application must be analyzed in

several separate stages, since each portion of the algorithm could interpret the input data

in a different scope.

Estimating the computational component, as given in Equation (3-4), of the RC

execution time is more complicated than communication due to the conversion factors.

Whereas the number of bytes per element is ultimately a fixed, user-defined value, the

number of operations (i.e. computations) per element must be manually measured

from the algorithm structure. Generally, the number of operations will be a function

network-level analysis, which then combine in the RATSS stage-level model. Based on

LogP and its derivatives (e.g., LogGP and the RAT communication model, to an extent),

the network model consists of parameters for the latency (i.e., physical interconnect

delay), L; overhead, o; message gap, g; number of "pro --..i P; and message size, k.

These attributes are analogous to the general terms from Equation 4-1 in that L and o

determine delay; P and k determine data q
between LogP, LogGP, PLogP, and the RAT communication models is the i.,/1 parameter,

which is defined by the expected behavior of the particular network. Approximations

of the short-message gap, g, and long-message gap, G, of LogGP are often sufficient for

microprocessor networks such as Ethernet. However, PLogP and RAT define the gap as a

function of the message size, g(k).

Determining message size is a key issue for accurate performance prediction. The

general assumption is that each node, P, will contribute k bytes of data for the network

transaction. However, the message gap, g(k), is highly dependent not only on the volume

of data per node but also any subdivision of that data into multiple smaller transfers.

Typically, a large message will have less performance overhead than several smaller

messages. Ensuring the gap attribute accurately reflects the performance of the actual

message segmentation size will reduce modeling errors.

Although specific communication transactions (e.g., MPICH2 implementation of an

MPI scatter over Ethernet) have detailed, potentially application-specific performance, the

network model introduces general equations for the two types of network communication

used in the cases studies. Equation 4-4 illustrates the individual set of attributes,

Stransaction, for multi-node network transactions such as InfiniBand or Ethernet, which

includes the L, o, g, G, P, and k attributes along with a 7 cost value for any additional

required computations (e.g., reduce operations). Equation 4-5 defines the performance of

the communication transaction, ttransaction, by the delay as a function, fdelay, of the latency

and overhead attributes; the volume of data as a function, quantity, of the message size and

example, 18-bit fixed point ---.- be used in Xilinx FPG As since it maximizes usage of

single 18-bit embedded multipliers. /. with parallel decomposition, numerical formulation

is ultimately the decision of the (l. ,, !'. designer. RAT i. .. Ides a i:' 1 and consistent

procedure for evaluating these 1. : choices.

3.2.3 Resources

By measuring resource utilization, RAT seeks to determine the scalability < an

application design. F... -..'cally, most FPGA designs will be limited in size by the

availability of three conlmnon resources: on-chip memory, hardcore functional units (e.g.

fixed ::1 I: ;). and basic lc elements (i.e. look-up tables and flip-f. ).

On-chip RAM is readily measurable since some ( : ''!:'ty of the memory will likely be

used for I/O bi: of a known size. Additionally, : i: .i,.!. 1. :. buffering and storage

must be considered. Vendor-provided wrappers i.. interfacing designs to T GA 1:::...

can also consume a :: ll number of memories but the quantity is generally constant

and independent of the i 1'" l:. design.

Although the tyi of dedicated functional units included in FPGAs can vary greatly,

the hardware multiplier is a ... y common component. T demand for .. :.ed

multiplier resources is :::i 'd by the availability of ::::.: <. !:1 (e.g. Xilinx

Virtex-4 and -5 SX series) with extra multipliers versus other .. ..*11y sized FPGAs.

( : : : : the necessary number of hardware multipliers is dependent on the

and amount of ... 1 (. : 1'. required. d? :::.. dividers, -i:' :e roots, and

: : -point units use hardware multipliers for i execution. Varying levels of pipelining

and other design choices can increase or decrease the overall demand : these resources.

With .:-cient design p1 .... :" an accurate measure of resource utilization can be taken

for a design given knowledge of the architecture of the basic computational kernels.

S1 .... .. : basic logic elements is the most common resource metric. High-level

designs do not empiric :i translate into any .: : ::1" resource count. Q1i :

assertions about the demand for logic elements can be made based upon approximate

quantity of data,
and expound on this general performance equation for the node and network

models, respectively.

performance = delay + delay + qii.,,Ii,1 x ..i' (4-1)

Sections and discuss the hierarchical .,. I 'ii.1 i il; .i of the individual

node and network components to model the performance of the individual algorithm

stages and subsequently the total application. The synchronous, iterative behavior

described in the node and network defines the computation and communication scheduling

for each algorithm stage and the overlap of these stages defines the total application

performance. The RATSS model uses this hierarchical approach to .,.-.-1.! i lIe the

individual performance estimates for the components into a single, quantitative prediction

for the application. Compute Node Model

The goal of the node model is to estimate the performance of each computational

task based on the user-provided platform and application attributes. Depending on the

application requirements, each of the devices performing a given algorithm task may

have different computational demands, each requiring a separate node-level analysis.

Similarly, the application will likely have different computational loads for each task (i.e.,

stage) of the algorithm. Again, the node model is used to describe each unique portion

of computation and the individual performance estimates are combined in the RATSS

stage-level model.

As summarized in Equation 4-2, each node at each stage of execution can have

a unique set of attribute values, Sfpg,,,, which includes the pipeline latency, PLfpga;

number of data elements, Nfpga_element; number of computational operations per element,

Nops_element; FPGA clock frequency, Fclock; and computation throughput, Rfpg, for the

[14] E. Grobelny, D. Bueno, I. Troxel, A. George, and J. Vetter, : -, A framework
for scalable performance prediction of HPC systems and applications," Simulation:
Transactions on The S .. .. I// for Modeling and Simulation International, vol. 83, no.
10, pp. 721-745, October 2007.

[15] K. K. Bondalapati, Modeling and 'Y"'l'.:';. for D;I,,i,,/.. / ll;. Reconfigurable H;lI,.:ll
Architectures, Ph.D. dissertation, University of Southern California, Los Angeles, CA,
August 2001.

[16] R. Enzler, C. Plessl, and M. Platzner, "System-level performance evaluation of
reconfigurable processors," Microprocessors and -if. '.V";;. i,,- vol. 29, no. 2-3, pp.
63-75, April 2005.

[17] W. Fu and K. Compton, "A simulation platform for reconfigurable computing
research," in Int'l Conf. Field P,.. 'i,;I,,I,, l', Logic and Applications (FPL), August
2006, pp. 1-7.

[18] C. Steffen, "Parameterization of algorithms and FPGA accelerators to predict
performance," in Reconrfii.1it,'- System Summer Institute (RSSI), Urbana, IL, Jul
17-20 2007.

[19] H. Quinn, M. Leeser, and L. S. King, "Dynamo: A runtime partitioning system for
FPGA-based HW/SW image processing systems," J. Real-Time Image Processing,
vol. 2, no. 4, pp. 179-190, 2007.

[20] M. C. Herbordt, T. VanCourt, Y. Gu, B. Sukhwani, A. Conti, J. Model, and
D. DiSabello, "Achieving high performance with FPGA-based computing," Computer,
vol. 40, no. 3, pp. 50-57, 2007.

[21] W. M. Fang and J. Rose, "Modeling routing demand for early-stage FPGA
architecture development," in Proc. ACi[ Symp. Field P,..,,i,,,,i,,,l,, Gate Ar-
rays (FPGA), Monterey, CA, 2008, pp. 139-148.

[22] V. Manohararai 1! G. R. Chiu, D. P. Singh, and S. D. Brown, "Difficulty of
predicting interconnect delay in a timing driven FPGA CAD flow," in Proc. AC_[
Workshop System-level Interconnect Prediction (SLIP), Munich, Germany, 2006, pp.

[23] M. Xu and F. Kurdahi, "Accurate prediction of quality metrics for logic level design
targeted towards lookup-table-based FPGA's," IEEE Trans. Very LI,,;, Scale
Integration (VLSI) S,-/. I,- vol. 7, no. 4, pp. 411-418, Dec. 1999.

[24] S. D. Brown, J. Rose, and Z. G. Vranesic, "A stochastic model to predict the
routability of field-programmable gate arrays," IEEE Trans. Computer-Aided Design
of Integrated Circuits and S,.-/. i,- vol. 12, no. 12, pp. 1827-1838, Dec. 1993.

Single Buffered

Comm RI W1 R2 W2 R3 W3
Comp C1 C | G2 i | C3

Double Buffered, Computation Bound

Comm R1 R2 W1I R3 W2 R4 W3 RS W4
omp CI. C2 C3 C4

Double Buffered, Communication Bound

Comm R1 R2 W1 R3 W2 R4 W R5 W4
Comp C1 I C2 G 3 C4
Leend: R = Read, W = Write, C = Compute

Figure 3-2. Example Overlap Scenarios

is included for completeness of the RAT model, however the case studies focus on the SB


The RAT analysis for computing tcomp primarily assumes one algorithm "functional

unit" operating on a single buffer's worth of transmitted information. The parameter iter

is the number of iterations of communication and computation required to solve the entire


tRCsB = -iter (tcomm + tcomp) (3-5)

tRCDB M Niter Max(tcomm, tcomp) (3-6)

Assuming that the application design currently under analysis was based upon

available sequential software code, a baseline execution time, tsoft, is available for

comparison with the estimated FPGA execution time to predict the overall speedup.

As given in Equation (3-7), speedup is a function of the total application execution time,

not a single iteration.

speedup tsoft (3 7)


The background and related research for this document is divided into two sections.

Section 2.1 provides an overview of FPGA technology and high-performance reconfigurable

computing. Section 2.2 summarizes related work for HPC performance modeling, FPGA

simulation and analytical modeling, and prediction focus.

2.1 FPGA Background

The field-programmable gate array (FPGA) is the primary device driving reconfigurable

computing. The overall goal of FPGAs is to provide the performance of an application-specific

integrated circuit (ASIC) with the flexibility and programmability of a microprocessor.

Applications for FPGAs are developed as hardware circuits constructed from logic

elements, fixed resources such as multiply accumulators or memories, and routing

elements. Traditionally, FPGA applications are developed in hardware description

languages such as VHDL or Verilog but high-level languages are emerging to bring FPGA

code to the level of languages such as C or JAVA. Figure 2-1 outlines the performance

spectrum of computing devices and the general structure of FPGAs.

. Reconfigurable H I H I II
Processors -Connect- Switch -Connect- Switch -
SGeneral- (FPGA) 4-input Box Box Box Box _
SPurpose LUT I I I
LL Processors I

SASIs 1 i -Connect -Connect-


Figure 2-1. Performance C'!i i:terization and General Structure of FPGA Devices

Reconfigurable computing systems often incorporate FPGAs to provide maximum

performance (e.g speed, power, cost) versus comparable microprocessor-only solutions.

This research is applicable to both high-performance computing (HPC) and high-performance

pipelining allowed <-, lutational thro uts to be accurately pro'. even though

the high-level parallel algorithms were not yet mapped to hardware, i i total RC

execution time had an error of 1 for the case studies, I'1 ly higher than the

individual communication and computation components. Large system overhead versus

short execution time was the main cause. Overall, the methodolc p : .. well for

the diverse collection of algorithm complexities, hardware languages, FPGA : : ..

and total execution times. RAT was designed to handled these issues in single CPU and

FPGA systems where communication and computation are governed by the number of

data elements.

specific algorithm task. Adapted from RAT, the computation time, Equation 4-3, is

analogous to Equation 4-1 where the pipeline latency is the delay term, the number

of data elements and number of operations per element are the ie,., ,I.:li, and the

computation throughput and clock frequency define the effective thro,,.ili,,ii

Sfpgyi PLpga i, Nfpga-elementsl Nops/element Fdlock Rfpga } (4-2)

Sfpga: set of attribute values for a specific computation unit

PLfpga: pipeline latency of the computation (cycles)

Nfpga_element: number of computation elements (elements)

Nops/element: number of operations per element (ops/element)

Fclock: FPGA clock frequency (\!lI.)

Rfpga: computation throughput (ops/cycle)
fpa(Sfpg) PLfpga Nfpga-elements X N'ops/element (4 3)
tfpga(Sfga) = + (4-3)
Fclock clock X Rfpga

tfpga: execution time for the fpga compute node (s)

Microprocessor nodes can also impact FPGA application performance with computation

coinciding with FPGA execution (from Figure 4-2). The execution for a microprocessor,

tp, is defined by the software time which must be measured from legacy code or

estimated using a traditional model. Note that this microprocessor performance attribute

is only intended for application-related computation occurring in parallel with FPGA

execution. FPGA setup, configuration, and other software-involved overheads are

considered in the stage-level model. Network Model

The goal of the network model is to estimate the performance of a communication

transaction based on the provided platform and application attributes. Analogous

to the node model, each unique communication transaction will require a separate

via a 133MHz PCI-X bus which has a theoretical maximum bandwidth of IGB/s. T: a

parameters were computed using a microbenchmark consisting of a read and write for data

sizes *1.1. to those used by the 1-D PDF algorithm. : 'i: read and write

times were measured, combined with the transfer size to < ::: :i.1 the actual communicate

rates, and finally used to calculate the a parameters dividing the theoretical

maximum. a meterss for the target FPGA platform are low due to communication

.: ..' >cols and middleware used by Nallatech atop PCI-X and high latencies associated

with the small 2KB (512 x 4B) t.

: computation : :::: tearss are the more i :' :11 **. portion of RAT ] I : : .

1.... .tion, but are still i. :..: given the deterministic behavior of PDF estimation. ,.

mentioned :: :, each element that comes into the PDF estimator is evaluated against

each of the .. bins. Each computation requires 3 operations: (.... ..1 on (subtraction),

multiplication, and addition. Therefore, the number of operations : element totals :.

(i.e. _' x 3). This particular i.I .:: ;1 : structure has 8 .; felines that each perform 3

operations per cycle for a total of 24. However, this value is conservati- "--- rounded down

to 20 to account for implementation details such as :icline latency and :.. station

overhead. This conservative 'ameter was selected prior to the algorithm coding and

has not (nor has any :::: -ter for any case study) been -.1usted for f: ::: f.

created from runtime data. Pre-imrplementation --* ustments to the RXAI parameters such

as reducing the thrc::.-I: ut value are not required but are sometimes useful to create

more optimistic or pessimistic predictions and account for ap .!. ..- or : -specific

behaviors not modeled by RAT. Similarly, a range of th rc::.i: ut values could be examined

to explore the effect on performance when the 1:.. 1:, .. :. is better or worse than

expected. However, this case study focuses on a I value for each parameter to

validate the RAT' model.

While previous --ameters could be reasonably ... i .... .the deterministic

structure of the algorithm, a prior estimation of the required clock : .: :y is


The promise of reconfigurable computing for achieving speedup and power savings

versus traditional computing paradigms is expanding interest in FPGAs. Among the

challenges for improving development of parallel algorithms for FPGAs, the lack of

methods for strategic design is a key obstacle to efficient usage of RC devices. Better

formulation methodologies are needed to explore algorithms, architectures, and mapping

to reduce FPGA development time and cost. Consequently, RAT is created as a simple

and effective methodology for investigating the performance potential of the mapping

of a given parallel algorithm for a given FPGA platform architecture. The methodology

employs an analytical model to analyze FPGA designs prior to actual development.

RAT is scoped and automated to provide maximum efficiency and reliability while

retaining reasonable prediction accuracy. The performance prediction is meant to work

with empirical knowledge of RC devices to create more efficient and effective means for

design-space exploration.

For the first phase of research, the RAT methodology defined the core analytical

model for performance estimation during formulation. The extensible RAT model

was specifically scoped for usage with deterministic applications on common, albeit

single-FPGA, platforms. Five case studies (1-D PDF, 2-D PDF, LIDAR, TSP, and MD)

validated the accuracy of decomposing complex system behavior into key communication

microbenchmarks and computation parameters for RAT modeling. Detailed microbenchmarking

allowed for an average error of 1'"- (with individual errors as low as 1 .) for the

communication times of the case studies. For the deterministic case studies (i.e., all

except MD), computation error peaked at 17'. The total RC execution time had an

average error of 1' for the case studies, which helped validate the RAT methodology of

rapid and reasonable accurate prediction.

Figure 5-4. Framework bridging modeling environments and performance prediction, and
orchestrating DSE

(KPN), a specialized AMP MoC, to describe the system concurrency and individual

component behavior. The RC Modeling Language (RC'\11I) [61] provides hierarchical

models for the algorithm, architecture, and total application mapping with specialized

constructs to express deep parallelism (e.g. pipelines), typically for AMP and SDF

MoCs. With suitable abstraction, any of these models could provide effective application

specification for RAT performance prediction. The proposed DSE tool in Section 3.4 uses

RC'\ \!I because of the RC-oriented focus.

5.3 Integrated Framework

This section describes the methodology of the framework for connecting RAT

performance prediction with modeling environments for increased productivity during

strategic DSE. (Section 3.4 discusses the DSE tool bridging RAT with the RC'\ I1

modeling environment.) Figure 5-4 provides an overview of the framework structure.

The proposed methodology provides translation of specification information from the

modeling environment to RAT and orchestration of DSE based on revisions to the

specification information. The modeling environment and RAT performance prediction

components are the existing methods and tools from Section 5.2, as indicated by the

Table 4-5. Modeling Error for 2-D PDF Estimation (Nallatech, XC4VLX100, 195MHz)
Predicted (s) Experimental (s) Error
tcomp 1.41E+2 1.56E+2 -9. ,
2 PGAs tcomm 1.35E+1 1.51E+1 -10.7'
total 1.54E+2 1.71E+2 -9.7'~
Speedup 146 132 10.,.'
tcomp 7.05E+1 7.84E+1 -10.1,
4 PGA tcomm 9.31E+0 9.93E+0 -6.'".
total 7.98E+1 8.84E+1 -9.7',
Speedup 283 255 11.0'(.
tcomp 3.52E+1 3.95E+1 -10.9,
SFPGAs tcomm 7.25E+0 7.70E+0 -5.9'
total 4.24E+1 4.72E+1 -10.1,
Speedup 532 478 11.:''.

for a range of clock frequency values, though only the results of the 195MHz estimation

are shown. Major revisions to the target algorithm or platform architecture during

implementation can significantly alter the application performance affecting the validity

of the prediction. Thus, RATSS can be used iteratively throughout the design process,

recomputing predictions whenever significant revisions are considered or become necessary

to ensure the subsequent implementation will still meet performance requirements and

thereby prevent further reductions in productivity. However, such modifications to the

application structure were not necessary for the 2-D PDF estimation case study.

In Table 4-5, the results of the performance prediction for the 2-D PDF estimation

case study are compared against a subsequent implementation of the target algorithm.

The node and network models underestimated the actual implementation times and

subsequently overestimated the total application speedup over the software baseline. The

node times represented the i I i i ,i iy of the execution time (over 9, '.- of the physical

implementation) thereby having the greatest impact on prediction accuracy. The

prediction errors for the 2, 4, and 8 FPGA configurations are under 11 which is

considered reasonably accurate given the focus on high-level design-space exploration

prior to implementation. Most of the discrepancy is due additional cycles of overhead

related to data movement during the FPGA computation. In contrast to the node


The first research phase outlines FPGA performance estimation using the RAT

analytical model prior to hardware implementation. This chapter presents a brief

introduction on the research challenges for a formulation-time, extensible model

(Section 3.1), a detailed analysis of the prediction methodology (Section 3.2), a complete

walkthrough of performance estimation for a real scientific application (Section 3.3), four

additional case studies as further validation of RAT (Section 3.4), and conclusions (Section


3.1 Introduction

In this chapter, research challenges with constructing a performance prediction

model to support efficient design-space exploration are investigated. Potential algorithms,

architectures, and system mappings must be investigated prior to implementation (to

reduce development cost) and predictions are limited to evaluation of a specific algorithm

targeting a specific platform (to avoid vague generalities). The RAT methodology is

presented as a technique to address the challenges of formulation-level performance

prediction. Five case studies are presented to validate the RAT performance model and


3.2 RC Amenability Test

Figure 3-1 illustrates the basic methodology behind the RC amenability test. These

throughput, numerical precision, and resource tests serve as a basis for determining the

viability of an algorithm design on the FPGA platform prior to any FPGA programming.

Again, RAT is intended to address the performance of a specific high-level parallel

algorithm mapped to a particular FPGA platform, not a generic application. The results

of the RAT tests must be compared against the designer's requirements to evaluate the

success of the design. Though the throughput analysis is considered the most important

step, the three tests are not necessarily used as a single, sequential procedure. Often,

SNAP Interconnect

Figure 4-7. Platform Structure for Image Filtering Case Study

by the interconnect controller. Network transactions are still described using collective

communication terminology but the physical operations are independently initiated

point-to-point messages. Additionally, case study implementations are written in Carte C.

Common to both case studies, Table 4-6 defines the network attributes for the SNAP

interconnect, which are measured from microbenchmarks. Similar to the Nallatech system,

latency, L, is the transmission time of a single-word transfer, which is dominated by the

network delay. However, overhead, o, the time between successive messages, is a function

of the number of nodes, P, and message size, k, not a constant parameter due to the

decentralized communication requests. More extensive microbenchmarking can determine

approximate overhead values although variability in message ordering inhibits detailed

analysis. For these case studies, the effect of overhead is considered negligible and not part

of the network model. The short-message gap, g, is measured as the transmission time

for short messages minus latency. No short messages are used in these case studies but

the value is included for completeness. The long-message gap, G, is one 8-byte word every

10ns clock cycle based on the fixed 100MHz clock.

2.2 Related Research

Productive application development for FPGA-based systems is key for wider

deployment and usage of FPGAs. One challenge to FPGA productivity is efficient

generation of more abstract FPGA design codes. Raising the design focus from traditional

hardware description languages, high-level languages such as Impulse C [3], Carte C

[4], Mitrion C [5], and Handel C [6] provide a software-like infrastructure for a more

efficient and familiar programming model for FPGA applications. Similarly, research in

hardware/software codesign enables a faster bridge between application specification and

hardware implementation and a brief history of this research trend can be found in [7].

However, faster implementation does not address the underlying need for application

planning and evaluation prior to significant commitments of time and money.

Efficient performance modeling for algorithms and systems is an ongoing area

of research for traditional parallel computing. The Parallel Random Access Machine

(PRAM) [8] attempts to model the critical (and hopefully small) set of algorithm

and platform attributes necessary to achieve a better understanding of the greater

computational interaction and ultimately the application performance. The Bulk

Synchronous Parallel (BSP) [9] model extends PRAM concepts, which includes support for

communication and its interaction (i.e., overlap) with computation. The LogP model [10]

(one successor to PRAM) abstracts the application performance based on the latency (i.e.,

wire delay), L; overhead, o; message gap (i.e., minimum time between messages), g; and

number of processors, P. LogGP [11] and additional revisions such as parameterized LogP

(PlogP) [12] provide further modeling fidelity to LogP by addressing specific issues such

as bandwidth constraints for long messages and dynamic performance characterization,

respectively. However, these concepts are not limited to systems of general-purpose

processors. The evolution towards heterogeneous i i-r,: v-core devices necessitates increased

modeling research and usage due to rising development time and cost. In particular, RAT

seeks to leverage these ideas of maximizing model flexibility (through parameterization)

16 Nodes
.y1 \ Novo-G
Gigabit Ethernet

Figure 5-11. Architecture specification of Novo-G system

Additional analysis tools could be constructed to identify the highest performing designs)

based on criteria such as fewest revisions from the original application specification, but

such features are outside the scope of this dissertation.

5.5 Integrated RATSS (MNW)

The case studies (and associated FPGA platform) from Section 5.4 only required

RAT-level analysis due to the small system size and single-level point-to-point communication.

However, the proposed framework can also be used with RATSS, which extends the

RAT tool based on the methodologies for analytical modeling of scalable systems from

C'! lpter 4. Abstract application specification using modeling environments and MoCs

remains unchanged except for potential mappings to larger, scalable systems with

more computation resources and communication interconnects. The methodologies

for translation and orchestration of the synchronous iterative model for performance

estimation are also maintain with RATSS, which provides models for the added system-level

communication. For this case study, the parallelism of MNW is extended to multiple

nodes of an FPGA-augmented cluster. Figure 5-11 describes the architecture of the

system, referred to as Novo-G, where each node of the cluster is the platform described

in Figure 5-6. The nodes are connected by Gigabit Ethernet, which is modeled using the

RATSS tool.

FPGA Memory
Sequenlial Accesses

soa a- : bn
E E E U3
3 e a n n c

-. --- ,, : .-- -

x x x x X X

+. +t- + +} + +_ +l
Bins Bins Bins Bins Bins Bins Bins B iqs
0-31 32-63 64-95 96-127 128-159 160-191 192-223 2ii2-25.-
PDF Kernel

Figure 3-4. Parallel Algorithm and Mapping for 1-D PDF

sample at every discrete probability level. For simplicity, each discrete probability level is

subsequently referred to as a bin.

In order to better understand the assumptions and choices made during the RAT

analysis, the chosen algorithm for PDF estimation is highlighted in Figure 3-4. A total

of 204,800 data samples are processed in batches of 512 elements against 256 bins. Eight

separate pipelines are created to process data samples with respect to a particular subset

of bins. Each data sample is an element with respect to the RAT analysis. The data

elements are fed into the parallel pipelines sequentially. Each pipeline unit can process

one element with respect to one bin per cycle. Internal registering for each bin keeps a

running total of the impact of all processed elements. These cumulative totals comprise

the final estimation of the PDF function.

3.3.2 RAT Input Parameters

Table 3-1 provides a list of all the input parameters necessary to perform a RAT

analysis. The parameters are sorted into four distinct categories, each referring to a

particular portion of the throughput analysis. Note that Neiements is listed under a

separate category when it is used by both communication and computation. It is assumed

that the number of elements dictating the computation volume is also the number of

G. E G E E .
0o 0 o 0 0 0
U Uo 00 U0

00 0 00 0 0 0 0
000 0 0 0 0 0

----- ------- ----- --------------

SApplication Iteration1 Application Iteration2 ***
-- time

Figure 4-3. Example Timing Diagram Illustrating Stage- and Application-level Scheduling
of Computation and Communication.

Nappiterations, of either the sum or maximum (longest) of the stage times. Again, this

model generalizes high-level iterative behavior. Applications can contain repetitive but

irregular behavior (e.g., 3 iterations of stage one, 7 iterations of stage two, 5 iterations of

stage three, etc.), which is simple to calculate but not explicitly considered by the model.

Sstage ({tstagel ... tstages, (4-15)

Sstage: set of stage times for the application

s: total number of stages for the application

application N appiterations X Sstage (4 16)
Max (Sstage)

application: total execution time of the application

Nappiterations: number of application-level iterations

4.4 Detailed Walkthrough: 2-D PDF Estimation

This section presents a detailed walkthrough of RATSS performance prediction with

a reasonably complex case study, 2-D PDF estimation. The intended algorithm and

platform structure along with the feature characterization and performance calculations

for the node, network, stage, and application models are discussed. The results of the

Modeling Environment
Architecture Algorithm
S Model Model


S_J Online API _Standard File Formatl j

Figure 5-3. Y-chart approach to application specification in modeling environments [54]

The RAT tool maintains the synchronous iterative (i.e., multiphase) performance

model, a subset of fork-join models with each hardware resource (e.g., microprocessor

or FPGA) performing an independent portion of the application computation each

iteration with synchronizing communication separating every iteration from preceding

and proceeding iterations [1, 55]. Figure 5-2 outlines the general structure of the RAT

tool. RAT assumes application execution time is defined by the summation of the

slowest computation and communication each iteration. The underlying computation

and communication models for RAT describe the potentially complex and typically

data-oriented behaviors within each iteration using a few key quantitative attributes.

Computation is defined by total number of application-specific operations and their rate

of execution based on usage of the algorithm's deep or wide parallelism by the hardware

resources. Alternatively, RC computation can often be described by the number of data

elements to be processed and the rate of completion (i.e., cycles per element). RAT uses

an extension of the Hockney model [56] to describe I/O communication and LogGP for

system-level communication.

5.2.2 Modeling Environments

Modeling environments provide abstract yet reasonably precise descriptions of

application structure and behavior. An application specification consists of models (i.e.,

descriptions) of the underlying algorithm and RC platform architecture along with their

respective mapping. Algorithm and architecture models are often specified separately

Table 4-4. System Model Attributes for 2-D PDF Estimation
Attribute Units 2 Nodes 4 Nodes 8 Nodes
Nstageiterations (iterations) 1 1 1

Scomp (s) 1.41E+2 7.05E+1 3.52E+1

scatter X 1.28E+0 1.92E+0 2.25E+0
scatter Y 1.28E+0 1.92E+0 2.25E+0
write X 4.07E-1 2.03E-1 1.02E-1
Scomm write Y (s) 4.07E-1 2.03E-1 1.02E-1
read 1.01E+1 5.05E+0 2.52E+0
gather 3.89E-3 7.78E-3 1.17E-2
tcomp (s) 1.41E+2 7.05E+1 3.52E+1
tcomm, (s) 1.35E+1 9.31E+0 7.25E+0
stage (s) 1.54E+2 7.98E+1 4.24E+1

summation of the number of iterations, Nstage iterations, of I/O communication and FPGA

computation for the node, tcomm and tcomp (Equation 4-24). Single buffering maximizes

the available memory bandwidth to the computation units. The computation time, as

shown in Table 4-4, dominates the total application execution time and would not greatly

benefit from double buffering.

application stage N stageiterations (tcomm + tcomp) (4-24)

The model output is summarized in the third block of Table 4-4. As the number

of nodes doubles, the node time reduces by half but the network time increases slightly.

Consequently, the total execution time, total, decreases by slightly less than half as the

number of nodes doubles. This trend of nearly linear performance improvement with

increasing platform size is reasonable given the embarrassingly parallel nature of the

computation and relatively low impact of the PCI-X and Ethernet communication.

4.4.5 Results and Verification

As previously discussed, the performance prediction is calculated prior to low-level

design and was not adjusted based on implementation details. Predictions were generated

Abstract of Dissertation Presented to the Graduate School
of the University of' V : ida in Partial Fulfillment <- the
i. requirement .. the Degree of Doctor of F'.1. >phy



Brian M. Holland

August 2010

( i .::: Alan D. George
M :' i : cal and C, ::: .:, :* : :, : ering

FPC continue to demonstrate impressive benefits in terms of i. ::p, .: :e per Watt

.. a wide range of !. i:cations. However, the design time and technical complexities

of FPGAs have made : l' :d 1 ::: : i pensive, ::,ticularly as the number

of project revisions increases. C<- ....- y, it is important to engage in --t ematic

formulation of : 1 : :. : : i. ::.:::. strategic 4 1. : :. :: performance prediction, and

tradeoff i .... :. i undergoing length1 -- development cycles. Unfortunately, almost all

existing simulative and analytic models for FPGAs target existing ap; : : : to provide

detailed, low-level analysis. This document explores methods, challenges, and ....

concerning performance prediction scope and complexity, 1 : : and verification,

1;.1.' to small and large-scale FPGA systems, efficiency, and automation. T i

RC Amenability Test (RAT) is proposed as a high-level methodology to address these

challenges and : a necessary design evaluation mechanism currently lacking in

FPGA application .... ... RAT is : d of an extensible analytic model

single and multi-FPGA m Ims harnessing a modeling infrastructure, RC Modeling

Language (RC i ), to provide a breath of features allowing FPGA designers to more

S.: :: and au atomati( explore and evaluate algorithm, platform, and system

mapping choices.

- 1
RA T Performance Prediction
SPrediction Synchronous *;
I Pipeline, State Machine te I
_. Iterative S
I l Communication c
Prediction Schedule o
S Prediction

Figure 5-2. Performance prediction using RAT

of the DSE tool includes the connectivity and usage of existing tools for application

specification and RAT prediction by the newly constructed components for translation and


5.2 Background and Related Research

The proposed framework leverages existing research in RAT performance prediction

and modeling environments to facilitate application specification and analysis for strategic

DSE for RC. RAT uses separate computation and communication models, based on

underlying assumptions about the application structure and behavior, which ... -- Li:. i I ,e

into complete predictions of application. Several modeling environments provide methods

and tools with similar approaches for abstracting algorithm and platform architecture

specifications, albeit with differing levels of implementation detail.

5.2.1 RAT Performance Prediction

For this phase of research, a prediction tool is constructed from the RAT methodology

that includes an API to provide the necessary prediction input and gather the resulting

performance estimation. The general methodology of separate computation and

communication modeling for RAT is common in prediction techniques, though research

directed towards strategic RC analysis is not as expansive as compared to modeling

environments. RAT is included within the framework due to its fairly unique focus of

strategic prediction prior to implementation. Encapsulation of RAT for replacement with

other synchronous iterative performance models is possible, but outside the scope of this


elements that are input to the application (although the effective bit-widths may differ

due to the fixed width of the communication channel). While applications can exhibit

unusual computational trends or require significant amounts of additional data (e.g.

constants, seed values, or lookup tables), these instances may be considered uncommon.

Alterations can be made to account for uncorrelated communication and computation but

such examples are not included in this document.

Table 3-1. Input parameters for RAT
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements)
N, ......... (bytes/element)
Communication Parameters
throughputideal (! 1/s)
Cawrite 0 < a < 1
read 0 < < 1
Computation Parameters
Nops/element (ops/element)
throughputproc (ops/cycle)
fclock (MlP.,)
Software Parameters
tsoft (sec)
Niter (iterations)

Table 3-2 summarizes the input parameters for RAT ,in 1i,-i; of the specified

algorithm for 1-D PDF estimation. The dataset parameters are generally the first

values supplied by the user, since the number of elements will ultimately govern the

entire algorithm performance. Though the entire application involves 204,800 data

samples, each iteration of the 1-D PDF estimation will involve only a portion, 512 data

samples, or 1/400 of the total set. This algorithm effectively consumes all of the input

values. Only one cumulative value is left after each iteration per bin but these results are

retained on the FPGA. Values are only transferred back to the host after computation

for all iterations is complete. The final output communication must be represented as

Table 3-15. Performance parameters of MD (XtremeData)
Predicted Predicted Predicted Actual
fclk (\II.) 75 100 150 100
tcomm (sec) 8.77E-4 8.77E-4 8.77E-4 1.39E-3
tcomp (sec) 7.17E-1 5.37E-1 3.58E-1 8.79E-1
utilcommsB 0.1 0.2'- 0.2 0.'"
utilcompsB 99.9', 99.>' 99.>' 99. 1
tRCsB (sec) 7.19E-1 5.38E-1 3.59E-1 8.80E-1
speedup 8.0 10.7 16.0 6.6

Table 3-16. Resource usage of MD (EP2S180)
FPGA Resource Utilization (C)
BRAMs 24
9-bit DSPs 100
ALUTs 73

The interconnect parameters model an XtremeData XD1000 platform containing a

Altera Stratix-II EP2S180 user FPGA connected to an Opteron processor over the

HyperTransport fabric. The theoretical interconnect throughput is 1.6GB/s but only

a fraction of the channel can be used for transferring data to the on-board SRAM as

needed for the algorithm. The number of operations per element, approximately 16,400

interactions per molecule times 10 operations each, is estimated due to the length of the

pipeline and data-driven behavior. Unlike the previous case studies, the computational

throughput cannot be reliably measured due to the complex and nondeterministic

algorithm structure. As discussed in Section 3.2.1, the number of operations per cycle

is treated as a "tuning parameter to compute the throughput necessary to achieve the

desired speedup based on the estimate of N1ps/element. Though 50 is the quantitative

value computed by the equations to achieve the desired overall speedup of approximately

10, this value serves qualitatively as an indicator that substantial data parallelism and

functional pipelining must be achieved in order to realize the desired speedup. The same

range of clock frequencies was used as in PDF estimation. The serial software baseline was

executed on a 2.2 GHz Opteron processor, the host processor of the XD1000 system. The

system-level modeling concepts form the basis for a proposed model for heterogeneous

clusters. Collective communication modeling and scheduling for node-heterogeneous

networks of workstations (NOWs) [47, 48] and clusters of clusters with hierarchical

networks [49] are further extensions to traditional system-level modeling. These

modifications for heterogeneous computing provide useful insight towards the proposed

RATSS model.

4.3 RATSS Model

This section provides a detailed discussion of the structure and contributions of the

RATSS model for fast and reasonably accurate performance prediction. This prediction

(and consequently design-space exploration) begins with the designer's specifications of

the FPGA platform and algorithm pairing for RATSS analysis. The FPGA platform

specification defines the performance capabilities of each component in the system,

specifically the computation and communication metrics such as latency, bandwidth, and

clock frequency. The algorithm specification defines the computation requirements of

every specific task and the resulting communication between devices, which depends on

the algorithm/platform mapping. Quantitative attributes are provided for every unique

computation and communication task in the FPGA system and these values feed the

component-level analytical equations. The RATSS model .I.-.-' 1i:. ii Ies the individual

computation and communication predictions based on the system-level schedule defined

by the application specification and subsequently provides a quantitative performance

estimate for the platform/algorithm pairing. This prediction is used by the designer

for further design-space exploration, revising (and re-analyzing) as necessary until the

application meets their performance requirements.

Again, RATSS adapts existing computation and communication models to provide

a complete performance prediction for an FPGA application (i.e., a specific algorithm

mapped to a specific FPGA platform). RATSS performance prediction is based on

efficient quantitative characterization of the key attributes of this algorithm/platform

Deterministic Algorithm tasks and data movement between tasks are predictable prior

to implementation, either as a constant or an average performance of typical data


Precise characterization of application task scheduling is insufficient for design-space

exploration if the underlying computation and communication times cannot be precisely

quantified. Randomness in computation and communication behavior requires quantification

of application characteristics as averages of expected behavior. The RATSS assumption for

deterministic behavior is reasonable as many applications targeting the FPGA paradigm

are SIMD--I i,, algorithms implemented as pipelines.

Ultimately, synchronous, iterative, and deterministic behavior allows efficient

characterization of computation needs and communication requirements of the FPGA

application. Pipelined, SIMD- Ile algorithms involve data transformations and both the

communication and computation are characterized by the quantity of data associated

with the particular platform/algorithm mapping. The computational demands of the

application are quantified by the number of operations per input data element and the

rate of execution (i.e., amount of deep and wide parallelism). Similarly, the attributes for

the communication requirements define the amount of data for each network transaction in

terms of bytes. Model Usage

Again, RATSS involves quantifying the key attributes of the FPGA platform and

application for use in the underlying analytical models for performance prediction. These

quantitative characteristics are provided largely by the designer. Platform-intrinsic

attributes such as network latency and throughput are gathered from microbenchmarks

that specifically mirror algorithm operations, such as a DMA read and write. Ideally,

a database of microbenchmark results is referenced by the designer for the platform

attributes, else the benchmarks must be performed prior to any performance prediction.

Note that accurate microbenchmarking can be a nontrivial process, albeit with nonrecurring

tcomp Max({tfpga tfpgap }) (4 27)

tcomm = tbroadcast,scatter + gather (4-28)

application stage N stageiterations X (tcomp + tcomm) (429)

Although these two additional case studies involve a different FPGA platform from

the 2-D PDF application, the sequential software baselines are measured from the same

3.2GHz Xeon microprocessor for consistency. While speedup is often an advantageous

performance metric, the specific speedup value must be compared with the problem

size and computation-to-communication ratio for the application. The image filtering

and molecular dynamics case studies illustrate communication- and computation-bound

problems, respectively, with correspondingly lower and higher speedups. Both scenarios

are modeled by RATSS with reasonable accuracy.

4.5.1 Image Filtering

The particular image filter used in this case study is a discrete 2-D convolution

of a 3x3 image segment (i.e., a pixel and its 8 neighbors) with a user-specified filter.

Example usages of this application include Sobel or Canny edge detection and high,

low, or band-pass filtering for noise reduction. Figure 4-8 provides an illustration of this

algorithm. The same 418x418 image is streamed (i.e., written) to the primary FPGA

of two nodes of the SRC-6 system. As part of the computation, the primary FPGAs

stream the image data to their respective secondary FPGAs. Each FPGA performs the

convolution of the image data with respect to different filter values. The resulting images

on the secondary FPGAs are streamed back to their respective primary FPGA which

DMAs the two new images from the node back to the network-attached microprocessor. A

more general overview of convolution for image filtering can be found in [53].

Table 4-7 summarizes the compute node attributes for the RATSS model. Because of

the double-precision operations, the overall pipeline will be fairly deep and too complex to

performance estimation are compared against a subsequent hardware implementation to

evaluate the accuracy of the RATSS model.

4.4.1 Algorithm and Platform Structure

The 2-D PDF estimation algorithm for this case study uses the Parzen window

technique [51], a generalized nonparametric approach to estimating PDFs in a d-dimensional

space. Despite the increased computational complexity versus traditional histograms,

the Parzen window technique is mathematically advantageous because of the rapid

convergence to a continuous function. This algorithm is amenable for FPGA acceleration

because of the high degree of computational parallelism and large computation effort

relative to the amount of data consumed (i.e., input) and produced (i.e., output). The

computational complexity of a d-dimensional PDF algorithm is O(Nnnd) where N is the

total number of samples of the random variable, n is the number of levels where the PDF

is estimated, and d is the number of dimensions. This 2-D PDF estimation algorithm

accumulates the statistical likelihood of every sample occurring within every probability

level. Each sample/level combination is independent, thereby making the algorithm

embarrassingly parallel. The data input consists of 0(N) samples whereas the output is

the resulting O(n2) probability levels.

A general overview of the algorithm structure for this case study is presented

in Figure 4-4. A total of 67,108,684 (i.e., 6 !\!) data samples, originating on one

microprocessor, are scattered equally among the P microprocessors. The number of

samples is large to fully stress the communication and memory capabilities of the target

FPGA platform. The microprocessors transfer the data to their respective FPGA node

in chucks of 8,192 data samples, limited by the available on-chip block RAM. A total of

80 pipelined kernels per node perform the necessary computations (comparison, scaling,

and accumulation) to analyze each data sample against the 256x256 probability levels.

The number of parallel kernels maximizes the available 96 hardware multipliers on the

target FPGA with some leeway. The numerical precision for the computation is 18-bit

A( ': )WLED( i TTS

i success of this research is due to the -:-iort and generosity of the University of

: i igh-performance Computing and Simulation (HC' ) Laboratory, and NSF Center

for i:J:-perforrance B. :: durable Computing (('s i C). A special thank you goes to

the author's thesis committee: Dr. Alan D. Ceorge (( ::.), Dr. Herman Lam, Dr. Creg

Stitt, and Dr. P.----:- Sanders. Additional thanks go to Vikas A i- .wal, Max :i 'ey,

(. : Cicslcwski, C .:: Conger, John Curreri, R DeVille, Rafael Garcia, Dr. Ann

Gordon-Ross, Dr. Eric G-.... .. -, Adam Jacobs, Seth Koehler, A l.. eet Lawande, Dr.

Saumil Merchant, Karathik ". :: .i: Carlo Pascoe. Dr. Casey Reardon, P.::-:i Shih,

Dr. Ian Troxel, and Jason Williams.

This work was :::ported in part I the I/IU( C Program of the National Science

Foundation under Grant No. EEC-0642422. aT. author gratefully acknowledges vendor

equipment and/or tools provided by Altera, C"-:-, : : e Accelerated Technologies,

Nallatech, SRC C i.-. ::. :: Xilinx, and XtremeData that helped make this work li)ssible.

Additional thanks go to the students and faculty of the High .-. .......e Computing Lab

(HCL) at George Washington Un' : I for the generous use of their -6

Level of Abstraction (Examples)
High Low
Back of the Envelope
Explore 0
Estimation Models
(Proposed Framework) -
C U/ >
0 >,
SAbstract Executable Models o
S(Ptolemy, ESL languages, etc.) .

0" 0 /Explore 0
Cycle-Accurate Models

Synthesizable Models O
(VHDL, Verilog)
Low < High
Alternative Realizations (Design Space)

Figure 5-1. Abstraction pyramid comparing levels of modeling for hardware applications

using isolated abstract specification and analysis tools can be tedious, disconnected from

subsequent implementation tasks, and ultimately counterproductive.

This chapter proposes a methodology for a framework allowing integration of

modeling environments with RAT. The RAT methodology (and subsequent tool)

provides models describing the behavior of the individual computation and communication

operations and estimates the total application performance based on their .-'11,;!, ii.

RAT has demonstrated reasonably accurate performance prediction, but its efficiency is

limited by currently manual interpretation of application specifications for the necessary

inputs to the analysis. The proposed framework defines a I i i-I I i. i" component that

distills the required prediction inputs for RAT from the (supported) model of computation

(MoC) of the application specification. An abstraction 1l .r insulates the translation

functionality from the tool-dependent details of the particular modeling environment.

Additionally, the framework defines "orchestration" of strategic DSE, which performs

RAT prediction on an initial application design and potential revisions to the underlying

algorithm and/or platform architecture. As validation of the productivity benefits of

framework-assisted DSE, a tool for translation and orchestration using the proposed

methodology (hereafter referred to as the "DSE tool") is constructed. The functionality


C-.-.-~uting is currently undergoing two reformations, one in device architecture and

the other in application development. Using the growth in transistor density predicted

1. Moore's Law for increased clock rates and instruction-level p.. i :.. has reached

.: .. : *i limits, and the nature of current and :': .:'e device architectures is focused

upon higher density in terms multi-core and :: .. : ore structures and more explicit

:of -allelism. Many such devices exist and are emerging on this path, some with

a fixed structure (e.g. quad-core CPU, Cell Broadband Engine) and some : durable

(e.g. FPGAs). Concomitant with this reformation in device architecture, the complexity

of .:i : .::: :. for these fixed or reconfigurable devices is at the forefront of

fundamental challenges in computing to-1

Ti: development of applications : : complex architectures can be defined in terms of

four : formulation, design, translation, and execution. i' p: .. of formulation is

exploration of, .:1 1 :::: architectures, and mapping, where strategic decisions are made

'* to coding antd l:.1. : ..... 1 design, translation, and execution r are

where implementation occurs in an iterative fashion, in terms of progr i .." translation

to executable codes and cores, 1 : 1: verification, f'. :... ::e optimization, etc.

. architecture complexity continues to increase, so too does the importance of the

formulation stage, since productivity increases when design weaknesses are et: ed

and addressed 1 in the development ....ess. FPGA i" ;are particularly

noteworthy : the amount of effort needed with existing languages and tools to

render a s- ... implementation, and thus productivity of application development

: FPC 1 : d systems can greatly benefit : :: better concepts and tools in the

formulation stage. T.: document presents a novel methodology and model to :: :ort

the rapid forimu.lation of : : ; for i "'GA-based reconfigurable cornpi ':: systems.

This model focuses on not :.1 reasonably accurate :i :.. :c estimation for single




DO( :)I tC:: Piiii:: OSOP "i .




Figure page

2-1 Performance C'! i o :terization and General Structure of FPGA Devices ..... ..16

2-2 Spectrum of High-Performance Reconfigurable Computing Platforms ..... ..17

3-1 Overview of RAT Methodology ............... ....... .. 22

3-2 Example Overlap Scenarios .................. .......... .. 27

3-3 Trends for Computational Utilization in SB and DB Scenarios .... 29

3-4 Parallel Algorithm and Mapping for 1-D PDF ......... ... 33

4-1 Two Classes of Modern High-performance FPGA systems .... 61

4-2 Synchronous Iterative Model ............... ......... .. 63

4-3 Example Timing Diagram for Application Scheduling ............. 74

4-4 Application Structure for 2-D PDF Estimation Case Study .......... ..76

4-5 Platform Structure for 2-D PDF Estimation Case Study .......... .77

4-6 Results of Efficiency Microbenchmarks for Nallatech BRAM I/O ... 79

4-7 Platform Structure for Image Filtering Case Study .............. 87

4-8 Algorithm Structure for Image Filtering Case Study ... 90

4-9 Algorithm Structure for Molecular Dynamics Case Study ..... 93

5-1 Abstraction pyramid comparing levels of modeling for hardware applications .97

5-2 Performance prediction using RAT .................. ..... .. 98

5-3 Y-chart approach to application specification in modeling environments 99

5-4 Framework bridging specification and analysis, and orchestrating DSE ..... ..101

5-5 Translation of application specification information for RAT prediction ..... 103

5-6 Architecture specification of FPGA platform ................ 106

5-7 Overview of MNW case study ............. ... 108

5-8 Predicted execution times of MNW ................ .... 109

5-9 MVA graph specification and mapping .................. ...... 110

such as DMA. Even for single-FPGA systems, a range of issues related to parallelism and

scalable can be explored. RAT is scoped to make a convenient and impactful model that

not only integrates broader issues such as numerical precision and resource utilization

but also contributes to the larger goal of better parallel algorithm formulation and

design-space exploration. Future research will expand the RAT methodology for larger

scale prediction on multi-FPGA systems.

3.3 Walkthrough

To simplify the RAT analysis in Section 3.2, a worksheet can be constructed based

upon Equations (3-1) through (3-11). Users simply provide the input parameters and the

resulting performance values are returned. This walkthrough further explains key concepts

of the throughput test by performing a detailed a ,~i-;- of a real application case study,

one-dimensional probability density function (PDF) estimation. The goal is to provide a

more complete description of how to use the RAT methodology in a practical setting.

3.3.1 Algorithm Architecture

The Parzen window technique [36] is a generalized nonparametric approach to

estimating probability density functions (PDFs) in a d-dimensional space. The common

parametric forms of PDFs (e.g., Gaussian, Binomial, Rayleigh distributions) represent

mathematical idealizations and, as such, are often not well matched to densities

encountered in practice. Though more computationally intensive than using histograms,

the Parzen window technique is mathematically advantageous. For example, the resulting

probability density function is continuous therefore differentiable. The computational

complexity of the algorithm is of order O(Nnfd) where N is the total number of data

samples (i.e. number of elements), n is the number of discrete points at which the PDF is

estimated (comparable to the number of 'i- in a histogram), and d is the number of

dimensions. A set of mathematical operations are performed on every data sample over

ad discrete points. Essentially, the algorithm computes the cumulative effect of every data

Table 3-6. Performance parameters of 2-D PDF (Nallatech)
Predicted Predicted Predicted Actual
fclk (\II.) 75 100 150 100
tcomm (sec) 1.01E-2 1.01E-2 1.01E-2 1.06E-2
tcomp (sec) 5.59E-2 4.19E-2 2.80E-2 4.46E-2
utilcommsB 15' 19' 27' 19'
utilcompsB 85', 81. 7 8' 81
tRCsB (sec) 2.64E+1 2.08E+1 1.52E+1 2.21E+1
speedup 6.0 7.6 10.4 7.2

Table 3-7. Resource usage of 2-D PDF (XC4VLX100)
FPGA Resource Utilization (C)
BRAMs 21
48-bit DSPs 33
Slices 22

against the comparable 100MHz prediction. The communication time is within 5'.

of the predicted value with a discrepancy of 0.5 milliseconds again due to accurate

microbenchmarking of the Nallatech board's PCI-X interface. This difference is potentially

significant given the 400 iterations required to perform this algorithm. The overall impact

on speedup is further affected by variation in the computation time. An underestimation

by approximately 2.7 milliseconds creates a total discrepancy just over 3 milliseconds per

iteration. This larger error in the computational throughput parameter as compared to

1-D PDF is due to the more exact modeling of the pipeline behavior without adjustments

for potential overhead. These overheads from pipeline latency and polling were assumed

insignificant due to the length of the overall execution time but instead had noticeable

effect each iteration. In total, the speedup was i.' less than the predicted speedup. This

error margin is excellent given the fast and coarse-grained prediction approach of RAT

compounded over hundreds of iterations. Greater attention to communication behavior

and the nuances of the computation structure can further reduce this error if desired.

To the extent possible while maintaining fast performance estimation, insight about

shortcomings in previous RAT predictions can be factored into future projects to further

boost accuracy. Comparing Table 3-7 to the resource utilization from the 1-D algorithm,

deterministic, pipelined structure. For the SRC-6 system, the individual communication

and computation times were measured on the FPGA via a vendor-provided counter

function. Both operations are FPGA-controlled and initiated by a single function call

which could not be separated from the perspective of the host microprocessor. However,

the total RC execution as measured by the wall-clock time of the CPU is approximately

0.07 seconds longer than the sum of the computation and communication times measured

on the FPGA. Consequently, extra system overhead not considered by RAT caused the

actual speedup value to be ,1' less than expected. The utilizations for the actual design

reflect this overhead with only i.'. of RC execution time comprising communication

or computation. The discrepancy in total execution is large because the overhead is

significant relative to the short time (less than 0.5s). If the overhead had been factored

into the prediction, the total estimation error would be under 1 Additionally, resource

utilization is summarized in Table 3-13. No multiplers are required for this type of

searching but the heavy usage of logic elements limits further scalability of the algorithm

on a single FPGA of this size.

3.4.4 Molecular Dynamics

Molecular Dynamics (\!1)) is the numerical simulation of the physical interactions of

atoms and molecules over a given time interval. Based on Newton's second law of motion,

the acceleration (and subsequent velocity and position) of the atoms and molecules are

calculated at each time step based on the particles' masses and the relevant subatomic

forces. For this case study, the molecular dynamics simulation is primarily focused on

the interaction of certain inert liquids such as neon or argon. These atoms do not form

covalent bonds and consequently the subatomic interaction is limited to the Lennard-Jones

potential (i.e. the attraction of distant particles by van der Waals force and the repulsion

of close particles based on the Pauli exclusion principle) [39]. Large-scale molecular

dynamics simulators such as AMBER [40] and NAMD [41] use these same classical physics

principles but can calculate not only Lennard-Jones potential but also the nonbonded

The last attribute in Table 4-1, tfpga, summarizes the computation time for the

2, 4, and 8 node cases. Each node for a particular system size (i.e., number of nodes,

P) will have an identical execution time because of the equal data decomposition. Due

to the increasing number of node resources, FPGA computation time, tfpg", decreases

approximately linearly. Two FPGAs require twice the time as four FPGAs which need

twice the time of eight FPGAs. This behavior is consistent with the embarrassingly

parallel nature of the 2-D PDF estimation algorithm.

4.4.3 Network Modeling

For the FPGA platform used in this case study, two communication network models

are necessary: PCI-X I/O Bus and Ethernet. The PCI-X bus model describes the

point-to-point interconnect between a host microprocessor and its Nallatech FPGA node.

The Ethernet model describes the MPICH2 communication over the Gigabit Ethernet

network. Assembling the attribute values for these models involves not only analysis of the

algorithm structure and mapping but also microbenchmarking of the underlying platform

behavior for typical communication transactions. PCI-X Network Modeling

The I/O operations for 2-D PDF estimation involve transfers between the host CPU

and the onboard FPGA block RAM. Microbenchmarks were performed on common

transfer sizes (i.e., powers of two from 4B to 6 i11 ). Figure 4-6 summarizes the results

of these transfers, which can be referenced for all future I/O performance estimations.

Smaller transfers, Figure 4-6A, have erratic but steadily increasing efficiency whereas

larger transfers, Figure 4-6B, could be approximated by a single value. For the transfer

sizes used in this case study (writing 8,192 elements and reading 65,536 elements,

discussed later), the I/O efficiencies are 0.31 and 0.10, respectively.

Table 4-2 summarizes the delay and throughput attributes, gathered from microbenchmarks

of the PCI-X I/O bus, along with the quantity of data transmitted for the 2-D PDF

estimation case study. The microbenchmarks measure the total time of a data transfer,

Table 3-5. Input parameters of 2-D PDF
Dataset Parameters
Neiements, input (elements)
Neiements, output (elements) 6
N, .. .... (bytes/element)
Communication Parameters (Nallatech)
throughputideal ( !1 /s)
write 0 < a < 1 0
read 0 < a < 1 0
Computation Parameters
Nops/element (ops/element) 19
throughputproc (ops/cycle)
fc1ock (M .) 75/100
Software Parameters
tsoft (sec) 1
Niter (iterations)





65,535 (256 x 256) PDF values are sent back to the host processor after each iteration of

computation due to memory size constraints on the FPGA. The same numerical precision

of four bytes per element is used for the data set. The interconnect parameters model

the same Nallatech FPGA card as in the 1-D case study but for different transfer sizes.

The aread term is small for the relatively large output of 65,536 elements because data

is transferred in 256 batches of 256 elements each, incurring a large latency overhead.

Each of the 65,536 bins requires three operations for a total of 196,608 operations. Eight

kernels, each containing two pipelines (one per dimension), perform three operations per

pipeline per cycle for a total of 48 simultaneous computations per cycle. Again, the same

range of clock frequencies is used for comparison. The software baseline for computing

speedup values was written in C and executed on the same 3.2GHz Xeon processor. The

algorithm requires the same 400 iterations to complete the computation and VHDL is also


The RAT performance predictions are listed with the experimentally measured

results in Table 3-6. These three predictions are based on the range of clock frequency

values listed in Table 3-5 but the accuracy of the actual 100MHz design is only evaluated

shaded boxes. The dashed border of the modeling environment and tool-abstraction 1iv. r

indicates the interchangeability of specification tool. Section 5.3.1 describes the procedure

for translation of algorithm-based MoCs into the quantitative performance attributes

and application scheduling necessary to direct the synchronous, iterative performance

model of RAT. Section 5.3.2 describes the mechanism for orchestration of DSE, specifically

the directed revision of an initial application specification to examine and compare the

performance potential design alternatives.

5.3.1 Translation

Although individual modeling environments and performance prediction techniques

sometimes include methods for direct connectivity to other tools, an explicit intermediary

between specification and analysis is advantageous. The proposed framework provides

translation between the algorithm MoCs and the RAT performance prediction, facilitating

the transfer of the required quantitative attributes and scheduling information to the

corresponding computation and communication model. Potential issues during translation

include differences in the data structures (e.g., format, representation, or precision),

abstraction levels, and semantic mean along with other dilemmas such a missing,

redundant, or inconsistent data. Resolving these issues can require acute awareness of

the low-level details of the data formats, syntax, and semantics of the tools with extra

functionality to identify and request additional information from the user as necessary.

The need for unique bridges between every desired modeling tool and RAT is greatly

reduced by an abstraction 1 v-_r, which allows the framework to perform the in ii. "iily of

the translation based on a generic format for algorithm MoCs derived from the specific

modeling environment tool.

As illustrated in Figure 5-5, the algorithm and architecture attributes for the basic

operations of the MoC of the application specification must be reorganized and formatted

based on their contribution to the RAT computation and/or communication estimation.

The framework constructs RAT computation models for every hardware resource based on

technique presented in [18] seeks to parameterize the computational algorithm and the

FPGA system itself. The analytical methods have similarities to RAT but the emphasis

is on projecting potential bottlenecks due to memory throughput, not on predicting total

system performance. Dynamo [19] involves performance prediction of image processing

systems partitioned and compiled at runtime from existing pipelined kernels. The system

provides dynamic optimization for application construction exclusively from existing

modules and assumes that algorithm design and analysis is completed prior to the use of

Dynamo. In [20], 12 design techniques are presented for maximizing the performance of

FPGA applications. This research is synergistic to RAT by potential reducing the number

of algorithm and architecture iterations necessary to achieve suitable performance, however

the RAT methodology is still required to quantitatively evaluate each design iteration.

Though prediction is quite common with FPGA technologies, it is not primarily used

for system-level performance. Routing is a common target for device-level prediction due

to the impact on development time and performance. In [21], a model of the algorithm

routing demands is created early in the FPGA development cycle. In [22], prediction is

used to mitigate the variability and long run times of commercial place and route tools

for estimating interconnect delay. Other issues including timing [23], routability [24],

interconnect planning [25], and routing d.1 iv [26] are explored via prediction. Performance

is also explored by modeling issues such as power [27] and wafer yield [28]. 1M ,ii: of

these prediction techniques for lower-level issues migrated from application-specific

integrated circuits into the RC domain to more efficiently model the growing complexity

of FPGAs. Similarly, RAT and other methodologies are branching to RC from existing

areas of parallel application modeling to bridge the growing need for efficient performance


Table 3-12. Performance parameters of TSP (SRC)
Predicted Actual
fclk (\ I.) 100 100
t,,mm (sec) 1.54E-5 1.57E-5
tcomp (sec) 4.31E-1 4.30E-1
utilcommsB 0.00 !' 0.011:'
utilcompsB 99.' I.' 86.'".
tRCsB (sec) 4.31E-1 4.99E-1
speedup 5.16 4.45

Table 3-13. Resource usage of TSP (XC2V6000)
FPGA Resource Utilization (C.)
BRAMs 56
18x18 Multipliers 0
Slices 73

While N2 distances are need to compute path lengths, NN total paths must be examined.

For consistency with the other case studies, the number of operations per element is set to

NN-2 (i.e. 97 = 4782969), which makes the RAT prediction computationally equivalent to

the view of NN path elements (for computation only) with one operation each. Since nine

cities are examined in this case study using nine kernels, a total of nine potential paths are

examined per clock cycle. The clock frequency of the MAP-B unit is fixed at 100 MHz and

only one input/compute/output iteration is required for this algorithm. The C software

baseline was executed on a 3.2GHz Xeon processor. The parallel algorithm is constructed

in SRC's Carte C, a high-level language (HLL) for FPGA design.

The results of the hardware design are compared against the performance predictions

in Table 3-12. The percent error in the predicted communication time was less than '.

due to microbenchmarking on the SRC-6 specifically to replicate the short communication

transfers. The cycle-accurate timers of the SRC-6 system meant this discrepancy was not

a measurement error but instead a function of the modeling and parameterization of the

SNAP interconnect throughput. However, the actual communication time is only 16ps

(versus 430ms for computation) and consequently its impact on speedup is negligible for

this case study. The predicted and actual computation times were nearly identical due to

Table 3-11. Input parameters of TSP
Dataset Parameters
Nelements, input (elements) 81
Neiements, output (elements) 1
N, ....... (bytes/element) 8
Communication Parameters (SRC)
throughputideal ( !1 /s) 1400
write 0 < a < 1 0.03
read 0 < a < 1 0.03
Computation Parameters
Nops/element (ops/element) 4782969
throughputproc (ops/cycle) 9
fclock (M\! I1.) 100
Software Parameters
tsoft (sec) 2.22
Niter (iterations) 1

of path validity and length (i.e. if all cities were visited exactly once, report the total

distance traveled). The individual steps are not interrelated and the examination of

possible paths can be pipelined. However, unlike the branch-and-bound technique which

backtracks in the middle of paths to avoid revisiting cities, the hardware pipeline operates

on full N-length paths, even those invalid because of repeated cities. Extra computation is

required but substantially more parallelism is exploitable.

Table 3-11 lists the input parameters of the RAT performance prediction for TSP.

The interconnect parameters model the proprietary SNAP interconnect of the SRC-6

system. The small fraction of throughput, a, represents the overhead associated with

the extremely minimal communication in the algorithm, only N x N input elements.

This information contains the distances between every pair of cities. The only output for

this system is the minimal path length and this communication time is assumed to be

negligible. Elements are 8 bytes, the width of the MAP-B's SRAM, but only 4 bytes (32

bits) per element are used to represent distances in fixed point. The information is not

byte-packed for communication and consequently the other 32 bits are wasted. For this

case study, the computational workload is exponentially related to the number of elements.

using the Y-chart approach [54], as illustrated in Figure 5-3. These application models

(particularly the algorithm model) describe the behavior of an application in terms of

a model of computation (MoC). A MoC defines a set of allowable I''" i i i. 'i-:" (i.e.,

basic and often technology-dependent computational events), communication between

operations (i.e. data movement), their relative costs (e.g., clock cycles), and the total

system behavior based on the operations composing the application. Each modeling

environment uses graphical and/or textual elements to denote precise syntactic and

semantic meanings for an application specification based on the MoC.

The case studies for this chapter, MNW and MVA graph, are specified by .-i-Lchronous

message-passing (AMP) and synchronous dataflow (SDF) MoCs, respectively, which

represent common models for FPGA systems. AMP denotes the use of one or more

queues to describe communication between groups of operations. Only messages

within the same queue are strictly ordered with unspecified timing between different

queues. SDF represents a special case of AMP with groups of operations evaluated

as soon as the necessary messages are available from the communication channels,

which are uni-directional. Data enters the application model at a constant rate, which

eventually induces a steady-state evaluation rate for each group of operations with total

performance defined by the slowest group. AMP suitably describes the straightforward

DMA communication between the microprocessor and FPGAs for MNW. SDF provides

mechanisms for describing the pipeline network of the MVA graph.

The proposed DSE tool requires a modeling environment capable of effectively

representing abstract algorithm, architecture, and subsequent application mapping

models based on AMP and SDF MoCs. Ptolemy [57] is an environment specifically for

simulating and prototyping systems involving heterogeneous MoCs, including AMP and

SDF. Metropolis [58] defines "metamodels" that use formal execution semantics to define

the application function, platform architecture, and mapping of the system based on a new

or existing MoCs. Artemis [59] and Sesame [60] use a hierarchical Kahn Process Network

Table 3-17. Summary of Results
Predicted Comm. (s) 2.47E-5 1.01E-2 6.60E-4 1.54E-5 8.77E-4
Actual Comm. (s) 2.50E-5 1.06E-2 5.65E-4 1.57E-4 1.39E-3
Comm. Error 1 5'. 17' .' :
Predicted Comp. (s) 1.31E-4 4.19E-2 2.64E-4 4.31E-1 5.37E-1
Actual Comp. (s) 1.39E-4 4.46E-2 2.25E-4 4.30E-1 8.79E-1
Comp. Error ,'.. ,'. 17. 0.'. 3' 0I'
Predicted Speedup 6.5 7.6 11.8 5.2 10.7
Actual Speedup 7.8 7.2 13.8 4.5 6.6
Speedup Error 1'.', .., 1 '., 1. 'I 3' .

entire dataset is processed in a single iteration and the algorithm is constructed in Impulse

C, a cross-platform HLL for FPGAs.

Table 3-15 outlines the predicted and actual results of the MD. Note that these

results are unique to this specific algorithm and that different structures, target languages,

and platforms will have varying prediction accuracy. The difference in predicted and

actual communication time is :;7' The error itself is associated with the overhead of

multiple I/O transfers between the CPU and on-board SRAM memory modeled as a

single block of communication. While more accurate estimations are the goal of RAT,

any further precision improvements for this parameter are inconsequential given the low

communication utilization. Computation dominated the overall RC execution time and

the actual time is l' higher than the predicted value due to the data-driven operations

and suboptimal pipelining performance. The total number of operations was higher than

expected, coupled with relatively modest parallelism for the problem size. Consequently,

the speedup error was also 3:l' ~, significantly less than desired. However, this case study

is useful because the qualitative need for significant parallelism is correctly predicted even

though the algorithm cannot be fully analyzed at design time. As Table 3-16 illustrates,

a large percentage of the combinatorial logic and all dedicated multiply-accumulators

(DSPs) were required for the algorithm.

similar to the scatter. However, unlike scatter (or gather) the amount of data during each

transmission does not increase because the data is reduced at every node. Consequently,

the reduce has 1(../_(P) latency, L, and transmission time, Gk, as defined in Equation 4-19.

Additionally, each transmission requires an addition operation, 7, for each of the k data

values in the message. Note that Equations 4-18 and 4-19 assume P is a power of 2.

ttransaction-d-, = Ic..'I (P) x (L + 2o + Gk + -yk) (4-19)

The second block of Table 4-3 lists the two application-dependent attributes defined

by the user based on the 2-D PDF estimation case study. Again, system configurations

of 2, 4, and 8 nodes, P, are used for this case study. The 2-D PDF application requires

two distinct transactions: distribution of the input data for the X and Y dimensions (i.e.,

MPIScatter) and reduction of the partial PDFs (i.e., MPIReduce). The 2, 4, and 8

FPGA platform configurations will involve message sizes, k, of 128MB, 6 11 ), and 32MB

of data, respectively for the scatter. For the reduce, every node will contribute the 256KB

(256x 256x 4B) partial results (regardless of the number of nodes) that are ultimately

accumulated on the head node.

The third block of Table 4-3 summarizes the results of the network model. The

individual times for the scatter and reduce transactions, transactions, are listed. These times

increase logarithmically for the 2, 4, and 8 node platforms due to the increasing number of

messages (i.e., log2(P)) required for the transaction.

4.4.4 Stage/Application Modeling

The stage and application models are synonymous because this case study consists

of a single stage of execution. Equation 4-20 summarizes the set, Scomp, of the execution

times, tfpga, for the 2, 4 and 8 (i.e., P) FPGA nodes used in this case study. From

Equation 4-21, the computation time, tcomp, is determined by the maximum (longest)

Related to the speedup is the computation and communication utilization given by

Equations (3-8), (3-9), (3-10), and (3-11). These metrics determine the fraction of the

total application execution time spent on computation and communication for the SB and

DB cases. For SB, the computation utilization can provide additional insight about the

application speedup. If utilization is high, the FPGA is rarely idle thereby maximizing

speedup. Low utilizations can indicate potential for increased speedups if the algorithm

can be reformulated to have less (or more overlapped) communication. In contrast to

computation which is effectively parallel for optimal FPGA processing, communication

is serialized. Whereas computation utilization gives no indication about the overall

resource usage, since additional FPGA logic could be added to operate in parallel without

affecting the utilization, the communication utilization indicates the fraction of bandwidth

remaining to facilitate additional transfers since the channel is only a single resource.

For DB, assuming steady-state behavior, the implications of the utilization terms are

slightly different. The larger value, whether communication or computation, will have a

utilization of 1. If computation is the shorter (i.e. overlapped) time, utilization illustrates

how starved the computation is for data. If communication is the shorter time, utilization

is a measure of the available throughput to support additional parallel computation. An

example of these utilization trends is shown in Figure 3-3.

utilcompsB tcomp (3 8)
tcomm + tcomp

Uttlc tcomm (3-9)
tcomm + tcomp

tlcompDB com (3-10)
Max(t comm, tcomp)

utilcommD tcom mp (3-t )
Max (tconmmn tcomnp)

Table 3-9. Performance parameters of LIDAR (Cray)
Predicted Predicted Predicted Actual
fclk (I I.) 100 125 150 125
tcomm (sec) 6.60E-4 6.60E-4 6.60E-4 5.65E-4
tcomp (sec) 3.30E-4 2.64E-4 2.20E-4 2.25E-4
Utilco'mmsB 23 2 5' 2' .
utilcompsB 71 7 7
tRCSB (sec) 9.90E-4 9.24E-4 8.80E-4 7.90E-4
speedup 11.0 11.8 12.4 13.8

speedup. The software baseline was written in C and executed on a 2.4GHz Opteron

processor, the host CPU for the Cray XD1 node. Only one iteration (i.e. GPS interval)

is required for this case study and VHDL is used to implement the parallel algorithm in


Table 3-9 compares the RAT performance predictions with the actual 125MHz

experimental results. The structure of the particular Cray SRAM interface overlaps

computation and DMA transfers back to the CPU (i.e. tread). Consequently, the

total RC execution time was directly measurable but the individual computation and

communication times for the actual result were estimated from the total execution time

based on the expected latency of the computation. Two general conclusions were that

both the computation and communication times were overestimated by RAT but that the

utilization ratios were still fairly consistent with expectations. Unlike the previous case

studies, the total speedup was underestimated by 1'.' This discrepancy in speedup was

primarily due to the difference in communication times. The actual computation pipeline

is believed to correspond closely with the high-level algorithm. The communication and

comptution times of 565ps and 225ps are likely comparable to the system overhead

and measurement error causing noticable discrepancies and the unusual behavior of

a pessimistic prediction even with the generalized analytical model. Though extra

performance as compared to RAT projections may be an unexpected benefit for the

algorithm, the goal of the methodology is precise prediction that considers all 1i ii' 'r

factors to performance. Adjustments to the model for more accurate prediction of short

The second phase of research proposed RATSS, an extension the RAT model for

multi-FPGA systems. RATSS balanced the desire for greater algorithm and platform

diversity (i.e., model applicability) with the requirement of high predictability (i.e.,

model accuracy) for scalable systems by focusing on synchronous iterative algorithms

for two classes of modern RC systems. Synchronous iterative algorithms represented a

significant class of data-parallel applications, typically structured as SIMD-- I ,.L pipelines.

Focusing on two classes of RC systems allows hierarchical .,.:.-1i ii i I i. ii of computation

and communication models into RAT predictions for the full application. Successes in

conventional HPC and HPEC modeling such as the LogP communication model are

leveraged to help maximize efficiency and reliability. Three case studies, 2-D PDF, image

filtering, and MD, demonstrated total prediction errors under 1"-'. ;' and 0.0 :'


For the third phase of research, the RAT (and RATSS) methodologies were

integrated within a larger framework for more strategic design-space exploration of

RC applications. Specifically, the framework bridged RAT performance prediction

with modeling environments. These modeling environments allowed rapid yet accurate

application specification by a designer within the context of the MoC. The framework

provided translation between supported MoCs and the analytical performance model

for RAT (i.e., the synchronous iterative model). A tool constructed from the framework

methodology provided translation for the AMP and SDF MoCs of the RC'\ 1, modeling

environment. This framework tool orchestrated design-space exploration by performing

RAT analysis on an initial application design and potential revisions to the characteristics

of algorithm and/or platform architecture, identifying suitable design configurations

based on designer criteria. Two case studies, MNW and MVA graph, demonstrated

reasonable prediction accuracy (under 5'. for MNW) and rapid exploration of large design

spaces (140ms and 340ms for RAT analysis of 100K revisions to MNW and MVA graph,



ACKNOW LEDGMENTS ................................. 4

LIST OF TABLES ....................... ............. 7

LIST OF FIGURES .................................... 9

A BSTRA CT . . 11


1 INTRODUCTION ...................... .......... 12

2 BACKGROUND AND RELATED RESEARCH ......... ......... 16

2.1 FPGA Background ........ ...................... 16
2.2 Related Research ........ .............. ......... 18

TO DESIGN (PHASE 1) ............................... 21

3.1 Introduction ...................... ........... 21
3.2 RC Amenability Test ................... ....... 21
3.2.1 Throughput ............................. 22
3.2.2 Numerical Precision ........................... 29
3.2.3 Resources ................... ......... 30
3.2.4 Scope of RAT ................... ....... 31
3.3 Walkthrough ...................... .......... 32
3.3.1 Algorithm Architecture .......... .............. 32
3.3.2 RAT Input Parameters ................... ..... 33
3.3.3 Predicted and Actual Results ......... ........ ... 38
3.4 Additional Case Studies .................. ......... .. 39
3.4.1 2-D PDF Estimation ............... .... .. 40
3.4.2 Coordinate Calculation for LIDAR. ...... ........... 43
3.4.3 Traveling Salesman Problem ................. .. 46
3.4.4 Molecular Dynamics .................. ........ .. 49
3.4.5 Summary of Case Studies .................. .... .. 53
3.5 Conclusions .................. ................ .. 54


4.1 Introduction .................. .... 56
4.2 Background and Related Research ................ .... .. 57
4.3 RATSS Model .................. ............... .. 59
4.3.1 RATSS Scope .................. ........... .. 60 FPGA Platform Scope .............. .. .. 60

[48] P. B. Bhat, V. K. Prasanna, and C. S. Raghavendra, "Adaptive communication
algorithms for distributed heterogeneous systems," J. Parallel Distributed Computing,
vol. 59, no. 2, pp. 252-279, 1999.

[49] F. Cappello, P. Fraigniaud, B. Mans, and A. L. Rosenberg, "HiHCoHP: Toward
a realistic communication model for hierarchical hyperclusters of heterogeneous
processors," in Proc. 15th Int'l Parallel and Distributed Processing Symp. (IPDPS),
Washington, DC, USA, 2001, p. 42, IEEE Computer Society.

[50] B. Holland, K. N ,I ijan, and A. D. George, "RAT: RC amenability test for rapid
performance prediction," ACM[ Trans. Reconfigurable T l ,,..I/;/ and S'I- 11,
(TRETS), vol. 1, no. 4, pp. 22:1-22:31, 2009.

[51] E. Parzen, "On estimation of a probability density function and mode," Annals of
Mathematical Statistics, vol. 33, no. 3, pp. 1065-1076, 1962.

[52] K. N I, ,i ,i- B. Holland, A. George, K. C. Slatton, and H. Lam, "Accelerating
machine-learning algorithms on FPGAs using pattern-based decomposition," J.
S.:jIrl Processing S,.-1 ,m- Jan. 2009.

[53] R. C. Gonzalez and R. E. Woods, Digital Image Processing, Second Edition,
Prentice-Hall, Inc, Upper Saddle River, NJ, 2002.

[54] B. Kienhuis, E. F. Deprettere, P. van der Wolf, and K. Vissers, Embedded Processor
Design Chllll u'.' chapter A Methodology to Design Programmable Embedded
Systems: The Y-Chart Approach, pp. 18-37, Springer, 2002.

[55] G. D. Peterson and R. D. C'!i ,inherlain, "Beyond execution time: Expanding the use
of performance models," IEEE Parallel Distributed T,1.-h.-l' 'ri Sl-,l.i- Applications,
vol. 2, no. 2, pp. 37-49, 1994.

[56] R. W. Hockney, "The communication challenge for MPP: Intel Paragon and Meiko
CS-2," Parallel Computing, vol. 20, no. 3, pp. 389-398, 1994.

[57] J. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt, "Ptolemy: A framework for
simulating and prototyping heterogeneous systems," Int'l J. Computer Simulation,
vol. 4, pp. 152-184, April 1994.

[58] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, and
A. Sangiovanni-Vincentelli, \ 1 I ropolis: an integrated electronic system design
environment," Computer, vol. 36, no. 4, pp. 45-52, April 2003.

[59] A. D. Pimentel, L. O. Hertzbetger, P. Lieverse, P. van der Wolf, and E. F. Deprettere,
"Exploring embedded-systems architectures with artemis," Computer, vol. 34, no. 11,
pp. 57-63, November 2001.

[60] A. D. Pimentel, C. Erbas, and S. Polstra, "A systematic approach to exploring
embedded system architectures at multiple abstraction levels," IEEE Trans. Comput-
ers, vol. 55, no. 2, pp. 99-112, February 2006.

Table 3-3. Performance parameters of 1-D PDF (Nallatech)
Predicted Predicted Predicted Actual
fclk (\!II.) 75 100 150 150
tcomm (sec) 2.47E-5 2.47E-5 2.47E-5 2.50E-5
tcomp (sec) 2.62E-4 1.97E-4 1.31E-4 1.39E-4
utilcommsB 11 11 15' ,
utilcompsB 91 b'' ( !' 85'
tRCsB (sec) 1.15E-1 8.85E-2 6.23E-2 7.45E-2
speedup 5.0 6.5 9.3 7.8

Table 3-4. Resource usage of 1-D PDF (XC4VLX100)
FPGA Resource Utilization ( )
BRAMs 15
48-bit DSPs 8
Slices 16

difficult. Empirical knowledge of FPGA p: .: ... and algorithm design practices provides

some insight as to a range <" 1V values. However, attaining a single, accurate estimate

of the maximum FPGA clock fre.- .... achieved is generally i possible e until after the

entire :.1 nation n has been converted to a hardware (1 :-:: and analyzed by an FPGA

vendor's !.--out and routing tools. Consequently, a number of clock values ranging from

75MHz to 150MHz i: : the LX100 are used to examine the scope <- possible ':

Ti. software parameters provide the last piece of information necessary to complete

the speedup analysis. i : software execution time of the algorithm is provided by the

user. Often, software 1 I code is the basis for the hardware migration initiative. FPGA

development could be based directly on mathematical models, but there would be no

baseline for evaluating speedup. Ti :: software for the 1-D PDF estimation was

written in C, compiled using gcc, and executed on a 3.2 CHz Xeon. IT !' the number of

iterations is deduced 1: : the portion of the overall : 1 1 ::: to reside in the FPGA at

one time. Since the user decided to only process 512 elements at a time from the set of

ei element set, there must be ..:. (i.e. 204, ::/512) iterations of the algorithm. Ti:

case study is implemented in VHDL.

PCI Ex ress 8x

Figure 5-6. Architecture specification of FPGA platform

5.4.1 Experimental Setup

For validation of the proposed methodology, a DSE tool provides functionality for

gathering the application specification, performing translation, and orchestrating DSE

using performance prediction. This functionality includes interaction with a modeling

environment tool, RC' \ to collect the necessary information from an application

specification and usage of a prediction tool, RAT, for performance analysis. RC\ lIl

provides an RC-specific abstraction environment with semantic constructs amenable to the

RAT prediction. The DSE tool is a hierarchical composite of existing tools for RC' I Il and

RAT and newly constructed components providing translation and orchestration. These

translation and orchestration components are implemented as a Java-based Eclipse plug-in

to help minimize the customized interfacing necessary for connecting to the RC'i\ I and

RAT tools.

The two application case studies for this chapter are mapped onto a Linux server

containing a GiDEL PROCStar-III FPGA card connected by a PCIe x8 bus to a Xeon

E5520 (i.e., 2.26GHz Quad-core Nehalem) microprocessor. The GiDEL FPGA card

contains four Altera Stratix-III E260 FPGAs, which have interconnects to .,I.i ient

FPGAs and support DMA transfers to and from the microprocessor. Figure 5-6 outlines

the general architecture model for the FPGA-augmented platform. This FPGA system

can be used as a prototype for an RC-augmented embedded platform or represent a single

node in a multi-node RC supercomputer.

and fidelity (through analytical methods). These modeling concepts, particularly LogGP,

are also used for extending RAT to scalable multi-FPGA systems (C'!i plter 4).

Simulation is another common outlet for quantifying the performance of RC

application models at a high level. In [13], a framework for simulation of FPGA systems

and applications is built on top of the Fast and Accurate Simulation Environment (FASE)

[14]. Models are created for the Mission-Level Designer (MI .1)) tool based on scripts of

algorithm behavior to rapidly explore large-scale FPGA systems. Another simulation

framework is the Hybrid System Architecture Model (HySAM) coupled with DRIVE

[15]. HySAM provides mechanisms for parameterizing architectures, defining algorithms,

and simulating interactions, while DRIVE provides tools for visualizing results generated

by HySAM. In [16], SimpleScalar and ModelSIM are combined for system analysis

through simultaneous processor emulation and VHDL simulation. Another tool [17] uses

a Simics-based simulator for capturing precise memory-access patterns while functionally

verifying hardware kernels. While each of these methodologies provides high-level

simulation fidelity, significant cost is associated with setting up the requisite models.

Either actual hardware or software code is required or effort is spent on constructing

custom simulation inputs distilled from algorithm and system behavior. In contrast, RAT

seeks to render performance prediction of the application and FPGA platform prior to any

significant hardware or software coding. By using analytical models instead of simulation

frameworks, prediction effort is minimized while maintaining reasonable accuracy. Though,

some insight about modeling larger systems can be leveraged from the multi-FPGA

simulation frameworks.

Understanding and improving algorithm design for FPGAs via analytical modeling

is an expanding area of RC research. One technique [1] focuses on analytical modeling

of shared heterogeneous workstations containing reconfigurable computing devices. The

methodology primarily emphasizes system-level, multi-FPGA architectures with variable

computational loading due to the multi-user environment. A performance prediction

Full Text




c r 2010BrianM.Holland 2


Thisworkisdedicatedtomyfamilyandfriends. 3


ACKNOWLEDGMENTS Thesuccessofthisresearchisduetothesupportandgeneros ityoftheUniversityof Florida,High-performanceComputingandSimulation(HCS)La boratory,andNSFCenter forHigh-performanceRecongurableComputing(CHREC).Aspe cialthankyougoesto theauthor'sthesiscommittee:Dr.AlanD.George(Chair),Dr .HermanLam,Dr.Greg Stitt,andDr.BeverlySanders.AdditionalthanksgotoVikasAg garwal,MaxBillingsley, GrzegorzCieslewski,ChrisConger,JohnCurreri,RyanDeVil le,RafaelGarcia,Dr.Ann Gordon-Ross,Dr.EricGrobelny,AdamJacobs,SethKoehler,Ab hijeetLawande,Dr. SaumilMerchant,KarathikNagarajan,CarloPascoe,Dr.Case yReardon,PhilipsShih, Dr.IanTroxel,andJasonWilliams. ThisworkwassupportedinpartbytheI/UCRCProgramoftheNati onalScience FoundationunderGrantNo.EEC-0642422.Theauthorgrateful lyacknowledgesvendor equipmentand/ortoolsprovidedbyAltera,Cray,ImpulseAcce leratedTechnologies, Nallatech,SRCComputers,Xilinx,andXtremeDatathathelpedm akethisworkpossible. AdditionalthanksgotothestudentsandfacultyoftheHigh-pe rformanceComputingLab (HCL)atGeorgeWashingtonUniversityforthegeneroususeoft heirSRC-6platform 4


TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 11 CHAPTER 1INTRODUCTION .................................. 12 2BACKGROUNDANDRELATEDRESEARCH .................. 16 2.1FPGABackground ............................... 16 2.2RelatedResearch ................................ 18 3ANALYTICALMODELFORFPGAPERFORMANCEESTIMATIONPRIOR TODESIGN(PHASE1) ............................... 21 3.1Introduction ................................... 21 3.2RCAmenabilityTest .............................. 21 3.2.1Throughput ............................... 22 3.2.2NumericalPrecision ........................... 29 3.2.3Resources ................................ 30 3.2.4ScopeofRAT .............................. 31 3.3Walkthrough .................................. 32 3.3.1AlgorithmArchitecture ......................... 32 3.3.2RATInputParameters ......................... 33 3.3.3PredictedandActualResults ..................... 38 3.4AdditionalCaseStudies ............................ 39 3.4.12-DPDFEstimation .......................... 40 3.4.2CoordinateCalculationforLIDAR ................... 43 3.4.3TravelingSalesmanProblem ...................... 46 3.4.4MolecularDynamics ........................... 49 3.4.5SummaryofCaseStudies ........................ 53 3.5Conclusions ................................... 54 4EXPANDEDMODELINGFORMULTI-FPGAPLATFORMS(PHASE2) ... 56 4.1Introduction ................................... 56 4.2BackgroundandRelatedResearch ....................... 57 4.3RATSSModel .................................. 59 4.3.1RATSSScope .............................. 60 .................... 60 5

PAGE 6 ...................... 62 ......................... 64 4.3.2ModelAttributesandEquations .................... 65 .................... 66 ........................ 67 .......................... 71 ...................... 73 4.4DetailedWalkthrough:2-DPDFEstimation ................. 74 4.4.1AlgorithmandPlatformStructure ................... 75 4.4.2ComputeNodeModeling ........................ 76 4.4.3NetworkModeling ............................ 78 .................. 78 ................. 80 4.4.4Stage/ApplicationModeling ...................... 82 4.4.5ResultsandVerication ......................... 84 4.5AdditionalCaseStudies ............................ 86 4.5.1ImageFiltering ............................. 89 4.5.2MolecularDynamics ........................... 91 4.6Conclusions ................................... 94 5INTEGRATEDPERFORMANCEPREDICTIONWITHRCMODELINGLANGUAGE (PHASE3) ...................................... 96 5.1Introduction ................................... 96 5.2BackgroundandRelatedResearch ....................... 98 5.2.1RATPerformancePrediction ...................... 98 5.2.2ModelingEnvironments ......................... 99 5.3IntegratedFramework ............................. 101 5.3.1Translation ................................ 102 5.3.2Orchestration .............................. 104 5.4CaseStudies ................................... 105 5.4.1ExperimentalSetup ........................... 106 5.4.2ModiedNeedleman-Wunsch(MNW) ................. 107 5.4.3TaskGraphofMean-ValueAnalysis(MVAGraph) ......... 111 5.5IntegratedRATSS(MNW) ........................... 113 5.6Conclusions ................................... 114 6CONCLUSIONS ................................... 116 REFERENCES ....................................... 119 BIOGRAPHICALSKETCH ................................ 125 6


LISTOFTABLES Table page 3-1InputparametersforRAT .............................. 34 3-2Inputparametersof1-DPDF ............................ 35 3-3Performanceparametersof1-DPDF(Nallatech) .................. 37 3-4Resourceusageof1-DPDF(XC4VLX100) ..................... 37 3-5Inputparametersof2-DPDF ............................ 41 3-6Performanceparametersof2-DPDF(Nallatech) .................. 42 3-7Resourceusageof2-DPDF(XC4VLX100) ..................... 42 3-8InputparametersofLIDAR ............................. 44 3-9PerformanceparametersofLIDAR(Cray) ..................... 45 3-10ResourceusageofLIDAR(XC2VP50) ....................... 46 3-11InputparametersofTSP ............................... 47 3-12PerformanceparametersofTSP(SRC) ....................... 48 3-13ResourceusageofTSP(XC2V6000) ........................ 48 3-14InputparametersofMD ............................... 50 3-15PerformanceparametersofMD(XtremeData) ................... 51 3-16ResourceusageofMD(EP2S180) .......................... 51 3-17SummaryofResults ................................. 52 4-1NodeAttributesfor2-DPDFEstimation ...................... 77 4-2PCI-XNetworkAttributesfor2-DPDFEstimation ................ 80 4-3EthernetNetworkAttributesfor2-DPDFEstimation ............... 81 4-4SystemModelAttributesfor2-DPDFEstimation ................. 84 4-5ModelingErrorfor2-DPDFEstimation(Nallatech,XC4VLX10 0,195MHz) .. 85 4-6SNAPNetworkAttributesforSRC-6System .................... 88 4-7NodeAttributesforImageFiltering ......................... 90 4-8AdditionalNetworkAttributesforImageFiltering ................. 91 7


4-9ModelingErrorforImageFiltering(SRC-6,XC2V6000) ............. 91 4-10NodeAttributesforMolecularDynamics ...................... 93 4-11AdditionalNetworkAttributesforMolecularDynamics .............. 94 4-12ModelingErrorforMolecularDynamics(SRC-6,XC2V6000 ) ........... 94 5-1PredictedandexperimentalresultsforMNW ................... 108 5-2AnalysistimesfordesignspacesofMNW ...................... 110 5-3AnalysistimesfordesignspacesofMVAgraph .................. 112 5-4Predicted(RATSS)andexperimentalresultsformulti-n odeMNW ........ 114 8


LISTOFFIGURES Figure page 2-1PerformanceCharacterizationandGeneralStructureof FPGADevices ..... 16 2-2SpectrumofHigh-PerformanceRecongurableComputingP latforms ...... 17 3-1OverviewofRATMethodology ........................... 22 3-2ExampleOverlapScenarios ............................. 27 3-3TrendsforComputationalUtilizationinSBandDBScenari os .......... 29 3-4ParallelAlgorithmandMappingfor1-DPDF ................... 33 4-1TwoClassesofModernHigh-performanceFPGAsystems ............. 61 4-2SynchronousIterativeModel ............................. 63 4-3ExampleTimingDiagramforApplicationScheduling ............... 74 4-4ApplicationStructurefor2-DPDFEstimationCaseStudy ........... 76 4-5PlatformStructurefor2-DPDFEstimationCaseStudy ............. 77 4-6ResultsofEciencyMicrobenchmarksforNallatechBRAMI/ O ......... 79 4-7PlatformStructureforImageFilteringCaseStudy ................ 87 4-8AlgorithmStructureforImageFilteringCaseStudy ................ 90 4-9AlgorithmStructureforMolecularDynamicsCaseStudy ............. 93 5-1Abstractionpyramidcomparinglevelsofmodelingforhar dwareapplications .. 97 5-2PerformancepredictionusingRAT ......................... 98 5-3Y-chartapproachtoapplicationspecicationinmodelin genvironments ..... 99 5-4Frameworkbridgingspecicationandanalysis,andorch estratingDSE ...... 101 5-5Translationofapplicationspecicationinformationf orRATprediction ..... 103 5-6ArchitecturespecicationofFPGAplatform .................... 106 5-7OverviewofMNWcasestudy ............................ 108 5-8PredictedexecutiontimesofMNW ......................... 109 5-9MVAgraphspecicationandmapping ....................... 110 9


5-10PredictedexecutiontimesoftheMVAgraph .................... 111 5-11ArchitecturespecicationofNovo-Gsystem .................... 113 10


AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy IMPROVINGFPGAAPPLICATIONDEVELOPMENTVIASTRATEGIC EXPLORATION,PERFORMANCEPREDICTION,ANDTRADEOFFANALYSIS By BrianM.Holland August2010 Chair:AlanD.GeorgeMajor:ElectricalandComputerEngineering FPGAscontinuetodemonstrateimpressivebenetsintermsof performanceperWatt forawiderangeofapplications.However,thedesigntimeand technicalcomplexities ofFPGAshavemadeapplicationdevelopmentexpensive,parti cularlyasthenumber ofprojectrevisionsincreases.Consequently,itisimport anttoengageinsystematic formulationofapplications,performingstrategicexplor ation,performanceprediction,and tradeoanalysisbeforeundergoinglengthydevelopmentcy cles.Unfortunately,almostall existingsimulativeandanalyticmodelsforFPGAstargetexi stingapplicationstoprovide detailed,low-levelanalysis.Thisdocumentexploresmeth ods,challenges,andtradeos concerningperformancepredictionscopeandcomplexity,c alibrationandverication, applicabilitytosmallandlarge-scaleFPGAsystems,ecie ncy,andautomation.The RCAmenabilityTest(RAT)isproposedasahigh-levelmethodo logytoaddressthese challengesandprovideanecessarydesignevaluationmecha nismcurrentlylackingin FPGAapplicationformulation.RATiscomprisedofanextens ibleanalyticmodelfor singleandmulti-FPGAsystemsharnessingamodelinginfras tructure,RCModeling Language(RCML),toprovideabreathoffeaturesallowingFP GAdesignerstomore ecientlyandautomaticallyexploreandevaluatealgorith m,platform,andsystem mappingchoices. 11


CHAPTER1 INTRODUCTION Computingiscurrentlyundergoingtworeformations,onein devicearchitectureand theotherinapplicationdevelopment.Usingthegrowthintra nsistordensitypredicted byMoore'sLawforincreasedclockratesandinstruction-le velparallelismhasreached fundamentallimits,andthenatureofcurrentandfuturedev icearchitecturesisfocused uponhigherdensityintermsofmulti-coreandmany-corestr ucturesandmoreexplicit formsofparallelism.Manysuchdevicesexistandareemergi ngonthispath,somewith axedstructure(e.g.quad-coreCPU,CellBroadbandEngine) andsomerecongurable (e.g.FPGAs).Concomitantwiththisreformationindevicear chitecture,thecomplexity ofapplicationdevelopmentforthesexedorrecongurable devicesisattheforefrontof fundamentalchallengesincomputingtoday. Thedevelopmentofapplicationsforcomplexarchitectures canbedenedintermsof fourstages:formulation,design,translation,andexecut ion.Thepurposeofformulationis explorationofalgorithms,architectures,andmappings,w herestrategicdecisionsaremade priortocodingandimplementation.Thedesign,translatio n,andexecutionstagesare whereimplementationoccursinaniterativefashion,inter msofprogramming,translation toexecutablecodesandcores,debugging,verication,per formanceoptimization,etc. Asarchitecturecomplexitycontinuestoincrease,sotoodoe stheimportanceofthe formulationstage,sinceproductivityincreaseswhendesi gnweaknessesarediscovered andaddressedearlyinthedevelopmentprocess.FPGAapplic ationsareparticularly noteworthyfortheamountofeortneededwithexistinglang uagesandtoolsto renderasuccessfulimplementation,andthusproductivity ofapplicationdevelopment forFPGA-basedsystemscangreatlybenetfrombetterconcep tsandtoolsinthe formulationstage.Thisdocumentpresentsanovelmethodol ogyandmodeltosupport therapidformulationofapplicationsforFPGA-basedrecon gurablecomputingsystems. Thismodelfocusesonnotonlyreasonablyaccurateperforma nceestimationforsingle 12


andmulti-FPGAsystemsbutalsoecientusageduringformul ation(i.e.,strategic design-spaceexploration).KnownastheRCAmenabilityTest ,RATprovidesaframework forpredictionofpotentialspeedupforagivenhigh-levelp arallelalgorithmmapped toaselectedhardwaretarget,sothatavarietyofstrategic tradeosinalgorithmand architectureexplorationcanbequicklyevaluatedbeforeu ndertakingweeksormonthsof costlyimplementation.RATperformancepredictionisscop edtomaintainecientand reasonablyaccurateestimationrelativetotheFPGAsystem sizeandcomplexity. CentraltoRATistheanalyticalperformancemodelandtheme thodologyforits applicationtoarangeofalgorithmsandFPGAplatformarchi tectures.Thekeyaspects ofcommunicationandcomputationwithintheFPGAsystemare parameterizedandused bytheRATmodeltoestimatethetotalapplicationperforman ce(i.e.executiontimeand speedup).Thepredictioneciencyandreliabilityareincr easedviatool-assistedparameter extractionandperformanceestimationfromexplicitalgor ithm,architecture,andsystem descriptions(alsoreferredtoasmodels).TheneedfortheR ATmethodologystemmed fromcommondicultiesencounteredduringseveralFPGAapp licationdevelopment projects.Researcherswouldtypicallypossessasoftwarea pplicationbutwouldbeunsure aboutpotentialperformancegainsinhardware.Thelevelof experiencewithFPGAs wouldvarygreatlyamongtheresearchersandinexperienced designerswereoftenunable toquantitativelyprojectandcomparepossiblealgorithmi cdesignandFPGAplatform choicesfortheirapplication.Manyinitialpredictionswe rehaphazardlyformulatedand performanceestimationmethodsvariedgreatly.Consequen tly,RATwascreatedto consolidateandunifytheperformancepredictionstrategi esforfaster,simpler,andmore eectiveanalyses. Thisresearchisdividedintothreephases.Intherstphase ,thefocusofthe performancepredictionmodelisonsystemswithasingleFPG Aconnecteddirectly toamicroprocessor.Applicationsproceediniterationsofw ritingdatatotheFPGA, performingcomputation,andreadingresultsfromtheFPGA.D esignchoicesare 13


made,keyperformanceattributesareextracted,predictio nsarecomputed,andsuitable designsproceedtoimplementation.Thegoalistoestablish thecore,extensiblemodelof applicationandarchitectureperformancethatisecientf orusepriortoimplementation andprovidesreasonablyaccurateresults.Severalapplica tioncasestudiesareanalyzed withRATandimplementedwithinFPGAsystemstoverifythepe rformancemodeland methodology. Forthesecondphase,theemergenceandcontinuedinteresti nmulti-FPGAsystem necessitatesamethodologyformulti-FPGAperformancepre dictiontoimproveapplication development.TheRCAmenabilityTestforScalableSystems(R ATSS)isanexpansionof theRATmethodologyencompassinglargerFPGAsystemsandpo tentiallyhigherdegrees ofalgorithmparallelism.Amajorchallengeisthesizeandv arietyofcommunication topologiesinmulti-FPGAplatforms,whichrequirevarying amountsofparameterization andanalysisforaccurateperformanceprediction.RATSSus esthesynchronousiterative model[ 1 ]fortwomodernplatformarchitecturesforRCsystems[ 2 ],providingaccurate modelingfordata-parallelalgorithmstypicallystructur edasSIMD-stylepipelines. ThethirdphaseinvolvesintegrationoftheRATanalyticalm odelwiththeRC ModelingLanguage(RCML).Manualspecicationandanalysi sofapplicationsbecomes increasinglyinecientastheFPGAplatformsizeandalgori thmcomplexitygrow. Ideally,algorithms,FPGAplatformarchitectures,andsys temmappingsarespecied usingamodelingenvironmentbasedonamodelofcomputation andthenanalyzed byperformanceestimationtoolssuchasRAT.RCMLisusedbec auseitprovidesan ecient,intuitive,andscalableinfrastructurespecica llydesignedforFPGAsystems. TheintegrationofRATandRCMLprovidesecientdesign-spa ceexplorationthrough tool-assistedtranslationofapplicationspecicationsi ntopredictionmodelsandevaluation ofbothaninitialdesignandpotentialrevisionstothealgo rithmorplatformarchitecture. Theremainderofthisdocumentisstructuredasfollows.Cha pter 2 providesabrief backgroundaboutFPGAcomputingandrelatedresearchforpe rformancepredictionand 14


modeling.Chapter 3 outlinestheRATanalyticmodelforFPGAperformancepredic tion priortodesign(Phase1).Chapter 4 describestheexpansionoftheRATmodelfor multi-FPGAplatforms(Phase2).Chapter 5 discussestheintegrationofRATperformance predictionwithRCML(Phase3).Finally,conclusionsarepr ovidedinChapter 6 15


CHAPTER2 BACKGROUNDANDRELATEDRESEARCH Thebackgroundandrelatedresearchforthisdocumentisdiv idedintotwosections. Section 2.1 providesanoverviewofFPGAtechnologyandhigh-performan cerecongurable computing.Section 2.2 summarizesrelatedworkforHPCperformancemodeling,FPGA simulationandanalyticalmodeling,andpredictionfocus. 2.1FPGABackground Theeld-programmablegatearray(FPGA)istheprimarydevic edrivingrecongurable computing.TheoverallgoalofFPGAsistoprovidetheperform anceofanapplication-specic integratedcircuit(ASIC)withtherexibilityandprogramma bilityofamicroprocessor. ApplicationsforFPGAsaredevelopedashardwarecircuitscon structedfromlogic elements,xedresourcessuchasmultiplyaccumulatorsorm emories,androuting elements.Traditionally,FPGAapplicationsaredeveloped inhardwaredescription languagessuchasVHDLorVerilogbuthigh-levellanguagesare emergingtobringFPGA codetotheleveloflanguagessuchasCorJAVA.Figure 2-1 outlinestheperformance spectrumofcomputingdevicesandthegeneralstructureofF PGAs. Figure2-1.PerformanceCharacterizationandGeneralStru ctureofFPGADevices RecongurablecomputingsystemsoftenincorporateFPGAsto providemaximum performance(e.gspeed,power,cost)versuscomparablemic roprocessor-onlysolutions. Thisresearchisapplicabletobothhigh-performancecompu ting(HPC)andhigh-performance 16


Figure2-2.SpectrumofHigh-PerformanceRecongurableCom putingPlatforms embeddedcomputing(HPEC).Theterm\highperformance"canh avebroadapplicability tocomputingsystems,butforthisdocumentitwillrefertos ystemswheremaximizing speedisasignicantrequirementoftheintendeddesign.(i .e.,Withinreason,higher speedwillyieldacorrespondinglymoreproductivedesign. )RCsystemscanrangein scalefromCPU-lesssingle-FPGAsolutionstosupercomputer sincorporatingthousands ofFPGA-augmentednodes(HPC)orlargecollectionsofFPGA-cen tricdevices(HPEC). AgeneraloverviewofthespectrumofRCsystemsisoutlinedi nFigure 2-2 .Theoptions forinterconnecttopologiesincreasewiththesystemsizec ompoundingthecommunication challenges.Incontrast,computationseekstoalignFPGAsin toasinglelargevirtualfabric formaximumparallelism.Ultimately,theparticularalgori thmchoiceswithrespectto themappingtotheplatformarchitecturewillultimatelyde terminethetotalapplication performance.Forthisresearch,platformselectionandthe subsequentmappingswillfocus onthemostcommonsystemswiththehighestamenabilitytofo rmulation-timeprediction. 17


2.2RelatedResearch ProductiveapplicationdevelopmentforFPGA-basedsystems iskeyforwider deploymentandusageofFPGAs.OnechallengetoFPGAproducti vityisecient generationofmoreabstractFPGAdesigncodes.Raisingthed esignfocusfromtraditional hardwaredescriptionlanguages,high-levellanguagessuc hasImpulseC[ 3 ],CarteC [ 4 ],MitrionC[ 5 ],andHandelC[ 6 ]provideasoftware-likeinfrastructureforamore ecientandfamiliarprogrammingmodelforFPGAapplicatio ns.Similarly,researchin hardware/softwarecodesignenablesafasterbridgebetwee napplicationspecicationand hardwareimplementationandabriefhistoryofthisresearc htrendcanbefoundin[ 7 ]. However,fasterimplementationdoesnotaddresstheunderly ingneedforapplication planningandevaluationpriortosignicantcommitmentsof timeandmoney. Ecientperformancemodelingforalgorithmsandsystemsis anongoingarea ofresearchfortraditionalparallelcomputing.TheParall elRandomAccessMachine (PRAM)[ 8 ]attemptstomodelthecritical(andhopefullysmall)setof algorithm andplatformattributesnecessarytoachieveabetterunder standingofthegreater computationalinteractionandultimatelytheapplication performance.TheBulk SynchronousParallel(BSP)[ 9 ]modelextendsPRAMconcepts,whichincludessupportfor communicationanditsinteraction(i.e.,overlap)withcom putation.TheLogPmodel[ 10 ] (onesuccessortoPRAM)abstractstheapplicationperforman cebasedonthelatency(i.e., wiredelay), L ;overhead, o ;messagegap(i.e.,minimumtimebetweenmessages), g ;and numberofprocessors, P .LogGP[ 11 ]andadditionalrevisionssuchasparameterizedLogP (PlogP)[ 12 ]providefurthermodelingdelitytoLogPbyaddressingspe cicissuessuch asbandwidthconstraintsforlongmessagesanddynamicperf ormancecharacterization, respectively.However,theseconceptsarenotlimitedtosys temsofgeneral-purpose processors.Theevolutiontowardsheterogeneousmany-cor edevicesnecessitatesincreased modelingresearchandusageduetorisingdevelopmenttimea ndcost.Inparticular,RAT seekstoleveragetheseideasofmaximizingmodelrexibilit y(throughparameterization) 18


anddelity(throughanalyticalmethods).Thesemodelingc oncepts,particularlyLogGP, arealsousedforextendingRATtoscalablemulti-FPGAsyste ms(Chapter 4 ). Simulationisanothercommonoutletforquantifyingtheper formanceofRC applicationmodelsatahighlevel.In[ 13 ],aframeworkforsimulationofFPGAsystems andapplicationsisbuiltontopoftheFastandAccurateSimul ationEnvironment(FASE) [ 14 ].ModelsarecreatedfortheMission-LevelDesigner(MLD)t oolbasedonscriptsof algorithmbehaviortorapidlyexplorelarge-scaleFPGAsys tems.Anothersimulation frameworkistheHybridSystemArchitectureModel(HySAM)coupl edwithDRIVE [ 15 ].HySAMprovidesmechanismsforparameterizingarchitectur es,deningalgorithms, andsimulatinginteractions,whileDRIVEprovidestoolsfor visualizingresultsgenerated byHySAM.In[ 16 ],SimpleScalarandModelSIMarecombinedforsystemanalys is throughsimultaneousprocessoremulationandVHDLsimulatio n.Anothertool[ 17 ]uses aSimics-basedsimulatorforcapturingprecisememory-acc esspatternswhilefunctionally verifyinghardwarekernels.Whileeachofthesemethodolog iesprovideshigh-level simulationdelity,signicantcostisassociatedwithset tinguptherequisitemodels. Eitheractualhardwareorsoftwarecodeisrequiredoreort isspentonconstructing customsimulationinputsdistilledfromalgorithmandsyst embehavior.Incontrast,RAT seekstorenderperformancepredictionoftheapplicationa ndFPGAplatformpriortoany signicanthardwareorsoftwarecoding.Byusinganalytica lmodelsinsteadofsimulation frameworks,predictioneortisminimizedwhilemaintaini ngreasonableaccuracy.Though, someinsightaboutmodelinglargersystemscanbeleveraged fromthemulti-FPGA simulationframeworks. UnderstandingandimprovingalgorithmdesignforFPGAsviaan alyticalmodeling isanexpandingareaofRCresearch.Onetechnique[ 1 ]focusesonanalyticalmodeling ofsharedheterogeneousworkstationscontainingrecongu rablecomputingdevices.The methodologyprimarilyemphasizessystem-level,multi-FP GAarchitectureswithvariable computationalloadingduetothemulti-userenvironment.A performanceprediction 19


techniquepresentedin[ 18 ]seekstoparameterizethecomputationalalgorithmandthe FPGAsystemitself.Theanalyticalmethodshavesimilariti estoRATbuttheemphasis isonprojectingpotentialbottlenecksduetomemorythroug hput,notonpredictingtotal systemperformance.Dynamo[ 19 ]involvesperformancepredictionofimageprocessing systemspartitionedandcompiledatruntimefromexistingp ipelinedkernels.Thesystem providesdynamicoptimizationforapplicationconstructi onexclusivelyfromexisting modulesandassumesthatalgorithmdesignandanalysisisco mpletedpriortotheuseof Dynamo.In[ 20 ],12designtechniquesarepresentedformaximizingtheper formanceof FPGAapplications.ThisresearchissynergistictoRATbypo tentialreducingthenumber ofalgorithmandarchitectureiterationsnecessarytoachi evesuitableperformance,however theRATmethodologyisstillrequiredtoquantitativelyeva luateeachdesigniteration. ThoughpredictionisquitecommonwithFPGAtechnologies,i tisnotprimarilyused forsystem-levelperformance.Routingisacommontargetfo rdevice-levelpredictiondue totheimpactondevelopmenttimeandperformance.In[ 21 ],amodelofthealgorithm routingdemandsiscreatedearlyintheFPGAdevelopmentcyc le.In[ 22 ],predictionis usedtomitigatethevariabilityandlongruntimesofcommer cialplaceandroutetools forestimatinginterconnectdelay.Otherissuesincluding timing[ 23 ],routability[ 24 ], interconnectplanning[ 25 ],androutingdelay[ 26 ]areexploredviaprediction.Performance isalsoexploredbymodelingissuessuchaspower[ 27 ]andwaferyield[ 28 ].Manyof thesepredictiontechniquesforlower-levelissuesmigrat edfromapplication-specic integratedcircuitsintotheRCdomaintomoreecientlymod elthegrowingcomplexity ofFPGAs.Similarly,RATandothermethodologiesarebranchi ngtoRCfromexisting areasofparallelapplicationmodelingtobridgethegrowin gneedforecientperformance prediction. 20


CHAPTER3 ANALYTICALMODELFORFPGAPERFORMANCE ESTIMATIONPRIORTODESIGN(PHASE1) TherstresearchphaseoutlinesFPGAperformanceestimati onusingtheRAT analyticalmodelpriortohardwareimplementation.Thisch apterpresentsabrief introductionontheresearchchallengesforaformulationtime,extensiblemodel (Section 3.1 ),adetailedanalysisofthepredictionmethodology(Secti on 3.2 ),acomplete walkthroughofperformanceestimationforarealscientic application(Section 3.3 ),four additionalcasestudiesasfurthervalidationofRAT(Secti on 3.4 ),andconclusions(Section 3.5 ). 3.1Introduction Inthischapter,researchchallengeswithconstructingape rformanceprediction modeltosupportecientdesign-spaceexplorationareinve stigated.Potentialalgorithms, architectures,andsystemmappingsmustbeinvestigatedpr iortoimplementation(to reducedevelopmentcost)andpredictionsarelimitedtoeva luationofaspecicalgorithm targetingaspecicplatform(toavoidvaguegeneralities) .TheRATmethodologyis presentedasatechniquetoaddressthechallengesofformul ation-levelperformance prediction.Fivecasestudiesarepresentedtovalidatethe RATperformancemodeland methodology. 3.2RCAmenabilityTest Figure 3-1 illustratesthebasicmethodologybehindtheRCamenabilit ytest.These throughput,numericalprecision,andresourcetestsserve asabasisfordeterminingthe viabilityofanalgorithmdesignontheFPGAplatformpriort oanyFPGAprogramming. Again,RATisintendedtoaddresstheperformanceofaspecic high-levelparallel algorithmmappedtoaparticularFPGAplatform,notageneri capplication.Theresults oftheRATtestsmustbecomparedagainstthedesigner'srequ irementstoevaluatethe successofthedesign.Thoughthethroughputanalysisiscon sideredthemostimportant step,thethreetestsarenotnecessarilyusedasasingle,se quentialprocedure.Often, 21


thesetestsareappliediterativelyduringtheRATanalysis untilasuitableversionof thealgorithmisformulatedorallreasonablepermutations areexhaustedwithouta satisfactorysolution.Thethroughputtestisasuitablest arting-pointforanapplication wishingtomatchthenumericalprecisionandgeneralarchit ectureofalegacyalgorithm. However,startingwiththenumericalprecisionandresource steststoreneanapplication priortothroughputanalysisisequallyviable. Figure3-1.OverviewofRATMethodology 3.2.1Throughput ForRAT,thepredictedperformanceofanapplicationisden edbytwoterms: communicationtimebetweentheCPUandFPGA,andFPGAcomputa tiontime. Recongurationandothersetuptimesareignored.Thesetwo termsencompassthe rateatwhichdatarowsthroughtheFPGAandrateatwhichoper ationsoccuronthat data,respectively.BecauseRATseekstoanalyzeapplicati onsattheearlieststageof hardwaremapping,thesetermsarereducedtothemostgenera lizedparameters.TheRAT throughputtestprimarilymodelsFPGAsasacceleratorstoge neral-purposeprocessors withburstcommunicationbuttheframeworkcanbeadjustedf orapplicationswith streamingdata. Calculatingthecommunicationtimeisarelativelysimplis ticprocessgivenby Equations( 3{1 ),( 3{2 ),and( 3{3 ).Theoverallcommunicationtimeisdenedasthe 22


summationofthereadandwritecomponents.Fortheindividu alreadsandwrites,the problemsize(i.e.numberofdataelements, N elements )andthenumericalprecision(i.e. numberofbytesperelement, N bytes=element )mustbedecidedbytheuserwithrespect tothealgorithm.Notethatfortheseequations,theproblems izeonlyreferstoasingle blockofdatatobebueredbytheFPGAsystem.Allreadorwrite communication fortheapplicationneednotoccurasasingletransferbutca ninsteadbepartitioned intomultipleblocksofdatatobeindependentlysentorrece ived.Multipletransfersare consideredinasubsequentequation.Thehypotheticalband widthoftheFPGA/CPU interconnectonthetargetplatform(e.g.133MHz64-bitPCIXwhichhasadocumented maximumthroughputof1GB/s)isalsonecessarybutisgenera llyprovidedeither withtheFPGAsubsystemdocumentationoraspartoftheinter connectstandard.An additionalparameter, ,representsthefractionofidealthroughputperformingus eful communication.TheactualsustainedperformanceoftheFPG Ainterconnectistypicallya fractionofthedocumentedtransferrate. t comm = t read + t write (3{1) t read = N elements N bytes=element read throughput ideal (3{2) t write = N elements N bytes=element write throughput ideal (3{3) Microbenchmarkscomposedofsimpledatatransferscanbeus edtoestablishthe truecommunicationthroughput.Thecommunicationtimesfo rtheseblocktransfersare measuredandcomparedagainstthetheoreticalinterconnec tthroughputtoestablishthe parameters.Itisimportantformicrobenchmarkstoclosely matchthecommunication methodsusedbyrealapplicationsonthetargetFPGAplatfor mtoaccuratelymodel theintendedbehavior.Ingeneral,microbenchmarksareper formedoverawiderange 23


ofpossibledatasizes.Theresulting valuescanbetabulatedandusedinfutureRAT analysesforthatFPGAplatform.Byseparatingtheeective throughputintothe theoreticalmaximumandthe fraction,eectssuchaschangingtheinterconnect typeandeciencycanbeexploredseparately.Thisdelityi sparticularlyusefulfor hypotheticalorotherwiseunavailableFPGAplatforms. Beforefurtherequationsarediscussed,itisimportanttoc larifytheconceptofan \element."Untilnow,theexpressions\problemsize,"\volu meofcommunicateddata," and\numberofelements"havebeenusedinterchangeably.How ever,strictlyspeaking,the rsttwotermsrefertoaquantityofbyteswhereasthelastte rmhasunitsof\elements." RAToperatesundertheassumptionthatthecomputationalwo rkloadofanalgorithm isdirectlyrelatedtothesizeoftheproblemdataset.Becau secommunicationtimesare concernedwithbytesand(aswillbesubsequentlyshown)com putationtimesrevolve aroundthenumberofoperations,acommontermisnecessaryt oexpressthisrelationship. Theelementismeanttobethebasicbuildingblockwhichgove rnsbothcommunication andcomputation.Forexample,anelementcouldbeavalueina narraytobesorted, anatominamoleculardynamicssimulation,orasinglechara cterinastring-matching algorithm.Ineachofthesecases,somenumberofbyteswillb erequiredtorepresentthat elementandsomenumberofcalculationswillbenecessaryto completeallcomputations involvingthatelement.Thedicultyisestablishingwhats ubsetofthedatashould constituteanelementforaparticularalgorithm.Oftenana pplicationmustbeanalyzedin severalseparatestages,sinceeachportionofthealgorith mcouldinterprettheinputdata inadierentscope. Estimatingthecomputationalcomponent,asgiveninEquati on( 3{4 ),oftheRC executiontimeismorecomplicatedthancommunicationduet otheconversionfactors. Whereasthenumberofbytesperelementisultimatelyaxed, user-denedvalue,the numberofoperations(i.e.computations)perelementmustb emanuallymeasured fromthealgorithmstructure.Generally,thenumberofoper ationswillbeafunction 24


oftheoverallcomputationalcomplexityofthealgorithman dthetypesofindividual computationsinvolved.Additionally,aswiththecommunica tionequation,athroughput term, throughput proc isalsoincludedtoestablishtherateofexecution.Thispar ameter ismeanttodescribethenumberofoperationscompletedperc ycle.Forfullypipelined designs,thenumberofoperationspercyclewillbeaconsist entportionofthenumber ofoperationsperelement,thoughthetwotermsareoftenequ alforelementswithlinear computationalcomplexity.Lessoptimizeddesignswillhav elowerthroughput,requiring multiplecyclestocompleteanelement.Somedesignsmaynot usepipeliningbutprocess severalelementsperunittimeviamultipleparallelkernel s.Again,notethatcomputation timeessentiallyreferstothetimerequiredtooperateonth edataprovidedbyone communicationtransfer.(Applicationswithmultiplecommu nicationandcomputation blocksareresolvedwhenthetotalFPGAexecutiontimeiscom putedlaterinthissection.) t comp = N elements N ops=element f clock throughput proc (3{4) Despitethepotentialunpredictabilityofalgorithmbehav ior,estimatingasuciently precisenumberofoperationsisstillpossibleformanytype sofapplications.However, predictingtheaveragerateofoperationexecutioncanbech allengingevenwithdetailed knowledgeofthetargethardwaredesign.Forapplicationsw ithahighlydeterministic pipeline,theprocedureisstraightforward.Throughputca nbemodeledasaccuratelyas possibleoradjustmentscanbemadeforoptimisticorpessim isticpredictions.Butfor interdependentordata-dependentoperations,theproblem ismorecomplex.Forthese scenarios,abetterapproachwouldbetotreat throughput proc asanindependentvariable andselectadesiredspeedupvalue.Thenonecansolveforthe particular throughput proc valuerequiredtoachievethatdesiredspeedup.Thismethod providestheuserwithinsight intotherelativeamountofparallelismthatmustbeincorpo ratedforadesigntosucceed. ThemoleculardynamicscasestudyinSection 3.4 illustratesacomplexalgorithmwhere thethroughputrequirementsarebasedonthedesiredspeedu p. 25


Similartoanelement,onemustalsoexaminewhatisan\opera tion."Consideran examplealgorithmcomposedofa32-bitadditionfollowedby a32-bitmultiplication. Theadditioncanbeperformedinasingleclockcyclebuttosa veresourcesthe32-bit multipliermightbeconstructedusingtheBoothalgorithmr equiring16clockcycles. Argumentscouldbemadethattheadditionandmultiplication wouldcountaseithertwo operations(additionandmultiplication)or17operations (additionplus16additions, thebasisoftheBoothmultiplieralgorithm).Eitherformul ationiscorrectprovided that throughput proc isformulatedwiththesameassumptionaboutthescopeofan operation.Often,deterministicandhighlystructuredalg orithmsarebetterviewedwith thenumberofoperationssynonymouswiththenumberofcycle s.Incontrast,complex ornondeterministicalgorithmstendtobeviewedasanumber ofabstractnumberof operationswithanaveragerateofexecution.Ultimately,ei therchoiceisviableandleftto thepreferenceoftheuser. Figure 3-2 illustratesthetypesofcommunicationandcomputationint eractiontobe modeledwiththethroughputtest.Singlebuering(SB)repr esentsthemostsimplistic scenariowithnooverlappingtasks.However,adouble-buer ed(DB)systemallows overlappingcommunicationandcomputationbyprovidingtw oindependentbuersto keepboththeprocessingandI/Oelementsoccupiedsimultan eously.Sincetherst computationblockcannotproceeduntiltherstcommunicat ionsequencehascompleted, steady-statebehaviorisnotachievableuntilatleastthes econditeration.However,this startupcostisconsiderednegligibleforasucientlylarg enumberofiterations. TheFPGAexecutiontime, t RC ,isafunctionnotonlyofthe t comm and t comp terms butalsotheamountofoverlapbetweencommunicationandcom putation.Equations( 3{5 ) and( 3{6 )modelbothSBandDBscenarios.ForSB,theexecutiontimeis simplythe summationofthecommunicationtime, t comm ,andcomputationtime, t comp .Withthe DBcase,eitherthecommunicationorcomputationtimecompl etelyoverlapstheother term.Thesmallerlatencyessentiallybecomeshiddendurin gsteadystate.TheDBcase 26


Figure3-2.ExampleOverlapScenarios isincludedforcompletenessoftheRATmodel,howevertheca sestudiesfocusontheSB scenario. TheRATanalysisforcomputing t comp primarilyassumesonealgorithm\functional unit"operatingonasinglebuer'sworthoftransmittedinf ormation.Theparameter N iter isthenumberofiterationsofcommunicationandcomputatio nrequiredtosolvetheentire problem. t RC SB = N iter ( t comm + t comp )(3{5) t RC DB N iter Max ( t comm ;t comp )(3{6) Assumingthattheapplicationdesigncurrentlyunderanalys iswasbasedupon availablesequentialsoftwarecode,abaselineexecutiont ime, t soft ,isavailablefor comparisonwiththeestimatedFPGAexecutiontimetopredic ttheoverallspeedup. AsgiveninEquation( 3{7 ),speedupisafunctionofthetotalapplicationexecutiont ime, notasingleiteration. speedup = t soft t RC (3{7) 27


Relatedtothespeedupisthecomputationandcommunication utilizationgivenby Equations( 3{8 ),( 3{9 ),( 3{10 ),and( 3{11 ).Thesemetricsdeterminethefractionofthe totalapplicationexecutiontimespentoncomputationandc ommunicationfortheSBand DBcases.ForSB,thecomputationutilizationcanprovidead ditionalinsightaboutthe applicationspeedup.Ifutilizationishigh,theFPGAisrar elyidletherebymaximizing speedup.Lowutilizationscanindicatepotentialforincre asedspeedupsifthealgorithm canbereformulatedtohaveless(ormoreoverlapped)commun ication.Incontrastto computationwhichiseectivelyparallelforoptimalFPGAp rocessing,communication isserialized.Whereascomputationutilizationgivesnoin dicationabouttheoverall resourceusage,sinceadditionalFPGAlogiccouldbeaddedt ooperateinparallelwithout aectingtheutilization,thecommunicationutilizationi ndicatesthefractionofbandwidth remainingtofacilitateadditionaltransferssincethecha nnelisonlyasingleresource. ForDB,assumingsteady-statebehavior,theimplicationso ftheutilizationtermsare slightlydierent.Thelargervalue,whethercommunicatio norcomputation,willhavea utilizationof1.Ifcomputationistheshorter(i.e.overla pped)time,utilizationillustrates howstarvedthecomputationisfordata.Ifcommunicationis theshortertime,utilization isameasureoftheavailablethroughputtosupportaddition alparallelcomputation.An exampleoftheseutilizationtrendsisshowninFigure 3-3 util comp SB = t comp t comm + t comp (3{8) util comm SB = t comm t comm + t comp (3{9) util comp DB = t comp Max ( t comm ;t comp ) (3{10) util comm DB = t comm Max ( t comm ;t comp ) (3{11) 28


Figure3-3.TrendsforComputationalUtilizationinSBandDB Scenarios 3.2.2NumericalPrecision Applicationnumericalprecisionistypicallydenedbythea mountofxed-or roating-pointcomputationwithinadesign.WithFPGAdevic es,whereincreased precisiondictateshigherresourceutilization,itisimpo rtanttouseonlyasmuchprecision asnecessarytoremainwithinacceptabletolerances.Becau segeneral-purposeprocessors havexed-lengthdatatypesandreadilyavailableroatingpointresources,itisreasonable toassumethatoftenagivensoftwareapplicationwillhavea tleastsomemeasureof wastedprecision.Consequently,eectivemigrationofapp licationstoFPGAsrequiresa methodtodeterminetheminimumnecessaryprecisionbefore anytranslationbegins. WhileformalmethodsfornumericalprecisionanalysisofFP GAapplicationsare important,theyareoutsidethescopeofthisdocument.Aple thoraofresearchexistson topicsincludingmaintainingprecisionwithmixeddatatyp es[ 29 ],automatedconversion ofroating-pointsoftwareprogramstoxed-pointhardware designs[ 30 ],design-time precisionanalysistoolsforRC[ 31 ],andcustomordynamicbit-widthsformaximizing performanceandareaonFPGAs[ 32 { 35 ].Applicationdesignsaremeanttocapitalize onthesenumericalprecisiontechniquesandthenusetheRAT methodologytoevaluate theresultingalgorithmperformance.Numericalprecisionm ustalsobebalancedagainst thetypeandquantityofavailableFPGAresourcestosupport thedesiredformat.For 29


example,18-bitxedpointmaybeusedinXilinxFPGAssinceitm aximizesusageof single18-bitembeddedmultipliers.Aswithparalleldecomp osition,numericalformulation isultimatelythedecisionoftheapplicationdesigner.RAT providesaquickandconsistent procedureforevaluatingthesedesignchoices.3.2.3Resources Bymeasuringresourceutilization,RATseekstodeterminet hescalabilityofan applicationdesign.Empirically,mostFPGAdesignswillbe limitedinsizebythe availabilityofthreecommonresources:on-chipmemory,ha rdcorefunctionalunits(e.g. xedmultipliers),andbasiclogicelements(i.e.look-upt ablesandrip-rops). On-chipRAMisreadilymeasurablesincesomequantityofthem emorywilllikelybe usedforI/Obuersofaknownsize.Additionally,intra-appl icationbueringandstorage mustbeconsidered.Vendor-providedwrappersforinterfac ingdesignstoFPGAplatforms canalsoconsumeasignicantnumberofmemoriesbutthequan tityisgenerallyconstant andindependentoftheapplicationdesign. Althoughthetypesofdedicatedfunctionalunitsincludedin FPGAscanvarygreatly, thehardwaremultiplierisafairlycommoncomponent.Thede mandfordedicated multiplierresourcesishighlightedbytheavailabilityof familiesofchips(e.g.Xilinx Virtex-4and-5SXseries)withextramultipliersversusothe rcomparablysizedFPGAs. Quantifyingthenecessarynumberofhardwaremultipliersi sdependentonthetype andamountofparalleloperationsrequired.Multipliers,d ividers,squareroots,and roating-pointunitsusehardwaremultipliersforfastexec ution.Varyinglevelsofpipelining andotherdesignchoicescanincreaseordecreasetheoveral ldemandfortheseresources. Withsucientdesignplanning,anaccuratemeasureofresou rceutilizationcanbetaken foradesigngivenknowledgeofthearchitectureofthebasic computationalkernels. Measuringbasiclogicelementsisthemostcommonresourcem etric.High-level designsdonotempiricallytranslateintoanydiscernibler esourcecount.Qualitative assertionsaboutthedemandforlogicelementscanbemadeba seduponapproximate 30


quantitiesofarithmeticorlogicaloperationsandregiste rs.Butaprecisecountisnearly impossiblewithoutanactualhardwaredescriptionlanguag e(HDL)implementation. Aboveallothertypesofresources,routingstrainincreases exponentiallyaslogicelement utilizationapproachesmaximum.Consequently,itisoften unwise(ifnotimpossible)toll theentireFPGA. Currently,RATdoesnotemployadatabaseofstatisticstofa cilitateresource analysisofanapplicationforcompleteFPGAnovices.Theus ageofRATrequires somevendor-specicknowledge(e.g.single-cycle32-bit xed-pointmultiplicationswith 64-bitresultantsonXilinxVirtex-4FPGAsrequirefourdedica ted18-bitmultipliers). Additionally,theusermustconsidertradeossuchasusing xedresourcesversuslogic elementsandcomputationallogicversuslookuptables.Res ourceanalysesaremeantto highlightgeneralapplicationtrendsandpredictscalabil ity.Forexample,thestructureof themoleculardynamicscasestudyinSection 3.4 isdesignedtominimizeRAMusageand theparallelismisultimatelylimitedbytheavailabilityo fmultiplierresources. 3.2.4ScopeofRAT TheanalyticalmodeldescribedinSection 3.2.1 establishesthebasicscopeofRATas astrategicdesignmethodologytoformulatepredictionsab outalgorithmperformance andRCamenability.RATisintendedtosupportadiversecoll ectionofplatforms andapplicationeldsbecausethemethodologyfocusesonth ecommonstructuresand determinismwithinthealgorithm.Communicationandcompu tationarerelatedtothe numberofdataelementsinthealgorithm.Eectiveusageoft heperformanceprediction modelsrequiresmitigationofvariabilitiesinthealgorit hmstructuresuchasdata-driven computation.Basedonthecomplexityofthealgorithmandar chitecture,theRATmodel maybeusedtodirectlypredictperformanceorinsteadestab lishminimumthroughput requirementsbasedonthedesiredspeedup.RATcurrentlyta rgetssystemswithasingle CPUandFPGAasarststeptowardsabroadRCmethodology.The FPGAdevice isconsideredacoprocessortotheCPUbutcaninitiatesomeo perationsindependently 31


suchasDMA.Evenforsingle-FPGAsystems,arangeofissuesre latedtoparallelismand scalablecanbeexplored.RATisscopedtomakeaconvenienta ndimpactfulmodelthat notonlyintegratesbroaderissuessuchasnumericalprecis ionandresourceutilization butalsocontributestothelargergoalofbetterparallelal gorithmformulationand design-spaceexploration.Futureresearchwillexpandthe RATmethodologyforlarger scalepredictiononmulti-FPGAsystems. 3.3Walkthrough TosimplifytheRATanalysisinSection 3.2 ,aworksheetcanbeconstructedbased uponEquations( 3{1 )through( 3{11 ).Userssimplyprovidetheinputparametersandthe resultingperformancevaluesarereturned.Thiswalkthrou ghfurtherexplainskeyconcepts ofthethroughputtestbyperformingadetailedanalysisofa realapplicationcasestudy, one-dimensionalprobabilitydensityfunction(PDF)estim ation.Thegoalistoprovidea morecompletedescriptionofhowtousetheRATmethodologyi napracticalsetting. 3.3.1AlgorithmArchitecture TheParzenwindowtechnique[ 36 ]isageneralizednonparametricapproachto estimatingprobabilitydensityfunctions(PDFs)ina d -dimensionalspace.Thecommon parametricformsofPDFs(e.g.,Gaussian,Binomial,Raylei ghdistributions)represent mathematicalidealizationsand,assuch,areoftennotwell matchedtodensities encounteredinpractice.Thoughmorecomputationallyinte nsivethanusinghistograms, theParzenwindowtechniqueismathematicallyadvantageou s.Forexample,theresulting probabilitydensityfunctioniscontinuousthereforedie rentiable.Thecomputational complexityofthealgorithmisoforder O ( Nn d )where N isthetotalnumberofdata samples(i.e.numberofelements), n isthenumberofdiscretepointsatwhichthePDFis estimated(comparabletothenumberof\bins"inahistogram ),and d isthenumberof dimensions.Asetofmathematicaloperationsareperformed oneverydatasampleover n d discretepoints.Essentially,thealgorithmcomputesthec umulativeeectofeverydata 32


Figure3-4.ParallelAlgorithmandMappingfor1-DPDF sampleateverydiscreteprobabilitylevel.Forsimplicity ,eachdiscreteprobabilitylevelis subsequentlyreferredtoasabin. Inordertobetterunderstandtheassumptionsandchoicesma deduringtheRAT analysis,thechosenalgorithmforPDFestimationishighli ghtedinFigure 3-4 .Atotal of204,800datasamplesareprocessedinbatchesof512eleme ntsagainst256bins.Eight separatepipelinesarecreatedtoprocessdatasampleswith respecttoaparticularsubset ofbins.EachdatasampleisanelementwithrespecttotheRAT analysis.Thedata elementsarefedintotheparallelpipelinessequentially. Eachpipelinedunitcanprocess oneelementwithrespecttoonebinpercycle.Internalregis teringforeachbinkeepsa runningtotaloftheimpactofallprocessedelements.These cumulativetotalscomprise thenalestimationofthePDFfunction.3.3.2RATInputParameters Table 3-1 providesalistofalltheinputparametersnecessarytoperf ormaRAT analysis.Theparametersaresortedintofourdistinctcate gories,eachreferringtoa particularportionofthethroughputanalysis.Notethat N elements islistedundera separatecategorywhenitisusedbybothcommunicationandc omputation.Itisassumed thatthenumberofelementsdictatingthecomputationvolum eisalsothenumberof 33


elementsthatareinputtotheapplication(althoughtheee ctivebit-widthsmaydier duetothexedwidthofthecommunicationchannel).Whileap plicationscanexhibit unusualcomputationaltrendsorrequiresignicantamount sofadditionaldata(e.g. constants,seedvalues,orlookuptables),theseinstances maybeconsidereduncommon. Alterationscanbemadetoaccountforuncorrelatedcommunic ationandcomputationbut suchexamplesarenotincludedinthisdocument. Table3-1.InputparametersforRAT DatasetParameters N elements ,input(elements) N elements ,output(elements) N bytes=element (bytes/element) CommunicationParameters throughput ideal (MB/s) write 0 << 1 read 0 << 1 ComputationParameters N ops=element (ops/element) throughput proc (ops/cycle) f clock (MHz) SoftwareParameters t soft (sec) N iter (iterations) Table 3-2 summarizestheinputparametersforRATanalysisofthespec ied algorithmfor1-DPDFestimation.Thedatasetparametersar egenerallytherst valuessuppliedbytheuser,sincethenumberofelementswil lultimatelygovernthe entirealgorithmperformance.Thoughtheentireapplicati oninvolves204,800data samples,eachiterationofthe1-DPDFestimationwillinvol veonlyaportion,512data samples,or1/400ofthetotalset.Thisalgorithmeectivel yconsumesalloftheinput values.Onlyonecumulativevalueisleftaftereachiterati onperbinbuttheseresultsare retainedontheFPGA.Valuesareonlytransferredbacktotheh ostaftercomputation foralliterationsiscomplete.Thenaloutputcommunicati onmustberepresentedas 34


Table3-2.Inputparametersof1-DPDF DatasetParameters N elements ,input(elements)512 N elements ,output(elements)1 N bytes=element (bytes/element)4 CommunicationParameters(Nallatech) throughput ideal (MB/s)1000 write 0 << 10.099 read 0 << 10.001 ComputationParameters N ops=element (ops/element)768 throughput proc (ops/cycle)20 f clock (MHz)75/100/150 SoftwareParameters t soft (sec)0.578 N iter (iterations)400 individualpartialtransfers,oneperiteration,tocorres pondwiththeRATmodel,butthe throughputeciencyisadjustedtocorrespondwithasingle blockofdata. Thenumberofbytesperelement, N bytes=element ,isroundedtofour(i.e.32bits). EventhoughthePDFestimationalgorithmonlyuses18-bitx edpoint,theinterconnect uses32-bitcommunication.Thedatawasnotbyte-packedand theremaining14bits perwordofcommunicationareunused.Duringthealgorithmi cformulation,several formatsincluding18-bitxedpoint,32-bitxedpoint,and 32-bitroatingpointwere consideredforuseinthePDFalgorithm.However,themaximum errorpercentage wasfoundtobeonly3.8%for18-bitxedpoint,whichissatis factoryprecisionfor theapplication.Ultimately18-bitxedpointwaschosensot hatonlyoneXilinx 18 18multiple-accumulate(MAC)unitwouldbeneededpermulti plication.Though slightlysmallerbitwidthsalsohadreasonableerrorconst raints,noperformancegainsor appreciableresourcesavingswouldhavebeenachieved. Thecommunicationparametersareprovidedbytheusersince theyaremerelya functionofthetargetRCplatform,whichisaNallatechH101-P CIXMcardcontaininga Virtex-4LX100userFPGAforthiscasestudy.Thecardisconnec tedtothehostCPU 35


viaa133MHzPCI-Xbuswhichhasatheoreticalmaximumbandwid thof1GB/s.The parameterswerecomputedusingamicrobenchmarkconsistin gofareadandwritefordata sizescomparabletothoseusedbythe1-DPDFalgorithm.Ther esultingreadandwrite timesweremeasured,combinedwiththetransfersizetocomp utetheactualcommunicate rates,andnallyusedtocalculatethe parametersbydividingbythetheoretical maximum.The parametersforthetargetFPGAplatformarelowduetocommun ication protocolsandmiddlewareusedbyNallatechatopPCI-Xandhig hlatenciesassociated withthesmall2KB(512 4B)transfers. Thecomputationparametersarethemorechallengingportio nofRATperformance prediction,butarestillsimplisticgiventhedeterminist icbehaviorofPDFestimation.As mentionedearlier,eachelementthatcomesintothePDFesti matorisevaluatedagainst eachofthe256bins.Eachcomputationrequires3operations :comparison(subtraction), multiplication,andaddition.Therefore,thenumberofope rationsperelementtotals768 (i.e.256 3).Thisparticularalgorithmstructurehas8pipelinestha teachperform3 operationspercycleforatotalof24.However,thisvalueisc onservativelyroundeddown to20toaccountforimplementationdetailssuchaspipeline latencyandcomputation overhead.Thisconservativeparameterwasselectedpriort othealgorithmcodingand hasnot(norhasanyparameterforanycasestudy)beenadjust edforany\fudgefactor" createdfromruntimedata.Pre-implementationadjustment stotheRATparameterssuch asreducingthethroughputvaluearenotrequiredbutaresom etimesusefultocreate moreoptimisticorpessimisticpredictionsandaccountfor application-orplatform-specic behaviorsnotmodeledbyRAT.Similarly,arangeofthroughp utvaluescouldbeexamined toexploretheeectonperformancewhentheimplementation isbetterorworsethan expected.However,thiscasestudyfocusesonaspecicvalue foreachparameterto validatetheRATmodel. Whilepreviousparameterscouldbereasonablyinferredfro mthedeterministic structureofthealgorithm,aprioriestimationoftherequi redclockfrequencyisvery 36


Table3-3.Performanceparametersof1-DPDF(Nallatech) PredictedPredictedPredictedActual f clk (MHz)75100150150 t comm (sec)2.47E-52.47E-52.47E-52.50E-5 t comp (sec)2.62E-41.97E-41.31E-41.39E-4 util comm SB 9%11%16%15% util comp SB 91%89%84%85% t RC SB (sec)1.15E-18.85E-26.23E-27.45E-2 speedup Table3-4.Resourceusageof1-DPDF(XC4VLX100) FPGAResourceUtilization(%) BRAMs1548-bitDSPs8Slices16 dicult.EmpiricalknowledgeofFPGAplatformsandalgorit hmdesignpracticesprovides someinsightastoarangeoflikelyvalues.However,attainin gasingle,accurateestimate ofthemaximumFPGAclockfrequencyachievedisgenerallyim possibleuntilafterthe entireapplicationhasbeenconvertedtoahardwaredesigna ndanalyzedbyanFPGA vendor'slayoutandroutingtools.Consequently,anumbero fclockvaluesrangingfrom 75MHzto150MHzfortheLX100areusedtoexaminethescopeofposs iblespeedups. Thesoftwareparametersprovidethelastpieceofinformati onnecessarytocomplete thespeedupanalysis.Thesoftwareexecutiontimeofthealg orithmisprovidedbythe user.Often,softwarelegacycodeisthebasisforthehardwa remigrationinitiative.FPGA developmentcouldbebaseddirectlyonmathematicalmodels ,buttherewouldbeno baselineforevaluatingspeedup.Thebaselinesoftwarefor the1-DPDFestimationwas writteninC,compiledusinggcc,andexecutedona3.2GHzXeon. Lastly,thenumberof iterationsisdeducedfromtheportionoftheoverallproble mtoresideintheFPGAatany onetime.Sincetheuserdecidedtoonlyprocess512elements atatimefromthesetof 204800elementset,theremustbe400(i.e.204800/512)iter ationsofthealgorithm.The casestudyisimplementedinVHDL. 37


3.3.3PredictedandActualResults TheRATperformancenumbersarecomparedwiththeexperimen tallymeasured resultsinTable 3-3 .Eachpredictedvalueinthetableiscomputedusingtheinpu t parametersandequationslistedinSection 3.2.1 .Forexample,thepredictedcomputation timewhen f clk =150MHziscalculatedasfollows: t comp = 512elements 768ops/element 150MHz 20ops/cycle = 393216ops 3E+9ops/sec =1 : 31E-4secs Thecommunicationtimeiscomputedusingthecorresponding equation.Becausethe applicationissingle-buered,thetotalRCexecutiontime issimply: t RC SB =400iterations (2 : 47E-5secs+1 : 31E-4secs) =6 : 23E-2secs Thespeedupissimplythedivisionofthesoftwareexecution timebytheRC executiontime.Theutilizationiscomputedusingthecorre spondingSBequations. ThecommunicationandcomputationtimesfortheactualFPGA codeweremeasured usingthewall-clocktimeoftheCPU.Theerrorinthepredicti onofthecommunication timewasminimal,approximately1%,duetodetailedmicrobe nchmarkingfortheseexact transfersizes.Arelativelyaccurate t comp predictionisexpectedgiventhedeterministic structureoftheparallelalgorithm.However,thehighdegre eofaccuracy(twosignicant gures)betweenthepredictedandactualcomputationtimes with f clk =150MHzwas unusual,sincethecomputationalthroughputwasaconserva tivelyestimatedparameter. Muchofthe1-DPDFalgorithmispipelinedbutthelowereect ivethroughput,dueto thelatencyintheshort0.14msofcomputationtime,closely matchedtheconservatively estimatedvaluefor throughput proc (i.e.the20ops/cycleusedintheRATprediction 38


insteadofthetheoretical24).Also,thislowerthroughputa ccountedforextraoverhead timeinvolvedwithpollingtheFPGAforcompletionofcomput ation. ThetotalexecutiontimefortheFPGAisalsomeasuredusingt hewall-clocktime, ratherthatcalculatedfromEquation( 3{5 ),toensuremaximumaccuracy.Additional factorsmaybepresentinthetotaltimethatarenotaccounte dintheindividual communicationandcomputation.Inthiscasestudy,thetota lerrorwas16%butthe comunicationandcomputationerrorswereonly1%and6%,res pectively.Thediscrepancy isduetooverheadswithmanagingandregulatingdatatransf ersbythehostCPUthatare notexpresslypartoftheindividualRATmodels.Theimpacto ftheseoverheadsandother synchronizationissueswillvarydependinguponthepartic ularFPGAplatformandsizeof theoverheadtimerelativetothetotalRCexecutiontime.Fo r1-DPDF,theextra8.5ms wassignicantcomparedtothe75msofexecutiontime.There lativelylowresourceusage inTable 3-4 illustratesapotentialforfurtherspeedupbyincludingad ditionalparallel kernelsalbeitattheriskofincreasingtheimpactofthesys temoverhead. 3.4AdditionalCaseStudies Severalcasestudiesarepresentedasfurtheranalysisandv alidationoftheRAT methodology:2-DPDFestimation,coordinatecalculationf orLIDARprocessing,the travelingsalesmanproblem,andmoleculardynamics.Two-d imensionalPDFestimation continuestoillustratetheaccuracyofRATforalgorithmsw ithadeterministicstructure. Coordinatecalculationusespredictiononacommunication -boundalgorithm.Traveling salesmanexploresacomputation-boundsearchingalgorith mwithpipelinedstructure. However,themoleculardynamicsapplicationservesasacoun terpointgiventherelative dicultyofencapsulatingitsdata-driven,non-determini sticcomputations.Adiverse collectionofvendorplatforms,Nallatech,Cray,SRC,andXtr emeData,isusedfor2-D PDF,LIDAR,travelingsalesman,andmoleculardynamics,res pectively.Eachofthese casestudieshassingle-bueredcommunicationandcomputa tion.Aswithone-dimensional 39


PDFestimation,thedesignemphasisisplacedonthroughput analysesbecausetheoverall goalistominimizeexecutiontimeforthesedesigns. TheseRATcasestudiesrepresentarangeofexperienceswith estimatingcomputational throughputbasedondierentuserbackgroundsandpredicti onemphases.Consequently, 2-DPDFestimation,LIDAR,andtravelingsalesmanfocusonmo reexactthroughput parameterizationincontrasttotheconservativepredicti onin1-DPDF.However,the performanceofmoleculardynamicscouldnotbereliablyest imatedpriortoimplementation becauseofthedicultyofanalyzingthecomplexanddata-de pendentalgorithm structureasdescribedbyahigh-levellanguage.Instead,t hetargetthroughputis computedfromthespeeduprequirements.Whilethispredict ionwillbeinaccurateif theminimumthroughputisunrealizable,theRATestimation providesastartingpoint forimplementationandinsightabouttheperformancerami cationsofasuboptimal architecture.3.4.12-DPDFEstimation Aspreviouslydiscussed,theParzenwindowtechniqueisappl icableinanarbitrary numberofdimensions[ 36 ].However,thetwo-dimensionalcasepresentsasignicantl y greaterproblemintermsofcommunicationandcomputationv olumethantheoriginal1-D PDFestimate.Now256 256discretebinsareusedforPDFestimationandtheinput datasetiseectivelydoubledtoaccountfortheextradimen sion.Thebasiccomputation perelementgrowsfrom( N n ) 2 + c to(( N 1 n 1 ) 2 +( N 2 n 2 ) 2 + c where N 1 and N 2 arethedatasamplevaluesand n 1 n 2 aretheprobabilitylevelsforeachdimension,and c isaprobabilityscalingfactor.Butdespitetheaddedcompl exity,theincreasedquantityof parallelizableoperationsintuitivelymakesthisalgorit hmamendabletotheRCparadigm, assumingsucientquantitiesofhardwareresourcesareava ilable. Table 3-5 summarizestheinputparametersfortheRATanalysisforour 2-DPDF estimationalgorithm.Again,thecomputationisperformedi natwo-dimensionalspace,so twicethenumberofdatasamplesaresenttotheFPGA.Incontra sttothe1-Dcase,the 40


Table3-5.Inputparametersof2-DPDF DatasetParameters N elements ,input(elements)1024 N elements ,output(elements)65536 N bytes=element (bytes/element)4 CommunicationParameters(Nallatech) throughput ideal (MB/s)1000 write 0 << 10.147 read 0 << 10.026 ComputationParameters N ops=element (ops/element)196608 throughput proc (ops/cycle)48 f clock (MHz)75/100/150 SoftwareParameters t soft (sec)158.8 N iter (iterations)400 65,535(256 256)PDFvaluesaresentbacktothehostprocessoraftereach iterationof computationduetomemorysizeconstraintsontheFPGA.Thesa menumericalprecision offourbytesperelementisusedforthedataset.Theinterco nnectparametersmodel thesameNallatechFPGAcardasinthe1-Dcasestudybutfordi erenttransfersizes. The read termissmallfortherelativelylargeoutputof65,536eleme ntsbecausedata istransferredin256batchesof256elementseach,incurrin galargelatencyoverhead. Eachofthe65,536binsrequiresthreeoperationsforatotal of196,608operations.Eight kernels,eachcontainingtwopipelines(oneperdimension) ,performthreeoperationsper pipelinepercycleforatotalof48simultaneouscomputatio nspercycle.Again,thesame rangeofclockfrequenciesisusedforcomparison.Thesoftw arebaselineforcomputing speedupvalueswaswritteninCandexecutedonthesame3.2GHz Xeonprocessor.The algorithmrequiresthesame400iterationstocompletethec omputationandVHDLisalso used. TheRATperformancepredictionsarelistedwiththeexperim entallymeasured resultsinTable 3-6 .Thesethreepredictionsarebasedontherangeofclockfreq uency valueslistedinTable 3-5 buttheaccuracyoftheactual100MHzdesignisonlyevaluated 41


Table3-6.Performanceparametersof2-DPDF(Nallatech) PredictedPredictedPredictedActual f clk (MHz)75100150100 t comm (sec)1.01E-21.01E-21.01E-21.06E-2 t comp (sec)5.59E-24.19E-22.80E-24.46E-2 util comm SB 15%19%27%19% util comp SB 85%81%73%81% t RC SB (sec)2.64E+12.08E+11.52E+12.21E+1 speedup6.07.610.47.2 Table3-7.Resourceusageof2-DPDF(XC4VLX100) FPGAResourceUtilization(%) BRAMs2148-bitDSPs33Slices22 againstthecomparable100MHzprediction.Thecommunicatio ntimeiswithin5% ofthepredictedvaluewithadiscrepancyof0.5millisecond sagainduetoaccurate microbenchmarkingoftheNallatechboard'sPCI-Xinterface .Thisdierenceispotentially signicantgiventhe400iterationsrequiredtoperformthi salgorithm.Theoverallimpact onspeedupisfurtheraectedbyvariationinthecomputatio ntime.Anunderestimation byapproximately2.7millisecondscreatesatotaldiscrepa ncyjustover3millisecondsper iteration.Thislargererrorinthecomputationalthroughp utparameterascomparedto 1-DPDFisduetothemoreexactmodelingofthepipelinebehav iorwithoutadjustments forpotentialoverhead.Theseoverheadsfrompipelinelate ncyandpollingwereassumed insignicantduetothelengthoftheoverallexecutiontime butinsteadhadnoticeable eecteachiteration.Intotal,thespeedupwas6%lessthant hepredictedspeedup.This errormarginisexcellentgiventhefastandcoarse-grained predictionapproachofRAT compoundedoverhundredsofiterations.Greaterattention tocommunicationbehavior andthenuancesofthecomputationstructurecanfurtherred ucethiserrorifdesired. Totheextentpossiblewhilemaintainingfastperformancee stimation,insightabout shortcomingsinpreviousRATpredictionscanbefactoredin tofutureprojectstofurther boostaccuracy.ComparingTable 3-7 totheresourceutilizationfromthe1-Dalgorithm, 42


thehardwareusagehasincreasedbutstillhasnotnearlyexh austedtheresourcesofthe FPGA.Additionalparallelismcouldbeexploitedtoimproveth eperformanceofthe2-D algorithmifadditionalrevisionsaredesired.3.4.2CoordinateCalculationforLIDAR Airbornelightdetectionandranging(LIDAR)[ 37 ]isemergingasanimportant remotesensingmodalityforprovidinghigh-resolutionpos itioninformationontargetsof interest,primarilyground-basedfeaturessuchasterrain topology,fromlongdistances. LIDARprocessingbeginswiththecalculationoftheCartesia ncoordinatesoftheLIDAR targetsbaseduponthetraveltimeofthelaserbeam(i.e.ran ge ),angleofthescan ( ),GPS(GlobalPositioningSystem)coordinatesoftheairpl ane( X ac ;Y ac ;Z ac ),and aircraftattitude(roll r ,pitch p ,andyaw y ).Forthiscasestudy,thelasercontinuously providestargetrangeinformationatarateof33kHzwhilethe GPScoordinatesonly updateat1Hzandtheaircraftattitudeat5Hz.Interpolationi susedtomapeachLIDAR returnwithaspecicGPScoordinateandaircraftattitude. Forthisalgorithm,coordinatecalculationisconstructed asonepipelineonasingle nodeofaCrayXD1systemconsistingofaXilinxXC2VP50userFPGA connectedto ahostOpteronprocessorviatheRapidArrayinterconnect.Th ealgorithmiscomprised ofsixstepsforeachLIDARreturn.First,interpolationofth eGPScoordinatesand aircraftattitudeisperformedfortheparticularreturnva lue.Second,theunitvector betweenthelasersourceandthetargetiscomputedfromthes canangle.Thethirdand fourthstepsgeneratethreerotationmatricestoalignthea ircraftattitudewiththeGPS coordinateandapplythematricestotheunitvector,respec tively.Fifth,arangevectoris producedbyscalingtherotatedunitvectorbythetarget'sr ange.Sixth,therangevector istranslatedintothecoordinatespacerelativetotheairc raftGPSposition.Equation 3{12 summarizesthemathematicalcomputationsofthestepswher eSandCabbreviatetosine andcosineoperations,respectively. 43


Table3-8.InputparametersofLIDAR DatasetParameters N elements ,input(elements)33000 N elements ,output(elements)33000 N bytes=element (bytes/element)8 CommunicationParameters(Cray) throughput ideal (MB/s)1600 write 0 << 10.5 read 0 << 10.5 ComputationParameters N ops=element (ops/element)1 throughput proc (ops/cycle)1 f clock (MHz)100/125/150 SoftwareParameters t soft (sec)0.011 N iter (iterations)1 266664 X Y Z 377775 = 266664 ( C y C r S C y S r C p S S y S r S ))+ X ac ( S y C r S S y S r C p C C y S r C ))+ Y ac ( S r S C r C p C )+ Z ac 377775 (3{12) Table 3-8 summarizestheRATinputparametersofthealgorithmforcoo rdinate calculation.Theinputdatesizeof33,000elementsisbased ononesecondofLIDAR returns(i.e.thetimebetweenGPSupdates).Acorrespondin gnumberofGPScoordinates isreturnedbythecalculations.The X Y ,and Z dimensionsoftheLIDARreturnsand GPScoordinateseachusea16-bitxed-pointformat.Atotal of48bitsissentusing the64-bit(8-byte)RapidArrayinterconnect.Thischannelh asadocumentedtheoretical throughputof1.6GB/sperdirectionbutmicrobenchmarking indicatesonlyhalftherate isachievableforthesedatatransfers.Becausethecomputa tionispipelined,thenumber ofoperationsperelementissynonymouswiththenumberofel ements.Thepipelinecan processoneoperation(i.e.element)percycle.Theexactde pthofthepipelineisnot knownaprioributtheextralatencyispresumednegligiblew hencomparedtothesizeof thedataset.Arangeofclockfrequenciesisexaminedtopred ictthescopeoftheoverall 44


Table3-9.PerformanceparametersofLIDAR(Cray) PredictedPredictedPredictedActual f clk (MHz)100125150125 t comm (sec)6.60E-46.60E-46.60E-45.65E-4 t comp (sec)3.30E-42.64E-42.20E-42.25E-4 util comm SB 33%29%25%28% util comp SB 67%71%75%72% t RC SB (sec)9.90E-49.24E-48.80E-47.90E-4 speedup11.011.812.413.8 speedup.ThesoftwarebaselinewaswritteninCandexecuted ona2.4GHzOpteron processor,thehostCPUfortheCrayXD1node.Onlyoneiterati on(i.e.GPSinterval) isrequiredforthiscasestudyandVHDLisusedtoimplementthe parallelalgorithmin hardware. Table 3-9 comparestheRATperformancepredictionswiththeactual12 5MHz experimentalresults.ThestructureoftheparticularCray SRAMinterfaceoverlaps computationandDMAtransfersbacktotheCPU(i.e. t read ).Consequently,the totalRCexecutiontimewasdirectlymeasurablebuttheindi vidualcomputationand communicationtimesfortheactualresultwereestimatedfr omthetotalexecutiontime basedontheexpectedlatencyofthecomputation.Twogenera lconclusionswerethat boththecomputationandcommunicationtimeswereoveresti matedbyRATbutthatthe utilizationratioswerestillfairlyconsistentwithexpec tations.Unlikethepreviouscase studies,thetotalspeedupwasunderestimatedby16%.Thisd iscrepancyinspeedupwas primarilyduetothedierenceincommunicationtimes.Thea ctualcomputationpipeline isbelievedtocorrespondcloselywiththehigh-levelalgor ithm.Thecommunicationand comptutiontimesof565 sand225 sarelikelycomparabletothesystemoverhead andmeasurementerrorcausingnoticablediscrepanciesand theunusualbehaviorof apessimisticpredictionevenwiththegeneralizedanalyti calmodel.Thoughextra performanceascomparedtoRATprojectionsmaybeanunexpec tedbenetforthe algorithm,thegoalofthemethodologyisprecisepredictio nthatconsidersallmajor factorstoperformance.Adjustmentstothemodelformoreacc uratepredictionofshort 45


Table3-10.ResourceusageofLIDAR(XC2VP50) FPGAResourceUtilization(%) BRAMs1218x18Multipliers5Slices45 communicationandcomputationmaybenecessary.Table 3-10 highlightstheavailability ofunusedresourcestoexpandtheparallelcomputationbutt hebenetwillbemarginal becauseofthecommunication-boundalgorithm.3.4.3TravelingSalesmanProblem Thetravelingsalesmanproblem(TSP)[ 38 ]isaparticularversionoftheNP-complete Hamiltonianpathproblemthatlocatestheminimumlengthpat hthroughanundirected, weightedgraphinwhicheachvertex( actlyonce.(Otherderivations oftheHamiltonianpathproblemincludethesnake-in-the-bo x,knight'stour,andthe Lovaszconjecture.)Forthisalgorithm,anycitymaybethe startingpointandallcities areconnectedtoeveryothercitycreating N !potentialHamiltonianpaths,where N is thenumberofcities.Toacceleratethetimetoconvergeonas olution,heuristicsare sometimesemployedtosystematicallysearchasubsetofthe solutionspace.However, thealgorithmforthiscasestudyperformsanexhaustivesea rchonallpathsinthegraph. Thespecicalgorithmformulationhassignicantramicat ionsnotonlyonthehardware performancebutalsoonthepredictionaccuracy. ThecasestudytargetsSRCComputer'sSRC-6FPGAplatform.W ithintheSRC-6, thealgorithmusesoneoftheXilinxXC2V6000userFPGAsinasing leMAP-Bunit. TheFPGAisconnectedtoahostprocessorviathevendor'sSNAP( memoryDIMMslot) interconnect.Ninedepth-rsttraversalsofthegraphoccur simultaneouslyonasingle FPGAstartingfromeachoftheninedierentcities.Techniq uessuchasbranchand boundarenotusedbecauseeachstepofthesearchwouldbedep endentontheprevious steps,thuspreventinganypipelining.Instead,thealgori thmstartswithselecting N arbitrarycities(andtheir N 1edges)allatonceandisthenfollowedbydetermination 46


Table3-11.InputparametersofTSP DatasetParameters N elements ,input(elements)81 N elements ,output(elements)1 N bytes=element (bytes/element)8 CommunicationParameters(SRC) throughput ideal (MB/s)1400 write 0 << 10.03 read 0 << 10.03 ComputationParameters N ops=element (ops/element)4782969 throughput proc (ops/cycle)9 f clock (MHz)100 SoftwareParameters t soft (sec)2.22 N iter (iterations)1 ofpathvalidityandlength(i.e.ifallcitieswerevisitede xactlyonce,reportthetotal distancetraveled).Theindividualstepsarenotinterrela tedandtheexaminationof possiblepathscanbepipelined.However,unlikethebranchand-boundtechniquewhich backtracksinthemiddleofpathstoavoidrevisitingcities ,thehardwarepipelineoperates onfull N -lengthpaths,eventhoseinvalidbecauseofrepeatedcitie s.Extracomputationis requiredbutsubstantiallymoreparallelismisexploitabl e. Table 3-11 liststheinputparametersoftheRATperformancepredictio nforTSP. TheinterconnectparametersmodeltheproprietarySNAPinter connectoftheSRC-6 system.Thesmallfractionofthroughput, ,representstheoverheadassociatedwith theextremelyminimalcommunicationinthealgorithm,only N N inputelements. Thisinformationcontainsthedistancesbetweeneverypair ofcities.Theonlyoutputfor thissystemistheminimalpathlengthandthiscommunicatio ntimeisassumedtobe negligible.Elementsare8bytes,thewidthoftheMAP-B'sSRAM ,butonly4bytes(32 bits)perelementareusedtorepresentdistancesinxedpoi nt.Theinformationisnot byte-packedforcommunicationandconsequentlytheother3 2bitsarewasted.Forthis casestudy,thecomputationalworkloadisexponentiallyre latedtothenumberofelements. 47


Table3-12.PerformanceparametersofTSP(SRC) PredictedActual f clk (MHz)100100 t comm (sec)1.54E-51.57E-5 t comp (sec)4.31E-14.30E-1 util comm SB 0.004%0.003% util comp SB 99.996%86.2% t RC SB (sec)4.31E-14.99E-1 speedup5.164.45 Table3-13.ResourceusageofTSP(XC2V6000) FPGAResourceUtilization(%) BRAMs5618x18Multipliers0Slices73 While N 2 distancesareneedtocomputepathlengths, N N totalpathsmustbeexamined. Forconsistencywiththeothercasestudies,thenumberofop erationsperelementissetto N N 2 (i.e.9 7 =4782969),whichmakestheRATpredictioncomputationally equivalentto theviewof N N pathelements(forcomputationonly)withoneoperationeac h.Sincenine citiesareexaminedinthiscasestudyusingninekernels,at otalofninepotentialpathsare examinedperclockcycle.TheclockfrequencyoftheMAP-Buni tisxedat100MHzand onlyoneinput/compute/outputiterationisrequiredforth isalgorithm.TheCsoftware baselinewasexecutedona3.2GHzXeonprocessor.Theparallel algorithmisconstructed inSRC'sCarteC,ahigh-levellanguage(HLL)forFPGAdesign. Theresultsofthehardwaredesignarecomparedagainstthep erformancepredictions inTable 3-12 .Thepercenterrorinthepredictedcommunicationtimewasl essthan2% duetomicrobenchmarkingontheSRC-6specicallytoreplic atetheshortcommunication transfers.Thecycle-accuratetimersoftheSRC-6systemme antthisdiscrepancywasnot ameasurementerrorbutinsteadafunctionofthemodelingan dparameterizationofthe SNAPinterconnectthroughput.However,theactualcommunicat iontimeisonly16 s (versus430msforcomputation)andconsequentlyitsimpact onspeedupisnegligiblefor thiscasestudy.Thepredictedandactualcomputationtimes werenearlyidenticaldueto 48


deterministic,pipelinedstructure.FortheSRC-6system, theindividualcommunication andcomputationtimesweremeasuredontheFPGAviaavendorprovidedcounter function.BothoperationsareFPGA-controlledandinitiate dbyasinglefunctioncall whichcouldnotbeseparatedfromtheperspectiveofthehost microprocessor.However, thetotalRCexecutionasmeasuredbythewall-clocktimeoft heCPUisapproximately 0.07secondslongerthanthesumofthecomputationandcommu nicationtimesmeasured ontheFPGA.Consequently,extrasystemoverheadnotconside redbyRATcausedthe actualspeedupvaluetobe14%lessthanexpected.Theutiliz ationsfortheactualdesign rerectthisoverheadwithonly86%ofRCexecutiontimecompr isingcommunication orcomputation.Thediscrepancyintotalexecutionislarge becausetheoverheadis signicantrelativetotheshorttime(lessthan0.5s).Ifth eoverheadhadbeenfactored intotheprediction,thetotalestimationerrorwouldbeund er1%.Additionally,resource utilizationissummarizedinTable 3-13 .Nomultiplersarerequiredforthistypeof searchingbuttheheavyusageoflogicelementslimitsfurth erscalabilityofthealgorithm onasingleFPGAofthissize.3.4.4MolecularDynamics MolecularDynamics(MD)isthenumericalsimulationofthep hysicalinteractionsof atomsandmoleculesoveragiventimeinterval.BasedonNewto n'ssecondlawofmotion, theacceleration(andsubsequentvelocityandposition)of theatomsandmoleculesare calculatedateachtimestepbasedontheparticles'massesa ndtherelevantsubatomic forces.Forthiscasestudy,themoleculardynamicssimulat ionisprimarilyfocusedon theinteractionofcertaininertliquidssuchasneonorargo n.Theseatomsdonotform covalentbondsandconsequentlythesubatomicinteraction islimitedtotheLennard-Jones potential(i.e.theattractionofdistantparticlesbyvand erWaalsforceandtherepulsion ofcloseparticlesbasedonthePauliexclusionprinciple)[ 39 ].Large-scalemolecular dynamicssimulatorssuchasAMBER[ 40 ]andNAMD[ 41 ]usethesesameclassicalphysics principlesbutcancalculatenotonlyLennard-Jonespotent ialbutalsothenonbonded 49


Table3-14.InputparametersofMD DatasetParameters N elements ,input(elements)16384 N elements ,output(elements)16384 N bytes=element (bytes/element)36 CommunicationParameters(XtremeData) throughput ideal (MB/s)1600 write 0 << 10.28 read 0 << 10.28 ComputationParameters N ops=element (ops/element)164000 throughput proc (ops/cycle)50 f clock (MHz)75/100/150 SoftwareParameters t soft (sec)5.76 N iter (iterations)1 electrostaticenergiesandtheforcesofcovalentbonds,th eirangles,andtorsionsmaking themapplicabletonotonlyinertatomsbutalsocomplexmole culessuchasproteins. Theparallelalgorithmusedforthiscasestudywasadaptedf romcodeprovidedby OakRidgeNationalLabs(ORNL)andtargetstheXtremeDataXD1000 FPGAplatform. ThemostchallengingaspectofperformancepredictionforM Disaccuratelymeasuring thenumberofoperationspermolecularinteractionandthec omputationalthroughput. Thisparticularalgorithm'sexecutiontimeisdependenton thelocalityofthemolecules, whichisafunctionofthedatasetvalues.Sucientlydistan tmoleculesareassumedto havenegligibleinteractionandthereforerequirelesscom putationaleort.Attemptsare madetomitigatethedata-drivenbehaviorthroughpipeline dcomputations.However, thecomplexityofthealgorithmandpotentialforsuboptima lbehaviorduringmapping requiressomealgorithmparameterstobeestimated. Table 3-14 summarizestheinputparametersfortheRATanalysisoftheM Ddesign. Thedatasizeof16,384molecules(i.e.elements)waschosen becauseitisasmallbut stillscienticallyinterestingproblem.Eachelementreq uires36bytes,4byteseach forposition,velocityandaccelerationineachoftheX,Y,and Zspatialdirections. 50


Table3-15.PerformanceparametersofMD(XtremeData) PredictedPredictedPredictedActual f clk (MHz)75100150100 t comm (sec)8.77E-48.77E-48.77E-41.39E-3 t comp (sec)7.17E-15.37E-13.58E-18.79E-1 util comm SB 0.1%0.2%0.2%0.2% util comp SB 99.9%99.8%99.8%99.8% t RC SB (sec)7.19E-15.38E-13.59E-18.80E-1 speedup8.010.716.06.6 Table3-16.ResourceusageofMD(EP2S180) FPGAResourceUtilization(%) BRAMs249-bitDSPs100ALUTs73 TheinterconnectparametersmodelanXtremeDataXD1000platf ormcontaininga AlteraStratix-IIEP2S180userFPGAconnectedtoanOpteronp rocessoroverthe HyperTransportfabric.Thetheoreticalinterconnectthrou ghputis1.6GB/sbutonly afractionofthechannelcanbeusedfortransferringdatato theon-boardSRAMas neededforthealgorithm.Thenumberofoperationspereleme nt,approximately16,400 interactionspermoleculetimes10operationseach,isesti matedduetothelengthofthe pipelineanddata-drivenbehavior.Unlikethepreviouscase studies,thecomputational throughputcannotbereliablymeasuredduetothecomplexan dnondeterministic algorithmstructure.AsdiscussedinSection 3.2.1 ,thenumberofoperationspercycle istreatedasa\tuning"parametertocomputethethroughput necessarytoachievethe desiredspeedupbasedontheestimateof N ops=element .Though50isthequantitative valuecomputedbytheequationstoachievethedesiredovera llspeedupofapproximately 10,thisvalueservesqualitativelyasanindicatorthatsub stantialdataparallelismand functionalpipeliningmustbeachievedinordertorealizet hedesiredspeedup.Thesame rangeofclockfrequencieswasusedasinPDFestimation.The serialsoftwarebaselinewas executedona2.2GHzOpteronprocessor,thehostprocessorof theXD1000system.The 51


Table3-17.SummaryofResults 1-DPDF2-DPDFLIDARTSPMD PredictedComm.(s)2.47E-51.01E-26.60E-41.54E-58.77E4 ActualComm.(s)2.50E-51.06E-25.65E-41.57E-41.39E-3Comm.Error1%5%17%2%37% PredictedComp.(s)1.31E-44.19E-22.64E-44.31E-15.37E1 ActualComp.(s)1.39E-44.46E-22.25E-44.30E-18.79E-1Comp.Error6%6%17%0.2%39% PredictedSpeedup6.57.611.85.210.7ActualSpeedup7. entiredatasetisprocessedinasingleiterationandthealg orithmisconstructedinImpulse C,across-platformHLLforFPGAs. Table 3-15 outlinesthepredictedandactualresultsoftheMD.Notethat these resultsareuniquetothisspecicalgorithmandthatdiere ntstructures,targetlanguages, andplatformswillhavevaryingpredictionaccuracy.Thedi erenceinpredictedand actualcommunicationtimeis37%.Theerroritselfisassoci atedwiththeoverheadof multipleI/OtransfersbetweentheCPUandon-boardSRAMmemo rymodeledasa singleblockofcommunication.Whilemoreaccurateestimat ionsarethegoalofRAT, anyfurtherprecisionimprovementsforthisparameterarei nconsequentialgiventhelow communicationutilization.Computationdominatedtheove rallRCexecutiontimeand theactualtimeis39%higherthanthepredictedvalueduetot hedata-drivenoperations andsuboptimalpipeliningperformance.Thetotalnumberof operationswashigherthan expected,coupledwithrelativelymodestparallelismfort heproblemsize.Consequently, thespeeduperrorwasalso39%,signicantlylessthandesir ed.However,thiscasestudy isusefulbecausethequalitativeneedforsignicantparal lelismiscorrectlypredictedeven thoughthealgorithmcannotbefullyanalyzedatdesigntime .AsTable 3-16 illustrates, alargepercentageofthecombinatoriallogicandalldedica tedmultiply-accumulators (DSPs)wererequiredforthealgorithm. 52


3.4.5SummaryofCaseStudies Table 3-17 outlinesthepredictedvalues,actualresults,anderrorpe rcentagesfor thecommunicationtime,computationtime,andspeedup.The magnitudesofthe predictedandactualvaluesarelistedtocomparetheabsolu teimpactoftherelative errorpercentages.Forexample,moleculardynamicshascom parablecommunicationand computationerrorpercentagesbutthemagnitudesaredier entwithcommunication havingvirtuallynoimpactontotalspeedup. Forthesecasestudies,communicationhadthelowestaverag eerroramongthe modeledtimes.ThelargererrorsforLIDARandMDwerecausedb yminordiscrepancies inthenalcommunicationsetupversustheRATanalysisofth ealgorithm.Theseerror percentagesareacceptableforRATbecausetheystillyield validquantitativeinsight aboutthealgorithmbehavior.Thecostofmoreprecisepredi ctionmustbebalancedwith theimpactofthecommunicationtimeonperformance.Again,t helargestcommunication error,foundinMD,didnotsignicantlyaectthespeedup,b ecausethecommunication waslessthan1%oftheoverallexecutiontime. Evenwiththemorecomplicatedtaskofarchitectureandplat formparameterization, theaveragepredictionerrorforcomputationwasonlysligh tlyhigherthancommunication. PDFandTSPhadlowcomputationerrorscomplementingthelow communicationerrors. LIDARhaddouble-digiterror,whichwasduetoperceivablesy stemoverheadsinthe short0.2mscomputationtime.TheoneoutlierwastheMDappl ication.Thediculty ofmitigatingthedata-drivencomputationswascompounded byunknownsinthenal algorithmmappingbytheHLLtool.Aswiththecommunicationpr edictionsforMD, theerrorwassignicantbutRATstillprovidedausefulinsi ghtaboutwhatorderof magnitudespeedupshouldbeachievable. Thepredictionerrorsintheoverallspeedupwerehigherona veragethanthe individualcomputationorcommunicationtimes.Particula rlywith1-DPDF,LIDAR, andTSP,overheadsnotpartoftheRATcomputationandcommun icationmodelswere 53


noticeableportionsoftheshortRCexecutiontimes,70ms,0 .2msand430ms,respectively. The2-DPDFestimationwaslongenoughtomitigatesystemove rheads.Theoverall performanceofmoleculardynamicsmatchedthecomputation timedueto99%utilization. Whileminimizingthepredictionerrorwasanimportantissu e,rapidlyachievinga reasonableprojectionwastheultimategoal. ThedatasummarizedinTable 3-17 highlightsthefocusofRATonperformance approximation.RenementstotheRATmodelforshortcommun icationandcomputation times,platform-specicoverheads,andmultipledatatran sfersperiterationarepotentially usefulandcanhelpimprovetheoverallaccuracyoftheperfo rmanceestimations.Extra analysisduringparameterizationcanalsoimproveRATpred ictions.Whilefutureresearch willinvolvesomerevisionstothemodel,thecurrenterrors inthe5%to15%rangehave provensucentfortheintentofRAT. 3.5Conclusions Concomitantwiththereformationtowardsexplicitlyparal lelarchitecturessuchas FPGAs,ecientdevelopmentofparallelalgorithmsremainsa fundamentalchallengefor computing.Insucientattentionisgiventoformulation,t heexplorationandanalysisof applicationdesignsprecedingthedesign,translation,an dexecutionphases.Particularly, theeciencyofFPGAapplicationdevelopmentsuersfromal ackofmethodsandtools forrapidevaluationofcandidatealgorithms,architectur es,andsystemmappingsprior toexpensiveimplementation.TheRCAmenabilityTestispres entedasamethodology toevaluatetheperformanceofaspecicparallelalgorithm onaspecicFPGAplatform beforeanylengthycoding. RATdemonstratedreasonableaccuracywithpredictingcomm unicationtime, computationtime,andspeedupwiththevecasestudies.Det ailedmicrobenchmarking priortotheRATanalysesallowedforanaverageerrorof12%( withindividualerrorsas lowas1%)forthecommunicationtimesofalgorithms.Forthe deterministiccasestudies (i.e.allexceptMD),computationerrorpeakedat17%.Eacha lgorithm'sinclusionof 54


pipeliningallowedcomputationalthroughputstobeaccura telyprojectedeventhough thehigh-levelparallelalgorithmswerenotyetmappedtoha rdware.ThetotalRC executiontimehadanaverageerrorof18%forthecasestudie s,slightlyhigherthanthe individualcommunicationandcomputationcomponents.Lar gesystemoverheadversus shortexecutiontimewasthemaincause.Overall,themethod ologyperformedwellfor thediversecollectionofalgorithmcomplexities,hardwar elanguages,FPGAplatforms, andtotalexecutiontimes.RATwasdesignedtohandledthese issuesinsingleCPUand FPGAsystemswherecommunicationandcomputationaregover nedbythenumberof dataelements. 55


CHAPTER4 EXPANDEDMODELINGFORMULTI-FPGAPLATFORMS(PHASE2) ThesecondresearchphaseoutlinestheexpansionoftheRATa nalyticalmodelfor ecientandreasonablyaccurateperformanceestimationof multi-FPGAsystemspriorto hardwareimplementation.Thischapterpresentsabriefint roductiontothechallengesand objectivesofmulti-FPGARAT(Section 4.1 );backgroundandrelatedworkonmulti-node performancemodeling(Section 4.2 );assumptions,quantitativeattributes,analytical models,andscopeoftheexpandedmodel(Section 4.3 );adetailedwalkthroughofa reasonablycomplexapplication,2-DPDFestimation(Secti on 4.4 );twoadditionalcase studies,imagelteringandmoleculardynamics(Section 4.5 );andconclusions(Section 4.6 ). 4.1Introduction Thereformationtowardsexplicitlyparallelarchitecture sandalgorithmsisaccompanied byincreasedemphasisonmulti-devicesystemsforachievin gadditionalperformance benets.However,exploitingparallelismforscalableFPGA systemsrequireseven moreexpensivedevelopmentcycles,whichfurtherlimitswi despreadadoptionofRC. Currentdesignapproachesfocusonfastercodingpathstode vice-levelimplementations (e.g.,high-levelsynthesis),whichaddressonlyonesympt omofthegreaterproductivity challengeforFPGAsystems. Incontrasttodevelopmentpracticesbasedoniterativeimp lementation,strategic design-spaceexploration(DSE)isneededtoimproveproduc tivitywithscalablesystems. Parallelapplicationsformulti-FPGAplatformsshouldbep lannedandperformanceissues analyzedpriortoimplementation,narrowingtherangeofpo ssiblealgorithmandsystems mappingsbasedontheperformancerequirements.Forthesec ondphaseofresearch,the RCAmenabilityTestforScalableSystems(RATSS)extendsRAT fromChapter 3 to multi-FPGAsystemsbyincorporatingkeyconceptsfromtrad itionalanalyticalmodeling (e.g.,BSP[ 9 ]andLogP[ 10 ]).Thisnewmodelproducesacomprehensiveperformance 56


predictionforthesystembyagglomeratingandanalyzingth eresultsoftheunderlying modelsbasedonthesystemhierarchy.RATSSpredictionisfa standreasonablyaccurate supportingtheoverallstrategicDSEgoal. RAT,LogP,andmanyotheranalyticalmodelsseektoexpresst hebehaviorofthe computationandcommunicationwithinatargetcomponent,d evice,subsystem,etc. \Nodes"and\networks"abstractionsareextendedbyRATSSfo rsimplerepresentation ofcomputationandcommunication,respectively.RATSSsys tematicallycharacterizes analgorithmmappedtoanFPGAplatformusingcomparablenod e-andnetwork-level analysis,whichcombinetoformacompletemodelforthesyst em.TheRATSSmodel istractableforscalablesystemsbecausethescopeandcons traintsoftheunderlying modelsareadaptedforstrategic,multi-FPGAperformancec haracterization.Thischapter discussesRATSSmodelingfordata-parallelalgorithmsstr ucturedasSIMD-stylepipelines targetingmodernhigh-performanceFPGAsystems[ 2 ].Whiletheauthorsbelievethat RATSShasusagebeyondthisscope,thesealgorithmandplatf ormassumptionsallowfor conciseandconsistentidenticationofrequirementsandc apabilities,analysisofindividual computationandcommunicationcomponents,andagglomerat ionofthenodeandnetwork modelsforcompleteperformanceprediction. 4.2BackgroundandRelatedResearch TheanalyticalmodelingresearchdiscussedinChapters 2 and 3 arealsorelatedto applicationdevelopmentproductivityforscalablemultiFPGAsystems.Theperformance predictiontechniquesin[ 18 ]characterizealgorithmsforbottleneckdetectiononmult i-FPGA systems.Dynamo[ 19 ]denesperformanceanalysisforimageprocessingapplica tions mappedatruntimetosystemscontainingtypicallycontaini ngonormoreHWaccelerators. SmithandPeterson[ 1 ]proposeananalyticalmodelforsynchronous,iterativeap plications onclustersofsharedheterogeneousworkstationseachcont ainingoneormorerecongurable computingdevices.Whilethesemodelsaddresssomeofthech allengesforproductive applicationdevelopmentforscalablesystems,theydonotd irectlyaddresstheneed 57


foranalytical,system-levelmodelingpriortoanydetaile dandpotentiallycostly implementationofFPGAkernelsorapplications.RATfromCh apter 3 denedan analyticalmodelforperformanceestimationofaspecical gorithmonaspecicplatform priortoimplementation,albeitforsingle-FPGAdesigns.T heRATmethodologymust bepairedwithsystem-levelmodelingconceptstoprovideac ompletemodelforscalable, multi-FPGAsystems. Existingresearchwithmicroprocessor-basedalgorithmsa ndparallelplatforms canhelpbridgethegapbetweenanalyticalmodelingforFPGA devicesandthedesign challengesofFPGA-basedscalablesystems.TheParallelRan domAccessMachine (PRAM)[ 8 ]isoneoftherstwidelystudiedmodelsthatattemptstored ucecomplex systembehaviorintoafewkeyattributes,butthemodelnegl ectsissuessuchas synchronizationandcommunication,whichcangreatlyaec ttheaccuracyofthe performanceestimationsforlargersystems.TheBulkSynch ronousParallel(BSP) Model[ 9 ]extendsthemodelingconceptsofPRAMbydeninganapplicat ioninterms ofaseriesofglobalsuperstepseachconsistingoflocalcom putation,communication, andsynchronization.TheLogP[ 10 ]modelattemptstodenenetworksofcomputing nodes(i.e.,microprocessorsandlocalmemory)bylatency, L ;overhead, o ;gapbetween (short)messages, g ;andnumberofprocessingunits, P .TheLogGPmodel[ 11 ]extends theLogPconceptwithsupportforalong-messagegap, G .OtherextensionstoLogP andLogGPincludesupportforcontention,LoPC[ 42 ],andparameterizationofthe L o g G ,and P attributes(PlogP)tosupportdynamicallychangingvalues inwide-area networks[ 12 ].Additionally,benchmarkshavebeencreatedtoassistinth emeasurement oftheseattributes[ 43 ].LeveragingthesemodelsallowsRATSStodescibesystem-l evel communicationformulti-FPGAplatforms. Priorworkhasleveragedsystem-levelmodelingconceptsbe yondhomogeneous microprocessors.HeterogeneousLogGP(HLogGP)[ 44 45 ]considersextensionsfor multipleprocessorspeedsandcommunicationnetworkswith inacluster.In[ 46 ], 58


system-levelmodelingconceptsformthebasisforapropose dmodelforheterogeneous clusters.Collectivecommunicationmodelingandscheduli ngfornode-heterogeneous networksofworkstations(NOWs)[ 47 48 ]andclustersofclusterswithhierarchical networks[ 49 ]arefurtherextensionstotraditionalsystem-levelmodel ing.These modicationsforheterogeneouscomputingprovideusefuli nsighttowardstheproposed RATSSmodel. 4.3RATSSModel Thissectionprovidesadetaileddiscussionofthestructur eandcontributionsofthe RATSSmodelforfastandreasonablyaccurateperformancepr ediction.Thisprediction (andconsequentlydesign-spaceexploration)beginswitht hedesigner'sspecicationsof theFPGAplatformandalgorithmpairingforRATSSanalysis. TheFPGAplatform specicationdenestheperformancecapabilitiesofeachc omponentinthesystem, specicallythecomputationandcommunicationmetricssuc haslatency,bandwidth,and clockfrequency.Thealgorithmspecicationdenesthecom putationrequirementsof everyspecictaskandtheresultingcommunicationbetween devices,whichdependson thealgorithm/platformmapping.Quantitativeattributes areprovidedforeveryunique computationandcommunicationtaskintheFPGAsystemandth esevaluesfeedthe component-levelanalyticalequations.TheRATSSmodelagg lomeratestheindividual computationandcommunicationpredictionsbasedonthesys tem-levelscheduledened bytheapplicationspecicationandsubsequentlyprovides aquantitativeperformance estimatefortheplatform/algorithmpairing.Thispredict ionisusedbythedesigner forfurtherdesign-spaceexploration,revising(andre-an alyzing)asnecessaryuntilthe applicationmeetstheirperformancerequirements. Again,RATSSadaptsexistingcomputationandcommunication modelstoprovide acompleteperformancepredictionforanFPGAapplication( i.e.,aspecicalgorithm mappedtoaspecicFPGAplatform).RATSSperformancepredi ctionisbasedon ecientquantitativecharacterizationofthekeyattribut esofthisalgorithm/platform 59


pairing.TheseattributesareexplicitlydenedbyRATSSfr omadaptationsofthe underlyingmodelsforuseinthecomputation,communicatio n,andultimatelythe fullapplication-levelmodels.Section 4.3.1 discusseshowthescopeoftheplatform, algorithm,andtheirmappingrevealthecomputationandcom municationattributesof anapplication.Section 4.3.2 discussestheusageoftheattributestoformequationsthat modeltheperformanceofthetargetapplication.4.3.1RATSSScope AnalyticalmodelssuchasLogPareeectivebecauseafewkeyc haracteristicscan describetheperformanceofadiversesetofapplicationstr uctures.Similarly,FPGA platformandapplicationdescriptions,speciedbythedes ignerwithinthemodel's intendedscope,canexploitinherentcommonalitiesforec ientperformanceprediction. (Conversely,platformsandapplicationsoutsidetheinten dedscopewillnotbeprecisely characterizablebythemodelattributes.)Thefollowingse ctionsdiscusshowthescope ofplatformandapplicationfeaturesaectstheamenabilit yofRATSSperformance characterizationandultimatelytheaccuracyoftheRATSSp erformancemodel. DeningtherangeoftheFPGAplatformsapplicabletoRATSSi snecessaryforclear andconsistentcharacterizationoftheperformanceattrib utesthatformthecomprehensive prediction.EvenwithareducedscopeforFPGAplatforms,ak eychallengeforRATSS isconcisemodelingofthearchitecturaldiversities.Eci entcharacterizationofFPGA platformsprovidesthenecessaryinsightintothecomputat ionandcommunication capabilitiesforexecutingaspeciedapplication.Asanext ensiontotraditionalHPC modeling,RATSSabstractsFPGAplatformarchitecturesasc ompute\nodes"connected bythecommunication\networks."Node Oneormoredevices,tightlycoupled,forcomputation(e.g. ,microprocessorsor FPGAs). Network Communicationmediumconnectingtwoormorenodes. 60


Figure4-1.AdaptedfromEl-Ghazawietal.[ 2 ],ModernHigh-performanceFPGASystems CompriseTwoClasses:UniformNodeNonuniformSystems(UNNSs)andNonuniformNodeUniformSystems(NNUSs) Figure 4-1 ,adaptedfrom[ 2 ],illustratesthetwomajorclassesofmodernhigh-perform ance FPGAsystems.RATSSsupportscomprehensiveperformancepr edictionforthese twoclassesofsystemsthroughnodeandnetworkmodelsaddre ssingthecomputation andcommunicationabstractions,respectively.Incontras ttotraditionalhomogeneous HPCsystems,thepresenceofFPGAsasapplicationaccelerator s,orchestratedby microprocessors,cancreateheterogeneitynotonlyamonga djacentdevicesbutalsoat thesystemlevel.ForRATSS,nodesarenotdenedasxedarra ngementsofFPGAsand microprocessors,butasanabstractionforoneormoredevic esthatcanbeaccurately modeledasasinglecomputationalunit(e.g.,twoFPGAswitha dedicatedinterconnect). Conversely,collectionsofdevicesareseparatenodesifth eirinterconnectrequiresan explicitcommunicationmodel. Moresimply,RATSSagglomeratescomputationdevicesintoa singlenodemodel iftheinteractionbetweenthedevicesdoesnotrequireaful lnetworkmodel.This simplicationispotentiallyapplication-dependent,but thegeneralassumptionis thatdirectlyconnectedFPGAscanbemodeledasasinglecompu tenode.Similarly, multi-coremicroprocessorsorotherlocallyconnectedSMP saremodeledassinglecompute nodes.RATSSfocusesonexplicitnetworkmodelsforsystemwidemulti-microprocessor interconnects(e.g.,Ethernet)andmicroprocessor/FPGAi nterconnects(e.g.,PCI-based FPGAacceleratorcards).FromFigure 4-1 ,theUNNSarchitecturerequiresone 61


system-widenetworkmodelandtwonodemodels:themicropro cessorandFPGA. However,theNNUSarchitecturewillinvolvetwonetworks:aloca linterconnectbetween theFPGAandmicroprocessorandasystem-widenetworkbetwe enthemicroprocessors. Deningnodesbasedontheiradjoiningnetworkscreatesaco nsistentabstractionof computationandcommunicationforbothprevailingsystemc lasses.Thisdistinction becomesincreasinglyimportantasthehierarchyoftheFPGA platformincreasesindepth. Ultimately,thecollectionofnodeandnetworkmodelsprovid essmall,separable descriptionsofthecompletecomputationcapabilitiesand communicationperformance oftheFPGAplatform.Foreachpieceofcomputationintheapp lication,theclock frequencyattributefortheFPGAdenestheoverallrateofe xecution.Foreachnetwork communication,quantitativeattributesincludethedelay throughtheinterconnectmedium (i.e.,latency)andbandwidthformessagetransmission. Strategicperformancepredictionrequiresapplicationch aracteristicsamenableto quantication.Anapplicationencompassesanalgorithmand itsmappingtoanFPGA platform.Algorithm Finitenumberofhardware-independenttaskswithexplicit lydened parallelismandorderingusedtosolveaproblem. Mapping Algorithm'scomputationtasksassignedtonodesanddatamov ementbetween nodesassignedtooneormorecommunicationnetworks. Acompletedescriptionofanalgorithmanditsmappingmustb eprovidedbythe designerforeectiveperformancemodeling.Thecompositi onandparallelization ofalgorithmtasksdenesthecomputationalloadforeachno deandtherequired communicationforeachnetworktosupporttheapplication. Algorithmandmapping featuresmustbescopedtoensurequantitativecharacteriz ationofcomputationand communicationinteractionthatistractableforanalytica lmodeling.Specically, 62


Figure4-2.AdaptedfromSmithandPeterson[ 1 ],theSynchronous,IterativeModelIs ExpandedforMulti-stageApplications.SynchronousCommun icationCan OccuroverOneorMoreNetworks.IterativeBehaviorCanOccur withina Stage,overOneStage,oroverManyStages. thecomputationandcommunicationoftheFPGAapplications houldconformtothe synchronous,iterativemodel.Synchronous Computationoccurssimultaneouslyonallnodesandisprece dedand proceededbycommunication,whichcollectivelydenea\st age"oftheapplication. Iterative Individualormultiplestagescollectivelymayberepeated aspartofthe execution. Figure 4-2 ,extendedfromSmithandPeterson[ 1 ],summarizesanalyticalmodeling forsynchronous,iterativeapplications.Thesecharacter isticsreducethecomplexityof computationandcommunicationoverlapandallowstraightf orwardagglomerationof nodeandnetworkperformancebyRATSS.Regardlessofthepla tformarchitecture, problemrequirements,andthelocalityofdata,thesynchro nousbehaviorofcomputation andcommunicationreducesthetotalapplicationperforman cetoeitherthecollective summationofthestagetimesorthesingleslowestcomponent (assumingsteadystate). Iterativebehaviorcanoccuroveronestageofexecutionort hefullapplication.Implicitly, theanalyticalmodelforsynchronous,iterativebehaviora lsorequiresdeterministic behavior. 63


Deterministic Algorithmtasksanddatamovementbetweentasksarepredicta bleprior toimplementation,eitherasaconstantoranaverageperfor manceoftypicaldata sets. Precisecharacterizationofapplicationtaskschedulingi sinsucientfordesign-space explorationiftheunderlyingcomputationandcommunicati ontimescannotbeprecisely quantied.Randomnessincomputationandcommunicationbe haviorrequiresquantication ofapplicationcharacteristicsasaveragesofexpectedbeh avior.TheRATSSassumptionfor deterministicbehaviorisreasonableasmanyapplications targetingtheFPGAparadigm areSIMD-stylealgorithmsimplementedaspipelines. Ultimately,synchronous,iterative,anddeterministicbeh aviorallowsecient characterizationofcomputationneedsandcommunicationr equirementsoftheFPGA application.Pipelined,SIMD-stylealgorithmsinvolveda tatransformationsandboththe communicationandcomputationarecharacterizedbythequa ntityofdataassociated withtheparticularplatform/algorithmmapping.Thecompu tationaldemandsofthe applicationarequantiedbythenumberofoperationsperin putdataelementandthe rateofexecution(i.e.,amountofdeepandwideparallelism ).Similarly,theattributesfor thecommunicationrequirementsdenetheamountofdatafor eachnetworktransactionin termsofbytes. Again,RATSSinvolvesquantifyingthekeyattributesoftheF PGAplatformand applicationforuseintheunderlyinganalyticalmodelsfor performanceprediction.These quantitativecharacteristicsareprovidedlargelybythed esigner.Platform-intrinsic attributessuchasnetworklatencyandthroughputaregathe redfrommicrobenchmarks thatspecicallymirroralgorithmoperations,suchasaDMA readandwrite.Ideally, adatabaseofmicrobenchmarkresultsisreferencedbythede signerfortheplatform attributes,elsethebenchmarksmustbeperformedpriortoa nyperformanceprediction. Notethataccuratemicrobenchmarkingcanbeanontrivialpro cess,albeitwithnonrecurring 64


costduetopotentialreuseforanalysisoffutureapplicati onswithsimilarplatform mapping.Application-specicattributessuchasthequanti tyofdataandamountof computationalparallelismareexplicitlyspeciedbytheu serbasedonthealgorithmand platformmapping.Theseattributesfeedtheequationsdesc ribedinSection 4.3.2 which computetheperformanceestimate.Basedonthemodelresult s,thedesignermayfurther renetheapplicationorproceedtolow-levelimplementati onandanalysis. Theaccompanyingcasestudiespresentedinthischapterpri marilyemphasize performancepredictionforthenalcongurationofanappl icationpriortoimplementation. However,theauthorsexpectthatstrategicDSEwillexplorem ultipleoptionsforalgorithm structureandplatformmapping,whichwouldinvolvesevera lrepetitionsofRATSS analysiswithcomparisonofthepredictedperformancevalu esagainstthedesigner's expectations.Thekeyperformancecriterionexploredinth ischapterisexecutiontime, butissuesofapplicationscalability,resourceutilizati on(e.g.,loadbalancing),power-delay product,etc.canalsobeinferredfromtheRATSSanalysis.An alysesoftheseissuesare notlimitedtophysicallyrealizablesystemsandcanprojec tcapabilitiesoffuturesystem congurations.4.3.2ModelAttributesandEquations Thissectiondiscussestheattributes,equations,andgene ralapproachofthenode andnetworkmodelsalongwiththeirarrangementintostageandapplication-levelmodels forRATSShierarchicalperformanceprediction.Theattrib utesandequationsforthe nodeandnetworkmodelsleverageexistingresearchfromRAT [ 50 ]andLogGP[ 11 ] toconstructcomputationandcommunicationsmodels.Thepl atformandalgorithm scopeprovideecientquanticationofperformancefeatur esofthecomputationand communication,whichservesasinputtotheanalyticalmode ls.Essentially,both computationandcommunicationrepresentthetimecostofda tamovementthrougha component(e.g.,FPGAorinterconnect).Equation 4{1 denesthegeneralstructurefor nodeandnetworkperformanceasthedelayoverheadthroughm edium/architecture, delay ; 65


quantityofdata, quantity ;andrateofservice, throughput (i.e., gap 1 ).Sections and expoundonthisgeneralperformanceequationforthenodean dnetwork models,respectively. performance = delay + quantity throughput = delay + quantity gap (4{1) Sections and discussthehierarchicalagglomerationoftheindividual nodeandnetworkcomponentstomodeltheperformanceofthei ndividualalgorithm stagesandsubsequentlythetotalapplication.Thesynchro nous,iterativebehavior describedinthenodeandnetworkdenesthecomputationand communicationscheduling foreachalgorithmstageandtheoverlapofthesestagesden esthetotalapplication performance.TheRATSSmodelusesthishierarchicalapproa chtoagglomeratethe individualperformanceestimatesforthecomponentsintoa single,quantitativeprediction fortheapplication. Thegoalofthenodemodelistoestimatetheperformanceofea chcomputational taskbasedontheuser-providedplatformandapplicationat tributes.Dependingonthe applicationrequirements,eachofthedevicesperforminga givenalgorithmtaskmay havedierentcomputationaldemands,eachrequiringasepa ratenode-levelanalysis. Similarly,theapplicationwilllikelyhavedierentcompu tationalloadsforeachtask(i.e., stage)ofthealgorithm.Again,thenodemodelisusedtodescr ibeeachuniqueportion ofcomputationandtheindividualperformanceestimatesar ecombinedintheRATSS stage-levelmodel. AssummarizedinEquation 4{2 ,eachnodeateachstageofexecutioncanhave auniquesetofattributevalues, S fpga ,whichincludesthepipelinelatency, PL fpga ; numberofdataelements, N fpga element ;numberofcomputationaloperationsperelement, N ops element ;FPGAclockfrequency, F clock ;andcomputationthroughput, R fpga ,forthe 66


specicalgorithmtask.AdaptedfromRAT,thecomputationti me,Equation 4{3 ,is analogoustoEquation 4{1 wherethepipelinelatencyisthe delay term,thenumber ofdataelementsandnumberofoperationsperelementarethe quantity ,andthe computationthroughputandclockfrequencydenetheeect ive throughput S fpga = PL fpga ;N fpga elements ;N ops=element ;F clock ;R fpga (4{2) S fpga :setofattributevaluesforaspeciccomputationunit PL fpga :pipelinelatencyofthecomputation(cycles) N fpga element :numberofcomputationelements(elements) N ops=element :numberofoperationsperelement(ops/element) F clock :FPGAclockfrequency(MHz) R fpga :computationthroughput(ops/cycle) t fpga ( S fpga )= PL fpga F clock + N fpga elements N ops=element F clock R fpga (4{3) t fpga :executiontimeforthefpgacomputenode(s) MicroprocessornodescanalsoimpactFPGAapplicationperf ormancewithcomputation coincidingwithFPGAexecution(fromFigure 4-2 ).Theexecutionforamicroprocessor, t P ,isdenedbythesoftwaretimewhichmustbemeasuredfromle gacycodeor estimatedusingatraditionalmodel.Notethatthismicropro cessorperformanceattribute isonlyintendedforapplication-relatedcomputationoccu rringinparallelwithFPGA execution.FPGAsetup,conguration,andothersoftware-i nvolvedoverheadsare consideredinthestage-levelmodel. Thegoalofthenetworkmodelistoestimatetheperformanceo facommunication transactionbasedontheprovidedplatformandapplication attributes.Analogous tothenodemodel,eachuniquecommunicationtransactionwi llrequireaseparate 67


network-levelanalysis,whichthencombineintheRATSSsta ge-levelmodel.Basedon LogPanditsderivatives(e.g.,LogGPandtheRATcommunicat ionmodel,toanextent), thenetworkmodelconsistsofparametersforthelatency(i. e.,physicalinterconnect delay), L ;overhead, o ;messagegap, g ;numberof\processors", P ;andmessagesize, k TheseattributesareanalogoustothegeneraltermsfromEqu ation 4{1 inthat L and o determine delay ; P and k determinedata quantity ;and g isthe gap .Thekeydistinction betweenLogP,LogGP,PLogP,andtheRATcommunicationmodel sisthe gap parameter, whichisdenedbytheexpectedbehavioroftheparticularne twork.Approximations oftheshort-messagegap, g ,andlong-messagegap, G ,ofLogGPareoftensucientfor microprocessornetworkssuchasEthernet.However,PLogPan dRATdenethegapasa functionofthemessagesize, g ( k ). Determiningmessagesizeisakeyissueforaccurateperform anceprediction.The generalassumptionisthateachnode, P ,willcontribute k bytesofdataforthenetwork transaction.However,themessagegap, g ( k ),ishighlydependentnotonlyonthevolume ofdatapernodebutalsoanysubdivisionofthatdataintomul tiplesmallertransfers. Typically,alargemessagewillhavelessperformanceoverh eadthanseveralsmaller messages.Ensuringthegapattributeaccuratelyrerectsth eperformanceoftheactual messagesegmentationsizewillreducemodelingerrors. Althoughspeciccommunicationtransactions(e.g.,MPICH2i mplementationofan MPIscatteroverEthernet)havedetailed,potentiallyappl ication-specicperformance,the networkmodelintroducesgeneralequationsforthetwotype sofnetworkcommunication usedinthecasesstudies.Equation 4{4 illustratestheindividualsetofattributes, S transaction ,formulti-nodenetworktransactionssuchasInniBandorE thernet,which includesthe L o g G P ,and k attributesalongwitha r costvalueforanyadditional requiredcomputations(e.g.,reduceoperations).Equatio n 4{5 denestheperformanceof thecommunicationtransaction, t transaction ,bythedelayasafunction, f delay ,ofthelatency andoverheadattributes;thevolumeofdataasafunction, f quantity ,ofthemessagesizeand 68


numberofnodes;theshortorlong-messagegap;andtheaddit ionalcomputation,ifany,as afunction, f cost ,oftheamountofdataand r costvalue. S transaction = f L;o;g;G;P;k;r g (4{4) S transaction :setofattributevaluesforthespecictransaction L;o :LogGPlatencyandoverheadattributes,respectively g;G :LogGPshortandlong-messagegap,respectively P;k :LogGPnumberofnodesandmessagesize,respectively r :additionalcomputationforoperationssuchasreduce t transaction ( S transaction ) = f delay ( L;o )+ f quantity ( P;k ) [ g or G ]+ f cost ( P;k;r )(4{5) t transaction :totaltimeforthenetworktransaction f delay ():functiondeningdelayw.r.t. L and o f quantity ():functiondeningtotaldataquantityw.r.t P and k f cost ():functiondeningadditionalcomputationcost(e.g.,re duce) Incontrasttothemulti-nodenetworkmodel,I/Ointerconne ctsbetweenmicroprocessors andFPGAacceleratorcardsoftenexhibitahighlyvariableg apoverarangeofmessage sizes.However,thedierentgapvaluesforarangeofdatasiz esandtransfertypes(i.e., DMAtoBRAMorreadfromregisters)canbecollectedpriortoap plicationanalysisusing microbenchmarksandreusedonfutureapplicationswithsim ilarI/Ocommunication. Theseattributevaluesareeithercollectedintoatablefor referenceorusedtoconstructan explicit g ( k )function.FromEquation 4{6 ,theoriginalRATmodelseparatedindividual gapvaluesintothetheoreticalinterconnectthroughput, R IO ,andtheeciencyofthe interconnectforthemessagesize, E IO .Similarly,themessagesizewasdecomposedinto 69


thenumberofdataelements, N elements ,andnumberofbytesperelement, N bytes=element (Equation 4{7 ).Expressingdataintermsofelementsallowedmoredirectc orrelation betweenthevolumeofcomputationandtheamountofcommunic ation.However,theRAT I/OmodelisadjustedtocoincidewiththeLogGPformulation forconsistencywithinthe networkmodel. g ( k )=( R IO E IO ) 1 (4{6) R IO :theoreticalthroughputrateofI/Ochannel(fromRAT) E IO :eciencyofI/Ochannel(fromRAT) k = N IO elements N bytes=element (4{7) N IO elements :numberofI/Oelements(fromRAT) N bytes=element :numberofbytesperelement(fromRAT) Equation 4{8 denesthesetofattributes, S IO ,fortherevisedI/Otransactionmodel, whichconsistsofthelatencyandoverheaddelays, L and o ;size-dependentmessagegap, g ( k );numberofnodes, P ;messagesize, k ;andtheadditionalcomputationcost, r ,forthe communication,ifany.ThoughtheI/Omodelrepresentsapoi nt-to-pointinterconnect, the P valueremainstorepresentunidirectional(i.e., P =1)orbidirectional(i.e., P =2) behavior.Equation 4{9 denesthecommunicationfortheI/Otransaction, t IO ,bythe delayfunction, f delay ,forlatencyandoverhead;numberofnodes;messagesize;ga p;and additionalcomputationasafunction, f cost ,ofthemessagesizeand r costvalue. 70


S IO = f L;o;g ( k ) ;P;k;r g (4{8) S IO :setofattributevaluesfortheI/Otransaction L;o :latencyandoverheadattributes,respectively g ( k ):gapasafunctionofmessagesize, k k :messagesize r :additionalcomputationcost t IO ( S IO )= f delay ( L;o )+ P k g ( k )+ f cost ( k;r )(4{9) t IO :totaltimefortheI/Otransaction f delay ():functiondeningdelayw.r.t. L and o f cost ():functiondeningadditionalcomputationcost AsdiscussedinSection ,theRATSSstagemodelrepresentsthecollective schedulingoftheoneormorerepetitionsofcomputationand communicationforan algorithmtask.ThesetofFPGAnodetimes, S fpga ,containstheindividualFPGA executiontimes, t fpga ,foreachofthe n nodesinvolvedintheapplicationstage(Equation 4{10 ).Often,allavailablenodesareperformingthespecictas k,butsomestagesmay requirelesscomputationtimeanduseonly n nodes.Thus,thetotalcomputationtime, t comp ,forastageofanapplicationisthesetupandcongurationo verhead, t overhead plusthemaximum(longest)individualexecutiontimesfrom theFPGAnodesand microprocessor(Equation 4{11 ). 71


S fpga = f t fpga 1 t fpga n g (4{10) S fpga :setofFPGAexecutiontimesforthestage n :totalnumberofFPGAnodes t comp = t overhead + Max ( Max ( S fpga ) ;t P )(4{11) t comp totalcomputationtimeforthestage t overhead conguration,setup,andotheroverheadsforthestage Similarly,thesetofcommunicationtimes, S comm ,containstheperformanceestimates foreachofthe networktransactiontimes, t transaction (Equation 4{12 ).Typically,this setwillcontainoneormoreinputandoutputtransactionsfo reachlevelofnetwork communicationintheplatform,thoughsomeapplicationswi llinsteadaccumulatepartial resultswithintheFPGAsovermultiplestageswithcumulativ eoutputafterthelast computationiteration.FromEquation 4{13 ,thecommunicationtimeforthestage, t comm ,iscomposedofthesumofthe transactiontimes, t transaction .Multiplelevelsof communicationwithinanapplicationstageareassumednonoverlappingduetoblocking. However,non-blockingtransactionscanbemodeledbythetot alnetworkdelayandlongest (i.e.,maximum)throughputtime. S comm = f t transaction 1 t transaction g (4{12) S comm :setoftransactiontimesforthestage :totalnumberofcommunicationtransaction t comm ( S comm )= X S comm (4{13) t comm :totalcommunicationtimeforthestage 72


DependingontheFPGAplatformarchitectureandmapping,co mputationand communicationwithinastageiseitherserializedoroverla ppingwiththetotalexecution timeofthestage, t stage ,denedasanumberofiterations, N stage iterations ,ofeitherthe sumormaximumofthe t comp and t comm terms(Equation 4{14 ).Notethatperformance estimatesshouldbemodeledforeachuniquestageoftheappl icationexecutionwith attentiontoanyspecialcases,suchasinitialandnalstag espossessingmoreorless communication. t stage = N stage iterations 8>><>>: t comp + t comm Max ( t comp ;t comm ) (4{14) t stage :totalexecutiontimeoftheapplicationstage N stage iterations :numberofstage-leveliterations TheRATSSapplication-levelmodeldescribestheschedulin goftheindividualstages oftaskexecutiontoestimatethefullsystemperformance.Ap plicationsconsistofone ormoredistinctstagesofexecution,whichmaybecollectiv elyrepeatedforoneormore ofiterations.Analogoustothecomputationandcommunicati onscheduling,application stagescaneitherbeserializedoroverlapping.Figure 4-3 providesanexampletiming diagramforpotentialiterativebehaviorattheapplicatio nlevel.Theexampleconsistsof threestagescollectivelyrepeatedtwice,reinforcingthe multi-leveliterativebehaviorofthe stageandapplicationmodelsasrstdescribedinFigure 4-2 .Thissampleapplicationdoes notillustrateallpotentialplatformandalgorithmfeatur essuchasmulti-levelnetworks, butinsteadreinforcestheabilityofRATSStoorganizeexec utionpathsintohierarchical models. Equation 4{15 denestheset, S stage ,of s stagetimes, t stage ,fortheapplication. Thetotalexecutiontimefortheapplication, t application ,isthenumberofiterations, 73


Figure4-3.ExampleTimingDiagramIllustratingStage-and Application-levelScheduling ofComputationandCommunication. N app iterations ,ofeitherthesumormaximum(longest)ofthestagetimes.Aga in,this modelgeneralizeshigh-leveliterativebehavior.Applicat ionscancontainrepetitivebut irregularbehavior(e.g.,3iterationsofstageone,7itera tionsofstagetwo,5iterationsof stagethree,etc.),whichissimpletocalculatebutnotexpl icitlyconsideredbythemodel. S stage = f t stage 1 t stage s g (4{15) S stage :setofstagetimesfortheapplication s :totalnumberofstagesfortheapplication t application = N app iterations 8>><>>: P S stage Max ( S stage ) (4{16) t application :totalexecutiontimeoftheapplication N app iterations :numberofapplication-leveliterations 4.4DetailedWalkthrough:2-DPDFEstimation ThissectionpresentsadetailedwalkthroughofRATSSperfo rmancepredictionwith areasonablycomplexcasestudy,2-DPDFestimation.Theint endedalgorithmand platformstructurealongwiththefeaturecharacterizatio nandperformancecalculations forthenode,network,stage,andapplicationmodelsaredis cussed.Theresultsofthe 74


performanceestimationarecomparedagainstasubsequenth ardwareimplementationto evaluatetheaccuracyoftheRATSSmodel.4.4.1AlgorithmandPlatformStructure The2-DPDFestimationalgorithmforthiscasestudyusesthe Parzenwindow technique[ 51 ],ageneralizednonparametricapproachtoestimatingPDFs ina d -dimensional space.Despitetheincreasedcomputationalcomplexityver sustraditionalhistograms, theParzenwindowtechniqueismathematicallyadvantageou sbecauseoftherapid convergencetoacontinuousfunction.Thisalgorithmisame nableforFPGAacceleration becauseofthehighdegreeofcomputationalparallelismand largecomputationeort relativetotheamountofdataconsumed(i.e.,input)andpro duced(i.e.,output).The computationalcomplexityofa d -dimensionalPDFalgorithmis O ( Nn d )where N isthe totalnumberofsamplesoftherandomvariable, n isthenumberoflevelswherethePDF isestimated,and d isthenumberofdimensions.This2-DPDFestimationalgorit hm accumulatesthestatisticallikelihoodofeverysampleocc urringwithineveryprobability level.Eachsample/levelcombinationisindependent,ther ebymakingthealgorithm embarrassinglyparallel.Thedatainputconsistsof O ( N )sampleswhereastheoutputis theresulting O ( n 2 )probabilitylevels. Ageneraloverviewofthealgorithmstructureforthiscases tudyispresented inFigure 4-4 .Atotalof67,108,684(i.e.,64M)datasamples,originatin gonone microprocessor,arescatteredequallyamongthe P microprocessors.Thenumberof samplesislargetofullystressthecommunicationandmemor ycapabilitiesofthetarget FPGAplatform.Themicroprocessorstransferthedatatothe irrespectiveFPGAnode inchucksof8,192datasamples,limitedbytheavailableonchipblockRAM.Atotalof 80pipelinedkernelspernodeperformthenecessarycomputa tions(comparison,scaling, andaccumulation)toanalyzeeachdatasampleagainstthe25 6 256probabilitylevels. Thenumberofparallelkernelsmaximizestheavailable96ha rdwaremultipliersonthe targetFPGAwithsomeleeway.Thenumericalprecisionforth ecomputationis18-bit 75


Figure4-4.ApplicationStructurefor2-DPDFEstimationCas eStudy xedpoint.Theresultsareaccumulatedin256 256registersandperiodicallyread backbythehostmicroprocessor.Theresulting256 256partialsumsoneachofthe P nodesarecollectedwithareduceoperation.Morediscussio nonthis2-DPDFestimation architecturealongwithgeneralissuesrelatedtoFPGAimpl ementationcanbefoundin [ 52 ]. TheintendedplatformforthiscasestudyisillustratedinF igure 4-5 .Thefull platformconsistsofeight3.2GHzXeonmicroprocessornodese achconnectedtooneXilinx XC4VLX100(NallatechH101card)viaaPCI-Xbus.Theprocessingn odesareorganized asaclusteroftraditionalcomputerseachaugmentedwithap plicationacceleration hardware(i.e.,FPGAs).Theon-chipBlockRAMs(BRAMs)areexpl icitlyillustrated sincetheyareusedtostoretheinputandoutputdataforthe2 -DPDFcasestudy.The microprocessornodesareconnectedviaGigabitEthernet.Ne twork-levelcommunication usestheMPICH2implementationoftheMessagePassingInterf ace(MPI).Thecasestudy ismodeledandtheimplementationistestedusing2,4,and8F PGAnodes. 4.4.2ComputeNodeModeling Thenode-levelmodelconsistsofestimatingthecomputatio nforthe2,4,and8 Nallatech-augmentedcomputenodes.ThevaluesinTable 4-1 consistofthecomputation attributes,whicharedistilledfromthestructureofthe2DPDFestimationalgorithmas mappedtothearchitectureoftheFPGAnode.Forcomputation ,accurateparameterization 76


Figure4-5.PlatformStructurefor2-DPDFEstimationCaseS tudy Table4-1.NodeAttributesfor2-DPDFEstimation AttributeUnits2Nodes4Nodes8Nodes PL comp (cycles)111111 R comp (ops/cycle)240240240 F clock (MHz)195195195 N comp elements (elements)33,554,43216,777,2168,388,608 N ops=element (ops/element)196,608196,608196,608 t fpga (s)1.41E+27.05E+13.52E+1 ofthepipelinelatency, PL comp ,requiresdetailedknowledgeofthenalalgorithm structure.Thepipelineforthe2-DPDFestimationhasastra ightforwardcomputational structureofthreeoperations(subtraction,multiplicati on,andadditionfromFigure 4-4 )requiring11totalcycles.Therelativelydeeppipelinehe lpsensureahigherclock frequency.Thecomputationalthroughput, R comp ,of240operationspercyclecomesfrom the80pipelinedkernels,eachwith3simultaneousoperatio nsperpipeline.Predictionsare generatedforalargerangeofpossiblefrequencies.Thepre dictionfortheclockfrequency, F clock ,of195MHzisshownsinceitultimatelymatchedthemaximumfr equencyfor laterimplementation.Thenumberofcomputationelements, N comp elements ,is33,554,432 (64M 2);16,777,216(64M 4);and8,388,608(64M 8)forthetwo,fourandeight-node congurations,respectively,duetothebalanceddatadeco mposition.Thenumberof operationsperelement, N ops=element ,isbasedonthe256 256comparisonsperdata elementtimes3operationsperelementforatotalof196,608 operations. 77


ThelastattributeinTable 4-1 t fpga ,summarizesthecomputationtimeforthe 2,4,and8nodecases.Eachnodeforaparticularsystemsize( i.e.,numberofnodes, P )willhaveanidenticalexecutiontimebecauseoftheequald atadecomposition.Due totheincreasingnumberofnoderesources,FPGAcomputatio ntime, t fpga ,decreases approximatelylinearly.TwoFPGAsrequiretwicethetimeasf ourFPGAswhichneed twicethetimeofeightFPGAs.Thisbehaviorisconsistentwit htheembarrassingly parallelnatureofthe2-DPDFestimationalgorithm.4.4.3NetworkModeling FortheFPGAplatformusedinthiscasestudy,twocommunicat ionnetworkmodels arenecessary:PCI-XI/OBusandEthernet.ThePCI-Xbusmode ldescribesthe point-to-pointinterconnectbetweenahostmicroprocesso randitsNallatechFPGAnode. TheEthernetmodeldescribestheMPICH2communicationovert heGigabitEthernet network.Assemblingtheattributevaluesforthesemodelsin volvesnotonlyanalysisofthe algorithmstructureandmappingbutalsomicrobenchmarkin goftheunderlyingplatform behaviorfortypicalcommunicationtransactions. TheI/Ooperationsfor2-DPDFestimationinvolvetransfers betweenthehostCPU andtheonboardFPGAblockRAM.Microbenchmarkswereperform edoncommon transfersizes(i.e.,powersoftwofrom4Bto64MB).Figure 4-6 summarizestheresults ofthesetransfers,whichcanbereferencedforallfutureI/ Operformanceestimations. Smallertransfers,Figure 4-6A ,haveerraticbutsteadilyincreasingeciencywhereas largertransfers,Figure 4-6B ,couldbeapproximatedbyasinglevalue.Forthetransfer sizesusedinthiscasestudy(writing8,192elementsandrea ding65,536elements, discussedlater),theI/Oecienciesare0.31and0.10,resp ectively. Table 4-2 summarizesthedelayandthroughputattributes,gatheredf rommicrobenchmarks ofthePCI-XI/Obus,alongwiththequantityofdatatransmit tedforthe2-DPDF estimationcasestudy.Themicrobenchmarksmeasurethetot altimeofadatatransfer, 78


ARangeoftransfersizes BLargetransfersizes Figure4-6.ResultsofEciencyMicrobenchmarksforNallate chBRAMI/O whichdenestheeectivethroughputforagiventransfersi ze.TheI/Olatencyand overhead, L IO + o IO ,isassumedtobethetotaltransfertimeforaverysmalltran sfer(i.e., 4Bofdata),whichisdominatedbythechanneldelayofthePCI -Xbus.Forwritesand readswiththeFPGAblockRAM,themeasuredperformanceis1.6 0E-5(s)and3.20E-5(s), respectively.Thegap, g ( k ),forthewriteandreadI/Otractionsisthemultiplicative inverseofthe1,064MB/s(i.e.,33MHz,64-bitPCI-X)theoreti calthroughput, R IO ,times theI/Oeciency, E IO .Theseparticular g ( k )valuesaredeterminedbythemessagesize, k ,whichisdenedbythenumberofI/Oelements, N IO elements ,andnumberofbytesper element, N bytes=element .ForthewriteI/O,the64Minputdataelementsforeachofthe X andYdimensionaredividedamongthe2,4,and8nodesforthe N IO elements term.Again, thesewritetransfersaredividedintoblocksof8,192eleme ntsmeaning64M 8,192 P distincttransfers.Theoutput(i.e.,readI/O)involvesco llectingthe65,536(256 256) elementsstoringthepartialPDFestimatesforeachofthe64 M 8,192 P iterations. Thoughthecomputationis18-bitxedpoint,thedataformat fortheI/Otransfersis 32-bitintegerandconsequentlythenumberofbytesperelem ent, N bytes=element ,is4. Equation 4{17 denestheperformanceforPCI-Xwriteandreadtransaction t transaction write;read ,bytheI/Olatencyandoverhead, L IO + o IO ,gapvalueforthemessage 79


Table4-2.PCI-XNetworkAttributesfor2-DPDFEstimation AttributeDirectionUnits2Nodes4Nodes8Nodes L IO + o IO write (s) 1.60E{51.60E{51.60E{5 read3.20E{53.20E{53.20E{5 g ( k ) IO ( R IO E IO ) write ( s ) 1 3.03E{33.03E{33.03E{3 (1064 0 : 31) 1 (1064 0 : 31) 1 (1064 0 : 31) 1 read 9.40E{39.40E{39.40E{3 (1064 0 : 10) 1 (1064 0 : 10) 1 (1064 0 : 10) 1 k( N IO elements N bytes=element ) writeX (bytes) 128M64M32M (32M 4)(16M 4)(8M 4) writeY 128M64M32M (32M 4)(16M 4)(8M 4) read 1024M512M256M (256M 4)(128M 4)(64M 4) t comm writeX (s) 4.07E{12.03E{11.02E{1 writeY4.07E{12.03E{11.02E{1read1.01E+15.05E+02.52E+0 size, g ( k ),andmessagesize, k .TheI/Oresultsofthetwowritesandonereadforthis casestudyaresummarizedinthesecondblockofTable 4-2 t transaction write;read = L IO + o IO + g ( k ) k (4{17) TheattributesintherstblockofTable 4-3 weregatheredusingtheLogGP benchmarkingtooldescribedin[ 43 ].ThistoolcomputestheparameterizedLogP functions,whichareconvertedtothexedlatency, L ,overhead, o ,short-messagegap g ,andlong-messagegap, G ,values.Muchofthetool'smethodologyisoutsidethescope ofthepaper,butessentiallyaprogressivelyincreasingse quenceofmessagesisusedto calculatethegap.Thelatencyandoverheadarededucedfrom thedelaysbetweenthese messages.Thecomputationaloverhead, r ,consistsoftheadditionoperationsforthe reducetransactionusedinthe2-DPDFestimationandisbenc hmarkedontheXeon 80


Table4-3.EthernetNetworkAttributesfor2-DPDFEstimatio n AttributeUnits2Nodes4Nodes8Nodes L (s)1.08E{41.08E{41.08E{4 o (s)6.75E{66.75E{66.75E{6 g (s)1.64E{51.64E{51.64E{5 G (s/Byte)9.56E{99.56E{99.56E{9 r (s/Byte)1.90E{81.90E{81.90E{8 P (nodes)248 k scatterX (Bytes) 128M64M32M scatterY128M64M32Mreduce256K256K256K t transaction scatterX (s) 1.28E+01.28E+0 3.89E{3 1.92E+01.92E+0 7.78E{3 2.25E+02.25E+0 1.17E{2 scatterYreduce microprocessor.Usingtheseparameters,preciseanalytica lmodelscanbeconstructedfor thebinomialtreescatterandreduceusedinthiscasestudy. Theanalyticalmodelforthebinomial-treescatterusedint heMPICH2implementation ofMPIisdenedinEquation 4{18 .Asopposedtoanaivescatter,whichrequires P 1 messagestocompletea P -nodescatter,abinomial-treescatteronlyrequires log 2 ( P ) messagesaseachnodethathasreceiveddatasubsequentlysc atterstootherremaining nodes.However,thesemessagesforthebinomial-treescatte rbeginashalfthetotal datatobescatteredanddecreasebyhalfwitheachtransmiss ionbecausenodesmustbe suppliedwithnotonlytheirdatabutalsothedatatheymustp asson.Inactuality,the binomial-treescatterusesmorenetworkbandwidth(i.e.,m anytransmissionsinparallel) andthethroughputcomponentisthesameasanaivescatter.T headvantageofthe binomial-treescatteristhereductioninlatency. t transaction scatter = log 2 ( P ) L +2 o + G ( P 1) k (4{18) Fortheshortermessagesizesnecessaryfordatacollection inthe2-DPDFestimation algorithm,theMPICH2implementationoftheMPI Reducefunctionusesabinomialtree 81


similartothescatter.However,unlikescatter(orgather)t heamountofdataduringeach transmissiondoesnotincreasebecausethedataisreduceda teverynode.Consequently, thereducehas log 2 ( P )latency, L ,andtransmissiontime, Gk ,asdenedinEquation 4{19 Additionally,eachtransmissionrequiresanadditionopera tion, r ,foreachofthe k data valuesinthemessage.NotethatEquations 4{18 and 4{19 assume P isapowerof2. t transaction reduce = log 2 ( P ) ( L +2 o + Gk + rk )(4{19) ThesecondblockofTable 4-3 liststhetwoapplication-dependentattributesdened bytheuserbasedonthe2-DPDFestimationcasestudy.Again,s ystemcongurations of2,4,and8nodes, P ,areusedforthiscasestudy.The2-DPDFapplicationrequir es twodistincttransactions:distributionoftheinputdataf ortheXandYdimensions(i.e., MPI Scatter)andreductionofthepartialPDFs(i.e.,MPI Reduce).The2,4,and8 FPGAplatformcongurationswillinvolvemessagesizes, k ,of128MB,64MB,and32MB ofdata,respectivelyforthescatter.Forthereduce,every nodewillcontributethe256KB (256 256 4B)partialresults(regardlessofthenumberofnodes)that areultimately accumulatedontheheadnode. ThethirdblockofTable 4-3 summarizestheresultsofthenetworkmodel.The individualtimesforthescatterandreducetransactions, t transaction ,arelisted.Thesetimes increaselogarithmicallyforthe2,4,and8nodeplatformsd uetotheincreasingnumberof messages(i.e., log 2 ( P ))requiredforthetransaction. 4.4.4Stage/ApplicationModeling Thestageandapplicationmodelsaresynonymousbecausethi scasestudyconsists ofasinglestageofexecution.Equation 4{20 summarizestheset, S comp ,oftheexecution times, t fpga ,forthe2,4and8(i.e., P )FPGAnodesusedinthiscasestudy.From Equation 4{21 ,thecomputationtime, t comp ,isdeterminedbythemaximum(longest) 82


nodetime.Becauseoftheequalloadbalancingamongthenode s,thetotalcomputation timeisequivalenttotheperformanceofany i -thnodeinthesystem. S comp = f t fpga 1 t fpga P g (4{20) t comp ( S comp )= Max ( S comp )= t fpga i (4{21) Thesetofcommunicationtimes, S comm ,summarizesnetworktransactionsinvolved inthesinglestageofthecasestudy.Thescatterandwriteti mes, t scatter and t write representtheXandYdimensionsoftheinputdataandtheread andreducetimes, t read and t reduce ,denethedatacollectedaftercomputation.FromEquation 4{23 ,thetotal communicationperformance, t comm ,isthesummationoftheindividual,non-overlapping networktransactions. S comm = f t scatter X ;t scatter Y ;t write X ;t write Y ;t read ;t reduce g (4{22) t comm ( S comm )= t scatter X + t scatter Y + t write X + t write Y + t read + t reduce (4{23) TheinputstotheRATSSsystem-levelmodelaresummarizedin therstblock ofTable 4-4 .Theonlyuser-providedattributeisthenumberofstage-le veliterations, N iterations .The2-DPDFapplicationonlyrequiresoneiterationofnode andnetwork interaction(i.e.,inter-nodedatadistribution,node-le velcomputation,inter-nodedata collection).Theindividualnodeandnetworktransactiont imescomprisethemajority oftheinputtothesystemmodel.The S comp and S comm attributesetscontainthe computenodeandnetworktransactiontimesfromtherespect ivemodels.Thesecond blockofTable 4-4 summarizestheestimatedperformanceofthetotalcomputat ionand communicationtimes, t comp and t comm fromEquations 4{21 and 4{23 The2-DPDFestimationdoesnotuseanelaboratebueringsch emesothetotal performanceoftheapplication, t application ,asdenedbythestagetime, t stage ,isthe 83


Table4-4.SystemModelAttributesfor2-DPDFEstimation AttributeUnits2Nodes4Nodes8Nodes N stage iterations (iterations)111 S comp (s)1.41E+27.05E+13.52E+1 S comm scatterX (s) 1.28E+01.92E+02.25E+0 scatterY1.28E+01.92E+02.25E+0writeX4.07E{12.03E{11.02E{1writeY4.07E{12.03E{11.02E{1read1.01E+15.05E+02.52E+0gather3.89E-37.78E-31.17E-2 t comp (s)1.41E+27.05E+13.52E+1 t comm (s)1.35E+19.31E+07.25E+0 t stage (s)1.54E+27.98E+14.24E+1 summationofthenumberofiterations, N stage iterations ,ofI/OcommunicationandFPGA computationforthenode, t comm and t comp (Equation 4{24 ).Singlebueringmaximizes theavailablememorybandwidthtothecomputationunits.Th ecomputationtime,as showninTable 4-4 ,dominatesthetotalapplicationexecutiontimeandwouldn otgreatly benetfromdoublebuering. t application = t stage = N stage iterations ( t comm + t comp )(4{24) ThemodeloutputissummarizedinthethirdblockofTable 4-4 .Asthenumber ofnodesdoubles,thenodetimereducesbyhalfbutthenetwor ktimeincreasesslightly. Consequently,thetotalexecutiontime, t total ,decreasesbyslightlylessthanhalfasthe numberofnodesdoubles.Thistrendofnearlylinearperform anceimprovementwith increasingplatformsizeisreasonablegiventheembarrass inglyparallelnatureofthe computationandrelativelylowimpactofthePCI-XandEther netcommunication. 4.4.5ResultsandVerication Aspreviouslydiscussed,theperformancepredictioniscalc ulatedpriortolow-level designandwasnotadjustedbasedonimplementationdetails .Predictionsweregenerated 84


Table4-5.ModelingErrorfor2-DPDFEstimation(Nallatech, XC4VLX100,195MHz) Predicted(s)Experimental(s)Error 2FPGAs t comp 1.41E+21.56E+2-9.6% t comm 1.35E+11.51E+1-10.7% t total 1.54E+21.71E+2-9.7% Speedup14613210.6% 4FPGAs t comp 7.05E+17.84E+1-10.1% t comm 9.31E+09.93E+0-6.2% t total 7.98E+18.84E+1-9.7% Speedup28325511.0% 8FPGAs t comp 3.52E+13.95E+1-10.9% t comm 7.25E+07.70E+0-5.9% t total 4.24E+14.72E+1-10.1% Speedup53247811.3% forarangeofclockfrequencyvalues,thoughonlytheresult softhe195MHzestimation areshown.Majorrevisionstothetargetalgorithmorplatfo rmarchitectureduring implementationcansignicantlyaltertheapplicationper formanceaectingthevalidity oftheprediction.Thus,RATSScanbeusediterativelythrou ghoutthedesignprocess, recomputingpredictionswheneversignicantrevisionsar econsideredorbecomenecessary toensurethesubsequentimplementationwillstillmeetper formancerequirementsand therebypreventfurtherreductionsinproductivity.Howeve r,suchmodicationstothe applicationstructurewerenotnecessaryforthe2-DPDFest imationcasestudy. InTable 4-5 ,theresultsoftheperformancepredictionforthe2-DPDFes timation casestudyarecomparedagainstasubsequentimplementatio nofthetargetalgorithm. Thenodeandnetworkmodelsunderestimatedtheactualimple mentationtimesand subsequentlyoverestimatedthetotalapplicationspeedup overthesoftwarebaseline.The nodetimesrepresentedthemajorityoftheexecutiontime(o ver90%ofthephysical implementation)therebyhavingthegreatestimpactonpred ictionaccuracy.The predictionerrorsforthe2,4,and8FPGAcongurationsareu nder11%,whichis consideredreasonablyaccurategiventhefocusonhigh-lev eldesign-spaceexploration priortoimplementation.Mostofthediscrepancyisdueaddi tionalcyclesofoverhead relatedtodatamovementduringtheFPGAcomputation.Incon trasttothenode 85


modelingerror,whichremainedrelativelyconstantfor2,4 ,and8nodes,thenetwork modelingerrorhadsignicantlymorevariability.Theerro rforthe2nodeplatformwas justunder12%.Thoughnotideal,theerrorhadminimalimpac tontheoverallprediction accuracyandwasduetoextraoverheadassociatedwitha2-no descatterandreduce versusthesinglemessagepredictedbythemodel.Thenetwor kerrorwas6.2%and 5.9%forthe4and8nodeplatforms,respectively.Thisnetwo rkerror,lowerthanthe respectivenodeerror,reducedtheoverallexecutionerror .Consequently,thepredicted speedupoverasequentialsoftwarebaselineexecutedonone 3.2GHzXeonmicroprocessor intheplatformwasapproximately90%accurate(i.e.,under 12%error)forallthree systemsizes.Thespeedupvaluesaresubstantialduetothel argeproblemsizeandhigh computation-to-communicationratio. Akeylessonlearnedfromthiscasestudyrelatestomodeling eortversusaccuracy. Thenodecharacterizationandperformanceestimationrequ iredsignicantlylesseort ascomparedtomicrobenchmarkingandmodelingthetwocommu nicationnetworks. However,precisenodemodelingisimportantasthecasestudy consistsofover90% computationandmoreeortrelatedtomodelingunforeseenc omputationaloverheads couldbenetfutureapplicationstargetingthisplatform. Theneedforgreateraccuracyin communicationmodelingissituational,asillustratedint henexttwocasestudies,image lteringandMD,whichhavesignicantandtrivialcommunic ation,respectively. 4.5AdditionalCaseStudies Thissectiondiscussestwoadditionalcasestudiesforfurt hervalidationoftheRATSS model:imagelteringandmoleculardynamics.Bothcasestu diesuseanSRC-6system, whichincorporatesuptofourMAP-Bnodeseachconsistingoft woXilinxXC2V6000 FPGAs(Figure 4-7 ).AllfourMAP-Bnodesareconnectedtoadual-Xeonmicroproces sor workstationviaasingle,proprietaryinterconnectcalled SNAP.Incontrasttothe Nallatech-augmentedclusterfromthe2-DPDFestimationcas estudy,initiationof communicationisperformedlocallybyeachnodeasaDMAtran sferandarbitrated 86


Figure4-7.PlatformStructureforImageFilteringCaseStu dy bytheinterconnectcontroller.Networktransactionsarest illdescribedusingcollective communicationterminologybutthephysicaloperationsare independentlyinitiated point-to-pointmessages.Additionally,casestudyimpleme ntationsarewritteninCarteC. Commontobothcasestudies,Table 4-6 denesthenetworkattributesfortheSNAP interconnect,whicharemeasuredfrommicrobenchmarks.Si milartotheNallatechsystem, latency, L ,isthetransmissiontimeofasingle-wordtransfer,whichi sdominatedbythe networkdelay.However,overhead, o ,thetimebetweensuccessivemessages,isafunction ofthenumberofnodes, P ,andmessagesize, k ,notaconstantparameterduetothe decentralizedcommunicationrequests.Moreextensivemic robenchmarkingcandetermine approximateoverheadvaluesalthoughvariabilityinmessa georderinginhibitsdetailed analysis.Forthesecasestudies,theeectofoverheadisco nsiderednegligibleandnotpart ofthenetworkmodel.Theshort-messagegap, g ,ismeasuredasthetransmissiontime forshortmessagesminuslatency.Noshortmessagesareusedi nthesecasestudiesbut thevalueisincludedforcompleteness.Thelong-messagega p, G ,isone8-bytewordevery 10nsclockcyclebasedonthexed100MHzclock. 87


Table4-6.SNAPNetworkAttributesforSRC-6System AttributesUnitsValue L (s)1.01E{5 o (s)o(P,k) g (s)6.40E{7 G (s/Byte)1.25E{9 TheequationsfortheSNAPcommunication,broadcast,scatter ,andgather,are denedinEquations 4{25 and 4{26 .Ingeneral,SRC-6transactionswillbeaseriesof serializedmessages,albeitofdierentsizes.However,the independentlyinitiatedDMA transfersaredesignedtoallownodecomputationtoproceed beforeallcommunication hascompleted.Foranaccurateyetsimplemodelofthisparti aloverlap,theoutput transaction(i.e.,gatheroranothercollection-typeoper ation)canberepresentedbyjust thenalmessageofthecommunication,whichcannotbehidde nbecausecomputationhas completed,insteadofthetotal P messages. t transaction broadcast;scatter = L + G P k (4{25) t transaction gather = 8>><>>: L + G k; overlappingcommunication L + G P k; non-overlappingcommunication (4{26) Forbrevity,theperformancemodelfortheimagelteringan dMDcasestudies(and manyotherSRC-6applications)issummarizedinEquations 4{27 4{29 .Again,application executionisdenedbyFPGAcomputationwithsynchronizing DMAcommunicationto acentralmicroprocessornode.Distributeddatamovementi ncorporatingglobalcommon memoriescanhelpoverlapcommunicationandcomputationbu tisnotconsideredforthe followingcasestudies.Thelongestnodetimedenesthecom putationtime(Equation 4{27 )andthecommunicationconsistsofbroadcastorscatter,an dgather(Equation 4{28 ).Thetotalapplicationperformanceisthesummationofthe computationand communicationtimes(Equation 4{29 ). 88


t comp = Max ( f t fpga 1 t fpga P g )(4{27) t comm = t broadcast;scatter + t gather (4{28) t application = t stage = N stage iterations ( t comp + t comm )(4{29) Althoughthesetwoadditionalcasestudiesinvolveadieren tFPGAplatformfrom the2-DPDFapplication,thesequentialsoftwarebaselines aremeasuredfromthesame 3.2GHzXeonmicroprocessorforconsistency.Whilespeedupis oftenanadvantageous performancemetric,thespecicspeedupvaluemustbecompa redwiththeproblem sizeandcomputation-to-communicationratiofortheappli cation.Theimageltering andmoleculardynamicscasestudiesillustratecommunicat ion-andcomputation-bound problems,respectively,withcorrespondinglylowerandhi gherspeedups.Bothscenarios aremodeledbyRATSSwithreasonableaccuracy.4.5.1ImageFiltering Theparticularimagelterusedinthiscasestudyisadiscre te2-Dconvolution ofa3 3imagesegment(i.e.,apixelandits8neighbors)withauser -speciedlter. ExampleusagesofthisapplicationincludeSobelorCannyed gedetectionandhigh, low,orband-passlteringfornoisereduction.Figure 4-8 providesanillustrationofthis algorithm.Thesame418 418imageisstreamed(i.e.,written)totheprimaryFPGA oftwonodesoftheSRC-6system.Aspartofthecomputation,th eprimaryFPGAs streamtheimagedatatotheirrespectivesecondaryFPGAs.Ea chFPGAperformsthe convolutionoftheimagedatawithrespecttodierentlter values.Theresultingimages onthesecondaryFPGAsarestreamedbacktotheirrespectivep rimaryFPGAwhich DMAsthetwonewimagesfromthenodebacktothenetwork-attac hedmicroprocessor.A moregeneraloverviewofconvolutionforimagelteringcan befoundin[ 53 ]. Table 4-7 summarizesthecomputenodeattributesfortheRATSSmodel. Becauseof thedouble-precisionoperations,theoverallpipelinewil lbefairlydeepandtoocomplexto 89


Figure4-8.AlgorithmStructureforImageFilteringCaseStu dy Table4-7.NodeAttributesforImageFiltering AttributeUnitsValue PL comp (cycles)0 R comp (operations/cycle)34 F clock (MHz)100 N comp elements (elements)349,448 N ops=element (operations/element)17 quicklyandaccuratelydetermineapriori.However,thepipe linelatency, RL comp ,should benegligiblewithrespecttothevolumeofdata.BothFPGAsco ntainapipelinedltering kernelthatcalculatestheninemultiplicationsandeighta dditionsfortheconvolution foratotalcomputationalthroughput, R comp ,of34(2FPGAs 9+8operations).The clockfrequency, F clock ,fortheMAP-Bnodeisxedat100MHz.Animagesizeof418x418 pixels,limitedbythesizeoftheMAP-BSRAM,isusedforthisca sestudythoughlarger sizescanbesimulatedbyrepeatedlyloopingthroughthemem ory.Incontrasttothe previouscasestudy,eachFPGAofeachnodeneedsthecomplet edataset(i.e.,image) becauseeachkernelconvolvesadierentlter.Consequent ly,theeectivenumberofdata elements, N elements ,pernodeis349,448(418pixels 418pixels 2FPGAs).Again,each elementrequiresninemultiplicationswiththe3 3lterandeightsubsequentsummations foratotalof17operationsperelement, N ops=element Table 4-8 denestheapplication-specicnetworkattributesforthe SNAPnetwork model.Twonodes, P ,withfourtotalFPGAsareusedforthiscasestudy.Thepipeli ned (streaming)computationisstructuredusingshiftregiste rsandrequiresthreenew 90


Table4-8.AdditionalNetworkAttributesforImageFiltering AttributeUnitsValue P (nodes)2 k broadcast (Bytes) 4,193,376 gather2,795,584 Table4-9.ModelingErrorforImageFiltering(SRC-6,XC2V60 00) Predicted(s)Experimental(s)Error 2Nodes(4FPGAs) t comp 5.24E{35.24E{3-0.04% t comm 1.40E{21.41E{2-1.08% t application 1.92E{21.98E{2-3.13% Speedup1.391.352.96% inputseachcycle(i.e.,apixelanditsupperandlowerneigh bors).Consequently,the streamingcommunicationofthe418 418imagesrequiresatotalof4,193,376bytes (418 418 3 8B)forthebroadcastmessage, k .Similarly,theoutputcommunicationwill involvetwolteredimages(oneperFPGA)of418 418pixelseach,foratotalgather messagesize, k ,of2,795,584bytes(418 418 2 8B). Table 4-9 highlightsthesourcesoferrorfortheimagelteringcases tudy.The computationtimeiscomparabletothe O ( N )writetime.Thedeterministicstructureof thecomputationalpipelinesallowedforhighlyaccuratemo delingofthetotalexecution timeforthenodes, t comp ,whichisobyonly204cycles,nearlythe127-cyclepipelin e latencyreportedbytheCartetoolduringimplementation.T henetworkcommunication time, t comm ,comprisedthemajorityofthetotalexecutiontime, t application ,andtheerror remainedjustover1%.Consequently,thetotalerrorremain edjustover3%.Theaccuracy oftheRATSScommunicationmodelfortheSNAPinterconnectens uredonlyasmallerror (under3%)inthespeedupprediction.Thespeedupvalueisre lativelymodestduetothe lowercomputation-to-communicationratioascomparedtot heothercasestudies. 4.5.2MolecularDynamics MolecularDynamics(MD)isthenumericalsimulationofthep hysicalinteractionsof atomsandmoleculesoveragiventimeinterval.BasedonNewto n'ssecondlawofmotion, theacceleration(andsubsequentvelocityandposition)of theatomsandmoleculesare 91


calculatedateachtimestepbasedontheparticles'massesa ndtherelevantsubatomic forces.Forthiscasestudy,theMDsimulationisfocusedont heinteractionofcertaininert liquidssuchasneonorargon.Theseatomsdonotformcovalen tbondsandconsequently thesubatomicinteractionislimitedtotheLennard-Jonesp otential(i.e.,theattraction ofdistantparticlesbyvanderWaalsforceandtherepulsion ofcloseparticlesbased onthePauliexclusionprinciple)[ 39 ].Large-scaleMDsimulatorssuchasAMBER[ 40 ] andNAMD[ 41 ]usethesesameclassicalphysicsprinciplesbutcancalcul atenotonly Lennard-Jonespotentialbutalsothenonbondedelectrosta ticenergiesandtheforcesof covalentbonds,theirangles,andtorsions,makingthemapp licabletonotonlyinertatoms butalsocomplexmoleculessuchasproteins.Theparallelal gorithmusedforthiscase studywasadaptedfromcodeprovidedbyOakRidgeNationalLab (ORNL). Figure 4-9 providesanoverviewoftheMDcasestudy.Inslightcontrast tothe image-lteringcasestudy,fourMAP-Bnodes,oneFPGAeach,a reusedforMD.In ordertocomparetwomoleculeseveryclockcycle,twocopies ofthemoleculardataare sentbythenetwork-attachedmicroprocessortotheprimary FPGAofeachnode.(The secondaryFPGAisnotusedforthiscasestudy).Eachcopycon tainstheX,Y,andZ dimensionsofthemolecularpositiondata,requiring2SRAMs percopyforatotalof4 bankspernode.EachMDkernelchecksthedistanceof N= 4moleculesagainsttheother N 1molecules,where N= 4isthenumberofmoleculesforeachofthefournodes.If themoleculesaresucientlyclose,theMDkernelcalculate sthemolecularforces(and subsequentacceleration)impartedoneachother.Theaccel erationeectsareaccumulated inthelasttwoSRAMbanksandtransferredbacktothenetworkattachedmicroprocessor. Thenode-levelattributesfortheMDcasestudyaredenedin Table 4-10 .The pipelinelatency, PL comp ,isconsiderednegligibleforthiscasestudyduetothe O ( N 2 ) computationalcomplexity.Onepipelinepernodeallowsfor moleculariteration(i.e., operation)percycle, N ops=cycle .Again,theclockfrequency, F clock ,fortheMAP-Bnodes isxedat100MHz.Forthiscasestudy,thenumberofdataeleme nts(i.e.,molecules)per 92


Figure4-9.AlgorithmStructureforMolecularDynamicsCase Study Table4-10.NodeAttributesforMolecularDynamics AttributeUnitsValue PL comp (cycles)0 R comp (operations/cycle)1 F clock (MHz)100 N comp elements (elements)8,192 N ops=element (operations/element)32,767 node, N elements ,is8,192(32,768/4).Eachmolecule'sinteractioniscompu tedagainstevery othermoleculeforatotalof32,767operations, N ops=element Table 4-11 denestheapplication-specicattributesfortheSNAPnetwo rkmodel.A totaloffournodes, P ,areusedforthecasestudy.Thescattermessagesizeistwic ethe gathermessagesizeduetotwocopiesofinputdatarequiredt ocomputeamolecular interactioninasinglecycle(i.e.,twomemoryaccesspercy cle).Also,the4-byte, single-precisionx,y,andz,dimensionsofthemoleculedat aarepackedintotwo8-byte words.Thus,thescatterandgathermessagesizes, k ,are1,048,576B(32,768 4 8B) and524,228B(32,768 2 8B)respectively.Becauseofthesingletimestep,onlyone system-leveliteration, N system iterations ,isrequiredforthiscasestudy. Table 4-12 comparestheresultsoftheRATSSmodelwiththesubsequenti mplementation ofMD.Over99%oftheexecutiontimeisdominatedbytheFPGAc omputation,which ishighlydeterministic.Themodelerrorforthenode-level time, t nodes ,is-0.0001%,an 93


Table4-11.AdditionalNetworkAttributesforMolecularDyna mics AttributeUnitsValue P (nodes)4 k scatter (Bytes) 1,048,576 gather524,228 Table4-12.ModelingErrorforMolecularDynamics(SRC-6,X C2V6000) Predicted(s)Experimental(s)Error 4Nodes(4FPGAs) t comp 2.68E+02.68E+0-0.0001% t comm 5.90E{34.84E-321.8% t application 2.69E+02.69E+00.03% Speedup5.125.12-0.03% underestimationof304cycles,roughlythepipelinelatenc yof271cyclesreportedbythe Cartetoolduringimplementation.However,thetotalnetwor ktime, t network ,isdicult tomeasureaccuratelyduetotheindependentlyinitiatedco mmunicationforthefour nodes,whichresultedina22%error.However,theoverallimp actonpredictionaccuracy isnegligibleasthemodeldiscrepancyforthetotalapplica tionexecutiontime, t total wasoverestimatedby0.03%.Consequently,thespeedupvers usthesequentialsoftware baselinewasunderestimatedby0.03%.Whilethespeedupval ueappearslowgiventhe highcomputation-to-communicationratio,theimportantc onsiderationisaccurateRATSS performancepredictionforcomputation-boundapplicatio nswithnontrivialproblemsizes. LargerdatasetsfortheMDalgorithmwouldresultinhighers peedups. 4.6Conclusions Basedoncurrentparallelcomputingtrends,scalablesyste mswithFPGAsare increasinglydesirablefortheirperformanceandpowerben ets.However,theassociated costofapplicationdevelopmenthasinhibitedgreateradop tionofmulti-FPGAsystems. Insucientattentionhasbeengiventostrategicplanningf orapplications,particularlyas applicationsscale.Simplyprovidingfasterimplementati onpathsforFPGAdevicesonly addressesonesymptomoftheproductivitychallenge.Eect iveDSEinvolvesnotonly rapiddesignbutalsoecientperformanceevaluation.Analy ticalmodelsofmulti-FPGA systemscanestimatetheperformanceofapplicationdesign sbydistillingandevaluating 94


keyperformancecharacteristicsfromthedesigner'sspeci cation.Suchmodelsprevent wastedimplementationeortbyidentifyingunrealizabled esignsandreducingtherevisions necessarytoachieveperformancerequirements. RATSSprovidesanecientandreasonablyaccurateanalytic almodelforevaluating ascalableFPGAapplicationpriortoimplementationbasedo ntheRATmethodology fromChapter 4 .RATSSboostsdesignerproductivitybyextendingconcepts from component-levelmodelstoallowecientabstractionandes timationofthecomputation andcommunicationfeaturesofFPGAapplications.TheRATSS modelcontributesa hierarchicalapproachforagglomeratingcomponentdescri ptionsintoafullperformance estimate.RATSSperformancepredictionremainstractable byfocusingonsynchronous, iterativecomputationmodelsforthetwomajorclassofmode rnhigh-performanceFPGA platforms. The2-DPDF,imagelter,andMDcasestudiesillustrateperf ormancemodeling forarangeofproblemsizesandratiosofcomputation-to-co mmunication.Thesecase studiesdemonstratednearly90%predictionaccuracy,whic hisconsideredsucient giventhefocusofRATSSonstrategicapplicationplanning. Theaccuracyofboththe computationandcommunicationmodelsallowsnotonlyindiv idualperformanceestimates butalsoaccuratepredictionsacrossarangeofpotentialap plicationcongurations includingwidevariationsinproblemssizesandcomputatio n-to-communicationratios. Specically,importantperformancetradeossuchasincre asingparallelismordecreasing thecommunicationratecanbeecientlyevaluatedwithreas onableaccuracy.These casestudiesserveasmotivationforbroaddesign-spaceexp lorationwithRATSSas predictionsareecientlygeneratedandreasonablyaccura te,whichhelpensurethe eventualimplementationisthemostdesirabledesigncong uration. 95


CHAPTER5 INTEGRATEDPERFORMANCEPREDICTION WITHRCMODELINGLANGUAGE(PHASE3) ThethirdresearchphaseoutlinestheintegrationoftheRAT analyticalmodelwith theRCModelingLanguage(RCML)forecientdesign-spaceex ploration.Thischapter presentsabriefintroductiononthenecessitiesofintegra tedprediction(Section 5.1 ); backgroundandrelatedworkonFPGAmodelinglanguagesandi nfrastructure(Section 5.2 );discussionoftheproposedframeworkforRATandRCMLinte gration(Section 5.3 ); twocasestudies,modiedNeedleman-Wunschandtaskgraphfo rmean-valueanalysis (Section 5.4 );anadditionalcasestudyforMNWonmulti-FPGAsystemswith RATSS analysis(Section 5.5 );andconclusions(Section 5.6 ). 5.1Introduction IllustratedinFigure 5-1 ,akeyproductivitychallengeforFPGAplatformsisthe gapintheabstractionpyramidforsystemdesign[ 54 ]between\BackoftheEnvelope" estimationsand\AbstractExecutableModels."Aspartofthis needformorestrategic design-spaceexploration(DSE),RATandRATSS(hereafterr eferredtobysimply \RAT")providemethodologiesforpredictingperformanceo fapplicationdesignspriorto implementation.However,theeciencyofperformanceestim ationwithRATislimited bymanualapplicationcharacterization(i.e.,parameteri zation)andagglomerationof componentperformancemodels.Existingmodelingenvironm entscanprovidemoreprecise applicationspecicationbutremainisolatedfrompredict ionmethodologies,whichinhibits productiveDSEofRCsystems. OnesolutiontotheRCproductivitychallengeisanextensib lemethodologyandtools forbridgingmodelingenvironmentswithperformancepredi ction.Modelingenvironments providemorecustomizable,human-productivedesignentry thanlistsorspreadsheetsof quantitativeperformanceinformation.Performancepredi ctiontechniquescanprovide relativelyrapidandreasonablyaccurateestimations,alb eitwithpotentiallymanual datainput,errorchecking,andrevision.Withoutanintegr atedapproach,strategicDSE 96


Figure5-1.Abstractionpyramidcomparinglevelsofmodelin gforhardwareapplications [ 54 ] usingisolatedabstractspecicationandanalysistoolsca nbetedious,disconnectedfrom subsequentimplementationtasks,andultimatelycounterp roductive. Thischapterproposesamethodologyforaframeworkallowin gintegrationof modelingenvironmentswithRAT.TheRATmethodology(andsu bsequenttool) providesmodelsdescribingthebehavioroftheindividualc omputationandcommunication operationsandestimatesthetotalapplicationperformanc ebasedontheiragglomeration. RAThasdemonstratedreasonablyaccurateperformancepred iction,butitseciencyis limitedbycurrentlymanualinterpretationofapplication specicationsforthenecessary inputstotheanalysis.Theproposedframeworkdenesa\tra nslation"componentthat distillstherequiredpredictioninputsforRATfromthe(su pported)modelofcomputation (MoC)oftheapplicationspecication.Anabstractionlayer insulatesthetranslation functionalityfromthetool-dependentdetailsoftheparti cularmodelingenvironment. Additionally,theframeworkdenes\orchestration"ofstra tegicDSE,whichperforms RATpredictiononaninitialapplicationdesignandpotenti alrevisionstotheunderlying algorithmand/orplatformarchitecture.Asvalidationofth eproductivitybenetsof framework-assistedDSE,atoolfortranslationandorchest rationusingtheproposed methodology(hereafterreferredtoasthe\DSEtool")iscon structed.Thefunctionality 97


Figure5-2.PerformancepredictionusingRAToftheDSEtoolincludestheconnectivityandusageofexisti ngtoolsforapplication specicationandRATpredictionbythenewlyconstructedco mponentsfortranslationand orchestration. 5.2BackgroundandRelatedResearch TheproposedframeworkleveragesexistingresearchinRATp erformanceprediction andmodelingenvironmentstofacilitateapplicationspeci cationandanalysisforstrategic DSEforRC.RATusesseparatecomputationandcommunication models,basedon underlyingassumptionsabouttheapplicationstructurean dbehavior,whichagglomerate intocompletepredictionsofapplication.Severalmodelin genvironmentsprovidemethods andtoolswithsimilarapproachesforabstractingalgorith mandplatformarchitecture specications,albeitwithdieringlevelsofimplementat iondetail. 5.2.1RATPerformancePrediction Forthisphaseofresearch,apredictiontoolisconstructed fromtheRATmethodology thatincludesanAPItoprovidethenecessarypredictioninpu tandgathertheresulting performanceestimation.Thegeneralmethodologyofsepara tecomputationand communicationmodelingforRATiscommoninpredictiontech niques,thoughresearch directedtowardsstrategicRCanalysisisnotasexpansivea scomparedtomodeling environments.RATisincludedwithintheframeworkduetoit sfairlyuniquefocusof strategicpredictionpriortoimplementation.Encapsulat ionofRATforreplacementwith othersynchronousiterativeperformancemodelsispossibl e,butoutsidethescopeofthis dissertation. 98


Figure5-3.Y-chartapproachtoapplicationspecicationin modelingenvironments[ 54 ] TheRATtoolmaintainsthesynchronousiterative(i.e.,mul tiphase)performance model,asubsetoffork-joinmodelswitheachhardwareresou rce(e.g.,microprocessor orFPGA)performinganindependentportionoftheapplicatio ncomputationeach iterationwithsynchronizingcommunicationseparatingev eryiterationfrompreceding andproceedingiterations[ 1 55 ].Figure 5-2 outlinesthegeneralstructureoftheRAT tool.RATassumesapplicationexecutiontimeisdenedbyth esummationofthe slowestcomputationandcommunicationeachiteration.The underlyingcomputation andcommunicationmodelsforRATdescribethepotentiallyc omplexandtypically data-orientedbehaviorswithineachiterationusingafewk eyquantitativeattributes. Computationisdenedbytotalnumberofapplication-speci coperationsandtheirrate ofexecutionbasedonusageofthealgorithm'sdeeporwidepa rallelismbythehardware resources.Alternatively,RCcomputationcanoftenbedescr ibedbythenumberofdata elementstobeprocessedandtherateofcompletion(i.e.,cy clesperelement).RATuses anextensionoftheHockneymodel[ 56 ]todescribeI/OcommunicationandLogGPfor system-levelcommunication.5.2.2ModelingEnvironments Modelingenvironmentsprovideabstractyetreasonablypre cisedescriptionsof applicationstructureandbehavior.Anapplicationspecic ationconsistsofmodels(i.e., descriptions)oftheunderlyingalgorithmandRCplatforma rchitecturealongwiththeir respectivemapping.Algorithmandarchitecturemodelsareo ftenspeciedseparately 99

PAGE 100

usingtheY-chartapproach[ 54 ],asillustratedinFigure 5-3 .Theseapplicationmodels (particularlythealgorithmmodel)describethebehavioro fanapplicationintermsof amodelofcomputation(MoC).AMoCdenesasetofallowable\ operations"(i.e., basicandoftentechnology-dependentcomputationalevent s),communicationbetween operations(i.e.datamovement),theirrelativecosts(e.g .,clockcycles),andthetotal systembehaviorbasedontheoperationscomposingtheappli cation.Eachmodeling environmentusesgraphicaland/ortextualelementstodeno teprecisesyntacticand semanticmeaningsforanapplicationspecicationbasedon theMoC. Thecasestudiesforthischapter,MNWandMVAgraph,arespeci edbyasynchronous message-passing(AMP)andsynchronousdatarow(SDF)MoCs,r espectively,which representcommonmodelsforFPGAsystems.AMPdenotestheuse ofoneormore queuestodescribecommunicationbetweengroupsofoperati ons.Onlymessages withinthesamequeuearestrictlyorderedwithunspeciedt imingbetweendierent queues.SDFrepresentsaspecialcaseofAMPwithgroupsofope rationsevaluated assoonasthenecessarymessagesareavailablefromthecomm unicationchannels, whichareuni-directional.Dataenterstheapplicationmod elataconstantrate,which eventuallyinducesasteady-stateevaluationrateforeach groupofoperationswithtotal performancedenedbytheslowestgroup.AMPsuitablydescri besthestraightforward DMAcommunicationbetweenthemicroprocessorandFPGAsforM NW.SDFprovides mechanismsfordescribingthepipelinenetworkoftheMVAgr aph. TheproposedDSEtoolrequiresamodelingenvironmentcapab leofeectively representingabstractalgorithm,architecture,andsubse quentapplicationmapping modelsbasedonAMPandSDFMoCs.Ptolemy[ 57 ]isanenvironmentspecicallyfor simulatingandprototypingsystemsinvolvingheterogeneo usMoCs,includingAMPand SDF.Metropolis[ 58 ]denes\metamodels"thatuseformalexecutionsemanticst odene theapplicationfunction,platformarchitecture,andmapp ingofthesystembasedonanew orexistingMoCs.Artemis[ 59 ]andSesame[ 60 ]useahierarchicalKahnProcessNetwork 100

PAGE 101

Figure5-4.Frameworkbridgingmodelingenvironmentsandp erformanceprediction,and orchestratingDSE (KPN),aspecializedAMPMoC,todescribethesystemconcurren cyandindividual componentbehavior.TheRCModelingLanguage(RCML)[ 61 ]provideshierarchical modelsforthealgorithm,architecture,andtotalapplicat ionmappingwithspecialized constructstoexpressdeepparallelism(e.g.pipelines),t ypicallyforAMPandSDF MoCs.Withsuitableabstraction,anyofthesemodelscouldp rovideeectiveapplication specicationforRATperformanceprediction.Theproposed DSEtoolinSection 3.4 uses RCMLbecauseoftheRC-orientedfocus. 5.3IntegratedFramework Thissectiondescribesthemethodologyoftheframeworkfor connectingRAT performancepredictionwithmodelingenvironmentsforinc reasedproductivityduring strategicDSE.(Section 3.4 discussestheDSEtoolbridgingRATwiththeRCML modelingenvironment.)Figure 5-4 providesanoverviewoftheframeworkstructure. Theproposedmethodologyprovidestranslationofspecica tioninformationfromthe modelingenvironmenttoRATandorchestrationofDSEbasedo nrevisionstothe specicationinformation.ThemodelingenvironmentandRA Tperformanceprediction componentsaretheexistingmethodsandtoolsfromSection 5.2 ,asindicatedbythe 101

PAGE 102

shadedboxes.Thedashedborderofthemodelingenvironment andtool-abstractionlayer indicatestheinterchangeabilityofspecicationtool.Se ction 5.3.1 describestheprocedure fortranslationofalgorithm-basedMoCsintothequantitat iveperformanceattributes andapplicationschedulingnecessarytodirectthesynchro nous,iterativeperformance modelofRAT.Section 5.3.2 describesthemechanismfororchestrationofDSE,specica lly thedirectedrevisionofaninitialapplicationspecicati ontoexamineandcomparethe performancepotentialdesignalternatives.5.3.1Translation Althoughindividualmodelingenvironmentsandperformance predictiontechniques sometimesincludemethodsfordirectconnectivitytoother tools,anexplicitintermediary betweenspecicationandanalysisisadvantageous.Thepro posedframeworkprovides translationbetweenthealgorithmMoCsandtheRATperforma nceprediction,facilitating thetransferoftherequiredquantitativeattributesandsc hedulinginformationtothe correspondingcomputationandcommunicationmodel.Poten tialissuesduringtranslation includedierencesinthedatastructures(e.g.,format,re presentation,orprecision), abstractionlevels,andsemanticmeanalongwithotherdile mmassuchamissing, redundant,orinconsistentdata.Resolvingtheseissuesca nrequireacuteawarenessof thelow-leveldetailsofthedataformats,syntax,andseman ticsofthetoolswithextra functionalitytoidentifyandrequestadditionalinformat ionfromtheuserasnecessary. Theneedforuniquebridgesbetweeneverydesiredmodelingt oolandRATisgreatly reducedbyanabstractionlayer,whichallowstheframework toperformthemajorityof thetranslationbasedonagenericformatforalgorithmMoCs derivedfromthespecic modelingenvironmenttool. AsillustratedinFigure 5-5 ,thealgorithmandarchitectureattributesforthebasic operationsoftheMoCoftheapplicationspecicationmustb ereorganizedandformatted basedontheircontributiontotheRATcomputationand/orco mmunicationestimation. TheframeworkconstructsRATcomputationmodelsforeveryh ardwareresourcebasedon 102

PAGE 103

Figure5-5.Translationofapplicationspecicationinfor mationforRATprediction thegroupsofoperationsmappedtothatresourceandRATcomm unicationmodelsbased ondatamovementbetweenhardwareresources.Agenericsche dulethatdescribesthe parallelismandoverlapbetweenthecomputationandcommun icationisconstructedfrom thesemanticsoftheMoC.ConversionbetweenalgorithmicMo CsandRATperformance modelsispossibleastheimportantstructureandbehavioro ftheapplicationspecication, properlyformatted,corresponddirectlytoavailablecomp utationandcommunication models.ThebasiccomputationoperationswithinanMoCareo ftengenericabstractions thatrequireadditionalquantitativeattributesfromthea pplicationdesigner,specically formattedfortheassumedtechnology(e.g.FPGAs),fortrans lationtotheRATmodel. FortheDSEtoolusedinSection 3.4 (andbyextension,anyfuturetoolconnecting modelingenvironmentsandRAT),theunderlyingtranslatio nstepmustbetuned fortheMoCsofinterest,specicallyAMPandSDFforthecases tudies.ForFPGA systems,theseMoCscanbeabstractlyrepresentedasanumbe rof\tasks"(i.e.,generic encapsulationsofgroupsofoperationswithdetailedspeci cationlefttotheapplication designer)withthedatamovementthroughalgorithm\connec tions."Tasksoftenrepresent eitherpipelinesorstatemachines,whichimplystructured executionatadeterministic 103

PAGE 104

(orstatisticallyobserved)rate.Classicationoftheseb asicoperationsisstraightforward becausetasksperformonlycomputationandconnectionsfac ilitateonlycommunication. Thequantitativeattributesforthecomputationtasksincl udetheamountofdatatobe processedandthecostofprocessingeachdataelement,whic harecontainedwithinthe particulartaskspecication.Similarly,quantitativeat tributesforalgorithmconnections denetheamountofdataandsegmentationfortransfersbetw eentasks,whichare containedwithintheconnectionspecication.Computatio nandcommunicationmodels fortasksandconnections,respectively,areprovidedwith correspondingarchitectural information(e.g.,FPGAclockfrequencyorinterconnectba ndwidth)bytheframework translationbasedontheapplicationmapping.Forscheduli ng,thekeydierencebetween thetwoMoCsisthespecicityoftheoverlapoftaskexecutio n.ForAMP,taskexecution isdependentonlyupontheorderofcommunicationmessagefr omitspredecessortasks (i.e.,thosepriortaskswhichprovidedatatothecurrentta sk).Incontrast,SDFmodels assumessimultaneous,ne-grainoperationofalltasksand connections,typicallyasa pipelineoperatingonindividualdataelementswithinoneo rmorestreamsofdata.In practice,AMPissucientforserializingcommunicationbet weenmicroprocessorsand FPGAapplicationaccelerators(e.g.,MNW)whereasSDFisuse fulfordescribingmultiple directlyconnectedpipelines(e.g.,MVAgraph).5.3.2Orchestration StrategicDSEinvolvesevaluatingarangeofapplicationde signstodeterminethe mostdesirableconguration.Designalternativesmaydie rinmultiplefacetsincluding thealgorithmrequirements(e.g.,problemsize)andarchit ecturalcapabilities(e.g.,clock frequency).TheframeworksupportsstrategicDSEbyrepeti tivelyrevisinganapplication specicationandevaluatingtheresultingperformanceaga instotherdesignalternatives. RATisprovideddierentsetsofquantitativeperformancef eatures,whichtypically representseveralpermutationsofoneormoreattributes.( DSEbasedonmajorrevisions totheapplicationmappingisoutsidethescopeofthisdisse rtation.)Predictionsare 104

PAGE 105

revisednotoncebutpotentiallyhundredsorthousandsofti mesdependingonthebreadth ofthedesignspaceandcomplexityofthealgorithm. StrategicDSEbeginswithidenticationofperformancefea turesforrevisions.The applicationdesignermaychoosetoannotateaparameterwit honeormorealternative valuesdenotingpossiblechangestotheapplicationdesign .Thegoalistopropose revisionstospecicfeaturesandcomparetherangeofperfo rmancevaluesagainstthe performancerequirementsofthedesigner.Forexample,sev eraldierentpipelinerates orclockfrequenciesmaybeevaluated.Also,scalabilitycan beanalyzedusingrevisions thatdeneprogressivelylargerproblemsizesandhardware resources.Alternatively, dierentschedulescanbeevaluatedbyadjustingtheorderi ng(i.e.,priority)ofmessagesto outgoingcommunicationchannels.Asillustratedbythecase studiesinSection 3.4 ,rapid explorationoflargedesignspacescangreatlyaiddesignpr oductivity.However,adesigner usingtheframeworkmustensurethatthedesignspaceunderi nvestigationisrealisticwith respecttothearchitecturalconstraints(e.g.,maximumci rcuitsizeorclockfrequency). 5.4CaseStudies Thissectiondescribestwocasestudies,MNWandMVAgraph,wh ichdemonstrate thecapabilitiesoftheintegratedframeworkforecient(i .e.,rapidandreasonably accurate)strategicDSE.Theexperimentalsetup,includin gtheconstructionoftheDSE toolbridgingtheRCMLmodelingenvironmentwithRATperfor manceprediction,is discussedinSection 5.4.1 .TheMNWcasestudyinSection 5.4.2 isabioinformatics applicationwithanAMPMoC.Thiscasestudydemonstratesacc urateprediction,as comparedtosubsequenthardwareimplementations,andrapi dDSE.TheMVAgraph inSection 5.4.3 containsamorecomplexnetworkofpipelineswithperforman cedened bytheSDFMoC.ThiscasestudymaintainsrapidDSE,evenforv erylargenumbersof revisionstoacomplexalgorithmstructure. 105

PAGE 106

Figure5-6.ArchitecturespecicationofFPGAplatform5.4.1ExperimentalSetup Forvalidationoftheproposedmethodology,aDSEtoolprovi desfunctionalityfor gatheringtheapplicationspecication,performingtrans lation,andorchestratingDSE usingperformanceprediction.Thisfunctionalityinclude sinteractionwithamodeling environmenttool,RCML,tocollectthenecessaryinformati onfromanapplication specicationandusageofapredictiontool,RAT,forperfor manceanalysis.RCML providesanRC-specicabstractionenvironmentwithseman ticconstructsamenabletothe RATprediction.TheDSEtoolisahierarchicalcompositeofe xistingtoolsforRCMLand RATandnewlyconstructedcomponentsprovidingtranslatio nandorchestration.These translationandorchestrationcomponentsareimplemented asaJava-basedEclipseplug-in tohelpminimizethecustomizedinterfacingnecessaryforc onnectingtotheRCMLand RATtools. Thetwoapplicationcasestudiesforthischapteraremapped ontoaLinuxserver containingaGiDELPROCStar-IIIFPGAcardconnectedbyaPCI e 8bustoaXeon E5520(i.e.,2.26GHzQuad-coreNehalem)microprocessor.The GiDELFPGAcard containsfourAlteraStratix-IIIE260FPGAs,whichhaveinter connectstoadjacent FPGAsandsupportDMAtransferstoandfromthemicroprocesso r.Figure 5-6 outlines thegeneralarchitecturemodelfortheFPGA-augmentedplatf orm.ThisFPGAsystem canbeusedasaprototypeforanRC-augmentedembeddedplatf ormorrepresentasingle nodeinamulti-nodeRCsupercomputer. 106

PAGE 107

5.4.2ModiedNeedleman-Wunsch(MNW) TheMNWcasestudyisanFPGA-optimizedapplicationforcalcul atingthe normalizededitdistancebetweentwoDNAsequenceswithinth ecompositeESPRIT applicationformetagenomics[ 62 ].Thenormalizededitdistanceprovidesconcise quantitativeinsightaboutthesimilarityoftwoDNAsequenc esbasedonthelengthof thesequences,thenumberofgapsintheglobalsequencealig nment,andthenumberof editsrequiredtotransformonesequencestringintotheoth er.TheMNWapplication pipelinesthestandardNeedleman-Wunsch[ 63 ]calculationsforindividualalignment scoresandresultingglobalalignmentwiththeESPRITcalcu lationofthenormalizededit distance.Thepipelineconcurrentlycomputesthealignmen tscoreswiththenormalized editdistanceratherthancomputingtheeditdistancefromt hecharacterrepresentationof thealignmentasisdoneinsoftware.Computingtheeditdist anceinthiswayeliminates theneedtostoreascorematrix,signicantlyreducingthem emoryrequirementsforthe FPGAsystem.Thiscasestudyisreferredtoasmodiedbecaus ethetypicaloutputs ofNeedleman-Wunsch,thescorematrixandglobalalignment, areunnecessaryafter thecalculationofthenormalizededitdistanceandareneve rretained.However,as withtraditionalNeedleman-Wunsch,MNWisoftenusefulforco mparingmanypairsof sequencesofsimilarlengthasabatch. Figure 5-7A providesageneraloverviewofthealgorithmstructure.Ada tabaseof comparisonsisbuiltfromthesequencesanddivided,roundrobin,amongthespecied numberofFPGAs.Atotalof N sequencesrequiresadatabaseof N 2 N 2 comparisons sincesequencesarenotcomparedagainstthemselvesandcom parisonssuchas[2 ; 1]are equivalentto[1 ; 2].Theinitialcongurationofthiscasestudyinvolves150 0sequences, each105charactersinlength.Theresulting N 2 N 2 normalizededitdistancesarecollected bymicroprocessoraftercomputationiscomplete. Theframeworkqueriesthemodelingenvironmentforthequan titativeperformance informationnecessaryforRATprediction.Thecommunicati onoftheDNAsequence 107

PAGE 108

ACalculationofnormalizededitdistancesonmultipleFPGA s forMNW BMNWalgorithmspecicationandmapping Figure5-7.OverviewofMNWcasestudyTable5-1.PredictedandexperimentalresultsforMNW PredictedTime(s)ExperimentalTime(s)Error 1FPGA9.44E-19.58E-11.5%2FPGAs4.72E-14.83E-12.3%4FPGAs2.36E-12.46E-14.1% databaseandtheresultingvaluesforthenormalizededitdi stancearedescribedusing theAMPMoC.Fromthealgorithmspecication(Figure 5-7B ),eachofthecomputation tasks(Start,MNW,andEnd)andtwocommunicationconnection srequiresaseparate analyticalmodel.TheperformanceofthesoftwareStartand Endtasksaredened byanexecution-timeattribute.Thenumberofcharactersin thedatabaseofsequence comparisonsdeterminestheamountofinputcommunication( betweenStartandMNW) andtheamountofcomputationforMNW.Theoutputcommunicati on(betweenMNW andEnd)isdenedbythenumberofsequencecomparisons.The architecturemodel (Figure 5-6 )containstheparametersoutliningthecommunicationcapa bilitiesofthePCIe interconnect.TheFPGAclockfrequency(architecture)and pipelinedepth(algorithm) parametersdenethecomputationrate. 108

PAGE 109

Figure5-8.PredictedexecutiontimesofMNWonfourFPGAsbase donrevisionstothe numberofDNAsequencesforcomparison Table 5-1 summarizesthepredictedandexperimentalexecutiontimes forMNW using1500DNAsequences(i.e., 1500 2 1500 2 comparisons)dividedacross1,2,and4 FPGAs.ThepredictedexecutiontimesweregeneratedbyRATba sedonthequantitative performanceinformationprovidedbytheframework.Theexp erimentalexecutiontimes weremeasuredfromsubsequenthardwareimplementationsth atcorrespondtothe applicationspecication.Basedonthe1%to4%errorrate,t heintegratedframework wasabletomaintainreasonableaccuracyduringtheabstrac tapplicationspecication, collectionofquantitativeperformanceparameters,andre sultingperformanceprediction. Generatingtheabstractspecicationtookonlyafewminute sandthesubsequentanalysis, asdirectedbytheframework,tookapproximately2.3ms.The productivitygainedby usingtheframeworkissignicantbecausetheactualhardwa reimplementationforMNW requiredapproximately200man-hourstocode,placeandrou te,debug,andevaluate. Beyondtheinitialprediction,evaluatingtheperformance impactofalternative MNWdesignscanprovideinsightaboutthedesirabilityofpos siblestructuralorbehavior revisions.TheDSEtoolcanexploredierentarchitectural optimizations(e.g.,faster 109

PAGE 110

Table5-2.AnalysistimesfordesignspacesofMNW NumberofRevisionsAnalysisTime(ms) 0(initialdesign)2.310003.01000015100000140 Figure5-9.MVAgraphspecicationandmappingpipelines),buttheseanalysescanbetrivialforthiscompu tation-boundapplicationdue tothedirectcorrespondencebetweentherateofexecutiona ndtheoverallapplication performance.Instead,Figure 5-8 illustratesthepredictedexecutiontimeofMNWbased on1000designrevisionsthatrepresentdierentproblemsi zes(i.e.,comparisonsof 500to50450DNAsequences)dividedamongfourFPGAs.Theserev isionsexpandthe initialfour-FPGAdesignof1500DNAsequences.Theexecutio ntimeofMNWincreases exponentiallywiththenumberofDNAsequencecomparisons.T hisDSEcanhelpevaluate thesuitabilityoftheMNWapplicationformeetingthebroadp erformancerequirementsof adesigner,particularlywhenthesizeofthesequencedatab aseisexpectedtoincreaseata potentiallyunknownrateafterimplementation.TheDSEtoo ltookonly6.1mstoanalyze thissignicantlylargerdesignspace.Table 5-2 summarizesanalysistimesfortheinitial design,the1000revisions,andtwootherlargeDSEs.Theana lysistimesgrowlinearly withthesizeofthedesignspaceandallowsverylargenumber sofrevisionstobeexplored insignicantlylessthanonesecond. 110

PAGE 111

Figure5-10.PredictedexecutiontimesoftheMVAgraphbase donrevisionstothe executionrateoftheT4-2pipeline 5.4.3TaskGraphofMean-ValueAnalysis(MVAGraph) AnMVAgraphisadenedstructurefortasksanddependencesth atcommonly describestheparalleldecompositionofarecursivealgori thm.AsillustratedinFigure 5-9 thegeneralalgorithmstructurewidensuntiltheparticula r\base"caseisachievedand subsequentcontractstoagglomeratetheindividualvalues .Thistaskgraphiscommonly associatedwithmean-valueanalysis[ 64 ],thoughotheralgorithmssuchasquicksorthave comparablestructure.Forthiscasestudy,thetasksrepres entanetworkofpipelined computationsconformingtotheSDFMoC.Randomvolumesofda taandexecutionrates (i.e.,pipelineratesandparalleldecomposition)areassi gnedtoeachtask.Communication isbasedonthemulti-FPGAmapping.ThestructureoftheMVAg raphisoftenused asabenchmarkforhardware/softwareschedulingalgorithm s.Forthiscasestudy,the randomlypopulatedMVAgraphisusedasasyntheticapplicat iontodemonstrate thecapabilitiesoftheDSEtoolforrapidlyanalyzinglarge designspacesofcomplex applications. 111

PAGE 112

Table5-3.AnalysistimesfordesignspacesofMVAgraph NumberofRevisionsAnalysisTime(ms) 0(initialdesign)101000121000033100000340 AlthoughthealgorithmstructureisaSDFMoC,collectingqua ntitativeparameters fromtheindividualcomputationtasksandcommunicationop erationsremainsverysimilar totheMNWcasestudy.Thealgorithmcomplexityismanifested inthenumberoftasks, theirdependencies,andtheirscheduling.Thepredictedex ecutiontimeoftheinitialdesign fortheMVAgraphis0.17s,whichisdominatedbytheslowestp ipeline,T2-1(Figure 5-9 ),duetoitshighestassignedworkload.StrategicDSEcanpr ovideusefulinsightabout minimumexecutionratesfortheotherpipelinesandallowsa designertoconstructthe slowest(andleastresource-intensive)pipelinepossible withoutincreasingtheoverall executiontimefortheapplication.Forexample,Figure 5-10 illustratesthepredicted executiontimeoftheMVAgraphbasedon1000designrevision sthatrepresentdierent executionratesfortheT4-2pipeline.PipelineratesforT4 -2aboveacertainthreshold havenoimpactonthetotalperformanceoftheMVAgraphbecau setheexecutiontime remainsdominatedbytheslowerT2-1pipeline.However,suc ientlyslowratesforthe T4-2pipelineincreasetheexecutiontimeoftheMVAgraph.R apiddeterminationofthis thresholdisdicultwithouttheDSEtool. Despitethelargenumberofrevisions,DSEusingtheintegra tedframeworkremained tractable.Table 5-3 summarizestheframeworkanalysistimes,includingthetra nsferof theperformanceinformationandRATprediction,forvariou snumbersofrevisionstothe designoftheMVAgraph.Theanalysistimegrowsapproximate lylinearlywiththesize ofthedesignspace.Thelongestanalysisof100,000revisio nstook340ms,whichisnearly indistinguishablebytheuserfromthedurationofasinglea nalysisofasimpleapplication (e.g.,2.3msforMNW).TheprimarylimitationofbroadDSE(as idefromJavamemory requirements)istheabilityoftheusertoecientlydigest thegeneratedpredictionvalues. 112

PAGE 113

Figure5-11.ArchitecturespecicationofNovo-GsystemAdditionalanalysistoolscouldbeconstructedtoidentifyt hehighestperformingdesign(s) basedoncriteriasuchasfewestrevisionsfromtheoriginal applicationspecication,but suchfeaturesareoutsidethescopeofthisdissertation. 5.5IntegratedRATSS(MNW) Thecasestudies(andassociatedFPGAplatform)fromSectio n 5.4 onlyrequired RAT-levelanalysisduetothesmallsystemsizeandsingle-l evelpoint-to-pointcommunication. However,theproposedframeworkcanalsobeusedwithRATSS,w hichextendsthe RATtoolbasedonthemethodologiesforanalyticalmodeling ofscalablesystemsfrom Chapter 4 .Abstractapplicationspecicationusingmodelingenviron mentsandMoCs remainsunchangedexceptforpotentialmappingstolarger, scalablesystemswith morecomputationresourcesandcommunicationinterconnec ts.Themethodologies fortranslationandorchestrationofthesynchronousitera tivemodelforperformance estimationarealsomaintainwithRATSS,whichprovidesmod elsfortheaddedsystem-level communication.Forthiscasestudy,theparallelismofMNWis extendedtomultiple nodesofanFPGA-augmentedcluster.Figure 5-11 describesthearchitectureofthe system,referredtoasNovo-G,whereeachnodeoftheclusteri stheplatformdescribed inFigure 5-6 .ThenodesareconnectedbyGigabitEthernet,whichismodel edusingthe RATSStool. 113

PAGE 114

Table5-4.Predicted(RATSS)andexperimentalresultsform ulti-nodeMNW PredictedTime(s)ExperimentalTime(s)Error 8FPGAs(2Nodes) t node 1.18E-11.23E-14.1% t MPI 8.09E-36.72E-320.4% t total 1.26E-11.20E-12.8% 16FPGAs(4Nodes) t node 5.90E-26.15E-24.1% t MPI 1.30E-21.10E-218.1% t total 7.20E-27.25E-20.7% 32FPGAs(8Nodes) t node 2.95E-23.08E-24.1% t MPI 1.64E-21.47E-211.4% t total 4.59E-24.55E-20.9% 64FPGAs(16Nodes) t node 1.48E-21.54E-24.1% t MPI 1.89E-21.70E-211.6% t total 3.37E-23.24E-24.0% Table 5-4 summarizesthepredictedandexperimentalresultsfrompar allelization ofthe 1500 2 1500 2 molecularcomparisonsdescribedinSection 5.4.2 across8,16,32 and64FPGAson2,4,8,and16nodes,respectively.Theperform anceoftheeach four-FPGAnode, t node ,decreasesapproximatelylinearlyasthesystemsizegrows dueto thecomputation-boundapplication,maintainingtheappro ximately4%underestimation bytheRATSSmodel.Theinter-nodecommunication,MPIoverG igabitEthernetusing MPICH2( t MPI ),involvesbroadcastingthesequencedatabaseandgatheri ngtheresulting normalizededitdistances.Thiscommunicationwasoverest imatedby11%forlarger8and 16nodesystemsandapproximately20%forsmaller2and4node systems.Consequently, theRATSSmaintainshighpredictionaccuracyforthetotala pplicationperformance, t total witherrorsbetween1%and4%error. 5.6Conclusions WidespreadadoptionofFPGAsystemsisincreasinglylimite dbyapplication developmentproductivity.Modelingenvironmentsandperf ormancepredictionhelp reducethecostlydevelopmentprocessbyfacilitatingiter ativeapplicationrenement priortohardwareimplementation.However,ecientdesignspaceexplorationrequiresa comprehensiveapproachtoapplicationspecication,anal ysis,andultimatelyrevision,if necessary. 114

PAGE 115

Theproposedframeworkprovidesthenecessarymethodology tointegratemodeling environmentswithRATperformancepredictionforstrategi cDSE.Theframework translatesapplicationspecicationsintoperformancech aracterizationsandorchestrates therequiredRATpredictions.TheDSEtoolwasconstructedb asedontheframework methodologyanddemonstratedaccurateperformanceanalys iswithunder5%errorfor theMNWcasestudy.StrategicDSEwiththetoolwasecientand rapid,requiringonly 140msand340msforanalysisof100,000revisionstotheMNWap plicationandtheMVA graph,respectively. 115

PAGE 116

CHAPTER6 CONCLUSIONS Thepromiseofrecongurablecomputingforachievingspeed upandpowersavings versustraditionalcomputingparadigmsisexpandinginter estinFPGAs.Amongthe challengesforimprovingdevelopmentofparallelalgorith msforFPGAs,thelackof methodsforstrategicdesignisakeyobstacletoecientusa geofRCdevices.Better formulationmethodologiesareneededtoexplorealgorithm s,architectures,andmappings toreduceFPGAdevelopmenttimeandcost.Consequently,RAT iscreatedasasimple andeectivemethodologyforinvestigatingtheperformanc epotentialofthemapping ofagivenparallelalgorithmforagivenFPGAplatformarchi tecture.Themethodology employsananalyticalmodeltoanalyzeFPGAdesignspriorto actualdevelopment. RATisscopedandautomatedtoprovidemaximumeciencyandr eliabilitywhile retainingreasonablepredictionaccuracy.Theperformanc epredictionismeanttowork withempiricalknowledgeofRCdevicestocreatemoreecien tandeectivemeansfor design-spaceexploration. Fortherstphaseofresearch,theRATmethodologydenedth ecoreanalytical modelforperformanceestimationduringformulation.Thee xtensibleRATmodel wasspecicallyscopedforusagewithdeterministicapplic ationsoncommon,albeit single-FPGA,platforms.Fivecasestudies(1-DPDF,2-DPDF, LIDAR,TSP,andMD) validatedtheaccuracyofdecomposingcomplexsystembehav iorintokeycommunication microbenchmarksandcomputationparametersforRATmodeli ng.Detailedmicrobenchmarking allowedforanaverageerrorof12%(withindividualerrorsa slowas1%)forthe communicationtimesofthecasestudies.Forthedeterminis ticcasestudies(i.e.,all exceptMD),computationerrorpeakedat17%.ThetotalRCexe cutiontimehadan averageerrorof18%forthecasestudies,whichhelpedvalid atetheRATmethodologyof rapidandreasonableaccurateprediction. 116

PAGE 117

ThesecondphaseofresearchproposedRATSS,anextensionth eRATmodelfor multi-FPGAsystems.RATSSbalancedthedesireforgreatera lgorithmandplatform diversity(i.e.,modelapplicability)withtherequiremen tofhighpredictability(i.e., modelaccuracy)forscalablesystemsbyfocusingonsynchro nousiterativealgorithms fortwoclassesofmodernRCsystems.Synchronousiterative algorithmsrepresenteda signicantclassofdata-parallelapplications,typicall ystructuredasSIMD-stylepipelines. FocusingontwoclassesofRCsystemsallowshierarchicalag glormerationofcomputation andcommunicationmodelsintoRATpredictionsforthefulla pplication.Successesin conventionalHPCandHPECmodelingsuchastheLogPcommunicat ionmodelare leveragedtohelpmaximizeeciencyandreliability.Three casestudies,2-DPDF,image ltering,andMD,demonstratedtotalpredictionerrorsund er12%,3%,and0.03%, respectively. Forthethirdphaseofresearch,theRAT(andRATSS)methodol ogieswere integratedwithinalargerframeworkformorestrategicdes ign-spaceexplorationof RCapplications.Specically,theframeworkbridgedRATpe rformanceprediction withmodelingenvironments.Thesemodelingenvironmentsa llowedrapidyetaccurate applicationspecicationbyadesignerwithinthecontexto ftheMoC.Theframework providedtranslationbetweensupportedMoCsandtheanalyt icalperformancemodel forRAT(i.e.,thesynchronousiterativemodel).Atoolcons tructedfromtheframework methodologyprovidedtranslationfortheAMPandSDFMoCsoft heRCMLmodeling environment.Thisframeworktoolorchestrateddesign-spa ceexplorationbyperforming RATanalysisonaninitialapplicationdesignandpotential revisionstothecharacteristics ofalgorithmand/orplatformarchitecture,identifyingsu itabledesigncongurations basedondesignercriteria.Twocasestudies,MNWandMVAgrap h,demonstrated reasonablepredictionaccuracy(under5%forMNW)andrapide xplorationoflargedesign spaces(140msand340msforRATanalysisof100Krevisionsto MNWandMVAgraph, respectively). 117

PAGE 118

Collectively,thisresearchcontributesananalyticalmod elandaccompanying methodologyforperformancepredictionwheresuchworkwas lackingforFPGA development.RATandRATSSdemonstratedhighapplicabilit ytoavarietyofalgorithms, platformarchitectures,andsystemmappingsandprovidesa formalizedinfrastructure forintegrated,ecient,reliable,andreasonablyaccurat eaidwithrespecttothe predictionaspectofdesign-spaceexploration.Thisresea rchcontributednotonlyto analyticalmodelingforFPGAperformanceestimationbutal sotomodelinglanguages anddesignpatternsforRC.Futuredirectionsforresearchi ncludeincorporationofthe RATprediction(andtheintegratedframework)intoalarger methodologyformore fullyautomateddesign-spaceexploration(e.g.,automate dmapping,evaluation,and optimization)withintegratedbridgingtodesign-levelim plementationcode. 118

PAGE 119

REFERENCES [1]M.SmithandG.Peterson,\Parallelapplicationperform anceonsharedhigh performancerecongurablecomputingresources," PerformanceEvaluation ,vol.60, pp.107{125,May2005. [2]T.El-Ghazawi,E.El-Araby,M.Huang,K.Gaj,V.Kindratenko ,andD.Buell,\The promiseofhigh-performancerecongurablecomputing," Computer ,vol.41,no.2,pp. 69{76,Feb.2008. [3]D.PellerinandS.Thibault, PracticalFPGAProgramminginC ,PrenticeHallPress, 2005. [4]SRCComputers, SRCCarteCProgrammingEnvironment ,2007. [5]Mitrionics,\Lowpowerhybridcomputingforecientsof tware acceleration,"updated2008,citedMay2010,availablefro m itepaper.pdf [6]MentorGraphics,\Handel-csynthesismethodology,"upd ated2010,citedMay2009, availablefrom [7]W.Wolf,\Adecadeofhardware/softwarecodesign," Computer ,vol.36,no.4,pp. 38{43,2003. [8]S.FortuneandJ.Wyllie,\Parallelisminrandomaccessm achines,"in Proc.ACM 10thSymp.TheoryofComputing ,SanDiego,CA,May01-031978,pp.114{118. [9]L.G.Valiant,\Abridgingmodelforparallelcomputatio n," CommunicationsACM vol.33,no.8,pp.103{111,Aug.1990. [10]D.Culler,R.Karp,D.Patterson,A.Sahay,K.E.Schauser ,E.Santos, R.Subramonian,andT.vonEicken,\LogP:Towardsarealisti cmodelofparallel computation,"in Proc.ACM4thSymp.PrinciplesandPracticeofParallelProgr amming ,SanDiego,CA,May19-221993,pp.1{12. [11]A.Alexandrov,M.F.Ionescu,K.E.Schauser,andC.Scheim an,\LogGP: IncorporatinglongmessagesintotheLogPmodelforparalle lcomputation," J. ParallelandDistributedComputing ,vol.44,no.1,pp.71{79,1997. [12]T.Kielmann,H.E.Bal,andS.Gorlatch,\Bandwidth-eci entcollective communicationforclusteredwideareasystems,"in Proc.Int'lParallelandDistributedProcessingSymp.(IPDPS) ,1999,pp.492{499. [13]E.Grobelny,C.Reardon,A.Jacobs,andA.George,\Simula tionframeworkfor performancepredictionintheengineeringofRCsystemsand applications,"in Proc. Int'lConf.EngineeringofRecongurableSystemsandAlgori thms(ERSA) ,LasVegas, NV,Jun25-282007. 119

PAGE 120

[14]E.Grobelny,D.Bueno,I.Troxel,A.George,andJ.Vetter ,\Fase:Aframework forscalableperformancepredictionofHPCsystemsandappli cations," Simulation: TransactionsonTheSocietyforModelingandSimulationInt ernational ,vol.83,no. 10,pp.721{745,October2007. [15]K.K.Bondalapati, ModelingandMappingforDynamicallyRecongurableHybrid Architectures ,Ph.D.dissertation,UniversityofSouthernCalifornia,Lo sAngeles,CA, August2001. [16]R.Enzler,C.Plessl,andM.Platzner,\System-levelpe rformanceevaluationof recongurableprocessors," MicroprocessorsandMicrosystems ,vol.29,no.2-3,pp. 63{75,April2005. [17]W.FuandK.Compton,\Asimulationplatformforrecong urablecomputing research,"in Int'lConf.FieldProgrammableLogicandApplications(FPL ) ,August 2006,pp.1{7. [18]C.Steen,\ParameterizationofalgorithmsandFPGAac celeratorstopredict performance,"in RecongurableSystemSummerInstitute(RSSI) ,Urbana,IL,Jul 17-202007. [19]H.Quinn,M.Leeser,andL.S.King,\Dynamo:Aruntimepar titioningsystemfor FPGA-basedHW/SWimageprocessingsystems," J.Real-TimeImageProcessing vol.2,no.4,pp.179{190,2007. [20]M.C.Herbordt,T.VanCourt,Y.Gu,B.Sukhwani,A.Conti,J. Model,and D.DiSabello,\AchievinghighperformancewithFPGA-basedco mputing," Computer vol.40,no.3,pp.50{57,2007. [21]W.M.FangandJ.Rose,\Modelingroutingdemandforearl y-stageFPGA architecturedevelopment,"in Proc.ACMSymp.FieldProgrammableGateArrays(FPGA) ,Monterey,CA,2008,pp.139{148. [22]V.Manohararajah,G.R.Chiu,D.P.Singh,andS.D.Brown, \Dicultyof predictinginterconnectdelayinatimingdrivenFPGACADrow ,"in Proc.ACM WorkshopSystem-levelInterconnectPrediction(SLIP) ,Munich,Germany,2006,pp. 3{8. [23]M.XuandF.Kurdahi,\Accuratepredictionofqualitymetr icsforlogicleveldesign targetedtowardslookup-table-basedFPGA's," IEEETrans.VeryLargeScale Integration(VLSI)Systems ,vol.7,no.4,pp.411{418,Dec.1999. [24]S.D.Brown,J.Rose,andZ.G.Vranesic,\Astochasticmo deltopredictthe routabilityofeld-programmablegatearrays," IEEETrans.Computer-AidedDesign ofIntegratedCircuitsandSystems ,vol.12,no.12,pp.1827{1838,Dec.1993. 120

PAGE 121

[25]A.SinghandM.Marek-Sadowska,\FPGAinterconnectplan ning,"in Proc.ACM WorkshopSystem-levelInterconnectPrediction(SLIP) ,SanDiego,CA,2002,pp. 23{30. [26]D.P.Singh,V.Manohararajah,andS.D.Brown,\Two-stag ephysicalsynthesisfor FPGAs,"in Proc.IEEE13thConf.CustomIntegratedCircuits ,Sept18-212005,pp. 171{178. [27]V.DegalahalandT.Tuan,\Methodologyforhigh-leveles timationofFPGApower consumption,"in Proc.ACMConf.AsiaSouthPacicDesignAutomation(ASPDAC) ,2005,pp.657{660. [28]P.MaideeandK.Bazargan,\Defect-tolerantFPGAarchi tectureexploration,"in Proc.IEEE13thConf.FieldProgrammableLogicandApplicat ions(FPL) ,Madrid, Spain,Aug2006,pp.1{6. [29]A.Buttari,J.Dongarra,J.Kurzak,J.Langou,J.Langou, P.Luszczek,and S.Tomov, High-PerformanceComputingandGridsinAction ,IOSPress,2007. [30]P.Banerjee,D.Bagchi,M.Haldar,A.Nayak,V.Kim,andR.Urib e,\Automated conversionofroatingpointMATLABprogramsintoxedpointF PGAbased hardwaredesign,"in Proc.IEEE11thSymp.Field-ProgrammableCustomComputing Machines(FCCM) ,Napa,CA,Apr8-112003,pp.263{264. [31]M.ChangandS.Hauck,\Precis:Adesign-timeprecisiona nalysistool,"in Proc. IEEE10thSymp.Field-ProgrammableCustomComputingMachin es(FCCM) ,Napa, CA,Apr22-242002,pp.229{238. [32]K.BondalapatiandV.Prasanna,\Dynamicprecisionmana gementforloop computationsonrecongurablearchitectures,"in Proc.IEEE7thSymp.FieldProgrammableCustomComputingMachines(FCCM) ,Napa,CA,Apr21-231999,pp. 249{258. [33]A.Gaar,O.Mencer,W.Luk,P.Cheung,andN.Shirazi,\Flo ating-pointbitwidth analysisviaautomaticdierentiation,"in Proc.IEEEInt'lConf.Field-Programmable Technology(FPT) ,HongKong,China,Dec16-182002,pp.158{165. [34]S.Perri,P.Corsonello,M.A.Iachino,M.Lanuzza,andG. Cocorullo,\Variable precisionarithmeticcircuitsforFPGA-basedmultimediapr ocessors," IEEETrans. VeryLargeScaleIntegration(VLSI) ,vol.12,no.9,pp.995{999,2004. [35]X.Wang,S.Braganza,andM.Leeser,\Advancedcomponents inthevariable precisionroating-pointlibrary,"in Proc.IEEEInt'lSymp.Field-Programmable CustomComputingMachines(FCCM) ,Napa,CA,Apr24-262006. [36]K.Nagarajan,B.Holland,C.Slatton,andA.D.George,\Sca lableandportable architectureforprobabilitydensityfunctionestimation onFPGAs,"in ProcIEEE 121

PAGE 122

16thSymp.Field-ProgrammableCustomComputingMachines(F CCM) ,PaloAlto, CA,Apr.14-152008. [37]K.Shih,A.Balachandran,K.Nagarajan,B.Holland,C.Slat ton,andA.George, \FastrealtimeLIDARprocessingonFPGAs,"in ProcEngineeringofRecongurable SystemsandAlgorithms(ERSA) ,LasVegas,NV,July14-172008. [38]S.Tschoke,R.Lubling,andB.Monien,\Solvingthetrav elingsalesmanproblemwith adistributedbranch-and-boundalgorithmona1024process ornetwork,"in Proc. Symp.ParallelProcessing ,SantaBarbara,CA,Apr.25-281995. [39]M.P.AllenandD.J.Tildesley, ComputerSimulationofLiquids ,OxfordUniversity Press,NewYork,1987. [40]D.A.Pearlman,D.A.Case,J.W.Caldwell,W.S.Ross,I.Tho masE.Cheatham, S.DeBolt,D.Ferguson,G.Seibel,andP.Kollman,\Amber,apa ckageofcomputer programsforapplyingmolecularmechanics,normalmodeana lysis,molecular dynamicsandfreeenergycalculationstosimulatethestruc turalandenergetic propertiesofmolecules," ComputerPhysicsCommunications ,vol.91,no.1-3,pp. 1{41,September1995. [41]M.Nelson,W.Humphrey,A.Gursoy,A.Dalke,L.Kal,R.D.Skee l,andK.Schulten, \Namd-aparallel,object-orientedmoleculardynamicsprog ram," Int'lJ.SupercomputerApplicationsandHighPerformanceComputing ,vol.10,no.4,pp.251{268, 1996. [42]M.I.Frank,A.Agarwal,andM.K.Vernon,\LoPC:modelingc ontentioninparallel algorithms,"in Proc.6thACMSIGPLANSymp.PrinciplesandPracticeofParall el Programming(PPOPP) ,1997,pp.276{287. [43]T.Kielmann,H.E.Bal,andK.Verstoep,\Fastmeasuremen tofLogPparametersfor messagepassingplatforms,"in Proc.15thIPDPSWorkshopParallelandDistributed Processing ,London,UK,2000,pp.1176{1183. [44]J.L.BosqueandL.P.Perez,\Hloggp:anewparallelcompu tationalmodelfor heterogeneousclusters,"in Proc.IEEESymp.ClusterComputingandtheGrid [45]J.L.BosqueandL.Pastor,\Aparallelcomputationmode lforheterogenous clusters," IEEETrans.ParallelandDistributedSystems ,vol.17,no.13,2006. [46]A.Lastovetsky,I.-H.Mkwawa,andM.O'Flynn,\Anaccurate communicationmodel ofaheterogenousclusterbasedonaswitch-enabledetherne tnetwork,"in Proc.12th IEEEInt'lConf.ParallelandDistributedSystems(ICPADS) ,Minneapolis,MN,July 12-1152006. [47]R.Kesavan,K.Bondalapati,D.Panda,andD.K.P,\Multi castonirregular switch-basednetworkswithwormholerouting,"in Proc.Int'lSymp.HighPerformanceComputerArchitecture(HPCA) ,SanAntonio,TX,1997,pp.48{57. 122

PAGE 123

[48]P.B.Bhat,V.K.Prasanna,andC.S.Raghavendra,\Adaptiv ecommunication algorithmsfordistributedheterogeneoussystems," J.ParallelDistributedComputing vol.59,no.2,pp.252{279,1999. [49]F.Cappello,P.Fraigniaud,B.Mans,andA.L.Rosenberg, \HiHCoHP:Toward arealisticcommunicationmodelforhierarchicalhyperclu stersofheterogeneous processors,"in Proc.15thInt'lParallelandDistributedProcessingSymp.( IPDPS) Washington,DC,USA,2001,p.42,IEEEComputerSociety. [50]B.Holland,K.Nagarajan,andA.D.George,\RAT:RCamenabi litytestforrapid performanceprediction," ACMTrans.RecongurableTechnologyandSystems (TRETS) ,vol.1,no.4,pp.22:1{22:31,2009. [51]E.Parzen,\Onestimationofaprobabilitydensityfunc tionandmode," Annalsof MathematicalStatistics ,vol.33,no.3,pp.1065{1076,1962. [52]K.Nagarajan,B.Holland,A.George,K.C.Slatton,andH.Lam ,\Accelerating machine-learningalgorithmsonFPGAsusingpattern-basedd ecomposition," J. SignalProcessingSystems ,Jan.2009. [53]R.C.GonzalezandR.E.Woods, DigitalImageProcessing,SecondEdition Prentice-Hall,Inc,UpperSaddleRiver,NJ,2002. [54]B.Kienhuis,E.F.Deprettere,P.vanderWolf,andK.Viss ers, EmbeddedProcessor DesignChallenges ,chapterAMethodologytoDesignProgrammableEmbedded Systems:TheY-ChartApproach,pp.18{37,Springer,2002. [55]G.D.PetersonandR.D.Chamberlain,\Beyondexecution time:Expandingtheuse ofperformancemodels," IEEEParallelDistributedTechnology:SystemsApplications vol.2,no.2,pp.37{49,1994. [56]R.W.Hockney,\ThecommunicationchallengeforMPP:Int elParagonandMeiko CS-2," ParallelComputing ,vol.20,no.3,pp.389{398,1994. [57]J.Buck,S.Ha,E.A.Lee,andD.G.Messerschmitt,\Ptolemy :Aframeworkfor simulatingandprototypingheterogeneoussystems," Int'lJ.ComputerSimulation vol.4,pp.152{184,April1994. [58]F.Balarin,Y.Watanabe,H.Hsieh,L.Lavagno,C.Passerone ,and A.Sangiovanni-Vincentelli,\Metropolis:anintegratedele ctronicsystemdesign environment," Computer ,vol.36,no.4,pp.45{52,April2003. [59]A.D.Pimentel,L.O.Hertzbetger,P.Lieverse,P.vanderW olf,andE.F.Deprettere, \Exploringembedded-systemsarchitectureswithartemis, Computer ,vol.34,no.11, pp.57{63,November2001. [60]A.D.Pimentel,C.Erbas,andS.Polstra,\Asystematicap proachtoexploring embeddedsystemarchitecturesatmultipleabstractionlev els," IEEETrans.Computers ,vol.55,no.2,pp.99{112,February2006. 123

PAGE 124

[61]C.Reardon,B.Holland,A.George,H.Lam,andG.Stitt,\RCM L:Anabstract modelinglanguagefordesign-spaceexplorationinrecong urablecomputing,"in Proc. IEEERecongurableArchitecturesWorkshop ,May25-262009. [62]Y.Sun,Y.Cai,L.Liu,F.Yu,M.L.Farrell,W.McKendree,an dW.Farmerie, \ESPRIT:estimatingspeciesrichnessusinglargecollecti onsof16SrRNA pyrosequences," NucleicAcidsRes. ,vol.37,no.10,pp.e76. [63]S.B.NeedlemanandC.D.Wunsch,\Ageneralmethodapplic abletothesearchfor similaritiesintheaminoacidsequenceoftwoproteins," J.MolecularBiology ,vol.48, no.3,pp.443{453,1970. [64]M.ReiserandS.S.Lavenberg,\Mean-valueanalysisofc losedmultichainqueuing networks," J.ACM ,vol.27,no.2,pp.313{322,1980. 124

PAGE 125

BIOGRAPHICALSKETCH BrianHollandreceivedhisB.S.degreeincomputerengineeri ngfromClemson UniversityandM.S.degreeinelectricalandcomputerengine eringfromtheUniversity ofFlorida.Mr.Hollandisaseniorresearchassistantwithth eHigh-performance ComputingandSimulation(HCS)LaboratoryandtheNSFCenterf orHigh-performance RecongurableComputing(CHREC).Mr.HollandledtheApplicat ionCaseStudies groupandco-ledthFormulationandDesigngroupfocusingon productiveapplication developmentthroughstrategicperformanceanalysis.Mr.Ho llandhasconductedresearch involvingapplicationmodeling,analyticalperformancep rediction,andhigh-levelsynthesis fordiverseapplicationeldsincludingsignalprocessing andbioinformatics. Mr.Hollandhasguestlecturedonhigh-levellanguages,appl icationdesignproductivity, andanalyticalmodelingforParallelComputerArchitecture (EEL6763)andRecongurable Computing(EEL5934).Mr.Hollandhasservedasareviewerfor theJournalof Supercomputingandthe2007ParallelandDistributedCompu tingandSystems (PDCS)conference.Mr.Hollandhasauthoredorco-authored4 journalpublications and11conferencepublicationstodate,withonejournalsub missionpending.Upon graduation,Mr.Hollandwillseekemploymentineitherindus tryorgovernmentto continueresearchintoproductiveapplicationdesign,ana lysis,andimplementationfor novelsystemarchitectures. 125