Title: Memory-driven algorithm mapping of molecular dynamics for high-performance reconfigurable computers
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094687/00001
 Material Information
Title: Memory-driven algorithm mapping of molecular dynamics for high-performance reconfigurable computers
Physical Description: Book
Language: English
Creator: Holland, Brian
George, Alan D,
Publisher: Holland et al.
Place of Publication: Gainesville, Fla.
Publication Date: April 2008
Copyright Date: 2008
 Notes
General Note: Paper presented at Many-core and Reconfigurable Supercomputer Conference ; April 1-3, 2008
 Record Information
Bibliographic ID: UF00094687
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

MRSC08_F3 ( PDF )


Full Text






Memory-Driven Algorithm Mapping

of Molecular Dynamics for

High-Performance Reconfigurable Computers


*CHREC
NSF Center for High-Performance
Reconfigurable Computing

U UNIVERSITY of
UF IFLORIDA c
Vioia BYU MI fEORE
-91 1 O U WASI I I TON
A "A, L LH_ NN r ,I L
WN STALE fiS T 14 r_ N I'


Brian Holland & Alan D. George

NSF CHREC Center

ECE Department
University of Florida


1-3 April 2008







Outline

* Molecular Dynamics Introduction
* Strategic Design (Formulation)
* Performance Prediction Overview
a RC Amenability Test (RAT)
* Impulse C & XtremeData XD1000
a System configuration and linguistic factors
a Performance prediction with MD
" Carte & SRC MAP-B
L System configuration and linguistic factors
Performance prediction with MD
* Results, Speedup, & Prediction Accuracy
* Taking Knowledge Forward
L Design Patterns
* Conclusions
UF UNIVERSITY of
SCHREC UFLORIDA
NSF Center for High-Performance 2 ch BYU ..,.
Reconfigurable Computing N









Molecular Dynamics


* Numerical simulation of physical interactions of

atoms and molecules over given time interval



* Previous Work
a Widely studied in HPC & HPRC
[Alam 06], [Azizi 04], [Cordov 05], [Gu 06], [Kindratenko06], etc.
a Software baseline code for molecular dynamics
provided to CHREC by Oak Ridge National Lab
Explored as HPC case study to uncover lessoned learned


* Primary Stages of MD Computation
1. Reading molecular pair's position from memory
2. Computing distance between two molecules
3. Computing atomic interaction (if distance < threshold)
4. Adjusting net acceleration for adjacent molecules


time 0.0041 ps











~~


- Images from Wikipedia Commons and folding.stanford.edu. For academic purposes only. F UNIVEITY
CHREC u FLORIDA j
NSF Center for High-Performance 3 V"g ch BYU ...
Reconfigurable Computing "" .


G 1491








Molecular Dynamics


void ComputeAccel() {
double dr[3],f,fcVal,rrCut,rr,ri2,ri6,rl;
int jl,j2,n,k;
rrCut = RCUT*RCUT;
for(n=0;n potEnergy = 0.0;
for (j1=0; jl for (j2=j1+1; j2 for (rr=0.0, k=0; k<3; k++){
1 dr[k] = r[jl][k] r[2][k];
dr[k] = dr[k]-SignR(RegionH[k],dr[k]-RegionH[k])
2 SignR(RegionH[k],dr[k]+RegionH[k]);
rr = rr + dr[k]*dr[k];
}
if (rr < rrCut) {
S ri2 = 1.0/rr; ri6 = ri2*ri2*ri2; rl = sqrt(rr);
S fcVal = 48.0*ri2*ri6*(ri6-0.5) + Duc/rl;
for (k=0; k<3; k++){
f = fcVal*dr[k];
4 ra[jl][k] = ra[jl][k] + f;
ra[j2][k] = ra[j2][k]- f;
potEnergy+=4.0ri6(ri6-1.0)- UcDuc(rl-RUT);
potEnergy+=4.0*ri6*(ri6-1.0)- Uc Duc*(rl-RCUT);


1. Reading Position
a Potential contention for
shared resource

2. Computing Distance
o Simple hardware mapping if
Step 1 contentions resolved

3. Computing Interaction
a Conditional computations
must be speculatively
executed for pipelines

4. Adjusting Net Acceleration
a Similar contention to Step 1
for shared resource


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFf UNIVERSITY of
UFFLORIDA
UF'T BYU






Software
Molecular Dynamics Softwar
Step Step (1,2) (1,3) (2,3)
for (rr=0.0, k=0; k<3; k++){ for (k=0; k<3; k++){
dr[k] = r[jl][k] r[j2][k]; f = fcVal*dr[k]; Step 2 Compu Distance
ral[l][k] = ral[l][k] + f;
} ra[j2][k] = ra[j2][k] f;
Step 3 adjacentJ Adjacent
SMemory-driven algorithm mapping
a Memory contention resolution in Step 1 Step 4
SReplicate input data across multiple banks
i Alternately, use multi-ported memory Hardware
a Memory contention resolution in Step 4 Step, (21)(31)(32)
(1 2) (1 3) (2,3)
Use multi-ported memory (
a Not always available, especially for external SRAM Step 2 uDgist
Compute pairs twice only saving one value each time
a Both updates are appends (read, add, write) Step3 NAd
0 Potentially many stall cycles are necessary
L One molecule per pair requires only one register
append with a final memory write (after N iterations) Step 4

U | UNIVERSITY of
CHREC UFLORIDA
NSF Center for High-Performance 5 Tech BYU ...
Reconfigurable Computing ," N '







Strategic Design Methodology


* What is next step towards an FPGA design?
a Jump right in and start coding (ill-advised!)
a Examine prior work and leverage success (better!!)
But what if you're in uncharted territory with algorithm?
L Formulation/Strategic Design (best!!!) 1 i
What are specifications and requirements?
Can I quantitatively show my design will meet expectations?
a Develop algorithm then use RAT to gauge likely performance

* Molecular Dynamics
a Requirements
Create highest performance design possible for 16,384 molecules
Minimize memory footprint to enable larger future designs
a Quantitative Analysis
Will memory structure and bandwidth support the design?
Are enough computations operating in parallel to achieve speedup?
U |T )UNIVERSITY of
HCHREC UF LORIDA
NSF Center for High-Performance 6 VW Blch BYU .
Reconfigurable Computing '









Performance Prediction


* RC Amenability Test (RAT)

a Determines execution time of

specific algorithm on specific

hardware platform

a Amenability is gauged based on

performance and speedup

requirements as determined by user

* RAT Analytic Model
Communication Time Computation Time
t t + t t elements Nops/element Com
comm read write comp = ok th Noeghptp,, Coi
f clo k throughput o


t1
speedup = t
tRC
Speedup


trcs Niter (comm tcomp

rcDB ier comm E comp
RC Execution Time


Co


Co


Co
Co
Co


FPGA Application
Development Spectrum
Specific
SRequirements

Numerical
Performance Analysis
Prediction
.(RAT) Resource
U Usage


m. and Comp. Overlap for Single or Double Buffer
Single Buffered
mm R1 W1 R2 W2 R3j W3
imp C1 C C C3
Double Buffered, Computation Bound
---------- ------ ----- I ---------- -----------------
mm R R2 W1 R3 W2 W3R5 W4 W
mp C1 C2 C3 C4
Double offered, Communication Bound
r...R----------------------.......... T"--- R--- 2 --- --V--- ---------------.
*mp C1 C2 C3 C4
Legend.R --ead, W= Write, C = Compute


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFFLORII 1
""g'olech BYU


ing









Performance Prediction

* Interconnect Parameters Cons
a Throughput, ainput ,output
Models CPU/FPGA interconnect throughput(ic
* Communication Parameters a(input)
Snut/outut a(output)
a #,..,_. inDUt/OUtDUt


elements,-- -I -I
Quantity of input and output data for algorithm
0 NaBytes/element
Element is an algorithm's basic unit of data
Conversion factor for determining total data size
* Computation Parameters
#elementscomp
Computation is not always related to data transfer
o ps/element
Amount of computational work needed to complete
one element
a Throughputproc
Average number of operations per cycle
Clock Frequency
* Software Parameters
a Tsoft
Software execution time
n N
S iterations
Number of input, compute, output cycles


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFFLORII1
"g'Wlech BYU


tants/User-Defined Parameters

Interconnect Parameters
leal) (MB/s)
0 0

Communication Parameters
# of Input Elements (elements)
# of Output Elements (elements)
Bytes per element (B)

Computation Parameters
# of Comp Elements (elements)
Ops per element (ops/elem)
throughput(proc) (ops/cycle)
f(clock) (MHz)

Software Parameters
t(soft) (sec)
N (iterations)


Calculated Sub-metrics
Single Buffered
t(soft) (sec) 0.000E+00
t(comm) (sec) 0.000E+00
t(comp) (sec) 0.000E+00
t(RC) (sec) 0.000E+00
speedup(kernel) n/a 0.00







XD1000


System


&


Impulse C Language


* Xtreme Data XD1000 System
a Single XD1000 FPGA module
One Altera Stratix II EP2S180
Connected to Opteron server via
HyperTransport interconnect
Single 4MB SRAM bank
a -500MB/s throughput
* Linguistic Factors
a Internal FPGA Network
Stream-oriented communication
with serialized input and output
a Links 14 MD computational kernels
a Computation
Challenging to explore and
evaluate optimal pipelining


A
a
i8"


impu se
m accelerated technologies
t"nIi


CHREC
NSF Center for High-Performance
Reconfigurable Computing


Images from xtremedatainc.com and impulsec.com. For academic purposes only.
9


fffq UNIVERSITY of
UFLORIDA
WTec BYU










MD Performance Prediction

I Constants/User-Defined Parameters Impulse C & XD1000


Interconnect Parameters
throughput(ideal) (MB/s) 500
a(input) o a(output) o
Communication Parameters
# of Input Elements (elements) 16384
# of Output Elements (elements) 16384
Bytes per element (B) 12

Computation Parameters
# of Comp Elements (elements) 16384
Ops per element (ops/elem) 163840
throughput(proc) (ops/cycle) 40
f(clock) (MHz) 100

Software Parameters
t(soft) (sec) 6.71
N (iterations) 1

Calculated Sub-metrics
Single Buffered
t(soft) (sec) 6.71 E+00
t(comm) (sec) 8.74E-04
t(comp) (sec) 6.71E-01
t(RC) (sec) 6.72E-01
speedup(kernel) n/a 9.99


* Interconnect Parameters
J Throughput, input, output
Models HyperTransport interconnect
* Communication Parameters
a #elements, input/output
16384 molecules (elements) are used in MD simulation
Nabytes/element
4 (32 bits) x 3 (x,y,z dimensions)
* Computation Parameters
#elements,COmp
16384 (elements, iterations, etc.)
Naops/element
16384 other elements x 100 (ops each)
Difficult to measure for this algorithm
a Throughputproc
40, number required to achieve 10x speedup
Difficult to measure for this algorithm
a Clock Frequency
100 MHz, default frequency of XD1000 system
* Software Parameters
a To


ui


Execution time on 3.2GHz Xeon


S Niterations
Only 1 iteration necessary


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFFL7RIiA
Vi'STech BYU








ISRC System & Carte Language


* SRC-6 System
a 4 MAP-B units
Two Xilinx Virtex XC2V6000 each
a Only one FPGA per unit used
Connected to host server via
SNAP (memory) interconnect
Six 4MB 64-bit SRAM banks each
* Linguistic Issues
a Mapping SRAM resources to
computational needs
One MD kernel per MAP unit
consuming all SRAM resources
L Four computation kernels total
a Computation
Mapping unrolled MD loops and
removing pipeline stalls


CHREC
NSF Center for High-Performance
Reconfigurable Computing


Image from www.srccomp.com. For academic purposes only.
11


fffq UNIVERSITY of
UFFLORIDA
UWFe BYU
8~II~ecb Y


[X,Y] [Z] [X,Y] [Z] [X,Y] [Z]
1 1 2 2
1 4


7
s ,o I; ,
\ Ix /,
S 2,3
SRC-6










MD Performance Prediction

Constants/User-Defined Parameters Carte C & SRC MAP-B

Interconnect Parameters Interconnect Parameters
throughput(ideal) (MB/s) 800 i Throughput, inputoutput
Models SNAP interconnect
a(input) o a( ) o< 0- Communication Parameters
a(output) o a #elementsinput
Communication Parameters 16384 molecules x 2 blocks (x/y,z) x 2 copies (i,j)
Communication Parameters
a # ,output
# of Input Elements (elements) 65536 elementsupt
S-16384 molecules x 2 blocks (x/y,z)
# of Output Elements (elements) 32768
Bytes per element (B) 8 bytes/e = 6-b M
0 8 bytes = 64-bit wide SRAM
SP Computation Parameters
Computation Parameterse nc
# of Comp Elements (elements) 16384 elementscomp
16384 (elements, iterations, etc.)
Ops per element (ops/elem) 16383 N
throughput(proc) (ops/cycle) 4I ops/element
throughput(proc) (ops/cycle) 16383 operations (i.e. comparisons) again other atoms
f(clock) (MHz) 100 Individual steps of comparison (i.e. pipeline depth) is difficult to
quantify before design but not necessary
Software Parameters a Throughputproc
t(soft) (sec) 6.71 4 parallel kernels, processing 4 comparisons per cycle
N (iterations) 1 Assumes that pipelines will not stall
a Clock Frequency
Calculated Sub-metrics 100 MHz, default frequency
Single Buffered Software Parameters
t(soft) (sec) 6.71 E+00 a Tsoft
t(comm) (sec) 1.03E-03 Execution time on 3.2GHz Xeon
t(comp) (sec) 6.71E-01 Niterations
t(RC) (sec) 6.72E-01 Only 1 iteration necessary
speedup(kernel) n/a 9.98
UF |)UNIVERSITY of
CHREC UFLORIDA
NSF Center for High-Performance 12 gTech BYU ..
Reconfigurable Computing ,' '









Results


XD1000


0.48
0.94
0.85
2.28
6.49
12.30


Execution Time (se
SRC-6


0.92
0.92
0.92
0.92
0.92
0.92


Execution Time of 16K MD Simulation


0
0.01
---XD1000
---3.2GHz Xeon


0.1 1 10 100
---SRC-6 Density Parameter


c)
3.2GHz Xeon Molecular Density
6.67 i Pairs of molecules above
6.70 threshold do not require extra
6.71 computation
7.12 Sparse sets will have many pairs
10.87 above distance threshold
16.29
1629 Dense sets will have many pairs
below distance threshold


* Impulse
O Diminishing performance for
denser sets
Pipelining inefficiencies
Load balancing issues
* SRC
O Consistent pipelining performance
Best for dense sets of molecules
However, not as resource-efficient


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFf UNIVERSITY of
UFFLORIDA
WgU T BYU
8~II~ecb Y


density


0.01
0.1
1
10
100
1000


18
16
14
12
w 10
E 8
i=
6
4
2


__


~a39











Speedup & Prediction Accuracy
20
Speedup of 16,384 Molecule Simulation
18 SRC-6

16



12





2





2 8 -.0000
6--
4


0.01


1 10
Molecular Density


1000


tcomm (sec)
tcomp (sec)
overhead (sec)


tRC


(sec)


Speedup n/a


Impulse C & XD1000

Predicted Actual


8.74E-04
6.71E-01
0
6.71E-01
9.99


1.40E-03
8.47E-01
N/A
8.48E-01
7.93


Carte C & SRC-6


Predicted
1.04E-03
6.71E-01
0
6.71E-01
9.98


XD1000 has CPU-initiated data transfers
Actual
ja All measurements are from wall-clock time
2.81E-03
SRC has FPGA-initiated data transfers
6.71E-01
a Total execution time is wall-clock time
2.46E-01 L Individual values take from FPGA cycle count
9.20E-01 Discrepancy caused by system overhead


U S I I


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFf UNIVERSITY of
UF FLORIDA
STech BYU
L*I (U = m









Taking Knowledge Forward

* Design Patterns
a "A design pattern names, abstracts, and identifies the key aspects
of a common design structure that make it useful for creating a
reusable object-oriented design" [1]
a "Design patterns offer us organizing and structuring principles that
help us understand how to put building blocks (e.g., adders,
multipliers, FIRs) together." [2]


* Gravitational N-Body


X, Y, Z,
X, Y, Z,


Input
Duplication
X, Y, M,
X, Y, M,


Molecular Dynamics


Distance
Calculation


Interaction


(G N ,I .
( + d J\N


Acceleration
Register


Output
Unrolling


Acceleration
Register


Gravitational N-body
1. Gamma, Eric, et al., Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, Boston, 1995.
2. DeHon, Andre, et al., "Design Patterns for Reconfigurable Computing", Proceedings of 12th IEEE Symposium on Field-Programmable
Custom Computina Machines (FCCM'04). April 20-23. 2004. Napa. California.


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFf UNIVERSITY of
WUFLORIDA
UFlc BYU







Conclusions


* Molecular Dynamics
a Case study for evaluating algorithm mapping and optimization
Illustrates need for strategic design planning (formulation)
a Memory hierarchy
Significant impact on overall algorithm performance
L Without proper data locality & distribution, computation stalls unnecessarily
Balance requirements of overall application with features and
strengths of target FPGA platform and design language
L Efficient kernels had highest average speedup for MD
* Strategic Design
a Performance prediction: RC Amenability Test
Knowledge of algorithm, FPGA system, & design language is critical
Accuracy will improve with larger problem sizes
a Must balance accuracy of prediction with time required to make predictions
a Knowledge is reusable for betterment of future algorithm design

U lT )UNIVERSITY of
HCHREC UFLORIDA
NSF Center for High-Performance 16 V T BYU .....
Reconfigurable Computing ,. '







Acknowledgements

* NSF I/UCRC Program (Grant EEC-0642422)
* CHREC members


* Altera Corporation (tools,


* George Washington University (SRC-6 access)


* Impulse Accelerated


Technologies (tools)


devices)


* SRC Computers (tools)
* XtremeData Inc. (tools, platform)




Questions?


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UFf UNIVERSITY of
UF FLORIDA
UWF BYU
8~II~ecb Y







References

1. S. R. Alam, J. S. Vetter, P. K. Agarwal, and A. Geist. Performance
characterization of molecular dynamics techniques for biomolecular
simulations. In 11th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, New York, NY, Mar 29-31 2006.
2. V. Kindratenko and D. Pointer. A case study in porting a production scientific
supercomputing application to a reconfigurable computer. In Proc IEEE 14th
Symp. Field-Programmable Custom ComputingMachines (FCCM), pages
13-22, Napa, CA, Apr 24-26 2006.
3. Azizi, N., Kuon, I., Egier, A., Darabiha, A., and Chow, P. 2004. Recongurable
molecular dynamics simulator. In Proc IEEE 12th Symp. Field-Programmable
Custom Computing Machines (FCCM). Napa, CA, 197-206.
4. Cordova, L. and Buell, D. 2005. An approach to scalable molecular dynamics
simulation using supercomputing adaptive processing elements. In Proc.
IEEE Int. Conf. Field Programmable Logic and Applications (FPL). 711-712.
5. Gu, Y., VanCourt, T., and Herbordt, M. 2006. Accelerating molecular
dynamics simulations with congurable circuits. In Proc. lEE Computers and
Digital Techniques. Vol. 153. 189{195.



UF T UNIVERSITY of
HCHREC UFFLORA
NSF Center for High-Performance 18 ~WlglTech BYU ,
Reconfigurable Computing ,' '




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs