Title: Lessons learned with performance prediction and design patterns on molecular dynamics
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094691/00001
 Material Information
Title: Lessons learned with performance prediction and design patterns on molecular dynamics
Physical Description: Book
Language: English
Creator: Holland, Brian
Nagarajan, Karthnik
Marchant, Saumil
Lam, Herman
George, Alan D.
Publisher: Holland et al.
Place of Publication: Gainesville, Fla.
Copyright Date: 2007
 Record Information
Bibliographic ID: UF00094691
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

MAFA07_F3 ( PDF )


Full Text




NainlSineFudto'


Lessons Learned with Performance

Prediction and Design Patterns on

Molecular Dynamics


CHREC
NSF Center for High-Performance
Reconfigurable Computing

UFT UNIVERSITY of
UF FLORIDA
HwmTech OBYU ; ir
AM f SlATL A NtITL IT I N t R R I t
AMO MA~kI KflRs Y fI ~ I~r~(\:


Brian Holland
Karthik Nagarajan
Saumil Merchant
Herman Lam
Alan D. George

ECE Department, University of Florida
NSF CHREC Center







Outline of Algorithm Design Progression


Algorithm decomposition
a Design flow challenges
Performance prediction
a RC Amenability Test (RAT)
a Molecular dynamics case study
a Improvements to RAT
Design patterns and methodology
a Introduction and related research
a Expanding pattern documentation
a Molecular dynamics case study
Conclusions


1HCHREC
NSF Center for High-Performance 2
Reconfigurable Computing


Feb'07


Jun '07





Sep '07






Design Evolution
UF UNIVERSITY of
UFLORIDA
'4'g' BYU l.,fu
[I~llko







Design Flow Challenges


Original mission r
a Create scientific applications for FPGAs as case studies to investigate
topics such as portability and scalability
Molecular dynamics is one such application
Goal is not application implementation but lessons learned from app.
a Maximize performance and productivity using HLLs and high-
performance reconfigurable computing (HPRC) design techniques
Applications should have significant speedup over SW baseline
Challenges
a Ensure speedup over traditional implementations
Particularly when researcher is not an RC engineer
a Explore application design space thoroughly and efficiently
Several designs may achieve speedup but which should be used?

H CHRE"C UF LORIDA k
NSF Center for High-Performance 3 'BYU 4,RSi i
Reconfigurable Computing 0IM







Algorithm Performance

Motivation
a (Re)designing applications is expensive
Only want to design once and even then, do it most efficiently
a Scientific applications can contain extra precision
Floating point may not be necessary but is used as a SW "standard"
a Optimal design may overuse available FPGA resources
Discovering resource exhaustion mid-development is expensive

Need
a Performance prediction
Quickly and with reasonable accuracy estimate performance of a
particular algorithm on a specific FPGA platform
Use simple analytic models to make prediction accessible to novices


| | H R EU UNIVERSITY of
HCHREC FLORIDA
NSF Center for High-Performance 4 'UFgUi BYU 1,.
SLT Reconfigurable Computing B|>K,h









RC Amenability Test (RAT)

"A methodology for fast and accurate RC performance prediction of a specific
application on a specific platform before any hardware coding"


* Throughput Test
a Algorithm and FPGA platform
are parameterized
a Equations are used to predict
speedup
* Numerical Precision Test
L RAT user should explicitly
examine impact of reducing
precision on computation
a Interrelated with throughput test
Two tests essentially proceed
simultaneously

* Resource Utilization Test
a FPGA resources usage is
estimated to determine


scalability on FPGA platform


NEW

\


Identify kernel,
create design on
paper


Perform RAT
Perform RAT


Build in HDL or HLL,
simulate design/


Throughput -
Test
Desi able Minimum
perform ance precision
_ _- I_
Numerical
Precision Test


Acceptable balance of
performance nd precision


i Resource Test


Verify on
HW platform


PROCEED

Overview of RAT Methodology


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF UNIVERSITY of
'i4gm BYU
[Ip1lch


START
, /i Insufficient~
comm. BW
nrr rmn


hroughput


Unrealizable
precision
requirement


Insufficient
resources








Original RAT Analytic Model


Communication time


tomm
comm


Spread


+write
write


tread


Elements bytes/element

Read throughputideal


write


elements bytes element
write throughputideal


Computation time


elements ops / element

fclock throughputproc


Total RC execution time


CDB Niter Max(tcommlL tcomp

Speedup


c! C2 iC3 C4
Legend: R = Read, W = Write, C =
Communication and Computation Overlap
for Single or Double Buffering


tr = Niter (tc
rcss iter \ commn2


t soft
speedup =
tRC


Application and RC platform attributes are parameterized
and used in these equations to estimate performance.


CHREC
NSF Center for High-Performance
SReconfigurable Computing


UF UNIVERSITY of
g'_II BYU
[Ip1lch


comp


+ tcomp )








I time 0.0041 ps


Molecular Dynamics


void ComputeAccel() {
double dr[3],f,fcVal,rrCut,rr,ri2,ri6,rl;
int jl,j2,n,k;

rrCut = RCUT*RCUT;
for(n=0;n potEnergy = 0.0;

for (jl=0; jl for (j2=jl+l; j2 for (rr=0.0, k=0; k<3; k++) {
dr[k] = r[jl][k] r[j2][k];
dr[k] = dr[k]-SignR(RegionH[k],dr[k]-Region k])
SignR(RegionH[k] ,dr[k]+RegionH[k]);
rr = rr + dr[k]*dr[k];
}
if (rr < rrCut) {
ri2 = 1.0/rr; ri6 = r i2*ri2; rl = sqrt(rr);
fcVal = 48.0*ri2*ri6*(ri6-0. + Duc/rl;
for (k=0; k<3; k++) {
f = fcVal*dr[k];
ra[jl][k] = ra[jl][k] + f;
ra[j2] [k] = ra[j2] [k] f;
}
potEnergy+=4.0*ri6*(ri6-1.0)- Uc Duc*(rl-RCUT);


SW Baseline Code


* Simulation of interactions of a

set of molecules over a given
time interval

a Based upon code provided by
Oak Ridge National Lab (ORNL)

* Challenges for accurate
performance prediction of MD

'-^ Large simulation datasets
Exhaust FPGA's local memory
a Sets of molecules are often on
order of 100,000s of atoms, with
dozens of time steps

*a- Nondeterministic runtime
Molecules beyond a certain
threshold are assumed to have
zero impact
[ Certain sets require less comp.


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF UNIVERSITY of
Vg'IIa BYU
[Ip1lch






















Molecular Dynamics


* Algorithm
a 16,384 molecule data set
a Written in Impulse C
a XtremeData XD1000 platform
Altera Stratix II EPS2180 FPGA
HyperTransport interconnect
a SW baseline on 2.4GHz Opteron
* Parameters
a Dataset Parameters
Model volume of data used by FPGA
a Communication Parameters
Model the HyperTransport interconnect
a Computation Parameters
Model computational requirement of FPGA
Nops/element
164000 = 16384 10 ops
i.e. each molecule (element) takes 10ops/iteration
Throughputproc
a 50
a i.e. operations per cycle needed for >10x speedup
a Software Parameters
Software baseline runtime and iterations
required to complete RC application


Dataset Parameters
Nelements, input (elements) 16384
Nelements, output (elements) 16384
Nbytes/element (bytes/element) 36

Communication Parameters
throughput(ideal) (Mbps) 500
a(input) o a(output) o
Computation Parameters
Nops/element (ops/element) 164000
throughput(proc) (ops/cycle) 50
f(clock) (MHz) 75/100/150

Software Parameters
t(soft) (sec) 5.76
N (iterations) 1
RAT Input Parameters of MD

Predicted Predicted Predicted Actual


f(clock)
tcomm
tcomp
utilcomn
utilcomp
tRC
speedupc


75 100 150
2.62E-3 2.62E-3 2.62E-3 1
7.17E-1 5.37E-1 3.58E-1 8
0.4% 0.5% 0.7%
99.6% 99.5% 99.3%
7.19E-1 5.40E-1 3.61E-1 8
S8 10.7 16
Performance Parameters of MD


100
.39E-3
.79E-1
0.2%
99.8%
.80E-1
6.6


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UNIVERSITY 9f
UFFiOhhiDA
Vuginia BYU
[Ip1lch










Parameter Alterations for Pipelining


* MD Optimization
L Each molecular pair's
computation should be pipelined
Individual molecules have
nondeterministic workloads
But, pairs of molecules will enter
the pipeline at a constant rate
* Parameters
L Computation Parameters
Nops/element
u 16400
[ Strictly number of interactions per
element
Throughputpipeline
.333
a Number of cycles needed to per
interaction. i.e. you can only stall
pipeline for 2 extra cycles
N.
Pipeline
L 15
a Guess based upon predicted
area usage


Dataset Parameters
Nelements, input (elements) 16384
Nelements, output (elements) 16384
Nbytes/element (bytes/element) 36

Communication Parameters
throughput(ideal) (Mbps) 500
a(input) O a(output) O
Computation Parameters
Nops/element (ops/element) 16400
throughput(pipeline) (ops/cycle) 0.33333
Npipelines (ops/cycle) 15
f(clock) (MHz) 75/100/150

Software Parameters
t(soft) (sec) 5.76
N (iterations) 1
Modified RAT Input Parameters of MD


f(clock)
tcomm
tcomp
utilcomm
utilcomp
tRC
speedup


Predicted Predicted Predicted


75
2.62E-3
7.17E-1
0.4%
99.6%
7.19E-1
8


100
2.62E-3
5.37E-1
0.5%
99.5%
5.40E-1
10.7


150
2.62E-3
3.58E-1
0.7%
99.3%
3.61E-1
16


Actual
100
1.39E-3
8.79E-1
0.2%
99.8%
8.80E-1
6.6


Performance Parameters of MD


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF I FCj~iiik
Vu'giac BYU
[Ip1lch









Pipelined Performance Prediction


. Molecular Dynamics


a If a pipeline is possible, certain
parameters become obsolete
Number of operations in pipeline (i.e
depth) is not important
Number of pipeline stalls becomes
critical and is much more meaningful
for non-deterministic apps
* Parameters
La Nelement
163842
Number of molecular pairs
a Nclks/element
3
i.e. up to two cycles can be stalls
La Npipelines
15
Same number of kernels as before


Dataset Parameters
Nelements (elements) (16384)2
Nclks/element (cycle/element) 3
Npipelines 15
Depthpipeline cycles 100
f(clock) (MHz) 100
t(soft) (sec) 5.76

Dataset Parameters
tRC (sec) 0.537
Speedup 10.7
Pipelined RAT Input Parameters of MD


N -N
elements clks/element
Nkernels fclk
Modified RC Execution


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF I FCvjR~liij k
'4ugic BYU
[Ip1lch


tRC -


Depthpipeline
+ comm
fclk
Time Equation



















"And now for something completely different"


-Monty


Python


CHREC
NSF Center for High-Performance
SReconfigurable Computing


Image from "Monty Python and the Holy Grail".
For academic purposes only.


UF I FCvjR~liij k
Vu'giac BYU
[Ip1lch


(Or


is


it?)







ILeveraging Algorithm


Designs


* Introduction W K1
a Molecular dynamics provided several lessons learned
Best design practices for coding in Impulse C
Algorithm optimizations for maximum performance
Memory staging for minimal footprint and delay
a Sacrificing computation efficiency for decreased memory accesses

* Motivations and Challenges
a Application designs should educate the researcher
Successes and mistakes are retained to expedite future apps.
Application designs should also train other researchers
a Unfortunately, new designing can be expensive
Collecting application knowledge into design patterns provides
distilled lessons learned for efficient application


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF^ UNIVERSITY of
UF FLORIDAh
ug'im BYU lfyul
[Ip1lch







What are Design Patterns?

Objected-oriented software engineering:
L "A design pattern names, abstracts, and
identifies the key aspects of a common design
structure that make it useful for creating a
reusable object-oriented design" [1]
Reconfigurable Computing
L "Design patterns offer us organizing and
structuring principles that help us understand
how to put building blocks (e.g., adders,
multipliers, FIRs) together." [2]

1. Gamma, Eric, et al., Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley,
Boston, 1995.
2. DeHon, Andre, et al., "Design Patterns for Reconfigurable Computing", Proceedings of 12th IEEE Symposium
on Field-Programmable Custom Computing Machines (FCCM'04), April 20-23, 2004, Napa, California.
|R E |UFf UNIVERSITY of
-C H R E C UF FLORIDA.
NSF Center for High-Performance 13 V" 1 c BYU
l 6 Reconfigurable Computing







Classification of Design Patterns 00 Textbook [1]


* Pattern categories
a Creational
Abstract Factory
Prototype
Singleton
etc.
a Structural J
Adapter
Bridge
Proxy
etc.
a Behavioral
Iterator
Mediator
Interpreter
etc.


* Describing Patterns
I a Pattern name
j Intent
L Also know as
L Motivation
L Applicability
L Structure
La Participants
L a Collaborations
L Consequences
L Implementation
L Sample code
L Known uses
L Related patterns


*


CHREC
NSF Center for High-Performance
SReconfigurable Computing


UF li hhi5k --I
Vugima BYU
[Ip1lch


=Ofui







Sample Design Patterns- RC Paper [2]

S14 pattern categories: 89 patterns identified (samples)
[L Area-Time Tradeoffs a Coarse-Grained Time Multiplexing
a Synchronous Dataflow
a Expressing Parallelism u Multi-threaded
L Implementing Parallelism a Sequential vs. Parallel Implementation
L Processor-FPGA Integration (hardware-software partitioning)
S. SIMD
L Common-Case Optimization u SIMD
a Communicating FSMDs
a Re-using Hardware Efficiently u Instruction augmentation
a Specialization a Exceptions
a Partial Reconfiguration a Pipelining
SWorst-Case Footprint
a Communications
u Streaming Data
a Synchronization a Shared Memory
a Efficient Layout and Communications a Synchronous Clocking
a Implementing Communication Asynchronous Handshaking
a Cellular Automata
a Value-Added Memory Patterns u Token Ring
a Number Representation Patterns a etc.


0CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF NI FChi5
Vugima BYU
[Ip1lch







Example Datapath Duplication

Replicated computational structures for parallel processing
Parallel data input -----------
Intent Exploiting computation parallelism in
sequential programming structures (loops)
Motivation Achieving faster performance
through replication of computational structures K2 ------------
Applicability data independent
No feedback loops (acyclic dataflow)
Participants Single computational kernel Bufferccmulator
Buffer/Accumulator <.....-..--

Collaborations Control algorithm directs dataflow and synchronization
Consequences Area time tradeoff, higher processing speed at cost of
increased implementation footprint in hardware
Known Uses PDF estimation, BbNN implementation, MD, etc.
Implementation Centralize controller orchestrates data movement and
synchronization of parallel processing elements

HCE:HREC UFLORIDA
NSF Center for High-Performance 16 "'V b BYU ,
Reconfigurable Computing B ,Ih .











Example Design Pattern: Pipelining


. ..:_:_- Structure


Description


. Name


a Pipelining (a.k.a. Instruction Pipelining)

* Motivation

L Instruction throughput could be increased
if design allows possibility to execute
more instructions per unit of time

a "Chain"-like instruction execution -
increased processing speeds

* Consequences

L Stall or wasted cycles for non-
independent instructions in design

L Extra registers and flip-flops in data path

* Known uses


L Algorithm/program with instruction
independency in its structure


Inputs
11,12,1314


Time


Instructions 12
in program 2 Instructions
execution 13 A
14 = M



Implementation
Pseudo HLL code
Void Pipelinefunction()
{while (,) { ------
Instruction 1;
Instruction 2;
Instruction 3;


Equivalent VHDL Code
entity PipeliningSam
Port (I : in std
O : in std
--) end
architecture arch of
--signal declarations


1<=Instr
2<=Instri
3<=Instr


-Control
ind arch;


or(31 downto
or(31 downto
SampleCode;
SampleCode i;
mediate value;


leCode is
logic_vect
logic_vect
Pipelinin<
'ipelining
for interr


ion 1;
ion 2 (1
ion 3 (


er/ dependent


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF UNIVERSITY of
Vg'_I BYU
[Ip1lch


r"









Example Design Pattern: Memory Dependency

Description
* Name
L Memory dependency resolution for efficient pipeline implementations
* Motivation
L Resolve memory dependencies in computations for efficient pipeline implementations
* Applicability
L Memory dependency may arise due to
for (i=0; i Multiple reads from same memory in a single clock cycle. fr (i=O; i c[i] = a[i] + a[i+l];
Multiple reads and writes to a memory in a single clock cycle. }
Multiple writes to a memory in a single clock cycle.


L Memory dependency resolutions
Two parallel reads can be implemented using dual-ported
memories where possible
Modifying operations to serialize memory accesses
* Consequences
L Increasing number of pipeline stages
Not a problem for large number of iterations


for (i=O; i aO = al;
al = a[i];
if(i>O) {c[i] = aO + al;}
}


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF li FCi5i
Vugima BYU
[Ip1lch








System-level Patterns for MD


j kernels implemented
in parallel


Visualization of Datapath Duplication


* When design MD, initial goal is
decompose algorithm into parallel
kernels
L "Datapath duplication" is a
potential starting pattern
a MD will require additional
modifications since computational
structure will not divide cleanly


"What do customers buy after
viewing this item?"
67% use this pattern
37% alternatively use ....
"May we also recommend:"
Pipelining
Loop Fusion
"On-line Shopping" for Design Patterns


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF UNIVERSITY of
gI'm_ BYU
[Ip1lch









Kernel-level optimization patterns for MD

Pattern Utilization


void ComputeAccel() {
double dr[3],f,fcVal,
int jl,j2,n,k;
rrCut = RCUT*RCUT;
for(n=0;n potEnergy = 0.0;


if (rr
ri2 =
fcVal


:rCut,rr,ri2,ri6,rl;


for(k=0;k<3;k++)


< rrCut) {
1.0/rr; ri6 = ri2*ri2*ri2; rl
= 48.0*ri2*ri6*(ri6-0.5) + Duc


2-D arrays
SW addressing is handled
by C compiler
a HW should be explicit
Loop fusion
a Fairly straightforward in
explicit languages
a Challenging to make
efficient in other HLLs
Memory dependencies
o Shared bank
Repeat accesses in pipeline
cause stalls
ai Write after read
Double access, even of
same memory location,
similarly causes stalls


sqrt (rr);
1;


or (k=0; k<3; k++) {


potEnergy+=4.0*ri6*(ri6-1.0)- Uc Duc*(rl-RCUT);


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UF UNIVERSITY of
'i4mI BYU
[Ip1lch











Design Pattern Effects on MD
for (i=0; i Type Stall Cycles cg_count_ceil_32(1,0,i==0,num-2,&k);
Nestd Lp cg_count_ceil_32(1,0,i==0,num-2,&j2);
SNested Loop d N cg_count_ceil_32(j2==0,0,i==O,num,&jl); if( j2 >= ji) j2++;
pipeline depth outer loop iterations


41 Possible bank conflict
3 iterations 1 extra aci
1 Accumulation conflicts
Energy calc is longest


3ess each
cess each


if(j2==0) rr = 0.0;
split_64to32_ft flt(AL[j1],&j1y,&j1x);
split_64to32_flt_flt(BL[j1],&dummy,&jlz);
split_64to32_flt_flt(CL[j2],&j2y,&j2x);
split_64to32_flt_flt(DL[j2],&dummy,&j2z);


void ComputeAccel() {
double dr[3],f,fcVal,rrCut,rr,ri2,ri6,rl;
int jl,j2,n,k;
rrCut = RCUT*RCUT;
for(n=0;n potEnergy = 0.0;
for (j =0; j for (j2=jl +1; j2 for (rr=0.0, k=0; k<3; k++) {
dr[k] = r[jl][k]- r[j2][k];
dr[k] = dr[k] SignR(RegionH[k],dr[k]-RegionH[k])
SignR(RegionH[k],dr[k]+RegionH[k]);
rr = rr + dr[k]*dr[k];
}
if (rr < rrCut) {
ri2 = 1.0/rr; ri6 = ri2*ri2*ri2; rl = sqrt(rr);
fcVal = 48.0*ri2*ri6*(ri6-0.5) + Duc/rl;
for (k=0; k<3; k++) {
f = fcVal*dr[k];
ra[jl][k] = ra[jl][k]+ f;
ra[j2][k] = ra[j2][k] f;

potEnergy+=4.0*ri6*(ri6-1.0)- Uc Duc*(rl-RCUT);


C baseline code for MD


if(jl < j2) { drO = jlx- j2x; drl = jly- j2y; dr2 = jlz- j2z;}
else { drO = j2x jlx; drl = j2y jly; dr2 = j2z jlz;}

dr0 = dr ( dr0 > REGIONHO ? REGIONHO : MREGIONHO)
(drO > MREGIONHO ? REGIONHO : MREGIONHO);
drl = drl (drl > REGIONH1 ? REGIONH1 : MREGIONH1)
(drl > MREGIONH1 ? REGIONH1 : MREGIONH1 );
dr2 = dr2 ( dr2 > REGIONH2 ? REGIONH2 : MREGIONH2)
(dr2 > MREGIONH2 ? REGIONH2 : MREGIONH2);

rr = dr0*dr0 + drl*drl + dr2*dr2;
ri2 = 1.0/rr; ri6 = ri2*ri2*ri2; rl = sqrt(rr);
fcVal = 48.0*ri2*ri6*(ri6-0.5) + Duc/rl;
fx = fcVal*dr0; fy = fcVal*drl; fz = fcVal*dr2;

if(j2 < j) {fx = -fx; fy= -fy; fz=-fz; }

fp_accum_32(fx, k==(num-2), 1, k==0, &jalx, &err);
fp_accum_32(fy, k==(num-2), 1, k==0, &jaly, &err);
fp_accum_32(fz, k==(num-2), 1, k==0, &jalz, &err);
if( rr comb_32to64_flt_flt(jaly,jalx,&EL[j1]);
comb_32to64_ ft flt(0,jalz,&FL[j1]);
fp_accum_32(4.0*ri6*(ri6-1.0) Uc Duc*(rl-RCUT),
i==lim-1,jl }
Carte MD. fully nDielined, 282 cycle death


CHREC
NSF Center for High-Performance
Reconfigurable Computing


UNIVERSITY 9f
UFFiOhhiDA
Vugima BYU
[Ip1lch







Conclusions -


* Performance prediction is a powerful technique for
improving efficiency of RC application formulation
a Provides reasonable accuracy for rough estimate
a Encourages importance of numerical precision and resource
utilization in performance prediction
* Design patterns provide lessons-learned documentation
a Records and disseminates algorithm design knowledge
a Allows for more effective formulation of future designs
* Future Work
a Improve connection b/w design patterns and performance prediction
a Expand design pattern methodology for better integration with RC
a Increase role of numerical precision in performance prediction


CHREC
NSF Center for High-Performance
SReconfigurable Computing


UF NI FChi5
Vugima BYU
[Ip1lch




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs