Scalable and Portable Architecture for
Probability Density Function Estimation on FPGAs
Karthik Nagarajan, Brian Holland, Clint Slatton, Alan D. George
NSF Center for HighPerformance Reconfigurable Computing (CHREC)
ECE Department, University of Florida
Email: [nagartuiii, .ii 'lltJi. \I,,tti, n. georgej @chrec.org
1. Introduction
Significant success has been achieved in FPGA
application research by selecting, developing, and
testing diverse applications from various fields. The
Parzen windowbased, nonparametric estimation
algorithm for Probability Density Function (PDF) is
known to suffer from superlinear increases in
computational time with increasing data size and
number of signals being processed. It forms an integral
part of numerous machinelearning applications and
hence fast and efficient execution of the algorithm is
highly desirable. Different FPGA architectures (i.e.
diverse platforms) offer varying degrees of
performance for different applications. Thus, it is best
to determine at a preliminary stage of system design
the maximum performances attainable in migrating a
specific algorithm to a variety of FPGA platforms.
Along this line, this paper describes the design,
development, and analysis of a scalable and portable
architecture for multidimensional, nonparametric
PDF estimation using Gaussian kernels on FPGAs.
2. Algorithm Overview
The computational complexity of the Parzen
window algorithm [1] is of order O(Nnd), where N is the
total number of data points, n is the number of discrete
points at which the PDF along a dimension are
estimated (i.e. bins), and d is the number of dimensions.
The data flow in the algorithm follows an exhaustive
data permutation pattern wherein all of the input data
affects the value of every output data point.
Mathematically, the probability that point i falls in a d
dimensional space is given by
xr1/ h_d (i d 11)2 (y,y2 )2
pi .. e 2h ..e 2h (1)
n1",nd j=l j2=1
This work was supported in part by the I/UCRC Program of the
National Science Foundation under Grant No. EEC0642422. The
authors also gratefully acknowledge vendor equipment and/or tools
provided by Xilinx, Cray, and Nallatech that helped make this work
possible.
where h is a tuning parameter of the algorithm, (x',...yi)
denotes the bin at which the PDF is estimated, and
(xi,...,yi) denotes the input in a ddimensional space. To
make the algorithm more suitable to hardware, a
secondorder Taylor series expansion is applied to the
exponential function. Specifically, the expression for
the 1D PDF is given in Eq. 2.
P ,1 h (x )2 (2)
Pi ,=hn 2h2
3. FPGA Core Design
After understanding the PDF estimation algorithm,
an FPGA core design to efficiently perform the
algorithm can be crafted. The general architecture of the
1D PDF estimation algorithm is highlighted in Fig. 1.
A kernel computes the PDF value for a particular data
sample x, at a particular point x' (i.e. at a particular bin).
Parallel pipelines for k such kernels are created. Each
kernel is seeded differently and performs the same set
of operations on every input data sample independently.
Load balancing and reducing communication and
synchronization requirements are needed to ensure that
hardware outperforms its sequential software
counterpart. The Parzen window technique is an
embarrassingly parallel algorithm that has no
communication between kernels and requires minimal
synchronization.
FPGA Memory

_ ,,_ .r _' . _
x x < X X X X x
+ +i +
b. f 0, B' ... E ". 3 ." B i'
031 3243 495 6127 12159 100191 192223 '12
PDF Kernel
Figure 1. Architecture of 1D PDF algorithm
This trait allows the possibility to distribute data
samples across kernels that can process data in a
parallel fashion (see Fig. 1). In estimating multi
dimensional PDFs, operations in each dimension are
independent of the others and hence performed in a
parallel fashion within each pipeline. Internal
registering for each bin keeps a running total of the
impact of all processed data. These cumulative totals
comprise the final estimation of the PDF function.
The RC Amenability Test (RAT) [2] is a
ilnc'li dl .'l', developed for predicting the performance
(speedup) of a specific application design on a specific
platform and features two important steps. The first step
deals with estimating the communication burden (tcomm)
involved in transferring data in and out of the FPGA.
The second step involves the estimation of time spent in
performing computation (tcomp) over the data transferred
under an assumed FPGA clock frequency. The software
baseline for computing speedup values was measured
from an optimized C program executed on a 3.2 GHz
Xeon processor using singleprecision floating point.
The predicted speedups for the 1D and 2D PDF
algorithms were 10.5 and 6.8, respectively.
4. Results
The experimental platform primarily used in these
experiments was a Nallatech H101 board with a Xilinx
Virtex4 LX100 FPGA. The board communicates with
the host processor over a PCIX interconnect. A 32bit
middleware communication channel is used of which 9
bits were allocated for the fixedpoint fractional
segment. The singlecore 1D and 2D PDF designs
operated at FPGA clock speeds of 150 MHz and 100
MHz, respectively. The number of kernels, k, in the
core and the number of data samples per transfer were
set to 8 and 512, respectively. It can be seen from the
comparisons made in Table 1A that the actual speedups
obtained are reasonably close to the predicted speedups.
The deviation between the values is primarily due to the
inaccurate estimate of tcomm values (variation in the
interconnect channel efficiency for small data
transfers). The tco,,p predictions are relatively closer to
the experimental values owing to the deterministic
nature of the algorithm.
Dualcore architectures were developed and
evaluated (see Table 1B) to study inherent
characteristics for scalability on single or multiple
device systems by replicating the singlecore designs
sharing the same interconnect. In addition to having
scalability impacts, application characteristics also
contribute greatly to platform selection and
performance. The utilization factors in RAT indicate
the nature of a particular design in terms of whether it is
communicationbound or computationbound and aid in
the selection of an ideal platform. While platforms with
FPGAs having more multipliers (e.g., the Cray XD1
with Virtex2 Pro FPGAs) and faster interconnects
(e.g., RapidArray in XD1) might intuitively suggest
improvements in speedup by performing more
computations in the PDF algorithm in parallel, these
could be negated by poor read and write efficiencies
during data communication. This case was particularly
true with the Cray XD1 system (see Table 1C).
Although it housed a theoretically faster interconnect,
significantly low read speeds (4MB/s for small data
transfers) led to a small speedup in the 2D PDF design.
Table 1. Speedup values for A) singlecore designs, B)
dualcore designs, and C) designs across platforms
Predicted Experimental
A 1D 2D 1D 2D
tc,,m(sec) 6.0E6 1.6E3 2.5E5 1.1E2
tcomp(sec) 1.3E4 5.6E2 1.4E4 6.5E2
Speedup 10.5 6.8 7.8 5.2
B Singlecore Dualcore
1D 2D 1D 2D
Speedup 7.8 5.2 13.4 7.2
Nallatech Cray XD1
1D 2D 1D 2D
Speedup 7.8 5.2 20.6 4.0
5. Conclusions
Significant performance improvements in terms of
speedup were obtained from our design for the Parzen
windowbased PDF estimation algorithm on FPGAs.
Dualcore architectures were developed on a single
device system by exploiting scalability in the algorithm.
Key design parameters were identified for tuning the
architecture to suit different platforms. We believe that
with the successful design of a scalable and portable
architecture for the PDF algorithm, rapid development
of hardware designs for similarly patterned algorithms
(e.g. kmeans clustering, correlation functions, and
information theoretic measures) is possible by
investigating and generalizing design patterns.
6. References
[1] R. O. Duda, P. E. Hart, and D. G. Stork, "Pattern
Classification," 2nd edition, John Wiley & Sons, Inc., 2001.
[2] Brian Holland, Karthik Nagarajan, Chris Conger, Adam
Jacobs, and Alan D. George, "RAT: A Methodology for
Predicting Performance in Application Design Migration to
FPGAs," Proc. of HighPerformance Rt ,,ti'iilr'ltI
Computing Technologies & Applications Workshop
(HPRCTA 2007), SC'07, Reno, NV, Nov. 11, 2007.
