Group Title: Design and analysis of parallel N-Queens on reconfigurable hardware with Handel-C and MPI
Title: Presentation
ALL VOLUMES CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094756/00002
 Material Information
Title: Presentation
Physical Description: Book
Language: English
Creator: Aggarwal, Vikas
Troxel, Ian
George, Alan D.
Publisher: Aggarwal et al.
Place of Publication: Gainesville, Fla.
 Record Information
Bibliographic ID: UF00094756
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

MAPLD2004_Paper198 ( PDF )


Full Text
UNIVERSITY OF
FLORIDA


Design and Analysis of Parallel N-Queens on
Reconfigurable Hardware with Handel-C and MPI


Vikas Aggarwal, Ian Troxel, and Alan D. George

High-performance Computing and Simulation (HCS) Research Lab
Department of Electrical and Computer Engineering
University of Florida
Gainesville, FL


#198 MAPLD 2004


Aggarwal





Outline


* Introduction
* N-Queens Solutions
* Backtracking Approach
* N-Queens Parallelization
* Experimental Setup
* Handel-C and Lessons Learned
* Results and Analysis
* Conclusions
* Future Work and Acknowledgements
* References
( UNIVERSITY OF
ggarwal FLORIDA 2 #198 MAPLD 20(


A


)4





Introduction mmmm mI


* N-Queens dates back to the 19th century mmmmm I
(studied by Gauss) mmm m

* Classical combinatorial problem, widely used
as a benchmark because of its simple and regular structure

* Problem involves placing N queens on an N x N chessboard such
that no queen can attack any other

* Benchmark code versions include finding the first solution and
finding all solutions


UNIVERSITY OF
A9FLORIDA


Aggarwal


#198 MAPLD 2004





Introduction



* Mathematically stated:
Find a permutation of the BOARD() vector containing
numbers 1:N, such that


2 3


4 5


I1 13 5 2 4


Q


QI


Board( i )-
Board( i) +


for any i != j
i !=Board(j )-j
i != Board(j) +j


UNIVERSITY OF
A9FLORIDA


BOARD []


Aggarwal


#198 MAPLD 2004





N-Queens Solutions


Various approaches to the problem
a Brute force[2]
a Local search algorithms[4
a Backtracking [2, [7,11], [12], 13
a Divide and conquer approach'
a Permutation generation'[2
a Mathematical solutions'6'
a Graph theory concepts'[2
a Heuristics and A14],[14'
U UNIVERSITY OF
Aggarwal TFLORIDA #198 MAPLD 2004





Backtracking Approach


One of the only approaches that guarantees a solution,
though it can be slow
Can be seen as a form of intelligent depth-first search
Complexity of backtracking typically rises exponentially
with problem size
Good test case for performance analysis of RC systems,
as the problem is complex even for small data size*
Traditional processors provide a suboptimal platform for this
iterative application due to serial nature of their processing
pipelines
a Tremendous speedups achieved by adding parallelism at the
logic level via RC

For an 8x8 board, 981 moves (876 tests + 105
U NIV E RSITY OF backtracks) are required for first solution alone
Aggarwal FLORIDA 6 #198 MAPLD 2004







Backtracking Approach



# of operations
e or sti [7] Number of solutions [8]
Tables provide an estimate for 1Board solution Size


of the backtracking

approach's complexity

a Problem can be made to find
first solution or the total
number of solutions

a Total number of solutions is
obviously a more challenging
problem

a Interesting observation: 1st
solution's complexity (i.e.
number of operations) does
not increase monotonically
with board size


i steps


15
30
44
196
365
558
981
1.067
1.463
3,315
21.624
28.380
50.710
96.579
170.748
188.133
609.996
784.510
1, 265,433
4. 192. 125
10.289.900
10.737 522
12. 717. 586
39955. 071
45.966.735
87,182.230
421.959.504
463.557.767
S.749. 317. 724
2.887.216.497
80. 311.184.345


(length of one Number of Solutions to N
side of N x N Queens Problem
chessboard)
F1 11
12 10
3 0
14 12
15 |10
16 F4
7 40
8 F92
19 1352
110 1724
111 12680
12 14200
113 173712
114 1365596
15 [2279184
116 14772512
117 95815104
18 F666090624
19 4968057848
120 139029188884
21 314666222712
122 12691008701644
123 124233937684440


UNIVERSITY OF
FLORIDA


Aggarwal


#198 MAPLD 2004





N-Queens Parallelization

. Different levels of parallelism added to improve


performance
a Functional Unit replication
a Parallel column check
a Parallel row check

Parallelization Comparison
Sequential: 11 cycles
Parallel column check: 3 cycles
Multiple row check appended:1 cycles
11x speedup over sequential operation


Note: Assume first four queens have been placed and the fifth queen starts from the 1st row


UNIVERSITY OF
A9FLORIDA


Q Q

Q

Q

Q
Q


Aggarwal


#198 MAPLD 2004





Experimental setup


* Experiments conducted using RC1000 boards from
Celoxica, Inc., and Tarari RC boards from Tarari, Inc.
Each RC1000 board features a Xilinx Virtex-2000 FPGA,
8 MB of on-card SRAM, and PCI Mezzanine Card (PMC)
sockets for connecting two daughter cards
Each Tarari board features two user-programmable
Xilinx Virtex-ll FPGAs in addition to a controller FPGA,
256 MB of DDR SDRAM
Configurations designed in Handel-C using Celoxica's
application mapping tool DK-2, along with Xilinx ISE for
place and route
Performance compared against 2.4 GHz Xeon server
and 1.33 GHz Athlon server
( UNIVERSITY OF
ggarwal TFLORIDA 9 #198 MAPLD 20(


A


)4







Celoxica RC1000


* PCI-based card having one Xilinx FPGA and four memory banks
* FPGA configured from the host processor over the PCI bus
* Four memory banks, each of 2MB, accessible to both the FPGA and any other device
on the PCI bus
* Data transfers: The RC1000 Host Secondary
Primary PCI
provides 3 methods of transferring PCI-PCI PMC#1
data over PCI bus between Bridge P
host processor and FPGA:
Bulk data transfers performed via PLX PC1gDS8D C ls nx
memory banks BG5Xi
a Two unidirectional 8 bit ports, called SRAM Bank
512Kx32 e.g. 4aB5XL;
control and status ports, for direct comm.
between FPGA and PCI bus SRAM Bank 40150xv
(note: this method used in our experiments) 6 I vioo
a User I/O pins USER1 and USERO for SR Bi V20EB
single bit communication with FPGA
SRAM Bank VB12E
* API-layer calls from host to configure Iaion a 512Kx32
and communicate with RC board Linear +3v3 +2S soaion
ReguLaror Auxiliary I/O


iU NIV E RS ITY OF
F L R I-D-A


Aggarwal


* Figure courtesy of Celoxica RC1000 manual
#198 MAPLD 2004






Tarari Content Processing Platform



* PCI-based board having 3 FPGAs and a 256 MB memory bank
* Two Xilinx Virtex-ll FPGAs available for user to load configuration files from host over the
PCI bus
Each Content Processing Engine or CPE (User FPGA) C .t
configured with one or two agents
Third FPGA acts as controller providing high-bandwidth
access to memory and configuration of CPP Eng
with agents l
256 MB of DDR SDRAM for data sharing between C
CPEs and the host application --
PCI Bus (32/64 bit, 33/66 Mhz, 3.3 v)
Configuration files first uploaded into the
memory slots and used to configure each FPGA
Both single-word transfers and DMA transfers supported between the host and the CPP





U NIV ERSITY OF Figure courtesy of Tarari CP-DK manual
ggarwal FLORIDA R #198 MAPLD 2004


A


I





Handel-C Programming Paradigm



Handel-C acts as a bridge between VHDL and "C"

a Comparison with conventional C
More explicit provisioning of parallelism within the code
Variables declared to have the exact bit-lengths to save space
Provides more bit-level manipulations beyond shifts and logic operations
Limited support for many ANSI C standards and extensions

a Comparison with VHDL
Application porting is much faster for experienced coders
Similar to VHDL behavioral models
Lacks VHDL concurrent signal assignments which can be suspended
until changes on input triggers (Handel-C requires polling)
Provides more higher-level routines


UNIVERSITY OF
Aggarwal FLORIDA 12 #198 MAPLD 20(


)4






Handel-C Design Specifics



Design makes use of the following two approaches

a Approach 1
Use of an array of binary numbers to hold a '1' at a particular bit position to indicate the
location of queen in the column
A 32 x 32 board will require an array of 32 elements of 32 bits each
Correspondingly use bit-shift operations and logical-and operations to check diagonal and
row conditions
More closely corresponds to the way the operations will take place on the RC fabric

a Approach 2
Use of an array of integers instead of binary numbers
Correspondingly use the mathematical model of the problem to check the validation
conditions
Smaller variables yield better device utilization; slices occupied reduce from about 75% to
about 15% for similar performance and parallelism

a Approach 2 found to be more amenable for Handel-C designs


UNIVERSITY OF
Aggarwal FLORIDA 13 #198 MAPLD 20(


)4





Lessons Learned with Handel-C

Some interesting observations:
* Code for which place and route did not work, finally worked when
the function parameters were replaced by global variables
Less control at lower level with place and route being a consistent
problem even with designs using up only 40% of total slices
Self-referenced operations (e.g. a=a+x) affect the design
adversely, so use intermediate variables
Order of operations and conditional statements can affect design
Useful to reduce wider-bit operations into a sequence of narrower-
bit operations
Balancing "if" with "else" branches leads to better designs
Comments in the main program sometimes affected the synthesis,
leading to place and route errors in fully commented code
We are still learning more everyday!


(UNIVERSITY OF
ggarwal FLORIDA 14 #198 MAPLD 20(


A


)4








Sequential First-Solution Results



Performance Comparison of Sequential Version with Host


10000
8000
E 6000
E 4000
2000
0


1 5 4 7 6 9 11 8 10 13 12 15 14 19 17 16 21 23 18 25 20 24
Board Size


---RC1000 ---Dual Xeon Server -A-Athlon Server
RC1000 clock speed @ 40 MHz


Performance Comparison of Similar Version of Bit and Integer
Algorithms


500
400
E 300
a 200
E
1 100
0


1 5 4 7 6 9 11 8 10 13 12 15 14 19 17 16 18 20
Board Size


--- RC1000 (Bit Version) -s RC1000 (Integer Version)
RC1000 clock speed @ 40 MHz


* Sequential version does not perform well
versus the Xeon and Athlon CPUs

" Algorithm needs an efficient design to

minimize resource utilization

* The results do not include the one-time

configuration overhead of -150 ms


UNIVERSITY OF

FLORIDA


Algorithm type Bit manipulation Integer
manipulation

Parallel column 19% 3%
checks

Parallel row 78% 15%
and column
checks


Aggarwal


#198 MAPLD 2004








Parallel First-Solution Results


Performance Comparison of Parallel Algorithm


17 16 18 20
Board Size


1 5 4 7 6 9 11 8 10 13 12 15 14 19 17 16
Board Size


-- RC1000(Parallel column check) RRC1000(2 Row check appended)
-- "RC1000(6 Row check appended) Dual Xeon Server
Athlon Server
RC1000 clock speed @ 25 MHz

* The most parallel algorithm runs about 20x faster than
sequential algorithm on RC fabric
* Parallel algorithm with two row checks almost
duplicates behavior of 2.4 GHz Xeon server,
while 6-row check outperforms it by 74%
* Further increasing the number of rows checked is likely t
for larger problem sizes


0


700
600
500
E 400
" 300
H 200
100
0


further improve performance


UNIVERSITY OF

FLORIDA


Version Slices Occupied


With parallel column 3%
check (for all columns)

With two row check 5%
appended

With six row check 15%
appended



Version Speedup


With parallel column 0.18
check (for columns)

With 2-row check 0.83
appended

With 6-row check 1.74
appended


700
600
500
S400
S300
' 200
100
0


I.u- LU


Aggarwal


#198 MAPLD 2004





Total Number


of Solutions


Method


* Employ divide-and-conquer approach
* Seen as a parallel depth-first search
* Solutions obtained with queen positioned in any row


in the first column are independent
with queens in other positions
* Technique allows for high


from solutions


degree of


parallelism


(DoP)


UNIVERSITY OF
A9FLORIDA


1


Aggarwal


#198 MAPLD 2004







One-Board Total-Solutions Results


Comparison of Tarari CPP vs. RC1000


Target Platform Area
(slices)

RC1000 (Virtex 2000e) 10%

Tarari CPP (Virtex-ll) 94%


Athlon Server


---Tarari 1FU ---RC1000 1FU ---Xeon Server


RC1000 and Tarari clock speed @ 33 MHz


* Designs on hardware perform around 1.7x faster than Xeon server

* Performance on both RC platforms similar for same clock rates

* RC1000 performs a notch better for smaller chess board sizes while
Tarari CPP's performance improves with chess board sizes

* Almost entire Virtex-ll chip on the Tarari is occupied for one FU


U UNIVERSITY OF
FLORIDA


4 5 6 7 8 9 10 11 12 13 14 15 16 17
Board Size


20000000
18000000
16000000
14000000
12000000
10000000
8000000
6000000
4000000
2000000
0


Aggarwal


#198 MAPLD 2004





Multiple Functional Units (FUs)


* Used additional FUs per chip to LI
increase parallelism per chip 7
Each FU searches for the number F_ NP
of solutions corresponding to a
subset of rows in the first column
The controller
U-
L Handles communication with the host 0
L Invokes all FUs in parallel 8
i Combines all results Host
processor <

( UNIVERSITY OF
ggarwal TFLORIDA 19 #198 MAPLD 20(


A


)4








Total-Solutions Results with Multiple FUs


Performance Comparison with Host


(/)
E
E
E
I.-


20000000
18000000
16000000
14000000
12000000
10000000
8000000
6000000
4000000
2000000
0


4 5 6 7 8 9 10 11 12 13
Board Size

-o-RC1000 1fu ---RC1000 2fu -A-RC1000 3fu Xeon Ser


14 15 16 17


rver ---Athlon Server


RC1000 clock speed @ 30 MHz



* RC1000 with three FUs performs almost 5x faster
than Xeon server

* Speedup increases near linearly with number of FUs

* Area occupied scales linearly with number of FUs



RC soeeduD vs. Xeon server fo


N-Queens Optimization Area (slices)


1 Functional Unit 10%



2 Functional Unit 21%



3 Functional Units 29%





Speedup vs. FU Scaling

5


4

1 3


> 1 2 3
Number of FUs

r board size of 17


UNIVERSITY OF

FFLORIDA


#198 MAPLD 2004


Aggarwal






MPI for Inter-Board Communication


* To further increase system speedup (having
more functional units), multiple boards employed
* Each FU programmed to search a subset of the
solution space
* Servers communicate using the Message
Passing Interface (MPI) to start search in parallel
and obtain the final result


m mmm
---- Kzzz r>


On-board FPGA
(with one or multiple FU's)



Host server


MPI


Host server


On-board FPGA
(with one or multiple FU's)


UNIVERSITY OF
A9FLORIDA


Aggarwal


#198 MAPLD 2004







Total-Solutions Results with MPI





Performance Comparison Speedup vs. Board Scaling
20000000 7


S10000000- 4
E 8000000 -
6000000 3
4000000 2
2000000

4 5 6 7 8 9 10 11 12 13 14 15 16 17
Board Size 1 2 4
--1 Tarari --- 2 Tarari -A-4 Tarari Xeon Server -- Athlon Server Number of Boards
Tarari CPP clock speed @ 33 MHz RC speedup vs. Xeon server for board size of 12

Results show total execution time including MPI overhead
Minimal MPI overhead incurred (high computation-to-communication ratio)
Communication overhead bounded to 3 ms regardless of problem size and
initialization overhead is around 750 ms
Overhead becomes negligible for large problem sizes
Speedup scales near linearly with number of boards
4-board Tarari design performs about 6.5x faster than Xeon server


UNIVERSITY OF
Aggarwal FLORIDA 22 #198 MAPLD 20


)4







Total-Solutions Results with MPI


Performance Comparison with Host


2500000
2000000
1500000
1000000
500000
0


8 9 10 11 12
Board size
-*-1 RC1000 -- 2 RC1000 -A-4 RC1000 X


Speedup vs. Board Scaling


13 14 15 16

eon Server -K-Athlon Server


RC1000 clock speed @ 30 MHz


RC sleedup vs. Xeon server for board size of 12


* Results show total execution time including MPI overhead
* Minimal MPI overhead incurred (high computation-to-communication ratio)
* Communication overhead bounded to 3 ms regardless of problem size and
initialization overhead is around 750 ms
* Overhead becomes negligible for large problem sizes
* Speedup scales near linearly with number of boards
* 4-board RC1000 design performs about 12x faster than Xeon server


UNIVERSITY OF

FLORIDA


Aggarwal


#198 MAPLD 2004






Total-Solutions Results with MPI

Performance Comparison with Host


2500000
2000000
1500000
1000000
500000
0


8 9 10 11 12
Board Size


13 14 15 16


---On 8 Boards ---Xeon Serer -A-Athlon Serer
RC1000 clock speed @ 30 MHz and Tarari clock speed @ 33MHz
* Communication overheads still remain low, while MPI initialization overheads increase
with number of boards (now 1316 ms for 8 boards)
* Heterogeneous mixture of boards employed to solve the problem coordinating via MPI
* Total of 8 boards (4 RC1000 and 4 Tarari boards) allows up to 16 (4x3 + 4x1) FUs
* 8 boards perform about 21x faster than Xeon server for chess board size of 16

* What appears to be an unfair comparison really shows how the approach scales to
many more FUs per FPGA (on higher density chips)


UNIVERSITY OF
FLORIDA


.4


Aggarwal


#198 MAPLD 2004





Conclusions

* Parallel backtracking for solving N-Queens
problem in RC shows promise for performance
a N-Queens is an important benchmark in the HPC community
a RC devices outperform CPUs for N-Queens due to RC's efficient
processing of fine-grained, parallel, bit-manipulation operations
a Previously inefficient methods for CPUs like backtracking can be
improved by reexamining their design
a This approach can be applied to many other applications
a Numerous parallel approaches developed at several levels
* Handel-C lessons learned
a A "C-based" programming model for application mapping provides
a degree of higher-level abstraction, yet still requires programmer
to code from a hardware perspective
a Solutions produced to date show promise for application mapping

UNIVERSITY OF
ggarwal 9FLORIDA 25 #198 MAPLD 20(


A


)4





Future Work and Acknowledgements

Compare application mappers with HDL design in terms of mapping efficiency
Develop and use direct communication between FPGAs to avoid MPI overhead
Export approach featured in this talk to variety of algorithms and HPC
benchmarks for performance analysis and optimization
Develop library of application and middleware kernels for RC-based HPC

We wish to thank the following for their support of this research:
a Department of Defense
a Xilinx
a Celoxica
a Tarari
a Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel)




UNIVERSITY OF
Aggarwal FLORIDA 26 #198 MAPLD 200


4






References

[1] "Divide and Conquer under Global Constraints: A Solution to the N-Queens Problem", Bruce
Abramson and Mordechai M. Yung
[2] "Different Perspectives Of The N-queens Problem", Cengiz Erbas, Seyed Sarkeshikt, Murat M.
Tanik, Department of Computer Science and Engineering,Southern Methodist University, Dallas
[3] "Algorithms and Complexity", Herbert S. Wilf, University of Pennsylvania, Philadelphia
[4] "Fast search algorithms for N-Queens problem", Rok Sausic, Jum Gu, appeared in IEEE
transactions on Systems, Man, and Cybernetics, Vol 21, 6, pp 1572-76, Nov/Dec 1991
[5] http://www.cit.gu.edu.au/~sosic/nqueens.html
[6] http://bridges.canterbury.ac.nz/features/eight.html
[7] www.math. utah.edu/~alfeld/queens/queens.html
[8] www.jsomers.com/nqueen_demo/nqueens.html
[9] A polynomial time algorithm for N-queens problem
[10] remus.rutgers.edu/~rhoads/Code/code.html
[11] http://www.mactech.com/articles/mactech/Vol.13/13.12/TheEightQueensProblem/index.html
[12] http://www2.ilog.com/preview/Discovery/samples/nqueens/
[13] http://www.infosun.fmi.uni-passau.de/br/lehrstuhl/Kurse/Proseminar_ss01/backtracking_nm.pdf
[14] "From Alife Agents To A Kingdom Of N Queens", Han Jing, Jimimg Liu, Cai Qingsheng
[15] http://www.wi.leidenuniv.nl/~kosters/nqueens.html
[16] http://www.dsitri.de/projects/NQP/

UNIVERSITY OF
ggarwal ( FLORIDA 27 #198 MAPLD 20(


A


)4




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs