Group Title: Survey of C-based application mapping tools for reconfigurable computing
Title: Presentation
ALL VOLUMES CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094749/00002
 Material Information
Title: Presentation
Physical Description: Book
Language: English
Creator: Troxel, Ian
Holland, Brian
Vacas, Mauricio
Aggarwal, Vikas
DeVille, Ryan
George, Alan D.
Publisher: Troxel et al.
Place of Publication: Gainesville, Fla.
 Record Information
Bibliographic ID: UF00094749
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

MAPLD05_Holland ( PDF )


Full Text
.F UNIVERSITY OF
.FLORIDA


Survey of
C-based Application Mapping Tools
for Reconfigurable Computing



Brian Holland, Mauricio Vacas, Vikas Aggarwal,
Ryan DeVille, lan Troxel, and Alan D. George

High-performance Computing and Simulation (HCS) Research Lab
Department of Electrical and Computer Engineering
University of Florida


#215 MAPLD 2005


Holland






Outline


* Introduction

* General Survey
u Ten C-based Application Mappers
* Benchmarking & Results
L Finite-Impulse Response (FIR)
a N-Queens
u Radix Sort
* Lessons Learned

* Conclusions

* Acknowledgements

* References


CARTE


CATAPULT C


DIME-C


HANDEL C

IMPULSE C


MITRION C

NAPA C

SA-C

STREAMS C

SYSTEM


UNIVERSITY OF
HLFLORIDA


Holland


#215 MAPLD 2005





Motivation for Application Mappers



Motivation for Application Mappers
L HDL programming has shortcomings
Limited applicability to application developers
More involved development process (vs. software)
Requires training beyond application level

Instead, can we find and exploit an environment that allows a
measure of hardware control along with increased productivity?
u Can we bring RC performance benefits to application developers?
L Would this be practical/possible in traditional HDL?
HDL is well below the level of traditional application programming
Consequently, we need to move to a higher level of abstraction

UNIVERSITY OF
Holland FLORIDA 3 #215 MAPLD 200


5





Introduction CCode

CO VPII.ER

Selecting a Higher Level of Abstraction
L CAD tools: Visual appealing, but tedious for large projects
a New language: Optimal, but requires complete retraining
a Traditional or Obiect-Oriented languages: Which? How?

Ideally, use pure ANSI-C, "The Universal Language"
L Requires no additional knowledge or special training
a Port existing C programs into hardware implementations (HDL)


Translation can be handled by a hardware compiler
Programmer concentrates on algorithmic functionality

UNIVERSITY OF
Holland TFLORIDA 4 #215 MAPLD 200W


5





Commonalities


* General characteristics of C-based application mappers:
a Companies create proprietary ANSI C-based language
a Languages do not have all ANSI C features
a Extra pragmas are included for corresponding compilers
a Additional libraries of functions/macros for further extensions
a Must adhere to specific programming "style" for maximum optimization
a Emphasis on both hardware generation and I/O interfaces
ANSI-C VHDL
void FIR(int INPUTA, int OUTPUTB)

ruser source/ COMPILE


NU UNIVERSITY OF
Holland TFLORIDA 5 #215 MAPLD 2005








Spectrum of C-based Atmlication Mappers

SURVEY PORTION


SystemC


Open Standard


Catapult C

Impulse C

Mitrion C


Generic HDL
Multiple Platforms


tleneric nUL (upuminze ior I argeis a specific
Manufacturer's Hardware) Platform/Confituration


BENCHMARK SECTION


/,
/ I
/
/
/
/ / VHDL
/
/
/


/
/ DIME-C
Handel C
SImpulse C


ANSI-C //


..'.". HVV A W H F Effort
Software Pragmas Pragmas THE LAW OF CONSERVATION OF PAIN


UNIVERSITY OF

FLORIDA


Hybrid Only


Holland


#215 MAPLD 2005






Carte
SRC Computers'
* C/Fortran FPGA environment
a Direct mapping of C/Fortran code
to configuration level
L Software emulation and simulation
of compiled code for debugging
a Capable of multiprocessor and
multi-FPGA computational
definitions
L Allows explicit data flow control
within memory hierarchy
* Targets SRC's MAP processor
L Produces "Unified Executables" for
HW or SW processor execution
L Runtime libraries handle required
interfacing and management


Catapult C
Mentor Graphics
* Algorithmic synthesis tool for
RTL generation
a RTL from "pure" untimed C++
a No extensions, pragmas, etc.
* Compiler uses "wrappers"
around algorithmic code
a External: manages I/O interface
a Internal: constrains synthesis to
optimize for chosen interface
* Explicit architectural
constraints and optimization
* Output: RTL netlists in VHDL,
Verilog, and SystemC


UNIVERSITY OF
HLFLORIDA


Holland


#215 MAPLD 2005





DIME-C
Nallatech [
* FPGA prototyping tool
* Designs are not cycle-accurate
a Allows application synthesis for a
higher clock speed
* Compilation/Optimization
a Pipeline/parallelize where possible
a Included IEEE-754 FP cores
L Dedicated (integer) multipliers
* Currently in beta, expected
release: 4Q05
* Output: synthesizable VHDL
and DIMEtalk components


Handel C
Celoxica P
* Environment for cycle-accurate
application development
* All operations occur in one
deterministic clock cycle
a Makes it cycle-accurate, but clock
freq reduced to slowest operation
a Decisions/Loops are "penalty-free"
but can significantly impact timing
* Language has pragmas for
explicitly defined parallelism
* Compiler can analyze, optimize,
and rewrite code
* Output: VHDLIVerilog,
SystemC, or targeted EDIFs


UNIVERSITY OF
HLFLORIDA


Holland


#215 MAPLD 2005





SImpulse C
Impulse Accelerated Technologies
* Language/compiler for
modeling sequential apps.
a Processes independent, potentially
concurrent, computing blocks
a Streams communicate and
synchronize processes
* Uses Streams-C methodology
L However, focuses on compatibility
with C development environments
* Compilation
a Each process implemented as
separate state machine
* Output: Generic or FPGA-
specific VHDL


Mitrion C
Mitrion [
* "Softcore" processor tactic
L "Processor" creates abstraction
layer between C code and FPGA
* Compilation
a C code is mapped to a generic
"API" of possible functions
a Processor instantiated on FPGA,
tailored to specific application
L Custom instruction bit-widths,
specific cache and buffer sizes
* Currently in beta, expected
release: 4Q05
* Output: a VHDL IP core for
target architectures


UNIVERSITY OF
HLFLORIDA


Holland


#215 MAPLD 2005






SNapa C
National Semiconductor
Language/compiler for
RISC/FPGA hybrid processor
a Capitalize on single-cycle
interconnect instead of I/O bus
Datapath Synthesis Technique
a Hand-optimized pre-placed, pre-
routed module generators
L Compiler generates hardware
pipelines from C loops
Targets NS NAPA1000 hybrid
processor
L Fixed-Instruction Processor (FIP),
Adaptive Logic Processor (ALP)
a ALP also compiles to RTL VHDL,
structural VHDL, structural Verilog
4 UNIVERSITY OF
Holland FLORIDA


SA-C
Colorado State University 19-1
* High-level, expression-oriented,
machine-independent, single-
assignment language
a Designed to implicitly express
data-parallel operations
L Image and signal processing
* Compiler (UC-Irvine, UC-Riverside,
Colorado State Univ.)
a Loop optimizations
L Structural transforms
a Execution block placement
* Target Platforms
a UC Irvine Morphosys; Annapolis
WildForce, StarFire, WildFire


#215 MAPLD 2005






Streams C
Los Alamos National Laboratory '-
* Stream-oriented sequential
process modeling
a Essentially, data elements moving
through discrete functional blocks
* Compiler
L Generates multi-threaded
processor executables and
multiple FPGA bitstreams
a Allows parallel C program
translation into a parallel arch.
* Includes functional-level
simulation environment
* Output: synthesizable RTL


SystemC
Open SjstemC Initiative (OSCI) 1
* Open-source extension of C++
for HW/SW modeling
a Core language, modules & ports
for defining structure, and
interfaces & channels
* Supports functional modeling
a Hierarchical decomposition of a
system into modules
L Structural connectivity between
modules using ports/exports
L Scheduling and synchronization of
concurrent processes using events
* Event-driven simulator
a Events are basic dynamic/static
process synchronization objects


UNIVERSITY OF
HLFLORIDA


Holland


#215 MAPLD 2005






About the Benchmarks

Three classic algorithms used for benchmarking :
L Finite-Impulse Response (FIR)
-2 1 3 5 7 9 11 13 15 17 21 23 25 27
Simple 51-tap FIR filter for standard DSP applications
Compare compiler solutions and analyze their usage metrics -

a N-Queens
Classic embarrassingly parallel HPC backtracking search problem
Showcases the potential of optimized implementations

Radix Sort
Sorts using 'binary bins', minimizing resources 0 110 0 10
Illustrates resource metrics in RAM-intensive applications
1 111 101
Implementation Details
a DIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing)
a Experiments performed on Nallatech BenNUEY-PCI card with Virtexll-6000 FPGA
a Resource utilization based on post place-and-route data
a Runtime represents communication time (setup and verification I/O is negated)
a Handel C and Impulse C require VHDL wrappers which can increase resource usage
UNIVERSITY OF
Holland 9 FLORIDA 12 #215 MAPLD 2005







Finite-Impulse Response

FIR Resource Utilization Statistics Speedup over 2.4GHz Xeon
100
4
80

60 3

o 40 2

20


Slices Multipliers Block RAMs Clock Freq 0 -
DIME-C m Handel C m Impulse C m VHDL DIME-C Handel C Irrpulse C VHDL gcc -03 gcc -00

FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6])
Various application-mapper languages do not have a consistent I/O interface
a Could not create a consistent streaming channel with requisite blocking in every tool
a Instead, FIR algorithm operates on values stored in a block RAM
Obtains speedup through parallel multiplication, efficient memory accesses
a The 51 coefficients and variables are stored in local variables
Additional performance boosts are possible in multi-channel DSP processing

UNIVERSITY OF
olland FLORIDA 13 #215 MAPLD 200


H


5







N-Queens

N-Queens Resource Utilization Statistics


Speedup over 2.4GHz Xeon


80

60

40


20 -
0 --


17 N


Slices Clock Freq
DIME-C 0 Handel C 0 Impulse C 0 VHDL -m- DIME-C Handel C l In-pulse C -+ VHDL gcc -03 -*-gcc -

* Represents a purely computational algorithm; virtually no communication overhead

* Algorithm contains several parallelizable code segments, exploitable for speedup

* Implementations are based upon same baseline C code
a Every available technique and compiler optimization is employed to boost performance
* Notes:
a Handel C N-Queens is a benchmark from our MAPLD'04 paper with additional refinements
a VHDL N-Queens is culmination of a semester-long endeavor into algorithm's parallelism
a DIME-C and Impulse C N-Queens are results of experimentation with beta compilers


)0


UNIVERSITY OF
FLORIDA


p p


id 14


Holland


#215 MAPLD 2005






Radix Sort


Radix Sort Resource Utilization Statistics


Speedup over 2.4GHz Xeon


Slices Block RAMs
* DIME-C Handel C Impulse C


Clock Freq
m VHDL


DIME-C Handel C Impulse C


VHDL gcc -03 gcc -00


* Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time)
* Represents a "worst-case" legacy algorithm, containing no functional-level parallelism
a Every element in every iteration depends on every previous element in every iteration
a Ideal for software processor with fast cache, challenging in FPGA hardware
* Speedup comes through efficient RAM usage and compiler optimizations/pipelining
a Reduce quantity and addressing complexity of RAM accesses whenever possible
* Metrics are based on sorting 600 32-bit integers contained within a block RAM


UNIVERSITY OF
FLORIDA


r_ F n


Holland


#215 MAPLD 2005






Some


Optimization Techniques


* Keep expensive computational operations to a minimum
a Multiplication, division, modulo, greater/less than, and floating point are *slow*


* Minimize reliance on arrays
Sfor(i=0;i<20;i++){
a[O] = b[i];


* Exploit functional level parallelism
for(i=0;i<2;i++){
S for(j=0;j<20;j++){
Sa[i][j] = i+j;


temp = a[O];
for(i=0;i<20; i++){
temp = b[i];
a[O] = temp;
a[O] = temp;


g for(j=0;j<20;j++){
0 a[j] =j;
}


for(j=0;j<20;j++){
b[j] = 1+j;
}


for(j=0;j<20;j++){
c[j] = 2+j;
}


* Watch for combinable statements


if(flag == 1 && test == 1){
solution++;


W
_ solution += (flag&test);
w


* Reduce bit-widths to minimal size
int i;
0 for(i=0;i<255;i++);

short i;
t for(i=0;i<255;i++);

t char i;
for(i=0;i<255;i++);


( UNIVERSITY OF
HFLORIDA


Holland


#215 MAPLD 2005







Case Study: Dot Product


Green Computation
Blue Communication
Orange Pragmas


DIME-C
void Kemel(int a[50], int b[50], int answer)
{
int i, temp = 0;
for(i=0;i<50;i++)


IMPULSE C
void Kernel l(costream al, costream bl, costream zl){
int a[50], b[50], answer-0;
co_stream open(al,ORDONLY,INTTYPE(32)); /*etc*/
for(i=0;i<50;i++)


costreamread(al, &a[i], sizeof(int32));
costream read(bl, &b[i], sizeof(int32));


temp += a[i] b[i];

answer = temp;


for(i=0;i<50;i++)


#pragma CO UNROLL
answer += a[i] b[i];


void dotproduct(int al [50], int b1 [50],
int a2[50], int b2[50], int answer)
{
int answer, answer;

#pragma genusc instance Kernel 1
Kernel(al,b l,answerl);

#pragma genusc instance Kernel2
Kernel(a2,b2,answer2);

answer = answer + answer


costream write(zl, &answer, sizeof(int32));
co stream close(al); /*etc*/


void Kernel2(co_stream a2, co_stream b2, co_stream z2){
/* SAME AS IN Kernell */


void dot_product(co_stream zl, co_stream z2, co_stream ans){
int i, answer, answer, answer;
co_stream open(zl,ORDONLY,INTTYPE(32)); /*etc*/
costreamread(zl, answerr, INT TYPE(32));
costreamread(z2, answerr, INT TYPE(32));
answer = answer + answer;
co stream write(ans, &answer, INT TYPE(32));
co stream close(zl); /*etc*/


HANDEL C
int 32 Kerell(int 32 a[50], int 32 b[50])


static int 32 i, temp[i], answer;
par(i=0;i<50;i++)
{
temp[i] = a[i] b[i];
for(i=i<50i++)
for(i=0;i<50;i++)


answer += temp[i];


return answer;
I
int 32 Kernel2(int 32 a[50], int 32 b[50])
/* SAME AS IN Kernell */


void main() //dotproduct


int 32 al[50]; int 32 bl[50];
int 32 a2[50]; int 32 b2[50];
int 32 temple, temp2;
int 32 answer;
interface bus_out() OutputResult(answer);
par
{
ansi = Kernell(int 32 al[50],int 32 b[50]);
ans2 = Kernel2(int 32 a2[50],int 32 b[501);


answer = ansi + ans2;


UNIVERSITY OF

FLORIDA


Holland


*Not all implementations are perfectly optimized. Your mileage will vary.*
17 #215 MAPLD 2005





Lessons Learned


* Tools are not near point of automatic translation
a Programs still require some tweaking for hardware compilation "1
a Optimized Software C Optimized Hardware C
* However, generating VHDL is significantly easier
a Learning basics of a C-based mapper is straightforward
* At least two major challenges remain:
a Input/output interfaces become a limiting factor
Moving generic VHDL to unsupported platforms requires VHDL knowledge
However, once a generic I/O wrapper is generated, it should be reusable
a True hardware debugging remains a challenge
Another level of abstraction means another layer for mistranslation
With no knowledge of internal VHDL signals, tracing becomes difficult

4 UNIVERSITY OF
Holland tFLORIDA 18 #215 MAPLD 2005





Conclusions

* Advantages of C-based application mappers
a Far broader audience of potential RC users with high-level languages
a Required HDL knowledge is significantly reduced or eliminated
a Time to preliminary results is much less than manual HDL
Software-to-hardware porting is considerably easier
a Visualization of C hardware is far easier for scientific community
* Disadvantages
a Mapper instructions are many times more powerful than CPU
instructions, but FPGA clocks are many times slower
a Mappers can parallelize and pipeline C code, however they generally
cannot automatically instantiate multiple functional units
a Optimized C-mapper code is obtained through manual parallelization of
existing code using techniques pertinent to algorithm's structure
a Reduced development time can come at cost of performance


UNIVERSITY OF
FLORIDA


Holland


#215 MAPLD 2005





Acknowledgements


* We thank the following vendors for application mapping
tools, information, and technical support:
a Celoxica (Handel C)
a Impulse Accelerated Technologies (Impulse C)
a Nallatech (DIME-C)
a Mitrion (Mitrion C)

We thank the following vendors for providing tools and/or
hardware that made this study possible:
a Aldec (Active-HDL & Riviera EDA tools)
a Intel (Xeon servers)
a Nallatech (FUSE & DIMEtalk tools, RC boards)
a Xilinx (ISE, RC boards, FPGAs)
UNIVERSITY OF
olland TFLORIDA 20 #215 MAPLD 200


H


5







References

[1] http://www.srccomp.com
[2] http://www.mentor.com/products/c-based_design/catapult_c_synthesis/.
[3] K. Morris, "Catapult C: Mentor Announces Architectural Synthesis," fpgajournal.com, June 1, 2004.
[4] Nallatech, Inc., "DIME-C User Guide," Reference Manual, United Kingdom, 2005.
[5] Celoxica, Ltd. "Using Handel-C with DK," Training Manual, United Kingdom, 2005.
[6] D. Pellerin and S. Thibault, "Practical FPGA Programming in C," Pearson Education, Inc., Upper Saddle River, NJ, 2005.
[7] Mitrionics AB, Inc, "The Mitrion Processor," Product Overview, Sweden, 2005.
[8] M. Gokhale, J. Stone and E. Gomersall, "Co-Synthesis to a Hybrid RISC/FPGA Architecture," Journal of VLSI Signal Processing
Systems, 24, pp. 165-180, 2000.
[9] J. Hammes and W. Bohm, "The SA-C Language," Reference Manual, Colorado State University, 2001.
[10] J. Hammes, M. Chawathe and W. Bohm, "The SA-C Compiler," Reference Manual, Colorado State University, 2001.
[11] Colorado State Univ. "Cameron Poster for ACS PI Meeting," Arlington, VA, March 7, 2002.
[12] I. Troxel, "CARMA: An Infrastructure for Reconfigurable High-Performance Computing," Ph.D. Prospectus, University of Florida, pp.
30-32, 2005.
[13] R. Goering, "Open-source C compiler targets FPGAs," Embedded.com, October 18, 2002.
[14] J. Frigo, M. Gokhale and D. Lavenier, "Evaluation of Streams-C C-to-FPGA Compiler: An Applications Perspective," Proc.
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, February 11-13, 2001.
[15] http://www.systemc.org.
[16] OSCI, "SystemC 2.0.1 Language Reference Manual," Reference Manual, San Jose, CA, 2003.
[17] D. A. Buell, S. Akella, J. P. Davis, G. Quan, and D. Caliga, "The DARPA boolean equation benchmark on a reconfigurable
computer," Proc. Military Applications of Programmable Logic Devices (MAPLD),Washington, DC, September 8-10, 2004.
[18] V. Aggarwal, I. Troxel, and A George, "Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and
MPI" Proc. MAPLD, Washington, DC, September 8-10, 2004.
[19] J. Jussel, "The future of programmable SoC design is C-based", Proc. Engineering of Reconfigurable Systems and Algorithms
(ERSA), Las Vegas, NV, June 27-30, 2005.

UNIVERSITY OF
Holland FL ORIDA 21 #215 MAPLD 2005




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs