Group Title: Design and tradeoff analysis of JPEG2000 on hardware-reconfigurable systems
Title: Presentation
ALL VOLUMES CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00094745/00002
 Material Information
Title: Presentation
Physical Description: Book
Language: English
Creator: DeVille, Ryan
Aggarwal, Vikas
Troxel, Ian
George, Alan D.
Publisher: DeVille et al.
Place of Publication: Gainesville, Fla.
 Record Information
Bibliographic ID: UF00094745
Volume ID: VID00002
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

MAPLD05_DeVille ( PDF )


Full Text
/ UNIVERSITY OF
.FLORIDA


Design and Tradeoff Analysis
of JPEG-2000 on
Hardware-Reconfigurable Systems



Ryan DeVille, Vikas Aggarwal,
lan Troxel, and Alan D. George
High-performance Computing and Simulation (HCS) Research Laboratory
Department of Electrical and Computer Engineering
University of Florida


#229 MAPLD 2005


DeVille






Introduction EBCOT Algorithm


Multicomponent Discrete Wavelet Tier-1 Encoding
Transform Transform (compression)



JPEG-2000 Encoding
a State-of-the-art low bit-rate compression algorithm
a Progressive transmission by quality, resolution, component, or spatial
locality
a Spatially random access to bitstream
a Region of interest coding
Motivation for porting JPEG-2000 to RC systems
a High-performance and low-cost solution is attractive for airborne and
satellite imaging systems
a Speedup readily available with fine-grain and coarse-grain parallelism
opportunities

UNIVERSITY OF
eVille 9FLORIDA 2 #229 MAPLD 20(


D


)5






Related Research

* EBCOT Encoder designs
a Group of Column optimization method
a Previous RC Designs
Space systems prototype [5]
Scalable Entropy Encoder [6]
Dual Processing Elements Architecture [7]
2D Discrete Wavelet Transform designs
a Several mimic early VLSI designs [8, 9]
a Multiple architecture designs classifications [10]
Direct
L ID, transpose, perform another ID
u Intrinsically slow
Separate serial and parallel filters or parallel row, parallel column filters
a Processes along rows and columns
a Represents significant performance improvement
Symmetrically extended
a Improves processing efficiency, especially towards center of image


UNIVERSITYY OF
)eVille FLORIDA 3 #229 MAPLD 20(


D


)5






JPEG-2000 Encoder Design


* Software code profiling first used to
determine effort distribution


& Develop.



Jasper Execution Time Profile


L Previous research efforts show that DWT and
Tierl encoding consume 80-85% of execution
time
a Current profiling results with Jasper and
OpenJPEG show that >90% of execution time
spent in DWT and Tier
Benchmark images selected from Kodak
Lossless True Color Image Suite, JasPer
benchmark images, standard image
processing images (lena, etc.)


U UNIVERSITY OF
DLFLORIDA


ETIER2
*TIER1
OQUANT
OFWT
SMCT


DeVille


#229 MAPLD 2005






Discrete Wavelet Transform (DWT)


m Features
L Second-most computationally intensive block in compression process
L Transforms each component tile data into coefficients
Reversible transform involves all integer operations
Represents high- and low-frequency components of image
Amenable to compression results in better compression ratios
a Recursive application yields frequency bands at multiple resolutions


* Operation
L 2D transform achieved by successively
applying ID transform in X&Y directions
L Each ID transform consist of
Filtering step
De-interleave step: reorganizing of data in bands
* Available data and functional parallelism can
be exploited


a3HL a3LL


a3LH


UNIVERSITY OF
DLFLORIDA


DeVille


#229 MAPLD 2005






DWT Hardware Architecture
Input
Buffer


* Challenges presented by DWT oumn
Parallel processing limited by memory bandwidth requirements Temp
a Some sequential nature in processing involved Buffer

* Design features iDe
interleave
Tile Data Column
a Data-level parallelism exploited by operating on multiple "tiles"
-- ,.*1 -emp
a Function-level parallelism exploited by pipelining different u ffe
processing step
Data reuse eliminates extra read cycles Ro

* Internal architecture
T p
a Each tile is entirely stored in single Block RAM to u -ffer
minimize data movement De-
interleave
a Overlapped processing to further reduce latency Row
Output
Buffer

4 UNIVERSITY OF
DeVille FLORIDA 6 #229 MAPLD 2005





Embedded Block Coding with Optimized Truncation
(EBCOT): Tier-1

* Features
a Specially adapted arithmetic coder
a Four bit-plane coding primitives
a Three coding passes for each bit-plane (except the most
significant)

Operation I
a Coding passes: CUP begins at most significant bit plane
a Iteratively perform coding passes over remaining bit planes
a Coding-pass-generated context and bit data serially encoded
and compressed by arithmetic encoder
a Flush and reset arithmetic coder at completion

UNIVERSITY OF
eVille OFLORIDA 7 #229 MAPLD 20(


D


)5






Tier-1 Encoding Hardware Architecture

Challenges presented by Tier-1 encoding: Codebock
a Serial process creation of current MQ context data
directly depends upon previous pass results
a "Bursty" communication contextual data from a
pass short, semi-continuous bursts -s .0
a Large amounts of data and flags must be stored ', 0 |- _
through multiple iterations of algorithm, requiring 0 r "r
high memory bandwidth
Internal architecture (high-level) Pass
a Retrieve current stripe from memory for processing Queues
a Data is operated in a pipelined fashion through
registers MQ Input Arithmetic
controller Entropy
L Context and data information sent to queues Encoder
a Serializing agent: arithmetic entropy encoder
a MQ Input Controller regulates input to arithmetic Design decision to use MQ
entropy encoder, insuring correct operation encoder as serializing agent sav
area and BlockRAM space withe
L Data from arithmetic entropy encoder is written to a sacrificing too much performance
separate, final buffer

4 UNIVERSITY OF
DeVille FLORIDA 8 #229 MAPLD 20(


/es
)ut
e.


)5






Target HPEC Platform


High-Perf. Embedded
High-Per Embedded ZBTSSRAM ZBTSSRAM ZBT SSRAM ZBTSSRAM
Computing: Nallatech (2 MB) (2 MB) (4 MB) (4 MB)
BenNUEY w/ BenBLUE-II 33 32 64 64
L Three FPGAs (all 1 Pc I BenBLUE-1I
Xilinx Virtex2 6000, -4) FPGA BenNUEY User ParyGA
X V-r-x (Xilinx bus FPGA (Xilinx2 Local Bus (Xilinx
Single "user" FPGA J [ Spartan2) (32-bt data 6000,-4)(Xi Virtex2
on BenNUEY PCI (64-bit data, 66 MHz)
board Inter-FPGA (159 I0, user-
Dual FPGAs on communications bus defined clk)
BenBLUE-II
daughter card BenBLUE-l
Secondary FPGA
(Xilinx Virtex2
Low bandwidth to system memory through 64/66 MHz PCI bus 6000, -4)
connection
a Large memory storage capability with 12 MB SRAM (166 MHz, ZBT)
a Advantages/Disadvantages
High configuration time (PCI bus + chained JTAG interface)
Large memory storage helps alleviate strain on PCI bus
Very good IO interface support with proprietary tools

UNIVERSITY OF Diagram shown here only reflects those
SF LRA buses actually used in the design; other
DeVille F LURI-DA 9 communication schemes are available. #229 MAPLD 2005








DWT Single FPGA Results


Single-module design Single-module design
processing one tile (ps) processing eight tiles (ps)
DMA write time 127 1001
DMA read time 80 573
Computation time (part 1) 52 56
Computation time (part 2) 48 404
Total time for FPGA solution 307 2034
Time for software solution 130 1043

Results for single DWT module design for BenNUEY board operating at 80 MHz

Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU



Processing eight tiles (ps) Processing forty tiles (ps)
DMA write time 758 3750
DMA read time 382 1900
Computation time (part 1) 80 80
Computation time (part 2) 82 424
Total time for FPGA solution 1302 6154
Time for software solution 1043 5219

Results for Eight DWT modules design for BenNUEY board operating at 40 MHz


Performance Comparison


3 2000 -
1500-
E
S1000 -
a 500
w


Tiles processed
* FPGA Solution (without DMA) m FPGA Solution (with DMA) o Softw are Solution


Performance Comparison


0 8000
E 6000
- 4000
2000
-- 0


Tiles Processed

* FFGA Solution (w without DMA) a FPGA Solution (w ith DMA) [ Software Solution

Resource Utilization on Virtex2 6000 -4
# of Modules Slices BRAMs

Single Module 1157 (3%) 6 (4%)

Eight Modules 5742(17%) 48(33%)


UNIVERSITY OF

FLORIDA


U1_1


DeVille


#229 MAPLD 2005







Tier-1 Encoding Current Results


Single-module design Eight-module design
processing one processing one
codeblock (ps) codeblock each (ps)
DMA Write Time 70 218
DMA Read Time 49 388
Computation Time 175 175
Total Time 294 781
Software Time 276 2189

Results for Tier1 module design for BenNUEY
board operating at 90 MHz
Note: software solution comes from execution
on server with 2.4 GHz Xeon Processor

# of modules Slices BlockRAMs
Single 3,527 (10%) 7 (5%)
Eight 25,556 (75%) 56 (38%)

Profiling shows performance projections
with DMA transfer times included.


peppers.ras

camera.ras

kodim06.ras

kodiml0.ras

kodim 1.ras

kodim16.ras

kodim21.ras

kodim22.ras

kodim23.ras

baboon.ras

lena.ras

water.pnm

0% 20% 40% 60% 80% 100%

E MCT m FWT o QUAINT o TIER1 m TIER2


UNIV E RSITY OF
FLORIDA


DeVille


* Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3
11 #229 MAPLD 2005


I





Conclusions from HPEC Platform

* Multi-chip system offers resources for increased
parallelism or a multi-component application
Order of magnitude improvement in total
computation time
Faster computation times on FPGA
a But communication overhead severely hinders
performance improvement
a Low-bandwidth PCI interconnect not amenable to
designs with challenging memory demands


UNIVERSITY OF
eVille OFLORIDA 12 #229 MAPLD 20(


D


)5






Target HPC Platform
SGI Altix w/ RASC
m High-Performance Computing: SGI extension
Altix 350 with FPGA Brick 2 MB
QDR SRAM
Single FPGA: Virtex2 6000 (-6 speed Addr & Ctr QDRSRAM
grade) 36 \ 36
Approximately 33% of chip used for Adr
SGI's RASC system layer 36
Two algorithm clock speeds: 200 MHz m \ Algorithm
and 100 MHz FGA -
a High bandwidth to system memory through 36 36
proprietary NUMAlink interconnect (12.8 Addr
GB/s) through Scalable System Port (6.4 Addr & Ctrl
GB/s) Select MAP 72 / I 72
programming Interface
L 3 banks of QDR SRAM (6 MB each) with Lae_
a full bandwidth of 9.6 GB/s (1.6 GB/s for Loader PCI
each read and write) FPGA 33Mhz TIO
u Advantages/Disadvantages
Extremely low reconfiguration time
High memory bandwidth greatly helps NUMAlink connectors
memory-intensive apps, such as JPEG-2K

UNIVERSITY OF Diagram shown here only reflects those
FLI A buses actually used in the design; other
DeVille ( FLORIDA 13 communication schemes are available. #229 MAPLD 20(


I


& Ctrl


)5







Performance Projections


100% -
90% -
80%
70%
60%
50%
40%
30%
20%
10%
0% -


I TIE R2
O TIER1
o QUANT
* FVVT
m MCT


Profile shows projections for no-latency, infinite-bandwidth interconnect.

* NUMAlink interconnect
a Approximate order-of-magnitude improvement of transfers in similar designs
a Mitigates communication overhead bottleneck


UNIVERSITY OF
DMFLORIDA


DeVille


#229 MAPLD 2005





Lessons Learned and Conclusions

* Lessons Learned
a HW/SW codesign
Shared-memory systems more amenable to closely-coupled
processing associated with communication-sensitive RC applications
PCI boards for servers effective when tasks are offloaded for
processing with minimal or masked communication
a Memory bandwidth constrains parallelism in DWT design
a Serializing agent (arithmetic coder) in Tier-1 design is key limit
to performance improvement
Conclusions
a Identifying and accelerating key components yields better system
performance (with a wary eye on Amdahl's Law)
a Performance enhancements achieved mostly through functional
parallelism due to sequential processing constraints

UNIVERSITYY OF
)eVille FLORIDA 15 #229 MAPLD 20(


D


)5





Future Work and Acknowledgments

* Future Work:
a Full system implementation on SGI Altix with RASC
Region of Interest capability
a Lossy encoding and rate capability
a MCT and Tier-2 encoding on FPGA as well
a Single FPGA JPEG-2000 encoding application
Acknowledgments
a We wish to thank the following vendors for equipment and/or tools in
support of this research:
SGI
Nallatech
Xilinx
Aldec
a Special thanks to SGI Digital Media group, SGI RASC engineers for
their help and suggestions
UNIVERSITY OF
eVille 9 FLORIDA 16 #229 MAPLD 20


D


)5







References

[1] Adams, M.D. and Ward, R.K., "JasPer: a portable flexible open-source software tool kit for image
coding/process", in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '04),
pp. 241-244, May 2004.
[2] OpenJPEG. http://www.opegjpeg.org/
[3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., "A VLSI architecture of EBCOT encoder for JPEG2000", in 5th
International Conference on ASIC, pp. 882-885, Oct. 2003.
[4] Chen, K., Lian, C., Chen, H., and L. Chen, "Analysis and architecture design of EBCOT for JPEG-2000," in
IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001.
[5] Van Buren, D., "A high-rate JPEG2000 compression system for space", in IEEE Aerospace Conference, March
2005.
[6] Aouadi, I., and Hammami, O., "Analysis and hardware design of a scalable dual JPEG-2000 entropy coder", in
Euromicro Symposium on Digital System Design (DSD 2004), pp. 227-233, Sept. 2004.
[7] Gangadhar, M. and Bhatia, D., "FPGA based EBCOT architecture for JPEG 2000", in IEEE International
Conference on Field-Programmable Technology (FPT'03), pp. 228-233, Dec. 2003
[8] Hung, K., Huang Y., Truong, T., Wang, C., "FPGA implementation for 2D discrete wavelet transform", in
Electronics Letters, pp. 639-640, April 1998.
[9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., "Design and FPGA
implementation of image block encoders with 2D-DWT", in Conference on Convergent Technologies for Asia-
Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003.
[10] McCanny, P., Masud, S., and McCanny, J., "Design and implementation of the symmetrically extended 2-D
wavelet transform", in IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP '02), vol. 3, pp. 3108-31111, May 2002.
[11] D. Taubman, "High performance scalable image compression with EBCOT," in IEEE Trans. Image Processing,
vol. 9, pp. 1158-1170, July 2000.
[12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West
Sussex, New York: John Wiley and Sons, Ltd (UK), 2002.
[13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard or image Compression: Concepts, Algorithms, and VLSI
Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005.

UNIVERSITY OF
eVille FLORIDA 17 #229 MAPLD 20(


D


)5




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs