Citation
Scalable Window Generation For The Intel Broadwell+Arria 10 And High-Bandwidth Fpga Systems

Material Information

Title:
Scalable Window Generation For The Intel Broadwell+Arria 10 And High-Bandwidth Fpga Systems
Series Title:
19th Annual Undergraduate Research Symposium
Creator:
Emas, Madison
Language:
English
Physical Description:
Undetermined

Subjects

Subjects / Keywords:
Center for Undergraduate Research
Center for Undergraduate Research
Genre:
Conference papers and proceedings
Poster

Notes

Abstract:
Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications—an important subset of digital signal processing—demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7× improvement over previous work.We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81× and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7×. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters. ( en )
General Note:
Research authors: Greg Stitt, Abhay Gupta, Madison N. Emas, David Wilson, Austin Baylis - University of Florida
General Note:
University Scholars Program
General Note:
Faculty Mentor: Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications—an important subset of digital signal processing—demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7× improvement over previous work.We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81× and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7×. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters. - Center for Undergraduate Research, University Scholars Program

Record Information

Source Institution:
University of Florida
Rights Management:
Copyright Madison Emas. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

Previous Approach New Approach Frequency decreases to 100MHz for just 128 pipelines Frequency remains high at 242MHz for 128 pipelines Does not scale past 256 pipelines Can scale to 1024 pipelines Maximum bandwidth is just 38.4 GB/s Maximum bandwidth is 336 GB/s: a 8.7% improvement! Given a stream of pixels, window generator assembles p windows of data each cycle p is number of pixels provided each cycle Window is 2D subset of image Sliding window apps typically have consecutive windows with large amounts of overlap Overlap represents reused data that should be buffered internally Previous window generators limited to one window/cycle Our approach produces p windows each cycle for any window size Limited by memory bandwidth Architecture consists of three components Variable Read FIFO, Window Buffer, Window Coalescer 2D Convolution Results CNN Projections Results Scalable Sliding Window Generator Introduction New approach uses fewer LUTs ( 40 % average reduction due to eliminating muxes in Window Buffer) but increase in Flip Flops ( 20 % increase) Variable Read FIFO Reads variable number of pixels from input stream Enables usage of any image size Eliminates need for padding of input image Highly pipelined for high clock speeds Writes p pixels/cycle User specifiable reads from 0 to p each cycle Window Buffer Window Coalescer Buffers incoming rows of image Stores reused data across multiple windows Coalesces window buffer data into complete windows Outputs p windows per cycle Comparison to Previous Work Problem GPUs often provide 100 + GB/s of memory bandwidth Emerging FPGA systems can now provide more bandwidth to compete with GPUs Not all circuits scale to larger amounts of memory 2 D convolution (an important FPGA app) exhibits these scaling problems Solution A scalable sliding window generator for 2 D convolution is introduced and evaluated on the Intel Broadwell + Arria 10 System Enables 1024 replicated pipelines Scales up to 330 GB/s of memory bandwidth Results : 3 x 3 filters For 32 filters per image, both Arria 10 and Stratix 10 outperform other devices For 64 filters per image, Statix 10 still faster than GPU Compared new window generator to previous work using 3x3 windows and image sizes of 2048x2048 Performed timing optimization to obtain maximum clock frequency Tradeoff : Increase in Flip Flops for higher memory bandwidth Intel Broadwell + Arria 10 Broadwell + Arria 10 on same die with shared memory FPGA accesses shared memory over PCIe and/or QPI using a provided cache coherent interface (CCI) Experimental Setup Compared Arria 10 and Stratix 10 to 12 core Broadwell Xeon E5 and Nvidia Quadro P6000 GPU Evaluated convolution parameters common to CNNs Results : 5 x 5 filters GPU faster than Arria 10 Stratix 10 still faster than GPU when including PCIe transfers Conclusions Presented window generator enables emerging FPGA systems to achieve performance that is better or comparable to the P6000 GPU for many CNN use cases Experimental Setup Compared Broadwell + Arria 10 to 12 core Broadwell Xeon E5 and Nvidia Quadro P6000 GPU Highly optimized software and GPU code from DeepBench, Nvidia SDK, and Intel Math Kernel Library All examples run convolution pipelines at 271 MHz All FPGA results include PCIe and QPI transfer times for accessing memory Time and Power measured by putting relevant code in a loop and averaging numerous measurements Results for 2 D convolution ( 1 filter/image) Broadwell + Arria 10 vs 12 core Xeon Broadwell E5 81x average speedup 96x average energy improvement Broadwell + Arria 10 vs high end Nvidia P6000 GPU 12x average speedup 15.7 average energy improvement 1.2x faster even without GPU PCIe transfers 2.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 32 64 256 512 Execution Time (ms) 3x3 Filter: Filters Per Image Arria 10 Stratix 10 Xeon Broadwell P6000 GPU (PCIe) P6000 GPU (no PCIe) 1.1 2.1 1.6 3.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 32 64 256 512 Execution Time (ms) 5x5 Filter: Filters Per Image Arria 10 Stratix 10 Xeon Broadwell P6000 GPU (PCIe) P6000 GPU (no PCIe) 2D Convolution on the Intel Broadwell + Arria 10 Madison N Emas