J. Semicond. > Volume 41?>?Issue 2?> Article Number: 021402

A survey of FPGA design for AI era

Zhengjie Li , Yufan Zhang , Jian Wang and Jinmei Lai ,

+ Author Affiliations + Find other works by these authors

PDF

Turn off MathJax

Abstract: FPGA is an appealing platform to accelerate DNN. We survey a range of FPGA chip designs for AI. For DSP module, one type of design is to support low-precision operation, such as 9-bit or 4-bit multiplication. The other type of design of DSP is to support floating point multiply-accumulates (MACs), which guarantee high-accuracy of DNN. For ALM (adaptive logic module) module, one type of design is to support low-precision MACs, three modifications of ALM includes extra carry chain, or 4-bit adder, or shadow multipliers which increase the density of on-chip MAC operation. The other enhancement of ALM or CLB (configurable logic block) is to support BNN (binarized neural network) which is ultra-reduced precision version of DNN. For memory modules which can store weights and activations of DNN, three types of memory are proposed which are embedded memory, in-package HBM (high bandwidth memory) and off-chip memory interfaces, such as DDR4/5. Other designs are new architecture and specialized AI engine. Xilinx ACAP in 7 nm is the first industry adaptive compute acceleration platform. Its AI engine can provide up to 8X silicon compute density. Intel AgileX in 10 nm works coherently with Intel own CPU, which increase computation performance, reduced overhead and latency.

Key words: FPGADNNLow-precisionDSPCLBALM

Abstract: FPGA is an appealing platform to accelerate DNN. We survey a range of FPGA chip designs for AI. For DSP module, one type of design is to support low-precision operation, such as 9-bit or 4-bit multiplication. The other type of design of DSP is to support floating point multiply-accumulates (MACs), which guarantee high-accuracy of DNN. For ALM (adaptive logic module) module, one type of design is to support low-precision MACs, three modifications of ALM includes extra carry chain, or 4-bit adder, or shadow multipliers which increase the density of on-chip MAC operation. The other enhancement of ALM or CLB (configurable logic block) is to support BNN (binarized neural network) which is ultra-reduced precision version of DNN. For memory modules which can store weights and activations of DNN, three types of memory are proposed which are embedded memory, in-package HBM (high bandwidth memory) and off-chip memory interfaces, such as DDR4/5. Other designs are new architecture and specialized AI engine. Xilinx ACAP in 7 nm is the first industry adaptive compute acceleration platform. Its AI engine can provide up to 8X silicon compute density. Intel AgileX in 10 nm works coherently with Intel own CPU, which increase computation performance, reduced overhead and latency.

Key words: FPGADNNLow-precisionDSPCLBALM



References:

[1]

Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS), 2012, 1097

[2]

Liang S, Yin S, Liu L, et al. FP-BNN: Binarized neural network on FPGA. Neurocomputing, 2017, 275, 1072

[3]

Freund K. Machine learning application landscape. https://www.xilinx.com/support/documentation/backgrounders/Machine-Learning-Application-Landscape.pdf. 2017

[4]

Zhang C, Li P, Sun G, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, 161

[5]

Qiu J, Wang J, Yao S, et al. Going deeper with embedded FPGA platform for convolutional neural network. The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016, 26

[6]

Yin S, Ouyang P, Tang S, et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE J Solid-State Circuits, 2018, 53(4), 968

[7]

Han S, Mao H, Dally W J. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016

[8]

Gysel P, Motamedi M, Ghiasi S. Hardware-oriented approximation of convolutional neural networks. ICLR, 2016

[9]

Han S, Liu X, Mao H, et al. EIE: efficient inference engine on compressed deep neural network. International Symposium on Computer Architecture (ISCA), 2016, 243

[10]

Zhou A, Yao A, Guo Y, et al. Incremental network quantization: towards lossless CNNs with low-precision weights. ICLR, 2017

[11]

Mishra A, Nurvitadhi E, Cook J J, et al. WRPN: wide reduced-precision networks. arXiv: 1709.01134, 2017

[12]

Hubara I, Courbariaux M, Soudry D. Binarized neural networks. Neural Information Processing Systems (NIPS), 2016, 1

[13]

Umuroglu Y, Fraser N J, Gambardella G, et al. FINN: A framework for fast, scalable binarized neural network inference. International Symposium on Field-Programmable Gate Arrays, 2017, 65

[14]

Boutros A, Yazdanshenas S, Betz V. Embracing diversity: Enhanced DSP blocks for low precision deep learning on FPGAs. 28th International Conference on Field-Programmable Logic and Applications, 2018, 35

[15]

Won M S. Intel? AgilexTM FPGA architecture. https://www.intel.com/content/www/us/en/products/programmable/fpga/agilex.html. Intel White Paper

[16]

Versal: The first adaptive compute acceleration platform (ACAP). https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf. Xilinx White Paper. Version: v1.0, October 2, 2018

[17]

Boutros A, Eldafrawy M, Yazdanshenas S, et al. Math doesn’t have to be hard: logic block architectures to enhance low-precision multiply-accumulate on FPGAs. The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019, 94

[18]

Kim J H, Lee J, Anderson J H. FPGA architecture enhancements for efficient BNN implementation. International Conference on Field-Programmable Technology (ICFPT), 2018, 217

[19]

Versal architecture and product data sheet: overview. https://www.xilinx.com/support/documentation/data_sheets/ds950-versal-overview.pdf. DS950. Version: v1.2, July 3, 2019

[20]

Xilinx AI engines and their applications. https://www.xilinx.com/support/documentation/white_papers/wp506-ai-engine.pdf. Xilinx White Paper. Version: v1.0.2, October 3, 2018

[21]

Intel Arria 10 core fabric and general purpose I/Os handbook. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_handbook.pdf. Version: 2018.06.24

[22]

Intel? Stratix? 10 variable precision DSP blocks user guide. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/ug-s10-dsp.pdf. Version: 2018.09.24

[23]

Yazdanshenas S, Betz V. Automatic circuit design and modelling for heterogeneous FPGAs. International Conference on Field-Programmable Technology (ICFPT), 2017, 9

[24]

Yazdanshenas S, Betz V. COFFE 2: Automatic modelling and optimization of complex and heterogeneous FPGA architectures. ACM Trans Reconfig Technol Syst, 2018, 12(1), 3

[25]

Versal ACAP AI core series product selection guide. https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf. XMP452. Version: v1.0.1, 2018

[26]

UltraScale architecture DSP slice user guide. https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf. UG579. Version: v1.9, September 20, 2019

[27]

Langhammer M, Pasca B. Floating-point DSP block architecture for FPGAs. The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, 117

[28]

Intel? AgilexTM FPGA advanced information brief. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/agilex/ag-overview.pdf. AG-OVERVIEW Version: 2019.07.02

[1]

Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. Neural Information Processing Systems (NIPS), 2012, 1097

[2]

Liang S, Yin S, Liu L, et al. FP-BNN: Binarized neural network on FPGA. Neurocomputing, 2017, 275, 1072

[3]

Freund K. Machine learning application landscape. https://www.xilinx.com/support/documentation/backgrounders/Machine-Learning-Application-Landscape.pdf. 2017

[4]

Zhang C, Li P, Sun G, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, 161

[5]

Qiu J, Wang J, Yao S, et al. Going deeper with embedded FPGA platform for convolutional neural network. The 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016, 26

[6]

Yin S, Ouyang P, Tang S, et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE J Solid-State Circuits, 2018, 53(4), 968

[7]

Han S, Mao H, Dally W J. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016

[8]

Gysel P, Motamedi M, Ghiasi S. Hardware-oriented approximation of convolutional neural networks. ICLR, 2016

[9]

Han S, Liu X, Mao H, et al. EIE: efficient inference engine on compressed deep neural network. International Symposium on Computer Architecture (ISCA), 2016, 243

[10]

Zhou A, Yao A, Guo Y, et al. Incremental network quantization: towards lossless CNNs with low-precision weights. ICLR, 2017

[11]

Mishra A, Nurvitadhi E, Cook J J, et al. WRPN: wide reduced-precision networks. arXiv: 1709.01134, 2017

[12]

Hubara I, Courbariaux M, Soudry D. Binarized neural networks. Neural Information Processing Systems (NIPS), 2016, 1

[13]

Umuroglu Y, Fraser N J, Gambardella G, et al. FINN: A framework for fast, scalable binarized neural network inference. International Symposium on Field-Programmable Gate Arrays, 2017, 65

[14]

Boutros A, Yazdanshenas S, Betz V. Embracing diversity: Enhanced DSP blocks for low precision deep learning on FPGAs. 28th International Conference on Field-Programmable Logic and Applications, 2018, 35

[15]

Won M S. Intel? AgilexTM FPGA architecture. https://www.intel.com/content/www/us/en/products/programmable/fpga/agilex.html. Intel White Paper

[16]

Versal: The first adaptive compute acceleration platform (ACAP). https://www.xilinx.com/support/documentation/white_papers/wp505-versal-acap.pdf. Xilinx White Paper. Version: v1.0, October 2, 2018

[17]

Boutros A, Eldafrawy M, Yazdanshenas S, et al. Math doesn’t have to be hard: logic block architectures to enhance low-precision multiply-accumulate on FPGAs. The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019, 94

[18]

Kim J H, Lee J, Anderson J H. FPGA architecture enhancements for efficient BNN implementation. International Conference on Field-Programmable Technology (ICFPT), 2018, 217

[19]

Versal architecture and product data sheet: overview. https://www.xilinx.com/support/documentation/data_sheets/ds950-versal-overview.pdf. DS950. Version: v1.2, July 3, 2019

[20]

Xilinx AI engines and their applications. https://www.xilinx.com/support/documentation/white_papers/wp506-ai-engine.pdf. Xilinx White Paper. Version: v1.0.2, October 3, 2018

[21]

Intel Arria 10 core fabric and general purpose I/Os handbook. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/arria-10/a10_handbook.pdf. Version: 2018.06.24

[22]

Intel? Stratix? 10 variable precision DSP blocks user guide. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/ug-s10-dsp.pdf. Version: 2018.09.24

[23]

Yazdanshenas S, Betz V. Automatic circuit design and modelling for heterogeneous FPGAs. International Conference on Field-Programmable Technology (ICFPT), 2017, 9

[24]

Yazdanshenas S, Betz V. COFFE 2: Automatic modelling and optimization of complex and heterogeneous FPGA architectures. ACM Trans Reconfig Technol Syst, 2018, 12(1), 3

[25]

Versal ACAP AI core series product selection guide. https://www.xilinx.com/support/documentation/selection-guides/versal-ai-core-product-selection-guide.pdf. XMP452. Version: v1.0.1, 2018

[26]

UltraScale architecture DSP slice user guide. https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf. UG579. Version: v1.9, September 20, 2019

[27]

Langhammer M, Pasca B. Floating-point DSP block architecture for FPGAs. The 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015, 117

[28]

Intel? AgilexTM FPGA advanced information brief. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/agilex/ag-overview.pdf. AG-OVERVIEW Version: 2019.07.02

[1]

Ruan Aiwu, Li Wenchang, Xiang Chuanyin, Song Jiangmin, Kang Shi, Liao Yongbo. Graph theory for FPGA minimum configurations. J. Semicond., 2011, 32(11): 115018. doi: 10.1088/1674-4926/32/11/115018

[2]

Wang Liyun, Lai Jinmei, Tong Jiarong, Tang Pushan, Chen Xing, Duan Xueyan, Chen Liguang, Wang Jian, Wang Yuan. A new FPGA architecture suitable for DSP applications. J. Semicond., 2011, 32(5): 055012. doi: 10.1088/1674-4926/32/5/055012

[3]

Yu Hongmin, Chen Stanley L, Liu Zhongli. Design of a Dedicated Reconfigurable Multiplier in an FPGA. J. Semicond., 2008, 29(11): 2218.

[4]

Han Xiaowei, Wu Lihua, Zhao Yan, Li Yan, Zhang Qianli, Chen Liang, Zhang Guoquan, Li Jianzhong, Yang Bo, Gao Jiantou, Wang Jian, Li Ming, Liu Guizhai, Zhang Feng, Guo Xufeng, Stanley L. Chen, Liu Zhongli, Yu Fang, Zhao Kai. A radiation-hardened SOI-based FPGA. J. Semicond., 2011, 32(7): 075012. doi: 10.1088/1674-4926/32/7/075012

[5]

Mao Zhidong, Chen Liguang, Wang Yuan, Lai Jinmei. A new FPGA with 4/5-input LUT and optimized carry chain. J. Semicond., 2012, 33(7): 075009. doi: 10.1088/1674-4926/33/7/075009

[6]

Wu Lihua, Han Xiaowei, Zhao Yan, Liu Zhongli, Yu Fang, Stanley L. Chen. Design and implementation of a programming circuit in radiation-hardened FPGA. J. Semicond., 2011, 32(8): 085012. doi: 10.1088/1674-4926/32/8/085012

[7]

Hao Huijuan, Zhang Yulin, Lu Wenjuan, Wei Qiang. Three-Dimensional Fabrication by Electron Beam Lithography Using Overlapped Increment Scanning. J. Semicond., 2006, 27(7): 1326.

[8]

Zhao Yan, Wu Lihua, Han Xiaowei, Li Yan, Zhang Qianli, Chen Liang, Zhang Guoquan, Li Jianzhong, Yang Bo, Gao Jiantou, Wang Jian, Li Ming, Liu Guizhai, Zhang Feng, Guo Xufeng, Zhao Kai, Stanley L. Chen, Yu Fang, Liu Zhongli. An IO block array in a radiation-hardened SOI SRAM-based FPGA. J. Semicond., 2012, 33(1): 015010. doi: 10.1088/1674-4926/33/1/015010

[9]

Chen Liying, Hou Chunping, Mao Luhong, Wu Shunhua, Xu Zhenmei, Wang Zhenxing. A Novel Verification Development Platform for PassiveUHF RFID Tag. J. Semicond., 2007, 28(11): 1696.

[10]

Chen Liguang, Wang Yabin, Wu Fang, Lai Jinmei, Tong Jiarong, Zhang Huowen, Tu Rui, Wang Jian, Wang Yuan, Shen Qiushi, Yu Hui, Huang Junnai, Lu Haizhou, Pan Guanghua. Design and Implementation of an FDP Chip. J. Semicond., 2008, 29(4): 713.

[11]

Ni Minghao, Chan S L, Liu Zhongli. Optimization of Global Signal Networks for Island-Style FPGAs. J. Semicond., 2008, 29(9): 1764.

[12]

Chen Xun, Zhu Jianwen, Zhang Minxuan. Regular FPGA based on regular fabric. J. Semicond., 2011, 32(8): 085015. doi: 10.1088/1674-4926/32/8/085015

[13]

Gao Haixia, Ma Xiaohua, Yang Yintang. Accurate Interconnection Length and Routing Channel Width Estimates for FPGAs. J. Semicond., 2006, 27(7): 1196.

[14]

Wu Fang, Wang Yabin, Chen Liguang, Wang Jian, Lai Jinmei, Wang Yuan, Tong Jiarong. Circuit design of a novel FPGA chip FDP2008. J. Semicond., 2009, 30(11): 115009. doi: 10.1088/1674-4926/30/11/115009

[15]

Chen Zhujia, Yang Haigang, Liu Fei, Wang Yu. A fast-locking all-digital delay-locked loop for phase/delay generation in an FPGA. J. Semicond., 2011, 32(10): 105010. doi: 10.1088/1674-4926/32/10/105010

[16]

Hongzhen Fang, Pengjun Wang, Xu Cheng, Keji Zhou. High speed true random number generator with a new structure of coarse-tuning PDL in FPGA. J. Semicond., 2018, 39(3): 035001. doi: 10.1088/1674-4926/39/3/035001

[17]

Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism. J. Semicond., 2020, 41(2): 022403. doi: 10.1088/1674-4926/41/2/022403

[18]

Shang Liwei, Liu Ming, Tu Deyu, Zhen Lijuan, Liu Ge. One-Time Programmable Metal-Molecule-Metal Device. J. Semicond., 2008, 29(10): 1928.

[19]

Wu Fang, Zhang Huowen, Lai Jinmei, Wang Yuan, Chen Liguang, Duan Lei, Tong Jiarong. Designand implementation of a delay-optimized universal programmable routing circuit for FPGAs. J. Semicond., 2009, 30(6): 065010. doi: 10.1088/1674-4926/30/6/065010

[20]

Weixiong Jiang, Heng Yu, Jiale Zhang, Jiaxuan Wu, Shaobo Luo, Yajun Ha. Optimizing energy efficiency of CNN-based object detection with dynamic voltage and frequency scaling. J. Semicond., 2020, 41(2): 022406. doi: 10.1088/1674-4926/41/2/022406

Search

Advanced Search >>

GET CITATION

Z J Li, Y F Zhang, J Wang, J M Lai, A survey of FPGA design for AI era[J]. J. Semicond., 2020, 41(2): 021402. doi: 10.1088/1674-4926/41/2/021402.

Export: BibTex EndNote

Article Metrics

Article views: 457 Times PDF downloads: 56 Times Cited by: 0 Times

History

Manuscript received: 26 September 2019 Manuscript revised: 19 October 2019 Online: Accepted Manuscript: 14 December 2019 Uncorrected proof: 25 December 2019 Published: 11 February 2020

Email This Article

User name:
Email:*请输入正确邮箱
Code:*验证码错误
XML 地图 | Sitemap 地图