J. Semicond. > Volume 41?>?Issue 2?> Article Number: 022406

Optimizing energy efficiency of CNN-based object detection with dynamic voltage and frequency scaling

Weixiong Jiang 1, 2, 3, , Heng Yu 4, , Jiale Zhang 1, 2, 3, , Jiaxuan Wu 1, 2, 3, , Shaobo Luo 5, and Yajun Ha 1, 2, 3, ,

+ Author Affiliations + Find other works by these authors

PDF

Turn off MathJax

Abstract: On the one hand, accelerating convolution neural networks (CNNs) on FPGAs requires ever increasing high energy efficiency in the edge computing paradigm. On the other hand, unlike normal digital algorithms, CNNs maintain their high robustness even with limited timing errors. By taking advantage of this unique feature, we propose to use dynamic voltage and frequency scaling (DVFS) to further optimize the energy efficiency for CNNs. First, we have developed a DVFS framework on FPGAs. Second, we apply the DVFS to SkyNet, a state-of-the-art neural network targeting on object detection. Third, we analyze the impact of DVFS on CNNs in terms of performance, power, energy efficiency and accuracy. Compared to the state-of-the-art, experimental results show that we have achieved 38% improvement in energy efficiency without any loss in accuracy. Results also show that we can achieve 47% improvement in energy efficiency if we allow 0.11% relaxation in accuracy.

Key words: CNNFPGADVFSobject detection

Abstract: On the one hand, accelerating convolution neural networks (CNNs) on FPGAs requires ever increasing high energy efficiency in the edge computing paradigm. On the other hand, unlike normal digital algorithms, CNNs maintain their high robustness even with limited timing errors. By taking advantage of this unique feature, we propose to use dynamic voltage and frequency scaling (DVFS) to further optimize the energy efficiency for CNNs. First, we have developed a DVFS framework on FPGAs. Second, we apply the DVFS to SkyNet, a state-of-the-art neural network targeting on object detection. Third, we analyze the impact of DVFS on CNNs in terms of performance, power, energy efficiency and accuracy. Compared to the state-of-the-art, experimental results show that we have achieved 38% improvement in energy efficiency without any loss in accuracy. Results also show that we can achieve 47% improvement in energy efficiency if we allow 0.11% relaxation in accuracy.

Key words: CNNFPGADVFSobject detection



References:

[1]

Nurvitadhi E, Venkatesh G, Sim J, et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks. Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

[2]

Mantovani P, Cota E G, Tien K, et al. An FPGA-based infrastructure for fine-grained DVFS analysis in high-performance embedded systems. Proceedings of the 53rd Annual Design Automation Conference, 2016

[3]

Bai L, Zhao Y, Huang X. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans Circuits Syst II, 2018, 65(10), 1415

[4]

Ma Y, Cao Y, Vrudhula S, et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. 27th International Conference on Field Programmable Logic and Applications (FPL), 2017

[5]

Ma Y, Cao Y, Vrudhula S, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

[6]

Ma Y, Kim M, Cao Y, et al. End-to-end scalable FPGA accelerator for deep residual networks. 2017 IEEE International Symposium on Circuits and Systems (ISCAS), 2017

[7]

Wei X, Liang Y, Li X, et al. TGPA: tile-grained pipeline architecture for low latency CNN inference. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018

[8]

Guo K, Sui L, Qiu J, et al. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans Comput-Aid Des Integr Circuits Syst, 2018, 37(1), 35

[9]

Ma Y, Cao Y, Vrudhula S, et al. Performance modeling for cnn inference accelerators on FPGA. IEEE Trans Comput-Aid Des Integr Circuits Syst, 2019

[10]

Qiu J, Wang J, Yao S, et al. Going deeper with embedded FPGA platform for convolutional neural network. Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016

[11]

Zhang X, Wang J, Zhu C, et al. Dnnbuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. Proceedings of the International Conference on Computer-Aided Design, 2018

[12]

Motamedi M, Fong D, Ghiasi S. Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference. ACM Trans Embedded Comput Syst, 2017, 16(5s), 151

[13]

Xiao Q, Liang Y, Lu L, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017

[14]

Dutta S, Bai Z, Low T M, et al. Codenet: Training large scale neural networks in presence of soft-errors. arXiv preprint arXiv: 190301042, 2019

[15]

Nie B, Tiwari D, Gupta S, et al. A large-scale study of soft-errors on GPUs in the field. 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016

[16]

Chen Y, Zhu Y, Qiao F, et al. Evaluating data resilience in CNNs from an approximate memory perspective. Proceedings of the on Great Lakes Symposium on VLSI, 2017, 89

[17]

Qiao A, Aragam B, Zhang B, et al. Fault tolerance in iterative-convergent machine learning. arXiv preprint arXiv: 1810.07354, 2018

[18]

Nunez-Yanez J L. Adaptive voltage scaling with in-situ detectors in commercial FPGAs. IEEE Trans Comput, 2014, 64(1), 45

[19]

Nabina A, Nunez-Yanez J L. Adaptive voltage scaling in a dynamically reconfigurable FPGA-based platform. ACM Trans Reconfig Technol Syst, 2012, 5(4), 20

[20]

Wei X, Liang Y, Cong J. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. DAC, 2019, 125

[21]

Ding C, Wang S, Liu N, et al. Req-yolo: A resource-aware, efficient quantization framework for object detection on FPGAs. Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

[22]

Zhang X, Hao C, Li Y, et al. A bi-directional co-design approach to enable deep learning on IoT devices. arXiv preprint arXiv: 190508369, 2019

[23]

Hao C, Zhang X, Li Y, et al. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. Proceedings of the 56th Annual Design Automation Conference, 2019

[24]

Nunez-Yanez J L. Energy proportional neural network inference with adaptive voltage and frequency scaling. IEEE Trans Comput, 2018, 99(99), 1

[25]

Zhang X, Hao C, Lu H, et al., Skynet: A champion design for DAC-SDC on low power object detection. arXiv preprint arXiv: 190610327, 2019

[26]

Weissel A, Bellosa F, Process cruise control: event-driven clock scaling for dynamic power management. Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2002

[27]

De Vogeleer K, Memmi G, Jouvelot P, et al. The energy/frequency convexity rule: Modeling and experimental validation on mobile devices. International Conference on Parallel Processing and Applied Mathematics, 2013

[28]

Huang H, Chaturvedi V, Quan G, et al. Throughput maximization for periodic real-time systems under the maximal temperature constraint. ACM Trans Embed Comput Syst, 2014, 13(2s), 70

[29]

Yu H, Syed R, Ha Y. Thermal-aware frequency scaling for adaptive workloads on heterogeneous MPSoCs. Proceedings of the Conference on Design, Automation & Test in Europe, 2014

[30]

Yu H, Ha Y, Wang J. Quality optimization of resilient applications under temperature constraints. Proceedings of the Computing Frontiers Conference, 2017

[31]

Ma Y, Chantem T, Dick R P, et al. Improving system-level lifetime reliability of multicore soft real-time systems. IEEE Trans Very Large Scale Integr Syst, 2017, 25(6), 1895

[32]

Bong K, Choi S, Kim C, et al. Low-power convolutional neural network processor for a face-recognition system. IEEE Micro, 2017, 37(6), 30

[33]

Santoro G, Casu M R, Peluso V, et al. Design-space exploration of pareto-optimal architectures for deep learning with DVFS. 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018

[34]

Hsieh G C, Hung J C. Phase-locked loop techniques. A survey. IEEE Trans Indust Electron, 1996, 43(6), 609

[35]

Kim J H, Kwak Y H, Kim M, et al. A 120-MHz–1.8-GHz CMOS dll-based clock generator for dynamic frequency scaling. IEEE J Solid-State Circuits, 2006, 41(9), 2077

[36]

Brynjolfson I, Zilic Z. Dynamic clock management for low power applications in FPGAs. Proceedings of the IEEE 2000 Custom Integrated Circuits Conference, 2000

[37]

Beldachi A F, Nunez-Yanez J L. Run-time power and performance scaling in 28 nm FPGAs. IET Comput Digit Tech, 2014, 8(4), 178

[38]

Beldachi A F, Nunez-Yanez J L. Accurate power control and monitoring in zynq boards. 2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014

[39]

Hosseinabady M, Nunez-Yanez J L. Run-time power gating in hybrid arm-FPGA devices. 2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014

[1]

Nurvitadhi E, Venkatesh G, Sim J, et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks. Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

[2]

Mantovani P, Cota E G, Tien K, et al. An FPGA-based infrastructure for fine-grained DVFS analysis in high-performance embedded systems. Proceedings of the 53rd Annual Design Automation Conference, 2016

[3]

Bai L, Zhao Y, Huang X. A CNN accelerator on FPGA using depthwise separable convolution. IEEE Trans Circuits Syst II, 2018, 65(10), 1415

[4]

Ma Y, Cao Y, Vrudhula S, et al. An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks. 27th International Conference on Field Programmable Logic and Applications (FPL), 2017

[5]

Ma Y, Cao Y, Vrudhula S, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017

[6]

Ma Y, Kim M, Cao Y, et al. End-to-end scalable FPGA accelerator for deep residual networks. 2017 IEEE International Symposium on Circuits and Systems (ISCAS), 2017

[7]

Wei X, Liang Y, Li X, et al. TGPA: tile-grained pipeline architecture for low latency CNN inference. 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018

[8]

Guo K, Sui L, Qiu J, et al. Angel-eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Trans Comput-Aid Des Integr Circuits Syst, 2018, 37(1), 35

[9]

Ma Y, Cao Y, Vrudhula S, et al. Performance modeling for cnn inference accelerators on FPGA. IEEE Trans Comput-Aid Des Integr Circuits Syst, 2019

[10]

Qiu J, Wang J, Yao S, et al. Going deeper with embedded FPGA platform for convolutional neural network. Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016

[11]

Zhang X, Wang J, Zhu C, et al. Dnnbuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. Proceedings of the International Conference on Computer-Aided Design, 2018

[12]

Motamedi M, Fong D, Ghiasi S. Machine intelligence on resource-constrained IoT devices: The case of thread granularity optimization for CNN inference. ACM Trans Embedded Comput Syst, 2017, 16(5s), 151

[13]

Xiao Q, Liang Y, Lu L, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017

[14]

Dutta S, Bai Z, Low T M, et al. Codenet: Training large scale neural networks in presence of soft-errors. arXiv preprint arXiv: 190301042, 2019

[15]

Nie B, Tiwari D, Gupta S, et al. A large-scale study of soft-errors on GPUs in the field. 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016

[16]

Chen Y, Zhu Y, Qiao F, et al. Evaluating data resilience in CNNs from an approximate memory perspective. Proceedings of the on Great Lakes Symposium on VLSI, 2017, 89

[17]

Qiao A, Aragam B, Zhang B, et al. Fault tolerance in iterative-convergent machine learning. arXiv preprint arXiv: 1810.07354, 2018

[18]

Nunez-Yanez J L. Adaptive voltage scaling with in-situ detectors in commercial FPGAs. IEEE Trans Comput, 2014, 64(1), 45

[19]

Nabina A, Nunez-Yanez J L. Adaptive voltage scaling in a dynamically reconfigurable FPGA-based platform. ACM Trans Reconfig Technol Syst, 2012, 5(4), 20

[20]

Wei X, Liang Y, Cong J. Overcoming data transfer bottlenecks in FPGA-based DNN accelerators via layer conscious memory management. DAC, 2019, 125

[21]

Ding C, Wang S, Liu N, et al. Req-yolo: A resource-aware, efficient quantization framework for object detection on FPGAs. Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

[22]

Zhang X, Hao C, Li Y, et al. A bi-directional co-design approach to enable deep learning on IoT devices. arXiv preprint arXiv: 190508369, 2019

[23]

Hao C, Zhang X, Li Y, et al. FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge. Proceedings of the 56th Annual Design Automation Conference, 2019

[24]

Nunez-Yanez J L. Energy proportional neural network inference with adaptive voltage and frequency scaling. IEEE Trans Comput, 2018, 99(99), 1

[25]

Zhang X, Hao C, Lu H, et al., Skynet: A champion design for DAC-SDC on low power object detection. arXiv preprint arXiv: 190610327, 2019

[26]

Weissel A, Bellosa F, Process cruise control: event-driven clock scaling for dynamic power management. Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2002

[27]

De Vogeleer K, Memmi G, Jouvelot P, et al. The energy/frequency convexity rule: Modeling and experimental validation on mobile devices. International Conference on Parallel Processing and Applied Mathematics, 2013

[28]

Huang H, Chaturvedi V, Quan G, et al. Throughput maximization for periodic real-time systems under the maximal temperature constraint. ACM Trans Embed Comput Syst, 2014, 13(2s), 70

[29]

Yu H, Syed R, Ha Y. Thermal-aware frequency scaling for adaptive workloads on heterogeneous MPSoCs. Proceedings of the Conference on Design, Automation & Test in Europe, 2014

[30]

Yu H, Ha Y, Wang J. Quality optimization of resilient applications under temperature constraints. Proceedings of the Computing Frontiers Conference, 2017

[31]

Ma Y, Chantem T, Dick R P, et al. Improving system-level lifetime reliability of multicore soft real-time systems. IEEE Trans Very Large Scale Integr Syst, 2017, 25(6), 1895

[32]

Bong K, Choi S, Kim C, et al. Low-power convolutional neural network processor for a face-recognition system. IEEE Micro, 2017, 37(6), 30

[33]

Santoro G, Casu M R, Peluso V, et al. Design-space exploration of pareto-optimal architectures for deep learning with DVFS. 2018 IEEE International Symposium on Circuits and Systems (ISCAS), 2018

[34]

Hsieh G C, Hung J C. Phase-locked loop techniques. A survey. IEEE Trans Indust Electron, 1996, 43(6), 609

[35]

Kim J H, Kwak Y H, Kim M, et al. A 120-MHz–1.8-GHz CMOS dll-based clock generator for dynamic frequency scaling. IEEE J Solid-State Circuits, 2006, 41(9), 2077

[36]

Brynjolfson I, Zilic Z. Dynamic clock management for low power applications in FPGAs. Proceedings of the IEEE 2000 Custom Integrated Circuits Conference, 2000

[37]

Beldachi A F, Nunez-Yanez J L. Run-time power and performance scaling in 28 nm FPGAs. IET Comput Digit Tech, 2014, 8(4), 178

[38]

Beldachi A F, Nunez-Yanez J L. Accurate power control and monitoring in zynq boards. 2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014

[39]

Hosseinabady M, Nunez-Yanez J L. Run-time power gating in hybrid arm-FPGA devices. 2014 24th International Conference on Field Programmable Logic and Applications (FPL), 2014

[1]

Yu Hongmin, Chen Stanley L, Liu Zhongli. Design of a Dedicated Reconfigurable Multiplier in an FPGA. J. Semicond., 2008, 29(11): 2218.

[2]

Ruan Aiwu, Li Wenchang, Xiang Chuanyin, Song Jiangmin, Kang Shi, Liao Yongbo. Graph theory for FPGA minimum configurations. J. Semicond., 2011, 32(11): 115018. doi: 10.1088/1674-4926/32/11/115018

[3]

Han Xiaowei, Wu Lihua, Zhao Yan, Li Yan, Zhang Qianli, Chen Liang, Zhang Guoquan, Li Jianzhong, Yang Bo, Gao Jiantou, Wang Jian, Li Ming, Liu Guizhai, Zhang Feng, Guo Xufeng, Stanley L. Chen, Liu Zhongli, Yu Fang, Zhao Kai. A radiation-hardened SOI-based FPGA. J. Semicond., 2011, 32(7): 075012. doi: 10.1088/1674-4926/32/7/075012

[4]

Mao Zhidong, Chen Liguang, Wang Yuan, Lai Jinmei. A new FPGA with 4/5-input LUT and optimized carry chain. J. Semicond., 2012, 33(7): 075009. doi: 10.1088/1674-4926/33/7/075009

[5]

Wu Lihua, Han Xiaowei, Zhao Yan, Liu Zhongli, Yu Fang, Stanley L. Chen. Design and implementation of a programming circuit in radiation-hardened FPGA. J. Semicond., 2011, 32(8): 085012. doi: 10.1088/1674-4926/32/8/085012

[6]

Zhao Yan, Wu Lihua, Han Xiaowei, Li Yan, Zhang Qianli, Chen Liang, Zhang Guoquan, Li Jianzhong, Yang Bo, Gao Jiantou, Wang Jian, Li Ming, Liu Guizhai, Zhang Feng, Guo Xufeng, Zhao Kai, Stanley L. Chen, Yu Fang, Liu Zhongli. An IO block array in a radiation-hardened SOI SRAM-based FPGA. J. Semicond., 2012, 33(1): 015010. doi: 10.1088/1674-4926/33/1/015010

[7]

Chen Liying, Hou Chunping, Mao Luhong, Wu Shunhua, Xu Zhenmei, Wang Zhenxing. A Novel Verification Development Platform for PassiveUHF RFID Tag. J. Semicond., 2007, 28(11): 1696.

[8]

Chen Liguang, Wang Yabin, Wu Fang, Lai Jinmei, Tong Jiarong, Zhang Huowen, Tu Rui, Wang Jian, Wang Yuan, Shen Qiushi, Yu Hui, Huang Junnai, Lu Haizhou, Pan Guanghua. Design and Implementation of an FDP Chip. J. Semicond., 2008, 29(4): 713.

[9]

Ni Minghao, Chan S L, Liu Zhongli. Optimization of Global Signal Networks for Island-Style FPGAs. J. Semicond., 2008, 29(9): 1764.

[10]

Chen Xun, Zhu Jianwen, Zhang Minxuan. Regular FPGA based on regular fabric. J. Semicond., 2011, 32(8): 085015. doi: 10.1088/1674-4926/32/8/085015

[11]

Zhengjie Li, Yufan Zhang, Jian Wang, Jinmei Lai. A survey of FPGA design for AI era. J. Semicond., 2020, 41(2): 021402. doi: 10.1088/1674-4926/41/2/021402

[12]

Gao Haixia, Ma Xiaohua, Yang Yintang. Accurate Interconnection Length and Routing Channel Width Estimates for FPGAs. J. Semicond., 2006, 27(7): 1196.

[13]

Wu Fang, Wang Yabin, Chen Liguang, Wang Jian, Lai Jinmei, Wang Yuan, Tong Jiarong. Circuit design of a novel FPGA chip FDP2008. J. Semicond., 2009, 30(11): 115009. doi: 10.1088/1674-4926/30/11/115009

[14]

Chen Zhujia, Yang Haigang, Liu Fei, Wang Yu. A fast-locking all-digital delay-locked loop for phase/delay generation in an FPGA. J. Semicond., 2011, 32(10): 105010. doi: 10.1088/1674-4926/32/10/105010

[15]

Wang Liyun, Lai Jinmei, Tong Jiarong, Tang Pushan, Chen Xing, Duan Xueyan, Chen Liguang, Wang Jian, Wang Yuan. A new FPGA architecture suitable for DSP applications. J. Semicond., 2011, 32(5): 055012. doi: 10.1088/1674-4926/32/5/055012

[16]

Hongzhen Fang, Pengjun Wang, Xu Cheng, Keji Zhou. High speed true random number generator with a new structure of coarse-tuning PDL in FPGA. J. Semicond., 2018, 39(3): 035001. doi: 10.1088/1674-4926/39/3/035001

[17]

Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism. J. Semicond., 2020, 41(2): 022403. doi: 10.1088/1674-4926/41/2/022403

[18]

Chunyou Su, Sheng Zhou, Liang Feng, Wei Zhang. Towards high performance low bitwidth training for deep neural networks. J. Semicond., 2020, 41(2): 022404. doi: 10.1088/1674-4926/41/2/022404

[19]

Shang Liwei, Liu Ming, Tu Deyu, Zhen Lijuan, Liu Ge. One-Time Programmable Metal-Molecule-Metal Device. J. Semicond., 2008, 29(10): 1928.

[20]

Wu Fang, Zhang Huowen, Lai Jinmei, Wang Yuan, Chen Liguang, Duan Lei, Tong Jiarong. Designand implementation of a delay-optimized universal programmable routing circuit for FPGAs. J. Semicond., 2009, 30(6): 065010. doi: 10.1088/1674-4926/30/6/065010

Search

Advanced Search >>

GET CITATION

W X Jiang, H Yu, J L Zhang, J X Wu, S B Luo, Y J Ha, Optimizing energy efficiency of CNN-based object detection with dynamic voltage and frequency scaling[J]. J. Semicond., 2020, 41(2): 022406. doi: 10.1088/1674-4926/41/2/022406.

Export: BibTex EndNote

Article Metrics

Article views: 271 Times PDF downloads: 27 Times Cited by: 0 Times

History

Manuscript received: 17 September 2019 Manuscript revised: 13 November 2019 Online: Accepted Manuscript: 17 December 2019 Uncorrected proof: 15 January 2020 Published: 11 February 2020

Email This Article

User name:
Email:*请输入正确邮箱
Code:*验证码错误
XML 地图 | Sitemap 地图