J. Semicond. > Volume 41?>?Issue 2?> Article Number: 022404

Towards high performance low bitwidth training for deep neural networks

Chunyou Su 1, ?, , Sheng Zhou 2, ?, , Liang Feng 1, and Wei Zhang 1, ,

+ Author Affiliations + Find other works by these authors

PDF

Turn off MathJax

Abstract: The high performance of the state-of-the-art deep neural networks (DNNs) is acquired at the cost of huge consumption of computing resources. Quantization of networks is recently recognized as a promising solution to solve the problem and significantly reduce the resource usage. However, the previous quantization works have mostly focused on the DNN inference, and there were very few works to address on the challenges of DNN training. In this paper, we leverage dynamic fixed-point (DFP) quantization algorithm and stochastic rounding (SR) strategy to develop a fully quantized 8-bit neural networks targeting low bitwidth training. The experiments show that, in comparison to the full-precision networks, the accuracy drop of our quantized convolutional neural networks (CNNs) can be less than 2%, even when applied to deep models evaluated on ImageNet dataset. Additionally, our 8-bit GNMT translation network can achieve almost identical BLEU to full-precision network. We further implement a prototype on FPGA and the synthesis shows that the low bitwidth training scheme can reduce the resource usage significantly.

Key words: CNNquantized neural networkslimited precision training

Abstract: The high performance of the state-of-the-art deep neural networks (DNNs) is acquired at the cost of huge consumption of computing resources. Quantization of networks is recently recognized as a promising solution to solve the problem and significantly reduce the resource usage. However, the previous quantization works have mostly focused on the DNN inference, and there were very few works to address on the challenges of DNN training. In this paper, we leverage dynamic fixed-point (DFP) quantization algorithm and stochastic rounding (SR) strategy to develop a fully quantized 8-bit neural networks targeting low bitwidth training. The experiments show that, in comparison to the full-precision networks, the accuracy drop of our quantized convolutional neural networks (CNNs) can be less than 2%, even when applied to deep models evaluated on ImageNet dataset. Additionally, our 8-bit GNMT translation network can achieve almost identical BLEU to full-precision network. We further implement a prototype on FPGA and the synthesis shows that the low bitwidth training scheme can reduce the resource usage significantly.

Key words: CNNquantized neural networkslimited precision training



References:

[1]

Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge. Int J Comput Vision, 2015, 115(3), 211

[2]

Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process Syst, 2012, 1097

[3]

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770

[4]

Han S, Pool J, Tran J, et al. Learning both weights and connections for efficient neural network. Adv Neural Inform Process Syst, 2015, 1135

[5]

Parashar A, Rhu M, Mukkara A, et al. Scnn: An accelerator for compressed-sparse convolutional neural networks. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, 27

[6]

Han S, Liu X, Mao H, et al. EIE: efficient inference engine on compressed deep neural network. ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, 243

[7]

Li H, De S, Xu Z, et al. Training quantized nets: A deeper understanding. Adv Neural Inform Process Syst, 2017, 5811

[8]

Lu Z, Rallapalli S, Chan K, et al. Modeling the resource requirements of convolutional neural networks on mobile devices. Proceedings of the 25th ACM International Conference on Multimedia, 2017, 1663

[9]

Courbariaux M, Bengio Y, David J P. Training deep neural networks with low precision multiplications. arXiv preprint arXiv: 1412.7024, 2014

[10]

Nielsen M. How the backpropagation algorithm works. Retrieved from http://neuralnetworksanddeeplearning.com/chap2.html

[11]

Miyashita D, Lee E H, Murmann B. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv: 1603.01025, 2016

[12]

Cai Z, He X, Sun J, et al. Deep learning with low precision by half-wave gaussian quantization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 5918

[13]

Zhou S, Wu Y, Ni Z, et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv: 1606.06160, 2016

[14]

Banner R, Hubara I, Hoffer E, et al. Scalable methods for 8-bit training of neural networks. Adv Neural Inform Process Syst, 2018, 5145

[15]

Hubara I, Courbariaux M, Soudry D, et al. Quantized neural networks: Training neural networks with low precision weights and activations. J Mach Learning Res, 2017, 18(1), 6869

[16]

Gupta S, Agrawal A, Gopalakrishnan K, et al. Deep learning with limited numerical precision. International Conference on Machine Learning, 2015, 1737

[17]

De Sa C, Feldman M, Ré C, et al. Understanding and optimizing asynchronous low-precision stochastic gradient descent. ACM SIGARCH Computer Architecture News, 2017, 45, 461

[18]

De Sa C, Leszczynski M, Zhang J, et al. High-accuracy low-precision training. arXiv preprint arXiv: 1803.03383, 2018

[19]

Chintala S, Gross S, Yeager L, et al. Alexnet. Retrieved from https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py

[20]

Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144, 2016

[21]

nvpstr. (2019, July 17). GNMT v2 for PyTorch. Retrieved from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT

[22]

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014

[23]

Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, 311

[24]

Courbariaux M, Bengio Y, David J P. Binaryconnect: Training deep neural networks with binary weights during propagations. Adv Neural Inform Process Syst, 2015, 3123

[25]

Hubara I, Courbariaux M, Soudry D, et al. Binarized neural networks. Adv Neural Inform Process Syst, 2016, 4107

[26]

Rastegari M, Ordonez V, Redmon J, et al. Xnor-net: Imagenet classification using binary convolutional neural networks. European Conference on Computer Vision, 2016, 525

[27]

Wu S, Li G, Chen F, et al. Training and inference with integers in deep neural networks. arXiv preprint arXiv: 1802.04680, 2018

[28]

Lin D D, Talathi S S. Overcoming challenges in fixed point training of deep convolutional networks. arXiv preprint arXiv: 1607.02241, 2016

[1]

Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge. Int J Comput Vision, 2015, 115(3), 211

[2]

Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process Syst, 2012, 1097

[3]

He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, 770

[4]

Han S, Pool J, Tran J, et al. Learning both weights and connections for efficient neural network. Adv Neural Inform Process Syst, 2015, 1135

[5]

Parashar A, Rhu M, Mukkara A, et al. Scnn: An accelerator for compressed-sparse convolutional neural networks. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, 27

[6]

Han S, Liu X, Mao H, et al. EIE: efficient inference engine on compressed deep neural network. ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, 243

[7]

Li H, De S, Xu Z, et al. Training quantized nets: A deeper understanding. Adv Neural Inform Process Syst, 2017, 5811

[8]

Lu Z, Rallapalli S, Chan K, et al. Modeling the resource requirements of convolutional neural networks on mobile devices. Proceedings of the 25th ACM International Conference on Multimedia, 2017, 1663

[9]

Courbariaux M, Bengio Y, David J P. Training deep neural networks with low precision multiplications. arXiv preprint arXiv: 1412.7024, 2014

[10]

Nielsen M. How the backpropagation algorithm works. Retrieved from http://neuralnetworksanddeeplearning.com/chap2.html

[11]

Miyashita D, Lee E H, Murmann B. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv: 1603.01025, 2016

[12]

Cai Z, He X, Sun J, et al. Deep learning with low precision by half-wave gaussian quantization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 5918

[13]

Zhou S, Wu Y, Ni Z, et al. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv: 1606.06160, 2016

[14]

Banner R, Hubara I, Hoffer E, et al. Scalable methods for 8-bit training of neural networks. Adv Neural Inform Process Syst, 2018, 5145

[15]

Hubara I, Courbariaux M, Soudry D, et al. Quantized neural networks: Training neural networks with low precision weights and activations. J Mach Learning Res, 2017, 18(1), 6869

[16]

Gupta S, Agrawal A, Gopalakrishnan K, et al. Deep learning with limited numerical precision. International Conference on Machine Learning, 2015, 1737

[17]

De Sa C, Feldman M, Ré C, et al. Understanding and optimizing asynchronous low-precision stochastic gradient descent. ACM SIGARCH Computer Architecture News, 2017, 45, 461

[18]

De Sa C, Leszczynski M, Zhang J, et al. High-accuracy low-precision training. arXiv preprint arXiv: 1803.03383, 2018

[19]

Chintala S, Gross S, Yeager L, et al. Alexnet. Retrieved from https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py

[20]

Wu Y, Schuster M, Chen Z, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144, 2016

[21]

nvpstr. (2019, July 17). GNMT v2 for PyTorch. Retrieved from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Translation/GNMT

[22]

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 1409.0473, 2014

[23]

Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2002, 311

[24]

Courbariaux M, Bengio Y, David J P. Binaryconnect: Training deep neural networks with binary weights during propagations. Adv Neural Inform Process Syst, 2015, 3123

[25]

Hubara I, Courbariaux M, Soudry D, et al. Binarized neural networks. Adv Neural Inform Process Syst, 2016, 4107

[26]

Rastegari M, Ordonez V, Redmon J, et al. Xnor-net: Imagenet classification using binary convolutional neural networks. European Conference on Computer Vision, 2016, 525

[27]

Wu S, Li G, Chen F, et al. Training and inference with integers in deep neural networks. arXiv preprint arXiv: 1802.04680, 2018

[28]

Lin D D, Talathi S S. Overcoming challenges in fixed point training of deep convolutional networks. arXiv preprint arXiv: 1607.02241, 2016

[1]

Weixiong Jiang, Heng Yu, Jiale Zhang, Jiaxuan Wu, Shaobo Luo, Yajun Ha. Optimizing energy efficiency of CNN-based object detection with dynamic voltage and frequency scaling. J. Semicond., 2020, 41(2): 022406. doi: 10.1088/1674-4926/41/2/022406

[2]

Cheng Luo, Man-Kit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGA-based batch-level parallelism. J. Semicond., 2020, 41(2): 022403. doi: 10.1088/1674-4926/41/2/022403

[3]

Zheng Wang, Libing Zhou, Wenting Xie, Weiguang Chen, Jinyuan Su, Wenxuan Chen, Anhua Du, Shanliao Li, Minglan Liang, Yuejin Lin, Wei Zhao, Yanze Wu, Tianfu Sun, Wenqi Fang, Zhibin Yu. Accelerating hybrid and compact neural networks targeting perception and control domains with coarse-grained dataflow reconfiguration. J. Semicond., 2020, 41(2): 022401. doi: 10.1088/1674-4926/41/2/022401

[4]

Liang Yan, Jin Dongming. A CMOS Implementation for Radial Basis Function Networks. J. Semicond., 2008, 29(2): 387.

[5]

Yang Hongbin, Fan Yongliang, Zhang Xiangjiu. Strain and Annealing Effect of SiGe/Si Heterostructure in Limited Area Grown by MBE. J. Semicond., 2007, 28(8): 1226.

[6]

Zhang Zhenxing, Xie Erqing, Pan Xiaojun, Jia Lu, Han Weihua. Space-Charge-Limited Current Properties of Amorphous GaN Thin Films. J. Semicond., 2006, 27(S1): 113.

[7]

Zhang Chong, Yang Haigang, Wei Jinbao. A High Precision CMOS Opamp Suitable for ISFET Readout. J. Semicond., 2008, 29(4): 686.

[8]

Guodong Ren, Shifang Zhao, Zhongshen Pu, Zhiqiang Wei. Compact trimming design of a high-precision reference. J. Semicond., 2014, 35(4): 045008. doi: 10.1088/1674-4926/35/4/045008

[9]

Chong Lu, Hongzhou Tan, Zhikui Duan, Yi Ding. A high-precision synchronization circuit for clock distribution. J. Semicond., 2015, 36(10): 105004. doi: 10.1088/1674-4926/36/10/105004

[10]

Huai Yongjin, Han Zhengsheng. Design of a High Precision Array Pulse Sensor in TCM. J. Semicond., 2008, 29(4): 701.

[11]

Niu Xinhuan, Tan Baimei, Zhao Xiaohong, Liu Yuling. High Precision Finishing Process for Sapphire Substrate Surface. J. Semicond., 2007, 28(S1): 48.

[12]

Wu Qisong, Yang Haigang, Yin Tao, Zhang Chong. A high precision CMOS weak current readout circuit. J. Semicond., 2009, 30(7): 075011. doi: 10.1088/1674-4926/30/7/075011

[13]

Zhou Zekun, Ming Xin, Zhang Bo, Li Zhaoji. A novel precision curvature-compensated bandgap reference. J. Semicond., 2010, 31(1): 015010. doi: 10.1088/1674-4926/31/1/015010

[14]

Zhang Xu, Pei Weihua, Huang Beiju, Chen Hongda. Low power CMOS preamplifier for neural recording applications. J. Semicond., 2010, 31(4): 045002. doi: 10.1088/1674-4926/31/4/045002

[15]

Sui Xiaohong, Liu Jinbin, Gu Ming, Pei Weihua, Chen Hongda. Simulation of a Monolithic Integrated CMOS Preamplifier for Neural Recordings. J. Semicond., 2005, 26(12): 2275.

[16]

Xiaoyan Shen, Zhigong Wang. Fully integrated circuit chip of microelectronic neural bridge. J. Semicond., 2014, 35(9): 095011. doi: 10.1088/1674-4926/35/9/095011

[17]

Jin Song, Xuemeng Wang, Zhipeng Zhao, Wei Li, Tian Zhi. A survey of neural network accelerator with software development environments. J. Semicond., 2020, 41(2): 021403. doi: 10.1088/1674-4926/41/2/021403

[18]

Ni Minghao, Chan S L, Liu Zhongli. Optimization of Global Signal Networks for Island-Style FPGAs. J. Semicond., 2008, 29(9): 1764.

[19]

Yang Qingsen, Peng Yingquan, Xing Hongwei, Li Xunshuan, Yuan Jianting, Ma Chaozhu, Zhao Ming. Temperature Characteristics of Bilayer Thin-Film Devices Under Organic Interface Limited Current Conduction. J. Semicond., 2008, 29(6): 1075.

[20]

Zhu Tiancheng, Yao Suying, Li Binqiao. A high precision programmable bandgap voltage reference design for high resolution ADC. J. Semicond., 2009, 30(7): 075005. doi: 10.1088/1674-4926/30/7/075005

Search

Advanced Search >>

GET CITATION

C Y Su, S Zhou, L Feng, W Zhang, Towards high performance low bitwidth training for deep neural networks[J]. J. Semicond., 2020, 41(2): 022404. doi: 10.1088/1674-4926/41/2/022404.

Export: BibTex EndNote

Article Metrics

Article views: 199 Times PDF downloads: 25 Times Cited by: 0 Times

History

Manuscript received: 15 January 2020 Manuscript revised: Online: Accepted Manuscript: 21 January 2020 Uncorrected proof: 03 February 2020 Published: 11 February 2020

Email This Article

User name:
Email:*请输入正确邮箱
Code:*验证码错误
XML 地图 | Sitemap 地图