J. Semicond. > Volume 41?>?Issue 2?> Article Number: 022401

Accelerating hybrid and compact neural networks targeting perception and control domains with coarse-grained dataflow reconfiguration

Zheng Wang 1, , , Libing Zhou 2, , Wenting Xie 2, , Weiguang Chen 1, , Jinyuan Su 2, , Wenxuan Chen 2, , Anhua Du 2, , Shanliao Li 3, , Minglan Liang 3, , Yuejin Lin 2, , Wei Zhao 2, , Yanze Wu 4, , Tianfu Sun 1, , Wenqi Fang 1, and Zhibin Yu 1,

+ Author Affiliations + Find other works by these authors

PDF

Turn off MathJax

Abstract: Driven by continuous scaling of nanoscale semiconductor technologies, the past years have witnessed the progressive advancement of machine learning techniques and applications. Recently, dedicated machine learning accelerators, especially for neural networks, have attracted the research interests of computer architects and VLSI designers. State-of-the-art accelerators increase performance by deploying a huge amount of processing elements, however still face the issue of degraded resource utilization across hybrid and non-standard algorithmic kernels. In this work, we exploit the properties of important neural network kernels for both perception and control to propose a reconfigurable dataflow processor, which adjusts the patterns of data flowing, functionalities of processing elements and on-chip storages according to network kernels. In contrast to state-of-the-art fine-grained data flowing techniques, the proposed coarse-grained dataflow reconfiguration approach enables extensive sharing of computing and storage resources. Three hybrid networks for MobileNet, deep reinforcement learning and sequence classification are constructed and analyzed with customized instruction sets and toolchain. A test chip has been designed and fabricated under UMC 65 nm CMOS technology, with the measured power consumption of 7.51 mW under 100 MHz frequency on a die size of 1.8 × 1.8 mm2.

Key words: CMOS technologydigital integrated circuitsneural networksdataflow architecture

Abstract: Driven by continuous scaling of nanoscale semiconductor technologies, the past years have witnessed the progressive advancement of machine learning techniques and applications. Recently, dedicated machine learning accelerators, especially for neural networks, have attracted the research interests of computer architects and VLSI designers. State-of-the-art accelerators increase performance by deploying a huge amount of processing elements, however still face the issue of degraded resource utilization across hybrid and non-standard algorithmic kernels. In this work, we exploit the properties of important neural network kernels for both perception and control to propose a reconfigurable dataflow processor, which adjusts the patterns of data flowing, functionalities of processing elements and on-chip storages according to network kernels. In contrast to state-of-the-art fine-grained data flowing techniques, the proposed coarse-grained dataflow reconfiguration approach enables extensive sharing of computing and storage resources. Three hybrid networks for MobileNet, deep reinforcement learning and sequence classification are constructed and analyzed with customized instruction sets and toolchain. A test chip has been designed and fabricated under UMC 65 nm CMOS technology, with the measured power consumption of 7.51 mW under 100 MHz frequency on a die size of 1.8 × 1.8 mm2.

Key words: CMOS technologydigital integrated circuitsneural networksdataflow architecture



References:

[1]

Chen Y, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits, 2017, 52, 127

[2]

Jouppi N, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. ACM/IEEE International Symposium on Computer Architecture, 2017, 1

[3]

Chen Y, Tao L, Liu S, et al. DaDianNao: A machine-learning supercomputer. ACM/IEEE International Symposium on Microarchitecture, 2015, 609

[4]

Cong J, Xiao B. Minimizing computation in convolutional neural networks. Artificial Neural Networks and Machine Learning, 2014, 281

[5]

Yin S, Ouyang P, Tang S, et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE J Solid-State Circuits, 2017, 53, 968

[6]

Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. Int J Comput Vision, 2015, 115, 211

[7]

Iandola F, Han S, Moskewicz M, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv: 1602.07360, 2016

[8]

Howard A, Zhu M, Chen B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv: 1704.04861, 2017

[9]

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014,

[10]

Yang C, Wang Y, Wang X, et al. A reconfigurable accelerator based on fast winograd algorithm for convolutional neural network in internet of things. IEEE International Conference on Solid-State and Integrated Circuit Technology, 2018, 1

[11]

Vasilache N, Johnson J, Mathieu M, et al. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv: 1412.7580, 2014

[12]

Guo K, Zeng S, Yu J, et al. A survey of FPGA-based neural network accelerator. arXiv: 1712.08934, 2017

[13]

Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv: 1312.5602, 2013

[14]

Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518, 529

[15]

Silver D, Huang A, Maddison C, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529, 484

[16]

Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge. Nature, 2017, 550, 354

[17]

Chen Y, Emer J, Sze V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM/IEEE International Symposium on Computer Architecture, 2016, 44, 367

[18]

Gers F, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. 9th International Conference on Artificial Neural Networks, 1999, 850

[19]

Basterretxea K, Tarela J, Del C. Approximation of sigmoid function and the derivative for hardware implementation of artificial neurons. IEE Proc Circuits, Devices Syst, 2004, 151, 18

[20]

Sutton R, Barto A. Reinforcement learning: An introduction. MIT Press, 2018

[21]

Gulli A, Sujit P. Deep learning with Keras. Packt Publishing Ltd, 2017

[22]

Li S, Ouyang N, Wang Z. Accelerator design for convolutional neural network with vertical data streaming. IEEE Asia Pacific Conference on Circuits and Systems, 2018, 544

[23]

Guo Y. Fixed point quantization of deep convolutional networks. International Conference on Machine Learning, 2016, 2849

[24]

Opalkelly product manual. https://opalkelly.com/products/frontpanel

[25]

Chen W, Wang Z, Li S, et al. Accelerating compact convolutional neural networks with multi-threaded data streaming. IEEE Computer Society Annual Symposium on VLSI, 2019, 519

[26]

MitchellSpryn solving a maze with Q learning. www.mitchellspryn.com/2017/10/28/Solving-A-Maze-With-Q-Learning.html

[27]

Liang M, Chen M, Wang Z. A CGRA based neural network inference engine for deep reinforcement learning. IEEE Asia Pacific Conference on Circuits and Systems, 2018, 519

[28]

Chen Y, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE International Solid-State Circuits Conference (ISSCC), 2016, 127

[29]

Moons B, Uytterhoeven R, Dehaene W, et al. ENVISION: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28 nm FDSOI. IEEE International Solid-State Circuits Conference (ISSCC), 2017, 246

[30]

Yin S, Ouyang P, Tang S, et al. 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications. Symposium on VLSI Circuits, 2017

[1]

Chen Y, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE J Solid-State Circuits, 2017, 52, 127

[2]

Jouppi N, Young C, Patil N, et al. In-datacenter performance analysis of a tensor processing unit. ACM/IEEE International Symposium on Computer Architecture, 2017, 1

[3]

Chen Y, Tao L, Liu S, et al. DaDianNao: A machine-learning supercomputer. ACM/IEEE International Symposium on Microarchitecture, 2015, 609

[4]

Cong J, Xiao B. Minimizing computation in convolutional neural networks. Artificial Neural Networks and Machine Learning, 2014, 281

[5]

Yin S, Ouyang P, Tang S, et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE J Solid-State Circuits, 2017, 53, 968

[6]

Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. Int J Comput Vision, 2015, 115, 211

[7]

Iandola F, Han S, Moskewicz M, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv: 1602.07360, 2016

[8]

Howard A, Zhu M, Chen B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv: 1704.04861, 2017

[9]

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014,

[10]

Yang C, Wang Y, Wang X, et al. A reconfigurable accelerator based on fast winograd algorithm for convolutional neural network in internet of things. IEEE International Conference on Solid-State and Integrated Circuit Technology, 2018, 1

[11]

Vasilache N, Johnson J, Mathieu M, et al. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv: 1412.7580, 2014

[12]

Guo K, Zeng S, Yu J, et al. A survey of FPGA-based neural network accelerator. arXiv: 1712.08934, 2017

[13]

Mnih V, Kavukcuoglu K, Silver D, et al. Playing atari with deep reinforcement learning. arXiv: 1312.5602, 2013

[14]

Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518, 529

[15]

Silver D, Huang A, Maddison C, et al. Mastering the game of Go with deep neural networks and tree search. Nature, 2016, 529, 484

[16]

Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge. Nature, 2017, 550, 354

[17]

Chen Y, Emer J, Sze V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. ACM/IEEE International Symposium on Computer Architecture, 2016, 44, 367

[18]

Gers F, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM. 9th International Conference on Artificial Neural Networks, 1999, 850

[19]

Basterretxea K, Tarela J, Del C. Approximation of sigmoid function and the derivative for hardware implementation of artificial neurons. IEE Proc Circuits, Devices Syst, 2004, 151, 18

[20]

Sutton R, Barto A. Reinforcement learning: An introduction. MIT Press, 2018

[21]

Gulli A, Sujit P. Deep learning with Keras. Packt Publishing Ltd, 2017

[22]

Li S, Ouyang N, Wang Z. Accelerator design for convolutional neural network with vertical data streaming. IEEE Asia Pacific Conference on Circuits and Systems, 2018, 544

[23]

Guo Y. Fixed point quantization of deep convolutional networks. International Conference on Machine Learning, 2016, 2849

[24]

Opalkelly product manual. https://opalkelly.com/products/frontpanel

[25]

Chen W, Wang Z, Li S, et al. Accelerating compact convolutional neural networks with multi-threaded data streaming. IEEE Computer Society Annual Symposium on VLSI, 2019, 519

[26]

MitchellSpryn solving a maze with Q learning. www.mitchellspryn.com/2017/10/28/Solving-A-Maze-With-Q-Learning.html

[27]

Liang M, Chen M, Wang Z. A CGRA based neural network inference engine for deep reinforcement learning. IEEE Asia Pacific Conference on Circuits and Systems, 2018, 519

[28]

Chen Y, Krishna T, Emer J, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE International Solid-State Circuits Conference (ISSCC), 2016, 127

[29]

Moons B, Uytterhoeven R, Dehaene W, et al. ENVISION: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28 nm FDSOI. IEEE International Solid-State Circuits Conference (ISSCC), 2017, 246

[30]

Yin S, Ouyang P, Tang S, et al. 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications. Symposium on VLSI Circuits, 2017

[1]

Wang Yufeng, Wang Zhigong, Lü Xiaoying, Wang Huiling. A Single-Chip and Low-Power CMOS Amplifier for Neural Signal Detection. J. Semicond., 2006, 27(8): 1490.

[2]

Ruiqi Luo, Xiaolei Chen, Yajun Ha. A routing algorithm for FPGAs with time-multiplexed interconnects. J. Semicond., 2020, 41(2): 022405. doi: 10.1088/1674-4926/41/2/022405

[3]

Jun Luo, Yan Wang, Ruifeng Yue. 60-GHz array antenna with standard CMOS technology on Schott Borofloat. J. Semicond., 2013, 34(11): 115006. doi: 10.1088/1674-4926/34/11/115006

[4]

Wu Hanming, Wang Guohua, Huang Ru, Wang Yangyuan. Challenges of Process Technology in 32nm Technology Node. J. Semicond., 2008, 29(9): 1637.

[5]

Jiao Yang, Wang Zhigong, Wang Rong, Guan Zhiqiang. A 155Mbps 0.5μm CMOS Limiting Amplifier. J. Semicond., 2007, 28(2): 176.

[6]

Li Wenyuan, Wang Zhigong, Mao Yinwei. CMOS Quadrature Modulator and Up-Conversion Mixer for802.11a Wireless LAN Systems. J. Semicond., 2007, 28(9): 1364.

[7]

P. A. Gowri Sankar, K. Udhayakumar. MOSFET-like CNFET based logic gate library for low-power application: a comparative study. J. Semicond., 2014, 35(7): 075001. doi: 10.1088/1674-4926/35/7/075001

[8]

Kun Ren, Jiachen Zheng, Haiyan Lu, Jun Liu, Lishu Wu, Wenyong Zhou, Wei Cheng. An investigation of the DC and RF performance of InP DHBTs transferred to RF CMOS wafer substrate. J. Semicond., 2018, 39(5): 054004. doi: 10.1088/1674-4926/39/5/054004

[9]

Liang Chen, Xinyu Chen, Youtao Zhang, Zhiqun Li, Lei Yang. A high linearity X-band SOI CMOS digitally-controlled phase shifter. J. Semicond., 2015, 36(6): 065004. doi: 10.1088/1674-4926/36/6/065004

[10]

Wang Huan, Wang Zhigong, Feng Jun, Zhang Li, Li Wei. A 10GHz LC Voltage-Controlled Oscillator in 0.25μm CMOS. J. Semicond., 2008, 29(3): 484.

[11]

Chenyi Zhou, Zhenghao Lu, Jiangmin Gu, Xiaopeng Yu. A high-efficiency low-voltage class-E PA for IoT applications in sub-1 GHz frequency range. J. Semicond., 2017, 38(10): 105002. doi: 10.1088/1674-4926/38/10/105002

[12]

Xiaofeng Zhao, Dianzhong Wen, Cuicui Zhuang, Jingya Cao, Zhiqiang Wang. Fabrication and characteristics of magnetic field sensors based on nano-polysilicon thin-film transistors. J. Semicond., 2013, 34(3): 036001. doi: 10.1088/1674-4926/34/3/036001

[13]

Guangyao Zhou, Shunli Ma, Ning Li, Fan Ye, Junyan Ren. A monolithic K-band phase-locked loop for microwave radar application. J. Semicond., 2017, 38(2): 025002. doi: 10.1088/1674-4926/38/2/025002

[14]

Yu Wang, Jing Liu, Na Yan, Hao Min. A low-noise widely tunable Gm-C filter with frequency calibration. J. Semicond., 2016, 37(9): 095002. doi: 10.1088/1674-4926/37/9/095002

[15]

Xiaofeng Zhao, Dianzhong Wen, Meiwei Lü, Hanyu Guan, Gang Liu. Fabrication and characterization of the split-drain MAGFET based on the nano-polysilicon thin film transistor. J. Semicond., 2014, 35(9): 094004. doi: 10.1088/1674-4926/35/9/094004

[16]

Xu Hui, Feng Jun, Liu Quan, Li Wei. A 3.125-Gb/s inductorless transimpedance amplifier for optical communication in 0.35 μm CMOS. J. Semicond., 2011, 32(10): 105003. doi: 10.1088/1674-4926/32/10/105003

[17]

Sui Xiaohong, Liu Jinbin, Gu Ming, Pei Weihua, Chen Hongda. Simulation of a Monolithic Integrated CMOS Preamplifier for Neural Recordings. J. Semicond., 2005, 26(12): 2275.

[18]

Hongchao Zheng, Yuanfu Zhao, Suge Yue, Long Fan, Shougang Du, Maoxin Chen, Chunqing Yu. The single-event effect evaluation technology for nano integrated circuits. J. Semicond., 2015, 36(11): 115002. doi: 10.1088/1674-4926/36/11/115002

[19]

Chunyou Su, Sheng Zhou, Liang Feng, Wei Zhang. Towards high performance low bitwidth training for deep neural networks. J. Semicond., 2020, 41(2): 022404. doi: 10.1088/1674-4926/41/2/022404

[20]

Hu Huiyong, Zhang Heming, Jia Xinzhang, Dai Xianying, Xuan Rongxi. Study on Si-SiGe Three-Dimensional CMOS Integrated Circuits. J. Semicond., 2007, 28(5): 681.

Search

Advanced Search >>

GET CITATION

Z Wang, L B Zhou, W T Xie, W G Chen, J Y Su, W X Chen, A H Du, S L Li, M L Liang, Y J Lin, W Zhao, Y Z Wu, T F Sun, W Q Fang, Z B Yu, Accelerating hybrid and compact neural networks targeting perception and control domains with coarse-grained dataflow reconfiguration[J]. J. Semicond., 2020, 41(2): 022401. doi: 10.1088/1674-4926/41/2/022401.

Export: BibTex EndNote

Article Metrics

Article views: 363 Times PDF downloads: 37 Times Cited by: 0 Times

History

Manuscript received: 08 October 2019 Manuscript revised: 16 December 2019 Online: Accepted Manuscript: 26 December 2019 Uncorrected proof: 06 January 2020 Published: 11 February 2020

Email This Article

User name:
Email:*请输入正确邮箱
Code:*验证码错误
XML 地图 | Sitemap 地图