High Performance CNN Accelerators based on Hardware and Algorithm Co-Optimization
1Kothapally Kanchana, 2B. Srilekha, *3Madipalli Sumalatha
1,2,3Department of ECE,
1,2,3Siddhartha Institute of Technology & Sciences, Narapally, Ghatkesar, Medchal-Malkajgiri, Telangana, India
*3sitsmtechms@gmail.com
________________________________________________________________________________________________________
Abstract: The efficacy of convolutional neural networks (CNNs) has led to their extensive application in picture recognition and categorization. Nevertheless, embedded systems face challenges when it comes to storing the massive volumes of weight data required by CNNs in their on-chip memory. Pruning a convolutional neural network (CNN) model can reduce its size with no impact on accuracy; however, when used with a parallel architecture, the resulting model will be slower to execute. A CNN compression method that is hardware-centric is introduced in this paper. A model of a deep neural network (DNN) will include "pruning layers (P-layers)" and "no-pruning layers (NP-layers)" attached to it. For effective and parallel processing, an NP-layer employs a regular distribution for its weights. Pruning causes a P-layer to have an irregular form, yet the high compression ratio it yields makes the shape undesirable. Using uniform and gradual quantization methods, we can achieve a processing efficiency/compression ratio trade-off with a small loss of precision. Integrating multiple parallel finite impulse response (FIR) filters into an NP-layer distributed convolutional architecture, we further improve the regular model. The P-layer irregular sparse model is suggested to use an ADF-based processing element. The suggested compression strategy and hardware architecture can be utilised by a hardware/algorithm co-optimization (HACO) method to run an NP→P hybrid compressed CNN model on FPGAs. The VGG-16 network achieves a compression ratio of 27.5× with a top-5 accuracy loss of 0.44% by using a hardware accelerator on a single FPGA chip without off-chip memory. A result of 83.0 FPS, or 1.8 times quicker, was achieved when the compressed VGG-16 model was implemented on a Xilinx VCU118 evaluation board for image applications.
Index Terms - CNN, Accelerators, Co-Optimization.
________________________________________________________________________________________________________