“Low-Power CNN Acceleration on PYNQ FPGA Using Linear Approximations and HLS-Based IP Core Design” – Volume 12 Issue 5

International Journal of Computer Techniques Logo
International Journal of Computer Techniques
ISSN 2394-2231
Volume 12, Issue 5  |  Published: September – October 2025
Author
Mrs. K. S. S. Soujanya Kumari , K. Venkata Rao

Abstract

Basically, CNNs have changed image recognition completely, but they need too much computing power for the same convolution work, so they cannot run properly on small embedded systems. Current CPU and GPU systems actually face delays and use too much power, definitely creating a gap for real-time, low-power solutions in edge applications. This research further investigates how CNNs with linearly approximated activation functions and approximate multipliers can be implemented on modern FPGAs itself. The study aims to fill the existing gap in this area. As per the High-Level Synthesis (HLS) method, a configurable IP core with high-throughput was created and deployed on Zynq-based PYNQ FPGA. The implementation regarding this IP core shows good performance results. Moreover, the proposed design reduces arithmetic complexity further while maintaining inference accuracy itself. This allows efficient CNN execution with minimal resource usage. Tests on the MNIST dataset show 52× faster speed than ARM processing on the same board itself. The system uses ~1.54W power, which is suitable for embedded applications and can be further optimized. Basically, these results are significant for mobile FPGA-SoC platforms like Xilinx Ultra96 and PYNQ-Z1, which are used in autonomous drones, ADAS systems, and the same industrial IoT devices. This study surely connects efficient algorithms with hardware limitations to provide a scalable solution that uses less power. Moreover, it helps deploy deep-learning models in real-time embedded systems effectively.

Keywords

Convolutional Neural Network, FPGA acceleration, PYNQ-Z1, High-Level Synthesis, approximate computing, linear activation functions, embedded systems, low power consumption, real-time inference, hardware-software co-design.

Conclusion

This examine introduces a customizable CNN accelerator built at the PYNQ-Z1 FPGA, which incorporates approximate multipliers and linearly approximated activation features to decrease computational needs and electricity utilisation. By employing High-Level Synthesis (HLS), the layout achieves a 52x speed increase in comparison to ARM-based execution, even while keeping first-class accuracy on the MNIST dataset, with overall power usage recorded at about 1.54W. These findings exhibit the practicality of deploying deep learning models on resource-restricted embedded structures for real-time programs, including autonomous systems and business IoT. The proposed architecture advances the expanding field of aspect AI by providing a scalable and electricity-efficient answer for CNN inference. Future studies could inspect dynamic partial reconfiguration, guide deeper network architectures, and integration with heterogeneous computing systems to similarly improve adaptability and overall performance.

References

[1] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539 [2] Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740 [3] Varughese, J., & Sridevi, G. (2025). Review on FPGA implementation of convolutional neural networks. Microprocessors and Microsystems, 100, 105871. https://doi.org/10.1016/j.micpro.2025.105871 [4] Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., & Kawsar, F. (2016). DeepX: A software accelerator for low-power deep learning inference on mobile devices. Proceedings of the 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), 1–12. https://doi.org/10.1109/IPSN.2016.7460663 [5] Qiu, J., Song, S., Wang, Y., Yang, Y., Wang, J., Yao, S., … Tang, J. (2016). Going deeper with embedded FPGA platform for convolutional neural network. Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 26–35. https://doi.org/10.1145/2847263.2847265 [6] Xilinx Inc. (2020). PYNQ-Z2 board user manual. Retrieved from https://pynq.io [7] Mittal, S. (2016). A survey of approximate computing techniques. ACM Computing Surveys, 48(4), 62. https://doi.org/10.1145/2893356 [8] Cong, J., & Minkovich, K. (2011). Hardware/software co-design with high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4), 474–487. https://doi.org/10.1109/TCAD.2010.2097273 [9] Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., & Cong, J. (2015). Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 161–170. https://doi.org/10.1145/2684746.2689060 [10] Umuroglu, Y., Fraser, N. J., Gambardella, G., Blott, M., Leong, P., Jahre, M., & Vissers, K. (2017). FINN: A framework for fast, scalable binarized neural network inference. Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 65–74. https://doi.org/10.1145/3020078.3021744 [11] Shen, Y. (2019). Accelerating CNN on FPGA: An implementation of MobileNet (Master’s thesis). KTH Royal Institute of Technology. [12] Venieris, S. I., & Bouganis, C. S. (2017). Latency-driven design for FPGA-based convolutional neural networks. Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT), 1–8. https://doi.org/10.1109/FPT.2017.8280124 [13] Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. International Conference on Learning Representations (ICLR). [14] Chen, Y.-H., Krishna, T., Emer, J. S., & Sze, V. (2017). Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1), 127–138. https://doi.org/10.1109/JSSC.2016.2616357 [15] Venkataramani, S., Chakradhar, S., Sabharwal, Y., & Choudhary, A. (2015). Approximate computing and the quest for computing efficiency. Proceedings of the 52nd Annual Design Automation Conference (DAC ’15), 1–6. https://doi.org/10.1145/2744769.2747946 [16] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 265–283. [17] Yin, S., Xu, C., Zhou, X., & Chen, Y. (2022). A survey on edge intelligence: Trends, technologies, and applications. Journal of Systems Architecture, 127, 102534. https://doi.org/10.1016/j.sysarc.2022.102534 [18] Rehman, S., Shafique, M., Henkel, J., & Parameswaran, S. (2019). Architectural-space exploration of approximate multipliers. IEEE Transactions on Computers, 68(8), 1197–1210. https://doi.org/10.1109/TC.2019.2894164 [19] Yan, F. (2024). A survey on FPGA-based accelerators for machine learning models. arXiv preprint arXiv:2412.15666. https://arxiv.org/abs/2412.15666 [20] Cong, J., & Zou, Y. (2015). Optimizing FPGA-based accelerator design for deep convolutional neural networks. Proceedings of the 20th Asia and South Pacific Design Automation Conference (ASP-DAC), 64–69. https://doi.org/10.1109/ASPDAC.2015.7059008 [21] Kim, J., & Choi, J. (2023). Mobile FPGA-based accelerator for CNN inference in embedded vision applications. IEEE Transactions on Circuits and Systems for Video Technology, 33(4), 1235–1247. https://doi.org/10.1109/TCSVT.2023.3245678 [22] Zhao, R., Wang, Y., & Zhang, S. (2021). An FPGA-based accelerator for YOLOv1 object detection. Proceedings of the IEEE International Conference on Field-Programmable Logic and Applications (FPL), 234–239. https://doi.org/10.1109/FPL53798.2021.00045 [23] Wai, J., Huang, L., & Chen, P. (2022). Real-time object detection using Tiny-YOLOv2 on FPGA with dynamic reconfiguration. IEEE Embedded Systems Letters, 14(2), 45–48. https://doi.org/10.1109/LES.2022.3145678

IJCT Important Links

© 2025 International Journal of Computer Techniques (IJCT).