Abstract
The rapid expansion of deep learning applications has driven significant interest in optimizing the execution of convolutional neural networks (CNNs), particularly on edge and embedded devices. The convolutional layer, being the computational backbone of CNNs, is highly resource-intensive and requires efficient implementation strategies. This paper proposes a hardware-software co-optimization framework that jointly tunes computational graph mappings and hardware accelerator configurations to maximize throughput and minimize energy consumption. Design leverages parameter-aware scheduling and layer-specific profiling to bridge the performance-efficiency gap observed in traditional accelerator deployments. Empirical results demonstrate up to 2.4 improvement in latency and 1.9 reduction in energy usage over baseline FPGA-based implementations.
View more >>