Browsing by Subject "Performance Simulators"

Now showing 1 - 1 of 1

Efficient Computation of Deep Neural Workloads on Domain-Specific Custom Accelerator Platforms
(2022-10) Manasi, Susmita Dey
Recent breakthroughs in artificial intelligence (AI) using deep neural networks (DNNs) have made DNNs an important component for all modern computing systems. However, one of the key challenges of this AI-powered era is to find efficient solutions to meet the high computational demands of executing DNN algorithms. General purpose machines, such as CPUs and GPUs, are unable to efficiently execute these heavy neural workloads, and incur high overheads. Relative to these platforms, DNN execution on domain-specific custom hardware accelerators has been shown to offer major improvements in cost, energy, and performance. However, customized neural engines are facing increasingly difficult challenges in meeting the demands of DNNs. One key challenge is related to the extremely high computational complexity of executing operations in multiple layers of the DNN. Inherently, DNN computations are very memory-intensive, and inefficient implementations can lead to large penalty in energy and performance of the hardware. Therefore, developing techniques that improve the energy efficiency and throughput of customized DNN accelerators is vital to enable the widespread deployment of AI applications. Another key challenge is associated with the design process for custom neural engines. Synthesis and fabrication of a neural accelerator chip are extremely time-consuming and costly. Moreover, due to the multidimensional computational features in DNNs, the set of choices available in building neural hardware constitutes a large design space. Consequently, there is a strong need to develop specialized performance simulators and optimization frameworks for domain-specific neural processors to perform early-stage design evaluations. However, the research community lacks comprehensive tools and simulation resources that can guide neural hardware architects with principled design insights and systematic evaluation of the intricate DNN accelerator design space. In this thesis, we attempt to address these two key challenges. We develop a set of comprehensive and accurate performance simulators for ASIC-based DNN accelerators to support design-stage optimization and fast evaluation of the hardware. In addition, we develop three energy- and performance-aware solutions to improve the efficiency of executing heavy DNN workloads on ASIC accelerator platforms. We develop a new analytical model, CNNergy, to predict the energy of executing deep convolutional neural networks (CNNs) on ASIC-based deep learning (DL) accelerator. CNNergy is based on careful modeling of the behavior of an actual neural hardware and captures all major components of the computation, and is successfully validated against measured silicon data. Utilizing the analytical framework of CNNergy, we also develop an energy-efficient scheme, NeuPart, that optimizes energy on a battery-constrained mobile client by partitioning CNN computations between in situ processing on the client and offloaded computations in the cloud. We demonstrate that partitioned computation by NeuPart provides significant energy savings on the client over fully cloud-based or fully in situ computation on standard CNN topologies. For example, at 80 Mbps effective bit rate and 0.78 W transmission power, the optimal partition for AlexNet [SqueezeNet] saves up to 52.4% [73.4%] energy over a fully cloud-based computation, and 27.3% [28.8%] energy over a fully in situ computation. To enable optimized execution of memory-intensive CNN workloads, we develop a novel optimization framework, DeepOpt, for general ASIC-based systolic accelerators, which systematically study the large design space of scheduling multidimensional CNN computation on neural hardware. The two key contributions of DeepOpt are: (1) a novel layer-specific optimized scheduling within a CNN, showing significant improvements over fixed scheduling schemes, and (2) optimal selection of hardware resources (e.g., number of computation units, selection of on-chip storage parameters). The choices made by DeepOpt are tuned for specific networks and optimized for applications that may be energy- or latency-sensitive. Our analysis using standard CNN topologies reveals that, for the same hardware area, optimal hardware allocation significantly reduces the execution cost of a network as compared to generic allocation of hardware resources. For example, for a 16mm2 65nm ASIC implementation, DeepOpt shows improvements of up to 50x in the energy-delay product for VGG-16 and 41x for GoogleNet-v1. Our third efficient accelerator solution targets the lightweight family of CNNs that uses depthwise convolution (DwC) in key layers. DL accelerators are primarily optimized for standard convolution, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. We develop an efficient methodology that reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. We formulate an analytical framework to guide pre-RTL hardware choices, and develop new hardware modules and software support for end-to-end evaluation of the solution. Our GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7x speed-up and 1.8x lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74x speed-up over a CPU, and even 1.4x speed-up over a power-hungry GPU. Finally, we develop a novel performance analysis framework, SimDIT, which supports both CNN training and inference on ASIC-based hardware accelerator platforms. Modern CNN graphs consist of many types of layers other than convolution, which become especially important during training due to their high computational cost. However, today’s performance analysis frameworks for deep learning accelerators largely focus on convolution layers only and lack support for training operations. Our modeling effort in SimDIT comprehensively covers convolution and non-convolution operations of both CNN inference and training with a highly parameterizable hardware substrate. SimDIT is integrated with a backend silicon implementation flow and provides detailed end-to-end performance statistics (i.e., data access cost, cycle counts, energy, and power) for executing CNN inference and training workloads on a specific platform. Our performance analysis using SimDIT reveals that on a 64x64 processing array, non-convolution operations constitute 59.5% of total runtime for ResNet-50 training workload. In addition, by optimally distributing available off-chip DRAM bandwidth and on-chip SRAM resources, SimDIT achieves 18x performance improvement over a generic static resource allocation for ResNet-50 inference.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Performance Simulators"