Unnikrishnan, Nanda2023-11-282023-11-282023-06https://hdl.handle.net/11299/258677University of Minnesota Ph.D. dissertation. June 2023. Major: Electrical/Computer Engineering. Advisor: Keshab Parhi. 1 computer file (PDF); xii, 169 pages.Artificial intelligence (AI) has become an increasingly important and prevalent technology in today’s world. The past decade has seen tremendous growth in AI with it being used in a wide range of applications, including healthcare, finance, transportation, research, manufacturing, and even entertainment. One of the most significant advancements in AI has been the development of deep neural networks (DNNs), which have revolutionized the field by providing unprecedented human-like performance in solving many real-world problems. However, the computations involved in DNNs are expensive and time-consuming, especially for large and complex networks. Additionally, a variety of models, like convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, and graph neural networks (GNNs), pose significant challenges for hardware design, particularly due to the diverse set of operations used. Each operation brings its own set of challenges for energy, performance, and memory that do not always align with one another precluding a one size fits all solution. The thesis addresses the above challenges in three parts. The first part tries to develop a fundamental understanding of the different operations involved in different DNN models. This thesis explores the evolution of brain-inspired computing models from a historical context, focusing on DNNs, CNNs, RNNs, and GNNs among others. This provides the necessary context for optimizing DNN operations for training and inference. The second part of the thesis proposed hardware-software co-design techniques inspired by the design of DSP systems to address energy, computation, and memory challenges during training for CNNs. The thesis proposes a novel approach for using systolic architectures to train convolutional neural networks using gradient interleaving, called InterGrad. The approach involves interleaving the computations of two gradients on the same configurable systolic array, resulting in significant savings in terms of the number of cycles and memory accesses. The proposed method uses 25% fewer cycles and memory accesses, and 16% less energy in state-of-the-art CNNs, and up to 2.2× fewer cycles and memory accesses in the fully connected layers. The thesis also presents a novel optimization approach called LayerPipe, which explores how to partition optimally and pipeline DNN training workload on multi-processor systems. LayerPipe can better balance workloads while minimizing the communication overhead. LayerPipe achieves an average speedup of 25% and upwards of 80% with 7 to 9 processors when compared to prior approaches such as PipeDream. Lastly, the thesis explores the design of dedicated hardware accelerators for graph neural networks (GNNs). The proposed SCV-GNN method uses a novel sparse compressed vectors (SCV) format optimized for the aggregation operation. The proposed method achieves a geometric mean speedup of 7.96× and 7.04× over a compressed sparse column (CSC) and compressed sparse rows (CSR) aggregation operations, respectively, and reduces the memory traffic by a factor of 3.29× and 4.37× over CSC and CSR, respectively.enASIC acceleratorsDeep learningGraph acceleratorsHardware-software co-designNeural networksVLSI for DSPTowards Hardware-Software Co-design for Energy-Efficient Deep LearningThesis or Dissertation