Design of high-efficiency accelerators for diverse AI workloads
Authors
Published Date
Publisher
Abstract
Artificial intelligence (AI) has evolved from dense Deep Neural Networks (DNNs) toward a diverse set of models, such as sparse graph convolutional neural networks (GCNs), LLMs. To efficiently process these workloads AI accelerators are inevitable. But designing AI accelerators is challenging as these new models differ in model size, processing flow, memory access patterns, and data/model sparsity. This thesis discusses the key design requirements of AI accelerators, including reconfigurability, heterogeneity, and scalability, to efficiently accelerate AI models. Based on these findings, three works are proposed. A FPGA based dynamically reconfigurable GCN accelerator that efficiently handles the heterogeneous sparse and dense compute pattern in GCN. Different from deep neural networks (DNNs), GCNs are sparse, irregular, and unstructured, posing unique challenges to hardware acceleration with regular processing elements. To overcome these challenges, we propose an end-to-end hardware-software co-design to accelerate GCNs on resource-constrained FPGAs with the features including: (1) A custom dataflow that leverages symmetry along the diagonal of the adjacency matrix to accelerate feature aggregation for undirected graphs. (2) Unified compute cores for both aggregation and transformation phases, with full support to the symmetry-based dataflow. (3) Preprocessing of the graph in software to rearrange the edges and features to match the custom dataflow. The accelerator is implemented in Intel Stratix10 MX FPGA board with HBM2, and demonstrate 1.3×-110.5× improvement in end-to-end GCN latency operations as compared to the state-of the-art FPGA implementations, on the graph datasets of Cora, Pubmed, Citeseer and Reddit. To address the need for a heterogeneous accelerator to support the diverse set of AI workloads a new RISCV based reconfigurable heterogeneous accelerator, with the target to balance the computation needs and energy efficiency is proposed. Based on representative DNNs and GCNs, we propose two types of processing elements (PEs): (1) A Latch-based digital IMCs (LIMC) for regular and dense computation, and (2) A digital SIMD array (SIMD) with fine-grained control for irregular and sparse workloads. iv To integrate both types of PEs and dynamically manage the data flow, we design reconfigurable modules of scatter/gather and buffers, supporting different types of memory access and compute patterns. The new heterogeneous accelerator has been designed and taped out at 16nm. Based on 16nm design data, it achieves an 11× improvement in latency compared to baseline homogeneous accelerators, and up to 2.1× and 20× improvement in TOPS/mm2 and TOPS/W, respectively, as compared to state-of-the-art accelerators. With AI models evolving at a rapid pace and along with a huge amount of data volume, it is critical that AI accelerators also be easily scalable to meet the performance and data bandwidth requirements. Traditional computer vision tasks in autonomous machines and AR/VR rely on high-speed links, such as MIPI CSI-2, to transfer data from sensors to computing units.While these systems have struggled in the past to meet the growing demands on high bandwidth and low latency, today’s advanced packaging technologies that allow for multiple tiers of sensing and computing chiplets to be stacked together have the potential to support real-time processing of these tasks. In this work, we utilize advanced packaging technology, specifically 3D die stacking with high-density copper (Cu) pillars, to develop a 2-tier hardware-software co-design for an AI ViT accelerator. Ultimately, this 2-tier accelerator will be integrated with the sensing tier to process continuous data streams. Our 3D stacking approach, featuring face-to-face bonding with a 5µm pitch, offers two key advantages: (1) higher compute density than what is offered by 2D / 2.5D packaging and (2) higher connection density than conventional TSV-based stacking. Our synthesis results at 28nm demonstrate an 18x improvement in latency and a 127x reduction in energy consumption compared to conventional 2D designs and an 11x improvement in latency compared to similar 3D architecture.
Description
University of Minnesota Ph.D. dissertation. March 2025. Major: Electrical Engineering. Advisor: Yu Cao. 1 computer file (PDF); x, 78 pages.
Related to
item.page.replaces
License
Collections
Series/Report Number
Funding Information
item.page.isbn
DOI identifier
Previously Published Citation
Other identifiers
Suggested Citation
Raveendran Nair, Gopikrishnan. (2025). Design of high-efficiency accelerators for diverse AI workloads. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/273527.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.
