Browsing by Subject "Computer Architecture"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
Item Advancing architecture optimizations with Bespoke Analysis and Machine Learning(2023-01) Sethumurugan, SubhashWith transistor scaling nearing atomic dimensions and leakage power dissipation imposing strict energy limitations, it has become increasingly difficult to improve energy efficiency in modern processors without sacrificing performance and functionality. One way to avoid this tradeoff and reduce energy without reducing performance or functionality is to take a cue from application behavior and eliminate energy in areas that will not impact application performance. This approach is especially relevant in embedded systems, which often have ultra-low power and energy requirements and typically run a single application over and over throughout their operational lifetime. In such processors, application behavior can be effectively characterized and leveraged to identify opportunities for ``free'' energy savings. We find that in addition to instruction-level sequencing, constraints imposed by program-level semantics can be used to automate processor customization and further improve energy efficiency. This dissertation describes automated techniques to identify, form, propagate, and enforce application-based constraints in gate-level simulation to reveal opportunities to optimize a processor at the design level. While this can significantly improve energy efficiency, if the goal is truly to maximize energy efficiency, it is important to consider not only design-level optimizations but also architectural optimizations. That being said, architectural optimization presents several challenges. First, the symbolic simulation tool used to characterize gate-level behavior of an application must be written anew for each new architecture. Given the expansiveness of the architectural parameter space, this is not feasible. To overcome this barrier, we developed a generic symbolic simulation tool that can handle any design, technology, or architecture, making it possible to explore application-specific architectural optimizations. However, exploring each parameter variation still requires synthesizing a new design and performing application-specific optimizations, which again becomes infeasible due to the large architecture parameter space. Given the wide usage of Machine Learning (ML) for effective design space exploration, we sought the aid of ML to efficiently explore the architectural parameter space. We built a tool that takes into account the impacts of architectural optimizations on an application and predicts the architectural parameters that result in near-optimal energy efficiency for an application. This dissertation explores the objective, training, and inference of the ML model in detail. Inspired by the ability of ML-based tools to automate architecture optimization, we also apply ML-guided architecture design and optimization for other challenging problems. Specifically, we target cache replacement, which has historically been a difficult area to improve performance. Furthermore, improvements have historically been ad hoc and highly based on designer skill and creativity. We show that ML can be used to automate the design of a policy that meets or exceeds the performance of the current state-of-art.Item The Design of Spintronic-based Circuitry for Memory and Logic Units in Computer Systems(2018-10) Ma, CongAs CMOS technology starts to face serious scaling and power consumption issues, emerging beyond-CMOS technologies draw substantial attention in recent years. Spintronic device, one of the most promising CMOS alternatives, with smaller size and low standby power consumption, fits the needs of the trending mobile and IoT devices. Spin-Transfer Torque-MRAM (STT-MRAM) with comparable read latency with SRAM and All-spin logic (ASL) capable of implementing pure spin-based circuit are the potential candidates to replace CMOS memory and logic devices. However, spintronic memory continues to require higher write energy, presenting a challenge to memory hierarchy design when energy consumption is a concern. This motivates the use of STT-MRAM for the first level caches of a multicore processor to reduce energy consumption without significantly degrading the performance. The large STT-MRAM first-level cache implementation saves leakage power. And the use of small level-0 cache regains the performance drop due to the long write latency of STT-MRAM. This combination reduces the energy-delay product by 65% on average compared to CMOS baseline. All-spin logic suffers from random bit flips that significantly impacts the Boolean logic reliability. Stochastic computing, using random bit streams for computations, has shown low hardware cost and high fault-tolerance compared to the conventional binary encoding. It motivates the use of ASL in stochastic computing to take advantage of its simplicity and fault tolerance. Finite-state machine (FSM), a sequential stochastic computing element, can compute complex functions including the exponentiation and hyperbolic tangent functions more efficiently, but it suffers from long calculation latency and autocorrelation issues. A parallel implementation scheme of FSM is proposed to use an estimator and a dispatcher to directly initialize the FSM to the steady state. It shows equivalent or better results than the serial implementation with some hardware overhead. A re-randomizer that uses an up/down counter is also proposed to solve the autocorrelation issue.Item Performance-correctness challenges in emerging heterogeneous multicore processors(2013-12) Mekkat, VineethWe are witnessing a tremendous amount of change in the design of the modern microprocessor. With dozens of CPU cores on-chip recent multicore processors, the search for thread-level parallelism (TLP) is more significant than ever. In parallel, a very different processor architecture has emerged that aims to extract parallelism at an entirely different scale. Originally proposed for accelerating graphical applications, graphics processing units (GPU) are increasingly being employed to improve the performance of general purpose applications.Advances in process technology and the need for energy efficiency has brought together CPU and GPU cores onto the same die to form on-chip heterogeneous multicore processors. Several industrial designs that follow this philosophy are already part of mainstream computing. The presence of diverse cores on the same die, sharing on-chip resources, presents several challenges in achieving an efficient design. In particular, this thesis addresses two key aspects in designing efficient heterogeneous multicore processors: performance and correctness.Performance is of paramount concern in the design of a microprocessor, and the last-level cache (LLC) is a critical on-chip component from this perspective. Several techniques have been proposed to efficiently share the LLC among on-chip cores. However, when the on-chip cores show significant diversity in their memory access characteristics, currently proposed techniques face severe challenge in attaining effective LLC sharing. In the first part of this thesis, we address this problem and propose a new policy that improves the management of shared LLC, in the presence of heterogeneous workloads, in terms of performance as well as energy efficiency.Execution correctness is an important concern in the quest for the extraction of parallelism. Concurrency bugs, such as data race conditions, are severe impediments to the effectiveness of parallel computing. Although, several techniques have been proposed to identify and rectify data race conditions, their implementation faces several challenges. While software-based mechanisms are cheaper to implement, they inflict severe performance overhead on the monitored application. The high performance of hardware-based mechanisms, on the other hand, comes at the expense of additional hardware support and increased implementation cost. In the second part of this thesis, we propose a technique to utilize available on-chip GPU cores to perform efficient data race detection for the applications executing on the CPU cores.Overall, with these two techniques, we address two critical challenges in the design of emerging heterogeneous multicore processors.