Holey, Anup Purushottam2015-04-092015-04-092015-01https://hdl.handle.net/11299/171130University of Minnesota Ph.D. dissertation. January 2015. Major: Computer Science. Advisor: Antonia Zhai. 1 computer file (PDF); xi, 122 pages.Graphics Processing Units (GPUs) are becoming increasingly popular not only across various scientific communities, but also as integrated data-parallel accelerators on existing multicore processors. Support for massive fine-grained parallelism in contemporary GPUs provides a tremendous amount of computing power. GPUs support thousands of lightweight threads to deliver high computational throughput. Popularity of GPUs is facilitated by easy-to-adopt programming models such as CUDA and OpenCL that aim to ease programmers' efforts while developing parallel GPU applications. However, designing and implementing correct and efficient GPU programs is still challenging since programmers must consider interaction between thousands of parallel threads. Therefore, addressing these challenges is essential for improving programmers' productivity as well as software reliability. Towards this end, this dissertation proposes mechanisms for improving programmability of irregular applications and ensuring correctness of compute kernels.Some applications possess abundant data-level parallelism, but are unable to take advantage of GPU's parallelism. They exhibit irregular memory access patterns to the shared data structures. Programming such applications on GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose transactional memory (TM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Transactional execution can incur runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this dissertation we illustrate lightweight TM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying TM support is able to achieve performance comparable to fine-grained locking.Ensuring functional correctness on GPUs that are capable of supporting thousands of concurrent threads is crucial for achieving high performance. However, GPUs provide relatively little guarantee with respect to the coherence and consistency of the memory system. Thus, they are prone to a multitude of concurrency bugs related to inconsistent memory states. Many such bugs manifest as some form of data race condition at runtime. It is critical to identify such race conditions, and mechanisms that aid their detection at runtime can form the basis for powerful tools for enhancing GPU software correctness. However, relatively little attention has been given to explore such runtime monitors. Most prior works focus on the software-based approaches that incur significant overhead. We believe that minimal hardware support can enable efficient data race detection for GPUs. In this dissertation, we propose a hardware-accelerated data race detection mechanism for efficient and accurate data race detection in GPUs. Our evaluation shows that the proposed mechanism can accurately detect data race bugs in GPU programs with moderate runtime overheads.enComputera architectureConcurrency bugsGPUsParallel programmingSpeculative executionTransactional memoryComputer scienceEnhancing GPU programmability and correctness through transactional executionThesis or Dissertation