Browsing by Subject "Parallel programming"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item Enhancing GPU programmability and correctness through transactional execution(2015-01) Holey, Anup PurushottamGraphics Processing Units (GPUs) are becoming increasingly popular not only across various scientific communities, but also as integrated data-parallel accelerators on existing multicore processors. Support for massive fine-grained parallelism in contemporary GPUs provides a tremendous amount of computing power. GPUs support thousands of lightweight threads to deliver high computational throughput. Popularity of GPUs is facilitated by easy-to-adopt programming models such as CUDA and OpenCL that aim to ease programmers' efforts while developing parallel GPU applications. However, designing and implementing correct and efficient GPU programs is still challenging since programmers must consider interaction between thousands of parallel threads. Therefore, addressing these challenges is essential for improving programmers' productivity as well as software reliability. Towards this end, this dissertation proposes mechanisms for improving programmability of irregular applications and ensuring correctness of compute kernels.Some applications possess abundant data-level parallelism, but are unable to take advantage of GPU's parallelism. They exhibit irregular memory access patterns to the shared data structures. Programming such applications on GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose transactional memory (TM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Transactional execution can incur runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state. Thus, in this dissertation we illustrate lightweight TM designs that are capable of scaling to a large number of GPU threads. In our system, programmers simply mark the critical sections in the applications, and the underlying TM support is able to achieve performance comparable to fine-grained locking.Ensuring functional correctness on GPUs that are capable of supporting thousands of concurrent threads is crucial for achieving high performance. However, GPUs provide relatively little guarantee with respect to the coherence and consistency of the memory system. Thus, they are prone to a multitude of concurrency bugs related to inconsistent memory states. Many such bugs manifest as some form of data race condition at runtime. It is critical to identify such race conditions, and mechanisms that aid their detection at runtime can form the basis for powerful tools for enhancing GPU software correctness. However, relatively little attention has been given to explore such runtime monitors. Most prior works focus on the software-based approaches that incur significant overhead. We believe that minimal hardware support can enable efficient data race detection for GPUs. In this dissertation, we propose a hardware-accelerated data race detection mechanism for efficient and accurate data race detection in GPUs. Our evaluation shows that the proposed mechanism can accurately detect data race bugs in GPU programs with moderate runtime overheads.Item A Transactional Model for Parallel Programming of Graph Applications on Computing Clusters(IEEE, 2017) Tripathi, Anand; Padhye, Vinit; Sunkara, Tara Sasank; Tucker, Jeremy; Thirunavukarasu, BhagavathiDhass; Pandey, Varun; Sharma, Rahul R.We present here the results of our investigation of a transactional model of parallel programming on cluster computing systems. This model is specifically targeted for graph applications with the goal of harnessing unstructured parallelism inherently present in many such problems. In this model, tasks for vertex-centric computations are executed optimistically in parallel as serializable transactions. A key-value based globally shared object store is implemented in the main memory of the cluster nodes for storing the graph data. Task computations read and modify data in the distributed global store, without any explicitly programmed message-passing in the application code. Based on this model we developed a framework for parallel programming of graph applications on computing clusters. We present here the programming abstractions provided by this framework and its architecture. Using several graph problems we illustrate the simplicity of the abstractions provided by this model. These problems include graph coloring, k-nearest neighbors, and single-source shortest path computation. We also illustrate how incremental computations can be supported by this programming model. Using these problems we evaluate the transactional programming model and the mechanisms provided by this framework.