Browsing by Author "LaSalle, Dominique"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item A Parallel Hill-Climbing Refinement Algorithm for Graph Partitioning(2015-11-25) LaSalle, Dominique; Karypis, GeorgeGraph partitioning is an important step in distributing workloads on parallel compute systems, sparse matrix re-ordering, and VLSI circuit design. Producing high quality graph partitionings while effectively utilizing available CPU power is becoming increasingly challenging due to the rising number of cores per processor. This not only increases the amount of parallelism required of the partitioner, but also the degree partitionings it is to generate. In this work we present a new shared-memory parallel k-way method of refining an existing partitioning that can break out of local minima. Our method matches the quality of the current high-quality serial refinement methods, and achieves speedups of 5.7-16.7x using 24 threads, while exhibiting only 0.52% higher edgecuts than when run serially. This is 6.3x faster than other parallel refinement methods.Item Graph Partitioning, Ordering, and Clustering for Multicore Architectures(2015-12) LaSalle, DominiqueThe terms dual-core and quad-core have become ubiquitous for modern computer processors. Where processors previously had a single powerful processing core, they now have multiple lower power processing cores. While some applications can readily be adapted to this change in computer architecture, a large class of applications dealing with unstructured data poses significant challenges to achieving performance on these new architectures. In this thesis we propose new algorithms and methods for the problems of graph partitioning, fill reducing ordering, and graph clustering on multicore architectures. These problems are used for the discovery and introduction of structure into this data. This is important in the fields of scientific and parallel computing, data sciences, life sciences, and integrated circuit design. Our new methods for graph partitioning exploit current multicore architectures to greatly reduce both runtime and memory requirements. Our methods rely on a model of coarse grained parallelism while effectively maximizing data locality. We develop new methods for each phase of multilevel graph partitioning including a new aggregation method, an adaptive parallel formulation of initial partitioning, a high speed method for performing greedy refinement, and a new method for high quality refinement which is highly scalable. These new algorithms result in 3x lower runtimes than previous methods, while matching them in terms of quality using the greedy refinement method. Our new high quality and highly scalable refinement method improves the solution quality by 8%. We develop new methods for generating fill reducing orderings of sparse matrices. This includes a new method of vertex separator refinement which manages to achieve effective parallelization while still matching the quality of the best serial algorithms. We introduce a task scheduling method specifically for nested dissection and show that it results in significantly better performance over the task schedulers available in current threading libraries. Overall, our algorithms for nested dissection are 1.5x faster than existing parallel methods and reduce the fill-in by 3.7% and operation count by 14% of Cholesky factorization compared to the orderings produced by other parallel methods. This matches the quality of orderings generated by serial methods while achieving a 10.1x speedup on 16 cores. Finally, we develop new serial and parallel algorithms for the graph clustering problem. We take many of the strategies we used for parallelizing graph partitioning on multicore architectures, and adapt them to the modularity maximization problem. We develop new methods for vertex aggregation, initial clustering, and cluster refinement. Our serial algorithms are an order of magnitude faster than state of the art serial methods while producing clusterings of equal or greater quality. Our parallel algorithms are 4.5--27.2x faster than other parallel methods and achieve 8.9x speedup on 16 cores while exhibiting less than 1% degradation in solution quality compared to their counterparts. Our implementation of these algorithms can cluster a three billion edge graph in under 90 seconds on a single computer.Item MPI for Big Data: New Tricks for an Old Dog(2013-10-21) LaSalle, Dominique; Karypis, GeorgeThe processing of massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we present a runtime system for traditional MPI programs that enables the efficient and transparent out-of-core execution of distributed-memory parallel programs. This system, called BDMPI, leverages the semantics of MPI's API to orchestrate the execution of a large number of MPI processes on much fewer compute nodes, so that the running processes maximize the amount of computation that they perform with the data fetched from the disk. BDMPI enables the development of efficient out-of-core parallel distributed memory codes without the high engineering and algorithmic complexities associated with multiple levels of blocking. BDMPI achieves significantly better performance than existing technologies on a single node as well as on a small cluster, and performs within 30% of optimized out-of-core implementations.Item Multi-Threaded Modularity Based Graph Clustering using the Multilevel Paradigm(2014-05-05) LaSalle, Dominique; Karypis, GeorgeGraphs are an important tool for modeling data in many diverse domains. Recent increases in sensor technology and deployment, the adoption of online services, and the scale of VLSI circuits has caused the size of these graphs to skyrocket. Finding clusters of highly connected vertices within these graphs is a critical part of their analysis. In this paper we apply the multilevel paradigm to the modularity graph clustering problem. We improve upon the state of the art by introducing new efficient methods for coarsening graphs, creating initial clusterings, and performing local refinement on the resulting clusterings. We establish that for a graph with n vertices and m edges, these algorithms have an O(m + n) runtime complexity and an O(m + n) space complexity, and show that in practice they are extremely fast. We present shared-memory parallel formulations of these algorithms to take full advantage of modern architectures. Finally, we present the product of this research, the clustering tool Nerstrand 1 . In serial mode, Nerstrand runs in a fraction of the time of current methods and produces results of equal quality. When run in parallel mode, Nerstrand exhibits significant speedup with less than one percent degradation of clustering quality. Nerstrand works well on large graphs, clustering a graph with over 105 million vertices and 3.3 billion edges in 90 seconds.