Technical Report 
Department of Computer Science 
University of Minnesota 
4-192 EECS Bu ilding 
200 Union Street SE 
Minneapolis, MN 55455-0159 USA 
TR 97-014 
Parallel Multilevel Diffusion 
Algorithms for Repartitioning of 
Adaptive Meshes 
by: Kirk Sch loegel, George 
Karypis, and Vipin Kumar 
Parallel Multilevel Diffusion Algorithms for 
Repartitioning of Adaptive Meshes * 
Kirk Schloegel, George Karypis, Vipin Kumar 
University of Minnesota, Department of Computer Science 
( kirk, karypis, kumar) @ cs.umn.edu 
Abstract 
Graph partitioning has been shown to be an effective way to divide a large computation over an arbitrary number 
of processors. A good partitioning can ensure load balance and minimize the communication overhead of the compu-
tation by partitioning an irregular mesh into p equal parts while minimizing the number of edges cut by the partition. 
For a large class of irregular mesh applications, the structure of the graph changes from one phase of the computation 
to the next. Eventually, as the graph evolves, the adapted mesh has to be repartitioned to ensure good load balance. 
Failure to do so will lead to higher parallel run time. This repartitioning needs to maintain a low edge-cut in order to 
minimize communication overhead in the follow-on computation. It also needs to minimize the time for physically 
migrating data from one processor to another since this time can dominate overall run time. Finally, it must be fast and 
scalable since it may be necessary to repartition frequently. Partitioning the adapted mesh again from scratch with an 
existing graph partitioner can be done quickly and will result in a low edge-cut. However, it will lead to an excessive 
migration of data among processors. In this paper, we present new parallel algorithms for robustly computing reparti-
tionings of adaptively refined meshes. These algorithms perform diffusion of vertices in a multilevel framework and 
minimize data movement without compromising the edge-cut. Furthermore, our parallel repartitioners include param-
eterized heuristics to specifically optimize edge--cut, total data migration, or the maximum amount of data migrated 
into and out of any one processor. Our results on a variety of synthetic meshes show that our parallel multilevel dif-
fusion algorithms are highly robust schemes for repartitioning adaptive meshes. The resulting edge-cuts are close to 
those resulting from partitioning from scratch with a state-of-the-art graph partitioner, while data migration is sub-
stantially reduced. Furthermore, repartitioning can be done very fast. Our experiments show that meshes with around 
eight million vertices can be repartitioned on a 256-processor Cray T3D in only a couple of seconds. 
1 Introduction 
Mesh partitioning is an important problem which has applications in many areas, including scientific computing. In 
irregular mesh applications, the amount of computation associated with a grid point is represented by the weight of its 
associated vertex. The amount of interaction required between two grid points is represented by the weight of the edge 
between the associated vertices. Efficient parallel execution of these irregular grid applications requires the partitioning 
of the associated graph into p parts with the following two constraints: (i) Each partition has an equal amount of total 
vertex weight; (ii) The total number of edges cut by the partitions (thereafter referred to as edge-cut) is minimized. Since 
the weight of any given edge represents the amount of communication required between nodes, minimizing the number 
*This work was supported by NSF CCR-9423082, by Anny Research Office contract DNDAAH04-95-l-0538, by Arrny High Perforrnance 
Computing Research Center cooperative agreement number DAAH04-95-2-0003/contract number DAAH04-95-C-0008, by the IBM Partnership 
Award, and by the IBM SUR equipment grant. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute. 
Related papers are available via WWW at URL: http://www.cs.umn.edwkarypis 
of edges cut by the partition tends to minimize the overall amount of communication required by the computation. This 
problem has been well defined and discussed in previous work [2, 7). 
For a large class of irregular grid applications, the computational structure of the problem changes in an incremental 
fashion from one phase of the computation to another. For example, in adaptive meshes [ l], areas of the original graph 
are selectively coarsened or refined in order to accurately model the dynamic computation. This causes the weights 
of the vertices and the edges to change. Eventually, as the graph evolves, it becomes necessary to correct the partition 
in accordance with the structural changes in the computation and to migrate a certain amount of computation between 
processors. Thus, we need a partitioning or repartitioning algorithm to redistribute the adapted graph. This algorithm 
should satisfy the following constraints. 
I. It robustly balances the graph. Failure to balance the graph will lead to load imbalance, which will result in 
higher parallel run time. In order to make the repartitioning algorithm general it must be able to balance graphs 
from a wide variety of application domains. 
2. It minimizes edge-cut. The redistributed graph should have a small edge-cut to minimize communication 
overhead in the follow-on computation. 
3. It minimizes vertex migration time. Once the mesh is repartitioned, and before the grid computation can 
begin, data associated with the migrated vertices also needs to be moved. In many adaptive computations, the 
data associated with each vertex is very large. The time for movement of the data can dominate overall run time, 
especially if the mesh is adapted frequently. 
4. It is fast. The computational cost of repartitioning should be inexpensive since it is done frequently. Also, since 
the problem studied in this paper is relevant only in the parallel context, the repartitioning algorithm should be 
parallelizable. 
If the adapted graph is partitioned from scratch using a state-of-the-art multilevel graph partitioner such as the one 
implemented in M~S (6), then it will reasonably optimize criterion 1 and 2. Since a highly parallel formulation of this 
algorithm exists [6], criterion 4 can also be met to a large extent. Partitioning from scratch will, however, result in high 
vertex migration, as the partitioning does not take the initial location of the vertices into account. A partitioning meth<>d 
that incrementally constructs a new partition as simply a modification of the input partition ( e.g. by diffusion [ 11]) can 
potentially move a much smaller number of vertices. Such a method can also be potentially faster than partitioning the 
graph from scratch. 
Repartitioning schemes that incrementally modify an existing partition have been quite successful on graphs that 
are small perturbations of the original graphs (9, 11). For these graphs, such a scheme inherits a good (i.e., low edge-
cut), but imbalanced starting point in the initial partition. It then attempts to fix the imbalance of this partition while 
maintaining its good edge-cut. It does this by minimizing the amount of disturbance (i.e., vertex migration) to the initial 
partition in the balancing phase. For only slightly imbalanced graphs, the initial partition does not need to be disturbed 
very much, and so these algorithms are able to maintain an edge-cut comparable with the initial partition. However, if 
the initial partition is highly imbalanced, then many vertices need to move in order to balance the graph. Thus, even if 
the disturbance to the initial partition is minimized, the final partition will necessarily end up quite a bit removed from 
it. Hence, the balancing phase of such a method will increase the edge-cut considerably. Local refinement [9, 11] can 
only provide a limited improvement in the edge-cut of the resulting partition. 
One promising solution to the problem of edge-cut degradation as the degree of imbalance increases in size and 
complexity is the use of a multilevel scheme (2, 4) that takes the initial location of the vertices into consideration. The 
multilevel paradigm allows the local refinement to be performed at multiple coarsened versions of the graph, which 
has been shown to be quite effective in reducing the edge-cut. In addition to the refinement, the movement of graph 
vertices (to achieve load balance) can also be done at multiple coarsened versions. This multilevel diffusion scheme 
can move large chunks of vertices at coarser levels, and then achieve better load balance at finer levels. In a multilevel 
context, a global picture of the graph (3) can be used to guide graph balancing while utilizing a multilevel view to guide 
refinement. 
We have developed multilevel diffusion schemes that incrementally constructs a new partition of the graph (10). 
These schemes coarsen the graph, begin multilevel diffusion on the coarsest graphs, and then begin multilevel refine-
ment as soon as the partition is sufficiently balanced. By applying diffusion on the coarsest graphs, we speed up the 
costly diffusion process, since this is done on relatively small graphs. We are also able to effectively control the trade-
off between edge-cut and vertex migration since diffusion and refinement are linked in a multilevel context. In this 
2 
paper, we present a brief overview of these schemes and provide experimental results of an MPI-based implementation 
of these schemes on a 256-processor Cray T3D for a variety of adaptive graphs. Our experimental results show that we 
can repartition adaptive meshes in less time than it takes to partition the meshes from scratch, maintain a comparable 
edge-cut, and move far less vertices in the process. 
2 Notations, Definitions, and lss ues 
Our discussion will include the concepts of both vertex weight and vertex size. Vertex weight is the computational cost 
of the work represented by the vertex while size is its migration cost. Thus, the repartitioner should attempt to balance 
the graph with respect to vertex weight while minimizing vertex migration with respect to vertex size. Depending on 
the representation and storage policy of the data, size and weight may not necessarily be equal. One example of such 
a situation arises in [8]. 
· Let G = (V, E) be an undirected graph of V vertices and E edges and P be a set of p processors. Lets; represent the 
cost of movement of vertex v;. We will refer to s; as the size of vertex i. Let w; represent the weight (i.e., computational 
work) of vertex v; and w,(v1, Vi) equal the amount of communication required between v1 and Vi• A vertex's density 
is equal to its weight divided by its size. We denote B(q) as the set of vertices with partition q. The weight of any 
partition q can then be defined as: 
W(q) = L W; 
v;€8(q) 
A vertex is clean if its current partition is its initial partition on the input graph. Otherwise it is dirty. A vertex is a 
border vertex if one of its adjacent vertices is in another partition. If so, then all such partitions are the vertex's neighbor 
partitions. If a partition contains at least one vertex which has another partition as a neighbor partition, then those two 
partitions are neighbor partitions to each other. 
TotalV is defined as the sum of the sizes of vertices which change partitions as the result of partitioning or reparti-
tioning. Thus, it is the sum of the sizes of the dirty vertices. MaxV is defined as the maximum of the sums of the sizes 
of those vertices which migrate into and out of any one partition as a result of partitioning or repartitioning. 
2.1 Multilevel Diffusion Algorithms: A Brief Overview 
Our serial multilevel diffusion algorithms have three phases: a coarsening phase, a multilevel diffusion phase, and a 
multilevel refinement phase. In the coarsening phase, only pairs of nodes that belong to the same partition are consid-
ered for merging. Hence, the initial partition of the coarsest level graph is identical to the input partition of the graph 
that is being repartitioned, and thus does not need to be computed. 
In the multilevel diffusion phase, balance is sought by means of either undirected or directed diffusion. In the case 
of undirected diffusion, the border vertices are visited in a random order. If a vertex belongs to an overweight partition 
(i.e., a partition whose weight is higher than the average weight), then il is moved to an adjacent partition with lower 
weight. If there are more than one adjacent partition satisfying this condition, the one that leads to the smaller edge-
cut is selected. If a vertex belongs to a partition with average weight, then it is moved to a partition that leads lo a 
reduction in the edge-cut as long as il does not make the destination partition overweight. Again, if there are more 
than one partition satisfying this condition, the one that leads to the smaller edge-cut is selected. If a vertex belongs 
to an under-weight partition it is not moved. Note that a vertex belonging to an overweight partition can move even if 
this migration leads to an increase in the edge-cut. This process is repeated for a small number of steps or until either 
balance is obtained or no progress is made in balancing. Note that in this scheme, diffusion is performed using only 
local information. 
In the case of directed diffusion, a global picture is used to guide the vertex migration. This global picture is com-
puted by the 2-norm minimization solution described in (3). The result of this computation is a transfer matrix that 
indicates how much weight needs to be transfered between neighboring partitions. Using this transfer matrix, the di-
rected diffusion scheme works as follows. Again, the boarder vertices are visited in a random order. If a vertex is 
neighbors with a partition which according to the transfer matrix needs to transfer work to, then that vertex can be mi-
grated to the neighbor partition. If a vertex is neighbors with more than one such partition, it is migrated to that partition 
which will produce the highest gain in edge-cut. The vertex is migrated even if the gain in edge-cut is negative. After 
a vertex is migrated, the transfer matrix is updated to reflect the vertex migration (i.e., the weight of the vertex that was 
moved is subtracted from the appropriate entry of the transfer matrix). After each border vertex is visited exactly once, 
the process repeats until either balance is obtained or no progress is made in balancing. 
3 
In both of these schemes used during the multilevel diffusion phase, it may not be possible to balance the graph at the 
coarsest graph level. That is, there may not be sufficiently fine vertices on tbe coarsest graph to allow for total balancing. 
If this is the case, the graph needs to be uncoarsened one level in order to increase the number of finer vertices. The 
process described above is then begun on the next coarsest graph. 
After the graph is balanced, multilevel diffusion ends and multilevel refinement begins on the current graph. Here, 
the emphasis is on improving the edge-cut. The border vertices are again visited randomly and are checked to see if 
they can be migrated to another partition so that 
I. maintain the edge-cut, maintain the balance, and the selected partition is the vertex's initial partition from the 
input graph, or 
2. decrease the edge-cut while maintaining the graph balance, or 
3. maintain the edge-cut and improve graph balance. 
If so, vertices are migrated. These three conditions make up the refinement phase vertex migration criteria. Criterion 1 
allows vertices to migrate to their initial partitions (as long as the migration does not increase the edge-cut and worsen 
the load balance), and therefore, to lower TotalV and possibly MaxV (10]. 
Dynamic suppression of vertices as described in [10] is also utilized during multilevel diffusion to reduce MaxV. 
Essentially with this variation, the migration of relatively non-dense vertices is suppressed during the balancing phase. 
The concept of vertex cleanness as described in (10) is incorporated in multilevel refinement in order to reduce the 
TotalV. That is, only those vertices which have either been displaced in the process of balancing or those which will 
result in a sufficiently high edge-cut-gain-to-size-ratio are eligible for migration during refinement. 
A complete description of the scheme and extensive experimental results conducted on a single processor for a va-
riety of adapted meshes are presented in (10). 
3 Parallel Multilevel Diffusion Algorithms 
We have developed parallel versions of the multilevel diffusion algorithms as follows. Vertices are initially assumed to 
be distributed across p processors. This division of vertices corresponds to the original partition of a static partitioner 
and is assumed to be of good quality (i.e., low edge-cut). However, the sums of the vertex weights of the vertices 
resident on each processor are assumed to be variant. Thus, the original partition is not balanced and so there is a need 
for repartitioning. 
Our parallel algorithms begin with a coarsening phase in whi-ch a sequence G; = (V;, E;) for i = 0, l, ... , m, of 
successively coarser graphs is constructed. Graph G;+1 is constructed from G; by first computing a matching of vertices 
of G; and then collapsing together the matched vertices. The matchings computed are restricted to vertices residing on 
the same processors. By adhering to this restriction, coarsening is inherently very parallel. Otherwise, this phase is 
identical as described in [7]. 
The parallel formulation of the multilevel diffusion phase depends on whether or not we are using directed or undi-
rected diffusion. In the case of directed diffusion, we chosen to perform the directed diffusion for the coarsest graph 
serially. Since the coarse graph is very small (its size is proportional to the number of processors), this serial compu-
tation does not significantly affect the overall performance and scalability of our parallel multilevel directed diffusion 
algorithm. Furthermore, we use the additional processors to obtain a better directed diffusion as follows. The graph 
is broadcast to all processors. Each processor then simultaneously balances the coarsest graph using th~ directed dif-
fusion algorithm described in Section 2.1 . Since the diffusion scheme is inherently random, each processor computes 
a potentially unique partition. The best partition is selected. The selection criteria are either; (i) lowest edge-cut, (ii) 
lowest TotalV, (iii) lowest MaxV, or (iv) greatest balance. In our experiments, we select the partition that has the low-
est edge-cut. In some cases, the coarsest graph may be too coarse to allow for complete balancing. For this reason we 
use the parallel undirected diffusion algorithm (described in the next paragraph) to balance any minor imbalances. 
The parallel formulation of the undirected diffusion algorithm is modeled after our coarse-grained parallel multilevel 
refinement algorithm (7). Each iteration of the parallel multilevel refinement algorithm consists of two sub-phases. 
During the first sub-phase, vertices are migrated only from lower- to higher-numbered partitions. During the second 
sub-phase, vertices are migrated from higher- to lower-numbered partitions. In this way, unexpected edge-cut drops 
caused by the simultaneous migration of neighboring vertices is avoided. Furthermore, our schemes avoids any bias 
towards the lower- or higher-numbered partitions, by using at each step a random partition ordering. In each sub-phase, 
vertices are visited and selected for migration according to the criteria for undirected diffusion described in Section 2.1. 
4 
Cl 
i 
() 
I s· 
"" 
Multilevel Diffusion Multilevel Diffusion 
Figure 1: Multilevel Diffusion Algorithms: (a) undirected diffusion and (b) directed diffusion. 
The parallel formulation of the multilevel refinement algorithm is similar to that for undirected diffusion with the 
exception that vertices are moved according to the refinement phase vertex migration criteria described in Section 2.1. 
Furthermore, the concept of vertex cleanness as described in [10] is also employed in our scheme in order to reduce 
TotalV. Otherwise, our multilevel refinement algorithm is identical to the coarse-grained parallel multilevel refinement 
algorithm described in [7]. 
In summary, as illustrated in Figure 1, our parallel multilevel diffusion algorithms are made up of three phases, graph 
coarsening, multilevel diffusion, and coarse-grained multilevel refinement. In the case of undirected diffusion all three 
phases are done in parallel, whereas in the case of directed diffusion two of the three phases, coarsening and refinement, 
are done in p~rallel, and the diffusion is done serially on all processors. 
4 Experimental Results 
We tested our parallel multilevel repartitioning algorithms on a Cray T3D with 256 processors. Each processor on the 
T3D is a 150Mhz Dec Alpha (EV4). The processors are interconnected via a three dimensional torus network that 
has a peak unidirectional bandwidth of 150Bytes per second, and a small latency. We used Cray's MPI library for 
communication. Cray's MPI achieves a peak bandwidth of 45MBytes and an effective startup time of 57 microseconds. 
We evaluated the performance of our parallel adaptive multilevel diffusion algorithms on four medium to large size 
graphs arising in finite element computations. The characteristics of these graphs are described in Table 1. For each 
graph we synthetically generated adaptive graphs by randomly changing the weights of the vertices. For example, on 
64 processors, we first computed a 64-way partition and moved the graph so that processor P; stores the vertices that 
belong to partition i. Next, each processor P; selects a random number r; between zero and the number of vertices 
that it stores, and randomly selects r; of its vertices and. changes their weight from one to three. We found that this 
scheme leads to graphs that are about 50% load imbalanced, i.e., there is a partition whose weight is 50% higher than 
the average weight. These synthetically generated graphs are then used as the input to our parallel multilevel diffusion 
and partitioning algorithms. In all of our experiments, a graph that is 5% load imbalance is assumed to be well balanced, 
and in all experiments this load imbalance level is easily achieved. 
Graph Name Number of Vertices Number of Edges Description 
MRNGA 257000 505048 Dual of a 30 Finite element mesh 
MRNGB 1017253 2015714 Dual of a 3D Finite element mesh 
MRNGC 4039160 8016848 Dual of a 3D Finite element mesh 
MRNGD 7833224 15291280 Dual of a 3D Finite element mesh 
Table 1: The various graphs used in the experiments. 
Table 2 shows the results obtained by our parallel multilevel directed and undirected diffusion algorithms for the test 
5 
problems on 64, 128, and 256 processors. For each problem, this table shows the edge-cut of the resulting partitioning, 
the total number of vertices that needs to be moved (TotalV), the maximum number of vertices that needs to be moved 
in and out of any particular processor (MaxV), as well as the amount of time (in seconds) required to compute the 
repartitioning. The rows labeled 'Scratch' show similar performance metrics for the scheme in which the adaptive 
graph is being repartitioned from scratch using the parallel multilevel k-way partitioning algorithm described in [7]. 
Finally, the rows labeled 'Imbalance' indicate the load imbalance resulted by our synthetic adaptation scheme for each 
one of the graph-processor combinations. As you can see, the load imbalance ranges between 1.42 (42%) and 1.52 
(52%). Note that MRNGD could not run on 64 processors due to the limited amount of memory on each processor. 
Graph 
MRNGA 
MRNGB 
MRNGC 
MRNGD 
Table 2: The performance of the parallel adaptive repartitioning schemes on the four test graphs for 64, 128, and 256 processors 
on Cray T30. All run-times are in seconds. 
To better compare the three schemes we graphically depicted the results shown in Table 2 in the sequence of graphs 
shown in Figures 2- 5. In particular, Figure 2 compares the quality in terms of edge-cut produced by the two multilevel 
diffusion algorithms relative to partitioning from scratch. For each experiment, we computed the ratio of the edgecut 
produced by the diffusion algorithms to that of partitioning from scratch and plotted it using a bar chart. For example, 
the bars for MRNGB-128 corresponds to partitioning the adaptively refined graph MRNGB on 128 processors. Look-
ing at this figure we can see that the edge-cuts of the partitionings produced by the two multilevel diffusion algorithms 
are in general within 5% of the edge-cut produced by partitioning the adaptive graphs from scratch. Furthermore, for 
some of these graphs (e.g ., MRNGA-256 and MRNGC0 128) the multilevel diffusion algorithms produced partition-
ings whose edge-cut was slightly better than that of partitioning from scratch. These results show that our adaptive 
multilevel diffusion algorithms are able to produce partitionings whose quality is comparable to the quality obtained 
by the parallel multilevel partitioning algorithm. Comparing the two multilevel diffusion algorithms together, we see 
that in general they lead to partitionings that have comparable edge-cuts. Out of the eleven experiments, the undirected 
diffusion scheme leads to smaller edge-cuts for six of them, whereas the directed diffusion does better in five experi-
ments. 
The performance in terms on the number of vertices that need to be moved in terms of TotalV and MaxV is shown 
in Figures 3 and 4, respectively. Looking at these two figures we can clearly see the advantage oftbe multilevel diffu-
sion algorithms over partitioning from scratch in significantly reducing the number of vertices that needs to be moved 
in order to achieve load balance. The multilevel diffusion schemes move in general less than 10% percent of the total 
number of vertices that are required to be moved by partitioning from scratch (Figure 3), and the maximum number of 
vertices that is sent and received by any processor is less than 30% of that required by partitioning from scratch (Fig-
ure 4). This dramatic reduction in data movement coupled with the comparable edge-cut quality makes the multilevel 
diffusion algorithms a clear choice for repartitioning adaptively refined graphs. As it was the case with the edge-cut, 
the relative performance of directed and undirected diffusion is quite similar, with both schemes performing within a 
couple percentage points of each other. 
Finally, Figure 5 shows the amount of time required by the two multilevel diffusion algorithms relative to parti-
tioning from scratch. From this figure we can see that the multilevel diffusion algorithms are considerably faster than 
partitioning the graph from scratch. In particular, for MRNGD on 256 processors, undirected diffusion is about 40% 
faster than partitioning from scratch, requiring only 2.75 seconds (partitioning from scratch requires 4.69 seconds). This 
6 
0.9 
o.a 
0 .7 
0 .6 
0 ,5 "-· 
c::::, Unclrected Dlfualon 
0.4 ··- ··- ·-··· 
0.3 --- -·- ····· 
0 .1 
-Directedl'.>iusion 
'l' 
,...__, 
,...__, 
- P.arJliof'ling from sc:ra1ch (b11s.lin•) 
······- ··- ··- --
·- ---
1----- -
Figure 2: Quality in terms of edge-cuts of the partitionings produced by the adaptive multilevel directed and undirected diffusion 
algorithms relative to partitioning from scratch. For each graph, the ratio of the edge-cut of the adaptive multilevel diffusion algo-
rithms to that of the multilevel k-way partitioning is plotted for 64·, 128·, and 256-way partitionings on 64, 128, and 256 processors, 
respectively. Bars under the baseline indicate that the multileve I diffusion algorithms perform better than partitioning from scratch. 
run-time difference is mostly due to the fact that the multilevel diffusion algorithms do not have to compute an initial 
k-way partitioning of the coarsest graph, as they inherit the original partitioning. Comparing the multilevel undirected 
diffusion with the directed diffusion algorithm we see that the former is faster, and the performance gap increases with 
the number of processors. This is because, as discussed in Section 3, the directed diffusion algorithm needs to diffuse 
the coarsest graph serially, whereas the undirected diffusion algorithm does not have this serial component. Looking 
at the run-times shown in Table 2 we can see that both multilevel diffusion algorithms scale very well with the number 
of processors, and they are able to partition graphs with around eight million vertices in around three seconds. 
5 Conclusion and Related Work 
Walshaw, Cross, and Everett also implemented a parallel partitioner and directed diffusion repartitioner based on an 
optimization of the Hu and Blake diffusion solver [3, 12, 11]. Their algorithms has two distinct phases, called balanc-
ing and refinement phase. The first is a balancing phase in which the diffusion solution guides vertex migration in order 
to balance the graph. Unlike our algorithms, this diffusion is performed on the original graph (i.e., it does not utilize 
a multilevel representation), and requires a significant amount of time especially for graphs that are grossly load im-
balanced. The second is a local refinement phase in which a local view of the graph guides vertex migration in order 
to decrease the edge-cut upset by the balancing phase. For this refinement phase they use either a single-level refine-
ment scheme (performed on the full graph), or they use a multilevel refinement scheme similar to that described in [5). 
They report results on a set of small- to medium-size two-dimensional meshes with very moderate imbalance problems 
(around 10%). In the case of their single-level refinement scheme, they report results in which the TotalV is low and 
the edge-cut increases only slightly over partitioning from scratch. However, when they use the multilevel refinement 
scheme, they are able to decrease the edge-cut, at the cost of increasing the TotalV considerably. 
In this paper we presented scalable and highly parallel formulations of multilevel diffusion algorithms. Our algo-
rithms are able to quickly repartition graphs corresponding to adaptively refined meshes and obtain partitionings whose 
quality is comparable to those obtained by the high-quality parallel multilevel k-way partitioning algorithm of MEliS 
[7], while dramatically reducing the amount of data movement required to realize this new partition. Unlike the al-
gorithms of Walshaw et al., our parallel multilevel diffusion algorithms combine both diffusion and refinement in a 
unified multil!!vel framework making it possible to quickly balance graphs that are highly imbalanced while producing 
edgecuts comparable to partitioning from scratch and substantially reducing the total vertex movement. Furthermore, 
our algorithms utilize the concepts of vertex weight, vertex size, and vertex cleanness to specifically minimize TotalV 
and MaxV in addition to the edge-cut. Our experiments has shown that grapns with around eight million vertices can be 
repartitioned in under three seconds on a 256-processorCray T3D making it possible to perform large scale simulations 
that require frequent mesh refinements. 
7 
cz::J Uldrected Dilusion - Partitionin9 from acratch {bowlin~) 
0.9 
o.•f----------------------------------------1 
0.1 
0.6 - --------------------------- -------; 
o.s . . - ·- ··- ·- ·- ·· - - - - - --- --
o.•>-------------------- ----------------i 
0.3 f----------------------------------- -----1 
0,2 t------ ---------- ---------------------; 
0 .1 
Figure 3: Quality in terms of the total number of vertices that need to be moved (TotalV) of the partitionings produced by the adaptive 
multilevel directed and undirected diffusion algorithms relative to partitioning from scratch. For each graph, the ratio of the TotalV of 
the adaptive multilevel diffusion algorithms to that of the multilevel k-way partitioning is plotted for 64-, 128-, and 256-way partition-
ings on 64, 128, and 256 processors, respectively. Bars under the baseline indicate that the multilevel diffusion algorithms perform 
better than partitioning from scratch. 
References 
[ I) R. Biswas and R. C. Strawn. A new procedure for dynamic adaption of three-dimensional unstructured grids. Applied Numerical Mathematics, 
13:437-452, 1994. 
[2] Bruce Hendrickson and Robert Leland. A multilevel algorithm for partitioning graphs. Technical Report SAND93-1301. Sandia National 
Laboratories, 1993. 
[3] Y. F. Hu and R. J. Blake. An optimal dynamic load balancing algorithm. Technical Report DL-P-95-011, Daresbury Laboratory, Wanington, 
UK, 1995. 
[4] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report TR 95-035, Department 
of Computer Science, University of Minnesota, 1995. Also available on WWW at URL http://www.cs.umn.edu/1rarypis. A short version 
appears in Intl. Conf. on Parallel Processing 1995. 
[5] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Technical Report TR 95-064, Department of Computer 
Science, University of Minnesota, 1995. Also available on WWW at URL http://www.cs.urnn.edu/1carypis. 
[6] G. Karypis and V. Kumar. Parallel multilevel k-way partitioning scheme for irregular graphs. Technical Report TR 96-036, Department of 
Computer Science, University of Minnesota, 1996. Also available on WWW at URL http://www.cs.umn.edu/1carypis. A sho1t version appears 
in Supercomputing 96. 
[7] George Karypis and Vi pin Kumar. A coarse-grain parallel multilevel k-way partitioning algorithm. In Proceedings of the eighth SIAM co11-
fere11ce on Parallel Processing for Scientific Computing, 1997. 
[8] Leonid Oliker and Rupak Biswa~. Efficient load balancing and data remapping for adaptive grid calculations. Technical report, NASA Ames 
Research Center, Moffett Field, CA, 1997. 
[9] Chao-Wei Ou and Sanjay Ranka. Parallel incremental graph partitioning using linear programming. Technical report, Syracuse University, 
Syracuse, NY, 1992. 
[ I OJ Kirk Schloegel, George Karypis, and Vipin Kumar. Multilevel diffusion schemes for repartitioning of adaptive meshes. Technical Report TR 
97-013, University of Minnesota. Department of Computer Science, 1997. http://www.cs.umn.edu/ karypis. 
[ II] C. Walshaw, M. Cross, and M. G. Everett. Dynamic load-balancing for parallel adaptive unstrucrured meshes. Parallel Processing for Scientific 
Computi11g, 1997. 
[12] C. Walshaw, M. Cross, and M. G. Everett. Dynamic mesh partitioning: A unified optimisation and load-balancing algorithm. Technical Report 
95/IM/06, Centre for Numerical Modelling and Process Analysis, University of Greenwich, London. UK, December I 995. 
8 
c:::3 Undir~td Oilusion 
- Dwet.ctOitus.ion 
.. 
o.o f---- - ----------------------------l 
0.1 1--- --- --------------------------l 
o.• t--- -- --------------------------1 
o., 
o., ~------ - ----·-------------------l 
0.3 !----~---=.--.,-=---F------,~----:.,=----nl■-----..------l 
0,2 
Figure 4: Quality in terms of the maximum number of vertices that need to be moved in and out of a partition (MaxV) of the par-
titionings produced by the adaptive multilevel directed and undirected diffusion algorithms relative to partitioning from scratch. For 
each graph, the ratio of the MaxV of the adaptive multilevel diffusion algorithms to that of the multilevel k-way partitioning is plotted 
for 64·, 128·, and 256-way partitionings on 64, 128, and 256 processors, respectively. Bars under the baseline indicate that the 
multilevel diffusion algorithms perform better than partitioning from scratch. 
- hllitlof'lino lrom acn.llch ~) 
..  
o., f----------------------=■1---- ---------"■--------l 
0.7 
o.• 
0,6 
o., 
0.3 
I 
Figure 5: Run time of adaptive multilevel directed and undirected diffusion algorithms relative to partitioning from scratch. For each 
graph, the ratio of the time required by adaptive multilevel diffusion algorithms to that of the multilevel k-way partitioning is plotted 
for 64·, 128·, and 256-way partilionings on 64, 128, and 256 processors, respectively. Bars under the baseline indicate that the 
multilevel diffusion algorithms run faster than partitioning from scratch. 
9