Preconditioning is often the most problematic part in a parallel
environment.
Incomplete decompositions of A form a popular class of preconditionings
in the context of solving discretized PDEs. In this case the preconditioner
K=LU, where L and U have a sparsity pattern equal or close to the
sparsity pattern
of the corresponding parts of A (L is lower triangular, U is upper
triangular). For details see [4], [82] and
[83].
Solving Kw=r leads to solving successively Lz=r and Uw=z.
These triangular solves lead to recurrence relations that are not easily
parallelized.
We will now discuss a number of approaches to obtain parallelism in the
preconditioning part.
- (1)
- Reordering the computations. Depending on the structure of
the matrix a frontal approach may lead to successful parallelism. By
inspecting the dependency graph one can select those elements that can
be computed in parallel. For instance, if a second order PDE is discretized
by the usual five-point star over a rectangular grid, then
the triangular solves can be parallelized if the computation is carried
out along diagonals of the grid, instead of the usual lexicographical
order. For vector computers this leads to a vectorizable preconditioner
(see [84,8,85,86]).
For coarse-grained parallelism this approach is insufficient. By a similar
approach more parallelism can be obtained in three-dimensional situations: the
so-called
hyperplane approach
[87,85,86].
The disadvantage is that the data need to be redistributed over the
processors, since the grid points,
which correspond to a hyperplane in the grid, are located
quite irregularly in the array. For shared memory machines this also leads
to reduced performance because of indirect addressing. In general one
concludes that the data dependency approach is not adequate for
obtaining a suitable degree of parallelism.
- (2)
- Reordering the unknowns. One may also use a coloring
scheme
for reordering the unknowns, so that unknowns with the same color are not
explicitly coupled. This means that the triangular solves can be parallelized
for each color. Of course, communication is required for couplings between
groups of different colors.
Simple coloring schemes, like red-black ordering for the five-point
discretized
Poisson operator, seem to have a negative effect on the convergence behavior.
Duff and Meurant [88] have carried out numerical experiments for
many different orderings, which show that the numbers of iterations may
increase significantly for other than lexicographical ordering. Some modest
degree of parallelism can be obtained, however, with so-called incomplete
twisted factorizations [8,89,85].
Multi-color schemes with a large number of colors (e.g., 20 to 100) may
lead to little or no degradation in convergence behavior [90].
but also to less parallelism. Moreover, the ratio of computation
to communication may be more unfavorable.
- (3)
- Forced parallelism. Parallelism can also be forced by simply
neglecting couplings to unknowns residing in other processors.
This is like block Jacobi preconditioning, in which the blocks
may be decomposed in incomplete form [91]. Again, this
may not always reduce the overall solution time, since the effects of
increased parallelism are more than undone by an increased number of
iteration steps.
In order to reduce this effect, it is suggested in [92] to
construct
incomplete decompositions on slightly overlapping domains. This requires
communication similar to that of matrix--vector products.
In [92] results are reported on a six-processor shared
memory system (IBM3090), and speedups close to 6 have been observed.