9.1 Parallelism in the Kernels of Iterative Methods

Next: 9.2 Parallelism and Data
Up: 9 Iterative Methods for
Previous: 9 Iterative Methods for
The basic time-consuming computational kernels of iterative schemes are
usually:
- (1)
- inner products,
- (2)
- vector updates,
- (3)
- matrix-vector products, like
(for some methods also
),
- (4)
- preconditioning (e.g., solve for
in
).
The inner products can be easily parallelized; each processor computes the
inner product of two segments of each vector (local inner products or LIPs).
On distributed memory machines the LIPs have to be sent to other processors
in order to be reduced to the required global inner product. This step
requires communication. For shared memory machines the inner products
can be computed in parallel without difficulty.
If the distributed memory
system supports overlap of communication with computation,
then we seek opportunities in the algorithm to do so. In the
standard formulation of most iterative schemes this is usually a major
problem.
We will come back to this in the next two sections.
Vector updates are trivially parallelizable: each processor updates its
`own' segment.
The matrix-vector products are often easily parallelized on shared memory
machines by splitting the matrix into strips corresponding to the vector
segments. Each processor takes care of the matrix-vector product of one
strip.
For distributed memory machines there may be a problem if each processor
has only a segment of the vector in its memory. Depending on the bandwidth
of the matrix we may need communication for other elements of the vector,
which may lead to communication problems. However, many sparse
matrix problems are related to a network in which only nearby nodes are
connected. In such a case it seems natural to subdivide the network, or
grid, in suitable blocks and to distribute these blocks over the processors.
When computing
each processor needs at most the values of
at
some nodes in neighboring blocks. If the number of connections to these
neighboring blocks is small compared to the number of internal nodes,
then the communication time can be overlapped with computational work.
For more detailed discussions on implementation aspects on distributed
memory systems, see [80] and [81].
Preconditioning is often the most problematic part in a parallel
environment.
Incomplete decompositions of
form a popular class of preconditionings
in the context of solving discretized PDEs. In this case the preconditioner
, where
and
have a sparsity pattern equal or close to the
sparsity pattern
of the corresponding parts of
(
is lower triangular,
is upper
triangular). For details see [4], [82] and
[83].
Solving
leads to solving successively
and
.
These triangular solves lead to recurrence relations that are not easily
parallelized.
We will now discuss a number of approaches to obtain parallelism in the
preconditioning part.
- (1)
- Reordering the computations. Depending on the structure of
the matrix a frontal approach may lead to successful parallelism. By
inspecting the dependency graph one can select those elements that can
be computed in parallel. For instance, if a second order PDE is discretized
by the usual five-point star over a rectangular grid, then
the triangular solves can be parallelized if the computation is carried
out along diagonals of the grid, instead of the usual lexicographical
order. For vector computers this leads to a vectorizable preconditioner
(see [86][85][8][84]).
For coarse-grained parallelism this approach is insufficient. By a similar
approach more parallelism can be obtained in three-dimensional situations: the
so-called
hyperplane approach
[86][85][87].
The disadvantage is that the data need to be redistributed over the
processors, since the grid points,
which correspond to a hyperplane in the grid, are located
quite irregularly in the array. For shared memory machines this also leads
to reduced performance because of indirect addressing. In general one
concludes that the data dependency approach is not adequate for
obtaining a suitable degree of parallelism.
- (2)
- Reordering the unknowns. One may also use a coloring
scheme
for reordering the unknowns, so that unknowns with the same color are not
explicitly coupled. This means that the triangular solves can be parallelized
for each color. Of course, communication is required for couplings between
groups of different colors.
Simple coloring schemes, like red-black ordering for the five-point
discretized
Poisson operator, seem to have a negative effect on the convergence behavior.
Duff and Meurant [88] have carried out numerical experiments for
many different orderings, which show that the numbers of iterations may
increase significantly for other than lexicographical ordering. Some modest
degree of parallelism can be obtained, however, with so-called incomplete
twisted factorizations [85][89][8].
Multi-color schemes with a large number of colors (e.g., 20 to 100) may
lead to little or no degradation in convergence behavior [90].
but also to less parallelism. Moreover, the ratio of computation
to communication may be more unfavorable.
- (3)
- Forced parallelism. Parallelism can also be forced by simply
neglecting couplings to unknowns residing in other processors.
This is like block Jacobi preconditioning, in which the blocks
may be decomposed in incomplete form [91]. Again, this
may not always reduce the overall solution time, since the effects of
increased parallelism are more than undone by an increased number of
iteration steps.
In order to reduce this effect, it is suggested in [92] to
construct
incomplete decompositions on slightly overlapping domains. This requires
communication similar to that of matrix-vector products.
In [92] results are reported on a six-processor shared
memory system (IBM3090), and speedups close to
have been observed.
The problems with parallelism in the preconditioner have led to searches
for other preconditioners. Often simple diagonal scaling is an adequate
preconditioner and this is trivially parallelizable.
For results on a Connection Machine, see [93].
Often this approach
leads to a significant increase in iteration steps.
Still another approach is to use polynomial preconditioning:
,
i.e.
, for some suitable
th degree polynomial. This
preconditioner can be implemented by forming only matrix-vector products,
which, depending on the structure of
, are easier to parallelize
[94].
For
one often selects a Chebychev polynomial,
which requires some information on the spectrum of
.
Finally we point out the possibility of using the truncated Neumann series
for the approximate inverse of
, or parts of
and
. Madsen et
al. [95]
discuss approximate inversion of
, which from the
implementation point of view is equivalent to polynomial preconditioning.
In [96] the use of truncated Neumann series for removing some of
the recurrences in the triangular solves is discussed. This approach leads
to only fine-grained parallelism (vectorization).

Next: 9.2 Parallelism and Data
Up: 9 Iterative Methods for
Previous: 9 Iterative Methods for
verena@csep1.phy.ornl.gov