## 9.1 Parallelism in the Kernels of Iterative Methods     continued...

Preconditioning is often the most problematic part in a parallel environment. Incomplete decompositions of A form a popular class of preconditionings in the context of solving discretized PDEs. In this case the preconditioner K=LU, where L and U have a sparsity pattern equal or close to the sparsity pattern of the corresponding parts of A (L is lower triangular, U is upper triangular). For details see [4], [82] and [83]. Solving Kw=r leads to solving successively Lz=r and Uw=z. These triangular solves lead to recurrence relations that are not easily parallelized. We will now discuss a number of approaches to obtain parallelism in the preconditioning part.

(1)
Reordering the computations. Depending on the structure of the matrix a frontal approach may lead to successful parallelism. By inspecting the dependency graph one can select those elements that can be computed in parallel. For instance, if a second order PDE is discretized by the usual five-point star over a rectangular grid, then the triangular solves can be parallelized if the computation is carried out along diagonals of the grid, instead of the usual lexicographical order. For vector computers this leads to a vectorizable preconditioner (see [84,8,85,86]). For coarse-grained parallelism this approach is insufficient. By a similar approach more parallelism can be obtained in three-dimensional situations: the so-called hyperplane approach [87,85,86]. The disadvantage is that the data need to be redistributed over the processors, since the grid points, which correspond to a hyperplane in the grid, are located quite irregularly in the array. For shared memory machines this also leads to reduced performance because of indirect addressing. In general one concludes that the data dependency approach is not adequate for obtaining a suitable degree of parallelism.
(2)
Reordering the unknowns. One may also use a coloring scheme for reordering the unknowns, so that unknowns with the same color are not explicitly coupled. This means that the triangular solves can be parallelized for each color. Of course, communication is required for couplings between groups of different colors. Simple coloring schemes, like red-black ordering for the five-point discretized Poisson operator, seem to have a negative effect on the convergence behavior. Duff and Meurant [88] have carried out numerical experiments for many different orderings, which show that the numbers of iterations may increase significantly for other than lexicographical ordering. Some modest degree of parallelism can be obtained, however, with so-called incomplete twisted factorizations [8,89,85]. Multi-color schemes with a large number of colors (e.g., 20 to 100) may lead to little or no degradation in convergence behavior [90]. but also to less parallelism. Moreover, the ratio of computation to communication may be more unfavorable.
(3)
Forced parallelism. Parallelism can also be forced by simply neglecting couplings to unknowns residing in other processors. This is like block Jacobi preconditioning, in which the blocks may be decomposed in incomplete form [91]. Again, this may not always reduce the overall solution time, since the effects of increased parallelism are more than undone by an increased number of iteration steps. In order to reduce this effect, it is suggested in [92] to construct incomplete decompositions on slightly overlapping domains. This requires communication similar to that of matrix--vector products. In [92] results are reported on a six-processor shared memory system (IBM3090), and speedups close to 6 have been observed.