9.1 Parallelism in the Kernels of Iterative Methods



next up previous
Next: 9.2 Parallelism and Data Up: 9 Iterative Methods for Previous: 9 Iterative Methods for

9.1 Parallelism in the Kernels of Iterative Methods

 

The basic time-consuming computational kernels of iterative schemes are usually:

(1)
inner products,
(2)
vector updates,
(3)
matrix-vector products, like (for some methods also ),
(4)
preconditioning (e.g., solve for in ).

The inner products can be easily parallelized; each processor computes the inner product of two segments of each vector (local inner products or LIPs). On distributed memory machines the LIPs have to be sent to other processors in order to be reduced to the required global inner product. This step requires communication. For shared memory machines the inner products can be computed in parallel without difficulty. If the distributed memory system supports overlap of communication with computation, then we seek opportunities in the algorithm to do so. In the standard formulation of most iterative schemes this is usually a major problem. We will come back to this in the next two sections. Vector updates are trivially parallelizable: each processor updates its `own' segment. The matrix-vector products are often easily parallelized on shared memory machines by splitting the matrix into strips corresponding to the vector segments. Each processor takes care of the matrix-vector product of one strip.

For distributed memory machines there may be a problem if each processor has only a segment of the vector in its memory. Depending on the bandwidth of the matrix we may need communication for other elements of the vector, which may lead to communication problems. However, many sparse matrix problems are related to a network in which only nearby nodes are connected. In such a case it seems natural to subdivide the network, or grid, in suitable blocks and to distribute these blocks over the processors. When computing each processor needs at most the values of at some nodes in neighboring blocks. If the number of connections to these neighboring blocks is small compared to the number of internal nodes, then the communication time can be overlapped with computational work. For more detailed discussions on implementation aspects on distributed memory systems, see [80] and [81].

Preconditioning is often the most problematic part in a parallel environment. Incomplete decompositions of form a popular class of preconditionings in the context of solving discretized PDEs. In this case the preconditioner , where and have a sparsity pattern equal or close to the sparsity pattern of the corresponding parts of ( is lower triangular, is upper triangular). For details see [4], [82] and [83]. Solving leads to solving successively and . These triangular solves lead to recurrence relations that are not easily parallelized. We will now discuss a number of approaches to obtain parallelism in the preconditioning part.

(1)
Reordering the computations. Depending on the structure of the matrix a frontal approach may lead to successful parallelism. By inspecting the dependency graph one can select those elements that can be computed in parallel. For instance, if a second order PDE is discretized by the usual five-point star over a rectangular grid, then the triangular solves can be parallelized if the computation is carried out along diagonals of the grid, instead of the usual lexicographical order. For vector computers this leads to a vectorizable preconditioner (see [86][85][8][84]). For coarse-grained parallelism this approach is insufficient. By a similar approach more parallelism can be obtained in three-dimensional situations: the so-called hyperplane approach [86][85][87]. The disadvantage is that the data need to be redistributed over the processors, since the grid points, which correspond to a hyperplane in the grid, are located quite irregularly in the array. For shared memory machines this also leads to reduced performance because of indirect addressing. In general one concludes that the data dependency approach is not adequate for obtaining a suitable degree of parallelism.
(2)
Reordering the unknowns. One may also use a coloring scheme for reordering the unknowns, so that unknowns with the same color are not explicitly coupled. This means that the triangular solves can be parallelized for each color. Of course, communication is required for couplings between groups of different colors. Simple coloring schemes, like red-black ordering for the five-point discretized Poisson operator, seem to have a negative effect on the convergence behavior. Duff and Meurant [88] have carried out numerical experiments for many different orderings, which show that the numbers of iterations may increase significantly for other than lexicographical ordering. Some modest degree of parallelism can be obtained, however, with so-called incomplete twisted factorizations [85][89][8]. Multi-color schemes with a large number of colors (e.g., 20 to 100) may lead to little or no degradation in convergence behavior [90]. but also to less parallelism. Moreover, the ratio of computation to communication may be more unfavorable.
(3)
Forced parallelism. Parallelism can also be forced by simply neglecting couplings to unknowns residing in other processors. This is like block Jacobi preconditioning, in which the blocks may be decomposed in incomplete form [91]. Again, this may not always reduce the overall solution time, since the effects of increased parallelism are more than undone by an increased number of iteration steps. In order to reduce this effect, it is suggested in [92] to construct incomplete decompositions on slightly overlapping domains. This requires communication similar to that of matrix-vector products. In [92] results are reported on a six-processor shared memory system (IBM3090), and speedups close to have been observed.

The problems with parallelism in the preconditioner have led to searches for other preconditioners. Often simple diagonal scaling is an adequate preconditioner and this is trivially parallelizable. For results on a Connection Machine, see [93]. Often this approach leads to a significant increase in iteration steps. Still another approach is to use polynomial preconditioning: , i.e. , for some suitable th degree polynomial. This preconditioner can be implemented by forming only matrix-vector products, which, depending on the structure of , are easier to parallelize [94]. For one often selects a Chebychev polynomial, which requires some information on the spectrum of .

Finally we point out the possibility of using the truncated Neumann series for the approximate inverse of , or parts of and . Madsen et al. [95] discuss approximate inversion of , which from the implementation point of view is equivalent to polynomial preconditioning. In [96] the use of truncated Neumann series for removing some of the recurrences in the triangular solves is discussed. This approach leads to only fine-grained parallelism (vectorization).



next up previous
Next: 9.2 Parallelism and Data Up: 9 Iterative Methods for Previous: 9 Iterative Methods for



verena@csep1.phy.ornl.gov