9.3 Parallelism and Data Locality in GMRES



next up previous
Next: 10 Error Bounds for Up: 9 Iterative Methods for Previous: 9.2 Parallelism and Data

9.3 Parallelism and Data Locality in GMRES

 

GMRES, proposed by Saad and Schultz [107], is a CG-like method for solving general nonsingular linear systems . GMRES minimizes the residual over the Krylov subspace , with . This requires, as with CG, the construction of an orthogonal basis of this space. Since we do not require to be symmetric, we need long recurrences: each new vector must be explicitly orthogonalized against all previously generated basis vectors. In its most common form GMRES orthogonalizes using Modified Gram-Schmidt [4]. In order to limit memory requirements (since all basis vectors must be stored), GMRES is restarted after each cycle of iteration steps; this is called GMRES. A slightly simplified version of GMRES with preconditioning is as follows (for details, see [108]):

 

Another scheme for GMRES, based upon Householder orthogonalization instead of modified Gram-Schmidt, has been proposed in [109]. For some applications the additional computation required by Householder orthogonalization is compensated by improved numerical properties: the better orthogonality saves iteration steps. In [110] a variant of GMRES is proposed in which the preconditioner itself may be an iterative process, which may help to increase parallel efficiency.

Similar to CG and other iterative schemes, the major computations are matrix-vector computations (with and ), inner products and vector updates. All of these operations are easily parallelizable, although on distributed memory machines the inner products in the orthogonalization act as synchronization points. In this part of the algorithm, one new vector, , is orthogonalized against the previously built orthogonal set , , ... , . In Algorithm 9.4, this is done using Level 1 BLAS, which may be quite inefficient. To incorporate Level 2 BLAS we can do either Householder orthogonalization or classical Gram-Schmidt twice (which mitigates classical Gram-Schmidt's potential instability [103]. Both approaches significantly increase the computational work and do not remove the synchronization and data-locality problems completely. Note that we cannot, as in CG, overlap the inner product computation with updating the approximate solution, since in GMRES this updating can be done only after completing a cycle of orthogonalization steps.

The obvious way to extract more parallelism and data locality is to generate a basis , , ..., for the Krylov subspace first, and to orthogonalize this set afterwards; this is called -step GMRES() [104]. This approach does not increase the computational work and, in contrast to CG, the numerical instability due to generating a possibly near-dependent set is not necessarily a drawback. One reason is that error cannot build up as in CG, because the method is restarted every steps. In any case, the resulting set, after orthogonalization, is the basis of some subspace, and the residual is then minimized over that subspace. If, however, one wants to mimic standard GMRES() as closely as possible, one could generate a better (more independent) starting set of basis vectors , , ..., , where the are suitable degree polynomials. Newton polynomials are suggested in [111] and and Chebychev polynomials in [80].

After generating a suitable starting set, we still have to orthogonalize it. In [80] modified Gram-Schmidt is used while avoiding communication times that cannot be overlapped. We outline this approach, since it may be of value for other orthogonalization methods. Given a basis for the Krylov subspace, we orthogonalize by

for :}
/* orthogonalize w.r.t. */
for



In order to overlap the communication costs of the inner products, we split the -loop into two parts. Then for each we proceed as follows.

1. compute in parallel the local parts of the inner products for the first group
2. assemble the local inner products to global inner products
3. compute in parallel the local parts of the inner products for the second group
4. update ; compute the local inner products required for
5. assemble the local inner products of the second group to global inner products
6. update the vectors
7. compute
From this scheme it is obvious that if the length of the vector segments per processor are not too small, in principle all communication time can be overlapped by useful computations.

For a processor MEIKO system, configured as a torus, de Sturler [80] reports speedups of about for typical discretized PDE systems with unknowns (i.e. unknowns per processor). For larger systems, the speedup increases to (or more if more processors are involved) as expected. Calvetti et al. [112] report results for an implementation of -step GMRES, using BLAS2 Householder orthogonalization, for a four-processor IBM 6000 distributed memory system. For larger linear systems, they observed speedups close to .



next up previous
Next: 10 Error Bounds for Up: 9 Iterative Methods for Previous: 9.2 Parallelism and Data



verena@csep1.phy.ornl.gov