GMRES, proposed by Saad and Schultz [107], is a CG-like
method for solving general nonsingular linear systems
.
GMRES minimizes the residual over the Krylov subspace
, with
.
This requires, as with CG, the construction of an orthogonal
basis of this space. Since we do not require
to be symmetric,
we need long recurrences: each new vector must be
explicitly orthogonalized against all previously generated basis vectors.
In its most common form GMRES orthogonalizes using Modified Gram-Schmidt
[4].
In order to limit memory
requirements (since all basis vectors must be stored), GMRES is
restarted after each cycle of
iteration steps; this is called
GMRES
.
A slightly simplified version of GMRES
with
preconditioning
is as follows (for details, see [108]):
Another scheme for GMRES, based upon Householder orthogonalization instead of modified Gram-Schmidt, has been proposed in [109]. For some applications the additional computation required by Householder orthogonalization is compensated by improved numerical properties: the better orthogonality saves iteration steps. In [110] a variant of GMRES is proposed in which the preconditioner itself may be an iterative process, which may help to increase parallel efficiency.
Similar to CG and other iterative schemes, the major computations
are matrix-vector computations (with
and
), inner products and
vector updates. All of these operations are easily parallelizable,
although on distributed memory machines the inner products in the
orthogonalization act as
synchronization points. In this part of the algorithm, one new vector,
, is orthogonalized against the previously built orthogonal set
,
, ... ,
. In Algorithm 9.4,
this is done using
Level 1 BLAS, which may be quite inefficient. To incorporate Level 2 BLAS
we can do either Householder orthogonalization or classical Gram-Schmidt
twice (which mitigates classical Gram-Schmidt's potential instability
[103]. Both approaches
significantly increase the computational work and do not remove
the synchronization and data-locality problems completely. Note that we
cannot, as in CG, overlap the inner product computation with updating
the approximate solution, since in GMRES this updating can be done only
after completing a cycle of
orthogonalization steps.
The obvious way to extract more parallelism and
data locality is to generate a basis
,
, ...,
for the Krylov subspace first,
and to orthogonalize this set afterwards; this is called
-step GMRES(
) [104]. This approach does not
increase the computational work and, in contrast to CG, the numerical
instability due to generating a possibly near-dependent set is not
necessarily a drawback. One reason is that error cannot build up as in
CG, because the method is restarted every
steps.
In any case, the resulting set, after orthogonalization,
is the basis of some subspace, and the residual is then minimized over that
subspace. If, however, one wants to mimic
standard GMRES(
) as closely as possible,
one could generate a better (more independent) starting set of basis
vectors
,
, ...,
, where the
are suitable degree
polynomials.
Newton polynomials are suggested in [111] and
and Chebychev polynomials in [80].
After generating a suitable starting set, we still have to
orthogonalize it. In [80]
modified Gram-Schmidt is used while avoiding communication times
that cannot be overlapped. We outline this approach, since it
may be of value for other orthogonalization methods.
Given a basis for the Krylov subspace, we orthogonalize by
for
:}
/* orthogonalize
w.r.t.
*/
for
In order to overlap the communication costs of the inner products,
we split the
-loop into two parts. Then for each
we proceed as
follows.
1. compute in parallel the local parts of the inner products for
the first group
2. assemble the local inner products to global inner products
3. compute in parallel the local parts of the inner products for
the second group
4. update
; compute the local inner products required for 
5. assemble the local inner products of the second group to global
inner products
6. update the vectors
7. compute
From this scheme it is obvious that if the length of the vector segments per
processor are not too small, in principle all communication time can be
overlapped by useful computations.
For a
processor MEIKO system, configured as a
torus,
de Sturler [80] reports speedups of about
for typical
discretized PDE systems with
unknowns (i.e.
unknowns
per processor). For larger systems, the speedup increases to
(or more if more processors are involved) as expected.
Calvetti et al. [112] report results for an implementation of
-step
GMRES, using BLAS2 Householder orthogonalization, for a four-processor
IBM 6000 distributed memory system. For larger linear systems, they observed
speedups close to
.