Another scheme for GMRES, based upon Householder orthogonalization instead of modified Gram--Schmidt, has been proposed in [109]. For some applications the additional computation required by Householder orthogonalization is compensated by improved numerical properties: the better orthogonality saves iteration steps. In [110] a variant of GMRES is proposed in which the preconditioner itself may be an iterative process, which may help to increase parallel efficiency.
Similar to CG and other iterative schemes, the major computations
are matrix--vector computations (with A and K), inner products and
vector updates. All of these operations are easily parallelizable,
although on distributed memory machines the inner products in the
orthogonalization act as
synchronization points. In this part of the algorithm, one new vector,
, is orthogonalized against the previously built orthogonal set
,
, ... ,
. In Algorithm 9.4,
this is done using
Level 1 BLAS, which may be quite inefficient. To incorporate Level 2 BLAS
we can do either Householder orthogonalization or classical Gram--Schmidt
twice (which mitigates classical Gram--Schmidt's potential instability
[103]. Both approaches
significantly increase the computational work and do not remove
the synchronization and data-locality problems completely. Note that we
cannot, as in CG, overlap the inner product computation with updating
the approximate solution, since in GMRES this updating can be done only
after completing a cycle of m orthogonalization steps.