The obvious way to extract more parallelism and data locality is to generate a basis , , ..., for the Krylov subspace first, and to orthogonalize this set afterwards; this is called m-step GMRES(m) . This approach does not increase the computational work and, in contrast to CG, the numerical instability due to generating a possibly near-dependent set is not necessarily a drawback. One reason is that error cannot build up as in CG, because the method is restarted every m steps. In any case, the resulting set, after orthogonalization, is the basis of some subspace, and the residual is then minimized over that subspace. If, however, one wants to mimic standard GMRES(m) as closely as possible, one could generate a better (more independent) starting set of basis vectors , , ..., , where the are suitable degree j polynomials. Newton polynomials are suggested in  and and Chebychev polynomials in .
After generating a suitable starting set, we still have to
orthogonalize it. In 
modified Gram--Schmidt is used while avoiding communication times
that cannot be overlapped. We outline this approach, since it
may be of value for other orthogonalization methods.
Given a basis for the Krylov subspace, we orthogonalize by
/* orthogonalize w.r.t. */
In order to overlap the communication costs of the inner products,
we split the j-loop into two parts. Then for each k we proceed as