1. compute in parallel the local parts of the inner products for
the first group

2. assemble the local inner products to global inner products

3. compute in parallel the local parts of the inner products for
the second group

4. update ; compute the local inner products required for

5. assemble the local inner products of the second group to global
inner products

6. update the vectors

7. compute

From this scheme it is obvious that if the length of the vector segments per processor are not too small, in principle all communication time can be overlapped by useful computations.

For a **150** processor MEIKO system, configured as a torus,
de Sturler [80] reports speedups of about **120** for typical
discretized PDE systems with **60,000** unknowns (i.e. **400** unknowns
per processor). For larger systems, the speedup increases to **150**
(or more if more processors are involved) as expected.
Calvetti * et al.* [112] report results for an implementation of **m**-step
GMRES, using BLAS2 Householder orthogonalization, for a four-processor
IBM 6000 distributed memory system. For larger linear systems, they observed
speedups close to .