Now we consider parallelizing the rest of the algorithm.
Note that updating and can only begin after
completing the inner product for . Since on a distributed
memory machine communication is needed for the inner product, we
cannot overlap this communication with useful computation.
The same observation applies to updating , which can only begin
after completing the inner product for .
Apart from computing and solving , we need to
load **7** vectors for **10** vector floating point operations. This means that
for this part of the computation only floating point operation can be
carried out per memory reference on average.

Several authors [98,99,100,101]
have attempted to improve this ratio, and to reduce the number
of synchronization points (the points at which computation must wait for
communication). In Algorithm 9.1
there are two such synchronization
points, namely the computation of both inner products.
Meurant [100] (see also [94]) has proposed a variant in
which there is only one
synchronization point, however at the cost of possibly reduced numerical
stability, and one additional inner product. In this scheme the ratio between
computations and memory references is about **2**.
We show here yet another variant, proposed by Chronopoulos and Gear
[98].