Now we consider parallelizing the rest of the algorithm. Note that updating and can only begin after completing the inner product for . Since on a distributed memory machine communication is needed for the inner product, we cannot overlap this communication with useful computation. The same observation applies to updating , which can only begin after completing the inner product for . Apart from computing and solving , we need to load 7 vectors for 10 vector floating point operations. This means that for this part of the computation only floating point operation can be carried out per memory reference on average.
Several authors [98,99,100,101] have attempted to improve this ratio, and to reduce the number of synchronization points (the points at which computation must wait for communication). In Algorithm 9.1 there are two such synchronization points, namely the computation of both inner products. Meurant  (see also ) has proposed a variant in which there is only one synchronization point, however at the cost of possibly reduced numerical stability, and one additional inner product. In this scheme the ratio between computations and memory references is about 2. We show here yet another variant, proposed by Chronopoulos and Gear .