Now we consider parallelizing the rest of the algorithm.
Note that updating and
can only begin after
completing the inner product for
. Since on a distributed
memory machine communication is needed for the inner product, we
cannot overlap this communication with useful computation.
The same observation applies to updating
, which can only begin
after completing the inner product for
.
Apart from computing
and solving
, we need to
load 7 vectors for 10 vector floating point operations. This means that
for this part of the computation only
floating point operation can be
carried out per memory reference on average.
Several authors [98,99,100,101] have attempted to improve this ratio, and to reduce the number of synchronization points (the points at which computation must wait for communication). In Algorithm 9.1 there are two such synchronization points, namely the computation of both inner products. Meurant [100] (see also [94]) has proposed a variant in which there is only one synchronization point, however at the cost of possibly reduced numerical stability, and one additional inner product. In this scheme the ratio between computations and memory references is about 2. We show here yet another variant, proposed by Chronopoulos and Gear [98].