next up previous

9.2 Parallelism and Data Locality in Preconditioned CG     continued...

In this scheme all vectors need be loaded only once per pass of the loop, which leads to improved data locality. However, the price is 2n extra flops per iteration step. Chronopoulos and Gear [98] claim the method is stable, based on their numerical experiments. Instead of two synchronization points, as in the standard version of CG, we have now only one such synchronization point, as the next loop can be started only when the inner products at the end of the previous loop have been completed. Another slight advantage is that these inner products can be computed in parallel.