Under the assumptions that we have made, CG can be efficiently parallelized
- All compute intensive operations can be done in parallel.
Only operations (2), (3), (7), (8), (9), and (0) require communication.
We have assumed that the communication in (2), (7), and (0) can be
overlapped with computation.
- The communication required for the reduction of the inner product in (3)
can be overlapped with the update for in (4),
(which could in fact have been done in the previous iteration step).
- The reduction of the inner product in (8) can be overlapped with the
computation in (0). Also step (9) usually requires information such
as the norm of the residual, which can be overlapped with (0).
- Steps (1), (2), and (3) can be combined: the computation of a segment
of can be followed immediately by the computation of a segment
of in (2), and this can be followed by the computation of a part of
the inner product in (3). This saves on load operations for segments of
- Depending on the structure of L, the computation of segments of
in (6) can be followed by operations in (7), which can be
followed by the computation of parts of the inner product in (8), and
the computation of the norm of , required for (9).
- The computation of can be done as soon as the computation
in (8) has been completed. At that moment, the computation for (1)
can be started if the requested parts of have been completed
- If no preconditioner is used, then , and steps (7) and (0)
are skipped. Step (8) is replaced by
. Now we need some computation
to overlap the communication for this inner product. To this end, one
might split the computation in (4) in two parts. The
first part would be computed in parallel with (3), and the
second part with .
More recent work on removing synchronization points in CG while
retaining numerical stability appears in