next up previous

9.2 Parallelism and Data Locality in Preconditioned CG     continued...

Under the assumptions that we have made, CG can be efficiently parallelized as follows.

(1)
All compute intensive operations can be done in parallel. Only operations (2), (3), (7), (8), (9), and (0) require communication. We have assumed that the communication in (2), (7), and (0) can be overlapped with computation.
(2)
The communication required for the reduction of the inner product in (3) can be overlapped with the update for in (4), (which could in fact have been done in the previous iteration step).
(3)
The reduction of the inner product in (8) can be overlapped with the computation in (0). Also step (9) usually requires information such as the norm of the residual, which can be overlapped with (0).
(4)
Steps (1), (2), and (3) can be combined: the computation of a segment of can be followed immediately by the computation of a segment of in (2), and this can be followed by the computation of a part of the inner product in (3). This saves on load operations for segments of and .
(5)
Depending on the structure of L, the computation of segments of in (6) can be followed by operations in (7), which can be followed by the computation of parts of the inner product in (8), and the computation of the norm of , required for (9).
(6)
The computation of can be done as soon as the computation in (8) has been completed. At that moment, the computation for (1) can be started if the requested parts of have been completed in (0).
(7)
If no preconditioner is used, then , and steps (7) and (0) are skipped. Step (8) is replaced by . Now we need some computation to overlap the communication for this inner product. To this end, one might split the computation in (4) in two parts. The first part would be computed in parallel with (3), and the second part with .

More recent work on removing synchronization points in CG while retaining numerical stability appears in [105,106].