
The basic time-consuming computational kernels of iterative schemes are
usually:
- (1)
- inner products,
- (2)
- vector updates,
- (3)
- matrix--vector products, like
(for some methods also
),
- (4)
- preconditioning (e.g., solve for w in Kw=r).
The inner products can be easily parallelized; each processor computes the
inner product of two segments of each vector (local inner products or LIPs).
On distributed memory machines the LIPs have to be sent to other processors
in order to be reduced to the required global inner product. This step
requires communication. For shared memory machines the inner products
can be computed in parallel without difficulty.
If the distributed memory
system supports overlap of communication with computation,
then we seek opportunities in the algorithm to do so. In the
standard formulation of most iterative schemes this is usually a major
problem.
We will come back to this in the next two sections.
Vector updates are trivially parallelizable: each processor updates its
`own' segment.
The matrix--vector products are often easily parallelized on shared memory
machines by splitting the matrix into strips corresponding to the vector
segments. Each processor takes care of the matrix--vector product of one
strip.
For distributed memory machines there may be a problem if each processor
has only a segment of the vector in its memory. Depending on the bandwidth
of the matrix we may need communication for other elements of the vector,
which may lead to communication problems. However, many sparse
matrix problems are related to a network in which only nearby nodes are
connected. In such a case it seems natural to subdivide the network, or
grid, in suitable blocks and to distribute these blocks over the processors.
When computing
each processor needs at most the values of
at
some nodes in neighboring blocks. If the number of connections to these
neighboring blocks is small compared to the number of internal nodes,
then the communication time can be overlapped with computational work.
For more detailed discussions on implementation aspects on distributed
memory systems, see [80] and [81].

