The basic time-consuming computational kernels of iterative schemes are usually:

- (1)
- inner products,
- (2)
- vector updates,
- (3)
- matrix--vector products, like (for some methods also ),
- (4)
- preconditioning (e.g., solve for
**w**in**Kw=r**).

For distributed memory machines there may be a problem if each processor has only a segment of the vector in its memory. Depending on the bandwidth of the matrix we may need communication for other elements of the vector, which may lead to communication problems. However, many sparse matrix problems are related to a network in which only nearby nodes are connected. In such a case it seems natural to subdivide the network, or grid, in suitable blocks and to distribute these blocks over the processors. When computing each processor needs at most the values of at some nodes in neighboring blocks. If the number of connections to these neighboring blocks is small compared to the number of internal nodes, then the communication time can be overlapped with computational work. For more detailed discussions on implementation aspects on distributed memory systems, see [80] and [81].