next up previous

7 Gaussian Elimination on Band Matrices     continued...

For parallel computers, the parallelism in eliminating these subdiagonal blocks is relatively fine-grained compared with the more coarse-grained parallelism in the first step of the algorithm. Furthermore, on distributed memory machines the data for each subdiagonal block has to be spread over all processors. In [50] it has been shown that this limits the performance of the algorithm, the speedup being bounded by the ratio of computational speed and communication speed. This ratio is often very low [50].

The other approach is first to eliminate successively the last nonzero elements in the subdiagonal blocks . This can be done with a short recurrence of length , after which all fill-in can be eliminated in parallel. For the recurrence we need some data communication between processors. However, for k large enough with respect to , one can attain speedups close to for this algorithm on a k processor system [51]. For a generalization of the divide-and-conquer approach for banded systems, see [52]; the data transport aspects for distributed memory machines have been discussed in [50].

There are other variants of the divide-and-conquer approach that move the fill-in into other columns of the subblocks or are more stable numerically. For example, in [53] the matrix is split into a block diagonal matrix and a remainder via rank-1 updates.