For parallel computers, the parallelism in eliminating these subdiagonal blocks is relatively fine-grained compared with the more coarse-grained parallelism in the first step of the algorithm. Furthermore, on distributed memory machines the data for each subdiagonal block has to be spread over all processors. In [50] it has been shown that this limits the performance of the algorithm, the speedup being bounded by the ratio of computational speed and communication speed. This ratio is often very low [50].

The other approach is first to eliminate successively the last nonzero elements
in the subdiagonal blocks . This can be done with a short
recurrence of length , after which all fill-in can be eliminated in
parallel. For the recurrence we need some data communication between
processors.
However, for **k** large enough with respect to , one can attain speedups
close to for this algorithm on a **k** processor system
[51].
For a generalization of the divide-and-conquer approach for banded systems,
see [52]; the data transport aspects for distributed memory
machines
have been discussed in [50].

There are other variants of the divide-and-conquer approach that move the fill-in into other columns of the subblocks or are more stable numerically. For example, in [53] the matrix is split into a block diagonal matrix and a remainder via rank-1 updates.