For parallel computers, the parallelism in eliminating these subdiagonal blocks is relatively fine-grained compared with the more coarse-grained parallelism in the first step of the algorithm. Furthermore, on distributed memory machines the data for each subdiagonal block has to be spread over all processors. In  it has been shown that this limits the performance of the algorithm, the speedup being bounded by the ratio of computational speed and communication speed. This ratio is often very low .
The other approach is first to eliminate successively the last nonzero elements in the subdiagonal blocks . This can be done with a short recurrence of length , after which all fill-in can be eliminated in parallel. For the recurrence we need some data communication between processors. However, for k large enough with respect to , one can attain speedups close to for this algorithm on a k processor system . For a generalization of the divide-and-conquer approach for banded systems, see ; the data transport aspects for distributed memory machines have been discussed in .
There are other variants of the divide-and-conquer approach that move the fill-in into other columns of the subblocks or are more stable numerically. For example, in  the matrix is split into a block diagonal matrix and a remainder via rank-1 updates.