As described earlier, data layout influences the algorithm.
We show the algorithm for
a block scatter mapping in both dimensions, and then discuss how other
layouts may be handled.
The algorithm is essentially the same as Algorithm 6.4
with interprocessor communication inserted as
necessary. The block size equals
, which determines the
layout in the horizontal direction.
Communication is required in Algorithm 6.3 to find the pivot entry at each step and swap rows if necessary; then each processor can perform the scaling and rank-1 updates independently. The pivot search is a reduction operation, meaning that values from all processors must be reduced to a single value, a pointer to the row containing the largest pivot. After the block column is fully factorized, the pivot information must be broadcast so other processors can permute their own data, as well as permute among different processors.
In Algorithm 6.4,
the L matrix
stored on the diagonal must be spread rightward to other processors
in the same
row, so they can compute their entries of U. Finally,
the processors holding the rest of L below
the diagonal must spread their submatrices to the right,
and the processors holding
the new entries of U just computed must spread their submatrices
downward, before the
final rank-
update in the last line of Algorithm 6.4
can take place.