As described earlier, data layout influences the algorithm. We show the algorithm for a block scatter mapping in both dimensions, and then discuss how other layouts may be handled. The algorithm is essentially the same as Algorithm 6.4 with interprocessor communication inserted as necessary. The block size equals , which determines the layout in the horizontal direction.

Communication is required in Algorithm 6.3
to find the pivot
entry at each step and swap rows if necessary; then each processor
can perform the
scaling and rank-1 updates independently. The pivot search is a
* reduction* operation, meaning that values from all processors
must be reduced to a single value, a pointer to the row containing
the largest pivot.
After the block column is
fully factorized,
the pivot information must be * broadcast* so other processors
can permute their own
data, as well as permute among different processors.

In Algorithm 6.4,
the **L** matrix
stored on the diagonal must be * spread* rightward to other processors
in the same
row, so they can compute their entries of **U**. Finally,
the processors holding the rest of **L** below
the diagonal must * spread* their submatrices to the right,
and the processors holding
the new entries of **U** just computed must * spread* their submatrices
downward, before the
final rank- update in the last line of Algorithm 6.4
can take place.