As described earlier, data layout influences the algorithm. We show the algorithm for a block scatter mapping in both dimensions, and then discuss how other layouts may be handled. The algorithm is essentially the same as Algorithm 6.4 with interprocessor communication inserted as necessary. The block size equals , which determines the layout in the horizontal direction.
Communication is required in Algorithm 6.3 to find the pivot entry at each step and swap rows if necessary; then each processor can perform the scaling and rank-1 updates independently. The pivot search is a reduction operation, meaning that values from all processors must be reduced to a single value, a pointer to the row containing the largest pivot. After the block column is fully factorized, the pivot information must be broadcast so other processors can permute their own data, as well as permute among different processors.
In Algorithm 6.4, the L matrix stored on the diagonal must be spread rightward to other processors in the same row, so they can compute their entries of U. Finally, the processors holding the rest of L below the diagonal must spread their submatrices to the right, and the processors holding the new entries of U just computed must spread their submatrices downward, before the final rank- update in the last line of Algorithm 6.4 can take place.