By being a little more flexible about the algorithms we implement, we can mitigate the apparent tradeoff between load balance and applicability of BLAS. For example, the layout of A in Figure 3 is identical to the layout in Figure 2 of , where P is a permutation matrix.
This shows that running Cannon's algorithm from the last section to multiply A times B in scatter layout is the same as multiplying and to get , which is the desired product. Indeed, as long as (1) A and B are both distributed over a square array of processors; (2) the permutations of the columns of A and rows of B are identical; and (3) for all i the number of columns of A stored by processor column i is the same as the number of rows of B stored by processor row i, the algorithms of the previous section will correctly multiply A and B. The distribution of the product will be determined by the distribution of the rows of A and columns of B. We will see a similar phenomenon for other distributed memory algorithms later.