By being a little more flexible about the algorithms we implement,
we can mitigate the
apparent tradeoff between load balance and applicability of BLAS.
For example, the layout of **A** in
Figure 3
is identical to the layout in
Figure 2
of , where **P** is a permutation matrix.

This shows that running Cannon's algorithm from the last section
to multiply **A** times **B** in scatter
layout is the same as multiplying
and to get , which is the desired product.
Indeed, as long as
(1) **A** and **B** are both distributed over a square array of
processors;
(2) the permutations of the columns of **A** and rows of **B** are
identical; and
(3) for all **i** the number of columns of **A** stored by processor column **i**
is the same
as the number of rows of **B** stored by processor row **i**, the algorithms
of the previous section will
correctly multiply **A** and **B**. The distribution of the product
will be determined by the
distribution of the rows of **A** and columns of **B**.
We will see a similar phenomenon for other distributed memory
algorithms later.