By being a little more flexible about the algorithms we implement,
we can mitigate the
apparent tradeoff between load balance and applicability of BLAS.
For example, the layout of A in
Figure 3
is identical to the layout in
Figure 2
of , where P is a permutation matrix.
This shows that running Cannon's algorithm from the last section
to multiply A times B in scatter
layout is the same as multiplying
and
to get
, which is the desired product.
Indeed, as long as
(1) A and B are both distributed over a square array of
processors;
(2) the permutations of the columns of A and rows of B are
identical; and
(3) for all i the number of columns of A stored by processor column i
is the same
as the number of rows of B stored by processor row i, the algorithms
of the previous section will
correctly multiply A and B. The distribution of the product
will be determined by the
distribution of the rows of A and columns of B.
We will see a similar phenomenon for other distributed memory
algorithms later.