Let us examine in detail how to implement matrix multiplication to minimize the
number of memory moves.
Suppose we have two levels of memory hierarchy, fast and slow,
where the slow memory is large enough to contain the matrices
A, B and C, but the fast memory contains only M words where
. Further assume the data are reused optimally (which may
be optimistic if the decisions are made automatically by hardware).
The simplest algorithm one might try consists of three nested loops:
We count the number of references to the slow memory as follows:
for reading B n times,
for reading A one row at
a time and keeping it in fast memory until it is no longer needed, and
for reading one entry of C at a time, keeping it in fast
memory until it is completely computed. This comes to
for a
q of about 2, which is no better than the Level 2 BLAS and far from
the maximum
possible
. If
, so that we cannot keep a full row of A in
fast memory, q further decreases to 1, since the algorithm reduces
to a sequence of inner products, which are Level 1 BLAS. For every
permutation of the three loops on i, j and k, one gets another
algorithm with q about the same.