Let us examine in detail how to implement matrix multiplication to minimize the number of memory moves. Suppose we have two levels of memory hierarchy, fast and slow, where the slow memory is large enough to contain the matrices , and , but the fast memory contains only words where . Further assume the data are reused optimally (which may be optimistic if the decisions are made automatically by hardware).

The simplest algorithm one might try consists of three nested loops:

We count the number of references to the slow memory as follows: for reading times, for reading one row at a time and keeping it in fast memory until it is no longer needed, and for reading one entry of at a time, keeping it in fast memory until it is completely computed. This comes to for a of about 2, which is no better than the Level 2 BLAS and far from the maximum possible . If , so that we cannot keep a full row of in fast memory, further decreases to 1, since the algorithm reduces to a sequence of inner products, which are Level 1 BLAS. For every permutation of the three loops on , and , one gets another algorithm with about the same.

The next possibility is dividing and into *column blocks*,
and computing block by block. Recall that we use the notation
to mean the submatrix in rows through and columns
through . We partition
where each
is ,
and similarly for . Our column block algorithm is then

Assuming , so that fast memory can accommodate , and one column of simultaneously, our memory reference count is as follows: for reading and writing each block of once, for reading each block of once, and for reading times. This yields , so that needs to grow with to keep large.

Finally, we consider *rectangular blocking*, where is broken into
an block matrix with blocks ,
and and are similarly partitioned. The algorithm becomes

Assuming so that one block each from , and fit in memory simultaneously, our memory reference count is as follows: for reading and writing each block of once, for reading times, and for reading times. This yields , which is much better than the previous algorithms. In [15] an analysis of this problem leads to an upper bound for near , so we cannot expect to improve much on this algorithm for square matrices.