Let us examine in detail how to implement matrix multiplication to minimize the number of memory moves. Suppose we have two levels of memory hierarchy, fast and slow, where the slow memory is large enough to contain the matrices A, B and C, but the fast memory contains only M words where . Further assume the data are reused optimally (which may be optimistic if the decisions are made automatically by hardware).
The simplest algorithm one might try consists of three nested loops:
We count the number of references to the slow memory as follows: for reading B n times, for reading A one row at a time and keeping it in fast memory until it is no longer needed, and for reading one entry of C at a time, keeping it in fast memory until it is completely computed. This comes to for a q of about 2, which is no better than the Level 2 BLAS and far from the maximum possible . If , so that we cannot keep a full row of A in fast memory, q further decreases to 1, since the algorithm reduces to a sequence of inner products, which are Level 1 BLAS. For every permutation of the three loops on i, j and k, one gets another algorithm with q about the same.