On the other hand, this brief analysis ignores a number of practical issues:

- (1)
- high level language constructs do not yet support block layout of matrices as described here,
- (2)
- if the fast memory consists of vector registers and has vector
operations supporting
`saxpy`but not inner products, a column blocked code may be superior; - (3)
- a real code will have to deal with nonsquare matrices, for which
the optimal block sizes may not be square.

Another quite different algorithm is Strassen's method [17], which multiplies matrices recursively by dividing them into block matrices, and multiplying the subblocks using 7 matrix multiplications (recursively) and 15 matrix additions of half the size. This leads to an asymptotic complexity of instead of . The value of this algorithm is not just this asymptotic complexity but its reduction of the problem to smaller subproblems which eventually fit in fast memory; once the subproblems fit in fast memory standard matrix multiplication may be used. This approach has led to speedups on relatively large matrices on some machines [18]. A drawback is the need for significant workspace, and somewhat lower numerical stability, although it is adequate for many purposes [19].