LINPACK's Cholesky performed so poorly in Table 1, because it was not designed to minimize memory movement on machines such as the Cray YMP (it was designed to minimize another kind of memory movement, page faults between main memory and disk). In contrast, the matrix-matrix multiplication and other BLAS on the Cray YMP in Table 1 were written (in assembly language) just for the Cray YMP to minimize data movement.
Since it is expensive and time-consuming to write every routine like Cholesky in assembly language for every new computer, we would like a better approach. Here is the most successful approach we have discovered so far. Since operations like matrix-matrix multiplication are so common, computer manufacturers have standardized them as the Basic Linear Algebra Subroutines or BLAS [10,11,12], and optimized them for their machines. In other words, a library of subroutines for matrix-matrix multiplication, matrix-vector multiplication, and other routines is available as a standard Fortran (or C) callable library on most high performance machines, and underneath they have been optimized for each machine. If we can reorganize our algorithms to use these optimized BLAS to perform all or most of the work, then our algorithms will go as fast as the manufacturer's optimized BLAS. This was the approach taken in LAPACK: the algorithms in LINPACK (and the corresponding library for eigenvalue problems, EISPACK [13,14]) were reorganized to call the BLAS in their innermost loops, where most of the work is done. This led to the speedups shown in Table 1.