LINPACK's Cholesky performed so poorly in Table 1,
because it was * not* designed to
minimize memory movement on machines such as the Cray YMP (it was designed to
minimize another kind of memory movement, * page faults* between main memory
and disk). In contrast, the matrix-matrix multiplication
and other BLAS on the Cray YMP in Table 1
were written (in assembly language) just for the Cray YMP to minimize data movement.

Since it is expensive and time-consuming to write every routine like Cholesky in
assembly language for every new computer, we would like a better approach.
Here is the most successful approach we have discovered so far.
Since operations like matrix-matrix multiplication are so common, computer manufacturers
have standardized them as the * Basic Linear Algebra Subroutines* or * BLAS*
[10,11,12],
and optimized them for their machines. In other words, a library of subroutines
for matrix-matrix multiplication, matrix-vector multiplication, and other
routines is available as a standard Fortran (or C) callable library on most
high performance machines, and underneath they have been optimized for each machine.
If we can reorganize our algorithms to use these optimized BLAS to perform all or most
of the work, then our algorithms will go as fast as the manufacturer's optimized BLAS.
This was the approach taken in LAPACK: the algorithms in LINPACK (and the corresponding
library for eigenvalue problems, EISPACK [13,14]) were reorganized to
call the BLAS in their innermost loops, where most of the work is done. This led to the
speedups shown in Table 1.