The innermost loop of Algorithm 6.2
can be performed by a single call to
the Level 1 BLAS operation saxpy; this in done in LINPACK.
To achieve higher performance, we
modify this code first to use the Level 2 and then the Level 3 BLAS in
its innermost loops. Again, 3! versions of these algorithms are possible,
but we just describe the ones used in the LAPACK library [2].
There is obvious parallelism in the innermost loop, since each
can be updated independently.
To make the use of BLAS clear, we use Fortran 90 (or Matlab) notation:
The parallelism is evident:
most work is performed is a single rank-1 update of the
trailing
submatrix
,
where each entry of
can be updated in parallel.
Other permutations of the
nested loops lead to different algorithms, which depend on the BLAS for
matrix-vector multiplication and solving a triangular system instead of
rank-1 updating [30][29];
which is faster depends on the relative speed of these on a given machine.
To convert to the Level 3 BLAS involves column blocking

into
blocks, where
is the block size and
. The optimal choice of
depends on the
memory hierarchy of the machine in question: our approach is to compute
the LU decomposition
of each
subblock of
using Algorithm 6.3
in the fast memory, and then use
Level 3 BLAS to update the rest of the matrix:
Most of the work is performed in the last two lines, solving a triangular
system with
many right-hand sides, and matrix multiplication. Other similar algorithms
may be derived
by conformally partitioning
,
and
, and equating partitions
in
.
Algorithms 6.3 and 6.4 are available as,
respectively, subroutines sgetf2 and
sgetrf in LAPACK [2].
(See exercise 9.)
The LAPACK routine implementing Cholesky is sposv. Other matrix structures handled by LAPACK include