The innermost loop of Algorithm 6.2
can be performed by a single call to
the Level 1 BLAS operation saxpy; this in done in LINPACK.
To achieve higher performance, we
modify this code first to use the Level 2 and then the Level 3 BLAS in
its innermost loops. Again, 3! versions of these algorithms are possible,
but we just describe the ones used in the LAPACK library [2].
There is obvious parallelism in the innermost loop, since each
can be updated independently.
To make the use of BLAS clear, we use Fortran 90 (or Matlab) notation:
The parallelism is evident:
most work is performed is a single rank-1 update of the
trailing submatrix
,
where each entry of
can be updated in parallel.
Other permutations of the
nested loops lead to different algorithms, which depend on the BLAS for
matrix--vector multiplication and solving a triangular system instead of
rank-1 updating [29,30];
which is faster depends on the relative speed of these on a given machine.