The innermost loop of Algorithm 6.2
can be performed by a single call to
the Level 1 BLAS operation ` saxpy`; this in done in LINPACK.
To achieve higher performance, we
modify this code first to use the Level 2 and then the Level 3 BLAS in
its innermost loops. Again, 3! versions of these algorithms are possible,
but we just describe the ones used in the LAPACK library [2].
There is obvious parallelism in the innermost loop, since each
can be updated independently.
To make the use of BLAS clear, we use Fortran 90 (or Matlab) notation:

The parallelism is evident: most work is performed is a single rank-1 update of the trailing submatrix , where each entry of can be updated in parallel. Other permutations of the nested loops lead to different algorithms, which depend on the BLAS for matrix--vector multiplication and solving a triangular system instead of rank-1 updating [29,30]; which is faster depends on the relative speed of these on a given machine.