The innermost loop of Algorithm 6.2 can be performed by a single call to the Level 1 BLAS operation saxpy; this in done in LINPACK. To achieve higher performance, we modify this code first to use the Level 2 and then the Level 3 BLAS in its innermost loops. Again, 3! versions of these algorithms are possible, but we just describe the ones used in the LAPACK library . There is obvious parallelism in the innermost loop, since each can be updated independently. To make the use of BLAS clear, we use Fortran 90 (or Matlab) notation:
The parallelism is evident: most work is performed is a single rank-1 update of the trailing submatrix , where each entry of can be updated in parallel. Other permutations of the nested loops lead to different algorithms, which depend on the BLAS for matrix--vector multiplication and solving a triangular system instead of rank-1 updating [29,30]; which is faster depends on the relative speed of these on a given machine.