When this algorithm terminates, the diagonal and upper triangle of **A** contain **U**,
the part below the diagonal contains the corresponding part of **L** (the diagonal of
**L** is all ones, and so does not need to be stored. The permutation matrix **P**
is determined implicitly by the permutations in the second line of the algorithm.

If **A** is stored by column, as is the case in Fortran (but not C), then since
the inner loop combines rows of **A**, it
accesses memory locations (at least) **n** locations apart. As described in
Chapter CA,

,
this does not respect locality, and
so is likely to be slower than an algorithm which accesses **A** by
column (i.e. accesses consecutive memory locations).
Algorithm 6.1 is
also called **kij**-* LU* decomposition, because of the
nesting order of its
loops. All the rest of the permutations of **i**, **j** and **k** lead to
valid algorithms, some of which access columns of **A** in the innermost loop.
The next algorithm is one of these,
and is used in the LINPACK routine ` sgefa` [1].

- 6.1 Gaussian Elimination on Shared Memory Machines
- 6.2 Gaussian Elimination on Distributed Memory Machines