The fifth row of the table gives the performance of the Cholesky subroutine
` spofa` from the
LINPACK subroutine library [1]. LINPACK is a well-known and widely used
library of Fortran 77 linear algebra routines, but was written before the advent of vector
and parallel computers. Its performance is very poor, over 4.5 times slower than
the maximum machine speed on a single processor and over 36 times slower than maximum on
8 processors. This is very disappointing, since at first glance it is hard to
imagine an algorithm better suited to the Cray than Cholesky.

The last two rows of Table 1 give the performance of a
different version of Cholesky, ` sposv`, taken from the newer LAPACK library of
linear algebra routines [2]. LAPACK was written to exploit architectures
like the Cray. It is also written in Fortran 77, but includes calls to the BLAS, which
are written in Cray assembler. It performs the same floating point operations as
LINPACK, yet goes 4 times faster on a single processor and almost 20 times faster
on 8 processors. Increasing the matrix dimension **n** from 500 to 1000 increases
LAPACK's speed even more, to within 80% of the peak machine speed.

The purpose of this example is to show that an ``obviously'' parallel or vector algorithm may work much more poorly than expected, but by applying certain (systematic) transformations to the algorithm, it can be made to work quite well. We understand these algorithmic transformations most completely in the case of simple algorithms like Cholesky, on simple (to the user) architectures like the Cray Y-MP. We understand them rather less well for more complicated algorithms on more complicated architectures. Our goal in this chapter is to discuss these transformations so the reader can learn to use them, and give pointers to existing software (like LAPACK) where these transformations have already been performed.