Since the three standard problems listed above have been studied for so long, one
might hope that good numerical software libraries would be available to solve them.
This is partly true, as we illustrated by LAPACK [2,7].
However, there are several reasons one cannot depend on libraries entirely.
First, it is impossible to anticipate all possible structures of **A** that can arise in
practice and write corresponding libraries (although extensive libraries for various
specialized **A** do exist [7]).
Second, high performance computer architectures, programming languages and compilers
have been changing so much lately that library writers have not been able to keep up.
There is no clear cut rule on when to use a library routine. In some cases an unoptimized
or serial implementation of an algorithm which fully exploits a matrix structure may be
much faster than a highly tuned, parallel implementation of a more general algorithm,
and in other cases the opposite may be true.
If the user's priority is ease of programming and reliability,
a library routine is a good choice. If, instead, the user's priority is maximum performance
on one specialized problem, and software development time is secondary, custom software
is the likely choice. We will try to address the concerns of both classes of users in
this chapter.

The rest of this chapter is organized as follows.
Section 2 discusses the cost models we will use
to understand algorithm performance.
Section 3 discusses the BLAS, which we use to construct many
high performance algorithms.
Section 4 discusses how to implement matrix
multiplication quickly on several different parallel architectures; these techniques
serve as models for implementing more complicated algorithms.
Section 5 discusses how to layout matrices on distributed
memory machines.
Section 6 discusses parallelizing Gaussian elimination on
dense matrices, and gives pointers to other related algorithms.
Section 7 discusses parallelizing Gaussian elimination on
dense matrices, and gives pointers to other related algorithms.
Section 8 discusses parallelizing Gaussian elimination on
sparse matrices; this is a much harder problem, and much less software is
available. Section 9 discusses iterative methods for
solving **Ax=b**, which in contrast to Gaussian elimination take an approximate
solution and iteratively improves it, rather than providing the final answer after
a fixed number of steps.
Section 10 shows how to bound the error in a computed
solution to **Ax=b**, for any of the methods discussed in this chapter.
Section 11 discusses how to use netlib to retrieve existing library
software electronically. Finally, section 12 gives a quick references
guide to the BLAS.

This is a necessarily partial survey of a large and active field. For further
reading in numerical linear algebra, see the textbooks by Golub and Van Loan
[4], Watkins [5] or Demmel [6].
For more details on parallel numerical linear algebra in particular, see
[3,8,9], the last of which includes a bibliography
of over 2000 references.