We begin with an example to motivate our study of parallel and vector algorithms
for linear algebra problems. Consider solving a system of linear equations
, where
is a given dense
-by-
matrix,
is a given
-by-1 vector, and
is an
-by-1 vector of unknowns we want to compute.
The algorithm we will use is Cholesky, which is a variation of Gaussian
elimination suitable for matrices
with the special (but common) properties
of symmetry and positive definiteness (these are discussed in
more detail below). Cholesky, as well as Gaussian elimination, consists almost
entirely of simple, regular operations which are in principle easy to parallelize
or vectorize: adding multiples of one column of the matrix to other columns.
Since vector processors like the Cray Y-MP are designed to perform this operation
particularly well, we expect to get good performance running Cholesky on a Cray Y-MP.
The second column of Table 1 gives the speed on a single
processor of a Cray Y-MP; here vectorization is feature to be exploited.
The third column gives the speed using all 8 processors; here parallelization as well
as vectorization is possible. The first row of the table gives the maximum possible
speed in megaflops of the machine; no algorithm for any problem can go faster.
The next three rows give the speed of various Basic Linear Algebra Subroutines
(or BLAS), such as multiplying two large, square (dimension
) matrices, multiplying
a large square matrix by a vector, and solving a triangular system of equations
by substitution. These BLAS have been written in Cray assembly language and
fully exploit the Cray Y-MP architecture; we see that they generally come quite
close to the maximum performance given in the first row.
The fifth row of the table gives the performance of the Cholesky subroutine spofa from the LINPACK subroutine library [1]. LINPACK is a well-known and widely used library of Fortran 77 linear algebra routines, but was written before the advent of vector and parallel computers. Its performance is very poor, over 4.5 times slower than the maximum machine speed on a single processor and over 36 times slower than maximum on 8 processors. This is very disappointing, since at first glance it is hard to imagine an algorithm better suited to the Cray than Cholesky.
The last two rows of Table 1 give the performance of a
different version of Cholesky, sposv, taken from the newer LAPACK library of
linear algebra routines [2]. LAPACK was written to exploit architectures
like the Cray. It is also written in Fortran 77, but includes calls to the BLAS, which
are written in Cray assembler. It performs the same floating point operations as
LINPACK, yet goes 4 times faster on a single processor and almost 20 times faster
on 8 processors. Increasing the matrix dimension
from 500 to 1000 increases
LAPACK's speed even more, to within 80% of the peak machine speed.
The purpose of this example is to show that an ``obviously'' parallel or vector algorithm may work much more poorly than expected, but by applying certain (systematic) transformations to the algorithm, it can be made to work quite well. We understand these algorithmic transformations most completely in the case of simple algorithms like Cholesky, on simple (to the user) architectures like the Cray Y-MP. We understand them rather less well for more complicated algorithms on more complicated architectures. Our goal in this chapter is to discuss these transformations so the reader can learn to use them, and give pointers to existing software (like LAPACK) where these transformations have already been performed.
In order to describe and solve the basic linear algebra problems of this chapter,
we need some notation.
We will refer frequently to matrices, vectors and
scalars. A matrix will be denoted by an upper case letter like
,
and its
-th element by
. Occasionally in detailed algorithmic
descriptions we will instead write
. The submatrix of
occupying
rows
though
and columns
through
will be denoted
.
A lower case letter like
will denote a vector, and its
-th element will
be written
.
Vectors will almost always be column vectors, which are the same as
matrices with one column. Lower case Greek letters (and occasionally lower
case letters) will denote scalars.
will denote the set of real numbers,
the set of
-dimensional real vectors,
and
the set of
-by-
real matrices.
will denote the transpose of the
matrix
:
. A matrix is called symmetric if
.
A matrix
is positive definite if
for all nonzero
vectors
. Other notation will be introduced as needed.
Computational science problems can often be reduced to solving one or a sequence of the following standard linear algebra problems:
. Here
is a given
-by-
nonsingular real or complex matrix,
is a given column vector with
entries,
and
is a column vector with
entries we wish to compute.
which minimizes
where
is
-by-
,
is
-by-
, and
is
-by-
.
is called the
2-norm of the vector
.
If
, so we have more equations than unknowns, the system is called overdetermined. In this case we cannot generally solve
exactly.
If
, the system is called underdetermined, and we will have
infinitely many solutions.
-by-
matrix
,
find an
-by-
nonzero vector
and a scalar
so that
.
For example, linear equations often arise when solving ordinary or partial
differential equations numerically; the linear system must be solved to advance
the solution by a time step. Least squares problems arise in fitting curves or
surfaces to experimental data. Eigenvalue problems arise when analyzing vibrations.
There are also many important variations on these basic problems,
but to keep this chapter to a reasonable length, we will concentrate on
algorithms for solving systems of linear equations, and refer elsewhere
for least squares and eigenvalue problems
[6][5][4][3].
A great many algorithms are available for all these problems. The choice of algorithm depends on three things:
has a large influence on the algorithm. For example,
solving
using Gaussian elimination takes
flops (floating point
operations) if
is a dense matrix. If
is symmetric and positive definite,
we can use the Cholesky algorithm which costs half as much:
flops.
If
is triangular (i.e. either zero
above the diagonal or zero below the diagonal), then we can solve
in
only
flops by simple substitution, which is much faster.
Recognizing triangularity is easy, but other structures are more subtle. For example,
if
arises from solving certain elliptic partial differential equations
(like Poisson's equation), then
can actually be solved in
flops using
multigrid, which is essentially as little work as possible.
to 16 decimals of accuracy (more on what this means below),
one might use a careful and expensive implementation of Gaussian elimination, but if
one only needs 3 decimal digits, a few iterations of a much cheaper iterative
method like conjugate gradients might suffice.
Since the three standard problems listed above have been studied for so long, one
might hope that good numerical software libraries would be available to solve them.
This is partly true, as we illustrated by LAPACK [7][2].
However, there are several reasons one cannot depend on libraries entirely.
First, it is impossible to anticipate all possible structures of
that can arise in
practice and write corresponding libraries (although extensive libraries for various
specialized
do exist [7]).
Second, high performance computer architectures, programming languages and compilers
have been changing so much lately that library writers have not been able to keep up.
There is no clear cut rule on when to use a library routine. In some cases an unoptimized
or serial implementation of an algorithm which fully exploits a matrix structure may be
much faster than a highly tuned, parallel implementation of a more general algorithm,
and in other cases the opposite may be true.
If the user's priority is ease of programming and reliability,
a library routine is a good choice. If, instead, the user's priority is maximum performance
on one specialized problem, and software development time is secondary, custom software
is the likely choice. We will try to address the concerns of both classes of users in
this chapter.
The rest of this chapter is organized as follows.
Section 2 discusses the cost models we will use
to understand algorithm performance.
Section 3 discusses the BLAS, which we use to construct many
high performance algorithms.
Section 4 discusses how to implement matrix
multiplication quickly on several different parallel architectures; these techniques
serve as models for implementing more complicated algorithms.
Section 5 discusses how to layout matrices on distributed
memory machines.
Section 6 discusses parallelizing Gaussian elimination on
dense matrices, and gives pointers to other related algorithms.
Section 7 discusses parallelizing Gaussian elimination on
dense matrices, and gives pointers to other related algorithms.
Section 8 discusses parallelizing Gaussian elimination on
sparse matrices; this is a much harder problem, and much less software is
available. Section 9 discusses iterative methods for
solving
, which in contrast to Gaussian elimination take an approximate
solution and iteratively improves it, rather than providing the final answer after
a fixed number of steps.
Section 10 shows how to bound the error in a computed
solution to
, for any of the methods discussed in this chapter.
Section 11 discusses how to use netlib to retrieve existing library
software electronically. Finally, section 12 gives a quick references
guide to the BLAS.
This is a necessarily partial survey of a large and active field. For further reading in numerical linear algebra, see the textbooks by Golub and Van Loan [4], Watkins [5] or Demmel [6]. For more details on parallel numerical linear algebra in particular, see [9][8][3], the last of which includes a bibliography of over 2000 references.