We begin with an example to motivate our study of parallel and vector algorithms
for linear algebra problems. Consider solving a system of linear equations
**Ax=b**, where **A** is a given dense **n**-by-**n** matrix, **b** is a given
**n**-by-1 vector, and **x** is an **n**-by-1 vector of unknowns we want to compute.
The algorithm we will use is * Cholesky*, which is a variation of Gaussian
elimination suitable for matrices **A** with the special (but common) properties
of * symmetry* and * positive definiteness* (these are discussed in
more detail below). Cholesky, as well as Gaussian elimination, consists almost
entirely of simple, regular operations which are in principle easy to parallelize
or vectorize: adding multiples of one column of the matrix to other columns.
Since vector processors like the Cray Y-MP are designed to perform this operation
particularly well, we expect to get good performance running Cholesky on a Cray Y-MP.

Table 1: Performance in Megaflops of Cholesky on a Cray Y-MP.

The second column of Table 1 gives the speed on a single
processor of a Cray Y-MP; here vectorization is feature to be exploited.
The third column gives the speed using all 8 processors; here parallelization as well
as vectorization is possible. The first row of the table gives the maximum possible
speed in megaflops of the machine; no algorithm for any problem can go faster.
The next three rows give the speed of various Basic Linear Algebra Subroutines
(or BLAS), such as multiplying two large, square (dimension **n=500**) matrices, multiplying
a large square matrix by a vector, and solving a triangular system of equations
**Tx=b** by substitution. These BLAS have been written in Cray assembly language and
fully exploit the Cray Y-MP architecture; we see that they generally come quite
close to the maximum performance given in the first row.