We begin with an example to motivate our study of parallel and vector algorithms for linear algebra problems. Consider solving a system of linear equations Ax=b, where A is a given dense n-by-n matrix, b is a given n-by-1 vector, and x is an n-by-1 vector of unknowns we want to compute. The algorithm we will use is Cholesky, which is a variation of Gaussian elimination suitable for matrices A with the special (but common) properties of symmetry and positive definiteness (these are discussed in more detail below). Cholesky, as well as Gaussian elimination, consists almost entirely of simple, regular operations which are in principle easy to parallelize or vectorize: adding multiples of one column of the matrix to other columns. Since vector processors like the Cray Y-MP are designed to perform this operation particularly well, we expect to get good performance running Cholesky on a Cray Y-MP.
Table 1: Performance in Megaflops of Cholesky on a Cray Y-MP.
The second column of Table 1 gives the speed on a single processor of a Cray Y-MP; here vectorization is feature to be exploited. The third column gives the speed using all 8 processors; here parallelization as well as vectorization is possible. The first row of the table gives the maximum possible speed in megaflops of the machine; no algorithm for any problem can go faster. The next three rows give the speed of various Basic Linear Algebra Subroutines (or BLAS), such as multiplying two large, square (dimension n=500) matrices, multiplying a large square matrix by a vector, and solving a triangular system of equations Tx=b by substitution. These BLAS have been written in Cray assembly language and fully exploit the Cray Y-MP architecture; we see that they generally come quite close to the maximum performance given in the first row.