1 Introduction



next up previous
Next: 2 Memory Hierarchies Up: LA Chapter Previous: LA Chapter

1 Introduction

We begin with an example to motivate our study of parallel and vector algorithms for linear algebra problems. Consider solving a system of linear equations , where is a given dense -by- matrix, is a given -by-1 vector, and is an -by-1 vector of unknowns we want to compute. The algorithm we will use is Cholesky, which is a variation of Gaussian elimination suitable for matrices with the special (but common) properties of symmetry and positive definiteness (these are discussed in more detail below). Cholesky, as well as Gaussian elimination, consists almost entirely of simple, regular operations which are in principle easy to parallelize or vectorize: adding multiples of one column of the matrix to other columns. Since vector processors like the Cray Y-MP are designed to perform this operation particularly well, we expect to get good performance running Cholesky on a Cray Y-MP.


Table 1: Performance in Megaflops of Cholesky on a Cray Y-MP View Table

The second column of Table 1 gives the speed on a single processor of a Cray Y-MP; here vectorization is feature to be exploited. The third column gives the speed using all 8 processors; here parallelization as well as vectorization is possible. The first row of the table gives the maximum possible speed in megaflops of the machine; no algorithm for any problem can go faster. The next three rows give the speed of various Basic Linear Algebra Subroutines (or BLAS), such as multiplying two large, square (dimension ) matrices, multiplying a large square matrix by a vector, and solving a triangular system of equations by substitution. These BLAS have been written in Cray assembly language and fully exploit the Cray Y-MP architecture; we see that they generally come quite close to the maximum performance given in the first row.

The fifth row of the table gives the performance of the Cholesky subroutine spofa from the LINPACK subroutine library [1]. LINPACK is a well-known and widely used library of Fortran 77 linear algebra routines, but was written before the advent of vector and parallel computers. Its performance is very poor, over 4.5 times slower than the maximum machine speed on a single processor and over 36 times slower than maximum on 8 processors. This is very disappointing, since at first glance it is hard to imagine an algorithm better suited to the Cray than Cholesky.

The last two rows of Table 1 give the performance of a different version of Cholesky, sposv, taken from the newer LAPACK library of linear algebra routines [2]. LAPACK was written to exploit architectures like the Cray. It is also written in Fortran 77, but includes calls to the BLAS, which are written in Cray assembler. It performs the same floating point operations as LINPACK, yet goes 4 times faster on a single processor and almost 20 times faster on 8 processors. Increasing the matrix dimension from 500 to 1000 increases LAPACK's speed even more, to within 80% of the peak machine speed.

The purpose of this example is to show that an ``obviously'' parallel or vector algorithm may work much more poorly than expected, but by applying certain (systematic) transformations to the algorithm, it can be made to work quite well. We understand these algorithmic transformations most completely in the case of simple algorithms like Cholesky, on simple (to the user) architectures like the Cray Y-MP. We understand them rather less well for more complicated algorithms on more complicated architectures. Our goal in this chapter is to discuss these transformations so the reader can learn to use them, and give pointers to existing software (like LAPACK) where these transformations have already been performed.

In order to describe and solve the basic linear algebra problems of this chapter, we need some notation. We will refer frequently to matrices, vectors and scalars. A matrix will be denoted by an upper case letter like , and its -th element by . Occasionally in detailed algorithmic descriptions we will instead write . The submatrix of occupying rows though and columns through will be denoted . A lower case letter like will denote a vector, and its -th element will be written . Vectors will almost always be column vectors, which are the same as matrices with one column. Lower case Greek letters (and occasionally lower case letters) will denote scalars. will denote the set of real numbers, the set of -dimensional real vectors, and the set of -by- real matrices. will denote the transpose of the matrix : . A matrix is called symmetric if . A matrix is positive definite if for all nonzero vectors . Other notation will be introduced as needed.

Computational science problems can often be reduced to solving one or a sequence of the following standard linear algebra problems:

For example, linear equations often arise when solving ordinary or partial differential equations numerically; the linear system must be solved to advance the solution by a time step. Least squares problems arise in fitting curves or surfaces to experimental data. Eigenvalue problems arise when analyzing vibrations. There are also many important variations on these basic problems,gif but to keep this chapter to a reasonable length, we will concentrate on algorithms for solving systems of linear equations, and refer elsewhere for least squares and eigenvalue problems [6][5][4][3].

A great many algorithms are available for all these problems. The choice of algorithm depends on three things:

Since the three standard problems listed above have been studied for so long, one might hope that good numerical software libraries would be available to solve them. This is partly true, as we illustrated by LAPACK [7][2]. However, there are several reasons one cannot depend on libraries entirely. First, it is impossible to anticipate all possible structures of that can arise in practice and write corresponding libraries (although extensive libraries for various specialized do exist [7]). Second, high performance computer architectures, programming languages and compilers have been changing so much lately that library writers have not been able to keep up. There is no clear cut rule on when to use a library routine. In some cases an unoptimized or serial implementation of an algorithm which fully exploits a matrix structure may be much faster than a highly tuned, parallel implementation of a more general algorithm, and in other cases the opposite may be true. If the user's priority is ease of programming and reliability, a library routine is a good choice. If, instead, the user's priority is maximum performance on one specialized problem, and software development time is secondary, custom software is the likely choice. We will try to address the concerns of both classes of users in this chapter.

The rest of this chapter is organized as follows. Section 2 discusses the cost models we will use to understand algorithm performance. Section 3 discusses the BLAS, which we use to construct many high performance algorithms. Section 4 discusses how to implement matrix multiplication quickly on several different parallel architectures; these techniques serve as models for implementing more complicated algorithms. Section 5 discusses how to layout matrices on distributed memory machines. Section 6 discusses parallelizing Gaussian elimination on dense matrices, and gives pointers to other related algorithms. Section 7 discusses parallelizing Gaussian elimination on dense matrices, and gives pointers to other related algorithms. Section 8 discusses parallelizing Gaussian elimination on sparse matrices; this is a much harder problem, and much less software is available. Section 9 discusses iterative methods for solving , which in contrast to Gaussian elimination take an approximate solution and iteratively improves it, rather than providing the final answer after a fixed number of steps. Section 10 shows how to bound the error in a computed solution to , for any of the methods discussed in this chapter. Section 11 discusses how to use netlib to retrieve existing library software electronically. Finally, section 12 gives a quick references guide to the BLAS.

This is a necessarily partial survey of a large and active field. For further reading in numerical linear algebra, see the textbooks by Golub and Van Loan [4], Watkins [5] or Demmel [6]. For more details on parallel numerical linear algebra in particular, see [9][8][3], the last of which includes a bibliography of over 2000 references.



next up previous
Next: 2 Memory Hierarchies Up: LA Chapter Previous: LA Chapter



verena@csep1.phy.ornl.gov