2 Memory Hierarchies



next up previous
Next: 3 The BLAS Up: LA Chapter Previous: 1 Introduction

2 Memory Hierarchies

 

To understand the speedups of LAPACK over LINPACK in Table 1, and in general to be able to understand which algorithms are fast and which are slow, we need to understand memory hierarchies. Memory hierarchies were discussed in some detail in the chapter on Computer Architecture gif, and we just review them briefly here, emphasizing their importance to algorithm design.

Computer memories are built as hierarchies, with a series of different kinds of memories ranging from very fast, expensive, and therefore small memory at the top of the hierarchy, down to slow, cheap and very large memory at the bottom. For example, registers typically form the fastest memory, then cache, main memory, disks, and finally tape as the slowest, largest and cheapest. Useful floating point operations can only be done on data at the top of the hierarchy, in the registers. Since an entire large matrix cannot fit in the registers, it must be moved up and down through the hierarchy, transferred up to the registers when work needs to be done, and transferred back down to the main memory (or disk or tape) when it is no longer needed. It takes time to move between levels in the memory hierarchy, and moving is slower the farther down in the hierarchy one goes. Indeed, one such data movement takes far longer than performing a floating point operation. So the limiting factor for many algorithms is not the time needed for floating point operations, but the time to move data in the memory hierarchy. Therefore, a clever algorithm will attempt to minimize the number of these memory moves (even if it might mean doing a few more floating point operations).

LINPACK's Cholesky performed so poorly in Table 1, because it was not designed to minimize memory movement on machines such as the Cray YMP (it was designed to minimize another kind of memory movement, page faults between main memory and disk). In contrast, the matrix-matrix multiplication and other BLAS on the Cray YMP in Table 1 were written (in assembly language) just for the Cray YMP to minimize data movement.

Since it is expensive and time-consuming to write every routine like Cholesky in assembly language for every new computer, we would like a better approach. Here is the most successful approach we have discovered so far. Since operations like matrix-matrix multiplication are so common, computer manufacturers have standardized them as the Basic Linear Algebra Subroutines or BLAS [12][11][10],   and optimized them for their machines. In other words, a library of subroutines for matrix-matrix multiplication, matrix-vector multiplication, and other routines is available as a standard Fortran (or C) callable library on most high performance machines, and underneath they have been optimized for each machine. If we can reorganize our algorithms to use these optimized BLAS to perform all or most of the work, then our algorithms will go as fast as the manufacturer's optimized BLAS. This was the approach taken in LAPACK: the algorithms in LINPACK (and the corresponding library for eigenvalue problems, EISPACK [14][13]) were reorganized to call the BLAS in their innermost loops, where most of the work is done. This led to the speedups shown in Table 1.

This approach was very successful on machines like the Cray, i.e. parallel vector processors using fast shared memory with relatively few processors. On newer architectures, especially distributed memory machines like the Intel Paragon and CM-5, this approach cannot be used as straightforwardly, although many of the same ideas still apply. The difficulty with distributed memory machines is twofold. First, the memory hierarchy is deeper, including both ``local memory'' and ``remote memory'' layers at the bottom (remote memory means memory physically on another processor). Individual nodes may also be more complicated; for example, the CM-5 has 5 discernible levels of memory hierarchy on a single processor node; other machines are also complicated. Second, languages and compilers are still evolving, so that there are many more possible ways to store a matrix on a machine, and ``obviously'' parallelizable or vectorizable loops may or may not be compiled well.

In the remaining sections, we will discuss the successful BLAS-based approach to numerical software, and outline current activities on distributed memory machines.



next up previous
Next: 3 The BLAS Up: LA Chapter Previous: 1 Introduction



verena@csep1.phy.ornl.gov