This approach was very successful on machines like the Cray, i.e. parallel vector processors using fast shared memory with relatively few processors. On newer architectures, especially distributed memory machines like the Intel Paragon and CM-5, this approach cannot be used as straightforwardly, although many of the same ideas still apply. The difficulty with distributed memory machines is twofold. First, the memory hierarchy is deeper, including both ``local memory'' and ``remote memory'' layers at the bottom (remote memory means memory physically on another processor). Individual nodes may also be more complicated; for example, the CM-5 has 5 discernible levels of memory hierarchy on a single processor node; other machines are also complicated. Second, languages and compilers are still evolving, so that there are many more possible ways to store a matrix on a machine, and ``obviously'' parallelizable or vectorizable loops may or may not be compiled well.
In the remaining sections, we will discuss the successful BLAS-based approach to numerical software, and outline current activities on distributed memory machines.