Choosing a data layout may be described as choosing a mapping from location in a matrix to the processor on which it is stored. As discussed previously, we hope to design so that it permits highly parallel implementation of a variety of matrix algorithms, limits communication cost as much as possible, and retains these attractive properties as we scale to larger matrices and larger machines. For example, the algorithm of the previous section uses the map , where we subscript matrices starting at 0, number processors by their coordinates in a grid (also starting at (0,0)), and store an submatrix on each processor, where .
There is an emerging consensus about data layouts for distributed memory machines. This is being implemented in several programming languages , that will be available to programmers in the near future. We describe these layouts here.
High Performance Fortran (HPF)  permits the user to define a virtual array of processors, align actual data structures like matrices and arrays with this virtual array (and so with respect to each other), and then to layout the virtual processor array on an actual machine. We describe the layout functions offered for this last step. The range of is a rectangular array of processors numbered from up to . Then all can be parameterized by two integer parameters and as follows:
Suppose the matrix (or virtual processor array) is . Then choosing yields a column of processors, each containing some number of complete rows of . Choosing yields a row of processors. Choosing and yields a blocked layout, where is broken into subblocks, each of which resides on a single processor. This is the simplest two-dimensional layout one could imagine (we used it in the previous section), and by having large subblocks stored on each processor it makes using the BLAS on each processor attractive. However, for straightforward matrix algorithms that process the matrix from left to right (including Gaussian elimination, QR decomposition, reduction to tridiagonal form, and so on), the leftmost processors will become idle early in the computation and make load balance poor. Choosing is called scatter mapping (also known as wrapped or cyclic or interleaved mapping), and optimizes load balance, since the matrix entries stored on a single processor are as nearly as possible uniformly distributed throughout the matrix. On the other hand, this appears to inhibit the use of the BLAS locally in each processor, since the data owned by a processor are not contiguous from the point of view of the matrix. Finally, by choosing and , we get a block-scatter mapping which trades off load balance and applicability of the BLAS. These layouts are shown in Figures 2 through 4 for a matrix laid out on a processor grid; each array entry is labeled by the number of the processor that stores it.
(See exercises 7 and 8.)
A different approach is to write algorithms that work independently of the location of the data, and rely on the underlying language or run-time system to optimize the necessary communications. This makes code easier to write, but puts a large burden on compiler and run-time system writers . Even though compiler and run-time system writers aspire to make programming this easy, they have not yet succeeded (at least without large performance penalties), so the computational scientist interested in top performance must currently pay attention to details of data layout and communication.