Choosing a data layout may be described as choosing a mapping from location in a matrix to the processor on which it is stored. As discussed previously, we hope to design so that it permits highly parallel implementation of a variety of matrix algorithms, limits communication cost as much as possible, and retains these attractive properties as we scale to larger matrices and larger machines. For example, the algorithm of the previous section uses the map , where we subscript matrices starting at 0, number processors by their coordinates in a grid (also starting at (0,0)), and store an submatrix on each processor, where .

There is an emerging consensus about data layouts for distributed memory machines. This is being implemented in several programming languages [27][26], that will be available to programmers in the near future. We describe these layouts here.

High Performance Fortran (HPF) [27] permits the user to define a virtual array of processors, align actual data structures like matrices and arrays with this virtual array (and so with respect to each other), and then to layout the virtual processor array on an actual machine. We describe the layout functions offered for this last step. The range of is a rectangular array of processors numbered from up to . Then all can be parameterized by two integer parameters and as follows:

Suppose the matrix (or virtual processor array) is .
Then choosing
yields a column of processors, each containing some number
of complete rows of .
Choosing yields a row of processors. Choosing and
yields a *blocked layout*, where is
broken into subblocks,
each of which resides on a single processor.
This is the simplest two-dimensional layout one could
imagine (we used it in the previous section),
and by having large subblocks stored on each processor
it makes using the BLAS on
each processor attractive.
However, for straightforward matrix algorithms that process
the matrix from left to right
(including
Gaussian elimination, *QR* decomposition, reduction to tridiagonal form,
and so on), the
leftmost processors will become idle early in the computation
and make load balance poor.
Choosing is called *scatter mapping* (also known as
*wrapped* or *cyclic* or *interleaved mapping*),
and optimizes load balance, since the
matrix entries stored on a single processor are as nearly
as possible uniformly distributed
throughout the matrix. On the other hand, this appears to
inhibit the use of the BLAS locally in
each processor, since the data owned by a processor are
not contiguous from the point of view
of the matrix. Finally, by choosing
and , we get a *block-scatter mapping*
which trades off load balance and
applicability of the BLAS. These
layouts are shown in
Figures 2 through 4
for a matrix laid out on a processor grid;
each array entry is labeled
by the number of the processor that stores it.

Figure 2: Block layout of a matrix on a processor grid. View Figure

Figure 3: Scatter layout of a matrix on a processor grid. View Figure

Figure 4: Block scatter layout of a matrix on a processor grid with blocks. View Figure

A different approach is to write algorithms that work independently of the location of the data, and rely on the underlying language or run-time system to optimize the necessary communications. This makes code easier to write, but puts a large burden on compiler and run-time system writers [28]. Even though compiler and run-time system writers aspire to make programming this easy, they have not yet succeeded (at least without large performance penalties), so the computational scientist interested in top performance must currently pay attention to details of data layout and communication.