Choosing a data layout may be described as choosing
a mapping from location
in a matrix to the processor on which it is stored.
As discussed previously, we
hope to design f so that it permits highly parallel
implementation of a variety of
matrix algorithms, limits communication cost as much as possible,
and retains these
attractive properties as we scale to larger matrices and larger machines.
For example,
the algorithm of the previous section uses the map
,
where we subscript matrices
starting at 0, number processors by their coordinates
in a grid (also starting at (0,0)),
and store an
submatrix on each processor,
where
.
There is an emerging consensus about data layouts for distributed memory machines. This is being implemented in several programming languages [26,27], that will be available to programmers in the near future. We describe these layouts here.