Choosing a data layout may be described as choosing a mapping from location in a matrix to the processor on which it is stored. As discussed previously, we hope to design f so that it permits highly parallel implementation of a variety of matrix algorithms, limits communication cost as much as possible, and retains these attractive properties as we scale to larger matrices and larger machines. For example, the algorithm of the previous section uses the map , where we subscript matrices starting at 0, number processors by their coordinates in a grid (also starting at (0,0)), and store an submatrix on each processor, where .
There is an emerging consensus about data layouts for distributed memory machines. This is being implemented in several programming languages [26,27], that will be available to programmers in the near future. We describe these layouts here.