Choosing is called scatter mapping (also known as wrapped or cyclic or interleaved mapping), and optimizes load balance, since the matrix entries stored on a single processor are as nearly as possible uniformly distributed throughout the matrix. On the other hand, this appears to inhibit the use of the BLAS locally in each processor, since the data owned by a processor are not contiguous from the point of view of the matrix. Finally, by choosing and , we get a block-scatter mapping which trades off load balance and applicability of the BLAS. These layouts are shown in Figures 2 through 4 for a matrix laid out on a processor grid; each array entry is labeled by the number of the processor that stores it.
Figure 2: Block layout of a 16 x 16 matrix on a 4 x 4 grid.
Figure 3: Scatter layout of a 16 x 16 matrix on a 4 x 4 processor grid.
Figure 4: Block scatter layout of a 16 x 16 matrix on a 4 x 4 processor grid with 2 x 2 blocks.
(See exercises 13 and 13.)
A different approach is to write algorithms that work independently of the location of the data, and rely on the underlying language or run-time system to optimize the necessary communications. This makes code easier to write, but puts a large burden on compiler and run-time system writers . Even though compiler and run-time system writers aspire to make programming this easy, they have not yet succeeded (at least without large performance penalties), so the computational scientist interested in top performance must currently pay attention to details of data layout and communication.