Choosing is called scatter mapping (also known as
wrapped or cyclic or interleaved mapping),
and optimizes load balance, since the
matrix entries stored on a single processor are as nearly
as possible uniformly distributed
throughout the matrix. On the other hand, this appears to
inhibit the use of the BLAS locally in
each processor, since the data owned by a processor are
not contiguous from the point of view
of the matrix. Finally, by choosing
and
, we get a block-scatter mapping
which trades off load balance and
applicability of the BLAS. These
layouts are shown in
Figures 2 through 4
for a
matrix laid out on a
processor grid;
each array entry is labeled
by the number of the processor that stores it.
Figure 2: Block layout of a 16 x 16 matrix on a 4 x 4 grid.
Figure 3: Scatter layout of a 16 x 16 matrix on a 4 x 4 processor grid.
Figure 4: Block scatter layout of a 16 x 16 matrix on a 4 x 4 processor grid with 2 x 2 blocks.
A different approach is to write algorithms that work independently of the location of the data, and rely on the underlying language or run-time system to optimize the necessary communications. This makes code easier to write, but puts a large burden on compiler and run-time system writers [28]. Even though compiler and run-time system writers aspire to make programming this easy, they have not yet succeeded (at least without large performance penalties), so the computational scientist interested in top performance must currently pay attention to details of data layout and communication.