The optimal choice of block sizes and depends on the
cost of communication
* versus* computation. For example, if the communication required to do
pivot search and
swapping of rows is expensive, should be large.
The execution time is a function
of dimension **n**, block sizes and ,
processor counts and ,
and the cost of computation and communication
(from Chapter CA,

,
we know how to model these).
Given this function, it
may be minimized as a function of , , and .
Some theoretical analyses of this sort
for special cases may be found in [30] and the references
therein.
See also [31] and [32]. As an example of the
performance that can be
attained in practice, on an Intel Delta with 512 processors the speed
of * LU* ranged from a little over 1 gigaflop for **n=2000** to nearly 12
gigaflops for **n=25000**.

Even if the layout is not block scatter as described so far,
essentially the same
algorithm may be used. As described in Section 5,
many possible layouts
are related by permutation matrices. So simply performing
the algorithm just described
with (optimal) block sizes and on the matrix **A**
as stored is equivalent to
performing the * LU* decomposition of where
and are permutation
matrices. Thus at the cost of keeping track of these permutations
(a possibly nontrivial software
issue), a single algorithm suffices for a wide variety of layouts.

Finally, we need to solve the triangular systems **Ly=b** and **Ux=y**
arising from the
* LU* decomposition. On a shared memory machine, this is accomplished by
two calls to the
Level 2 BLAS. Designing such an algorithm on a distributed
memory machine is harder,
because the fewer floating point operations performed (
instead of ) make
it harder to mask the communication
[33,34,35,36].