The optimal choice of block sizes and
depends on the
cost of communication
versus computation. For example, if the communication required to do
pivot search and
swapping of rows is expensive,
should be large.
The execution time is a function
of dimension n, block sizes
and
,
processor counts
and
,
and the cost of computation and communication
(from Chapter CA,
,
we know how to model these).
Given this function, it
may be minimized as a function of
,
,
and
.
Some theoretical analyses of this sort
for special cases may be found in [30] and the references
therein.
See also [31] and [32]. As an example of the
performance that can be
attained in practice, on an Intel Delta with 512 processors the speed
of LU ranged from a little over 1 gigaflop for n=2000 to nearly 12
gigaflops for n=25000.
Even if the layout is not block scatter as described so far,
essentially the same
algorithm may be used. As described in Section 5,
many possible layouts
are related by permutation matrices. So simply performing
the algorithm just described
with (optimal) block sizes and
on the matrix A
as stored is equivalent to
performing the LU decomposition of
where
and
are permutation
matrices. Thus at the cost of keeping track of these permutations
(a possibly nontrivial software
issue), a single algorithm suffices for a wide variety of layouts.
Finally, we need to solve the triangular systems Ly=b and Ux=y
arising from the
LU decomposition. On a shared memory machine, this is accomplished by
two calls to the
Level 2 BLAS. Designing such an algorithm on a distributed
memory machine is harder,
because the fewer floating point operations performed (
instead of
) make
it harder to mask the communication
[33,34,35,36].