The optimal choice of block sizes and depends on the cost of communication versus computation. For example, if the communication required to do pivot search and swapping of rows is expensive, should be large. The execution time is a function of dimension n, block sizes and , processor counts and , and the cost of computation and communication (from Chapter CA,
, we know how to model these). Given this function, it may be minimized as a function of , , and . Some theoretical analyses of this sort for special cases may be found in  and the references therein. See also  and . As an example of the performance that can be attained in practice, on an Intel Delta with 512 processors the speed of LU ranged from a little over 1 gigaflop for n=2000 to nearly 12 gigaflops for n=25000.
Even if the layout is not block scatter as described so far, essentially the same algorithm may be used. As described in Section 5, many possible layouts are related by permutation matrices. So simply performing the algorithm just described with (optimal) block sizes and on the matrix A as stored is equivalent to performing the LU decomposition of where and are permutation matrices. Thus at the cost of keeping track of these permutations (a possibly nontrivial software issue), a single algorithm suffices for a wide variety of layouts.
Finally, we need to solve the triangular systems Ly=b and Ux=y arising from the LU decomposition. On a shared memory machine, this is accomplished by two calls to the Level 2 BLAS. Designing such an algorithm on a distributed memory machine is harder, because the fewer floating point operations performed ( instead of ) make it harder to mask the communication [33,34,35,36].