On the other hand, this brief analysis ignores a number of practical issues:
Another quite different algorithm is Strassen's method , which multiplies matrices recursively by dividing them into block matrices, and multiplying the subblocks using 7 matrix multiplications (recursively) and 15 matrix additions of half the size. This leads to an asymptotic complexity of instead of . The value of this algorithm is not just this asymptotic complexity but its reduction of the problem to smaller subproblems which eventually fit in fast memory; once the subproblems fit in fast memory standard matrix multiplication may be used. This approach has led to speedups on relatively large matrices on some machines . A drawback is the need for significant workspace, and somewhat lower numerical stability, although it is adequate for many purposes .
Consider the basic linear algebra subroutine which solves for , where is a given -by- triangular matrix, is a given -by- matrix, and is an -by- matrix of unknowns. Give blocked implementations for this subroutine analogous to the ones for matrix-multiplication above, and compare their ratios of flops to memory references. How do your answers depend on ?
Show how to reduce the number of messages sent in the first two loops of the algorithm by sometimes shifting right (down) instead of left (up). How does this reduce the overall cost?
When Cannon's algorithm completes, and are not in their original storage locations. Write an algorithm to return and to their original positions. What is the running time of your algorithm?
By being a little more flexible about the algorithms we implement, we can mitigate the apparent tradeoff between load balance and applicability of BLAS. For example, the layout of in Figure 3 is identical to the layout in Figure 2 of , where is a permutation matrix. This shows that running Cannon's algorithm from the last section to multiply times in scatter layout is the same as multiplying and to get , which is the desired product. Indeed, as long as (1) and are both distributed over a square array of processors; (2) the permutations of the columns of and rows of are identical; and (3) for all the number of columns of stored by processor column is the same as the number of rows of stored by processor row , the algorithms of the previous section will correctly multiply and . The distribution of the product will be determined by the distribution of the rows of and columns of . We will see a similar phenomenon for other distributed memory algorithms later.
Using your implementation of Cannon's algorithm from an earlier exercise, confirm by running it that it multiplies matrices correctly even when they are stored in a scattered layout.
The Cholesky factorization of a symmetric positive definite matrix is , where is a lower triangular matrix. The algorithm is very similar to Gaussian elimination, but the special properties of mean only half as much storage and half as many flops are needed as for standard Gaussian elimination. Here is the analog of Algorithm 6.1 for Cholesky:
Note that Cholesky does not require pivoting. Derive analogs of Algorithms 6.2 through 6.4 for Cholesky. Ideally your algorithm should only need to read and write the lower (or upper) triangular part of .
Retrieve the LAPACK routine sgeevx from netlib (see section 11 below on how to do this). Write a modified version, which we will call sgesvn, which does not do pivoting. Run both routines on randomly generated test matrices, and compare the error bounds they compute. Modify the test matrices to make the (1,1) entry very small. How do the error bounds compare now? Note that LAPACK produces other error bounds besides the one described here.