Algorithm 8.5
takes a demand-driven approach: the updates for a given
column **j** are not computed until needed to complete that column, and
they are computed by the sending processors rather than the receiving
processor. As a result, all of a given processor's contributions to
the updating of the column in question can be combined into a single
aggregate update column, which is then transmitted in a single message
to the processor containing the target column. This approach not only
decreases communication frequency and volume, but it also facilitates a
more efficient computational inner loop. In particular, no
communication is required to complete the columns corresponding to any
subtree that is assigned entirely to a single processor. Thus, with an
appropriate locality-preserving and load-balanced subtree mapping,
Algorithm 8.5
has a perfectly parallel, communication-free initial phase
that is followed by a second phase in which communication takes place
over increasingly larger subsets of processors as the computation
proceeds up the elimination tree, encountering larger subtrees. This
perfectly parallel phase, which is due entirely to sparsity, tends to
constitute a larger proportion of the overall computation as the size
of the problem grows for a fixed number of processors, and thus the
algorithm enjoys relatively high efficiencies for sufficiently large
problems.