Having developed some understanding of the sources of parallelism in sparse Cholesky factorization, we now consider some algorithms for exploiting it. In designing any parallel algorithm, one of the most important decisions is how tasks are to be assigned to processors. In a shared memory parallel architecture, the tasks can easily be assigned to processors dynamically by maintaining a common pool of tasks from which available processors claim work to do. This approach has the additional advantage of providing automatic load balancing to whatever degree is permitted by the chosen task granularity. An implementation of this approach for parallel sparse factorization is given in [67].
In a distributed memory environment, communication costs often prohibit dynamic task assignment or load balancing, and thus we seek a static mapping of tasks to processors. In the case of column-oriented factorization algorithms, this amounts to assigning the columns of the matrix to processors according to some mapping procedure determined in advance. Such an assignment could be made using the block or wrap mappings, or combinations thereof, often used for dense matrices. However, such simple mappings risk wasting much of the large-grain parallelism identified by means of the elimination tree, and may also incur unnecessary communication. For example, the leaf nodes of the elimination tree can be processed in parallel if they are assigned to different processors, but the latter is not necessarily ensured by a simple block or wrap mapping.
A better approach for sparse factorization is to preserve locality by
assigning subtrees of the elimination tree to contiguous subsets of
neighboring processors. A good example of this technique is the
`subtree-to-subcube' mapping often used with hypercube multicomputers
[68].
Of course, the same idea applies to other
network topologies, such as submeshes of a larger mesh. We will assume
that some such mapping is used, and we will comment further on its
implications later. Whatever the mapping, we will denote the processor
containing column j by , or, more generally, if J is a set
of column numbers,
will denote the set of processors
containing the given columns.
One of the earliest and simplest parallel algorithms for sparse Cholesky factorization is the following version of submatrix-Cholesky [69]. Algorithm 8.4 runs on each processor, with each responsible for its own subset, mycols, of columns.
In Algorithm 8.4,
any processor that owns a column of L corresponding
to a leaf node of the elimination tree can complete it immediately
merely by performing the necessary operation, since such a
column depends on no prior columns. The resulting factor columns are
then broadcast (fanned-out) to all other processors that will need them
to update columns that they own. The remainder of the algorithm is
then driven by the arrival of factor columns, as each processor goes
into a loop in which it receives and applies successive factor columns,
in whatever order they may arrive, to whatever columns remain to be
processed. When the modifications of a given column have been
completed, then the
operation is done, the resulting factor
column is broadcast as before, and the process continues until all
columns of L have been computed.
Algorithm 8.4
potentially exploits both the large-grain parallelism
characterized by concurrent s and the medium-grain parallelism
characterized by concurrent
s, but this data-driven approach also
has a number of drawbacks that severely limit its efficiency.
In particular, performing the column updates one at a time by the
receiving processors results in unnecessarily high communication
frequency and volume, and in a relatively inefficient computational
inner loop. The communication requirements can be reduced by careful
mapping and by aggregating updating information over subtrees (see,
e.g., [70,71,72]) but even with this
improvement, the fan-out algorithm is usually not competitive with
other algorithms presented later. The shortcomings of the fan-out
algorithm motivated the formulation of the following fan-in algorithm
for sparse factorization, which is a parallel implementation of
column-Cholesky [73].