In this section it will be convenient to number matrix entries (or subblocks)
and processors from **0** to **n-1** instead of **1** to **n**.

On distributed memory machines the cost model is more complicated than on
shared memory machines, because we will need to worry about the
* data layout*, or how the matrices are
partitioned across the machine. This will determine both the amount
of parallelism and the cost of communication. Recall from the chapter
on Computer Architecture
that the cost of sending a message of **k** words
from one processor to another is , where
is the * start-up cost* or * latency*, and is the per-word cost,
or reciprocal of * bandwidth*. Therefore to assess the cost of an algorithm
we need to count the number of floating point operations, the number of
messages sent (at a cost of per message), and the total length
of messages sent (at a cost of per word).
We begin by showing how best to implement matrix multiplication without
regard to the layout's suitability for other matrix operations, and
return to the question of layouts in the next section.