In this section it will be convenient to number matrix entries (or subblocks) and processors from 0 to n-1 instead of 1 to n.
On distributed memory machines the cost model is more complicated than on shared memory machines, because we will need to worry about the data layout, or how the matrices are partitioned across the machine. This will determine both the amount of parallelism and the cost of communication. Recall from the chapter on Computer Architecture that the cost of sending a message of k words from one processor to another is , where is the start-up cost or latency, and is the per-word cost, or reciprocal of bandwidth. Therefore to assess the cost of an algorithm we need to count the number of floating point operations, the number of messages sent (at a cost of per message), and the total length of messages sent (at a cost of per word). We begin by showing how best to implement matrix multiplication without regard to the layout's suitability for other matrix operations, and return to the question of layouts in the next section.