In this section it will be convenient to number matrix entries (or subblocks) and processors from to instead of to .
On distributed memory machines the cost model is more complicated than on shared memory machines, because we will need to worry about the data layout, or how the matrices are partitioned across the machine. This will determine both the amount of parallelism and the cost of communication. Recall from the chapter on Computer Architecture that the cost of sending a message of words from one processor to another is , where is the start-up cost or latency, and is the per-word cost, or reciprocal of bandwidth. Therefore to assess the cost of an algorithm we need to count the number of floating point operations, the number of messages sent (at a cost of per message), and the total length of messages sent (at a cost of per word). We begin by showing how best to implement matrix multiplication without regard to the layout's suitability for other matrix operations, and return to the question of layouts in the next section.
The algorithm is due to Cannon  and is well suited for computers laid out in a square mesh, i.e. where each processor communicates most efficiently with the four other processors immediately north, east, south and west of itself. We also assume the processors at the edges of the grid are directly connected to the processors on the opposite edge; this makes the topology that of a two-dimensional torus. Let be partitioned into square subblocks as before, with stored on processor . Let and be partitioned similarly. The algorithm is given below. It is easily seen that whenever and `meet' in processor , they are multiplied and accumulated in ; the products for the different are accumulated in different orders.
Figure 1 illustrates the functioning of this algorithm for .
The time spent by this algorithm is computed as follows. Assume we multiply -by- matrices on an -by- processor mesh, where for simplicity divides evenly. Assuming messages are sent between nearest neighbors only, and that a processor can only send a single message at a time, the total number of messages sent (in parallel) is , the total number of words sent (in parallel) is , and the total number of flops (in parallel) is .
(See exercises 3, 4, 5, and 6.)
A variation of this algorithm suitable for machines that are efficient at spreading subblocks across rows (or down columns) is to do this instead of the preshifting and rotation of (or ) .
Cannon's algorithm may also be easily adapted to a hypercube . The simplest way is to embed a grid (or two-dimensional torus) in a hypercube, i.e. map the processors in a grid to the processors in a hypercube, and the connections in a grid to a subset of the connections in a hypercube . This approach (which is useful for more than matrix multiplication) uses only a subset of the connections in a hypercube, which makes the communication slower than it need be. Several sophisticated improvements on this basic algorithm have been developed , the latter of which fully utilizes the available bandwidth of the hypercube to reduce the number of messages sent by a factor of 2 and the number of words sent by a factor of nearly . This assumes each processor in the hypercube can send messages to all its neighbors simultaneously, as was the case on the CM-2. If the architecture permits us to overlap communication and computation, we can save up to another factor of two in speed.
In this section we have shown one can optimize matrix multiplication in a series of steps tuning it ever more highly for a particular computer architecture, until essentially every communication link and floating point unit is utilized. The algorithms are scalable, in that they continue to run efficiently on larger machines and larger problems, with communication costs becoming ever smaller with respect to computation. On the other hand, let us ask what we lose by optimizing so heavily for one architecture. Our high performance depends on the matrices having just the right dimensions, being laid out just right in memory, and leaving them in a scrambled final position at the end (although a modest amount of extra communication could repair this). It is unreasonable to expect users, who want to do several computations of which this is but one, to satisfy all these requirements. Therefore a practical algorithm will have to deal with many irregularities, and be quite complicated. Our ability to do this extreme optimization is limited to a few simple and regular problems like matrix multiplication on a hypercube, as well as other heavily used kernels like the BLAS, which have indeed been highly optimized for many architectures. We do not expect equal success for more complicated algorithms on all architectures of interest, at least within a reasonable amount of time. Also, the algorithm is highly tuned to a particular interconnection network topology, which may require redesign for another machine. In view of this, a number of recent machines try to make communication time appear as independent of topology as possible, so the user sees essentially a completely connected topology.