The time spent by this algorithm is computed as follows. Assume we multiply
**n**-by-**n** matrices on an **N**-by-**N** processor mesh, where for simplicity
**N** divides **n** evenly. Assuming messages are sent between nearest neighbors
only, and that a processor can only send a single message at a time, the total
number of messages sent (in parallel) is **4N-2**, the total number of words
sent (in parallel) is , and the total number of
flops (in parallel) is .

(See exercises 13, 13, 13, and 13.)

A variation of this algorithm suitable for machines that are
efficient at * spreading* subblocks across rows (or down columns)
is to do this instead of the preshifting and rotation of **A** (or **B**)
[21].

Cannon's algorithm may also be easily adapted to a hypercube [3]. The simplest way is to embed a grid (or two-dimensional torus) in a hypercube, i.e. map the processors in a grid to the processors in a hypercube, and the connections in a grid to a subset of the connections in a hypercube [22,23]. This approach (which is useful for more than matrix multiplication) uses only a subset of the connections in a hypercube, which makes the communication slower than it need be. Several sophisticated improvements on this basic algorithm have been developed [24,25], the latter of which fully utilizes the available bandwidth of the hypercube to reduce the number of messages sent by a factor of 2 and the number of words sent by a factor of nearly . This assumes each processor in the hypercube can send messages to all its neighbors simultaneously, as was the case on the CM-2. If the architecture permits us to overlap communication and computation, we can save up to another factor of two in speed.