The time spent by this algorithm is computed as follows. Assume we multiply n-by-n matrices on an N-by-N processor mesh, where for simplicity N divides n evenly. Assuming messages are sent between nearest neighbors only, and that a processor can only send a single message at a time, the total number of messages sent (in parallel) is 4N-2, the total number of words sent (in parallel) is , and the total number of flops (in parallel) is .
(See exercises 13, 13, 13, and 13.)
A variation of this algorithm suitable for machines that are efficient at spreading subblocks across rows (or down columns) is to do this instead of the preshifting and rotation of A (or B) .
Cannon's algorithm may also be easily adapted to a hypercube . The simplest way is to embed a grid (or two-dimensional torus) in a hypercube, i.e. map the processors in a grid to the processors in a hypercube, and the connections in a grid to a subset of the connections in a hypercube [22,23]. This approach (which is useful for more than matrix multiplication) uses only a subset of the connections in a hypercube, which makes the communication slower than it need be. Several sophisticated improvements on this basic algorithm have been developed [24,25], the latter of which fully utilizes the available bandwidth of the hypercube to reduce the number of messages sent by a factor of 2 and the number of words sent by a factor of nearly . This assumes each processor in the hypercube can send messages to all its neighbors simultaneously, as was the case on the CM-2. If the architecture permits us to overlap communication and computation, we can save up to another factor of two in speed.