The algorithm is due to Cannon [20] and is well suited for computers
laid out in a square mesh, i.e. where each processor communicates
most efficiently with the four other processors immediately north,
east, south and west of itself. We also assume the processors
at the edges of the grid are directly connected to the processors on
the opposite edge; this makes the topology that of a two-dimensional
torus. Let A be partitioned into square subblocks
as before, with
stored on processor
. Let
B and C be partitioned similarly. The algorithm is given below. It
is easily seen that whenever
and
`meet' in
processor i,j, they are multiplied and accumulated in
;
the products for the different
are accumulated in different
orders.
Figure 1 illustrates the functioning of this algorithm for N=3.
Figure 1: Cannon's algorithm for N=3.