The algorithm is due to Cannon  and is well suited for computers laid out in a square mesh, i.e. where each processor communicates most efficiently with the four other processors immediately north, east, south and west of itself. We also assume the processors at the edges of the grid are directly connected to the processors on the opposite edge; this makes the topology that of a two-dimensional torus. Let A be partitioned into square subblocks as before, with stored on processor . Let B and C be partitioned similarly. The algorithm is given below. It is easily seen that whenever and `meet' in processor i,j, they are multiplied and accumulated in ; the products for the different are accumulated in different orders.
Figure 1 illustrates the functioning of this algorithm for N=3.
Figure 1: Cannon's algorithm for N=3.