The history of shared memory multiprocessors goes back to the early 1970s, to two influential research projects at Carnegie- Mellon University. The first machine, named c.mmp (from the PMS notation for ``computer with multiple mini-processors''), was organized around a crossbar switch that connected 16 PDP-11 processors to 16 memory banks. The second, cm*, also used PDP-11 processors, but connected them via the tree-shaped network shown in Figure 17 on page 63. The basic building block for this system was a processor cluster, which consisted of four processors, each with their own local memories. The global memory space was evenly partitioned among the memories in the system. When a processor generated a request for address , its bus logic would check to see if was in the range of addresses in that machine's local memory. If it wasn't, the request was transferred to a cluster controller, which would see if belonged to any other memory within that cluster. If not, the request would be routed up the tree to another level of cluster controllers. In all, 50 processors were connected by three levels of buses.
cm* was an early example of a non-uniform memory access (NUMA) architecture. Depending on whether an item was in a processor's local memory, within the same cluster, or in another cluster, the time to fetch an item was 3s, 9s, or 27s, respectively. As a reference point, a PDP-11 of this era, without the cluster interconnection logic, could fetch an item from main memory in about 2s.
One of the first commercial systems of this type was the BBN Butterfly. As its name implies, it consisted of a butterfly switch connecting up to 256 processors and memories. The processors were Motorola 68000 single-chip microprocessors. BBN added an extra path from the processors to memory by pairing up each processor with one of the memory modules, so each processor had a ``favored'' memory unit. The processor could access this memory directly without going through the switch. The result was a NUMA architecture, with a ratio of about 15:1 in access times depending on whether the processor used the butterfly switch or the direct connection.
A recent commercial system in this category, with computing power and scalability that could potentially make it widely used in computational science, is the KSR-1 from Kendall Square Research. Processing elements are connected in rings, with from 8 to 32 PEs per ring. Larger systems have a second level ring that connects up to 34 first- level rings, for a maximum machine size of 1088 processors. Each ring is unidirectional, i.e. information flows in only one direction, with a bandwidth of 1 GB/sec.
Each PE has a 32MB cache, but there is no primary memory. This unusual organization uses a cache directory to access all information. When a processor makes a reference to an item in location , the cache line that contains migrates around the rings until it reaches the requesting processor. If two or more processors need an item the hardware implements the necessary cache coherency protocols to keep the items up to date. This type of system is also known as a shared virtual memory.
The processors in the KSR-1 are proprietary 64-bit superscalar processors with a 20MHz cycle time. According to the LINPACK benchmark report, a single KSR-1 processor achieves 31 MFLOPS out of a theoretical peak of 40 MFLOPS on Gaussian elimination. A 32-node system reaches 513 MFLOPS, a speedup of a factor of 16.5.