One of the first commercial systems of this type was the BBN Butterfly. As its name implies, it consisted of a butterfly switch connecting up to 256 processors and memories. The processors were Motorola 68000 single-chip microprocessors. BBN added an extra path from the processors to memory by pairing up each processor with one of the memory modules, so each processor had a ``favored'' memory unit. The processor could access this memory directly without going through the switch. The result was a NUMA architecture, with a ratio of about 15:1 in access times depending on whether the processor used the butterfly switch or the direct connection.
A recent commercial system in this category, with computing power and scalability that could potentially make it widely used in computational science, is the KSR-1 from Kendall Square Research. Processing elements are connected in rings, with from 8 to 32 PEs per ring. Larger systems have a second level ring that connects up to 34 first- level rings, for a maximum machine size of 1088 processors. Each ring is unidirectional, i.e. information flows in only one direction, with a bandwidth of 1 GB/sec.
Each PE has a 32MB cache, but there is no primary memory. This unusual organization uses a cache directory to access all information. When a processor makes a reference to an item in location i, the cache line that contains i migrates around the rings until it reaches the requesting processor. If two or more processors need an item the hardware implements the necessary cache coherency protocols to keep the items up to date. This type of system is also known as a shared virtual memory.
Figure 18: Kendall Square Research KSR-1.
The processors in the KSR-1 are proprietary 64-bit superscalar processors with a 20MHz cycle time. According to the LINPACK benchmark report, a single KSR-1 processor achieves 31 MFLOPS out of a theoretical peak of 40 MFLOPS on Gaussian elimination. A 32-node system reaches 513 MFLOPS, a speedup of a factor of 16.5.