The history of shared memory multiprocessors goes back to the
early 1970s, to two influential research projects at Carnegie-
Mellon University. The first machine, named c.mmp (from the PMS
notation for ``computer with multiple mini-processors''), was
organized around a crossbar switch that connected 16 PDP-11
processors to 16 memory banks. The second, cm*, also used PDP-11
processors, but connected them via the tree-shaped network shown
in Figure 17 on page 63. The basic building block for this system
was a processor cluster, which consisted of four processors, each
with their own local memories. The global memory space was evenly
partitioned among the memories in the system. When a processor
generated a request for address
, its bus logic would check to
see if
was in the range of addresses in that machine's local
memory. If it wasn't, the request was transferred to a cluster
controller, which would see if
belonged
to any other memory
within that cluster. If not, the request would be
routed up the
tree to another level of cluster controllers. In all, 50
processors were connected by three levels of buses.
cm* was an
early example of a non-uniform memory access (NUMA) architecture.
Depending on whether an item was in a processor's local memory,
within the same cluster, or in another cluster, the time to fetch
an item was 3
s, 9
s, or 27
s, respectively.
As a reference
point, a PDP-11 of this era, without the cluster interconnection
logic, could fetch an item from main memory in about 2
s.
One of the first commercial systems of this type was the BBN Butterfly. As its name implies, it consisted of a butterfly switch connecting up to 256 processors and memories. The processors were Motorola 68000 single-chip microprocessors. BBN added an extra path from the processors to memory by pairing up each processor with one of the memory modules, so each processor had a ``favored'' memory unit. The processor could access this memory directly without going through the switch. The result was a NUMA architecture, with a ratio of about 15:1 in access times depending on whether the processor used the butterfly switch or the direct connection.
A recent commercial system in this category, with computing power and scalability that could potentially make it widely used in computational science, is the KSR-1 from Kendall Square Research. Processing elements are connected in rings, with from 8 to 32 PEs per ring. Larger systems have a second level ring that connects up to 34 first- level rings, for a maximum machine size of 1088 processors. Each ring is unidirectional, i.e. information flows in only one direction, with a bandwidth of 1 GB/sec.
Each PE has a 32MB
cache, but there is no primary memory. This unusual organization
uses a cache directory to access all information. When a
processor makes a reference to an item in location
, the cache
line that contains
migrates around the rings until it reaches
the requesting processor. If two or more processors need an item
the hardware implements the necessary cache coherency protocols
to keep the items up to date. This type of system is also known
as a shared virtual memory.
The processors in the KSR-1 are
proprietary 64-bit superscalar processors with a 20MHz cycle
time. According to the LINPACK benchmark report, a single KSR-1
processor achieves 31 MFLOPS out of a theoretical peak of 40
MFLOPS on
Gaussian elimination. A 32-node
system reaches 513
MFLOPS, a speedup of a factor of 16.5.