A straightforward way to connect several processors together to build a multiprocessor is shown in Figure 7. The physical connections are quite simple. Most bus structures allow an arbitrary (but not too large) number of devices to communicate over the bus. Bus protocols were initially designed to allow a single processor and one or more disk or tape controllers to communicate with memory. If the I/O controllers are replaced by processors, one has a small single-bus multiprocessor.
The problem with this design is that processors must contend for access to the bus. If a processor P is fetching an instruction, all other processors P must wait until the bus is free. If there are only two processors they can perform close to their maximum rate since the bus can alternate between them: as one processor is decoding and executing an instruction, the other can be using the bus to fetch its next instruction. However, when a third processor is added performance begins to degrade. Usually by the time 10 processors are connected to the bus the performance curve has flattened out so that adding an 11th processor will not increase performance at all. The bottom line is the fact that the memory and bus have a fixed bandwidth, determined by a combination of the cycle time of the memory and the bus protocol, and in a single-bus multiprocessor this bandwidth is divided among several processors. If the processor cycle time is very slow compared to the memory cycle, a fairly large number of processors can be accommodated by this plan, but in fact processor cycles are usually much faster than memory cycles so this scheme is not widely used.
A slight modification to this design will improve performance, but it cannot indefinitely postpone the flattening of the performance curve. If each processor has its own local cache, there is a high probability () that the instruction or data it wants is in the local cache. A reasonable cache hit rate will greatly reduce the number of accesses a processor makes and thus improve overall efficiency. The ``knee'' of the performance curve, which identifies a point where it is still cost-effective to add processors, can now be around 20 processors, and the curve will not flatten out until around 30 processors.
Giving each processor its own cache
introduces a difficulty known as the cache coherency problem. In
its simplest form, the problem may be exemplified by the
following scenario. Suppose two processors use data item
ends up in the cache of both processors. Next suppose processor 1
performs a calculation that changes
A. When it is done, the new
A is written out to main memory. Processor 2 at a later
time needs to fetch
A. However, since
A was already
in its cache,
it will use the cached value and not the newly updated value
calculated by processor 1. Maintaining a consistent version of
shared data requires providing new versions of the cached data to
each processor whenever one of the processors updates its copy.
The multiprocessors produced by Sequent, Inc. are classic examples of machines of this type. Their first machine, the Balance 8000, was intended to compete with the DEC VAX 780, a popular minicomputer at that time. A 2-processor configuration gave slightly less performance than the VAX, but the next larger configuration, with four processors, was faster. The operating system was a modified version of Unix. There was a single global task queue, and each processor could fetch a task from the queue, execute it until it blocks or times out, and return it to the queue. Thus the system implemented a form of job level parallelism. Sequent also provided a library of procedures that allowed users to write parallel programs, and the machine became a popular testbed for parallel languages and algorithms. The current machines, in the Symmetry series, are widely used for on-line transaction processing.
Programming a shared memory machine is fairly straightforward. Programming constructs such as semaphores, fork-join, and monitors, which were developed for communication and synchronization of parallel processes in operating systems and other concurrent programming applications, have been adapted for parallel processing. The implementation of the basic synchronization primitives from which these constructs are built is more complex in a parallel system, but this complexity is hidden from users. For example, the bus in the Sequent Symmetry has provisions for implementing a pool of semaphores so that processes are guaranteed to gain exclusive access to shared structures.
Another way of building a shared memory multiprocessor is shown in Figure 8. In these designs, the bus has been replaced by a switch that routes requests from a processor to one of several different memory modules. Even though there are several physical memories, there is one large virtual address space. The advantage of this organization is based on the fact the switch can handle multiple requests in parallel. Each processor can be paired up with a memory, and each can then run at full speed as it accesses the memory it is currently connected to. Contention still occurs, though. If two processors make requests of the same memory module only one will be given access and the other will be blocked. Several machines with this design will be discussed in the survey of MIMD machines following the section on interconnection topology, which introduces concepts that will explain various switch designs.