So far the discussion
of high performance computing has concentrated on increasing the
amount of processing power in a system, either through
parallelism, which seeks to increase the number of instructions
that can be executed in a time period, or through pipelining,
which improves the instruction throughput. Another, equally
important, aspect of high performance computing is the
organization of the memory system. No matter how fast one makes
the processing unit, if the memory cannot keep up and provide
instructions and data at a sufficient rate there will be no
improvement in performance. The main problem that needs to be
overcome in matching memory response to processor speed is the
memory cycle time, defined in section 2.2 to be the
time between two successive memory operations. Processor cycle
times are typically much shorter than memory cycle times. When a
processor initiates a memory transfer at time
, the memory will
be ``busy'' until
, where is the memory cycle
time. During this
period no other device - I/O controller, other processors, or
even the processor that makes the request - can use the memory
since it will be busy responding to the request.
Solutions to the memory access problem have led to a dichotomy in parallel systems. In one type of system, known as a shared memory system, there is one large virtual memory, and all processors have equal access to data and instructions in this memory. The other type of system is a distributed memory, in which each processor has a local memory that is not accessible from any other processor.
The
difference between shared or distributed memory is a difference
in the structure of virtual memory, i.e. the memory as seen from
the perspective of a processor. Physically, almost every memory
system is partitioned into separate components that can be
accessed independently. What distinguishes a shared memory from a
distributed memory is how the memory subsystem interprets an
address generated by a processor. As an example, suppose a
processor executes the instruction load R0,i, which means
``load
register R0 with the contents of memory location i''
(denoted
Mem[i]). The question is, what does i mean? In a shared memory
system, i is a global address, and Mem[i] to one processor is the
same memory cell as Mem[i] to another processor. If both
processors execute this instruction at the same time they will
both load the same information into their R0 registers. In a
distributed memory system, i is a local address. If two
processors both execute load R0,i they may end up with different
values in their R0 registers since Mem[i] designates two
different memory cells, one in the local memory of each
processor.
The distinction between shared memory and distributed memory is an important one for programmers because it determines how different parts of a parallel program will communicate. In a shared memory system it is only necessary to build a data structure in memory and pass references to the data structure to parallel subroutines. For example, a matrix multiplication routine that breaks matrices into quadrants only needs to pass the indices of each quadrant to the parallel subroutines. A distributed memory machine on the other hand must create copies of shared data in each local memory. These copies are created by sending a message containing the data to another processor. In the matrix multiplication example, the controlling process would have to send messages to three other processors. Each message would contain the submatrices required to compute one quadrant of the result. A drawback to this memory organization is that these messages might have to be quite large; in this example, half of each input matrix needs to be sent to each parallel subroutine.
In this section we will explore the range of techniques used to connect processors to memories in high performance computers and how these techniques affect programmers. The first section is on interleaved memory, a method long used in vector processors to provide successive vector elements at a rate that matches the cycle time in the pipelined data processing units. The next two sections deal with shared memory and distributed memory organizations for parallel systems.