3.3 Memory Organizations



next up previous
Next: 3.3.1 Interleaved Memory Up: 3 High Performance Computer Previous: 3.2 Pipelines

3.3 Memory Organizations

 

So far the discussion of high performance computing has concentrated on increasing the amount of processing power in a system, either through parallelism, which seeks to increase the number of instructions that can be executed in a time period, or through pipelining, which improves the instruction throughput. Another, equally important, aspect of high performance computing is the organization of the memory system. No matter how fast one makes the processing unit, if the memory cannot keep up and provide instructions and data at a sufficient rate there will be no improvement in performance. The main problem that needs to be overcome in matching memory response to processor speed is the memory cycle time, defined in section 2.2 to be the time between two successive memory operations. Processor cycle times are typically much shorter than memory cycle times. When a processor initiates a memory transfer at time , the memory will be ``busy'' until , where is the memory cycle time. During this period no other device - I/O controller, other processors, or even the processor that makes the request - can use the memory since it will be busy responding to the request.

Solutions to the memory access problem have led to a dichotomy in parallel systems. In one type of system, known as a shared memory system, there is one large virtual memory, and all processors have equal access to data and instructions in this memory. The other type of system is a distributed memory, in which each processor has a local memory that is not accessible from any other processor.

The difference between shared or distributed memory is a difference in the structure of virtual memory, i.e. the memory as seen from the perspective of a processor. Physically, almost every memory system is partitioned into separate components that can be accessed independently. What distinguishes a shared memory from a distributed memory is how the memory subsystem interprets an address generated by a processor. As an example, suppose a processor executes the instruction load R0,i, which means ``load register R0 with the contents of memory location i'' (denoted Mem[i]). The question is, what does i mean? In a shared memory system, i is a global address, and Mem[i] to one processor is the same memory cell as Mem[i] to another processor. If both processors execute this instruction at the same time they will both load the same information into their R0 registers. In a distributed memory system, i is a local address. If two processors both execute load R0,i they may end up with different values in their R0 registers since Mem[i] designates two different memory cells, one in the local memory of each processor.

The distinction between shared memory and distributed memory is an important one for programmers because it determines how different parts of a parallel program will communicate. In a shared memory system it is only necessary to build a data structure in memory and pass references to the data structure to parallel subroutines. For example, a matrix multiplication routine that breaks matrices into quadrants only needs to pass the indices of each quadrant to the parallel subroutines. A distributed memory machine on the other hand must create copies of shared data in each local memory. These copies are created by sending a message containing the data to another processor. In the matrix multiplication example, the controlling process would have to send messages to three other processors. Each message would contain the submatrices required to compute one quadrant of the result. A drawback to this memory organization is that these messages might have to be quite large; in this example, half of each input matrix needs to be sent to each parallel subroutine.

In this section we will explore the range of techniques used to connect processors to memories in high performance computers and how these techniques affect programmers. The first section is on interleaved memory, a method long used in vector processors to provide successive vector elements at a rate that matches the cycle time in the pipelined data processing units. The next two sections deal with shared memory and distributed memory organizations for parallel systems.





next up previous
Next: 3.3.1 Interleaved Memory Up: 3 High Performance Computer Previous: 3.2 Pipelines



verena@csep1.phy.ornl.gov