3 High Performance Computer Architecture

next up previous
Next: 3.1 Flynn's Taxonomy Up: CA Chapter Previous: 2.7 Performance Models

3 High Performance Computer Architecture


As described in Section 3.6 the performance of a computer system is defined by three factors. The time to execute a program is a function of the number of instructions to execute, the average number of clock cycles required per instruction, and the clock cycle time:

Lowering the clock cycle time is mostly a matter of engineering, through the use of more advanced materials or production techniques that allow the construction of smaller (and thus faster and more efficient) circuits. In this section we will survey several techniques for designing architectures that improve the other two factors.

The common thread that runs through all these techniques is parallelism, which is achieved by replicating basic components in the system. For example, an architect may use four adder/multiplier units instead of one inside the CPU, or connect two or more memories to the CPU in order to increase bandwidth, or connect two or more processors to one memory in order to increase the number of instructions executed per unit time, or even replicate the entire computer (processor, memory, and I/O connections) in a network of machines that all work together on the same program. Parallelism has existed in the minds of computer architects from the time of Charles Babbage in the early 19th century, and has been manifested in a large number of machines in a variety of ways that might be classified in distinct levels [13]:

Job level parallelism is also exploited within a single computer by treating a job or several jobs as a collection of independent tasks. For example, several jobs may reside in memory at the same time with only one in execution at any given time. When that job requires some I/O services, such as a read from disc, the operation is initiated, the job requiring the service is suspended, and another job is put into execution state. At some future time, after the I/O operation has completed and the data is available, control will pass back to the original job and execution will continue. In this example, the CPU and the I/O system are functioning in parallel.

Parallelism at the program level is generally manifested in two ways: independent sections of a given program, and individual iterations of a loop where there are no dependencies between iterations. This type of parallelism may be exploited by multiple processors or multiple functional units. For example, the following code segment calls for the calculation of sums:

          DO 10 I=1,N
          A(I) = B(I) + C (I)
10        CONTINUE

The sums are independent, i.e. the calculation of does not depend on for any . That means they can be done in any order, and in particular a machine with processors could do them all at the same time.

Obviously, more complex segments may also be treated in parallel depending on the sophistication of the architecture. This is the level of parallelism with which we will primarily be concerned in this book.

The next lower level of parallelism is at the instruction level, where individual instructions may be overlapped or a given instruction may be decomposed into suboperations with the suboperations overlapped. In the first case, for example, it is common to find a load instruction, which copies a value from memory to an internal CPU register, overlapped with an arithmetic instruction. The second situation is exemplified by the ubiquitous pipeline that has become the mainstay for arithmetic processing. In general, programmers need not concern themselves with this level of parallelism, since compilers are adept at reorganizing programs to exploit this form of parallelism. Nevertheless, one should keep in mind that the quality of compilers varies greatly from system to system and one may have to structure the code in particular ways to help the compiler make maximum use of the hardware. For example, as we will see below, Cray supercomputers are most efficient when vector lengths are 64 or smaller, and rearranging programs to operate on small segments of long vectors can improve performance. In addition, awareness of the internal structure of a computer is often necessary when analyzing the performance of a program.

A concept related to the level of parallelism is the granularity of parallel tasks. A large grain system is one in which the operations that run in parallel are fairly large, on the order of entire programs. Small grain parallel systems divide programs into very small pieces, in some cases only a few instructions. The example used above of a processor that calculates sums in parallel is an example of very fine grain parallelism.

When designing a machine the architect must make a decision to use a relatively small number of powerful processors or a large number of simple processors to achieve the desired performance. The latter approach is often termed massively parallel.gif At one extreme are the systems built by Cray Research Inc. that consist of two to sixteen very powerful vector processors; at the other extreme are arrays of tens of thousands of very simple processors, exemplified by the CM-1 from Thinking Machines Corporation, which has up to 65,536 single-bit processors. The motivation for a small number of powerful processors is that they are simpler to interconnect and they lend themselves to an implementation of memory organizations that make the systems relatively easy to program. On the other hand such processors are very expensive to build, power and cool. Some architects have used commodity microprocessors which offer great economies of scale at the expense of more complex interconnection strategies. With the rapid increase in the power of microprocessors, arrays of a few hundred processors have the same theoretical peak performance of the fastest machines offered by Cray Research Inc.

This section of the book explores the design space of high performance machines, including single processor vector machines, parallel systems with a few powerful processors, and massively parallel architectures. Section 3.1 introduces a popular taxonomy of parallel systems. The next three sections cover major concepts of parallel systems organization, including pipelining, schemes for interconnecting processors and memories, and scalability. Section 3.5 is a survey of the major types of parallel machines, with an emphasis on systems that have been used in computational science. Finally, Section 3.6 present models for analyzing the performance of parallel computer systems.

next up previous
Next: 3.1 Flynn's Taxonomy Up: CA Chapter Previous: 2.7 Performance Models