As described in Section 3.6 the performance of a computer system is defined by three factors. The time to execute a program is a function of the number of instructions to execute, the average number of clock cycles required per instruction, and the clock cycle time:

Lowering the clock cycle time is mostly a matter of engineering, through the use of more advanced materials or production techniques that allow the construction of smaller (and thus faster and more efficient) circuits. In this section we will survey several techniques for designing architectures that improve the other two factors.
The common thread that runs through all these techniques is parallelism, which is achieved by replicating basic components in the system. For example, an architect may use four adder/multiplier units instead of one inside the CPU, or connect two or more memories to the CPU in order to increase bandwidth, or connect two or more processors to one memory in order to increase the number of instructions executed per unit time, or even replicate the entire computer (processor, memory, and I/O connections) in a network of machines that all work together on the same program. Parallelism has existed in the minds of computer architects from the time of Charles Babbage in the early 19th century, and has been manifested in a large number of machines in a variety of ways that might be classified in distinct levels [13]:
Job level parallelism is also exploited within a single computer by treating a job or several jobs as a collection of independent tasks. For example, several jobs may reside in memory at the same time with only one in execution at any given time. When that job requires some I/O services, such as a read from disc, the operation is initiated, the job requiring the service is suspended, and another job is put into execution state. At some future time, after the I/O operation has completed and the data is available, control will pass back to the original job and execution will continue. In this example, the CPU and the I/O system are functioning in parallel.
Parallelism at the program
level is generally manifested in two ways: independent sections
of a given program, and individual iterations of a loop where
there are no dependencies between iterations. This type of
parallelism may be exploited by multiple processors or multiple
functional units. For example, the following code segment calls
for the calculation of
sums:
DO 10 I=1,N
A(I) = B(I) + C (I)
10 CONTINUE
The sums are independent, i.e. the calculation of
does not
depend on
for any
. That means they
can be done in any order,
and in particular a machine with
processors could do them all
at the same time.
Obviously, more complex segments may also be treated in parallel depending on the sophistication of the architecture. This is the level of parallelism with which we will primarily be concerned in this book.
The next lower level of parallelism is at the instruction level, where individual instructions may be overlapped or a given instruction may be decomposed into suboperations with the suboperations overlapped. In the first case, for example, it is common to find a load instruction, which copies a value from memory to an internal CPU register, overlapped with an arithmetic instruction. The second situation is exemplified by the ubiquitous pipeline that has become the mainstay for arithmetic processing. In general, programmers need not concern themselves with this level of parallelism, since compilers are adept at reorganizing programs to exploit this form of parallelism. Nevertheless, one should keep in mind that the quality of compilers varies greatly from system to system and one may have to structure the code in particular ways to help the compiler make maximum use of the hardware. For example, as we will see below, Cray supercomputers are most efficient when vector lengths are 64 or smaller, and rearranging programs to operate on small segments of long vectors can improve performance. In addition, awareness of the internal structure of a computer is often necessary when analyzing the performance of a program.
A concept related to the level of
parallelism is the granularity of parallel tasks. A large grain
system is one in which the operations that run in parallel are
fairly large, on the order of entire programs. Small grain
parallel systems divide programs into very small pieces, in some
cases only a few instructions. The example used above of a
processor that calculates
sums in parallel is an example of
very fine grain parallelism.
When designing a machine the
architect must make a decision to use a relatively small number
of powerful processors or a large number of simple processors to
achieve the desired performance. The latter approach is often
termed massively parallel.
At one extreme are the systems built
by Cray Research Inc. that consist of two to sixteen very
powerful vector processors; at the other extreme are arrays of
tens of thousands of very simple processors, exemplified by the
CM-1 from Thinking Machines Corporation, which has up to 65,536
single-bit processors. The motivation for a small number of
powerful processors is that they are simpler to interconnect and
they lend themselves to an implementation of memory organizations
that make the systems relatively easy to program. On the other
hand such processors are very expensive to build, power and cool.
Some architects have used commodity microprocessors which offer
great economies of scale at the expense of more complex
interconnection strategies. With the rapid increase in the power
of microprocessors, arrays of a few hundred processors have the
same theoretical peak performance of the fastest machines offered
by Cray Research Inc.
This section of the book explores the design space of high performance machines, including single processor vector machines, parallel systems with a few powerful processors, and massively parallel architectures. Section 3.1 introduces a popular taxonomy of parallel systems. The next three sections cover major concepts of parallel systems organization, including pipelining, schemes for interconnecting processors and memories, and scalability. Section 3.5 is a survey of the major types of parallel machines, with an emphasis on systems that have been used in computational science. Finally, Section 3.6 present models for analyzing the performance of parallel computer systems.