Hockney and Jesshope [13] introduced
two parameters to describe the performance of vector processors.
The first parameter is the theoretical peak performance, or
asymptotic performance, denoted
. It is the
maximum possible rate
of computation, expressed as a number of floating point
operations per second. The parameter may be applied to a single
vector pipeline or to an entire system. Thus
for a
single pipe
of the Cray Y-MP is 167 MFLOPS (6ns cycle, one result per cycle),
and approximately 2.6 GFLOPS for an 8-processor system with the
add and multiply units in operation simultaneously. The other
parameter, designated
and known as the half
performance length,
is the length of the vector for which a system attains half of
its peak performance, i.e.
.
is
a function of vector startup
time and pipeline depth. As these values increase it becomes
harder and harder to achieve near peak performance for the system
because it requires algorithms with longer and longer vectors. As
we saw in section 2.2, the startup times for the CDC
vector computers were an order of magnitude greater than those
from Cray Research. This is reflected in
for the two systems
differing also by an order of magnitude. Even though
was higher
for the CDC machines, the systems from Cray proved more popular
than those from CDC, so users seem to find lower
more
important.
Perhaps the most fundamental performance question that can be asked of an algorithm running on a parallel system is ``does it run faster, and if so by how much?''. Ideally, if one uses P processors to solve a given problem, the execution time would be cut by a factor of P. This leads to a definition of speedup, which is the ratio of the execution time on one processor to the execution time on P processors:

For example, if a program takes
18 minutes to run on one processor, but only 4.7 minutes on four
processors, the speedup is a factor of
.
The strength of such a
measure is that it uses observed execution time and thus takes
into account any overhead in the parallel system for breaking a
job into parallel tasks and intertask communication time.
Comparing time on one processor vs. time on
processors can be
misleading, however. One might be tempted to write a program for
a
-processor machine, time it first on one processor and then on
processors, and call the ratio the speedup. Plotted for
different values of
this procedure gives an accurate measure of
the scalability of the algorithm used, but it does not answer the
question how much faster a problem may be solved using
processors since a parallel algorithm usually incurs overheads
that are not found in sequential algorithms.
Ortega and Voigt [24] defined speedup as the
ratio of the solution time for the
best serial algorithm with that required by the parallel
algorithm:

In the 1960's, Amdahl [1] noted that speedup is
limited by the size of the portion of a problem that is not
executed faster. For example, suppose a program that executes in
10.0 seconds contains a key subroutine that accounts for 80% of
the execution time. The rest of the program uses 20% of the total
time, or 2.0 seconds. If we use a more efficient version of the
subroutine that runs twice as fast, total execution time will
drop to 6.0 seconds (
seconds for the subroutine,
2.0 seconds for
the remainder of the program). If we find a parallel subroutine
that speeds up perfectly on P processors, and run the program on
an 8-processor machine, execution time will drop to 3.0 seconds
(
seconds for the parallel portion, 2.0 seconds for the
sequential portion). If we run on 100 processors, the
total execution time
will be 2.08 seconds. As more processors are used, the execution
time gets closer to the time required for the sequential part,
but it can never get lower than this. Since the fastest this
program will ever run is 2.0 seconds, no matter how many
processors are used, the maximum speedup is a factor of 5.0.
If we normalize the formula that defines speedup by letting the sequential execution time be 1 and expressing the other times as percentages of the sequential time, we derive the following formulation of Amdahl's law for parallel processors:

This
version of the equation makes the contribution of the sequential
portion of the computation more apparent. Here
is the fraction
of the program that can be performed in parallel, and thus
is the portion that is sequential. The denominator is the time to
execute the program in parallel, i.e. the sum of the time spent
in the sequential portion and the time spent in the parallel
part, where the parallel time is a function of the speedup factor
of the parallel portion. If the parallel portion exhibits perfect
speedup, i.e. a factor of
when run on P processors, the
equation becomes:

The efficiency of a parallel computation is the ratio of the speedup to the number of processors used to obtain that speedup:

For example, if 10 processors are used and
the program runs 10 times faster, we are running at the maximum
possible speed, i.e. all processors are being used to their full
capacity. If the speedup is only a factor of 5, however, the
efficiency is
, i.e. half the computing power is
lost in overhead or synchronization.
The above models do not try to characterize the execution of a parallel program. They simply measure the time required to execute a program on a given machine, and compare that time to sequential execution times. A simple model that breaks a parallel program into constituent parts is

The three
components of overall execution time are
, the
computation time,
, the time spent in communication, and
, the
time used to
synchronize the processors at appropriate points in the
algorithm. This type of analysis is very important for algorithms
that will run on distributed memory machines, where locality and
communication costs will play a major role in efficiency.
Expressions for communication complexity can range from very
simple, e.g. all communication requires the same amount of time
(a reasonable model in some cases for a shared memory system), to
very complex, as would be required for an accurate model of a
distributed memory system that communicates by message passing.
In the former case, the number of accesses to memory times the
average access time might suffice. In the latter case, a more
complex analysis is necessary for an accurate model. For example,
the time to send a message from processor
to processor
in a packet switched network is often modeled by
an expression such as

where
is the overhead in setting up the message in the sending
processor (and, if necessary, storing a message in the receiving
processor),
is the distance from
to
, and
is
the length
of the message in packets. Again, this simple formula hides many
additional complexities that might or might not effect
performance: the existence of a separate processor at each
processing node to handle messages, contention at links or nodes,
whether or the routing is fixed or dynamic, etc.
The overhead for synchronization can also involve a number of issues depending on the algorithm and the type of synchronization mechanism used. For example, an SIMD system automatically synchronizes parallel subtasks and no further modeling is required. For MIMD systems, in addition to the overhead associated with the actual synchronization process, there is also the idle time created when processors wait at a synchronization point. A thorough discussion of synchronization mechanisms may be found in Andrews and Schneider [1983]. The development of a complexity model involving all three of the above factors in great detail may be found in Reed and Patrick [1985]. A general discussion of many of the issues relative to linear algebra algorithms may be found in Ortega [1988].