3.6 Performance Models for Vector and Parallel Machines

Next: 4 Exercises Up: 3 High Performance Computer Previous: 3.5.6 SIMD Machines

## 3.6 Performance Models for Vector and Parallel Machines

Hockney and Jesshope [13] introduced two parameters to describe the performance of vector processors. The first parameter is the theoretical peak performance, or asymptotic performance, denoted . It is the maximum possible rate of computation, expressed as a number of floating point operations per second. The parameter may be applied to a single vector pipeline or to an entire system. Thus for a single pipe of the Cray Y-MP is 167 MFLOPS (6ns cycle, one result per cycle), and approximately 2.6 GFLOPS for an 8-processor system with the add and multiply units in operation simultaneously. The other parameter, designated and known as the half performance length, is the length of the vector for which a system attains half of its peak performance, i.e. . is a function of vector startup time and pipeline depth. As these values increase it becomes harder and harder to achieve near peak performance for the system because it requires algorithms with longer and longer vectors. As we saw in section 2.2, the startup times for the CDC vector computers were an order of magnitude greater than those from Cray Research. This is reflected in for the two systems differing also by an order of magnitude. Even though was higher for the CDC machines, the systems from Cray proved more popular than those from CDC, so users seem to find lower more important.

Perhaps the most fundamental performance question that can be asked of an algorithm running on a parallel system is ``does it run faster, and if so by how much?''. Ideally, if one uses P processors to solve a given problem, the execution time would be cut by a factor of P. This leads to a definition of speedup, which is the ratio of the execution time on one processor to the execution time on P processors:

For example, if a program takes 18 minutes to run on one processor, but only 4.7 minutes on four processors, the speedup is a factor of .

The strength of such a measure is that it uses observed execution time and thus takes into account any overhead in the parallel system for breaking a job into parallel tasks and intertask communication time. Comparing time on one processor vs. time on processors can be misleading, however. One might be tempted to write a program for a -processor machine, time it first on one processor and then on processors, and call the ratio the speedup. Plotted for different values of this procedure gives an accurate measure of the scalability of the algorithm used, but it does not answer the question how much faster a problem may be solved using processors since a parallel algorithm usually incurs overheads that are not found in sequential algorithms. Ortega and Voigt [24] defined speedup as the ratio of the solution time for the best serial algorithm with that required by the parallel algorithm:

In the 1960's, Amdahl [1] noted that speedup is limited by the size of the portion of a problem that is not executed faster. For example, suppose a program that executes in 10.0 seconds contains a key subroutine that accounts for 80% of the execution time. The rest of the program uses 20% of the total time, or 2.0 seconds. If we use a more efficient version of the subroutine that runs twice as fast, total execution time will drop to 6.0 seconds ( seconds for the subroutine, 2.0 seconds for the remainder of the program). If we find a parallel subroutine that speeds up perfectly on P processors, and run the program on an 8-processor machine, execution time will drop to 3.0 seconds ( seconds for the parallel portion, 2.0 seconds for the sequential portion). If we run on 100 processors, the total execution time will be 2.08 seconds. As more processors are used, the execution time gets closer to the time required for the sequential part, but it can never get lower than this. Since the fastest this program will ever run is 2.0 seconds, no matter how many processors are used, the maximum speedup is a factor of 5.0.

If we normalize the formula that defines speedup by letting the sequential execution time be 1 and expressing the other times as percentages of the sequential time, we derive the following formulation of Amdahl's law for parallel processors:

This version of the equation makes the contribution of the sequential portion of the computation more apparent. Here is the fraction of the program that can be performed in parallel, and thus is the portion that is sequential. The denominator is the time to execute the program in parallel, i.e. the sum of the time spent in the sequential portion and the time spent in the parallel part, where the parallel time is a function of the speedup factor of the parallel portion. If the parallel portion exhibits perfect speedup, i.e. a factor of when run on P processors, the equation becomes:

The efficiency of a parallel computation is the ratio of the speedup to the number of processors used to obtain that speedup:

For example, if 10 processors are used and the program runs 10 times faster, we are running at the maximum possible speed, i.e. all processors are being used to their full capacity. If the speedup is only a factor of 5, however, the efficiency is , i.e. half the computing power is lost in overhead or synchronization.

The above models do not try to characterize the execution of a parallel program. They simply measure the time required to execute a program on a given machine, and compare that time to sequential execution times. A simple model that breaks a parallel program into constituent parts is

The three components of overall execution time are , the computation time, , the time spent in communication, and , the time used to synchronize the processors at appropriate points in the algorithm. This type of analysis is very important for algorithms that will run on distributed memory machines, where locality and communication costs will play a major role in efficiency.

Expressions for communication complexity can range from very simple, e.g. all communication requires the same amount of time (a reasonable model in some cases for a shared memory system), to very complex, as would be required for an accurate model of a distributed memory system that communicates by message passing. In the former case, the number of accesses to memory times the average access time might suffice. In the latter case, a more complex analysis is necessary for an accurate model. For example, the time to send a message from processor to processor in a packet switched network is often modeled by an expression such as

where is the overhead in setting up the message in the sending processor (and, if necessary, storing a message in the receiving processor), is the distance from to , and is the length of the message in packets. Again, this simple formula hides many additional complexities that might or might not effect performance: the existence of a separate processor at each processing node to handle messages, contention at links or nodes, whether or the routing is fixed or dynamic, etc.

The overhead for synchronization can also involve a number of issues depending on the algorithm and the type of synchronization mechanism used. For example, an SIMD system automatically synchronizes parallel subtasks and no further modeling is required. For MIMD systems, in addition to the overhead associated with the actual synchronization process, there is also the idle time created when processors wait at a synchronization point. A thorough discussion of synchronization mechanisms may be found in Andrews and Schneider [1983]. The development of a complexity model involving all three of the above factors in great detail may be found in Reed and Patrick [1985]. A general discussion of many of the issues relative to linear algebra algorithms may be found in Ortega [1988].

Next: 4 Exercises Up: 3 High Performance Computer Previous: 3.5.6 SIMD Machines

verena@csep1.phy.ornl.gov