2.7 Performance Models



next up previous
Next: 3 High Performance Computer Up: 2 Basic Computer Architecture Previous: 2.6 Data Representations

2.7 Performance Models

 

The most widely recognized aspect of a machine's internal organization that relates to performance is the clock cycle time, which controls the rate of internal operations in the CPU (Section 2.1). A shorter clock cycle time, or equivalently a larger number of cycles per second, implies more operations can be performed per unit time.

For a given architecture, it is often possible to rank systems according to their clock rates. For example, the HP 9000/725 and 9000/735 workstations have basically the same architecture, meaning they have the same instruction set and, in general, appear to be the same system as far as compiler writers are concerned. The 725 has a 66MHz clock, while the 735 has a 99MHz clock, and indeed the 735 has a higher performance on most programs.

There are several reasons why simply comparing clock cycle times is an inadequate measure of performance. One reason is that processors don't operate ``in a vacuum'', but rely on memories and buses to supply information. The size and access times of the memories and the bandwidth of the bus all play a major role in performance. It is very easy to imagine a program that requires a large amount of memory running faster on an HP 725 that has a larger cache and more main memory than a 735. We will return to the topic of memory organization and processor- memory interconnection in later sections on vector processors and parallel processors since these two aspects of systems organization are even more crucial for high performance in those systems.

A second reason clock rate by itself is an inadequate measure of performance is that it doesn't take into account what happens during a clock cycle. This is especially true when comparing systems with different instruction sets. It is possible that a machine might have a lower clock rate, but because it requires fewer cycles to execute the same program it would have higher performance. For example, consider two machines, A and B, that are almost identical except that A has a multiply instruction and B does not. A simple loop that multiplies a vector by a scalar (the constant 3 in this example) is shown in the table below. The number of cycles for each instruction is given in parentheses next to the instruction.


Table 3 View Table

The first instruction loads an element of the vector into an internal processor register X. Next, machine A multiplies the vector element by 3, leaving the result in the register. Machine B does the same operation by shifting and adding, i.e. . B copies the contents of X to another register Y, shifts X left one bit (which multiplies it by 2), and then adds Y, again leaving the result in X. Both machines then store the result back into the vector in memory and branch back to the top of the loop if the vector index is not at the end of the vector (the comparison and branch are done by the dbr instruction). Machine A might be slightly slower than B, but since it takes fewer cycles it will execute the loop faster. For example if A's cycle time is 9 MHz (.11s per cycle) and B's cycle time is 10 MHz (.10s per cycle) A will execute one pass through the loop in 1.1s but B will require 1.2s.

As a historical note, microprocessor and microcomputer designers in the 1970s tended to build systems with instruction sets like those of machine A above. The goal was to include instructions with a large ``semantic content,'' e.g. multiplication is relatively more complex than loading a value from memory or shifting a bit pattern. The payoff was in reducing the overhead to fetch instructions, since fewer instructions could accomplish the same job. By the 1980s, however, it became widely accepted that instruction sets such as those of machine B were in fact a better match for VLSI chip technology. The move toward simpler instructions became known as RISC, for Reduced Instruction Set Computer. A RISC has fewer instructions in its repertoire, but more importantly each instruction is very simple. The fact that operations are so simple and so uniform leads to some very powerful implementation techniques, such as pipelining, and opens up room on the processor chip for items such as on-chip caches or multiple functional units, e.g. a CPU that has two or more arithmetic units. We will discuss these types of systems in more detail later, in the section on superscalar designs (Section 3.5.2). Another benefit to simple instructions is that cycle times can also be much shorter; instead of being only moderately faster, e.g 10MHz vs. 9MHz as in the example above, cycle times on RISC machines are often much faster, so even though they fetch and execute more instructions they typically outperform complex instruction set (CISC) machines designed at the same time.

In order to compare performance of two machines with different instruction sets, and even different styles of instruction sets (e.g. RISC vs. CISC), we can break the total execution time into constituent parts [11]. The total time to execute any given program is the product of the number of machine cycles required to execute the program and the processor cycle time:

The number of cycles executed can be rewritten as the number of instructions executed times the average number of cycles per instruction:

The middle factor in this expression describes the average number of machine cycles the processor devotes to each instruction. It is the number of cycles per instruction, or CPI. The basic performance model for a single processor computer system is thus

where

The three factors each describe different attributes of the execution of a program. The number of instructions depends on the algorithm, the compiler, and to some extent the instruction set of the machine. Total execution time can be reduced by lowering the instruction count, either through a better algorithm (one that executes an inner loop fewer times, for example), a better compiler (one that generates fewer instructions for the body of the loop), or perhaps by changing the instruction set so it requires fewer instructions to encode the same algorithm. As we saw earlier, however, a more compact encoding as a result of a richer instruction set does not always speed up a program since complex instructions require more cycles. The interaction between instruction complexity and the number of cycles to execute a program is very involved, and it is hard to predict ahead of time whether adding a new instruction will really improve performance.

The second factor in the performance model is CPI. At first it would seem this factor is simply a measure of the complexity of the instruction set: simple instructions require fewer cycles, so RISC machines should have lower CPI values. That view is misleading, however, since it concerns a static quantity. The performance equation describes the average number of cycles per instruction measured during the execution of a program. The difference is crucial. Implementation techniques such as pipelining allow a processor to overlap instructions by working on several instructions at one time. These techniques will lower CPI and improve performance since more instructions are executed in any given time period. For example, the average instruction in a system might require three machine cycles: one to fetch it from cache, one to fetch its operands from registers, and one to perform the operation and store the result in a register. Based on this static description one might conclude the CPI is 3.0, since each instruction requires three cycles. However, if the processor can juggle three instructions at once, for example by fetching instruction while it is locating the operands for instruction and executing instruction , then the effective CPI observed during the execution of the program is just a little over 1.0 (Figure 3). Note that this is another illustration of the difference between speed and bandwidth. Overall performance of a system can be improved by increasing bandwidth, in this case by increasing the number of instructions that flow through the processor per unit time, without changing the execution time of the individual instructions.

The third factor in the performance model is the processor cycle time . This is usually in the realm of computer engineering: a better layout of the components on the surface of the chip might shorten wire lengths and allow for a faster clock, or a different material (e.g. gallium arsenide vs. silicon based semiconductors) might have a faster switching time. However, the architecture can also affect cycle time. One of the reasons RISC is such a good fit for current VLSI technology is that if the instruction set is small, it requires less logic to implement. Less logic means less space on the chip, and smaller circuits run faster and consume less power [12]. Thus the design of the instruction set, the organization of pipelines, and other attributes of the architecture and its implementation can impact cycle time.


Figure 3 Pipelined execution View Figure

We conclude this section with a few remarks on some metrics that are commonly used to describe the performance of computer systems. MIPS stands for ``millions of instructions per second.'' With the variation in instruction styles, internal organization, and number of processors per system it is almost meaningless for comparing two systems. As a point of reference, the DEC VAX 11/780 executed approximately one million instructions per second. You may see a system described as having performance rated at ``X VAX MIPS.'' This is a measure of performance normalized to VAX 11/780 performance. What this means is someone ran a program on the VAX, then ran the same program on the other system, and the ratio is X. The term ``native MIPS'' refers to the number of millions of instructions of the machine's own instruction set that can be executed per second.

MFLOPS (pronounced ``megaflops'') stands for ``millions of floating point operations per second.'' This is often used as a ``bottom-line'' figure. If you know ahead of time how many operations a program needs to perform, you can divide the number of operations by the execution time to come up with a MFLOPS rating. For example, the standard algorithm for multiplying matrices requires operations ( inner products, with multiplications and additions in each product). Suppose you compute the product of two matrices in 0.35 seconds. Your computer achieved

Obviously this type of comparison ignores the overhead involved in setting up loops, checking terminating conditions, and so on, but as a ``bottom line'' it gets to the point: what you care about (in this example) is how long it takes to multiply two matrices, and if that operation is a major component of your research it makes sense to compare machines by how fast they can multiply matrices. A standard set of reference programs known as LINPACK (linear algebra package) is often used to compare systems based on their MFLOPS ratings by measuring execution times for Gaussian elimination on matrices [8].

The term ``theoretical peak MFLOPS'' refers to how many operations per second would be possible if the machine did nothing but numerical operations. It is obtained by calculating the time it takes to perform one operation and then computing how many of them could be done in one second. For example, if it takes 8 cycles to do one floating point multiplication, the cycle time on the machine is 20 nanoseconds, and arithmetic operations are not overlapped with one another, it takes 160ns for one multiplication, and

so the theoretical peak performance is 6.25 MFLOPS. Of course, programs are not just long sequences of multiply and add instructions, so a machine rarely comes close to this level of performance on any real program. Most machines will achieve less than 10% of their peak rating, but vector processors or other machines with internal pipelines that have an effective CPI near 1.0 can often achieve 70% or more of their theoretical peak on small programs.

Using metrics such as CPI, MIPS, or MFLOPS to compare machines depends heavily on the programs used to measure execution times. A benchmark is a program written specifically for this purpose. There are several well-known collections of benchmarks. One that is be particularly interesting to computational scientists is LINPACK, which contains a set of linear algebra routines written in Fortran. MFLOPS ratings based on LINPACK performance are published regularly [8]. Two collections of a wider range of programs are SPEC (System Performance Evaluation Cooperative) and the Perfect Club, which is oriented toward parallel processing. Both include widely used programs such as a C compiler and a text formatter, not just small special purpose subroutines, and are useful for comparing systems such as high performance workstations that will be used for other jobs in addition to computational science modelling.



next up previous
Next: 3 High Performance Computer Up: 2 Basic Computer Architecture Previous: 2.6 Data Representations



verena@csep1.phy.ornl.gov