next up previous

3.5.2 Superscalar Processors     continued...

A vector processor would load the (x,y) pairs into two vector registers and then use vector instructions. On the RS/6000 the operations are compiled into the following loop (constants , , etc. are loaded into registers before the loop begins):

L:      load  R8,x[i]          
        fma   R10,R8,cos,xd  
        load R9,y[i]  
        fma  R11,R9,cos,yd    
        fma  R12,R9,-sin,R10
        store R12,x[i]'  
        fma R13,R8,sin,R11   
        store R13,y[i]'   
        branch  L

The fma W,X,Y,Z instruction is ``floating multiply and add'', i.e. W = X*Y+Z . Note that the compiler has carefully interleaved load and store instructions with data processing instructions, and there are eight floating point operations (two per fma instruction) in each loop iteration and the loop itself has eight instructions, not counting the branch. Over the entire loop, then, the processor initiates one floating point operation per instruction. Since the instruction fetch unit executes the branch there are no cycles when the floating point unit is not busy. The machine will deliver one result per cycle for arbitrarily long vectors as long as there are no cache misses. See [10] for a detailed explanation of the timing of this loop. A 62.5 MHz RS/6000 system ran the LINPACK benchmark ( Gaussian elimination) at a rate of 104 MFLOPS and has a theoretical peak performance of 125 MFLOPS. The HP 9000/735 workstation has a 99 MHz superscalar HP-PA processor. This machine executes the LINPACK benchmark at 107 MFLOPS, with a theoretical peak performance of 198 MFLOPS. By comparison, the Cray-1S, with a clock cycle of 80MHz, performs at 110 MFLOPS and a theoretical peak of 160 MFLOPS. The advantage of the superscalar approach is that it does not rely on a vectorizing compiler to detect loops and turn them into vector instructions. A superscalar machine still requires a very sophisticated compiler to allocate resources and schedule operations in an order that will best take advantage of the resources of the machine, but in the long run the superscalar approach may be more flexible and applicable to a wider range of applications than vector processing.