3.5.2 Superscalar Processors



next up previous
Next: 3.5.3 Shared Memory MIMD Up: 3.5 Survey of High Previous: 3.5.1 Vector Processors

3.5.2 Superscalar Processors

 


Figure 16 Vector Chaining View Figure

The evolution of microprocessors has reached the point where architectural concepts pioneered in vector processors and mainframe computers of the 1970s (most notably the CDC-6600 and Cray-1) are starting to appear in RISC processors. Early RISC machines were very simple single-chip processors. As VLSI technology improved more room became available on the chip. Rather than increase the complexity of the architecture, most designers decided to use this room on techniques to improve the execution of their current architecture. The two principle techniques are on-chip caches and instruction pipelines.

The latest step in this evolutionary process is the superscalar processor. The name means these processors are scalar processors that are capable of executing more than one instruction in each cycle. The keys to superscalar execution are an instruction fetching unit that can fetch more than one instruction at a time from cache; instruction decoding logic that can decide when instructions are independent and thus executed simultaneously; and sufficient execution units to be able to process several instructions at one time. Note that the execution units may be pipelined, e.g. they may be floating point adders or multipliers, in which case the cycle time for each stage matches the cycle times on the fetching and decoding logic. In many systems the high level architecture is unchanged from earlier scalar designs. The superscalar designs use instruction level parallelism for improved implementation of these architectures.

A good example of a superscalar processor is the IBM RS/6000 [10]. There are three major subsystems in this processor: the instruction fetch unit, an integer processor, and a floating point processor. The instruction fetch unit is a 2- stage pipeline; during the first stage a packet of four instructions is fetched from an instruction cache, and in the second stage instructions are routed to the integer processor and/or floating point processor. An interesting feature of this instruction unit is that it executes branch instructions itself so that in a tight loop there is effectively no overhead from branching since the instruction unit executes branches while the data units are computing values. The integer unit is a four-stage pipeline. In addition to executing data processing instructions this unit does some preprocessing for the floating point unit. The floating point unit itself is six stages deep.

The following example from [10] shows the potential of this style of computing. This code from a computer graphics application rotates and displaces a set of () pairs by an angle q and displacement ():

A vector processor would load the () pairs into two vector registers and then use vector instructions. On the RS/6000 the operations are compiled into the following loop (constants , , etc. are loaded into registers before the loop begins):

L:      load  R8,x[i]          
        fma   R10,R8,cos,xd  
        load R9,y[i]  
        fma  R11,R9,cos,yd    
        fma  R12,R9,-sin,R10
        store R12,x[i]'  
        fma R13,R8,sin,R11   
        store R13,y[i]'   
        branch  L

The fma W,X,Y,Z instruction is ``floating multiply and add'', i.e. W = X*Y+Z . Note that the compiler has carefully interleaved load and store instructions with data processing instructions, and there are eight floating point operations (two per fma instruction) in each loop iteration and the loop itself has eight instructions, not counting the branch. Over the entire loop, then, the processor initiates one floating point operation per instruction. Since the instruction fetch unit executes the branch there are no cycles when the floating point unit is not busy. The machine will deliver one result per cycle for arbitrarily long vectors as long as there are no cache misses. See [10] for a detailed explanation of the timing of this loop. A 62.5 MHz RS/6000 system ran the LINPACK benchmark ( Gaussian elimination) at a rate of 104 MFLOPS and has a theoretical peak performance of 125 MFLOPS. The HP 9000/735 workstation has a 99 MHz superscalar HP-PA processor. This machine executes the LINPACK benchmark at 107 MFLOPS, with a theoretical peak performance of 198 MFLOPS. By comparison, the Cray-1S, with a clock cycle of 80MHz, performs at 110 MFLOPS and a theoretical peak of 160 MFLOPS. The advantage of the superscalar approach is that it does not rely on a vectorizing compiler to detect loops and turn them into vector instructions. A superscalar machine still requires a very sophisticated compiler to allocate resources and schedule operations in an order that will best take advantage of the resources of the machine, but in the long run the superscalar approach may be more flexible and applicable to a wider range of applications than vector processing.



next up previous
Next: 3.5.3 Shared Memory MIMD Up: 3.5 Survey of High Previous: 3.5.1 Vector Processors



verena@csep1.phy.ornl.gov