The evolution of microprocessors has reached the point where architectural concepts pioneered in vector processors and mainframe computers of the 1970s (most notably the CDC-6600 and Cray-1) are starting to appear in RISC processors. Early RISC machines were very simple single-chip processors. As VLSI technology improved more room became available on the chip. Rather than increase the complexity of the architecture, most designers decided to use this room on techniques to improve the execution of their current architecture. The two principle techniques are on-chip caches and instruction pipelines.
The latest step in this evolutionary process is the superscalar processor. The name means these processors are scalar processors that are capable of executing more than one instruction in each cycle. The keys to superscalar execution are an instruction fetching unit that can fetch more than one instruction at a time from cache; instruction decoding logic that can decide when instructions are independent and thus executed simultaneously; and sufficient execution units to be able to process several instructions at one time. Note that the execution units may be pipelined, e.g. they may be floating point adders or multipliers, in which case the cycle time for each stage matches the cycle times on the fetching and decoding logic. In many systems the high level architecture is unchanged from earlier scalar designs. The superscalar designs use instruction level parallelism for improved implementation of these architectures.
A good example of a superscalar processor is the IBM RS/6000 [10]. There are three major subsystems in this processor: the instruction fetch unit, an integer processor, and a floating point processor. The instruction fetch unit is a 2- stage pipeline; during the first stage a packet of four instructions is fetched from an instruction cache, and in the second stage instructions are routed to the integer processor and/or floating point processor. An interesting feature of this instruction unit is that it executes branch instructions itself so that in a tight loop there is effectively no overhead from branching since the instruction unit executes branches while the data units are computing values. The integer unit is a four-stage pipeline. In addition to executing data processing instructions this unit does some preprocessing for the floating point unit. The floating point unit itself is six stages deep.
The following
example from [10] shows the potential of this style
of computing.
This code from a computer graphics application rotates and
displaces a set of (
) pairs by an angle q and displacement
(
):

A vector processor would load the (
) pairs into two
vector registers and then use vector instructions. On the RS/6000
the operations are compiled into the following loop (constants
,
, etc. are loaded into
registers before the loop begins):
L: load R8,x[i]
fma R10,R8,cos,xd
load R9,y[i]
fma R11,R9,cos,yd
fma R12,R9,-sin,R10
store R12,x[i]'
fma R13,R8,sin,R11
store R13,y[i]'
branch L
The fma W,X,Y,Z instruction is
``floating multiply and add'', i.e.
W = X*Y+Z . Note that the
compiler has carefully interleaved load and store instructions
with data processing instructions, and there are eight floating
point operations (two per fma instruction) in each loop iteration
and the loop itself has eight instructions, not counting the
branch. Over the entire loop, then, the processor initiates one
floating point operation per instruction. Since the instruction
fetch unit executes the branch there are no cycles when the
floating point unit is not busy. The machine will deliver one
result per cycle for arbitrarily long vectors as long as there
are no cache misses. See [10] for a detailed
explanation of the
timing of this loop. A 62.5 MHz RS/6000 system ran the LINPACK
benchmark ( Gaussian elimination) at a rate of 104 MFLOPS and has
a theoretical peak performance of 125 MFLOPS. The HP 9000/735
workstation has a 99 MHz superscalar HP-PA processor. This
machine executes the LINPACK benchmark at 107 MFLOPS, with a
theoretical peak performance of 198 MFLOPS. By comparison, the
Cray-1S, with a clock cycle of 80MHz, performs at 110 MFLOPS and
a theoretical peak of 160 MFLOPS. The advantage of the
superscalar approach is that it does not rely on a vectorizing
compiler to detect loops and turn them into vector instructions.
A superscalar machine still requires a very sophisticated
compiler to allocate resources and schedule operations in an
order that will best take advantage of the resources of the
machine, but in the long run the superscalar approach may be more
flexible and applicable to a wider range of applications than
vector processing.