A vector processor is a processor that can operate on entire vectors with one instruction, i.e. the operands of some instructions specify complete vectors. For example, consider the following add instruction:
C = A + B
In both scalar and vector
machines this means ``add the contents of A to the contents of
B
and put the sum in C.'' In a scalar machine the operands are
numbers, but in vector processors the operands are vectors and
the instruction directs the machine to compute the pairwise sum
of each pair of vector elements. A processor register, usually
called the vector length register, tells the processor how many
individual additions to perform when it adds the vectors.
A vectorizing compiler is a compiler that will try to recognize when loops can be transformed into single vector instructions. For example, the following loop can be executed by a single instruction on a vector processor:
DO 10 I=1,N
A(I) = B(I) + C(I)
10 CONTINUE
This code would be translated into an instruction that would
set the vector
length to N followed by a vector add instruction.
The use of
vector instructions pays off in two different ways. First, the
machine has to fetch and decode far fewer instructions, so the
control unit overhead is greatly reduced and the memory bandwidth
necessary to perform this sequence of operations is reduced a
corresponding amount. The second payoff, equally important, is
that the instruction provides the processor with a regular source
of data. When the vector instruction is initiated, the machine
knows it will have to fetch
pairs of operands which are
arranged in a regular pattern in memory. Thus the processor can
tell the memory system to start sending those pairs. With an
interleaved memory, the pairs will arrive at a rate of one per
cycle, at which point they can be routed directly to a pipelined
data unit for processing. Without an interleaved memory or some
other way of providing operands at a high rate the advantages of
processing an entire vector with a single instruction would be
greatly reduced.
A key division of vector processors arises from the way the instructions access their operands. In the memory to memory organization the operands are fetched from memory and routed directly to the functional unit. Results are streamed back out to memory as the operation proceeds. In the register to register organization operands are first loaded into a set of vector registers, each of which can hold a segment of a register, for example 64 elements. The vector operation then proceeds by fetching the operands from the vector registers and returning the results to a vector register.
The advantage of memory to memory machines is the ability to process very long vectors, whereas register to register machines must break long vectors into fixed length segments. Unfortunately, this flexibility is offset by a relatively large overhead known as the startup time, which is the time between the initialization of the instruction and the time the first result emerges from the pipeline. The long startup time on a memory to memory machine is a function of memory latency, which is longer than the time it takes to access a value in an internal register. Once the pipeline is full, however, a result is produced every cycle or perhaps every other cycle. Thus a performance model for a vector processor is of the form

where
is the startup time,
is the length of the vector and
is an
instruction dependent constant, usually
, 1 or 2.
Examples of this type of architecture include the Texas Instruments Inc. Advanced Scientific Computer and a family of machines built by Control Data Corp. known first as the Cyber 200 series and later the ETA-10 when Control Data Corp. founded a separate company known as ETA Systems Inc. These machines appeared in the mid 1970s after a long development cycle that left them with dated technology and disappeared in the mid 1980s. For a thorough discussion of their characteristics, see Hockney and Jesshope [13]. One of the major reasons for their demise was the large startup time, which was on the order of 100 processor cycles. This meant that short vector operations were very inefficient, and even for vectors of length 100 the machines were delivering only about half their maximum performance. In a later section we will see how this vector length that yields half of peak performance is used to characterize vector computers.
In the register to register machines the vectors have a relatively short length, 64 in the case of the Cray family, but the startup time is far less than on the memory to memory machines. Thus these machines are much more efficient for operations involving short vectors, but for long vector operations the vector registers must loaded with each segment before the operation can continue. Register to register machines now dominate the vector computer market, with a number of offerings from Cray Research Inc., including the Y-MP and the C-90. The approach is also the basis for machines from Fujitsu, Hitachi and NEC. Clock cycles on modern vector processors range from 2.5ns (NEC SX-3) to 4.2ns (Cray C90), and single processor performance on LINPACK benchmarks is in the range of 1000 to 2000 MFLOPS (1 to 2 GFLOPS).
The basic processor architecture of the Cray supercomputers has changed little since the Cray-1 was introduced in 1976 [28]. There are 8 vector registers, named V0 through V7, which each hold 64 64-bit words. There are also 8 scalar registers, which hold single 64-bit words, and 8 address registers (for pointers) that have 20-bit words. Instead of a cache, these machines have a set of backup registers for the scalar and address registers; transfer to and from the backup registers is done under program control, rather than by lower level hardware using dynamic memory referencing patterns.
The
original Cray-1 had 12 pipelined data processing units; newer
Cray systems have 14. There are separate pipelines for addition,
multiplication, computing reciprocals (to divide
by
, a Cray
computes
), and logical operations. The cycle
time of the data
processing pipelines is carefully matched to the memory cycle
times. The memory system delivers one value per clock cycle
through the use of 4-way interleaved memory.
An interesting feature introduced in the Cray computers is the notion of vector chaining. Consider the following two vector instructions:
V2 = V0 * V1
V4 = V2 + V3
The output of the first
instruction is one of the operands of the second instruction.
Recall that since these are vector instructions, the first
instruction will route up to 64 pairs of numbers to a pipelined
multiplier. About midway through the execution of this
instruction, the machine will be in an interesting state: the
first few elements of V2 will contain recently computed products;
the products that will eventually go into the next elements of V2
are still in the multiplier pipeline; and the remainder of the
operands are still in V0 and V1, waiting to be
fetched and routed
to the pipeline. This situation is shown in
Figure 16,
where the operands from V0 and V1 that are currently
in the
multiplier pipeline are indicated by gray cells. At this point,
the system is fetching V0[k] and V1[k] to route them to the first
stage of the pipeline and V2[j] is just leaving the pipeline.
Vector chaining relies on the path marked with an asterisk. While
V2[j] is being stored in the vector register, it is also routed
directly to the pipelined adder, where it is matched with V3[j].
As the figure shows, the second instruction can begin even before
the first finished, and while both are executing the machine is
producing two results per cycle (V4[i] and V2[j])
instead of just one.
Without vector chaining, the peak performance of the Cray-1
would have been 80 MFLOPS (one full pipeline producing a result
every 12.5ns, or 80,000,000 results per second). With three
pipelines chained together, there is a very short burst of time
where all three are producing results, for a theoretical peak
performance of 240 MFLOPS.
In principle vector chaining could be
implemented in a memory-to-memory vector processor, but it would
require much higher memory bandwidth to do so. Without chaining,
three ``channels'' must be used to fetch two input operand streams
and store one result stream; with chaining, five channels would
be needed for three inputs and two outputs. Thus the ability to
chain operations together to double performance gave register-
to-register designs another competitive edge over memory-to-
memory designs.