3.5.1 Vector Processors

Next: 3.5.2 Superscalar Processors Up: 3.5 Survey of High Previous: 3.5 Survey of High

### 3.5.1 Vector Processors

A vector processor is a processor that can operate on entire vectors with one instruction, i.e. the operands of some instructions specify complete vectors. For example, consider the following add instruction:

`C = A + B`

In both scalar and vector machines this means ``add the contents of `A` to the contents of `B` and put the sum in `C`.'' In a scalar machine the operands are numbers, but in vector processors the operands are vectors and the instruction directs the machine to compute the pairwise sum of each pair of vector elements. A processor register, usually called the vector length register, tells the processor how many individual additions to perform when it adds the vectors.

A vectorizing compiler is a compiler that will try to recognize when loops can be transformed into single vector instructions. For example, the following loop can be executed by a single instruction on a vector processor:

```         DO 10 I=1,N
A(I) = B(I) + C(I)
10       CONTINUE```

This code would be translated into an instruction that would set the vector length to `N` followed by a vector add instruction.

The use of vector instructions pays off in two different ways. First, the machine has to fetch and decode far fewer instructions, so the control unit overhead is greatly reduced and the memory bandwidth necessary to perform this sequence of operations is reduced a corresponding amount. The second payoff, equally important, is that the instruction provides the processor with a regular source of data. When the vector instruction is initiated, the machine knows it will have to fetch pairs of operands which are arranged in a regular pattern in memory. Thus the processor can tell the memory system to start sending those pairs. With an interleaved memory, the pairs will arrive at a rate of one per cycle, at which point they can be routed directly to a pipelined data unit for processing. Without an interleaved memory or some other way of providing operands at a high rate the advantages of processing an entire vector with a single instruction would be greatly reduced.

A key division of vector processors arises from the way the instructions access their operands. In the memory to memory organization the operands are fetched from memory and routed directly to the functional unit. Results are streamed back out to memory as the operation proceeds. In the register to register organization operands are first loaded into a set of vector registers, each of which can hold a segment of a register, for example 64 elements. The vector operation then proceeds by fetching the operands from the vector registers and returning the results to a vector register.

The advantage of memory to memory machines is the ability to process very long vectors, whereas register to register machines must break long vectors into fixed length segments. Unfortunately, this flexibility is offset by a relatively large overhead known as the startup time, which is the time between the initialization of the instruction and the time the first result emerges from the pipeline. The long startup time on a memory to memory machine is a function of memory latency, which is longer than the time it takes to access a value in an internal register. Once the pipeline is full, however, a result is produced every cycle or perhaps every other cycle. Thus a performance model for a vector processor is of the form

where is the startup time, is the length of the vector and is an instruction dependent constant, usually , 1 or 2.

Examples of this type of architecture include the Texas Instruments Inc. Advanced Scientific Computer and a family of machines built by Control Data Corp. known first as the Cyber 200 series and later the ETA-10 when Control Data Corp. founded a separate company known as ETA Systems Inc. These machines appeared in the mid 1970s after a long development cycle that left them with dated technology and disappeared in the mid 1980s. For a thorough discussion of their characteristics, see Hockney and Jesshope [13]. One of the major reasons for their demise was the large startup time, which was on the order of 100 processor cycles. This meant that short vector operations were very inefficient, and even for vectors of length 100 the machines were delivering only about half their maximum performance. In a later section we will see how this vector length that yields half of peak performance is used to characterize vector computers.

In the register to register machines the vectors have a relatively short length, 64 in the case of the Cray family, but the startup time is far less than on the memory to memory machines. Thus these machines are much more efficient for operations involving short vectors, but for long vector operations the vector registers must loaded with each segment before the operation can continue. Register to register machines now dominate the vector computer market, with a number of offerings from Cray Research Inc., including the Y-MP and the C-90. The approach is also the basis for machines from Fujitsu, Hitachi and NEC. Clock cycles on modern vector processors range from 2.5ns (NEC SX-3) to 4.2ns (Cray C90), and single processor performance on LINPACK benchmarks is in the range of 1000 to 2000 MFLOPS (1 to 2 GFLOPS).

The basic processor architecture of the Cray supercomputers has changed little since the Cray-1 was introduced in 1976 [28]. There are 8 vector registers, named V0 through V7, which each hold 64 64-bit words. There are also 8 scalar registers, which hold single 64-bit words, and 8 address registers (for pointers) that have 20-bit words. Instead of a cache, these machines have a set of backup registers for the scalar and address registers; transfer to and from the backup registers is done under program control, rather than by lower level hardware using dynamic memory referencing patterns.

The original Cray-1 had 12 pipelined data processing units; newer Cray systems have 14. There are separate pipelines for addition, multiplication, computing reciprocals (to divide by , a Cray computes ), and logical operations. The cycle time of the data processing pipelines is carefully matched to the memory cycle times. The memory system delivers one value per clock cycle through the use of 4-way interleaved memory.

An interesting feature introduced in the Cray computers is the notion of vector chaining. Consider the following two vector instructions:

```          V2 = V0 * V1
V4 = V2 + V3```

The output of the first instruction is one of the operands of the second instruction. Recall that since these are vector instructions, the first instruction will route up to 64 pairs of numbers to a pipelined multiplier. About midway through the execution of this instruction, the machine will be in an interesting state: the first few elements of `V2` will contain recently computed products; the products that will eventually go into the next elements of `V2` are still in the multiplier pipeline; and the remainder of the operands are still in `V0` and `V1`, waiting to be fetched and routed to the pipeline. This situation is shown in Figure 16, where the operands from `V0` and `V1` that are currently in the multiplier pipeline are indicated by gray cells. At this point, the system is fetching `V0[k]` and `V1[k]` to route them to the first stage of the pipeline and `V2[j]` is just leaving the pipeline. Vector chaining relies on the path marked with an asterisk. While `V2[j]` is being stored in the vector register, it is also routed directly to the pipelined adder, where it is matched with `V3[j]`. As the figure shows, the second instruction can begin even before the first finished, and while both are executing the machine is producing two results per cycle (`V4[i]` and `V2[j]`) instead of just one.

Without vector chaining, the peak performance of the Cray-1 would have been 80 MFLOPS (one full pipeline producing a result every 12.5ns, or 80,000,000 results per second). With three pipelines chained together, there is a very short burst of time where all three are producing results, for a theoretical peak performance of 240 MFLOPS. In principle vector chaining could be implemented in a memory-to-memory vector processor, but it would require much higher memory bandwidth to do so. Without chaining, three ``channels'' must be used to fetch two input operand streams and store one result stream; with chaining, five channels would be needed for three inputs and two outputs. Thus the ability to chain operations together to double performance gave register- to-register designs another competitive edge over memory-to- memory designs.

Next: 3.5.2 Superscalar Processors Up: 3.5 Survey of High Previous: 3.5 Survey of High

verena@csep1.phy.ornl.gov