### 3.5.1 Vector Processors     continued...

An interesting feature introduced in the Cray computers is the notion of vector chaining. Consider the following two vector instructions:

```          V2 = V0 * V1
V4 = V2 + V3
```

The output of the first instruction is one of the operands of the second instruction. Recall that since these are vector instructions, the first instruction will route up to 64 pairs of numbers to a pipelined multiplier. About midway through the execution of this instruction, the machine will be in an interesting state: the first few elements of `V2` will contain recently computed products; the products that will eventually go into the next elements of `V2` are still in the multiplier pipeline; and the remainder of the operands are still in `V0` and `V1`, waiting to be fetched and routed to the pipeline. This situation is shown in Figure 16, where the operands from `V0` and `V1` that are currently in the multiplier pipeline are indicated by gray cells. At this point, the system is fetching `V0[k]` and `V1[k]` to route them to the first stage of the pipeline and `V2[j]` is just leaving the pipeline. Vector chaining relies on the path marked with an asterisk. While `V2[j]` is being stored in the vector register, it is also routed directly to the pipelined adder, where it is matched with `V3[j]`. As the figure shows, the second instruction can begin even before the first finished, and while both are executing the machine is producing two results per cycle (`V4[i]` and `V2[j]`) instead of just one.

Without vector chaining, the peak performance of the Cray-1 would have been 80 MFLOPS (one full pipeline producing a result every 12.5ns, or 80,000,000 results per second). With three pipelines chained together, there is a very short burst of time where all three are producing results, for a theoretical peak performance of 240 MFLOPS. In principle vector chaining could be implemented in a memory-to-memory vector processor, but it would require much higher memory bandwidth to do so. Without chaining, three ``channels'' must be used to fetch two input operand streams and store one result stream; with chaining, five channels would be needed for three inputs and two outputs. Thus the ability to chain operations together to double performance gave register- to-register designs another competitive edge over memory-to- memory designs.