An interesting feature introduced in the Cray computers is the notion of vector chaining. Consider the following two vector instructions:
V2 = V0 * V1 V4 = V2 + V3
The output of the first
instruction is one of the operands of the second instruction.
Recall that since these are vector instructions, the first
instruction will route up to 64 pairs of numbers to a pipelined
multiplier. About midway through the execution of this
instruction, the machine will be in an interesting state: the
first few elements of
V2 will contain recently computed products;
the products that will eventually go into the next elements of
are still in the multiplier pipeline; and the remainder of the
operands are still in
V1, waiting to be
fetched and routed
to the pipeline. This situation is shown in
where the operands from
V1 that are currently
multiplier pipeline are indicated by gray cells. At this point,
the system is fetching
V1[k] to route them to the first
stage of the pipeline and
V2[j] is just leaving the pipeline.
Vector chaining relies on the path marked with an asterisk. While
V2[j] is being stored in the vector register, it is also routed
directly to the pipelined adder, where it is matched with
As the figure shows, the second instruction can begin even before
the first finished, and while both are executing the machine is
producing two results per cycle (
instead of just one.
Without vector chaining, the peak performance of the Cray-1 would have been 80 MFLOPS (one full pipeline producing a result every 12.5ns, or 80,000,000 results per second). With three pipelines chained together, there is a very short burst of time where all three are producing results, for a theoretical peak performance of 240 MFLOPS. In principle vector chaining could be implemented in a memory-to-memory vector processor, but it would require much higher memory bandwidth to do so. Without chaining, three ``channels'' must be used to fetch two input operand streams and store one result stream; with chaining, five channels would be needed for three inputs and two outputs. Thus the ability to chain operations together to double performance gave register- to-register designs another competitive edge over memory-to- memory designs.