3.2 Pipelines      Next: 3.3 Memory Organizations Up: 3 High Performance Computer Previous: 3.1.5 Other Taxonomies

## 3.2 Pipelines

A common analogy for a pipeline is the assembly line used in manufacturing. The end goal is to increase productivity - the number of instructions executed per second or the number of cars built per day - by dividing a complex operation into pieces that can be performed in parallel. Separate ``workers'' implement successive steps along the assembly line, and when an item finishes one step it is passed down the line to next step.

Pipelines are used in two major areas in computer design: instruction processing and arithmetic operations. The following requirements must be satisfied in a pipelined system:

• A system is a candidate for pipelined implementation if it repeatedly executes a basic function.
• A basic function must be divisible into independent stages that have minimal overlap.
• The complexity of the stages should be roughly similar.
The number of stages is referred to as the depth of the pipeline. As an example of a pipeline, consider the floating point addition of two numbers of the form . One possible breakdown of this function into stages is as follows :
1. If swap the operands. Find the difference in exponents .
2. Shift to the right by bits.
3. Compute the mantissa of the sum by adding and . The exponent of the sum is .
4. Normalize the sum.
The extra complexity of a pipelined adder pays off when adding long sequences of numbers. Operations at each stage can be done on different pairs of inputs, e.g. one stage can be comparing the exponents in one pair of operands at the same time another stage is adding the mantissas of a different pair of operands.

A very important requirement for overlapping operations this way is that there be no resource conflicts, i.e. the operands must be independent. For example, suppose a program contains the two instructions

```R2 = R0 + R1
R4 = R3 + R1```

Note that both instructions identify `R1` as one of their inputs. There is a potential conflict in stages two and three of the adder pipeline because stage two might need to shift the mantissa of `R1` at the same time stage three needs to add the mantissa of `R0` to the mantissa of `R1`. The solution is to make copies of the operands, and pass the copies through the pipeline. Thus the CPU gives the adder copies of `R0` and `R1` when it starts the pipeline for the first pair, and the second stage gets these copies from the first stage along with a value of .

There is another potential problem illustrated by this example. Suppose the second instruction is changed so one of its operands is `R2`, so the pair of instructions is:

```R2 = R0 + R1
R4 = R2 + R1```

Now the second instruction depends on the result of the first instruction. The CPU cannot send the second pair of operands to the pipelined adder until the result of the first addition exits the last stage of the pipeline. Interactions such as these lead to periods when pipeline stages are empty. These empty time slots are often called bubbles.

Figure 5 shows a type of diagram, known as a Gantt chart, that is commonly used to illustrate the operation of a pipeline. The horizontal axis represents time. There is one row for each stage of the pipeline. A line segment during cycle means a stage is active in that cycle; a blank means the stage is inactive. The figure illustrates the pipelined floating point adder of the previous example in the case when two successive instructions can be overlapped and in the case where the second instruction must wait for the first to complete. In the case of overlapped instructions, note that in cycle the first stage is busy with the second instruction while the second stage is busy with the first instruction. In each successive cycle the two instructions are passed down the ``assembly line'' to the next stage. In the first case, the second sum is done after 6 cycles, but in the second case it is not finished until after the 10th cycle. The ``bubble'' in the pipeline is the 4-cycle dead period in each stage caused by delaying the second instruction.

Figure 5 Gantt Charts for Pipelined Floating Point Adder View Figure

It should be apparent that in the general case a pipeline of depth can process items in steps when there are no bubbles. Without a pipeline, each application of the basic function would require cycles, and they would have to be executed sequentially, for a total time of cycles. The speedup obtained by a full pipeline is thus When we can safely ignore the in the denominator, so the asymptotic speedup, observed for large , is a factor of . For example, suppose we want to add 1000 pairs of numbers, e.g. when adding two 1000-element vectors. If it takes 5 cycles for each addition, a machine without a pipelined adder would require 5000 cycles. With our 5-stage pipelined adder, the last sum will appear after cycles, so the pipeline is times faster. Providing a steady stream of independent operands that will keep a pipeline full is the distinguishing feature of a vector processor, which can initiate such a series of operations with a single instruction.

There are many possible sources of bubbles in pipelines. Dependencies between instructions are the main cause. For arithmetic pipelines, data dependencies arise when pairs of operations share inputs and outputs. Consider the following two add instructions:

```z1 = x1 + y1
...
z2  =  x2  +  y2```

The dependence illustrated previously is `x2 = z1`, e.g. both operands are the register `R2`. Other dependencies are `z1 = z2` (both instructions write to the same register) and `x1 = z2` (the output from the second instruction may overwrite the register before the first has had a chance to read the old value; an unlikely occurrence in a vector machine, but a situation that must be taken into account). Note that the instructions do not have to be consecutive, i.e. there could be intervening instructions. A compiler that checks for dependencies and possibly reorders instructions has to ``look ahead'' in the code by an amount equal to the depth of the pipeline used in the first instruction.

Instruction pipelines, used to speed up the fetch-decode-execute cycle, are also susceptible to bubbles. The most common case here is caused by branch or loop instructions, which are control dependencies. If the pipeline is ``looking ahead'' and fetching instructions it thinks the machine will want to execute, but in fact the machine branches to another location, a bubble is introduced while the fetch stage goes to get the instructions at the new location.

Pipelines have been widely used in high performance machines for many years. The CDC 6600 is a classic example of a complex instruction pipeline, using a circuit known as a ``scoreboard'' to detect data dependencies between instructions and instruction buffers to implement pipelining in the fetch-decode logic. The Cray-1 was a very successful early supercomputer in which every data processing unit was pipelined. Until the late 1980s pipelining was one of the attributes that separated ``mainframes'' and supercomputers from microprocessors, but with the advance of VLSI technology microprocessors now have room on chip for complex control circuitry. Most now have pipelined data processing units and complex instruction scheduling logic that rivals that of the CDC 6600, an early pipelined processor known for its innovative instruction processing.      Next: 3.3 Memory Organizations Up: 3 High Performance Computer Previous: 3.1.5 Other Taxonomies

verena@csep1.phy.ornl.gov