The next lower level of parallelism is at the instruction level, where individual instructions may be overlapped or a given instruction may be decomposed into suboperations with the suboperations overlapped. In the first case, for example, it is common to find a load instruction, which copies a value from memory to an internal CPU register, overlapped with an arithmetic instruction. The second situation is exemplified by the ubiquitous pipeline that has become the mainstay for arithmetic processing. In general, programmers need not concern themselves with this level of parallelism, since compilers are adept at reorganizing programs to exploit this form of parallelism. Nevertheless, one should keep in mind that the quality of compilers varies greatly from system to system and one may have to structure the code in particular ways to help the compiler make maximum use of the hardware. For example, as we will see below, Cray supercomputers are most efficient when vector lengths are 64 or smaller, and rearranging programs to operate on small segments of long vectors can improve performance. In addition, awareness of the internal structure of a computer is often necessary when analyzing the performance of a program.
A concept related to the level of parallelism is the granularity of parallel tasks. A large grain system is one in which the operations that run in parallel are fairly large, on the order of entire programs. Small grain parallel systems divide programs into very small pieces, in some cases only a few instructions. The example used above of a processor that calculates n sums in parallel is an example of very fine grain parallelism.