SIMD machines have one instruction processing unit, sometimes called a controller and indicated by a K in the PMS notation, and several data processing units, generally called D-units or processing elements (PEs). The first operational machine of this class was the ILLIAC-IV, a joint project by DARPA, Burroughs Corporation, and the University of Illinois Institute for Advanced Computation [5]. Later machines included the Distributed Array Processor (DAP) from the British corporation ICL, and the Goodyear MPP. Two recent machines, the Thinking Machines CM-1 and the MasPar MP-1, are discussed in detail in Section 3.1.2
The control unit is
responsible for fetching and interpreting instructions. When it
encounters an arithmetic or other data processing instruction, it
broadcasts the instruction to all PEs, which then all perform the
same operation. For example, the instruction might be ``
add R3,R0.'' Each PE would add the contents of its own internal
register R3 to its own R0. To allow for needed flexibility in
implementing algorithms, a PE can be deactivated. Thus on each
instruction, a PE is either idle, in which case it does nothing,
or it is active, in which case it performs the same operation as
all other active PEs. Each PE has its own memory for storing
data. A memory reference instruction, for example ``load
R0,100'' directs each PE to load its internal register with the contents
of memory location 100, meaning the 100th cell in its own local
memory.
One of the advantages of this style of parallel machine organization is a savings in the amount of logic. Anywhere from 20% to 50% of the logic on a typical processor chip is devoted to control, namely to fetching, decoding, and scheduling instructions. The remainder is used for on-chip storage (registers and cache) and the logic required to implement the data processing (adders, multipliers, etc.). In an SIMD machine, only one control unit fetches and processes instructions, so more logic can be dedicated to arithmetic circuits and registers. For example, 32 PEs fit on one chip in the MasPar MP-1, and a 1024- processor system is built from 32 chips, all of which fit on a single board (the control unit occupies a separate board).
Vector processing is performed on an SIMD machine by distributing
elements of vectors across all data memories. For example,
suppose we have two vectors, a and b, and
a machine with 1024
PEs. We would store
in location 0 of memory i and
in
location 1
of memory i. To add a and b, the machine would tell each PE to
load the contents of location 0 into one register, the contents
of location 1 into another register, add the two registers, and
write the result. As long as the number of PEs is greater than
the length of the vectors, vector processing on an SIMD machine
is done in constant time, i.e. it does not depend on the length
of the vectors. Vector operations on a pipelined SISD vector
processor, however, take time that is a linear function of the
length of the vectors.