3.5.6 SIMD Machines



next up previous
Next: 3.6 Performance Models for Up: 3.5 Survey of High Previous: 3.5.5 Distributed Memory MIMD

3.5.6 SIMD Machines

 

Several commercial SIMD machines were introduced in the 1970s, but they were not very widely used. Interest in this class of machines was renewed in the late 1980s with the introduction of the Connection Machine (CM-1) from Thinking Machines, Inc., and the MasPar MP-1. Part of the renewed interest is certainly the result of VLSI technology, which had advanced by that time to the point where several small processors could be put on a single chip. By themselves these processors were too simple to compete with general purpose single-chip processors such as the Motorola 68020 or Intel 80386, but literally thousands of them could be packaged in a small space and built into a cost-effective system. For example, 32 MP-1 processors fit on a single chip, and 32 chips were placed on a single board, for a total of 1024 processors (and their associated memory) in approximately 4 square feet.

The CM-1 was based on 1-bit processors. Every operation in the machine processed 1-bit operands and produced 1-bit results. Operations on larger data elements, for example 32-bit integers, required one cycle per bit. Attached to each processor was a local memory with a capacity of 4K bits. Memory references, like processor operations, were 1-bit operations, i.e. a fetch copied 1 bit from memory into a 1-bit processor register. 16 processors were implemented on a single chip. Within a chip, processors were connected with a grid, and up to 4096 chips were connected via a 12-dimensional hypercube. All processors obeyed instructions issued by a central control processor, which in turn was connected to a front-end workstation.


Figure 20 Switching in an X-net View Figure

The MasPar MP-1 was introduced a few years after the CM-1. It also has a very narrow datapath, but it processes data 4 bits at a time instead of 1 bit at a time. Each processor can have up to 64KB of local memory. One of the interesting aspects of the MP-1 is that there are two separate communication systems, and programmers can alternate between them to choose the best performance for different parts of their algorithms. One interconnection network is known as the X-net (Figure 20). It connects each processor to its 8 nearest neighbors in a 2D mesh with wraparound connections. The other connection is a global router, which provides point-to-point communication between any two PEs. The router is implemented by a 3-stage switching network, where each stage in a 1024-processor machine contains a crossbar; together the three stages comprise a crossbar. The processors are controlled by a proprietary RISC processor known as the array control unit, or ACU. The ACU has its own local memory and is used for scalar operations, while the processor array is intended for vector and array operations. An MP-1 can be configured as an square mesh or a rectangular mesh. The smallest configuration has 1,024 processors and the largest has 16,384 processors in a grid.

The newest machines from Thinking Machines and MasPar are the CM-5 and MP-2, respectively. The CM-5 is described in more detail in [15]. The MP-2 has a wider internal data path than the MP-1 - 32 bits vs. 4 bits - but is otherwise very similar to the MP-1 in that it uses both the X-net and global router to connect PEs in a 2D mesh. The largest MP-2, which has 16,384 () processors, has a theoretical peak performance of 550 MFLOPS and reaches 473 MFLOPS on the LINPACK benchmark for parallel machines.



next up previous
Next: 3.6 Performance Models for Up: 3.5 Survey of High Previous: 3.5.5 Distributed Memory MIMD



verena@csep1.phy.ornl.gov