Draw a PMS diagram for one of the systems you use. The diagram should include at least the size of main memory and cache, the processor(s), and the pathways between processor(s) and memory. Consult [32] for more information on PMS if necessary.
Write a
program that creates two
matrices of
random real numbers between
0 and 1 and then computes the product of the two matrices.
should be an input parameter. The only output from the program
should be the time required to multiply the matrices.
?
. Keep adding
entries until you have runs that take more than a few minutes. Do
the entries in the table agree with your predicted performance
model?
gnuplot or one of the
other plotting packages described in Chapter ``Scientific
Visualization In High Performance Computing''.
Look up ``MIPS'' rating of your machine, either in the manuals provided by the manufacturer or in a standard reference ([find Dongarra's numbers]). How does it compare to numbers you achieved in previous problem?
What is the communication bandwidth between the frame buffer and the monitor in a typical high resolution 8-bit RGB display?
Use
mathematical induction to prove there are
unique
-digit strings
composed only of the symbols 1 and 0.
Prove that if
is the
-bit representation of the integer
, the
two's complement of
, found by inverting
every bit
and adding 1, is the
representation of
. Hint: the value of the complement of
is
.
(a) Prove that if
is
the
-bit representation of the
integer
, then (a) shifting
left by
bits
is equivalent to
multiplying
by
and (b) shifting
right by
bits
is equivalent
to dividing
by
(ignoring any remainder).
(b) If
,
represents
a negative number in a two's complement system. In what is known
as a ``logical shift right'' 0's are inserted into the leftmost
bits, which means the result will be a positive number. Obviously
if we divide a negative number by a power of two we expect a
negative result, and in this case the logical shift doesn't give
the correct answer. For example,
. The 8-bit
two's complement
representation of
is 11110000. If we shift it right by two bits
to divide by 4, we get 00111100, which represents 60, not
. Can
you think of a simple way to ``fix'' the shift operation so that it
gives the correct result when it shifts both positive and
negative numbers?
Prove that if
is the
-bit
representation of the integer
, the low order
bits are the value
of the remainder of
. NOTE: this operation is also
performed very
efficiently on most machines. Let
be a pattern known as a
``mask'' that contains 0's in the high order
bits and 1's in
the low
order
bits. An operation known as a ``bitwise AND'' will compute a
pattern
such that
(the
operation is the logical AND, which is
1 if and only if both operands are 1). To find the remainder of a
division by
, create a mask with
1's in the low order bits, then
``and'' it with
. For example, the remainder of 14 (00001110 in an
8-bit system) divided by
is
.
Explain the floating point number system on the machine that you are using.
Look up the representation for floating point numbers in the system you will be using for this course. Does it conform to the IEEE standard (standard 754)? How many bits are in the mantissa and exponent in single precision? In double precision? Does a double precision number really have twice as much precision as a single precision number? Explain.
A Gantt chart can be used to show
how interleaved memory works by drawing one row for each memory
bank. Mark the time line in units of processor cycles. If a
processor requests an item from bank
at time
, draw a line in
row
starting at time
and continuing for
units, where
is
the memory cycle time.
Figure 21
shows the Gantt chart
for an 8-way interleaved memory in a system where the processor
cycle time is 10ns and the memory cycle time is 40ns. The chart
illustrates which memories are busy when the processor requests
items from successive memory cells. Asterisks on the time line
indicate when data items reach the processor (assuming data is
delivered on the last memory cycle). Asterisks in every column
indicate the memory is performing at its full potential,
i.e. there are no bank conflicts.
Use a Gantt chart to show that for a system with 16 memory banks, 12.5ns processor cycle time, and 50ns memory cycle time, there will be no conflicts when the stride is 2 or 4, but there will be conflicts when the stride is 8 or 16. (NOTE: these cycle times and interleaving factors are taken from the Cray-1).
Use a Gantt chart to illustrate the effectiveness of vector chaining. Use the following parameters: the length of each vector is 64 elements; the multiply pipeline is 7 stages deep, the add pipeline has 6 stages, and there is a one-cycle delay (called the chain slot time) after producing the first product before the first pair of operands go to the adder. How many cycles does it take to compute V4 = V3+(V0*V1) without chaining? With chaining?
What is your understanding of the connection between scalar registers, vector registers, and floating point units for the performance of the SAXPY benchmark?
Does vector chaining improve performance on the SAXPY benchmark? What could you infer about the organization of the data path and the connection between scalar registers, vector registers, and floating point units if you measured the performance of SAXPY and found out that operands were indeed being chained between data units?
Each vertex of a
-dimensional hypercube is
connected to d other vertices. In a parallel machine with a
hypercube topology, there is a single processor at each node, and
it is linked to d other processors. An alternative design is to
place a ring of d processors at each vertex, linking the
i
processor to the neighboring
vertex along dimension i. The result
is a topology known as a cube-connected cycle (CCC)
(Figure 22).
Every node in a CCC is
connected to 3 other nodes: its
two neighbors in the ring plus the node in the neighboring
vertex. Thus a CCC has a constant degree, no matter how many
nodes are in the topology.
(a) What is the diameter of a CCC (assume bidirectional communication).
(b) How many nodes are in a general n-dimensional CCC?
(c) Draw a picture of a 4-dimensional CCC.
Label the nodes in the 3d and 4d hypercubes in Figure 11. In the 4D cube, draw the paths that are taken by the following messages: from 0 to 15; from 3 to 12; from 5 to 10, and from 10 to 5.
You may have noticed in the previous exercise that the message from node 5 to node 10 travels a different path than the message from node 10 to node 5. Explain why, for any two processors i and j, messages sent from i to j travel a different path than messages from j to i. Can you characterize the relationship between the two paths?
For each interior node in the butterfly switch of figure Figure 15, indicate whether it is in the straight- through or flipped configuration for the following pairs of processor-memory connections. Assume the connections that come first in the list are made first by the switching network:
P0- M3, P1-M5, P2-M2, P3-M7, P4-M6, P5-M0, P6-M4, P7-M1
How many of these connections block due to contention in the switch?
Most of the uses of a crossbar switch mentioned in this chapter connect a set of processors on the ``input'' side to a set of memories on the ``output'' side (Figure 14). However, two systems, the Fujitsu VPP500 and the MasPar MP-1, use a crossbar to interconnect processing elements (PEs), nodes which consist of a processor and its local memory. Draw a PMS diagram of such a system and explain how information could be moved from one node to another.
The ring network in the KSR-1 is a slotted ring, which means packets flow around the ring in discrete steps under control of a global clock. The KSR-1 transfers 128 million packets per second past each slot on the ring.
(a) The bandwidth is 1GB/sec. How wide is the communication channel?
(b) The processor cycle time is 20MHz. How many packets can be taken off the ring during each processor cycle?
(c) What is the diameter of a full-size (1088-PE) KSR-1?
(d) Ignoring communication overhead and the possibility of contention that would delay a message, how long will it take the machine to transfer 256 bytes along the longest communication path?
Use gnuplot or some other plotting package to
plot Amdahl's law for different values of
. Note that
varies
from 0 to 1, i.e. from completely sequential to completely parallel.