4 Exercises

Next: References Up: CA Chapter Previous: 3.6 Performance Models for

# 4 Exercises

Draw a PMS diagram for one of the systems you use. The diagram should include at least the size of main memory and cache, the processor(s), and the pathways between processor(s) and memory. Consult [32] for more information on PMS if necessary.

Write a program that creates two matrices of random real numbers between 0 and 1 and then computes the product of the two matrices. should be an input parameter. The only output from the program should be the time required to multiply the matrices.

• Analyze the complexity of your program, i.e. how many operations does it perform as a function of the input size ?

• Construct a table of execution times as a function of the matrix size . Keep adding entries until you have runs that take more than a few minutes. Do the entries in the table agree with your predicted performance model?

• Plot the data in the table with `gnuplot` or one of the other plotting packages described in Chapter ``Scientific Visualization In High Performance Computing''.

• From the data in the table can you tell when the matrices are large enough so they don't all fit in cache? How does this data point correlate with the size of the cache on your system?

• If you have a vector processor, can you infer anything about optimum vector sizes from the pattern of data in your table?

• If you are collecting your data on a workstation, record the system load average along with the execution time. Try several runs of the same matrix size but at times when the load average is high, medium, and low. Plot the data again, either with separate lines for different load averages, or with error bars that show the range of times. Does system load affect the performance of your program?

Look up ``MIPS'' rating of your machine, either in the manuals provided by the manufacturer or in a standard reference ([find Dongarra's numbers]). How does it compare to numbers you achieved in previous problem?

What is the communication bandwidth between the frame buffer and the monitor in a typical high resolution 8-bit RGB display?

Use mathematical induction to prove there are unique -digit strings composed only of the symbols 1 and 0.

Prove that if is the -bit representation of the integer , the two's complement of , found by inverting every bit and adding 1, is the representation of . Hint: the value of the complement of is .

(a) Prove that if is the -bit representation of the integer , then (a) shifting left by bits is equivalent to multiplying by and (b) shifting right by bits is equivalent to dividing by (ignoring any remainder).

(b) If , represents a negative number in a two's complement system. In what is known as a ``logical shift right'' 0's are inserted into the leftmost bits, which means the result will be a positive number. Obviously if we divide a negative number by a power of two we expect a negative result, and in this case the logical shift doesn't give the correct answer. For example, . The 8-bit two's complement representation of is 11110000. If we shift it right by two bits to divide by 4, we get 00111100, which represents 60, not . Can you think of a simple way to ``fix'' the shift operation so that it gives the correct result when it shifts both positive and negative numbers?

Prove that if is the -bit representation of the integer , the low order bits are the value of the remainder of . NOTE: this operation is also performed very efficiently on most machines. Let be a pattern known as a ``mask'' that contains 0's in the high order bits and 1's in the low order bits. An operation known as a ``bitwise AND'' will compute a pattern such that (the operation is the logical AND, which is 1 if and only if both operands are 1). To find the remainder of a division by , create a mask with 1's in the low order bits, then ``and'' it with . For example, the remainder of 14 (00001110 in an 8-bit system) divided by is .

Explain the floating point number system on the machine that you are using.

Look up the representation for floating point numbers in the system you will be using for this course. Does it conform to the IEEE standard (standard 754)? How many bits are in the mantissa and exponent in single precision? In double precision? Does a double precision number really have twice as much precision as a single precision number? Explain.

Figure 21 Gantt chart for 8-way interleaved memory View Figure

A Gantt chart can be used to show how interleaved memory works by drawing one row for each memory bank. Mark the time line in units of processor cycles. If a processor requests an item from bank at time , draw a line in row starting at time and continuing for units, where is the memory cycle time. Figure 21 shows the Gantt chart for an 8-way interleaved memory in a system where the processor cycle time is 10ns and the memory cycle time is 40ns. The chart illustrates which memories are busy when the processor requests items from successive memory cells. Asterisks on the time line indicate when data items reach the processor (assuming data is delivered on the last memory cycle). Asterisks in every column indicate the memory is performing at its full potential, i.e. there are no bank conflicts.

Use a Gantt chart to show that for a system with 16 memory banks, 12.5ns processor cycle time, and 50ns memory cycle time, there will be no conflicts when the stride is 2 or 4, but there will be conflicts when the stride is 8 or 16. (NOTE: these cycle times and interleaving factors are taken from the Cray-1).

Use a Gantt chart to illustrate the effectiveness of vector chaining. Use the following parameters: the length of each vector is 64 elements; the multiply pipeline is 7 stages deep, the add pipeline has 6 stages, and there is a one-cycle delay (called the chain slot time) after producing the first product before the first pair of operands go to the adder. How many cycles does it take to compute V4 = V3+(V0*V1) without chaining? With chaining?

Figure 22 A single vertex in a 3D cube View Figure

What is your understanding of the connection between scalar registers, vector registers, and floating point units for the performance of the SAXPY benchmark?

Does vector chaining improve performance on the SAXPY benchmark? What could you infer about the organization of the data path and the connection between scalar registers, vector registers, and floating point units if you measured the performance of SAXPY and found out that operands were indeed being chained between data units?

Each vertex of a -dimensional hypercube is connected to d other vertices. In a parallel machine with a hypercube topology, there is a single processor at each node, and it is linked to d other processors. An alternative design is to place a ring of d processors at each vertex, linking the i processor to the neighboring vertex along dimension i. The result is a topology known as a cube-connected cycle (CCC) (Figure 22). Every node in a CCC is connected to 3 other nodes: its two neighbors in the ring plus the node in the neighboring vertex. Thus a CCC has a constant degree, no matter how many nodes are in the topology.

(a) What is the diameter of a CCC (assume bidirectional communication).

(b) How many nodes are in a general n-dimensional CCC?

(c) Draw a picture of a 4-dimensional CCC.

Label the nodes in the 3d and 4d hypercubes in Figure 11. In the 4D cube, draw the paths that are taken by the following messages: from 0 to 15; from 3 to 12; from 5 to 10, and from 10 to 5.

You may have noticed in the previous exercise that the message from node 5 to node 10 travels a different path than the message from node 10 to node 5. Explain why, for any two processors i and j, messages sent from i to j travel a different path than messages from j to i. Can you characterize the relationship between the two paths?

For each interior node in the butterfly switch of figure Figure 15, indicate whether it is in the straight- through or flipped configuration for the following pairs of processor-memory connections. Assume the connections that come first in the list are made first by the switching network:

P0- M3, P1-M5, P2-M2, P3-M7, P4-M6, P5-M0, P6-M4, P7-M1

How many of these connections block due to contention in the switch?

Most of the uses of a crossbar switch mentioned in this chapter connect a set of processors on the ``input'' side to a set of memories on the ``output'' side (Figure 14). However, two systems, the Fujitsu VPP500 and the MasPar MP-1, use a crossbar to interconnect processing elements (PEs), nodes which consist of a processor and its local memory. Draw a PMS diagram of such a system and explain how information could be moved from one node to another.

The ring network in the KSR-1 is a slotted ring, which means packets flow around the ring in discrete steps under control of a global clock. The KSR-1 transfers 128 million packets per second past each slot on the ring.

(a) The bandwidth is 1GB/sec. How wide is the communication channel?

(b) The processor cycle time is 20MHz. How many packets can be taken off the ring during each processor cycle?

(c) What is the diameter of a full-size (1088-PE) KSR-1?

(d) Ignoring communication overhead and the possibility of contention that would delay a message, how long will it take the machine to transfer 256 bytes along the longest communication path?

Use gnuplot or some other plotting package to plot Amdahl's law for different values of . Note that varies from 0 to 1, i.e. from completely sequential to completely parallel.

Next: References Up: CA Chapter Previous: 3.6 Performance Models for

verena@csep1.phy.ornl.gov