Here, we modify Amdahl's Law [1] to describe the operations required to achieve the above vectorized events. As Monte Carlo processes are probabilistic, they result in random branching. This inhibits direct vectorization, as the particles which ``pass'' tests are generally located discontiguously in memory. On vector architectures, we must either: (1) pay the penalty of data motion to ``gather'' elements into contiguous locations in memory (and perhaps subsequently to ``scatter'' elements into their discontiguous original positions in memory) or (2) pay the penalty of full (100% - passed and failed) memory retrieval of elements, while masking-off elements which are not to have their values stored. On SIMD architectures, the penalty is incurred by conditionally masking-off individual processors where particles do not pass the test.

Suppose we are mapping this process to an architecture and desire to model the speedup from either: (1) scalar to vector or (2) single processor to multiprocessors. Here, the speedup is the ratio of: (1) scalar to vector or (2) single to multiprocessor execution times. Letting:

Then, the elapsed time to run the problem on a scalar processor will be:

where the elapsed time is simply the number of operations to be performed, , divided by the scalar execution rate, . Now, suppose that some fraction of this work, , can be done in the vector unit or in multiple processors. Additionally, suppose that the overhead associated with this is composed of two parts: (1) a fractional addition, , due to vector start-up (or due to the spreading of the problem over multiple processors) and (2) a fractional addition, , due to vectorized data motion (or due to inter-processor communication). Note that, these fractional additions are purely overhead, as the work done by the CPU is useless work. Then, the elapsed time to execute the problem on a parallel or vector machine is:

Then, dividing the above two equations yields the speedup ratio:

Note, as and appear in conjunction, we can combine their effects into one. Thus, hereafter, we let represent the sum of and . Moreover, it is generally possible to select long enough vector lengths so that is negligible; however, data motion, , is always significant. In effect, we pay the overhead of data motion (useless work ``gathering'' data elements into contiguous memory locations), so that we can perform subsequent operations in the much faster vector hardware.

In Figure 21, we depict versus for (representative of Cray Hardware), with varying parametrically.

Figure 21 Modified Amdahl's Law. View Figure

Several things are noteworthy from the graph:

- High values of are attainable, even if there is significant overhead in data motion,
- must be large to obtain good speedup, and
*The curves are**relatively*insensitive to

If we fix (nominally fixed by the architecture), we can consider equation (1) to contain two independent parameters, and . Therefore, if we measure for two different architectures with known but different values of , we can determine both and .

Burns et al. [3] have performed such an experiment on a Cray Y/MP and an ETA 10-G for the Monte Carlo simulation GAMTEB-one of Los Alamos' benchmark programs [2]. We measured (by turning vectorization ``on'' and ``off'' for critical loops) values of V/S of 12 and 25 for the Cray Y/MP and the ETA 10-G, respectively.

In particular, GAMTEB exhibits many ``standard'' aspects of Monte Carlo simulations, so this exercise should yield representative values for these two parameters. The results were:

Additionally, we independently verified these numbers using Cray's hardware performance monitor (hpm). Ergo, we conclude that Monte Carlo programs may be very highly vectorized (), with about a 25% penalty due to data motion. For Cray architectures, the speedup that results exceeds 9. This region of the parameter space is shown shaded in Figure 21.