Next: 7. Load balancing Up: Parallel fast Fourier transforms Previous: 5. Cost comparison

6. Results

In order to assess the validity of the analysis of Sec. 5 in practice, we compared the communication times for both methods running on the CRAY T3E-1200E at the University of Manchester. This machine was chosen because it enabled us to compare the methods running on up to 512 nodes. However, as mentioned at the end of Sec. 5, we would not expect this machine to give particularly favourable results for the new method because its latency cost is small.

In Fig. 4 we plot the results for two different FFT grid sizes: $128 \times 128 \times 128$ and $64 \times 64 \times 64$ . The implementation of each method used the same communication library call. The results plotted were obtained by averaging the communication times for 50 convolutions (i.e. both a forward and backward FFT) on the grid and 100 convolutions on the grid. The error bars show the resulting standard deviations. The dashed and dotted lines show the best fits to Eqs. 4 and 5 respectively.

**Figure 4:** Comparison of average communication times for both methods applied to $128 \times 128 \times 128$ (left) and $64 \times 64 \times 64$ (right) grid sizes.
$\includegraphics [height=70mm]{timing.eps}$

The quality of the fits of Eqs. 4 and 5 to the data supports our analysis. In particular, the cross-over occurs at a smaller number of processors for the case of the smaller FFT grid, as expected. On the machine used here, the new method is out-performed by the old method for the reasons given above, and would only be applicable for very small FFT grids or very large numbers (over 1000) of nodes. However, from the quality of the fit to the analysis of Sec. 5, we would expect to be able to apply Eqs. 4 and 5 with confidence to estimate how the methods would scale on other machines. For example, on the cluster of PCs mentioned in Sec. 5, on which we measured a latency of about $300~\mathrm{\mu s}$ and a bandwidth of $8.7~\mathrm{Mb s^{-1}}$ , the expected cross-over points are 40 and 140 nodes for the and grids respectively.

Next: 7. Load balancing Up: Parallel fast Fourier transforms Previous: 5. Cost comparison

Peter Haynes