next up previous
Next: 6. Results Up: Parallel fast Fourier transforms Previous: 4. New parallel implementation


5. Cost comparison

Both of the methods described in Secs. 3 and 4 require the same number of floating point operations, but differ dramatically in their communication patterns. The traditional method requires two transposition or communication phases, and in general, each of these may require each node to communicate with every other node. By contrast, the new method requires $\log_2 N$ communication phases, but in each of these, the nodes communicate in a pairwise fashion.

In the following analysis, we assume that the cost of communicating a packet of data from one node to another consists of two parts. The first part is simply the time taken to transmit the data between the nodes and depends upon the bandwidth of the connection $\beta$ and the size of the data packet. The second part is a fixed overhead, the latency $\alpha$, which is independent of the size of the data packet. We define $\alpha$ and $\beta$ for the situation in which a node is sending a data packet to another node and simultaneously receiving a data packet of the same size from another node.

In the traditional method, each of the two communication phases generally involves each node communicating with every other node. This can be achieved in $N-1$ phases in which all of the nodes simultaneously send a data packet to one node and receive a packet from another (usually different) node. The total latency cost for the traditional parallel DFT is thus $2 (N-1) \alpha$. We define $n = n_x
n_y n_z$ to be the total number of data elements in the full FFT grid. In the traditional method, each data packet has a size of $n u /
N^2$ where $u$ is the size of a single data element (typically 16 bytes for a double precision complex data type). The total communication time involved in the traditional method, $\tau_{\mathrm{trad}}$ is thus:

$\displaystyle \tau_{\mathrm{trad}}$ $\textstyle =$ $\displaystyle 2 \left( N - 1 \right) \left[ \alpha + {n u
\over \beta N^2} \right]$  
  $\textstyle \approx$ $\displaystyle 2 \left[\alpha N + {n u
\over \beta N} \right] \mathrm{~,~for~large~} N.$ (4)

In the new method, there are $\log_2 N$ communication phases in which each node communicates with just one other node. The total latency cost for this method is thus $\alpha \log_2 N$. However, in this method, the size of the data packets exchanged is $n u / N$, a factor of $N$ larger than in the traditional method. The total communication time for the new method, $\tau_{\mathrm{new}}$ is then:

$\displaystyle \tau_{\mathrm{new}}$ $\textstyle =$ $\displaystyle \log_2 N \left[ \alpha + {n u \over \beta N}
\right]$ (5)

The new method has a lower latency cost, due to the smaller number of data packets sent, but a higher transmission cost due to the larger size of those data packets. We therefore expect the new method to be advantageous in the limit of a large number of processors $N$. In this limit, many processors need to communicate with each other, but the packets they exchange are very small, so that the latency cost dominates. Since the latency cost of the new method increases only logarithmically instead of linearly, it should scale to a larger number of processors.

The ``cross-over'' point at which the new method performs more efficiently than the traditional method depends upon the hardware (and low-level software libraries which interface to it), parametrised by $\alpha$ and $\beta$, and the size of the FFT grid $n$. The smaller the FFT grid, the more competitive the new method will be. It is for this reason that the method may be most useful in electronic structure calculations in which the FFT grid sizes are relatively small.

The product $\alpha \beta$ defines the packet size which costs as much in latency as transmission to send. On a cluster of PCs connected by 100 Mbit ethernet this product is of the order of 2 Kbytes. We measured a very similar value for a 64-node SGI Origin 2000 supercomputer. In both cases, the generic Message-Passing Interface (MPI) was used. However, use of lower-level vendor-specific libraries on the Origin would reduce the latency and hence the $\alpha \beta$ product. We therefore anticipate that the new method may be more suitable for implementation on clusters of workstations than on supercomputers designed and built to run programs in parallel.


next up previous
Next: 6. Results Up: Parallel fast Fourier transforms Previous: 4. New parallel implementation
Peter Haynes