Next: 5. Cost comparison Up: Parallel fast Fourier transforms Previous: 3. Traditional parallel implementation

4. New parallel implementation

In the traditional method, the communication phases simply redistribute the data across the nodes so that the nodes can perform local 1D-FFTs. By constrast, in the new method we present here the communication phases also play a rôle in the calculation of the 3D-FFT.

**Figure 2:** Distribution of data for new implementation.
$\includegraphics [height=58mm]{new.eps}$

The distribution of data in the new method is illustrated in Fig. 2, again for a $4 \times 6 \times 8$ grid and 4 nodes. This distribution is motivated by the Danielson-Lanczos lemma. Each node now deals with a set of -planes. For simplicity, consider first applying the Danielson-Lanczos lemma recursively to a 1D-FFT. In the first step, the data is partitioned into even and odd elements. In the second step, each of these two sets of data is partitioned into even and odd elements again. Thus one obtains four sets of data: even-even, even-odd, odd-even and odd-odd elemets. In the case of four nodes, we would therefore assign one of these sets to each node, to obtain the pattern shown in Fig. 2. Since the three 1D-FFTs in the -, - and -directions in Eq. 3 commute, this distribution is quite flexible, and an alternative is shown in Fig. 3 in which the Danielson-Lanczos lemma has been applied in both the - and -directions.

**Figure 3:** Alternative distribution of data for new implementation.
$\includegraphics [height=58mm]{alt.eps}$

Thus, instead of distributing data as rods as in the traditional method, each node deals with a set of data which is picked uniformly from the full FFT grid. In the case of a distribution across nodes, each node thus deals with one of the sets of data which would result after $\log_2 N$ applications of the Danielson-Lanczos lemma. Hence, in order to perform the FFT from momentum-space to real-space, each node first performs a local 3D-FFT on its data. It then remains for the nodes to communicate in order to combine this data to complete the full FFT (i.e. to perform the multiplication by a phase factor and addition shown in the last line of Eq. 2). When complete, each node possesses a single contiguous block of real-space data (right of Figs. 2 and 3). Again, the DFT from real- to momentum-space is performed by reversing the above operations.

Another advantage of the new method is that the Hamiltonian matrix is now effectively blocked. The FFT permits the application of the Hamiltonian on state vectors in $O(n \log n)$ operations. The new method distributes the data such that each block of the matrix can be applied in $O\left( (n/N) \log (n/N) \right)$ . This opens up the possibility of applying iterative diagonalisation algorithms for block matrices to the electronic structure problem to further decrease the amount of communication.

Next: 5. Cost comparison Up: Parallel fast Fourier transforms Previous: 3. Traditional parallel implementation

Peter Haynes