It is most straightforward to implement a code to perform these calculations on serial or vector processors. These offer good performance, for example the Hitachi S3600 has a theoretical peak performance of 2 GFLOPS (billion floating point operations per second). However, although the approximations introduced above greatly reduce the cost of ab initio electronic structure calculations, this performance is not sufficient for the largest calculations. The only way in which significantly higher performance can be achieved is through the use of parallel architectures. This approach uses many high performance microprocessors connected by a high performance network. While this can offer much higher performance, for example 64 processors of an Hitachi SR2201 has a theoretical peak of 19.2 GFLOPS, the techniques necessary to write efficient code are much more complex. The data must be partitioned between the processors and communication must occur if data resident on one processor is needed by another. The grid on which the charge densities are represented and the reciprocal space sphere of plane waves are distributed as columns between processors. Operations may then be performed by each processor on the portion of the data held locally. In particular, one-dimensional fast Fourier transforms may be performed along these columns, although the data must be reordered to perform the Fourier transforms along the other data axes. This process is the dominant communication cost for this implementation.
In the foreseeable future, parallel architectures offer the only way to achieve the necessary computational performance to address significant questions in biology. As parallel machines become even more widespread, we are likely to see the price:performance ratio continue to decrease rapidly making such calculations cost-effective.