Monte Carlo calculations are inherently parallel in nature. At their most basic, they involve the calculation of a large set of independent random numbers and then the averaging over a set of results produced by each of these random numbers. Coupled with the fact that QMC calculations are relatively expensive to perform on todays workstation computers, they are an ideal candidate for porting to parallel architecture machines, which offer two to three orders of magnitude more computational power than a conventional workstation.
All the VMC calculations reported on in this thesis have been performed on the Cray T3D and Hitachi SR2001 Massively Parallel Processing (MPP) machines. Only a few changes to the general VMC algorithm outlined above are required to produce a VMC algorithm that will make efficient use of an MPP machine.
The modified parallel algorithm is illustrated in figure . Each processing element on the parallel machine runs its own version of the VMC algorithm. These are made independent of each other by starting the random number generator with a different seed on each node. Care must be taken to ensure that the period of the random number generator is sufficiently long that the sequences of random numbers on each of the nodes do not overlap. This ensures that by the time the equilibration process has finished, the positions of the electrons on each node are not correlated. Each node then accumulates its own set of observables such as the total energy and charge density. At the end of the calculation, these observables are all passed to one node that simply adds them up and takes the mean.
It is not strictly necessary to run through the equilibration stage on each of the nodes. A single node calculation could be used to produce a set of equilibrated configurations of electron positions that could then be used as a starting point for the energy evaluation stage of the process on each of the nodes. In practice, the equilibration stage of the calculation is only a small fraction of the total time and this was therefore not deemed necessary.
The time overhead required to perform the communication between processors for accumulating averages etc. is negligible. Therefore, the parallel efficiency of the algorithm is effectively 100%. This means that on parallel machines with a few hundred processing elements, the algorithm achieves almost linear scaling, i.e. if one doubles the number of processors on which the code is running, the overall wall-clock time required to perform a particular VMC calculation to within a certain specific statistical accuracy is halved.
Figure: Flow chart illustrating the Parallel VMC algorithm. Each individual node is enclosed within the dashed boxes. Only three of the N nodes are shown in the diagram. The initial random seeds are generated on one node and then distributed out to all the other N-1 processing nodes. At the end of each block of energy evaluation moves, averages are taken across all the processors.