TCM
UoC crest

Linear Congruential Random Numbers and Parallelisation

Linear congruential pseudorandom number generators have a bad press: they tend to have short repeat periods, and n-tuples do not fill an n-dimensional space uniformly, but rather lie on hyperplanes. They are well described in a Wikipedia article.

But sometimes a fast, and poor, generator is all that is needed. Linear congruential generators have the advantage that their state is small, and they are easy to step forwards without needing to calculate every intermediate value in the sequence.

Thus one can write OpenMP code to fill in an array with random numbers, and obtain the same results no matter how many threads are used.

The basic equation is

xn+1=(a*xn+c) mod m

For well chosen a, m and c the repeat length can be up to m. For speed m is often chosen to be a power of two, so that the modulus operation becomes simply a bitwise and. A popular set of choices is:

m=231
a=1103515245
c=12345

The formula is easy to step forwards large distances because

xn+2=(a*((a*xn+c) mod m + c) mod m
= ((a2 mod m)x + (a*c+c) mod m ) mod m

which is the same equation as before, but with a replaced by a2 mod m, and c by (a*c+c) mod m. (Note that this new choice of a and c leads to a sequence with half the repeat length of the original sequence.)

This can be repeated to work out quickly how to step forwards four, eight, sixteen, thirty-two etc. steps of the original sequence at once. To step forwards an arbitrary number of steps, N, requires order log2N calculations. (That it takes so little effort to jump forwards in this sequence might be regarded as proof that it is not truely random.)

Random numbers and parallelisation

When parallelising code reliant on random numbers, it is useful for debugging if precisely the same random numbers are used no matter how many parallel processes are running. This does not mean the same random numbers on each parallel process, but rather the same sequence that the serial code would have produced if one combines all the parallel parts.

The ability to take a linear congruential generator and readily produce another which produces every nth value from the same sequence, or to step forwards quickly a large number of values and then calculate every value in the original sequence, can be useful. As an example this page considers parallelising the linear congruential generator using OpenMP.

A rather conservative piece of Fortran for filling a 1D array with random numbers in the range 0<=x<1 might be:

  subroutine pran(rnd,seed)
    implicit none
    integer, intent(inout) :: seed
    real(kind(1d0)), intent(out) :: rnd(:)
    integer, parameter :: long=selected_int_kind(15)
    integer(long), parameter :: m=2147483647_long
    integer(long) :: s

    s=seed
  
    do i=1,size(rnd)
       s=iand(s*1103515245+12345,m)
       rnd(i)=s/(m+1d0)
    enddo

    seed=s
  end subroutine pran

(This code makes no assumptions about how the multiplication of default integers behaves under overflow. It uses 64 bit integers for everything which might overflow a 32 bit integer.)

Suppose we actually wish to jump forwards and start filling the array rnd at some offset from the beginning. This we can do by recalculating s after a loop involving just log2(offset) iterations. That loop looks like

    a=1
    c=0
    ap=1103515245
    cp=12345
    p=1

    do while (p<=ioffset)
       if (iand(p,ioffset).ne.0) then
          c=iand(ap*c+cp,m)
          a=iand(a*ap,m)
       end if
       cp=iand(ap*cp+cp,m)
       ap=iand(ap*ap,m)
       p=p*2
    enddo

    s=iand(seed*a+c,m)

An example of this using OpenMP and Fortran to fill in an array is given as prand.f90. This example lets each thread operate on a block of elements, rather than interleaving the elements in the array which each thread modifies. Interleaving would cause cache line conflicts.

The example also needs to ensure that the final seed written back is the same one which the unparallelised code would have produced.

One could argue that this example is pointless: the run-time is approximately zero, even for arrays of many millions of elements. But it is the concept that counts! If one wished to use a modulus which was not a power of two, so than an explicit modulus operation was needed, then the compute costs would be significantly higher, and parallelisation more useful.

And, given that test programs are expected, see prand_test.f90 for the same with a short test program appended.

Finally, for those who prefer it, prand_test.c, the same in C. One might argue that a lot of the modulus operations are unnecessary, and that the slightly less readable prand2_test.c suffices. In C/C++, unsigned integer overflow is well-defined, whereas Fortran has no unsigned integer type, and signed integer overflow is less well defined.

Timings

One timed test of the above OpenMP C code produced the following data on a quad core 3.1GHz Haswell.

Sizens per number
OMP offOMP on
2 3.48 536.22
20 1.57 54.90
200 1.65 6.34
2000 1.63 0.94
20000 1.63 0.47
200000 1.62 0.41
2000000 1.64 0.50
20000000 1.64 0.75

(Compiled with "gcc -O3 -mavx2".)

This shows clearly that OMP parallelisation has an overhead. Eventually it wins by about a factor of four (until the final two rows, with data sizes of over 8MB, fall out of the last level cache and the memory starts slowing progress slightly). But for sizes under about 1,000 elements, it is slower.