C's Pointers and Optimisation

Many supercomputers rely on being able to spot when there is independence between different data structures or different parts of the same structure. This can either be because they wish to split work between multiple processors (e.g. summing a very long vector), or because they wish to move data from memory to the CPU long before it is used, and guarantee that that data would not have been overwritten unexpectedly by subsequent intstructions. There follows an example of how this is impossible to do well with C.

(This example is deliberately kept free of explicit pointer arithmetic, for the amusement of the "C without pointers is just like FORTRAN" school. It is not. This example is intended to show why automatic (pseudo-)vectorisation of C is in general unachieveable without breaking the C standard, and hence that non-standard, non-portable directives will be required on any vector (or long-pipeline, or multi-pipeline) machine to extract reasonable preformance from C in numeric codes.)

Consider a function to add two vectors storing the result in a third.

void vadd(double *a,double *b,double *c,int n)
{
   int i;
   for(i=0;i<n;i++) c[i]=a[i]+b[i];
}
This is something which should surely pseudo-vectorise easily. Each iteration of the loop is independent of preceeding iterations, and one can simple preload large amounts of the vectors a and b into vector registers, add them, and then poststore the result. Nothing could be simpler.

Not quite. Suppose the function is called as vadd(x,x,x,100). This should double the first 100 elements of the array x. A silly way of doubling something, but the code the compiler produces for this function must cope with this unusual usage. In this case, the vectorised model described above would still cope.

Now consider doing something really odd with the vadd function - let us use it to calculate the Fibonacci series:

/* Generate Fibonacci series by using "vadd" function */

#include<stdio.h>
#define NMAX 50

int main()
{
  double fib[NMAX];
  int i;

  fib[0]=fib[1]=1.0;

  vadd(&fib[0],&fib[1],&fib[2],NMAX-2);

  for(i=0;i<NMAX;i++) printf("%2d  %12.0lf\n",i+1,fib[i]);
}

Now any attempt to vectorise the vadd function would lead to erroneous results at this point. The way that the arguments to vadd have been passed masks the horror that the loop is now:

for(i=0;i<48;i++) x[i+2]=x[i+1]+x[i];
In other words, each iteration of the loop requires data calculated in the two immediately preceeding iterations. Any attempt to preload data is likely to lead to the original (uninitialised) data being used, rather than the data being written out in the preceeding iterations.

C compilers must generate code which will produce the correct result as long as the source obeys the relevant standards. Thus, on seeing the function vadd defined as above, the compiler must produce code which will run correctly if it is called to double a vector in place, or to produce a Fibonacci sequence.

So why FORTRAN?

Those unfamiliar with FORTRAN might believe that the statements above apply equally to FORTRAN, as the code appears trivial to translate into FORTRAN.


      subroutine vadd(a,b,c,n)
      integer i,n
      double precision a(*),b(*),c(*)

      do i=1,n
        c(i)=a(i)+b(i)
      enddo

      end
and
      program fbncc
      integer nmax,i
      parameter (nmax=50)
      double precision fib(nmax)

      fib(1)=1.0d0
      fib(2)=1.0d0

      call vadd(fib,fib(2),fib(3),nmax-2)

      do i=1,nmax
         write(*,100) i,fib(i)
      enddo

100   format(X,I2,X,F12.0)

      end

This code will run happily on many workstations, but produces rubbish with full optimisation on some machines. A bug in the compilers? Not necessarily -- rather bad FORTRAN. The FORTRAN standard does not permit any of a set of overlapping arguments to be modified, so calling vadd in this way breaks the FORTRAN standard. (DEC Alphas with full optimisation produce rubbish for this code.)

Thus the stricter standard of FORTRAN allows compilers to optimise much more agressively than a C compiler can, because many more assumptions about the code and its data dependencies can be made.