TCM
UoC crest

MJ Rutter's Linpack: unroll or block?

Code optimisation in the 1980s and before was all about unrolling loops. Today it is all about blocking. Here the difference is briefly explained.

What takes the time?

In the linpack code, there is just a single loop which takes the great majority of the time. It is it about line 220 in this code and is

     for(i=k+1;i<n;i++)
        a[i+j*n]+=t*a[i+k*n];

Unsurprisingly, the loop is the inner loop of a nest. Looking back though the code we can see:

 for(k=0;k<n-1;k++){
    [...]
    for(j=k+1;j<n;j++){
    [...]

Only then do we reach the loop over i.

Unrolling

lpk_unroll.c

The theory of unrolling is the the problem with this loop is all the nasty overhead of it being a loop: adding one to a counter, comparing the counter to n, and performing a conditional jump. If we could reduce this overhead, perhaps things would be faster? We can readily reduce this overhead:

     for(i=k+1;i<n-1;i+=2){
        a[j*n+i]+=t*a[k*n+i];
        a[j*n+i+1]+=t*a[k*n+i+1];
      }
      if(i==(n-1)) a[i+j*n]+=t*a[k*n+i];

Now each iteration of the new loop performs two iterations of the old, yet still has a single counter increment, comparison and conditional jump. The final line (outside of the loop) deals with the problem of the loop count not necessarily being divisible by two.

Unfortunately the time is unchanged. There are two reasons for this. Firstly, most compilers will unroll loops automatically, and actually find that they can do this better if a human has not done so first, so sometimes the unrolled version is slower. However, even using a compiler switch like -fno-unroll-loops still reveals no measurable difference in execution speed. The loop overhead simply was not a problem here.

Blocking

lpk_block.c

for(k=0;k<n-1;k++){
   [...]
   for(j=k+1;j<n;j++){
      [...]
      t=[...]
      [..]
      for(i=k+1;i<n;i++)
         a[i+j*n]+=t*a[i+k*n];

The problem here is the memory access pattern. With n being 2,000, each row of this matrix is 16KB, and the whole matrix is 32MB. With a typical size of last level cache, each access of a[i+j*n] is coming from the main memory. The theoretical memory bandwidth of a 3.4GHz Haswell with two channels of DDR3/1600 memory is 25.6GB/s. This inner loop has two floating point operations, and two memory operations (a load and a store) which we are arguing are from main memory. So one floating point operation per memory operation, and as each memory operation is 8 bytes, it cannot possibly exceed 3.2 GFLOPS, a fraction of the 220 GFLOPS the CPU is capable of if all cores are used.

So we need to change the memory access pattern.

for(k=0;k<n-2;k+=2){
   [...]
   for(j=k+2;j<n;j++){
   [...]
      for(i=k2+1;i<n;i++)
	 a[i+j*n]+=t*a[i+k*n]+t2*a[i+(k+1)*n];

Rather than performing two iterations of the inner loop at once, and keeping the memory access pattern unchanged, we now perform two iterations of the outer loop at once. Each iteration of the inner loop is at least as bad as before, and its access of a[i+j*n] is as bad as ever. However, because the outer loop now steps in twos, there are only half as many iterations of this inner loop as there were before, and the code might be expected to go twice as fast.

It nearly does! The previous code managed a very poor 2.3 GFLOPS, and this code manages 3.7 GFLOPS. The problem is that the code is becoming unreadable and unmaintainable, and one needs to perform much more aggressive blocking.

Fortunately there is a slight variation of the algorithm used here which is much easier to block, and that is what modern implementations of linpack (or dgesv) use. As the comments in the above codes suggest, the core of their algorithms uses a routine called dgefa. The modern, more blockable, alternative is called dgetrf.

Back to Linpack page.