TCM
UoC crest

Intel's Variable Clock Speeds and Benchmarking

Benchmarking, i.e. timing the execution of programs, is a useful way both of comparing the performance of different computers, and of comparing the performance of different codes. If one modifies a code with the intention of making it run faster, only some form of timed run, or benchmark, can determine whether one was successful.

Getting repeatable timings from modern computers is becoming increasingly difficult. Issues surrounding caching (including TLBs, large page entries, and fragmentation reducing the number of large pages), and surrounding the NUMA nature of dual socket computers (and some single-socket ones) are not discussed here. Instead here are discussed issues with automatic variation of the CPU's clock speed.

Turbos

If a CPU is operating at below its maximum temperature and power thresholds, it may increase its clock-speed, particularly if not all of its cores are active. This is generally referred to as a turbo mode. The increase can be significant (tens of percent), and, for consistent benchmark results, one wishes to turn this firmly off in the BIOS. Else measured performance could depend on the temperature of the room the computer is in!

Turbos and AVX2 / AVX-512

Processors which support AVX2, and especially AVX-512, often have a lower nominal clock-speed when executing sequences of instructions which contain a large proportion of those instructions. The corresponding functional units are very power-hungry, and the CPU will adjust its clock-speed to react to changing sequencies of instructions after about 1ms.

One example of this would be the Xeon Scalable 6128, a six-core processor with a nominal speed of 3.4GHz and with a maximum turbo frequency of 3.7GHz. When executing AVX2 instructions its nominal frequency drops to 2.9GHz with a maximum turbo frequency of 3.6GHz. When executing AVX-512 instructions, the nominal frequency drops to 2.3GHz, and the maximum turbo frequency to 3.5GHz (two cores or fewer active) to 2.9GHz (all cores active).

So this CPU has frequency ranges of 3.4 to 3.7GHz, 2.9 to 3.6GHz or 2.3 to 2.9GHz when operating on all cores. If the turbo function is turned off, this does not pin the frequency to the lowest in the three ranges. Rather, it caps the maximum frequency to the standard nominal 3.4GHz. In practice, if cooling is good, this means that no slow down is seen for AVX2 instructions, and a slow-down to 2.9GHz for AVX-512 if all cores are active. Under Linux, these frequency changes can be monitored by looking at /proc/cpuinfo.

Very few (no?) Intel CPUs can sustain their standard clock-speed when executing long, dense sequences of AVX-512 instructions.

(Frequency data from this Intel document.)

Power Saving

An idle CPU will reduce its clock-speed, and operating voltage, quite dramatically. The clock speed is generally reduced to around 1GHz, and then restored relatively rapidly when the CPU stops being idle. I would recommend leaving this power-saving on: it saves electricity when the machine is idle, and the small fraction of a second (under 1ms) taken to restore the CPU's speed at the start of a benchmark is not very significant compared to the usual program start-up overheads.

BUT, what happens if the CPU fails to detect itself as being properly busy when the benchmark is run? A CPU which is very lightly loaded (e.g. decoding a DVD which has been mostly offloaded to a discrete video card anyway) will, not unreasonably, stay in its low-power state. Can one assume that all benchmarks will be detected as requiring the full-speed state? Not necessarily for some Skylake CPUs and memory bandwidth benchmarks, such as John McCalpin's Streams.

With Linux, the minimum idle speed can be controlled through
/sys/devices/system/cpu/intel_pstate/min_perf_pct
which gives the minimum speed as a percentage of the standard speed. So one can read the normal value, write "100" to this file (as root) to turn off the clock frequency scaling, run some benchmarks, and then restore the old value.

A four-core 3.0GHz with dual channel DDR4/2133 ECC memory gave results betwen 28GB/s and 8GB/s with the default settings, and consistently 28GB/s with min_perf_pct set to 100 (rather than the default 22). A scalable Xeon Scalable Gold CPU seems to ignore this file completely.

Other slowdowns

Some AVX2-supporting CPUs turn off the top half of their AVX2 registers and functional units when they are not being used. At this point instructions which attempt to use the full 256 bits get emulated by being passed twice through the active lower part of the units, until the full unit can be powered up. Powering up takes around 10μs, and powering down happens after about 1ms of inactivity.

See here and here for more details.