AVX512 is a bit of a disappointment

AVX-512: A Performance Disappointment for Deep Learning on Skylake-X

October 2020

I have been hesitant to publish this analysis, as the findings were rather underwhelming. However, the technical
nuance makes it a story worth sharing. Recently, I acquired Skylake-X processors with AVX-512 support and optimized
the DeepTrainer engine to take advantage of these wider vector instructions. Given my custom water-cooled setup, I
was able to push the hardware to its thermal and frequency limits.

The results were unexpected: AVX-512 actually performed worse than AVX2.

The Benchmark: Deep Neural Network Training

To test performance, I used a fully connected neural network with the following architecture:

  • Input Layer: 1 neuron
  • Hidden Layers: 16, 256, 256, 16 neurons
  • Output Layer: 5 neurons

This specific topology was chosen to ensure that during matrix partitioning, no “empty” values were calculated—every
floating-point operation contributed to the result.

Baseline: AVX2 Performance

Using the AVX2 instruction set (256-bit wide registers, holding 8 single-precision floats), I utilized 8×8 partial
matrix multiplications.

  • Clock Speed: 4.8 GHz (All 8 cores, stable for 10+ hours)
  • Performance: 1.2 GCps (Giga-Connections per second)
  • Efficiency: ~21 MCps / GHz / Core

This 1.2 GCps result is the highest single-CPU throughput I have achieved to date.

The AVX-512 Experiment

I then extended the kernel to use AVX-512 instructions (512-bit wide registers, holding 16 single-precision floats)
with 16×16 block matrix partitions. Theoretically, doubling the register width should offer significant speedups.

  • Clock Speed: Throttled to 4.3 GHz (Automatic downclocking)
  • Performance: 1.08 GCps
  • Efficiency: ~19 MCps / GHz / Core

The Downclocking Penalty

The culprit, confirmed via HWMonitor, was the processor’s automatic thermal protection behavior. When AVX-512
instructions are detected, the CPU drastically reduces its clock speed (in this case, from 4.8 GHz to 4.3 GHz) to
manage the intense heat generation and power consumption associated with these dense instructions.

This penalty completely negated the throughput gains of the wider vector units. This is a known issue with the
first-generation AVX-512 implementation on Skylake-X, but seeing the numbers in practice was a stark realization.

Is there a workaround?

It is theoretically possible to mitigate this by “diluting” the instruction mix. One could implement a mixed block
matrix partitioning scheme where, for example, 30-60% of blocks use 16×16 (AVX-512) and the rest use 8×8 (AVX2).
This might keep the power draw just below the threshold that triggers the aggressive frequency offset.

However, the complexity of implementing such a hybrid scheduler likely outweighs the marginal gains.

Conclusion: The Pivot to GPU

Ideally, for simulating neurons, even single-precision (32-bit) float is overkill. Neural networks often converge
well with half-precision (16-bit) or even lower. Unfortunately, the CPU lineup I tested lacks efficient native
support for these lower precisions.

Given that AVX-512 on this architecture appears to be a dead end for sustained high-throughput workloads, the logical
next step for DeepTrainer is to move away from pure CPU optimization and embrace GPU acceleration.
The massive parallelism of CUDA and the specialized tensor cores in modern NVIDIA GPUs offer a far more promising
path for scaling deep learning performance.

Leave a Reply

Your email address will not be published. Required fields are marked *