I have wanted to make this post for a while but because it is mostly about findings that are disappointing I did not feel particularly motivated to write it. I have got my hands on some proper Skylake-X processors with AVX512 enabled, and modified my code to make use of it. Also since the system is water cooled, I was able to push the processor to its limits, achieving some nice results. However, all the remarkable results are from AVX2 instructions.
I was able to run an 8 core 16 thread CPU with AVX2 instructions, which means using 256 bits wide registers, holding 8 single precision (32bit) floats each. These allowed me to accelerate 8×8 partial matrix multiplications. The test load was a neural network with 1 input, 16, 256, 256, 16 neurons in the four hidden layers, and 5 output neurons. This layout makes sure that during matrix partitioning there are no empty values calculated in the matrices, all values matter. I have reached 1.2 GCps performance (with 21MCps / GHz / Core efficiency), which is the highest number I have seen so far. All this from running all 8 cores at 4.8 GHz continuously (tested to be stable for 10 hours straight), without thermal throttling.
Then I extended my code to utilize 512 bits wide registers, using 16 single precision floats, and 16×16 block matrix partitions. This required the AVX512 instruction set. The result was only 1.08 GCps (with 19MCps / GHz / Core efficiency). The utility HWMonitor revealed what was really going on: all cores that were running AVX512 instructions got automatically downclocked to 4.3GHz instead of 4.8GHz. This is a well known problem with the AVX512 instruction set, and I would not be the first person complaining about this. It seems that the only way to gain performance advantage from using these instructions is by using them sparingly, e.g. by mixing them with AVX2.
I could probably improve on my results by introducing a mixed block matrix partitioning where e.g. 30-60% of the 16×16 blocks are further partitioned to 8×8 blocks, and making one part of the calculations with 256bits wide registers, and the other part with 512 bits wide ones, thus making sure that I am only using a certain amount of AVX512 power that is not clocking down the CPU cores just yet. It is certainly possible to do this, but I doubt it is worth it.
I am looking towards GPU solutions instead, the AVX512 is unfortunately a dead end, along with the other capabilities that the Intel CPUs currently offer. First of all, single precision is still way too high for the purposes of simulating neurons. These simple activation functions would also run happily on half (16 bits) precision floating point numbers, and such precisions are not available in the current Intel line-up.