Artificial Intelligence Fight – comparing neural network implementations.

Ok, my apologies in advance if you expected artificial intelligence agents playing with each other in this post, I couldn’t help myself when I wrote the title. This post is about comparing the speeds of two significantly different implementations of the same neural network algorithm.


I’ve just done a very simple performance comparison between my neural network algorithm implementations (no exact time measurements though). Both applications are using a 1x10x10x10x5 Neural Network model, and both are learning the same amount of sample data.
  • In the left corner:
    • Old C++ code from the uni 15 years ago,
    • Written in Borland C++ Builder 6,
    • Fixed size containers for network parameters
    • Implemented with arrays and lots of raw pointers,
    • For loops with if/else conditions for matrix algorithms
    • Mental amount of copy constructor calls when returning temporary matrices by value,
    • Simple but should be fast.
  • In the right corner:
    • C++11 implementation
    • Dynamic container sizes
    • With lots of smart pointers
    • Lots of vtable access,
    • Move constructor semantics,
    • Uses STL algorithms for almost everything except matrix product
    • Transpose_iterator implementation that fakes a transposed matrix without actually shuffling the data – transposing a matrix is now flipping a boolean
    • Wrapped in CLI/C++
    • Displayed in C# WinForms.

Both are release builds started within 20 iterations from each other.

The new implementation is the clear winner, but it is very surprising by how little – by only 11.8%. Clearly there is a lot to improve in the performance. I think it’s time to start doing some serious profiling to find the worst offenders.


(Update: the worst offender is the already expected one, the matrix multiplication operator takes around 18% of the whole CPU time.)


Both implementations can be downloaded with source code from GitHub:
My plans for further performance improvement include using SSE intrinsics and after that a CUDA core implementation. Tensor Core CUDA 9 implementation will have to wait though until I can afford to put my hands on a Volta card. (I need a “donate” button.)

Update 2: I have optimised the loops in the matrix multiplication algorithm:

  • pre-fetching const values before the loops
  • pre-fetching a function pointer to the right Getter function (transposed vs normal), saving an extra if/else condition from inside the loop

the speed difference between the two applications jumped to 35.8% – which is a lot more pleasing for me to see. The speed gain also shows in the profiler, the CPU spends around 16% less time in the matrix multiplier operator compared to the previous build.