Ok, my apologies in advance if you expected artificial intelligence agents playing with each other in this post, I couldn’t help myself when I wrote the title. This post is about comparing the speeds of two significantly different implementations of the same neural network algorithm.

I’ve just done a very simple performance comparison between my neural network algorithm implementations (no exact time measurements though). Both applications are using a 1x10x10x10x5 Neural Network model, and both are learning the same amount of sample data.

**In the left corner:**- Old C++ code from the uni 15 years ago,
- Written in Borland C++ Builder 6,
- Fixed size containers for network parameters
- Implemented with arrays and lots of raw pointers,
- For loops with if/else conditions for matrix algorithms
- Mental amount of copy constructor calls when returning temporary matrices by value,
- Simple but should be fast.

**In the right corner:**- C++11 implementation
- Dynamic container sizes
- With lots of smart pointers
- Lots of vtable access,
- Move constructor semantics,
- Uses STL algorithms for almost everything except matrix product
- Transpose_iterator implementation that fakes a transposed matrix without actually shuffling the data – transposing a matrix is now flipping a boolean
- Wrapped in CLI/C++
- Displayed in C# WinForms.

Both are release builds started within 20 iterations from each other.

The new implementation is the clear winner, but it is very surprising by how little – by only 11.8%. Clearly there is a lot to improve in the performance. I think it’s time to start doing some serious profiling to find the worst offenders.

(

**Update**: the worst offender is the already expected one, the matrix multiplication operator takes around 18% of the whole CPU time.)

Both implementations can be downloaded with source code from GitHub:

My plans for further performance improvement include using SSE intrinsics and after that a CUDA core implementation. Tensor Core CUDA 9 implementation will have to wait though until I can afford to put my hands on a Volta card. (I need a “donate” button.)

**Update 2**: I have optimised the loops in the matrix multiplication algorithm:

- pre-fetching const values before the loops
- pre-fetching a function pointer to the right Getter function (transposed vs normal), saving an extra if/else condition from inside the loop

the speed difference between the two applications jumped to 35.8% – which is a lot more pleasing for me to see. The speed gain also shows in the profiler, the CPU spends around 16% less time in the matrix multiplier operator compared to the previous build.