This project is currently in progress but I thought I would publish it anyway. I have created a modern C++ DLL from the code extracted from the Borland C++ Builder project, as my 15 year old code is hardly going to be useful for anyone these days.
What this library currently does – compared to the old version:
- Dynamic allocation of network parameters
- CLI/C++ wrapper to make it accessible from managed .NET code
- Revised indexing of network layers
- Redesigned objects
- Interface for the Matrix class
- Made the Matrix class thread-safe
- Rewrote the Matrix class to use STL algorithms wherever possible
There is still a lot of work I will have to do on this library:
- Create more accessors and hide public variables in the classes
- Create interfaces for all classes
- Get rid of magic numbers
- Create unit tests
The repository also contains a small test application written in WinForms. This application doesn’t do much currently, I am using it at the moment to tests the library through the wrapper in debug mode. I think I am not too far away from recreating the whole original application either in WinForms or in WPF, or maybe both.
Achieving thread-safety for the CMatrix class was not a straightforward move because when the std::mutex implementation saw the /clr switch in the wrapper it threw away everything in his hands and started running around waving his arms and shouting in panic:
<mutex> is not supported when compiling with /clr or /clr:pure.
The solution for this was to create an intermediate Locker class that can be forward-declared in my header file. This means that when the compiler reaches the header it receives a future promise that “there will be a class named Locker later on”. This way I could avoid including <mutex> into the header. Forward-declared classes can only be used as pointer or reference members so and for this situation using a unique_ptr is a perfect solution. Mutexes are neither supposed to be copied nor moved, so all my constructors, copy and move constructors can instantiate their own instance.
The Matrix class stores values in memory as a continuous array, so all operations that affect every value in the matrix individually can be implemented on this array. To achieve this STL provides fast algorithms. Except for transpose and dot product all operations are now using STL.
For the matrix dot product and transpose operations I am planning to use SSE3 intrinsics which are available in all Intel-based processors since 2004. A matrix dot product could make use of a fast transpose operation because if the second matrix was stored in memory transposed during calculations more values could stay in the cache. The question is whether it is worth it, as transposing a matrix could take at least as much cycles as calculating the dot product itself. With one trick this could be accelerated though: if all matrices were stored in memory twice, once in their original form, once transposed. I think I’ll only find this out if I implement it and try it.
My ultimate goal in optimisation will be to implement partitioned matrix multiplication of double precision 4×4 sub-matrices as these operations can be executed not only by SIMD instructions, but by CUDA cores and by the latest Tensor Cores from NVidia.