Artificial Intelligence Fight VI. – Some Unexpected Improvements

Since the last AI Fight article I did not make conscious efforts to improve the performance of the algorithms, I have been busy with updating the existing interfaces/application, and with creating new interfaces. I have rewritten all test harnesses from scratch, and I have added a new WebAPI interface. Also I started experimenting with an actual build in the cloud in the form of an Azure Worker Role application.

Platform

Up until the cloud implementation I kept the main platform in the project configuration 32 bit and I only used a 64 bit platform for the CUDA configuration because Nvidia is not supporting 32 bit code anymore. My point is: I did not have any performance measurements in 64 bit and I assumed that it can’t be any better than the 32 bit code. But then I was forced to create an “AnyCPU” configuration for the cloud for which I had to build all native libraries in 64 bit. I decided to try out the 64 bit version of the Test Applications as well. The performance was surprisingly better than in 32 bit.

Intrinsics

When I reimplemented the test harness applications I also made the block size changeable through the user interface. Now you are able to choose between 4×4 or 8×8 block matrices. Before this change I was using 8×8 block matrices by default, thinking that the more calculations are done using SIMD instructions the faster my algorithm will become. I had to face a big surprise though. When I enabled 4×4 and tried it just to see it is actually working as it should, I immediately noticed a performance increase of around 5-10%. This was almost certainly due to the fact that I was using 1 input and 1 output networks for testing, and when used with 4×4 blocks, the number of inputs were treated as 4, and with 8×8 blocks they were treated as 8. When the first and the last hidden layers contained 256 neurons this resulted in quite a noticeable difference in the weight matrix sizes (8×258 is double the size of 4×256), so I had to find a strategy to eliminate this distortion. I decided to use a 1x8x256x256x8x1 matrix topology, in which the difference from rounding up to 4 or 8 becomes insignificant. The result was that the 8×8 blocks came out as winners, although by only around 1.5%. I would have expected a larger boost from double width SIMD instructions.

I was able to achieve 18800kCpS/GHz/Core with 4×4 blocks, and 19280kCpS/GHz/Core with 8×8 blocks using a 1x8x256x256x8x1 network topology.

Further Measurements

I am including here my measurements with the 1x256x256x1 network topology only, as with less hidden layers my results are still better.

Performance_256_256_4x4_x64_WinFormsPerformance_256_256_4x4_x64_WPF

As you can see the new record is 633MCpS at 4.5GHz on 8 hardware threads. This is achieved by 19745 kCpS/GHz/Core efficience, which is much higher than anything I have achieved before, and the increase is mostly due to moving from 32 bit to 64 bit build. Of course I was not displaying any graphs during performance measurements as these would have affected the performance measurements too.

Unfortunately I do not have access to the 40 core Xeon monster workstation I used to make some measurements with before last July, but I am sure that with the current code I am sure I would easily achieve 1.5GCpS performance. This result comes from my interpolation. When I achieved 0.933GCpS on the Xeon machine, I measured 0.318GCpS on my Core-i7. So now my Core-i7 measures 0.633GCpS, with linear interpolation I should expect 1.86GCpS on the Xeon machine. (Let’s be generous, I am already happy if I see 1.5.)

Leave a Reply

Your email address will not be published. Required fields are marked *