Why NVidia’s Volta architecture is important for AI and Machine Learning

In this short entry I would like to talk about the details behind the “magical” buzzwords spreading these days on the Machine Learning and Big Data scenes. I am sure the marketing and sales department at NVidia is making great efforts to put expressions like “tensor cores” to be the topic of common talk – without most people having even the slightest clue about it. I am not mocking this process, I think this is actually helping progress in technology.

In my previous articles I explained it in details how a neural network can be expressed with matrices, and how the training process can be solved with simple matrix operations. Here I would like to write about how various processing architectures can improve the performance of these operations.

MATRIX OPERATIONS IN NEURAL NETWORKS

As you may notice in the neural network learning algorithms a lot of their matrix calculations are easily serialisable, which means they can be executed easily by iterating through vectors that are continuously stored in memory. The only two operations that are hard to serialise in such a way are the matrix dot product and transposing a matrix. These dot products are the most CPU consuming operations in these algorithms.

MULTIPLYING PARTITIONED MATRICES

A beautiful feature of matrix product calculations is that they remain the same if they are executed on matrix partitions instead of individual matrix cells. If a matrix is partitioned to 2×2 or 4×4 sub-matrices – with the remaining odd rows and columns partitioned 1×2, 2×1 or 1×4, 4×1 vectors the matrix multiplication algorithm will be the same on the sub-matrices.

The operations used in calculating a matrix product are multiplication and addition. The multiplication in the case of sub-matrices will become a dot product (be aware that this multiplication is not commutative as in the case of scalar matrix values). During partitioning the only thing one has to be careful about is to make it in a way that the partition sizes match up. If the original large matrix is not an exact multiple of 4×4 matrices you will have to partition the remainder in a way that the 4×3 or 4×1 sub-matrix will meet 1×1, 3×3 or 3×4 matrices only in the algorithm.

ACCELERATING PARTITIONED MATRIX MULTIPLICATION WITH HARDWARE SOLUTIONS

Intel is already providing means for accelerating matrix operations in the form of SSE SIMD (Single Instruction Multiple Data) operations since 2004. NVidia took this to a new level by recently introducing Tensor Cores in their latest Volta architecture.

“Tensor Cores” are a marketing term, they are trying to convey the picture of a 3 dimensional 4x4x4 cube of something serious – indeed it looks very fancy and sciency, however the underlying concept is very simple.

As I have mentioned above, every matrix multiplication can be broken down to multiplication and addition operations between 4×4 sub-matrices. NVidia’s tensor cores are supporting exactly that:

A = B x C + D

where A, B, C and D are all 4×4 matrices. In recent Volta processors you have 640 cores that can execute such operations simultaneously, so the neural connections between two layers of a neural network can be calculated incredibly quickly during a backpropagation step which calculates the gradient values of the network.

If you organise these 4×4 matrices into a cube, then you get a 4x4x4 cube which can make your presentations very serious-looking and sciency when you mention machine learning. Still, it doesn’t hurt to know that in reality these are simple operations we learned about in middle school.