Matrices have a beautiful feature that comes very handy when creating fast dot product algorithms. If a matrix is partitioned to equal size blocks, and the blocks themselves are treated as matrix cells in a new matrix, then the dot product calculation will be the same with the new matrix.
In the DeepTrainer project I have already implemented this partitioning method.
Imagine you have two 6×4 and 4×6 matrices. First let’s calculate the matrix dot product without partitioning:
A | B | ||||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 1 | 7 | 13 | 19 | ||||
7 | 8 | 9 | 10 | 11 | 12 | X | 2 | 8 | 14 | 20 | |||
13 | 14 | 15 | 16 | 17 | 18 | 3 | 9 | 15 | 21 | ||||
19 | 20 | 21 | 22 | 23 | 24 | 4 | 10 | 16 | 22 | ||||
5 | 11 | 17 | 23 | ||||||||||
6 | 12 | 18 | 24 | ||||||||||
C | |||||||||||||
91 | 217 | 343 | 469 | ||||||||||
= | 217 | 559 | 901 | 1243 | |||||||||
343 | 901 | 1459 | 2017 | ||||||||||
469 | 1243 | 2017 | 2791 |
We can create 2×2 sub-matrices from the original A and B matrices:
A11 | A12 | A13 | B11 | B12 | ||||||||||
1 | 2 | 3 | 4 | 5 | 6 | 1 | 7 | 13 | 19 | |||||
7 | 8 | 9 | 10 | 11 | 12 | 2 | 8 | 14 | 20 | |||||
A21 | A22 | A23 | B21 | B22 | ||||||||||
13 | 14 | 15 | 16 | 17 | 18 | 3 | 9 | 15 | 21 | |||||
19 | 20 | 21 | 22 | 23 | 24 | 4 | 10 | 16 | 22 | |||||
B31 | B32 | |||||||||||||
5 | 11 | 17 | 23 | |||||||||||
6 | 12 | 18 | 24 |
Now we have two 3×2 and 2×3 matrices, each of which has 2×2 matrices in their cells. So we can calculate the dot product by multiplying these two block matrices:
A11 | A12 | A13 | B11 | B12 | |
A21 | A22 | A23 | x | B21 | B22 |
B31 | B32 |
The result can be calculated using the same method as if we had numbers in the cells, but this calculation will use sub-matrices. You have to be careful because the multiplications here do not mean scalar products, but instead matrix products. Matrix products are not commutative, so the order of the elements must be preserved!
A11xB11 + A12xB21 + A13xB31 | A11xB12 + A12xB22 + A13xB32 |
A21xB11 + A22xB21 + A23xB31 | A21xB12 + A22xB22 + A23xB32 |
If you calculate these the result will be exactly the same as without partitioning:
91 | 217 | 343 | 469 | |
217 | 559 | 901 | 1243 | |
343 | 901 | 1459 | 2017 | |
469 | 1243 | 2017 | 2791 |
The performance advantage of partitioning comes from the fact that modern processors are equipped with SIMD (Single Instruction Multiple Data) instructions that allow you to calculate the product of 2×2 or 4×4 matrices very quickly by using processor intrinsics – which I have published in my previous post. An even better acceleration is enabled by the latest Nvidia Tensor Cores, that are able to execute a large number of such multiplications in parallel. So with a Volta or RTX GPU you can calculate the sub-results all at once.
A question naturally arises from the above algorithm: what happens if you can’t create exact partitions because your matrix dimensions are not multiples of 2 or 4? The answer is that you have to extend your existing matrices to make their dimensions partitionable, by adding rows and/or columns only containing zeros.
Lets take a look at the above example, but this time using 5×3 and 3×5 matrices:
1 | 2 | 3 | 4 | 5 | 1 | 7 | 13 | 55 | 145 | 235 | ||
7 | 8 | 9 | 10 | 11 | X | 2 | 8 | 14 | = | 145 | 415 | 685 |
13 | 14 | 15 | 16 | 17 | 3 | 9 | 15 | 235 | 685 | 1135 | ||
4 | 10 | 16 | ||||||||||
5 | 11 | 17 |
Before partitioning, we need to extend these matrices to become 6×4 and 4×6 in dimensions. Here is the solution with extended matrices, without partitioning:
A | B | |||||||||||
1 | 2 | 3 | 4 | 5 | 0 | 1 | 7 | 13 | 0 | |||
7 | 8 | 9 | 10 | 11 | 0 | X | 2 | 8 | 14 | 0 | ||
13 | 14 | 15 | 16 | 17 | 0 | 3 | 9 | 15 | 0 | |||
0 | 0 | 0 | 0 | 0 | 0 | 4 | 10 | 16 | 0 | |||
5 | 11 | 17 | 0 | |||||||||
0 | 0 | 0 | 0 |
Here are the sub-matrices:
A11 | A12 | A13 | B11 | B12 | ||||||||||
1 | 2 | 3 | 4 | 5 | 0 | 1 | 7 | 13 | 0 | |||||
7 | 8 | 9 | 10 | 11 | 0 | 2 | 8 | 14 | 0 | |||||
A21 | A22 | A23 | B21 | B22 | ||||||||||
13 | 14 | 15 | 16 | 17 | 0 | 3 | 9 | 15 | 0 | |||||
0 | 0 | 0 | 0 | 0 | 0 | 4 | 10 | 16 | 0 | |||||
B31 | B32 | |||||||||||||
5 | 11 | 17 | 0 | |||||||||||
0 | 0 | 0 | 0 |
Following the same partitioned algorithm, the result will be a 4×4 matrix, which will contain zeros in all those columns that we extended, so the resulting 3×3 matrix can be easily cropped out from it:
C | |||||
55 | 145 | 235 | 0 | ||
145 | 415 | 685 | 0 | ||
235 | 685 | 1135 | 0 | ||
0 | 0 | 0 | 0 |
As you can see, the result is exactly the same as with 5×3 and 3×5 matrices.
Extending the matrices with zeros will result in larger matrices, so this is clearly a disadvantage in terms of performance, although when our original matrices have the size of 99×99 this small extra is negligible. Also when instead of processor intrinsics matrix multiplier cores or tensor cores are used, this disadvantage disappears completely.