Calculating dot product using matrix partitioning

Matrices have a beautiful feature that comes very handy when creating fast dot product algorithms. If a matrix is partitioned to equal size blocks, and the blocks themselves are treated as matrix cells in a new matrix, then the dot product calculation will be the same with the new matrix.

In the DeepTrainer project I have already implemented this partitioning method.

Imagine you have two 6×4 and 4×6 matrices. First let’s calculate the matrix dot product without partitioning:

A B
1 2 3 4 5 6 1 7 13 19
7 8 9 10 11 12 X 2 8 14 20
13 14 15 16 17 18 3 9 15 21
19 20 21 22 23 24 4 10 16 22
5 11 17 23
6 12 18 24
C
91 217 343 469
= 217 559 901 1243
343 901 1459 2017
469 1243 2017 2791

We can create 2×2 sub-matrices from the original A and B matrices:

A11 A12 A13 B11 B12
1 2 3 4 5 6 1 7 13 19
7 8 9 10 11 12 2 8 14 20
A21 A22 A23 B21 B22
13 14 15 16 17 18 3 9 15 21
19 20 21 22 23 24 4 10 16 22
B31 B32
5 11 17 23
6 12 18 24

Now we have two 3×2 and 2×3 matrices, each of which has 2×2 matrices in their cells. So we can calculate the dot product by multiplying these two block matrices:

A11 A12 A13 B11 B12
A21 A22 A23 x B21 B22
B31 B32

The result can be calculated using the same method as if we had numbers in the cells, but this calculation will use sub-matrices. You have to be careful because the multiplications here do not mean scalar products, but instead matrix products. Matrix products are not commutative, so the order of the elements must be preserved!

A11xB11 + A12xB21 + A13xB31 A11xB12 + A12xB22 + A13xB32
A21xB11 + A22xB21 + A23xB31 A21xB12 + A22xB22 + A23xB32

If you calculate these the result will be exactly the same as without partitioning:

91 217 343 469
217 559 901 1243
343 901 1459 2017
469 1243 2017 2791

The performance advantage of partitioning comes from the fact that modern processors are equipped with SIMD (Single Instruction Multiple Data) instructions that allow you to calculate the product of 2×2 or 4×4 matrices very quickly by using processor intrinsics – which I have published in my previous post. An even better acceleration is enabled by the latest Nvidia Tensor Cores, that are able to execute a large number of such multiplications in parallel. So with a Volta or RTX GPU you can calculate the sub-results all at once.

A question naturally arises from the above algorithm: what happens if you can’t create exact partitions because your matrix dimensions are not multiples of 2 or 4? The answer is that you have to extend your existing matrices to make their dimensions partitionable, by adding rows and/or columns only containing zeros.

Lets take a look at the above example, but this time using 5×3 and 3×5 matrices:

1 2 3 4 5 1 7 13 55 145 235
7 8 9 10 11 X 2 8 14 = 145 415 685
13 14 15 16 17 3 9 15 235 685 1135
4 10 16
5 11 17

Before partitioning, we need to extend these matrices to become 6×4 and 4×6 in dimensions. Here is the solution with extended matrices, without partitioning:

A B
1 2 3 4 5 0 1 7 13 0
7 8 9 10 11 0 X 2 8 14 0
13 14 15 16 17 0 3 9 15 0
0 0 0 0 0 0 4 10 16 0
5 11 17 0
0 0 0 0

Here are the sub-matrices:

A11 A12 A13 B11 B12
1 2 3 4 5 0 1 7 13 0
7 8 9 10 11 0 2 8 14 0
A21 A22 A23 B21 B22
13 14 15 16 17 0 3 9 15 0
0 0 0 0 0 0 4 10 16 0
B31 B32
5 11 17 0
0 0 0 0

Following the same partitioned algorithm, the result will be a 4×4 matrix, which will contain zeros in all those columns that we extended, so the resulting 3×3 matrix can be easily cropped out from it:

C
55 145 235 0
145 415 685 0
235 685 1135 0
0 0 0 0

As you can see, the result is exactly the same as with 5×3 and 3×5 matrices.

Extending the matrices with zeros will result in larger matrices, so this is clearly a disadvantage in terms of performance, although when our original matrices have the size of 99×99 this small extra is negligible. Also when instead of processor intrinsics matrix multiplier cores or tensor cores are used, this disadvantage disappears completely.