Calculating dot product using matrix partitioning

Matrices have a beautiful feature that comes very handy when creating fast dot product algorithms. If a matrix is partitioned to equal size blocks, and the blocks themselves are treated as matrix cells in a new matrix, then the dot product calculation will be the same with the new matrix.

In the DeepTrainer project I have already implemented this partitioning method.

Imagine you have two 6×4 and 4×6 matrices. First let’s calculate the matrix dot product without partitioning:

A									B
1	2	3	4	5	6				1	7	13	19
7	8	9	10	11	12		X		2	8	14	20
13	14	15	16	17	18				3	9	15	21
19	20	21	22	23	24				4	10	16	22
									5	11	17	23
									6	12	18	24

						C
						91	217	343	469
				=		217	559	901	1243
						343	901	1459	2017
						469	1243	2017	2791

We can create 2×2 sub-matrices from the original A and B matrices:

A11		A12		A13		B11		B12
1	2	3	4	5	6	1	7	13	19
7	8	9	10	11	12	2	8	14	20

A21		A22		A23		B21		B22
13	14	15	16	17	18	3	9	15	21
19	20	21	22	23	24	4	10	16	22

						B31		B32
						5	11	17	23
						6	12	18	24

Now we have two 3×2 and 2×3 matrices, each of which has 2×2 matrices in their cells. So we can calculate the dot product by multiplying these two block matrices:

A11	A12	A13		B11	B12
A21	A22	A23	x	B21	B22
				B31	B32

The result can be calculated using the same method as if we had numbers in the cells, but this calculation will use sub-matrices. You have to be careful because the multiplications here do not mean scalar products, but instead matrix products. Matrix products are not commutative, so the order of the elements must be preserved!

A11xB11 + A12xB21 + A13xB31	A11xB12 + A12xB22 + A13xB32
A21xB11 + A22xB21 + A23xB31	A21xB12 + A22xB22 + A23xB32

If you calculate these the result will be exactly the same as without partitioning:

91	217	343	469
217	559	901	1243

343	901	1459	2017
469	1243	2017	2791

The performance advantage of partitioning comes from the fact that modern processors are equipped with SIMD (Single Instruction Multiple Data) instructions that allow you to calculate the product of 2×2 or 4×4 matrices very quickly by using processor intrinsics – which I have published in my previous post. An even better acceleration is enabled by the latest Nvidia Tensor Cores, that are able to execute a large number of such multiplications in parallel. So with a Volta or RTX GPU you can calculate the sub-results all at once.

A question naturally arises from the above algorithm: what happens if you can’t create exact partitions because your matrix dimensions are not multiples of 2 or 4? The answer is that you have to extend your existing matrices to make their dimensions partitionable, by adding rows and/or columns only containing zeros.

Lets take a look at the above example, but this time using 5×3 and 3×5 matrices:

1	2	3	4	5		1	7	13		55	145	235
7	8	9	10	11	X	2	8	14	=	145	415	685
13	14	15	16	17		3	9	15		235	685	1135
						4	10	16
						5	11	17

Before partitioning, we need to extend these matrices to become 6×4 and 4×6 in dimensions. Here is the solution with extended matrices, without partitioning:

A							B
1	2	3	4	5	0		1	7	13	0
7	8	9	10	11	0	X	2	8	14	0
13	14	15	16	17	0		3	9	15	0
0	0	0	0	0	0		4	10	16	0
							5	11	17	0
							0	0	0	0

Here are the sub-matrices:

A11		A12		A13		B11		B12
1	2	3	4	5	0	1	7	13	0
7	8	9	10	11	0	2	8	14	0

A21		A22		A23		B21		B22
13	14	15	16	17	0	3	9	15	0
0	0	0	0	0	0	4	10	16	0

						B31		B32
						5	11	17	0
						0	0	0	0

Following the same partitioned algorithm, the result will be a 4×4 matrix, which will contain zeros in all those columns that we extended, so the resulting 3×3 matrix can be easily cropped out from it:

C
55	145	235	0
145	415	685	0
235	685	1135	0
0	0	0	0

As you can see, the result is exactly the same as with 5×3 and 3×5 matrices.

Extending the matrices with zeros will result in larger matrices, so this is clearly a disadvantage in terms of performance, although when our original matrices have the size of 99×99 this small extra is negligible. Also when instead of processor intrinsics matrix multiplier cores or tensor cores are used, this disadvantage disappears completely.

DeepTrainer

Deep Learning algorithm R&D for Artificial Neural Networks

Calculating dot product using matrix partitioning

Leave a Reply Cancel reply

A							B
1	2	3	4	5	0		1	7	13	0
7	8	9	10	11	0	X	2	8	14	0
13	14	15	16	17	0		3	9	15	0
0	0	0	0	0	0		4	10	16	0
							5	11	17	0
							0	0	0	0

A							B
1	2	3	4	5	0		1	7	13	0
7	8	9	10	11	0	X	2	8	14	0
13	14	15	16	17	0		3	9	15	0
0	0	0	0	0	0		4	10	16	0
							5	11	17	0
							0	0	0	0

A							B
1	2	3	4	5	0		1	7	13	0
7	8	9	10	11	0	X	2	8	14	0
13	14	15	16	17	0		3	9	15	0
0	0	0	0	0	0		4	10	16	0
							5	11	17	0
							0	0	0	0