# Single and double precision 4×4 and 8×8 block matrix products using SSE2, AVX and AVX512 intrinsics

In this post I would like to summarize the matrix multiplication algorithms I am using in my neural network library – hopefully they will come handy for some of you.

## The data types

Every SIMD instruction works on a vector of data in parallel. This vector is a row in our matrix.

__m128: stores 4 32bit float values, usable for 4×4 matrices with floats (SSE)
__m256d: stores 4 64bit double values, usable for 4×4 matrices with doubles (SSE2)
__m256: stores 8 64bit float values, usable for 8×8 matrices with floats (AVX)
__m512d: stores 8 64bit double values, usable for 8×8 matrices with doubles (AVX512)

The AVX512 intrinsics are only available in Visual Studio 2017, earlier compilers can’t handle these instructions and data types. Also you should have either a recent Xeon or a Core-X series processor to run it on.

## Algorithm used for multiplying 4×4 matrices of floats (using SSE intrinsics)

The C += A x B algorithm:

```#define MatrixPartitionSize 4
struct Mat44f
{
float value[MatrixPartitionSize][MatrixPartitionSize];
};
void MultiplyAndAddMM4x4F(Mat44f &amp; result, const Mat44f &amp; MatrixA, const Mat44f &amp; MatrixB)
{
__m128 rightRow[MatrixPartitionSize];
__m128 resultRow[MatrixPartitionSize];
for (int i = 0; i &lt; MatrixPartitionSize; ++i)
{
rightRow[i] = _mm_loadu_ps((const float *) MatrixB.value[i]);
resultRow[i] = _mm_loadu_ps((const float *) result.value[i]); }
for (int i = 0; i &lt; MatrixPartitionSize; ++i)
{
for (int j = 0; j &lt; MatrixPartitionSize; ++j)
{
}
_mm_storeu_ps((float *)result.value[i], resultRow[i]);
}
}
}
```

## Algorithm used for multiplying 4×4 matrices of doubles (using SSE2 intrinsics)

The C += A x B algorithm:

```#define MatrixPartitionSize 4

struct Mat44d { double value[MatrixPartitionSize][MatrixPartitionSize]; };

void MultiplyAndAddMM4x4D(Mat44d &result, const Mat44d &MatrixA, const Mat44d &MatrixB)
{
__m256d rightRow[MatrixPartitionSize];
__m256d resultRow[MatrixPartitionSize];
for (int i = 0; i < MatrixPartitionSize; ++i)
{
}
for (int i = 0; i < MatrixPartitionSize; ++i)
{
for (int j = 0; j < MatrixPartitionSize; ++j)
{
}
_mm256_storeu_pd((double *)result.value[i], resultRow[i]);
}
}
```

## Algorithm used for multiplying 8×8 matrices of floats (using AVX intrinsics)

The C += A x B algorithm:

```#define MatrixPartitionSize 8

struct Mat88f { float value[MatrixPartitionSize][MatrixPartitionSize]; };

void MultiplyAndAddMM8x8F(Mat88f &result, const Mat88f &MatrixA, const Mat88f &MatrixB)
{
__m256 rightRow[MatrixPartitionSize];
__m256 resultRow[MatrixPartitionSize];
for (int i = 0; i < MatrixPartitionSize; ++i)
{
}
for (int i = 0; i < MatrixPartitionSize; ++i)
{
for (int j = 0; j < MatrixPartitionSize; ++j)
{
}
_mm256_storeu_ps((float *)result.value[i], resultRow[i]);
}
}
```

## Algorithm used for multiplying 8×8 matrices of doubles (using AVX512 intrinsics)

(N.B.: This algorithm only compiles in Visual Studio 2017, and AVX512 instructions are only supported on Core-X series processors or recent Xeon CPUs.)

The C += A x B algorithm:

```#define MatrixPartitionSize 8

struct Mat88d { double value[MatrixPartitionSize][MatrixPartitionSize]; };

void MultiplyAndAddMM8x8D(Mat88d &result, const Mat88d &MatrixA, const Mat88d &MatrixB)
{
__m512d rightRow[MatrixPartitionSize];
__m512d resultRow[MatrixPartitionSize];

for (int i = 0; i < MatrixPartitionSize; ++i)
{