Single and double precision 4×4 and 8×8 block matrix products using SSE2, AVX and AVX512 intrinsics
Dear Reader, In this post I would like to summarize the matrix multiplication algorithms I am using in my neural network library – hopefully they will come handy for some of you. The data types Every SIMD instruction works on a vector of data in parallel. This vector is a row in our matrix. __m128: […]