Dear Reader,
In this post I would like to summarize the matrix multiplication algorithms I am using in my neural network library – hopefully they will come handy for some of you.
The data types
Every SIMD instruction works on a vector of data in parallel. This vector is a row in our matrix.
__m128: stores 4 32bit float values, usable for 4×4 matrices with floats (SSE)
__m256d: stores 4 64bit double values, usable for 4×4 matrices with doubles (SSE2)
__m256: stores 8 64bit float values, usable for 8×8 matrices with floats (AVX)
__m512d: stores 8 64bit double values, usable for 8×8 matrices with doubles (AVX512)
The AVX512 intrinsics are only available in Visual Studio 2017, earlier compilers can’t handle these instructions and data types. Also you should have either a recent Xeon or a Core-X series processor to run it on.
Algorithm used for multiplying 4×4 matrices of floats (using SSE intrinsics)
The C += A x B algorithm:
#define MatrixPartitionSize 4 struct Mat44f { float value[MatrixPartitionSize][MatrixPartitionSize]; }; void MultiplyAndAddMM4x4F(Mat44f & result, const Mat44f & MatrixA, const Mat44f & MatrixB) { __m128 rightRow[MatrixPartitionSize]; __m128 resultRow[MatrixPartitionSize]; for (int i = 0; i < MatrixPartitionSize; ++i) { rightRow[i] = _mm_loadu_ps((const float *) MatrixB.value[i]); resultRow[i] = _mm_loadu_ps((const float *) result.value[i]); } for (int i = 0; i < MatrixPartitionSize; ++i) { for (int j = 0; j < MatrixPartitionSize; ++j) { resultRow[i] = _mm_add_ps(resultRow[i], _mm_mul_ps(rightRow[j], _mm_set1_ps(MatrixA.value[i][j]))); } _mm_storeu_ps((float *)result.value[i], resultRow[i]); } } }
Algorithm used for multiplying 4×4 matrices of doubles (using SSE2 intrinsics)
The C += A x B algorithm:
#define MatrixPartitionSize 4 struct Mat44d { double value[MatrixPartitionSize][MatrixPartitionSize]; }; void MultiplyAndAddMM4x4D(Mat44d &result, const Mat44d &MatrixA, const Mat44d &MatrixB) { __m256d rightRow[MatrixPartitionSize]; __m256d resultRow[MatrixPartitionSize]; for (int i = 0; i < MatrixPartitionSize; ++i) { rightRow[i] = _mm256_loadu_pd((const double *)MatrixB.value[i]); resultRow[i] = _mm256_loadu_pd((const double *)result.value[i]); } for (int i = 0; i < MatrixPartitionSize; ++i) { for (int j = 0; j < MatrixPartitionSize; ++j) { resultRow[i] = _mm256_add_pd(resultRow[i], _mm256_mul_pd(rightRow[j], _mm256_set1_pd(MatrixA.value[i][j]))); } _mm256_storeu_pd((double *)result.value[i], resultRow[i]); } }
Algorithm used for multiplying 8×8 matrices of floats (using AVX intrinsics)
The C += A x B algorithm:
#define MatrixPartitionSize 8 struct Mat88f { float value[MatrixPartitionSize][MatrixPartitionSize]; }; void MultiplyAndAddMM8x8F(Mat88f &result, const Mat88f &MatrixA, const Mat88f &MatrixB) { __m256 rightRow[MatrixPartitionSize]; __m256 resultRow[MatrixPartitionSize]; for (int i = 0; i < MatrixPartitionSize; ++i) { rightRow[i] = _mm256_loadu_ps((const float *)MatrixB.value[i]); resultRow[i] = _mm256_loadu_ps((const float *)result.value[i]); } for (int i = 0; i < MatrixPartitionSize; ++i) { for (int j = 0; j < MatrixPartitionSize; ++j) { resultRow[i] = _mm256_add_ps(resultRow[i], _mm256_mul_ps(rightRow[j], _mm256_set1_ps(MatrixA.value[i][j]))); } _mm256_storeu_ps((float *)result.value[i], resultRow[i]); } }
Algorithm used for multiplying 8×8 matrices of doubles (using AVX512 intrinsics)
(N.B.: This algorithm only compiles in Visual Studio 2017, and AVX512 instructions are only supported on Core-X series processors or recent Xeon CPUs.)
The C += A x B algorithm:
#define MatrixPartitionSize 8 struct Mat88d { double value[MatrixPartitionSize][MatrixPartitionSize]; }; void MultiplyAndAddMM8x8D(Mat88d &result, const Mat88d &MatrixA, const Mat88d &MatrixB) { __m512d rightRow[MatrixPartitionSize]; __m512d resultRow[MatrixPartitionSize]; for (int i = 0; i < MatrixPartitionSize; ++i) { rightRow[i] = _mm512_loadu_pd((const float *)MatrixB.value[i]); resultRow[i] = _mm512_loadu_pd((const float *)result.value[i]); } for (int i = 0; i < MatrixPartitionSize; ++i) { for (int j = 0; j < MatrixPartitionSize; ++j) { resultRow[i] = _mm512_add_pd(resultRow[i], _mm512_mul_pd(rightRow[j], _mm512_set1_pd(MatrixA.value[i][j]))); } _mm512_storeu_pd((double *)result.value[i], resultRow[i]); } }