[HN Gopher] Experiments with Byte Matrix Multiplication
___________________________________________________________________
Experiments with Byte Matrix Multiplication
Author : serge-ss-paille
Score : 27 points
Date : 2025-01-10 15:36 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gok wrote:
| Curious how this compares with, say, the implementation of
| gemm_s8s8s32 in Intel's MKL / OneAPI.
| dkhudia wrote:
| > It's quite common in machine learning operations to multiply a
| matrix of unsigned byte by a matrix of signed byte. Don't ask me
| why, but that's the case.
|
| Overflow is the reason. Intel's vpmaddubsw takes int8_t and
| uint8_t to give you results in int16_t. If both are unsigned 255
| * 255 = 65025 will be out of range for int16_t (-32,768 to
| +32,767) so likely the instruction is designed to take int8_t and
| uint8_t. However, if one is signed and other is unsigned extremes
| -128 * 255 or 127 * 255 are always in int16_t range. The overflow
| (or rather saturation with this instruction) can still occur
| because it sums adjacent multiplications. See my comment in
| PyTorch.
| https://github.com/pytorch/pytorch/blob/a37db5ae3978010e1bb7...
| atq2119 wrote:
| This doesn't feel like a convincing argument. If you wanted to
| multiply uint8 * uint8, you'd naturally use an unsigned
| multiply with a uint16 result. That doesn't overflow either.
|
| I believe a better argument is to appeal to the structure of
| neural networks. Activation inputs into a matrix multiply come
| out of a non-linear function, and ReLU is a popular function
| which causes activation inputs to be unsigned. Weights then
| need to be signed so that the matrix multiplication can have
| negative outputs -- without negative outputs, you would lose
| the non-linearity of ReLU.
___________________________________________________________________
(page generated 2025-01-10 23:01 UTC)