[HN Gopher] Bolt: Faster matrix and vector operations that run o...
___________________________________________________________________
Bolt: Faster matrix and vector operations that run on compressed
data
Author : febin
Score : 81 points
Date : 2022-06-18 18:01 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| raxxorraxor wrote:
| This looks good. Why do the vectors have to be dense? Just
| because of overhead/speed gain being the lowest? Just asking if
| you could use it universally for all operations if I don't know
| the density.
| anderskaseorg wrote:
| If your data is represented as sparse vectors, that sparse
| representation is already compressed. You wouldn't want to
| decompress it to a dense representation just to apply a
| different, less effective, lossy compression algorithm. That
| would be like taking a screenshot of a paragraph of text so you
| can post a JPEG of it to Twitter.
|
| Oh, wait.
| jansan wrote:
| THis sounds and looks impressive, but this part struck me:
|
| "If you ... and can tolerate lossy compression"
|
| What does this mean? I wouldn't have thought that matrix
| operations can be lossy. Does anybody know to what extend they
| are lossy and where this would be acceptable?
| epistasis wrote:
| The matrix itself is the operation, too, it's a function from
| R^n to R^m, and that function is what is approximated by matrix
| compression.
|
| If you are familiar with PCA or SVD, you are already close to
| understanding a basic form of compression. SVD breaks down an m
| x n matrix into an nxn rotation matrix, an nxn diagonal scaling
| matrix, and a n mxn loadings matrix. The ordering of the new
| matrices is usually with the highest amount of variability
| described first. So if you take the first, say, 5 of the n
| dimensions, and only use them, you can reconstruct an
| approximation of the original matrix that uses approximately
| 5/n of the original storage.
|
| PCA is often used in machine learning too, and there is such a
| deep connection between compression that is hard to make
| explicit or formalize.
|
| The implementation linked here uses Vector Quantization
| instead:
|
| https://en.wikipedia.org/wiki/Vector_quantization
| jansan wrote:
| The furthest I have come was applying an SVD on a 2D Matrix
| to find the new axes of a skewed ellipsis, but I think I
| could follow for most of the part. Thanks.
| Iv wrote:
| In almost all practical uses of matrix multiplication, we have
| rounding errors. For example, in 3D it is hard to reverse
| exactly a rotation and get the exact initial position back.
|
| I don't know what amount of losses we are talking about but in
| deep learning, several operations don't require a crazy level
| of compression, and it led to some lightweight float
| implementations (bfloat, on 16 bits, being the most common but
| there are also 8 bits floats for extreme cases)
|
| If that's really a 10-100x speed increase at the cost of a bit
| of loss, I am sure machine learning will love it.
| [deleted]
| nynx wrote:
| Wow, this is fascinating. I wonder if hardware could be designed
| to do this really efficiently.
| skohan wrote:
| It already is right? A GPU is basically a purpose-built linear
| algebra machine.
| mandarax8 wrote:
| From the abstract:
|
| > (In the common case that one matrix is known ahead of
| time,) our method also has the in- teresting property that it
| requires zero multiply-adds. These results suggest that a
| mixture of hashing, aver- aging, and byte shuffling---the
| core operations of our method---could be a more promising
| building block for machine learning than the sparsified,
| factorized, and/or scalar quantized matrix products that have
| re- cently been the focus of substantial research and hard-
| ware investment.`
|
| This is not at all what modern gpus are optimized for.
| bee_rider wrote:
| I guess the naive approach, if we wanted to do a quick lossy
| matrix multipy, would be to take the truncated SVD and use that.
| How does this library compare to the boring strategy, I wonder?
| Iv wrote:
| > If you have a large collection of mostly-dense vectors and can
| tolerate lossy compression, Bolt can probably save you 10-200x
| space and compute time.
|
| Space. It can save space.
|
| The main limitation of fast ML models nowadays is how much
| parameters you can load in your GPU memory, and these are usually
| matrices.
|
| 200x would allow me to run GPT-3 on my old GTX 1050.
|
| Frameworks, please implement this NOW!
| cgreerrun wrote:
| Maddness is their more recent work and yields 100x speedups:
| https://arxiv.org/pdf/2106.10860.pdf
|
| The code for Maddness is in the same github repo if you search
| for "Mithral".
|
| SIMD instructions can work wonders in the right context.
| skohan wrote:
| It's incredible that there's actually this much room to
| improve. How does this compare to GPU implementations?
|
| Also it looks like the optimization is related to running
| operations on a compressed representation, for the 10x vs 100x
| speedup, is there a tradeoff between speed and accuracy, or is
| that extra degree of magnitude just from bringing SIMD into the
| picture?
| jacobolus wrote:
| From what I can tell, this is a machine learning based
| approximation to matrix multiplication by a particular matrix
| (which it was trained on). It trades accuracy for speed. If
| you need to multiply many (many!) vectors by a static matrix
| and you have loose enough error tolerance, this can provide
| up to 100x speedup.
| Iv wrote:
| This is actually from a paper published last year:
|
| https://www.reddit.com/r/MachineLearning/comments/pffoo8/r_m...
|
| A few questions:
|
| - Do some ML frameworks implement it already? - It promises up to
| 200x compression, is it reasonable to expect it to allow us to
| run GPT-3 on smaller mainstream GPUs?
___________________________________________________________________
(page generated 2022-06-18 23:00 UTC)