hngopher.com

       [HN Gopher] Show HN: Speeding up LLM inference 2x times (possibly)
       ___________________________________________________________________
        
       Show HN: Speeding up LLM inference 2x times (possibly)
        
       Here's a project I've been working on for the last few months.
       It's a new (I think) algorithm, that allows to adjust smoothly -
       and in real time - how many calculations you'd like to do during
       inference of an LLM model.  It seems that it's possible to do just
       20-25% of weight multiplications instead of all of them, and still
       get good inference results.  I implemented it to run on M1/M2/M3
       GPU. The mmul approximation itself can be pushed to run 2x fast
       before the quality of output collapses.  The inference speed is
       just a bit faster than Llama.cpp's, because the rest of
       implementation could be better, but with a better development I
       think it can be a new method to speed up inference - in addition to
       quantization.  You could call it ad-hoc model distillation :)  You
       can change the speed / accuracy of a model at will, in real time.
       Oh, and as a side effect, the data format allows to also choose how
       much of the model you want to load into the memory. You can decide
       to skip say 10-20-40% of the least important weights.  It's
       implemented for Mistral, it was also tested slightly on Mixtral and
       Llama. It's for FP16 for now, but Q8 is in the works.  The
       algorithm is described here, and the implementation is open source.
       https://kolinko.github.io/effort/  I know these are bold claims,
       but I hope they survive the scrutiny :)
        
       Author : kolinko
       Score  : 391 points
       Date   : 2024-04-17 17:26 UTC (1 days ago)
        
 (HTM) web link (asciinema.org)
 (TXT) w3m dump (asciinema.org)
        
       | kolinko wrote:
       | Here to answer questions!
       | 
       | A friend of mine has published the original link on HN before me,
       | so I hope a double post won't be an issue :)
        
         | dang wrote:
         | We've merged them. (other one was
         | https://news.ycombinator.com/item?id=40067489)
        
           | kolinko wrote:
           | Thx! Although I'm happy you took your time and now I can brag
           | until the end of time that I got 2 out of 5 top slots for a
           | moment :D
        
       | dartos wrote:
       | Thank you for this really cool and open contribution!
       | 
       | I will be watching llama.cpp closely for them to implement this!
       | 
       | I've been looking for ways to speed up CPU inference and I really
       | like this idea of "effort"
        
         | kolinko wrote:
         | Hahah, thanks! It was a marathon to get develop this, and I'm
         | glad it reached the front page.
         | 
         | The name was proposed by chatgpt :) It claims it doesn't
         | recognise this approach - so there is a chance it's really a
         | new thing.
         | 
         | I want to reach out to llama.cpp and the others - I hope it
         | gets implemented. I considered just writing a patch to llama,
         | but c++ and the scale of that project was beyond me.
         | 
         | As for CPU inference - it should speed it up just as well. But
         | thanks to the fact that it can load up a fraction of weights
         | (e.g. just 70%, skipping the least important ones), it should
         | be possible now to run models on less VRAM than before (still,
         | Q8 needs to implemented though).
         | 
         | Funnily - when I tried comparing benchmarks to llama.cpp, I
         | couldn't find speeds for 7B/FP16 on MB Air 16GB, because it's
         | impossible to run with regular methods. It is possible with
         | Effort.
         | 
         | Ditto, I was running full resolution, but cropped, Mixtral on
         | my 96GB M2, even though it usually takes 114GB ram. I just
         | loaded 75% of weights, and it was working smoothly. (before I
         | messed something up with implementation and it now produces
         | crap output - needs a fix)
        
           | dhruvdh wrote:
           | I would imagine the importance of weights depends on the
           | prompt. How do you decide which weights are important?
        
             | kolinko wrote:
             | Yeah, that is the point more or less - it dynamically chise
             | the weights layer per layer depending on the internal
             | state.
             | 
             | A bit technical explaination here.
             | https://kolinko.github.io/effort/equations.html
        
           | indymike wrote:
           | > It is possible with Effort.
           | 
           | "All things are possible with enough effort." -- Dad.
        
             | kolinko wrote:
             | Hahaha :)
        
           | 0x4139 wrote:
           | Implementing this approach could significantly enhance the
           | adoption of LLMs within mobile phone libraries and other
           | compact devices. I highly recommend opening an improvement
           | issue for llama.cpp.
        
       | avereveard wrote:
       | we need this into llama.cpp it seems somewhat stable down to 40%
       | effort
        
         | kolinko wrote:
         | Yes! I hope, if it's proven, that it will be implemented into
         | the main inference engines.
         | 
         | 40% effort is only a bit faster than a full base
         | multiplication, but I hope both the speed and the accuracy
         | could be improved further.
        
       | toisanji wrote:
       | this sounds related to this: https://arxiv.org/abs/2312.12456
       | https://github.com/SJTU-IPADS/PowerInfer
        
         | kolinko wrote:
         | Similar theme, but they skip whole neurons, in my case it's
         | down to a level of single weights.
         | 
         | From my experiment, skipping whole neurons (so whole
         | rows/columns of matrixes) didn't allow for such good results.
         | In my case 30-50% neurons whole are skipped with 15% effort,
         | but the rest is used partially still.
         | 
         | There are a few papers on a similar theme that a friend sent me
         | today morning - I plan to add them in the citations part
        
           | toisanji wrote:
           | awesome, looking forward to seeing how your results perform.
           | I tested powerinfer on smaller models and didn't see large
           | performance gains.
        
       | bigcat12345678 wrote:
       | """ So instead, let's flip the matrix, sort the elements row-
       | wise, and revisit the multiplications from that direction.
       | 
       | This is called a Compressed Sparse Row (CSR) format by the smart
       | people. To do the multiplication now, we take, say, the 1 from
       | the vector, multiply it by 256, and add it into the output vector
       | at the 3rd row. And so on.
       | 
       | Now, let's see what happens if we truncate the last column - the
       | one with the lowest values. """
       | 
       | How does csr works with reduced numbers multiplication?
        
         | kolinko wrote:
         | Can you rephrase the question? Not sure I get it.
        
           | bigcat12345678 wrote:
           | I could not read from the text how multiplication with CSR
           | format works in the context of optimization.
           | 
           | The key missing piece for me is that how to utilize CSR
           | format to find the numbers to do multiplication, in other
           | words, how does CSR format helps with picking the numbers to
           | multiply with the vector.
        
             | kolinko wrote:
             | Ah, I get the question now.
             | 
             | If I understand correctly CSR, it stores indexes and values
             | as a list. I store them also as a list - that's why the
             | comparison is there. The difference is that with CSR you
             | store say 15% of the values from a given row. I store all
             | of them, but use only the first X% of them. The X% varies
             | and depends on the input vector.
             | 
             | They are stored sorted, from the one of a highest absolute
             | value to the lowest absolute value.
             | 
             | It's after midnight so my explanations may not be too good
             | now, but I hope the pseudocode on the page and the examples
             | explain it slightly better.
             | 
             | I'll be fixing grammar / typos, and asking ChatGPT to
             | rewrite the page text for me tomorrow to make it more
             | readable :)
        
       | gcy wrote:
       | Could you explain the pseudo code in your equations page? Is the
       | second approxMul call a typo (also the capitalized V)?
       | 
       | def precompute(W): W = W.T probes = get_probes(W) W_idx, W_val =
       | sortMatrixRows(W)
       | 
       | def approxMul(v, W_idx, W_val, probes): cutoff_chart = v * probes
       | cutoff = topK(cutoff_chart, effort) approxMul(V, W_idx, W_val,
       | cutoff)
        
         | kolinko wrote:
         | oh, thanks, I fixed it. No idea what I meant there originally.
         | 
         | There are still a few typos on the page, I'll be fixing them
         | tomorrow - it's midnight now, and my mental batteries are
         | slowly drying out :)
        
           | mijoharas wrote:
           | Did you deploy your changes? I just refreshed and that is
           | still there on the page[0].
           | 
           | [0] https://kolinko.github.io/effort/equations.html
        
       | queuebert wrote:
       | Please, please, please call the final product Halfwit.
       | 
       | Seriously though, this is a very interesting biologically
       | inspired idea, since not all neuronal pathways fire all the time.
       | 
       | It seems to follow that, if you can predict which weights you
       | won't need, then you should be able to compress the model
       | architecture permanently, at least for certain use cases.
        
         | kolinko wrote:
         | Haha halfwit! I'm waiting for such a fork.
         | 
         | As for predicting the weights - not necessarily so. It seems
         | most weights are being used, just not all the time. Kind of
         | like that saying that humans are using just 5% of their brain -
         | perhaps they are, but it's various parts of the 5%.
         | 
         | Interestingly, Effort works just as well on MoE, if not better.
         | I did most of the development on Mixtral and I think it go even
         | to 15-20% effort before losing quality, but there is some sort
         | of a bug right now that prevents the inference on Mixtral.
         | 
         | It's on a todo to fix, but I didn't want to delay the release
         | because of it.
        
         | HPsquared wrote:
         | Halfweight. Half the weights, half the wait, half the wit.
        
         | LorenDB wrote:
         | Effortless would be another great name (since you are literally
         | reducing effort to get speed). OK, maybe not "great", but "an
         | option if you're going for puns".
        
       | coolvision wrote:
       | how does it compare to 8-bit/4-bit quantization in terms of
       | speed/accuracy?
        
         | kolinko wrote:
         | hard to say for now, I'm curious as well, but I used simpler
         | tests so far because of the implementation issues - most test
         | suites are geared towards testing models and not model
         | implementation.
         | 
         | I didn't want to wait any longer with the release, but better
         | tests will be coming soon I hope. Anecdotally, I think 30%
         | effort should be comparable to Q8z
         | 
         | More importantly, this algorithm should work on top of Q8. The
         | quality is not yet certain though - I could use help with the
         | implementation.
        
       | kanwisher wrote:
       | Could you break down a bit more about why you can skip so many
       | calculations ?
        
         | kolinko wrote:
         | It's explained in detail here:
         | 
         | https://kolinko.github.io/effort/equations.html
         | 
         | Long story short - I kind of sort them and pick only the top %
         | that would give a highest result.
         | 
         | One part is choosing them though - I think this was done before
         | in some papers. But the second part was an implementation of
         | multiplication that is efficient on both gpus and cpus when
         | choosing weights almost at will.
         | 
         | All explained on the site, but I just got feedback that it may
         | not be easy enough to read, so I'll push it through gpt for
         | grmamar fixes soon :) It's also a bit complicated as an
         | algorithm.
        
           | throwaway2562 wrote:
           | Hmmm. Very interesting. I wonder if you could speed up your
           | clever approximation still further with this approach
           | https://arxiv.org/pdf/2205.09120.pdf
        
             | kolinko wrote:
             | Nah, sadly this most likely will not work with Q1/Q1.5
             | implementations - initially I was playing with monte carlo
             | approximations (before I arrived at bucketMul), and the
             | convergence was very slow for binary/ternary networks.
             | 
             | Or, in simpler terms - if you have just ones and zeroes,
             | and minus ones, you can remove zeroes from calculations,
             | but that's it. No good method to figure out which ones are
             | more important than the other ones.
             | 
             | Also, there are no bits left to store positional
             | information when bucketing.
             | 
             | There are some paths that could be explored in this
             | fashion, but it would require a redesign of the algorithm
             | from the ground up.
        
         | punnerud wrote:
         | A bit like calculating the fastest route for a car, you
         | probably don't need to calculate the distances for the opposite
         | side of earth if you will not drive there. Then multiply that
         | by a billion, but the optimization still holds.
        
           | kolinko wrote:
           | Nice analogy, thanks! Yes - you still need a map, because one
           | day you need this road, another day you need info about
           | another road, but you're not using info about all the roads
           | all of the time.
        
       | saurik wrote:
       | Also being discussed at:
       | https://news.ycombinator.com/item?id=40067489
        
       | byyoung3 wrote:
       | I think this is the overall idea behind the MoE LLM models right?
       | MoE just expands upon this idea by learning which sets of weights
       | to use
        
         | kolinko wrote:
         | I'd say that this expands on MoE really - MoE chooses
         | dynamically which groups of weights may be needed, but it's
         | whole matrixes. Here it's ingle weights.
         | 
         | Also, this works on top of MoE beautifully - most of the
         | development and testing was done on Mixtral and it was getting
         | (anecdotally) even better results - getting down to 15-18%
         | effort before seriously losing quality.
         | 
         | I decided to release the Mistral version, but Mixtral was fully
         | operational a few commits back :)
         | 
         | Also, the cool thing - because you can load only the top say
         | 70% weights, I was running Mixtral full precision on my MB 96G
         | - there were no bemchmarks for this in other impls because
         | others need to load full model into the memory.
         | 
         | The real question is Q8 performance - I didn't implement it
         | fully so far.
        
           | anentropic wrote:
           | Could you add a git tag for the last commit where Mixtral was
           | working?
        
         | huac wrote:
         | It also feels similar to mixture of depths
         | (https://arxiv.org/abs/2404.02258).
         | 
         | Being able to apply this post-training is pretty cool though,
         | makes it easier to use across a wider range of setups.
        
           | kolinko wrote:
           | Yes! I like that, and I saw that paper last weekend iirc. I
           | think these MoD/MoE and other similar methods are highly
           | compatible, and in a similar style.
           | 
           | I was originally afraid that this method wouldn't be
           | compatible with MoE and the other methods, but fortunately,
           | at least for Mixtral, there seems to be an amazing synergy.
           | 
           | By the way, other tasks have higher priority now, byt there
           | is an interesting observation about MoE. In MoE you get two
           | experts chosen, and each expert has a different weight
           | attached to it - e.g. expert 1 has 75% weight, and expert 2
           | has 25% weight. Perhaps this could allow to scale the effort
           | to give 75% effort to one expert, and 25% to the other. There
           | are some issues there due to non-linearity of the layers, but
           | perhaps there is something to it.
        
       | gsuuon wrote:
       | Nice writeup! Very curious about the performance per VRAM with
       | this compared to just quantization. Any plans to implement a
       | cross platform version?
        
         | kolinko wrote:
         | Per VRAM - not much netter, because it still uses all the
         | weights, just not all the time.
         | 
         | I mean - it can also load less weights, but quality seems to
         | degrade quick after offloading more than 20-30% weights.
         | 
         | In other words - this algorithm decouples inference time from
         | VRAM use.
         | 
         | Having said that, I'm curious as well if using effort you can
         | get better results on Q8 cropped to 75% than on Q6.
         | 
         | But it's still probably a few weeks to get the implementation
         | polished enough to be well tested.
        
           | AnthonyMouse wrote:
           | > Having said that, I'm curious as well if using effort you
           | can get better results on Q8 cropped to 75% than on Q6.
           | 
           | This is what I wanted to ask. This seems like the same _kind_
           | of optimization as quantization, sacrificing a bit of quality
           | for performance by discarding some data. So then the question
           | is, which is better, and how do they combine?
           | 
           | You could even potentially get different results at different
           | points in the scale. Maybe Q8 cropped to 75% isn't better
           | than Q6 but Q4 cropped to 75% is better than Q3, or vice
           | versa.
        
             | kolinko wrote:
             | Below Q8 I think it will combine poorly - bucketMul needs
             | to keep a few bits for an index. It can be offset by the
             | fact that the numbers from each bucket are from a limited
             | range. My intuition is that with Q8 this trades off roughly
             | evenly - meaning bucketed Q8 may store as much info as
             | regular Q8, but it will be more difficult with lower
             | quantizations, and cross the boundary of impossible after
             | Q5. Don't have math to back it up, but perhaps some
             | information theoretitian could pick up the calculations.
             | 
             | You could say that these are divergent paths in the future
             | developments (if the results hold for Q8) - _perhaps_ you
             | can crop the Q8 models to Q6 sizes and inference speeds of
             | Q2 /Q4. On the other hand, the _wildly optimistic_ scenario
             | is that the bucketMul speeds will overcome even Q1, with
             | dynamic computation pruning - token by token and layer by
             | layer, by having a separate small network that chooses
             | effort levels based on the input data. (someone already
             | proposed it in the thread).
             | 
             | For now, the most important thing is fixing the bugs, doing
             | more serious tests for FP16, and showing how the charts
             | look for Q8. Especially the Q8 is the biggest unknown,
             | although the initial results are hopeful.
        
       | brrrrrm wrote:
       | do you have a simple python impl? :)
        
         | kolinko wrote:
         | It originally started as a fork to Recmo's cria pure numpy
         | llama impl :)
         | 
         | https://github.com/recmo/cria
         | 
         | Took a whole night to compute a few tokens, but I used it to do
         | the first tests.
         | 
         | Also, my friend pasted the paper to claude and it produced a
         | working basic impl instantly :D
         | 
         | But in all seriousness - I think MLX implementation would be
         | doable, or a wrapper to the Metal gou functionality
        
           | kwikiel wrote:
           | Will share python implementation soon as a kind of executable
           | pseudo code which then can be ported to any platform.
           | 
           | This project is kind of like ultimate nerdsnipe as math is
           | quite simple, you don't need PhD to understand it and
           | actually implementing things would teach you linear algebra
           | faster vs just mindlessly doing exercises sets.
        
             | kolinko wrote:
             | Haha yes :) Publish it, Kacper!
             | 
             | The project is a nerdsnipe for math geeks, because there
             | are multiple small things that beg to be proven / described
             | by math there. For example - what's the tradeoff between
             | the number of bits we loose when embedding position vs the
             | bits of information that we gain by knowing which bucket a
             | weight belongs to?
             | 
             | In other words - is it possible that when storing weights
             | in the bucketed form we can actually end up having a higher
             | precision than using a regular form? For Q8 we get just 4
             | bits to store the weight (and 1 bit for sign, and 3 bits
             | for location), but these 4 bits need to express numbers
             | from a smaller range than before.
        
         | brrrrrm wrote:
         | not sure why I got downvoted for asking this. seems like a
         | reasonable thing in the ML space
         | 
         | anyway here's a numerical simulation written in PyTorch for
         | those who want to consider this algo on their projects before
         | fully integrating:
         | https://gist.github.com/bwasti/78d7058aad7f42dc893d906c877b7...
         | 
         | happy hacking
        
           | kolinko wrote:
           | If you got downvoted, it was briefly. That was a good
           | question. PyTorch/numpy is awesome for initial testing.
        
       | xrd wrote:
       | My takeaway is that this proves what was said on the recent
       | latent.space podcast with David Luan from Adept.
       | 
       | https://www.latent.space/p/adept
       | 
       | "I think is going to commoditize a lot of the regular LLMs and
       | soon regular multimodal models."
       | 
       | In other words, if you train your own models, you will not get to
       | take advantage of breakthroughs like this that start with open
       | models (like Mistral).
       | 
       | All the advantages are going towards the open models and this is
       | an existential risk for OpenAI and other closed model companies.
        
         | quadrature wrote:
         | Maybe, but theres nothing that stops OpenAI from stealing these
         | tricks.
        
           | refulgentis wrote:
           | It's extremely unlikely anyone will take any of this.
           | 
           | Quick take:
           | 
           | - it was 15x slower than llama.cpp when I used Apple's new
           | proprietary ML framework on my MacBook
           | 
           | - So I made it possible to skip arbitrary amounts of work.
           | 
           | - I identified an arbitrary tradeoff that seems arbitrarily
           | good to me.
           | 
           | - I've confirmed this by making GPT-4 write some prompt with
           | questions. Then I had the normal version answer, and the
           | "skip arbitrary work" version answer, and it LGTM.
           | 
           | - So I threw it up on GitHub, then on HN with a title "LLM
           | inference 2x faster (possibly)", and people missed: [on my
           | laptop] [in the ML framework I'm forcing myself to use]
           | [based on an eval I made up] [based on an an eval where I am
           | the evaluator]
           | 
           | This *really* shouldn't have the title it does, very
           | misleading.
           | 
           | Author, please feel free to correct me, I'm sorry for not
           | taking the time to find a gentler way to communicate this. I
           | hope you kick ass. You did put possibly in a parenthetical,
           | but its carrying the weight of the world here, people just
           | see LLM 2x faster. That's why everyone is spinning off into
           | grand speculation land, which I also see you valiantly
           | commenting to dissuade
        
             | kolinko wrote:
             | The whole inference is slow, but it's matrix
             | multiplications that count. They work reliably on all the
             | Macbooks that I tested - at 50% effort it's the same speed
             | as the state of the art matrix multiplications, at 25% they
             | are twice as fast.
             | 
             | The apple's MPS matrix multiplications from Apple are
             | comparable in speed to the speed of Llama.cpp and the other
             | models. When I was doing tests, I was comparing the
             | Llama.cpp benchmarks (
             | https://github.com/ggerganov/llama.cpp/discussions/4167 )
             | to Apple's MPS - they match very closely. And then I was
             | comparing Apple's MPS to my results.
             | 
             | Even if the end-results would show that the models somehow
             | break (which they might on Q8), there is no other
             | implemented method right now that would give you such
             | speedups with matrixes of 25% sparsity. The usual methods
             | break even with full matrix multiplications around 15%
             | mark, and show speed improvements under 10% (as far as I
             | know, but I'm new to the field, so I wait to be corrected).
             | 
             | As for the other metrics - I hope to get help from the
             | community to get the implementation done properly. So far
             | it's been 3 months of work 12 hours a day - even during
             | Easter - to get this version going. It is as far as I can
             | push it without the community support, which I'm happy I
             | received over the last hour.
             | 
             | Also, I'm not sure what you'd expect really. A full
             | production ready system on the day one? From a solo
             | developer? Seriously? :)
             | 
             | Let's get the flame war going! :D
        
               | refulgentis wrote:
               | Nah it's good work, you'll be able to flame me in a
               | couple weeks...months?..too when I ship my yawn-worthy
               | yet another llama.cpp / OpenAI wrapper. :p
               | 
               | I'd love this knob, particularly in llama.cpp, inference
               | is a bit too slow on Android, 6 tkn/s for 3B. just can't
               | stand it when people don't actually read anything but the
               | title, and go crazy overboard, like, how are we in a
               | thread where people are like "oh this confirm local
               | models will definitely win like I heard on a podcast" and
               | "big bad OpenAI will steal this".
        
               | kolinko wrote:
               | Hahah thanks, although I was hoping for a flame to get
               | adrenaline flowing to push through the night :D
               | 
               | I also hope there will be an extra knob - or more like
               | knobs, because effort can be regulated smoothly layer by
               | layer, token by token, matrix by matrix. Think more like
               | an equalizer, not a volume control :)
               | 
               | The biggest question right now is how (if) it will
               | perform with Q8 and with smaller models. The risk is that
               | the quality dropoff will show up closer to 40-60% at Q8,
               | negating the performance gains.
        
             | xrd wrote:
             | I think you are commenting on the "stealing" reply and not
             | my original comment. And, I think you are making my point
             | stronger.
             | 
             | OpenAI could easily (and will) put out a blog post or tweet
             | saying "next models do inference 2.5x faster!" Koliko did
             | that, or maybe he didn't and someone else put words in his
             | mouth. I don't really care: I can validate and test your
             | comments here (and they are great!) and I can try his
             | experiments myself.
             | 
             | I cannot do that against "GPT-5-2.5x faster (c) 2024"-42B
             | (because it isn't released yet publicly). Putting a paper
             | and some vague ideas on Arvix isn't really doing much these
             | days except adding to the confusion. Truly open work like
             | koliko is doing is really exciting and feels like it can
             | only be done against truly open models like Mistral.
             | 
             | Oh wait, Mistral isn't fully open either (ducks...).
        
               | observationist wrote:
               | There used to be a class of software called freeware -
               | you could download and use it without restriction, you
               | just couldn't resell it, or have the source to modify it.
               | Llama and similar models are like freeware - an
               | inscrutable binary blob crafted to work with other
               | software, except instead of a VM or native OS
               | environment, you have llama.cpp or similar software that
               | runs the AI model.
               | 
               | Mistral is open source, in that you can do anything the
               | Apache license allows you to do, even package it into
               | your own product and resell or modify it. We're missing
               | the dataset details, the source to the software that
               | produces the model, similar to not having access to an
               | operating system and special compiler software. That's
               | not a huge deal, because people don't have the resources
               | to make use of those large datasets or the Mistral
               | training software, which is likely highly tailored to
               | their own training and development pipeline, and wouldn't
               | do much good for anyone without at least a pod of A100's
               | of their own.
               | 
               | Weights available and other terms are being thrown
               | around, and Meta and the like are calling their stuff
               | "open" but that use of the term bears little resemblance
               | to the use of the word by the open source community.
               | 
               | The public Mistral models have open source licenses. The
               | model can be used like open source software. The terms
               | are permissive and free, requiring only attribution.
               | Meta's license scheme is novel and not open, with
               | arbitrary lawyerese and absolutely, 100% will bite
               | someone in the ass when the threshold between "annoying
               | to sue" and "profitable to sue" gets exceeded by someone
               | using Llama in a way that's technically incorrect. Right
               | now, Meta wants the goodwill more than they want a couple
               | million dollars chasing a couple dozen startups.
               | 
               | If the model doesn't have an open source license, it's
               | not open. It might be freeware. Llama is freeware. You
               | can, technically, do whatever you want to it, but try to
               | not attract too much notice or be too successful with it.
               | 
               | Mistral, by using Apache licensing, couldn't go after you
               | even if they wanted to, unless you do something
               | deliberately stupid.
        
               | oceanplexian wrote:
               | > If the model doesn't have an open source license, it's
               | not open.
               | 
               | Actually, OSS comes with tons of strings attached that
               | make the term open dubious. And there are many ways they
               | could come after you legally. Apache, GPL, etc all have
               | terms and conditions, you have to contribute back X Y and
               | Z, agree to our manifesto, and so on.
               | 
               | The only truly free license is MIT. Go build a billion
               | dollar business, change it however you want with no
               | strings attached and you can express the license terms in
               | a single paragraph.
        
               | observationist wrote:
               | Apache is stricter because it requires you to list all
               | the modifications you make to the original software. A
               | copy of the license, with copyright and attribution
               | notices, must be included in the source code and any
               | modifications. A specific section on contributions
               | outlines the conditions for accepting external
               | contributions to the project. The MIT license requires a
               | copyright notice, but it does not have explicit
               | requirements for the license or attribution notices.
               | 
               | source: https://www.mend.io/blog/top-10-apache-license-
               | questions-ans...
               | 
               | Apache 2.0 is almost as permissive as MIT, and better
               | suited for some cases. I love the MIT license, as it
               | allows the most freedom for users and creators, across
               | the board. Apache 2.0 is second best, but might better in
               | a formalized organization that wants more structure and
               | formalized documentation requirements.
        
           | littlestymaar wrote:
           | Exactly, that's the problem with the current state of things
           | with open models, the players that keep their secret sauce
           | keep an edge over the people doing things in open while
           | benefiting from all of their work without contributing back.
        
             | kolinko wrote:
             | That was the claim with a lot of the software in the past,
             | but open source won in many places in the end.
        
               | iwontberude wrote:
               | At the end of the day, if their profit margins aren't
               | good, it doesn't matter whether their competition is open
               | source or not (which is often where OSS wins). I think we
               | are seeing that AI isn't the slam dunk for increasing
               | productivity or we would see companies like UIPath being
               | profitable. I don't think we've seen anyone net a profit
               | on AI software and the only company that has been
               | investing since at least 2017, Apple, gets zero credit
               | for their contributions and commercialization of the
               | tech. I think about how Amazon abandoned the AI-powered,
               | checkout-free tech because the margin of error stayed
               | stubbornly high for too long. The clock is ticking on the
               | industry and some players, like Apple, already have found
               | it isn't profitable (well to their standards of 60%
               | return on investment).
        
               | kolinko wrote:
               | Depends on a field.
               | 
               | The project from the thread would take me an impossible
               | amount of time without GPT. Even the page itself would
               | take me twice as long to generate - the charts were done
               | by pasting source data to GPT and GPT writing plotlib
               | code for me to chart them, and the equations were
               | originally written by GPT as well, because I wasn't
               | familiar with MathJAX.
               | 
               | Ditto with the code - a bunch of it was written by GPT
               | originally, since this is my first Swift/Metal project. I
               | kept telling it what I want to do in Python, and it kept
               | rewriting it in Swift/Metal until I learned the latter.
               | 
               | The name "effort" was also invented by GPT. Originally,
               | internally, I was using "quant" but that would be
               | confused with quantization. I considered "perc" from
               | percentage - but that's ugly. I described the project to
               | GPT, and it suggested "effort" as a metric.
               | 
               | As for self-checkout - in Poland we have Zabka Nano which
               | is still going on, and seems more solid than Amazon, but
               | of course the time will tell :)
        
               | littlestymaar wrote:
               | > but open source won in many places in the end.
               | 
               | The biggest companies in the world are running closed-
               | source software for profit that uses open source
               | foundation while barely contributing back, so it's really
               | not the counter-argument you think it is. And that's no
               | wonder we're seeing open source companies going for
               | source-available licenses now (Redis, HashiCorp) or other
               | kinds of restrictions (RedHat), because they were
               | helpless regarding the parasitic behavior of the big bad
               | wolfs.
        
             | Hugsun wrote:
             | In these fast moving early times of LLMs, they can maintain
             | this advantage with simple things like proprietary datasets
             | and greater compute.
             | 
             | The difference in quality between the best model that can
             | be made with such proprietary utilities and without is
             | likely to decrease over time as open datasets of greater
             | quality are published and the field matures.
             | 
             | The difference in quality and number of competitors
             | ultimately pays the bills and the harsher the competition
             | is, the less money there will be, for each individual
             | company, to maintain their possibly dwindling proprietary
             | edge.
             | 
             | The greater access to compute is an edge companies will
             | likely hold for a while. It will be interesting to see how
             | much open models will be able to catch up and how great of
             | an edge proprietary models will maintain.
        
         | kolinko wrote:
         | In this case though, the algorithm should be just as useful to
         | closed models as to open models. There is nothing there that is
         | optimised specifically for Mistral - aside from the hard-coded
         | dimensions in multiple places in the code :D
         | 
         | Having said that, it's awesome to have open source models out
         | there, and I hope they will ultimately win in the end.
        
         | Havoc wrote:
         | That seems fundamentally flawed to me. Closed providers can
         | copy techniques from open models but not vice versa.
         | 
         | To me that reads as closed will always be a step ahead not an
         | existential risk to OpenAI.
        
       | globally_unique wrote:
       | That looks like fantastic stuff. I just want to point out the
       | 15ms delay looks similar to 60Hz vsync (16.7ms), if you are
       | updating the screen once per token, maybe that's causing a sync
       | somehow?
        
         | kolinko wrote:
         | Nah, that's not it, I measure the CPU & GPU work separately,
         | and 15ms happens between the kernel invocations. It also
         | happens when I don't print out text.
         | 
         | Thanks for the idea though! I treat it as the first community
         | contribution :D
        
       | bick_nyers wrote:
       | I'm wondering if you could sort the two inputs, add the indicies
       | for the multiplications together, then take the largest of that.
       | 
       | In your 0.1 example, 1000 gets index 2, and 0.1 index 0, combines
       | to 2. This will tie with the 1*8, but I think it would even out
       | with larger vector lengths.
       | 
       | Edit: I could be wrong but I think you can precompute the indices
       | for the weights in advance without a prompt, then you won't need
       | to perform those sorts at runtime.
        
         | kolinko wrote:
         | That was one of the ideas I was also considering originally!
         | Don't remember why is skipped it though - may have been because
         | I tried the published one and it worked well enough during the
         | first tries.
         | 
         | As for indices for weight - not sure if I get what you mean,
         | but the weights are sorted as a part of precomputations.
         | Sorting is not done in runtime, because that would kill any
         | efficiency - the moment you need to read all the weights,
         | you've lost, because it's the memory reads, not computations
         | that matter. If you can skip one memory read by doing 5-10
         | multiplications, it's a good tradeoff.
        
       | marmaduke wrote:
       | Having used CSR it's not surprising, and some newer formats might
       | have more mechanical sympathy like block ELL, since they avoid
       | uncoalesced reads / gathers, tho the code is trickier.
        
         | kolinko wrote:
         | Oh, nice to finally bump into someone who has experience with
         | CSR!
         | 
         | bucketMul has few uncoalesced reads, and it uses a different
         | data structure than the regular CSR - it's decribed here:
         | https://kolinko.github.io/effort/bucketmul.html It splits each
         | Matrix row into 16 parts, and chooses which ones are necessary
         | to read. The writes are fully linear.
         | 
         | Not sure if I speak sense though, it's getting a bit late
         | today, and it's been a long day ;)
        
           | marmaduke wrote:
           | Yeah, nice, you could also consider a low rank factorization
           | as well.
        
       | uhlo wrote:
       | Okay now add a small model that decides how much effort is needed
       | in each inference step and we are good to go
        
         | kolinko wrote:
         | Yes! That would be awesome. Especially since there are ~32*6
         | independent effort settings for every single token.
         | 
         | I tested the most basic implementation, with a flat effort
         | setting for all the muls, but I bet the results could be pushed
         | even further with such an approach. Or even with just doing
         | some ML to figure out which layer/matrix needs more and which
         | less effort.
        
           | uhlo wrote:
           | Great work! One thing: it seems the hugging face link doesn't
           | work... I get a 404
        
             | kolinko wrote:
             | It should work now :)
        
           | mentos wrote:
           | If this is within reach of doing it sounds like the joke
           | might be worth exploring?
        
       | spencerchubb wrote:
       | I love this line in the gpu implementation section.
       | 
       | "Readers fresh to GPU programming may ask now - how does it work?
       | 
       | Readers experienced with GPU programming may ask - how the hell
       | does it work?"
        
         | kolinko wrote:
         | Haha thanks! :) As far as I understand I had to implement the
         | memory reads and some other things the opposite way to what is
         | considered a proper approach.
         | 
         | Would love to have that code reviewed by someone who actually
         | knows stuff about Metal - this is my first gpu programming
         | attempt
        
       | smcleod wrote:
       | Looks like the models are missing on Huggingface, I've logged an
       | issue: https://github.com/kolinko/effort/issues/3
        
         | kolinko wrote:
         | Ah yes, forgot to make the repo public. Thanks a ton for
         | pointing it out and writing the comment - I'd miss it if you
         | didn't point it out on HN.
        
       | a2code wrote:
       | If it sounds too good to be true, it probably is not.
       | 
       | If removing weights improves some metrics, that may be a clue
       | that the model is not optimal in some sense.
        
         | kolinko wrote:
         | The algorithm still uses all the weights, just not all the time
         | - just skips the weights when they are not important given an
         | input vector.
         | 
         | Also, approximation methods, as a field, are not new and they
         | have shown their use.
         | 
         | Having said all that, extraordinary claims require
         | extraordinary evidence - that's why I hedge the communication
         | messages. It's ,,probably" until we get serious tests going on
        
         | mijoharas wrote:
         | The metric that's improved is computation speed, and it's
         | achieved by essentially changing the computation (by not
         | performing some computation that likely doesn't have large
         | impact on the results).
         | 
         | Give that it's a different computation, you could argue that
         | Mistral+effort is a new model with the improved metric of
         | quality per amount of computation performed.
         | 
         | Otherwise - given that for every different input there's a
         | seperate set of weights in the model that are excluded - I
         | don't think you could conclude from this (if it holds up etc
         | etc) that the base model is not optimal.
         | 
         | In a similar sense, quantization improved the "quality per
         | model size" metric, but I don't think people are arguing that
         | Mistral is less optimal than quantised Mistral (unless you're
         | speaking about literally that metric). On the other hand, if
         | you're targeting that metric specifically, then it would make
         | perfect sense to say that quantised Mistral is more optimisal
         | for it.
         | 
         | I guess it comes down to optimality being dependent on the
         | metric you're looking at, and there being many things you might
         | want to optimise for.
         | 
         | To note again, if this technique holds up, it's better than
         | model distillation (just get rid of some of the weights)
         | because for some inputs those weights could matter and this
         | technique should (iiuc) account for that somewhat. To me, this
         | is what it sounds like you're referring to when saying:
         | 
         | > If removing weights improves some metrics, that may be a clue
         | that the model is not optimal in some sense
        
         | addandsubtract wrote:
         | Wait, does the second "it" refer to the true part? Because
         | traditionally, it refers to the "too good" expression. So you'd
         | say, "it _is_ too good to be true".
        
       | hatthew wrote:
       | This seems similar to semi-structured (aka 2:4) sparsity, may be
       | worth explicitly comparing. As far as I can tell by skimming,
       | your technique:
       | 
       | - is optimized for apple silicon - ~2x speed at 75% sparsity -
       | dynamic, depends on input, applied at runtime - can choose amount
       | of sparsity
       | 
       | And 2:4 semi-structured sparsity:
       | 
       | - is optimized for GPUs with sparse tensor cores (nvidia ampere
       | and beyond) - ~2x speed at 50% sparsity - static, applied to the
       | model at rest - probably worse results than your technique at 50%
       | sparsity
       | 
       | The interesting comparison I'd want to see is semi-structured
       | sparsity results (50% sparsity, 2x speedup) vs your results at
       | 75% sparsity (2x speedup).
        
         | kolinko wrote:
         | Thanks for checking it out!
         | 
         | I also can't wait to have more tests done.
         | 
         | Also, I chose Apple Silicon because it was easier for me to
         | develop on. It is possible that the algorithm would also have
         | good performance on other architectures.
        
       | yousif_123123 wrote:
       | I know this doesn't retrain, but I wonder if approaches like this
       | plus quantization could get back any "lost" quality with some
       | post training.
       | 
       | It's great to see, and to know in one's mind how much likely
       | performance and cost will be better in the future.
       | 
       | I know it's fun to work on, but I also want to say THANK YOU for
       | developing open source.
        
         | spencerchubb wrote:
         | At first glance, that sounds like it could work. From what I've
         | read, there seems to be two main ways to regain some quality
         | with quantization: post-training which happens after, and
         | quantization-aware training which quantizes during training but
         | leaves the activations and gradients full precision
        
       | wayoverthecloud wrote:
       | I was wondering if something similar can be done for CNN too.
        
         | kolinko wrote:
         | A friend who first heard of the method immediately suggested
         | this may work, perhaps even better in Diffusion models.
         | 
         | It is really a drop-in replacement for the regular matrix
         | multiplication. The data structure is a bit more painful to
         | work with (you have three datasets, not just one to represent a
         | weight matrix), but it shouldn't be too difficult for devs of
         | the existing inference engines to implement it for a test.
         | 
         | Half of my challenge was that I wasn't knowledgable enough to
         | just patch llama.cpp or MLX and use their engines with
         | bucketMul. That's why I opted for making my own - still not
         | sure if it was a good choice to build everything from the
         | ground up though, although I'm proud of the name :)
         | 
         | Finally - the basic math behind approximation suggest that this
         | should work with all the models - cosine similarity score is
         | 0.99 until the magical 25% mark for most of the matrixes I
         | tried. It can vary within a model though - e.g. in Llama, on
         | the first layer, wq/wv/wk could be easily approximated with 5%
         | effort, whereas some deeper layers at 25% had just 0.90 cos
         | score - still seemed enough to not lose coherency within the
         | model.
        
       | brrrrrm wrote:
       | I've been trying to think about how you'd amp up the batch size
       | here. it's a bit tricky since the memory access would be way
       | higher, but I think you can actually still save on compute by
       | chunking things up in a clever way to utilize the tensorcores
        
         | kolinko wrote:
         | What would definitely help would be a better understanding a
         | GPU handles myVal - does it go to the cache, threadgroup memory
         | or where. Or perhaps decompilation of the source kernel to see
         | what's really going on there. If we knew, we could calculate
         | the optimal parameters given an architecture.
         | 
         | There is also another approach I tried - I've kept it in the
         | wildMul.swift/metal . More by the book. Splitting work so that
         | one thread keeps track of only one or two output dimensions
         | when going through the buckets. Didn't manage to get it working
         | with a decent speed though.
         | 
         | The idea would be to make sure that one simdgroup reads 4/8
         | buckets in a batch (so still ~8-16 consecutive bytes which is
         | considered minimum for an optimal read on Apple Silicon) and
         | splits each single bucket among 4-8 threads. One thread might
         | then need to remember only 2-4 dims in myVal, and this might
         | stay in internal GPU registers.
         | 
         | Not sure if I'm explaining this all correctly though.
         | 
         | Having said this - the priority now will be to remove the
         | overhead to get the whole inference time closer in speed to
         | just the multiplication speeds, but above all - fixing the bugs
         | causing suboptimal results at 100% (and breaking Mixtral), and
         | doing Q8. The algorithm needs to work in Q8 to be really
         | usable.
        
       | renonce wrote:
       | Essentially the algorithm is about pruning parameters _on the
       | fly_ and making the weight matrix sparse by setting least
       | significant weights to zero, where  "least significant" is
       | defined in terms of the rank of the absolute value of the pruned
       | weight within its group?
       | 
       | A search for model pruning turns out many results, including
       | https://arxiv.org/abs/2305.11627 which discusses "magnitude
       | pruning" as a baseline, and refers to
       | https://arxiv.org/pdf/2301.00774.pdf, which asserts in the
       | introduction:
       | 
       | > First, as shown in Figure 1, SparseGPT can induce uniform
       | layerwise sparsity of up to 60% in e.g. the 175-billion-parameter
       | variant of the OPT family (Zhang et al., 2022), with minor
       | accuracy loss. By contrast, the only known one-shot baseline
       | which easily extends to this scale, Magnitude Pruning (Hagiwara,
       | 1994; Han et al., 2015), preserves accuracy only until 10%
       | sparsity, and completely collapses beyond 30% sparsity.
       | 
       | I don't like how these papers boast their own methods by poorly
       | implementing a baseline and describing their own methods using
       | lots of mathematical jargon - the OP's blog post is a breeze in
       | making the methods accessible to people with very little
       | background.
        
         | kolinko wrote:
         | Thanks! The last full month was all about making sure all the
         | research is as replicable and as trustworthy as I can do it.
         | The original implementation was extremely inefficient, and even
         | once I had the Metal/GPU mmul op working fast, I spent much
         | time to bring the rest of the implementation as close to the
         | Llama.cpp as possible to allow for easier benchmarking.
         | 
         | As for the approaches - it seems the papers you mention do this
         | statically, and din't produce an algorithm for actually
         | speeding up computations with their 20-50% results - which was
         | a large part of the difficulty. I'll probably take some time
         | off one day and do a thorough review of the literature finally.
         | 
         | Ultimately I want to add a citations page with these and some
         | other papers people have posted in the comments soon. I expect
         | sooner or later someone will find this algorithm written up by
         | someone :)
         | 
         | While development asked gpt-4 and googled trying to find
         | information about this method - everything seemed either
         | static, or focused on removing full dimensions/layers ad hoc
         | with eventual retraining. Didn't find anything matching this
         | idea exactly.
        
           | bravura wrote:
           | Thank you. If you want to contribute novel research, a
           | thorough literature search of prior work is essential. And
           | hopefully a comparison to previous approaches.
           | 
           | If you didn't find these works in your literature search, I
           | encourage you to improve this skill, because it's important
           | and valuable.
        
             | kolinko wrote:
             | True, it will be easier next time because new llms allow to
             | better searching through papers.
             | 
             | Having said that - the tradeoff with getting into a new
             | field is that there is 50 years of history to dig through,
             | and by the time you know exactly what you're looking for,
             | you don't need to find it really. Although a better math
             | framework or another approach would help a bit for sure.
             | 
             | I think in the future it will be way easier with better llm
             | searches. Gpt right now fell short in finding some stuff,
             | but it was already very useful in giving me an overview of
             | the field.
        
               | llama_person wrote:
               | One more opinion here: if you're looking around and
               | nobody else has something obviously working, it's ok not
               | to do a thorough search through hundreds of papers hoping
               | to find someone else who came up with the idea. Get it
               | working first, show it off, and if someone comes claiming
               | it was their idea, respectfully acknowledge the prior
               | work. But there's a gigantic gap between "wrote it in a
               | paper" and "actually got it working".
               | 
               | Also, awesome job! Thanks for writing up some seriously
               | cool engineering.
        
           | Macuyiko wrote:
           | Have a look at https://arxiv.org/pdf/2306.11695.pdf which
           | also uses the norm of inputs based on calibration
        
       ___________________________________________________________________
       (page generated 2024-04-18 23:01 UTC)