[HN Gopher] Quantized Llama models with increased speed and a re...
       ___________________________________________________________________
        
       Quantized Llama models with increased speed and a reduced memory
       footprint
        
       Author : egnehots
       Score  : 182 points
       Date   : 2024-10-24 18:52 UTC (4 hours ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | newfocogi wrote:
       | TLDR: Quantized versions of Llama 3.2 1B and 3B models with
       | "competitive accuracy" to the original versions (meaning some
       | degraded performance; plots included in the release notes).
        
         | newfocogi wrote:
         | Quantization schemes include post-training quantization (PTQ),
         | SpinQuant, and QLoRA.
        
       | arnaudsm wrote:
       | How do they compare to their original quants on ollama like
       | q4_K_S?
        
         | tcdent wrote:
         | These undergo additional fine tuning (QLoRA) using some or all
         | of the original dataset, so they're able to get the weights to
         | align to the nf4 dtype better, which increases the accuracy.
        
       | philipkglass wrote:
       | These quantized models show much less degradation compared to a
       | "vanilla post-training-quantization" but there are a bunch of PTQ
       | schemes that people have already applied to Llama models [1]. I
       | didn't see any details about the vanilla PTQ they used as a
       | baseline. Has it been written about elsewhere?
       | 
       | [1] https://ollama.com/library/llama3.2/tags
        
       | nisten wrote:
       | It's pretty interesting that the new SpinQuant method did not
       | manage to be better than good old nf4bit QLORA training (Tim
       | Dettmers really cooked with that one).
       | 
       | Really appreciate that Meta published both results+model quants
       | and didn't just make some bs claim about a new sota quant like
       | most other bigger companies would've done.
        
       | EliBullockPapa wrote:
       | Anyone know a nice iOS app to run these locally?
        
         | Arcuru wrote:
         | I access them by running the models in Ollama (on my own
         | hardware), and then using my app Chaz[1] to access it through
         | my normal Matrix client.
         | 
         | [1] - https://github.com/arcuru/chaz
        
         | simonw wrote:
         | MLC Chat is a great iPhone app for running models (it's on
         | Android too) and currently ships with Llama 3.2 3B Instruct -
         | not the version Meta released today, its a quantized version of
         | their previous release.
         | 
         | I wouldn't be surprised to see it add the new ones shortly,
         | it's quite actively maintained.
         | 
         | https://apps.apple.com/us/app/mlc-chat/id6448482937
        
         | behnamoh wrote:
         | I've been using PocketGPT.
        
         | drilbo wrote:
         | https://github.com/a-ghorbani/pocketpal-ai
         | 
         | This was just recently open sourced and is pretty nice. Only
         | issue I've had is very minor UI stuff (on Android, sounds like
         | it runs better on iOS from skimming comments)
        
       | theanonymousone wrote:
       | May I ask if anyone has successfully used 1B and 3B models in
       | production and if yes, in what use cases? I seem to be failing
       | even in seemingly simpler tasks such as word translation or zero-
       | shot classification. For example, they seem to not care about
       | instructions to only write a response and no explanation, thus
       | making it impossible to use them in a pipeline :/
        
         | wswope wrote:
         | I've only toyed with them a bit, and had a similar experience -
         | but did find I got better output by forcing them to adhere to a
         | fixed grammar:
         | https://github.com/ggerganov/llama.cpp/tree/master/grammars
         | 
         | For context, I was playing with a script to bulk download
         | podcasts, transcribe with whisper, pass the transcription to
         | llama.cpp to ID ads, then slice the ads out with ffmpeg. I
         | started with the generic json_array example grammar, then
         | iteratively tweaked it.
        
         | accrual wrote:
         | Not in production, but I've used a 3B model to test a local LLM
         | application I'm working on. I needed a full end-to-end
         | request/response and it's a lot faster asking a 3B model than
         | an 8B model. I could setup a test harness and replay the
         | responses... but this was a lot simpler.
        
           | jdthedisciple wrote:
           | If for testing then why not just mock the whole thing for
           | ultimate performance ... ?
        
             | nkozyra wrote:
             | Probably faster to use off the shelf model with llama.cpp
             | than to mock it
        
         | com2kid wrote:
         | 3B models are perfectly capable, I've had great luck with Phi
         | 3.5.
         | 
         | > For example, they seem to not care about instructions to only
         | write a response and no explanation
         | 
         | You need to use tools to force the model to adhere to a schema.
         | Or you can learn to parse out the part of the response you
         | want, both work.
         | 
         | You'll also need to make good use of robust examples in your
         | initial prompt, and give lots of examples of how you want the
         | output to look. (Yes this quickly burns up the limited context
         | length!)
         | 
         | Finally, embrace the fact that these models are tuned for chat,
         | so the more conversational you make the back and forth the less
         | you are stretching the models abilities.
         | 
         | I wrote a very small blog post at
         | https://meanderingthoughts.hashnode.dev/unlock-the-full-pote...
         | explaining some of this.
        
           | teleforce wrote:
           | I wonder if CUE can help the situation in similar fashion to
           | the DSL methods that you've described in your blog post [1].
           | After all CUE fundamentals are based on feature structure
           | from the deterministic approach of NLP unlike LLM that's
           | stochastic NLP [2],[3]. Perhaps deterministic and non-
           | deterministic approaches is the potent combination that can
           | effectively help reduce much of the footprint to get to the
           | same results and being energy efficient in the process.
           | 
           | [1] Cue - A language for defining, generating, and validating
           | data:
           | 
           | https://news.ycombinator.com/item?id=20847943
           | 
           | [2] Feature structure:
           | 
           | https://en.m.wikipedia.org/wiki/Feature_structure
           | 
           | [3] The Logic of CUE:
           | 
           | https://cuelang.org/docs/concept/the-logic-of-cue/
        
             | com2kid wrote:
             | On my LinkedIn post about this topic someone actually
             | replied with a superior method of steering LLM output
             | compared to anything else I've ever heard of, so I've
             | decided that until I find time to implement their method,
             | I'm not going to worry about things.
             | 
             | tl;dr you put into the prompt all the JSON up until what
             | you want the LLM to say, and you set the stop token to the
             | end token of the current JSON item (so ',' or '}' ']',
             | whatever) and you then your code fills out the rest of the
             | JSON syntax up until another LLM generated value is needed.
             | 
             | I hope that makes sense.
             | 
             | It is super cool, and I am pretty sure there is a way to
             | make a generator that takes in an arbitrary JSON schema and
             | builds a state machine to do the above.
             | 
             | The performance should be super fast on locally hosted
             | models that are using context caching.
             | 
             | Eh I should write this up as a blog post, hope someone else
             | implements it, and if not, just do it myself.
        
         | JohnHammersley wrote:
         | > For example, they seem to not care about instructions to only
         | write a response and no explanation, thus making it impossible
         | to use them in a pipeline
         | 
         | I was doing some local tidying up of recording transcripts,
         | using a fairly long system prompt, and I saw the same behaviour
         | you mention if the transcript I was passing in was too long --
         | batching it up to make sure to be under the max length
         | prevented this.
         | 
         | Might not be what's happening in your case, but I mention it
         | because it wasn't immediately obvious to me when I first saw
         | the behaviour.
        
         | beoberha wrote:
         | For me, it was almost random if I would get a little spiel at
         | the beginning of my response - even on the unquantized 8b
         | instruct. Since ollama doesn't support grammars, I was trying
         | to get it to work where I had a prompt that summarized an
         | article and extracted and classified certain information that I
         | requested. Then I had another prompt that would digest the
         | summary and spit out a structured JSON output. It was much
         | better than trying to do it in one prompt, but still far too
         | random even with temperature at 0. Sometimes the first prompt
         | misclassified things. Sometimes the second prompt would include
         | a "here's your structured output".
         | 
         | And Claude did everything perfectly ;)
        
         | scriptsmith wrote:
         | Yes, I've used the v3.2 3B-Instruct model in a Slack app.
         | Specifically using vLLM, with a template:
         | https://github.com/vllm-project/vllm/blob/main/examples/tool...
         | 
         | Works as expected if you provide a few system prompts with
         | context.
        
         | bloomingkales wrote:
         | Qwen2.5 3b is very very good.
        
       | mmaunder wrote:
       | Funny how quiet the HN comments on big AI news with serious
       | practical applications, like this, are getting. ;-)
        
         | pryelluw wrote:
         | I don't get the comment. For one I'm excited for developments
         | in the field. Not afraid it will "replace me" as technology has
         | replaced me multiple times over. I'm looking towards working
         | with these models more and more.
        
           | mmaunder wrote:
           | No, I meant that a lot of us are working very fast on a pre-
           | launch product, implementing some cutting edge ideas using
           | e.g. the incredible speedup in a small fast inference model
           | like quantized 3B in combination with other tools, and I
           | think there's quite a bit of paranoia out there that someone
           | else will beat you to market. And so not a lot of sharing
           | going on in the comments. At least not as much as previously,
           | and not as much technical discussion vs other non-AI threads
           | on HN.
        
             | mattgreenrocks wrote:
             | This thread attracts a smaller audience than, say, a new
             | version of ChatGPT.
        
             | pryelluw wrote:
             | Ok, thank you for pointing that out.
             | 
             | I'm focused on making models play nice with each other
             | rather than building a feature that relies on it. That's
             | where I see the more relevant work being. Why such news are
             | exciting!
        
         | accrual wrote:
         | Two days ago there was a pretty big discussion on this topic:
         | Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
         | https://news.ycombinator.com/item?id=41914989         1421
         | points, 717 comments
        
         | flawn wrote:
         | A sign of the ongoing commoditization?
        
         | keyle wrote:
         | Aren't we all just tired of arguing the same points?
        
         | lxgr wrote:
         | What kind of fundamental discussion are you hoping to see under
         | an article about an iterative improvement to a known model?
         | 
         | "AI will destroy the world"? "AI is great and will save
         | humanity"? If you're seriously missing that, there's really
         | enough platforms (and articles for more fundamental
         | announcements/propositions on this one) where you can have
         | these.
        
         | yieldcrv wrote:
         | I mean, this outcome of LLMs is expected and the frequency of
         | LLM drops are too fast, and definitely too fast to wait for
         | Meta to do an annual conference with a ton of hype, and
         | furthermore these things are just prerequisites for a massive
         | lemming rush of altering these models for the real fun, which
         | occurs in other communities
        
       | behnamoh wrote:
       | Does anyone know why the most common method to speed up inference
       | time is quantization? I keep hearing about all sorts of new
       | methods but nearly none of them is implemented in practice
       | (except for flash attention).
        
         | o11c wrote:
         | Because the way LLMs work is more-or-less "for every token,
         | read the entire matrix from memory and do math on it". Math is
         | fast, so if you manage to use only half the bits to store each
         | item in the matrix, you only have to do half as much work. Of
         | course, sometimes those least-significant-bits were relied-upon
         | in the original training.
        
         | formalsystem wrote:
         | It's particularly useful in memory bound workflows like batch
         | size = 1 LLM inference where you're bottlenecked by how quickly
         | you can send weights to your GPU. This is why at least in
         | torchao we strongly recommend people try out int4 quantization.
         | 
         | At larger batch sizes you become compute bound so quantization
         | matters less and you have to rely on hardware support to
         | accelerate smaller dtypes like fp8
        
       | justanotheratom wrote:
       | Any pointers no how to finetune this on my dataset and package
       | and run it in my swift ios app?
        
       | tveita wrote:
       | So SpinQuant learns a rotation for activations and weights that,
       | to my understanding, "smear" the outlier weights out so you don't
       | get extreme values in any one weight.
       | 
       | Random anecdote warning - In the old days, before vector search
       | became AI and everyone and their dog offered a vector database, I
       | had a task that required nearest neighbour search in a decent
       | amount of high-dimensional vectors.
       | 
       | I tried quantizing them to bit vectors in an index and scanning
       | through it to get an initial set of candidates. Performance was
       | actually quite decent - reading through RAM linearly is fast! But
       | the selectivity wasn't great.
       | 
       | Somewhere along the way I found this paper[1] that iteratively
       | finds a rotation to apply before quantization to reduce the
       | quantization error. Very similar goal to SpinQuant, but focused
       | on bit quantization only.
       | 
       | As it turns out the 'random rotation' baseline they benchmark
       | against worked great for my use case, so I never tried
       | implementing the fancier algorithm. But it's a pretty rare day at
       | work that "apply a random rotation matrix to a 128-dimensional
       | vector" is the solution to my problem.
       | 
       | [1] https://ieeexplore.ieee.org/abstract/document/6296665 /
       | https://slazebni.cs.illinois.edu/publications/ITQ.pdf
        
       | ed wrote:
       | Oh cool! I've been playing with quantized llama 3B for the last
       | week. (4-bit spinquant). The code for spinquant has been public
       | for a bit.
       | 
       | It's pretty adept at most natural language tasks ("summarize
       | this") and performance on iPhone is usable. It's even decent at
       | tool once you get the chat template right.
       | 
       | But it struggles with json and html syntax (correctly escaping
       | characters), and isn't great at planning, which makes it a bad
       | fit for most agenetic uses.
       | 
       | My plan was to let llama communicate with more advanced AI's,
       | using natural language to offload tool use to them, but very
       | quickly llama goes rogue and starts doing things you didn't ask
       | it to, like trying to delete data.
       | 
       | Still - the progress Meta has made here is incredible and it
       | seems we'll have capable on-device agents in the next generation
       | or two.
        
       | formalsystem wrote:
       | Hi I'm Mark I work on torchao which was used for the quantization
       | aware training and ARM kernels in this blog. If you have any
       | questions about quantization or performance more generally feel
       | free to let me know!
        
         | philipkglass wrote:
         | What was the "vanilla post-training quantization" used for
         | comparison? There are 22 GGUF quantization variants smaller
         | than 16 bits per weight and I can't tell which one is being
         | compared with:
         | 
         | https://huggingface.co/docs/hub/en/gguf#quantization-types
         | 
         | It might even mean a non-GGUF quantization scheme; I'm just an
         | intermediate user of local models, not an expert user or
         | developer.
        
       ___________________________________________________________________
       (page generated 2024-10-24 23:00 UTC)