[HN Gopher] Quantized Llama models with increased speed and a re...
___________________________________________________________________
Quantized Llama models with increased speed and a reduced memory
footprint
Author : egnehots
Score : 182 points
Date : 2024-10-24 18:52 UTC (4 hours ago)
(HTM) web link (ai.meta.com)
(TXT) w3m dump (ai.meta.com)
| newfocogi wrote:
| TLDR: Quantized versions of Llama 3.2 1B and 3B models with
| "competitive accuracy" to the original versions (meaning some
| degraded performance; plots included in the release notes).
| newfocogi wrote:
| Quantization schemes include post-training quantization (PTQ),
| SpinQuant, and QLoRA.
| arnaudsm wrote:
| How do they compare to their original quants on ollama like
| q4_K_S?
| tcdent wrote:
| These undergo additional fine tuning (QLoRA) using some or all
| of the original dataset, so they're able to get the weights to
| align to the nf4 dtype better, which increases the accuracy.
| philipkglass wrote:
| These quantized models show much less degradation compared to a
| "vanilla post-training-quantization" but there are a bunch of PTQ
| schemes that people have already applied to Llama models [1]. I
| didn't see any details about the vanilla PTQ they used as a
| baseline. Has it been written about elsewhere?
|
| [1] https://ollama.com/library/llama3.2/tags
| nisten wrote:
| It's pretty interesting that the new SpinQuant method did not
| manage to be better than good old nf4bit QLORA training (Tim
| Dettmers really cooked with that one).
|
| Really appreciate that Meta published both results+model quants
| and didn't just make some bs claim about a new sota quant like
| most other bigger companies would've done.
| EliBullockPapa wrote:
| Anyone know a nice iOS app to run these locally?
| Arcuru wrote:
| I access them by running the models in Ollama (on my own
| hardware), and then using my app Chaz[1] to access it through
| my normal Matrix client.
|
| [1] - https://github.com/arcuru/chaz
| simonw wrote:
| MLC Chat is a great iPhone app for running models (it's on
| Android too) and currently ships with Llama 3.2 3B Instruct -
| not the version Meta released today, its a quantized version of
| their previous release.
|
| I wouldn't be surprised to see it add the new ones shortly,
| it's quite actively maintained.
|
| https://apps.apple.com/us/app/mlc-chat/id6448482937
| behnamoh wrote:
| I've been using PocketGPT.
| drilbo wrote:
| https://github.com/a-ghorbani/pocketpal-ai
|
| This was just recently open sourced and is pretty nice. Only
| issue I've had is very minor UI stuff (on Android, sounds like
| it runs better on iOS from skimming comments)
| theanonymousone wrote:
| May I ask if anyone has successfully used 1B and 3B models in
| production and if yes, in what use cases? I seem to be failing
| even in seemingly simpler tasks such as word translation or zero-
| shot classification. For example, they seem to not care about
| instructions to only write a response and no explanation, thus
| making it impossible to use them in a pipeline :/
| wswope wrote:
| I've only toyed with them a bit, and had a similar experience -
| but did find I got better output by forcing them to adhere to a
| fixed grammar:
| https://github.com/ggerganov/llama.cpp/tree/master/grammars
|
| For context, I was playing with a script to bulk download
| podcasts, transcribe with whisper, pass the transcription to
| llama.cpp to ID ads, then slice the ads out with ffmpeg. I
| started with the generic json_array example grammar, then
| iteratively tweaked it.
| accrual wrote:
| Not in production, but I've used a 3B model to test a local LLM
| application I'm working on. I needed a full end-to-end
| request/response and it's a lot faster asking a 3B model than
| an 8B model. I could setup a test harness and replay the
| responses... but this was a lot simpler.
| jdthedisciple wrote:
| If for testing then why not just mock the whole thing for
| ultimate performance ... ?
| nkozyra wrote:
| Probably faster to use off the shelf model with llama.cpp
| than to mock it
| com2kid wrote:
| 3B models are perfectly capable, I've had great luck with Phi
| 3.5.
|
| > For example, they seem to not care about instructions to only
| write a response and no explanation
|
| You need to use tools to force the model to adhere to a schema.
| Or you can learn to parse out the part of the response you
| want, both work.
|
| You'll also need to make good use of robust examples in your
| initial prompt, and give lots of examples of how you want the
| output to look. (Yes this quickly burns up the limited context
| length!)
|
| Finally, embrace the fact that these models are tuned for chat,
| so the more conversational you make the back and forth the less
| you are stretching the models abilities.
|
| I wrote a very small blog post at
| https://meanderingthoughts.hashnode.dev/unlock-the-full-pote...
| explaining some of this.
| teleforce wrote:
| I wonder if CUE can help the situation in similar fashion to
| the DSL methods that you've described in your blog post [1].
| After all CUE fundamentals are based on feature structure
| from the deterministic approach of NLP unlike LLM that's
| stochastic NLP [2],[3]. Perhaps deterministic and non-
| deterministic approaches is the potent combination that can
| effectively help reduce much of the footprint to get to the
| same results and being energy efficient in the process.
|
| [1] Cue - A language for defining, generating, and validating
| data:
|
| https://news.ycombinator.com/item?id=20847943
|
| [2] Feature structure:
|
| https://en.m.wikipedia.org/wiki/Feature_structure
|
| [3] The Logic of CUE:
|
| https://cuelang.org/docs/concept/the-logic-of-cue/
| com2kid wrote:
| On my LinkedIn post about this topic someone actually
| replied with a superior method of steering LLM output
| compared to anything else I've ever heard of, so I've
| decided that until I find time to implement their method,
| I'm not going to worry about things.
|
| tl;dr you put into the prompt all the JSON up until what
| you want the LLM to say, and you set the stop token to the
| end token of the current JSON item (so ',' or '}' ']',
| whatever) and you then your code fills out the rest of the
| JSON syntax up until another LLM generated value is needed.
|
| I hope that makes sense.
|
| It is super cool, and I am pretty sure there is a way to
| make a generator that takes in an arbitrary JSON schema and
| builds a state machine to do the above.
|
| The performance should be super fast on locally hosted
| models that are using context caching.
|
| Eh I should write this up as a blog post, hope someone else
| implements it, and if not, just do it myself.
| JohnHammersley wrote:
| > For example, they seem to not care about instructions to only
| write a response and no explanation, thus making it impossible
| to use them in a pipeline
|
| I was doing some local tidying up of recording transcripts,
| using a fairly long system prompt, and I saw the same behaviour
| you mention if the transcript I was passing in was too long --
| batching it up to make sure to be under the max length
| prevented this.
|
| Might not be what's happening in your case, but I mention it
| because it wasn't immediately obvious to me when I first saw
| the behaviour.
| beoberha wrote:
| For me, it was almost random if I would get a little spiel at
| the beginning of my response - even on the unquantized 8b
| instruct. Since ollama doesn't support grammars, I was trying
| to get it to work where I had a prompt that summarized an
| article and extracted and classified certain information that I
| requested. Then I had another prompt that would digest the
| summary and spit out a structured JSON output. It was much
| better than trying to do it in one prompt, but still far too
| random even with temperature at 0. Sometimes the first prompt
| misclassified things. Sometimes the second prompt would include
| a "here's your structured output".
|
| And Claude did everything perfectly ;)
| scriptsmith wrote:
| Yes, I've used the v3.2 3B-Instruct model in a Slack app.
| Specifically using vLLM, with a template:
| https://github.com/vllm-project/vllm/blob/main/examples/tool...
|
| Works as expected if you provide a few system prompts with
| context.
| bloomingkales wrote:
| Qwen2.5 3b is very very good.
| mmaunder wrote:
| Funny how quiet the HN comments on big AI news with serious
| practical applications, like this, are getting. ;-)
| pryelluw wrote:
| I don't get the comment. For one I'm excited for developments
| in the field. Not afraid it will "replace me" as technology has
| replaced me multiple times over. I'm looking towards working
| with these models more and more.
| mmaunder wrote:
| No, I meant that a lot of us are working very fast on a pre-
| launch product, implementing some cutting edge ideas using
| e.g. the incredible speedup in a small fast inference model
| like quantized 3B in combination with other tools, and I
| think there's quite a bit of paranoia out there that someone
| else will beat you to market. And so not a lot of sharing
| going on in the comments. At least not as much as previously,
| and not as much technical discussion vs other non-AI threads
| on HN.
| mattgreenrocks wrote:
| This thread attracts a smaller audience than, say, a new
| version of ChatGPT.
| pryelluw wrote:
| Ok, thank you for pointing that out.
|
| I'm focused on making models play nice with each other
| rather than building a feature that relies on it. That's
| where I see the more relevant work being. Why such news are
| exciting!
| accrual wrote:
| Two days ago there was a pretty big discussion on this topic:
| Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
| https://news.ycombinator.com/item?id=41914989 1421
| points, 717 comments
| flawn wrote:
| A sign of the ongoing commoditization?
| keyle wrote:
| Aren't we all just tired of arguing the same points?
| lxgr wrote:
| What kind of fundamental discussion are you hoping to see under
| an article about an iterative improvement to a known model?
|
| "AI will destroy the world"? "AI is great and will save
| humanity"? If you're seriously missing that, there's really
| enough platforms (and articles for more fundamental
| announcements/propositions on this one) where you can have
| these.
| yieldcrv wrote:
| I mean, this outcome of LLMs is expected and the frequency of
| LLM drops are too fast, and definitely too fast to wait for
| Meta to do an annual conference with a ton of hype, and
| furthermore these things are just prerequisites for a massive
| lemming rush of altering these models for the real fun, which
| occurs in other communities
| behnamoh wrote:
| Does anyone know why the most common method to speed up inference
| time is quantization? I keep hearing about all sorts of new
| methods but nearly none of them is implemented in practice
| (except for flash attention).
| o11c wrote:
| Because the way LLMs work is more-or-less "for every token,
| read the entire matrix from memory and do math on it". Math is
| fast, so if you manage to use only half the bits to store each
| item in the matrix, you only have to do half as much work. Of
| course, sometimes those least-significant-bits were relied-upon
| in the original training.
| formalsystem wrote:
| It's particularly useful in memory bound workflows like batch
| size = 1 LLM inference where you're bottlenecked by how quickly
| you can send weights to your GPU. This is why at least in
| torchao we strongly recommend people try out int4 quantization.
|
| At larger batch sizes you become compute bound so quantization
| matters less and you have to rely on hardware support to
| accelerate smaller dtypes like fp8
| justanotheratom wrote:
| Any pointers no how to finetune this on my dataset and package
| and run it in my swift ios app?
| tveita wrote:
| So SpinQuant learns a rotation for activations and weights that,
| to my understanding, "smear" the outlier weights out so you don't
| get extreme values in any one weight.
|
| Random anecdote warning - In the old days, before vector search
| became AI and everyone and their dog offered a vector database, I
| had a task that required nearest neighbour search in a decent
| amount of high-dimensional vectors.
|
| I tried quantizing them to bit vectors in an index and scanning
| through it to get an initial set of candidates. Performance was
| actually quite decent - reading through RAM linearly is fast! But
| the selectivity wasn't great.
|
| Somewhere along the way I found this paper[1] that iteratively
| finds a rotation to apply before quantization to reduce the
| quantization error. Very similar goal to SpinQuant, but focused
| on bit quantization only.
|
| As it turns out the 'random rotation' baseline they benchmark
| against worked great for my use case, so I never tried
| implementing the fancier algorithm. But it's a pretty rare day at
| work that "apply a random rotation matrix to a 128-dimensional
| vector" is the solution to my problem.
|
| [1] https://ieeexplore.ieee.org/abstract/document/6296665 /
| https://slazebni.cs.illinois.edu/publications/ITQ.pdf
| ed wrote:
| Oh cool! I've been playing with quantized llama 3B for the last
| week. (4-bit spinquant). The code for spinquant has been public
| for a bit.
|
| It's pretty adept at most natural language tasks ("summarize
| this") and performance on iPhone is usable. It's even decent at
| tool once you get the chat template right.
|
| But it struggles with json and html syntax (correctly escaping
| characters), and isn't great at planning, which makes it a bad
| fit for most agenetic uses.
|
| My plan was to let llama communicate with more advanced AI's,
| using natural language to offload tool use to them, but very
| quickly llama goes rogue and starts doing things you didn't ask
| it to, like trying to delete data.
|
| Still - the progress Meta has made here is incredible and it
| seems we'll have capable on-device agents in the next generation
| or two.
| formalsystem wrote:
| Hi I'm Mark I work on torchao which was used for the quantization
| aware training and ARM kernels in this blog. If you have any
| questions about quantization or performance more generally feel
| free to let me know!
| philipkglass wrote:
| What was the "vanilla post-training quantization" used for
| comparison? There are 22 GGUF quantization variants smaller
| than 16 bits per weight and I can't tell which one is being
| compared with:
|
| https://huggingface.co/docs/hub/en/gguf#quantization-types
|
| It might even mean a non-GGUF quantization scheme; I'm just an
| intermediate user of local models, not an expert user or
| developer.
___________________________________________________________________
(page generated 2024-10-24 23:00 UTC)