[HN Gopher] LLaMa running at 5 tokens/second on a Pixel 6
___________________________________________________________________
LLaMa running at 5 tokens/second on a Pixel 6
Author : pr337h4m
Score : 175 points
Date : 2023-03-15 16:50 UTC (6 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| naillo wrote:
| This is really cool but the output is such garbage at that weight
| size that you might as well be running a markov chain.
| circuit10 wrote:
| It's quite a bit bigger than GPT-2 which was a really big deal
| not very long ago (remember the unicorn news article example
| and the slow release because it was apparently too powerful?)
| lostmsu wrote:
| From the video output seems fine.
|
| But if it is a trimmed version, it is wong to call it LLaMa.
| refulgentis wrote:
| It's nonsensical, celeb announces they're going to rehab and
| notes it (?) is an issue affecting all women, at least,
| earlier today (??), they also noted it wasn't drugs or
| alcohol this time, but, a life (???)
| londons_explore wrote:
| Without instruction tuning, the perfect language model
| produces output which has the same level of intelligibility
| as random text from the training set. And the training set
| probably has a lot of spam and junk in.
| ddren wrote:
| What are you comparing it to? Without instruction tuning
| and a two character prompt "He" I am not sure why you would
| expect it to perform any better.
| refulgentis wrote:
| I was replying to a comment that said it "seems fine."
|
| It does not seem fine.
|
| It is incomprehensible and doesn't match the results I've
| seen from 7B through 65B.
|
| It is true that RLHF could improve it, and perhaps then
| this severe of optimization will seem fine.
| tbalsam wrote:
| I've heard a number of people say (from earlier) that the
| quantization and default sampling parameters is way
| wacked. Honestly even running that model size alone is
| the big achievement here and getting the accuracy to
| actually reach the benchmark is the beeg next step nao, I
| believe. <3 :'))))
| PreachSoup wrote:
| Could call it Slim LLaMa
| eganist wrote:
| SLLaMa?
| falcor84 wrote:
| Isn't any LLM mathematically a Markov chain, such that the
| current state includes the context of the last (finite) n
| tokens?
| sottol wrote:
| Afaik, the Llama sampler needs to be tuned to get more sensible
| outputs.
|
| https://twitter.com/theshawwn/status/1632569215348531201
| simonw wrote:
| That's why Alpaca is so exciting: it instruction-tunes LLaMA to
| the point that even the tiny 7B model (the one that fits on a
| phone) produces useful output:
| https://simonwillison.net/2023/Mar/13/alpaca/
| MagicMoonlight wrote:
| But they won't give us the model... so it's ultimately
| meaningless because they'll just sell out
| itake wrote:
| My understanding is they legally can't. It was trained used
| OpenAI, which doesn't allow using their output to train new
| models. Someone would need to find another data source to
| fine tune llama.
| hawski wrote:
| Why they can train their model on copyrighted data and
| claiming fair use as they do not outright copy while
| disallowing training other models on their output? I
| understand revoking access though.
| alwayslikethis wrote:
| * * *
| Zetice wrote:
| What will OpenAI do, sue? Okay but now it's out there.
| nwoli wrote:
| Also how could that even be under protection. As if they
| haven't been scraping copyrighted materials and sites
| with end user agreements to train the model in the first
| place
| TaylorAlexander wrote:
| I believe the work was done by Stanford, so OpenAI could
| revoke Stanford's access to their API. That would inhibit
| Stanford's ability to do new research with this system.
| dwallin wrote:
| You don't need to find a new data source, you just need
| to find an unencumbered third party. You can use the that
| data publicly provided in the git repo as long as you
| haven't signed an agreement with OpenAI yourself.
| sdrg822 wrote:
| It's only a matter of time
| simonw wrote:
| If they don't release the model, recreating it doesn't look
| too hard. $100 worth of compute time to run the fine-
| tuning, and the training data they used is here:
| https://github.com/tatsu-
| lab/stanford_alpaca/blob/main/alpac...
|
| That would have the same licensing problems that they have
| though: that alpaca_data.json file was created using GPT3.
| But creating a "clean" training set of 52,000 examples
| doesn't feel impossible to me for the right group.
| dwallin wrote:
| You're only bound by the terms of OpenAI's agreement if
| you agreed to the terms of use. If a third party obtained
| the data without signing an agreement with OpenAI (eg. by
| just downloading it from that repo) they are under no
| obligation to refrain from using it to compete with
| OpenAI. It is fair-use by the same argument OpenAI itself
| uses to train its own models on publicly available data.
| throwaway1851 wrote:
| I've been playing with the Alpaca demo, and I'm really
| impressed! The outputs are generally excellent, especially
| for a model of that size, fine tuned on a $100 (!!) compute
| budget.
|
| If the cloud of uncertainty around commercial use of
| derivative weights from LLaMA can be resolved, I think this
| could be the answer for a lot of domain-specific generative
| language needs. A model you can fine tune on your own data,
| and which you host and control, rather than depending on a
| cloud service not to arbitrarily up prices/close your
| account/apply unhelpful filters to the output/etc.
| jdright wrote:
| How one can tune the model to a specific usage? Is there some
| place that teaches this?
| Havoc wrote:
| Any more details? I'm guessing they're leveraging the NPU in the
| pixel?
| saidinesh5 wrote:
| I think they are using llama.cpp without any NPU/TPU patches.
| By default it only runs on CPU with support for various SIMD
| extensions.
|
| https://github.com/ggerganov/llama.cpp
| zodester wrote:
| It uses the ARM NEON extensions to the instruction set for SIMD
| (as far as I understand).
| a-dub wrote:
| would be even cooler if it employed the accelerator!
|
| (unless this ggml library is doing that under the hood)
|
| i assume it has unified memory, but maybe not little numbers...
| beiller wrote:
| Here is a thread to tweak the parameters which the model seems
| very sensitive to:
|
| https://github.com/ggerganov/llama.cpp/issues/129
| nico wrote:
| Could the model itself be used to tweak it's own parameters
| iteratively?
| OscarCunningham wrote:
| This would be useful for predictive text. That's exactly what
| LLMs are actually built for.
| a-dub wrote:
| LSTMs have been in the Google keyboard for years...
| __mharrison__ wrote:
| I'm waiting until it runs on my C64...
| [deleted]
| tosh wrote:
| Did anyone get this to run on an iPhone or in a browser yet?
| pzo wrote:
| Most iphone have only 4GB RAM (and even latest iphone 14 has
| only 6GB RAM). Pixel 6 has 8GB RAM. But bigger issue is on iOS
| still OS limits how much RAM your app can use and might kill
| your app.
| londons_explore wrote:
| I'm still amazed that Apple invests so much into every other
| bit of hardware on a high end phone, yet always gives you the
| bare minimum amount of RAM they can get away with.
|
| There are so many use cases (like this) that require more
| RAM. And even if a use case doesn't theoretically require
| more RAM, getting a developer to dedicate time to optimizing
| RAM is time taken away from making a wonderful app.
| squarefoot wrote:
| > I'm still amazed that Apple invests so much into every
| other bit of hardware on a high end phone, yet always gives
| you the bare minimum amount of RAM they can get away with.
|
| Advanced hardware makes bullet points on advertising to
| sell the device; giving the bare minimum of RAM accelerates
| the device planned obsolescence, so that user will be
| forced to upgrade sooner to the next model.
| szundi wrote:
| How is that an iPhone 7 is completely current vs give me
| a branded Android from the same year of release that has
| even security updates, not even features.
| alden5 wrote:
| The problem with iPhones is once updates stop there's
| nothing you can do. The iPhone 7 isn't current, it's
| stuck on iOS 15 while the newest is 16. And while the
| pixel 2 (which is only a month younger than the iPhone 7)
| only got official support up to Android 11; you actually
| own the device and can easily unlock the boot-loader to
| upgrade to Android 13.
| thewataccount wrote:
| Apple still does security updates for IOS - last was
| 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S.
|
| Feature updates with the current IOS 16 goes back to the
| iPhone 8
|
| Yeah you do lose feature updates and slowly app support
| after the latest version drops support, but it's not like
| they're dropping support after 2 years, and you can stay
| on it for years later if you'd like.
|
| I'm not saying it couldn't be better but they're clearly
| far above the vast majority of their competition.
| thewataccount wrote:
| I personally don't buy that it's planned obsolescence. I
| think most people just don't need that much ram. IOS is
| really good at loading/unloading stuff as needed, outside
| of HN I'm not sure most consumers care about the exact
| amount of ram.
|
| Apple still does security updates for IOS - last was
| 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S
|
| They've literally provided security updates for a 10 year
| old device, has any competitor even come close to that?
| syntaxing wrote:
| Does this in theory mean it should be relatively easy to port to
| coral TPU?
| saidinesh5 wrote:
| All their tensor/math magic seems to happen in
| https://github.com/ggerganov/llama.cpp/blob/master/ggml.h .
|
| So maybe if you implement the ggml.c with tensorflow/libcoral -
| you'd have a chance.
| sottol wrote:
| Afaik that TPU has only 8MB of RAM to fit models, you'd have to
| continuously stream the weights - can't imagine that's
| workable.
| superkuh wrote:
| Until it thermally throttles 40 seconds later. But yeah, it's
| really cool how many platforms the vanilla code in llama.cpp can
| be easily compiled on. And somehow I doubt they did the
| quantization step on the Pixel itself. My favorite was the person
| who did it on the rpi4. I know a guy working on getting it going
| on rpi3 but the ARM7/8 mixing , NEON support, and 64 bit ARM
| intrinsics are apparently non-trivial to convert.
| aquajet wrote:
| I'm the original tweet author.
|
| Currently typing this from my Pixel after running it countless
| times :)
| sebzim4500 wrote:
| >And somehow I doubt they did the quantization step on the
| Pixel itself
|
| You're probably right (because why would they?) but I don't see
| any reason they couldn't have done this if they wanted to.
| aquajet wrote:
| I would have tried it but I didn't have enough storage on my
| phone to hold both the original and quantized weights.
| nshm wrote:
| It is not really llama, it is llama quantized to 4bit. Not even
| the quality of original 7B. I could also quantize it to 1 bit and
| claim it runs on my RPI3.
| mrWiz wrote:
| The 4 bit quantization performs well, though. Does your 1 bit
| version?
| tbalsam wrote:
| 1 bit will mathematically be guaranteed to be more efficient
| for performance-per-parameter so to me it is a pretty clear
| eventuality one day, but I think also the relative
| performance % will likely tank still. Impressed honestly that
| it held so well at 4 bit tbh, I thought personally that 8 bit
| was the ceiling.
|
| However I can see fractional bits (via binary
| representations) and larger models happening first before
| that compression step.
|
| And then we have the sub-bit range..... ;DDDD
| nshm wrote:
| Do you have the numbers? I suspect is is way worse. Original
| llama.cpp authors never measure any numbers as well.
| sottol wrote:
| They're using GTPQ -- here you go:
| https://arxiv.org/abs/2210.17323 . The authors benchmarked
| two families of models over a wide range of numbers of
| params.
| ddren wrote:
| llama.cpp is using RTN at the moment.
| minxomat wrote:
| Some numbers here: https://github.com/qwopqwop200/GPTQ-for-
| LLaMa#result
| ddren wrote:
| The python implementation[1] ran some tests using the same
| quantization algorithm as llama.cpp (4 bit RTN).
|
| 1: https://github.com/qwopqwop200/GPTQ-for-LLaMa
| nshm wrote:
| Great thanks a lot.
|
| So we have numbers on PTB original perplexity 8.79
| quantized 9.68, already 10% worse. And PPL reported per
| token I suppose? Because word PPL for PTB must be around
| 20, not less than 10.
|
| Any numbers on more complex tasks then? like QA?
| elbigbad wrote:
| The quantization to four hits doesn't have that much effect on
| the output. 1 bit might not either, but someone would need to
| do some testing before making the claim that "1 bit ... runs on
| my RPI3" because "runs" is a bit overloaded to mean "runs and
| produces sensible output." I think you're missing that runs
| here has that overloading.
| nwoli wrote:
| It should also be mentioned that it isn't really that each
| weight is a 4 bit float, but rather that they're basically
| clustering floats into 2^4 clusters and then grabbing from a
| lookup table the float associated to a 4 bit value as needed.
| So as long as the weights roughly fall into 16 clusters
| you'll get identical results
| renewiltord wrote:
| I used the 7B quantized to 4 bit and it needs a few tries for
| most things, but it's not useless.
| alden5 wrote:
| i haven't noticed 4bit quantization affecting the quality of
| LLaMA-7B, it produces very coherent outputs, the trick is
| having a good example in your prompt so it has a good idea of
| what's expected of it.
___________________________________________________________________
(page generated 2023-03-15 23:01 UTC)