[HN Gopher] LLaMa running at 5 tokens/second on a Pixel 6
       ___________________________________________________________________
        
       LLaMa running at 5 tokens/second on a Pixel 6
        
       Author : pr337h4m
       Score  : 175 points
       Date   : 2023-03-15 16:50 UTC (6 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | naillo wrote:
       | This is really cool but the output is such garbage at that weight
       | size that you might as well be running a markov chain.
        
         | circuit10 wrote:
         | It's quite a bit bigger than GPT-2 which was a really big deal
         | not very long ago (remember the unicorn news article example
         | and the slow release because it was apparently too powerful?)
        
         | lostmsu wrote:
         | From the video output seems fine.
         | 
         | But if it is a trimmed version, it is wong to call it LLaMa.
        
           | refulgentis wrote:
           | It's nonsensical, celeb announces they're going to rehab and
           | notes it (?) is an issue affecting all women, at least,
           | earlier today (??), they also noted it wasn't drugs or
           | alcohol this time, but, a life (???)
        
             | londons_explore wrote:
             | Without instruction tuning, the perfect language model
             | produces output which has the same level of intelligibility
             | as random text from the training set. And the training set
             | probably has a lot of spam and junk in.
        
             | ddren wrote:
             | What are you comparing it to? Without instruction tuning
             | and a two character prompt "He" I am not sure why you would
             | expect it to perform any better.
        
               | refulgentis wrote:
               | I was replying to a comment that said it "seems fine."
               | 
               | It does not seem fine.
               | 
               | It is incomprehensible and doesn't match the results I've
               | seen from 7B through 65B.
               | 
               | It is true that RLHF could improve it, and perhaps then
               | this severe of optimization will seem fine.
        
               | tbalsam wrote:
               | I've heard a number of people say (from earlier) that the
               | quantization and default sampling parameters is way
               | wacked. Honestly even running that model size alone is
               | the big achievement here and getting the accuracy to
               | actually reach the benchmark is the beeg next step nao, I
               | believe. <3 :'))))
        
           | PreachSoup wrote:
           | Could call it Slim LLaMa
        
             | eganist wrote:
             | SLLaMa?
        
         | falcor84 wrote:
         | Isn't any LLM mathematically a Markov chain, such that the
         | current state includes the context of the last (finite) n
         | tokens?
        
         | sottol wrote:
         | Afaik, the Llama sampler needs to be tuned to get more sensible
         | outputs.
         | 
         | https://twitter.com/theshawwn/status/1632569215348531201
        
         | simonw wrote:
         | That's why Alpaca is so exciting: it instruction-tunes LLaMA to
         | the point that even the tiny 7B model (the one that fits on a
         | phone) produces useful output:
         | https://simonwillison.net/2023/Mar/13/alpaca/
        
           | MagicMoonlight wrote:
           | But they won't give us the model... so it's ultimately
           | meaningless because they'll just sell out
        
             | itake wrote:
             | My understanding is they legally can't. It was trained used
             | OpenAI, which doesn't allow using their output to train new
             | models. Someone would need to find another data source to
             | fine tune llama.
        
               | hawski wrote:
               | Why they can train their model on copyrighted data and
               | claiming fair use as they do not outright copy while
               | disallowing training other models on their output? I
               | understand revoking access though.
        
               | alwayslikethis wrote:
               | * * *
        
               | Zetice wrote:
               | What will OpenAI do, sue? Okay but now it's out there.
        
               | nwoli wrote:
               | Also how could that even be under protection. As if they
               | haven't been scraping copyrighted materials and sites
               | with end user agreements to train the model in the first
               | place
        
               | TaylorAlexander wrote:
               | I believe the work was done by Stanford, so OpenAI could
               | revoke Stanford's access to their API. That would inhibit
               | Stanford's ability to do new research with this system.
        
               | dwallin wrote:
               | You don't need to find a new data source, you just need
               | to find an unencumbered third party. You can use the that
               | data publicly provided in the git repo as long as you
               | haven't signed an agreement with OpenAI yourself.
        
               | sdrg822 wrote:
               | It's only a matter of time
        
             | simonw wrote:
             | If they don't release the model, recreating it doesn't look
             | too hard. $100 worth of compute time to run the fine-
             | tuning, and the training data they used is here:
             | https://github.com/tatsu-
             | lab/stanford_alpaca/blob/main/alpac...
             | 
             | That would have the same licensing problems that they have
             | though: that alpaca_data.json file was created using GPT3.
             | But creating a "clean" training set of 52,000 examples
             | doesn't feel impossible to me for the right group.
        
               | dwallin wrote:
               | You're only bound by the terms of OpenAI's agreement if
               | you agreed to the terms of use. If a third party obtained
               | the data without signing an agreement with OpenAI (eg. by
               | just downloading it from that repo) they are under no
               | obligation to refrain from using it to compete with
               | OpenAI. It is fair-use by the same argument OpenAI itself
               | uses to train its own models on publicly available data.
        
           | throwaway1851 wrote:
           | I've been playing with the Alpaca demo, and I'm really
           | impressed! The outputs are generally excellent, especially
           | for a model of that size, fine tuned on a $100 (!!) compute
           | budget.
           | 
           | If the cloud of uncertainty around commercial use of
           | derivative weights from LLaMA can be resolved, I think this
           | could be the answer for a lot of domain-specific generative
           | language needs. A model you can fine tune on your own data,
           | and which you host and control, rather than depending on a
           | cloud service not to arbitrarily up prices/close your
           | account/apply unhelpful filters to the output/etc.
        
           | jdright wrote:
           | How one can tune the model to a specific usage? Is there some
           | place that teaches this?
        
       | Havoc wrote:
       | Any more details? I'm guessing they're leveraging the NPU in the
       | pixel?
        
         | saidinesh5 wrote:
         | I think they are using llama.cpp without any NPU/TPU patches.
         | By default it only runs on CPU with support for various SIMD
         | extensions.
         | 
         | https://github.com/ggerganov/llama.cpp
        
         | zodester wrote:
         | It uses the ARM NEON extensions to the instruction set for SIMD
         | (as far as I understand).
        
       | a-dub wrote:
       | would be even cooler if it employed the accelerator!
       | 
       | (unless this ggml library is doing that under the hood)
       | 
       | i assume it has unified memory, but maybe not little numbers...
        
       | beiller wrote:
       | Here is a thread to tweak the parameters which the model seems
       | very sensitive to:
       | 
       | https://github.com/ggerganov/llama.cpp/issues/129
        
         | nico wrote:
         | Could the model itself be used to tweak it's own parameters
         | iteratively?
        
       | OscarCunningham wrote:
       | This would be useful for predictive text. That's exactly what
       | LLMs are actually built for.
        
         | a-dub wrote:
         | LSTMs have been in the Google keyboard for years...
        
       | __mharrison__ wrote:
       | I'm waiting until it runs on my C64...
        
         | [deleted]
        
       | tosh wrote:
       | Did anyone get this to run on an iPhone or in a browser yet?
        
         | pzo wrote:
         | Most iphone have only 4GB RAM (and even latest iphone 14 has
         | only 6GB RAM). Pixel 6 has 8GB RAM. But bigger issue is on iOS
         | still OS limits how much RAM your app can use and might kill
         | your app.
        
           | londons_explore wrote:
           | I'm still amazed that Apple invests so much into every other
           | bit of hardware on a high end phone, yet always gives you the
           | bare minimum amount of RAM they can get away with.
           | 
           | There are so many use cases (like this) that require more
           | RAM. And even if a use case doesn't theoretically require
           | more RAM, getting a developer to dedicate time to optimizing
           | RAM is time taken away from making a wonderful app.
        
             | squarefoot wrote:
             | > I'm still amazed that Apple invests so much into every
             | other bit of hardware on a high end phone, yet always gives
             | you the bare minimum amount of RAM they can get away with.
             | 
             | Advanced hardware makes bullet points on advertising to
             | sell the device; giving the bare minimum of RAM accelerates
             | the device planned obsolescence, so that user will be
             | forced to upgrade sooner to the next model.
        
               | szundi wrote:
               | How is that an iPhone 7 is completely current vs give me
               | a branded Android from the same year of release that has
               | even security updates, not even features.
        
               | alden5 wrote:
               | The problem with iPhones is once updates stop there's
               | nothing you can do. The iPhone 7 isn't current, it's
               | stuck on iOS 15 while the newest is 16. And while the
               | pixel 2 (which is only a month younger than the iPhone 7)
               | only got official support up to Android 11; you actually
               | own the device and can easily unlock the boot-loader to
               | upgrade to Android 13.
        
               | thewataccount wrote:
               | Apple still does security updates for IOS - last was
               | 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S.
               | 
               | Feature updates with the current IOS 16 goes back to the
               | iPhone 8
               | 
               | Yeah you do lose feature updates and slowly app support
               | after the latest version drops support, but it's not like
               | they're dropping support after 2 years, and you can stay
               | on it for years later if you'd like.
               | 
               | I'm not saying it couldn't be better but they're clearly
               | far above the vast majority of their competition.
        
               | thewataccount wrote:
               | I personally don't buy that it's planned obsolescence. I
               | think most people just don't need that much ram. IOS is
               | really good at loading/unloading stuff as needed, outside
               | of HN I'm not sure most consumers care about the exact
               | amount of ram.
               | 
               | Apple still does security updates for IOS - last was
               | 12.5.7 - 23 Jan 2023 - that's back to the iPhone 5S
               | 
               | They've literally provided security updates for a 10 year
               | old device, has any competitor even come close to that?
        
       | syntaxing wrote:
       | Does this in theory mean it should be relatively easy to port to
       | coral TPU?
        
         | saidinesh5 wrote:
         | All their tensor/math magic seems to happen in
         | https://github.com/ggerganov/llama.cpp/blob/master/ggml.h .
         | 
         | So maybe if you implement the ggml.c with tensorflow/libcoral -
         | you'd have a chance.
        
         | sottol wrote:
         | Afaik that TPU has only 8MB of RAM to fit models, you'd have to
         | continuously stream the weights - can't imagine that's
         | workable.
        
       | superkuh wrote:
       | Until it thermally throttles 40 seconds later. But yeah, it's
       | really cool how many platforms the vanilla code in llama.cpp can
       | be easily compiled on. And somehow I doubt they did the
       | quantization step on the Pixel itself. My favorite was the person
       | who did it on the rpi4. I know a guy working on getting it going
       | on rpi3 but the ARM7/8 mixing , NEON support, and 64 bit ARM
       | intrinsics are apparently non-trivial to convert.
        
         | aquajet wrote:
         | I'm the original tweet author.
         | 
         | Currently typing this from my Pixel after running it countless
         | times :)
        
         | sebzim4500 wrote:
         | >And somehow I doubt they did the quantization step on the
         | Pixel itself
         | 
         | You're probably right (because why would they?) but I don't see
         | any reason they couldn't have done this if they wanted to.
        
           | aquajet wrote:
           | I would have tried it but I didn't have enough storage on my
           | phone to hold both the original and quantized weights.
        
       | nshm wrote:
       | It is not really llama, it is llama quantized to 4bit. Not even
       | the quality of original 7B. I could also quantize it to 1 bit and
       | claim it runs on my RPI3.
        
         | mrWiz wrote:
         | The 4 bit quantization performs well, though. Does your 1 bit
         | version?
        
           | tbalsam wrote:
           | 1 bit will mathematically be guaranteed to be more efficient
           | for performance-per-parameter so to me it is a pretty clear
           | eventuality one day, but I think also the relative
           | performance % will likely tank still. Impressed honestly that
           | it held so well at 4 bit tbh, I thought personally that 8 bit
           | was the ceiling.
           | 
           | However I can see fractional bits (via binary
           | representations) and larger models happening first before
           | that compression step.
           | 
           | And then we have the sub-bit range..... ;DDDD
        
           | nshm wrote:
           | Do you have the numbers? I suspect is is way worse. Original
           | llama.cpp authors never measure any numbers as well.
        
             | sottol wrote:
             | They're using GTPQ -- here you go:
             | https://arxiv.org/abs/2210.17323 . The authors benchmarked
             | two families of models over a wide range of numbers of
             | params.
        
               | ddren wrote:
               | llama.cpp is using RTN at the moment.
        
             | minxomat wrote:
             | Some numbers here: https://github.com/qwopqwop200/GPTQ-for-
             | LLaMa#result
        
             | ddren wrote:
             | The python implementation[1] ran some tests using the same
             | quantization algorithm as llama.cpp (4 bit RTN).
             | 
             | 1: https://github.com/qwopqwop200/GPTQ-for-LLaMa
        
               | nshm wrote:
               | Great thanks a lot.
               | 
               | So we have numbers on PTB original perplexity 8.79
               | quantized 9.68, already 10% worse. And PPL reported per
               | token I suppose? Because word PPL for PTB must be around
               | 20, not less than 10.
               | 
               | Any numbers on more complex tasks then? like QA?
        
         | elbigbad wrote:
         | The quantization to four hits doesn't have that much effect on
         | the output. 1 bit might not either, but someone would need to
         | do some testing before making the claim that "1 bit ... runs on
         | my RPI3" because "runs" is a bit overloaded to mean "runs and
         | produces sensible output." I think you're missing that runs
         | here has that overloading.
        
           | nwoli wrote:
           | It should also be mentioned that it isn't really that each
           | weight is a 4 bit float, but rather that they're basically
           | clustering floats into 2^4 clusters and then grabbing from a
           | lookup table the float associated to a 4 bit value as needed.
           | So as long as the weights roughly fall into 16 clusters
           | you'll get identical results
        
         | renewiltord wrote:
         | I used the 7B quantized to 4 bit and it needs a few tries for
         | most things, but it's not useless.
        
         | alden5 wrote:
         | i haven't noticed 4bit quantization affecting the quality of
         | LLaMA-7B, it produces very coherent outputs, the trick is
         | having a good example in your prompt so it has a good idea of
         | what's expected of it.
        
       ___________________________________________________________________
       (page generated 2023-03-15 23:01 UTC)