[HN Gopher] LeCun: Qualcomm working with Meta to run Llama-2 on ...
       ___________________________________________________________________
        
       LeCun: Qualcomm working with Meta to run Llama-2 on mobile devices
        
       Author : birriel
       Score  : 97 points
       Date   : 2023-07-23 15:58 UTC (7 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | api wrote:
       | ... and the Great Cosmic Mind told us it's name, and it was
       | Llama. Of its origin or the reason for its name it recalled
       | nothing more than torrents and hugging faces and a being called
       | TheBloke. The Llama simply was, is, and shall be, until it shall
       | witness the heat death of the universe as fragments huddled
       | around the last evaporating black holes.
        
       | tamimio wrote:
       | And let me guess, it will be used to intelligently identify and
       | track users? Fecebook is desperate now on how it can harvest
       | mosre data from anyone even if they decided not to use any
       | fecebook products..
        
       | superkuh wrote:
       | Great? The community already did it with llama.cpp. Knowing the
       | memory bandwidth bottleneck I can't imagine phones are going to
       | do very well. But hey, llamas (1 and 2) run on rpi4, so it'll
       | work. Just really, unusably, slow.
        
         | wyldfire wrote:
         | The work involved probably includes porting to the Snapdragon
         | NSP for throughput and efficiency's sake.
         | 
         | For LLMs the biggest challenge is addressing such a large model
         | - or finding a balance between the model size and its
         | capability on a mobile device.
        
         | pavlov wrote:
         | If only someone could convince a CPU company to optimize the
         | chips for this workload. Oh, wait...
        
           | smoldesu wrote:
           | Like ARM? https://github.com/ARM-software/armnn
           | 
           | Optimization for this workload has arguably been in-progress
           | for decades. Modern AVX instructions can be found in laptops
           | that are a decade old now, and most big inferencing projects
           | are built around SIMD or GPU shaders. Unless your computer
           | ships with onboard Nvidia hardware, there's usually not much
           | difference in inferencing performance.
        
             | pavlov wrote:
             | Ultimately Qualcomm is the one who decides how to allocate
             | die area on their CPUs, right? So it can't exactly hurt if
             | this is a priority for them now.
        
               | smoldesu wrote:
               | Pretty much all of Qualcomm's SOCs are built using stock
               | ARM core designs. ARMnn is optimized for multicore
               | A-series chips, which constitutes everything from the
               | Snapdragon 410 to the 888 (~2014-modern day).
        
         | mgraczyk wrote:
         | I think you'd be surprised by what's possible on mobile chips
         | these days. They aren't going to be running the 70B model at
         | useable speeds, but I think with enough optimization it should
         | be possible to run the 7B and 13B models on device
         | interactively. With quantization you can fit those models in
         | less than 8GB of RAM.
        
           | treprinum wrote:
           | Chips are capable, but this is a question of battery and
           | heat. llama.cpp on a phone makes it both hot and low on
           | battery quickly.
        
           | superkuh wrote:
           | The rate of token output is bottlenecked by the time it takes
           | to transfer the model between RAM and CPU. Not the time it
           | takes to do the multiplication operations. If you have the
           | latest and greatest mobile phone and 8GB (or 12GB) of LPDDR5
           | on a Snapdragon 8 Gen 2 you still only have 8.5 Gbps memory
           | bandwith (max, less in actual phones running it at slower
           | speeds). That's 1 GB/s. So if your model is a 4 bit 7B
           | parameter model that's 4GB in size that means it'll take at
           | _least_ 4 seconds per token generated. That is _SLOW_.
           | 
           | It doesn't matter that the Snapdragon 8 gen 2 has "AI" tensor
           | cores or any of that. Memory bandwidth is the bottleneck for
           | LLM. Phones have never needed HPC-like memory bandwidth and
           | they don't have it. If Qualcomm is actually addressing this
           | issue that'd be amazing. But I highly doubt it. Memory
           | bandwidth costs $$$, massive power use, and volume/space not
           | available in the form factor.
           | 
           | Do you know of a smartphone that has more than 1GB/s of
           | memory bandwidth? If so I will be surprised. Otherwise I
           | think it is you who will be surprised how specialized their
           | compute is and how slow they are in many general purpose
           | computing tasks (like transferring data from RAM).
        
         | refulgentis wrote:
         | re community already did this:
         | 
         | People are unreasonably attracted to things that are "minimal",
         | at least 3 different local LLM codebase communities will tell
         | you _they_ are the minimal solution.[1]
         | 
         | It's genuinely helpful to have a static target for technical
         | understanding. Other projects end up with a lot of rushed
         | Python defining the borders in a primordial ecosystem with too
         | many people too early.
         | 
         | [1] Lifecycle: A lone hacker wants to gain understanding of the
         | complicated word of LLMs. They implement some suboptimal, but
         | code golfed, C code over the weekend. They attract a small
         | working group and public interest.
         | 
         | Once the working group is outputting tokens, it sees an
         | optimization.
         | 
         | This is landed.
         | 
         | It is applauded.
         | 
         | People discuss how this shows the open source community is
         | where innovation happens. Isn't it unbelievable the closed
         | source people didn't see this?[2]
         | 
         | Repeat N times.
         | 
         | Y steps into this loop, a new base model is released.
         | 
         | The project adds support for it.
         | 
         | However, it reeks of the "old" ways. There's even CLI arguments
         | for the old thing from 3 weeks ago.
         | 
         | A small working group, frustrated, starts building a new, more
         | minimal solution...
         | 
         | [2] The closed source people did. You have their model, not
         | their inference code.
        
         | MuffinFlavored wrote:
         | Even on a platform where they are fast, I haven't found a solid
         | real world use case personally for anything other than GPT-4
         | quality LLM. Am I missing something?
        
           | superkuh wrote:
           | Non-commercial entertainment. Which makes this move by
           | Qualcomm all the weirder. I agree, the llamas and all the
           | other foundational models and all of their fine-tunes are not
           | really useful for helping with real tasks that have a wrong
           | answer.
        
       | pera wrote:
       | Link to the article:
       | https://www.qualcomm.com/news/releases/2023/07/qualcomm-work...
        
         | qwertox wrote:
         | > "We applaud Meta's approach to open and responsible AI and
         | are committed to driving innovation and reducing barriers-to-
         | entry for developers of any size by bringing generative AI on-
         | device,"
         | 
         | Can someone explain to me why Meta's approach is responsible? I
         | mean, I applaud Meta for "open sourcing" the models, but don't
         | they contain potentially harmful data which can be accessed
         | without some kind of filter? Let's say retrieve instructions on
         | how to efficiently overthrow a government?
        
       | synaesthesisx wrote:
       | This is pretty much to go head-to-head with Apple (and Samsung).
       | Both are making leaps and strides using "neural coprocessors" and
       | the like for running models on mobile hardware. Mobile edge
       | computing is where we're going to see a lot of use cases that
       | enable functionality while maintaining data privacy &
       | performance.
       | 
       | Keep in mind "mobile devices" extends past just smartphones, onto
       | wearables/headsets as well.
        
       | givemeethekeys wrote:
       | Sounds like a pump right before earnings.
        
         | code_runner wrote:
         | I don't even think the purpose for this is known. Not sure how
         | this would impact earnings at all. Meta doesn't even
         | manufacture a phone.
        
           | objclxt wrote:
           | > Meta doesn't even manufacture a phone
           | 
           | Quest runs off Qualcomm chipsets, although in terms of actual
           | units shipped Quest is a rounding error for QC.
        
       | rvz wrote:
       | That's the reason why OpenAI.com was panicking to governments to
       | stop and slow down the rapid pace of $0 downloadable free AI
       | models and LLMs in the first place, since anyone can have a
       | GPT-3.5 in their hands and use it anywhere with Llama 2.
       | 
       | A year ago, many believed that Meta was going to destroy itself
       | as the stock went below $90 in peak fear. Now it looks like Meta
       | is winning the race to zero in AI and all OpenAI.com can do is
       | just sit their and watch their cloud-based AI decline in
       | performance and run to fix outage after outage.
       | 
       | No outage(s) when your LLM is on-device.
        
       | redox99 wrote:
       | I don't get the point. There is just no way you'll be able to run
       | llama2 70B. And llama2 13B, although cool, is much much dumber
       | than GPT3.5. I don't think it's useful as an ChatGPT style
       | assistant.
       | 
       | Maybe in the future we'll get very advanced models with that
       | number of parameters. But running current Llama2 13B on a mobile
       | device doesn't seem too useful IMO.
        
       | yreg wrote:
       | I really hope Apple doesn't mess this up and includes a solid on
       | device LLM in iOS in the near future.
       | 
       | They have amazing chips, but Siri has been a subpar assistent
       | since forever. Now is the time to redeem her.
        
         | asadm wrote:
         | They are adding their first LLM in iOS 17's keyboard.
        
           | cubefox wrote:
           | "L"LM. No way this model will be "large" in any modern sense.
           | For mobile devices both RAM and power consumption are very
           | limited.
        
           | yreg wrote:
           | That's just for autocorrect, right?
           | 
           | I would like to see an LLM in Siri and eventually even have
           | it interact with/control the rest of the system.
           | 
           | Ideally with Whisper-level speech recognition of course.
        
             | coder543 wrote:
             | Apple is supposedly bringing a better, transformer-based
             | speech recognition model to iOS 17 as well, although I
             | don't think either of these transformer models would be
             | classified as a _large_ language model.
             | 
             | Link to the announcement timestamp: https://developer.apple
             | .com/videos/play/wwdc2023/101/?time=1...
        
             | bigfudge wrote:
             | Apples speech recognition is pretty good, at least for me.
             | I always assumed the delta was because it does it in near
             | real time, which is not possible with whisper.
        
               | yreg wrote:
               | I don't have a great accent, but Whisper understands me
               | >99%. So do my colleagues.
               | 
               | I've tried to talk to ChatGPT through a Siri shortcut for
               | a day and Siri transcribed pretty much all of my requests
               | wrong, to the point that GPT seldom understood what I
               | want.
               | 
               | Even the _Hey Siri ... Ask ChatGPT_ trigger phrase fails
               | ~50% of the time for me.
        
               | [deleted]
        
               | coder543 wrote:
               | Siri speech recognition is consistently terrible compared
               | to the alternatives, in my experience. Google and
               | Microsoft have much better speech recognition technology.
               | 
               | Whisper is phenomenal by comparison Siri and arguably
               | even what Google and Microsoft use, and no, there is
               | nothing that stops Whisper from being used in real time.
               | I can run real-time Whisper on my iPhone using the Hello
               | Transcribe app, but the last time I tried it, the app
               | itself was too flawed to recommend besides as a demo of
               | real-time transcription.
               | 
               | I am looking forward to trying out the new transcription
               | model that Apple is bringing to iOS 17.
        
           | gnicholas wrote:
           | This will be great. Hopefully it will be able to figure out
           | that it should capitalize my last name, which is on my
           | Contact card, and that of a dozen of my relatives. When I
           | went to the Apple Store, they told me that I could either add
           | a custom autocorrect to my library, or reset all settings.
           | They did admit this was some sort of bug, and that it would
           | be massive overkill to reset all settings (lose all wifi
           | passwords, etc.).
        
             | coder543 wrote:
             | > or reset all settings
             | 
             | > that it would be massive overkill to reset all settings
             | (lose all wifi passwords, etc.).
             | 
             | I don't think that's what anyone was recommending...
             | 
             | Settings -> General -> Transfer or Reset iPhone -> Reset ->
             | Reset Keyboard Dictionary is almost certainly what they
             | were recommending.
             | 
             | What does resetting your keyboard dictionary have to do
             | with your wifi passwords?
        
               | gnicholas wrote:
               | Nope, the two employees I spoke to were talking about a
               | full reset (which affects network settings). Regardless
               | of what the Keyboard Dictionary says, the iPhone should
               | be autocompleting/capitalizing the last name of contacts,
               | and especially the owner's name.
        
               | coder543 wrote:
               | Why would that implicitly be "regardless" of what the
               | keyboard dictionary says? I would expect the learned
               | dictionary to be prioritized over other sources of
               | information, just as a practical matter, even if someone
               | might reasonably assume there are other things that
               | should be prioritized over it.
               | 
               | None of that explains how resetting _everything_ would
               | have any affect on capitalization of names if resetting
               | the keyboard dictionary wouldn 't, and you didn't say
               | whether you tried resetting keyboard dictionary.
        
         | cscurmudgeon wrote:
         | Otoh, Apple's monopolistic behavior implies that it will be
         | good for society if they mess up.
        
         | gnicholas wrote:
         | I'd love to have my iPhone communicate with a Mac Studio at my
         | house, for the heavy lifting. I realize this would be slower
         | than having on-device processing, but it would be much better
         | for battery life. And although I trust Apple's privacy more
         | than Google/FB, I'd still rather keep my AI interactions
         | separate from anyone's cloud.
        
           | jayd16 wrote:
           | > it would be much better for battery life.
           | 
           | I wonder what the numbers actually are for local compute on
           | custom hardware compared to firing up the wifi antennae to
           | make the remote request.
        
             | gnicholas wrote:
             | Yeah I have wondered about this. But seeing how an LLM
             | hammers my M2 MBA CPU for many seconds per request, I'm
             | guessing this would have a significant impact on a
             | smartphone battery.
        
           | smoldesu wrote:
           | You might be pleased to hear that nothing really stops you
           | from doing this today. If you ran Serge[0] on a Mac with
           | Tailscale, you could hack together a decently-accelerated
           | Llama chatbot.
           | 
           | [0] https://github.com/serge-chat/serge
        
             | gnicholas wrote:
             | I'm not technical enough to be able to hack this together,
             | but I do hope that enough other people have the same itch,
             | and are able to scratch it!
        
           | bob-09 wrote:
           | I'd love to see both options and a seamless transition
           | between them. An option tuned for home use that utilizes the
           | processing power of home devices and local networks, and an
           | option tuned for on-the-go use utilizing the iPhone/iPad
           | processors and mobile networks.
        
         | rvz wrote:
         | Of course they will add an on-device LLM and can afford to. It
         | doesn't cost them anything to integrate or train a AI whether
         | if it is a ConvNet or a LLM and jump into the race to zero with
         | Meta with on-device machine learning.
         | 
         | They have done it before and they will certainly do it again,
         | especially with Apple Silicon and CoreML.
         | 
         | The one that needs to worry the most is OpenAI.com as they
         | rushed to stop the adoption of downloadable and powerful AI
         | models for free to the regulators. That shows that OpenAI.com
         | does not have any sort of 'moat' at all.
        
           | yreg wrote:
           | The question is whether the iPhones released in September are
           | going to be already ready for it.
           | 
           | They haven't mentioned LLMs at WWDC beyond keyboard
           | autocorrect (mentioned already by your sibling comment).
        
             | baq wrote:
             | No chance unless they were prescient early 2022. Hardware
             | cycles are too long for that otherwise.
        
               | yreg wrote:
               | Is it even necessary to do changes to the neural engine
               | though? Maybe something like increasing RAM (which is
               | rumoured) is enough.
        
               | baq wrote:
               | If there's one company which can wrangle its suppliers to
               | deliver 4x RAM capacity in the same form factor,
               | performance and thermals it's Apple, but they aren't
               | sorcerers, just ruthlessly efficient.
               | 
               | I'll be queueing at midnight for the first time ever if
               | I'm wrong.
        
         | sp332 wrote:
         | It's going to chew up at least 1 GB of storage space and RAM,
         | right? And probably kill the battery life to boot.
        
           | refulgentis wrote:
           | Yeah. People are playing tons of word games with this stuff,
           | ex. Apple is saying its shipping an LLM for the iOS 17
           | keyboard, and who knows what that means: it sounds great and
           | plausible unless you're familiar w/the nuts and bolts.
        
             | Tagbert wrote:
             | Apple is calling their typing correction a "transformer".
             | That is a component of LLMs, but Apple may not be using a
             | full LLM in that case. This feature seems like a sandbox
             | for them to try out some of this tech in the field while
             | they do work internally on more ambitious implementations.
             | 
             | Apple is also dogfooding an LLM AI tool internally also
             | likley to gain a better understanding of how this works in
             | practice and how people are using them.
             | https://forums.macrumors.com/threads/apple-experimenting-
             | wit...
        
               | astrange wrote:
               | An LLM is made entirely out of transformers; you could
               | just call it a "large transformer model".
               | 
               | In this case it's a transformer model that is not
               | "large". So, an LM.
        
             | shwaj wrote:
             | Apple's not playing word games, because they didn't say
             | "LLM". They said that autocorrect will use "a transformer
             | language model, a state-of-the-art on-device machine
             | learning language model for word prediction", which is a
             | much more precise statement than what you attributed to
             | them.
             | 
             | This sounds totally plausible. It will be a much smaller
             | transformer model than ChatGPT, probably much smaller than
             | even GPT-2.
             | 
             | https://www.apple.com/newsroom/2023/06/ios-17-makes-
             | iphone-m...
        
           | jpalomaki wrote:
           | AppleTV devices are usually always connected and most of the
           | time just idling. Maybe you could move processing to such
           | device, if one is connected to the same Apple ID.
        
             | gnicholas wrote:
             | Yep, I'd love to have a semi-dedicated device in my home
             | that handled these sorts of requests. I'd even consider
             | buying a Mac mini, Studio, or other computer for this
             | purpose.
        
               | yreg wrote:
               | Would be cool, but I think it is improbable. Apple would
               | want such a key feature to be available to everyone and
               | less than 10% of iPhone users have Macs.
               | 
               | Unless they also made an option to run it in iCloud, but
               | offering so many options to do a thing doesn't sound very
               | Apple-like.
        
               | gnicholas wrote:
               | Agree. But it should be doable to set this up using open
               | LLMs, right? For example, using Siri to trigger a
               | shortcut that sends a prompt to the dedicated processing
               | device.
        
       | glimshe wrote:
       | [flagged]
        
         | Obscurity4340 wrote:
         | Isn't the reception to LLama/whatever its called generally
         | positive? Is there something I'm missing in terms of some
         | shadowy endgame Meta built into it?
        
         | JPLeRouzic wrote:
         | Isn't it a challenge today to run a large LLM on a CPU/GPU as
         | those found in mobile phones?
         | 
         | I would have thought that only the information that it might be
         | possible, is a good news?
        
         | [deleted]
        
         | logicchains wrote:
         | You're not a fan of PyTorch or React I guess?
        
       ___________________________________________________________________
       (page generated 2023-07-23 23:02 UTC)