[HN Gopher] MobileLLM: Optimizing Sub-Billion Parameter Language...
       ___________________________________________________________________
        
       MobileLLM: Optimizing Sub-Billion Parameter Language Models for On-
       Device Use
        
       Author : tosh
       Score  : 246 points
       Date   : 2024-07-09 11:48 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sourcecodeplz wrote:
       | Nice, could one use this to train models for Windows PCs also? I
       | don't have a lot of ram.
        
         | skiexperte wrote:
         | Training models is not OS dependend. RAM is dependend on the
         | size and i would argue this should be a lot easier to finetune
         | with less GPU Ram.
         | 
         | Nonetheless the endgoal will probably be downloading a model
         | like this or paying for finetuning than downloading and using
         | it through an optimized Neuralchip.
         | 
         | Its currently more a question of when this will happen. The
         | newest Windows cert already requires some neuralchip and even
         | my google pixel 8 pro can host small models (i know the pixel
         | is not a cheap phone, but the coprocessor should still be much
         | more affordable than a big GPU)
        
       | mmastrac wrote:
       | > MobileLLM-125M/350M attains a remarkable 2.7%/4.3% accuracy
       | boost over preceding 125M/350M SoTA models on zero-shot
       | commonsense reasoning tasks
       | 
       | Small models, slightly improved, probably still not good enough
       | for the same use as online models. Nothing wrong with incremental
       | progress, however.
       | 
       | 1.5B parameter model does seem to be a pretty decent step up,
       | even beating larger models by a wide margin. I'm not sure why
       | they didn't go larger -- having a more efficient model that fits
       | on hardware the size of the RPi could be a gamechanger (IIRC
       | TinyLlama 7B does run, barely).
        
         | phkahler wrote:
         | >> Small models, slightly improved, probably still not good
         | enough for the same use as online models. Nothing wrong with
         | incremental progress, however.
         | 
         | An even smaller language model should still be useful as part
         | of a speech-to-text system. These should benefit from using the
         | language model to narrow down what word is spoken in the face
         | of ambiguity or noise.
        
           | woodson wrote:
           | ASR systems already use language models during decoding,
           | though mostly not large decoder-only LLMs. However,
           | incorporating LLMs into ASR is currently at the center of a
           | lot of research, e.g. using a speech encoder like wav2vec 2.0
           | or the whisper encoder with a Qformer etc. and a LoRA adapter
           | on an LLM trained for ASR.
        
             | omarelb wrote:
             | Really interested in this! Do you know of some good reading
             | in this area?
        
         | cjtrowbridge wrote:
         | Llama-3-8b runs fine on raspberry pi
        
           | inhumantsar wrote:
           | how fast is that for you?
        
         | choppaface wrote:
         | But imagine if these models were baked into your Instagram app
         | and then used for ad targeting using your own compute. Then
         | Facebook gets to look at tons of other data and for less cost
         | (and much less litigation risk) to them.
         | 
         | In this application it's unfair to compare tiny models to cloud
         | models. Moreover any incremental precision boosts to tiny
         | models would be notable (and directly translate to revenue).
        
         | HanClinto wrote:
         | > I'm not sure why they didn't go larger -- having a more
         | efficient model that fits on hardware the size of the RPi could
         | be a gamechanger (IIRC TinyLlama 7B does run, barely).
         | 
         | I'm not sure that RPi is the right target for the next step of
         | local LLMs, and I think that it's worth considering web-
         | deployment on engines like WebLLM [1].
         | 
         | A 7B model may "run fine" on a Raspberry Pi, but I've
         | (personally) found 7B models to be a bit larger than I want to
         | download / run for web-based interfaces.
         | 
         | However, a solid 125M model is the sort of thing that I can run
         | on a webpage, and the time it takes to download to the local
         | user's browser (combined with my bandwidth costs) aren't
         | exorbitant.
         | 
         | [1] https://github.com/mlc-ai/web-llm
        
       | zurfer wrote:
       | While this is interesting, I wonder what the use case is, other
       | than better autocomplete?
        
         | potatoman22 wrote:
         | It could power simple agents like Siri under the hood. Helping
         | with natural language understanding, intent classification,
         | retrieval, and other agent tasks.
        
           | rvnx wrote:
           | Like the Rabbit R1 or Humane AI Pin
        
         | redox99 wrote:
         | Local agent like siri that can do simple tasks, and route more
         | complex requests.
        
         | skiexperte wrote:
         | Reading emails, replying to emails, scheduling tasks, using
         | apis for services.
         | 
         | Basically everything which doesn't need knowledge but actions.
         | 
         | "Tell my wife i'm late" and it will use some configured magic
         | to talk to service xy and just does it.
         | 
         | Siri is very good in doing homeautomatistaion without the
         | internet, the old google agent and alexa were absolutly not and
         | i don't think they were ever available offline.
         | 
         | This basically gives you a local (local-first!) good working
         | assistent
        
           | Narhem wrote:
           | Would be very nice to have my schedule automatically managed
           | by Siri. Already has a few nice things but I genuinely have
           | trust issues, especially with AI.
        
             | lovethevoid wrote:
             | You can get very far with the Shortcuts app by the way.
             | Some examples: using your current location to estimate when
             | you should leave to get to your next meeting on your
             | calendar, letting those included in the calendar event know
             | you're running late. Highly highly recommend it, the
             | learning curve isn't much, a bunch of drag and drop!
        
         | throwthrowuknow wrote:
         | You could possibly fine tune it for narrow domain tasks like
         | they did with tiny-agent
         | https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/
         | 
         | I like the approach that Apple seems to be taking with fine
         | tuned small models that handle routine tasks and then defer to
         | larger off device models for things they can't confidently do.
         | I imagine you could construct a training set that contains
         | examples that should produce low confidence answers where you
         | could add an output that is essentially a "call for help"
         | option so you could train it to choose that. Smaller models
         | also means you could have more running in parallel and use
         | another to route requests to the appropriate expert.
        
         | Narhem wrote:
         | Probably hacking foreign intelligence codes.
        
         | nsonha wrote:
         | user cases are that of LLMs, from a mobile UI (so every AI use
         | case there is), when you need privacy from big tech's AI APIs.
         | 
         | I'm just so amazed by statements like "LLMs can ONLY be used
         | for autocomplete", like am I supposed to be impressed by the
         | smirkiness?
        
           | nl wrote:
           | The question was more about the capability and knowledge in a
           | sub-1B LLM: at that size what is it capable to do beyond
           | excellent autocompletion.
        
         | barronli wrote:
         | It can be fine tuned for device related actions. In other
         | words, with all the capabilities of your device applications or
         | services, the small model can virtually have the same
         | capabilities. It can always dispatch a user request in way of
         | "natural language" to those applications, and orchestrate the
         | applications. It can dispatch user requests beyond the device
         | capabilities to a cloud model. This is powerful since it
         | changes how you interact with your devices.
        
         | syassami wrote:
         | https://www.meta.com/smart-glasses/
        
         | simion314 wrote:
         | I tested the Google AI on my phone, I had the browser open and
         | asked it to read the page to me and it responded that it does
         | not have access to the internet.
         | 
         | So I would like an AI assistant that:
         | 
         | 1 can understand english and my native language
         | 
         | 2 that is aware that runs on Android(or KDE/Linux) and can
         | understand commands like "open the Android Settings ,
         | Application section " or "read the page that is opened in the
         | browser" or "read the text in the popup that is now opened".
         | Basically to be integrated with the OS via public and open
         | APIs. Big AI companies could compete on selling us better
         | assistants especially for multi lingual people.
         | 
         | 3 the model should be small , it should not know geography,
         | history, music bands etc, for tasks where the user asks
         | question there should be an option for the model to forward the
         | question to a search engine or even an online LLM.
        
       | Havoc wrote:
       | What apps can one currently use to run them on say an iPhone?
       | Only aware of the MLC one which has literally 3 old models only
        
         | 5cott0 wrote:
         | wat
         | 
         | https://huggingface.co/mlc-ai
        
           | Havoc wrote:
           | On my iphone there doesn't seem to be an option to download
           | more.
           | 
           | Vaguely recall there being a button initially but don't see
           | it anymore
        
         | pickettd wrote:
         | The Android apk for MLC is updated frequently with recent
         | models built-in. And a Samsung S24+ can comfortably run 7-8B
         | models at reasonable speeds (10ish tokens/sec).
         | 
         | https://llm.mlc.ai/docs/deploy/android.html
        
         | woadwarrior01 wrote:
         | I have an (mlc-llm based) app on the App Store that supports
         | over 2 dozen models, including some recent ones.
        
         | ukuina wrote:
         | cnvrs runs GGUFs on iOS:
         | https://testflight.apple.com/join/ERFxInZg
        
       | yshvrdhn wrote:
       | Am I missing something but can't something like distillation help
       | here ?
        
         | imurray wrote:
         | The paper says they tried that:
         | https://arxiv.org/abs/2402.14905
         | 
         | Deep link to the relevant snippet in html version:
         | https://ar5iv.labs.arxiv.org/html/2402.14905#S3.SS5
         | 
         |  _" So far, we trained compact models from scratch using next
         | tokens as hard labels. We explored Knowledge Distillation
         | (KD)... Unfortunately KD increases training time (slowdown of
         | 2.6-3.2x) and exhibits comparable or inferior accuracy to
         | label-based training (details in appendix)."_
        
       | PoignardAzur wrote:
       | I wonder how much you can push the "deeper and thinner" part. At
       | some point your entire FFN fits into your L2 cache, you're bound
       | to get some performance jumps.
        
         | sigmoid10 wrote:
         | Other research from Meta FAIR actually suggests that you should
         | prune deeper layers if you want to improve performance while
         | maintaining accuracy [1]. So there must be a cutoff point for
         | smaller networks where this approach still works, otherwise the
         | results are contradictory. Or we could drastically improve
         | these new models even further.
         | 
         | [1] https://arxiv.org/html/2403.17887v1
        
         | woodson wrote:
         | That reminds me of the findings of Google's paper on
         | EfficientT5 (https://arxiv.org/abs/2109.10686). They refer to
         | it as "DeepNarrow".
        
       | ejdhshsuwisjsh wrote:
       | Anyone is aware of custom mobile llms?
       | 
       | Optimizing and loading in your own voice, selecting your primary
       | language and adding a little bit of personal knowledge like
       | nicknames, location and stuff?
       | 
       | My pixel 8 apparently can use / load local models but don't have
       | the time right now to follow that rabbit hole
        
         | euniceee3 wrote:
         | Tensor chips are not open enough for an optimized mobile LLM to
         | be ran on them.
        
       | vhiremath4 wrote:
       | It seems like the smaller models get the largest size decrease by
       | embedding share/weight tying between the linear head and token
       | embeddings. Is there any research going into how to further
       | reduce size from there?
        
         | cztomsik wrote:
         | If you mean that LM-head is just inverted embedding matrix then
         | this was already done in GPT-2.
         | 
         | Unfortunately, the only thing I found out about this is that
         | bigger models benefit from separate layer. But this was only
         | mentioned somewhere in discord, so no paper to read and my
         | personal hunch is that it should work for bigger models too.
         | After all, GPT-3 was just scaled GPT-2.
         | 
         | From my personal experiments, models learn better if you give
         | them harder task. And tied weights could be one of such things.
         | Multi-token prediction could be another and bitnet could be
         | also considered such... (and dropout too)
        
       | cjtrowbridge wrote:
       | Why no mmlu or gsm8k?
        
       | lawlessone wrote:
       | Does it have to stay on mobile devices? Bit of niche but if its
       | not a resource hog it could be handy for giving NPC's in games
       | more interesting dialogue without having use
       | 
       | Even better if it could be tuned in someway to allow dialogue to
       | influence NPC behavior or actions.
        
         | janalsncm wrote:
         | It would be fascinating if NPCs had more backstory to them and
         | more complex behaviors. Although I would imagine it would be
         | near impossible to test since anything could influence their
         | behavior.
        
           | lawlessone wrote:
           | yeah definitely testing would be nightmare. especially if
           | conversations could influence the wider game.
           | 
           | You'd have someone on youtube cheesing games by running
           | scamming npcs.
        
           | HanClinto wrote:
           | I'm definitely interested in exploring this sort of thing.
           | How much can we do with creating interesting characters and
           | interesting circumstances?
           | 
           | Makes me think of the way that characters are set up in AI
           | Alibis -- each with their own secrets, but also with clues
           | about other NPC's secrets. That feels like clever design, and
           | it's the first use-case of using LLMs for NPC dialogue that
           | feels interesting to me:
           | https://news.ycombinator.com/item?id=40921990
        
         | kevingadd wrote:
         | Would it be _interesting_ dialogue? You could generate more
         | dialogue, but would it have anything underpinning it of
         | interest to the player? i.e. you could suddenly have
         | townspeople that would talk about local scenery or their
         | relationships with other NPCs, but none of that stuff they
         | describe would actually _exist_ in the game. I would personally
         | be weirded out if NPCs started making stuff up.
         | 
         | I can imagine training some sort of LLM _on_ your game data
         | such that NPCs are able to actually describe the game world,
         | but I can 't imagine what kind of scale you'd need to operate
         | at for that to be cheaper than just paying someone to write the
         | dialogue. Maybe at Ubisoft's scale where your team sizes are in
         | the thousands (AFAIK, they have been investigating using AI for
         | writing, but it's mostly for things like combat barks which are
         | very repetitive and basically noise.)
        
           | lawlessone wrote:
           | >Would it be interesting dialogue?
           | 
           | It would definitely depend a lot on the implementation. I
           | think it could work great for some indie dev's. Not all of
           | course, devs that like writing understandably won't like it.
        
       | KTibow wrote:
       | When Gemma 2 2b releases it would be interesting to compare its
       | scaling with this
        
       | pmontra wrote:
       | Interesting research, but Meta do not have any device worth
       | talking about (at least at scale,) unless they want to ship that
       | as part of their apps.
        
         | TeMPOraL wrote:
         | They have Oculus.
        
         | ynx wrote:
         | Dismissiveness like this tends to radiate ignorance, not
         | insight.
         | 
         | Quests have shipped roughly ~1/2 PS5 sales. Certainly a scale
         | only a handful of technologically advanced product lines
         | outside of phones ever reach.
         | 
         | Incidentally, the enabling technology for the Quest? On-device
         | ML that grew out of - you guessed it - developing on-device
         | inference for their apps.
        
         | HanClinto wrote:
         | 125M parameters feels very feasible to ship as part of apps --
         | even web-based apps.
        
       | banish-m4 wrote:
       | Hey HN. I actually have a current need for on-device wake-word-
       | like STT. Which model(s) have the lowest WER and can run on an
       | RPi 4B? I've been looking at openWakeWord. It's for an DIY
       | inventory system.
        
       ___________________________________________________________________
       (page generated 2024-07-10 23:02 UTC)