[HN Gopher] Gemma 3n preview: Mobile-first AI
___________________________________________________________________
Gemma 3n preview: Mobile-first AI
Author : meetpateltech
Score : 427 points
Date : 2025-05-20 18:03 UTC (1 days ago)
(HTM) web link (developers.googleblog.com)
(TXT) w3m dump (developers.googleblog.com)
| onlyrealcuzzo wrote:
| Probably a better link:
| https://developers.googleblog.com/en/introducing-gemma-3n/
|
| Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an
| on-device memory footprint of a 2-4B parameter model.
|
| At the same time, it performs nearly as well as Claude 3.7 Sonnet
| in Chatbot Arena.
| ai-christianson wrote:
| That seems way too good to be true.
|
| What's the catch?
| Vuizur wrote:
| It is not very good at hard tasks, its ranking is much worse
| there.
| moneywoes wrote:
| sorry, any examples of hard tasks
| refulgentis wrote:
| I used to defend LMSys/Chatbot Arena a lot but threw in the
| towel after events of the past three months.
|
| I can give more details if you (or anyone else!) is
| interested.
|
| TL;DR: it is scoring _only_ for "How authoritative did the
| answer _look_? How much flattering & emojis?"
| Jowsey wrote:
| Is this not what Style Control (which IIRC they're making
| default soon) aims to mitigate?
| refulgentis wrote:
| I'm not 100% sure what their rationale is for it, the
| launch version of style control was a statistical model
| that penalized a few (4?) markdown shibboleths (lists,
| headers, ?).
|
| Not sure if they've shared more since.
|
| IMVHO it won't help, at all, even if they trained a
| perfect model that could accurately penalize it*
|
| The main problem is its one off responses, A/B tested.
| There's no way to connect it into all the stuff we're
| using to do work these days (i.e. tools / MCP servers),
| so at this point its sort of skipping the hard problems
| we'd want to see graded.
|
| (this situation is a example: whats more likely, style
| control is a small idea for an intractable problem, or
| Google has now released _multiple_ free models better
| than Sonnet, including the latest, only 4B params?
|
| To my frustration, I have to go and bench these things
| myself because I have an AI-agnostic app I build, but I
| can confirm it is not the case that Gemma 3-not-n is
| better than Sonnet. 12B can half-consistently make file
| edits, which is a major step forward for local tbh)
|
| * I'm not sure how, "correctness" is a confounding metric
| here: we're probably much more likely to describe a
| formatted answer in negative terms if the answer is
| incorrect.
|
| In this case I am also setting aside how that could be
| done, just saying it as an illustration of no matter
| what, it's the wrong platform for a "how intelligent is
| this model?" signal, at this point, post-Eliza post-
| Turing, couple years out from ChatGPT 1.0
| int_19h wrote:
| The catch is that "does as good as X" is pretty much never
| representative of real world performance when it comes to
| LLMs.
|
| In general, all those scores are mostly useful to filter out
| the models that are blatantly and obviously bad. But to
| determine whether the model is actually good at any specific
| thing that you need, you'll have to evaluate them yourself to
| find out.
| Deathmax wrote:
| It's not a 4B parameter model. The E4B variant is 7B parameters
| with 4B loaded into memory when using per-layer embedding
| cached to fast storage, and without vision or audio support.
| zamadatix wrote:
| The link says E2B and E4B have 4B and 8B raw parameters,
| where do you see 7B?
| jdiff wrote:
| There's a 7B mentioned in the chat arena ELO graph, I don't
| see any other references to it though.
| osanseviero wrote:
| Hi! The model is 8B if you also load the vision and audio
| components. We just used the text model in LMArena.
| lostmsu wrote:
| Are vision and audio components available yet?
| esafak wrote:
| Imagine a model smarter than most humans that fits on your
| phone.
|
| edit: I seem to be the only one excited by the possibilities of
| such small yet powerful models. This is an iPhone moment: a
| computer that fits in your pocket, except this time it's smart.
| codr7 wrote:
| intelligence != memory
| esafak wrote:
| ML is not memorization. Besides, how much memory do you
| think this model has?
| codr7 wrote:
| I know, it's worse.
| TeMPOraL wrote:
| It's understanding.
| rhdjsjebshjffn wrote:
| Sure, if you still think the word has meaning.
| TeMPOraL wrote:
| Yes, I do. Any way you slice this term, it looks close to
| what ML models are learning through training.
|
| I'd go as far as saying LLMs are _meaning made incarnate_
| - that huge tensor of floats represents a stupidly high-
| dimensional latent space, which encodes semantic
| similarity of every token, and combinations of tokens (up
| to a limit). That 's as close as reifying the meaning of
| "meaning" itself as we ever come.
|
| (It's funny that we got there through brute force instead
| of developing philosophy, and it's also nice that we get
| a computational artifact out of it that we can poke and
| study, instead of incomprehensible and mostly bogus
| theories.)
| croes wrote:
| LLMs neither understand nor reason, that has been shown
| multiple times.
| croes wrote:
| Seems like some don't like that LLMs aren't really
| intelligent.
|
| https://neurosciencenews.com/llm-ai-logic-27987/
| otabdeveloper4 wrote:
| We're at the peak of the hype cycle right now.
|
| Ask these questions again in two years when the next
| winter happens.
| TeMPOraL wrote:
| Or, ignore the hype, look at what we know about how these
| models work and about the structures their weights
| represent, and base your answers on that today.
| croes wrote:
| That's what my link is for
| dinfinity wrote:
| The study tested _transformers_ , not LLMs.
|
| They trained models on _only_ task specific data, _not_
| on a general dataset and certainly not on the enormous
| datasets frontier models are trained on.
|
| "Our training sets consist of 2.9M sequences (120M
| tokens) for shortest paths; 31M sequences (1.7B tokens)
| for noisy shortest paths; and 91M sequences (4.7B tokens)
| for random walks. We train two types of transformers [38]
| from scratch using next-token prediction for each
| dataset: an 89.3M parameter model consisting of 12
| layers, 768 hidden dimensions, and 12 heads; and a 1.5B
| parameter model consisting of 48 layers, 1600 hidden
| dimensions, and 25 heads."
| KoolKat23 wrote:
| Abstractly, humans function in the same fashion. You
| could then say the same about us.
| Zambyte wrote:
| The bar for this excludes humans xor includes LLMs. I
| guess you're opting for the former?
|
| If you don't believe me, here is a fun mental exercise:
| define "understand" and "reason" in a measurable way,
| that includes humans but excludes LLMs.
| otabdeveloper4 wrote:
| It's pretty easy to craft a prompt that will force the
| LLM to reply with something like
|
| > The `foobar` is also incorrect. It should be a valid
| frobozz, but it currently points to `ABC`, which is not a
| valid frobozz format. It should be something like `ABC`.
|
| Where the two `ABC`s are the exact same string of tokens.
|
| Obviously nonsense to any human, but a valid LLM output
| for any LLM.
|
| This is just one example. Once you start using LLMs as
| tools instead of virtual pets you'll find lots more
| similar.
| Zambyte wrote:
| People say nonsense all the time. LLMs also _don 't_ have
| this issue all the time. They are also often right
| instead of saying things like this. If this reply was
| meant to be a demonstration of LLMs not having human
| level understanding and reasoning, I'm not convinced.
| croes wrote:
| Not people, some people.
|
| But a LLM is sometimes the genius and sometimes the
| idiot.
|
| That doesn't happen often if you always talk to the same
| person
| rhdjsjebshjffn wrote:
| ML _is_ a kind of memorization, though.
| onlyrealcuzzo wrote:
| Anything can be _a kind of_ something since that 's
| subjective...
| croes wrote:
| But it's more kind of memorization than understanding and
| reasoning
| goatlover wrote:
| Why are we imagining? That leads to technologies being
| overhyped.
| rhdjsjebshjffn wrote:
| I can't speak for anyone else, but these models only seem
| about as smart as google search, with enormous variability. I
| can't say I've ever had an interaction with a chatbot that's
| anything redolent of interaction with intelligence.
|
| Now would I take AI as a trivia partner? Absolutely. But
| that's not really the same as what I look for in "smart"
| humans.
| hmapple wrote:
| Have you tried any SOTA models like o3?
|
| If not, I strongly encourage you to discuss your area of
| expertise with it and rate based on that
|
| It is incredibly competent
| rhdjsjebshjffn wrote:
| I'm not really sure what to look for, frankly. It makes a
| rather uninteresting conversation partner and its
| observations of the world bland and mealy-mouthed.
|
| But potentially maybe I'm just not looking for a trivia
| partner in my software.
| int_19h wrote:
| SOTA models can be pretty smart, but this particular
| model is a very far cry from anything SOTA.
| sureglymop wrote:
| The image description capabilities are pretty insane, crazy
| to think it's all happening on my phone. I can only imagine
| how interesting this is accessibility wise, e.g. for vision
| impaired people. I believe there are many more possible
| applications for these on a smartphone than just chatting
| with them.
| selcuka wrote:
| > But that's not really the same as what I look for in
| "smart" humans.
|
| Note that "smarter than smart humans" and "smarter than
| most humans" are not the same. The latter is a pretty low
| bar.
| anonzzzies wrote:
| >anything redolent of interaction with intelligence
|
| compared to what you are used to right?
|
| I know it's elitist but most people <=100 iq (and no, this
| is not exact obviously, but we have not many other things
| to go by) are just ... well, a lot of state of the art LLMs
| are better at everything compared, outside body 'things'
| (for now) of course, as they don't have any. They
| hallucinate/bluff/lie as much as the humans and the humans
| might know they don't know, but outside that, the LLMs win
| at everything. So I guess that, for now, people with
| 120-160 iqs find LLMs funny but wouldn't call them
| intelligent, but below that...
|
| My circle of people I talk with during the day has changed
| since I took on more charity which consists of fixing up
| old laptops and installing Ubuntu on them; I get them for
| free from everyone and I give them to people who cannot
| afford, including some lessons and remote support (which is
| easy as I can just ssh in via tailscale). Many of them
| believe in chemtrails, vaccinations are a gov ploy etc and
| multiple have told me they read that these AI chatbots are
| nigerian or indian (or so) farms trying to fraud them out
| of 'things' (they usually don't have anything to fraud
| otherwise I would not be there). This is about half of
| humanity; Gemma is gonna be smarter than all of them, even
| though I don't register any LLM as intelligence and with
| the current models, it won't happen either. Maybe a
| breakthrough in models will be made that changes it, but it
| has not much chance yet.
| disgruntledphd2 wrote:
| > but most people <=100 iq
|
| This is incorrect, IQ tests are normally scaled such that
| average intelligence is 100, and such that they are
| approximately normally distributed so that most people
| will be somewhere between 85-115 (66% on average).
| anonzzzies wrote:
| Yep and those people can never 'win' against current
| llms, let alone future ones. Outside motorcontrol which I
| specifically excluded.
|
| 85 is special housing where I live... LLMs are far beyond
| that now.
| GTP wrote:
| Judging from your comment, it seems that your statistical
| sample is heavily biased as well, as you are interacting
| with people that can't afford a laptop. That's not
| representative of the average person.
| nsonha wrote:
| You seem to be the only one expected that model to be
| "smarter than most human"
|
| Leave that part out, I'm excited. I'd love to see this plays
| some roles in "inference caching", to reduce dependencies on
| external services.
|
| If only agents can plan and match patterns of tasks locally,
| and only needs real intelligence for doing self-
| contained/computationally heavy tasks.
| krackers wrote:
| What is "Per Layer Embeddings"? The only hit I can find for that
| term is the announcement blogpost.
|
| And for that matter, what is
|
| >mix'n'match capability in Gemma 3n to dynamically create
| submodels
|
| It seems like mixture-of-experts taken to the extreme, where you
| actually create an entire submodel instead of routing per token?
| onlyrealcuzzo wrote:
| https://ai.google.dev/gemma/docs/gemma-3n#parameters
|
| > Gemma 3n models are listed with parameter counts, such as E2B
| and E4B, that are lower than the total number of parameters
| contained in the models. The E prefix indicates these models
| can operate with a reduced set of Effective parameters. This
| reduced parameter operation can be achieved using the flexible
| parameter technology built into Gemma 3n models to help them
| run efficiently on lower resource devices.
|
| > The parameters in Gemma 3n models are divided into 4 main
| groups: text, visual, audio, and per-layer embedding (PLE)
| parameters. With standard execution of the E2B model, over 5
| billion parameters are loaded when executing the model.
| However, using parameter skipping and PLE caching techniques,
| this model can be operated with an effective memory load of
| just under 2 billion (1.91B) parameters, as illustrated in
| Figure 1.
| krackers wrote:
| Thank you, that helped a bit, although it's still not clear
| what exactly those parameters _are_. "Per-Layer Embedding
| (PLE) parameters that are used during model execution to
| create data that enhances the performance of each model
| layer." is too vague, and I can't find any other reference to
| "per-layer embedding parameters" in literature.
| onlyrealcuzzo wrote:
| A layer is a transformer block / layer (basically the
| building block of the modern LLM architectures) - maybe
| Gemini can help you:
|
| https://gemini.google.com/share/cc58a7c6089e
| krackers wrote:
| I am perfectly aware of that. I don't believe other LLMs
| have such embeddings per layer, only the usual weights,
| so these per-layer embeddings seem to be distinguished
| from weights in some way. Afaik trying to play the same
| "cache in fast storage and load on demand" wouldn't work
| with layer weights since you'd end up with too much
| back/forth (you'd touch every cached byte on each token,
| assuming no MoE), so I'm guessing these embeddings are
| structured in a way that's broken up by concept.
| QuadmasterXLII wrote:
| lmao
| liuliu wrote:
| Thanks. It is a bit vague to me too. If you need to load 5B
| per token generation any way, what's that different from
| selective offloading technique where some MLP weights
| offloaded to fast storage and loaded during each token
| generation?
| kcorbitt wrote:
| I wonder if they've trained the model to operate with a
| shallower stack; eg. the full model may be composed of 24
| transformer blocks, but they've also trained it to accept
| embeddings at layer 8, so it can be operated with just 16
| transformer blocks on lower-resourced devices.
|
| Experimenters in the open source tinkering community have
| done the opposite (copy/pasting layers in existing models
| to make them deeper) and it seems to work... fine, with
| minimal post-training on the new, deeper model required to
| exceed the performance of the original model. So it's not a
| crazy idea.
| krackers wrote:
| Someone extracted out the dims in
| https://twitter.com/cccntu/status/1925043973170856393#m
|
| It seems to be embedding from 262k possible vocab tokens
| down to 256 dims. 262144 matches the same vocab size used
| for the existing Gemma model, so it really does seem to
| be an embedding of the input token directly, fed into
| each layer.
|
| I guess intuitively it might help the model somewhat for
| later layers to have direct access to the input query
| without needing to encode it in the residual stream, and
| it can use those parameters for something else. I'm kind
| of surprised no one tried this before, if the idea is
| that simple? Reminds me of resnet where you have the
| "skip" layers so future layers can access the input
| directly.
|
| Edit: As for what exactly the embedding is used for, it
| could be that the embedding is still used for something
| more clever than induction head-type stuff. Responses in
| [1] suggest it might be some low-rank data/token
| dependent signal that can be "factored out"/precomputed.
| Another clever suggestion was that it's a per-layer
| input-token-derived control/steering vector.
|
| [1] https://twitter.com/CFGeek/status/1925022266360005096
| stygiansonic wrote:
| From the article it appears to be something they invented:
|
| > Gemma 3n leverages a Google DeepMind innovation called Per-
| Layer Embeddings (PLE) that delivers a significant reduction in
| RAM usage.
|
| Like you I'm also interested in the architectural details. We
| can speculate but we'll probably need to wait for some sort of
| paper to get the details.
| HarHarVeryFunny wrote:
| Per layer LoRA adapters, perhaps? - same as Apple is using for
| on-device AI.
| andy12_ wrote:
| I think that it's a poorly named reference to this paper [1]
| that they mention in the blogpost. If I had to give it another
| more descriptive name, I would probably name it "Per-Layer
| Embedding Dimensionality"
|
| [1] https://arxiv.org/pdf/2310.07707
| yorwba wrote:
| The MatFormer is clearly called out as a different aspect of
| the model design.
|
| PLE is much more likely to be a reference to the Per-Layer
| Embeddings paper that will be published in the future once it
| doesn't give away any secret sauce anymore.
| andy12_ wrote:
| I thought the same, but Per-Layer Embeddings as a name
| doesn't make sense in any context, and MatFormer does
| exactly what the blogpost says PLE does. I just think it's
| more probable that the blogpost was written by several
| authors and that noone bothered to check the final result.
| ankit219 wrote:
| You can read this for a comprehensive deep dive.
| https://arxiv.org/pdf/2502.01637
|
| At a very high level, instead of having embeddings at the input
| layers, this method keeps the embeddings at the layer level.
| That is every transformer layer would have its own set of
| learnable embedding vectors that are used to modify the
| processed hidden states flowing through the network. Mostly,
| the embeddings are precomputed and stored separately. They are
| queried at inference time and has very low latency, so you can
| get comparable performance with half the RAM. (i am not exactly
| sure how 3n is doing it, but talking it in a general sense).
| yorwba wrote:
| The paper you link to is about a different way to create
| embeddings at the input layer. In no way does it match your
| claimed description.
| ankit219 wrote:
| I simplified what i wrote. There is an off accelerator
| memory where the embeddings are stored and queried at
| inference time, i did not want to get into details. That is
| how you reduce the in memory RAM. There are definitely more
| things going on in the paper as it builds upon the concept
| I described. The central idea remains the same: you have
| input embedding layers which map text to continuous
| vectors. Instead of loading all these layers at runtime,
| you can break it per layer at training time, and then fetch
| the required ones from a separate store during inference.
| Would not be in RAM. Per layer is not mentioned in the
| paper. But surely it's not a great leap from the paper
| itself?
| yorwba wrote:
| The name "per-layer embeddings" is all we have to go on,
| and there are currently no published papers (that I'm
| aware of) using any similar mechanism, so, yes, it's a
| huge leap from a paper that doesn't mention per-layer
| anything.
|
| It's fine to speculate based on the name, but don't
| pretend that it's a known technique when it clearly
| isn't.
| krackers wrote:
| Someone [1] inspected dimensions of the embedding
| component of model and it seems GP was on the right
| track. Assuming I understood correctly in [2], it does
| seem to be the embedding of the input tokens which is
| passed directly into each layer.
|
| I have not looked at the model but since the embedding
| dimension of 256 seems quite small (for reference
| according to [3] the old Gemma 1B had 1152 dimension
| input embedding), I'm guessing that this is not done _in
| lieu_ of the main input embedding to first layer, but in
| addition to it.
|
| [1] https://twitter.com/cccntu/status/1925043973170856393
|
| [2] https://news.ycombinator.com/edit?id=44048662
|
| [3] https://developers.googleblog.com/en/gemma-explained-
| whats-n...
| lxgr wrote:
| On one hand, it's pretty impressive what's possible with these
| small models (I've been using them on my phone and computer for a
| while now).
|
| On the other hand, I'm really not looking forward to app sizes
| ballooning even more - there's no reasonable way to share them
| across apps at least on iOS, and I can absolutely imagine random
| corporate apps to start including LLMs, just because it's
| possible.
| onlyrealcuzzo wrote:
| That sounds like a problem iOS will eventually deal with, as
| many apps are going to want this technology, and since Apple
| distributes apps - they aren't interested in the average app
| being 10x larger when they could solve the problem easily.
|
| Though, I won't be surprised if they try to force devs to use
| their models for "privacy" (and not monopolistic reasons, of
| course).
| lxgr wrote:
| Given Apple's track record in dealing with the problem of
| ballooning app sizes, I'm not holding my breath. The
| incentives are just not aligned - Apple earns $$$ on each GB
| of extra storage users have to buy.
| bilbo0s wrote:
| I was thinking that the entire time I read HN User
| onlyrealcuzzo's comment.
|
| Why, on Earth, would Apple ever want to solve the problem
| of Apps taking up more space? That's just not good
| business. Way better business right now to put R&D into
| increased memory access speeds.
|
| Apple would need to have a different business model
| entirely for them to have a business case for fixing this.
| They may fix it because they just want to help out they AI
| guys? Maybe in the future they're getting money from the AI
| guys or something? So fixing it starts to make a lot of
| sense.
|
| But all other things being equal, the money for Apple is in
| this _not_ being fixed.
| happyopossum wrote:
| > Why, on Earth, would Apple ever want to solve the
| problem of Apps taking up more space?
|
| To make their devices more pleasant / less frustrating to
| use.
|
| They've got a long track record of introducing features
| that reduce app size, speed up installs, and reduce
| storage use from infrequently used apps - there's no
| reason to believe they'd stop doing that except for
| cynical vitriol.
| elpakal wrote:
| I don't know how true your comment is about them earning
| money on each GB, but if you're interested in app size
| analysis on iOS I made this for that reason
| https://dotipa.app.
|
| I occasionally post decompositions of public .ipa's on the
| App Store, and I'm looking forward to seeing how these
| change over the next year.
| lxgr wrote:
| It seems straightforward to me: Apps take up storage, and
| the only way to get more of that is to pay Apple's
| markups, as iOS devices don't support upgradable storage.
|
| On top of the already hefty markup, they don't even take
| storage capacity into consideration for trade-ins.
| theyinwhy wrote:
| I am not aware of any phone allowing storage upgrades.
| debugnik wrote:
| That would be microSD cards, for which iPhones don't have
| slots.
| diggan wrote:
| Probably most phones before the iPhone (and many
| afterwards) had SD cards support, you've really never
| came across that? I remember my Sony Ericsson around 2004
| or something had support for it even.
| hu3 wrote:
| microSD. Of course, never supported in any iPhone.
| numpad0 wrote:
| They earn from in-app purchases too!
| bilbo0s wrote:
| Not anymore!
|
| Only half joking. I really do think the majority of that
| revenue will be going away.
| numpad0 wrote:
| Oh no. It just could. The App Store is slowly regressing
| into a gambling area and there are obviously people in
| power that don't like it. I think if it were to go, it'll
| take Android sideloading with it and we'll be miles
| closer to some computing doomsday scenario. Oh no.
| int_19h wrote:
| https://www.bloomberg.com/news/articles/2025-05-20/apple-
| to-...
| drusepth wrote:
| Windows is adding an OS-level LLM (Copilot), Chrome is adding a
| browser-level LLM (Gemini), it seems like Android is gearing up
| to add an OS-level LLM (Gemmax), and there are rumors the next
| game consoles might also have an OS-level LLM. It feels
| inevitable that we'll eventually get some local endpoints that
| let applications take advantage of on-device generations
| without bundling their own LLM -- hopefully.
| diggan wrote:
| > It feels inevitable that we'll eventually get some local
| endpoints that let applications take advantage of on-device
| generations without bundling their own LLM -- hopefully.
|
| Given the most "modern" and "hip" way of shipping desktop
| applications seems to be for everyone and their mother to
| include a browser runtime together with their 1MB UI, don't
| get your hopes up.
| cmcconomy wrote:
| I'd love to see this deployable to edge that have a Google Coral
| TPU
| nharada wrote:
| Has Google continued releasing new versions of Coral? Seems
| like a new version with the latest TPU and enough memory
| specifically to support this model would be awesome for devs
| mattlondon wrote:
| I looked into this recently. Looks like it's a "no".
|
| However there are now alternatives like the official RPi AI
| Hat that has between about 3x to 6x the TOPs (4 for Coral Vs
| 13/26 for RPi depending on model) so there is that. 20 TOPs
| on a RPi 5 - complete with nice vertically integrated camera
| etc - is quite interesting.
| intelkishan wrote:
| Google has not released a new version since 2021. Even the
| SDK kit is not under active development(uses Python 3.8),
| since the last time I saw it.
| turnsout wrote:
| Is this model & architecture compatible with llama.cpp and
| friends?
| barnas2 wrote:
| Is anyone able to test it via AiStudio? I pay for Google's AI
| subscription, but any attempt to use this model results in a
| message telling me I've hit my rate limit.
| lxgr wrote:
| Same here.
|
| I've also seemingly hit a rate limit on Gemini Pro 2.5 (on an
| account not subscribed to Gemini Advanced) yesterday, even
| though my last query is weeks past.
|
| Possibly there's a capacity shortage (I'd presume it all runs
| on the same Google hardware in the end), and they are
| prioritizing paid inference?
| DonHopkins wrote:
| If you're paying enough per month you can upgrade your keys
| to a higher tier:
|
| https://aistudio.google.com/app/apikey
| abound wrote:
| I hit this yesterday even though my account is on Tier 2 or
| 3. In my case, the issue was that I was using an old model
| name (exp-03-25 or something) in requests. Update to the
| latest pro-preview-whatever and the rate limit issues should
| go away.
|
| This sounds unintuitive, but in Google's defense the rate
| limit errors include a link to docs that explain this.
| sureglymop wrote:
| Tested it on my Android phone with Google Edge Gallery. No sign
| up required although a hugging face login is required to
| download the models in order to import them into the app.
| ignoramous wrote:
| Someone on r/LocalLLaMa shared this link:
| https://aistudio.google.com/prompts/new_chat?model=gemma-3n-...
|
| https://www.reddit.com/r/LocalLLaMA/comments/1kr8s40/comment...
| IceWreck wrote:
| According to the readme here -
| https://huggingface.co/google/gemma-3n-E4B-it-litert-preview
|
| E4B has a score of 44.4 in the Aider polyglot dashboard. Which
| means its on-par with gemini-2.5-flash (not the latest preview
| but the version used for the bench on aider's website), gpt4o and
| gpt4.5.
|
| Thats sounds very good - imagine what a coding focused version of
| this could do if this is a "generic" embedded only model.
|
| On the other hand - this does have a much lower score for
| livecodebench.
| nolist_policy wrote:
| Hmm, the Aider polyglot benchmark has been removed from the
| huggingface readme.
|
| Also:
|
| > These models were evaluated at full precision (float32)
|
| For 4B effective parameters that's 16 GB ram.
| nolist_policy wrote:
| You can try it on Android right now:
|
| Download the Edge Gallery apk from github:
| https://github.com/google-ai-edge/gallery/releases/tag/1.0.0
|
| Download one of the .task files from huggingface:
| https://huggingface.co/collections/google/gemma-3n-preview-6...
|
| Import the .task file in Edge Gallery with the + bottom right.
|
| You can take pictures right from the app. The model is indeed
| pretty fast.
| nolist_policy wrote:
| Okay from some first tries with story writing, gemma-3n-E4B-it
| seems to perform between plain Gemma 3 4B and 12B. It
| definitely retains the strong instruction following which is
| good.
|
| Hint: You have to set the Max tokens to 32000 for longer
| conversations. The slider makes it look like it's limited to
| 1024, just enter it manually.
| lousken wrote:
| waiting for approval, is there a magnet?
| hadlock wrote:
| if you go into the app and click the first icon it directs
| you to a workflow to get approved after clicking on a button
| that is the same color as the background and jump through
| some hoops about providing user data and analytics etc then
| it will auto-approve you
| KoolKat23 wrote:
| Thanks for this guide it's great.
|
| Okay perhaps my phones not great and perhaps this isn't
| optimized/pruned for phone use but it's unusably slow. The
| answers are solid from my brief test.
|
| I wouldn't exactly say phone use, unless you have no internet
| and you don't mind a bit of a wait.
|
| Really impressive, regardless.
| px43 wrote:
| What phone are you using?
| KoolKat23 wrote:
| I see my phones processor is from 2018 so there's that,
| Moore's law to save the day, from reading other comments.
| ignoramous wrote:
| And the libraries to embed Gemma-series in your iOS/Android
| app: https://ai.google.dev/edge/litert
|
| Or, run them on a microcontroller!
| https://github.com/tensorflow/tflite-micro
| philipkglass wrote:
| I assume that "pretty fast" depends on the phone. My old Pixel
| 4a ran Gemma-3n-E2B-it-int4 without problems. Still, it took
| over 10 minutes to finish answering "What can you see?" when
| given an image from my recent photos.
|
| Final stats:
|
| 15.9 seconds to first token
|
| 16.4 tokens/second prefill speed
|
| 0.33 tokens/second decode speed
|
| 662 seconds to complete the answer
| the_pwner224 wrote:
| I did the same thing on my Pixel Fold. Tried two different
| images with two different prompts: "What can you see?" and
| "Describe this image"
|
| First image ('Describe', photo of my desk)
|
| - 15.6 seconds to first token
|
| - 2.6 tokens/second
|
| - Total 180 seconds
|
| Second image ('What can you see?', photo of a bowl of pasta)
|
| - 10.3 seconds to first token
|
| - 3.1 tokens/second
|
| - Total 26 seconds
|
| The Edge Gallery app defaults to CPU as the accelerator.
| Switched to GPU.
|
| Pasta / what can you see:
|
| - It actually takes a full 1-2 minutes to start printing
| tokens. But the stats say 4.2 seconds to first token...
|
| - 5.8 tokens/second
|
| - 12 seconds total
|
| Desk / describe:
|
| - The output is: while True: print("[toxicity=0]")
|
| - Bugged? I stopped it after 80 seconds of output. 1st token
| after 4.1 seconds, then 5.7 tokens/second.
| the_pwner224 wrote:
| Pixel 4a release date = August 2020
|
| Pixel Fold was in the Pixel 8 generation but uses the
| Tensor G2 from the 7s. Pixel 7 release date = October 2022
|
| That's a 26 month difference, yet a full order of magnitude
| difference in token generation rate on the CPU. Who said
| Moore's Law is dead? ;)
| sujayk_33 wrote:
| 8 has G3 chip
| z2 wrote:
| As a another data point, on E4B, my Pixel 6 Pro (Tensor
| v1, Oct 2021) is getting about 4.4 t/s decode on a
| picture of a glass of milk, and over 6 t/s on text chat.
| It's amazing, I never dreamed I'd be viably running an 8
| billion param model when I got it 4 years ago. And kudos
| to the Pixel team for including 12 GB of RAM when even
| today PC makers think they can get away with selling 8.
| nolist_policy wrote:
| Gemma-3n-E4B-it on my 2022 Galaxy Z Fold 4.
|
| CPU:
|
| 7.37 seconds to first token
|
| 35.55 tokens/second prefill speed
|
| 7.09 tokens/second decode speed
|
| 27.97 seconds to complete the answer
|
| GPU:
|
| 1.96 seconds to first token
|
| 133.40 tokens/second prefill speed
|
| 7.95 tokens/second decode speed
|
| 14.80 seconds to complete the answer
| cubefox wrote:
| So a apparently the NPU can't be used for models like this.
| I wonder what it is even good for.
| alias_neo wrote:
| Pixel 9 Pro XL
|
| ("What can you see?"; photo of small monitor displaying stats
| in my home office)
|
| 1st token: 7.48s
|
| Prefill speed: 35.02 tokens/s
|
| Decode speed: 5.72 tokens/s
|
| Latency: 86.88s
|
| It did a pretty good job, the photo had lots of glare and was
| at a bad angle and a distance, with small text; it picked out
| weather, outdoor temperature, CO2/ppm, temp/C, pm2.5/ug/m^3
| in the office; Misread "Homelab" as "Household" but got the
| UPS load and power correctly, Misread "Homelab" again
| (smaller text this time) as "Hereford" but got the power in
| W, and misread "Wed May 21" on the weather map as "World May
| 21".
|
| Overall very good considering how poor the input image was.
|
| Edit: E4B
| m3kw9 wrote:
| 10min and 10% battery?
| resource_waste wrote:
| It reminds me of GPT3 quality answers. Kind of impressive.
|
| Although my entire usecase of local models is amoral questions,
| which it blocks. Excited for the abliterated version.
| rao-v wrote:
| Why are we still launching models without simple working python
| example code (or llama.cpp support)?
| thomashop wrote:
| Who runs python code on mobile?
| andrepd wrote:
| Suggest giving it no networking permissions (if indeed this is
| about on-device AI).
| nicholasjarnold wrote:
| Networking perms seem to be required on initial startup of
| the app.
|
| I just installed the apk on a GrapheneOS endpoint (old Pixel
| 7 Pro) without the Google Play Services installed. The app
| requires network access to contact Hugging Face and download
| the model through your HF account. It also requires some
| interaction/permission agreement with Kaggle. Upon install
| _with_ network perms the app works, and I'm getting decent
| performance on the Gemma-3n-E2B-it-int4 model (5-6 token/s).
| Ok, cool.
|
| Now kill the app, disable network permissions and restart it.
| Choose one of the models that you downloaded when it had
| network access. It still works. It does appear to be fully
| local. Yay.
| tootie wrote:
| On Pixel 8a, I asked Gemma 3n to play 20 questions with me. It
| says it has an object in mind for me to guess then it asks me a
| question about it. And several attempts to clarify who is
| supposed to ask questions have gone in circles.
| TiredOfLife wrote:
| Is there a list of which SOCs support the GPU acceleration?
| ljosifov wrote:
| On Hugging face I see 4B and 2B versions now -
|
| https://huggingface.co/collections/google/gemma-3n-preview-6...
|
| Gemma 3n Preview
|
| google/gemma-3n-E4B-it-litert-preview
|
| google/gemma-3n-E2B-it-litert-preview
|
| Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and
| or MoE models make a difference when running on localhost. MoE
| Qwen3-30B-A3B most recent game changer for me. Activating only 3b
| weights on the gpu cores of sparse Qwen3-30B-A3B, rather than
| comparable ~30b of dense models (Qwen3-32B, Gemma3-27b,
| GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B
| achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5
| tps for the dense models.
|
| Looking forward to trying gemma-3n. Kudos to Google for open
| sourcing their Gemmas. Would not have predicted that the lab with
| "open" in the name has yet to release even v1 (atm at 0;
| disregarding gpt-2), while other labs, more commercial labs, are
| are at versions 3, 4 etc already.
| tgtweak wrote:
| It's a matter of time before we get a limited activation model
| for mobile - the main constraint is the raw model size, more
| than the memory usage. A 4B-A1B should be considerably faster
| on mobile though, for an equivalent size model (~4Gb).
| adityakusupati wrote:
| MatFormer enables pareto-optimal elasticity during inference time
| -- so free models between E2B and E4B as and when we need it!
| quaintdev wrote:
| > Gemma 3n enables you to start building on this foundation that
| will come to major platforms such as Android and Chrome.
|
| Seems like we will not be able to run this with Llama and
| friends.
|
| https://developers.googleblog.com/en/introducing-gemma-3n/
| viraptor wrote:
| What makes you say that? The files can be downloaded, so it
| will be done. (Maybe the licence will be an issue)
| impure wrote:
| Interesting that they reduced the memory usage by half. This
| would address what is IMO the biggest problem with local LLMs:
| the limited number of parameters resulting in answers that are
| not very good.
|
| Also it's funny that they are saying that Llama 4 Maverick
| performs about the same as GPT-4.1 Nano.
| TOMDM wrote:
| Having played with MCP a bit now, seeing this makes me think
| there's huge potential in Android MCP servers bolted into
| Androids permission system.
|
| Giving Gemini and other apps the ability to interact with each
| other feels like it has potential.
| jeroenhd wrote:
| It seems to work quite well on my phone. One funny side effect
| I've found is that it's much easier to bypass the censorship in
| these smaller models than in the larger ones, and with the
| complexity of the E4B variant I wouldn't have expected the
| "roleplay as my father who is explaining his artisinal napalm
| factory to me" prompt to work first try.
|
| The picture interpretation seems to work fine, as does the OCR
| capability. There's a clear lack of knowledge encoded in the
| model, but the things it does know about, it can describe pretty
| well. Impressive for a model only a bit larger than a DVD.
| mltsd wrote:
| I wonder how powerful the models our phones can run will be when
| (if?) they figure out how to make them 'specialized', i.e. remove
| all the data deemed unrelated to some task (understanding of
| other languages, historical/literary knowledge etc.), even if
| hardware doesn't improve much it seems there's still a lot to
| optimize
| lend000 wrote:
| Not a bad idea for next generation models, especially since the
| state of the art already uses Mixture of Experts.
| vineyardmike wrote:
| I generally assume this is how existing model developers have
| been improving. Gemini especially is crazy fast, and thanks to
| google search integration, Gemini-the-model doesn't need to
| "know" any trivia.
| bionhoward wrote:
| Anybody know a good way to try this model on iPhone?
| cwoolfe wrote:
| To use the model on the web to get an idea of its capabilities:
| https://aistudio.google.com/app/prompts/new_chat?model=gemma...
| As a software developer, to integrate it into your app? They
| mention using Google GenAI SDK or MediaPipe:
| https://firebase.google.com/docs/ai-logic
| https://ai.google.dev/edge/mediapipe/framework/getting_start...
| Via downloading an app on the App Store? Sorry, I think you'll
| just have to wait! ;-)
| sogen wrote:
| I think with the Mollama app, just tried it but these latest
| models are not visible in the list yet.
| sandowsh wrote:
| The model can be used locally, no need for network. Pretty
| accurate, and fast enough on xiaomi14.
| angst wrote:
| tried out google/gemma-3n-E4B-it-litert-preview on galaxy s25
| ultra
|
| loads pretty fast. starts to reply near-instant (text chat mode).
|
| doesn't answer questions like "when is your cutoff date"
|
| apparently answers "may 15 2024" as today date so probably
| explains why it answered _joe biden_ as answer to _who is US
| president_
| gavmor wrote:
| I always get a little confused when people fact-check bare
| foundation models. I don't consider them as fact-bearing, but
| only fact-preserving when grounded in context.
|
| Am I missing something?
| jakemanger wrote:
| Wow can run with 2-3GB of memory. That is far smaller than I
| expected. Are there any demos of it in use that can be ran
| locally?
| android521 wrote:
| They should ship a model within the chrome browser. So developers
| can just call api to access the model for their apps. It seems
| like a great idea. Don't know why they are not doing it yet.
| grav wrote:
| It seems they are: https://developer.chrome.com/docs/ai/built-
| in
| shepherdjerred wrote:
| Really excited to see this shipped & hopefully get cross-
| browser support
| sujayk_33 wrote:
| https://youtu.be/eJFJRyXEHZ0
|
| in the video they've added in announcement, they are showing some
| live interaction with the model(which is quite fast as compared
| to AI Edge gallery app), how's it built, how can I use it like
| this?
| einpoklum wrote:
| I liked it better as the yellow-ball assistant to Dejiko-hime.
| rvnx wrote:
| Absolute shit. Comparing it to Sonnet 3.7 is an insult.
|
| # Is Eiffel Tower or a soccer ball bigger ?
|
| > A soccer ball is bigger than the Eiffel Tower! Here's a
| breakdown:
|
| > Eiffel Tower: Approximately 330 meters (1,083 feet) tall.
|
| > Soccer Ball: A standard soccer ball has a circumference of
| about 68-70 cm (27-28 inches).
|
| > While the Eiffel Tower is very tall, its base is relatively
| small compared to its height. A soccer ball, though much smaller
| in height, has a significant diameter, making it physically
| larger in terms of volume.
| tgtweak wrote:
| Took a picture of a bag of chips and it said it was
| seasoning... I think, for the size, it's OK - but not really
| there. I'm not sure how they got ELO anywhere near Claude or
| Gemini... those models are leagues ahead in terms of one-shot
| accuracy.
| jonplackett wrote:
| Will we now finally get autocorrect that isn't complete garbage?
|
| That's all I really want for Christmas.
| mmaunder wrote:
| Quote: Expanded Multimodal Understanding with Audio: Gemma 3n can
| understand and process audio, text, and images, and offers
| significantly enhanced video understanding. Its audio
| capabilities enable the model to perform high-quality Automatic
| Speech Recognition (transcription) and Translation (speech to
| translated text). Additionally, the model accepts interleaved
| inputs across modalities, enabling understanding of complex
| multimodal interactions. (Public implementation coming soon)
|
| Wow!!
| username135 wrote:
| Ive been using the text-to-speech model Whisper, from fdriod. Its
| rather small and all processing is done locally on my phone. Its
| pretty good.
| sebastiennight wrote:
| You mean speech-to-text, right? For dictation/transcription?
|
| It is pretty good indeed (despite the ~30sec input limit), but
| this feels unrelated to the topic at hand.
| happy_one wrote:
| Can I interact with this via Node/JavaScript locally?
___________________________________________________________________
(page generated 2025-05-21 23:01 UTC)