[HN Gopher] Gemma.cpp: lightweight, standalone C++ inference eng...
___________________________________________________________________
Gemma.cpp: lightweight, standalone C++ inference engine for Gemma
models
Author : mfiguiere
Score : 298 points
Date : 2024-02-23 15:15 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| a1o wrote:
| If I want to put a Gemma model in a minimalist command line
| interface, build it to a standalone exe file that runs offline,
| what is the size of my final executable? I am interested in how
| small can the size of something like this be and it still be
| functional.
| replete wrote:
| I used gemm:2b with ollama last night and the model was around
| 1.3gb IIRC
| samus wrote:
| Depends how much you quantize the model. For most general-
| purpose LLMs, the model completely dwarfs the size of the
| binary code.
| coder543 wrote:
| https://ollama.com/library/gemma/tags
|
| You can see the various quantizations here, both for the 2B
| model and the 7B model. The smallest you can go is the q2_K
| quantization of the 2B model, which is 1.3GB, but I wouldn't
| really call that "functional". The q4_0 quantization is 1.7GB,
| and that would probably be functional.
|
| The size of anything but the model is going to be rounding
| error compared to how large the models are, in this context.
| sorenjan wrote:
| What's the use case of models this small? Can you use the
| "knowledge" encoded in them and ask them questions and get
| relevant answers, or are they used as text processors to
| summarize documents etc?
| trisfromgoogle wrote:
| Gemma 2B generation quality is excellent in my own very-
| biased opinion. I asked it to write a response to your
| comment:
|
| --
|
| Large language models (LLMs) have achieved significant
| progress in recent years, with models like GPT-3 and LaMDA
| demonstrating remarkable abilities in various tasks such as
| language generation, translation, and question answering.
|
| However, 2b parameter models are a much smaller and simpler
| type of LLM compared to GPT-3. While they are still capable
| of impressive performance, they have a limited capacity for
| knowledge representation and reasoning.
|
| Despite their size, 2b parameter models can be useful in
| certain scenarios where the specific knowledge encoded in
| the model is relevant to the task at hand. For example:
|
| - Question answering: 2b parameter models can be used to
| answer questions by leveraging their ability to generate
| text that is similar to the question.
|
| - Text summarization: 2b parameter models can be used to
| generate concise summaries of documents by extracting the
| most important information.
|
| - Code generation: While not as common, 2b parameter models
| can be used to generate code snippets based on the
| knowledge they have learned.
|
| Overall, 2b parameter models are a valuable tool for tasks
| that require specific knowledge or reasoning capabilities.
| However, for tasks that involve general language
| understanding and information retrieval, larger LLMs like
| GPT-3 may be more suitable.
|
| --
|
| Generated in under 1s from query to full response on
| together.ai
|
| https://api.together.xyz/playground/chat/google/gemma-2b-it
| brucethemoose2 wrote:
| The code is a basically irrelevant fraction of the model
| weights. The raw FP16 is like 17GB.
|
| In practice your priority would be fancy quantization, and just
| _any_ library that compiles down to an executable (like this,
| MLC-LLM or llama.cpp)
| a1o wrote:
| 17GB looks like a lot. Thanks, I will wait until people
| figure how to make these smaller before trying to use to make
| something standalone.
| wg0 wrote:
| These won't be smaller I guess. Given we keep the number of
| parameters same.
|
| Pre LLM era (let's say 2020), the hardware used to look
| decently powerful for most use cases (disks in hundreds of
| GBs, dozen or two of RAM and quad or hex core processors)
| but with the advent of LLMs, even disk drives start to look
| pretty small let alone compute and memory.
| brucethemoose2 wrote:
| And cache! The talk of AI hardware is now "how do we fit
| these darn things inside SRAM?"
| sillysaurusx wrote:
| The average PS5 game seems to be around 45GB. Cyberpunk was
| 250GB.
|
| Distributing 17GB isn't a big deal if you shove it into
| Cloudflare R2.
| brucethemoose2 wrote:
| In theory quantized weights of smaller models are under a
| gigabyte.
|
| If you are looking for megabytes, yeah, those "chat" llms
| are pretty unusable at that size.
| swatcoder wrote:
| It's always going to be a huge quantity of data. Even as
| efficiency improves, storage and bandwidth are so cheap now
| that the incentive will be to convert that efficiency
| towards performance (models with more parameters, ensembles
| of models, etc) rather than chasing some micro-model that
| doesn't do as well. It might not always be 17GB, but don't
| expect some lesser order of magnitude for anything
| competitive.
|
| As maturity arrives, we'll likely see a handful of
| competing local models shipped as part of the OS or as
| redistributable third-party bundles (a la the .NET or Java
| runtimes) so that individual applications don't all need to
| be massive.
|
| You'll either need to wait for that or bite the bullet and
| make something chonky. It's never going to get that small.
| superkuh wrote:
| *EDIT*: Nevermind, llamafile hasn't been updated in a full
| month and gemma support was only added to llama.cpp on the 21st
| of this month. Disregard this post for now and come back when
| mozilla updates llamafile.
|
| ---
|
| llama.cpp has integrated gemma support. So you can use
| llamafile for this. It is a standalone executable that is
| portable across most popular OSes.
|
| https://github.com/Mozilla-Ocho/llamafile/releases
|
| So, download the executable from the releases page under
| assets. You want either just main and server and llava. Don't
| get the huge ones with the model inlined in the file. The
| executable is about 30MB in size,
|
| https://github.com/Mozilla-Ocho/llamafile/releases/download/...
| brucethemoose2 wrote:
| Not to be confused with llama.cpp and the GGML library, which is
| a seperate project (and almost immediately worked with Gemma).
| throwaway19423 wrote:
| I am confused how all these things are able to interoperate.
| Are the creators of these models following the same IO for
| their models? Won't the tokenizer or token embedder be
| different? I am genuinely confused by how the same code works
| for so many different models.
| brucethemoose2 wrote:
| It's complicated, but basically because _most_ are llama
| architecture. Meta all but set the standard for open source
| llms when they released llama1, and anyone trying to deviate
| from it has run into trouble because the models don 't work
| with the hyper optimized llama runtumes.
|
| Also, there's a lot of magic going on behind the scenes with
| configs stored in gguf/huggingface format models, and the
| libraries that use them. There are different tokenizers, but
| they mostly follow the same standards.
| null_point wrote:
| I found the magic! https://github.com/search?q=repo%3Aggerg
| anov%2Fggml%20magic&...
| jebarker wrote:
| I doubt there'd be confusion as the names are totally different
| brucethemoose2 wrote:
| ...Also, we have eval'd Gemma 7B internally in a deterministic,
| zero temperature test, and its error rate is like double Mistral
| Instruct 0.2. Well below most other 7Bs.
|
| Was not very impressed with the chat either.
|
| So maybe this is neat for embedded projects, but if it's Gemma
| only, that would be quite a sticking point for me.
| Havoc wrote:
| That does seem to be the consensus unfortunately. Would have
| been better for everyone if google's foray into open model a la
| FB made a splash
| brucethemoose2 wrote:
| Yeah, especially with how much Google is hyping it.
|
| It could have been long context? Or a little bigger, to fill
| the relative gap in the 13B-30B area? Even if the model
| itself was mediocre (which you can't know until after
| training), it would have been more interesting.
| Vetch wrote:
| Was it via gemma.cpp or some other library? I've seen a few
| people note that gemma performance via gemma.cpp is much better
| than llama.cpp, possible that the non-google implementations
| are still not quite right?
| brucethemoose2 wrote:
| I eval'd it with vllm.
|
| One thing I _do_ suspect people are running into is sampling
| issues. Gemma probably doesn 't like llama defaults with its
| 256K vocab.
|
| Many Chinese llms have a similar "default sampling" issue.
|
| But our testing was done with zero temperature and
| constrained single token responses, so that shouldnt be an
| issue.
| trisfromgoogle wrote:
| Any chance you can share more details on your measurement setup
| and eval protocols? You're likely seeing some config snafus,
| which we're trying to track down.
| colesantiago wrote:
| Isn't there a huge risk that Google could most likely deprecate
| Gemini, Gemma and Gemma.cpp? Not really smart to build on
| anything with Google e.g. Google Cloud for AI.
|
| Has this perception changed or pretty much the same?
| beoberha wrote:
| Gemini - maybe, though I find it pretty unlikely it'll happen
| anytime soon.
|
| Not sure what you mean about Gemma considering it's not a
| service. You can download the model weights and the inference
| code is on GitHub. Everything is local!
| brucethemoose2 wrote:
| This is not necessarily a production backend, as it mentions in
| the readme.
|
| There are some very interesting efforts in JAX/TPU land like
| https://github.com/erfanzar/EasyDeL
| ertgbnm wrote:
| The weights are downloadable so there isn't much of a risk if
| Google stops hosting Gemma apart from the fact that it won't
| get new versions that you swap out in the future.
| cyanydeez wrote:
| even if there's a new model, I'm not seeing how these models
| provide any reliability metric.
|
| if you figure out a money making software/service, you're
| gonna be tied to that model to some significant degree.
| brokensegue wrote:
| does anyone have stats on cpu only inference speed with this?
| austinvhuang wrote:
| any particular hardware folks are most interested in?
| brokensegue wrote:
| I'm just looking for ballpark figures. Maybe a common aws
| instance type
| notum wrote:
| Not sure if this is of any value to you, but Ryzen 7
| generates 2 tokens per second for the 7B-Instruct model.
|
| The model itself is very unimpressive and I see no reason
| to play with it over the worst alternative from Hugging
| Face. I can only imagine this was released for some bizarre
| compliance reasons.
| brokensegue wrote:
| the metrics suggest it's much better than that
| austinvhuang wrote:
| Hi, one of the authors austin here. Happy to answer any questions
| the best I can.
|
| To get a few common questions out of the way:
|
| - This is separate / independent of llama.cpp / ggml. I'm a big
| fan of that project and it was an inspiration (we say as much in
| the README). I've been a big advocate of gguf + llama.cpp support
| for gemma and am happy for people to use that.
|
| - how is it different than inference runtime X? gemma.cpp is a
| direct implementation of gemma, in its current form it's aimed at
| experimentation + research and portability + easy modifiable
| rather than a general purpose deployment framework.
|
| - this initial implementation is cpu simd centric. we're
| exploring options for portable gpu support but the cool thing is
| it will build and run on a lot of environments you might not
| expect an llm to run, so long as you have the memory to load the
| model.
|
| - I'll let other colleagues answer questions about the Gemma
| model itself, this is a C++ implementation of the model, but
| relatively independent of the model training process.
|
| - Although this is from Google, we're a very small team that
| wanted such a codebase to exist. We have lots of plans to use it
| ourselves and we hope other people like it and find it useful.
|
| - I wrote a twitter thread on this project here:
| https://twitter.com/austinvhuang/status/1760375890448429459
| rgbrgb wrote:
| Thanks for releasing this! What is your use case for this
| rather than llama.cpp? For the on-device AI stuff I mostly do,
| llama.cpp is better because of GPU/metal offloading.
| austinvhuang wrote:
| llama.cpp is great, if it fit your needs you can use it. I
| think at this point llama.cpp is effectively a platform
| that's hardened for production.
|
| In its current form, I think of gemma.cpp is more of a direct
| model implementation (somewhere between the minimalism of
| llama2.c and the generality of ggml).
|
| I tend to think of 3 modes of usage:
|
| - hacking on inference internals - there's very little
| indirection, no IRs, the model is just code, so if you want
| to add support for your own runtime support for
| sparsity/quantization/model compression/etc. and demo it
| working with gemma, there's minimal barriers to do so
|
| - implementing experimental frontends - i'll add some
| examples of this in the very near future. but you're free to
| get pretty creative with terminal UIs, code that interact
| with model internals like the KV cache, accepting/rejecting
| tokens etc.
|
| - interacting with the model locally with a small program -
| of course there's other options for this but hopefully this
| is one way to play with gemma w/ minimal fuss.
| dartharva wrote:
| So... llamafile release?
|
| https://github.com/Mozilla-Ocho/llamafile
| austinvhuang wrote:
| gguf files are out there, so anyone should be able to do
| this! are people looking for an "official" version?
|
| ps i'm a fan of cosmopolitan as well.
| beoberha wrote:
| > Although this is from Google, we're a very small team that
| wanted such a codebase to exist. We have lots of plans to use
| it ourselves and we hope other people like it and find it
| useful.
|
| This is really cool, Austin. Kudos to your team!
| austinvhuang wrote:
| Thanks so much!
|
| Everyone working on this self-selected into contributing, so
| I think of it less as my team than ... a team?
|
| Specifically want to call out: Jan Wassenberg (author of
| https://github.com/google/highway) and I started gemma.cpp as
| a small project just a few months ago + Phil Culliton, Dan
| Zheng, and Paul Chang + of course the GDM Gemma team.
| trisfromgoogle wrote:
| Huge +1, this has definitely been a self-forming collective
| of people who love great AI, great research, and the open
| community.
|
| Austin and Jan are truly amazing. The optimization work is
| genuinely outstanding; I get incredible CPU performance on
| Gemma.cpp for inference. Thanks for all of the awesomeness,
| Austin =)
| moffkalast wrote:
| Cool, any plans on adding K quants, an API server and/or a
| python wrapper? I really doubt most people want to use it as a
| cpp dependency and run models at FP16.
| austinvhuang wrote:
| There's a custom 8-bit quantization (SFP), it's what we
| recommend. At 16 bit, we do bfloat16 instead of fp16 thanks
| to https://github.com/google/highway, even on CPU. Other
| quants - stay tuned.
|
| python wrapper - if you want to run the model in python I
| feel like there's already a lot of more mature options
| available (see the model variations at
| https://www.kaggle.com/models/google/gemma) , but if people
| really want this and have something they want to do with a
| python wrapper that can't be done with existing options let
| me know. (similar thoughts wrt to API servers).
| moffkalast wrote:
| In my experience there's really no reason to run any model
| above Q6_K, the performance is identical and you shave off
| almost 2 GB of VRAM of a 7B model compared to Q8. To those
| of us with single digit amounts, that's highly significant.
| But most people seem to go for 4 bits anyway and it's the
| AWQ standard too. If you think it'll make the model look
| bad, then don't worry, it's only the relative performance
| that matters.
|
| I would think that having an OpenAI standard compatible API
| would be a higher priority over a python wrapper, since
| then it can act as a drop in replacement for most any
| backend.
| austinvhuang wrote:
| A nice side effect of implementing cpu simd is you just
| need enough regular RAM, which tends to be far less
| scarce than VRAM. Nonetheless, I get your point that more
| aggressive quantization is valuable + will share with the
| modeling team.
| moffkalast wrote:
| True, it's the only way I can for example run Mixtral on
| a 8GB GPU, but main memory will always have more latency
| so some tradeoff tends to be worth it. And parts like the
| prompt batch buffer and most of the context generally
| have to be in VRAM if you want to use cuBLAS, with
| OpenBLAS it's maybe less of a problem, but it is slower.
| leminimal wrote:
| Kudos on your release! I know this was just made available but
|
| - Somewhere the README, consider adding the need for a
| `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around
| step 3.
|
| - The README should explicitly say somehere that there's no GPU
| support (at the moment)
|
| - "Failed to read cache gating_ein_0 (error 294)" is pretty
| obscure. I think even "(error at line number 294)" would be a
| big improvement when it fails to FindKey.
|
| - There's something odd about the 2b vs 7b model. The 2b will
| claim its trained by Google but the 7b won't. Were these
| trained on the same data?
|
| - Are the .sbs weights the same weights as the GGUF? I'm
| getting different answers compared to llama.cpp. Do you know of
| a good way to compare the two? Any way to make both
| deterministic? Or even dump probability distributions on the
| first (or any) token to compare?
| austinvhuang wrote:
| Yes - thanks for pointing that out. The README is being
| updated, you can see an updated WIP in the dev branch:
| https://github.com/google/gemma.cpp/tree/dev?tab=readme-
| ov-f... and improving error messages is a high priority.
|
| The weights should be the same across formats, but it's easy
| for differences to arise due to quantization and/or subtle
| implementation differences. Minor implementation differences
| has been a pain point in the ML ecosystem for a while (w/
| IRs, onnx, python vs. runtime, etc.), but hopefully the
| differences aren't too significant (if they are, it's a bug
| in one of the implementations).
|
| There were quantization fixes like
| https://twitter.com/ggerganov/status/1760418864418934922 and
| other patches happening, but it may take a few days for
| patches to work their way through the ecosystem.
| verticalscaler wrote:
| Hi Austin, what say you about how the Gemma rollout was
| handled, issues raised, and atmosphere around the office? :)
| trisfromgoogle wrote:
| I'm not Austin, but I am Tris, the friendly neighborhood
| product person on Gemma. Overall, I think that the main
| feeling is: incredibly relieved to have had the launch go as
| smoothly as it has! The complexity of the launch is truly
| astounding:
|
| 1) Reference implementations in JAX, PyTorch, TF with Keras
| 3, MaxText/JAX, more...
|
| 2) Full integration at launch with HF including Transformers
| + optimization therein
|
| 3) TensorRT-LLM and full NVIDIA opt across the stack in
| partnership with that team (mentioned on the NVIDIA earnings
| call by Jensen, even)
|
| 4) More developer surfaces than you can shake a stick at:
| Kaggle, Colab, Gemma.cpp, GGUF
|
| 5) Comms landing with full coordination from Sundar + Demis +
| Jeff Dean, not to mention positive articles in NYT, Verge,
| Fortune, etc.
|
| 6) Full Google Cloud launches across several major products,
| including Vertex and GKE
|
| 7) Launched globally and with a permissive set of terms that
| enable developers to do awesome stuff
|
| Pulling that off without any _major_ SNAFUs is a huge relief
| for the team. We 're excited by the potential of using all of
| those surfaces and the launch momentum to build a lot more
| great things for you all =)
| kergonath wrote:
| I am not a fan of a lot of what Google does, but
| congratulations! That's a massive undertaking and it is
| bringing the field forward. I am glad you could do this,
| and hope you'll have many other successful releases.
|
| Now, I'm off playing with a new toy :)
| verticalscaler wrote:
| Has there been any negative articles or valid criticism at
| all in your opinion? =)
| trisfromgoogle wrote:
| Always -- anything that comes with the Google name
| attached always attracts some negativity. There's plenty
| of valid criticism, most of which we hope to address in
| the coming weeks and months =).
| verticalscaler wrote:
| Can you without addressing just acknowledge some of it
| here? Specific examples? =)
|
| > not to mention positive articles in NYT, Verge,
| Fortune, etc.
|
| You in fact are mentioning them and only them. I was
| wondering if you can simply mention the negative ones.
| Otherwise it sort of sounded at first like its all roses.
| ;)
| trisfromgoogle wrote:
| I mean, many articles will have a negative cast because
| of the need for clicks -- e.g., the Verge's launch
| article is entitled "Google Gemma: because Google doesn't
| want to give away Gemini yet" -- which I think is both an
| unfair characterization (given the free tier of Gemini
| Pro) and unnecessarily inflammatory.
|
| Legitimate criticisms include not working correctly out
| of the box for llama.cpp due to repetition penalty and
| vocab size, some snafus on chat templates with
| huggingface, the fact that they're not larger-sized
| models, etc. Lots of the issues are already fixed, and
| we're committed to making sure these models are great.
|
| Honestly, not sure what you're trying to get at here --
| are you trying to "gotcha" the fact that not everything
| is perfect? That's true for any launch.
| verticalscaler wrote:
| I mean, take for example Paul Graham the guy who made
| this website you're posting on: "Wow,
| Gemini is a joke. The ridiculous images
| generated by Gemini aren't an anomaly.
| They're a self-portrait of Google's bureaucratic
| corporate culture. The bigger your cash cow,
| the worse your culture can get without driving you out of
| business. And Google's cash cow, search advertising, is
| one of the biggest the world has ever seen." -
| https://twitter.com/paulg/status/1760078920135872716
|
| Are these legitimate criticisms? I don't _think_ he needs
| clicks. "The AI behavior guardrails that
| are set up with prompt engineering and filtering should
| be public -- the creators should proudly stand behind
| their vision of what is best for society and how they
| crystallized it into commands and code. *I
| suspect many are actually ashamed* The
| thousands of tiny nudges encoded by reinforcement
| learning from human feedback offer a lot more plausible
| deniability, of course." - https://twitte
| r.com/ID_AA_Carmack/status/1760360183945965853
|
| Neither does John Carmack.
| trisfromgoogle wrote:
| Neither of those applies at all to Gemma, though? I'm
| still confused -- what are you trying to accomplish with
| this line of questioning?
| verticalscaler wrote:
| I'm confused about your confusion ("Still confused"? When
| did you suddenly become confused? First you've mentioned
| this).
|
| Gemini is a nerfed, for lack of better term, model
| released by Google to overwhelming negative response from
| the public. Not just the click hungry press, users.
| Including highly knowledgable ones.
|
| Said nerfing "a self-portrait of Google's bureaucratic
| corporate culture" according to Paul Graham.
|
| Same company. Same culture. Releases a toy open model, a
| mini Gemini. The concerns around it are the same.
|
| > Launched globally and with a permissive set of terms
| that enable developers to do awesome stuff
|
| I think by permissive set of terms you're referring to
| the license, not the model itself. Not entirely sure how
| developers can trust the model itself given recent
| events. Seems like a reasonable line of questioning.
|
| I'm trying to establish whether honest human-like non-
| evasive communication is on offer along with the model so
| developers might contemplate trusting your work output.
|
| Otherwise, besides being a technical curio, unclear what
| anybody can do with it. Most wouldn't feel comfortable
| rolling a Gemma model into a product without a lot more
| clarity.
| trisfromgoogle wrote:
| I've been completely honest, human-like, and non-evasive
| with you. I answered your questions directly and frankly.
|
| Every time, you ignored the honest and human-like answers
| to try and score some imaginary points.
|
| We're honestly trying our best to build open models
| *with* the community that you can tune and use to build
| neat AI research + products. Ignoring that in favor of
| some political narrative is really petty.
| verticalscaler wrote:
| > not to mention positive articles in NYT, Verge,
| Fortune, etc.
|
| > There's plenty of valid criticism, most of which we
| hope to address in the coming weeks and months =).
|
| > many articles will have a negative cast because of the
| need for clicks
|
| This simply comes across as, at best, spin. Given what
| just happened with Gemini I urge you not to communicate
| in this style.
|
| > doesn't apply at all to Gemma
|
| That would be the crux of the matter. As a sibling
| comment states:
|
| > a larger unreleased version of it (Gemma) is (likely)
| used as part of the Gemini product
|
| There is no reason for anybody outside Google to think
| otherwise, these are blackboxes.
|
| > Ignoring that in favor of some political narrative is
| really petty.
|
| You can have assume any politics you wish, the issues
| with Gemini have already been acknowledged by Google
| along with an apology.
|
| https://blog.google/products/gemini/gemini-image-
| generation-... over time, the model
| became way more cautious than we intended and refused to
| answer certain prompts entirely -- wrongly interpreting
| some very anodyne prompts as sensitive.
|
| There is no information for me or anybody else to go on
| that your Gemma models are not nerfed - whether
| maliciously or accidentally nerfed - is immaterial.
|
| Instead of addressing that you and you alone eventually
| brought it back to politics.
|
| > We're honestly trying our best to build open models
| _with_ the community
|
| Amongst other things this requires much more trust and
| transparency and an altogether different communication
| style.
| summerlight wrote:
| It looks like you're trying to get some sort of
| "confession" from relevant people based on recent memes
| against the company? The reality is likely that the
| developers sincerely believe in the value of this product
| and are proud of its launch. You're just adding
| uninteresting, irrelevant noise to the discussion and you
| probably won't get what you want.
| jph00 wrote:
| These comments appear to be about Gemini's image
| generation, IIUC. Gemma, however, is a language model --
| whilst I believe that a larger unreleased version of it
| is used as part of the Gemini product, it doesn't seem
| relevant to these criticisms. Also, the Gemma base model
| is released, which doesn't AFAIK contain any RLHF.
|
| The impression I have is that you're using the release of
| Gemma to complain about tangentially related issues about
| Google and politics more generally. The HN guidelines
| warn against this: "Eschew flamebait. Avoid generic
| tangents... Please don't post shallow dismissals,
| especially of other people's work. A good critical
| comment teaches us something. Please don't use Hacker
| News for political or ideological battle."
| dankle wrote:
| What's the reason to not integrate with llama.cpp instead of a
| separate app? In what ways this better than llama.cpp?
| austinvhuang wrote:
| On uses, see
| https://news.ycombinator.com/item?id=39481554#39482302 and on
| llama.cpp support -
| https://news.ycombinator.com/item?id=39481554
|
| Gemma support has been added to llama.cpp, and we're more
| than happy to see people use it there.
| freedomben wrote:
| I think on uses you meant to link to
| https://news.ycombinator.com/item?id=39482581 child of
| https://news.ycombinator.com/item?id=39481554#39482302 ?
|
| side note: imagine how gnarly those urls would be if HN
| used UUIDs instead of integers for IDs :-D
| einpoklum wrote:
| Come on Dejiko, we don't have time for this gema!
|
| https://www.youtube.com/watch?v=9FSAqDVZHhU
| a-french-anon wrote:
| Glad I wasn't alone.
| einpoklum wrote:
| Well, it was just so nostalgic for me nyo :-\
| sillysaurusx wrote:
| Every time I see Gemma all I hear is Jubei screaming Genmaaaa
| since the n is almost silent.
| https://youtu.be/TFR9-cZecWo?si=rMED2LEh-fssHeeG
| austinvhuang wrote:
| lol
| ofermend wrote:
| Awesome work on getting this done so quickly. We just added Gemma
| to the HHEM leaderboard -
| https://huggingface.co/spaces/vectara/leaderboard, and as you can
| see there its doing pretty good in terms of low hallucination
| rate, relative to other small models.
| swozey wrote:
| > LLM hallucinations
|
| I wasn't familiar with the term, good article -
| https://masterofcode.com/blog/hallucinations-in-llms-what-yo...
| ed wrote:
| Karpathy offers a more concise (and whimsical) explanation
| https://x.com/karpathy/status/1733299213503787018
| swozey wrote:
| The velocity of the LLM open source ecosystem is absolutely
| insane.
|
| I just got into hobby projects with diffusion a week ago and I'm
| seeing non-stop releases. It's hard to keep up. It's a firehose
| of information, acronyms, code etc.
|
| It's been a great python refresher.
| austinvhuang wrote:
| Don't be discouraged, you don't have to follow everything.
|
| In fact it's probably better to dive deep into one hobby
| project like you're doing than constantly context switch with
| every little news item that comes up.
|
| While working on gemma.cpp there were definitely a lot of "gee
| i wish i could clone myself and work on that other thing too".
| next_xibalba wrote:
| Is this neutered in the way Gemini is (i.e. is the "censorship"
| built in) or is that a "feature" of the Gemini application?
| ComputerGuru wrote:
| It depends on the model you load/use, the team released both
| censored and "PT" versions.
| jonpo wrote:
| These models (Gemma) are very difficult to jailbreak.
| throwaway19423 wrote:
| Can any kind soul explain the difference between GGUF, GGML and
| all the other model packaging I am seeing these days? Was used to
| pth and the thing tf uses. Is this all to support inference or
| quantization? Who manages these formats or are they brewing
| organically?
| austinvhuang wrote:
| I think it's mostly an organic process arising from the
| ecosystem.
|
| My personal way of understanding it is this - the original sin
| of model weight format complexity is that NNs are both data and
| computation.
|
| Representing the computation as data is the hard part and
| that's where the simplicity falls apart. Do you embed the
| compute graph? If so, what do you do about different frameworks
| supporting overlapping but distinct operations. Do you need the
| artifact to make training reproducible? Well that's an even
| more complex computation that you have to serialize as data.
| And so on..
| moffkalast wrote:
| It's all mostly just inference, though some train LoRAs
| directly on quantized models too.
|
| GGML and GGUF are the same thing, GGUF is the new version that
| adds more data about the model so it's easy to support multiple
| architectures, and also includes prompt templates. These can
| run CPU only, be partially or fully offloaded to a GPU. With K
| quants, you can get anywhere from a 2 bit to an 8 bit GGUF.
|
| GPTQ was the GPU-only optimized quantization method that was
| superseded by AWQ, which is roughly 2x faster and now by EXL2
| which is even better. These are usually only 4 bit.
|
| Safetensors and pytorch bin files are raw float16 model files,
| these are only really used for continued fine tuning.
| Gracana wrote:
| > and also includes prompt templates
|
| That sounds very convenient. What software makes use of the
| built-in prompt template?
| moffkalast wrote:
| Of the ones I commonly use, I've only seen it read by text-
| generation-webui, in the GGML days it had a long hardcoded
| list of known models and which templates they use so they
| could be auto-selected (which was often wrong), but now it
| just grabs it from any model directly and sets it when it's
| loaded.
| liuliu wrote:
| pth can include Python code (PyTorch code) for inference. TF
| includes the complete static graph.
|
| GGUF is just weights, safetensors the same thing. GGUF doesn't
| need a JSON decoder for the format while safetensors needs
| that.
|
| I personally think having a JSON decoder is not a big deal and
| make the format more amendable, given GGUF evolves too.
| Wissan wrote:
| Hello
| sintax wrote:
| Demo when model quantized to q0_K?
| tarruda wrote:
| Is it not possible to add Gemma support on Llama.cpp?
| austinvhuang wrote:
| Gemma support has been added to llama.cpp, in fact it was added
| almost immediately after the release:
| https://twitter.com/ggerganov/status/1760293079313973408
|
| However, be aware that there were some quality issues with
| quantization initially (hopefully they're resolved but i
| haven't followed too closely):
| https://twitter.com/ggerganov/status/1760418864418934922
| zoogeny wrote:
| I know a lot of people chide Google for being behind OpenAI in
| their commercial offerings. We also dunk on them for the over-
| protective nature of their fine-tuning.
|
| But Google is scarily capable on the LLM front and we shouldn't
| count them out. OpenAI might have the advantage of being quick to
| move, but when the juggernaut gets passed its resting inertia and
| starts to gain momentum it is going to leave an impression.
|
| That became clear to me after watching the recent Jeff Dean video
| [1] which was posted a few days ago. The depth of institutional
| knowledge that is going to be unlocked inside Google is actually
| frightening for me to consider.
|
| I hope the continued competition on the open source front, which
| we can really thank Facebook and Llama for, keeps these behemoths
| sharing. As OpenAI moves further from its original mission into
| capitalizing on its technological lead, we have to remember why
| the original vision they had is important.
|
| So thank you, Google, for this.
|
| 1.
| https://www.youtube.com/watch?v=oSCRZkSQ1CE&ab_channel=RiceK...
| whimsicalism wrote:
| Realistically, if Google has all this talent, they should have
| gotten the juggernaut moving in 2020.
|
| Google has had _years_ to get to this stage, and they 've lost
| a lot of the talent that made their initial big splashes to OAI
| and competitors. Try finding someone on a sparse MoE paper from
| Google prior to 2022 who is still working there and not at OAI.
|
| With respect, they can hardly even beat Mistral, resorting to
| rounding down a 7.8b model (w/o embeddings) to 7b.
| freedomben wrote:
| Organizational dysfunction can squash/squander even the most
| talented engineers. Especially in a big org in big tech. My
| bet is that their inability to deliver before is probably a
| result of non-comittal funders/decision makers, product
| whiplash, corporate politics, and other non-technical
| challenges.
|
| Google has been the home of the talent for many years. They
| came on my radar in the late 00s when I used Peter Norvig's
| textbook in college, and they hired Ray Kurzweil in like 2012
| or 2013 IIRC. They were hiring ML PhDs with talent for many
| years, and they pioneered most of the major innovations. They
| just got behind on productizing and shipping.
| whimsicalism wrote:
| Right, which was fine for them before there was major
| competition. But starting in 2020, they have basically
| attrited most of their talented labor force to OAI and
| competitors who were not similarly dysfunctional.
| dguest wrote:
| Maybe someone who knows google better can answer my question
| here: are they behind simply because LLMs are not really their
| core business? In other words, it wasn't (and still isn't)
| obvious that LLMs will help them sell add space.
|
| And of course writing that gives me a terrible realization:
| product placement in LLMs is going to be a very big thing in
| the near future.
| freedomben wrote:
| I'm an outsider and am speculating based on what I've heard,
| so maybe I shouldn't even comment, but to me it seems like
| it's been entirely corporate/organizational reasons. Non-
| serious funding, shifting priorities, personnel
| transfers/fluctuations, internal fragmentation, and more.
| Lack of talent has never been their problem.
| elwell wrote:
| LLM bad because cannibalizes search ads. Wait as long as
| possible. OpenAI opens pandora's box. Now full speed ahead;
| catch up and overtake.
| brigadier132 wrote:
| There was a podcast yesterday that explained well why Google is
| in a tough position.
|
| https://youtu.be/-i9AGk3DJ90?t=616
|
| In essence, Google already rules information retrieval. Their
| margins are insane. Switching to LLM based search cuts into
| their margins and increases their costs dramatically. Also, the
| advantage they've built over decades has been cut down.
|
| All of this means there is potential for less profit and a
| shrinking valuation. A shrinking valuation means issues with
| employee retention and it could lead to long term stagnation.
| corysama wrote:
| The Innovator's Dilemma over and over again.
| brikym wrote:
| I'm sure Kodak had the same problem with the digital camera.
| llm_nerd wrote:
| While I generally agree with you, who has ever counted Google
| out? We've made fun of Google for lagging while they instead
| spend their engineering time renaming projects and performing
| algorithmic white-erasure, but we all knew they're a potent
| force.
|
| Google has as much or more computing power than anyone. They're
| massively capitalized and have a market cap of almost $2T and
| colossal cashflow, and have the ability to throw enormous
| resources at the problem until they have a competitor. They
| have an enormous, benchmark-setting amount of data across their
| various projects to train on. That we're talking like they're
| some scrappy upstart is super weird.
|
| >As OpenAI moves further from its original mission into
| capitalizing on its technological lead, we have to remember why
| the original vision they had is important.
|
| I'm way more cynical about the open source models released by
| the megas, and OpenAI is probably the most honest about their
| intentions. Meta and Google are releasing these models arguably
| to kneecap any possible next OpenAI. They want to basically set
| the market value of anything below state of the art at $0.00,
| ensuring that there is no breathing room below the $2T cos.
| These models (Llama, Gemma, etc) are fun toys, but in the end
| they're completely uncompetitive and will yield zero "wins", so
| to speak.
| jerpint wrote:
| > Meta and Google are releasing these models arguably to
| kneecap any possible next OpenAI. They want to basically set
| the market value of anything below state of the art at $0.00,
| ensuring that there is no breathing room below the $2T cos
|
| Never thought about it that way, but it makes a lot of sense.
| It's also true these models are not up to par with SOTA no
| matter what the benchmarks say
| loudmax wrote:
| I certainly would not count out Google's engineering talent.
| But all the technical expertise in the world won't matter
| when the leadership is incompetent and dysfunctional. Rolling
| out a new product takes vision, and it means taking some
| risks. This is diametrically opposed to how Google operates
| today. Gemini could be years ahead of ChatGPT (and maybe it
| is now, if it weren't neutered), but Google's current
| leadership would have no idea what to do with it.
|
| Google has the technical resources to become a major player
| here, maybe even the dominant player. But it won't happen
| under current management. I won't count out Google entirely,
| and there's still time for the company to be saved. It starts
| with new leadership.
| refulgentis wrote:
| There's nothing provided here other than Jeff Dean gave a stock
| entry-level presentation to students at Rice, therefore "The
| depth of institutional knowledge that is going to be unlocked
| inside Google is actually frightening for me to consider."
|
| You should see Google's turnover numbers from 4 years ago, much
| less now.
|
| It's been years, it's broken internally, we see the results.
|
| Here, we're in awe of 1KLOC of C++ code that runs inference on
| the CPU.
|
| Nobody serious is running inference on CPU unless you're on the
| extreme cutting edge. (ex. I need to on Android and on the
| Chrome OS Linux VM, but I still use llama.cpp because it does
| support GPU everywhere else)
|
| I'm not sure what else to say.
|
| (n.b. i am a xoogler)
| kwantaz wrote:
| nice
| namtranase wrote:
| Thank the team for the awesome repo. I have navigated gemma.cpp
| and run it from the first day, it is smooth in my view. So I hope
| gemma.cpp will continue to add cool features (something like
| k-quants, server,...) so it can serve more widely. Actually, I
| have developed a Python wrapper for it:
| https://github.com/namtranase/gemma-cpp-python The purpose is to
| use easily and update every new technique from gemma.cpp team.
| dontupvoteme wrote:
| At the risk of being snarky, it's interesting that Llama.cpp was
| a 'grassroots' effort originating from a Bulgarian hacker google
| now launches a corporatized effort inspired by it.
|
| I wonder if there's some analogies to the 80s or 90s in here.
| trisfromgoogle wrote:
| To be clear, this is not comparable directly to llama.cpp --
| Gemma models work on llama.cpp and we encourage people who love
| llama.cpp to use them there. We're also launched with Ollama.
|
| Gemma.cpp is a highly optimized and lightweight system. The
| performance is pretty incredible on CPU, give it a try =)
| natch wrote:
| Apart from the fact that they are different things, since they
| came out of the same organization I think it's fair to ask:
|
| Do these models have the same kind of odd behavior as Gemini?
| xrd wrote:
| I was discussing LLMs with a non technical person on the plane
| yesterday. I was explaining why LLMs aren't good at math. And, he
| responded, no, chatgpt is great a multivariate regression, etc.
|
| I'm using LLMs locally almost always and eschewing API backed
| LLMs like chatgpt. So I'm not very familiar with plugins, and I'm
| assuming chatgpt plugs into a backend when it detects a math
| problem. So it isn't the LLM doing the math but to the user it
| appears to be.
|
| Does anyone here know what LLM projects like llama.cpp or
| gemma.cpp support a plugin model?
|
| I'm interested in adding to the dungeons and dragons system I
| built using llama.cpp. Because it doesn't do math well, the
| combat mode is terrible. But I was writing my own layer to break
| out when combat mode occurs, and I'm wondering if there is a
| better way with some kind of plugin approach.
___________________________________________________________________
(page generated 2024-02-23 23:00 UTC)