[HN Gopher] Gemma.cpp: lightweight, standalone C++ inference eng...
       ___________________________________________________________________
        
       Gemma.cpp: lightweight, standalone C++ inference engine for Gemma
       models
        
       Author : mfiguiere
       Score  : 298 points
       Date   : 2024-02-23 15:15 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | a1o wrote:
       | If I want to put a Gemma model in a minimalist command line
       | interface, build it to a standalone exe file that runs offline,
       | what is the size of my final executable? I am interested in how
       | small can the size of something like this be and it still be
       | functional.
        
         | replete wrote:
         | I used gemm:2b with ollama last night and the model was around
         | 1.3gb IIRC
        
         | samus wrote:
         | Depends how much you quantize the model. For most general-
         | purpose LLMs, the model completely dwarfs the size of the
         | binary code.
        
         | coder543 wrote:
         | https://ollama.com/library/gemma/tags
         | 
         | You can see the various quantizations here, both for the 2B
         | model and the 7B model. The smallest you can go is the q2_K
         | quantization of the 2B model, which is 1.3GB, but I wouldn't
         | really call that "functional". The q4_0 quantization is 1.7GB,
         | and that would probably be functional.
         | 
         | The size of anything but the model is going to be rounding
         | error compared to how large the models are, in this context.
        
           | sorenjan wrote:
           | What's the use case of models this small? Can you use the
           | "knowledge" encoded in them and ask them questions and get
           | relevant answers, or are they used as text processors to
           | summarize documents etc?
        
             | trisfromgoogle wrote:
             | Gemma 2B generation quality is excellent in my own very-
             | biased opinion. I asked it to write a response to your
             | comment:
             | 
             | --
             | 
             | Large language models (LLMs) have achieved significant
             | progress in recent years, with models like GPT-3 and LaMDA
             | demonstrating remarkable abilities in various tasks such as
             | language generation, translation, and question answering.
             | 
             | However, 2b parameter models are a much smaller and simpler
             | type of LLM compared to GPT-3. While they are still capable
             | of impressive performance, they have a limited capacity for
             | knowledge representation and reasoning.
             | 
             | Despite their size, 2b parameter models can be useful in
             | certain scenarios where the specific knowledge encoded in
             | the model is relevant to the task at hand. For example:
             | 
             | - Question answering: 2b parameter models can be used to
             | answer questions by leveraging their ability to generate
             | text that is similar to the question.
             | 
             | - Text summarization: 2b parameter models can be used to
             | generate concise summaries of documents by extracting the
             | most important information.
             | 
             | - Code generation: While not as common, 2b parameter models
             | can be used to generate code snippets based on the
             | knowledge they have learned.
             | 
             | Overall, 2b parameter models are a valuable tool for tasks
             | that require specific knowledge or reasoning capabilities.
             | However, for tasks that involve general language
             | understanding and information retrieval, larger LLMs like
             | GPT-3 may be more suitable.
             | 
             | --
             | 
             | Generated in under 1s from query to full response on
             | together.ai
             | 
             | https://api.together.xyz/playground/chat/google/gemma-2b-it
        
         | brucethemoose2 wrote:
         | The code is a basically irrelevant fraction of the model
         | weights. The raw FP16 is like 17GB.
         | 
         | In practice your priority would be fancy quantization, and just
         | _any_ library that compiles down to an executable (like this,
         | MLC-LLM or llama.cpp)
        
           | a1o wrote:
           | 17GB looks like a lot. Thanks, I will wait until people
           | figure how to make these smaller before trying to use to make
           | something standalone.
        
             | wg0 wrote:
             | These won't be smaller I guess. Given we keep the number of
             | parameters same.
             | 
             | Pre LLM era (let's say 2020), the hardware used to look
             | decently powerful for most use cases (disks in hundreds of
             | GBs, dozen or two of RAM and quad or hex core processors)
             | but with the advent of LLMs, even disk drives start to look
             | pretty small let alone compute and memory.
        
               | brucethemoose2 wrote:
               | And cache! The talk of AI hardware is now "how do we fit
               | these darn things inside SRAM?"
        
             | sillysaurusx wrote:
             | The average PS5 game seems to be around 45GB. Cyberpunk was
             | 250GB.
             | 
             | Distributing 17GB isn't a big deal if you shove it into
             | Cloudflare R2.
        
             | brucethemoose2 wrote:
             | In theory quantized weights of smaller models are under a
             | gigabyte.
             | 
             | If you are looking for megabytes, yeah, those "chat" llms
             | are pretty unusable at that size.
        
             | swatcoder wrote:
             | It's always going to be a huge quantity of data. Even as
             | efficiency improves, storage and bandwidth are so cheap now
             | that the incentive will be to convert that efficiency
             | towards performance (models with more parameters, ensembles
             | of models, etc) rather than chasing some micro-model that
             | doesn't do as well. It might not always be 17GB, but don't
             | expect some lesser order of magnitude for anything
             | competitive.
             | 
             | As maturity arrives, we'll likely see a handful of
             | competing local models shipped as part of the OS or as
             | redistributable third-party bundles (a la the .NET or Java
             | runtimes) so that individual applications don't all need to
             | be massive.
             | 
             | You'll either need to wait for that or bite the bullet and
             | make something chonky. It's never going to get that small.
        
         | superkuh wrote:
         | *EDIT*: Nevermind, llamafile hasn't been updated in a full
         | month and gemma support was only added to llama.cpp on the 21st
         | of this month. Disregard this post for now and come back when
         | mozilla updates llamafile.
         | 
         | ---
         | 
         | llama.cpp has integrated gemma support. So you can use
         | llamafile for this. It is a standalone executable that is
         | portable across most popular OSes.
         | 
         | https://github.com/Mozilla-Ocho/llamafile/releases
         | 
         | So, download the executable from the releases page under
         | assets. You want either just main and server and llava. Don't
         | get the huge ones with the model inlined in the file. The
         | executable is about 30MB in size,
         | 
         | https://github.com/Mozilla-Ocho/llamafile/releases/download/...
        
       | brucethemoose2 wrote:
       | Not to be confused with llama.cpp and the GGML library, which is
       | a seperate project (and almost immediately worked with Gemma).
        
         | throwaway19423 wrote:
         | I am confused how all these things are able to interoperate.
         | Are the creators of these models following the same IO for
         | their models? Won't the tokenizer or token embedder be
         | different? I am genuinely confused by how the same code works
         | for so many different models.
        
           | brucethemoose2 wrote:
           | It's complicated, but basically because _most_ are llama
           | architecture. Meta all but set the standard for open source
           | llms when they released llama1, and anyone trying to deviate
           | from it has run into trouble because the models don 't work
           | with the hyper optimized llama runtumes.
           | 
           | Also, there's a lot of magic going on behind the scenes with
           | configs stored in gguf/huggingface format models, and the
           | libraries that use them. There are different tokenizers, but
           | they mostly follow the same standards.
        
             | null_point wrote:
             | I found the magic! https://github.com/search?q=repo%3Aggerg
             | anov%2Fggml%20magic&...
        
         | jebarker wrote:
         | I doubt there'd be confusion as the names are totally different
        
       | brucethemoose2 wrote:
       | ...Also, we have eval'd Gemma 7B internally in a deterministic,
       | zero temperature test, and its error rate is like double Mistral
       | Instruct 0.2. Well below most other 7Bs.
       | 
       | Was not very impressed with the chat either.
       | 
       | So maybe this is neat for embedded projects, but if it's Gemma
       | only, that would be quite a sticking point for me.
        
         | Havoc wrote:
         | That does seem to be the consensus unfortunately. Would have
         | been better for everyone if google's foray into open model a la
         | FB made a splash
        
           | brucethemoose2 wrote:
           | Yeah, especially with how much Google is hyping it.
           | 
           | It could have been long context? Or a little bigger, to fill
           | the relative gap in the 13B-30B area? Even if the model
           | itself was mediocre (which you can't know until after
           | training), it would have been more interesting.
        
         | Vetch wrote:
         | Was it via gemma.cpp or some other library? I've seen a few
         | people note that gemma performance via gemma.cpp is much better
         | than llama.cpp, possible that the non-google implementations
         | are still not quite right?
        
           | brucethemoose2 wrote:
           | I eval'd it with vllm.
           | 
           | One thing I _do_ suspect people are running into is sampling
           | issues. Gemma probably doesn 't like llama defaults with its
           | 256K vocab.
           | 
           | Many Chinese llms have a similar "default sampling" issue.
           | 
           | But our testing was done with zero temperature and
           | constrained single token responses, so that shouldnt be an
           | issue.
        
         | trisfromgoogle wrote:
         | Any chance you can share more details on your measurement setup
         | and eval protocols? You're likely seeing some config snafus,
         | which we're trying to track down.
        
       | colesantiago wrote:
       | Isn't there a huge risk that Google could most likely deprecate
       | Gemini, Gemma and Gemma.cpp? Not really smart to build on
       | anything with Google e.g. Google Cloud for AI.
       | 
       | Has this perception changed or pretty much the same?
        
         | beoberha wrote:
         | Gemini - maybe, though I find it pretty unlikely it'll happen
         | anytime soon.
         | 
         | Not sure what you mean about Gemma considering it's not a
         | service. You can download the model weights and the inference
         | code is on GitHub. Everything is local!
        
         | brucethemoose2 wrote:
         | This is not necessarily a production backend, as it mentions in
         | the readme.
         | 
         | There are some very interesting efforts in JAX/TPU land like
         | https://github.com/erfanzar/EasyDeL
        
         | ertgbnm wrote:
         | The weights are downloadable so there isn't much of a risk if
         | Google stops hosting Gemma apart from the fact that it won't
         | get new versions that you swap out in the future.
        
           | cyanydeez wrote:
           | even if there's a new model, I'm not seeing how these models
           | provide any reliability metric.
           | 
           | if you figure out a money making software/service, you're
           | gonna be tied to that model to some significant degree.
        
       | brokensegue wrote:
       | does anyone have stats on cpu only inference speed with this?
        
         | austinvhuang wrote:
         | any particular hardware folks are most interested in?
        
           | brokensegue wrote:
           | I'm just looking for ballpark figures. Maybe a common aws
           | instance type
        
             | notum wrote:
             | Not sure if this is of any value to you, but Ryzen 7
             | generates 2 tokens per second for the 7B-Instruct model.
             | 
             | The model itself is very unimpressive and I see no reason
             | to play with it over the worst alternative from Hugging
             | Face. I can only imagine this was released for some bizarre
             | compliance reasons.
        
               | brokensegue wrote:
               | the metrics suggest it's much better than that
        
       | austinvhuang wrote:
       | Hi, one of the authors austin here. Happy to answer any questions
       | the best I can.
       | 
       | To get a few common questions out of the way:
       | 
       | - This is separate / independent of llama.cpp / ggml. I'm a big
       | fan of that project and it was an inspiration (we say as much in
       | the README). I've been a big advocate of gguf + llama.cpp support
       | for gemma and am happy for people to use that.
       | 
       | - how is it different than inference runtime X? gemma.cpp is a
       | direct implementation of gemma, in its current form it's aimed at
       | experimentation + research and portability + easy modifiable
       | rather than a general purpose deployment framework.
       | 
       | - this initial implementation is cpu simd centric. we're
       | exploring options for portable gpu support but the cool thing is
       | it will build and run on a lot of environments you might not
       | expect an llm to run, so long as you have the memory to load the
       | model.
       | 
       | - I'll let other colleagues answer questions about the Gemma
       | model itself, this is a C++ implementation of the model, but
       | relatively independent of the model training process.
       | 
       | - Although this is from Google, we're a very small team that
       | wanted such a codebase to exist. We have lots of plans to use it
       | ourselves and we hope other people like it and find it useful.
       | 
       | - I wrote a twitter thread on this project here:
       | https://twitter.com/austinvhuang/status/1760375890448429459
        
         | rgbrgb wrote:
         | Thanks for releasing this! What is your use case for this
         | rather than llama.cpp? For the on-device AI stuff I mostly do,
         | llama.cpp is better because of GPU/metal offloading.
        
           | austinvhuang wrote:
           | llama.cpp is great, if it fit your needs you can use it. I
           | think at this point llama.cpp is effectively a platform
           | that's hardened for production.
           | 
           | In its current form, I think of gemma.cpp is more of a direct
           | model implementation (somewhere between the minimalism of
           | llama2.c and the generality of ggml).
           | 
           | I tend to think of 3 modes of usage:
           | 
           | - hacking on inference internals - there's very little
           | indirection, no IRs, the model is just code, so if you want
           | to add support for your own runtime support for
           | sparsity/quantization/model compression/etc. and demo it
           | working with gemma, there's minimal barriers to do so
           | 
           | - implementing experimental frontends - i'll add some
           | examples of this in the very near future. but you're free to
           | get pretty creative with terminal UIs, code that interact
           | with model internals like the KV cache, accepting/rejecting
           | tokens etc.
           | 
           | - interacting with the model locally with a small program -
           | of course there's other options for this but hopefully this
           | is one way to play with gemma w/ minimal fuss.
        
         | dartharva wrote:
         | So... llamafile release?
         | 
         | https://github.com/Mozilla-Ocho/llamafile
        
           | austinvhuang wrote:
           | gguf files are out there, so anyone should be able to do
           | this! are people looking for an "official" version?
           | 
           | ps i'm a fan of cosmopolitan as well.
        
         | beoberha wrote:
         | > Although this is from Google, we're a very small team that
         | wanted such a codebase to exist. We have lots of plans to use
         | it ourselves and we hope other people like it and find it
         | useful.
         | 
         | This is really cool, Austin. Kudos to your team!
        
           | austinvhuang wrote:
           | Thanks so much!
           | 
           | Everyone working on this self-selected into contributing, so
           | I think of it less as my team than ... a team?
           | 
           | Specifically want to call out: Jan Wassenberg (author of
           | https://github.com/google/highway) and I started gemma.cpp as
           | a small project just a few months ago + Phil Culliton, Dan
           | Zheng, and Paul Chang + of course the GDM Gemma team.
        
             | trisfromgoogle wrote:
             | Huge +1, this has definitely been a self-forming collective
             | of people who love great AI, great research, and the open
             | community.
             | 
             | Austin and Jan are truly amazing. The optimization work is
             | genuinely outstanding; I get incredible CPU performance on
             | Gemma.cpp for inference. Thanks for all of the awesomeness,
             | Austin =)
        
         | moffkalast wrote:
         | Cool, any plans on adding K quants, an API server and/or a
         | python wrapper? I really doubt most people want to use it as a
         | cpp dependency and run models at FP16.
        
           | austinvhuang wrote:
           | There's a custom 8-bit quantization (SFP), it's what we
           | recommend. At 16 bit, we do bfloat16 instead of fp16 thanks
           | to https://github.com/google/highway, even on CPU. Other
           | quants - stay tuned.
           | 
           | python wrapper - if you want to run the model in python I
           | feel like there's already a lot of more mature options
           | available (see the model variations at
           | https://www.kaggle.com/models/google/gemma) , but if people
           | really want this and have something they want to do with a
           | python wrapper that can't be done with existing options let
           | me know. (similar thoughts wrt to API servers).
        
             | moffkalast wrote:
             | In my experience there's really no reason to run any model
             | above Q6_K, the performance is identical and you shave off
             | almost 2 GB of VRAM of a 7B model compared to Q8. To those
             | of us with single digit amounts, that's highly significant.
             | But most people seem to go for 4 bits anyway and it's the
             | AWQ standard too. If you think it'll make the model look
             | bad, then don't worry, it's only the relative performance
             | that matters.
             | 
             | I would think that having an OpenAI standard compatible API
             | would be a higher priority over a python wrapper, since
             | then it can act as a drop in replacement for most any
             | backend.
        
               | austinvhuang wrote:
               | A nice side effect of implementing cpu simd is you just
               | need enough regular RAM, which tends to be far less
               | scarce than VRAM. Nonetheless, I get your point that more
               | aggressive quantization is valuable + will share with the
               | modeling team.
        
               | moffkalast wrote:
               | True, it's the only way I can for example run Mixtral on
               | a 8GB GPU, but main memory will always have more latency
               | so some tradeoff tends to be worth it. And parts like the
               | prompt batch buffer and most of the context generally
               | have to be in VRAM if you want to use cuBLAS, with
               | OpenBLAS it's maybe less of a problem, but it is slower.
        
         | leminimal wrote:
         | Kudos on your release! I know this was just made available but
         | 
         | - Somewhere the README, consider adding the need for a
         | `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around
         | step 3.
         | 
         | - The README should explicitly say somehere that there's no GPU
         | support (at the moment)
         | 
         | - "Failed to read cache gating_ein_0 (error 294)" is pretty
         | obscure. I think even "(error at line number 294)" would be a
         | big improvement when it fails to FindKey.
         | 
         | - There's something odd about the 2b vs 7b model. The 2b will
         | claim its trained by Google but the 7b won't. Were these
         | trained on the same data?
         | 
         | - Are the .sbs weights the same weights as the GGUF? I'm
         | getting different answers compared to llama.cpp. Do you know of
         | a good way to compare the two? Any way to make both
         | deterministic? Or even dump probability distributions on the
         | first (or any) token to compare?
        
           | austinvhuang wrote:
           | Yes - thanks for pointing that out. The README is being
           | updated, you can see an updated WIP in the dev branch:
           | https://github.com/google/gemma.cpp/tree/dev?tab=readme-
           | ov-f... and improving error messages is a high priority.
           | 
           | The weights should be the same across formats, but it's easy
           | for differences to arise due to quantization and/or subtle
           | implementation differences. Minor implementation differences
           | has been a pain point in the ML ecosystem for a while (w/
           | IRs, onnx, python vs. runtime, etc.), but hopefully the
           | differences aren't too significant (if they are, it's a bug
           | in one of the implementations).
           | 
           | There were quantization fixes like
           | https://twitter.com/ggerganov/status/1760418864418934922 and
           | other patches happening, but it may take a few days for
           | patches to work their way through the ecosystem.
        
         | verticalscaler wrote:
         | Hi Austin, what say you about how the Gemma rollout was
         | handled, issues raised, and atmosphere around the office? :)
        
           | trisfromgoogle wrote:
           | I'm not Austin, but I am Tris, the friendly neighborhood
           | product person on Gemma. Overall, I think that the main
           | feeling is: incredibly relieved to have had the launch go as
           | smoothly as it has! The complexity of the launch is truly
           | astounding:
           | 
           | 1) Reference implementations in JAX, PyTorch, TF with Keras
           | 3, MaxText/JAX, more...
           | 
           | 2) Full integration at launch with HF including Transformers
           | + optimization therein
           | 
           | 3) TensorRT-LLM and full NVIDIA opt across the stack in
           | partnership with that team (mentioned on the NVIDIA earnings
           | call by Jensen, even)
           | 
           | 4) More developer surfaces than you can shake a stick at:
           | Kaggle, Colab, Gemma.cpp, GGUF
           | 
           | 5) Comms landing with full coordination from Sundar + Demis +
           | Jeff Dean, not to mention positive articles in NYT, Verge,
           | Fortune, etc.
           | 
           | 6) Full Google Cloud launches across several major products,
           | including Vertex and GKE
           | 
           | 7) Launched globally and with a permissive set of terms that
           | enable developers to do awesome stuff
           | 
           | Pulling that off without any _major_ SNAFUs is a huge relief
           | for the team. We 're excited by the potential of using all of
           | those surfaces and the launch momentum to build a lot more
           | great things for you all =)
        
             | kergonath wrote:
             | I am not a fan of a lot of what Google does, but
             | congratulations! That's a massive undertaking and it is
             | bringing the field forward. I am glad you could do this,
             | and hope you'll have many other successful releases.
             | 
             | Now, I'm off playing with a new toy :)
        
             | verticalscaler wrote:
             | Has there been any negative articles or valid criticism at
             | all in your opinion? =)
        
               | trisfromgoogle wrote:
               | Always -- anything that comes with the Google name
               | attached always attracts some negativity. There's plenty
               | of valid criticism, most of which we hope to address in
               | the coming weeks and months =).
        
               | verticalscaler wrote:
               | Can you without addressing just acknowledge some of it
               | here? Specific examples? =)
               | 
               | > not to mention positive articles in NYT, Verge,
               | Fortune, etc.
               | 
               | You in fact are mentioning them and only them. I was
               | wondering if you can simply mention the negative ones.
               | Otherwise it sort of sounded at first like its all roses.
               | ;)
        
               | trisfromgoogle wrote:
               | I mean, many articles will have a negative cast because
               | of the need for clicks -- e.g., the Verge's launch
               | article is entitled "Google Gemma: because Google doesn't
               | want to give away Gemini yet" -- which I think is both an
               | unfair characterization (given the free tier of Gemini
               | Pro) and unnecessarily inflammatory.
               | 
               | Legitimate criticisms include not working correctly out
               | of the box for llama.cpp due to repetition penalty and
               | vocab size, some snafus on chat templates with
               | huggingface, the fact that they're not larger-sized
               | models, etc. Lots of the issues are already fixed, and
               | we're committed to making sure these models are great.
               | 
               | Honestly, not sure what you're trying to get at here --
               | are you trying to "gotcha" the fact that not everything
               | is perfect? That's true for any launch.
        
               | verticalscaler wrote:
               | I mean, take for example Paul Graham the guy who made
               | this website you're posting on:                 "Wow,
               | Gemini is a joke.             The ridiculous images
               | generated by Gemini aren't an anomaly.
               | They're a self-portrait of Google's bureaucratic
               | corporate culture.            The bigger your cash cow,
               | the worse your culture can get without driving you out of
               | business. And Google's cash cow, search advertising, is
               | one of the biggest the world has ever seen." -
               | https://twitter.com/paulg/status/1760078920135872716
               | 
               | Are these legitimate criticisms? I don't _think_ he needs
               | clicks.                 "The AI behavior guardrails that
               | are set up with prompt engineering and filtering should
               | be public -- the creators should proudly stand behind
               | their vision of what is best for society and how they
               | crystallized it into commands and code.            *I
               | suspect many are actually ashamed*            The
               | thousands of tiny nudges encoded by reinforcement
               | learning from human feedback offer a lot more plausible
               | deniability, of course."                 - https://twitte
               | r.com/ID_AA_Carmack/status/1760360183945965853
               | 
               | Neither does John Carmack.
        
               | trisfromgoogle wrote:
               | Neither of those applies at all to Gemma, though? I'm
               | still confused -- what are you trying to accomplish with
               | this line of questioning?
        
               | verticalscaler wrote:
               | I'm confused about your confusion ("Still confused"? When
               | did you suddenly become confused? First you've mentioned
               | this).
               | 
               | Gemini is a nerfed, for lack of better term, model
               | released by Google to overwhelming negative response from
               | the public. Not just the click hungry press, users.
               | Including highly knowledgable ones.
               | 
               | Said nerfing "a self-portrait of Google's bureaucratic
               | corporate culture" according to Paul Graham.
               | 
               | Same company. Same culture. Releases a toy open model, a
               | mini Gemini. The concerns around it are the same.
               | 
               | > Launched globally and with a permissive set of terms
               | that enable developers to do awesome stuff
               | 
               | I think by permissive set of terms you're referring to
               | the license, not the model itself. Not entirely sure how
               | developers can trust the model itself given recent
               | events. Seems like a reasonable line of questioning.
               | 
               | I'm trying to establish whether honest human-like non-
               | evasive communication is on offer along with the model so
               | developers might contemplate trusting your work output.
               | 
               | Otherwise, besides being a technical curio, unclear what
               | anybody can do with it. Most wouldn't feel comfortable
               | rolling a Gemma model into a product without a lot more
               | clarity.
        
               | trisfromgoogle wrote:
               | I've been completely honest, human-like, and non-evasive
               | with you. I answered your questions directly and frankly.
               | 
               | Every time, you ignored the honest and human-like answers
               | to try and score some imaginary points.
               | 
               | We're honestly trying our best to build open models
               | *with* the community that you can tune and use to build
               | neat AI research + products. Ignoring that in favor of
               | some political narrative is really petty.
        
               | verticalscaler wrote:
               | > not to mention positive articles in NYT, Verge,
               | Fortune, etc.
               | 
               | > There's plenty of valid criticism, most of which we
               | hope to address in the coming weeks and months =).
               | 
               | > many articles will have a negative cast because of the
               | need for clicks
               | 
               | This simply comes across as, at best, spin. Given what
               | just happened with Gemini I urge you not to communicate
               | in this style.
               | 
               | > doesn't apply at all to Gemma
               | 
               | That would be the crux of the matter. As a sibling
               | comment states:
               | 
               | > a larger unreleased version of it (Gemma) is (likely)
               | used as part of the Gemini product
               | 
               | There is no reason for anybody outside Google to think
               | otherwise, these are blackboxes.
               | 
               | > Ignoring that in favor of some political narrative is
               | really petty.
               | 
               | You can have assume any politics you wish, the issues
               | with Gemini have already been acknowledged by Google
               | along with an apology.
               | 
               | https://blog.google/products/gemini/gemini-image-
               | generation-...                 over time, the model
               | became way more cautious than we intended and refused to
               | answer certain prompts entirely -- wrongly interpreting
               | some very anodyne prompts as sensitive.
               | 
               | There is no information for me or anybody else to go on
               | that your Gemma models are not nerfed - whether
               | maliciously or accidentally nerfed - is immaterial.
               | 
               | Instead of addressing that you and you alone eventually
               | brought it back to politics.
               | 
               | > We're honestly trying our best to build open models
               | _with_ the community
               | 
               | Amongst other things this requires much more trust and
               | transparency and an altogether different communication
               | style.
        
               | summerlight wrote:
               | It looks like you're trying to get some sort of
               | "confession" from relevant people based on recent memes
               | against the company? The reality is likely that the
               | developers sincerely believe in the value of this product
               | and are proud of its launch. You're just adding
               | uninteresting, irrelevant noise to the discussion and you
               | probably won't get what you want.
        
               | jph00 wrote:
               | These comments appear to be about Gemini's image
               | generation, IIUC. Gemma, however, is a language model --
               | whilst I believe that a larger unreleased version of it
               | is used as part of the Gemini product, it doesn't seem
               | relevant to these criticisms. Also, the Gemma base model
               | is released, which doesn't AFAIK contain any RLHF.
               | 
               | The impression I have is that you're using the release of
               | Gemma to complain about tangentially related issues about
               | Google and politics more generally. The HN guidelines
               | warn against this: "Eschew flamebait. Avoid generic
               | tangents... Please don't post shallow dismissals,
               | especially of other people's work. A good critical
               | comment teaches us something. Please don't use Hacker
               | News for political or ideological battle."
        
         | dankle wrote:
         | What's the reason to not integrate with llama.cpp instead of a
         | separate app? In what ways this better than llama.cpp?
        
           | austinvhuang wrote:
           | On uses, see
           | https://news.ycombinator.com/item?id=39481554#39482302 and on
           | llama.cpp support -
           | https://news.ycombinator.com/item?id=39481554
           | 
           | Gemma support has been added to llama.cpp, and we're more
           | than happy to see people use it there.
        
             | freedomben wrote:
             | I think on uses you meant to link to
             | https://news.ycombinator.com/item?id=39482581 child of
             | https://news.ycombinator.com/item?id=39481554#39482302 ?
             | 
             | side note: imagine how gnarly those urls would be if HN
             | used UUIDs instead of integers for IDs :-D
        
       | einpoklum wrote:
       | Come on Dejiko, we don't have time for this gema!
       | 
       | https://www.youtube.com/watch?v=9FSAqDVZHhU
        
         | a-french-anon wrote:
         | Glad I wasn't alone.
        
           | einpoklum wrote:
           | Well, it was just so nostalgic for me nyo :-\
        
         | sillysaurusx wrote:
         | Every time I see Gemma all I hear is Jubei screaming Genmaaaa
         | since the n is almost silent.
         | https://youtu.be/TFR9-cZecWo?si=rMED2LEh-fssHeeG
        
           | austinvhuang wrote:
           | lol
        
       | ofermend wrote:
       | Awesome work on getting this done so quickly. We just added Gemma
       | to the HHEM leaderboard -
       | https://huggingface.co/spaces/vectara/leaderboard, and as you can
       | see there its doing pretty good in terms of low hallucination
       | rate, relative to other small models.
        
         | swozey wrote:
         | > LLM hallucinations
         | 
         | I wasn't familiar with the term, good article -
         | https://masterofcode.com/blog/hallucinations-in-llms-what-yo...
        
           | ed wrote:
           | Karpathy offers a more concise (and whimsical) explanation
           | https://x.com/karpathy/status/1733299213503787018
        
       | swozey wrote:
       | The velocity of the LLM open source ecosystem is absolutely
       | insane.
       | 
       | I just got into hobby projects with diffusion a week ago and I'm
       | seeing non-stop releases. It's hard to keep up. It's a firehose
       | of information, acronyms, code etc.
       | 
       | It's been a great python refresher.
        
         | austinvhuang wrote:
         | Don't be discouraged, you don't have to follow everything.
         | 
         | In fact it's probably better to dive deep into one hobby
         | project like you're doing than constantly context switch with
         | every little news item that comes up.
         | 
         | While working on gemma.cpp there were definitely a lot of "gee
         | i wish i could clone myself and work on that other thing too".
        
       | next_xibalba wrote:
       | Is this neutered in the way Gemini is (i.e. is the "censorship"
       | built in) or is that a "feature" of the Gemini application?
        
         | ComputerGuru wrote:
         | It depends on the model you load/use, the team released both
         | censored and "PT" versions.
        
         | jonpo wrote:
         | These models (Gemma) are very difficult to jailbreak.
        
       | throwaway19423 wrote:
       | Can any kind soul explain the difference between GGUF, GGML and
       | all the other model packaging I am seeing these days? Was used to
       | pth and the thing tf uses. Is this all to support inference or
       | quantization? Who manages these formats or are they brewing
       | organically?
        
         | austinvhuang wrote:
         | I think it's mostly an organic process arising from the
         | ecosystem.
         | 
         | My personal way of understanding it is this - the original sin
         | of model weight format complexity is that NNs are both data and
         | computation.
         | 
         | Representing the computation as data is the hard part and
         | that's where the simplicity falls apart. Do you embed the
         | compute graph? If so, what do you do about different frameworks
         | supporting overlapping but distinct operations. Do you need the
         | artifact to make training reproducible? Well that's an even
         | more complex computation that you have to serialize as data.
         | And so on..
        
         | moffkalast wrote:
         | It's all mostly just inference, though some train LoRAs
         | directly on quantized models too.
         | 
         | GGML and GGUF are the same thing, GGUF is the new version that
         | adds more data about the model so it's easy to support multiple
         | architectures, and also includes prompt templates. These can
         | run CPU only, be partially or fully offloaded to a GPU. With K
         | quants, you can get anywhere from a 2 bit to an 8 bit GGUF.
         | 
         | GPTQ was the GPU-only optimized quantization method that was
         | superseded by AWQ, which is roughly 2x faster and now by EXL2
         | which is even better. These are usually only 4 bit.
         | 
         | Safetensors and pytorch bin files are raw float16 model files,
         | these are only really used for continued fine tuning.
        
           | Gracana wrote:
           | > and also includes prompt templates
           | 
           | That sounds very convenient. What software makes use of the
           | built-in prompt template?
        
             | moffkalast wrote:
             | Of the ones I commonly use, I've only seen it read by text-
             | generation-webui, in the GGML days it had a long hardcoded
             | list of known models and which templates they use so they
             | could be auto-selected (which was often wrong), but now it
             | just grabs it from any model directly and sets it when it's
             | loaded.
        
         | liuliu wrote:
         | pth can include Python code (PyTorch code) for inference. TF
         | includes the complete static graph.
         | 
         | GGUF is just weights, safetensors the same thing. GGUF doesn't
         | need a JSON decoder for the format while safetensors needs
         | that.
         | 
         | I personally think having a JSON decoder is not a big deal and
         | make the format more amendable, given GGUF evolves too.
        
       | Wissan wrote:
       | Hello
        
         | sintax wrote:
         | Demo when model quantized to q0_K?
        
       | tarruda wrote:
       | Is it not possible to add Gemma support on Llama.cpp?
        
         | austinvhuang wrote:
         | Gemma support has been added to llama.cpp, in fact it was added
         | almost immediately after the release:
         | https://twitter.com/ggerganov/status/1760293079313973408
         | 
         | However, be aware that there were some quality issues with
         | quantization initially (hopefully they're resolved but i
         | haven't followed too closely):
         | https://twitter.com/ggerganov/status/1760418864418934922
        
       | zoogeny wrote:
       | I know a lot of people chide Google for being behind OpenAI in
       | their commercial offerings. We also dunk on them for the over-
       | protective nature of their fine-tuning.
       | 
       | But Google is scarily capable on the LLM front and we shouldn't
       | count them out. OpenAI might have the advantage of being quick to
       | move, but when the juggernaut gets passed its resting inertia and
       | starts to gain momentum it is going to leave an impression.
       | 
       | That became clear to me after watching the recent Jeff Dean video
       | [1] which was posted a few days ago. The depth of institutional
       | knowledge that is going to be unlocked inside Google is actually
       | frightening for me to consider.
       | 
       | I hope the continued competition on the open source front, which
       | we can really thank Facebook and Llama for, keeps these behemoths
       | sharing. As OpenAI moves further from its original mission into
       | capitalizing on its technological lead, we have to remember why
       | the original vision they had is important.
       | 
       | So thank you, Google, for this.
       | 
       | 1.
       | https://www.youtube.com/watch?v=oSCRZkSQ1CE&ab_channel=RiceK...
        
         | whimsicalism wrote:
         | Realistically, if Google has all this talent, they should have
         | gotten the juggernaut moving in 2020.
         | 
         | Google has had _years_ to get to this stage, and they 've lost
         | a lot of the talent that made their initial big splashes to OAI
         | and competitors. Try finding someone on a sparse MoE paper from
         | Google prior to 2022 who is still working there and not at OAI.
         | 
         | With respect, they can hardly even beat Mistral, resorting to
         | rounding down a 7.8b model (w/o embeddings) to 7b.
        
           | freedomben wrote:
           | Organizational dysfunction can squash/squander even the most
           | talented engineers. Especially in a big org in big tech. My
           | bet is that their inability to deliver before is probably a
           | result of non-comittal funders/decision makers, product
           | whiplash, corporate politics, and other non-technical
           | challenges.
           | 
           | Google has been the home of the talent for many years. They
           | came on my radar in the late 00s when I used Peter Norvig's
           | textbook in college, and they hired Ray Kurzweil in like 2012
           | or 2013 IIRC. They were hiring ML PhDs with talent for many
           | years, and they pioneered most of the major innovations. They
           | just got behind on productizing and shipping.
        
             | whimsicalism wrote:
             | Right, which was fine for them before there was major
             | competition. But starting in 2020, they have basically
             | attrited most of their talented labor force to OAI and
             | competitors who were not similarly dysfunctional.
        
         | dguest wrote:
         | Maybe someone who knows google better can answer my question
         | here: are they behind simply because LLMs are not really their
         | core business? In other words, it wasn't (and still isn't)
         | obvious that LLMs will help them sell add space.
         | 
         | And of course writing that gives me a terrible realization:
         | product placement in LLMs is going to be a very big thing in
         | the near future.
        
           | freedomben wrote:
           | I'm an outsider and am speculating based on what I've heard,
           | so maybe I shouldn't even comment, but to me it seems like
           | it's been entirely corporate/organizational reasons. Non-
           | serious funding, shifting priorities, personnel
           | transfers/fluctuations, internal fragmentation, and more.
           | Lack of talent has never been their problem.
        
           | elwell wrote:
           | LLM bad because cannibalizes search ads. Wait as long as
           | possible. OpenAI opens pandora's box. Now full speed ahead;
           | catch up and overtake.
        
         | brigadier132 wrote:
         | There was a podcast yesterday that explained well why Google is
         | in a tough position.
         | 
         | https://youtu.be/-i9AGk3DJ90?t=616
         | 
         | In essence, Google already rules information retrieval. Their
         | margins are insane. Switching to LLM based search cuts into
         | their margins and increases their costs dramatically. Also, the
         | advantage they've built over decades has been cut down.
         | 
         | All of this means there is potential for less profit and a
         | shrinking valuation. A shrinking valuation means issues with
         | employee retention and it could lead to long term stagnation.
        
           | corysama wrote:
           | The Innovator's Dilemma over and over again.
        
           | brikym wrote:
           | I'm sure Kodak had the same problem with the digital camera.
        
         | llm_nerd wrote:
         | While I generally agree with you, who has ever counted Google
         | out? We've made fun of Google for lagging while they instead
         | spend their engineering time renaming projects and performing
         | algorithmic white-erasure, but we all knew they're a potent
         | force.
         | 
         | Google has as much or more computing power than anyone. They're
         | massively capitalized and have a market cap of almost $2T and
         | colossal cashflow, and have the ability to throw enormous
         | resources at the problem until they have a competitor. They
         | have an enormous, benchmark-setting amount of data across their
         | various projects to train on. That we're talking like they're
         | some scrappy upstart is super weird.
         | 
         | >As OpenAI moves further from its original mission into
         | capitalizing on its technological lead, we have to remember why
         | the original vision they had is important.
         | 
         | I'm way more cynical about the open source models released by
         | the megas, and OpenAI is probably the most honest about their
         | intentions. Meta and Google are releasing these models arguably
         | to kneecap any possible next OpenAI. They want to basically set
         | the market value of anything below state of the art at $0.00,
         | ensuring that there is no breathing room below the $2T cos.
         | These models (Llama, Gemma, etc) are fun toys, but in the end
         | they're completely uncompetitive and will yield zero "wins", so
         | to speak.
        
           | jerpint wrote:
           | > Meta and Google are releasing these models arguably to
           | kneecap any possible next OpenAI. They want to basically set
           | the market value of anything below state of the art at $0.00,
           | ensuring that there is no breathing room below the $2T cos
           | 
           | Never thought about it that way, but it makes a lot of sense.
           | It's also true these models are not up to par with SOTA no
           | matter what the benchmarks say
        
           | loudmax wrote:
           | I certainly would not count out Google's engineering talent.
           | But all the technical expertise in the world won't matter
           | when the leadership is incompetent and dysfunctional. Rolling
           | out a new product takes vision, and it means taking some
           | risks. This is diametrically opposed to how Google operates
           | today. Gemini could be years ahead of ChatGPT (and maybe it
           | is now, if it weren't neutered), but Google's current
           | leadership would have no idea what to do with it.
           | 
           | Google has the technical resources to become a major player
           | here, maybe even the dominant player. But it won't happen
           | under current management. I won't count out Google entirely,
           | and there's still time for the company to be saved. It starts
           | with new leadership.
        
         | refulgentis wrote:
         | There's nothing provided here other than Jeff Dean gave a stock
         | entry-level presentation to students at Rice, therefore "The
         | depth of institutional knowledge that is going to be unlocked
         | inside Google is actually frightening for me to consider."
         | 
         | You should see Google's turnover numbers from 4 years ago, much
         | less now.
         | 
         | It's been years, it's broken internally, we see the results.
         | 
         | Here, we're in awe of 1KLOC of C++ code that runs inference on
         | the CPU.
         | 
         | Nobody serious is running inference on CPU unless you're on the
         | extreme cutting edge. (ex. I need to on Android and on the
         | Chrome OS Linux VM, but I still use llama.cpp because it does
         | support GPU everywhere else)
         | 
         | I'm not sure what else to say.
         | 
         | (n.b. i am a xoogler)
        
       | kwantaz wrote:
       | nice
        
       | namtranase wrote:
       | Thank the team for the awesome repo. I have navigated gemma.cpp
       | and run it from the first day, it is smooth in my view. So I hope
       | gemma.cpp will continue to add cool features (something like
       | k-quants, server,...) so it can serve more widely. Actually, I
       | have developed a Python wrapper for it:
       | https://github.com/namtranase/gemma-cpp-python The purpose is to
       | use easily and update every new technique from gemma.cpp team.
        
       | dontupvoteme wrote:
       | At the risk of being snarky, it's interesting that Llama.cpp was
       | a 'grassroots' effort originating from a Bulgarian hacker google
       | now launches a corporatized effort inspired by it.
       | 
       | I wonder if there's some analogies to the 80s or 90s in here.
        
         | trisfromgoogle wrote:
         | To be clear, this is not comparable directly to llama.cpp --
         | Gemma models work on llama.cpp and we encourage people who love
         | llama.cpp to use them there. We're also launched with Ollama.
         | 
         | Gemma.cpp is a highly optimized and lightweight system. The
         | performance is pretty incredible on CPU, give it a try =)
        
       | natch wrote:
       | Apart from the fact that they are different things, since they
       | came out of the same organization I think it's fair to ask:
       | 
       | Do these models have the same kind of odd behavior as Gemini?
        
       | xrd wrote:
       | I was discussing LLMs with a non technical person on the plane
       | yesterday. I was explaining why LLMs aren't good at math. And, he
       | responded, no, chatgpt is great a multivariate regression, etc.
       | 
       | I'm using LLMs locally almost always and eschewing API backed
       | LLMs like chatgpt. So I'm not very familiar with plugins, and I'm
       | assuming chatgpt plugs into a backend when it detects a math
       | problem. So it isn't the LLM doing the math but to the user it
       | appears to be.
       | 
       | Does anyone here know what LLM projects like llama.cpp or
       | gemma.cpp support a plugin model?
       | 
       | I'm interested in adding to the dungeons and dragons system I
       | built using llama.cpp. Because it doesn't do math well, the
       | combat mode is terrible. But I was writing my own layer to break
       | out when combat mode occurs, and I'm wondering if there is a
       | better way with some kind of plugin approach.
        
       ___________________________________________________________________
       (page generated 2024-02-23 23:00 UTC)