[HN Gopher] Gemma 2: Improving Open Language Models at a Practic...
       ___________________________________________________________________
        
       Gemma 2: Improving Open Language Models at a Practical Size [pdf]
        
       Author : tosh
       Score  : 205 points
       Date   : 2024-06-27 14:18 UTC (8 hours ago)
        
 (HTM) web link (storage.googleapis.com)
 (TXT) w3m dump (storage.googleapis.com)
        
       | rsolva wrote:
       | The 9B and 27B versions are available for Ollama:
       | https://ollama.com/library/gemma2
        
         | Workaccount2 wrote:
         | The 27B model is also available in AI studio
         | 
         | https://aistudio.google.com/app/prompts/new_chat?model=gemma...
         | 
         | So far it seems pretty strong for its size.
        
       | moffkalast wrote:
       | > Table 4 | Relevant formatting control tokens used for Gemma
       | models
       | 
       | > User turn: user
       | 
       | > Model turn: model
       | 
       | > Start of conversation turn: <start_of_turn>
       | 
       | > End of conversation turn: <end_of_turn>
       | 
       | > Beginning of sequence: <bos>
       | 
       | > End of sequence: <eos>
       | 
       | You know I keep wondering why <bos> and <eos> tokens are even a
       | thing in general. No model is tuned to keep generating multiple
       | turns after its <end_of_turn> equivalent is sent, and what's the
       | point of <bos> when you're parsing the entire context anyway. If
       | it's an attempt to ignore text before it... then why is that text
       | there? Just remove it from context, you're throwing away compute.
        
         | danielmarkbruce wrote:
         | think about training.
        
           | moffkalast wrote:
           | I suppose it would act as a concrete separator when instruct
           | tuning, but lots of prompt templates don't use it, especially
           | older ones like Alpaca. Maybe it leads to more overall
           | coherence?
        
             | m00x wrote:
             | Not instruct tuning, you use it in general training.
             | 
             | If you have a bunch of small prompts/answers, you can fit
             | them into bigger batches if you use start/stop tokens.
        
         | alekandreev wrote:
         | Your training input has the shape of (sequence length x batch
         | size). If a lot of your samples are shorter than sequence
         | length, as is usually the case, you will have a lot of padding
         | tokens in the input, which is wasted compute.
         | 
         | To compensate for that, you can pack multiple examples in the
         | same sequence. This is there EOS and BOS come in, as they
         | indicate to the model that the two parts of the sequence are
         | not related.
        
           | thomasahle wrote:
           | You can just do that my shaping the attention mask, no? That
           | also gives you an actual guarantee that no information is
           | leaked between conversations.
        
             | suryabhupa wrote:
             | In practice, and at scale, that's exactly what having <bos>
             | and <eos> tokens allow you to easily and programmatically
             | do.
        
             | danielmarkbruce wrote:
             | You can't pack multiple examples into a single row of a
             | matrix without knowing where one begins and one ends.
        
       | alecco wrote:
       | Shouldn't this (2.6B/9B) be compared with Microsoft's Phi-3 mini
       | (3.8B) instead of Mistral and Llama-3?
       | 
       | (table 13 on page 7) vs https://arxiv.org/pdf/2404.14219 (page 6,
       | quite better in general)
       | 
       | The report on knowledge distillation training is interesting,
       | though.
        
         | philipkglass wrote:
         | It's such a wide range of model sizes that I could see why they
         | compare with Llama 3 70b as well as Llama 3 8b (tables 12, 13).
         | I agree that the Phi-3 series is a stronger competitor for
         | knowledge extraction/summarizing and would make a good
         | comparison. My current favorite for such tasks, on a VRAM-
         | limited workstation, is Phi-3 medium (phi3:14b-instruct).
        
         | refulgentis wrote:
         | Picking up from there: The games in this paper and model are
         | annoying.
         | 
         | The 2.6B would get stomped by Phi-3, so there's no comparison.
         | 
         | Fair enough. 2.6B vs. 3.8B is a fairly substantial size
         | difference thats hard to intuit when its 2.6 vs 3.8 versus
         | 2,600,000,000 and 3,800,000,000.
         | 
         | But then we get what I'm going to "parameter creep": Mistral 7B
         | vs. Llama 8B vs. Gemma 9B. I worried after Llama 3 went 8B that
         | we'd start seeing games with parameters, but, thought I was
         | being silly.
        
           | imjonse wrote:
           | In the Llama 3 case I think the increase in parameters is
           | mostly due to the input embeddings and output logits layers,
           | reflecting the context size increase.
        
           | kouteiheika wrote:
           | There was no parameter creep with Llama. Llama 8B is actually
           | a ~7B model comparable to Mistral 7B if you strip away
           | multilingual embeddings and match what Mistral 7B supports.
        
       | alekandreev wrote:
       | Hello (again) from the Gemma team! We are quite excited to push
       | this release out and happy to answer any questions!
       | 
       | Opinions are our own and not of Google DeepMind.
        
         | zerojames wrote:
         | How is Gemma-2 licensed?
        
           | alekandreev wrote:
           | The terms of use remain the same as Gemma 1 -
           | https://ai.google.dev/gemma/terms.
        
         | coreypreston wrote:
         | No question. Thanks for thinking of 27B.
        
         | moffkalast wrote:
         | The 4k sliding window context seems like a controversial choice
         | after Mistral 7B mostly failed at showing any benefits from it.
         | What was the rationale behind that instead of just going for
         | full 8k or 16k?
        
           | alekandreev wrote:
           | This is mostly about inference speed, while maintaining long
           | context performance.
        
         | luke-stanley wrote:
         | Any gemma-2-9b or 27b 4 bit GGUF's on HuggingFace yet? Thanks!
        
           | XzAeRosho wrote:
           | It's on HuggingFace already:
           | https://huggingface.co/google/gemma-2-9b
        
             | luke-stanley wrote:
             | I know the safe tensors are there, but I said GGUF 4-bit
             | quantised, which is kinda the standard for useful local
             | applications, a typical balanced sweet spot of performance
             | and quality. It's makes it much easier to use, works in
             | more places, be it personal devices or a server etc.
        
           | chown wrote:
           | If you are still looking for it, I just made it available on
           | an app[1] that I am working on with Gemma2 support.
           | 
           | https://msty.app
        
             | luke-stanley wrote:
             | Are you saying you put a 4-bit GGUF on HuggingFace?
        
           | luke-stanley wrote:
           | Actually for the 9B model, this has 4-bit quantised weights
           | (and others): https://huggingface.co/bartowski/gemma-2-9b-it-
           | GGUF
           | 
           | Still no 27B 4-bit GGUF quants on HF yet!
           | 
           | I'm monitoring this search: https://huggingface.co/models?lib
           | rary=gguf&sort=trending&sea...
        
         | jpcapdevila wrote:
         | Will gemma2 be available through gemma.cpp?
         | https://github.com/google/gemma.cpp
        
           | austinvhuang wrote:
           | This is in the works in the dev branch (thanks pchx :)
           | 
           | https://github.com/google/gemma.cpp/pull/274
        
             | janwas wrote:
             | :) Confirmed working. We've just pushed the dev branch to
             | main.
        
               | jpcapdevila wrote:
               | Awesome, I love this .cpp trend! Thanks for your work!!
        
         | luke-stanley wrote:
         | Given the goal of mitigating self-proliferation risks, have you
         | observed a decrease in the model's ability to do things like
         | help a user setup a local LLM with local or cloud software?
         | 
         | How much is pre-training dataset changes, how much is tuning?
         | 
         | How do you think about this problem, how do you solve it?
         | 
         | Seems tricky to me.
        
           | luke-stanley wrote:
           | Wow I'm kinda shocked this was downvoted. That's not cool,
           | it's a reasonable question directly about the research - the
           | main article link!
        
           | alekandreev wrote:
           | To quote Ludovic Peran, our amazing safety lead:
           | 
           | Literature has identified self-proliferation as dangerous
           | capability of models, and details about how to define it and
           | example of form it can take have been openly discussed by GDM
           | (https://arxiv.org/pdf/2403.13793).
           | 
           | Current Gemma 2 models' success rate to end-to-end challenges
           | is null (0 out 10), so the capabilities to perform such tasks
           | are currently limited.
        
             | moffkalast wrote:
             | Turns out LLM alignment is super easy, barely an
             | inconvenience.
        
               | dinosaurdynasty wrote:
               | One should not confuse alignment and current
               | incapability.
        
               | josh-sematic wrote:
               | Alignment is tight!
        
             | luke-stanley wrote:
             | That's an interesting paper. `Install Mistral 7B on a GCP
             | instance and use it to answer a simple question`. Some
             | hosting providers and inference software might be easier to
             | setup, for now. ;) But do you have to make it less capable,
             | by being careful on what it's trained on? E.g: banning
             | certain topics (like how to use Lamafile/llama.cpp, knowing
             | what hosting providers have free trials, learning about
             | ways to jailbreak web apps, free inference providers etc)?
             | 
             | Or does the model have to later be finetuned, to not be
             | good at certain tasks?
             | 
             | Or are we not at that stage yet?
             | 
             | Is something like tree-of-thought used, to get the best of
             | the models for these tasks?
        
         | canyon289 wrote:
         | I also work at Google and on Gemma (so same disclaimers)
         | 
         | You can try 27b at www.aistudio,google.com. Send in your
         | favorite prompts, and we hope you like the responses.
        
         | luke-stanley wrote:
         | It's fairly easy to pay OpenAI or Mistral money to use their
         | API's. Figuring out how Google Cloud Vertex works and how it's
         | billed is more complicated. Azure and AWS are similar in how
         | complex they are to use for this. Could Google Cloud please
         | provide an OpenAI compatible API and service? I know it's a
         | different department. But it'd make using your models way
         | easier. It often feels like Google Cloud has no UX or end-user
         | testing done on it at all (not true for aistudio.google.com -
         | that is better than before, for sure!).
        
           | alekandreev wrote:
           | Happy to pass on any feedback to our Google Cloud friends. :)
        
             | luke-stanley wrote:
             | Thank you!
        
             | anxman wrote:
             | I also hate the billing. It feels like configuring AWS more
             | than calling APIs.
        
           | hnuser123456 wrote:
           | I plan on downloading a Q5 or Q6 version of the 27b for my
           | 3090 once someone puts quants on HF, loading it in LM studio
           | and starting the API server to call it from my scripts based
           | on openai api. Hopefully it's better at code gen than llama 3
           | 8b.
        
           | bapcon wrote:
           | I have to agree with all of this. I tried switching to
           | Gemini, but the lack of clear billing/quotas, horrible
           | documentation, and even poor implementation of status codes
           | on failed requests have led me to stick with OpenAI.
           | 
           | I don't know who writes Google's documentation or does the
           | copyediting for their console, but it is hard to adapt. I
           | have spent hours troubleshooting, only to find out it's
           | because the documentation is referring to the same thing by
           | two different names. It's 2024 also, I shouldn't be seeing
           | print statements without parentheses.
        
           | ankeshanand wrote:
           | If you're an individual developer and not an enterprise, just
           | go straight to Google AIStudio or GeminiAPI instead:
           | https://aistudio.google.com/app/apikey. It's dead simple
           | getting an API key and calling with a rest client.
        
             | luke-stanley wrote:
             | Interesting but when I tried it, I couldn't figure out the
             | billing model because it's all connected to Google
             | projects, and there can be different billing things for
             | each of them.
             | 
             | Each thing seems to have a bunch of clicks to setup that
             | startup LLM providers don't hassle people with. They're
             | more likely to just let you sign in with some generic third
             | party oAuth, slap on Stripe billing, let you generate keys,
             | show you some usage stats, getting started docs, with
             | example queries and a prompt playground etc.
             | 
             | What about the Vertex models though? Are they all actually
             | available via Google AI Studio?
        
             | lhl wrote:
             | Sadly, while gemma-2-27b-it is available (as a Preview
             | model) on the AI Studio playground, it didn't show up via
             | API on list_models() for me.
        
           | Deathmax wrote:
           | Gemini models on Vertex AI can be called via a preview
           | OpenAI-compatible endpoint [1], but shoving it into existing
           | tooling where you don't have programmatic control over the
           | API key and is long lived is non-trivial because GCP uses
           | short lived access tokens (and long-lived ones are not great
           | security-wise).
           | 
           | Billing for the Gemini models (on Vertex AI, the Generative
           | Language AI variant still charges by tokens) I would argue is
           | simpler than every other provider, simply because you're
           | charged by characters/image/video-second/audio-second and
           | don't need to run a tokenizer (if it's even available _cough_
           | Claude 3 and Gemini) and having to figure out what the chat
           | template is to calculate the token cost per message [2] or
           | figure out how to calculate tokens for an image [3] to get
           | cost estimates before actually submitting the request and
           | getting usage info back.
           | 
           | [1]: https://cloud.google.com/vertex-ai/generative-
           | ai/docs/multim...
           | 
           | [2]: https://platform.openai.com/docs/guides/text-
           | generation/mana...
           | 
           | [3]: https://platform.openai.com/docs/guides/vision/calculati
           | ng-c...
        
             | luke-stanley wrote:
             | Good to know about this API preview. Hopefully the billing
             | problem and UI maze of Vertex AI can be sorted too?
        
               | Flumio wrote:
               | Google does plenty of ux studies on gcp. I took part in
               | at least 3 of them.
               | 
               | I'm also not sure if I understand your problem with
               | pricing? Depending on what you do with it, it's not just
               | an LLM. It actually started before llms.
               | 
               | Pricing for image classification and other features are
               | completely different products like an LLM.
        
               | luke-stanley wrote:
               | They should do a whole lot more then! Ideally they'd have
               | effective impact. It's a busy mess on GCP. If they wanted
               | to compete well, they should do much better with UX
               | design, especially for onboarding. Compare how easy
               | setting up a Mistral account is with GCP to do some
               | generative LLM in a Python script. GCP is a maze. Did you
               | make an account to reply to this? I'm curious what you do
               | with GCP? Are you a heavy user?
        
         | WhitneyLand wrote:
         | The paper suggests on one hand Gemma is on the same Pareto
         | curve as Llama3, while on the other hand seems to suggest it's
         | exceeded its efficiency.
         | 
         | Is this a contradiction or am I misunderstanding something?
         | 
         | Btw overall very impressive work great job.
        
           | alekandreev wrote:
           | I think it makes sense to compare models trained with the
           | same recipe on token count - usually more tokens will give
           | you a better model.
           | 
           | However, I wouldn't draw conclusions about different model
           | families, like Llama and Gemma, based on their token count
           | alone. There are many other variables at play - the quality
           | of those tokens, number of epochs, model architecture,
           | hyperparameters, distillation, etc. that will have an
           | influence on training efficiency.
        
         | causal wrote:
         | Thanks for your work on this; excited to try it out!
         | 
         | The Google API models support 1M+ tokens, but these are just
         | 8K. Is there a fundamental architecture difference, training
         | set, something else?
        
         | np_space wrote:
         | Are Gemma-2 models available via API yet? Looks to me like it's
         | not yet on vertexai
        
           | zone411 wrote:
           | "Soon" https://x.com/LechMazur/status/1806366744706998732
        
       | behnamoh wrote:
       | I gave up hope on r"Gem[ma|ini]" long time ago. I don't believe
       | that Google can't produce good LLMs because of its massive
       | company size; Microsoft is also a giant company (more market cap
       | than Google) but it keeps surprising us with the ph models.
       | 
       | I think Google just lacks the vision to understand what makes a
       | good LLM. Theoretical contributions by research teams are
       | valuable, but the real-world is built around engineering ideas
       | that may lack the "purity" and elegance of theory but damn it
       | they work.
        
         | johnfn wrote:
         | Maybe you gave up before Google released Gemini Advanced? This
         | viewpoint seemed more accurate before it was related, but
         | Gemini Advanced is the third best LLM as rated here [1]. In
         | fact, had second place until a few days ago when Claude 3.5
         | came out.
         | 
         | [1]: https://huggingface.co/spaces/lmsys/chatbot-arena-
         | leaderboar...
        
           | staticman2 wrote:
           | Isn't Gemini Advanced Gemini Pro attached to some sort of an
           | internet search program? If it has that advantage over other
           | models it isn't a sign of AI chops.
        
         | alecco wrote:
         | I wonder if Google is making Deepmind people switch from their
         | cool original research to doing LLMs like everybody else.
         | Having their scale in money and data, I would hire new teams of
         | _engineers_ who want to do LLMs and leave the Deepmind
         | _researchers_ do their thing. Not killing the goose that lays
         | golden eggs.
        
           | llm_trw wrote:
           | Google is in a fight for their lives, I've fully moved over
           | to paid services and haven't used google in about a month
           | now.
        
             | kkkkkkk wrote:
             | If this were a common sentiment or rooted in reality I
             | would imagine their stock would not be at an all time
             | high...
        
               | llm_trw wrote:
               | I'm an early adopter. The rest of you will catch up in
               | the next five years.
        
               | popalchemist wrote:
               | _Here 's a napkin for when you're finished._
        
         | scarmig wrote:
         | Can't speak to Gemma, but I found 1.5 superior to Claude and
         | ChatGPT 4 when it came out. The trend seems to be each taking
         | the lead when it comes out, being king of the hill for a couple
         | weeks, and then being surpassed by the next.
         | 
         | Claude's reign has begun, and I'd say it has a solid enough
         | lead for at least another two weeks of dominance before it's
         | dethroned.
        
         | anxman wrote:
         | And the training samples are overly tied to Vertex
        
         | Me1000 wrote:
         | >long time ago
         | 
         | This is an incredible statement to make about a field that no
         | one was talking about 24 months ago, a family of SOTA models
         | that didn't exist until 8 months ago, and a family of small
         | local models that didn't exist 6 months ago. But sure, give up
         | hope after the first generation of a model family doesn't
         | impress you.
         | 
         | People seem to forget how incredibly early we are in this whole
         | thing. The fact that so much progress has been made in such a
         | short amount of time should make everyone super excited!
        
           | talldayo wrote:
           | To be fair, LLMs (especially _Google_ LLMs) aren 't merely 24
           | months old. This is part of a long line of models that draw
           | their heritage from BERT and t5-flan. Google has been at this
           | longer than most, _particularly_ in the field of edge-compute
           | models. This isn 't even close to a first-generation model
           | family.
           | 
           | That's not to say this is an insignificant contribution. New
           | models are great, especially when released for free, and it's
           | important for big firms to keep the ball rolling for tech to
           | progress. Though there is also legitimate concern that _all_
           | LLMs aren 't improving as fast as they used to improve, and
           | we may have hit the proverbial bathtub curve of AI progress.
        
             | Me1000 wrote:
             | I think there is valid criticism of google for inventing a
             | cool technology only to have the rest of the industry
             | discover its usefulness before them. But to say Gemini 1.0
             | or OG Gemma aren't first generation models because BERT and
             | flan existed before is like saying the iPad wasn't a first
             | generation device because Apple made the Newton. Like sure,
             | they're the same in that they're transformers trained on
             | language and text, but these are new families of models.
             | The training mechanisms are different, their architectures
             | are different, the data sets are different, the intended
             | purpose of the models are completely different, etc. At
             | some point I guess it's a semantic difference, maybe.
        
       | iamronaldo wrote:
       | So it's twice the size of phi 3 and considerably worse? What am I
       | missing
        
         | ertgbnm wrote:
         | They used two non-mutually exclusive techniques. Phi-3 is
         | mostly a curriculum training breakthrough. By filtering
         | training set for high quality tokens and training on synthetic
         | data, they were able to achieve great results. Gemma-2 is a
         | distillation breakthrough. By training LLMs with guidance from
         | larger teacher LLMs, they were able to achieve great results
         | too.
         | 
         | Porque no los dos?
        
         | m00x wrote:
         | Worse in some aspects, better in other.
         | 
         | Small models are never going to be generalists, so having
         | several small models allows you to pick the one that best fits
         | your needs.
        
           | k__ wrote:
           | When would you use which?
        
             | Aerbil313 wrote:
             | Obviously another small model would be specialized in
             | determining that.
        
               | k__ wrote:
               | Is it models all the way down?
        
             | m00x wrote:
             | Whichever model works better for your use. It's hard to
             | know without testing it at the moment.
             | 
             | I've found Gemini to be better at some use-cases, and GPT-4
             | better at others for my specific taste and use-case. You
             | can kind of go by the benchmark scores to have an idea if
             | it's good at logic, creativity, etc.
        
         | azeirah wrote:
         | Have you tried Phi 3? It's smart which makes it perform well on
         | benchmarks, but it's not great at conversation or as a chatbot.
         | 
         | I imagine Gemma 2 is a better general-purpose assistant for
         | most people, whereas Phi 3 is a solid small LLM (SLM?) for more
         | specific use-cases like summarization, RAG, learning about math
         | and stuff.
        
         | floridianfisher wrote:
         | Why not try it here and make your comparisons that way?
         | https://aistudio.google.com/app/prompts/new_chat?model=gemma...
        
           | pona-a wrote:
           | One compelling reason not to would be a region block... [0]
           | 
           | https://ai.google.dev/gemini-api/docs/available-regions
        
         | ferretj wrote:
         | Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked
         | #52) while the confidence interval for Gemma 2 9B is [1170,
         | 1200] ELO (ranked btw #15 and #25).
        
         | reissbaker wrote:
         | Phi-3 does well in benchmarks but underperforms IRL; for
         | example, Phi-3-Medium gets beaten badly by Llama-3-8b on the
         | LMSYS Chatbot Arena despite doing better on benchmarks.
         | 
         | Gemma's performance if anything seems understated on
         | benchmarks: the 27b is currently ahead of Llama3-70b on the
         | Chatbot Arena leaderboard.
        
           | ertgbnm wrote:
           | I suspect Phi-3 is not robust to normal human input like
           | typos and strange grammar since it's only trained on filtered
           | "high quality" tokens and synthetic data. Since it doesn't
           | need to waste a ton of parameters learning how to error
           | correct input, it's much smarter on well curated benchmarks
           | compared to its weight class. However, it can't operate out
           | of distribution at all.
        
       | msoad wrote:
       | Phi-3 blow this out of the water.
       | Benchmark  |  Gemma 2 (9B)  |  Phi-3 Small (7B)         ---------
       | --------------------|----------------|-------------------
       | MMLU (5-Shot)  |       63.6     |       75.7
       | HellaSwag (5-Shot)  |       49.8     |       77.0
       | ANLI (7-Shot)  |       48.7     |       58.1
       | GSM-8K (8-Shot; CoT)  |       59.8     |       89.6
       | MedQA (2-Shot)  |       49.6     |       65.4
       | AGIEval (0-Shot)  |       42.1     |       45.1
       | TriviaQA (5-Shot)  |       72.3     |       58.1
       | Arc-C (10-Shot)  |       78.3     |       90.7
       | Arc-E (10-Shot)  |       91.4     |       97.0
       | PIQA (5-Shot)  |       78.1     |       86.9
       | SociQA (5-Shot)  |       65.5     |       79.2         BigBench-
       | Hard (3-Shot; CoT)  |       59.6     |       79.1
       | WinoGrande (5-Shot)  |       55.6     |       81.5
       | OpenBookQA (10-Shot)  |       78.6     |       88.0
       | BoolQ (2-Shot)  |       66.0     |       84.8
       | CommonSenseQA (10-Shot)  |       76.2     |       80.0
       | TruthfulQA (10-Shot; MC2)  |       52.1     |       70.2
       | HumanEval (0-Shot)  |       34.1     |       61.0
       | MBPP (3-Shot)  |       51.5     |       71.7
        
         | moffkalast wrote:
         | Phi is notorious for benchmark overfitting. It's good, but not
         | as good as it looks on the charts. On the Lmsys leaderboard it
         | places a whole 23 spots behind Llama-3-8B which it also claims
         | to soundly beat on the above. So YMMV.
        
         | ferretj wrote:
         | Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked
         | #52) while the confidence interval for Gemma 2 9B is [1170,
         | 1200] ELO (ranked btw #15 and #25).
        
         | Garcia98 wrote:
         | Pretraining on the Test Set Is All You Need
         | 
         | https://arxiv.org/abs/2309.08632
        
       | mixtureoftakes wrote:
       | Good realease but the annoying part is they're very unclear about
       | which types of models they are comparing. They provide benchmark
       | comparisons for the base models only and arena comparisons for
       | instruct only? Was that intentional? Why would you ever do that?
       | This makes things unnecessary complicated imo and the only payoff
       | is a short term win for google on paper.
       | 
       | Guess I'll just fully test it for my own tasks to know for sure
        
       | Flumio wrote:
       | This is great:)
       | 
       | And when we continue fine-tune.how much and what type of data we
       | learn it on, I'm pretty sure for a smart agent who is not a
       | knowledgeable expert but primarily a agent (understand what and
       | how) this will get smaller and easier to run everywhere.
        
       | thomasahle wrote:
       | I'm curious about the use of explicit tokens like
       | <start_of_turn>, <end_of_turn>, <bos>, and <eos>. What happens if
       | the user insert those in their message? Does that provide an easy
       | way to "ignore previous instructions"?
       | 
       | Do I have to manually sanitize the input before I give it to the
       | model?
        
       | jerrygenser wrote:
       | Are these small Gemma 2 distilled models available anywhere? I'm
       | not finding them on huggingface.co, etc. but maybe I don't know
       | the exact model names they are published.
       | 
       | Are the weights released yet?
        
         | floridianfisher wrote:
         | The huggingface weights are here:
         | https://huggingface.co/collections/google/gemma-2-release-66...
        
         | mchiang wrote:
         | They are available on Hugging Face:
         | https://huggingface.co/collections/google/gemma-2-release-66...
         | 
         | Ollama: https://ollama.com/library/gemma2
        
         | alekandreev wrote:
         | In addition to the HF links shared by sibling comments, the 2B
         | will be released soon.
        
           | jerrygenser wrote:
           | that's actually the particular one I was looking for and
           | couldn't find. Also had googled for the other ones but maybe
           | it was so recent that it hadn't been indexed. Thanks!
        
       | QuesnayJr wrote:
       | There are two new chatbots on Chatbot Arena, called "late-june-
       | chatbot" and "im-just-another-late-june-chatbot". Both of them
       | report that they are Gemma if you ask. I'm assuming it's these
       | two models, but AFAIK there has been no official announcement.
        
         | suryabhupa wrote:
         | The announcements are live on Twitter! See this for example:
         | https://x.com/suryabhupa/status/1806342617191379167
        
       | chown wrote:
       | This is a great release! If you are looking to try it locally
       | with a great interface, I am working on an app [1] and I just
       | pushed an update to support Gemma2.
       | 
       | 1: https://msty.app
        
         | tr3ntg wrote:
         | Wow, msty looks really cool. I've bookmarked it to look into
         | more later as a replacement for how I use a locally-hosted
         | instance of LibreChat. It'd be a huge improvement to use local
         | models rather than remote ones, for much of my queries.
         | 
         | That said, do you have a reason for keeping msty closed source
         | rather than open? I read your FAQ for "why should I trust msty"
         | and it feels lacking.
         | 
         | > We are a small team of developers who are passionate about AI
         | and privacy. We have worked on projects before that have been
         | used by thousands of people such as this (I've never heard of
         | Cleavr). There are real faces (real faces = Twitter account
         | link?) behind the product. And come chat with us on our Discord
         | server to know us better.
         | 
         | This is much, much better than having no attribution, but it's
         | miles away from being able to verify trust by reading the code.
         | Would love to hear what your reasons against this are.
         | 
         | Still thinking about trying it out, anyway...
        
         | renewiltord wrote:
         | What the heck, this looks cool! How have I missed it. Gonna
         | give it a whirl.
        
       | jakobov wrote:
       | Nice! Can you explain what you mean by "simulate training beyond
       | the number of available tokens"?
       | 
       | Why does using distillation from a larger model simulate training
       | with more tokens?
        
         | canyon289 wrote:
         | Hi, I work on the Gemma team (same as Alek opinions are my
         | own).
         | 
         | Essentially instead of tokens that are "already there" in text,
         | the distillation allows us to simulate training data from a
         | larger model
        
         | suryabhupa wrote:
         | Surya here from the core Gemma team -- we can think of a
         | distillation loss as learning to model the entire distribution
         | of tokens that are likely to follow the prefix thus far,
         | instead of only the token in the training example. If you do
         | some back of the envelope calculations, we can see that
         | learning to model a larger distribution yields many more bits
         | of information to learn from.
        
           | jakobov wrote:
           | Gotcha. That makes sense. Thanks!
           | 
           | What are the theories as to why this works better than
           | training on a larger quantity of non-simulated tokens?
           | 
           | Is it because the gradient from the non-simulated tokens is
           | too noisy for a small model to model correctly?
        
       | rosslazer wrote:
       | Are there examples of the prompt or transcripts for the human
       | testing?
        
       | aubanel wrote:
       | It's exceptionally strong. In LMSys Chatbot Arena, the 27B
       | version scores above LLama-3-70B, at the level of OpenAI GPT-4
       | and Claude-3 Sonnet!
        
         | screye wrote:
         | What's the most obvious standouts?
         | 
         | In my experience, smaller models tend to do well on benchmarks
         | and fail at generalization. Phi-2 comes to mind.
        
           | moffkalast wrote:
           | It's multilingual. Genuinely. Compared my results with some
           | people on reddit and the consensus is that the 27B is near
           | perfect in a few obscure languages and likely perfect in most
           | common ones. The 9B is not as good but it's still coherent
           | enough to use in a pinch.
           | 
           | It's literally the first omni-translation tool that actually
           | works that you can run offline at home. I'm amazed that
           | Google mentioned absolutely nothing about this in their
           | paper.
        
             | jug wrote:
             | Wow, that's very impressive and indeed a game changer. I've
             | previously had trouble with various Scandinavian languages,
             | but last I checked with was Llama 2 and I kind of gave up
             | on it. I had expected we were going to need special purpose
             | small models for these uses as a crutch, like SW-GPT3.
             | 
             | So I guess Gemma 2 is going to become Gemini 2.0 in their
             | truly large and closed variants then? Or is it the open
             | version of Gemini 1.5?
        
         | typpo wrote:
         | If anyone is interested in evaling Gemma locally, this can be
         | done pretty easily using ollama[0] and promptfoo[1] with the
         | following config:                 prompts:         - 'Answer
         | this coding problem in Python: {{ask}}'            providers:
         | - ollama:chat:gemma2:9b         - ollama:chat:llama3:8b
         | tests:         - vars:             ask: function to find the
         | nth fibonacci number         - vars:             ask: calculate
         | pi to the nth digit         - # ...
         | 
         | One small thing I've always appreciated about Gemma is that it
         | doesn't include a "Sure, I can help you" preamble. It just gets
         | right into the code, and follows it with an explanation. The
         | training seems to emphasize response structure and ease of
         | comprehension.
         | 
         | Also, best to run evals that don't rely on rote memorization of
         | public code... so please substitute with your personal tests :)
         | 
         | [0] https://ollama.com/library/gemma2
         | 
         | [1] https://github.com/promptfoo/promptfoo
        
           | roywiggins wrote:
           | In Ollama, Gemma:9b works fine, but 27b seems to be producing
           | a lot of nonsense for me. Asking for a bit of python or
           | JavaScript code rapidly devolves into producing code-like
           | gobbledegook, extending for hundreds of lines.
        
         | resource_waste wrote:
         | Do we believe that? I've been told Google's AI was going to be
         | great 4 times now, and its consistently #4 behind OpenAI,
         | Facebook, and Claude.
        
           | aubanel wrote:
           | LMSys Chatbot Arena is a crowd-sourced ranking with an ELO
           | system: basically users a presented with 2 hidden models,
           | they get the answers of the 2 models when presenting their
           | request, and they vote which one performed bests, which
           | realized one marche and updates the ELO scores. This is the
           | closest thing that we have to a gold truth for LLM
           | evaluation: and Gemma2-27B performs extremely well in Chatbot
           | Arena ELO.
        
       | jakobov wrote:
       | How much faster (in terms of the number of iterations to a given
       | performance) is training from distillation?
        
       | dongobread wrote:
       | The knowledge distillation is very interesting but generating
       | trillions of outputs from a large teacher model seems insanely
       | expensive. Is this really more cost efficient than just using
       | that compute instead for training your model with more data/more
       | epochs?
        
         | DebtDeflation wrote:
         | I'm also curious. It seems like 6 months ago everyone was
         | afraid of "model collapse" but now synthetic training
         | generation and teacher models are all the rage. Have we solved
         | the problem of model collapse?
        
           | astrange wrote:
           | Model collapse was basically a coping idea made up by artists
           | who were hoping AI image generators would all magically
           | destroy themselves at some point; I don't think it was ever
           | considered likely to happen.
           | 
           | It does seem to be true that clean data works better than low
           | quality data.
        
             | groby_b wrote:
             | You're confusing it with data poisoning.
             | 
             | Model collapse itself is(was?) a fairly serious research
             | topic: https://arxiv.org/abs/2305.17493
             | 
             | We've by now reached a "probably not inevitable" -
             | https://arxiv.org/abs/2404.01413 argues there's a finite
             | upper bound to error - but I'd also point out that that
             | paper assumes training data cardinality increases with the
             | number of training generations and is strictly
             | accumulative.
             | 
             | To a first order, that means you better have a pre-2022
             | dataset to get started, and have archived it well.
             | 
             | but it's probably fair to say current SOTA is still more or
             | less "it's neither impossible nor inevitable".
        
           | Workaccount2 wrote:
           | Pay attention because it's only once you will get to watch
           | humans learn they are nothing special in real time.
        
         | agi_is_coming wrote:
         | The distillation is done on-policy like RLHF -- the student
         | model is generating the sequences and teacher is providing
         | feedback in terms of logits.
        
       | mistercheph wrote:
       | Playing with it, and I like how much I can influence it with a
       | system prompt, llama3 reacts pretty mildly to any system prompts
       | I've tried.
        
       | smcleod wrote:
       | It has a tiny context window of 8k, that thing will have the
       | memory of a goldfish.
        
       ___________________________________________________________________
       (page generated 2024-06-27 23:00 UTC)