[HN Gopher] Gemma 2: Improving Open Language Models at a Practic...
___________________________________________________________________
Gemma 2: Improving Open Language Models at a Practical Size [pdf]
Author : tosh
Score : 205 points
Date : 2024-06-27 14:18 UTC (8 hours ago)
(HTM) web link (storage.googleapis.com)
(TXT) w3m dump (storage.googleapis.com)
| rsolva wrote:
| The 9B and 27B versions are available for Ollama:
| https://ollama.com/library/gemma2
| Workaccount2 wrote:
| The 27B model is also available in AI studio
|
| https://aistudio.google.com/app/prompts/new_chat?model=gemma...
|
| So far it seems pretty strong for its size.
| moffkalast wrote:
| > Table 4 | Relevant formatting control tokens used for Gemma
| models
|
| > User turn: user
|
| > Model turn: model
|
| > Start of conversation turn: <start_of_turn>
|
| > End of conversation turn: <end_of_turn>
|
| > Beginning of sequence: <bos>
|
| > End of sequence: <eos>
|
| You know I keep wondering why <bos> and <eos> tokens are even a
| thing in general. No model is tuned to keep generating multiple
| turns after its <end_of_turn> equivalent is sent, and what's the
| point of <bos> when you're parsing the entire context anyway. If
| it's an attempt to ignore text before it... then why is that text
| there? Just remove it from context, you're throwing away compute.
| danielmarkbruce wrote:
| think about training.
| moffkalast wrote:
| I suppose it would act as a concrete separator when instruct
| tuning, but lots of prompt templates don't use it, especially
| older ones like Alpaca. Maybe it leads to more overall
| coherence?
| m00x wrote:
| Not instruct tuning, you use it in general training.
|
| If you have a bunch of small prompts/answers, you can fit
| them into bigger batches if you use start/stop tokens.
| alekandreev wrote:
| Your training input has the shape of (sequence length x batch
| size). If a lot of your samples are shorter than sequence
| length, as is usually the case, you will have a lot of padding
| tokens in the input, which is wasted compute.
|
| To compensate for that, you can pack multiple examples in the
| same sequence. This is there EOS and BOS come in, as they
| indicate to the model that the two parts of the sequence are
| not related.
| thomasahle wrote:
| You can just do that my shaping the attention mask, no? That
| also gives you an actual guarantee that no information is
| leaked between conversations.
| suryabhupa wrote:
| In practice, and at scale, that's exactly what having <bos>
| and <eos> tokens allow you to easily and programmatically
| do.
| danielmarkbruce wrote:
| You can't pack multiple examples into a single row of a
| matrix without knowing where one begins and one ends.
| alecco wrote:
| Shouldn't this (2.6B/9B) be compared with Microsoft's Phi-3 mini
| (3.8B) instead of Mistral and Llama-3?
|
| (table 13 on page 7) vs https://arxiv.org/pdf/2404.14219 (page 6,
| quite better in general)
|
| The report on knowledge distillation training is interesting,
| though.
| philipkglass wrote:
| It's such a wide range of model sizes that I could see why they
| compare with Llama 3 70b as well as Llama 3 8b (tables 12, 13).
| I agree that the Phi-3 series is a stronger competitor for
| knowledge extraction/summarizing and would make a good
| comparison. My current favorite for such tasks, on a VRAM-
| limited workstation, is Phi-3 medium (phi3:14b-instruct).
| refulgentis wrote:
| Picking up from there: The games in this paper and model are
| annoying.
|
| The 2.6B would get stomped by Phi-3, so there's no comparison.
|
| Fair enough. 2.6B vs. 3.8B is a fairly substantial size
| difference thats hard to intuit when its 2.6 vs 3.8 versus
| 2,600,000,000 and 3,800,000,000.
|
| But then we get what I'm going to "parameter creep": Mistral 7B
| vs. Llama 8B vs. Gemma 9B. I worried after Llama 3 went 8B that
| we'd start seeing games with parameters, but, thought I was
| being silly.
| imjonse wrote:
| In the Llama 3 case I think the increase in parameters is
| mostly due to the input embeddings and output logits layers,
| reflecting the context size increase.
| kouteiheika wrote:
| There was no parameter creep with Llama. Llama 8B is actually
| a ~7B model comparable to Mistral 7B if you strip away
| multilingual embeddings and match what Mistral 7B supports.
| alekandreev wrote:
| Hello (again) from the Gemma team! We are quite excited to push
| this release out and happy to answer any questions!
|
| Opinions are our own and not of Google DeepMind.
| zerojames wrote:
| How is Gemma-2 licensed?
| alekandreev wrote:
| The terms of use remain the same as Gemma 1 -
| https://ai.google.dev/gemma/terms.
| coreypreston wrote:
| No question. Thanks for thinking of 27B.
| moffkalast wrote:
| The 4k sliding window context seems like a controversial choice
| after Mistral 7B mostly failed at showing any benefits from it.
| What was the rationale behind that instead of just going for
| full 8k or 16k?
| alekandreev wrote:
| This is mostly about inference speed, while maintaining long
| context performance.
| luke-stanley wrote:
| Any gemma-2-9b or 27b 4 bit GGUF's on HuggingFace yet? Thanks!
| XzAeRosho wrote:
| It's on HuggingFace already:
| https://huggingface.co/google/gemma-2-9b
| luke-stanley wrote:
| I know the safe tensors are there, but I said GGUF 4-bit
| quantised, which is kinda the standard for useful local
| applications, a typical balanced sweet spot of performance
| and quality. It's makes it much easier to use, works in
| more places, be it personal devices or a server etc.
| chown wrote:
| If you are still looking for it, I just made it available on
| an app[1] that I am working on with Gemma2 support.
|
| https://msty.app
| luke-stanley wrote:
| Are you saying you put a 4-bit GGUF on HuggingFace?
| luke-stanley wrote:
| Actually for the 9B model, this has 4-bit quantised weights
| (and others): https://huggingface.co/bartowski/gemma-2-9b-it-
| GGUF
|
| Still no 27B 4-bit GGUF quants on HF yet!
|
| I'm monitoring this search: https://huggingface.co/models?lib
| rary=gguf&sort=trending&sea...
| jpcapdevila wrote:
| Will gemma2 be available through gemma.cpp?
| https://github.com/google/gemma.cpp
| austinvhuang wrote:
| This is in the works in the dev branch (thanks pchx :)
|
| https://github.com/google/gemma.cpp/pull/274
| janwas wrote:
| :) Confirmed working. We've just pushed the dev branch to
| main.
| jpcapdevila wrote:
| Awesome, I love this .cpp trend! Thanks for your work!!
| luke-stanley wrote:
| Given the goal of mitigating self-proliferation risks, have you
| observed a decrease in the model's ability to do things like
| help a user setup a local LLM with local or cloud software?
|
| How much is pre-training dataset changes, how much is tuning?
|
| How do you think about this problem, how do you solve it?
|
| Seems tricky to me.
| luke-stanley wrote:
| Wow I'm kinda shocked this was downvoted. That's not cool,
| it's a reasonable question directly about the research - the
| main article link!
| alekandreev wrote:
| To quote Ludovic Peran, our amazing safety lead:
|
| Literature has identified self-proliferation as dangerous
| capability of models, and details about how to define it and
| example of form it can take have been openly discussed by GDM
| (https://arxiv.org/pdf/2403.13793).
|
| Current Gemma 2 models' success rate to end-to-end challenges
| is null (0 out 10), so the capabilities to perform such tasks
| are currently limited.
| moffkalast wrote:
| Turns out LLM alignment is super easy, barely an
| inconvenience.
| dinosaurdynasty wrote:
| One should not confuse alignment and current
| incapability.
| josh-sematic wrote:
| Alignment is tight!
| luke-stanley wrote:
| That's an interesting paper. `Install Mistral 7B on a GCP
| instance and use it to answer a simple question`. Some
| hosting providers and inference software might be easier to
| setup, for now. ;) But do you have to make it less capable,
| by being careful on what it's trained on? E.g: banning
| certain topics (like how to use Lamafile/llama.cpp, knowing
| what hosting providers have free trials, learning about
| ways to jailbreak web apps, free inference providers etc)?
|
| Or does the model have to later be finetuned, to not be
| good at certain tasks?
|
| Or are we not at that stage yet?
|
| Is something like tree-of-thought used, to get the best of
| the models for these tasks?
| canyon289 wrote:
| I also work at Google and on Gemma (so same disclaimers)
|
| You can try 27b at www.aistudio,google.com. Send in your
| favorite prompts, and we hope you like the responses.
| luke-stanley wrote:
| It's fairly easy to pay OpenAI or Mistral money to use their
| API's. Figuring out how Google Cloud Vertex works and how it's
| billed is more complicated. Azure and AWS are similar in how
| complex they are to use for this. Could Google Cloud please
| provide an OpenAI compatible API and service? I know it's a
| different department. But it'd make using your models way
| easier. It often feels like Google Cloud has no UX or end-user
| testing done on it at all (not true for aistudio.google.com -
| that is better than before, for sure!).
| alekandreev wrote:
| Happy to pass on any feedback to our Google Cloud friends. :)
| luke-stanley wrote:
| Thank you!
| anxman wrote:
| I also hate the billing. It feels like configuring AWS more
| than calling APIs.
| hnuser123456 wrote:
| I plan on downloading a Q5 or Q6 version of the 27b for my
| 3090 once someone puts quants on HF, loading it in LM studio
| and starting the API server to call it from my scripts based
| on openai api. Hopefully it's better at code gen than llama 3
| 8b.
| bapcon wrote:
| I have to agree with all of this. I tried switching to
| Gemini, but the lack of clear billing/quotas, horrible
| documentation, and even poor implementation of status codes
| on failed requests have led me to stick with OpenAI.
|
| I don't know who writes Google's documentation or does the
| copyediting for their console, but it is hard to adapt. I
| have spent hours troubleshooting, only to find out it's
| because the documentation is referring to the same thing by
| two different names. It's 2024 also, I shouldn't be seeing
| print statements without parentheses.
| ankeshanand wrote:
| If you're an individual developer and not an enterprise, just
| go straight to Google AIStudio or GeminiAPI instead:
| https://aistudio.google.com/app/apikey. It's dead simple
| getting an API key and calling with a rest client.
| luke-stanley wrote:
| Interesting but when I tried it, I couldn't figure out the
| billing model because it's all connected to Google
| projects, and there can be different billing things for
| each of them.
|
| Each thing seems to have a bunch of clicks to setup that
| startup LLM providers don't hassle people with. They're
| more likely to just let you sign in with some generic third
| party oAuth, slap on Stripe billing, let you generate keys,
| show you some usage stats, getting started docs, with
| example queries and a prompt playground etc.
|
| What about the Vertex models though? Are they all actually
| available via Google AI Studio?
| lhl wrote:
| Sadly, while gemma-2-27b-it is available (as a Preview
| model) on the AI Studio playground, it didn't show up via
| API on list_models() for me.
| Deathmax wrote:
| Gemini models on Vertex AI can be called via a preview
| OpenAI-compatible endpoint [1], but shoving it into existing
| tooling where you don't have programmatic control over the
| API key and is long lived is non-trivial because GCP uses
| short lived access tokens (and long-lived ones are not great
| security-wise).
|
| Billing for the Gemini models (on Vertex AI, the Generative
| Language AI variant still charges by tokens) I would argue is
| simpler than every other provider, simply because you're
| charged by characters/image/video-second/audio-second and
| don't need to run a tokenizer (if it's even available _cough_
| Claude 3 and Gemini) and having to figure out what the chat
| template is to calculate the token cost per message [2] or
| figure out how to calculate tokens for an image [3] to get
| cost estimates before actually submitting the request and
| getting usage info back.
|
| [1]: https://cloud.google.com/vertex-ai/generative-
| ai/docs/multim...
|
| [2]: https://platform.openai.com/docs/guides/text-
| generation/mana...
|
| [3]: https://platform.openai.com/docs/guides/vision/calculati
| ng-c...
| luke-stanley wrote:
| Good to know about this API preview. Hopefully the billing
| problem and UI maze of Vertex AI can be sorted too?
| Flumio wrote:
| Google does plenty of ux studies on gcp. I took part in
| at least 3 of them.
|
| I'm also not sure if I understand your problem with
| pricing? Depending on what you do with it, it's not just
| an LLM. It actually started before llms.
|
| Pricing for image classification and other features are
| completely different products like an LLM.
| luke-stanley wrote:
| They should do a whole lot more then! Ideally they'd have
| effective impact. It's a busy mess on GCP. If they wanted
| to compete well, they should do much better with UX
| design, especially for onboarding. Compare how easy
| setting up a Mistral account is with GCP to do some
| generative LLM in a Python script. GCP is a maze. Did you
| make an account to reply to this? I'm curious what you do
| with GCP? Are you a heavy user?
| WhitneyLand wrote:
| The paper suggests on one hand Gemma is on the same Pareto
| curve as Llama3, while on the other hand seems to suggest it's
| exceeded its efficiency.
|
| Is this a contradiction or am I misunderstanding something?
|
| Btw overall very impressive work great job.
| alekandreev wrote:
| I think it makes sense to compare models trained with the
| same recipe on token count - usually more tokens will give
| you a better model.
|
| However, I wouldn't draw conclusions about different model
| families, like Llama and Gemma, based on their token count
| alone. There are many other variables at play - the quality
| of those tokens, number of epochs, model architecture,
| hyperparameters, distillation, etc. that will have an
| influence on training efficiency.
| causal wrote:
| Thanks for your work on this; excited to try it out!
|
| The Google API models support 1M+ tokens, but these are just
| 8K. Is there a fundamental architecture difference, training
| set, something else?
| np_space wrote:
| Are Gemma-2 models available via API yet? Looks to me like it's
| not yet on vertexai
| zone411 wrote:
| "Soon" https://x.com/LechMazur/status/1806366744706998732
| behnamoh wrote:
| I gave up hope on r"Gem[ma|ini]" long time ago. I don't believe
| that Google can't produce good LLMs because of its massive
| company size; Microsoft is also a giant company (more market cap
| than Google) but it keeps surprising us with the ph models.
|
| I think Google just lacks the vision to understand what makes a
| good LLM. Theoretical contributions by research teams are
| valuable, but the real-world is built around engineering ideas
| that may lack the "purity" and elegance of theory but damn it
| they work.
| johnfn wrote:
| Maybe you gave up before Google released Gemini Advanced? This
| viewpoint seemed more accurate before it was related, but
| Gemini Advanced is the third best LLM as rated here [1]. In
| fact, had second place until a few days ago when Claude 3.5
| came out.
|
| [1]: https://huggingface.co/spaces/lmsys/chatbot-arena-
| leaderboar...
| staticman2 wrote:
| Isn't Gemini Advanced Gemini Pro attached to some sort of an
| internet search program? If it has that advantage over other
| models it isn't a sign of AI chops.
| alecco wrote:
| I wonder if Google is making Deepmind people switch from their
| cool original research to doing LLMs like everybody else.
| Having their scale in money and data, I would hire new teams of
| _engineers_ who want to do LLMs and leave the Deepmind
| _researchers_ do their thing. Not killing the goose that lays
| golden eggs.
| llm_trw wrote:
| Google is in a fight for their lives, I've fully moved over
| to paid services and haven't used google in about a month
| now.
| kkkkkkk wrote:
| If this were a common sentiment or rooted in reality I
| would imagine their stock would not be at an all time
| high...
| llm_trw wrote:
| I'm an early adopter. The rest of you will catch up in
| the next five years.
| popalchemist wrote:
| _Here 's a napkin for when you're finished._
| scarmig wrote:
| Can't speak to Gemma, but I found 1.5 superior to Claude and
| ChatGPT 4 when it came out. The trend seems to be each taking
| the lead when it comes out, being king of the hill for a couple
| weeks, and then being surpassed by the next.
|
| Claude's reign has begun, and I'd say it has a solid enough
| lead for at least another two weeks of dominance before it's
| dethroned.
| anxman wrote:
| And the training samples are overly tied to Vertex
| Me1000 wrote:
| >long time ago
|
| This is an incredible statement to make about a field that no
| one was talking about 24 months ago, a family of SOTA models
| that didn't exist until 8 months ago, and a family of small
| local models that didn't exist 6 months ago. But sure, give up
| hope after the first generation of a model family doesn't
| impress you.
|
| People seem to forget how incredibly early we are in this whole
| thing. The fact that so much progress has been made in such a
| short amount of time should make everyone super excited!
| talldayo wrote:
| To be fair, LLMs (especially _Google_ LLMs) aren 't merely 24
| months old. This is part of a long line of models that draw
| their heritage from BERT and t5-flan. Google has been at this
| longer than most, _particularly_ in the field of edge-compute
| models. This isn 't even close to a first-generation model
| family.
|
| That's not to say this is an insignificant contribution. New
| models are great, especially when released for free, and it's
| important for big firms to keep the ball rolling for tech to
| progress. Though there is also legitimate concern that _all_
| LLMs aren 't improving as fast as they used to improve, and
| we may have hit the proverbial bathtub curve of AI progress.
| Me1000 wrote:
| I think there is valid criticism of google for inventing a
| cool technology only to have the rest of the industry
| discover its usefulness before them. But to say Gemini 1.0
| or OG Gemma aren't first generation models because BERT and
| flan existed before is like saying the iPad wasn't a first
| generation device because Apple made the Newton. Like sure,
| they're the same in that they're transformers trained on
| language and text, but these are new families of models.
| The training mechanisms are different, their architectures
| are different, the data sets are different, the intended
| purpose of the models are completely different, etc. At
| some point I guess it's a semantic difference, maybe.
| iamronaldo wrote:
| So it's twice the size of phi 3 and considerably worse? What am I
| missing
| ertgbnm wrote:
| They used two non-mutually exclusive techniques. Phi-3 is
| mostly a curriculum training breakthrough. By filtering
| training set for high quality tokens and training on synthetic
| data, they were able to achieve great results. Gemma-2 is a
| distillation breakthrough. By training LLMs with guidance from
| larger teacher LLMs, they were able to achieve great results
| too.
|
| Porque no los dos?
| m00x wrote:
| Worse in some aspects, better in other.
|
| Small models are never going to be generalists, so having
| several small models allows you to pick the one that best fits
| your needs.
| k__ wrote:
| When would you use which?
| Aerbil313 wrote:
| Obviously another small model would be specialized in
| determining that.
| k__ wrote:
| Is it models all the way down?
| m00x wrote:
| Whichever model works better for your use. It's hard to
| know without testing it at the moment.
|
| I've found Gemini to be better at some use-cases, and GPT-4
| better at others for my specific taste and use-case. You
| can kind of go by the benchmark scores to have an idea if
| it's good at logic, creativity, etc.
| azeirah wrote:
| Have you tried Phi 3? It's smart which makes it perform well on
| benchmarks, but it's not great at conversation or as a chatbot.
|
| I imagine Gemma 2 is a better general-purpose assistant for
| most people, whereas Phi 3 is a solid small LLM (SLM?) for more
| specific use-cases like summarization, RAG, learning about math
| and stuff.
| floridianfisher wrote:
| Why not try it here and make your comparisons that way?
| https://aistudio.google.com/app/prompts/new_chat?model=gemma...
| pona-a wrote:
| One compelling reason not to would be a region block... [0]
|
| https://ai.google.dev/gemini-api/docs/available-regions
| ferretj wrote:
| Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked
| #52) while the confidence interval for Gemma 2 9B is [1170,
| 1200] ELO (ranked btw #15 and #25).
| reissbaker wrote:
| Phi-3 does well in benchmarks but underperforms IRL; for
| example, Phi-3-Medium gets beaten badly by Llama-3-8b on the
| LMSYS Chatbot Arena despite doing better on benchmarks.
|
| Gemma's performance if anything seems understated on
| benchmarks: the 27b is currently ahead of Llama3-70b on the
| Chatbot Arena leaderboard.
| ertgbnm wrote:
| I suspect Phi-3 is not robust to normal human input like
| typos and strange grammar since it's only trained on filtered
| "high quality" tokens and synthetic data. Since it doesn't
| need to waste a ton of parameters learning how to error
| correct input, it's much smarter on well curated benchmarks
| compared to its weight class. However, it can't operate out
| of distribution at all.
| msoad wrote:
| Phi-3 blow this out of the water.
| Benchmark | Gemma 2 (9B) | Phi-3 Small (7B) ---------
| --------------------|----------------|-------------------
| MMLU (5-Shot) | 63.6 | 75.7
| HellaSwag (5-Shot) | 49.8 | 77.0
| ANLI (7-Shot) | 48.7 | 58.1
| GSM-8K (8-Shot; CoT) | 59.8 | 89.6
| MedQA (2-Shot) | 49.6 | 65.4
| AGIEval (0-Shot) | 42.1 | 45.1
| TriviaQA (5-Shot) | 72.3 | 58.1
| Arc-C (10-Shot) | 78.3 | 90.7
| Arc-E (10-Shot) | 91.4 | 97.0
| PIQA (5-Shot) | 78.1 | 86.9
| SociQA (5-Shot) | 65.5 | 79.2 BigBench-
| Hard (3-Shot; CoT) | 59.6 | 79.1
| WinoGrande (5-Shot) | 55.6 | 81.5
| OpenBookQA (10-Shot) | 78.6 | 88.0
| BoolQ (2-Shot) | 66.0 | 84.8
| CommonSenseQA (10-Shot) | 76.2 | 80.0
| TruthfulQA (10-Shot; MC2) | 52.1 | 70.2
| HumanEval (0-Shot) | 34.1 | 61.0
| MBPP (3-Shot) | 51.5 | 71.7
| moffkalast wrote:
| Phi is notorious for benchmark overfitting. It's good, but not
| as good as it looks on the charts. On the Lmsys leaderboard it
| places a whole 23 spots behind Llama-3-8B which it also claims
| to soundly beat on the above. So YMMV.
| ferretj wrote:
| Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked
| #52) while the confidence interval for Gemma 2 9B is [1170,
| 1200] ELO (ranked btw #15 and #25).
| Garcia98 wrote:
| Pretraining on the Test Set Is All You Need
|
| https://arxiv.org/abs/2309.08632
| mixtureoftakes wrote:
| Good realease but the annoying part is they're very unclear about
| which types of models they are comparing. They provide benchmark
| comparisons for the base models only and arena comparisons for
| instruct only? Was that intentional? Why would you ever do that?
| This makes things unnecessary complicated imo and the only payoff
| is a short term win for google on paper.
|
| Guess I'll just fully test it for my own tasks to know for sure
| Flumio wrote:
| This is great:)
|
| And when we continue fine-tune.how much and what type of data we
| learn it on, I'm pretty sure for a smart agent who is not a
| knowledgeable expert but primarily a agent (understand what and
| how) this will get smaller and easier to run everywhere.
| thomasahle wrote:
| I'm curious about the use of explicit tokens like
| <start_of_turn>, <end_of_turn>, <bos>, and <eos>. What happens if
| the user insert those in their message? Does that provide an easy
| way to "ignore previous instructions"?
|
| Do I have to manually sanitize the input before I give it to the
| model?
| jerrygenser wrote:
| Are these small Gemma 2 distilled models available anywhere? I'm
| not finding them on huggingface.co, etc. but maybe I don't know
| the exact model names they are published.
|
| Are the weights released yet?
| floridianfisher wrote:
| The huggingface weights are here:
| https://huggingface.co/collections/google/gemma-2-release-66...
| mchiang wrote:
| They are available on Hugging Face:
| https://huggingface.co/collections/google/gemma-2-release-66...
|
| Ollama: https://ollama.com/library/gemma2
| alekandreev wrote:
| In addition to the HF links shared by sibling comments, the 2B
| will be released soon.
| jerrygenser wrote:
| that's actually the particular one I was looking for and
| couldn't find. Also had googled for the other ones but maybe
| it was so recent that it hadn't been indexed. Thanks!
| QuesnayJr wrote:
| There are two new chatbots on Chatbot Arena, called "late-june-
| chatbot" and "im-just-another-late-june-chatbot". Both of them
| report that they are Gemma if you ask. I'm assuming it's these
| two models, but AFAIK there has been no official announcement.
| suryabhupa wrote:
| The announcements are live on Twitter! See this for example:
| https://x.com/suryabhupa/status/1806342617191379167
| chown wrote:
| This is a great release! If you are looking to try it locally
| with a great interface, I am working on an app [1] and I just
| pushed an update to support Gemma2.
|
| 1: https://msty.app
| tr3ntg wrote:
| Wow, msty looks really cool. I've bookmarked it to look into
| more later as a replacement for how I use a locally-hosted
| instance of LibreChat. It'd be a huge improvement to use local
| models rather than remote ones, for much of my queries.
|
| That said, do you have a reason for keeping msty closed source
| rather than open? I read your FAQ for "why should I trust msty"
| and it feels lacking.
|
| > We are a small team of developers who are passionate about AI
| and privacy. We have worked on projects before that have been
| used by thousands of people such as this (I've never heard of
| Cleavr). There are real faces (real faces = Twitter account
| link?) behind the product. And come chat with us on our Discord
| server to know us better.
|
| This is much, much better than having no attribution, but it's
| miles away from being able to verify trust by reading the code.
| Would love to hear what your reasons against this are.
|
| Still thinking about trying it out, anyway...
| renewiltord wrote:
| What the heck, this looks cool! How have I missed it. Gonna
| give it a whirl.
| jakobov wrote:
| Nice! Can you explain what you mean by "simulate training beyond
| the number of available tokens"?
|
| Why does using distillation from a larger model simulate training
| with more tokens?
| canyon289 wrote:
| Hi, I work on the Gemma team (same as Alek opinions are my
| own).
|
| Essentially instead of tokens that are "already there" in text,
| the distillation allows us to simulate training data from a
| larger model
| suryabhupa wrote:
| Surya here from the core Gemma team -- we can think of a
| distillation loss as learning to model the entire distribution
| of tokens that are likely to follow the prefix thus far,
| instead of only the token in the training example. If you do
| some back of the envelope calculations, we can see that
| learning to model a larger distribution yields many more bits
| of information to learn from.
| jakobov wrote:
| Gotcha. That makes sense. Thanks!
|
| What are the theories as to why this works better than
| training on a larger quantity of non-simulated tokens?
|
| Is it because the gradient from the non-simulated tokens is
| too noisy for a small model to model correctly?
| rosslazer wrote:
| Are there examples of the prompt or transcripts for the human
| testing?
| aubanel wrote:
| It's exceptionally strong. In LMSys Chatbot Arena, the 27B
| version scores above LLama-3-70B, at the level of OpenAI GPT-4
| and Claude-3 Sonnet!
| screye wrote:
| What's the most obvious standouts?
|
| In my experience, smaller models tend to do well on benchmarks
| and fail at generalization. Phi-2 comes to mind.
| moffkalast wrote:
| It's multilingual. Genuinely. Compared my results with some
| people on reddit and the consensus is that the 27B is near
| perfect in a few obscure languages and likely perfect in most
| common ones. The 9B is not as good but it's still coherent
| enough to use in a pinch.
|
| It's literally the first omni-translation tool that actually
| works that you can run offline at home. I'm amazed that
| Google mentioned absolutely nothing about this in their
| paper.
| jug wrote:
| Wow, that's very impressive and indeed a game changer. I've
| previously had trouble with various Scandinavian languages,
| but last I checked with was Llama 2 and I kind of gave up
| on it. I had expected we were going to need special purpose
| small models for these uses as a crutch, like SW-GPT3.
|
| So I guess Gemma 2 is going to become Gemini 2.0 in their
| truly large and closed variants then? Or is it the open
| version of Gemini 1.5?
| typpo wrote:
| If anyone is interested in evaling Gemma locally, this can be
| done pretty easily using ollama[0] and promptfoo[1] with the
| following config: prompts: - 'Answer
| this coding problem in Python: {{ask}}' providers:
| - ollama:chat:gemma2:9b - ollama:chat:llama3:8b
| tests: - vars: ask: function to find the
| nth fibonacci number - vars: ask: calculate
| pi to the nth digit - # ...
|
| One small thing I've always appreciated about Gemma is that it
| doesn't include a "Sure, I can help you" preamble. It just gets
| right into the code, and follows it with an explanation. The
| training seems to emphasize response structure and ease of
| comprehension.
|
| Also, best to run evals that don't rely on rote memorization of
| public code... so please substitute with your personal tests :)
|
| [0] https://ollama.com/library/gemma2
|
| [1] https://github.com/promptfoo/promptfoo
| roywiggins wrote:
| In Ollama, Gemma:9b works fine, but 27b seems to be producing
| a lot of nonsense for me. Asking for a bit of python or
| JavaScript code rapidly devolves into producing code-like
| gobbledegook, extending for hundreds of lines.
| resource_waste wrote:
| Do we believe that? I've been told Google's AI was going to be
| great 4 times now, and its consistently #4 behind OpenAI,
| Facebook, and Claude.
| aubanel wrote:
| LMSys Chatbot Arena is a crowd-sourced ranking with an ELO
| system: basically users a presented with 2 hidden models,
| they get the answers of the 2 models when presenting their
| request, and they vote which one performed bests, which
| realized one marche and updates the ELO scores. This is the
| closest thing that we have to a gold truth for LLM
| evaluation: and Gemma2-27B performs extremely well in Chatbot
| Arena ELO.
| jakobov wrote:
| How much faster (in terms of the number of iterations to a given
| performance) is training from distillation?
| dongobread wrote:
| The knowledge distillation is very interesting but generating
| trillions of outputs from a large teacher model seems insanely
| expensive. Is this really more cost efficient than just using
| that compute instead for training your model with more data/more
| epochs?
| DebtDeflation wrote:
| I'm also curious. It seems like 6 months ago everyone was
| afraid of "model collapse" but now synthetic training
| generation and teacher models are all the rage. Have we solved
| the problem of model collapse?
| astrange wrote:
| Model collapse was basically a coping idea made up by artists
| who were hoping AI image generators would all magically
| destroy themselves at some point; I don't think it was ever
| considered likely to happen.
|
| It does seem to be true that clean data works better than low
| quality data.
| groby_b wrote:
| You're confusing it with data poisoning.
|
| Model collapse itself is(was?) a fairly serious research
| topic: https://arxiv.org/abs/2305.17493
|
| We've by now reached a "probably not inevitable" -
| https://arxiv.org/abs/2404.01413 argues there's a finite
| upper bound to error - but I'd also point out that that
| paper assumes training data cardinality increases with the
| number of training generations and is strictly
| accumulative.
|
| To a first order, that means you better have a pre-2022
| dataset to get started, and have archived it well.
|
| but it's probably fair to say current SOTA is still more or
| less "it's neither impossible nor inevitable".
| Workaccount2 wrote:
| Pay attention because it's only once you will get to watch
| humans learn they are nothing special in real time.
| agi_is_coming wrote:
| The distillation is done on-policy like RLHF -- the student
| model is generating the sequences and teacher is providing
| feedback in terms of logits.
| mistercheph wrote:
| Playing with it, and I like how much I can influence it with a
| system prompt, llama3 reacts pretty mildly to any system prompts
| I've tried.
| smcleod wrote:
| It has a tiny context window of 8k, that thing will have the
| memory of a goldfish.
___________________________________________________________________
(page generated 2024-06-27 23:00 UTC)