[HN Gopher] Gemini Flash
___________________________________________________________________
Gemini Flash
Author : meetpateltech
Score : 245 points
Date : 2024-05-14 18:00 UTC (4 hours ago)
(HTM) web link (deepmind.google)
(TXT) w3m dump (deepmind.google)
| causal wrote:
| 1M token context by default is the big feature here IMO, but we
| need better benchmarks to measure what that really means.
|
| My intuition is that as contexts get longer we start hitting the
| limits of how much comprehension can be embedded in a single
| point of vector space, and will need better architectures for
| selecting the relevant portions of the context.
| dragonwriter wrote:
| > 1M token context by default is the big feature here IMO, but
| we need better benchmarks to measure what that really means.
|
| Multimodality in a model That's between 4-7% the cost per token
| of OpenAI's cheapest multimodal model is an important feature
| when you are talking about production use and not just
| economically unsustainable demos.
| refulgentis wrote:
| In preview, can't be used in production, they already rug-
| pulled people building on Gemini w/r/t cost and RPM, and
| they're pointedly not putting any RPM or cost on the page.
| (seriously, try finding info on cost, RPM, or release right
| now, you're linked in circles.)
|
| Agree on OpenAI multimodal but it's sort of a stilted example
| itself, it's because OpenAI has a hole in its lineup - ex.
| Claude Haiku is multimodal, faster, and significantly cheaper
| than GPT 3.5.
| dragonwriter wrote:
| > they're pointedly not putting any RPM or cost on the page
|
| 360 RPM base limit, pricing is posted.
|
| > seriously, try finding info on cost, RPM, or release
| right now,
|
| I wasn't making up numbers, its on their Gemini API pricing
| page: https://ai.google.dev/pricing
| refulgentis wrote:
| Nice, thanks (btw, I didn't think you were making it up,
| it was in the keynote!)
| causal wrote:
| +1 on Haiku being oft overlooked.
| verdverm wrote:
| Shows the power of the brand and the limit of names
| consumers will recall long term
|
| "Who are the biggest soda or potato chip makers?"
| refulgentis wrote:
| One comment, and like everywhere, HN is chockful of
| people who are happy to opine on AI like it's sports, but
| have little experience building with it, or making
| business choices based on it.
| verdverm wrote:
| 1. Not an AI specific comment series, it applies
| generally
|
| 2. Comments like yours are against the guidelines of HN
|
| 3. You don't know what I do and are incorrect in your
| assessment of my work and experience (another common HN
| mishap)
|
| ---
|
| (re: #2/3) Please make your substantive points without
| crossing into personal attack.
|
| https://news.ycombinator.com/newsguidelines.html
| anoncareer0212 wrote:
| What do you mean?
|
| A) This fills a gaping hole for cheap multimodal models,
| OpenAI doesn't have one
|
| B) Anthropic's Haiku is a good choice.
|
| You) wow A didn't know Anthropic. Goes to show power of
| brands, much like snack foods
|
| B) Eh I wouldn't conclude anything from A. Its one
| comment. some people don't know what an Anthropic is
| because there's high interest in AI relative to interest
| in AI APIs. you can expect a low SNR ratio, even on HN
|
| You) Stop personally attacking me! It's against the
| rules!!
| memothon wrote:
| Can you explain this comment further? I don't really
| understand your point here.
| leetharris wrote:
| The problem is that even 1.5 Pro seems completely useless for
| long context multimodal stuff.
|
| I have tried it for so many use cases in video / audio and it
| hallucinates an unbelievable amount. More than any other
| model I've ever used.
|
| So if 1.5 Pro can't even handle simple tasks without
| hallucination, I imagine this tiny model is even more
| useless.
| refulgentis wrote:
| Yeah it's not very good in practice, you can get a halfway
| decent demo out of it ("look I gave it 6.5 harry potters and it
| made an SVG map connecting characters with
| annotations!!"...some of the characters...spare
| annotations...cost $20). Just good enough to fool you a couple
| times when you try to make it work 10 times.
| cynicalsecurity wrote:
| Feed 1 mln tokens
|
| @
|
| Get blocked by some silly overly sensitive "safety" trigger
| gpm wrote:
| Last I checked you could disable the safety triggers as an API
| user with gemini (which doesn't alleviate your obligation to
| follow the TOS as to the uses of the model).
| VS1999 wrote:
| I'm not working with a company that can just write in the ToS
| "we can do anything we want. lol. lmao" and expect me to
| follow it religiously. Corporations need less control over
| speech, not more.
| xianshou wrote:
| Looking at MMLU and other benchmarks, this essentially means sub-
| second first-token latency with Llama 3 70B quality (but not
| GPT-4 / Opus), native multimodality, and 1M context.
|
| Not bad compared to rolling your own, but among frontier models
| the main competitive differentiator was native multimodality.
| With the release of GPT-4o I'm not clear on why an organization
| not bound to GCP would pick Gemini. 128k context (4o) is fine
| unless you're processing whole books/movies at once. Is anyone
| doing this at scale in a way that can't be filtered down from 1M
| to 100k?
| Workaccount2 wrote:
| With 1M tokens you can dump 2000 pages of documents into the
| context windows before starting a chat.
|
| Gemini's strength isn't in being able to answer logic puzzles,
| it's strength is in its context length. Studying for an exam?
| Just put the entire textbook in the chat. Need to use a dead
| language for an old test system with no information on the
| internet? Drop the 1300 page reference manual in and ask away.
| tulip4attoo wrote:
| You don't really use it, right? There's no way to debug if
| you're doing it like this. Also, the accuracy isn't high, and
| it can't answer complicated questions, making it quite
| useless for the cost.
| dang wrote:
| Please make your substantive points without crossing into
| personal attack. Your comment would be fine without the
| first sentence.
|
| https://news.ycombinator.com/newsguidelines.html
| ianbicking wrote:
| How much do those input tokens cost?
|
| According to https://ai.google.dev/pricing it's $0.70/million
| input tokens (for a long context). That will be per-exchange,
| so every little back and forth will cost around that much (if
| you're using a substantial portion of the context window).
|
| And while I haven't tested Gemini, most LLMs get increasingly
| wonky as the context goes up, more likely to fixate, more
| likely to forget instructions.
|
| That big context window could definitely be great for certain
| tasks (especially information extraction), but it doesn't
| feel like a generally useful feature.
| lxgr wrote:
| Is there a way to amortize that cost over several queries,
| i.e. "pre-bake" a document into a context persisted in some
| form to allow cheaper follow-up queries about it?
| inlined wrote:
| Though I'm not familiar with the specifics, they
| announced "context caching"
| simonw wrote:
| They announced that today, calling it "context caching" -
| but it looks like it's only going to be available for
| Gemini Pro 1.5, not for Gemini Flash.
|
| It reduces prompt costs by half for those shared prefix
| tokens, but you have to pay $4.50/million tokens/hour to
| keep that cache warm - so probably not a useful
| optimization for most lower traffic applications.
|
| https://ai.google.dev/gemini-api/docs/caching
| dragonwriter wrote:
| > It reduces prompt costs by half for those shared prefix
| tokens, but you have to pay $4.50/million tokens/hour to
| keep that cache warm - so probably not a useful
| optimization for most lower traffic applications
|
| That's on a model with $3.5/1M input token cost, so half
| price on cached prefix tokens for $4.5/1M/hour breaks
| even at a little over 2.5 requests/hour using the cached
| prefix.
| gcanyon wrote:
| Depending on the output window limit, the first query
| could be something like: "Summarize this down to its
| essential details" -- then use that to feed future
| queries.
|
| Tediously, it would be possible to do this chapter by
| chapter in order to exceed the output limit building
| something for future inputs.
|
| Of course, the summary might not fulfill the same
| functionality as the original source document. YMMV
| mcbuilder wrote:
| That per exchange context cost is what really puts me off
| using cloud LLM for anything serious. I know batching and
| everything is needed in the data center, and important for
| keeping around KVQ cache, you basically need to fully take
| over machine to get an interactive session to get the
| context costs to scale with sequence length. So it's
| useful, but more in the case of a local LLaMA type
| situation if you want a conversation.
| falcor84 wrote:
| I wonder if we could implement the equivalent of a JIT
| compilation, whereby context sequences which get
| repeatedly reused would be used for an online fine-
| tuning.
| bredren wrote:
| Can anyone speculate on how G arrived at this price, and
| perhaps how it contrasts with how OAI arrived at its
| updated pricing? (realizing it can't be held up directly to
| GPT x at the moment)
| tk90 wrote:
| Isn't there retrieval degradation with such a large context
| size? I would still think that a RAG system on 128K is still
| better than No Rag + 1M context window, no? (assuming text
| only)
| dragonwriter wrote:
| > With the release of GPT-4o I'm not clear on why an
| organization not bound to GCP would pick Gemini.
|
| Price for anything, particularly multimodal tasks that with
| OpenAI GPT-4o is the cheapest model, that doesn't need GPT-4
| quality. GPT-3.5-Turbo -- which itself is 1/10 the cost of
| GPT-4o, is $0.5/1M tokens on input, $1.50/1M on output, with a
| 16K context window. Gemini 1.5 Flash, for prompts up to 128K,
| is $0.35/1M tokens on input, and $0.53/1M tokens on output.
|
| For tasks that require multimodality but not GPT-4 smarts
| (which I think includes a lot of document-processing tasks, for
| which GPT-4 with Vision and now GPT-4 are magical but pricy),
| Gemini Flash looks like close to a 95% price cut.
| mupuff1234 wrote:
| I think that's a bit like asking why would someone need a 1gb
| Gmail when 50mb yahoo account is clearly enough.
|
| It means you can dump context without thinking about it twice
| and without needing to hack some solutions to deal with context
| overflow etc.
|
| And given that most use cases most likely deal with text and
| not multimodal the advantage seems pretty clear imo.
| tedsanders wrote:
| Long context is a little bit different than extra email
| storage. Having 1 gb of storage instead of 50 mb has
| essentially no downside to the user experience.
|
| But submitting 1M input tokens instead of 100k input tokens:
|
| - Causes your costs to go up ~10x
|
| - Causes your latency to go up ~10x (or between 1x and 10x)
|
| - Can result in worse answers (especially if the model gets
| distracted by irrelevant info)
|
| So longer context is great, yes, but it's not a no-brainer
| like more email storage. It brings costs. And whether those
| costs are worth it depends on what you're doing.
| thefourthchime wrote:
| I tried to use the 1M tokens with Gemini a couple of months
| ago. It either crashed or responded ___very__ slowly and then
| crashed.
|
| I tried a half dozen times and gave up, I hope this one is
| faster and more stable.
| leetharris wrote:
| There's no way it's Llama 3 70b quality.
|
| I've been trying to work Gemini 1.5 Pro into our workstream for
| all kinds of stuff and it is so bad. Unbelievable amount of
| hallucinations, especially when you introduce video or audio.
|
| I'm not sure I can think of a single use case where a high
| hallucination tiny multimodal model is practical in most
| businesses. Without reliability it's just a toy.
| dibujaron wrote:
| Seconding this. Gemini 1.5 is comically bad at basic tasks
| that GPT4 breezes through, not to mention GPT4o.
| chimney wrote:
| Price.
| killerstorm wrote:
| I guess it depends on what you want to do.
|
| E.g. I want to send an entire code base in a context. It might
| not fit into 128k.
|
| Filtering down is a complex task by itself. It's much easier to
| call a single API.
|
| Regarding quality of responses, I've seen both disappointing
| and brilliant responses from Gemini. Do maybe worth trying. But
| it will probably take several iterations until it can be relied
| upon.
| treprinum wrote:
| GPT-3.5 has 0.5s average first-token latency and Claude3 Haiku
| 0.4s.
| refulgentis wrote:
| It's absolutely unconscionable that Gemini Ultra got memory-
| holed. I can't trust anything that Google says about benchmarks.
|
| It seemingly existed only so in December 2023, Gemini ~= GPT-4.
| (April 2023 version) (on paper) ("32-shot CoT" vs. 5-shot GPT-4)
| summerlight wrote:
| Gemini Ultra is 1.0 with 8k window. This is 1.5 with 1m window.
| Your feeling is based on incorrect assumption.
| anoncareer0212 wrote:
| And?
|
| You're replying to a comment that points out Gemini Ultra was
| never released, wasn't mentioned today, and it's the only
| model Google's benchmarking at GPT-4 level. They didn't say
| anything about feelings or context window.
| dontreact wrote:
| Gemini Ultra has been available for people to try via
| Gemini Advanced (formerly Bard) for a few months
| cma wrote:
| It says it may fall back to a worse model under load and
| there is no way to tell which you are getting. I think
| chatgpt has at times done something similar though.
| summerlight wrote:
| > You're replying to a comment that points out Gemini Ultra
| was never released
|
| What are you even talking about? How do you know it's
| memory-holed if you haven't used it? The API is not GA, but
| the model can be used through the chatbot subscription. GP
| is talking about their lack of trust on Google's claim of
| 1M context token, not GPT-4 level reasoning. If you're
| expect GPT-4 level performance with cost-efficient models,
| that's another problem.
| CSMastermind wrote:
| Anyone who uses both products regularly will tell you that
| Gemini Advanced is far behind GPT-4 and Claude 3 Opus.
|
| Pretending that they have a model internally that's on par but
| they're not releasing it is a very "my girlfriend goes to
| another school" move and makes no sense if they're a business
| that's actually trying to compete.
| Workaccount2 wrote:
| Probably because they have to spend 30% more time per
| iteration to make sure it's output is hobbled enough to be
| DEI compliant.
|
| Never let social ideologues into your organization, and
| definitely don't give them power.
| numbers wrote:
| It's ironic that when you ask these AI chatbots what their own
| context size is, they don't know. ChatGPT doesn't even know about
| 4o existing in 4o.
| SoftTalker wrote:
| Does a monkey know that it is a monkey?
| verdverm wrote:
| I think "yes" is the most likely answer here
|
| animals have a lot more intelligence than they typically get
| attributed
|
| Tool use, names, language, social structure and behavior,
| even drug use has been shown across many species
| chaorace wrote:
| Okay, but the monkey doesn't _know_ that it knows that it
| 's a monkey.
| verdverm wrote:
| are you sure?
|
| Many animals recognize themselves and their species as
| separate concepts
| advisedwang wrote:
| Ask a human how many neurons they have. Hell, over history
| humans haven't even consistently understood that the brain is
| where cognition happens.
| simonw wrote:
| The models didn't exist when their training data was collected.
|
| But... that's not really an excuse any more. Model vendors
| should understand now that the most natural thing in the world
| is for people to ask models directly about their own abilities
| and architecture.
|
| I think models should have a final layer of fine-tuning or even
| system prompting to help them answer these kinds of questions
| in a useful way.
| nightski wrote:
| A lightweight model that you can only use in the cloud? That is
| amusing. These tech megacorps are really intent on owning your
| usage of AI. But we must not let that be the future.
| kherud wrote:
| Now that context length seems abundant for most tasks, I'm
| wondering why sub-word tokens are still used. I'm really curious
| how character-based LLMs would compare. With 2 M context, the
| compute bottleneck fades away. I'm not sure though what role the
| vocabulary size has. Maybe a large size is critical, since the
| embedding already contains a big chunk of the knowledge. On the
| other hand, using a character-based vocabulary would solve
| multiple problems, I think, like glitch tokens and possibly
| things like arithmetic and rhyming capabilities. Implementing
| sub-word tokenizers correctly and training them seems also quite
| complex. On a character level this should be trivial.
| AaronFriel wrote:
| The attention mechanism is vastly more efficient to train when
| it can attend to larger, more meaningful tokens. For inference
| servers, a significant amount of memory goes into the KV cache,
| and as you note, to build up the embedding through attention
| would then require correlating far more tokens, each of which
| is "less meaningful".
|
| I think we may get to this point eventually, in the limit we
| will want multimodal LLMs that understand images and sounds
| down to the pixel and frequency, and it seems like for text,
| too, we will eventually want that as well.
| yk wrote:
| > a significant amount of memory goes into the KV cache
|
| Is there a good paper (or talk) how inference looks at scale?
| (Kinda like ELI-using-single-gpus)
| thomasahle wrote:
| Maybe you could just use a good-old 1D-CNN for the bottom 3-4
| layers. Then the model has been able to combine characters
| into roughly token length chunks anyway.
|
| Just make sure to have some big MLPs at the start too, to
| enrich the "tokens" with the information currently stored in
| the embedding tables.
| joaogui1 wrote:
| I would say 2 big problems are:
|
| 1. latency, which would get worse if you have to sequentially
| generate more output
|
| 2. These models very roughly turn tokens -> "average meaning"
| on the embedding layer, followed by attention layers that
| combine the meanings, and feed forward layers that match the
| current meaning combination to some kind of learned
| archetype/prototype almost. When you move from word parts to
| characters all of that becomes more confusing (what's the
| average meaning of a?) and so I don't think there are good
| enough techniques to learn character-based models yet
| novaRom wrote:
| In AI music generation we have much better results with large
| vocabulary sizes of 10^6 order, my uneducated guess is that's
| because transformers are not universal pattern recognizers,
| they can catch patterns on a certain granularity level only.
| darby_eight wrote:
| > On a character level this should be trivial.
|
| Characters are not the semantic components of words--these are
| syllables. Generally speaking, anyway. I've got to imagine this
| approach would yield higher quality results than the roman
| alphabet. I'm curious if this could be tested by just looking
| at how LLMs handle English vs Chinese.
| inbetween wrote:
| The minimal semantic parts of words are morphemes. Syllables
| are phonological units (roughly: the minimal unit for
| rhythmic purposes such as stress, etc)
| darby_eight wrote:
| Only in languages that have morphemes! This is hardly a
| universal attribute of language so much as an attribute of
| those that use an alphabet to encode sounds. It makes more
| sense to just bypass the encoding and directly consider the
| speech.
|
| Besides, considering morphemes as semantic often results in
| a completely different meaning than we actually intend. We
| aren't trying to train a chatbot to speak in prefixes and
| suffixes, we're trying to train a chatbot to speak in
| natural language, even if it is encoded to latin script
| before output.
| nojvek wrote:
| Will wait for Meta to release Flash equivalent weights.
|
| Multi-Modal modals running offline on mobile devices with
| millisecond latencies per token seems the future.
|
| Where is Apple in all of this. Why is Siri still so shit?
| visarga wrote:
| Apple made a deal with OpenAI for GPT4o, the stakes are indeed
| high, can't be caught with pants down. iPhone needs to remain
| the premium brand.
| alephxyz wrote:
| Not very informative. They're selling it as the fast/cheap option
| but they don't benchmark inference speed or compare it with non-
| gemini models.
|
| According to https://ai.google.dev/pricing it's priced a bit
| lower than gpt3.5-turbo but no idea how it compares to it.
| simonw wrote:
| I upgraded my llm-gemini plugin to provide CLI access to Gemini
| Flash: pipx install llm # or brew install llm
| llm install llm-gemini --upgrade llm keys set gemini
| # paste API key here llm -m gemini-1.5-flash-latest 'a
| short poem about otters'
|
| https://github.com/simonw/llm-gemini/releases/tag/0.1a4
| quantisan wrote:
| Price (input) $0.35 / 1 million tokens (for prompts up to 128K
| tokens) $0.70 / 1 million tokens (for prompts longer than 128K)
|
| Price (output) $0.53 / 1 million tokens (for prompts up to 128K
| tokens) $1.05 / 1 million tokens (for prompts longer than 128K)
|
| ---
|
| Compared to GPT-3.5 Turbo
|
| Input US$0.50 / 1M tokens Output US$1.50 / 1M tokens
| cs702 wrote:
| We're witnessing a race to the bottom on pricing as it's
| happening. Competition based solely or mainly on pricing is a
| defining characteristic of a commodity market, i.e., a market in
| which competing products are interchangeable, and buyers are
| happy to switch to the cheapest option for a given level of
| quality.
|
| There's an old saying that if you're selling a commodity, "you
| can only be as smart as your dumbest competitor."
|
| If we want to be more polite, we could say instead: "you can only
| price your service as high as your lowest-cost competitor."
|
| It seems that a lot of capital that has been "invested" to train
| AI models is, ahem, unlikely ever to be recovered.
| Aloisius wrote:
| Price competition isn't limited to commodities.
| cs702 wrote:
| I never said it was.
| EGreg wrote:
| You never said it wasn't, either :-P
| Aloisius wrote:
| Then why imply that it is a commodity because they (partly)
| compete on price?
|
| Fungibility is the defining characteristic of commodities.
| While these products can be used to accomplish the same
| task, we're not near real fungibility yet.
| Delmololo wrote:
| But the race to the bottom has an opposition right?
|
| So people expect to see a return of investment which will
| create the bottom of pricing (at least as soon as the old money
| ran out)
|
| I'm also curious if AI is a good example because ai will become
| fundamental. This means if you don't invest you might be gone
| therefore it's more like a fee in case the investment would not
| pan out.
| __loam wrote:
| Supply and demand determines price, not the hopes and dreams
| of investors.
| r0m4n0 wrote:
| Google is building on top of and integrated with their cloud
| offerings. Having first party solutions like this gives big
| cloud customers an easy way to integrate. For Google it's just
| another tool in the chest that gets sold to these big
| enterprises. Many go all in on all the same cloud products.
| Also the models are only the building blocks. Other cloud
| products at Google will be built with this and sold as a
| service
|
| Not so sure about Open AI though...
| daghamm wrote:
| Is this race to the bottom or just Googles new TPUs being
| extremly efficient?
| rmbyrro wrote:
| Google figured it can't beat OpenAI technically, but they sure
| know they can beat them financially and infrastructurally.
| __loam wrote:
| Is infrastructure and scale not an expression of technical
| ability? It should have been obvious that Meta and Google
| would bury a tiny company with less than 1000 employees given
| the amount of capital they can leverage for compute, talent,
| and data. Google literally invented GPT.
| __loam wrote:
| You're saying the quiet part out loud here.
| objektif wrote:
| Does Goog have anything like openai assistant via API? If they
| had I would definitely give it a try.
___________________________________________________________________
(page generated 2024-05-14 23:00 UTC)