[HN Gopher] Gemini Flash
       ___________________________________________________________________
        
       Gemini Flash
        
       Author : meetpateltech
       Score  : 245 points
       Date   : 2024-05-14 18:00 UTC (4 hours ago)
        
 (HTM) web link (deepmind.google)
 (TXT) w3m dump (deepmind.google)
        
       | causal wrote:
       | 1M token context by default is the big feature here IMO, but we
       | need better benchmarks to measure what that really means.
       | 
       | My intuition is that as contexts get longer we start hitting the
       | limits of how much comprehension can be embedded in a single
       | point of vector space, and will need better architectures for
       | selecting the relevant portions of the context.
        
         | dragonwriter wrote:
         | > 1M token context by default is the big feature here IMO, but
         | we need better benchmarks to measure what that really means.
         | 
         | Multimodality in a model That's between 4-7% the cost per token
         | of OpenAI's cheapest multimodal model is an important feature
         | when you are talking about production use and not just
         | economically unsustainable demos.
        
           | refulgentis wrote:
           | In preview, can't be used in production, they already rug-
           | pulled people building on Gemini w/r/t cost and RPM, and
           | they're pointedly not putting any RPM or cost on the page.
           | (seriously, try finding info on cost, RPM, or release right
           | now, you're linked in circles.)
           | 
           | Agree on OpenAI multimodal but it's sort of a stilted example
           | itself, it's because OpenAI has a hole in its lineup - ex.
           | Claude Haiku is multimodal, faster, and significantly cheaper
           | than GPT 3.5.
        
             | dragonwriter wrote:
             | > they're pointedly not putting any RPM or cost on the page
             | 
             | 360 RPM base limit, pricing is posted.
             | 
             | > seriously, try finding info on cost, RPM, or release
             | right now,
             | 
             | I wasn't making up numbers, its on their Gemini API pricing
             | page: https://ai.google.dev/pricing
        
               | refulgentis wrote:
               | Nice, thanks (btw, I didn't think you were making it up,
               | it was in the keynote!)
        
             | causal wrote:
             | +1 on Haiku being oft overlooked.
        
               | verdverm wrote:
               | Shows the power of the brand and the limit of names
               | consumers will recall long term
               | 
               | "Who are the biggest soda or potato chip makers?"
        
               | refulgentis wrote:
               | One comment, and like everywhere, HN is chockful of
               | people who are happy to opine on AI like it's sports, but
               | have little experience building with it, or making
               | business choices based on it.
        
               | verdverm wrote:
               | 1. Not an AI specific comment series, it applies
               | generally
               | 
               | 2. Comments like yours are against the guidelines of HN
               | 
               | 3. You don't know what I do and are incorrect in your
               | assessment of my work and experience (another common HN
               | mishap)
               | 
               | ---
               | 
               | (re: #2/3) Please make your substantive points without
               | crossing into personal attack.
               | 
               | https://news.ycombinator.com/newsguidelines.html
        
               | anoncareer0212 wrote:
               | What do you mean?
               | 
               | A) This fills a gaping hole for cheap multimodal models,
               | OpenAI doesn't have one
               | 
               | B) Anthropic's Haiku is a good choice.
               | 
               | You) wow A didn't know Anthropic. Goes to show power of
               | brands, much like snack foods
               | 
               | B) Eh I wouldn't conclude anything from A. Its one
               | comment. some people don't know what an Anthropic is
               | because there's high interest in AI relative to interest
               | in AI APIs. you can expect a low SNR ratio, even on HN
               | 
               | You) Stop personally attacking me! It's against the
               | rules!!
        
               | memothon wrote:
               | Can you explain this comment further? I don't really
               | understand your point here.
        
           | leetharris wrote:
           | The problem is that even 1.5 Pro seems completely useless for
           | long context multimodal stuff.
           | 
           | I have tried it for so many use cases in video / audio and it
           | hallucinates an unbelievable amount. More than any other
           | model I've ever used.
           | 
           | So if 1.5 Pro can't even handle simple tasks without
           | hallucination, I imagine this tiny model is even more
           | useless.
        
         | refulgentis wrote:
         | Yeah it's not very good in practice, you can get a halfway
         | decent demo out of it ("look I gave it 6.5 harry potters and it
         | made an SVG map connecting characters with
         | annotations!!"...some of the characters...spare
         | annotations...cost $20). Just good enough to fool you a couple
         | times when you try to make it work 10 times.
        
       | cynicalsecurity wrote:
       | Feed 1 mln tokens
       | 
       | @
       | 
       | Get blocked by some silly overly sensitive "safety" trigger
        
         | gpm wrote:
         | Last I checked you could disable the safety triggers as an API
         | user with gemini (which doesn't alleviate your obligation to
         | follow the TOS as to the uses of the model).
        
           | VS1999 wrote:
           | I'm not working with a company that can just write in the ToS
           | "we can do anything we want. lol. lmao" and expect me to
           | follow it religiously. Corporations need less control over
           | speech, not more.
        
       | xianshou wrote:
       | Looking at MMLU and other benchmarks, this essentially means sub-
       | second first-token latency with Llama 3 70B quality (but not
       | GPT-4 / Opus), native multimodality, and 1M context.
       | 
       | Not bad compared to rolling your own, but among frontier models
       | the main competitive differentiator was native multimodality.
       | With the release of GPT-4o I'm not clear on why an organization
       | not bound to GCP would pick Gemini. 128k context (4o) is fine
       | unless you're processing whole books/movies at once. Is anyone
       | doing this at scale in a way that can't be filtered down from 1M
       | to 100k?
        
         | Workaccount2 wrote:
         | With 1M tokens you can dump 2000 pages of documents into the
         | context windows before starting a chat.
         | 
         | Gemini's strength isn't in being able to answer logic puzzles,
         | it's strength is in its context length. Studying for an exam?
         | Just put the entire textbook in the chat. Need to use a dead
         | language for an old test system with no information on the
         | internet? Drop the 1300 page reference manual in and ask away.
        
           | tulip4attoo wrote:
           | You don't really use it, right? There's no way to debug if
           | you're doing it like this. Also, the accuracy isn't high, and
           | it can't answer complicated questions, making it quite
           | useless for the cost.
        
             | dang wrote:
             | Please make your substantive points without crossing into
             | personal attack. Your comment would be fine without the
             | first sentence.
             | 
             | https://news.ycombinator.com/newsguidelines.html
        
           | ianbicking wrote:
           | How much do those input tokens cost?
           | 
           | According to https://ai.google.dev/pricing it's $0.70/million
           | input tokens (for a long context). That will be per-exchange,
           | so every little back and forth will cost around that much (if
           | you're using a substantial portion of the context window).
           | 
           | And while I haven't tested Gemini, most LLMs get increasingly
           | wonky as the context goes up, more likely to fixate, more
           | likely to forget instructions.
           | 
           | That big context window could definitely be great for certain
           | tasks (especially information extraction), but it doesn't
           | feel like a generally useful feature.
        
             | lxgr wrote:
             | Is there a way to amortize that cost over several queries,
             | i.e. "pre-bake" a document into a context persisted in some
             | form to allow cheaper follow-up queries about it?
        
               | inlined wrote:
               | Though I'm not familiar with the specifics, they
               | announced "context caching"
        
               | simonw wrote:
               | They announced that today, calling it "context caching" -
               | but it looks like it's only going to be available for
               | Gemini Pro 1.5, not for Gemini Flash.
               | 
               | It reduces prompt costs by half for those shared prefix
               | tokens, but you have to pay $4.50/million tokens/hour to
               | keep that cache warm - so probably not a useful
               | optimization for most lower traffic applications.
               | 
               | https://ai.google.dev/gemini-api/docs/caching
        
               | dragonwriter wrote:
               | > It reduces prompt costs by half for those shared prefix
               | tokens, but you have to pay $4.50/million tokens/hour to
               | keep that cache warm - so probably not a useful
               | optimization for most lower traffic applications
               | 
               | That's on a model with $3.5/1M input token cost, so half
               | price on cached prefix tokens for $4.5/1M/hour breaks
               | even at a little over 2.5 requests/hour using the cached
               | prefix.
        
               | gcanyon wrote:
               | Depending on the output window limit, the first query
               | could be something like: "Summarize this down to its
               | essential details" -- then use that to feed future
               | queries.
               | 
               | Tediously, it would be possible to do this chapter by
               | chapter in order to exceed the output limit building
               | something for future inputs.
               | 
               | Of course, the summary might not fulfill the same
               | functionality as the original source document. YMMV
        
             | mcbuilder wrote:
             | That per exchange context cost is what really puts me off
             | using cloud LLM for anything serious. I know batching and
             | everything is needed in the data center, and important for
             | keeping around KVQ cache, you basically need to fully take
             | over machine to get an interactive session to get the
             | context costs to scale with sequence length. So it's
             | useful, but more in the case of a local LLaMA type
             | situation if you want a conversation.
        
               | falcor84 wrote:
               | I wonder if we could implement the equivalent of a JIT
               | compilation, whereby context sequences which get
               | repeatedly reused would be used for an online fine-
               | tuning.
        
             | bredren wrote:
             | Can anyone speculate on how G arrived at this price, and
             | perhaps how it contrasts with how OAI arrived at its
             | updated pricing? (realizing it can't be held up directly to
             | GPT x at the moment)
        
           | tk90 wrote:
           | Isn't there retrieval degradation with such a large context
           | size? I would still think that a RAG system on 128K is still
           | better than No Rag + 1M context window, no? (assuming text
           | only)
        
         | dragonwriter wrote:
         | > With the release of GPT-4o I'm not clear on why an
         | organization not bound to GCP would pick Gemini.
         | 
         | Price for anything, particularly multimodal tasks that with
         | OpenAI GPT-4o is the cheapest model, that doesn't need GPT-4
         | quality. GPT-3.5-Turbo -- which itself is 1/10 the cost of
         | GPT-4o, is $0.5/1M tokens on input, $1.50/1M on output, with a
         | 16K context window. Gemini 1.5 Flash, for prompts up to 128K,
         | is $0.35/1M tokens on input, and $0.53/1M tokens on output.
         | 
         | For tasks that require multimodality but not GPT-4 smarts
         | (which I think includes a lot of document-processing tasks, for
         | which GPT-4 with Vision and now GPT-4 are magical but pricy),
         | Gemini Flash looks like close to a 95% price cut.
        
         | mupuff1234 wrote:
         | I think that's a bit like asking why would someone need a 1gb
         | Gmail when 50mb yahoo account is clearly enough.
         | 
         | It means you can dump context without thinking about it twice
         | and without needing to hack some solutions to deal with context
         | overflow etc.
         | 
         | And given that most use cases most likely deal with text and
         | not multimodal the advantage seems pretty clear imo.
        
           | tedsanders wrote:
           | Long context is a little bit different than extra email
           | storage. Having 1 gb of storage instead of 50 mb has
           | essentially no downside to the user experience.
           | 
           | But submitting 1M input tokens instead of 100k input tokens:
           | 
           | - Causes your costs to go up ~10x
           | 
           | - Causes your latency to go up ~10x (or between 1x and 10x)
           | 
           | - Can result in worse answers (especially if the model gets
           | distracted by irrelevant info)
           | 
           | So longer context is great, yes, but it's not a no-brainer
           | like more email storage. It brings costs. And whether those
           | costs are worth it depends on what you're doing.
        
         | thefourthchime wrote:
         | I tried to use the 1M tokens with Gemini a couple of months
         | ago. It either crashed or responded ___very__ slowly and then
         | crashed.
         | 
         | I tried a half dozen times and gave up, I hope this one is
         | faster and more stable.
        
         | leetharris wrote:
         | There's no way it's Llama 3 70b quality.
         | 
         | I've been trying to work Gemini 1.5 Pro into our workstream for
         | all kinds of stuff and it is so bad. Unbelievable amount of
         | hallucinations, especially when you introduce video or audio.
         | 
         | I'm not sure I can think of a single use case where a high
         | hallucination tiny multimodal model is practical in most
         | businesses. Without reliability it's just a toy.
        
           | dibujaron wrote:
           | Seconding this. Gemini 1.5 is comically bad at basic tasks
           | that GPT4 breezes through, not to mention GPT4o.
        
         | chimney wrote:
         | Price.
        
         | killerstorm wrote:
         | I guess it depends on what you want to do.
         | 
         | E.g. I want to send an entire code base in a context. It might
         | not fit into 128k.
         | 
         | Filtering down is a complex task by itself. It's much easier to
         | call a single API.
         | 
         | Regarding quality of responses, I've seen both disappointing
         | and brilliant responses from Gemini. Do maybe worth trying. But
         | it will probably take several iterations until it can be relied
         | upon.
        
         | treprinum wrote:
         | GPT-3.5 has 0.5s average first-token latency and Claude3 Haiku
         | 0.4s.
        
       | refulgentis wrote:
       | It's absolutely unconscionable that Gemini Ultra got memory-
       | holed. I can't trust anything that Google says about benchmarks.
       | 
       | It seemingly existed only so in December 2023, Gemini ~= GPT-4.
       | (April 2023 version) (on paper) ("32-shot CoT" vs. 5-shot GPT-4)
        
         | summerlight wrote:
         | Gemini Ultra is 1.0 with 8k window. This is 1.5 with 1m window.
         | Your feeling is based on incorrect assumption.
        
           | anoncareer0212 wrote:
           | And?
           | 
           | You're replying to a comment that points out Gemini Ultra was
           | never released, wasn't mentioned today, and it's the only
           | model Google's benchmarking at GPT-4 level. They didn't say
           | anything about feelings or context window.
        
             | dontreact wrote:
             | Gemini Ultra has been available for people to try via
             | Gemini Advanced (formerly Bard) for a few months
        
               | cma wrote:
               | It says it may fall back to a worse model under load and
               | there is no way to tell which you are getting. I think
               | chatgpt has at times done something similar though.
        
             | summerlight wrote:
             | > You're replying to a comment that points out Gemini Ultra
             | was never released
             | 
             | What are you even talking about? How do you know it's
             | memory-holed if you haven't used it? The API is not GA, but
             | the model can be used through the chatbot subscription. GP
             | is talking about their lack of trust on Google's claim of
             | 1M context token, not GPT-4 level reasoning. If you're
             | expect GPT-4 level performance with cost-efficient models,
             | that's another problem.
        
         | CSMastermind wrote:
         | Anyone who uses both products regularly will tell you that
         | Gemini Advanced is far behind GPT-4 and Claude 3 Opus.
         | 
         | Pretending that they have a model internally that's on par but
         | they're not releasing it is a very "my girlfriend goes to
         | another school" move and makes no sense if they're a business
         | that's actually trying to compete.
        
           | Workaccount2 wrote:
           | Probably because they have to spend 30% more time per
           | iteration to make sure it's output is hobbled enough to be
           | DEI compliant.
           | 
           | Never let social ideologues into your organization, and
           | definitely don't give them power.
        
       | numbers wrote:
       | It's ironic that when you ask these AI chatbots what their own
       | context size is, they don't know. ChatGPT doesn't even know about
       | 4o existing in 4o.
        
         | SoftTalker wrote:
         | Does a monkey know that it is a monkey?
        
           | verdverm wrote:
           | I think "yes" is the most likely answer here
           | 
           | animals have a lot more intelligence than they typically get
           | attributed
           | 
           | Tool use, names, language, social structure and behavior,
           | even drug use has been shown across many species
        
             | chaorace wrote:
             | Okay, but the monkey doesn't _know_ that it knows that it
             | 's a monkey.
        
               | verdverm wrote:
               | are you sure?
               | 
               | Many animals recognize themselves and their species as
               | separate concepts
        
         | advisedwang wrote:
         | Ask a human how many neurons they have. Hell, over history
         | humans haven't even consistently understood that the brain is
         | where cognition happens.
        
         | simonw wrote:
         | The models didn't exist when their training data was collected.
         | 
         | But... that's not really an excuse any more. Model vendors
         | should understand now that the most natural thing in the world
         | is for people to ask models directly about their own abilities
         | and architecture.
         | 
         | I think models should have a final layer of fine-tuning or even
         | system prompting to help them answer these kinds of questions
         | in a useful way.
        
       | nightski wrote:
       | A lightweight model that you can only use in the cloud? That is
       | amusing. These tech megacorps are really intent on owning your
       | usage of AI. But we must not let that be the future.
        
       | kherud wrote:
       | Now that context length seems abundant for most tasks, I'm
       | wondering why sub-word tokens are still used. I'm really curious
       | how character-based LLMs would compare. With 2 M context, the
       | compute bottleneck fades away. I'm not sure though what role the
       | vocabulary size has. Maybe a large size is critical, since the
       | embedding already contains a big chunk of the knowledge. On the
       | other hand, using a character-based vocabulary would solve
       | multiple problems, I think, like glitch tokens and possibly
       | things like arithmetic and rhyming capabilities. Implementing
       | sub-word tokenizers correctly and training them seems also quite
       | complex. On a character level this should be trivial.
        
         | AaronFriel wrote:
         | The attention mechanism is vastly more efficient to train when
         | it can attend to larger, more meaningful tokens. For inference
         | servers, a significant amount of memory goes into the KV cache,
         | and as you note, to build up the embedding through attention
         | would then require correlating far more tokens, each of which
         | is "less meaningful".
         | 
         | I think we may get to this point eventually, in the limit we
         | will want multimodal LLMs that understand images and sounds
         | down to the pixel and frequency, and it seems like for text,
         | too, we will eventually want that as well.
        
           | yk wrote:
           | > a significant amount of memory goes into the KV cache
           | 
           | Is there a good paper (or talk) how inference looks at scale?
           | (Kinda like ELI-using-single-gpus)
        
           | thomasahle wrote:
           | Maybe you could just use a good-old 1D-CNN for the bottom 3-4
           | layers. Then the model has been able to combine characters
           | into roughly token length chunks anyway.
           | 
           | Just make sure to have some big MLPs at the start too, to
           | enrich the "tokens" with the information currently stored in
           | the embedding tables.
        
         | joaogui1 wrote:
         | I would say 2 big problems are:
         | 
         | 1. latency, which would get worse if you have to sequentially
         | generate more output
         | 
         | 2. These models very roughly turn tokens -> "average meaning"
         | on the embedding layer, followed by attention layers that
         | combine the meanings, and feed forward layers that match the
         | current meaning combination to some kind of learned
         | archetype/prototype almost. When you move from word parts to
         | characters all of that becomes more confusing (what's the
         | average meaning of a?) and so I don't think there are good
         | enough techniques to learn character-based models yet
        
         | novaRom wrote:
         | In AI music generation we have much better results with large
         | vocabulary sizes of 10^6 order, my uneducated guess is that's
         | because transformers are not universal pattern recognizers,
         | they can catch patterns on a certain granularity level only.
        
         | darby_eight wrote:
         | > On a character level this should be trivial.
         | 
         | Characters are not the semantic components of words--these are
         | syllables. Generally speaking, anyway. I've got to imagine this
         | approach would yield higher quality results than the roman
         | alphabet. I'm curious if this could be tested by just looking
         | at how LLMs handle English vs Chinese.
        
           | inbetween wrote:
           | The minimal semantic parts of words are morphemes. Syllables
           | are phonological units (roughly: the minimal unit for
           | rhythmic purposes such as stress, etc)
        
             | darby_eight wrote:
             | Only in languages that have morphemes! This is hardly a
             | universal attribute of language so much as an attribute of
             | those that use an alphabet to encode sounds. It makes more
             | sense to just bypass the encoding and directly consider the
             | speech.
             | 
             | Besides, considering morphemes as semantic often results in
             | a completely different meaning than we actually intend. We
             | aren't trying to train a chatbot to speak in prefixes and
             | suffixes, we're trying to train a chatbot to speak in
             | natural language, even if it is encoded to latin script
             | before output.
        
       | nojvek wrote:
       | Will wait for Meta to release Flash equivalent weights.
       | 
       | Multi-Modal modals running offline on mobile devices with
       | millisecond latencies per token seems the future.
       | 
       | Where is Apple in all of this. Why is Siri still so shit?
        
         | visarga wrote:
         | Apple made a deal with OpenAI for GPT4o, the stakes are indeed
         | high, can't be caught with pants down. iPhone needs to remain
         | the premium brand.
        
       | alephxyz wrote:
       | Not very informative. They're selling it as the fast/cheap option
       | but they don't benchmark inference speed or compare it with non-
       | gemini models.
       | 
       | According to https://ai.google.dev/pricing it's priced a bit
       | lower than gpt3.5-turbo but no idea how it compares to it.
        
       | simonw wrote:
       | I upgraded my llm-gemini plugin to provide CLI access to Gemini
       | Flash:                   pipx install llm # or brew install llm
       | llm install llm-gemini --upgrade         llm keys set gemini
       | # paste API key here         llm -m gemini-1.5-flash-latest 'a
       | short poem about otters'
       | 
       | https://github.com/simonw/llm-gemini/releases/tag/0.1a4
        
       | quantisan wrote:
       | Price (input) $0.35 / 1 million tokens (for prompts up to 128K
       | tokens) $0.70 / 1 million tokens (for prompts longer than 128K)
       | 
       | Price (output) $0.53 / 1 million tokens (for prompts up to 128K
       | tokens) $1.05 / 1 million tokens (for prompts longer than 128K)
       | 
       | ---
       | 
       | Compared to GPT-3.5 Turbo
       | 
       | Input US$0.50 / 1M tokens Output US$1.50 / 1M tokens
        
       | cs702 wrote:
       | We're witnessing a race to the bottom on pricing as it's
       | happening. Competition based solely or mainly on pricing is a
       | defining characteristic of a commodity market, i.e., a market in
       | which competing products are interchangeable, and buyers are
       | happy to switch to the cheapest option for a given level of
       | quality.
       | 
       | There's an old saying that if you're selling a commodity, "you
       | can only be as smart as your dumbest competitor."
       | 
       | If we want to be more polite, we could say instead: "you can only
       | price your service as high as your lowest-cost competitor."
       | 
       | It seems that a lot of capital that has been "invested" to train
       | AI models is, ahem, unlikely ever to be recovered.
        
         | Aloisius wrote:
         | Price competition isn't limited to commodities.
        
           | cs702 wrote:
           | I never said it was.
        
             | EGreg wrote:
             | You never said it wasn't, either :-P
        
             | Aloisius wrote:
             | Then why imply that it is a commodity because they (partly)
             | compete on price?
             | 
             | Fungibility is the defining characteristic of commodities.
             | While these products can be used to accomplish the same
             | task, we're not near real fungibility yet.
        
         | Delmololo wrote:
         | But the race to the bottom has an opposition right?
         | 
         | So people expect to see a return of investment which will
         | create the bottom of pricing (at least as soon as the old money
         | ran out)
         | 
         | I'm also curious if AI is a good example because ai will become
         | fundamental. This means if you don't invest you might be gone
         | therefore it's more like a fee in case the investment would not
         | pan out.
        
           | __loam wrote:
           | Supply and demand determines price, not the hopes and dreams
           | of investors.
        
         | r0m4n0 wrote:
         | Google is building on top of and integrated with their cloud
         | offerings. Having first party solutions like this gives big
         | cloud customers an easy way to integrate. For Google it's just
         | another tool in the chest that gets sold to these big
         | enterprises. Many go all in on all the same cloud products.
         | Also the models are only the building blocks. Other cloud
         | products at Google will be built with this and sold as a
         | service
         | 
         | Not so sure about Open AI though...
        
         | daghamm wrote:
         | Is this race to the bottom or just Googles new TPUs being
         | extremly efficient?
        
         | rmbyrro wrote:
         | Google figured it can't beat OpenAI technically, but they sure
         | know they can beat them financially and infrastructurally.
        
           | __loam wrote:
           | Is infrastructure and scale not an expression of technical
           | ability? It should have been obvious that Meta and Google
           | would bury a tiny company with less than 1000 employees given
           | the amount of capital they can leverage for compute, talent,
           | and data. Google literally invented GPT.
        
         | __loam wrote:
         | You're saying the quiet part out loud here.
        
       | objektif wrote:
       | Does Goog have anything like openai assistant via API? If they
       | had I would definitely give it a try.
        
       ___________________________________________________________________
       (page generated 2024-05-14 23:00 UTC)