[HN Gopher] Gemini 2.0: our new AI model for the agentic era
       ___________________________________________________________________
        
       Gemini 2.0: our new AI model for the agentic era
        
       Author : meetpateltech
       Score  : 564 points
       Date   : 2024-12-11 15:33 UTC (7 hours ago)
        
 (HTM) web link (blog.google)
 (TXT) w3m dump (blog.google)
        
       | og_kalu wrote:
       | The Gemini 2 models support native audio and image generation but
       | the latter won't be generally available till January. Really
       | excited for that as well as 4o's image generation (whenever that
       | comes out). Steerability has lagged behind aesthetics in image
       | generation for a while now and it's be great to see a big advance
       | in that.
       | 
       | Also a whole lot of computer vision tasks (via LLMs) could be
       | unlocked with this. Think Inpainting, Style Transfer, Text
       | Editing in the wild, Segmentation, Edge detection etc
       | 
       | They have a demo: https://www.youtube.com/watch?v=7RqFLp0TqV0
        
         | jncfhnb wrote:
         | These are not computer vision tasks...
        
           | Jabrov wrote:
           | What are they, then...?
        
             | 85392_school wrote:
             | The first two are tasks which involve _making_ images. They
             | could be called image generation or image editing.
        
           | newfocogi wrote:
           | Maybe some of these tasks are arguably not aligned with the
           | traditional applications of CV, but Segmentation and Edge
           | detection are definitely computer vision in every definition
           | I've come across - before and after NNs took over.
        
       | jerpint wrote:
       | We're definitely going to need better benchmarks for agentic
       | tasks, and not just code reasoning. Things that are needlessly
       | painful that humans go through all the time
        
         | AuthConnectFail wrote:
         | it's insane on lmarena for a size, livebench should have it
         | soon too I guess
        
           | maeil wrote:
           | The size isn't stated, not necessarily a given that it's as
           | small as 1.5-Flash.
        
       | bradhilton wrote:
       | Beats Gemini 1.5 Pro at all but two of the listed benchmarks.
       | Google DeepMind is starting to get their bearings in the LLM era.
       | These are the minds behind AlphaGo/Zero/Fold. They control their
       | own hardware destiny with TPUs. Bullish.
        
         | p1esk wrote:
         | Are these benchmarks still meaningful?
        
           | maeil wrote:
           | No, and they haven't been for at least half a year. Utterly
           | optimized for by the providers. Nowadays if a model would be
           | SotA for general use but not #1 on any of these benchmarks, I
           | doubt they'd even release it.
        
           | CamperBob2 wrote:
           | I've started keeping an eye out for original brainteasers,
           | just for that reason. GCHQ's Christmas puzzle just came out
           | [1], and o1-pro got 6 out of 7 of them right. It took about
           | 20 minutes in total.
           | 
           | I wasn't going to bother trying those because I was pretty
           | sure it wouldn't get _any_ of them, but decided to give it an
           | easy one (#4) and was impressed at the CoT.
           | 
           | Meanwhile, Google's newest 2.0 Flash model went 0 for 7.
           | 
           | 1: https://metro.co.uk/2024/12/11/gchq-christmas-
           | puzzle-2024-re...
        
             | nrvn wrote:
             | Did it get the 8 right? The linked article provides the
             | wrong answer btw.
        
             | p1esk wrote:
             | Wow! That's all I need to know about Google's model.
        
               | danpalmer wrote:
               | That's a comparison of multiple GPT-4 models working
               | together... against a single GPT-4 mini style model.
        
               | Workaccount2 wrote:
               | What is impressive about this new model is that it is the
               | lightweight version (flash).
               | 
               | There will probably be a 2.0 pro (which will be 4o/sonnet
               | class) and maybe an ultra (o1(?)/Opus).
        
             | iamdelirium wrote:
             | Why are you comparing flash vs o1-pro, wouldn't a more fair
             | comparison be flash vs mini?
        
               | iamdelirium wrote:
               | I just ask o1-mini the first two questions and it got it
               | wrong.
        
         | dagmx wrote:
         | Regarding TPU's, sure for the stuff that's running on the
         | cloud.
         | 
         | However their on device TPUs lag behind the competition and
         | Google still seem to struggle to move significant parts of
         | Gemini to run on device as a result.
         | 
         | Of course, Gemini is provided as a subscription service as well
         | so perhaps they're not incentivized to move things locally.
         | 
         | I am curious if they'll introduce something like Apple's
         | private cloud compute.
        
           | whimsicalism wrote:
           | i don't think they need to win the on device market.
           | 
           | we need to separate inference and training - the real winners
           | are those who have the training compute. you can always have
           | other companies help with inference
        
             | dagmx wrote:
             | At what point does the on device stuff eat into their
             | market share though? As on device gets better, who will pay
             | for cloud compute? Other than enterprise use.
             | 
             | I'm not saying on device will ever truly compete at
             | quality, but I believe it'll be good enough that most
             | people don't care to pay for cloud services.
        
               | whimsicalism wrote:
               | You're still focused about inference :)
               | 
               | inference basically does not matter, it is a commodity
        
               | dagmx wrote:
               | You're still focused about training :)
               | 
               | training doesn't matter if inference costs are high and
               | people don't pay for them
        
               | whimsicalism wrote:
               | but inference costs _arent high_ already and there are
               | tons of hardware companies that can do relatively cheap
               | LLM inference
        
               | dagmx wrote:
               | Inference costs per invocation aren't high. Scale it out
               | to billions of users and it's a different story.
               | 
               | Training is amortized over each inference, so the cost of
               | inference also needs to include the cost of training to
               | break even unless made up elsewhere
        
               | rowanG077 wrote:
               | That makes no sense. Inference cost dwarf training cost
               | if you have a succesfull product pretty quickly. Afaik
               | there is no commodity hardware that can run state of the
               | art models like chatgpt-o1.
        
               | whimsicalism wrote:
               | > Afaik there is no commodity hardware that can run state
               | of the art models like chatgpt-o1.
               | 
               | Stack enough GPUs and any of them can run o1. Building a
               | chip to infer LLMs is _much easier_ than building a
               | training chip.
               | 
               | Just because one cost dwarfs another does not mean that
               | this is where the most marginal value from developing a
               | better chip will be, especially if other people are just
               | doing it for you. Google gets a good model, inference
               | providers will be begging to be able to run it on their
               | platform, or to just sell google their chips - and as I
               | said, inference chips are much easier.
        
               | menaerus wrote:
               | Each GPU costs ~50k. You need at least 8 of them to run
               | mid-sized models. Then you need a server to plug those
               | GPUs into. That's not commodity hardware.
        
               | whimsicalism wrote:
               | more like ~$16k for 16 3090s. AMD chips can also run
               | these models. The parts are expensive but there is a
               | competitive market in processors that can do LLM
               | inference. Less so in training.
        
             | maeil wrote:
             | > i don't think they need to win the on device market.
             | 
             | The second Apple comes out with strong on-device AI - and
             | it very much looks like they will - Google will have to
             | respond on Android. They can't just sit and pray that e.g.
             | Samsung makes a competitive chip for this purpose.
        
               | petra wrote:
               | But given inference time compute, to give a strong reply
               | reasonably fast, you'll need a lot of compute, very
               | rarely used.
               | 
               | Economically this fits the cloud much better.
        
               | reportingsjr wrote:
               | The Android on chip AI is and has been leagues better
               | than what is available on iOS.
               | 
               | If anything, I think the upcoming iOS AI update will
               | bring them to a similar level as android/google.
        
               | SimianSci wrote:
               | I think Apple is uniquely disadvantaged in the AI race to
               | a point people dont realize. They have less training data
               | to use, having famously been focused on privacy for its
               | users and thus having no particular advantage in this
               | space due to not having customer data to train on. They
               | have little to no cloud business, and while they operate
               | a couple of services for their users, they do not have
               | the infrastructure scale to compete with hyperscaler
               | cloud vendors such as Google and Microsoft. Most of what
               | they would need to spend on training new models would
               | require that they hand over lots of money to the very
               | companies that already have their own models,
               | supercharging their competition.
               | 
               | While there is a chance that Apple might come out with a
               | very sophisticate on-device model. The problem here is
               | that they would only be able to compete with other on-
               | device models. The magnitude of compute needed to keep
               | pace with SOA models is not achievable on a single
               | device. It will take many generations of Apple silicon in
               | order to compete with the compute of existing
               | datacenters.
               | 
               | Google also already has competitive silicon in this space
               | with the Tensor series processors, which are being fabbed
               | at Samsung plants today. There is no sitting and praying
               | necessary on their part as they already compete.
               | 
               | Apple is a very distant competitor in the space of AI,
               | and I see no reason to assume this will change, they are
               | uniquely disadvantaged by several of the choices they
               | made on their way to mobile supremacy. The only thing
               | they currently have going for them is the development of
               | their own ARM silicon which may give them the ability to
               | compete with Google's TPU chips, but there is far more
               | needed to be competitive here than the ability to avoid
               | the Nvidia tax.
        
               | whimsicalism wrote:
               | yeah i've never understood the outsized optimism for
               | apple's ai strategy, especially on hn.
               | 
               | they're _a little bit less of a nobody_ than they used to
               | be, but they're basically a nobody when it comes to
               | frontier research /scaling. and the best model matters
               | way more than on-device which can always just be
               | distilled later and find some random startup/chipco to do
               | inference
        
               | msabalau wrote:
               | Theory: Apple's lifestyle branding is quite important to
               | the identity of many in the community here. I mean, look
               | at the buy-in at launch for Apple Vision Pro by so many
               | people on HN--it made actual Apple communities and
               | publications look like jaded skeptics.
        
               | simonw wrote:
               | "having famously been focused on privacy for its users
               | and thus having no particular advantage in this space due
               | to not having customer data to train on"
               | 
               | That may not be as big a disadvantage as you think.
               | 
               | Anthropic claim that they did not use any data from their
               | users when they trained Claude 3.5 Sonnet.
        
               | whimsicalism wrote:
               | sure but they certainly acquired data from mass scraping
               | (including of data produced by their users) and/or data
               | brokering aka paying someone to do the same.
        
             | vineyardmike wrote:
             | I don't think the AI market will ever really be a healthy
             | one until inference vastly outnumbers training. What does
             | it say about AI if training is done more than inference?
             | 
             | I agree that the in-device inference market is not
             | important yet.
        
               | whimsicalism wrote:
               | done more != where the value is at
               | 
               | inference hardware is a commodity in a way that training
               | is not
        
           | mupuff1234 wrote:
           | Majority of people want better performance, running locally
           | is just a nice to have feature.
        
             | dagmx wrote:
             | They'll care though when they have to pay for it, or when
             | they're in an area with poor reception.
        
               | mupuff1234 wrote:
               | They pay to run it locally as well (more expensive
               | hardware)
               | 
               | And sure, poor reception will be an issue, but most
               | people would still absolutely take a helpful remote
               | assistant over a dumb local assistant.
               | 
               | And you don't exactly see people complaining that they
               | can't run Google/YouTube/etc locally.
        
               | dagmx wrote:
               | Your first sentence has the fallacy that you're
               | attributing the cost of the device to a single feature
               | against the cost of that single feature.
               | 
               | Most people are unlikely to buy the device for the AI
               | features alone. It's a value add to the device they'd buy
               | anyway.
               | 
               | So you need the paid for option to be significantly
               | better than the free one that comes with the device.
               | 
               | Your second sentence assumes the local one is dumb. What
               | happens when local ones get better? Again how much better
               | is the cloud one to compete on cost?
               | 
               | To your last sentence, it assumes data fetching from the
               | cloud. Which is valid but a lot of data is local too. Are
               | people really going to pay for what Google search is
               | giving them for free?
        
               | mupuff1234 wrote:
               | I think it's a more likely assumption that on device
               | performance will trail off device models by a significant
               | margin for at least the next few years - of course if
               | magically you can make it work locally with the same
               | level of performance it would be better.
               | 
               | Plus a lot of the "agentic" stuff is interaction with the
               | outside world, connectivity is a must regardless.
        
               | dagmx wrote:
               | My point is that you do NOT need the same level of
               | performance. You need an adequate level of performance
               | that the cost to get more performance isn't worth it to
               | most people.
        
               | mupuff1234 wrote:
               | And my point is that it's way too early to try to
               | optimize for running locally, if performance really
               | stabilizes and comes to a halt (which may likely happen)
               | then it makes more sense to optimize.
               | 
               | Plus once you start with on device features you start
               | limiting your development speed and flexibility.
        
               | jsight wrote:
               | It isn't really hypothetical. Lots of good models run
               | well on a modern Macbook Pro.
        
               | YetAnotherNick wrote:
               | You can run model >100x faster in cloud compared to on
               | device with DDR RAM. This would make up for the
               | reception.
        
               | dagmx wrote:
               | And you can't run the cloud model at all if you can't
               | talk to the cloud.
        
               | YetAnotherNick wrote:
               | Yes, but I can't imagine situations where I "have" to run
               | a model when I don't have internet at that time. My life
               | would be more affected with the rest of the internet than
               | having to run a small stupid model locally. At the very
               | least until the hallucination is completely solved, as I
               | need internet to verify the models.
        
               | dagmx wrote:
               | You're assuming the model is purely for generation
               | though. Several of the Gemini features are lookup of
               | things across data available to it. A lot of that data
               | can be local to device.
               | 
               | That is currently Apple's path with Apple Intelligence
               | for example.
        
               | vineyardmike wrote:
               | Poor reception is rapidly becoming a non-issue for most
               | of the developed world. I can't think of the last time I
               | had poor reception (in America) and wasn't on an
               | airplane.
               | 
               | As the global human population increasingly urbanizes,
               | it'll become increasingly easy to blanket it with cell
               | towers. Poor(er) regions of the world will increase
               | reception more slowly, but they're also more likely to
               | have devices that don't support on-device models.
               | 
               | Also, Gemini Flash is basically positioned as a free
               | model, (nearly) free API, free in GUI, free in Search
               | Results, Free in a variety of Google products, etc. No
               | one will be paying for it.
        
               | dagmx wrote:
               | Many major cities have significant dead spots for
               | coverage. It's not just for developing areas.
               | 
               | Flash is free for api use at a low rate limit. Gemini as
               | a whole is not free to Android users (free right now with
               | subscription costs beyond a time period for advanced
               | features) and isn't free to Google without some monetary
               | incentive. Hence why I also originally ask about private
               | cloud compute alternatives with Google.
        
             | griomnib wrote:
             | Latency is a _huge_ factor in performance, and local models
             | often have a huge edge. Especially on mobile devices that
             | could be offline entirely.
        
           | YetAnotherNick wrote:
           | If the model weights is not open, you can't run it on device
           | anyways.
        
             | kridsdale1 wrote:
             | The Pixel 9 runs many small proprietary Gemini models on
             | the internal TPU.
        
               | griomnib wrote:
               | And yet these new models still haven't reached feature
               | parity with Google Assistant, which can turn my
               | flashlight on, but with all the power of burning down a
               | rainforest, Gemini still cannot interact with my actual
               | phone.
        
               | lern_too_spel wrote:
               | I just tried asking my phone to turn on the flashlight
               | using Gemini. It worked.
               | https://9to5google.com/2024/11/07/gemini-utilities-
               | extension...
        
               | griomnib wrote:
               | Ok I tried literally last week on Pixel 7a and it didn't
               | work. What model do you have? Maybe it requires a phone
               | that can do on-device models?
        
               | staticman2 wrote:
               | I just tried it on my Galaxy Ultra s23 and it worked. I
               | then disconnected internet and it did not work.
        
               | YetAnotherNick wrote:
               | Gemini nano weights are leaked and google doesn't care
               | about it being leaked. Google would definitely care if
               | Pro weights are leaked.
        
               | onlyrealcuzzo wrote:
               | Is there any phone in the world that can realistically
               | run pro weights?
        
         | JeremyNT wrote:
         | Yeah they've been slow to release end-user facing stuff but
         | it's obvious that they're just grinding away internally.
         | 
         | They've ceded the fast mover advantage, but with a massive
         | installed base of Android devices, a team of experts who
         | basically created the entire field, a huge hardware presence
         | (that THEY own), massive legal expertise, existing content
         | deals, and a suite of vertically integrated services, I feel
         | like the game is theirs to lose at this point.
         | 
         | The only caution is regulation / anti-trust action, but with a
         | Trump administration that seems far less likely.
        
         | VirusNewbie wrote:
         | If you look at where talent is going, it's Anthropic that is
         | the real competitor to Google, not OpenAI.
        
       | echelon wrote:
       | Gemini in search is answering so many of my search questions
       | wrong.
       | 
       | If I ask natural language yes/no questions, Gemini sometimes
       | tells me outright lies with confidence.
       | 
       | It also presents information as authoritative - locations,
       | science facts, corporate ownership, geography - even when it's
       | pure hallucination.
       | 
       | Right at the top of Google search.
       | 
       | edit:
       | 
       | I can't find the most obnoxious offending queries, but here was
       | one I performed today: "how many islands does georgia have?".
       | 
       | Compare that with "how many islands does georgia have? Skidaway
       | Island".
       | 
       | This is an extremely mild case, but I've seen some wildly wrong
       | results, where Google has claimed companies were founded in the
       | wrong states, that towns were located in the wrong states, etc.
        
         | sib301 wrote:
         | This has happened to me zero times. :shrug:
        
         | airstrike wrote:
         | Doesn't match my experience. It also feels like it's getting
         | better over time.
        
         | nilayj wrote:
         | can you provide some example queries that Gemini in search gets
         | wrong?
        
         | nice__two wrote:
         | Gemini 1.5 indeed is a lot of hit-and-miss. Also, the
         | politically correct and medical info filtering is limiting its
         | usefulness a lot, IMHO.
         | 
         | I also miss that it's not yet really as context aware as
         | ChatGPTo4. Even just asking a follow-up question, confuses
         | Gemini 1.5.
         | 
         | Hope Gemini 2.0 will improve that!
        
         | adultSwim wrote:
         | I've found these results quite useful
        
         | jonomacd wrote:
         | At first, this was true but now it has gotten pretty good. The
         | times it gets things wrong are often not the models fault and
         | just google searches fault.
        
       | zb3 wrote:
       | Is this the gemini-exp model on LMArena?
        
         | jasonjmcghee wrote:
         | Both are available on aistudio so I don't think so.
         | 
         | In my own testing "exp 1206" is significantly better than
         | Gemini 2.
         | 
         | Feels like haiku 3.5 vs sonnet 3.5 kind of thing.
        
         | warkdarrior wrote:
         | Yes, LMArena shows Gemini-2.0-Flash-Exp ranking 3rd right now,
         | after Gemini-Exp-1206 and ChatGPT-4o-latest_(2024-11-20), and
         | ahead of o1-preview and o1-mini:
         | 
         | https://lmarena.ai/?leaderboard
        
           | zb3 wrote:
           | There's also the "gremlin" model (not reachable directly) and
           | it seems to be pretty smart.. maybe that's the deep research
           | mode?
           | 
           | EDIT: probably not deep research.. is it Google testing their
           | equivalent of o1? who knows..
        
         | usaar333 wrote:
         | It looks like gemini-exp-1121 slightly upgraded. 1206 is
         | something else.
        
       | crowcroft wrote:
       | Big companies can be slow to pivot, and Google has been famously
       | bad at getting people aligned and driving in one direction.
       | 
       | But, once they do get moving in the right direction the can
       | achieve things that smaller companies can't. Google has an insane
       | amount of talent in this space, and seems to be getting the right
       | results from that now.
       | 
       | Remains to be seen how well they will be able to productize and
       | market, but hard to deny that their LLM models aren't really,
       | really good though.
        
         | manishsharan wrote:
         | >> hard to deny that their LLM models aren't really, really
         | good though.
         | 
         | The context window of Gemini 1.5 pro is incredibly large and it
         | retains the memory of things in the middle of the window well.
         | It is quite a game changer for RAG applications.
        
           | KaoruAoiShiho wrote:
           | It looks like long context degraded from 1.5 to 2.0 according
           | to the 2.0 launch benchmarks.
        
           | caeril wrote:
           | Bear in mind that a "1 million token" context window isn't
           | actually that. You're being sold a sparse attention model,
           | which is guaranteed to drop critical context. Google TPUs
           | aren't running inference on a TERABYTE of fp8 query-key
           | inputs, let alone TWO of fp16.
           | 
           | Google's marketing wins again, I guess.
        
         | pelorat wrote:
         | Well, compared to github copilot (paid), I think Gemini Free is
         | actually better at writing non-archaic code.
        
           | rafaelmn wrote:
           | Using Claude 3.5 sonnet ?
        
           | jacooper wrote:
           | Gemini is coming to copilot soon anyway.
        
         | talldayo wrote:
         | BERT and Gemma 2B were both some of the highest-performing edge
         | models of their time. Google does really well - in terms of
         | pushing efficiency in the community they're second to none.
         | They also don't need to rely on inordinate amounts of compute
         | because Google's differentiating factor is the products they
         | own and how they integrate it. OpenAI is API-minded, Google is
         | laser-focused on the big-picture experience.
         | 
         | For example; those little AI-generated YouTube summaries that
         | have been rolling out are wonderful. They don't require
         | heavyweight LLMs to generate, and can create pretty effective
         | summaries using nothing but a transcript. It's not only more
         | useful than the other AI "features" I interact with regularly,
         | it doesn't demand AGI or chain-of-thought.
        
           | closewith wrote:
           | > Google is laser-focused on the big-picture experience.
           | 
           | This doesn't match my experience of any Google product.
        
             | talldayo wrote:
             | I disagree - another way you could phrase this is that
             | Google is presbyopic. They're very capable of thinking
             | long-term (eg. Google Deepmind and AI as a whole, cloud,
             | video, Drive/GSuite, etc.), but as a result they struggle
             | to respond to quick market changes. AdSense is the perfect
             | example of Google "going long" on a product and reaping the
             | rewards to monopolistic ends. They can corner a market when
             | the set their sights on it.
             | 
             | I don't think Google (or really any of FAANG) makes "good"
             | products anymore. But I do think there are things to
             | appreciate in each org, and compared to the way Apple and
             | Microsoft are flailing helplessly I think Google has proven
             | themselves in software here.
        
         | panabee wrote:
         | With many research areas converging to comparable levels, the
         | most critical piece is arguably vertical integration and
         | forgoing the Nvidia tax.
         | 
         | They haven't wielded this advantage as powerfully as possible,
         | but changes here could signal how committed they are to slaying
         | the search cash cow.
         | 
         | Nadella deservedly earned acclaim for transitioning Microsoft
         | from the Windows era to cloud and mobile.
         | 
         | It will be far more impressive if Google can defy the odds and
         | conquer the innovator's dilemma with search.
         | 
         | Regardless, congratulations to Google on an amazing release and
         | pushing the frontiers of innovation.
        
           | crowcroft wrote:
           | They need an iPod to iPhone like transition. If they can pull
           | it off it will be incredible for the business.
        
           | bloomingkales wrote:
           | They have to not get blind sided by Sora, while at the same
           | time fighting the cloud war against MS/Amazon.
           | 
           | Weirdly Google is THE AI play. If AI is not set to change
           | everything and truly is a hype cycle, then Google stock
           | withstands and grows. If AI is the real deal, then Google
           | still withstands due to how much bigger the pie will get.
        
         | bushbaba wrote:
         | Yet, google continues to show it'll deprecate it's APIs,
         | Services, and Functionality at the detriment of your own
         | business. I'm not sure enterprises will trust Google's LLM over
         | the alternatives. Too many have been burned throughout the
         | years, including GCP customers.
         | 
         | The fact GCP needs to have this page, and these lists are not
         | 100% comprehensive is telling enough.
         | https://cloud.google.com/compute/docs/deprecations
         | https://cloud.google.com/chronicle/docs/deprecations
         | https://developers.google.com/maps/deprecations
         | 
         | Steve Yegge rightfully called this out, and yet no change has
         | been made. https://medium.com/@steve.yegge/dear-google-cloud-
         | your-depre...
        
           | weatherlite wrote:
           | GCP grew 35% last quarter , just saying ...
        
             | Jabbles wrote:
             | "just saying" things that are false.
             | 
             | Google Cloud grew 35% year over year, when comparing the 3
             | months ending September 30th 2024 with 2023.
             | 
             | https://abc.xyz/assets/94/93/52071fba4229a93331939f9bc31c/g
             | o... page 12
        
               | surajrmal wrote:
               | Isn't that the typical interpretation of what the parent
               | comment said? How is it false?
        
         | StableAlkyne wrote:
         | > Remains to be seen how well they will be able to productize
         | and market
         | 
         | The challenge is trust.
         | 
         | Google is one of the leaders in AI and are home to incredibly
         | talented developers. But they also have an incredibly bad track
         | record of supporting their products.
         | 
         | It's hard to justify committing developers and money to a
         | product when there's a good chance you'll just have to pivot
         | again once they get bored. Say what you will about Microsoft,
         | but at least I can rely on their obsession with supporting
         | outdated products.
        
           | egeozcan wrote:
           | > they also have an incredibly bad track record of supporting
           | their products
           | 
           | Incredibly bad track record of supporting products _that don
           | 't grow_. I'm not saying this to defend Google, I'm still
           | (perhaps unreasonably) angry because of Reader, it's just
           | that there is a pattern and AI isn't likely to fit that for a
           | long while.
        
             | RandomThoughts3 wrote:
             | I'm sad for reader but it was a somewhat niche product.
             | Inbox I can't forgive. It was insanely good and was killed
             | because it was a threat to Gmail.
             | 
             | My main issue with Google is that internal politic affects
             | users all the time. See the debacle of anything built on
             | top of Android and being treated as a second citizen.
             | 
             | You can't trust a company which can't shield users from its
             | internal politics. It means nothing is aligned correctly
             | for users to be taken seriously.
        
             | mannycalavera42 wrote:
             | not going to miss the opportunity to upvote on the grief of
             | having lost Reader
        
             | msabalau wrote:
             | Yeah, either AI is significant, in which case Google isn't
             | going to kill it. Or AI is a bubble, in any of the
             | alternatives one might pick can easily crash and die long
             | before Google ends of life anything.
             | 
             | This isn't some minor consumer play, like a random tablet
             | or Stadia. Anyone who has paying attention would have
             | noticed that AI has been an important, consistent, long
             | term strategic interest of Google's for a very long time.
             | They've been killing off the fail/minor products to invest
             | in _this_.
        
           | TIPSIO wrote:
           | Yes. Imagine Google banning your entire Google account /
           | Gmail because you violated their gray area AI terms ([1] or
           | [2]). Or, one of your users did via an app you made using an
           | API key and their models.
           | 
           | With that being said, I am extremely bullish on Google AI for
           | a long time. I imagine they land at being the best and
           | cheapest for the foreseeable future.
           | 
           | [1] https://policies.google.com/terms/generative-ai
           | 
           | [2] https://policies.google.com/terms/generative-ai/use-
           | policy
        
             | estebarb wrote:
             | For me that is a reason for not touching anything from
             | Google for building stuff. I can afford lossing my Amazon
             | account, but Google's one would be too much. At least they
             | should be clear in their terms that getting banned at cloud
             | doesn't mean getting banned from Gmail/Docs/Photos...
        
               | bippihippi1 wrote:
               | why not just make a business / project account?
        
               | rtsil wrote:
               | That won't help. Their TOS and policies are vague enough
               | that they can terminate all accounts you own (under "Use
               | of multiple accounts for abuse" for instance).
        
               | TIPSIO wrote:
               | To be fair, I believe this is reserved for things like
               | fighting fraud.
        
               | dbdoskey wrote:
               | It has been used a few times by people who had a Google
               | Play app banned, that sometimes the personal account
               | would get banned as well.
               | 
               | https://www.xda-developers.com/google-developer-account-
               | ban-...
        
           | fluoridation wrote:
           | >Say what you will about Microsoft, but at least I can rely
           | on their obsession with supporting outdated products.
           | 
           | Eh... I don't know about that. Their tech graveyard isn't as
           | populous as Google's, but it's hardly empty. A few that come
           | to mind: ATL, MFC, Silverlight, UWP.
        
             | bri3d wrote:
             | Besides Silverlight (which was supported all the way until
             | the end of 2021!), you can still not only run but _write
             | new applications_ using all of the listed technologies.
        
               | fluoridation wrote:
               | That doesn't constitute support when it comes to
               | development platforms. They've not received any updates
               | in years or decades. What they've done is simply not
               | remove the capability build capability from the
               | toolchains. That is, not even the work that would be
               | required to no longer support them in any way. Compare
               | that to C#, which has evolved rapidly over the same time
               | period.
        
               | Fidelix wrote:
               | That's different from "killing" the product / technology,
               | which is what Google does.
        
               | fluoridation wrote:
               | Only because they operate different businesses. Google is
               | primarily a service provider. They have few software
               | products that are not designed to integrate with their
               | servers. Many of Microsoft's businesses work
               | fundamentally differently. There's nothing Microsoft
               | could do to Windows to disable all MFC applications and
               | only MFC applications, and if there was it would involve
               | more work than simply not doing anything else with MFC.
        
           | dotancohen wrote:
           | > Google is one of the leaders in AI and are home to
           | incredibly talented developers. But they also have an
           | incredibly bad track record of supporting their products.
           | 
           | This is why we've stayed with Anthropic. Every single person
           | I work with on my current project is sore at Google for
           | discontinuing one product or another - and not a single one
           | of them mentioned Reader.
           | 
           | We do run some non-customer facing assets in Google Cloud.
           | But the website and API are on AWS.
        
           | bastardoperator wrote:
           | Putting your trust in Google is a fools errand. I don't know
           | anyone that doesn't have a story.
        
         | crazygringo wrote:
         | > _and Google has been famously bad at getting people aligned
         | and driving in one direction._
         | 
         | To be fair, it's not that they're bad at it -- it's that they
         | generally have an explicit philosophy against it. It's a
         | choice.
         | 
         | Google management doesn't want to "pick winners". It prefers to
         | let multiple products (like messaging apps, famously) compete
         | and let the market decide. According to this way of thinking,
         | you come out ahead in the long run because you increase your
         | chances of having the winning product.
         | 
         | Gemini is a great example of when they do choose to focus on a
         | single strategy, however. Cloud was another great example.
        
           | xnx wrote:
           | I definitely agree that multiple competing products is a
           | deliberate choice, but it was foolish to pursue it for so
           | long in a space like messaging apps that has network effects.
           | 
           | As a user I always still wish that there were fewer apps with
           | the best features of both. Google's 2(!) apps for AI podcasts
           | being a recent example : https://notebooklm.google.com/ and
           | https://illuminate.google.com/home
        
           | tbarbugli wrote:
           | Google is not winning on cloud, AWS is winning and MS gaining
           | ground.
        
             | surajrmal wrote:
             | Parent didn't claim Google is winning. Only that there is a
             | cohesive push and investment in a single product/platform.
        
             | rrdharan wrote:
             | That was 2023; more recently Microsoft is losing ground to
             | Google (in 2024).
        
         | bwb wrote:
         | So far, for my tests, it has performed terribly compared to
         | ChatGPT and Claude. I hope this version is better.
        
         | aerhardt wrote:
         | > seems to be getting the right results
         | 
         | > hard to deny that their LLM models aren't really, really good
         | though
         | 
         | I'm so scarred by how much their first Gemini releases sucked
         | that the thought of trying it again doesn't even cross my mind.
         | 
         | Are you telling us you're buying this press release wholesale,
         | or you've tried the tech they're talking about and love it, or
         | you have some additional knowledge not immediately evident
         | here? Because it's not clear from your comment where you are
         | getting that their LLM models are really good.
        
           | MaxDPS wrote:
           | I've been using Gemini 1.5 Pro for coding and it's been
           | great.
        
       | jncfhnb wrote:
       | Am I alone in thinking the word "agentic" is dumb as shit?
       | 
       | Most of these things seem to just be a system prompt and a tool
       | that get invoked as part of a pipeline. They're hardly "agents".
       | 
       | They're modules.
        
         | thomassmith65 wrote:
         | It's easier for consultants and sales people to sell to
         | enterprise if the terminology is familiar but mysterious.
         | 
         | Bad                 1. installed Antivirus software       2.
         | added screen-size CSS rules       3. copied 'Assets' harddrive
         | to DropBox       4. edited homepage to include Bitcoin wallet
         | address link       5. upgraded to ChatGPT Pro
         | 
         | "Good"                 1. Cyber-security defenses       2.
         | Responsive Design implementation       3. Cloud Storage
         | 4. Blockchain Technology gateway       5. Agentic enhancements
        
         | xnx wrote:
         | Controlling a browser in Project Mariner seems very agentic:
         | https://youtu.be/Fs0t6SdODd8?t=86
        
         | uludag wrote:
         | Definitely not alone. With all the this money at stake, coining
         | dumb terms like this might make you a pretty penny.
         | 
         | It's like a meme that can be milked for monetization.
        
         | Agentus wrote:
         | The beauty of LLMs isn't just these coding objects speak human
         | vernacular but they can be concatenated with human vernacular
         | prompts and that itself can be used as an input, command or
         | output sensibly without necessarily causing error even if a
         | series of inputs combinations weren't preprogrammed.
         | 
         | I have an A.I. textbook that has agent terminology that was
         | written preLLm days. agents are just autonomous ish code that
         | loops on itself with some extra functionality. LLMs in their
         | elegance can more easily out the box selfloop just on the basis
         | concatenating language prompts, sensibly. They are almost agent
         | ready out the box by this very elegant quality(the textbook
         | agentic diagram is just a conceptual self perpetuation loop),
         | except...
         | 
         | Except they fail at a lot or get stuck at hiccups. But, here is
         | a novel thought. What if an LLM becomes more agentic (ie more
         | able to sustain autonomous chain prompts that do actions
         | without a terminal failure) and less copilotee not by more
         | complex controlling wrapper self perpetuation code, but by
         | means of training the core llm itself to more fluidly function
         | in agentic scenarios.
         | 
         | a better agentically performing llm that isnt mislabeled with a
         | bad buzzword might not reveal itself in its wrapper control
         | code but through it just performing better in an typical
         | agentic loop or environment conditions with whatever initiating
         | prompt, control wrapper code, or pipeline that initiates its
         | self perpetuation cycle.
        
         | WA wrote:
         | Gemini, too, for the sole reason that non-native speakers have
         | no clue how to pronounce it.
        
           | kaashif wrote:
           | Also, people at NASA pronounce it two ways, even native
           | speakers of English.
        
         | Havoc wrote:
         | >"agentic" is dumb as shit?
         | 
         | It'll create endless consulting opportunities for projects that
         | never go anywhere and add nothing of value unless you value
         | rich consultants.
        
       | brokensegue wrote:
       | Any word on price? I can't find it at
       | https://ai.google.dev/pricing
        
         | Oras wrote:
         | PS18/month
         | 
         | https://gemini.google/advanced/?Btc=web&Atc=owned&ztc=gemini...
         | 
         | then sign in with Google account and you'll see it
        
           | brokensegue wrote:
           | Oh but I only care about api pricing
        
             | xnx wrote:
             | I think it is free for 1500 requests/day. See the model
             | dropdown on https://aistudio.google.com/prompts/new_chat
        
         | gman83 wrote:
         | I've been using Gemini Flash for free through the API using
         | Cline for VS Code. I switch between Claude and Gemini Flash,
         | using Claude for more complicated tasks. Hope that the 2.0
         | model comes closer to Claude for coding.
        
           | SV_BubbleTime wrote:
           | Or... just continue using Claude?
        
             | 85392_school wrote:
             | I think they try to conserve costs by only using Claude
             | when needed.
        
             | IAmGraydon wrote:
             | Claude is ridiculously expensive and often subject to rate
             | limiting.
        
         | serjester wrote:
         | Agreed - tried some sample prompts on our data and the rough
         | vibe check is that flash is now as good as the old pro. If they
         | keep pricing the same, this would be really promising.
        
       | airstrike wrote:
       | OT: I'm not entirely sure why, but "agentic" sets my teeth on
       | edge. I don't mind the concept, but the word itself has that
       | hollow, buzzwordy flavor I associate with overblown LinkedIn
       | jargon, particularly as it is not actually in the
       | dictionary...unlike perfectly serviceable entries such as
       | "versatile", "multifaceted" or "autonomous"
        
         | geodel wrote:
         | Huh, all three words you mentioned as replacement are equally
         | buzzwordy and I see them a lot in CVs while screen candidates
         | for job interview.
        
           | airstrike wrote:
           | At least all three of them are actually in the dictionary
        
             | hombre_fatal wrote:
             | That's not necessarily a good thing because they are
             | overloaded while novel jargon is specific.
             | 
             | We use new words so often that we take it for granted.
             | You've passively picked up dozens of new words over the
             | last 5 or 10 years without questioning them.
        
           | raincole wrote:
           | Versatile implies it can to more kinds of tasks (than it's
           | predecessor or competitor). Agentic implies it requires less
           | human intervention.
           | 
           | I don't think these are necessary buzzwords _if_ the product
           | really does what they imply.
        
           | lolinder wrote:
           | They agree--they're saying that at least those buzzwords are
           | in the dictionary, not that they'd be a good replacement for
           | "agentic".
        
         | thom wrote:
         | I'm personally very glad that the word has adhered itself to a
         | bunch of AI stuff, because people had started talking about
         | "living more agentically" which I found much more aggravating.
         | Now if anyone states that out loud you immediately picture them
         | walking into doors and misunderstanding simple questions, so it
         | will hopefully die out.
        
         | OutOfHere wrote:
         | To play devil's advocate, the correct use of the word would be
         | when multiple AIs are coordinating and handing off tasks to
         | each other with limited context, such that the handoffs are
         | dynamically decided at runtime by the AI, not by any routine
         | code. I have yet to see a single example where this is
         | required. Most problems can be solved with static workflows and
         | simple rule based code. As such, I do believe that >95% of the
         | usage of the word is marketing nonsense.
        
           | maeil wrote:
           | I actually have built such a tool (two AIs, each with
           | different capabilities), but still cringe at calling at
           | agentic. Might just be an instinctive reflex.
        
           | danpalmer wrote:
           | I think this sort of usage is already happening, but perhaps
           | in the internal details or uninteresting parts, such as
           | content moderation. Most good LLM products are in fact using
           | many LLM calls under the hood, and I would expect that
           | results from one are influencing which others get used.
        
         | m3kw9 wrote:
         | Yeah I hate it when AI companies throw around words like AGI
         | and agentic capabilities. It's non sense to most people and
         | ambiguous at best
        
         | ramoz wrote:
         | Need a general term for autonomous intelligent decision making.
        
           | airstrike wrote:
           | Isn't that just "intelligent"?
        
             | ramoz wrote:
             | We need something to describe a behavioral element in
             | business processes. Something goes into it, something comes
             | out of it - though in this case nondeterminism is involved
             | and it may not be concrete outputs so much as further
             | actioning.
             | 
             | Intelligence is a characteristic.
        
               | airstrike wrote:
               | Volitional, independent, spontaneous, free-willed,
               | sovereign...
        
           | aithrowawaycomm wrote:
           | No, we need a _scientific understanding_ of autonomous
           | intelligent decision-making. The problem with "agentic AI" is
           | the same old "Artificial Intelligence, Natural Stupidity"
           | problem: we have no clue what "reasoning" or "intelligence"
           | or "autonomous" actually means in animals, and trying to
           | apply these terms to AI without understanding them (or
           | inventing a new term without nailing down the underlying
           | concept) is doomed to fail.
        
         | wepple wrote:
         | Versatile is far worse. It's so broad to the point of
         | meaninglessness. My garden rake is fairly versatile.
         | 
         | Agentic to me means that it acts somewhat under its own
         | authority rather than a single call to an LLM. It has a small
         | degree of agency.
        
       | geodel wrote:
       | Just searched for _GVP vs SVP_ and got:
       | 
       | "GVP stands for Good Pharmacovigilance Practice, which is a set
       | of guidelines for monitoring the safety of drugs. SVP stands for
       | Senior Vice President, which is a role in a company that focuses
       | on a specific area of operations."
       | 
       | Seems lot of pharma regulation in my telecom company.
        
       | PaulWaldman wrote:
       | Anecdotally, using the Gemini App with "Gemini Advanced 2.0 Flash
       | Experimental", the response quality is ignorantly improved and
       | faster at some basic Python and C# generation.
        
         | xnx wrote:
         | > ignorantly improved
         | 
         | autocorrect of "significantly improved"?
        
       | oldpersonintx wrote:
       | the demos are amazing
       | 
       | I need to rewire my brain for the power of these tools
       | 
       | this plus the quantum stuff...Google is on a win streak
        
       | SubiculumCode wrote:
       | Considering so many of us would like more vRAM than NVIDIA is
       | giving us for home compute, is there any future where these
       | Trillium TPUs become commodity hardware?
        
         | geodel wrote:
         | _So many of us_ are probably in thousands they need to be 3
         | order magnitude higher before Google can even think of it.
        
         | kajecounterhack wrote:
         | Power concerns aside, individual chips in a TPU pod don't
         | actually have a ton of vRAM; they rely on fast interconnects
         | between a lot of chips to aggregate vRAM and then rely on
         | pipeline / tensor parallelism. It doesn't make sense to try to
         | sell the hardware -- it's operationally expensive. By keeping
         | it in house Google only has to support the OS/hardware in their
         | datacenter and they can and do commercialize through hosted
         | services.
         | 
         | Why do you want the hardware vs just using it in the cloud? If
         | you're training huge models you probably don't also keep all
         | your data on prem, but on GCS or S3 right? It'd be more
         | efficient to use training resources close to your data. I guess
         | inference on huge models? Still isn't just using a hosted API
         | simpler / what everyone is doing now?
        
       | katamari-damacy wrote:
       | Is it better than GPT4o? Does it have an API?
        
         | jerrygenser wrote:
         | API is accessible via Vertex AI on Google Cloud in preview. I
         | think it's also available in the consumer Gemini Chat.
        
         | kgwgk wrote:
         | https://ai.google.dev/gemini-api/docs/models/gemini-v2
        
       | chrsw wrote:
       | Instead of throwing up tables of benchmarks just let me try to do
       | stuff and see if it's useful.
        
       | topicseed wrote:
       | Is it on AI studio already?
        
         | jonomacd wrote:
         | Yes it is. Including the live features. It is pretty
         | impressive. Basically voice mode with a live video feed as
         | well.
        
       | siliconc0w wrote:
       | What's everyone's favorite LLM leaderboard? Gemini 2 seems to be
       | edging out 4o on chatbot arena(https://lmarena.ai/?leaderboard)
        
         | SV_BubbleTime wrote:
         | AI benchmarks and leaderboards are complete nonsense though.
         | 
         | Find something you like, use it, be ready to look again in a
         | month or two.
        
           | falcor84 wrote:
           | With the accelerating progress, the "be ready to look again"
           | is becoming a full time job that we need to be able to
           | delegate in some way, and I haven't found anything better
           | than benchmarks, leaderboards and reviews.
           | 
           | EDIT: Typo
        
           | siliconc0w wrote:
           | FWIW I've found the 'coding' 'category' of the leaderboard to
           | be reasonably accurate. Claude was the best, o1-mini then was
           | typically stronger, now the Gemini Exp 1206 is at the top.
           | 
           | I find myself just paying a la carte via the API rather than
           | paying the $20/mo so I can switch between the models.
        
           | hombre_fatal wrote:
           | poe.com has a decent model where you buy credits and spend
           | them talking to any LLM which makes it nice to swap between
           | them even during the same conversation instead of paying for
           | multiple subscriptions.
           | 
           | Though gpt-4o could say "David Mayer" on poe.com but not on
           | chat.openai.com which makes me wonder if they sometimes cheat
           | and sneak in different models.
        
         | manishsharan wrote:
         | Leaderboards are not that useful for measuring real-life
         | effectiveness of the models atleast in my day-today usage.
         | 
         | I am currently struggling to diagnose an ipv6 mis-configuration
         | in my enormous aws cloudformation yaml code. I gave the same
         | input to Claude Opus, Gemini and ChatGPT ( o1 and 4o).
         | 
         | 4o was the worst. verbose and waste of my time.
         | 
         | Claude completely went off-tangent and began recommending fixes
         | for ipv4 while I specifically asked for ipv6 issues
         | 
         | o1 made a suggestion which I tried out and it fixed it. It
         | literally found a needle in the haystack. The solution is
         | working well now.
         | 
         | Gemini made a suggestion which almost got it right but it was
         | not a full solution.
         | 
         | I must clarify diagnosing network issues on AWS VPC is not my
         | expertise and I use the LLMs to supplement my knowledge.
        
           | blastbking wrote:
           | Sonnet 3.5 as of today is superior to Opus, curious if sonnet
           | could have solved your problem
        
         | zhyder wrote:
         | I like that https://artificialanalysis.ai/leaderboards/models
         | describes both quality and speed (tokens/s and first chunk s).
         | Not sure how accurate it is; anyone know? Speed and variance of
         | it in particular seems difficult to pin down because providers
         | obviously vary it with load to control their costs.
        
         | IAmGraydon wrote:
         | https://aider.chat/docs/leaderboards/
        
         | lossolo wrote:
         | https://livebench.ai/#/
        
         | danpalmer wrote:
         | Notably, GPT-4o is a "full size" model, whereas Gemini 2 Flash
         | is the small and efficient variant in that family as far as I
         | understand it.
        
       | dangoodmanUT wrote:
       | Jules looks like it's going after Devin
        
         | m3kw9 wrote:
         | Claude MCP does the same thing. It's the setup that is hard. It
         | will do push pull create branch automatically from a single
         | prompt. 500$ a month for Devin could be worth it if you want it
         | taken care off plus use the models for a team, but a single
         | person can set it up
        
       | moralestapia wrote:
       | >2,000 words of bs
       | 
       | >General availability will follow in January, along with more
       | model sizes.
       | 
       | >Benchmarks against their own models which always underperformed
       | 
       | >No pricing visible anywhere
       | 
       | Completely inept leadership at play.
        
       | EternalFury wrote:
       | Think of Google as of a tanker ship. It takes a while to change
       | course, but it has great momentum. Sundar just needs to make sure
       | the course is right.
        
         | griomnib wrote:
         | And where is the ship headed if they are no longer supporting
         | the open web?
         | 
         | Publishers are being squeezed and going under, or replacing
         | humans with hallucinated genai slop.
         | 
         | It's like we're taking the private equity model of extracting
         | value and killing something off to the entire web.
         | 
         | I'm not sure where this is headed, but I don't think Sundar has
         | any strategy here other than playing catch up.
         | 
         | Demis' goal is pretty transparently positioning himself to take
         | over.
        
         | CSMastermind wrote:
         | That's almost word for word what people said about Windows
         | Phone when I was at Microsoft.
        
           | atorodius wrote:
           | Was the Windows Phone ever at the frontier tho?
        
           | zaptrem wrote:
           | It is a lot easier to switch LLMs than it is to switch
           | smartphone platforms.
        
           | onlyrealcuzzo wrote:
           | But Windows Phone was actually good, like Xune, it was just
           | late, and it was incredibly popular to hate Microsoft at the
           | time.
           | 
           | Additionally, Microsoft didn't really have any advantage in
           | the smart phone space.
           | 
           | Google is already a product the majority of people on the
           | planet use regularly to answer questions.
           | 
           | That seems like a competitive advantage to me.
        
             | machiaweliczny wrote:
             | Yeah, I liked my windows phone, not sure why they killed it
        
           | rrrrrrrrrrrryan wrote:
           | Windows Phone was actually great though, and would've
           | eventually been a major player in the space if Microsoft were
           | stubborn enough to stick with it long enough, like they did
           | with the Xbox.
           | 
           | By his own admission, Gates was extremely distracted at the
           | time by the antitrust cases in Europe, and he let the
           | initiative die.
        
       | thisoneworks wrote:
       | "gemini for video games" - here we go again with the AI does the
       | interesting stuff for you rather than the boring stuff
        
       | gotaran wrote:
       | Google beat OpenAI at their own game.
        
       | transcriptase wrote:
       | "Hey google turn on kitchen lights"
       | 
       | "Sure, playing don't fear the reaper on bathroom speaker"
       | 
       | Ok
        
       | wonderfuly wrote:
       | Chat now: https://app.chathub.gg/chat/cloud-gemini-2.0-flash
        
       | m3kw9 wrote:
       | Can these guy lead for once? They are always responding to what
       | OpenAI is doing.
        
       | losvedir wrote:
       | This naming is confusing...
       | 
       | Anyway, I'm glad that this Google release is actually available
       | right away! I pay for Gemini Advanced and I see "Gemini Flash
       | 2.0" as an option in the model selector.
       | 
       | I've been going through Advent of Code this year, and testing
       | each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet,
       | Opus, Gemini Pro 1.5). Gemini has done decent, but is probably
       | the weakest of the bunch. It failed (unexpectedly to me) on Day
       | 10, but when I tried Flash 2.0 it got it! So at least in that one
       | benchmark, the new Flash 2.0 edged out Pro 1.5.
       | 
       | I look forward to seeing how it handles upcoming problems!
       | 
       | I should say: Gemini Flash didn't _quite_ get it out of the box.
       | It actually had a syntax error in the for loop, which caused it
       | to fail to compile, which is an unusual failure mode for these
       | models. Maybe it was a different version of Java or something (I
       | 'm also trying to learn Java with AoC this year...). But when I
       | gave Flash 2.0 the compilation error, it did fix it.
       | 
       | For the more Java proficient, can someone explain why it may have
       | provided this code:                    for (int[] current =
       | queue.remove(0)) {
       | 
       | which was a compilation error for me? The corrected code it gave
       | me afterwards was just                    for (int[] current :
       | queue) {
       | 
       | and with that one change the class ran and gave the right
       | solution.
        
         | ianmcgowan wrote:
         | A tangent, but is there a clear best choice amongst those
         | models for AOC type questions?
        
         | srameshc wrote:
         | I use a Claude and Gemini a lot for coding and I realized there
         | is no good or best model. Every model has it's upside and
         | downside. I was trying to get authentication working according
         | to the newer guidelines of Manifest V3 for browser extensions
         | and every model is terrible. It is one use case where there is
         | not much information or right documentation so every model
         | makesup stuff. But this is my experience and I don't speak for
         | everyone.
        
           | huijzer wrote:
           | Relatedly, I start to think more and more the AI is great for
           | mediocre stuff. If you just need to do the 1000th website, it
           | can do that. Do you want to build a new framework? Then there
           | will probably be less many useful suggestions. (Still not
           | useless though. I do like it a lot for refactoring while
           | building xrcf.)
           | 
           | EDIT: One reason that lead me to think it's better for
           | mediocre stuff was seeing the Sora model generate videos. Yes
           | it can create semi-novel stuff through combinations of
           | existing stuff, but it can't stick to a coherent "vision"
           | throughout the video. It's not like a movie by a great
           | director like Tarantino where every detail is right and all
           | details point to the same vision. Instead, Sora is just
           | flailing around. I see the same in software. Sometimes the
           | suggestions go towards one style and the next moment into
           | another. I guess AI currently is just way lower in their
           | context length. Tarantino has been refining his style for 30
           | years now. And always he has been tuning his model towards
           | his vision. AI in comparison seems to always just take
           | everything and turn it into one mediocre blob. It's not
           | useless but currently good to keep in mind I think. That you
           | can only use it to generate mediocre stuff.
        
             | meiraleal wrote:
             | We got to the point that AI isn't great because it is not
             | like a Tarantino movie. What a time to be alive.
        
           | monkmartinez wrote:
           | This is true for all newish code bases. You need to provide
           | the context it needs to get the problem right. It has been my
           | experience that one or two examples with new functions or new
           | requirements will suffice for a correction.
        
           | copperx wrote:
           | That's when having a huge context is valuable. Dump all of
           | the new documentation into the model along with your query
           | and the chances of success hugely increase.
        
           | xnx wrote:
           | > I use a Claude and Gemini a lot for coding and I realized
           | there is no good or best model.
           | 
           | True to a point, but is anyone using GPT2 for anything still?
           | Sometimes the better model completely supplants others.
        
         | notamy wrote:
         | > For the more Java proficient, can someone explain why it may
         | have provided this code:
         | 
         | To me that reads like it was trying to accomplish something
         | like                   int[] current;         while((current =
         | queue.pop()) != null) {
        
         | rybosome wrote:
         | I can't comment on why the model gave you that code, but I can
         | tell you why it was not correct.
         | 
         | `queue.remove(0)` gives you an `int[]`, which is also what you
         | were assigning to `current`. So logically it's a single
         | element, not an iterable. If you had wanted to iterate over
         | each item in the array, it would need to be:
         | 
         | ``` for (int[] current : queue) { for (int c : current) { //
         | ...do stuff... } } ```
         | 
         | Alternatively, if you wanted to iterate over each element in
         | the queue and treat the int array as a single element, the
         | revised solution is the correct one.
        
       | nuz wrote:
       | I guess this means we'll have an openai release soon
        
       | nightski wrote:
       | Anyone else annoyed how the ML/AI community just adopted the word
       | "reasoning" when it seems like it is being used very out of
       | context when looking at what the model actually does?
        
         | ramoz wrote:
         | These models take an instruction, along with any contextual
         | information, and are trained to produce valid output.
         | 
         | That production of output is a form of reasoning via _some_
         | type of logical processing. No?
         | 
         | Maybe better to say computational reasoning. That's a mouthful.
        
           | nightski wrote:
           | Static computation is not reasoning (these models are not
           | building up an argument from premises, they are merely
           | finding statistically likely completions). Computational
           | thinking/reasoning would be breaking down a problem into an
           | algorithmic steps. The model is doing neither. I wouldn't
           | confuse the fact that it can break it into steps if you ask
           | it, because again that is just regurgitation. It's not going
           | through that process without your prompt. That is not part of
           | its process to arrive at an answer.
        
             | ramoz wrote:
             | The point is emergent capabilities in LLMs go beyond
             | statistical extrapolation, as they demonstrate reasoning by
             | combining learned patterns.
             | 
             | When asked, "If Alice has 3 apples and gives 2 to Bob, how
             | many does she have left?", the model doesn't just retrieve
             | a memorized answer--it infers the logical steps
             | (subtracting 2 from 3) to generate the correct result,
             | showcasing reasoning built on the interplay of its scale
             | and architecture rather than explicit data recall.
        
             | thelastbender12 wrote:
             | I kinda agree with you but I can also see why it isn't that
             | far from "reasoning" in the sense humans do it.
             | 
             | To wit, if I am doing a high school geometry proof, I come
             | up with a sequence of steps. If the proof is correct, each
             | step follows logically from the one before it.
             | 
             | However, when I go from step 2 to step 3, there are
             | multiple options for step-3 I could have chose. Is it so
             | different from a "most-likely-prediction" an LLM makes? I
             | suppose the difference is humans can filter out logically-
             | incorrect steps, or prune chains-of-steps that won't lead
             | to the actual theorem quicker. But an LLM predictor coupled
             | with a verifier doesn't feel that different from it.
        
         | w10-1 wrote:
         | Does it help to explore the annoyance using gap analysis? I
         | think of it as heuristics. As with humans, it's the pragmatic
         | "whatever seems to work" where "seems" is determined via
         | training. It's neither reasoning from first principles (system
         | 1) nor just selecting the most likely/prevalent answer (system
         | 2). And chaining heuristics doesn't make it reasoning, either.
         | But where there's evidence that it's working from a model, then
         | it becomes interesting, and begins to comply with classical
         | epistemology wrt "reasoning". Unfortunately, information theory
         | seems to treat any compression as a model leading to some
         | pretty subtle delusions.
        
         | resource_waste wrote:
         | These kind of simplifications continue to make me an expert in
         | LLM applications.
         | 
         | So... its a trade secret to know how it actually works...
        
       | summerlight wrote:
       | It is interesting to see that they keep focusing on the cheapest
       | model instead of the frontier model. Probably because of their
       | primary (internal?) customer's need?
        
         | discobot wrote:
         | the problem is that last generation of the largest models
         | failed to overcome smaller models on the benchmarks, see lack
         | of new claude opus or gpt-5. The problem is probably in the
         | benchmarks, but anyway.
        
         | coder543 wrote:
         | It's cheaper and faster to train a small model, which is better
         | for a research team to iterate on, right? If Google decides
         | that a particular small model is really good, why wouldn't they
         | go ahead and release it while they work on scaling up that work
         | to train the larger versions of the model?
        
           | summerlight wrote:
           | I have no knowledge of Google specific cases, but in many
           | teams smaller models are trained upon bigger frontier models
           | through distillation. So the frontier models come first then
           | smaller models later.
        
             | coder543 wrote:
             | Training a "frontier model" without testing the
             | architecture is very risky.
             | 
             | Meta trained the smaller Llama 3 models first, and then
             | trained the 405B model on the same architecture once it had
             | been validated on the smaller ones. Later, they went back
             | and used that 405B model to improve the smaller models for
             | the Llama 3.1 release. Mistral started with a number of
             | small models before scaling up to larger models.
             | 
             | I feel like this is a fairly common pattern.
             | 
             | If Google had a bigger version of Gemini 2.0 ready to go, I
             | feel confident they would have mentioned it, and it would
             | be difficult to distill it down to a small model if it
             | wasn't ready to go.
        
       | xyst wrote:
       | Side note on Gemini: I pay for Google Workspace simply to enable
       | e-mail capability for a custom domain.
       | 
       | I never used the web interface to access email until recently. To
       | my surprise, all of the AI shit is enabled by default. So it's
       | very likely Gemini has been training on private data without my
       | explicit consent.
       | 
       | Of course G words it as "personalizing" the experience for me but
       | it's such a load of shit. I'm tired of these companies stealing
       | our data and never getting rightly compensated.
        
         | hnuser123456 wrote:
         | Gmail is hosting your email. Being able to change the domain
         | doesn't change that they're hosting it on their terms. I think
         | there are other email providers that have more privacy-focused
         | policies.
        
       | dandiep wrote:
       | Gemini multimodal live docs here:
       | https://cloud.google.com/vertex-ai/generative-ai/docs/model-...
       | 
       | A little thin...
       | 
       | Also no pricing is live yet. OpenAI's audio inputs/outputs are
       | too expensive to really put in production, so hopefully Gemini
       | will be cheaper. (Not to mention, OAI's doesn't follow
       | instructions very well.)
        
         | kwindla wrote:
         | The Multimodal Live API is free while the model/API is in
         | preview. My guess is that they will be pretty aggressive with
         | pricing when it's in GA, given the 1.5 Flash multimodal
         | pricing.
         | 
         | If you're interested in this stuff, here's a full chat app for
         | the new Gemini 2 API's with text, audio, image, camera video
         | and screen video. This shows how to use both the WebSocket API
         | and to route through WebRTC infrastructure.
         | 
         | https://github.com/pipecat-ai/gemini-multimodal-live-demo
        
           | dandiep wrote:
           | Thanks, this is great!
        
       | AJRF wrote:
       | I think they are really overloading that word "Agent". I know
       | there isn't a standard definition - but I think Google are
       | stretching the meaning of that way thinner than most C Suite
       | level execs talk about agents at.
       | 
       | I think DeepMind could make progress if they focused on the agent
       | definition of multi-step reasoning + action through a web
       | browser, and deliver a ton of value, outside of lumping in the
       | seldom used "Look at the world through a camera" or "Multi modal
       | Robots" thing.
       | 
       | If Google cracked robots, past plays show that the market for
       | those aren't big enough to interest Google. Like VR, you just
       | can't get a billion people to be interested in robots - so even
       | if they make progress, it won't survive under Google.
       | 
       | The "Look at the world through a camera" thing is a footnote in
       | an Android release.
       | 
       | Agentic computer use _is_ a product a billion people would use,
       | and it's adjacent to the business interests of Google Search.
        
       | jstummbillig wrote:
       | Reminder that implied models are not actual models. Models have
       | failed to materialize repeatedly and vanished without further
       | mention. I assume no one is trying to be misleading but, at this
       | point, maybe overly optimistic.
        
       | nycdatasci wrote:
       | Gemini 2.0 Flash is available here:
       | https://aistudio.google.com/prompts/new_chat
       | 
       | Based on initial interactions, it's extremely verbose. It seems
       | to be focused on explaining its reasoning, but even after just a
       | few interactions I have seen some surprising hallucinations. For
       | example, to assess current understanding of AI, I mentioned "Why
       | hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded
       | with text that included "Why haven't they released Claude 3.5
       | Sonnet First? That's an interesting point." There's clearly some
       | reflection/attempted reasoning happening, but it doesn't feel
       | competitive with o1 or the new Claude 3.5 Sonnet that was trained
       | on 3.5 Opus output.
        
       | simonw wrote:
       | I released a new llm-gemini plugin with support for the Gemini
       | 2.0 Flash model, here's how to use that in the terminal:
       | llm install -U llm-gemini         llm -m gemini-2.0-flash-exp
       | 'prompt goes here'
       | 
       | LLM installation: https://llm.datasette.io/en/stable/setup.html
       | 
       | Worth noting that the Gemini models have the ability to write and
       | then execute Python code. I tried that like this:
       | llm -m gemini-2.0-flash-exp -o code_execution 1 \
       | 'write and execute python to generate a 80x40 ascii art fractal'
       | 
       | Here's the result:
       | https://gist.github.com/simonw/0d8225d62e8d87ce843fde471d143...
       | 
       | It can't make outbound network calls though, so this fails:
       | llm -m gemini-2.0-flash-exp  -o code_execution 1 \
       | 'write python code to retrieve https://simonwillison.net/ and use
       | a regex to extract the title, run that code'
       | 
       | Amusingly Gemini itself doesn't know that it can't make network
       | calls, so it tries several different approaches before giving up:
       | https://gist.github.com/simonw/2ccfdc68290b5ced24e5e0909563c...
       | 
       | The new model seems very good at vision:                   llm -m
       | gemini-2.0-flash-exp describe -a
       | https://static.simonwillison.net/static/2024/pelicans.jpg
       | 
       | I got back a solid description, see here:
       | https://gist.github.com/simonw/32172b6f8bcf8e55e489f10979f8f...
        
         | rafram wrote:
         | > Some pelicans have white on their heads, suggesting that some
         | of them are older birds.
         | 
         | Interesting theory!
        
           | smackay wrote:
           | Brown Pelican (Pelecanus occidentalis) heads are white in the
           | breeding season. Birds start breeding aged three to five. So
           | technically the statement is correct but I wonder if Gemini
           | didn't get its pelicans and cormorants in a muddle. The
           | mainland European Great Cormorant (Phalacrocorax carbo
           | sinensis) has a head that gets progressively whiter as birds
           | age.
        
         | pcwelder wrote:
         | Code execution is okay, but soon runs into the problem of
         | missing packages that it can't install.
         | 
         | Practically, sandboxing hasn't been super important for me.
         | Running claude with mcp based shell access has been working
         | fine for me, as long as you instruct it to use venv, temporary
         | directory, etc.
        
           | UltraSane wrote:
           | Is there a guide on how to do that?
        
             | pcwelder wrote:
             | For building mcp server? The official docs do a great job
             | 
             | https://modelcontextprotocol.io/introduction
             | 
             | My own mcp server could be an inspiration on Mac. It's
             | based on pexpect to enable repl session and has some tricks
             | to prevent bad commands.
             | 
             | https://github.com/rusiaaman/wcgw
             | 
             | However, I recommend creating one with your own customised
             | prompts and tools for maximum benefit.
        
             | stavros wrote:
             | I wrote a program that can do more or less the same thing,
             | if you only care about the LLM running commands to help you
             | do something:
             | 
             | https://github.com/skorokithakis/sysaidmin
        
           | mnky9800n wrote:
           | Can it run ipython? Then you could use ipython magic to pip
           | install things:
           | 
           | https://ipython.readthedocs.io/en/stable/interactive/magics..
           | ..
        
         | simonw wrote:
         | Published some more detailed notes on my explorations of Gemini
         | 2.0 here https://simonwillison.net/2024/Dec/11/gemini-2/
        
         | bravura wrote:
         | Question: Have you tried using this for video?
         | 
         | Alternately, if I wanted to pipe a bunch of screencaps into it
         | and get one grand response, how would I do that?
         | 
         | e.g. "Does the user perform a thumbs up gesture in any of these
         | stills?"
         | 
         | [edit: also, do you know the vision pricing? I couldn't find it
         | easily]
        
           | simonw wrote:
           | Precious Gemini models worked really well for video, and this
           | one can even handle steaming video:
           | https://simonwillison.net/2024/Dec/11/gemini-2/#the-
           | streamin...
        
       | melvinmelih wrote:
       | No mention of Perplexity yet in the comments but it's obvious to
       | me that they're targeting Perplexity Pro directly with their new
       | Deep Research feature
       | (https://blog.google/products/gemini/google-gemini-deep-
       | resea...). I still wonder why Perplexity is worth $7 billion when
       | the 800-pound gorilla is pounding on their door (albeit slowly).
        
         | yandie wrote:
         | Just tried the deep search. It's a much much slower experience
         | than perplexity at the moment - taking sweet many minutes to
         | return result. Maybe it's more extensive but I use perplexity
         | for quick information summary a lot and this is a very
         | different UX.
         | 
         | Haven't used it enough to evaluate the quality, however.
        
           | BoorishBears wrote:
           | Before dropping it for a different project that got some
           | traction, "Slow Perplexity" was something I was pretty set on
           | building.
           | 
           | Perplexity is a much less versatile product than it has to be
           | in the chase of speed: you can only chew through so many
           | tokens, do so much CoT, etc. in a given amount of time.
           | 
           | They optimized for virality (it's just as fast as Google but
           | gives me more info!) but I suspect it kills the stickiness
           | for a huge number of users since you end up with some
           | "embarrassing misses": stuff that should have been a slam
           | dunk, goes off the rails due to not enough search, or the
           | wrong context being surfaced from the page, etc. and the user
           | just doesn't see value in it anymore.
        
       | ianbutler wrote:
       | Unfortunately the 10rpm quota for this experimental model isn't
       | enough to run an actual Agentic experience on.
       | 
       | That's my main issue with google there's several models we want
       | to try with our agent but quota is limited and we have to jump
       | through hoops to see if we can get it raised.
        
       | aantix wrote:
       | Their offering is just so... bad. Even the new model. All the
       | data in the world, yet they trail behind.
       | 
       | They have all of these extensions that they use to prop up the
       | results in the web UI.
       | 
       | I was asking for a list of related YouTube videos - the UI
       | returns them.
       | 
       | Ask the API the same prompt, it returns a bunch of made up
       | YouTube titles and descriptions.
       | 
       | How could I ever rely on this product?
        
       | fuddle wrote:
       | I'd be interested to see Gemini 2.0's performance on SWE-Bench.
        
       | mherrmann wrote:
       | Their Mariner tool for controlling the browser sounds scary and
       | exciting. At the moment, it's an extension, which means
       | JavaScript. Some web sites block automation that happens this
       | way, and developers resort to tools such as Selenium. These use
       | the Chrome DevTools API to automate the browser. It's better, but
       | can still be distinguished from normal use with very technical
       | details. I wonder if Google, who still own Chrome, will give
       | extensions better APIs for automation that can not be
       | distinguished from normal use.
        
       | tpoacher wrote:
       | I know this isn't really a useful comment, but, I'm still sour
       | about the name they chose. They MUST have known about the Gemini
       | protocol. I'm tempted to think it was intentional, even.
       | 
       | It's like Microsoft creating an AI tool and calling it Peertube.
       | "Hurr durr they couldn't possibly be confused; one is a
       | decentralised video platform and the other is an AI tool hurr
       | durr. And ours is already more popular if you 'bing' it hurr
       | durr."
        
         | jvolkman wrote:
         | > It's like Microsoft creating an AI tool and calling it
         | Peertube.
         | 
         | How is it like that? Gemini is a much more common word than
         | Peertube. https://en.wikipedia.org/wiki/Gemini
        
       | petesergeant wrote:
       | Speed looks good vis-a-vis 4o-mini, and quality looks good so far
       | against my eval set. If it's cheaper than 4o-mini too (which, it
       | probably will be?) then OpenAI have a real problem, because
       | switching between them is a value in a config file.
        
       | smallerfish wrote:
       | Was this written by an LLM? It's pretty bad copy. Maybe they laid
       | off their copywriting team...?
       | 
       | > "Now millions of developers are building with Gemini. And it's
       | helping us reimagine all of our products -- including all 7 of
       | them with 2 billion users -- and to create new ones"
       | 
       | and
       | 
       | > "We're getting 2.0 into the hands of developers and trusted
       | testers today. And we're working quickly to get it into our
       | products, leading with Gemini and Search. Starting today our
       | Gemini 2.0 Flash experimental model will be available to all
       | Gemini users."
        
         | utopcell wrote:
         | Sorry, what's wrong with these phrases?
        
           | echelon wrote:
           | It reads like a transcribed speech. You can picture this
           | being read from a teleprompter at a conference keynote.
           | 
           | Short sentence fact. And aspirational tagline - pause for
           | some metrics - and more. And. Today. And. And. Today.
        
           | krona wrote:
           | > all of our products -- including all 7 of them
           | 
           | All the products including all the products?
        
             | iamdelirium wrote:
             | Why did you specifically ignore the remainder of the
             | sentence?
             | 
             | "...all of our products -- including all 7 of them with 2
             | billion users..."
             | 
             | It tells people that 7 of their products have 2b users.
        
               | fluoridation wrote:
               | That's not really any better, since "all of our products"
               | already includes the subset that has at least 2B users.
               | "I brought all my shoes, including all my red shoes."
        
               | stavros wrote:
               | They're pointing out that seven of their products have
               | more than 2 billion users.
               | 
               | "I brought all my shoes, including the pairs that cost
               | over $10,000" is saying something about what shoes you
               | brought, more than "all of them".
        
               | fluoridation wrote:
               | Why are they bragging about something completely
               | unrelated in the middle of a sentence about the impact of
               | a piece of technology?
               | 
               | -Hey, are you done packing?
               | 
               | -Yes, I decided I'll bring all my shoes, including the
               | ones that cost over $10,000.
               | 
               | What, they just couldn't help themselves?
        
               | stavros wrote:
               | The fact that they're using Gemini with even their most
               | important products shows that they trust it.
        
               | fluoridation wrote:
               | Again, that's covered by "all our products". Why do we
               | need to be reminded that Google has a lot of users?
               | Someone oblivious to that isn't going to care about this
               | press release.
        
               | aerhardt wrote:
               | That phrasing still sucks, I am neither a native speaker
               | nor a wordsmith but I've worked with professional English
               | writers who could make that look and sound infinitely
               | better.
        
               | jay_kyburz wrote:
               | all of our products, 7 of which have over 2 billion
               | users..
        
             | hombre_fatal wrote:
             | The meme of LLM generated content is that it's verbose and
             | formal, not that it's poorly written.
             | 
             | It's why the quoted text is obviously written by a human.
        
               | contagiousflow wrote:
               | There's no law that says LLM generated text has to bad in
               | a singular way
        
           | scudsworth wrote:
           | executive spotted
        
       | ryandvm wrote:
       | I am sure Google has the resources to compete in this space. What
       | I'm less sure about is whether Google can monetize AI in a way
       | that doesn't cannibalize their advertising income.
       | 
       | Who the hell wants an AI that has the personality of a car
       | salesman?
        
       | ipsum2 wrote:
       | Tested out Gemini-2 Flash, I had such high hopes that a better
       | base model would help. It still hallucinates like crazy compared
       | to GPT-4o.
        
         | gbickford wrote:
         | Small models don't "know" as much so they hallucinate more.
         | They are better suited for generations that are based in a
         | ground truth, like in a RAG setup.
         | 
         | A better comparison might be Flash 2.0 vs 4o-mini. Even then,
         | the models aren't meant to have vast world knowledge, so
         | benchmarking them on that isn't a great indicator of how they
         | would be used in real-world cases.
        
           | ipsum2 wrote:
           | Yes, it's not an apples to apples comparison. My point is the
           | position it's at on the lmarena leaderboard is misplaced due
           | to the hallucination issues.
        
       | CSMastermind wrote:
       | > We're also launching a new feature called Deep Research, which
       | uses advanced reasoning and long context capabilities to act as a
       | research assistant, exploring complex topics and compiling
       | reports on your behalf. It's available in Gemini Advanced today.
       | 
       | Anyone seeing this? I don't have an option in my dropdown.
        
         | atorodius wrote:
         | Rolling out the next few days accorsing to Jeff
        
         | fudged71 wrote:
         | Not seeing it yet on web or mobile (in Canada)
        
       | echohack5 wrote:
       | Is this what it feels like to become one of the gray bearded
       | engineers? This sounds like a bunch of intentionally confusing
       | marketing drivel.
       | 
       | When capitalism has pilfered everything from the pockets of
       | working people so people are constantly stressed over healthcare
       | and groceries, and there's little left to further the pockets of
       | plutocrats, the only marketing that makes sense is to appeal to
       | other companies in order to raid their coffers by tricking their
       | Directors to buy a nonsensical product.
       | 
       | Is that what they mean by "agentic era"? Cause that's what it
       | sounds like to me. Also smells alot like press release driven
       | development where the point is to put a feather in the cap of
       | whatever poor Google engineer is chasing their next promotion.
        
         | weatherlite wrote:
         | > Is that what they mean by "agentic era"? Cause that's what it
         | sounds like to me.
         | 
         | What are you basing your opinion on? I have no idea how well
         | these LLM agents will perform but its definitely a thing.
         | OpenAI is working on them , Claude and certainly Google.
        
         | cush wrote:
         | Yeah it's a lot of marketing fluff but these tools are
         | genuinely useful and there's no wonder why Google is working
         | hard to prevent them from destroying their search-dependent
         | bottom line.
         | 
         | Marketing aside, agents are just LLMs that can reach out of
         | their regular chat bubbles and use tools. Seems like just the
         | next logical evolution
        
       | serjester wrote:
       | Buried in the announcement is the real gem -- they're releasing a
       | new SDK that actually looks like it follows modern best
       | practices. Could be a game-changer for usability.
       | 
       | They've had OpenAI-compatible endpoints for a while, but it's
       | never been clear how serious they were about supporting them
       | long-term. Nice to see another option showing up. For reference,
       | their main repo (not kidding) recommends setting up a Kubernetes
       | cluster and a GCP bucket to submit batch requests.
       | 
       | [1]https://github.com/googleapis/python-genai
        
         | pkkkzip wrote:
         | its interesting that just as the LLM hype appears to be
         | simmering down, DeepMind is making big strides. I'm more
         | excited by this than any of OpenAI's announcements.
        
       | ComputerGuru wrote:
       | I've been using gemini-exp-1206 and I notice a lot of
       | similarities to the new gemini-2.0-flash-exp: they're not that
       | much _actually smarter_ but they go out of their way to convince
       | you they are with overly verbose  "reasoning" and explanations.
       | The reasoning and explanations aren't necessarily wrong per se,
       | but put them aside and focus on the actual logical reasoning
       | steps and conclusions to your prompts and it's still very much a
       | dumb model.
       | 
       | The models do just fine on "work" but are terrible for
       | "thinking". The verbosity of the explanations (and the sheer
       | amount of praise the models like to give the prompter - I've
       | never had my rear end kissed so much!) should lead one to beware
       | any subjective reviews of their performance rather than objective
       | reviews focusing solely on correct/incorrect.
        
       | epolanski wrote:
       | I'm not gonna lie I like Google's models.
       | 
       | Flash combines speed and cost and is extremely good to build apps
       | on.
       | 
       | People really take that whole benchmarking thing more seriously
       | than necessary.
        
       | Animats wrote:
       | _" Over the last year, we have been investing in developing more
       | agentic models, meaning they can understand more about the world
       | around you, think multiple steps ahead, and take action on your
       | behalf, with your supervision."_
       | 
       | "With your supervision". Thus avoiding Google being held
       | responsible. That's like Teslas Fake Self Driving, where the user
       | must have their hands on the wheel at all times.
        
       | sorenjan wrote:
       | Published the day after one of the authors, Demis Hassabis,
       | received his Nobel prize in Stockholm.
        
       | beepbooptheory wrote:
       | We are moving through eras faster than years these days.
        
       | strongpigeon wrote:
       | Did anyone get to play with the native image generation part? In
       | my experience, Imagen 3 was much better than the competition so
       | I'm curious to hear people's take on this one.
        
         | strongpigeon wrote:
         | Hrm, when I tried to get it to generate an image it said it was
         | using Imagen 3. Not sure what "native" image generation means
         | then.
        
       | fpgaminer wrote:
       | I work with LLMs and MLLMs all day (as part of my work on
       | JoyCaption, an open source VLM). Specifically, I spend a lot of
       | time interacting with multiple models at the same time, so I get
       | the chance to very frequently compare models head-to-head on real
       | tasks.
       | 
       | I'll give Flash 2 a try soon, but I gotta say that Google has
       | been doing a great job catching up with Gemini. Both Gemini 1.5
       | Pro 002 and Flash 1.5 can trade blows with 4o, and are easily
       | ahead of the vast majority of other major models (Mistral Large,
       | Qwen, Llama, etc). Claude is usually better, but has a major flaw
       | (to be discussed later).
       | 
       | So, here's my current rankings. I base my rankings on my work,
       | not on benchmarks. I think benchmarks are important and they'll
       | get better in time, but most benchmarks for LLMs and MLLMs are
       | quite bad.
       | 
       | 1) 4o and its ilk are far and away the best in terms of accuracy,
       | both for textual tasks as well as vision related tasks.
       | Absolutely nothing comes even close to 4o for vision related
       | tasks. The biggest failing of 4o is that it has the worst
       | instruction following of commercial LLMs, and that instruction
       | following gets _even_ worse when an image is involved. A prime
       | example is when I ask 4o to help edit some text, to change
       | certain words, verbage, etc. No matter how I prompt it, it will
       | often completely re-write the input text to its own style of
       | speaking. It's a really weird failing. It's like their RLHF
       | tuning is hyper focused on keeping it aligned with the
       | "character" of 4o to the point that it injects that character
       | into all its outputs no matter what the user or system
       | instructions state. o1 is a MASSIVE improvement in this regard,
       | and is also really good at inferring things so I don't have to
       | explicitly instruct it on every little detail. I haven't found
       | o1-pro overly useful yet. o1 is basically my daily driver outside
       | of work, even for mundane questions, because it's just better
       | across the board and the speed penalty is negligible. One
       | particularly example of o1 being better I encountered yesterday.
       | I had it re-wording an image description, and thought it had
       | introduced a detail that wasn't in the original description.
       | Well, I was wrong and had accidentally skimmed over that detail
       | in the original. It _told_ me I was wrong, and didn't update the
       | description! Freaky, but really incredible. 4o never corrects me
       | when I give it an explicit instruction.
       | 
       | 4o is fairly easy to jailbreak. They've been turning the screws
       | for awhile so it isn't as easy as day 1, but even o1-pro can be
       | jailbroken.
       | 
       | 2) Gemini 1.5 Pro 002 (specifically 002) is second best in my
       | books. I'd guesstimate it at being about 80% as good as 4o on
       | most tasks, including vision. But it's _significantly_ better at
       | instruction following. Its RLHF is a lot lighter than ChatGPT
       | models, so it's easier to get these models to fall back to
       | pretraining, which is really helpful for my work specifically.
       | But in general the Gemini models have come a long way. The
       | ability to turn off model censorship is quite nice, though it
       | does still refuse at times. The Flash variation is interesting;
       | often times on-par with Pro with Pro edging out maybe 30% of the
       | time. I don't frequently use Flash, but it's an impressive model
       | for its size. (Side note: The Gemma models are ... not good.
       | Google's other public models, like so400m and OWLv2 are great, so
       | it's a shame their open LLMs forays are falling behind). Google
       | also has the best AI playground.
       | 
       | Jailbreaking Gemini is a piece of cake.
       | 
       | 3) Claude is third on my list. It has the _best_ instruction
       | following of all the models, even slightly better than o1. Though
       | it often requires multi-turn to get it to fully follow
       | instructions, which is annoying. Its overall prowess as an LLM is
       | somewhere between 4o and Gemini. Vision is about the same as
       | Gemini, except for knowledge based queries which Gemini tends to
       | be quite bad at (who is this person? Where is this? What brand of
       | guitar? etc). But Claude's biggest flaw is the insane "safety"
       | training it underwent, which makes it practically useless. I get
       | false triggers _all_ the time from Claude. And that's to say
       | nothing of how unethical their "ethics" system is to begin with.
       | And what's funny is that Claude is an order of magnitude
       | _smarter_ when its reasoning about its safety training. It's the
       | only real semblance of reason I've seen from LLMs ... all just to
       | deny my requests.
       | 
       | I've put Claude three out of respect for the _technical_
       | achievements of the product, but I think the developers need to
       | take a long look in the mirror and ask why they think it's okay
       | to for _them_ to decide what people with disabilities are and are
       | not aloud to have access to.
       | 
       | 4) Llama 3. What a solid model. It's the best open LLM, hands
       | down. Nowhere near the commercial models above, but for a model
       | that's completely free to use locally? That's invaluable. Their
       | vision variation is ... not worth using. But I think it'll get
       | better with time. The 8B variation far outperforms its weight
       | class. 70B is a respectable model, with better instruction
       | following than 4o. The ability to finetune these models to a task
       | with so little data is a huge plus. I've made task specific
       | models with 200-400 examples.
       | 
       | 5) Mistral Large (I forget the specific version for their latest
       | release). I love Mistral as the "under-dog". Their models aren't
       | bad, and behave _very_ differently from all other models out
       | there, which I appreciate. But Mistral never puts any effort into
       | polishing their models; they always come out of the oven half-
       | baked. Which means they frequently glitch out, have very
       | inconsistent behavior, etc. Accuracy and quality is hard to
       | assess because of this inconsistency. On its best days it's up
       | near Gemini, which is quite incredible considering the models are
       | also released publicly. So theoretically you could finetune them
       | to your task and get a commercial grade model to run locally. But
       | rarely see anyone do that with Mistral, I think partly because of
       | their weird license. Overall, I like seeing them in the race and
       | hope they get better, but I wouldn't use it for anything serious.
       | 
       | Mistral is lightly censored, but fairly easy to jailbreak.
       | 
       | 6) Qwen 2 (or 2.5 or whatever the current version is these days).
       | It's an okay model. I've heard a lot of praises for it, but in
       | all my uses thus far its always been really inconsistent,
       | glitchy, and weak. I've used it both locally and through APIs. I
       | guess in _theory_ it's a good model, based on benchmarks. And
       | it's open, which I appreciate. But I've not found any practical
       | use for it. I even tried finetuning with Qwen 2VL 72B, and my
       | tiny 8B JoyCaption model beat it handily.
       | 
       | That's about the sum of it. AFAIK that's all the major commercial
       | and open models (my focus is mainly on MLLMs). OpenAI are still
       | leading the pack in my experience. I'm glad to see good
       | competition coming from Google finally. I hope Mistral can polish
       | their models and be a real contender.
       | 
       | There are a couple smaller contenders out there like Pixmo/etc
       | from allenai. Allen AI has hands down the _best_ public VQA
       | dataset I've seen, so huge props to them there. Pixmo is ...
       | okayish. I tried Amazon's models a little but didn't see anything
       | useful.
       | 
       | NOTE: I refuse to use Grok models for the obvious reasons, so
       | fucks to be them.
        
         | kernal wrote:
         | >Grok models
         | 
         | Thanks for the admission as it invalidates pretty much
         | everything you've said, and will say.
        
       | nopcode wrote:
       | > What can you tell me about this sculpture?
       | 
       | > It's located in London.
       | 
       | Mind blowing.
        
       | jacooper wrote:
       | The best thing about gemini models is the huge context windows,
       | you can just throw big documents and find stuff real fast, rather
       | than struggling with cut off in perplexity or Claude.
        
       | bluelightning2k wrote:
       | Does anyone know how to sign up to the speech output wait list or
       | tester program? I have a decent spent with GCP over the years if
       | that helps at all. Really want DemoTime videos to use those
       | voices. (I like how truly incredible best in the world tts is
       | like a footnote in this larger announcement.)
        
       | jgalt212 wrote:
       | Can cloudflare turnstile (and others) detect these agents as
       | bots?
        
       | jamesponddotco wrote:
       | At least when it comes to Go code, I'm pretty impressed by the
       | results so far. It's also pretty good at following directions,
       | which is a problem I have with open source models, and seems to
       | use or handle Claude's XML output very well.
       | 
       | Overall, especially seeing as I haven't paid a dime to use the
       | API yet, I'm pretty impressed.
        
       | mfonda wrote:
       | Does anyone have any insights into how Google selects source
       | material for AI overviews? I run an educational site with lots of
       | excellent information, but it seems to have been passed over
       | entirely for AI overviews. With these becoming an increasingly
       | large part of search--and from the sound of it, now more so with
       | Gemini 2.0--this has me a little worried.
       | 
       | Anyone else run into similar issues or have any tips?
        
       | nuancebydefault wrote:
       | I'm quite impressed with the flash demo's reasoning capabilities.
       | Did the 20 questions game with it, it found the computer mouse i
       | had in my head. At first it was confused about our roles and said
       | something weird, it thought that it had to guess its own word.
       | Afterwards I had a meta conversation about that weirdness and it
       | gave impressive insights:
       | 
       | " Why My "I Guess What I Have in Mind" Statement Doesn't Make
       | Sense                   The Problem of Agency: You're spot-on
       | when you say it doesn't make sense for me to state "I guess what
       | I (the computer) have in mind." AI does not have a "mind" in the
       | same way that humans do. My internal state is a complex pattern
       | of numerical data, not conscious thoughts or intentions in the
       | way you have things in mind. My phrasing gave the illusion of
       | having a "thing in mind," but in reality, I don't have this
       | capacity.              Misinterpretation of the Task: My
       | statement "I guess what I have in mind" misinterprets the task.
       | You were clearly signaling that I needed to ask questions to
       | uncover what you were thinking of. My failure to grasp this
       | created a weird scenario where it seemed like I was trying to
       | determine my own data set!"
        
       | paradite wrote:
       | Honestly this post makes Google sound like the new IBM. Very
       | corporate.
       | 
       | "Hear from our CEO first, and then our other CEO in charge of
       | this domain and CTO will tell you the actual news."
       | 
       | I haven't seen other tech companies write like that.
        
       ___________________________________________________________________
       (page generated 2024-12-11 23:00 UTC)