[HN Gopher] Gemini 2.0: our new AI model for the agentic era
___________________________________________________________________
Gemini 2.0: our new AI model for the agentic era
Author : meetpateltech
Score : 564 points
Date : 2024-12-11 15:33 UTC (7 hours ago)
(HTM) web link (blog.google)
(TXT) w3m dump (blog.google)
| og_kalu wrote:
| The Gemini 2 models support native audio and image generation but
| the latter won't be generally available till January. Really
| excited for that as well as 4o's image generation (whenever that
| comes out). Steerability has lagged behind aesthetics in image
| generation for a while now and it's be great to see a big advance
| in that.
|
| Also a whole lot of computer vision tasks (via LLMs) could be
| unlocked with this. Think Inpainting, Style Transfer, Text
| Editing in the wild, Segmentation, Edge detection etc
|
| They have a demo: https://www.youtube.com/watch?v=7RqFLp0TqV0
| jncfhnb wrote:
| These are not computer vision tasks...
| Jabrov wrote:
| What are they, then...?
| 85392_school wrote:
| The first two are tasks which involve _making_ images. They
| could be called image generation or image editing.
| newfocogi wrote:
| Maybe some of these tasks are arguably not aligned with the
| traditional applications of CV, but Segmentation and Edge
| detection are definitely computer vision in every definition
| I've come across - before and after NNs took over.
| jerpint wrote:
| We're definitely going to need better benchmarks for agentic
| tasks, and not just code reasoning. Things that are needlessly
| painful that humans go through all the time
| AuthConnectFail wrote:
| it's insane on lmarena for a size, livebench should have it
| soon too I guess
| maeil wrote:
| The size isn't stated, not necessarily a given that it's as
| small as 1.5-Flash.
| bradhilton wrote:
| Beats Gemini 1.5 Pro at all but two of the listed benchmarks.
| Google DeepMind is starting to get their bearings in the LLM era.
| These are the minds behind AlphaGo/Zero/Fold. They control their
| own hardware destiny with TPUs. Bullish.
| p1esk wrote:
| Are these benchmarks still meaningful?
| maeil wrote:
| No, and they haven't been for at least half a year. Utterly
| optimized for by the providers. Nowadays if a model would be
| SotA for general use but not #1 on any of these benchmarks, I
| doubt they'd even release it.
| CamperBob2 wrote:
| I've started keeping an eye out for original brainteasers,
| just for that reason. GCHQ's Christmas puzzle just came out
| [1], and o1-pro got 6 out of 7 of them right. It took about
| 20 minutes in total.
|
| I wasn't going to bother trying those because I was pretty
| sure it wouldn't get _any_ of them, but decided to give it an
| easy one (#4) and was impressed at the CoT.
|
| Meanwhile, Google's newest 2.0 Flash model went 0 for 7.
|
| 1: https://metro.co.uk/2024/12/11/gchq-christmas-
| puzzle-2024-re...
| nrvn wrote:
| Did it get the 8 right? The linked article provides the
| wrong answer btw.
| p1esk wrote:
| Wow! That's all I need to know about Google's model.
| danpalmer wrote:
| That's a comparison of multiple GPT-4 models working
| together... against a single GPT-4 mini style model.
| Workaccount2 wrote:
| What is impressive about this new model is that it is the
| lightweight version (flash).
|
| There will probably be a 2.0 pro (which will be 4o/sonnet
| class) and maybe an ultra (o1(?)/Opus).
| iamdelirium wrote:
| Why are you comparing flash vs o1-pro, wouldn't a more fair
| comparison be flash vs mini?
| iamdelirium wrote:
| I just ask o1-mini the first two questions and it got it
| wrong.
| dagmx wrote:
| Regarding TPU's, sure for the stuff that's running on the
| cloud.
|
| However their on device TPUs lag behind the competition and
| Google still seem to struggle to move significant parts of
| Gemini to run on device as a result.
|
| Of course, Gemini is provided as a subscription service as well
| so perhaps they're not incentivized to move things locally.
|
| I am curious if they'll introduce something like Apple's
| private cloud compute.
| whimsicalism wrote:
| i don't think they need to win the on device market.
|
| we need to separate inference and training - the real winners
| are those who have the training compute. you can always have
| other companies help with inference
| dagmx wrote:
| At what point does the on device stuff eat into their
| market share though? As on device gets better, who will pay
| for cloud compute? Other than enterprise use.
|
| I'm not saying on device will ever truly compete at
| quality, but I believe it'll be good enough that most
| people don't care to pay for cloud services.
| whimsicalism wrote:
| You're still focused about inference :)
|
| inference basically does not matter, it is a commodity
| dagmx wrote:
| You're still focused about training :)
|
| training doesn't matter if inference costs are high and
| people don't pay for them
| whimsicalism wrote:
| but inference costs _arent high_ already and there are
| tons of hardware companies that can do relatively cheap
| LLM inference
| dagmx wrote:
| Inference costs per invocation aren't high. Scale it out
| to billions of users and it's a different story.
|
| Training is amortized over each inference, so the cost of
| inference also needs to include the cost of training to
| break even unless made up elsewhere
| rowanG077 wrote:
| That makes no sense. Inference cost dwarf training cost
| if you have a succesfull product pretty quickly. Afaik
| there is no commodity hardware that can run state of the
| art models like chatgpt-o1.
| whimsicalism wrote:
| > Afaik there is no commodity hardware that can run state
| of the art models like chatgpt-o1.
|
| Stack enough GPUs and any of them can run o1. Building a
| chip to infer LLMs is _much easier_ than building a
| training chip.
|
| Just because one cost dwarfs another does not mean that
| this is where the most marginal value from developing a
| better chip will be, especially if other people are just
| doing it for you. Google gets a good model, inference
| providers will be begging to be able to run it on their
| platform, or to just sell google their chips - and as I
| said, inference chips are much easier.
| menaerus wrote:
| Each GPU costs ~50k. You need at least 8 of them to run
| mid-sized models. Then you need a server to plug those
| GPUs into. That's not commodity hardware.
| whimsicalism wrote:
| more like ~$16k for 16 3090s. AMD chips can also run
| these models. The parts are expensive but there is a
| competitive market in processors that can do LLM
| inference. Less so in training.
| maeil wrote:
| > i don't think they need to win the on device market.
|
| The second Apple comes out with strong on-device AI - and
| it very much looks like they will - Google will have to
| respond on Android. They can't just sit and pray that e.g.
| Samsung makes a competitive chip for this purpose.
| petra wrote:
| But given inference time compute, to give a strong reply
| reasonably fast, you'll need a lot of compute, very
| rarely used.
|
| Economically this fits the cloud much better.
| reportingsjr wrote:
| The Android on chip AI is and has been leagues better
| than what is available on iOS.
|
| If anything, I think the upcoming iOS AI update will
| bring them to a similar level as android/google.
| SimianSci wrote:
| I think Apple is uniquely disadvantaged in the AI race to
| a point people dont realize. They have less training data
| to use, having famously been focused on privacy for its
| users and thus having no particular advantage in this
| space due to not having customer data to train on. They
| have little to no cloud business, and while they operate
| a couple of services for their users, they do not have
| the infrastructure scale to compete with hyperscaler
| cloud vendors such as Google and Microsoft. Most of what
| they would need to spend on training new models would
| require that they hand over lots of money to the very
| companies that already have their own models,
| supercharging their competition.
|
| While there is a chance that Apple might come out with a
| very sophisticate on-device model. The problem here is
| that they would only be able to compete with other on-
| device models. The magnitude of compute needed to keep
| pace with SOA models is not achievable on a single
| device. It will take many generations of Apple silicon in
| order to compete with the compute of existing
| datacenters.
|
| Google also already has competitive silicon in this space
| with the Tensor series processors, which are being fabbed
| at Samsung plants today. There is no sitting and praying
| necessary on their part as they already compete.
|
| Apple is a very distant competitor in the space of AI,
| and I see no reason to assume this will change, they are
| uniquely disadvantaged by several of the choices they
| made on their way to mobile supremacy. The only thing
| they currently have going for them is the development of
| their own ARM silicon which may give them the ability to
| compete with Google's TPU chips, but there is far more
| needed to be competitive here than the ability to avoid
| the Nvidia tax.
| whimsicalism wrote:
| yeah i've never understood the outsized optimism for
| apple's ai strategy, especially on hn.
|
| they're _a little bit less of a nobody_ than they used to
| be, but they're basically a nobody when it comes to
| frontier research /scaling. and the best model matters
| way more than on-device which can always just be
| distilled later and find some random startup/chipco to do
| inference
| msabalau wrote:
| Theory: Apple's lifestyle branding is quite important to
| the identity of many in the community here. I mean, look
| at the buy-in at launch for Apple Vision Pro by so many
| people on HN--it made actual Apple communities and
| publications look like jaded skeptics.
| simonw wrote:
| "having famously been focused on privacy for its users
| and thus having no particular advantage in this space due
| to not having customer data to train on"
|
| That may not be as big a disadvantage as you think.
|
| Anthropic claim that they did not use any data from their
| users when they trained Claude 3.5 Sonnet.
| whimsicalism wrote:
| sure but they certainly acquired data from mass scraping
| (including of data produced by their users) and/or data
| brokering aka paying someone to do the same.
| vineyardmike wrote:
| I don't think the AI market will ever really be a healthy
| one until inference vastly outnumbers training. What does
| it say about AI if training is done more than inference?
|
| I agree that the in-device inference market is not
| important yet.
| whimsicalism wrote:
| done more != where the value is at
|
| inference hardware is a commodity in a way that training
| is not
| mupuff1234 wrote:
| Majority of people want better performance, running locally
| is just a nice to have feature.
| dagmx wrote:
| They'll care though when they have to pay for it, or when
| they're in an area with poor reception.
| mupuff1234 wrote:
| They pay to run it locally as well (more expensive
| hardware)
|
| And sure, poor reception will be an issue, but most
| people would still absolutely take a helpful remote
| assistant over a dumb local assistant.
|
| And you don't exactly see people complaining that they
| can't run Google/YouTube/etc locally.
| dagmx wrote:
| Your first sentence has the fallacy that you're
| attributing the cost of the device to a single feature
| against the cost of that single feature.
|
| Most people are unlikely to buy the device for the AI
| features alone. It's a value add to the device they'd buy
| anyway.
|
| So you need the paid for option to be significantly
| better than the free one that comes with the device.
|
| Your second sentence assumes the local one is dumb. What
| happens when local ones get better? Again how much better
| is the cloud one to compete on cost?
|
| To your last sentence, it assumes data fetching from the
| cloud. Which is valid but a lot of data is local too. Are
| people really going to pay for what Google search is
| giving them for free?
| mupuff1234 wrote:
| I think it's a more likely assumption that on device
| performance will trail off device models by a significant
| margin for at least the next few years - of course if
| magically you can make it work locally with the same
| level of performance it would be better.
|
| Plus a lot of the "agentic" stuff is interaction with the
| outside world, connectivity is a must regardless.
| dagmx wrote:
| My point is that you do NOT need the same level of
| performance. You need an adequate level of performance
| that the cost to get more performance isn't worth it to
| most people.
| mupuff1234 wrote:
| And my point is that it's way too early to try to
| optimize for running locally, if performance really
| stabilizes and comes to a halt (which may likely happen)
| then it makes more sense to optimize.
|
| Plus once you start with on device features you start
| limiting your development speed and flexibility.
| jsight wrote:
| It isn't really hypothetical. Lots of good models run
| well on a modern Macbook Pro.
| YetAnotherNick wrote:
| You can run model >100x faster in cloud compared to on
| device with DDR RAM. This would make up for the
| reception.
| dagmx wrote:
| And you can't run the cloud model at all if you can't
| talk to the cloud.
| YetAnotherNick wrote:
| Yes, but I can't imagine situations where I "have" to run
| a model when I don't have internet at that time. My life
| would be more affected with the rest of the internet than
| having to run a small stupid model locally. At the very
| least until the hallucination is completely solved, as I
| need internet to verify the models.
| dagmx wrote:
| You're assuming the model is purely for generation
| though. Several of the Gemini features are lookup of
| things across data available to it. A lot of that data
| can be local to device.
|
| That is currently Apple's path with Apple Intelligence
| for example.
| vineyardmike wrote:
| Poor reception is rapidly becoming a non-issue for most
| of the developed world. I can't think of the last time I
| had poor reception (in America) and wasn't on an
| airplane.
|
| As the global human population increasingly urbanizes,
| it'll become increasingly easy to blanket it with cell
| towers. Poor(er) regions of the world will increase
| reception more slowly, but they're also more likely to
| have devices that don't support on-device models.
|
| Also, Gemini Flash is basically positioned as a free
| model, (nearly) free API, free in GUI, free in Search
| Results, Free in a variety of Google products, etc. No
| one will be paying for it.
| dagmx wrote:
| Many major cities have significant dead spots for
| coverage. It's not just for developing areas.
|
| Flash is free for api use at a low rate limit. Gemini as
| a whole is not free to Android users (free right now with
| subscription costs beyond a time period for advanced
| features) and isn't free to Google without some monetary
| incentive. Hence why I also originally ask about private
| cloud compute alternatives with Google.
| griomnib wrote:
| Latency is a _huge_ factor in performance, and local models
| often have a huge edge. Especially on mobile devices that
| could be offline entirely.
| YetAnotherNick wrote:
| If the model weights is not open, you can't run it on device
| anyways.
| kridsdale1 wrote:
| The Pixel 9 runs many small proprietary Gemini models on
| the internal TPU.
| griomnib wrote:
| And yet these new models still haven't reached feature
| parity with Google Assistant, which can turn my
| flashlight on, but with all the power of burning down a
| rainforest, Gemini still cannot interact with my actual
| phone.
| lern_too_spel wrote:
| I just tried asking my phone to turn on the flashlight
| using Gemini. It worked.
| https://9to5google.com/2024/11/07/gemini-utilities-
| extension...
| griomnib wrote:
| Ok I tried literally last week on Pixel 7a and it didn't
| work. What model do you have? Maybe it requires a phone
| that can do on-device models?
| staticman2 wrote:
| I just tried it on my Galaxy Ultra s23 and it worked. I
| then disconnected internet and it did not work.
| YetAnotherNick wrote:
| Gemini nano weights are leaked and google doesn't care
| about it being leaked. Google would definitely care if
| Pro weights are leaked.
| onlyrealcuzzo wrote:
| Is there any phone in the world that can realistically
| run pro weights?
| JeremyNT wrote:
| Yeah they've been slow to release end-user facing stuff but
| it's obvious that they're just grinding away internally.
|
| They've ceded the fast mover advantage, but with a massive
| installed base of Android devices, a team of experts who
| basically created the entire field, a huge hardware presence
| (that THEY own), massive legal expertise, existing content
| deals, and a suite of vertically integrated services, I feel
| like the game is theirs to lose at this point.
|
| The only caution is regulation / anti-trust action, but with a
| Trump administration that seems far less likely.
| VirusNewbie wrote:
| If you look at where talent is going, it's Anthropic that is
| the real competitor to Google, not OpenAI.
| echelon wrote:
| Gemini in search is answering so many of my search questions
| wrong.
|
| If I ask natural language yes/no questions, Gemini sometimes
| tells me outright lies with confidence.
|
| It also presents information as authoritative - locations,
| science facts, corporate ownership, geography - even when it's
| pure hallucination.
|
| Right at the top of Google search.
|
| edit:
|
| I can't find the most obnoxious offending queries, but here was
| one I performed today: "how many islands does georgia have?".
|
| Compare that with "how many islands does georgia have? Skidaway
| Island".
|
| This is an extremely mild case, but I've seen some wildly wrong
| results, where Google has claimed companies were founded in the
| wrong states, that towns were located in the wrong states, etc.
| sib301 wrote:
| This has happened to me zero times. :shrug:
| airstrike wrote:
| Doesn't match my experience. It also feels like it's getting
| better over time.
| nilayj wrote:
| can you provide some example queries that Gemini in search gets
| wrong?
| nice__two wrote:
| Gemini 1.5 indeed is a lot of hit-and-miss. Also, the
| politically correct and medical info filtering is limiting its
| usefulness a lot, IMHO.
|
| I also miss that it's not yet really as context aware as
| ChatGPTo4. Even just asking a follow-up question, confuses
| Gemini 1.5.
|
| Hope Gemini 2.0 will improve that!
| adultSwim wrote:
| I've found these results quite useful
| jonomacd wrote:
| At first, this was true but now it has gotten pretty good. The
| times it gets things wrong are often not the models fault and
| just google searches fault.
| zb3 wrote:
| Is this the gemini-exp model on LMArena?
| jasonjmcghee wrote:
| Both are available on aistudio so I don't think so.
|
| In my own testing "exp 1206" is significantly better than
| Gemini 2.
|
| Feels like haiku 3.5 vs sonnet 3.5 kind of thing.
| warkdarrior wrote:
| Yes, LMArena shows Gemini-2.0-Flash-Exp ranking 3rd right now,
| after Gemini-Exp-1206 and ChatGPT-4o-latest_(2024-11-20), and
| ahead of o1-preview and o1-mini:
|
| https://lmarena.ai/?leaderboard
| zb3 wrote:
| There's also the "gremlin" model (not reachable directly) and
| it seems to be pretty smart.. maybe that's the deep research
| mode?
|
| EDIT: probably not deep research.. is it Google testing their
| equivalent of o1? who knows..
| usaar333 wrote:
| It looks like gemini-exp-1121 slightly upgraded. 1206 is
| something else.
| crowcroft wrote:
| Big companies can be slow to pivot, and Google has been famously
| bad at getting people aligned and driving in one direction.
|
| But, once they do get moving in the right direction the can
| achieve things that smaller companies can't. Google has an insane
| amount of talent in this space, and seems to be getting the right
| results from that now.
|
| Remains to be seen how well they will be able to productize and
| market, but hard to deny that their LLM models aren't really,
| really good though.
| manishsharan wrote:
| >> hard to deny that their LLM models aren't really, really
| good though.
|
| The context window of Gemini 1.5 pro is incredibly large and it
| retains the memory of things in the middle of the window well.
| It is quite a game changer for RAG applications.
| KaoruAoiShiho wrote:
| It looks like long context degraded from 1.5 to 2.0 according
| to the 2.0 launch benchmarks.
| caeril wrote:
| Bear in mind that a "1 million token" context window isn't
| actually that. You're being sold a sparse attention model,
| which is guaranteed to drop critical context. Google TPUs
| aren't running inference on a TERABYTE of fp8 query-key
| inputs, let alone TWO of fp16.
|
| Google's marketing wins again, I guess.
| pelorat wrote:
| Well, compared to github copilot (paid), I think Gemini Free is
| actually better at writing non-archaic code.
| rafaelmn wrote:
| Using Claude 3.5 sonnet ?
| jacooper wrote:
| Gemini is coming to copilot soon anyway.
| talldayo wrote:
| BERT and Gemma 2B were both some of the highest-performing edge
| models of their time. Google does really well - in terms of
| pushing efficiency in the community they're second to none.
| They also don't need to rely on inordinate amounts of compute
| because Google's differentiating factor is the products they
| own and how they integrate it. OpenAI is API-minded, Google is
| laser-focused on the big-picture experience.
|
| For example; those little AI-generated YouTube summaries that
| have been rolling out are wonderful. They don't require
| heavyweight LLMs to generate, and can create pretty effective
| summaries using nothing but a transcript. It's not only more
| useful than the other AI "features" I interact with regularly,
| it doesn't demand AGI or chain-of-thought.
| closewith wrote:
| > Google is laser-focused on the big-picture experience.
|
| This doesn't match my experience of any Google product.
| talldayo wrote:
| I disagree - another way you could phrase this is that
| Google is presbyopic. They're very capable of thinking
| long-term (eg. Google Deepmind and AI as a whole, cloud,
| video, Drive/GSuite, etc.), but as a result they struggle
| to respond to quick market changes. AdSense is the perfect
| example of Google "going long" on a product and reaping the
| rewards to monopolistic ends. They can corner a market when
| the set their sights on it.
|
| I don't think Google (or really any of FAANG) makes "good"
| products anymore. But I do think there are things to
| appreciate in each org, and compared to the way Apple and
| Microsoft are flailing helplessly I think Google has proven
| themselves in software here.
| panabee wrote:
| With many research areas converging to comparable levels, the
| most critical piece is arguably vertical integration and
| forgoing the Nvidia tax.
|
| They haven't wielded this advantage as powerfully as possible,
| but changes here could signal how committed they are to slaying
| the search cash cow.
|
| Nadella deservedly earned acclaim for transitioning Microsoft
| from the Windows era to cloud and mobile.
|
| It will be far more impressive if Google can defy the odds and
| conquer the innovator's dilemma with search.
|
| Regardless, congratulations to Google on an amazing release and
| pushing the frontiers of innovation.
| crowcroft wrote:
| They need an iPod to iPhone like transition. If they can pull
| it off it will be incredible for the business.
| bloomingkales wrote:
| They have to not get blind sided by Sora, while at the same
| time fighting the cloud war against MS/Amazon.
|
| Weirdly Google is THE AI play. If AI is not set to change
| everything and truly is a hype cycle, then Google stock
| withstands and grows. If AI is the real deal, then Google
| still withstands due to how much bigger the pie will get.
| bushbaba wrote:
| Yet, google continues to show it'll deprecate it's APIs,
| Services, and Functionality at the detriment of your own
| business. I'm not sure enterprises will trust Google's LLM over
| the alternatives. Too many have been burned throughout the
| years, including GCP customers.
|
| The fact GCP needs to have this page, and these lists are not
| 100% comprehensive is telling enough.
| https://cloud.google.com/compute/docs/deprecations
| https://cloud.google.com/chronicle/docs/deprecations
| https://developers.google.com/maps/deprecations
|
| Steve Yegge rightfully called this out, and yet no change has
| been made. https://medium.com/@steve.yegge/dear-google-cloud-
| your-depre...
| weatherlite wrote:
| GCP grew 35% last quarter , just saying ...
| Jabbles wrote:
| "just saying" things that are false.
|
| Google Cloud grew 35% year over year, when comparing the 3
| months ending September 30th 2024 with 2023.
|
| https://abc.xyz/assets/94/93/52071fba4229a93331939f9bc31c/g
| o... page 12
| surajrmal wrote:
| Isn't that the typical interpretation of what the parent
| comment said? How is it false?
| StableAlkyne wrote:
| > Remains to be seen how well they will be able to productize
| and market
|
| The challenge is trust.
|
| Google is one of the leaders in AI and are home to incredibly
| talented developers. But they also have an incredibly bad track
| record of supporting their products.
|
| It's hard to justify committing developers and money to a
| product when there's a good chance you'll just have to pivot
| again once they get bored. Say what you will about Microsoft,
| but at least I can rely on their obsession with supporting
| outdated products.
| egeozcan wrote:
| > they also have an incredibly bad track record of supporting
| their products
|
| Incredibly bad track record of supporting products _that don
| 't grow_. I'm not saying this to defend Google, I'm still
| (perhaps unreasonably) angry because of Reader, it's just
| that there is a pattern and AI isn't likely to fit that for a
| long while.
| RandomThoughts3 wrote:
| I'm sad for reader but it was a somewhat niche product.
| Inbox I can't forgive. It was insanely good and was killed
| because it was a threat to Gmail.
|
| My main issue with Google is that internal politic affects
| users all the time. See the debacle of anything built on
| top of Android and being treated as a second citizen.
|
| You can't trust a company which can't shield users from its
| internal politics. It means nothing is aligned correctly
| for users to be taken seriously.
| mannycalavera42 wrote:
| not going to miss the opportunity to upvote on the grief of
| having lost Reader
| msabalau wrote:
| Yeah, either AI is significant, in which case Google isn't
| going to kill it. Or AI is a bubble, in any of the
| alternatives one might pick can easily crash and die long
| before Google ends of life anything.
|
| This isn't some minor consumer play, like a random tablet
| or Stadia. Anyone who has paying attention would have
| noticed that AI has been an important, consistent, long
| term strategic interest of Google's for a very long time.
| They've been killing off the fail/minor products to invest
| in _this_.
| TIPSIO wrote:
| Yes. Imagine Google banning your entire Google account /
| Gmail because you violated their gray area AI terms ([1] or
| [2]). Or, one of your users did via an app you made using an
| API key and their models.
|
| With that being said, I am extremely bullish on Google AI for
| a long time. I imagine they land at being the best and
| cheapest for the foreseeable future.
|
| [1] https://policies.google.com/terms/generative-ai
|
| [2] https://policies.google.com/terms/generative-ai/use-
| policy
| estebarb wrote:
| For me that is a reason for not touching anything from
| Google for building stuff. I can afford lossing my Amazon
| account, but Google's one would be too much. At least they
| should be clear in their terms that getting banned at cloud
| doesn't mean getting banned from Gmail/Docs/Photos...
| bippihippi1 wrote:
| why not just make a business / project account?
| rtsil wrote:
| That won't help. Their TOS and policies are vague enough
| that they can terminate all accounts you own (under "Use
| of multiple accounts for abuse" for instance).
| TIPSIO wrote:
| To be fair, I believe this is reserved for things like
| fighting fraud.
| dbdoskey wrote:
| It has been used a few times by people who had a Google
| Play app banned, that sometimes the personal account
| would get banned as well.
|
| https://www.xda-developers.com/google-developer-account-
| ban-...
| fluoridation wrote:
| >Say what you will about Microsoft, but at least I can rely
| on their obsession with supporting outdated products.
|
| Eh... I don't know about that. Their tech graveyard isn't as
| populous as Google's, but it's hardly empty. A few that come
| to mind: ATL, MFC, Silverlight, UWP.
| bri3d wrote:
| Besides Silverlight (which was supported all the way until
| the end of 2021!), you can still not only run but _write
| new applications_ using all of the listed technologies.
| fluoridation wrote:
| That doesn't constitute support when it comes to
| development platforms. They've not received any updates
| in years or decades. What they've done is simply not
| remove the capability build capability from the
| toolchains. That is, not even the work that would be
| required to no longer support them in any way. Compare
| that to C#, which has evolved rapidly over the same time
| period.
| Fidelix wrote:
| That's different from "killing" the product / technology,
| which is what Google does.
| fluoridation wrote:
| Only because they operate different businesses. Google is
| primarily a service provider. They have few software
| products that are not designed to integrate with their
| servers. Many of Microsoft's businesses work
| fundamentally differently. There's nothing Microsoft
| could do to Windows to disable all MFC applications and
| only MFC applications, and if there was it would involve
| more work than simply not doing anything else with MFC.
| dotancohen wrote:
| > Google is one of the leaders in AI and are home to
| incredibly talented developers. But they also have an
| incredibly bad track record of supporting their products.
|
| This is why we've stayed with Anthropic. Every single person
| I work with on my current project is sore at Google for
| discontinuing one product or another - and not a single one
| of them mentioned Reader.
|
| We do run some non-customer facing assets in Google Cloud.
| But the website and API are on AWS.
| bastardoperator wrote:
| Putting your trust in Google is a fools errand. I don't know
| anyone that doesn't have a story.
| crazygringo wrote:
| > _and Google has been famously bad at getting people aligned
| and driving in one direction._
|
| To be fair, it's not that they're bad at it -- it's that they
| generally have an explicit philosophy against it. It's a
| choice.
|
| Google management doesn't want to "pick winners". It prefers to
| let multiple products (like messaging apps, famously) compete
| and let the market decide. According to this way of thinking,
| you come out ahead in the long run because you increase your
| chances of having the winning product.
|
| Gemini is a great example of when they do choose to focus on a
| single strategy, however. Cloud was another great example.
| xnx wrote:
| I definitely agree that multiple competing products is a
| deliberate choice, but it was foolish to pursue it for so
| long in a space like messaging apps that has network effects.
|
| As a user I always still wish that there were fewer apps with
| the best features of both. Google's 2(!) apps for AI podcasts
| being a recent example : https://notebooklm.google.com/ and
| https://illuminate.google.com/home
| tbarbugli wrote:
| Google is not winning on cloud, AWS is winning and MS gaining
| ground.
| surajrmal wrote:
| Parent didn't claim Google is winning. Only that there is a
| cohesive push and investment in a single product/platform.
| rrdharan wrote:
| That was 2023; more recently Microsoft is losing ground to
| Google (in 2024).
| bwb wrote:
| So far, for my tests, it has performed terribly compared to
| ChatGPT and Claude. I hope this version is better.
| aerhardt wrote:
| > seems to be getting the right results
|
| > hard to deny that their LLM models aren't really, really good
| though
|
| I'm so scarred by how much their first Gemini releases sucked
| that the thought of trying it again doesn't even cross my mind.
|
| Are you telling us you're buying this press release wholesale,
| or you've tried the tech they're talking about and love it, or
| you have some additional knowledge not immediately evident
| here? Because it's not clear from your comment where you are
| getting that their LLM models are really good.
| MaxDPS wrote:
| I've been using Gemini 1.5 Pro for coding and it's been
| great.
| jncfhnb wrote:
| Am I alone in thinking the word "agentic" is dumb as shit?
|
| Most of these things seem to just be a system prompt and a tool
| that get invoked as part of a pipeline. They're hardly "agents".
|
| They're modules.
| thomassmith65 wrote:
| It's easier for consultants and sales people to sell to
| enterprise if the terminology is familiar but mysterious.
|
| Bad 1. installed Antivirus software 2.
| added screen-size CSS rules 3. copied 'Assets' harddrive
| to DropBox 4. edited homepage to include Bitcoin wallet
| address link 5. upgraded to ChatGPT Pro
|
| "Good" 1. Cyber-security defenses 2.
| Responsive Design implementation 3. Cloud Storage
| 4. Blockchain Technology gateway 5. Agentic enhancements
| xnx wrote:
| Controlling a browser in Project Mariner seems very agentic:
| https://youtu.be/Fs0t6SdODd8?t=86
| uludag wrote:
| Definitely not alone. With all the this money at stake, coining
| dumb terms like this might make you a pretty penny.
|
| It's like a meme that can be milked for monetization.
| Agentus wrote:
| The beauty of LLMs isn't just these coding objects speak human
| vernacular but they can be concatenated with human vernacular
| prompts and that itself can be used as an input, command or
| output sensibly without necessarily causing error even if a
| series of inputs combinations weren't preprogrammed.
|
| I have an A.I. textbook that has agent terminology that was
| written preLLm days. agents are just autonomous ish code that
| loops on itself with some extra functionality. LLMs in their
| elegance can more easily out the box selfloop just on the basis
| concatenating language prompts, sensibly. They are almost agent
| ready out the box by this very elegant quality(the textbook
| agentic diagram is just a conceptual self perpetuation loop),
| except...
|
| Except they fail at a lot or get stuck at hiccups. But, here is
| a novel thought. What if an LLM becomes more agentic (ie more
| able to sustain autonomous chain prompts that do actions
| without a terminal failure) and less copilotee not by more
| complex controlling wrapper self perpetuation code, but by
| means of training the core llm itself to more fluidly function
| in agentic scenarios.
|
| a better agentically performing llm that isnt mislabeled with a
| bad buzzword might not reveal itself in its wrapper control
| code but through it just performing better in an typical
| agentic loop or environment conditions with whatever initiating
| prompt, control wrapper code, or pipeline that initiates its
| self perpetuation cycle.
| WA wrote:
| Gemini, too, for the sole reason that non-native speakers have
| no clue how to pronounce it.
| kaashif wrote:
| Also, people at NASA pronounce it two ways, even native
| speakers of English.
| Havoc wrote:
| >"agentic" is dumb as shit?
|
| It'll create endless consulting opportunities for projects that
| never go anywhere and add nothing of value unless you value
| rich consultants.
| brokensegue wrote:
| Any word on price? I can't find it at
| https://ai.google.dev/pricing
| Oras wrote:
| PS18/month
|
| https://gemini.google/advanced/?Btc=web&Atc=owned&ztc=gemini...
|
| then sign in with Google account and you'll see it
| brokensegue wrote:
| Oh but I only care about api pricing
| xnx wrote:
| I think it is free for 1500 requests/day. See the model
| dropdown on https://aistudio.google.com/prompts/new_chat
| gman83 wrote:
| I've been using Gemini Flash for free through the API using
| Cline for VS Code. I switch between Claude and Gemini Flash,
| using Claude for more complicated tasks. Hope that the 2.0
| model comes closer to Claude for coding.
| SV_BubbleTime wrote:
| Or... just continue using Claude?
| 85392_school wrote:
| I think they try to conserve costs by only using Claude
| when needed.
| IAmGraydon wrote:
| Claude is ridiculously expensive and often subject to rate
| limiting.
| serjester wrote:
| Agreed - tried some sample prompts on our data and the rough
| vibe check is that flash is now as good as the old pro. If they
| keep pricing the same, this would be really promising.
| airstrike wrote:
| OT: I'm not entirely sure why, but "agentic" sets my teeth on
| edge. I don't mind the concept, but the word itself has that
| hollow, buzzwordy flavor I associate with overblown LinkedIn
| jargon, particularly as it is not actually in the
| dictionary...unlike perfectly serviceable entries such as
| "versatile", "multifaceted" or "autonomous"
| geodel wrote:
| Huh, all three words you mentioned as replacement are equally
| buzzwordy and I see them a lot in CVs while screen candidates
| for job interview.
| airstrike wrote:
| At least all three of them are actually in the dictionary
| hombre_fatal wrote:
| That's not necessarily a good thing because they are
| overloaded while novel jargon is specific.
|
| We use new words so often that we take it for granted.
| You've passively picked up dozens of new words over the
| last 5 or 10 years without questioning them.
| raincole wrote:
| Versatile implies it can to more kinds of tasks (than it's
| predecessor or competitor). Agentic implies it requires less
| human intervention.
|
| I don't think these are necessary buzzwords _if_ the product
| really does what they imply.
| lolinder wrote:
| They agree--they're saying that at least those buzzwords are
| in the dictionary, not that they'd be a good replacement for
| "agentic".
| thom wrote:
| I'm personally very glad that the word has adhered itself to a
| bunch of AI stuff, because people had started talking about
| "living more agentically" which I found much more aggravating.
| Now if anyone states that out loud you immediately picture them
| walking into doors and misunderstanding simple questions, so it
| will hopefully die out.
| OutOfHere wrote:
| To play devil's advocate, the correct use of the word would be
| when multiple AIs are coordinating and handing off tasks to
| each other with limited context, such that the handoffs are
| dynamically decided at runtime by the AI, not by any routine
| code. I have yet to see a single example where this is
| required. Most problems can be solved with static workflows and
| simple rule based code. As such, I do believe that >95% of the
| usage of the word is marketing nonsense.
| maeil wrote:
| I actually have built such a tool (two AIs, each with
| different capabilities), but still cringe at calling at
| agentic. Might just be an instinctive reflex.
| danpalmer wrote:
| I think this sort of usage is already happening, but perhaps
| in the internal details or uninteresting parts, such as
| content moderation. Most good LLM products are in fact using
| many LLM calls under the hood, and I would expect that
| results from one are influencing which others get used.
| m3kw9 wrote:
| Yeah I hate it when AI companies throw around words like AGI
| and agentic capabilities. It's non sense to most people and
| ambiguous at best
| ramoz wrote:
| Need a general term for autonomous intelligent decision making.
| airstrike wrote:
| Isn't that just "intelligent"?
| ramoz wrote:
| We need something to describe a behavioral element in
| business processes. Something goes into it, something comes
| out of it - though in this case nondeterminism is involved
| and it may not be concrete outputs so much as further
| actioning.
|
| Intelligence is a characteristic.
| airstrike wrote:
| Volitional, independent, spontaneous, free-willed,
| sovereign...
| aithrowawaycomm wrote:
| No, we need a _scientific understanding_ of autonomous
| intelligent decision-making. The problem with "agentic AI" is
| the same old "Artificial Intelligence, Natural Stupidity"
| problem: we have no clue what "reasoning" or "intelligence"
| or "autonomous" actually means in animals, and trying to
| apply these terms to AI without understanding them (or
| inventing a new term without nailing down the underlying
| concept) is doomed to fail.
| wepple wrote:
| Versatile is far worse. It's so broad to the point of
| meaninglessness. My garden rake is fairly versatile.
|
| Agentic to me means that it acts somewhat under its own
| authority rather than a single call to an LLM. It has a small
| degree of agency.
| geodel wrote:
| Just searched for _GVP vs SVP_ and got:
|
| "GVP stands for Good Pharmacovigilance Practice, which is a set
| of guidelines for monitoring the safety of drugs. SVP stands for
| Senior Vice President, which is a role in a company that focuses
| on a specific area of operations."
|
| Seems lot of pharma regulation in my telecom company.
| PaulWaldman wrote:
| Anecdotally, using the Gemini App with "Gemini Advanced 2.0 Flash
| Experimental", the response quality is ignorantly improved and
| faster at some basic Python and C# generation.
| xnx wrote:
| > ignorantly improved
|
| autocorrect of "significantly improved"?
| oldpersonintx wrote:
| the demos are amazing
|
| I need to rewire my brain for the power of these tools
|
| this plus the quantum stuff...Google is on a win streak
| SubiculumCode wrote:
| Considering so many of us would like more vRAM than NVIDIA is
| giving us for home compute, is there any future where these
| Trillium TPUs become commodity hardware?
| geodel wrote:
| _So many of us_ are probably in thousands they need to be 3
| order magnitude higher before Google can even think of it.
| kajecounterhack wrote:
| Power concerns aside, individual chips in a TPU pod don't
| actually have a ton of vRAM; they rely on fast interconnects
| between a lot of chips to aggregate vRAM and then rely on
| pipeline / tensor parallelism. It doesn't make sense to try to
| sell the hardware -- it's operationally expensive. By keeping
| it in house Google only has to support the OS/hardware in their
| datacenter and they can and do commercialize through hosted
| services.
|
| Why do you want the hardware vs just using it in the cloud? If
| you're training huge models you probably don't also keep all
| your data on prem, but on GCS or S3 right? It'd be more
| efficient to use training resources close to your data. I guess
| inference on huge models? Still isn't just using a hosted API
| simpler / what everyone is doing now?
| katamari-damacy wrote:
| Is it better than GPT4o? Does it have an API?
| jerrygenser wrote:
| API is accessible via Vertex AI on Google Cloud in preview. I
| think it's also available in the consumer Gemini Chat.
| kgwgk wrote:
| https://ai.google.dev/gemini-api/docs/models/gemini-v2
| chrsw wrote:
| Instead of throwing up tables of benchmarks just let me try to do
| stuff and see if it's useful.
| topicseed wrote:
| Is it on AI studio already?
| jonomacd wrote:
| Yes it is. Including the live features. It is pretty
| impressive. Basically voice mode with a live video feed as
| well.
| siliconc0w wrote:
| What's everyone's favorite LLM leaderboard? Gemini 2 seems to be
| edging out 4o on chatbot arena(https://lmarena.ai/?leaderboard)
| SV_BubbleTime wrote:
| AI benchmarks and leaderboards are complete nonsense though.
|
| Find something you like, use it, be ready to look again in a
| month or two.
| falcor84 wrote:
| With the accelerating progress, the "be ready to look again"
| is becoming a full time job that we need to be able to
| delegate in some way, and I haven't found anything better
| than benchmarks, leaderboards and reviews.
|
| EDIT: Typo
| siliconc0w wrote:
| FWIW I've found the 'coding' 'category' of the leaderboard to
| be reasonably accurate. Claude was the best, o1-mini then was
| typically stronger, now the Gemini Exp 1206 is at the top.
|
| I find myself just paying a la carte via the API rather than
| paying the $20/mo so I can switch between the models.
| hombre_fatal wrote:
| poe.com has a decent model where you buy credits and spend
| them talking to any LLM which makes it nice to swap between
| them even during the same conversation instead of paying for
| multiple subscriptions.
|
| Though gpt-4o could say "David Mayer" on poe.com but not on
| chat.openai.com which makes me wonder if they sometimes cheat
| and sneak in different models.
| manishsharan wrote:
| Leaderboards are not that useful for measuring real-life
| effectiveness of the models atleast in my day-today usage.
|
| I am currently struggling to diagnose an ipv6 mis-configuration
| in my enormous aws cloudformation yaml code. I gave the same
| input to Claude Opus, Gemini and ChatGPT ( o1 and 4o).
|
| 4o was the worst. verbose and waste of my time.
|
| Claude completely went off-tangent and began recommending fixes
| for ipv4 while I specifically asked for ipv6 issues
|
| o1 made a suggestion which I tried out and it fixed it. It
| literally found a needle in the haystack. The solution is
| working well now.
|
| Gemini made a suggestion which almost got it right but it was
| not a full solution.
|
| I must clarify diagnosing network issues on AWS VPC is not my
| expertise and I use the LLMs to supplement my knowledge.
| blastbking wrote:
| Sonnet 3.5 as of today is superior to Opus, curious if sonnet
| could have solved your problem
| zhyder wrote:
| I like that https://artificialanalysis.ai/leaderboards/models
| describes both quality and speed (tokens/s and first chunk s).
| Not sure how accurate it is; anyone know? Speed and variance of
| it in particular seems difficult to pin down because providers
| obviously vary it with load to control their costs.
| IAmGraydon wrote:
| https://aider.chat/docs/leaderboards/
| lossolo wrote:
| https://livebench.ai/#/
| danpalmer wrote:
| Notably, GPT-4o is a "full size" model, whereas Gemini 2 Flash
| is the small and efficient variant in that family as far as I
| understand it.
| dangoodmanUT wrote:
| Jules looks like it's going after Devin
| m3kw9 wrote:
| Claude MCP does the same thing. It's the setup that is hard. It
| will do push pull create branch automatically from a single
| prompt. 500$ a month for Devin could be worth it if you want it
| taken care off plus use the models for a team, but a single
| person can set it up
| moralestapia wrote:
| >2,000 words of bs
|
| >General availability will follow in January, along with more
| model sizes.
|
| >Benchmarks against their own models which always underperformed
|
| >No pricing visible anywhere
|
| Completely inept leadership at play.
| EternalFury wrote:
| Think of Google as of a tanker ship. It takes a while to change
| course, but it has great momentum. Sundar just needs to make sure
| the course is right.
| griomnib wrote:
| And where is the ship headed if they are no longer supporting
| the open web?
|
| Publishers are being squeezed and going under, or replacing
| humans with hallucinated genai slop.
|
| It's like we're taking the private equity model of extracting
| value and killing something off to the entire web.
|
| I'm not sure where this is headed, but I don't think Sundar has
| any strategy here other than playing catch up.
|
| Demis' goal is pretty transparently positioning himself to take
| over.
| CSMastermind wrote:
| That's almost word for word what people said about Windows
| Phone when I was at Microsoft.
| atorodius wrote:
| Was the Windows Phone ever at the frontier tho?
| zaptrem wrote:
| It is a lot easier to switch LLMs than it is to switch
| smartphone platforms.
| onlyrealcuzzo wrote:
| But Windows Phone was actually good, like Xune, it was just
| late, and it was incredibly popular to hate Microsoft at the
| time.
|
| Additionally, Microsoft didn't really have any advantage in
| the smart phone space.
|
| Google is already a product the majority of people on the
| planet use regularly to answer questions.
|
| That seems like a competitive advantage to me.
| machiaweliczny wrote:
| Yeah, I liked my windows phone, not sure why they killed it
| rrrrrrrrrrrryan wrote:
| Windows Phone was actually great though, and would've
| eventually been a major player in the space if Microsoft were
| stubborn enough to stick with it long enough, like they did
| with the Xbox.
|
| By his own admission, Gates was extremely distracted at the
| time by the antitrust cases in Europe, and he let the
| initiative die.
| thisoneworks wrote:
| "gemini for video games" - here we go again with the AI does the
| interesting stuff for you rather than the boring stuff
| gotaran wrote:
| Google beat OpenAI at their own game.
| transcriptase wrote:
| "Hey google turn on kitchen lights"
|
| "Sure, playing don't fear the reaper on bathroom speaker"
|
| Ok
| wonderfuly wrote:
| Chat now: https://app.chathub.gg/chat/cloud-gemini-2.0-flash
| m3kw9 wrote:
| Can these guy lead for once? They are always responding to what
| OpenAI is doing.
| losvedir wrote:
| This naming is confusing...
|
| Anyway, I'm glad that this Google release is actually available
| right away! I pay for Gemini Advanced and I see "Gemini Flash
| 2.0" as an option in the model selector.
|
| I've been going through Advent of Code this year, and testing
| each problem with each model (GPT-4o, o1, o1 Pro, Claude Sonnet,
| Opus, Gemini Pro 1.5). Gemini has done decent, but is probably
| the weakest of the bunch. It failed (unexpectedly to me) on Day
| 10, but when I tried Flash 2.0 it got it! So at least in that one
| benchmark, the new Flash 2.0 edged out Pro 1.5.
|
| I look forward to seeing how it handles upcoming problems!
|
| I should say: Gemini Flash didn't _quite_ get it out of the box.
| It actually had a syntax error in the for loop, which caused it
| to fail to compile, which is an unusual failure mode for these
| models. Maybe it was a different version of Java or something (I
| 'm also trying to learn Java with AoC this year...). But when I
| gave Flash 2.0 the compilation error, it did fix it.
|
| For the more Java proficient, can someone explain why it may have
| provided this code: for (int[] current =
| queue.remove(0)) {
|
| which was a compilation error for me? The corrected code it gave
| me afterwards was just for (int[] current :
| queue) {
|
| and with that one change the class ran and gave the right
| solution.
| ianmcgowan wrote:
| A tangent, but is there a clear best choice amongst those
| models for AOC type questions?
| srameshc wrote:
| I use a Claude and Gemini a lot for coding and I realized there
| is no good or best model. Every model has it's upside and
| downside. I was trying to get authentication working according
| to the newer guidelines of Manifest V3 for browser extensions
| and every model is terrible. It is one use case where there is
| not much information or right documentation so every model
| makesup stuff. But this is my experience and I don't speak for
| everyone.
| huijzer wrote:
| Relatedly, I start to think more and more the AI is great for
| mediocre stuff. If you just need to do the 1000th website, it
| can do that. Do you want to build a new framework? Then there
| will probably be less many useful suggestions. (Still not
| useless though. I do like it a lot for refactoring while
| building xrcf.)
|
| EDIT: One reason that lead me to think it's better for
| mediocre stuff was seeing the Sora model generate videos. Yes
| it can create semi-novel stuff through combinations of
| existing stuff, but it can't stick to a coherent "vision"
| throughout the video. It's not like a movie by a great
| director like Tarantino where every detail is right and all
| details point to the same vision. Instead, Sora is just
| flailing around. I see the same in software. Sometimes the
| suggestions go towards one style and the next moment into
| another. I guess AI currently is just way lower in their
| context length. Tarantino has been refining his style for 30
| years now. And always he has been tuning his model towards
| his vision. AI in comparison seems to always just take
| everything and turn it into one mediocre blob. It's not
| useless but currently good to keep in mind I think. That you
| can only use it to generate mediocre stuff.
| meiraleal wrote:
| We got to the point that AI isn't great because it is not
| like a Tarantino movie. What a time to be alive.
| monkmartinez wrote:
| This is true for all newish code bases. You need to provide
| the context it needs to get the problem right. It has been my
| experience that one or two examples with new functions or new
| requirements will suffice for a correction.
| copperx wrote:
| That's when having a huge context is valuable. Dump all of
| the new documentation into the model along with your query
| and the chances of success hugely increase.
| xnx wrote:
| > I use a Claude and Gemini a lot for coding and I realized
| there is no good or best model.
|
| True to a point, but is anyone using GPT2 for anything still?
| Sometimes the better model completely supplants others.
| notamy wrote:
| > For the more Java proficient, can someone explain why it may
| have provided this code:
|
| To me that reads like it was trying to accomplish something
| like int[] current; while((current =
| queue.pop()) != null) {
| rybosome wrote:
| I can't comment on why the model gave you that code, but I can
| tell you why it was not correct.
|
| `queue.remove(0)` gives you an `int[]`, which is also what you
| were assigning to `current`. So logically it's a single
| element, not an iterable. If you had wanted to iterate over
| each item in the array, it would need to be:
|
| ``` for (int[] current : queue) { for (int c : current) { //
| ...do stuff... } } ```
|
| Alternatively, if you wanted to iterate over each element in
| the queue and treat the int array as a single element, the
| revised solution is the correct one.
| nuz wrote:
| I guess this means we'll have an openai release soon
| nightski wrote:
| Anyone else annoyed how the ML/AI community just adopted the word
| "reasoning" when it seems like it is being used very out of
| context when looking at what the model actually does?
| ramoz wrote:
| These models take an instruction, along with any contextual
| information, and are trained to produce valid output.
|
| That production of output is a form of reasoning via _some_
| type of logical processing. No?
|
| Maybe better to say computational reasoning. That's a mouthful.
| nightski wrote:
| Static computation is not reasoning (these models are not
| building up an argument from premises, they are merely
| finding statistically likely completions). Computational
| thinking/reasoning would be breaking down a problem into an
| algorithmic steps. The model is doing neither. I wouldn't
| confuse the fact that it can break it into steps if you ask
| it, because again that is just regurgitation. It's not going
| through that process without your prompt. That is not part of
| its process to arrive at an answer.
| ramoz wrote:
| The point is emergent capabilities in LLMs go beyond
| statistical extrapolation, as they demonstrate reasoning by
| combining learned patterns.
|
| When asked, "If Alice has 3 apples and gives 2 to Bob, how
| many does she have left?", the model doesn't just retrieve
| a memorized answer--it infers the logical steps
| (subtracting 2 from 3) to generate the correct result,
| showcasing reasoning built on the interplay of its scale
| and architecture rather than explicit data recall.
| thelastbender12 wrote:
| I kinda agree with you but I can also see why it isn't that
| far from "reasoning" in the sense humans do it.
|
| To wit, if I am doing a high school geometry proof, I come
| up with a sequence of steps. If the proof is correct, each
| step follows logically from the one before it.
|
| However, when I go from step 2 to step 3, there are
| multiple options for step-3 I could have chose. Is it so
| different from a "most-likely-prediction" an LLM makes? I
| suppose the difference is humans can filter out logically-
| incorrect steps, or prune chains-of-steps that won't lead
| to the actual theorem quicker. But an LLM predictor coupled
| with a verifier doesn't feel that different from it.
| w10-1 wrote:
| Does it help to explore the annoyance using gap analysis? I
| think of it as heuristics. As with humans, it's the pragmatic
| "whatever seems to work" where "seems" is determined via
| training. It's neither reasoning from first principles (system
| 1) nor just selecting the most likely/prevalent answer (system
| 2). And chaining heuristics doesn't make it reasoning, either.
| But where there's evidence that it's working from a model, then
| it becomes interesting, and begins to comply with classical
| epistemology wrt "reasoning". Unfortunately, information theory
| seems to treat any compression as a model leading to some
| pretty subtle delusions.
| resource_waste wrote:
| These kind of simplifications continue to make me an expert in
| LLM applications.
|
| So... its a trade secret to know how it actually works...
| summerlight wrote:
| It is interesting to see that they keep focusing on the cheapest
| model instead of the frontier model. Probably because of their
| primary (internal?) customer's need?
| discobot wrote:
| the problem is that last generation of the largest models
| failed to overcome smaller models on the benchmarks, see lack
| of new claude opus or gpt-5. The problem is probably in the
| benchmarks, but anyway.
| coder543 wrote:
| It's cheaper and faster to train a small model, which is better
| for a research team to iterate on, right? If Google decides
| that a particular small model is really good, why wouldn't they
| go ahead and release it while they work on scaling up that work
| to train the larger versions of the model?
| summerlight wrote:
| I have no knowledge of Google specific cases, but in many
| teams smaller models are trained upon bigger frontier models
| through distillation. So the frontier models come first then
| smaller models later.
| coder543 wrote:
| Training a "frontier model" without testing the
| architecture is very risky.
|
| Meta trained the smaller Llama 3 models first, and then
| trained the 405B model on the same architecture once it had
| been validated on the smaller ones. Later, they went back
| and used that 405B model to improve the smaller models for
| the Llama 3.1 release. Mistral started with a number of
| small models before scaling up to larger models.
|
| I feel like this is a fairly common pattern.
|
| If Google had a bigger version of Gemini 2.0 ready to go, I
| feel confident they would have mentioned it, and it would
| be difficult to distill it down to a small model if it
| wasn't ready to go.
| xyst wrote:
| Side note on Gemini: I pay for Google Workspace simply to enable
| e-mail capability for a custom domain.
|
| I never used the web interface to access email until recently. To
| my surprise, all of the AI shit is enabled by default. So it's
| very likely Gemini has been training on private data without my
| explicit consent.
|
| Of course G words it as "personalizing" the experience for me but
| it's such a load of shit. I'm tired of these companies stealing
| our data and never getting rightly compensated.
| hnuser123456 wrote:
| Gmail is hosting your email. Being able to change the domain
| doesn't change that they're hosting it on their terms. I think
| there are other email providers that have more privacy-focused
| policies.
| dandiep wrote:
| Gemini multimodal live docs here:
| https://cloud.google.com/vertex-ai/generative-ai/docs/model-...
|
| A little thin...
|
| Also no pricing is live yet. OpenAI's audio inputs/outputs are
| too expensive to really put in production, so hopefully Gemini
| will be cheaper. (Not to mention, OAI's doesn't follow
| instructions very well.)
| kwindla wrote:
| The Multimodal Live API is free while the model/API is in
| preview. My guess is that they will be pretty aggressive with
| pricing when it's in GA, given the 1.5 Flash multimodal
| pricing.
|
| If you're interested in this stuff, here's a full chat app for
| the new Gemini 2 API's with text, audio, image, camera video
| and screen video. This shows how to use both the WebSocket API
| and to route through WebRTC infrastructure.
|
| https://github.com/pipecat-ai/gemini-multimodal-live-demo
| dandiep wrote:
| Thanks, this is great!
| AJRF wrote:
| I think they are really overloading that word "Agent". I know
| there isn't a standard definition - but I think Google are
| stretching the meaning of that way thinner than most C Suite
| level execs talk about agents at.
|
| I think DeepMind could make progress if they focused on the agent
| definition of multi-step reasoning + action through a web
| browser, and deliver a ton of value, outside of lumping in the
| seldom used "Look at the world through a camera" or "Multi modal
| Robots" thing.
|
| If Google cracked robots, past plays show that the market for
| those aren't big enough to interest Google. Like VR, you just
| can't get a billion people to be interested in robots - so even
| if they make progress, it won't survive under Google.
|
| The "Look at the world through a camera" thing is a footnote in
| an Android release.
|
| Agentic computer use _is_ a product a billion people would use,
| and it's adjacent to the business interests of Google Search.
| jstummbillig wrote:
| Reminder that implied models are not actual models. Models have
| failed to materialize repeatedly and vanished without further
| mention. I assume no one is trying to be misleading but, at this
| point, maybe overly optimistic.
| nycdatasci wrote:
| Gemini 2.0 Flash is available here:
| https://aistudio.google.com/prompts/new_chat
|
| Based on initial interactions, it's extremely verbose. It seems
| to be focused on explaining its reasoning, but even after just a
| few interactions I have seen some surprising hallucinations. For
| example, to assess current understanding of AI, I mentioned "Why
| hasn't Anthropic released Claude 3.5 Opus yet?" Gemini responded
| with text that included "Why haven't they released Claude 3.5
| Sonnet First? That's an interesting point." There's clearly some
| reflection/attempted reasoning happening, but it doesn't feel
| competitive with o1 or the new Claude 3.5 Sonnet that was trained
| on 3.5 Opus output.
| simonw wrote:
| I released a new llm-gemini plugin with support for the Gemini
| 2.0 Flash model, here's how to use that in the terminal:
| llm install -U llm-gemini llm -m gemini-2.0-flash-exp
| 'prompt goes here'
|
| LLM installation: https://llm.datasette.io/en/stable/setup.html
|
| Worth noting that the Gemini models have the ability to write and
| then execute Python code. I tried that like this:
| llm -m gemini-2.0-flash-exp -o code_execution 1 \
| 'write and execute python to generate a 80x40 ascii art fractal'
|
| Here's the result:
| https://gist.github.com/simonw/0d8225d62e8d87ce843fde471d143...
|
| It can't make outbound network calls though, so this fails:
| llm -m gemini-2.0-flash-exp -o code_execution 1 \
| 'write python code to retrieve https://simonwillison.net/ and use
| a regex to extract the title, run that code'
|
| Amusingly Gemini itself doesn't know that it can't make network
| calls, so it tries several different approaches before giving up:
| https://gist.github.com/simonw/2ccfdc68290b5ced24e5e0909563c...
|
| The new model seems very good at vision: llm -m
| gemini-2.0-flash-exp describe -a
| https://static.simonwillison.net/static/2024/pelicans.jpg
|
| I got back a solid description, see here:
| https://gist.github.com/simonw/32172b6f8bcf8e55e489f10979f8f...
| rafram wrote:
| > Some pelicans have white on their heads, suggesting that some
| of them are older birds.
|
| Interesting theory!
| smackay wrote:
| Brown Pelican (Pelecanus occidentalis) heads are white in the
| breeding season. Birds start breeding aged three to five. So
| technically the statement is correct but I wonder if Gemini
| didn't get its pelicans and cormorants in a muddle. The
| mainland European Great Cormorant (Phalacrocorax carbo
| sinensis) has a head that gets progressively whiter as birds
| age.
| pcwelder wrote:
| Code execution is okay, but soon runs into the problem of
| missing packages that it can't install.
|
| Practically, sandboxing hasn't been super important for me.
| Running claude with mcp based shell access has been working
| fine for me, as long as you instruct it to use venv, temporary
| directory, etc.
| UltraSane wrote:
| Is there a guide on how to do that?
| pcwelder wrote:
| For building mcp server? The official docs do a great job
|
| https://modelcontextprotocol.io/introduction
|
| My own mcp server could be an inspiration on Mac. It's
| based on pexpect to enable repl session and has some tricks
| to prevent bad commands.
|
| https://github.com/rusiaaman/wcgw
|
| However, I recommend creating one with your own customised
| prompts and tools for maximum benefit.
| stavros wrote:
| I wrote a program that can do more or less the same thing,
| if you only care about the LLM running commands to help you
| do something:
|
| https://github.com/skorokithakis/sysaidmin
| mnky9800n wrote:
| Can it run ipython? Then you could use ipython magic to pip
| install things:
|
| https://ipython.readthedocs.io/en/stable/interactive/magics..
| ..
| simonw wrote:
| Published some more detailed notes on my explorations of Gemini
| 2.0 here https://simonwillison.net/2024/Dec/11/gemini-2/
| bravura wrote:
| Question: Have you tried using this for video?
|
| Alternately, if I wanted to pipe a bunch of screencaps into it
| and get one grand response, how would I do that?
|
| e.g. "Does the user perform a thumbs up gesture in any of these
| stills?"
|
| [edit: also, do you know the vision pricing? I couldn't find it
| easily]
| simonw wrote:
| Precious Gemini models worked really well for video, and this
| one can even handle steaming video:
| https://simonwillison.net/2024/Dec/11/gemini-2/#the-
| streamin...
| melvinmelih wrote:
| No mention of Perplexity yet in the comments but it's obvious to
| me that they're targeting Perplexity Pro directly with their new
| Deep Research feature
| (https://blog.google/products/gemini/google-gemini-deep-
| resea...). I still wonder why Perplexity is worth $7 billion when
| the 800-pound gorilla is pounding on their door (albeit slowly).
| yandie wrote:
| Just tried the deep search. It's a much much slower experience
| than perplexity at the moment - taking sweet many minutes to
| return result. Maybe it's more extensive but I use perplexity
| for quick information summary a lot and this is a very
| different UX.
|
| Haven't used it enough to evaluate the quality, however.
| BoorishBears wrote:
| Before dropping it for a different project that got some
| traction, "Slow Perplexity" was something I was pretty set on
| building.
|
| Perplexity is a much less versatile product than it has to be
| in the chase of speed: you can only chew through so many
| tokens, do so much CoT, etc. in a given amount of time.
|
| They optimized for virality (it's just as fast as Google but
| gives me more info!) but I suspect it kills the stickiness
| for a huge number of users since you end up with some
| "embarrassing misses": stuff that should have been a slam
| dunk, goes off the rails due to not enough search, or the
| wrong context being surfaced from the page, etc. and the user
| just doesn't see value in it anymore.
| ianbutler wrote:
| Unfortunately the 10rpm quota for this experimental model isn't
| enough to run an actual Agentic experience on.
|
| That's my main issue with google there's several models we want
| to try with our agent but quota is limited and we have to jump
| through hoops to see if we can get it raised.
| aantix wrote:
| Their offering is just so... bad. Even the new model. All the
| data in the world, yet they trail behind.
|
| They have all of these extensions that they use to prop up the
| results in the web UI.
|
| I was asking for a list of related YouTube videos - the UI
| returns them.
|
| Ask the API the same prompt, it returns a bunch of made up
| YouTube titles and descriptions.
|
| How could I ever rely on this product?
| fuddle wrote:
| I'd be interested to see Gemini 2.0's performance on SWE-Bench.
| mherrmann wrote:
| Their Mariner tool for controlling the browser sounds scary and
| exciting. At the moment, it's an extension, which means
| JavaScript. Some web sites block automation that happens this
| way, and developers resort to tools such as Selenium. These use
| the Chrome DevTools API to automate the browser. It's better, but
| can still be distinguished from normal use with very technical
| details. I wonder if Google, who still own Chrome, will give
| extensions better APIs for automation that can not be
| distinguished from normal use.
| tpoacher wrote:
| I know this isn't really a useful comment, but, I'm still sour
| about the name they chose. They MUST have known about the Gemini
| protocol. I'm tempted to think it was intentional, even.
|
| It's like Microsoft creating an AI tool and calling it Peertube.
| "Hurr durr they couldn't possibly be confused; one is a
| decentralised video platform and the other is an AI tool hurr
| durr. And ours is already more popular if you 'bing' it hurr
| durr."
| jvolkman wrote:
| > It's like Microsoft creating an AI tool and calling it
| Peertube.
|
| How is it like that? Gemini is a much more common word than
| Peertube. https://en.wikipedia.org/wiki/Gemini
| petesergeant wrote:
| Speed looks good vis-a-vis 4o-mini, and quality looks good so far
| against my eval set. If it's cheaper than 4o-mini too (which, it
| probably will be?) then OpenAI have a real problem, because
| switching between them is a value in a config file.
| smallerfish wrote:
| Was this written by an LLM? It's pretty bad copy. Maybe they laid
| off their copywriting team...?
|
| > "Now millions of developers are building with Gemini. And it's
| helping us reimagine all of our products -- including all 7 of
| them with 2 billion users -- and to create new ones"
|
| and
|
| > "We're getting 2.0 into the hands of developers and trusted
| testers today. And we're working quickly to get it into our
| products, leading with Gemini and Search. Starting today our
| Gemini 2.0 Flash experimental model will be available to all
| Gemini users."
| utopcell wrote:
| Sorry, what's wrong with these phrases?
| echelon wrote:
| It reads like a transcribed speech. You can picture this
| being read from a teleprompter at a conference keynote.
|
| Short sentence fact. And aspirational tagline - pause for
| some metrics - and more. And. Today. And. And. Today.
| krona wrote:
| > all of our products -- including all 7 of them
|
| All the products including all the products?
| iamdelirium wrote:
| Why did you specifically ignore the remainder of the
| sentence?
|
| "...all of our products -- including all 7 of them with 2
| billion users..."
|
| It tells people that 7 of their products have 2b users.
| fluoridation wrote:
| That's not really any better, since "all of our products"
| already includes the subset that has at least 2B users.
| "I brought all my shoes, including all my red shoes."
| stavros wrote:
| They're pointing out that seven of their products have
| more than 2 billion users.
|
| "I brought all my shoes, including the pairs that cost
| over $10,000" is saying something about what shoes you
| brought, more than "all of them".
| fluoridation wrote:
| Why are they bragging about something completely
| unrelated in the middle of a sentence about the impact of
| a piece of technology?
|
| -Hey, are you done packing?
|
| -Yes, I decided I'll bring all my shoes, including the
| ones that cost over $10,000.
|
| What, they just couldn't help themselves?
| stavros wrote:
| The fact that they're using Gemini with even their most
| important products shows that they trust it.
| fluoridation wrote:
| Again, that's covered by "all our products". Why do we
| need to be reminded that Google has a lot of users?
| Someone oblivious to that isn't going to care about this
| press release.
| aerhardt wrote:
| That phrasing still sucks, I am neither a native speaker
| nor a wordsmith but I've worked with professional English
| writers who could make that look and sound infinitely
| better.
| jay_kyburz wrote:
| all of our products, 7 of which have over 2 billion
| users..
| hombre_fatal wrote:
| The meme of LLM generated content is that it's verbose and
| formal, not that it's poorly written.
|
| It's why the quoted text is obviously written by a human.
| contagiousflow wrote:
| There's no law that says LLM generated text has to bad in
| a singular way
| scudsworth wrote:
| executive spotted
| ryandvm wrote:
| I am sure Google has the resources to compete in this space. What
| I'm less sure about is whether Google can monetize AI in a way
| that doesn't cannibalize their advertising income.
|
| Who the hell wants an AI that has the personality of a car
| salesman?
| ipsum2 wrote:
| Tested out Gemini-2 Flash, I had such high hopes that a better
| base model would help. It still hallucinates like crazy compared
| to GPT-4o.
| gbickford wrote:
| Small models don't "know" as much so they hallucinate more.
| They are better suited for generations that are based in a
| ground truth, like in a RAG setup.
|
| A better comparison might be Flash 2.0 vs 4o-mini. Even then,
| the models aren't meant to have vast world knowledge, so
| benchmarking them on that isn't a great indicator of how they
| would be used in real-world cases.
| ipsum2 wrote:
| Yes, it's not an apples to apples comparison. My point is the
| position it's at on the lmarena leaderboard is misplaced due
| to the hallucination issues.
| CSMastermind wrote:
| > We're also launching a new feature called Deep Research, which
| uses advanced reasoning and long context capabilities to act as a
| research assistant, exploring complex topics and compiling
| reports on your behalf. It's available in Gemini Advanced today.
|
| Anyone seeing this? I don't have an option in my dropdown.
| atorodius wrote:
| Rolling out the next few days accorsing to Jeff
| fudged71 wrote:
| Not seeing it yet on web or mobile (in Canada)
| echohack5 wrote:
| Is this what it feels like to become one of the gray bearded
| engineers? This sounds like a bunch of intentionally confusing
| marketing drivel.
|
| When capitalism has pilfered everything from the pockets of
| working people so people are constantly stressed over healthcare
| and groceries, and there's little left to further the pockets of
| plutocrats, the only marketing that makes sense is to appeal to
| other companies in order to raid their coffers by tricking their
| Directors to buy a nonsensical product.
|
| Is that what they mean by "agentic era"? Cause that's what it
| sounds like to me. Also smells alot like press release driven
| development where the point is to put a feather in the cap of
| whatever poor Google engineer is chasing their next promotion.
| weatherlite wrote:
| > Is that what they mean by "agentic era"? Cause that's what it
| sounds like to me.
|
| What are you basing your opinion on? I have no idea how well
| these LLM agents will perform but its definitely a thing.
| OpenAI is working on them , Claude and certainly Google.
| cush wrote:
| Yeah it's a lot of marketing fluff but these tools are
| genuinely useful and there's no wonder why Google is working
| hard to prevent them from destroying their search-dependent
| bottom line.
|
| Marketing aside, agents are just LLMs that can reach out of
| their regular chat bubbles and use tools. Seems like just the
| next logical evolution
| serjester wrote:
| Buried in the announcement is the real gem -- they're releasing a
| new SDK that actually looks like it follows modern best
| practices. Could be a game-changer for usability.
|
| They've had OpenAI-compatible endpoints for a while, but it's
| never been clear how serious they were about supporting them
| long-term. Nice to see another option showing up. For reference,
| their main repo (not kidding) recommends setting up a Kubernetes
| cluster and a GCP bucket to submit batch requests.
|
| [1]https://github.com/googleapis/python-genai
| pkkkzip wrote:
| its interesting that just as the LLM hype appears to be
| simmering down, DeepMind is making big strides. I'm more
| excited by this than any of OpenAI's announcements.
| ComputerGuru wrote:
| I've been using gemini-exp-1206 and I notice a lot of
| similarities to the new gemini-2.0-flash-exp: they're not that
| much _actually smarter_ but they go out of their way to convince
| you they are with overly verbose "reasoning" and explanations.
| The reasoning and explanations aren't necessarily wrong per se,
| but put them aside and focus on the actual logical reasoning
| steps and conclusions to your prompts and it's still very much a
| dumb model.
|
| The models do just fine on "work" but are terrible for
| "thinking". The verbosity of the explanations (and the sheer
| amount of praise the models like to give the prompter - I've
| never had my rear end kissed so much!) should lead one to beware
| any subjective reviews of their performance rather than objective
| reviews focusing solely on correct/incorrect.
| epolanski wrote:
| I'm not gonna lie I like Google's models.
|
| Flash combines speed and cost and is extremely good to build apps
| on.
|
| People really take that whole benchmarking thing more seriously
| than necessary.
| Animats wrote:
| _" Over the last year, we have been investing in developing more
| agentic models, meaning they can understand more about the world
| around you, think multiple steps ahead, and take action on your
| behalf, with your supervision."_
|
| "With your supervision". Thus avoiding Google being held
| responsible. That's like Teslas Fake Self Driving, where the user
| must have their hands on the wheel at all times.
| sorenjan wrote:
| Published the day after one of the authors, Demis Hassabis,
| received his Nobel prize in Stockholm.
| beepbooptheory wrote:
| We are moving through eras faster than years these days.
| strongpigeon wrote:
| Did anyone get to play with the native image generation part? In
| my experience, Imagen 3 was much better than the competition so
| I'm curious to hear people's take on this one.
| strongpigeon wrote:
| Hrm, when I tried to get it to generate an image it said it was
| using Imagen 3. Not sure what "native" image generation means
| then.
| fpgaminer wrote:
| I work with LLMs and MLLMs all day (as part of my work on
| JoyCaption, an open source VLM). Specifically, I spend a lot of
| time interacting with multiple models at the same time, so I get
| the chance to very frequently compare models head-to-head on real
| tasks.
|
| I'll give Flash 2 a try soon, but I gotta say that Google has
| been doing a great job catching up with Gemini. Both Gemini 1.5
| Pro 002 and Flash 1.5 can trade blows with 4o, and are easily
| ahead of the vast majority of other major models (Mistral Large,
| Qwen, Llama, etc). Claude is usually better, but has a major flaw
| (to be discussed later).
|
| So, here's my current rankings. I base my rankings on my work,
| not on benchmarks. I think benchmarks are important and they'll
| get better in time, but most benchmarks for LLMs and MLLMs are
| quite bad.
|
| 1) 4o and its ilk are far and away the best in terms of accuracy,
| both for textual tasks as well as vision related tasks.
| Absolutely nothing comes even close to 4o for vision related
| tasks. The biggest failing of 4o is that it has the worst
| instruction following of commercial LLMs, and that instruction
| following gets _even_ worse when an image is involved. A prime
| example is when I ask 4o to help edit some text, to change
| certain words, verbage, etc. No matter how I prompt it, it will
| often completely re-write the input text to its own style of
| speaking. It's a really weird failing. It's like their RLHF
| tuning is hyper focused on keeping it aligned with the
| "character" of 4o to the point that it injects that character
| into all its outputs no matter what the user or system
| instructions state. o1 is a MASSIVE improvement in this regard,
| and is also really good at inferring things so I don't have to
| explicitly instruct it on every little detail. I haven't found
| o1-pro overly useful yet. o1 is basically my daily driver outside
| of work, even for mundane questions, because it's just better
| across the board and the speed penalty is negligible. One
| particularly example of o1 being better I encountered yesterday.
| I had it re-wording an image description, and thought it had
| introduced a detail that wasn't in the original description.
| Well, I was wrong and had accidentally skimmed over that detail
| in the original. It _told_ me I was wrong, and didn't update the
| description! Freaky, but really incredible. 4o never corrects me
| when I give it an explicit instruction.
|
| 4o is fairly easy to jailbreak. They've been turning the screws
| for awhile so it isn't as easy as day 1, but even o1-pro can be
| jailbroken.
|
| 2) Gemini 1.5 Pro 002 (specifically 002) is second best in my
| books. I'd guesstimate it at being about 80% as good as 4o on
| most tasks, including vision. But it's _significantly_ better at
| instruction following. Its RLHF is a lot lighter than ChatGPT
| models, so it's easier to get these models to fall back to
| pretraining, which is really helpful for my work specifically.
| But in general the Gemini models have come a long way. The
| ability to turn off model censorship is quite nice, though it
| does still refuse at times. The Flash variation is interesting;
| often times on-par with Pro with Pro edging out maybe 30% of the
| time. I don't frequently use Flash, but it's an impressive model
| for its size. (Side note: The Gemma models are ... not good.
| Google's other public models, like so400m and OWLv2 are great, so
| it's a shame their open LLMs forays are falling behind). Google
| also has the best AI playground.
|
| Jailbreaking Gemini is a piece of cake.
|
| 3) Claude is third on my list. It has the _best_ instruction
| following of all the models, even slightly better than o1. Though
| it often requires multi-turn to get it to fully follow
| instructions, which is annoying. Its overall prowess as an LLM is
| somewhere between 4o and Gemini. Vision is about the same as
| Gemini, except for knowledge based queries which Gemini tends to
| be quite bad at (who is this person? Where is this? What brand of
| guitar? etc). But Claude's biggest flaw is the insane "safety"
| training it underwent, which makes it practically useless. I get
| false triggers _all_ the time from Claude. And that's to say
| nothing of how unethical their "ethics" system is to begin with.
| And what's funny is that Claude is an order of magnitude
| _smarter_ when its reasoning about its safety training. It's the
| only real semblance of reason I've seen from LLMs ... all just to
| deny my requests.
|
| I've put Claude three out of respect for the _technical_
| achievements of the product, but I think the developers need to
| take a long look in the mirror and ask why they think it's okay
| to for _them_ to decide what people with disabilities are and are
| not aloud to have access to.
|
| 4) Llama 3. What a solid model. It's the best open LLM, hands
| down. Nowhere near the commercial models above, but for a model
| that's completely free to use locally? That's invaluable. Their
| vision variation is ... not worth using. But I think it'll get
| better with time. The 8B variation far outperforms its weight
| class. 70B is a respectable model, with better instruction
| following than 4o. The ability to finetune these models to a task
| with so little data is a huge plus. I've made task specific
| models with 200-400 examples.
|
| 5) Mistral Large (I forget the specific version for their latest
| release). I love Mistral as the "under-dog". Their models aren't
| bad, and behave _very_ differently from all other models out
| there, which I appreciate. But Mistral never puts any effort into
| polishing their models; they always come out of the oven half-
| baked. Which means they frequently glitch out, have very
| inconsistent behavior, etc. Accuracy and quality is hard to
| assess because of this inconsistency. On its best days it's up
| near Gemini, which is quite incredible considering the models are
| also released publicly. So theoretically you could finetune them
| to your task and get a commercial grade model to run locally. But
| rarely see anyone do that with Mistral, I think partly because of
| their weird license. Overall, I like seeing them in the race and
| hope they get better, but I wouldn't use it for anything serious.
|
| Mistral is lightly censored, but fairly easy to jailbreak.
|
| 6) Qwen 2 (or 2.5 or whatever the current version is these days).
| It's an okay model. I've heard a lot of praises for it, but in
| all my uses thus far its always been really inconsistent,
| glitchy, and weak. I've used it both locally and through APIs. I
| guess in _theory_ it's a good model, based on benchmarks. And
| it's open, which I appreciate. But I've not found any practical
| use for it. I even tried finetuning with Qwen 2VL 72B, and my
| tiny 8B JoyCaption model beat it handily.
|
| That's about the sum of it. AFAIK that's all the major commercial
| and open models (my focus is mainly on MLLMs). OpenAI are still
| leading the pack in my experience. I'm glad to see good
| competition coming from Google finally. I hope Mistral can polish
| their models and be a real contender.
|
| There are a couple smaller contenders out there like Pixmo/etc
| from allenai. Allen AI has hands down the _best_ public VQA
| dataset I've seen, so huge props to them there. Pixmo is ...
| okayish. I tried Amazon's models a little but didn't see anything
| useful.
|
| NOTE: I refuse to use Grok models for the obvious reasons, so
| fucks to be them.
| kernal wrote:
| >Grok models
|
| Thanks for the admission as it invalidates pretty much
| everything you've said, and will say.
| nopcode wrote:
| > What can you tell me about this sculpture?
|
| > It's located in London.
|
| Mind blowing.
| jacooper wrote:
| The best thing about gemini models is the huge context windows,
| you can just throw big documents and find stuff real fast, rather
| than struggling with cut off in perplexity or Claude.
| bluelightning2k wrote:
| Does anyone know how to sign up to the speech output wait list or
| tester program? I have a decent spent with GCP over the years if
| that helps at all. Really want DemoTime videos to use those
| voices. (I like how truly incredible best in the world tts is
| like a footnote in this larger announcement.)
| jgalt212 wrote:
| Can cloudflare turnstile (and others) detect these agents as
| bots?
| jamesponddotco wrote:
| At least when it comes to Go code, I'm pretty impressed by the
| results so far. It's also pretty good at following directions,
| which is a problem I have with open source models, and seems to
| use or handle Claude's XML output very well.
|
| Overall, especially seeing as I haven't paid a dime to use the
| API yet, I'm pretty impressed.
| mfonda wrote:
| Does anyone have any insights into how Google selects source
| material for AI overviews? I run an educational site with lots of
| excellent information, but it seems to have been passed over
| entirely for AI overviews. With these becoming an increasingly
| large part of search--and from the sound of it, now more so with
| Gemini 2.0--this has me a little worried.
|
| Anyone else run into similar issues or have any tips?
| nuancebydefault wrote:
| I'm quite impressed with the flash demo's reasoning capabilities.
| Did the 20 questions game with it, it found the computer mouse i
| had in my head. At first it was confused about our roles and said
| something weird, it thought that it had to guess its own word.
| Afterwards I had a meta conversation about that weirdness and it
| gave impressive insights:
|
| " Why My "I Guess What I Have in Mind" Statement Doesn't Make
| Sense The Problem of Agency: You're spot-on
| when you say it doesn't make sense for me to state "I guess what
| I (the computer) have in mind." AI does not have a "mind" in the
| same way that humans do. My internal state is a complex pattern
| of numerical data, not conscious thoughts or intentions in the
| way you have things in mind. My phrasing gave the illusion of
| having a "thing in mind," but in reality, I don't have this
| capacity. Misinterpretation of the Task: My
| statement "I guess what I have in mind" misinterprets the task.
| You were clearly signaling that I needed to ask questions to
| uncover what you were thinking of. My failure to grasp this
| created a weird scenario where it seemed like I was trying to
| determine my own data set!"
| paradite wrote:
| Honestly this post makes Google sound like the new IBM. Very
| corporate.
|
| "Hear from our CEO first, and then our other CEO in charge of
| this domain and CTO will tell you the actual news."
|
| I haven't seen other tech companies write like that.
___________________________________________________________________
(page generated 2024-12-11 23:00 UTC)