[HN Gopher] Gemini 2.5 Pro Preview
___________________________________________________________________
Gemini 2.5 Pro Preview
Author : meetpateltech
Score : 434 points
Date : 2025-05-06 15:10 UTC (7 hours ago)
(HTM) web link (developers.googleblog.com)
(TXT) w3m dump (developers.googleblog.com)
| jeswin wrote:
| Now if there was a way to add prepaid credits and monitor usage
| near real-time on a dashboard, like every other vendor. Hey
| Google are you listening?
| greenavocado wrote:
| You can do that by using deepinfra to manage your billing. It's
| pay-as-you-go and they have a pass-through virtual target for
| Google Gemini.
|
| Deepinfra token usage updates every time you switch to the tab
| if it is opened to the usage page so it is possible to see
| updates even every second
| Hawkenfall wrote:
| You can do this with https://openrouter.ai/
| pzo wrote:
| but if you want to use google SDK (python-genai, js-genai)
| rather than openai SDK (If found google api more feature rich
| when using different modality like audio/images/video) you
| cannot use openrouter. Also not sure if you are developing
| app and needs higher rate limits - what's typical rate limit
| via openrouter?
| pzo wrote:
| also for some reason I tested simple prompt (few words, no
| system prompt) with attached 1 images and openrouter
| charged me like ~1700 tokens when on the other hand using
| directly via python-genai its like ~400 tokens. Also keep
| in mind they charge small markup fee when you top you their
| account.
| slig wrote:
| In in the meantime, I'm using openrouter.
| cchance wrote:
| openrouter, i dont think anyone should use google direct till
| they fix their shit billing
| greenavocado wrote:
| Even afterwards. Avoid paying directly if you can because
| they generally could not care less about individuals.
|
| You have less than $10 million in spend you will be treated
| worse than cattle because at least farmers feed their cattle
| before they are milked
| tucnak wrote:
| You need LLM Ops. YC happens to have invested in Langfuse,
| which is if you're serious about tracking metrics, you'll
| appreciate the rest, too.
|
| And before you ask: yes, for cached content and batch
| completion discounts you can accommodate both--just needs a bit
| of logic in your completion-layer code.
| therealmarv wrote:
| Is this on Google AI Studio or Google Vertex or both?
| simple10 wrote:
| You can do this with LLM proxies like LiteLLM. e.g. Cursor ->
| LiteLLM -> LLM provider API.
|
| I have LiteLLM server running locally with Langfuse to view
| traces. You configure LiteLLM to connect directly to providers'
| APIs. This has the added benefit of being able to create
| LiteLLM API keys per project that proxies to different sets of
| provider API keys to monitor or cap billing usage.
|
| I use https://github.com/LLemonStack/llemonstack/ to spin up
| local instances of LiteLLM and Langfuse.
| ramesh31 wrote:
| >Best-in-class frontend web development
|
| It really is wild to have seen this happen over the last year.
| The days of traditional "design-to-code" FE work are completely
| over. I haven't written a line of HTML/CSS in months. If you are
| still doing this stuff by hand, you need to adapt fast. In
| conjunction with an agentic coding IDE and a few MCP tools, weeks
| worth of UI work are now done in hours to a _higher_ level of
| quality and consistency with practically zero effort.
| shostack wrote:
| What does your tool and model stack look like for this?
| ramesh31 wrote:
| Cline with Gemini 2.5 (https://cline.bot/)
|
| Framelink MCP (https://github.com/GLips/Figma-Context-MCP)
|
| Playwright MCP (https://github.com/microsoft/playwright-mcp)
|
| Pull down designs via Framelink, optionally enrich with PNG
| exports of nodes added as image uploads to the prompt, write
| out the components, test/verify via Playwright MCP.
|
| Gemini has a 1M context size now, so this applies to large
| mature codebases as well as greenfield. The key thing here is
| the coding agent being really clever about maintaining its'
| context; you don't need to fit an entire codebase into a
| single prompt in the same way that you don't need to fit the
| entire codebase into your head to make a change, you just
| need enough context on the structure and form to maintain the
| correct patterns.
| jjani wrote:
| The designs itself are still done by humans, I presume?
| ramesh31 wrote:
| >The designs itself are still done by humans, I presume?
|
| Indeed, in fact design has become the bottleneck now.
| Figma has really dropped the ball here WRT building out
| AI assisted (not _driven_ ) tooling for designers.
| redox99 wrote:
| What tools do you use?
| amarcheschi wrote:
| i'm surprised by no line of css html in months. maybe it's an
| exageration and that's okay.
|
| However, just today i was building a website for fun with
| gemini and had to manually fix some issues with css that he
| struggled with. as it often happens, trying to let it repair
| the damage only made it go into a pit of despair (for me). i
| fixed the issues in about a glance and 5 minutes. This is not
| to say it's bad, but sometimes it still makes absurd mistakes
| and can't find a way to solve them
| PaulHoule wrote:
| I have pretty good luck with AI assistants with CSS and with
| theming React components like MUI where you have to figure
| out what to put in an sx or a theme. Sure beats looking
| through 50 standards documents (fortunately not a lot of
| "document A invalidates document B" in that pile) or digging
| through wrong answers where ignoramuses hold court on
| StackOverflow.
| ramesh31 wrote:
| >"just today i was building a website for fun with gemini and
| had to manually fix some issues with css that he struggled
| with."
|
| Tailwind (with utility classes) is the real key here. It
| provides a semantic layer over CSS that allows the LLM to
| reason about how things will actually look. Night and day
| difference from using stylesheets with custom classes.
| dlojudice wrote:
| > are now done in hours to a higher level of quality
|
| However, I feel that there is a big difference between the
| models. In my tests, using Cursor, Clause 3.7 Sonnet has a much
| more refined "aesthetic sense" than other models. Many times I
| ask "make it more beautiful" and it manages to improve, where
| other models just can't understand it.
| danielbln wrote:
| I've noticed the same, but I wonder if this new Gemini
| checkpoint is better at it now.
| preommr wrote:
| Elaborate, because I have serious doubts about this.
|
| If we're talking about just slapping on tailwind+component-
| library(e.g. shadcn-ui, material), then that's just one step-
| above using no-code solutions. Which, yes, that works well. But
| if someone didn't need customized logic, then it was always
| possible to just hop on fiverr or use some very simple
| template-based tools to accomplish this.
|
| If we're talking more advanced logic, understanding aesthetics,
| etc. Then I'd say it's much worse than other coding areas like
| backend, because they work on a visual and ux level beyond just
| code which is just text manipulation (and what llms excel at).
| In other words, I think the results are still very shallow
| beyond first impressions.
| kweingar wrote:
| If it's zero effort, then why do devs need to adapt fast? And
| wouldn't adapting be incredibly easy?
|
| The only disadvantage to not using these tools would be that
| your current output is slower. As soon as your employer asks
| for more or you're looking for a new job, you can just turn on
| AI and be as fast as everyone who already uses it.
| Workaccount2 wrote:
| "Why are we paying you $150k/yr to middleman a chatbot?"
| ramesh31 wrote:
| >"Why are we paying you $150k/yr to middleman a chatbot?"
|
| Because I don't get paid $150k/yr to write HTML and CSS. I
| get paid to provide technical solutions to business
| problems. And "chatbots" are a very useful new tool to aid
| in that.
| kweingar wrote:
| > I get paid to provide technical solutions to business
| problems.
|
| That's true of all SWEs who write HTML and CSS, and it's
| the reason I don't think there's much downside for devs
| to not proactively start using these agentic tools.
|
| If it truly turns weeks of work into hours as you say,
| then my managers will start asking me to use them, and I
| will use them. I won't be at a disadvantage compared to
| people who started using them a bit earlier than me.
|
| If I am looking for a new job and find an employer that
| wants people to use agentic tools, then I will tell the
| hiring manager that I will use those tools. Again, no
| disadvantage.
|
| Being outdated as a tech employee puts you at a
| disadvantage to the extent that there is a difficult-to-
| cross gap. If you are working in COBOL and the market
| demands Rust engineers, then you need a significant
| amount of learning/experience to catch up.
|
| But a major pitch of AI tools is that it is not difficult
| to cross the gap. You draw on your domain experience to
| describe what you want, and it gives it to you. When it
| makes a mistake, you draw on your domain experience to
| tweak or fix things as needed.
|
| Maybe someday there will be a gap. Maybe people will
| develop years of experience and intuition using
| particular AI tools that makes them much more attractive
| than somebody without this experience. But the tools are
| churning so quickly (Claude Code and Cursor are brand
| new, tools from 18 months ago are obsolete, newer and
| better tools are surely coming soon) that this seems far
| off.
| jaccola wrote:
| Yup, I see comments like the parent all of the time and they
| are always a head scratcher. They would be far more rational
| (and a bit desperate) if they were trying to sell something,
| but they never appear to be.
|
| Always "10x"/"100x" more productive with AI, "you will miss
| out if you don't adopt now"! Build a great company 100x
| faster and every rational actor in the market will notice,
| believe you and be begging to adopt your ways of working (and
| you will become filthy rich as a nice kicker).
|
| The proof of the pudding is in the eating.
| mediaman wrote:
| I find they achieve acceptable, but not polished levels of
| work.
|
| I'm not even a designer, but I care about the consistency of UI
| design and whether the overall experience is well-organized,
| aligned properly, things are placed in a logical flow for the
| user, and so on.
|
| While I'm pro-AI tooling and use it heavily, and these models
| usually provide a good starting point, I can't imagine shipping
| the slop without writing/editing a line of HTML for anything
| that's interaction-heavy.
| siwakotisaurav wrote:
| Usually don't believe the benchmarks but first in web dev arena
| specifically is crazy. That one has been Claude for so long,
| which tracks in my experience
| hersko wrote:
| Give Gemini a shot. It is genuinely very good.
| enraged_camel wrote:
| I'm wondering when Claude 4 will drop. It's long overdue.
| danielbln wrote:
| I was a little disappointed when the last thing coming out of
| Anthropic was their MAX pricing plan instead of a better
| model...
| Etheryte wrote:
| For me, Claude 3.7 was a noticeable step down across a wide
| range of tasks when compared to 3.5 with the same prompt.
| Benchmarks are one thing, but for real life use, I kept
| finding myself switching back to 3.5. Wouldn't be surprised
| if they were trying to figure out what happened there and how
| to prevent that in the next version.
| ranyume wrote:
| I don't know if I'm doing something wrong, but every time I ask
| gemini 2.5 for code it outputs SO MANY comments. An exaggerated
| amount of comments. Sections comments, step comments, block
| comments, inline comments, all the gang.
| cchance wrote:
| And comments are bad? I mean you could tell it to not comment
| the code or to self-document with naming instead of inline
| comments, its a LLM it does what you tell it to
| tucnak wrote:
| Ask it to do less of it, problem solved, no? With tools like
| Cursor it's become really easy to fit the models to the shoe,
| or the shoe to the foot.
| GaggiX wrote:
| You can ask to not use comments or use less comments, you can
| put this in the system prompt too.
| ChadMoran wrote:
| I've tried this, aggressively and it still does it for me. I
| gave up.
| ziml77 wrote:
| I tried this as well. I'm interfacing with Gemini 2.5 using
| Cursor and I have rules to to limit the comments. It still
| ends up over-commenting.
| shawabawa3 wrote:
| I have a feeling this may be a cursor issue, perhaps
| cursors system prompt asks for comments? Asking in the
| aistudio UI for code and ending the prompt with "no code
| comments" has always worked for me
| koakuma-chan wrote:
| Have you tried threats?
| throwup238 wrote:
| It strips the comments from the code or else it gets the
| hose again.
| blensor wrote:
| Maybe too many comments could be a good metric to check if
| someone just yolo accepted the result or if they actually
| checked if it's correct.
|
| I don't have problems with getting lot's of comments in the
| output, I am just deleting it while reading what it did
| tough wrote:
| another great tell of code reviewers yolo'ing it is that
| LLM's usually put the full filename path on the output, so
| if you see a file with the filename / path on the first
| line, thats prob a llm output
| Scene_Cast2 wrote:
| It also does super defensive coding. Not that it's a bad thing
| in general, but I write a lot of prototype code.
| prpl wrote:
| Production quality code is defensive. Probably trained on a
| lot of google code.
| montebicyclelo wrote:
| Does the code consist of many large try except blocks that
| catch "Exception", which Gemini seems to like doing, (I
| thought it was a bad practice to catch the generic
| Exception in Python)
| hnuser123456 wrote:
| Catching the generic exception is a nice middleground
| between not catching exceptions at all (and letting your
| script crash), and catching every conceivable exception
| individually and deciding exactly how to handle each one.
| Depends on how reliable you need your code to be.
| montebicyclelo wrote:
| Hmm, for my use case just allowing the lines to fail
| would have been better, (which I told the model)
| Tainnor wrote:
| Depends on what you mean by "defensive". Anticipating error
| and non-happy-path cases and handling them is definitely
| good. Also fault tolerance, i.e. allowing parts of the
| application to fail without bringing down everything.
|
| But I've heard "defensive code" used for the kind of code
| where almost every method validates its input parameters,
| wraps everything in a try-catch, returns nonsensical
| default values in failure scenarios, etc. This is a
| complete waste because the caller won't know what to do
| with the failed validations or thrown errors, and it's just
| unnecessary bloat that obfuscates the business logic.
| Validation, error handling and so on should be done in
| specific parts of the codebase (bonus points if you can
| encode the successful validation or the presence/absence of
| errors in the type system).
| neilellis wrote:
| this!
|
| lots of hasattr("") rubbish, I've increased the amount of
| prompting but it still does this - basically it defers
| it's lack of compile time knowledge to runtime 'let's
| hope for the best, and see what happens!'
|
| Trying to teach it FAIL FAST is an uphill struggle.
|
| Oh and yes, returning mock objects if something goes
| wrong is a favourite.
|
| It truly is an Idiot Savant - but still amazingly
| productive.
| Benjammer wrote:
| I've found that heavily commented code can be better for the
| LLM to read later, so it pulls in explanatory comments into
| context at the same time as reading code, similar to pulling in
| @docs, so maybe it's doing that on purpose?
| koakuma-chan wrote:
| No, it's just bad. I've been writing a lot of Python code
| past two days with Gemini 2.5 Pro Preview, and all of its
| code was like:
|
| ```python
|
| def whatever(): --- SECTION ONE OF THE CODE
| --- ... --- SECTION TWO OF THE CODE ---
| try: [some "dangerous" code] except Exception
| as e: logging.error(f"Failed to save files to
| {output_path}: {e}") # Decide whether to raise the
| error or just warn # raise IOError(f"Failed to save
| files to {output_path}: {e}")
|
| ```
|
| (it adds commented out code like that all the time, "just in
| case")
|
| It's terrible.
|
| I'm back to Claude Code.
| brandall10 wrote:
| It's certainly annoying, but you can try following up with
| "can you please remove superfluous comments? In particular,
| if a comment doesn't add anything to the understanding of
| the code, it doesn't deserve to be there".
| diggan wrote:
| I'm having the same issue, and no matter what I prompt
| (even stuff like "Don't add any comments at all to
| anything, at any time") it still tries to add these
| typical junior-dev comments where it's just re-iterating
| what the code on the next line does.
| tough wrote:
| you can have a script that drops them all
| shawabawa3 wrote:
| You don't need a follow up
|
| Just end your prompt with "no code comments"
| brandall10 wrote:
| I prefer not to do that as comments are helpful to guide
| the LLM, and esp. show past decisions so it doesn't redo
| things, at least in the scope of a feature. For me this
| tends to be more of a final refactoring step to tidy them
| up.
| NeutralForest wrote:
| I'm seeing it trying to catch blind exceptions in Python
| all the time. I see it in my colleagues code all the time,
| it's driving me nuts.
| jerkstate wrote:
| There are a bunch of stupid behaviors of LLM coding that
| will be fixed by more awareness pretty soon. Imagine
| putting the docs and code for all of your libraries into
| the context window so it can understand what exceptions
| might be thrown!
| maccard wrote:
| Copilot and the likes have been around for 4 years, and
| we've been hearing this all along. I'm bullish on LLM
| assistants (not vibe coding) but I'd love to see some of
| these things actually start to happen.
| kenjackson wrote:
| I feel like it has gotten better over time, but I don't
| have any metrics to confirm this. And it may also depend
| on what type of you language/libraries that you use.
| maccard wrote:
| I feel like there was a huge jump when cursor et al
| appeared, and things have been "changing" since then
| rather than improving.
| NeutralForest wrote:
| It just feels to me like trying to derive correct
| behavior without a proper spec so I don't see how it'll
| get that much better. Maybe we'll collectively remove the
| pathological code but otherwise I'm not seeing it.
| tclancy wrote:
| Well, at least now we know who to blame for the training
| data :)
| JoshuaDavid wrote:
| The training loop asked the model to one-shot working
| code for the given problems without being able to
| iterate. If you had to write code that _had_ to work on
| the first try, and where a partially correct answer was
| better than complete failure, I bet your code would look
| like that too.
|
| In any case, it knows what good code looks like. You can
| say "take this code and remove spurious comments and
| prefer narrow exception handling over catch-all", and
| it'll do just fine (in a way it _wouldn 't_ do just fine
| if your prompt told it to write it that way the first
| time, writing new code and editing existing code are
| different tasks).
| NeutralForest wrote:
| It's only an example, there's pretty of irrelevant stuff
| that LLMs default to which is pretty bad Python. I'm not
| saying it's always bad but there's a ton of not so nice
| code or subtly wrong code generated (for example file and
| path manipulation).
| breppp wrote:
| I always thought these were there to ground the LLM on the
| task and produce better code, an artifact of the fact that
| this will autocomplete better based on past tokens. Similarly
| always thought this is why ChatGPT always starts every reply
| with repeating exactly what you asked again
| rst wrote:
| Comments describing the organization and intent, perhaps.
| Comments just saying what a "require ..." line requires,
| not so much. (I find it will frequently put notes on the
| change it is making in comments, contrasting it with the
| previous state of the code; these aren't helpful at all to
| anyone doing further work on the result, and I wound up
| trimming a lot of them off by hand.)
| guestbest wrote:
| What kind of problems are you putting in where that is the
| solution? Just curious.
| dyauspitr wrote:
| Just ask it for fewer comments, it's not rocket science.
| Maxatar wrote:
| Tell it not to write so many comments then. You have a great
| deal of flexibility in dictating the coding style and can even
| include that style in your system prompt or upload a coding
| style document and have Gemini use it.
| Trasmatta wrote:
| Every time I ask an LLM to not write comments, it still
| litters it with comments. Is Gemini better about that?
| dheera wrote:
| I usually ask ChatGPT to "comment the shit out of this" for
| everything it writes. I find it vastly helps future LLM
| conversations pick up all of the context and why various
| pieces of code are there.
|
| If it is ingesting data, there should also be a sample of
| the data in a comment.
| sitkack wrote:
| LLMs are extremely poor at following negative instructions,
| tell them what to do, not what not to do.
| diggan wrote:
| Ok, so saying "Implement feature X" leads to a ton of
| comments. How do you rewrite that comment to not include
| "don't write comments" while making the output not
| containing comments? "Write only source code, no plain
| text with special characters in the beginning of the
| line" or what are you suggesting here in practical terms?
| FireBeyond wrote:
| "Implement feature X, and as you do, insert only minimal
| and absolutely necessary comments that explain why
| something is being done, not what is being done."
| sitkack wrote:
| You would say "omit the how". That word has negation
| built in.
| sroussey wrote:
| "Constrain all comments to a single block at the top of
| the file. Be concise."
|
| Or something similar that does not rely on negation.
| sitkack wrote:
| I also include something about "Target the comments
| towards a staff engineer that favors concise comments
| that focus on the why, and only for code that might cause
| confusion."
|
| I also try and get it to channel that energy into the doc
| strings, so it isn't buried in the source.
| diggan wrote:
| But I want no comments whatsoever, not one huge block of
| comments at the top of the file. How'd I get that without
| negation?
|
| Besides, other models seems to handle negation correctly,
| not sure why it's so difficult for the Gemini family of
| models to understand.
| staticman2 wrote:
| This is sort of LLM specific. For some tasks you might
| try including the word comment but give the order at the
| beginning and end of the prompt. This is very model
| dependent. Like:
|
| Refractor this. Do not write any comments.
|
| <code to refractor>
|
| As a reminder your task is to refractor the above code
| and do not write any comments.
| diggan wrote:
| > Do not write any comments. [...] do not write any
| comments.
|
| Literally both of those are negations.
| staticman2 wrote:
| Yes my suggestion is that negations can work just fine,
| depending on the model and task, and instead of avoiding
| negations you can try other promoting strategies like
| emphasizing what you want at the beginning and at the end
| of the prompt.
|
| If you think negations never work tell Gemini 2.5 to
| "write 10 sentences that do not include the word the" and
| see what happens.
| Hackbraten wrote:
| "Whenever you are tempted to write a line or block
| comment, it is imperative that you just write the actual
| code instead"
| grw_ wrote:
| No, you can tell it not to write these comments in every
| prompt and it'll still do it
| nearbuy wrote:
| Sample size of one, but I just tried it and it worked for
| me on 2.5 pro. I just ended my prompt with "Do not include
| any comments whatsoever."
| taf2 wrote:
| I really liked the Gemini 2.5 pro model when it was first
| released - the upload code folder was very nice (but they
| removed it). The annoying things I find with the model is it
| does a really bad job of formatting the code it generates... I
| know I can use a code formatting tool and I do when i use
| gemini output but otherwise I find grok much easier to work
| with and yields better results.
| throwup238 wrote:
| _> I really liked the Gemini 2.5 pro model when it was first
| released - the upload code folder was very nice (but they
| removed it)._
|
| Removed from where? I use the attach code folder feature
| every day from the Gemini web app (with a script that clones
| a local repo that deletes .git and anything matching a
| gitignore pattern).
| puika wrote:
| I have the same issue plus unnecessary refactorings (that break
| functionality). it doesn't matter if I write a whole paragraph
| in the chat or the prompt explaining I don't want it to change
| anything else apart from what is required to fulfill my very
| specific request. It will just go rogue and massacre the
| entirety of the file.
| fkyoureadthedoc wrote:
| Where/how do you use it? I've only tried this model through
| GitHub Copilot in VS Code and I haven't experienced much
| changing of random things.
| diggan wrote:
| I've used it via Google's own AI studio and via my own
| library/program using the API and finally via Aider. All of
| them lead to the same outcome, large chunks of changes to a
| lot of unrelated things ("helpful" refactors that I didn't
| ask for) and tons of unnecessary comments everywhere (like
| those comments you ask junior devs to stop making). No
| amount of prompting seems to address either problems.
| dherikb wrote:
| I have the exactly same issue using it with Aider.
| mgw wrote:
| This has also been my biggest gripe with Gemini 2.5 Pro.
| While it is fantastic at one-shotting major new features,
| when wanting to make smaller iterative changes, it always
| does big refactors at the same time. I haven't found a way to
| change that behavior through changes in my prompts.
|
| Claude 3.7 Sonnet is much more restrained and does smaller
| changes.
| cryptoz wrote:
| This exact problem is something I'm hoping to fix with a
| tool that parses the source to AST and then has the LLM
| write code to modify the AST (which you then run to get
| your changes) rather than output code directly.
|
| I've started in a narrow niche of python/flask webapps and
| constrained to that stack for now, but if you're interested
| I've just opened it for signups:
| https://codeplusequalsai.com
|
| Would love feedback! Especially if you see promising
| results in not getting huge refactors out of small change
| requests!
|
| (Edit: I also blogged about how the AST idea works in case
| you're just that curious: https://codeplusequalsai.com/stat
| ic/blog/prompting_llms_to_m...)
| jtwaleson wrote:
| Having the LLM modify the AST seems like a great idea.
| Constraining an LLM to only generate valid code would be
| super interesting too. Hope this works out!
| HenriNext wrote:
| Interesting idea. But LLMs are trained on vast amount of
| "code as text" and tiny fraction of "code as AST";
| wouldn't that significantly hurt the result quality?
| cryptoz wrote:
| Thanks and yeah that is a concern; however I have been
| getting quite good results from this AST approach, at
| least for building medium-complexity webapps. On the
| other hand though, this wasn't always true...the only
| OpenAI model that really works well is o3 series. Older
| models do write AST code but fail to do a _good job_
| because of the exact issue you mention, I suspect!
| tough wrote:
| Interesting, i started playing with ts-morph and neo4j to
| parse TypeScript codebases.
|
| simonw has symbex which could be useful for you for
| python
| nolist_policy wrote:
| Can't you just commit the relevant parts? The git index is
| made for this sort of thing.
| tasuki wrote:
| It's not always trivial to find the relevant 5 line
| change in a diff of 200 lines...
| fwip wrote:
| Really? I haven't tried Gemini 2.5 yet, but my main
| complaint with Claude 3.7 is this exact behavior - creating
| 200+ line diffs when I asked it to fix one function.
| bugglebeetle wrote:
| This is generally controllable with prompting. I usually
| include something like, "be excessively cautious and
| conservative in refactoring, only implementing the desired
| changes" to avoid.
| energy123 wrote:
| It probably increases scores in the RL training since it's a
| kind of locally specific reasoning that would reduce bugs.
|
| Which means if you try to force it to stop, the code quality
| will drop.
| bugglebeetle wrote:
| It's annoying, but I've done extensive work with this model and
| leaving the comments in for the first few iterations produced
| better outcomes. I expect this is baked into the RL they're
| doing, but because of the context size, it's not really an
| issue. You can just ask it to strip out in the final pass.
| lukeschlather wrote:
| I usually remove the comments by hand. It's actually pretty
| helpful, it ensures I've reviewed every piece of code
| carefully, especially since most of the comments are literally
| just restating the next line, and "does this comment add any
| information?" is a really helpful question to make sure I
| understand the code.
| tasuki wrote:
| Same! It eases my code review. In the rare occasions I don't
| want to do that, I ask the LLM to provide the code without
| comments.
| mrinterweb wrote:
| If you don't want so many comments, have you tried asking the
| AI for fewer comments. Seems like something a little prompt
| engineering could solve.
| asadm wrote:
| you need to do a 2nd step as a post-process to erase the
| comments.
|
| Models use comments to think, asking to remove will affect code
| quality.
| merksittich wrote:
| My favourites are comments such as: from openai import
| DefaultHttpxClient # Import the httpx client
| benbristow wrote:
| You can ask it to remove the comments afterwards, and it'll do
| a decent job of it, but yeah, it's a pain.
| sureIy wrote:
| My custom default Claude prompt asks it to never explain code
| unless specifically asked to. Also to produce modern and
| compact code. It's a beauty to see. You ask for code and you
| get code, nothing else.
| Workaccount2 wrote:
| I have a strong sense that the comments are for the model more
| than the user. It's effectively more thinking in context.
| HenriNext wrote:
| Same experience. Especially the "step" comments about the
| performed changes are super annoying. Here is my prompt-rule to
| prevent them:
|
| "5. You must never output any comments about the progress or
| type of changes of your refactoring or generation. Example: you
| must NOT add comments like: 'Added dependency' or 'Changed to
| new style' or worst of all 'Keeping existing implementation'."
| kurtis_reed wrote:
| > all the gang
|
| What does that mean?
| Semaphor wrote:
| 2.5 was the most impressive model I use, but I agree about the
| comments. And when refactoring some code it wrote before, it
| just adds more comments, it becomes like archaeological history
| (disclaimer: I don't use it for work, but to see what it can
| do, so I try to intervene as little as possible, and get it to
| refactor what it _thinks_ it should)
| Hikikomori wrote:
| So many comments, more verbose code and will refactor stuff on
| its own. Still better than chatgpt, but I just want a small
| amount of code that does what I asked for so I can read through
| it quickly.
| freddydumont wrote:
| That's been my experience as well. It's especially jarring when
| asking for a refactor as it will leave a bunch of WIP-style
| comments highlighting the difference with the previous
| approach.
| AuthConnectFail wrote:
| you can ask it to remove, it does p good job at it
| segphault wrote:
| My frustration with using these models for programming in the
| past has largely been around their tendency to hallucinate APIs
| that simply don't exist. The Gemini 2.5 models, both pro and
| flash, seem significantly less susceptible to this than any other
| model I've tried.
|
| There are still significant limitations, no amount of prompting
| will get current models to approach abstraction and architecture
| the way a person does. But I'm finding that these Gemini models
| are finally able to replace searches and stackoverflow for a lot
| of my day-to-day programming.
| thefourthchime wrote:
| Ask the models that can search to double check their API usage.
| This can just be part of a pre-prompt.
| ksec wrote:
| I have been asking if AI without hallucination, coding or not
| is possible but so far with no real concrete answer.
| mattlondon wrote:
| It's already much improved on the early days.
|
| But I wonder when we'll be happy? Do we expect colleagues
| friends and family to be 100% laser-accurate 100% of the
| time? I'd wager we don't. Should we expect that from an
| artificial intelligence too?
| cinntaile wrote:
| It's tool not a human so I don't know if the comparison
| even makes sense?
| ziml77 wrote:
| Yes we should expect better from an AI that has a knowledge
| base much larger than any individual and which can very
| quickly find and consume documentation. I also expect them
| to not get stuck trying the same thing they've already been
| told doesn't work, same as I would expect from a person.
| kweingar wrote:
| I expect my calculator to be 100% accurate 100% of the
| time. I have slightly more tolerance for other software
| having defects, but not much more.
| asadotzler wrote:
| And a $2.99 drugstore slim wallet calculator with solar
| power gets it right 100% of the time while billion dollar
| LLMs can still get arithmetic wrong on occasion.
| pb7 wrote:
| My hammer can't do any arithmetic at all, why does anyone
| even use them?
| namaria wrote:
| Does it sometimes instead of driving a nail hit random
| things in the house?
| hn_go_brrrrr wrote:
| Yes, like my thumb.
| izacus wrote:
| What you're being asked is to stop trying to hammer every
| single thing that comes into your vicinity. Smashing your
| computer with a hammer won't create code.
| Analemma_ wrote:
| I don't think that's the relevant comparison though. Do
| you expect StackOverflow or product documentation to be
| 100% accurate 100% of the time? I definitely don't.
| ctxc wrote:
| The error introduced by the data is expected and
| internalized, it's the error of LLMs on _top_ of that
| that's hard to.
| ctxc wrote:
| Also, documentation and SO are incorrect in a predictable
| way. We don't expect them to state things in a matter of
| fact way that just don't exist.
| kweingar wrote:
| I actually agree with this. I use LLMs often, and I don't
| compare them to a calculator.
|
| Mainly I meant to push back against the reflexive
| comparison to a friend or family member or colleague. AI
| is a multi-purpose tool that is used for many different
| kinds of tasks. Some of these tasks are analogues to
| human tasks, where we should anticipate human error.
| Others are not, and yet we often ask an LLM to do them
| anyway.
| pizza wrote:
| Are you sure about that? Try these..
|
| - (1e(1e10) + 1) - 1e(1e10)
|
| - sqrt(sqrt(2)) * sqrt(sqrt(2)) * sqrt(sqrt(2)) *
| sqrt(sqrt(2))
| ctxc wrote:
| Three decades and I haven't had to do anything remotely
| resembling this on a calculator, much less find the
| calculator wrong. Same for the majority of general
| population I assume.
| jjmarr wrote:
| (1/3)*3
| tasuki wrote:
| The person you're replying to pointed out that you
| shouldn't expect a calculator to be 100% accurate 100% of
| the time. _Especially_ not when faced with adversarial
| prompts.
| Vvector wrote:
| Try "1/3". The calculator answer is not "100% accurate"
| bb88 wrote:
| I had a casio calculator back in the 1980's that did
| fractions.
|
| So when I punched in 1/3 it was exactly 1/3.
| gilbetron wrote:
| It's your option not to use it. However, this is a
| competitive environment and so we will see who pulls out
| ahead, those that use AI as a productivity multiplier
| versus those that do not. Maybe that multiplier is less
| than 1, time will tell.
| kweingar wrote:
| Agreed. The nice thing is that I am told by HN and
| Twitter that agentic workflows makes code tasks very
| easy, so if it turns out that using these tools
| multiplies productivity, then I can just start using them
| and it will be easy. Then I am caught up with the early
| adopters and don't need to worry about being out-competed
| by them.
| mattlondon wrote:
| AIs aren't intended to be used as calculators though?
|
| You could say that when I use my spanner/wrench to
| tighten a nut it works 100% of the time, but as soon as I
| try to use a screwdriver it's terrible and full of
| problems and it can't even reliably so something as
| trivially easy as tighten a nut, even though a
| screwdriver works the same way by using torque to tighten
| a fastener.
|
| Well that's because one tool is designed for one thing,
| and one is designed for another.
| mdp2021 wrote:
| > _AIs are_
|
| "AI"s are designed to be reliable; "AGI"s are designed to
| be intelligent; "LLM"s seem to be designed to make some
| qualities emerge.
|
| > _one tool is designed for one thing, and one is
| designed for another_
|
| The design of LLMs seems to be "let us see where the
| promise leads us". That is not really "design", i.e.
| "from need to solution".
| thewebguyd wrote:
| > AIs aren't intended to be used as calculators though?
|
| Then why are we using them to write code, which should
| produce reliable outputs for a given input...much like a
| calculator.
|
| Obviously we want the code to produce correct results for
| whatever input we give, and as it stands now, I can't
| trust LLM output without reviewing first. Still a helpful
| tool, but ultimately my desire would be to have them be
| as accurate as a calculator so they can be trusted enough
| to not need the review step.
|
| Using an LLM and being OK with untrustworthy results,
| it'd be like clicking the terminal icon on my dock and
| sometimes it opens terminal, sometimes it might open a
| browser, or just silently fail because there's no
| reproducible output for any given input to an LLM. To me
| that's a problem, output should be reproducible,
| especially if it's writing code.
| LordDragonfang wrote:
| A calculator isn't software, it's hardware. Your inputs
| into a calculator _are_ code.
|
| Your interaction with LLMs is categorically closer to
| interactions with people than with a calculator. Your
| inputs into it are language.
|
| Of course the two are different. A calculator is a
| computer, an LLM is not. Comparing the two is making the
| same category error which would confuse Mr. Babbage, but
| in reverse.
|
| ("On two occasions, I have been asked [by members of
| Parliament], 'Pray, Mr. Babbage, if you put into the
| machine wrong figures, will the right answers come out?'
| I am not able to rightly apprehend the kind of confusion
| of ideas that could provoke such a question.")
| pohuing wrote:
| It's a tool, not an intelligence, a tool that costs money
| on every erroneous token. I expect my computer to be more
| reliable at remembering things than myself, that's one of
| the primary use cases even. Especially if using it costs
| money. Of course errors are possible, but rarely do they
| happen as frequently in any other program I use.
| kortilla wrote:
| If colleagues lie with the certainty that LLMs do, they
| would get fired for incompetence.
| scarab92 wrote:
| I wish that were true, but I've found that certain types
| of employees do confidently lie as much as llms,
| especially when answering "do you understand" type
| questions
| izacus wrote:
| And we try to PIP and fire those as well, not turn
| everyone else into them.
| dmd wrote:
| Or elected to high office.
| ChromaticPanic wrote:
| Have you worked in an actual workplace. Confidence is
| king.
| ksec wrote:
| I dont expect it to be 100% accurate. Software aren't bug
| free, human aren't perfect. But may be 99.99%? At least
| given enough time and resources human could fact check it
| ourselves. And precisely because we know we are not
| perfect, in accounting and court cases we have due
| diligence.
|
| And it is also not just about the %. It is also about the
| type of error. Will we reach a point we change our
| perception and say these are expected non-human error?
|
| Or could we have a specific LLM that only checks for these
| types of error?
| mdp2021 wrote:
| Yes we want people "in the game" to be of sound mind. (The
| matter there is not about being accurate, but of being
| trustworthy - substance, not appearance.)
|
| And tools in the game, even more so (there's no excuse for
| the engineered).
| Foreignborn wrote:
| Try dropping the entire api docs in the context. If it's
| verbose, i usually pull only a subset of pages.
|
| Usually I'm using a minimum of 200k tokens to start with
| gemini 2.5.
| nolist_policy wrote:
| That's more than 222 novel pages:
|
| 200k tk = 1/3 200k words = 1/300 1/3 200k pages
| pizza wrote:
| "if it were a fact, it wouldn't be called intelligence" -
| donald rumsfeld
| codebolt wrote:
| I've found they do a decent job searching for bugs now as well.
| Just yesterday I had a bug report on a component/page I wasn't
| familiar with in our Angular app. I simply described the issue
| as well as I could to Claude and asked politely for help
| figuring out the cause. It found the exact issue correctly on
| the first try and came up with a few different suggestions for
| how to fix it. The solutions weren't quite what I needed but it
| still saved me a bunch of time just figuring out the error.
| M4v3R wrote:
| That's my experience as well. Many bugs involve typos, syntax
| issues or other small errors that LLMs are very good at
| catching.
| ChocolateGod wrote:
| I asked today both Claude and ChatGPT to fix a Grafana Loki
| query I was trying to build, both hallucinated functions that
| didn't exist, even when telling to use existing functions.
|
| To my surprise, Gemini got it spot on first time.
| fwip wrote:
| Could be a bit of a "it's always in the last place you look"
| kind of thing - if Claude or CGPT had gotten it right, you
| wouldn't have tried Gemini.
| redox99 wrote:
| Making LLMs know what they don't know is a hard problem. Many
| attempts at making them refuse to answer what they don't know
| caused them to refuse to answer things they did in fact know.
| Volundr wrote:
| > Many attempts at making them refuse to answer what they
| don't know caused them to refuse to answer things they did in
| fact know.
|
| Are we sure they know these things as opposed to being able
| to consistently guess correctly? With LLMs I'm not sure we
| even have a clear definition of what it means for it to
| "know" something.
| redox99 wrote:
| Yes. You could ask for factual information like "Tallest
| building in X place" and first it would answer it did not
| know. After pressuring it, it would answer with the correct
| building and height.
|
| But also things where guessing was desirable. For example
| with a riddle it would tell you it did not know or there
| wasn't enough information. After pressuring it to answer
| anyway it would correctly solve the riddle.
|
| The official llama 2 finetune was pretty bad with this
| stuff.
| Volundr wrote:
| > After pressuring it, it would answer with the correct
| building and height.
|
| And if you bully it enough on something nonsensical it'll
| give you a wrong answer.
|
| You press it, and it takes a guess even though you told
| it not to, and gets it right, then you go "see it knew!".
| There's no database hanging out in
| ChatGPT/Claude/Gemini's weights with a list of cities and
| the tallest buildings. There's a whole bunch of opaque
| stats derived from the content it's been trained on that
| means that most of the time it'll come up with the same
| guess. But there's no difference in process between that
| highly consistent response to you asking the tallest
| building in New York and the one where it hallucinates a
| Python method that doesn't exist, or suggests glue to
| keep the cheese on your pizza. It's all the same process
| to the LLM.
| ajross wrote:
| > Are we sure they know these things as opposed to being
| able to consistently guess correctly?
|
| What is the practical difference you're imagining between
| "consistently correct guess" and "knowledge"?
|
| LLMs aren't databases. We have databases. LLMs are
| probabilistic inference engines. _All they do_ is guess,
| essentially. The discussion here is about how to get the
| guess to "check itself" with a firmer idea of "truth". And
| it turns out that's hard because it requires that the
| guessing engine know that something needs to be checked in
| the first place.
| mynameisvlad wrote:
| Simple, and even simpler from your own example.
|
| Knowledge has an objective correctness. We know that
| there is a "right" and "wrong" answer and we know what a
| "right" answer is. "Consistently correct guesses", based
| on the name itself, is not reliable enough to actually be
| trusted. There's absolutely no guarantee that the next
| "consistently correct guess" is knowledge or a
| hallucination.
| ajross wrote:
| This is a circular semantic argument. You're saying
| knowledge is knowledge because it's correct, where
| guessing is guessing because it's a guess. But "is it
| correct?" is precisely the question you're asking the
| poor LLM to answer in the first place. It's not helpful
| to just demand a computation device work the way you
| want, you need to actually make it work.
|
| Also, too, there are whole subfields of philosophy that
| make your statement here kinda laughably naive. Suffice
| it to say that, no, knowledge as rigorously understood
| does _not_ have "an objective correctness".
| mynameisvlad wrote:
| I mean, it clearly does based on your comments showing a
| need for a correctness check to disambiguate between made
| up "hallucinations" and actual "knowledge" (together, a
| "consistently correct guess").
|
| The fact that you are humanizing an LLM is honestly just
| plain weird. It does not have feelings. It doesn't care
| that it has to answer "is it correct?" and saying _poor_
| LLM is just trying to tug on heartstrings to make your
| point.
| ajross wrote:
| FWIW "asking the poor <system> to do <requirement>" is an
| extremely common idiom. It's used as a metaphor for an
| inappropriate or unachievable design requirement. Nothing
| to do with LLMs. I work on microcontrollers for a living.
| Volundr wrote:
| > You're saying knowledge is knowledge because it's
| correct, where guessing is guessing because it's a guess.
|
| Knowledge is knowledge because the knower knows it to be
| correct. I know I'm typing this into my phone, because
| it's right here in my hand. I'm guessing you typed your
| reply into some electronic device. I'm guessing this is
| true for all your comments. Am I 100% accurate? You'll
| have to answer that for me. I don't know it to be true,
| it's a highly informed guess.
|
| Being wrong sometimes is not what makes a guess a guess.
| It's the different between pulling something from your
| memory banks, be they biological or mechanical, vs
| inferring it from some combination of your knowledge
| (what's in those memory banks), statistics, intuition,
| and whatever other fairy dust you sprinkle on.
| fwip wrote:
| So, if that were so, then an LLM possess no knowledge
| whatsoever, and cannot ever be trusted. Is that the line
| of thought you are drawing?
| Volundr wrote:
| > What is the practical difference you're imagining
| between "consistently correct guess" and "knowledge"?
|
| Knowing it's correct. You've just instructed it not to
| guess remember? With practice people can get really good
| at guessing all sorts of things.
|
| I think people have a serious misunderstanding about how
| these things work. They don't have their training set
| sitting around for reference. They are usually guessing.
| Most of the time with enough consistency that it seems
| like they "know'. Then when they get it wrong we call it
| "hallucinations". But instructing then not to guess means
| suddenly they can't answer much. There no guessing vs not
| with an LLM, it's all the same statistical process, the
| difference is just if it gives the right answer or not.
| mountainriver wrote:
| https://github.com/IINemo/lm-polygraph is the best work in
| this space
| rdtsc wrote:
| > Making LLMs know what they don't know is a hard problem.
| Many attempts at making them refuse to answer what they don't
| know caused them to refuse to answer things they did in fact
| know.
|
| They are the perfect "fake it till you make it" example
| cranked up to 11. They'll bullshit you, but will do it
| confidently and with proper grammar.
|
| > Many attempts at making them refuse to answer what they
| don't know caused them to refuse to answer things they did in
| fact know.
|
| I can see in some contexts that being desirable if it can be
| a parameter that can be tweaked. I guess it's not that easy,
| or we'd already have it.
| bezier-curve wrote:
| The best way around this is to dump documentation of the APIs
| you need them privy to into their context window.
| pzo wrote:
| I feel your pain. Cursor has docs features but many times when
| I pointed to check @docs and selected one recently indexed one
| it sometimes still didn't get it. I still have to try contex7
| mcp which looks promising:
|
| https://github.com/upstash/context7
| doug_durham wrote:
| If they never get good at abstraction or architecture they will
| still provide a tremendous amount of value. I have them do the
| parts of my job that I don't like. I like doing abstraction and
| architecture.
| mynameisvlad wrote:
| Sure, but that's not the problem people have with them nor
| the general criticism. It's that people without the knowledge
| to do abstraction and architecture don't realize the
| importance of these things and pretend that "vibe coding" is
| a reasonable alternative to a well-thought-out project.
| sanderjd wrote:
| The way I see this is that it's just another skill
| differentiator that you can take advantage of if you can
| get it right.
|
| That is, if it's true that abstraction and architecture are
| useful for a given product, then people who know how to do
| those things will succeed in creating that product, and
| those who don't will fail. I think this is true for
| essentially all production software, but a lot of software
| never reaches production.
|
| Transitioning or entirely recreating "vibecoded" proofs of
| concept to production software is another skill that will
| be valuable.
|
| Having a good sense for when to do that transition, or when
| to start building production software from the start, and
| especially the ability to influence decision makers to
| agree with you, is another valuable skill.
|
| I do worry about what the careers of entry level people
| will look like. It isn't obvious to me how they'll
| naturally develop any of these skills.
| mynameisvlad wrote:
| > "vibecoded" proofs of concept
|
| The fact that you called it out as a PoC is already many
| bars above what most vibe coders are doing. Which is
| considering a barely functioning web app as proof that
| vibe coding is a viable solution for coding in general.
|
| > I do worry about what the careers of entry level people
| will look like. It isn't obvious to me how they'll
| naturally develop any of these skills.
|
| Exactly. There isn't really a path forward from vibe
| coding to anything productizable without actual, deep CS
| knowledge. And LLMs are not providing that.
| sanderjd wrote:
| Yeah I think we largely agree. But I do know people,
| mostly experienced product managers, who are excited
| about "vibecoding" expressly as a prototyping / demo
| creation tool, which can be useful in conjunction with
| people who know how to turn the prototypes into real
| software.
|
| I'm sure lots of people aren't seeing it this way, but
| the point I was trying to make about this being a skill
| differentiator is that I think understanding the
| advantages, limitations, and tradeoffs, and keeping that
| understanding up to date as capabilities expand, is
| already a valuable skillset, and will continue to be.
| Karrot_Kream wrote:
| We can rewind the clock 10 years and I can substitute "vibe
| coding" for VBA/Excel macros and we'd get a common type of
| post from back then.
|
| There's always been a demand for programming by non
| technical stakeholders that they try and solve without
| bringing on real programmers. No matter the tool, I think
| the problem is evergreen.
| Tainnor wrote:
| I definitely get more use out of Gemini Pro than other models
| I've tried, but it's still very prone to bullshitting.
|
| I asked it a complicated question about the Scala ZIO framework
| that involved subtyping, type inference, etc. - something that
| would definitely be hard to figure out just from reading the
| docs. The first answer it gave me was very detailed, very
| convincing and very wrong. Thankfully I noticed it myself and
| was able to re-prompt it and I got an answer that is probably
| right. So it was useful in the end, but only because I realised
| that the first answer was nonsense.
| gxs wrote:
| Huh? Have you ever just told it, that API doesn't exist, find
| another solution?
|
| Never seen it fumble that around
|
| Swear people act like humans themselves don't ever need to be
| asked for clarification
| 0x457 wrote:
| I've noticed that models that can search internet do it a lot
| less because I guess they can look up documentation? My
| annoyance now is that it doesn't take version into
| consideration.
| tough wrote:
| You should give it docs for each of your base dependencies in a
| mcp/tool whatever so it can just consult.
|
| internet also helps.
|
| Also having markdown files with the stack etc and any -rules-
| siscia wrote:
| This problem have been solved by LSP (language server
| protocol), all we need is a small server behind MCP that can
| communicate LSP information back to the LLM and get the LLM to
| use by adding to the prompt something like: "check your API
| usage with the LSP"
|
| The unfortunate state of open source funding makes buildings
| such simple tool a loosing adventure unfortunately.
| satvikpendem wrote:
| This already happens in agent modes in IDEs like Cursor or
| VSCode with Copilot, it can check for errors with the LSP.
| jug wrote:
| I've seen benchs on hallucinations and OpenAI has typically
| performed worse than Google and Anthropic models. Sometimes
| significantly so. But it doesn't seem like they have cared
| much. I've suspected that LLM performance is correlated to
| risking hallucinations? That is, if they're bolder, this can be
| beneficial? Which helps in other performance benchmarks. But of
| course at the risk of hallucinating more...
| mountainriver wrote:
| The hallucinations are a result of RLVR. We reward the model
| for an answer and then force it to reason about how to get
| there when the base model may not have that information.
| mdp2021 wrote:
| > _The hallucinations are a result of RLVR_
|
| Well let us reward them for producing output that is
| consistent with database accessed selected documentation
| then, and massacre them for output they cannot justify -
| like we do with humans.
| johnisgood wrote:
| > hallucinate APIs
|
| Tell me about it. Thankfully I have not experienced it as much
| with Claude as I did with GPT. It can get quite annoying. GPT
| kept telling me to use this and that and none of them were real
| projects.
| impulser_ wrote:
| Use few-shot learning. Build a simple prompt with basic
| examples of how to use the API and it will do significantly
| better.
|
| LLMs just guess, so you have to give it a cheatsheet to help it
| guess closer to what you want.
| rcpt wrote:
| I'm using repomix for this
| M4v3R wrote:
| At this point the time it takes to teach the model might be
| more than you save from using it for interacting with that
| API.
| Jordan-117 wrote:
| I recently needed to recommend some IAM permissions for an
| assistant on a hobby project; not complete access but just
| enough to do what was required. Was rusty with the console and
| didn't have direct access to it at the time, but figured it was
| a solid use case for LLMs since AWS is so ubiquitous and well-
| documented. I actually queried 4o, 3.7 Sonnet, and Gemini 2.5
| for recommendations, stripped the list of duplicates, then
| passed the result to Gemini to vet and format as JSON. The
| result was perfectly formatted... and still contained a bunch
| of non-existent permissions. My first time being burned by a
| hallucination IRL, but just goes to show that even the latest
| models working in concert on a very well-defined problem space
| can screw up.
| dotancohen wrote:
| AWS docs have (had) an embedded AI model that would do this
| perfectly. I suppose it had better training data, and the
| actual spec as a RAG.
| djhn wrote:
| Both AWS and Azure docs' built in models have been
| absolutely useless.
| darepublic wrote:
| Listen I don't blame any mortal being for not grokking the
| AWS and Google docs. They are a twisting labyrinth of
| pointers to pointers some of them deprecated though
| recommended by Google itself.
| perching_aix wrote:
| Sounds like a vague requirement, so I'd just generally point
| you towards the AWS managed policies summary [0] instead.
| Particularly the PowerUserAccess policy sounds fitting here
| [1] if the description for it doesn't raise any immediate
| flags. Alternatively, you could browse through the job
| function oriented policies [2] they have and see if you find
| a better fit. Can just click it together instead of bothering
| with the JSON. Though it sounds like you're past this problem
| by now.
|
| [0] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_p
| oli...
|
| [1] https://docs.aws.amazon.com/aws-managed-
| policy/latest/refere...
|
| [2] https://docs.aws.amazon.com/IAM/latest/UserGuide/access_p
| oli...
| satvikpendem wrote:
| If you use Cursor, you can use @Docs to let it index the
| documentation for the libraries and languages you use, so no
| hallucination happens.
| Rudybega wrote:
| The context7 mcp works similarly. It allows you to search a
| massive constantly updated database of relevant documentation
| for thousands of projects.
| thr0waway39290 wrote:
| Replacing stackoverflow is definitely helpful, but the best use
| case for me is how much it helps in high-level architecture and
| planning before starting a project.
| yousif_123123 wrote:
| The opposite problem is also true. I was using it to edit code
| I had that was calling the new openai image API, which is
| slightly different from the dalle API. But Gemini was
| consistently "fixing" the OpenAI call even when I explained
| clearly not to do that since I'm using a new API design etc.
| Claude wasn't having that issue.
|
| The models are very impressive. But issues like these still
| make me feel they are still more pattern matching (although
| there's also some magic, don't get me wrong) but not fully
| reasoning over everything correctly like you'd expect of a
| typical human reasoner.
| toomuchtodo wrote:
| It seems like the fix is straightforward (check the output
| against a machine readable spec before providing it to the
| user), but perhaps I am a rube. This is no different than me
| clicking through a search result to the underlying page to
| verify the veracity of the search result surfaced.
| disgruntledphd2 wrote:
| Why coding agents et al don't make use of the AST through
| LSP is a question I've been asking myself since the first
| release of GitHub copilot.
|
| I assume that it's trickier than it seems as it hasn't
| happened yet.
| celeritascelery wrote:
| What good do you think that would do?
| disgruntledphd2 wrote:
| They are definitely pattern matching. Like, that's how we
| train them, and no matter how many layers of post training
| you add, you won't get too far from next token prediction.
|
| And that's fine and useful.
| mdp2021 wrote:
| > _fine and useful_
|
| And crippled, incomplete, and deceiving, dangerous.
| mannycalavera42 wrote:
| same, I asked a simple question about javascript fetch api and
| it started talking about the workspace api. When I asked about
| that workspace api it replied it was the google workspace API -
| \ _ (tsu) _ / -
| mbesto wrote:
| To date, LLMs can't replace the human element of:
|
| - Determining what features to make for users
|
| - Forecasting out a roadmap that are aligned to business goals
|
| - Translating and prioritizing all of these to a developer
| (regardless of whether these developers are agentic or human)
|
| Coincidentally these are the areas that frequently are the
| largest contributors to software businesses successes....not
| wether you use NextJs with a Go and Elixir backend against a
| multi-geo redundant multi sharded CockroachDB database, or that
| your code is clean/elegant.
| nearbuy wrote:
| What does it say when you ask it to?
| dist-epoch wrote:
| Maybe at elite companies.
|
| At half of the companies you can randomly pick those three
| things and probably improve the situation. Using an AI would
| be a massive improvement.
| jstummbillig wrote:
| > no amount of prompting will get current models to approach
| abstraction and architecture the way a person does
|
| I find this sentiment increasingly worrisome. It's entirely
| clear that every last human will be beaten on code design in
| the upcoming years (I am not going to argue if it's 1 or 5
| years away, who cares?)
|
| I wished people would just stop holding on to what amounts to
| nothing, and think and talk more about what can be done in a
| new world. We need good ideas and I think this could be a place
| to advance them.
| jjice wrote:
| I'm confused by your comment. It seems like you didn't really
| provide a retort to the parent's comment about bad
| architecture and abstraction from LLMs.
|
| FWIW, I think you're probably right that we need to adapt,
| but there was no explanation as to _why_ you believe that
| that's the case.
| TuringNYC wrote:
| I think they are pointing out that the advantage humans
| have has been chipped away little by little and computers
| winning at coding is inevitable on some timeline. They are
| also suggesting that perhaps the GP is being defensive.
| dml2135 wrote:
| Why is it inevitable? Progress towards a goal in the past
| does not guarantee progress towards that goal in the
| future. There are plenty of examples of technology moving
| forward, and then hitting a wall.
| TuringNYC wrote:
| I agree with you it isnt guaranteed to be inevitable, and
| also agree there have been plenty of journeys which were
| on a trajectory only to fall off.
|
| That said, IMHO it is inevitable. My personal (dismal)
| view is that businesses see engineering as a huge cost
| center to be broken up and it will play out just like
| manufacturing -- decimated without regard to the human
| cost. The profit motive and cost savings are just too
| great to not try. It is a very specific line item so
| cost/savings attribution is visible and already tracked.
| Finally, a good % of the industry has been staffed up
| with under-trained workers (e.g., express bootcamp) who
| arent working on abstraction, etc -- they are doing basic
| CRUD work.
| warkdarrior wrote:
| > businesses see engineering as a huge cost center to be
| [...] decimated without regard to the human cost
|
| Most cost centers in the past were decimated in order to
| make progress: from horse-drawn carriages to cars and
| trucks, from mining pickaxes to mining machines, from
| laundry at the river to clothes washing machines, etc. Is
| engineering a particularly unique endeavor that needs to
| be saved from automation?
| saurik wrote:
| I mean, didn't you just admit you are wrong? If we are
| talking 1-5 years out, that's not "current models".
| jstummbillig wrote:
| Imagine sitting in a car, that is fast approaching a cliff,
| with no brakes, while the driver talks about how they have
| not been in any serious car accident so far.
|
| Technically correct. And yet, you would probably be at
| least be a little worried about that cliff and rather talk
| about that.
| mattgreenrocks wrote:
| I'm always impressed by the ability of the comment section to
| come up with more reasons why decent design and architecture
| of source code just can't happen:
|
| * "it's too hard!"
|
| * "my coworkers will just ruin it"
|
| * "startups need to pursue PMF, not architecture"
|
| * "good design doesn't get you promoted"
|
| And now we have "AI will do it better soon."
|
| None of those are entirely wrong. They're not entirely
| correct, either.
| dullcrisp wrote:
| It's always so aggressive too. What fools we are for trying
| to write maintainable code when it's so obviously
| impossible.
| DanHulton wrote:
| > It's entirely clear that every last human will be beaten on
| code design in the upcoming years
|
| Citation needed. In fact, I think this pretty clearly hits
| the "extraordinary claims require extraordinary evidence"
| bar.
| coffeemug wrote:
| AlphaGo.
| giovannibonetti wrote:
| A board game has a much narrower scope than programming
| in general.
| cft wrote:
| Thus this was in 2016. 9 years have passed.
| astrange wrote:
| LLMs and AlphaGo don't work at all similarly, since LLMs
| don't use search.
| kaliqt wrote:
| Trends would dictate that this will keep scaling and
| surpass each goalpost year by year.
| sweezyjeezy wrote:
| I would argue that what LLMs are capable of doing right now
| is already pretty extraordinary, and would fulfil your
| extraordinary evidence request. To turn it on its head -
| given the rather astonishing success of the recent LLM
| training approaches, what evidence do you have that these
| models are going to plateau short of your own abilities?
| sigmaisaletter wrote:
| What they do is extraordinary, but it's not just a claim,
| they actually do, their doing so is evidence.
|
| Here someone just claimed that it is "entirely clear"
| LLMs will become super-human, without any evidence.
|
| https://en.wikipedia.org/wiki/Extraordinary_claims_requir
| e_e...
| sweezyjeezy wrote:
| Again - I'd argue that the extraordinary success of LLMs,
| in a relatively short amount of time, using a fairly
| unsophisticated training approach, is strong evidence
| that coding models are going to get a lot better than
| they are right now. Will it definitely surpass every
| human? I don't know, but I wouldn't say we're lacking
| extraordinary evidence for that claim either.
|
| The way you've framed it seems like the only evidence you
| will accept is after it's actually happened.
| sigmaisaletter wrote:
| Well, predicting the future is always hard. But if
| someone claims some extraordinary future event is going
| to happen, you at least ask for their reasons for
| claiming so, don't you.
|
| In my mind, at this point we either need (a) some
| previously "hidden" super-massive source of training
| data, or (b) another architectural breakthrough. Without
| either, this is a game of optimization, and the scaling
| curves are going to plateau really fast.
| sweezyjeezy wrote:
| A couple of comments there -
|
| a) it hasn't even been a year since the last big
| breakthrough, the reasoning models like o3 only came out
| in September. I'd wait a second before assuming the low-
| hanging fruit is done.
|
| b) I think coding is a really good environment for agents
| / reinforcement learning. Rather than requiring a
| continual supply of new training data, we give the model
| coding tasks to execute (writing / maintaining /
| modifying) and then test its code for correctness. We
| could for example take the entire history of a code-base
| and just give the model its changing unit + integration
| tests to implement. My hunch (with no extraordinary
| evidence) is that this is how coding agents start to nail
| some of the higher-level abilities.
| davidsainez wrote:
| I use LLMs for coding every day. There have been significant
| improvements over the years but mostly across a single
| dimension: mapping human language to code. This capability is
| robust, but you still have to know how to manage context to
| keep them focused. I still have to direct them to consider
| e.g. performance or architecture considerations.
|
| I'm not convinced that they can reason effectively (see the
| ARC-AGI-2 benchmarks). Doesn't mean that they are not useful,
| but they have their limitations. I suspect we still need to
| discover tech distinct from LLMs to get closer to what a
| human brain does.
| acedTrex wrote:
| > It's entirely clear that every last human will be beaten on
| code design in the upcoming years
|
| In what world is this statement remotely true.
| dullcrisp wrote:
| In the world where idle speculation can be passed off as
| established future facts, i.e., this one I guess.
| 1024core wrote:
| Proof by negation, I guess?
|
| If someone were to claim: no computer will ever be able to
| beat humans in code design, would you agree with that? If
| the answer is "no", then there's your proof.
| Workaccount2 wrote:
| Software will change to accommodate LLMs, if for no other
| reason than we are on the cusp of everyone being a junior
| level programmer. What does software written for LLMs to
| middleman look like?
|
| I think there is a total seismic change in software that is
| about to go down, similar to something like going from gas
| lamps to electric. Software doesn't need to be the way it is
| now anymore, since we have just about solved human language
| to computer interface translation. I don't want to fuss with
| formatting a word document anymore, I would rather just tell
| and LLM and let it modify the program memory to implement
| what I want.
| epolanski wrote:
| > no amount of prompting will get current models to approach
| abstraction and architecture the way a person does
|
| Which person it is? Because 90% of the people in our trade
| are bad, like, real bad.
|
| I get that people on HN are in that elitist niche of those
| who care more, focus on career more, etc so they don't even
| realize the existence of armies of low quality body rental
| consultancies and small shops out there working on Magento or
| Liferay or even worse crap.
| bayindirh wrote:
| > It's entirely clear that every last human will be beaten on
| code design in the upcoming years (I am not going to argue if
| it's 1 or 5 years away, who cares?)
|
| No code & AI assisted programming has been told to be around
| the corner since 2000. We just arrived to a point where
| models remix what others have typed on their keyboards, and
| yet somebody _still_ argues that humans will be left in the
| dust in near times.
|
| No machine, incl. humans can create something more complex
| than itself. This is the rule of abstraction. As you go
| higher level, you lose expressiveness. Yes, you express more
| with less, yet you can express less _in total_. You 're
| reducing the set's symbol size (element count) as you go
| higher by clumping symbols together and assigning more
| complex meanings to it.
|
| Yet, being able to describe a larger set with more elements
| while keeping all elements addressable with less possible
| symbols doesn't sound plausible to me.
|
| So, as others said. Citation needed. Extraordinary claims
| needs extraordinary evidence. No, asking AI to create a
| premium mobile photo app and getting Halide's design as an
| output doesn't count. It's training data leakage.
| joshjob42 wrote:
| I mean, if you draw the scaling curves out and believe them,
| then sometime in the next 3-10 years, plausibly shorter, AIs
| will be able to achieve best-case human performance in
| everything able to be done with a computer and do it at
| 10-1000x less cost than a human, and shortly thereafter
| robots will be able to do something similar (though with a
| smaller delta in cost) for physical labor, and then shortly
| after that we get atomically precise manufacturing and post-
| scarcity. So the amount of stuff that amounts to nothing is
| plausibly every field of endeavor that isn't slightly
| advancing or delaying AI progress itself.
| sigmaisaletter wrote:
| If the scaling continues. We just don't know.
|
| It is kinda a meme at this point, that there is no more
| "publicly available"... cough... training data. And while
| there have been massive breakthroughs in architecture, a
| lot of the progress of the last couple years has been ever
| more training for ever larger models.
|
| So, at this point we either need (a) some previously
| "hidden" super-massive source of training data, or (b)
| another architectural breakthrough. Without either, this is
| a game of optimization, and the scaling curves are going to
| plateau really fast.
| bdangubic wrote:
| _It 's entirely clear that every last human will be beaten on
| code design in the upcoming years (I am not going to argue if
| it's 1 or 5 years away, who cares?)_
|
| Our entire industry (after all these years) does not have
| even remotely sane measure or definition as what is good code
| design. Hence, this statement is dead on arrival as you are
| claiming something that cannot be either proven or disproven
| by anyone.
| froh wrote:
| searching and ranking existing fragments and recombining them
| within well known paths is one thing, exploratively combining
| existing fragments to completely novel solutions quickly runs
| into combinatorial explosion.
|
| so it's a great tool in the hands of a creative architect, but
| it is not one in and by itself and I don't see yet how it can
| be.
|
| my pet theory is that the human brain can't understand and
| formalize its creativity because you need a higher order logic
| to fully capture some other logic. I've been contested that the
| second Godel incompleteness theorem "can't be applied like this
| to the brain" but I stubbornly insist yes, the brain implements
| _some_ formal system and it can't understand how that system
| works. tongue in cheek, somewhat, maybe.
|
| but back to earth I agree llms are a great tool for a creative
| human mind.
| dist-epoch wrote:
| > Demystifying Godel's Theorem: What It Actually Says
|
| > If you think his theorem limits human knowledge, think
| again
|
| https://www.youtube.com/watch?v=OH-ybecvuEo
| froh wrote:
| thanks for the pointer.
|
| first, with Neil DeGrasse Tyson I feel in fairly ok company
| with my little pet peeve fallacy ;-)
|
| yah as I said, I both get it and don't ;-)
|
| And then the video escapes me saying statements about the
| brain "being a formal method" can't be made "because" the
| finite brain can't hold infinity.
|
| that's beyond me. although obviously the brain can't
| enumerate infinite possibilities, we're still fairly well
| capable of formal thinking, aren't we?
|
| And many lovely formal systems nicely fit on fairly finite
| paper. And formal proofs can be run on finite computers.
|
| So somehow the logic in the video is beyond me.
|
| My humble point is this: if we build "intelligence" as a
| formal system, like some silicon running some fancy pants
| LLM what have you, and we want rigor in it's construction,
| i.e. if we want to be able to tell "this is how it works",
| then we need to use a subset of our brain that's capable of
| formal and consistent thinking. And my claim is that _that
| subsystem_ can't capture "itself". So we have to use "more"
| of our brain than that subsystem. so either the "AI" that
| we understand is "less" than what we need and use to
| understand it. or we can't understand it.
|
| I fully get our brain is capable of more, and this "more"
| is obviously capable of very inconsistent outputs, HAL 9000
| had that problem, too ;-)
|
| I'm an old woman. it's late at night.
|
| When I sat through Godel back in the early 1990s in CS and
| then in contrast listened to the enthusiastic AI lectures
| it didn't sit right with me. Maybe one of the AI Prof's
| made that tactical mistake to call our brain "wet
| biological hardware" in contrast to "dry silicon hardware".
| but I can't shake of that analogy ;-) I hope I'm wrong :-)
| "real" AI that we can trust because we can reason about
| it's inner workings will be fun :-)
| breuleux wrote:
| > I've been contested that the second Godel incompleteness
| theorem "can't be applied like this to the brain" but I
| stubbornly insist yes, the brain implements _some_ formal
| system and it can't understand how that system works
|
| I would argue that the second incompleteness theorem doesn't
| have much relevance to the human brain, because it is trying
| to prove a falsehood. The brain is blatantly _not_ a
| consistent system. It is, however, paraconsistent: we are
| perfectly capable of managing a set of inconsistent premises
| and extracting useful insight from them. That 's a good
| thing.
|
| It's also true that we don't understand how our own brain
| works, of course.
| jppittma wrote:
| I've had great success by asking it to do project design first,
| compose the design into an artifact, and then asking it to
| consult the design artifact as it writes code.
| epaga wrote:
| This is a great idea - do you have a more detailed overview
| of this approach and/or an example? What types of things do
| you tell it to put into the "artefact"?
| pdntspa wrote:
| I don't know about that, my own adventures with Gemini Pro 2.5
| in Roo Code has it outputting code in a style that is very
| close to my own
|
| While far from perfect for large projects, controlling the
| scope of individual requests (with orchestrator/boomerang mode,
| for example) seems to do wonders
|
| Given the sheer, uh, variety of code I see day to day in an
| enterprise setting, maybe the problem isn't with Gemini?
| abletonlive wrote:
| I feel like there are two realities right now where half the
| people say LLM doesn't do anything well and there is another
| half that's just using LLM to the max. Can everybody preface
| what stack they are using or what exactly they are doing so we
| can better determine why it's not working for you? Maybe even
| include what your expectations are? Maybe even tell us what
| models you're using? How are you prompting the models exactly?
|
| I find for 90% of the things I'm doing LLM removes 90% of the
| starting friction and let me get to the part that I'm actually
| interested in. Of course I also develop professionally in a
| python stack and LLMs are 1 shotting a ton of stuff. My work is
| standard data pipelines and web apps.
|
| I'm a tech lead at faang adjacent w/ 11YOE and the systems I
| work with are responsible for about half a billion dollars a
| year in transactions directly and growing. You could argue
| maybe my standards are lower than yours but I think if I was
| making deadly mistakes the company would have been on my ass by
| now or my peers would have caught them.
|
| Everybody that I work with is getting valuable output from
| LLMs. We are using all the latest openAI models and have a
| business relationship with openAI. I don't think I'm even that
| good at prompting and mostly rely on "vibes". Half of the time
| I'm pointing the model to an example and telling it "in the
| style of X do X for me".
|
| I feel like comments like these almost seem gaslight-y or maybe
| there's just a major expectation mismatch between people. Are
| you expecting LLMs to just do exactly what you say and your
| entire job is to sit back prompt the LLM? Maybe I'm just use to
| shit code but I've looked at many code bases and there is a
| huge variance in quality and the average is pretty poor. The
| average code that AI pumps out is much better.
| oparin10 wrote:
| I've had the opposite experience. Despite trying various
| prompts and models, I'm still searching for that mythical 10x
| productivity boost others claim.
|
| I use it mostly for Golang and Rust, I work building cloud
| infrastructure automation tools.
|
| I'll try to give some examples, they may seem overly specific
| but it's the first things that popped into my head when
| thinking about the subject.
|
| Personally, I found that LLMs consistently struggle with
| dependency injection patterns. They'll generate tightly
| coupled services that directly instantiate dependencies
| rather than accepting interfaces, making testing nearly
| impossible.
|
| If I ask them to generate code and also their respective unit
| tests, they'll often just create a bunch of mocks or start
| importing mock libraries to compensate for their faulty
| implementation, rather than fixing the underlying
| architectural issues.
|
| They consistently fail to understand architecture patterns,
| generating code where infrastructure concerns bleed into
| domain logic. When corrected, they'll make surface level
| changes while missing the fundamental design principle of
| accepting interfaces rather than concrete implementations,
| even when explicitly instructed that it should move things
| like side-effects to the application edges.
|
| Despite tailoring prompts for different models based on
| guides and personal experience, I often spend 10+ minutes
| correcting the LLM's output when I could have written the
| functionality myself in half the time.
|
| No, I'm not expecting LLMs to replace my job. I'm expecting
| them to produce code that follows fundamental design
| principles without requiring extensive rewriting. There's a
| vast middle ground between "LLMs do nothing well" and the
| productivity revolution being claimed.
|
| That being said, I'm glad it's working out so well for you, I
| really wish I had the same experience.
| abletonlive wrote:
| > I use it mostly for Golang and Rust
|
| I'm starting to suspect this is the issue. Neither of these
| languages are in the top 5 languages so there is probably
| less to train on. It'd be interesting to see if this
| improves over time or if the gap between the languages
| become even more intense as it becomes favorable to use a
| language simply because LLMs are so much better at it.
|
| There are a lot of interesting discussions to be had here:
|
| - if the efficiency gains are real and llms don't improve
| in lesser used languages, one should expect that we might
| observe that companies that chose to use obscure languages
| and tech stacks die out as they become a lot less
| competitive against stacks that are more compatible with
| llms
|
| - if the efficiency gains are real this might
| disincentivize new language adoption and creation unless
| the folks training models somehow address this
|
| - languages like python with higher output acceptance rates
| are probably going to become even more compatible with llms
| at a faster rate if we extrapolate that positive
| reinforcement is probably more valuable than negative
| reinforcement for llms
| oparin10 wrote:
| Yes, I agree, that's likely a big factor. I've had a
| noticeably better LLM design experience using widely
| adopted tech like TypeScript/React.
|
| I do wonder if the gap will keep widening though. If
| newer tools/tech don't have enough training data, LLMs
| may struggle more with them early on. Its possible that
| RAG and other optimization techniques will evolve fast
| enough to narrow the gap and prevent diminishing returns
| on LLM driven productivity.
| Implicated wrote:
| I'm also suspecting this has a lot to do with the
| dichotomy between the "omg llms are amazing at code
| tasks" and "wtf are these people using these llms for
| it's trash" takes.
|
| As someone who works primarily within the Laravel stack,
| in PHP, the LLM's are wildly effective. That's not to say
| there aren't warts - but my productivity has skyrocketed.
|
| But it's become clear that when you venture into the
| weeds of things that aren't very mainstream you're going
| to get wildly more hallucinations and solutions that are
| puzzling.
|
| Another observation is that I believe that when you start
| getting outside of your expertise you're likely going to
| have a correlating amount of 'waste' time spent where the
| LLM is spitting out solutions that an expert in the
| domain would immediately recognize as problematic but the
| non-expert will see and likely reason that it seems
| reasonable/or, worse, not even look at the solution and
| just try to use it.
|
| 100% of the time that I've tried to get
| Claude/Gemini/ChatGPT to "one shot" a whole feature or
| refactor it's been a waste of time and tokens. But when
| I've spent even a minimal amount of energy to focus it in
| on the task, curate the context and then approach?
| Tremendously effective most times. But this also requires
| me to do enough mental work that I probably have an idea
| of how it should work out which primes my capability to
| parse the proposed solutions/code and pick up the pieces.
| Another good flow is to just prompt the LLM (in this
| case, Claude Code, or something with MCP/filesystem
| access) with the feature/refactor/request asking it to
| draw up the initial plan of implementation to feed to
| itself. Then iterate on that as needed before starting up
| a new session/context with that plan and hitting it one
| item at a time, while keeping a running
| {TASK_NAME}_WORKBOOK.md (that you task the llm to keep up
| to date with the relevant details) and starting a new
| session/context for each task/item on the plan, using the
| workbook to get the new sessions up to speed.
|
| Also, this is just a hunch, but I'm generally a nocturnal
| creature and tend to be working in the evening into early
| mornings. Once 8am PST rolls around I really feel like
| Claude (in particular) just turns into mush. Responses
| get slower but it seems it loses context where it
| otherwise wouldn't start getting off topic/having to re-
| read files it should already have in context. (Note; I'm
| pretty diligent about refreshing/working with the context
| and something happens in the 'work' hours to make it
| terrible)
|
| I'd imagine we're going to end up with language specific
| llms (though I have no idea, just seems logical to me)
| that a 'main' model pushes tasks/tool usage to. We don't
| need our "coding" LLM's to also be proficient on oceanic
| tidal patterns and 1800's boxing history. Those are all
| parameters that could have been better spent on the code.
| thewebguyd wrote:
| I've found, like you mentioned, that the tech stack you work
| with matters a lot in terms of successful results from LLMs.
|
| Python is generally fine, as you've experienced, as is
| JavaScript/TypeScript & React.
|
| I've had mixed results with C# and PowerShell. With
| PowerShell, hallucinations are still a big problem. Not sure
| if it's the Noun-Verb naming scheme of cmdlets, but most
| models still make up cmdlets that don't exist on the fly
| (though will correct itself once you correct it that it
| doesn't exist but at that point - why bother when I can just
| do it myself correctly the first time).
|
| With C#, even with my existing code as context, it can't
| adhere to a consistent style, and can't handle nullable
| reference types (albeit, a relatively new feature in C#). It
| works, but I have to spend too much time correcting it.
|
| Given my own experiences and the stacks I work with, I still
| won't trust an LLM in agent mode. I make heavy use of them as
| a better Google, especially since Google has gone to shit,
| and to bounce ideas off of, but I'll still write the code
| myself. I don't like reviewing code, and having LLMs write
| code for me just turns me into a full time code reviewer, not
| something I'm terribly interested in becoming.
|
| I still get a lot of value out of the tools, but for me I'm
| still hesitant to unleash them on my code directly. I'll
| stick with the chat interface for now.
|
| _edit_ Golang is another language I 've had problems relying
| on LLMs for. On the flip side, LLMs have been great for me
| with SQL and I'm grateful for that.
| neonsunset wrote:
| FWIW If you are using Github Copilot Edit/Agent mode - you
| may have more luck with other plugins. Until recently,
| Claude 3. _5_ Sonnet worked really well with C# and
| required relatively few extra commands to stay consistent
| to "newest tersest" style. But then, from my
| understanding, there was a big change in how Copilot
| extension handles attached context alongside changes around
| what I presume prompt and fine-tuning which resulted in
| severe degradation of the output quality. Hell, even
| attaching context data does not properly work 1 out of 3
| times. But at least Gemini 2.5 Pro can write test semi-
| competently, but I still can't fathom how did the manage to
| make it so much worse!
| bboygravity wrote:
| This is hilarious to read if you have actually seen the average
| (embedded systems) production code written by humans.
|
| Either you have no idea how terrible real world commercial
| software (architecture) is or you're vastly underestimating
| newer LLMs or both.
| onlyrealcuzzo wrote:
| 2.5 pro seems like a huge improvement.
|
| One area I've still noticed weakness is if you want to use a
| pretty popular library from one language in another language,
| it has a tendency to think the function signatures in the
| popular language match the other.
|
| Naively, this seems like a hard problem to solve.
|
| I.e. ask it how to use torchlib in Ruby instead of Python.
| viraptor wrote:
| > no amount of prompting will get current models to approach
| abstraction and architecture the way a person does.
|
| What do you mean specifically? I found the "let's write a spec,
| let's make a plan, implement this step by step with testing"
| results in basically the same approach to design/architecture
| that I would take.
| nurettin wrote:
| Just tell it to cite docs when using functions, works wonders.
| tastysandwich wrote:
| Re hallucinating APIs that don't exist - I find this with
| Golang sometimes. I wonder if it's because the training data
| doesn't just consist of all the docs and source code, but
| potentially feature proposals that never made it into the
| language.
|
| Regexes are another area where I can't get much help from LLMs.
| If it's something common like a phone number, that's fine. But
| anything novel it seems to have trouble. It will spit out junk
| very confidently.
| xnx wrote:
| This is much bigger news than OpenAI's acquisition of WindSurf.
| herpdyderp wrote:
| I agree it's very good but the UI is still usually an unusable,
| scroll-jacking disaster. I've found it's best to let a chat sit
| for around a few minutes after it has finished printing the AI's
| output. Finding the `ms-code-block` element in dev tools and
| logging `$0.textContext` is reliable too.
| OsrsNeedsf2P wrote:
| Loading the UI on mobile while on low bandwidth is also a non-
| starter. It simply doesn't work.
| uh_uh wrote:
| Noticed this too. There's something funny about billion dollar
| models being handicapped by stuck buttons.
| energy123 wrote:
| The Gemini app has a number of severe bugs that impacts
| everyone who uses it, and those bugs have persisted for over
| 6 months.
|
| There's something seriously dysfunctional and incompetent
| about the team that built that web app. What a way to waste
| the best LLM in the world.
| kubb wrote:
| It's the company. Letting incompetent people who are vocal
| rise to the top is a part of Google's culture, and the
| internal performance review process discourages excellence
| - doing the thousand small improvements that makes a
| product truly great is invisible to it, so nobody does it.
|
| Software that people truly love is impossible to build in
| there.
| crat3r wrote:
| So, are people using these tools without the org they work for
| knowing? The amount of hoops I would have to jump through to get
| either of the smaller companies I have worked for since the AI
| boom to let me use a tool like this would make it absolutely not
| worth the effort.
|
| I'm assuming large companies are mandating it, but ultimately the
| work that these LLMs seem poised for would benefit smaller
| companies most and I don't think they can really afford using
| them? Are people here paying for a personal subscription and then
| linking it to their work machines?
| jeffbee wrote:
| Not every coding task is something you want to check into your
| repo. I have mostly used Gemini to generate random crud. For
| example I had a huge JSON representation of a graph, and I
| wanted the graph modified in a given way, and I wanted it
| printed out on my terminal in color. None of which I was
| remotely interested in writing, so I let a robot do it and it
| was fine.
| crat3r wrote:
| Fair, but I am seeing so much talk about how it is completing
| actual SDE tickets. Maybe not this model specifically, but to
| be honest I don't care about generating dummy data, I care
| about the claims that these newer models are on par with
| junior engineers.
|
| Junior engineers will complete a task to update an API, or
| fix a bug on the front-end, within a couple days with lets
| say 80 percent certainty they hit the mark (maybe an inflated
| metric). How are people comparing the output of these models
| to that of a junior engineer if they generally just say "Here
| is some of my code, what's wrong with it?". That certainly
| isn't taking a real ticket and completing it in any capacity.
|
| I am obviously very skeptical but mostly I want to try one of
| these models myself but in reality I think that my higher-ups
| would think that they introduce both risk AND the potential
| for major slacking off haha.
| jpc0 wrote:
| I don't know about tickets but my org definitely happily
| pays for Gemini Advanced and encourages it's use and would
| be considered a small org.
|
| The latest SOTA models are definitely at the point where
| they can absolutely improve workflows and not get in your
| way too much.
|
| I treat it a lot like an intern, "Here's an api doc and
| spec, write me the boilerplate and a general idea about
| implementation"
|
| Then I go in, review, rip put crud and add what I need.
|
| It almost always gets architecture wrong, don't expect that
| from it. However small functions and such is great.
|
| When it comes to refactoring ask it for suggestions, eat
| the meat leave the bones.
| bongodongobob wrote:
| I work for a large company and everything other than MS Copilot
| is blocked aggressively at the DNS/cert level. Tried Deepseek
| when it came out and they already had it blocked. All .ai TLDs
| are blocked as well. If you're not in tech, there is a lot of
| "security" fear around AI.
| codebolt wrote:
| If you can get them to approve GitHub Copilot Business then
| Gemini Pro 2.5 and many others are available there. They have
| guarantees that they don't share/store prompts or code and the
| parent company is Microsoft. If you can argue that they will
| save money (on saved developer time), what would be their
| argument against?
| otabdeveloper4 wrote:
| > They have guarantees that they don't share/store prompts or
| code
|
| "They trust me. Dumb ..."
| tasuki wrote:
| > The amount of hoops I would have to jump through to get
| either of the smaller companies I have worked for since the AI
| boom to let me use a tool like this would make it absolutely
| not worth the effort.
|
| Define "smaller"? In small companies, say 10 people, there are
| no hoops. That is the whole point of small companies!
| ionwake wrote:
| Is it possible to sue this with Cursor? If so what is the name of
| the model? gemini-2.5-pro-preview ?
|
| edit> Its gemini-2.5-pro-preview-05-06
|
| edit>Cursor syas it doesnt have "good support" et, but im not
| sure if this is a defualt message when it doesnt recognise a
| model? is this a big deal? should I wait until its officially
| supported by cursor?
|
| Just trying to save time here for everyone - anyone know the
| answer?
| androng wrote:
| At the bottom of the article it says no action is required and
| the Gemini-2.5-pro-preview-03-25 now points to the new model
| ionwake wrote:
| well alot of action was required such as adding the model so
| no idea what happened to the guy who wrote the article maybe
| there is a new cursor update now
| tough wrote:
| Cursor UI sucks, it tells me to use -auto mode- to be faster,
| but gemini 2.5 is way faster than any of the other free models,
| so just selecting that one is faster even if the UI says
| otherwise
| ionwake wrote:
| yeah ive noticed this too, like wtf would I use Auto?
| bn-l wrote:
| The one with exp in the name is free (you may have to add it
| yourself) but they train on you. And after a certain limit it
| becomes paid).
| xbmcuser wrote:
| As a non programmer Gemini 2.5 Pro I have been really loving this
| for my python scripting for manipulating text and excel files for
| web scraping. In the past I was able to use Chat Gpt to code some
| of the things that I wanted but with Gemini 2.5 Pro it has been
| just another level. If they improved it further that would be
| amazing
| djrj477dhsnv wrote:
| I don't understand what I'm doing wrong.. it seems like everyone
| is saying Gemini is better, but I've compared dozens of examples
| from my work, and Grok has always produced better results.
| redox99 wrote:
| I haven't tested this release yet, but I found Gemini to be
| overrated before.
|
| My choice of LLMs was
|
| Coding in cursor: Claude
|
| General questions: Grok, if it fails then Gemini
|
| Deep Research: Gemini (I don't have GPT plus, I heard it's
| better)
| dyauspitr wrote:
| Anecdotally grok has been the worst of the bunch for me.
| athoun wrote:
| I agree, from my experience Grok gives superior coding results,
| especially when modifying large sections of the codebase at
| once such as in refactoring.
|
| Although it's not for coding, I have noticed Gemini 2.5 pro
| Deep Research has surpassed Grok's DeepSearch in thoroughness
| and research quality however.
| white_beach wrote:
| object?
|
| (aider joke)
| llm_nerd wrote:
| Their nomenclature is a bit confused. The Gemini web app has a
| 2.5 Pro (experimental), yet this apparently is referring to 2.5
| Pro Preview 05-06.
|
| Would be ideal if they incremented the version number or the
| like.
| martinald wrote:
| I'm totally lost again! If I use Gemini on the website
| (gemini.google.com), am I using 2.5 Pro IO edition, or am I using
| the old one?
| disgruntledphd2 wrote:
| Check the dropdown in the top left (on my screen,at least).
| martinald wrote:
| Are you referring to gemini.google.com or ai studio? I see
| 2.5 Pro but is this the right one? I saw a tweet from them
| saying you have to select Canvas first? I'm so so lost.
| koakuma-chan wrote:
| http://aistudio.google.com/app/prompts/new_chat?model=gemini...
| martinald wrote:
| I get this in AI studio, but does it apply to
| gemini.google.com?
| pzo wrote:
| "The previous iteration (03-25) now points to the most recent
| version (05-06), so no action is required to use the improved
| model"
| oellegaard wrote:
| Is there anything like Claude code for other models such as
| gemini?
| mickeyp wrote:
| I'm literally working on this particular problem. Locally-run
| server; browser-based interface instead of TUI/CLI; connects to
| all the major model APIs; many, many quality of life and
| feature improvements over other tools that hook into your
| browser.
|
| Drop me a line (see profile) if you're interested in beta
| testing it when it's out.
| oellegaard wrote:
| I'm actually very happy with everything in Claude code, eg
| the CLI so im really just curious to try other models
| revicon wrote:
| Same! I prefer the CLI, way easier when I'm connected via
| ssh from another network somewhere.
| mickeyp wrote:
| The CLI definitely has its advantages!
|
| But with my app: you can install the host anywhere and
| connect to it securely (via SSH forwarding or private VPN
| or what have you) so that workflow definitely still
| works!
| Filligree wrote:
| I find that 2.5 Pro has a higher ceiling of understanding,
| while Claude writes more maintainable code with better
| comments. If we want to combine them... well, it should be
| easier to fix 2.5 than Claude. That said, neither is there
| yet.
|
| Currently Claude Code is a big value-add for Claude. Google
| has nothing equivalent; aider requires far more manual
| work.
| alphabettsy wrote:
| Aider
| danielbln wrote:
| Aider wasn't all that agentic last time I tried it, has that
| changed?
| elliot07 wrote:
| OpenAi has a version called Codex that has support. It's
| lacking in a few features like MCP right now and the TUI isn't
| there yet, but interestingly they are building a Rust version
| (it's all open source) that seems to include MCP support and
| looks significantly higher quality. I'd bet within the next few
| weeks there will be a high quality claude code alternative.
| martythemaniak wrote:
| Goose by Block (Square/CashApp) is like an open-source Claude
| Code that works with any remote or local LLM.
|
| https://github.com/block/goose
| vunderba wrote:
| Haven't tried it yet, but I've heard good things about Plandex.
|
| https://github.com/plandex-ai/plandex
| mliker wrote:
| The "video to learning app" feature is a cool concept (see it in
| AI Studio). I just passed in two separate Stanford lectures to
| see if it could come up with an interesting interactive app. The
| apps it generated weren't too useful, but I can see with more
| focus and development, it'd be a game changer for education.
| SparkyMcUnicorn wrote:
| Anyone know of any coding agents that support video inputs?
|
| Web chat interfaces are great, but copy/paste gets old fast.
| lostmsu wrote:
| I wonder how it processes video. Even individual pictures take
| a lot of tokens.
| brap wrote:
| Gemini is now ranked #1 across every category in lmarena.
| aoeusnth1 wrote:
| LMArena is a joke, though
| killerstorm wrote:
| Why can't they just use version numbers instead of this "new
| preview" stuff?
|
| E.g. call it Gemini Pro 2.5.1.
| lukeschlather wrote:
| I take preview to mean the model may be retired on an
| accelerated timescale and replaced with a "real" model so it's
| dangerous to put into prod unless you are paying attention.
| lolinder wrote:
| They could still use version numbers for that. 2.5.1-preview
| becomes 2.5.1 when stable.
| danenania wrote:
| Scheduled tasks in ChatGPT are useful for keeping track of
| these kinds of things. You can have it check daily whether
| there's a change in status, price, etc. for a particular
| model (or set of models).
| cdolan wrote:
| I appriciate that you are trying to help
|
| But I do not want to have to build a network of bots with
| non-deterministic outputs to simply stay on top of versions
| danenania wrote:
| Neither do I, but it's the best solution I've found so
| far. It beats checking models/prices manually every day
| to see if anything has changed, and it works well enough
| in practice.
|
| But yeah, some kind of deterministic way to get alerts
| would be better.
| mhh__ wrote:
| Are you saying you find model names like o4-mini-high-pro-
| experimental-version5 confusing and stupid?
| andy12_ wrote:
| Interestingly, when compering benchmarks of Experimental 03-25
| [1] and Experimental 05-06 [2] it seems the new version scores
| slightly lower in everything except on LiveCodeBench.
|
| [1] https://storage.googleapis.com/model-
| cards/documents/gemini-... [2]
| https://deepmind.google/technologies/gemini/
| arnaudsm wrote:
| This should be the top comment. Cherry-picking is hurting this
| industry.
|
| I bet they kept training on coding tasks, made everything worse
| on the way, and tried to hide it under the rug because of the
| sunk costs.
| luckydata wrote:
| Or because they realized that coding is what most of those
| LLMs are used for anyways?
| arnaudsm wrote:
| They should have shown the benchmarks. Or market it as a
| coding model, like Qwen & Mistral.
| jjani wrote:
| That's clearly not a PR angle they could possibly take
| when it's replacing the overall SotA model. This is a
| business decision, potentially inference cost related.
| arnaudsm wrote:
| From a business pov it's a great move, for the customers
| it's evil to hide evidence that your product became
| worse.
| cma wrote:
| They likely knew continued training on code would have some
| amount of catastrophic forgetting on other stuff. They didn't
| throw away the old weights so probably not sunk cost fallacy
| going on, but since it is relatively new and they found out
| X% of API token spend was on coding agents (where X is huge),
| compared to what token spend distribution looked like on
| prior Geminis that couldn't code well, they probably didn't
| want the complexity and worse batching of having another
| model for it if the impacts weren't too large and decided
| they didn't weight coding enough initially and it is worth
| the tradeoffs.
| jjani wrote:
| Sounds like they were losing so much money on 2.5-Pro they came
| up with a forced update that made it cheaper to run. They can't
| come out with "we've made it worse across the board", nor do
| they want to be the first to actually raise prices, so instead
| they made a bit of a distill that's slightly better at coding
| so they can still spin it positively.
| sauwan wrote:
| I'd be surprised if this was a new base model. It sounds like
| they just did some post-training RL tuning to make this
| version specifically stronger for coding, at the expense of
| other priorities.
| jjani wrote:
| Every frontier model now is a distill of a larger
| unpublished model. This could be a slightly smaller
| distill, with potentially the extra tuning you're
| mentioning.
| cubefox wrote:
| That's an unsubstantiated claim. I doubt this is true,
| since people are disproportionately more willing to pay
| for the best of the best, rather than for something
| worse.
| tangjurine wrote:
| Any info on this?
| Workaccount2 wrote:
| Google doesn't pay the nvidia tax. Their TPUs are designed
| for Gemini and Gemini designed for their TPUs. Google is no
| doubt paying far less per token than every other AI house.
| merksittich wrote:
| According to the article, "[t]he previous iteration (03-25) now
| points to the most recent version (05-06)." I assume this
| applies to both the free tier gemini-2.5-pro-exp-03-25 in the
| API (which will be used for training) and the paid tier
| gemini-2.5-pro-preview-03-25.
|
| Fair enough, one could say, as these were all labeled as
| preview or experimental. Still, considering that the new model
| is slightly worse across the board in benchmarks (except for
| LiveCodeBench), it would have been nice to have the option to
| stick with the older version. Not everyone is using these
| models for coding.
| zurfer wrote:
| Just switching a pinned version (even alpha, beta,
| experimental, preview) to another model doesn't feel right.
|
| I get it, chips are sparse and they want their capacity back,
| but it breaks trust with developers to just downgrade your
| model.
|
| Call it gemini-latest and I understand that things will
| change. Call it *-03-25 and I want the same model that I got
| on 25th March.
| nopinsight wrote:
| Livebench.ai actually suggests the new version is better on
| most things.
|
| https://livebench.ai/#/
| thevillagechief wrote:
| I've been switching between this and GPT-4o at work, and Gemini
| is really verbose. But I've been primarily using it. I'm confused
| though, the model available in copilot says Gemini 2.5 Pro
| (Preview), and I've had it for a few weeks. This was just
| released today. Is this an updated preview? If so, the
| blog/naming is confusing.
| CSMastermind wrote:
| Hasn't Gemini 2.5 Pro been out for a while?
|
| At first I was very impressed with it's coding abilities,
| switching off of Claud for it but recently I've been using GPT o3
| which I find is much more concise and generally better at problem
| solving when you hit an error.
| spaceman_2020 wrote:
| Think that was still the experimental model incorrectly labeled
| by many platforms as "Pro"
| 85392_school wrote:
| That's inaccurate. First, there was the experimental 03-25
| checkpoint. Then it was promoted to Preview without changing
| anything. And now we have a new 05-06 checkpoint, still
| called Gemini 2.5 Pro, and still in Preview.
| laborcontract wrote:
| My guess is that they've done a lot of tuning to improve diff
| based code editing. Gemini 2.5 is fantastic at agentic work, but
| it still is pretty rough around the edges in terms of generating
| perfectly matching diffs to edit code. It's probably one of the
| very few issues with the model. Luckily, aider tracks this.
|
| They measure the old gemini 2.5 generating proper diffs 92% of
| the time. I bet this goes up to ~95-98%
| https://aider.chat/docs/leaderboards/
|
| Question for the google peeps who monitor these threads: Is
| gemini-2.5-pro-exp (free tier) updated as well, or will it go
| away?
|
| Also, in the blog post, it says: > The previous
| iteration (03-25) now points to the most recent version (05-06),
| so no action is required to use the improved model, and it
| continues to be available at the same price.
|
| Does this mean gemini-2.5-pro-preview-03-25 now uses 05-06? Does
| the same apply to gemini-2.5-pro-exp-03-25?
|
| update: I just tried updating the date in the exp model
| (gemini-2.5-pro-exp-05-06) and that doesnt work.
| okdood64 wrote:
| What do you mean by agentic work in this context?
| laborcontract wrote:
| Knowing when to call functions, generating the proper
| function calling text structure, properly executing functions
| in sequence, knowing when it's completed its objective, and
| doing that over an extended context window.
| laborcontract wrote:
| Update 2: I've been using this model in both aider and cline
| and I've haven't gotten a diff matching error yet, even with
| some pretty difficult substitutions across different places in
| multiple files. The overall feel of this model is nice.
|
| I don't have a formal benchmark but there's a notable
| improvement in code generation due to this alone.
|
| I've had gemini chug away on plans that have taken ~1 hour to
| implement. (~80mln tokens spent) A good portion of that energy
| was spent fixing mistakes made by cline/aider/roo due to
| search/replace mistakes. If this model gets anywhere close to
| 100% on diffs then this is a BFD. I estimate this will
| translate to a 50-75% productivity boost on long context coding
| tasks. I hope the initial results i'm seeing hold up!
|
| I'm surprised by the reaction in the rest of the thread. A lot
| unproductive complaining, a lot of off topic stuff, nothing
| talking about the model itself.
|
| Any thoughts from anyone else using the updated model?
| EliasWatson wrote:
| I wonder how the latest version of Grok 3 would stack up to
| Gemini 2.5 Pro on the web dev arena leaderboard. They are still
| just showing the original early access model for some reason,
| despite there being API access to the latest model. I've been
| using Grok 3 with Aider Chat and have been very impressed with
| it. I get $150 of free API credits every month by allowing them
| to train on my data, which I'm fine with since I'm just working
| on personal side projects. Gemini 2.5 Pro and Claude 3.7 might be
| a little better than Grok 3, but I can't justify the cost when
| Grok doesn't cost me a penny to use.
| mohsen1 wrote:
| I use Gemini for almost everything. But their model card[1] only
| compares to o3-mini! In known benchmarks o3 is still ahead:
| +------------------------------+---------+--------------+
| | Benchmark | o3 | Gemini 2.5 |
| | | | Pro |
| +------------------------------+---------+--------------+
| | ARC-AGI (High Compute) | 87.5% | -- |
| | GPQA Diamond (Science) | 87.7% | 84.0% |
| | AIME 2024 (Math) | 96.7% | 92.0% |
| | SWE-bench Verified (Coding) | 71.7% | 63.8% |
| | Codeforces Elo Rating | 2727 | -- |
| | MMMU (Visual Reasoning) | 82.9% | 81.7% |
| | MathVista (Visual Math) | 86.8% | -- |
| | Humanity's Last Exam | 26.6% | 18.8% |
| +------------------------------+---------+--------------+
|
| [1] https://storage.googleapis.com/model-
| cards/documents/gemini-...
| jsnell wrote:
| The text in the model card says the results are from March
| (including the Gemini 2.5 Pro results), and o3 wasn't released
| yet.
|
| Is this maybe not the updated card, even though the blog post
| claims there is one? Sure, the timestamp is in late April, but
| I seem to remember that the first model card for 2.5 Pro was
| only released in the last couple of weeks.
| cbg0 wrote:
| o3 is $40/M output tokens and 2.5 Pro is $10-15/M output tokens
| so o3 being slightly ahead is not really worth 4 times more
| than gemini.
| jorl17 wrote:
| Also, o3 is insanely slow compared to Gemini 2.5 Pro
| i_have_an_idea wrote:
| Not sure why this is being downvoted, but it's absolutely
| true.
|
| If you're using these models to generate code daily, the
| costs add up.
|
| Sure, I'll give a really tough problem to o3 (and probably
| over ChatGPT, not the API), but on general code tasks, there
| really isn't meaningful enough difference to justify 4x the
| cost.
| gitroom wrote:
| man that endless commenting seriously kills my flow - gotta say,
| even after all the prompts and hacks, still can't get these
| models to chill out. you think we'll ever get ai to stop
| overdoing it and actually fit real developer habits or is it
| always gonna be like this?
| arnaudsm wrote:
| Be careful, this model is worse than 03-25 in 10 of the 12
| benchmarks (!)
|
| I bet they kept training on coding, made everything worse on the
| way, and tried to hide it under the rug because of the sunk
| costs.
| jstummbillig wrote:
| It seems that trying to build llms is the definition of
| accepting sunk cost.
| nashashmi wrote:
| I keep hearing good things about Gemini online and offline. I
| wrote them off as terrible when they first launched and have not
| looked back since.
|
| How are they now? Sufficiently good? Competent? Competitive? Or
| limited? My needs are very consumer oriented, not programming/api
| stuff.
| hmate9 wrote:
| Probably the best one right now, their deep research is also
| very good.
| danielbln wrote:
| Bard sucked, Gemini sucked, Gemini 2 was alright, 2.5 is
| awesome and my main driver for coding these days.
| thevillagechief wrote:
| The Gemini deep research is a revelation. I obsessively
| research most things I buy, from home appliances to gym
| equipment. It has literally saved untold hours of comparisons.
| You get detailed reports generated from every website including
| youtube reviews. I've bought a bunch of stuff on it's
| recommendation.
| Imanari wrote:
| care to share your search prompt?
| ramoz wrote:
| Never sleep on Google.
| panarchy wrote:
| Is it just me that finds that while Gemini 2.5 is able to
| generate a lot of code that the end results are usually
| lackluster compared to Claude and even ChatGPT? I also find it
| hard-headed and frequently does things in ways I explicitly told
| it not to. The massive context window is pretty great though and
| enables me to do things I can't with the others so it still gets
| used a lot.
| scrlk wrote:
| How are you using it?
|
| I find that I get the best results from 2.5 Pro via Google AI
| Studio with a low temperature (0.2-0.3).
| panarchy wrote:
| AI Studio as well, but I haven't played around with the
| temperature too much and even then I only lowered it to like
| 0.8 a few times. So I'll have to try this out. Thanks.
| xyst wrote:
| Proprietary junk beats DeepSeek by a mere 213 points?
|
| Oof. G and others are way behind
| childintime wrote:
| How does it perform on anything but Python and Javascript? In my
| experience my milage varied a lot when using C#, for example, or
| Zig, so I've learnt to just let it select the language it wants.
|
| Also, why doesn't Ctrl+C work??
| scbenet wrote:
| It's very good at Go, which makes sense because I'm assuming
| it's trained on a lot of Google's code
| simianwords wrote:
| How would they train it on google code without revealing
| internal IP?
| obsolete_wagie wrote:
| o3 is so far ahead of antrhopic and google, these models arent
| even worth using
| Workaccount2 wrote:
| o3 is expensive in the API and intentionally crippled in the
| web app.
| Squarex wrote:
| source?
| obsolete_wagie wrote:
| use the models daily, its not even close
| mattlondon wrote:
| The benchmarks (1) seem to suggest that o3 is in 3rd place
| after Gemini 2.5 pro preview and Gemini 2.5 pro exp (for text
| reasoning, o3 4th for webdev). o3 doesn't even appear on the
| openrouter leaderboards (2) suggesting is hardly used (if at
| all) by anyone using LLMs do actually _do_ anything (such as
| coding) which makes one question if it is actually any good at
| all (otherwise if it was so great I 'd expect to see heavy
| usage)
|
| Not sure where your data is coming from but everything else is
| pointing to Google supremacy in AI right now. I look forward to
| some new models from Anthropic, xAi, Meta et al (remains to be
| seen if OpenAI has anything left apart from bluster). Exciting
| times.
|
| 1 - https://beta.lmarena.ai/leaderboard
|
| 2 - https://openrouter.ai/rankings
| obsolete_wagie wrote:
| you just arent using the models to their full capacity if you
| think this, benchmarks have all been hacked
| cellis wrote:
| 8x the cost for maybe 5% improvement?
| epolanski wrote:
| Not my experience, at all.
|
| I have long stopped using OpenAI products, and all oX have been
| letdowns.
|
| For coding it has been Claude 3.5 -> 3.7 -> Gemini 2.5 for me.
| For general use it has been chatgpt -> Gemini.
|
| Google has retaken the ML crown for my use cases and it keeps
| getting better.
|
| Gemini 2.0 flash was also the first LLM I put in production,
| because for my use case (summarizing news articles and
| translate them) it was way too fast, accurate and cheap to
| ignore whereas ChatGPT was consistently too slow and expensive
| to be even considered.
| ionwake wrote:
| Can someone tell me if windsurf is better than cursor? ( pref
| someone who has used both for a few days? )
| kurtis_reed wrote:
| Relevance?
| ionwake wrote:
| its what literally every hn coder is using to program with
| these models much as gemini.where u been brother
| ramoz wrote:
| Claude Code and its not close. I feed my entire project to
| gemini for planning and figuring out complex solutions for
| claude code to execute on. I use Prompt Tower for building
| entire codebase prompts for gemini.
| ionwake wrote:
| fantastic reply thanks, can I ask if you have tried cursor? I
| use to use claudecode but it was super expensive and got
| stuck in loops. ( I know it is cheaper now). Do you have any
| thoughts?
| ramoz wrote:
| I spend the money on Claude Code, and don't think twice.
| I've spent low 1,000s at this point but the return is
| justified.
|
| I use Cursor when I code myself. But I don't use it's chat
| or agent features. I had replaced VS Code with it but at
| this point I could go back to VS Code, but I'm lazy.
|
| Cursor agent/chat we're fine if you're bottlenecked by
| money. I have no idea why or how it uses things like the
| codebase embedding. An agent on top of a filesystem is a
| powerful thing. People also like Aider and RooCode for the
| CLI experience and I think they are affordable.
|
| To make the most use of these things, you need to guide
| them and provide them adequate context for every task. For
| Claude Code I have built a meta management framework that
| works really well. If I were forced to use cursor I would
| use the same approach.
| m_kos wrote:
| [Tangent] Anyone here using 2.5 Pro in Gemini Advanced? I have
| been experiencing a ton of bugs, e.g.,:
|
| - [codes] showing up instead of references,
|
| - raw search tool output sliding across the screen,
|
| - Gemini continusly answering questions asked two or more
| messages before but ignoring the most recent one (you need to ask
| Gemini an unrelated question for it to snap out of this bug for a
| few minutes),
|
| - weird messages including text irrelevant to any of my chats
| with Gemini, like baseball,
|
| - confusing its own replies with mine,
|
| - not being able to run its own Python code due to some
| unsolvable formatting issue,
|
| - timeouts, and more.
| Dardalus wrote:
| The Gemini app is absolute dog doo... use it through AI studio.
| Google ought to shut down the entire Gemini app.
| paulirish wrote:
| > Gemini 2.5 Pro now ranks #1 on the WebDev Arena leaderboard
|
| It'd make sense to rename WebDev Arena to React/Tailwind Arena.
| Its system prompt requires [1] those technologies and the entire
| tool breaks when requesting vanilla JS or other frameworks. The
| second-order implications of models competing on this narrow
| definition of webdev are rather troublesome.
|
| [1] https://blog.lmarena.ai/blog/2025/webdev-
| arena/#:~:text=PROM...
| martinsnow wrote:
| Bwoah it's almost as if react and tailwind is the bees knees
| ind frontend atm
| byearthithatius wrote:
| Sadly. Tailwind is so oof in my opinion. Lets import
| megabytes just so we don't have to write 5 whole CSS classes.
| I mean just copy paste the code.
|
| Don't get me stared on how ugly the HTML becomes when most
| tags have 20 f*cking classes which could have been two.
| johnfn wrote:
| In most reasonably-sized websites, Tailwind will decrease
| overall bundle size when compared to other ways of writing
| CSS. Which is less code, 100 instances of "margin-left:
| 8px" or 100 instances of "ml-2" (and a single definition
| for ml-2)? Tailwind will dead-code eliminate all rules
| you're not using.
|
| In typical production environments tailwind is only around
| 10kb[1].
|
| [1]: https://v3.tailwindcss.com/docs/optimizing-for-
| production
| postalrat wrote:
| I've found them to be pretty good with vanilla html and css.
| shortcord wrote:
| Not a fan of the dominance of shadcn and Tailwind when it comes
| to generating greenfield code.
| BoorishBears wrote:
| shadcn/ui is such a terrible thing for the frontend
| ecosystem, and it'll get _even worse_ for it as AI gets
| better.
|
| Instead of learnable, stable, APIs for common components with
| well established versioning and well defined tokens, we've
| got people literally copying and pasting components and
| applying diffs so they can claim they "own them".
|
| Except the vast majority of them don't ever change a line and
| just end up with a strictly worse version of a normal package
| (typically out of date or a hodgepodge of "versions" because
| they don't want to figure out diffs), and the few that do
| make changes don't have anywhere _near_ the design sense to
| be using shadcn since there aren 't enough tokens to keep the
| look and feel consistent across components.
|
| The would be 1% who would change it _and_ have their own well
| thought out design systems don 't get a lift from shadcn
| either vs just starting with Radix directly.
|
| -
|
| Amazing spin job though with the "registry" idea too: "it's
| actually very good for AI that we invented a parallel
| distribution system for ad-hoc components with no standard
| except a loose convention around sticking stuff in a folder
| called ui"
| aero142 wrote:
| If llms are able to write better code with more declarative and
| local programming components and tailwind, then I could imagine
| a future where a new programming language is created to
| maximize llm success.
| epolanski wrote:
| This so much.
|
| To me it seems so strange that few good language designers
| and ml folks didn't group together to work on this.
|
| It's clear that there is a space for some LLM meta language
| that could be designed to compile to bytecode, binary, JS,
| etc.
|
| It also doesn't need to be textual like we code, but some
| form of AST llama can manipulate with ease.
| LZ_Khan wrote:
| readability would probably be the sticking point
| senbrow wrote:
| At that point why not just have LLMs generate bytecode in
| one shot?
|
| Plenty of training data to go on, I'd imagine.
| seb1204 wrote:
| Would this be addressed by better documentation of code and
| APIs as well as examples? All this would go into the
| training materials and then be the body of knowledge.
| nicce wrote:
| > It'd make sense to rename WebDev Arena to React/Tailwind
| Arena.
|
| Funnily, training of these models feels getting cut mid of
| v3/v4 Tailwind release, and Gemini always try to correct my
| mistakes (... use v3 instead of v4)
| qwertox wrote:
| I have my issues with the code Gemini Pro in AI Studio generates
| without customized "System Instructions".
|
| It turns a well readable code-snippet of 5 lines into a 30 line
| snippet full of comments and mostly unnecessary error handling.
| Code which becomes harder to reason about.
|
| But for sysadmin tasks, like dealing with ZFS and LVM, it is
| absolutely incredible.
| bn-l wrote:
| I've found the same thing. I don't use it for code any more
| because it produces highly verbose and inefficient code that
| may work but is ugly and subtly brittle.
| mvdtnz wrote:
| I truly do not understand how people are getting worthwhile
| results from Gemini 2.5 Pro. I have used all of the major models
| for lots of different programming tasks and I have never once had
| Gemini produce something useful. It's not just wrong, it's
| laughably bad. And people are making claims that it's the best. I
| just... don't... get it.
| WaltPurvis wrote:
| That's weird. What languages/frameworks/tasks are you using it
| for? I've been using Gemini 2.5 with Dart recently and it
| frequently produces indisputably useful code, and indisputably
| helpful advice. Along with some code that's pretty dumb or
| misguided, and some advice that would be counterproductive if I
| actually followed it. But "never once had Gemini produce
| something useful" is _wildly_ different from my recent
| experience.
| franze wrote:
| I like it. I threw some random concepts at it (Neon, LSD,
| Falling, Elite, Shooter, Escher + Mobile Game + SPA) at it and
| this is what it came up with after a few (5x) roundtrips.
|
| https://show.franzai.com/a/star-zero-huge?nobuttons
| cadamsdotcom wrote:
| Google/Alphabet is a giant hulking machine that's been frankly
| running at idle. All that resume driven development and
| performance review promo cycles and retention of top talent
| mainly to work on ad tech means it's packed to the rafters with
| latent capability. Holding on to so much talent in the face of
| basically having nothing to do is a testament to the company's
| leadership - even if said leadership didn't manage to make Google
| push humanity forward over the last decade or so.
|
| Now there's a big nugget to chew (LLMs) you're seeing that latent
| capability come to life. This awakening feels more bottom-up
| driven than top down. Google's a war machine chugging along
| nicely in peacetime, but now its war again!
|
| Hats off to the engineers working on the tech. Excited to try it
| out!
| kccqzy wrote:
| > retention of top talent mainly to work on ad tech
|
| No the top talent worked on exciting things like Fuchsia. Ad
| tech is boring stuff written by people who aren't enough of a
| snob to refuse working on ad tech.
| cadamsdotcom wrote:
| Top talent worked on what now?
|
| Isn't that a flower?
|
| (Hopefully you see my point)
| alana314 wrote:
| The google sheets UI asked me to try Gemini to create a formula,
| so I tried it, starting with "Create a formula...", and its
| answer was "Sorry, I can't help with creating formulas yet, but
| I'm still learning."
| wewewedxfgdf wrote:
| Gemini does not accept upload of TSX files, it says "File type
| unsupported"
|
| You must _rename your files to .tsx.txt THEN IT ACCEPTS THEM_ and
| works perfectly fine writing TSX code.
|
| This is absolutely bananas. How can such a powerful coding engine
| have this behavior?
| krat0sprakhar wrote:
| Where are you testing this? I'm able to upload tsx files on
| aistudio
| wewewedxfgdf wrote:
| https://gemini.google.com/app
| jmward01 wrote:
| Google's models are pretty good, but their API(s) and guarantees
| aren't. We were just told today that 'quota doesn't guarantee
| capacity' so basically on-demand isn't prod capable. Add to that
| that there isn't a second vendor source like Anthropic and OpenAI
| have and Google's reliability makes it a hard sell to use them
| unless you can back up the calls with a different model family
| all together.
| simonw wrote:
| Here's a summary of the 394 comments on this post created using
| the new gemini-2.5-pro-preview-05-06. It looks very good to me -
| well grouped, nicely formatted.
|
| https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...
|
| 30,408 input, 8,535 output = 12.336 cents.
|
| 8,500 is a very long output! Finally a model that obeys my
| instructions to "go long" when summarizing Hacker News threads.
| Here's the script I used:
| https://gist.github.com/simonw/7ef3d77c8aeeaf1bfe9cc6fd68760...
___________________________________________________________________
(page generated 2025-05-06 23:00 UTC)